From 94962fb9d4513aa47804acfebdcca29c037bbfad Mon Sep 17 00:00:00 2001
From: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Date: Thu, 19 Dec 2024 16:38:22 +0000
Subject: [PATCH] Commit from GitHub Actions (GH Arxiv Posterbot)
---
combined.json | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/combined.json b/combined.json
index 6ff949c..7c6636c 100644
--- a/combined.json
+++ b/combined.json
@@ -1 +1 @@
-{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2310.04576": {"title": "Finite Sample Performance of a Conduct Parameter Test in Homogenous Goods Markets", "link": "http://arxiv.org/abs/2310.04576", "description": "We assess the finite sample performance of the conduct parameter test in\nhomogeneous goods markets. Statistical power rises with an increase in the\nnumber of markets, a larger conduct parameter, and a stronger demand rotation\ninstrument. However, even with a moderate number of markets and five firms,\nregardless of instrument strength and the utilization of optimal instruments,\nrejecting the null hypothesis of perfect competition remains challenging. Our\nfindings indicate that empirical results that fail to reject perfect\ncompetition are a consequence of the limited number of markets rather than\nmethodological deficiencies."}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.05311": {"title": "Identification and Estimation in a Class of Potential Outcomes Models", "link": "http://arxiv.org/abs/2310.05311", "description": "This paper develops a class of potential outcomes models characterized by\nthree main features: (i) Unobserved heterogeneity can be represented by a\nvector of potential outcomes and a type describing the manner in which an\ninstrument determines the choice of treatment; (ii) The availability of an\ninstrumental variable that is conditionally independent of unobserved\nheterogeneity; and (iii) The imposition of convex restrictions on the\ndistribution of unobserved heterogeneity. The proposed class of models\nencompasses multiple classical and novel research designs, yet possesses a\ncommon structure that permits a unifying analysis of identification and\nestimation. In particular, we establish that these models share a common\nnecessary and sufficient condition for identifying certain causal parameters.\nOur identification results are constructive in that they yield estimating\nmoment conditions for the parameters of interest. Focusing on a leading special\ncase of our framework, we further show how these estimating moment conditions\nmay be modified to be doubly robust. The corresponding double robust estimators\nare shown to be asymptotically normally distributed, bootstrap based inference\nis shown to be asymptotically valid, and the semi-parametric efficiency bound\nis derived for those parameters that are root-n estimable. We illustrate the\nusefulness of our results for developing, identifying, and estimating causal\nmodels through an empirical evaluation of the role of mental health as a\nmediating variable in the Moving To Opportunity experiment."}, "http://arxiv.org/abs/2310.05761": {"title": "Robust Minimum Distance Inference in Structural Models", "link": "http://arxiv.org/abs/2310.05761", "description": "This paper proposes minimum distance inference for a structural parameter of\ninterest, which is robust to the lack of identification of other structural\nnuisance parameters. Some choices of the weighting matrix lead to asymptotic\nchi-squared distributions with degrees of freedom that can be consistently\nestimated from the data, even under partial identification. In any case,\nknowledge of the level of under-identification is not required. We study the\npower of our robust test. Several examples show the wide applicability of the\nprocedure and a Monte Carlo investigates its finite sample performance. Our\nidentification-robust inference method can be applied to make inferences on\nboth calibrated (fixed) parameters and any other structural parameter of\ninterest. We illustrate the method's usefulness by applying it to a structural\nmodel on the non-neutrality of monetary policy, as in \\cite{nakamura2018high},\nwhere we empirically evaluate the validity of the calibrated parameters and we\ncarry out robust inference on the slope of the Phillips curve and the\ninformation effect."}, "http://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "http://arxiv.org/abs/2302.13066", "description": "Different proxy variables used in fiscal policy SVARs lead to contradicting\nconclusions regarding the size of fiscal multipliers. In this paper, we show\nthat the conflicting results are due to violations of the exogeneity\nassumptions, i.e. the commonly used proxies are endogenously related to the\nstructural shocks. We propose a novel approach to include proxy variables into\na Bayesian non-Gaussian SVAR, tailored to accommodate potentially endogenous\nproxy variables. Using our model, we show that increasing government spending\nis a more effective tool to stimulate the economy than reducing taxes. We\nconstruct new exogenous proxies that can be used in the traditional proxy VAR\napproach resulting in similar estimates compared to our proposed hybrid SVAR\nmodel."}, "http://arxiv.org/abs/2303.01863": {"title": "Constructing High Frequency Economic Indicators by Imputation", "link": "http://arxiv.org/abs/2303.01863", "description": "Monthly and weekly economic indicators are often taken to be the largest\ncommon factor estimated from high and low frequency data, either separately or\njointly. To incorporate mixed frequency information without directly modeling\nthem, we target a low frequency diffusion index that is already available, and\ntreat high frequency values as missing. We impute these values using multiple\nfactors estimated from the high frequency data. In the empirical examples\nconsidered, static matrix completion that does not account for serial\ncorrelation in the idiosyncratic errors yields imprecise estimates of the\nmissing values irrespective of how the factors are estimated. Single equation\nand systems-based dynamic procedures that account for serial correlation yield\nimputed values that are closer to the observed low frequency ones. This is the\ncase in the counterfactual exercise that imputes the monthly values of consumer\nsentiment series before 1978 when the data was released only on a quarterly\nbasis. This is also the case for a weekly version of the CFNAI index of\neconomic activity that is imputed using seasonally unadjusted data. The imputed\nseries reveals episodes of increased variability of weekly economic information\nthat are masked by the monthly data, notably around the 2014-15 collapse in oil\nprices."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2009.01995": {"title": "Instrument Validity for Heterogeneous Causal Effects", "link": "http://arxiv.org/abs/2009.01995", "description": "This paper provides a general framework for testing instrument validity in\nheterogeneous causal effect models. The generalization includes the cases where\nthe treatment can be multivalued ordered or unordered. Based on a series of\ntestable implications, we propose a nonparametric test which is proved to be\nasymptotically size controlled and consistent. Compared to the tests in the\nliterature, our test can be applied in more general settings and may achieve\npower improvement. Refutation of instrument validity by the test helps detect\ninvalid instruments that may yield implausible results on causal effects.\nEvidence that the test performs well on finite samples is provided via\nsimulations. We revisit the empirical study on return to schooling to\ndemonstrate application of the proposed test in practice. An extended\ncontinuous mapping theorem and an extended delta method, which may be of\nindependent interest, are provided to establish the asymptotic distribution of\nthe test statistic under null."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "http://arxiv.org/abs/2205.02274", "description": "Marketplace companies rely heavily on experimentation when making changes to\nthe design or operation of their platforms. The workhorse of experimentation is\nthe randomized controlled trial (RCT), or A/B test, in which users are randomly\nassigned to treatment or control groups. However, marketplace interference\ncauses the Stable Unit Treatment Value Assumption (SUTVA) to be violated,\nleading to bias in the standard RCT metric. In this work, we propose techniques\nfor platforms to run standard RCTs and still obtain meaningful estimates\ndespite the presence of marketplace interference. We specifically consider a\ngeneralized matching setting, in which the platform explicitly matches supply\nwith demand via a linear programming algorithm. Our first proposal is for the\nplatform to estimate the value of global treatment and global control via\noptimization. We prove that this approach is unbiased in the fluid limit. Our\nsecond proposal is to compare the average shadow price of the treatment and\ncontrol groups rather than the total value accrued by each group. We prove that\nthis technique corresponds to the correct first-order approximation (in a\nTaylor series sense) of the value function of interest even in a finite-size\nsystem. We then use this result to prove that, under reasonable assumptions,\nour estimator is less biased than the RCT estimator. At the heart of our result\nis the idea that it is relatively easy to model interference in matching-driven\nmarketplaces since, in such markets, the platform intermediates the spillover."}, "http://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "http://arxiv.org/abs/2208.09638", "description": "What is the purpose of pre-analysis plans, and how should they be designed?\nWe propose a principal-agent model where a decision-maker relies on selective\nbut truthful reports by an analyst. The analyst has data access, and\nnon-aligned objectives. In this model, the implementation of statistical\ndecision rules (tests, estimators) requires an incentive-compatible mechanism.\nWe first characterize which decision rules can be implemented. We then\ncharacterize optimal statistical decision rules subject to implementability. We\nshow that implementation requires pre-analysis plans. Focussing specifically on\nhypothesis tests, we show that optimal rejection rules pre-register a valid\ntest for the case when all data is reported, and make worst-case assumptions\nabout unreported data. Optimal tests can be found as a solution to a\nlinear-programming problem."}, "http://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "http://arxiv.org/abs/2302.11505", "description": "This paper studies settings where the analyst is interested in identifying\nand estimating the average causal effect of a binary treatment on an outcome.\nWe consider a setup in which the outcome realization does not get immediately\nrealized after the treatment assignment, a feature that is ubiquitous in\nempirical settings. The period between the treatment and the realization of the\noutcome allows other observed actions to occur and affect the outcome. In this\ncontext, we study several regression-based estimands routinely used in\nempirical work to capture the average treatment effect and shed light on\ninterpreting them in terms of ceteris paribus effects, indirect causal effects,\nand selection terms. We obtain three main and related takeaways. First, the\nthree most popular estimands do not generally satisfy what we call \\emph{strong\nsign preservation}, in the sense that these estimands may be negative even when\nthe treatment positively affects the outcome conditional on any possible\ncombination of other actions. Second, the most popular regression that includes\nthe other actions as controls satisfies strong sign preservation \\emph{if and\nonly if} these actions are mutually exclusive binary variables. Finally, we\nshow that a linear regression that fully stratifies the other actions leads to\nestimands that satisfy strong sign preservation."}, "http://arxiv.org/abs/2302.13455": {"title": "Nickell Bias in Panel Local Projection: Financial Crises Are Worse Than You Think", "link": "http://arxiv.org/abs/2302.13455", "description": "Local Projection is widely used for impulse response estimation, with the\nFixed Effect (FE) estimator being the default for panel data. This paper\nhighlights the presence of Nickell bias for all regressors in the FE estimator,\neven if lagged dependent variables are absent in the regression. This bias is\nthe consequence of the inherent panel predictive specification. We recommend\nusing the split-panel jackknife estimator to eliminate the asymptotic bias and\nrestore the standard statistical inference. Revisiting three macro-finance\nstudies on the linkage between financial crises and economic contraction, we\nfind that the FE estimator substantially underestimates the post-crisis\neconomic losses."}, "http://arxiv.org/abs/2310.07151": {"title": "Identification and Estimation of a Semiparametric Logit Model using Network Data", "link": "http://arxiv.org/abs/2310.07151", "description": "This paper studies the identification and estimation of a semiparametric\nbinary network model in which the unobserved social characteristic is\nendogenous, that is, the unobserved individual characteristic influences both\nthe binary outcome of interest and how links are formed within the network. The\nexact functional form of the latent social characteristic is not known. The\nproposed estimators are obtained based on matching pairs of agents whose\nnetwork formation distributions are the same. The consistency and the\nasymptotic distribution of the estimators are proposed. The finite sample\nproperties of the proposed estimators in a Monte-Carlo simulation are assessed.\nWe conclude this study with an empirical application."}, "http://arxiv.org/abs/2310.07558": {"title": "Smootheness-Adaptive Dynamic Pricing with Nonparametric Demand Learning", "link": "http://arxiv.org/abs/2310.07558", "description": "We study the dynamic pricing problem where the demand function is\nnonparametric and H\\\"older smooth, and we focus on adaptivity to the unknown\nH\\\"older smoothness parameter $\\beta$ of the demand function. Traditionally the\noptimal dynamic pricing algorithm heavily relies on the knowledge of $\\beta$ to\nachieve a minimax optimal regret of\n$\\widetilde{O}(T^{\\frac{\\beta+1}{2\\beta+1}})$. However, we highlight the\nchallenge of adaptivity in this dynamic pricing problem by proving that no\npricing policy can adaptively achieve this minimax optimal regret without\nknowledge of $\\beta$. Motivated by the impossibility result, we propose a\nself-similarity condition to enable adaptivity. Importantly, we show that the\nself-similarity condition does not compromise the problem's inherent complexity\nsince it preserves the regret lower bound\n$\\Omega(T^{\\frac{\\beta+1}{2\\beta+1}})$. Furthermore, we develop a\nsmoothness-adaptive dynamic pricing algorithm and theoretically prove that the\nalgorithm achieves this minimax optimal regret bound without the prior\nknowledge $\\beta$."}, "http://arxiv.org/abs/1910.07452": {"title": "Identifying Network Ties from Panel Data: Theory and an Application to Tax Competition", "link": "http://arxiv.org/abs/1910.07452", "description": "Social interactions determine many economic behaviors, but information on\nsocial ties does not exist in most publicly available and widely used datasets.\nWe present results on the identification of social networks from observational\npanel data that contains no information on social ties between agents. In the\ncontext of a canonical social interactions model, we provide sufficient\nconditions under which the social interactions matrix, endogenous and exogenous\nsocial effect parameters are all globally identified. While this result is\nrelevant across different estimation strategies, we then describe how\nhigh-dimensional estimation techniques can be used to estimate the interactions\nmodel based on the Adaptive Elastic Net GMM method. We employ the method to\nstudy tax competition across US states. We find the identified social\ninteractions matrix implies tax competition differs markedly from the common\nassumption of competition between geographically neighboring states, providing\nfurther insights for the long-standing debate on the relative roles of factor\nmobility and yardstick competition in driving tax setting behavior across\nstates. Most broadly, our identification and application show the analysis of\nsocial interactions can be extended to economic realms where no network data\nexists."}, "http://arxiv.org/abs/2308.00913": {"title": "The Bayesian Context Trees State Space Model for time series modelling and forecasting", "link": "http://arxiv.org/abs/2308.00913", "description": "A hierarchical Bayesian framework is introduced for developing rich mixture\nmodels for real-valued time series, partly motivated by important applications\nin financial time series analysis. At the top level, meaningful discrete states\nare identified as appropriately quantised values of some of the most recent\nsamples. These observable states are described as a discrete context-tree\nmodel. At the bottom level, a different, arbitrary model for real-valued time\nseries -- a base model -- is associated with each state. This defines a very\ngeneral framework that can be used in conjunction with any existing model class\nto build flexible and interpretable mixture models. We call this the Bayesian\nContext Trees State Space Model, or the BCT-X framework. Efficient algorithms\nare introduced that allow for effective, exact Bayesian inference and learning\nin this setting; in particular, the maximum a posteriori probability (MAP)\ncontext-tree model can be identified. These algorithms can be updated\nsequentially, facilitating efficient online forecasting. The utility of the\ngeneral framework is illustrated in two particular instances: When\nautoregressive (AR) models are used as base models, resulting in a nonlinear AR\nmixture model, and when conditional heteroscedastic (ARCH) models are used,\nresulting in a mixture model that offers a powerful and systematic way of\nmodelling the well-known volatility asymmetries in financial data. In\nforecasting, the BCT-X methods are found to outperform state-of-the-art\ntechniques on simulated and real-world data, both in terms of accuracy and\ncomputational requirements. In modelling, the BCT-X structure finds natural\nstructure present in the data. In particular, the BCT-ARCH model reveals a\nnovel, important feature of stock market index data, in the form of an enhanced\nleverage effect."}, "http://arxiv.org/abs/2310.07790": {"title": "Integration or fragmentation? A closer look at euro area financial markets", "link": "http://arxiv.org/abs/2310.07790", "description": "This paper examines the degree of integration at euro area financial markets.\nTo that end, we estimate overall and country-specific integration indices based\non a panel vector-autoregression with factor stochastic volatility. Our results\nindicate a more heterogeneous bond market compared to the market for lending\nrates. At both markets, the global financial crisis and the sovereign debt\ncrisis led to a severe decline in financial integration, which fully recovered\nsince then. We furthermore identify countries that deviate from their peers\neither by responding differently to crisis events or by taking on different\nroles in the spillover network. The latter analysis reveals two set of\ncountries, namely a main body of countries that receives and transmits\nspillovers and a second, smaller group of spillover absorbing economies.\nFinally, we demonstrate by estimating an augmented Taylor rule that euro area\nshort-term interest rates are positively linked to the level of integration on\nthe bond market."}, "http://arxiv.org/abs/2310.07839": {"title": "Marital Sorting, Household Inequality and Selection", "link": "http://arxiv.org/abs/2310.07839", "description": "Using CPS data for 1976 to 2022 we explore how wage inequality has evolved\nfor married couples with both spouses working full time full year, and its\nimpact on household income inequality. We also investigate how marriage sorting\npatterns have changed over this period. To determine the factors driving income\ninequality we estimate a model explaining the joint distribution of wages which\naccounts for the spouses' employment decisions. We find that income inequality\nhas increased for these households and increased assortative matching of wages\nhas exacerbated the inequality resulting from individual wage growth. We find\nthat positive sorting partially reflects the correlation across unobservables\ninfluencing both members' of the marriage wages. We decompose the changes in\nsorting patterns over the 47 years comprising our sample into structural,\ncomposition and selection effects and find that the increase in positive\nsorting primarily reflects the increased skill premia for both observed and\nunobserved characteristics."}, "http://arxiv.org/abs/2310.08063": {"title": "Uniform Inference for Nonlinear Endogenous Treatment Effects with High-Dimensional Covariates", "link": "http://arxiv.org/abs/2310.08063", "description": "Nonlinearity and endogeneity are common in causal effect studies with\nobservational data. In this paper, we propose new estimation and inference\nprocedures for nonparametric treatment effect functions with endogeneity and\npotentially high-dimensional covariates. The main innovation of this paper is\nthe double bias correction procedure for the nonparametric instrumental\nvariable (NPIV) model under high dimensions. We provide a useful uniform\nconfidence band of the marginal effect function, defined as the derivative of\nthe nonparametric treatment function. The asymptotic honesty of the confidence\nband is verified in theory. Simulations and an empirical study of air pollution\nand migration demonstrate the validity of our procedures."}, "http://arxiv.org/abs/2310.08115": {"title": "Model-Agnostic Covariate-Assisted Inference on Partially Identified Causal Effects", "link": "http://arxiv.org/abs/2310.08115", "description": "Many causal estimands are only partially identifiable since they depend on\nthe unobservable joint distribution between potential outcomes. Stratification\non pretreatment covariates can yield sharper partial identification bounds;\nhowever, unless the covariates are discrete with relatively small support, this\napproach typically requires consistent estimation of the conditional\ndistributions of the potential outcomes given the covariates. Thus, existing\napproaches may fail under model misspecification or if consistency assumptions\nare violated. In this study, we propose a unified and model-agnostic\ninferential approach for a wide class of partially identified estimands, based\non duality theory for optimal transport problems. In randomized experiments,\nour approach can wrap around any estimates of the conditional distributions and\nprovide uniformly valid inference, even if the initial estimates are\narbitrarily inaccurate. Also, our approach is doubly robust in observational\nstudies. Notably, this property allows analysts to use the multiplier bootstrap\nto select covariates and models without sacrificing validity even if the true\nmodel is not included. Furthermore, if the conditional distributions are\nestimated at semiparametric rates, our approach matches the performance of an\noracle with perfect knowledge of the outcome model. Finally, we propose an\nefficient computational framework, enabling implementation on many practical\nproblems in causal inference."}, "http://arxiv.org/abs/2310.08173": {"title": "Structural Vector Autoregressions and Higher Moments: Challenges and Solutions in Small Samples", "link": "http://arxiv.org/abs/2310.08173", "description": "Generalized method of moments estimators based on higher-order moment\nconditions derived from independent shocks can be used to identify and estimate\nthe simultaneous interaction in structural vector autoregressions. This study\nhighlights two problems that arise when using these estimators in small\nsamples. First, imprecise estimates of the asymptotically efficient weighting\nmatrix and the asymptotic variance lead to volatile estimates and inaccurate\ninference. Second, many moment conditions lead to a small sample scaling bias\ntowards innovations with a variance smaller than the normalizing unit variance\nassumption. To address the first problem, I propose utilizing the assumption of\nindependent structural shocks to estimate the efficient weighting matrix and\nthe variance of the estimator. For the second issue, I propose incorporating a\ncontinuously updated scaling term into the weighting matrix, eliminating the\nscaling bias. To demonstrate the effectiveness of these measures, I conducted a\nMonte Carlo simulation which shows a significant improvement in the performance\nof the estimator."}, "http://arxiv.org/abs/2310.08536": {"title": "Real-time Prediction of the Great Recession and the Covid-19 Recession", "link": "http://arxiv.org/abs/2310.08536", "description": "A series of standard and penalized logistic regression models is employed to\nmodel and forecast the Great Recession and the Covid-19 recession in the US.\nThese two recessions are scrutinized by closely examining the movement of five\nchosen predictors, their regression coefficients, and the predicted\nprobabilities of recession. The empirical analysis explores the predictive\ncontent of numerous macroeconomic and financial indicators with respect to NBER\nrecession indicator. The predictive ability of the underlying models is\nevaluated using a set of statistical evaluation metrics. The results strongly\nsupport the application of penalized logistic regression models in the area of\nrecession prediction. Specifically, the analysis indicates that a mixed usage\nof different penalized logistic regression models over different forecast\nhorizons largely outperform standard logistic regression models in the\nprediction of Great recession in the US, as they achieve higher predictive\naccuracy across 5 different forecast horizons. The Great Recession is largely\npredictable, whereas the Covid-19 recession remains unpredictable, given that\nthe Covid-19 pandemic is a real exogenous event. The results are validated by\nconstructing via principal component analysis (PCA) on a set of selected\nvariables a recession indicator that suffers less from publication lags and\nexhibits a very high correlation with the NBER recession indicator."}, "http://arxiv.org/abs/2210.11355": {"title": "Network Synthetic Interventions: A Causal Framework for Panel Data Under Network Interference", "link": "http://arxiv.org/abs/2210.11355", "description": "We propose a generalization of the synthetic controls and synthetic\ninterventions methodology to incorporate network interference. We consider the\nestimation of unit-specific potential outcomes from panel data in the presence\nof spillover across units and unobserved confounding. Key to our approach is a\nnovel latent factor model that takes into account network interference and\ngeneralizes the factor models typically used in panel data settings. We propose\nan estimator, Network Synthetic Interventions (NSI), and show that it\nconsistently estimates the mean outcomes for a unit under an arbitrary set of\ncounterfactual treatments for the network. We further establish that the\nestimator is asymptotically normal. We furnish two validity tests for whether\nthe NSI estimator reliably generalizes to produce accurate counterfactual\nestimates. We provide a novel graph-based experiment design that guarantees the\nNSI estimator produces accurate counterfactual estimates, and also analyze the\nsample complexity of the proposed design. We conclude with simulations that\ncorroborate our theoretical findings."}, "http://arxiv.org/abs/2310.08672": {"title": "Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal", "link": "http://arxiv.org/abs/2310.08672", "description": "In many settings, interventions may be more effective for some individuals\nthan others, so that targeting interventions may be beneficial. We analyze the\nvalue of targeting in the context of a large-scale field experiment with over\n53,000 college students, where the goal was to use \"nudges\" to encourage\nstudents to renew their financial-aid applications before a non-binding\ndeadline. We begin with baseline approaches to targeting. First, we target\nbased on a causal forest that estimates heterogeneous treatment effects and\nthen assigns students to treatment according to those estimated to have the\nhighest treatment effects. Next, we evaluate two alternative targeting\npolicies, one targeting students with low predicted probability of renewing\nfinancial aid in the absence of the treatment, the other targeting those with\nhigh probability. The predicted baseline outcome is not the ideal criterion for\ntargeting, nor is it a priori clear whether to prioritize low, high, or\nintermediate predicted probability. Nonetheless, targeting on low baseline\noutcomes is common in practice, for example because the relationship between\nindividual characteristics and treatment effects is often difficult or\nimpossible to estimate with historical data. We propose hybrid approaches that\nincorporate the strengths of both predictive approaches (accurate estimation)\nand causal approaches (correct criterion); we show that targeting intermediate\nbaseline outcomes is most effective, while targeting based on low baseline\noutcomes is detrimental. In one year of the experiment, nudging all students\nimproved early filing by an average of 6.4 percentage points over a baseline\naverage of 37% filing, and we estimate that targeting half of the students\nusing our preferred policy attains around 75% of this benefit."}, "http://arxiv.org/abs/2310.09013": {"title": "Smoothed instrumental variables quantile regression", "link": "http://arxiv.org/abs/2310.09013", "description": "In this article, I introduce the sivqr command, which estimates the\ncoefficients of the instrumental variables (IV) quantile regression model\nintroduced by Chernozhukov and Hansen (2005). The sivqr command offers several\nadvantages over the existing ivqreg and ivqreg2 commands for estimating this IV\nquantile regression model, which complements the alternative \"triangular model\"\nbehind cqiv and the \"local quantile treatment effect\" model of ivqte.\nComputationally, sivqr implements the smoothed estimator of Kaplan and Sun\n(2017), who show that smoothing improves both computation time and statistical\naccuracy. Standard errors are computed analytically or by Bayesian bootstrap;\nfor non-iid sampling, sivqr is compatible with bootstrap. I discuss syntax and\nthe underlying methodology, and I compare sivqr with other commands in an\nexample."}, "http://arxiv.org/abs/2310.09105": {"title": "Estimating Individual Responses when Tomorrow Matters", "link": "http://arxiv.org/abs/2310.09105", "description": "We propose a regression-based approach to estimate how individuals'\nexpectations influence their responses to a counterfactual change. We provide\nconditions under which average partial effects based on regression estimates\nrecover structural effects. We propose a practical three-step estimation method\nthat relies on subjective beliefs data. We illustrate our approach in a model\nof consumption and saving, focusing on the impact of an income tax that not\nonly changes current income but also affects beliefs about future income. By\napplying our approach to survey data from Italy, we find that considering\nindividuals' beliefs matter for evaluating the impact of tax policies on\nconsumption decisions."}, "http://arxiv.org/abs/2210.09828": {"title": "Modelling Large Dimensional Datasets with Markov Switching Factor Models", "link": "http://arxiv.org/abs/2210.09828", "description": "We study a novel large dimensional approximate factor model with regime\nchanges in the loadings driven by a latent first order Markov process. By\nexploiting the equivalent linear representation of the model, we first recover\nthe latent factors by means of Principal Component Analysis. We then cast the\nmodel in state-space form, and we estimate loadings and transition\nprobabilities through an EM algorithm based on a modified version of the\nBaum-Lindgren-Hamilton-Kim filter and smoother that makes use of the factors\npreviously estimated. Our approach is appealing as it provides closed form\nexpressions for all estimators. More importantly, it does not require knowledge\nof the true number of factors. We derive the theoretical properties of the\nproposed estimation procedure, and we show their good finite sample performance\nthrough a comprehensive set of Monte Carlo experiments. The empirical\nusefulness of our approach is illustrated through an application to a large\nportfolio of stocks."}, "http://arxiv.org/abs/2210.13562": {"title": "Prediction intervals for economic fixed-event forecasts", "link": "http://arxiv.org/abs/2210.13562", "description": "The fixed-event forecasting setup is common in economic policy. It involves a\nsequence of forecasts of the same (`fixed') predictand, so that the difficulty\nof the forecasting problem decreases over time. Fixed-event point forecasts are\ntypically published without a quantitative measure of uncertainty. To construct\nsuch a measure, we consider forecast postprocessing techniques tailored to the\nfixed-event case. We develop regression methods that impose constraints\nmotivated by the problem at hand, and use these methods to construct prediction\nintervals for gross domestic product (GDP) growth in Germany and the US."}, "http://arxiv.org/abs/2302.02747": {"title": "Testing Quantile Forecast Optimality", "link": "http://arxiv.org/abs/2302.02747", "description": "Quantile forecasts made across multiple horizons have become an important\noutput of many financial institutions, central banks and international\norganisations. This paper proposes misspecification tests for such quantile\nforecasts that assess optimality over a set of multiple forecast horizons\nand/or quantiles. The tests build on multiple Mincer-Zarnowitz quantile\nregressions cast in a moment equality framework. Our main test is for the null\nhypothesis of autocalibration, a concept which assesses optimality with respect\nto the information contained in the forecasts themselves. We provide an\nextension that allows to test for optimality with respect to larger information\nsets and a multivariate extension. Importantly, our tests do not just inform\nabout general violations of optimality, but may also provide useful insights\ninto specific forms of sub-optimality. A simulation study investigates the\nfinite sample performance of our tests, and two empirical applications to\nfinancial returns and U.S. macroeconomic series illustrate that our tests can\nyield interesting insights into quantile forecast sub-optimality and its\ncauses."}, "http://arxiv.org/abs/2305.00700": {"title": "Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control", "link": "http://arxiv.org/abs/2305.00700", "description": "Motivated by a recent literature on the double-descent phenomenon in machine\nlearning, we consider highly over-parameterized models in causal inference,\nincluding synthetic control with many control units. In such models, there may\nbe so many free parameters that the model fits the training data perfectly. We\nfirst investigate high-dimensional linear regression for imputing wage data and\nestimating average treatment effects, where we find that models with many more\ncovariates than sample size can outperform simple ones. We then document the\nperformance of high-dimensional synthetic control estimators with many control\nunits. We find that adding control units can help improve imputation\nperformance even beyond the point where the pre-treatment fit is perfect. We\nprovide a unified theoretical perspective on the performance of these\nhigh-dimensional models. Specifically, we show that more complex models can be\ninterpreted as model-averaging estimators over simpler ones, which we link to\nan improvement in average performance. This perspective yields concrete\ninsights into the use of synthetic control when control units are many relative\nto the number of pre-treatment periods."}, "http://arxiv.org/abs/2310.09398": {"title": "An In-Depth Examination of Requirements for Disclosure Risk Assessment", "link": "http://arxiv.org/abs/2310.09398", "description": "The use of formal privacy to protect the confidentiality of responses in the\n2020 Decennial Census of Population and Housing has triggered renewed interest\nand debate over how to measure the disclosure risks and societal benefits of\nthe published data products. Following long-established precedent in economics\nand statistics, we argue that any proposal for quantifying disclosure risk\nshould be based on pre-specified, objective criteria. Such criteria should be\nused to compare methodologies to identify those with the most desirable\nproperties. We illustrate this approach, using simple desiderata, to evaluate\nthe absolute disclosure risk framework, the counterfactual framework underlying\ndifferential privacy, and prior-to-posterior comparisons. We conclude that\nsatisfying all the desiderata is impossible, but counterfactual comparisons\nsatisfy the most while absolute disclosure risk satisfies the fewest.\nFurthermore, we explain that many of the criticisms levied against differential\nprivacy would be levied against any technology that is not equivalent to\ndirect, unrestricted access to confidential data. Thus, more research is\nneeded, but in the near-term, the counterfactual approach appears best-suited\nfor privacy-utility analysis."}, "http://arxiv.org/abs/2310.09545": {"title": "A Semiparametric Instrumented Difference-in-Differences Approach to Policy Learning", "link": "http://arxiv.org/abs/2310.09545", "description": "Recently, there has been a surge in methodological development for the\ndifference-in-differences (DiD) approach to evaluate causal effects. Standard\nmethods in the literature rely on the parallel trends assumption to identify\nthe average treatment effect on the treated. However, the parallel trends\nassumption may be violated in the presence of unmeasured confounding, and the\naverage treatment effect on the treated may not be useful in learning a\ntreatment assignment policy for the entire population. In this article, we\npropose a general instrumented DiD approach for learning the optimal treatment\npolicy. Specifically, we establish identification results using a binary\ninstrumental variable (IV) when the parallel trends assumption fails to hold.\nAdditionally, we construct a Wald estimator, novel inverse probability\nweighting (IPW) estimators, and a class of semiparametric efficient and\nmultiply robust estimators, with theoretical guarantees on consistency and\nasymptotic normality, even when relying on flexible machine learning algorithms\nfor nuisance parameters estimation. Furthermore, we extend the instrumented DiD\nto the panel data setting. We evaluate our methods in extensive simulations and\na real data application."}, "http://arxiv.org/abs/2310.09597": {"title": "Adaptive maximization of social welfare", "link": "http://arxiv.org/abs/2310.09597", "description": "We consider the problem of repeatedly choosing policies to maximize social\nwelfare. Welfare is a weighted sum of private utility and public revenue.\nEarlier outcomes inform later policies. Utility is not observed, but indirectly\ninferred. Response functions are learned through experimentation.\n\nWe derive a lower bound on regret, and a matching adversarial upper bound for\na variant of the Exp3 algorithm. Cumulative regret grows at a rate of\n$T^{2/3}$. This implies that (i) welfare maximization is harder than the\nmulti-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets),\nand (ii) our algorithm achieves the optimal rate. For the stochastic setting,\nif social welfare is concave, we can achieve a rate of $T^{1/2}$ (for\ncontinuous policy sets), using a dyadic search algorithm.\n\nWe analyze an extension to nonlinear income taxation, and sketch an extension\nto commodity taxation. We compare our setting to monopoly pricing (which is\neasier), and price setting for bilateral trade (which is harder)."}, "http://arxiv.org/abs/2008.08387": {"title": "A Novel Approach to Predictive Accuracy Testing in Nested Environments", "link": "http://arxiv.org/abs/2008.08387", "description": "We introduce a new approach for comparing the predictive accuracy of two\nnested models that bypasses the difficulties caused by the degeneracy of the\nasymptotic variance of forecast error loss differentials used in the\nconstruction of commonly used predictive comparison statistics. Our approach\ncontinues to rely on the out of sample MSE loss differentials between the two\ncompeting models, leads to nuisance parameter free Gaussian asymptotics and is\nshown to remain valid under flexible assumptions that can accommodate\nheteroskedasticity and the presence of mixed predictors (e.g. stationary and\nlocal to unit root). A local power analysis also establishes its ability to\ndetect departures from the null in both stationary and persistent settings.\nSimulations calibrated to common economic and financial applications indicate\nthat our methods have strong power with good size control across commonly\nencountered sample sizes."}, "http://arxiv.org/abs/2211.07506": {"title": "Type I Tobit Bayesian Additive Regression Trees for Censored Outcome Regression", "link": "http://arxiv.org/abs/2211.07506", "description": "Censoring occurs when an outcome is unobserved beyond some threshold value.\nMethods that do not account for censoring produce biased predictions of the\nunobserved outcome. This paper introduces Type I Tobit Bayesian Additive\nRegression Tree (TOBART-1) models for censored outcomes. Simulation results and\nreal data applications demonstrate that TOBART-1 produces accurate predictions\nof censored outcomes. TOBART-1 provides posterior intervals for the conditional\nexpectation and other quantities of interest. The error term distribution can\nhave a large impact on the expectation of the censored outcome. Therefore the\nerror is flexibly modeled as a Dirichlet process mixture of normal\ndistributions."}, "http://arxiv.org/abs/2302.02866": {"title": "Out of Sample Predictability in Predictive Regressions with Many Predictor Candidates", "link": "http://arxiv.org/abs/2302.02866", "description": "This paper is concerned with detecting the presence of out of sample\npredictability in linear predictive regressions with a potentially large set of\ncandidate predictors. We propose a procedure based on out of sample MSE\ncomparisons that is implemented in a pairwise manner using one predictor at a\ntime and resulting in an aggregate test statistic that is standard normally\ndistributed under the global null hypothesis of no linear predictability.\nPredictors can be highly persistent, purely stationary or a combination of\nboth. Upon rejection of the null hypothesis we subsequently introduce a\npredictor screening procedure designed to identify the most active predictors.\nAn empirical application to key predictors of US economic activity illustrates\nthe usefulness of our methods and highlights the important forward looking role\nplayed by the series of manufacturing new orders."}, "http://arxiv.org/abs/2305.12883": {"title": "Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors", "link": "http://arxiv.org/abs/2305.12883", "description": "In recent years, there has been a significant growth in research focusing on\nminimum $\\ell_2$ norm (ridgeless) interpolation least squares estimators.\nHowever, the majority of these analyses have been limited to a simple\nregression error structure, assuming independent and identically distributed\nerrors with zero mean and common variance. In this paper, we explore prediction\nrisk as well as estimation risk under more general regression error\nassumptions, highlighting the benefits of overparameterization in a finite\nsample. We find that including a large number of unimportant parameters\nrelative to the sample size can effectively reduce both risks. Notably, we\nestablish that the estimation difficulties associated with the variance\ncomponents of both risks can be summarized through the trace of the\nvariance-covariance matrix of the regression errors."}, "http://arxiv.org/abs/2203.09001": {"title": "Selection and parallel trends", "link": "http://arxiv.org/abs/2203.09001", "description": "We study the role of selection into treatment in difference-in-differences\n(DiD) designs. We derive necessary and sufficient conditions for parallel\ntrends assumptions under general classes of selection mechanisms. These\nconditions characterize the empirical content of parallel trends. For settings\nwhere the necessary conditions are questionable, we propose tools for\nselection-based sensitivity analysis. We also provide templates for justifying\nDiD in applications with and without covariates. A reanalysis of the causal\neffect of NSW training programs demonstrates the usefulness of our\nselection-based approach to sensitivity analysis."}, "http://arxiv.org/abs/2207.07318": {"title": "Flexible global forecast combinations", "link": "http://arxiv.org/abs/2207.07318", "description": "Forecast combination -- the aggregation of individual forecasts from multiple\nexperts or models -- is a proven approach to economic forecasting. To date,\nresearch on economic forecasting has concentrated on local combination methods,\nwhich handle separate but related forecasting tasks in isolation. Yet, it has\nbeen known for over two decades in the machine learning community that global\nmethods, which exploit task-relatedness, can improve on local methods that\nignore it. Motivated by the possibility for improvement, this paper introduces\na framework for globally combining forecasts while being flexible to the level\nof task-relatedness. Through our framework, we develop global versions of\nseveral existing forecast combinations. To evaluate the efficacy of these new\nglobal forecast combinations, we conduct extensive comparisons using synthetic\nand real data. Our real data comparisons, which involve forecasts of core\neconomic indicators in the Eurozone, provide empirical evidence that the\naccuracy of global combinations of economic forecasts can surpass local\ncombinations."}, "http://arxiv.org/abs/2210.01938": {"title": "Probability of Causation with Sample Selection: A Reanalysis of the Impacts of J\\'ovenes en Acci\\'on on Formality", "link": "http://arxiv.org/abs/2210.01938", "description": "This paper identifies the probability of causation when there is sample\nselection. We show that the probability of causation is partially identified\nfor individuals who are always observed regardless of treatment status and\nderive sharp bounds under three increasingly restrictive sets of assumptions.\nThe first set imposes an exogenous treatment and a monotone sample selection\nmechanism. To tighten these bounds, the second set also imposes the monotone\ntreatment response assumption, while the third set additionally imposes a\nstochastic dominance assumption. Finally, we use experimental data from the\nColombian job training program J\\'ovenes en Acci\\'on to empirically illustrate\nour approach's usefulness. We find that, among always-employed women, at least\n18% and at most 24% transitioned to the formal labor market because of the\nprogram."}, "http://arxiv.org/abs/2212.06312": {"title": "Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation", "link": "http://arxiv.org/abs/2212.06312", "description": "Methods for learning optimal policies use causal machine learning models to\ncreate human-interpretable rules for making choices around the allocation of\ndifferent policy interventions. However, in realistic policy-making contexts,\ndecision-makers often care about trade-offs between outcomes, not just\nsingle-mindedly maximising utility for one outcome. This paper proposes an\napproach termed Multi-Objective Policy Learning (MOPoL) which combines optimal\ndecision trees for policy learning with a multi-objective Bayesian optimisation\napproach to explore the trade-off between multiple outcomes. It does this by\nbuilding a Pareto frontier of non-dominated models for different hyperparameter\nsettings which govern outcome weighting. The key here is that a low-cost greedy\ntree can be an accurate proxy for the very computationally costly optimal tree\nfor the purposes of making decisions which means models can be repeatedly fit\nto learn a Pareto frontier. The method is applied to a real-world case-study of\nnon-price rationing of anti-malarial medication in Kenya."}, "http://arxiv.org/abs/2302.08002": {"title": "Deep Learning Enhanced Realized GARCH", "link": "http://arxiv.org/abs/2302.08002", "description": "We propose a new approach to volatility modeling by combining deep learning\n(LSTM) and realized volatility measures. This LSTM-enhanced realized GARCH\nframework incorporates and distills modeling advances from financial\neconometrics, high frequency trading data and deep learning. Bayesian inference\nvia the Sequential Monte Carlo method is employed for statistical inference and\nforecasting. The new framework can jointly model the returns and realized\nvolatility measures, has an excellent in-sample fit and superior predictive\nperformance compared to several benchmark models, while being able to adapt\nwell to the stylized facts in volatility. The performance of the new framework\nis tested using a wide range of metrics, from marginal likelihood, volatility\nforecasting, to tail risk forecasting and option pricing. We report on a\ncomprehensive empirical study using 31 widely traded stock indices over a time\nperiod that includes COVID-19 pandemic."}, "http://arxiv.org/abs/2303.10019": {"title": "Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices", "link": "http://arxiv.org/abs/2303.10019", "description": "This paper presents a new method for combining (or aggregating or ensembling)\nmultivariate probabilistic forecasts, considering dependencies between\nquantiles and marginals through a smoothing procedure that allows for online\nlearning. We discuss two smoothing methods: dimensionality reduction using\nBasis matrices and penalized smoothing. The new online learning algorithm\ngeneralizes the standard CRPS learning framework into multivariate dimensions.\nIt is based on Bernstein Online Aggregation (BOA) and yields optimal asymptotic\nlearning properties. The procedure uses horizontal aggregation, i.e.,\naggregation across quantiles. We provide an in-depth discussion on possible\nextensions of the algorithm and several nested cases related to the existing\nliterature on online forecast combination. We apply the proposed methodology to\nforecasting day-ahead electricity prices, which are 24-dimensional\ndistributional forecasts. The proposed method yields significant improvements\nover uniform combination in terms of continuous ranked probability score\n(CRPS). We discuss the temporal evolution of the weights and hyperparameters\nand present the results of reduced versions of the preferred model. A fast C++\nimplementation of the proposed algorithm will be made available in connection\nwith this paper as an open-source R-Package on CRAN."}, "http://arxiv.org/abs/2309.02072": {"title": "DeepVol: A Pre-Trained Universal Asset Volatility Model", "link": "http://arxiv.org/abs/2309.02072", "description": "This paper introduces DeepVol, a pre-trained deep learning volatility model\nthat is more general than traditional econometric models. DeepVol leverage the\npower of transfer learning to effectively capture and model the volatility\ndynamics of all financial assets, including previously unseen ones, using a\nsingle universal model. This contrasts to the usual practice in the\neconometrics literature, which trains a separate model for each asset. The\nintroduction of DeepVol opens up new avenues for volatility modeling in the\nfinance industry, potentially transforming the way volatility is predicted."}, "http://arxiv.org/abs/2310.11680": {"title": "Trimmed Mean Group Estimation of Average Treatment Effects in Ultra Short T Panels under Correlated Heterogeneity", "link": "http://arxiv.org/abs/2310.11680", "description": "Under correlated heterogeneity, the commonly used two-way fixed effects\nestimator is biased and can lead to misleading inference. This paper proposes a\nnew trimmed mean group (TMG) estimator which is consistent at the irregular\nrate of n^{1/3} even if the time dimension of the panel is as small as the\nnumber of its regressors. Extensions to panels with time effects are provided,\nand a Hausman-type test of correlated heterogeneity is proposed. Small sample\nproperties of the TMG estimator (with and without time effects) are\ninvestigated by Monte Carlo experiments and shown to be satisfactory and\nperform better than other trimmed estimators proposed in the literature. The\nproposed test of correlated heterogeneity is also shown to have the correct\nsize and satisfactory power. The utility of the TMG approach is illustrated\nwith an empirical application."}, "http://arxiv.org/abs/2310.11962": {"title": "Machine Learning for Staggered Difference-in-Differences and Dynamic Treatment Effect Heterogeneity", "link": "http://arxiv.org/abs/2310.11962", "description": "We combine two recently proposed nonparametric difference-in-differences\nmethods, extending them to enable the examination of treatment effect\nheterogeneity in the staggered adoption setting using machine learning. The\nproposed method, machine learning difference-in-differences (MLDID), allows for\nestimation of time-varying conditional average treatment effects on the\ntreated, which can be used to conduct detailed inference on drivers of\ntreatment effect heterogeneity. We perform simulations to evaluate the\nperformance of MLDID and find that it accurately identifies the true predictors\nof treatment effect heterogeneity. We then use MLDID to evaluate the\nheterogeneous impacts of Brazil's Family Health Program on infant mortality,\nand find those in poverty and urban locations experienced the impact of the\npolicy more quickly than other subgroups."}, "http://arxiv.org/abs/2310.11969": {"title": "Survey calibration for causal inference: a simple method to balance covariate distributions", "link": "http://arxiv.org/abs/2310.11969", "description": "This paper proposes a simple method for balancing distributions of covariates\nfor causal inference based on observational studies. The method makes it\npossible to balance an arbitrary number of quantiles (e.g., medians, quartiles,\nor deciles) together with means if necessary. The proposed approach is based on\nthe theory of calibration estimators (Deville and S\\\"arndal 1992), in\nparticular, calibration estimators for quantiles, proposed by Harms and\nDuchesne (2006). By modifying the entropy balancing method and the covariate\nbalancing propensity score method, it is possible to balance the distributions\nof the treatment and control groups. The method does not require numerical\nintegration, kernel density estimation or assumptions about the distributions;\nvalid estimates can be obtained by drawing on existing asymptotic theory.\nResults of a simulation study indicate that the method efficiently estimates\naverage treatment effects on the treated (ATT), the average treatment effect\n(ATE), the quantile treatment effect on the treated (QTT) and the quantile\ntreatment effect (QTE), especially in the presence of non-linearity and\nmis-specification of the models. The proposed methods are implemented in an\nopen source R package jointCalib."}, "http://arxiv.org/abs/2203.04080": {"title": "On Robust Inference in Time Series Regression", "link": "http://arxiv.org/abs/2203.04080", "description": "Least squares regression with heteroskedasticity and autocorrelation\nconsistent (HAC) standard errors has proved very useful in cross section\nenvironments. However, several major difficulties, which are generally\noverlooked, must be confronted when transferring the HAC estimation technology\nto time series environments. First, in plausible time-series environments\ninvolving failure of strong exogeneity, OLS parameter estimates can be\ninconsistent, so that HAC inference fails even asymptotically. Second, most\neconomic time series have strong autocorrelation, which renders HAC regression\nparameter estimates highly inefficient. Third, strong autocorrelation similarly\nrenders HAC conditional predictions highly inefficient. Finally, The structure\nof popular HAC estimators is ill-suited for capturing the autoregressive\nautocorrelation typically present in economic time series, which produces large\nsize distortions and reduced power in HACbased hypothesis testing, in all but\nthe largest samples. We show that all four problems are largely avoided by the\nuse of a simple dynamic regression procedure, which is easily implemented. We\ndemonstrate the advantages of dynamic regression with detailed simulations\ncovering a range of practical issues."}, "http://arxiv.org/abs/2308.15338": {"title": "Another Look at the Linear Probability Model and Nonlinear Index Models", "link": "http://arxiv.org/abs/2308.15338", "description": "We reassess the use of linear models to approximate response probabilities of\nbinary outcomes, focusing on average partial effects (APE). We confirm that\nlinear projection parameters coincide with APEs in certain scenarios. Through\nsimulations, we identify other cases where OLS does or does not approximate\nAPEs and find that having large fraction of fitted values in [0, 1] is neither\nnecessary nor sufficient. We also show nonlinear least squares estimation of\nthe ramp model is consistent and asymptotically normal and is equivalent to\nusing OLS on an iteratively trimmed sample to reduce bias. Our findings offer\npractical guidance for empirical research."}, "http://arxiv.org/abs/2309.10642": {"title": "Testing and correcting sample selection in academic achievement comparisons", "link": "http://arxiv.org/abs/2309.10642", "description": "Country comparisons using standardized test scores may in some cases be\nmisleading unless we make sure that the potential sample selection bias created\nby drop-outs and non-enrollment patterns does not alter the analysis. In this\npaper, I propose an answer to this issue which consists of identifying the\ncounterfactual distribution of achievement (I mean the distribution of\nachievement if there was hypothetically no selection) from the observed\ndistribution of achievements. International comparison measures like means,\nquantiles, and inequality measures have to be computed using that\ncounterfactual distribution which is statistically closer to the observed one\nfor a low proportion of out-of-school children. I identify the quantiles of\nthat latent distribution by readjusting the percentile levels of the observed\nquantile function of achievement. Because the data on test scores is by nature\ntruncated, I have to rely on auxiliary data to borrow identification power. I\nfinally applied my method to compute selection corrected means using PISA 2018\nand PASEC 2019 and I found that ranking/comparisons can change."}, "http://arxiv.org/abs/2310.01104": {"title": "Multi-period static hedging of European options", "link": "http://arxiv.org/abs/2310.01104", "description": "We consider the hedging of European options when the price of the underlying\nasset follows a single-factor Markovian framework. By working in such a\nsetting, Carr and Wu \\cite{carr2014static} derived a spanning relation between\na given option and a continuum of shorter-term options written on the same\nasset. In this paper, we have extended their approach to simultaneously include\noptions over multiple short maturities. We then show a practical implementation\nof this with a finite set of shorter-term options to determine the hedging\nerror using a Gaussian Quadrature method. We perform a wide range of\nexperiments for both the \\textit{Black-Scholes} and \\textit{Merton Jump\nDiffusion} models, illustrating the comparative performance of the two methods."}, "http://arxiv.org/abs/2310.02414": {"title": "On Optimal Set Estimation for Partially Identified Binary Choice Models", "link": "http://arxiv.org/abs/2310.02414", "description": "In this paper we reconsider the notion of optimality in estimation of\npartially identified models. We illustrate the general problem in the context\nof a semiparametric binary choice model with discrete covariates as an example\nof a model which is partially identified as shown in, e.g. Bierens and Hartog\n(1988). A set estimator for the regression coefficients in the model can be\nconstructed by implementing the Maximum Score procedure proposed by Manski\n(1975). For many designs this procedure converges to the identified set for\nthese parameters, and so in one sense is optimal. But as shown in Komarova\n(2013) for other cases the Maximum Score objective function gives an outer\nregion of the identified set. This motivates alternative methods that are\noptimal in one sense that they converge to the identified region in all\ndesigns, and we propose and compare such procedures. One is a Hodges type\nestimator combining the Maximum Score estimator with existing procedures. A\nsecond is a two step estimator using a Maximum Score type objective function in\nthe second step. Lastly we propose a new random set quantile estimator,\nmotivated by definitions introduced in Molchanov (2006). Extensions of these\nideas for the cross sectional model to static and dynamic discrete panel data\nmodels are also provided."}, "http://arxiv.org/abs/2310.12825": {"title": "Nonparametric Regression with Dyadic Data", "link": "http://arxiv.org/abs/2310.12825", "description": "This paper studies the identification and estimation of a nonparametric\nnonseparable dyadic model where the structural function and the distribution of\nthe unobservable random terms are assumed to be unknown. The identification and\nthe estimation of the distribution of the unobservable random term are also\nproposed. I assume that the structural function is continuous and strictly\nincreasing in the unobservable heterogeneity. I propose suitable normalization\nfor the identification by allowing the structural function to have some\ndesirable properties such as homogeneity of degree one in the unobservable\nrandom term and some of its observables. The consistency and the asymptotic\ndistribution of the estimators are proposed. The finite sample properties of\nthe proposed estimators in a Monte-Carlo simulation are assessed."}, "http://arxiv.org/abs/2310.12863": {"title": "Moment-dependent phase transitions in high-dimensional Gaussian approximations", "link": "http://arxiv.org/abs/2310.12863", "description": "High-dimensional central limit theorems have been intensively studied with\nmost focus being on the case where the data is sub-Gaussian or sub-exponential.\nHowever, heavier tails are omnipresent in practice. In this article, we study\nthe critical growth rates of dimension $d$ below which Gaussian approximations\nare asymptotically valid but beyond which they are not. We are particularly\ninterested in how these thresholds depend on the number of moments $m$ that the\nobservations possess. For every $m\\in(2,\\infty)$, we construct i.i.d. random\nvectors $\\textbf{X}_1,...,\\textbf{X}_n$ in $\\mathbb{R}^d$, the entries of which\nare independent and have a common distribution (independent of $n$ and $d$)\nwith finite $m$th absolute moment, and such that the following holds: if there\nexists an $\\varepsilon\\in(0,\\infty)$ such that $d/n^{m/2-1+\\varepsilon}\\not\\to\n0$, then the Gaussian approximation error (GAE) satisfies $$\n\n\\limsup_{n\\to\\infty}\\sup_{t\\in\\mathbb{R}}\\left[\\mathbb{P}\\left(\\max_{1\\leq\nj\\leq d}\\frac{1}{\\sqrt{n}}\\sum_{i=1}^n\\textbf{X}_{ij}\\leq\nt\\right)-\\mathbb{P}\\left(\\max_{1\\leq j\\leq d}\\textbf{Z}_j\\leq\nt\\right)\\right]=1,$$ where $\\textbf{Z} \\sim\n\\mathsf{N}_d(\\textbf{0}_d,\\mathbf{I}_d)$. On the other hand, a result in\nChernozhukov et al. (2023a) implies that the left-hand side above is zero if\njust $d/n^{m/2-1-\\varepsilon}\\to 0$ for some $\\varepsilon\\in(0,\\infty)$. In\nthis sense, there is a moment-dependent phase transition at the threshold\n$d=n^{m/2-1}$ above which the limiting GAE jumps from zero to one."}, "http://arxiv.org/abs/2209.11840": {"title": "Revisiting the Analysis of Matched-Pair and Stratified Experiments in the Presence of Attrition", "link": "http://arxiv.org/abs/2209.11840", "description": "In this paper we revisit some common recommendations regarding the analysis\nof matched-pair and stratified experimental designs in the presence of\nattrition. Our main objective is to clarify a number of well-known claims about\nthe practice of dropping pairs with an attrited unit when analyzing\nmatched-pair designs. Contradictory advice appears in the literature about\nwhether or not dropping pairs is beneficial or harmful, and stratifying into\nlarger groups has been recommended as a resolution to the issue. To address\nthese claims, we derive the estimands obtained from the difference-in-means\nestimator in a matched-pair design both when the observations from pairs with\nan attrited unit are retained and when they are dropped. We find limited\nevidence to support the claims that dropping pairs helps recover the average\ntreatment effect, but we find that it may potentially help in recovering a\nconvex weighted average of conditional average treatment effects. We report\nsimilar findings for stratified designs when studying the estimands obtained\nfrom a regression of outcomes on treatment with and without strata fixed\neffects."}, "http://arxiv.org/abs/2210.04523": {"title": "An identification and testing strategy for proxy-SVARs with weak proxies", "link": "http://arxiv.org/abs/2210.04523", "description": "When proxies (external instruments) used to identify target structural shocks\nare weak, inference in proxy-SVARs (SVAR-IVs) is nonstandard and the\nconstruction of asymptotically valid confidence sets for the impulse responses\nof interest requires weak-instrument robust methods. In the presence of\nmultiple target shocks, test inversion techniques require extra restrictions on\nthe proxy-SVAR parameters other those implied by the proxies that may be\ndifficult to interpret and test. We show that frequentist asymptotic inference\nin these situations can be conducted through Minimum Distance estimation and\nstandard asymptotic methods if the proxy-SVAR can be identified by using\n`strong' instruments for the non-target shocks; i.e. the shocks which are not\nof primary interest in the analysis. The suggested identification strategy\nhinges on a novel pre-test for the null of instrument relevance based on\nbootstrap resampling which is not subject to pre-testing issues, in the sense\nthat the validity of post-test asymptotic inferences is not affected by the\noutcomes of the test. The test is robust to conditionally heteroskedasticity\nand/or zero-censored proxies, is computationally straightforward and applicable\nregardless of the number of shocks being instrumented. Some illustrative\nexamples show the empirical usefulness of the suggested identification and\ntesting strategy."}, "http://arxiv.org/abs/2301.07241": {"title": "Unconditional Quantile Partial Effects via Conditional Quantile Regression", "link": "http://arxiv.org/abs/2301.07241", "description": "This paper develops a semi-parametric procedure for estimation of\nunconditional quantile partial effects using quantile regression coefficients.\nThe estimator is based on an identification result showing that, for continuous\ncovariates, unconditional quantile effects are a weighted average of\nconditional ones at particular quantile levels that depend on the covariates.\nWe propose a two-step estimator for the unconditional effects where in the\nfirst step one estimates a structural quantile regression model, and in the\nsecond step a nonparametric regression is applied to the first step\ncoefficients. We establish the asymptotic properties of the estimator, say\nconsistency and asymptotic normality. Monte Carlo simulations show numerical\nevidence that the estimator has very good finite sample performance and is\nrobust to the selection of bandwidth and kernel. To illustrate the proposed\nmethod, we study the canonical application of the Engel's curve, i.e. food\nexpenditures as a share of income."}, "http://arxiv.org/abs/2302.04380": {"title": "Covariate Adjustment in Experiments with Matched Pairs", "link": "http://arxiv.org/abs/2302.04380", "description": "This paper studies inference on the average treatment effect in experiments\nin which treatment status is determined according to \"matched pairs\" and it is\nadditionally desired to adjust for observed, baseline covariates to gain\nfurther precision. By a \"matched pairs\" design, we mean that units are sampled\ni.i.d. from the population of interest, paired according to observed, baseline\ncovariates and finally, within each pair, one unit is selected at random for\ntreatment. Importantly, we presume that not all observed, baseline covariates\nare used in determining treatment assignment. We study a broad class of\nestimators based on a \"doubly robust\" moment condition that permits us to study\nestimators with both finite-dimensional and high-dimensional forms of covariate\nadjustment. We find that estimators with finite-dimensional, linear adjustments\nneed not lead to improvements in precision relative to the unadjusted\ndifference-in-means estimator. This phenomenon persists even if the adjustments\nare interacted with treatment; in fact, doing so leads to no changes in\nprecision. However, gains in precision can be ensured by including fixed\neffects for each of the pairs. Indeed, we show that this adjustment is the\n\"optimal\" finite-dimensional, linear adjustment. We additionally study two\nestimators with high-dimensional forms of covariate adjustment based on the\nLASSO. For each such estimator, we show that it leads to improvements in\nprecision relative to the unadjusted difference-in-means estimator and also\nprovide conditions under which it leads to the \"optimal\" nonparametric,\ncovariate adjustment. A simulation study confirms the practical relevance of\nour theoretical analysis, and the methods are employed to reanalyze data from\nan experiment using a \"matched pairs\" design to study the effect of\nmacroinsurance on microenterprise."}, "http://arxiv.org/abs/2305.03134": {"title": "Debiased inference for dynamic nonlinear models with two-way fixed effects", "link": "http://arxiv.org/abs/2305.03134", "description": "Panel data models often use fixed effects to account for unobserved\nheterogeneities. These fixed effects are typically incidental parameters and\ntheir estimators converge slowly relative to the square root of the sample\nsize. In the maximum likelihood context, this induces an asymptotic bias of the\nlikelihood function. Test statistics derived from the asymptotically biased\nlikelihood, therefore, no longer follow their standard limiting distributions.\nThis causes severe distortions in test sizes. We consider a generic class of\ndynamic nonlinear models with two-way fixed effects and propose an analytical\nbias correction method for the likelihood function. We formally show that the\nlikelihood ratio, the Lagrange-multiplier, and the Wald test statistics derived\nfrom the corrected likelihood follow their standard asymptotic distributions. A\nbias-corrected estimator of the structural parameters can also be derived from\nthe corrected likelihood function. We evaluate the performance of our bias\ncorrection procedure through simulations and an empirical example."}, "http://arxiv.org/abs/2310.13240": {"title": "Transparency challenges in policy evaluation with causal machine learning -- improving usability and accountability", "link": "http://arxiv.org/abs/2310.13240", "description": "Causal machine learning tools are beginning to see use in real-world policy\nevaluation tasks to flexibly estimate treatment effects. One issue with these\nmethods is that the machine learning models used are generally black boxes,\ni.e., there is no globally interpretable way to understand how a model makes\nestimates. This is a clear problem in policy evaluation applications,\nparticularly in government, because it is difficult to understand whether such\nmodels are functioning in ways that are fair, based on the correct\ninterpretation of evidence and transparent enough to allow for accountability\nif things go wrong. However, there has been little discussion of transparency\nproblems in the causal machine learning literature and how these might be\novercome. This paper explores why transparency issues are a problem for causal\nmachine learning in public policy evaluation applications and considers ways\nthese problems might be addressed through explainable AI tools and by\nsimplifying models in line with interpretable AI principles. It then applies\nthese ideas to a case-study using a causal forest model to estimate conditional\naverage treatment effects for a hypothetical change in the school leaving age\nin Australia. It shows that existing tools for understanding black-box\npredictive models are poorly suited to causal machine learning and that\nsimplifying the model to make it interpretable leads to an unacceptable\nincrease in error (in this application). It concludes that new tools are needed\nto properly understand causal machine learning models and the algorithms that\nfit them."}, "http://arxiv.org/abs/2308.04276": {"title": "Causal Interpretation of Linear Social Interaction Models with Endogenous Networks", "link": "http://arxiv.org/abs/2308.04276", "description": "This study investigates the causal interpretation of linear social\ninteraction models in the presence of endogeneity in network formation under a\nheterogeneous treatment effects framework. We consider an experimental setting\nin which individuals are randomly assigned to treatments while no interventions\nare made for the network structure. We show that running a linear regression\nignoring network endogeneity is not problematic for estimating the average\ndirect treatment effect. However, it leads to sample selection bias and\nnegative-weights problem for the estimation of the average spillover effect. To\novercome these problems, we propose using potential peer treatment as an\ninstrumental variable (IV), which is automatically a valid IV for actual\nspillover exposure. Using this IV, we examine two IV-based estimands and\ndemonstrate that they have a local average treatment-effect-type causal\ninterpretation for the spillover effect."}, "http://arxiv.org/abs/2309.01889": {"title": "The Local Projection Residual Bootstrap for AR(1) Models", "link": "http://arxiv.org/abs/2309.01889", "description": "This paper proposes a local projection residual bootstrap method to construct\nconfidence intervals for impulse response coefficients of AR(1) models. Our\nbootstrap method is based on the local projection (LP) approach and a residual\nbootstrap procedure. We present theoretical results for our bootstrap method\nand proposed confidence intervals. First, we prove the uniform consistency of\nthe LP-residual bootstrap over a large class of AR(1) models that allow for a\nunit root. Then, we prove the asymptotic validity of our confidence intervals\nover the same class of AR(1) models. Finally, we show that the LP-residual\nbootstrap provides asymptotic refinements for confidence intervals on a\nrestricted class of AR(1) models relative to those required for the uniform\nconsistency of our bootstrap."}, "http://arxiv.org/abs/2310.13785": {"title": "Bayesian Estimation of Panel Models under Potentially Sparse Heterogeneity", "link": "http://arxiv.org/abs/2310.13785", "description": "We incorporate a version of a spike and slab prior, comprising a pointmass at\nzero (\"spike\") and a Normal distribution around zero (\"slab\") into a dynamic\npanel data framework to model coefficient heterogeneity. In addition to\nhomogeneity and full heterogeneity, our specification can also capture sparse\nheterogeneity, that is, there is a core group of units that share common\nparameters and a set of deviators with idiosyncratic parameters. We fit a model\nwith unobserved components to income data from the Panel Study of Income\nDynamics. We find evidence for sparse heterogeneity for balanced panels\ncomposed of individuals with long employment histories."}, "http://arxiv.org/abs/2310.14068": {"title": "Unobserved Grouped Heteroskedasticity and Fixed Effects", "link": "http://arxiv.org/abs/2310.14068", "description": "This paper extends the linear grouped fixed effects (GFE) panel model to\nallow for heteroskedasticity from a discrete latent group variable. Key\nfeatures of GFE are preserved, such as individuals belonging to one of a finite\nnumber of groups and group membership is unrestricted and estimated. Ignoring\ngroup heteroskedasticity may lead to poor classification, which is detrimental\nto finite sample bias and standard errors of estimators. I introduce the\n\"weighted grouped fixed effects\" (WGFE) estimator that minimizes a weighted\naverage of group sum of squared residuals. I establish $\\sqrt{NT}$-consistency\nand normality under a concept of group separation based on second moments. A\ntest of group homoskedasticity is discussed. A fast computation procedure is\nprovided. Simulations show that WGFE outperforms alternatives that exclude\nsecond moment information. I demonstrate this approach by considering the link\nbetween income and democracy and the effect of unionization on earnings."}, "http://arxiv.org/abs/2310.14142": {"title": "On propensity score matching with a diverging number of matches", "link": "http://arxiv.org/abs/2310.14142", "description": "This paper reexamines Abadie and Imbens (2016)'s work on propensity score\nmatching for average treatment effect estimation. We explore the asymptotic\nbehavior of these estimators when the number of nearest neighbors, $M$, grows\nwith the sample size. It is shown, hardly surprising but technically\nnontrivial, that the modified estimators can improve upon the original\nfixed-$M$ estimators in terms of efficiency. Additionally, we demonstrate the\npotential to attain the semiparametric efficiency lower bound when the\npropensity score achieves \"sufficient\" dimension reduction, echoing Hahn\n(1998)'s insight about the role of dimension reduction in propensity\nscore-based causal inference."}, "http://arxiv.org/abs/2310.14438": {"title": "BVARs and Stochastic Volatility", "link": "http://arxiv.org/abs/2310.14438", "description": "Bayesian vector autoregressions (BVARs) are the workhorse in macroeconomic\nforecasting. Research in the last decade has established the importance of\nallowing time-varying volatility to capture both secular and cyclical\nvariations in macroeconomic uncertainty. This recognition, together with the\ngrowing availability of large datasets, has propelled a surge in recent\nresearch in building stochastic volatility models suitable for large BVARs.\nSome of these new models are also equipped with additional features that are\nespecially desirable for large systems, such as order invariance -- i.e.,\nestimates are not dependent on how the variables are ordered in the BVAR -- and\nrobustness against COVID-19 outliers. Estimation of these large, flexible\nmodels is made possible by the recently developed equation-by-equation approach\nthat drastically reduces the computational cost of estimating large systems.\nDespite these recent advances, there remains much ongoing work, such as the\ndevelopment of parsimonious approaches for time-varying coefficients and other\ntypes of nonlinearities in large BVARs."}, "http://arxiv.org/abs/2310.14983": {"title": "Causal clustering: design of cluster experiments under network interference", "link": "http://arxiv.org/abs/2310.14983", "description": "This paper studies the design of cluster experiments to estimate the global\ntreatment effect in the presence of spillovers on a single network. We provide\nan econometric framework to choose the clustering that minimizes the worst-case\nmean-squared error of the estimated global treatment effect. We show that the\noptimal clustering can be approximated as the solution of a novel penalized\nmin-cut optimization problem computed via off-the-shelf semi-definite\nprogramming algorithms. Our analysis also characterizes easy-to-check\nconditions to choose between a cluster or individual-level randomization. We\nillustrate the method's properties using unique network data from the universe\nof Facebook's users and existing network data from a field experiment."}, "http://arxiv.org/abs/2004.08318": {"title": "Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions", "link": "http://arxiv.org/abs/2004.08318", "description": "We study causal inference under case-control and case-population sampling.\nSpecifically, we focus on the binary-outcome and binary-treatment case, where\nthe parameters of interest are causal relative and attributable risks defined\nvia the potential outcome framework. It is shown that strong ignorability is\nnot always as powerful as it is under random sampling and that certain\nmonotonicity assumptions yield comparable results in terms of sharp identified\nintervals. Specifically, the usual odds ratio is shown to be a sharp identified\nupper bound on causal relative risk under the monotone treatment response and\nmonotone treatment selection assumptions. We offer algorithms for inference on\nthe causal parameters that are aggregated over the true population distribution\nof the covariates. We show the usefulness of our approach by studying three\nempirical examples: the benefit of attending private school for entering a\nprestigious university in Pakistan; the relationship between staying in school\nand getting involved with drug-trafficking gangs in Brazil; and the link\nbetween physicians' hours and size of the group practice in the United States."}, "http://arxiv.org/abs/2108.07455": {"title": "Causal Inference with Noncompliance and Unknown Interference", "link": "http://arxiv.org/abs/2108.07455", "description": "We consider a causal inference model in which individuals interact in a\nsocial network and they may not comply with the assigned treatments. In\nparticular, we suppose that the form of network interference is unknown to\nresearchers. To estimate meaningful causal parameters in this situation, we\nintroduce a new concept of exposure mapping, which summarizes potentially\ncomplicated spillover effects into a fixed dimensional statistic of\ninstrumental variables. We investigate identification conditions for the\nintention-to-treat effects and the average treatment effects for compliers,\nwhile explicitly considering the possibility of misspecification of exposure\nmapping. Based on our identification results, we develop nonparametric\nestimation procedures via inverse probability weighting. Their asymptotic\nproperties, including consistency and asymptotic normality, are investigated\nusing an approximate neighborhood interference framework. For an empirical\nillustration, we apply our method to experimental data on the anti-conflict\nintervention school program. The proposed methods are readily available with\nthe companion R package latenetwork."}, "http://arxiv.org/abs/2112.03872": {"title": "Nonparametric Treatment Effect Identification in School Choice", "link": "http://arxiv.org/abs/2112.03872", "description": "This paper studies nonparametric identification and estimation of causal\neffects in centralized school assignment. In many centralized assignment\nsettings, students are subjected to both lottery-driven variation and\nregression discontinuity (RD) driven variation. We characterize the full set of\nidentified atomic treatment effects (aTEs), defined as the conditional average\ntreatment effect between a pair of schools, given student characteristics.\nAtomic treatment effects are the building blocks of more aggregated notions of\ntreatment contrasts, and common approaches estimating aggregations of aTEs can\nmask important heterogeneity. In particular, many aggregations of aTEs put zero\nweight on aTEs driven by RD variation, and estimators of such aggregations put\nasymptotically vanishing weight on the RD-driven aTEs. We develop a diagnostic\ntool for empirically assessing the weight put on aTEs driven by RD variation.\nLastly, we provide estimators and accompanying asymptotic results for inference\non aggregations of RD-driven aTEs."}, "http://arxiv.org/abs/2203.01425": {"title": "A Modern Gauss-Markov Theorem? Really?", "link": "http://arxiv.org/abs/2203.01425", "description": "We show that the theorems in Hansen (2021a) (the version accepted by\nEconometrica), except for one, are not new as they coincide with classical\ntheorems like the good old Gauss-Markov or Aitken Theorem, respectively; the\nexceptional theorem is incorrect. Hansen (2021b) corrects this theorem. As a\nresult, all theorems in the latter version coincide with the above mentioned\nclassical theorems. Furthermore, we also show that the theorems in Hansen\n(2022) (the version published in Econometrica) either coincide with the\nclassical theorems just mentioned, or contain extra assumptions that are alien\nto the Gauss-Markov or Aitken Theorem."}, "http://arxiv.org/abs/2204.12723": {"title": "Information-theoretic limitations of data-based price discrimination", "link": "http://arxiv.org/abs/2204.12723", "description": "This paper studies third-degree price discrimination (3PD) based on a random\nsample of valuation and covariate data, where the covariate is continuous, and\nthe distribution of the data is unknown to the seller. The main results of this\npaper are twofold. The first set of results is pricing strategy independent and\nreveals the fundamental information-theoretic limitation of any data-based\npricing strategy in revenue generation for two cases: 3PD and uniform pricing.\nThe second set of results proposes the $K$-markets empirical revenue\nmaximization (ERM) strategy and shows that the $K$-markets ERM and the uniform\nERM strategies achieve the optimal rate of convergence in revenue to that\ngenerated by their respective true-distribution 3PD and uniform pricing optima.\nOur theoretical and numerical results suggest that the uniform (i.e.,\n$1$-market) ERM strategy generates a larger revenue than the $K$-markets ERM\nstrategy when the sample size is small enough, and vice versa."}, "http://arxiv.org/abs/2304.12698": {"title": "Enhanced multilayer perceptron with feature selection and grid search for travel mode choice prediction", "link": "http://arxiv.org/abs/2304.12698", "description": "Accurate and reliable prediction of individual travel mode choices is crucial\nfor developing multi-mode urban transportation systems, conducting\ntransportation planning and formulating traffic demand management strategies.\nTraditional discrete choice models have dominated the modelling methods for\ndecades yet suffer from strict model assumptions and low prediction accuracy.\nIn recent years, machine learning (ML) models, such as neural networks and\nboosting models, are widely used by researchers for travel mode choice\nprediction and have yielded promising results. However, despite the superior\nprediction performance, a large body of ML methods, especially the branch of\nneural network models, is also limited by overfitting and tedious model\nstructure determination process. To bridge this gap, this study proposes an\nenhanced multilayer perceptron (MLP; a neural network) with two hidden layers\nfor travel mode choice prediction; this MLP is enhanced by XGBoost (a boosting\nmethod) for feature selection and a grid search method for optimal hidden\nneurone determination of each hidden layer. The proposed method was trained and\ntested on a real resident travel diary dataset collected in Chengdu, China."}, "http://arxiv.org/abs/2306.02584": {"title": "Synthetic Regressing Control Method", "link": "http://arxiv.org/abs/2306.02584", "description": "Estimating weights in the synthetic control method, typically resulting in\nsparse weights where only a few control units have non-zero weights, involves\nan optimization procedure that simultaneously selects and aligns control units\nto closely match the treated unit. However, this simultaneous selection and\nalignment of control units may lead to a loss of efficiency. Another concern\narising from the aforementioned procedure is its susceptibility to\nunder-fitting due to imperfect pre-treatment fit. It is not uncommon for the\nlinear combination, using nonnegative weights, of pre-treatment period outcomes\nfor the control units to inadequately approximate the pre-treatment outcomes\nfor the treated unit. To address both of these issues, this paper proposes a\nsimple and effective method called Synthetic Regressing Control (SRC). The SRC\nmethod begins by performing the univariate linear regression to appropriately\nalign the pre-treatment periods of the control units with the treated unit.\nSubsequently, a SRC estimator is obtained by synthesizing (taking a weighted\naverage) the fitted controls. To determine the weights in the synthesis\nprocedure, we propose an approach that utilizes a criterion of unbiased risk\nestimator. Theoretically, we show that the synthesis way is asymptotically\noptimal in the sense of achieving the lowest possible squared error. Extensive\nnumerical experiments highlight the advantages of the SRC method."}, "http://arxiv.org/abs/2308.12470": {"title": "Scalable Estimation of Multinomial Response Models with Uncertain Consideration Sets", "link": "http://arxiv.org/abs/2308.12470", "description": "A standard assumption in the fitting of unordered multinomial response models\nfor $J$ mutually exclusive nominal categories, on cross-sectional or\nlongitudinal data, is that the responses arise from the same set of $J$\ncategories between subjects. However, when responses measure a choice made by\nthe subject, it is more appropriate to assume that the distribution of\nmultinomial responses is conditioned on a subject-specific consideration set,\nwhere this consideration set is drawn from the power set of $\\{1,2,\\ldots,J\\}$.\nBecause the cardinality of this power set is exponential in $J$, estimation is\ninfeasible in general. In this paper, we provide an approach to overcoming this\nproblem. A key step in the approach is a probability model over consideration\nsets, based on a general representation of probability distributions on\ncontingency tables, which results in mixtures of independent consideration\nmodels. Although the support of this distribution is exponentially large, the\nposterior distribution over consideration sets given parameters is typically\nsparse, and is easily sampled in an MCMC scheme. We show posterior consistency\nof the parameters of the conditional response model and the distribution of\nconsideration sets. The effectiveness of the methodology is documented in\nsimulated longitudinal data sets with $J=100$ categories and real data from the\ncereal market with $J=68$ brands."}, "http://arxiv.org/abs/2310.15512": {"title": "Inference for Rank-Rank Regressions", "link": "http://arxiv.org/abs/2310.15512", "description": "Slope coefficients in rank-rank regressions are popular measures of\nintergenerational mobility, for instance in regressions of a child's income\nrank on their parent's income rank. In this paper, we first point out that\ncommonly used variance estimators such as the homoskedastic or robust variance\nestimators do not consistently estimate the asymptotic variance of the OLS\nestimator in a rank-rank regression. We show that the probability limits of\nthese estimators may be too large or too small depending on the shape of the\ncopula of child and parent incomes. Second, we derive a general asymptotic\ntheory for rank-rank regressions and provide a consistent estimator of the OLS\nestimator's asymptotic variance. We then extend the asymptotic theory to other\nregressions involving ranks that have been used in empirical work. Finally, we\napply our new inference methods to three empirical studies. We find that the\nconfidence intervals based on estimators of the correct variance may sometimes\nbe substantially shorter and sometimes substantially longer than those based on\ncommonly used variance estimators. The differences in confidence intervals\nconcern economically meaningful values of mobility and thus lead to different\nconclusions when comparing mobility in U.S. commuting zones with mobility in\nother countries."}, "http://arxiv.org/abs/2310.15796": {"title": "Testing for equivalence of pre-trends in Difference-in-Differences estimation", "link": "http://arxiv.org/abs/2310.15796", "description": "The plausibility of the ``parallel trends assumption'' in\nDifference-in-Differences estimation is usually assessed by a test of the null\nhypothesis that the difference between the average outcomes of both groups is\nconstant over time before the treatment. However, failure to reject the null\nhypothesis does not imply the absence of differences in time trends between\nboth groups. We provide equivalence tests that allow researchers to find\nevidence in favor of the parallel trends assumption and thus increase the\ncredibility of their treatment effect estimates. While we motivate our tests in\nthe standard two-way fixed effects model, we discuss simple extensions to\nsettings in which treatment adoption is staggered over time."}, "http://arxiv.org/abs/1712.04802": {"title": "Fisher-Schultz Lecture: Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments, with an Application to Immunization in India", "link": "http://arxiv.org/abs/1712.04802", "description": "We propose strategies to estimate and make inference on key features of\nheterogeneous effects in randomized experiments. These key features include\nbest linear predictors of the effects using machine learning proxies, average\neffects sorted by impact groups, and average characteristics of most and least\nimpacted units. The approach is valid in high dimensional settings, where the\neffects are proxied (but not necessarily consistently estimated) by predictive\nand causal machine learning methods. We post-process these proxies into\nestimates of the key features. Our approach is generic, it can be used in\nconjunction with penalized methods, neural networks, random forests, boosted\ntrees, and ensemble methods, both predictive and causal. Estimation and\ninference are based on repeated data splitting to avoid overfitting and achieve\nvalidity. We use quantile aggregation of the results across many potential\nsplits, in particular taking medians of p-values and medians and other\nquantiles of confidence intervals. We show that quantile aggregation lowers\nestimation risks over a single split procedure, and establish its principal\ninferential properties. Finally, our analysis reveals ways to build provably\nbetter machine learning proxies through causal learning: we can use the\nobjective functions that we develop to construct the best linear predictors of\nthe effects, to obtain better machine learning proxies in the initial step. We\nillustrate the use of both inferential tools and causal learners with a\nrandomized field experiment that evaluates a combination of nudges to stimulate\ndemand for immunization in India."}, "http://arxiv.org/abs/2301.09016": {"title": "Inference for Two-stage Experiments under Covariate-Adaptive Randomization", "link": "http://arxiv.org/abs/2301.09016", "description": "This paper studies inference in two-stage randomized experiments under\ncovariate-adaptive randomization. In the initial stage of this experimental\ndesign, clusters (e.g., households, schools, or graph partitions) are\nstratified and randomly assigned to control or treatment groups based on\ncluster-level covariates. Subsequently, an independent second-stage design is\ncarried out, wherein units within each treated cluster are further stratified\nand randomly assigned to either control or treatment groups, based on\nindividual-level covariates. Under the homogeneous partial interference\nassumption, I establish conditions under which the proposed\ndifference-in-\"average of averages\" estimators are consistent and\nasymptotically normal for the corresponding average primary and spillover\neffects and develop consistent estimators of their asymptotic variances.\nCombining these results establishes the asymptotic validity of tests based on\nthese estimators. My findings suggest that ignoring covariate information in\nthe design stage can result in efficiency loss, and commonly used inference\nmethods that ignore or improperly use covariate information can lead to either\nconservative or invalid inference. Finally, I apply these results to studying\noptimal use of covariate information under covariate-adaptive randomization in\nlarge samples, and demonstrate that a specific generalized matched-pair design\nachieves minimum asymptotic variance for each proposed estimator. The practical\nrelevance of the theoretical results is illustrated through a simulation study\nand an empirical application."}, "http://arxiv.org/abs/2306.12003": {"title": "Difference-in-Differences with Interference: A Finite Population Perspective", "link": "http://arxiv.org/abs/2306.12003", "description": "In many scenarios, such as the evaluation of place-based policies, potential\noutcomes are not only dependent upon the unit's own treatment but also its\nneighbors' treatment. Despite this, \"difference-in-differences\" (DID) type\nestimators typically ignore such interference among neighbors. I show in this\npaper that the canonical DID estimators generally fail to identify interesting\ncausal effects in the presence of neighborhood interference. To incorporate\ninterference structure into DID estimation, I propose doubly robust estimators\nfor the direct average treatment effect on the treated as well as the average\nspillover effects under a modified parallel trends assumption. When spillover\neffects are of interest, we often sample the entire population. Thus, I adopt a\nfinite population perspective in the sense that the estimands are defined as\npopulation averages and inference is conditional on the attributes of all\npopulation units. The approach in this paper relaxes common restrictions in the\nliterature, such as partial interference and correctly specified spillover\nfunctions. Moreover, robust inference is discussed based on the asymptotic\ndistribution of the proposed estimators."}, "http://arxiv.org/abs/2310.16281": {"title": "Improving Robust Decisions with Data", "link": "http://arxiv.org/abs/2310.16281", "description": "A decision-maker (DM) faces uncertainty governed by a data-generating process\n(DGP), which is only known to belong to a set of sequences of independent but\npossibly non-identical distributions. A robust decision maximizes the DM's\nexpected payoff against the worst possible DGP in this set. This paper studies\nhow such robust decisions can be improved with data, where improvement is\nmeasured by expected payoff under the true DGP. In this paper, I fully\ncharacterize when and how such an improvement can be guaranteed under all\npossible DGPs and develop inference methods to achieve it. These inference\nmethods are needed because, as this paper shows, common inference methods\n(e.g., maximum likelihood or Bayesian) often fail to deliver such an\nimprovement. Importantly, the developed inference methods are given by simple\naugmentations to standard inference procedures, and are thus easy to implement\nin practice."}, "http://arxiv.org/abs/2310.16290": {"title": "Fair Adaptive Experiments", "link": "http://arxiv.org/abs/2310.16290", "description": "Randomized experiments have been the gold standard for assessing the\neffectiveness of a treatment or policy. The classical complete randomization\napproach assigns treatments based on a prespecified probability and may lead to\ninefficient use of data. Adaptive experiments improve upon complete\nrandomization by sequentially learning and updating treatment assignment\nprobabilities. However, their application can also raise fairness and equity\nconcerns, as assignment probabilities may vary drastically across groups of\nparticipants. Furthermore, when treatment is expected to be extremely\nbeneficial to certain groups of participants, it is more appropriate to expose\nmany of these participants to favorable treatment. In response to these\nchallenges, we propose a fair adaptive experiment strategy that simultaneously\nenhances data use efficiency, achieves an envy-free treatment assignment\nguarantee, and improves the overall welfare of participants. An important\nfeature of our proposed strategy is that we do not impose parametric modeling\nassumptions on the outcome variables, making it more versatile and applicable\nto a wider array of applications. Through our theoretical investigation, we\ncharacterize the convergence rate of the estimated treatment effects and the\nassociated standard deviations at the group level and further prove that our\nadaptive treatment assignment algorithm, despite not having a closed-form\nexpression, approaches the optimal allocation rule asymptotically. Our proof\nstrategy takes into account the fact that the allocation decisions in our\ndesign depend on sequentially accumulated data, which poses a significant\nchallenge in characterizing the properties and conducting statistical inference\nof our method. We further provide simulation evidence to showcase the\nperformance of our fair adaptive experiment strategy."}, "http://arxiv.org/abs/2310.16638": {"title": "Covariate Shift Adaptation Robust to Density-Ratio Estimation", "link": "http://arxiv.org/abs/2310.16638", "description": "Consider a scenario where we have access to train data with both covariates\nand outcomes while test data only contains covariates. In this scenario, our\nprimary aim is to predict the missing outcomes of the test data. With this\nobjective in mind, we train parametric regression models under a covariate\nshift, where covariate distributions are different between the train and test\ndata. For this problem, existing studies have proposed covariate shift\nadaptation via importance weighting using the density ratio. This approach\naverages the train data losses, each weighted by an estimated ratio of the\ncovariate densities between the train and test data, to approximate the\ntest-data risk. Although it allows us to obtain a test-data risk minimizer, its\nperformance heavily relies on the accuracy of the density ratio estimation.\nMoreover, even if the density ratio can be consistently estimated, the\nestimation errors of the density ratio also yield bias in the estimators of the\nregression model's parameters of interest. To mitigate these challenges, we\nintroduce a doubly robust estimator for covariate shift adaptation via\nimportance weighting, which incorporates an additional estimator for the\nregression function. Leveraging double machine learning techniques, our\nestimator reduces the bias arising from the density ratio estimation errors. We\ndemonstrate the asymptotic distribution of the regression parameter estimator.\nNotably, our estimator remains consistent if either the density ratio estimator\nor the regression function is consistent, showcasing its robustness against\npotential errors in density ratio estimation. Finally, we confirm the soundness\nof our proposed method via simulation studies."}, "http://arxiv.org/abs/2310.16819": {"title": "CATE Lasso: Conditional Average Treatment Effect Estimation with High-Dimensional Linear Regression", "link": "http://arxiv.org/abs/2310.16819", "description": "In causal inference about two treatments, Conditional Average Treatment\nEffects (CATEs) play an important role as a quantity representing an\nindividualized causal effect, defined as a difference between the expected\noutcomes of the two treatments conditioned on covariates. This study assumes\ntwo linear regression models between a potential outcome and covariates of the\ntwo treatments and defines CATEs as a difference between the linear regression\nmodels. Then, we propose a method for consistently estimating CATEs even under\nhigh-dimensional and non-sparse parameters. In our study, we demonstrate that\ndesirable theoretical properties, such as consistency, remain attainable even\nwithout assuming sparsity explicitly if we assume a weaker assumption called\nimplicit sparsity originating from the definition of CATEs. In this assumption,\nwe suppose that parameters of linear models in potential outcomes can be\ndivided into treatment-specific and common parameters, where the\ntreatment-specific parameters take difference values between each linear\nregression model, while the common parameters remain identical. Thus, in a\ndifference between two linear regression models, the common parameters\ndisappear, leaving only differences in the treatment-specific parameters.\nConsequently, the non-zero parameters in CATEs correspond to the differences in\nthe treatment-specific parameters. Leveraging this assumption, we develop a\nLasso regression method specialized for CATE estimation and present that the\nestimator is consistent. Finally, we confirm the soundness of the proposed\nmethod by simulation studies."}, "http://arxiv.org/abs/2203.06685": {"title": "Encompassing Tests for Nonparametric Regressions", "link": "http://arxiv.org/abs/2203.06685", "description": "We set up a formal framework to characterize encompassing of nonparametric\nmodels through the L2 distance. We contrast it to previous literature on the\ncomparison of nonparametric regression models. We then develop testing\nprocedures for the encompassing hypothesis that are fully nonparametric. Our\ntest statistics depend on kernel regression, raising the issue of bandwidth's\nchoice. We investigate two alternative approaches to obtain a \"small bias\nproperty\" for our test statistics. We show the validity of a wild bootstrap\nmethod. We empirically study the use of a data-driven bandwidth and illustrate\nthe attractive features of our tests for small and moderate samples."}, "http://arxiv.org/abs/2212.11012": {"title": "Partly Linear Instrumental Variables Regressions without Smoothing on the Instruments", "link": "http://arxiv.org/abs/2212.11012", "description": "We consider a semiparametric partly linear model identified by instrumental\nvariables. We propose an estimation method that does not smooth on the\ninstruments and we extend the Landweber-Fridman regularization scheme to the\nestimation of this semiparametric model. We then show the asymptotic normality\nof the parametric estimator and obtain the convergence rate for the\nnonparametric estimator. Our estimator that does not smooth on the instruments\ncoincides with a typical estimator that does smooth on the instruments but\nkeeps the respective bandwidth fixed as the sample size increases. We propose a\ndata driven method for the selection of the regularization parameter, and in a\nsimulation study we show the attractive performance of our estimators."}, "http://arxiv.org/abs/2212.11112": {"title": "A Bootstrap Specification Test for Semiparametric Models with Generated Regressors", "link": "http://arxiv.org/abs/2212.11112", "description": "This paper provides a specification test for semiparametric models with\nnonparametrically generated regressors. Such variables are not observed by the\nresearcher but are nonparametrically identified and estimable. Applications of\nthe test include models with endogenous regressors identified by control\nfunctions, semiparametric sample selection models, or binary games with\nincomplete information. The statistic is built from the residuals of the\nsemiparametric model. A novel wild bootstrap procedure is shown to provide\nvalid critical values. We consider nonparametric estimators with an automatic\nbias correction that makes the test implementable without undersmoothing. In\nsimulations the test exhibits good small sample performances, and an\napplication to women's labor force participation decisions shows its\nimplementation in a real data context."}, "http://arxiv.org/abs/2305.07993": {"title": "The Nonstationary Newsvendor with (and without) Predictions", "link": "http://arxiv.org/abs/2305.07993", "description": "The classic newsvendor model yields an optimal decision for a \"newsvendor\"\nselecting a quantity of inventory, under the assumption that the demand is\ndrawn from a known distribution. Motivated by applications such as cloud\nprovisioning and staffing, we consider a setting in which newsvendor-type\ndecisions must be made sequentially, in the face of demand drawn from a\nstochastic process that is both unknown and nonstationary. All prior work on\nthis problem either (a) assumes that the level of nonstationarity is known, or\n(b) imposes additional statistical assumptions that enable accurate predictions\nof the unknown demand.\n\nWe study the Nonstationary Newsvendor, with and without predictions. We\nfirst, in the setting without predictions, design a policy which we prove (via\nmatching upper and lower bounds) achieves order-optimal regret -- ours is the\nfirst policy to accomplish this without being given the level of\nnonstationarity of the underlying demand. We then, for the first time,\nintroduce a model for generic (i.e. with no statistical assumptions)\npredictions with arbitrary accuracy, and propose a policy that incorporates\nthese predictions without being given their accuracy. We upper bound the regret\nof this policy, and show that it matches the best achievable regret had the\naccuracy of the predictions been known. Finally, we empirically validate our\nnew policy with experiments based on two real-world datasets containing\nthousands of time-series, showing that it succeeds in closing approximately 74%\nof the gap between the best approaches based on nonstationarity and predictions\nalone."}, "http://arxiv.org/abs/2310.16849": {"title": "Correlation structure analysis of the global agricultural futures market", "link": "http://arxiv.org/abs/2310.16849", "description": "This paper adopts the random matrix theory (RMT) to analyze the correlation\nstructure of the global agricultural futures market from 2000 to 2020. It is\nfound that the distribution of correlation coefficients is asymmetric and right\nskewed, and many eigenvalues of the correlation matrix deviate from the RMT\nprediction. The largest eigenvalue reflects a collective market effect common\nto all agricultural futures, the other largest deviating eigenvalues can be\nimplemented to identify futures groups, and there are modular structures based\non regional properties or agricultural commodities among the significant\nparticipants of their corresponding eigenvectors. Except for the smallest\neigenvalue, other smallest deviating eigenvalues represent the agricultural\nfutures pairs with highest correlations. This paper can be of reference and\nsignificance for using agricultural futures to manage risk and optimize asset\nallocation."}, "http://arxiv.org/abs/2310.16850": {"title": "The impact of the Russia-Ukraine conflict on the extreme risk spillovers between agricultural futures and spots", "link": "http://arxiv.org/abs/2310.16850", "description": "The ongoing Russia-Ukraine conflict between two major agricultural powers has\nposed significant threats and challenges to the global food system and world\nfood security. Focusing on the impact of the conflict on the global\nagricultural market, we propose a new analytical framework for tail dependence,\nand combine the Copula-CoVaR method with the ARMA-GARCH-skewed Student-t model\nto examine the tail dependence structure and extreme risk spillover between\nagricultural futures and spots over the pre- and post-outbreak periods. Our\nresults indicate that the tail dependence structures in the futures-spot\nmarkets of soybean, maize, wheat, and rice have all reacted to the\nRussia-Ukraine conflict. Furthermore, the outbreak of the conflict has\nintensified risks of the four agricultural markets in varying degrees, with the\nwheat market being affected the most. Additionally, all the agricultural\nfutures markets exhibit significant downside and upside risk spillovers to\ntheir corresponding spot markets before and after the outbreak of the conflict,\nwhereas the strengths of these extreme risk spillover effects demonstrate\nsignificant asymmetries at the directional (downside versus upside) and\ntemporal (pre-outbreak versus post-outbreak) levels."}, "http://arxiv.org/abs/2310.17278": {"title": "Dynamic Factor Models: a Genealogy", "link": "http://arxiv.org/abs/2310.17278", "description": "Dynamic factor models have been developed out of the need of analyzing and\nforecasting time series in increasingly high dimensions. While mathematical\nstatisticians faced with inference problems in high-dimensional observation\nspaces were focusing on the so-called spiked-model-asymptotics, econometricians\nadopted an entirely and considerably more effective asymptotic approach, rooted\nin the factor models originally considered in psychometrics. The so-called\ndynamic factor model methods, in two decades, has grown into a wide and\nsuccessful body of techniques that are widely used in central banks, financial\ninstitutions, economic and statistical institutes. The objective of this\nchapter is not an extensive survey of the topic but a sketch of its historical\ngrowth, with emphasis on the various assumptions and interpretations, and a\nfamily tree of its main variants."}, "http://arxiv.org/abs/2310.17473": {"title": "Bayesian SAR model with stochastic volatility and multiple time-varying weights", "link": "http://arxiv.org/abs/2310.17473", "description": "A novel spatial autoregressive model for panel data is introduced, which\nincorporates multilayer networks and accounts for time-varying relationships.\nMoreover, the proposed approach allows the structural variance to evolve\nsmoothly over time and enables the analysis of shock propagation in terms of\ntime-varying spillover effects. The framework is applied to analyse the\ndynamics of international relationships among the G7 economies and their impact\non stock market returns and volatilities. The findings underscore the\nsubstantial impact of cooperative interactions and highlight discernible\ndisparities in network exposure across G7 nations, along with nuanced patterns\nin direct and indirect spillover effects."}, "http://arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "http://arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training\nmachine learning models on historical data to predict user behaviors and\nimprove recommendations continuously. However, these data training loops can\nintroduce interference in A/B tests, where data generated by control and\ntreatment algorithms, potentially with different distributions, are combined.\nTo address these challenges, we introduce a novel approach called weighted\ntraining. This approach entails training a model to predict the probability of\neach data point appearing in either the treatment or control data and\nsubsequently applying weighted losses during model training. We demonstrate\nthat this approach achieves the least variance among all estimators without\ncausing shifts in the training distributions. Through simulation studies, we\ndemonstrate the lower bias and variance of our approach compared to other\nmethods."}, "http://arxiv.org/abs/2310.17571": {"title": "Inside the black box: Neural network-based real-time prediction of US recessions", "link": "http://arxiv.org/abs/2310.17571", "description": "Feedforward neural network (FFN) and two specific types of recurrent neural\nnetwork, long short-term memory (LSTM) and gated recurrent unit (GRU), are used\nfor modeling US recessions in the period from 1967 to 2021. The estimated\nmodels are then employed to conduct real-time predictions of the Great\nRecession and the Covid-19 recession in US. Their predictive performances are\ncompared to those of the traditional linear models, the logistic regression\nmodel both with and without the ridge penalty. The out-of-sample performance\nsuggests the application of LSTM and GRU in the area of recession forecasting,\nespecially for the long-term forecasting tasks. They outperform other types of\nmodels across 5 forecasting horizons with respect to different types of\nstatistical performance metrics. Shapley additive explanations (SHAP) method is\napplied to the fitted GRUs across different forecasting horizons to gain\ninsight into the feature importance. The evaluation of predictor importance\ndiffers between the GRU and ridge logistic regression models, as reflected in\nthe variable order determined by SHAP values. When considering the top 5\npredictors, key indicators such as the S\\&P 500 index, real GDP, and private\nresidential fixed investment consistently appear for short-term forecasts (up\nto 3 months). In contrast, for longer-term predictions (6 months or more), the\nterm spread and producer price index become more prominent. These findings are\nsupported by both local interpretable model-agnostic explanations (LIME) and\nmarginal effects."}, "http://arxiv.org/abs/2205.07836": {"title": "2SLS with Multiple Treatments", "link": "http://arxiv.org/abs/2205.07836", "description": "We study what two-stage least squares (2SLS) identifies in models with\nmultiple treatments under treatment effect heterogeneity. Two conditions are\nshown to be necessary and sufficient for the 2SLS to identify positively\nweighted sums of agent-specific effects of each treatment: average conditional\nmonotonicity and no cross effects. Our identification analysis allows for any\nnumber of treatments, any number of continuous or discrete instruments, and the\ninclusion of covariates. We provide testable implications and present\ncharacterizations of choice behavior implied by our identification conditions\nand discuss how the conditions can be tested empirically."}, "http://arxiv.org/abs/2308.12485": {"title": "Optimal Shrinkage Estimation of Fixed Effects in Linear Panel Data Models", "link": "http://arxiv.org/abs/2308.12485", "description": "Shrinkage methods are frequently used to estimate fixed effects to reduce the\nnoisiness of the least squares estimators. However, widely used shrinkage\nestimators guarantee such noise reduction only under strong distributional\nassumptions. I develop an estimator for the fixed effects that obtains the best\npossible mean squared error within a class of shrinkage estimators. This class\nincludes conventional shrinkage estimators and the optimality does not require\ndistributional assumptions. The estimator has an intuitive form and is easy to\nimplement. Moreover, the fixed effects are allowed to vary with time and to be\nserially correlated, and the shrinkage optimally incorporates the underlying\ncorrelation structure in this case. In such a context, I also provide a method\nto forecast fixed effects one period ahead."}, "http://arxiv.org/abs/2310.18504": {"title": "Nonparametric Doubly Robust Identification of Causal Effects of a Continuous Treatment using Discrete Instruments", "link": "http://arxiv.org/abs/2310.18504", "description": "Many empirical applications estimate causal effects of a continuous\nendogenous variable (treatment) using a binary instrument. Estimation is\ntypically done through linear 2SLS. This approach requires a mean treatment\nchange and causal interpretation requires the LATE-type monotonicity in the\nfirst stage. An alternative approach is to explore distributional changes in\nthe treatment, where the first-stage restriction is treatment rank similarity.\nWe propose causal estimands that are doubly robust in that they are valid under\neither of these two restrictions. We apply the doubly robust estimation to\nestimate the impacts of sleep on well-being. Our results corroborate the usual\n2SLS estimates."}, "http://arxiv.org/abs/2310.18563": {"title": "Covariate Balancing and the Equivalence of Weighting and Doubly Robust Estimators of Average Treatment Effects", "link": "http://arxiv.org/abs/2310.18563", "description": "We show that when the propensity score is estimated using a suitable\ncovariate balancing procedure, the commonly used inverse probability weighting\n(IPW) estimator, augmented inverse probability weighting (AIPW) with linear\nconditional mean, and inverse probability weighted regression adjustment\n(IPWRA) with linear conditional mean are all numerically the same for\nestimating the average treatment effect (ATE) or the average treatment effect\non the treated (ATT). Further, suitably chosen covariate balancing weights are\nautomatically normalized, which means that normalized and unnormalized versions\nof IPW and AIPW are identical. For estimating the ATE, the weights that achieve\nthe algebraic equivalence of IPW, AIPW, and IPWRA are based on propensity\nscores estimated using the inverse probability tilting (IPT) method of Graham,\nPinto and Egel (2012). For the ATT, the weights are obtained using the\ncovariate balancing propensity score (CBPS) method developed in Imai and\nRatkovic (2014). These equivalences also make covariate balancing methods\nattractive when the treatment is confounded and one is interested in the local\naverage treatment effect."}, "http://arxiv.org/abs/2310.18836": {"title": "Design of Cluster-Randomized Trials with Cross-Cluster Interference", "link": "http://arxiv.org/abs/2310.18836", "description": "Cluster-randomized trials often involve units that are irregularly\ndistributed in space without well-separated communities. In these settings,\ncluster construction is a critical aspect of the design due to the potential\nfor cross-cluster interference. The existing literature relies on partial\ninterference models, which take clusters as given and assume no cross-cluster\ninterference. We relax this assumption by allowing interference to decay with\ngeographic distance between units. This induces a bias-variance trade-off:\nconstructing fewer, larger clusters reduces bias due to interference but\nincreases variance. We propose new estimators that exclude units most\npotentially impacted by cross-cluster interference and show that this\nsubstantially reduces asymptotic bias relative to conventional\ndifference-in-means estimators. We then study the design of clusters to\noptimize the estimators' rates of convergence. We provide formal justification\nfor a new design that chooses the number of clusters to balance the asymptotic\nbias and variance of our estimators and uses unsupervised learning to automate\ncluster construction."}, "http://arxiv.org/abs/2310.19200": {"title": "Popularity, face and voice: Predicting and interpreting livestreamers' retail performance using machine learning techniques", "link": "http://arxiv.org/abs/2310.19200", "description": "Livestreaming commerce, a hybrid of e-commerce and self-media, has expanded\nthe broad spectrum of traditional sales performance determinants. To\ninvestigate the factors that contribute to the success of livestreaming\ncommerce, we construct a longitudinal firm-level database with 19,175\nobservations, covering an entire livestreaming subsector. By comparing the\nforecasting accuracy of eight machine learning models, we identify a random\nforest model that provides the best prediction of gross merchandise volume\n(GMV). Furthermore, we utilize explainable artificial intelligence to open the\nblack-box of machine learning model, discovering four new facts: 1) variables\nrepresenting the popularity of livestreaming events are crucial features in\npredicting GMV. And voice attributes are more important than appearance; 2)\npopularity is a major determinant of sales for female hosts, while vocal\naesthetics is more decisive for their male counterparts; 3) merits and\ndrawbacks of the voice are not equally valued in the livestreaming market; 4)\nbased on changes of comments, page views and likes, sales growth can be divided\ninto three stages. Finally, we innovatively propose a 3D-SHAP diagram that\ndemonstrates the relationship between predicting feature importance, target\nvariable, and its predictors. This diagram identifies bottlenecks for both\nbeginner and top livestreamers, providing insights into ways to optimize their\nsales performance."}, "http://arxiv.org/abs/2310.19543": {"title": "Spectral identification and estimation of mixed causal-noncausal invertible-noninvertible models", "link": "http://arxiv.org/abs/2310.19543", "description": "This paper introduces new techniques for estimating, identifying and\nsimulating mixed causal-noncausal invertible-noninvertible models. We propose a\nframework that integrates high-order cumulants, merging both the spectrum and\nbispectrum into a single estimation function. The model that most adequately\nrepresents the data under the assumption that the error term is i.i.d. is\nselected. Our Monte Carlo study reveals unbiased parameter estimates and a high\nfrequency with which correct models are identified. We illustrate our strategy\nthrough an empirical analysis of returns from 24 Fama-French emerging market\nstock portfolios. The findings suggest that each portfolio displays noncausal\ndynamics, producing white noise residuals devoid of conditional heteroscedastic\neffects."}, "http://arxiv.org/abs/2310.19557": {"title": "A Bayesian Markov-switching SAR model for time-varying cross-price spillovers", "link": "http://arxiv.org/abs/2310.19557", "description": "The spatial autoregressive (SAR) model is extended by introducing a Markov\nswitching dynamics for the weight matrix and spatial autoregressive parameter.\nThe framework enables the identification of regime-specific connectivity\npatterns and strengths and the study of the spatiotemporal propagation of\nshocks in a system with a time-varying spatial multiplier matrix. The proposed\nmodel is applied to disaggregated CPI data from 15 EU countries to examine\ncross-price dependencies. The analysis identifies distinct connectivity\nstructures and spatial weights across the states, which capture shifts in\nconsumer behaviour, with marked cross-country differences in the spillover from\none price category to another."}, "http://arxiv.org/abs/2310.19747": {"title": "Characteristics of price related fluctuations in Non-Fungible Token (NFT) market", "link": "http://arxiv.org/abs/2310.19747", "description": "Non-fungible token (NFT) market is a new trading invention based on the\nblockchain technology which parallels the cryptocurrency market. In the present\nwork we study capitalization, floor price, the number of transactions, the\ninter-transaction times, and the transaction volume value of a few selected\npopular token collections. The results show that the fluctuations of all these\nquantities are characterized by heavy-tailed probability distribution\nfunctions, in most cases well described by the stretched exponentials, with a\ntrace of power-law scaling at times, long-range memory, and in several cases\neven the fractal organization of fluctuations, mostly restricted to the larger\nfluctuations, however. We conclude that the NFT market - even though young and\ngoverned by a somewhat different mechanisms of trading - shares several\nstatistical properties with the regular financial markets. However, some\ndifferences are visible in the specific quantitative indicators."}, "http://arxiv.org/abs/2310.19788": {"title": "Locally Optimal Best Arm Identification with a Fixed Budget", "link": "http://arxiv.org/abs/2310.19788", "description": "This study investigates the problem of identifying the best treatment arm, a\ntreatment arm with the highest expected outcome. We aim to identify the best\ntreatment arm with a lower probability of misidentification, which has been\nexplored under various names across numerous research fields, including\n\\emph{best arm identification} (BAI) and ordinal optimization. In our\nexperiments, the number of treatment-allocation rounds is fixed. In each round,\na decision-maker allocates a treatment arm to an experimental unit and observes\na corresponding outcome, which follows a Gaussian distribution with a variance\ndifferent among treatment arms. At the end of the experiment, we recommend one\nof the treatment arms as an estimate of the best treatment arm based on the\nobservations. The objective of the decision-maker is to design an experiment\nthat minimizes the probability of misidentifying the best treatment arm. With\nthis objective in mind, we develop lower bounds for the probability of\nmisidentification under the small-gap regime, where the gaps of the expected\noutcomes between the best and suboptimal treatment arms approach zero. Then,\nassuming that the variances are known, we design the\nGeneralized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, which is\nan extension of the Neyman allocation proposed by Neyman (1934) and the\nUniform-EBA strategy proposed by Bubeck et al. (2011). For the GNA-EBA\nstrategy, we show that the strategy is asymptotically optimal because its\nprobability of misidentification aligns with the lower bounds as the sample\nsize approaches infinity under the small-gap regime. We refer to such optimal\nstrategies as locally asymptotic optimal because their performance aligns with\nthe lower bounds within restricted situations characterized by the small-gap\nregime."}, "http://arxiv.org/abs/2009.00553": {"title": "A Vector Monotonicity Assumption for Multiple Instruments", "link": "http://arxiv.org/abs/2009.00553", "description": "When a researcher combines multiple instrumental variables for a single\nbinary treatment, the monotonicity assumption of the local average treatment\neffects (LATE) framework can become restrictive: it requires that all units\nshare a common direction of response even when separate instruments are shifted\nin opposing directions. What I call vector monotonicity, by contrast, simply\nassumes treatment uptake to be monotonic in all instruments, representing a\nspecial case of the partial monotonicity assumption introduced by Mogstad et\nal. (2021). I characterize the class of causal parameters that are point\nidentified under vector monotonicity, when the instruments are binary. This\nclass includes, for example, the average treatment effect among units that are\nin any way responsive to the collection of instruments, or those that are\nresponsive to a given subset of them. The identification results are\nconstructive and yield a simple estimator for the identified treatment effect\nparameters. An empirical application revisits the labor market returns to\ncollege."}, "http://arxiv.org/abs/2109.08109": {"title": "Standard Errors for Calibrated Parameters", "link": "http://arxiv.org/abs/2109.08109", "description": "Calibration, the practice of choosing the parameters of a structural model to\nmatch certain empirical moments, can be viewed as minimum distance estimation.\nExisting standard error formulas for such estimators require a consistent\nestimate of the correlation structure of the empirical moments, which is often\nunavailable in practice. Instead, the variances of the individual empirical\nmoments are usually readily estimable. Using only these variances, we derive\nconservative standard errors and confidence intervals for the structural\nparameters that are valid even under the worst-case correlation structure. In\nthe over-identified case, we show that the moment weighting scheme that\nminimizes the worst-case estimator variance amounts to a moment selection\nproblem with a simple solution. Finally, we develop tests of over-identifying\nor parameter restrictions. We apply our methods empirically to a model of menu\ncost pricing for multi-product firms and to a heterogeneous agent New Keynesian\nmodel."}, "http://arxiv.org/abs/2211.16714": {"title": "Incorporating Prior Knowledge of Latent Group Structure in Panel Data Models", "link": "http://arxiv.org/abs/2211.16714", "description": "The assumption of group heterogeneity has become popular in panel data\nmodels. We develop a constrained Bayesian grouped estimator that exploits\nresearchers' prior beliefs on groups in a form of pairwise constraints,\nindicating whether a pair of units is likely to belong to a same group or\ndifferent groups. We propose a prior to incorporate the pairwise constraints\nwith varying degrees of confidence. The whole framework is built on the\nnonparametric Bayesian method, which implicitly specifies a distribution over\nthe group partitions, and so the posterior analysis takes the uncertainty of\nthe latent group structure into account. Monte Carlo experiments reveal that\nadding prior knowledge yields more accurate estimates of coefficient and scores\npredictive gains over alternative estimators. We apply our method to two\nempirical applications. In a first application to forecasting U.S. CPI\ninflation, we illustrate that prior knowledge of groups improves density\nforecasts when the data is not entirely informative. A second application\nrevisits the relationship between a country's income and its democratic\ntransition; we identify heterogeneous income effects on democracy with five\ndistinct groups over ninety countries."}, "http://arxiv.org/abs/2307.01357": {"title": "Adaptive Principal Component Regression with Applications to Panel Data", "link": "http://arxiv.org/abs/2307.01357", "description": "Principal component regression (PCR) is a popular technique for fixed-design\nerror-in-variables regression, a generalization of the linear regression\nsetting in which the observed covariates are corrupted with random noise. We\nprovide the first time-uniform finite sample guarantees for online\n(regularized) PCR whenever data is collected adaptively. Since the proof\ntechniques for analyzing PCR in the fixed design setting do not readily extend\nto the online setting, our results rely on adapting tools from modern\nmartingale concentration to the error-in-variables setting. As an application\nof our bounds, we provide a framework for experiment design in panel data\nsettings when interventions are assigned adaptively. Our framework may be\nthought of as a generalization of the synthetic control and synthetic\ninterventions frameworks, where data is collected via an adaptive intervention\nassignment policy."}, "http://arxiv.org/abs/2309.06693": {"title": "Stochastic Learning of Semiparametric Monotone Index Models with Large Sample Size", "link": "http://arxiv.org/abs/2309.06693", "description": "I study the estimation of semiparametric monotone index models in the\nscenario where the number of observation points $n$ is extremely large and\nconventional approaches fail to work due to heavy computational burdens.\nMotivated by the mini-batch gradient descent algorithm (MBGD) that is widely\nused as a stochastic optimization tool in the machine learning field, I\nproposes a novel subsample- and iteration-based estimation procedure. In\nparticular, starting from any initial guess of the true parameter, I\nprogressively update the parameter using a sequence of subsamples randomly\ndrawn from the data set whose sample size is much smaller than $n$. The update\nis based on the gradient of some well-chosen loss function, where the\nnonparametric component is replaced with its Nadaraya-Watson kernel estimator\nbased on subsamples. My proposed algorithm essentially generalizes MBGD\nalgorithm to the semiparametric setup. Compared with full-sample-based method,\nthe new method reduces the computational time by roughly $n$ times if the\nsubsample size and the kernel function are chosen properly, so can be easily\napplied when the sample size $n$ is large. Moreover, I show that if I further\nconduct averages across the estimators produced during iterations, the\ndifference between the average estimator and full-sample-based estimator will\nbe $1/\\sqrt{n}$-trivial. Consequently, the average estimator is\n$1/\\sqrt{n}$-consistent and asymptotically normally distributed. In other\nwords, the new estimator substantially improves the computational speed, while\nat the same time maintains the estimation accuracy."}, "http://arxiv.org/abs/2310.16945": {"title": "Causal Q-Aggregation for CATE Model Selection", "link": "http://arxiv.org/abs/2310.16945", "description": "Accurate estimation of conditional average treatment effects (CATE) is at the\ncore of personalized decision making. While there is a plethora of models for\nCATE estimation, model selection is a nontrivial task, due to the fundamental\nproblem of causal inference. Recent empirical work provides evidence in favor\nof proxy loss metrics with double robust properties and in favor of model\nensembling. However, theoretical understanding is lacking. Direct application\nof prior theoretical work leads to suboptimal oracle model selection rates due\nto the non-convexity of the model selection problem. We provide regret rates\nfor the major existing CATE ensembling approaches and propose a new CATE model\nensembling approach based on Q-aggregation using the doubly robust loss. Our\nmain result shows that causal Q-aggregation achieves statistically optimal\noracle model selection regret rates of $\\frac{\\log(M)}{n}$ (with $M$ models and\n$n$ samples), with the addition of higher-order estimation error terms related\nto products of errors in the nuisance functions. Crucially, our regret rate\ndoes not require that any of the candidate CATE models be close to the truth.\nWe validate our new method on many semi-synthetic datasets and also provide\nextensions of our work to CATE model selection with instrumental variables and\nunobserved confounding."}, "http://arxiv.org/abs/2310.19992": {"title": "Robust Estimation of Realized Correlation: New Insight about Intraday Fluctuations in Market Betas", "link": "http://arxiv.org/abs/2310.19992", "description": "Time-varying volatility is an inherent feature of most economic time-series,\nwhich causes standard correlation estimators to be inconsistent. The quadrant\ncorrelation estimator is consistent but very inefficient. We propose a novel\nsubsampled quadrant estimator that improves efficiency while preserving\nconsistency and robustness. This estimator is particularly well-suited for\nhigh-frequency financial data and we apply it to a large panel of US stocks.\nOur empirical analysis sheds new light on intra-day fluctuations in market\nbetas by decomposing them into time-varying correlations and relative\nvolatility changes. Our results show that intraday variation in betas is\nprimarily driven by intraday variation in correlations."}, "http://arxiv.org/abs/2006.07691": {"title": "Synthetic Interventions", "link": "http://arxiv.org/abs/2006.07691", "description": "Consider a setting with $N$ heterogeneous units (e.g., individuals,\nsub-populations) and $D$ interventions (e.g., socio-economic policies). Our\ngoal is to learn the expected potential outcome associated with every\nintervention on every unit, totaling $N \\times D$ causal parameters. Towards\nthis, we present a causal framework, synthetic interventions (SI), to infer\nthese $N \\times D$ causal parameters while only observing each of the $N$ units\nunder at most two interventions, independent of $D$. This can be significant as\nthe number of interventions, i.e., level of personalization, grows. Under a\nnovel tensor factor model across units, outcomes, and interventions, we prove\nan identification result for each of these $N \\times D$ causal parameters,\nestablish finite-sample consistency of our estimator along with asymptotic\nnormality under additional conditions. Importantly, our estimator also allows\nfor latent confounders that determine how interventions are assigned. The\nestimator is further furnished with data-driven tests to examine its\nsuitability. Empirically, we validate our framework through a large-scale A/B\ntest performed on an e-commerce platform. We believe our results could have\nimplications for the design of data-efficient randomized experiments (e.g.,\nrandomized control trials) with heterogeneous units and multiple interventions."}, "http://arxiv.org/abs/2207.04481": {"title": "Detecting Grouped Local Average Treatment Effects and Selecting True Instruments", "link": "http://arxiv.org/abs/2207.04481", "description": "Under an endogenous binary treatment with heterogeneous effects and multiple\ninstruments, we propose a two-step procedure for identifying complier groups\nwith identical local average treatment effects (LATE) despite relying on\ndistinct instruments, even if several instruments violate the identifying\nassumptions. We use the fact that the LATE is homogeneous for instruments which\n(i) satisfy the LATE assumptions (instrument validity and treatment\nmonotonicity in the instrument) and (ii) generate identical complier groups in\nterms of treatment propensities given the respective instruments. We propose a\ntwo-step procedure, where we first cluster the propensity scores in the first\nstep and find groups of IVs with the same reduced form parameters in the second\nstep. Under the plurality assumption that within each set of instruments with\nidentical treatment propensities, instruments truly satisfying the LATE\nassumptions are the largest group, our procedure permits identifying these true\ninstruments in a data driven way. We show that our procedure is consistent and\nprovides consistent and asymptotically normal estimators of underlying LATEs.\nWe also provide a simulation study investigating the finite sample properties\nof our approach and an empirical application investigating the effect of\nincarceration on recidivism in the US with judge assignments serving as\ninstruments."}, "http://arxiv.org/abs/2304.09078": {"title": "Club coefficients in the UEFA Champions League: Time for shift to an Elo-based formula", "link": "http://arxiv.org/abs/2304.09078", "description": "One of the most popular club football tournaments, the UEFA Champions League,\nwill see a fundamental reform from the 2024/25 season: the traditional group\nstage will be replaced by one league where each of the 36 teams plays eight\nmatches. To guarantee that the opponents of the clubs are of the same strength\nin the new design, it is crucial to forecast the performance of the teams\nbefore the tournament as well as possible. This paper investigates whether the\ncurrently used rating of the teams, the UEFA club coefficient, can be improved\nby taking the games played in the national leagues into account. According to\nour logistic regression models, a variant of the Elo method provides a higher\naccuracy in terms of explanatory power in the Champions League matches. The\nUnion of European Football Associations (UEFA) is encouraged to follow the\nexample of the FIFA World Ranking and reform the calculation of the club\ncoefficients in order to avoid unbalanced schedules in the novel tournament\nformat of the Champions League."}, "http://arxiv.org/abs/2308.13564": {"title": "SGMM: Stochastic Approximation to Generalized Method of Moments", "link": "http://arxiv.org/abs/2308.13564", "description": "We introduce a new class of algorithms, Stochastic Generalized Method of\nMoments (SGMM), for estimation and inference on (overidentified) moment\nrestriction models. Our SGMM is a novel stochastic approximation alternative to\nthe popular Hansen (1982) (offline) GMM, and offers fast and scalable\nimplementation with the ability to handle streaming datasets in real time. We\nestablish the almost sure convergence, and the (functional) central limit\ntheorem for the inefficient online 2SLS and the efficient SGMM. Moreover, we\npropose online versions of the Durbin-Wu-Hausman and Sargan-Hansen tests that\ncan be seamlessly integrated within the SGMM framework. Extensive Monte Carlo\nsimulations show that as the sample size increases, the SGMM matches the\nstandard (offline) GMM in terms of estimation accuracy and gains over\ncomputational efficiency, indicating its practical value for both large-scale\nand online datasets. We demonstrate the efficacy of our approach by a proof of\nconcept using two well known empirical examples with large sample sizes."}, "http://arxiv.org/abs/2311.00013": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "http://arxiv.org/abs/2311.00013", "description": "We propose two approaches to estimate semiparametric discrete choice models\nfor bundles. Our first approach is a kernel-weighted rank estimator based on a\nmatching-based identification strategy. We establish its complete asymptotic\nproperties and prove the validity of the nonparametric bootstrap for inference.\nWe then introduce a new multi-index least absolute deviations (LAD) estimator\nas an alternative, of which the main advantage is its capacity to estimate\npreference parameters on both alternative- and agent-specific regressors. Both\nmethods can account for arbitrary correlation in disturbances across choices,\nwith the former also allowing for interpersonal heteroskedasticity. We also\ndemonstrate that the identification strategy underlying these procedures can be\nextended naturally to panel data settings, producing an analogous localized\nmaximum score estimator and a LAD estimator for estimating bundle choice models\nwith fixed effects. We derive the limiting distribution of the former and\nverify the validity of the numerical bootstrap as an inference tool. All our\nproposed methods can be applied to general multi-index models. Monte Carlo\nexperiments show that they perform well in finite samples."}, "http://arxiv.org/abs/2311.00439": {"title": "Bounds on Treatment Effects under Stochastic Monotonicity Assumption in Sample Selection Models", "link": "http://arxiv.org/abs/2311.00439", "description": "This paper discusses the partial identification of treatment effects in\nsample selection models when the exclusion restriction fails and the\nmonotonicity assumption in the selection effect does not hold exactly, both of\nwhich are key challenges in applying the existing methodologies. Our approach\nbuilds on Lee's (2009) procedure, who considers partial identification under\nthe monotonicity assumption, but we assume only a stochastic (and weaker)\nversion of monotonicity, which depends on a prespecified parameter $\\vartheta$\nthat represents researchers' belief in the plausibility of the monotonicity.\nUnder this assumption, we show that we can still obtain useful bounds even when\nthe monotonic behavioral model does not strictly hold. Our procedure is useful\nwhen empirical researchers anticipate that a small fraction of the population\nwill not behave monotonically in selection; it can also be an effective tool\nfor performing sensitivity analysis or examining the identification power of\nthe monotonicity assumption. Our procedure is easily extendable to other\nrelated settings; we also provide the identification result of the marginal\ntreatment effects setting as an important application. Moreover, we show that\nthe bounds can still be obtained even in the absence of the knowledge of\n$\\vartheta$ under the semiparametric models that nest the classical probit and\nlogit selection models."}, "http://arxiv.org/abs/2311.00577": {"title": "Personalized Assignment to One of Many Treatment Arms via Regularized and Clustered Joint Assignment Forests", "link": "http://arxiv.org/abs/2311.00577", "description": "We consider learning personalized assignments to one of many treatment arms\nfrom a randomized controlled trial. Standard methods that estimate\nheterogeneous treatment effects separately for each arm may perform poorly in\nthis case due to excess variance. We instead propose methods that pool\ninformation across treatment arms: First, we consider a regularized\nforest-based assignment algorithm based on greedy recursive partitioning that\nshrinks effect estimates across arms. Second, we augment our algorithm by a\nclustering scheme that combines treatment arms with consistently similar\noutcomes. In a simulation study, we compare the performance of these approaches\nto predicting arm-wise outcomes separately, and document gains of directly\noptimizing the treatment assignment with regularization and clustering. In a\ntheoretical model, we illustrate how a high number of treatment arms makes\nfinding the best arm hard, while we can achieve sizable utility gains from\npersonalization by regularized optimization."}, "http://arxiv.org/abs/2311.00662": {"title": "On Gaussian Process Priors in Conditional Moment Restriction Models", "link": "http://arxiv.org/abs/2311.00662", "description": "This paper studies quasi-Bayesian estimation and uncertainty quantification\nfor an unknown function that is identified by a nonparametric conditional\nmoment restriction model. We derive contraction rates for a class of Gaussian\nprocess priors and provide conditions under which a Bernstein-von Mises theorem\nholds for the quasi-posterior distribution. As a consequence, we show that\noptimally-weighted quasi-Bayes credible sets have exact asymptotic frequentist\ncoverage. This extends classical result on the frequentist validity of\noptimally weighted quasi-Bayes credible sets for parametric generalized method\nof moments (GMM) models."}, "http://arxiv.org/abs/2209.14502": {"title": "Fast Inference for Quantile Regression with Tens of Millions of Observations", "link": "http://arxiv.org/abs/2209.14502", "description": "Big data analytics has opened new avenues in economic research, but the\nchallenge of analyzing datasets with tens of millions of observations is\nsubstantial. Conventional econometric methods based on extreme estimators\nrequire large amounts of computing resources and memory, which are often not\nreadily available. In this paper, we focus on linear quantile regression\napplied to \"ultra-large\" datasets, such as U.S. decennial censuses. A fast\ninference framework is presented, utilizing stochastic subgradient descent\n(S-subGD) updates. The inference procedure handles cross-sectional data\nsequentially: (i) updating the parameter estimate with each incoming \"new\nobservation\", (ii) aggregating it as a $\\textit{Polyak-Ruppert}$ average, and\n(iii) computing a pivotal statistic for inference using only a solution path.\nThe methodology draws from time-series regression to create an asymptotically\npivotal statistic through random scaling. Our proposed test statistic is\ncalculated in a fully online fashion and critical values are calculated without\nresampling. We conduct extensive numerical studies to showcase the\ncomputational merits of our proposed inference. For inference problems as large\nas $(n, d) \\sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the\nnumber of regressors, our method generates new insights, surpassing current\ninference methods in computation. Our method specifically reveals trends in the\ngender gap in the U.S. college wage premium using millions of observations,\nwhile controlling over $10^3$ covariates to mitigate confounding effects."}, "http://arxiv.org/abs/2311.00905": {"title": "Data-Driven Fixed-Point Tuning for Truncated Realized Variations", "link": "http://arxiv.org/abs/2311.00905", "description": "Many methods for estimating integrated volatility and related functionals of\nsemimartingales in the presence of jumps require specification of tuning\nparameters for their use. In much of the available theory, tuning parameters\nare assumed to be deterministic, and their values are specified only up to\nasymptotic constraints. However, in empirical work and in simulation studies,\nthey are typically chosen to be random and data-dependent, with explicit\nchoices in practice relying on heuristics alone. In this paper, we consider\nnovel data-driven tuning procedures for the truncated realized variations of a\nsemimartingale with jumps, which are based on a type of stochastic fixed-point\niteration. Being effectively automated, our approach alleviates the need for\ndelicate decision-making regarding tuning parameters, and can be implemented\nusing information regarding sampling frequency alone. We show our methods can\nlead to asymptotically efficient estimation of integrated volatility and\nexhibit superior finite-sample performance compared to popular alternatives in\nthe literature."}, "http://arxiv.org/abs/2311.01217": {"title": "The learning effects of subsidies to bundled goods: a semiparametric approach", "link": "http://arxiv.org/abs/2311.01217", "description": "Can temporary subsidies to bundles induce long-run changes in demand due to\nlearning about the relative quality of one of its constituent goods? This paper\nprovides theoretical and experimental evidence on the role of this mechanism.\nTheoretically, we introduce a model where an agent learns about the quality of\nan innovation on an essential good through consumption. Our results show that\nthe contemporaneous effect of a one-off subsidy to a bundle that contains the\ninnovation may be decomposed into a direct price effect, and an indirect\nlearning motive, whereby an agent leverages the discount to increase the\ninformational bequest left to her future selves. We then assess the predictions\nof our theory in a randomised experiment in a ridesharing platform. The\nexperiment provided two-week discounts for car trips integrating with a train\nor metro station (a bundle). Given the heavy-tailed nature of our data, we\nfollow \\cite{Athey2023} and, motivated by our theory, propose a semiparametric\nmodel for treatment effects that enables the construction of more efficient\nestimators. We introduce a statistically efficient estimator for our model by\nrelying on L-moments, a robust alternative to standard moments. Our estimator\nimmediately yields a specification test for the semiparametric model; moreover,\nin our adopted parametrisation, it can be easily computed through generalized\nleast squares. Our empirical results indicate that a two-week 50\\% discount on\ncar trips integrating with train/metro leads to a contemporaneous increase in\nthe demand for integrated rides, and, consistent with our learning model,\npersistent changes in the mean and dispersion of nonintegrated rides. These\neffects persist for over four months after the discount. A simple calibration\nof our model shows that around 40\\% to 50\\% of the estimated contemporaneous\nincrease in integrated rides may be attributed to a learning motive."}, "http://arxiv.org/abs/2110.10650": {"title": "Attention Overload", "link": "http://arxiv.org/abs/2110.10650", "description": "We introduce an Attention Overload Model that captures the idea that\nalternatives compete for the decision maker's attention, and hence the\nattention that each alternative receives decreases as the choice problem\nbecomes larger. We provide testable implications on the observed choice\nbehavior that can be used to (point or partially) identify the decision maker's\npreference and attention frequency. We then enhance our attention overload\nmodel to accommodate heterogeneous preferences based on the idea of List-based\nAttention Overload, where alternatives are presented to the decision makers as\na list that correlates with both heterogeneous preferences and random\nattention. We show that preference and attention frequencies are (point or\npartially) identifiable under nonparametric assumptions on the list and\nattention formation mechanisms, even when the true underlying list is unknown\nto the researcher. Building on our identification results, we develop\neconometric methods for estimation and inference."}, "http://arxiv.org/abs/2112.13398": {"title": "Long Story Short: Omitted Variable Bias in Causal Machine Learning", "link": "http://arxiv.org/abs/2112.13398", "description": "We derive general, yet simple, sharp bounds on the size of the omitted\nvariable bias for a broad class of causal parameters that can be identified as\nlinear functionals of the conditional expectation function of the outcome. Such\nfunctionals encompass many of the traditional targets of investigation in\ncausal inference studies, such as, for example, (weighted) average of potential\noutcomes, average treatment effects (including subgroup effects, such as the\neffect on the treated), (weighted) average derivatives, and policy effects from\nshifts in covariate distribution -- all for general, nonparametric causal\nmodels. Our construction relies on the Riesz-Frechet representation of the\ntarget functional. Specifically, we show how the bound on the bias depends only\non the additional variation that the latent variables create both in the\noutcome and in the Riesz representer for the parameter of interest. Moreover,\nin many important cases (e.g, average treatment effects and avearage\nderivatives) the bound is shown to depend on easily interpretable quantities\nthat measure the explanatory power of the omitted variables. Therefore, simple\nplausibility judgments on the maximum explanatory power of omitted variables\n(in explaining treatment and outcome variation) are sufficient to place overall\nbounds on the size of the bias. Furthermore, we use debiased machine learning\nto provide flexible and efficient statistical inference on learnable components\nof the bounds. Finally, empirical examples demonstrate the usefulness of the\napproach."}, "http://arxiv.org/abs/2112.03626": {"title": "Phase transitions in nonparametric regressions", "link": "http://arxiv.org/abs/2112.03626", "description": "When the unknown regression function of a single variable is known to have\nderivatives up to the $(\\gamma+1)$th order bounded in absolute values by a\ncommon constant everywhere or a.e. (i.e., $(\\gamma+1)$th degree of smoothness),\nthe minimax optimal rate of the mean integrated squared error (MISE) is stated\nas $\\left(\\frac{1}{n}\\right)^{\\frac{2\\gamma+2}{2\\gamma+3}}$ in the literature.\nThis paper shows that: (i) if $n\\leq\\left(\\gamma+1\\right)^{2\\gamma+3}$, the\nminimax optimal MISE rate is $\\frac{\\log n}{n\\log(\\log n)}$ and the optimal\ndegree of smoothness to exploit is roughly $\\max\\left\\{ \\left\\lfloor \\frac{\\log\nn}{2\\log\\left(\\log n\\right)}\\right\\rfloor ,\\,1\\right\\} $; (ii) if\n$n>\\left(\\gamma+1\\right)^{2\\gamma+3}$, the minimax optimal MISE rate is\n$\\left(\\frac{1}{n}\\right)^{\\frac{2\\gamma+2}{2\\gamma+3}}$ and the optimal degree\nof smoothness to exploit is $\\gamma+1$. The fundamental contribution of this\npaper is a set of metric entropy bounds we develop for smooth function classes.\nSome of our bounds are original, and some of them improve and/or generalize the\nones in the literature (e.g., Kolmogorov and Tikhomirov, 1959). Our metric\nentropy bounds allow us to show phase transitions in the minimax optimal MISE\nrates associated with some commonly seen smoothness classes as well as\nnon-standard smoothness classes, and can also be of independent interest\noutside the nonparametric regression problems."}, "http://arxiv.org/abs/2206.04157": {"title": "Inference for Matched Tuples and Fully Blocked Factorial Designs", "link": "http://arxiv.org/abs/2206.04157", "description": "This paper studies inference in randomized controlled trials with multiple\ntreatments, where treatment status is determined according to a \"matched\ntuples\" design. Here, by a matched tuples design, we mean an experimental\ndesign where units are sampled i.i.d. from the population of interest, grouped\ninto \"homogeneous\" blocks with cardinality equal to the number of treatments,\nand finally, within each block, each treatment is assigned exactly once\nuniformly at random. We first study estimation and inference for matched tuples\ndesigns in the general setting where the parameter of interest is a vector of\nlinear contrasts over the collection of average potential outcomes for each\ntreatment. Parameters of this form include standard average treatment effects\nused to compare one treatment relative to another, but also include parameters\nwhich may be of interest in the analysis of factorial designs. We first\nestablish conditions under which a sample analogue estimator is asymptotically\nnormal and construct a consistent estimator of its corresponding asymptotic\nvariance. Combining these results establishes the asymptotic exactness of tests\nbased on these estimators. In contrast, we show that, for two common testing\nprocedures based on t-tests constructed from linear regressions, one test is\ngenerally conservative while the other generally invalid. We go on to apply our\nresults to study the asymptotic properties of what we call \"fully-blocked\" 2^K\nfactorial designs, which are simply matched tuples designs applied to a full\nfactorial experiment. Leveraging our previous results, we establish that our\nestimator achieves a lower asymptotic variance under the fully-blocked design\nthan that under any stratified factorial design which stratifies the\nexperimental sample into a finite number of \"large\" strata. A simulation study\nand empirical application illustrate the practical relevance of our results."}, "http://arxiv.org/abs/2303.02716": {"title": "Deterministic, quenched and annealed parameter estimation for heterogeneous network models", "link": "http://arxiv.org/abs/2303.02716", "description": "At least two, different approaches to define and solve statistical models for\nthe analysis of economic systems exist: the typical, econometric one,\ninterpreting the Gravity Model specification as the expected link weight of an\narbitrary probability distribution, and the one rooted into statistical\nphysics, constructing maximum-entropy distributions constrained to satisfy\ncertain network properties. In a couple of recent, companion papers they have\nbeen successfully integrated within the framework induced by the constrained\nminimisation of the Kullback-Leibler divergence: specifically, two, broad\nclasses of models have been devised, i.e. the integrated and the conditional\nones, defined by different, probabilistic rules to place links, load them with\nweights and turn them into proper, econometric prescriptions. Still, the\nrecipes adopted by the two approaches to estimate the parameters entering into\nthe definition of each model differ. In econometrics, a likelihood that\ndecouples the binary and weighted parts of a model, treating a network as\ndeterministic, is typically maximised; to restore its random character, two\nalternatives exist: either solving the likelihood maximisation on each\nconfiguration of the ensemble and taking the average of the parameters\nafterwards or taking the average of the likelihood function and maximising the\nlatter one. The difference between these approaches lies in the order in which\nthe operations of averaging and maximisation are taken - a difference that is\nreminiscent of the quenched and annealed ways of averaging out the disorder in\nspin glasses. The results of the present contribution, devoted to comparing\nthese recipes in the case of continuous, conditional network models, indicate\nthat the annealed estimation recipe represents the best alternative to the\ndeterministic one."}, "http://arxiv.org/abs/2307.01284": {"title": "Does regional variation in wage levels identify the effects of a national minimum wage?", "link": "http://arxiv.org/abs/2307.01284", "description": "This paper examines the identification assumptions underlying two types of\nestimators of the causal effects of minimum wages based on regional variation\nin wage levels: the \"effective minimum wage\" and the \"fraction affected/gap\"\ndesigns. For the effective minimum wage design, I show that the identification\nassumptions emphasized by Lee (1999) are crucial for unbiased estimation but\ndifficult to satisfy in empirical applications for reasons arising from\neconomic theory. For the fraction affected design at the region level, I show\nthat economic factors such as a common trend in the dispersion of worker\nproductivity or regional convergence in GDP per capita may lead to violations\nof the \"parallel trends\" identifying assumption. The paper suggests ways to\nincrease the likelihood of detecting those issues when implementing checks for\nparallel pre-trends. I also show that this design may be subject to biases\narising from the misspecification of the treatment intensity variable,\nespecially when the minimum wage strongly affects employment and wages."}, "http://arxiv.org/abs/2311.02196": {"title": "Pooled Bewley Estimator of Long Run Relationships in Dynamic Heterogenous Panels", "link": "http://arxiv.org/abs/2311.02196", "description": "Using a transformation of the autoregressive distributed lag model due to\nBewley, a novel pooled Bewley (PB) estimator of long-run coefficients for\ndynamic panels with heterogeneous short-run dynamics is proposed. The PB\nestimator is directly comparable to the widely used Pooled Mean Group (PMG)\nestimator, and is shown to be consistent and asymptotically normal. Monte Carlo\nsimulations show good small sample performance of PB compared to the existing\nestimators in the literature, namely PMG, panel dynamic OLS (PDOLS), and panel\nfully-modified OLS (FMOLS). Application of two bias-correction methods and a\nbootstrapping of critical values to conduct inference robust to cross-sectional\ndependence of errors are also considered. The utility of the PB estimator is\nillustrated in an empirical application to the aggregate consumption function."}, "http://arxiv.org/abs/2311.02299": {"title": "The Fragility of Sparsity", "link": "http://arxiv.org/abs/2311.02299", "description": "We show, using three empirical applications, that linear regression estimates\nwhich rely on the assumption of sparsity are fragile in two ways. First, we\ndocument that different choices of the regressor matrix that don't impact\nordinary least squares (OLS) estimates, such as the choice of baseline category\nwith categorical controls, can move sparsity-based estimates two standard\nerrors or more. Second, we develop two tests of the sparsity assumption based\non comparing sparsity-based estimators with OLS. The tests tend to reject the\nsparsity assumption in all three applications. Unless the number of regressors\nis comparable to or exceeds the sample size, OLS yields more robust results at\nlittle efficiency cost."}, "http://arxiv.org/abs/2311.02467": {"title": "Individualized Policy Evaluation and Learning under Clustered Network Interference", "link": "http://arxiv.org/abs/2311.02467", "description": "While there now exists a large literature on policy evaluation and learning,\nmuch of prior work assumes that the treatment assignment of one unit does not\naffect the outcome of another unit. Unfortunately, ignoring interference may\nlead to biased policy evaluation and yield ineffective learned policies. For\nexample, treating influential individuals who have many friends can generate\npositive spillover effects, thereby improving the overall performance of an\nindividualized treatment rule (ITR). We consider the problem of evaluating and\nlearning an optimal ITR under clustered network (or partial) interference where\nclusters of units are sampled from a population and units may influence one\nanother within each cluster. Under this model, we propose an estimator that can\nbe used to evaluate the empirical performance of an ITR. We show that this\nestimator is substantially more efficient than the standard inverse probability\nweighting estimator, which does not impose any assumption about spillover\neffects. We derive the finite-sample regret bound for a learned ITR, showing\nthat the use of our efficient evaluation estimator leads to the improved\nperformance of learned policies. Finally, we conduct simulation and empirical\nstudies to illustrate the advantages of the proposed methodology."}, "http://arxiv.org/abs/2311.02789": {"title": "Estimation of Semiparametric Multi-Index Models Using Deep Neural Networks", "link": "http://arxiv.org/abs/2311.02789", "description": "In this paper, we consider estimation and inference for both the multi-index\nparameters and the link function involved in a class of semiparametric\nmulti-index models via deep neural networks (DNNs). We contribute to the design\nof DNN by i) providing more transparency for practical implementation, ii)\ndefining different types of sparsity, iii) showing the differentiability, iv)\npointing out the set of effective parameters, and v) offering a new variant of\nrectified linear activation function (ReLU), etc. Asymptotic properties for the\njoint estimates of both the index parameters and the link functions are\nestablished, and a feasible procedure for the purpose of inference is also\nproposed. We conduct extensive numerical studies to examine the finite-sample\nperformance of the estimation methods, and we also evaluate the empirical\nrelevance and applicability of the proposed models and estimation methods to\nreal data."}, "http://arxiv.org/abs/2201.07880": {"title": "Deep self-consistent learning of local volatility", "link": "http://arxiv.org/abs/2201.07880", "description": "We present an algorithm for the calibration of local volatility from market\noption prices through deep self-consistent learning, by approximating both\nmarket option prices and local volatility using deep neural networks,\nrespectively. Our method uses the initial-boundary value problem of the\nunderlying Dupire's partial differential equation solved by the parameterized\noption prices to bring corrections to the parameterization in a self-consistent\nway. By exploiting the differentiability of the neural networks, we can\nevaluate Dupire's equation locally at each strike-maturity pair; while by\nexploiting their continuity, we sample strike-maturity pairs uniformly from a\ngiven domain, going beyond the discrete points where the options are quoted.\nMoreover, the absence of arbitrage opportunities are imposed by penalizing an\nassociated loss function as a soft constraint. For comparison with existing\napproaches, the proposed method is tested on both synthetic and market option\nprices, which shows an improved performance in terms of reduced interpolation\nand reprice errors, as well as the smoothness of the calibrated local\nvolatility. An ablation study has been performed, asserting the robustness and\nsignificance of the proposed method."}, "http://arxiv.org/abs/2204.12023": {"title": "A One-Covariate-at-a-Time Method for Nonparametric Additive Models", "link": "http://arxiv.org/abs/2204.12023", "description": "This paper proposes a one-covariate-at-a-time multiple testing (OCMT)\napproach to choose significant variables in high-dimensional nonparametric\nadditive regression models. Similarly to Chudik, Kapetanios and Pesaran (2018),\nwe consider the statistical significance of individual nonparametric additive\ncomponents one at a time and take into account the multiple testing nature of\nthe problem. One-stage and multiple-stage procedures are both considered. The\nformer works well in terms of the true positive rate only if the marginal\neffects of all signals are strong enough; the latter helps to pick up hidden\nsignals that have weak marginal effects. Simulations demonstrate the good\nfinite sample performance of the proposed procedures. As an empirical\napplication, we use the OCMT procedure on a dataset we extracted from the\nLongitudinal Survey on Rural Urban Migration in China. We find that our\nprocedure works well in terms of the out-of-sample forecast root mean square\nerrors, compared with competing methods."}, "http://arxiv.org/abs/2209.03259": {"title": "A Ridge-Regularised Jackknifed Anderson-Rubin Test", "link": "http://arxiv.org/abs/2209.03259", "description": "We consider hypothesis testing in instrumental variable regression models\nwith few included exogenous covariates but many instruments -- possibly more\nthan the number of observations. We show that a ridge-regularised version of\nthe jackknifed Anderson Rubin (1949, henceforth AR) test controls asymptotic\nsize in the presence of heteroskedasticity, and when the instruments may be\narbitrarily weak. Asymptotic size control is established under weaker\nassumptions than those imposed for recently proposed jackknifed AR tests in the\nliterature. Furthermore, ridge-regularisation extends the scope of jackknifed\nAR tests to situations in which there are more instruments than observations.\nMonte-Carlo simulations indicate that our method has favourable finite-sample\nsize and power properties compared to recently proposed alternative approaches\nin the literature. An empirical application on the elasticity of substitution\nbetween immigrants and natives in the US illustrates the usefulness of the\nproposed method for practitioners."}, "http://arxiv.org/abs/2212.09193": {"title": "Identification of time-varying counterfactual parameters in nonlinear panel models", "link": "http://arxiv.org/abs/2212.09193", "description": "We develop a general framework for the identification of counterfactual\nparameters in a class of nonlinear semiparametric panel models with fixed\neffects and time effects. Our method applies to models for discrete outcomes\n(e.g., two-way fixed effects binary choice) or continuous outcomes (e.g.,\ncensored regression), with discrete or continuous regressors. Our results do\nnot require parametric assumptions on the error terms or time-homogeneity on\nthe outcome equation. Our main results focus on static models, with a set of\nresults applying to models without any exogeneity conditions. We show that the\nsurvival distribution of counterfactual outcomes is identified (point or\npartial) in this class of models. This parameter is a building block for most\npartial and marginal effects of interest in applied practice that are based on\nthe average structural function as defined by Blundell and Powell (2003, 2004).\nTo the best of our knowledge, ours are the first results on average partial and\nmarginal effects for binary choice and ordered choice models with two-way fixed\neffects and non-logistic errors."}, "http://arxiv.org/abs/2306.04135": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "http://arxiv.org/abs/2306.04135", "description": "We propose two approaches to estimate semiparametric discrete choice models\nfor bundles. Our first approach is a kernel-weighted rank estimator based on a\nmatching-based identification strategy. We establish its complete asymptotic\nproperties and prove the validity of the nonparametric bootstrap for inference.\nWe then introduce a new multi-index least absolute deviations (LAD) estimator\nas an alternative, of which the main advantage is its capacity to estimate\npreference parameters on both alternative- and agent-specific regressors. Both\nmethods can account for arbitrary correlation in disturbances across choices,\nwith the former also allowing for interpersonal heteroskedasticity. We also\ndemonstrate that the identification strategy underlying these procedures can be\nextended naturally to panel data settings, producing an analogous localized\nmaximum score estimator and a LAD estimator for estimating bundle choice models\nwith fixed effects. We derive the limiting distribution of the former and\nverify the validity of the numerical bootstrap as an inference tool. All our\nproposed methods can be applied to general multi-index models. Monte Carlo\nexperiments show that they perform well in finite samples."}, "http://arxiv.org/abs/2311.03471": {"title": "Optimal Estimation Methodologies for Panel Data Regression Models", "link": "http://arxiv.org/abs/2311.03471", "description": "This survey study discusses main aspects to optimal estimation methodologies\nfor panel data regression models. In particular, we present current\nmethodological developments for modeling stationary panel data as well as\nrobust methods for estimation and inference in nonstationary panel data\nregression models. Some applications from the network econometrics and high\ndimensional statistics literature are also discussed within a stationary time\nseries environment."}, "http://arxiv.org/abs/2311.04073": {"title": "Debiased Fixed Effects Estimation of Binary Logit Models with Three-Dimensional Panel Data", "link": "http://arxiv.org/abs/2311.04073", "description": "Naive maximum likelihood estimation of binary logit models with fixed effects\nleads to unreliable inference due to the incidental parameter problem. We study\nthe case of three-dimensional panel data, where the model includes three sets\nof additive and overlapping unobserved effects. This encompasses models for\nnetwork panel data, where senders and receivers maintain bilateral\nrelationships over time, and fixed effects account for unobserved heterogeneity\nat the sender-time, receiver-time, and sender-receiver levels. In an asymptotic\nframework, where all three panel dimensions grow large at constant relative\nrates, we characterize the leading bias of the naive estimator. The inference\nproblem we identify is particularly severe, as it is not possible to balance\nthe order of the bias and the standard deviation. As a consequence, the naive\nestimator has a degenerating asymptotic distribution, which exacerbates the\ninference problem relative to other fixed effects estimators studied in the\nliterature. To resolve the inference problem, we derive explicit expressions to\ndebias the fixed effects estimator."}, "http://arxiv.org/abs/2207.09246": {"title": "Asymptotic Properties of Endogeneity Corrections Using Nonlinear Transformations", "link": "http://arxiv.org/abs/2207.09246", "description": "This paper considers a linear regression model with an endogenous regressor\nwhich arises from a nonlinear transformation of a latent variable. It is shown\nthat the corresponding coefficient can be consistently estimated without\nexternal instruments by adding a rank-based transformation of the regressor to\nthe model and performing standard OLS estimation. In contrast to other\napproaches, our nonparametric control function approach does not rely on a\nconformably specified copula. Furthermore, the approach allows for the presence\nof additional exogenous regressors which may be (linearly) correlated with the\nendogenous regressor(s). Consistency and asymptotic normality of the estimator\nare proved and the estimator is compared with copula based approaches by means\nof Monte Carlo simulations. An empirical application on wage data of the US\ncurrent population survey demonstrates the usefulness of our method."}, "http://arxiv.org/abs/2301.10643": {"title": "Automatic Locally Robust Estimation with Generated Regressors", "link": "http://arxiv.org/abs/2301.10643", "description": "Many economic and causal parameters of interest depend on generated\nregressors. Examples include structural parameters in models with endogenous\nvariables estimated by control functions and in models with sample selection,\ntreatment effect estimation with propensity score matching, and marginal\ntreatment effects. Inference with generated regressors is complicated by the\nvery complex expression for influence functions and asymptotic variances. To\naddress this problem, we propose Automatic Locally Robust/debiased GMM\nestimators in a general setting with generated regressors. Importantly, we\nallow for the generated regressors to be generated from machine learners, such\nas Random Forest, Neural Nets, Boosting, and many others. We use our results to\nconstruct novel Doubly Robust and Locally Robust estimators for the\nCounterfactual Average Structural Function and Average Partial Effects in\nmodels with endogeneity and sample selection, respectively. We provide\nsufficient conditions for the asymptotic normality of our debiased GMM\nestimators and investigate their finite sample performance through Monte Carlo\nsimulations."}, "http://arxiv.org/abs/2303.11399": {"title": "How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on Over 60 Replicated Studies", "link": "http://arxiv.org/abs/2303.11399", "description": "Instrumental variable (IV) strategies are widely used in political science to\nestablish causal relationships. However, the identifying assumptions required\nby an IV design are demanding, and it remains challenging for researchers to\nassess their validity. In this paper, we replicate 67 papers published in three\ntop journals in political science during 2010-2022 and identify several\ntroubling patterns. First, researchers often overestimate the strength of their\nIVs due to non-i.i.d. errors, such as a clustering structure. Second, the most\ncommonly used t-test for the two-stage-least-squares (2SLS) estimates often\nseverely underestimates uncertainty. Using more robust inferential methods, we\nfind that around 19-30% of the 2SLS estimates in our sample are underpowered.\nThird, in the majority of the replicated studies, the 2SLS estimates are much\nlarger than the ordinary-least-squares estimates, and their ratio is negatively\ncorrelated with the strength of the IVs in studies where the IVs are not\nexperimentally generated, suggesting potential violations of unconfoundedness\nor the exclusion restriction. To help researchers avoid these pitfalls, we\nprovide a checklist for better practice."}, "http://arxiv.org/abs/2208.02028": {"title": "Bootstrap inference in the presence of bias", "link": "http://arxiv.org/abs/2208.02028", "description": "We consider bootstrap inference for estimators which are (asymptotically)\nbiased. We show that, even when the bias term cannot be consistently estimated,\nvalid inference can be obtained by proper implementations of the bootstrap.\nSpecifically, we show that the prepivoting approach of Beran (1987, 1988),\noriginally proposed to deliver higher-order refinements, restores bootstrap\nvalidity by transforming the original bootstrap p-value into an asymptotically\nuniform random variable. We propose two different implementations of\nprepivoting (plug-in and double bootstrap), and provide general high-level\nconditions that imply validity of bootstrap inference. To illustrate the\npractical relevance and implementation of our results, we discuss five\nexamples: (i) inference on a target parameter based on model averaging; (ii)\nridge-type regularized estimators; (iii) nonparametric regression; (iv) a\nlocation model for infinite variance data; and (v) dynamic panel data models."}, "http://arxiv.org/abs/2304.01273": {"title": "Heterogeneity-robust granular instruments", "link": "http://arxiv.org/abs/2304.01273", "description": "Granular instrumental variables (GIV) has experienced sharp growth in\nempirical macro-finance. The methodology's rise showcases granularity's\npotential for identification in a wide set of economic environments, like the\nestimation of spillovers and demand systems. I propose a new estimator--called\nrobust granular instrumental variables (RGIV)--that allows researchers to study\nunit-level heterogeneity in spillovers within GIV's framework. In contrast to\nGIV, RGIV also allows for unknown shock variances and does not require skewness\nof the size distribution of units. I also develop a test of overidentifying\nrestrictions that evaluates RGIV's compatibility with the data, a parameter\nrestriction test that evaluates the appropriateness of the homogeneous\nspillovers assumption, and extend the framework to allow for observable\nexplanatory variables. Applied to the Euro area, I find strong evidence of\ncountry-level heterogeneity in sovereign yield spillovers. In simulations, I\nshow that RGIV produces reliable and informative confidence intervals."}, "http://arxiv.org/abs/2309.11387": {"title": "Identifying Causal Effects in Information Provision Experiments", "link": "http://arxiv.org/abs/2309.11387", "description": "Information provision experiments are a popular way to study causal effects\nof beliefs on behavior. Researchers estimate these effects using TSLS. I show\nthat existing TSLS specifications do not estimate the average partial effect;\nthey have weights proportional to belief updating in the first-stage. If people\nwhose decisions depend on their beliefs gather information before the\nexperiment, the information treatment may shift beliefs more for people with\nweak belief effects. This attenuates TSLS estimates. I propose researchers use\na local-least-squares (LLS) estimator that I show consistently estimates the\naverage partial effect (APE) under Bayesian updating, and apply it to Settele\n(2022)."}, "http://arxiv.org/abs/2107.13737": {"title": "Design-Robust Two-Way-Fixed-Effects Regression For Panel Data", "link": "http://arxiv.org/abs/2107.13737", "description": "We propose a new estimator for average causal effects of a binary treatment\nwith panel data in settings with general treatment patterns. Our approach\naugments the popular two-way-fixed-effects specification with unit-specific\nweights that arise from a model for the assignment mechanism. We show how to\nconstruct these weights in various settings, including the staggered adoption\nsetting, where units opt into the treatment sequentially but permanently. The\nresulting estimator converges to an average (over units and time) treatment\neffect under the correct specification of the assignment model, even if the\nfixed effect model is misspecified. We show that our estimator is more robust\nthan the conventional two-way estimator: it remains consistent if either the\nassignment mechanism or the two-way regression model is correctly specified. In\naddition, the proposed estimator performs better than the two-way-fixed-effect\nestimator if the outcome model and assignment mechanism are locally\nmisspecified. This strong double robustness property underlines and quantifies\nthe benefits of modeling the assignment process and motivates using our\nestimator in practice. We also discuss an extension of our estimator to handle\ndynamic treatment effects."}, "http://arxiv.org/abs/2302.09756": {"title": "Identification-robust inference for the LATE with high-dimensional covariates", "link": "http://arxiv.org/abs/2302.09756", "description": "This paper presents an inference method for the local average treatment\neffect (LATE) in the presence of high-dimensional covariates, irrespective of\nthe strength of identification. We propose a novel high-dimensional conditional\ntest statistic with uniformly correct asymptotic size. We provide an\neasy-to-implement algorithm to infer the high-dimensional LATE by inverting our\ntest statistic and employing the double/debiased machine learning method.\nSimulations indicate that our test is robust against both weak identification\nand high dimensionality concerning size control and power performance,\noutperforming other conventional tests. Applying the proposed method to\nrailroad and population data to study the effect of railroad access on urban\npopulation growth, we observe that our methodology yields confidence intervals\nthat are 49% to 92% shorter than conventional results, depending on\nspecifications."}, "http://arxiv.org/abs/2311.05883": {"title": "Time-Varying Identification of Monetary Policy Shocks", "link": "http://arxiv.org/abs/2311.05883", "description": "We propose a new Bayesian heteroskedastic Markov-switching structural vector\nautoregression with data-driven time-varying identification. The model selects\nalternative exclusion restrictions over time and, as a condition for the\nsearch, allows to verify identification through heteroskedasticity within each\nregime. Based on four alternative monetary policy rules, we show that a monthly\nsix-variable system supports time variation in US monetary policy shock\nidentification. In the sample-dominating first regime, systematic monetary\npolicy follows a Taylor rule extended by the term spread and is effective in\ncurbing inflation. In the second regime, occurring after 2000 and gaining more\npersistence after the global financial and COVID crises, the Fed acts according\nto a money-augmented Taylor rule. This regime's unconventional monetary policy\nprovides economic stimulus, features the liquidity effect, and is complemented\nby a pure term spread shock. Absent the specific monetary policy of the second\nregime, inflation would be over one percentage point higher on average after\n2008."}, "http://arxiv.org/abs/2309.14160": {"title": "Unified Inference for Dynamic Quantile Predictive Regression", "link": "http://arxiv.org/abs/2309.14160", "description": "This paper develops unified asymptotic distribution theory for dynamic\nquantile predictive regressions which is useful when examining quantile\npredictability in stock returns under possible presence of nonstationarity."}, "http://arxiv.org/abs/2311.06256": {"title": "From Deep Filtering to Deep Econometrics", "link": "http://arxiv.org/abs/2311.06256", "description": "Calculating true volatility is an essential task for option pricing and risk\nmanagement. However, it is made difficult by market microstructure noise.\nParticle filtering has been proposed to solve this problem as it favorable\nstatistical properties, but relies on assumptions about underlying market\ndynamics. Machine learning methods have also been proposed but lack\ninterpretability, and often lag in performance. In this paper we implement the\nSV-PF-RNN: a hybrid neural network and particle filter architecture. Our\nSV-PF-RNN is designed specifically with stochastic volatility estimation in\nmind. We then show that it can improve on the performance of a basic particle\nfilter."}, "http://arxiv.org/abs/2311.06831": {"title": "Quasi-Bayes in Latent Variable Models", "link": "http://arxiv.org/abs/2311.06831", "description": "Latent variable models are widely used to account for unobserved determinants\nof economic behavior. Traditional nonparametric methods to estimate latent\nheterogeneity do not scale well into multidimensional settings. Distributional\nrestrictions alleviate tractability concerns but may impart non-trivial\nmisspecification bias. Motivated by these concerns, this paper introduces a\nquasi-Bayes approach to estimate a large class of multidimensional latent\nvariable models. Our approach to quasi-Bayes is novel in that we center it\naround relating the characteristic function of observables to the distribution\nof unobservables. We propose a computationally attractive class of priors that\nare supported on Gaussian mixtures and derive contraction rates for a variety\nof latent variable models."}, "http://arxiv.org/abs/2311.06891": {"title": "Design-based Estimation Theory for Complex Experiments", "link": "http://arxiv.org/abs/2311.06891", "description": "This paper considers the estimation of treatment effects in randomized\nexperiments with complex experimental designs, including cases with\ninterference between units. We develop a design-based estimation theory for\narbitrary experimental designs. Our theory facilitates the analysis of many\ndesign-estimator pairs that researchers commonly employ in practice and provide\nprocedures to consistently estimate asymptotic variance bounds. We propose new\nclasses of estimators with favorable asymptotic properties from a design-based\npoint of view. In addition, we propose a scalar measure of experimental\ncomplexity which can be linked to the design-based variance of the estimators.\nWe demonstrate the performance of our estimators using simulated datasets based\non an actual network experiment studying the effect of social networks on\ninsurance adoptions."}, "http://arxiv.org/abs/2311.07067": {"title": "High Dimensional Binary Choice Model with Unknown Heteroskedasticity or Instrumental Variables", "link": "http://arxiv.org/abs/2311.07067", "description": "This paper proposes a new method for estimating high-dimensional binary\nchoice models. The model we consider is semiparametric, placing no\ndistributional assumptions on the error term, allowing for heteroskedastic\nerrors, and permitting endogenous regressors. Our proposed approaches extend\nthe special regressor estimator originally proposed by Lewbel (2000). This\nestimator becomes impractical in high-dimensional settings due to the curse of\ndimensionality associated with high-dimensional conditional density estimation.\nTo overcome this challenge, we introduce an innovative data-driven dimension\nreduction method for nonparametric kernel estimators, which constitutes the\nmain innovation of this work. The method combines distance covariance-based\nscreening with cross-validation (CV) procedures, rendering the special\nregressor estimation feasible in high dimensions. Using the new feasible\nconditional density estimator, we address the variable and moment (instrumental\nvariable) selection problems for these models. We apply penalized least squares\n(LS) and Generalized Method of Moments (GMM) estimators with a smoothly clipped\nabsolute deviation (SCAD) penalty. A comprehensive analysis of the oracle and\nasymptotic properties of these estimators is provided. Monte Carlo simulations\nare employed to demonstrate the effectiveness of our proposed procedures in\nfinite sample scenarios."}, "http://arxiv.org/abs/2311.07243": {"title": "Optimal Estimation of Large-Dimensional Nonlinear Factor Models", "link": "http://arxiv.org/abs/2311.07243", "description": "This paper studies optimal estimation of large-dimensional nonlinear factor\nmodels. The key challenge is that the observed variables are possibly nonlinear\nfunctions of some latent variables where the functional forms are left\nunspecified. A local principal component analysis method is proposed to\nestimate the factor structure and recover information on latent variables and\nlatent functions, which combines $K$-nearest neighbors matching and principal\ncomponent analysis. Large-sample properties are established, including a sharp\nbound on the matching discrepancy of nearest neighbors, sup-norm error bounds\nfor estimated local factors and factor loadings, and the uniform convergence\nrate of the factor structure estimator. Under mild conditions our estimator of\nthe latent factor structure can achieve the optimal rate of uniform convergence\nfor nonparametric regression. The method is illustrated with a Monte Carlo\nexperiment and an empirical application studying the effect of tax cuts on\neconomic growth."}, "http://arxiv.org/abs/1902.09608": {"title": "On Binscatter", "link": "http://arxiv.org/abs/1902.09608", "description": "Binscatter is a popular method for visualizing bivariate relationships and\nconducting informal specification testing. We study the properties of this\nmethod formally and develop enhanced visualization and econometric binscatter\ntools. These include estimating conditional means with optimal binning and\nquantifying uncertainty. We also highlight a methodological problem related to\ncovariate adjustment that can yield incorrect conclusions. We revisit two\napplications using our methodology and find substantially different results\nrelative to those obtained using prior informal binscatter methods. General\npurpose software in Python, R, and Stata is provided. Our technical work is of\nindependent interest for the nonparametric partition-based estimation\nliterature."}, "http://arxiv.org/abs/2109.09043": {"title": "Composite Likelihood for Stochastic Migration Model with Unobserved Factor", "link": "http://arxiv.org/abs/2109.09043", "description": "We introduce the conditional Maximum Composite Likelihood (MCL) estimation\nmethod for the stochastic factor ordered Probit model of credit rating\ntransitions of firms. This model is recommended for internal credit risk\nassessment procedures in banks and financial institutions under the Basel III\nregulations. Its exact likelihood function involves a high-dimensional\nintegral, which can be approximated numerically before maximization. However,\nthe estimated migration risk and required capital tend to be sensitive to the\nquality of this approximation, potentially leading to statistical regulatory\narbitrage. The proposed conditional MCL estimator circumvents this problem and\nmaximizes the composite log-likelihood of the factor ordered Probit model. We\npresent three conditional MCL estimators of different complexity and examine\ntheir consistency and asymptotic normality when n and T tend to infinity. The\nperformance of these estimators at finite T is examined and compared with a\ngranularity-based approach in a simulation study. The use of the MCL estimator\nis also illustrated in an empirical application."}, "http://arxiv.org/abs/2111.01301": {"title": "Asymptotic in a class of network models with an increasing sub-Gamma degree sequence", "link": "http://arxiv.org/abs/2111.01301", "description": "For the differential privacy under the sub-Gamma noise, we derive the\nasymptotic properties of a class of network models with binary values with a\ngeneral link function. In this paper, we release the degree sequences of the\nbinary networks under a general noisy mechanism with the discrete Laplace\nmechanism as a special case. We establish the asymptotic result including both\nconsistency and asymptotically normality of the parameter estimator when the\nnumber of parameters goes to infinity in a class of network models. Simulations\nand a real data example are provided to illustrate asymptotic results."}, "http://arxiv.org/abs/2309.08982": {"title": "Least squares estimation in nonlinear cohort panels with learning from experience", "link": "http://arxiv.org/abs/2309.08982", "description": "We discuss techniques of estimation and inference for nonlinear cohort panels\nwith learning from experience, showing, inter alia, the consistency and\nasymptotic normality of the nonlinear least squares estimator employed in the\nseminal paper by Malmendier and Nagel (2016, QJE). Potential pitfalls for\nhypothesis testing are identified and solutions proposed. Monte Carlo\nsimulations verify the properties of the estimator and corresponding test\nstatistics in finite samples, while an application to a panel of survey\nexpectations demonstrates the usefulness of the theory developed."}, "http://arxiv.org/abs/2311.08218": {"title": "Estimating Conditional Value-at-Risk with Nonstationary Quantile Predictive Regression Models", "link": "http://arxiv.org/abs/2311.08218", "description": "This paper develops an asymptotic distribution theory for a two-stage\ninstrumentation estimation approach in quantile predictive regressions when\nboth generated covariates and persistent predictors are used. The generated\ncovariates are obtained from an auxiliary quantile regression model and our\nmain interest is the robust estimation and inference of the primary quantile\npredictive regression in which this generated covariate is added to the set of\nnonstationary regressors. We find that the proposed doubly IVX estimator is\nrobust to the abstract degree of persistence regardless of the presence of\ngenerated regressor obtained from the first stage procedure. The asymptotic\nproperties of the two-stage IVX estimator such as mixed Gaussianity are\nestablished while the asymptotic covariance matrix is adjusted to account for\nthe first-step estimation error."}, "http://arxiv.org/abs/2302.00469": {"title": "Regression adjustment in randomized controlled trials with many covariates", "link": "http://arxiv.org/abs/2302.00469", "description": "This paper is concerned with estimation and inference on average treatment\neffects in randomized controlled trials when researchers observe potentially\nmany covariates. By employing Neyman's (1923) finite population perspective, we\npropose a bias-corrected regression adjustment estimator using cross-fitting,\nand show that the proposed estimator has favorable properties over existing\nalternatives. For inference, we derive the first and second order terms in the\nstochastic component of the regression adjustment estimators, study higher\norder properties of the existing inference methods, and propose a\nbias-corrected version of the HC3 standard error. The proposed methods readily\nextend to stratified experiments with large strata. Simulation studies show our\ncross-fitted estimator, combined with the bias-corrected HC3, delivers precise\npoint estimates and robust size controls over a wide range of DGPs. To\nillustrate, the proposed methods are applied to real dataset on randomized\nexperiments of incentives and services for college achievement following\nAngrist, Lang, and Oreopoulos (2009)."}, "http://arxiv.org/abs/2311.08958": {"title": "Locally Asymptotically Minimax Statistical Treatment Rules Under Partial Identification", "link": "http://arxiv.org/abs/2311.08958", "description": "Policymakers often desire a statistical treatment rule (STR) that determines\na treatment assignment rule deployed in a future population from available\ndata. With the true knowledge of the data generating process, the average\ntreatment effect (ATE) is the key quantity characterizing the optimal treatment\nrule. Unfortunately, the ATE is often not point identified but partially\nidentified. Presuming the partial identification of the ATE, this study\nconducts a local asymptotic analysis and develops the locally asymptotically\nminimax (LAM) STR. The analysis does not assume the full differentiability but\nthe directional differentiability of the boundary functions of the\nidentification region of the ATE. Accordingly, the study shows that the LAM STR\ndiffers from the plug-in STR. A simulation study also demonstrates that the LAM\nSTR outperforms the plug-in STR."}, "http://arxiv.org/abs/2311.08963": {"title": "Incorporating Preferences Into Treatment Assignment Problems", "link": "http://arxiv.org/abs/2311.08963", "description": "This study investigates the problem of individualizing treatment allocations\nusing stated preferences for treatments. If individuals know in advance how the\nassignment will be individualized based on their stated preferences, they may\nstate false preferences. We derive an individualized treatment rule (ITR) that\nmaximizes welfare when individuals strategically state their preferences. We\nalso show that the optimal ITR is strategy-proof, that is, individuals do not\nhave a strong incentive to lie even if they know the optimal ITR a priori.\nConstructing the optimal ITR requires information on the distribution of true\npreferences and the average treatment effect conditioned on true preferences.\nIn practice, the information must be identified and estimated from the data. As\ntrue preferences are hidden information, the identification is not\nstraightforward. We discuss two experimental designs that allow the\nidentification: strictly strategy-proof randomized controlled trials and doubly\nrandomized preference trials. Under the presumption that data comes from one of\nthese experiments, we develop data-dependent procedures for determining ITR,\nthat is, statistical treatment rules (STRs). The maximum regret of the proposed\nSTRs converges to zero at a rate of the square root of the sample size. An\nempirical application demonstrates our proposed STRs."}, "http://arxiv.org/abs/2007.04267": {"title": "Difference-in-Differences Estimators of Intertemporal Treatment Effects", "link": "http://arxiv.org/abs/2007.04267", "description": "We study treatment-effect estimation using panel data. The treatment may be\nnon-binary, non-absorbing, and the outcome may be affected by treatment lags.\nWe make a parallel-trends assumption, and propose event-study estimators of the\neffect of being exposed to a weakly higher treatment dose for $\\ell$ periods.\nWe also propose normalized estimators, that estimate a weighted average of the\neffects of the current treatment and its lags. We also analyze commonly-used\ntwo-way-fixed-effects regressions. Unlike our estimators, they can be biased in\nthe presence of heterogeneous treatment effects. A local-projection version of\nthose regressions is biased even with homogeneous effects."}, "http://arxiv.org/abs/2007.10432": {"title": "Treatment Effects with Targeting Instruments", "link": "http://arxiv.org/abs/2007.10432", "description": "Multivalued treatments are commonplace in applications. We explore the use of\ndiscrete-valued instruments to control for selection bias in this setting. Our\ndiscussion revolves around the concept of targeting instruments: which\ninstruments target which treatments. It allows us to establish conditions under\nwhich counterfactual averages and treatment effects are point- or\npartially-identified for composite complier groups. We illustrate the\nusefulness of our framework by applying it to data from the Head Start Impact\nStudy. Under a plausible positive selection assumption, we derive informative\nbounds that suggest less beneficial effects of Head Start expansions than the\nparametric estimates of Kline and Walters (2016)."}, "http://arxiv.org/abs/2310.00786": {"title": "Semidiscrete optimal transport with unknown costs", "link": "http://arxiv.org/abs/2310.00786", "description": "Semidiscrete optimal transport is a challenging generalization of the\nclassical transportation problem in linear programming. The goal is to design a\njoint distribution for two random variables (one continuous, one discrete) with\nfixed marginals, in a way that minimizes expected cost. We formulate a novel\nvariant of this problem in which the cost functions are unknown, but can be\nlearned through noisy observations; however, only one function can be sampled\nat a time. We develop a semi-myopic algorithm that couples online learning with\nstochastic approximation, and prove that it achieves optimal convergence rates,\ndespite the non-smoothness of the stochastic gradient and the lack of strong\nconcavity in the objective function."}, "http://arxiv.org/abs/2311.09435": {"title": "Estimating Functionals of the Joint Distribution of Potential Outcomes with Optimal Transport", "link": "http://arxiv.org/abs/2311.09435", "description": "Many causal parameters depend on a moment of the joint distribution of\npotential outcomes. Such parameters are especially relevant in policy\nevaluation settings, where noncompliance is common and accommodated through the\nmodel of Imbens & Angrist (1994). This paper shows that the sharp identified\nset for these parameters is an interval with endpoints characterized by the\nvalue of optimal transport problems. Sample analogue estimators are proposed\nbased on the dual problem of optimal transport. These estimators are root-n\nconsistent and converge in distribution under mild assumptions. Inference\nprocedures based on the bootstrap are straightforward and computationally\nconvenient. The ideas and estimators are demonstrated in an application\nrevisiting the National Supported Work Demonstration job training program. I\nfind suggestive evidence that workers who would see below average earnings\nwithout treatment tend to see above average benefits from treatment."}, "http://arxiv.org/abs/2311.09972": {"title": "Inference in Auctions with Many Bidders Using Transaction Prices", "link": "http://arxiv.org/abs/2311.09972", "description": "This paper considers inference in first-price and second-price sealed-bid\nauctions with a large number of symmetric bidders having independent private\nvalues. Given the abundance of bidders in each auction, we propose an\nasymptotic framework in which the number of bidders diverges while the number\nof auctions remains fixed. This framework allows us to perform asymptotically\nexact inference on key model features using only transaction price data.\nSpecifically, we examine inference on the expected utility of the auction\nwinner, the expected revenue of the seller, and the tail properties of the\nvaluation distribution. Simulations confirm the accuracy of our inference\nmethods in finite samples. Finally, we also apply them to Hong Kong car license\nauction data."}, "http://arxiv.org/abs/2212.06080": {"title": "Logs with zeros? Some problems and solutions", "link": "http://arxiv.org/abs/2212.06080", "description": "When studying an outcome $Y$ that is weakly-positive but can equal zero (e.g.\nearnings), researchers frequently estimate an average treatment effect (ATE)\nfor a \"log-like\" transformation that behaves like $\\log(Y)$ for large $Y$ but\nis defined at zero (e.g. $\\log(1+Y)$, $\\mathrm{arcsinh}(Y)$). We argue that\nATEs for log-like transformations should not be interpreted as approximating\npercentage effects, since unlike a percentage, they depend on the units of the\noutcome. In fact, we show that if the treatment affects the extensive margin,\none can obtain a treatment effect of any magnitude simply by re-scaling the\nunits of $Y$ before taking the log-like transformation. This arbitrary\nunit-dependence arises because an individual-level percentage effect is not\nwell-defined for individuals whose outcome changes from zero to non-zero when\nreceiving treatment, and the units of the outcome implicitly determine how much\nweight the ATE for a log-like transformation places on the extensive margin. We\nfurther establish a trilemma: when the outcome can equal zero, there is no\ntreatment effect parameter that is an average of individual-level treatment\neffects, unit-invariant, and point-identified. We discuss several alternative\napproaches that may be sensible in settings with an intensive and extensive\nmargin, including (i) expressing the ATE in levels as a percentage (e.g. using\nPoisson regression), (ii) explicitly calibrating the value placed on the\nintensive and extensive margins, and (iii) estimating separate effects for the\ntwo margins (e.g. using Lee bounds). We illustrate these approaches in three\nempirical applications."}, "http://arxiv.org/abs/2308.05486": {"title": "The Distributional Impact of Money Growth and Inflation Disaggregates: A Quantile Sensitivity Analysis", "link": "http://arxiv.org/abs/2308.05486", "description": "We propose an alternative method to construct a quantile dependence system\nfor inflation and money growth. By considering all quantiles, we assess how\nperturbations in one variable's quantile lead to changes in the distribution of\nthe other variable. We demonstrate the construction of this relationship\nthrough a system of linear quantile regressions. The proposed framework is\nexploited to examine the distributional effects of money growth on the\ndistributions of inflation and its disaggregate measures in the United States\nand the Euro area. Our empirical analysis uncovers significant impacts of the\nupper quantile of the money growth distribution on the distribution of\ninflation and its disaggregate measures. Conversely, we find that the lower and\nmedian quantiles of the money growth distribution have a negligible influence\non the distribution of inflation and its disaggregate measures."}, "http://arxiv.org/abs/2311.10685": {"title": "High-Throughput Asset Pricing", "link": "http://arxiv.org/abs/2311.10685", "description": "We use empirical Bayes (EB) to mine for out-of-sample returns among 73,108\nlong-short strategies constructed from accounting ratios, past returns, and\nticker symbols. EB predicts returns are concentrated in accounting and past\nreturn strategies, small stocks, and pre-2004 samples. The cross-section of\nout-of-sample return lines up closely with EB predictions. Data-mined\nportfolios have mean returns comparable with published portfolios, but the\ndata-mined returns are arguably free of data mining bias. In contrast,\ncontrolling for multiple testing following Harvey, Liu, and Zhu (2016) misses\nthe vast majority of returns. This \"high-throughput asset pricing\" provides an\nevidence-based solution for data mining bias."}, "http://arxiv.org/abs/2209.11444": {"title": "Identification of the Marginal Treatment Effect with Multivalued Treatments", "link": "http://arxiv.org/abs/2209.11444", "description": "The multinomial choice model based on utility maximization has been widely\nused to select treatments. In this paper, we establish sufficient conditions\nfor the identification of the marginal treatment effects with multivalued\ntreatments. Our result reveals treatment effects conditioned on the willingness\nto participate in treatments against a specific treatment. Further, our results\ncan identify other parameters such as the marginal distribution of potential\noutcomes."}, "http://arxiv.org/abs/2211.04027": {"title": "Bootstraps for Dynamic Panel Threshold Models", "link": "http://arxiv.org/abs/2211.04027", "description": "This paper develops valid bootstrap inference methods for the dynamic panel\nthreshold regression. For the first-differenced generalized method of moments\n(GMM) estimation for the dynamic short panel, we show that the standard\nnonparametric bootstrap is inconsistent. The inconsistency is due to an\n$n^{1/4}$-consistent non-normal asymptotic distribution for the threshold\nestimate when the parameter resides within the continuity region of the\nparameter space, which stems from the rank deficiency of the approximate\nJacobian of the sample moment conditions on the continuity region. We propose a\ngrid bootstrap to construct confidence sets for the threshold, a residual\nbootstrap to construct confidence intervals for the coefficients, and a\nbootstrap for testing continuity. They are shown to be valid under uncertain\ncontinuity. A set of Monte Carlo experiments demonstrate that the proposed\nbootstraps perform well in the finite samples and improve upon the asymptotic\nnormal approximation even under a large jump at the threshold. An empirical\napplication to firms' investment model illustrates our methods."}, "http://arxiv.org/abs/2304.07480": {"title": "Gini-stable Lorenz curves and their relation to the generalised Pareto distribution", "link": "http://arxiv.org/abs/2304.07480", "description": "We introduce an iterative discrete information production process where we\ncan extend ordered normalised vectors by new elements based on a simple affine\ntransformation, while preserving the predefined level of inequality, G, as\nmeasured by the Gini index.\n\nThen, we derive the family of empirical Lorenz curves of the corresponding\nvectors and prove that it is stochastically ordered with respect to both the\nsample size and G which plays the role of the uncertainty parameter. We prove\nthat asymptotically, we obtain all, and only, Lorenz curves generated by a new,\nintuitive parametrisation of the finite-mean Pickands' Generalised Pareto\nDistribution (GPD) that unifies three other families, namely: the Pareto Type\nII, exponential, and scaled beta ones. The family is not only totally ordered\nwith respect to the parameter G, but also, thanks to our derivations, has a\nnice underlying interpretation. Our result may thus shed a new light on the\ngenesis of this family of distributions.\n\nOur model fits bibliometric, informetric, socioeconomic, and environmental\ndata reasonably well. It is quite user-friendly for it only depends on the\nsample size and its Gini index."}, "http://arxiv.org/abs/2311.11637": {"title": "Modeling economies of scope in joint production: Convex regression of input distance function", "link": "http://arxiv.org/abs/2311.11637", "description": "Modeling of joint production has proved a vexing problem. This paper develops\na radial convex nonparametric least squares (CNLS) approach to estimate the\ninput distance function with multiple outputs. We document the correct input\ndistance function transformation and prove that the necessary orthogonality\nconditions can be satisfied in radial CNLS. A Monte Carlo study is performed to\ncompare the finite sample performance of radial CNLS and other deterministic\nand stochastic frontier approaches in terms of the input distance function\nestimation. We apply our novel approach to the Finnish electricity distribution\nnetwork regulation and empirically confirm that the input isoquants become more\ncurved. In addition, we introduce the weight restriction to radial CNLS to\nmitigate the potential overfitting and increase the out-of-sample performance\nin energy regulation."}, "http://arxiv.org/abs/2311.11858": {"title": "Theory coherent shrinkage of Time-Varying Parameters in VARs", "link": "http://arxiv.org/abs/2311.11858", "description": "Time-Varying Parameters Vector Autoregressive (TVP-VAR) models are frequently\nused in economics to capture evolving relationships among the macroeconomic\nvariables. However, TVP-VARs have the tendency of overfitting the data,\nresulting in inaccurate forecasts and imprecise estimates of typical objects of\ninterests such as the impulse response functions. This paper introduces a\nTheory Coherent Time-Varying Parameters Vector Autoregressive Model\n(TC-TVP-VAR), which leverages on an arbitrary theoretical framework derived by\nan underlying economic theory to form a prior for the time varying parameters.\nThis \"theory coherent\" shrinkage prior significantly improves inference\nprecision and forecast accuracy over the standard TVP-VAR. Furthermore, the\nTC-TVP-VAR can be used to perform indirect posterior inference on the deep\nparameters of the underlying economic theory. The paper reveals that using the\nclassical 3-equation New Keynesian block to form a prior for the TVP- VAR\nsubstantially enhances forecast accuracy of output growth and of the inflation\nrate in a standard model of monetary policy. Additionally, the paper shows that\nthe TC-TVP-VAR can be used to address the inferential challenges during the\nZero Lower Bound period."}, "http://arxiv.org/abs/2007.07842": {"title": "Persistence in Financial Connectedness and Systemic Risk", "link": "http://arxiv.org/abs/2007.07842", "description": "This paper characterises dynamic linkages arising from shocks with\nheterogeneous degrees of persistence. Using frequency domain techniques, we\nintroduce measures that identify smoothly varying links of a transitory and\npersistent nature. Our approach allows us to test for statistical differences\nin such dynamic links. We document substantial differences in transitory and\npersistent linkages among US financial industry volatilities, argue that they\ntrack heterogeneously persistent sources of systemic risk, and thus may serve\nas a useful tool for market participants."}, "http://arxiv.org/abs/2205.11365": {"title": "Graph-Based Methods for Discrete Choice", "link": "http://arxiv.org/abs/2205.11365", "description": "Choices made by individuals have widespread impacts--for instance, people\nchoose between political candidates to vote for, between social media posts to\nshare, and between brands to purchase--moreover, data on these choices are\nincreasingly abundant. Discrete choice models are a key tool for learning\nindividual preferences from such data. Additionally, social factors like\nconformity and contagion influence individual choice. Traditional methods for\nincorporating these factors into choice models do not account for the entire\nsocial network and require hand-crafted features. To overcome these\nlimitations, we use graph learning to study choice in networked contexts. We\nidentify three ways in which graph learning techniques can be used for discrete\nchoice: learning chooser representations, regularizing choice model parameters,\nand directly constructing predictions from a network. We design methods in each\ncategory and test them on real-world choice datasets, including county-level\n2016 US election results and Android app installation and usage data. We show\nthat incorporating social network structure can improve the predictions of the\nstandard econometric choice model, the multinomial logit. We provide evidence\nthat app installations are influenced by social context, but we find no such\neffect on app usage among the same participants, which instead is habit-driven.\nIn the election data, we highlight the additional insights a discrete choice\nframework provides over classification or regression, the typical approaches.\nOn synthetic data, we demonstrate the sample complexity benefit of using social\ninformation in choice models."}, "http://arxiv.org/abs/2306.09287": {"title": "Modelling and Forecasting Macroeconomic Risk with Time Varying Skewness Stochastic Volatility Models", "link": "http://arxiv.org/abs/2306.09287", "description": "Monitoring downside risk and upside risk to the key macroeconomic indicators\nis critical for effective policymaking aimed at maintaining economic stability.\nIn this paper I propose a parametric framework for modelling and forecasting\nmacroeconomic risk based on stochastic volatility models with Skew-Normal and\nSkew-t shocks featuring time varying skewness. Exploiting a mixture stochastic\nrepresentation of the Skew-Normal and Skew-t random variables, in the paper I\ndevelop efficient posterior simulation samplers for Bayesian estimation of both\nunivariate and VAR models of this type. In an application, I use the models to\npredict downside risk to GDP growth in the US and I show that these models\nrepresent a competitive alternative to semi-parametric approaches such as\nquantile regression. Finally, estimating a medium scale VAR on US data I show\nthat time varying skewness is a relevant feature of macroeconomic and financial\nshocks."}, "http://arxiv.org/abs/2309.07476": {"title": "Causal inference in network experiments: regression-based analysis and design-based properties", "link": "http://arxiv.org/abs/2309.07476", "description": "Investigating interference or spillover effects among units is a central task\nin many social science problems. Network experiments are powerful tools for\nthis task, which avoids endogeneity by randomly assigning treatments to units\nover networks. However, it is non-trivial to analyze network experiments\nproperly without imposing strong modeling assumptions. Previously, many\nresearchers have proposed sophisticated point estimators and standard errors\nfor causal effects under network experiments. We further show that\nregression-based point estimators and standard errors can have strong\ntheoretical guarantees if the regression functions and robust standard errors\nare carefully specified to accommodate the interference patterns under network\nexperiments. We first recall a well-known result that the Hajek estimator is\nnumerically identical to the coefficient from the weighted-least-squares fit\nbased on the inverse probability of the exposure mapping. Moreover, we\ndemonstrate that the regression-based approach offers three notable advantages:\nits ease of implementation, the ability to derive standard errors through the\nsame weighted-least-squares fit, and the capacity to integrate covariates into\nthe analysis, thereby enhancing estimation efficiency. Furthermore, we analyze\nthe asymptotic bias of the regression-based network-robust standard errors.\nRecognizing that the covariance estimator can be anti-conservative, we propose\nan adjusted covariance estimator to improve the empirical coverage rates.\nAlthough we focus on regression-based point estimators and standard errors, our\ntheory holds under the design-based framework, which assumes that the\nrandomness comes solely from the design of network experiments and allows for\narbitrary misspecification of the regression models."}, "http://arxiv.org/abs/2311.12267": {"title": "Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity", "link": "http://arxiv.org/abs/2311.12267", "description": "This paper studies causal representation learning, the task of recovering\nhigh-level latent variables and their causal relationships from low-level data\nthat we observe, assuming access to observations generated from multiple\nenvironments. While existing works are able to prove full identifiability of\nthe underlying data generating process, they typically assume access to\nsingle-node, hard interventions which is rather unrealistic in practice. The\nmain contribution of this paper is characterize a notion of identifiability\nwhich is provably the best one can achieve when hard interventions are not\navailable. First, for linear causal models, we provide identifiability\nguarantee for data observed from general environments without assuming any\nsimilarities between them. While the causal graph is shown to be fully\nrecovered, the latent variables are only identified up to an effect-domination\nambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to\nrecover the ground-truth model up to EDA, and we demonstrate its effectiveness\nvia numerical experiments. Moving on to general non-parametric causal models,\nwe prove the same idenfifiability guarantee assuming access to groups of soft\ninterventions. Finally, we provide counterparts of our identifiability results,\nindicating that EDA is basically inevitable in our setting."}, "http://arxiv.org/abs/2311.12671": {"title": "Predictive Density Combination Using a Tree-Based Synthesis Function", "link": "http://arxiv.org/abs/2311.12671", "description": "Bayesian predictive synthesis (BPS) provides a method for combining multiple\npredictive distributions based on agent/expert opinion analysis theory and\nencompasses a range of existing density forecast pooling methods. The key\ningredient in BPS is a ``synthesis'' function. This is typically specified\nparametrically as a dynamic linear regression. In this paper, we develop a\nnonparametric treatment of the synthesis function using regression trees. We\nshow the advantages of our tree-based approach in two macroeconomic forecasting\napplications. The first uses density forecasts for GDP growth from the euro\narea's Survey of Professional Forecasters. The second combines density\nforecasts of US inflation produced by many regression models involving\ndifferent predictors. Both applications demonstrate the benefits -- in terms of\nimproved forecast accuracy and interpretability -- of modeling the synthesis\nfunction nonparametrically."}, "http://arxiv.org/abs/1810.00283": {"title": "Proxy Controls and Panel Data", "link": "http://arxiv.org/abs/1810.00283", "description": "We provide new results for nonparametric identification, estimation, and\ninference of causal effects using `proxy controls': observables that are noisy\nbut informative proxies for unobserved confounding factors. Our analysis\napplies to cross-sectional settings but is particularly well-suited to panel\nmodels. Our identification results motivate a simple and `well-posed'\nnonparametric estimator. We derive convergence rates for the estimator and\nconstruct uniform confidence bands with asymptotically correct size. In panel\nsettings, our methods provide a novel approach to the difficult problem of\nidentification with non-separable, general heterogeneity and fixed $T$. In\npanels, observations from different periods serve as proxies for unobserved\nheterogeneity and our key identifying assumptions follow from restrictions on\nthe serial dependence structure. We apply our methods to two empirical\nsettings. We estimate consumer demand counterfactuals using panel data and we\nestimate causal effects of grade retention on cognitive performance."}, "http://arxiv.org/abs/2102.08809": {"title": "Testing for Nonlinear Cointegration under Heteroskedasticity", "link": "http://arxiv.org/abs/2102.08809", "description": "This article discusses tests for nonlinear cointegration in the presence of\nvariance breaks. We build on cointegration test approaches under\nheteroskedasticity (Cavaliere and Taylor, 2006, Journal of Time Series\nAnalysis) and for nonlinearity (Choi and Saikkonen, 2010, Econometric Theory)\nto propose a bootstrap test and prove its consistency. A Monte Carlo study\nshows the approach to have good finite sample properties. We provide an\nempirical application to the environmental Kuznets curves (EKC), finding that\nthe cointegration test provides little evidence for the EKC hypothesis.\nAdditionally, we examine the nonlinear relation between the US money and the\ninterest rate, finding that our test does not reject the null of a smooth\ntransition cointegrating relation."}, "http://arxiv.org/abs/2311.12878": {"title": "Adaptive Bayesian Learning with Action and State-Dependent Signal Variance", "link": "http://arxiv.org/abs/2311.12878", "description": "This manuscript presents an advanced framework for Bayesian learning by\nincorporating action and state-dependent signal variances into decision-making\nmodels. This framework is pivotal in understanding complex data-feedback loops\nand decision-making processes in various economic systems. Through a series of\nexamples, we demonstrate the versatility of this approach in different\ncontexts, ranging from simple Bayesian updating in stable environments to\ncomplex models involving social learning and state-dependent uncertainties. The\npaper uniquely contributes to the understanding of the nuanced interplay\nbetween data, actions, outcomes, and the inherent uncertainty in economic\nmodels."}, "http://arxiv.org/abs/2311.13327": {"title": "Regressions under Adverse Conditions", "link": "http://arxiv.org/abs/2311.13327", "description": "We introduce a new regression method that relates the mean of an outcome\nvariable to covariates, given the \"adverse condition\" that a distress variable\nfalls in its tail. This allows to tailor classical mean regressions to adverse\neconomic scenarios, which receive increasing interest in managing macroeconomic\nand financial risks, among many others. In the terminology of the systemic risk\nliterature, our method can be interpreted as a regression for the Marginal\nExpected Shortfall. We propose a two-step procedure to estimate the new models,\nshow consistency and asymptotic normality of the estimator, and propose\nfeasible inference under weak conditions allowing for cross-sectional and time\nseries applications. The accuracy of the asymptotic approximations of the\ntwo-step estimator is verified in simulations. Two empirical applications show\nthat our regressions under adverse conditions are valuable in such diverse\nfields as the study of the relation between systemic risk and asset price\nbubbles, and dissecting macroeconomic growth vulnerabilities into individual\ncomponents."}, "http://arxiv.org/abs/2311.13575": {"title": "Large-Sample Properties of the Synthetic Control Method under Selection on Unobservables", "link": "http://arxiv.org/abs/2311.13575", "description": "We analyze the properties of the synthetic control (SC) method in settings\nwith a large number of units. We assume that the selection into treatment is\nbased on unobserved permanent heterogeneity and pretreatment information, thus\nallowing for both strictly and sequentially exogenous assignment processes.\nExploiting duality, we interpret the solution of the SC optimization problem as\nan estimator for the underlying treatment probabilities. We use this to derive\nthe asymptotic representation for the SC method and characterize sufficient\nconditions for its asymptotic normality. We show that the critical property\nthat determines the behavior of the SC method is the ability of input features\nto approximate the unobserved heterogeneity. Our results imply that the SC\nmethod delivers asymptotically normal estimators for a large class of linear\npanel data models as long as the number of pretreatment periods is large,\nmaking it a natural alternative to conventional methods built on the\nDifference-in-Differences."}, "http://arxiv.org/abs/2108.00723": {"title": "Partial Identification and Inference for Conditional Distributions of Treatment Effects", "link": "http://arxiv.org/abs/2108.00723", "description": "This paper considers identification and inference for the distribution of\ntreatment effects conditional on observable covariates. Since the conditional\ndistribution of treatment effects is not point identified without strong\nassumptions, we obtain bounds on the conditional distribution of treatment\neffects by using the Makarov bounds. We also consider the case where the\ntreatment is endogenous and propose two stochastic dominance assumptions to\ntighten the bounds. We develop a nonparametric framework to estimate the bounds\nand establish the asymptotic theory that is uniformly valid over the support of\ntreatment effects. An empirical example illustrates the usefulness of the\nmethods."}, "http://arxiv.org/abs/2311.13969": {"title": "Was Javert right to be suspicious? Unpacking treatment effect heterogeneity of alternative sentences on time-to-recidivism in Brazil", "link": "http://arxiv.org/abs/2311.13969", "description": "This paper presents new econometric tools to unpack the treatment effect\nheterogeneity of punishing misdemeanor offenses on time-to-recidivism. We show\nhow one can identify, estimate, and make inferences on the distributional,\nquantile, and average marginal treatment effects in setups where the treatment\nselection is endogenous and the outcome of interest, usually a duration\nvariable, is potentially right-censored. We explore our proposed econometric\nmethodology to evaluate the effect of fines and community service sentences as\na form of punishment on time-to-recidivism in the State of S\\~ao Paulo, Brazil,\nbetween 2010 and 2019, leveraging the as-if random assignment of judges to\ncases. Our results highlight substantial treatment effect heterogeneity that\nother tools are not meant to capture. For instance, we find that people whom\nmost judges would punish take longer to recidivate as a consequence of the\npunishment, while people who would be punished only by strict judges recidivate\nat an earlier date than if they were not punished. This result suggests that\ndesigning sentencing guidelines that encourage strict judges to become more\nlenient could reduce recidivism."}, "http://arxiv.org/abs/2311.14032": {"title": "Counterfactual Sensitivity in Equilibrium Models", "link": "http://arxiv.org/abs/2311.14032", "description": "Counterfactuals in equilibrium models are functions of the current state of\nthe world, the exogenous change variables and the model parameters. Current\npractice treats the current state of the world, the observed data, as perfectly\nmeasured, but there is good reason to believe that they are measured with\nerror. The main aim of this paper is to provide tools for quantifying\nuncertainty about counterfactuals, when the current state of the world is\nmeasured with error. I propose two methods, a Bayesian approach and an\nadversarial approach. Both methods are practical and theoretically justified. I\napply the two methods to the application in Adao et al. (2017) and find\nnon-trivial uncertainty about counterfactuals."}, "http://arxiv.org/abs/2311.14204": {"title": "Reproducible Aggregation of Sample-Split Statistics", "link": "http://arxiv.org/abs/2311.14204", "description": "Statistical inference is often simplified by sample-splitting. This\nsimplification comes at the cost of the introduction of randomness that is not\nnative to the data. We propose a simple procedure for sequentially aggregating\nstatistics constructed with multiple splits of the same sample. The user\nspecifies a bound and a nominal error rate. If the procedure is implemented\ntwice on the same data, the nominal error rate approximates the chance that the\nresults differ by more than the bound. We provide a non-asymptotic analysis of\nthe accuracy of the nominal error rate and illustrate the application of the\nprocedure to several widely applied statistical methods."}, "http://arxiv.org/abs/2205.03288": {"title": "Leverage, Influence, and the Jackknife in Clustered Regression Models: Reliable Inference Using summclust", "link": "http://arxiv.org/abs/2205.03288", "description": "We introduce a new Stata package called summclust that summarizes the cluster\nstructure of the dataset for linear regression models with clustered\ndisturbances. The key unit of observation for such a model is the cluster. We\ntherefore propose cluster-level measures of leverage, partial leverage, and\ninfluence and show how to compute them quickly in most cases. The measures of\nleverage and partial leverage can be used as diagnostic tools to identify\ndatasets and regression designs in which cluster-robust inference is likely to\nbe challenging. The measures of influence can provide valuable information\nabout how the results depend on the data in the various clusters. We also show\nhow to calculate two jackknife variance matrix estimators efficiently as a\nbyproduct of our other computations. These estimators, which are already\navailable in Stata, are generally more conservative than conventional variance\nmatrix estimators. The summclust package computes all the quantities that we\ndiscuss."}, "http://arxiv.org/abs/2205.10310": {"title": "Treatment Effects in Bunching Designs: The Impact of Mandatory Overtime Pay on Hours", "link": "http://arxiv.org/abs/2205.10310", "description": "The 1938 Fair Labor Standards Act mandates overtime premium pay for most U.S.\nworkers, but it has proven difficult to assess the policy's impact on the labor\nmarket because the rule applies nationally and has varied little over time. I\nuse the extent to which firms bunch workers at the overtime threshold of 40\nhours in a week to estimate the rule's effect on hours, drawing on data from\nindividual workers' weekly paychecks. To do so I generalize a popular\nidentification strategy that exploits bunching at kink points in a\ndecision-maker's choice set. Making only nonparametric assumptions about\npreferences and heterogeneity, I show that the average causal response among\nbunchers to the policy switch at the kink is partially identified. The bounds\nindicate a relatively small elasticity of demand for weekly hours, suggesting\nthat the overtime mandate has a discernible but limited impact on hours and\nemployment."}, "http://arxiv.org/abs/2305.19484": {"title": "A Simple Method for Predicting Covariance Matrices of Financial Returns", "link": "http://arxiv.org/abs/2305.19484", "description": "We consider the well-studied problem of predicting the time-varying\ncovariance matrix of a vector of financial returns. Popular methods range from\nsimple predictors like rolling window or exponentially weighted moving average\n(EWMA) to more sophisticated predictors such as generalized autoregressive\nconditional heteroscedastic (GARCH) type methods. Building on a specific\ncovariance estimator suggested by Engle in 2002, we propose a relatively simple\nextension that requires little or no tuning or fitting, is interpretable, and\nproduces results at least as good as MGARCH, a popular extension of GARCH that\nhandles multiple assets. To evaluate predictors we introduce a novel approach,\nevaluating the regret of the log-likelihood over a time period such as a\nquarter. This metric allows us to see not only how well a covariance predictor\ndoes over all, but also how quickly it reacts to changes in market conditions.\nOur simple predictor outperforms MGARCH in terms of regret. We also test\ncovariance predictors on downstream applications such as portfolio optimization\nmethods that depend on the covariance matrix. For these applications our simple\ncovariance predictor and MGARCH perform similarly."}, "http://arxiv.org/abs/2311.14698": {"title": "Business Policy Experiments using Fractional Factorial Designs: Consumer Retention on DoorDash", "link": "http://arxiv.org/abs/2311.14698", "description": "This paper investigates an approach to both speed up business decision-making\nand lower the cost of learning through experimentation by factorizing business\npolicies and employing fractional factorial experimental designs for their\nevaluation. We illustrate how this method integrates with advances in the\nestimation of heterogeneous treatment effects, elaborating on its advantages\nand foundational assumptions. We empirically demonstrate the implementation and\nbenefits of our approach and assess its validity in evaluating consumer\npromotion policies at DoorDash, which is one of the largest delivery platforms\nin the US. Our approach discovers a policy with 5% incremental profit at 67%\nlower implementation cost."}, "http://arxiv.org/abs/2311.14813": {"title": "A Review of Cross-Sectional Matrix Exponential Spatial Models", "link": "http://arxiv.org/abs/2311.14813", "description": "The matrix exponential spatial models exhibit similarities to the\nconventional spatial autoregressive model in spatial econometrics but offer\nanalytical, computational, and interpretive advantages. This paper provides a\ncomprehensive review of the literature on the estimation, inference, and model\nselection approaches for the cross-sectional matrix exponential spatial models.\nWe discuss summary measures for the marginal effects of regressors and detail\nthe matrix-vector product method for efficient estimation. Our aim is not only\nto summarize the main findings from the spatial econometric literature but also\nto make them more accessible to applied researchers. Additionally, we\ncontribute to the literature by introducing some new results. We propose an\nM-estimation approach for models with heteroskedastic error terms and\ndemonstrate that the resulting M-estimator is consistent and has an asymptotic\nnormal distribution. We also consider some new results for model selection\nexercises. In a Monte Carlo study, we examine the finite sample properties of\nvarious estimators from the literature alongside the M-estimator."}, "http://arxiv.org/abs/2311.14892": {"title": "An Identification and Dimensionality Robust Test for Instrumental Variables Models", "link": "http://arxiv.org/abs/2311.14892", "description": "I propose a new identification-robust test for the structural parameter in a\nheteroskedastic linear instrumental variables model. The proposed test\nstatistic is similar in spirit to a jackknife version of the K-statistic and\nthe resulting test has exact asymptotic size so long as an auxiliary parameter\ncan be consistently estimated. This is possible under approximate sparsity even\nwhen the number of instruments is much larger than the sample size. As the\nnumber of instruments is allowed, but not required, to be large, the limiting\nbehavior of the test statistic is difficult to examine via existing central\nlimit theorems. Instead, I derive the asymptotic chi-squared distribution of\nthe test statistic using a direct Gaussian approximation technique. To improve\npower against certain alternatives, I propose a simple combination with the\nsup-score statistic of Belloni et al. (2012) based on a thresholding rule. I\ndemonstrate favorable size control and power properties in a simulation study\nand apply the new methods to revisit the effect of social spillovers in movie\nconsumption."}, "http://arxiv.org/abs/2311.15458": {"title": "Causal Models for Longitudinal and Panel Data: A Survey", "link": "http://arxiv.org/abs/2311.15458", "description": "This survey discusses the recent causal panel data literature. This recent\nliterature has focused on credibly estimating causal effects of binary\ninterventions in settings with longitudinal data, with an emphasis on practical\nadvice for empirical researchers. It pays particular attention to heterogeneity\nin the causal effects, often in situations where few units are treated. The\nliterature has extended earlier work on difference-in-differences or\ntwo-way-fixed-effect estimators and more generally incorporated factor models\nor interactive fixed effects. It has also developed novel methods using\nsynthetic control approaches."}, "http://arxiv.org/abs/2311.15829": {"title": "(Frisch-Waugh-Lovell)': On the Estimation of Regression Models by Row", "link": "http://arxiv.org/abs/2311.15829", "description": "We demonstrate that regression models can be estimated by working\nindependently in a row-wise fashion. We document a simple procedure which\nallows for a wide class of econometric estimators to be implemented\ncumulatively, where, in the limit, estimators can be produced without ever\nstoring more than a single line of data in a computer's memory. This result is\nuseful in understanding the mechanics of many common regression models. These\nprocedures can be used to speed up the computation of estimates computed via\nOLS, IV, Ridge regression, LASSO, Elastic Net, and Non-linear models including\nprobit and logit, with all common modes of inference. This has implications for\nestimation and inference with `big data', where memory constraints may imply\nthat working with all data at once is particularly costly. We additionally show\nthat even with moderately sized datasets, this method can reduce computation\ntime compared with traditional estimation routines."}, "http://arxiv.org/abs/2311.15871": {"title": "On Quantile Treatment Effects, Rank Similarity, and Variation of Instrumental Variables", "link": "http://arxiv.org/abs/2311.15871", "description": "This paper investigates how certain relationship between observed and\ncounterfactual distributions serves as an identifying condition for treatment\neffects when the treatment is endogenous, and shows that this condition holds\nin a range of nonparametric models for treatment effects. To this end, we first\nprovide a novel characterization of the prevalent assumption restricting\ntreatment heterogeneity in the literature, namely rank similarity. Our\ncharacterization demonstrates the stringency of this assumption and allows us\nto relax it in an economically meaningful way, resulting in our identifying\ncondition. It also justifies the quest of richer exogenous variations in the\ndata (e.g., multi-valued or multiple instrumental variables) in exchange for\nweaker identifying conditions. The primary goal of this investigation is to\nprovide empirical researchers with tools that are robust and easy to implement\nbut still yield tight policy evaluations."}, "http://arxiv.org/abs/2311.15878": {"title": "Individualized Treatment Allocations with Distributional Welfare", "link": "http://arxiv.org/abs/2311.15878", "description": "In this paper, we explore optimal treatment allocation policies that target\ndistributional welfare. Most literature on treatment choice has considered\nutilitarian welfare based on the conditional average treatment effect (ATE).\nWhile average welfare is intuitive, it may yield undesirable allocations\nespecially when individuals are heterogeneous (e.g., with outliers) - the very\nreason individualized treatments were introduced in the first place. This\nobservation motivates us to propose an optimal policy that allocates the\ntreatment based on the conditional \\emph{quantile of individual treatment\neffects} (QoTE). Depending on the choice of the quantile probability, this\ncriterion can accommodate a policymaker who is either prudent or negligent. The\nchallenge of identifying the QoTE lies in its requirement for knowledge of the\njoint distribution of the counterfactual outcomes, which is generally hard to\nrecover even with experimental data. Therefore, we introduce minimax optimal\npolicies that are robust to model uncertainty. We then propose a range of\nidentifying assumptions under which we can point or partially identify the\nQoTE. We establish the asymptotic bound on the regret of implementing the\nproposed policies. We consider both stochastic and deterministic rules. In\nsimulations and two empirical applications, we compare optimal decisions based\non the QoTE with decisions based on other criteria."}, "http://arxiv.org/abs/2311.15932": {"title": "Valid Wald Inference with Many Weak Instruments", "link": "http://arxiv.org/abs/2311.15932", "description": "This paper proposes three novel test procedures that yield valid inference in\nan environment with many weak instrumental variables (MWIV). It is observed\nthat the t statistic of the jackknife instrumental variable estimator (JIVE)\nhas an asymptotic distribution that is identical to the two-stage-least squares\n(TSLS) t statistic in the just-identified environment. Consequently, test\nprocedures that were valid for TSLS t are also valid for the JIVE t. Two such\nprocedures, i.e., VtF and conditional Wald, are adapted directly. By exploiting\na feature of MWIV environments, a third, more powerful, one-sided VtF-based\ntest procedure can be obtained."}, "http://arxiv.org/abs/2311.15952": {"title": "Robust Conditional Wald Inference for Over-Identified IV", "link": "http://arxiv.org/abs/2311.15952", "description": "For the over-identified linear instrumental variables model, researchers\ncommonly report the 2SLS estimate along with the robust standard error and seek\nto conduct inference with these quantities. If errors are homoskedastic, one\ncan control the degree of inferential distortion using the first-stage F\ncritical values from Stock and Yogo (2005), or use the robust-to-weak\ninstruments Conditional Wald critical values of Moreira (2003). If errors are\nnon-homoskedastic, these methods do not apply. We derive the generalization of\nConditional Wald critical values that is robust to non-homoskedastic errors\n(e.g., heteroskedasticity or clustered variance structures), which can also be\napplied to nonlinear weakly-identified models (e.g. weakly-identified GMM)."}, "http://arxiv.org/abs/1801.00332": {"title": "Confidence set for group membership", "link": "http://arxiv.org/abs/1801.00332", "description": "Our confidence set quantifies the statistical uncertainty from data-driven\ngroup assignments in grouped panel models. It covers the true group memberships\njointly for all units with pre-specified probability and is constructed by\ninverting many simultaneous unit-specific one-sided tests for group membership.\nWe justify our approach under $N, T \\to \\infty$ asymptotics using tools from\nhigh-dimensional statistics, some of which we extend in this paper. We provide\nMonte Carlo evidence that the confidence set has adequate coverage in finite\nsamples.An empirical application illustrates the use of our confidence set."}, "http://arxiv.org/abs/2004.05027": {"title": "Direct and spillover effects of a new tramway line on the commercial vitality of peripheral streets", "link": "http://arxiv.org/abs/2004.05027", "description": "In cities, the creation of public transport infrastructure such as light\nrails can cause changes on a very detailed spatial scale, with different\nstories unfolding next to each other within a same urban neighborhood. We study\nthe direct effect of a light rail line built in Florence (Italy) on the retail\ndensity of the street where it was built and and its spillover effect on other\nstreets in the treated street's neighborhood. To this aim, we investigate the\nuse of the Synthetic Control Group (SCG) methods in panel comparative case\nstudies where interference between the treated and the untreated units is\nplausible, an issue still little researched in the SCG methodological\nliterature. We frame our discussion in the potential outcomes approach. Under a\npartial interference assumption, we formally define relevant direct and\nspillover causal effects. We also consider the ``unrealized'' spillover effect\non the treated street in the hypothetical scenario that another street in the\ntreated unit's neighborhood had been assigned to the intervention."}, "http://arxiv.org/abs/2004.09458": {"title": "Noise-Induced Randomization in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2004.09458", "description": "Regression discontinuity designs assess causal effects in settings where\ntreatment is determined by whether an observed running variable crosses a\npre-specified threshold. Here we propose a new approach to identification,\nestimation, and inference in regression discontinuity designs that uses\nknowledge about exogenous noise (e.g., measurement error) in the running\nvariable. In our strategy, we weight treated and control units to balance a\nlatent variable of which the running variable is a noisy measure. Our approach\nis explicitly randomization-based and complements standard formal analyses that\nappeal to continuity arguments while ignoring the stochastic nature of the\nassignment mechanism."}, "http://arxiv.org/abs/2111.12258": {"title": "On Recoding Ordered Treatments as Binary Indicators", "link": "http://arxiv.org/abs/2111.12258", "description": "Researchers using instrumental variables to investigate ordered treatments\noften recode treatment into an indicator for any exposure. We investigate this\nestimand under the assumption that the instruments shift compliers from no\ntreatment to some but not from some treatment to more. We show that when there\nare extensive margin compliers only (EMCO) this estimand captures a weighted\naverage of treatment effects that can be partially unbundled into each complier\ngroup's potential outcome means. We also establish an equivalence between EMCO\nand a two-factor selection model and apply our results to study treatment\nheterogeneity in the Oregon Health Insurance Experiment."}, "http://arxiv.org/abs/2311.16260": {"title": "Using Multiple Outcomes to Improve the Synthetic Control Method", "link": "http://arxiv.org/abs/2311.16260", "description": "When there are multiple outcome series of interest, Synthetic Control\nanalyses typically proceed by estimating separate weights for each outcome. In\nthis paper, we instead propose estimating a common set of weights across\noutcomes, by balancing either a vector of all outcomes or an index or average\nof them. Under a low-rank factor model, we show that these approaches lead to\nlower bias bounds than separate weights, and that averaging leads to further\ngains when the number of outcomes grows. We illustrate this via simulation and\nin a re-analysis of the impact of the Flint water crisis on educational\noutcomes."}, "http://arxiv.org/abs/2311.16333": {"title": "From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks", "link": "http://arxiv.org/abs/2311.16333", "description": "We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density\nforecasting through a novel neural network architecture with dedicated mean and\nvariance hemispheres. Our architecture features several key ingredients making\nMLE work in this context. First, the hemispheres share a common core at the\nentrance of the network which accommodates for various forms of time variation\nin the error variance. Second, we introduce a volatility emphasis constraint\nthat breaks mean/variance indeterminacy in this class of overparametrized\nnonlinear models. Third, we conduct a blocked out-of-bag reality check to curb\noverfitting in both conditional moments. Fourth, the algorithm utilizes\nstandard deep learning software and thus handles large data sets - both\ncomputationally and statistically. Ergo, our Hemisphere Neural Network (HNN)\nprovides proactive volatility forecasts based on leading indicators when it\ncan, and reactive volatility based on the magnitude of previous prediction\nerrors when it must. We evaluate point and density forecasts with an extensive\nout-of-sample experiment and benchmark against a suite of models ranging from\nclassics to more modern machine learning-based offerings. In all cases, HNN\nfares well by consistently providing accurate mean/variance forecasts for all\ntargets and horizons. Studying the resulting volatility paths reveals its\nversatility, while probabilistic forecasting evaluation metrics showcase its\nenviable reliability. Finally, we also demonstrate how this machinery can be\nmerged with other structured deep learning models by revisiting Goulet Coulombe\n(2022)'s Neural Phillips Curve."}, "http://arxiv.org/abs/2311.16440": {"title": "Inference for Low-rank Models without Estimating the Rank", "link": "http://arxiv.org/abs/2311.16440", "description": "This paper studies the inference about linear functionals of high-dimensional\nlow-rank matrices. While most existing inference methods would require\nconsistent estimation of the true rank, our procedure is robust to rank\nmisspecification, making it a promising approach in applications where rank\nestimation can be unreliable. We estimate the low-rank spaces using\npre-specified weighting matrices, known as diversified projections. A novel\nstatistical insight is that, unlike the usual statistical wisdom that\noverfitting mainly introduces additional variances, the over-estimated low-rank\nspace also gives rise to a non-negligible bias due to an implicit ridge-type\nregularization. We develop a new inference procedure and show that the central\nlimit theorem holds as long as the pre-specified rank is no smaller than the\ntrue rank. Empirically, we apply our method to the U.S. federal grants\nallocation data and test the existence of pork-barrel politics."}, "http://arxiv.org/abs/2311.16486": {"title": "On the adaptation of causal forests to manifold data", "link": "http://arxiv.org/abs/2311.16486", "description": "Researchers often hold the belief that random forests are \"the cure to the\nworld's ills\" (Bickel, 2010). But how exactly do they achieve this? Focused on\nthe recently introduced causal forests (Athey and Imbens, 2016; Wager and\nAthey, 2018), this manuscript aims to contribute to an ongoing research trend\ntowards answering this question, proving that causal forests can adapt to the\nunknown covariate manifold structure. In particular, our analysis shows that a\ncausal forest estimator can achieve the optimal rate of convergence for\nestimating the conditional average treatment effect, with the covariate\ndimension automatically replaced by the manifold dimension. These findings\nalign with analogous observations in the realm of deep learning and resonate\nwith the insights presented in Peter Bickel's 2004 Rietz lecture."}, "http://arxiv.org/abs/2311.17021": {"title": "Optimal Categorical Instrumental Variables", "link": "http://arxiv.org/abs/2311.17021", "description": "This paper discusses estimation with a categorical instrumental variable in\nsettings with potentially few observations per category. The proposed\ncategorical instrumental variable estimator (CIV) leverages a regularization\nassumption that implies existence of a latent categorical variable with fixed\nfinite support achieving the same first stage fit as the observed instrument.\nIn asymptotic regimes that allow the number of observations per category to\ngrow at arbitrary small polynomial rate with the sample size, I show that when\nthe cardinality of the support of the optimal instrument is known, CIV is\nroot-n asymptotically normal, achieves the same asymptotic variance as the\noracle IV estimator that presumes knowledge of the optimal instrument, and is\nsemiparametrically efficient under homoskedasticity. Under-specifying the\nnumber of support points reduces efficiency but maintains asymptotic normality."}, "http://arxiv.org/abs/1912.10488": {"title": "Efficient and Convergent Sequential Pseudo-Likelihood Estimation of Dynamic Discrete Games", "link": "http://arxiv.org/abs/1912.10488", "description": "We propose a new sequential Efficient Pseudo-Likelihood (k-EPL) estimator for\ndynamic discrete choice games of incomplete information. k-EPL considers the\njoint behavior of multiple players simultaneously, as opposed to individual\nresponses to other agents' equilibrium play. This, in addition to reframing the\nproblem from conditional choice probability (CCP) space to value function\nspace, yields a computationally tractable, stable, and efficient estimator. We\nshow that each iteration in the k-EPL sequence is consistent and asymptotically\nefficient, so the first-order asymptotic properties do not vary across\niterations. Furthermore, we show the sequence achieves higher-order equivalence\nto the finite-sample maximum likelihood estimator with iteration and that the\nsequence of estimators converges almost surely to the maximum likelihood\nestimator at a nearly-superlinear rate when the data are generated by any\nregular Markov perfect equilibrium, including equilibria that lead to\ninconsistency of other sequential estimators. When utility is linear in\nparameters, k-EPL iterations are computationally simple, only requiring that\nthe researcher solve linear systems of equations to generate pseudo-regressors\nwhich are used in a static logit/probit regression. Monte Carlo simulations\ndemonstrate the theoretical results and show k-EPL's good performance in finite\nsamples in both small- and large-scale games, even when the game admits\nspurious equilibria in addition to one that generated the data. We apply the\nestimator to study the role of competition in the U.S. wholesale club industry."}, "http://arxiv.org/abs/2202.07150": {"title": "Asymptotics of Cointegration Tests for High-Dimensional VAR($k$)", "link": "http://arxiv.org/abs/2202.07150", "description": "The paper studies nonstationary high-dimensional vector autoregressions of\norder $k$, VAR($k$). Additional deterministic terms such as trend or\nseasonality are allowed. The number of time periods, $T$, and the number of\ncoordinates, $N$, are assumed to be large and of the same order. Under this\nregime the first-order asymptotics of the Johansen likelihood ratio (LR),\nPillai-Bartlett, and Hotelling-Lawley tests for cointegration are derived: the\ntest statistics converge to nonrandom integrals. For more refined analysis, the\npaper proposes and analyzes a modification of the Johansen test. The new test\nfor the absence of cointegration converges to the partial sum of the Airy$_1$\npoint process. Supporting Monte Carlo simulations indicate that the same\nbehavior persists universally in many situations beyond those considered in our\ntheorems.\n\nThe paper presents empirical implementations of the approach for the analysis\nof S$\\&$P$100$ stocks and of cryptocurrencies. The latter example has a strong\npresence of multiple cointegrating relationships, while the results for the\nformer are consistent with the null of no cointegration."}, "http://arxiv.org/abs/2311.17575": {"title": "Identifying Causal Effects of Nonbinary, Ordered Treatments using Multiple Instrumental Variables", "link": "http://arxiv.org/abs/2311.17575", "description": "This paper addresses the challenge of identifying causal effects of\nnonbinary, ordered treatments with multiple binary instruments. Next to\npresenting novel insights into the widely-applied two-stage least squares\nestimand, I show that a weighted average of local average treatment effects for\ncombined complier populations is identified under the limited monotonicity\nassumption. This novel causal parameter has an intuitive interpretation,\noffering an appealing alternative to two-stage least squares. I employ recent\nadvances in causal machine learning for estimation. I further demonstrate how\ncausal forests can be used to detect local violations of the underlying limited\nmonotonicity assumption. The methodology is applied to study the impact of\ncommunity nurseries on child health outcomes."}, "http://arxiv.org/abs/2311.17858": {"title": "On the Limits of Regression Adjustment", "link": "http://arxiv.org/abs/2311.17858", "description": "Regression adjustment, sometimes known as Controlled-experiment Using\nPre-Experiment Data (CUPED), is an important technique in internet\nexperimentation. It decreases the variance of effect size estimates, often\ncutting confidence interval widths in half or more while never making them\nworse. It does so by carefully regressing the goal metric against\npre-experiment features to reduce the variance. The tremendous gains of\nregression adjustment begs the question: How much better can we do by\nengineering better features from pre-experiment data, for example by using\nmachine learning techniques or synthetic controls? Could we even reduce the\nvariance in our effect sizes arbitrarily close to zero with the right\npredictors? Unfortunately, our answer is negative. A simple form of regression\nadjustment, which uses just the pre-experiment values of the goal metric,\ncaptures most of the benefit. Specifically, under a mild assumption that\nobservations closer in time are easier to predict that ones further away in\ntime, we upper bound the potential gains of more sophisticated feature\nengineering, with respect to the gains of this simple form of regression\nadjustment. The maximum reduction in variance is $50\\%$ in Theorem 1, or\nequivalently, the confidence interval width can be reduced by at most an\nadditional $29\\%$."}, "http://arxiv.org/abs/2110.04442": {"title": "A Primer on Deep Learning for Causal Inference", "link": "http://arxiv.org/abs/2110.04442", "description": "This review systematizes the emerging literature for causal inference using\ndeep neural networks under the potential outcomes framework. It provides an\nintuitive introduction on how deep learning can be used to estimate/predict\nheterogeneous treatment effects and extend causal inference to settings where\nconfounding is non-linear, time varying, or encoded in text, networks, and\nimages. To maximize accessibility, we also introduce prerequisite concepts from\ncausal inference and deep learning. The survey differs from other treatments of\ndeep learning and causal inference in its sharp focus on observational causal\nestimation, its extended exposition of key algorithms, and its detailed\ntutorials for implementing, training, and selecting among deep estimators in\nTensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference."}, "http://arxiv.org/abs/2111.05243": {"title": "Bounding Treatment Effects by Pooling Limited Information across Observations", "link": "http://arxiv.org/abs/2111.05243", "description": "We provide novel bounds on average treatment effects (on the treated) that\nare valid under an unconfoundedness assumption. Our bounds are designed to be\nrobust in challenging situations, for example, when the conditioning variables\ntake on a large number of different values in the observed sample, or when the\noverlap condition is violated. This robustness is achieved by only using\nlimited \"pooling\" of information across observations. Namely, the bounds are\nconstructed as sample averages over functions of the observed outcomes such\nthat the contribution of each outcome only depends on the treatment status of a\nlimited number of observations. No information pooling across observations\nleads to so-called \"Manski bounds\", while unlimited information pooling leads\nto standard inverse propensity score weighting. We explore the intermediate\nrange between these two extremes and provide corresponding inference methods.\nWe show in Monte Carlo experiments and through an empirical application that\nour bounds are indeed robust and informative in practice."}, "http://arxiv.org/abs/2303.01231": {"title": "Robust Hicksian Welfare Analysis under Individual Heterogeneity", "link": "http://arxiv.org/abs/2303.01231", "description": "Welfare effects of price changes are often estimated with cross-sections;\nthese do not identify demand with heterogeneous consumers. We develop a\ntheoretical method addressing this, utilizing uncompensated demand moments to\nconstruct local approximations for compensated demand moments, robust to\nunobserved preference heterogeneity. Our methodological contribution offers\nrobust approximations for average and distributional welfare estimates,\nextending to price indices, taxable income elasticities, and general\nequilibrium welfare. Our methods apply to any cross-section; we demonstrate\nthem via UK household budget survey data. We uncover an insight: simple\nnon-parametric representative agent models might be less biased than complex\nparametric models accounting for heterogeneity."}, "http://arxiv.org/abs/2305.02185": {"title": "Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences", "link": "http://arxiv.org/abs/2305.02185", "description": "We consider a panel data analysis to examine the heterogeneity in treatment\neffects with respect to a pre-treatment covariate of interest in the staggered\ndifference-in-differences setting of Callaway and Sant'Anna (2021). Under\nstandard identification conditions, a doubly robust estimand conditional on the\ncovariate identifies the group-time conditional average treatment effect given\nthe covariate. Focusing on the case of a continuous covariate, we propose a\nthree-step estimation procedure based on nonparametric local polynomial\nregressions and parametric estimation methods. Using uniformly valid\ndistributional approximation results for empirical processes and multiplier\nbootstrapping, we develop doubly robust inference methods to construct uniform\nconfidence bands for the group-time conditional average treatment effect\nfunction. The accompanying R package didhetero allows for easy implementation\nof the proposed methods."}, "http://arxiv.org/abs/2311.18136": {"title": "Extrapolating Away from the Cutoff in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2311.18136", "description": "Canonical RD designs yield credible local estimates of the treatment effect\nat the cutoff under mild continuity assumptions, but they fail to identify\ntreatment effects away from the cutoff without additional assumptions. The\nfundamental challenge of identifying treatment effects away from the cutoff is\nthat the counterfactual outcome under the alternative treatment status is never\nobserved. This paper aims to provide a methodological blueprint to identify\ntreatment effects away from the cutoff in various empirical settings by\noffering a non-exhaustive list of assumptions on the counterfactual outcome.\nInstead of assuming the exact evolution of the counterfactual outcome, this\npaper bounds its variation using the data and sensitivity parameters. The\nproposed assumptions are weaker than those introduced previously in the\nliterature, resulting in partially identified treatment effects that are less\nsusceptible to assumption violations. This approach accommodates both single\ncutoff and multi-cutoff designs. The specific choice of the extrapolation\nassumption depends on the institutional background of each empirical\napplication. Additionally, researchers are recommended to conduct sensitivity\nanalysis on the chosen parameter and assess resulting shifts in conclusions.\nThe paper compares the proposed identification results with results using\nprevious methods via an empirical application and simulated data. It\ndemonstrates that set identification yields a more credible conclusion about\nthe sign of the treatment effect."}, "http://arxiv.org/abs/2311.18555": {"title": "Identification in Endogenous Sequential Treatment Regimes", "link": "http://arxiv.org/abs/2311.18555", "description": "This paper develops a novel nonparametric identification method for treatment\neffects in settings where individuals self-select into treatment sequences. I\npropose an identification strategy which relies on a dynamic version of\nstandard Instrumental Variables (IV) assumptions and builds on a dynamic\nversion of the Marginal Treatment Effects (MTE) as the fundamental building\nblock for treatment effects. The main contribution of the paper is to relax\nassumptions on the support of the observed variables and on unobservable gains\nof treatment that are present in the dynamic treatment effects literature.\nMonte Carlo simulation studies illustrate the desirable finite-sample\nperformance of a sieve estimator for MTEs and Average Treatment Effects (ATEs)\non a close-to-application simulation study."}, "http://arxiv.org/abs/2311.18759": {"title": "Bootstrap Inference on Partially Linear Binary Choice Model", "link": "http://arxiv.org/abs/2311.18759", "description": "The partially linear binary choice model can be used for estimating\nstructural equations where nonlinearity may appear due to diminishing marginal\nreturns, different life cycle regimes, or hectic physical phenomena. The\ninference procedure for this model based on the analytic asymptotic\napproximation could be unreliable in finite samples if the sample size is not\nsufficiently large. This paper proposes a bootstrap inference approach for the\nmodel. Monte Carlo simulations show that the proposed inference method performs\nwell in finite samples compared to the procedure based on the asymptotic\napproximation."}, "http://arxiv.org/abs/2006.01212": {"title": "New Approaches to Robust Inference on Market (Non-)Efficiency, Volatility Clustering and Nonlinear Dependence", "link": "http://arxiv.org/abs/2006.01212", "description": "Many financial and economic variables, including financial returns, exhibit\nnonlinear dependence, heterogeneity and heavy-tailedness. These properties may\nmake problematic the analysis of (non-)efficiency and volatility clustering in\neconomic and financial markets using traditional approaches that appeal to\nasymptotic normality of sample autocorrelation functions of returns and their\nsquares.\n\nThis paper presents new approaches to deal with the above problems. We\nprovide the results that motivate the use of measures of market\n(non-)efficiency and volatility clustering based on (small) powers of absolute\nreturns and their signed versions.\n\nWe further provide new approaches to robust inference on the measures in the\ncase of general time series, including GARCH-type processes. The approaches are\nbased on robust $t-$statistics tests and new results on their applicability are\npresented. In the approaches, parameter estimates (e.g., estimates of measures\nof nonlinear dependence) are computed for groups of data, and the inference is\nbased on $t-$statistics in the resulting group estimates. This results in valid\nrobust inference under heterogeneity and dependence assumptions satisfied in\nreal-world financial markets. Numerical results and empirical applications\nconfirm the advantages and wide applicability of the proposed approaches."}, "http://arxiv.org/abs/2312.00282": {"title": "Stochastic volatility models with skewness selection", "link": "http://arxiv.org/abs/2312.00282", "description": "This paper expands traditional stochastic volatility models by allowing for\ntime-varying skewness without imposing it. While dynamic asymmetry may capture\nthe likely direction of future asset returns, it comes at the risk of leading\nto overparameterization. Our proposed approach mitigates this concern by\nleveraging sparsity-inducing priors to automatically selects the skewness\nparameter as being dynamic, static or zero in a data-driven framework. We\nconsider two empirical applications. First, in a bond yield application,\ndynamic skewness captures interest rate cycles of monetary easing and\ntightening being partially explained by central banks' mandates. In an currency\nmodeling framework, our model indicates no skewness in the carry factor after\naccounting for stochastic volatility which supports the idea of carry crashes\nbeing the result of volatility surges instead of dynamic skewness."}, "http://arxiv.org/abs/2312.00399": {"title": "GMM-lev estimation and individual heterogeneity: Monte Carlo evidence and empirical applications", "link": "http://arxiv.org/abs/2312.00399", "description": "The generalized method of moments (GMM) estimator applied to equations in\nlevels, GMM-lev, has the advantage of being able to estimate the effect of\nmeasurable time-invariant covariates using all available information. This is\nnot possible with GMM-dif, applied to equations in each period transformed into\nfirst differences, while GMM-sys uses little information, as it adds the\nequation in levels for only one period. The GMM-lev, by implying a\ntwo-component error term containing the individual heterogeneity and the shock,\nexposes the explanatory variables to possible double endogeneity. For example,\nthe estimation of true persistence could suffer from bias if instruments were\ncorrelated with the unit-specific error component. We propose to exploit the\n\\citet{Mundlak1978}'s approach together with GMM-lev estimation to capture\ninitial conditions and improve inference. Monte Carlo simulations for different\npanel types and under different double endogeneity assumptions show the\nadvantage of our approach."}, "http://arxiv.org/abs/2312.00590": {"title": "Inference on common trends in functional time series", "link": "http://arxiv.org/abs/2312.00590", "description": "This paper studies statistical inference on unit roots and cointegration for\ntime series in a Hilbert space. We develop statistical inference on the number\nof common stochastic trends that are embedded in the time series, i.e., the\ndimension of the nonstationary subspace. We also consider hypotheses on the\nnonstationary subspace itself. The Hilbert space can be of an arbitrarily large\ndimension, and our methods remain asymptotically valid even when the time\nseries of interest takes values in a subspace of possibly unknown dimension.\nThis has wide applicability in practice; for example, in the case of\ncointegrated vector time series of finite dimension, in a high-dimensional\nfactor model that includes a finite number of nonstationary factors, in the\ncase of cointegrated curve-valued (or function-valued) time series, and\nnonstationary dynamic functional factor models. We include two empirical\nillustrations to the term structure of interest rates and labor market indices,\nrespectively."}, "http://arxiv.org/abs/2305.17083": {"title": "A Policy Gradient Method for Confounded POMDPs", "link": "http://arxiv.org/abs/2305.17083", "description": "In this paper, we propose a policy gradient method for confounded partially\nobservable Markov decision processes (POMDPs) with continuous state and\nobservation spaces in the offline setting. We first establish a novel\nidentification result to non-parametrically estimate any history-dependent\npolicy gradient under POMDPs using the offline data. The identification enables\nus to solve a sequence of conditional moment restrictions and adopt the min-max\nlearning procedure with general function approximation for estimating the\npolicy gradient. We then provide a finite-sample non-asymptotic bound for\nestimating the gradient uniformly over a pre-specified policy class in terms of\nthe sample size, length of horizon, concentratability coefficient and the\nmeasure of ill-posedness in solving the conditional moment restrictions.\nLastly, by deploying the proposed gradient estimation in the gradient ascent\nalgorithm, we show the global convergence of the proposed algorithm in finding\nthe history-dependent optimal policy under some technical conditions. To the\nbest of our knowledge, this is the first work studying the policy gradient\nmethod for POMDPs under the offline setting."}, "http://arxiv.org/abs/2312.00955": {"title": "Identification and Inference for Synthetic Controls with Confounding", "link": "http://arxiv.org/abs/2312.00955", "description": "This paper studies inference on treatment effects in panel data settings with\nunobserved confounding. We model outcome variables through a factor model with\nrandom factors and loadings. Such factors and loadings may act as unobserved\nconfounders: when the treatment is implemented depends on time-varying factors,\nand who receives the treatment depends on unit-level confounders. We study the\nidentification of treatment effects and illustrate the presence of a trade-off\nbetween time and unit-level confounding. We provide asymptotic results for\ninference for several Synthetic Control estimators and show that different\nsources of randomness should be considered for inference, depending on the\nnature of confounding. We conclude with a comparison of Synthetic Control\nestimators with alternatives for factor models."}, "http://arxiv.org/abs/2312.01162": {"title": "Tests for Many Treatment Effects in Regression Discontinuity Panel Data Models", "link": "http://arxiv.org/abs/2312.01162", "description": "Numerous studies use regression discontinuity design (RDD) for panel data by\nassuming that the treatment effects are homogeneous across all\nindividuals/groups and pooling the data together. It is unclear how to test for\nthe significance of treatment effects when the treatments vary across\nindividuals/groups and the error terms may exhibit complicated dependence\nstructures. This paper examines the estimation and inference of multiple\ntreatment effects when the errors are not independent and identically\ndistributed, and the treatment effects vary across individuals/groups. We\nderive a simple analytical expression for approximating the variance-covariance\nstructure of the treatment effect estimators under general dependence\nconditions and propose two test statistics, one is to test for the overall\nsignificance of the treatment effect and the other for the homogeneity of the\ntreatment effects. We find that in the Gaussian approximations to the test\nstatistics, the dependence structures in the data can be safely ignored due to\nthe localized nature of the statistics. This has the important implication that\nthe simulated critical values can be easily obtained. Simulations demonstrate\nour tests have superb size control and reasonable power performance in finite\nsamples regardless of the presence of strong cross-section dependence or/and\nweak serial dependence in the data. We apply our tests to two datasets and find\nsignificant overall treatment effects in each case."}, "http://arxiv.org/abs/2312.01209": {"title": "A Method of Moments Approach to Asymptotically Unbiased Synthetic Controls", "link": "http://arxiv.org/abs/2312.01209", "description": "A common approach to constructing a Synthetic Control unit is to fit on the\noutcome variable and covariates in pre-treatment time periods, but it has been\nshown by Ferman and Pinto (2021) that this approach does not provide asymptotic\nunbiasedness when the fit is imperfect and the number of controls is fixed.\nMany related panel methods have a similar limitation when the number of units\nis fixed. I introduce and evaluate a new method in which the Synthetic Control\nis constructed using a General Method of Moments approach where if the\nSynthetic Control satisfies the moment conditions it must have the same\nloadings on latent factors as the treated unit. I show that a Synthetic Control\nEstimator of this form will be asymptotically unbiased as the number of\npre-treatment time periods goes to infinity, even when pre-treatment fit is\nimperfect and the set of controls is fixed. Furthermore, if both the number of\npre-treatment and post-treatment time periods go to infinity, then averages of\ntreatment effects can be consistently estimated and asymptotically valid\ninference can be conducted using a subsampling method. I conduct simulations\nand an empirical application to compare the performance of this method with\nexisting approaches in the literature."}, "http://arxiv.org/abs/2312.01881": {"title": "Bayesian Nonlinear Regression using Sums of Simple Functions", "link": "http://arxiv.org/abs/2312.01881", "description": "This paper proposes a new Bayesian machine learning model that can be applied\nto large datasets arising in macroeconomics. Our framework sums over many\nsimple two-component location mixtures. The transition between components is\ndetermined by a logistic function that depends on a single threshold variable\nand two hyperparameters. Each of these individual models only accounts for a\nminor portion of the variation in the endogenous variables. But many of them\nare capable of capturing arbitrary nonlinear conditional mean relations.\nConjugate priors enable fast and efficient inference. In simulations, we show\nthat our approach produces accurate point and density forecasts. In a real-data\nexercise, we forecast US macroeconomic aggregates and consider the nonlinear\neffects of financial shocks in a large-scale nonlinear VAR."}, "http://arxiv.org/abs/1905.05237": {"title": "Sustainable Investing and the Cross-Section of Returns and Maximum Drawdown", "link": "http://arxiv.org/abs/1905.05237", "description": "We use supervised learning to identify factors that predict the cross-section\nof returns and maximum drawdown for stocks in the US equity market. Our data\nrun from January 1970 to December 2019 and our analysis includes ordinary least\nsquares, penalized linear regressions, tree-based models, and neural networks.\nWe find that the most important predictors tended to be consistent across\nmodels, and that non-linear models had better predictive power than linear\nmodels. Predictive power was higher in calm periods than in stressed periods.\nEnvironmental, social, and governance indicators marginally impacted the\npredictive power of non-linear models in our data, despite their negative\ncorrelation with maximum drawdown and positive correlation with returns. Upon\nexploring whether ESG variables are captured by some models, we find that ESG\ndata contribute to the prediction nonetheless."}, "http://arxiv.org/abs/2203.08879": {"title": "A Simple and Computationally Trivial Estimator for Grouped Fixed Effects Models", "link": "http://arxiv.org/abs/2203.08879", "description": "This paper introduces a new fixed effects estimator for linear panel data\nmodels with clustered time patterns of unobserved heterogeneity. The method\navoids non-convex and combinatorial optimization by combining a preliminary\nconsistent estimator of the slope coefficient, an agglomerative\npairwise-differencing clustering of cross-sectional units, and a pooled\nordinary least squares regression. Asymptotic guarantees are established in a\nframework where $T$ can grow at any power of $N$, as both $N$ and $T$ approach\ninfinity. Unlike most existing approaches, the proposed estimator is\ncomputationally straightforward and does not require a known upper bound on the\nnumber of groups. As existing approaches, this method leads to a consistent\nestimation of well-separated groups and an estimator of common parameters\nasymptotically equivalent to the infeasible regression controlling for the true\ngroups. An application revisits the statistical association between income and\ndemocracy."}, "http://arxiv.org/abs/2204.02346": {"title": "Finitely Heterogeneous Treatment Effect in Event-study", "link": "http://arxiv.org/abs/2204.02346", "description": "Treatment effect estimation strategies in the event-study setup, namely panel\ndata with variation in treatment timing, often use the parallel trend\nassumption that assumes mean independence of potential outcomes across\ndifferent treatment timings. In this paper, we relax the parallel trend\nassumption by assuming a latent type variable and develop a type-specific\nparallel trend assumption. With a finite support assumption on the latent type\nvariable, we show that an extremum classifier consistently estimates the type\nassignment. Based on the clasfification result, we propose a type-specific\ndiff-in-diff estimator for the type-specific CATT. By estimating the CATT with\nregard to the latent type, we study heterogeneity in treatment effect, in\naddition to heterogeneity in baseline outcomes."}, "http://arxiv.org/abs/2204.07672": {"title": "Abadie's Kappa and Weighting Estimators of the Local Average Treatment Effect", "link": "http://arxiv.org/abs/2204.07672", "description": "In this paper we study the finite sample and asymptotic properties of various\nweighting estimators of the local average treatment effect (LATE), each of\nwhich can be motivated by Abadie's (2003) kappa theorem. Our framework presumes\na binary treatment and a binary instrument, which may only be valid after\nconditioning on additional covariates. We argue that two of the estimators\nunder consideration, which are weight normalized, are generally preferable.\nSeveral other estimators, which are unnormalized, do not satisfy the properties\nof scale invariance with respect to the natural logarithm and translation\ninvariance, thereby exhibiting sensitivity to the units of measurement when\nestimating the LATE in logs and the centering of the outcome variable more\ngenerally. We also demonstrate that, when noncompliance is one sided, certain\nestimators have the advantage of being based on a denominator that is strictly\ngreater than zero by construction. This is the case for only one of the two\nnormalized estimators, and we recommend this estimator for wider use. We\nillustrate our findings with a simulation study and three empirical\napplications. The importance of normalization is particularly apparent in\napplications to real data. The simulations also suggest that covariate\nbalancing estimation of instrument propensity scores may be more robust to\nmisspecification. Software for implementing these methods is available in\nStata."}, "http://arxiv.org/abs/2208.03737": {"title": "(Functional)Characterizations vs (Finite)Tests: Partially Unifying Functional and Inequality-Based Approaches to Testing", "link": "http://arxiv.org/abs/2208.03737", "description": "Historically, testing if decision-makers obey certain choice axioms using\nchoice data takes two distinct approaches. The 'functional' approach observes\nand tests the entire 'demand' or 'choice' function, whereas the 'revealed\npreference(RP)' approach constructs inequalities to test finite choices. I\ndemonstrate that a statistical recasting of the revealed enables uniting both\napproaches. Specifically, I construct a computationally efficient algorithm to\noutput one-sided statistical tests of choice data from functional\ncharacterizations of axiomatic behavior, thus linking statistical and RP\ntesting. An application to weakly separable preferences, where RP\ncharacterizations are provably NP-Hard, demonstrates the approach's merit. I\nalso show that without assuming monotonicity, all restrictions disappear.\nHence, any ability to resolve axiomatic behavior relies on the monotonicity\nassumption."}, "http://arxiv.org/abs/2211.13610": {"title": "Cross-Sectional Dynamics Under Network Structure: Theory and Macroeconomic Applications", "link": "http://arxiv.org/abs/2211.13610", "description": "Many environments in economics feature a cross-section of units linked by\nbilateral ties. I develop a framework for studying dynamics of cross-sectional\nvariables exploiting this network structure. It is a vector autoregression in\nwhich innovations transmit cross-sectionally only via bilateral links and which\ncan accommodate rich patterns of how network effects of higher order accumulate\nover time. The model can be used to estimate dynamic network effects, with the\nnetwork given or inferred from dynamic cross-correlations in the data. It also\noffers a dimensionality-reduction technique for modeling high-dimensional\n(cross-sectional) processes, owing to networks' ability to summarize complex\nrelations among variables (units) by relatively few non-zero bilateral links.\nIn a first application, I estimate how sectoral productivity shocks transmit\nalong supply chain linkages and affect dynamics of sectoral prices in the US\neconomy. The analysis suggests that network positions can rationalize not only\nthe strength of a sector's impact on aggregates, but also its timing. In a\nsecond application, I model industrial production growth across 44 countries by\nassuming global business cycles are driven by bilateral links which I estimate.\nThis reduces out-of-sample mean squared errors by up to 23% relative to a\nprincipal components factor model."}, "http://arxiv.org/abs/2303.00083": {"title": "Transition Probabilities and Moment Restrictions in Dynamic Fixed Effects Logit Models", "link": "http://arxiv.org/abs/2303.00083", "description": "Dynamic logit models are popular tools in economics to measure state\ndependence. This paper introduces a new method to derive moment restrictions in\na large class of such models with strictly exogenous regressors and fixed\neffects. We exploit the common structure of logit-type transition probabilities\nand elementary properties of rational fractions, to formulate a systematic\nprocedure that scales naturally with model complexity (e.g the lag order or the\nnumber of observed time periods). We detail the construction of moment\nrestrictions in binary response models of arbitrary lag order as well as\nfirst-order panel vector autoregressions and dynamic multinomial logit models.\nIdentification of common parameters and average marginal effects is also\ndiscussed for the binary response case. Finally, we illustrate our results by\nstudying the dynamics of drug consumption amongst young people inspired by Deza\n(2015)."}, "http://arxiv.org/abs/2306.13362": {"title": "Sparse plus dense MIDAS regressions and nowcasting during the COVID pandemic", "link": "http://arxiv.org/abs/2306.13362", "description": "The common practice for GDP nowcasting in a data-rich environment is to\nemploy either sparse regression using LASSO-type regularization or a dense\napproach based on factor models or ridge regression, which differ in the way\nthey extract information from high-dimensional datasets. This paper aims to\ninvestigate whether sparse plus dense mixed frequency regression methods can\nimprove the nowcasts of the US GDP growth. We propose two novel MIDAS\nregressions and show that these novel sparse plus dense methods greatly improve\nthe accuracy of nowcasts during the COVID pandemic compared to either only\nsparse or only dense approaches. Using monthly macro and weekly financial\nseries, we further show that the improvement is particularly sharp when the\ndense component is restricted to be macro, while the sparse signal stems from\nboth macro and financial series."}, "http://arxiv.org/abs/2312.02288": {"title": "Almost Dominance: Inference and Application", "link": "http://arxiv.org/abs/2312.02288", "description": "This paper proposes a general framework for inference on three types of\nalmost dominances: Almost Lorenz dominance, almost inverse stochastic\ndominance, and almost stochastic dominance. We first generalize almost Lorenz\ndominance to almost upward and downward Lorenz dominances. We then provide a\nbootstrap inference procedure for the Lorenz dominance coefficients, which\nmeasure the degrees of almost Lorenz dominances. Furthermore, we propose almost\nupward and downward inverse stochastic dominances and provide inference on the\ninverse stochastic dominance coefficients. We also show that our results can\neasily be extended to almost stochastic dominance. Simulation studies\ndemonstrate the finite sample properties of the proposed estimators and the\nbootstrap confidence intervals. We apply our methods to the inequality growth\nin the United Kingdom and find evidence for almost upward inverse stochastic\ndominance."}, "http://arxiv.org/abs/2206.08052": {"title": "Likelihood ratio test for structural changes in factor models", "link": "http://arxiv.org/abs/2206.08052", "description": "A factor model with a break in its factor loadings is observationally\nequivalent to a model without changes in the loadings but a change in the\nvariance of its factors. This effectively transforms a structural change\nproblem of high dimension into a problem of low dimension. This paper considers\nthe likelihood ratio (LR) test for a variance change in the estimated factors.\nThe LR test implicitly explores a special feature of the estimated factors: the\npre-break and post-break variances can be a singular matrix under the\nalternative hypothesis, making the LR test diverging faster and thus more\npowerful than Wald-type tests. The better power property of the LR test is also\nconfirmed by simulations. We also consider mean changes and multiple breaks. We\napply the procedure to the factor modelling and structural change of the US\nemployment using monthly industry-level-data."}, "http://arxiv.org/abs/2207.03035": {"title": "On the instrumental variable estimation with many weak and invalid instruments", "link": "http://arxiv.org/abs/2207.03035", "description": "We discuss the fundamental issue of identification in linear instrumental\nvariable (IV) models with unknown IV validity. With the assumption of the\n\"sparsest rule\", which is equivalent to the plurality rule but becomes\noperational in computation algorithms, we investigate and prove the advantages\nof non-convex penalized approaches over other IV estimators based on two-step\nselections, in terms of selection consistency and accommodation for\nindividually weak IVs. Furthermore, we propose a surrogate sparsest penalty\nthat aligns with the identification condition and provides oracle sparse\nstructure simultaneously. Desirable theoretical properties are derived for the\nproposed estimator with weaker IV strength conditions compared to the previous\nliterature. Finite sample properties are demonstrated using simulations and the\nselection and estimation method is applied to an empirical study concerning the\neffect of BMI on diastolic blood pressure."}, "http://arxiv.org/abs/2301.07855": {"title": "Digital Divide: Empirical Study of CIUS 2020", "link": "http://arxiv.org/abs/2301.07855", "description": "Canada and other major countries are investigating the implementation of\n``digital money'' or Central Bank Digital Currencies, necessitating answers to\nkey questions about how demographic and geographic factors influence the\npopulation's digital literacy. This paper uses the Canadian Internet Use Survey\n(CIUS) 2020 and survey versions of Lasso inference methods to assess the\ndigital divide in Canada and determine the relevant factors that influence it.\nWe find that a significant divide in the use of digital technologies, e.g.,\nonline banking and virtual wallet, continues to exist across different\ndemographic and geographic categories. We also create a digital divide score\nthat measures the survey respondents' digital literacy and provide multiple\ncorrespondence analyses that further corroborate these findings."}, "http://arxiv.org/abs/2312.03165": {"title": "A Theory Guide to Using Control Functions to Instrument Hazard Models", "link": "http://arxiv.org/abs/2312.03165", "description": "I develop the theory around using control functions to instrument hazard\nmodels, allowing the inclusion of endogenous (e.g., mismeasured) regressors.\nSimple discrete-data hazard models can be expressed as binary choice panel data\nmodels, and the widespread Prentice and Gloeckler (1978) discrete-data\nproportional hazards model can specifically be expressed as a complementary\nlog-log model with time fixed effects. This allows me to recast it as GMM\nestimation and its instrumented version as sequential GMM estimation in a\nZ-estimation (non-classical GMM) framework; this framework can then be\nleveraged to establish asymptotic properties and sufficient conditions. Whilst\nthis paper focuses on the Prentice and Gloeckler (1978) model, the methods and\ndiscussion developed here can be applied more generally to other hazard models\nand binary choice models. I also introduce my Stata command for estimating a\ncomplementary log-log model instrumented via control functions (available as\nivcloglog on SSC), which allows practitioners to easily instrument the Prentice\nand Gloeckler (1978) model."}, "http://arxiv.org/abs/2005.05942": {"title": "Moment Conditions for Dynamic Panel Logit Models with Fixed Effects", "link": "http://arxiv.org/abs/2005.05942", "description": "This paper investigates the construction of moment conditions in discrete\nchoice panel data with individual specific fixed effects. We describe how to\nsystematically explore the existence of moment conditions that do not depend on\nthe fixed effects, and we demonstrate how to construct them when they exist.\nOur approach is closely related to the numerical \"functional differencing\"\nconstruction in Bonhomme (2012), but our emphasis is to find explicit analytic\nexpressions for the moment functions. We first explain the construction and\ngive examples of such moment conditions in various models. Then, we focus on\nthe dynamic binary choice logit model and explore the implications of the\nmoment conditions for identification and estimation of the model parameters\nthat are common to all individuals."}, "http://arxiv.org/abs/2104.12909": {"title": "Algorithm as Experiment: Machine Learning, Market Design, and Policy Eligibility Rules", "link": "http://arxiv.org/abs/2104.12909", "description": "Algorithms make a growing portion of policy and business decisions. We\ndevelop a treatment-effect estimator using algorithmic decisions as instruments\nfor a class of stochastic and deterministic algorithms. Our estimator is\nconsistent and asymptotically normal for well-defined causal effects. A special\ncase of our setup is multidimensional regression discontinuity designs with\ncomplex boundaries. We apply our estimator to evaluate the Coronavirus Aid,\nRelief, and Economic Security Act, which allocated many billions of dollars\nworth of relief funding to hospitals via an algorithmic rule. The funding is\nshown to have little effect on COVID-19-related hospital activities. Naive\nestimates exhibit selection bias."}, "http://arxiv.org/abs/2312.03915": {"title": "Alternative models for FX, arbitrage opportunities and efficient pricing of double barrier options in L\\'evy models", "link": "http://arxiv.org/abs/2312.03915", "description": "We analyze the qualitative differences between prices of double barrier\nno-touch options in the Heston model and pure jump KoBoL model calibrated to\nthe same set of the empirical data, and discuss the potential for arbitrage\nopportunities if the correct model is a pure jump model. We explain and\ndemonstrate with numerical examples that accurate and fast calculations of\nprices of double barrier options in jump models are extremely difficult using\nthe numerical methods available in the literature. We develop a new efficient\nmethod (GWR-SINH method) based of the Gaver-Wynn-Rho acceleration applied to\nthe Bromwich integral; the SINH-acceleration and simplified trapezoid rule are\nused to evaluate perpetual double barrier options for each value of the\nspectral parameter in GWR-algorithm. The program in Matlab running on a Mac\nwith moderate characteristics achieves the precision of the order of E-5 and\nbetter in several several dozen of milliseconds; the precision E-07 is\nachievable in about 0.1 sec. We outline the extension of GWR-SINH method to\nregime-switching models and models with stochastic parameters and stochastic\ninterest rates."}, "http://arxiv.org/abs/2312.04428": {"title": "A general framework for the generation of probabilistic socioeconomic scenarios: Quantification of national food security risk with application to the cases of Egypt and Ethiopia", "link": "http://arxiv.org/abs/2312.04428", "description": "In this work a general framework for providing detailed probabilistic\nsocioeconomic scenarios as well as estimates concerning country-level food\nsecurity risk is proposed. Our methodology builds on (a) the Bayesian\nprobabilistic version of the world population model and (b) on the\ninterdependencies of the minimum food requirements and the national food system\ncapacities on key drivers, such as: population, income, natural resources, and\nother socioeconomic and climate indicators. Model uncertainty plays an\nimportant role in such endeavours. In this perspective, the concept of the\nrecently developed convex risk measures which mitigate the model uncertainty\neffects, is employed for the development of a framework for assessment, in the\ncontext of food security. The proposed method provides predictions and\nevaluations for food security risk both within and across probabilistic\nscenarios at country level. Our methodology is illustrated through its\nimplementation for the cases of Egypt and Ethiopia, for the time period\n2019-2050, under the combined context of the Shared Socioeconomic Pathways\n(SSPs) and the Representative Concentration Pathways (RCPs)."}, "http://arxiv.org/abs/2108.02196": {"title": "Synthetic Controls for Experimental Design", "link": "http://arxiv.org/abs/2108.02196", "description": "This article studies experimental design in settings where the experimental\nunits are large aggregate entities (e.g., markets), and only one or a small\nnumber of units can be exposed to the treatment. In such settings,\nrandomization of the treatment may result in treated and control groups with\nvery different characteristics at baseline, inducing biases. We propose a\nvariety of synthetic control designs (Abadie, Diamond and Hainmueller, 2010,\nAbadie and Gardeazabal, 2003) as experimental designs to select treated units\nin non-randomized experiments with large aggregate units, as well as the\nuntreated units to be used as a control group. Average potential outcomes are\nestimated as weighted averages of treated units, for potential outcomes with\ntreatment -- and control units, for potential outcomes without treatment. We\nanalyze the properties of estimators based on synthetic control designs and\npropose new inferential techniques. We show that in experimental settings with\naggregate units, synthetic control designs can substantially reduce estimation\nbiases in comparison to randomization of the treatment."}, "http://arxiv.org/abs/2108.04852": {"title": "Multiway empirical likelihood", "link": "http://arxiv.org/abs/2108.04852", "description": "This paper develops a general methodology to conduct statistical inference\nfor observations indexed by multiple sets of entities. We propose a novel\nmultiway empirical likelihood statistic that converges to a chi-square\ndistribution under the non-degenerate case, where corresponding Hoeffding type\ndecomposition is dominated by linear terms. Our methodology is related to the\nnotion of jackknife empirical likelihood but the leave-out pseudo values are\nconstructed by leaving columns or rows. We further develop a modified version\nof our multiway empirical likelihood statistic, which converges to a chi-square\ndistribution regardless of the degeneracy, and discover its desirable\nhigher-order property compared to the t-ratio by the conventional Eicker-White\ntype variance estimator. The proposed methodology is illustrated by several\nimportant statistical problems, such as bipartite network, generalized\nestimating equations, and three-way observations."}, "http://arxiv.org/abs/2306.03632": {"title": "Uniform Inference for Cointegrated Vector Autoregressive Processes", "link": "http://arxiv.org/abs/2306.03632", "description": "Uniformly valid inference for cointegrated vector autoregressive processes\nhas so far proven difficult due to certain discontinuities arising in the\nasymptotic distribution of the least squares estimator. We extend asymptotic\nresults from the univariate case to multiple dimensions and show how inference\ncan be based on these results. Furthermore, we show that lag augmentation and a\nrecent instrumental variable procedure can also yield uniformly valid tests and\nconfidence regions. We verify the theoretical findings and investigate finite\nsample properties in simulation experiments for two specific examples."}, "http://arxiv.org/abs/2005.04141": {"title": "Critical Values Robust to P-hacking", "link": "http://arxiv.org/abs/2005.04141", "description": "P-hacking is prevalent in reality but absent from classical hypothesis\ntesting theory. As a consequence, significant results are much more common than\nthey are supposed to be when the null hypothesis is in fact true. In this\npaper, we build a model of hypothesis testing with p-hacking. From the model,\nwe construct critical values such that, if the values are used to determine\nsignificance, and if scientists' p-hacking behavior adjusts to the new\nsignificance standards, significant results occur with the desired frequency.\nSuch robust critical values allow for p-hacking so they are larger than\nclassical critical values. To illustrate the amount of correction that\np-hacking might require, we calibrate the model using evidence from the medical\nsciences. In the calibrated model the robust critical value for any test\nstatistic is the classical critical value for the same test statistic with one\nfifth of the significance level."}, "http://arxiv.org/abs/2312.05342": {"title": "Occasionally Misspecified", "link": "http://arxiv.org/abs/2312.05342", "description": "When fitting a particular Economic model on a sample of data, the model may\nturn out to be heavily misspecified for some observations. This can happen\nbecause of unmodelled idiosyncratic events, such as an abrupt but short-lived\nchange in policy. These outliers can significantly alter estimates and\ninferences. A robust estimation is desirable to limit their influence. For\nskewed data, this induces another bias which can also invalidate the estimation\nand inferences. This paper proposes a robust GMM estimator with a simple bias\ncorrection that does not degrade robustness significantly. The paper provides\nfinite-sample robustness bounds, and asymptotic uniform equivalence with an\noracle that discards all outliers. Consistency and asymptotic normality ensue\nfrom that result. An application to the \"Price-Puzzle,\" which finds inflation\nincreases when monetary policy tightens, illustrates the concerns and the\nmethod. The proposed estimator finds the intuitive result: tighter monetary\npolicy leads to a decline in inflation."}, "http://arxiv.org/abs/2312.05373": {"title": "GCov-Based Portmanteau Test", "link": "http://arxiv.org/abs/2312.05373", "description": "We examine finite sample performance of the Generalized Covariance (GCov)\nresidual-based specification test for semiparametric models with i.i.d. errors.\nThe residual-based multivariate portmanteau test statistic follows\nasymptotically a $\\chi^2$ distribution when the model is estimated by the GCov\nestimator. The test is shown to perform well in application to the univariate\nmixed causal-noncausal MAR, double autoregressive (DAR) and multivariate Vector\nAutoregressive (VAR) models. We also introduce a bootstrap procedure that\nprovides the limiting distribution of the test statistic when the specification\ntest is applied to a model estimated by the maximum likelihood, or the\napproximate or quasi-maximum likelihood under a parametric assumption on the\nerror distribution."}, "http://arxiv.org/abs/2312.05593": {"title": "Economic Forecasts Using Many Noises", "link": "http://arxiv.org/abs/2312.05593", "description": "This paper addresses a key question in economic forecasting: does pure noise\ntruly lack predictive power? Economists typically conduct variable selection to\neliminate noises from predictors. Yet, we prove a compelling result that in\nmost economic forecasts, the inclusion of noises in predictions yields greater\nbenefits than its exclusion. Furthermore, if the total number of predictors is\nnot sufficiently large, intentionally adding more noises yields superior\nforecast performance, outperforming benchmark predictors relying on dimension\nreduction. The intuition lies in economic predictive signals being densely\ndistributed among regression coefficients, maintaining modest forecast bias\nwhile diversifying away overall variance, even when a significant proportion of\npredictors constitute pure noises. One of our empirical demonstrations shows\nthat intentionally adding 300~6,000 pure noises to the Welch and Goyal (2008)\ndataset achieves a noteworthy 10% out-of-sample R square accuracy in\nforecasting the annual U.S. equity premium. The performance surpasses the\nmajority of sophisticated machine learning models."}, "http://arxiv.org/abs/2312.05700": {"title": "Influence Analysis with Panel Data", "link": "http://arxiv.org/abs/2312.05700", "description": "The presence of units with extreme values in the dependent and/or independent\nvariables (i.e., vertical outliers, leveraged data) has the potential to\nseverely bias regression coefficients and/or standard errors. This is common\nwith short panel data because the researcher cannot advocate asymptotic theory.\nExample include cross-country studies, cell-group analyses, and field or\nlaboratory experimental studies, where the researcher is forced to use few\ncross-sectional observations repeated over time due to the structure of the\ndata or research design. Available diagnostic tools may fail to properly detect\nthese anomalies, because they are not designed for panel data. In this paper,\nwe formalise statistical measures for panel data models with fixed effects to\nquantify the degree of leverage and outlyingness of units, and the joint and\nconditional influences of pairs of units. We first develop a method to visually\ndetect anomalous units in a panel data set, and identify their type. Second, we\ninvestigate the effect of these units on LS estimates, and on other units'\ninfluence on the estimated parameters. To illustrate and validate the proposed\nmethod, we use a synthetic data set contaminated with different types of\nanomalous units. We also provide an empirical example."}, "http://arxiv.org/abs/2312.05858": {"title": "The Machine Learning Control Method for Counterfactual Forecasting", "link": "http://arxiv.org/abs/2312.05858", "description": "Without a credible control group, the most widespread methodologies for\nestimating causal effects cannot be applied. To fill this gap, we propose the\nMachine Learning Control Method (MLCM), a new approach for causal panel\nanalysis based on counterfactual forecasting with machine learning. The MLCM\nestimates policy-relevant causal parameters in short- and long-panel settings\nwithout relying on untreated units. We formalize identification in the\npotential outcomes framework and then provide estimation based on supervised\nmachine learning algorithms. To illustrate the advantages of our estimator, we\npresent simulation evidence and an empirical application on the impact of the\nCOVID-19 crisis on educational inequality in Italy. We implement the proposed\nmethod in the companion R package MachineControl."}, "http://arxiv.org/abs/2312.05898": {"title": "Dynamic Spatiotemporal ARCH Models: Small and Large Sample Results", "link": "http://arxiv.org/abs/2312.05898", "description": "This paper explores the estimation of a dynamic spatiotemporal autoregressive\nconditional heteroscedasticity (ARCH) model. The log-volatility term in this\nmodel can depend on (i) the spatial lag of the log-squared outcome variable,\n(ii) the time-lag of the log-squared outcome variable, (iii) the spatiotemporal\nlag of the log-squared outcome variable, (iv) exogenous variables, and (v) the\nunobserved heterogeneity across regions and time, i.e., the regional and time\nfixed effects. We examine the small and large sample properties of two\nquasi-maximum likelihood estimators and a generalized method of moments\nestimator for this model. We first summarize the theoretical properties of\nthese estimators and then compare their finite sample properties through Monte\nCarlo simulations."}, "http://arxiv.org/abs/2312.05985": {"title": "Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions", "link": "http://arxiv.org/abs/2312.05985", "description": "To address the bias of the canonical two-way fixed effects estimator for\ndifference-in-differences under staggered adoptions, Wooldridge (2021) proposed\nthe extended two-way fixed effects estimator, which adds many parameters.\nHowever, this reduces efficiency. Restricting some of these parameters to be\nequal helps, but ad hoc restrictions may reintroduce bias. We propose a machine\nlearning estimator with a single tuning parameter, fused extended two-way fixed\neffects (FETWFE), that enables automatic data-driven selection of these\nrestrictions. We prove that under an appropriate sparsity assumption FETWFE\nidentifies the correct restrictions with probability tending to one. We also\nprove the consistency, asymptotic normality, and oracle efficiency of FETWFE\nfor two classes of heterogeneous marginal treatment effect estimators under\neither conditional or marginal parallel trends, and we prove consistency for\ntwo classes of conditional average treatment effects under conditional parallel\ntrends. We demonstrate FETWFE in simulation studies and an empirical\napplication."}, "http://arxiv.org/abs/2312.06379": {"title": "Trends in Temperature Data: Micro-foundations of Their Nature", "link": "http://arxiv.org/abs/2312.06379", "description": "Determining whether Global Average Temperature (GAT) is an integrated process\nof order 1, I(1), or is a stationary process around a trend function is crucial\nfor detection, attribution, impact and forecasting studies of climate change.\nIn this paper, we investigate the nature of trends in GAT building on the\nanalysis of individual temperature grids. Our 'micro-founded' evidence suggests\nthat GAT is stationary around a non-linear deterministic trend in the form of a\nlinear function with a one-period structural break. This break can be\nattributed to a combination of individual grid breaks and the standard\naggregation method under acceleration in global warming. We illustrate our\nfindings using simulations."}, "http://arxiv.org/abs/2312.06402": {"title": "Structural Analysis of Vector Autoregressive Models", "link": "http://arxiv.org/abs/2312.06402", "description": "This set of lecture notes discuss key concepts for the structural analysis of\nVector Autoregressive models for the teaching of Applied Macroeconometrics\nmodule."}, "http://arxiv.org/abs/2110.12722": {"title": "Functional instrumental variable regression with an application to estimating the impact of immigration on native wages", "link": "http://arxiv.org/abs/2110.12722", "description": "Functional linear regression gets its popularity as a statistical tool to\nstudy the relationship between function-valued response and exogenous\nexplanatory variables. However, in practice, it is hard to expect that the\nexplanatory variables of interest are perfectly exogenous, due to, for example,\nthe presence of omitted variables and measurement error. Despite its empirical\nrelevance, it was not until recently that this issue of endogeneity was studied\nin the literature on functional regression, and the development in this\ndirection does not seem to sufficiently meet practitioners' needs; for example,\nthis issue has been discussed with paying particular attention on consistent\nestimation and thus distributional properties of the proposed estimators still\nremain to be further explored. To fill this gap, this paper proposes new\nconsistent FPCA-based instrumental variable estimators and develops their\nasymptotic properties in detail. Simulation experiments under a wide range of\nsettings show that the proposed estimators perform considerably well. We apply\nour methodology to estimate the impact of immigration on native wages."}, "http://arxiv.org/abs/2205.01882": {"title": "Approximating Choice Data by Discrete Choice Models", "link": "http://arxiv.org/abs/2205.01882", "description": "We obtain a necessary and sufficient condition under which random-coefficient\ndiscrete choice models, such as mixed-logit models, are rich enough to\napproximate any nonparametric random utility models arbitrarily well across\nchoice sets. The condition turns out to be the affine-independence of the set\nof characteristic vectors. When the condition fails, resulting in some random\nutility models that cannot be closely approximated, we identify preferences and\nsubstitution patterns that are challenging to approximate accurately. We also\npropose algorithms to quantify the magnitude of approximation errors."}, "http://arxiv.org/abs/2305.18114": {"title": "Identifying Dynamic LATEs with a Static Instrument", "link": "http://arxiv.org/abs/2305.18114", "description": "In many situations, researchers are interested in identifying dynamic effects\nof an irreversible treatment with a static binary instrumental variable (IV).\nFor example, in evaluations of dynamic effects of training programs, with a\nsingle lottery determining eligibility. A common approach in these situations\nis to report per-period IV estimates. Under a dynamic extension of standard IV\nassumptions, we show that such IV estimators identify a weighted sum of\ntreatment effects for different latent groups and treatment exposures. However,\nthere is possibility of negative weights. We consider point and partial\nidentification of dynamic treatment effects in this setting under different\nsets of assumptions."}, "http://arxiv.org/abs/2312.07520": {"title": "Estimating Counterfactual Matrix Means with Short Panel Data", "link": "http://arxiv.org/abs/2312.07520", "description": "We develop a more flexible approach for identifying and estimating average\ncounterfactual outcomes when several but not all possible outcomes are observed\nfor each unit in a large cross section. Such settings include event studies and\nstudies of outcomes of \"matches\" between agents of two types, e.g. workers and\nfirms or people and places. When outcomes are generated by a factor model that\nallows for low-dimensional unobserved confounders, our method yields\nconsistent, asymptotically normal estimates of counterfactual outcome means\nunder asymptotics that fix the number of outcomes as the cross section grows\nand general outcome missingness patterns, including those not accommodated by\nexisting methods. Our method is also computationally efficient, requiring only\na single eigendecomposition of a particular aggregation of any factor estimates\nconstructed using subsets of units with the same observed outcomes. In a\nsemi-synthetic simulation study based on matched employer-employee data, our\nmethod performs favorably compared to a Two-Way-Fixed-Effects-model-based\nestimator."}, "http://arxiv.org/abs/2211.16362": {"title": "Score-based calibration testing for multivariate forecast distributions", "link": "http://arxiv.org/abs/2211.16362", "description": "Calibration tests based on the probability integral transform (PIT) are\nroutinely used to assess the quality of univariate distributional forecasts.\nHowever, PIT-based calibration tests for multivariate distributional forecasts\nface various challenges. We propose two new types of tests based on proper\nscoring rules, which overcome these challenges. They arise from a general\nframework for calibration testing in the multivariate case, introduced in this\nwork. The new tests have good size and power properties in simulations and\nsolve various problems of existing tests. We apply the tests to forecast\ndistributions for macroeconomic and financial time series data."}, "http://arxiv.org/abs/2309.04926": {"title": "Testing for Stationary or Persistent Coefficient Randomness in Predictive Regressions", "link": "http://arxiv.org/abs/2309.04926", "description": "This study considers tests for coefficient randomness in predictive\nregressions. Our focus is on how tests for coefficient randomness are\ninfluenced by the persistence of random coefficient. We find that when the\nrandom coefficient is stationary, or I(0), Nyblom's (1989) LM test loses its\noptimality (in terms of power), which is established against the alternative of\nintegrated, or I(1), random coefficient. We demonstrate this by constructing\ntests that are more powerful than the LM test when random coefficient is\nstationary, although these tests are dominated in terms of power by the LM test\nwhen random coefficient is integrated. This implies that the best test for\ncoefficient randomness differs from context to context, and practitioners\nshould take into account the persistence of potentially random coefficient and\nchoose from several tests accordingly. We apply tests for coefficient constancy\nto real data. The results mostly reverse the conclusion of an earlier empirical\nstudy."}, "http://arxiv.org/abs/2312.07683": {"title": "On Rosenbaum's Rank-based Matching Estimator", "link": "http://arxiv.org/abs/2312.07683", "description": "In two influential contributions, Rosenbaum (2005, 2020) advocated for using\nthe distances between component-wise ranks, instead of the original data\nvalues, to measure covariate similarity when constructing matching estimators\nof average treatment effects. While the intuitive benefits of using covariate\nranks for matching estimation are apparent, there is no theoretical\nunderstanding of such procedures in the literature. We fill this gap by\ndemonstrating that Rosenbaum's rank-based matching estimator, when coupled with\na regression adjustment, enjoys the properties of double robustness and\nsemiparametric efficiency without the need to enforce restrictive covariate\nmoment assumptions. Our theoretical findings further emphasize the statistical\nvirtues of employing ranks for estimation and inference, more broadly aligning\nwith the insights put forth by Peter Bickel in his 2004 Rietz lecture (Bickel,\n2004)."}, "http://arxiv.org/abs/2312.07881": {"title": "Efficiency of QMLE for dynamic panel data models with interactive effects", "link": "http://arxiv.org/abs/2312.07881", "description": "This paper derives the efficiency bound for estimating the parameters of\ndynamic panel data models in the presence of an increasing number of incidental\nparameters. We study the efficiency problem by formulating the dynamic panel as\na simultaneous equations system, and show that the quasi-maximum likelihood\nestimator (QMLE) applied to the system achieves the efficiency bound.\nComparison of QMLE with fixed effects estimators is made."}, "http://arxiv.org/abs/2312.08171": {"title": "Individual Updating of Subjective Probability of Homicide Victimization: a \"Natural Experiment'' on Risk Communication", "link": "http://arxiv.org/abs/2312.08171", "description": "We investigate the dynamics of the update of subjective homicide\nvictimization risk after an informational shock by developing two econometric\nmodels able to accommodate both optimal decisions of changing prior\nexpectations which enable us to rationalize skeptical Bayesian agents with\ntheir disregard to new information. We apply our models to a unique household\ndata (N = 4,030) that consists of socioeconomic and victimization expectation\nvariables in Brazil, coupled with an informational ``natural experiment''\nbrought by the sample design methodology, which randomized interviewers to\ninterviewees. The higher priors about their own subjective homicide\nvictimization risk are set, the more likely individuals are to change their\ninitial perceptions. In case of an update, we find that elders and females are\nmore reluctant to change priors and choose the new response level. In addition,\neven though the respondents' level of education is not significant, the\ninterviewers' level of education has a key role in changing and updating\ndecisions. The results show that our econometric approach fits reasonable well\nthe available empirical evidence, stressing the salient role heterogeneity\nrepresented by individual characteristics of interviewees and interviewers have\non belief updating and lack of it, say, skepticism. Furthermore, we can\nrationalize skeptics through an informational quality/credibility argument."}, "http://arxiv.org/abs/2312.08174": {"title": "Double Machine Learning for Static Panel Models with Fixed Effects", "link": "http://arxiv.org/abs/2312.08174", "description": "Machine Learning (ML) algorithms are powerful data-driven tools for\napproximating high-dimensional or non-linear nuisance functions which are\nuseful in practice because the true functional form of the predictors is\nex-ante unknown. In this paper, we develop estimators of policy interventions\nfrom panel data which allow for non-linear effects of the confounding\nregressors, and investigate the performance of these estimators using three\nwell-known ML algorithms, specifically, LASSO, classification and regression\ntrees, and random forests. We use Double Machine Learning (DML) (Chernozhukov\net al., 2018) for the estimation of causal effects of homogeneous treatments\nwith unobserved individual heterogeneity (fixed effects) and no unobserved\nconfounding by extending Robinson (1988)'s partially linear regression model.\nWe develop three alternative approaches for handling unobserved individual\nheterogeneity based on extending the within-group estimator, first-difference\nestimator, and correlated random effect estimator (Mundlak, 1978) for\nnon-linear models. Using Monte Carlo simulations, we find that conventional\nleast squares estimators can perform well even if the data generating process\nis non-linear, but there are substantial performance gains in terms of bias\nreduction under a process where the true effect of the regressors is non-linear\nand discontinuous. However, for the same scenarios, we also find -- despite\nextensive hyperparameter tuning -- inference to be problematic for both\ntree-based learners because these lead to highly non-normal estimator\ndistributions and the estimator variance being severely under-estimated. This\ncontradicts the performance of trees in other circumstances and requires\nfurther investigation. Finally, we provide an illustrative example of DML for\nobservational panel data showing the impact of the introduction of the national\nminimum wage in the UK."}, "http://arxiv.org/abs/2201.11304": {"title": "Standard errors for two-way clustering with serially correlated time effects", "link": "http://arxiv.org/abs/2201.11304", "description": "We propose improved standard errors and an asymptotic distribution theory for\ntwo-way clustered panels. Our proposed estimator and theory allow for arbitrary\nserial dependence in the common time effects, which is excluded by existing\ntwo-way methods, including the popular two-way cluster standard errors of\nCameron, Gelbach, and Miller (2011) and the cluster bootstrap of Menzel (2021).\nOur asymptotic distribution theory is the first which allows for this level of\ninter-dependence among the observations. Under weak regularity conditions, we\ndemonstrate that the least squares estimator is asymptotically normal, our\nproposed variance estimator is consistent, and t-ratios are asymptotically\nstandard normal, permitting conventional inference. We present simulation\nevidence that confidence intervals constructed with our proposed standard\nerrors obtain superior coverage performance relative to existing methods. We\nillustrate the relevance of the proposed method in an empirical application to\na standard Fama-French three-factor regression."}, "http://arxiv.org/abs/2303.04416": {"title": "Inference on Optimal Dynamic Policies via Softmax Approximation", "link": "http://arxiv.org/abs/2303.04416", "description": "Estimating optimal dynamic policies from offline data is a fundamental\nproblem in dynamic decision making. In the context of causal inference, the\nproblem is known as estimating the optimal dynamic treatment regime. Even\nthough there exists a plethora of methods for estimation, constructing\nconfidence intervals for the value of the optimal regime and structural\nparameters associated with it is inherently harder, as it involves non-linear\nand non-differentiable functionals of unknown quantities that need to be\nestimated. Prior work resorted to sub-sample approaches that can deteriorate\nthe quality of the estimate. We show that a simple soft-max approximation to\nthe optimal treatment regime, for an appropriately fast growing temperature\nparameter, can achieve valid inference on the truly optimal regime. We\nillustrate our result for a two-period optimal dynamic regime, though our\napproach should directly extend to the finite horizon case. Our work combines\ntechniques from semi-parametric inference and $g$-estimation, together with an\nappropriate triangular array central limit theorem, as well as a novel analysis\nof the asymptotic influence and asymptotic bias of softmax approximations."}, "http://arxiv.org/abs/1904.00111": {"title": "Simple subvector inference on sharp identified set in affine models", "link": "http://arxiv.org/abs/1904.00111", "description": "This paper studies a regularized support function estimator for bounds on\ncomponents of the parameter vector in the case in which the identified set is a\npolygon. The proposed regularized estimator has three important properties: (i)\nit has a uniform asymptotic Gaussian limit in the presence of flat faces in the\nabsence of redundant (or overidentifying) constraints (or vice versa); (ii) the\nbias from regularization does not enter the first-order limiting\ndistribution;(iii) the estimator remains consistent for sharp identified set\nfor the individual components even in the non-regualar case. These properties\nare used to construct uniformly valid confidence sets for an element\n$\\theta_{1}$ of a parameter vector $\\theta\\in\\mathbb{R}^{d}$ that is partially\nidentified by affine moment equality and inequality conditions. The proposed\nconfidence sets can be computed as a solution to a small number of linear and\nconvex quadratic programs, which leads to a substantial decrease in computation\ntime and guarantees a global optimum. As a result, the method provides\nuniformly valid inference in applications in which the dimension of the\nparameter space, $d$, and the number of inequalities, $k$, were previously\ncomputationally unfeasible ($d,k=100$). The proposed approach can be extended\nto construct confidence sets for intersection bounds, to construct joint\npolygon-shaped confidence sets for multiple components of $\\theta$, and to find\nthe set of solutions to a linear program. Inference for coefficients in the\nlinear IV regression model with an interval outcome is used as an illustrative\nexample."}, "http://arxiv.org/abs/1911.04529": {"title": "Identification in discrete choice models with imperfect information", "link": "http://arxiv.org/abs/1911.04529", "description": "We study identification of preferences in static single-agent discrete choice\nmodels where decision makers may be imperfectly informed about the state of the\nworld. We leverage the notion of one-player Bayes Correlated Equilibrium by\nBergemann and Morris (2016) to provide a tractable characterization of the\nsharp identified set. We develop a procedure to practically construct the sharp\nidentified set following a sieve approach, and provide sharp bounds on\ncounterfactual outcomes of interest. We use our methodology and data on the\n2017 UK general election to estimate a spatial voting model under weak\nassumptions on agents' information about the returns to voting. Counterfactual\nexercises quantify the consequences of imperfect information on the well-being\nof voters and parties."}, "http://arxiv.org/abs/2312.10333": {"title": "Logit-based alternatives to two-stage least squares", "link": "http://arxiv.org/abs/2312.10333", "description": "We propose logit-based IV and augmented logit-based IV estimators that serve\nas alternatives to the traditionally used 2SLS estimator in the model where\nboth the endogenous treatment variable and the corresponding instrument are\nbinary. Our novel estimators are as easy to compute as the 2SLS estimator but\nhave an advantage over the 2SLS estimator in terms of causal interpretability.\nIn particular, in certain cases where the probability limits of both our\nestimators and the 2SLS estimator take the form of weighted-average treatment\neffects, our estimators are guaranteed to yield non-negative weights whereas\nthe 2SLS estimator is not."}, "http://arxiv.org/abs/2312.10487": {"title": "The Dynamic Triple Gamma Prior as a Shrinkage Process Prior for Time-Varying Parameter Models", "link": "http://arxiv.org/abs/2312.10487", "description": "Many current approaches to shrinkage within the time-varying parameter\nframework assume that each state is equipped with only one innovation variance\nfor all time points. Sparsity is then induced by shrinking this variance\ntowards zero. We argue that this is not sufficient if the states display large\njumps or structural changes, something which is often the case in time series\nanalysis. To remedy this, we propose the dynamic triple gamma prior, a\nstochastic process that has a well-known triple gamma marginal form, while\nstill allowing for autocorrelation. Crucially, the triple gamma has many\ninteresting limiting and special cases (including the horseshoe shrinkage\nprior) which can also be chosen as the marginal distribution. Not only is the\nmarginal form well understood, we further derive many interesting properties of\nthe dynamic triple gamma, which showcase its dynamic shrinkage characteristics.\nWe develop an efficient Markov chain Monte Carlo algorithm to sample from the\nposterior and demonstrate the performance through sparse covariance modeling\nand forecasting of the returns of the components of the EURO STOXX 50 index."}, "http://arxiv.org/abs/2312.10558": {"title": "Some Finite-Sample Results on the Hausman Test", "link": "http://arxiv.org/abs/2312.10558", "description": "This paper shows that the endogeneity test using the control function\napproach in linear instrumental variable models is a variant of the Hausman\ntest. Moreover, we find that the test statistics used in these tests can be\nnumerically ordered, indicating their relative power properties in finite\nsamples."}, "http://arxiv.org/abs/2312.10984": {"title": "Predicting Financial Literacy via Semi-supervised Learning", "link": "http://arxiv.org/abs/2312.10984", "description": "Financial literacy (FL) represents a person's ability to turn assets into\nincome, and understanding digital currencies has been added to the modern\ndefinition. FL can be predicted by exploiting unlabelled recorded data in\nfinancial networks via semi-supervised learning (SSL). Measuring and predicting\nFL has not been widely studied, resulting in limited understanding of customer\nfinancial engagement consequences. Previous studies have shown that low FL\nincreases the risk of social harm. Therefore, it is important to accurately\nestimate FL to allocate specific intervention programs to less financially\nliterate groups. This will not only increase company profitability, but will\nalso reduce government spending. Some studies considered predicting FL in\nclassification tasks, whereas others developed FL definitions and impacts. The\ncurrent paper investigated mechanisms to learn customer FL level from their\nfinancial data using sampling by synthetic minority over-sampling techniques\nfor regression with Gaussian noise (SMOGN). We propose the SMOGN-COREG model\nfor semi-supervised regression, applying SMOGN to deal with unbalanced datasets\nand a nonparametric multi-learner co-regression (COREG) algorithm for labeling.\nWe compared the SMOGN-COREG model with six well-known regressors on five\ndatasets to evaluate the proposed models effectiveness on unbalanced and\nunlabelled financial data. Experimental results confirmed that the proposed\nmethod outperformed the comparator models for unbalanced and unlabelled\nfinancial data. Therefore, SMOGN-COREG is a step towards using unlabelled data\nto estimate FL level."}, "http://arxiv.org/abs/2312.11283": {"title": "The 2010 Census Confidentiality Protections Failed, Here's How and Why", "link": "http://arxiv.org/abs/2312.11283", "description": "Using only 34 published tables, we reconstruct five variables (census block,\nsex, age, race, and ethnicity) in the confidential 2010 Census person records.\nUsing the 38-bin age variable tabulated at the census block level, at most\n20.1% of reconstructed records can differ from their confidential source on\neven a single value for these five variables. Using only published data, an\nattacker can verify that all records in 70% of all census blocks (97 million\npeople) are perfectly reconstructed. The tabular publications in Summary File 1\nthus have prohibited disclosure risk similar to the unreleased confidential\nmicrodata. Reidentification studies confirm that an attacker can, within blocks\nwith perfect reconstruction accuracy, correctly infer the actual census\nresponse on race and ethnicity for 3.4 million vulnerable population uniques\n(persons with nonmodal characteristics) with 95% accuracy, the same precision\nas the confidential data achieve and far greater than statistical baselines.\nThe flaw in the 2010 Census framework was the assumption that aggregation\nprevented accurate microdata reconstruction, justifying weaker disclosure\nlimitation methods than were applied to 2010 Census public microdata. The\nframework used for 2020 Census publications defends against attacks that are\nbased on reconstruction, as we also demonstrate here. Finally, we show that\nalternatives to the 2020 Census Disclosure Avoidance System with similar\naccuracy (enhanced swapping) also fail to protect confidentiality, and those\nthat partially defend against reconstruction attacks (incomplete suppression\nimplementations) destroy the primary statutory use case: data for redistricting\nall legislatures in the country in compliance with the 1965 Voting Rights Act."}, "http://arxiv.org/abs/1811.11603": {"title": "Distribution Regression with Sample Selection, with an Application to Wage Decompositions in the UK", "link": "http://arxiv.org/abs/1811.11603", "description": "We develop a distribution regression model under endogenous sample selection.\nThis model is a semi-parametric generalization of the Heckman selection model.\nIt accommodates much richer effects of the covariates on outcome distribution\nand patterns of heterogeneity in the selection process, and allows for drastic\ndepartures from the Gaussian error structure, while maintaining the same level\ntractability as the classical model. The model applies to continuous, discrete\nand mixed outcomes. We provide identification, estimation, and inference\nmethods, and apply them to obtain wage decomposition for the UK. Here we\ndecompose the difference between the male and female wage distributions into\ncomposition, wage structure, selection structure, and selection sorting\neffects. After controlling for endogenous employment selection, we still find\nsubstantial gender wage gap -- ranging from 21% to 40% throughout the (latent)\noffered wage distribution that is not explained by composition. We also uncover\npositive sorting for single men and negative sorting for married women that\naccounts for a substantive fraction of the gender wage gap at the top of the\ndistribution."}, "http://arxiv.org/abs/2204.01884": {"title": "Policy Learning with Competing Agents", "link": "http://arxiv.org/abs/2204.01884", "description": "Decision makers often aim to learn a treatment assignment policy under a\ncapacity constraint on the number of agents that they can treat. When agents\ncan respond strategically to such policies, competition arises, complicating\nestimation of the optimal policy. In this paper, we study capacity-constrained\ntreatment assignment in the presence of such interference. We consider a\ndynamic model where the decision maker allocates treatments at each time step\nand heterogeneous agents myopically best respond to the previous treatment\nassignment policy. When the number of agents is large but finite, we show that\nthe threshold for receiving treatment under a given policy converges to the\npolicy's mean-field equilibrium threshold. Based on this result, we develop a\nconsistent estimator for the policy gradient. In simulations and a\nsemi-synthetic experiment with data from the National Education Longitudinal\nStudy of 1988, we demonstrate that this estimator can be used for learning\ncapacity-constrained policies in the presence of strategic behavior."}, "http://arxiv.org/abs/2204.10359": {"title": "Boundary Adaptive Local Polynomial Conditional Density Estimators", "link": "http://arxiv.org/abs/2204.10359", "description": "We begin by introducing a class of conditional density estimators based on\nlocal polynomial techniques. The estimators are boundary adaptive and easy to\nimplement. We then study the (pointwise and) uniform statistical properties of\nthe estimators, offering characterizations of both probability concentration\nand distributional approximation. In particular, we establish uniform\nconvergence rates in probability and valid Gaussian distributional\napproximations for the Studentized t-statistic process. We also discuss\nimplementation issues such as consistent estimation of the covariance function\nfor the Gaussian approximation, optimal integrated mean squared error bandwidth\nselection, and valid robust bias-corrected inference. We illustrate the\napplicability of our results by constructing valid confidence bands and\nhypothesis tests for both parametric specification and shape constraints,\nexplicitly characterizing their approximation errors. A companion R software\npackage implementing our main results is provided."}, "http://arxiv.org/abs/2312.11710": {"title": "Real-time monitoring with RCA models", "link": "http://arxiv.org/abs/2312.11710", "description": "We propose a family of weighted statistics based on the CUSUM process of the\nWLS residuals for the online detection of changepoints in a Random Coefficient\nAutoregressive model, using both the standard CUSUM and the Page-CUSUM process.\nWe derive the asymptotics under the null of no changepoint for all possible\nweighing schemes, including the case of the standardised CUSUM, for which we\nderive a Darling-Erdos-type limit theorem; our results guarantee the\nprocedure-wise size control under both an open-ended and a closed-ended\nmonitoring. In addition to considering the standard RCA model with no\ncovariates, we also extend our results to the case of exogenous regressors. Our\nresults can be applied irrespective of (and with no prior knowledge required as\nto) whether the observations are stationary or not, and irrespective of whether\nthey change into a stationary or nonstationary regime. Hence, our methodology\nis particularly suited to detect the onset, or the collapse, of a bubble or an\nepidemic. Our simulations show that our procedures, especially when\nstandardising the CUSUM process, can ensure very good size control and short\ndetection delays. We complement our theory by studying the online detection of\nbreaks in epidemiological and housing prices series."}, "http://arxiv.org/abs/2304.05805": {"title": "GDP nowcasting with artificial neural networks: How much does long-term memory matter?", "link": "http://arxiv.org/abs/2304.05805", "description": "In our study, we apply artificial neural networks (ANNs) to nowcast quarterly\nGDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare\nthe nowcasting performance of five different ANN architectures: the multilayer\nperceptron (MLP), the one-dimensional convolutional neural network (1D CNN),\nthe Elman recurrent neural network (RNN), the long short-term memory network\n(LSTM), and the gated recurrent unit (GRU). The empirical analysis presents the\nresults from two distinctively different evaluation periods. The first (2012:Q1\n-- 2019:Q4) is characterized by balanced economic growth, while the second\n(2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According\nto our results, longer input sequences result in more accurate nowcasts in\nperiods of balanced economic growth. However, this effect ceases above a\nrelatively low threshold value of around six quarters (eighteen months). During\nperiods of economic turbulence (e.g., during the COVID-19 recession), longer\ninput sequences do not help the models' predictive performance; instead, they\nseem to weaken their generalization capability. Combined results from the two\nevaluation periods indicate that architectural features enabling for long-term\nmemory do not result in more accurate nowcasts. On the other hand, the 1D CNN\nhas proved to be a highly suitable model for GDP nowcasting. The network has\nshown good nowcasting performance among the competitors during the first\nevaluation period and achieved the overall best accuracy during the second\nevaluation period. Consequently, first in the literature, we propose the\napplication of the 1D CNN for economic nowcasting."}, "http://arxiv.org/abs/2312.12741": {"title": "Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances", "link": "http://arxiv.org/abs/2312.12741", "description": "We address the problem of best arm identification (BAI) with a fixed budget\nfor two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the\nbest arm, an arm with the highest expected reward, through an adaptive\nexperiment. Kaufmann et al. (2016) develops a lower bound for the probability\nof misidentifying the best arm. They also propose a strategy, assuming that the\nvariances of rewards are known, and show that it is asymptotically optimal in\nthe sense that its probability of misidentification matches the lower bound as\nthe budget approaches infinity. However, an asymptotically optimal strategy is\nunknown when the variances are unknown. For this open issue, we propose a\nstrategy that estimates variances during an adaptive experiment and draws arms\nwith a ratio of the estimated standard deviations. We refer to this strategy as\nthe Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)\nstrategy. We then demonstrate that this strategy is asymptotically optimal by\nshowing that its probability of misidentification matches the lower bound when\nthe budget approaches infinity, and the gap between the expected rewards of two\narms approaches zero (small-gap regime). Our results suggest that under the\nworst-case scenario characterized by the small-gap regime, our strategy, which\nemploys estimated variance, is asymptotically optimal even when the variances\nare unknown."}, "http://arxiv.org/abs/2312.13195": {"title": "Principal Component Copulas for Capital Modelling", "link": "http://arxiv.org/abs/2312.13195", "description": "We introduce a class of copulas that we call Principal Component Copulas.\nThis class intends to combine the strong points of copula-based techniques with\nprincipal component-based models, which results in flexibility when modelling\ntail dependence along the most important directions in multivariate data. The\nproposed techniques have conceptual similarities and technical differences with\nthe increasingly popular class of factor copulas. Such copulas can generate\ncomplex dependence structures and also perform well in high dimensions. We show\nthat Principal Component Copulas give rise to practical and technical\nadvantages compared to other techniques. We perform a simulation study and\napply the copula to multivariate return data. The copula class offers the\npossibility to avoid the curse of dimensionality when estimating very large\ncopula models and it performs particularly well on aggregate measures of tail\nrisk, which is of importance for capital modeling."}, "http://arxiv.org/abs/2103.07066": {"title": "Finding Subgroups with Significant Treatment Effects", "link": "http://arxiv.org/abs/2103.07066", "description": "Researchers often run resource-intensive randomized controlled trials (RCTs)\nto estimate the causal effects of interventions on outcomes of interest. Yet\nthese outcomes are often noisy, and estimated overall effects can be small or\nimprecise. Nevertheless, we may still be able to produce reliable evidence of\nthe efficacy of an intervention by finding subgroups with significant effects.\nIn this paper, we propose a machine-learning method that is specifically\noptimized for finding such subgroups in noisy data. Unlike available methods\nfor personalized treatment assignment, our tool is fundamentally designed to\ntake significance testing into account: it produces a subgroup that is chosen\nto maximize the probability of obtaining a statistically significant positive\ntreatment effect. We provide a computationally efficient implementation using\ndecision trees and demonstrate its gain over selecting subgroups based on\npositive (estimated) treatment effects. Compared to standard tree-based\nregression and classification tools, this approach tends to yield higher power\nin detecting subgroups affected by the treatment."}, "http://arxiv.org/abs/2208.06729": {"title": "Optimal Recovery for Causal Inference", "link": "http://arxiv.org/abs/2208.06729", "description": "Problems in causal inference can be fruitfully addressed using signal\nprocessing techniques. As an example, it is crucial to successfully quantify\nthe causal effects of an intervention to determine whether the intervention\nachieved desired outcomes. We present a new geometric signal processing\napproach to classical synthetic control called ellipsoidal optimal recovery\n(EOpR), for estimating the unobservable outcome of a treatment unit. EOpR\nprovides policy evaluators with both worst-case and typical outcomes to help in\ndecision making. It is an approximation-theoretic technique that relates to the\ntheory of principal components, which recovers unknown observations given a\nlearned signal class and a set of known observations. We show EOpR can improve\npre-treatment fit and mitigate bias of the post-treatment estimate relative to\nother methods in causal inference. Beyond recovery of the unit of interest, an\nadvantage of EOpR is that it produces worst-case limits over the estimates\nproduced. We assess our approach on artificially-generated data, on datasets\ncommonly used in the econometrics literature, and in the context of the\nCOVID-19 pandemic, showing better performance than baseline techniques"}, "http://arxiv.org/abs/2301.01085": {"title": "The Chained Difference-in-Differences", "link": "http://arxiv.org/abs/2301.01085", "description": "This paper studies the identification, estimation, and inference of long-term\n(binary) treatment effect parameters when balanced panel data is not available,\nor consists of only a subset of the available data. We develop a new estimator:\nthe chained difference-in-differences, which leverages the overlapping\nstructure of many unbalanced panel data sets. This approach consists in\naggregating a collection of short-term treatment effects estimated on multiple\nincomplete panels. Our estimator accommodates (1) multiple time periods, (2)\nvariation in treatment timing, (3) treatment effect heterogeneity, (4) general\nmissing data patterns, and (5) sample selection on observables. We establish\nthe asymptotic properties of the proposed estimator and discuss identification\nand efficiency gains in comparison to existing methods. Finally, we illustrate\nits relevance through (i) numerical simulations, and (ii) an application about\nthe effects of an innovation policy in France."}, "http://arxiv.org/abs/2312.13939": {"title": "Binary Endogenous Treatment in Stochastic Frontier Models with an Application to Soil Conservation in El Salvador", "link": "http://arxiv.org/abs/2312.13939", "description": "Improving the productivity of the agricultural sector is part of one of the\nSustainable Development Goals set by the United Nations. To this end, many\ninternational organizations have funded training and technology transfer\nprograms that aim to promote productivity and income growth, fight poverty and\nenhance food security among smallholder farmers in developing countries.\nStochastic production frontier analysis can be a useful tool when evaluating\nthe effectiveness of these programs. However, accounting for treatment\nendogeneity, often intrinsic to these interventions, only recently has received\nany attention in the stochastic frontier literature. In this work, we extend\nthe classical maximum likelihood estimation of stochastic production frontier\nmodels by allowing both the production frontier and inefficiency to depend on a\npotentially endogenous binary treatment. We use instrumental variables to\ndefine an assignment mechanism for the treatment, and we explicitly model the\ndensity of the first and second-stage composite error terms. We provide\nempirical evidence of the importance of controlling for endogeneity in this\nsetting using farm-level data from a soil conservation program in El Salvador."}, "http://arxiv.org/abs/2312.14095": {"title": "RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation", "link": "http://arxiv.org/abs/2312.14095", "description": "Significant research effort has been devoted in recent years to developing\npersonalized pricing, promotions, and product recommendation algorithms that\ncan leverage rich customer data to learn and earn. Systematic benchmarking and\nevaluation of these causal learning systems remains a critical challenge, due\nto the lack of suitable datasets and simulation environments. In this work, we\npropose a multi-stage model for simulating customer shopping behavior that\ncaptures important sources of heterogeneity, including price sensitivity and\npast experiences. We embedded this model into a working simulation environment\n-- RetailSynth. RetailSynth was carefully calibrated on publicly available\ngrocery data to create realistic synthetic shopping transactions. Multiple\npricing policies were implemented within the simulator and analyzed for impact\non revenue, category penetration, and customer retention. Applied researchers\ncan use RetailSynth to validate causal demand models for multi-category retail\nand to incorporate realistic price sensitivity into emerging benchmarking\nsuites for personalized pricing, promotions, and product recommendations."}, "http://arxiv.org/abs/2201.06898": {"title": "Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period", "link": "http://arxiv.org/abs/2201.06898", "description": "We propose difference-in-differences estimators for continuous treatments\nwith heterogeneous effects. We assume that between consecutive periods, the\ntreatment of some units, the switchers, changes, while the treatment of other\nunits does not change. We show that under a parallel trends assumption, an\nunweighted and a weighted average of the slopes of switchers' potential\noutcomes can be estimated. While the former parameter may be more intuitive,\nthe latter can be used for cost-benefit analysis, and it can often be estimated\nmore precisely. We generalize our estimators to the instrumental-variable case.\nWe use our results to estimate the price-elasticity of gasoline consumption."}, "http://arxiv.org/abs/2211.14236": {"title": "Strategyproof Decision-Making in Panel Data Settings and Beyond", "link": "http://arxiv.org/abs/2211.14236", "description": "We consider the problem of decision-making using panel data, in which a\ndecision-maker gets noisy, repeated measurements of multiple units (or agents).\nWe consider a setup where there is a pre-intervention period, when the\nprincipal observes the outcomes of each unit, after which the principal uses\nthese observations to assign a treatment to each unit. Unlike this classical\nsetting, we permit the units generating the panel data to be strategic, i.e.\nunits may modify their pre-intervention outcomes in order to receive a more\ndesirable intervention. The principal's goal is to design a strategyproof\nintervention policy, i.e. a policy that assigns units to their\nutility-maximizing interventions despite their potential strategizing. We first\nidentify a necessary and sufficient condition under which a strategyproof\nintervention policy exists, and provide a strategyproof mechanism with a simple\nclosed form when one does exist. Along the way, we prove impossibility results\nfor strategic multiclass classification, which may be of independent interest.\nWhen there are two interventions, we establish that there always exists a\nstrategyproof mechanism, and provide an algorithm for learning such a\nmechanism. For three or more interventions, we provide an algorithm for\nlearning a strategyproof mechanism if there exists a sufficiently large gap in\nthe principal's rewards between different interventions. Finally, we\nempirically evaluate our model using real-world panel data collected from\nproduct sales over 18 months. We find that our methods compare favorably to\nbaselines which do not take strategic interactions into consideration, even in\nthe presence of model misspecification."}, "http://arxiv.org/abs/2312.14191": {"title": "Noisy Measurements Are Important, the Design of Census Products Is Much More Important", "link": "http://arxiv.org/abs/2312.14191", "description": "McCartan et al. (2023) call for \"making differential privacy work for census\ndata users.\" This commentary explains why the 2020 Census Noisy Measurement\nFiles (NMFs) are not the best focus for that plea. The August 2021 letter from\n62 prominent researchers asking for production of the direct output of the\ndifferential privacy system deployed for the 2020 Census signaled the\nengagement of the scholarly community in the design of decennial census data\nproducts. NMFs, the raw statistics produced by the 2020 Census Disclosure\nAvoidance System before any post-processing, are one component of that\ndesign--the query strategy output. The more important component is the query\nworkload output--the statistics released to the public. Optimizing the query\nworkload--the Redistricting Data (P.L. 94-171) Summary File,\nspecifically--could allow the privacy-loss budget to be more effectively\nmanaged. There could be fewer noisy measurements, no post-processing bias, and\ndirect estimates of the uncertainty from disclosure avoidance for each\npublished statistic."}, "http://arxiv.org/abs/2312.14325": {"title": "Exploring Distributions of House Prices and House Price Indices", "link": "http://arxiv.org/abs/2312.14325", "description": "We use house prices (HP) and house price indices (HPI) as a proxy to income\ndistribution. Specifically, we analyze sale prices in the 1970-2010 window of\nover 116,000 single-family homes in Hamilton County, Ohio, including Cincinnati\nmetro area of about 2.2 million people. We also analyze HPI, published by\nFederal Housing Finance Agency (FHFA), for nearly 18,000 US ZIP codes that\ncover a period of over 40 years starting in 1980's. If HP can be viewed as a\nfirst derivative of income, HPI can be viewed as its second derivative. We use\ngeneralized beta (GB) family of functions to fit distributions of HP and HPI\nsince GB naturally arises from the models of economic exchange described by\nstochastic differential equations. Our main finding is that HP and multi-year\nHPI exhibit a negative Dragon King (nDK) behavior, wherein power-law\ndistribution tail gives way to an abrupt decay to a finite upper limit value,\nwhich is similar to our recent findings for realized volatility of S\\&P500\nindex in the US stock market. This type of tail behavior is best fitted by a\nmodified GB (mGB) distribution. Tails of single-year HPI appear to show more\nconsistency with power-law behavior, which is better described by a GB Prime\n(GB2) distribution. We supplement full distribution fits by mGB and GB2 with\ndirect linear fits (LF) of the tails. Our numerical procedure relies on\nevaluation of confidence intervals (CI) of the fits, as well as of p-values\nthat give the likelihood that data come from the fitted distributions."}, "http://arxiv.org/abs/2207.09943": {"title": "Efficient Bias Correction for Cross-section and Panel Data", "link": "http://arxiv.org/abs/2207.09943", "description": "Bias correction can often improve the finite sample performance of\nestimators. We show that the choice of bias correction method has no effect on\nthe higher-order variance of semiparametrically efficient parametric\nestimators, so long as the estimate of the bias is asymptotically linear. It is\nalso shown that bootstrap, jackknife, and analytical bias estimates are\nasymptotically linear for estimators with higher-order expansions of a standard\nform. In particular, we find that for a variety of estimators the\nstraightforward bootstrap bias correction gives the same higher-order variance\nas more complicated analytical or jackknife bias corrections. In contrast, bias\ncorrections that do not estimate the bias at the parametric rate, such as the\nsplit-sample jackknife, result in larger higher-order variances in the i.i.d.\nsetting we focus on. For both a cross-sectional MLE and a panel model with\nindividual fixed effects, we show that the split-sample jackknife has a\nhigher-order variance term that is twice as large as that of the\n`leave-one-out' jackknife."}, "http://arxiv.org/abs/2312.15119": {"title": "Functional CLTs for subordinated stable L\\'evy models in physics, finance, and econometrics", "link": "http://arxiv.org/abs/2312.15119", "description": "We present a simple unifying treatment of a large class of applications from\nstatistical mechanics, econometrics, mathematical finance, and insurance\nmathematics, where stable (possibly subordinated) L\\'evy noise arises as a\nscaling limit of some form of continuous-time random walk (CTRW). For each\napplication, it is natural to rely on weak convergence results for stochastic\nintegrals on Skorokhod space in Skorokhod's J1 or M1 topologies. As compared to\nearlier and entirely separate works, we are able to give a more streamlined\naccount while also allowing for greater generality and providing important new\ninsights. For each application, we first make clear how the fundamental\nconclusions for J1 convergent CTRWs emerge as special cases of the same general\nprinciples, and we then illustrate how the specific settings give rise to\ndifferent results for strictly M1 convergent CTRWs."}, "http://arxiv.org/abs/2312.15494": {"title": "Variable Selection in High Dimensional Linear Regressions with Parameter Instability", "link": "http://arxiv.org/abs/2312.15494", "description": "This paper is concerned with the problem of variable selection in the\npresence of parameter instability when both the marginal effects of signals on\nthe target variable and the correlations of the covariates in the active set\ncould vary over time. We pose the issue of whether one should use weighted or\nunweighted observations at the variable selection stage in the presence of\nparameter instability, particularly when the number of potential covariates is\nlarge. We allow parameter instability to be continuous or discrete, subject to\ncertain regularity conditions. We discuss the pros and cons of Lasso and the\nOne Covariate at a time Multiple Testing (OCMT) method for variable selection\nand argue that OCMT has important advantages under parameter instability. We\nestablish three main theorems on selection, estimation post selection, and\nin-sample fit. These theorems provide justification for using unweighted\nobservations at the selection stage of OCMT and down-weighting of observations\nonly at the forecasting stage. It is shown that OCMT delivers better forecasts,\nin mean squared error sense, as compared to Lasso, Adaptive Lasso and boosting\nboth in Monte Carlo experiments as well as in 3 sets of empirical applications:\nforecasting monthly returns on 28 stocks from Dow Jones , forecasting quarterly\noutput growths across 33 countries, and forecasting euro area output growth\nusing surveys of professional forecasters."}, "http://arxiv.org/abs/2312.15524": {"title": "The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective", "link": "http://arxiv.org/abs/2312.15524", "description": "Large Language Models (LLMs) have demonstrated impressive potential to\nsimulate human behavior. Using a causal inference framework, we empirically and\ntheoretically analyze the challenges of conducting LLM-simulated experiments,\nand explore potential solutions. In the context of demand estimation, we show\nthat variations in the treatment included in the prompt (e.g., price of focal\nproduct) can cause variations in unspecified confounding factors (e.g., price\nof competitors, historical prices, outside temperature), introducing\nendogeneity and yielding implausibly flat demand curves. We propose a\ntheoretical framework suggesting this endogeneity issue generalizes to other\ncontexts and won't be fully resolved by merely improving the training data.\nUnlike real experiments where researchers assign pre-existing units across\nconditions, LLMs simulate units based on the entire prompt, which includes the\ndescription of the treatment. Therefore, due to associations in the training\ndata, the characteristics of individuals and environments simulated by the LLM\ncan be affected by the treatment assignment. We explore two potential\nsolutions. The first specifies all contextual variables that affect both\ntreatment and outcome, which we demonstrate to be challenging for a\ngeneral-purpose LLM. The second explicitly specifies the source of treatment\nvariation in the prompt given to the LLM (e.g., by informing the LLM that the\nstore is running an experiment). While this approach only allows the estimation\nof a conditional average treatment effect that depends on the specific\nexperimental design, it provides valuable directional results for exploratory\nanalysis."}, "http://arxiv.org/abs/2312.15595": {"title": "Zero-Inflated Bandits", "link": "http://arxiv.org/abs/2312.15595", "description": "Many real applications of bandits have sparse non-zero rewards, leading to\nslow learning rates. A careful distribution modeling that utilizes\nproblem-specific structures is known as critical to estimation efficiency in\nthe statistics literature, yet is under-explored in bandits. To fill the gap,\nwe initiate the study of zero-inflated bandits, where the reward is modeled as\na classic semi-parametric distribution called zero-inflated distribution. We\ncarefully design Upper Confidence Bound (UCB) and Thompson Sampling (TS)\nalgorithms for this specific structure. Our algorithms are suitable for a very\ngeneral class of reward distributions, operating under tail assumptions that\nare considerably less stringent than the typical sub-Gaussian requirements.\nTheoretically, we derive the regret bounds for both the UCB and TS algorithms\nfor multi-armed bandit, showing that they can achieve rate-optimal regret when\nthe reward distribution is sub-Gaussian. The superior empirical performance of\nthe proposed methods is shown via extensive numerical studies."}, "http://arxiv.org/abs/2312.15624": {"title": "Negative Controls for Instrumental Variable Designs", "link": "http://arxiv.org/abs/2312.15624", "description": "Studies using instrumental variables (IV) often assess the validity of their\nidentification assumptions using falsification tests. However, these tests are\noften carried out in an ad-hoc manner, without theoretical foundations. In this\npaper, we establish a theoretical framework for negative control tests, the\npredominant category of falsification tests for IV designs. These tests are\nconditional independence tests between negative control variables and either\nthe IV or the outcome (e.g., examining the ``effect'' on the lagged outcome).\nWe introduce a formal definition for threats to IV exogeneity (alternative path\nvariables) and characterize the necessary conditions that proxy variables for\nsuch unobserved threats must meet to serve as negative controls. The theory\nhighlights prevalent errors in the implementation of negative control tests and\nhow they could be corrected. Our theory can also be used to design new\nfalsification tests by identifying appropriate negative control variables,\nincluding currently underutilized types, and suggesting alternative statistical\ntests. The theory shows that all negative control tests assess IV exogeneity.\nHowever, some commonly used tests simultaneously evaluate the 2SLS functional\nform assumptions. Lastly, we show that while negative controls are useful for\ndetecting biases in IV designs, their capacity to correct or quantify such\nbiases requires additional non-trivial assumptions."}, "http://arxiv.org/abs/2312.15999": {"title": "Pricing with Contextual Elasticity and Heteroscedastic Valuation", "link": "http://arxiv.org/abs/2312.15999", "description": "We study an online contextual dynamic pricing problem, where customers decide\nwhether to purchase a product based on its features and price. We introduce a\nnovel approach to modeling a customer's expected demand by incorporating\nfeature-based price elasticity, which can be equivalently represented as a\nvaluation with heteroscedastic noise. To solve the problem, we propose a\ncomputationally efficient algorithm called \"Pricing with Perturbation (PwP)\",\nwhich enjoys an $O(\\sqrt{dT\\log T})$ regret while allowing arbitrary\nadversarial input context sequences. We also prove a matching lower bound at\n$\\Omega(\\sqrt{dT})$ to show the optimality regarding $d$ and $T$ (up to $\\log\nT$ factors). Our results shed light on the relationship between contextual\nelasticity and heteroscedastic valuation, providing insights for effective and\npractical pricing strategies."}, "http://arxiv.org/abs/2312.16099": {"title": "Direct Multi-Step Forecast based Comparison of Nested Models via an Encompassing Test", "link": "http://arxiv.org/abs/2312.16099", "description": "We introduce a novel approach for comparing out-of-sample multi-step\nforecasts obtained from a pair of nested models that is based on the forecast\nencompassing principle. Our proposed approach relies on an alternative way of\ntesting the population moment restriction implied by the forecast encompassing\nprinciple and that links the forecast errors from the two competing models in a\nparticular way. Its key advantage is that it is able to bypass the variance\ndegeneracy problem afflicting model based forecast comparisons across nested\nmodels. It results in a test statistic whose limiting distribution is standard\nnormal and which is particularly simple to construct and can accommodate both\nsingle period and longer-horizon prediction comparisons. Inferences are also\nshown to be robust to different predictor types, including stationary,\nhighly-persistent and purely deterministic processes. Finally, we illustrate\nthe use of our proposed approach through an empirical application that explores\nthe role of global inflation in enhancing individual country specific inflation\nforecasts."}, "http://arxiv.org/abs/2010.05117": {"title": "Combining Observational and Experimental Data to Improve Efficiency Using Imperfect Instruments", "link": "http://arxiv.org/abs/2010.05117", "description": "Randomized controlled trials generate experimental variation that can\ncredibly identify causal effects, but often suffer from limited scale, while\nobservational datasets are large, but often violate desired identification\nassumptions. To improve estimation efficiency, I propose a method that\nleverages imperfect instruments - pretreatment covariates that satisfy the\nrelevance condition but may violate the exclusion restriction. I show that\nthese imperfect instruments can be used to derive moment restrictions that, in\ncombination with the experimental data, improve estimation efficiency. I\noutline estimators for implementing this strategy, and show that my methods can\nreduce variance by up to 50%; therefore, only half of the experimental sample\nis required to attain the same statistical precision. I apply my method to a\nsearch listing dataset from Expedia that studies the causal effect of search\nrankings on clicks, and show that the method can substantially improve the\nprecision."}, "http://arxiv.org/abs/2105.12891": {"title": "Identification and Estimation of Partial Effects in Nonlinear Semiparametric Panel Models", "link": "http://arxiv.org/abs/2105.12891", "description": "Average partial effects (APEs) are often not point identified in panel models\nwith unrestricted unobserved heterogeneity, such as binary response panel model\nwith fixed effects and logistic errors. This lack of point-identification\noccurs despite the identification of these models' common coefficients. We\nprovide a unified framework to establish the point identification of various\npartial effects in a wide class of nonlinear semiparametric models under an\nindex sufficiency assumption on the unobserved heterogeneity, even when the\nerror distribution is unspecified and non-stationary. This assumption does not\nimpose parametric restrictions on the unobserved heterogeneity and\nidiosyncratic errors. We also present partial identification results when the\nsupport condition fails. We then propose three-step semiparametric estimators\nfor the APE, the average structural function, and average marginal effects, and\nshow their consistency and asymptotic normality. Finally, we illustrate our\napproach in a study of determinants of married women's labor supply."}, "http://arxiv.org/abs/2212.11833": {"title": "Efficient Sampling for Realized Variance Estimation in Time-Changed Diffusion Models", "link": "http://arxiv.org/abs/2212.11833", "description": "This paper analyzes the benefits of sampling intraday returns in intrinsic\ntime for the standard and pre-averaging realized variance (RV) estimators. We\ntheoretically show in finite samples and asymptotically that the RV estimator\nis most efficient under the new concept of realized business time, which\nsamples according to a combination of observed trades and estimated tick\nvariance. Our asymptotic results carry over to the pre-averaging RV estimator\nunder market microstructure noise. The analysis builds on the assumption that\nasset prices follow a diffusion that is time-changed with a jump process that\nseparately models the transaction times. This provides a flexible model that\nseparately captures the empirically varying trading intensity and tick variance\nprocesses, which are particularly relevant for disentangling the driving forces\nof the sampling schemes. Extensive simulations confirm our theoretical results\nand show that realized business time remains superior also under more general\nnoise and process specifications. An application to stock data provides\nempirical evidence for the benefits of using realized business time sampling to\nconstruct more efficient RV estimators as well as for an improved forecasting\nperformance."}, "http://arxiv.org/abs/2301.05580": {"title": "Randomization Test for the Specification of Interference Structure", "link": "http://arxiv.org/abs/2301.05580", "description": "This study considers testing the specification of spillover effects in causal\ninference. We focus on experimental settings in which the treatment assignment\nmechanism is known to researchers. We develop a new randomization test\nutilizing a hierarchical relationship between different exposures. Compared\nwith existing approaches, our approach is essentially applicable to any null\nexposure specifications and produces powerful test statistics without a priori\nknowledge of the true interference structure. As empirical illustrations, we\nrevisit two existing social network experiments: one on farmers' insurance\nadoption and the other on anti-conflict education programs."}, "http://arxiv.org/abs/2312.16214": {"title": "Stochastic Equilibrium the Lucas Critique and Keynesian Economics", "link": "http://arxiv.org/abs/2312.16214", "description": "In this paper, a mathematically rigorous solution overturns existing wisdom\nregarding New Keynesian Dynamic Stochastic General Equilibrium. I develop a\nformal concept of stochastic equilibrium. I prove uniqueness and necessity,\nwhen agents are patient, across a wide class of dynamic stochastic models.\nExistence depends on appropriately specified eigenvalue conditions. Otherwise,\nno solution of any kind exists. I construct the equilibrium for the benchmark\nCalvo New Keynesian. I provide novel comparative statics with the\nnon-stochastic model of independent mathematical interest. I uncover a\nbifurcation between neighbouring stochastic systems and approximations taken\nfrom the Zero Inflation Non-Stochastic Steady State (ZINSS). The correct\nPhillips curve agrees with the zero limit from the trend inflation framework.\nIt contains a large lagged inflation coefficient and a small response to\nexpected inflation. The response to the output gap is always muted and is zero\nat standard parameters. A neutrality result is presented to explain why and to\nalign Calvo with Taylor pricing. Present and lagged demand shocks enter the\nPhillips curve so there is no Divine Coincidence and the system is identified\nfrom structural shocks alone. The lagged inflation slope is increasing in the\ninflation response, embodying substantive policy trade-offs. The Taylor\nprinciple is reversed, inactive settings are necessary for existence, pointing\ntowards inertial policy. The observational equivalence idea of the Lucas\ncritique is disproven. The bifurcation results from the breakdown of the\nconstraints implied by lagged nominal rigidity, associated with cross-equation\ncancellation possible only at ZINSS. There is a dual relationship between\nrestrictions on the econometrician and constraints on repricing firms. Thus if\nthe model is correct, goodness of fit will jump."}, "http://arxiv.org/abs/2312.16307": {"title": "Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration", "link": "http://arxiv.org/abs/2312.16307", "description": "We consider a panel data setting in which one observes measurements of units\nover time, under different interventions. Our focus is on the canonical family\nof synthetic control methods (SCMs) which, after a pre-intervention time period\nwhen all units are under control, estimate counterfactual outcomes for test\nunits in the post-intervention time period under control by using data from\ndonor units who have remained under control for the entire post-intervention\nperiod. In order for the counterfactual estimate produced by synthetic control\nfor a test unit to be accurate, there must be sufficient overlap between the\noutcomes of the donor units and the outcomes of the test unit. As a result, a\ncanonical assumption in the literature on SCMs is that the outcomes for the\ntest units lie within either the convex hull or the linear span of the outcomes\nfor the donor units. However despite their ubiquity, such overlap assumptions\nmay not always hold, as is the case when, e.g., units select their own\ninterventions and different subpopulations of units prefer different\ninterventions a priori.\n\nWe shed light on this typically overlooked assumption, and we address this\nissue by incentivizing units with different preferences to take interventions\nthey would not normally consider. Specifically, we provide a SCM for\nincentivizing exploration in panel data settings which provides\nincentive-compatible intervention recommendations to units by leveraging tools\nfrom information design and online learning. Using our algorithm, we show how\nto obtain valid counterfactual estimates using SCMs without the need for an\nexplicit overlap assumption on the unit outcomes."}, "http://arxiv.org/abs/2312.16489": {"title": "Best-of-Both-Worlds Linear Contextual Bandits", "link": "http://arxiv.org/abs/2312.16489", "description": "This study investigates the problem of $K$-armed linear contextual bandits,\nan instance of the multi-armed bandit problem, under an adversarial corruption.\nAt each round, a decision-maker observes an independent and identically\ndistributed context and then selects an arm based on the context and past\nobservations. After selecting an arm, the decision-maker incurs a loss\ncorresponding to the selected arm. The decision-maker aims to minimize the\ncumulative loss over the trial. The goal of this study is to develop a strategy\nthat is effective in both stochastic and adversarial environments, with\ntheoretical guarantees. We first formulate the problem by introducing a novel\nsetting of bandits with adversarial corruption, referred to as the contextual\nadversarial regime with a self-bounding constraint. We assume linear models for\nthe relationship between the loss and the context. Then, we propose a strategy\nthat extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the\nFollow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is\nshown to be upper-bounded by $O\\left(\\min\\left\\{\\frac{(\\log(T))^3}{\\Delta_{*}}\n+ \\sqrt{\\frac{C(\\log(T))^3}{\\Delta_{*}}},\\ \\\n\\sqrt{T}(\\log(T))^2\\right\\}\\right)$, where $T \\in\\mathbb{N}$ is the number of\nrounds, $\\Delta_{*} > 0$ is the constant minimum gap between the best and\nsuboptimal arms for any context, and $C\\in[0, T] $ is an adversarial corruption\nparameter. This regret upper bound implies\n$O\\left(\\frac{(\\log(T))^3}{\\Delta_{*}}\\right)$ in a stochastic environment and\nby $O\\left( \\sqrt{T}(\\log(T))^2\\right)$ in an adversarial environment. We refer\nto our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its\ntheoretical guarantees in both stochastic and adversarial regimes."}, "http://arxiv.org/abs/2312.16707": {"title": "Modeling Systemic Risk: A Time-Varying Nonparametric Causal Inference Framework", "link": "http://arxiv.org/abs/2312.16707", "description": "We propose a nonparametric and time-varying directed information graph\n(TV-DIG) framework to estimate the evolving causal structure in time series\nnetworks, thereby addressing the limitations of traditional econometric models\nin capturing high-dimensional, nonlinear, and time-varying interconnections\namong series. This framework employs an information-theoretic measure rooted in\na generalized version of Granger-causality, which is applicable to both linear\nand nonlinear dynamics. Our framework offers advancements in measuring systemic\nrisk and establishes meaningful connections with established econometric\nmodels, including vector autoregression and switching models. We evaluate the\nefficacy of our proposed model through simulation experiments and empirical\nanalysis, reporting promising results in recovering simulated time-varying\nnetworks with nonlinear and multivariate structures. We apply this framework to\nidentify and monitor the evolution of interconnectedness and systemic risk\namong major assets and industrial sectors within the financial network. We\nfocus on cryptocurrencies' potential systemic risks to financial stability,\nincluding spillover effects on other sectors during crises like the COVID-19\npandemic and the Federal Reserve's 2020 emergency response. Our findings\nreveals significant, previously underrecognized pre-2020 influences of\ncryptocurrencies on certain financial sectors, highlighting their potential\nsystemic risks and offering a systematic approach in tracking evolving\ncross-sector interactions within financial networks."}, "http://arxiv.org/abs/2312.16927": {"title": "Development of Choice Model for Brand Evaluation", "link": "http://arxiv.org/abs/2312.16927", "description": "Consumer choice modeling takes center stage as we delve into understanding\nhow personal preferences of decision makers (customers) for products influence\ndemand at the level of the individual. The contemporary choice theory is built\nupon the characteristics of the decision maker, alternatives available for the\nchoice of the decision maker, the attributes of the available alternatives and\ndecision rules that the decision maker uses to make a choice. The choice set in\nour research is represented by six major brands (products) of laundry\ndetergents in the Japanese market. We use the panel data of the purchases of 98\nhouseholds to which we apply the hierarchical probit model, facilitated by a\nMarkov Chain Monte Carlo simulation (MCMC) in order to evaluate the brand\nvalues of six brands. The applied model also allows us to evaluate the tangible\nand intangible brand values. These evaluated metrics help us to assess the\nbrands based on their tangible and intangible characteristics. Moreover,\nconsumer choice modeling also provides a framework for assessing the\nenvironmental performance of laundry detergent brands as the model uses the\ninformation on components (physical attributes) of laundry detergents."}, "http://arxiv.org/abs/2312.17061": {"title": "Bayesian Analysis of High Dimensional Vector Error Correction Model", "link": "http://arxiv.org/abs/2312.17061", "description": "Vector Error Correction Model (VECM) is a classic method to analyse\ncointegration relationships amongst multivariate non-stationary time series. In\nthis paper, we focus on high dimensional setting and seek for\nsample-size-efficient methodology to determine the level of cointegration. Our\ninvestigation centres at a Bayesian approach to analyse the cointegration\nmatrix, henceforth determining the cointegration rank. We design two algorithms\nand implement them on simulated examples, yielding promising results\nparticularly when dealing with high number of variables and relatively low\nnumber of observations. Furthermore, we extend this methodology to empirically\ninvestigate the constituents of the S&P 500 index, where low-volatility\nportfolios can be found during both in-sample training and out-of-sample\ntesting periods."}, "http://arxiv.org/abs/1903.08028": {"title": "State-Building through Public Land Disposal? An Application of Matrix Completion for Counterfactual Prediction", "link": "http://arxiv.org/abs/1903.08028", "description": "This paper examines how homestead policies, which opened vast frontier lands\nfor settlement, influenced the development of American frontier states. It uses\na treatment propensity-weighted matrix completion model to estimate the\ncounterfactual size of these states without homesteading. In simulation\nstudies, the method shows lower bias and variance than other estimators,\nparticularly in higher complexity scenarios. The empirical analysis reveals\nthat homestead policies significantly and persistently reduced state government\nexpenditure and revenue. These findings align with continuous\ndifference-in-differences estimates using 1.46 million land patent records.\nThis study's extension of the matrix completion method to include propensity\nscore weighting for causal effect estimation in panel data, especially in\nstaggered treatment contexts, enhances policy evaluation by improving the\nprecision of long-term policy impact assessments."}, "http://arxiv.org/abs/2011.08174": {"title": "Policy design in experiments with unknown interference", "link": "http://arxiv.org/abs/2011.08174", "description": "This paper studies experimental designs for estimation and inference on\npolicies with spillover effects. Units are organized into a finite number of\nlarge clusters and interact in unknown ways within each cluster. First, we\nintroduce a single-wave experiment that, by varying the randomization across\ncluster pairs, estimates the marginal effect of a change in treatment\nprobabilities, taking spillover effects into account. Using the marginal\neffect, we propose a test for policy optimality. Second, we design a\nmultiple-wave experiment to estimate welfare-maximizing treatment rules. We\nprovide strong theoretical guarantees and an implementation in a large-scale\nfield experiment."}, "http://arxiv.org/abs/2104.13367": {"title": "When should you adjust inferences for multiple hypothesis testing?", "link": "http://arxiv.org/abs/2104.13367", "description": "Multiple hypothesis testing practices vary widely, without consensus on which\nare appropriate when. We provide an economic foundation for these practices. In\nstudies of multiple treatments or sub-populations, adjustments may be\nappropriate depending on scale economies in the research production function,\nwith control of classical notions of compound errors emerging in some but not\nall cases. Studies with multiple outcomes motivate testing using a single\nindex, or adjusted tests of several indices when the intended audience is\nheterogeneous. Data on actual research costs suggest both that some adjustment\nis warranted and that standard procedures are overly conservative."}, "http://arxiv.org/abs/2303.11777": {"title": "Quasi Maximum Likelihood Estimation of High-Dimensional Factor Models: A Critical Review", "link": "http://arxiv.org/abs/2303.11777", "description": "We review Quasi Maximum Likelihood estimation of factor models for\nhigh-dimensional panels of time series. We consider two cases: (1) estimation\nwhen no dynamic model for the factors is specified \\citep{baili12,baili16}; (2)\nestimation based on the Kalman smoother and the Expectation Maximization\nalgorithm thus allowing to model explicitly the factor dynamics\n\\citep{DGRqml,BLqml}. Our interest is in approximate factor models, i.e., when\nwe allow for the idiosyncratic components to be mildly cross-sectionally, as\nwell as serially, correlated. Although such setting apparently makes estimation\nharder, we show, in fact, that factor models do not suffer of the {\\it curse of\ndimensionality} problem, but instead they enjoy a {\\it blessing of\ndimensionality} property. In particular, given an approximate factor structure,\nif the cross-sectional dimension of the data, $N$, grows to infinity, we show\nthat: (i) identification of the model is still possible, (ii) the\nmis-specification error due to the use of an exact factor model log-likelihood\nvanishes. Moreover, if we let also the sample size, $T$, grow to infinity, we\ncan also consistently estimate all parameters of the model and make inference.\nThe same is true for estimation of the latent factors which can be carried out\nby weighted least-squares, linear projection, or Kalman filtering/smoothing. We\nalso compare the approaches presented with: Principal Component analysis and\nthe classical, fixed $N$, exact Maximum Likelihood approach. We conclude with a\ndiscussion on efficiency of the considered estimators."}, "http://arxiv.org/abs/2305.08559": {"title": "Designing Discontinuities", "link": "http://arxiv.org/abs/2305.08559", "description": "Discontinuities can be fairly arbitrary but also cause a significant impact\non outcomes in larger systems. Indeed, their arbitrariness is why they have\nbeen used to infer causal relationships among variables in numerous settings.\nRegression discontinuity from econometrics assumes the existence of a\ndiscontinuous variable that splits the population into distinct partitions to\nestimate the causal effects of a given phenomenon. Here we consider the design\nof partitions for a given discontinuous variable to optimize a certain effect\npreviously studied using regression discontinuity. To do so, we propose a\nquantization-theoretic approach to optimize the effect of interest, first\nlearning the causal effect size of a given discontinuous variable and then\napplying dynamic programming for optimal quantization design of discontinuities\nto balance the gain and loss in that effect size. We also develop a\ncomputationally-efficient reinforcement learning algorithm for the dynamic\nprogramming formulation of optimal quantization. We demonstrate our approach by\ndesigning optimal time zone borders for counterfactuals of social capital,\nsocial mobility, and health. This is based on regression discontinuity analyses\nwe perform on novel data, which may be of independent empirical interest."}, "http://arxiv.org/abs/2309.09299": {"title": "Bounds on Average Effects in Discrete Choice Panel Data Models", "link": "http://arxiv.org/abs/2309.09299", "description": "In discrete choice panel data, the estimation of average effects is crucial\nfor quantifying the effect of covariates, and for policy evaluation and\ncounterfactual analysis. This task is challenging in short panels with\nindividual-specific effects due to partial identification and the incidental\nparameter problem. While consistent estimation of the identified set is\npossible, it generally requires very large sample sizes, especially when the\nnumber of support points of the observed covariates is large, such as when the\ncovariates are continuous. In this paper, we propose estimating outer bounds on\nthe identified set of average effects. Our bounds are easy to construct,\nconverge at the parametric rate, and are computationally simple to obtain even\nin moderately large samples, independent of whether the covariates are discrete\nor continuous. We also provide asymptotically valid confidence intervals on the\nidentified set. Simulation studies confirm that our approach works well and is\ninformative in finite samples. We also consider an application to labor force\nparticipation."}, "http://arxiv.org/abs/2312.17623": {"title": "Decision Theory for Treatment Choice Problems with Partial Identification", "link": "http://arxiv.org/abs/2312.17623", "description": "We apply classical statistical decision theory to a large class of treatment\nchoice problems with partial identification, revealing important theoretical\nand practical challenges but also interesting research opportunities. The\nchallenges are: In a general class of problems with Gaussian likelihood, all\ndecision rules are admissible; it is maximin-welfare optimal to ignore all\ndata; and, for severe enough partial identification, there are infinitely many\nminimax-regret optimal decision rules, all of which sometimes randomize the\npolicy recommendation. The opportunities are: We introduce a profiled regret\ncriterion that can reveal important differences between rules and render some\nof them inadmissible; and we uniquely characterize the minimax-regret optimal\nrule that least frequently randomizes. We apply our results to aggregation of\nexperimental estimates for policy adoption, to extrapolation of Local Average\nTreatment Effects, and to policy making in the presence of omitted variable\nbias."}, "http://arxiv.org/abs/2312.17676": {"title": "Robust Inference in Panel Data Models: Some Effects of Heteroskedasticity and Leveraged Data in Small Samples", "link": "http://arxiv.org/abs/2312.17676", "description": "With the violation of the assumption of homoskedasticity, least squares\nestimators of the variance become inefficient and statistical inference\nconducted with invalid standard errors leads to misleading rejection rates.\nDespite a vast cross-sectional literature on the downward bias of robust\nstandard errors, the problem is not extensively covered in the panel data\nframework. We investigate the consequences of the simultaneous presence of\nsmall sample size, heteroskedasticity and data points that exhibit extreme\nvalues in the covariates ('good leverage points') on the statistical inference.\nFocusing on one-way linear panel data models, we examine asymptotic and finite\nsample properties of a battery of heteroskedasticity-consistent estimators\nusing Monte Carlo simulations. We also propose a hybrid estimator of the\nvariance-covariance matrix. Results show that conventional standard errors are\nalways dominated by more conservative estimators of the variance, especially in\nsmall samples. In addition, all types of HC standard errors have excellent\nperformances in terms of size and power tests under homoskedasticity."}, "http://arxiv.org/abs/2401.00249": {"title": "Forecasting CPI inflation under economic policy and geo-political uncertainties", "link": "http://arxiv.org/abs/2401.00249", "description": "Forecasting a key macroeconomic variable, consumer price index (CPI)\ninflation, for BRIC countries using economic policy uncertainty and\ngeopolitical risk is a difficult proposition for policymakers at the central\nbanks. This study proposes a novel filtered ensemble wavelet neural network\n(FEWNet) that can produce reliable long-term forecasts for CPI inflation. The\nproposal applies a maximum overlapping discrete wavelet transform to the CPI\ninflation series to obtain high-frequency and low-frequency signals. All the\nwavelet-transformed series and filtered exogenous variables are fed into\ndownstream autoregressive neural networks to make the final ensemble forecast.\nTheoretically, we show that FEWNet reduces the empirical risk compared to\nsingle, fully connected neural networks. We also demonstrate that the\nrolling-window real-time forecasts obtained from the proposed algorithm are\nsignificantly more accurate than benchmark forecasting methods. Additionally,\nwe use conformal prediction intervals to quantify the uncertainty associated\nwith the forecasts generated by the proposed approach. The excellent\nperformance of FEWNet can be attributed to its capacity to effectively capture\nnon-linearities and long-range dependencies in the data through its adaptable\narchitecture."}, "http://arxiv.org/abs/2401.00264": {"title": "Identification of Dynamic Nonlinear Panel Models under Partial Stationarity", "link": "http://arxiv.org/abs/2401.00264", "description": "This paper studies identification for a wide range of nonlinear panel data\nmodels, including binary choice, ordered repsonse, and other types of limited\ndependent variable models. Our approach accommodates dynamic models with any\nnumber of lagged dependent variables as well as other types of (potentially\ncontemporary) endogeneity. Our identification strategy relies on a partial\nstationarity condition, which not only allows for an unknown distribution of\nerrors but also for temporal dependencies in errors. We derive partial\nidentification results under flexible model specifications and provide\nadditional support conditions for point identification. We demonstrate the\nrobust finite-sample performance of our approach using Monte Carlo simulations,\nwith static and dynamic ordered choice models as illustrative examples."}, "http://arxiv.org/abs/2401.00618": {"title": "Generalized Difference-in-Differences for Ordered Choice Models: Too Many \"False Zeros\"?", "link": "http://arxiv.org/abs/2401.00618", "description": "In this paper, we develop a generalized Difference-in-Differences model for\ndiscrete, ordered outcomes, building upon elements from a continuous\nChanges-in-Changes model. We focus on outcomes derived from self-reported\nsurvey data eliciting socially undesirable, illegal, or stigmatized behaviors\nlike tax evasion, substance abuse, or domestic violence, where too many \"false\nzeros\", or more broadly, underreporting are likely. We provide\ncharacterizations for distributional parallel trends, a concept central to our\napproach, within a general threshold-crossing model framework. In cases where\noutcomes are assumed to be reported correctly, we propose a framework for\nidentifying and estimating treatment effects across the entire distribution.\nThis framework is then extended to modeling underreported outcomes, allowing\nthe reporting decision to depend on treatment status. A simulation study\ndocuments the finite sample performance of the estimators. Applying our\nmethodology, we investigate the impact of recreational marijuana legalization\nfor adults in several U.S. states on the short-term consumption behavior of\n8th-grade high-school students. The results indicate small, but significant\nincreases in consumption probabilities at each level. These effects are further\namplified upon accounting for misreporting."}, "http://arxiv.org/abs/2010.08868": {"title": "A Decomposition Approach to Counterfactual Analysis in Game-Theoretic Models", "link": "http://arxiv.org/abs/2010.08868", "description": "Decomposition methods are often used for producing counterfactual predictions\nin non-strategic settings. When the outcome of interest arises from a\ngame-theoretic setting where agents are better off by deviating from their\nstrategies after a new policy, such predictions, despite their practical\nsimplicity, are hard to justify. We present conditions in Bayesian games under\nwhich the decomposition-based predictions coincide with the equilibrium-based\nones. In many games, such coincidence follows from an invariance condition for\nequilibrium selection rules. To illustrate our message, we revisit an empirical\nanalysis in Ciliberto and Tamer (2009) on firms' entry decisions in the airline\nindustry."}, "http://arxiv.org/abs/2308.10138": {"title": "On the Inconsistency of Cluster-Robust Inference and How Subsampling Can Fix It", "link": "http://arxiv.org/abs/2308.10138", "description": "Conventional methods of cluster-robust inference are inconsistent in the\npresence of unignorably large clusters. We formalize this claim by establishing\na necessary and sufficient condition for the consistency of the conventional\nmethods. We find that this condition for the consistency is rejected for a\nmajority of empirical research papers. In this light, we propose a novel score\nsubsampling method which is robust even under the condition that fails the\nconventional method. Simulation studies support these claims. With real data\nused by an empirical paper, we showcase that the conventional methods conclude\nsignificance while our proposed method concludes insignificance."}, "http://arxiv.org/abs/2401.01064": {"title": "Robust Inference for Multiple Predictive Regressions with an Application on Bond Risk Premia", "link": "http://arxiv.org/abs/2401.01064", "description": "We propose a robust hypothesis testing procedure for the predictability of\nmultiple predictors that could be highly persistent. Our method improves the\npopular extended instrumental variable (IVX) testing (Phillips and Lee, 2013;\nKostakis et al., 2015) in that, besides addressing the two bias effects found\nin Hosseinkouchack and Demetrescu (2021), we find and deal with the\nvariance-enlargement effect. We show that two types of higher-order terms\ninduce these distortion effects in the test statistic, leading to significant\nover-rejection for one-sided tests and tests in multiple predictive\nregressions. Our improved IVX-based test includes three steps to tackle all the\nissues above regarding finite sample bias and variance terms. Thus, the test\nstatistics perform well in size control, while its power performance is\ncomparable with the original IVX. Monte Carlo simulations and an empirical\nstudy on the predictability of bond risk premia are provided to demonstrate the\neffectiveness of the newly proposed approach."}, "http://arxiv.org/abs/2401.01565": {"title": "Classification and Treatment Learning with Constraints via Composite Heaviside Optimization: a Progressive MIP Method", "link": "http://arxiv.org/abs/2401.01565", "description": "This paper proposes a Heaviside composite optimization approach and presents\na progressive (mixed) integer programming (PIP) method for solving multi-class\nclassification and multi-action treatment problems with constraints. A\nHeaviside composite function is a composite of a Heaviside function (i.e., the\nindicator function of either the open $( \\, 0,\\infty )$ or closed $[ \\,\n0,\\infty \\, )$ interval) with a possibly nondifferentiable function.\nModeling-wise, we show how Heaviside composite optimization provides a unified\nformulation for learning the optimal multi-class classification and\nmulti-action treatment rules, subject to rule-dependent constraints stipulating\na variety of domain restrictions. A Heaviside composite function has an\nequivalent discrete formulation %in terms of integer variables, and the\nresulting optimization problem can in principle be solved by integer\nprogramming (IP) methods. Nevertheless, for constrained learning problems with\nlarge data sets, %of modest or large sizes, a straightforward application of\noff-the-shelf IP solvers is usually ineffective in achieving global optimality.\nTo alleviate such a computational burden, our major contribution is the\nproposal of the PIP method by leveraging the effectiveness of state-of-the-art\nIP solvers for problems of modest sizes. We provide the theoretical advantage\nof the PIP method with the connection to continuous optimization and show that\nthe computed solution is locally optimal for a broad class of Heaviside\ncomposite optimization problems. The numerical performance of the PIP method is\ndemonstrated by extensive computational experimentation."}, "http://arxiv.org/abs/2401.01645": {"title": "Model Averaging and Double Machine Learning", "link": "http://arxiv.org/abs/2401.01645", "description": "This paper discusses pairing double/debiased machine learning (DDML) with\nstacking, a model averaging method for combining multiple candidate learners,\nto estimate structural parameters. We introduce two new stacking approaches for\nDDML: short-stacking exploits the cross-fitting step of DDML to substantially\nreduce the computational burden and pooled stacking enforces common stacking\nweights over cross-fitting folds. Using calibrated simulation studies and two\napplications estimating gender gaps in citations and wages, we show that DDML\nwith stacking is more robust to partially unknown functional forms than common\nalternative approaches based on single pre-selected learners. We provide Stata\nand R software implementing our proposals."}, "http://arxiv.org/abs/2401.01804": {"title": "Efficient Computation of Confidence Sets Using Classification on Equidistributed Grids", "link": "http://arxiv.org/abs/2401.01804", "description": "Economic models produce moment inequalities, which can be used to form tests\nof the true parameters. Confidence sets (CS) of the true parameters are derived\nby inverting these tests. However, they often lack analytical expressions,\nnecessitating a grid search to obtain the CS numerically by retaining the grid\npoints that pass the test. When the statistic is not asymptotically pivotal,\nconstructing the critical value for each grid point in the parameter space adds\nto the computational burden. In this paper, we convert the computational issue\ninto a classification problem by using a support vector machine (SVM)\nclassifier. Its decision function provides a faster and more systematic way of\ndividing the parameter space into two regions: inside vs. outside of the\nconfidence set. We label those points in the CS as 1 and those outside as -1.\nResearchers can train the SVM classifier on a grid of manageable size and use\nit to determine whether points on denser grids are in the CS or not. We\nestablish certain conditions for the grid so that there is a tuning that allows\nus to asymptotically reproduce the test in the CS. This means that in the\nlimit, a point is classified as belonging to the confidence set if and only if\nit is labeled as 1 by the SVM."}, "http://arxiv.org/abs/2011.03073": {"title": "Bias correction for quantile regression estimators", "link": "http://arxiv.org/abs/2011.03073", "description": "We study the bias of classical quantile regression and instrumental variable\nquantile regression estimators. While being asymptotically first-order\nunbiased, these estimators can have non-negligible second-order biases. We\nderive a higher-order stochastic expansion of these estimators using empirical\nprocess theory. Based on this expansion, we derive an explicit formula for the\nsecond-order bias and propose a feasible bias correction procedure that uses\nfinite-difference estimators of the bias components. The proposed bias\ncorrection method performs well in simulations. We provide an empirical\nillustration using Engel's classical data on household expenditure."}, "http://arxiv.org/abs/2107.02637": {"title": "Difference-in-Differences with a Continuous Treatment", "link": "http://arxiv.org/abs/2107.02637", "description": "This paper analyzes difference-in-differences setups with a continuous\ntreatment. We show that treatment effect on the treated-type parameters can be\nidentified under a generalized parallel trends assumption that is similar to\nthe binary treatment setup. However, interpreting differences in these\nparameters across different values of the treatment can be particularly\nchallenging due to treatment effect heterogeneity. We discuss alternative,\ntypically stronger, assumptions that alleviate these challenges. We also\nprovide a variety of treatment effect decomposition results, highlighting that\nparameters associated with popular linear two-way fixed-effect (TWFE)\nspecifications can be hard to interpret, \\emph{even} when there are only two\ntime periods. We introduce alternative estimation procedures that do not suffer\nfrom these TWFE drawbacks, and show in an application that they can lead to\ndifferent conclusions."}, "http://arxiv.org/abs/2401.02428": {"title": "Cu\\'anto es demasiada inflaci\\'on? Una clasificaci\\'on de reg\\'imenes inflacionarios", "link": "http://arxiv.org/abs/2401.02428", "description": "The classifications of inflationary regimes proposed in the literature have\nmostly been based on arbitrary characterizations, subject to value judgments by\nresearchers. The objective of this study is to propose a new methodological\napproach that reduces subjectivity and improves accuracy in the construction of\nsuch regimes. The method is built upon a combination of clustering techniques\nand classification trees, which allows for an historical periodization of\nArgentina's inflationary history for the period 1943-2022. Additionally, two\nprocedures are introduced to smooth out the classification over time: a measure\nof temporal contiguity of observations and a rolling method based on the simple\nmajority rule. The obtained regimes are compared against the existing\nliterature on the inflation-relative price variability relationship, revealing\na better performance of the proposed regimes."}, "http://arxiv.org/abs/2401.02819": {"title": "Roughness Signature Functions", "link": "http://arxiv.org/abs/2401.02819", "description": "Inspired by the activity signature introduced by Todorov and Tauchen (2010),\nwhich was used to measure the activity of a semimartingale, this paper\nintroduces the roughness signature function. The paper illustrates how it can\nbe used to determine whether a discretely observed process is generated by a\ncontinuous process that is rougher than a Brownian motion, a pure-jump process,\nor a combination of the two. Further, if a continuous rough process is present,\nthe function gives an estimate of the roughness index. This is done through an\nextensive simulation study, where we find that the roughness signature function\nworks as expected on rough processes. We further derive some asymptotic\nproperties of this new signature function. The function is applied empirically\nto three different volatility measures for the S&P500 index. The three measures\nare realized volatility, the VIX, and the option-extracted volatility estimator\nof Todorov (2019). The realized volatility and option-extracted volatility show\nsigns of roughness, with the option-extracted volatility appearing smoother\nthan the realized volatility, while the VIX appears to be driven by a\ncontinuous martingale with jumps."}, "http://arxiv.org/abs/2401.03293": {"title": "Counterfactuals in factor models", "link": "http://arxiv.org/abs/2401.03293", "description": "We study a new model where the potential outcomes, corresponding to the\nvalues of a (possibly continuous) treatment, are linked through common factors.\nThe factors can be estimated using a panel of regressors. We propose a\nprocedure to estimate time-specific and unit-specific average marginal effects\nin this context. Our approach can be used either with high-dimensional time\nseries or with large panels. It allows for treatment effects heterogenous\nacross time and units and is straightforward to implement since it only relies\non principal components analysis and elementary computations. We derive the\nasymptotic distribution of our estimator of the average marginal effect and\nhighlight its solid finite sample performance through a simulation exercise.\nThe approach can also be used to estimate average counterfactuals or adapted to\nan instrumental variables setting and we discuss these extensions. Finally, we\nillustrate our novel methodology through an empirical application on income\ninequality."}, "http://arxiv.org/abs/2401.03756": {"title": "Contextual Fixed-Budget Best Arm Identification: Adaptive Experimental Design with Policy Learning", "link": "http://arxiv.org/abs/2401.03756", "description": "Individualized treatment recommendation is a crucial task in evidence-based\ndecision-making. In this study, we formulate this task as a fixed-budget best\narm identification (BAI) problem with contextual information. In this setting,\nwe consider an adaptive experiment given multiple treatment arms. At each\nround, a decision-maker observes a context (covariate) that characterizes an\nexperimental unit and assigns the unit to one of the treatment arms. At the end\nof the experiment, the decision-maker recommends a treatment arm estimated to\nyield the highest expected outcome conditioned on a context (best treatment\narm). The effectiveness of this decision is measured in terms of the worst-case\nexpected simple regret (policy regret), which represents the largest difference\nbetween the conditional expected outcomes of the best and recommended treatment\narms given a context. Our initial step is to derive asymptotic lower bounds for\nthe worst-case expected simple regret, which also implies ideal treatment\nassignment rules. Following the lower bounds, we propose the Adaptive Sampling\n(AS)-Policy Learning recommendation (PL) strategy. Under this strategy, we\nrandomly assign a treatment arm with a ratio of a target assignment ratio at\neach round. At the end of the experiment, we train a policy, a function that\nrecommends a treatment arm given a context, by maximizing the counterfactual\nempirical policy value. Our results show that the AS-PL strategy is\nasymptotically minimax optimal, with its leading factor of expected simple\nregret converging with our established worst-case lower bound. This research\nhas broad implications in various domains, and in light of existing literature,\nour method can be perceived as an adaptive experimental design tailored for\npolicy learning, on-policy learning, or adaptive welfare maximization."}, "http://arxiv.org/abs/2401.03990": {"title": "Identification with possibly invalid IVs", "link": "http://arxiv.org/abs/2401.03990", "description": "This paper proposes a novel identification strategy relying on\nquasi-instrumental variables (quasi-IVs). A quasi-IV is a relevant but possibly\ninvalid IV because it is not completely exogenous and/or excluded. We show that\na variety of models with discrete or continuous endogenous treatment, which are\nusually identified with an IV - quantile models with rank invariance additive\nmodels with homogenous treatment effects, and local average treatment effect\nmodels - can be identified under the joint relevance of two complementary\nquasi-IVs instead. To achieve identification we complement one excluded but\npossibly endogenous quasi-IV (e.g., ``relevant proxies'' such as previous\ntreatment choice) with one exogenous (conditional on the excluded quasi-IV) but\npossibly included quasi-IV (e.g., random assignment or exogenous market\nshocks). In practice, our identification strategy should be attractive since\ncomplementary quasi-IVs should be easier to find than standard IVs. Our\napproach also holds if any of the two quasi-IVs turns out to be a valid IV."}, "http://arxiv.org/abs/2401.04050": {"title": "Robust Estimation in Network Vector Autoregression with Nonstationary Regressors", "link": "http://arxiv.org/abs/2401.04050", "description": "This article studies identification and estimation for the network vector\nautoregressive model with nonstationary regressors. In particular, network\ndependence is characterized by a nonstochastic adjacency matrix. The\ninformation set includes a stationary regressand and a node-specific vector of\nnonstationary regressors, both observed at the same equally spaced time\nfrequencies. Our proposed econometric specification correponds to the NVAR\nmodel under time series nonstationarity which relies on the local-to-unity\nparametrization for capturing the unknown form of persistence of these\nnode-specific regressors. Robust econometric estimation is achieved using an\nIVX-type estimator and the asymptotic theory analysis for the augmented vector\nof regressors is studied based on a double asymptotic regime where both the\nnetwork size and the time dimension tend to infinity."}, "http://arxiv.org/abs/2103.12374": {"title": "What Do We Get from Two-Way Fixed Effects Regressions? Implications from Numerical Equivalence", "link": "http://arxiv.org/abs/2103.12374", "description": "In any multiperiod panel, a two-way fixed effects (TWFE) regression is\nnumerically equivalent to a first-difference (FD) regression that pools all\npossible between-period gaps. Building on this observation, this paper develops\nnumerical and causal interpretations of the TWFE coefficient. At the sample\nlevel, the TWFE coefficient is a weighted average of FD coefficients with\ndifferent between-period gaps. This decomposition is useful for assessing the\nsource of identifying variation for the TWFE coefficient. At the population\nlevel, a causal interpretation of the TWFE coefficient requires a common trends\nassumption for any between-period gap, and the assumption has to be conditional\non changes in time-varying covariates. I propose a natural generalization of\nthe TWFE estimator that can relax these requirements."}, "http://arxiv.org/abs/2107.11869": {"title": "Adaptive Estimation and Uniform Confidence Bands for Nonparametric Structural Functions and Elasticities", "link": "http://arxiv.org/abs/2107.11869", "description": "We introduce two data-driven procedures for optimal estimation and inference\nin nonparametric models using instrumental variables. The first is a\ndata-driven choice of sieve dimension for a popular class of sieve two-stage\nleast squares estimators. When implemented with this choice, estimators of both\nthe structural function $h_0$ and its derivatives (such as elasticities)\nconverge at the fastest possible (i.e., minimax) rates in sup-norm. The second\nis for constructing uniform confidence bands (UCBs) for $h_0$ and its\nderivatives. Our UCBs guarantee coverage over a generic class of\ndata-generating processes and contract at the minimax rate, possibly up to a\nlogarithmic factor. As such, our UCBs are asymptotically more efficient than\nUCBs based on the usual approach of undersmoothing. As an application, we\nestimate the elasticity of the intensive margin of firm exports in a\nmonopolistic competition model of international trade. Simulations illustrate\nthe good performance of our procedures in empirically calibrated designs. Our\nresults provide evidence against common parameterizations of the distribution\nof unobserved firm heterogeneity."}, "http://arxiv.org/abs/2301.09397": {"title": "ddml: Double/debiased machine learning in Stata", "link": "http://arxiv.org/abs/2301.09397", "description": "We introduce the package ddml for Double/Debiased Machine Learning (DDML) in\nStata. Estimators of causal parameters for five different econometric models\nare supported, allowing for flexible estimation of causal effects of endogenous\nvariables in settings with unknown functional forms and/or many exogenous\nvariables. ddml is compatible with many existing supervised machine learning\nprograms in Stata. We recommend using DDML in combination with stacking\nestimation which combines multiple machine learners into a final predictor. We\nprovide Monte Carlo evidence to support our recommendation."}, "http://arxiv.org/abs/2302.13857": {"title": "Multi-cell experiments for marginal treatment effect estimation of digital ads", "link": "http://arxiv.org/abs/2302.13857", "description": "Randomized experiments with treatment and control groups are an important\ntool to measure the impacts of interventions. However, in experimental settings\nwith one-sided noncompliance, extant empirical approaches may not produce the\nestimands a decision-maker needs to solve their problem of interest. For\nexample, these experimental designs are common in digital advertising settings,\nbut typical methods do not yield effects that inform the intensive margin --\nhow many consumers should be reached or how much should be spent on a campaign.\nWe propose a solution that combines a novel multi-cell experimental design with\nmodern estimation techniques that enables decision-makers to recover enough\ninformation to solve problems with an intensive margin. Our design is\nstraightforward to implement and does not require any additional budget to be\ncarried out. We illustrate our approach through a series of simulations that\nare calibrated using an advertising experiment at Facebook, finding that our\nmethod outperforms standard techniques in generating better decisions."}, "http://arxiv.org/abs/2401.04200": {"title": "Teacher bias or measurement error?", "link": "http://arxiv.org/abs/2401.04200", "description": "In many countries, teachers' track recommendations are used to allocate\nstudents to secondary school tracks. Previous studies have shown that students\nfrom families with low socioeconomic status (SES) receive lower track\nrecommendations than their peers from high SES families, conditional on\nstandardized test scores. It is often argued this indicates teacher bias.\nHowever, this claim is invalid in the presence of measurement error in test\nscores. We discuss how measurement error in test scores generates a biased\ncoefficient of the conditional SES gap, and consider three empirical strategies\nto address this bias. Using administrative data from the Netherlands, we find\nthat measurement error explains 35 to 43% of the conditional SES gap in track\nrecommendations."}, "http://arxiv.org/abs/2401.04512": {"title": "Robust Bayesian Method for Refutable Models", "link": "http://arxiv.org/abs/2401.04512", "description": "We propose a robust Bayesian method for economic models that can be rejected\nunder some data distributions. The econometrician starts with a structural\nassumption which can be written as the intersection of several assumptions, and\nthe joint assumption is refutable. To avoid the model rejection, the\neconometrician first takes a stance on which assumption $j$ is likely to be\nviolated and considers a measurement of the degree of violation of this\nassumption $j$. She then considers a (marginal) prior belief on the degree of\nviolation $(\\pi_{m_j})$: She considers a class of prior distributions $\\pi_s$\non all economic structures such that all $\\pi_s$ have the same marginal\ndistribution $\\pi_m$. Compared to the standard nonparametric Bayesian method\nthat puts a single prior on all economic structures, the robust Bayesian method\nimposes a single marginal prior distribution on the degree of violation. As a\nresult, the robust Bayesian method allows the econometrician to take a stance\nonly on the likeliness of violation of assumption $j$. Compared to the\nfrequentist approach to relax the refutable assumption, the robust Bayesian\nmethod is transparent on the econometrician's stance of choosing models. We\nalso show that many frequentists' ways to relax the refutable assumption can be\nfound equivalent to particular choices of robust Bayesian prior classes. We use\nthe local average treatment effect (LATE) in the potential outcome framework as\nthe leading illustrating example."}, "http://arxiv.org/abs/2005.10314": {"title": "On the Nuisance of Control Variables in Regression Analysis", "link": "http://arxiv.org/abs/2005.10314", "description": "Control variables are included in regression analyses to estimate the causal\neffect of a treatment on an outcome. In this paper, we argue that the estimated\neffect sizes of controls are unlikely to have a causal interpretation\nthemselves, though. This is because even valid controls are possibly endogenous\nand represent a combination of several different causal mechanisms operating\njointly on the outcome, which is hard to interpret theoretically. Therefore, we\nrecommend refraining from interpreting marginal effects of controls and\nfocusing on the main variables of interest, for which a plausible\nidentification argument can be established. To prevent erroneous managerial or\npolicy implications, coefficients of control variables should be clearly marked\nas not having a causal interpretation or omitted from regression tables\naltogether. Moreover, we advise against using control variable estimates for\nsubsequent theory building and meta-analyses."}, "http://arxiv.org/abs/2401.04803": {"title": "IV Estimation of Panel Data Tobit Models with Normal Errors", "link": "http://arxiv.org/abs/2401.04803", "description": "Amemiya (1973) proposed a ``consistent initial estimator'' for the parameters\nin a censored regression model with normal errors. This paper demonstrates that\na similar approach can be used to construct moment conditions for\nfixed--effects versions of the model considered by Amemiya. This result\nsuggests estimators for models that have not previously been considered."}, "http://arxiv.org/abs/2401.04849": {"title": "A Deep Learning Representation of Spatial Interaction Model for Resilient Spatial Planning of Community Business Clusters", "link": "http://arxiv.org/abs/2401.04849", "description": "Existing Spatial Interaction Models (SIMs) are limited in capturing the\ncomplex and context-aware interactions between business clusters and trade\nareas. To address the limitation, we propose a SIM-GAT model to predict\nspatiotemporal visitation flows between community business clusters and their\ntrade areas. The model innovatively represents the integrated system of\nbusiness clusters, trade areas, and transportation infrastructure within an\nurban region using a connected graph. Then, a graph-based deep learning model,\ni.e., Graph AttenTion network (GAT), is used to capture the complexity and\ninterdependencies of business clusters. We developed this model with data\ncollected from the Miami metropolitan area in Florida. We then demonstrated its\neffectiveness in capturing varying attractiveness of business clusters to\ndifferent residential neighborhoods and across scenarios with an eXplainable AI\napproach. We contribute a novel method supplementing conventional SIMs to\npredict and analyze the dynamics of inter-connected community business\nclusters. The analysis results can inform data-evidenced and place-specific\nplanning strategies helping community business clusters better accommodate\ntheir customers across scenarios, and hence improve the resilience of community\nbusinesses."}, "http://arxiv.org/abs/2306.14653": {"title": "Optimization of the Generalized Covariance Estimator in Noncausal Processes", "link": "http://arxiv.org/abs/2306.14653", "description": "This paper investigates the performance of the Generalized Covariance\nestimator (GCov) in estimating and identifying mixed causal and noncausal\nmodels. The GCov estimator is a semi-parametric method that minimizes an\nobjective function without making any assumptions about the error distribution\nand is based on nonlinear autocovariances to identify the causal and noncausal\norders. When the number and type of nonlinear autocovariances included in the\nobjective function of a GCov estimator is insufficient/inadequate, or the error\ndensity is too close to the Gaussian, identification issues can arise. These\nissues result in local minima in the objective function, which correspond to\nparameter values associated with incorrect causal and noncausal orders. Then,\ndepending on the starting point and the optimization algorithm employed, the\nalgorithm can converge to a local minimum. The paper proposes the use of the\nSimulated Annealing (SA) optimization algorithm as an alternative to\nconventional numerical optimization methods. The results demonstrate that SA\nperforms well when applied to mixed causal and noncausal models, successfully\neliminating the effects of local minima. The proposed approach is illustrated\nby an empirical application involving a bivariate commodity price series."}, "http://arxiv.org/abs/2401.05517": {"title": "On Efficient Inference of Causal Effects with Multiple Mediators", "link": "http://arxiv.org/abs/2401.05517", "description": "This paper provides robust estimators and efficient inference of causal\neffects involving multiple interacting mediators. Most existing works either\nimpose a linear model assumption among the mediators or are restricted to\nhandle conditionally independent mediators given the exposure. To overcome\nthese limitations, we define causal and individual mediation effects in a\ngeneral setting, and employ a semiparametric framework to develop quadruply\nrobust estimators for these causal effects. We further establish the asymptotic\nnormality of the proposed estimators and prove their local semiparametric\nefficiencies. The proposed method is empirically validated via simulated and\nreal datasets concerning psychiatric disorders in trauma survivors."}, "http://arxiv.org/abs/2401.05784": {"title": "Covariance Function Estimation for High-Dimensional Functional Time Series with Dual Factor Structures", "link": "http://arxiv.org/abs/2401.05784", "description": "We propose a flexible dual functional factor model for modelling\nhigh-dimensional functional time series. In this model, a high-dimensional\nfully functional factor parametrisation is imposed on the observed functional\nprocesses, whereas a low-dimensional version (via series approximation) is\nassumed for the latent functional factors. We extend the classic principal\ncomponent analysis technique for the estimation of a low-rank structure to the\nestimation of a large covariance matrix of random functions that satisfies a\nnotion of (approximate) functional \"low-rank plus sparse\" structure; and\ngeneralise the matrix shrinkage method to functional shrinkage in order to\nestimate the sparse structure of functional idiosyncratic components. Under\nappropriate regularity conditions, we derive the large sample theory of the\ndeveloped estimators, including the consistency of the estimated factors and\nfunctional factor loadings and the convergence rates of the estimated matrices\nof covariance functions measured by various (functional) matrix norms.\nConsistent selection of the number of factors and a data-driven rule to choose\nthe shrinkage parameter are discussed. Simulation and empirical studies are\nprovided to demonstrate the finite-sample performance of the developed model\nand estimation methodology."}, "http://arxiv.org/abs/2305.16377": {"title": "Validating a dynamic input-output model for the propagation of supply and demand shocks during the COVID-19 pandemic in Belgium", "link": "http://arxiv.org/abs/2305.16377", "description": "This work validates a dynamic production network model, used to quantify the\nimpact of economic shocks caused by COVID-19 in the UK, using data for Belgium.\nBecause the model was published early during the 2020 COVID-19 pandemic, it\nrelied on several assumptions regarding the magnitude of the observed economic\nshocks, for which more accurate data have become available in the meantime. We\nrefined the propagated shocks to align with observed data collected during the\npandemic and calibrated some less well-informed parameters using 115 economic\ntime series. The refined model effectively captures the evolution of GDP,\nrevenue, and employment during the COVID-19 pandemic in Belgium at both\nindividual economic activity and aggregate levels. However, the reduction in\nbusiness-to-business demand is overestimated, revealing structural shortcomings\nin accounting for businesses' motivations to sustain trade despite the\npandemic's induced shocks. We confirm that the relaxation of the stringent\nLeontief production function by a survey on the criticality of inputs\nsignificantly improved the model's accuracy. However, despite a large dataset,\ndistinguishing between varying degrees of relaxation proved challenging.\nOverall, this work demonstrates the model's validity in assessing the impact of\neconomic shocks caused by an epidemic in Belgium."}, "http://arxiv.org/abs/2309.13251": {"title": "Nonparametric estimation of conditional densities by generalized random forests", "link": "http://arxiv.org/abs/2309.13251", "description": "Considering a continuous random variable Y together with a continuous random\nvector X, I propose a nonparametric estimator f^(.|x) for the conditional\ndensity of Y given X=x. This estimator takes the form of an exponential series\nwhose coefficients T = (T1,...,TJ) are the solution of a system of nonlinear\nequations that depends on an estimator of the conditional expectation\nE[p(Y)|X=x], where p(.) is a J-dimensional vector of basis functions. A key\nfeature is that E[p(Y)|X=x] is estimated by generalized random forest (Athey,\nTibshirani, and Wager, 2019), targeting the heterogeneity of T across x. I show\nthat f^(.|x) is uniformly consistent and asymptotically normal, while allowing\nJ to grow to infinity. I also provide a standard error formula to construct\nasymptotically valid confidence intervals. Results from Monte Carlo experiments\nand an empirical illustration are provided."}, "http://arxiv.org/abs/2401.06264": {"title": "Exposure effects are policy relevant only under strong assumptions about the interference structure", "link": "http://arxiv.org/abs/2401.06264", "description": "Savje (2023) recommends misspecified exposure effects as a way to avoid\nstrong assumptions about interference when analyzing the results of an\nexperiment. In this discussion, we highlight a key limitation of Savje's\nrecommendation. Exposure effects are not generally useful for evaluating social\npolicies without the strong assumptions that Savje seeks to avoid.\n\nOur discussion is organized as follows. Section 2 summarizes our position,\nsection 3 provides a concrete example, and section 4 concludes. Proof of claims\nare in an appendix."}, "http://arxiv.org/abs/2401.06611": {"title": "Robust Analysis of Short Panels", "link": "http://arxiv.org/abs/2401.06611", "description": "Many structural econometric models include latent variables on whose\nprobability distributions one may wish to place minimal restrictions. Leading\nexamples in panel data models are individual-specific variables sometimes\ntreated as \"fixed effects\" and, in dynamic models, initial conditions. This\npaper presents a generally applicable method for characterizing sharp\nidentified sets when models place no restrictions on the probability\ndistribution of certain latent variables and no restrictions on their\ncovariation with other variables. In our analysis latent variables on which\nrestrictions are undesirable are removed, leading to econometric analysis\nrobust to misspecification of restrictions on their distributions which are\ncommonplace in the applied panel data literature. Endogenous explanatory\nvariables are easily accommodated. Examples of application to some static and\ndynamic binary, ordered and multiple discrete choice and censored panel data\nmodels are presented."}, "http://arxiv.org/abs/2308.00202": {"title": "Randomization Inference of Heterogeneous Treatment Effects under Network Interference", "link": "http://arxiv.org/abs/2308.00202", "description": "We design randomization tests of heterogeneous treatment effects when units\ninteract on a single connected network. Our modeling strategy allows network\ninterference into the potential outcomes framework using the concept of\nexposure mapping. We consider several null hypotheses representing different\nnotions of homogeneous treatment effects. However, these hypotheses are not\nsharp due to nuisance parameters and multiple potential outcomes. To address\nthe issue of multiple potential outcomes, we propose a conditional\nrandomization method that expands on existing procedures. Our conditioning\napproach permits the use of treatment assignment as a conditioning variable,\nwidening the range of application of the randomization method of inference. In\naddition, we propose techniques that overcome the nuisance parameter issue. We\nshow that our resulting testing methods based on the conditioning procedure and\nthe strategies for handling nuisance parameters are asymptotically valid. We\ndemonstrate the testing methods using a network data set and also present the\nfindings of a Monte Carlo study."}, "http://arxiv.org/abs/2401.06864": {"title": "Deep Learning With DAGs", "link": "http://arxiv.org/abs/2401.06864", "description": "Social science theories often postulate causal relationships among a set of\nvariables or events. Although directed acyclic graphs (DAGs) are increasingly\nused to represent these theories, their full potential has not yet been\nrealized in practice. As non-parametric causal models, DAGs require no\nassumptions about the functional form of the hypothesized relationships.\nNevertheless, to simplify the task of empirical evaluation, researchers tend to\ninvoke such assumptions anyway, even though they are typically arbitrary and do\nnot reflect any theoretical content or prior knowledge. Moreover, functional\nform assumptions can engender bias, whenever they fail to accurately capture\nthe complexity of the causal system under investigation. In this article, we\nintroduce causal-graphical normalizing flows (cGNFs), a novel approach to\ncausal inference that leverages deep neural networks to empirically evaluate\ntheories represented as DAGs. Unlike conventional approaches, cGNFs model the\nfull joint distribution of the data according to a DAG supplied by the analyst,\nwithout relying on stringent assumptions about functional form. In this way,\nthe method allows for flexible, semi-parametric estimation of any causal\nestimand that can be identified from the DAG, including total effects,\nconditional effects, direct and indirect effects, and path-specific effects. We\nillustrate the method with a reanalysis of Blau and Duncan's (1967) model of\nstatus attainment and Zhou's (2019) model of conditional versus controlled\nmobility. To facilitate adoption, we provide open-source software together with\na series of online tutorials for implementing cGNFs. The article concludes with\na discussion of current limitations and directions for future development."}, "http://arxiv.org/abs/2401.07038": {"title": "A simple stochastic nonlinear AR model with application to bubble", "link": "http://arxiv.org/abs/2401.07038", "description": "Economic and financial time series can feature locally explosive behavior\nwhen a bubble is formed. The economic or financial bubble, especially its\ndynamics, is an intriguing topic that has been attracting longstanding\nattention. To illustrate the dynamics of the local explosion itself, the paper\npresents a novel, simple, yet useful time series model, called the stochastic\nnonlinear autoregressive model, which is always strictly stationary and\ngeometrically ergodic and can create long swings or persistence observed in\nmany macroeconomic variables. When a nonlinear autoregressive coefficient is\noutside of a certain range, the model has periodically explosive behaviors and\ncan then be used to portray the bubble dynamics. Further, the quasi-maximum\nlikelihood estimation (QMLE) of our model is considered, and its strong\nconsistency and asymptotic normality are established under minimal assumptions\non innovation. A new model diagnostic checking statistic is developed for model\nfitting adequacy. In addition two methods for bubble tagging are proposed, one\nfrom the residual perspective and the other from the null-state perspective.\nMonte Carlo simulation studies are conducted to assess the performances of the\nQMLE and the two bubble tagging methods in finite samples. Finally, the\nusefulness of the model is illustrated by an empirical application to the\nmonthly Hang Seng Index."}, "http://arxiv.org/abs/2401.07152": {"title": "Inference for Synthetic Controls via Refined Placebo Tests", "link": "http://arxiv.org/abs/2401.07152", "description": "The synthetic control method is often applied to problems with one treated\nunit and a small number of control units. A common inferential task in this\nsetting is to test null hypotheses regarding the average treatment effect on\nthe treated. Inference procedures that are justified asymptotically are often\nunsatisfactory due to (1) small sample sizes that render large-sample\napproximation fragile and (2) simplification of the estimation procedure that\nis implemented in practice. An alternative is permutation inference, which is\nrelated to a common diagnostic called the placebo test. It has provable Type-I\nerror guarantees in finite samples without simplification of the method, when\nthe treatment is uniformly assigned. Despite this robustness, the placebo test\nsuffers from low resolution since the null distribution is constructed from\nonly $N$ reference estimates, where $N$ is the sample size. This creates a\nbarrier for statistical inference at a common level like $\\alpha = 0.05$,\nespecially when $N$ is small. We propose a novel leave-two-out procedure that\nbypasses this issue, while still maintaining the same finite-sample Type-I\nerror guarantee under uniform assignment for a wide range of $N$. Unlike the\nplacebo test whose Type-I error always equals the theoretical upper bound, our\nprocedure often achieves a lower unconditional Type-I error than theory\nsuggests; this enables useful inference in the challenging regime when $\\alpha\n< 1/N$. Empirically, our procedure achieves a higher power when the effect size\nis reasonably large and a comparable power otherwise. We generalize our\nprocedure to non-uniform assignments and show how to conduct sensitivity\nanalysis. From a methodological perspective, our procedure can be viewed as a\nnew type of randomization inference different from permutation or rank-based\ninference, which is particularly effective in small samples."}, "http://arxiv.org/abs/2401.07176": {"title": "A Note on Uncertainty Quantification for Maximum Likelihood Parameters Estimated with Heuristic Based Optimization Algorithms", "link": "http://arxiv.org/abs/2401.07176", "description": "Gradient-based solvers risk convergence to local optima, leading to incorrect\nresearcher inference. Heuristic-based algorithms are able to ``break free\" of\nthese local optima to eventually converge to the true global optimum. However,\ngiven that they do not provide the gradient/Hessian needed to approximate the\ncovariance matrix and that the significantly longer computational time they\nrequire for convergence likely precludes resampling procedures for inference,\nresearchers often are unable to quantify uncertainty in the estimates they\nderive with these methods. This note presents a simple and relatively fast\ntwo-step procedure to estimate the covariance matrix for parameters estimated\nwith these algorithms. This procedure relies on automatic differentiation, a\ncomputational means of calculating derivatives that is popular in machine\nlearning applications. A brief empirical example demonstrates the advantages of\nthis procedure relative to bootstrapping and shows the similarity in standard\nerror estimates between this procedure and that which would normally accompany\nmaximum likelihood estimation with a gradient-based algorithm."}, "http://arxiv.org/abs/2401.08290": {"title": "Causal Machine Learning for Moderation Effects", "link": "http://arxiv.org/abs/2401.08290", "description": "It is valuable for any decision maker to know the impact of decisions\n(treatments) on average and for subgroups. The causal machine learning\nliterature has recently provided tools for estimating group average treatment\neffects (GATE) to understand treatment heterogeneity better. This paper\naddresses the challenge of interpreting such differences in treatment effects\nbetween groups while accounting for variations in other covariates. We propose\na new parameter, the balanced group average treatment effect (BGATE), which\nmeasures a GATE with a specific distribution of a priori-determined covariates.\nBy taking the difference of two BGATEs, we can analyse heterogeneity more\nmeaningfully than by comparing two GATEs. The estimation strategy for this\nparameter is based on double/debiased machine learning for discrete treatments\nin an unconfoundedness setting, and the estimator is shown to be\n$\\sqrt{N}$-consistent and asymptotically normal under standard conditions.\nAdding additional identifying assumptions allows specific balanced differences\nin treatment effects between groups to be interpreted causally, leading to the\ncausal balanced group average treatment effect. We explore the finite sample\nproperties in a small-scale simulation study and demonstrate the usefulness of\nthese parameters in an empirical example."}, "http://arxiv.org/abs/2401.08442": {"title": "Assessing the impact of forced and voluntary behavioral changes on economic-epidemiological co-dynamics: A comparative case study between Belgium and Sweden during the 2020 COVID-19 pandemic", "link": "http://arxiv.org/abs/2401.08442", "description": "During the COVID-19 pandemic, governments faced the challenge of managing\npopulation behavior to prevent their healthcare systems from collapsing. Sweden\nadopted a strategy centered on voluntary sanitary recommendations while Belgium\nresorted to mandatory measures. Their consequences on pandemic progression and\nassociated economic impacts remain insufficiently understood. This study\nleverages the divergent policies of Belgium and Sweden during the COVID-19\npandemic to relax the unrealistic -- but persistently used -- assumption that\nsocial contacts are not influenced by an epidemic's dynamics. We develop an\nepidemiological-economic co-simulation model where pandemic-induced behavioral\nchanges are a superposition of voluntary actions driven by fear, prosocial\nbehavior or social pressure, and compulsory compliance with government\ndirectives. Our findings emphasize the importance of early responses, which\nreduce the stringency of measures necessary to safeguard healthcare systems and\nminimize ensuing economic damage. Voluntary behavioral changes lead to a\npattern of recurring epidemics, which should be regarded as the natural\nlong-term course of pandemics. Governments should carefully consider prolonging\nlockdown longer than necessary because this leads to higher economic damage and\na potentially higher second surge when measures are released. Our model can aid\npolicymakers in the selection of an appropriate long-term strategy that\nminimizes economic damage."}, "http://arxiv.org/abs/2101.00009": {"title": "Adversarial Estimation of Riesz Representers", "link": "http://arxiv.org/abs/2101.00009", "description": "Many causal and structural parameters are linear functionals of an underlying\nregression. The Riesz representer is a key component in the asymptotic variance\nof a semiparametrically estimated linear functional. We propose an adversarial\nframework to estimate the Riesz representer using general function spaces. We\nprove a nonasymptotic mean square rate in terms of an abstract quantity called\nthe critical radius, then specialize it for neural networks, random forests,\nand reproducing kernel Hilbert spaces as leading cases. Furthermore, we use\ncritical radius theory -- in place of Donsker theory -- to prove asymptotic\nnormality without sample splitting, uncovering a ``complexity-rate robustness''\ncondition. This condition has practical consequences: inference without sample\nsplitting is possible in several machine learning settings, which may improve\nfinite sample performance compared to sample splitting. Our estimators achieve\nnominal coverage in highly nonlinear simulations where previous methods break\ndown. They shed new light on the heterogeneous effects of matching grants."}, "http://arxiv.org/abs/2101.00399": {"title": "The Law of Large Numbers for Large Stable Matchings", "link": "http://arxiv.org/abs/2101.00399", "description": "In many empirical studies of a large two-sided matching market (such as in a\ncollege admissions problem), the researcher performs statistical inference\nunder the assumption that they observe a random sample from a large matching\nmarket. In this paper, we consider a setting in which the researcher observes\neither all or a nontrivial fraction of outcomes from a stable matching. We\nestablish a concentration inequality for empirical matching probabilities\nassuming strong correlation among the colleges' preferences while allowing\nstudents' preferences to be fully heterogeneous. Our concentration inequality\nyields laws of large numbers for the empirical matching probabilities and other\nstatistics commonly used in empirical analyses of a large matching market. To\nillustrate the usefulness of our concentration inequality, we prove consistency\nfor estimators of conditional matching probabilities and measures of positive\nassortative matching."}, "http://arxiv.org/abs/2203.08050": {"title": "Pairwise Valid Instruments", "link": "http://arxiv.org/abs/2203.08050", "description": "Finding valid instruments is difficult. We propose Validity Set Instrumental\nVariable (VSIV) estimation, a method for estimating local average treatment\neffects (LATEs) in heterogeneous causal effect models when the instruments are\npartially invalid. We consider settings with pairwise valid instruments, that\nis, instruments that are valid for a subset of instrument value pairs. VSIV\nestimation exploits testable implications of instrument validity to remove\ninvalid pairs and provides estimates of the LATEs for all remaining pairs,\nwhich can be aggregated into a single parameter of interest using\nresearcher-specified weights. We show that the proposed VSIV estimators are\nasymptotically normal under weak conditions and remove or reduce the asymptotic\nbias relative to standard LATE estimators (that is, LATE estimators that do not\nuse testable implications to remove invalid variation). We evaluate the finite\nsample properties of VSIV estimation in application-based simulations and apply\nour method to estimate the returns to college education using parental\neducation as an instrument."}, "http://arxiv.org/abs/2212.07052": {"title": "On LASSO for High Dimensional Predictive Regression", "link": "http://arxiv.org/abs/2212.07052", "description": "This paper examines LASSO, a widely-used $L_{1}$-penalized regression method,\nin high dimensional linear predictive regressions, particularly when the number\nof potential predictors exceeds the sample size and numerous unit root\nregressors are present. The consistency of LASSO is contingent upon two key\ncomponents: the deviation bound of the cross product of the regressors and the\nerror term, and the restricted eigenvalue of the Gram matrix. We present new\nprobabilistic bounds for these components, suggesting that LASSO's rates of\nconvergence are different from those typically observed in cross-sectional\ncases. When applied to a mixture of stationary, nonstationary, and cointegrated\npredictors, LASSO maintains its asymptotic guarantee if predictors are\nscale-standardized. Leveraging machine learning and macroeconomic domain\nexpertise, LASSO demonstrates strong performance in forecasting the\nunemployment rate, as evidenced by its application to the FRED-MD database."}, "http://arxiv.org/abs/2303.14226": {"title": "Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions", "link": "http://arxiv.org/abs/2303.14226", "description": "Consider a setting where there are $N$ heterogeneous units and $p$\ninterventions. Our goal is to learn unit-specific potential outcomes for any\ncombination of these $p$ interventions, i.e., $N \\times 2^p$ causal parameters.\nChoosing a combination of interventions is a problem that naturally arises in a\nvariety of applications such as factorial design experiments, recommendation\nengines, combination therapies in medicine, conjoint analysis, etc. Running $N\n\\times 2^p$ experiments to estimate the various parameters is likely expensive\nand/or infeasible as $N$ and $p$ grow. Further, with observational data there\nis likely confounding, i.e., whether or not a unit is seen under a combination\nis correlated with its potential outcome under that combination. To address\nthese challenges, we propose a novel latent factor model that imposes structure\nacross units (i.e., the matrix of potential outcomes is approximately rank\n$r$), and combinations of interventions (i.e., the coefficients in the Fourier\nexpansion of the potential outcomes is approximately $s$ sparse). We establish\nidentification for all $N \\times 2^p$ parameters despite unobserved\nconfounding. We propose an estimation procedure, Synthetic Combinations, and\nestablish it is finite-sample consistent and asymptotically normal under\nprecise conditions on the observation pattern. Our results imply consistent\nestimation given $\\text{poly}(r) \\times \\left( N + s^2p\\right)$ observations,\nwhile previous methods have sample complexity scaling as $\\min(N \\times s^2p, \\\n\\ \\text{poly(r)} \\times (N + 2^p))$. We use Synthetic Combinations to propose a\ndata-efficient experimental design. Empirically, Synthetic Combinations\noutperforms competing approaches on a real-world dataset on movie\nrecommendations. Lastly, we extend our analysis to do causal inference where\nthe intervention is a permutation over $p$ items (e.g., rankings)."}, "http://arxiv.org/abs/2308.15062": {"title": "Forecasting with Feedback", "link": "http://arxiv.org/abs/2308.15062", "description": "Systematically biased forecasts are typically interpreted as evidence of\nforecasters' irrationality and/or asymmetric loss. In this paper we propose an\nalternative explanation: when forecasts inform economic policy decisions, and\nthe resulting actions affect the realization of the forecast target itself,\nforecasts may be optimally biased even under quadratic loss. The result arises\nin environments in which the forecaster is uncertain about the decision maker's\nreaction to the forecast, which is presumably the case in most applications. We\nillustrate the empirical relevance of our theory by reviewing some stylized\nproperties of Green Book inflation forecasts and relating them to the\npredictions from our model. Our results point out that the presence of policy\nfeedback poses a challenge to traditional tests of forecast rationality."}, "http://arxiv.org/abs/2309.05639": {"title": "Forecasted Treatment Effects", "link": "http://arxiv.org/abs/2309.05639", "description": "We consider estimation and inference of the effects of a policy in the\nabsence of a control group. We obtain unbiased estimators of individual\n(heterogeneous) treatment effects and a consistent and asymptotically normal\nestimator of the average treatment effect. Our estimator averages over unbiased\nforecasts of individual counterfactuals, based on a (short) time series of\npre-treatment data. The paper emphasizes the importance of focusing on forecast\nunbiasedness rather than accuracy when the end goal is estimation of average\ntreatment effects. We show that simple basis function regressions ensure\nforecast unbiasedness for a broad class of data-generating processes for the\ncounterfactuals, even in short panels. In contrast, model-based forecasting\nrequires stronger assumptions and is prone to misspecification and estimation\nbias. We show that our method can replicate the findings of some previous\nempirical studies, but without using a control group."}, "http://arxiv.org/abs/1812.10820": {"title": "A $t$-test for synthetic controls", "link": "http://arxiv.org/abs/1812.10820", "description": "We propose a practical and robust method for making inferences on average\ntreatment effects estimated by synthetic controls. We develop a $K$-fold\ncross-fitting procedure for bias correction. To avoid the difficult estimation\nof the long-run variance, inference is based on a self-normalized\n$t$-statistic, which has an asymptotically pivotal $t$-distribution. Our\n$t$-test is easy to implement, provably robust against misspecification, and\nvalid with stationary and non-stationary data. It demonstrates an excellent\nsmall sample performance in application-based simulations and performs well\nrelative to alternative methods. We illustrate the usefulness of the $t$-test\nby revisiting the effect of carbon taxes on emissions."}, "http://arxiv.org/abs/2108.12419": {"title": "Revisiting Event Study Designs: Robust and Efficient Estimation", "link": "http://arxiv.org/abs/2108.12419", "description": "We develop a framework for difference-in-differences designs with staggered\ntreatment adoption and heterogeneous causal effects. We show that conventional\nregression-based estimators fail to provide unbiased estimates of relevant\nestimands absent strong restrictions on treatment-effect homogeneity. We then\nderive the efficient estimator addressing this challenge, which takes an\nintuitive \"imputation\" form when treatment-effect heterogeneity is\nunrestricted. We characterize the asymptotic behavior of the estimator, propose\ntools for inference, and develop tests for identifying assumptions. Our method\napplies with time-varying controls, in triple-difference designs, and with\ncertain non-binary treatments. We show the practical relevance of our results\nin a simulation study and an application. Studying the consumption response to\ntax rebates in the United States, we find that the notional marginal propensity\nto consume is between 8 and 11 percent in the first quarter - about half as\nlarge as benchmark estimates used to calibrate macroeconomic models - and\npredominantly occurs in the first month after the rebate."}, "http://arxiv.org/abs/2301.06720": {"title": "Testing Firm Conduct", "link": "http://arxiv.org/abs/2301.06720", "description": "Evaluating policy in imperfectly competitive markets requires understanding\nfirm behavior. While researchers test conduct via model selection and\nassessment, we present advantages of Rivers and Vuong (2002) (RV) model\nselection under misspecification. However, degeneracy of RV invalidates\ninference. With a novel definition of weak instruments for testing, we connect\ndegeneracy to instrument strength, derive weak instrument properties of RV, and\nprovide a diagnostic for weak instruments by extending the framework of Stock\nand Yogo (2005) to model selection. We test vertical conduct (Villas-Boas,\n2007) using common instrument sets. Some are weak, providing no power. Strong\ninstruments support manufacturers setting retail prices."}, "http://arxiv.org/abs/2307.13686": {"title": "Characteristics and Predictive Modeling of Short-term Impacts of Hurricanes on the US Employment", "link": "http://arxiv.org/abs/2307.13686", "description": "This study examines the short-term employment changes in the US after\nhurricane impacts. An analysis of hurricane events during 1990-2021 suggests\nthat county-level employment changes in the initial month are small on average,\nthough large employment losses (>30%) can occur after extreme cyclones. The\noverall small changes partly result from compensation among opposite changes in\nemployment sectors, such as the construction and leisure and hospitality\nsectors. An analysis of these extreme cases highlights concentrated employment\nlosses in the service-providing industries and delayed, robust employment gains\nrelated to reconstruction activities. The overall employment shock is\nnegatively correlated with the metrics of cyclone hazards (e.g., extreme wind\nand precipitation) and geospatial details of impacts (e.g., cyclone-entity\ndistance). Additionally, non-cyclone factors such as county characteristics\nalso strongly affect short-term employment changes. The findings inform\npredictive modeling of short-term employment changes and help deliver promising\nskills for service-providing industries and high-impact cyclones. Specifically,\nthe Random Forests model, which can account for nonlinear relationships,\ngreatly outperforms the multiple linear regression model commonly used by\neconomics studies. Overall, our findings may help improve post-cyclone aid\nprograms and the modeling of hurricanes socioeconomic impacts in a changing\nclimate."}, "http://arxiv.org/abs/2308.08958": {"title": "Linear Regression with Weak Exogeneity", "link": "http://arxiv.org/abs/2308.08958", "description": "This paper studies linear time series regressions with many regressors. Weak\nexogeneity is the most used identifying assumption in time series. Weak\nexogeneity requires the structural error to have zero conditional expectation\ngiven the present and past regressor values, allowing errors to correlate with\nfuture regressor realizations. We show that weak exogeneity in time series\nregressions with many controls may produce substantial biases and even render\nthe least squares (OLS) estimator inconsistent. The bias arises in settings\nwith many regressors because the normalized OLS design matrix remains\nasymptotically random and correlates with the regression error when only weak\n(but not strict) exogeneity holds. This bias's magnitude increases with the\nnumber of regressors and their average autocorrelation. To address this issue,\nwe propose an innovative approach to bias correction that yields a new\nestimator with improved properties relative to OLS. We establish consistency\nand conditional asymptotic Gaussianity of this new estimator and provide a\nmethod for inference."}, "http://arxiv.org/abs/2401.09874": {"title": "A Quantile Nelson-Siegel model", "link": "http://arxiv.org/abs/2401.09874", "description": "A widespread approach to modelling the interaction between macroeconomic\nvariables and the yield curve relies on three latent factors usually\ninterpreted as the level, slope, and curvature (Diebold et al., 2006). This\napproach is inherently focused on the conditional mean of the yields and\npostulates a dynamic linear model where the latent factors smoothly change over\ntime. However, periods of deep crisis, such as the Great Recession and the\nrecent pandemic, have highlighted the importance of statistical models that\naccount for asymmetric shocks and are able to forecast the tails of a\nvariable's distribution. A new version of the dynamic three-factor model is\nproposed to address this issue based on quantile regressions. The novel\napproach leverages the potential of quantile regression to model the entire\n(conditional) distribution of the yields instead of restricting to its mean. An\napplication to US data from the 1970s shows the significant heterogeneity of\nthe interactions between financial and macroeconomic variables across different\nquantiles. Moreover, an out-of-sample forecasting exercise showcases the\nproposed method's advantages in predicting the yield distribution tails\ncompared to the standard conditional mean model. Finally, by inspecting the\nposterior distribution of the three factors during the recent major crises, new\nevidence is found that supports the greater and longer-lasting negative impact\nof the great recession on the yields compared to the COVID-19 pandemic."}, "http://arxiv.org/abs/2401.10054": {"title": "Nowcasting economic activity in European regions using a mixed-frequency dynamic factor model", "link": "http://arxiv.org/abs/2401.10054", "description": "Timely information about the state of regional economies can be essential for\nplanning, implementing and evaluating locally targeted economic policies.\nHowever, European regional accounts for output are published at an annual\nfrequency and with a two-year delay. To obtain robust and more timely measures\nin a computationally efficient manner, we propose a mixed-frequency dynamic\nfactor model that accounts for national information to produce high-frequency\nestimates of the regional gross value added (GVA). We show that our model\nproduces reliable nowcasts of GVA in 162 regions across 12 European countries."}, "http://arxiv.org/abs/2108.13707": {"title": "Wild Bootstrap for Instrumental Variables Regressions with Weak and Few Clusters", "link": "http://arxiv.org/abs/2108.13707", "description": "We study the wild bootstrap inference for instrumental variable regressions\nin the framework of a small number of large clusters in which the number of\nclusters is viewed as fixed and the number of observations for each cluster\ndiverges to infinity. We first show that the wild bootstrap Wald test, with or\nwithout using the cluster-robust covariance estimator, controls size\nasymptotically up to a small error as long as the parameters of endogenous\nvariables are strongly identified in at least one of the clusters. Then, we\nestablish the required number of strong clusters for the test to have power\nagainst local alternatives. We further develop a wild bootstrap Anderson-Rubin\ntest for the full-vector inference and show that it controls size\nasymptotically up to a small error even under weak or partial identification in\nall clusters. We illustrate the good finite sample performance of the new\ninference methods using simulations and provide an empirical application to a\nwell-known dataset about US local labor markets."}, "http://arxiv.org/abs/2303.13598": {"title": "Bootstrap-Assisted Inference for Generalized Grenander-type Estimators", "link": "http://arxiv.org/abs/2303.13598", "description": "Westling and Carone (2020) proposed a framework for studying the large sample\ndistributional properties of generalized Grenander-type estimators, a versatile\nclass of nonparametric estimators of monotone functions. The limiting\ndistribution of those estimators is representable as the left derivative of the\ngreatest convex minorant of a Gaussian process whose covariance kernel can be\ncomplicated and whose monomial mean can be of unknown order (when the degree of\nflatness of the function of interest is unknown). The standard nonparametric\nbootstrap is unable to consistently approximate the large sample distribution\nof the generalized Grenander-type estimators even if the monomial order of the\nmean is known, making statistical inference a challenging endeavour in\napplications. To address this inferential problem, we present a\nbootstrap-assisted inference procedure for generalized Grenander-type\nestimators. The procedure relies on a carefully crafted, yet automatic,\ntransformation of the estimator. Moreover, our proposed method can be made\n``flatness robust'' in the sense that it can be made adaptive to the (possibly\nunknown) degree of flatness of the function of interest. The method requires\nonly the consistent estimation of a single scalar quantity, for which we\npropose an automatic procedure based on numerical derivative estimation and the\ngeneralized jackknife. Under random sampling, our inference method can be\nimplemented using a computationally attractive exchangeable bootstrap\nprocedure. We illustrate our methods with examples and we also provide a small\nsimulation study. The development of formal results is made possible by some\ntechnical results that may be of independent interest."}, "http://arxiv.org/abs/2401.10261": {"title": "How industrial clusters influence the growth of the regional GDP: A spatial-approach", "link": "http://arxiv.org/abs/2401.10261", "description": "In this paper, we employ spatial econometric methods to analyze panel data\nfrom German NUTS 3 regions. Our goal is to gain a deeper understanding of the\nsignificance and interdependence of industry clusters in shaping the dynamics\nof GDP. To achieve a more nuanced spatial differentiation, we introduce\nindicator matrices for each industry sector which allows for extending the\nspatial Durbin model to a new version of it. This approach is essential due to\nboth the economic importance of these sectors and the potential issue of\nomitted variables. Failing to account for industry sectors can lead to omitted\nvariable bias and estimation problems. To assess the effects of the major\nindustry sectors, we incorporate eight distinct branches of industry into our\nanalysis. According to prevailing economic theory, these clusters should have a\npositive impact on the regions they are associated with. Our findings indeed\nreveal highly significant impacts, which can be either positive or negative, of\nspecific sectors on local GDP growth. Spatially, we observe that direct and\nindirect effects can exhibit opposite signs, indicative of heightened\ncompetitiveness within and between industry sectors. Therefore, we recommend\nthat industry sectors should be taken into consideration when conducting\nspatial analysis of GDP. Doing so allows for a more comprehensive understanding\nof the economic dynamics at play."}, "http://arxiv.org/abs/2111.00822": {"title": "Financial-cycle ratios and medium-term predictions of GDP: Evidence from the United States", "link": "http://arxiv.org/abs/2111.00822", "description": "Using a large quarterly macroeconomic dataset for the period 1960-2017, we\ndocument the ability of specific financial ratios from the housing market and\nfirms' aggregate balance sheets to predict GDP over medium-term horizons in the\nUnited States. A cyclically adjusted house price-to-rent ratio and the\nliabilities-to-income ratio of the non-financial non-corporate business sector\nprovide the best in-sample and out-of-sample predictions of GDP growth over\nhorizons of one to five years, based on a wide variety of rankings. Small\nforecasting models that include these indicators outperform popular\nhigh-dimensional models and forecast combinations. The predictive power of the\ntwo ratios appears strong during both recessions and expansions, stable over\ntime, and consistent with well-established macro-finance theory."}, "http://arxiv.org/abs/2401.11016": {"title": "Bounding Consideration Probabilities in Consider-Then-Choose Ranking Models", "link": "http://arxiv.org/abs/2401.11016", "description": "A common theory of choice posits that individuals make choices in a two-step\nprocess, first selecting some subset of the alternatives to consider before\nmaking a selection from the resulting consideration set. However, inferring\nunobserved consideration sets (or item consideration probabilities) in this\n\"consider then choose\" setting poses significant challenges, because even\nsimple models of consideration with strong independence assumptions are not\nidentifiable, even if item utilities are known. We consider a natural extension\nof consider-then-choose models to a top-$k$ ranking setting, where we assume\nrankings are constructed according to a Plackett-Luce model after sampling a\nconsideration set. While item consideration probabilities remain non-identified\nin this setting, we prove that knowledge of item utilities allows us to infer\nbounds on the relative sizes of consideration probabilities. Additionally,\ngiven a condition on the expected consideration set size, we derive absolute\nupper and lower bounds on item consideration probabilities. We also provide\nalgorithms to tighten those bounds on consideration probabilities by\npropagating inferred constraints. Thus, we show that we can learn useful\ninformation about consideration probabilities despite not being able to\nidentify them precisely. We demonstrate our methods on a ranking dataset from a\npsychology experiment with two different ranking tasks (one with fixed\nconsideration sets and one with unknown consideration sets). This combination\nof data allows us to estimate utilities and then learn about unknown\nconsideration probabilities using our bounds."}, "http://arxiv.org/abs/2401.11046": {"title": "Information Based Inference in Models with Set-Valued Predictions and Misspecification", "link": "http://arxiv.org/abs/2401.11046", "description": "This paper proposes an information-based inference method for partially\nidentified parameters in incomplete models that is valid both when the model is\ncorrectly specified and when it is misspecified. Key features of the method\nare: (i) it is based on minimizing a suitably defined Kullback-Leibler\ninformation criterion that accounts for incompleteness of the model and\ndelivers a non-empty pseudo-true set; (ii) it is computationally tractable;\n(iii) its implementation is the same for both correctly and incorrectly\nspecified models; (iv) it exploits all information provided by variation in\ndiscrete and continuous covariates; (v) it relies on Rao's score statistic,\nwhich is shown to be asymptotically pivotal."}, "http://arxiv.org/abs/2401.11229": {"title": "Estimation with Pairwise Observations", "link": "http://arxiv.org/abs/2401.11229", "description": "The paper introduces a new estimation method for the standard linear\nregression model. The procedure is not driven by the optimisation of any\nobjective function rather, it is a simple weighted average of slopes from\nobservation pairs. The paper shows that such estimator is consistent for\ncarefully selected weights. Other properties, such as asymptotic distributions,\nhave also been derived to facilitate valid statistical inference. Unlike\ntraditional methods, such as Least Squares and Maximum Likelihood, among\nothers, the estimated residual of this estimator is not by construction\northogonal to the explanatory variables of the model. This property allows a\nwide range of practical applications, such as the testing of endogeneity,\ni.e.,the correlation between the explanatory variables and the disturbance\nterms, and potentially several others."}, "http://arxiv.org/abs/2401.11422": {"title": "Local Identification in the Instrumental Variable Multivariate Quantile Regression Model", "link": "http://arxiv.org/abs/2401.11422", "description": "The instrumental variable (IV) quantile regression model introduced by\nChernozhukov and Hansen (2005) is a useful tool for analyzing quantile\ntreatment effects in the presence of endogeneity, but when outcome variables\nare multidimensional, it is silent on the joint distribution of different\ndimensions of each variable. To overcome this limitation, we propose an IV\nmodel built on the optimal-transport-based multivariate quantile that takes\ninto account the correlation between the entries of the outcome variable. We\nthen provide a local identification result for the model. Surprisingly, we find\nthat the support size of the IV required for the identification is independent\nof the dimension of the outcome vector, as long as the IV is sufficiently\ninformative. Our result follows from a general identification theorem that we\nestablish, which has independent theoretical significance."}, "http://arxiv.org/abs/2401.12050": {"title": "A Bracketing Relationship for Long-Term Policy Evaluation with Combined Experimental and Observational Data", "link": "http://arxiv.org/abs/2401.12050", "description": "Combining short-term experimental data with observational data enables\ncredible long-term policy evaluation. The literature offers two key but\nnon-nested assumptions, namely the latent unconfoundedness (LU; Athey et al.,\n2020) and equi-confounding bias (ECB; Ghassami et al., 2022) conditions, to\ncorrect observational selection. Committing to the wrong assumption leads to\nbiased estimation. To mitigate such risks, we provide a novel bracketing\nrelationship (cf. Angrist and Pischke, 2009) repurposed for the setting with\ndata combination: the LU-based estimand and the ECB-based estimand serve as the\nlower and upper bounds, respectively, with the true causal effect lying in\nbetween if either assumption holds. For researchers further seeking point\nestimates, our Lalonde-style exercise suggests the conservatively more robust\nLU-based lower bounds align closely with the hold-out experimental estimates\nfor educational policy evaluation. We investigate the economic substantives of\nthese findings through the lens of a nonparametric class of selection\nmechanisms and sensitivity analysis. We uncover as key the sub-martingale\nproperty and sufficient-statistics role (Chetty, 2009) of the potential\noutcomes of student test scores (Chetty et al., 2011, 2014)."}, "http://arxiv.org/abs/2401.12084": {"title": "Temporal Aggregation for the Synthetic Control Method", "link": "http://arxiv.org/abs/2401.12084", "description": "The synthetic control method (SCM) is a popular approach for estimating the\nimpact of a treatment on a single unit with panel data. Two challenges arise\nwith higher frequency data (e.g., monthly versus yearly): (1) achieving\nexcellent pre-treatment fit is typically more challenging; and (2) overfitting\nto noise is more likely. Aggregating data over time can mitigate these problems\nbut can also destroy important signal. In this paper, we bound the bias for SCM\nwith disaggregated and aggregated outcomes and give conditions under which\naggregating tightens the bounds. We then propose finding weights that balance\nboth disaggregated and aggregated series."}, "http://arxiv.org/abs/1807.11835": {"title": "The econometrics of happiness: Are we underestimating the returns to education and income?", "link": "http://arxiv.org/abs/1807.11835", "description": "This paper describes a fundamental and empirically conspicuous problem\ninherent to surveys of human feelings and opinions in which subjective\nresponses are elicited on numerical scales. The paper also proposes a solution.\nThe problem is a tendency by some individuals -- particularly those with low\nlevels of education -- to simplify the response scale by considering only a\nsubset of possible responses such as the lowest, middle, and highest. In\nprinciple, this ``focal value rounding'' (FVR) behavior renders invalid even\nthe weak ordinality assumption often used in analysis of such data. With\n``happiness'' or life satisfaction data as an example, descriptive methods and\na multinomial logit model both show that the effect is large and that education\nand, to a lesser extent, income level are predictors of FVR behavior.\n\nA model simultaneously accounting for the underlying wellbeing and for the\ndegree of FVR is able to estimate the latent subjective wellbeing, i.e.~the\ncounterfactual full-scale responses for all respondents, the biases associated\nwith traditional estimates, and the fraction of respondents who exhibit FVR.\nAddressing this problem helps to resolve a longstanding puzzle in the life\nsatisfaction literature, namely that the returns to education, after adjusting\nfor income, appear to be small or negative. Due to the same econometric\nproblem, the marginal utility of income in a subjective wellbeing sense has\nbeen consistently underestimated."}, "http://arxiv.org/abs/2209.04329": {"title": "Heterogeneous Treatment Effect Bounds under Sample Selection with an Application to the Effects of Social Media on Political Polarization", "link": "http://arxiv.org/abs/2209.04329", "description": "We propose a method for estimation and inference for bounds for heterogeneous\ncausal effect parameters in general sample selection models where the treatment\ncan affect whether an outcome is observed and no exclusion restrictions are\navailable. The method provides conditional effect bounds as functions of policy\nrelevant pre-treatment variables. It allows for conducting valid statistical\ninference on the unidentified conditional effects. We use a flexible\ndebiased/double machine learning approach that can accommodate non-linear\nfunctional forms and high-dimensional confounders. Easily verifiable high-level\nconditions for estimation, misspecification robust confidence intervals, and\nuniform confidence bands are provided as well. We re-analyze data from a large\nscale field experiment on Facebook on counter-attitudinal news subscription\nwith attrition. Our method yields substantially tighter effect bounds compared\nto conventional methods and suggests depolarization effects for younger users."}, "http://arxiv.org/abs/2303.07287": {"title": "Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm", "link": "http://arxiv.org/abs/2303.07287", "description": "In non-asymptotic learning, variance-type parameters of sub-Gaussian\ndistributions are of paramount importance. However, directly estimating these\nparameters using the empirical moment generating function (MGF) is infeasible.\nTo address this, we suggest using the sub-Gaussian intrinsic moment norm\n[Buldygin and Kozachenko (2000), Theorem 1.3] achieved by maximizing a sequence\nof normalized moments. Significantly, the suggested norm can not only\nreconstruct the exponential moment bounds of MGFs but also provide tighter\nsub-Gaussian concentration inequalities. In practice, we provide an intuitive\nmethod for assessing whether data with a finite sample size is sub-Gaussian,\nutilizing the sub-Gaussian plot. The intrinsic moment norm can be robustly\nestimated via a simple plug-in approach. Our theoretical findings are also\napplicable to reinforcement learning, including the multi-armed bandit\nscenario."}, "http://arxiv.org/abs/2307.10067": {"title": "Weak Factors are Everywhere", "link": "http://arxiv.org/abs/2307.10067", "description": "Factor Sequences are stochastic double sequences indexed in time and\ncross-section which have a so called factor structure. The term was coined by\nForni and Lippi (2001) who introduced dynamic factor sequences. We introduce\nthe distinction between dynamic- and static factor sequences which has been\noverlooked in the literature. Static factor sequences, where the static factors\nare modeled by a dynamic system, are the most common model of macro-econometric\nfactor analysis, building on Chamberlain and Rothschild (1983a); Stock and\nWatson (2002a); Bai and Ng (2002).\n\nWe show that there exist two types of common components - a dynamic and a\nstatic common component. The difference between those consists of the weak\ncommon component, which is spanned by (potentially infinitely many) weak\nfactors. We also show that the dynamic common component of a dynamic factor\nsequence is causally subordinated to the output under suitable conditions. As a\nconsequence only the dynamic common component can be interpreted as the\nprojection on the infinite past of the common innovations of the economy, i.e.\nthe part which is dynamically common. On the other hand the static common\ncomponent captures only the contemporaneous co-movement."}, "http://arxiv.org/abs/2401.12309": {"title": "Interpreting Event-Studies from Recent Difference-in-Differences Methods", "link": "http://arxiv.org/abs/2401.12309", "description": "This note discusses the interpretation of event-study plots produced by\nrecent difference-in-differences methods. I show that even when specialized to\nthe case of non-staggered treatment timing, the default plots produced by\nsoftware for three of the most popular recent methods (de Chaisemartin and\nD'Haultfoeuille, 2020; Callaway and SantAnna, 2021; Borusyak, Jaravel and\nSpiess, 2024) do not match those of traditional two-way fixed effects (TWFE)\nevent-studies: the new methods may show a kink or jump at the time of treatment\neven when the TWFE event-study shows a straight line. This difference stems\nfrom the fact that the new methods construct the pre-treatment coefficients\nasymmetrically from the post-treatment coefficients. As a result, visual\nheuristics for analyzing TWFE event-study plots should not be immediately\napplied to those from these methods. I conclude with practical recommendations\nfor constructing and interpreting event-study plots when using these methods."}, "http://arxiv.org/abs/2104.00655": {"title": "Local Projections vs", "link": "http://arxiv.org/abs/2104.00655", "description": "We conduct a simulation study of Local Projection (LP) and Vector\nAutoregression (VAR) estimators of structural impulse responses across\nthousands of data generating processes, designed to mimic the properties of the\nuniverse of U.S. macroeconomic data. Our analysis considers various\nidentification schemes and several variants of LP and VAR estimators, employing\nbias correction, shrinkage, or model averaging. A clear bias-variance trade-off\nemerges: LP estimators have lower bias than VAR estimators, but they also have\nsubstantially higher variance at intermediate and long horizons. Bias-corrected\nLP is the preferred method if and only if the researcher overwhelmingly\nprioritizes bias. For researchers who also care about precision, VAR methods\nare the most attractive -- Bayesian VARs at short and long horizons, and\nleast-squares VARs at intermediate and long horizons."}, "http://arxiv.org/abs/2210.13843": {"title": "GLS under Monotone Heteroskedasticity", "link": "http://arxiv.org/abs/2210.13843", "description": "The generalized least square (GLS) is one of the most basic tools in\nregression analyses. A major issue in implementing the GLS is estimation of the\nconditional variance function of the error term, which typically requires a\nrestrictive functional form assumption for parametric estimation or smoothing\nparameters for nonparametric estimation. In this paper, we propose an\nalternative approach to estimate the conditional variance function under\nnonparametric monotonicity constraints by utilizing the isotonic regression\nmethod. Our GLS estimator is shown to be asymptotically equivalent to the\ninfeasible GLS estimator with knowledge of the conditional error variance, and\ninvolves only some tuning to trim boundary observations, not only for point\nestimation but also for interval estimation or hypothesis testing. Our analysis\nextends the scope of the isotonic regression method by showing that the\nisotonic estimates, possibly with generated variables, can be employed as first\nstage estimates to be plugged in for semiparametric objects. Simulation studies\nillustrate excellent finite sample performances of the proposed method. As an\nempirical example, we revisit Acemoglu and Restrepo's (2017) study on the\nrelationship between an aging population and economic growth to illustrate how\nour GLS estimator effectively reduces estimation errors."}, "http://arxiv.org/abs/2305.04137": {"title": "Volatility of Volatility and Leverage Effect from Options", "link": "http://arxiv.org/abs/2305.04137", "description": "We propose model-free (nonparametric) estimators of the volatility of\nvolatility and leverage effect using high-frequency observations of short-dated\noptions. At each point in time, we integrate available options into estimates\nof the conditional characteristic function of the price increment until the\noptions' expiration and we use these estimates to recover spot volatility. Our\nvolatility of volatility estimator is then formed from the sample variance and\nfirst-order autocovariance of the spot volatility increments, with the latter\ncorrecting for the bias in the former due to option observation errors. The\nleverage effect estimator is the sample covariance between price increments and\nthe estimated volatility increments. The rate of convergence of the estimators\ndepends on the diffusive innovations in the latent volatility process as well\nas on the observation error in the options with strikes in the vicinity of the\ncurrent spot price. Feasible inference is developed in a way that does not\nrequire prior knowledge of the source of estimation error that is\nasymptotically dominating."}, "http://arxiv.org/abs/2401.13057": {"title": "Inference under partial identification with minimax test statistics", "link": "http://arxiv.org/abs/2401.13057", "description": "We provide a means of computing and estimating the asymptotic distributions\nof test statistics based on an outer minimization of an inner maximization.\nSuch test statistics, which arise frequently in moment models, are of special\ninterest in providing hypothesis tests under partial identification. Under\ngeneral conditions, we provide an asymptotic characterization of such test\nstatistics using the minimax theorem, and a means of computing critical values\nusing the bootstrap. Making some light regularity assumptions, our results\nprovide a basis for several asymptotic approximations that have been provided\nfor partially identified hypothesis tests, and extend them by mitigating their\ndependence on local linear approximations of the parameter space. These\nasymptotic results are generally simple to state and straightforward to compute\n(e.g. adversarially)."}, "http://arxiv.org/abs/2401.13179": {"title": "Realized Stochastic Volatility Model with Skew-t Distributions for Improved Volatility and Quantile Forecasting", "link": "http://arxiv.org/abs/2401.13179", "description": "Forecasting volatility and quantiles of financial returns is essential for\naccurately measuring financial tail risks, such as value-at-risk and expected\nshortfall. The critical elements in these forecasts involve understanding the\ndistribution of financial returns and accurately estimating volatility. This\npaper introduces an advancement to the traditional stochastic volatility model,\ntermed the realized stochastic volatility model, which integrates realized\nvolatility as a precise estimator of volatility. To capture the well-known\ncharacteristics of return distribution, namely skewness and heavy tails, we\nincorporate three types of skew-t distributions. Among these, two distributions\ninclude the skew-normal feature, offering enhanced flexibility in modeling the\nreturn distribution. We employ a Bayesian estimation approach using the Markov\nchain Monte Carlo method and apply it to major stock indices. Our empirical\nanalysis, utilizing data from US and Japanese stock indices, indicates that the\ninclusion of both skewness and heavy tails in daily returns significantly\nimproves the accuracy of volatility and quantile forecasts."}, "http://arxiv.org/abs/2401.13370": {"title": "New accessibility measures based on unconventional big data sources", "link": "http://arxiv.org/abs/2401.13370", "description": "In health econometric studies we are often interested in quantifying aspects\nrelated to the accessibility to medical infrastructures. The increasing\navailability of data automatically collected through unconventional sources\n(such as webscraping, crowdsourcing or internet of things) recently opened\npreviously unconceivable opportunities to researchers interested in measuring\naccessibility and to use it as a tool for real-time monitoring, surveillance\nand health policies definition. This paper contributes to this strand of\nliterature proposing new accessibility measures that can be continuously feeded\nby automatic data collection. We present new measures of accessibility and we\nillustrate their use to study the territorial impact of supply-side shocks of\nhealth facilities. We also illustrate the potential of our proposal with a case\nstudy based on a huge set of data (related to the Emergency Departments in\nMilan, Italy) that have been webscraped for the purpose of this paper every 5\nminutes since November 2021 to March 2022, amounting to approximately 5 million\nobservations."}, "http://arxiv.org/abs/2401.13665": {"title": "Entrywise Inference for Causal Panel Data: A Simple and Instance-Optimal Approach", "link": "http://arxiv.org/abs/2401.13665", "description": "In causal inference with panel data under staggered adoption, the goal is to\nestimate and derive confidence intervals for potential outcomes and treatment\neffects. We propose a computationally efficient procedure, involving only\nsimple matrix algebra and singular value decomposition. We derive\nnon-asymptotic bounds on the entrywise error, establishing its proximity to a\nsuitably scaled Gaussian variable. Despite its simplicity, our procedure turns\nout to be instance-optimal, in that our theoretical scaling matches a local\ninstance-wise lower bound derived via a Bayesian Cram\\'{e}r-Rao argument. Using\nour insights, we develop a data-driven procedure for constructing entrywise\nconfidence intervals with pre-specified coverage guarantees. Our analysis is\nbased on a general inferential toolbox for the SVD algorithm applied to the\nmatrix denoising model, which might be of independent interest."}, "http://arxiv.org/abs/2307.01033": {"title": "Expected Shortfall LASSO", "link": "http://arxiv.org/abs/2307.01033", "description": "We propose an $\\ell_1$-penalized estimator for high-dimensional models of\nExpected Shortfall (ES). The estimator is obtained as the solution to a\nleast-squares problem for an auxiliary dependent variable, which is defined as\na transformation of the dependent variable and a pre-estimated tail quantile.\nLeveraging a sparsity condition, we derive a nonasymptotic bound on the\nprediction and estimator errors of the ES estimator, accounting for the\nestimation error in the dependent variable, and provide conditions under which\nthe estimator is consistent. Our estimator is applicable to heavy-tailed\ntime-series data and we find that the amount of parameters in the model may\ngrow with the sample size at a rate that depends on the dependence and\nheavy-tailedness in the data. In an empirical application, we consider the\nsystemic risk measure CoES and consider a set of regressors that consists of\nnonlinear transformations of a set of state variables. We find that the\nnonlinear model outperforms an unpenalized and untransformed benchmark\nconsiderably."}, "http://arxiv.org/abs/2401.14395": {"title": "Identification of Nonseparable Models with Endogenous Control Variables", "link": "http://arxiv.org/abs/2401.14395", "description": "We study identification of the treatment effects in a class of nonseparable\nmodels with the presence of potentially endogenous control variables. We show\nthat given the treatment variable and the controls are measurably separated,\nthe usual conditional independence condition or availability of excluded\ninstrument suffices for identification."}, "http://arxiv.org/abs/2105.08766": {"title": "Trading-off Bias and Variance in Stratified Experiments and in Matching Studies, Under a Boundedness Condition on the Magnitude of the Treatment Effect", "link": "http://arxiv.org/abs/2105.08766", "description": "I consider estimation of the average treatment effect (ATE), in a population\ncomposed of $S$ groups or units, when one has unbiased estimators of each\ngroup's conditional average treatment effect (CATE). These conditions are met\nin stratified experiments and in matching studies. I assume that each CATE is\nbounded in absolute value by $B$ standard deviations of the outcome, for some\nknown $B$. This restriction may be appealing: outcomes are often standardized\nin applied work, so researchers can use available literature to determine a\nplausible value for $B$. I derive, across all linear combinations of the CATEs'\nestimators, the minimax estimator of the ATE. In two stratified experiments, my\nestimator has twice lower worst-case mean-squared-error than the commonly-used\nstrata-fixed effects estimator. In a matching study with limited overlap, my\nestimator achieves 56\\% of the precision gains of a commonly-used trimming\nestimator, and has an 11 times smaller worst-case mean-squared-error."}, "http://arxiv.org/abs/2308.09535": {"title": "Weak Identification with Many Instruments", "link": "http://arxiv.org/abs/2308.09535", "description": "Linear instrumental variable regressions are widely used to estimate causal\neffects. Many instruments arise from the use of ``technical'' instruments and\nmore recently from the empirical strategy of ``judge design''. This paper\nsurveys and summarizes ideas from recent literature on estimation and\nstatistical inferences with many instruments for a single endogenous regressor.\nWe discuss how to assess the strength of the instruments and how to conduct\nweak identification-robust inference under heteroskedasticity. We establish new\nresults for a jack-knifed version of the Lagrange Multiplier (LM) test\nstatistic. Furthermore, we extend the weak-identification-robust tests to\nsettings with both many exogenous regressors and many instruments. We propose a\ntest that properly partials out many exogenous regressors while preserving the\nre-centering property of the jack-knife. The proposed tests have correct size\nand good power properties."}, "http://arxiv.org/abs/2401.14545": {"title": "Structural Periodic Vector Autoregressions", "link": "http://arxiv.org/abs/2401.14545", "description": "While seasonality inherent to raw macroeconomic data is commonly removed by\nseasonal adjustment techniques before it is used for structural inference, this\napproach might distort valuable information contained in the data. As an\nalternative method to commonly used structural vector autoregressions (SVAR)\nfor seasonally adjusted macroeconomic data, this paper offers an approach in\nwhich the periodicity of not seasonally adjusted raw data is modeled directly\nby structural periodic vector autoregressions (SPVAR) that are based on\nperiodic vector autoregressions (PVAR) as the reduced form model. In comparison\nto a VAR, the PVAR does allow not only for periodically time-varying\nintercepts, but also for periodic autoregressive parameters and innovations\nvariances, respectively. As this larger flexibility leads also to an increased\nnumber of parameters, we propose linearly constrained estimation techniques.\nOverall, SPVARs allow to capture seasonal effects and enable a direct and more\nrefined analysis of seasonal patterns in macroeconomic data, which can provide\nuseful insights into their dynamics. Moreover, based on such SPVARs, we propose\na general concept for structural impulse response analyses that takes seasonal\npatterns directly into account. We provide asymptotic theory for estimators of\nperiodic reduced form parameters and structural impulse responses under\nflexible linear restrictions. Further, for the construction of confidence\nintervals, we propose residual-based (seasonal) bootstrap methods that allow\nfor general forms of seasonalities in the data and prove its bootstrap\nconsistency. A real data application on industrial production, inflation and\nfederal funds rate is presented, showing that useful information about the data\nstructure can be lost when using common seasonal adjustment methods."}, "http://arxiv.org/abs/2401.14582": {"title": "High-dimensional forecasting with known knowns and known unknowns", "link": "http://arxiv.org/abs/2401.14582", "description": "Forecasts play a central role in decision making under uncertainty. After a\nbrief review of the general issues, this paper considers ways of using\nhigh-dimensional data in forecasting. We consider selecting variables from a\nknown active set, known knowns, using Lasso and OCMT, and approximating\nunobserved latent factors, known unknowns, by various means. This combines both\nsparse and dense approaches. We demonstrate the various issues involved in\nvariable selection in a high-dimensional setting with an application to\nforecasting UK inflation at different horizons over the period 2020q1-2023q1.\nThis application shows both the power of parsimonious models and the importance\nof allowing for global variables."}, "http://arxiv.org/abs/2401.15205": {"title": "csranks: An R Package for Estimation and Inference Involving Ranks", "link": "http://arxiv.org/abs/2401.15205", "description": "This article introduces the R package csranks for estimation and inference\ninvolving ranks. First, we review methods for the construction of confidence\nsets for ranks, namely marginal and simultaneous confidence sets as well as\nconfidence sets for the identities of the tau-best. Second, we review methods\nfor estimation and inference in regressions involving ranks. Third, we describe\nthe implementation of these methods in csranks and illustrate their usefulness\nin two examples: one about the quantification of uncertainty in the PISA\nranking of countries and one about the measurement of intergenerational\nmobility using rank-rank regressions."}, "http://arxiv.org/abs/2401.15253": {"title": "Testing the Exogeneity of Instrumental Variables and Regressors in Linear Regression Models Using Copulas", "link": "http://arxiv.org/abs/2401.15253", "description": "We provide a Copula-based approach to test the exogeneity of instrumental\nvariables in linear regression models. We show that the exogeneity of\ninstrumental variables is equivalent to the exogeneity of their standard normal\ntransformations with the same CDF value. Then, we establish a Wald test for the\nexogeneity of the instrumental variables. We demonstrate the performance of our\ntest using simulation studies. Our simulations show that if the instruments are\nactually endogenous, our test rejects the exogeneity hypothesis approximately\n93% of the time at the 5% significance level. Conversely, when instruments are\ntruly exogenous, it dismisses the exogeneity assumption less than 30% of the\ntime on average for data with 200 observations and less than 2% of the time for\ndata with 1,000 observations. Our results demonstrate our test's effectiveness,\noffering significant value to applied econometricians."}, "http://arxiv.org/abs/2401.16275": {"title": "Graph Neural Networks: Theory for Estimation with Application on Network Heterogeneity", "link": "http://arxiv.org/abs/2401.16275", "description": "This paper presents a novel application of graph neural networks for modeling\nand estimating network heterogeneity. Network heterogeneity is characterized by\nvariations in unit's decisions or outcomes that depend not only on its own\nattributes but also on the conditions of its surrounding neighborhood. We\ndelineate the convergence rate of the graph neural networks estimator, as well\nas its applicability in semiparametric causal inference with heterogeneous\ntreatment effects. The finite-sample performance of our estimator is evaluated\nthrough Monte Carlo simulations. In an empirical setting related to\nmicrofinance program participation, we apply the new estimator to examine the\naverage treatment effects and outcomes of counterfactual policies, and to\npropose an enhanced strategy for selecting the initial recipients of program\ninformation in social networks."}, "http://arxiv.org/abs/2103.01280": {"title": "Dynamic covariate balancing: estimating treatment effects over time with potential local projections", "link": "http://arxiv.org/abs/2103.01280", "description": "This paper studies the estimation and inference of treatment histories in\npanel data settings when treatments change dynamically over time.\n\nWe propose a method that allows for (i) treatments to be assigned dynamically\nover time based on high-dimensional covariates, past outcomes and treatments;\n(ii) outcomes and time-varying covariates to depend on treatment trajectories;\n(iii) heterogeneity of treatment effects.\n\nOur approach recursively projects potential outcomes' expectations on past\nhistories. It then controls the bias by balancing dynamically observable\ncharacteristics. We study the asymptotic and numerical properties of the\nestimator and illustrate the benefits of the procedure in an empirical\napplication."}, "http://arxiv.org/abs/2211.00363": {"title": "Reservoir Computing for Macroeconomic Forecasting with Mixed Frequency Data", "link": "http://arxiv.org/abs/2211.00363", "description": "Macroeconomic forecasting has recently started embracing techniques that can\ndeal with large-scale datasets and series with unequal release periods.\nMIxed-DAta Sampling (MIDAS) and Dynamic Factor Models (DFM) are the two main\nstate-of-the-art approaches that allow modeling series with non-homogeneous\nfrequencies. We introduce a new framework called the Multi-Frequency Echo State\nNetwork (MFESN) based on a relatively novel machine learning paradigm called\nreservoir computing. Echo State Networks (ESN) are recurrent neural networks\nformulated as nonlinear state-space systems with random state coefficients\nwhere only the observation map is subject to estimation. MFESNs are\nconsiderably more efficient than DFMs and allow for incorporating many series,\nas opposed to MIDAS models, which are prone to the curse of dimensionality. All\nmethods are compared in extensive multistep forecasting exercises targeting US\nGDP growth. We find that our MFESN models achieve superior or comparable\nperformance over MIDAS and DFMs at a much lower computational cost."}, "http://arxiv.org/abs/2306.07619": {"title": "Kernel Choice Matters for Boundary Inference Using Local Polynomial Density: With Application to Manipulation Testing", "link": "http://arxiv.org/abs/2306.07619", "description": "The local polynomial density (LPD) estimator has been a useful tool for\ninference concerning boundary points of density functions. While it is commonly\nbelieved that kernel selection is not crucial for the performance of\nkernel-based estimators, this paper argues that this does not hold true for LPD\nestimators at boundary points. We find that the commonly used kernels with\ncompact support lead to larger asymptotic and finite-sample variances.\nFurthermore, we present theoretical and numerical evidence showing that such\nunfavorable variance properties negatively affect the performance of\nmanipulation testing in regression discontinuity designs, which typically\nsuffer from low power. Notably, we demonstrate that these issues of increased\nvariance and reduced power can be significantly improved just by using a kernel\nfunction with unbounded support. We recommend the use of the spline-type kernel\n(the Laplace density) and illustrate its superior performance."}, "http://arxiv.org/abs/2309.14630": {"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns", "link": "http://arxiv.org/abs/2309.14630", "description": "Discontinuities in regression functions can reveal important insights. In\nmany contexts, like geographic settings, such discontinuities are multivariate\nand unknown a priori. We propose a non-parametric regression method that\nestimates the location and size of discontinuities by segmenting the regression\nsurface. This estimator is based on a convex relaxation of the Mumford-Shah\nfunctional, for which we establish identification and convergence. We use it to\nshow that an internet shutdown in India resulted in a reduction of economic\nactivity by 25--35%, greatly surpassing previous estimates and shedding new\nlight on the true cost of such shutdowns for digital economies globally."}, "http://arxiv.org/abs/2401.16844": {"title": "Congestion Pricing for Efficiency and Equity: Theory and Applications to the San Francisco Bay Area", "link": "http://arxiv.org/abs/2401.16844", "description": "Congestion pricing, while adopted by many cities to alleviate traffic\ncongestion, raises concerns about widening socioeconomic disparities due to its\ndisproportionate impact on low-income travelers. In this study, we address this\nconcern by proposing a new class of congestion pricing schemes that not only\nminimize congestion levels but also incorporate an equity objective to reduce\ncost disparities among travelers with different willingness-to-pay. Our\nanalysis builds on a congestion game model with heterogeneous traveler\npopulations. We present four pricing schemes that account for practical\nconsiderations, such as the ability to charge differentiated tolls to various\ntraveler populations and the option to toll all or only a subset of edges in\nthe network. We evaluate our pricing schemes in the calibrated freeway network\nof the San Francisco Bay Area. We demonstrate that the proposed congestion\npricing schemes improve both efficiency (in terms of reduced average travel\ntime) and equity (the disparities of travel costs experienced by different\npopulations) compared to the current pricing scheme. Moreover, our pricing\nschemes also generate a total revenue comparable to the current pricing scheme.\nOur results further show that pricing schemes charging differentiated prices to\ntraveler populations with varying willingness-to-pay lead to a more equitable\ndistribution of travel costs compared to those that charge a homogeneous price\nto all."}, "http://arxiv.org/abs/2401.17137": {"title": "Partial Identification of Binary Choice Models with Misreported Outcomes", "link": "http://arxiv.org/abs/2401.17137", "description": "This paper provides partial identification of various binary choice models\nwith misreported dependent variables. We propose two distinct approaches by\nexploiting different instrumental variables respectively. In the first\napproach, the instrument is assumed to only affect the true dependent variable\nbut not misreporting probabilities. The second approach uses an instrument that\ninfluences misreporting probabilities monotonically while having no effect on\nthe true dependent variable. Moreover, we derive identification results under\nadditional restrictions on misreporting, including bounded/monotone\nmisreporting probabilities. We use simulations to demonstrate the robust\nperformance of our approaches, and apply the method to study educational\nattainment."}, "http://arxiv.org/abs/2209.00197": {"title": "Switchback Experiments under Geometric Mixing", "link": "http://arxiv.org/abs/2209.00197", "description": "The switchback is an experimental design that measures treatment effects by\nrepeatedly turning an intervention on and off for a whole system. Switchback\nexperiments are a robust way to overcome cross-unit spillover effects; however,\nthey are vulnerable to bias from temporal carryovers. In this paper, we\nconsider properties of switchback experiments in Markovian systems that mix at\na geometric rate. We find that, in this setting, standard switchback designs\nsuffer considerably from carryover bias: Their estimation error decays as\n$T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of\ncarryovers a faster rate of $T^{-1/2}$ would have been possible. We also show,\nhowever, that judicious use of burn-in periods can considerably improve the\nsituation, and enables errors that decay as $\\log(T)^{1/2}T^{-1/2}$. Our formal\nresults are mirrored in an empirical evaluation."}, "https://arxiv.org/abs/2401.17137": {"title": "Partial Identification of Binary Choice Models with Misreported Outcomes", "link": "https://arxiv.org/abs/2401.17137", "description": "This paper provides partial identification of various binary choice models with misreported dependent variables. We propose two distinct approaches by exploiting different instrumental variables respectively. In the first approach, the instrument is assumed to only affect the true dependent variable but not misreporting probabilities. The second approach uses an instrument that influences misreporting probabilities monotonically while having no effect on the true dependent variable. Moreover, we derive identification results under additional restrictions on misreporting, including bounded/monotone misreporting probabilities. We use simulations to demonstrate the robust performance of our approaches, and apply the method to study educational attainment."}, "https://arxiv.org/abs/2401.16844": {"title": "Congestion Pricing for Efficiency and Equity: Theory and Applications to the San Francisco Bay Area", "link": "https://arxiv.org/abs/2401.16844", "description": "Congestion pricing, while adopted by many cities to alleviate traffic congestion, raises concerns about widening socioeconomic disparities due to its disproportionate impact on low-income travelers. In this study, we address this concern by proposing a new class of congestion pricing schemes that not only minimize congestion levels but also incorporate an equity objective to reduce cost disparities among travelers with different willingness-to-pay. Our analysis builds on a congestion game model with heterogeneous traveler populations. We present four pricing schemes that account for practical considerations, such as the ability to charge differentiated tolls to various traveler populations and the option to toll all or only a subset of edges in the network. We evaluate our pricing schemes in the calibrated freeway network of the San Francisco Bay Area. We demonstrate that the proposed congestion pricing schemes improve both efficiency (in terms of reduced average travel time) and equity (the disparities of travel costs experienced by different populations) compared to the current pricing scheme. Moreover, our pricing schemes also generate a total revenue comparable to the current pricing scheme. Our results further show that pricing schemes charging differentiated prices to traveler populations with varying willingness-to-pay lead to a more equitable distribution of travel costs compared to those that charge a homogeneous price to all."}, "https://arxiv.org/abs/2209.00197": {"title": "Switchback Experiments under Geometric Mixing", "link": "https://arxiv.org/abs/2209.00197", "description": "The switchback is an experimental design that measures treatment effects by repeatedly turning an intervention on and off for a whole system. Switchback experiments are a robust way to overcome cross-unit spillover effects; however, they are vulnerable to bias from temporal carryovers. In this paper, we consider properties of switchback experiments in Markovian systems that mix at a geometric rate. We find that, in this setting, standard switchback designs suffer considerably from carryover bias: Their estimation error decays as $T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of carryovers a faster rate of $T^{-1/2}$ would have been possible. We also show, however, that judicious use of burn-in periods can considerably improve the situation, and enables errors that decay as $\\log(T)^{1/2}T^{-1/2}$. Our formal results are mirrored in an empirical evaluation."}, "https://arxiv.org/abs/2311.15878": {"title": "Policy Learning with Distributional Welfare", "link": "https://arxiv.org/abs/2311.15878", "description": "In this paper, we explore optimal treatment allocation policies that target distributional welfare. Most literature on treatment choice has considered utilitarian welfare based on the conditional average treatment effect (ATE). While average welfare is intuitive, it may yield undesirable allocations especially when individuals are heterogeneous (e.g., with outliers) - the very reason individualized treatments were introduced in the first place. This observation motivates us to propose an optimal policy that allocates the treatment based on the conditional quantile of individual treatment effects (QoTE). Depending on the choice of the quantile probability, this criterion can accommodate a policymaker who is either prudent or negligent. The challenge of identifying the QoTE lies in its requirement for knowledge of the joint distribution of the counterfactual outcomes, which is generally hard to recover even with experimental data. Therefore, we introduce minimax policies that are robust to model uncertainty. A range of identifying assumptions can be used to yield more informative policies. For both stochastic and deterministic policies, we establish the asymptotic bound on the regret of implementing the proposed policies. In simulations and two empirical applications, we compare optimal decisions based on the QoTE with decisions based on other criteria. The framework can be generalized to any setting where welfare is defined as a functional of the joint distribution of the potential outcomes."}, "https://arxiv.org/abs/2401.17595": {"title": "Marginal treatment effects in the absence of instrumental variables", "link": "https://arxiv.org/abs/2401.17595", "description": "We propose a method for defining, identifying, and estimating the marginal treatment effect (MTE) without imposing the instrumental variable (IV) assumptions of independence, exclusion, and separability (or monotonicity). Under a new definition of the MTE based on reduced-form treatment error that is statistically independent of the covariates, we find that the relationship between the MTE and standard treatment parameters holds in the absence of IVs. We provide a set of sufficient conditions ensuring the identification of the defined MTE in an environment of essential heterogeneity. The key conditions include a linear restriction on potential outcome regression functions, a nonlinear restriction on the propensity score, and a conditional mean independence restriction that will lead to additive separability. We prove this identification using the notion of semiparametric identification based on functional form. We suggest consistent semiparametric estimation procedures, and provide an empirical application for the Head Start program to illustrate the usefulness of the proposed method in analyzing heterogenous causal effects when IVs are elusive."}, "https://arxiv.org/abs/2401.17909": {"title": "Regularizing Discrimination in Optimal Policy Learning with Distributional Targets", "link": "https://arxiv.org/abs/2401.17909", "description": "A decision maker typically (i) incorporates training data to learn about the relative effectiveness of the treatments, and (ii) chooses an implementation mechanism that implies an \"optimal\" predicted outcome distribution according to some target functional. Nevertheless, a discrimination-aware decision maker may not be satisfied achieving said optimality at the cost of heavily discriminating against subgroups of the population, in the sense that the outcome distribution in a subgroup deviates strongly from the overall optimal outcome distribution. We study a framework that allows the decision maker to penalize for such deviations, while allowing for a wide range of target functionals and discrimination measures to be employed. We establish regret and consistency guarantees for empirical success policies with data-driven tuning parameters, and provide numerical results. Furthermore, we briefly illustrate the methods in two empirical settings."}, "https://arxiv.org/abs/2311.13969": {"title": "Was Javert right to be suspicious? Unpacking treatment effect heterogeneity of alternative sentences on time-to-recidivism in Brazil", "link": "https://arxiv.org/abs/2311.13969", "description": "This paper presents new econometric tools to unpack the treatment effect heterogeneity of punishing misdemeanor offenses on time-to-recidivism. We show how one can identify, estimate, and make inferences on the distributional, quantile, and average marginal treatment effects in setups where the treatment selection is endogenous and the outcome of interest, usually a duration variable, is potentially right-censored. We explore our proposed econometric methodology to evaluate the effect of fines and community service sentences as a form of punishment on time-to-recidivism in the State of S\\~ao Paulo, Brazil, between 2010 and 2019, leveraging the as-if random assignment of judges to cases. Our results highlight substantial treatment effect heterogeneity that other tools are not meant to capture. For instance, we find that people whom most judges would punish take longer to recidivate as a consequence of the punishment, while people who would be punished only by strict judges recidivate at an earlier date than if they were not punished. This result suggests that designing sentencing guidelines that encourage strict judges to become more lenient could reduce recidivism."}, "https://arxiv.org/abs/2402.00184": {"title": "The Heterogeneous Aggregate Valence Analysis (HAVAN) Model: A Flexible Approach to Modeling Unobserved Heterogeneity in Discrete Choice Analysis", "link": "https://arxiv.org/abs/2402.00184", "description": "This paper introduces the Heterogeneous Aggregate Valence Analysis (HAVAN) model, a novel class of discrete choice models. We adopt the term \"valence'' to encompass any latent quantity used to model consumer decision-making (e.g., utility, regret, etc.). Diverging from traditional models that parameterize heterogeneous preferences across various product attributes, HAVAN models (pronounced \"haven\") instead directly characterize alternative-specific heterogeneous preferences. This innovative perspective on consumer heterogeneity affords unprecedented flexibility and significantly reduces simulation burdens commonly associated with mixed logit models. In a simulation experiment, the HAVAN model demonstrates superior predictive performance compared to state-of-the-art artificial neural networks. This finding underscores the potential for HAVAN models to improve discrete choice modeling capabilities."}, "https://arxiv.org/abs/2402.00192": {"title": "Finite- and Large-Sample Inference for Ranks using Multinomial Data with an Application to Ranking Political Parties", "link": "https://arxiv.org/abs/2402.00192", "description": "It is common to rank different categories by means of preferences that are revealed through data on choices. A prominent example is the ranking of political candidates or parties using the estimated share of support each one receives in surveys or polls about political attitudes. Since these rankings are computed using estimates of the share of support rather than the true share of support, there may be considerable uncertainty concerning the true ranking of the political candidates or parties. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the rank of each category. We consider both the problem of constructing marginal confidence sets for the rank of a particular category as well as simultaneous confidence sets for the ranks of all categories. A distinguishing feature of our analysis is that we exploit the multinomial structure of the data to develop confidence sets that are valid in finite samples. We additionally develop confidence sets using the bootstrap that are valid only approximately in large samples. We use our methodology to rank political parties in Australia using data from the 2019 Australian Election Survey. We find that our finite-sample confidence sets are informative across the entire ranking of political parties, even in Australian territories with few survey respondents and/or with parties that are chosen by only a small share of the survey respondents. In contrast, the bootstrap-based confidence sets may sometimes be considerably less informative. These findings motivate us to compare these methods in an empirically-driven simulation study, in which we conclude that our finite-sample confidence sets often perform better than their large-sample, bootstrap-based counterparts, especially in settings that resemble our empirical application."}, "https://arxiv.org/abs/2402.00567": {"title": "Stochastic convergence in per capita CO$_2$ emissions", "link": "https://arxiv.org/abs/2402.00567", "description": "This paper studies stochastic convergence of per capita CO$_2$ emissions in 28 OECD countries for the 1901-2009 period. The analysis is carried out at two aggregation levels, first for the whole set of countries (joint analysis) and then separately for developed and developing states (group analysis). A powerful time series methodology, adapted to a nonlinear framework that allows for quadratic trends with possibly smooth transitions between regimes, is applied. This approach provides more robust conclusions in convergence path analysis, enabling (a) robust detection of the presence, and if so, the number of changes in the level and/or slope of the trend of the series, (b) inferences on stationarity of relative per capita CO$_2$ emissions, conditionally on the presence of breaks and smooth transitions between regimes, and (c) estimation of change locations in the convergence paths. Finally, as stochastic convergence is attained when both stationarity around a trend and $\\beta$-convergence hold, the linear approach proposed by Tomljanovich and Vogelsang (2002) is extended in order to allow for more general quadratic models. Overall, joint analysis finds some evidence of stochastic convergence in per capita CO$_2$ emissions. Some dispersion in terms of $\\beta$-convergence is detected by group analysis, particularly among developed countries. This is in accordance with per capita GDP not being the sole determinant of convergence in emissions, with factors like search for more efficient technologies, fossil fuel substitution, innovation, and possibly outsources of industries, also having a crucial role."}, "https://arxiv.org/abs/2402.00584": {"title": "Arellano-Bond LASSO Estimator for Dynamic Linear Panel Models", "link": "https://arxiv.org/abs/2402.00584", "description": "The Arellano-Bond estimator can be severely biased when the time series dimension of the data, $T$, is long. The source of the bias is the large degree of overidentification. We propose a simple two-step approach to deal with this problem. The first step applies LASSO to the cross-section data at each time period to select the most informative moment conditions. The second step applies a linear instrumental variable estimator using the instruments constructed from the moment conditions selected in the first step. The two stages are combined using sample-splitting and cross-fitting to avoid overfitting bias. Using asymptotic sequences where the two dimensions of the panel grow with the sample size, we show that the new estimator is consistent and asymptotically normal under much weaker conditions on $T$ than the Arellano-Bond estimator. Our theory covers models with high dimensional covariates including multiple lags of the dependent variable, which are common in modern applications. We illustrate our approach with an application to the short and long-term effects of the opening of K-12 schools and other policies on the spread of COVID-19 using weekly county-level panel data from the United States."}, "https://arxiv.org/abs/2402.00788": {"title": "EU-28's progress towards the 2020 renewable energy share", "link": "https://arxiv.org/abs/2402.00788", "description": "This paper assesses the convergence of the EU-28 countries towards their common goal of 20% in the renewable energy share indicator by year 2020. The potential presence of clubs of convergence towards different steady state equilibria is also analyzed from both the standpoints of global convergence to the 20% goal and specific convergence to the various targets assigned to Member States. Two clubs of convergence are detected in the former case, each corresponding to different RES targets. A probit model is also fitted with the aim of better understanding the determinants of club membership, that seemingly include real GDP per capita, expenditure on environmental protection, energy dependence, and nuclear capacity, with all of them having statistically significant effects. Finally, convergence is also analyzed separately for the transport, heating and cooling, and electricity sectors."}, "https://arxiv.org/abs/2402.00172": {"title": "The Fourier-Malliavin Volatility (FMVol) MATLAB library", "link": "https://arxiv.org/abs/2402.00172", "description": "This paper presents the Fourier-Malliavin Volatility (FMVol) estimation library for MATLAB. This library includes functions that implement Fourier- Malliavin estimators (see Malliavin and Mancino (2002, 2009)) of the volatility and co-volatility of continuous stochastic volatility processes and second-order quantities, like the quarticity (the squared volatility), the volatility of volatility and the leverage (the covariance between changes in the process and changes in its volatility). The Fourier-Malliavin method is fully non-parametric, does not require equally-spaced observations and is robust to measurement errors, or noise, without any preliminary bias correction or pre-treatment of the observations. Further, in its multivariate version, it is intrinsically robust to irregular and asynchronous sampling. Although originally introduced for a specific application in financial econometrics, namely the estimation of asset volatilities, the Fourier-Malliavin method is a general method that can be applied whenever one is interested in reconstructing the latent volatility and second-order quantities of a continuous stochastic volatility process from discrete observations."}, "https://arxiv.org/abs/2010.03898": {"title": "Consistent Specification Test of the Quantile Autoregression", "link": "https://arxiv.org/abs/2010.03898", "description": "This paper proposes a test for the joint hypothesis of correct dynamic specification and no omitted latent factors for the Quantile Autoregression. If the composite null is rejected we proceed to disentangle the cause of rejection, i.e., dynamic misspecification or an omitted variable. We establish the asymptotic distribution of the test statistics under fairly weak conditions and show that factor estimation error is negligible. A Monte Carlo study shows that the suggested tests have good finite sample properties. Finally, we undertake an empirical illustration of modelling GDP growth and CPI inflation in the United Kingdom, where we find evidence that factor augmented models are correctly specified in contrast with their non-augmented counterparts when it comes to GDP growth, while also exploring the asymmetric behaviour of the growth and inflation distributions."}, "https://arxiv.org/abs/2203.12740": {"title": "Correcting Attrition Bias using Changes-in-Changes", "link": "https://arxiv.org/abs/2203.12740", "description": "Attrition is a common and potentially important threat to internal validity in treatment effect studies. We extend the changes-in-changes approach to identify the average treatment effect for respondents and the entire study population in the presence of attrition. Our method, which exploits baseline outcome data, can be applied to randomized experiments as well as quasi-experimental difference-in-difference designs. A formal comparison highlights that while widely used corrections typically impose restrictions on whether or how response depends on treatment, our proposed attrition correction exploits restrictions on the outcome model. We further show that the conditions required for our correction can accommodate a broad class of response models that depend on treatment in an arbitrary way. We illustrate the implementation of the proposed corrections in an application to a large-scale randomized experiment."}, "https://arxiv.org/abs/2209.11444": {"title": "Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect", "link": "https://arxiv.org/abs/2209.11444", "description": "This paper establishes sufficient conditions for the identification of the marginal treatment effects with multivalued treatments. Our model is based on a multinomial choice model with utility maximization. Our MTE generalizes the MTE defined in Heckman and Vytlacil (2005) in binary treatment models. As in the binary case, we can interpret the MTE as the treatment effect for persons who are indifferent between two treatments at a particular level. Our MTE enables one to obtain the treatment effects of those with specific preference orders over the choice set. Further, our results can identify other parameters such as the marginal distribution of potential outcomes."}, "https://arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "https://arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods."}, "https://rss.arxiv.org/abs/2402.00184": {"title": "The Heterogeneous Aggregate Valence Analysis (HAVAN) Model: A Flexible Approach to Modeling Unobserved Heterogeneity in Discrete Choice Analysis", "link": "https://rss.arxiv.org/abs/2402.00184", "description": "This paper introduces the Heterogeneous Aggregate Valence Analysis (HAVAN) model, a novel class of discrete choice models. We adopt the term \"valence'' to encompass any latent quantity used to model consumer decision-making (e.g., utility, regret, etc.). Diverging from traditional models that parameterize heterogeneous preferences across various product attributes, HAVAN models (pronounced \"haven\") instead directly characterize alternative-specific heterogeneous preferences. This innovative perspective on consumer heterogeneity affords unprecedented flexibility and significantly reduces simulation burdens commonly associated with mixed logit models. In a simulation experiment, the HAVAN model demonstrates superior predictive performance compared to state-of-the-art artificial neural networks. This finding underscores the potential for HAVAN models to improve discrete choice modeling capabilities."}, "https://rss.arxiv.org/abs/2402.00192": {"title": "Finite- and Large-Sample Inference for Ranks using Multinomial Data with an Application to Ranking Political Parties", "link": "https://rss.arxiv.org/abs/2402.00192", "description": "It is common to rank different categories by means of preferences that are revealed through data on choices. A prominent example is the ranking of political candidates or parties using the estimated share of support each one receives in surveys or polls about political attitudes. Since these rankings are computed using estimates of the share of support rather than the true share of support, there may be considerable uncertainty concerning the true ranking of the political candidates or parties. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the rank of each category. We consider both the problem of constructing marginal confidence sets for the rank of a particular category as well as simultaneous confidence sets for the ranks of all categories. A distinguishing feature of our analysis is that we exploit the multinomial structure of the data to develop confidence sets that are valid in finite samples. We additionally develop confidence sets using the bootstrap that are valid only approximately in large samples. We use our methodology to rank political parties in Australia using data from the 2019 Australian Election Survey. We find that our finite-sample confidence sets are informative across the entire ranking of political parties, even in Australian territories with few survey respondents and/or with parties that are chosen by only a small share of the survey respondents. In contrast, the bootstrap-based confidence sets may sometimes be considerably less informative. These findings motivate us to compare these methods in an empirically-driven simulation study, in which we conclude that our finite-sample confidence sets often perform better than their large-sample, bootstrap-based counterparts, especially in settings that resemble our empirical application."}, "https://rss.arxiv.org/abs/2402.00567": {"title": "Stochastic convergence in per capita CO$_2$ emissions", "link": "https://rss.arxiv.org/abs/2402.00567", "description": "This paper studies stochastic convergence of per capita CO$_2$ emissions in 28 OECD countries for the 1901-2009 period. The analysis is carried out at two aggregation levels, first for the whole set of countries (joint analysis) and then separately for developed and developing states (group analysis). A powerful time series methodology, adapted to a nonlinear framework that allows for quadratic trends with possibly smooth transitions between regimes, is applied. This approach provides more robust conclusions in convergence path analysis, enabling (a) robust detection of the presence, and if so, the number of changes in the level and/or slope of the trend of the series, (b) inferences on stationarity of relative per capita CO$_2$ emissions, conditionally on the presence of breaks and smooth transitions between regimes, and (c) estimation of change locations in the convergence paths. Finally, as stochastic convergence is attained when both stationarity around a trend and $\\beta$-convergence hold, the linear approach proposed by Tomljanovich and Vogelsang (2002) is extended in order to allow for more general quadratic models. Overall, joint analysis finds some evidence of stochastic convergence in per capita CO$_2$ emissions. Some dispersion in terms of $\\beta$-convergence is detected by group analysis, particularly among developed countries. This is in accordance with per capita GDP not being the sole determinant of convergence in emissions, with factors like search for more efficient technologies, fossil fuel substitution, innovation, and possibly outsources of industries, also having a crucial role."}, "https://rss.arxiv.org/abs/2402.00584": {"title": "Arellano-Bond LASSO Estimator for Dynamic Linear Panel Models", "link": "https://rss.arxiv.org/abs/2402.00584", "description": "The Arellano-Bond estimator can be severely biased when the time series dimension of the data, $T$, is long. The source of the bias is the large degree of overidentification. We propose a simple two-step approach to deal with this problem. The first step applies LASSO to the cross-section data at each time period to select the most informative moment conditions. The second step applies a linear instrumental variable estimator using the instruments constructed from the moment conditions selected in the first step. The two stages are combined using sample-splitting and cross-fitting to avoid overfitting bias. Using asymptotic sequences where the two dimensions of the panel grow with the sample size, we show that the new estimator is consistent and asymptotically normal under much weaker conditions on $T$ than the Arellano-Bond estimator. Our theory covers models with high dimensional covariates including multiple lags of the dependent variable, which are common in modern applications. We illustrate our approach with an application to the short and long-term effects of the opening of K-12 schools and other policies on the spread of COVID-19 using weekly county-level panel data from the United States."}, "https://rss.arxiv.org/abs/2402.00788": {"title": "EU-28's progress towards the 2020 renewable energy share", "link": "https://rss.arxiv.org/abs/2402.00788", "description": "This paper assesses the convergence of the EU-28 countries towards their common goal of 20% in the renewable energy share indicator by year 2020. The potential presence of clubs of convergence towards different steady state equilibria is also analyzed from both the standpoints of global convergence to the 20% goal and specific convergence to the various targets assigned to Member States. Two clubs of convergence are detected in the former case, each corresponding to different RES targets. A probit model is also fitted with the aim of better understanding the determinants of club membership, that seemingly include real GDP per capita, expenditure on environmental protection, energy dependence, and nuclear capacity, with all of them having statistically significant effects. Finally, convergence is also analyzed separately for the transport, heating and cooling, and electricity sectors."}, "https://rss.arxiv.org/abs/2402.00172": {"title": "The Fourier-Malliavin Volatility (FMVol) MATLAB library", "link": "https://rss.arxiv.org/abs/2402.00172", "description": "This paper presents the Fourier-Malliavin Volatility (FMVol) estimation library for MATLAB. This library includes functions that implement Fourier- Malliavin estimators (see Malliavin and Mancino (2002, 2009)) of the volatility and co-volatility of continuous stochastic volatility processes and second-order quantities, like the quarticity (the squared volatility), the volatility of volatility and the leverage (the covariance between changes in the process and changes in its volatility). The Fourier-Malliavin method is fully non-parametric, does not require equally-spaced observations and is robust to measurement errors, or noise, without any preliminary bias correction or pre-treatment of the observations. Further, in its multivariate version, it is intrinsically robust to irregular and asynchronous sampling. Although originally introduced for a specific application in financial econometrics, namely the estimation of asset volatilities, the Fourier-Malliavin method is a general method that can be applied whenever one is interested in reconstructing the latent volatility and second-order quantities of a continuous stochastic volatility process from discrete observations."}, "https://rss.arxiv.org/abs/2010.03898": {"title": "Consistent Specification Test of the Quantile Autoregression", "link": "https://rss.arxiv.org/abs/2010.03898", "description": "This paper proposes a test for the joint hypothesis of correct dynamic specification and no omitted latent factors for the Quantile Autoregression. If the composite null is rejected we proceed to disentangle the cause of rejection, i.e., dynamic misspecification or an omitted variable. We establish the asymptotic distribution of the test statistics under fairly weak conditions and show that factor estimation error is negligible. A Monte Carlo study shows that the suggested tests have good finite sample properties. Finally, we undertake an empirical illustration of modelling GDP growth and CPI inflation in the United Kingdom, where we find evidence that factor augmented models are correctly specified in contrast with their non-augmented counterparts when it comes to GDP growth, while also exploring the asymmetric behaviour of the growth and inflation distributions."}, "https://rss.arxiv.org/abs/2203.12740": {"title": "Correcting Attrition Bias using Changes-in-Changes", "link": "https://rss.arxiv.org/abs/2203.12740", "description": "Attrition is a common and potentially important threat to internal validity in treatment effect studies. We extend the changes-in-changes approach to identify the average treatment effect for respondents and the entire study population in the presence of attrition. Our method, which exploits baseline outcome data, can be applied to randomized experiments as well as quasi-experimental difference-in-difference designs. A formal comparison highlights that while widely used corrections typically impose restrictions on whether or how response depends on treatment, our proposed attrition correction exploits restrictions on the outcome model. We further show that the conditions required for our correction can accommodate a broad class of response models that depend on treatment in an arbitrary way. We illustrate the implementation of the proposed corrections in an application to a large-scale randomized experiment."}, "https://rss.arxiv.org/abs/2209.11444": {"title": "Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect", "link": "https://rss.arxiv.org/abs/2209.11444", "description": "This paper establishes sufficient conditions for the identification of the marginal treatment effects with multivalued treatments. Our model is based on a multinomial choice model with utility maximization. Our MTE generalizes the MTE defined in Heckman and Vytlacil (2005) in binary treatment models. As in the binary case, we can interpret the MTE as the treatment effect for persons who are indifferent between two treatments at a particular level. Our MTE enables one to obtain the treatment effects of those with specific preference orders over the choice set. Further, our results can identify other parameters such as the marginal distribution of potential outcomes."}, "https://rss.arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "https://rss.arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods."}, "http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon>0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard & Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}, "http://arxiv.org/abs/2310.04452": {"title": "Short text classification with machine learning in the social sciences: The case of climate change on Twitter", "link": "http://arxiv.org/abs/2310.04452", "description": "To analyse large numbers of texts, social science researchers are\nincreasingly confronting the challenge of text classification. When manual\nlabeling is not possible and researchers have to find automatized ways to\nclassify texts, computer science provides a useful toolbox of machine-learning\nmethods whose performance remains understudied in the social sciences. In this\narticle, we compare the performance of the most widely used text classifiers by\napplying them to a typical research scenario in social science research: a\nrelatively small labeled dataset with infrequent occurrence of categories of\ninterest, which is a part of a large unlabeled dataset. As an example case, we\nlook at Twitter communication regarding climate change, a topic of increasing\nscholarly interest in interdisciplinary social science research. Using a novel\ndataset including 5,750 tweets from various international organizations\nregarding the highly ambiguous concept of climate change, we evaluate the\nperformance of methods in automatically classifying tweets based on whether\nthey are about climate change or not. In this context, we highlight two main\nfindings. First, supervised machine-learning methods perform better than\nstate-of-the-art lexicons, in particular as class balance increases. Second,\ntraditional machine-learning methods, such as logistic regression and random\nforest, perform similarly to sophisticated deep-learning methods, whilst\nrequiring much less training time and computational resources. The results have\nimportant implications for the analysis of short texts in social science\nresearch."}, "http://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "http://arxiv.org/abs/2310.04563", "description": "During the COVID-19 pandemic, implementing in-person indoor instruction in a\nsafe manner was a high priority for universities nationwide. To support this\neffort at the University, we developed a mathematical model for estimating the\nrisk of SARS-CoV-2 transmission in university classrooms. This model was used\nto design a safe classroom environment at the University during the COVID-19\npandemic that supported the higher occupancy levels needed to match\npre-pandemic numbers of in-person courses, despite a limited number of large\nclassrooms. A retrospective analysis at the end of the semester confirmed the\nmodel's assessment that the proposed classroom configuration would be safe. Our\nframework is generalizable and was also used to support reopening decisions at\nStanford University. In addition, our methods are flexible; our modeling\nframework was repurposed to plan for large university events and gatherings. We\nfound that our approach and methods work in a wide range of indoor settings and\ncould be used to support reopening planning across various industries, from\nsecondary schools to movie theaters and restaurants."}, "http://arxiv.org/abs/2310.04578": {"title": "TNDDR: Efficient and doubly robust estimation of COVID-19 vaccine effectiveness under the test-negative design", "link": "http://arxiv.org/abs/2310.04578", "description": "While the test-negative design (TND), which is routinely used for monitoring\nseasonal flu vaccine effectiveness (VE), has recently become integral to\nCOVID-19 vaccine surveillance, it is susceptible to selection bias due to\noutcome-dependent sampling. Some studies have addressed the identifiability and\nestimation of causal parameters under the TND, but efficiency bounds for\nnonparametric estimators of the target parameter under the unconfoundedness\nassumption have not yet been investigated. We propose a one-step doubly robust\nand locally efficient estimator called TNDDR (TND doubly robust), which\nutilizes sample splitting and can incorporate machine learning techniques to\nestimate the nuisance functions. We derive the efficient influence function\n(EIF) for the marginal expectation of the outcome under a vaccination\nintervention, explore the von Mises expansion, and establish the conditions for\n$\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR.\nThe proposed TNDDR is supported by both theoretical and empirical\njustifications, and we apply it to estimate COVID-19 VE in an administrative\ndataset of community-dwelling older people (aged $\\geq 60$y) in the province of\nQu\\'ebec, Canada."}, "http://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "http://arxiv.org/abs/2310.04660", "description": "Many scientific questions in biomedical, environmental, and psychological\nresearch involve understanding the impact of multiple factors on outcomes.\nWhile randomized factorial experiments are ideal for this purpose,\nrandomization is infeasible in many empirical studies. Therefore, investigators\noften rely on observational data, where drawing reliable causal inferences for\nmultiple factors remains challenging. As the number of treatment combinations\ngrows exponentially with the number of factors, some treatment combinations can\nbe rare or even missing by chance in observed data, further complicating\nfactorial effects estimation. To address these challenges, we propose a novel\nweighting method tailored to observational studies with multiple factors. Our\napproach uses weighted observational data to emulate a randomized factorial\nexperiment, enabling simultaneous estimation of the effects of multiple factors\nand their interactions. Our investigations reveal a crucial nuance: achieving\nbalance among covariates, as in single-factor scenarios, is necessary but\ninsufficient for unbiasedly estimating factorial effects. Our findings suggest\nthat balancing the factors is also essential in multi-factor settings.\nMoreover, we extend our weighting method to handle missing treatment\ncombinations in observed data. Finally, we study the asymptotic behavior of the\nnew weighting estimators and propose a consistent variance estimator, providing\nreliable inferences on factorial effects in observational studies."}, "http://arxiv.org/abs/2310.04709": {"title": "Time-dependent mediators in survival analysis: Graphical representation of causal assumptions", "link": "http://arxiv.org/abs/2310.04709", "description": "We study time-dependent mediators in survival analysis using a treatment\nseparation approach due to Didelez [2019] and based on earlier work by Robins\nand Richardson [2011]. This approach avoids nested counterfactuals and\ncrossworld assumptions which are otherwise common in mediation analysis. The\ncausal model of treatment, mediators, covariates, confounders and outcome is\nrepresented by causal directed acyclic graphs (DAGs). However, the DAGs tend to\nbe very complex when we have measurements at a large number of time points. We\ntherefore suggest using so-called rolled graphs in which a node represents an\nentire coordinate process instead of a single random variable, leading us to\nfar simpler graphical representations. The rolled graphs are not necessarily\nacyclic; they can be analyzed by $\\delta$-separation which is the appropriate\ngraphical separation criterion in this class of graphs and analogous to\n$d$-separation. In particular, $\\delta$-separation is a graphical tool for\nevaluating if the conditions of the mediation analysis are met or if unmeasured\nconfounders influence the estimated effects. We also state a mediational\ng-formula. This is similar to the approach in Vansteelandt et al. [2019]\nalthough that paper has a different conceptual basis. Finally, we apply this\nframework to a statistical model based on a Cox model with an added treatment\neffect.survival analysis; mediation; causal inference; graphical models; local\nindependence graphs"}, "http://arxiv.org/abs/2310.04919": {"title": "The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models", "link": "http://arxiv.org/abs/2310.04919", "description": "In modern scientific research, the objective is often to identify which\nvariables are associated with an outcome among a large class of potential\npredictors. This goal can be achieved by selecting variables in a manner that\ncontrols the the false discovery rate (FDR), the proportion of irrelevant\npredictors among the selections. Knockoff filtering is a cutting-edge approach\nto variable selection that provides FDR control. Existing knockoff statistics\nfrequently employ linear models to assess relationships between features and\nthe response, but the linearity assumption is often violated in real world\napplications. This may result in poor power to detect truly prognostic\nvariables. We introduce a knockoff statistic based on the conditional\nprediction function (CPF), which can pair with state-of-art machine learning\npredictive models, such as deep neural networks. The CPF statistics can capture\nthe nonlinear relationships between predictors and outcomes while also\naccounting for correlation between features. We illustrate the capability of\nthe CPF statistics to provide superior power over common knockoff statistics\nwith continuous, categorical, and survival outcomes using repeated simulations.\nKnockoff filtering with the CPF statistics is demonstrated using (1) a\nresidential building dataset to select predictors for the actual sales prices\nand (2) the TCGA dataset to select genes that are correlated with disease\nstaging in lung cancer patients."}, "http://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "http://arxiv.org/abs/2310.04924", "description": "Markov chain Monte Carlo significance tests were first introduced by Besag\nand Clifford in [4]. These methods produce statistical valid p-values in\nproblems where sampling from the null hypotheses is intractable. We give an\noverview of the methods of Besag and Clifford and some recent developments. A\nrange of examples and applications are discussed."}, "http://arxiv.org/abs/2310.04934": {"title": "UBSea: A Unified Community Detection Framework", "link": "http://arxiv.org/abs/2310.04934", "description": "Detecting communities in networks and graphs is an important task across many\ndisciplines such as statistics, social science and engineering. There are\ngenerally three different kinds of mixing patterns for the case of two\ncommunities: assortative mixing, disassortative mixing and core-periphery\nstructure. Modularity optimization is a classical way for fitting network\nmodels with communities. However, it can only deal with assortative mixing and\ndisassortative mixing when the mixing pattern is known and fails to discover\nthe core-periphery structure. In this paper, we extend modularity in a\nstrategic way and propose a new framework based on Unified Bigroups Standadized\nEdge-count Analysis (UBSea). It can address all the formerly mentioned\ncommunity mixing structures. In addition, this new framework is able to\nautomatically choose the mixing type to fit the networks. Simulation studies\nshow that the new framework has superb performance in a wide range of settings\nunder the stochastic block model and the degree-corrected stochastic block\nmodel. We show that the new approach produces consistent estimate of the\ncommunities under a suitable signal-to-noise-ratio condition, for the case of a\nblock model with two communities, for both undirected and directed networks.\nThe new method is illustrated through applications to several real-world\ndatasets."}, "http://arxiv.org/abs/2310.05049": {"title": "On Estimation of Optimal Dynamic Treatment Regimes with Multiple Treatments for Survival Data-With Application to Colorectal Cancer Study", "link": "http://arxiv.org/abs/2310.05049", "description": "Dynamic treatment regimes (DTR) are sequential decision rules corresponding\nto several stages of intervention. Each rule maps patients' covariates to\noptional treatments. The optimal dynamic treatment regime is the one that\nmaximizes the mean outcome of interest if followed by the overall population.\nMotivated by a clinical study on advanced colorectal cancer with traditional\nChinese medicine, we propose a censored C-learning (CC-learning) method to\nestimate the dynamic treatment regime with multiple treatments using survival\ndata. To address the challenges of multiple stages with right censoring, we\nmodify the backward recursion algorithm in order to adapt to the flexible\nnumber and timing of treatments. For handling the problem of multiple\ntreatments, we propose a framework from the classification perspective by\ntransferring the problem of optimization with multiple treatment comparisons\ninto an example-dependent cost-sensitive classification problem. With\nclassification and regression tree (CART) as the classifier, the CC-learning\nmethod can produce an estimated optimal DTR with good interpretability. We\ntheoretically prove the optimality of our method and numerically evaluate its\nfinite sample performances through simulation. With the proposed method, we\nidentify the interpretable tree treatment regimes at each stage for the\nadvanced colorectal cancer treatment data from Xiyuan Hospital."}, "http://arxiv.org/abs/2310.05151": {"title": "Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions", "link": "http://arxiv.org/abs/2310.05151", "description": "In clinical trials of longitudinal continuous outcomes, reference based\nimputation (RBI) has commonly been applied to handle missing outcome data in\nsettings where the estimand incorporates the effects of intercurrent events,\ne.g. treatment discontinuation. RBI was originally developed in the multiple\nimputation framework, however recently conditional mean imputation (CMI)\ncombined with the jackknife estimator of the standard error was proposed as a\nway to obtain deterministic treatment effect estimates and correct frequentist\ninference. For both multiple and CMI, a mixed model for repeated measures\n(MMRM) is often used for the imputation model, but this can be computationally\nintensive to fit to multiple data sets (e.g. the jackknife samples) and lead to\nconvergence issues with complex MMRM models with many parameters. Therefore, a\nstep-wise approach based on sequential linear regression (SLR) of the outcomes\nat each visit was developed for the imputation model in the multiple imputation\nframework, but similar developments in the CMI framework are lacking. In this\narticle, we fill this gap in the literature by proposing a SLR approach to\nimplement RBI in the CMI framework, and justify its validity using theoretical\nresults and simulations. We also illustrate our proposal on a real data\napplication."}, "http://arxiv.org/abs/2310.05398": {"title": "Statistical Inference for Modulation Index in Phase-Amplitude Coupling", "link": "http://arxiv.org/abs/2310.05398", "description": "Phase-amplitude coupling is a phenomenon observed in several neurological\nprocesses, where the phase of one signal modulates the amplitude of another\nsignal with a distinct frequency. The modulation index (MI) is a common\ntechnique used to quantify this interaction by assessing the Kullback-Leibler\ndivergence between a uniform distribution and the empirical conditional\ndistribution of amplitudes with respect to the phases of the observed signals.\nThe uniform distribution is an ideal representation that is expected to appear\nunder the absence of coupling. However, it does not reflect the statistical\nproperties of coupling values caused by random chance. In this paper, we\npropose a statistical framework for evaluating the significance of an observed\nMI value based on a null hypothesis that a MI value can be entirely explained\nby chance. Significance is obtained by comparing the value with a reference\ndistribution derived under the null hypothesis of independence (i.e., no\ncoupling) between signals. We derived a closed-form distribution of this null\nmodel, resulting in a scaled beta distribution. To validate the efficacy of our\nproposed framework, we conducted comprehensive Monte Carlo simulations,\nassessing the significance of MI values under various experimental scenarios,\nincluding amplitude modulation, trains of spikes, and sequences of\nhigh-frequency oscillations. Furthermore, we corroborated the reliability of\nour model by comparing its statistical significance thresholds with reported\nvalues from other research studies conducted under different experimental\nsettings. Our method offers several advantages such as meta-analysis\nreliability, simplicity and computational efficiency, as it provides p-values\nand significance levels without resorting to generating surrogate data through\nsampling procedures."}, "http://arxiv.org/abs/2310.05526": {"title": "Projecting infinite time series graphs to finite marginal graphs using number theory", "link": "http://arxiv.org/abs/2310.05526", "description": "In recent years, a growing number of method and application works have\nadapted and applied the causal-graphical-model framework to time series data.\nMany of these works employ time-resolved causal graphs that extend infinitely\ninto the past and future and whose edges are repetitive in time, thereby\nreflecting the assumption of stationary causal relationships. However, most\nresults and algorithms from the causal-graphical-model framework are not\ndesigned for infinite graphs. In this work, we develop a method for projecting\ninfinite time series graphs with repetitive edges to marginal graphical models\non a finite time window. These finite marginal graphs provide the answers to\n$m$-separation queries with respect to the infinite graph, a task that was\npreviously unresolved. Moreover, we argue that these marginal graphs are useful\nfor causal discovery and causal effect estimation in time series, effectively\nenabling to apply results developed for finite graphs to the infinite graphs.\nThe projection procedure relies on finding common ancestors in the\nto-be-projected graph and is, by itself, not new. However, the projection\nprocedure has not yet been algorithmically implemented for time series graphs\nsince in these infinite graphs there can be infinite sets of paths that might\ngive rise to common ancestors. We solve the search over these possibly infinite\nsets of paths by an intriguing combination of path-finding techniques for\nfinite directed graphs and solution theory for linear Diophantine equations. By\nproviding an algorithm that carries out the projection, our paper makes an\nimportant step towards a theoretically-grounded and method-agnostic\ngeneralization of a range of causal inference methods and results to time\nseries."}, "http://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "http://arxiv.org/abs/2310.05539", "description": "In response to the unique challenge created by high-dimensional mediators in\nmediation analysis, this paper presents a novel procedure for testing the\nnullity of the mediation effect in the presence of high-dimensional mediators.\nThe procedure incorporates two distinct features. Firstly, the test remains\nvalid under all cases of the composite null hypothesis, including the\nchallenging scenario where both exposure-mediator and mediator-outcome\ncoefficients are zero. Secondly, it does not impose structural assumptions on\nthe exposure-mediator coefficients, thereby allowing for an arbitrarily strong\nexposure-mediator relationship. To the best of our knowledge, the proposed test\nis the first of its kind to provably possess these two features in\nhigh-dimensional mediation analysis. The validity and consistency of the\nproposed test are established, and its numerical performance is showcased\nthrough simulation studies. The application of the proposed test is\ndemonstrated by examining the mediation effect of DNA methylation between\nsmoking status and lung cancer development."}, "http://arxiv.org/abs/2310.05548": {"title": "Cokrig-and-Regress for Spatially Misaligned Environmental Data", "link": "http://arxiv.org/abs/2310.05548", "description": "Spatially misaligned data, where the response and covariates are observed at\ndifferent spatial locations, commonly arise in many environmental studies. Much\nof the statistical literature on handling spatially misaligned data has been\ndevoted to the case of a single covariate and a linear relationship between the\nresponse and this covariate. Motivated by spatially misaligned data collected\non air pollution and weather in China, we propose a cokrig-and-regress (CNR)\nmethod to estimate spatial regression models involving multiple covariates and\npotentially non-linear associations. The CNR estimator is constructed by\nreplacing the unobserved covariates (at the response locations) by their\ncokriging predictor derived from the observed but misaligned covariates under a\nmultivariate Gaussian assumption, where a generalized Kronecker product\ncovariance is used to account for spatial correlations within and between\ncovariates. A parametric bootstrap approach is employed to bias-correct the CNR\nestimates of the spatial covariance parameters and for uncertainty\nquantification. Simulation studies demonstrate that CNR outperforms several\nexisting methods for handling spatially misaligned data, such as\nnearest-neighbor interpolation. Applying CNR to the spatially misaligned air\npollution and weather data in China reveals a number of non-linear\nrelationships between PM$_{2.5}$ concentration and several meteorological\ncovariates."}, "http://arxiv.org/abs/2310.05622": {"title": "A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards", "link": "http://arxiv.org/abs/2310.05622", "description": "While well-established methods for time-to-event data are available when the\nproportional hazards assumption holds, there is no consensus on the best\ninferential approach under non-proportional hazards (NPH). However, a wide\nrange of parametric and non-parametric methods for testing and estimation in\nthis scenario have been proposed. To provide recommendations on the statistical\nanalysis of clinical trials where non proportional hazards are expected, we\nconducted a comprehensive simulation study under different scenarios of\nnon-proportional hazards, including delayed onset of treatment effect, crossing\nhazard curves, subgroups with different treatment effect and changing hazards\nafter disease progression. We assessed type I error rate control, power and\nconfidence interval coverage, where applicable, for a wide range of methods\nincluding weighted log-rank tests, the MaxCombo test, summary measures such as\nthe restricted mean survival time (RMST), average hazard ratios, and milestone\nsurvival probabilities as well as accelerated failure time regression models.\nWe found a trade-off between interpretability and power when choosing an\nanalysis strategy under NPH scenarios. While analysis methods based on weighted\nlogrank tests typically were favorable in terms of power, they do not provide\nan easily interpretable treatment effect estimate. Also, depending on the\nweight function, they test a narrow null hypothesis of equal hazard functions\nand rejection of this null hypothesis may not allow for a direct conclusion of\ntreatment benefit in terms of the survival function. In contrast,\nnon-parametric procedures based on well interpretable measures as the RMST\ndifference had lower power in most scenarios. Model based methods based on\nspecific survival distributions had larger power, however often gave biased\nestimates and lower than nominal confidence interval coverage."}, "http://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "http://arxiv.org/abs/2310.05646", "description": "We study transfer learning in the context of estimating piecewise-constant\nsignals when source data, which may be relevant but disparate, are available in\naddition to the target data. We initially investigate transfer learning\nestimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for\nunisource data scenarios and then generalise these estimators to accommodate\nmultisource data. To further reduce estimation errors, especially in scenarios\nwhere some sources significantly differ from the target, we introduce an\ninformative source selection algorithm. We then examine these estimators with\nmultisource selection and establish their minimax optimality under specific\nregularity conditions. It is worth emphasising that, unlike the prevalent\nnarrative in the transfer learning literature that the performance is enhanced\nthrough large source sample sizes, our approaches leverage higher observation\nfrequencies and accommodate diverse frequencies across multiple sources. Our\ntheoretical findings are empirically validated through extensive numerical\nexperiments, with the code available online, see\nhttps://github.com/chrisfanwang/transferlearning"}, "http://arxiv.org/abs/2310.05685": {"title": "Post-Selection Inference for Sparse Estimation", "link": "http://arxiv.org/abs/2310.05685", "description": "When the model is not known and parameter testing or interval estimation is\nconducted after model selection, it is necessary to consider selective\ninference. This paper discusses this issue in the context of sparse estimation.\nFirstly, we describe selective inference related to Lasso as per \\cite{lee},\nand then present polyhedra and truncated distributions when applying it to\nmethods such as Forward Stepwise and LARS. Lastly, we discuss the Significance\nTest for Lasso by \\cite{significant} and the Spacing Test for LARS by\n\\cite{ryan_exact}. This paper serves as a review article.\n\nKeywords: post-selective inference, polyhedron, LARS, lasso, forward\nstepwise, significance test, spacing test."}, "http://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "http://arxiv.org/abs/2310.05921", "description": "We introduce Conformal Decision Theory, a framework for producing safe\nautonomous decisions despite imperfect machine learning predictions. Examples\nof such decisions are ubiquitous, from robot planning algorithms that rely on\npedestrian predictions, to calibrating autonomous manufacturing to exhibit high\nthroughput and low error, to the choice of trusting a nominal policy versus\nswitching to a safe backup policy at run-time. The decisions produced by our\nalgorithms are safe in the sense that they come with provable statistical\nguarantees of having low risk without any assumptions on the world model\nwhatsoever; the observations need not be I.I.D. and can even be adversarial.\nThe theory extends results from conformal prediction to calibrate decisions\ndirectly, without requiring the construction of prediction sets. Experiments\ndemonstrate the utility of our approach in robot motion planning around humans,\nautomated stock trading, and robot manufacturin"}, "http://arxiv.org/abs/2101.06950": {"title": "Learning and scoring Gaussian latent variable causal models with unknown additive interventions", "link": "http://arxiv.org/abs/2101.06950", "description": "With observational data alone, causal structure learning is a challenging\nproblem. The task becomes easier when having access to data collected from\nperturbations of the underlying system, even when the nature of these is\nunknown. Existing methods either do not allow for the presence of latent\nvariables or assume that these remain unperturbed. However, these assumptions\nare hard to justify if the nature of the perturbations is unknown. We provide\nresults that enable scoring causal structures in the setting with additive, but\nunknown interventions. Specifically, we propose a maximum-likelihood estimator\nin a structural equation model that exploits system-wide invariances to output\nan equivalence class of causal structures from perturbation data. Furthermore,\nunder certain structural assumptions on the population model, we provide a\nsimple graphical characterization of all the DAGs in the interventional\nequivalence class. We illustrate the utility of our framework on synthetic data\nas well as real data involving California reservoirs and protein expressions.\nThe software implementation is available as the Python package \\emph{utlvce}."}, "http://arxiv.org/abs/2107.14151": {"title": "Modern Non-Linear Function-on-Function Regression", "link": "http://arxiv.org/abs/2107.14151", "description": "We introduce a new class of non-linear function-on-function regression models\nfor functional data using neural networks. We propose a framework using a\nhidden layer consisting of continuous neurons, called a continuous hidden\nlayer, for functional response modeling and give two model fitting strategies,\nFunctional Direct Neural Network (FDNN) and Functional Basis Neural Network\n(FBNN). Both are designed explicitly to exploit the structure inherent in\nfunctional data and capture the complex relations existing between the\nfunctional predictors and the functional response. We fit these models by\nderiving functional gradients and implement regularization techniques for more\nparsimonious results. We demonstrate the power and flexibility of our proposed\nmethod in handling complex functional models through extensive simulation\nstudies as well as real data examples."}, "http://arxiv.org/abs/2112.00832": {"title": "On the mixed-model analysis of covariance in cluster-randomized trials", "link": "http://arxiv.org/abs/2112.00832", "description": "In the analyses of cluster-randomized trials, mixed-model analysis of\ncovariance (ANCOVA) is a standard approach for covariate adjustment and\nhandling within-cluster correlations. However, when the normality, linearity,\nor the random-intercept assumption is violated, the validity and efficiency of\nthe mixed-model ANCOVA estimators for estimating the average treatment effect\nremain unclear. Under the potential outcomes framework, we prove that the\nmixed-model ANCOVA estimators for the average treatment effect are consistent\nand asymptotically normal under arbitrary misspecification of its working\nmodel. If the probability of receiving treatment is 0.5 for each cluster, we\nfurther show that the model-based variance estimator under mixed-model ANCOVA1\n(ANCOVA without treatment-covariate interactions) remains consistent,\nclarifying that the confidence interval given by standard software is\nasymptotically valid even under model misspecification. Beyond robustness, we\ndiscuss several insights on precision among classical methods for analyzing\ncluster-randomized trials, including the mixed-model ANCOVA, individual-level\nANCOVA, and cluster-level ANCOVA estimators. These insights may inform the\nchoice of methods in practice. Our analytical results and insights are\nillustrated via simulation studies and analyses of three cluster-randomized\ntrials."}, "http://arxiv.org/abs/2201.10770": {"title": "Confidence intervals for the Cox model test error from cross-validation", "link": "http://arxiv.org/abs/2201.10770", "description": "Cross-validation (CV) is one of the most widely used techniques in\nstatistical learning for estimating the test error of a model, but its behavior\nis not yet fully understood. It has been shown that standard confidence\nintervals for test error using estimates from CV may have coverage below\nnominal levels. This phenomenon occurs because each sample is used in both the\ntraining and testing procedures during CV and as a result, the CV estimates of\nthe errors become correlated. Without accounting for this correlation, the\nestimate of the variance is smaller than it should be. One way to mitigate this\nissue is by estimating the mean squared error of the prediction error instead\nusing nested CV. This approach has been shown to achieve superior coverage\ncompared to intervals derived from standard CV. In this work, we generalize the\nnested CV idea to the Cox proportional hazards model and explore various\nchoices of test error for this setting."}, "http://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2202.08419", "description": "In this paper, we develop a novel high-dimensional time-varying coefficient\nestimation method, based on high-dimensional Ito diffusion processes. To\naccount for high-dimensional time-varying coefficients, we first estimate local\n(or instantaneous) coefficients using a time-localized Dantzig selection scheme\nunder a sparsity condition, which results in biased local coefficient\nestimators due to the regularization. To handle the bias, we propose a\ndebiasing scheme, which provides well-performing unbiased local coefficient\nestimators. With the unbiased local coefficient estimators, we estimate the\nintegrated coefficient, and to further account for the sparsity of the\ncoefficient process, we apply thresholding schemes. We call this Thresholding\ndEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED\nestimator. In the empirical analysis, we apply the TED procedure to analyzing\nhigh-dimensional factor models using high-frequency data."}, "http://arxiv.org/abs/2206.12525": {"title": "Causality of Functional Longitudinal Data", "link": "http://arxiv.org/abs/2206.12525", "description": "\"Treatment-confounder feedback\" is the central complication to resolve in\nlongitudinal studies, to infer causality. The existing frameworks for\nidentifying causal effects for longitudinal studies with discrete repeated\nmeasures hinge heavily on assuming that time advances in discrete time steps or\ntreatment changes as a jumping process, rendering the number of \"feedbacks\"\nfinite. However, medical studies nowadays with real-time monitoring involve\nfunctional time-varying outcomes, treatment, and confounders, which leads to an\nuncountably infinite number of feedbacks between treatment and confounders.\nTherefore more general and advanced theory is needed. We generalize the\ndefinition of causal effects under user-specified stochastic treatment regimes\nto longitudinal studies with continuous monitoring and develop an\nidentification framework, allowing right censoring and truncation by death. We\nprovide sufficient identification assumptions including a generalized\nconsistency assumption, a sequential randomization assumption, a positivity\nassumption, and a novel \"intervenable\" assumption designed for the\ncontinuous-time case. Under these assumptions, we propose a g-computation\nprocess and an inverse probability weighting process, which suggest a\ng-computation formula and an inverse probability weighting formula for\nidentification. For practical purposes, we also construct two classes of\npopulation estimating equations to identify these two processes, respectively,\nwhich further suggest a doubly robust identification formula with extra\nrobustness against process misspecification. We prove that our framework fully\ngeneralize the existing frameworks and is nonparametric."}, "http://arxiv.org/abs/2209.08139": {"title": "Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2209.08139", "description": "Bayesian variable selection methods are powerful techniques for fitting and\ninferring on sparse high-dimensional linear regression models. However, many\nare computationally intensive or require restrictive prior distributions on\nmodel parameters. In this paper, we proposed a computationally efficient and\npowerful Bayesian approach for sparse high-dimensional linear regression.\nMinimal prior assumptions on the parameters are required through the use of\nplug-in empirical Bayes estimates of hyperparameters. Efficient maximum a\nposteriori (MAP) estimation is completed through a Parameter-Expanded\nExpectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in\na robust computationally efficient coordinate-wise optimization which -- when\nupdating the coefficient for a particular predictor -- adjusts for the impact\nof other predictor variables. The completion of the E-step uses an approach\nmotivated by the popular two-group approach to multiple testing. The result is\na PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse\nhigh-dimensional linear regression, which can be completed using one-at-a-time\nor all-at-once type optimization. We compare the empirical properties of PROBE\nto comparable approaches with numerous simulation studies and analyses of\ncancer cell drug responses. The proposed approach is implemented in the R\npackage probe."}, "http://arxiv.org/abs/2212.02709": {"title": "SURE-tuned Bridge Regression", "link": "http://arxiv.org/abs/2212.02709", "description": "Consider the {$\\ell_{\\alpha}$} regularized linear regression, also termed\nBridge regression. For $\\alpha\\in (0,1)$, Bridge regression enjoys several\nstatistical properties of interest such as sparsity and near-unbiasedness of\nthe estimates (Fan and Li, 2001). However, the main difficulty lies in the\nnon-convex nature of the penalty for these values of $\\alpha$, which makes an\noptimization procedure challenging and usually it is only possible to find a\nlocal optimum. To address this issue, Polson et al. (2013) took a sampling\nbased fully Bayesian approach to this problem, using the correspondence between\nthe Bridge penalty and a power exponential prior on the regression\ncoefficients. However, their sampling procedure relies on Markov chain Monte\nCarlo (MCMC) techniques, which are inherently sequential and not scalable to\nlarge problem dimensions. Cross validation approaches are similarly\ncomputation-intensive. To this end, our contribution is a novel\n\\emph{non-iterative} method to fit a Bridge regression model. The main\ncontribution lies in an explicit formula for Stein's unbiased risk estimate for\nthe out of sample prediction risk of Bridge regression, which can then be\noptimized to select the desired tuning parameters, allowing us to completely\nbypass MCMC as well as computation-intensive cross validation approaches. Our\nprocedure yields results in a fraction of computational times compared to\niterative schemes, without any appreciable loss in statistical performance. An\nR implementation is publicly available online at:\nhttps://github.com/loriaJ/Sure-tuned_BridgeRegression ."}, "http://arxiv.org/abs/2212.03122": {"title": "Robust convex biclustering with a tuning-free method", "link": "http://arxiv.org/abs/2212.03122", "description": "Biclustering is widely used in different kinds of fields including gene\ninformation analysis, text mining, and recommendation system by effectively\ndiscovering the local correlation between samples and features. However, many\nbiclustering algorithms will collapse when facing heavy-tailed data. In this\npaper, we propose a robust version of convex biclustering algorithm with Huber\nloss. Yet, the newly introduced robustification parameter brings an extra\nburden to selecting the optimal parameters. Therefore, we propose a tuning-free\nmethod for automatically selecting the optimal robustification parameter with\nhigh efficiency. The simulation study demonstrates the more fabulous\nperformance of our proposed method than traditional biclustering methods when\nencountering heavy-tailed noise. A real-life biomedical application is also\npresented. The R package RcvxBiclustr is available at\nhttps://github.com/YifanChen3/RcvxBiclustr."}, "http://arxiv.org/abs/2301.09661": {"title": "Estimating marginal treatment effects from observational studies and indirect treatment comparisons: When are standardization-based methods preferable to those based on propensity score weighting?", "link": "http://arxiv.org/abs/2301.09661", "description": "In light of newly developed standardization methods, we evaluate, via\nsimulation study, how propensity score weighting and standardization -based\napproaches compare for obtaining estimates of the marginal odds ratio and the\nmarginal hazard ratio. Specifically, we consider how the two approaches compare\nin two different scenarios: (1) in a single observational study, and (2) in an\nanchored indirect treatment comparison (ITC) of randomized controlled trials.\nWe present the material in such a way so that the matching-adjusted indirect\ncomparison (MAIC) and the (novel) simulated treatment comparison (STC) methods\nin the ITC setting may be viewed as analogous to the propensity score weighting\nand standardization methods in the single observational study setting. Our\nresults suggest that current recommendations for conducting ITCs can be\nimproved and underscore the importance of adjusting for purely prognostic\nfactors."}, "http://arxiv.org/abs/2302.11746": {"title": "Logistic Regression and Classification with non-Euclidean Covariates", "link": "http://arxiv.org/abs/2302.11746", "description": "We introduce a logistic regression model for data pairs consisting of a\nbinary response and a covariate residing in a non-Euclidean metric space\nwithout vector structures. Based on the proposed model we also develop a binary\nclassifier for non-Euclidean objects. We propose a maximum likelihood estimator\nfor the non-Euclidean regression coefficient in the model, and provide upper\nbounds on the estimation error under various metric entropy conditions that\nquantify complexity of the underlying metric space. Matching lower bounds are\nderived for the important metric spaces commonly seen in statistics,\nestablishing optimality of the proposed estimator in such spaces. Similarly, an\nupper bound on the excess risk of the developed classifier is provided for\ngeneral metric spaces. A finer upper bound and a matching lower bound, and thus\noptimality of the proposed classifier, are established for Riemannian\nmanifolds. We investigate the numerical performance of the proposed estimator\nand classifier via simulation studies, and illustrate their practical merits\nvia an application to task-related fMRI data."}, "http://arxiv.org/abs/2302.13658": {"title": "Robust High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2302.13658", "description": "In this paper, we develop a novel high-dimensional coefficient estimation\nprocedure based on high-frequency data. Unlike usual high-dimensional\nregression procedure such as LASSO, we additionally handle the heavy-tailedness\nof high-frequency observations as well as time variations of coefficient\nprocesses. Specifically, we employ Huber loss and truncation scheme to handle\nheavy-tailed observations, while $\\ell_{1}$-regularization is adopted to\novercome the curse of dimensionality. To account for the time-varying\ncoefficient, we estimate local coefficients which are biased due to the\n$\\ell_{1}$-regularization. Thus, when estimating integrated coefficients, we\npropose a debiasing scheme to enjoy the law of large number property and employ\na thresholding scheme to further accommodate the sparsity of the coefficients.\nWe call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show\nthat the RED-LASSO estimator can achieve a near-optimal convergence rate. In\nthe empirical study, we apply the RED-LASSO procedure to the high-dimensional\nintegrated coefficient estimation using high-frequency trading data."}, "http://arxiv.org/abs/2307.04754": {"title": "Action-State Dependent Dynamic Model Selection", "link": "http://arxiv.org/abs/2307.04754", "description": "A model among many may only be best under certain states of the world.\nSwitching from a model to another can also be costly. Finding a procedure to\ndynamically choose a model in these circumstances requires to solve a complex\nestimation procedure and a dynamic programming problem. A Reinforcement\nlearning algorithm is used to approximate and estimate from the data the\noptimal solution to this dynamic programming problem. The algorithm is shown to\nconsistently estimate the optimal policy that may choose different models based\non a set of covariates. A typical example is the one of switching between\ndifferent portfolio models under rebalancing costs, using macroeconomic\ninformation. Using a set of macroeconomic variables and price data, an\nempirical application to the aforementioned portfolio problem shows superior\nperformance to choosing the best portfolio model with hindsight."}, "http://arxiv.org/abs/2307.14828": {"title": "Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley", "link": "http://arxiv.org/abs/2307.14828", "description": "Two-component mixture models have proved to be a powerful tool for modeling\nheterogeneity in several cluster analysis contexts. However, most methods based\non these models assume a constant behavior for the mixture weights, which can\nbe restrictive and unsuitable for some applications. In this paper, we relax\nthis assumption and allow the mixture weights to vary according to the index\n(e.g., time) to make the model more adaptive to a broader range of data sets.\nWe propose an efficient MCMC algorithm to jointly estimate both component\nparameters and dynamic weights from their posterior samples. We evaluate the\nmethod's performance by running Monte Carlo simulation studies under different\nscenarios for the dynamic weights. In addition, we apply the algorithm to a\ntime series that records the level reached by a river in southern Brazil. The\nTaquari River is a water body whose frequent flood inundations have caused\nvarious damage to riverside communities. Implementing a dynamic mixture model\nallows us to properly describe the flood regimes for the areas most affected by\nthese phenomena."}, "http://arxiv.org/abs/2310.06130": {"title": "Statistical inference for radially-stable generalized Pareto distributions and return level-sets in geometric extremes", "link": "http://arxiv.org/abs/2310.06130", "description": "We obtain a functional analogue of the quantile function for probability\nmeasures admitting a continuous Lebesgue density on $\\mathbb{R}^d$, and use it\nto characterize the class of non-trivial limit distributions of radially\nrecentered and rescaled multivariate exceedances in geometric extremes. A new\nclass of multivariate distributions is identified, termed radially stable\ngeneralized Pareto distributions, and is shown to admit certain stability\nproperties that permit extrapolation to extremal sets along any direction in\n$\\mathbb{R}^d$. Based on the limit Poisson point process likelihood of the\nradially renormalized point process of exceedances, we develop parsimonious\nstatistical models that exploit theoretical links between structural\nstar-bodies and are amenable to Bayesian inference. The star-bodies determine\nthe mean measure of the limit Poisson process through a hierarchical structure.\nOur framework sharpens statistical inference by suitably including additional\ninformation from the angular directions of the geometric exceedances and\nfacilitates efficient computations in dimensions $d=2$ and $d=3$. Additionally,\nit naturally leads to the notion of the return level-set, which is a canonical\nquantile set expressed in terms of its average recurrence interval, and a\ngeometric analogue of the uni-dimensional return level. We illustrate our\nmethods with a simulation study showing superior predictive performance of\nprobabilities of rare events, and with two case studies, one associated with\nriver flow extremes, and the other with oceanographic extremes."}, "http://arxiv.org/abs/2310.06252": {"title": "Power and sample size calculation of two-sample projection-based testing for sparsely observed functional data", "link": "http://arxiv.org/abs/2310.06252", "description": "Projection-based testing for mean trajectory differences in two groups of\nirregularly and sparsely observed functional data has garnered significant\nattention in the literature because it accommodates a wide spectrum of group\ndifferences and (non-stationary) covariance structures. This article presents\nthe derivation of the theoretical power function and the introduction of a\ncomprehensive power and sample size (PASS) calculation toolkit tailored to the\nprojection-based testing method developed by Wang (2021). Our approach\naccommodates a wide spectrum of group difference scenarios and a broad class of\ncovariance structures governing the underlying processes. Through extensive\nnumerical simulation, we demonstrate the robustness of this testing method by\nshowcasing that its statistical power remains nearly unaffected even when a\ncertain percentage of observations are missing, rendering it 'missing-immune'.\nFurthermore, we illustrate the practical utility of this test through analysis\nof two randomized controlled trials of Parkinson's disease. To facilitate\nimplementation, we provide a user-friendly R package fPASS, complete with a\ndetailed vignette to guide users through its practical application. We\nanticipate that this article will significantly enhance the usability of this\npotent statistical tool across a range of biostatistical applications, with a\nparticular focus on its relevance in the design of clinical trials."}, "http://arxiv.org/abs/2310.06315": {"title": "Ultra-high dimensional confounder selection algorithms comparison with application to radiomics data", "link": "http://arxiv.org/abs/2310.06315", "description": "Radiomics is an emerging area of medical imaging data analysis particularly\nfor cancer. It involves the conversion of digital medical images into mineable\nultra-high dimensional data. Machine learning algorithms are widely used in\nradiomics data analysis to develop powerful decision support model to improve\nprecision in diagnosis, assessment of prognosis and prediction of therapy\nresponse. However, machine learning algorithms for causal inference have not\nbeen previously employed in radiomics analysis. In this paper, we evaluate the\nvalue of machine learning algorithms for causal inference in radiomics. We\nselect three recent competitive variable selection algorithms for causal\ninference: outcome-adaptive lasso (OAL), generalized outcome-adaptive lasso\n(GOAL) and causal ball screening (CBS). We used a sure independence screening\nprocedure to propose an extension of GOAL and OAL for ultra-high dimensional\ndata, SIS + GOAL and SIS + OAL. We compared SIS + GOAL, SIS + OAL and CBS using\nsimulation study and two radiomics datasets in cancer, osteosarcoma and\ngliosarcoma. The two radiomics studies and the simulation study identified SIS\n+ GOAL as the optimal variable selection algorithm."}, "http://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "http://arxiv.org/abs/2310.06330", "description": "Markov chain Monte Carlo (MCMC) is a commonly used method for approximating\nexpectations with respect to probability distributions. Uncertainty assessment\nfor MCMC estimators is essential in practical applications. Moreover, for\nmultivariate functions of a Markov chain, it is important to estimate not only\nthe auto-correlation for each component but also to estimate\ncross-correlations, in order to better assess sample quality, improve estimates\nof effective sample size, and use more effective stopping rules. Berg and Song\n[2022] introduced the moment least squares (momentLS) estimator, a\nshape-constrained estimator for the autocovariance sequence from a reversible\nMarkov chain, for univariate functions of the Markov chain. Based on this\nsequence estimator, they proposed an estimator of the asymptotic variance of\nthe sample mean from MCMC samples. In this study, we propose novel\nautocovariance sequence and asymptotic variance estimators for Markov chain\nfunctions with multiple components, based on the univariate momentLS estimators\nfrom Berg and Song [2022]. We demonstrate strong consistency of the proposed\nauto(cross)-covariance sequence and asymptotic variance matrix estimators. We\nconduct empirical comparisons of our method with other state-of-the-art\napproaches on simulated and real-data examples, using popular samplers\nincluding the random-walk Metropolis sampler and the No-U-Turn sampler from\nSTAN."}, "http://arxiv.org/abs/2310.06357": {"title": "Adaptive Storey's null proportion estimator", "link": "http://arxiv.org/abs/2310.06357", "description": "False discovery rate (FDR) is a commonly used criterion in multiple testing\nand the Benjamini-Hochberg (BH) procedure is arguably the most popular approach\nwith FDR guarantee. To improve power, the adaptive BH procedure has been\nproposed by incorporating various null proportion estimators, among which\nStorey's estimator has gained substantial popularity. The performance of\nStorey's estimator hinges on a critical hyper-parameter, where a pre-fixed\nconfiguration lacks power and existing data-driven hyper-parameters compromise\nthe FDR control. In this work, we propose a novel class of adaptive\nhyper-parameters and establish the FDR control of the associated BH procedure\nusing a martingale argument. Within this class of data-driven hyper-parameters,\nwe present a specific configuration designed to maximize the number of\nrejections and characterize the convergence of this proposal to the optimal\nhyper-parameter under a commonly-used mixture model. We evaluate our adaptive\nStorey's null proportion estimator and the associated BH procedure on extensive\nsimulated data and a motivating protein dataset. Our proposal exhibits\nsignificant power gains when dealing with a considerable proportion of weak\nnon-nulls or a conservative null distribution."}, "http://arxiv.org/abs/2310.06467": {"title": "Advances in Kth nearest-neighbour clutter removal", "link": "http://arxiv.org/abs/2310.06467", "description": "We consider the problem of feature detection in the presence of clutter in\nspatial point processes. Classification methods have been developed in previous\nstudies. Among these, Byers and Raftery (1998) models the observed Kth nearest\nneighbour distances as a mixture distribution and classifies the clutter and\nfeature points consequently. In this paper, we enhance such approach in two\nmanners. First, we propose an automatic procedure for selecting the number of\nnearest neighbours to consider in the classification method by means of\nsegmented regression models. Secondly, with the aim of applying the procedure\nmultiple times to get a ``better\" end result, we propose a stopping criterion\nthat minimizes the overall entropy measure of cluster separation between\nclutter and feature points. The proposed procedures are suitable for a feature\nwith clutter as two superimposed Poisson processes on any space, including\nlinear networks. We present simulations and two case studies of environmental\ndata to illustrate the method."}, "http://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "http://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class\nof partially observed stochastic differential equations (SDE) driven by jump\nprocesses. Such type of models can be routinely found in applications, of which\nwe focus upon the case of neuroscience. The data are assumed to be observed\nregularly in time and driven by the SDE model with unknown parameters. In\npractice the SDE may not have an analytically tractable solution and this leads\nnaturally to a time-discretization. We adapt the multilevel Markov chain Monte\nCarlo method of [11], which works with a hierarchy of time discretizations and\nshow empirically and theoretically that this is preferable to using one single\ntime discretization. The improvement is in terms of the computational cost\nneeded to obtain a pre-specified numerical error. Our approach is illustrated\non models that are found in neuroscience."}, "http://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation due to adverse events", "link": "http://arxiv.org/abs/2310.06653", "description": "In clinical trials, patients sometimes discontinue study treatments\nprematurely due to reasons such as adverse events. Treatment discontinuation\noccurs after the randomisation as an intercurrent event, making causal\ninference more challenging. The Intention-To-Treat (ITT) analysis provides\nvalid causal estimates of the effect of treatment assignment; still, it does\nnot take into account whether or not patients had to discontinue the treatment\nprematurely. We propose to deal with the problem of treatment discontinuation\nusing principal stratification, recognised in the ICH E9(R1) addendum as a\nstrategy for handling intercurrent events. Under this approach, we can\ndecompose the overall ITT effect into principal causal effects for groups of\npatients defined by their potential discontinuation behaviour in continuous\ntime. In this framework, we must consider that discontinuation happening in\ncontinuous time generates an infinite number of principal strata and that\ndiscontinuation time is not defined for patients who would never discontinue.\nAn additional complication is that discontinuation time and time-to-event\noutcomes are subject to administrative censoring. We employ a flexible\nmodel-based Bayesian approach to deal with such complications. We apply the\nBayesian principal stratification framework to analyse synthetic data based on\na recent RCT in Oncology, aiming to assess the causal effects of a new\ninvestigational drug combined with standard of care vs. standard of care alone\non progression-free survival. We simulate data under different assumptions that\nreflect real situations where patients' behaviour depends on critical baseline\ncovariates. Finally, we highlight how such an approach makes it straightforward\nto characterise patients' discontinuation behaviour with respect to the\navailable covariates with the help of a simulation study."}, "http://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "http://arxiv.org/abs/2310.06673", "description": "An assurance calculation is a Bayesian alternative to a power calculation.\nOne may be performed to aid the planning of a clinical trial, specifically\nsetting the sample size or to support decisions about whether or not to perform\na study. Immuno-oncology (IO) is a rapidly evolving area in the development of\nanticancer drugs. A common phenomenon that arises from IO trials is one of\ndelayed treatment effects, that is, there is a delay in the separation of the\nsurvival curves. To calculate assurance for a trial in which a delayed\ntreatment effect is likely to be present, uncertainty about key parameters\nneeds to be considered. If uncertainty is not considered, then the number of\npatients recruited may not be enough to ensure we have adequate statistical\npower to detect a clinically relevant treatment effect. We present a new\nelicitation technique for when a delayed treatment effect is likely to be\npresent and show how to compute assurance using these elicited prior\ndistributions. We provide an example to illustrate how this could be used in\npractice. Open-source software is provided for implementing our methods. Our\nmethodology makes the benefits of assurance methods available for the planning\nof IO trials (and others where a delayed treatment expect is likely to occur)."}, "http://arxiv.org/abs/2310.06696": {"title": "Variable selection with FDR control for noisy data -- an application to screening metabolites that are associated with breast and colorectal cancer", "link": "http://arxiv.org/abs/2310.06696", "description": "The rapidly expanding field of metabolomics presents an invaluable resource\nfor understanding the associations between metabolites and various diseases.\nHowever, the high dimensionality, presence of missing values, and measurement\nerrors associated with metabolomics data can present challenges in developing\nreliable and reproducible methodologies for disease association studies.\nTherefore, there is a compelling need to develop robust statistical methods\nthat can navigate these complexities to achieve reliable and reproducible\ndisease association studies. In this paper, we focus on developing such a\nmethodology with an emphasis on controlling the False Discovery Rate during the\nscreening of mutual metabolomic signals for multiple disease outcomes. We\nillustrate the versatility and performance of this procedure in a variety of\nscenarios, dealing with missing data and measurement errors. As a specific\napplication of this novel methodology, we target two of the most prevalent\ncancers among US women: breast cancer and colorectal cancer. By applying our\nmethod to the Wome's Health Initiative data, we successfully identify\nmetabolites that are associated with either or both of these cancers,\ndemonstrating the practical utility and potential of our method in identifying\nconsistent risk factors and understanding shared mechanisms between diseases."}, "http://arxiv.org/abs/2310.06708": {"title": "Adjustment with Three Continuous Variables", "link": "http://arxiv.org/abs/2310.06708", "description": "Spurious association between X and Y may be due to a confounding variable W.\nStatisticians may adjust for W using a variety of techniques. This paper\npresents the results of simulations conducted to assess the performance of\nthose techniques under various, elementary, data-generating processes. The\nresults indicate that no technique is best overall and that specific techniques\nshould be selected based on the particulars of the data-generating process.\nHere we show how causal graphs can guide the selection or design of techniques\nfor statistical adjustment. R programs are provided for researchers interested\nin generalization."}, "http://arxiv.org/abs/2310.06720": {"title": "Asymptotic theory for Bayesian inference and prediction: from the ordinary to a conditional Peaks-Over-Threshold method", "link": "http://arxiv.org/abs/2310.06720", "description": "The Peaks Over Threshold (POT) method is the most popular statistical method\nfor the analysis of univariate extremes. Even though there is a rich applied\nliterature on Bayesian inference for the POT method there is no asymptotic\ntheory for such proposals. Even more importantly, the ambitious and challenging\nproblem of predicting future extreme events according to a proper probabilistic\nforecasting approach has received no attention to date. In this paper we\ndevelop the asymptotic theory (consistency, contraction rates, asymptotic\nnormality and asymptotic coverage of credible intervals) for the Bayesian\ninference based on the POT method. We extend such an asymptotic theory to cover\nthe Bayesian inference on the tail properties of the conditional distribution\nof a response random variable conditionally to a vector of random covariates.\nWith the aim to make accurate predictions of severer extreme events than those\noccurred in the past, we specify the posterior predictive distribution of a\nfuture unobservable excess variable in the unconditional and conditional\napproach and we prove that is Wasserstein consistent and derive its contraction\nrates. Simulations show the good performances of the proposed Bayesian\ninferential methods. The analysis of the change in the frequency of financial\ncrises over time shows the utility of our methodology."}, "http://arxiv.org/abs/2310.06730": {"title": "Sparse topic modeling via spectral decomposition and thresholding", "link": "http://arxiv.org/abs/2310.06730", "description": "The probabilistic Latent Semantic Indexing model assumes that the expectation\nof the corpus matrix is low-rank and can be written as the product of a\ntopic-word matrix and a word-document matrix. In this paper, we study the\nestimation of the topic-word matrix under the additional assumption that the\nordered entries of its columns rapidly decay to zero. This sparsity assumption\nis motivated by the empirical observation that the word frequencies in a text\noften adhere to Zipf's law. We introduce a new spectral procedure for\nestimating the topic-word matrix that thresholds words based on their corpus\nfrequencies, and show that its $\\ell_1$-error rate under our sparsity\nassumption depends on the vocabulary size $p$ only via a logarithmic term. Our\nerror bound is valid for all parameter regimes and in particular for the\nsetting where $p$ is extremely large; this high-dimensional setting is commonly\nencountered but has not been adequately addressed in prior literature.\nFurthermore, our procedure also accommodates datasets that violate the\nseparability assumption, which is necessary for most prior approaches in topic\nmodeling. Experiments with synthetic data confirm that our procedure is\ncomputationally fast and allows for consistent estimation of the topic-word\nmatrix in a wide variety of parameter regimes. Our procedure also performs well\nrelative to well-established methods when applied to a large corpus of research\npaper abstracts, as well as the analysis of single-cell and microbiome data\nwhere the same statistical model is relevant but the parameter regimes are\nvastly different."}, "http://arxiv.org/abs/2310.06746": {"title": "Causal Rule Learning: Enhancing the Understanding of Heterogeneous Treatment Effect via Weighted Causal Rules", "link": "http://arxiv.org/abs/2310.06746", "description": "Interpretability is a key concern in estimating heterogeneous treatment\neffects using machine learning methods, especially for healthcare applications\nwhere high-stake decisions are often made. Inspired by the Predictive,\nDescriptive, Relevant framework of interpretability, we propose causal rule\nlearning which finds a refined set of causal rules characterizing potential\nsubgroups to estimate and enhance our understanding of heterogeneous treatment\neffects. Causal rule learning involves three phases: rule discovery, rule\nselection, and rule analysis. In the rule discovery phase, we utilize a causal\nforest to generate a pool of causal rules with corresponding subgroup average\ntreatment effects. The selection phase then employs a D-learning method to\nselect a subset of these rules to deconstruct individual-level treatment\neffects as a linear combination of the subgroup-level effects. This helps to\nanswer an ignored question by previous literature: what if an individual\nsimultaneously belongs to multiple groups with different average treatment\neffects? The rule analysis phase outlines a detailed procedure to further\nanalyze each rule in the subset from multiple perspectives, revealing the most\npromising rules for further validation. The rules themselves, their\ncorresponding subgroup treatment effects, and their weights in the linear\ncombination give us more insights into heterogeneous treatment effects.\nSimulation and real-world data analysis demonstrate the superior performance of\ncausal rule learning on the interpretable estimation of heterogeneous treatment\neffect when the ground truth is complex and the sample size is sufficient."}, "http://arxiv.org/abs/2310.06808": {"title": "Odds are the sign is right", "link": "http://arxiv.org/abs/2310.06808", "description": "This article introduces a new condition based on odds ratios for sensitivity\nanalysis. The analysis involves the average effect of a treatment or exposure\non a response or outcome with estimates adjusted for and conditional on a\nsingle, unmeasured, dichotomous covariate. Results of statistical simulations\nare displayed to show that the odds ratio condition is as reliable as other\ncommonly used conditions for sensitivity analysis. Other conditions utilize\nquantities reflective of a mediating covariate. The odds ratio condition can be\napplied when the covariate is a confounding variable. As an example application\nwe use the odds ratio condition to analyze and interpret a positive association\nobserved between Zika virus infection and birth defects."}, "http://arxiv.org/abs/2204.06030": {"title": "Variable importance measures for heterogeneous causal effects", "link": "http://arxiv.org/abs/2204.06030", "description": "The recognition that personalised treatment decisions lead to better clinical\noutcomes has sparked recent research activity in the following two domains.\nPolicy learning focuses on finding optimal treatment rules (OTRs), which\nexpress whether an individual would be better off with or without treatment,\ngiven their measured characteristics. OTRs optimize a pre-set population\ncriterion, but do not provide insight into the extent to which treatment\nbenefits or harms individual subjects. Estimates of conditional average\ntreatment effects (CATEs) do offer such insights, but valid inference is\ncurrently difficult to obtain when data-adaptive methods are used. Moreover,\nclinicians are (rightly) hesitant to blindly adopt OTR or CATE estimates, not\nleast since both may represent complicated functions of patient characteristics\nthat provide little insight into the key drivers of heterogeneity. To address\nthese limitations, we introduce novel nonparametric treatment effect variable\nimportance measures (TE-VIMs). TE-VIMs extend recent regression-VIMs, viewed as\nnonparametric analogues to ANOVA statistics. By not being tied to a particular\nmodel, they are amenable to data-adaptive (machine learning) estimation of the\nCATE, itself an active area of research. Estimators for the proposed statistics\nare derived from their efficient influence curves and these are illustrated\nthrough a simulation study and an applied example."}, "http://arxiv.org/abs/2204.07907": {"title": "Just Identified Indirect Inference Estimator: Accurate Inference through Bias Correction", "link": "http://arxiv.org/abs/2204.07907", "description": "An important challenge in statistical analysis lies in controlling the\nestimation bias when handling the ever-increasing data size and model\ncomplexity of modern data settings. In this paper, we propose a reliable\nestimation and inference approach for parametric models based on the Just\nIdentified iNdirect Inference estimator (JINI). The key advantage of our\napproach is that it allows to construct a consistent estimator in a simple\nmanner, while providing strong bias correction guarantees that lead to accurate\ninference. Our approach is particularly useful for complex parametric models,\nas it allows to bypass the analytical and computational difficulties (e.g., due\nto intractable estimating equation) typically encountered in standard\nprocedures. The properties of JINI (including consistency, asymptotic\nnormality, and its bias correction property) are also studied when the\nparameter dimension is allowed to diverge, which provide the theoretical\nfoundation to explain the advantageous performance of JINI in increasing\ndimensional covariates settings. Our simulations and an alcohol consumption\ndata analysis highlight the practical usefulness and excellent performance of\nJINI when data present features (e.g., misclassification, rounding) as well as\nin robust estimation."}, "http://arxiv.org/abs/2209.05598": {"title": "Learning domain-specific causal discovery from time series", "link": "http://arxiv.org/abs/2209.05598", "description": "Causal discovery (CD) from time-varying data is important in neuroscience,\nmedicine, and machine learning. Techniques for CD encompass randomized\nexperiments, which are generally unbiased but expensive, and algorithms such as\nGranger causality, conditional-independence-based, structural-equation-based,\nand score-based methods that are only accurate under strong assumptions made by\nhuman designers. However, as demonstrated in other areas of machine learning,\nhuman expertise is often not entirely accurate and tends to be outperformed in\ndomains with abundant data. In this study, we examine whether we can enhance\ndomain-specific causal discovery for time series using a data-driven approach.\nOur findings indicate that this procedure significantly outperforms\nhuman-designed, domain-agnostic causal discovery methods, such as Mutual\nInformation, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor,\nthe NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when\nfeasible, the causality field should consider a supervised approach in which\ndomain-specific CD procedures are learned from extensive datasets with known\ncausal relationships, rather than being designed by human specialists. Our\nfindings promise a new approach toward improving CD in neural and medical data\nand for the broader machine learning community."}, "http://arxiv.org/abs/2209.05795": {"title": "Joint modelling of the body and tail of bivariate data", "link": "http://arxiv.org/abs/2209.05795", "description": "In situations where both extreme and non-extreme data are of interest,\nmodelling the whole data set accurately is important. In a univariate\nframework, modelling the bulk and tail of a distribution has been extensively\nstudied before. However, when more than one variable is of concern, models that\naim specifically at capturing both regions correctly are scarce in the\nliterature. A dependence model that blends two copulas with different\ncharacteristics over the whole range of the data support is proposed. One\ncopula is tailored to the bulk and the other to the tail, with a dynamic\nweighting function employed to transition smoothly between them. Tail\ndependence properties are investigated numerically and simulation is used to\nconfirm that the blended model is sufficiently flexible to capture a wide\nvariety of structures. The model is applied to study the dependence between\ntemperature and ozone concentration at two sites in the UK and compared with a\nsingle copula fit. The proposed model provides a better, more flexible, fit to\nthe data, and is also capable of capturing complex dependence structures."}, "http://arxiv.org/abs/2212.14650": {"title": "Two-step estimators of high dimensional correlation matrices", "link": "http://arxiv.org/abs/2212.14650", "description": "We investigate block diagonal and hierarchical nested stochastic multivariate\nGaussian models by studying their sample cross-correlation matrix on high\ndimensions. By performing numerical simulations, we compare a filtered sample\ncross-correlation with the population cross-correlation matrices by using\nseveral rotationally invariant estimators (RIE) and hierarchical clustering\nestimators (HCE) under several loss functions. We show that at large but finite\nsample size, sample cross-correlation filtered by RIE estimators are often\noutperformed by HCE estimators for several of the loss functions. We also show\nthat for block models and for hierarchically nested block models the best\ndetermination of the filtered sample cross-correlation is achieved by\nintroducing two-step estimators combining state-of-the-art non-linear shrinkage\nmodels with hierarchical clustering estimators."}, "http://arxiv.org/abs/2302.02457": {"title": "Scalable inference in functional linear regression with streaming data", "link": "http://arxiv.org/abs/2302.02457", "description": "Traditional static functional data analysis is facing new challenges due to\nstreaming data, where data constantly flow in. A major challenge is that\nstoring such an ever-increasing amount of data in memory is nearly impossible.\nIn addition, existing inferential tools in online learning are mainly developed\nfor finite-dimensional problems, while inference methods for functional data\nare focused on the batch learning setting. In this paper, we tackle these\nissues by developing functional stochastic gradient descent algorithms and\nproposing an online bootstrap resampling procedure to systematically study the\ninference problem for functional linear regression. In particular, the proposed\nestimation and inference procedures use only one pass over the data; thus they\nare easy to implement and suitable to the situation where data arrive in a\nstreaming manner. Furthermore, we establish the convergence rate as well as the\nasymptotic distribution of the proposed estimator. Meanwhile, the proposed\nperturbed estimator from the bootstrap procedure is shown to enjoy the same\ntheoretical properties, which provide the theoretical justification for our\nonline inference tool. As far as we know, this is the first inference result on\nthe functional linear regression model with streaming data. Simulation studies\nare conducted to investigate the finite-sample performance of the proposed\nprocedure. An application is illustrated with the Beijing multi-site\nair-quality data."}, "http://arxiv.org/abs/2303.09598": {"title": "Variational Bayesian analysis of survival data using a log-logistic accelerated failure time model", "link": "http://arxiv.org/abs/2303.09598", "description": "The log-logistic regression model is one of the most commonly used\naccelerated failure time (AFT) models in survival analysis, for which\nstatistical inference methods are mainly established under the frequentist\nframework. Recently, Bayesian inference for log-logistic AFT models using\nMarkov chain Monte Carlo (MCMC) techniques has also been widely developed. In\nthis work, we develop an alternative approach to MCMC methods and infer the\nparameters of the log-logistic AFT model via a mean-field variational Bayes\n(VB) algorithm. A piecewise approximation technique is embedded in deriving the\nVB algorithm to achieve conjugacy. The proposed VB algorithm is evaluated and\ncompared with typical frequentist inferences and MCMC inference using simulated\ndata under various scenarios. A publicly available dataset is employed for\nillustration. We demonstrate that the proposed VB algorithm can achieve good\nestimation accuracy and has a lower computational cost compared with MCMC\nmethods."}, "http://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "http://arxiv.org/abs/2304.03853", "description": "StepMix is an open-source Python package for the pseudo-likelihood estimation\n(one-, two- and three-step approaches) of generalized finite mixture models\n(latent profile and latent class analysis) with external variables (covariates\nand distal outcomes). In many applications in social sciences, the main\nobjective is not only to cluster individuals into latent classes, but also to\nuse these classes to develop more complex statistical models. These models\ngenerally divide into a measurement model that relates the latent classes to\nobserved indicators, and a structural model that relates covariates and outcome\nvariables to the latent classes. The measurement and structural models can be\nestimated jointly using the so-called one-step approach or sequentially using\nstepwise methods, which present significant advantages for practitioners\nregarding the interpretability of the estimated latent classes. In addition to\nthe one-step approach, StepMix implements the most important stepwise\nestimation methods from the literature, including the bias-adjusted three-step\nmethods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the\nmore recent two-step approach. These pseudo-likelihood estimators are presented\nin this paper under a unified framework as specific expectation-maximization\nsubroutines. To facilitate and promote their adoption among the data science\ncommunity, StepMix follows the object-oriented design of the scikit-learn\nlibrary and provides an additional R wrapper."}, "http://arxiv.org/abs/2310.06926": {"title": "Bayesian inference and cure rate modeling for event history data", "link": "http://arxiv.org/abs/2310.06926", "description": "Estimating model parameters of a general family of cure models is always a\nchallenging task mainly due to flatness and multimodality of the likelihood\nfunction. In this work, we propose a fully Bayesian approach in order to\novercome these issues. Posterior inference is carried out by constructing a\nMetropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines\nGibbs sampling for the latent cure indicators and Metropolis-Hastings steps\nwith Langevin diffusion dynamics for parameter updates. The main MCMC algorithm\nis embedded within a parallel tempering scheme by considering heated versions\nof the target posterior distribution. It is demonstrated via simulations that\nthe proposed algorithm freely explores the multimodal posterior distribution\nand produces robust point estimates, while it outperforms maximum likelihood\nestimation via the Expectation-Maximization algorithm. A by-product of our\nBayesian implementation is to control the False Discovery Rate when classifying\nitems as cured or not. Finally, the proposed method is illustrated in a real\ndataset which refers to recidivism for offenders released from prison; the\nevent of interest is whether the offender was re-incarcerated after probation\nor not."}, "http://arxiv.org/abs/2310.06969": {"title": "Positivity-free Policy Learning with Observational Data", "link": "http://arxiv.org/abs/2310.06969", "description": "Policy learning utilizing observational data is pivotal across various\ndomains, with the objective of learning the optimal treatment assignment policy\nwhile adhering to specific constraints such as fairness, budget, and\nsimplicity. This study introduces a novel positivity-free (stochastic) policy\nlearning framework designed to address the challenges posed by the\nimpracticality of the positivity assumption in real-world scenarios. This\nframework leverages incremental propensity score policies to adjust propensity\nscore values instead of assigning fixed values to treatments. We characterize\nthese incremental propensity score policies and establish identification\nconditions, employing semiparametric efficiency theory to propose efficient\nestimators capable of achieving rapid convergence rates, even when integrated\nwith advanced machine learning algorithms. This paper provides a thorough\nexploration of the theoretical guarantees associated with policy learning and\nvalidates the proposed framework's finite-sample performance through\ncomprehensive numerical experiments, ensuring the identification of causal\neffects from observational data is both robust and reliable."}, "http://arxiv.org/abs/2310.07002": {"title": "Bayesian cross-validation by parallel Markov Chain Monte Carlo", "link": "http://arxiv.org/abs/2310.07002", "description": "Brute force cross-validation (CV) is a method for predictive assessment and\nmodel selection that is general and applicable to a wide range of Bayesian\nmodels. However, in many cases brute force CV is too computationally burdensome\nto form part of interactive modeling workflows, especially when inference\nrelies on Markov chain Monte Carlo (MCMC). In this paper we present a method\nfor conducting fast Bayesian CV by massively parallel MCMC. On suitable\naccelerator hardware, for many applications our approach is about as fast (in\nwall clock time) as a single full-data model fit.\n\nParallel CV is more flexible than existing fast CV approximation methods\nbecause it can easily exploit a wide range of scoring rules and data\npartitioning schemes. This is particularly useful for CV methods designed for\nnon-exchangeable data. Our approach also delivers accurate estimates of Monte\nCarlo and CV uncertainty. In addition to parallelizing computations, parallel\nCV speeds up inference by reusing information from earlier MCMC adaptation and\ninference obtained during initial model fitting and checking of the full-data\nmodel.\n\nWe propose MCMC diagnostics for parallel CV applications, including a summary\nof MCMC mixing based on the popular potential scale reduction factor\n($\\hat{R}$) and MCMC effective sample size ($\\widehat{ESS}$) measures.\nFurthermore, we describe a method for determining whether an $\\hat{R}$\ndiagnostic indicates approximate stationarity of the chains, that may be of\nmore general interest for applications beyond parallel CV.\n\nFor parallel CV to work on memory-constrained computing accelerators, we show\nthat parallel CV and associated diagnostics can be implemented using online\n(streaming) algorithms ideal for parallel computing environments with limited\nmemory. Constant memory algorithms allow parallel CV to scale up to very large\nblocking designs."}, "http://arxiv.org/abs/2310.07016": {"title": "Discovering the Unknowns: A First Step", "link": "http://arxiv.org/abs/2310.07016", "description": "This article aims at discovering the unknown variables in the system through\ndata analysis. The main idea is to use the time of data collection as a\nsurrogate variable and try to identify the unknown variables by modeling\ngradual and sudden changes in the data. We use Gaussian process modeling and a\nsparse representation of the sudden changes to efficiently estimate the large\nnumber of parameters in the proposed statistical model. The method is tested on\na realistic dataset generated using a one-dimensional implementation of a\nMagnetized Liner Inertial Fusion (MagLIF) simulation model and encouraging\nresults are obtained."}, "http://arxiv.org/abs/2310.07107": {"title": "Root n consistent extremile regression and its supervised and semi-supervised learning", "link": "http://arxiv.org/abs/2310.07107", "description": "Extremile (Daouia, Gijbels and Stupfler,2019) is a novel and coherent measure\nof risk, determined by weighted expectations rather than tail probabilities. It\nfinds application in risk management, and, in contrast to quantiles, it\nfulfills the axioms of consistency, taking into account the severity of tail\nlosses. However, existing studies (Daouia, Gijbels and Stupfler,2019,2022) on\nextremile involve unknown distribution functions, making it challenging to\nobtain a root n-consistent estimator for unknown parameters in linear extremile\nregression. This article introduces a new definition of linear extremile\nregression and its estimation method, where the estimator is root n-consistent.\nAdditionally, while the analysis of unlabeled data for extremes presents a\nsignificant challenge and is currently a topic of great interest in machine\nlearning for various classification problems, we have developed a\nsemi-supervised framework for the proposed extremile regression using unlabeled\ndata. This framework can also enhance estimation accuracy under model\nmisspecification. Both simulations and real data analyses have been conducted\nto illustrate the finite sample performance of the proposed methods."}, "http://arxiv.org/abs/2310.07124": {"title": "Systematic simulation of age-period-cohort analysis: Demonstrating bias of Bayesian regularization", "link": "http://arxiv.org/abs/2310.07124", "description": "Age-period-cohort (APC) analysis is one of the fundamental time-series\nanalyses used in the social sciences. This paper evaluates APC analysis via\nsystematic simulation in term of how well the artificial parameters are\nrecovered. We consider three models of Bayesian regularization using normal\nprior distributions: the random effects model with reference to multilevel\nanalysis, the ridge regression model equivalent to the intrinsic estimator, and\nthe random walk model referred to as the Bayesian cohort model. The proposed\nsimulation generates artificial data through combinations of the linear\ncomponents, focusing on the fact that the identification problem affects the\nlinear components of the three effects. Among the 13 cases of artificial data,\nthe random walk model recovered the artificial parameters well in 10 cases,\nwhile the random effects model and the ridge regression model did so in 4\ncases. The cases in which the models failed to recover the artificial\nparameters show the estimated linear component of the cohort effects as close\nto zero. In conclusion, the models of Bayesian regularization in APC analysis\nhave a bias: the index weights have a large influence on the cohort effects and\nthese constraints drive the linear component of the cohort effects close to\nzero. However, the random walk model mitigates underestimating the linear\ncomponent of the cohort effects."}, "http://arxiv.org/abs/2310.07330": {"title": "Functional Generalized Canonical Correlation Analysis for studying multiple longitudinal variables", "link": "http://arxiv.org/abs/2310.07330", "description": "In this paper, we introduce Functional Generalized Canonical Correlation\nAnalysis (FGCCA), a new framework for exploring associations between multiple\nrandom processes observed jointly. The framework is based on the multiblock\nRegularized Generalized Canonical Correlation Analysis (RGCCA) framework. It is\nrobust to sparsely and irregularly observed data, making it applicable in many\nsettings. We establish the monotonic property of the solving procedure and\nintroduce a Bayesian approach for estimating canonical components. We propose\nan extension of the framework that allows the integration of a univariate or\nmultivariate response into the analysis, paving the way for predictive\napplications. We evaluate the method's efficiency in simulation studies and\npresent a use case on a longitudinal dataset."}, "http://arxiv.org/abs/2310.07364": {"title": "Statistical inference of high-dimensional vector autoregressive time series with non-i", "link": "http://arxiv.org/abs/2310.07364", "description": "Independent or i.i.d. innovations is an essential assumption in the\nliterature for analyzing a vector time series. However, this assumption is\neither too restrictive for a real-life time series to satisfy or is hard to\nverify through a hypothesis test. This paper performs statistical inference on\na sparse high-dimensional vector autoregressive time series, allowing its white\nnoise innovations to be dependent, even non-stationary. To achieve this goal,\nit adopts a post-selection estimator to fit the vector autoregressive model and\nderives the asymptotic distribution of the post-selection estimator. The\ninnovations in the autoregressive time series are not assumed to be\nindependent, thus making the covariance matrices of the autoregressive\ncoefficient estimators complex and difficult to estimate. Our work develops a\nbootstrap algorithm to facilitate practitioners in performing statistical\ninference without having to engage in sophisticated calculations. Simulations\nand real-life data experiments reveal the validity of the proposed methods and\ntheoretical results.\n\nReal-life data is rarely considered to exactly satisfy an autoregressive\nmodel with independent or i.i.d. innovations, so our work should better reflect\nthe reality compared to the literature that assumes i.i.d. innovations."}, "http://arxiv.org/abs/2310.07399": {"title": "Randomized Runge-Kutta-Nystr\\\"om", "link": "http://arxiv.org/abs/2310.07399", "description": "We present 5/2- and 7/2-order $L^2$-accurate randomized Runge-Kutta-Nystr\\\"om\nmethods to approximate the Hamiltonian flow underlying various non-reversible\nMarkov chain Monte Carlo chains including unadjusted Hamiltonian Monte Carlo\nand unadjusted kinetic Langevin chains. Quantitative 5/2-order $L^2$-accuracy\nupper bounds are provided under gradient and Hessian Lipschitz assumptions on\nthe potential energy function. The superior complexity of the corresponding\nMarkov chains is numerically demonstrated for a selection of `well-behaved',\nhigh-dimensional target distributions."}, "http://arxiv.org/abs/2310.07456": {"title": "Hierarchical Bayesian Claim Count modeling with Overdispersed Outcome and Mismeasured Covariates in Actuarial Practice", "link": "http://arxiv.org/abs/2310.07456", "description": "The problem of overdispersed claim counts and mismeasured covariates is\ncommon in insurance. On the one hand, the presence of overdispersion in the\ncount data violates the homogeneity assumption, and on the other hand,\nmeasurement errors in covariates highlight the model risk issue in actuarial\npractice. The consequence can be inaccurate premium pricing which would\nnegatively affect business competitiveness. Our goal is to address these two\nmodelling problems simultaneously by capturing the unobservable correlations\nbetween observations that arise from overdispersed outcome and mismeasured\ncovariate in actuarial process. To this end, we establish novel connections\nbetween the count-based generalized linear mixed model (GLMM) and a popular\nerror-correction tool for non-linear modelling - Simulation Extrapolation\n(SIMEX). We consider a modelling framework based on the hierarchical Bayesian\nparadigm. To our knowledge, the approach of combining a hierarchical Bayes with\nSIMEX has not previously been discussed in the literature. We demonstrate the\napplicability of our approach on the workplace absenteeism data. Our results\nindicate that the hierarchical Bayesian GLMM incorporated with the SIMEX\noutperforms naive GLMM / SIMEX in terms of goodness of fit."}, "http://arxiv.org/abs/2310.07567": {"title": "Comparing the effectiveness of k-different treatments through the area under the ROC curve", "link": "http://arxiv.org/abs/2310.07567", "description": "The area under the receiver-operating characteristic curve (AUC) has become a\npopular index not only for measuring the overall prediction capacity of a\nmarker but also the association strength between continuous and binary\nvariables. In the current study, it has been used for comparing the association\nsize of four different interventions involving impulsive decision making,\nstudied through an animal model, in which each animal provides several negative\n(pre-treatment) and positive (post-treatment) measures. The problem of the full\ncomparison of the average AUCs arises therefore in a natural way. We construct\nan analysis of variance (ANOVA) type test for testing the equality of the\nimpact of these treatments measured through the respective AUCs, and\nconsidering the random-effect represented by the animal. The use (and\ndevelopment) of a post-hoc Tukey's HSD type test is also considered. We explore\nthe finite-sample behavior of our proposal via Monte Carlo simulations, and\nanalyze the data generated from the original problem. An R package implementing\nthe procedures is provided as supplementary material."}, "http://arxiv.org/abs/2310.07605": {"title": "Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate", "link": "http://arxiv.org/abs/2310.07605", "description": "Multiple comparisons in hypothesis testing often encounter structural\nconstraints in various applications. For instance, in structural Magnetic\nResonance Imaging for Alzheimer's Disease, the focus extends beyond examining\natrophic brain regions to include comparisons of anatomically adjacent regions.\nThese constraints can be modeled as linear transformations of parameters, where\nthe sign patterns play a crucial role in estimating directional effects. This\nclass of problems, encompassing total variations, wavelet transforms, fused\nLASSO, trend filtering, and more, presents an open challenge in effectively\ncontrolling the directional false discovery rate. In this paper, we propose an\nextended Split Knockoff method specifically designed to address the control of\ndirectional false discovery rate under linear transformations. Our proposed\napproach relaxes the stringent linear manifold constraint to its neighborhood,\nemploying a variable splitting technique commonly used in optimization. This\nmethodology yields an orthogonal design that benefits both power and\ndirectional false discovery rate control. By incorporating a sample splitting\nscheme, we achieve effective control of the directional false discovery rate,\nwith a notable reduction to zero as the relaxed neighborhood expands. To\ndemonstrate the efficacy of our method, we conduct simulation experiments and\napply it to two real-world scenarios: Alzheimer's Disease analysis and human\nage comparisons."}, "http://arxiv.org/abs/2310.07680": {"title": "Hamiltonian Dynamics of Bayesian Inference Formalised by Arc Hamiltonian Systems", "link": "http://arxiv.org/abs/2310.07680", "description": "This paper makes two theoretical contributions. First, we establish a novel\nclass of Hamiltonian systems, called arc Hamiltonian systems, for saddle\nHamiltonian functions over infinite-dimensional metric spaces. Arc Hamiltonian\nsystems generate a flow that satisfies the law of conservation of energy\neverywhere in a metric space. They are governed by an extension of Hamilton's\nequation formulated based on (i) the framework of arc fields and (ii) an\ninfinite-dimensional gradient, termed the arc gradient, of a Hamiltonian\nfunction. We derive conditions for the existence of a flow generated by an arc\nHamiltonian system, showing that they reduce to local Lipschitz continuity of\nthe arc gradient under sufficient regularity. Second, we present two\nHamiltonian functions, called the cumulant generating functional and the\ncentred cumulant generating functional, over a metric space of log-likelihoods\nand measures. The former characterises the posterior of Bayesian inference as a\npart of the arc gradient that induces a flow of log-likelihoods and\nnon-negative measures. The latter characterises the difference of the posterior\nand the prior as a part of the arc gradient that induces a flow of\nlog-likelihoods and probability measures. Our results reveal an implication of\nthe belief updating mechanism from the prior to the posterior as an\ninfinitesimal change of a measure in the infinite-dimensional Hamiltonian\nflows."}, "http://arxiv.org/abs/2009.12217": {"title": "Latent Causal Socioeconomic Health Index", "link": "http://arxiv.org/abs/2009.12217", "description": "This research develops a model-based LAtent Causal Socioeconomic Health\n(LACSH) index at the national level. Motivated by the need for a holistic\nnational well-being index, we build upon the latent health factor index (LHFI)\napproach that has been used to assess the unobservable ecological/ecosystem\nhealth. LHFI integratively models the relationship between metrics, latent\nhealth, and covariates that drive the notion of health. In this paper, the LHFI\nstructure is integrated with spatial modeling and statistical causal modeling.\nOur efforts are focused on developing the integrated framework to facilitate\nthe understanding of how an observational continuous variable might have\ncausally affected a latent trait that exhibits spatial correlation. A novel\nvisualization technique to evaluate covariate balance is also introduced for\nthe case of a continuous policy (treatment) variable. Our resulting LACSH\nframework and visualization tool are illustrated through two global case\nstudies on national socioeconomic health (latent trait), each with various\nmetrics and covariates pertaining to different aspects of societal health, and\nthe treatment variable being mandatory maternity leave days and government\nexpenditure on healthcare, respectively. We validate our model by two\nsimulation studies. All approaches are structured in a Bayesian hierarchical\nframework and results are obtained by Markov chain Monte Carlo techniques."}, "http://arxiv.org/abs/2201.02958": {"title": "Smooth Nested Simulation: Bridging Cubic and Square Root Convergence Rates in High Dimensions", "link": "http://arxiv.org/abs/2201.02958", "description": "Nested simulation concerns estimating functionals of a conditional\nexpectation via simulation. In this paper, we propose a new method based on\nkernel ridge regression to exploit the smoothness of the conditional\nexpectation as a function of the multidimensional conditioning variable.\nAsymptotic analysis shows that the proposed method can effectively alleviate\nthe curse of dimensionality on the convergence rate as the simulation budget\nincreases, provided that the conditional expectation is sufficiently smooth.\nThe smoothness bridges the gap between the cubic root convergence rate (that\nis, the optimal rate for the standard nested simulation) and the square root\nconvergence rate (that is, the canonical rate for the standard Monte Carlo\nsimulation). We demonstrate the performance of the proposed method via\nnumerical examples from portfolio risk management and input uncertainty\nquantification."}, "http://arxiv.org/abs/2204.12635": {"title": "Multivariate and regression models for directional data based on projected P\\'olya trees", "link": "http://arxiv.org/abs/2204.12635", "description": "Projected distributions have proved to be useful in the study of circular and\ndirectional data. Although any multivariate distribution can be used to produce\na projected model, these distributions are typically parametric. In this\narticle we consider a multivariate P\\'olya tree on $R^k$ and project it to the\nunit hypersphere $S^k$ to define a new Bayesian nonparametric model for\ndirectional data. We study the properties of the proposed model and in\nparticular, concentrate on the implied conditional distributions of some\ndirections given the others to define a directional-directional regression\nmodel. We also define a multivariate linear regression model with P\\'olya tree\nerror and project it to define a linear-directional regression model. We obtain\nthe posterior characterisation of all models and show their performance with\nsimulated and real datasets."}, "http://arxiv.org/abs/2207.13250": {"title": "Spatio-Temporal Wildfire Prediction using Multi-Modal Data", "link": "http://arxiv.org/abs/2207.13250", "description": "Due to severe societal and environmental impacts, wildfire prediction using\nmulti-modal sensing data has become a highly sought-after data-analytical tool\nby various stakeholders (such as state governments and power utility companies)\nto achieve a more informed understanding of wildfire activities and plan\npreventive measures. A desirable algorithm should precisely predict fire risk\nand magnitude for a location in real time. In this paper, we develop a flexible\nspatio-temporal wildfire prediction framework using multi-modal time series\ndata. We first predict the wildfire risk (the chance of a wildfire event) in\nreal-time, considering the historical events using discrete mutually exciting\npoint process models. Then we further develop a wildfire magnitude prediction\nset method based on the flexible distribution-free time-series conformal\nprediction (CP) approach. Theoretically, we prove a risk model parameter\nrecovery guarantee, as well as coverage and set size guarantees for the CP\nsets. Through extensive real-data experiments with wildfire data in California,\nwe demonstrate the effectiveness of our methods, as well as their flexibility\nand scalability in large regions."}, "http://arxiv.org/abs/2210.13550": {"title": "Regularized Nonlinear Regression with Dependent Errors and its Application to a Biomechanical Model", "link": "http://arxiv.org/abs/2210.13550", "description": "A biomechanical model often requires parameter estimation and selection in a\nknown but complicated nonlinear function. Motivated by observing that data from\na head-neck position tracking system, one of biomechanical models, show\nmultiplicative time dependent errors, we develop a modified penalized weighted\nleast squares estimator. The proposed method can be also applied to a model\nwith non-zero mean time dependent additive errors. Asymptotic properties of the\nproposed estimator are investigated under mild conditions on a weight matrix\nand the error process. A simulation study demonstrates that the proposed\nestimation works well in both parameter estimation and selection with time\ndependent error. The analysis and comparison with an existing method for\nhead-neck position tracking data show better performance of the proposed method\nin terms of the variance accounted for (VAF)."}, "http://arxiv.org/abs/2210.14965": {"title": "Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions", "link": "http://arxiv.org/abs/2210.14965", "description": "This paper adopts a tool from computational topology, the Euler\ncharacteristic curve (ECC) of a sample, to perform one- and two-sample goodness\nof fit tests. We call our procedure TopoTests. The presented tests work for\nsamples of arbitrary dimension, having comparable power to the state-of-the-art\ntests in the one-dimensional case. It is demonstrated that the type I error of\nTopoTests can be controlled and their type II error vanishes exponentially with\nincreasing sample size. Extensive numerical simulations of TopoTests are\nconducted to demonstrate their power for samples of various sizes."}, "http://arxiv.org/abs/2211.03860": {"title": "Automatic Change-Point Detection in Time Series via Deep Learning", "link": "http://arxiv.org/abs/2211.03860", "description": "Detecting change-points in data is challenging because of the range of\npossible types of change and types of behaviour of data when there is no\nchange. Statistically efficient methods for detecting a change will depend on\nboth of these features, and it can be difficult for a practitioner to develop\nan appropriate detection method for their application of interest. We show how\nto automatically generate new offline detection methods based on training a\nneural network. Our approach is motivated by many existing tests for the\npresence of a change-point being representable by a simple neural network, and\nthus a neural network trained with sufficient data should have performance at\nleast as good as these methods. We present theory that quantifies the error\nrate for such an approach, and how it depends on the amount of training data.\nEmpirical results show that, even with limited training data, its performance\nis competitive with the standard CUSUM-based classifier for detecting a change\nin mean when the noise is independent and Gaussian, and can substantially\noutperform it in the presence of auto-correlated or heavy-tailed noise. Our\nmethod also shows strong results in detecting and localising changes in\nactivity based on accelerometer data."}, "http://arxiv.org/abs/2211.09099": {"title": "Selecting Subpopulations for Causal Inference in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2211.09099", "description": "The Brazil Bolsa Familia (BF) program is a conditional cash transfer program\naimed to reduce short-term poverty by direct cash transfers and to fight\nlong-term poverty by increasing human capital among poor Brazilian people.\nEligibility for Bolsa Familia benefits depends on a cutoff rule, which\nclassifies the BF study as a regression discontinuity (RD) design. Extracting\ncausal information from RD studies is challenging. Following Li et al (2015)\nand Branson and Mealli (2019), we formally describe the BF RD design as a local\nrandomized experiment within the potential outcome approach. Under this\nframework, causal effects can be identified and estimated on a subpopulation\nwhere a local overlap assumption, a local SUTVA and a local ignorability\nassumption hold. We first discuss the potential advantages of this framework\nover local regression methods based on continuity assumptions, which concern\nthe definition of the causal estimands, the design and the analysis of the\nstudy, and the interpretation and generalizability of the results. A critical\nissue of this local randomization approach is how to choose subpopulations for\nwhich we can draw valid causal inference. We propose a Bayesian model-based\nfinite mixture approach to clustering to classify observations into\nsubpopulations where the RD assumptions hold and do not hold. This approach has\nimportant advantages: a) it allows to account for the uncertainty in the\nsubpopulation membership, which is typically neglected; b) it does not impose\nany constraint on the shape of the subpopulation; c) it is scalable to\nhigh-dimensional settings; e) it allows to target alternative causal estimands\nthan the average treatment effect (ATE); and f) it is robust to a certain\ndegree of manipulation/selection of the running variable. We apply our proposed\napproach to assess causal effects of the Bolsa Familia program on leprosy\nincidence in 2009."}, "http://arxiv.org/abs/2301.08276": {"title": "Cross-validatory model selection for Bayesian autoregressions with exogenous regressors", "link": "http://arxiv.org/abs/2301.08276", "description": "Bayesian cross-validation (CV) is a popular method for predictive model\nassessment that is simple to implement and broadly applicable. A wide range of\nCV schemes is available for time series applications, including generic\nleave-one-out (LOO) and K-fold methods, as well as specialized approaches\nintended to deal with serial dependence such as leave-future-out (LFO),\nh-block, and hv-block.\n\nExisting large-sample results show that both specialized and generic methods\nare applicable to models of serially-dependent data. However, large sample\nconsistency results overlook the impact of sampling variability on accuracy in\nfinite samples. Moreover, the accuracy of a CV scheme depends on many aspects\nof the procedure. We show that poor design choices can lead to elevated rates\nof adverse selection.\n\nIn this paper, we consider the problem of identifying the regression\ncomponent of an important class of models of data with serial dependence,\nautoregressions of order p with q exogenous regressors (ARX(p,q)), under the\nlogarithmic scoring rule. We show that when serial dependence is present,\nscores computed using the joint (multivariate) density have lower variance and\nbetter model selection accuracy than the popular pointwise estimator. In\naddition, we present a detailed case study of the special case of ARX models\nwith fixed autoregressive structure and variance. For this class, we derive the\nfinite-sample distribution of the CV estimators and the model selection\nstatistic. We conclude with recommendations for practitioners."}, "http://arxiv.org/abs/2301.12026": {"title": "G-formula for causal inference via multiple imputation", "link": "http://arxiv.org/abs/2301.12026", "description": "G-formula is a popular approach for estimating treatment or exposure effects\nfrom longitudinal data that are subject to time-varying confounding. G-formula\nestimation is typically performed by Monte-Carlo simulation, with\nnon-parametric bootstrapping used for inference. We show that G-formula can be\nimplemented by exploiting existing methods for multiple imputation (MI) for\nsynthetic data. This involves using an existing modified version of Rubin's\nvariance estimator. In practice missing data is ubiquitous in longitudinal\ndatasets. We show that such missing data can be readily accommodated as part of\nthe MI procedure when using G-formula, and describe how MI software can be used\nto implement the approach. We explore its performance using a simulation study\nand an application from cystic fibrosis."}, "http://arxiv.org/abs/2306.01292": {"title": "Alternative Measures of Direct and Indirect Effects", "link": "http://arxiv.org/abs/2306.01292", "description": "There are a number of measures of direct and indirect effects in the\nliterature. They are suitable in some cases and unsuitable in others. We\ndescribe a case where the existing measures are unsuitable and propose new\nsuitable ones. We also show that the new measures can partially handle\nunmeasured treatment-outcome confounding, and bound long-term effects by\ncombining experimental and observational data."}, "http://arxiv.org/abs/2309.11942": {"title": "On the Probability of Immunity", "link": "http://arxiv.org/abs/2309.11942", "description": "This work is devoted to the study of the probability of immunity, i.e. the\neffect occurs whether exposed or not. We derive necessary and sufficient\nconditions for non-immunity and $\\epsilon$-bounded immunity, i.e. the\nprobability of immunity is zero and $\\epsilon$-bounded, respectively. The\nformer allows us to estimate the probability of benefit (i.e., the effect\noccurs if and only if exposed) from a randomized controlled trial, and the\nlatter allows us to produce bounds of the probability of benefit that are\ntighter than the existing ones. We also introduce the concept of indirect\nimmunity (i.e., through a mediator) and repeat our previous analysis for it.\nFinally, we propose a method for sensitivity analysis of the probability of\nimmunity under unmeasured confounding."}, "http://arxiv.org/abs/2309.13441": {"title": "Anytime valid and asymptotically optimal statistical inference driven by predictive recursion", "link": "http://arxiv.org/abs/2309.13441", "description": "Distinguishing two classes of candidate models is a fundamental and\npractically important problem in statistical inference. Error rate control is\ncrucial to the logic but, in complex nonparametric settings, such guarantees\ncan be difficult to achieve, especially when the stopping rule that determines\nthe data collection process is not available. In this paper we develop a novel\ne-process construction that leverages the so-called predictive recursion (PR)\nalgorithm designed to rapidly and recursively fit nonparametric mixture models.\nThe resulting PRe-process affords anytime valid inference uniformly over\nstopping rules and is shown to be efficient in the sense that it achieves the\nmaximal growth rate under the alternative relative to the mixture model being\nfit by PR. In the special case of testing for a log-concave density, the\nPRe-process test is computationally simpler and faster, more stable, and no\nless efficient compared to a recently proposed anytime valid test."}, "http://arxiv.org/abs/2309.16598": {"title": "Cross-Prediction-Powered Inference", "link": "http://arxiv.org/abs/2309.16598", "description": "While reliable data-driven decision-making hinges on high-quality labeled\ndata, the acquisition of quality labels often involves laborious human\nannotations or slow and expensive scientific measurements. Machine learning is\nbecoming an appealing alternative as sophisticated predictive techniques are\nbeing used to quickly and cheaply produce large amounts of predicted labels;\ne.g., predicted protein structures are used to supplement experimentally\nderived structures, predictions of socioeconomic indicators from satellite\nimagery are used to supplement accurate survey data, and so on. Since\npredictions are imperfect and potentially biased, this practice brings into\nquestion the validity of downstream inferences. We introduce cross-prediction:\na method for valid inference powered by machine learning. With a small labeled\ndataset and a large unlabeled dataset, cross-prediction imputes the missing\nlabels via machine learning and applies a form of debiasing to remedy the\nprediction inaccuracies. The resulting inferences achieve the desired error\nprobability and are more powerful than those that only leverage the labeled\ndata. Closely related is the recent proposal of prediction-powered inference,\nwhich assumes that a good pre-trained model is already available. We show that\ncross-prediction is consistently more powerful than an adaptation of\nprediction-powered inference in which a fraction of the labeled data is split\noff and used to train the model. Finally, we observe that cross-prediction\ngives more stable conclusions than its competitors; its confidence intervals\ntypically have significantly lower variability."}, "http://arxiv.org/abs/2310.07801": {"title": "Trajectory-aware Principal Manifold Framework for Data Augmentation and Image Generation", "link": "http://arxiv.org/abs/2310.07801", "description": "Data augmentation for deep learning benefits model training, image\ntransformation, medical imaging analysis and many other fields. Many existing\nmethods generate new samples from a parametric distribution, like the Gaussian,\nwith little attention to generate samples along the data manifold in either the\ninput or feature space. In this paper, we verify that there are theoretical and\npractical advantages of using the principal manifold hidden in the feature\nspace than the Gaussian distribution. We then propose a novel trajectory-aware\nprincipal manifold framework to restore the manifold backbone and generate\nsamples along a specific trajectory. On top of the autoencoder architecture, we\nfurther introduce an intrinsic dimension regularization term to make the\nmanifold more compact and enable few-shot image generation. Experimental\nresults show that the novel framework is able to extract more compact manifold\nrepresentation, improve classification accuracy and generate smooth\ntransformation among few samples."}, "http://arxiv.org/abs/2310.07817": {"title": "Nonlinear global Fr\\'echet regression for random objects via weak conditional expectation", "link": "http://arxiv.org/abs/2310.07817", "description": "Random objects are complex non-Euclidean data taking value in general metric\nspace, possibly devoid of any underlying vector space structure. Such data are\ngetting increasingly abundant with the rapid advancement in technology.\nExamples include probability distributions, positive semi-definite matrices,\nand data on Riemannian manifolds. However, except for regression for\nobject-valued response with Euclidean predictors and\ndistribution-on-distribution regression, there has been limited development of\na general framework for object-valued response with object-valued predictors in\nthe literature. To fill this gap, we introduce the notion of a weak conditional\nFr\\'echet mean based on Carleman operators and then propose a global nonlinear\nFr\\'echet regression model through the reproducing kernel Hilbert space (RKHS)\nembedding. Furthermore, we establish the relationships between the conditional\nFr\\'echet mean and the weak conditional Fr\\'echet mean for both Euclidean and\nobject-valued data. We also show that the state-of-the-art global Fr\\'echet\nregression developed by Petersen and Mueller, 2019 emerges as a special case of\nour method by choosing a linear kernel. We require that the metric space for\nthe predictor admits a reproducing kernel, while the intrinsic geometry of the\nmetric space for the response is utilized to study the asymptotic properties of\nthe proposed estimates. Numerical studies, including extensive simulations and\na real application, are conducted to investigate the performance of our\nestimator in a finite sample."}, "http://arxiv.org/abs/2310.07850": {"title": "Conformal prediction with local weights: randomization enables local guarantees", "link": "http://arxiv.org/abs/2310.07850", "description": "In this work, we consider the problem of building distribution-free\nprediction intervals with finite-sample conditional coverage guarantees.\nConformal prediction (CP) is an increasingly popular framework for building\nprediction intervals with distribution-free guarantees, but these guarantees\nonly ensure marginal coverage: the probability of coverage is averaged over a\nrandom draw of both the training and test data, meaning that there might be\nsubstantial undercoverage within certain subpopulations. Instead, ideally, we\nwould want to have local coverage guarantees that hold for each possible value\nof the test point's features. While the impossibility of achieving pointwise\nlocal coverage is well established in the literature, many variants of\nconformal prediction algorithm show favorable local coverage properties\nempirically. Relaxing the definition of local coverage can allow for a\ntheoretical understanding of this empirical phenomenon. We aim to bridge this\ngap between theoretical validation and empirical performance by proving\nachievable and interpretable guarantees for a relaxed notion of local coverage.\nBuilding on the localized CP method of Guan (2023) and the weighted CP\nframework of Tibshirani et al. (2019), we propose a new method,\nrandomly-localized conformal prediction (RLCP), which returns prediction\nintervals that are not only marginally valid but also achieve a relaxed local\ncoverage guarantee and guarantees under covariate shift. Through a series of\nsimulations and real data experiments, we validate these coverage guarantees of\nRLCP while comparing it with the other local conformal prediction methods."}, "http://arxiv.org/abs/2310.07852": {"title": "On the Computational Complexity of Private High-dimensional Model Selection via the Exponential Mechanism", "link": "http://arxiv.org/abs/2310.07852", "description": "We consider the problem of model selection in a high-dimensional sparse\nlinear regression model under the differential privacy framework. In\nparticular, we consider the problem of differentially private best subset\nselection and study its utility guarantee. We adopt the well-known exponential\nmechanism for selecting the best model, and under a certain margin condition,\nwe establish its strong model recovery property. However, the exponential\nsearch space of the exponential mechanism poses a serious computational\nbottleneck. To overcome this challenge, we propose a Metropolis-Hastings\nalgorithm for the sampling step and establish its polynomial mixing time to its\nstationary distribution in the problem parameters $n,p$, and $s$. Furthermore,\nwe also establish approximate differential privacy for the final estimates of\nthe Metropolis-Hastings random walk using its mixing property. Finally, we also\nperform some illustrative simulations that echo the theoretical findings of our\nmain results."}, "http://arxiv.org/abs/2310.07935": {"title": "Estimating the Likelihood of Arrest from Police Records in Presence of Unreported Crimes", "link": "http://arxiv.org/abs/2310.07935", "description": "Many important policy decisions concerning policing hinge on our\nunderstanding of how likely various criminal offenses are to result in arrests.\nSince many crimes are never reported to law enforcement, estimates based on\npolice records alone must be adjusted to account for the likelihood that each\ncrime would have been reported to the police. In this paper, we present a\nmethodological framework for estimating the likelihood of arrest from police\ndata that incorporates estimates of crime reporting rates computed from a\nvictimization survey. We propose a parametric regression-based two-step\nestimator that (i) estimates the likelihood of crime reporting using logistic\nregression with survey weights; and then (ii) applies a second regression step\nto model the likelihood of arrest. Our empirical analysis focuses on racial\ndisparities in arrests for violent crimes (sex offenses, robbery, aggravated\nand simple assaults) from 2006--2015 police records from the National Incident\nBased Reporting System (NIBRS), with estimates of crime reporting obtained\nusing 2003--2020 data from the National Crime Victimization Survey (NCVS). We\nfind that, after adjusting for unreported crimes, the likelihood of arrest\ncomputed from police records decreases significantly. We also find that, while\nincidents with white offenders on average result in arrests more often than\nthose with black offenders, the disparities tend to be small after accounting\nfor crime characteristics and unreported crimes."}, "http://arxiv.org/abs/2310.07953": {"title": "Enhancing Sample Quality through Minimum Energy Importance Weights", "link": "http://arxiv.org/abs/2310.07953", "description": "Importance sampling is a powerful tool for correcting the distributional\nmismatch in many statistical and machine learning problems, but in practice its\nperformance is limited by the usage of simple proposals whose importance\nweights can be computed analytically. To address this limitation, Liu and Lee\n(2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes\nthe importance weights for arbitrary simulated samples by minimizing the\nkernelized Stein discrepancy. However, this requires knowing the score function\nof the target distribution, which is not easy to compute for many Bayesian\nproblems. Hence, in this paper we propose another novel BBIS algorithm using\nminimum energy design, BBIS-MED, that requires only the unnormalized density\nfunction, which can be utilized as a post-processing step to improve the\nquality of Markov Chain Monte Carlo samples. We demonstrate the effectiveness\nand wide applicability of our proposed BBIS-MED algorithm on extensive\nsimulations and a real-world Bayesian model calibration problem where the score\nfunction cannot be derived analytically."}, "http://arxiv.org/abs/2310.07958": {"title": "Towards Causal Deep Learning for Vulnerability Detection", "link": "http://arxiv.org/abs/2310.07958", "description": "Deep learning vulnerability detection has shown promising results in recent\nyears. However, an important challenge that still blocks it from being very\nuseful in practice is that the model is not robust under perturbation and it\ncannot generalize well over the out-of-distribution (OOD) data, e.g., applying\na trained model to unseen projects in real world. We hypothesize that this is\nbecause the model learned non-robust features, e.g., variable names, that have\nspurious correlations with labels. When the perturbed and OOD datasets no\nlonger have the same spurious features, the model prediction fails. To address\nthe challenge, in this paper, we introduced causality into deep learning\nvulnerability detection. Our approach CausalVul consists of two phases. First,\nwe designed novel perturbations to discover spurious features that the model\nmay use to make predictions. Second, we applied the causal learning algorithms,\nspecifically, do-calculus, on top of existing deep learning models to\nsystematically remove the use of spurious features and thus promote causal\nbased prediction. Our results show that CausalVul consistently improved the\nmodel accuracy, robustness and OOD performance for all the state-of-the-art\nmodels and datasets we experimented. To the best of our knowledge, this is the\nfirst work that introduces do calculus based causal learning to software\nengineering models and shows it's indeed useful for improving the model\naccuracy, robustness and generalization. Our replication package is located at\nhttps://figshare.com/s/0ffda320dcb96c249ef2."}, "http://arxiv.org/abs/2310.07973": {"title": "Statistical Performance Guarantee for Selecting Those Predicted to Benefit Most from Treatment", "link": "http://arxiv.org/abs/2310.07973", "description": "Across a wide array of disciplines, many researchers use machine learning\n(ML) algorithms to identify a subgroup of individuals, called exceptional\nresponders, who are likely to be helped by a treatment the most. A common\napproach consists of two steps. One first estimates the conditional average\ntreatment effect or its proxy using an ML algorithm. They then determine the\ncutoff of the resulting treatment prioritization score to select those\npredicted to benefit most from the treatment. Unfortunately, these estimated\ntreatment prioritization scores are often biased and noisy. Furthermore,\nutilizing the same data to both choose a cutoff value and estimate the average\ntreatment effect among the selected individuals suffer from a multiple testing\nproblem. To address these challenges, we develop a uniform confidence band for\nexperimentally evaluating the sorted average treatment effect (GATES) among the\nindividuals whose treatment prioritization score is at least as high as any\ngiven quantile value, regardless of how the quantile is chosen. This provides a\nstatistical guarantee that the GATES for the selected subgroup exceeds a\ncertain threshold. The validity of the proposed methodology depends solely on\nrandomization of treatment and random sampling of units without requiring\nmodeling assumptions or resampling methods. This widens its applicability\nincluding a wide range of other causal quantities. A simulation study shows\nthat the empirical coverage of the proposed uniform confidence bands is close\nto the nominal coverage when the sample is as small as 100. We analyze a\nclinical trial of late-stage prostate cancer and find a relatively large\nproportion of exceptional responders with a statistical performance guarantee."}, "http://arxiv.org/abs/2310.08020": {"title": "Assessing Copula Models for Mixed Continuous-Ordinal Variables", "link": "http://arxiv.org/abs/2310.08020", "description": "Vine pair-copula constructions exist for a mix of continuous and ordinal\nvariables. In some steps, this can involve estimating a bivariate copula for a\npair of mixed continuous-ordinal variables. To assess the adequacy of copula\nfits for such a pair, diagnostic and visualization methods based on normal\nscore plots and conditional Q-Q plots are proposed. The former utilizes a\nlatent continuous variable for the ordinal variable. Using the Kullback-Leibler\ndivergence, existing probability models for mixed continuous-ordinal variable\npair are assessed for the adequacy of fit with simple parametric copula\nfamilies. The effectiveness of the proposed visualization and diagnostic\nmethods is illustrated on simulated and real datasets."}, "http://arxiv.org/abs/2310.08193": {"title": "Are sanctions for losers? A network study of trade sanctions", "link": "http://arxiv.org/abs/2310.08193", "description": "Studies built on dependency and world-system theory using network approaches\nhave shown that international trade is structured into clusters of 'core' and\n'peripheral' countries performing distinct functions. However, few have used\nthese methods to investigate how sanctions affect the position of the countries\ninvolved in the capitalist world-economy. Yet, this topic has acquired pressing\nrelevance due to the emergence of economic warfare as a key geopolitical weapon\nsince the 1950s. And even more so in light of the preeminent role that\nsanctions have played in the US and their allies' response to the\nRussian-Ukrainian war. Applying several clustering techniques designed for\ncomplex and temporal networks, this paper shows that a shift in the pattern of\ncommerce away from sanctioning countries and towards neutral or friendly ones.\nAdditionally, there are suggestions that these shifts may lead to the creation\nof an alternative 'core' that interacts with the world-economy's periphery\nbypassing traditional 'core' countries such as EU member States and the US."}, "http://arxiv.org/abs/2310.08268": {"title": "Change point detection in dynamic heterogeneous networks via subspace tracking", "link": "http://arxiv.org/abs/2310.08268", "description": "Dynamic networks consist of a sequence of time-varying networks, and it is of\ngreat importance to detect the network change points. Most existing methods\nfocus on detecting abrupt change points, necessitating the assumption that the\nunderlying network probability matrix remains constant between adjacent change\npoints. This paper introduces a new model that allows the network probability\nmatrix to undergo continuous shifting, while the latent network structure,\nrepresented via the embedding subspace, only changes at certain time points.\nTwo novel statistics are proposed to jointly detect these network subspace\nchange points, followed by a carefully refined detection procedure.\nTheoretically, we show that the proposed method is asymptotically consistent in\nterms of change point detection, and also establish the impossibility region\nfor detecting these network subspace change points. The advantage of the\nproposed method is also supported by extensive numerical experiments on both\nsynthetic networks and a UK politician social network."}, "http://arxiv.org/abs/2310.08397": {"title": "Assessing Marine Mammal Abundance: A Novel Data Fusion", "link": "http://arxiv.org/abs/2310.08397", "description": "Marine mammals are increasingly vulnerable to human disturbance and climate\nchange. Their diving behavior leads to limited visual access during data\ncollection, making studying the abundance and distribution of marine mammals\nchallenging. In theory, using data from more than one observation modality\nshould lead to better informed predictions of abundance and distribution. With\nfocus on North Atlantic right whales, we consider the fusion of two data\nsources to inform about their abundance and distribution. The first source is\naerial distance sampling which provides the spatial locations of whales\ndetected in the region. The second source is passive acoustic monitoring (PAM),\nreturning calls received at hydrophones placed on the ocean floor. Due to\nlimited time on the surface and detection limitations arising from sampling\neffort, aerial distance sampling only provides a partial realization of\nlocations. With PAM, we never observe numbers or locations of individuals. To\naddress these challenges, we develop a novel thinned point pattern data fusion.\nOur approach leads to improved inference regarding abundance and distribution\nof North Atlantic right whales throughout Cape Cod Bay, Massachusetts in the\nUS. We demonstrate performance gains of our approach compared to that from a\nsingle source through both simulation and real data."}, "http://arxiv.org/abs/2310.08410": {"title": "Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review and Meta-Analysis", "link": "http://arxiv.org/abs/2310.08410", "description": "Large language models such as ChatGPT are increasingly explored in medical\ndomains. However, the absence of standard guidelines for performance evaluation\nhas led to methodological inconsistencies. This study aims to summarize the\navailable evidence on evaluating ChatGPT's performance in medicine and provide\ndirection for future research. We searched ten medical literature databases on\nJune 15, 2023, using the keyword \"ChatGPT\". A total of 3520 articles were\nidentified, of which 60 were reviewed and summarized in this paper and 17 were\nincluded in the meta-analysis. The analysis showed that ChatGPT displayed an\noverall integrated accuracy of 56% (95% CI: 51%-60%, I2 = 87%) in addressing\nmedical queries. However, the studies varied in question resource,\nquestion-asking process, and evaluation metrics. Moreover, many studies failed\nto report methodological details, including the version of ChatGPT and whether\neach question was used independently or repeatedly. Our findings revealed that\nalthough ChatGPT demonstrated considerable potential for application in\nhealthcare, the heterogeneity of the studies and insufficient reporting may\naffect the reliability of these results. Further well-designed studies with\ncomprehensive and transparent reporting are needed to evaluate ChatGPT's\nperformance in medicine."}, "http://arxiv.org/abs/2310.08414": {"title": "Confidence bounds for the true discovery proportion based on the exact distribution of the number of rejections", "link": "http://arxiv.org/abs/2310.08414", "description": "In multiple hypotheses testing it has become widely popular to make inference\non the true discovery proportion (TDP) of a set $\\mathcal{M}$ of null\nhypotheses. This approach is useful for several application fields, such as\nneuroimaging and genomics. Several procedures to compute simultaneous lower\nconfidence bounds for the TDP have been suggested in prior literature.\nSimultaneity allows for post-hoc selection of $\\mathcal{M}$. If sets of\ninterest are specified a priori, it is possible to gain power by removing the\nsimultaneity requirement. We present an approach to compute lower confidence\nbounds for the TDP if the set of null hypotheses is defined a priori. The\nproposed method determines the bounds using the exact distribution of the\nnumber of rejections based on a step-up multiple testing procedure under\nindependence assumptions. We assess robustness properties of our procedure and\napply it to real data from the field of functional magnetic resonance imaging."}, "http://arxiv.org/abs/2310.08426": {"title": "Extensions of Heterogeneity in Integration and Prediction (HIP) with R Shiny Application", "link": "http://arxiv.org/abs/2310.08426", "description": "Multiple data views measured on the same set of participants is becoming more\ncommon and has the potential to deepen our understanding of many complex\ndiseases by analyzing these different views simultaneously. Equally important,\nmany of these complex diseases show evidence of subgroup heterogeneity (e.g.,\nby sex or race). HIP (Heterogeneity in Integration and Prediction) is among the\nfirst methods proposed to integrate multiple data views while also accounting\nfor subgroup heterogeneity to identify common and subgroup-specific markers of\na particular disease. However, HIP is applicable to continuous outcomes and\nrequires programming expertise by the user. Here we propose extensions to HIP\nthat accommodate multi-class, Poisson, and Zero-Inflated Poisson outcomes while\nretaining the benefits of HIP. Additionally, we introduce an R Shiny\napplication, accessible on shinyapps.io at\nhttps://multi-viewlearn.shinyapps.io/HIP_ShinyApp/, that provides an interface\nwith the Python implementation of HIP to allow more researchers to use the\nmethod anywhere and on any device. We applied HIP to identify genes and\nproteins common and specific to males and females that are associated with\nexacerbation frequency. Although some of the identified genes and proteins show\nevidence of a relationship with chronic obstructive pulmonary disease (COPD) in\nexisting literature, others may be candidates for future research investigating\ntheir relationship with COPD. We demonstrate the use of the Shiny application\nwith a publicly available data. An R-package for HIP would be made available at\nhttps://github.com/lasandrall/HIP."}, "http://arxiv.org/abs/2310.08479": {"title": "Personalised dynamic super learning: an application in predicting hemodiafiltration's convection volumes", "link": "http://arxiv.org/abs/2310.08479", "description": "Obtaining continuously updated predictions is a major challenge for\npersonalised medicine. Leveraging combinations of parametric regressions and\nmachine learning approaches, the personalised online super learner (POSL) can\nachieve such dynamic and personalised predictions. We adapt POSL to predict a\nrepeated continuous outcome dynamically and propose a new way to validate such\npersonalised or dynamic prediction models. We illustrate its performance by\npredicting the convection volume of patients undergoing hemodiafiltration. POSL\noutperformed its candidate learners with respect to median absolute error,\ncalibration-in-the-large, discrimination, and net benefit. We finally discuss\nthe choices and challenges underlying the use of POSL."}, "http://arxiv.org/abs/1903.00037": {"title": "Distance-Based Independence Screening for Canonical Analysis", "link": "http://arxiv.org/abs/1903.00037", "description": "This paper introduces a novel method called Distance-Based Independence\nScreening for Canonical Analysis (DISCA) that performs simultaneous dimension\nreduction for a pair of random variables by optimizing the distance covariance\n(dCov). dCov is a statistic first proposed by Sz\\'ekely et al. [2009] for\nindependence testing. Compared with sufficient dimension reduction (SDR) and\ncanonical correlation analysis (CCA)-based approaches, DISCA is a model-free\napproach that does not impose dimensional or distributional restrictions on\nvariables and is more sensitive to nonlinear relationships. Theoretically, we\nestablish a non-asymptotic error bound to provide a guarantee of our method's\nperformance. Numerically, DISCA performs comparable to or better than other\nstate-of-the-art algorithms and is computationally faster. All codes of our\nDISCA method can be found on GitHub https : //github.com/Yijin911/DISCA.git,\nincluding an R package named DISCA."}, "http://arxiv.org/abs/2105.13952": {"title": "Generalized Permutation Framework for Testing Model Variable Significance", "link": "http://arxiv.org/abs/2105.13952", "description": "A common problem in machine learning is determining if a variable\nsignificantly contributes to a model's prediction performance. This problem is\naggravated for datasets, such as gene expression datasets, that suffer the\nworst case of dimensionality: a low number of observations along with a high\nnumber of possible explanatory variables. In such scenarios, traditional\nmethods for testing variable statistical significance or constructing variable\nconfidence intervals do not apply. To address these problems, we developed a\nnovel permutation framework for testing the significance of variables in\nsupervised models. Our permutation framework has three main advantages. First,\nit is non-parametric and does not rely on distributional assumptions or\nasymptotic results. Second, it not only ranks model variables in terms of\nrelative importance, but also tests for statistical significance of each\nvariable. Third, it can test for the significance of the interaction between\nmodel variables. We applied this permutation framework to multi-class\nclassification of the Iris flower dataset and of brain regions in RNA\nexpression data, and using this framework showed variable-level statistical\nsignificance and interactions."}, "http://arxiv.org/abs/2210.02002": {"title": "Factor Augmented Sparse Throughput Deep ReLU Neural Networks for High Dimensional Regression", "link": "http://arxiv.org/abs/2210.02002", "description": "This paper introduces a Factor Augmented Sparse Throughput (FAST) model that\nutilizes both latent factors and sparse idiosyncratic components for\nnonparametric regression. The FAST model bridges factor models on one end and\nsparse nonparametric models on the other end. It encompasses structured\nnonparametric models such as factor augmented additive models and sparse\nlow-dimensional nonparametric interaction models and covers the cases where the\ncovariates do not admit factor structures. Via diversified projections as\nestimation of latent factor space, we employ truncated deep ReLU networks to\nnonparametric factor regression without regularization and to a more general\nFAST model using nonconvex regularization, resulting in factor augmented\nregression using neural network (FAR-NN) and FAST-NN estimators respectively.\nWe show that FAR-NN and FAST-NN estimators adapt to the unknown low-dimensional\nstructure using hierarchical composition models in nonasymptotic minimax rates.\nWe also study statistical learning for the factor augmented sparse additive\nmodel using a more specific neural network architecture. Our results are\napplicable to the weak dependent cases without factor structures. In proving\nthe main technical result for FAST-NN, we establish a new deep ReLU network\napproximation result that contributes to the foundation of neural network\ntheory. Our theory and methods are further supported by simulation studies and\nan application to macroeconomic data."}, "http://arxiv.org/abs/2210.04482": {"title": "Leave-group-out cross-validation for latent Gaussian models", "link": "http://arxiv.org/abs/2210.04482", "description": "Evaluating the predictive performance of a statistical model is commonly done\nusing cross-validation. Although the leave-one-out method is frequently\nemployed, its application is justified primarily for independent and\nidentically distributed observations. However, this method tends to mimic\ninterpolation rather than prediction when dealing with dependent observations.\nThis paper proposes a modified cross-validation for dependent observations.\nThis is achieved by excluding an automatically determined set of observations\nfrom the training set to mimic a more reasonable prediction scenario. Also,\nwithin the framework of latent Gaussian models, we illustrate a method to\nadjust the joint posterior for this modified cross-validation to avoid model\nrefitting. This new approach is accessible in the R-INLA package\n(www.r-inla.org)."}, "http://arxiv.org/abs/2212.01179": {"title": "Feasibility of using survey data and semi-variogram kriging to obtain bespoke indices of neighbourhood characteristics: a simulation and a case study", "link": "http://arxiv.org/abs/2212.01179", "description": "Data on neighbourhood characteristics are not typically collected in\nepidemiological studies. They are however useful in the study of small-area\nhealth inequalities. Neighbourhood characteristics are collected in some\nsurveys and could be linked to the data of other studies. We propose to use\nkriging based on semi-variogram models to predict values at non-observed\nlocations with the aim of constructing bespoke indices of neighbourhood\ncharacteristics to be linked to data from epidemiological studies. We perform a\nsimulation study to assess the feasibility of the method as well as a case\nstudy using data from the RECORD study. Apart from having enough observed data\nat small distances to the non-observed locations, a good fitting\nsemi-variogram, a larger range and the absence of nugget effects for the\nsemi-variogram models are factors leading to a higher reliability."}, "http://arxiv.org/abs/2303.17823": {"title": "An interpretable neural network-based non-proportional odds model for ordinal regression", "link": "http://arxiv.org/abs/2303.17823", "description": "This study proposes an interpretable neural network-based non-proportional\nodds model (N$^3$POM) for ordinal regression. N$^3$POM is different from\nconventional approaches to ordinal regression with non-proportional models in\nseveral ways: (1) N$^3$POM is designed to directly handle continuous responses,\nwhereas standard methods typically treat de facto ordered continuous variables\nas discrete, (2) instead of estimating response-dependent finite coefficients\nof linear models from discrete responses as is done in conventional approaches,\nwe train a non-linear neural network to serve as a coefficient function. Thanks\nto the neural network, N$^3$POM offers flexibility while preserving the\ninterpretability of conventional ordinal regression. We establish a sufficient\ncondition under which the predicted conditional cumulative probability locally\nsatisfies the monotonicity constraint over a user-specified region in the\ncovariate space. Additionally, we provide a monotonicity-preserving stochastic\n(MPS) algorithm for effectively training the neural network. We apply N$^3$POM\nto several real-world datasets."}, "http://arxiv.org/abs/2306.16335": {"title": "Emulating the dynamics of complex systems using autoregressive models on manifolds (mNARX)", "link": "http://arxiv.org/abs/2306.16335", "description": "We propose a novel surrogate modelling approach to efficiently and accurately\napproximate the response of complex dynamical systems driven by time-varying\nexogenous excitations over extended time periods. Our approach, namely manifold\nnonlinear autoregressive modelling with exogenous input (mNARX), involves\nconstructing a problem-specific exogenous input manifold that is optimal for\nconstructing autoregressive surrogates. The manifold, which forms the core of\nmNARX, is constructed incrementally by incorporating the physics of the system,\nas well as prior expert- and domain- knowledge. Because mNARX decomposes the\nfull problem into a series of smaller sub-problems, each with a lower\ncomplexity than the original, it scales well with the complexity of the\nproblem, both in terms of training and evaluation costs of the final surrogate.\nFurthermore, mNARX synergizes well with traditional dimensionality reduction\ntechniques, making it highly suitable for modelling dynamical systems with\nhigh-dimensional exogenous inputs, a class of problems that is typically\nchallenging to solve. Since domain knowledge is particularly abundant in\nphysical systems, such as those found in civil and mechanical engineering,\nmNARX is well suited for these applications. We demonstrate that mNARX\noutperforms traditional autoregressive surrogates in predicting the response of\na classical coupled spring-mass system excited by a one-dimensional random\nexcitation. Additionally, we show that mNARX is well suited for emulating very\nhigh-dimensional time- and state-dependent systems, even when affected by\nactive controllers, by surrogating the dynamics of a realistic\naero-servo-elastic onshore wind turbine simulator. In general, our results\ndemonstrate that mNARX offers promising prospects for modelling complex\ndynamical systems, in terms of accuracy and efficiency."}, "http://arxiv.org/abs/2307.02236": {"title": "D-optimal Subsampling Design for Massive Data Linear Regression", "link": "http://arxiv.org/abs/2307.02236", "description": "Data reduction is a fundamental challenge of modern technology, where\nclassical statistical methods are not applicable because of computational\nlimitations. We consider linear regression for an extraordinarily large number\nof observations, but only a few covariates. Subsampling aims at the selection\nof a given percentage of the existing original data. Under distributional\nassumptions on the covariates, we derive D-optimal subsampling designs and\nstudy their theoretical properties. We make use of fundamental concepts of\noptimal design theory and an equivalence theorem from constrained convex\noptimization. The thus obtained subsampling designs provide simple rules for\nwhether to accept or reject a data point, allowing for an easy algorithmic\nimplementation. In addition, we propose a simplified subsampling method that\ndiffers from the D-optimal design but requires lower computing time. We present\na simulation study, comparing both subsampling schemes with the IBOSS method."}, "http://arxiv.org/abs/2310.08726": {"title": "Design-Based RCT Estimators and Central Limit Theorems for Baseline Subgroup and Related Analyses", "link": "http://arxiv.org/abs/2310.08726", "description": "There is a growing literature on design-based methods to estimate average\ntreatment effects (ATEs) for randomized controlled trials (RCTs) for full\nsample analyses. This article extends these methods to estimate ATEs for\ndiscrete subgroups defined by pre-treatment variables, with an application to\nan RCT testing subgroup effects for a school voucher experiment in New York\nCity. We consider ratio estimators for subgroup effects using regression\nmethods, allowing for model covariates to improve precision, and prove a finite\npopulation central limit theorem. We discuss extensions to blocked and\nclustered RCT designs, and to other common estimators with random\ntreatment-control sample sizes (or weights): post-stratification estimators,\nweighted estimators that adjust for data nonresponse, and estimators for\nBernoulli trials. We also develop simple variance estimators that share\nfeatures with robust estimators. Simulations show that the design-based\nsubgroup estimators yield confidence interval coverage near nominal levels,\neven for small subgroups."}, "http://arxiv.org/abs/2310.08798": {"title": "Alteration Detection of Tensor Dependence Structure via Sparsity-Exploited Reranking Algorithm", "link": "http://arxiv.org/abs/2310.08798", "description": "Tensor-valued data arise frequently from a wide variety of scientific\napplications, and many among them can be translated into an alteration\ndetection problem of tensor dependence structures. In this article, we\nformulate the problem under the popularly adopted tensor-normal distributions\nand aim at two-sample correlation/partial correlation comparisons of\ntensor-valued observations. Through decorrelation and centralization, a\nseparable covariance structure is employed to pool sample information from\ndifferent tensor modes to enhance the power of the test. Additionally, we\npropose a novel Sparsity-Exploited Reranking Algorithm (SERA) to further\nimprove the multiple testing efficiency. The algorithm is approached through\nreranking of the p-values derived from the primary test statistics, by\nincorporating a carefully constructed auxiliary tensor sequence. Besides the\ntensor framework, SERA is also generally applicable to a wide range of\ntwo-sample large-scale inference problems with sparsity structures, and is of\nindependent interest. The asymptotic properties of the proposed test are\nderived and the algorithm is shown to control the false discovery at the\npre-specified level. We demonstrate the efficacy of the proposed method through\nintensive simulations and two scientific applications."}, "http://arxiv.org/abs/2310.08812": {"title": "A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM model", "link": "http://arxiv.org/abs/2310.08812", "description": "Time series forecasting represents a significant and challenging task across\nvarious fields. Recently, methods based on mode decomposition have dominated\nthe forecasting of complex time series because of the advantages of capturing\nlocal characteristics and extracting intrinsic modes from data. Unfortunately,\nmost models fail to capture the implied volatilities that contain significant\ninformation. To enhance the forecasting of current, rapidly evolving, and\nvolatile time series, we propose a novel decomposition-ensemble paradigm, the\nVMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed\nto decompose the time series into K sub-modes. Subsequently, the GARCH model\nextracts the volatility information from these sub-modes, which serve as the\ninput for the LSTM. The numerical and volatility information of each sub-mode\nis utilized to train a Long Short-Term Memory network. This network predicts\nthe sub-mode, and then we aggregate the predictions from all sub-modes to\nproduce the output. By integrating econometric and artificial intelligence\nmethods, and taking into account both the numerical and volatility information\nof the time series, our proposed model demonstrates superior performance in\ntime series forecasting, as evidenced by the significant decrease in MSE, RMSE,\nand MAPE in our comparative experimental results."}, "http://arxiv.org/abs/2310.08867": {"title": "A Survey of Methods for Handling Disk Data Imbalance", "link": "http://arxiv.org/abs/2310.08867", "description": "Class imbalance exists in many classification problems, and since the data is\ndesigned for accuracy, imbalance in data classes can lead to classification\nchallenges with a few classes having higher misclassification costs. The\nBackblaze dataset, a widely used dataset related to hard discs, has a small\namount of failure data and a large amount of health data, which exhibits a\nserious class imbalance. This paper provides a comprehensive overview of\nresearch in the field of imbalanced data classification. The discussion is\norganized into three main aspects: data-level methods, algorithmic-level\nmethods, and hybrid methods. For each type of method, we summarize and analyze\nthe existing problems, algorithmic ideas, strengths, and weaknesses.\nAdditionally, the challenges of unbalanced data classification are discussed,\nalong with strategies to address them. It is convenient for researchers to\nchoose the appropriate method according to their needs."}, "http://arxiv.org/abs/2310.08939": {"title": "Fast Screening Rules for Optimal Design via Quadratic Lasso Reformulation", "link": "http://arxiv.org/abs/2310.08939", "description": "The problems of Lasso regression and optimal design of experiments share a\ncritical property: their optimal solutions are typically \\emph{sparse}, i.e.,\nonly a small fraction of the optimal variables are non-zero. Therefore, the\nidentification of the support of an optimal solution reduces the dimensionality\nof the problem and can yield a substantial simplification of the calculations.\nIt has recently been shown that linear regression with a \\emph{squared}\n$\\ell_1$-norm sparsity-inducing penalty is equivalent to an optimal\nexperimental design problem. In this work, we use this equivalence to derive\nsafe screening rules that can be used to discard inessential samples. Compared\nto previously existing rules, the new tests are much faster to compute,\nespecially for problems involving a parameter space of high dimension, and can\nbe used dynamically within any iterative solver, with negligible computational\noverhead. Moreover, we show how an existing homotopy algorithm to compute the\nregularization path of the lasso method can be reparametrized with respect to\nthe squared $\\ell_1$-penalty. This allows the computation of a Bayes\n$c$-optimal design in a finite number of steps and can be several orders of\nmagnitude faster than standard first-order algorithms. The efficiency of the\nnew screening rules and of the homotopy algorithm are demonstrated on different\nexamples based on real data."}, "http://arxiv.org/abs/2310.09100": {"title": "Time-Uniform Self-Normalized Concentration for Vector-Valued Processes", "link": "http://arxiv.org/abs/2310.09100", "description": "Self-normalized processes arise naturally in many statistical tasks. While\nself-normalized concentration has been extensively studied for scalar-valued\nprocesses, there is less work on multidimensional processes outside of the\nsub-Gaussian setting. In this work, we construct a general, self-normalized\ninequality for $\\mathbb{R}^d$-valued processes that satisfy a simple yet broad\n\"sub-$\\psi$\" tail condition, which generalizes assumptions based on cumulant\ngenerating functions. From this general inequality, we derive an upper law of\nthe iterated logarithm for sub-$\\psi$ vector-valued processes, which is tight\nup to small constants. We demonstrate applications in prototypical statistical\ntasks, such as parameter estimation in online linear regression and\nauto-regressive modeling, and bounded mean estimation via a new (multivariate)\nempirical Bernstein concentration inequality."}, "http://arxiv.org/abs/2310.09185": {"title": "Mediation Analysis using Semi-parametric Shape-Restricted Regression with Applications", "link": "http://arxiv.org/abs/2310.09185", "description": "Often linear regression is used to perform mediation analysis. However, in\nmany instances, the underlying relationships may not be linear, as in the case\nof placental-fetal hormones and fetal development. Although, the exact\nfunctional form of the relationship may be unknown, one may hypothesize the\ngeneral shape of the relationship. For these reasons, we develop a novel\nshape-restricted inference-based methodology for conducting mediation analysis.\nThis work is motivated by an application in fetal endocrinology where\nresearchers are interested in understanding the effects of pesticide\napplication on birth weight, with human chorionic gonadotropin (hCG) as the\nmediator. We assume a practically plausible set of nonlinear effects of hCG on\nthe birth weight and a linear relationship between pesticide exposure and hCG,\nwith both exposure-outcome and exposure-mediator models being linear in the\nconfounding factors. Using the proposed methodology on a population-level\nprenatal screening program data, with hCG as the mediator, we discovered that,\nwhile the natural direct effects suggest a positive association between\npesticide application and birth weight, the natural indirect effects were\nnegative."}, "http://arxiv.org/abs/2310.09214": {"title": "An Introduction to the Calibration of Computer Models", "link": "http://arxiv.org/abs/2310.09214", "description": "In the context of computer models, calibration is the process of estimating\nunknown simulator parameters from observational data. Calibration is variously\nreferred to as model fitting, parameter estimation/inference, an inverse\nproblem, and model tuning. The need for calibration occurs in most areas of\nscience and engineering, and has been used to estimate hard to measure\nparameters in climate, cardiology, drug therapy response, hydrology, and many\nother disciplines. Although the statistical method used for calibration can\nvary substantially, the underlying approach is essentially the same and can be\nconsidered abstractly. In this survey, we review the decisions that need to be\ntaken when calibrating a model, and discuss a range of computational methods\nthat can be used to compute Bayesian posterior distributions."}, "http://arxiv.org/abs/2310.09239": {"title": "Estimating weighted quantile treatment effects with missing outcome data by double sampling", "link": "http://arxiv.org/abs/2310.09239", "description": "Causal weighted quantile treatment effects (WQTE) are a useful compliment to\nstandard causal contrasts that focus on the mean when interest lies at the\ntails of the counterfactual distribution. To-date, however, methods for\nestimation and inference regarding causal WQTEs have assumed complete data on\nall relevant factors. Missing or incomplete data, however, is a widespread\nchallenge in practical settings, particularly when the data are not collected\nfor research purposes such as electronic health records and disease registries.\nFurthermore, in such settings may be particularly susceptible to the outcome\ndata being missing-not-at-random (MNAR). In this paper, we consider the use of\ndouble-sampling, through which the otherwise missing data is ascertained on a\nsub-sample of study units, as a strategy to mitigate bias due to MNAR data in\nthe estimation of causal WQTEs. With the additional data in-hand, we present\nidentifying conditions that do not require assumptions regarding missingness in\nthe original data. We then propose a novel inverse-probability weighted\nestimator and derive its' asymptotic properties, both pointwise at specific\nquantiles and uniform across a range of quantiles in (0,1), when the propensity\nscore and double-sampling probabilities are estimated. For practical inference,\nwe develop a bootstrap method that can be used for both pointwise and uniform\ninference. A simulation study is conducted to examine the finite sample\nperformance of the proposed estimators."}, "http://arxiv.org/abs/2310.09257": {"title": "A SIMPLE Approach to Provably Reconstruct Ising Model with Global Optimality", "link": "http://arxiv.org/abs/2310.09257", "description": "Reconstruction of interaction network between random events is a critical\nproblem arising from statistical physics and politics to sociology, biology,\nand psychology, and beyond. The Ising model lays the foundation for this\nreconstruction process, but finding the underlying Ising model from the least\namount of observed samples in a computationally efficient manner has been\nhistorically challenging for half a century. By using the idea of sparsity\nlearning, we present a approach named SIMPLE that has a dominant sample\ncomplexity from theoretical limit. Furthermore, a tuning-free algorithm is\ndeveloped to give a statistically consistent solution of SIMPLE in polynomial\ntime with high probability. On extensive benchmarked cases, the SIMPLE approach\nprovably reconstructs underlying Ising models with global optimality. The\napplication on the U.S. senators voting in the last six congresses reveals that\nboth the Republicans and Democrats noticeably assemble in each congresses;\ninterestingly, the assembling of Democrats is particularly pronounced in the\nlatest congress."}, "http://arxiv.org/abs/2208.09817": {"title": "High-Dimensional Composite Quantile Regression: Optimal Statistical Guarantees and Fast Algorithms", "link": "http://arxiv.org/abs/2208.09817", "description": "The composite quantile regression (CQR) was introduced by Zou and Yuan [Ann.\nStatist. 36 (2008) 1108--1126] as a robust regression method for linear models\nwith heavy-tailed errors while achieving high efficiency. Its penalized\ncounterpart for high-dimensional sparse models was recently studied in Gu and\nZou [IEEE Trans. Inf. Theory 66 (2020) 7132--7154], along with a specialized\noptimization algorithm based on the alternating direct method of multipliers\n(ADMM). Compared to the various first-order algorithms for penalized least\nsquares, ADMM-based algorithms are not well-adapted to large-scale problems. To\novercome this computational hardness, in this paper we employ a\nconvolution-smoothed technique to CQR, complemented with iteratively reweighted\n$\\ell_1$-regularization. The smoothed composite loss function is convex, twice\ncontinuously differentiable, and locally strong convex with high probability.\nWe propose a gradient-based algorithm for penalized smoothed CQR via a variant\nof the majorize-minimization principal, which gains substantial computational\nefficiency over ADMM. Theoretically, we show that the iteratively reweighted\n$\\ell_1$-penalized smoothed CQR estimator achieves near-minimax optimal\nconvergence rate under heavy-tailed errors without any moment constraint, and\nfurther achieves near-oracle convergence rate under a weaker minimum signal\nstrength condition than needed in Gu and Zou (2020). Numerical studies\ndemonstrate that the proposed method exhibits significant computational\nadvantages without compromising statistical performance compared to two\nstate-of-the-art methods that achieve robustness and high efficiency\nsimultaneously."}, "http://arxiv.org/abs/2210.14292": {"title": "Statistical Inference for H\\\"usler-Reiss Graphical Models Through Matrix Completions", "link": "http://arxiv.org/abs/2210.14292", "description": "The severity of multivariate extreme events is driven by the dependence\nbetween the largest marginal observations. The H\\\"usler-Reiss distribution is a\nversatile model for this extremal dependence, and it is usually parameterized\nby a variogram matrix. In order to represent conditional independence relations\nand obtain sparse parameterizations, we introduce the novel H\\\"usler-Reiss\nprecision matrix. Similarly to the Gaussian case, this matrix appears naturally\nin density representations of the H\\\"usler-Reiss Pareto distribution and\nencodes the extremal graphical structure through its zero pattern. For a given,\narbitrary graph we prove the existence and uniqueness of the completion of a\npartially specified H\\\"usler-Reiss variogram matrix so that its precision\nmatrix has zeros on non-edges in the graph. Using suitable estimators for the\nparameters on the edges, our theory provides the first consistent estimator of\ngraph structured H\\\"usler-Reiss distributions. If the graph is unknown, our\nmethod can be combined with recent structure learning algorithms to jointly\ninfer the graph and the corresponding parameter matrix. Based on our\nmethodology, we propose new tools for statistical inference of sparse\nH\\\"usler-Reiss models and illustrate them on large flight delay data in the\nU.S., as well as Danube river flow data."}, "http://arxiv.org/abs/2302.02288": {"title": "Efficient Adaptive Joint Significance Tests and Sobel-Type Confidence Intervals for Mediation Effects", "link": "http://arxiv.org/abs/2302.02288", "description": "Mediation analysis is an important statistical tool in many research fields.\nIts aim is to investigate the mechanism along the causal pathway between an\nexposure and an outcome. The joint significance test is widely utilized as a\nprominent statistical approach for examining mediation effects in practical\napplications. Nevertheless, the limitation of this mediation testing method\nstems from its conservative Type I error, which reduces its statistical power\nand imposes certain constraints on its popularity and utility. The proposed\nsolution to address this gap is the adaptive joint significance test for one\nmediator, a novel data-adaptive test for mediation effect that exhibits\nsignificant advancements compared to traditional joint significance test. The\nproposed method is designed to be user-friendly, eliminating the need for\ncomplicated procedures. We have derived explicit expressions for size and\npower, ensuring the theoretical validity of our approach. Furthermore, we\nextend the proposed adaptive joint significance tests for small-scale mediation\nhypotheses with family-wise error rate (FWER) control. Additionally, a novel\nadaptive Sobel-type approach is proposed for the estimation of confidence\nintervals for the mediation effects, demonstrating significant advancements\nover conventional Sobel's confidence intervals in terms of achieving desirable\ncoverage probabilities. Our mediation testing and confidence intervals\nprocedure is evaluated through comprehensive simulations, and compared with\nnumerous existing approaches. Finally, we illustrate the usefulness of our\nmethod by analysing three real-world datasets with continuous, binary and\ntime-to-event outcomes, respectively."}, "http://arxiv.org/abs/2305.15742": {"title": "Counterfactual Generative Models for Time-Varying Treatments", "link": "http://arxiv.org/abs/2305.15742", "description": "Estimating the counterfactual outcome of treatment is essential for\ndecision-making in public health and clinical science, among others. Often,\ntreatments are administered in a sequential, time-varying manner, leading to an\nexponentially increased number of possible counterfactual outcomes.\nFurthermore, in modern applications, the outcomes are high-dimensional and\nconventional average treatment effect estimation fails to capture disparities\nin individuals. To tackle these challenges, we propose a novel conditional\ngenerative framework capable of producing counterfactual samples under\ntime-varying treatment, without the need for explicit density estimation. Our\nmethod carefully addresses the distribution mismatch between the observed and\ncounterfactual distributions via a loss function based on inverse probability\nweighting. We present a thorough evaluation of our method using both synthetic\nand real-world data. Our results demonstrate that our method is capable of\ngenerating high-quality counterfactual samples and outperforms the\nstate-of-the-art baselines."}, "http://arxiv.org/abs/2309.09115": {"title": "Fully Synthetic Data for Complex Surveys", "link": "http://arxiv.org/abs/2309.09115", "description": "When seeking to release public use files for confidential data, statistical\nagencies can generate fully synthetic data. We propose an approach for making\nfully synthetic data from surveys collected with complex sampling designs.\nSpecifically, we generate pseudo-populations by applying the weighted finite\npopulation Bayesian bootstrap to account for survey weights, take simple random\nsamples from those pseudo-populations, estimate synthesis models using these\nsimple random samples, and release simulated data drawn from the models as the\npublic use files. We use the framework of multiple imputation to enable\nvariance estimation using two data generation strategies. In the first, we\ngenerate multiple data sets from each simple random sample, whereas in the\nsecond, we generate a single synthetic data set from each simple random sample.\nWe present multiple imputation combining rules for each setting. We illustrate\neach approach and the repeated sampling properties of the combining rules using\nsimulation studies."}, "http://arxiv.org/abs/2309.09323": {"title": "Answering Layer 3 queries with DiscoSCMs", "link": "http://arxiv.org/abs/2309.09323", "description": "Addressing causal queries across the Pearl Causal Hierarchy (PCH) (i.e.,\nassociational, interventional and counterfactual), which is formalized as\n\\Layer{} Valuations, is a central task in contemporary causal inference\nresearch. Counterfactual questions, in particular, pose a significant challenge\nas they often necessitate a complete knowledge of structural equations. This\npaper identifies \\textbf{the degeneracy problem} caused by the consistency\nrule. To tackle this, the \\textit{Distribution-consistency Structural Causal\nModels} (DiscoSCMs) is introduced, which extends both the structural causal\nmodels (SCM) and the potential outcome framework. The correlation pattern of\npotential outcomes in personalized incentive scenarios, described by $P(y_x,\ny'_{x'})$, is used as a case study for elucidation. Although counterfactuals\nare no longer degenerate, they remain indeterminable. As a result, the\ncondition of independent potential noise is incorporated into DiscoSCM. It is\nfound that by adeptly using homogeneity, counterfactuals can be identified.\nFurthermore, more refined results are achieved in the unit problem scenario. In\nsimpler terms, when modeling counterfactuals, one should contemplate: \"Consider\na person with average ability who takes a test and, due to good luck, achieves\nan exceptionally high score. If this person were to retake the test under\nidentical external conditions, what score will he obtain? An exceptionally high\nscore or an average score?\" If your choose is predicting an average score, then\nyou are essentially choosing DiscoSCM over the traditional frameworks based on\nthe consistency rule."}, "http://arxiv.org/abs/2310.01748": {"title": "A generative approach to frame-level multi-competitor races", "link": "http://arxiv.org/abs/2310.01748", "description": "Multi-competitor races often feature complicated within-race strategies that\nare difficult to capture when training data on race outcome level data.\nFurther, models which do not account for such strategic effects may suffer from\nconfounded inferences and predictions. In this work we develop a general\ngenerative model for multi-competitor races which allows analysts to explicitly\nmodel certain strategic effects such as changing lanes or drafting and separate\nthese impacts from competitor ability. The generative model allows one to\nsimulate full races from any real or created starting position which opens new\navenues for attributing value to within-race actions and to perform\ncounter-factual analyses. This methodology is sufficiently general to apply to\nany track based multi-competitor races where both tracking data is available\nand competitor movement is well described by simultaneous forward and lateral\nmovements. We apply this methodology to one-mile horse races using data\nprovided by the New York Racing Association (NYRA) and the New York\nThoroughbred Horsemen's Association (NYTHA) for the Big Data Derby 2022 Kaggle\nCompetition. This data features granular tracking data for all horses at the\nframe-level (occurring at approximately 4hz). We demonstrate how this model can\nyield new inferences, such as the estimation of horse-specific speed profiles\nwhich vary over phases of the race, and examples of posterior predictive\ncounterfactual simulations to answer questions of interest such as starting\nlane impacts on race outcomes."}, "http://arxiv.org/abs/2310.09345": {"title": "A Unified Bayesian Framework for Modeling Measurement Error in Multinomial Data", "link": "http://arxiv.org/abs/2310.09345", "description": "Measurement error in multinomial data is a well-known and well-studied\ninferential problem that is encountered in many fields, including engineering,\nbiomedical and omics research, ecology, finance, and social sciences.\nSurprisingly, methods developed to accommodate measurement error in multinomial\ndata are typically equipped to handle false negatives or false positives, but\nnot both. We provide a unified framework for accommodating both forms of\nmeasurement error using a Bayesian hierarchical approach. We demonstrate the\nproposed method's performance on simulated data and apply it to acoustic bat\nmonitoring data."}, "http://arxiv.org/abs/2310.09384": {"title": "Modeling Missing at Random Neuropsychological Test Scores Using a Mixture of Binomial Product Experts", "link": "http://arxiv.org/abs/2310.09384", "description": "Multivariate bounded discrete data arises in many fields. In the setting of\nlongitudinal dementia studies, such data is collected when individuals complete\nneuropsychological tests. We outline a modeling and inference procedure that\ncan model the joint distribution conditional on baseline covariates, leveraging\nprevious work on mixtures of experts and latent class models. Furthermore, we\nillustrate how the work can be extended when the outcome data is missing at\nrandom using a nested EM algorithm. The proposed model can incorporate\ncovariate information, perform imputation and clustering, and infer latent\ntrajectories. We apply our model on simulated data and an Alzheimer's disease\ndata set."}, "http://arxiv.org/abs/2310.09428": {"title": "Sparse higher order partial least squares for simultaneous variable selection, dimension reduction, and tensor denoising", "link": "http://arxiv.org/abs/2310.09428", "description": "Partial Least Squares (PLS) regression emerged as an alternative to ordinary\nleast squares for addressing multicollinearity in a wide range of scientific\napplications. As multidimensional tensor data is becoming more widespread,\ntensor adaptations of PLS have been developed. Our investigations reveal that\nthe previously established asymptotic result of the PLS estimator for a tensor\nresponse breaks down as the tensor dimensions and the number of features\nincrease relative to the sample size. To address this, we propose Sparse Higher\nOrder Partial Least Squares (SHOPS) regression and an accompanying algorithm.\nSHOPS simultaneously accommodates variable selection, dimension reduction, and\ntensor association denoising. We establish the asymptotic accuracy of the SHOPS\nalgorithm under a high-dimensional regime and verify these results through\ncomprehensive simulation experiments, and applications to two contemporary\nhigh-dimensional biological data analysis."}, "http://arxiv.org/abs/2310.09493": {"title": "Summary Statistics Knockoffs Inference with Family-wise Error Rate Control", "link": "http://arxiv.org/abs/2310.09493", "description": "Testing multiple hypotheses of conditional independence with provable error\nrate control is a fundamental problem with various applications. To infer\nconditional independence with family-wise error rate (FWER) control when only\nsummary statistics of marginal dependence are accessible, we adopt\nGhostKnockoff to directly generate knockoff copies of summary statistics and\npropose a new filter to select features conditionally dependent to the response\nwith provable FWER control. In addition, we develop a computationally efficient\nalgorithm to greatly reduce the computational cost of knockoff copies\ngeneration without sacrificing power and FWER control. Experiments on simulated\ndata and a real dataset of Alzheimer's disease genetics demonstrate the\nadvantage of proposed method over the existing alternatives in both statistical\npower and computational efficiency."}, "http://arxiv.org/abs/2310.09646": {"title": "Jackknife empirical likelihood confidence intervals for the categorical Gini correlation", "link": "http://arxiv.org/abs/2310.09646", "description": "The categorical Gini correlation, $\\rho_g$, was proposed by Dang et al. to\nmeasure the dependence between a categorical variable, $Y$ , and a numerical\nvariable, $X$. It has been shown that $\\rho_g$ has more appealing properties\nthan current existing dependence measurements. In this paper, we develop the\njackknife empirical likelihood (JEL) method for $\\rho_g$. Confidence intervals\nfor the Gini correlation are constructed without estimating the asymptotic\nvariance. Adjusted and weighted JEL are explored to improve the performance of\nthe standard JEL. Simulation studies show that our methods are competitive to\nexisting methods in terms of coverage accuracy and shortness of confidence\nintervals. The proposed methods are illustrated in an application on two real\ndatasets."}, "http://arxiv.org/abs/2310.09673": {"title": "Robust Quickest Change Detection in Non-Stationary Processes", "link": "http://arxiv.org/abs/2310.09673", "description": "Optimal algorithms are developed for robust detection of changes in\nnon-stationary processes. These are processes in which the distribution of the\ndata after change varies with time. The decision-maker does not have access to\nprecise information on the post-change distribution. It is shown that if the\npost-change non-stationary family has a distribution that is least favorable in\na well-defined sense, then the algorithms designed using the least favorable\ndistributions are robust and optimal. Non-stationary processes are encountered\nin public health monitoring and space and military applications. The robust\nalgorithms are applied to real and simulated data to show their effectiveness."}, "http://arxiv.org/abs/2310.09701": {"title": "A powerful empirical Bayes approach for high dimensional replicability analysis", "link": "http://arxiv.org/abs/2310.09701", "description": "Identifying replicable signals across different studies provides stronger\nscientific evidence and more powerful inference. Existing literature on high\ndimensional applicability analysis either imposes strong modeling assumptions\nor has low power. We develop a powerful and robust empirical Bayes approach for\nhigh dimensional replicability analysis. Our method effectively borrows\ninformation from different features and studies while accounting for\nheterogeneity. We show that the proposed method has better power than competing\nmethods while controlling the false discovery rate, both empirically and\ntheoretically. Analyzing datasets from the genome-wide association studies\nreveals new biological insights that otherwise cannot be obtained by using\nexisting methods."}, "http://arxiv.org/abs/2310.09702": {"title": "Inference with Mondrian Random Forests", "link": "http://arxiv.org/abs/2310.09702", "description": "Random forests are popular methods for classification and regression, and\nmany different variants have been proposed in recent years. One interesting\nexample is the Mondrian random forest, in which the underlying trees are\nconstructed according to a Mondrian process. In this paper we give a central\nlimit theorem for the estimates made by a Mondrian random forest in the\nregression setting. When combined with a bias characterization and a consistent\nvariance estimator, this allows one to perform asymptotically valid statistical\ninference, such as constructing confidence intervals, on the unknown regression\nfunction. We also provide a debiasing procedure for Mondrian random forests\nwhich allows them to achieve minimax-optimal estimation rates with\n$\\beta$-H\\\"older regression functions, for all $\\beta$ and in arbitrary\ndimension, assuming appropriate parameter tuning."}, "http://arxiv.org/abs/2310.09818": {"title": "MCMC for Bayesian nonparametric mixture modeling under differential privacy", "link": "http://arxiv.org/abs/2310.09818", "description": "Estimating the probability density of a population while preserving the\nprivacy of individuals in that population is an important and challenging\nproblem that has received considerable attention in recent years. While the\nprevious literature focused on frequentist approaches, in this paper, we\npropose a Bayesian nonparametric mixture model under differential privacy (DP)\nand present two Markov chain Monte Carlo (MCMC) algorithms for posterior\ninference. One is a marginal approach, resembling Neal's algorithm 5 with a\npseudo-marginal Metropolis-Hastings move, and the other is a conditional\napproach. Although our focus is primarily on local DP, we show that our MCMC\nalgorithms can be easily extended to deal with global differential privacy\nmechanisms. Moreover, for certain classes of mechanisms and mixture kernels, we\nshow how standard algorithms can be employed, resulting in substantial\nefficiency gains. Our approach is general and applicable to any mixture model\nand privacy mechanism. In several simulations and a real case study, we discuss\nthe performance of our algorithms and evaluate different privacy mechanisms\nproposed in the frequentist literature."}, "http://arxiv.org/abs/2310.09955": {"title": "On the Statistical Foundations of H-likelihood for Unobserved Random Variables", "link": "http://arxiv.org/abs/2310.09955", "description": "The maximum likelihood estimation is widely used for statistical inferences.\nIn this study, we reformulate the h-likelihood proposed by Lee and Nelder in\n1996, whose maximization yields maximum likelihood estimators for fixed\nparameters and asymptotically best unbiased predictors for random parameters.\nWe establish the statistical foundations for h-likelihood theories, which\nextend classical likelihood theories to embrace broad classes of statistical\nmodels with random parameters. The maximum h-likelihood estimators\nasymptotically achieve the generalized Cramer-Rao lower bound. Furthermore, we\nexplore asymptotic theory when the consistency of either fixed parameter\nestimation or random parameter prediction is violated. The introduction of this\nnew h-likelihood framework enables likelihood theories to cover inferences for\na much broader class of models, while also providing computationally efficient\nfitting algorithms to give asymptotically optimal estimators for fixed\nparameters and predictors for random parameters."}, "http://arxiv.org/abs/2310.09960": {"title": "Point Mass in the Confidence Distribution: Is it a Drawback or an Advantage?", "link": "http://arxiv.org/abs/2310.09960", "description": "Stein's (1959) problem highlights the phenomenon called the probability\ndilution in high dimensional cases, which is known as a fundamental deficiency\nin probabilistic inference. The satellite conjunction problem also suffers from\nprobability dilution that poor-quality data can lead to a dilution of collision\nprobability. Though various methods have been proposed, such as generalized\nfiducial distribution and the reference posterior, they could not maintain the\ncoverage probability of confidence intervals (CIs) in both problems. On the\nother hand, the confidence distribution (CD) has a point mass at zero, which\nhas been interpreted paradoxical. However, we show that this point mass is an\nadvantage rather than a drawback, because it gives a way to maintain the\ncoverage probability of CIs. More recently, `false confidence theorem' was\npresented as another deficiency in probabilistic inferences, called the false\nconfidence. It was further claimed that the use of consonant belief can\nmitigate this deficiency. However, we show that the false confidence theorem\ncannot be applied to the CD in both Stein's and satellite conjunction problems.\nIt is crucial that a confidence feature, not a consonant one, is the key to\novercome the deficiencies in probabilistic inferences. Our findings reveal that\nthe CD outperforms the other existing methods, including the consonant belief,\nin the context of Stein's and satellite conjunction problems. Additionally, we\ndemonstrate the ambiguity of coverage probability in an observed CI from the\nfrequentist CI procedure, and show that the CD provides valuable information\nregarding this ambiguity."}, "http://arxiv.org/abs/2310.09961": {"title": "Theoretical Evaluation of Asymmetric Shapley Values for Root-Cause Analysis", "link": "http://arxiv.org/abs/2310.09961", "description": "In this work, we examine Asymmetric Shapley Values (ASV), a variant of the\npopular SHAP additive local explanation method. ASV proposes a way to improve\nmodel explanations incorporating known causal relations between variables, and\nis also considered as a way to test for unfair discrimination in model\npredictions. Unexplored in previous literature, relaxing symmetry in Shapley\nvalues can have counter-intuitive consequences for model explanation. To better\nunderstand the method, we first show how local contributions correspond to\nglobal contributions of variance reduction. Using variance, we demonstrate\nmultiple cases where ASV yields counter-intuitive attributions, arguably\nproducing incorrect results for root-cause analysis. Second, we identify\ngeneralized additive models (GAM) as a restricted class for which ASV exhibits\ndesirable properties. We support our arguments by proving multiple theoretical\nresults about the method. Finally, we demonstrate the use of asymmetric\nattributions on multiple real-world datasets, comparing the results with and\nwithout restricted model families using gradient boosting and deep learning\nmodels."}, "http://arxiv.org/abs/2310.10003": {"title": "Conformal Contextual Robust Optimization", "link": "http://arxiv.org/abs/2310.10003", "description": "Data-driven approaches to predict-then-optimize decision-making problems seek\nto mitigate the risk of uncertainty region misspecification in safety-critical\nsettings. Current approaches, however, suffer from considering overly\nconservative uncertainty regions, often resulting in suboptimal decisionmaking.\nTo this end, we propose Conformal-Predict-Then-Optimize (CPO), a framework for\nleveraging highly informative, nonconvex conformal prediction regions over\nhigh-dimensional spaces based on conditional generative models, which have the\ndesired distribution-free coverage guarantees. Despite guaranteeing robustness,\nsuch black-box optimization procedures alone inspire little confidence owing to\nthe lack of explanation of why a particular decision was found to be optimal.\nWe, therefore, augment CPO to additionally provide semantically meaningful\nvisual summaries of the uncertainty regions to give qualitative intuition for\nthe optimal decision. We highlight the CPO framework by demonstrating results\non a suite of simulation-based inference benchmark tasks and a vehicle routing\ntask based on probabilistic weather prediction."}, "http://arxiv.org/abs/2310.10048": {"title": "Evaluation of transplant benefits with the U", "link": "http://arxiv.org/abs/2310.10048", "description": "Kidney transplantation is the most effective renal replacement therapy for\nend stage renal disease patients. With the severe shortage of kidney supplies\nand for the clinical effectiveness of transplantation, patient's life\nexpectancy post transplantation is used to prioritize patients for\ntransplantation; however, severe comorbidity conditions and old age are the\nmost dominant factors that negatively impact post-transplantation life\nexpectancy, effectively precluding sick or old patients from receiving\ntransplants. It would be crucial to design objective measures to quantify the\ntransplantation benefit by comparing the mean residual life with and without a\ntransplant, after adjusting for comorbidity and demographic conditions. To\naddress this urgent need, we propose a new class of semiparametric\ncovariate-dependent mean residual life models. Our method estimates covariate\neffects semiparametrically efficiently and the mean residual life function\nnonparametrically, enabling us to predict the residual life increment potential\nfor any given patient. Our method potentially leads to a more fair system that\nprioritizes patients who would have the largest residual life gains. Our\nanalysis of the kidney transplant data from the U.S. Scientific Registry of\nTransplant Recipients also suggests that a single index of covariates summarize\nwell the impacts of multiple covariates, which may facilitate interpretations\nof each covariate's effect. Our subgroup analysis further disclosed\ninequalities in survival gains across groups defined by race, gender and\ninsurance type (reflecting socioeconomic status)."}, "http://arxiv.org/abs/2310.10052": {"title": "Group-Orthogonal Subsampling for Hierarchical Data Based on Linear Mixed Models", "link": "http://arxiv.org/abs/2310.10052", "description": "Hierarchical data analysis is crucial in various fields for making\ndiscoveries. The linear mixed model is often used for training hierarchical\ndata, but its parameter estimation is computationally expensive, especially\nwith big data. Subsampling techniques have been developed to address this\nchallenge. However, most existing subsampling methods assume homogeneous data\nand do not consider the possible heterogeneity in hierarchical data. To address\nthis limitation, we develop a new approach called group-orthogonal subsampling\n(GOSS) for selecting informative subsets of hierarchical data that may exhibit\nheterogeneity. GOSS selects subdata with balanced data size among groups and\ncombinatorial orthogonality within each group, resulting in subdata that are\n$D$- and $A$-optimal for building linear mixed models. Estimators of parameters\ntrained on GOSS subdata are consistent and asymptotically normal. GOSS is shown\nto be numerically appealing via simulations and a real data application.\nTheoretical proofs, R codes, and supplementary numerical results are accessible\nonline as Supplementary Materials."}, "http://arxiv.org/abs/2310.10239": {"title": "Structural transfer learning of non-Gaussian DAG", "link": "http://arxiv.org/abs/2310.10239", "description": "Directed acyclic graph (DAG) has been widely employed to represent\ndirectional relationships among a set of collected nodes. Yet, the available\ndata in one single study is often limited for accurate DAG reconstruction,\nwhereas heterogeneous data may be collected from multiple relevant studies. It\nremains an open question how to pool the heterogeneous data together for better\nDAG structure reconstruction in the target study. In this paper, we first\nintroduce a novel set of structural similarity measures for DAG and then\npresent a transfer DAG learning framework by effectively leveraging information\nfrom auxiliary DAGs of different levels of similarities. Our theoretical\nanalysis shows substantial improvement in terms of DAG reconstruction in the\ntarget study, even when no auxiliary DAG is overall similar to the target DAG,\nwhich is in sharp contrast to most existing transfer learning methods. The\nadvantage of the proposed transfer DAG learning is also supported by extensive\nnumerical experiments on both synthetic data and multi-site brain functional\nconnectivity network data."}, "http://arxiv.org/abs/2310.10271": {"title": "A geometric power analysis for general log-linear models", "link": "http://arxiv.org/abs/2310.10271", "description": "General log-linear models are widely used to express the association in\nmultivariate frequency data on contingency tables. The paper focuses on the\npower analysis for testing the goodness-of-fit hypothesis for these models.\nConventionally, for the power-related sample size calculations a deviation from\nthe null hypothesis, aka effect size, is specified by means of the chi-square\ngoodness-of-fit index. It is argued that the odds ratio is a more natural\nmeasure of effect size, with the advantage of having a data-relevant\ninterpretation. Therefore, a class of log-affine models that are specified by\nodds ratios whose values deviate from those of the null by a small amount can\nbe chosen as an alternative. Being expressed as sets of constraints on odds\nratios, both hypotheses are represented by smooth surfaces in the probability\nsimplex, and thus, the power analysis can be given a geometric interpretation\nas well. A concept of geometric power is introduced and a Monte-Carlo algorithm\nfor its estimation is proposed. The framework is applied to the power analysis\nof goodness-of-fit in the context of multinomial sampling. An iterative scaling\nprocedure for generating distributions from a log-affine model is described and\nits convergence is proved. To illustrate, the geometric power analysis is\ncarried out for data from a clinical study."}, "http://arxiv.org/abs/2310.10324": {"title": "Assessing univariate and bivariate risks of late-frost and drought using vine copulas: A historical study for Bavaria", "link": "http://arxiv.org/abs/2310.10324", "description": "In light of climate change's impacts on forests, including extreme drought\nand late-frost, leading to vitality decline and regional forest die-back, we\nassess univariate drought and late-frost risks and perform a joint risk\nanalysis in Bavaria, Germany, from 1952 to 2020. Utilizing a vast dataset with\n26 bioclimatic and topographic variables, we employ vine copula models due to\nthe data's non-Gaussian and asymmetric dependencies. We use D-vine regression\nfor univariate and Y-vine regression for bivariate analysis, and propose\ncorresponding univariate and bivariate conditional probability risk measures.\nWe identify \"at-risk\" regions, emphasizing the need for forest adaptation due\nto climate change."}, "http://arxiv.org/abs/2310.10329": {"title": "Towards Data-Conditional Simulation for ABC Inference in Stochastic Differential Equations", "link": "http://arxiv.org/abs/2310.10329", "description": "We develop a Bayesian inference method for discretely-observed stochastic\ndifferential equations (SDEs). Inference is challenging for most SDEs, due to\nthe analytical intractability of the likelihood function. Nevertheless, forward\nsimulation via numerical methods is straightforward, motivating the use of\napproximate Bayesian computation (ABC). We propose a conditional simulation\nscheme for SDEs that is based on lookahead strategies for sequential Monte\nCarlo (SMC) and particle smoothing using backward simulation. This leads to the\nsimulation of trajectories that are consistent with the observed trajectory,\nthereby increasing the ABC acceptance rate. We additionally employ an invariant\nneural network, previously developed for Markov processes, to learn the summary\nstatistics function required in ABC. The neural network is incrementally\nretrained by exploiting an ABC-SMC sampler, which provides new training data at\neach round. Since the SDE simulation scheme differs from standard forward\nsimulation, we propose a suitable importance sampling correction, which has the\nadded advantage of guiding the parameters towards regions of high posterior\ndensity, especially in the first ABC-SMC round. Our approach achieves accurate\ninference and is about three times faster than standard (forward-only) ABC-SMC.\nWe illustrate our method in four simulation studies, including three examples\nfrom the Chan-Karaolyi-Longstaff-Sanders SDE family."}, "http://arxiv.org/abs/2310.10331": {"title": "Specifications tests for count time series models with covariates", "link": "http://arxiv.org/abs/2310.10331", "description": "We propose a goodness-of-fit test for a class of count time series models\nwith covariates which includes the Poisson autoregressive model with covariates\n(PARX) as a special case. The test criteria are derived from a specific\ncharacterization for the conditional probability generating function and the\ntest statistic is formulated as a $L_2$ weighting norm of the corresponding\nsample counterpart. The asymptotic properties of the proposed test statistic\nare provided under the null hypothesis as well as under specific alternatives.\nA bootstrap version of the test is explored in a Monte--Carlo study and\nillustrated on a real data set on road safety."}, "http://arxiv.org/abs/2310.10373": {"title": "False Discovery Proportion control for aggregated Knockoffs", "link": "http://arxiv.org/abs/2310.10373", "description": "Controlled variable selection is an important analytical step in various\nscientific fields, such as brain imaging or genomics. In these high-dimensional\ndata settings, considering too many variables leads to poor models and high\ncosts, hence the need for statistical guarantees on false positives. Knockoffs\nare a popular statistical tool for conditional variable selection in high\ndimension. However, they control for the expected proportion of false\ndiscoveries (FDR) and not their actual proportion (FDP). We present a new\nmethod, KOPI, that controls the proportion of false discoveries for\nKnockoff-based inference. The proposed method also relies on a new type of\naggregation to address the undesirable randomness associated with classical\nKnockoff inference. We demonstrate FDP control and substantial power gains over\nexisting Knockoff-based methods in various simulation settings and achieve good\nsensitivity/specificity tradeoffs on brain imaging and genomic data."}, "http://arxiv.org/abs/2310.10393": {"title": "Statistical and Causal Robustness for Causal Null Hypothesis Tests", "link": "http://arxiv.org/abs/2310.10393", "description": "Prior work applying semiparametric theory to causal inference has primarily\nfocused on deriving estimators that exhibit statistical robustness under a\nprespecified causal model that permits identification of a desired causal\nparameter. However, a fundamental challenge is correct specification of such a\nmodel, which usually involves making untestable assumptions. Evidence factors\nis an approach to combining hypothesis tests of a common causal null hypothesis\nunder two or more candidate causal models. Under certain conditions, this\nyields a test that is valid if at least one of the underlying models is\ncorrect, which is a form of causal robustness. We propose a method of combining\nsemiparametric theory with evidence factors. We develop a causal null\nhypothesis test based on joint asymptotic normality of K asymptotically linear\nsemiparametric estimators, where each estimator is based on a distinct\nidentifying functional derived from each of K candidate causal models. We show\nthat this test provides both statistical and causal robustness in the sense\nthat it is valid if at least one of the K proposed causal models is correct,\nwhile also allowing for slower than parametric rates of convergence in\nestimating nuisance functions. We demonstrate the efficacy of our method via\nsimulations and an application to the Framingham Heart Study."}, "http://arxiv.org/abs/2310.10407": {"title": "Ensemble methods for testing a global null", "link": "http://arxiv.org/abs/2310.10407", "description": "Testing a global null is a canonical problem in statistics and has a wide\nrange of applications. In view of the fact that no uniformly most powerful test\nexists, prior and/or domain knowledge are commonly used to focus on a certain\nclass of alternatives to improve the testing power. However, it is generally\nchallenging to develop tests that are particularly powerful against a certain\nclass of alternatives. In this paper, motivated by the success of ensemble\nlearning methods for prediction or classification, we propose an ensemble\nframework for testing that mimics the spirit of random forests to deal with the\nchallenges. Our ensemble testing framework aggregates a collection of weak base\ntests to form a final ensemble test that maintains strong and robust power for\nglobal nulls. We apply the framework to four problems about global testing in\ndifferent classes of alternatives arising from Whole Genome Sequencing (WGS)\nassociation studies. Specific ensemble tests are proposed for each of these\nproblems, and their theoretical optimality is established in terms of Bahadur\nefficiency. Extensive simulations and an analysis of a real WGS dataset are\nconducted to demonstrate the type I error control and/or power gain of the\nproposed ensemble tests."}, "http://arxiv.org/abs/2310.10422": {"title": "A Neural Network-Based Approach to Normality Testing for Dependent Data", "link": "http://arxiv.org/abs/2310.10422", "description": "There is a wide availability of methods for testing normality under the\nassumption of independent and identically distributed data. When data are\ndependent in space and/or time, however, assessing and testing the marginal\nbehavior is considerably more challenging, as the marginal behavior is impacted\nby the degree of dependence. We propose a new approach to assess normality for\ndependent data by non-linearly incorporating existing statistics from normality\ntests as well as sample moments such as skewness and kurtosis through a neural\nnetwork. We calibrate (deep) neural networks by simulated normal and non-normal\ndata with a wide range of dependence structures and we determine the\nprobability of rejecting the null hypothesis. We compare several approaches for\nnormality tests and demonstrate the superiority of our method in terms of\nstatistical power through an extensive simulation study. A real world\napplication to global temperature data further demonstrates how the degree of\nspatio-temporal aggregation affects the marginal normality in the data."}, "http://arxiv.org/abs/2310.10494": {"title": "Multivariate Scalar on Multidimensional Distribution Regression", "link": "http://arxiv.org/abs/2310.10494", "description": "We develop a new method for multivariate scalar on multidimensional\ndistribution regression. Traditional approaches typically analyze isolated\nunivariate scalar outcomes or consider unidimensional distributional\nrepresentations as predictors. However, these approaches are sub-optimal\nbecause: i) they fail to utilize the dependence between the distributional\npredictors: ii) neglect the correlation structure of the response. To overcome\nthese limitations, we propose a multivariate distributional analysis framework\nthat harnesses the power of multivariate density functions and multitask\nlearning. We develop a computationally efficient semiparametric estimation\nmethod for modelling the effect of the latent joint density on multivariate\nresponse of interest. Additionally, we introduce a new conformal algorithm for\nquantifying the uncertainty of regression models with multivariate responses\nand distributional predictors, providing valuable insights into the conditional\ndistribution of the response. We have validated the effectiveness of our\nproposed method through comprehensive numerical simulations, clearly\ndemonstrating its superior performance compared to traditional methods. The\napplication of the proposed method is demonstrated on tri-axial accelerometer\ndata from the National Health and Nutrition Examination Survey (NHANES)\n2011-2014 for modelling the association between cognitive scores across various\ndomains and distributional representation of physical activity among older\nadult population. Our results highlight the advantages of the proposed\napproach, emphasizing the significance of incorporating complete spatial\ninformation derived from the accelerometer device."}, "http://arxiv.org/abs/2310.10588": {"title": "Max-convolution processes with random shape indicator kernels", "link": "http://arxiv.org/abs/2310.10588", "description": "In this paper, we introduce a new class of models for spatial data obtained\nfrom max-convolution processes based on indicator kernels with random shape. We\nshow that this class of models have appealing dependence properties including\ntail dependence at short distances and independence at long distances. We\nfurther consider max-convolutions between such processes and processes with\ntail independence, in order to separately control the bulk and tail dependence\nbehaviors, and to increase flexibility of the model at longer distances, in\nparticular, to capture intermediate tail dependence. We show how parameters can\nbe estimated using a weighted pairwise likelihood approach, and we conduct an\nextensive simulation study to show that the proposed inference approach is\nfeasible in high dimensions and it yields accurate parameter estimates in most\ncases. We apply the proposed methodology to analyse daily temperature maxima\nmeasured at 100 monitoring stations in the state of Oklahoma, US. Our results\nindicate that our proposed model provides a good fit to the data, and that it\ncaptures both the bulk and the tail dependence structures accurately."}, "http://arxiv.org/abs/1805.07301": {"title": "Enhanced Pricing and Management of Bundled Insurance Risks with Dependence-aware Prediction using Pair Copula Construction", "link": "http://arxiv.org/abs/1805.07301", "description": "We propose a dependence-aware predictive modeling framework for multivariate\nrisks stemmed from an insurance contract with bundling features - an important\ntype of policy increasingly offered by major insurance companies. The bundling\nfeature naturally leads to longitudinal measurements of multiple insurance\nrisks, and correct pricing and management of such risks is of fundamental\ninterest to financial stability of the macroeconomy. We build a novel\npredictive model that fully captures the dependence among the multivariate\nrepeated risk measurements. Specifically, the longitudinal measurement of each\nindividual risk is first modeled using pair copula construction with a D-vine\nstructure, and the multiple D-vines are then integrated by a flexible copula.\nThe proposed model provides a unified modeling framework for multivariate\nlongitudinal data that can accommodate different scales of measurements,\nincluding continuous, discrete, and mixed observations, and thus can be\npotentially useful for various economic studies. A computationally efficient\nsequential method is proposed for model estimation and inference, and its\nperformance is investigated both theoretically and via simulation studies. In\nthe application, we examine multivariate bundled risks in multi-peril property\ninsurance using proprietary data from a commercial property insurance provider.\nThe proposed model is found to provide improved decision making for several key\ninsurance operations. For underwriting, we show that the experience rate priced\nby the proposed model leads to a 9% lift in the insurer's net revenue. For\nreinsurance, we show that the insurer underestimates the risk of the retained\ninsurance portfolio by 10% when ignoring the dependence among bundled insurance\nrisks."}, "http://arxiv.org/abs/2005.04721": {"title": "Decision Making in Drug Development via Inference on Power", "link": "http://arxiv.org/abs/2005.04721", "description": "A typical power calculation is performed by replacing unknown\npopulation-level quantities in the power function with what is observed in\nexternal studies. Many authors and practitioners view this as an assumed value\nof power and offer the Bayesian quantity probability of success or assurance as\nan alternative. The claim is by averaging over a prior or posterior\ndistribution, probability of success transcends power by capturing the\nuncertainty around the unknown true treatment effect and any other\npopulation-level parameters. We use p-value functions to frame both the\nprobability of success calculation and the typical power calculation as merely\nproducing two different point estimates of power. We demonstrate that Go/No-Go\ndecisions based on either point estimate of power do not adequately quantify\nand control the risk involved, and instead we argue for Go/No-Go decisions that\nutilize inference on power for better risk management and decision making."}, "http://arxiv.org/abs/2103.00674": {"title": "BEAUTY Powered BEAST", "link": "http://arxiv.org/abs/2103.00674", "description": "We study distribution-free goodness-of-fit tests with the proposed Binary\nExpansion Approximation of UniformiTY (BEAUTY) approach. This method\ngeneralizes the renowned Euler's formula, and approximates the characteristic\nfunction of any copula through a linear combination of expectations of binary\ninteractions from marginal binary expansions. This novel theory enables a\nunification of many important tests of independence via approximations from\nspecific quadratic forms of symmetry statistics, where the deterministic weight\nmatrix characterizes the power properties of each test. To achieve a robust\npower, we examine test statistics with data-adaptive weights, referred to as\nthe Binary Expansion Adaptive Symmetry Test (BEAST). Using properties of the\nbinary expansion filtration, we demonstrate that the Neyman-Pearson test of\nuniformity can be approximated by an oracle weighted sum of symmetry\nstatistics. The BEAST with this oracle provides a useful benchmark of feasible\npower. To approach this oracle power, we devise the BEAST through a regularized\nresampling approximation of the oracle test. The BEAST improves the empirical\npower of many existing tests against a wide spectrum of common alternatives and\ndelivers a clear interpretation of dependency forms when significant."}, "http://arxiv.org/abs/2103.16159": {"title": "Controlling the False Discovery Rate in Transformational Sparsity: Split Knockoffs", "link": "http://arxiv.org/abs/2103.16159", "description": "Controlling the False Discovery Rate (FDR) in a variable selection procedure\nis critical for reproducible discoveries, and it has been extensively studied\nin sparse linear models. However, it remains largely open in scenarios where\nthe sparsity constraint is not directly imposed on the parameters but on a\nlinear transformation of the parameters to be estimated. Examples of such\nscenarios include total variations, wavelet transforms, fused LASSO, and trend\nfiltering. In this paper, we propose a data-adaptive FDR control method, called\nthe Split Knockoff method, for this transformational sparsity setting. The\nproposed method exploits both variable and data splitting. The linear\ntransformation constraint is relaxed to its Euclidean proximity in a lifted\nparameter space, which yields an orthogonal design that enables the orthogonal\nSplit Knockoff construction. To overcome the challenge that exchangeability\nfails due to the heterogeneous noise brought by the transformation, new inverse\nsupermartingale structures are developed via data splitting for provable FDR\ncontrol without sacrificing power. Simulation experiments demonstrate that the\nproposed methodology achieves the desired FDR and power. We also provide an\napplication to Alzheimer's Disease study, where atrophy brain regions and their\nabnormal connections can be discovered based on a structural Magnetic Resonance\nImaging dataset (ADNI)."}, "http://arxiv.org/abs/2201.05967": {"title": "Uniform Inference for Kernel Density Estimators with Dyadic Data", "link": "http://arxiv.org/abs/2201.05967", "description": "Dyadic data is often encountered when quantities of interest are associated\nwith the edges of a network. As such it plays an important role in statistics,\neconometrics and many other data science disciplines. We consider the problem\nof uniformly estimating a dyadic Lebesgue density function, focusing on\nnonparametric kernel-based estimators taking the form of dyadic empirical\nprocesses. Our main contributions include the minimax-optimal uniform\nconvergence rate of the dyadic kernel density estimator, along with strong\napproximation results for the associated standardized and Studentized\n$t$-processes. A consistent variance estimator enables the construction of\nvalid and feasible uniform confidence bands for the unknown density function.\nWe showcase the broad applicability of our results by developing novel\ncounterfactual density estimation and inference methodology for dyadic data,\nwhich can be used for causal inference and program evaluation. A crucial\nfeature of dyadic distributions is that they may be \"degenerate\" at certain\npoints in the support of the data, a property making our analysis somewhat\ndelicate. Nonetheless our methods for uniform inference remain robust to the\npotential presence of such points. For implementation purposes, we discuss\ninference procedures based on positive semi-definite covariance estimators,\nmean squared error optimal bandwidth selectors and robust bias correction\ntechniques. We illustrate the empirical finite-sample performance of our\nmethods both in simulations and with real-world trade data, for which we make\ncomparisons between observed and counterfactual trade distributions in\ndifferent years. Our technical results concerning strong approximations and\nmaximal inequalities are of potential independent interest."}, "http://arxiv.org/abs/2206.01076": {"title": "Likelihood-based Inference for Random Networks with Changepoints", "link": "http://arxiv.org/abs/2206.01076", "description": "Generative, temporal network models play an important role in analyzing the\ndependence structure and evolution patterns of complex networks. Due to the\ncomplicated nature of real network data, it is often naive to assume that the\nunderlying data-generative mechanism itself is invariant with time. Such\nobservation leads to the study of changepoints or sudden shifts in the\ndistributional structure of the evolving network. In this paper, we propose a\nlikelihood-based methodology to detect changepoints in undirected, affine\npreferential attachment networks, and establish a hypothesis testing framework\nto detect a single changepoint, together with a consistent estimator for the\nchangepoint. Such results require establishing consistency and asymptotic\nnormality of the MLE under the changepoint regime, which suffers from long\nrange dependence. The methodology is then extended to the multiple changepoint\nsetting via both a sliding window method and a more computationally efficient\nscore statistic. We also compare the proposed methodology with previously\ndeveloped non-parametric estimators of the changepoint via simulation, and the\nmethods developed herein are applied to modeling the popularity of a topic in a\nTwitter network over time."}, "http://arxiv.org/abs/2301.01616": {"title": "Locally Private Causal Inference for Randomized Experiments", "link": "http://arxiv.org/abs/2301.01616", "description": "Local differential privacy is a differential privacy paradigm in which\nindividuals first apply a privacy mechanism to their data (often by adding\nnoise) before transmitting the result to a curator. The noise for privacy\nresults in additional bias and variance in their analyses. Thus it is of great\nimportance for analysts to incorporate the privacy noise into valid inference.\nIn this article, we develop methodologies to infer causal effects from locally\nprivatized data under randomized experiments. First, we present frequentist\nestimators under various privacy scenarios with their variance estimators and\nplug-in confidence intervals. We show a na\\\"ive debiased estimator results in\ninferior mean-squared error (MSE) compared to minimax lower bounds. In\ncontrast, we show that using a customized privacy mechanism, we can match the\nlower bound, giving minimax optimal inference. We also develop a Bayesian\nnonparametric methodology along with a blocked Gibbs sampling algorithm, which\ncan be applied to any of our proposed privacy mechanisms, and which performs\nespecially well in terms of MSE for tight privacy budgets. Finally, we present\nsimulation studies to evaluate the performance of our proposed frequentist and\nBayesian methodologies for various privacy budgets, resulting in useful\nsuggestions for performing causal inference for privatized data."}, "http://arxiv.org/abs/2303.03215": {"title": "Quantile-Quantile Methodology -- Detailed Results", "link": "http://arxiv.org/abs/2303.03215", "description": "The linear quantile-quantile relationship provides an easy-to-implement yet\neffective tool for transformation to and testing for normality. Its good\nperformance is verified in this report."}, "http://arxiv.org/abs/2305.06645": {"title": "Causal Inference for Continuous Multiple Time Point Interventions", "link": "http://arxiv.org/abs/2305.06645", "description": "There are limited options to estimate the treatment effects of variables\nwhich are continuous and measured at multiple time points, particularly if the\ntrue dose-response curve should be estimated as closely as possible. However,\nthese situations may be of relevance: in pharmacology, one may be interested in\nhow outcomes of people living with -- and treated for -- HIV, such as viral\nfailure, would vary for time-varying interventions such as different drug\nconcentration trajectories. A challenge for doing causal inference with\ncontinuous interventions is that the positivity assumption is typically\nviolated. To address positivity violations, we develop projection functions,\nwhich reweigh and redefine the estimand of interest based on functions of the\nconditional support for the respective interventions. With these functions, we\nobtain the desired dose-response curve in areas of enough support, and\notherwise a meaningful estimand that does not require the positivity\nassumption. We develop $g$-computation type plug-in estimators for this case.\nThose are contrasted with g-computation estimators which are applied to\ncontinuous interventions without specifically addressing positivity violations,\nwhich we propose to be presented with diagnostics. The ideas are illustrated\nwith longitudinal data from HIV positive children treated with an\nefavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children\n$<13$ years in Zambia/Uganda. Simulations show in which situations a standard\n$g$-computation approach is appropriate, and in which it leads to bias and how\nthe proposed weighted estimation approach then recovers the alternative\nestimand of interest."}, "http://arxiv.org/abs/2305.14275": {"title": "Amortized Variational Inference with Coverage Guarantees", "link": "http://arxiv.org/abs/2305.14275", "description": "Amortized variational inference produces a posterior approximation that can\nbe rapidly computed given any new observation. Unfortunately, there are few\nguarantees about the quality of these approximate posteriors. We propose\nConformalized Amortized Neural Variational Inference (CANVI), a procedure that\nis scalable, easily implemented, and provides guaranteed marginal coverage.\nGiven a collection of candidate amortized posterior approximators, CANVI\nconstructs conformalized predictors based on each candidate, compares the\npredictors using a metric known as predictive efficiency, and returns the most\nefficient predictor. CANVI ensures that the resulting predictor constructs\nregions that contain the truth with a user-specified level of probability.\nCANVI is agnostic to design decisions in formulating the candidate\napproximators and only requires access to samples from the forward model,\npermitting its use in likelihood-free settings. We prove lower bounds on the\npredictive efficiency of the regions produced by CANVI and explore how the\nquality of a posterior approximation relates to the predictive efficiency of\nprediction regions based on that approximation. Finally, we demonstrate the\naccurate calibration and high predictive efficiency of CANVI on a suite of\nsimulation-based inference benchmark tasks and an important scientific task:\nanalyzing galaxy emission spectra."}, "http://arxiv.org/abs/2305.17187": {"title": "Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments", "link": "http://arxiv.org/abs/2305.17187", "description": "From clinical development of cancer therapies to investigations into partisan\nbias, adaptive sequential designs have become increasingly popular method for\ncausal inference, as they offer the possibility of improved precision over\ntheir non-adaptive counterparts. However, even in simple settings (e.g. two\ntreatments) the extent to which adaptive designs can improve precision is not\nsufficiently well understood. In this work, we study the problem of Adaptive\nNeyman Allocation in a design-based potential outcomes framework, where the\nexperimenter seeks to construct an adaptive design which is nearly as efficient\nas the optimal (but infeasible) non-adaptive Neyman design, which has access to\nall potential outcomes. Motivated by connections to online optimization, we\npropose Neyman Ratio and Neyman Regret as two (equivalent) performance measures\nof adaptive designs for this problem. We present Clip-OGD, an adaptive design\nwhich achieves $\\widetilde{O}(\\sqrt{T})$ expected Neyman regret and thereby\nrecovers the optimal Neyman variance in large samples. Finally, we construct a\nconservative variance estimator which facilitates the development of\nasymptotically valid confidence intervals. To complement our theoretical\nresults, we conduct simulations using data from a microeconomic experiment."}, "http://arxiv.org/abs/2306.15622": {"title": "Biclustering random matrix partitions with an application to classification of forensic body fluids", "link": "http://arxiv.org/abs/2306.15622", "description": "Classification of unlabeled data is usually achieved by supervised learning\nfrom labeled samples. Although there exist many sophisticated supervised\nmachine learning methods that can predict the missing labels with a high level\nof accuracy, they often lack the required transparency in situations where it\nis important to provide interpretable results and meaningful measures of\nconfidence. Body fluid classification of forensic casework data is the case in\npoint. We develop a new Biclustering Dirichlet Process for Class-assignment\nwith Random Matrices (BDP-CaRMa), with a three-level hierarchy of clustering,\nand a model-based approach to classification that adapts to block structure in\nthe data matrix. As the class labels of some observations are missing, the\nnumber of rows in the data matrix for each class is unknown. BDP-CaRMa handles\nthis and extends existing biclustering methods by simultaneously biclustering\nmultiple matrices each having a randomly variable number of rows. We\ndemonstrate our method by applying it to the motivating problem, which is the\nclassification of body fluids based on mRNA profiles taken from crime scenes.\nThe analyses of casework-like data show that our method is interpretable and\nproduces well-calibrated posterior probabilities. Our model can be more\ngenerally applied to other types of data with a similar structure to the\nforensic data."}, "http://arxiv.org/abs/2307.05644": {"title": "Lambert W random variables and their applications in loss modelling", "link": "http://arxiv.org/abs/2307.05644", "description": "Several distributions and families of distributions are proposed to model\nskewed data, think, e.g., of skew-normal and related distributions. Lambert W\nrandom variables offer an alternative approach where, instead of constructing a\nnew distribution, a certain transform is proposed (Goerg, 2011). Such an\napproach allows the construction of a Lambert W skewed version from any\ndistribution. We choose Lambert W normal distribution as a natural starting\npoint and also include Lambert W exponential distribution due to the simplicity\nand shape of the exponential distribution, which, after skewing, may produce a\nreasonably heavy tail for loss models. In the theoretical part, we focus on the\nmathematical properties of obtained distributions, including the range of\nskewness. In the practical part, the suitability of corresponding Lambert W\ntransformed distributions is evaluated on real insurance data. The results are\ncompared with those obtained using common loss distributions."}, "http://arxiv.org/abs/2307.06840": {"title": "Ensemble learning for blending gridded satellite and gauge-measured precipitation data", "link": "http://arxiv.org/abs/2307.06840", "description": "Regression algorithms are regularly used for improving the accuracy of\nsatellite precipitation products. In this context, satellite precipitation and\ntopography data are the predictor variables, and gauged-measured precipitation\ndata are the dependent variables. Alongside this, it is increasingly recognised\nin many fields that combinations of algorithms through ensemble learning can\nlead to substantial predictive performance improvements. Still, a sufficient\nnumber of ensemble learners for improving the accuracy of satellite\nprecipitation products and their large-scale comparison are currently missing\nfrom the literature. In this study, we work towards filling in this specific\ngap by proposing 11 new ensemble learners in the field and by extensively\ncomparing them. We apply the ensemble learners to monthly data from the\nPERSIANN (Precipitation Estimation from Remotely Sensed Information using\nArtificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals\nfor GPM) gridded datasets that span over a 15-year period and over the entire\nthe contiguous United States (CONUS). We also use gauge-measured precipitation\ndata from the Global Historical Climatology Network monthly database, version 2\n(GHCNm). The ensemble learners combine the predictions of six machine learning\nregression algorithms (base learners), namely the multivariate adaptive\nregression splines (MARS), multivariate adaptive polynomial splines\n(poly-MARS), random forests (RF), gradient boosting machines (GBM), extreme\ngradient boosting (XGBoost) and Bayesian regularized neural networks (BRNN),\nand each of them is based on a different combiner. The combiners include the\nequal-weight combiner, the median combiner, two best learners and seven\nvariants of a sophisticated stacking method. The latter stacks a regression\nalgorithm on top of the base learners to combine their independent\npredictions..."}, "http://arxiv.org/abs/2309.12819": {"title": "Doubly Robust Proximal Causal Learning for Continuous Treatments", "link": "http://arxiv.org/abs/2309.12819", "description": "Proximal causal learning is a promising framework for identifying the causal\neffect under the existence of unmeasured confounders. Within this framework,\nthe doubly robust (DR) estimator was derived and has shown its effectiveness in\nestimation, especially when the model assumption is violated. However, the\ncurrent form of the DR estimator is restricted to binary treatments, while the\ntreatment can be continuous in many real-world applications. The primary\nobstacle to continuous treatments resides in the delta function present in the\noriginal DR estimator, making it infeasible in causal effect estimation and\nintroducing a heavy computational burden in nuisance function estimation. To\naddress these challenges, we propose a kernel-based DR estimator that can well\nhandle continuous treatments. Equipped with its smoothness, we show that its\noracle form is a consistent approximation of the influence function. Further,\nwe propose a new approach to efficiently solve the nuisance functions. We then\nprovide a comprehensive convergence analysis in terms of the mean square error.\nWe demonstrate the utility of our estimator on synthetic datasets and\nreal-world applications."}, "http://arxiv.org/abs/2309.17283": {"title": "The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation", "link": "http://arxiv.org/abs/2309.17283", "description": "Assessing causal effects in the presence of unobserved confounding is a\nchallenging problem. Existing studies leveraged proxy variables or multiple\ntreatments to adjust for the confounding bias. In particular, the latter\napproach attributes the impact on a single outcome to multiple treatments,\nallowing estimating latent variables for confounding control. Nevertheless,\nthese methods primarily focus on a single outcome, whereas in many real-world\nscenarios, there is greater interest in studying the effects on multiple\noutcomes. Besides, these outcomes are often coupled with multiple treatments.\nExamples include the intensive care unit (ICU), where health providers evaluate\nthe effectiveness of therapies on multiple health indicators. To accommodate\nthese scenarios, we consider a new setting dubbed as multiple treatments and\nmultiple outcomes. We then show that parallel studies of multiple outcomes\ninvolved in this setting can assist each other in causal identification, in the\nsense that we can exploit other treatments and outcomes as proxies for each\ntreatment effect under study. We proceed with a causal discovery method that\ncan effectively identify such proxies for causal estimation. The utility of our\nmethod is demonstrated in synthetic data and sepsis disease."}, "http://arxiv.org/abs/2310.10740": {"title": "Unbiased Estimation of Structured Prediction Error", "link": "http://arxiv.org/abs/2310.10740", "description": "Many modern datasets, such as those in ecology and geology, are composed of\nsamples with spatial structure and dependence. With such data violating the\nusual independent and identically distributed (IID) assumption in machine\nlearning and classical statistics, it is unclear a priori how one should\nmeasure the performance and generalization of models. Several authors have\nempirically investigated cross-validation (CV) methods in this setting,\nreaching mixed conclusions. We provide a class of unbiased estimation methods\nfor general quadratic errors, correlated Gaussian response, and arbitrary\nprediction function $g$, for a noise-elevated version of the error. Our\napproach generalizes the coupled bootstrap (CB) from the normal means problem\nto general normal data, allowing correlation both within and between the\ntraining and test sets. CB relies on creating bootstrap samples that are\nintelligently decoupled, in the sense of being statistically independent.\nSpecifically, the key to CB lies in generating two independent \"views\" of our\ndata and using them as stand-ins for the usual independent training and test\nsamples. Beginning with Mallows' $C_p$, we generalize the estimator to develop\nour generalized $C_p$ estimators (GC). We show at under only a moment condition\non $g$, this noise-elevated error estimate converges smoothly to the noiseless\nerror estimate. We show that when Stein's unbiased risk estimator (SURE)\napplies, GC converges to SURE as in the normal means problem. Further, we use\nthese same tools to analyze CV and provide some theoretical analysis to help\nunderstand when CV will provide good estimates of error. Simulations align with\nour theoretical results, demonstrating the effectiveness of GC and illustrating\nthe behavior of CV methods. Lastly, we apply our estimator to a model selection\ntask on geothermal data in Nevada."}, "http://arxiv.org/abs/2310.10761": {"title": "Simulation Based Composite Likelihood", "link": "http://arxiv.org/abs/2310.10761", "description": "Inference for high-dimensional hidden Markov models is challenging due to the\nexponential-in-dimension computational cost of the forward algorithm. To\naddress this issue, we introduce an innovative composite likelihood approach\ncalled \"Simulation Based Composite Likelihood\" (SimBa-CL). With SimBa-CL, we\napproximate the likelihood by the product of its marginals, which we estimate\nusing Monte Carlo sampling. In a similar vein to approximate Bayesian\ncomputation (ABC), SimBa-CL requires multiple simulations from the model, but,\nin contrast to ABC, it provides a likelihood approximation that guides the\noptimization of the parameters. Leveraging automatic differentiation libraries,\nit is simple to calculate gradients and Hessians to not only speed-up\noptimization, but also to build approximate confidence sets. We conclude with\nan extensive experimental section, where we empirically validate our\ntheoretical results, conduct a comparative analysis with SMC, and apply\nSimBa-CL to real-world Aphtovirus data."}, "http://arxiv.org/abs/2310.10798": {"title": "Poisson Count Time Series", "link": "http://arxiv.org/abs/2310.10798", "description": "This paper reviews and compares popular methods, some old and some very\nrecent, that produce time series having Poisson marginal distributions. The\npaper begins by narrating ways where time series with Poisson marginal\ndistributions can be produced. Modeling nonstationary series with covariates\nmotivates consideration of methods where the Poisson parameter depends on time.\nHere, estimation methods are developed for some of the more flexible methods.\nThe results are used in the analysis of 1) a count sequence of tropical\ncyclones occurring in the North Atlantic Basin since 1970, and 2) the number of\nno-hitter games pitched in major league baseball since 1893. Tests for whether\nthe Poisson marginal distribution is appropriate are included."}, "http://arxiv.org/abs/2310.10915": {"title": "Identifiability of the Multinomial Processing Tree-IRT model for the Philadelphia Naming Test", "link": "http://arxiv.org/abs/2310.10915", "description": "For persons with aphasia, naming tests are used to evaluate the severity of\nthe disease and observing progress toward recovery. The Philadelphia Naming\nTest (PNT) is a leading naming test composed of 175 items. The items are common\nnouns which are one to four syllables in length and with low, medium, and high\nfrequency. Since the target word is known to the administrator, the response\nfrom the patient can be classified as correct or an error. If the patient\ncommits an error, the PNT provides procedures for classifying the type of error\nin the response. Item response theory can be applied to PNT data to provide\nestimates of item difficulty and subject naming ability. Walker et al. (2018)\ndeveloped a IRT multinomial processing tree (IRT-MPT) model to attempt to\nunderstand the pathways through which the different errors are made by patients\nwhen responding to an item. The MPT model expands on existing models by\nconsidering items to be heterogeneous and estimating multiple latent parameters\nfor patients to more precisely determine at which step of word of production a\npatient's ability has been affected. These latent parameters represent the\ntheoretical cognitive steps taken in responding to an item. Given the\ncomplexity of the model proposed in Walker et al. (2018), here we investigate\nthe identifiability of the parameters included in the IRT-MPT model."}, "http://arxiv.org/abs/2310.10976": {"title": "Exact nonlinear state estimation", "link": "http://arxiv.org/abs/2310.10976", "description": "The majority of data assimilation (DA) methods in the geosciences are based\non Gaussian assumptions. While these assumptions facilitate efficient\nalgorithms, they cause analysis biases and subsequent forecast degradations.\nNon-parametric, particle-based DA algorithms have superior accuracy, but their\napplication to high-dimensional models still poses operational challenges.\nDrawing inspiration from recent advances in the field of generative artificial\nintelligence (AI), this article introduces a new nonlinear estimation theory\nwhich attempts to bridge the existing gap in DA methodology. Specifically, a\nConjugate Transform Filter (CTF) is derived and shown to generalize the\ncelebrated Kalman filter to arbitrarily non-Gaussian distributions. The new\nfilter has several desirable properties, such as its ability to preserve\nstatistical relationships in the prior state and convergence to highly accurate\nobservations. An ensemble approximation of the new theory (ECTF) is also\npresented and validated using idealized statistical experiments that feature\nbounded quantities with non-Gaussian distributions, a prevalent challenge in\nEarth system models. Results from these experiments indicate that the greatest\nbenefits from ECTF occur when observation errors are small relative to the\nforecast uncertainty and when state variables exhibit strong nonlinear\ndependencies. Ultimately, the new filtering theory offers exciting avenues for\nimproving conventional DA algorithms through their principled integration with\nAI techniques."}, "http://arxiv.org/abs/2310.11122": {"title": "Sensitivity-Aware Amortized Bayesian Inference", "link": "http://arxiv.org/abs/2310.11122", "description": "Bayesian inference is a powerful framework for making probabilistic\ninferences and decisions under uncertainty. Fundamental choices in modern\nBayesian workflows concern the specification of the likelihood function and\nprior distributions, the posterior approximator, and the data. Each choice can\nsignificantly influence model-based inference and subsequent decisions, thereby\nnecessitating sensitivity analysis. In this work, we propose a multifaceted\napproach to integrate sensitivity analyses into amortized Bayesian inference\n(ABI, i.e., simulation-based inference with neural networks). First, we utilize\nweight sharing to encode the structural similarities between alternative\nlikelihood and prior specifications in the training process with minimal\ncomputational overhead. Second, we leverage the rapid inference of neural\nnetworks to assess sensitivity to various data perturbations or pre-processing\nprocedures. In contrast to most other Bayesian approaches, both steps\ncircumvent the costly bottleneck of refitting the model(s) for each choice of\nlikelihood, prior, or dataset. Finally, we propose to use neural network\nensembles to evaluate variation in results induced by unreliable approximation\non unseen data. We demonstrate the effectiveness of our method in applied\nmodeling problems, ranging from the estimation of disease outbreak dynamics and\nglobal warming thresholds to the comparison of human decision-making models.\nOur experiments showcase how our approach enables practitioners to effectively\nunveil hidden relationships between modeling choices and inferential\nconclusions."}, "http://arxiv.org/abs/2310.11357": {"title": "A Pseudo-likelihood Approach to Under-5 Mortality Estimation", "link": "http://arxiv.org/abs/2310.11357", "description": "Accurate and precise estimates of under-5 mortality rates (U5MR) are an\nimportant health summary for countries. Full survival curves are additionally\nof interest to better understand the pattern of mortality in children under 5.\nModern demographic methods for estimating a full mortality schedule for\nchildren have been developed for countries with good vital registration and\nreliable census data, but perform poorly in many low- and middle-income\ncountries. In these countries, the need to utilize nationally representative\nsurveys to estimate U5MR requires additional statistical care to mitigate\npotential biases in survey data, acknowledge the survey design, and handle\naspects of survival data (i.e., censoring and truncation). In this paper, we\ndevelop parametric and non-parametric pseudo-likelihood approaches to\nestimating under-5 mortality across time from complex survey data. We argue\nthat the parametric approach is particularly useful in scenarios where data are\nsparse and estimation may require stronger assumptions. The nonparametric\napproach provides an aid to model validation. We compare a variety of\nparametric models to three existing methods for obtaining a full survival curve\nfor children under the age of 5, and argue that a parametric pseudo-likelihood\napproach is advantageous in low- and middle-income countries. We apply our\nproposed approaches to survey data from Burkina Faso, Malawi, Senegal, and\nNamibia. All code for fitting the models described in this paper is available\nin the R package pssst."}, "http://arxiv.org/abs/2006.00767": {"title": "Generative Multiple-purpose Sampler for Weighted M-estimation", "link": "http://arxiv.org/abs/2006.00767", "description": "To overcome the computational bottleneck of various data perturbation\nprocedures such as the bootstrap and cross validations, we propose the\nGenerative Multiple-purpose Sampler (GMS), which constructs a generator\nfunction to produce solutions of weighted M-estimators from a set of given\nweights and tuning parameters. The GMS is implemented by a single optimization\nwithout having to repeatedly evaluate the minimizers of weighted losses, and is\nthus capable of significantly reducing the computational time. We demonstrate\nthat the GMS framework enables the implementation of various statistical\nprocedures that would be unfeasible in a conventional framework, such as the\niterated bootstrap, bootstrapped cross-validation for penalized likelihood,\nbootstrapped empirical Bayes with nonparametric maximum likelihood, etc. To\nconstruct a computationally efficient generator function, we also propose a\nnovel form of neural network called the \\emph{weight multiplicative multilayer\nperceptron} to achieve fast convergence. Our numerical results demonstrate that\nthe new neural network structure enjoys a few orders of magnitude speed\nadvantage in comparison to the conventional one. An R package called GMS is\nprovided, which runs under Pytorch to implement the proposed methods and allows\nthe user to provide a customized loss function to tailor to their own models of\ninterest."}, "http://arxiv.org/abs/2012.03593": {"title": "Algebraic geometry of discrete interventional models", "link": "http://arxiv.org/abs/2012.03593", "description": "We investigate the algebra and geometry of general interventions in discrete\nDAG models. To this end, we introduce a theory for modeling soft interventions\nin the more general family of staged tree models and develop the formalism to\nstudy these models as parametrized subvarieties of a product of probability\nsimplices. We then consider the problem of finding their defining equations,\nand we derive a combinatorial criterion for identifying interventional staged\ntree models for which the defining ideal is toric. We apply these results to\nthe class of discrete interventional DAG models and establish a criteria to\ndetermine when these models are toric varieties."}, "http://arxiv.org/abs/2105.12720": {"title": "Marginal structural models with Latent Class Growth Modeling of Treatment Trajectories", "link": "http://arxiv.org/abs/2105.12720", "description": "In a real-life setting, little is known regarding the effectiveness of\nstatins for primary prevention among older adults, and analysis of\nobservational data can add crucial information on the benefits of actual\npatterns of use. Latent class growth models (LCGM) are increasingly proposed as\na solution to summarize the observed longitudinal treatment in a few distinct\ngroups. When combined with standard approaches like Cox proportional hazards\nmodels, LCGM can fail to control time-dependent confounding bias because of\ntime-varying covariates that have a double role of confounders and mediators.\nWe propose to use LCGM to classify individuals into a few latent classes based\non their medication adherence pattern, then choose a working marginal\nstructural model (MSM) that relates the outcome to these groups. The parameter\nof interest is nonparametrically defined as the projection of the true MSM onto\nthe chosen working model. The combination of LCGM with MSM is a convenient way\nto describe treatment adherence and can effectively control time-dependent\nconfounding. Simulation studies were used to illustrate our approach and\ncompare it with unadjusted, baseline covariates-adjusted, time-varying\ncovariates adjusted and inverse probability of trajectory groups weighting\nadjusted models. We found that our proposed approach yielded estimators with\nlittle or no bias."}, "http://arxiv.org/abs/2208.07610": {"title": "E-Statistics, Group Invariance and Anytime Valid Testing", "link": "http://arxiv.org/abs/2208.07610", "description": "We study worst-case-growth-rate-optimal (GROW) e-statistics for hypothesis\ntesting between two group models. It is known that under a mild condition on\nthe action of the underlying group G on the data, there exists a maximally\ninvariant statistic. We show that among all e-statistics, invariant or not, the\nlikelihood ratio of the maximally invariant statistic is GROW, both in the\nabsolute and in the relative sense, and that an anytime-valid test can be based\non it. The GROW e-statistic is equal to a Bayes factor with a right Haar prior\non G. Our treatment avoids nonuniqueness issues that sometimes arise for such\npriors in Bayesian contexts. A crucial assumption on the group G is its\namenability, a well-known group-theoretical condition, which holds, for\ninstance, in scale-location families. Our results also apply to\nfinite-dimensional linear regression."}, "http://arxiv.org/abs/2302.03246": {"title": "CDANs: Temporal Causal Discovery from Autocorrelated and Non-Stationary Time Series Data", "link": "http://arxiv.org/abs/2302.03246", "description": "Time series data are found in many areas of healthcare such as medical time\nseries, electronic health records (EHR), measurements of vitals, and wearable\ndevices. Causal discovery, which involves estimating causal relationships from\nobservational data, holds the potential to play a significant role in\nextracting actionable insights about human health. In this study, we present a\nnovel constraint-based causal discovery approach for autocorrelated and\nnon-stationary time series data (CDANs). Our proposed method addresses several\nlimitations of existing causal discovery methods for autocorrelated and\nnon-stationary time series data, such as high dimensionality, the inability to\nidentify lagged causal relationships, and overlooking changing modules. Our\napproach identifies lagged and instantaneous/contemporaneous causal\nrelationships along with changing modules that vary over time. The method\noptimizes the conditioning sets in a constraint-based search by considering\nlagged parents instead of conditioning on the entire past that addresses high\ndimensionality. The changing modules are detected by considering both\ncontemporaneous and lagged parents. The approach first detects the lagged\nadjacencies, then identifies the changing modules and contemporaneous\nadjacencies, and finally determines the causal direction. We extensively\nevaluated our proposed method on synthetic and real-world clinical datasets,\nand compared its performance with several baseline approaches. The experimental\nresults demonstrate the effectiveness of the proposed method in detecting\ncausal relationships and changing modules for autocorrelated and non-stationary\ntime series data."}, "http://arxiv.org/abs/2305.07089": {"title": "Hierarchically Coherent Multivariate Mixture Networks", "link": "http://arxiv.org/abs/2305.07089", "description": "Large collections of time series data are often organized into hierarchies\nwith different levels of aggregation; examples include product and geographical\ngroupings. Probabilistic coherent forecasting is tasked to produce forecasts\nconsistent across levels of aggregation. In this study, we propose to augment\nneural forecasting architectures with a coherent multivariate mixture output.\nWe optimize the networks with a composite likelihood objective, allowing us to\ncapture time series' relationships while maintaining high computational\nefficiency. Our approach demonstrates 13.2% average accuracy improvements on\nmost datasets compared to state-of-the-art baselines. We conduct ablation\nstudies of the framework components and provide theoretical foundations for\nthem. To assist related work, the code is available at this\nhttps://github.com/Nixtla/neuralforecast."}, "http://arxiv.org/abs/2307.16720": {"title": "The epigraph and the hypograph indexes as useful tools for clustering multivariate functional data", "link": "http://arxiv.org/abs/2307.16720", "description": "The proliferation of data generation has spurred advancements in functional\ndata analysis. With the ability to analyze multiple variables simultaneously,\nthe demand for working with multivariate functional data has increased. This\nstudy proposes a novel formulation of the epigraph and hypograph indexes, as\nwell as their generalized expressions, specifically tailored for the\nmultivariate functional context. These definitions take into account the\ninterrelations between components. Furthermore, the proposed indexes are\nemployed to cluster multivariate functional data. In the clustering process,\nthe indexes are applied to both the data and their first and second\nderivatives. This generates a reduced-dimension dataset from the original\nmultivariate functional data, enabling the application of well-established\nmultivariate clustering techniques which have been extensively studied in the\nliterature. This methodology has been tested through simulated and real\ndatasets, performing comparative analyses against state-of-the-art to assess\nits performance."}, "http://arxiv.org/abs/2309.07810": {"title": "Spectrum-Aware Adjustment: A New Debiasing Framework with Applications to Principal Component Regression", "link": "http://arxiv.org/abs/2309.07810", "description": "We introduce a new debiasing framework for high-dimensional linear regression\nthat bypasses the restrictions on covariate distributions imposed by modern\ndebiasing technology. We study the prevalent setting where the number of\nfeatures and samples are both large and comparable. In this context,\nstate-of-the-art debiasing technology uses a degrees-of-freedom correction to\nremove the shrinkage bias of regularized estimators and conduct inference.\nHowever, this method requires that the observed samples are i.i.d., the\ncovariates follow a mean zero Gaussian distribution, and reliable covariance\nmatrix estimates for observed features are available. This approach struggles\nwhen (i) covariates are non-Gaussian with heavy tails or asymmetric\ndistributions, (ii) rows of the design exhibit heterogeneity or dependencies,\nand (iii) reliable feature covariance estimates are lacking.\n\nTo address these, we develop a new strategy where the debiasing correction is\na rescaled gradient descent step (suitably initialized) with step size\ndetermined by the spectrum of the sample covariance matrix. Unlike prior work,\nwe assume that eigenvectors of this matrix are uniform draws from the\northogonal group. We show this assumption remains valid in diverse situations\nwhere traditional debiasing fails, including designs with complex row-column\ndependencies, heavy tails, asymmetric properties, and latent low-rank\nstructures. We establish asymptotic normality of our proposed estimator\n(centered and scaled) under various convergence notions. Moreover, we develop a\nconsistent estimator for its asymptotic variance. Lastly, we introduce a\ndebiased Principal Components Regression (PCR) technique using our\nSpectrum-Aware approach. In varied simulations and real data experiments, we\nobserve that our method outperforms degrees-of-freedom debiasing by a margin."}, "http://arxiv.org/abs/2310.11471": {"title": "Modeling lower-truncated and right-censored insurance claims with an extension of the MBBEFD class", "link": "http://arxiv.org/abs/2310.11471", "description": "In general insurance, claims are often lower-truncated and right-censored\nbecause insurance contracts may involve deductibles and maximal covers. Most\nclassical statistical models are not (directly) suited to model lower-truncated\nand right-censored claims. A surprisingly flexible family of distributions that\ncan cope with lower-truncated and right-censored claims is the class of MBBEFD\ndistributions that originally has been introduced by Bernegger (1997) for\nreinsurance pricing, but which has not gained much attention outside the\nreinsurance literature. We derive properties of the class of MBBEFD\ndistributions, and we extend it to a bigger family of distribution functions\nsuitable for modeling lower-truncated and right-censored claims. Interestingly,\nin general insurance, we mainly rely on unimodal skewed densities, whereas the\nreinsurance literature typically proposes monotonically decreasing densities\nwithin the MBBEFD class."}, "http://arxiv.org/abs/2310.11603": {"title": "Group sequential two-stage preference designs", "link": "http://arxiv.org/abs/2310.11603", "description": "The two-stage preference design (TSPD) enables the inference for treatment\nefficacy while allowing for incorporation of patient preference to treatment.\nIt can provide unbiased estimates for selection and preference effects, where a\nselection effect occurs when patients who prefer one treatment respond\ndifferently than those who prefer another, and a preference effect is the\ndifference in response caused by an interaction between the patient's\npreference and the actual treatment they receive. One potential barrier to\nadopting TSPD in practice, however, is the relatively large sample size\nrequired to estimate selection and preference effects with sufficient power. To\naddress this concern, we propose a group sequential two-stage preference design\n(GS-TSPD), which combines TSPD with sequential monitoring for early stopping.\nIn the GS-TSPD, pre-planned sequential monitoring allows investigators to\nconduct repeated hypothesis tests on accumulated data prior to full enrollment\nto assess study eligibility for early trial termination without inflating type\nI error rates. Thus, the procedure allows investigators to terminate the study\nwhen there is sufficient evidence of treatment, selection, or preference\neffects during an interim analysis, thereby reducing the design resource in\nexpectation. To formalize such a procedure, we verify the independent\nincrements assumption for testing the selection and preference effects and\napply group sequential stopping boundaries from the approximate sequential\ndensity functions. Simulations are then conducted to investigate the operating\ncharacteristics of our proposed GS-TSPD compared to the traditional TSPD. We\ndemonstrate the applicability of the design using a study of Hepatitis C\ntreatment modality."}, "http://arxiv.org/abs/2310.11620": {"title": "Enhancing modified treatment policy effect estimation with weighted energy distance", "link": "http://arxiv.org/abs/2310.11620", "description": "The effects of continuous treatments are often characterized through the\naverage dose response function, which is challenging to estimate from\nobservational data due to confounding and positivity violations. Modified\ntreatment policies (MTPs) are an alternative approach that aim to assess the\neffect of a modification to observed treatment values and work under relaxed\nassumptions. Estimators for MTPs generally focus on estimating the conditional\ndensity of treatment given covariates and using it to construct weights.\nHowever, weighting using conditional density models has well-documented\nchallenges. Further, MTPs with larger treatment modifications have stronger\nconfounding and no tools exist to help choose an appropriate modification\nmagnitude. This paper investigates the role of weights for MTPs showing that to\ncontrol confounding, weights should balance the weighted data to an unobserved\nhypothetical target population, that can be characterized with observed data.\nLeveraging this insight, we present a versatile set of tools to enhance\nestimation for MTPs. We introduce a distance that measures imbalance of\ncovariate distributions under the MTP and use it to develop new weighting\nmethods and tools to aid in the estimation of MTPs. We illustrate our methods\nthrough an example studying the effect of mechanical power of ventilation on\nin-hospital mortality."}, "http://arxiv.org/abs/2310.11630": {"title": "Adaptive Bootstrap Tests for Composite Null Hypotheses in the Mediation Pathway Analysis", "link": "http://arxiv.org/abs/2310.11630", "description": "Mediation analysis aims to assess if, and how, a certain exposure influences\nan outcome of interest through intermediate variables. This problem has\nrecently gained a surge of attention due to the tremendous need for such\nanalyses in scientific fields. Testing for the mediation effect is greatly\nchallenged by the fact that the underlying null hypothesis (i.e. the absence of\nmediation effects) is composite. Most existing mediation tests are overly\nconservative and thus underpowered. To overcome this significant methodological\nhurdle, we develop an adaptive bootstrap testing framework that can accommodate\ndifferent types of composite null hypotheses in the mediation pathway analysis.\nApplied to the product of coefficients (PoC) test and the joint significance\n(JS) test, our adaptive testing procedures provide type I error control under\nthe composite null, resulting in much improved statistical power compared to\nexisting tests. Both theoretical properties and numerical examples of the\nproposed methodology are discussed."}, "http://arxiv.org/abs/2310.11683": {"title": "Are we bootstrapping the right thing? A new approach to quantify uncertainty of Average Treatment Effect Estimate", "link": "http://arxiv.org/abs/2310.11683", "description": "Existing approaches of using the bootstrap method to derive standard error\nand confidence interval of average treatment effect estimate has one potential\nissue, which is that they are actually bootstrapping the wrong thing, resulting\nin unvalid statistical inference. In this paper, we discuss this important\nissue and propose a new non-parametric bootstrap method that can more precisely\nquantify the uncertainty associated with average treatment effect estimates. We\ndemonstrate the validity of this approach through a simulation study and a\nreal-world example, and highlight the importance of deriving standard error and\nconfidence interval of average treatment effect estimates that both remove\nextra undesired noise and are easy to interpret when applied in real world\nscenarios."}, "http://arxiv.org/abs/2310.11724": {"title": "Simultaneous Nonparametric Inference of M-regression under Complex Temporal Dynamics", "link": "http://arxiv.org/abs/2310.11724", "description": "The paper considers simultaneous nonparametric inference for a wide class of\nM-regression models with time-varying coefficients. The covariates and errors\nof the regression model are tackled as a general class of piece-wise locally\nstationary time series and are allowed to be cross-dependent. We introduce an\nintegration technique to study the M-estimators, whose limiting properties are\ndisclosed using Bahadur representation and Gaussian approximation theory.\nFacilitated by a self-convolved bootstrap proposed in this paper, we introduce\na unified framework to conduct general classes of Exact Function Tests,\nLack-of-fit Tests, and Qualitative Tests for the time-varying coefficient\nM-regression under complex temporal dynamics. As an application, our method is\napplied to studying the anthropogenic warming trend and time-varying structures\nof the ENSO effect using global climate data from 1882 to 2005."}, "http://arxiv.org/abs/2310.11741": {"title": "Graph Sphere: From Nodes to Supernodes in Graphical Models", "link": "http://arxiv.org/abs/2310.11741", "description": "High-dimensional data analysis typically focuses on low-dimensional\nstructure, often to aid interpretation and computational efficiency. Graphical\nmodels provide a powerful methodology for learning the conditional independence\nstructure in multivariate data by representing variables as nodes and\ndependencies as edges. Inference is often focused on individual edges in the\nlatent graph. Nonetheless, there is increasing interest in determining more\ncomplex structures, such as communities of nodes, for multiple reasons,\nincluding more effective information retrieval and better interpretability. In\nthis work, we propose a multilayer graphical model where we first cluster nodes\nand then, at the second layer, investigate the relationships among groups of\nnodes. Specifically, nodes are partitioned into \"supernodes\" with a\ndata-coherent size-biased tessellation prior which combines ideas from Bayesian\nnonparametrics and Voronoi tessellations. This construct allows accounting also\nfor dependence of nodes within supernodes. At the second layer, dependence\nstructure among supernodes is modelled through a Gaussian graphical model,\nwhere the focus of inference is on \"superedges\". We provide theoretical\njustification for our modelling choices. We design tailored Markov chain Monte\nCarlo schemes, which also enable parallel computations. We demonstrate the\neffectiveness of our approach for large-scale structure learning in simulations\nand a transcriptomics application."}, "http://arxiv.org/abs/2310.11779": {"title": "A Multivariate Skew-Normal-Tukey-h Distribution", "link": "http://arxiv.org/abs/2310.11779", "description": "We introduce a new family of multivariate distributions by taking the\ncomponent-wise Tukey-h transformation of a random vector following a\nskew-normal distribution. The proposed distribution is named the\nskew-normal-Tukey-h distribution and is an extension of the skew-normal\ndistribution for handling heavy-tailed data. We compare this proposed\ndistribution to the skew-t distribution, which is another extension of the\nskew-normal distribution for modeling tail-thickness, and demonstrate that when\nthere are substantial differences in marginal kurtosis, the proposed\ndistribution is more appropriate. Moreover, we derive many appealing stochastic\nproperties of the proposed distribution and provide a methodology for the\nestimation of the parameters in which the computational requirement increases\nlinearly with the dimension. Using simulations, as well as a wine and a wind\nspeed data application, we illustrate how to draw inferences based on the\nmultivariate skew-normal-Tukey-h distribution."}, "http://arxiv.org/abs/2310.11799": {"title": "Testing for patterns and structures in covariance and correlation matrices", "link": "http://arxiv.org/abs/2310.11799", "description": "Covariance matrices of random vectors contain information that is crucial for\nmodelling. Certain structures and patterns of the covariances (or correlations)\nmay be used to justify parametric models, e.g., autoregressive models. Until\nnow, there have been only few approaches for testing such covariance structures\nsystematically and in a unified way. In the present paper, we propose such a\nunified testing procedure, and we will exemplify the approach with a large\nvariety of covariance structure models. This includes common structures such as\ndiagonal matrices, Toeplitz matrices, and compound symmetry but also the more\ninvolved autoregressive matrices. We propose hypothesis tests for these\nstructures, and we use bootstrap techniques for better small-sample\napproximation. The structures of the proposed tests invite for adaptations to\nother covariance patterns by choosing the hypothesis matrix appropriately. We\nprove their correctness for large sample sizes. The proposed methods require\nonly weak assumptions.\n\nWith the help of a simulation study, we assess the small sample properties of\nthe tests.\n\nWe also analyze a real data set to illustrate the application of the\nprocedure."}, "http://arxiv.org/abs/2310.11822": {"title": "Post-clustering Inference under Dependency", "link": "http://arxiv.org/abs/2310.11822", "description": "Recent work by Gao et al. has laid the foundations for post-clustering\ninference. For the first time, the authors established a theoretical framework\nallowing to test for differences between means of estimated clusters.\nAdditionally, they studied the estimation of unknown parameters while\ncontrolling the selective type I error. However, their theory was developed for\nindependent observations identically distributed as $p$-dimensional Gaussian\nvariables with a spherical covariance matrix. Here, we aim at extending this\nframework to a more convenient scenario for practical applications, where\narbitrary dependence structures between observations and features are allowed.\nWe show that a $p$-value for post-clustering inference under general dependency\ncan be defined, and we assess the theoretical conditions allowing the\ncompatible estimation of a covariance matrix. The theory is developed for\nhierarchical agglomerative clustering algorithms with several types of\nlinkages, and for the $k$-means algorithm. We illustrate our method with\nsynthetic data and real data of protein structures."}, "http://arxiv.org/abs/2310.12000": {"title": "Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models", "link": "http://arxiv.org/abs/2310.12000", "description": "Latent Gaussian process (GP) models are flexible probabilistic non-parametric\nfunction models. Vecchia approximations are accurate approximations for GPs to\novercome computational bottlenecks for large data, and the Laplace\napproximation is a fast method with asymptotic convergence guarantees to\napproximate marginal likelihoods and posterior predictive distributions for\nnon-Gaussian likelihoods. Unfortunately, the computational complexity of\ncombined Vecchia-Laplace approximations grows faster than linearly in the\nsample size when used in combination with direct solver methods such as the\nCholesky decomposition. Computations with Vecchia-Laplace approximations thus\nbecome prohibitively slow precisely when the approximations are usually the\nmost accurate, i.e., on large data sets. In this article, we present several\niterative methods for inference with Vecchia-Laplace approximations which make\ncomputations considerably faster compared to Cholesky-based calculations. We\nanalyze our proposed methods theoretically and in experiments with simulated\nand real-world data. In particular, we obtain a speed-up of an order of\nmagnitude compared to Cholesky-based inference and a threefold increase in\nprediction accuracy in terms of the continuous ranked probability score\ncompared to a state-of-the-art method on a large satellite data set. All\nmethods are implemented in a free C++ software library with high-level Python\nand R packages."}, "http://arxiv.org/abs/2310.12010": {"title": "A Note on Improving Variational Estimation for Multidimensional Item Response Theory", "link": "http://arxiv.org/abs/2310.12010", "description": "Survey instruments and assessments are frequently used in many domains of\nsocial science. When the constructs that these assessments try to measure\nbecome multifaceted, multidimensional item response theory (MIRT) provides a\nunified framework and convenient statistical tool for item analysis,\ncalibration, and scoring. However, the computational challenge of estimating\nMIRT models prohibits its wide use because many of the extant methods can\nhardly provide results in a realistic time frame when the number of dimensions,\nsample size, and test length are large. Instead, variational estimation\nmethods, such as Gaussian Variational Expectation Maximization (GVEM)\nalgorithm, have been recently proposed to solve the estimation challenge by\nproviding a fast and accurate solution. However, results have shown that\nvariational estimation methods may produce some bias on discrimination\nparameters during confirmatory model estimation, and this note proposes an\nimportance weighted version of GVEM (i.e., IW-GVEM) to correct for such bias\nunder MIRT models. We also use the adaptive moment estimation method to update\nthe learning rate for gradient descent automatically. Our simulations show that\nIW-GVEM can effectively correct bias with modest increase of computation time,\ncompared with GVEM. The proposed method may also shed light on improving the\nvariational estimation for other psychometrics models."}, "http://arxiv.org/abs/2310.12115": {"title": "MMD-based Variable Importance for Distributional Random Forest", "link": "http://arxiv.org/abs/2310.12115", "description": "Distributional Random Forest (DRF) is a flexible forest-based method to\nestimate the full conditional distribution of a multivariate output of interest\ngiven input variables. In this article, we introduce a variable importance\nalgorithm for DRFs, based on the well-established drop and relearn principle\nand MMD distance. While traditional importance measures only detect variables\nwith an influence on the output mean, our algorithm detects variables impacting\nthe output distribution more generally. We show that the introduced importance\nmeasure is consistent, exhibits high empirical performance on both real and\nsimulated data, and outperforms competitors. In particular, our algorithm is\nhighly efficient to select variables through recursive feature elimination, and\ncan therefore provide small sets of variables to build accurate estimates of\nconditional output distributions."}, "http://arxiv.org/abs/2310.12140": {"title": "Online Estimation with Rolling Validation: Adaptive Nonparametric Estimation with Stream Data", "link": "http://arxiv.org/abs/2310.12140", "description": "Online nonparametric estimators are gaining popularity due to their efficient\ncomputation and competitive generalization abilities. An important example\nincludes variants of stochastic gradient descent. These algorithms often take\none sample point at a time and instantly update the parameter estimate of\ninterest. In this work we consider model selection and hyperparameter tuning\nfor such online algorithms. We propose a weighted rolling-validation procedure,\nan online variant of leave-one-out cross-validation, that costs minimal extra\ncomputation for many typical stochastic gradient descent estimators. Similar to\nbatch cross-validation, it can boost base estimators to achieve a better,\nadaptive convergence rate. Our theoretical analysis is straightforward, relying\nmainly on some general statistical stability assumptions. The simulation study\nunderscores the significance of diverging weights in rolling validation in\npractice and demonstrates its sensitivity even when there is only a slim\ndifference between candidate estimators."}, "http://arxiv.org/abs/2010.02968": {"title": "Modelling of functional profiles and explainable shape shifts detection: An approach combining the notion of the Fr\\'echet mean with the shape invariant model", "link": "http://arxiv.org/abs/2010.02968", "description": "A modelling framework suitable for detecting shape shifts in functional\nprofiles combining the notion of Fr\\'echet mean and the concept of deformation\nmodels is developed and proposed. The generalized mean sense offered by the\nFr\\'echet mean notion is employed to capture the typical pattern of the\nprofiles under study, while the concept of deformation models, and in\nparticular of the shape invariant model, allows for interpretable\nparameterizations of profile's deviations from the typical shape. EWMA-type\ncontrol charts compatible with the functional nature of data and the employed\ndeformation model are built and proposed, exploiting certain shape\ncharacteristics of the profiles under study with respect to the generalized\nmean sense, allowing for the identification of potential shifts concerning the\nshape and/or the deformation process. Potential shifts in the shape deformation\nprocess, are further distinguished to significant shifts with respect to\namplitude and/or the phase of the profile under study. The proposed modelling\nand shift detection framework is implemented to a real world case study, where\ndaily concentration profiles concerning air pollutants from an area in the city\nof Athens are modelled, while profiles indicating hazardous concentration\nlevels are successfully identified in most of the cases."}, "http://arxiv.org/abs/2207.07218": {"title": "On the Selection of Tuning Parameters for Patch-Stitching Embedding Methods", "link": "http://arxiv.org/abs/2207.07218", "description": "While classical scaling, just like principal component analysis, is\nparameter-free, other methods for embedding multivariate data require the\nselection of one or several tuning parameters. This tuning can be difficult due\nto the unsupervised nature of the situation. We propose a simple, almost\nobvious, approach to supervise the choice of tuning parameter(s): minimize a\nnotion of stress. We apply this approach to the selection of the patch size in\na prototypical patch-stitching embedding method, both in the multidimensional\nscaling (aka network localization) setting and in the dimensionality reduction\n(aka manifold learning) setting. In our study, we uncover a new bias--variance\ntradeoff phenomenon."}, "http://arxiv.org/abs/2303.17856": {"title": "Bootstrapping multiple systems estimates to account for model selection", "link": "http://arxiv.org/abs/2303.17856", "description": "Multiple systems estimation using a Poisson loglinear model is a standard\napproach to quantifying hidden populations where data sources are based on\nlists of known cases. Information criteria are often used for selecting between\nthe large number of possible models. Confidence intervals are often reported\nconditional on the model selected, providing an over-optimistic impression of\nestimation accuracy. A bootstrap approach is a natural way to account for the\nmodel selection. However, because the model selection step has to be carried\nout for every bootstrap replication, there may be a high or even prohibitive\ncomputational burden. We explore the merit of modifying the model selection\nprocedure in the bootstrap to look only among a subset of models, chosen on the\nbasis of their information criterion score on the original data. This provides\nlarge computational gains with little apparent effect on inference. We also\nincorporate rigorous and economical ways of approaching issues of the existence\nof estimators when applying the method to sparse data tables."}, "http://arxiv.org/abs/2308.07319": {"title": "Partial identification for discrete data with nonignorable missing outcomes", "link": "http://arxiv.org/abs/2308.07319", "description": "Nonignorable missing outcomes are common in real world datasets and often\nrequire strong parametric assumptions to achieve identification. These\nassumptions can be implausible or untestable, and so we may forgo them in\nfavour of partially identified models that narrow the set of a priori possible\nvalues to an identification region. Here we propose a new nonparametric Bayes\nmethod that allows for the incorporation of multiple clinically relevant\nrestrictions of the parameter space simultaneously. We focus on two common\nrestrictions, instrumental variables and the direction of missing data bias,\nand investigate how these restrictions narrow the identification region for\nparameters of interest. Additionally, we propose a rejection sampling algorithm\nthat allows us to quantify the evidence for these assumptions in the data. We\ncompare our method to a standard Heckman selection model in both simulation\nstudies and in an applied problem examining the effectiveness of cash-transfers\nfor people experiencing homelessness."}, "http://arxiv.org/abs/2310.12285": {"title": "Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2310.12285", "description": "High-dimensional longitudinal data is increasingly used in a wide range of\nscientific studies. However, there are few statistical methods for\nhigh-dimensional linear mixed models (LMMs), as most Bayesian variable\nselection or penalization methods are designed for independent observations.\nAdditionally, the few available software packages for high-dimensional LMMs\nsuffer from scalability issues. This work presents an efficient and accurate\nBayesian framework for high-dimensional LMMs. We use empirical Bayes estimators\nof hyperparameters for increased flexibility and an\nExpectation-Conditional-Minimization (ECM) algorithm for computationally\nefficient maximum a posteriori probability (MAP) estimation of parameters. The\nnovelty of the approach lies in its partitioning and parameter expansion as\nwell as its fast and scalable computation. We illustrate Linear Mixed Modeling\nwith PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies\nevaluating fixed and random effects estimation along with computation time. A\nreal-world example is provided using data from a study of lupus in children,\nwhere we identify genes and clinical factors associated with a new lupus\nbiomarker and predict the biomarker over time."}, "http://arxiv.org/abs/2310.12348": {"title": "Goodness--of--Fit Tests Based on the Min--Characteristic Function", "link": "http://arxiv.org/abs/2310.12348", "description": "We propose tests of fit for classes of distributions that include the\nWeibull, the Pareto and the Fr\\'echet, distributions. The new tests employ the\nnovel tool of the min--characteristic function and are based on an L2--type\nweighted distance between this function and its empirical counterpart applied\non suitably standardized data. If data--standardization is performed using the\nMLE of the distributional parameters then the method reduces to testing for the\nstandard member of the family, with parameter values known and set equal to\none. We investigate asymptotic properties of the tests, while a Monte Carlo\nstudy is presented that includes the new procedure as well as competitors for\nthe purpose of specification testing with three extreme value distributions.\nThe new tests are also applied on a few real--data sets."}, "http://arxiv.org/abs/2310.12358": {"title": "causalBETA: An R Package for Bayesian Semiparametric Casual Inference with Event-Time Outcomes", "link": "http://arxiv.org/abs/2310.12358", "description": "Observational studies are often conducted to estimate causal effects of\ntreatments or exposures on event-time outcomes. Since treatments are not\nrandomized in observational studies, techniques from causal inference are\nrequired to adjust for confounding. Bayesian approaches to causal estimates are\ndesirable because they provide 1) prior smoothing provides useful\nregularization of causal effect estimates, 2) flexible models that are robust\nto misspecification, 3) full inference (i.e. both point and uncertainty\nestimates) for causal estimands. However, Bayesian causal inference is\ndifficult to implement manually and there is a lack of user-friendly software,\npresenting a significant barrier to wide-spread use. We address this gap by\ndeveloping causalBETA (Bayesian Event Time Analysis) - an open-source R package\nfor estimating causal effects on event-time outcomes using Bayesian\nsemiparametric models. The package provides a familiar front-end to users, with\nsyntax identical to existing survival analysis R packages such as survival. At\nthe same time, it back-ends to Stan - a popular platform for Bayesian modeling\nand high performance statistical computing - for efficient posterior\ncomputation. To improve user experience, the package is built using customized\nS3 class objects and methods to facilitate visualizations and summaries of\nresults using familiar generic functions like plot() and summary(). In this\npaper, we provide the methodological details of the package, a demonstration\nusing publicly-available data, and computational guidance."}, "http://arxiv.org/abs/2310.12391": {"title": "Real-time Semiparametric Regression via Sequential Monte Carlo", "link": "http://arxiv.org/abs/2310.12391", "description": "We develop and describe online algorithms for performing real-time\nsemiparametric regression analyses. Earlier work on this topic is in Luts,\nBroderick & Wand (J. Comput. Graph. Statist., 2014) where online mean field\nvariational Bayes was employed. In this article we instead develop sequential\nMonte Carlo approaches to circumvent well-known inaccuracies inherent in\nvariational approaches. Even though sequential Monte Carlo is not as fast as\nonline mean field variational Bayes, it can be a viable alternative for\napplications where the data rate is not overly high. For Gaussian response\nsemiparametric regression models our new algorithms share the online mean field\nvariational Bayes property of only requiring updating and storage of sufficient\nstatistics quantities of streaming data. In the non-Gaussian case accurate\nreal-time semiparametric regression requires the full data to be kept in\nstorage. The new algorithms allow for new options concerning accuracy/speed\ntrade-offs for real-time semiparametric regression."}, "http://arxiv.org/abs/2310.12402": {"title": "Data visualization and dimension reduction for metric-valued response regression", "link": "http://arxiv.org/abs/2310.12402", "description": "As novel data collection becomes increasingly common, traditional dimension\nreduction and data visualization techniques are becoming inadequate to analyze\nthese complex data. A surrogate-assisted sufficient dimension reduction (SDR)\nmethod for regression with a general metric-valued response on Euclidean\npredictors is proposed. The response objects are mapped to a real-valued\ndistance matrix using an appropriate metric and then projected onto a large\nsample of random unit vectors to obtain scalar-valued surrogate responses. An\nensemble estimate of the subspaces for the regression of the surrogate\nresponses versus the predictor is used to estimate the original central space.\nUnder this framework, classical SDR methods such as ordinary least squares and\nsliced inverse regression are extended. The surrogate-assisted method applies\nto responses on compact metric spaces including but not limited to Euclidean,\ndistributional, and functional. An extensive simulation experiment demonstrates\nthe superior performance of the proposed surrogate-assisted method on synthetic\ndata compared to existing competing methods where applicable. The analysis of\nthe distributions and functional trajectories of county-level COVID-19\ntransmission rates in the U.S. as a function of demographic characteristics is\nalso provided. The theoretical justifications are included as well."}, "http://arxiv.org/abs/2310.12424": {"title": "Optimal heteroskedasticity testing in nonparametric regression", "link": "http://arxiv.org/abs/2310.12424", "description": "Heteroskedasticity testing in nonparametric regression is a classic\nstatistical problem with important practical applications, yet fundamental\nlimits are unknown. Adopting a minimax perspective, this article considers the\ntesting problem in the context of an $\\alpha$-H\\\"{o}lder mean and a\n$\\beta$-H\\\"{o}lder variance function. For $\\alpha > 0$ and $\\beta \\in (0,\n\\frac{1}{2})$, the sharp minimax separation rate $n^{-4\\alpha} +\nn^{-\\frac{4\\beta}{4\\beta+1}} + n^{-2\\beta}$ is established. To achieve the\nminimax separation rate, a kernel-based statistic using first-order squared\ndifferences is developed. Notably, the statistic estimates a proxy rather than\na natural quadratic functional (the squared distance between the variance\nfunction and its best $L^2$ approximation by a constant) suggested in previous\nwork.\n\nThe setting where no smoothness is assumed on the variance function is also\nstudied; the variance profile across the design points can be arbitrary.\nDespite the lack of structure, consistent testing turns out to still be\npossible by using the Gaussian character of the noise, and the minimax rate is\nshown to be $n^{-4\\alpha} + n^{-1/2}$. Exploiting noise information happens to\nbe a fundamental necessity as consistent testing is impossible if nothing more\nthan zero mean and unit variance is known about the noise distribution.\nFurthermore, in the setting where $V$ is $\\beta$-H\\\"{o}lder but\nheteroskedasticity is measured only with respect to the design points, the\nminimax separation rate is shown to be $n^{-4\\alpha} + n^{-\\left(\\frac{1}{2}\n\\vee \\frac{4\\beta}{4\\beta+1}\\right)}$ when the noise is Gaussian and\n$n^{-4\\alpha} + n^{-\\frac{4\\beta}{4\\beta+1}} + n^{-2\\beta}$ when the noise\ndistribution is unknown."}, "http://arxiv.org/abs/2310.12427": {"title": "Fast Power Curve Approximation for Posterior Analyses", "link": "http://arxiv.org/abs/2310.12427", "description": "Bayesian hypothesis testing leverages posterior probabilities, Bayes factors,\nor credible intervals to assess characteristics that summarize data. We propose\na framework for power curve approximation with such hypothesis tests that\nassumes data are generated using statistical models with fixed parameters for\nthe purposes of sample size determination. We present a fast approach to\nexplore the sampling distribution of posterior probabilities when the\nconditions for the Bernstein-von Mises theorem are satisfied. We extend that\napproach to facilitate targeted sampling from the approximate sampling\ndistribution of posterior probabilities for each sample size explored. These\nsampling distributions are used to construct power curves for various types of\nposterior analyses. Our resulting method for power curve approximation is\norders of magnitude faster than conventional power curve estimation for\nBayesian hypothesis tests. We also prove the consistency of the corresponding\npower estimates and sample size recommendations under certain conditions."}, "http://arxiv.org/abs/2310.12428": {"title": "Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach", "link": "http://arxiv.org/abs/2310.12428", "description": "We initiate a novel approach to explain the out of sample performance of\nrandom forest (RF) models by exploiting the fact that any RF can be formulated\nas an adaptive weighted K nearest-neighbors model. Specifically, we use the\nproximity between points in the feature space learned by the RF to re-write\nrandom forest predictions exactly as a weighted average of the target labels of\ntraining data points. This linearity facilitates a local notion of\nexplainability of RF predictions that generates attributions for any model\nprediction across observations in the training set, and thereby complements\nestablished methods like SHAP, which instead generates attributions for a model\nprediction across dimensions of the feature space. We demonstrate this approach\nin the context of a bond pricing model trained on US corporate bond trades, and\ncompare our approach to various existing approaches to model explainability."}, "http://arxiv.org/abs/2310.12460": {"title": "Linear Source Apportionment using Generalized Least Squares", "link": "http://arxiv.org/abs/2310.12460", "description": "Motivated by applications to water quality monitoring using fluorescence\nspectroscopy, we develop the source apportionment model for high dimensional\nprofiles of dissolved organic matter (DOM). We describe simple methods to\nestimate the parameters of a linear source apportionment model, and show how\nthe estimates are related to those of ordinary and generalized least squares.\nUsing this least squares framework, we analyze the variability of the\nestimates, and we propose predictors for missing elements of a DOM profile. We\ndemonstrate the practical utility of our results on fluorescence spectroscopy\ndata collected from the Neuse River in North Carolina."}, "http://arxiv.org/abs/2310.12711": {"title": "Modelling multivariate extremes through angular-radial decomposition of the density function", "link": "http://arxiv.org/abs/2310.12711", "description": "We present a new framework for modelling multivariate extremes, based on an\nangular-radial representation of the probability density function. Under this\nrepresentation, the problem of modelling multivariate extremes is transformed\nto that of modelling an angular density and the tail of the radial variable,\nconditional on angle. Motivated by univariate theory, we assume that the tail\nof the conditional radial distribution converges to a generalised Pareto (GP)\ndistribution. To simplify inference, we also assume that the angular density is\ncontinuous and finite and the GP parameter functions are continuous with angle.\nWe refer to the resulting model as the semi-parametric angular-radial (SPAR)\nmodel for multivariate extremes. We consider the effect of the choice of polar\ncoordinate system and introduce generalised concepts of angular-radial\ncoordinate systems and generalised scalar angles in two dimensions. We show\nthat under certain conditions, the choice of polar coordinate system does not\naffect the validity of the SPAR assumptions. However, some choices of\ncoordinate system lead to simpler representations. In contrast, we show that\nthe choice of margin does affect whether the model assumptions are satisfied.\nIn particular, the use of Laplace margins results in a form of the density\nfunction for which the SPAR assumptions are satisfied for many common families\nof copula, with various dependence classes. We show that the SPAR model\nprovides a more versatile framework for characterising multivariate extremes\nthan provided by existing approaches, and that several commonly-used approaches\nare special cases of the SPAR model. Moreover, the SPAR framework provides a\nmeans of characterising all `extreme regions' of a joint distribution using a\nsingle inference. Applications in which this is useful are discussed."}, "http://arxiv.org/abs/2310.12757": {"title": "Conservative Inference for Counterfactuals", "link": "http://arxiv.org/abs/2310.12757", "description": "In causal inference, the joint law of a set of counterfactual random\nvariables is generally not identified. We show that a conservative version of\nthe joint law - corresponding to the smallest treatment effect - is identified.\nFinding this law uses recent results from optimal transport theory. Under this\nconservative law we can bound causal effects and we may construct inferences\nfor each individual's counterfactual dose-response curve. Intuitively, this is\nthe flattest counterfactual curve for each subject that is consistent with the\ndistribution of the observables. If the outcome is univariate then, under mild\nconditions, this curve is simply the quantile function of the counterfactual\ndistribution that passes through the observed point. This curve corresponds to\na nonparametric rank preserving structural model."}, "http://arxiv.org/abs/2310.12882": {"title": "Sequential Gibbs Posteriors with Applications to Principal Component Analysis", "link": "http://arxiv.org/abs/2310.12882", "description": "Gibbs posteriors are proportional to a prior distribution multiplied by an\nexponentiated loss function, with a key tuning parameter weighting information\nin the loss relative to the prior and providing a control of posterior\nuncertainty. Gibbs posteriors provide a principled framework for\nlikelihood-free Bayesian inference, but in many situations, including a single\ntuning parameter inevitably leads to poor uncertainty quantification. In\nparticular, regardless of the value of the parameter, credible regions have far\nfrom the nominal frequentist coverage even in large samples. We propose a\nsequential extension to Gibbs posteriors to address this problem. We prove the\nproposed sequential posterior exhibits concentration and a Bernstein-von Mises\ntheorem, which holds under easy to verify conditions in Euclidean space and on\nmanifolds. As a byproduct, we obtain the first Bernstein-von Mises theorem for\ntraditional likelihood-based Bayesian posteriors on manifolds. All methods are\nillustrated with an application to principal component analysis."}, "http://arxiv.org/abs/2207.06949": {"title": "Seeking the Truth Beyond the Data", "link": "http://arxiv.org/abs/2207.06949", "description": "Clustering is an unsupervised machine learning methodology where unlabeled\nelements/objects are grouped together aiming to the construction of\nwell-established clusters that their elements are classified according to their\nsimilarity. The goal of this process is to provide a useful aid to the\nresearcher that will help her/him to identify patterns among the data. Dealing\nwith large databases, such patterns may not be easily detectable without the\ncontribution of a clustering algorithm. This article provides a deep\ndescription of the most widely used clustering methodologies accompanied by\nuseful presentations concerning suitable parameter selection and\ninitializations. Simultaneously, this article not only represents a review\nhighlighting the major elements of examined clustering techniques but\nemphasizes the comparison of these algorithms' clustering efficiency based on 3\ndatasets, revealing their existing weaknesses and capabilities through accuracy\nand complexity, during the confrontation of discrete and continuous\nobservations. The produced results help us extract valuable conclusions about\nthe appropriateness of the examined clustering techniques in accordance with\nthe dataset's size."}, "http://arxiv.org/abs/2208.07831": {"title": "Structured prior distributions for the covariance matrix in latent factor models", "link": "http://arxiv.org/abs/2208.07831", "description": "Factor models are widely used for dimension reduction in the analysis of\nmultivariate data. This is achieved through decomposition of a p x p covariance\nmatrix into the sum of two components. Through a latent factor representation,\nthey can be interpreted as a diagonal matrix of idiosyncratic variances and a\nshared variation matrix, that is, the product of a p x k factor loadings matrix\nand its transpose. If k << p, this defines a sparse factorisation of the\ncovariance matrix. Historically, little attention has been paid to\nincorporating prior information in Bayesian analyses using factor models where,\nat best, the prior for the factor loadings is order invariant. In this work, a\nclass of structured priors is developed that can encode ideas of dependence\nstructure about the shared variation matrix. The construction allows\ndata-informed shrinkage towards sensible parametric structures while also\nfacilitating inference over the number of factors. Using an unconstrained\nreparameterisation of stationary vector autoregressions, the methodology is\nextended to stationary dynamic factor models. For computational inference,\nparameter-expanded Markov chain Monte Carlo samplers are proposed, including an\nefficient adaptive Gibbs sampler. Two substantive applications showcase the\nscope of the methodology and its inferential benefits."}, "http://arxiv.org/abs/2211.01746": {"title": "Log-density gradient covariance and automatic metric tensors for Riemann manifold Monte Carlo methods", "link": "http://arxiv.org/abs/2211.01746", "description": "A metric tensor for Riemann manifold Monte Carlo particularly suited for\nnon-linear Bayesian hierarchical models is proposed. The metric tensor is built\nfrom symmetric positive semidefinite log-density gradient covariance (LGC)\nmatrices, which are also proposed and further explored here. The LGCs\ngeneralize the Fisher information matrix by measuring the joint information\ncontent and dependence structure of both a random variable and the parameters\nof said variable. Consequently, positive definite Fisher/LGC-based metric\ntensors may be constructed not only from the observation likelihoods as is\ncurrent practice, but also from arbitrarily complicated non-linear prior/latent\nvariable structures, provided the LGC may be derived for each conditional\ndistribution used to construct said structures. The proposed methodology is\nhighly automatic and allows for exploitation of any sparsity associated with\nthe model in question. When implemented in conjunction with a Riemann manifold\nvariant of the recently proposed numerical generalized randomized Hamiltonian\nMonte Carlo processes, the proposed methodology is highly competitive, in\nparticular for the more challenging target distributions associated with\nBayesian hierarchical models."}, "http://arxiv.org/abs/2211.02383": {"title": "Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity", "link": "http://arxiv.org/abs/2211.02383", "description": "Simulation-based calibration checking (SBC) is a practical method to validate\ncomputationally-derived posterior distributions or their approximations. In\nthis paper, we introduce a new variant of SBC to alleviate several known\nproblems. Our variant allows the user to in principle detect any possible issue\nwith the posterior, while previously reported implementations could never\ndetect large classes of problems including when the posterior is equal to the\nprior. This is made possible by including additional data-dependent test\nquantities when running SBC. We argue and demonstrate that the joint likelihood\nof the data is an especially useful test quantity. Some other types of test\nquantities and their theoretical and practical benefits are also investigated.\nWe provide theoretical analysis of SBC, thereby providing a more complete\nunderstanding of the underlying statistical mechanisms. We also bring attention\nto a relatively common mistake in the literature and clarify the difference\nbetween SBC and checks based on the data-averaged posterior. We support our\nrecommendations with numerical case studies on a multivariate normal example\nand a case study in implementing an ordered simplex data type for use with\nHamiltonian Monte Carlo. The SBC variant introduced in this paper is\nimplemented in the $\\mathtt{SBC}$ R package."}, "http://arxiv.org/abs/2310.13081": {"title": "Metastable Hidden Markov Processes: a theory for modeling financial markets", "link": "http://arxiv.org/abs/2310.13081", "description": "The modeling of financial time series by hidden Markov models has been\nperformed successfully in the literature. In this paper, we propose a theory\nthat justifies such a modeling under the assumption that there exists a market\nformed by agents whose states evolve on time as an interacting Markov system\nthat has a metastable behavior described by the hidden Markov chain. This\ntheory is a rare application of metastability outside the modeling of physical\nsystems, and may inspire the development of new interacting Markov systems with\nfinancial constraints. In the context of financial economics and causal factor\ninvestment, the theory implies a new paradigm in which fluctuations in\ninvestment performance are primarily driven by the state of the market, rather\nthan being directly caused by other variables. Even though the usual approach\nto causal factor investment based on causal inference is not completely\ninconsistent with the proposed theory, the latter has the advantage of\naccounting for the non-stationary evolution of the time series through the\nchange between hidden market states. By accounting for this possibility, one\ncan more effectively assess risks and implement mitigation strategies."}, "http://arxiv.org/abs/2310.13162": {"title": "Network Meta-Analysis of Time-to-Event Endpoints with Individual Participant Data using Restricted Mean Survival Time Regression", "link": "http://arxiv.org/abs/2310.13162", "description": "Restricted mean survival time (RMST) models have gained popularity when\nanalyzing time-to-event outcomes because RMST models offer more straightforward\ninterpretations of treatment effects with fewer assumptions than hazard ratios\ncommonly estimated from Cox models. However, few network meta-analysis (NMA)\nmethods have been developed using RMST. In this paper, we propose advanced RMST\nNMA models when individual participant data are available. Our models allow us\nto study treatment effect moderation and provide comprehensive understanding\nabout comparative effectiveness of treatments and subgroup effects. An\nextensive simulation study and a real data example about treatments for\npatients with atrial fibrillation are presented."}, "http://arxiv.org/abs/2310.13178": {"title": "Exact Inference for Common Odds Ratio in Meta-Analysis with Zero-Total-Event Studies", "link": "http://arxiv.org/abs/2310.13178", "description": "Stemming from the high profile publication of Nissen and Wolski (2007) and\nsubsequent discussions with divergent views on how to handle observed\nzero-total-event studies, defined to be studies which observe zero events in\nboth treatment and control arms, the research topic concerning the common odds\nratio model with zero-total-event studies remains to be an unresolved problem\nin meta-analysis. In this article, we address this problem by proposing a novel\nrepro samples method to handle zero-total-event studies and make inference for\nthe parameter of common odds ratio. The development explicitly accounts for\nsampling scheme and does not rely on large sample approximation. It is\ntheoretically justified with a guaranteed finite sample performance. The\nempirical performance of the proposed method is demonstrated through simulation\nstudies. It shows that the proposed confidence set achieves the desired\nempirical coverage rate and also that the zero-total-event studies contains\ninformation and impacts the inference for the common odds ratio. The proposed\nmethod is applied to combine information in the Nissen and Wolski study."}, "http://arxiv.org/abs/2310.13232": {"title": "Interaction Screening and Pseudolikelihood Approaches for Tensor Learning in Ising Models", "link": "http://arxiv.org/abs/2310.13232", "description": "In this paper, we study two well known methods of Ising structure learning,\nnamely the pseudolikelihood approach and the interaction screening approach, in\nthe context of tensor recovery in $k$-spin Ising models. We show that both\nthese approaches, with proper regularization, retrieve the underlying\nhypernetwork structure using a sample size logarithmic in the number of network\nnodes, and exponential in the maximum interaction strength and maximum\nnode-degree. We also track down the exact dependence of the rate of tensor\nrecovery on the interaction order $k$, that is allowed to grow with the number\nof samples and nodes, for both the approaches. Finally, we provide a\ncomparative discussion of the performance of the two approaches based on\nsimulation studies, which also demonstrate the exponential dependence of the\ntensor recovery rate on the maximum coupling strength."}, "http://arxiv.org/abs/2310.13387": {"title": "Assumption violations in causal discovery and the robustness of score matching", "link": "http://arxiv.org/abs/2310.13387", "description": "When domain knowledge is limited and experimentation is restricted by\nethical, financial, or time constraints, practitioners turn to observational\ncausal discovery methods to recover the causal structure, exploiting the\nstatistical properties of their data. Because causal discovery without further\nassumptions is an ill-posed problem, each algorithm comes with its own set of\nusually untestable assumptions, some of which are hard to meet in real\ndatasets. Motivated by these considerations, this paper extensively benchmarks\nthe empirical performance of recent causal discovery methods on observational\ni.i.d. data generated under different background conditions, allowing for\nviolations of the critical assumptions required by each selected approach. Our\nexperimental findings show that score matching-based methods demonstrate\nsurprising performance in the false positive and false negative rate of the\ninferred graph in these challenging scenarios, and we provide theoretical\ninsights into their performance. This work is also the first effort to\nbenchmark the stability of causal discovery algorithms with respect to the\nvalues of their hyperparameters. Finally, we hope this paper will set a new\nstandard for the evaluation of causal discovery methods and can serve as an\naccessible entry point for practitioners interested in the field, highlighting\nthe empirical implications of different algorithm choices."}, "http://arxiv.org/abs/2310.13444": {"title": "Testing for the extent of instability in nearly unstable processes", "link": "http://arxiv.org/abs/2310.13444", "description": "This paper deals with unit root issues in time series analysis. It has been\nknown for a long time that unit root tests may be flawed when a series although\nstationary has a root close to unity. That motivated recent papers dedicated to\nautoregressive processes where the bridge between stability and instability is\nexpressed by means of time-varying coefficients. In this vein the process we\nconsider has a companion matrix $A_{n}$ with spectral radius $\\rho(A_{n}) < 1$\nsatisfying $\\rho(A_{n}) \\rightarrow 1$, a situation that we describe as `nearly\nunstable'. The question we investigate is the following: given an observed path\nsupposed to come from a nearly-unstable process, is it possible to test for the\n`extent of instability', \\textit{i.e.} to test how close we are to the unit\nroot? In this regard, we develop a strategy to evaluate $\\alpha$ and to test\nfor $\\mathcal{H}_0 : \"\\alpha = \\alpha_0\"$ against $\\mathcal{H}_1 : \"\\alpha >\n\\alpha_0\"$ when $\\rho(A_{n})$ lies in an inner $O(n^{-\\alpha})$-neighborhood of\nthe unity, for some $0 < \\alpha < 1$. Empirical evidence is given (on\nsimulations and real time series) about the advantages of the flexibility\ninduced by such a procedure compared to the usual unit root tests and their\nbinary responses. As a by-product, we also build a symmetric procedure for the\nusually left out situation where the dominant root lies around $-1$."}, "http://arxiv.org/abs/2310.13446": {"title": "Simple binning algorithm and SimDec visualization for comprehensive sensitivity analysis of complex computational models", "link": "http://arxiv.org/abs/2310.13446", "description": "Models of complex technological systems inherently contain interactions and\ndependencies among their input variables that affect their joint influence on\nthe output. Such models are often computationally expensive and few sensitivity\nanalysis methods can effectively process such complexities. Moreover, the\nsensitivity analysis field as a whole pays limited attention to the nature of\ninteraction effects, whose understanding can prove to be critical for the\ndesign of safe and reliable systems. In this paper, we introduce and\nextensively test a simple binning approach for computing sensitivity indices\nand demonstrate how complementing it with the smart visualization method,\nsimulation decomposition (SimDec), can permit important insights into the\nbehavior of complex engineering models. The simple binning approach computes\nfirst-, second-order effects, and a combined sensitivity index, and is\nconsiderably more computationally efficient than Sobol' indices. The totality\nof the sensitivity analysis framework provides an efficient and intuitive way\nto analyze the behavior of complex systems containing interactions and\ndependencies."}, "http://arxiv.org/abs/2310.13487": {"title": "Two-stage weighted least squares estimator of multivariate discrete-valued observation-driven models", "link": "http://arxiv.org/abs/2310.13487", "description": "In this work a general semi-parametric multivariate model where the first two\nconditional moments are assumed to be multivariate time series is introduced.\nThe focus of the estimation is the conditional mean parameter vector for\ndiscrete-valued distributions. Quasi-Maximum Likelihood Estimators (QMLEs)\nbased on the linear exponential family are typically employed for such\nestimation problems when the true multivariate conditional probability\ndistribution is unknown or too complex. Although QMLEs provide consistent\nestimates they may be inefficient. In this paper novel two-stage Multivariate\nWeighted Least Square Estimators (MWLSEs) are introduced which enjoy the same\nconsistency property as the QMLEs but can provide improved efficiency with\nsuitable choice of the covariance matrix of the observations. The proposed\nmethod allows for a more accurate estimation of model parameters in particular\nfor count and categorical data when maximum likelihood estimation is\nunfeasible. Moreover, consistency and asymptotic normality of MWLSEs are\nderived. The estimation performance of QMLEs and MWLSEs is compared through\nsimulation experiments and a real data application, showing superior accuracy\nof the proposed methodology."}, "http://arxiv.org/abs/2310.13511": {"title": "Dynamic Realized Minimum Variance Portfolio Models", "link": "http://arxiv.org/abs/2310.13511", "description": "This paper introduces a dynamic minimum variance portfolio (MVP) model using\nnonlinear volatility dynamic models, based on high-frequency financial data.\nSpecifically, we impose an autoregressive dynamic structure on MVP processes,\nwhich helps capture the MVP dynamics directly. To evaluate the dynamic MVP\nmodel, we estimate the inverse volatility matrix using the constrained\n$\\ell_1$-minimization for inverse matrix estimation (CLIME) and calculate daily\nrealized non-normalized MVP weights. Based on the realized non-normalized MVP\nweight estimator, we propose the dynamic MVP model, which we call the dynamic\nrealized minimum variance portfolio (DR-MVP) model. To estimate a large number\nof parameters, we employ the least absolute shrinkage and selection operator\n(LASSO) and predict the future MVP and establish its asymptotic properties.\nUsing high-frequency trading data, we apply the proposed method to MVP\nprediction."}, "http://arxiv.org/abs/2310.13580": {"title": "Bayesian Hierarchical Modeling for Bivariate Multiscale Spatial Data with Application to Blood Test Monitoring", "link": "http://arxiv.org/abs/2310.13580", "description": "In public health applications, spatial data collected are often recorded at\ndifferent spatial scales and over different correlated variables. Spatial\nchange of support is a key inferential problem in these applications and have\nbecome standard in univariate settings; however, it is less standard in\nmultivariate settings. There are several existing multivariate spatial models\nthat can be easily combined with multiscale spatial approach to analyze\nmultivariate multiscale spatial data. In this paper, we propose three new\nmodels from such combinations for bivariate multiscale spatial data in a\nBayesian context. In particular, we extend spatial random effects models,\nmultivariate conditional autoregressive models, and ordered hierarchical models\nthrough a multiscale spatial approach. We run simulation studies for the three\nmodels and compare them in terms of prediction performance and computational\nefficiency. We motivate our models through an analysis of 2015 Texas annual\naverage percentage receiving two blood tests from the Dartmouth Atlas Project."}, "http://arxiv.org/abs/2102.13209": {"title": "Wielding Occam's razor: Fast and frugal retail forecasting", "link": "http://arxiv.org/abs/2102.13209", "description": "The algorithms available for retail forecasting have increased in complexity.\nNewer methods, such as machine learning, are inherently complex. The more\ntraditional families of forecasting models, such as exponential smoothing and\nautoregressive integrated moving averages, have expanded to contain multiple\npossible forms and forecasting profiles. We question complexity in forecasting\nand the need to consider such large families of models. Our argument is that\nparsimoniously identifying suitable subsets of models will not decrease\nforecasting accuracy nor will it reduce the ability to estimate forecast\nuncertainty. We propose a framework that balances forecasting performance\nversus computational cost, resulting in the consideration of only a reduced set\nof models. We empirically demonstrate that a reduced set performs well.\nFinally, we translate computational benefits to monetary cost savings and\nenvironmental impact and discuss the implications of our results in the context\nof large retailers."}, "http://arxiv.org/abs/2211.04666": {"title": "Fast and Locally Adaptive Bayesian Quantile Smoothing using Calibrated Variational Approximations", "link": "http://arxiv.org/abs/2211.04666", "description": "Quantiles are useful characteristics of random variables that can provide\nsubstantial information on distributions compared with commonly used summary\nstatistics such as means. In this paper, we propose a Bayesian quantile trend\nfiltering method to estimate non-stationary trend of quantiles. We introduce\ngeneral shrinkage priors to induce locally adaptive Bayesian inference on\ntrends and mixture representation of the asymmetric Laplace likelihood. To\nquickly compute the posterior distribution, we develop calibrated mean-field\nvariational approximations to guarantee that the frequentist coverage of\ncredible intervals obtained from the approximated posterior is a specified\nnominal level. Simulation and empirical studies show that the proposed\nalgorithm is computationally much more efficient than the Gibbs sampler and\ntends to provide stable inference results, especially for high/low quantiles."}, "http://arxiv.org/abs/2305.17631": {"title": "A Bayesian Approach for Clustering Constant-wise Change-point Data", "link": "http://arxiv.org/abs/2305.17631", "description": "Change-point models deal with ordered data sequences. Their primary goal is\nto infer the locations where an aspect of the data sequence changes. In this\npaper, we propose and implement a nonparametric Bayesian model for clustering\nobservations based on their constant-wise change-point profiles via Gibbs\nsampler. Our model incorporates a Dirichlet Process on the constant-wise\nchange-point structures to cluster observations while simultaneously performing\nchange-point estimation. Additionally, our approach controls the number of\nclusters in the model, not requiring the specification of the number of\nclusters a priori. Our method's performance is evaluated on simulated data\nunder various scenarios and on a real dataset from single-cell genomic\nsequencing."}, "http://arxiv.org/abs/2306.06342": {"title": "Distribution-free inference with hierarchical data", "link": "http://arxiv.org/abs/2306.06342", "description": "This paper studies distribution-free inference in settings where the data set\nhas a hierarchical structure -- for example, groups of observations, or\nrepeated measurements. In such settings, standard notions of exchangeability\nmay not hold. To address this challenge, a hierarchical form of exchangeability\nis derived, facilitating extensions of distribution-free methods, including\nconformal prediction and jackknife+. While the standard theoretical guarantee\nobtained by the conformal prediction framework is a marginal predictive\ncoverage guarantee, in the special case of independent repeated measurements,\nit is possible to achieve a stronger form of coverage -- the \"second-moment\ncoverage\" property -- to provide better control of conditional miscoverage\nrates, and distribution-free prediction sets that achieve this property are\nconstructed. Simulations illustrate that this guarantee indeed leads to\nuniformly small conditional miscoverage rates. Empirically, this stronger\nguarantee comes at the cost of a larger width of the prediction set in\nscenarios where the fitted model is poorly calibrated, but this cost is very\nmild in cases where the fitted model is accurate."}, "http://arxiv.org/abs/2307.15205": {"title": "Robust graph-based methods for overcoming the curse of dimensionality", "link": "http://arxiv.org/abs/2307.15205", "description": "Graph-based two-sample tests and graph-based change-point detection that\nutilize a similarity graph provide a powerful tool for analyzing\nhigh-dimensional and non-Euclidean data as these methods do not impose\ndistributional assumptions on data and have good performance across various\nscenarios. Current graph-based tests that deliver efficacy across a broad\nspectrum of alternatives typically reply on the $K$-nearest neighbor graph or\nthe $K$-minimum spanning tree. However, these graphs can be vulnerable for\nhigh-dimensional data due to the curse of dimensionality. To mitigate this\nissue, we propose to use a robust graph that is considerably less influenced by\nthe curse of dimensionality. We also establish a theoretical foundation for\ngraph-based methods utilizing this proposed robust graph and demonstrate its\nconsistency under fixed alternatives for both low-dimensional and\nhigh-dimensional data."}, "http://arxiv.org/abs/2310.13764": {"title": "Random Flows of Covariance Operators and their Statistical Inference", "link": "http://arxiv.org/abs/2310.13764", "description": "We develop a statistical framework for conducting inference on collections of\ntime-varying covariance operators (covariance flows) over a general, possibly\ninfinite dimensional, Hilbert space. We model the intrinsically non-linear\nstructure of covariances by means of the Bures-Wasserstein metric geometry. We\nmake use of the Riemmanian-like structure induced by this metric to define a\nnotion of mean and covariance of a random flow, and develop an associated\nKarhunen-Lo\\`eve expansion. We then treat the problem of estimation and\nconstruction of functional principal components from a finite collection of\ncovariance flows. Our theoretical results are motivated by modern problems in\nfunctional data analysis, where one observes operator-valued random processes\n-- for instance when analysing dynamic functional connectivity and fMRI data,\nor when analysing multiple functional time series in the frequency domain.\n{Nevertheless, our framework is also novel in the finite-dimensions (matrix\ncase), and we demonstrate what simplifications can be afforded then}. We\nillustrate our methodology by means of simulations and a data analyses."}, "http://arxiv.org/abs/2310.13796": {"title": "Faithful graphical representations of local independence", "link": "http://arxiv.org/abs/2310.13796", "description": "Graphical models use graphs to represent conditional independence structure\nin the distribution of a random vector. In stochastic processes, graphs may\nrepresent so-called local independence or conditional Granger causality. Under\nsome regularity conditions, a local independence graph implies a set of\nindependences using a graphical criterion known as $\\delta$-separation, or\nusing its generalization, $\\mu$-separation. This is a stochastic process\nanalogue of $d$-separation in DAGs. However, there may be more independences\nthan implied by this graph and this is a violation of so-called faithfulness.\nWe characterize faithfulness in local independence graphs and give a method to\nconstruct a faithful graph from any local independence model such that the\noutput equals the true graph when Markov and faithfulness assumptions hold. We\ndiscuss various assumptions that are weaker than faithfulness, and we explore\ndifferent structure learning algorithms and their properties under varying\nassumptions."}, "http://arxiv.org/abs/2310.13826": {"title": "A p-value for Process Tracing and other N=1 Studies", "link": "http://arxiv.org/abs/2310.13826", "description": "The paper introduces a \\(p\\)-value that summarizes the evidence against a\nrival causal theory that explains an observed outcome in a single case. We show\nhow to represent the probability distribution characterizing a theorized rival\nhypothesis (the null) in the absence of randomization of treatment and when\ncounting on qualitative data, for instance when conducting process tracing. As\nin Fisher's \\autocite*{fisher1935design} original design, our \\(p\\)-value\nindicates how frequently one would find the same observations or even more\nfavorable observations under a theory that is compatible with our observations\nbut antagonistic to the working hypothesis. We also present an extension that\nallows researchers assess the sensitivity of their results to confirmation\nbias. Finally, we illustrate the application of our hypothesis test using the\nstudy by Snow \\autocite*{Snow1855} about the cause of Cholera in Soho, a\nclassic in Process Tracing, Epidemiology, and Microbiology. Our framework suits\nany type of case studies and evidence, such as data from interviews, archives,\nor participant observation."}, "http://arxiv.org/abs/2310.13858": {"title": "Likelihood-based surrogate dimension reduction", "link": "http://arxiv.org/abs/2310.13858", "description": "We consider the problem of surrogate sufficient dimension reduction, that is,\nestimating the central subspace of a regression model, when the covariates are\ncontaminated by measurement error. When no measurement error is present, a\nlikelihood-based dimension reduction method that relies on maximizing the\nlikelihood of a Gaussian inverse regression model on the Grassmann manifold is\nwell-known to have superior performance to traditional inverse moment methods.\nWe propose two likelihood-based estimators for the central subspace in\nmeasurement error settings, which make different adjustments to the observed\nsurrogates. Both estimators are computed based on maximizing objective\nfunctions on the Grassmann manifold and are shown to consistently recover the\ntrue central subspace. When the central subspace is assumed to depend on only a\nfew covariates, we further propose to augment the likelihood function with a\npenalty term that induces sparsity on the Grassmann manifold to obtain sparse\nestimators. The resulting objective function has a closed-form Riemann gradient\nwhich facilitates efficient computation of the penalized estimator. We leverage\nthe state-of-the-art trust region algorithm on the Grassmann manifold to\ncompute the proposed estimators efficiently. Simulation studies and a data\napplication demonstrate the proposed likelihood-based estimators perform better\nthan inverse moment-based estimators in terms of both estimation and variable\nselection accuracy."}, "http://arxiv.org/abs/2310.13874": {"title": "A Linear Errors-in-Variables Model with Unknown Heteroscedastic Measurement Errors", "link": "http://arxiv.org/abs/2310.13874", "description": "In the classic measurement error framework, covariates are contaminated by\nindependent additive noise. This paper considers parameter estimation in such a\nlinear errors-in-variables model where the unknown measurement error\ndistribution is heteroscedastic across observations. We propose a new\ngeneralized method of moment (GMM) estimator that combines a moment correction\napproach and a phase function-based approach. The former requires distributions\nto have four finite moments, while the latter relies on covariates having\nasymmetric distributions. The new estimator is shown to be consistent and\nasymptotically normal under appropriate regularity conditions. The asymptotic\ncovariance of the estimator is derived, and the estimated standard error is\ncomputed using a fast bootstrap procedure. The GMM estimator is demonstrated to\nhave strong finite sample performance in numerical studies, especially when the\nmeasurement errors follow non-Gaussian distributions."}, "http://arxiv.org/abs/2310.13911": {"title": "Multilevel Matrix Factor Model", "link": "http://arxiv.org/abs/2310.13911", "description": "Large-scale matrix data has been widely discovered and continuously studied\nin various fields recently. Considering the multi-level factor structure and\nutilizing the matrix structure, we propose a multilevel matrix factor model\nwith both global and local factors. The global factors can affect all matrix\ntimes series, whereas the local factors are only allow to affect within each\nspecific matrix time series. The estimation procedures can consistently\nestimate the factor loadings and determine the number of factors. We establish\nthe asymptotic properties of the estimators. The simulation is presented to\nillustrate the performance of the proposed estimation method. We utilize the\nmodel to analyze eight indicators across 200 stocks from ten distinct\nindustries, demonstrating the empirical utility of our proposed approach."}, "http://arxiv.org/abs/2310.13966": {"title": "Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression", "link": "http://arxiv.org/abs/2310.13966", "description": "In recent years, transfer learning has garnered significant attention in the\nmachine learning community. Its ability to leverage knowledge from related\nstudies to improve generalization performance in a target study has made it\nhighly appealing. This paper focuses on investigating the transfer learning\nproblem within the context of nonparametric regression over a reproducing\nkernel Hilbert space. The aim is to bridge the gap between practical\neffectiveness and theoretical guarantees. We specifically consider two\nscenarios: one where the transferable sources are known and another where they\nare unknown. For the known transferable source case, we propose a two-step\nkernel-based estimator by solely using kernel ridge regression. For the unknown\ncase, we develop a novel method based on an efficient aggregation algorithm,\nwhich can automatically detect and alleviate the effects of negative sources.\nThis paper provides the statistical properties of the desired estimators and\nestablishes the minimax optimal rate. Through extensive numerical experiments\non synthetic data and real examples, we validate our theoretical findings and\ndemonstrate the effectiveness of our proposed method."}, "http://arxiv.org/abs/2310.13973": {"title": "Estimation and convergence rates in the distributional single index model", "link": "http://arxiv.org/abs/2310.13973", "description": "The distributional single index model is a semiparametric regression model in\nwhich the conditional distribution functions $P(Y \\leq y | X = x) =\nF_0(\\theta_0(x), y)$ of a real-valued outcome variable $Y$ depend on\n$d$-dimensional covariates $X$ through a univariate, parametric index function\n$\\theta_0(x)$, and increase stochastically as $\\theta_0(x)$ increases. We\npropose least squares approaches for the joint estimation of $\\theta_0$ and\n$F_0$ in the important case where $\\theta_0(x) = \\alpha_0^{\\top}x$ and obtain\nconvergence rates of $n^{-1/3}$, thereby improving an existing result that\ngives a rate of $n^{-1/6}$. A simulation study indicates that the convergence\nrate for the estimation of $\\alpha_0$ might be faster. Furthermore, we\nillustrate our methods in a real data application that demonstrates the\nadvantages of shape restrictions in single index models."}, "http://arxiv.org/abs/2310.14246": {"title": "Shortcuts for causal discovery of nonlinear models by score matching", "link": "http://arxiv.org/abs/2310.14246", "description": "The use of simulated data in the field of causal discovery is ubiquitous due\nto the scarcity of annotated real data. Recently, Reisach et al., 2021\nhighlighted the emergence of patterns in simulated linear data, which displays\nincreasing marginal variance in the casual direction. As an ablation in their\nexperiments, Montagna et al., 2023 found that similar patterns may emerge in\nnonlinear models for the variance of the score vector $\\nabla \\log\np_{\\mathbf{X}}$, and introduced the ScoreSort algorithm. In this work, we\nformally define and characterize this score-sortability pattern of nonlinear\nadditive noise models. We find that it defines a class of identifiable\n(bivariate) causal models overlapping with nonlinear additive noise models. We\ntheoretically demonstrate the advantages of ScoreSort in terms of statistical\nefficiency compared to prior state-of-the-art score matching-based methods and\nempirically show the score-sortability of the most common synthetic benchmarks\nin the literature. Our findings remark (1) the lack of diversity in the data as\nan important limitation in the evaluation of nonlinear causal discovery\napproaches, (2) the importance of thoroughly testing different settings within\na problem class, and (3) the importance of analyzing statistical properties in\ncausal discovery, where research is often limited to defining identifiability\nconditions of the model."}, "http://arxiv.org/abs/2310.14293": {"title": "Testing exchangeability by pairwise betting", "link": "http://arxiv.org/abs/2310.14293", "description": "In this paper, we address the problem of testing exchangeability of a\nsequence of random variables, $X_1, X_2,\\cdots$. This problem has been studied\nunder the recently popular framework of testing by betting. But the mapping of\ntesting problems to game is not one to one: many games can be designed for the\nsame test. Past work established that it is futile to play single game betting\non every observation: test martingales in the data filtration are powerless.\nTwo avenues have been explored to circumvent this impossibility: betting in a\nreduced filtration (wealth is a test martingale in a coarsened filtration), or\nplaying many games in parallel (wealth is an e-process in the data filtration).\nThe former has proved to be difficult to theoretically analyze, while the\nlatter only works for binary or discrete observation spaces. Here, we introduce\na different approach that circumvents both drawbacks. We design a new (yet\nsimple) game in which we observe the data sequence in pairs. Despite the fact\nthat betting on individual observations is futile, we show that betting on\npairs of observations is not. To elaborate, we prove that our game leads to a\nnontrivial test martingale, which is interesting because it has been obtained\nby shrinking the filtration very slightly. We show that our test controls\ntype-1 error despite continuous monitoring, and achieves power one for both\nbinary and continuous observations, under a broad class of alternatives. Due to\nthe shrunk filtration, optional stopping is only allowed at even stopping\ntimes, not at odd ones: a relatively minor price. We provide a wide array of\nsimulations that align with our theoretical findings."}, "http://arxiv.org/abs/2310.14399": {"title": "The role of randomization inference in unraveling individual treatment effects in clinical trials: Application to HIV vaccine trials", "link": "http://arxiv.org/abs/2310.14399", "description": "Randomization inference is a powerful tool in early phase vaccine trials to\nestimate the causal effect of a regimen against a placebo or another regimen.\nTraditionally, randomization-based inference often focuses on testing either\nFisher's sharp null hypothesis of no treatment effect for any unit or Neyman's\nweak null hypothesis of no sample average treatment effect. Many recent efforts\nhave explored conducting exact randomization-based inference for other\nsummaries of the treatment effect profile, for instance, quantiles of the\ntreatment effect distribution function. In this article, we systematically\nreview methods that conduct exact, randomization-based inference for quantiles\nof individual treatment effects (ITEs) and extend some results by incorporating\nauxiliary information often available in a vaccine trial. These methods are\nsuitable for four scenarios: (i) a randomized controlled trial (RCT) where the\npotential outcomes under one regimen are constant; (ii) an RCT with no\nrestriction on any potential outcomes; (iii) an RCT with some user-specified\nbounds on potential outcomes; and (iv) a matched study comparing two\nnon-randomized, possibly confounded treatment arms. We then conduct two\nextensive simulation studies, one comparing the performance of each method in\nmany practical clinical settings and the other evaluating the usefulness of the\nmethods in ranking and advancing experimental therapies. We apply these methods\nto an early-phase clinical trail, HIV Vaccine Trials Network Study 086 (HVTN\n086), to showcase the usefulness of the methods."}, "http://arxiv.org/abs/2310.14419": {"title": "An RKHS Approach for Variable Selection in High-dimensional Functional Linear Models", "link": "http://arxiv.org/abs/2310.14419", "description": "High-dimensional functional data has become increasingly prevalent in modern\napplications such as high-frequency financial data and neuroimaging data\nanalysis. We investigate a class of high-dimensional linear regression models,\nwhere each predictor is a random element in an infinite dimensional function\nspace, and the number of functional predictors p can potentially be much\ngreater than the sample size n. Assuming that each of the unknown coefficient\nfunctions belongs to some reproducing kernel Hilbert space (RKHS), we\nregularized the fitting of the model by imposing a group elastic-net type of\npenalty on the RKHS norms of the coefficient functions. We show that our loss\nfunction is Gateaux sub-differentiable, and our functional elastic-net\nestimator exists uniquely in the product RKHS. Under suitable sparsity\nassumptions and a functional version of the irrepresentible condition, we\nestablish the variable selection consistency property of our approach. The\nproposed method is illustrated through simulation studies and a real-data\napplication from the Human Connectome Project."}, "http://arxiv.org/abs/2310.14448": {"title": "Semiparametrically Efficient Score for the Survival Odds Ratio", "link": "http://arxiv.org/abs/2310.14448", "description": "We consider a general proportional odds model for survival data under binary\ntreatment, where the functional form of the covariates is left unspecified. We\nderive the efficient score for the conditional survival odds ratio given the\ncovariates using modern semiparametric theory. The efficient score may be\nuseful in the development of doubly robust estimators, although computational\nchallenges remain."}, "http://arxiv.org/abs/2310.14763": {"title": "Externally Valid Policy Evaluation Combining Trial and Observational Data", "link": "http://arxiv.org/abs/2310.14763", "description": "Randomized trials are widely considered as the gold standard for evaluating\nthe effects of decision policies. Trial data is, however, drawn from a\npopulation which may differ from the intended target population and this raises\na problem of external validity (aka. generalizability). In this paper we seek\nto use trial data to draw valid inferences about the outcome of a policy on the\ntarget population. Additional covariate data from the target population is used\nto model the sampling of individuals in the trial study. We develop a method\nthat yields certifiably valid trial-based policy evaluations under any\nspecified range of model miscalibrations. The method is nonparametric and the\nvalidity is assured even with finite samples. The certified policy evaluations\nare illustrated using both simulated and real data."}, "http://arxiv.org/abs/2310.14922": {"title": "The Complex Network Patterns of Human Migration at Different Geographical Scales: Network Science meets Regression Analysis", "link": "http://arxiv.org/abs/2310.14922", "description": "Migration's influence in shaping population dynamics in times of impending\nclimate and population crises exposes its crucial role in upholding societal\ncohesion. As migration impacts virtually all aspects of life, it continues to\nrequire attention across scientific disciplines. This study delves into two\ndistinctive substrates of Migration Studies: the \"why\" substrate, which deals\nwith identifying the factors driving migration relying primarily on regression\nmodeling, encompassing economic, demographic, geographic, cultural, political,\nand other variables; and the \"how\" substrate, which focuses on identifying\nmigration flows and patterns, drawing from Network Science tools and\nvisualization techniques to depict complex migration networks. Despite the\ngrowing percentage of Network Science studies in migration, the explanations of\nthe identified network traits remain very scarce, highlighting the detachment\nbetween the two research substrates. Our study includes real-world network\nanalyses of human migration across different geographical levels: city,\ncountry, and global. We examine inter-district migration in Vienna at the city\nlevel, review internal migration networks in Austria and Croatia at the country\nlevel, and analyze migration exchange between Croatia and the world at the\nglobal level. By comparing network structures, we demonstrate how distinct\nnetwork traits impact regression modeling. This work not only uncovers\nmigration network patterns in previously unexplored areas but also presents a\ncomprehensive overview of recent research, highlighting gaps in each field and\ntheir interconnectedness. Our contribution offers suggestions for integrating\nboth fields to enhance methodological rigor and support future research."}, "http://arxiv.org/abs/2310.15016": {"title": "Impact of Record-Linkage Errors in Covid-19 Vaccine-Safety Analyses using German Health-Care Data: A Simulation Study", "link": "http://arxiv.org/abs/2310.15016", "description": "With unprecedented speed, 192,248,678 doses of Covid-19 vaccines were\nadministered in Germany by July 11, 2023 to combat the pandemic. Limitations of\nclinical trials imply that the safety profile of these vaccines is not fully\nknown before marketing. However, routine health-care data can help address\nthese issues. Despite the high proportion of insured people, the analysis of\nvaccination-related data is challenging in Germany. Generally, the Covid-19\nvaccination status and other health-care data are stored in separate databases,\nwithout persistent and database-independent person identifiers. Error-prone\nrecord-linkage techniques must be used to merge these databases. Our aim was to\nquantify the impact of record-linkage errors on the power and bias of different\nanalysis methods designed to assess Covid-19 vaccine safety when using German\nhealth-care data with a Monte-Carlo simulation study. We used a discrete-time\nsimulation and empirical data to generate realistic data with varying amounts\nof record-linkage errors. Afterwards, we analysed this data using a Cox model\nand the self-controlled case series (SCCS) method. Realistic proportions of\nrandom linkage errors only had little effect on the power of either method. The\nSCCS method produced unbiased results even with a high percentage of linkage\nerrors, while the Cox model underestimated the true effect."}, "http://arxiv.org/abs/2310.15069": {"title": "Second-order group knockoffs with applications to GWAS", "link": "http://arxiv.org/abs/2310.15069", "description": "Conditional testing via the knockoff framework allows one to identify --\namong large number of possible explanatory variables -- those that carry unique\ninformation about an outcome of interest, and also provides a false discovery\nrate guarantee on the selection. This approach is particularly well suited to\nthe analysis of genome wide association studies (GWAS), which have the goal of\nidentifying genetic variants which influence traits of medical relevance.\n\nWhile conditional testing can be both more powerful and precise than\ntraditional GWAS analysis methods, its vanilla implementation encounters a\ndifficulty common to all multivariate analysis methods: it is challenging to\ndistinguish among multiple, highly correlated regressors. This impasse can be\novercome by shifting the object of inference from single variables to groups of\ncorrelated variables. To achieve this, it is necessary to construct \"group\nknockoffs.\" While successful examples are already documented in the literature,\nthis paper substantially expands the set of algorithms and software for group\nknockoffs. We focus in particular on second-order knockoffs, for which we\ndescribe correlation matrix approximations that are appropriate for GWAS data\nand that result in considerable computational savings. We illustrate the\neffectiveness of the proposed methods with simulations and with the analysis of\nalbuminuria data from the UK Biobank.\n\nThe described algorithms are implemented in an open-source Julia package\nKnockoffs.jl, for which both R and Python wrappers are available."}, "http://arxiv.org/abs/2310.15070": {"title": "Improving estimation efficiency of case-cohort study with interval-censored failure time data", "link": "http://arxiv.org/abs/2310.15070", "description": "The case-cohort design is a commonly used cost-effective sampling strategy\nfor large cohort studies, where some covariates are expensive to measure or\nobtain. In this paper, we consider regression analysis under a case-cohort\nstudy with interval-censored failure time data, where the failure time is only\nknown to fall within an interval instead of being exactly observed. A common\napproach to analyze data from a case-cohort study is the inverse probability\nweighting approach, where only subjects in the case-cohort sample are used in\nestimation, and the subjects are weighted based on the probability of inclusion\ninto the case-cohort sample. This approach, though consistent, is generally\ninefficient as it does not incorporate information outside the case-cohort\nsample. To improve efficiency, we first develop a sieve maximum weighted\nlikelihood estimator under the Cox model based on the case-cohort sample, and\nthen propose a procedure to update this estimator by using information in the\nfull cohort. We show that the update estimator is consistent, asymptotically\nnormal, and more efficient than the original estimator. The proposed method can\nflexibly incorporate auxiliary variables to further improve estimation\nefficiency. We employ a weighted bootstrap procedure for variance estimation.\nSimulation results indicate that the proposed method works well in practical\nsituations. A real study on diabetes is provided for illustration."}, "http://arxiv.org/abs/2310.15108": {"title": "Evaluating machine learning models in non-standard settings: An overview and new findings", "link": "http://arxiv.org/abs/2310.15108", "description": "Estimating the generalization error (GE) of machine learning models is\nfundamental, with resampling methods being the most common approach. However,\nin non-standard settings, particularly those where observations are not\nindependently and identically distributed, resampling using simple random data\ndivisions may lead to biased GE estimates. This paper strives to present\nwell-grounded guidelines for GE estimation in various such non-standard\nsettings: clustered data, spatial data, unequal sampling probabilities, concept\ndrift, and hierarchically structured outcomes. Our overview combines\nwell-established methodologies with other existing methods that, to our\nknowledge, have not been frequently considered in these particular settings. A\nunifying principle among these techniques is that the test data used in each\niteration of the resampling procedure should reflect the new observations to\nwhich the model will be applied, while the training data should be\nrepresentative of the entire data set used to obtain the final model. Beyond\nproviding an overview, we address literature gaps by conducting simulation\nstudies. These studies assess the necessity of using GE-estimation methods\ntailored to the respective setting. Our findings corroborate the concern that\nstandard resampling methods often yield biased GE estimates in non-standard\nsettings, underscoring the importance of tailored GE estimation."}, "http://arxiv.org/abs/2310.15124": {"title": "Mixed-Variable Global Sensitivity Analysis For Knowledge Discovery And Efficient Combinatorial Materials Design", "link": "http://arxiv.org/abs/2310.15124", "description": "Global Sensitivity Analysis (GSA) is the study of the influence of any given\ninputs on the outputs of a model. In the context of engineering design, GSA has\nbeen widely used to understand both individual and collective contributions of\ndesign variables on the design objectives. So far, global sensitivity studies\nhave often been limited to design spaces with only quantitative (numerical)\ndesign variables. However, many engineering systems also contain, if not only,\nqualitative (categorical) design variables in addition to quantitative design\nvariables. In this paper, we integrate Latent Variable Gaussian Process (LVGP)\nwith Sobol' analysis to develop the first metamodel-based mixed-variable GSA\nmethod. Through numerical case studies, we validate and demonstrate the\neffectiveness of our proposed method for mixed-variable problems. Furthermore,\nwhile the proposed GSA method is general enough to benefit various engineering\ndesign applications, we integrate it with multi-objective Bayesian optimization\n(BO) to create a sensitivity-aware design framework in accelerating the Pareto\nfront design exploration for metal-organic framework (MOF) materials with\nmany-level combinatorial design spaces. Although MOFs are constructed only from\nqualitative variables that are notoriously difficult to design, our method can\nutilize sensitivity analysis to navigate the optimization in the many-level\nlarge combinatorial design space, greatly expediting the exploration of novel\nMOF candidates."}, "http://arxiv.org/abs/2003.04433": {"title": "Least Squares Estimation of a Quasiconvex Regression Function", "link": "http://arxiv.org/abs/2003.04433", "description": "We develop a new approach for the estimation of a multivariate function based\non the economic axioms of quasiconvexity (and monotonicity). On the\ncomputational side, we prove the existence of the quasiconvex constrained least\nsquares estimator (LSE) and provide a characterization of the function space to\ncompute the LSE via a mixed integer quadratic programme. On the theoretical\nside, we provide finite sample risk bounds for the LSE via a sharp oracle\ninequality. Our results allow for errors to depend on the covariates and to\nhave only two finite moments. We illustrate the superior performance of the LSE\nagainst some competing estimators via simulation. Finally, we use the LSE to\nestimate the production function for the Japanese plywood industry and the cost\nfunction for hospitals across the US."}, "http://arxiv.org/abs/2008.10296": {"title": "Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison", "link": "http://arxiv.org/abs/2008.10296", "description": "Leave-one-out cross-validation (LOO-CV) is a popular method for comparing\nBayesian models based on their estimated predictive performance on new, unseen,\ndata. As leave-one-out cross-validation is based on finite observed data, there\nis uncertainty about the expected predictive performance on new data. By\nmodeling this uncertainty when comparing two models, we can compute the\nprobability that one model has a better predictive performance than the other.\nModeling this uncertainty well is not trivial, and for example, it is known\nthat the commonly used standard error estimate is often too small. We study the\nproperties of the Bayesian LOO-CV estimator and the related uncertainty\nestimates when comparing two models. We provide new results of the properties\nboth theoretically in the linear regression case and empirically for multiple\ndifferent models and discuss the challenges of modeling the uncertainty. We\nshow that problematic cases include: comparing models with similar predictions,\nmisspecified models, and small data. In these cases, there is a weak connection\nin the skewness of the individual leave-one-out terms and the distribution of\nthe error of the Bayesian LOO-CV estimator. We show that it is possible that\nthe problematic skewness of the error distribution, which occurs when the\nmodels make similar predictions, does not fade away when the data size grows to\ninfinity in certain situations. Based on the results, we also provide practical\nrecommendations for the users of Bayesian LOO-CV for model comparison."}, "http://arxiv.org/abs/2105.04981": {"title": "Uncovering patterns for adverse pregnancy outcomes with a Bayesian spatial model: Evidence from Philadelphia", "link": "http://arxiv.org/abs/2105.04981", "description": "We introduce a Bayesian conditional autoregressive model for analyzing\npatient-specific and neighborhood risks of stillbirth and preterm birth within\na city. Our fully Bayesian approach automatically learns the amount of spatial\nheterogeneity and spatial dependence between neighborhoods. Our model provides\nmeaningful inferences and uncertainty quantification for both covariate effects\nand neighborhood risk probabilities through their posterior distributions. We\napply our methodology to data from the city of Philadelphia. Using electronic\nhealth records (45,919 deliveries at hospitals within the University of\nPennsylvania Health System) and United States Census Bureau data from 363\ncensus tracts in Philadelphia, we find that both patient-level characteristics\n(e.g. self-identified race/ethnicity) and neighborhood-level characteristics\n(e.g. violent crime) are highly associated with patients' odds of stillbirth or\npreterm birth. Our neighborhood risk analysis further reveals that census\ntracts in West Philadelphia and North Philadelphia are at highest risk of these\noutcomes. Specifically, neighborhoods with higher rates of women in poverty or\non public assistance have greater neighborhood risk for these outcomes, while\nneighborhoods with higher rates of college-educated women or women in the labor\nforce have lower risk. Our findings could be useful for targeted individual and\nneighborhood interventions."}, "http://arxiv.org/abs/2107.07317": {"title": "Nonparametric Statistical Inference via Metric Distribution Function in Metric Spaces", "link": "http://arxiv.org/abs/2107.07317", "description": "Distribution function is essential in statistical inference, and connected\nwith samples to form a directed closed loop by the correspondence theorem in\nmeasure theory and the Glivenko-Cantelli and Donsker properties. This\nconnection creates a paradigm for statistical inference. However, existing\ndistribution functions are defined in Euclidean spaces and no longer convenient\nto use in rapidly evolving data objects of complex nature. It is imperative to\ndevelop the concept of distribution function in a more general space to meet\nemerging needs. Note that the linearity allows us to use hypercubes to define\nthe distribution function in a Euclidean space, but without the linearity in a\nmetric space, we must work with the metric to investigate the probability\nmeasure. We introduce a class of metric distribution functions through the\nmetric between random objects and a fixed location in metric spaces. We\novercome this challenging step by proving the correspondence theorem and the\nGlivenko-Cantelli theorem for metric distribution functions in metric spaces\nthat lie the foundation for conducting rational statistical inference for\nmetric space-valued data. Then, we develop homogeneity test and mutual\nindependence test for non-Euclidean random objects, and present comprehensive\nempirical evidence to support the performance of our proposed methods."}, "http://arxiv.org/abs/2109.03694": {"title": "Parameterizing and Simulating from Causal Models", "link": "http://arxiv.org/abs/2109.03694", "description": "Many statistical problems in causal inference involve a probability\ndistribution other than the one from which data are actually observed; as an\nadditional complication, the object of interest is often a marginal quantity of\nthis other probability distribution. This creates many practical complications\nfor statistical inference, even where the problem is non-parametrically\nidentified. In particular, it is difficult to perform likelihood-based\ninference, or even to simulate from the model in a general way.\n\nWe introduce the `frugal parameterization', which places the causal effect of\ninterest at its centre, and then builds the rest of the model around it. We do\nthis in a way that provides a recipe for constructing a regular, non-redundant\nparameterization using causal quantities of interest. In the case of discrete\nvariables we can use odds ratios to complete the parameterization, while in the\ncontinuous case copulas are the natural choice; other possibilities are also\ndiscussed.\n\nOur methods allow us to construct and simulate from models with\nparametrically specified causal distributions, and fit them using\nlikelihood-based methods, including fully Bayesian approaches. Our proposal\nincludes parameterizations for the average causal effect and effect of\ntreatment on the treated, as well as other causal quantities of interest."}, "http://arxiv.org/abs/2202.09534": {"title": "Locally Adaptive Spatial Quantile Smoothing: Application to Monitoring Crime Density in Tokyo", "link": "http://arxiv.org/abs/2202.09534", "description": "Spatial trend estimation under potential heterogeneity is an important\nproblem to extract spatial characteristics and hazards such as criminal\nactivity. By focusing on quantiles, which provide substantial information on\ndistributions compared with commonly used summary statistics such as means, it\nis often useful to estimate not only the average trend but also the high (low)\nrisk trend additionally. In this paper, we propose a Bayesian quantile trend\nfiltering method to estimate the non-stationary trend of quantiles on graphs\nand apply it to crime data in Tokyo between 2013 and 2017. By modeling multiple\nobservation cases, we can estimate the potential heterogeneity of spatial crime\ntrends over multiple years in the application. To induce locally adaptive\nBayesian inference on trends, we introduce general shrinkage priors for graph\ndifferences. Introducing so-called shadow priors with multivariate distribution\nfor local scale parameters and mixture representation of the asymmetric Laplace\ndistribution, we provide a simple Gibbs sampling algorithm to generate\nposterior samples. The numerical performance of the proposed method is\ndemonstrated through simulation studies."}, "http://arxiv.org/abs/2203.16710": {"title": "Detecting Treatment Interference under the K-Nearest-Neighbors Interference Model", "link": "http://arxiv.org/abs/2203.16710", "description": "We propose a model of treatment interference where the response of a unit\ndepends only on its treatment status and the statuses of units within its\nK-neighborhood. Current methods for detecting interference include carefully\ndesigned randomized experiments and conditional randomization tests on a set of\nfocal units. We give guidance on how to choose focal units under this model of\ninterference. We then conduct a simulation study to evaluate the efficacy of\nexisting methods for detecting network interference. We show that this choice\nof focal units leads to powerful tests of treatment interference which\noutperform current experimental methods."}, "http://arxiv.org/abs/2206.00646": {"title": "Importance sampling for stochastic reaction-diffusion equations in the moderate deviation regime", "link": "http://arxiv.org/abs/2206.00646", "description": "We develop a provably efficient importance sampling scheme that estimates\nexit probabilities of solutions to small-noise stochastic reaction-diffusion\nequations from scaled neighborhoods of a stable equilibrium. The moderate\ndeviation scaling allows for a local approximation of the nonlinear dynamics by\ntheir linearized version. In addition, we identify a finite-dimensional\nsubspace where exits take place with high probability. Using stochastic control\nand variational methods we show that our scheme performs well both in the zero\nnoise limit and pre-asymptotically. Simulation studies for stochastically\nperturbed bistable dynamics illustrate the theoretical results."}, "http://arxiv.org/abs/2206.12084": {"title": "Functional Mixed Membership Models", "link": "http://arxiv.org/abs/2206.12084", "description": "Mixed membership models, or partial membership models, are a flexible\nunsupervised learning method that allows each observation to belong to multiple\nclusters. In this paper, we propose a Bayesian mixed membership model for\nfunctional data. By using the multivariate Karhunen-Lo\\`eve theorem, we are\nable to derive a scalable representation of Gaussian processes that maintains\ndata-driven learning of the covariance structure. Within this framework, we\nestablish conditional posterior consistency given a known feature allocation\nmatrix. Compared to previous work on mixed membership models, our proposal\nallows for increased modeling flexibility, with the benefit of a directly\ninterpretable mean and covariance structure. Our work is motivated by studies\nin functional brain imaging through electroencephalography (EEG) of children\nwith autism spectrum disorder (ASD). In this context, our work formalizes the\nclinical notion of \"spectrum\" in terms of feature membership proportions."}, "http://arxiv.org/abs/2208.07614": {"title": "Reweighting the RCT for generalization: finite sample error and variable selection", "link": "http://arxiv.org/abs/2208.07614", "description": "Randomized Controlled Trials (RCTs) may suffer from limited scope. In\nparticular, samples may be unrepresentative: some RCTs over- or under- sample\nindividuals with certain characteristics compared to the target population, for\nwhich one wants conclusions on treatment effectiveness. Re-weighting trial\nindividuals to match the target population can improve the treatment effect\nestimation. In this work, we establish the exact expressions of the bias and\nvariance of such reweighting procedures -- also called Inverse Propensity of\nSampling Weighting (IPSW) -- in presence of categorical covariates for any\nsample size. Such results allow us to compare the theoretical performance of\ndifferent versions of IPSW estimates. Besides, our results show how the\nperformance (bias, variance, and quadratic risk) of IPSW estimates depends on\nthe two sample sizes (RCT and target population). A by-product of our work is\nthe proof of consistency of IPSW estimates. Results also reveal that IPSW\nperformances are improved when the trial probability to be treated is estimated\n(rather than using its oracle counterpart). In addition, we study choice of\nvariables: how including covariates that are not necessary for identifiability\nof the causal effect may impact the asymptotic variance. Including covariates\nthat are shifted between the two samples but not treatment effect modifiers\nincreases the variance while non-shifted but treatment effect modifiers do not.\nWe illustrate all the takeaways in a didactic example, and on a semi-synthetic\nsimulation inspired from critical care medicine."}, "http://arxiv.org/abs/2209.15448": {"title": "Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments", "link": "http://arxiv.org/abs/2209.15448", "description": "As AI becomes more prevalent throughout society, effective methods of\nintegrating humans and AI systems that leverage their respective strengths and\nmitigate risk have become an important priority. In this paper, we introduce\nthe paradigm of super reinforcement learning that takes advantage of Human-AI\ninteraction for data driven sequential decision making. This approach utilizes\nthe observed action, either from AI or humans, as input for achieving a\nstronger oracle in policy learning for the decision maker (humans or AI). In\nthe decision process with unmeasured confounding, the actions taken by past\nagents can offer valuable insights into undisclosed information. By including\nthis information for the policy search in a novel and legitimate manner, the\nproposed super reinforcement learning will yield a super-policy that is\nguaranteed to outperform both the standard optimal policy and the behavior one\n(e.g., past agents' actions). We call this stronger oracle a blessing from\nhuman-AI interaction. Furthermore, to address the issue of unmeasured\nconfounding in finding super-policies using the batch data, a number of\nnonparametric and causal identifications are established. Building upon on\nthese novel identification results, we develop several super-policy learning\nalgorithms and systematically study their theoretical properties such as\nfinite-sample regret guarantee. Finally, we illustrate the effectiveness of our\nproposal through extensive simulations and real-world applications."}, "http://arxiv.org/abs/2212.06906": {"title": "Flexible Regularized Estimation in High-Dimensional Mixed Membership Models", "link": "http://arxiv.org/abs/2212.06906", "description": "Mixed membership models are an extension of finite mixture models, where each\nobservation can partially belong to more than one mixture component. A\nprobabilistic framework for mixed membership models of high-dimensional\ncontinuous data is proposed with a focus on scalability and interpretability.\nThe novel probabilistic representation of mixed membership is based on convex\ncombinations of dependent multivariate Gaussian random vectors. In this\nsetting, scalability is ensured through approximations of a tensor covariance\nstructure through multivariate eigen-approximations with adaptive\nregularization imposed through shrinkage priors. Conditional weak posterior\nconsistency is established on an unconstrained model, allowing for a simple\nposterior sampling scheme while keeping many of the desired theoretical\nproperties of our model. The model is motivated by two biomedical case studies:\na case study on functional brain imaging of children with autism spectrum\ndisorder (ASD) and a case study on gene expression data from breast cancer\ntissue. These applications highlight how the typical assumption made in cluster\nanalysis, that each observation comes from one homogeneous subgroup, may often\nbe restrictive in several applications, leading to unnatural interpretations of\ndata features."}, "http://arxiv.org/abs/2301.09020": {"title": "On the Role of Volterra Integral Equations in Self-Consistent, Product-Limit, Inverse Probability of Censoring Weighted, and Redistribution-to-the-Right Estimators for the Survival Function", "link": "http://arxiv.org/abs/2301.09020", "description": "This paper reconsiders several results of historical and current importance\nto nonparametric estimation of the survival distribution for failure in the\npresence of right-censored observation times, demonstrating in particular how\nVolterra integral equations of the first kind help inter-connect the resulting\nestimators. The paper begins by considering Efron's self-consistency equation,\nintroduced in a seminal 1967 Berkeley symposium paper. Novel insights provided\nin the current work include the observations that (i) the self-consistency\nequation leads directly to an anticipating Volterra integral equation of the\nfirst kind whose solution is given by a product-limit estimator for the\ncensoring survival function; (ii) a definition used in this argument\nimmediately establishes the familiar product-limit estimator for the failure\nsurvival function; (iii) the usual Volterra integral equation for the\nproduct-limit estimator of the failure survival function leads to an immediate\nand simple proof that it can be represented as an inverse probability of\ncensoring weighted estimator (i.e., under appropriate conditions). Finally, we\nshow that the resulting inverse probability of censoring weighted estimators,\nattributed to a highly influential 1992 paper of Robins and Rotnitzky, were\nimplicitly introduced in Efron's 1967 paper in its development of the\nredistribution-to-the-right algorithm. All results developed herein allow for\nties between failure and/or censored observations."}, "http://arxiv.org/abs/2302.01576": {"title": "ResMem: Learn what you can and memorize the rest", "link": "http://arxiv.org/abs/2302.01576", "description": "The impressive generalization performance of modern neural networks is\nattributed in part to their ability to implicitly memorize complex training\npatterns. Inspired by this, we explore a novel mechanism to improve model\ngeneralization via explicit memorization. Specifically, we propose the\nresidual-memorization (ResMem) algorithm, a new method that augments an\nexisting prediction model (e.g. a neural network) by fitting the model's\nresiduals with a $k$-nearest neighbor based regressor. The final prediction is\nthen the sum of the original model and the fitted residual regressor. By\nconstruction, ResMem can explicitly memorize the training labels. Empirically,\nwe show that ResMem consistently improves the test set generalization of the\noriginal prediction model across various standard vision and natural language\nprocessing benchmarks. Theoretically, we formulate a stylized linear regression\nproblem and rigorously show that ResMem results in a more favorable test risk\nover the base predictor."}, "http://arxiv.org/abs/2303.05032": {"title": "Sensitivity analysis for principal ignorability violation in estimating complier and noncomplier average causal effects", "link": "http://arxiv.org/abs/2303.05032", "description": "An important strategy for identifying principal causal effects, which are\noften used in settings with noncompliance, is to invoke the principal\nignorability (PI) assumption. As PI is untestable, it is important to gauge how\nsensitive effect estimates are to its violation. We focus on this task for the\ncommon one-sided noncompliance setting where there are two principal strata,\ncompliers and noncompliers. Under PI, compliers and noncompliers share the same\noutcome-mean-given-covariates function under the control condition. For\nsensitivity analysis, we allow this function to differ between compliers and\nnoncompliers in several ways, indexed by an odds ratio, a generalized odds\nratio, a mean ratio, or a standardized mean difference sensitivity parameter.\nWe tailor sensitivity analysis techniques (with any sensitivity parameter\nchoice) to several types of PI-based main analysis methods, including outcome\nregression, influence function (IF) based and weighting methods. We illustrate\nthe proposed sensitivity analyses using several outcome types from the JOBS II\nstudy. This application estimates nuisance functions parametrically -- for\nsimplicity and accessibility. In addition, we establish rate conditions on\nnonparametric nuisance estimation for IF-based estimators to be asymptotically\nnormal -- with a view to inform nonparametric inference."}, "http://arxiv.org/abs/2304.13307": {"title": "A Statistical Interpretation of the Maximum Subarray Problem", "link": "http://arxiv.org/abs/2304.13307", "description": "Maximum subarray is a classical problem in computer science that given an\narray of numbers aims to find a contiguous subarray with the largest sum. We\nfocus on its use for a noisy statistical problem of localizing an interval with\na mean different from background. While a naive application of maximum subarray\nfails at this task, both a penalized and a constrained version can succeed. We\nshow that the penalized version can be derived for common exponential family\ndistributions, in a manner similar to the change-point detection literature,\nand we interpret the resulting optimal penalty value. The failure of the naive\nformulation is then explained by an analysis of the estimated interval\nboundaries. Experiments further quantify the effect of deviating from the\noptimal penalty. We also relate the penalized and constrained formulations and\nshow that the solutions to the former lie on the convex hull of the solutions\nto the latter."}, "http://arxiv.org/abs/2305.10637": {"title": "Conformalized matrix completion", "link": "http://arxiv.org/abs/2305.10637", "description": "Matrix completion aims to estimate missing entries in a data matrix, using\nthe assumption of a low-complexity structure (e.g., low rank) so that\nimputation is possible. While many effective estimation algorithms exist in the\nliterature, uncertainty quantification for this problem has proved to be\nchallenging, and existing methods are extremely sensitive to model\nmisspecification. In this work, we propose a distribution-free method for\npredictive inference in the matrix completion problem. Our method adapts the\nframework of conformal prediction, which provides confidence intervals with\nguaranteed distribution-free validity in the setting of regression, to the\nproblem of matrix completion. Our resulting method, conformalized matrix\ncompletion (cmc), offers provable predictive coverage regardless of the\naccuracy of the low-rank model. Empirical results on simulated and real data\ndemonstrate that cmc is robust to model misspecification while matching the\nperformance of existing model-based methods when the model is correct."}, "http://arxiv.org/abs/2305.15027": {"title": "A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods", "link": "http://arxiv.org/abs/2305.15027", "description": "We establish the first mathematically rigorous link between Bayesian,\nvariational Bayesian, and ensemble methods. A key step towards this it to\nreformulate the non-convex optimisation problem typically encountered in deep\nlearning as a convex optimisation in the space of probability measures. On a\ntechnical level, our contribution amounts to studying generalised variational\ninference through the lense of Wasserstein gradient flows. The result is a\nunified theory of various seemingly disconnected approaches that are commonly\nused for uncertainty quantification in deep learning -- including deep\nensembles and (variational) Bayesian methods. This offers a fresh perspective\non the reasons behind the success of deep ensembles over procedures based on\nparameterised variational inference, and allows the derivation of new\nensembling schemes with convergence guarantees. We showcase this by proposing a\nfamily of interacting deep ensembles with direct parallels to the interactions\nof particle systems in thermodynamics, and use our theory to prove the\nconvergence of these algorithms to a well-defined global minimiser on the space\nof probability measures."}, "http://arxiv.org/abs/2308.05858": {"title": "Inconsistency and Acausality of Model Selection in Bayesian Inverse Problems", "link": "http://arxiv.org/abs/2308.05858", "description": "Bayesian inference paradigms are regarded as powerful tools for solution of\ninverse problems. However, when applied to inverse problems in physical\nsciences, Bayesian formulations suffer from a number of inconsistencies that\nare often overlooked. A well known, but mostly neglected, difficulty is\nconnected to the notion of conditional probability densities. Borel, and later\nKolmogorov's (1933/1956), found that the traditional definition of conditional\ndensities is incomplete: In different parameterizations it leads to different\nresults. We will show an example where two apparently correct procedures\napplied to the same problem lead to two widely different results. Another type\nof inconsistency involves violation of causality. This problem is found in\nmodel selection strategies in Bayesian inversion, such as Hierarchical Bayes\nand Trans-Dimensional Inversion where so-called hyperparameters are included as\nvariables to control either the number (or type) of unknowns, or the prior\nuncertainties on data or model parameters. For Hierarchical Bayes we\ndemonstrate that the calculated 'prior' distributions of data or model\nparameters are not prior-, but posterior information. In fact, the calculated\n'standard deviations' of the data are a measure of the inability of the forward\nfunction to model the data, rather than uncertainties of the data. For\ntrans-dimensional inverse problems we show that the so-called evidence is, in\nfact, not a measure of the success of fitting the data for the given choice (or\nnumber) of parameters, as often claimed. We also find that the notion of\nNatural Parsimony is ill-defined, because of its dependence on the parameter\nprior. Based on this study, we find that careful rethinking of Bayesian\ninversion practices is required, with special emphasis on ways of avoiding the\nBorel-Kolmogorov inconsistency, and on the way we interpret model selection\nresults."}, "http://arxiv.org/abs/2310.15266": {"title": "Causal progress with imperfect placebo treatments and outcomes", "link": "http://arxiv.org/abs/2310.15266", "description": "In the quest to make defensible causal claims from observational data, it is\nsometimes possible to leverage information from \"placebo treatments\" and\n\"placebo outcomes\" (or \"negative outcome controls\"). Existing approaches\nemploying such information focus largely on point identification and assume (i)\n\"perfect placebos\", meaning placebo treatments have precisely zero effect on\nthe outcome and the real treatment has precisely zero effect on a placebo\noutcome; and (ii) \"equiconfounding\", meaning that the treatment-outcome\nrelationship where one is a placebo suffers the same amount of confounding as\ndoes the real treatment-outcome relationship, on some scale. We instead\nconsider an omitted variable bias framework, in which users can postulate\nnon-zero effects of placebo treatment on real outcomes or of real treatments on\nplacebo outcomes, and the relative strengths of confounding suffered by a\nplacebo treatment/outcome compared to the true treatment-outcome relationship.\nOnce postulated, these assumptions identify or bound the linear estimates of\ntreatment effects. While applicable in many settings, one ubiquitous use-case\nfor this approach is to employ pre-treatment outcomes as (perfect) placebo\noutcomes. In this setting, the parallel trends assumption of\ndifference-in-difference is in fact a strict equiconfounding assumption on a\nparticular scale, which can be relaxed in our framework. Finally, we\ndemonstrate the use of our framework with two applications, employing an R\npackage that implements these approaches."}, "http://arxiv.org/abs/2310.15333": {"title": "Estimating Trustworthy and Safe Optimal Treatment Regimes", "link": "http://arxiv.org/abs/2310.15333", "description": "Recent statistical and reinforcement learning methods have significantly\nadvanced patient care strategies. However, these approaches face substantial\nchallenges in high-stakes contexts, including missing data, inherent\nstochasticity, and the critical requirements for interpretability and patient\nsafety. Our work operationalizes a safe and interpretable framework to identify\noptimal treatment regimes. This approach involves matching patients with\nsimilar medical and pharmacological characteristics, allowing us to construct\nan optimal policy via interpolation. We perform a comprehensive simulation\nstudy to demonstrate the framework's ability to identify optimal policies even\nin complex settings. Ultimately, we operationalize our approach to study\nregimes for treating seizures in critically ill patients. Our findings strongly\nsupport personalized treatment strategies based on a patient's medical history\nand pharmacological features. Notably, we identify that reducing medication\ndoses for patients with mild and brief seizure episodes while adopting\naggressive treatment for patients in intensive care unit experiencing intense\nseizures leads to more favorable outcomes."}, "http://arxiv.org/abs/2310.15459": {"title": "Strategies to mitigate bias from time recording errors in pharmacokinetic studies", "link": "http://arxiv.org/abs/2310.15459", "description": "Opportunistic pharmacokinetic (PK) studies have sparse and imbalanced\nclinical measurement data, and the impact of sample time errors is an important\nconcern when seeking accurate estimates of treatment response. We evaluated an\napproximate Bayesian model for individualized pharmacokinetics in the presence\nof time recording errors (TREs), considering both a short and long infusion\ndosing pattern. We found that the long infusion schedule generally had lower\nbias in estimates of the pharmacodynamic (PD) endpoint relative to the short\ninfusion schedule. We investigated three different design strategies for their\nability to mitigate the impact of TREs: (i) shifting blood draws taken during\nan active infusion to the post-infusion period, (ii) identifying the best next\nsample time by minimizing bias in the presence of TREs, and (iii) collecting\nadditional information on a subset of patients based on estimate uncertainty or\nquadrature-estimated variance in the presence of TREs. Generally, the proposed\nstrategies led to a decrease in bias of the PD estimate for the short infusion\nschedule, but had a negligible impact for the long infusion schedule. Dosing\nregimens with periods of high non-linearity may benefit from design\nmodifications, while more stable concentration-time profiles are generally more\nrobust to TREs with no design modifications."}, "http://arxiv.org/abs/2310.15497": {"title": "Generalized Box-Cox method to estimate sample mean and standard deviation for Meta-analysis", "link": "http://arxiv.org/abs/2310.15497", "description": "Meta-analysis is the aggregation of data from multiple studies to find\npatterns across a broad range relating to a particular subject. It is becoming\nincreasingly useful to apply meta-analysis to summarize these studies being\ndone across various fields. In meta-analysis, it is common to use the mean and\nstandard deviation from each study to compare for analysis. While many studies\nreported mean and standard deviation for their summary statistics, some report\nother values including the minimum, maximum, median, and first and third\nquantiles. Often, the quantiles and median are reported when the data is skewed\nand does not follow a normal distribution. In order to correctly summarize the\ndata and draw conclusions from multiple studies, it is necessary to estimate\nthe mean and standard deviation from each study, considering variation and\nskewness within each study. In past literature, methods have been proposed to\nestimate the mean and standard deviation, but do not consider negative values.\nData that include negative values are common and would increase the accuracy\nand impact of the me-ta-analysis. We propose a method that implements a\ngeneralized Box-Cox transformation to estimate the mean and standard deviation\naccounting for such negative values while maintaining similar accuracy."}, "http://arxiv.org/abs/2310.15877": {"title": "Regression analysis of multiplicative hazards model with time-dependent coefficient for sparse longitudinal covariates", "link": "http://arxiv.org/abs/2310.15877", "description": "We study the multiplicative hazards model with intermittently observed\nlongitudinal covariates and time-varying coefficients. For such models, the\nexisting {\\it ad hoc} approach, such as the last value carried forward, is\nbiased. We propose a kernel weighting approach to get an unbiased estimation of\nthe non-parametric coefficient function and establish asymptotic normality for\nany fixed time point. Furthermore, we construct the simultaneous confidence\nband to examine the overall magnitude of the variation. Simulation studies\nsupport our theoretical predictions and show favorable performance of the\nproposed method. A data set from cerebral infarction is used to illustrate our\nmethodology."}, "http://arxiv.org/abs/2310.15956": {"title": "Likelihood-Based Inference for Semi-Parametric Transformation Cure Models with Interval Censored Data", "link": "http://arxiv.org/abs/2310.15956", "description": "A simple yet effective way of modeling survival data with cure fraction is by\nconsidering Box-Cox transformation cure model (BCTM) that unifies mixture and\npromotion time cure models. In this article, we numerically study the\nstatistical properties of the BCTM when applied to interval censored data.\nTime-to-events associated with susceptible subjects are modeled through\nproportional hazards structure that allows for non-homogeneity across subjects,\nwhere the baseline hazard function is estimated by distribution-free piecewise\nlinear function with varied degrees of non-parametricity. Due to missing cured\nstatuses for right censored subjects, maximum likelihood estimates of model\nparameters are obtained by developing an expectation-maximization (EM)\nalgorithm. Under the EM framework, the conditional expectation of the complete\ndata log-likelihood function is maximized by considering all parameters\n(including the Box-Cox transformation parameter $\\alpha$) simultaneously, in\ncontrast to conventional profile-likelihood technique of estimating $\\alpha$.\nThe robustness and accuracy of the model and estimation method are established\nthrough a detailed simulation study under various parameter settings, and an\nanalysis of real-life data obtained from a smoking cessation study."}, "http://arxiv.org/abs/1901.04916": {"title": "Pairwise accelerated failure time regression models for infectious disease transmission in close-contact groups with external sources of infection", "link": "http://arxiv.org/abs/1901.04916", "description": "Many important questions in infectious disease epidemiology involve the\neffects of covariates (e.g., age or vaccination status) on infectiousness and\nsusceptibility, which can be measured in studies of transmission in households\nor other close-contact groups. Because the transmission of disease produces\ndependent outcomes, these questions are difficult or impossible to address\nusing standard regression models from biostatistics. Pairwise survival analysis\nhandles dependent outcomes by calculating likelihoods in terms of contact\ninterval distributions in ordered pairs of individuals. The contact interval in\nthe ordered pair ij is the time from the onset of infectiousness in i to\ninfectious contact from i to j, where an infectious contact is sufficient to\ninfect j if they are susceptible. Here, we introduce a pairwise accelerated\nfailure time regression model for infectious disease transmission that allows\nthe rate parameter of the contact interval distribution to depend on\ninfectiousness covariates for i, susceptibility covariates for j, and pairwise\ncovariates. This model can simultaneously handle internal infections (caused by\ntransmission between individuals under observation) and external infections\n(caused by environmental or community sources of infection). In a simulation\nstudy, we show that these models produce valid point and interval estimates of\nparameters governing the contact interval distributions. We also explore the\nrole of epidemiologic study design and the consequences of model\nmisspecification. We use this regression model to analyze household data from\nLos Angeles County during the 2009 influenza A (H1N1) pandemic, where we find\nthat the ability to account for external sources of infection is critical to\nestimating the effect of antiviral prophylaxis."}, "http://arxiv.org/abs/2003.06416": {"title": "VCBART: Bayesian trees for varying coefficients", "link": "http://arxiv.org/abs/2003.06416", "description": "The linear varying coefficient models posits a linear relationship between an\noutcome and covariates in which the covariate effects are modeled as functions\nof additional effect modifiers. Despite a long history of study and use in\nstatistics and econometrics, state-of-the-art varying coefficient modeling\nmethods cannot accommodate multivariate effect modifiers without imposing\nrestrictive functional form assumptions or involving computationally intensive\nhyperparameter tuning. In response, we introduce VCBART, which flexibly\nestimates the covariate effect in a varying coefficient model using Bayesian\nAdditive Regression Trees. With simple default settings, VCBART outperforms\nexisting varying coefficient methods in terms of covariate effect estimation,\nuncertainty quantification, and outcome prediction. We illustrate the utility\nof VCBART with two case studies: one examining how the association between\nlater-life cognition and measures of socioeconomic position vary with respect\nto age and socio-demographics and another estimating how temporal trends in\nurban crime vary at the neighborhood level. An R package implementing VCBART is\navailable at https://github.com/skdeshpande91/VCBART"}, "http://arxiv.org/abs/2204.05870": {"title": "How much of the past matters? Using dynamic survival models for the monitoring of potassium in heart failure patients using electronic health records", "link": "http://arxiv.org/abs/2204.05870", "description": "Statistical methods to study the association between a longitudinal biomarker\nand the risk of death are very relevant for the long-term care of subjects\naffected by chronic illnesses, such as potassium in heart failure patients.\nParticularly in the presence of comorbidities or pharmacological treatments,\nsudden crises can cause potassium to undergo very abrupt yet transient changes.\nIn the context of the monitoring of potassium, there is a need for a dynamic\nmodel that can be used in clinical practice to assess the risk of death related\nto an observed patient's potassium trajectory. We considered different dynamic\nsurvival approaches, starting from the simple approach considering the most\nrecent measurement, to the joint model. We then propose a novel method based on\nwavelet filtering and landmarking to retrieve the prognostic role of past\nshort-term potassium shifts. We argue that while taking into account past\ninformation is important, not all past information is equally informative.\nState-of-the-art dynamic survival models are prone to give more importance to\nthe mean long-term value of potassium. However, our findings suggest that it is\nessential to take into account also recent potassium instability to capture all\nthe relevant prognostic information. The data used comes from over 2000\nsubjects, with a total of over 80 000 repeated potassium measurements collected\nthrough Administrative Health Records and Outpatient and Inpatient Clinic\nE-charts. A novel dynamic survival approach is proposed in this work for the\nmonitoring of potassium in heart failure. The proposed wavelet landmark method\nshows promising results revealing the prognostic role of past short-term\nchanges, according to their different duration, and achieving higher\nperformances in predicting the survival probability of individuals."}, "http://arxiv.org/abs/2212.09494": {"title": "Optimal Treatment Regimes for Proximal Causal Learning", "link": "http://arxiv.org/abs/2212.09494", "description": "A common concern when a policymaker draws causal inferences from and makes\ndecisions based on observational data is that the measured covariates are\ninsufficiently rich to account for all sources of confounding, i.e., the\nstandard no confoundedness assumption fails to hold. The recently proposed\nproximal causal inference framework shows that proxy variables that abound in\nreal-life scenarios can be leveraged to identify causal effects and therefore\nfacilitate decision-making. Building upon this line of work, we propose a novel\noptimal individualized treatment regime based on so-called outcome and\ntreatment confounding bridges. We then show that the value function of this new\noptimal treatment regime is superior to that of existing ones in the\nliterature. Theoretical guarantees, including identification, superiority,\nexcess value bound, and consistency of the estimated regime, are established.\nFurthermore, we demonstrate the proposed optimal regime via numerical\nexperiments and a real data application."}, "http://arxiv.org/abs/2302.07294": {"title": "Derandomized Novelty Detection with FDR Control via Conformal E-values", "link": "http://arxiv.org/abs/2302.07294", "description": "Conformal inference provides a general distribution-free method to rigorously\ncalibrate the output of any machine learning algorithm for novelty detection.\nWhile this approach has many strengths, it has the limitation of being\nrandomized, in the sense that it may lead to different results when analyzing\ntwice the same data, and this can hinder the interpretation of any findings. We\npropose to make conformal inferences more stable by leveraging suitable\nconformal e-values instead of p-values to quantify statistical significance.\nThis solution allows the evidence gathered from multiple analyses of the same\ndata to be aggregated effectively while provably controlling the false\ndiscovery rate. Further, we show that the proposed method can reduce randomness\nwithout much loss of power compared to standard conformal inference, partly\nthanks to an innovative way of weighting conformal e-values based on additional\nside information carefully extracted from the same data. Simulations with\nsynthetic and real data confirm this solution can be effective at eliminating\nrandom noise in the inferences obtained with state-of-the-art alternative\ntechniques, sometimes also leading to higher power."}, "http://arxiv.org/abs/2304.02127": {"title": "A Bayesian Collocation Integral Method for Parameter Estimation in Ordinary Differential Equations", "link": "http://arxiv.org/abs/2304.02127", "description": "Inferring the parameters of ordinary differential equations (ODEs) from noisy\nobservations is an important problem in many scientific fields. Currently, most\nparameter estimation methods that bypass numerical integration tend to rely on\nbasis functions or Gaussian processes to approximate the ODE solution and its\nderivatives. Due to the sensitivity of the ODE solution to its derivatives,\nthese methods can be hindered by estimation error, especially when only sparse\ntime-course observations are available. We present a Bayesian collocation\nframework that operates on the integrated form of the ODEs and also avoids the\nexpensive use of numerical solvers. Our methodology has the capability to\nhandle general nonlinear ODE systems. We demonstrate the accuracy of the\nproposed method through simulation studies, where the estimated parameters and\nrecovered system trajectories are compared with other recent methods. A real\ndata example is also provided."}, "http://arxiv.org/abs/2307.00127": {"title": "Large-scale Bayesian Structure Learning for Gaussian Graphical Models using Marginal Pseudo-likelihood", "link": "http://arxiv.org/abs/2307.00127", "description": "Bayesian methods for learning Gaussian graphical models offer a robust\nframework that addresses model uncertainty and incorporates prior knowledge.\nDespite their theoretical strengths, the applicability of Bayesian methods is\noften constrained by computational needs, especially in modern contexts\ninvolving thousands of variables. To overcome this issue, we introduce two\nnovel Markov chain Monte Carlo (MCMC) search algorithms that have a\nsignificantly lower computational cost than leading Bayesian approaches. Our\nproposed MCMC-based search algorithms use the marginal pseudo-likelihood\napproach to bypass the complexities of computing intractable normalizing\nconstants and iterative precision matrix sampling. These algorithms can deliver\nreliable results in mere minutes on standard computers, even for large-scale\nproblems with one thousand variables. Furthermore, our proposed method is\ncapable of addressing model uncertainty by efficiently exploring the full\nposterior graph space. Our simulation study indicates that the proposed\nalgorithms, particularly for large-scale sparse graphs, outperform the leading\nBayesian approaches in terms of computational efficiency and precision. The\nimplementation supporting the new approach is available through the R package\nBDgraph."}, "http://arxiv.org/abs/2307.09302": {"title": "Conformal prediction under ambiguous ground truth", "link": "http://arxiv.org/abs/2307.09302", "description": "Conformal Prediction (CP) allows to perform rigorous uncertainty\nquantification by constructing a prediction set $C(X)$ satisfying $\\mathbb{P}(Y\n\\in C(X))\\geq 1-\\alpha$ for a user-chosen $\\alpha \\in [0,1]$ by relying on\ncalibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\\mathbb{P}=\\mathbb{P}^{X}\n\\otimes \\mathbb{P}^{Y|X}$. It is typically implicitly assumed that\n$\\mathbb{P}^{Y|X}$ is the \"true\" posterior label distribution. However, in many\nreal-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating\nexpert opinions using a voting procedure, resulting in a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus\nw.r.t. $\\mathbb{P}_{vote}=\\mathbb{P}^X \\otimes \\mathbb{P}_{vote}^{Y|X}$ rather\nthan the true distribution $\\mathbb{P}$. In cases with unambiguous ground truth\nlabels, the distinction between $\\mathbb{P}_{vote}$ and $\\mathbb{P}$ is\nirrelevant. However, when experts do not agree because of ambiguous labels,\napproximating $\\mathbb{P}^{Y|X}$ with a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose\nto leverage expert opinions to approximate $\\mathbb{P}^{Y|X}$ using a\nnon-degenerate distribution $\\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP\nprocedures which provide guarantees w.r.t. $\\mathbb{P}_{agg}=\\mathbb{P}^X\n\\otimes \\mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels\nfrom $\\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a\ncase study of skin condition classification with significant disagreement among\nexpert annotators, we show that applying CP w.r.t. $\\mathbb{P}_{vote}$\nunder-covers expert annotations: calibrated for $72\\%$ coverage, it falls short\nby on average $10\\%$; our Monte Carlo CP closes this gap both empirically and\ntheoretically."}, "http://arxiv.org/abs/2310.16203": {"title": "Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework", "link": "http://arxiv.org/abs/2310.16203", "description": "Mediation analysis is an important analytic tool commonly used in a broad\nrange of scientific applications. In this article, we study the problem of\nmediation analysis when there are multivariate and conditionally dependent\nmediators, and when the variables are observed over multiple time points. The\nproblem is challenging, because the effect of a mediator involves not only the\npath from the treatment to this mediator itself at the current time point, but\nalso all possible paths pointed to this mediator from its upstream mediators,\nas well as the carryover effects from all previous time points. We propose a\nnovel multivariate dynamic mediation analysis approach. Drawing inspiration\nfrom the Markov decision process model that is frequently employed in\nreinforcement learning, we introduce a Markov mediation process paired with a\nsystem of time-varying linear structural equation models to formulate the\nproblem. We then formally define the individual mediation effect, built upon\nthe idea of simultaneous interventions and intervention calculus. We next\nderive the closed-form expression and propose an iterative estimation procedure\nunder the Markov mediation process model. We study both the asymptotic property\nand the empirical performance of the proposed estimator, and further illustrate\nour method with a mobile health application."}, "http://arxiv.org/abs/2310.16207": {"title": "Propensity score weighting plus an adjusted proportional hazards model does not equal doubly robust away from the null", "link": "http://arxiv.org/abs/2310.16207", "description": "Recently it has become common for applied works to combine commonly used\nsurvival analysis modeling methods, such as the multivariable Cox model, and\npropensity score weighting with the intention of forming a doubly robust\nestimator that is unbiased in large samples when either the Cox model or the\npropensity score model is correctly specified. This combination does not, in\ngeneral, produce a doubly robust estimator, even after regression\nstandardization, when there is truly a causal effect. We demonstrate via\nsimulation this lack of double robustness for the semiparametric Cox model, the\nWeibull proportional hazards model, and a simple proportional hazards flexible\nparametric model, with both the latter models fit via maximum likelihood. We\nprovide a novel proof that the combination of propensity score weighting and a\nproportional hazards survival model, fit either via full or partial likelihood,\nis consistent under the null of no causal effect of the exposure on the outcome\nunder particular censoring mechanisms if either the propensity score or the\noutcome model is correctly specified and contains all confounders. Given our\nresults suggesting that double robustness only exists under the null, we\noutline two simple alternative estimators that are doubly robust for the\nsurvival difference at a given time point (in the above sense), provided the\ncensoring mechanism can be correctly modeled, and one doubly robust method of\nestimation for the full survival curve. We provide R code to use these\nestimators for estimation and inference in the supplementary materials."}, "http://arxiv.org/abs/2310.16213": {"title": "Bayes factor functions", "link": "http://arxiv.org/abs/2310.16213", "description": "We describe Bayes factors functions based on z, t, $\\chi^2$, and F statistics\nand the prior distributions used to define alternative hypotheses. The\nnon-local alternative prior distributions are centered on standardized effects,\nwhich index the Bayes factor function. The prior densities include a dispersion\nparameter that models the variation of effect sizes across replicated\nexperiments. We examine the convergence rates of Bayes factor functions under\ntrue null and true alternative hypotheses. Several examples illustrate the\napplication of the Bayes factor functions to replicated experimental designs\nand compare the conclusions from these analyses to other default Bayes factor\nmethods."}, "http://arxiv.org/abs/2310.16256": {"title": "A Causal Disentangled Multi-Granularity Graph Classification Method", "link": "http://arxiv.org/abs/2310.16256", "description": "Graph data widely exists in real life, with large amounts of data and complex\nstructures. It is necessary to map graph data to low-dimensional embedding.\nGraph classification, a critical graph task, mainly relies on identifying the\nimportant substructures within the graph. At present, some graph classification\nmethods do not combine the multi-granularity characteristics of graph data.\nThis lack of granularity distinction in modeling leads to a conflation of key\ninformation and false correlations within the model. So, achieving the desired\ngoal of a credible and interpretable model becomes challenging. This paper\nproposes a causal disentangled multi-granularity graph representation learning\nmethod (CDM-GNN) to solve this challenge. The CDM-GNN model disentangles the\nimportant substructures and bias parts within the graph from a\nmulti-granularity perspective. The disentanglement of the CDM-GNN model reveals\nimportant and bias parts, forming the foundation for its classification task,\nspecifically, model interpretations. The CDM-GNN model exhibits strong\nclassification performance and generates explanatory outcomes aligning with\nhuman cognitive patterns. In order to verify the effectiveness of the model,\nthis paper compares the three real-world datasets MUTAG, PTC, and IMDM-M. Six\nstate-of-the-art models, namely GCN, GAT, Top-k, ASAPool, SUGAR, and SAT are\nemployed for comparison purposes. Additionally, a qualitative analysis of the\ninterpretation results is conducted."}, "http://arxiv.org/abs/2310.16260": {"title": "Private Estimation and Inference in High-Dimensional Regression with FDR Control", "link": "http://arxiv.org/abs/2310.16260", "description": "This paper presents novel methodologies for conducting practical\ndifferentially private (DP) estimation and inference in high-dimensional linear\nregression. We start by proposing a differentially private Bayesian Information\nCriterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso,\neliminating the need for prior knowledge of model sparsity, a requisite in the\nexisting literature. Then we propose a differentially private debiased LASSO\nalgorithm that enables privacy-preserving inference on regression parameters.\nOur proposed method enables accurate and private inference on the regression\nparameters by leveraging the inherent sparsity of high-dimensional linear\nregression models. Additionally, we address the issue of multiple testing in\nhigh-dimensional linear regression by introducing a differentially private\nmultiple testing procedure that controls the false discovery rate (FDR). This\nallows for accurate and privacy-preserving identification of significant\npredictors in the regression model. Through extensive simulations and real data\nanalysis, we demonstrate the efficacy of our proposed methods in conducting\ninference for high-dimensional linear models while safeguarding privacy and\ncontrolling the FDR."}, "http://arxiv.org/abs/2310.16284": {"title": "Bayesian Image Mediation Analysis", "link": "http://arxiv.org/abs/2310.16284", "description": "Mediation analysis aims to separate the indirect effect through mediators\nfrom the direct effect of the exposure on the outcome. It is challenging to\nperform mediation analysis with neuroimaging data which involves high\ndimensionality, complex spatial correlations, sparse activation patterns and\nrelatively low signal-to-noise ratio. To address these issues, we develop a new\nspatially varying coefficient structural equation model for Bayesian Image\nMediation Analysis (BIMA). We define spatially varying mediation effects within\nthe potential outcome framework, employing the soft-thresholded Gaussian\nprocess prior for functional parameters. We establish the posterior consistency\nfor spatially varying mediation effects along with selection consistency on\nimportant regions that contribute to the mediation effects. We develop an\nefficient posterior computation algorithm scalable to analysis of large-scale\nimaging data. Through extensive simulations, we show that BIMA can improve the\nestimation accuracy and computational efficiency for high-dimensional mediation\nanalysis over the existing methods. We apply BIMA to analyze the behavioral and\nfMRI data in the Adolescent Brain Cognitive Development (ABCD) study with a\nfocus on inferring the mediation effects of the parental education level on the\nchildren's general cognitive ability that are mediated through the working\nmemory brain activities."}, "http://arxiv.org/abs/2310.16294": {"title": "Producer-Side Experiments Based on Counterfactual Interleaving Designs for Online Recommender Systems", "link": "http://arxiv.org/abs/2310.16294", "description": "Recommender systems have become an integral part of online platforms,\nproviding personalized suggestions for purchasing items, consuming contents,\nand connecting with individuals. An online recommender system consists of two\nsides of components: the producer side comprises product sellers, content\ncreators, or service providers, etc., and the consumer side includes buyers,\nviewers, or guests, etc. To optimize an online recommender system, A/B tests\nserve as the golden standard for comparing different ranking models and\nevaluating their impact on both the consumers and producers. While\nconsumer-side experiments are relatively straightforward to design and commonly\nused to gauge the impact of ranking changes on the behavior of consumers\n(buyers, viewers, etc.), designing producer-side experiments presents a\nconsiderable challenge because producer items in the treatment and control\ngroups need to be ranked by different models and then merged into a single\nranking for the recommender to show to each consumer. In this paper, we review\nissues with the existing methods, propose new design principles for\nproducer-side experiments, and develop a rigorous solution based on\ncounterfactual interleaving designs for accurately measuring the effects of\nranking changes on the producers (sellers, creators, etc.)."}, "http://arxiv.org/abs/2310.16466": {"title": "Learning Continuous Network Emerging Dynamics from Scarce Observations via Data-Adaptive Stochastic Processes", "link": "http://arxiv.org/abs/2310.16466", "description": "Learning network dynamics from the empirical structure and spatio-temporal\nobservation data is crucial to revealing the interaction mechanisms of complex\nnetworks in a wide range of domains. However, most existing methods only aim at\nlearning network dynamic behaviors generated by a specific ordinary\ndifferential equation instance, resulting in ineffectiveness for new ones, and\ngenerally require dense observations. The observed data, especially from\nnetwork emerging dynamics, are usually difficult to obtain, which brings\ntrouble to model learning. Therefore, how to learn accurate network dynamics\nwith sparse, irregularly-sampled, partial, and noisy observations remains a\nfundamental challenge. We introduce Neural ODE Processes for Network Dynamics\n(NDP4ND), a new class of stochastic processes governed by stochastic\ndata-adaptive network dynamics, to overcome the challenge and learn continuous\nnetwork dynamics from scarce observations. Intensive experiments conducted on\nvarious network dynamics in ecological population evolution, phototaxis\nmovement, brain activity, epidemic spreading, and real-world empirical systems,\ndemonstrate that the proposed method has excellent data adaptability and\ncomputational efficiency, and can adapt to unseen network emerging dynamics,\nproducing accurate interpolation and extrapolation with reducing the ratio of\nrequired observation data to only about 6\\% and improving the learning speed\nfor new dynamics by three orders of magnitude."}, "http://arxiv.org/abs/2310.16489": {"title": "Latent event history models for quasi-reaction systems", "link": "http://arxiv.org/abs/2310.16489", "description": "Various processes can be modelled as quasi-reaction systems of stochastic\ndifferential equations, such as cell differentiation and disease spreading.\nSince the underlying data of particle interactions, such as reactions between\nproteins or contacts between people, are typically unobserved, statistical\ninference of the parameters driving these systems is developed from\nconcentration data measuring each unit in the system over time. While observing\nthe continuous time process at a time scale as fine as possible should in\ntheory help with parameter estimation, the existing Local Linear Approximation\n(LLA) methods fail in this case, due to numerical instability caused by small\nchanges of the system at successive time points. On the other hand, one may be\nable to reconstruct the underlying unobserved interactions from the observed\ncount data. Motivated by this, we first formalise the latent event history\nmodel underlying the observed count process. We then propose a computationally\nefficient Expectation-Maximation algorithm for parameter estimation, with an\nextended Kalman filtering procedure for the prediction of the latent states. A\nsimulation study shows the performance of the proposed method and highlights\nthe settings where it is particularly advantageous compared to the existing LLA\napproaches. Finally, we present an illustration of the methodology on the\nspreading of the COVID-19 pandemic in Italy."}, "http://arxiv.org/abs/2310.16502": {"title": "Assessing the overall and partial causal well-specification of nonlinear additive noise models", "link": "http://arxiv.org/abs/2310.16502", "description": "We propose a method to detect model misspecifications in nonlinear causal\nadditive and potentially heteroscedastic noise models. We aim to identify\npredictor variables for which we can infer the causal effect even in cases of\nsuch misspecification. We develop a general framework based on knowledge of the\nmultivariate observational data distribution and we then propose an algorithm\nfor finite sample data, discuss its asymptotic properties, and illustrate its\nperformance on simulated and real data."}, "http://arxiv.org/abs/2310.16600": {"title": "Balancing central and marginal rejection when combining independent significance tests", "link": "http://arxiv.org/abs/2310.16600", "description": "A common approach to evaluating the significance of a collection of\n$p$-values combines them with a pooling function, in particular when the\noriginal data are not available. These pooled $p$-values convert a sample of\n$p$-values into a single number which behaves like a univariate $p$-value. To\nclarify discussion of these functions, a telescoping series of alternative\nhypotheses are introduced that communicate the strength and prevalence of\nnon-null evidence in the $p$-values before general pooling formulae are\ndiscussed. A pattern noticed in the UMP pooled $p$-value for a particular\nalternative motivates the definition and discussion of central and marginal\nrejection levels at $\\alpha$. It is proven that central rejection is always\ngreater than or equal to marginal rejection, motivating a quotient to measure\nthe balance between the two for pooled $p$-values. A combining function based\non the $\\chi^2_{\\kappa}$ quantile transformation is proposed to control this\nquotient and shown to be robust to mis-specified parameters relative to the\nUMP. Different powers for different parameter settings motivate a map of\nplausible alternatives based on where this pooled $p$-value is minimized."}, "http://arxiv.org/abs/2310.16626": {"title": "Scalable Causal Structure Learning via Amortized Conditional Independence Testing", "link": "http://arxiv.org/abs/2310.16626", "description": "Controlling false positives (Type I errors) through statistical hypothesis\ntesting is a foundation of modern scientific data analysis. Existing causal\nstructure discovery algorithms either do not provide Type I error control or\ncannot scale to the size of modern scientific datasets. We consider a variant\nof the causal discovery problem with two sets of nodes, where the only edges of\ninterest form a bipartite causal subgraph between the sets. We develop Scalable\nCausal Structure Learning (SCSL), a method for causal structure discovery on\nbipartite subgraphs that provides Type I error control. SCSL recasts the\ndiscovery problem as a simultaneous hypothesis testing problem and uses\ndiscrete optimization over the set of possible confounders to obtain an upper\nbound on the test statistic for each edge. Semi-synthetic simulations\ndemonstrate that SCSL scales to handle graphs with hundreds of nodes while\nmaintaining error control and good power. We demonstrate the practical\napplicability of the method by applying it to a cancer dataset to reveal\nconnections between somatic gene mutations and metastases to different tissues."}, "http://arxiv.org/abs/2310.16650": {"title": "Data-integration with pseudoweights and survey-calibration: application to developing US-representative lung cancer risk models for use in screening", "link": "http://arxiv.org/abs/2310.16650", "description": "Accurate cancer risk estimation is crucial to clinical decision-making, such\nas identifying high-risk people for screening. However, most existing cancer\nrisk models incorporate data from epidemiologic studies, which usually cannot\nrepresent the target population. While population-based health surveys are\nideal for making inference to the target population, they typically do not\ncollect time-to-cancer incidence data. Instead, time-to-cancer specific\nmortality is often readily available on surveys via linkage to vital\nstatistics. We develop calibrated pseudoweighting methods that integrate\nindividual-level data from a cohort and a survey, and summary statistics of\ncancer incidence from national cancer registries. By leveraging\nindividual-level cancer mortality data in the survey, the proposed methods\nimpute time-to-cancer incidence for survey sample individuals and use survey\ncalibration with auxiliary variables of influence functions generated from Cox\nregression to improve robustness and efficiency of the inverse-propensity\npseudoweighting method in estimating pure risks. We develop a lung cancer\nincidence pure risk model from the Prostate, Lung, Colorectal, and Ovarian\n(PLCO) Cancer Screening Trial using our proposed methods by integrating data\nfrom the National Health Interview Survey (NHIS) and cancer registries."}, "http://arxiv.org/abs/2310.16653": {"title": "Adaptive importance sampling for heavy-tailed distributions via $\\alpha$-divergence minimization", "link": "http://arxiv.org/abs/2310.16653", "description": "Adaptive importance sampling (AIS) algorithms are widely used to approximate\nexpectations with respect to complicated target probability distributions. When\nthe target has heavy tails, existing AIS algorithms can provide inconsistent\nestimators or exhibit slow convergence, as they often neglect the target's tail\nbehaviour. To avoid this pitfall, we propose an AIS algorithm that approximates\nthe target by Student-t proposal distributions. We adapt location and scale\nparameters by matching the escort moments - which are defined even for\nheavy-tailed distributions - of the target and the proposal. These updates\nminimize the $\\alpha$-divergence between the target and the proposal, thereby\nconnecting with variational inference. We then show that the\n$\\alpha$-divergence can be approximated by a generalized notion of effective\nsample size and leverage this new perspective to adapt the tail parameter with\nBayesian optimization. We demonstrate the efficacy of our approach through\napplications to synthetic targets and a Bayesian Student-t regression task on a\nreal example with clinical trial data."}, "http://arxiv.org/abs/2310.16690": {"title": "Dynamic treatment effect phenotyping through functional survival analysis", "link": "http://arxiv.org/abs/2310.16690", "description": "In recent years, research interest in personalised treatments has been\ngrowing. However, treatment effect heterogeneity and possibly time-varying\ntreatment effects are still often overlooked in clinical studies. Statistical\ntools are needed for the identification of treatment response patterns, taking\ninto account that treatment response is not constant over time. We aim to\nprovide an innovative method to obtain dynamic treatment effect phenotypes on a\ntime-to-event outcome, conditioned on a set of relevant effect modifiers. The\nproposed method does not require the assumption of proportional hazards for the\ntreatment effect, which is rarely realistic. We propose a spline-based survival\nneural network, inspired by the Royston-Parmar survival model, to estimate\ntime-varying conditional treatment effects. We then exploit the functional\nnature of the resulting estimates to apply a functional clustering of the\ntreatment effect curves in order to identify different patterns of treatment\neffects. The application that motivated this work is the discontinuation of\ntreatment with Mineralocorticoid receptor Antagonists (MRAs) in patients with\nheart failure, where there is no clear evidence as to which patients it is the\nsafest choice to discontinue treatment and, conversely, when it leads to a\nhigher risk of adverse events. The data come from an electronic health record\ndatabase. A simulation study was performed to assess the performance of the\nspline-based neural network and the stability of the treatment response\nphenotyping procedure. We provide a novel method to inform individualized\nmedical decisions by characterising subject-specific treatment responses over\ntime."}, "http://arxiv.org/abs/2310.16698": {"title": "Causal Discovery with Generalized Linear Models through Peeling Algorithms", "link": "http://arxiv.org/abs/2310.16698", "description": "This article presents a novel method for causal discovery with generalized\nstructural equation models suited for analyzing diverse types of outcomes,\nincluding discrete, continuous, and mixed data. Causal discovery often faces\nchallenges due to unmeasured confounders that hinder the identification of\ncausal relationships. The proposed approach addresses this issue by developing\ntwo peeling algorithms (bottom-up and top-down) to ascertain causal\nrelationships and valid instruments. This approach first reconstructs a\nsuper-graph to represent ancestral relationships between variables, using a\npeeling algorithm based on nodewise GLM regressions that exploit relationships\nbetween primary and instrumental variables. Then, it estimates parent-child\neffects from the ancestral relationships using another peeling algorithm while\ndeconfounding a child's model with information borrowed from its parents'\nmodels. The article offers a theoretical analysis of the proposed approach,\nwhich establishes conditions for model identifiability and provides statistical\nguarantees for accurately discovering parent-child relationships via the\npeeling algorithms. Furthermore, the article presents numerical experiments\nshowcasing the effectiveness of our approach in comparison to state-of-the-art\nstructure learning methods without confounders. Lastly, it demonstrates an\napplication to Alzheimer's disease (AD), highlighting the utility of the method\nin constructing gene-to-gene and gene-to-disease regulatory networks involving\nSingle Nucleotide Polymorphisms (SNPs) for healthy and AD subjects."}, "http://arxiv.org/abs/2310.16813": {"title": "Improving the Aggregation and Evaluation of NBA Mock Drafts", "link": "http://arxiv.org/abs/2310.16813", "description": "Many enthusiasts and experts publish forecasts of the order players are\ndrafted into professional sports leagues, known as mock drafts. Using a novel\ndataset of mock drafts for the National Basketball Association (NBA), we\nanalyze authors' mock draft accuracy over time and ask how we can reasonably\nuse information from multiple authors. To measure how accurate mock drafts are,\nwe assume that both mock drafts and the actual draft are ranked lists, and we\npropose that rank-biased distance (RBD) of Webber et al. (2010) is the\nappropriate error metric for mock draft accuracy. This is because RBD allows\nmock drafts to have a different length than the actual draft, accounts for\nplayers not appearing in both lists, and weights errors early in the draft more\nthan errors later on. We validate that mock drafts, as expected, improve in\naccuracy over the course of a season, and that accuracy of the mock drafts\nproduced right before their drafts is fairly stable across seasons. To be able\nto combine information from multiple mock drafts into a single consensus mock\ndraft, we also propose a ranked-list combination method based on the ideas of\nranked-choice voting. We show that our method provides improved forecasts over\nthe standard Borda count combination method used for most similar analyses in\nsports, and that either combination method provides a more accurate forecast\nover time than any single author."}, "http://arxiv.org/abs/2310.16824": {"title": "Parametric model for post-processing visibility ensemble forecasts", "link": "http://arxiv.org/abs/2310.16824", "description": "Despite the continuous development of the different operational ensemble\nprediction systems over the past decades, ensemble forecasts still might suffer\nfrom lack of calibration and/or display systematic bias, thus require some\npost-processing to improve their forecast skill. Here we focus on visibility,\nwhich quantity plays a crucial role e.g. in aviation and road safety or in ship\nnavigation, and propose a parametric model where the predictive distribution is\na mixture of a gamma and a truncated normal distribution, both right censored\nat the maximal reported visibility value. The new model is evaluated in two\ncase studies based on visibility ensemble forecasts of the European Centre for\nMedium-Range Weather Forecasts covering two distinct domains in Central and\nWestern Europe and two different time periods. The results of the case studies\nindicate that climatology is substantially superior to the raw ensemble;\nnevertheless, the forecast skill can be further improved by post-processing, at\nleast for short lead times. Moreover, the proposed mixture model consistently\noutperforms the Bayesian model averaging approach used as reference\npost-processing technique."}, "http://arxiv.org/abs/2109.09339": {"title": "Improving the accuracy of estimating indexes in contingency tables using Bayesian estimators", "link": "http://arxiv.org/abs/2109.09339", "description": "In contingency table analysis, one is interested in testing whether a model\nof interest (e.g., the independent or symmetry model) holds using\ngoodness-of-fit tests. When the null hypothesis where the model is true is\nrejected, the interest turns to the degree to which the probability structure\nof the contingency table deviates from the model. Many indexes have been\nstudied to measure the degree of the departure, such as the Yule coefficient\nand Cram\\'er coefficient for the independence model, and Tomizawa's symmetry\nindex for the symmetry model. The inference of these indexes is performed using\nsample proportions, which are estimates of cell probabilities, but it is\nwell-known that the bias and mean square error (MSE) values become large\nwithout a sufficient number of samples. To address the problem, this study\nproposes a new estimator for indexes using Bayesian estimators of cell\nprobabilities. Assuming the Dirichlet distribution for the prior of cell\nprobabilities, we asymptotically evaluate the value of MSE when plugging the\nposterior means of cell probabilities into the index, and propose an estimator\nof the index using the Dirichlet hyperparameter that minimizes the value.\nNumerical experiments show that when the number of samples per cell is small,\nthe proposed method has smaller values of bias and MSE than other methods of\ncorrecting estimation accuracy. We also show that the values of bias and MSE\nare smaller than those obtained by using the uniform and Jeffreys priors."}, "http://arxiv.org/abs/2110.01031": {"title": "A general framework for formulating structured variable selection", "link": "http://arxiv.org/abs/2110.01031", "description": "In variable selection, a selection rule that prescribes the permissible sets\nof selected variables (called a \"selection dictionary\") is desirable due to the\ninherent structural constraints among the candidate variables. Such selection\nrules can be complex in real-world data analyses, and failing to incorporate\nsuch restrictions could not only compromise the interpretability of the model\nbut also lead to decreased prediction accuracy. However, no general framework\nhas been proposed to formalize selection rules and their applications, which\nposes a significant challenge for practitioners seeking to integrate these\nrules into their analyses. In this work, we establish a framework for\nstructured variable selection that can incorporate universal structural\nconstraints. We develop a mathematical language for constructing arbitrary\nselection rules, where the selection dictionary is formally defined. We\ndemonstrate that all selection rules can be expressed as combinations of\noperations on constructs, facilitating the identification of the corresponding\nselection dictionary. Once this selection dictionary is derived, practitioners\ncan apply their own user-defined criteria to select the optimal model.\nAdditionally, our framework enhances existing penalized regression methods for\nvariable selection by providing guidance on how to appropriately group\nvariables to achieve the desired selection rule. Furthermore, our innovative\nframework opens the door to establishing new l0 norm-based penalized regression\ntechniques that can be tailored to respect arbitrary selection rules, thereby\nexpanding the possibilities for more robust and tailored model development."}, "http://arxiv.org/abs/2203.14223": {"title": "Identifying Peer Influence in Therapeutic Communities", "link": "http://arxiv.org/abs/2203.14223", "description": "We investigate if there is a peer influence or role model effect on\nsuccessful graduation from Therapeutic Communities (TCs). We analyze anonymized\nindividual-level observational data from 3 TCs that kept records of written\nexchanges of affirmations and corrections among residents, and their precise\nentry and exit dates. The affirmations allow us to form peer networks, and the\nentry and exit dates allow us to define a causal effect of interest. We\nconceptualize the causal role model effect as measuring the difference in the\nexpected outcome of a resident (ego) who can observe one of their social\ncontacts (e.g., peers who gave affirmations), to be successful in graduating\nbefore the ego's exit vs not successfully graduating before the ego's exit.\nSince peer influence is usually confounded with unobserved homophily in\nobservational data, we model the network with a latent variable model to\nestimate homophily and include it in the outcome equation. We provide a\ntheoretical guarantee that the bias of our peer influence estimator decreases\nwith sample size. Our results indicate there is an effect of peers' graduation\non the graduation of residents. The magnitude of peer influence differs based\non gender, race, and the definition of the role model effect. A counterfactual\nexercise quantifies the potential benefits of intervention of assigning a buddy\nto \"at-risk\" individuals directly on the treated resident and indirectly on\ntheir peers through network propagation."}, "http://arxiv.org/abs/2207.03182": {"title": "Chilled Sampling for Uncertainty Quantification: A Motivation From A Meteorological Inverse Problem", "link": "http://arxiv.org/abs/2207.03182", "description": "Atmospheric motion vectors (AMVs) extracted from satellite imagery are the\nonly wind observations with good global coverage. They are important features\nfor feeding numerical weather prediction (NWP) models. Several Bayesian models\nhave been proposed to estimate AMVs. Although critical for correct assimilation\ninto NWP models, very few methods provide a thorough characterization of the\nestimation errors. The difficulty of estimating errors stems from the\nspecificity of the posterior distribution, which is both very high dimensional,\nand highly ill-conditioned due to a singular likelihood. Motivated by this\ndifficult inverse problem, this work studies the evaluation of the (expected)\nestimation errors using gradient-based Markov Chain Monte Carlo (MCMC)\nalgorithms. The main contribution is to propose a general strategy, called here\nchilling, which amounts to sampling a local approximation of the posterior\ndistribution in the neighborhood of a point estimate. From a theoretical point\nof view, we show that under regularity assumptions, the family of chilled\nposterior distributions converges in distribution as temperature decreases to\nan optimal Gaussian approximation at a point estimate given by the Maximum A\nPosteriori, also known as the Laplace approximation. Chilled sampling therefore\nprovides access to this approximation generally out of reach in such\nhigh-dimensional nonlinear contexts. From an empirical perspective, we evaluate\nthe proposed approach based on some quantitative Bayesian criteria. Our\nnumerical simulations are performed on synthetic and real meteorological data.\nThey reveal that not only the proposed chilling exhibits a significant gain in\nterms of accuracy of the point estimates and of their associated expected\nerrors, but also a substantial acceleration in the convergence speed of the\nMCMC algorithms."}, "http://arxiv.org/abs/2207.13612": {"title": "Robust Output Analysis with Monte-Carlo Methodology", "link": "http://arxiv.org/abs/2207.13612", "description": "In predictive modeling with simulation or machine learning, it is critical to\naccurately assess the quality of estimated values through output analysis. In\nrecent decades output analysis has become enriched with methods that quantify\nthe impact of input data uncertainty in the model outputs to increase\nrobustness. However, most developments are applicable assuming that the input\ndata adheres to a parametric family of distributions. We propose a unified\noutput analysis framework for simulation and machine learning outputs through\nthe lens of Monte Carlo sampling. This framework provides nonparametric\nquantification of the variance and bias induced in the outputs with\nhigher-order accuracy. Our new bias-corrected estimation from the model outputs\nleverages the extension of fast iterative bootstrap sampling and higher-order\ninfluence functions. For the scalability of the proposed estimation methods, we\ndevise budget-optimal rules and leverage control variates for variance\nreduction. Our theoretical and numerical results demonstrate a clear advantage\nin building more robust confidence intervals from the model outputs with higher\ncoverage probability."}, "http://arxiv.org/abs/2208.06685": {"title": "Adaptive novelty detection with false discovery rate guarantee", "link": "http://arxiv.org/abs/2208.06685", "description": "This paper studies the semi-supervised novelty detection problem where a set\nof \"typical\" measurements is available to the researcher. Motivated by recent\nadvances in multiple testing and conformal inference, we propose AdaDetect, a\nflexible method that is able to wrap around any probabilistic classification\nalgorithm and control the false discovery rate (FDR) on detected novelties in\nfinite samples without any distributional assumption other than\nexchangeability. In contrast to classical FDR-controlling procedures that are\noften committed to a pre-specified p-value function, AdaDetect learns the\ntransformation in a data-adaptive manner to focus the power on the directions\nthat distinguish between inliers and outliers. Inspired by the multiple testing\nliterature, we further propose variants of AdaDetect that are adaptive to the\nproportion of nulls while maintaining the finite-sample FDR control. The\nmethods are illustrated on synthetic datasets and real-world datasets,\nincluding an application in astrophysics."}, "http://arxiv.org/abs/2211.02582": {"title": "Inference for Network Count Time Series with the R Package PNAR", "link": "http://arxiv.org/abs/2211.02582", "description": "We introduce a new R package useful for inference about network count time\nseries. Such data are frequently encountered in statistics and they are usually\ntreated as multivariate time series. Their statistical analysis is based on\nlinear or log linear models. Nonlinear models, which have been applied\nsuccessfully in several research areas, have been neglected from such\napplications mainly because of their computational complexity. We provide R\nusers the flexibility to fit and study nonlinear network count time series\nmodels which include either a drift in the intercept or a regime switching\nmechanism. We develop several computational tools including estimation of\nvarious count Network Autoregressive models and fast computational algorithms\nfor testing linearity in standard cases and when non-identifiable parameters\nhamper the analysis. Finally, we introduce a copula Poisson algorithm for\nsimulating multivariate network count time series. We illustrate the\nmethodology by modeling weekly number of influenza cases in Germany."}, "http://arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "http://arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics\nand often exhibits community-like structures, where each component (node) along\neach different mode has a community membership associated with it. In this\npaper we propose the tensor mixed-membership blockmodel, a generalization of\nthe tensor blockmodel positing that memberships need not be discrete, but\ninstead are convex combinations of latent communities. We establish the\nidentifiability of our model and propose a computationally efficient estimation\nprocedure based on the higher-order orthogonal iteration algorithm (HOOI) for\ntensor SVD composed with a simplex corner-finding algorithm. We then\ndemonstrate the consistency of our estimation procedure by providing a per-node\nerror bound, which showcases the effect of higher-order structures on\nestimation accuracy. To prove our consistency result, we develop the\n$\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent,\npossibly heteroskedastic, subgaussian noise that may be of independent\ninterest. Our analysis uses a novel leave-one-out construction for the\niterates, and our bounds depend only on spectral properties of the underlying\nlow-rank tensor under nearly optimal signal-to-noise ratio conditions such that\ntensor SVD is computationally feasible. Whereas other leave-one-out analyses\ntypically focus on sequences constructed by analyzing the output of a given\nalgorithm with a small part of the noise removed, our leave-one-out analysis\nconstructions use both the previous iterates and the additional tensor\nstructure to eliminate a potential additional source of error. Finally, we\napply our methodology to real and simulated data, including applications to two\nflight datasets and a trade network dataset, demonstrating some effects not\nidentifiable from the model with discrete community memberships."}, "http://arxiv.org/abs/2304.10372": {"title": "Statistical inference for Gaussian Whittle-Mat\\'ern fields on metric graphs", "link": "http://arxiv.org/abs/2304.10372", "description": "Whittle-Mat\\'ern fields are a recently introduced class of Gaussian processes\non metric graphs, which are specified as solutions to a fractional-order\nstochastic differential equation. Unlike earlier covariance-based approaches\nfor specifying Gaussian fields on metric graphs, the Whittle-Mat\\'ern fields\nare well-defined for any compact metric graph and can provide Gaussian\nprocesses with differentiable sample paths. We derive the main statistical\nproperties of the model class, particularly the consistency and asymptotic\nnormality of maximum likelihood estimators of model parameters and the\nnecessary and sufficient conditions for asymptotic optimality properties of\nlinear prediction based on the model with misspecified parameters.\n\nThe covariance function of the Whittle-Mat\\'ern fields is generally\nunavailable in closed form, and they have therefore been challenging to use for\nstatistical inference. However, we show that for specific values of the\nfractional exponent, when the fields have Markov properties, likelihood-based\ninference and spatial prediction can be performed exactly and computationally\nefficiently. This facilitates using the Whittle-Mat\\'ern fields in statistical\napplications involving big datasets without the need for any approximations.\nThe methods are illustrated via an application to modeling of traffic data,\nwhere allowing for differentiable processes dramatically improves the results."}, "http://arxiv.org/abs/2305.09282": {"title": "Errors-in-variables Fr\\'echet Regression with Low-rank Covariate Approximation", "link": "http://arxiv.org/abs/2305.09282", "description": "Fr\\'echet regression has emerged as a promising approach for regression\nanalysis involving non-Euclidean response variables. However, its practical\napplicability has been hindered by its reliance on ideal scenarios with\nabundant and noiseless covariate data. In this paper, we present a novel\nestimation method that tackles these limitations by leveraging the low-rank\nstructure inherent in the covariate matrix. Our proposed framework combines the\nconcepts of global Fr\\'echet regression and principal component regression,\naiming to improve the efficiency and accuracy of the regression estimator. By\nincorporating the low-rank structure, our method enables more effective\nmodeling and estimation, particularly in high-dimensional and\nerrors-in-variables regression settings. We provide a theoretical analysis of\nthe proposed estimator's large-sample properties, including a comprehensive\nrate analysis of bias, variance, and additional variations due to measurement\nerrors. Furthermore, our numerical experiments provide empirical evidence that\nsupports the theoretical findings, demonstrating the superior performance of\nour approach. Overall, this work introduces a promising framework for\nregression analysis of non-Euclidean variables, effectively addressing the\nchallenges associated with limited and noisy covariate data, with potential\napplications in diverse fields."}, "http://arxiv.org/abs/2305.19417": {"title": "Model averaging approaches to data subset selection", "link": "http://arxiv.org/abs/2305.19417", "description": "Model averaging is a useful and robust method for dealing with model\nuncertainty in statistical analysis. Often, it is useful to consider data\nsubset selection at the same time, in which model selection criteria are used\nto compare models across different subsets of the data. Two different criteria\nhave been proposed in the literature for how the data subsets should be\nweighted. We compare the two criteria closely in a unified treatment based on\nthe Kullback-Leibler divergence, and conclude that one of them is subtly flawed\nand will tend to yield larger uncertainties due to loss of information.\nAnalytical and numerical examples are provided."}, "http://arxiv.org/abs/2309.06053": {"title": "Confounder selection via iterative graph expansion", "link": "http://arxiv.org/abs/2309.06053", "description": "Confounder selection, namely choosing a set of covariates to control for\nconfounding between a treatment and an outcome, is arguably the most important\nstep in the design of observational studies. Previous methods, such as Pearl's\ncelebrated back-door criterion, typically require pre-specifying a causal\ngraph, which can often be difficult in practice. We propose an interactive\nprocedure for confounder selection that does not require pre-specifying the\ngraph or the set of observed variables. This procedure iteratively expands the\ncausal graph by finding what we call \"primary adjustment sets\" for a pair of\npossibly confounded variables. This can be viewed as inverting a sequence of\nlatent projections of the underlying causal graph. Structural information in\nthe form of primary adjustment sets is elicited from the user, bit by bit,\nuntil either a set of covariates are found to control for confounding or it can\nbe determined that no such set exists. Other information, such as the causal\nrelations between confounders, is not required by the procedure. We show that\nif the user correctly specifies the primary adjustment sets in every step, our\nprocedure is both sound and complete."}, "http://arxiv.org/abs/2310.16989": {"title": "Randomization Inference When N Equals One", "link": "http://arxiv.org/abs/2310.16989", "description": "N-of-1 experiments, where a unit serves as its own control and treatment in\ndifferent time windows, have been used in certain medical contexts for decades.\nHowever, due to effects that accumulate over long time windows and\ninterventions that have complex evolution, a lack of robust inference tools has\nlimited the widespread applicability of such N-of-1 designs. This work combines\ntechniques from experiment design in causal inference and system identification\nfrom control theory to provide such an inference framework. We derive a model\nof the dynamic interference effect that arises in linear time-invariant\ndynamical systems. We show that a family of causal estimands analogous to those\nstudied in potential outcomes are estimable via a standard estimator derived\nfrom the method of moments. We derive formulae for higher moments of this\nestimator and describe conditions under which N-of-1 designs may provide faster\nways to estimate the effects of interventions in dynamical systems. We also\nprovide conditions under which our estimator is asymptotically normal and\nderive valid confidence intervals for this setting."}, "http://arxiv.org/abs/2310.17009": {"title": "Simulation based stacking", "link": "http://arxiv.org/abs/2310.17009", "description": "Simulation-based inference has been popular for amortized Bayesian\ncomputation. It is typical to have more than one posterior approximation, from\ndifferent inference algorithms, different architectures, or simply the\nrandomness of initialization and stochastic gradients. With a provable\nasymptotic guarantee, we present a general stacking framework to make use of\nall available posterior approximations. Our stacking method is able to combine\ndensities, simulation draws, confidence intervals, and moments, and address the\noverall precision, calibration, coverage, and bias at the same time. We\nillustrate our method on several benchmark simulations and a challenging\ncosmological inference task."}, "http://arxiv.org/abs/2310.17153": {"title": "Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration", "link": "http://arxiv.org/abs/2310.17153", "description": "Semi-implicit variational inference (SIVI) has been introduced to expand the\nanalytical variational families by defining expressive semi-implicit\ndistributions in a hierarchical manner. However, the single-layer architecture\ncommonly used in current SIVI methods can be insufficient when the target\nposterior has complicated structures. In this paper, we propose hierarchical\nsemi-implicit variational inference, called HSIVI, which generalizes SIVI to\nallow more expressive multi-layer construction of semi-implicit distributions.\nBy introducing auxiliary distributions that interpolate between a simple base\ndistribution and the target distribution, the conditional layers can be trained\nby progressively matching these auxiliary distributions one layer after\nanother. Moreover, given pre-trained score networks, HSIVI can be used to\naccelerate the sampling process of diffusion models with the score matching\nobjective. We show that HSIVI significantly enhances the expressiveness of SIVI\non several Bayesian inference problems with complicated target distributions.\nWhen used for diffusion model acceleration, we show that HSIVI can produce high\nquality samples comparable to or better than the existing fast diffusion model\nbased samplers with a small number of function evaluations on various datasets."}, "http://arxiv.org/abs/2310.17165": {"title": "Price Experimentation and Interference in Online Platforms", "link": "http://arxiv.org/abs/2310.17165", "description": "In this paper, we examine the biases arising in A/B tests where a firm\nmodifies a continuous parameter, such as price, to estimate the global\ntreatment effect associated to a given performance metric. Such biases emerge\nfrom canonical designs and estimators due to interference among market\nparticipants. We employ structural modeling and differential calculus to derive\nintuitive structural characterizations of this bias. We then specialize our\ngeneral model to a standard revenue management pricing problem. This setting\nhighlights a key potential pitfall in the use of pricing experiments to guide\nprofit maximization: notably, the canonical estimator for the change in profits\ncan have the {\\em wrong sign}. In other words, following the guidance of the\ncanonical estimator may lead the firm to move prices in the wrong direction,\nand thereby decrease profits relative to the status quo. We apply these results\nto a two-sided market model and show how this ``change of sign\" regime depends\non model parameters, and discuss structural and practical implications for\nplatform operators."}, "http://arxiv.org/abs/2310.17248": {"title": "The observed Fisher information attached to the EM algorithm, illustrated on Shepp and Vardi estimation procedure for positron emission tomography", "link": "http://arxiv.org/abs/2310.17248", "description": "The Shepp & Vardi (1982) implementation of the EM algorithm for PET scan\ntumor estimation provides a point estimate of the tumor. The current study\npresents a closed-form formula of the observed Fisher information for Shepp &\nVardi PET scan tumor estimation. Keywords: PET scan, EM algorithm, Fisher\ninformation matrix, standard errors."}, "http://arxiv.org/abs/2310.17308": {"title": "Wild Bootstrap for Counting Process-Based Statistics", "link": "http://arxiv.org/abs/2310.17308", "description": "The wild bootstrap is a popular resampling method in the context of\ntime-to-event data analyses. Previous works established the large sample\nproperties of it for applications to different estimators and test statistics.\nIt can be used to justify the accuracy of inference procedures such as\nhypothesis tests or time-simultaneous confidence bands. This paper consists of\ntwo parts: in Part~I, a general framework is developed in which the large\nsample properties are established in a unified way by using martingale\nstructures. The framework includes most of the well-known non- and\nsemiparametric statistical methods in time-to-event analysis and parametric\napproaches. In Part II, the Fine-Gray proportional sub-hazards model\nexemplifies the theory for inference on cumulative incidence functions given\nthe covariates. The model falls within the framework if the data are\ncensoring-complete. A simulation study demonstrates the reliability of the\nmethod and an application to a data set about hospital-acquired infections\nillustrates the statistical procedure."}, "http://arxiv.org/abs/2310.17334": {"title": "Bayesian Optimization for Personalized Dose-Finding Trials with Combination Therapies", "link": "http://arxiv.org/abs/2310.17334", "description": "Identification of optimal dose combinations in early phase dose-finding\ntrials is challenging, due to the trade-off between precisely estimating the\nmany parameters required to flexibly model the dose-response surface, and the\nsmall sample sizes in early phase trials. Existing methods often restrict the\nsearch to pre-defined dose combinations, which may fail to identify regions of\noptimality in the dose combination space. These difficulties are even more\npertinent in the context of personalized dose-finding, where patient\ncharacteristics are used to identify tailored optimal dose combinations. To\novercome these challenges, we propose the use of Bayesian optimization for\nfinding optimal dose combinations in standard (\"one size fits all\") and\npersonalized multi-agent dose-finding trials. Bayesian optimization is a method\nfor estimating the global optima of expensive-to-evaluate objective functions.\nThe objective function is approximated by a surrogate model, commonly a\nGaussian process, paired with a sequential design strategy to select the next\npoint via an acquisition function. This work is motivated by an\nindustry-sponsored problem, where focus is on optimizing a dual-agent therapy\nin a setting featuring minimal toxicity. To compare the performance of the\nstandard and personalized methods under this setting, simulation studies are\nperformed for a variety of scenarios. Our study concludes that taking a\npersonalized approach is highly beneficial in the presence of heterogeneity."}, "http://arxiv.org/abs/2310.17434": {"title": "The `Why' behind including `Y' in your imputation model", "link": "http://arxiv.org/abs/2310.17434", "description": "Missing data is a common challenge when analyzing epidemiological data, and\nimputation is often used to address this issue. Here, we investigate the\nscenario where a covariate used in an analysis has missingness and will be\nimputed. There are recommendations to include the outcome from the analysis\nmodel in the imputation model for missing covariates, but it is not necessarily\nclear if this recommmendation always holds and why this is sometimes true. We\nexamine deterministic imputation (i.e., single imputation where the imputed\nvalues are treated as fixed) and stochastic imputation (i.e., single imputation\nwith a random value or multiple imputation) methods and their implications for\nestimating the relationship between the imputed covariate and the outcome. We\nmathematically demonstrate that including the outcome variable in imputation\nmodels is not just a recommendation but a requirement to achieve unbiased\nresults when using stochastic imputation methods. Moreover, we dispel common\nmisconceptions about deterministic imputation models and demonstrate why the\noutcome should not be included in these models. This paper aims to bridge the\ngap between imputation in theory and in practice, providing mathematical\nderivations to explain common statistical recommendations. We offer a better\nunderstanding of the considerations involved in imputing missing covariates and\nemphasize when it is necessary to include the outcome variable in the\nimputation model."}, "http://arxiv.org/abs/2310.17440": {"title": "Gibbs optimal design of experiments", "link": "http://arxiv.org/abs/2310.17440", "description": "Bayesian optimal design of experiments is a well-established approach to\nplanning experiments. Briefly, a probability distribution, known as a\nstatistical model, for the responses is assumed which is dependent on a vector\nof unknown parameters. A utility function is then specified which gives the\ngain in information for estimating the true value of the parameters using the\nBayesian posterior distribution. A Bayesian optimal design is given by\nmaximising the expectation of the utility with respect to the joint\ndistribution given by the statistical model and prior distribution for the true\nparameter values. The approach takes account of the experimental aim via\nspecification of the utility and of all assumed sources of uncertainty via the\nexpected utility. However, it is predicated on the specification of the\nstatistical model. Recently, a new type of statistical inference, known as\nGibbs (or General Bayesian) inference, has been advanced. This is\nBayesian-like, in that uncertainty on unknown quantities is represented by a\nposterior distribution, but does not necessarily rely on specification of a\nstatistical model. Thus the resulting inference should be less sensitive to\nmisspecification of the statistical model. The purpose of this paper is to\npropose Gibbs optimal design: a framework for optimal design of experiments for\nGibbs inference. The concept behind the framework is introduced along with a\ncomputational approach to find Gibbs optimal designs in practice. The framework\nis demonstrated on exemplars including linear models, and experiments with\ncount and time-to-event responses."}, "http://arxiv.org/abs/2310.17546": {"title": "A changepoint approach to modelling non-stationary soil moisture dynamics", "link": "http://arxiv.org/abs/2310.17546", "description": "Soil moisture dynamics provide an indicator of soil health that scientists\nmodel via soil drydown curves. The typical modeling process requires the soil\nmoisture time series to be manually separated into drydown segments and then\nexponential decay models are fitted to them independently. Sensor development\nover recent years means that experiments that were previously conducted over a\nfew field campaigns can now be scaled to months or even years, often at a\nhigher sampling rate. Manual identification of drydown segments is no longer\npractical. To better meet the challenge of increasing data size, this paper\nproposes a novel changepoint-based approach to automatically identify\nstructural changes in the soil drying process, and estimate the parameters\ncharacterizing the drying processes simultaneously. A simulation study is\ncarried out to assess the performance of the method. The results demonstrate\nits ability to identify structural changes and retrieve key parameters of\ninterest to soil scientists. The method is applied to hourly soil moisture time\nseries from the NEON data portal to investigate the temporal dynamics of soil\nmoisture drydown. We recover known relationships previously identified\nmanually, alongside delivering new insights into the temporal variability\nacross soil types and locations."}, "http://arxiv.org/abs/2310.17629": {"title": "Approximate Leave-one-out Cross Validation for Regression with $\\ell_1$ Regularizers (extended version)", "link": "http://arxiv.org/abs/2310.17629", "description": "The out-of-sample error (OO) is the main quantity of interest in risk\nestimation and model selection. Leave-one-out cross validation (LO) offers a\n(nearly) distribution-free yet computationally demanding approach to estimate\nOO. Recent theoretical work showed that approximate leave-one-out cross\nvalidation (ALO) is a computationally efficient and statistically reliable\nestimate of LO (and OO) for generalized linear models with differentiable\nregularizers. For problems involving non-differentiable regularizers, despite\nsignificant empirical evidence, the theoretical understanding of ALO's error\nremains unknown. In this paper, we present a novel theory for a wide class of\nproblems in the generalized linear model family with non-differentiable\nregularizers. We bound the error |ALO - LO| in terms of intuitive metrics such\nas the size of leave-i-out perturbations in active sets, sample size n, number\nof features p and regularization parameters. As a consequence, for the\n$\\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes\nto infinity while n/p and SNR are fixed and bounded."}, "http://arxiv.org/abs/2108.04201": {"title": "Guaranteed Functional Tensor Singular Value Decomposition", "link": "http://arxiv.org/abs/2108.04201", "description": "This paper introduces the functional tensor singular value decomposition\n(FTSVD), a novel dimension reduction framework for tensors with one functional\nmode and several tabular modes. The problem is motivated by high-order\nlongitudinal data analysis. Our model assumes the observed data to be a random\nrealization of an approximate CP low-rank functional tensor measured on a\ndiscrete time grid. Incorporating tensor algebra and the theory of Reproducing\nKernel Hilbert Space (RKHS), we propose a novel RKHS-based constrained power\niteration with spectral initialization. Our method can successfully estimate\nboth singular vectors and functions of the low-rank structure in the observed\ndata. With mild assumptions, we establish the non-asymptotic contractive error\nbounds for the proposed algorithm. The superiority of the proposed framework is\ndemonstrated via extensive experiments on both simulated and real data."}, "http://arxiv.org/abs/2202.02146": {"title": "Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net", "link": "http://arxiv.org/abs/2202.02146", "description": "The elastic net combines lasso and ridge regression to fuse the sparsity\nproperty of lasso with the grouping property of ridge regression. The\nconnections between ridge regression and gradient descent and between lasso and\nforward stagewise regression have previously been shown. Similar to how the\nelastic net generalizes lasso and ridge regression, we introduce elastic\ngradient descent, a generalization of gradient descent and forward stagewise\nregression. We theoretically analyze elastic gradient descent and compare it to\nthe elastic net and forward stagewise regression. Parts of the analysis are\nbased on elastic gradient flow, a piecewise analytical construction, obtained\nfor elastic gradient descent with infinitesimal step size. We also compare\nelastic gradient descent to the elastic net on real and simulated data and show\nthat it provides similar solution paths, but is several orders of magnitude\nfaster. Compared to forward stagewise regression, elastic gradient descent\nselects a model that, although still sparse, provides considerably lower\nprediction and estimation errors."}, "http://arxiv.org/abs/2202.03897": {"title": "Inference from Sampling with Response Probabilities Estimated via Calibration", "link": "http://arxiv.org/abs/2202.03897", "description": "A solution to control for nonresponse bias consists of multiplying the design\nweights of respondents by the inverse of estimated response probabilities to\ncompensate for the nonrespondents. Maximum likelihood and calibration are two\napproaches that can be applied to obtain estimated response probabilities. We\nconsider a common framework in which these approaches can be compared. We\ndevelop an asymptotic study of the behavior of the resulting estimator when\ncalibration is applied. A logistic regression model for the response\nprobabilities is postulated. Missing at random and unclustered data are\nsupposed. Three main contributions of this work are: 1) we show that the\nestimators with the response probabilities estimated via calibration are\nasymptotically equivalent to unbiased estimators and that a gain in efficiency\nis obtained when estimating the response probabilities via calibration as\ncompared to the estimator with the true response probabilities, 2) we show that\nthe estimators with the response probabilities estimated via calibration are\ndoubly robust to model misspecification and explain why double robustness is\nnot guaranteed when maximum likelihood is applied, and 3) we discuss and\nillustrate problems related to response probabilities estimation, namely\nexistence of a solution to the estimating equations, problems of convergence,\nand extreme weights. We explain and illustrate why the first aforementioned\nproblem is more likely with calibration than with maximum likelihood\nestimation. We present the results of a simulation study in order to illustrate\nthese elements."}, "http://arxiv.org/abs/2208.14951": {"title": "Statistical inference for multivariate extremes via a geometric approach", "link": "http://arxiv.org/abs/2208.14951", "description": "A geometric representation for multivariate extremes, based on the shapes of\nscaled sample clouds in light-tailed margins and their so-called limit sets,\nhas recently been shown to connect several existing extremal dependence\nconcepts. However, these results are purely probabilistic, and the geometric\napproach itself has not been fully exploited for statistical inference. We\noutline a method for parametric estimation of the limit set shape, which\nincludes a useful non/semi-parametric estimate as a pre-processing step. More\nfundamentally, our approach provides a new class of asymptotically-motivated\nstatistical models for the tails of multivariate distributions, and such models\ncan accommodate any combination of simultaneous or non-simultaneous extremes\nthrough appropriate parametric forms for the limit set shape. Extrapolation\nfurther into the tail of the distribution is possible via simulation from the\nfitted model. A simulation study confirms that our methodology is very\ncompetitive with existing approaches, and can successfully allow estimation of\nsmall probabilities in regions where other methods struggle. We apply the\nmethodology to two environmental datasets, with diagnostics demonstrating a\ngood fit."}, "http://arxiv.org/abs/2209.08889": {"title": "Inference of nonlinear causal effects with GWAS summary data", "link": "http://arxiv.org/abs/2209.08889", "description": "Large-scale genome-wide association studies (GWAS) have offered an exciting\nopportunity to discover putative causal genes or risk factors associated with\ndiseases by using SNPs as instrumental variables (IVs). However, conventional\napproaches assume linear causal relations partly for simplicity and partly for\nthe availability of GWAS summary data. In this work, we propose a novel model\n{for transcriptome-wide association studies (TWAS)} to incorporate nonlinear\nrelationships across IVs, an exposure/gene, and an outcome, which is robust\nagainst violations of the valid IV assumptions, permits the use of GWAS summary\ndata, and covers two-stage least squares as a special case. We decouple the\nestimation of a marginal causal effect and a nonlinear transformation, where\nthe former is estimated via sliced inverse regression and a sparse instrumental\nvariable regression, and the latter is estimated by a ratio-adjusted inverse\nregression. On this ground, we propose an inferential procedure. An application\nof the proposed method to the ADNI gene expression data and the IGAP GWAS\nsummary data identifies 18 causal genes associated with Alzheimer's disease,\nincluding APOE and TOMM40, in addition to 7 other genes missed by two-stage\nleast squares considering only linear relationships. Our findings suggest that\nnonlinear modeling is required to unleash the power of IV regression for\nidentifying potentially nonlinear gene-trait associations. Accompanying this\npaper is our Python library \\texttt{nl-causal}\n(\\url{https://nonlinear-causal.readthedocs.io/}) that implements the proposed\nmethod."}, "http://arxiv.org/abs/2301.03038": {"title": "Skewed Bernstein-von Mises theorem and skew-modal approximations", "link": "http://arxiv.org/abs/2301.03038", "description": "Gaussian approximations are routinely employed in Bayesian statistics to ease\ninference when the target posterior is intractable. Although these\napproximations are asymptotically justified by Bernstein-von Mises type\nresults, in practice the expected Gaussian behavior may poorly represent the\nshape of the posterior, thus affecting approximation accuracy. Motivated by\nthese considerations, we derive an improved class of closed-form approximations\nof posterior distributions which arise from a new treatment of a third-order\nversion of the Laplace method yielding approximations in a tractable family of\nskew-symmetric distributions. Under general assumptions which account for\nmisspecified models and non-i.i.d. settings, this family of approximations is\nshown to have a total variation distance from the target posterior whose rate\nof convergence improves by at least one order of magnitude the one established\nby the classical Bernstein-von Mises theorem. Specializing this result to the\ncase of regular parametric models shows that the same improvement in\napproximation accuracy can be also derived for polynomially bounded posterior\nfunctionals. Unlike other higher-order approximations, our results prove that\nit is possible to derive closed-form and valid densities which are expected to\nprovide, in practice, a more accurate, yet similarly-tractable, alternative to\nGaussian approximations of the target posterior, while inheriting its limiting\nfrequentist properties. We strengthen such arguments by developing a practical\nskew-modal approximation for both joint and marginal posteriors that achieves\nthe same theoretical guarantees of its theoretical counterpart by replacing the\nunknown model parameters with the corresponding MAP estimate. Empirical studies\nconfirm that our theoretical results closely match the remarkable performance\nobserved in practice, even in finite, possibly small, sample regimes."}, "http://arxiv.org/abs/2303.05878": {"title": "Identification and Estimation of Causal Effects with Confounders Missing Not at Random", "link": "http://arxiv.org/abs/2303.05878", "description": "Making causal inferences from observational studies can be challenging when\nconfounders are missing not at random. In such cases, identifying causal\neffects is often not guaranteed. Motivated by a real example, we consider a\ntreatment-independent missingness assumption under which we establish the\nidentification of causal effects when confounders are missing not at random. We\npropose a weighted estimating equation (WEE) approach for estimating model\nparameters and introduce three estimators for the average causal effect, based\non regression, propensity score weighting, and doubly robust estimation. We\nevaluate the performance of these estimators through simulations, and provide a\nreal data analysis to illustrate our proposed method."}, "http://arxiv.org/abs/2305.12283": {"title": "Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods", "link": "http://arxiv.org/abs/2305.12283", "description": "In this paper, we consider the uncertainty quantification problem for\nregression models. Specifically, we consider an individual calibration\nobjective for characterizing the quantiles of the prediction model. While such\nan objective is well-motivated from downstream tasks such as newsvendor cost,\nthe existing methods have been largely heuristic and lack of statistical\nguarantee in terms of individual calibration. We show via simple examples that\nthe existing methods focusing on population-level calibration guarantees such\nas average calibration or sharpness can lead to harmful and unexpected results.\nWe propose simple nonparametric calibration methods that are agnostic of the\nunderlying prediction model and enjoy both computational efficiency and\nstatistical consistency. Our approach enables a better understanding of the\npossibility of individual calibration, and we establish matching upper and\nlower bounds for the calibration error of our proposed methods. Technically,\nour analysis combines the nonparametric analysis with a covering number\nargument for parametric analysis, which advances the existing theoretical\nanalyses in the literature of nonparametric density estimation and quantile\nbandit problems. Importantly, the nonparametric perspective sheds new\ntheoretical insights into regression calibration in terms of the curse of\ndimensionality and reconciles the existing results on the impossibility of\nindividual calibration. To our knowledge, we make the first effort to reach\nboth individual calibration and finite-sample guarantee with minimal\nassumptions in terms of conformal prediction. Numerical experiments show the\nadvantage of such a simple approach under various metrics, and also under\ncovariates shift. We hope our work provides a simple benchmark and a starting\npoint of theoretical ground for future research on regression calibration."}, "http://arxiv.org/abs/2305.14943": {"title": "Learning Rate Free Bayesian Inference in Constrained Domains", "link": "http://arxiv.org/abs/2305.14943", "description": "We introduce a suite of new particle-based algorithms for sampling on\nconstrained domains which are entirely learning rate free. Our approach\nleverages coin betting ideas from convex optimisation, and the viewpoint of\nconstrained sampling as a mirrored optimisation problem on the space of\nprobability measures. Based on this viewpoint, we also introduce a unifying\nframework for several existing constrained sampling algorithms, including\nmirrored Langevin dynamics and mirrored Stein variational gradient descent. We\ndemonstrate the performance of our algorithms on a range of numerical examples,\nincluding sampling from targets on the simplex, sampling with fairness\nconstraints, and constrained sampling problems in post-selection inference. Our\nresults indicate that our algorithms achieve competitive performance with\nexisting constrained sampling methods, without the need to tune any\nhyperparameters."}, "http://arxiv.org/abs/2308.07983": {"title": "Monte Carlo guided Diffusion for Bayesian linear inverse problems", "link": "http://arxiv.org/abs/2308.07983", "description": "Ill-posed linear inverse problems arise frequently in various applications,\nfrom computational photography to medical imaging. A recent line of research\nexploits Bayesian inference with informative priors to handle the ill-posedness\nof such problems. Amongst such priors, score-based generative models (SGM) have\nrecently been successfully applied to several different inverse problems. In\nthis study, we exploit the particular structure of the prior defined by the SGM\nto define a sequence of intermediate linear inverse problems. As the noise\nlevel decreases, the posteriors of these inverse problems get closer to the\ntarget posterior of the original inverse problem. To sample from this sequence\nof posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The\nproposed algorithm, MCGDiff, is shown to be theoretically grounded and we\nprovide numerical simulations showing that it outperforms competing baselines\nwhen dealing with ill-posed inverse problems in a Bayesian setting."}, "http://arxiv.org/abs/2309.16843": {"title": "A Mean Field Approach to Empirical Bayes Estimation in High-dimensional Linear Regression", "link": "http://arxiv.org/abs/2309.16843", "description": "We study empirical Bayes estimation in high-dimensional linear regression. To\nfacilitate computationally efficient estimation of the underlying prior, we\nadopt a variational empirical Bayes approach, introduced originally in\nCarbonetto and Stephens (2012) and Kim et al. (2022). We establish asymptotic\nconsistency of the nonparametric maximum likelihood estimator (NPMLE) and its\n(computable) naive mean field variational surrogate under mild assumptions on\nthe design and the prior. Assuming, in addition, that the naive mean field\napproximation has a dominant optimizer, we develop a computationally efficient\napproximation to the oracle posterior distribution, and establish its accuracy\nunder the 1-Wasserstein metric. This enables computationally feasible Bayesian\ninference; e.g., construction of posterior credible intervals with an average\ncoverage guarantee, Bayes optimal estimation for the regression coefficients,\nestimation of the proportion of non-nulls, etc. Our analysis covers both\ndeterministic and random designs, and accommodates correlations among the\nfeatures. To the best of our knowledge, this provides the first rigorous\nnonparametric empirical Bayes method in a high-dimensional regression setting\nwithout sparsity."}, "http://arxiv.org/abs/2310.17679": {"title": "Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow-Shrink Trees", "link": "http://arxiv.org/abs/2310.17679", "description": "Learning graphical conditional independence structures is an important\nmachine learning problem and a cornerstone of causal discovery. However, the\naccuracy and execution time of learning algorithms generally struggle to scale\nto problems with hundreds of highly connected variables -- for instance,\nrecovering brain networks from fMRI data. We introduce the best order score\nsearch (BOSS) and grow-shrink trees (GSTs) for learning directed acyclic graphs\n(DAGs) in this paradigm. BOSS greedily searches over permutations of variables,\nusing GSTs to construct and score DAGs from permutations. GSTs efficiently\ncache scores to eliminate redundant calculations. BOSS achieves\nstate-of-the-art performance in accuracy and execution time, comparing\nfavorably to a variety of combinatorial and gradient-based learning algorithms\nunder a broad range of conditions. To demonstrate its practicality, we apply\nBOSS to two sets of resting-state fMRI data: simulated data with\npseudo-empirical noise distributions derived from randomized empirical fMRI\ncortical signals and clinical data from 3T fMRI scans processed into cortical\nparcels. BOSS is available for use within the TETRAD project which includes\nPython and R wrappers."}, "http://arxiv.org/abs/2310.17712": {"title": "Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec", "link": "http://arxiv.org/abs/2310.17712", "description": "Embedding the nodes of a large network into an Euclidean space is a common\nobjective in modern machine learning, with a variety of tools available. These\nembeddings can then be used as features for tasks such as community\ndetection/node clustering or link prediction, where they achieve state of the\nart performance. With the exception of spectral clustering methods, there is\nlittle theoretical understanding for other commonly used approaches to learning\nembeddings. In this work we examine the theoretical properties of the\nembeddings learned by node2vec. Our main result shows that the use of k-means\nclustering on the embedding vectors produced by node2vec gives weakly\nconsistent community recovery for the nodes in (degree corrected) stochastic\nblock models. We also discuss the use of these embeddings for node and link\nprediction tasks. We demonstrate this result empirically, and examine how this\nrelates to other embedding tools for network data."}, "http://arxiv.org/abs/2310.17760": {"title": "Novel Models for Multiple Dependent Heteroskedastic Time Series", "link": "http://arxiv.org/abs/2310.17760", "description": "Functional magnetic resonance imaging or functional MRI (fMRI) is a very\npopular tool used for differing brain regions by measuring brain activity. It\nis affected by physiological noise, such as head and brain movement in the\nscanner from breathing, heart beats, or the subject fidgeting. The purpose of\nthis paper is to propose a novel approach to handling fMRI data for infants\nwith high volatility caused by sudden head movements. Another purpose is to\nevaluate the volatility modelling performance of multiple dependent fMRI time\nseries data. The models examined in this paper are AR and GARCH and the\nmodelling performance is evaluated by several statistical performance measures.\nThe conclusions of this paper are that multiple dependent fMRI series data can\nbe fitted with AR + GARCH model if the multiple fMRI data have many sudden head\nmovements. The GARCH model can capture the shared volatility clustering caused\nby head movements across brain regions. However, the multiple fMRI data without\nmany head movements have fitted AR + GARCH model with different performance.\nThe conclusions are supported by statistical tests and measures. This paper\nhighlights the difference between the proposed approach from traditional\napproaches when estimating model parameters and modelling conditional variances\non multiple dependent time series. In the future, the proposed approach can be\napplied to other research fields, such as financial economics, and signal\nprocessing. Code is available at \\url{https://github.com/13204942/STAT40710}."}, "http://arxiv.org/abs/2310.17766": {"title": "Minibatch Markov chain Monte Carlo Algorithms for Fitting Gaussian Processes", "link": "http://arxiv.org/abs/2310.17766", "description": "Gaussian processes (GPs) are a highly flexible, nonparametric statistical\nmodel that are commonly used to fit nonlinear relationships or account for\ncorrelation between observations. However, the computational load of fitting a\nGaussian process is $\\mathcal{O}(n^3)$ making them infeasible for use on large\ndatasets. To make GPs more feasible for large datasets, this research focuses\non the use of minibatching to estimate GP parameters. Specifically, we outline\nboth approximate and exact minibatch Markov chain Monte Carlo algorithms that\nsubstantially reduce the computation of fitting a GP by only considering small\nsubsets of the data at a time. We demonstrate and compare this methodology\nusing various simulations and real datasets."}, "http://arxiv.org/abs/2310.17806": {"title": "Transporting treatment effects from difference-in-differences studies", "link": "http://arxiv.org/abs/2310.17806", "description": "Difference-in-differences (DID) is a popular approach to identify the causal\neffects of treatments and policies in the presence of unmeasured confounding.\nDID identifies the sample average treatment effect in the treated (SATT).\nHowever, a goal of such research is often to inform decision-making in target\npopulations outside the treated sample. Transportability methods have been\ndeveloped to extend inferences from study samples to external target\npopulations; these methods have primarily been developed and applied in\nsettings where identification is based on conditional independence between the\ntreatment and potential outcomes, such as in a randomized trial. This paper\ndevelops identification and estimators for effects in a target population,\nbased on DID conducted in a study sample that differs from the target\npopulation. We present a range of assumptions under which one may identify\ncausal effects in the target population and employ causal diagrams to\nillustrate these assumptions. In most realistic settings, results depend\ncritically on the assumption that any unmeasured confounders are not effect\nmeasure modifiers on the scale of the effect of interest. We develop several\nestimators of transported effects, including a doubly robust estimator based on\nthe efficient influence function. Simulation results support theoretical\nproperties of the proposed estimators. We discuss the potential application of\nour approach to a study of the effects of a US federal smoke-free housing\npolicy, where the original study was conducted in New York City alone and the\ngoal is extend inferences to other US cities."}, "http://arxiv.org/abs/2310.17816": {"title": "Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs", "link": "http://arxiv.org/abs/2310.17816", "description": "This work addresses the problem of automated covariate selection under\nlimited prior knowledge. Given an exposure-outcome pair {X,Y} and a variable\nset Z of unknown causal structure, the Local Discovery by Partitioning (LDP)\nalgorithm partitions Z into subsets defined by their relation to {X,Y}. We\nenumerate eight exhaustive and mutually exclusive partitions of any arbitrary Z\nand leverage this taxonomy to differentiate confounders from other variable\ntypes. LDP is motivated by valid adjustment set identification, but avoids the\npretreatment assumption commonly made by automated covariate selection methods.\nWe provide theoretical guarantees that LDP returns a valid adjustment set for\nany Z that meets sufficient graphical conditions. Under stronger conditions, we\nprove that partition labels are asymptotically correct. Total independence\ntests is worst-case quadratic in |Z|, with sub-quadratic runtimes observed\nempirically. We numerically validate our theoretical guarantees on synthetic\nand semi-synthetic graphs. Adjustment sets from LDP yield less biased and more\nprecise average treatment effect estimates than baselines, with LDP\noutperforming on confounder recall, test count, and runtime for valid\nadjustment set discovery."}, "http://arxiv.org/abs/2310.17820": {"title": "Sparse Bayesian Multidimensional Item Response Theory", "link": "http://arxiv.org/abs/2310.17820", "description": "Multivariate Item Response Theory (MIRT) is sought-after widely by applied\nresearchers looking for interpretable (sparse) explanations underlying response\npatterns in questionnaire data. There is, however, an unmet demand for such\nsparsity discovery tools in practice. Our paper develops a Bayesian platform\nfor binary and ordinal item MIRT which requires minimal tuning and scales well\non relatively large datasets due to its parallelizable features. Bayesian\nmethodology for MIRT models has traditionally relied on MCMC simulation, which\ncannot only be slow in practice, but also often renders exact sparsity recovery\nimpossible without additional thresholding. In this work, we develop a scalable\nBayesian EM algorithm to estimate sparse factor loadings from binary and\nordinal item responses. We address the seemingly insurmountable problem of\nunknown latent factor dimensionality with tools from Bayesian nonparametrics\nwhich enable estimating the number of factors. Rotations to sparsity through\nparameter expansion further enhance convergence and interpretability without\nidentifiability constraints. In our simulation study, we show that our method\nreliably recovers both the factor dimensionality as well as the latent\nstructure on high-dimensional synthetic data even for small samples. We\ndemonstrate the practical usefulness of our approach on two datasets: an\neducational item response dataset and a quality-of-life measurement dataset.\nBoth demonstrations show that our tool yields interpretable estimates,\nfacilitating interesting discoveries that might otherwise go unnoticed under a\npure confirmatory factor analysis setting. We provide an easy-to-use software\nwhich is a useful new addition to the MIRT toolkit and which will hopefully\nserve as the go-to method for practitioners."}, "http://arxiv.org/abs/2310.17845": {"title": "A Unified and Optimal Multiple Testing Framework based on rho-values", "link": "http://arxiv.org/abs/2310.17845", "description": "Multiple testing is an important research direction that has gained major\nattention in recent years. Currently, most multiple testing procedures are\ndesigned with p-values or Local false discovery rate (Lfdr) statistics.\nHowever, p-values obtained by applying probability integral transform to some\nwell-known test statistics often do not incorporate information from the\nalternatives, resulting in suboptimal procedures. On the other hand, Lfdr based\nprocedures can be asymptotically optimal but their guarantee on false discovery\nrate (FDR) control relies on consistent estimation of Lfdr, which is often\ndifficult in practice especially when the incorporation of side information is\ndesirable. In this article, we propose a novel and flexibly constructed class\nof statistics, called rho-values, which combines the merits of both p-values\nand Lfdr while enjoys superiorities over methods based on these two types of\nstatistics. Specifically, it unifies these two frameworks and operates in two\nsteps, ranking and thresholding. The ranking produced by rho-values mimics that\nproduced by Lfdr statistics, and the strategy for choosing the threshold is\nsimilar to that of p-value based procedures. Therefore, the proposed framework\nguarantees FDR control under weak assumptions; it maintains the integrity of\nthe structural information encoded by the summary statistics and the auxiliary\ncovariates and hence can be asymptotically optimal. We demonstrate the efficacy\nof the new framework through extensive simulations and two data applications."}, "http://arxiv.org/abs/2310.17999": {"title": "Automated threshold selection and associated inference uncertainty for univariate extremes", "link": "http://arxiv.org/abs/2310.17999", "description": "Threshold selection is a fundamental problem in any threshold-based extreme\nvalue analysis. While models are asymptotically motivated, selecting an\nappropriate threshold for finite samples can be difficult through standard\nmethods. Inference can also be highly sensitive to the choice of threshold. Too\nlow a threshold choice leads to bias in the fit of the extreme value model,\nwhile too high a choice leads to unnecessary additional uncertainty in the\nestimation of model parameters. In this paper, we develop a novel methodology\nfor automated threshold selection that directly tackles this bias-variance\ntrade-off. We also develop a method to account for the uncertainty in this\nthreshold choice and propagate this uncertainty through to high quantile\ninference. Through a simulation study, we demonstrate the effectiveness of our\nmethod for threshold selection and subsequent extreme quantile estimation. We\napply our method to the well-known, troublesome example of the River Nidd\ndataset."}, "http://arxiv.org/abs/2310.18027": {"title": "Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors", "link": "http://arxiv.org/abs/2310.18027", "description": "Effective and rapid decision-making from randomized controlled trials (RCTs)\nrequires unbiased and precise treatment effect inferences. Two strategies to\naddress this requirement are to adjust for covariates that are highly\ncorrelated with the outcome, and to leverage historical control information via\nBayes' theorem. We propose a new Bayesian prognostic covariate adjustment\nmethodology, referred to as Bayesian PROCOVA, that combines these two\nstrategies. Covariate adjustment is based on generative artificial intelligence\n(AI) algorithms that construct a digital twin generator (DTG) for RCT\nparticipants. The DTG is trained on historical control data and yields a\ndigital twin (DT) probability distribution for each participant's control\noutcome. The expectation of the DT distribution defines the single covariate\nfor adjustment. Historical control information are leveraged via an additive\nmixture prior with two components: an informative prior probability\ndistribution specified based on historical control data, and a non-informative\nprior distribution. The weight parameter in the mixture has a prior\ndistribution as well, so that the entire additive mixture prior distribution is\ncompletely pre-specifiable and does not involve any information from the RCT.\nWe establish an efficient Gibbs algorithm for sampling from the posterior\ndistribution, and derive closed-form expressions for the posterior mean and\nvariance of the treatment effect conditional on the weight parameter, of\nBayesian PROCOVA. We evaluate the bias control and variance reduction of\nBayesian PROCOVA compared to frequentist prognostic covariate adjustment\n(PROCOVA) via simulation studies that encompass different types of\ndiscrepancies between the historical control and RCT data. Ultimately, Bayesian\nPROCOVA can yield informative treatment effect inferences with fewer control\nparticipants, accelerating effective decision-making."}, "http://arxiv.org/abs/2310.18047": {"title": "Robust Bayesian Inference on Riemannian Submanifold", "link": "http://arxiv.org/abs/2310.18047", "description": "Non-Euclidean spaces routinely arise in modern statistical applications such\nas in medical imaging, robotics, and computer vision, to name a few. While\ntraditional Bayesian approaches are applicable to such settings by considering\nan ambient Euclidean space as the parameter space, we demonstrate the benefits\nof integrating manifold structure into the Bayesian framework, both\ntheoretically and computationally. Moreover, existing Bayesian approaches which\nare designed specifically for manifold-valued parameters are primarily\nmodel-based, which are typically subject to inaccurate uncertainty\nquantification under model misspecification. In this article, we propose a\nrobust model-free Bayesian inference for parameters defined on a Riemannian\nsubmanifold, which is shown to provide valid uncertainty quantification from a\nfrequentist perspective. Computationally, we propose a Markov chain Monte Carlo\nto sample from the posterior on the Riemannian submanifold, where the mixing\ntime, in the large sample regime, is shown to depend only on the intrinsic\ndimension of the parameter space instead of the potentially much larger ambient\ndimension. Our numerical results demonstrate the effectiveness of our approach\non a variety of problems, such as reduced-rank multiple quantile regression,\nprincipal component analysis, and Fr\\'{e}chet mean estimation."}, "http://arxiv.org/abs/2310.18108": {"title": "Transductive conformal inference with adaptive scores", "link": "http://arxiv.org/abs/2310.18108", "description": "Conformal inference is a fundamental and versatile tool that provides\ndistribution-free guarantees for many machine learning tasks. We consider the\ntransductive setting, where decisions are made on a test sample of $m$ new\npoints, giving rise to $m$ conformal $p$-values. {While classical results only\nconcern their marginal distribution, we show that their joint distribution\nfollows a P\\'olya urn model, and establish a concentration inequality for their\nempirical distribution function.} The results hold for arbitrary exchangeable\nscores, including {\\it adaptive} ones that can use the covariates of the\ntest+calibration samples at training stage for increased accuracy. We\ndemonstrate the usefulness of these theoretical results through uniform,\nin-probability guarantees for two machine learning tasks of current interest:\ninterval prediction for transductive transfer learning and novelty detection\nbased on two-class classification."}, "http://arxiv.org/abs/2310.18212": {"title": "Robustness of Algorithms for Causal Structure Learning to Hyperparameter Choice", "link": "http://arxiv.org/abs/2310.18212", "description": "Hyperparameters play a critical role in machine learning. Hyperparameter\ntuning can make the difference between state-of-the-art and poor prediction\nperformance for any algorithm, but it is particularly challenging for structure\nlearning due to its unsupervised nature. As a result, hyperparameter tuning is\noften neglected in favour of using the default values provided by a particular\nimplementation of an algorithm. While there have been numerous studies on\nperformance evaluation of causal discovery algorithms, how hyperparameters\naffect individual algorithms, as well as the choice of the best algorithm for a\nspecific problem, has not been studied in depth before. This work addresses\nthis gap by investigating the influence of hyperparameters on causal structure\nlearning tasks. Specifically, we perform an empirical evaluation of\nhyperparameter selection for some seminal learning algorithms on datasets of\nvarying levels of complexity. We find that, while the choice of algorithm\nremains crucial to obtaining state-of-the-art performance, hyperparameter\nselection in ensemble settings strongly influences the choice of algorithm, in\nthat a poor choice of hyperparameters can lead to analysts using algorithms\nwhich do not give state-of-the-art performance for their data."}, "http://arxiv.org/abs/2310.18261": {"title": "Label Shift Estimators for Non-Ignorable Missing Data", "link": "http://arxiv.org/abs/2310.18261", "description": "We consider the problem of estimating the mean of a random variable Y subject\nto non-ignorable missingness, i.e., where the missingness mechanism depends on\nY . We connect the auxiliary proxy variable framework for non-ignorable\nmissingness (West and Little, 2013) to the label shift setting (Saerens et al.,\n2002). Exploiting this connection, we construct an estimator for non-ignorable\nmissing data that uses high-dimensional covariates (or proxies) without the\nneed for a generative model. In synthetic and semi-synthetic experiments, we\nstudy the behavior of the proposed estimator, comparing it to commonly used\nignorable estimators in both well-specified and misspecified settings.\nAdditionally, we develop a score to assess how consistent the data are with the\nlabel shift assumption. We use our approach to estimate disease prevalence\nusing a large health survey, comparing ignorable and non-ignorable approaches.\nWe show that failing to account for non-ignorable missingness can have profound\nconsequences on conclusions drawn from non-representative samples."}, "http://arxiv.org/abs/2102.12698": {"title": "Improving the Hosmer-Lemeshow Goodness-of-Fit Test in Large Models with Replicated Trials", "link": "http://arxiv.org/abs/2102.12698", "description": "The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-fit (GOF)\ntest that assesses the quality of the overall fit of a logistic regression\nmodel. In this paper, we give results from simulations showing that the type 1\nerror rate (and hence power) of the HL test decreases as model complexity\ngrows, provided that the sample size remains fixed and binary replicates are\npresent in the data. We demonstrate that the generalized version of the HL test\nby Surjanovic et al. (2020) can offer some protection against this power loss.\nWe conclude with a brief discussion explaining the behaviour of the HL test,\nalong with some guidance on how to choose between the two tests."}, "http://arxiv.org/abs/2110.04852": {"title": "Mixture representations and Bayesian nonparametric inference for likelihood ratio ordered distributions", "link": "http://arxiv.org/abs/2110.04852", "description": "In this article, we introduce mixture representations for likelihood ratio\nordered distributions. Essentially, the ratio of two probability densities, or\nmass functions, is monotone if and only if one can be expressed as a mixture of\none-sided truncations of the other. To illustrate the practical value of the\nmixture representations, we address the problem of density estimation for\nlikelihood ratio ordered distributions. In particular, we propose a\nnonparametric Bayesian solution which takes advantage of the mixture\nrepresentations. The prior distribution is constructed from Dirichlet process\nmixtures and has large support on the space of pairs of densities satisfying\nthe monotone ratio constraint. Posterior consistency holds under reasonable\nconditions on the prior specification and the true unknown densities. To our\nknowledge, this is the first posterior consistency result in the literature on\norder constrained inference. With a simple modification to the prior\ndistribution, we can test the equality of two distributions against the\nalternative of likelihood ratio ordering. We develop a Markov chain Monte Carlo\nalgorithm for posterior inference and demonstrate the method in a biomedical\napplication."}, "http://arxiv.org/abs/2207.08911": {"title": "Deeply-Learned Generalized Linear Models with Missing Data", "link": "http://arxiv.org/abs/2207.08911", "description": "Deep Learning (DL) methods have dramatically increased in popularity in\nrecent years, with significant growth in their application to supervised\nlearning problems in the biomedical sciences. However, the greater prevalence\nand complexity of missing data in modern biomedical datasets present\nsignificant challenges for DL methods. Here, we provide a formal treatment of\nmissing data in the context of deeply learned generalized linear models, a\nsupervised DL architecture for regression and classification problems. We\npropose a new architecture, \\textit{dlglm}, that is one of the first to be able\nto flexibly account for both ignorable and non-ignorable patterns of\nmissingness in input features and response at training time. We demonstrate\nthrough statistical simulation that our method outperforms existing approaches\nfor supervised learning tasks in the presence of missing not at random (MNAR)\nmissingness. We conclude with a case study of a Bank Marketing dataset from the\nUCI Machine Learning Repository, in which we predict whether clients subscribed\nto a product based on phone survey data. Supplementary materials for this\narticle are available online."}, "http://arxiv.org/abs/2208.04627": {"title": "Causal Effect Identification in Uncertain Causal Networks", "link": "http://arxiv.org/abs/2208.04627", "description": "Causal identification is at the core of the causal inference literature,\nwhere complete algorithms have been proposed to identify causal queries of\ninterest. The validity of these algorithms hinges on the restrictive assumption\nof having access to a correctly specified causal structure. In this work, we\nstudy the setting where a probabilistic model of the causal structure is\navailable. Specifically, the edges in a causal graph exist with uncertainties\nwhich may, for example, represent degree of belief from domain experts.\nAlternatively, the uncertainty about an edge may reflect the confidence of a\nparticular statistical test. The question that naturally arises in this setting\nis: Given such a probabilistic graph and a specific causal effect of interest,\nwhat is the subgraph which has the highest plausibility and for which the\ncausal effect is identifiable? We show that answering this question reduces to\nsolving an NP-complete combinatorial optimization problem which we call the\nedge ID problem. We propose efficient algorithms to approximate this problem\nand evaluate them against both real-world networks and randomly generated\ngraphs."}, "http://arxiv.org/abs/2211.00268": {"title": "Stacking designs: designing multi-fidelity computer experiments with target predictive accuracy", "link": "http://arxiv.org/abs/2211.00268", "description": "In an era where scientific experiments can be very costly, multi-fidelity\nemulators provide a useful tool for cost-efficient predictive scientific\ncomputing. For scientific applications, the experimenter is often limited by a\ntight computational budget, and thus wishes to (i) maximize predictive power of\nthe multi-fidelity emulator via a careful design of experiments, and (ii)\nensure this model achieves a desired error tolerance with some notion of\nconfidence. Existing design methods, however, do not jointly tackle objectives\n(i) and (ii). We propose a novel stacking design approach that addresses both\ngoals. A multi-level reproducing kernel Hilbert space (RKHS) interpolator is\nfirst introduced to build the emulator, under which our stacking design\nprovides a sequential approach for designing multi-fidelity runs such that a\ndesired prediction error of $\\epsilon > 0$ is met under regularity assumptions.\nWe then prove a novel cost complexity theorem that, under this multi-level\ninterpolator, establishes a bound on the computation cost (for training data\nsimulation) needed to achieve a prediction bound of $\\epsilon$. This result\nprovides novel insights on conditions under which the proposed multi-fidelity\napproach improves upon a conventional RKHS interpolator which relies on a\nsingle fidelity level. Finally, we demonstrate the effectiveness of stacking\ndesigns in a suite of simulation experiments and an application to finite\nelement analysis."}, "http://arxiv.org/abs/2211.05357": {"title": "Bayesian score calibration for approximate models", "link": "http://arxiv.org/abs/2211.05357", "description": "Scientists continue to develop increasingly complex mechanistic models to\nreflect their knowledge more realistically. Statistical inference using these\nmodels can be challenging since the corresponding likelihood function is often\nintractable and model simulation may be computationally burdensome.\nFortunately, in many of these situations, it is possible to adopt a surrogate\nmodel or approximate likelihood function. It may be convenient to conduct\nBayesian inference directly with the surrogate, but this can result in bias and\npoor uncertainty quantification. In this paper we propose a new method for\nadjusting approximate posterior samples to reduce bias and produce more\naccurate uncertainty quantification. We do this by optimizing a transform of\nthe approximate posterior that maximizes a scoring rule. Our approach requires\nonly a (fixed) small number of complex model simulations and is numerically\nstable. We demonstrate good performance of the new method on several examples\nof increasing complexity."}, "http://arxiv.org/abs/2302.00993": {"title": "Unpaired Multi-Domain Causal Representation Learning", "link": "http://arxiv.org/abs/2302.00993", "description": "The goal of causal representation learning is to find a representation of\ndata that consists of causally related latent variables. We consider a setup\nwhere one has access to data from multiple domains that potentially share a\ncausal representation. Crucially, observations in different domains are assumed\nto be unpaired, that is, we only observe the marginal distribution in each\ndomain but not their joint distribution. In this paper, we give sufficient\nconditions for identifiability of the joint distribution and the shared causal\ngraph in a linear setup. Identifiability holds if we can uniquely recover the\njoint distribution and the shared causal representation from the marginal\ndistributions in each domain. We transform our identifiability results into a\npractical method to recover the shared latent causal graph."}, "http://arxiv.org/abs/2303.17277": {"title": "Cross-temporal probabilistic forecast reconciliation: Methodological and practical issues", "link": "http://arxiv.org/abs/2303.17277", "description": "Forecast reconciliation is a post-forecasting process that involves\ntransforming a set of incoherent forecasts into coherent forecasts which\nsatisfy a given set of linear constraints for a multivariate time series. In\nthis paper we extend the current state-of-the-art cross-sectional probabilistic\nforecast reconciliation approach to encompass a cross-temporal framework, where\ntemporal constraints are also applied. Our proposed methodology employs both\nparametric Gaussian and non-parametric bootstrap approaches to draw samples\nfrom an incoherent cross-temporal distribution. To improve the estimation of\nthe forecast error covariance matrix, we propose using multi-step residuals,\nespecially in the time dimension where the usual one-step residuals fail. To\naddress high-dimensionality issues, we present four alternatives for the\ncovariance matrix, where we exploit the two-fold nature (cross-sectional and\ntemporal) of the cross-temporal structure, and introduce the idea of\noverlapping residuals. We assess the effectiveness of the proposed\ncross-temporal reconciliation approaches through a simulation study that\ninvestigates their theoretical and empirical properties and two forecasting\nexperiments, using the Australian GDP and the Australian Tourism Demand\ndatasets. For both applications, the optimal cross-temporal reconciliation\napproaches significantly outperform the incoherent base forecasts in terms of\nthe Continuous Ranked Probability Score and the Energy Score. Overall, the\nresults highlight the potential of the proposed methods to improve the accuracy\nof probabilistic forecasts and to address the challenge of integrating\ndisparate scenarios while coherently taking into account short-term\noperational, medium-term tactical, and long-term strategic planning."}, "http://arxiv.org/abs/2309.07867": {"title": "Beta Diffusion", "link": "http://arxiv.org/abs/2309.07867", "description": "We introduce beta diffusion, a novel generative modeling method that\nintegrates demasking and denoising to generate data within bounded ranges.\nUsing scaled and shifted beta distributions, beta diffusion utilizes\nmultiplicative transitions over time to create both forward and reverse\ndiffusion processes, maintaining beta distributions in both the forward\nmarginals and the reverse conditionals, given the data at any point in time.\nUnlike traditional diffusion-based generative models relying on additive\nGaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is\nmultiplicative and optimized with KL-divergence upper bounds (KLUBs) derived\nfrom the convexity of the KL divergence. We demonstrate that the proposed KLUBs\nare more effective for optimizing beta diffusion compared to negative ELBOs,\nwhich can also be derived as the KLUBs of the same KL divergence with its two\narguments swapped. The loss function of beta diffusion, expressed in terms of\nBregman divergence, further supports the efficacy of KLUBs for optimization.\nExperimental results on both synthetic data and natural images demonstrate the\nunique capabilities of beta diffusion in generative modeling of range-bounded\ndata and validate the effectiveness of KLUBs in optimizing diffusion models,\nthereby making them valuable additions to the family of diffusion-based\ngenerative models and the optimization techniques used to train them."}, "http://arxiv.org/abs/2310.18422": {"title": "Inference via Wild Bootstrap and Multiple Imputation under Fine-Gray Models with Incomplete Data", "link": "http://arxiv.org/abs/2310.18422", "description": "Fine-Gray models specify the subdistribution hazards for one out of multiple\ncompeting risks to be proportional. The estimators of parameters and cumulative\nincidence functions under Fine-Gray models have a simpler structure when data\nare censoring-complete than when they are more generally incomplete. This paper\nconsiders the case of incomplete data but it exploits the above-mentioned\nsimpler estimator structure for which there exists a wild bootstrap approach\nfor inferential purposes. The present idea is to link the methodology under\ncensoring-completeness with the more general right-censoring regime with the\nhelp of multiple imputation. In a simulation study, this approach is compared\nto the estimation procedure proposed in the original paper by Fine and Gray\nwhen it is combined with a bootstrap approach. An application to a data set\nabout hospital-acquired infections illustrates the method."}, "http://arxiv.org/abs/2310.18474": {"title": "Robust Bayesian Graphical Regression Models for Assessing Tumor Heterogeneity in Proteomic Networks", "link": "http://arxiv.org/abs/2310.18474", "description": "Graphical models are powerful tools to investigate complex dependency\nstructures in high-throughput datasets. However, most existing graphical models\nmake one of the two canonical assumptions: (i) a homogeneous graph with a\ncommon network for all subjects; or (ii) an assumption of normality especially\nin the context of Gaussian graphical models. Both assumptions are restrictive\nand can fail to hold in certain applications such as proteomic networks in\ncancer. To this end, we propose an approach termed robust Bayesian graphical\nregression (rBGR) to estimate heterogeneous graphs for non-normally distributed\ndata. rBGR is a flexible framework that accommodates non-normality through\nrandom marginal transformations and constructs covariate-dependent graphs to\naccommodate heterogeneity through graphical regression techniques. We formulate\na new characterization of edge dependencies in such models called conditional\nsign independence with covariates along with an efficient posterior sampling\nalgorithm. In simulation studies, we demonstrate that rBGR outperforms existing\ngraphical regression models for data generated under various levels of\nnon-normality in both edge and covariate selection. We use rBGR to assess\nproteomic networks across two cancers: lung and ovarian, to systematically\ninvestigate the effects of immunogenic heterogeneity within tumors. Our\nanalyses reveal several important protein-protein interactions that are\ndifferentially impacted by the immune cell abundance; some corroborate existing\nbiological knowledge whereas others are novel findings."}, "http://arxiv.org/abs/2310.18500": {"title": "Designing Randomized Experiments to Predict Unit-Specific Treatment Effects", "link": "http://arxiv.org/abs/2310.18500", "description": "Typically, a randomized experiment is designed to test a hypothesis about the\naverage treatment effect and sometimes hypotheses about treatment effect\nvariation. The results of such a study may then be used to inform policy and\npractice for units not in the study. In this paper, we argue that given this\nuse, randomized experiments should instead be designed to predict unit-specific\ntreatment effects in a well-defined population. We then consider how different\nsampling processes and models affect the bias, variance, and mean squared\nprediction error of these predictions. The results indicate, for example, that\nproblems of generalizability (differences between samples and populations) can\ngreatly affect bias both in predictive models and in measures of error in these\nmodels. We also examine when the average treatment effect estimate outperforms\nunit-specific treatment effect predictive models and implications of this for\nplanning studies."}, "http://arxiv.org/abs/2310.18527": {"title": "Multiple Imputation Method for High-Dimensional Neuroimaging Data", "link": "http://arxiv.org/abs/2310.18527", "description": "Missingness is a common issue for neuroimaging data, and neglecting it in\ndownstream statistical analysis can introduce bias and lead to misguided\ninferential conclusions. It is therefore crucial to conduct appropriate\nstatistical methods to address this issue. While multiple imputation is a\npopular technique for handling missing data, its application to neuroimaging\ndata is hindered by high dimensionality and complex dependence structures of\nmultivariate neuroimaging variables. To tackle this challenge, we propose a\nnovel approach, named High Dimensional Multiple Imputation (HIMA), based on\nBayesian models. HIMA develops a new computational strategy for sampling large\ncovariance matrices based on a robustly estimated posterior mode, which\ndrastically enhances computational efficiency and numerical stability. To\nassess the effectiveness of HIMA, we conducted extensive simulation studies and\nreal-data analysis using neuroimaging data from a Schizophrenia study. HIMA\nshowcases a computational efficiency improvement of over 2000 times when\ncompared to traditional approaches, while also producing imputed datasets with\nimproved precision and stability."}, "http://arxiv.org/abs/2310.18533": {"title": "Evaluating the effects of high-throughput structural neuroimaging predictors on whole-brain functional connectome outcomes via network-based vector-on-matrix regression", "link": "http://arxiv.org/abs/2310.18533", "description": "The joint analysis of multimodal neuroimaging data is critical in the field\nof brain research because it reveals complex interactive relationships between\nneurobiological structures and functions. In this study, we focus on\ninvestigating the effects of structural imaging (SI) features, including white\nmatter micro-structure integrity (WMMI) and cortical thickness, on the whole\nbrain functional connectome (FC) network. To achieve this goal, we propose a\nnetwork-based vector-on-matrix regression model to characterize the FC-SI\nassociation patterns. We have developed a novel multi-level dense bipartite and\nclique subgraph extraction method to identify which subsets of spatially\nspecific SI features intensively influence organized FC sub-networks. The\nproposed method can simultaneously identify highly correlated\nstructural-connectomic association patterns and suppress false positive\nfindings while handling millions of potential interactions. We apply our method\nto a multimodal neuroimaging dataset of 4,242 participants from the UK Biobank\nto evaluate the effects of whole-brain WMMI and cortical thickness on the\nresting-state FC. The results reveal that the WMMI on corticospinal tracts and\ninferior cerebellar peduncle significantly affect functional connections of\nsensorimotor, salience, and executive sub-networks with an average correlation\nof 0.81 (p<0.001)."}, "http://arxiv.org/abs/2310.18536": {"title": "Efficient Fully Bayesian Approach to Brain Activity Mapping with Complex-Valued fMRI Data", "link": "http://arxiv.org/abs/2310.18536", "description": "Functional magnetic resonance imaging (fMRI) enables indirect detection of\nbrain activity changes via the blood-oxygen-level-dependent (BOLD) signal.\nConventional analysis methods mainly rely on the real-valued magnitude of these\nsignals. In contrast, research suggests that analyzing both real and imaginary\ncomponents of the complex-valued fMRI (cv-fMRI) signal provides a more holistic\napproach that can increase power to detect neuronal activation. We propose a\nfully Bayesian model for brain activity mapping with cv-fMRI data. Our model\naccommodates temporal and spatial dynamics. Additionally, we propose a\ncomputationally efficient sampling algorithm, which enhances processing speed\nthrough image partitioning. Our approach is shown to be computationally\nefficient via image partitioning and parallel computation while being\ncompetitive with state-of-the-art methods. We support these claims with both\nsimulated numerical studies and an application to real cv-fMRI data obtained\nfrom a finger-tapping experiment."}, "http://arxiv.org/abs/2310.18556": {"title": "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment", "link": "http://arxiv.org/abs/2310.18556", "description": "Design-based causal inference is one of the most widely used frameworks for\ntesting causal null hypotheses or inferring about causal parameters from\nexperimental or observational data. The most significant merit of design-based\ncausal inference is that its statistical validity only comes from the study\ndesign (e.g., randomization design) and does not require assuming any\noutcome-generating distributions or models. Although immune to model\nmisspecification, design-based causal inference can still suffer from other\ndata challenges, among which missingness in outcomes is a significant one.\nHowever, compared with model-based causal inference, outcome missingness in\ndesign-based causal inference is much less studied, largely due to the\nchallenge that design-based causal inference does not assume any outcome\ndistributions/models and, therefore, cannot directly adopt any existing\nmodel-based approaches for missing data. To fill this gap, we systematically\nstudy the missing outcomes problem in design-based causal inference. First, we\nuse the potential outcomes framework to clarify the minimal assumption\n(concerning the outcome missingness mechanism) needed for conducting\nfinite-population-exact randomization tests for the null effect (i.e., Fisher's\nsharp null) and that needed for constructing finite-population-exact confidence\nsets with missing outcomes. Second, we propose a general framework called\n``imputation and re-imputation\" for conducting finite-population-exact\nrandomization tests in design-based causal studies with missing outcomes. Our\nframework can incorporate any existing outcome imputation algorithms and\nmeanwhile guarantee finite-population-exact type-I error rate control. Third,\nwe extend our framework to conduct covariate adjustment in an exact\nrandomization test with missing outcomes and to construct\nfinite-population-exact confidence sets with missing outcomes."}, "http://arxiv.org/abs/2310.18611": {"title": "Sequential Kalman filter for fast online changepoint detection in longitudinal health records", "link": "http://arxiv.org/abs/2310.18611", "description": "This article introduces the sequential Kalman filter, a computationally\nscalable approach for online changepoint detection with temporally correlated\ndata. The temporal correlation was not considered in the Bayesian online\nchangepoint detection approach due to the large computational cost. Motivated\nby detecting COVID-19 infections for dialysis patients from massive\nlongitudinal health records with a large number of covariates, we develop a\nscalable approach to detect multiple changepoints from correlated data by\nsequentially stitching Kalman filters of subsequences to compute the joint\ndistribution of the observations, which has linear computational complexity\nwith respect to the number of observations between the last detected\nchangepoint and the current observation at each time point, without\napproximating the likelihood function. Compared to other online changepoint\ndetection methods, simulated experiments show that our approach is more precise\nin detecting single or multiple changes in mean, variance, or correlation for\ntemporally correlated data. Furthermore, we propose a new way to integrate\nclassification and changepoint detection approaches that improve the detection\ndelay and accuracy for detecting COVID-19 infection compared to other\nalternatives."}, "http://arxiv.org/abs/2310.18733": {"title": "Threshold detection under a semiparametric regression model", "link": "http://arxiv.org/abs/2310.18733", "description": "Linear regression models have been extensively considered in the literature.\nHowever, in some practical applications they may not be appropriate all over\nthe range of the covariate. In this paper, a more flexible model is introduced\nby considering a regression model $Y=r(X)+\\varepsilon$ where the regression\nfunction $r(\\cdot)$ is assumed to be linear for large values in the domain of\nthe predictor variable $X$. More precisely, we assume that\n$r(x)=\\alpha_0+\\beta_0 x$ for $x> u_0$, where the value $u_0$ is identified as\nthe smallest value satisfying such a property. A penalized procedure is\nintroduced to estimate the threshold $u_0$. The considered proposal focusses on\na semiparametric approach since no parametric model is assumed for the\nregression function for values smaller than $u_0$. Consistency properties of\nboth the threshold estimator and the estimators of $(\\alpha_0,\\beta_0)$ are\nderived, under mild assumptions. Through a numerical study, the small sample\nproperties of the proposed procedure and the importance of introducing a\npenalization are investigated. The analysis of a real data set allows us to\ndemonstrate the usefulness of the penalized estimators."}, "http://arxiv.org/abs/2310.18766": {"title": "Discussion of ''A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks''", "link": "http://arxiv.org/abs/2310.18766", "description": "This review discusses the paper ''A Tale of Two Datasets: Representativeness\nand Generalisability of Inference for Samples of Networks'' by Krivitsky,\nColetti, and Hens, published in the Journal of the American Statistical\nAssociation in 2023."}, "http://arxiv.org/abs/2310.18858": {"title": "Estimating a function of the scale parameter in a gamma distribution with bounded variance", "link": "http://arxiv.org/abs/2310.18858", "description": "Given a gamma population with known shape parameter $\\alpha$, we develop a\ngeneral theory for estimating a function $g(\\cdot)$ of the scale parameter\n$\\beta$ with bounded variance. We begin by defining a sequential sampling\nprocedure with $g(\\cdot)$ satisfying some desired condition in proposing the\nstopping rule, and show the procedure enjoys appealing asymptotic properties.\nAfter these general conditions, we substitute $g(\\cdot)$ with specific\nfunctions including the gamma mean, the gamma variance, the gamma rate\nparameter, and a gamma survival probability as four possible illustrations. For\neach illustration, Monte Carlo simulations are carried out to justify the\nremarkable performance of our proposed sequential procedure. This is further\nsubstantiated with a real data study on weights of newly born babies."}, "http://arxiv.org/abs/2310.18875": {"title": "Feature calibration for computer models", "link": "http://arxiv.org/abs/2310.18875", "description": "Computer model calibration involves using partial and imperfect observations\nof the real world to learn which values of a model's input parameters lead to\noutputs that are consistent with real-world observations. When calibrating\nmodels with high-dimensional output (e.g. a spatial field), it is common to\nrepresent the output as a linear combination of a small set of basis vectors.\nOften, when trying to calibrate to such output, what is important to the\ncredibility of the model is that key emergent physical phenomena are\nrepresented, even if not faithfully or in the right place. In these cases,\ncomparison of model output and data in a linear subspace is inappropriate and\nwill usually lead to poor model calibration. To overcome this, we present\nkernel-based history matching (KHM), generalising the meaning of the technique\nsufficiently to be able to project model outputs and observations into a\nhigher-dimensional feature space, where patterns can be compared without their\nlocation necessarily being fixed. We develop the technical methodology, present\nan expert-driven kernel selection algorithm, and then apply the techniques to\nthe calibration of boundary layer clouds for the French climate model IPSL-CM."}, "http://arxiv.org/abs/2310.18905": {"title": "Incorporating nonparametric methods for estimating causal excursion effects in mobile health with zero-inflated count outcomes", "link": "http://arxiv.org/abs/2310.18905", "description": "In the domain of mobile health, tailoring interventions for real-time\ndelivery is of paramount importance. Micro-randomized trials have emerged as\nthe \"gold-standard\" methodology for developing such interventions. Analyzing\ndata from these trials provides insights into the efficacy of interventions and\nthe potential moderation by specific covariates. The \"causal excursion effect\",\na novel class of causal estimand, addresses these inquiries, backed by current\nsemiparametric inference techniques. Yet, existing methods mainly focus on\ncontinuous or binary data, leaving count data largely unexplored. The current\nwork is motivated by the Drink Less micro-randomized trial from the UK, which\nfocuses on a zero-inflated proximal outcome, the number of screen views in the\nsubsequent hour following the intervention decision point. In the current\npaper, we revisit the concept of causal excursion effects, specifically for\nzero-inflated count outcomes, and introduce novel estimation approaches that\nincorporate nonparametric techniques. Bidirectional asymptotics are derived for\nthe proposed estimators. Through extensive simulation studies, we evaluate the\nperformance of the proposed estimators. As an illustration, we also employ the\nproposed methods to the Drink Less trial data."}, "http://arxiv.org/abs/2310.18963": {"title": "Expectile-based conditional tail moments with covariates", "link": "http://arxiv.org/abs/2310.18963", "description": "Expectile, as the minimizer of an asymmetric quadratic loss function, is a\ncoherent risk measure and is helpful to use more information about the\ndistribution of the considered risk. In this paper, we propose a new risk\nmeasure by replacing quantiles by expectiles, called expectile-based\nconditional tail moment, and focus on the estimation of this new risk measure\nas the conditional survival function of the risk, given the risk exceeding the\nexpectile and given a value of the covariates, is heavy tail. Under some\nregular conditions, asymptotic properties of this new estimator are considered.\nThe extrapolated estimation of the conditional tail moments is also\ninvestigated. These results are illustrated both on simulated data and on a\nreal insurance data."}, "http://arxiv.org/abs/2310.19043": {"title": "Differentially Private Permutation Tests: Applications to Kernel Methods", "link": "http://arxiv.org/abs/2310.19043", "description": "Recent years have witnessed growing concerns about the privacy of sensitive\ndata. In response to these concerns, differential privacy has emerged as a\nrigorous framework for privacy protection, gaining widespread recognition in\nboth academic and industrial circles. While substantial progress has been made\nin private data analysis, existing methods often suffer from impracticality or\na significant loss of statistical efficiency. This paper aims to alleviate\nthese concerns in the context of hypothesis testing by introducing\ndifferentially private permutation tests. The proposed framework extends\nclassical non-private permutation tests to private settings, maintaining both\nfinite-sample validity and differential privacy in a rigorous manner. The power\nof the proposed test depends on the choice of a test statistic, and we\nestablish general conditions for consistency and non-asymptotic uniform power.\nTo demonstrate the utility and practicality of our framework, we focus on\nreproducing kernel-based test statistics and introduce differentially private\nkernel tests for two-sample and independence testing: dpMMD and dpHSIC. The\nproposed kernel tests are straightforward to implement, applicable to various\ntypes of data, and attain minimax optimal power across different privacy\nregimes. Our empirical evaluations further highlight their competitive power\nunder various synthetic and real-world scenarios, emphasizing their practical\nvalue. The code is publicly available to facilitate the implementation of our\nframework."}, "http://arxiv.org/abs/2310.19051": {"title": "A Survey of Methods for Estimating Hurst Exponent of Time Sequence", "link": "http://arxiv.org/abs/2310.19051", "description": "The Hurst exponent is a significant indicator for characterizing the\nself-similarity and long-term memory properties of time sequences. It has wide\napplications in physics, technologies, engineering, mathematics, statistics,\neconomics, psychology and so on. Currently, available methods for estimating\nthe Hurst exponent of time sequences can be divided into different categories:\ntime-domain methods and spectrum-domain methods based on the representation of\ntime sequence, linear regression methods and Bayesian methods based on\nparameter estimation methods. Although various methods are discussed in\nliterature, there are still some deficiencies: the descriptions of the\nestimation algorithms are just mathematics-oriented and the pseudo-codes are\nmissing; the effectiveness and accuracy of the estimation algorithms are not\nclear; the classification of estimation methods is not considered and there is\na lack of guidance for selecting the estimation methods. In this work, the\nemphasis is put on thirteen dominant methods for estimating the Hurst exponent.\nFor the purpose of decreasing the difficulty of implementing the estimation\nmethods with computer programs, the mathematical principles are discussed\nbriefly and the pseudo-codes of algorithms are presented with necessary\ndetails. It is expected that the survey could help the researchers to select,\nimplement and apply the estimation algorithms of interest in practical\nsituations in an easy way."}, "http://arxiv.org/abs/2310.19091": {"title": "Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector", "link": "http://arxiv.org/abs/2310.19091", "description": "Machine Learning (ML) systems are becoming instrumental in the public sector,\nwith applications spanning areas like criminal justice, social welfare,\nfinancial fraud detection, and public health. While these systems offer great\npotential benefits to institutional decision-making processes, such as improved\nefficiency and reliability, they still face the challenge of aligning intricate\nand nuanced policy objectives with the precise formalization requirements\nnecessitated by ML models. In this paper, we aim to bridge the gap between ML\nand public sector decision-making by presenting a comprehensive overview of key\ntechnical challenges where disjunctions between policy goals and ML models\ncommonly arise. We concentrate on pivotal points of the ML pipeline that\nconnect the model to its operational environment, delving into the significance\nof representative training data and highlighting the importance of a model\nsetup that facilitates effective decision-making. Additionally, we link these\nchallenges with emerging methodological advancements, encompassing causal ML,\ndomain adaptation, uncertainty quantification, and multi-objective\noptimization, illustrating the path forward for harmonizing ML and public\nsector objectives."}, "http://arxiv.org/abs/2310.19114": {"title": "Sparse Fr\\'echet Sufficient Dimension Reduction with Graphical Structure Among Predictors", "link": "http://arxiv.org/abs/2310.19114", "description": "Fr\\'echet regression has received considerable attention to model\nmetric-space valued responses that are complex and non-Euclidean data, such as\nprobability distributions and vectors on the unit sphere. However, existing\nFr\\'echet regression literature focuses on the classical setting where the\npredictor dimension is fixed, and the sample size goes to infinity. This paper\nproposes sparse Fr\\'echet sufficient dimension reduction with graphical\nstructure among high-dimensional Euclidean predictors. In particular, we\npropose a convex optimization problem that leverages the graphical information\namong predictors and avoids inverting the high-dimensional covariance matrix.\nWe also provide the Alternating Direction Method of Multipliers (ADMM)\nalgorithm to solve the optimization problem. Theoretically, the proposed method\nachieves subspace estimation and variable selection consistency under suitable\nconditions. Extensive simulations and a real data analysis are carried out to\nillustrate the finite-sample performance of the proposed method."}, "http://arxiv.org/abs/2310.19246": {"title": "A spectral regularisation framework for latent variable models designed for single channel applications", "link": "http://arxiv.org/abs/2310.19246", "description": "Latent variable models (LVMs) are commonly used to capture the underlying\ndependencies, patterns, and hidden structure in observed data. Source\nduplication is a by-product of the data hankelisation pre-processing step\ncommon to single channel LVM applications, which hinders practical LVM\nutilisation. In this article, a Python package titled\nspectrally-regularised-LVMs is presented. The proposed package addresses the\nsource duplication issue via the addition of a novel spectral regularisation\nterm. This package provides a framework for spectral regularisation in single\nchannel LVM applications, thereby making it easier to investigate and utilise\nLVMs with spectral regularisation. This is achieved via the use of symbolic or\nexplicit representations of potential LVM objective functions which are\nincorporated into a framework that uses spectral regularisation during the LVM\nparameter estimation process. The objective of this package is to provide a\nconsistent linear LVM optimisation framework which incorporates spectral\nregularisation and caters to single channel time-series applications."}, "http://arxiv.org/abs/2310.19253": {"title": "Flow-based Distributionally Robust Optimization", "link": "http://arxiv.org/abs/2310.19253", "description": "We present a computationally efficient framework, called \\texttt{FlowDRO},\nfor solving flow-based distributionally robust optimization (DRO) problems with\nWasserstein uncertainty sets, when requiring the worst-case distribution (also\ncalled the Least Favorable Distribution, LFD) to be continuous so that the\nalgorithm can be scalable to problems with larger sample sizes and achieve\nbetter generalization capability for the induced robust algorithms. To tackle\nthe computationally challenging infinitely dimensional optimization problem, we\nleverage flow-based models, continuous-time invertible transport maps between\nthe data distribution and the target distribution, and develop a Wasserstein\nproximal gradient flow type of algorithm. In practice, we parameterize the\ntransport maps by a sequence of neural networks progressively trained in blocks\nby gradient descent. Our computational framework is general, can handle\nhigh-dimensional data with large sample sizes, and can be useful for various\napplications. We demonstrate its usage in adversarial learning,\ndistributionally robust hypothesis testing, and a new mechanism for data-driven\ndistribution perturbation differential privacy, where the proposed method gives\nstrong empirical performance on real high-dimensional data."}, "http://arxiv.org/abs/2310.19343": {"title": "Quantile Super Learning for independent and online settings with application to solar power forecasting", "link": "http://arxiv.org/abs/2310.19343", "description": "Estimating quantiles of an outcome conditional on covariates is of\nfundamental interest in statistics with broad application in probabilistic\nprediction and forecasting. We propose an ensemble method for conditional\nquantile estimation, Quantile Super Learning, that combines predictions from\nmultiple candidate algorithms based on their empirical performance measured\nwith respect to a cross-validated empirical risk of the quantile loss function.\nWe present theoretical guarantees for both iid and online data scenarios. The\nperformance of our approach for quantile estimation and in forming prediction\nintervals is tested in simulation studies. Two case studies related to solar\nenergy are used to illustrate Quantile Super Learning: in an iid setting, we\npredict the physical properties of perovskite materials for photovoltaic cells,\nand in an online setting we forecast ground solar irradiance based on output\nfrom dynamic weather ensemble models."}, "http://arxiv.org/abs/2310.19433": {"title": "Ordinal classification for interval-valued data and interval-valued functional data", "link": "http://arxiv.org/abs/2310.19433", "description": "The aim of ordinal classification is to predict the ordered labels of the\noutput from a set of observed inputs. Interval-valued data refers to data in\nthe form of intervals. For the first time, interval-valued data and\ninterval-valued functional data are considered as inputs in an ordinal\nclassification problem. Six ordinal classifiers for interval data and\ninterval-valued functional data are proposed. Three of them are parametric, one\nof them is based on ordinal binary decompositions and the other two are based\non ordered logistic regression. The other three methods are based on the use of\ndistances between interval data and kernels on interval data. One of the\nmethods uses the weighted $k$-nearest-neighbor technique for ordinal\nclassification. Another method considers kernel principal component analysis\nplus an ordinal classifier. And the sixth method, which is the method that\nperforms best, uses a kernel-induced ordinal random forest. They are compared\nwith na\\\"ive approaches in an extensive experimental study with synthetic and\noriginal real data sets, about human global development, and weather data. The\nresults show that considering ordering and interval-valued information improves\nthe accuracy. The source code and data sets are available at\nhttps://github.com/aleixalcacer/OCFIVD."}, "http://arxiv.org/abs/2310.19435": {"title": "A novel characterization of structures in smooth regression curves: from a viewpoint of persistent homology", "link": "http://arxiv.org/abs/2310.19435", "description": "We characterize structures such as monotonicity, convexity, and modality in\nsmooth regression curves using persistent homology. Persistent homology is a\nkey tool in topological data analysis that detects higher dimensional\ntopological features such as connected components and holes (cycles or loops)\nin the data. In other words, persistent homology is a multiscale version of\nhomology that characterizes sets based on the connected components and holes.\nWe use super-level sets of functions to extract geometric features via\npersistent homology. In particular, we explore structures in regression curves\nvia the persistent homology of super-level sets of a function, where the\nfunction of interest is - the first derivative of the regression function.\n\nIn the course of this study, we extend an existing procedure of estimating\nthe persistent homology for the first derivative of a regression function and\nestablish its consistency. Moreover, as an application of the proposed\nmethodology, we demonstrate that the persistent homology of the derivative of a\nfunction can reveal hidden structures in the function that are not visible from\nthe persistent homology of the function itself. In addition, we also illustrate\nthat the proposed procedure can be used to compare the shapes of two or more\nregression curves which is not possible merely from the persistent homology of\nthe function itself."}, "http://arxiv.org/abs/2310.19519": {"title": "A General Neural Causal Model for Interactive Recommendation", "link": "http://arxiv.org/abs/2310.19519", "description": "Survivor bias in observational data leads the optimization of recommender\nsystems towards local optima. Currently most solutions re-mines existing\nhuman-system collaboration patterns to maximize longer-term satisfaction by\nreinforcement learning. However, from the causal perspective, mitigating\nsurvivor effects requires answering a counterfactual problem, which is\ngenerally unidentifiable and inestimable. In this work, we propose a neural\ncausal model to achieve counterfactual inference. Specifically, we first build\na learnable structural causal model based on its available graphical\nrepresentations which qualitatively characterizes the preference transitions.\nMitigation of the survivor bias is achieved though counterfactual consistency.\nTo identify the consistency, we use the Gumbel-max function as structural\nconstrains. To estimate the consistency, we apply reinforcement optimizations,\nand use Gumbel-Softmax as a trade-off to get a differentiable function. Both\ntheoretical and empirical studies demonstrate the effectiveness of our\nsolution."}, "http://arxiv.org/abs/2310.19621": {"title": "A Bayesian Methodology for Estimation for Sparse Canonical Correlation", "link": "http://arxiv.org/abs/2310.19621", "description": "It can be challenging to perform an integrative statistical analysis of\nmulti-view high-dimensional data acquired from different experiments on each\nsubject who participated in a joint study. Canonical Correlation Analysis (CCA)\nis a statistical procedure for identifying relationships between such data\nsets. In that context, Structured Sparse CCA (ScSCCA) is a rapidly emerging\nmethodological area that aims for robust modeling of the interrelations between\nthe different data modalities by assuming the corresponding CCA directional\nvectors to be sparse. Although it is a rapidly growing area of statistical\nmethodology development, there is a need for developing related methodologies\nin the Bayesian paradigm. In this manuscript, we propose a novel ScSCCA\napproach where we employ a Bayesian infinite factor model and aim to achieve\nrobust estimation by encouraging sparsity in two different levels of the\nmodeling framework. Firstly, we utilize a multiplicative Half-Cauchy process\nprior to encourage sparsity at the level of the latent variable loading\nmatrices. Additionally, we promote further sparsity in the covariance matrix by\nusing graphical horseshoe prior or diagonal structure. We conduct multiple\nsimulations to compare the performance of the proposed method with that of\nother frequently used CCA procedures, and we apply the developed procedures to\nanalyze multi-omics data arising from a breast cancer study."}, "http://arxiv.org/abs/2310.19683": {"title": "An Online Bootstrap for Time Series", "link": "http://arxiv.org/abs/2310.19683", "description": "Resampling methods such as the bootstrap have proven invaluable in the field\nof machine learning. However, the applicability of traditional bootstrap\nmethods is limited when dealing with large streams of dependent data, such as\ntime series or spatially correlated observations. In this paper, we propose a\nnovel bootstrap method that is designed to account for data dependencies and\ncan be executed online, making it particularly suitable for real-time\napplications. This method is based on an autoregressive sequence of\nincreasingly dependent resampling weights. We prove the theoretical validity of\nthe proposed bootstrap scheme under general conditions. We demonstrate the\neffectiveness of our approach through extensive simulations and show that it\nprovides reliable uncertainty quantification even in the presence of complex\ndata dependencies. Our work bridges the gap between classical resampling\ntechniques and the demands of modern data analysis, providing a valuable tool\nfor researchers and practitioners in dynamic, data-rich environments."}, "http://arxiv.org/abs/2310.19787": {"title": "$e^{\\text{RPCA}}$: Robust Principal Component Analysis for Exponential Family Distributions", "link": "http://arxiv.org/abs/2310.19787", "description": "Robust Principal Component Analysis (RPCA) is a widely used method for\nrecovering low-rank structure from data matrices corrupted by significant and\nsparse outliers. These corruptions may arise from occlusions, malicious\ntampering, or other causes for anomalies, and the joint identification of such\ncorruptions with low-rank background is critical for process monitoring and\ndiagnosis. However, existing RPCA methods and their extensions largely do not\naccount for the underlying probabilistic distribution for the data matrices,\nwhich in many applications are known and can be highly non-Gaussian. We thus\npropose a new method called Robust Principal Component Analysis for Exponential\nFamily distributions ($e^{\\text{RPCA}}$), which can perform the desired\ndecomposition into low-rank and sparse matrices when such a distribution falls\nwithin the exponential family. We present a novel alternating direction method\nof multiplier optimization algorithm for efficient $e^{\\text{RPCA}}$\ndecomposition. The effectiveness of $e^{\\text{RPCA}}$ is then demonstrated in\ntwo applications: the first for steel sheet defect detection, and the second\nfor crime activity monitoring in the Atlanta metropolitan area."}, "http://arxiv.org/abs/2110.00152": {"title": "ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families", "link": "http://arxiv.org/abs/2110.00152", "description": "The empirical Bayes normal means (EBNM) model is important to many areas of\nstatistics, including (but not limited to) multiple testing, wavelet denoising,\nmultiple linear regression, and matrix factorization. There are several\nexisting software packages that can fit EBNM models under different prior\nassumptions and using different algorithms; however, the differences across\ninterfaces complicate direct comparisons. Further, a number of important prior\nassumptions do not yet have implementations. Motivated by these issues, we\ndeveloped the R package ebnm, which provides a unified interface for\nefficiently fitting EBNM models using a variety of prior assumptions, including\nnonparametric approaches. In some cases, we incorporated existing\nimplementations into ebnm; in others, we implemented new fitting procedures\nwith a focus on speed and numerical stability. To demonstrate the capabilities\nof the unified interface, we compare results using different prior assumptions\nin two extended examples: the shrinkage estimation of baseball statistics; and\nthe matrix factorization of genetics data (via the new R package flashier). In\nsummary, ebnm is a convenient and comprehensive package for performing EBNM\nanalyses under a wide range of prior assumptions."}, "http://arxiv.org/abs/2110.02440": {"title": "Inverse Probability Weighting-based Mediation Analysis for Microbiome Data", "link": "http://arxiv.org/abs/2110.02440", "description": "Mediation analysis is an important tool to study causal associations in\nbiomedical and other scientific areas and has recently gained attention in\nmicrobiome studies. Using a microbiome study of acute myeloid leukemia (AML)\npatients, we investigate whether the effect of induction chemotherapy intensity\nlevels on the infection status is mediated by the microbial taxa abundance. The\nunique characteristics of the microbial mediators -- high-dimensionality,\nzero-inflation, and dependence -- call for new methodological developments in\nmediation analysis. The presence of an exposure-induced mediator-outcome\nconfounder, antibiotic use, further requires a delicate treatment in the\nanalysis. To address these unique challenges in our motivating AML microbiome\nstudy, we propose a novel nonparametric identification formula for the\ninterventional indirect effect (IIE), a measure recently developed for studying\nmediation effects. We develop the corresponding estimation algorithm using the\ninverse probability weighting method. We also test the presence of mediation\neffects via constructing the standard normal bootstrap confidence intervals.\nSimulation studies show that the proposed method has good finite-sample\nperformance in terms of the IIE estimation, and type-I error rate and power of\nthe corresponding test. In the AML microbiome study, our findings suggest that\nthe effect of induction chemotherapy intensity levels on infection is mainly\nmediated by patients' gut microbiome."}, "http://arxiv.org/abs/2203.03532": {"title": "E-detectors: a nonparametric framework for sequential change detection", "link": "http://arxiv.org/abs/2203.03532", "description": "Sequential change detection is a classical problem with a variety of\napplications. However, the majority of prior work has been parametric, for\nexample, focusing on exponential families. We develop a fundamentally new and\ngeneral framework for sequential change detection when the pre- and post-change\ndistributions are nonparametrically specified (and thus composite). Our\nprocedures come with clean, nonasymptotic bounds on the average run length\n(frequency of false alarms). In certain nonparametric cases (like sub-Gaussian\nor sub-exponential), we also provide near-optimal bounds on the detection delay\nfollowing a changepoint. The primary technical tool that we introduce is called\nan \\emph{e-detector}, which is composed of sums of e-processes -- a fundamental\ngeneralization of nonnegative supermartingales -- that are started at\nconsecutive times. We first introduce simple Shiryaev-Roberts and CUSUM-style\ne-detectors, and then show how to design their mixtures in order to achieve\nboth statistical and computational efficiency. Our e-detector framework can be\ninstantiated to recover classical likelihood-based procedures for parametric\nproblems, as well as yielding the first change detection method for many\nnonparametric problems. As a running example, we tackle the problem of\ndetecting changes in the mean of a bounded random variable without i.i.d.\nassumptions, with an application to tracking the performance of a basketball\nteam over multiple seasons."}, "http://arxiv.org/abs/2208.02942": {"title": "sparsegl: An R Package for Estimating Sparse Group Lasso", "link": "http://arxiv.org/abs/2208.02942", "description": "The sparse group lasso is a high-dimensional regression technique that is\nuseful for problems whose predictors have a naturally grouped structure and\nwhere sparsity is encouraged at both the group and individual predictor level.\nIn this paper we discuss a new R package for computing such regularized models.\nThe intention is to provide highly optimized solution routines enabling\nanalysis of very large datasets, especially in the context of sparse design\nmatrices."}, "http://arxiv.org/abs/2208.06236": {"title": "Differentially Private Kolmogorov-Smirnov-Type Tests", "link": "http://arxiv.org/abs/2208.06236", "description": "Hypothesis testing is a central problem in statistical analysis, and there is\ncurrently a lack of differentially private tests which are both statistically\nvalid and powerful. In this paper, we develop several new differentially\nprivate (DP) nonparametric hypothesis tests. Our tests are based on\nKolmogorov-Smirnov, Kuiper, Cram\\'er-von Mises, and Wasserstein test\nstatistics, which can all be expressed as a pseudo-metric on empirical\ncumulative distribution functions (ecdfs), and can be used to test hypotheses\non goodness-of-fit, two samples, and paired data. We show that these test\nstatistics have low sensitivity, requiring minimal noise to satisfy DP. In\nparticular, we show that the sensitivity of these test statistics can be\nexpressed in terms of the base sensitivity, which is the pseudo-metric distance\nbetween the ecdfs of adjacent databases and is easily calculated. The sampling\ndistribution of our test statistics are distribution-free under the null\nhypothesis, enabling easy computation of $p$-values by Monte Carlo methods. We\nshow that in several settings, especially with small privacy budgets or\nheavy-tailed data, our new DP tests outperform alternative nonparametric DP\ntests."}, "http://arxiv.org/abs/2208.10027": {"title": "Learning Invariant Representations under General Interventions on the Response", "link": "http://arxiv.org/abs/2208.10027", "description": "It has become increasingly common nowadays to collect observations of feature\nand response pairs from different environments. As a consequence, one has to\napply learned predictors to data with a different distribution due to\ndistribution shifts. One principled approach is to adopt the structural causal\nmodels to describe training and test models, following the invariance principle\nwhich says that the conditional distribution of the response given its\npredictors remains the same across environments. However, this principle might\nbe violated in practical settings when the response is intervened. A natural\nquestion is whether it is still possible to identify other forms of invariance\nto facilitate prediction in unseen environments. To shed light on this\nchallenging scenario, we focus on linear structural causal models (SCMs) and\nintroduce invariant matching property (IMP), an explicit relation to capture\ninterventions through an additional feature, leading to an alternative form of\ninvariance that enables a unified treatment of general interventions on the\nresponse as well as the predictors. We analyze the asymptotic generalization\nerrors of our method under both the discrete and continuous environment\nsettings, where the continuous case is handled by relating it to the\nsemiparametric varying coefficient models. We present algorithms that show\ncompetitive performance compared to existing methods over various experimental\nsettings including a COVID dataset."}, "http://arxiv.org/abs/2208.11756": {"title": "Testing Many Constraints in Possibly Irregular Models Using Incomplete U-Statistics", "link": "http://arxiv.org/abs/2208.11756", "description": "We consider the problem of testing a null hypothesis defined by equality and\ninequality constraints on a statistical parameter. Testing such hypotheses can\nbe challenging because the number of relevant constraints may be on the same\norder or even larger than the number of observed samples. Moreover, standard\ndistributional approximations may be invalid due to irregularities in the null\nhypothesis. We propose a general testing methodology that aims to circumvent\nthese difficulties. The constraints are estimated by incomplete U-statistics,\nand we derive critical values by Gaussian multiplier bootstrap. We show that\nthe bootstrap approximation of incomplete U-statistics is valid for kernels\nthat we call mixed degenerate when the number of combinations used to compute\nthe incomplete U-statistic is of the same order as the sample size. It follows\nthat our test controls type I error even in irregular settings. Furthermore,\nthe bootstrap approximation covers high-dimensional settings making our testing\nstrategy applicable for problems with many constraints. The methodology is\napplicable, in particular, when the constraints to be tested are polynomials in\nU-estimable parameters. As an application, we consider goodness-of-fit tests of\nlatent tree models for multivariate data."}, "http://arxiv.org/abs/2210.05538": {"title": "Estimating optimal treatment regimes in survival contexts using an instrumental variable", "link": "http://arxiv.org/abs/2210.05538", "description": "In survival contexts, substantial literature exists on estimating optimal\ntreatment regimes, where treatments are assigned based on personal\ncharacteristics for the purpose of maximizing the survival probability. These\nmethods assume that a set of covariates is sufficient to deconfound the\ntreatment-outcome relationship. Nevertheless, the assumption can be limited in\nobservational studies or randomized trials in which non-adherence occurs. Thus,\nwe propose a novel approach for estimating the optimal treatment regime when\ncertain confounders are not observable and a binary instrumental variable is\navailable. Specifically, via a binary instrumental variable, we propose two\nsemiparametric estimators for the optimal treatment regime by maximizing\nKaplan-Meier-like estimators within a pre-defined class of regimes, one of\nwhich possesses the desirable property of double robustness. Because the\nKaplan-Meier-like estimators are jagged, we incorporate kernel smoothing\nmethods to enhance their performance. Under appropriate regularity conditions,\nthe asymptotic properties are rigorously established. Furthermore, the finite\nsample performance is assessed through simulation studies. Finally, we\nexemplify our method using data from the National Cancer Institute's (NCI)\nprostate, lung, colorectal, and ovarian cancer screening trial."}, "http://arxiv.org/abs/2302.00878": {"title": "The Contextual Lasso: Sparse Linear Models via Deep Neural Networks", "link": "http://arxiv.org/abs/2302.00878", "description": "Sparse linear models are one of several core tools for interpretable machine\nlearning, a field of emerging importance as predictive models permeate\ndecision-making in many domains. Unfortunately, sparse linear models are far\nless flexible as functions of their input features than black-box models like\ndeep neural networks. With this capability gap in mind, we study a not-uncommon\nsituation where the input features dichotomize into two groups: explanatory\nfeatures, which are candidates for inclusion as variables in an interpretable\nmodel, and contextual features, which select from the candidate variables and\ndetermine their effects. This dichotomy leads us to the contextual lasso, a new\nstatistical estimator that fits a sparse linear model to the explanatory\nfeatures such that the sparsity pattern and coefficients vary as a function of\nthe contextual features. The fitting process learns this function\nnonparametrically via a deep neural network. To attain sparse coefficients, we\ntrain the network with a novel lasso regularizer in the form of a projection\nlayer that maps the network's output onto the space of $\\ell_1$-constrained\nlinear models. An extensive suite of experiments on real and synthetic data\nsuggests that the learned models, which remain highly transparent, can be\nsparser than the regular lasso without sacrificing the predictive power of a\nstandard deep neural network."}, "http://arxiv.org/abs/2302.02560": {"title": "Causal Estimation of Exposure Shifts with Neural Networks: Evaluating the Health Benefits of Stricter Air Quality Standards in the US", "link": "http://arxiv.org/abs/2302.02560", "description": "In policy research, one of the most critical analytic tasks is to estimate\nthe causal effect of a policy-relevant shift to the distribution of a\ncontinuous exposure/treatment on an outcome of interest. We call this problem\nshift-response function (SRF) estimation. Existing neural network methods\ninvolving robust causal-effect estimators lack theoretical guarantees and\npractical implementations for SRF estimation. Motivated by a key\npolicy-relevant question in public health, we develop a neural network method\nand its theoretical underpinnings to estimate SRFs with robustness and\nefficiency guarantees. We then apply our method to data consisting of 68\nmillion individuals and 27 million deaths across the U.S. to estimate the\ncausal effect from revising the US National Ambient Air Quality Standards\n(NAAQS) for PM 2.5 from 12 $\\mu g/m^3$ to 9 $\\mu g/m^3$. This change has been\nrecently proposed by the US Environmental Protection Agency (EPA). Our goal is\nto estimate, for the first time, the reduction in deaths that would result from\nthis anticipated revision using causal methods for SRFs. Our proposed method,\ncalled {T}argeted {R}egularization for {E}xposure {S}hifts with Neural\n{Net}works (TRESNET), contributes to the neural network literature for causal\ninference in two ways: first, it proposes a targeted regularization loss with\ntheoretical properties that ensure double robustness and achieves asymptotic\nefficiency specific for SRF estimation; second, it enables loss functions from\nthe exponential family of distributions to accommodate non-continuous outcome\ndistributions (such as hospitalization or mortality counts). We complement our\napplication with benchmark experiments that demonstrate TRESNET's broad\napplicability and competitiveness."}, "http://arxiv.org/abs/2303.03502": {"title": "A Semi-Parametric Model Simultaneously Handling Unmeasured Confounding, Informative Cluster Size, and Truncation by Death with a Data Application in Medicare Claims", "link": "http://arxiv.org/abs/2303.03502", "description": "Nearly 300,000 older adults experience a hip fracture every year, the\nmajority of which occur following a fall. Unfortunately, recovery after\nfall-related trauma such as hip fracture is poor, where older adults diagnosed\nwith Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long\ntime in hospitals or rehabilitation facilities during the post-operative\nrecuperation period. Because older adults value functional recovery and\nspending time at home versus facilities as key outcomes after hospitalization,\nidentifying factors that influence days spent at home after hospitalization is\nimperative. While several individual-level factors have been identified, the\ncharacteristics of the treating hospital have recently been identified as\ncontributors. However, few methodological rigorous approaches are available to\nhelp overcome potential sources of bias such as hospital-level unmeasured\nconfounders, informative hospital size, and loss to follow-up due to death.\nThis article develops a useful tool equipped with unsupervised learning to\nsimultaneously handle statistical complexities that are often encountered in\nhealth services research, especially when using large administrative claims\ndatabases. The proposed estimator has a closed form, thus only requiring light\ncomputation load in a large-scale study. We further develop its asymptotic\nproperties that can be used to make statistical inference in practice.\nExtensive simulation studies demonstrate superiority of the proposed estimator\ncompared to existing estimators."}, "http://arxiv.org/abs/2305.04113": {"title": "Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis", "link": "http://arxiv.org/abs/2305.04113", "description": "Factor analysis provides a canonical framework for imposing lower-dimensional\nstructure such as sparse covariance in high-dimensional data. High-dimensional\ndata on the same set of variables are often collected under different\nconditions, for instance in reproducing studies across research groups. In such\ncases, it is natural to seek to learn the shared versus condition-specific\nstructure. Existing hierarchical extensions of factor analysis have been\nproposed, but face practical issues including identifiability problems. To\naddress these shortcomings, we propose a class of SUbspace Factor Analysis\n(SUFA) models, which characterize variation across groups at the level of a\nlower-dimensional subspace. We prove that the proposed class of SUFA models\nlead to identifiability of the shared versus group-specific components of the\ncovariance, and study their posterior contraction properties. Taking a Bayesian\napproach, these contributions are developed alongside efficient posterior\ncomputation algorithms. Our sampler fully integrates out latent variables, is\neasily parallelizable and has complexity that does not depend on sample size.\nWe illustrate the methods through application to integration of multiple gene\nexpression datasets relevant to immunology."}, "http://arxiv.org/abs/2305.17570": {"title": "Auditing Fairness by Betting", "link": "http://arxiv.org/abs/2305.17570", "description": "We provide practical, efficient, and nonparametric methods for auditing the\nfairness of deployed classification and regression models. Whereas previous\nwork relies on a fixed-sample size, our methods are sequential and allow for\nthe continuous monitoring of incoming data, making them highly amenable to\ntracking the fairness of real-world systems. We also allow the data to be\ncollected by a probabilistic policy as opposed to sampled uniformly from the\npopulation. This enables auditing to be conducted on data gathered for another\npurpose. Moreover, this policy may change over time and different policies may\nbe used on different subpopulations. Finally, our methods can handle\ndistribution shift resulting from either changes to the model or changes in the\nunderlying population. Our approach is based on recent progress in\nanytime-valid inference and game-theoretic statistics-the \"testing by betting\"\nframework in particular. These connections ensure that our methods are\ninterpretable, fast, and easy to implement. We demonstrate the efficacy of our\napproach on three benchmark fairness datasets."}, "http://arxiv.org/abs/2306.02948": {"title": "Learning under random distributional shifts", "link": "http://arxiv.org/abs/2306.02948", "description": "Many existing approaches for generating predictions in settings with\ndistribution shift model distribution shifts as adversarial or low-rank in\nsuitable representations. In various real-world settings, however, we might\nexpect shifts to arise through the superposition of many small and random\nchanges in the population and environment. Thus, we consider a class of random\ndistribution shift models that capture arbitrary changes in the underlying\ncovariate space, and dense, random shocks to the relationship between the\ncovariates and the outcomes. In this setting, we characterize the benefits and\ndrawbacks of several alternative prediction strategies: the standard approach\nthat directly predicts the long-term outcome of interest, the proxy approach\nthat directly predicts a shorter-term proxy outcome, and a hybrid approach that\nutilizes both the long-term policy outcome and (shorter-term) proxy outcome(s).\nWe show that the hybrid approach is robust to the strength of the distribution\nshift and the proxy relationship. We apply this method to datasets in two\nhigh-impact domains: asylum-seeker assignment and early childhood education. In\nboth settings, we find that the proposed approach results in substantially\nlower mean-squared error than current approaches."}, "http://arxiv.org/abs/2306.05751": {"title": "Advancing Counterfactual Inference through Quantile Regression", "link": "http://arxiv.org/abs/2306.05751", "description": "The capacity to address counterfactual \"what if\" inquiries is crucial for\nunderstanding and making use of causal influences. Traditional counterfactual\ninference usually assumes the availability of a structural causal model. Yet,\nin practice, such a causal model is often unknown and may not be identifiable.\nThis paper aims to perform reliable counterfactual inference based on the\n(learned) qualitative causal structure and observational data, without\nnecessitating a given causal model or even the direct estimation of conditional\ndistributions. We re-cast counterfactual reasoning as an extended quantile\nregression problem, implemented with deep neural networks to capture general\ncausal relationships and data distributions. The proposed approach offers\nsuperior statistical efficiency compared to existing ones, and further, it\nenhances the potential for generalizing the estimated counterfactual outcomes\nto previously unseen data, providing an upper bound on the generalization\nerror. Empirical results conducted on multiple datasets offer compelling\nsupport for our theoretical assertions."}, "http://arxiv.org/abs/2306.06155": {"title": "Intensity Profile Projection: A Framework for Continuous-Time Representation Learning for Dynamic Networks", "link": "http://arxiv.org/abs/2306.06155", "description": "We present a new representation learning framework, Intensity Profile\nProjection, for continuous-time dynamic network data. Given triples $(i,j,t)$,\neach representing a time-stamped ($t$) interaction between two entities\n($i,j$), our procedure returns a continuous-time trajectory for each node,\nrepresenting its behaviour over time. The framework consists of three stages:\nestimating pairwise intensity functions, e.g. via kernel smoothing; learning a\nprojection which minimises a notion of intensity reconstruction error; and\nconstructing evolving node representations via the learned projection. The\ntrajectories satisfy two properties, known as structural and temporal\ncoherence, which we see as fundamental for reliable inference. Moreoever, we\ndevelop estimation theory providing tight control on the error of any estimated\ntrajectory, indicating that the representations could even be used in quite\nnoise-sensitive follow-on analyses. The theory also elucidates the role of\nsmoothing as a bias-variance trade-off, and shows how we can reduce the level\nof smoothing as the signal-to-noise ratio increases on account of the algorithm\n`borrowing strength' across the network."}, "http://arxiv.org/abs/2306.08777": {"title": "MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting", "link": "http://arxiv.org/abs/2306.08777", "description": "We propose novel statistics which maximise the power of a two-sample test\nbased on the Maximum Mean Discrepancy (MMD), by adapting over the set of\nkernels used in defining it. For finite sets, this reduces to combining\n(normalised) MMD values under each of these kernels via a weighted soft\nmaximum. Exponential concentration bounds are proved for our proposed\nstatistics under the null and alternative. We further show how these kernels\ncan be chosen in a data-dependent but permutation-independent way, in a\nwell-calibrated test, avoiding data splitting. This technique applies more\nbroadly to general permutation-based MMD testing, and includes the use of deep\nkernels with features learnt using unsupervised models such as auto-encoders.\nWe highlight the applicability of our MMD-FUSE test on both synthetic\nlow-dimensional and real-world high-dimensional data, and compare its\nperformance in terms of power against current state-of-the-art kernel tests."}, "http://arxiv.org/abs/2306.09335": {"title": "Class-Conditional Conformal Prediction with Many Classes", "link": "http://arxiv.org/abs/2306.09335", "description": "Standard conformal prediction methods provide a marginal coverage guarantee,\nwhich means that for a random test point, the conformal prediction set contains\nthe true label with a user-specified probability. In many classification\nproblems, we would like to obtain a stronger guarantee--that for test points of\na specific class, the prediction set contains the true label with the same\nuser-chosen probability. For the latter goal, existing conformal prediction\nmethods do not work well when there is a limited amount of labeled data per\nclass, as is often the case in real applications where the number of classes is\nlarge. We propose a method called clustered conformal prediction that clusters\ntogether classes having \"similar\" conformal scores and performs conformal\nprediction at the cluster level. Based on empirical evaluation across four\nimage data sets with many (up to 1000) classes, we find that clustered\nconformal typically outperforms existing methods in terms of class-conditional\ncoverage and set size metrics."}, "http://arxiv.org/abs/2306.11839": {"title": "Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations", "link": "http://arxiv.org/abs/2306.11839", "description": "Randomized experiments often need to be stopped prematurely due to the\ntreatment having an unintended harmful effect. Existing methods that determine\nwhen to stop an experiment early are typically applied to the data in aggregate\nand do not account for treatment effect heterogeneity. In this paper, we study\nthe early stopping of experiments for harm on heterogeneous populations. We\nfirst establish that current methods often fail to stop experiments when the\ntreatment harms a minority group of participants. We then use causal machine\nlearning to develop CLASH, the first broadly-applicable method for\nheterogeneous early stopping. We demonstrate CLASH's performance on simulated\nand real data and show that it yields effective early stopping for both\nclinical trials and A/B tests."}, "http://arxiv.org/abs/2307.02520": {"title": "Conditional independence testing under misspecified inductive biases", "link": "http://arxiv.org/abs/2307.02520", "description": "Conditional independence (CI) testing is a fundamental and challenging task\nin modern statistics and machine learning. Many modern methods for CI testing\nrely on powerful supervised learning methods to learn regression functions or\nBayes predictors as an intermediate step; we refer to this class of tests as\nregression-based tests. Although these methods are guaranteed to control Type-I\nerror when the supervised learning methods accurately estimate the regression\nfunctions or Bayes predictors of interest, their behavior is less understood\nwhen they fail due to misspecified inductive biases; in other words, when the\nemployed models are not flexible enough or when the training algorithm does not\ninduce the desired predictors. Then, we study the performance of\nregression-based CI tests under misspecified inductive biases. Namely, we\npropose new approximations or upper bounds for the testing errors of three\nregression-based tests that depend on misspecification errors. Moreover, we\nintroduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI\ntest robust against misspecified inductive biases. Finally, we conduct\nexperiments with artificial and real data, showcasing the usefulness of our\ntheory and methods."}, "http://arxiv.org/abs/2308.03801": {"title": "On problematic practice of using normalization in Self-modeling/Multivariate Curve Resolution (S/MCR)", "link": "http://arxiv.org/abs/2308.03801", "description": "The paper is briefly dealing with greater or lesser misused normalization in\nself-modeling/multivariate curve resolution (S/MCR) practice. The importance of\nthe correct use of the ode solvers and apt kinetic illustrations are\nelucidated. The new terms, external and internal normalizations are defined and\ninterpreted. The problem of reducibility of a matrix is touched. Improper\ngeneralization/development of normalization-based methods are cited as\nexamples. The position of the extreme values of the signal contribution\nfunction is clarified. An Executable Notebook with Matlab Live Editor was\ncreated for algorithmic explanations and depictions."}, "http://arxiv.org/abs/2308.05373": {"title": "Conditional Independence Testing for Discrete Distributions: Beyond $\\chi^2$- and $G$-tests", "link": "http://arxiv.org/abs/2308.05373", "description": "This paper is concerned with the problem of conditional independence testing\nfor discrete data. In recent years, researchers have shed new light on this\nfundamental problem, emphasizing finite-sample optimality. The non-asymptotic\nviewpoint adapted in these works has led to novel conditional independence\ntests that enjoy certain optimality under various regimes. Despite their\nattractive theoretical properties, the considered tests are not necessarily\npractical, relying on a Poissonization trick and unspecified constants in their\ncritical values. In this work, we attempt to bridge the gap between theory and\npractice by reproving optimality without Poissonization and calibrating tests\nusing Monte Carlo permutations. Along the way, we also prove that classical\nasymptotic $\\chi^2$- and $G$-tests are notably sub-optimal in a\nhigh-dimensional regime, which justifies the demand for new tools. Our\ntheoretical results are complemented by experiments on both simulated and\nreal-world datasets. Accompanying this paper is an R package UCI that\nimplements the proposed tests."}, "http://arxiv.org/abs/2309.03875": {"title": "Network Sampling Methods for Estimating Social Networks, Population Percentages, and Totals of People Experiencing Unsheltered Homelessness", "link": "http://arxiv.org/abs/2309.03875", "description": "In this article, we propose using network-based sampling strategies to\nestimate the number of unsheltered people experiencing homelessness within a\ngiven administrative service unit, known as a Continuum of Care. We demonstrate\nthe effectiveness of network sampling methods to solve this problem. Here, we\nfocus on Respondent Driven Sampling (RDS), which has been shown to provide\nunbiased or low-biased estimates of totals and proportions for hard-to-reach\npopulations in contexts where a sampling frame (e.g., housing addresses) is not\navailable. To make the RDS estimator work for estimating the total number of\npeople living unsheltered, we introduce a new method that leverages\nadministrative data from the HUD-mandated Homeless Management Information\nSystem (HMIS). The HMIS provides high-quality counts and demographics for\npeople experiencing homelessness who sleep in emergency shelters. We then\ndemonstrate this method using network data collected in Nashville, TN, combined\nwith simulation methods to illustrate the efficacy of this approach and\nintroduce a method for performing a power analysis to find the optimal sample\nsize in this setting. We conclude with the RDS unsheltered PIT count conducted\nby King County Regional Homelessness Authority in 2022 (data publicly available\non the HUD website) and perform a comparative analysis between the 2022 RDS\nestimate of unsheltered people experiencing homelessness and an ARIMA forecast\nof the visual unsheltered PIT count. Finally, we discuss how this method works\nfor estimating the unsheltered population of people experiencing homelessness\nand future areas of research."}, "http://arxiv.org/abs/2310.19973": {"title": "Unified Enhancement of Privacy Bounds for Mixture Mechanisms via $f$-Differential Privacy", "link": "http://arxiv.org/abs/2310.19973", "description": "Differentially private (DP) machine learning algorithms incur many sources of\nrandomness, such as random initialization, random batch subsampling, and\nshuffling. However, such randomness is difficult to take into account when\nproving differential privacy bounds because it induces mixture distributions\nfor the algorithm's output that are difficult to analyze. This paper focuses on\nimproving privacy bounds for shuffling models and one-iteration differentially\nprivate gradient descent (DP-GD) with random initializations using $f$-DP. We\nderive a closed-form expression of the trade-off function for shuffling models\nthat outperforms the most up-to-date results based on $(\\epsilon,\\delta)$-DP.\nMoreover, we investigate the effects of random initialization on the privacy of\none-iteration DP-GD. Our numerical computations of the trade-off function\nindicate that random initialization can enhance the privacy of DP-GD. Our\nanalysis of $f$-DP guarantees for these mixture mechanisms relies on an\ninequality for trade-off functions introduced in this paper. This inequality\nimplies the joint convexity of $F$-divergences. Finally, we study an $f$-DP\nanalog of the advanced joint convexity of the hockey-stick divergence related\nto $(\\epsilon,\\delta)$-DP and apply it to analyze the privacy of mixture\nmechanisms."}, "http://arxiv.org/abs/2310.19985": {"title": "Modeling random directions of changes in simplex-valued data", "link": "http://arxiv.org/abs/2310.19985", "description": "We propose models and algorithms for learning about random directions in\nsimplex-valued data. The models are applied to the study of income level\nproportions and their changes over time in a geostatistical area. There are\nseveral notable challenges in the analysis of simplex-valued data: the\nmeasurements must respect the simplex constraint and the changes exhibit\nspatiotemporal smoothness and may be heterogeneous. To that end, we propose\nBayesian models that draw from and expand upon building blocks in circular and\nspatial statistics by exploiting a suitable transformation for the\nsimplex-valued data. Our models also account for spatial correlation across\nlocations in the simplex and the heterogeneous patterns via mixture modeling.\nWe describe some properties of the models and model fitting via MCMC\ntechniques. Our models and methods are applied to an analysis of movements and\ntrends of income categories using the Home Mortgage Disclosure Act data."}, "http://arxiv.org/abs/2310.19988": {"title": "Counterfactual fairness for small subgroups", "link": "http://arxiv.org/abs/2310.19988", "description": "While methods for measuring and correcting differential performance in risk\nprediction models have proliferated in recent years, most existing techniques\ncan only be used to assess fairness across relatively large subgroups. The\npurpose of algorithmic fairness efforts is often to redress discrimination\nagainst groups that are both marginalized and small, so this sample size\nlimitation often prevents existing techniques from accomplishing their main\naim. We take a three-pronged approach to address the problem of quantifying\nfairness with small subgroups. First, we propose new estimands built on the\n\"counterfactual fairness\" framework that leverage information across groups.\nSecond, we estimate these quantities using a larger volume of data than\nexisting techniques. Finally, we propose a novel data borrowing approach to\nincorporate \"external data\" that lacks outcomes and predictions but contains\ncovariate and group membership information. This less stringent requirement on\nthe external data allows for more possibilities for external data sources. We\ndemonstrate practical application of our estimators to a risk prediction model\nused by a major Midwestern health system during the COVID-19 pandemic."}, "http://arxiv.org/abs/2310.20058": {"title": "New Asymptotic Limit Theory and Inference for Monotone Regression", "link": "http://arxiv.org/abs/2310.20058", "description": "Nonparametric regression problems with qualitative constraints such as\nmonotonicity or convexity are ubiquitous in applications. For example, in\npredicting the yield of a factory in terms of the number of labor hours, the\nmonotonicity of the conditional mean function is a natural constraint. One can\nestimate a monotone conditional mean function using nonparametric least squares\nestimation, which involves no tuning parameters. Several interesting properties\nof the isotonic LSE are known including its rate of convergence, adaptivity\nproperties, and pointwise asymptotic distribution. However, we believe that the\nfull richness of the asymptotic limit theory has not been explored in the\nliterature which we do in this paper. Moreover, the inference problem is not\nfully settled. In this paper, we present some new results for monotone\nregression including an extension of existing results to triangular arrays, and\nprovide asymptotically valid confidence intervals that are uniformly valid over\na large class of distributions."}, "http://arxiv.org/abs/2310.20075": {"title": "Meek Separators and Their Applications in Targeted Causal Discovery", "link": "http://arxiv.org/abs/2310.20075", "description": "Learning causal structures from interventional data is a fundamental problem\nwith broad applications across various fields. While many previous works have\nfocused on recovering the entire causal graph, in practice, there are scenarios\nwhere learning only part of the causal graph suffices. This is called\n$targeted$ causal discovery. In our work, we focus on two such well-motivated\nproblems: subset search and causal matching. We aim to minimize the number of\ninterventions in both cases.\n\nTowards this, we introduce the $Meek~separator$, which is a subset of\nvertices that, when intervened, decomposes the remaining unoriented edges into\nsmaller connected components. We then present an efficient algorithm to find\nMeek separators that are of small sizes. Such a procedure is helpful in\ndesigning various divide-and-conquer-based approaches. In particular, we\npropose two randomized algorithms that achieve logarithmic approximation for\nsubset search and causal matching, respectively. Our results provide the first\nknown average-case provable guarantees for both problems. We believe that this\nopens up possibilities to design near-optimal methods for many other targeted\ncausal structure learning problems arising from various applications."}, "http://arxiv.org/abs/2310.20087": {"title": "PAM-HC: A Bayesian Nonparametric Construction of Hybrid Control for Randomized Clinical Trials Using External Data", "link": "http://arxiv.org/abs/2310.20087", "description": "It is highly desirable to borrow information from external data to augment a\ncontrol arm in a randomized clinical trial, especially in settings where the\nsample size for the control arm is limited. However, a main challenge in\nborrowing information from external data is to accommodate potential\nheterogeneous subpopulations across the external and trial data. We apply a\nBayesian nonparametric model called Plaid Atoms Model (PAM) to identify\noverlapping and unique subpopulations across datasets, with which we restrict\nthe information borrowing to the common subpopulations. This forms a hybrid\ncontrol (HC) that leads to more precise estimation of treatment effects\nSimulation studies demonstrate the robustness of the new method, and an\napplication to an Atopic Dermatitis dataset shows improved treatment effect\nestimation."}, "http://arxiv.org/abs/2310.20088": {"title": "Optimal transport representations and functional principal components for distribution-valued processes", "link": "http://arxiv.org/abs/2310.20088", "description": "We develop statistical models for samples of distribution-valued stochastic\nprocesses through time-varying optimal transport process representations under\nthe Wasserstein metric when the values of the process are univariate\ndistributions. While functional data analysis provides a toolbox for the\nanalysis of samples of real- or vector-valued processes, there is at present no\ncoherent statistical methodology available for samples of distribution-valued\nprocesses, which are increasingly encountered in data analysis. To address the\nneed for such methodology, we introduce a transport model for samples of\ndistribution-valued stochastic processes that implements an intrinsic approach\nwhereby distributions are represented by optimal transports. Substituting\ntransports for distributions addresses the challenge of centering\ndistribution-valued processes and leads to a useful and interpretable\nrepresentation of each realized process by an overall transport and a\nreal-valued trajectory, utilizing a scalar multiplication operation for\ntransports. This representation facilitates a connection to Gaussian processes\nthat proves useful, especially for the case where the distribution-valued\nprocesses are only observed on a sparse grid of time points. We study the\nconvergence of the key components of the proposed representation to their\npopulation targets and demonstrate the practical utility of the proposed\napproach through simulations and application examples."}, "http://arxiv.org/abs/2310.20182": {"title": "Explicit Form of the Asymptotic Variance Estimator for IPW-type Estimators of Certain Estimands", "link": "http://arxiv.org/abs/2310.20182", "description": "Confidence intervals (CI) for the IPW estimators of the ATT and ATO might not\nalways yield conservative CIs when using the 'robust sandwich variance'\nestimator. In this manuscript, we identify scenarios where this variance\nestimator can be employed to derive conservative CIs. Specifically, for the\nATT, a conservative CI can be derived when there's a homogeneous treatment\neffect or the interaction effect surpasses the effect from the covariates\nalone. For the ATO, conservative CIs can be derived under certain conditions,\nsuch as when there are homogeneous treatment effects, when there exists\nsignificant treatment-confounder interactions, or when there's a large number\nof members in the control groups."}, "http://arxiv.org/abs/2310.20294": {"title": "Robust nonparametric regression based on deep ReLU neural networks", "link": "http://arxiv.org/abs/2310.20294", "description": "In this paper, we consider robust nonparametric regression using deep neural\nnetworks with ReLU activation function. While several existing theoretically\njustified methods are geared towards robustness against identical heavy-tailed\nnoise distributions, the rise of adversarial attacks has emphasized the\nimportance of safeguarding estimation procedures against systematic\ncontamination. We approach this statistical issue by shifting our focus towards\nestimating conditional distributions. To address it robustly, we introduce a\nnovel estimation procedure based on $\\ell$-estimation. Under a mild model\nassumption, we establish general non-asymptotic risk bounds for the resulting\nestimators, showcasing their robustness against contamination, outliers, and\nmodel misspecification. We then delve into the application of our approach\nusing deep ReLU neural networks. When the model is well-specified and the\nregression function belongs to an $\\alpha$-H\\\"older class, employing\n$\\ell$-type estimation on suitable networks enables the resulting estimators to\nachieve the minimax optimal rate of convergence. Additionally, we demonstrate\nthat deep $\\ell$-type estimators can circumvent the curse of dimensionality by\nassuming the regression function closely resembles the composition of several\nH\\\"older functions. To attain this, new deep fully-connected ReLU neural\nnetworks have been designed to approximate this composition class. This\napproximation result can be of independent interest."}, "http://arxiv.org/abs/2310.20376": {"title": "Mixture modeling via vectors of normalized independent finite point processes", "link": "http://arxiv.org/abs/2310.20376", "description": "Statistical modeling in presence of hierarchical data is a crucial task in\nBayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the\nutmost tool to handle data organized in groups through mixture modeling.\nAlthough the HDP is mathematically tractable, its computational cost is\ntypically demanding, and its analytical complexity represents a barrier for\npractitioners. The present paper conceives a mixture model based on a novel\nfamily of Bayesian priors designed for multilevel data and obtained by\nnormalizing a finite point process. A full distribution theory for this new\nfamily and the induced clustering is developed, including tractable expressions\nfor marginal, posterior and predictive distributions. Efficient marginal and\nconditional Gibbs samplers are designed for providing posterior inference. The\nproposed mixture model overcomes the HDP in terms of analytical feasibility,\nclustering discovery, and computational time. The motivating application comes\nfrom the analysis of shot put data, which contains performance measurements of\nathletes across different seasons. In this setting, the proposed model is\nexploited to induce clustering of the observations across seasons and athletes.\nBy linking clusters across seasons, similarities and differences in athlete's\nperformances are identified."}, "http://arxiv.org/abs/2310.20409": {"title": "Detection of nonlinearity, discontinuity and interactions in generalized regression models", "link": "http://arxiv.org/abs/2310.20409", "description": "In generalized regression models the effect of continuous covariates is\ncommonly assumed to be linear. This assumption, however, may be too restrictive\nin applications and may lead to biased effect estimates and decreased\npredictive ability. While a multitude of alternatives for the flexible modeling\nof continuous covariates have been proposed, methods that provide guidance for\nchoosing a suitable functional form are still limited. To address this issue,\nwe propose a detection algorithm that evaluates several approaches for modeling\ncontinuous covariates and guides practitioners to choose the most appropriate\nalternative. The algorithm utilizes a unified framework for tree-structured\nmodeling which makes the results easily interpretable. We assessed the\nperformance of the algorithm by conducting a simulation study. To illustrate\nthe proposed algorithm, we analyzed data of patients suffering from chronic\nkidney disease."}, "http://arxiv.org/abs/2310.20450": {"title": "Safe Testing for Large-Scale Experimentation Platforms", "link": "http://arxiv.org/abs/2310.20450", "description": "In the past two decades, AB testing has proliferated to optimise products in\ndigital domains. Traditional AB tests use fixed-horizon testing, determining\nthe sample size of the experiment and continuing until the experiment has\nconcluded. However, due to the feedback provided by modern data infrastructure,\nexperimenters may take incorrect decisions based on preliminary results of the\ntest. For this reason, anytime-valid inference (AVI) is seeing increased\nadoption as the modern experimenters method for rapid decision making in the\nworld of data streaming.\n\nThis work focuses on Safe Testing, a novel framework for experimentation that\nenables continuous analysis without elevating the risk of incorrect\nconclusions. There exist safe testing equivalents of many common statistical\ntests, including the z-test, the t-test, and the proportion test. We compare\nthe efficacy of safe tests against classical tests and another method for AVI,\nthe mixture sequential probability ratio test (mSPRT). Comparisons are\nconducted first on simulation and then by real-world data from a large\ntechnology company, Vinted, a large European online marketplace for second-hand\nclothing. Our findings indicate that safe tests require fewer samples to detect\nsignificant effects, encouraging its potential for broader adoption."}, "http://arxiv.org/abs/2310.20460": {"title": "Aggregating Dependent Signals with Heavy-Tailed Combination Tests", "link": "http://arxiv.org/abs/2310.20460", "description": "Combining dependent p-values to evaluate the global null hypothesis presents\na longstanding challenge in statistical inference, particularly when\naggregating results from diverse methods to boost signal detection. P-value\ncombination tests using heavy-tailed distribution based transformations, such\nas the Cauchy combination test and the harmonic mean p-value, have recently\ngarnered significant interest for their potential to efficiently handle\narbitrary p-value dependencies. Despite their growing popularity in practical\napplications, there is a gap in comprehensive theoretical and empirical\nevaluations of these methods. This paper conducts an extensive investigation,\nrevealing that, theoretically, while these combination tests are asymptotically\nvalid for pairwise quasi-asymptotically independent test statistics, such as\nbivariate normal variables, they are also asymptotically equivalent to the\nBonferroni test under the same conditions. However, extensive simulations\nunveil their practical utility, especially in scenarios where stringent type-I\nerror control is not necessary and signals are dense. Both the heaviness of the\ndistribution and its support substantially impact the tests' non-asymptotic\nvalidity and power, and we recommend using a truncated Cauchy distribution in\npractice. Moreover, we show that under the violation of quasi-asymptotic\nindependence among test statistics, these tests remain valid and, in fact, can\nbe considerably less conservative than the Bonferroni test. We also present two\ncase studies in genetics and genomics, showcasing the potential of the\ncombination tests to significantly enhance statistical power while effectively\ncontrolling type-I errors."}, "http://arxiv.org/abs/2310.20483": {"title": "Measuring multidimensional heterogeneity in emergent social phenomena", "link": "http://arxiv.org/abs/2310.20483", "description": "Measuring inequalities in a multidimensional framework is a challenging\nproblem which is common to most field of science and engineering. Nevertheless,\ndespite the enormous amount of researches illustrating the fields of\napplication of inequality indices, and of the Gini index in particular, very\nfew consider the case of a multidimensional variable. In this paper, we\nconsider in some details a new inequality index, based on the Fourier\ntransform, that can be fruitfully applied to measure the degree of\ninhomogeneity of multivariate probability distributions. This index exhibits a\nnumber of interesting properties that make it very promising in quantifying the\ndegree of inequality in data sets of complex and multifaceted social phenomena."}, "http://arxiv.org/abs/2310.20537": {"title": "Directed Cyclic Graph for Causal Discovery from Multivariate Functional Data", "link": "http://arxiv.org/abs/2310.20537", "description": "Discovering causal relationship using multivariate functional data has\nreceived a significant amount of attention very recently. In this article, we\nintroduce a functional linear structural equation model for causal structure\nlearning when the underlying graph involving the multivariate functions may\nhave cycles. To enhance interpretability, our model involves a low-dimensional\ncausal embedded space such that all the relevant causal information in the\nmultivariate functional data is preserved in this lower-dimensional subspace.\nWe prove that the proposed model is causally identifiable under standard\nassumptions that are often made in the causal discovery literature. To carry\nout inference of our model, we develop a fully Bayesian framework with suitable\nprior specifications and uncertainty quantification through posterior\nsummaries. We illustrate the superior performance of our method over existing\nmethods in terms of causal graph estimation through extensive simulation\nstudies. We also demonstrate the proposed method using a brain EEG dataset."}, "http://arxiv.org/abs/2310.20697": {"title": "Text-Transport: Toward Learning Causal Effects of Natural Language", "link": "http://arxiv.org/abs/2310.20697", "description": "As language technologies gain prominence in real-world settings, it is\nimportant to understand how changes to language affect reader perceptions. This\ncan be formalized as the causal effect of varying a linguistic attribute (e.g.,\nsentiment) on a reader's response to the text. In this paper, we introduce\nText-Transport, a method for estimation of causal effects from natural language\nunder any text distribution. Current approaches for valid causal effect\nestimation require strong assumptions about the data, meaning the data from\nwhich one can estimate valid causal effects often is not representative of the\nactual target domain of interest. To address this issue, we leverage the notion\nof distribution shift to describe an estimator that transports causal effects\nbetween domains, bypassing the need for strong assumptions in the target\ndomain. We derive statistical guarantees on the uncertainty of this estimator,\nand we report empirical results and analyses that support the validity of\nText-Transport across data settings. Finally, we use Text-Transport to study a\nrealistic setting--hate speech on social media--in which causal effects do\nshift significantly between text domains, demonstrating the necessity of\ntransport when conducting causal inference on natural language."}, "http://arxiv.org/abs/2203.15009": {"title": "DAMNETS: A Deep Autoregressive Model for Generating Markovian Network Time Series", "link": "http://arxiv.org/abs/2203.15009", "description": "Generative models for network time series (also known as dynamic graphs) have\ntremendous potential in fields such as epidemiology, biology and economics,\nwhere complex graph-based dynamics are core objects of study. Designing\nflexible and scalable generative models is a very challenging task due to the\nhigh dimensionality of the data, as well as the need to represent temporal\ndependencies and marginal network structure. Here we introduce DAMNETS, a\nscalable deep generative model for network time series. DAMNETS outperforms\ncompeting methods on all of our measures of sample quality, over both real and\nsynthetic data sets."}, "http://arxiv.org/abs/2212.01792": {"title": "Classification by sparse additive models", "link": "http://arxiv.org/abs/2212.01792", "description": "We consider (nonparametric) sparse additive models (SpAM) for classification.\nThe design of a SpAM classifier is based on minimizing the logistic loss with a\nsparse group Lasso and more general sparse group Slope-type penalties on the\ncoefficients of univariate components' expansions in orthonormal series (e.g.,\nFourier or wavelets). The resulting classifiers are inherently adaptive to the\nunknown sparsity and smoothness. We show that under certain sparse group\nrestricted eigenvalue condition the sparse group Lasso classifier is\nnearly-minimax (up to log-factors) within the entire range of analytic, Sobolev\nand Besov classes while the sparse group Slope classifier achieves the exact\nminimax order (without the extra log-factors) for sparse and moderately dense\nsetups. The performance of the proposed classifier is illustrated on the\nreal-data example."}, "http://arxiv.org/abs/2302.11656": {"title": "Confounder-Dependent Bayesian Mixture Model: Characterizing Heterogeneity of Causal Effects in Air Pollution Epidemiology", "link": "http://arxiv.org/abs/2302.11656", "description": "Several epidemiological studies have provided evidence that long-term\nexposure to fine particulate matter (PM2.5) increases mortality risk.\nFurthermore, some population characteristics (e.g., age, race, and\nsocioeconomic status) might play a crucial role in understanding vulnerability\nto air pollution. To inform policy, it is necessary to identify groups of the\npopulation that are more or less vulnerable to air pollution. In causal\ninference literature, the Group Average Treatment Effect (GATE) is a\ndistinctive facet of the conditional average treatment effect. This widely\nemployed metric serves to characterize the heterogeneity of a treatment effect\nbased on some population characteristics. In this work, we introduce a novel\nConfounder-Dependent Bayesian Mixture Model (CDBMM) to characterize causal\neffect heterogeneity. More specifically, our method leverages the flexibility\nof the dependent Dirichlet process to model the distribution of the potential\noutcomes conditionally to the covariates and the treatment levels, thus\nenabling us to: (i) identify heterogeneous and mutually exclusive population\ngroups defined by similar GATEs in a data-driven way, and (ii) estimate and\ncharacterize the causal effects within each of the identified groups. Through\nsimulations, we demonstrate the effectiveness of our method in uncovering key\ninsights about treatment effects heterogeneity. We apply our method to claims\ndata from Medicare enrollees in Texas. We found six mutually exclusive groups\nwhere the causal effects of PM2.5 on mortality are heterogeneous."}, "http://arxiv.org/abs/2305.03149": {"title": "A Spectral Method for Identifiable Grade of Membership Analysis with Binary Responses", "link": "http://arxiv.org/abs/2305.03149", "description": "Grade of Membership (GoM) models are popular individual-level mixture models\nfor multivariate categorical data. GoM allows each subject to have mixed\nmemberships in multiple extreme latent profiles. Therefore GoM models have a\nricher modeling capacity than latent class models that restrict each subject to\nbelong to a single profile. The flexibility of GoM comes at the cost of more\nchallenging identifiability and estimation problems. In this work, we propose a\nsingular value decomposition (SVD) based spectral approach to GoM analysis with\nmultivariate binary responses. Our approach hinges on the observation that the\nexpectation of the data matrix has a low-rank decomposition under a GoM model.\nFor identifiability, we develop sufficient and almost necessary conditions for\na notion of expectation identifiability. For estimation, we extract only a few\nleading singular vectors of the observed data matrix, and exploit the simplex\ngeometry of these vectors to estimate the mixed membership scores and other\nparameters. We also establish the consistency of our estimator in the\ndouble-asymptotic regime where both the number of subjects and the number of\nitems grow to infinity. Our spectral method has a huge computational advantage\nover Bayesian or likelihood-based methods and is scalable to large-scale and\nhigh-dimensional data. Extensive simulation studies demonstrate the superior\nefficiency and accuracy of our method. We also illustrate our method by\napplying it to a personality test dataset."}, "http://arxiv.org/abs/2305.05276": {"title": "Causal Discovery from Subsampled Time Series with Proxy Variables", "link": "http://arxiv.org/abs/2305.05276", "description": "Inferring causal structures from time series data is the central interest of\nmany scientific inquiries. A major barrier to such inference is the problem of\nsubsampling, i.e., the frequency of measurement is much lower than that of\ncausal influence. To overcome this problem, numerous methods have been\nproposed, yet either was limited to the linear case or failed to achieve\nidentifiability. In this paper, we propose a constraint-based algorithm that\ncan identify the entire causal structure from subsampled time series, without\nany parametric constraint. Our observation is that the challenge of subsampling\narises mainly from hidden variables at the unobserved time steps. Meanwhile,\nevery hidden variable has an observed proxy, which is essentially itself at\nsome observable time in the future, benefiting from the temporal structure.\nBased on these, we can leverage the proxies to remove the bias induced by the\nhidden variables and hence achieve identifiability. Following this intuition,\nwe propose a proxy-based causal discovery algorithm. Our algorithm is\nnonparametric and can achieve full causal identification. Theoretical\nadvantages are reflected in synthetic and real-world experiments."}, "http://arxiv.org/abs/2305.08942": {"title": "Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification", "link": "http://arxiv.org/abs/2305.08942", "description": "Data-driven modeling is useful for reconstructing nonlinear dynamical systems\nwhen the underlying process is unknown or too expensive to compute. Having\nreliable uncertainty assessment of the forecast enables tools to be deployed to\npredict new scenarios unobserved before. In this work, we first extend parallel\npartial Gaussian processes for predicting the vector-valued transition function\nthat links the observations between the current and next time points, and\nquantify the uncertainty of predictions by posterior sampling. Second, we show\nthe equivalence between the dynamic mode decomposition and the maximum\nlikelihood estimator of the linear mapping matrix in the linear state space\nmodel. The connection provides a {probabilistic generative} model of dynamic\nmode decomposition and thus, uncertainty of predictions can be obtained.\nFurthermore, we draw close connections between different data-driven models for\napproximating nonlinear dynamics, through a unified view of generative models.\nWe study two numerical examples, where the inputs of the dynamics are assumed\nto be known in the first example and the inputs are unknown in the second\nexample. The examples indicate that uncertainty of forecast can be properly\nquantified, whereas model or input misspecification can degrade the accuracy of\nuncertainty quantification."}, "http://arxiv.org/abs/2305.16795": {"title": "On Consistent Bayesian Inference from Synthetic Data", "link": "http://arxiv.org/abs/2305.16795", "description": "Generating synthetic data, with or without differential privacy, has\nattracted significant attention as a potential solution to the dilemma between\nmaking data easily available, and the privacy of data subjects. Several works\nhave shown that consistency of downstream analyses from synthetic data,\nincluding accurate uncertainty estimation, requires accounting for the\nsynthetic data generation. There are very few methods of doing so, most of them\nfor frequentist analysis. In this paper, we study how to perform consistent\nBayesian inference from synthetic data. We prove that mixing posterior samples\nobtained separately from multiple large synthetic data sets converges to the\nposterior of the downstream analysis under standard regularity conditions when\nthe analyst's model is compatible with the data provider's model. We also\npresent several examples showing how the theory works in practice, and showing\nhow Bayesian inference can fail when the compatibility assumption is not met,\nor the synthetic data set is not significantly larger than the original."}, "http://arxiv.org/abs/2306.04746": {"title": "Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models", "link": "http://arxiv.org/abs/2306.04746", "description": "In computational social science (CSS), researchers analyze documents to\nexplain social and political phenomena. In most scenarios, CSS researchers\nfirst obtain labels for documents and then explain labels using interpretable\nregression analyses in the second step. One increasingly common way to annotate\ndocuments cheaply at scale is through large language models (LLMs). However,\nlike other scalable ways of producing annotations, such surrogate labels are\noften imperfect and biased. We present a new algorithm for using imperfect\nannotation surrogates for downstream statistical analyses while guaranteeing\nstatistical properties -- like asymptotic unbiasedness and proper uncertainty\nquantification -- which are fundamental to CSS research. We show that direct\nuse of surrogate labels in downstream statistical analyses leads to substantial\nbias and invalid confidence intervals, even with high surrogate accuracy of\n80--90\\%. To address this, we build on debiased machine learning to propose the\ndesign-based supervised learning (DSL) estimator. DSL employs a doubly-robust\nprocedure to combine surrogate labels with a smaller number of high-quality,\ngold-standard labels. Our approach guarantees valid inference for downstream\nstatistical analyses, even when surrogates are arbitrarily biased and without\nrequiring stringent assumptions, by controlling the probability of sampling\ndocuments for gold-standard labeling. Both our theoretical analysis and\nexperimental results show that DSL provides valid statistical inference while\nachieving root mean squared errors comparable to existing alternatives that\nfocus only on prediction without inferential guarantees."}, "http://arxiv.org/abs/2308.13713": {"title": "Causally Sound Priors for Binary Experiments", "link": "http://arxiv.org/abs/2308.13713", "description": "We introduce the BREASE framework for the Bayesian analysis of randomized\ncontrolled trials with a binary treatment and a binary outcome. Approaching the\nproblem from a causal inference perspective, we propose parameterizing the\nlikelihood in terms of the baseline risk, efficacy, and adverse side effects of\nthe treatment, along with a flexible, yet intuitive and tractable jointly\nindependent beta prior distribution on these parameters, which we show to be a\ngeneralization of the Dirichlet prior for the joint distribution of potential\noutcomes. Our approach has a number of desirable characteristics when compared\nto current mainstream alternatives: (i) it naturally induces prior dependence\nbetween expected outcomes in the treatment and control groups; (ii) as the\nbaseline risk, efficacy and risk of adverse side effects are quantities\ncommonly present in the clinicians' vocabulary, the hyperparameters of the\nprior are directly interpretable, thus facilitating the elicitation of prior\nknowledge and sensitivity analysis; and (iii) we provide analytical formulae\nfor the marginal likelihood, Bayes factor, and other posterior quantities, as\nwell as exact posterior sampling via simulation, in cases where traditional\nMCMC fails. Empirical examples demonstrate the utility of our methods for\nestimation, hypothesis testing, and sensitivity analysis of treatment effects."}, "http://arxiv.org/abs/2309.01608": {"title": "Supervised dimensionality reduction for multiple imputation by chained equations", "link": "http://arxiv.org/abs/2309.01608", "description": "Multivariate imputation by chained equations (MICE) is one of the most\npopular approaches to address missing values in a data set. This approach\nrequires specifying a univariate imputation model for every variable under\nimputation. The specification of which predictors should be included in these\nunivariate imputation models can be a daunting task. Principal component\nanalysis (PCA) can simplify this process by replacing all of the potential\nimputation model predictors with a few components summarizing their variance.\nIn this article, we extend the use of PCA with MICE to include a supervised\naspect whereby information from the variables under imputation is incorporated\ninto the principal component estimation. We conducted an extensive simulation\nstudy to assess the statistical properties of MICE with different versions of\nsupervised dimensionality reduction and we compared them with the use of\nclassical unsupervised PCA as a simpler dimensionality reduction technique."}, "http://arxiv.org/abs/2311.00118": {"title": "Extracting the Multiscale Causal Backbone of Brain Dynamics", "link": "http://arxiv.org/abs/2311.00118", "description": "The bulk of the research effort on brain connectivity revolves around\nstatistical associations among brain regions, which do not directly relate to\nthe causal mechanisms governing brain dynamics. Here we propose the multiscale\ncausal backbone (MCB) of brain dynamics shared by a set of individuals across\nmultiple temporal scales, and devise a principled methodology to extract it.\n\nOur approach leverages recent advances in multiscale causal structure\nlearning and optimizes the trade-off between the model fitting and its\ncomplexity. Empirical assessment on synthetic data shows the superiority of our\nmethodology over a baseline based on canonical functional connectivity\nnetworks. When applied to resting-state fMRI data, we find sparse MCBs for both\nthe left and right brain hemispheres. Thanks to its multiscale nature, our\napproach shows that at low-frequency bands, causal dynamics are driven by brain\nregions associated with high-level cognitive functions; at higher frequencies\ninstead, nodes related to sensory processing play a crucial role. Finally, our\nanalysis of individual multiscale causal structures confirms the existence of a\ncausal fingerprint of brain connectivity, thus supporting from a causal\nperspective the existing extensive research in brain connectivity\nfingerprinting."}, "http://arxiv.org/abs/2311.00122": {"title": "Statistical Network Analysis: Past, Present, and Future", "link": "http://arxiv.org/abs/2311.00122", "description": "This article provides a brief overview of statistical network analysis, a\nrapidly evolving field of statistics, which encompasses statistical models,\nalgorithms, and inferential methods for analyzing data in the form of networks.\nParticular emphasis is given to connecting the historical developments in\nnetwork science to today's statistical network analysis, and outlining\nimportant new areas for future research.\n\nThis invited article is intended as a book chapter for the volume \"Frontiers\nof Statistics and Data Science\" edited by Subhashis Ghoshal and Anindya Roy for\nthe International Indian Statistical Association Series on Statistics and Data\nScience, published by Springer. This review article covers the material from\nthe short course titled \"Statistical Network Analysis: Past, Present, and\nFuture\" taught by the author at the Annual Conference of the International\nIndian Statistical Association, June 6-10, 2023, at Golden, Colorado."}, "http://arxiv.org/abs/2311.00210": {"title": "Broken Adaptive Ridge Method for Variable Selection in Generalized Partly Linear Models with Application to the Coronary Artery Disease Data", "link": "http://arxiv.org/abs/2311.00210", "description": "Motivated by the CATHGEN data, we develop a new statistical learning method\nfor simultaneous variable selection and parameter estimation under the context\nof generalized partly linear models for data with high-dimensional covariates.\nThe method is referred to as the broken adaptive ridge (BAR) estimator, which\nis an approximation of the $L_0$-penalized regression by iteratively performing\nreweighted squared $L_2$-penalized regression. The generalized partly linear\nmodel extends the generalized linear model by including a non-parametric\ncomponent to construct a flexible model for modeling various types of covariate\neffects. We employ the Bernstein polynomials as the sieve space to approximate\nthe non-parametric functions so that our method can be implemented easily using\nthe existing R packages. Extensive simulation studies suggest that the proposed\nmethod performs better than other commonly used penalty-based variable\nselection methods. We apply the method to the CATHGEN data with a binary\nresponse from a coronary artery disease study, which motivated our research,\nand obtained new findings in both high-dimensional genetic and low-dimensional\nnon-genetic covariates."}, "http://arxiv.org/abs/2311.00294": {"title": "Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence", "link": "http://arxiv.org/abs/2311.00294", "description": "To address the difficult problem of multi-step ahead prediction of\nnon-parametric autoregressions, we consider a forward bootstrap approach.\nEmploying a local constant estimator, we can analyze a general type of\nnon-parametric time series model, and show that the proposed point predictions\nare consistent with the true optimal predictor. We construct a quantile\nprediction interval that is asymptotically valid. Moreover, using a debiasing\ntechnique, we can asymptotically approximate the distribution of multi-step\nahead non-parametric estimation by bootstrap. As a result, we can build\nbootstrap prediction intervals that are pertinent, i.e., can capture the model\nestimation variability, thus improving upon the standard quantile prediction\nintervals. Simulation studies are given to illustrate the performance of our\npoint predictions and pertinent prediction intervals for finite samples."}, "http://arxiv.org/abs/2311.00528": {"title": "On the Comparative Analysis of Average Treatment Effects Estimation via Data Combination", "link": "http://arxiv.org/abs/2311.00528", "description": "There is growing interest in exploring causal effects in target populations\nby combining multiple datasets. Nevertheless, most approaches are tailored to\nspecific settings and lack comprehensive comparative analyses across different\nsettings. In this article, within the typical scenario of a source dataset and\na target dataset, we establish a unified framework for comparing various\nsettings in causal inference via data combination. We first design six distinct\nsettings, each with different available datasets and identifiability\nassumptions. The six settings cover a wide range of scenarios in the existing\nliterature. We then conduct a comprehensive efficiency comparative analysis\nacross these settings by calculating and comparing the semiparametric\nefficiency bounds for the average treatment effect (ATE) in the target\npopulation. Our findings reveal the key factors contributing to efficiency\ngains or losses across these settings. In addition, we extend our analysis to\nother estimands, including ATE in the source population and the average\ntreatment effect on treated (ATT) in both the source and target populations.\nFurthermore, we empirically validate our findings by constructing locally\nefficient estimators and conducting extensive simulation studies. We\ndemonstrate the proposed approaches using a real application to a MIMIC-III\ndataset as the target population and an eICU dataset as the source population."}, "http://arxiv.org/abs/2311.00541": {"title": "An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek", "link": "http://arxiv.org/abs/2311.00541", "description": "Word meanings change over time, and word senses evolve, emerge or die out in\nthe process. For ancient languages, where the corpora are often small, sparse\nand noisy, modelling such changes accurately proves challenging, and\nquantifying uncertainty in sense-change estimates consequently becomes\nimportant. GASC and DiSC are existing generative models that have been used to\nanalyse sense change for target words from an ancient Greek text corpus, using\nunsupervised learning without the help of any pre-training. These models\nrepresent the senses of a given target word such as \"kosmos\" (meaning\ndecoration, order or world) as distributions over context words, and sense\nprevalence as a distribution over senses. The models are fitted using MCMC\nmethods to measure temporal changes in these representations. In this paper, we\nintroduce EDiSC, an embedded version of DiSC, which combines word embeddings\nwith DiSC to provide superior model performance. We show empirically that EDiSC\noffers improved predictive accuracy, ground-truth recovery and uncertainty\nquantification, as well as better sampling efficiency and scalability\nproperties with MCMC methods. We also discuss the challenges of fitting these\nmodels."}, "http://arxiv.org/abs/2311.00553": {"title": "Polynomial Chaos Surrogate Construction for Random Fields with Parametric Uncertainty", "link": "http://arxiv.org/abs/2311.00553", "description": "Engineering and applied science rely on computational experiments to\nrigorously study physical systems. The mathematical models used to probe these\nsystems are highly complex, and sampling-intensive studies often require\nprohibitively many simulations for acceptable accuracy. Surrogate models\nprovide a means of circumventing the high computational expense of sampling\nsuch complex models. In particular, polynomial chaos expansions (PCEs) have\nbeen successfully used for uncertainty quantification studies of deterministic\nmodels where the dominant source of uncertainty is parametric. We discuss an\nextension to conventional PCE surrogate modeling to enable surrogate\nconstruction for stochastic computational models that have intrinsic noise in\naddition to parametric uncertainty. We develop a PCE surrogate on a joint space\nof intrinsic and parametric uncertainty, enabled by Rosenblatt transformations,\nand then extend the construction to random field data via the Karhunen-Loeve\nexpansion. We then take advantage of closed-form solutions for computing PCE\nSobol indices to perform a global sensitivity analysis of the model which\nquantifies the intrinsic noise contribution to the overall model output\nvariance. Additionally, the resulting joint PCE is generative in the sense that\nit allows generating random realizations at any input parameter setting that\nare statistically approximately equivalent to realizations from the underlying\nstochastic model. The method is demonstrated on a chemical catalysis example\nmodel."}, "http://arxiv.org/abs/2311.00568": {"title": "Scalable kernel balancing weights in a nationwide observational study of hospital profit status and heart attack outcomes", "link": "http://arxiv.org/abs/2311.00568", "description": "Weighting is a general and often-used method for statistical adjustment.\nWeighting has two objectives: first, to balance covariate distributions, and\nsecond, to ensure that the weights have minimal dispersion and thus produce a\nmore stable estimator. A recent, increasingly common approach directly\noptimizes the weights toward these two objectives. However, this approach has\nnot yet been feasible in large-scale datasets when investigators wish to\nflexibly balance general basis functions in an extended feature space. For\nexample, many balancing approaches cannot scale to national-level health\nservices research studies. To address this practical problem, we describe a\nscalable and flexible approach to weighting that integrates a basis expansion\nin a reproducing kernel Hilbert space with state-of-the-art convex optimization\ntechniques. Specifically, we use the rank-restricted Nystr\\\"{o}m method to\nefficiently compute a kernel basis for balancing in {nearly} linear time and\nspace, and then use the specialized first-order alternating direction method of\nmultipliers to rapidly find the optimal weights. In an extensive simulation\nstudy, we provide new insights into the performance of weighting estimators in\nlarge datasets, showing that the proposed approach substantially outperforms\nothers in terms of accuracy and speed. Finally, we use this weighting approach\nto conduct a national study of the relationship between hospital profit status\nand heart attack outcomes in a comprehensive dataset of 1.27 million patients.\nWe find that for-profit hospitals use interventional cardiology to treat heart\nattacks at similar rates as other hospitals, but have higher mortality and\nreadmission rates."}, "http://arxiv.org/abs/2311.00596": {"title": "Evaluating Binary Outcome Classifiers Estimated from Survey Data", "link": "http://arxiv.org/abs/2311.00596", "description": "Surveys are commonly used to facilitate research in epidemiology, health, and\nthe social and behavioral sciences. Often, these surveys are not simple random\nsamples, and respondents are given weights reflecting their probability of\nselection into the survey. It is well known that analysts can use these survey\nweights to produce unbiased estimates of population quantities like totals. In\nthis article, we show that survey weights also can be beneficial for evaluating\nthe quality of predictive models when splitting data into training and test\nsets. In particular, we characterize model assessment statistics, such as\nsensitivity and specificity, as finite population quantities, and compute\nsurvey-weighted estimates of these quantities with sample test data comprising\na random subset of the original data.Using simulations with data from the\nNational Survey on Drug Use and Health and the National Comorbidity Survey, we\nshow that unweighted metrics estimated with sample test data can misrepresent\npopulation performance, but weighted metrics appropriately adjust for the\ncomplex sampling design. We also show that this conclusion holds for models\ntrained using upsampling for mitigating class imbalance. The results suggest\nthat weighted metrics should be used when evaluating performance on sample test\ndata."}, "http://arxiv.org/abs/2206.10866": {"title": "Nearest Neighbor Classification based on Imbalanced Data: A Statistical Approach", "link": "http://arxiv.org/abs/2206.10866", "description": "When the competing classes in a classification problem are not of comparable\nsize, many popular classifiers exhibit a bias towards larger classes, and the\nnearest neighbor classifier is no exception. To take care of this problem, we\ndevelop a statistical method for nearest neighbor classification based on such\nimbalanced data sets. First, we construct a classifier for the binary\nclassification problem and then extend it for classification problems involving\nmore than two classes. Unlike the existing oversampling or undersampling\nmethods, our proposed classifiers do not need to generate any pseudo\nobservations or remove any existing observations, hence the results are exactly\nreproducible. We establish the Bayes risk consistency of these classifiers\nunder appropriate regularity conditions. Their superior performance over the\nexisting methods is amply demonstrated by analyzing several benchmark data\nsets."}, "http://arxiv.org/abs/2209.08892": {"title": "High-dimensional data segmentation in regression settings permitting temporal dependence and non-Gaussianity", "link": "http://arxiv.org/abs/2209.08892", "description": "We propose a data segmentation methodology for the high-dimensional linear\nregression problem where regression parameters are allowed to undergo multiple\nchanges. The proposed methodology, MOSEG, proceeds in two stages: first, the\ndata are scanned for multiple change points using a moving window-based\nprocedure, which is followed by a location refinement stage. MOSEG enjoys\ncomputational efficiency thanks to the adoption of a coarse grid in the first\nstage, and achieves theoretical consistency in estimating both the total number\nand the locations of the change points, under general conditions permitting\nserial dependence and non-Gaussianity. We also propose MOSEG.MS, a multiscale\nextension of MOSEG which, while comparable to MOSEG in terms of computational\ncomplexity, achieves theoretical consistency for a broader parameter space\nwhere large parameter shifts over short intervals and small changes over long\nstretches of stationarity are simultaneously allowed. We demonstrate good\nperformance of the proposed methods in comparative simulation studies and in an\napplication to predicting the equity premium."}, "http://arxiv.org/abs/2210.02341": {"title": "A Distributed Block-Split Gibbs Sampler with Hypergraph Structure for High-Dimensional Inverse Problems", "link": "http://arxiv.org/abs/2210.02341", "description": "Sampling-based algorithms are classical approaches to perform Bayesian\ninference in inverse problems. They provide estimators with the associated\ncredibility intervals to quantify the uncertainty on the estimators. Although\nthese methods hardly scale to high dimensional problems, they have recently\nbeen paired with optimization techniques, such as proximal and splitting\napproaches, to address this issue. Such approaches pave the way to distributed\nsamplers, splitting computations to make inference more scalable and faster. We\nintroduce a distributed Split Gibbs sampler (SGS) to efficiently solve such\nproblems involving distributions with multiple smooth and non-smooth functions\ncomposed with linear operators. The proposed approach leverages a recent\napproximate augmentation technique reminiscent of primal-dual optimization\nmethods. It is further combined with a block-coordinate approach to split the\nprimal and dual variables into blocks, leading to a distributed\nblock-coordinate SGS. The resulting algorithm exploits the hypergraph structure\nof the involved linear operators to efficiently distribute the variables over\nmultiple workers under controlled communication costs. It accommodates several\ndistributed architectures, such as the Single Program Multiple Data and\nclient-server architectures. Experiments on a large image deblurring problem\nshow the performance of the proposed approach to produce high quality estimates\nwith credibility intervals in a small amount of time. Supplementary material to\nreproduce the experiments is available online."}, "http://arxiv.org/abs/2210.14086": {"title": "A Global Wavelet Based Bootstrapped Test of Covariance Stationarity", "link": "http://arxiv.org/abs/2210.14086", "description": "We propose a covariance stationarity test for an otherwise dependent and\npossibly globally non-stationary time series. We work in a generalized version\nof the new setting in Jin, Wang and Wang (2015), who exploit Walsh (1923)\nfunctions in order to compare sub-sample covariances with the full sample\ncounterpart. They impose strict stationarity under the null, only consider\nlinear processes under either hypothesis in order to achieve a parametric\nestimator for an inverted high dimensional asymptotic covariance matrix, and do\nnot consider any other orthonormal basis. Conversely, we work with a general\northonormal basis under mild conditions that include Haar wavelet and Walsh\nfunctions; and we allow for linear or nonlinear processes with possibly non-iid\ninnovations. This is important in macroeconomics and finance where nonlinear\nfeedback and random volatility occur in many settings. We completely sidestep\nasymptotic covariance matrix estimation and inversion by bootstrapping a\nmax-correlation difference statistic, where the maximum is taken over the\ncorrelation lag $h$ and basis generated sub-sample counter $k$ (the number of\nsystematic samples). We achieve a higher feasible rate of increase for the\nmaximum lag and counter $\\mathcal{H}_{T}$ and $\\mathcal{K}_{T}$. Of particular\nnote, our test is capable of detecting breaks in variance, and distant, or very\nmild, deviations from stationarity."}, "http://arxiv.org/abs/2211.03031": {"title": "A framework for leveraging machine learning tools to estimate personalized survival curves", "link": "http://arxiv.org/abs/2211.03031", "description": "The conditional survival function of a time-to-event outcome subject to\ncensoring and truncation is a common target of estimation in survival analysis.\nThis parameter may be of scientific interest and also often appears as a\nnuisance in nonparametric and semiparametric problems. In addition to classical\nparametric and semiparametric methods (e.g., based on the Cox proportional\nhazards model), flexible machine learning approaches have been developed to\nestimate the conditional survival function. However, many of these methods are\neither implicitly or explicitly targeted toward risk stratification rather than\noverall survival function estimation. Others apply only to discrete-time\nsettings or require inverse probability of censoring weights, which can be as\ndifficult to estimate as the outcome survival function itself. Here, we employ\na decomposition of the conditional survival function in terms of observable\nregression models in which censoring and truncation play no role. This allows\napplication of an array of flexible regression and classification methods\nrather than only approaches that explicitly handle the complexities inherent to\nsurvival data. We outline estimation procedures based on this decomposition,\nempirically assess their performance, and demonstrate their use on data from an\nHIV vaccine trial."}, "http://arxiv.org/abs/2301.12389": {"title": "On Learning Necessary and Sufficient Causal Graphs", "link": "http://arxiv.org/abs/2301.12389", "description": "The causal revolution has stimulated interest in understanding complex\nrelationships in various fields. Most of the existing methods aim to discover\ncausal relationships among all variables within a complex large-scale graph.\nHowever, in practice, only a small subset of variables in the graph are\nrelevant to the outcomes of interest. Consequently, causal estimation with the\nfull causal graph -- particularly given limited data -- could lead to numerous\nfalsely discovered, spurious variables that exhibit high correlation with, but\nexert no causal impact on, the target outcome. In this paper, we propose\nlearning a class of necessary and sufficient causal graphs (NSCG) that\nexclusively comprises causally relevant variables for an outcome of interest,\nwhich we term causal features. The key idea is to employ probabilities of\ncausation to systematically evaluate the importance of features in the causal\ngraph, allowing us to identify a subgraph relevant to the outcome of interest.\nTo learn NSCG from data, we develop a necessary and sufficient causal\nstructural learning (NSCSL) algorithm, by establishing theoretical properties\nand relationships between probabilities of causation and natural causal effects\nof features. Across empirical studies of simulated and real data, we\ndemonstrate that NSCSL outperforms existing algorithms and can reveal crucial\nyeast genes for target heritable traits of interest."}, "http://arxiv.org/abs/2303.18211": {"title": "A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models", "link": "http://arxiv.org/abs/2303.18211", "description": "Additive Noise Models (ANMs) are a common model class for causal discovery\nfrom observational data and are often used to generate synthetic data for\ncausal discovery benchmarking. Specifying an ANM requires choosing all\nparameters, including those not fixed by explicit assumptions. Reisach et al.\n(2021) show that sorting variables by increasing variance often yields an\nordering close to a causal order and introduce var-sortability to quantify this\nalignment. Since increasing variances may be unrealistic and are\nscale-dependent, ANM data are often standardized in benchmarks.\n\nWe show that synthetic ANM data are characterized by another pattern that is\nscale-invariant: the explainable fraction of a variable's variance, as captured\nby the coefficient of determination $R^2$, tends to increase along the causal\norder. The result is high $R^2$-sortability, meaning that sorting the variables\nby increasing $R^2$ yields an ordering close to a causal order. We propose an\nefficient baseline algorithm termed $R^2$-SortnRegress that exploits high\n$R^2$-sortability and that can match and exceed the performance of established\ncausal discovery algorithms. We show analytically that sufficiently high edge\nweights lead to a relative decrease of the noise contributions along causal\nchains, resulting in increasingly deterministic relationships and high $R^2$.\nWe characterize $R^2$-sortability for different simulation parameters and find\nhigh values in common settings. Our findings reveal high $R^2$-sortability as\nan assumption about the data generating process relevant to causal discovery\nand implicit in many ANM sampling schemes. It should be made explicit, as its\nprevalence in real-world data is unknown. For causal discovery benchmarking, we\nimplement $R^2$-sortability, the $R^2$-SortnRegress algorithm, and ANM\nsimulation procedures in our library CausalDisco at\nhttps://causaldisco.github.io/CausalDisco/."}, "http://arxiv.org/abs/2304.11491": {"title": "Bayesian Boundary Trend Filtering", "link": "http://arxiv.org/abs/2304.11491", "description": "Estimating boundary curves has many applications such as economics, climate\nscience, and medicine. Bayesian trend filtering has been developed as one of\nlocally adaptive smoothing methods to estimate the non-stationary trend of\ndata. This paper develops a Bayesian trend filtering for estimating boundary\ntrend. To this end, the truncated multivariate normal working likelihood and\nglobal-local shrinkage priors based on scale mixtures of normal distribution\nare introduced. In particular, well-known horseshoe prior for difference leads\nto locally adaptive shrinkage estimation for boundary trend. However, the full\nconditional distributions of the Gibbs sampler involve high-dimensional\ntruncated multivariate normal distribution. To overcome the difficulty of\nsampling, an approximation of truncated multivariate normal distribution is\nemployed. Using the approximation, the proposed models lead to an efficient\nGibbs sampling algorithm via P\\'olya-Gamma data augmentation. The proposed\nmethod is also extended by considering nearly isotonic constraint. The\nperformance of the proposed method is illustrated through some numerical\nexperiments and real data examples."}, "http://arxiv.org/abs/2304.13237": {"title": "An Efficient Doubly-Robust Test for the Kernel Treatment Effect", "link": "http://arxiv.org/abs/2304.13237", "description": "The average treatment effect, which is the difference in expectation of the\ncounterfactuals, is probably the most popular target effect in causal inference\nwith binary treatments. However, treatments may have effects beyond the mean,\nfor instance decreasing or increasing the variance. We propose a new\nkernel-based test for distributional effects of the treatment. It is, to the\nbest of our knowledge, the first kernel-based, doubly-robust test with provably\nvalid type-I error. Furthermore, our proposed algorithm is computationally\nefficient, avoiding the use of permutations."}, "http://arxiv.org/abs/2305.07981": {"title": "Inferring Stochastic Group Interactions within Structured Populations via Coupled Autoregression", "link": "http://arxiv.org/abs/2305.07981", "description": "The internal behaviour of a population is an important feature to take\naccount of when modelling their dynamics. In line with kin selection theory,\nmany social species tend to cluster into distinct groups in order to enhance\ntheir overall population fitness. Temporal interactions between populations are\noften modelled using classical mathematical models, but these sometimes fail to\ndelve deeper into the, often uncertain, relationships within populations. Here,\nwe introduce a stochastic framework that aims to capture the interactions of\nanimal groups and an auxiliary population over time. We demonstrate the model's\ncapabilities, from a Bayesian perspective, through simulation studies and by\nfitting it to predator-prey count time series data. We then derive an\napproximation to the group correlation structure within such a population,\nwhile also taking account of the effect of the auxiliary population. We finally\ndiscuss how this approximation can lead to ecologically realistic\ninterpretations in a predator-prey context. This approximation also serves as\nverification to whether the population in question satisfies our various\nassumptions. Our modelling approach will be useful for empiricists for\nmonitoring groups within a conservation framework and also theoreticians\nwanting to quantify interactions, to study cooperation and other phenomena\nwithin social populations."}, "http://arxiv.org/abs/2306.09520": {"title": "Ensembled Prediction Intervals for Causal Outcomes Under Hidden Confounding", "link": "http://arxiv.org/abs/2306.09520", "description": "Causal inference of exact individual treatment outcomes in the presence of\nhidden confounders is rarely possible. Recent work has extended prediction\nintervals with finite-sample guarantees to partially identifiable causal\noutcomes, by means of a sensitivity model for hidden confounding. In deep\nlearning, predictors can exploit their inductive biases for better\ngeneralization out of sample. We argue that the structure inherent to a deep\nensemble should inform a tighter partial identification of the causal outcomes\nthat they predict. We therefore introduce an approach termed Caus-Modens, for\ncharacterizing causal outcome intervals by modulated ensembles. We present a\nsimple approach to partial identification using existing causal sensitivity\nmodels and show empirically that Caus-Modens gives tighter outcome intervals,\nas measured by the necessary interval size to achieve sufficient coverage. The\nlast of our three diverse benchmarks is a novel usage of GPT-4 for\nobservational experiments with unknown but probeable ground truth."}, "http://arxiv.org/abs/2306.16838": {"title": "Solving Kernel Ridge Regression with Gradient-Based Optimization Methods", "link": "http://arxiv.org/abs/2306.16838", "description": "Kernel ridge regression, KRR, is a generalization of linear ridge regression\nthat is non-linear in the data, but linear in the parameters. Here, we\nintroduce an equivalent formulation of the objective function of KRR, opening\nup both for using penalties other than the ridge penalty and for studying\nkernel ridge regression from the perspective of gradient descent. Using a\ncontinuous-time perspective, we derive a closed-form solution for solving\nkernel regression with gradient descent, something we refer to as kernel\ngradient flow, KGF, and theoretically bound the differences between KRR and\nKGF, where, for the latter, regularization is obtained through early stopping.\nWe also generalize KRR by replacing the ridge penalty with the $\\ell_1$ and\n$\\ell_\\infty$ penalties, respectively, and use the fact that analogous to the\nsimilarities between KGF and KRR, $\\ell_1$ regularization and forward stagewise\nregression (also known as coordinate descent), and $\\ell_\\infty$ regularization\nand sign gradient descent, follow similar solution paths. We can thus alleviate\nthe need for computationally heavy algorithms based on proximal gradient\ndescent. We show theoretically and empirically how the $\\ell_1$ and\n$\\ell_\\infty$ penalties, and the corresponding gradient-based optimization\nalgorithms, produce sparse and robust kernel regression solutions,\nrespectively."}, "http://arxiv.org/abs/2311.00820": {"title": "Bayesian inference for generalized linear models via quasi-posteriors", "link": "http://arxiv.org/abs/2311.00820", "description": "Generalized linear models (GLMs) are routinely used for modeling\nrelationships between a response variable and a set of covariates. The simple\nform of a GLM comes with easy interpretability, but also leads to concerns\nabout model misspecification impacting inferential conclusions. A popular\nsemi-parametric solution adopted in the frequentist literature is\nquasi-likelihood, which improves robustness by only requiring correct\nspecification of the first two moments. We develop a robust approach to\nBayesian inference in GLMs through quasi-posterior distributions. We show that\nquasi-posteriors provide a coherent generalized Bayes inference method, while\nalso approximating so-called coarsened posteriors. In so doing, we obtain new\ninsights into the choice of coarsening parameter. Asymptotically, the\nquasi-posterior converges in total variation to a normal distribution and has\nimportant connections with the loss-likelihood bootstrap posterior. We\ndemonstrate that it is also well-calibrated in terms of frequentist coverage.\nMoreover, the loss-scale parameter has a clear interpretation as a dispersion,\nand this leads to a consolidated method of moments estimator."}, "http://arxiv.org/abs/2311.00878": {"title": "Backward Joint Model for Dynamic Prediction using Multivariate Longitudinal and Competing Risk Data", "link": "http://arxiv.org/abs/2311.00878", "description": "Joint modeling is a useful approach to dynamic prediction of clinical\noutcomes using longitudinally measured predictors. When the outcomes are\ncompeting risk events, fitting the conventional shared random effects joint\nmodel often involves intensive computation, especially when multiple\nlongitudinal biomarkers are be used as predictors, as is often desired in\nprediction problems. Motivated by a longitudinal cohort study of chronic kidney\ndisease, this paper proposes a new joint model for the dynamic prediction of\nend-stage renal disease with the competing risk of death. The model factorizes\nthe likelihood into the distribution of the competing risks data and the\ndistribution of longitudinal data given the competing risks data. The\nestimation with the EM algorithm is efficient, stable and fast, with a\none-dimensional integral in the E-step and convex optimization for most\nparameters in the M-step, regardless of the number of longitudinal predictors.\nThe model also comes with a consistent albeit less efficient estimation method\nthat can be quickly implemented with standard software, ideal for model\nbuilding and diagnotics. This model enables the prediction of future\nlongitudinal data trajectories conditional on being at risk at a future time, a\npractically significant problem that has not been studied in the statistical\nliterature. We study the properties of the proposed method using simulations\nand a real dataset and compare its performance with the shared random effects\njoint model."}, "http://arxiv.org/abs/2311.00885": {"title": "Controlling the number of significant effects in multiple testing", "link": "http://arxiv.org/abs/2311.00885", "description": "In multiple testing several criteria to control for type I errors exist. The\nfalse discovery rate, which evaluates the expected proportion of false\ndiscoveries among the rejected null hypotheses, has become the standard\napproach in this setting. However, false discovery rate control may be too\nconservative when the effects are weak. In this paper we alternatively propose\nto control the number of significant effects, where 'significant' refers to a\npre-specified threshold $\\gamma$. This means that a $(1-\\alpha)$-lower\nconfidence bound $L$ for the number of non-true null hypothesis with p-values\nbelow $\\gamma$ is provided. When one rejects the nulls corresponding to the $L$\nsmallest p-values, the probability that the number of false positives exceeds\nthe number of false negatives among the significant effects is bounded by\n$\\alpha$. Relative merits of the proposed criterion are discussed. Procedures\nto control for the number of significant effects in practice are introduced and\ninvestigated both theoretically and through simulations. Illustrative real data\napplications are given."}, "http://arxiv.org/abs/2311.00923": {"title": "A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations", "link": "http://arxiv.org/abs/2311.00923", "description": "The fusion of causal models with deep learning introducing increasingly\nintricate data sets, such as the causal associations within images or between\ntextual components, has surfaced as a focal research area. Nonetheless, the\nbroadening of original causal concepts and theories to such complex,\nnon-statistical data has been met with serious challenges. In response, our\nstudy proposes redefinitions of causal data into three distinct categories from\nthe standpoint of causal structure and representation: definite data,\nsemi-definite data, and indefinite data. Definite data chiefly pertains to\nstatistical data used in conventional causal scenarios, while semi-definite\ndata refers to a spectrum of data formats germane to deep learning, including\ntime-series, images, text, and others. Indefinite data is an emergent research\nsphere inferred from the progression of data forms by us. To comprehensively\npresent these three data paradigms, we elaborate on their formal definitions,\ndifferences manifested in datasets, resolution pathways, and development of\nresearch. We summarize key tasks and achievements pertaining to definite and\nsemi-definite data from myriad research undertakings, present a roadmap for\nindefinite data, beginning with its current research conundrums. Lastly, we\nclassify and scrutinize the key datasets presently utilized within these three\nparadigms."}, "http://arxiv.org/abs/2311.00927": {"title": "Scalable Counterfactual Distribution Estimation in Multivariate Causal Models", "link": "http://arxiv.org/abs/2311.00927", "description": "We consider the problem of estimating the counterfactual joint distribution\nof multiple quantities of interests (e.g., outcomes) in a multivariate causal\nmodel extended from the classical difference-in-difference design. Existing\nmethods for this task either ignore the correlation structures among dimensions\nof the multivariate outcome by considering univariate causal models on each\ndimension separately and hence produce incorrect counterfactual distributions,\nor poorly scale even for moderate-size datasets when directly dealing with such\nmultivariate causal model. We propose a method that alleviates both issues\nsimultaneously by leveraging a robust latent one-dimensional subspace of the\noriginal high-dimension space and exploiting the efficient estimation from the\nunivariate causal model on such space. Since the construction of the\none-dimensional subspace uses information from all the dimensions, our method\ncan capture the correlation structures and produce good estimates of the\ncounterfactual distribution. We demonstrate the advantages of our approach over\nexisting methods on both synthetic and real-world data."}, "http://arxiv.org/abs/2311.01021": {"title": "ABC-based Forecasting in State Space Models", "link": "http://arxiv.org/abs/2311.01021", "description": "Approximate Bayesian Computation (ABC) has gained popularity as a method for\nconducting inference and forecasting in complex models, most notably those\nwhich are intractable in some sense. In this paper we use ABC to produce\nprobabilistic forecasts in state space models (SSMs). Whilst ABC-based\nforecasting in correctly-specified SSMs has been studied, the misspecified case\nhas not been investigated, and it is that case which we emphasize. We invoke\nrecent principles of 'focused' Bayesian prediction, whereby Bayesian updates\nare driven by a scoring rule that rewards predictive accuracy; the aim being to\nproduce predictives that perform well in that rule, despite misspecification.\nTwo methods are investigated for producing the focused predictions. In a\nsimulation setting, 'coherent' predictions are in evidence for both methods:\nthe predictive constructed via the use of a particular scoring rule predicts\nbest according to that rule. Importantly, both focused methods typically\nproduce more accurate forecasts than an exact, but misspecified, predictive. An\nempirical application to a truly intractable SSM completes the paper."}, "http://arxiv.org/abs/2311.01147": {"title": "Variational Inference for Sparse Poisson Regression", "link": "http://arxiv.org/abs/2311.01147", "description": "We have utilized the non-conjugate VB method for the problem of the sparse\nPoisson regression model. To provide an approximated conjugacy in the model,\nthe likelihood is approximated by a quadratic function, which provides the\nconjugacy of the approximation component with the Gaussian prior to the\nregression coefficient. Three sparsity-enforcing priors are used for this\nproblem. The proposed models are compared with each other and two frequentist\nsparse Poisson methods (LASSO and SCAD) to evaluate the prediction performance,\nas well as, the sparsing performance of the proposed methods. Throughout a\nsimulated data example, the accuracy of the VB methods is computed compared to\nthe corresponding benchmark MCMC methods. It can be observed that the proposed\nVB methods have provided a good approximation to the posterior distribution of\nthe parameters, while the VB methods are much faster than the MCMC ones. Using\nseveral benchmark count response data sets, the prediction performance of the\nproposed methods is evaluated in real-world applications."}, "http://arxiv.org/abs/2311.01287": {"title": "Semiparametric Latent ANOVA Model for Event-Related Potentials", "link": "http://arxiv.org/abs/2311.01287", "description": "Event-related potentials (ERPs) extracted from electroencephalography (EEG)\ndata in response to stimuli are widely used in psychological and neuroscience\nexperiments. A major goal is to link ERP characteristic components to\nsubject-level covariates. Existing methods typically follow two-step\napproaches, first identifying ERP components using peak detection methods and\nthen relating them to the covariates. This approach, however, can lead to loss\nof efficiency due to inaccurate estimates in the initial step, especially\nconsidering the low signal-to-noise ratio of EEG data. To address this\nchallenge, we propose a semiparametric latent ANOVA model (SLAM) that unifies\ninference on ERP components and their association to covariates. SLAM models\nERP waveforms via a structured Gaussian process prior that encodes ERP latency\nin its derivative and links the subject-level latencies to covariates using a\nlatent ANOVA. This unified Bayesian framework provides estimation at both\npopulation- and subject- levels, improving the efficiency of the inference by\nleveraging information across subjects. We automate posterior inference and\nhyperparameter tuning using a Monte Carlo expectation-maximization algorithm.\nWe demonstrate the advantages of SLAM over competing methods via simulations.\nOur method allows us to examine how factors or covariates affect the magnitude\nand/or latency of ERP components, which in turn reflect cognitive,\npsychological or neural processes. We exemplify this via an application to data\nfrom an ERP experiment on speech recognition, where we assess the effect of age\non two components of interest. Our results verify the scientific findings that\nolder people take a longer reaction time to respond to external stimuli because\nof the delay in perception and brain processes."}, "http://arxiv.org/abs/2311.01297": {"title": "Bias correction in multiple-systems estimation", "link": "http://arxiv.org/abs/2311.01297", "description": "If part of a population is hidden but two or more sources are available that\neach cover parts of this population, dual- or multiple-system(s) estimation can\nbe applied to estimate this population. For this it is common to use the\nlog-linear model, estimated with maximum likelihood. These maximum likelihood\nestimates are based on a non-linear model and therefore suffer from\nfinite-sample bias, which can be substantial in case of small samples or a\nsmall population size. This problem was recognised by Chapman, who derived an\nestimator with good small sample properties in case of two available sources.\nHowever, he did not derive an estimator for more than two sources. We propose\nan estimator that is an extension of Chapman's estimator to three or more\nsources and compare this estimator with other bias-reduced estimators in a\nsimulation study. The proposed estimator performs well, and much better than\nthe other estimators. A real data example on homelessness in the Netherlands\nshows that our proposed model can make a substantial difference."}, "http://arxiv.org/abs/2311.01301": {"title": "TRIALSCOPE A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models", "link": "http://arxiv.org/abs/2311.01301", "description": "The rapid digitization of real-world data offers an unprecedented opportunity\nfor optimizing healthcare delivery and accelerating biomedical discovery. In\npractice, however, such data is most abundantly available in unstructured\nforms, such as clinical notes in electronic medical records (EMRs), and it is\ngenerally plagued by confounders. In this paper, we present TRIALSCOPE, a\nunifying framework for distilling real-world evidence from population-level\nobservational data. TRIALSCOPE leverages biomedical language models to\nstructure clinical text at scale, employs advanced probabilistic modeling for\ndenoising and imputation, and incorporates state-of-the-art causal inference\ntechniques to combat common confounders. Using clinical trial specification as\ngeneric representation, TRIALSCOPE provides a turn-key solution to generate and\nreason with clinical hypotheses using observational data. In extensive\nexperiments and analyses on a large-scale real-world dataset with over one\nmillion cancer patients from a large US healthcare network, we show that\nTRIALSCOPE can produce high-quality structuring of real-world data and\ngenerates comparable results to marquee cancer trials. In addition to\nfacilitating in-silicon clinical trial design and optimization, TRIALSCOPE may\nbe used to empower synthetic controls, pragmatic trials, post-market\nsurveillance, as well as support fine-grained patient-like-me reasoning in\nprecision diagnosis and treatment."}, "http://arxiv.org/abs/2311.01303": {"title": "Local differential privacy in survival analysis using private failure indicators", "link": "http://arxiv.org/abs/2311.01303", "description": "We consider the estimation of the cumulative hazard function, and\nequivalently the distribution function, with censored data under a setup that\npreserves the privacy of the survival database. This is done through a\n$\\alpha$-locally differentially private mechanism for the failure indicators\nand by proposing a non-parametric kernel estimator for the cumulative hazard\nfunction that remains consistent under the privatization. Under mild\nconditions, we also prove lowers bounds for the minimax rates of convergence\nand show that estimator is minimax optimal under a well-chosen bandwidth."}, "http://arxiv.org/abs/2311.01341": {"title": "Composite Dyadic Models for Spatio-Temporal Data", "link": "http://arxiv.org/abs/2311.01341", "description": "Mechanistic statistical models are commonly used to study the flow of\nbiological processes. For example, in landscape genetics, the aim is to infer\nmechanisms that govern gene flow in populations. Existing statistical\napproaches in landscape genetics do not account for temporal dependence in the\ndata and may be computationally prohibitive. We infer mechanisms with a\nBayesian hierarchical dyadic model that scales well with large data sets and\nthat accounts for spatial and temporal dependence. We construct a\nfully-connected network comprising spatio-temporal data for the dyadic model\nand use normalized composite likelihoods to account for the dependence\nstructure in space and time. Our motivation for developing a dyadic model was\nto account for physical mechanisms commonly found in physical-statistical\nmodels. However, a numerical solver is not required in our approach because we\nmodel first-order changes directly. We apply our methods to ancient human DNA\ndata to infer the mechanisms that affected human movement in Bronze Age Europe."}, "http://arxiv.org/abs/2311.01412": {"title": "Castor: Causal Temporal Regime Structure Learning", "link": "http://arxiv.org/abs/2311.01412", "description": "The task of uncovering causal relationships among multivariate time series\ndata stands as an essential and challenging objective that cuts across a broad\narray of disciplines ranging from climate science to healthcare. Such data\nentails linear or non-linear relationships, and usually follow multiple a\npriori unknown regimes. Existing causal discovery methods can infer summary\ncausal graphs from heterogeneous data with known regimes, but they fall short\nin comprehensively learning both regimes and the corresponding causal graph. In\nthis paper, we introduce CASTOR, a novel framework designed to learn causal\nrelationships in heterogeneous time series data composed of various regimes,\neach governed by a distinct causal graph. Through the maximization of a score\nfunction via the EM algorithm, CASTOR infers the number of regimes and learns\nlinear or non-linear causal relationships in each regime. We demonstrate the\nrobust convergence properties of CASTOR, specifically highlighting its\nproficiency in accurately identifying unique regimes. Empirical evidence,\ngarnered from exhaustive synthetic experiments and two real-world benchmarks,\nconfirm CASTOR's superior performance in causal discovery compared to baseline\nmethods. By learning a full temporal causal graph for each regime, CASTOR\nestablishes itself as a distinctly interpretable method for causal discovery in\nheterogeneous time series."}, "http://arxiv.org/abs/2311.01453": {"title": "PPI++: Efficient Prediction-Powered Inference", "link": "http://arxiv.org/abs/2311.01453", "description": "We present PPI++: a computationally lightweight methodology for estimation\nand inference based on a small labeled dataset and a typically much larger\ndataset of machine-learning predictions. The methods automatically adapt to the\nquality of available predictions, yielding easy-to-compute confidence sets --\nfor parameters of any dimensionality -- that always improve on classical\nintervals using only the labeled data. PPI++ builds on prediction-powered\ninference (PPI), which targets the same problem setting, improving its\ncomputational and statistical efficiency. Real and synthetic experiments\ndemonstrate the benefits of the proposed adaptations."}, "http://arxiv.org/abs/2008.00707": {"title": "Heterogeneous Treatment and Spillover Effects under Clustered Network Interference", "link": "http://arxiv.org/abs/2008.00707", "description": "The bulk of causal inference studies rule out the presence of interference\nbetween units. However, in many real-world scenarios, units are interconnected\nby social, physical, or virtual ties, and the effect of the treatment can spill\nfrom one unit to other connected individuals in the network. In this paper, we\ndevelop a machine learning method that uses tree-based algorithms and a\nHorvitz-Thompson estimator to assess the heterogeneity of treatment and\nspillover effects with respect to individual, neighborhood, and network\ncharacteristics in the context of clustered networks and neighborhood\ninterference within clusters. The proposed Network Causal Tree (NCT) algorithm\nhas several advantages. First, it allows the investigation of the treatment\neffect heterogeneity, avoiding potential bias due to the presence of\ninterference. Second, understanding the heterogeneity of both treatment and\nspillover effects can guide policy-makers in scaling up interventions,\ndesigning targeting strategies, and increasing cost-effectiveness. We\ninvestigate the performance of our NCT method using a Monte Carlo simulation\nstudy, and we illustrate its application to assess the heterogeneous effects of\ninformation sessions on the uptake of a new weather insurance policy in rural\nChina."}, "http://arxiv.org/abs/2107.01773": {"title": "Extending Latent Basis Growth Model to Explore Joint Development in the Framework of Individual Measurement Occasions", "link": "http://arxiv.org/abs/2107.01773", "description": "Longitudinal processes often pose nonlinear change patterns. Latent basis\ngrowth models (LBGMs) provide a versatile solution without requiring specific\nfunctional forms. Building on the LBGM specification for unequally-spaced waves\nand individual occasions proposed by Liu and Perera (2023), we extend LBGMs to\nmultivariate longitudinal outcomes. This provides a unified approach to\nnonlinear, interconnected trajectories. Simulation studies demonstrate that the\nproposed model can provide unbiased and accurate estimates with target coverage\nprobabilities for the parameters of interest. Real-world analyses of reading\nand mathematics scores demonstrates its effectiveness in analyzing joint\ndevelopmental processes that vary in temporal patterns. Computational code is\nincluded."}, "http://arxiv.org/abs/2112.03152": {"title": "Bounding Wasserstein distance with couplings", "link": "http://arxiv.org/abs/2112.03152", "description": "Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates\nof intractable posterior expectations as the number of iterations tends to\ninfinity. However, in large data applications, MCMC can be computationally\nexpensive per iteration. This has catalyzed interest in approximating MCMC in a\nmanner that improves computational speed per iteration but does not produce\nasymptotically consistent estimates. In this article, we propose estimators\nbased on couplings of Markov chains to assess the quality of such\nasymptotically biased sampling methods. The estimators give empirical upper\nbounds of the Wasserstein distance between the limiting distribution of the\nasymptotically biased sampling method and the original target distribution of\ninterest. We establish theoretical guarantees for our upper bounds and show\nthat our estimators can remain effective in high dimensions. We apply our\nquality measures to stochastic gradient MCMC, variational Bayes, and Laplace\napproximations for tall data and to approximate MCMC for Bayesian logistic\nregression in 4500 dimensions and Bayesian linear regression in 50000\ndimensions."}, "http://arxiv.org/abs/2204.02954": {"title": "Strongly convergent homogeneous approximations to inhomogeneous Markov jump processes and applications", "link": "http://arxiv.org/abs/2204.02954", "description": "The study of time-inhomogeneous Markov jump processes is a traditional topic\nwithin probability theory that has recently attracted substantial attention in\nvarious applications. However, their flexibility also incurs a substantial\nmathematical burden which is usually circumvented by using well-known generic\ndistributional approximations or simulations. This article provides a novel\napproximation method that tailors the dynamics of a time-homogeneous Markov\njump process to meet those of its time-inhomogeneous counterpart on an\nincreasingly fine Poisson grid. Strong convergence of the processes in terms of\nthe Skorokhod $J_1$ metric is established, and convergence rates are provided.\nUnder traditional regularity assumptions, distributional convergence is\nestablished for unconditional proxies, to the same limit. Special attention is\ndevoted to the case where the target process has one absorbing state and the\nremaining ones transient, for which the absorption times also converge. Some\napplications are outlined, such as univariate hazard-rate density estimation,\nruin probabilities, and multivariate phase-type density evaluation."}, "http://arxiv.org/abs/2301.07210": {"title": "Causal Falsification of Digital Twins", "link": "http://arxiv.org/abs/2301.07210", "description": "Digital twins are virtual systems designed to predict how a real-world\nprocess will evolve in response to interventions. This modelling paradigm holds\nsubstantial promise in many applications, but rigorous procedures for assessing\ntheir accuracy are essential for safety-critical settings. We consider how to\nassess the accuracy of a digital twin using real-world data. We formulate this\nas causal inference problem, which leads to a precise definition of what it\nmeans for a twin to be \"correct\" appropriate for many applications.\nUnfortunately, fundamental results from causal inference mean observational\ndata cannot be used to certify that a twin is correct in this sense unless\npotentially tenuous assumptions are made, such as that the data are\nunconfounded. To avoid these assumptions, we propose instead to find situations\nin which the twin is not correct, and present a general-purpose statistical\nprocedure for doing so. Our approach yields reliable and actionable information\nabout the twin under only the assumption of an i.i.d. dataset of observational\ntrajectories, and remains sound even if the data are confounded. We apply our\nmethodology to a large-scale, real-world case study involving sepsis modelling\nwithin the Pulse Physiology Engine, which we assess using the MIMIC-III dataset\nof ICU patients."}, "http://arxiv.org/abs/2301.11472": {"title": "Fast Bayesian Inference for Spatial Mean-Parameterized Conway--Maxwell--Poisson Models", "link": "http://arxiv.org/abs/2301.11472", "description": "Count data with complex features arise in many disciplines, including\necology, agriculture, criminology, medicine, and public health. Zero inflation,\nspatial dependence, and non-equidispersion are common features in count data.\nThere are two classes of models that allow for these features -- the\nmode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the\ngeneralized Poisson model. However both require the use of either constraints\non the parameter space or a parameterization that leads to challenges in\ninterpretability. We propose a spatial mean-parameterized COMP model that\nretains the flexibility of these models while resolving the above issues. We\nuse a Bayesian spatial filtering approach in order to efficiently handle\nhigh-dimensional spatial data and we use reversible-jump MCMC to automatically\nchoose the basis vectors for spatial filtering. The COMP distribution poses two\nadditional computational challenges -- an intractable normalizing function in\nthe likelihood and no closed-form expression for the mean. We propose a fast\ncomputational approach that addresses these challenges by, respectively,\nintroducing an efficient auxiliary variable algorithm and pre-computing key\napproximations for fast likelihood evaluation. We illustrate the application of\nour methodology to simulated and real datasets, including Texas HPV-cancer data\nand US vaccine refusal data."}, "http://arxiv.org/abs/2305.08529": {"title": "Kernel-based Joint Independence Tests for Multivariate Stationary and Non-stationary Time Series", "link": "http://arxiv.org/abs/2305.08529", "description": "Multivariate time series data that capture the temporal evolution of\ninterconnected systems are ubiquitous in diverse areas. Understanding the\ncomplex relationships and potential dependencies among co-observed variables is\ncrucial for the accurate statistical modelling and analysis of such systems.\nHere, we introduce kernel-based statistical tests of joint independence in\nmultivariate time series by extending the $d$-variable Hilbert-Schmidt\nindependence criterion (dHSIC) to encompass both stationary and non-stationary\nprocesses, thus allowing broader real-world applications. By leveraging\nresampling techniques tailored for both single- and multiple-realisation time\nseries, we show how the method robustly uncovers significant higher-order\ndependencies in synthetic examples, including frequency mixing data and logic\ngates, as well as real-world climate, neuroscience, and socioeconomic data. Our\nmethod adds to the mathematical toolbox for the analysis of multivariate time\nseries and can aid in uncovering high-order interactions in data."}, "http://arxiv.org/abs/2306.07769": {"title": "Amortized Simulation-Based Frequentist Inference for Tractable and Intractable Likelihoods", "link": "http://arxiv.org/abs/2306.07769", "description": "High-fidelity simulators that connect theoretical models with observations\nare indispensable tools in many sciences. When coupled with machine learning, a\nsimulator makes it possible to infer the parameters of a theoretical model\ndirectly from real and simulated observations without explicit use of the\nlikelihood function. This is of particular interest when the latter is\nintractable. In this work, we introduce a simple extension of the recently\nproposed likelihood-free frequentist inference (LF2I) approach that has some\ncomputational advantages. Like LF2I, this extension yields provably valid\nconfidence sets in parameter inference problems in which a high-fidelity\nsimulator is available. The utility of our algorithm is illustrated by applying\nit to three pedagogically interesting examples: the first is from cosmology,\nthe second from high-energy physics and astronomy, both with tractable\nlikelihoods, while the third, with an intractable likelihood, is from\nepidemiology."}, "http://arxiv.org/abs/2307.05732": {"title": "Semiparametric Shape-restricted Estimators for Nonparametric Regression", "link": "http://arxiv.org/abs/2307.05732", "description": "Estimating the conditional mean function that relates predictive covariates\nto a response variable of interest is a fundamental task in economics and\nstatistics. In this manuscript, we propose some general nonparametric\nregression approaches that are widely applicable based on a simple yet\nsignificant decomposition of nonparametric functions into a semiparametric\nmodel with shape-restricted components. For instance, we observe that every\nLipschitz function can be expressed as a sum of a monotone function and a\nlinear function. We implement well-established shape-restricted estimation\nprocedures, such as isotonic regression, to handle the ``nonparametric\"\ncomponents of the true regression function and combine them with a simple\nsample-splitting procedure to estimate the parametric components. The resulting\nestimators inherit several favorable properties from the shape-restricted\nregression estimators. Notably, it is practically tuning parameter free,\nconverges at the minimax optimal rate, and exhibits an adaptive rate when the\ntrue regression function is ``simple\". We also confirm these theoretical\nproperties and compare the practice performance with existing methods via a\nseries of numerical studies."}, "http://arxiv.org/abs/2311.01470": {"title": "Preliminary Estimators of Population Mean using Ranked Set Sampling in the Presence of Measurement Error and Non-Response Error", "link": "http://arxiv.org/abs/2311.01470", "description": "In order to estimate the population mean in the presence of both non-response\nand measurement errors that are uncorrelated, the paper presents some novel\nestimators employing ranked set sampling by utilizing auxiliary information.Up\nto the first order of approximation, the equations for the bias and mean\nsquared error of the suggested estimators are produced, and it is found that\nthe proposed estimators outperform the other existing estimators analysed in\nthis study. Investigations using simulation studies and numerical examples show\nhow well the suggested estimators perform in the presence of measurement and\nnon-response errors. The relative efficiency of the suggested estimators\ncompared to the existing estimators has been expressed as a percentage, and the\nimpact of measurement errors has been expressed as a percentage computation of\nmeasurement errors."}, "http://arxiv.org/abs/2311.01484": {"title": "Comparison of methods for analyzing environmental mixtures effects on survival outcomes and application to a population-based cohort study", "link": "http://arxiv.org/abs/2311.01484", "description": "The estimation of the effect of environmental exposures and overall mixtures\non a survival time outcome is common in environmental epidemiological studies.\nWhile advanced statistical methods are increasingly being used for mixture\nanalyses, their applicability and performance for survival outcomes has yet to\nbe explored. We identified readily available methods for analyzing an\nenvironmental mixture's effect on a survival outcome and assessed their\nperformance via simulations replicating various real-life scenarios. Using\nprespecified criteria, we selected Bayesian Additive Regression Trees (BART),\nCox Elastic Net, Cox Proportional Hazards (PH) with and without penalized\nsplines, Gaussian Process Regression (GPR) and Multivariate Adaptive Regression\nSplines (MARS) to compare the bias and efficiency produced when estimating\nindividual exposure, overall mixture, and interaction effects on a survival\noutcome. We illustrate the selected methods in a real-world data application.\nWe estimated the effects of arsenic, cadmium, molybdenum, selenium, tungsten,\nand zinc on incidence of cardiovascular disease in American Indians using data\nfrom the Strong Heart Study (SHS). In the simulation study, there was a\nconsistent bias-variance trade off. The more flexible models (BART, GPR and\nMARS) were found to be most advantageous in the presence of nonproportional\nhazards, where the Cox models often did not capture the true effects due to\ntheir higher bias and lower variance. In the SHS, estimates of the effect of\nselenium and the overall mixture indicated negative effects, but the magnitudes\nof the estimated effects varied across methods. In practice, we recommend\nevaluating if findings are consistent across methods."}, "http://arxiv.org/abs/2311.01485": {"title": "Subgroup identification using individual participant data from multiple trials on low back pain", "link": "http://arxiv.org/abs/2311.01485", "description": "Model-based recursive partitioning (MOB) and its extension, metaMOB, are\npotent tools for identifying subgroups with differential treatment effects. In\nthe metaMOB approach random effects are used to model heterogeneity of the\ntreatment effects when pooling data from various trials. In situations where\ninterventions offer only small overall benefits and require extensive, costly\ntrials with a large participant enrollment, leveraging individual-participant\ndata (IPD) from multiple trials can help identify individuals who are most\nlikely to benefit from the intervention. We explore the application of MOB and\nmetaMOB in the context of non specific low back pain treatment, using\nsynthesized data based on a subset of the individual participant data\nmeta-analysis by Patel et al. Our study underscores the need to explore\nheterogeneity in intercepts and treatment effects to identify subgroups with\ndifferential treatment effects in IPD meta-analyses."}, "http://arxiv.org/abs/2311.01538": {"title": "A reluctant additive model framework for interpretable nonlinear individualized treatment rules", "link": "http://arxiv.org/abs/2311.01538", "description": "Individualized treatment rules (ITRs) for treatment recommendation is an\nimportant topic for precision medicine as not all beneficial treatments work\nwell for all individuals. Interpretability is a desirable property of ITRs, as\nit helps practitioners make sense of treatment decisions, yet there is a need\nfor ITRs to be flexible to effectively model complex biomedical data for\ntreatment decision making. Many ITR approaches either focus on linear ITRs,\nwhich may perform poorly when true optimal ITRs are nonlinear, or black-box\nnonlinear ITRs, which may be hard to interpret and can be overly complex. This\ndilemma indicates a tension between interpretability and accuracy of treatment\ndecisions. Here we propose an additive model-based nonlinear ITR learning\nmethod that balances interpretability and flexibility of the ITR. Our approach\naims to strike this balance by allowing both linear and nonlinear terms of the\ncovariates in the final ITR. Our approach is parsimonious in that the nonlinear\nterm is included in the final ITR only when it substantially improves the ITR\nperformance. To prevent overfitting, we combine cross-fitting and a specialized\ninformation criterion for model selection. Through extensive simulations, we\nshow that our methods are data-adaptive to the degree of nonlinearity and can\nfavorably balance ITR interpretability and flexibility. We further demonstrate\nthe robust performance of our methods with an application to a cancer drug\nsensitive study."}, "http://arxiv.org/abs/2311.01596": {"title": "Local Bayesian Dirichlet mixing of imperfect models", "link": "http://arxiv.org/abs/2311.01596", "description": "To improve the predictability of complex computational models in the\nexperimentally-unknown domains, we propose a Bayesian statistical machine\nlearning framework utilizing the Dirichlet distribution that combines results\nof several imperfect models. This framework can be viewed as an extension of\nBayesian stacking. To illustrate the method, we study the ability of Bayesian\nmodel averaging and mixing techniques to mine nuclear masses. We show that the\nglobal and local mixtures of models reach excellent performance on both\nprediction accuracy and uncertainty quantification and are preferable to\nclassical Bayesian model averaging. Additionally, our statistical analysis\nindicates that improving model predictions through mixing rather than mixing of\ncorrected models leads to more robust extrapolations."}, "http://arxiv.org/abs/2311.01625": {"title": "Topological inference on brain networks across subtypes of post-stroke aphasia", "link": "http://arxiv.org/abs/2311.01625", "description": "Persistent homology (PH) characterizes the shape of brain networks through\nthe persistence features. Group comparison of persistence features from brain\nnetworks can be challenging as they are inherently heterogeneous. A recent\nscale-space representation of persistence diagram (PD) through heat diffusion\nreparameterizes using the finite number of Fourier coefficients with respect to\nthe Laplace-Beltrami (LB) eigenfunction expansion of the domain, which provides\na powerful vectorized algebraic representation for group comparisons of PDs. In\nthis study, we advance a transposition-based permutation test for comparing\nmultiple groups of PDs through the heat-diffusion estimates of the PDs. We\nevaluate the empirical performance of the spectral transposition test in\ncapturing within- and between-group similarity and dissimilarity with respect\nto statistical variation of topological noise and hole location. We also\nillustrate how the method extends naturally into a clustering scheme by\nsubtyping individuals with post-stroke aphasia through the PDs of their\nresting-state functional brain networks."}, "http://arxiv.org/abs/2311.01638": {"title": "Inference on summaries of a model-agnostic longitudinal variable importance trajectory", "link": "http://arxiv.org/abs/2311.01638", "description": "In prediction settings where data are collected over time, it is often of\ninterest to understand both the importance of variables for predicting the\nresponse at each time point and the importance summarized over the time series.\nBuilding on recent advances in estimation and inference for variable importance\nmeasures, we define summaries of variable importance trajectories. These\nmeasures can be estimated and the same approaches for inference can be applied\nregardless of the choice of the algorithm(s) used to estimate the prediction\nfunction. We propose a nonparametric efficient estimation and inference\nprocedure as well as a null hypothesis testing procedure that are valid even\nwhen complex machine learning tools are used for prediction. Through\nsimulations, we demonstrate that our proposed procedures have good operating\ncharacteristics, and we illustrate their use by investigating the longitudinal\nimportance of risk factors for suicide attempt."}, "http://arxiv.org/abs/2311.01681": {"title": "The R", "link": "http://arxiv.org/abs/2311.01681", "description": "We propose a prognostic stratum matching framework that addresses the\ndeficiencies of Randomized trial data subgroup analysis and transforms\nObservAtional Data to be used as if they were randomized, thus paving the road\nfor precision medicine. Our approach counters the effects of unobserved\nconfounding in observational data by correcting the estimated probabilities of\nthe outcome under a treatment through a novel two-step process. These\nprobabilities are then used to train Optimal Policy Trees (OPTs), which are\ndecision trees that optimally assign treatments to subgroups of patients based\non their characteristics. This facilitates the creation of clinically intuitive\ntreatment recommendations. We applied our framework to observational data of\npatients with gastrointestinal stromal tumors (GIST) and validated the OPTs in\nan external cohort using the sensitivity and specificity metrics. We show that\nthese recommendations outperformed those of experts in GIST. We further applied\nthe same framework to randomized clinical trial (RCT) data of patients with\nextremity sarcomas. Remarkably, despite the initial trial results suggesting\nthat all patients should receive treatment, our framework, after addressing\nimbalances in patient distribution due to the trial's small sample size,\nidentified through the OPTs a subset of patients with unique characteristics\nwho may not require treatment. Again, we successfully validated our\nrecommendations in an external cohort."}, "http://arxiv.org/abs/2311.01709": {"title": "Causal inference with Machine Learning-Based Covariate Representation", "link": "http://arxiv.org/abs/2311.01709", "description": "Utilizing covariate information has been a powerful approach to improve the\nefficiency and accuracy for causal inference, which support massive amount of\nrandomized experiments run on data-driven enterprises. However, state-of-art\napproaches can become practically unreliable when the dimension of covariate\nincreases to just 50, whereas experiments on large platforms can observe even\nhigher dimension of covariate. We propose a machine-learning-assisted covariate\nrepresentation approach that can effectively make use of historical experiment\nor observational data that are run on the same platform to understand which\nlower dimensions can effectively represent the higher-dimensional covariate. We\nthen propose design and estimation methods with the covariate representation.\nWe prove statistically reliability and performance guarantees for the proposed\nmethods. The empirical performance is demonstrated using numerical experiments."}, "http://arxiv.org/abs/2311.01762": {"title": "Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel", "link": "http://arxiv.org/abs/2311.01762", "description": "Kernel ridge regression, KRR, is a generalization of linear ridge regression\nthat is non-linear in the data, but linear in the parameters. The solution can\nbe obtained either as a closed-form solution, which includes a matrix\ninversion, or iteratively through gradient descent. Using the iterative\napproach opens up for changing the kernel during training, something that is\ninvestigated in this paper. We theoretically address the effects this has on\nmodel complexity and generalization. Based on our findings, we propose an\nupdate scheme for the bandwidth of translational-invariant kernels, where we\nlet the bandwidth decrease to zero during training, thus circumventing the need\nfor hyper-parameter selection. We demonstrate on real and synthetic data how\ndecreasing the bandwidth during training outperforms using a constant\nbandwidth, selected by cross-validation and marginal likelihood maximization.\nWe also show theoretically and empirically that using a decreasing bandwidth,\nwe are able to achieve both zero training error in combination with good\ngeneralization, and a double descent behavior, phenomena that do not occur for\nKRR with constant bandwidth but are known to appear for neural networks."}, "http://arxiv.org/abs/2311.01833": {"title": "Similarity network aggregation for the analysis of glacier ecosystems", "link": "http://arxiv.org/abs/2311.01833", "description": "The synthesis of information deriving from complex networks is a topic\nreceiving increasing relevance in ecology and environmental sciences. In\nparticular, the aggregation of multilayer networks, i.e. network structures\nformed by multiple interacting networks (the layers), constitutes a\nfast-growing field. In several environmental applications, the layers of a\nmultilayer network are modelled as a collection of similarity matrices\ndescribing how similar pairs of biological entities are, based on different\ntypes of features (e.g. biological traits). The present paper first discusses\ntwo main techniques for combining the multi-layered information into a single\nnetwork (the so-called monoplex), i.e. Similarity Network Fusion (SNF) and\nSimilarity Matrix Average (SMA). Then, the effectiveness of the two methods is\ntested on a real-world dataset of the relative abundance of microbial species\nin the ecosystems of nine glaciers (four glaciers in the Alps and five in the\nAndes). A preliminary clustering analysis on the monoplexes obtained with\ndifferent methods shows the emergence of a tightly connected community formed\nby species that are typical of cryoconite holes worldwide. Moreover, the\nweights assigned to different layers by the SMA algorithm suggest that two\nlarge South American glaciers (Exploradores and Perito Moreno) are structurally\ndifferent from the smaller glaciers in both Europe and South America. Overall,\nthese results highlight the importance of integration methods in the discovery\nof the underlying organizational structure of biological entities in multilayer\necological networks."}, "http://arxiv.org/abs/2311.01872": {"title": "The use of restricted mean survival time to estimate treatment effect under model misspecification, a simulation study", "link": "http://arxiv.org/abs/2311.01872", "description": "The use of the non-parametric Restricted Mean Survival Time endpoint (RMST)\nhas grown in popularity as trialists look to analyse time-to-event outcomes\nwithout the restrictions of the proportional hazards assumption. In this paper,\nwe evaluate the power and type I error rate of the parametric and\nnon-parametric RMST estimators when treatment effect is explained by multiple\ncovariates, including an interaction term. Utilising the RMST estimator in this\nway allows the combined treatment effect to be summarised as a one-dimensional\nestimator, which is evaluated using a one-sided hypothesis Z-test. The\nestimators are either fully specified or misspecified, both in terms of\nunaccounted covariates or misspecified knot points (where trials exhibit\ncrossing survival curves). A placebo-controlled trial of Gamma interferon is\nused as a motivating example to simulate associated survival times. When\ncorrectly specified, the parametric RMST estimator has the greatest power,\nregardless of the time of analysis. The misspecified RMST estimator generally\nperforms similarly when covariates mirror those of the fitted case study\ndataset. However, as the magnitude of the unaccounted covariate increases, the\nassociated power of the estimator decreases. In all cases, the non-parametric\nRMST estimator has the lowest power, and power remains very reliant on the time\nof analysis (with a later analysis time correlated with greater power)."}, "http://arxiv.org/abs/2311.01902": {"title": "High Precision Causal Model Evaluation with Conditional Randomization", "link": "http://arxiv.org/abs/2311.01902", "description": "The gold standard for causal model evaluation involves comparing model\npredictions with true effects estimated from randomized controlled trials\n(RCT). However, RCTs are not always feasible or ethical to perform. In\ncontrast, conditionally randomized experiments based on inverse probability\nweighting (IPW) offer a more realistic approach but may suffer from high\nestimation variance. To tackle this challenge and enhance causal model\nevaluation in real-world conditional randomization settings, we introduce a\nnovel low-variance estimator for causal error, dubbed as the pairs estimator.\nBy applying the same IPW estimator to both the model and true experimental\neffects, our estimator effectively cancels out the variance due to IPW and\nachieves a smaller asymptotic variance. Empirical studies demonstrate the\nimproved of our estimator, highlighting its potential on achieving near-RCT\nperformance. Our method offers a simple yet powerful solution to evaluate\ncausal inference models in conditional randomization settings without\ncomplicated modification of the IPW estimator itself, paving the way for more\nrobust and reliable model assessments."}, "http://arxiv.org/abs/2311.01913": {"title": "Extended Relative Power Contribution that Allows to Evaluate the Effect of Correlated Noise", "link": "http://arxiv.org/abs/2311.01913", "description": "We proposed an extension of Akaike's relative power contribution that could\nbe applied to data with correlations between noises. This method decomposes the\npower spectrum into a contribution of the terms caused by correlation between\ntwo noises, in addition to the contributions of the independent noises.\nNumerical examples confirm that some of the correlated noise has the effect of\nreducing the power spectrum."}, "http://arxiv.org/abs/2311.02019": {"title": "Reproducible Parameter Inference Using Bagged Posteriors", "link": "http://arxiv.org/abs/2311.02019", "description": "Under model misspecification, it is known that Bayesian posteriors often do\nnot properly quantify uncertainty about true or pseudo-true parameters. Even\nmore fundamentally, misspecification leads to a lack of reproducibility in the\nsense that the same model will yield contradictory posteriors on independent\ndata sets from the true distribution. To define a criterion for reproducible\nuncertainty quantification under misspecification, we consider the probability\nthat two confidence sets constructed from independent data sets have nonempty\noverlap, and we establish a lower bound on this overlap probability that holds\nfor any valid confidence sets. We prove that credible sets from the standard\nposterior can strongly violate this bound, particularly in high-dimensional\nsettings (i.e., with dimension increasing with sample size), indicating that it\nis not internally coherent under misspecification. To improve reproducibility\nin an easy-to-use and widely applicable way, we propose to apply bagging to the\nBayesian posterior (\"BayesBag\"'); that is, to use the average of posterior\ndistributions conditioned on bootstrapped datasets. We motivate BayesBag from\nfirst principles based on Jeffrey conditionalization and show that the bagged\nposterior typically satisfies the overlap lower bound. Further, we prove a\nBernstein--Von Mises theorem for the bagged posterior, establishing its\nasymptotic normal distribution. We demonstrate the benefits of BayesBag via\nsimulation experiments and an application to crime rate prediction."}, "http://arxiv.org/abs/2311.02043": {"title": "Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective", "link": "http://arxiv.org/abs/2311.02043", "description": "Quantile regression is a powerful tool for inferring how covariates affect\nspecific percentiles of the response distribution. Existing methods either\nestimate conditional quantiles separately for each quantile of interest or\nestimate the entire conditional distribution using semi- or non-parametric\nmodels. The former often produce inadequate models for real data and do not\nshare information across quantiles, while the latter are characterized by\ncomplex and constrained models that can be difficult to interpret and\ncomputationally inefficient. Further, neither approach is well-suited for\nquantile-specific subset selection. Instead, we pose the fundamental problems\nof linear quantile estimation, uncertainty quantification, and subset selection\nfrom a Bayesian decision analysis perspective. For any Bayesian regression\nmodel, we derive optimal and interpretable linear estimates and uncertainty\nquantification for each model-based conditional quantile. Our approach\nintroduces a quantile-focused squared error loss, which enables efficient,\nclosed-form computing and maintains a close relationship with Wasserstein-based\ndensity estimation. In an extensive simulation study, our methods demonstrate\nsubstantial gains in quantile estimation accuracy, variable selection, and\ninference over frequentist and Bayesian competitors. We apply these tools to\nidentify the quantile-specific impacts of social and environmental stressors on\neducational outcomes for a large cohort of children in North Carolina."}, "http://arxiv.org/abs/2010.08627": {"title": "Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function", "link": "http://arxiv.org/abs/2010.08627", "description": "Canonical correlation analysis (CCA) is a popular statistical technique for\nexploring relationships between datasets. In recent years, the estimation of\nsparse canonical vectors has emerged as an important but challenging variant of\nthe CCA problem, with widespread applications. Unfortunately, existing\nrate-optimal estimators for sparse canonical vectors have high computational\ncost. We propose a quasi-Bayesian estimation procedure that not only achieves\nthe minimax estimation rate, but also is easy to compute by Markov Chain Monte\nCarlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled\nRayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et\nal. (2018), we adopt a Bayesian framework that combines this\nquasi-log-likelihood with a spike-and-slab prior to regularize the inference\nand promote sparsity. We investigate the empirical behavior of the proposed\nmethod on both continuous and truncated data, and we demonstrate that it\noutperforms several state-of-the-art methods. As an application, we use the\nproposed methodology to maximally correlate clinical variables and proteomic\ndata for better understanding the Covid-19 disease."}, "http://arxiv.org/abs/2104.08300": {"title": "Semiparametric Sensitivity Analysis: Unmeasured Confounding In Observational Studies", "link": "http://arxiv.org/abs/2104.08300", "description": "Establishing cause-effect relationships from observational data often relies\non untestable assumptions. It is crucial to know whether, and to what extent,\nthe conclusions drawn from non-experimental studies are robust to potential\nunmeasured confounding. In this paper, we focus on the average causal effect\n(ACE) as our target of inference. We generalize the sensitivity analysis\napproach developed by Robins et al. (2000), Franks et al. (2020) and Zhou and\nYao (2023. We use semiparametric theory to derive the non-parametric efficient\ninfluence function of the ACE, for fixed sensitivity parameters. We use this\ninfluence function to construct a one-step bias-corrected estimator of the ACE.\nOur estimator depends on semiparametric models for the distribution of the\nobserved data; importantly, these models do not impose any restrictions on the\nvalues of sensitivity analysis parameters. We establish sufficient conditions\nensuring that our estimator has root-n asymptotics. We use our methodology to\nevaluate the causal effect of smoking during pregnancy on birth weight. We also\nevaluate the performance of estimation procedure in a simulation study."}, "http://arxiv.org/abs/2207.00100": {"title": "A Bayesian 'sandwich' for variance estimation", "link": "http://arxiv.org/abs/2207.00100", "description": "Large-sample Bayesian analogs exist for many frequentist methods, but are\nless well-known for the widely-used 'sandwich' or 'robust' variance estimates.\nWe review existing approaches to Bayesian analogs of sandwich variance\nestimates and propose a new analog, as the Bayes rule under a form of balanced\nloss function, that combines elements of standard parametric inference with\nfidelity of the data to the model. Our development is general, for essentially\nany regression setting with independent outcomes. Being the large-sample\nequivalent of its frequentist counterpart, we show by simulation that Bayesian\nrobust standard error estimates can faithfully quantify the variability of\nparameter estimates even under model misspecification -- thus retaining the\nmajor attraction of the original frequentist version. We demonstrate our\nBayesian analog of standard error estimates when studying the association\nbetween age and systolic blood pressure in NHANES."}, "http://arxiv.org/abs/2210.17514": {"title": "Cost-aware Generalized $\\alpha$-investing for Multiple Hypothesis Testing", "link": "http://arxiv.org/abs/2210.17514", "description": "We consider the problem of sequential multiple hypothesis testing with\nnontrivial data collection costs. This problem appears, for example, when\nconducting biological experiments to identify differentially expressed genes of\na disease process. This work builds on the generalized $\\alpha$-investing\nframework which enables control of the false discovery rate in a sequential\ntesting setting. We make a theoretical analysis of the long term asymptotic\nbehavior of $\\alpha$-wealth which motivates a consideration of sample size in\nthe $\\alpha$-investing decision rule. Posing the testing process as a game with\nnature, we construct a decision rule that optimizes the expected\n$\\alpha$-wealth reward (ERO) and provides an optimal sample size for each test.\nEmpirical results show that a cost-aware ERO decision rule correctly rejects\nmore false null hypotheses than other methods for $n=1$ where $n$ is the sample\nsize. When the sample size is not fixed cost-aware ERO uses a prior on the null\nhypothesis to adaptively allocate of the sample budget to each test. We extend\ncost-aware ERO investing to finite-horizon testing which enables the decision\nrule to allocate samples in a non-myopic manner. Finally, empirical tests on\nreal data sets from biological experiments show that cost-aware ERO balances\nthe allocation of samples to an individual test against the allocation of\nsamples across multiple tests."}, "http://arxiv.org/abs/2301.01480": {"title": "A new over-dispersed count model", "link": "http://arxiv.org/abs/2301.01480", "description": "A new two-parameter discrete distribution, namely the PoiG distribution is\nderived by the convolution of a Poisson variate and an independently\ndistributed geometric random variable. This distribution generalizes both the\nPoisson and geometric distributions and can be used for modelling\nover-dispersed as well as equi-dispersed count data. A number of important\nstatistical properties of the proposed count model, such as the probability\ngenerating function, the moment generating function, the moments, the survival\nfunction and the hazard rate function. Monotonic properties are studied, such\nas the log concavity and the stochastic ordering are also investigated in\ndetail. Method of moment and the maximum likelihood estimators of the\nparameters of the proposed model are presented. It is envisaged that the\nproposed distribution may prove to be useful for the practitioners for\nmodelling over-dispersed count data compared to its closest competitors."}, "http://arxiv.org/abs/2305.10050": {"title": "The Impact of Missing Data on Causal Discovery: A Multicentric Clinical Study", "link": "http://arxiv.org/abs/2305.10050", "description": "Causal inference for testing clinical hypotheses from observational data\npresents many difficulties because the underlying data-generating model and the\nassociated causal graph are not usually available. Furthermore, observational\ndata may contain missing values, which impact the recovery of the causal graph\nby causal discovery algorithms: a crucial issue often ignored in clinical\nstudies. In this work, we use data from a multi-centric study on endometrial\ncancer to analyze the impact of different missingness mechanisms on the\nrecovered causal graph. This is achieved by extending state-of-the-art causal\ndiscovery algorithms to exploit expert knowledge without sacrificing\ntheoretical soundness. We validate the recovered graph with expert physicians,\nshowing that our approach finds clinically-relevant solutions. Finally, we\ndiscuss the goodness of fit of our graph and its consistency from a clinical\ndecision-making perspective using graphical separation to validate causal\npathways."}, "http://arxiv.org/abs/2309.03952": {"title": "The Causal Roadmap and simulation studies to inform the Statistical Analysis Plan for real-data applications", "link": "http://arxiv.org/abs/2309.03952", "description": "The Causal Roadmap outlines a systematic approach to our research endeavors:\ndefine quantity of interest, evaluate needed assumptions, conduct statistical\nestimation, and carefully interpret of results. At the estimation step, it is\nessential that the estimation algorithm be chosen thoughtfully for its\ntheoretical properties and expected performance. Simulations can help\nresearchers gain a better understanding of an estimator's statistical\nperformance under conditions unique to the real-data application. This in turn\ncan inform the rigorous pre-specification of a Statistical Analysis Plan (SAP),\nnot only stating the estimand (e.g., G-computation formula), the estimator\n(e.g., targeted minimum loss-based estimation [TMLE]), and adjustment\nvariables, but also the implementation of the estimator -- including nuisance\nparameter estimation and approach for variance estimation. Doing so helps\nensure valid inference (e.g., 95% confidence intervals with appropriate\ncoverage). Failing to pre-specify estimation can lead to data dredging and\ninflated Type-I error rates."}, "http://arxiv.org/abs/2311.02273": {"title": "A Sequential Learning Procedure with Applications to Online Sales Examination", "link": "http://arxiv.org/abs/2311.02273", "description": "In this paper, we consider the problem of estimating parameters in a linear\nregression model. We propose a sequential learning procedure to determine the\nsample size for achieving a given small estimation risk, under the widely used\nGauss-Markov setup with independent normal errors. The procedure is proven to\nenjoy the second-order efficiency and risk-efficiency properties, which are\nvalidated through Monte Carlo simulation studies. Using e-commerce data, we\nimplement the procedure to examine the influential factors of online sales."}, "http://arxiv.org/abs/2311.02306": {"title": "Heteroskedastic Tensor Clustering", "link": "http://arxiv.org/abs/2311.02306", "description": "Tensor clustering, which seeks to extract underlying cluster structures from\nnoisy tensor observations, has gained increasing attention. One extensively\nstudied model for tensor clustering is the tensor block model, which postulates\nthe existence of clustering structures along each mode and has found broad\napplications in areas like multi-tissue gene expression analysis and multilayer\nnetwork analysis. However, currently available computationally feasible methods\nfor tensor clustering either are limited to handling i.i.d. sub-Gaussian noise\nor suffer from suboptimal statistical performance, which restrains their\nutility in applications that have to deal with heteroskedastic data and/or low\nsignal-to-noise-ratio (SNR).\n\nTo overcome these challenges, we propose a two-stage method, named\n$\\mathsf{High\\text{-}order~HeteroClustering}$ ($\\mathsf{HHC}$), which starts by\nperforming tensor subspace estimation via a novel spectral algorithm called\n$\\mathsf{Thresholded~Deflated\\text{-}HeteroPCA}$, followed by approximate\n$k$-means to obtain cluster nodes. Encouragingly, our algorithm provably\nachieves exact clustering as long as the SNR exceeds the computational limit\n(ignoring logarithmic factors); here, the SNR refers to the ratio of the\npairwise disparity between nodes to the noise level, and the computational\nlimit indicates the lowest SNR that enables exact clustering with polynomial\nruntime. Comprehensive simulation and real-data experiments suggest that our\nalgorithm outperforms existing algorithms across various settings, delivering\nmore reliable clustering performance."}, "http://arxiv.org/abs/2311.02308": {"title": "Kernel-based sensitivity indices for any model behavior and screening", "link": "http://arxiv.org/abs/2311.02308", "description": "Complex models are often used to understand interactions and drivers of\nhuman-induced and/or natural phenomena. It is worth identifying the input\nvariables that drive the model output(s) in a given domain and/or govern\nspecific model behaviors such as contextual indicators based on\nsocio-environmental models. Using the theory of multivariate weighted\ndistributions to characterize specific model behaviors, we propose new measures\nof association between inputs and such behaviors. Our measures rely on\nsensitivity functionals (SFs) and kernel methods, including variance-based\nsensitivity analysis. The proposed $\\ell_1$-based kernel indices account for\ninteractions among inputs, higher-order moments of SFs, and their upper bounds\nare somehow equivalent to the Morris-type screening measures, including\ndependent elementary effects. Empirical kernel-based indices are derived,\nincluding their statistical properties for the computational issues, and\nnumerical results are provided."}, "http://arxiv.org/abs/2311.02312": {"title": "Efficient Change Point Detection and Estimation in High-Dimensional Correlation Matrices", "link": "http://arxiv.org/abs/2311.02312", "description": "This paper considers the problems of detecting a change point and estimating\nthe location in the correlation matrices of a sequence of high-dimensional\nvectors, where the dimension is large enough to be comparable to the sample\nsize or even much larger. A new break test is proposed based on signflip\nparallel analysis to detect the existence of change points. Furthermore, a\ntwo-step approach combining a signflip permutation dimension reduction step and\na CUSUM statistic is proposed to estimate the change point's location and\nrecover the support of changes. The consistency of the estimator is\nconstructed. Simulation examples and real data applications illustrate the\nsuperior empirical performance of the proposed methods. Especially, the\nproposed methods outperform existing ones for non-Gaussian data and the change\npoint in the extreme tail of a sequence and become more accurate as the\ndimension p increases. Supplementary materials for this article are available\nonline."}, "http://arxiv.org/abs/2311.02450": {"title": "Factor-guided estimation of large covariance matrix function with conditional functional sparsity", "link": "http://arxiv.org/abs/2311.02450", "description": "This paper addresses the fundamental task of estimating covariance matrix\nfunctions for high-dimensional functional data/functional time series. We\nconsider two functional factor structures encompassing either functional\nfactors with scalar loadings or scalar factors with functional loadings, and\npostulate functional sparsity on the covariance of idiosyncratic errors after\ntaking out the common unobserved factors. To facilitate estimation, we rely on\nthe spiked matrix model and its functional generalization, and derive some\nnovel asymptotic identifiability results, based on which we develop DIGIT and\nFPOET estimators under two functional factor models, respectively. Both\nestimators involve performing associated eigenanalysis to estimate the\ncovariance of common components, followed by adaptive functional thresholding\napplied to the residual covariance. We also develop functional information\ncriteria for the purpose of model selection. The convergence rates of estimated\nfactors, loadings, and conditional sparse covariance matrix functions under\nvarious functional matrix norms, are respectively established for DIGIT and\nFPOET estimators. Numerical studies including extensive simulations and two\nreal data applications on mortality rates and functional portfolio allocation\nare conducted to examine the finite-sample performance of the proposed\nmethodology."}, "http://arxiv.org/abs/2311.02532": {"title": "Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making", "link": "http://arxiv.org/abs/2311.02532", "description": "A/B testing is critical for modern technological companies to evaluate the\neffectiveness of newly developed products against standard baselines. This\npaper studies optimal designs that aim to maximize the amount of information\nobtained from online experiments to estimate treatment effects accurately. We\npropose three optimal allocation strategies in a dynamic setting where\ntreatments are sequentially assigned over time. These strategies are designed\nto minimize the variance of the treatment effect estimator when data follow a\nnon-Markov decision process or a (time-varying) Markov decision process. We\nfurther develop estimation procedures based on existing off-policy evaluation\n(OPE) methods and conduct extensive experiments in various environments to\ndemonstrate the effectiveness of the proposed methodologies. In theory, we\nprove the optimality of the proposed treatment allocation design and establish\nupper bounds for the mean squared errors of the resulting treatment effect\nestimators."}, "http://arxiv.org/abs/2311.02543": {"title": "Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling", "link": "http://arxiv.org/abs/2311.02543", "description": "This paper discusses estimation and limited information goodness-of-fit test\nstatistics in factor models for binary data using pairwise likelihood\nestimation and sampling weights. The paper extends the applicability of\npairwise likelihood estimation for factor models with binary data to\naccommodate complex sampling designs. Additionally, it introduces two key\nlimited information test statistics: the Pearson chi-squared test and the Wald\ntest. To enhance computational efficiency, the paper introduces modifications\nto both test statistics. The performance of the estimation and the proposed\ntest statistics under simple random sampling and unequal probability sampling\nis evaluated using simulated data."}, "http://arxiv.org/abs/2311.02574": {"title": "Semi-supervised Estimation of Event Rate with Doubly-censored Survival Data", "link": "http://arxiv.org/abs/2311.02574", "description": "Electronic Health Record (EHR) has emerged as a valuable source of data for\ntranslational research. To leverage EHR data for risk prediction and\nsubsequently clinical decision support, clinical endpoints are often time to\nonset of a clinical condition of interest. Precise information on clinical\nevent times is often not directly available and requires labor-intensive manual\nchart review to ascertain. In addition, events may occur outside of the\nhospital system, resulting in both left and right censoring often termed double\ncensoring. On the other hand, proxies such as time to the first diagnostic code\nare readily available yet with varying degrees of accuracy. Using error-prone\nevent times derived from these proxies can lead to biased risk estimates while\nonly relying on manually annotated event times, which are typically only\navailable for a small subset of patients, can lead to high variability. This\nsignifies the need for semi-supervised estimation methods that can efficiently\ncombine information from both the small subset of labeled observations and a\nlarge size of surrogate proxies. While semi-supervised estimation methods have\nbeen recently developed for binary and right-censored data, no methods\ncurrently exist in the presence of double censoring. This paper fills the gap\nby developing a robust and efficient Semi-supervised Estimation of Event rate\nwith Doubly-censored Survival data (SEEDS) by leveraging a small set of gold\nstandard labels and a large set of surrogate features. Under regularity\nconditions, we demonstrate that the proposed SEEDS estimator is consistent and\nasymptotically normal. Simulation results illustrate that SEEDS performs well\nin finite samples and can be substantially more efficient compared to the\nsupervised counterpart. We apply the SEEDS to estimate the age-specific\nsurvival rate of type 2 diabetes using EHR data from Mass General Brigham."}, "http://arxiv.org/abs/2311.02610": {"title": "An adaptive standardisation model for Day-Ahead electricity price forecasting", "link": "http://arxiv.org/abs/2311.02610", "description": "The study of Day-Ahead prices in the electricity market is one of the most\npopular problems in time series forecasting. Previous research has focused on\nemploying increasingly complex learning algorithms to capture the sophisticated\ndynamics of the market. However, there is a threshold where increased\ncomplexity fails to yield substantial improvements. In this work, we propose an\nalternative approach by introducing an adaptive standardisation to mitigate the\neffects of dataset shifts that commonly occur in the market. By doing so,\nlearning algorithms can prioritize uncovering the true relationship between the\ntarget variable and the explanatory variables. We investigate four distinct\nmarkets, including two novel datasets, previously unexplored in the literature.\nThese datasets provide a more realistic representation of the current market\ncontext, that conventional datasets do not show. The results demonstrate a\nsignificant improvement across all four markets, using learning algorithms that\nare less complex yet widely accepted in the literature. This significant\nadvancement unveils opens up new lines of research in this field, highlighting\nthe potential of adaptive transformations in enhancing the performance of\nforecasting models."}, "http://arxiv.org/abs/2311.02634": {"title": "Pointwise Data Depth for Univariate and Multivariate Functional Outlier Detection", "link": "http://arxiv.org/abs/2311.02634", "description": "Data depth is an efficient tool for robustly summarizing the distribution of\nfunctional data and detecting potential magnitude and shape outliers. Commonly\nused functional data depth notions, such as the modified band depth and\nextremal depth, are estimated from pointwise depth for each observed functional\nobservation. However, these techniques require calculating one single depth\nvalue for each functional observation, which may not be sufficient to\ncharacterize the distribution of the functional data and detect potential\noutliers. This paper presents an innovative approach to make the best use of\npointwise depth. We propose using the pointwise depth distribution for\nmagnitude outlier visualization and the correlation between pairwise depth for\nshape outlier detection. Furthermore, a bootstrap-based testing procedure has\nbeen introduced for the correlation to test whether there is any shape outlier.\nThe proposed univariate methods are then extended to bivariate functional data.\nThe performance of the proposed methods is examined and compared to\nconventional outlier detection techniques by intensive simulation studies. In\naddition, the developed methods are applied to simulated solar energy datasets\nfrom a photovoltaic system. Results revealed that the proposed method offers\nsuperior detection performance over conventional techniques. These findings\nwill benefit engineers and practitioners in monitoring photovoltaic systems by\ndetecting unnoticed anomalies and outliers."}, "http://arxiv.org/abs/2311.02658": {"title": "Nonparametric Estimation and Comparison of Distance Distributions from Censored Data", "link": "http://arxiv.org/abs/2311.02658", "description": "Transportation distance information is a powerful resource, but location\nrecords are often censored due to privacy concerns or regulatory mandates. We\nconsider the problem of transportation event distance distribution\nreconstruction, which aims to handle this obstacle and has applications to\npublic health informatics, logistics, and more. We propose numerical methods to\napproximate, sample from, and compare distributions of distances between\ncensored location pairs. We validate empirically and demonstrate applicability\nto practical geospatial data analysis tasks. Our code is available on GitHub."}, "http://arxiv.org/abs/2311.02766": {"title": "Riemannian Laplace Approximation with the Fisher Metric", "link": "http://arxiv.org/abs/2311.02766", "description": "The Laplace's method approximates a target density with a Gaussian\ndistribution at its mode. It is computationally efficient and asymptotically\nexact for Bayesian inference due to the Bernstein-von Mises theorem, but for\ncomplex targets and finite-data posteriors it is often too crude an\napproximation. A recent generalization of the Laplace Approximation transforms\nthe Gaussian approximation according to a chosen Riemannian geometry providing\na richer approximation family, while still retaining computational efficiency.\nHowever, as shown here, its properties heavily depend on the chosen metric,\nindeed the metric adopted in previous work results in approximations that are\noverly narrow as well as being biased even at the limit of infinite data. We\ncorrect this shortcoming by developing the approximation family further,\nderiving two alternative variants that are exact at the limit of infinite data,\nextending the theoretical analysis of the method, and demonstrating practical\nimprovements in a range of experiments."}, "http://arxiv.org/abs/2311.02808": {"title": "Nonparametric Estimation of Conditional Copula using Smoothed Checkerboard Bernstein Sieves", "link": "http://arxiv.org/abs/2311.02808", "description": "Conditional copulas are useful tools for modeling the dependence between\nmultiple response variables that may vary with a given set of predictor\nvariables. Conditional dependence measures such as conditional Kendall's tau\nand Spearman's rho that can be expressed as functionals of the conditional\ncopula are often used to evaluate the strength of dependence conditioning on\nthe covariates. In general, semiparametric estimation methods of conditional\ncopulas rely on an assumed parametric copula family where the copula parameter\nis assumed to be a function of the covariates. The functional relationship can\nbe estimated nonparametrically using different techniques but it is required to\nchoose an appropriate copula model from various candidate families. In this\npaper, by employing the empirical checkerboard Bernstein copula (ECBC)\nestimator we propose a fully nonparametric approach for estimating conditional\ncopulas, which doesn't require any selection of parametric copula models.\nClosed-form estimates of the conditional dependence measures are derived\ndirectly from the proposed ECBC-based conditional copula estimator. We provide\nthe large-sample consistency of the proposed estimator as well as the estimates\nof conditional dependence measures. The finite-sample performance of the\nproposed estimator and comparison with semiparametric methods are investigated\nthrough simulation studies. An application to real case studies is also\nprovided."}, "http://arxiv.org/abs/2311.02822": {"title": "Robust estimation of heteroscedastic regression models: a brief overview and new proposals", "link": "http://arxiv.org/abs/2311.02822", "description": "We collect robust proposals given in the field of regression models with\nheteroscedastic errors. Our motivation stems from the fact that the\npractitioner frequently faces the confluence of two phenomena in the context of\ndata analysis: non--linearity and heteroscedasticity. The impact of\nheteroscedasticity on the precision of the estimators is well--known, however\nthe conjunction of these two phenomena makes handling outliers more difficult.\n\nAn iterative procedure to estimate the parameters of a heteroscedastic\nnon--linear model is considered. The studied estimators combine weighted\n$MM-$regression estimators, to control the impact of high leverage points, and\na robust method to estimate the parameters of the variance function."}, "http://arxiv.org/abs/2311.03247": {"title": "Multivariate selfsimilarity: Multiscale eigen-structures for selfsimilarity parameter estimation", "link": "http://arxiv.org/abs/2311.03247", "description": "Scale-free dynamics, formalized by selfsimilarity, provides a versatile\nparadigm massively and ubiquitously used to model temporal dynamics in\nreal-world data. However, its practical use has mostly remained univariate so\nfar. By contrast, modern applications often demand multivariate data analysis.\nAccordingly, models for multivariate selfsimilarity were recently proposed.\nNevertheless, they have remained rarely used in practice because of a lack of\navailable robust estimation procedures for the vector of selfsimilarity\nparameters. Building upon recent mathematical developments, the present work\nputs forth an efficient estimation procedure based on the theoretical study of\nthe multiscale eigenstructure of the wavelet spectrum of multivariate\nselfsimilar processes. The estimation performance is studied theoretically in\nthe asymptotic limits of large scale and sample sizes, and computationally for\nfinite-size samples. As a practical outcome, a fully operational and documented\nmultivariate signal processing estimation toolbox is made freely available and\nis ready for practical use on real-world data. Its potential benefits are\nillustrated in epileptic seizure prediction from multi-channel EEG data."}, "http://arxiv.org/abs/2311.03289": {"title": "Batch effect correction with sample remeasurement in highly confounded case-control studies", "link": "http://arxiv.org/abs/2311.03289", "description": "Batch effects are pervasive in biomedical studies. One approach to address\nthe batch effects is repeatedly measuring a subset of samples in each batch.\nThese remeasured samples are used to estimate and correct the batch effects.\nHowever, rigorous statistical methods for batch effect correction with\nremeasured samples are severely under-developed. In this study, we developed a\nframework for batch effect correction using remeasured samples in highly\nconfounded case-control studies. We provided theoretical analyses of the\nproposed procedure, evaluated its power characteristics, and provided a power\ncalculation tool to aid in the study design. We found that the number of\nsamples that need to be remeasured depends strongly on the between-batch\ncorrelation. When the correlation is high, remeasuring a small subset of\nsamples is possible to rescue most of the power."}, "http://arxiv.org/abs/2311.03343": {"title": "Distribution-uniform anytime-valid inference", "link": "http://arxiv.org/abs/2311.03343", "description": "Are asymptotic confidence sequences and anytime $p$-values uniformly valid\nfor a nontrivial class of distributions $\\mathcal{P}$? We give a positive\nanswer to this question by deriving distribution-uniform anytime-valid\ninference procedures. Historically, anytime-valid methods -- including\nconfidence sequences, anytime $p$-values, and sequential hypothesis tests that\nenable inference at stopping times -- have been justified nonasymptotically.\nNevertheless, asymptotic procedures such as those based on the central limit\ntheorem occupy an important part of statistical toolbox due to their\nsimplicity, universality, and weak assumptions. While recent work has derived\nasymptotic analogues of anytime-valid methods with the aforementioned benefits,\nthese were not shown to be $\\mathcal{P}$-uniform, meaning that their\nasymptotics are not uniformly valid in a class of distributions $\\mathcal{P}$.\nIndeed, the anytime-valid inference literature currently has no central limit\ntheory to draw from that is both uniform in $\\mathcal{P}$ and in the sample\nsize $n$. This paper fills that gap by deriving a novel $\\mathcal{P}$-uniform\nstrong Gaussian approximation theorem, enabling $\\mathcal{P}$-uniform\nanytime-valid inference for the first time. Along the way, our Gaussian\napproximation also yields a $\\mathcal{P}$-uniform law of the iterated\nlogarithm."}, "http://arxiv.org/abs/2009.10780": {"title": "Independent finite approximations for Bayesian nonparametric inference", "link": "http://arxiv.org/abs/2009.10780", "description": "Completely random measures (CRMs) and their normalizations (NCRMs) offer\nflexible models in Bayesian nonparametrics. But their infinite dimensionality\npresents challenges for inference. Two popular finite approximations are\ntruncated finite approximations (TFAs) and independent finite approximations\n(IFAs). While the former have been well-studied, IFAs lack similarly general\nbounds on approximation error, and there has been no systematic comparison\nbetween the two options. In the present work, we propose a general recipe to\nconstruct practical finite-dimensional approximations for homogeneous CRMs and\nNCRMs, in the presence or absence of power laws. We call our construction the\nautomated independent finite approximation (AIFA). Relative to TFAs, we show\nthat AIFAs facilitate more straightforward derivations and use of parallel\ncomputing in approximate inference. We upper bound the approximation error of\nAIFAs for a wide class of common CRMs and NCRMs -- and thereby develop\nguidelines for choosing the approximation level. Our lower bounds in key cases\nsuggest that our upper bounds are tight. We prove that, for worst-case choices\nof observation likelihoods, TFAs are more efficient than AIFAs. Conversely, we\nfind that in real-data experiments with standard likelihoods, AIFAs and TFAs\nperform similarly. Moreover, we demonstrate that AIFAs can be used for\nhyperparameter estimation even when other potential IFA options struggle or do\nnot apply."}, "http://arxiv.org/abs/2111.07517": {"title": "Correlation Improves Group Testing: Capturing the Dilution Effect", "link": "http://arxiv.org/abs/2111.07517", "description": "Population-wide screening to identify and isolate infectious individuals is a\npowerful tool for controlling COVID-19 and other infectious diseases. Group\ntesting can enable such screening despite limited testing resources. Samples'\nviral loads are often positively correlated, either because prevalence and\nsample collection are both correlated with geography, or through intentional\nenhancement, e.g., by pooling samples from people in similar risk groups. Such\ncorrelation is known to improve test efficiency in mathematical models with\nfixed sensitivity. In reality, however, dilution degrades a pooled test's\nsensitivity by an amount that varies with the number of positives in the pool.\nIn the presence of this dilution effect, we study the impact of correlation on\nthe most widely-used group testing procedure, the Dorfman procedure. We show\nthat correlation's effects are significantly altered by the dilution effect. We\nprove that under a general correlation structure, pooling correlated samples\ntogether (called correlated pooling) achieves higher sensitivity but can\ndegrade test efficiency compared to independently pooling the samples (called\nnaive pooling) using the same pool size. We identify an alternative measure of\ntest resource usage, the number of positives found per test consumed, which we\nargue is better aligned with infection control, and show that correlated\npooling outperforms naive pooling on this measure. We build a realistic\nagent-based simulation to contextualize our theoretical results within an\nepidemic control framework. We argue that the dilution effect makes it even\nmore important for policy-makers evaluating group testing protocols for\nlarge-scale screening to incorporate naturally arising correlation and to\nintentionally maximize correlation."}, "http://arxiv.org/abs/2202.05349": {"title": "Robust Parameter Estimation for the Lee-Carter Family: A Probabilistic Principal Component Approach", "link": "http://arxiv.org/abs/2202.05349", "description": "The well-known Lee-Carter model uses a bilinear form\n$\\log(m_{x,t})=a_x+b_xk_t$ to represent the log mortality rate and has been\nwidely researched and developed over the past thirty years. However, there has\nbeen little attention being paid to the robustness of the parameters against\noutliers, especially when estimating $b_x$. In response, we propose a robust\nestimation method for a wide family of Lee-Carter-type models, treating the\nproblem as a Probabilistic Principal Component Analysis (PPCA) with\nmultivariate $t$-distributions. An efficient Expectation-Maximization (EM)\nalgorithm is also derived for implementation.\n\nThe benefits of the method are threefold: 1) it produces more robust\nestimates of both $b_x$ and $k_t$, 2) it can be naturally extended to a large\nfamily of Lee-Carter type models, including those for modelling multiple\npopulations, and 3) it can be integrated with other existing time series models\nfor $k_t$. Using numerical studies based on United States mortality data from\nthe Human Mortality Database, we show the proposed model performs more robust\ncompared to conventional methods in the presence of outliers."}, "http://arxiv.org/abs/2204.11979": {"title": "Semi-Parametric Sensitivity Analysis for Trials with Irregular and Informative Assessment Times", "link": "http://arxiv.org/abs/2204.11979", "description": "Many trials are designed to collect outcomes at or around pre-specified times\nafter randomization. In practice, there can be substantial variability in the\ntimes when participants are actually assessed. Such irregular assessment times\npose a challenge to learning the effect of treatment since not all participants\nhave outcome assessments at the times of interest. Furthermore, observed\noutcome values may not be representative of all participants' outcomes at a\ngiven time. This problem, known as informative assessment times, can arise if\nparticipants tend to have assessments when their outcomes are better (or worse)\nthan at other times, or if participants with better outcomes tend to have more\n(or fewer) assessments. Methods have been developed that account for some types\nof informative assessment; however, since these methods rely on untestable\nassumptions, sensitivity analyses are needed. We develop a sensitivity analysis\nmethodology by extending existing weighting methods. Our method accounts for\nthe possibility that participants with worse outcomes at a given time are more\n(or less) likely than other participants to have an assessment at that time,\neven after controlling for variables observed earlier in the study. We apply\nour method to a randomized trial of low-income individuals with uncontrolled\nasthma. We illustrate implementation of our influence-function based estimation\nprocedure in detail, and we derive the large-sample distribution of our\nestimator and evaluate its finite-sample performance."}, "http://arxiv.org/abs/2205.13935": {"title": "Detecting hidden confounding in observational data using multiple environments", "link": "http://arxiv.org/abs/2205.13935", "description": "A common assumption in causal inference from observational data is that there\nis no hidden confounding. Yet it is, in general, impossible to verify this\nassumption from a single dataset. Under the assumption of independent causal\nmechanisms underlying the data-generating process, we demonstrate a way to\ndetect unobserved confounders when having multiple observational datasets\ncoming from different environments. We present a theory for testable\nconditional independencies that are only absent when there is hidden\nconfounding and examine cases where we violate its assumptions: degenerate &\ndependent mechanisms, and faithfulness violations. Additionally, we propose a\nprocedure to test these independencies and study its empirical finite-sample\nbehavior using simulation studies and semi-synthetic data based on a real-world\ndataset. In most cases, the proposed procedure correctly predicts the presence\nof hidden confounding, particularly when the confounding bias is large."}, "http://arxiv.org/abs/2206.09444": {"title": "Bayesian non-conjugate regression via variational message passing", "link": "http://arxiv.org/abs/2206.09444", "description": "Variational inference is a popular method for approximating the posterior\ndistribution of hierarchical Bayesian models. It is well-recognized in the\nliterature that the choice of the approximation family and the regularity\nproperties of the posterior strongly influence the efficiency and accuracy of\nvariational methods. While model-specific conjugate approximations offer\nsimplicity, they often converge slowly and may yield poor approximations.\nNon-conjugate approximations instead are more flexible but typically require\nthe calculation of expensive multidimensional integrals. This study focuses on\nBayesian regression models that use possibly non-differentiable loss functions\nto measure prediction misfit. The data behavior is modeled using a linear\npredictor, potentially transformed using a bijective link function. Examples\ninclude generalized linear models, mixed additive models, support vector\nmachines, and quantile regression. To address the limitations of non-conjugate\nsettings, the study proposes an efficient non-conjugate variational message\npassing method for approximate posterior inference, which only requires the\ncalculation of univariate numerical integrals when analytical solutions are not\navailable. The approach does not require differentiability, conjugacy, or\nmodel-specific data-augmentation strategies, thereby naturally extending to\nmodels with non-conjugate likelihood functions. Additionally, a stochastic\nimplementation is provided to handle large-scale data problems. The proposed\nmethod's performances are evaluated through extensive simulations and real data\nexamples. Overall, the results highlight the effectiveness of the proposed\nvariational message passing method, demonstrating its computational efficiency\nand approximation accuracy as an alternative to existing methods in Bayesian\ninference for regression models."}, "http://arxiv.org/abs/2206.10143": {"title": "A Contrastive Approach to Online Change Point Detection", "link": "http://arxiv.org/abs/2206.10143", "description": "We suggest a novel procedure for online change point detection. Our approach\nexpands an idea of maximizing a discrepancy measure between points from\npre-change and post-change distributions. This leads to a flexible procedure\nsuitable for both parametric and nonparametric scenarios. We prove\nnon-asymptotic bounds on the average running length of the procedure and its\nexpected detection delay. The efficiency of the algorithm is illustrated with\nnumerical experiments on synthetic and real-world data sets."}, "http://arxiv.org/abs/2206.15367": {"title": "Targeted learning in observational studies with multi-valued treatments: An evaluation of antipsychotic drug treatment safety", "link": "http://arxiv.org/abs/2206.15367", "description": "We investigate estimation of causal effects of multiple competing\n(multi-valued) treatments in the absence of randomization. Our work is\nmotivated by an intention-to-treat study of the relative cardiometabolic risk\nof assignment to one of six commonly prescribed antipsychotic drugs in a cohort\nof nearly 39,000 adults adults with serious mental illness. Doubly-robust\nestimators, such as targeted minimum loss-based estimation (TMLE), require\ncorrect specification of either the treatment model or outcome model to ensure\nconsistent estimation; however, common TMLE implementations estimate treatment\nprobabilities using multiple binomial regressions rather than multinomial\nregression. We implement a TMLE estimator that uses multinomial treatment\nassignment and ensemble machine learning to estimate average treatment effects.\nOur multinomial implementation improves coverage, but does not necessarily\nreduce bias, relative to the binomial implementation in simulation experiments\nwith varying treatment propensity overlap and event rates. Evaluating the\ncausal effects of the antipsychotics on 3-year diabetes risk or death, we find\na safety benefit of moving from a second-generation drug considered among the\nsafest of the second-generation drugs to an infrequently prescribed\nfirst-generation drug thought to pose a generally low cardiometabolic risk."}, "http://arxiv.org/abs/2209.04364": {"title": "Evaluating tests for cluster-randomized trials with few clusters under generalized linear mixed models with covariate adjustment: a simulation study", "link": "http://arxiv.org/abs/2209.04364", "description": "Generalized linear mixed models (GLMM) are commonly used to analyze clustered\ndata, but when the number of clusters is small to moderate, standard\nstatistical tests may produce elevated type I error rates. Small-sample\ncorrections have been proposed for continuous or binary outcomes without\ncovariate adjustment. However, appropriate tests to use for count outcomes or\nunder covariate-adjusted models remains unknown. An important setting in which\nthis issue arises is in cluster-randomized trials (CRTs). Because many CRTs\nhave just a few clusters (e.g., clinics or health systems), covariate\nadjustment is particularly critical to address potential chance imbalance\nand/or low power (e.g., adjustment following stratified randomization or for\nthe baseline value of the outcome). We conducted simulations to evaluate\nGLMM-based tests of the treatment effect that account for the small (10) or\nmoderate (20) number of clusters under a parallel-group CRT setting across\nscenarios of covariate adjustment (including adjustment for one or more\nperson-level or cluster-level covariates) for both binary and count outcomes.\nWe find that when the intraclass correlation is non-negligible ($\\geq 0.01$)\nand the number of covariates is small ($\\leq 2$), likelihood ratio tests with a\nbetween-within denominator degree of freedom have type I error rates close to\nthe nominal level. When the number of covariates is moderate ($\\geq 5$), across\nour simulation scenarios, the relative performance of the tests varied\nconsiderably and no method performed uniformly well. Therefore, we recommend\nadjusting for no more than a few covariates and using likelihood ratio tests\nwith a between-within denominator degree of freedom."}, "http://arxiv.org/abs/2211.14578": {"title": "Estimation and inference for transfer learning with high-dimensional quantile regression", "link": "http://arxiv.org/abs/2211.14578", "description": "Transfer learning has become an essential technique to exploit information\nfrom the source domain to boost performance of the target task. Despite the\nprevalence in high-dimensional data, heterogeneity and heavy tails are\ninsufficiently accounted for by current transfer learning approaches and thus\nmay undermine the resulting performance. We propose a transfer learning\nprocedure in the framework of high-dimensional quantile regression models to\naccommodate heterogeneity and heavy tails in the source and target domains. We\nestablish error bounds of transfer learning estimator based on delicately\nselected transferable source domains, showing that lower error bounds can be\nachieved for critical selection criterion and larger sample size of source\ntasks. We further propose valid confidence interval and hypothesis test\nprocedures for individual component of high-dimensional quantile regression\ncoefficients by advocating a double transfer learning estimator, which is\none-step debiased estimator for the transfer learning estimator wherein the\ntechnique of transfer learning is designed again. By adopting data-splitting\ntechnique, we advocate a transferability detection approach that guarantees to\ncircumvent negative transfer and identify transferable sources with high\nprobability. Simulation results demonstrate that the proposed method exhibits\nsome favorable and compelling performances and the practical utility is further\nillustrated by analyzing a real example."}, "http://arxiv.org/abs/2306.03302": {"title": "Statistical Inference Under Constrained Selection Bias", "link": "http://arxiv.org/abs/2306.03302", "description": "Large-scale datasets are increasingly being used to inform decision making.\nWhile this effort aims to ground policy in real-world evidence, challenges have\narisen as selection bias and other forms of distribution shifts often plague\nobservational data. Previous attempts to provide robust inference have given\nguarantees depending on a user-specified amount of possible distribution shift\n(e.g., the maximum KL divergence between the observed and target\ndistributions). However, decision makers will often have additional knowledge\nabout the target distribution which constrains the kind of possible shifts. To\nleverage such information, we propose a framework that enables statistical\ninference in the presence of selection bias which obeys user-specified\nconstraints in the form of functions whose expectation is known under the\ntarget distribution. The output is high-probability bounds on the value of an\nestimand for the target distribution. Hence, our method leverages domain\nknowledge in order to partially identify a wide class of estimands. We analyze\nthe computational and statistical properties of methods to estimate these\nbounds and show that our method can produce informative bounds on a variety of\nsimulated and semisynthetic tasks, as well as in a real-world use case."}, "http://arxiv.org/abs/2307.15348": {"title": "Stratified principal component analysis", "link": "http://arxiv.org/abs/2307.15348", "description": "This paper investigates a general family of covariance models with repeated\neigenvalues extending probabilistic principal component analysis (PPCA). A\ngeometric interpretation shows that these models are parameterised by flag\nmanifolds and stratify the space of covariance matrices according to the\nsequence of eigenvalue multiplicities. The subsequent analysis sheds light on\nPPCA and answers an important question on the practical identifiability of\nindividual eigenvectors. It notably shows that one rarely has enough samples to\nfit a covariance model with distinct eigenvalues and that block-averaging the\nadjacent sample eigenvalues with small gaps achieves a better\ncomplexity/goodness-of-fit tradeoff."}, "http://arxiv.org/abs/2308.02005": {"title": "Bias Correction for Randomization-Based Estimation in Inexactly Matched Observational Studies", "link": "http://arxiv.org/abs/2308.02005", "description": "Matching has been widely used to mimic a randomized experiment with\nobservational data. Ideally, treated subjects are exactly matched with controls\nfor the covariates, and randomization-based estimation can then be conducted as\nin a randomized experiment (assuming no unobserved covariates). However, when\nthere exists continuous covariates or many covariates, matching typically\nshould be inexact. Previous studies have routinely ignored inexact matching in\nthe downstream randomization-based estimation as long as some covariate balance\ncriteria are satisfied, which can cause severe estimation bias. Built on the\ncovariate-adaptive randomization inference framework, in this research note, we\npropose two new classes of bias-corrected randomization-based estimators to\nreduce estimation bias due to inexact matching: the bias-corrected maximum\n$p$-value estimator for the constant treatment effect and the bias-corrected\ndifference-in-means estimator for the average treatment effect. Our simulation\nresults show that the proposed bias-corrected estimators can effectively reduce\nestimation bias due to inexact matching."}, "http://arxiv.org/abs/2310.01153": {"title": "Online Permutation Tests: $e$-values and Likelihood Ratios for Testing Group Invariance", "link": "http://arxiv.org/abs/2310.01153", "description": "We develop a flexible online version of the permutation test. This allows us\nto test exchangeability as the data is arriving, where we can choose to stop or\ncontinue without invalidating the size of the test. Our methods generalize\nbeyond exchangeability to other forms of invariance under a compact group. Our\napproach relies on constructing an $e$-process that is the running product of\nmultiple conditional $e$-values. To construct $e$-values, we first develop an\nessentially complete class of admissible $e$-values in which one can flexibly\n`plug in' almost any desired test statistic. To make the $e$-values\nconditional, we explore the intersection between the concepts of conditional\ninvariance and sequential invariance, and find that the appropriate conditional\ndistribution can be captured by a compact subgroup. To find powerful $e$-values\nfor given alternatives, we develop the theory of likelihood ratios for testing\ngroup invariance yielding new optimality results for group invariance tests.\nThese statistics turn out to exist in three different flavors, depending on the\nspace on which we specify our alternative. We apply these statistics to test\nagainst a Gaussian location shift, which yields connections to the $t$-test\nwhen testing sphericity, connections to the softmax function and its\ntemperature when testing exchangeability, and yields an improved version of a\nknown $e$-value for testing sign-symmetry. Moreover, we introduce an impatience\nparameter that allows users to obtain more power now in exchange for less power\nin the long run."}, "http://arxiv.org/abs/2311.03381": {"title": "Separating and Learning Latent Confounders to Enhancing User Preferences Modeling", "link": "http://arxiv.org/abs/2311.03381", "description": "Recommender models aim to capture user preferences from historical feedback\nand then predict user-specific feedback on candidate items. However, the\npresence of various unmeasured confounders causes deviations between the user\npreferences in the historical feedback and the true preferences, resulting in\nmodels not meeting their expected performance. Existing debias models either\n(1) specific to solving one particular bias or (2) directly obtain auxiliary\ninformation from user historical feedback, which cannot identify whether the\nlearned preferences are true user preferences or mixed with unmeasured\nconfounders. Moreover, we find that the former recommender system is not only a\nsuccessor to unmeasured confounders but also acts as an unmeasured confounder\naffecting user preference modeling, which has always been neglected in previous\nstudies. To this end, we incorporate the effect of the former recommender\nsystem and treat it as a proxy for all unmeasured confounders. We propose a\nnovel framework, \\textbf{S}eparating and \\textbf{L}earning Latent Confounders\n\\textbf{F}or \\textbf{R}ecommendation (\\textbf{SLFR}), which obtains the\nrepresentation of unmeasured confounders to identify the counterfactual\nfeedback by disentangling user preferences and unmeasured confounders, then\nguides the target model to capture the true preferences of users. Extensive\nexperiments in five real-world datasets validate the advantages of our method."}, "http://arxiv.org/abs/2311.03382": {"title": "Causal Structure Representation Learning of Confounders in Latent Space for Recommendation", "link": "http://arxiv.org/abs/2311.03382", "description": "Inferring user preferences from the historical feedback of users is a\nvaluable problem in recommender systems. Conventional approaches often rely on\nthe assumption that user preferences in the feedback data are equivalent to the\nreal user preferences without additional noise, which simplifies the problem\nmodeling. However, there are various confounders during user-item interactions,\nsuch as weather and even the recommendation system itself. Therefore,\nneglecting the influence of confounders will result in inaccurate user\npreferences and suboptimal performance of the model. Furthermore, the\nunobservability of confounders poses a challenge in further addressing the\nproblem. To address these issues, we refine the problem and propose a more\nrational solution. Specifically, we consider the influence of confounders,\ndisentangle them from user preferences in the latent space, and employ causal\ngraphs to model their interdependencies without specific labels. By cleverly\ncombining local and global causal graphs, we capture the user-specificity of\nconfounders on user preferences. We theoretically demonstrate the\nidentifiability of the obtained causal graph. Finally, we propose our model\nbased on Variational Autoencoders, named Causal Structure representation\nlearning of Confounders in latent space (CSC). We conducted extensive\nexperiments on one synthetic dataset and five real-world datasets,\ndemonstrating the superiority of our model. Furthermore, we demonstrate that\nthe learned causal representations of confounders are controllable, potentially\noffering users fine-grained control over the objectives of their recommendation\nlists with the learned causal graphs."}, "http://arxiv.org/abs/2311.03554": {"title": "Conditional Randomization Tests for Behavioral and Neural Time Series", "link": "http://arxiv.org/abs/2311.03554", "description": "Randomization tests allow simple and unambiguous tests of null hypotheses, by\ncomparing observed data to a null ensemble in which experimentally-controlled\nvariables are randomly resampled. In behavioral and neuroscience experiments,\nhowever, the stimuli presented often depend on the subject's previous actions,\nso simple randomization tests are not possible. We describe how conditional\nrandomization can be used to perform exact hypothesis tests in this situation,\nand illustrate it with two examples. We contrast conditional randomization with\na related approach of tangent randomization, in which stimuli are resampled\nbased only on events occurring in the past, which is not valid for all choices\nof test statistic. We discuss how to design experiments that allow conditional\nrandomization tests to be used."}, "http://arxiv.org/abs/2311.03630": {"title": "Counterfactual Data Augmentation with Contrastive Learning", "link": "http://arxiv.org/abs/2311.03630", "description": "Statistical disparity between distinct treatment groups is one of the most\nsignificant challenges for estimating Conditional Average Treatment Effects\n(CATE). To address this, we introduce a model-agnostic data augmentation method\nthat imputes the counterfactual outcomes for a selected subset of individuals.\nSpecifically, we utilize contrastive learning to learn a representation space\nand a similarity measure such that in the learned representation space close\nindividuals identified by the learned similarity measure have similar potential\noutcomes. This property ensures reliable imputation of counterfactual outcomes\nfor the individuals with close neighbors from the alternative treatment group.\nBy augmenting the original dataset with these reliable imputations, we can\neffectively reduce the discrepancy between different treatment groups, while\ninducing minimal imputation error. The augmented dataset is subsequently\nemployed to train CATE estimation models. Theoretical analysis and experimental\nstudies on synthetic and semi-synthetic benchmarks demonstrate that our method\nachieves significant improvements in both performance and robustness to\noverfitting across state-of-the-art models."}, "http://arxiv.org/abs/2311.03644": {"title": "BOB: Bayesian Optimized Bootstrap with Applications to Gaussian Mixture Models", "link": "http://arxiv.org/abs/2311.03644", "description": "Sampling from the joint posterior distribution of Gaussian mixture models\n(GMMs) via standard Markov chain Monte Carlo (MCMC) imposes several\ncomputational challenges, which have prevented a broader full Bayesian\nimplementation of these models. A growing body of literature has introduced the\nWeighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as\nalternatives to MCMC sampling. The core idea of these methods is to repeatedly\ncompute maximum a posteriori (MAP) estimates on many randomly weighted\nposterior densities. These MAP estimates then can be treated as approximate\nposterior draws. Nonetheless, a central question remains unanswered: How to\nselect the distribution of the random weights under arbitrary sample sizes.\nThus, we introduce the Bayesian Optimized Bootstrap (BOB), a computational\nmethod to automatically select the weights distribution by minimizing, through\nBayesian Optimization, a black-box and noisy version of the reverse KL\ndivergence between the Bayesian posterior and an approximate posterior obtained\nvia random weighting. Our proposed method allows for uncertainty\nquantification, approximate posterior sampling, and embraces recent\ndevelopments in parallel computing. We show that BOB outperforms competing\napproaches in recovering the Bayesian posterior, while retaining key\ntheoretical properties from existing methods. BOB's performance is demonstrated\nthrough extensive simulations, along with real-world data analyses."}, "http://arxiv.org/abs/2311.03660": {"title": "Sampling via F\\\"ollmer Flow", "link": "http://arxiv.org/abs/2311.03660", "description": "We introduce a novel unit-time ordinary differential equation (ODE) flow\ncalled the preconditioned F\\\"{o}llmer flow, which efficiently transforms a\nGaussian measure into a desired target measure at time 1. To discretize the\nflow, we apply Euler's method, where the velocity field is calculated either\nanalytically or through Monte Carlo approximation using Gaussian samples. Under\nreasonable conditions, we derive a non-asymptotic error bound in the\nWasserstein distance between the sampling distribution and the target\ndistribution. Through numerical experiments on mixture distributions in 1D, 2D,\nand high-dimensional spaces, we demonstrate that the samples generated by our\nproposed flow exhibit higher quality compared to those obtained by several\nexisting methods. Furthermore, we propose leveraging the F\\\"{o}llmer flow as a\nwarmstart strategy for existing Markov Chain Monte Carlo (MCMC) methods, aiming\nto mitigate mode collapses and enhance their performance. Finally, thanks to\nthe deterministic nature of the F\\\"{o}llmer flow, we can leverage deep neural\nnetworks to fit the trajectory of sample evaluations. This allows us to obtain\na generator for one-step sampling as a result."}, "http://arxiv.org/abs/2311.03763": {"title": "Thresholding the higher criticism test statistics for optimality in a heterogeneous setting", "link": "http://arxiv.org/abs/2311.03763", "description": "Donoho and Kipnis (2022) showed that the the higher criticism (HC) test\nstatistic has a non-Gaussian phase transition but remarked that it is probably\nnot optimal, in the detection of sparse differences between two large frequency\ntables when the counts are low. The setting can be considered to be\nheterogeneous, with cells containing larger total counts more able to detect\nsmaller differences. We provide a general study here of sparse detection\narising from such heterogeneous settings, and showed that optimality of the HC\ntest statistic requires thresholding, for example in the case of frequency\ntable comparison, to restrict to p-values of cells with total counts exceeding\na threshold. The use of thresholding also leads to optimality of the HC test\nstatistic when it is applied on the sparse Poisson means model of Arias-Castro\nand Wang (2015). The phase transitions we consider here are non-Gaussian, and\ninvolve an interplay between the rate functions of the response and sample size\ndistributions. We also showed, both theoretically and in a numerical study,\nthat applying thresholding to the Bonferroni test statistic results in better\nsparse mixture detection in heterogeneous settings."}, "http://arxiv.org/abs/2311.03769": {"title": "Nonparametric Screening for Additive Quantile Regression in Ultra-high Dimension", "link": "http://arxiv.org/abs/2311.03769", "description": "In practical applications, one often does not know the \"true\" structure of\nthe underlying conditional quantile function, especially in the ultra-high\ndimensional setting. To deal with ultra-high dimensionality, quantile-adaptive\nmarginal nonparametric screening methods have been recently developed. However,\nthese approaches may miss important covariates that are marginally independent\nof the response, or may select unimportant covariates due to their high\ncorrelations with important covariates. To mitigate such shortcomings, we\ndevelop a conditional nonparametric quantile screening procedure (complemented\nby subsequent selection) for nonparametric additive quantile regression models.\nUnder some mild conditions, we show that the proposed screening method can\nidentify all relevant covariates in a small number of steps with probability\napproaching one. The subsequent narrowed best subset (via a modified Bayesian\ninformation criterion) also contains all the relevant covariates with\noverwhelming probability. The advantages of our proposed procedure are\ndemonstrated through simulation studies and a real data example."}, "http://arxiv.org/abs/2311.03829": {"title": "Multilevel mixtures of latent trait analyzers for clustering multi-layer bipartite networks", "link": "http://arxiv.org/abs/2311.03829", "description": "Within network data analysis, bipartite networks represent a particular type\nof network where relationships occur between two disjoint sets of nodes,\nformally called sending and receiving nodes. In this context, sending nodes may\nbe organized into layers on the basis of some defined characteristics,\nresulting in a special case of multilayer bipartite network, where each layer\nincludes a specific set of sending nodes. To perform a clustering of sending\nnodes in multi-layer bipartite network, we extend the Mixture of Latent Trait\nAnalyzers (MLTA), also taking into account the influence of concomitant\nvariables on clustering formation and the multi-layer structure of the data. To\nthis aim, a multilevel approach offers a useful methodological tool to properly\naccount for the hierarchical structure of the data and for the unobserved\nsources of heterogeneity at multiple levels. A simulation study is conducted to\ntest the performance of the proposal in terms of parameters' and clustering\nrecovery. Furthermore, the model is applied to the European Social Survey data\n(ESS) to i) perform a clustering of individuals (sending nodes) based on their\ndigital skills (receiving nodes); ii) understand how socio-economic and\ndemographic characteristics influence the individual digitalization level; iii)\naccount for the multilevel structure of the data; iv) obtain a clustering of\ncountries in terms of the base-line attitude to digital technologies of their\nresidents."}, "http://arxiv.org/abs/2311.03989": {"title": "Learned Causal Method Prediction", "link": "http://arxiv.org/abs/2311.03989", "description": "For a given causal question, it is important to efficiently decide which\ncausal inference method to use for a given dataset. This is challenging because\ncausal methods typically rely on complex and difficult-to-verify assumptions,\nand cross-validation is not applicable since ground truth causal quantities are\nunobserved.In this work, we propose CAusal Method Predictor (CAMP), a framework\nfor predicting the best method for a given dataset. To this end, we generate\ndatasets from a diverse set of synthetic causal models, score the candidate\nmethods, and train a model to directly predict the highest-scoring method for\nthat dataset. Next, by formulating a self-supervised pre-training objective\ncentered on dataset assumptions relevant for causal inference, we significantly\nreduce the need for costly labeled data and enhance training efficiency. Our\nstrategy learns to map implicit dataset properties to the best method in a\ndata-driven manner. In our experiments, we focus on method prediction for\ncausal discovery. CAMP outperforms selecting any individual candidate method\nand demonstrates promising generalization to unseen semi-synthetic and\nreal-world benchmarks."}, "http://arxiv.org/abs/2311.04017": {"title": "Multivariate quantile-based permutation tests with application to functional data", "link": "http://arxiv.org/abs/2311.04017", "description": "Permutation tests enable testing statistical hypotheses in situations when\nthe distribution of the test statistic is complicated or not available. In some\nsituations, the test statistic under investigation is multivariate, with the\nmultiple testing problem being an important example. The corresponding\nmultivariate permutation tests are then typically based on a\nsuitableone-dimensional transformation of the vector of partial permutation\np-values via so called combining functions. This paper proposes a new approach\nthat utilizes the optimal measure transportation concept. The final single\np-value is computed from the empirical center-outward distribution function of\nthe permuted multivariate test statistics. This method avoids computation of\nthe partial p-values and it is easy to be implemented. In addition, it allows\nto compute and interpret contributions of the components of the multivariate\ntest statistic to the non-conformity score and to the rejection of the null\nhypothesis. Apart from this method, the measure transportation is applied also\nto the vector of partial p-values as an alternative to the classical combining\nfunctions. Both techniques are compared with the standard approaches using\nvarious practical examples in a Monte Carlo study. An application on a\nfunctional data set is provided as well."}, "http://arxiv.org/abs/2311.04037": {"title": "Causal Discovery Under Local Privacy", "link": "http://arxiv.org/abs/2311.04037", "description": "Differential privacy is a widely adopted framework designed to safeguard the\nsensitive information of data providers within a data set. It is based on the\napplication of controlled noise at the interface between the server that stores\nand processes the data, and the data consumers. Local differential privacy is a\nvariant that allows data providers to apply the privatization mechanism\nthemselves on their data individually. Therefore it provides protection also in\ncontexts in which the server, or even the data collector, cannot be trusted.\nThe introduction of noise, however, inevitably affects the utility of the data,\nparticularly by distorting the correlations between individual data components.\nThis distortion can prove detrimental to tasks such as causal discovery. In\nthis paper, we consider various well-known locally differentially private\nmechanisms and compare the trade-off between the privacy they provide, and the\naccuracy of the causal structure produced by algorithms for causal learning\nwhen applied to data obfuscated by these mechanisms. Our analysis yields\nvaluable insights for selecting appropriate local differentially private\nprotocols for causal discovery tasks. We foresee that our findings will aid\nresearchers and practitioners in conducting locally private causal discovery."}, "http://arxiv.org/abs/2311.04103": {"title": "Joint modelling of recurrent and terminal events with discretely-distributed non-parametric frailty: application on re-hospitalizations and death in heart failure patients", "link": "http://arxiv.org/abs/2311.04103", "description": "In the context of clinical and biomedical studies, joint frailty models have\nbeen developed to study the joint temporal evolution of recurrent and terminal\nevents, capturing both the heterogeneous susceptibility to experiencing a new\nepisode and the dependence between the two processes. While\ndiscretely-distributed frailty is usually more exploitable by clinicians and\nhealthcare providers, existing literature on joint frailty models predominantly\nassumes continuous distributions for the random effects. In this article, we\npresent a novel joint frailty model that assumes bivariate\ndiscretely-distributed non-parametric frailties, with an unknown finite number\nof mass points. This approach facilitates the identification of latent\nstructures among subjects, grouping them into sub-populations defined by a\nshared frailty value. We propose an estimation routine via\nExpectation-Maximization algorithm, which not only estimates the number of\nsubgroups but also serves as an unsupervised classification tool. This work is\nmotivated by a study of patients with Heart Failure (HF) receiving ACE\ninhibitors treatment in the Lombardia region of Italy. Recurrent events of\ninterest are hospitalizations due to HF and terminal event is death for any\ncause."}, "http://arxiv.org/abs/2311.04159": {"title": "Statistical Inference on Simulation Output: Batching as an Inferential Device", "link": "http://arxiv.org/abs/2311.04159", "description": "We present {batching} as an omnibus device for statistical inference on\nsimulation output. We consider the classical context of a simulationist\nperforming statistical inference on an estimator $\\theta_n$ (of an unknown\nfixed quantity $\\theta$) using only the output data $(Y_1,Y_2,\\ldots,Y_n)$\ngathered from a simulation. By \\emph{statistical inference}, we mean\napproximating the sampling distribution of the error $\\theta_n-\\theta$ toward:\n(A) estimating an ``assessment'' functional $\\psi$, e.g., bias, variance, or\nquantile; or (B) constructing a $(1-\\alpha)$-confidence region on $\\theta$. We\nargue that batching is a remarkably simple and effective inference device that\nis especially suited for handling dependent output data such as what one\nfrequently encounters in simulation contexts. We demonstrate that if the number\nof batches and the extent of their overlap are chosen correctly, batching\nretains bootstrap's attractive theoretical properties of {strong consistency}\nand {higher-order accuracy}. For constructing confidence regions, we\ncharacterize two limiting distributions associated with a Studentized\nstatistic. Our extensive numerical experience confirms theoretical insight,\nespecially about the effects of batch size and batch overlap."}, "http://arxiv.org/abs/2002.03355": {"title": "Scalable Function-on-Scalar Quantile Regression for Densely Sampled Functional Data", "link": "http://arxiv.org/abs/2002.03355", "description": "Functional quantile regression (FQR) is a useful alternative to mean\nregression for functional data as it provides a comprehensive understanding of\nhow scalar predictors influence the conditional distribution of functional\nresponses. In this article, we study the FQR model for densely sampled,\nhigh-dimensional functional data without relying on parametric error or\nindependent stochastic process assumptions, with the focus on statistical\ninference under this challenging regime along with scalable implementation.\nThis is achieved by a simple but powerful distributed strategy, in which we\nfirst perform separate quantile regression to compute $M$-estimators at each\nsampling location, and then carry out estimation and inference for the entire\ncoefficient functions by properly exploiting the uncertainty quantification and\ndependence structure of $M$-estimators. We derive a uniform Bahadur\nrepresentation and a strong Gaussian approximation result for the\n$M$-estimators on the discrete sampling grid, leading to dimension reduction\nand serving as the basis for inference. An interpolation-based estimator with\nminimax optimality is proposed, and large sample properties for point and\nsimultaneous interval estimators are established. The obtained minimax optimal\nrate under the FQR model shows an interesting phase transition phenomenon that\nhas been previously observed in functional mean regression. The proposed\nmethods are illustrated via simulations and an application to a mass\nspectrometry proteomics dataset."}, "http://arxiv.org/abs/2201.06110": {"title": "FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series", "link": "http://arxiv.org/abs/2201.06110", "description": "We propose FNETS, a methodology for network estimation and forecasting of\nhigh-dimensional time series exhibiting strong serial- and cross-sectional\ncorrelations. We operate under a factor-adjusted vector autoregressive (VAR)\nmodel which, after accounting for pervasive co-movements of the variables by\n{\\it common} factors, models the remaining {\\it idiosyncratic} dynamic\ndependence between the variables as a sparse VAR process. Network estimation of\nFNETS consists of three steps: (i) factor-adjustment via dynamic principal\ncomponent analysis, (ii) estimation of the latent VAR process via\n$\\ell_1$-regularised Yule-Walker estimator, and (iii) estimation of partial\ncorrelation and long-run partial correlation matrices. In doing so, we learn\nthree networks underpinning the VAR process, namely a directed network\nrepresenting the Granger causal linkages between the variables, an undirected\none embedding their contemporaneous relationships and finally, an undirected\nnetwork that summarises both lead-lag and contemporaneous linkages. In\naddition, FNETS provides a suite of methods for forecasting the factor-driven\nand the idiosyncratic VAR processes. Under general conditions permitting tails\nheavier than the Gaussian one, we derive uniform consistency rates for the\nestimators in both network estimation and forecasting, which hold as the\ndimension of the panel and the sample size diverge. Simulation studies and real\ndata application confirm the good performance of FNETS."}, "http://arxiv.org/abs/2202.01650": {"title": "Exposure Effects on Count Outcomes with Observational Data, with Application to Incarcerated Women", "link": "http://arxiv.org/abs/2202.01650", "description": "Causal inference methods can be applied to estimate the effect of a point\nexposure or treatment on an outcome of interest using data from observational\nstudies. For example, in the Women's Interagency HIV Study, it is of interest\nto understand the effects of incarceration on the number of sexual partners and\nthe number of cigarettes smoked after incarceration. In settings like this\nwhere the outcome is a count, the estimand is often the causal mean ratio,\ni.e., the ratio of the counterfactual mean count under exposure to the\ncounterfactual mean count under no exposure. This paper considers estimators of\nthe causal mean ratio based on inverse probability of treatment weights, the\nparametric g-formula, and doubly robust estimation, each of which can account\nfor overdispersion, zero-inflation, and heaping in the measured outcome.\nMethods are compared in simulations and are applied to data from the Women's\nInteragency HIV Study."}, "http://arxiv.org/abs/2208.14960": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case", "link": "http://arxiv.org/abs/2208.14960", "description": "Gaussian processes are arguably the most important class of spatiotemporal\nmodels within machine learning. They encode prior information about the modeled\nfunction and can be used for exact or approximate Bayesian learning. In many\napplications, particularly in physical sciences and engineering, but also in\nareas such as geostatistics and neuroscience, invariance to symmetries is one\nof the most fundamental forms of prior information one can consider. The\ninvariance of a Gaussian process' covariance to such symmetries gives rise to\nthe most natural generalization of the concept of stationarity to such spaces.\nIn this work, we develop constructive and practical techniques for building\nstationary Gaussian processes on a very large class of non-Euclidean spaces\narising in the context of symmetries. Our techniques make it possible to (i)\ncalculate covariance kernels and (ii) sample from prior and posterior Gaussian\nprocesses defined on such spaces, both in a practical manner. This work is\nsplit into two parts, each involving different technical considerations: part I\nstudies compact spaces, while part II studies non-compact spaces possessing\ncertain structure. Our contributions make the non-Euclidean Gaussian process\nmodels we study compatible with well-understood computational techniques\navailable in standard Gaussian process software packages, thereby making them\naccessible to practitioners."}, "http://arxiv.org/abs/2210.14455": {"title": "Asymmetric predictability in causal discovery: an information theoretic approach", "link": "http://arxiv.org/abs/2210.14455", "description": "Causal investigations in observational studies pose a great challenge in\nresearch where randomized trials or intervention-based studies are not\nfeasible. We develop an information geometric causal discovery and inference\nframework of \"predictive asymmetry\". For $(X, Y)$, predictive asymmetry enables\nassessment of whether $X$ is more likely to cause $Y$ or vice-versa. The\nasymmetry between cause and effect becomes particularly simple if $X$ and $Y$\nare deterministically related. We propose a new metric called the Directed\nMutual Information ($DMI$) and establish its key statistical properties. $DMI$\nis not only able to detect complex non-linear association patterns in bivariate\ndata, but also is able to detect and infer causal relations. Our proposed\nmethodology relies on scalable non-parametric density estimation using Fourier\ntransform. The resulting estimation method is manyfold faster than the\nclassical bandwidth-based density estimation. We investigate key asymptotic\nproperties of the $DMI$ methodology and a data-splitting technique is utilized\nto facilitate causal inference using the $DMI$. Through simulation studies and\nan application, we illustrate the performance of $DMI$."}, "http://arxiv.org/abs/2211.01227": {"title": "Conformalized survival analysis with adaptive cutoffs", "link": "http://arxiv.org/abs/2211.01227", "description": "This paper introduces an assumption-lean method that constructs valid and\nefficient lower predictive bounds (LPBs) for survival times with censored data.\nWe build on recent work by Cand\\`es et al. (2021), whose approach first subsets\nthe data to discard any data points with early censoring times, and then uses a\nreweighting technique (namely, weighted conformal inference (Tibshirani et al.,\n2019)) to correct for the distribution shift introduced by this subsetting\nprocedure.\n\nFor our new method, instead of constraining to a fixed threshold for the\ncensoring time when subsetting the data, we allow for a covariate-dependent and\ndata-adaptive subsetting step, which is better able to capture the\nheterogeneity of the censoring mechanism. As a result, our method can lead to\nLPBs that are less conservative and give more accurate information. We show\nthat in the Type I right-censoring setting, if either of the censoring\nmechanism or the conditional quantile of survival time is well estimated, our\nproposed procedure achieves nearly exact marginal coverage, where in the latter\ncase we additionally have approximate conditional coverage. We evaluate the\nvalidity and efficiency of our proposed algorithm in numerical experiments,\nillustrating its advantage when compared with other competing methods. Finally,\nour method is applied to a real dataset to generate LPBs for users' active\ntimes on a mobile app."}, "http://arxiv.org/abs/2211.07451": {"title": "Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain", "link": "http://arxiv.org/abs/2211.07451", "description": "Forecasts of regional electricity net-demand, consumption minus embedded\ngeneration, are an essential input for reliable and economic power system\noperation, and energy trading. While such forecasts are typically performed\nregion by region, operations such as managing power flows require spatially\ncoherent joint forecasts, which account for cross-regional dependencies. Here,\nwe forecast the joint distribution of net-demand across the 14 regions\nconstituting Great Britain's electricity network. Joint modelling is\ncomplicated by the fact that the net-demand variability within each region, and\nthe dependencies between regions, vary with temporal, socio-economical and\nweather-related factors. We accommodate for these characteristics by proposing\na multivariate Gaussian model based on a modified Cholesky parametrisation,\nwhich allows us to model each unconstrained parameter via an additive model.\nGiven that the number of model parameters and covariates is large, we adopt a\nsemi-automated approach to model selection, based on gradient boosting. In\naddition to comparing the forecasting performance of several versions of the\nproposed model with that of two non-Gaussian copula-based models, we visually\nexplore the model output to interpret how the covariates affect net-demand\nvariability and dependencies.\n\nThe code for reproducing the results in this paper is available at\nhttps://doi.org/10.5281/zenodo.7315105, while methods for building and fitting\nmultivariate Gaussian additive models are provided by the SCM R package,\navailable at https://github.com/VinGioia90/SCM."}, "http://arxiv.org/abs/2211.16182": {"title": "Residual Permutation Test for High-Dimensional Regression Coefficient Testing", "link": "http://arxiv.org/abs/2211.16182", "description": "We consider the problem of testing whether a single coefficient is equal to\nzero in fixed-design linear models under a moderately high-dimensional regime,\nwhere the dimension of covariates $p$ is allowed to be in the same order of\nmagnitude as sample size $n$. In this regime, to achieve finite-population\nvalidity, existing methods usually require strong distributional assumptions on\nthe noise vector (such as Gaussian or rotationally invariant), which limits\ntheir applications in practice. In this paper, we propose a new method, called\nresidual permutation test (RPT), which is constructed by projecting the\nregression residuals onto the space orthogonal to the union of the column\nspaces of the original and permuted design matrices. RPT can be proved to\nachieve finite-population size validity under fixed design with just\nexchangeable noises, whenever $p < n / 2$. Moreover, RPT is shown to be\nasymptotically powerful for heavy tailed noises with bounded $(1+t)$-th order\nmoment when the true coefficient is at least of order $n^{-t/(1+t)}$ for $t \\in\n[0,1]$. We further proved that this signal size requirement is essentially\nrate-optimal in the minimax sense. Numerical studies confirm that RPT performs\nwell in a wide range of simulation settings with normal and heavy-tailed noise\ndistributions."}, "http://arxiv.org/abs/2306.14351": {"title": "Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions", "link": "http://arxiv.org/abs/2306.14351", "description": "The aim of this paper is to make clear and precise the relationship between\nthe Rubin causal model (RCM) and structural causal model (SCM) frameworks for\ncausal inference. Adopting a neutral logical perspective, and drawing on\nprevious work, we show what is required for an RCM to be representable by an\nSCM. A key result then shows that every RCM -- including those that violate\nalgebraic principles implied by the SCM framework -- emerges as an abstraction\nof some representable RCM. Finally, we illustrate the power of this\nconciliatory perspective by pinpointing an important role for SCM principles in\nclassic applications of RCMs; conversely, we offer a characterization of the\nalgebraic constraints implied by a graph, helping to substantiate further\ncomparisons between the two frameworks."}, "http://arxiv.org/abs/2307.07342": {"title": "Bounded-memory adjusted scores estimation in generalized linear models with large data sets", "link": "http://arxiv.org/abs/2307.07342", "description": "The widespread use of maximum Jeffreys'-prior penalized likelihood in\nbinomial-response generalized linear models, and in logistic regression, in\nparticular, are supported by the results of Kosmidis and Firth (2021,\nBiometrika), who show that the resulting estimates are also always\nfinite-valued, even in cases where the maximum likelihood estimates are not,\nwhich is a practical issue regardless of the size of the data set. In logistic\nregression, the implied adjusted score equations are formally bias-reducing in\nasymptotic frameworks with a fixed number of parameters and appear to deliver a\nsubstantial reduction in the persistent bias of the maximum likelihood\nestimator in high-dimensional settings where the number of parameters grows\nasymptotically linearly and slower than the number of observations. In this\nwork, we develop and present two new variants of iteratively reweighted least\nsquares for estimating generalized linear models with adjusted score equations\nfor mean bias reduction and maximization of the likelihood penalized by a\npositive power of the Jeffreys-prior penalty, which eliminate the requirement\nof storing $O(n)$ quantities in memory, and can operate with data sets that\nexceed computer memory or even hard drive capacity. We achieve that through\nincremental QR decompositions, which enable IWLS iterations to have access only\nto data chunks of predetermined size. We assess the procedures through a\nreal-data application with millions of observations."}, "http://arxiv.org/abs/2309.02422": {"title": "Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test", "link": "http://arxiv.org/abs/2309.02422", "description": "Maximum mean discrepancy (MMD) refers to a general class of nonparametric\ntwo-sample tests that are based on maximizing the mean difference over samples\nfrom one distribution $P$ versus another $Q$, over all choices of data\ntransformations $f$ living in some function space $\\mathcal{F}$. Inspired by\nrecent work that connects what are known as functions of $\\textit{Radon bounded\nvariation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study\nthe MMD defined by taking $\\mathcal{F}$ to be the unit ball in the RBV space of\na given smoothness order $k \\geq 0$. This test, which we refer to as the\n$\\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a\ngeneralization of the well-known and classical Kolmogorov-Smirnov (KS) test to\nmultiple dimensions and higher orders of smoothness. It is also intimately\nconnected to neural networks: we prove that the witness in the RKS test -- the\nfunction $f$ achieving the maximum mean difference -- is always a ridge spline\nof degree $k$, i.e., a single neuron in a neural network. This allows us to\nleverage the power of modern deep learning toolkits to (approximately) optimize\nthe criterion that underlies the RKS test. We prove that the RKS test has\nasymptotically full power at distinguishing any distinct pair $P \\not= Q$ of\ndistributions, derive its asymptotic null distribution, and carry out extensive\nexperiments to elucidate the strengths and weakenesses of the RKS test versus\nthe more traditional kernel MMD test."}, "http://arxiv.org/abs/2311.04318": {"title": "Estimation for multistate models subject to reporting delays and incomplete event adjudication", "link": "http://arxiv.org/abs/2311.04318", "description": "Complete observation of event histories is often impossible due to sampling\neffects such as right-censoring and left-truncation, but also due to reporting\ndelays and incomplete event adjudication. This is for example the case during\ninterim stages of clinical trials and for health insurance claims. In this\npaper, we develop a parametric method that takes the aforementioned effects\ninto account, treating the latter two as partially exogenous. The method, which\ntakes the form of a two-step M-estimation procedure, is applicable to\nmultistate models in general, including competing risks and recurrent event\nmodels. The effect of reporting delays is derived via thinning, extending\nexisting results for Poisson models. To address incomplete event adjudication,\nwe propose an imputed likelihood approach which, compared to existing methods,\nhas the advantage of allowing for dependencies between the event history and\nadjudication processes as well as allowing for unreported events and multiple\nevent types. We establish consistency and asymptotic normality under standard\nidentifiability, integrability, and smoothness conditions, and we demonstrate\nthe validity of the percentile bootstrap. Finally, a simulation study shows\nfavorable finite sample performance of our method compared to other\nalternatives, while an application to disability insurance data illustrates its\npractical potential."}, "http://arxiv.org/abs/2311.04359": {"title": "Flexibly Estimating and Interpreting Heterogeneous Treatment Effects of Laparoscopic Surgery for Cholecystitis Patients", "link": "http://arxiv.org/abs/2311.04359", "description": "Laparoscopic surgery has been shown through a number of randomized trials to\nbe an effective form of treatment for cholecystitis. Given this evidence, one\nnatural question for clinical practice is: does the effectiveness of\nlaparoscopic surgery vary among patients? It might be the case that, while the\noverall effect is positive, some patients treated with laparoscopic surgery may\nrespond positively to the intervention while others do not or may be harmed. In\nour study, we focus on conditional average treatment effects to understand\nwhether treatment effects vary systematically with patient characteristics.\nRecent methodological work has developed a meta-learner framework for flexible\nestimation of conditional causal effects. In this framework, nonparametric\nestimation methods can be used to avoid bias from model misspecification while\npreserving statistical efficiency. In addition, researchers can flexibly and\neffectively explore whether treatment effects vary with a large number of\npossible effect modifiers. However, these methods have certain limitations. For\nexample, conducting inference can be challenging if black-box models are used.\nFurther, interpreting and visualizing the effect estimates can be difficult\nwhen there are multi-valued effect modifiers. In this paper, we develop new\nmethods that allow for interpretable results and inference from the\nmeta-learner framework for heterogeneous treatment effects estimation. We also\ndemonstrate methods that allow for an exploratory analysis to identify possible\neffect modifiers. We apply our methods to a large database for the use of\nlaparoscopic surgery in treating cholecystitis. We also conduct a series of\nsimulation studies to understand the relative performance of the methods we\ndevelop. Our study provides key guidelines for the interpretation of\nconditional causal effects from the meta-learner framework."}, "http://arxiv.org/abs/2311.04540": {"title": "On the estimation of the number of components in multivariate functional principal component analysis", "link": "http://arxiv.org/abs/2311.04540", "description": "Happ and Greven (2018) developed a methodology for principal components\nanalysis of multivariate functional data for data observed on different\ndimensional domains. Their approach relies on an estimation of univariate\nfunctional principal components for each univariate functional feature. In this\npaper, we present extensive simulations to investigate choosing the number of\nprincipal components to retain. We show empirically that the conventional\napproach of using a percentage of variance explained threshold for each\nunivariate functional feature may be unreliable when aiming to explain an\noverall percentage of variance in the multivariate functional data, and thus we\nadvise practitioners to be careful when using it."}, "http://arxiv.org/abs/2311.04585": {"title": "Goodness-of-Fit Tests for Linear Non-Gaussian Structural Equation Models", "link": "http://arxiv.org/abs/2311.04585", "description": "The field of causal discovery develops model selection methods to infer\ncause-effect relations among a set of random variables. For this purpose,\ndifferent modelling assumptions have been proposed to render cause-effect\nrelations identifiable. One prominent assumption is that the joint distribution\nof the observed variables follows a linear non-Gaussian structural equation\nmodel. In this paper, we develop novel goodness-of-fit tests that assess the\nvalidity of this assumption in the basic setting without latent confounders as\nwell as in extension to linear models that incorporate latent confounders. Our\napproach involves testing algebraic relations among second and higher moments\nthat hold as a consequence of the linearity of the structural equations.\nSpecifically, we show that the linearity implies rank constraints on matrices\nand tensors derived from moments. For a practical implementation of our tests,\nwe consider a multiplier bootstrap method that uses incomplete U-statistics to\nestimate subdeterminants, as well as asymptotic approximations to the null\ndistribution of singular values. The methods are illustrated, in particular,\nfor the T\\\"ubingen collection of benchmark data sets on cause-effect pairs."}, "http://arxiv.org/abs/2311.04657": {"title": "Long-Term Causal Inference with Imperfect Surrogates using Many Weak Experiments, Proxies, and Cross-Fold Moments", "link": "http://arxiv.org/abs/2311.04657", "description": "Inferring causal effects on long-term outcomes using short-term surrogates is\ncrucial to rapid innovation. However, even when treatments are randomized and\nsurrogates fully mediate their effect on outcomes, it's possible that we get\nthe direction of causal effects wrong due to confounding between surrogates and\noutcomes -- a situation famously known as the surrogate paradox. The\navailability of many historical experiments offer the opportunity to instrument\nfor the surrogate and bypass this confounding. However, even as the number of\nexperiments grows, two-stage least squares has non-vanishing bias if each\nexperiment has a bounded size, and this bias is exacerbated when most\nexperiments barely move metrics, as occurs in practice. We show how to\neliminate this bias using cross-fold procedures, JIVE being one example, and\nconstruct valid confidence intervals for the long-term effect in new\nexperiments where long-term outcome has not yet been observed. Our methodology\nfurther allows to proxy for effects not perfectly mediated by the surrogates,\nallowing us to handle both confounding and effect leakage as violations of\nstandard statistical surrogacy conditions."}, "http://arxiv.org/abs/2311.04696": {"title": "Generative causality: using Shannon's information theory to infer underlying asymmetry in causal relations", "link": "http://arxiv.org/abs/2311.04696", "description": "Causal investigations in observational studies pose a great challenge in\nscientific research where randomized trials or intervention-based studies are\nnot feasible. Leveraging Shannon's seminal work on information theory, we\nconsider a framework of asymmetry where any causal link between putative cause\nand effect must be explained through a mechanism governing the cause as well as\na generative process yielding an effect of the cause. Under weak assumptions,\nthis framework enables the assessment of whether X is a stronger predictor of Y\nor vice-versa. Under stronger identifiability assumptions our framework is able\nto distinguish between cause and effect using observational data. We establish\nkey statistical properties of this framework. Our proposed methodology relies\non scalable non-parametric density estimation using fast Fourier\ntransformation. The resulting estimation method is manyfold faster than the\nclassical bandwidth-based density estimation while maintaining comparable mean\nintegrated squared error rates. We investigate key asymptotic properties of our\nmethodology and introduce a data-splitting technique to facilitate inference.\nThe key attraction of our framework is its inference toolkit, which allows\nresearchers to quantify uncertainty in causal discovery findings. We illustrate\nthe performance of our methodology through simulation studies as well as\nmultiple real data examples."}, "http://arxiv.org/abs/2311.04812": {"title": "Is it possible to obtain reliable estimates for the prevalence of anemia and childhood stunting among children under 5 in the poorest districts in Peru?", "link": "http://arxiv.org/abs/2311.04812", "description": "In this article we describe and apply the Fay-Herriot model with spatially\ncorrelated random area effects (Pratesi, M., & Salvati, N. (2008)), in order to\npredict the prevalence of anemia and childhood stunting in Peruvian districts,\nbased on the data from the Demographic and Family Health Survey of the year\n2019, which collects data about anemia and childhood stunting for children\nunder the age of 12 years, and the National Census carried out in 2017. Our\nmain objective is to produce reliable predictions for the districts, where\nsample sizes are too small to provide good direct estimates, and for the\ndistricts, which were not included in the sample. The basic Fay-Herriot model\n(Fay & Herriot, 1979) tackles this problem by incorporating auxiliary\ninformation, which is generally available from administrative or census\nrecords. The Fay-Herriot model with spatially correlated random area effects,\nin addition to auxiliary information, incorporates geographic information about\nthe areas, such as latitude and longitude. This permits modeling spatial\nautocorrelations, which are not unusual in socioeconomic and health surveys. To\nevaluate the mean square error of the above-mentioned predictors, we use the\nparametric bootstrap procedure, developed in Molina et al. (2009)."}, "http://arxiv.org/abs/2311.04855": {"title": "Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values", "link": "http://arxiv.org/abs/2311.04855", "description": "Non-negative matrix factorization (NMF) is a dimensionality reduction\ntechnique that has shown promise for analyzing noisy data, especially\nastronomical data. For these datasets, the observed data may contain negative\nvalues due to noise even when the true underlying physical signal is strictly\npositive. Prior NMF work has not treated negative data in a statistically\nconsistent manner, which becomes problematic for low signal-to-noise data with\nmany negative values. In this paper we present two algorithms, Shift-NMF and\nNearly-NMF, that can handle both the noisiness of the input data and also any\nintroduced negativity. Both of these algorithms use the negative data space\nwithout clipping, and correctly recover non-negative signals without any\nintroduced positive offset that occurs when clipping negative data. We\ndemonstrate this numerically on both simple and more realistic examples, and\nprove that both algorithms have monotonically decreasing update rules."}, "http://arxiv.org/abs/2311.04871": {"title": "Integration of Summary Information from External Studies for Semiparametric Models", "link": "http://arxiv.org/abs/2311.04871", "description": "With the development of biomedical science, researchers have increasing\naccess to an abundance of studies focusing on similar research questions. There\nis a growing interest in the integration of summary information from those\nstudies to enhance the efficiency of estimation in their own internal studies.\nIn this work, we present a comprehensive framework on integration of summary\ninformation from external studies when the data are modeled by semiparametric\nmodels. Our novel framework offers straightforward estimators that update\nconventional estimations with auxiliary information. It addresses computational\nchallenges by capitalizing on the intricate mathematical structure inherent to\nthe problem. We demonstrate the conditions when the proposed estimators are\ntheoretically more efficient than initial estimate based solely on internal\ndata. Several special cases such as proportional hazards model in survival\nanalysis are provided with numerical examples."}, "http://arxiv.org/abs/2107.10885": {"title": "Laplace and Saddlepoint Approximations in High Dimensions", "link": "http://arxiv.org/abs/2107.10885", "description": "We examine the behaviour of the Laplace and saddlepoint approximations in the\nhigh-dimensional setting, where the dimension of the model is allowed to\nincrease with the number of observations. Approximations to the joint density,\nthe marginal posterior density and the conditional density are considered. Our\nresults show that under the mildest assumptions on the model, the error of the\njoint density approximation is $O(p^4/n)$ if $p = o(n^{1/4})$ for the Laplace\napproximation and saddlepoint approximation, and $O(p^3/n)$ if $p = o(n^{1/3})$\nunder additional assumptions on the second derivative of the log-likelihood.\nStronger results are obtained for the approximation to the marginal posterior\ndensity."}, "http://arxiv.org/abs/2111.00280": {"title": "Testing semiparametric model-equivalence hypotheses based on the characteristic function", "link": "http://arxiv.org/abs/2111.00280", "description": "We propose three test criteria each of which is appropriate for testing,\nrespectively, the equivalence hypotheses of symmetry, of homogeneity, and of\nindependence, with multivariate data. All quantities have the common feature of\ninvolving weighted--type distances between characteristic functions and are\nconvenient from the computational point of view if the weight function is\nproperly chosen. The asymptotic behavior of the tests under the null hypothesis\nis investigated, and numerical studies are conducted in order to examine the\nperformance of the criteria in finite samples."}, "http://arxiv.org/abs/2112.11079": {"title": "Data fission: splitting a single data point", "link": "http://arxiv.org/abs/2112.11079", "description": "Suppose we observe a random vector $X$ from some distribution $P$ in a known\nfamily with unknown parameters. We ask the following question: when is it\npossible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part\nis sufficient to reconstruct $X$ by itself, but both together can recover $X$\nfully, and the joint distribution of $(f(X),g(X))$ is tractable? As one\nexample, if $X=(X_1,\\dots,X_n)$ and $P$ is a product distribution, then for any\n$m<n$, we can split the sample to define $f(X)=(X_1,\\dots,X_m)$ and\n$g(X)=(X_{m+1},\\dots,X_n)$. Rasines and Young (2022) offers an alternative\nroute of accomplishing this task through randomization of $X$ with additive\nGaussian noise which enables post-selection inference in finite samples for\nGaussian distributed data and asymptotically for non-Gaussian additive models.\nIn this paper, we offer a more general methodology for achieving such a split\nin finite samples by borrowing ideas from Bayesian inference to yield a\n(frequentist) solution that can be viewed as a continuous analog of data\nsplitting. We call our method data fission, as an alternative to data\nsplitting, data carving and p-value masking. We exemplify the method on a few\nprototypical applications, such as post-selection inference for trend filtering\nand other regression problems."}, "http://arxiv.org/abs/2207.04598": {"title": "Differential item functioning via robust scaling", "link": "http://arxiv.org/abs/2207.04598", "description": "This paper proposes a method for assessing differential item functioning\n(DIF) in item response theory (IRT) models. The method does not require\npre-specification of anchor items, which is its main virtue. It is developed in\ntwo main steps, first by showing how DIF can be re-formulated as a problem of\noutlier detection in IRT-based scaling, then tackling the latter using methods\nfrom robust statistics. The proposal is a redescending M-estimator of IRT\nscaling parameters that is tuned to flag items with DIF at the desired\nasymptotic Type I Error rate. Theoretical results describe the efficiency of\nthe estimator in the absence of DIF and its robustness in the presence of DIF.\nSimulation studies show that the proposed method compares favorably to\ncurrently available approaches for DIF detection, and a real data example\nillustrates its application in a research context where pre-specification of\nanchor items is infeasible. The focus of the paper is the two-parameter\nlogistic model in two independent groups, with extensions to other settings\nconsidered in the conclusion."}, "http://arxiv.org/abs/2209.12345": {"title": "Berry-Esseen bounds for design-based causal inference with possibly diverging treatment levels and varying group sizes", "link": "http://arxiv.org/abs/2209.12345", "description": "Neyman (1923/1990) introduced the randomization model, which contains the\nnotation of potential outcomes to define causal effects and a framework for\nlarge-sample inference based on the design of the experiment. However, the\nexisting theory for this framework is far from complete especially when the\nnumber of treatment levels diverges and the treatment group sizes vary. We\nprovide a unified discussion of statistical inference under the randomization\nmodel with general treatment group sizes. We formulate the estimator in terms\nof a linear permutational statistic and use results based on Stein's method to\nderive various Berry--Esseen bounds on the linear and quadratic functions of\nthe estimator. These new Berry--Esseen bounds serve as basis for design-based\ncausal inference with possibly diverging treatment levels and a diverging\nnumber of causal parameters of interest. We also fill an important gap by\nproposing novel variance estimators for experiments with possibly many\ntreatment levels without replications. Equipped with the newly developed\nresults, design-based causal inference in general settings becomes more\nconvenient with stronger theoretical guarantees."}, "http://arxiv.org/abs/2210.02014": {"title": "Doubly Robust Proximal Synthetic Controls", "link": "http://arxiv.org/abs/2210.02014", "description": "To infer the treatment effect for a single treated unit using panel data,\nsynthetic control methods construct a linear combination of control units'\noutcomes that mimics the treated unit's pre-treatment outcome trajectory. This\nlinear combination is subsequently used to impute the counterfactual outcomes\nof the treated unit had it not been treated in the post-treatment period, and\nused to estimate the treatment effect. Existing synthetic control methods rely\non correctly modeling certain aspects of the counterfactual outcome generating\nmechanism and may require near-perfect matching of the pre-treatment\ntrajectory. Inspired by proximal causal inference, we obtain two novel\nnonparametric identifying formulas for the average treatment effect for the\ntreated unit: one is based on weighting, and the other combines models for the\ncounterfactual outcome and the weighting function. We introduce the concept of\ncovariate shift to synthetic controls to obtain these identification results\nconditional on the treatment assignment. We also develop two treatment effect\nestimators based on these two formulas and the generalized method of moments.\nOne new estimator is doubly robust: it is consistent and asymptotically normal\nif at least one of the outcome and weighting models is correctly specified. We\ndemonstrate the performance of the methods via simulations and apply them to\nevaluate the effectiveness of a Pneumococcal conjugate vaccine on the risk of\nall-cause pneumonia in Brazil."}, "http://arxiv.org/abs/2303.01385": {"title": "Hyperlink communities in higher-order networks", "link": "http://arxiv.org/abs/2303.01385", "description": "Many networks can be characterised by the presence of communities, which are\ngroups of units that are closely linked and can be relevant in understanding\nthe system's overall function. Recently, hypergraphs have emerged as a\nfundamental tool for modelling systems where interactions are not limited to\npairs but may involve an arbitrary number of nodes. Using a dual approach to\ncommunity detection, in this study we extend the concept of link communities to\nhypergraphs, allowing us to extract informative clusters of highly related\nhyperedges. We analyze the dendrograms obtained by applying hierarchical\nclustering to distance matrices among hyperedges on a variety of real-world\ndata, showing that hyperlink communities naturally highlight the hierarchical\nand multiscale structure of higher-order networks. Moreover, by using hyperlink\ncommunities, we are able to extract overlapping memberships from nodes,\novercoming limitations of traditional hard clustering methods. Finally, we\nintroduce higher-order network cartography as a practical tool for categorizing\nnodes into different structural roles based on their interaction patterns and\ncommunity participation. This approach helps identify different types of\nindividuals in a variety of real-world social systems. Our work contributes to\na better understanding of the structural organization of real-world\nhigher-order systems."}, "http://arxiv.org/abs/2305.05931": {"title": "Generalised shot noise representations of stochastic systems driven by non-Gaussian L\\'evy processes", "link": "http://arxiv.org/abs/2305.05931", "description": "We consider the problem of obtaining effective representations for the\nsolutions of linear, vector-valued stochastic differential equations (SDEs)\ndriven by non-Gaussian pure-jump L\\'evy processes, and we show how such\nrepresentations lead to efficient simulation methods. The processes considered\nconstitute a broad class of models that find application across the physical\nand biological sciences, mathematics, finance and engineering. Motivated by\nimportant relevant problems in statistical inference, we derive new,\ngeneralised shot-noise simulation methods whenever a normal variance-mean (NVM)\nmixture representation exists for the driving L\\'evy process, including the\ngeneralised hyperbolic, normal-Gamma, and normal tempered stable cases. Simple,\nexplicit conditions are identified for the convergence of the residual of a\ntruncated shot-noise representation to a Brownian motion in the case of the\npure L\\'evy process, and to a Brownian-driven SDE in the case of the\nL\\'evy-driven SDE. These results provide Gaussian approximations to the small\njumps of the process under the NVM representation. The resulting\nrepresentations are of particular importance in state inference and parameter\nestimation for L\\'evy-driven SDE models, since the resulting conditionally\nGaussian structures can be readily incorporated into latent variable inference\nmethods such as Markov chain Monte Carlo (MCMC), Expectation-Maximisation (EM),\nand sequential Monte Carlo."}, "http://arxiv.org/abs/2305.17517": {"title": "Stochastic Nonparametric Estimation of the Density-Flow Curve", "link": "http://arxiv.org/abs/2305.17517", "description": "The fundamental diagram serves as the foundation of traffic flow modeling for\nalmost a century. With the increasing availability of road sensor data,\ndeterministic parametric models have proved inadequate in describing the\nvariability of real-world data, especially in congested area of the\ndensity-flow diagram. In this paper we estimate the stochastic density-flow\nrelation introducing a nonparametric method called convex quantile regression.\nThe proposed method does not depend on any prior functional form assumptions,\nbut thanks to the concavity constraints, the estimated function satisfies the\ntheoretical properties of the density-flow curve. The second contribution is to\ndevelop the new convex quantile regression with bags (CQRb) approach to\nfacilitate practical implementation of CQR to the real-world data. We\nillustrate the CQRb estimation process using the road sensor data from Finland\nin years 2016-2018. Our third contribution is to demonstrate the excellent\nout-of-sample predictive power of the proposed CQRb method in comparison to the\nstandard parametric deterministic approach."}, "http://arxiv.org/abs/2305.19180": {"title": "Transfer Learning With Efficient Estimators to Optimally Leverage Historical Data in Analysis of Randomized Trials", "link": "http://arxiv.org/abs/2305.19180", "description": "Although randomized controlled trials (RCTs) are a cornerstone of comparative\neffectiveness, they typically have much smaller sample size than observational\nstudies because of financial and ethical considerations. Therefore there is\ninterest in using plentiful historical data (either observational data or prior\ntrials) to reduce trial sizes. Previous estimators developed for this purpose\nrely on unrealistic assumptions, without which the added data can bias the\ntreatment effect estimate. Recent work proposed an alternative method\n(prognostic covariate adjustment) that imposes no additional assumptions and\nincreases efficiency in trial analyses. The idea is to use historical data to\nlearn a prognostic model: a regression of the outcome onto the covariates. The\npredictions from this model, generated from the RCT subjects' baseline\nvariables, are then used as a covariate in a linear regression analysis of the\ntrial data. In this work, we extend prognostic adjustment to trial analyses\nwith nonparametric efficient estimators, which are more powerful than linear\nregression. We provide theory that explains why prognostic adjustment improves\nsmall-sample point estimation and inference without any possibility of bias.\nSimulations corroborate the theory: efficient estimators using prognostic\nadjustment compared to without provides greater power (i.e., smaller standard\nerrors) when the trial is small. Population shifts between historical and trial\ndata attenuate benefits but do not introduce bias. We showcase our estimator\nusing clinical trial data provided by Novo Nordisk A/S that evaluates insulin\ntherapy for individuals with type II diabetes."}, "http://arxiv.org/abs/2308.13928": {"title": "A flexible Bayesian tool for CoDa mixed models: logistic-normal distribution with Dirichlet covariance", "link": "http://arxiv.org/abs/2308.13928", "description": "Compositional Data Analysis (CoDa) has gained popularity in recent years.\nThis type of data consists of values from disjoint categories that sum up to a\nconstant. Both Dirichlet regression and logistic-normal regression have become\npopular as CoDa analysis methods. However, fitting this kind of multivariate\nmodels presents challenges, especially when structured random effects are\nincluded in the model, such as temporal or spatial effects.\n\nTo overcome these challenges, we propose the logistic-normal Dirichlet Model\n(LNDM). We seamlessly incorporate this approach into the R-INLA package,\nfacilitating model fitting and model prediction within the framework of Latent\nGaussian Models (LGMs). Moreover, we explore metrics like Deviance Information\nCriteria (DIC), Watanabe Akaike information criterion (WAIC), and\ncross-validation measure conditional predictive ordinate (CPO) for model\nselection in R-INLA for CoDa.\n\nIllustrating LNDM through a simple simulated example and with an ecological\ncase study on Arabidopsis thaliana in the Iberian Peninsula, we underscore its\npotential as an effective tool for managing CoDa and large CoDa databases."}, "http://arxiv.org/abs/2310.02968": {"title": "Sampling depth trade-off in function estimation under a two-level design", "link": "http://arxiv.org/abs/2310.02968", "description": "Many modern statistical applications involve a two-level sampling scheme that\nfirst samples subjects from a population and then samples observations on each\nsubject. These schemes often are designed to learn both the population-level\nfunctional structures shared by the subjects and the functional characteristics\nspecific to individual subjects. Common wisdom suggests that learning\npopulation-level structures benefits from sampling more subjects whereas\nlearning subject-specific structures benefits from deeper sampling within each\nsubject. Oftentimes these two objectives compete for limited sampling\nresources, which raises the question of how to optimally sample at the two\nlevels. We quantify such sampling-depth trade-offs by establishing the $L_2$\nminimax risk rates for learning the population-level and subject-specific\nstructures under a hierarchical Gaussian process model framework where we\nconsider a Bayesian and a frequentist perspective on the unknown\npopulation-level structure. These rates provide general lessons for designing\ntwo-level sampling schemes. Interestingly, subject-specific learning\noccasionally benefits more by sampling more subjects than by deeper\nwithin-subject sampling. We also construct estimators that adapt to unknown\nsmoothness and achieve the corresponding minimax rates. We conduct two\nsimulation experiments validating our theory and illustrating the sampling\ntrade-off in practice, and apply these estimators to two real datasets."}, "http://arxiv.org/abs/2311.05025": {"title": "Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients", "link": "http://arxiv.org/abs/2311.05025", "description": "We present an unbiased method for Bayesian posterior means based on kinetic\nLangevin dynamics that combines advanced splitting methods with enhanced\ngradient approximations. Our approach avoids Metropolis correction by coupling\nMarkov chains at different discretization levels in a multilevel Monte Carlo\napproach. Theoretical analysis demonstrates that our proposed estimator is\nunbiased, attains finite variance, and satisfies a central limit theorem. It\ncan achieve accuracy $\\epsilon>0$ for estimating expectations of Lipschitz\nfunctions in $d$ dimensions with $\\mathcal{O}(d^{1/4}\\epsilon^{-2})$ expected\ngradient evaluations, without assuming warm start. We exhibit similar bounds\nusing both approximate and stochastic gradients, and our method's computational\ncost is shown to scale logarithmically with the size of the dataset. The\nproposed method is tested using a multinomial regression problem on the MNIST\ndataset and a Poisson regression model for soccer scores. Experiments indicate\nthat the number of gradient evaluations per effective sample is independent of\ndimension, even when using inexact gradients. For product distributions, we\ngive dimension-independent variance bounds. Our results demonstrate that the\nunbiased algorithm we present can be much more efficient than the\n``gold-standard\" randomized Hamiltonian Monte Carlo."}, "http://arxiv.org/abs/2311.05056": {"title": "High-dimensional Newey-Powell Test Via Approximate Message Passing", "link": "http://arxiv.org/abs/2311.05056", "description": "Homoscedastic regression error is a common assumption in many\nhigh-dimensional regression models and theories. Although heteroscedastic error\ncommonly exists in real-world datasets, testing heteroscedasticity remains\nlargely underexplored under high-dimensional settings. We consider the\nheteroscedasticity test proposed in Newey and Powell (1987), whose asymptotic\ntheory has been well-established for the low-dimensional setting. We show that\nthe Newey-Powell test can be developed for high-dimensional data. For\nasymptotic theory, we consider the setting where the number of dimensions grows\nwith the sample size at a linear rate. The asymptotic analysis for the test\nstatistic utilizes the Approximate Message Passing (AMP) algorithm, from which\nwe obtain the limiting distribution of the test. The numerical performance of\nthe test is investigated through an extensive simulation study. As real-data\napplications, we present the analysis based on \"international economic growth\"\ndata (Belloni et al. 2011), which is found to be homoscedastic, and\n\"supermarket\" data (Lan et al., 2016), which is found to be heteroscedastic."}, "http://arxiv.org/abs/2311.05200": {"title": "An efficient Bayesian approach to joint functional principal component analysis for complex sampling designs", "link": "http://arxiv.org/abs/2311.05200", "description": "The analysis of multivariate functional curves has the potential to yield\nimportant scientific discoveries in domains such as healthcare, medicine,\neconomics and social sciences. However it is common for real-world settings to\npresent data that are both sparse and irregularly sampled, and this introduces\nimportant challenges for the current functional data methodology. Here we\npropose a Bayesian hierarchical framework for multivariate functional principal\ncomponent analysis which accommodates the intricacies of such sampling designs\nby flexibly pooling information across subjects and correlated curves. Our\nmodel represents common latent dynamics via shared functional principal\ncomponent scores, thereby effectively borrowing strength across curves while\ncircumventing the computationally challenging task of estimating covariance\nmatrices. These scores also provide a parsimonious representation of the major\nmodes of joint variation of the curves, and constitute interpretable scalar\nsummaries that can be employed in follow-up analyses. We perform inference\nusing a variational message passing algorithm which combines efficiency,\nmodularity and approximate posterior density estimation, enabling the joint\nanalysis of large datasets with parameter uncertainty quantification. We\nconduct detailed simulations to assess the effectiveness of our approach in\nsharing information under complex sampling designs. We also exploit it to\nestimate the molecular disease courses of individual patients with SARS-CoV-2\ninfection and characterise patient heterogeneity in recovery outcomes; this\nstudy reveals key coordinated dynamics across the immune, inflammatory and\nmetabolic systems, which are associated with survival and long-COVID symptoms\nup to one year post disease onset. Our approach is implemented in the R package\nbayesFPCA."}, "http://arxiv.org/abs/2311.05248": {"title": "A General Space of Belief Updates for Model Misspecification in Bayesian Networks", "link": "http://arxiv.org/abs/2311.05248", "description": "In an ideal setting for Bayesian agents, a perfect description of the rules\nof the environment (i.e., the objective observation model) is available,\nallowing them to reason through the Bayesian posterior to update their beliefs\nin an optimal way. But such an ideal setting hardly ever exists in the natural\nworld, so agents have to make do with reasoning about how they should update\ntheir beliefs simultaneously. This introduces a number of related challenges\nfor a number of research areas: (1) For Bayesian statistics, this deviation of\nthe subjective model from the true data-generating mechanism is termed model\nmisspecification in the literature. (2) For neuroscience, it introduces the\nnecessity to model how the agents' belief updates (how they use evidence to\nupdate their belief) and how their belief changes over time. The current paper\naddresses these two challenges by (a) providing a general class of\nposteriors/belief updates called cut-posteriors of Bayesian networks that have\na much greater expressivity, and (b) parameterizing the space of possible\nposteriors to make meta-learning (i.e., choosing the belief update from this\nspace in a principled manner) possible. For (a), it is noteworthy that any\ncut-posterior has local computation only, making computation tractable for\nhuman or artificial agents. For (b), a Markov Chain Monte Carlo algorithm to\nperform such meta-learning will be sketched here, though it is only an\nillustration and but no means the only possible meta-learning procedure\npossible for the space of cut-posteriors. Operationally, this work gives a\ngeneral algorithm to take in an arbitrary Bayesian network and output all\npossible cut-posteriors in the space."}, "http://arxiv.org/abs/2311.05272": {"title": "deform: An R Package for Nonstationary Spatial Gaussian Process Models by Deformations and Dimension Expansion", "link": "http://arxiv.org/abs/2311.05272", "description": "Gaussian processes (GP) are a popular and powerful tool for spatial modelling\nof data, especially data that quantify environmental processes. However, in\nstationary form, whether covariance is isotropic or anisotropic, GPs may lack\nthe flexibility to capture dependence across a continuous spatial process,\nespecially across a large domains. The deform package aims to provide users\nwith user-friendly R functions for the fitting and visualization of\nnonstationary spatial Gaussian processes. Users can choose to capture\nnonstationarity with either the spatial deformation approach of Sampson and\nGuttorp (1992) or the dimension expansion approach of Bornn, Shaddick, and\nZidek (2012). Thin plate regression splines are used for both approaches to\nbring transformations of locations to give a new set of locations that bring\nisotropic covariance. Fitted models in deform can be used to predict these new\nlocations and to simulate nonstationary Gaussian processes for an arbitrary set\nof locations."}, "http://arxiv.org/abs/2311.05330": {"title": "Applying a new category association estimator to sentiment analysis on the Web", "link": "http://arxiv.org/abs/2311.05330", "description": "This paper introduces a novel Bayesian method for measuring the degree of\nassociation between categorical variables. The method is grounded in the formal\ndefinition of variable independence and was implemented using MCMC techniques.\nUnlike existing methods, this approach does not assume prior knowledge of the\ntotal number of occurrences for any category, making it particularly\nwell-suited for applications like sentiment analysis. We applied the method to\na dataset comprising 4,613 tweets written in Portuguese, each annotated for 30\npossibly overlapping emotional categories. Through this analysis, we identified\npairs of emotions that exhibit associations and mutually exclusive pairs.\nFurthermore, the method identifies hierarchical relations between categories, a\nfeature observed in our data, and was used to cluster emotions into basic level\ngroups."}, "http://arxiv.org/abs/2311.05339": {"title": "An iterative algorithm for high-dimensional linear models with both sparse and non-sparse structures", "link": "http://arxiv.org/abs/2311.05339", "description": "Numerous practical medical problems often involve data that possess a\ncombination of both sparse and non-sparse structures. Traditional penalized\nregularizations techniques, primarily designed for promoting sparsity, are\ninadequate to capture the optimal solutions in such scenarios. To address these\nchallenges, this paper introduces a novel algorithm named Non-sparse Iteration\n(NSI). The NSI algorithm allows for the existence of both sparse and non-sparse\nstructures and estimates them simultaneously and accurately. We provide\ntheoretical guarantees that the proposed algorithm converges to the oracle\nsolution and achieves the optimal rate for the upper bound of the $l_2$-norm\nerror. Through simulations and practical applications, NSI consistently\nexhibits superior statistical performance in terms of estimation accuracy,\nprediction efficacy, and variable selection compared to several existing\nmethods. The proposed method is also applied to breast cancer data, revealing\nrepeated selection of specific genes for in-depth analysis."}, "http://arxiv.org/abs/2311.05421": {"title": "Diffusion Based Causal Representation Learning", "link": "http://arxiv.org/abs/2311.05421", "description": "Causal reasoning can be considered a cornerstone of intelligent systems.\nHaving access to an underlying causal graph comes with the promise of\ncause-effect estimation and the identification of efficient and safe\ninterventions. However, learning causal representations remains a major\nchallenge, due to the complexity of many real-world systems. Previous works on\ncausal representation learning have mostly focused on Variational Auto-Encoders\n(VAE). These methods only provide representations from a point estimate, and\nthey are unsuitable to handle high dimensions. To overcome these problems, we\nproposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm.\nThis algorithm uses diffusion-based representations for causal discovery. DCRL\noffers access to infinite dimensional latent codes, which encode different\nlevels of information in the latent code. In a first proof of principle, we\ninvestigate the use of DCRL for causal representation learning. We further\ndemonstrate experimentally that this approach performs comparably well in\nidentifying the causal structure and causal variables."}, "http://arxiv.org/abs/2311.05532": {"title": "Uncertainty-Aware Bayes' Rule and Its Applications", "link": "http://arxiv.org/abs/2311.05532", "description": "Bayes' rule has enabled innumerable powerful algorithms of statistical signal\nprocessing and statistical machine learning. However, when there exist model\nmisspecifications in prior distributions and/or data distributions, the direct\napplication of Bayes' rule is questionable. Philosophically, the key is to\nbalance the relative importance of prior and data distributions when\ncalculating posterior distributions: if prior (resp. data) distributions are\noverly conservative, we should upweight the prior belief (resp. data evidence);\nif prior (resp. data) distributions are overly opportunistic, we should\ndownweight the prior belief (resp. data evidence). This paper derives a\ngeneralized Bayes' rule, called uncertainty-aware Bayes' rule, to technically\nrealize the above philosophy, i.e., to combat the model uncertainties in prior\ndistributions and/or data distributions. Simulated and real-world experiments\nshowcase the superiority of the presented uncertainty-aware Bayes' rule over\nthe conventional Bayes' rule: In particular, the uncertainty-aware Kalman\nfilter, the uncertainty-aware particle filter, and the uncertainty-aware\ninteractive multiple model filter are suggested and validated."}, "http://arxiv.org/abs/2110.00115": {"title": "Comparing Sequential Forecasters", "link": "http://arxiv.org/abs/2110.00115", "description": "Consider two forecasters, each making a single prediction for a sequence of\nevents over time. We ask a relatively basic question: how might we compare\nthese forecasters, either online or post-hoc, while avoiding unverifiable\nassumptions on how the forecasts and outcomes were generated? In this paper, we\npresent a rigorous answer to this question by designing novel sequential\ninference procedures for estimating the time-varying difference in forecast\nscores. To do this, we employ confidence sequences (CS), which are sequences of\nconfidence intervals that can be continuously monitored and are valid at\narbitrary data-dependent stopping times (\"anytime-valid\"). The widths of our\nCSs are adaptive to the underlying variance of the score differences.\nUnderlying their construction is a game-theoretic statistical framework, in\nwhich we further identify e-processes and p-processes for sequentially testing\na weak null hypothesis -- whether one forecaster outperforms another on average\n(rather than always). Our methods do not make distributional assumptions on the\nforecasts or outcomes; our main theorems apply to any bounded scores, and we\nlater provide alternative methods for unbounded scores. We empirically validate\nour approaches by comparing real-world baseball and weather forecasters."}, "http://arxiv.org/abs/2207.14753": {"title": "Estimating Causal Effects with Hidden Confounding using Instrumental Variables and Environments", "link": "http://arxiv.org/abs/2207.14753", "description": "Recent works have proposed regression models which are invariant across data\ncollection environments. These estimators often have a causal interpretation\nunder conditions on the environments and type of invariance imposed. One recent\nexample, the Causal Dantzig (CD), is consistent under hidden confounding and\nrepresents an alternative to classical instrumental variable estimators such as\nTwo Stage Least Squares (TSLS). In this work we derive the CD as a generalized\nmethod of moments (GMM) estimator. The GMM representation leads to several\npractical results, including 1) creation of the Generalized Causal Dantzig\n(GCD) estimator which can be applied to problems with continuous environments\nwhere the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which\nhas properties superior to GCD or TSLS alone 3) straightforward asymptotic\nresults for all methods using GMM theory. We compare the CD, GCD, TSLS, and\nHybrid estimators in simulations and an application to a Flow Cytometry data\nset. The newly proposed GCD and Hybrid estimators have superior performance to\nexisting methods in many settings."}, "http://arxiv.org/abs/2208.02948": {"title": "A Feature Selection Method that Controls the False Discovery Rate", "link": "http://arxiv.org/abs/2208.02948", "description": "The problem of selecting a handful of truly relevant variables in supervised\nmachine learning algorithms is a challenging problem in terms of untestable\nassumptions that must hold and unavailability of theoretical assurances that\nselection errors are under control. We propose a distribution-free feature\nselection method, referred to as Data Splitting Selection (DSS) which controls\nFalse Discovery Rate (FDR) of feature selection while obtaining a high power.\nAnother version of DSS is proposed with a higher power which \"almost\" controls\nFDR. No assumptions are made on the distribution of the response or on the\njoint distribution of the features. Extensive simulation is performed to\ncompare the performance of the proposed methods with the existing ones."}, "http://arxiv.org/abs/2209.03474": {"title": "An extension of the Unified Skew-Normal family of distributions and application to Bayesian binary regression", "link": "http://arxiv.org/abs/2209.03474", "description": "We consider the general problem of Bayesian binary regression and we\nintroduce a new class of distributions, the Perturbed Unified Skew Normal\n(pSUN, henceforth), which generalizes the Unified Skew-Normal (SUN) class. We\nshow that the new class is conjugate to any binary regression model, provided\nthat the link function may be expressed as a scale mixture of Gaussian\ndensities. We discuss in detail the popular logit case, and we show that, when\na logistic regression model is combined with a Gaussian prior, posterior\nsummaries such as cumulants and normalizing constants can be easily obtained\nthrough the use of an importance sampling approach, opening the way to\nstraightforward variable selection procedures. For more general priors, the\nproposed methodology is based on a simple Gibbs sampler algorithm. We also\nclaim that, in the p > n case, the proposed methodology shows better\nperformances - both in terms of mixing and accuracy - compared to the existing\nmethods. We illustrate the performance through several simulation studies and\ntwo data analyses."}, "http://arxiv.org/abs/2211.15070": {"title": "Online Kernel CUSUM for Change-Point Detection", "link": "http://arxiv.org/abs/2211.15070", "description": "We present a computationally efficient online kernel Cumulative Sum (CUSUM)\nmethod for change-point detection that utilizes the maximum over a set of\nkernel statistics to account for the unknown change-point location. Our\napproach exhibits increased sensitivity to small changes compared to existing\nkernel-based change-point detection methods, including Scan-B statistic,\ncorresponding to a non-parametric Shewhart chart-type procedure. We provide\naccurate analytic approximations for two key performance metrics: the Average\nRun Length (ARL) and Expected Detection Delay (EDD), which enable us to\nestablish an optimal window length to be on the order of the logarithm of ARL\nto ensure minimal power loss relative to an oracle procedure with infinite\nmemory. Moreover, we introduce a recursive calculation procedure for detection\nstatistics to ensure constant computational and memory complexity, which is\nessential for online implementation. Through extensive experiments on both\nsimulated and real data, we demonstrate the competitive performance of our\nmethod and validate our theoretical results."}, "http://arxiv.org/abs/2301.09633": {"title": "Prediction-Powered Inference", "link": "http://arxiv.org/abs/2301.09633", "description": "Prediction-powered inference is a framework for performing valid statistical\ninference when an experimental dataset is supplemented with predictions from a\nmachine-learning system. The framework yields simple algorithms for computing\nprovably valid confidence intervals for quantities such as means, quantiles,\nand linear and logistic regression coefficients, without making any assumptions\non the machine-learning algorithm that supplies the predictions. Furthermore,\nmore accurate predictions translate to smaller confidence intervals.\nPrediction-powered inference could enable researchers to draw valid and more\ndata-efficient conclusions using machine learning. The benefits of\nprediction-powered inference are demonstrated with datasets from proteomics,\nastronomy, genomics, remote sensing, census analysis, and ecology."}, "http://arxiv.org/abs/2303.03521": {"title": "Bayesian Adaptive Selection of Variables for Function-on-Scalar Regression Models", "link": "http://arxiv.org/abs/2303.03521", "description": "Considering the field of functional data analysis, we developed a new\nBayesian method for variable selection in function-on-scalar regression (FOSR).\nOur approach uses latent variables, allowing an adaptive selection since it can\ndetermine the number of variables and which ones should be selected for a\nfunction-on-scalar regression model. Simulation studies show the proposed\nmethod's main properties, such as its accuracy in estimating the coefficients\nand high capacity to select variables correctly. Furthermore, we conducted\ncomparative studies with the main competing methods, such as the BGLSS method\nas well as the group LASSO, the group MCP and the group SCAD. We also used a\nCOVID-19 dataset and some socioeconomic data from Brazil for real data\napplication. In short, the proposed Bayesian variable selection model is\nextremely competitive, showing significant predictive and selective quality."}, "http://arxiv.org/abs/2305.10564": {"title": "Counterfactually Comparing Abstaining Classifiers", "link": "http://arxiv.org/abs/2305.10564", "description": "Abstaining classifiers have the option to abstain from making predictions on\ninputs that they are unsure about. These classifiers are becoming increasingly\npopular in high-stakes decision-making problems, as they can withhold uncertain\npredictions to improve their reliability and safety. When evaluating black-box\nabstaining classifier(s), however, we lack a principled approach that accounts\nfor what the classifier would have predicted on its abstentions. These missing\npredictions matter when they can eventually be utilized, either directly or as\na backup option in a failure mode. In this paper, we introduce a novel approach\nand perspective to the problem of evaluating and comparing abstaining\nclassifiers by treating abstentions as missing data. Our evaluation approach is\ncentered around defining the counterfactual score of an abstaining classifier,\ndefined as the expected performance of the classifier had it not been allowed\nto abstain. We specify the conditions under which the counterfactual score is\nidentifiable: if the abstentions are stochastic, and if the evaluation data is\nindependent of the training data (ensuring that the predictions are missing at\nrandom), then the score is identifiable. Note that, if abstentions are\ndeterministic, then the score is unidentifiable because the classifier can\nperform arbitrarily poorly on its abstentions. Leveraging tools from\nobservational causal inference, we then develop nonparametric and doubly robust\nmethods to efficiently estimate this quantity under identification. Our\napproach is examined in both simulated and real data experiments."}, "http://arxiv.org/abs/2307.01539": {"title": "Implementing measurement error models with mechanistic mathematical models in a likelihood-based framework for estimation, identifiability analysis, and prediction in the life sciences", "link": "http://arxiv.org/abs/2307.01539", "description": "Throughout the life sciences we routinely seek to interpret measurements and\nobservations using parameterised mechanistic mathematical models. A fundamental\nand often overlooked choice in this approach involves relating the solution of\na mathematical model with noisy and incomplete measurement data. This is often\nachieved by assuming that the data are noisy measurements of the solution of a\ndeterministic mathematical model, and that measurement errors are additive and\nnormally distributed. While this assumption of additive Gaussian noise is\nextremely common and simple to implement and interpret, it is often unjustified\nand can lead to poor parameter estimates and non-physical predictions. One way\nto overcome this challenge is to implement a different measurement error model.\nIn this review, we demonstrate how to implement a range of measurement error\nmodels in a likelihood-based framework for estimation, identifiability\nanalysis, and prediction, called Profile-Wise Analysis. This frequentist\napproach to uncertainty quantification for mechanistic models leverages the\nprofile likelihood for targeting parameters and understanding their influence\non predictions. Case studies, motivated by simple caricature models routinely\nused in systems biology and mathematical biology literature, illustrate how the\nsame ideas apply to different types of mathematical models. Open-source Julia\ncode to reproduce results is available on GitHub."}, "http://arxiv.org/abs/2307.04548": {"title": "Beyond the Two-Trials Rule", "link": "http://arxiv.org/abs/2307.04548", "description": "The two-trials rule for drug approval requires \"at least two adequate and\nwell-controlled studies, each convincing on its own, to establish\neffectiveness\". This is usually implemented by requiring two significant\npivotal trials and is the standard regulatory requirement to provide evidence\nfor a new drug's efficacy. However, there is need to develop suitable\nalternatives to this rule for a number of reasons, including the possible\navailability of data from more than two trials. I consider the case of up to 3\nstudies and stress the importance to control the partial Type-I error rate,\nwhere only some studies have a true null effect, while maintaining the overall\nType-I error rate of the two-trials rule, where all studies have a null effect.\nSome less-known $p$-value combination methods are useful to achieve this:\nPearson's method, Edgington's method and the recently proposed harmonic mean\n$\\chi^2$-test. I study their properties and discuss how they can be extended to\na sequential assessment of success while still ensuring overall Type-I error\ncontrol. I compare the different methods in terms of partial Type-I error rate,\nproject power and the expected number of studies required. Edgington's method\nis eventually recommended as it is easy to implement and communicate, has only\nmoderate partial Type-I error rate inflation but substantially increased\nproject power."}, "http://arxiv.org/abs/2307.06250": {"title": "Identifiability Guarantees for Causal Disentanglement from Soft Interventions", "link": "http://arxiv.org/abs/2307.06250", "description": "Causal disentanglement aims to uncover a representation of data using latent\nvariables that are interrelated through a causal model. Such a representation\nis identifiable if the latent model that explains the data is unique. In this\npaper, we focus on the scenario where unpaired observational and interventional\ndata are available, with each intervention changing the mechanism of a latent\nvariable. When the causal variables are fully observed, statistically\nconsistent algorithms have been developed to identify the causal model under\nfaithfulness assumptions. We here show that identifiability can still be\nachieved with unobserved causal variables, given a generalized notion of\nfaithfulness. Our results guarantee that we can recover the latent causal model\nup to an equivalence class and predict the effect of unseen combinations of\ninterventions, in the limit of infinite data. We implement our causal\ndisentanglement framework by developing an autoencoding variational Bayes\nalgorithm and apply it to the problem of predicting combinatorial perturbation\neffects in genomics."}, "http://arxiv.org/abs/2311.05794": {"title": "An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits", "link": "http://arxiv.org/abs/2311.05794", "description": "Typically, multi-armed bandit (MAB) experiments are analyzed at the end of\nthe study and thus require the analyst to specify a fixed sample size in\nadvance. However, in many online learning applications, it is advantageous to\ncontinuously produce inference on the average treatment effect (ATE) between\narms as new data arrive and determine a data-driven stopping time for the\nexperiment. Existing work on continuous inference for adaptive experiments\nassumes that the treatment assignment probabilities are bounded away from zero\nand one, thus excluding nearly all standard bandit algorithms. In this work, we\ndevelop the Mixture Adaptive Design (MAD), a new experimental design for\nmulti-armed bandits that enables continuous inference on the ATE with\nguarantees on statistical validity and power for nearly any bandit algorithm.\nOn a high level, the MAD \"mixes\" a bandit algorithm of the user's choice with a\nBernoulli design through a tuning parameter $\\delta_t$, where $\\delta_t$ is a\ndeterministic sequence that controls the priority placed on the Bernoulli\ndesign as the sample size grows. We show that for $\\delta_t =\no\\left(1/t^{1/4}\\right)$, the MAD produces a confidence sequence that is\nasymptotically valid and guaranteed to shrink around the true ATE. We\nempirically show that the MAD improves the coverage and power of ATE inference\nin MAB experiments without significant losses in finite-sample reward."}, "http://arxiv.org/abs/2311.05806": {"title": "Likelihood ratio tests in random graph models with increasing dimensions", "link": "http://arxiv.org/abs/2311.05806", "description": "We explore the Wilks phenomena in two random graph models: the $\\beta$-model\nand the Bradley-Terry model. For two increasing dimensional null hypotheses,\nincluding a specified null $H_0: \\beta_i=\\beta_i^0$ for $i=1,\\ldots, r$ and a\nhomogenous null $H_0: \\beta_1=\\cdots=\\beta_r$, we reveal high dimensional\nWilks' phenomena that the normalized log-likelihood ratio statistic,\n$[2\\{\\ell(\\widehat{\\mathbf{\\beta}}) - \\ell(\\widehat{\\mathbf{\\beta}}^0)\\}\n-r]/(2r)^{1/2}$, converges in distribution to the standard normal distribution\nas $r$ goes to infinity. Here, $\\ell( \\mathbf{\\beta})$ is the log-likelihood\nfunction on the model parameter $\\mathbf{\\beta}=(\\beta_1, \\ldots,\n\\beta_n)^\\top$, $\\widehat{\\mathbf{\\beta}}$ is its maximum likelihood estimator\n(MLE) under the full parameter space, and $\\widehat{\\mathbf{\\beta}}^0$ is the\nrestricted MLE under the null parameter space. For the homogenous null with a\nfixed $r$, we establish Wilks-type theorems that\n$2\\{\\ell(\\widehat{\\mathbf{\\beta}}) - \\ell(\\widehat{\\mathbf{\\beta}}^0)\\}$\nconverges in distribution to a chi-square distribution with $r-1$ degrees of\nfreedom, as the total number of parameters, $n$, goes to infinity. When testing\nthe fixed dimensional specified null, we find that its asymptotic null\ndistribution is a chi-square distribution in the $\\beta$-model. However,\nunexpectedly, this is not true in the Bradley-Terry model. By developing\nseveral novel technical methods for asymptotic expansion, we explore Wilks type\nresults in a principled manner; these principled methods should be applicable\nto a class of random graph models beyond the $\\beta$-model and the\nBradley-Terry model. Simulation studies and real network data applications\nfurther demonstrate the theoretical results."}, "http://arxiv.org/abs/2311.05819": {"title": "A flexible framework for synthesizing human activity patterns with application to sequential categorical data", "link": "http://arxiv.org/abs/2311.05819", "description": "The ability to synthesize realistic data in a parametrizable way is valuable\nfor a number of reasons, including privacy, missing data imputation, and\nevaluating the performance of statistical and computational methods. When the\nunderlying data generating process is complex, data synthesis requires\napproaches that balance realism and simplicity. In this paper, we address the\nproblem of synthesizing sequential categorical data of the type that is\nincreasingly available from mobile applications and sensors that record\nparticipant status continuously over the course of multiple days and weeks. We\npropose the paired Markov Chain (paired-MC) method, a flexible framework that\nproduces sequences that closely mimic real data while providing a\nstraightforward mechanism for modifying characteristics of the synthesized\nsequences. We demonstrate the paired-MC method on two datasets, one reflecting\ndaily human activity patterns collected via a smartphone application, and one\nencoding the intensities of physical activity measured by wearable\naccelerometers. In both settings, sequences synthesized by paired-MC better\ncapture key characteristics of the real data than alternative approaches."}, "http://arxiv.org/abs/2311.05847": {"title": "Threshold distribution of equal states for quantitative amplitude fluctuations", "link": "http://arxiv.org/abs/2311.05847", "description": "Objective. The distribution of equal states (DES) quantifies amplitude\nfluctuations in biomedical signals. However, under certain conditions, such as\na high resolution of data collection or special signal processing techniques,\nequal states may be very rare, whereupon the DES fails to measure the amplitude\nfluctuations. Approach. To address this problem, we develop a novel threshold\nDES (tDES) that measures the distribution of differential states within a\nthreshold. To evaluate the proposed tDES, we first analyze five sets of\nsynthetic signals generated in different frequency bands. We then analyze sleep\nelectroencephalography (EEG) datasets taken from the public PhysioNet. Main\nresults. Synthetic signals and detrend-filtered sleep EEGs have no neighboring\nequal values; however, tDES can effectively measure the amplitude fluctuations\nwithin these data. The tDES of EEG data increases significantly as the sleep\nstage increases, even with datasets covering very short periods, indicating\ndecreased amplitude fluctuations in sleep EEGs. Generally speaking, the\npresence of more low-frequency components in a physiological series reflects\nsmaller amplitude fluctuations and larger DES. Significance.The tDES provides a\nreliable computing method for quantifying amplitude fluctuations, exhibiting\nthe characteristics of conceptual simplicity and computational robustness. Our\nfindings broaden the application of quantitative amplitude fluctuations and\ncontribute to the classification of sleep stages based on EEG data."}, "http://arxiv.org/abs/2311.05914": {"title": "Efficient Case-Cohort Design using Balanced Sampling", "link": "http://arxiv.org/abs/2311.05914", "description": "A case-cohort design is a two-phase sampling design frequently used to\nanalyze censored survival data in a cost-effective way, where a subcohort is\nusually selected using simple random sampling or stratified simple random\nsampling. In this paper, we propose an efficient sampling procedure based on\nbalanced sampling when selecting a subcohort in a case-cohort design. A sample\nselected via a balanced sampling procedure automatically calibrates auxiliary\nvariables. When fitting a Cox model, calibrating sampling weights has been\nshown to lead to more efficient estimators of the regression coefficients\n(Breslow et al., 2009a, b). The reduced variabilities over its counterpart with\na simple random sampling are shown via extensive simulation experiments. The\nproposed design and estimation procedure are also illustrated with the\nwell-known National Wilms Tumor Study dataset."}, "http://arxiv.org/abs/2311.06076": {"title": "Bayesian Tensor Factorisations for Time Series of Counts", "link": "http://arxiv.org/abs/2311.06076", "description": "We propose a flexible nonparametric Bayesian modelling framework for\nmultivariate time series of count data based on tensor factorisations. Our\nmodels can be viewed as infinite state space Markov chains of known maximal\norder with non-linear serial dependence through the introduction of appropriate\nlatent variables. Alternatively, our models can be viewed as Bayesian\nhierarchical models with conditionally independent Poisson distributed\nobservations. Inference about the important lags and their complex interactions\nis achieved via MCMC. When the observed counts are large, we deal with the\nresulting computational complexity of Bayesian inference via a two-step\ninferential strategy based on an initial analysis of a training set of the\ndata. Our methodology is illustrated using simulation experiments and analysis\nof real-world data."}, "http://arxiv.org/abs/2311.06086": {"title": "A three-step approach to production frontier estimation and the Matsuoka's distribution", "link": "http://arxiv.org/abs/2311.06086", "description": "In this work, we introduce a three-step semiparametric methodology for the\nestimation of production frontiers. We consider a model inspired by the\nwell-known Cobb-Douglas production function, wherein input factors operate\nmultiplicatively within the model. Efficiency in the proposed model is assumed\nto follow a continuous univariate uniparametric distribution in $(0,1)$,\nreferred to as Matsuoka's distribution, which is introduced and explored.\nFollowing model linearization, the first step of the procedure is to\nsemiparametrically estimate the regression function through a local linear\nsmoother. The second step focuses on the estimation of the efficiency parameter\nin which the properties of the Matsuoka's distribution are employed. Finally,\nwe estimate the production frontier through a plug-in methodology. We present a\nrigorous asymptotic theory related to the proposed three-step estimation,\nincluding consistency, and asymptotic normality, and derive rates for the\nconvergences presented. Incidentally, we also introduce and study the\nMatsuoka's distribution, deriving its main properties, including quantiles,\nmoments, $\\alpha$-expectiles, entropies, and stress-strength reliability, among\nothers. The Matsuoka's distribution exhibits a versatile array of shapes\ncapable of effectively encapsulating the typical behavior of efficiency within\nproduction frontier models. To complement the large sample results obtained, a\nMonte Carlo simulation study is conducted to assess the finite sample\nperformance of the proposed three-step methodology. An empirical application\nusing a dataset of Danish milk producers is also presented."}, "http://arxiv.org/abs/2311.06220": {"title": "Bayesian nonparametric generative modeling of large multivariate non-Gaussian spatial fields", "link": "http://arxiv.org/abs/2311.06220", "description": "Multivariate spatial fields are of interest in many applications, including\nclimate model emulation. Not only can the marginal spatial fields be subject to\nnonstationarity, but the dependence structure among the marginal fields and\nbetween the fields might also differ substantially. Extending a recently\nproposed Bayesian approach to describe the distribution of a nonstationary\nunivariate spatial field using a triangular transport map, we cast the\ninference problem for a multivariate spatial field for a small number of\nreplicates into a series of independent Gaussian process (GP) regression tasks\nwith Gaussian errors. Due to the potential nonlinearity in the conditional\nmeans, the joint distribution modeled can be non-Gaussian. The resulting\nnonparametric Bayesian methodology scales well to high-dimensional spatial\nfields. It is especially useful when only a few training samples are available,\nbecause it employs regularization priors and quantifies uncertainty. Inference\nis conducted in an empirical Bayes setting by a highly scalable stochastic\ngradient approach. The implementation benefits from mini-batching and could be\naccelerated with parallel computing. We illustrate the extended transport-map\nmodel by studying hydrological variables from non-Gaussian climate-model\noutput."}, "http://arxiv.org/abs/2207.12705": {"title": "Efficient shape-constrained inference for the autocovariance sequence from a reversible Markov chain", "link": "http://arxiv.org/abs/2207.12705", "description": "In this paper, we study the problem of estimating the autocovariance sequence\nresulting from a reversible Markov chain. A motivating application for studying\nthis problem is the estimation of the asymptotic variance in central limit\ntheorems for Markov chains. We propose a novel shape-constrained estimator of\nthe autocovariance sequence, which is based on the key observation that the\nrepresentability of the autocovariance sequence as a moment sequence imposes\ncertain shape constraints. We examine the theoretical properties of the\nproposed estimator and provide strong consistency guarantees for our estimator.\nIn particular, for geometrically ergodic reversible Markov chains, we show that\nour estimator is strongly consistent for the true autocovariance sequence with\nrespect to an $\\ell_2$ distance, and that our estimator leads to strongly\nconsistent estimates of the asymptotic variance. Finally, we perform empirical\nstudies to illustrate the theoretical properties of the proposed estimator as\nwell as to demonstrate the effectiveness of our estimator in comparison with\nother current state-of-the-art methods for Markov chain Monte Carlo variance\nestimation, including batch means, spectral variance estimators, and the\ninitial convex sequence estimator."}, "http://arxiv.org/abs/2209.04321": {"title": "Estimating Racial Disparities in Emergency General Surgery", "link": "http://arxiv.org/abs/2209.04321", "description": "Research documents that Black patients experience worse general surgery\noutcomes than white patients in the United States. In this paper, we focus on\nan important but less-examined category: the surgical treatment of emergency\ngeneral surgery (EGS) conditions, which refers to medical emergencies where the\ninjury is \"endogenous,\" such as a burst appendix. Our goal is to assess racial\ndisparities for common outcomes after EGS treatment using an administrative\ndatabase of hospital claims in New York, Florida, and Pennsylvania, and to\nunderstand the extent to which differences are attributable to patient-level\nrisk factors versus hospital-level factors. To do so, we use a class of linear\nweighting estimators that re-weight white patients to have a similar\ndistribution of baseline characteristics as Black patients. This framework\nnests many common approaches, including matching and linear regression, but\noffers important advantages over these methods in terms of controlling\nimbalance between groups, minimizing extrapolation, and reducing computation\ntime. Applying this approach to the claims data, we find that disparities\nestimates that adjust for the admitting hospital are substantially smaller than\nestimates that adjust for patient baseline characteristics only, suggesting\nthat hospital-specific factors are important drivers of racial disparities in\nEGS outcomes."}, "http://arxiv.org/abs/2210.12759": {"title": "Robust angle-based transfer learning in high dimensions", "link": "http://arxiv.org/abs/2210.12759", "description": "Transfer learning aims to improve the performance of a target model by\nleveraging data from related source populations, which is known to be\nespecially helpful in cases with insufficient target data. In this paper, we\nstudy the problem of how to train a high-dimensional ridge regression model\nusing limited target data and existing regression models trained in\nheterogeneous source populations. We consider a practical setting where only\nthe parameter estimates of the fitted source models are accessible, instead of\nthe individual-level source data. Under the setting with only one source model,\nwe propose a novel flexible angle-based transfer learning (angleTL) method,\nwhich leverages the concordance between the source and the target model\nparameters. We show that angleTL unifies several benchmark methods by\nconstruction, including the target-only model trained using target data alone,\nthe source model fitted on source data, and distance-based transfer learning\nmethod that incorporates the source parameter estimates and the target data\nunder a distance-based similarity constraint. We also provide algorithms to\neffectively incorporate multiple source models accounting for the fact that\nsome source models may be more helpful than others. Our high-dimensional\nasymptotic analysis provides interpretations and insights regarding when a\nsource model can be helpful to the target model, and demonstrates the\nsuperiority of angleTL over other benchmark methods. We perform extensive\nsimulation studies to validate our theoretical conclusions and show the\nfeasibility of applying angleTL to transfer existing genetic risk prediction\nmodels across multiple biobanks."}, "http://arxiv.org/abs/2211.04370": {"title": "NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation", "link": "http://arxiv.org/abs/2211.04370", "description": "Causal effect estimation from observational data is a central problem in\ncausal inference. Methods based on potential outcomes framework solve this\nproblem by exploiting inductive biases and heuristics from causal inference.\nEach of these methods addresses a specific aspect of causal effect estimation,\nsuch as controlling propensity score, enforcing randomization, etc., by\ndesigning neural network (NN) architectures and regularizers. In this paper, we\npropose an adaptive method called Neurosymbolic Causal Effect Estimator\n(NESTER), a generalized method for causal effect estimation. NESTER integrates\nthe ideas used in existing methods based on multi-head NNs for causal effect\nestimation into one framework. We design a Domain Specific Language (DSL)\ntailored for causal effect estimation based on causal inductive biases used in\nliterature. We conduct a theoretical analysis to investigate NESTER's efficacy\nin estimating causal effects. Our comprehensive empirical results show that\nNESTER performs better than state-of-the-art methods on benchmark datasets."}, "http://arxiv.org/abs/2211.13374": {"title": "A Multivariate Non-Gaussian Bayesian Filter Using Power Moments", "link": "http://arxiv.org/abs/2211.13374", "description": "In this paper, we extend our results on the univariate non-Gaussian Bayesian\nfilter using power moments to the multivariate systems, which can be either\nlinear or nonlinear. Doing this introduces several challenging problems, for\nexample a positive parametrization of the density surrogate, which is not only\na problem of filter design, but also one of the multiple dimensional Hamburger\nmoment problem. We propose a parametrization of the density surrogate with the\nproofs to its existence, Positivstellensatz and uniqueness. Based on it, we\nanalyze the errors of moments of the density estimates by the proposed density\nsurrogate. A discussion on continuous and discrete treatments to the\nnon-Gaussian Bayesian filtering problem is proposed to motivate the research on\ncontinuous parametrization of the system state. Simulation results on\nestimating different types of multivariate density functions are given to\nvalidate our proposed filter. To the best of our knowledge, the proposed filter\nis the first one implementing the multivariate Bayesian filter with the system\nstate parameterized as a continuous function, which only requires the true\nstates being Lebesgue integrable."}, "http://arxiv.org/abs/2303.02637": {"title": "A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks", "link": "http://arxiv.org/abs/2303.02637", "description": "A classic inferential statistical problem is the goodness-of-fit (GOF) test.\nSuch a test can be challenging when the hypothesized parametric model has an\nintractable likelihood and its distributional form is not available. Bayesian\nmethods for GOF can be appealing due to their ability to incorporate expert\nknowledge through prior distributions.\n\nHowever, standard Bayesian methods for this test often require strong\ndistributional assumptions on the data and their relevant parameters. To\naddress this issue, we propose a semi-Bayesian nonparametric (semi-BNP)\nprocedure in the context of the maximum mean discrepancy (MMD) measure that can\nbe applied to the GOF test. Our method introduces a novel Bayesian estimator\nfor the MMD, enabling the development of a measure-based hypothesis test for\nintractable models. Through extensive experiments, we demonstrate that our\nproposed test outperforms frequentist MMD-based methods by achieving a lower\nfalse rejection and acceptance rate of the null hypothesis. Furthermore, we\nshowcase the versatility of our approach by embedding the proposed estimator\nwithin a generative adversarial network (GAN) framework. It facilitates a\nrobust BNP learning approach as another significant application of our method.\nWith our BNP procedure, this new GAN approach can enhance sample diversity and\nimprove inferential accuracy compared to traditional techniques."}, "http://arxiv.org/abs/2304.07113": {"title": "Causal inference with a functional outcome", "link": "http://arxiv.org/abs/2304.07113", "description": "This paper presents methods to study the causal effect of a binary treatment\non a functional outcome with observational data. We define a Functional Average\nTreatment Effect and develop an outcome regression estimator. We show how to\nobtain valid inference on the FATE using simultaneous confidence bands, which\ncover the FATE with a given probability over the entire domain. Simulation\nexperiments illustrate how the simultaneous confidence bands take the multiple\ncomparison problem into account. Finally, we use the methods to infer the\neffect of early adult location on subsequent income development for one Swedish\nbirth cohort."}, "http://arxiv.org/abs/2310.00233": {"title": "CausalImages: An R Package for Causal Inference with Earth Observation, Bio-medical, and Social Science Images", "link": "http://arxiv.org/abs/2310.00233", "description": "The causalimages R package enables causal inference with image and image\nsequence data, providing new tools for integrating novel data sources like\nsatellite and bio-medical imagery into the study of cause and effect. One set\nof functions enables image-based causal inference analyses. For example, one\nkey function decomposes treatment effect heterogeneity by images using an\ninterpretable Bayesian framework. This allows for determining which types of\nimages or image sequences are most responsive to interventions. A second\nmodeling function allows researchers to control for confounding using images.\nThe package also allows investigators to produce embeddings that serve as\nvector summaries of the image or video content. Finally, infrastructural\nfunctions are also provided, such as tools for writing large-scale image and\nimage sequence data as sequentialized byte strings for more rapid image\nanalysis. causalimages therefore opens new capabilities for causal inference in\nR, letting researchers use informative imagery in substantive analyses in a\nfast and accessible manner."}, "http://arxiv.org/abs/2311.06409": {"title": "Flexible joint models for multivariate longitudinal and time-to-event data using multivariate functional principal components", "link": "http://arxiv.org/abs/2311.06409", "description": "The joint modeling of multiple longitudinal biomarkers together with a\ntime-to-event outcome is a challenging modeling task of continued scientific\ninterest. In particular, the computational complexity of high dimensional\n(generalized) mixed effects models often restricts the flexibility of shared\nparameter joint models, even when the subject-specific marker trajectories\nfollow highly nonlinear courses. We propose a parsimonious multivariate\nfunctional principal components representation of the shared random effects.\nThis allows better scalability, as the dimension of the random effects does not\ndirectly increase with the number of markers, only with the chosen number of\nprincipal component basis functions used in the approximation of the random\neffects. The functional principal component representation additionally allows\nto estimate highly flexible subject-specific random trajectories without\nparametric assumptions. The modeled trajectories can thus be distinctly\ndifferent for each biomarker. We build on the framework of flexible Bayesian\nadditive joint models implemented in the R-package 'bamlss', which also\nsupports estimation of nonlinear covariate effects via Bayesian P-splines. The\nflexible yet parsimonious functional principal components basis used in the\nestimation of the joint model is first estimated in a preliminary step. We\nvalidate our approach in a simulation study and illustrate its advantages by\nanalyzing a study on primary biliary cholangitis."}, "http://arxiv.org/abs/2311.06412": {"title": "Online multiple testing with e-values", "link": "http://arxiv.org/abs/2311.06412", "description": "A scientist tests a continuous stream of hypotheses over time in the course\nof her investigation -- she does not test a predetermined, fixed number of\nhypotheses. The scientist wishes to make as many discoveries as possible while\nensuring the number of false discoveries is controlled -- a well recognized way\nfor accomplishing this is to control the false discovery rate (FDR). Prior\nmethods for FDR control in the online setting have focused on formulating\nalgorithms when specific dependency structures are assumed to exist between the\ntest statistics of each hypothesis. However, in practice, these dependencies\noften cannot be known beforehand or tested after the fact. Our algorithm,\ne-LOND, provides FDR control under arbitrary, possibly unknown, dependence. We\nshow that our method is more powerful than existing approaches to this problem\nthrough simulations. We also formulate extensions of this algorithm to utilize\nrandomization for increased power, and for constructing confidence intervals in\nonline selective inference."}, "http://arxiv.org/abs/2311.06415": {"title": "Long-Term Dagum-PVF Frailty Regression Model: Application in Health Studies", "link": "http://arxiv.org/abs/2311.06415", "description": "Survival models incorporating cure fractions, commonly known as cure fraction\nmodels or long-term survival models, are widely employed in epidemiological\nstudies to account for both immune and susceptible patients in relation to the\nfailure event of interest under investigation. In such studies, there is also a\nneed to estimate the unobservable heterogeneity caused by prognostic factors\nthat cannot be observed. Moreover, the hazard function may exhibit a\nnon-monotonic form, specifically, an unimodal hazard function. In this article,\nwe propose a long-term survival model based on the defective version of the\nDagum distribution, with a power variance function (PVF) frailty term\nintroduced in the hazard function to control for unobservable heterogeneity in\npatient populations, which is useful for accommodating survival data in the\npresence of a cure fraction and with a non-monotone hazard function. The\ndistribution is conveniently reparameterized in terms of the cure fraction, and\nthen associated with the covariates via a logit link function, enabling direct\ninterpretation of the covariate effects on the cure fraction, which is not\nusual in the defective approach. It is also proven a result that generates\ndefective models induced by PVF frailty distribution. We discuss maximum\nlikelihood estimation for model parameters and evaluate its performance through\nMonte Carlo simulation studies. Finally, the practicality and benefits of our\nmodel are demonstrated through two health-related datasets, focusing on severe\ncases of COVID-19 in pregnant and postpartum women and on patients with\nmalignant skin neoplasms."}, "http://arxiv.org/abs/2311.06458": {"title": "Conditional Adjustment in a Markov Equivalence Class", "link": "http://arxiv.org/abs/2311.06458", "description": "We consider the problem of identifying a conditional causal effect through\ncovariate adjustment. We focus on the setting where the causal graph is known\nup to one of two types of graphs: a maximally oriented partially directed\nacyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs\nrepresent equivalence classes of possible underlying causal models. After\ndefining adjustment sets in this setting, we provide a necessary and sufficient\ngraphical criterion -- the conditional adjustment criterion -- for finding\nthese sets under conditioning on variables unaffected by treatment. We further\nprovide explicit sets from the graph that satisfy the conditional adjustment\ncriterion, and therefore, can be used as adjustment sets for conditional causal\neffect identification."}, "http://arxiv.org/abs/2311.06537": {"title": "Is Machine Learning Unsafe and Irresponsible in Social Sciences? Paradoxes and Reconsidering from Recidivism Prediction Tasks", "link": "http://arxiv.org/abs/2311.06537", "description": "The paper addresses some fundamental and hotly debated issues for high-stakes\nevent predictions underpinning the computational approach to social sciences.\nWe question several prevalent views against machine learning and outline a new\nparadigm that highlights the promises and promotes the infusion of\ncomputational methods and conventional social science approaches."}, "http://arxiv.org/abs/2311.06590": {"title": "Optimal resource allocation: Convex quantile regression approach", "link": "http://arxiv.org/abs/2311.06590", "description": "Optimal allocation of resources across sub-units in the context of\ncentralized decision-making systems such as bank branches or supermarket chains\nis a classical application of operations research and management science. In\nthis paper, we develop quantile allocation models to examine how much the\noutput and productivity could potentially increase if the resources were\nefficiently allocated between units. We increase robustness to random noise and\nheteroscedasticity by utilizing the local estimation of multiple production\nfunctions using convex quantile regression. The quantile allocation models then\nrely on the estimated shadow prices instead of detailed data of units and allow\nthe entry and exit of units. Our empirical results on Finland's business sector\nreveal a large potential for productivity gains through better allocation,\nkeeping the current technology and resources fixed."}, "http://arxiv.org/abs/2311.06681": {"title": "SpICE: An interpretable method for spatial data", "link": "http://arxiv.org/abs/2311.06681", "description": "Statistical learning methods are widely utilized in tackling complex problems\ndue to their flexibility, good predictive performance and its ability to\ncapture complex relationships among variables. Additionally, recently developed\nautomatic workflows have provided a standardized approach to implementing\nstatistical learning methods across various applications. However these tools\nhighlight a main drawbacks of statistical learning: its lack of interpretation\nin their results. In the past few years an important amount of research has\nbeen focused on methods for interpreting black box models. Having interpretable\nstatistical learning methods is relevant to have a deeper understanding of the\nmodel. In problems were spatial information is relevant, combined interpretable\nmethods with spatial data can help to get better understanding of the problem\nand interpretation of the results.\n\nThis paper is focused in the individual conditional expectation (ICE-plot), a\nmodel agnostic methods for interpreting statistical learning models and\ncombined them with spatial information. ICE-plot extension is proposed where\nspatial information is used as restriction to define Spatial ICE curves\n(SpICE). Spatial ICE curves are estimated using real data in the context of an\neconomic problem concerning property valuation in Montevideo, Uruguay.\nUnderstanding the key factors that influence property valuation is essential\nfor decision-making, and spatial data plays a relevant role in this regard."}, "http://arxiv.org/abs/2311.06719": {"title": "Efficient Multiple-Robust Estimation for Nonresponse Data Under Informative Sampling", "link": "http://arxiv.org/abs/2311.06719", "description": "Nonresponse after probability sampling is a universal challenge in survey\nsampling, often necessitating adjustments to mitigate sampling and selection\nbias simultaneously. This study explored the removal of bias and effective\nutilization of available information, not just in nonresponse but also in the\nscenario of data integration, where summary statistics from other data sources\nare accessible. We reformulate these settings within a two-step monotone\nmissing data framework, where the first step of missingness arises from\nsampling and the second originates from nonresponse. Subsequently, we derive\nthe semiparametric efficiency bound for the target parameter. We also propose\nadaptive estimators utilizing methods of moments and empirical likelihood\napproaches to attain the lower bound. The proposed estimator exhibits both\nefficiency and double robustness. However, attaining efficiency with an\nadaptive estimator requires the correct specification of certain working\nmodels. To reinforce robustness against the misspecification of working models,\nwe extend the property of double robustness to multiple robustness by proposing\na two-step empirical likelihood method that effectively leverages empirical\nweights. A numerical study is undertaken to investigate the finite-sample\nperformance of the proposed methods. We further applied our methods to a\ndataset from the National Health and Nutrition Examination Survey data by\nefficiently incorporating summary statistics from the National Health Interview\nSurvey data."}, "http://arxiv.org/abs/2311.06840": {"title": "Distribution Re-weighting and Voting Paradoxes", "link": "http://arxiv.org/abs/2311.06840", "description": "We explore a specific type of distribution shift called domain expertise, in\nwhich training is limited to a subset of all possible labels. This setting is\ncommon among specialized human experts, or specific focused studies. We show\nhow the standard approach to distribution shift, which involves re-weighting\ndata, can result in paradoxical disagreements among differing domain expertise.\nWe also demonstrate how standard adjustments for causal inference lead to the\nsame paradox. We prove that the characteristics of these paradoxes exactly\nmimic another set of paradoxes which arise among sets of voter preferences."}, "http://arxiv.org/abs/2311.06928": {"title": "Attention for Causal Relationship Discovery from Biological Neural Dynamics", "link": "http://arxiv.org/abs/2311.06928", "description": "This paper explores the potential of the transformer models for learning\nGranger causality in networks with complex nonlinear dynamics at every node, as\nin neurobiological and biophysical networks. Our study primarily focuses on a\nproof-of-concept investigation based on simulated neural dynamics, for which\nthe ground-truth causality is known through the underlying connectivity matrix.\nFor transformer models trained to forecast neuronal population dynamics, we\nshow that the cross attention module effectively captures the causal\nrelationship among neurons, with an accuracy equal or superior to that for the\nmost popular Granger causality analysis method. While we acknowledge that\nreal-world neurobiology data will bring further challenges, including dynamic\nconnectivity and unobserved variability, this research offers an encouraging\npreliminary glimpse into the utility of the transformer model for causal\nrepresentation learning in neuroscience."}, "http://arxiv.org/abs/2311.06945": {"title": "An Efficient Approach for Identifying Important Biomarkers for Biomedical Diagnosis", "link": "http://arxiv.org/abs/2311.06945", "description": "In this paper, we explore the challenges associated with biomarker\nidentification for diagnosis purpose in biomedical experiments, and propose a\nnovel approach to handle the above challenging scenario via the generalization\nof the Dantzig selector. To improve the efficiency of the regularization\nmethod, we introduce a transformation from an inherent nonlinear programming\ndue to its nonlinear link function into a linear programming framework. We\nillustrate the use of of our method on an experiment with binary response,\nshowing superior performance on biomarker identification studies when compared\nto their conventional analysis. Our proposed method does not merely serve as a\nvariable/biomarker selection tool, its ranking of variable importance provides\nvaluable reference information for practitioners to reach informed decisions\nregarding the prioritization of factors for further investigations."}, "http://arxiv.org/abs/2311.07034": {"title": "Regularized Halfspace Depth for Functional Data", "link": "http://arxiv.org/abs/2311.07034", "description": "Data depth is a powerful nonparametric tool originally proposed to rank\nmultivariate data from center outward. In this context, one of the most\narchetypical depth notions is Tukey's halfspace depth. In the last few decades\nnotions of depth have also been proposed for functional data. However, Tukey's\ndepth cannot be extended to handle functional data because of its degeneracy.\nHere, we propose a new halfspace depth for functional data which avoids\ndegeneracy by regularization. The halfspace projection directions are\nconstrained to have a small reproducing kernel Hilbert space norm. Desirable\ntheoretical properties of the proposed depth, such as isometry invariance,\nmaximality at center, monotonicity relative to a deepest point, upper\nsemi-continuity, and consistency are established. Moreover, the regularized\nhalfspace depth can rank functional data with varying emphasis in shape or\nmagnitude, depending on the regularization. A new outlier detection approach is\nalso proposed, which is capable of detecting both shape and magnitude outliers.\nIt is applicable to trajectories in L2, a very general space of functions that\ninclude non-smooth trajectories. Based on extensive numerical studies, our\nmethods are shown to perform well in terms of detecting outliers of different\ntypes. Three real data examples showcase the proposed depth notion."}, "http://arxiv.org/abs/2311.07156": {"title": "Deep mixture of linear mixed models for complex longitudinal data", "link": "http://arxiv.org/abs/2311.07156", "description": "Mixtures of linear mixed models are widely used for modelling longitudinal\ndata for which observation times differ between subjects. In typical\napplications, temporal trends are described using a basis expansion, with basis\ncoefficients treated as random effects varying by subject. Additional random\neffects can describe variation between mixture components, or other known\nsources of variation in complex experimental designs. A key advantage of these\nmodels is that they provide a natural mechanism for clustering, which can be\nhelpful for interpretation in many applications. Current versions of mixtures\nof linear mixed models are not specifically designed for the case where there\nare many observations per subject and a complex temporal trend, which requires\na large number of basis functions to capture. In this case, the\nsubject-specific basis coefficients are a high-dimensional random effects\nvector, for which the covariance matrix is hard to specify and estimate,\nespecially if it varies between mixture components. To address this issue, we\nconsider the use of recently-developed deep mixture of factor analyzers models\nas the prior for the random effects. The resulting deep mixture of linear mixed\nmodels is well-suited to high-dimensional settings, and we describe an\nefficient variational inference approach to posterior computation. The efficacy\nof the method is demonstrated on both real and simulated data."}, "http://arxiv.org/abs/2311.07371": {"title": "Scalable Estimation for Structured Additive Distributional Regression Through Variational Inference", "link": "http://arxiv.org/abs/2311.07371", "description": "Structured additive distributional regression models offer a versatile\nframework for estimating complete conditional distributions by relating all\nparameters of a parametric distribution to covariates. Although these models\nefficiently leverage information in vast and intricate data sets, they often\nresult in highly-parameterized models with many unknowns. Standard estimation\nmethods, like Bayesian approaches based on Markov chain Monte Carlo methods,\nface challenges in estimating these models due to their complexity and\ncostliness. To overcome these issues, we suggest a fast and scalable\nalternative based on variational inference. Our approach combines a\nparsimonious parametric approximation for the posteriors of regression\ncoefficients, with the exact conditional posterior for hyperparameters. For\noptimization, we use a stochastic gradient ascent method combined with an\nefficient strategy to reduce the variance of estimators. We provide theoretical\nproperties and investigate global and local annealing to enhance robustness,\nparticularly against data outliers. Our implementation is very general,\nallowing us to include various functional effects like penalized splines or\ncomplex tensor product interactions. In a simulation study, we demonstrate the\nefficacy of our approach in terms of accuracy and computation time. Lastly, we\npresent two real examples illustrating the modeling of infectious COVID-19\noutbreaks and outlier detection in brain activity."}, "http://arxiv.org/abs/2311.07474": {"title": "A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals", "link": "http://arxiv.org/abs/2311.07474", "description": "Most prognostic methods require a decent amount of data for model training.\nIn reality, however, the amount of historical data owned by a single\norganization might be small or not large enough to train a reliable prognostic\nmodel. To address this challenge, this article proposes a federated prognostic\nmodel that allows multiple users to jointly construct a failure time prediction\nmodel using their multi-stream, high-dimensional, and incomplete data while\nkeeping each user's data local and confidential. The prognostic model first\nemploys multivariate functional principal component analysis to fuse the\nmulti-stream degradation signals. Then, the fused features coupled with the\ntimes-to-failure are utilized to build a (log)-location-scale regression model\nfor failure prediction. To estimate parameters using distributed datasets and\nkeep the data privacy of all participants, we propose a new federated algorithm\nfor feature extraction. Numerical studies indicate that the performance of the\nproposed model is the same as that of classic non-federated prognostic models\nand is better than that of the models constructed by each user itself."}, "http://arxiv.org/abs/2311.07511": {"title": "Machine learning for uncertainty estimation in fusing precipitation observations from satellites and ground-based gauges", "link": "http://arxiv.org/abs/2311.07511", "description": "To form precipitation datasets that are accurate and, at the same time, have\nhigh spatial densities, data from satellites and gauges are often merged in the\nliterature. However, uncertainty estimates for the data acquired in this manner\nare scarcely provided, although the importance of uncertainty quantification in\npredictive modelling is widely recognized. Furthermore, the benefits that\nmachine learning can bring to the task of providing such estimates have not\nbeen broadly realized and properly explored through benchmark experiments. The\npresent study aims at filling in this specific gap by conducting the first\nbenchmark tests on the topic. On a large dataset that comprises 15-year-long\nmonthly data spanning across the contiguous United States, we extensively\ncompared six learners that are, by their construction, appropriate for\npredictive uncertainty quantification. These are the quantile regression (QR),\nquantile regression forests (QRF), generalized random forests (GRF), gradient\nboosting machines (GBM), light gradient boosting machines (LightGBM) and\nquantile regression neural networks (QRNN). The comparison referred to the\ncompetence of the learners in issuing predictive quantiles at nine levels that\nfacilitate a good approximation of the entire predictive probability\ndistribution, and was primarily based on the quantile and continuous ranked\nprobability skill scores. Three types of predictor variables (i.e., satellite\nprecipitation variables, distances between a point of interest and satellite\ngrid points, and elevation at a point of interest) were used in the comparison\nand were additionally compared with each other. This additional comparison was\nbased on the explainable machine learning concept of feature importance. The\nresults suggest that the order from the best to the worst of the learners for\nthe task investigated is the following: LightGBM, QRF, GRF, GBM, QRNN and QR..."}, "http://arxiv.org/abs/2311.07524": {"title": "The Link Between Health Insurance Coverage and Citizenship Among Immigrants: Bayesian Unit-Level Regression Modeling of Categorical Survey Data Observed with Measurement Error", "link": "http://arxiv.org/abs/2311.07524", "description": "Social scientists are interested in studying the impact that citizenship\nstatus has on health insurance coverage among immigrants in the United States.\nThis can be done using data from the Survey of Income and Program Participation\n(SIPP); however, two primary challenges emerge. First, statistical models must\naccount for the survey design in some fashion to reduce the risk of bias due to\ninformative sampling. Second, it has been observed that survey respondents\nmisreport citizenship status at nontrivial rates. This too can induce bias\nwithin a statistical model. Thus, we propose the use of a weighted\npseudo-likelihood mixture of categorical distributions, where the mixture\ncomponent is determined by the latent true response variable, in order to model\nthe misreported data. We illustrate through an empirical simulation study that\nthis approach can mitigate the two sources of bias attributable to the sample\ndesign and misreporting. Importantly, our misreporting model can be further\nused as a component in a deeper hierarchical model. With this in mind, we\nconduct an analysis of the relationship between health insurance coverage and\ncitizenship status using data from the SIPP."}, "http://arxiv.org/abs/2110.10195": {"title": "Operator-induced structural variable selection for identifying materials genes", "link": "http://arxiv.org/abs/2110.10195", "description": "In the emerging field of materials informatics, a fundamental task is to\nidentify physicochemically meaningful descriptors, or materials genes, which\nare engineered from primary features and a set of elementary algebraic\noperators through compositions. Standard practice directly analyzes the\nhigh-dimensional candidate predictor space in a linear model; statistical\nanalyses are then substantially hampered by the daunting challenge posed by the\nastronomically large number of correlated predictors with limited sample size.\nWe formulate this problem as variable selection with operator-induced structure\n(OIS) and propose a new method to achieve unconventional dimension reduction by\nutilizing the geometry embedded in OIS. Although the model remains linear, we\niterate nonparametric variable selection for effective dimension reduction.\nThis enables variable selection based on ab initio primary features, leading to\na method that is orders of magnitude faster than existing methods, with\nimproved accuracy. To select the nonparametric module, we discuss a desired\nperformance criterion that is uniquely induced by variable selection with OIS;\nin particular, we propose to employ a Bayesian Additive Regression Trees\n(BART)-based variable selection method. Numerical studies show superiority of\nthe proposed method, which continues to exhibit robust performance when the\ninput dimension is out of reach of existing methods. Our analysis of\nsingle-atom catalysis identifies physical descriptors that explain the binding\nenergy of metal-support pairs with high explanatory power, leading to\ninterpretable insights to guide the prevention of a notorious problem called\nsintering and aid catalysis design."}, "http://arxiv.org/abs/2204.12699": {"title": "Randomness of Shapes and Statistical Inference on Shapes via the Smooth Euler Characteristic Transform", "link": "http://arxiv.org/abs/2204.12699", "description": "In this article, we establish the mathematical foundations for modeling the\nrandomness of shapes and conducting statistical inference on shapes using the\nsmooth Euler characteristic transform. Based on these foundations, we propose\ntwo parametric algorithms for testing hypotheses on random shapes. Simulation\nstudies are presented to validate our mathematical derivations and to compare\nour algorithms with state-of-the-art methods to demonstrate the utility of our\nproposed framework. As real applications, we analyze a data set of mandibular\nmolars from four genera of primates and show that our algorithms have the power\nto detect significant shape differences that recapitulate known morphological\nvariation across suborders. Altogether, our discussions bridge the following\nfields: algebraic and computational topology, probability theory and stochastic\nprocesses, Sobolev spaces and functional analysis, statistical inference, and\ngeometric morphometrics."}, "http://arxiv.org/abs/2207.05019": {"title": "Covariate-adaptive randomization inference in matched designs", "link": "http://arxiv.org/abs/2207.05019", "description": "It is common to conduct causal inference in matched observational studies by\nproceeding as though treatment assignments within matched sets are assigned\nuniformly at random and using this distribution as the basis for inference.\nThis approach ignores observed discrepancies in matched sets that may be\nconsequential for the distribution of treatment, which are succinctly captured\nby within-set differences in the propensity score. We address this problem via\ncovariate-adaptive randomization inference, which modifies the permutation\nprobabilities to vary with estimated propensity score discrepancies and avoids\nrequirements to exclude matched pairs or model an outcome variable. We show\nthat the test achieves type I error control arbitrarily close to the nominal\nlevel when large samples are available for propensity score estimation. We\ncharacterize the large-sample behavior of the new randomization test for a\ndifference-in-means estimator of a constant additive effect. We also show that\nexisting methods of sensitivity analysis generalize effectively to\ncovariate-adaptive randomization inference. Finally, we evaluate the empirical\nvalue of covariate-adaptive randomization procedures via comparisons to\ntraditional uniform inference in matched designs with and without propensity\nscore calipers and regression adjustment using simulations and analyses of\ngenetic damage among welders and right-heart catheterization in surgical\npatients."}, "http://arxiv.org/abs/2209.07091": {"title": "A new Kernel Regression approach for Robustified $L_2$ Boosting", "link": "http://arxiv.org/abs/2209.07091", "description": "We investigate $L_2$ boosting in the context of kernel regression. Kernel\nsmoothers, in general, lack appealing traits like symmetry and positive\ndefiniteness, which are critical not only for understanding theoretical aspects\nbut also for achieving good practical performance. We consider a\nprojection-based smoother (Huang and Chen, 2008) that is symmetric, positive\ndefinite, and shrinking. Theoretical results based on the orthonormal\ndecomposition of the smoother reveal additional insights into the boosting\nalgorithm. In our asymptotic framework, we may replace the full-rank smoother\nwith a low-rank approximation. We demonstrate that the smoother's low-rank\n($d(n)$) is bounded above by $O(h^{-1})$, where $h$ is the bandwidth. Our\nnumerical findings show that, in terms of prediction accuracy, low-rank\nsmoothers may outperform full-rank smoothers. Furthermore, we show that the\nboosting estimator with low-rank smoother achieves the optimal convergence\nrate. Finally, to improve the performance of the boosting algorithm in the\npresence of outliers, we propose a novel robustified boosting algorithm which\ncan be used with any smoother discussed in the study. We investigate the\nnumerical performance of the proposed approaches using simulations and a\nreal-world case."}, "http://arxiv.org/abs/2210.01757": {"title": "Transportability of model-based estimands in evidence synthesis", "link": "http://arxiv.org/abs/2210.01757", "description": "In evidence synthesis, effect modifiers are typically described as variables\nthat induce treatment effect heterogeneity at the individual level, through\ntreatment-covariate interactions in an outcome model parametrized at such\nlevel. As such, effect modification is defined with respect to a conditional\nmeasure, but marginal effect estimates are required for population-level\ndecisions in health technology assessment. For non-collapsible measures, purely\nprognostic variables that are not determinants of treatment response at the\nindividual level may modify marginal effects, even where there is\nindividual-level treatment effect homogeneity. With heterogeneity, marginal\neffects for measures that are not directly collapsible cannot be expressed in\nterms of marginal covariate moments, and generally depend on the joint\ndistribution of conditional effect measure modifiers and purely prognostic\nvariables. There are implications for recommended practices in evidence\nsynthesis. Unadjusted anchored indirect comparisons can be biased in the\nabsence of individual-level treatment effect heterogeneity, or when marginal\ncovariate moments are balanced across studies. Covariate adjustment may be\nnecessary to account for cross-study imbalances in joint covariate\ndistributions involving purely prognostic variables. In the absence of\nindividual patient data for the target, covariate adjustment approaches are\ninherently limited in their ability to remove bias for measures that are not\ndirectly collapsible. Directly collapsible measures would facilitate the\ntransportability of marginal effects between studies by: (1) reducing\ndependence on model-based covariate adjustment where there is individual-level\ntreatment effect homogeneity and marginal covariate moments are balanced; and\n(2) facilitating the selection of baseline covariates for adjustment where\nthere is individual-level treatment effect heterogeneity."}, "http://arxiv.org/abs/2212.01900": {"title": "Bayesian survival analysis with INLA", "link": "http://arxiv.org/abs/2212.01900", "description": "This tutorial shows how various Bayesian survival models can be fitted using\nthe integrated nested Laplace approximation in a clear, legible, and\ncomprehensible manner using the INLA and INLAjoint R-packages. Such models\ninclude accelerated failure time, proportional hazards, mixture cure, competing\nrisks, multi-state, frailty, and joint models of longitudinal and survival\ndata, originally presented in the article \"Bayesian survival analysis with\nBUGS\" (Alvares et al., 2021). In addition, we illustrate the implementation of\na new joint model for a longitudinal semicontinuous marker, recurrent events,\nand a terminal event. Our proposal aims to provide the reader with syntax\nexamples for implementing survival models using a fast and accurate approximate\nBayesian inferential approach."}, "http://arxiv.org/abs/2301.03661": {"title": "Generative Quantile Regression with Variability Penalty", "link": "http://arxiv.org/abs/2301.03661", "description": "Quantile regression and conditional density estimation can reveal structure\nthat is missed by mean regression, such as multimodality and skewness. In this\npaper, we introduce a deep learning generative model for joint quantile\nestimation called Penalized Generative Quantile Regression (PGQR). Our approach\nsimultaneously generates samples from many random quantile levels, allowing us\nto infer the conditional distribution of a response variable given a set of\ncovariates. Our method employs a novel variability penalty to avoid the problem\nof vanishing variability, or memorization, in deep generative models. Further,\nwe introduce a new family of partial monotonic neural networks (PMNN) to\ncircumvent the problem of crossing quantile curves. A major benefit of PGQR is\nthat it can be fit using a single optimization, thus bypassing the need to\nrepeatedly train the model at multiple quantile levels or use computationally\nexpensive cross-validation to tune the penalty parameter. We illustrate the\nefficacy of PGQR through extensive simulation studies and analysis of real\ndatasets. Code to implement our method is available at\nhttps://github.com/shijiew97/PGQR."}, "http://arxiv.org/abs/2302.09526": {"title": "Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators", "link": "http://arxiv.org/abs/2302.09526", "description": "We present a methodology for using unlabeled data to design semi supervised\nlearning (SSL) methods that improve the prediction performance of supervised\nlearning for regression tasks. The main idea is to design different mechanisms\nfor integrating the unlabeled data, and include in each of them a mixing\nparameter $\\alpha$, controlling the weight given to the unlabeled data.\nFocusing on Generalized Linear Models (GLM) and linear interpolators classes of\nmodels, we analyze the characteristics of different mixing mechanisms, and\nprove that in all cases, it is invariably beneficial to integrate the unlabeled\ndata with some nonzero mixing ratio $\\alpha>0$, in terms of predictive\nperformance. Moreover, we provide a rigorous framework to estimate the best\nmixing ratio $\\alpha^*$ where mixed SSL delivers the best predictive\nperformance, while using the labeled and unlabeled data on hand.\n\nThe effectiveness of our methodology in delivering substantial improvement\ncompared to the standard supervised models, in a variety of settings, is\ndemonstrated empirically through extensive simulation, in a manner that\nsupports the theoretical analysis. We also demonstrate the applicability of our\nmethodology (with some intuitive modifications) to improve more complex models,\nsuch as deep neural networks, in real-world regression tasks."}, "http://arxiv.org/abs/2305.13421": {"title": "Sequential Estimation using Hierarchically Stratified Domains with Latin Hypercube Sampling", "link": "http://arxiv.org/abs/2305.13421", "description": "Quantifying the effect of uncertainties in systems where only point\nevaluations in the stochastic domain but no regularity conditions are available\nis limited to sampling-based techniques. This work presents an adaptive\nsequential stratification estimation method that uses Latin Hypercube Sampling\nwithin each stratum. The adaptation is achieved through a sequential\nhierarchical refinement of the stratification, guided by previous estimators\nusing local (i.e., stratum-dependent) variability indicators based on\ngeneralized polynomial chaos expansions and Sobol decompositions. For a given\ntotal number of samples $N$, the corresponding hierarchically constructed\nsequence of Stratified Sampling estimators combined with Latin Hypercube\nsampling is adequately averaged to provide a final estimator with reduced\nvariance. Numerical experiments illustrate the procedure's efficiency,\nindicating that it can offer a variance decay proportional to $N^{-2}$ in some\ncases."}, "http://arxiv.org/abs/2306.08794": {"title": "Quantile autoregressive conditional heteroscedasticity", "link": "http://arxiv.org/abs/2306.08794", "description": "This paper proposes a novel conditional heteroscedastic time series model by\napplying the framework of quantile regression processes to the ARCH(\\infty)\nform of the GARCH model. This model can provide varying structures for\nconditional quantiles of the time series across different quantile levels,\nwhile including the commonly used GARCH model as a special case. The strict\nstationarity of the model is discussed. For robustness against heavy-tailed\ndistributions, a self-weighted quantile regression (QR) estimator is proposed.\nWhile QR performs satisfactorily at intermediate quantile levels, its accuracy\ndeteriorates at high quantile levels due to data scarcity. As a remedy, a\nself-weighted composite quantile regression (CQR) estimator is further\nintroduced and, based on an approximate GARCH model with a flexible\nTukey-lambda distribution for the innovations, we can extrapolate the high\nquantile levels by borrowing information from intermediate ones. Asymptotic\nproperties for the proposed estimators are established. Simulation experiments\nare carried out to access the finite sample performance of the proposed\nmethods, and an empirical example is presented to illustrate the usefulness of\nthe new model."}, "http://arxiv.org/abs/2309.00948": {"title": "Marginalised Normal Regression: Unbiased curve fitting in the presence of x-errors", "link": "http://arxiv.org/abs/2309.00948", "description": "The history of the seemingly simple problem of straight line fitting in the\npresence of both $x$ and $y$ errors has been fraught with misadventure, with\nstatistically ad hoc and poorly tested methods abounding in the literature. The\nproblem stems from the emergence of latent variables describing the \"true\"\nvalues of the independent variables, the priors on which have a significant\nimpact on the regression result. By analytic calculation of maximum a\nposteriori values and biases, and comprehensive numerical mock tests, we assess\nthe quality of possible priors. In the presence of intrinsic scatter, the only\nprior that we find to give reliably unbiased results in general is a mixture of\none or more Gaussians with means and variances determined as part of the\ninference. We find that a single Gaussian is typically sufficient and dub this\nmodel Marginalised Normal Regression (MNR). We illustrate the necessity for MNR\nby comparing it to alternative methods on an important linear relation in\ncosmology, and extend it to nonlinear regression and an arbitrary covariance\nmatrix linking $x$ and $y$. We publicly release a Python/Jax implementation of\nMNR and its Gaussian mixture model extension that is coupled to Hamiltonian\nMonte Carlo for efficient sampling, which we call ROXY (Regression and\nOptimisation with X and Y errors)."}, "http://arxiv.org/abs/2311.07733": {"title": "Credible Intervals for Probability of Failure with Gaussian Processes", "link": "http://arxiv.org/abs/2311.07733", "description": "Efficiently approximating the probability of system failure has gained\nincreasing importance as expensive simulations begin to play a larger role in\nreliability quantification tasks in areas such as structural design, power grid\ndesign, and safety certification among others. This work derives credible\nintervals on the probability of failure for a simulation which we assume is a\nrealizations of a Gaussian process. We connect these intervals to binary\nclassification error and comment on their applicability to a broad class of\niterative schemes proposed throughout the literature. A novel iterative\nsampling scheme is proposed which can suggest multiple samples per batch for\nsimulations with parallel implementations. We empirically test our scalable,\nopen-source implementation on a variety simulations including a Tsunami model\nwhere failure is quantified in terms of maximum wave hight."}, "http://arxiv.org/abs/2311.07736": {"title": "Use of Equivalent Relative Utility (ERU) to Evaluate Artificial Intelligence-Enabled Rule-Out Devices", "link": "http://arxiv.org/abs/2311.07736", "description": "We investigated the use of equivalent relative utility (ERU) to evaluate the\neffectiveness of artificial intelligence (AI)-enabled rule-out devices that use\nAI to identify and autonomously remove non-cancer patient images from\nradiologist review in screening mammography.We reviewed two performance metrics\nthat can be used to compare the diagnostic performance between the\nradiologist-with-rule-out-device and radiologist-without-device workflows:\npositive/negative predictive values (PPV/NPV) and equivalent relative utility\n(ERU). To demonstrate the use of the two evaluation metrics, we applied both\nmethods to a recent US-based study that reported an improved performance of the\nradiologist-with-device workflow compared to the one without the device by\nretrospectively applying their AI algorithm to a large mammography dataset. We\nfurther applied the ERU method to a European study utilizing their reported\nrecall rates and cancer detection rates at different thresholds of their AI\nalgorithm to compare the potential utility among different thresholds. For the\nstudy using US data, neither the PPV/NPV nor the ERU method can conclude a\nsignificant improvement in diagnostic performance for any of the algorithm\nthresholds reported. For the study using European data, ERU values at lower AI\nthresholds are found to be higher than that at a higher threshold because more\nfalse-negative cases would be ruled-out at higher threshold, reducing the\noverall diagnostic performance. Both PPV/NPV and ERU methods can be used to\ncompare the diagnostic performance between the radiologist-with-device workflow\nand that without. One limitation of the ERU method is the need to measure the\nbaseline, standard-of-care relative utility (RU) value for mammography\nscreening in the US. Once the baseline value is known, the ERU method can be\napplied to large US datasets without knowing the true prevalence of the\ndataset."}, "http://arxiv.org/abs/2311.07752": {"title": "Doubly Robust Estimation under Possibly Misspecified Marginal Structural Cox Model", "link": "http://arxiv.org/abs/2311.07752", "description": "In this paper we address the challenges posed by non-proportional hazards and\ninformative censoring, offering a path toward more meaningful causal inference\nconclusions. We start from the marginal structural Cox model, which has been\nwidely used for analyzing observational studies with survival outcomes, and\ntypically relies on the inverse probability weighting method. The latter hinges\nupon a propensity score model for the treatment assignment, and a censoring\nmodel which incorporates both the treatment and the covariates. In such\nsettings, model misspecification can occur quite effortlessly, and the Cox\nregression model's non-collapsibility has historically posed challenges when\nstriving to guard against model misspecification through augmentation. We\nintroduce an augmented inverse probability weighted estimator which, enriched\nwith doubly robust properties, paves the way for integrating machine learning\nand a plethora of nonparametric methods, effectively overcoming the challenges\nof non-collapsibility. The estimator extends naturally to estimating a\ntime-average treatment effect when the proportional hazards assumption fails.\nWe closely examine its theoretical and practical performance, showing that it\nsatisfies both the assumption-lean and the well-specification criteria\ndiscussed in the recent literature. Finally, its application to a dataset\nreveals insights into the impact of mid-life alcohol consumption on mortality\nin later life."}, "http://arxiv.org/abs/2311.07762": {"title": "Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data", "link": "http://arxiv.org/abs/2311.07762", "description": "A mixture of multivariate Poisson-log normal factor analyzers is introduced\nby imposing constraints on the covariance matrix, which resulted in flexible\nmodels for clustering purposes. In particular, a class of eight parsimonious\nmixture models based on the mixtures of factor analyzers model are introduced.\nVariational Gaussian approximation is used for parameter estimation, and\ninformation criteria are used for model selection. The proposed models are\nexplored in the context of clustering discrete data arising from RNA sequencing\nstudies. Using real and simulated data, the models are shown to give favourable\nclustering performance. The GitHub R package for this work is available at\nhttps://github.com/anjalisilva/mixMPLNFA and is released under the open-source\nMIT license."}, "http://arxiv.org/abs/2311.07793": {"title": "The brain uses renewal points to model random sequences of stimuli", "link": "http://arxiv.org/abs/2311.07793", "description": "It has been classically conjectured that the brain assigns probabilistic\nmodels to sequences of stimuli. An important issue associated with this\nconjecture is the identification of the classes of models used by the brain to\nperform this task. We address this issue by using a new clustering procedure\nfor sets of electroencephalographic (EEG) data recorded from participants\nexposed to a sequence of auditory stimuli generated by a stochastic chain. This\nclustering procedure indicates that the brain uses renewal points in the\nstochastic sequence of auditory stimuli in order to build a model."}, "http://arxiv.org/abs/2311.07906": {"title": "Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects", "link": "http://arxiv.org/abs/2311.07906", "description": "Testing judicial impartiality is a problem of fundamental importance in\nempirical legal studies, for which standard regression methods have been\npopularly used to estimate the extralegal factor effects. However, those\nmethods cannot handle control variables with ultrahigh dimensionality, such as\nfound in judgment documents recorded in text format. To solve this problem, we\ndevelop a novel mixture conditional regression (MCR) approach, assuming that\nthe whole sample can be classified into a number of latent classes. Within each\nlatent class, a standard linear regression model can be used to model the\nrelationship between the response and a key feature vector, which is assumed to\nbe of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are\nthen used to determine the latent class membership, where a Na\\\"ive Bayes type\nmodel is used to describe the relationship. Hence, the dimension of control\nvariables is allowed to be arbitrarily high. A novel expectation-maximization\nalgorithm is developed for model estimation. Therefore, we are able to estimate\nthe interested key parameters as efficiently as if the true class membership\nwere known in advance. Simulation studies are presented to demonstrate the\nproposed MCR method. A real dataset of Chinese burglary offenses is analyzed\nfor illustration purpose."}, "http://arxiv.org/abs/2311.07951": {"title": "A Fast and Simple Algorithm for computing the MLE of Amplitude Density Function Parameters", "link": "http://arxiv.org/abs/2311.07951", "description": "Over the last decades, the family of $\\alpha$-stale distributions has proven\nto be useful for modelling in telecommunication systems. Particularly, in the\ncase of radar applications, finding a fast and accurate estimation for the\namplitude density function parameters appears to be very important. In this\nwork, the maximum likelihood estimator (MLE) is proposed for parameters of the\namplitude distribution. To do this, the amplitude data are \\emph{projected} on\nthe horizontal and vertical axes using two simple transformations. It is proved\nthat the \\emph{projected} data follow a zero-location symmetric $\\alpha$-stale\ndistribution for which the MLE can be computed quite fast. The average of\ncomputed MLEs based on two \\emph{projections} is considered as estimator for\nparameters of the amplitude distribution. Performance of the proposed\n\\emph{projection} method is demonstrated through simulation study and analysis\nof two sets of real radar data."}, "http://arxiv.org/abs/2311.07972": {"title": "Residual Importance Weighted Transfer Learning For High-dimensional Linear Regression", "link": "http://arxiv.org/abs/2311.07972", "description": "Transfer learning is an emerging paradigm for leveraging multiple sources to\nimprove the statistical inference on a single target. In this paper, we propose\na novel approach named residual importance weighted transfer learning (RIW-TL)\nfor high-dimensional linear models built on penalized likelihood. Compared to\nexisting methods such as Trans-Lasso that selects sources in an all-in-all-out\nmanner, RIW-TL includes samples via importance weighting and thus may permit\nmore effective sample use. To determine the weights, remarkably RIW-TL only\nrequires the knowledge of one-dimensional densities dependent on residuals,\nthus overcoming the curse of dimensionality of having to estimate\nhigh-dimensional densities in naive importance weighting. We show that the\noracle RIW-TL provides a faster rate than its competitors and develop a\ncross-fitting procedure to estimate this oracle. We discuss variants of RIW-TL\nby adopting different choices for residual weighting. The theoretical\nproperties of RIW-TL and its variants are established and compared with those\nof LASSO and Trans-Lasso. Extensive simulation and a real data analysis confirm\nits advantages."}, "http://arxiv.org/abs/2311.08004": {"title": "Nonlinear blind source separation exploiting spatial nonstationarity", "link": "http://arxiv.org/abs/2311.08004", "description": "In spatial blind source separation the observed multivariate random fields\nare assumed to be mixtures of latent spatially dependent random fields. The\nobjective is to recover latent random fields by estimating the unmixing\ntransformation. Currently, the algorithms for spatial blind source separation\ncan only estimate linear unmixing transformations. Nonlinear blind source\nseparation methods for spatial data are scarce. In this paper we extend an\nidentifiable variational autoencoder that can estimate nonlinear unmixing\ntransformations to spatially dependent data and demonstrate its performance for\nboth stationary and nonstationary spatial data using simulations. In addition,\nwe introduce scaled mean absolute Shapley additive explanations for\ninterpreting the latent components through nonlinear mixing transformation. The\nspatial identifiable variational autoencoder is applied to a geochemical\ndataset to find the latent random fields, which are then interpreted by using\nthe scaled mean absolute Shapley additive explanations."}, "http://arxiv.org/abs/2311.08050": {"title": "INLA+ -- Approximate Bayesian inference for non-sparse models using HPC", "link": "http://arxiv.org/abs/2311.08050", "description": "The integrated nested Laplace approximations (INLA) method has become a\nwidely utilized tool for researchers and practitioners seeking to perform\napproximate Bayesian inference across various fields of application. To address\nthe growing demand for incorporating more complex models and enhancing the\nmethod's capabilities, this paper introduces a novel framework that leverages\ndense matrices for performing approximate Bayesian inference based on INLA\nacross multiple computing nodes using HPC. When dealing with non-sparse\nprecision or covariance matrices, this new approach scales better compared to\nthe current INLA method, capitalizing on the computational power offered by\nmultiprocessors in shared and distributed memory architectures available in\ncontemporary computing resources and specialized dense matrix algebra. To\nvalidate the efficacy of this approach, we conduct a simulation study then\napply it to analyze cancer mortality data in Spain, employing a three-way\nspatio-temporal interaction model."}, "http://arxiv.org/abs/2311.08139": {"title": "Feedforward neural networks as statistical models: Improving interpretability through uncertainty quantification", "link": "http://arxiv.org/abs/2311.08139", "description": "Feedforward neural networks (FNNs) are typically viewed as pure prediction\nalgorithms, and their strong predictive performance has led to their use in\nmany machine-learning applications. However, their flexibility comes with an\ninterpretability trade-off; thus, FNNs have been historically less popular\namong statisticians. Nevertheless, classical statistical theory, such as\nsignificance testing and uncertainty quantification, is still relevant.\nSupplementing FNNs with methods of statistical inference, and covariate-effect\nvisualisations, can shift the focus away from black-box prediction and make\nFNNs more akin to traditional statistical models. This can allow for more\ninferential analysis, and, hence, make FNNs more accessible within the\nstatistical-modelling context."}, "http://arxiv.org/abs/2311.08168": {"title": "Time-Uniform Confidence Spheres for Means of Random Vectors", "link": "http://arxiv.org/abs/2311.08168", "description": "We derive and study time-uniform confidence spheres - termed confidence\nsphere sequences (CSSs) - which contain the mean of random vectors with high\nprobability simultaneously across all sample sizes. Inspired by the original\nwork of Catoni and Giulini, we unify and extend their analysis to cover both\nthe sequential setting and to handle a variety of distributional assumptions.\nMore concretely, our results include an empirical-Bernstein CSS for bounded\nrandom vectors (resulting in a novel empirical-Bernstein confidence interval),\na CSS for sub-$\\psi$ random vectors, and a CSS for heavy-tailed random vectors\nbased on a sequentially valid Catoni-Giulini estimator. Finally, we provide a\nversion of our empirical-Bernstein CSS that is robust to contamination by Huber\nnoise."}, "http://arxiv.org/abs/2311.08181": {"title": "Frame to frame interpolation for high-dimensional data visualisation using the woylier package", "link": "http://arxiv.org/abs/2311.08181", "description": "The woylier package implements tour interpolation paths between frames using\nGivens rotations. This provides an alternative to the geodesic interpolation\nbetween planes currently available in the tourr package. Tours are used to\nvisualise high-dimensional data and models, to detect clustering, anomalies and\nnon-linear relationships. Frame-to-frame interpolation can be useful for\nprojection pursuit guided tours when the index is not rotationally invariant.\nIt also provides a way to specifically reach a given target frame. We\ndemonstrate the method for exploring non-linear relationships between currency\ncross-rates."}, "http://arxiv.org/abs/2311.08254": {"title": "Identifiable and interpretable nonparametric factor analysis", "link": "http://arxiv.org/abs/2311.08254", "description": "Factor models have been widely used to summarize the variability of\nhigh-dimensional data through a set of factors with much lower dimensionality.\nGaussian linear factor models have been particularly popular due to their\ninterpretability and ease of computation. However, in practice, data often\nviolate the multivariate Gaussian assumption. To characterize higher-order\ndependence and nonlinearity, models that include factors as predictors in\nflexible multivariate regression are popular, with GP-LVMs using Gaussian\nprocess (GP) priors for the regression function and VAEs using deep neural\nnetworks. Unfortunately, such approaches lack identifiability and\ninterpretability and tend to produce brittle and non-reproducible results. To\naddress these problems by simplifying the nonparametric factor model while\nmaintaining flexibility, we propose the NIFTY framework, which parsimoniously\ntransforms uniform latent variables using one-dimensional nonlinear mappings\nand then applies a linear generative model. The induced multivariate\ndistribution falls into a flexible class while maintaining simple computation\nand interpretation. We prove that this model is identifiable and empirically\nstudy NIFTY using simulated data, observing good performance in density\nestimation and data visualization. We then apply NIFTY to bird song data in an\nenvironmental monitoring application."}, "http://arxiv.org/abs/2311.08315": {"title": "Total Empiricism: Learning from Data", "link": "http://arxiv.org/abs/2311.08315", "description": "Statistical analysis is an important tool to distinguish systematic from\nchance findings. Current statistical analyses rely on distributional\nassumptions reflecting the structure of some underlying model, which if not met\nlead to problems in the analysis and interpretation of the results. Instead of\ntrying to fix the model or \"correct\" the data, we here describe a totally\nempirical statistical approach that does not rely on ad hoc distributional\nassumptions in order to overcome many problems in contemporary statistics.\nStarting from elementary combinatorics, we motivate an information-guided\nformalism to quantify knowledge extracted from the given data. Subsequently, we\nderive model-agnostic methods to identify patterns that are solely evidenced by\nthe data based on our prior knowledge. The data-centric character of empiricism\nallows for its universal applicability, particularly as sample size grows\nlarger. In this comprehensive framework, we re-interpret and extend model\ndistributions, scores and statistical tests used in different schools of\nstatistics."}, "http://arxiv.org/abs/2311.08335": {"title": "Distinguishing immunological and behavioral effects of vaccination", "link": "http://arxiv.org/abs/2311.08335", "description": "The interpretation of vaccine efficacy estimands is subtle, even in\nrandomized trials designed to quantify immunological effects of vaccination. In\nthis article, we introduce terminology to distinguish between different vaccine\nefficacy estimands and clarify their interpretations. This allows us to\nexplicitly consider immunological and behavioural effects of vaccination, and\nestablish that policy-relevant estimands can differ substantially from those\ncommonly reported in vaccine trials. We further show that a conventional\nvaccine trial allows identification and estimation of different vaccine\nestimands under plausible conditions, if one additional post-treatment variable\nis measured. Specifically, we utilize a ``belief variable'' that indicates the\ntreatment an individual believed they had received. The belief variable is\nsimilar to ``blinding assessment'' variables that are occasionally collected in\nplacebo-controlled trials in other fields. We illustrate the relations between\nthe different estimands, and their practical relevance, in numerical examples\nbased on an influenza vaccine trial."}, "http://arxiv.org/abs/2311.08340": {"title": "Causal Message Passing: A Method for Experiments with Unknown and General Network Interference", "link": "http://arxiv.org/abs/2311.08340", "description": "Randomized experiments are a powerful methodology for data-driven evaluation\nof decisions or interventions. Yet, their validity may be undermined by network\ninterference. This occurs when the treatment of one unit impacts not only its\noutcome but also that of connected units, biasing traditional treatment effect\nestimations. Our study introduces a new framework to accommodate complex and\nunknown network interference, moving beyond specialized models in the existing\nliterature. Our framework, which we term causal message-passing, is grounded in\na high-dimensional approximate message passing methodology and is specifically\ntailored to experimental design settings with prevalent network interference.\nUtilizing causal message-passing, we present a practical algorithm for\nestimating the total treatment effect and demonstrate its efficacy in four\nnumerical scenarios, each with its unique interference structure."}, "http://arxiv.org/abs/2104.14987": {"title": "Emulating complex dynamical simulators with random Fourier features", "link": "http://arxiv.org/abs/2104.14987", "description": "A Gaussian process (GP)-based methodology is proposed to emulate complex\ndynamical computer models (or simulators). The method relies on emulating the\nnumerical flow map of the system over an initial (short) time step, where the\nflow map is a function that describes the evolution of the system from an\ninitial condition to a subsequent value at the next time step. This yields a\nprobabilistic distribution over the entire flow map function, with each draw\noffering an approximation to the flow map. The model output times series is\nthen predicted (under the Markov assumption) by drawing a sample from the\nemulated flow map (i.e., its posterior distribution) and using it to iterate\nfrom the initial condition ahead in time. Repeating this procedure with\nmultiple such draws creates a distribution over the time series. The mean and\nvariance of this distribution at a specific time point serve as the model\noutput prediction and the associated uncertainty, respectively. However,\ndrawing a GP posterior sample that represents the underlying function across\nits entire domain is computationally infeasible, given the infinite-dimensional\nnature of this object. To overcome this limitation, one can generate such a\nsample in an approximate manner using random Fourier features (RFF). RFF is an\nefficient technique for approximating the kernel and generating GP samples,\noffering both computational efficiency and theoretical guarantees. The proposed\nmethod is applied to emulate several dynamic nonlinear simulators including the\nwell-known Lorenz and van der Pol models. The results suggest that our approach\nhas a high predictive performance and the associated uncertainty can capture\nthe dynamics of the system accurately."}, "http://arxiv.org/abs/2111.12945": {"title": "Low-rank variational Bayes correction to the Laplace method", "link": "http://arxiv.org/abs/2111.12945", "description": "Approximate inference methods like the Laplace method, Laplace approximations\nand variational methods, amongst others, are popular methods when exact\ninference is not feasible due to the complexity of the model or the abundance\nof data. In this paper we propose a hybrid approximate method called Low-Rank\nVariational Bayes correction (VBC), that uses the Laplace method and\nsubsequently a Variational Bayes correction in a lower dimension, to the joint\nposterior mean. The cost is essentially that of the Laplace method which\nensures scalability of the method, in both model complexity and data size.\nModels with fixed and unknown hyperparameters are considered, for simulated and\nreal examples, for small and large datasets."}, "http://arxiv.org/abs/2202.13961": {"title": "Spatio-Causal Patterns of Sample Growth", "link": "http://arxiv.org/abs/2202.13961", "description": "Different statistical samples (e.g., from different locations) offer\npopulations and learning systems observations with distinct statistical\nproperties. Samples under (1) 'Unconfounded' growth preserve systems' ability\nto determine the independent effects of their individual variables on any\noutcome-of-interest (and lead, therefore, to fair and interpretable black-box\npredictions). Samples under (2) 'Externally-Valid' growth preserve their\nability to make predictions that generalize across out-of-sample variation. The\nfirst promotes predictions that generalize over populations, the second over\ntheir shared uncontrolled factors. We illustrate these theoretic patterns in\nthe full American census from 1840 to 1940, and samples ranging from the\nstreet-level all the way to the national. This reveals sample requirements for\ngeneralizability over space and time, and new connections among the Shapley\nvalue, counterfactual statistics, and hyperbolic geometry."}, "http://arxiv.org/abs/2211.04958": {"title": "Black-Box Model Confidence Sets Using Cross-Validation with High-Dimensional Gaussian Comparison", "link": "http://arxiv.org/abs/2211.04958", "description": "We derive high-dimensional Gaussian comparison results for the standard\n$V$-fold cross-validated risk estimates. Our results combine a recent\nstability-based argument for the low-dimensional central limit theorem of\ncross-validation with the high-dimensional Gaussian comparison framework for\nsums of independent random variables. These results give new insights into the\njoint sampling distribution of cross-validated risks in the context of model\ncomparison and tuning parameter selection, where the number of candidate models\nand tuning parameters can be larger than the fitting sample size. As a\nconsequence, our results provide theoretical support for a recent\nmethodological development that constructs model confidence sets using\ncross-validation."}, "http://arxiv.org/abs/2311.08427": {"title": "Towards a Transportable Causal Network Model Based on Observational Healthcare Data", "link": "http://arxiv.org/abs/2311.08427", "description": "Over the last decades, many prognostic models based on artificial\nintelligence techniques have been used to provide detailed predictions in\nhealthcare. Unfortunately, the real-world observational data used to train and\nvalidate these models are almost always affected by biases that can strongly\nimpact the outcomes validity: two examples are values missing not-at-random and\nselection bias. Addressing them is a key element in achieving transportability\nand in studying the causal relationships that are critical in clinical decision\nmaking, going beyond simpler statistical approaches based on probabilistic\nassociation.\n\nIn this context, we propose a novel approach that combines selection\ndiagrams, missingness graphs, causal discovery and prior knowledge into a\nsingle graphical model to estimate the cardiovascular risk of adolescent and\nyoung females who survived breast cancer. We learn this model from data\ncomprising two different cohorts of patients. The resulting causal network\nmodel is validated by expert clinicians in terms of risk assessment, accuracy\nand explainability, and provides a prognostic model that outperforms competing\nmachine learning methods."}, "http://arxiv.org/abs/2311.08484": {"title": "Covariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe)", "link": "http://arxiv.org/abs/2311.08484", "description": "We propose a new method for the simultaneous selection and estimation of\nmultivariate sparse additive models with correlated errors. Our method called\nCovariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe)\nsimultaneously selects among null, linear, and smooth non-linear effects for\neach predictor while incorporating joint estimation of the sparse residual\nstructure among responses, with the motivation that accounting for\ninter-response correlation structure can lead to improved accuracy in variable\nselection and estimation efficiency. CoMPAdRe is constructed in a\ncomputationally efficient way that allows the selection and estimation of\nlinear and non-linear covariates to be conducted in parallel across responses.\nCompared to single-response approaches that marginally select linear and\nnon-linear covariate effects, we demonstrate in simulation studies that the\njoint multivariate modeling leads to gains in both estimation efficiency and\nselection accuracy, of greater magnitude in settings where signal is moderate\nrelative to the level of noise. We apply our approach to protein-mRNA\nexpression levels from multiple breast cancer pathways obtained from The Cancer\nProteome Atlas and characterize both mRNA-protein associations and\nprotein-protein subnetworks for each pathway. We find non-linear mRNA-protein\nassociations for the Core Reactive, EMT, PIK-AKT, and RTK pathways."}, "http://arxiv.org/abs/2311.08527": {"title": "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments", "link": "http://arxiv.org/abs/2311.08527", "description": "We study inference on the long-term causal effect of a continual exposure to\na novel intervention, which we term a long-term treatment, based on an\nexperiment involving only short-term observations. Key examples include the\nlong-term health effects of regularly-taken medicine or of environmental\nhazards and the long-term effects on users of changes to an online platform.\nThis stands in contrast to short-term treatments or \"shocks,\" whose long-term\neffect can reasonably be mediated by short-term observations, enabling the use\nof surrogate methods. Long-term treatments by definition have direct effects on\nlong-term outcomes via continual exposure so surrogacy cannot reasonably hold.\n\nOur approach instead learns long-term temporal dynamics directly from\nshort-term experimental data, assuming that the initial dynamics observed\npersist but avoiding the need for both surrogacy assumptions and auxiliary data\nwith long-term observations. We connect the problem with offline reinforcement\nlearning, leveraging doubly-robust estimators to estimate long-term causal\neffects for long-term treatments and construct confidence intervals. Finally,\nwe demonstrate the method in simulated experiments."}, "http://arxiv.org/abs/2311.08561": {"title": "Measuring association with recursive rank binning", "link": "http://arxiv.org/abs/2311.08561", "description": "Pairwise measures of dependence are a common tool to map data in the early\nstages of analysis with several modern examples based on maximized partitions\nof the pairwise sample space. Following a short survey of modern measures of\ndependence, we introduce a new measure which recursively splits the ranks of a\npair of variables to partition the sample space and computes the $\\chi^2$\nstatistic on the resulting bins. Splitting logic is detailed for splits\nmaximizing a score function and randomly selected splits. Simulations indicate\nthat random splitting produces a statistic conservatively approximated by the\n$\\chi^2$ distribution without a loss of power to detect numerous different data\npatterns compared to maximized binning. Though it seems to add no power to\ndetect dependence, maximized recursive binning is shown to produce a natural\nvisualization of the data and the measure. Applying maximized recursive rank\nbinning to S&P 500 constituent data suggests the automatic detection of tail\ndependence."}, "http://arxiv.org/abs/2311.08604": {"title": "Incremental Cost-Effectiveness Statistical Inference: Calculations and Communications", "link": "http://arxiv.org/abs/2311.08604", "description": "We illustrate use of nonparametric statistical methods to compare alternative\ntreatments for a particular disease or condition on both their relative\neffectiveness and their relative cost. These Incremental Cost Effectiveness\n(ICE) methods are based upon Bootstrapping, i.e. Resampling with Replacement\nfrom observational or clinical-trial data on individual patients. We first show\nhow a reasonable numerical value for the \"Shadow Price of Health\" can be chosen\nusing functions within the ICEinfer R-package when effectiveness is not\nmeasured in \"QALY\"s. We also argue that simple histograms are ideal for\ncommunicating key findings to regulators, while our more detailed graphics may\nwell be more informative and compelling for other health-care stakeholders."}, "http://arxiv.org/abs/2311.08658": {"title": "Structured Estimation of Heterogeneous Time Series", "link": "http://arxiv.org/abs/2311.08658", "description": "How best to model structurally heterogeneous processes is a foundational\nquestion in the social, health and behavioral sciences. Recently, Fisher et\nal., (2022) introduced the multi-VAR approach for simultaneously estimating\nmultiple-subject multivariate time series characterized by common and\nindividualizing features using penalized estimation. This approach differs from\nmany popular modeling approaches for multiple-subject time series in that\nqualitative and quantitative differences in a large number of individual\ndynamics are well-accommodated. The current work extends the multi-VAR\nframework to include new adaptive weighting schemes that greatly improve\nestimation performance. In a small set of simulation studies we compare\nadaptive multi-VAR with these new penalty weights to common alternative\nestimators in terms of path recovery and bias. Furthermore, we provide toy\nexamples and code demonstrating the utility of multi-VAR under different\nheterogeneity regimes using the multivar package for R (Fisher, 2022)."}, "http://arxiv.org/abs/2311.08690": {"title": "Enabling CMF Estimation in Data-Constrained Scenarios: A Semantic-Encoding Knowledge Mining Model", "link": "http://arxiv.org/abs/2311.08690", "description": "Precise estimation of Crash Modification Factors (CMFs) is central to\nevaluating the effectiveness of various road safety treatments and prioritizing\ninfrastructure investment accordingly. While customized study for each\ncountermeasure scenario is desired, the conventional CMF estimation approaches\nrely heavily on the availability of crash data at given sites. This not only\nmakes the estimation costly, but the results are also less transferable, since\nthe intrinsic similarities between different safety countermeasure scenarios\nare not fully explored. Aiming to fill this gap, this study introduces a novel\nknowledge-mining framework for CMF prediction. This framework delves into the\nconnections of existing countermeasures and reduces the reliance of CMF\nestimation on crash data availability and manual data collection. Specifically,\nit draws inspiration from human comprehension processes and introduces advanced\nNatural Language Processing (NLP) techniques to extract intricate variations\nand patterns from existing CMF knowledge. It effectively encodes unstructured\ncountermeasure scenarios into machine-readable representations and models the\ncomplex relationships between scenarios and CMF values. This new data-driven\nframework provides a cost-effective and adaptable solution that complements the\ncase-specific approaches for CMF estimation, which is particularly beneficial\nwhen availability of crash data or time imposes constraints. Experimental\nvalidation using real-world CMF Clearinghouse data demonstrates the\neffectiveness of this new approach, which shows significant accuracy\nimprovements compared to baseline methods. This approach provides insights into\nnew possibilities of harnessing accumulated transportation knowledge in various\napplications."}, "http://arxiv.org/abs/2311.08691": {"title": "On Doubly Robust Estimation with Nonignorable Missing Data Using Instrumental Variables", "link": "http://arxiv.org/abs/2311.08691", "description": "Suppose we are interested in the mean of an outcome that is subject to\nnonignorable nonresponse. This paper develops new semiparametric estimation\nmethods with instrumental variables which affect nonresponse, but not the\noutcome. The proposed estimators remain consistent and asymptotically normal\neven under partial model misspecifications for two variation independent\nnuisance functions. We evaluate the performance of the proposed estimators via\na simulation study, and apply them in adjusting for missing data induced by HIV\ntesting refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana,\nusing interviewer experience as an instrumental variable."}, "http://arxiv.org/abs/2311.08743": {"title": "Kernel-based independence tests for causal structure learning on functional data", "link": "http://arxiv.org/abs/2311.08743", "description": "Measurements of systems taken along a continuous functional dimension, such\nas time or space, are ubiquitous in many fields, from the physical and\nbiological sciences to economics and engineering.Such measurements can be\nviewed as realisations of an underlying smooth process sampled over the\ncontinuum. However, traditional methods for independence testing and causal\nlearning are not directly applicable to such data, as they do not take into\naccount the dependence along the functional dimension. By using specifically\ndesigned kernels, we introduce statistical tests for bivariate, joint, and\nconditional independence for functional variables. Our method not only extends\nthe applicability to functional data of the HSIC and its d-variate version\n(d-HSIC), but also allows us to introduce a test for conditional independence\nby defining a novel statistic for the CPT based on the HSCIC, with optimised\nregularisation strength estimated through an evaluation rejection rate. Our\nempirical results of the size and power of these tests on synthetic functional\ndata show good performance, and we then exemplify their application to several\nconstraint- and regression-based causal structure learning problems, including\nboth synthetic examples and real socio-economic data."}, "http://arxiv.org/abs/2311.08752": {"title": "ProSpar-GP: scalable Gaussian process modeling with massive non-stationary datasets", "link": "http://arxiv.org/abs/2311.08752", "description": "Gaussian processes (GPs) are a popular class of Bayesian nonparametric\nmodels, but its training can be computationally burdensome for massive training\ndatasets. While there has been notable work on scaling up these models for big\ndata, existing methods typically rely on a stationary GP assumption for\napproximation, and can thus perform poorly when the underlying response surface\nis non-stationary, i.e., it has some regions of rapid change and other regions\nwith little change. Such non-stationarity is, however, ubiquitous in real-world\nproblems, including our motivating application for surrogate modeling of\ncomputer experiments. We thus propose a new Product of Sparse GP (ProSpar-GP)\nmethod for scalable GP modeling with massive non-stationary data. The\nProSpar-GP makes use of a carefully-constructed product-of-experts formulation\nof sparse GP experts, where different experts are placed within local regions\nof non-stationarity. These GP experts are fit via a novel variational inference\napproach, which capitalizes on mini-batching and GPU acceleration for efficient\noptimization of inducing points and length-scale parameters for each expert. We\nfurther show that the ProSpar-GP is Kolmogorov-consistent, in that its\ngenerative distribution defines a valid stochastic process over the prediction\nspace; such a property provides essential stability for variational inference,\nparticularly in the presence of non-stationarity. We then demonstrate the\nimproved performance of the ProSpar-GP over the state-of-the-art, in a suite of\nnumerical experiments and an application for surrogate modeling of a satellite\ndrag simulator."}, "http://arxiv.org/abs/2311.08812": {"title": "Optimal subsampling algorithm for the marginal model with large longitudinal data", "link": "http://arxiv.org/abs/2311.08812", "description": "Big data is ubiquitous in practices, and it has also led to heavy computation\nburden. To reduce the calculation cost and ensure the effectiveness of\nparameter estimators, an optimal subset sampling method is proposed to estimate\nthe parameters in marginal models with massive longitudinal data. The optimal\nsubsampling probabilities are derived, and the corresponding asymptotic\nproperties are established to ensure the consistency and asymptotic normality\nof the estimator. Extensive simulation studies are carried out to evaluate the\nperformance of the proposed method for continuous, binary and count data and\nwith four different working correlation matrices. A depression data is used to\nillustrate the proposed method."}, "http://arxiv.org/abs/2311.08845": {"title": "Statistical learning by sparse deep neural networks", "link": "http://arxiv.org/abs/2311.08845", "description": "We consider a deep neural network estimator based on empirical risk\nminimization with l_1-regularization. We derive a general bound for its excess\nrisk in regression and classification (including multiclass), and prove that it\nis adaptively nearly-minimax (up to log-factors) simultaneously across the\nentire range of various function classes."}, "http://arxiv.org/abs/2311.08908": {"title": "Robust Brain MRI Image Classification with SIBOW-SVM", "link": "http://arxiv.org/abs/2311.08908", "description": "The majority of primary Central Nervous System (CNS) tumors in the brain are\namong the most aggressive diseases affecting humans. Early detection of brain\ntumor types, whether benign or malignant, glial or non-glial, is critical for\ncancer prevention and treatment, ultimately improving human life expectancy.\nMagnetic Resonance Imaging (MRI) stands as the most effective technique to\ndetect brain tumors by generating comprehensive brain images through scans.\nHowever, human examination can be error-prone and inefficient due to the\ncomplexity, size, and location variability of brain tumors. Recently, automated\nclassification techniques using machine learning (ML) methods, such as\nConvolutional Neural Network (CNN), have demonstrated significantly higher\naccuracy than manual screening, while maintaining low computational costs.\nNonetheless, deep learning-based image classification methods, including CNN,\nface challenges in estimating class probabilities without proper model\ncalibration. In this paper, we propose a novel brain tumor image classification\nmethod, called SIBOW-SVM, which integrates the Bag-of-Features (BoF) model with\nSIFT feature extraction and weighted Support Vector Machines (wSVMs). This new\napproach effectively captures hidden image features, enabling the\ndifferentiation of various tumor types and accurate label predictions.\nAdditionally, the SIBOW-SVM is able to estimate the probabilities of images\nbelonging to each class, thereby providing high-confidence classification\ndecisions. We have also developed scalable and parallelable algorithms to\nfacilitate the practical implementation of SIBOW-SVM for massive images. As a\nbenchmark, we apply the SIBOW-SVM to a public data set of brain tumor MRI\nimages containing four classes: glioma, meningioma, pituitary, and normal. Our\nresults show that the new method outperforms state-of-the-art methods,\nincluding CNN."}, "http://arxiv.org/abs/2311.09015": {"title": "Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach", "link": "http://arxiv.org/abs/2311.09015", "description": "We consider the task of identifying and estimating a parameter of interest in\nsettings where data is missing not at random (MNAR). In general, such\nparameters are not identified without strong assumptions on the missing data\nmodel. In this paper, we take an alternative approach and introduce a method\ninspired by data fusion, where information in an MNAR dataset is augmented by\ninformation in an auxiliary dataset subject to missingness at random (MAR). We\nshow that even if the parameter of interest cannot be identified given either\ndataset alone, it can be identified given pooled data, under two complementary\nsets of assumptions. We derive an inverse probability weighted (IPW) estimator\nfor identified parameters, and evaluate the performance of our estimation\nstrategies via simulation studies."}, "http://arxiv.org/abs/2311.09081": {"title": "Posterior accuracy and calibration under misspecification in Bayesian generalized linear models", "link": "http://arxiv.org/abs/2311.09081", "description": "Generalized linear models (GLMs) are popular for data-analysis in almost all\nquantitative sciences, but the choice of likelihood family and link function is\noften difficult. This motivates the search for likelihoods and links that\nminimize the impact of potential misspecification. We perform a large-scale\nsimulation study on double-bounded and lower-bounded response data where we\nsystematically vary both true and assumed likelihoods and links. In contrast to\nprevious studies, we also study posterior calibration and uncertainty metrics\nin addition to point-estimate accuracy. Our results indicate that certain\nlikelihoods and links can be remarkably robust to misspecification, performing\nalmost on par with their respective true counterparts. Additionally, normal\nlikelihood models with identity link (i.e., linear regression) often achieve\ncalibration comparable to the more structurally faithful alternatives, at least\nin the studied scenarios. On the basis of our findings, we provide practical\nsuggestions for robust likelihood and link choices in GLMs."}, "http://arxiv.org/abs/2311.09107": {"title": "Illness-death model with renewal", "link": "http://arxiv.org/abs/2311.09107", "description": "The illness-death model for chronic conditions is combined with a renewal\nequation for the number of newborns taking into account possibly different\nfertility rates in the healthy and diseased parts of the population. The\nresulting boundary value problem consists of a system of partial differential\nequations with an integral boundary condition. As an application, the boundary\nvalue problem is applied to an example about type 2 diabetes."}, "http://arxiv.org/abs/2311.09137": {"title": "Causal prediction models for medication safety monitoring: The diagnosis of vancomycin-induced acute kidney injury", "link": "http://arxiv.org/abs/2311.09137", "description": "The current best practice approach for the retrospective diagnosis of adverse\ndrug events (ADEs) in hospitalized patients relies on a full patient chart\nreview and a formal causality assessment by multiple medical experts. This\nevaluation serves to qualitatively estimate the probability of causation (PC);\nthe probability that a drug was a necessary cause of an adverse event. This\npractice is manual, resource intensive and prone to human biases, and may thus\nbenefit from data-driven decision support. Here, we pioneer a causal modeling\napproach using observational data to estimate a lower bound of the PC\n(PC$_{low}$). This method includes two key causal inference components: (1) the\ntarget trial emulation framework and (2) estimation of individualized treatment\neffects using machine learning. We apply our method to the clinically relevant\nuse-case of vancomycin-induced acute kidney injury in intensive care patients,\nand compare our causal model-based PC$_{low}$ estimates to qualitative\nestimates of the PC provided by a medical expert. Important limitations and\npotential improvements are discussed, and we conclude that future improved\ncausal models could provide essential data-driven support for medication safety\nmonitoring in hospitalized patients."}, "http://arxiv.org/abs/1911.03071": {"title": "Balancing Covariates in Randomized Experiments with the Gram-Schmidt Walk Design", "link": "http://arxiv.org/abs/1911.03071", "description": "The design of experiments involves a compromise between covariate balance and\nrobustness. This paper provides a formalization of this trade-off and describes\nan experimental design that allows experimenters to navigate it. The design is\nspecified by a robustness parameter that bounds the worst-case mean squared\nerror of an estimator of the average treatment effect. Subject to the\nexperimenter's desired level of robustness, the design aims to simultaneously\nbalance all linear functions of potentially many covariates. Less robustness\nallows for more balance. We show that the mean squared error of the estimator\nis bounded in finite samples by the minimum of the loss function of an implicit\nridge regression of the potential outcomes on the covariates. Asymptotically,\nthe design perfectly balances all linear functions of a growing number of\ncovariates with a diminishing reduction in robustness, effectively allowing\nexperimenters to escape the compromise between balance and robustness in large\nsamples. Finally, we describe conditions that ensure asymptotic normality and\nprovide a conservative variance estimator, which facilitate the construction of\nasymptotically valid confidence intervals."}, "http://arxiv.org/abs/2102.07356": {"title": "Asymptotic properties of generalized closed-form maximum likelihood estimators", "link": "http://arxiv.org/abs/2102.07356", "description": "The maximum likelihood estimator (MLE) is pivotal in statistical inference,\nyet its application is often hindered by the absence of closed-form solutions\nfor many models. This poses challenges in real-time computation scenarios,\nparticularly within embedded systems technology, where numerical methods are\nimpractical. This study introduces a generalized form of the MLE that yields\nclosed-form estimators under certain conditions. We derive the asymptotic\nproperties of the proposed estimator and demonstrate that our approach retains\nkey properties such as invariance under one-to-one transformations, strong\nconsistency, and an asymptotic normal distribution. The effectiveness of the\ngeneralized MLE is exemplified through its application to the Gamma, Nakagami,\nand Beta distributions, showcasing improvements over the traditional MLE.\nAdditionally, we extend this methodology to a bivariate gamma distribution,\nsuccessfully deriving closed-form estimators. This advancement presents\nsignificant implications for real-time statistical analysis across various\napplications."}, "http://arxiv.org/abs/2207.13493": {"title": "The Cellwise Minimum Covariance Determinant Estimator", "link": "http://arxiv.org/abs/2207.13493", "description": "The usual Minimum Covariance Determinant (MCD) estimator of a covariance\nmatrix is robust against casewise outliers. These are cases (that is, rows of\nthe data matrix) that behave differently from the majority of cases, raising\nsuspicion that they might belong to a different population. On the other hand,\ncellwise outliers are individual cells in the data matrix. When a row contains\none or more outlying cells, the other cells in the same row still contain\nuseful information that we wish to preserve. We propose a cellwise robust\nversion of the MCD method, called cellMCD. Its main building blocks are\nobserved likelihood and a penalty term on the number of flagged cellwise\noutliers. It possesses good breakdown properties. We construct a fast algorithm\nfor cellMCD based on concentration steps (C-steps) that always lower the\nobjective. The method performs well in simulations with cellwise outliers, and\nhas high finite-sample efficiency on clean data. It is illustrated on real data\nwith visualizations of the results."}, "http://arxiv.org/abs/2208.07086": {"title": "Flexible Bayesian Multiple Comparison Adjustment Using Dirichlet Process and Beta-Binomial Model Priors", "link": "http://arxiv.org/abs/2208.07086", "description": "Researchers frequently wish to assess the equality or inequality of groups,\nbut this comes with the challenge of adequately adjusting for multiple\ncomparisons. Statistically, all possible configurations of equality and\ninequality constraints can be uniquely represented as partitions of the groups,\nwhere any number of groups are equal if they are in the same partition. In a\nBayesian framework, one can adjust for multiple comparisons by constructing a\nsuitable prior distribution over all possible partitions. Inspired by work on\nvariable selection in regression, we propose a class of flexible beta-binomial\npriors for Bayesian multiple comparison adjustment. We compare this prior setup\nto the Dirichlet process prior suggested by Gopalan and Berry (1998) and\nmultiple comparison adjustment methods that do not specify a prior over\npartitions directly. Our approach to multiple comparison adjustment not only\nallows researchers to assess all pairwise (in)equalities, but in fact all\npossible (in)equalities among all groups. As a consequence, the space of\npossible partitions grows quickly - for ten groups, there are already 115,975\npossible partitions - and we set up a stochastic search algorithm to\nefficiently explore the space. Our method is implemented in the Julia package\nEqualitySampler, and we illustrate it on examples related to the comparison of\nmeans, variances, and proportions."}, "http://arxiv.org/abs/2208.07959": {"title": "Variable Selection in Latent Regression IRT Models via Knockoffs: An Application to International Large-scale Assessment in Education", "link": "http://arxiv.org/abs/2208.07959", "description": "International large-scale assessments (ILSAs) play an important role in\neducational research and policy making. They collect valuable data on education\nquality and performance development across many education systems, giving\ncountries the opportunity to share techniques, organizational structures, and\npolicies that have proven efficient and successful. To gain insights from ILSA\ndata, we identify non-cognitive variables associated with students' academic\nperformance. This problem has three analytical challenges: 1) academic\nperformance is measured by cognitive items under a matrix sampling design; 2)\nthere are many missing values in the non-cognitive variables; and 3) multiple\ncomparisons due to a large number of non-cognitive variables. We consider an\napplication to the Programme for International Student Assessment (PISA),\naiming to identify non-cognitive variables associated with students'\nperformance in science. We formulate it as a variable selection problem under a\ngeneral latent variable model framework and further propose a knockoff method\nthat conducts variable selection with a controlled error rate for false\nselections."}, "http://arxiv.org/abs/2210.06927": {"title": "Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models", "link": "http://arxiv.org/abs/2210.06927", "description": "Bayesian modeling provides a principled approach to quantifying uncertainty\nin model parameters and model structure and has seen a surge of applications in\nrecent years. Within the context of a Bayesian workflow, we are concerned with\nmodel selection for the purpose of finding models that best explain the data,\nthat is, help us understand the underlying data generating process. Since we\nrarely have access to the true process, all we are left with during real-world\nanalyses is incomplete causal knowledge from sources outside of the current\ndata and model predictions of said data. This leads to the important question\nof when the use of prediction as a proxy for explanation for the purpose of\nmodel selection is valid. We approach this question by means of large-scale\nsimulations of Bayesian generalized linear models where we investigate various\ncausal and statistical misspecifications. Our results indicate that the use of\nprediction as proxy for explanation is valid and safe only when the models\nunder consideration are sufficiently consistent with the underlying causal\nstructure of the true data generating process."}, "http://arxiv.org/abs/2212.04550": {"title": "Modern Statistical Models and Methods for Estimating Fatigue-Life and Fatigue-Strength Distributions from Experimental Data", "link": "http://arxiv.org/abs/2212.04550", "description": "Engineers and scientists have been collecting and analyzing fatigue data\nsince the 1800s to ensure the reliability of life-critical structures.\nApplications include (but are not limited to) bridges, building structures,\naircraft and spacecraft components, ships, ground-based vehicles, and medical\ndevices. Engineers need to estimate S-N relationships (Stress or Strain versus\nNumber of cycles to failure), typically with a focus on estimating small\nquantiles of the fatigue-life distribution. Estimates from this kind of model\nare used as input to models (e.g., cumulative damage models) that predict\nfailure-time distributions under varying stress patterns. Also, design\nengineers need to estimate lower-tail quantiles of the closely related\nfatigue-strength distribution. The history of applying incorrect statistical\nmethods is nearly as long and such practices continue to the present. Examples\ninclude treating the applied stress (or strain) as the response and the number\nof cycles to failure as the explanatory variable in regression analyses\n(because of the need to estimate strength distributions) and ignoring or\notherwise mishandling censored observations (known as runouts in the fatigue\nliterature). The first part of the paper reviews the traditional modeling\napproach where a fatigue-life model is specified. We then show how this\nspecification induces a corresponding fatigue-strength model. The second part\nof the paper presents a novel alternative modeling approach where a\nfatigue-strength model is specified and a corresponding fatigue-life model is\ninduced. We explain and illustrate the important advantages of this new\nmodeling approach."}, "http://arxiv.org/abs/2303.01186": {"title": "Discrete-time Competing-Risks Regression with or without Penalization", "link": "http://arxiv.org/abs/2303.01186", "description": "Many studies employ the analysis of time-to-event data that incorporates\ncompeting risks and right censoring. Most methods and software packages are\ngeared towards analyzing data that comes from a continuous failure time\ndistribution. However, failure-time data may sometimes be discrete either\nbecause time is inherently discrete or due to imprecise measurement. This paper\nintroduces a novel estimation procedure for discrete-time survival analysis\nwith competing events. The proposed approach offers two key advantages over\nexisting procedures: first, it expedites the estimation process for a large\nnumber of unique failure time points; second, it allows for straightforward\nintegration and application of widely used regularized regression and screening\nmethods. We illustrate the benefits of our proposed approach by conducting a\ncomprehensive simulation study. Additionally, we showcase the utility of our\nprocedure by estimating a survival model for the length of stay of patients\nhospitalized in the intensive care unit, considering three competing events:\ndischarge to home, transfer to another medical facility, and in-hospital death."}, "http://arxiv.org/abs/2306.11281": {"title": "Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models", "link": "http://arxiv.org/abs/2306.11281", "description": "Answering counterfactual queries has many important applications such as\nknowledge discovery and explainability, but is challenging when causal\nvariables are unobserved and we only see a projection onto an observation\nspace, for instance, image pixels. One approach is to recover the latent\nStructural Causal Model (SCM), but this typically needs unrealistic\nassumptions, such as linearity of the causal mechanisms. Another approach is to\nuse na\\\"ive ML approximations, such as generative models, to generate\ncounterfactual samples; however, these lack guarantees of accuracy. In this\nwork, we strive to strike a balance between practicality and theoretical\nguarantees by focusing on a specific type of causal query called domain\ncounterfactuals, which hypothesizes what a sample would have looked like if it\nhad been generated in a different domain (or environment). Concretely, by only\nassuming invertibility, sparse domain interventions and access to observational\ndata from different domains, we aim to improve domain counterfactual estimation\nboth theoretically and practically with less restrictive assumptions. We define\ndomain counterfactually equivalent models and prove necessary and sufficient\nproperties for equivalent models that provide a tight characterization of the\ndomain counterfactual equivalence classes. Building upon this result, we prove\nthat every equivalence class contains a model where all intervened variables\nare at the end when topologically sorted by the causal DAG. This surprising\nresult suggests that a model design that only allows intervention in the last\n$k$ latent variables may improve model estimation for counterfactuals. We then\ntest this model design on extensive simulated and image-based experiments which\nshow the sparse canonical model indeed improves counterfactual estimation over\nbaseline non-sparse models."}, "http://arxiv.org/abs/2309.10378": {"title": "Group Spike and Slab Variational Bayes", "link": "http://arxiv.org/abs/2309.10378", "description": "We introduce Group Spike-and-slab Variational Bayes (GSVB), a scalable method\nfor group sparse regression. A fast co-ordinate ascent variational inference\n(CAVI) algorithm is developed for several common model families including\nGaussian, Binomial and Poisson. Theoretical guarantees for our proposed\napproach are provided by deriving contraction rates for the variational\nposterior in grouped linear regression. Through extensive numerical studies, we\ndemonstrate that GSVB provides state-of-the-art performance, offering a\ncomputationally inexpensive substitute to MCMC, whilst performing comparably or\nbetter than existing MAP methods. Additionally, we analyze three real world\ndatasets wherein we highlight the practical utility of our method,\ndemonstrating that GSVB provides parsimonious models with excellent predictive\nperformance, variable selection and uncertainty quantification."}, "http://arxiv.org/abs/2309.12632": {"title": "Are Deep Learning Classification Results Obtained on CT Scans Fair and Interpretable?", "link": "http://arxiv.org/abs/2309.12632", "description": "Following the great success of various deep learning methods in image and\nobject classification, the biomedical image processing society is also\noverwhelmed with their applications to various automatic diagnosis cases.\nUnfortunately, most of the deep learning-based classification attempts in the\nliterature solely focus on the aim of extreme accuracy scores, without\nconsidering interpretability, or patient-wise separation of training and test\ndata. For example, most lung nodule classification papers using deep learning\nrandomly shuffle data and split it into training, validation, and test sets,\ncausing certain images from the CT scan of a person to be in the training set,\nwhile other images of the exact same person to be in the validation or testing\nimage sets. This can result in reporting misleading accuracy rates and the\nlearning of irrelevant features, ultimately reducing the real-life usability of\nthese models. When the deep neural networks trained on the traditional, unfair\ndata shuffling method are challenged with new patient images, it is observed\nthat the trained models perform poorly. In contrast, deep neural networks\ntrained with strict patient-level separation maintain their accuracy rates even\nwhen new patient images are tested. Heat-map visualizations of the activations\nof the deep neural networks trained with strict patient-level separation\nindicate a higher degree of focus on the relevant nodules. We argue that the\nresearch question posed in the title has a positive answer only if the deep\nneural networks are trained with images of patients that are strictly isolated\nfrom the validation and testing patient sets."}, "http://arxiv.org/abs/2311.09388": {"title": "Synthesis estimators for positivity violations with a continuous covariate", "link": "http://arxiv.org/abs/2311.09388", "description": "Research intended to estimate the effect of an action, like in randomized\ntrials, often do not have random samples of the intended target population.\nInstead, estimates can be transported to the desired target population. Methods\nfor transporting between populations are often premised on a positivity\nassumption, such that all relevant covariate patterns in one population are\nalso present in the other. However, eligibility criteria, particularly in the\ncase of trials, can result in violations of positivity. To address\nnonpositivity, a synthesis of statistical and mechanistic models was previously\nproposed in the context of violations by a single binary covariate. Here, we\nextend the synthesis approach for positivity violations with a continuous\ncovariate. For estimation, two novel augmented inverse probability weighting\nestimators are proposed, with one based on estimating the parameters of a\nmarginal structural model and the other based on estimating the conditional\naverage causal effect. Both estimators are compared to other common approaches\nto address nonpositivity via a simulation study. Finally, the competing\napproaches are illustrated with an example in the context of two-drug versus\none-drug antiretroviral therapy on CD4 T cell counts among women with HIV."}, "http://arxiv.org/abs/2311.09419": {"title": "Change-point Inference for High-dimensional Heteroscedastic Data", "link": "http://arxiv.org/abs/2311.09419", "description": "We propose a bootstrap-based test to detect a mean shift in a sequence of\nhigh-dimensional observations with unknown time-varying heteroscedasticity. The\nproposed test builds on the U-statistic based approach in Wang et al. (2022),\ntargets a dense alternative, and adopts a wild bootstrap procedure to generate\ncritical values. The bootstrap-based test is free of tuning parameters and is\ncapable of accommodating unconditional time varying heteroscedasticity in the\nhigh-dimensional observations, as demonstrated in our theory and simulations.\nTheoretically, we justify the bootstrap consistency by using the recently\nproposed unconditional approach in Bucher and Kojadinovic (2019). Extensions to\ntesting for multiple change-points and estimation using wild binary\nsegmentation are also presented. Numerical simulations demonstrate the\nrobustness of the proposed testing and estimation procedures with respect to\ndifferent kinds of time-varying heteroscedasticity."}, "http://arxiv.org/abs/2311.09423": {"title": "Orthogonal prediction of counterfactual outcomes", "link": "http://arxiv.org/abs/2311.09423", "description": "Orthogonal meta-learners, such as DR-learner, R-learner and IF-learner, are\nincreasingly used to estimate conditional average treatment effects. They\nimprove convergence rates relative to na\\\"{\\i}ve meta-learners (e.g., T-, S-\nand X-learner) through de-biasing procedures that involve applying standard\nlearners to specifically transformed outcome data. This leads them to disregard\nthe possibly constrained outcome space, which can be particularly problematic\nfor dichotomous outcomes: these typically get transformed to values that are no\nlonger constrained to the unit interval, making it difficult for standard\nlearners to guarantee predictions within the unit interval. To address this, we\nconstruct orthogonal meta-learners for the prediction of counterfactual\noutcomes which respect the outcome space. As such, the obtained i-learner or\nimputation-learner is more generally expected to outperform existing learners,\neven when the outcome is unconstrained, as we confirm empirically in simulation\nstudies and an analysis of critical care data. Our development also sheds\nbroader light onto the construction of orthogonal learners for other estimands."}, "http://arxiv.org/abs/2311.09446": {"title": "On simulation-based inference for implicitly defined models", "link": "http://arxiv.org/abs/2311.09446", "description": "In many applications, a stochastic system is studied using a model implicitly\ndefined via a simulator. We develop a simulation-based parameter inference\nmethod for implicitly defined models. Our method differs from traditional\nlikelihood-based inference in that it uses a metamodel for the distribution of\na log-likelihood estimator. The metamodel is built on a local asymptotic\nnormality (LAN) property satisfied by the simulation-based log-likelihood\nestimator under certain conditions. A method for hypothesis test is developed\nunder the metamodel. Our method can enable accurate parameter estimation and\nuncertainty quantification where other Monte Carlo methods for parameter\ninference become highly inefficient due to large Monte Carlo variance. We\ndemonstrate our method using numerical examples including a mechanistic model\nfor the population dynamics of infectious disease."}, "http://arxiv.org/abs/2311.09838": {"title": "Bayesian Inference of Reproduction Number from Epidemiological and Genetic Data Using Particle MCMC", "link": "http://arxiv.org/abs/2311.09838", "description": "Inference of the reproduction number through time is of vital importance\nduring an epidemic outbreak. Typically, epidemiologists tackle this using\nobserved prevalence or incidence data. However, prevalence and incidence data\nalone is often noisy or partial. Models can also have identifiability issues\nwith determining whether a large amount of a small epidemic or a small amount\nof a large epidemic has been observed. Sequencing data however is becoming more\nabundant, so approaches which can incorporate genetic data are an active area\nof research. We propose using particle MCMC methods to infer the time-varying\nreproduction number from a combination of prevalence data reported at a set of\ndiscrete times and a dated phylogeny reconstructed from sequences. We validate\nour approach on simulated epidemics with a variety of scenarios. We then apply\nthe method to a real data set of HIV-1 in North Carolina, USA, between 1957 and\n2019."}, "http://arxiv.org/abs/2311.09875": {"title": "Unbiased and Multilevel Methods for a Class of Diffusions Partially Observed via Marked Point Processes", "link": "http://arxiv.org/abs/2311.09875", "description": "In this article we consider the filtering problem associated to partially\nobserved diffusions, with observations following a marked point process. In the\nmodel, the data form a point process with observation times that have its\nintensity driven by a diffusion, with the associated marks also depending upon\nthe diffusion process. We assume that one must resort to time-discretizing the\ndiffusion process and develop particle and multilevel particle filters to\nrecursively approximate the filter. In particular, we prove that our multilevel\nparticle filter can achieve a mean square error (MSE) of\n$\\mathcal{O}(\\epsilon^2)$ ($\\epsilon>0$ and arbitrary) with a cost of\n$\\mathcal{O}(\\epsilon^{-2.5})$ versus using a particle filter which has a cost\nof $\\mathcal{O}(\\epsilon^{-3})$ to achieve the same MSE. We then show how this\nmethodology can be extended to give unbiased (that is with no\ntime-discretization error) estimators of the filter, which are proved to have\nfinite variance and with high-probability have finite cost. Finally, we extend\nour methodology to the problem of online static-parameter estimation."}, "http://arxiv.org/abs/2311.09935": {"title": "Semi-parametric Benchmark Dose Analysis with Monotone Additive Models", "link": "http://arxiv.org/abs/2311.09935", "description": "Benchmark dose analysis aims to estimate the level of exposure to a toxin\nthat results in a clinically-significant adverse outcome and quantifies\nuncertainty using the lower limit of a confidence interval for this level. We\ndevelop a novel framework for benchmark dose analysis based on monotone\nadditive dose-response models. We first introduce a flexible approach for\nfitting monotone additive models via penalized B-splines and\nLaplace-approximate marginal likelihood. A reflective Newton method is then\ndeveloped that employs de Boor's algorithm for computing splines and their\nderivatives for efficient estimation of the benchmark dose. Finally, we develop\nand assess three approaches for calculating benchmark dose lower limits: a\nnaive one based on asymptotic normality of the estimator, one based on an\napproximate pivot, and one using a Bayesian parametric bootstrap. The latter\napproaches improve upon the naive method in terms of accuracy and are\nguaranteed to return a positive lower limit; the approach based on an\napproximate pivot is typically an order of magnitude faster than the bootstrap,\nalthough they are both practically feasible to compute. We apply the new\nmethods to make inferences about the level of prenatal alcohol exposure\nassociated with clinically significant cognitive defects in children using data\nfrom an NIH-funded longitudinal study. Software to reproduce the results in\nthis paper is available at https://github.com/awstringer1/bmd-paper-code."}, "http://arxiv.org/abs/2311.09961": {"title": "Scan statistics for the detection of anomalies in M-dependent random fields with applications to image data", "link": "http://arxiv.org/abs/2311.09961", "description": "Anomaly detection in random fields is an important problem in many\napplications including the detection of cancerous cells in medicine, obstacles\nin autonomous driving and cracks in the construction material of buildings.\nSuch anomalies are often visible as areas with different expected values\ncompared to the background noise. Scan statistics based on local means have the\npotential to detect such local anomalies by enhancing relevant features. We\nderive limit theorems for a general class of such statistics over M-dependent\nrandom fields of arbitrary but fixed dimension. By allowing for a variety of\ncombinations and contrasts of sample means over differently-shaped local\nwindows, this yields a flexible class of scan statistics that can be tailored\nto the particular application of interest. The latter is demonstrated for crack\ndetection in 2D-images of different types of concrete. Together with a\nsimulation study this indicates the potential of the proposed methodology for\nthe detection of anomalies in a variety of situations."}, "http://arxiv.org/abs/2311.09989": {"title": "Xputer: Bridging Data Gaps with NMF, XGBoost, and a Streamlined GUI Experience", "link": "http://arxiv.org/abs/2311.09989", "description": "The rapid proliferation of data across diverse fields has accentuated the\nimportance of accurate imputation for missing values. This task is crucial for\nensuring data integrity and deriving meaningful insights. In response to this\nchallenge, we present Xputer, a novel imputation tool that adeptly integrates\nNon-negative Matrix Factorization (NMF) with the predictive strengths of\nXGBoost. One of Xputer's standout features is its versatility: it supports zero\nimputation, enables hyperparameter optimization through Optuna, and allows\nusers to define the number of iterations. For enhanced user experience and\naccessibility, we have equipped Xputer with an intuitive Graphical User\nInterface (GUI) ensuring ease of handling, even for those less familiar with\ncomputational tools. In performance benchmarks, Xputer not only rivals the\ncomputational speed of established tools such as IterativeImputer but also\noften outperforms them in terms of imputation accuracy. Furthermore, Xputer\nautonomously handles a diverse spectrum of data types, including categorical,\ncontinuous, and Boolean, eliminating the need for prior preprocessing. Given\nits blend of performance, flexibility, and user-friendly design, Xputer emerges\nas a state-of-the-art solution in the realm of data imputation."}, "http://arxiv.org/abs/2311.10076": {"title": "A decorrelation method for general regression adjustment in randomized experiments", "link": "http://arxiv.org/abs/2311.10076", "description": "We study regression adjustment with general function class approximations for\nestimating the average treatment effect in the design-based setting. Standard\nregression adjustment involves bias due to sample re-use, and this bias leads\nto behavior that is sub-optimal in the sample size, and/or imposes restrictive\nassumptions. Our main contribution is to introduce a novel decorrelation-based\napproach that circumvents these issues. We prove guarantees, both asymptotic\nand non-asymptotic, relative to the oracle functions that are targeted by a\ngiven regression adjustment procedure. We illustrate our method by applying it\nto various high-dimensional and non-parametric problems, exhibiting improved\nsample complexity and weakened assumptions relative to known approaches."}, "http://arxiv.org/abs/2108.09431": {"title": "Equivariant Variance Estimation for Multiple Change-point Model", "link": "http://arxiv.org/abs/2108.09431", "description": "The variance of noise plays an important role in many change-point detection\nprocedures and the associated inferences. Most commonly used variance\nestimators require strong assumptions on the true mean structure or normality\nof the error distribution, which may not hold in applications. More\nimportantly, the qualities of these estimators have not been discussed\nsystematically in the literature. In this paper, we introduce a framework of\nequivariant variance estimation for multiple change-point models. In\nparticular, we characterize the set of all equivariant unbiased quadratic\nvariance estimators for a family of change-point model classes, and develop a\nminimax theory for such estimators."}, "http://arxiv.org/abs/2210.07987": {"title": "Bayesian Learning via Q-Exponential Process", "link": "http://arxiv.org/abs/2210.07987", "description": "Regularization is one of the most fundamental topics in optimization,\nstatistics and machine learning. To get sparsity in estimating a parameter\n$u\\in\\mathbb{R}^d$, an $\\ell_q$ penalty term, $\\Vert u\\Vert_q$, is usually\nadded to the objective function. What is the probabilistic distribution\ncorresponding to such $\\ell_q$ penalty? What is the correct stochastic process\ncorresponding to $\\Vert u\\Vert_q$ when we model functions $u\\in L^q$? This is\nimportant for statistically modeling large dimensional objects, e.g. images,\nwith penalty to preserve certainty properties, e.g. edges in the image. In this\nwork, we generalize the $q$-exponential distribution (with density proportional\nto) $\\exp{(- \\frac{1}{2}|u|^q)}$ to a stochastic process named $Q$-exponential\n(Q-EP) process that corresponds to the $L_q$ regularization of functions. The\nkey step is to specify consistent multivariate $q$-exponential distributions by\nchoosing from a large family of elliptic contour distributions. The work is\nclosely related to Besov process which is usually defined by the expanded\nseries. Q-EP can be regarded as a definition of Besov process with explicit\nprobabilistic formulation and direct control on the correlation length. From\nthe Bayesian perspective, Q-EP provides a flexible prior on functions with\nsharper penalty ($q<2$) than the commonly used Gaussian process (GP). We\ncompare GP, Besov and Q-EP in modeling functional data, reconstructing images,\nand solving inverse problems and demonstrate the advantage of our proposed\nmethodology."}, "http://arxiv.org/abs/2302.00354": {"title": "The Spatial Kernel Predictor based on Huge Observation Sets", "link": "http://arxiv.org/abs/2302.00354", "description": "Spatial prediction in an arbitrary location, based on a spatial set of\nobservations, is usually performed by Kriging, being the best linear unbiased\npredictor (BLUP) in a least-square sense. In order to predict a continuous\nsurface over a spatial domain a grid representation is most often used. Kriging\npredictions and prediction variances are computed in the nodes of a grid\ncovering the spatial domain, and the continuous surface is assessed from this\ngrid representation. A precise representation usually requires the number of\ngrid nodes to be considerably larger than the number of observations. For a\nGaussian random field model the Kriging predictor coinsides with the\nconditional expectation of the spatial variable given the observation set. An\nalternative expression for this conditional expectation provides a spatial\npredictor on functional form which does not rely on a spatial grid\ndiscretization. This functional predictor, called the Kernel predictor, is\nidentical to the asymptotic grid infill limit of the Kriging-based grid\nrepresentation, and the computational demand is primarily dependent on the\nnumber of observations - not the dimension of the spatial reference domain nor\nany grid discretization. We explore the potential of this Kernel predictor with\nassociated prediction variances. The predictor is valid for Gaussian random\nfields with any eligible spatial correlation function, and large computational\nsavings can be obtained by using a finite-range spatial correlation function.\nFor studies with a huge set of observations, localized predictors must be used,\nand the computational advantage relative to Kriging predictors can be very\nlarge. Moreover, model parameter inference based on a huge observation set can\nbe efficiently made. The methodology is demonstrated in a couple of examples."}, "http://arxiv.org/abs/2302.01861": {"title": "Covariance Matrix Estimation for High-Throughput Biomedical Data with Interconnected Communities", "link": "http://arxiv.org/abs/2302.01861", "description": "Estimating a covariance matrix is central to high-dimensional data analysis.\nEmpirical analyses of high-dimensional biomedical data, including genomics,\nproteomics, microbiome, and neuroimaging, among others, consistently reveal\nstrong modularity in the dependence patterns. In these analyses,\nintercorrelated high-dimensional biomedical features often form communities or\nmodules that can be interconnected with others. While the interconnected\ncommunity structure has been extensively studied in biomedical research (e.g.,\ngene co-expression networks), its potential to assist in the estimation of\ncovariance matrices remains largely unexplored. To address this gap, we propose\na procedure that leverages the commonly observed interconnected community\nstructure in high-dimensional biomedical data to estimate large covariance and\nprecision matrices. We derive the uniformly minimum variance unbiased\nestimators for covariance and precision matrices in closed forms and provide\ntheoretical results on their asymptotic properties. Our proposed method\nenhances the accuracy of covariance- and precision-matrix estimation and\ndemonstrates superior performance compared to the competing methods in both\nsimulations and real data analyses."}, "http://arxiv.org/abs/2303.16299": {"title": "Comparison of Methods that Combine Multiple Randomized Trials to Estimate Heterogeneous Treatment Effects", "link": "http://arxiv.org/abs/2303.16299", "description": "Individualized treatment decisions can improve health outcomes, but using\ndata to make these decisions in a reliable, precise, and generalizable way is\nchallenging with a single dataset. Leveraging multiple randomized controlled\ntrials allows for the combination of datasets with unconfounded treatment\nassignment to better estimate heterogeneous treatment effects. This paper\ndiscusses several non-parametric approaches for estimating heterogeneous\ntreatment effects using data from multiple trials. We extend single-study\nmethods to a scenario with multiple trials and explore their performance\nthrough a simulation study, with data generation scenarios that have differing\nlevels of cross-trial heterogeneity. The simulations demonstrate that methods\nthat directly allow for heterogeneity of the treatment effect across trials\nperform better than methods that do not, and that the choice of single-study\nmethod matters based on the functional form of the treatment effect. Finally,\nwe discuss which methods perform well in each setting and then apply them to\nfour randomized controlled trials to examine effect heterogeneity of treatments\nfor major depressive disorder."}, "http://arxiv.org/abs/2306.17043": {"title": "How trace plots help interpret meta-analysis results", "link": "http://arxiv.org/abs/2306.17043", "description": "The trace plot is seldom used in meta-analysis, yet it is a very informative\nplot. In this article we define and illustrate what the trace plot is, and\ndiscuss why it is important. The Bayesian version of the plot combines the\nposterior density of tau, the between-study standard deviation, and the\nshrunken estimates of the study effects as a function of tau. With a small or\nmoderate number of studies, tau is not estimated with much precision, and\nparameter estimates and shrunken study effect estimates can vary widely\ndepending on the correct value of tau. The trace plot allows visualization of\nthe sensitivity to tau along with a plot that shows which values of tau are\nplausible and which are implausible. A comparable frequentist or empirical\nBayes version provides similar results. The concepts are illustrated using\nexamples in meta-analysis and meta-regression; implementaton in R is\nfacilitated in a Bayesian or frequentist framework using the bayesmeta and\nmetafor packages, respectively."}, "http://arxiv.org/abs/2309.10978": {"title": "Scarcity-Mediated Spillover: An Overlooked Source of Bias in Pragmatic Clinical Trials", "link": "http://arxiv.org/abs/2309.10978", "description": "Pragmatic clinical trials evaluate the effectiveness of health interventions\nin real-world settings. Spillover arises in a pragmatic trial if the study\nintervention affects how scarce resources are allocated between patients in the\nintervention and comparison groups. This can harm patients assigned to the\ncontrol group and lead to overestimation of treatment effect. There is\ncurrently little recognition of this source of bias - which I term\n\"scarcity-mediated spillover\" - in the medical literature. In this article, I\nexamine what causes spillover and how it may have led trial investigators to\noverestimate the effect of patient navigation, AI-based physiological alarms,\nand elective induction of labor. I also suggest ways to detect\nscarcity-mediated spillover, design trials that avoid it, and modify clinical\ntrial guidelines to address this overlooked source of bias."}, "http://arxiv.org/abs/1910.12486": {"title": "Two-stage data segmentation permitting multiscale change points, heavy tails and dependence", "link": "http://arxiv.org/abs/1910.12486", "description": "The segmentation of a time series into piecewise stationary segments, a.k.a.\nmultiple change point analysis, is an important problem both in time series\nanalysis and signal processing. In the presence of multiscale change points\nwith both large jumps over short intervals and small changes over long\nstationary intervals, multiscale methods achieve good adaptivity in their\nlocalisation but at the same time, require the removal of false positives and\nduplicate estimators via a model selection step. In this paper, we propose a\nlocalised application of Schwarz information criterion which, as a generic\nmethodology, is applicable with any multiscale candidate generating procedure\nfulfilling mild assumptions. We establish the theoretical consistency of the\nproposed localised pruning method in estimating the number and locations of\nmultiple change points under general assumptions permitting heavy tails and\ndependence. Further, we show that combined with a MOSUM-based candidate\ngenerating procedure, it attains minimax optimality in terms of detection lower\nbound and localisation for i.i.d. sub-Gaussian errors. A careful comparison\nwith the existing methods by means of (a) theoretical properties such as\ngenerality, optimality and algorithmic complexity, (b) performance on simulated\ndatasets and run time, as well as (c) performance on real data applications,\nconfirm the overall competitiveness of the proposed methodology."}, "http://arxiv.org/abs/2101.04651": {"title": "Moving sum data segmentation for stochastics processes based on invariance", "link": "http://arxiv.org/abs/2101.04651", "description": "The segmentation of data into stationary stretches also known as multiple\nchange point problem is important for many applications in time series analysis\nas well as signal processing. Based on strong invariance principles, we analyse\ndata segmentation methodology using moving sum (MOSUM) statistics for a class\nof regime-switching multivariate processes where each switch results in a\nchange in the drift. In particular, this framework includes the data\nsegmentation of multivariate partial sum, integrated diffusion and renewal\nprocesses even if the distance between change points is sublinear. We study the\nasymptotic behaviour of the corresponding change point estimators, show\nconsistency and derive the corresponding localisation rates which are minimax\noptimal in a variety of situations including an unbounded number of changes in\nWiener processes with drift. Furthermore, we derive the limit distribution of\nthe change point estimators for local changes - a result that can in principle\nbe used to derive confidence intervals for the change points."}, "http://arxiv.org/abs/2207.07396": {"title": "Data Segmentation for Time Series Based on a General Moving Sum Approach", "link": "http://arxiv.org/abs/2207.07396", "description": "In this paper we propose new methodology for the data segmentation, also\nknown as multiple change point problem, in a general framework including\nclassic mean change scenarios, changes in linear regression but also changes in\nthe time series structure such as in the parameters of Poisson-autoregressive\ntime series. In particular, we derive a general theory based on estimating\nequations proving consistency for the number of change points as well as rates\nof convergence for the estimators of the locations of the change points. More\nprecisely, two different types of MOSUM (moving sum) statistics are considered:\nA MOSUM-Wald statistic based on differences of local estimators and a\nMOSUM-score statistic based on a global estimator. The latter is usually\ncomputationally less involved in particular in non-linear problems where no\nclosed form of the estimator is known such that numerical methods are required.\nFinally, we evaluate the methodology by means of simulated data as well as\nusing some geophysical well-log data."}, "http://arxiv.org/abs/2311.10263": {"title": "Stable Differentiable Causal Discovery", "link": "http://arxiv.org/abs/2311.10263", "description": "Inferring causal relationships as directed acyclic graphs (DAGs) is an\nimportant but challenging problem. Differentiable Causal Discovery (DCD) is a\npromising approach to this problem, framing the search as a continuous\noptimization. But existing DCD methods are numerically unstable, with poor\nperformance beyond tens of variables. In this paper, we propose Stable\nDifferentiable Causal Discovery (SDCD), a new method that improves previous DCD\nmethods in two ways: (1) It employs an alternative constraint for acyclicity;\nthis constraint is more stable, both theoretically and empirically, and fast to\ncompute. (2) It uses a training procedure tailored for sparse causal graphs,\nwhich are common in real-world scenarios. We first derive SDCD and prove its\nstability and correctness. We then evaluate it with both observational and\ninterventional data and on both small-scale and large-scale settings. We find\nthat SDCD outperforms existing methods in both convergence speed and accuracy\nand can scale to thousands of variables."}, "http://arxiv.org/abs/2311.10279": {"title": "Differentially private analysis of networks with covariates via a generalized $\\beta$-model", "link": "http://arxiv.org/abs/2311.10279", "description": "How to achieve the tradeoff between privacy and utility is one of fundamental\nproblems in private data analysis.In this paper, we give a rigourous\ndifferential privacy analysis of networks in the appearance of covariates via a\ngeneralized $\\beta$-model, which has an $n$-dimensional degree parameter\n$\\beta$ and a $p$-dimensional homophily parameter $\\gamma$.Under $(k_n,\n\\epsilon_n)$-edge differential privacy, we use the popular Laplace mechanism to\nrelease the network statistics.The method of moments is used to estimate the\nunknown model parameters. We establish the conditions guaranteeing consistency\nof the differentially private estimators $\\widehat{\\beta}$ and\n$\\widehat{\\gamma}$ as the number of nodes $n$ goes to infinity, which reveal an\ninteresting tradeoff between a privacy parameter and model parameters. The\nconsistency is shown by applying a two-stage Newton's method to obtain the\nupper bound of the error between $(\\widehat{\\beta},\\widehat{\\gamma})$ and its\ntrue value $(\\beta, \\gamma)$ in terms of the $\\ell_\\infty$ distance, which has\na convergence rate of rough order $1/n^{1/2}$ for $\\widehat{\\beta}$ and $1/n$\nfor $\\widehat{\\gamma}$, respectively. Further, we derive the asymptotic\nnormalities of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$, whose asymptotic\nvariances are the same as those of the non-private estimators under some\nconditions. Our paper sheds light on how to explore asymptotic theory under\ndifferential privacy in a principled manner; these principled methods should be\napplicable to a class of network models with covariates beyond the generalized\n$\\beta$-model. Numerical studies and a real data analysis demonstrate our\ntheoretical findings."}, "http://arxiv.org/abs/2311.10282": {"title": "Joint clustering with alignment for temporal data in a one-point-per-trajectory setting", "link": "http://arxiv.org/abs/2311.10282", "description": "Temporal data, obtained in the setting where it is only possible to observe\none time point per trajectory, is widely used in different research fields, yet\nremains insufficiently addressed from the statistical point of view. Such data\noften contain observations of a large number of entities, in which case it is\nof interest to identify a small number of representative behavior types. In\nthis paper, we propose a new method performing clustering simultaneously with\nalignment of temporal objects inferred from these data, providing insight into\nthe relationships between the entities. A series of simulations confirm the\nability of the proposed approach to leverage multiple properties of the complex\ndata we target such as accessible uncertainties, correlations and a small\nnumber of time points. We illustrate it on real data encoding cellular response\nto a radiation treatment with high energy, supported with the results of an\nenrichment analysis."}, "http://arxiv.org/abs/2311.10489": {"title": "Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline Approach", "link": "http://arxiv.org/abs/2311.10489", "description": "Overlapping asymmetric datasets are common in data science and pose questions\nof how they can be incorporated together into a predictive analysis. In\nhealthcare datasets there is often a small amount of information that is\navailable for a larger number of patients such as an electronic health record,\nhowever a small number of patients may have had extensive further testing.\nCommon solutions such as missing imputation can often be unwise if the smaller\ncohort is significantly different in scale to the larger sample, therefore the\naim of this research is to develop a new method which can model the smaller\ncohort against a particular response, whilst considering the larger cohort\nalso. Motivated by non-parametric models, and specifically flexible smoothing\ntechniques via generalized additive models, we model a twice penalized P-Spline\napproximation method to firstly prevent over/under-fitting of the smaller\ncohort and secondly to consider the larger cohort. This second penalty is\ncreated through discrepancies in the marginal value of covariates that exist in\nboth the smaller and larger cohorts. Through data simulations, parameter\ntunings and model adaptations to consider a continuous and binary response, we\nfind our twice penalized approach offers an enhanced fit over a linear B-Spline\nand once penalized P-Spline approximation. Applying to a real-life dataset\nrelating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see\nan improved model fit performance of over 65%. Areas for future work within\nthis space include adapting our method to not require dimensionality reduction\nand also consider parametric modelling methods. However, to our knowledge this\nis the first work to propose additional marginal penalties in a flexible\nregression of which we can report a vastly improved model fit that is able to\nconsider asymmetric datasets, without the need for missing data imputation."}, "http://arxiv.org/abs/2311.10638": {"title": "Concept-free Causal Disentanglement with Variational Graph Auto-Encoder", "link": "http://arxiv.org/abs/2311.10638", "description": "In disentangled representation learning, the goal is to achieve a compact\nrepresentation that consists of all interpretable generative factors in the\nobservational data. Learning disentangled representations for graphs becomes\nincreasingly important as graph data rapidly grows. Existing approaches often\nrely on Variational Auto-Encoder (VAE) or its causal structure learning-based\nrefinement, which suffer from sub-optimality in VAEs due to the independence\nfactor assumption and unavailability of concept labels, respectively. In this\npaper, we propose an unsupervised solution, dubbed concept-free causal\ndisentanglement, built on a theoretically provable tight upper bound\napproximating the optimal factor. This results in an SCM-like causal structure\nmodeling that directly learns concept structures from data. Based on this idea,\nwe propose Concept-free Causal VGAE (CCVGAE) by incorporating a novel causal\ndisentanglement layer into Variational Graph Auto-Encoder. Furthermore, we\nprove concept consistency under our concept-free causal disentanglement\nframework, hence employing it to enhance the meta-learning framework, called\nconcept-free causal Meta-Graph (CC-Meta-Graph). We conduct extensive\nexperiments to demonstrate the superiority of the proposed models: CCVGAE and\nCC-Meta-Graph, reaching up to $29\\%$ and $11\\%$ absolute improvements over\nbaselines in terms of AUC, respectively."}, "http://arxiv.org/abs/2009.07055": {"title": "Causal Inference of General Treatment Effects using Neural Networks with A Diverging Number of Confounders", "link": "http://arxiv.org/abs/2009.07055", "description": "Semiparametric efficient estimation of various multi-valued causal effects,\nincluding quantile treatment effects, is important in economic, biomedical, and\nother social sciences. Under the unconfoundedness condition, adjustment for\nconfounders requires estimating the nuisance functions relating outcome or\ntreatment to confounders nonparametrically. This paper considers a generalized\noptimization framework for efficient estimation of general treatment effects\nusing artificial neural networks (ANNs) to approximate the unknown nuisance\nfunction of growing-dimensional confounders. We establish a new approximation\nerror bound for the ANNs to the nuisance function belonging to a mixed\nsmoothness class without a known sparsity structure. We show that the ANNs can\nalleviate the \"curse of dimensionality\" under this circumstance. We establish\nthe root-$n$ consistency and asymptotic normality of the proposed general\ntreatment effects estimators, and apply a weighted bootstrap procedure for\nconducting inference. The proposed methods are illustrated via simulation\nstudies and a real data application."}, "http://arxiv.org/abs/2012.08371": {"title": "Limiting laws and consistent estimation criteria for fixed and diverging number of spiked eigenvalues", "link": "http://arxiv.org/abs/2012.08371", "description": "In this paper, we study limiting laws and consistent estimation criteria for\nthe extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly,\nfor fixed $p$, we propose a generalized estimation criterion that can\nconsistently estimate, $k$, the number of spiked eigenvalues. Compared with the\nexisting literature, we show that consistency can be achieved under weaker\nconditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we\nderive limiting distributions of the spiked sample eigenvalues using random\nmatrix theory techniques. Notably, our results do not require the spiked\neigenvalues to be uniformly bounded from above or tending to infinity, as have\nbeen assumed in the existing literature. Based on the above derived results, we\nformulate a generalized estimation criterion and show that it can consistently\nestimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We\nfurther show that the results in our work continue to hold under a general\npopulation distribution without assuming normality. The efficacy of the\nproposed estimation criteria is illustrated through comparative simulation\nstudies."}, "http://arxiv.org/abs/2104.06296": {"title": "Count Network Autoregression", "link": "http://arxiv.org/abs/2104.06296", "description": "We consider network autoregressive models for count data with a non-random\nneighborhood structure. The main methodological contribution is the development\nof conditions that guarantee stability and valid statistical inference for such\nmodels. We consider both cases of fixed and increasing network dimension and we\nshow that quasi-likelihood inference provides consistent and asymptotically\nnormally distributed estimators. The work is complemented by simulation results\nand a data example."}, "http://arxiv.org/abs/2110.09115": {"title": "Optimal designs for experiments for scalar-on-function linear models", "link": "http://arxiv.org/abs/2110.09115", "description": "The aim of this work is to extend the usual optimal experimental design\nparadigm to experiments where the settings of one or more factors are\nfunctions. Such factors are known as profile factors, or as dynamic factors.\nFor these new experiments, a design consists of combinations of functions for\neach run of the experiment. After briefly introducing the class of profile\nfactors, basis functions are described with primary focus given on the B-spline\nbasis system, due to its computational efficiency and useful properties. Basis\nfunction expansions are applied to a functional linear model consisting of\nprofile factors, reducing the problem to an optimisation of basis coefficients.\nThe methodology developed comprises special cases, including combinations of\nprofile and non-functional factors, interactions, and polynomial effects. The\nmethod is finally applied to an experimental design problem in a\nBiopharmaceutical study that is performed using the Ambr250 modular bioreactor."}, "http://arxiv.org/abs/2207.13071": {"title": "Missing Values Handling for Machine Learning Portfolios", "link": "http://arxiv.org/abs/2207.13071", "description": "We characterize the structure and origins of missingness for 159\ncross-sectional return predictors and study missing value handling for\nportfolios constructed using machine learning. Simply imputing with\ncross-sectional means performs well compared to rigorous\nexpectation-maximization methods. This stems from three facts about predictor\ndata: (1) missingness occurs in large blocks organized by time, (2)\ncross-sectional correlations are small, and (3) missingness tends to occur in\nblocks organized by the underlying data source. As a result, observed data\nprovide little information about missing data. Sophisticated imputations\nintroduce estimation noise that can lead to underperformance if machine\nlearning is not carefully applied."}, "http://arxiv.org/abs/2209.00105": {"title": "Personalized Biopsy Schedules Using an Interval-censored Cause-specific Joint Model", "link": "http://arxiv.org/abs/2209.00105", "description": "Active surveillance (AS), where biopsies are conducted to detect cancer\nprogression, has been acknowledged as an efficient way to reduce the\novertreatment of prostate cancer. Most AS cohorts use fixed biopsy schedules\nfor all patients. However, the ideal test frequency remains unknown, and the\nroutine use of such invasive tests burdens the patients. An emerging idea is to\ngenerate personalized biopsy schedules based on each patient's\nprogression-specific risk. To achieve that, we propose the interval-censored\ncause-specific joint model (ICJM), which models the impact of longitudinal\nbiomarkers on cancer progression while considering the competing event of early\ntreatment initiation. The underlying likelihood function incorporates the\ninterval-censoring of cancer progression, the competing risk of treatment, and\nthe uncertainty about whether cancer progression occurred since the last biopsy\nin patients that are right-censored or experience the competing event. The\nmodel can produce patient-specific risk profiles until a horizon time. If the\nrisk exceeds a certain threshold, a biopsy is conducted. The optimal threshold\ncan be chosen by balancing two indicators of the biopsy schedules: the expected\nnumber of biopsies and expected delay in detection of cancer progression. A\nsimulation study showed that our personalized schedules could considerably\nreduce the number of biopsies per patient by 34%-54% compared to the fixed\nschedules, though at the cost of a slightly longer detection delay."}, "http://arxiv.org/abs/2304.14110": {"title": "A Bayesian Spatio-Temporal Extension to Poisson Auto-Regression: Modeling the Disease Infection Rate of COVID-19 in England", "link": "http://arxiv.org/abs/2304.14110", "description": "The COVID-19 pandemic provided many modeling challenges to investigate the\nevolution of an epidemic process over areal units. A suitable encompassing\nmodel must describe the spatio-temporal variations of the disease infection\nrate of multiple areal processes while adjusting for local and global inputs.\nWe develop an extension to Poisson Auto-Regression that incorporates\nspatio-temporal dependence to characterize the local dynamics while borrowing\ninformation among adjacent areas. The specification includes up to two sets of\nspace-time random effects to capture the spatio-temporal dependence and a\nlinear predictor depending on an arbitrary set of covariates. The proposed\nmodel, adopted in a fully Bayesian framework and implemented through a novel\nsparse-matrix representation in Stan, provides a framework for evaluating local\npolicy changes over the whole spatial and temporal domain of the study. It has\nbeen validated through a substantial simulation study and applied to the weekly\nCOVID-19 cases observed in the English local authority districts between May\n2020 and March 2021. The model detects substantial spatial and temporal\nheterogeneity and allows a full evaluation of the impact of two alternative\nsets of covariates: the level of local restrictions in place and the value of\nthe Google Mobility Indices. The paper also formalizes various novel\nmodel-based investigation methods for assessing additional aspects of disease\nepidemiology."}, "http://arxiv.org/abs/2305.14131": {"title": "Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains", "link": "http://arxiv.org/abs/2305.14131", "description": "We consider the problem of detecting causal relationships between discrete\ntime series, in the presence of potential confounders. A hypothesis test is\nintroduced for identifying the temporally causal influence of $(x_n)$ on\n$(y_n)$, causally conditioned on a possibly confounding third time series\n$(z_n)$. Under natural Markovian modeling assumptions, it is shown that the\nnull hypothesis, corresponding to the absence of temporally causal influence,\nis equivalent to the underlying `causal conditional directed information rate'\nbeing equal to zero. The plug-in estimator for this functional is identified\nwith the log-likelihood ratio test statistic for the desired test. This\nstatistic is shown to be asymptotically normal under the alternative hypothesis\nand asymptotically $\\chi^2$ distributed under the null, facilitating the\ncomputation of $p$-values when used on empirical data. The effectiveness of the\nresulting hypothesis test is illustrated on simulated data, validating the\nunderlying theory. The test is also employed in the analysis of spike train\ndata recorded from neurons in the V4 and FEF brain regions of behaving animals\nduring a visual attention task. There, the test results are seen to identify\ninteresting and biologically relevant information."}, "http://arxiv.org/abs/2307.01748": {"title": "Monotone Cubic B-Splines with a Neural-Network Generator", "link": "http://arxiv.org/abs/2307.01748", "description": "We present a method for fitting monotone curves using cubic B-splines, which\nis equivalent to putting a monotonicity constraint on the coefficients. We\nexplore different ways of enforcing this constraint and analyze their\ntheoretical and empirical properties. We propose two algorithms for solving the\nspline fitting problem: one that uses standard optimization techniques and one\nthat trains a Multi-Layer Perceptrons (MLP) generator to approximate the\nsolutions under various settings and perturbations. The generator approach can\nspeed up the fitting process when we need to solve the problem repeatedly, such\nas when constructing confidence bands using bootstrap. We evaluate our method\nagainst several existing methods, some of which do not use the monotonicity\nconstraint, on some monotone curves with varying noise levels. We demonstrate\nthat our method outperforms the other methods, especially in high-noise\nscenarios. We also apply our method to analyze the polarization-hole phenomenon\nduring star formation in astrophysics. The source code is accessible at\n\\texttt{\\url{https://github.com/szcf-weiya/MonotoneSplines.jl}}."}, "http://arxiv.org/abs/2311.10738": {"title": "Approximation of supply curves", "link": "http://arxiv.org/abs/2311.10738", "description": "In this note, we illustrate the computation of the approximation of the\nsupply curves using a one-step basis. We derive the expression for the L2\napproximation and propose a procedure for the selection of nodes of the\napproximation. We illustrate the use of this approach with three large sets of\nbid curves from European electricity markets."}, "http://arxiv.org/abs/2311.10848": {"title": "Addressing Population Heterogeneity for HIV Incidence Estimation Based on Recency Test", "link": "http://arxiv.org/abs/2311.10848", "description": "Cross-sectional HIV incidence estimation leverages recency test results to\ndetermine the HIV incidence of a population of interest, where recency test\nuses biomarker profiles to infer whether an HIV-positive individual was\n\"recently\" infected. This approach possesses an obvious advantage over the\nconventional cohort follow-up method since it avoids longitudinal follow-up and\nrepeated HIV testing. In this manuscript, we consider the extension of\ncross-sectional incidence estimation to estimate the incidence of a different\ntarget population addressing potential population heterogeneity. We propose a\ngeneral framework that incorporates two settings: one with the target\npopulation that is a subset of the population with cross-sectional recency\ntesting data, e.g., leveraging recency testing data from screening in\nactive-arm trial design, and the other with an external target population. We\nalso propose a method to incorporate HIV subtype, a special covariate that\nmodifies the properties of recency test, into our framework. Through extensive\nsimulation studies and a data application, we demonstrate the excellent\nperformance of the proposed methods. We conclude with a discussion of\nsensitivity analysis and future work to improve our framework."}, "http://arxiv.org/abs/2311.10877": {"title": "Covariate adjustment in randomized experiments with missing outcomes and covariates", "link": "http://arxiv.org/abs/2311.10877", "description": "Covariate adjustment can improve precision in estimating treatment effects\nfrom randomized experiments. With fully observed data, regression adjustment\nand propensity score weighting are two asymptotically equivalent methods for\ncovariate adjustment in randomized experiments. We show that this equivalence\nbreaks down in the presence of missing outcomes, with regression adjustment no\nlonger ensuring efficiency gain when the true outcome model is not linear in\ncovariates. Propensity score weighting, in contrast, still guarantees\nefficiency over unadjusted analysis, and including more covariates in\nadjustment never harms asymptotic efficiency. Moreover, we establish the value\nof using partially observed covariates to secure additional efficiency. Based\non these findings, we recommend a simple double-weighted estimator for\ncovariate adjustment with incomplete outcomes and covariates: (i) impute all\nmissing covariates by zero, and use the union of the completed covariates and\ncorresponding missingness indicators to estimate the probability of treatment\nand the probability of having observed outcome for all units; (ii) estimate the\naverage treatment effect by the coefficient of the treatment from the\nleast-squares regression of the observed outcome on the treatment, where we\nweight each unit by the inverse of the product of these two estimated\nprobabilities."}, "http://arxiv.org/abs/2311.10900": {"title": "A powerful rank-based correction to multiple testing under positive dependency", "link": "http://arxiv.org/abs/2311.10900", "description": "We develop a novel multiple hypothesis testing correction with family-wise\nerror rate (FWER) control that efficiently exploits positive dependencies\nbetween potentially correlated statistical hypothesis tests. Our proposed\nalgorithm $\\texttt{max-rank}$ is conceptually straight-forward, relying on the\nuse of a $\\max$-operator in the rank domain of computed test statistics. We\ncompare our approach to the frequently employed Bonferroni correction,\ntheoretically and empirically demonstrating its superiority over Bonferroni in\nthe case of existing positive dependency, and its equivalence otherwise. Our\nadvantage over Bonferroni increases as the number of tests rises, and we\nmaintain high statistical power whilst ensuring FWER control. We specifically\nframe our algorithm in the context of parallel permutation testing, a scenario\nthat arises in our primary application of conformal prediction, a recently\npopularized approach for quantifying uncertainty in complex predictive\nsettings."}, "http://arxiv.org/abs/2311.11050": {"title": "Functional Neural Network Control Chart", "link": "http://arxiv.org/abs/2311.11050", "description": "In many Industry 4.0 data analytics applications, quality characteristic data\nacquired from manufacturing processes are better modeled as functions, often\nreferred to as profiles. In practice, there are situations where a scalar\nquality characteristic, referred to also as the response, is influenced by one\nor more variables in the form of functional data, referred to as functional\ncovariates. To adjust the monitoring of the scalar response by the effect of\nthis additional information, a new profile monitoring strategy is proposed on\nthe residuals obtained from the functional neural network, which is able to\nlearn a possibly nonlinear relationship between the scalar response and the\nfunctional covariates. An extensive Monte Carlo simulation study is performed\nto assess the performance of the proposed method with respect to other control\ncharts that appeared in the literature before. Finally, a case study in the\nrailway industry is presented with the aim of monitoring the heating,\nventilation and air conditioning systems installed onboard passenger trains."}, "http://arxiv.org/abs/2311.11054": {"title": "Modern extreme value statistics for Utopian extremes", "link": "http://arxiv.org/abs/2311.11054", "description": "Capturing the extremal behaviour of data often requires bespoke marginal and\ndependence models which are grounded in rigorous asymptotic theory, and hence\nprovide reliable extrapolation into the upper tails of the data-generating\ndistribution. We present a modern toolbox of four methodological frameworks,\nmotivated by modern extreme value theory, that can be used to accurately\nestimate extreme exceedance probabilities or the corresponding level in either\na univariate or multivariate setting. Our frameworks were used to facilitate\nthe winning contribution of Team Yalla to the data competition organised for\nthe 13th International Conference on Extreme Value Analysis (EVA2023). This\ncompetition comprised seven teams competing across four separate\nsub-challenges, with each requiring the modelling of data simulated from known,\nyet highly complex, statistical distributions, and extrapolation far beyond the\nrange of the available samples in order to predict probabilities of extreme\nevents. Data were constructed to be representative of real environmental data,\nsampled from the fantasy country of \"Utopia\"."}, "http://arxiv.org/abs/2311.11153": {"title": "Biarchetype analysis: simultaneous learning of observations and features based on extremes", "link": "http://arxiv.org/abs/2311.11153", "description": "A new exploratory technique called biarchetype analysis is defined. We extend\narchetype analysis to find the archetypes of both observations and features\nsimultaneously. The idea of this new unsupervised machine learning tool is to\nrepresent observations and features by instances of pure types (biarchetypes)\nthat can be easily interpreted as they are mixtures of observations and\nfeatures. Furthermore, the observations and features are expressed as mixtures\nof the biarchetypes, which also helps understand the structure of the data. We\npropose an algorithm to solve biarchetype analysis. We show that biarchetype\nanalysis offers advantages over biclustering, especially in terms of\ninterpretability. This is because byarchetypes are extreme instances as opposed\nto the centroids returned by biclustering, which favors human understanding.\nBiarchetype analysis is applied to several machine learning problems to\nillustrate its usefulness."}, "http://arxiv.org/abs/2311.11216": {"title": "Valid Randomization Tests in Inexactly Matched Observational Studies via Iterative Convex Programming", "link": "http://arxiv.org/abs/2311.11216", "description": "In causal inference, matching is one of the most widely used methods to mimic\na randomized experiment using observational (non-experimental) data. Ideally,\ntreated units are exactly matched with control units for the covariates so that\nthe treatments are as-if randomly assigned within each matched set, and valid\nrandomization tests for treatment effects can then be conducted as in a\nrandomized experiment. However, inexact matching typically exists, especially\nwhen there are continuous or many observed covariates or when unobserved\ncovariates exist. Previous matched observational studies routinely conducted\ndownstream randomization tests as if matching was exact, as long as the matched\ndatasets satisfied some prespecified balance criteria or passed some balance\ntests. Some recent studies showed that this routine practice could render a\nhighly inflated type-I error rate of randomization tests, especially when the\nsample size is large. To handle this problem, we propose an iterative convex\nprogramming framework for randomization tests with inexactly matched datasets.\nUnder some commonly used regularity conditions, we show that our approach can\nproduce valid randomization tests (i.e., robustly controlling the type-I error\nrate) for any inexactly matched datasets, even when unobserved covariates\nexist. Our framework allows the incorporation of flexible machine learning\nmodels to better extract information from covariate imbalance while robustly\ncontrolling the type-I error rate."}, "http://arxiv.org/abs/2311.11236": {"title": "Generalized Linear Models via the Lasso: To Scale or Not to Scale?", "link": "http://arxiv.org/abs/2311.11236", "description": "The Lasso regression is a popular regularization method for feature selection\nin statistics. Prior to computing the Lasso estimator in both linear and\ngeneralized linear models, it is common to conduct a preliminary rescaling of\nthe feature matrix to ensure that all the features are standardized. Without\nthis standardization, it is argued, the Lasso estimate will unfortunately\ndepend on the units used to measure the features. We propose a new type of\niterative rescaling of the features in the context of generalized linear\nmodels. Whilst existing Lasso algorithms perform a single scaling as a\npreprocessing step, the proposed rescaling is applied iteratively throughout\nthe Lasso computation until convergence. We provide numerical examples, with\nboth real and simulated data, illustrating that the proposed iterative\nrescaling can significantly improve the statistical performance of the Lasso\nestimator without incurring any significant additional computational cost."}, "http://arxiv.org/abs/2311.11256": {"title": "Bayesian Modeling of Incompatible Spatial Data: A Case Study Involving Post-Adrian Storm Forest Damage Assessment", "link": "http://arxiv.org/abs/2311.11256", "description": "Incompatible spatial data modeling is a pervasive challenge in remote sensing\ndata analysis that involves field data. Typical approaches to addressing this\nchallenge aggregate information to a coarser common scale, i.e., compatible\nresolutions. Such pre-processing aggregation to a common resolution simplifies\nanalysis, but potentially causes information loss and hence compromised\ninference and predictive performance. To incorporate finer information to\nenhance prediction performance, we develop a new Bayesian method aimed at\nimproving predictive accuracy and uncertainty quantification. The main\ncontribution of this work is an efficient algorithm that enables full Bayesian\ninference using finer resolution data while optimizing computational and\nstorage costs. The algorithm is developed and applied to a forest damage\nassessment for the 2018 Adrian storm in Carinthia, Austria, which uses field\ndata and high-resolution LiDAR measurements. Simulation studies demonstrate\nthat this approach substantially improves prediction accuracy and stability,\nproviding more reliable inference to support forest management decisions."}, "http://arxiv.org/abs/2311.11290": {"title": "Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias", "link": "http://arxiv.org/abs/2311.11290", "description": "Firth (1993, Biometrika) shows that the maximum Jeffreys' prior penalized\nlikelihood estimator in logistic regression has asymptotic bias decreasing with\nthe square of the number of observations when the number of parameters is\nfixed, which is an order faster than the typical rate from maximum likelihood.\nThe widespread use of that estimator in applied work is supported by the\nresults in Kosmidis and Firth (2021, Biometrika), who show that it takes finite\nvalues, even in cases where the maximum likelihood estimate does not exist.\nKosmidis and Firth (2021, Biometrika) also provide empirical evidence that the\nestimator has good bias properties in high-dimensional settings where the\nnumber of parameters grows asymptotically linearly but slower than the number\nof observations. We design and carry out a large-scale computer experiment\ncovering a wide range of such high-dimensional settings and produce strong\nempirical evidence for a simple rescaling of the maximum Jeffreys' prior\npenalized likelihood estimator that delivers high accuracy in signal recovery\nin the presence of an intercept parameter. The rescaled estimator is effective\neven in cases where estimates from maximum likelihood and other recently\nproposed corrective methods based on approximate message passing do not exist."}, "http://arxiv.org/abs/2311.11445": {"title": "Maximum likelihood inference for a class of discrete-time Markov-switching time series models with multiple delays", "link": "http://arxiv.org/abs/2311.11445", "description": "Autoregressive Markov switching (ARMS) time series models are used to\nrepresent real-world signals whose dynamics may change over time. They have\nfound application in many areas of the natural and social sciences, as well as\nin engineering. In general, inference in this kind of systems involves two\nproblems: (a) detecting the number of distinct dynamical models that the signal\nmay adopt and (b) estimating any unknown parameters in these models. In this\npaper, we introduce a class of ARMS time series models that includes many\nsystems resulting from the discretisation of stochastic delay differential\nequations (DDEs). Remarkably, this class includes cases in which the\ndiscretisation time grid is not necessarily aligned with the delays of the DDE,\nresulting in discrete-time ARMS models with real (non-integer) delays. We\ndescribe methods for the maximum likelihood detection of the number of\ndynamical modes and the estimation of unknown parameters (including the\npossibly non-integer delays) and illustrate their application with an ARMS\nmodel of El Ni\\~no--southern oscillation (ENSO) phenomenon."}, "http://arxiv.org/abs/2311.11487": {"title": "Modeling Insurance Claims using Bayesian Nonparametric Regression", "link": "http://arxiv.org/abs/2311.11487", "description": "The prediction of future insurance claims based on observed risk factors, or\ncovariates, help the actuary set insurance premiums. Typically, actuaries use\nparametric regression models to predict claims based on the covariate\ninformation. Such models assume the same functional form tying the response to\nthe covariates for each data point. These models are not flexible enough and\ncan fail to accurately capture at the individual level, the relationship\nbetween the covariates and the claims frequency and severity, which are often\nmultimodal, highly skewed, and heavy-tailed. In this article, we explore the\nuse of Bayesian nonparametric (BNP) regression models to predict claims\nfrequency and severity based on covariates. In particular, we model claims\nfrequency as a mixture of Poisson regression, and the logarithm of claims\nseverity as a mixture of normal regression. We use the Dirichlet process (DP)\nand Pitman-Yor process (PY) as a prior for the mixing distribution over the\nregression parameters. Unlike parametric regression, such models allow each\ndata point to have its individual parameters, making them highly flexible,\nresulting in improved prediction accuracy. We describe model fitting using MCMC\nand illustrate their applicability using French motor insurance claims data."}, "http://arxiv.org/abs/2311.11522": {"title": "A Bayesian two-step multiple imputation approach based on mixed models for the missing in EMA data", "link": "http://arxiv.org/abs/2311.11522", "description": "Ecological Momentary Assessments (EMA) capture real-time thoughts and\nbehaviors in natural settings, producing rich longitudinal data for statistical\nand physiological analyses. However, the robustness of these analyses can be\ncompromised by the large amount of missing in EMA data sets. To address this,\nmultiple imputation, a method that replaces missing values with several\nplausible alternatives, has become increasingly popular. In this paper, we\nintroduce a two-step Bayesian multiple imputation framework which leverages the\nconfiguration of mixed models. We adopt the Random Intercept Linear Mixed\nmodel, the Mixed-effect Location Scale model which accounts for subject\nvariance influenced by covariates and random effects, and the Shared Parameter\nLocation Scale Mixed Effect model which links the missing data to the response\nvariable through a random intercept logistic model, to complete the posterior\ndistribution within the framework. In the simulation study and an application\non data from a study on caregivers of dementia patients, we further adapt this\ntwo-step Bayesian multiple imputation strategy to handle simultaneous missing\nvariables in EMA data sets and compare the effectiveness of multiple\nimputations across different mixed models. The analyses highlight the\nadvantages of multiple imputations over single imputations. Furthermore, we\npropose two pivotal considerations in selecting the optimal mixed model for the\ntwo-step imputation: the influence of covariates as well as random effects on\nthe within-variance, and the nature of missing data in relation to the response\nvariable."}, "http://arxiv.org/abs/2311.11543": {"title": "A Comparison of Parameter Estimation Methods for Shared Frailty Models", "link": "http://arxiv.org/abs/2311.11543", "description": "This paper compares six different parameter estimation methods for shared\nfrailty models via a series of simulation studies. A shared frailty model is a\nsurvival model that incorporates a random effect term, where the frailties are\ncommon or shared among individuals within specific groups. Several parameter\nestimation methods are available for fitting shared frailty models, such as\npenalized partial likelihood (PPL), expectation-maximization (EM), pseudo full\nlikelihood (PFL), hierarchical likelihood (HL), maximum marginal likelihood\n(MML), and maximization penalized likelihood (MPL) algorithms. These estimation\nmethods are implemented in various R packages, providing researchers with\nvarious options for analyzing clustered survival data using shared frailty\nmodels. However, there is a limited amount of research comparing the\nperformance of these parameter estimation methods for fitting shared frailty\nmodels. Consequently, it can be challenging for users to determine the most\nappropriate method for analyzing clustered survival data. To address this gap,\nthis paper aims to conduct a series of simulation studies to compare the\nperformance of different parameter estimation methods implemented in R\npackages. We will evaluate several key aspects, including parameter estimation,\nbias and variance of the parameter estimates, rate of convergence, and\ncomputational time required by each package. Through this systematic\nevaluation, our goal is to provide a comprehensive understanding of the\nadvantages and limitations associated with each estimation method."}, "http://arxiv.org/abs/2311.11563": {"title": "Time-varying effect in the competing risks based on restricted mean time lost", "link": "http://arxiv.org/abs/2311.11563", "description": "Patients with breast cancer tend to die from other diseases, so for studies\nthat focus on breast cancer, a competing risks model is more appropriate.\nConsidering subdistribution hazard ratio, which is used often, limited to model\nassumptions and clinical interpretation, we aimed to quantify the effects of\nprognostic factors by an absolute indicator, the difference in restricted mean\ntime lost (RMTL), which is more intuitive. Additionally, prognostic factors may\nhave dynamic effects (time-varying effects) in long-term follow-up. However,\nexisting competing risks regression models only provide a static view of\ncovariate effects, leading to a distorted assessment of the prognostic factor.\nTo address this issue, we proposed a dynamic effect RMTL regression that can\nexplore the between-group cumulative difference in mean life lost over a period\nof time and obtain the real-time effect by the speed of accumulation, as well\nas personalized predictions on a time scale. Through Monte Carlo simulation, we\nvalidated the dynamic effects estimated by the proposed regression having low\nbias and a coverage rate of around 95%. Applying this model to an elderly\nearly-stage breast cancer cohort, we found that most factors had different\npatterns of dynamic effects, revealing meaningful physiological mechanisms\nunderlying diseases. Moreover, from the perspective of prediction, the mean\nC-index in external validation reached 0.78. Dynamic effect RMTL regression can\nanalyze both dynamic cumulative effects and real-time effects of covariates,\nproviding a more comprehensive prognosis and better prediction when competing\nrisks exist."}, "http://arxiv.org/abs/2311.11575": {"title": "Testing multivariate normality by testing independence", "link": "http://arxiv.org/abs/2311.11575", "description": "We propose a simple multivariate normality test based on Kac-Bernstein's\ncharacterization, which can be conducted by utilising existing statistical\nindependence tests for sums and differences of data samples. We also perform\nits empirical investigation, which reveals that for high-dimensional data, the\nproposed approach may be more efficient than the alternative ones. The\naccompanying code repository is provided at \\url{https://shorturl.at/rtuy5}."}, "http://arxiv.org/abs/2311.11657": {"title": "Minimax Two-Stage Gradient Boosting for Parameter Estimation", "link": "http://arxiv.org/abs/2311.11657", "description": "Parameter estimation is an important sub-field in statistics and system\nidentification. Various methods for parameter estimation have been proposed in\nthe literature, among which the Two-Stage (TS) approach is particularly\npromising, due to its ease of implementation and reliable estimates. Among the\ndifferent statistical frameworks used to derive TS estimators, the min-max\nframework is attractive due to its mild dependence on prior knowledge about the\nparameters to be estimated. However, the existing implementation of the minimax\nTS approach has currently limited applicability, due to its heavy computational\nload. In this paper, we overcome this difficulty by using a gradient boosting\nmachine (GBM) in the second stage of TS approach. We call the resulting\nalgorithm the Two-Stage Gradient Boosting Machine (TSGBM) estimator. Finally,\nwe test our proposed TSGBM estimator on several numerical examples including\nmodels of dynamical systems."}, "http://arxiv.org/abs/2311.11765": {"title": "Making accurate and interpretable treatment decisions for binary outcomes", "link": "http://arxiv.org/abs/2311.11765", "description": "Optimal treatment rules can improve health outcomes on average by assigning a\ntreatment associated with the most desirable outcome to each individual. Due to\nan unknown data generation mechanism, it is appealing to use flexible models to\nestimate these rules. However, such models often lead to complex and\nuninterpretable rules. In this article, we introduce an approach aimed at\nestimating optimal treatment rules that have higher accuracy, higher value, and\nlower loss from the same simple model family. We use a flexible model to\nestimate the optimal treatment rules and a simple model to derive interpretable\ntreatment rules. We provide an extensible definition of interpretability and\npresent a method that - given a class of simple models - can be used to select\na preferred model. We conduct a simulation study to evaluate the performance of\nour approach compared to treatment rules obtained by fitting the same simple\nmodel directly to observed data. The results show that our approach has lower\naverage loss, higher average outcome, and greater power in identifying\nindividuals who can benefit from the treatment. We apply our approach to derive\ntreatment rules of adjuvant chemotherapy in colon cancer patients using cancer\nregistry data. The results show that our approach has the potential to improve\ntreatment decisions."}, "http://arxiv.org/abs/2311.11852": {"title": "Statistical Prediction of Peaks Over a Threshold", "link": "http://arxiv.org/abs/2311.11852", "description": "In many applied fields it is desired to make predictions with the aim of\nassessing the plausibility of more severe events than those already recorded to\nsafeguard against calamities that have not yet occurred. This problem can be\nanalysed using extreme value theory. We consider the popular peaks over a\nthreshold method and show that the generalised Pareto approximation of the true\npredictive densities of both a future unobservable excess or peak random\nvariable can be very accurate. We propose both a frequentist and a Bayesian\napproach for the estimation of such predictive densities. We show the\nasymptotic accuracy of the corresponding estimators and, more importantly,\nprove that the resulting predictive inference is asymptotically reliable. We\nshow the utility of the proposed predictive tools analysing extreme\ntemperatures in Milan in Italy."}, "http://arxiv.org/abs/2311.11922": {"title": "Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix", "link": "http://arxiv.org/abs/2311.11922", "description": "Surrogate index approaches have recently become a popular method of\nestimating longer-term impact from shorter-term outcomes. In this paper, we\nleverage 1098 test arms from 200 A/B tests at Netflix to empirically\ninvestigate to what degree would decisions made using a surrogate index\nutilizing 14 days of data would align with those made using direct measurement\nof day 63 treatment effects. Focusing specifically on linear \"auto-surrogate\"\nmodels that utilize the shorter-term observations of the long-term outcome of\ninterest, we find that the statistical inferences that we would draw from using\nthe surrogate index are ~95% consistent with those from directly measuring the\nlong-term treatment effect. Moreover, when we restrict ourselves to the set of\ntests that would be \"launched\" (i.e. positive and statistically significant)\nbased on the 63-day directly measured treatment effects, we find that relying\ninstead on the surrogate index achieves 79% and 65% recall."}, "http://arxiv.org/abs/2311.12016": {"title": "Bayesian Semiparametric Estimation of Heterogeneous Effects in Matched Case-Control Studies with an Application to Alzheimer's Disease and Heat", "link": "http://arxiv.org/abs/2311.12016", "description": "Epidemiological approaches for examining human health responses to\nenvironmental exposures in observational studies often control for confounding\nby implementing clever matching schemes and using statistical methods based on\nconditional likelihood. Nonparametric regression models have surged in\npopularity in recent years as a tool for estimating individual-level\nheterogeneous effects, which provide a more detailed picture of the\nexposure-response relationship but can also be aggregated to obtain improved\nmarginal estimates at the population level. In this work we incorporate\nBayesian additive regression trees (BART) into the conditional logistic\nregression model to identify heterogeneous effects of environmental exposures\nin a case-crossover design. Conditional logistic BART (CL-BART) utilizes\nreversible jump Markov chain Monte Carlo to bypass the conditional conjugacy\nrequirement of the original BART algorithm. Our work is motivated by the\ngrowing interest in identifying subpopulations more vulnerable to environmental\nexposures. We apply CL-BART to a study of the impact of heatwaves on people\nwith Alzheimer's Disease in California and effect modification by other chronic\nconditions. Through this application, we also describe strategies to examine\nheterogeneous odds ratios through variable importance, partial dependence, and\nlower-dimensional summaries. CL-BART is available in the clbart R package."}, "http://arxiv.org/abs/2311.12020": {"title": "A Heterogeneous Spatial Model for Soil Carbon Mapping of the Contiguous United States Using VNIR Spectra", "link": "http://arxiv.org/abs/2311.12020", "description": "The Rapid Carbon Assessment, conducted by the U.S. Department of Agriculture,\nwas implemented in order to obtain a representative sample of soil organic\ncarbon across the contiguous United States. In conjunction with a statistical\nmodel, the dataset allows for mapping of soil carbon prediction across the\nU.S., however there are two primary challenges to such an effort. First, there\nexists a large degree of heterogeneity in the data, whereby both the first and\nsecond moments of the data generating process seem to vary both spatially and\nfor different land-use categories. Second, the majority of the sampled\nlocations do not actually have lab measured values for soil organic carbon.\nRather, visible and near-infrared (VNIR) spectra were measured at most\nlocations, which act as a proxy to help predict carbon content. Thus, we\ndevelop a heterogeneous model to analyze this data that allows both the mean\nand the variance to vary as a function of space as well as land-use category,\nwhile incorporating VNIR spectra as covariates. After a cross-validation study\nthat establishes the effectiveness of the model, we construct a complete map of\nsoil organic carbon for the contiguous U.S. along with uncertainty\nquantification."}, "http://arxiv.org/abs/1909.10678": {"title": "A Bayesian Approach to Directed Acyclic Graphs with a Candidate Graph", "link": "http://arxiv.org/abs/1909.10678", "description": "Directed acyclic graphs represent the dependence structure among variables.\nWhen learning these graphs from data, different amounts of information may be\navailable for different edges. Although many methods have been developed to\nlearn the topology of these graphs, most of them do not provide a measure of\nuncertainty in the inference. We propose a Bayesian method, baycn (BAYesian\nCausal Network), to estimate the posterior probability of three states for each\nedge: present with one direction ($X \\rightarrow Y$), present with the opposite\ndirection ($X \\leftarrow Y$), and absent. Unlike existing Bayesian methods, our\nmethod requires that the prior probabilities of these states be specified, and\ntherefore provides a benchmark for interpreting the posterior probabilities. We\ndevelop a fast Metropolis-Hastings Markov chain Monte Carlo algorithm for the\ninference. Our algorithm takes as input the edges of a candidate graph, which\nmay be the output of another graph inference method and may contain false\nedges. In simulation studies our method achieves high accuracy with small\nvariation across different scenarios and is comparable or better than existing\nBayesian methods. We apply baycn to genomic data to distinguish the direct and\nindirect targets of genetic variants."}, "http://arxiv.org/abs/2007.05748": {"title": "Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-As-Model", "link": "http://arxiv.org/abs/2007.05748", "description": "Note: Published now as a chapter in \"Handbook of the History and Philosophy\nof Mathematical Practice\" (Springer Nature, editor B. Sriraman,\nhttps://doi.org/10.1007/978-3-030-19071-2_105-1).\n\nThe application of mathematical probability theory in statistics is quite\ncontroversial. Controversies regard both the interpretation of probability, and\napproaches to statistical inference. After having given an overview of the main\napproaches, I will propose a re-interpretation of frequentist probability. Most\nstatisticians are aware that probability models interpreted in a frequentist\nmanner are not really true in objective reality, but only idealisations. I\nargue that this is often ignored when actually applying frequentist methods and\ninterpreting the results, and that keeping up the awareness for the essential\ndifference between reality and models can lead to a more appropriate use and\ninterpretation of frequentist models and methods, called\n\"frequentism-as-model\". This is elaborated showing connections to existing\nwork, appreciating the special role of independently and identically\ndistributed observations and subject matter knowledge, giving an account of how\nand under what conditions models that are not true can be useful, giving\ndetailed interpretations of tests and confidence intervals, confronting their\nimplicit compatibility logic with the inverse probability logic of Bayesian\ninference, re-interpreting the role of model assumptions, appreciating\nrobustness, and the role of \"interpretative equivalence\" of models. Epistemic\nprobability shares the issue that its models are only idealisations, and an\nanalogous \"epistemic-probability-as-model\" can also be developed."}, "http://arxiv.org/abs/2008.09434": {"title": "Correcting a Nonparametric Two-sample Graph Hypothesis Test for Graphs with Different Numbers of Vertices with Applications to Connectomics", "link": "http://arxiv.org/abs/2008.09434", "description": "Random graphs are statistical models that have many applications, ranging\nfrom neuroscience to social network analysis. Of particular interest in some\napplications is the problem of testing two random graphs for equality of\ngenerating distributions. Tang et al. (2017) propose a test for this setting.\nThis test consists of embedding the graph into a low-dimensional space via the\nadjacency spectral embedding (ASE) and subsequently using a kernel two-sample\ntest based on the maximum mean discrepancy. However, if the two graphs being\ncompared have an unequal number of vertices, the test of Tang et al. (2017) may\nnot be valid. We demonstrate the intuition behind this invalidity and propose a\ncorrection that makes any subsequent kernel- or distance-based test valid. Our\nmethod relies on sampling based on the asymptotic distribution for the ASE. We\ncall these altered embeddings the corrected adjacency spectral embeddings\n(CASE). We also show that CASE remedies the exchangeability problem of the\noriginal test and demonstrate the validity and consistency of the test that\nuses CASE via a simulation study. Lastly, we apply our proposed test to the\nproblem of determining equivalence of generating distributions in human\nconnectomes extracted from diffusion magnetic resonance imaging (dMRI) at\ndifferent scales."}, "http://arxiv.org/abs/2008.10055": {"title": "Multiple Network Embedding for Anomaly Detection in Time Series of Graphs", "link": "http://arxiv.org/abs/2008.10055", "description": "This paper considers the graph signal processing problem of anomaly detection\nin time series of graphs. We examine two related, complementary inference\ntasks: the detection of anomalous graphs within a time series, and the\ndetection of temporally anomalous vertices. We approach these tasks via the\nadaptation of statistically principled methods for joint graph inference,\nspecifically \\emph{multiple adjacency spectral embedding} (MASE). We\ndemonstrate that our method is effective for our inference tasks. Moreover, we\nassess the performance of our method in terms of the underlying nature of\ndetectable anomalies. We further provide the theoretical justification for our\nmethod and insight into its use. Applied to the Enron communication graph and a\nlarge-scale commercial search engine time series of graphs, our approaches\ndemonstrate their applicability and identify the anomalous vertices beyond just\nlarge degree change."}, "http://arxiv.org/abs/2011.06127": {"title": "Generalized Kernel Two-Sample Tests", "link": "http://arxiv.org/abs/2011.06127", "description": "Kernel two-sample tests have been widely used for multivariate data to test\nequality of distributions. However, existing tests based on mapping\ndistributions into a reproducing kernel Hilbert space mainly target specific\nalternatives and do not work well for some scenarios when the dimension of the\ndata is moderate to high due to the curse of dimensionality. We propose a new\ntest statistic that makes use of a common pattern under moderate and high\ndimensions and achieves substantial power improvements over existing kernel\ntwo-sample tests for a wide range of alternatives. We also propose alternative\ntesting procedures that maintain high power with low computational cost,\noffering easy off-the-shelf tools for large datasets. The new approaches are\ncompared to other state-of-the-art tests under various settings and show good\nperformance. We showcase the new approaches through two applications: The\ncomparison of musks and non-musks using the shape of molecules, and the\ncomparison of taxi trips starting from John F. Kennedy airport in consecutive\nmonths. All proposed methods are implemented in an R package kerTests."}, "http://arxiv.org/abs/2108.00306": {"title": "A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions", "link": "http://arxiv.org/abs/2108.00306", "description": "With advances in scientific computing and mathematical modeling, complex\nscientific phenomena such as galaxy formations and rocket propulsion can now be\nreliably simulated. Such simulations can however be very time-intensive,\nrequiring millions of CPU hours to perform. One solution is multi-fidelity\nemulation, which uses data of different fidelities to train an efficient\npredictive model which emulates the expensive simulator. For complex scientific\nproblems and with careful elicitation from scientists, such multi-fidelity data\nmay often be linked by a directed acyclic graph (DAG) representing its\nscientific model dependencies. We thus propose a new Graphical Multi-fidelity\nGaussian Process (GMGP) model, which embeds this DAG structure (capturing\nscientific dependencies) within a Gaussian process framework. We show that the\nGMGP has desirable modeling traits via two Markov properties, and admits a\nscalable algorithm for recursive computation of the posterior mean and variance\nalong at each depth level of the DAG. We also present a novel experimental\ndesign methodology over the DAG given an experimental budget, and propose a\nnonlinear extension of the GMGP via deep Gaussian processes. The advantages of\nthe GMGP are then demonstrated via a suite of numerical experiments and an\napplication to emulation of heavy-ion collisions, which can be used to study\nthe conditions of matter in the Universe shortly after the Big Bang. The\nproposed model has broader uses in data fusion applications with graphical\nstructure, which we further discuss."}, "http://arxiv.org/abs/2202.06188": {"title": "Testing the number of common factors by bootstrapped sample covariance matrix in high-dimensional factor models", "link": "http://arxiv.org/abs/2202.06188", "description": "This paper studies the impact of bootstrap procedure on the eigenvalue\ndistributions of the sample covariance matrix under a high-dimensional factor\nstructure. We provide asymptotic distributions for the top eigenvalues of\nbootstrapped sample covariance matrix under mild conditions. After bootstrap,\nthe spiked eigenvalues which are driven by common factors will converge weakly\nto Gaussian limits after proper scaling and centralization. However, the\nlargest non-spiked eigenvalue is mainly determined by the order statistics of\nthe bootstrap resampling weights, and follows extreme value distribution. Based\non the disparate behavior of the spiked and non-spiked eigenvalues, we propose\ninnovative methods to test the number of common factors. Indicated by extensive\nnumerical and empirical studies, the proposed methods perform reliably and\nconvincingly under the existence of both weak factors and cross-sectionally\ncorrelated errors. Our technical details contribute to random matrix theory on\nspiked covariance model with convexly decaying density and unbounded support,\nor with general elliptical distributions."}, "http://arxiv.org/abs/2203.04689": {"title": "Tensor Completion for Causal Inference with Multivariate Longitudinal Data: A Reevaluation of COVID-19 Mandates", "link": "http://arxiv.org/abs/2203.04689", "description": "We propose a new method that uses tensor completion to estimate causal\neffects with multivariate longitudinal data, data in which multiple outcomes\nare observed for each unit and time period. Our motivation is to estimate the\nnumber of COVID-19 fatalities prevented by government mandates such as travel\nrestrictions, mask-wearing directives, and vaccination requirements. In\naddition to COVID-19 fatalities, we observe related outcomes such as the number\nof fatalities from other diseases and injuries. The proposed method arranges\nthe data as a tensor with three dimensions (unit, time, and outcome) and uses\ntensor completion to impute the missing counterfactual outcomes. We first prove\nthat under general conditions, combining multiple outcomes using the proposed\nmethod improves the accuracy of counterfactual imputations. We then compare the\nproposed method to other approaches commonly used to evaluate COVID-19\nmandates. Our main finding is that other approaches overestimate the effect of\nmasking-wearing directives and that mask-wearing directives were not an\neffective alternative to travel restrictions. We conclude that while the\nproposed method can be applied whenever multivariate longitudinal data are\navailable, we believe it is particularly timely as governments increasingly\nrely on longitudinal data to choose among policies such as mandates during\npublic health emergencies."}, "http://arxiv.org/abs/2203.10775": {"title": "Modified Method of Moments for Generalized Laplace Distribution", "link": "http://arxiv.org/abs/2203.10775", "description": "In this note, we consider the performance of the classic method of moments\nfor parameter estimation of symmetric variance-gamma (generalized Laplace)\ndistributions. We do this through both theoretical analysis (multivariate delta\nmethod) and a comprehensive simulation study with comparison to maximum\nlikelihood estimation, finding performance is often unsatisfactory. In\naddition, we modify the method of moments by taking absolute moments to improve\nefficiency; in particular, our simulation studies demonstrate that our modified\nestimators have significantly improved performance for parameter values\ntypically encountered in financial modelling, and is also competitive with\nmaximum likelihood estimation."}, "http://arxiv.org/abs/2208.14989": {"title": "Learning Multiscale Non-stationary Causal Structures", "link": "http://arxiv.org/abs/2208.14989", "description": "This paper addresses a gap in the current state of the art by providing a\nsolution for modeling causal relationships that evolve over time and occur at\ndifferent time scales. Specifically, we introduce the multiscale non-stationary\ndirected acyclic graph (MN-DAG), a framework for modeling multivariate time\nseries data. Our contribution is twofold. Firstly, we expose a probabilistic\ngenerative model by leveraging results from spectral and causality theories.\nOur model allows sampling an MN-DAG according to user-specified priors on the\ntime-dependence and multiscale properties of the causal graph. Secondly, we\ndevise a Bayesian method named Multiscale Non-stationary Causal Structure\nLearner (MN-CASTLE) that uses stochastic variational inference to estimate\nMN-DAGs. The method also exploits information from the local partial\ncorrelation between time series over different time resolutions. The data\ngenerated from an MN-DAG reproduces well-known features of time series in\ndifferent domains, such as volatility clustering and serial correlation.\nAdditionally, we show the superior performance of MN-CASTLE on synthetic data\nwith different multiscale and non-stationary properties compared to baseline\nmodels. Finally, we apply MN-CASTLE to identify the drivers of the natural gas\nprices in the US market. Causal relationships have strengthened during the\nCOVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline\nmethods fail to capture. MN-CASTLE identifies the causal impact of critical\neconomic drivers on natural gas prices, such as seasonal factors, economic\nuncertainty, oil prices, and gas storage deviations."}, "http://arxiv.org/abs/2301.06333": {"title": "Functional concurrent regression with compositional covariates and its application to the time-varying effect of causes of death on human longevity", "link": "http://arxiv.org/abs/2301.06333", "description": "Multivariate functional data that are cross-sectionally compositional data\nare attracting increasing interest in the statistical modeling literature, a\nmajor example being trajectories over time of compositions derived from\ncause-specific mortality rates. In this work, we develop a novel functional\nconcurrent regression model in which independent variables are functional\ncompositions. This allows us to investigate the relationship over time between\nlife expectancy at birth and compositions derived from cause-specific mortality\nrates of four distinct age classes, namely 0--4, 5--39, 40--64 and 65+ in 25\ncountries. A penalized approach is developed to estimate the regression\ncoefficients and select the relevant variables. Then an efficient computational\nstrategy based on an augmented Lagrangian algorithm is derived to solve the\nresulting optimization problem. The good performances of the model in\npredicting the response function and estimating the unknown functional\ncoefficients are shown in a simulation study. The results on real data confirm\nthe important role of neoplasms and cardiovascular diseases in determining life\nexpectancy emerged in other studies and reveal several other contributions not\nyet observed."}, "http://arxiv.org/abs/2305.00961": {"title": "DIF Analysis with Unknown Groups and Anchor Items", "link": "http://arxiv.org/abs/2305.00961", "description": "Ensuring fairness in instruments like survey questionnaires or educational\ntests is crucial. One way to address this is by a Differential Item Functioning\n(DIF) analysis, which examines if different subgroups respond differently to a\nparticular item, controlling for their overall latent construct level. DIF\nanalysis is typically conducted to assess measurement invariance at the item\nlevel. Traditional DIF analysis methods require knowing the comparison groups\n(reference and focal groups) and anchor items (a subset of DIF-free items).\nSuch prior knowledge may not always be available, and psychometric methods have\nbeen proposed for DIF analysis when one piece of information is unknown. More\nspecifically, when the comparison groups are unknown while anchor items are\nknown, latent DIF analysis methods have been proposed that estimate the unknown\ngroups by latent classes. When anchor items are unknown while comparison groups\nare known, methods have also been proposed, typically under a sparsity\nassumption -- the number of DIF items is not too large. However, DIF analysis\nwhen both pieces of information are unknown has not received much attention.\nThis paper proposes a general statistical framework under this setting. In the\nproposed framework, we model the unknown groups by latent classes and introduce\nitem-specific DIF parameters to capture the DIF effects. Assuming the number of\nDIF items is relatively small, an $L_1$-regularised estimator is proposed to\nsimultaneously identify the latent classes and the DIF items. A computationally\nefficient Expectation-Maximisation (EM) algorithm is developed to solve the\nnon-smooth optimisation problem for the regularised estimator. The performance\nof the proposed method is evaluated by simulation studies and an application to\nitem response data from a real-world educational test."}, "http://arxiv.org/abs/2306.14302": {"title": "Improved LM Test for Robust Model Specification Searches in Covariance Structure Analysis", "link": "http://arxiv.org/abs/2306.14302", "description": "Model specification searches and modifications are commonly employed in\ncovariance structure analysis (CSA) or structural equation modeling (SEM) to\nimprove the goodness-of-fit. However, these practices can be susceptible to\ncapitalizing on chance, as a model that fits one sample may not generalize to\nanother sample from the same population. This paper introduces the improved\nLagrange Multipliers (LM) test, which provides a reliable method for conducting\na thorough model specification search and effectively identifying missing\nparameters. By leveraging the stepwise bootstrap method in the standard LM and\nWald tests, our data-driven approach enhances the accuracy of parameter\nidentification. The results from Monte Carlo simulations and two empirical\napplications in political science demonstrate the effectiveness of the improved\nLM test, particularly when dealing with small sample sizes and models with\nlarge degrees of freedom. This approach contributes to better statistical fit\nand addresses the issue of capitalization on chance in model specification."}, "http://arxiv.org/abs/2308.13069": {"title": "The diachronic Bayesian", "link": "http://arxiv.org/abs/2308.13069", "description": "It is well known that a Bayesian probability forecast for the future\nobservations should form a probability measure in order to satisfy a natural\ncondition of coherence. The topic of this paper is the evolution of the\nBayesian probability measure over time. We model the process of updating the\nBayesian's beliefs in terms of prediction markets. The resulting picture is\nadapted to forecasting several steps ahead and making almost optimal decisions."}, "http://arxiv.org/abs/2311.12214": {"title": "Random Fourier Signature Features", "link": "http://arxiv.org/abs/2311.12214", "description": "Tensor algebras give rise to one of the most powerful measures of similarity\nfor sequences of arbitrary length called the signature kernel accompanied with\nattractive theoretical guarantees from stochastic analysis. Previous algorithms\nto compute the signature kernel scale quadratically in terms of the length and\nthe number of the sequences. To mitigate this severe computational bottleneck,\nwe develop a random Fourier feature-based acceleration of the signature kernel\nacting on the inherently non-Euclidean domain of sequences. We show uniform\napproximation guarantees for the proposed unbiased estimator of the signature\nkernel, while keeping its computation linear in the sequence length and number.\nIn addition, combined with recent advances on tensor projections, we derive two\neven more scalable time series features with favourable concentration\nproperties and computational complexity both in time and memory. Our empirical\nresults show that the reduction in computational cost comes at a negligible\nprice in terms of accuracy on moderate-sized datasets, and it enables one to\nscale to large datasets up to a million time series."}, "http://arxiv.org/abs/2311.12252": {"title": "Sensitivity analysis with multiple treatments and multiple outcomes with applications to air pollution mixtures", "link": "http://arxiv.org/abs/2311.12252", "description": "Understanding the health impacts of air pollution is vital in public health\nresearch. Numerous studies have estimated negative health effects of a variety\nof pollutants, but accurately gauging these impacts remains challenging due to\nthe potential for unmeasured confounding bias that is ubiquitous in\nobservational studies. In this study, we develop a framework for sensitivity\nanalysis in settings with both multiple treatments and multiple outcomes\nsimultaneously. This setting is of particular interest because one can identify\nthe strength of association between the unmeasured confounders and both the\ntreatment and outcome, under a factor confounding assumption. This provides\ninformative bounds on the causal effect leading to partial identification\nregions for the effects of multivariate treatments that account for the maximum\npossible bias from unmeasured confounding. We also show that when negative\ncontrols are available, we are able to refine the partial identification\nregions substantially, and in certain cases, even identify the causal effect in\nthe presence of unmeasured confounding. We derive partial identification\nregions for general estimands in this setting, and develop a novel\ncomputational approach to finding these regions."}, "http://arxiv.org/abs/2311.12293": {"title": "Sample size calculation based on the difference in restricted mean time lost for clinical trials with competing risks", "link": "http://arxiv.org/abs/2311.12293", "description": "Computation of sample size is important when designing clinical trials. The\npresence of competing risks makes the design of clinical trials with\ntime-to-event endpoints cumbersome. A model based on the subdistribution hazard\nratio (SHR) is commonly used for trials under competing risks. However, this\napproach has some limitations related to model assumptions and clinical\ninterpretation. Considering such limitations, the difference in restricted mean\ntime lost (RMTLd) is recommended as an alternative indicator. In this paper, we\npropose a sample size calculation method based on the RMTLd for the Weibull\ndistribution (RMTLdWeibull) for clinical trials, which considers experimental\nconditions such as equal allocation, uniform accrual, uniform loss to\nfollow-up, and administrative censoring. Simulation results show that sample\nsize calculation based on the RMTLdWeibull can generally achieve a predefined\npower level and maintain relative robustness. Moreover, the performance of the\nsample size calculation based on the RMTLdWeibull is similar or superior to\nthat based on the SHR. Even if the event time does not follow the Weibull\ndistribution, the sample size calculation based on the RMTLdWeibull still\nperforms well. The results also verify the performance of the sample size\ncalculation method based on the RMTLdWeibull. From the perspective of the\nresults of this study, clinical interpretation, application conditions and\nstatistical performance, we recommend that when designing clinical trials in\nthe presence of competing risks, the RMTLd indicator be applied for sample size\ncalculation and subsequent effect size measurement."}, "http://arxiv.org/abs/2311.12347": {"title": "Bayesian Cluster Geographically Weighted Regression for Spatial Heterogeneous Data", "link": "http://arxiv.org/abs/2311.12347", "description": "Spatial statistical models are commonly used in geographical scenarios to\nensure spatial variation is captured effectively. However, spatial models and\ncluster algorithms can be complicated and expensive. This paper pursues three\nmain objectives. First, it introduces covariate effect clustering by\nintegrating a Bayesian Geographically Weighted Regression (BGWR) with a\nGaussian mixture model and the Dirichlet process mixture model. Second, this\npaper examines situations in which a particular covariate holds significant\nimportance in one region but not in another in the Bayesian framework. Lastly,\nit addresses computational challenges present in existing BGWR, leading to\nnotable enhancements in Markov chain Monte Carlo estimation suitable for large\nspatial datasets. The efficacy of the proposed method is demonstrated using\nsimulated data and is further validated in a case study examining children's\ndevelopment domains in Queensland, Australia, using data provided by Children's\nHealth Queensland and Australia's Early Development Census."}, "http://arxiv.org/abs/2311.12380": {"title": "A multivariate adaptation of direct kernel estimator of density ratio", "link": "http://arxiv.org/abs/2311.12380", "description": "\\'Cwik and Mielniczuk (1989) introduced a univariate kernel density ratio\nestimator, which directly estimates the ratio without estimating the two\ndensities of interest. This study presents its straightforward multivariate\nadaptation."}, "http://arxiv.org/abs/2311.12392": {"title": "Individualized Dynamic Latent Factor Model with Application to Mobile Health Data", "link": "http://arxiv.org/abs/2311.12392", "description": "Mobile health has emerged as a major success in tracking individual health\nstatus, due to the popularity and power of smartphones and wearable devices.\nThis has also brought great challenges in handling heterogeneous,\nmulti-resolution data which arise ubiquitously in mobile health due to\nirregular multivariate measurements collected from individuals. In this paper,\nwe propose an individualized dynamic latent factor model for irregular\nmulti-resolution time series data to interpolate unsampled measurements of time\nseries with low resolution. One major advantage of the proposed method is the\ncapability to integrate multiple irregular time series and multiple subjects by\nmapping the multi-resolution data to the latent space. In addition, the\nproposed individualized dynamic latent factor model is applicable to capturing\nheterogeneous longitudinal information through individualized dynamic latent\nfactors. In theory, we provide the integrated interpolation error bound of the\nproposed estimator and derive the convergence rate with B-spline approximation\nmethods. Both the simulation studies and the application to smartwatch data\ndemonstrate the superior performance of the proposed method compared to\nexisting methods."}, "http://arxiv.org/abs/2311.12452": {"title": "Multi-indication evidence synthesis in oncology health technology assessment", "link": "http://arxiv.org/abs/2311.12452", "description": "Background: Cancer drugs receive licensing extensions to include additional\nindications as trial evidence on treatment effectiveness accumulates. We\ninvestigate how sharing information across indications can strengthen the\ninferences supporting Health Technology Assessment (HTA). Methods: We applied\nmeta-analytic methods to randomised trial data on bevacizumab to share\ninformation across cancer indications on the treatment effect on overall\nsurvival (OS) or progression-free survival (PFS), and on the surrogate\nrelationship between effects on PFS and OS. Common or random parameters were\nused to facilitate sharing and the further flexibility of mixture models was\nexplored. Results: OS treatment effects lacked precision when pooling data\navailable at present-day within each indication, particularly for indications\nwith few trials. There was no suggestion of heterogeneity across indications.\nSharing information across indications provided more precise inferences on\ntreatment effects, and on surrogacy parameters, with the strength of sharing\ndepending on the model. When a surrogate relationship was used to predict OS\neffects, uncertainty was only reduced with sharing imposed on PFS effects in\naddition to surrogacy parameters. Corresponding analyses using the earlier,\nsparser evidence available for particular HTAs showed that sharing on both\nsurrogacy and PFS effects did not notably reduce uncertainty in OS predictions.\nLimited heterogeneity across indications meant that the added flexibility of\nmixture models was unnecessary. Conclusions: Meta-analysis methods can be\nusefully applied to share information on treatment effectiveness across\nindications to increase the precision of target indication estimates in HTA.\nSharing on surrogate relationships requires caution, as meaningful precision\ngains require larger bodies of evidence and clear support for surrogacy from\nother indications."}, "http://arxiv.org/abs/2311.12597": {"title": "Optimal Functional Bilinear Regression with Two-way Functional Covariates via Reproducing Kernel Hilbert Space", "link": "http://arxiv.org/abs/2311.12597", "description": "Traditional functional linear regression usually takes a one-dimensional\nfunctional predictor as input and estimates the continuous coefficient\nfunction. Modern applications often generate two-dimensional covariates, which\nbecome matrices when observed at grid points. To avoid the inefficiency of the\nclassical method involving estimation of a two-dimensional coefficient\nfunction, we propose a functional bilinear regression model, and introduce an\ninnovative three-term penalty to impose roughness penalty in the estimation.\nThe proposed estimator exhibits minimax optimal property for prediction under\nthe framework of reproducing kernel Hilbert space. An iterative generalized\ncross-validation approach is developed to choose tuning parameters, which\nsignificantly improves the computational efficiency over the traditional\ncross-validation approach. The statistical and computational advantages of the\nproposed method over existing methods are further demonstrated via simulated\nexperiments, the Canadian weather data, and a biochemical long-range infrared\nlight detection and ranging data."}, "http://arxiv.org/abs/2311.12685": {"title": "A Graphical Comparison of Screening Designs using Support Recovery Probabilities", "link": "http://arxiv.org/abs/2311.12685", "description": "A screening experiment attempts to identify a subset of important effects\nusing a relatively small number of experimental runs. Given the limited run\nsize and a large number of possible effects, penalized regression is a popular\ntool used to analyze screening designs. In particular, an automated\nimplementation of the Gauss-Dantzig selector has been widely recommended to\ncompare screening design construction methods. Here, we illustrate potential\nreproducibility issues that arise when comparing screening designs via\nsimulation, and recommend a graphical method, based on screening probabilities,\nwhich compares designs by evaluating them along the penalized regression\nsolution path. This method can be implemented using simulation, or, in the case\nof lasso, by using exact local lasso sign recovery probabilities. Our approach\ncircumvents the need to specify tuning parameters associated with\nregularization methods, leading to more reliable design comparisons. This\narticle contains supplementary materials including code to implement the\nproposed methods."}, "http://arxiv.org/abs/2311.12717": {"title": "Phylogenetic least squares estimation without genetic distances", "link": "http://arxiv.org/abs/2311.12717", "description": "Least squares estimation of phylogenies is an established family of methods\nwith good statistical properties. State-of-the-art least squares phylogenetic\nestimation proceeds by first estimating a distance matrix, which is then used\nto determine the phylogeny by minimizing a squared-error loss function. Here,\nwe develop a method for least squares phylogenetic inference that does not rely\non a pre-estimated distance matrix. Our approach allows us to circumvent the\ntypical need to first estimate a distance matrix by forming a new loss function\ninspired by the phylogenetic likelihood score function; in this manner,\ninference is not based on a summary statistic of the sequence data, but\ndirectly on the sequence data itself. We use a Jukes-Cantor substitution model\nto show that our method leads to improvements over ordinary least squares\nphylogenetic inference, and is even observed to rival maximum likelihood\nestimation in terms of topology estimation efficiency. Using a Kimura\n2-parameter model, we show that our method also allows for estimation of the\nglobal transition/transversion ratio simultaneously with the phylogeny and its\nbranch lengths. This is impossible to accomplish with any other distance-based\nmethod as far as we know. Our developments pave the way for more optimal\nphylogenetic inference under the least squares framework, particularly in\nsettings under which likelihood-based inference is infeasible, including when\none desires to build a phylogeny based on information provided by only a subset\nof all possible nucleotide substitutions such as synonymous or non-synonymous\nsubstitutions."}, "http://arxiv.org/abs/2311.12726": {"title": "Nonparametric variable importance for time-to-event outcomes with application to prediction of HIV infection", "link": "http://arxiv.org/abs/2311.12726", "description": "In survival analysis, complex machine learning algorithms have been\nincreasingly used for predictive modeling. Given a collection of features\navailable for inclusion in a predictive model, it may be of interest to\nquantify the relative importance of a subset of features for the prediction\ntask at hand. In particular, in HIV vaccine trials, participant baseline\ncharacteristics are used to predict the probability of infection over the\nintended follow-up period, and investigators may wish to understand how much\ncertain types of predictors, such as behavioral factors, contribute toward\noverall predictiveness. Time-to-event outcomes such as time to infection are\noften subject to right censoring, and existing methods for assessing variable\nimportance are typically not intended to be used in this setting. We describe a\nbroad class of algorithm-agnostic variable importance measures for prediction\nin the context of survival data. We propose a nonparametric efficient\nestimation procedure that incorporates flexible learning of nuisance\nparameters, yields asymptotically valid inference, and enjoys\ndouble-robustness. We assess the performance of our proposed procedure via\nnumerical simulations and analyze data from the HVTN 702 study to inform\nenrollment strategies for future HIV vaccine trials."}, "http://arxiv.org/abs/2202.12989": {"title": "Flexible variable selection in the presence of missing data", "link": "http://arxiv.org/abs/2202.12989", "description": "In many applications, it is of interest to identify a parsimonious set of\nfeatures, or panel, from multiple candidates that achieves a desired level of\nperformance in predicting a response. This task is often complicated in\npractice by missing data arising from the sampling design or other random\nmechanisms. Most recent work on variable selection in missing data contexts\nrelies in some part on a finite-dimensional statistical model, e.g., a\ngeneralized or penalized linear model. In cases where this model is\nmisspecified, the selected variables may not all be truly scientifically\nrelevant and can result in panels with suboptimal classification performance.\nTo address this limitation, we propose a nonparametric variable selection\nalgorithm combined with multiple imputation to develop flexible panels in the\npresence of missing-at-random data. We outline strategies based on the proposed\nalgorithm that achieve control of commonly used error rates. Through\nsimulations, we show that our proposal has good operating characteristics and\nresults in panels with higher classification and variable selection performance\ncompared to several existing penalized regression approaches in cases where a\ngeneralized linear model is misspecified. Finally, we use the proposed method\nto develop biomarker panels for separating pancreatic cysts with differing\nmalignancy potential in a setting where complicated missingness in the\nbiomarkers arose due to limited specimen volumes."}, "http://arxiv.org/abs/2206.12773": {"title": "Scalable and optimal Bayesian inference for sparse covariance matrices via screened beta-mixture prior", "link": "http://arxiv.org/abs/2206.12773", "description": "In this paper, we propose a scalable Bayesian method for sparse covariance\nmatrix estimation by incorporating a continuous shrinkage prior with a\nscreening procedure. In the first step of the procedure, the off-diagonal\nelements with small correlations are screened based on their sample\ncorrelations. In the second step, the posterior of the covariance with the\nscreened elements fixed at $0$ is computed with the beta-mixture prior. The\nscreened elements of the covariance significantly increase the efficiency of\nthe posterior computation. The simulation studies and real data applications\nshow that the proposed method can be used for the high-dimensional problem with\nthe `large p, small n'. In some examples in this paper, the proposed method can\nbe computed in a reasonable amount of time, while no other existing Bayesian\nmethods can be. The proposed method has also sound theoretical properties. The\nscreening procedure has the sure screening property and the selection\nconsistency, and the posterior has the optimal minimax or nearly minimax\nconvergence rate under the Frobeninus norm."}, "http://arxiv.org/abs/2209.05474": {"title": "Consistent Selection of the Number of Groups in Panel Models via Sample-Splitting", "link": "http://arxiv.org/abs/2209.05474", "description": "Group number selection is a key question for group panel data modelling. In\nthis work, we develop a cross validation method to tackle this problem.\nSpecifically, we split the panel data into a training dataset and a testing\ndataset on the time span. We first use the training dataset to estimate the\nparameters and group memberships. Then we apply the fitted model to the testing\ndataset and then the group number is estimated by minimizing certain loss\nfunction values on the testing dataset. We design the loss functions for panel\ndata models either with or without fixed effects. The proposed method has two\nadvantages. First, the method is totally data-driven thus no further tuning\nparameters are involved. Second, the method can be flexibly applied to a wide\nrange of panel data models. Theoretically, we establish the estimation\nconsistency by taking advantage of the optimization property of the estimation\nalgorithm. Experiments on a variety of synthetic and empirical datasets are\ncarried out to further illustrate the advantages of the proposed method."}, "http://arxiv.org/abs/2301.07276": {"title": "Data thinning for convolution-closed distributions", "link": "http://arxiv.org/abs/2301.07276", "description": "We propose data thinning, an approach for splitting an observation into two\nor more independent parts that sum to the original observation, and that follow\nthe same distribution as the original observation, up to a (known) scaling of a\nparameter. This very general proposal is applicable to any convolution-closed\ndistribution, a class that includes the Gaussian, Poisson, negative binomial,\ngamma, and binomial distributions, among others. Data thinning has a number of\napplications to model selection, evaluation, and inference. For instance,\ncross-validation via data thinning provides an attractive alternative to the\nusual approach of cross-validation via sample splitting, especially in settings\nin which the latter is not applicable. In simulations and in an application to\nsingle-cell RNA-sequencing data, we show that data thinning can be used to\nvalidate the results of unsupervised learning approaches, such as k-means\nclustering and principal components analysis, for which traditional sample\nsplitting is unattractive or unavailable."}, "http://arxiv.org/abs/2303.12401": {"title": "Real-time forecasting within soccer matches through a Bayesian lens", "link": "http://arxiv.org/abs/2303.12401", "description": "This paper employs a Bayesian methodology to predict the results of soccer\nmatches in real-time. Using sequential data of various events throughout the\nmatch, we utilize a multinomial probit regression in a novel framework to\nestimate the time-varying impact of covariates and to forecast the outcome.\nEnglish Premier League data from eight seasons are used to evaluate the\nefficacy of our method. Different evaluation metrics establish that the\nproposed model outperforms potential competitors inspired by existing\nstatistical or machine learning algorithms. Additionally, we apply robustness\nchecks to demonstrate the model's accuracy across various scenarios."}, "http://arxiv.org/abs/2305.10524": {"title": "Dynamic Matrix Recovery", "link": "http://arxiv.org/abs/2305.10524", "description": "Matrix recovery from sparse observations is an extensively studied topic\nemerging in various applications, such as recommendation system and signal\nprocessing, which includes the matrix completion and compressed sensing models\nas special cases. In this work we propose a general framework for dynamic\nmatrix recovery of low-rank matrices that evolve smoothly over time. We start\nfrom the setting that the observations are independent across time, then extend\nto the setting that both the design matrix and noise possess certain temporal\ncorrelation via modified concentration inequalities. By pooling neighboring\nobservations, we obtain sharp estimation error bounds of both settings, showing\nthe influence of the underlying smoothness, the dependence and effective\nsamples. We propose a dynamic fast iterative shrinkage thresholding algorithm\nthat is computationally efficient, and characterize the interplay between\nalgorithmic and statistical convergence. Simulated and real data examples are\nprovided to support such findings."}, "http://arxiv.org/abs/2306.07047": {"title": "Foundations of Causal Discovery on Groups of Variables", "link": "http://arxiv.org/abs/2306.07047", "description": "Discovering causal relationships from observational data is a challenging\ntask that relies on assumptions connecting statistical quantities to graphical\nor algebraic causal models. In this work, we focus on widely employed\nassumptions for causal discovery when objects of interest are (multivariate)\ngroups of random variables rather than individual (univariate) random\nvariables, as is the case in a variety of problems in scientific domains such\nas climate science or neuroscience. If the group-level causal models are\nderived from partitioning a micro-level model into groups, we explore the\nrelationship between micro and group-level causal discovery assumptions. We\ninvestigate the conditions under which assumptions like Causal Faithfulness\nhold or fail to hold. Our analysis encompasses graphical causal models that\ncontain cycles and bidirected edges. We also discuss grouped time series causal\ngraphs and variants thereof as special cases of our general theoretical\nframework. Thereby, we aim to provide researchers with a solid theoretical\nfoundation for the development and application of causal discovery methods for\nvariable groups."}, "http://arxiv.org/abs/2309.01721": {"title": "Direct and Indirect Treatment Effects in the Presence of Semi-Competing Risks", "link": "http://arxiv.org/abs/2309.01721", "description": "Semi-competing risks refer to the phenomenon that the terminal event (such as\ndeath) can truncate the non-terminal event (such as disease progression) but\nnot vice versa. The treatment effect on the terminal event can be delivered\neither directly following the treatment or indirectly through the non-terminal\nevent. We consider two strategies to decompose the total effect into a direct\neffect and an indirect effect under the framework of mediation analysis, by\nadjusting the prevalence and hazard of non-terminal events, respectively. They\nrequire slightly different assumptions on cross-world quantities to achieve\nidentifiability. We establish asymptotic properties for the estimated\ncounterfactual cumulative incidences and decomposed treatment effects. Through\nsimulation studies and real-data applications we illustrate the subtle\ndifference between these two decompositions."}, "http://arxiv.org/abs/2311.12825": {"title": "A PSO Based Method to Generate Actionable Counterfactuals for High Dimensional Data", "link": "http://arxiv.org/abs/2311.12825", "description": "Counterfactual explanations (CFE) are methods that explain a machine learning\nmodel by giving an alternate class prediction of a data point with some minimal\nchanges in its features. It helps the users to identify their data attributes\nthat caused an undesirable prediction like a loan or credit card rejection. We\ndescribe an efficient and an actionable counterfactual (CF) generation method\nbased on particle swarm optimization (PSO). We propose a simple objective\nfunction for the optimization of the instance-centric CF generation problem.\nThe PSO brings in a lot of flexibility in terms of carrying out multi-objective\noptimization in large dimensions, capability for multiple CF generation, and\nsetting box constraints or immutability of data attributes. An algorithm is\nproposed that incorporates these features and it enables greater control over\nthe proximity and sparsity properties over the generated CFs. The proposed\nalgorithm is evaluated with a set of action-ability metrics in real-world\ndatasets, and the results were superior compared to that of the\nstate-of-the-arts."}, "http://arxiv.org/abs/2311.12978": {"title": "Physics-Informed Priors with Application to Boundary Layer Velocity", "link": "http://arxiv.org/abs/2311.12978", "description": "One of the most popular recent areas of machine learning predicates the use\nof neural networks augmented by information about the underlying process in the\nform of Partial Differential Equations (PDEs). These physics-informed neural\nnetworks are obtained by penalizing the inference with a PDE, and have been\ncast as a minimization problem currently lacking a formal approach to quantify\nthe uncertainty. In this work, we propose a novel model-based framework which\nregards the PDE as a prior information of a deep Bayesian neural network. The\nprior is calibrated without data to resemble the PDE solution in the prior\nmean, while our degree in confidence on the PDE with respect to the data is\nexpressed in terms of the prior variance. The information embedded in the PDE\nis then propagated to the posterior yielding physics-informed forecasts with\nuncertainty quantification. We apply our approach to a simulated viscous fluid\nand to experimentally-obtained turbulent boundary layer velocity in a wind\ntunnel using an appropriately simplified Navier-Stokes equation. Our approach\nrequires very few observations to produce physically-consistent forecasts as\nopposed to non-physical forecasts stemming from non-informed priors, thereby\nallowing forecasting complex systems where some amount of data as well as some\ncontextual knowledge is available."}, "http://arxiv.org/abs/2311.13017": {"title": "W-kernel and essential subspace for frequencist's evaluation of Bayesian estimators", "link": "http://arxiv.org/abs/2311.13017", "description": "The posterior covariance matrix W defined by the log-likelihood of each\nobservation plays important roles both in the sensitivity analysis and\nfrequencist's evaluation of the Bayesian estimators. This study focused on the\nmatrix W and its principal space; we term the latter as an essential subspace.\nFirst, it is shown that they appear in various statistical settings, such as\nthe evaluation of the posterior sensitivity, assessment of the frequencist's\nuncertainty from posterior samples, and stochastic expansion of the loss; a key\ntool to treat frequencist's properties is the recently proposed Bayesian\ninfinitesimal jackknife approximation (Giordano and Broderick (2023)). In the\nfollowing part, we show that the matrix W can be interpreted as a reproducing\nkernel; it is named as W-kernel. Using the W-kernel, the essential subspace is\nexpressed as a principal space given by the kernel PCA. A relation to the\nFisher kernel and neural tangent kernel is established, which elucidates the\nconnection to the classical asymptotic theory; it also leads to a sort of\nBayesian-frequencist's duality. Finally, two applications, selection of a\nrepresentative set of observations and dimensional reduction in the approximate\nbootstrap, are discussed. In the former, incomplete Cholesky decomposition is\nintroduced as an efficient method to compute the essential subspace. In the\nlatter, different implementations of the approximate bootstrap for posterior\nmeans are compared."}, "http://arxiv.org/abs/2311.13048": {"title": "Weighted composite likelihood for linear mixed models in complex samples", "link": "http://arxiv.org/abs/2311.13048", "description": "Fitting mixed models to complex survey data is a challenging problem. Most\nmethods in the literature, including the most widely used one, require a close\nrelationship between the model structure and the survey design. In this paper\nwe present methods for fitting arbitrary mixed models to data from arbitrary\nsurvey designs. We support this with an implementation that allows for\nmultilevel linear models and multistage designs without any assumptions about\nnesting of model and design, and that also allows for correlation structures\nsuch as those resulting from genetic relatedness. The estimation and inference\napproach uses weighted pairwise (composite) likelihood."}, "http://arxiv.org/abs/2311.13131": {"title": "Pair circulas modelling for multivariate circular time series", "link": "http://arxiv.org/abs/2311.13131", "description": "Modelling multivariate circular time series is considered. The\ncross-sectional and serial dependence is described by circulas, which are\nanalogs of copulas for circular distributions. In order to obtain a simple\nexpression of the dependence structure, we decompose a multivariate circula\ndensity to a product of several pair circula densities. Moreover, to reduce the\nnumber of pair circula densities, we consider strictly stationary multi-order\nMarkov processes. The real data analysis, in which the proposed model is fitted\nto multivariate time series wind direction data is also given."}, "http://arxiv.org/abs/2311.13196": {"title": "Optimal Time of Arrival Estimation for MIMO Backscatter Channels", "link": "http://arxiv.org/abs/2311.13196", "description": "In this paper, we propose a novel time of arrival (TOA) estimator for\nmultiple-input-multiple-output (MIMO) backscatter channels in closed form. The\nproposed estimator refines the estimation precision from the topological\nstructure of the MIMO backscatter channels, and can considerably enhance the\nestimation accuracy. Particularly, we show that for the general $M \\times N$\nbistatic topology, the mean square error (MSE) is $\\frac{M+N-1}{MN}\\sigma^2_0$,\nand for the general $M \\times M$ monostatic topology, it is\n$\\frac{2M-1}{M^2}\\sigma^2_0$ for the diagonal subchannels, and\n$\\frac{M-1}{M^2}\\sigma^2_0$ for the off-diagonal subchannels, where\n$\\sigma^2_0$ is the MSE of the conventional least square estimator. In\naddition, we derive the Cramer-Rao lower bound (CRLB) for MIMO backscatter TOA\nestimation which indicates that the proposed estimator is optimal. Simulation\nresults verify that the proposed TOA estimator can considerably improve both\nestimation and positioning accuracy, especially when the MIMO scale is large."}, "http://arxiv.org/abs/2311.13202": {"title": "Robust Multi-Model Subset Selection", "link": "http://arxiv.org/abs/2311.13202", "description": "Modern datasets in biology and chemistry are often characterized by the\npresence of a large number of variables and outlying samples due to measurement\nerrors or rare biological and chemical profiles. To handle the characteristics\nof such datasets we introduce a method to learn a robust ensemble comprised of\na small number of sparse, diverse and robust models, the first of its kind in\nthe literature. The degree to which the models are sparse, diverse and\nresistant to data contamination is driven directly by the data based on a\ncross-validation criterion. We establish the finite-sample breakdown of the\nensembles and the models that comprise them, and we develop a tailored\ncomputing algorithm to learn the ensembles by leveraging recent developments in\nl0 optimization. Our extensive numerical experiments on synthetic and\nartificially contaminated real datasets from genomics and cheminformatics\ndemonstrate the competitive advantage of our method over state-of-the-art\nsparse and robust methods. We also demonstrate the applicability of our\nproposal on a cardiac allograft vasculopathy dataset."}, "http://arxiv.org/abs/2311.13247": {"title": "A projected nonlinear state-space model for forecasting time series signals", "link": "http://arxiv.org/abs/2311.13247", "description": "Learning and forecasting stochastic time series is essential in various\nscientific fields. However, despite the proposals of nonlinear filters and\ndeep-learning methods, it remains challenging to capture nonlinear dynamics\nfrom a few noisy samples and predict future trajectories with uncertainty\nestimates while maintaining computational efficiency. Here, we propose a fast\nalgorithm to learn and forecast nonlinear dynamics from noisy time series data.\nA key feature of the proposed model is kernel functions applied to projected\nlines, enabling fast and efficient capture of nonlinearities in the latent\ndynamics. Through empirical case studies and benchmarking, the model\ndemonstrates its effectiveness in learning and forecasting complex nonlinear\ndynamics, offering a valuable tool for researchers and practitioners in time\nseries analysis."}, "http://arxiv.org/abs/2311.13291": {"title": "Robust Functional Regression with Discretely Sampled Predictors", "link": "http://arxiv.org/abs/2311.13291", "description": "The functional linear model is an important extension of the classical\nregression model allowing for scalar responses to be modeled as functions of\nstochastic processes. Yet, despite the usefulness and popularity of the\nfunctional linear model in recent years, most treatments, theoretical and\npractical alike, suffer either from (i) lack of resistance towards the many\ntypes of anomalies one may encounter with functional data or (ii) biases\nresulting from the use of discretely sampled functional data instead of\ncompletely observed data. To address these deficiencies, this paper introduces\nand studies the first class of robust functional regression estimators for\npartially observed functional data. The proposed broad class of estimators is\nbased on thin-plate splines with a novel computationally efficient quadratic\npenalty, is easily implementable and enjoys good theoretical properties under\nweak assumptions. We show that, in the incomplete data setting, both the sample\nsize and discretization error of the processes determine the asymptotic rate of\nconvergence of functional regression estimators and the latter cannot be\nignored. These theoretical properties remain valid even with multi-dimensional\nrandom fields acting as predictors and random smoothing parameters. The\neffectiveness of the proposed class of estimators in practice is demonstrated\nby means of a simulation study and a real-data example."}, "http://arxiv.org/abs/2311.13347": {"title": "Loss-based Objective and Penalizing Priors for Model Selection Problems", "link": "http://arxiv.org/abs/2311.13347", "description": "Many Bayesian model selection problems, such as variable selection or cluster\nanalysis, start by setting prior model probabilities on a structured model\nspace. Based on a chosen loss function between models, model selection is often\nperformed with a Bayes estimator that minimizes the posterior expected loss.\nThe prior model probabilities and the choice of loss both highly affect the\nmodel selection results, especially for data with small sample sizes, and their\nproper calibration and careful reflection of no prior model preference are\ncrucial in objective Bayesian analysis. We propose risk equilibrium priors as\nan objective choice for prior model probabilities that only depend on the model\nspace and the choice of loss. Under the risk equilibrium priors, the Bayes\naction becomes indifferent before observing data, and the family of the risk\nequilibrium priors includes existing popular objective priors in Bayesian\nvariable selection problems. We generalize the result to the elicitation of\nobjective priors for Bayesian cluster analysis with Binder's loss. We also\npropose risk penalization priors, where the Bayes action chooses the simplest\nmodel before seeing data. The concept of risk equilibrium and penalization\npriors allows us to interpret prior properties in light of the effect of loss\nfunctions, and also provides new insight into the sensitivity of Bayes\nestimators under the same prior but different loss. We illustrate the proposed\nconcepts with variable selection simulation studies and cluster analysis on a\ngalaxy dataset."}, "http://arxiv.org/abs/2311.13410": {"title": "Towards Sensitivity Analysis: A Workflow", "link": "http://arxiv.org/abs/2311.13410", "description": "Establishing causal claims is one of the primary endeavors in sociological\nresearch. Statistical causal inference is a promising way to achieve this\nthrough the potential outcome framework or structural causal models, which are\nbased on a set of identification assumptions. However, identification\nassumptions are often not fully discussed in practice, which harms the validity\nof causal claims. In this article, we focus on the unmeasurededness assumption\nthat assumes no unmeasured confounders in models, which is often violated in\npractice. This article reviews a set of papers in two leading sociological\njournals to check the practice of causal inference and relevant identification\nassumptions, indicating the lack of discussion on sensitivity analysis methods\non unconfoundedness in practice. And then, a blueprint of how to conduct\nsensitivity analysis methods on unconfoundedness is built, including six steps\nof proper choices on practices of sensitivity analysis to evaluate the impacts\nof unmeasured confounders."}, "http://arxiv.org/abs/2311.13481": {"title": "Synergizing Roughness Penalization and Basis Selection in Bayesian Spline Regression", "link": "http://arxiv.org/abs/2311.13481", "description": "Bayesian P-splines and basis determination through Bayesian model selection\nare both commonly employed strategies for nonparametric regression using spline\nbasis expansions within the Bayesian framework. Although both methods are\nwidely employed, they each have particular limitations that may introduce\npotential estimation bias depending on the nature of the target function. To\novercome the limitations associated with each method while capitalizing on\ntheir respective strengths, we propose a new prior distribution that integrates\nthe essentials of both approaches. The proposed prior distribution assesses the\ncomplexity of the spline model based on a penalty term formed by a convex\ncombination of the penalties from both methods. The proposed method exhibits\nadaptability to the unknown level of smoothness while achieving the\nminimax-optimal posterior contraction rate up to a logarithmic factor. We\nprovide an efficient Markov chain Monte Carlo algorithm for implementing the\nproposed approach. Our extensive simulation study reveals that the proposed\nmethod outperforms other competitors in terms of performance metrics or model\ncomplexity. An application to a real dataset substantiates the validity of our\nproposed approach."}, "http://arxiv.org/abs/2311.13556": {"title": "Universally Optimal Multivariate Crossover Designs", "link": "http://arxiv.org/abs/2311.13556", "description": "In this article, universally optimal multivariate crossover designs are\nstudied. The multiple response crossover design is motivated by a $3 \\times 3$\ncrossover setup, where the effect of $3$ doses of an oral drug are studied on\ngene expressions related to mucosal inflammation. Subjects are assigned to\nthree treatment sequences and response measurements on 5 different gene\nexpressions are taken from each subject in each of the $3$ time periods. To\nmodel multiple or $g$ responses, where $g>1$, in a crossover setup, a\nmultivariate fixed effect model with both direct and carryover treatment\neffects is considered. It is assumed that there are non zero within response\ncorrelations, while between response correlations are taken to be zero. The\ninformation matrix corresponding to the direct effects is obtained and some\nresults are studied. The information matrix in the multivariate case is shown\nto differ from the univariate case, particularly in the completely symmetric\nproperty. For the $g>1$ case, with $t$ treatments and $p$ periods, for $p=t\n\\geq 3$, the design represented by a Type $I$ orthogonal array of strength $2$\nis proved to be universally optimal over the class of binary designs, for the\ndirect treatment effects."}, "http://arxiv.org/abs/2311.13595": {"title": "Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein", "link": "http://arxiv.org/abs/2311.13595", "description": "Feature alignment methods are used in many scientific disciplines for data\npooling, annotation, and comparison. As an instance of a permutation learning\nproblem, feature alignment presents significant statistical and computational\nchallenges. In this work, we propose the covariance alignment model to study\nand compare various alignment methods and establish a minimax lower bound for\ncovariance alignment that has a non-standard dimension scaling because of the\npresence of a nuisance parameter. This lower bound is in fact minimax optimal\nand is achieved by a natural quasi MLE. However, this estimator involves a\nsearch over all permutations which is computationally infeasible even when the\nproblem has moderate size. To overcome this limitation, we show that the\ncelebrated Gromov-Wasserstein algorithm from optimal transport which is more\namenable to fast implementation even on large-scale problems is also minimax\noptimal. These results give the first statistical justification for the\ndeployment of the Gromov-Wasserstein algorithm in practice."}, "http://arxiv.org/abs/2107.06093": {"title": "A generalized hypothesis test for community structure in networks", "link": "http://arxiv.org/abs/2107.06093", "description": "Researchers theorize that many real-world networks exhibit community\nstructure where within-community edges are more likely than between-community\nedges. While numerous methods exist to cluster nodes into different\ncommunities, less work has addressed this question: given some network, does it\nexhibit statistically meaningful community structure? We answer this question\nin a principled manner by framing it as a statistical hypothesis test in terms\nof a general and model-agnostic community structure parameter. Leveraging this\nparameter, we propose a simple and interpretable test statistic used to\nformulate two separate hypothesis testing frameworks. The first is an\nasymptotic test against a baseline value of the parameter while the second\ntests against a baseline model using bootstrap-based thresholds. We prove\ntheoretical properties of these tests and demonstrate how the proposed method\nyields rich insights into real-world data sets."}, "http://arxiv.org/abs/2205.08030": {"title": "Interpretable sensitivity analysis for the Baron-Kenny approach to mediation with unmeasured confounding", "link": "http://arxiv.org/abs/2205.08030", "description": "Mediation analysis assesses the extent to which the exposure affects the\noutcome indirectly through a mediator and the extent to which it operates\ndirectly through other pathways. As the most popular method in empirical\nmediation analysis, the Baron-Kenny approach estimates the indirect and direct\neffects of the exposure on the outcome based on linear structural equation\nmodels. However, when the exposure and the mediator are not randomized, the\nestimates may be biased due to unmeasured confounding among the exposure,\nmediator, and outcome. Building on Cinelli and Hazlett (2020), we derive\ngeneral omitted-variable bias formulas in linear regressions with vector\nresponses and regressors. We then use the formulas to develop a sensitivity\nanalysis method for the Baron-Kenny approach to mediation in the presence of\nunmeasured confounding. To ensure interpretability, we express the sensitivity\nparameters to correspond to the natural factorization of the joint distribution\nof the direct acyclic graph for mediation analysis. They measure the partial\ncorrelation between the unmeasured confounder and the exposure, mediator,\noutcome, respectively. With the sensitivity parameters, we propose a novel\nmeasure called the \"robustness value for mediation\" or simply the \"robustness\nvalue\", to assess the robustness of results based on the Baron-Kenny approach\nwith respect to unmeasured confounding. Intuitively, the robustness value\nmeasures the minimum value of the maximum proportion of variability explained\nby the unmeasured confounding, for the exposure, mediator and outcome, to\noverturn the results of the point estimate or confidence interval for the\ndirect and indirect effects. Importantly, we prove that all our sensitivity\nbounds are attainable and thus sharp."}, "http://arxiv.org/abs/2208.10106": {"title": "Statistics did not prove that the Huanan Seafood Wholesale Market was the early epicenter of the COVID-19 pandemic", "link": "http://arxiv.org/abs/2208.10106", "description": "In a recent prominent study Worobey et al.\\ (2022, Science, 377, pp.\\ 951--9)\npurported to demonstrate statistically that the Huanan Seafood Wholesale Market\nwas the epicenter of the early COVID-19 epidemic. We show that this statistical\nconclusion is invalid on two grounds: (1) The assumption that a centroid of\nearly case locations or another simply constructed point is the origin of an\nepidemic is unproved. (2) A Monte Carlo test used to conclude that no other\nlocation than the seafood market can be the origin is flawed. Hence, the\nquestion of the origin of the pandemic has not been answered by their\nstatistical analysis."}, "http://arxiv.org/abs/2209.10433": {"title": "Arithmetic Average Density Fusion -- Part II: Unified Derivation for Unlabeled and Labeled RFS Fusion", "link": "http://arxiv.org/abs/2209.10433", "description": "As a fundamental information fusion approach, the arithmetic average (AA)\nfusion has recently been investigated for various random finite set (RFS)\nfilter fusion in the context of multi-sensor multi-target tracking. It is not a\nstraightforward extension of the ordinary density-AA fusion to the RFS\ndistribution but has to preserve the form of the fusing multi-target density.\nIn this work, we first propose a statistical concept, probability hypothesis\ndensity (PHD) consistency, and explain how it can be achieved by the PHD-AA\nfusion and lead to more accurate and robust detection and localization of the\npresent targets. This forms a both theoretically sound and technically\nmeaningful reason for performing inter-filter PHD AA-fusion/consensus, while\npreserving the form of the fusing RFS filter. Then, we derive and analyze the\nproper AA fusion formulations for most existing unlabeled/labeled RFS filters\nbasing on the (labeled) PHD-AA/consistency. These derivations are theoretically\nunified, exact, need no approximation and greatly enable heterogenous unlabeled\nand labeled RFS density fusion which is separately demonstrated in two\nconsequent companion papers."}, "http://arxiv.org/abs/2209.14846": {"title": "Factor Modeling of a High-Dimensional Matrix-Variate and Statistical Learning for Matrix-Valued Sequences", "link": "http://arxiv.org/abs/2209.14846", "description": "We propose a new matrix factor model, named RaDFaM, the latent structure of\nwhich is strictly derived based on a hierarchical rank decomposition of a\nmatrix. Hierarchy is in the sense that all basis vectors of the column space of\neach multiplier matrix are assumed the structure of a vector factor model.\nCompared to the most commonly used matrix factor model that takes the latent\nstructure of a bilinear form, RaDFaM involves additional row-wise and\ncolumn-wise matrix latent factors. This yields modest dimension reduction and\nstronger signal intensity from the sight of tensor subspace learning, though\nposes challenges of new estimation procedure and concomitant inferential theory\nfor a collection of matrix-valued observations. We develop a class of\nestimation procedure that makes use of the separable covariance structure under\nRaDFaM and approximate least squares, and derive its superiority in the merit\nof the peak signal-to-noise ratio. We also establish the asymptotic theory when\nthe matrix-valued observations are uncorrelated or weakly correlated.\nNumerically, in terms of image/matrix reconstruction, supervised learning, and\nso forth, we demonstrate the excellent performance of RaDFaM through two\nmatrix-valued sequence datasets of independent 2D images and multinational\nmacroeconomic indices time series, respectively."}, "http://arxiv.org/abs/2210.14745": {"title": "Identifying Counterfactual Queries with the R Package cfid", "link": "http://arxiv.org/abs/2210.14745", "description": "In the framework of structural causal models, counterfactual queries describe\nevents that concern multiple alternative states of the system under study.\nCounterfactual queries often take the form of \"what if\" type questions such as\n\"would an applicant have been hired if they had over 10 years of experience,\nwhen in reality they only had 5 years of experience?\" Such questions and\ncounterfactual inference in general are crucial, for example when addressing\nthe problem of fairness in decision-making. Because counterfactual events\ncontain contradictory states of the world, it is impossible to conduct a\nrandomized experiment to address them without making several restrictive\nassumptions. However, it is sometimes possible to identify such queries from\nobservational and experimental data by representing the system under study as a\ncausal model, and the available data as symbolic probability distributions.\nShpitser and Pearl (2007) constructed two algorithms, called ID* and IDC*, for\nidentifying counterfactual queries and conditional counterfactual queries,\nrespectively. These two algorithms are analogous to the ID and IDC algorithms\nby Shpitser and Pearl (2006) for identification of interventional\ndistributions, which were implemented in R by Tikka and Karvanen (2017) in the\ncausaleffect package. We present the R package cfid that implements the ID* and\nIDC* algorithms. Identification of counterfactual queries and the features of\ncfid are demonstrated via examples."}, "http://arxiv.org/abs/2210.17000": {"title": "Ensemble transport smoothing", "link": "http://arxiv.org/abs/2210.17000", "description": "Smoothers are algorithms for Bayesian time series re-analysis. Most\noperational smoothers rely either on affine Kalman-type transformations or on\nsequential importance sampling. These strategies occupy opposite ends of a\nspectrum that trades computational efficiency and scalability for statistical\ngenerality and consistency: non-Gaussianity renders affine Kalman updates\ninconsistent with the true Bayesian solution, while the ensemble size required\nfor successful importance sampling can be prohibitive. This paper revisits the\nsmoothing problem from the perspective of measure transport, which offers the\nprospect of consistent prior-to-posterior transformations for Bayesian\ninference. We leverage this capacity by proposing a general ensemble framework\nfor transport-based smoothing. Within this framework, we derive a comprehensive\nset of smoothing recursions based on nonlinear transport maps and detail how\nthey exploit the structure of state-space models in fully non-Gaussian\nsettings. We also describe how many standard Kalman-type smoothing algorithms\nemerge as special cases of our framework. A companion paper (Ramgraber et al.,\n2023) explores the implementation of nonlinear ensemble transport smoothers in\ngreater depth."}, "http://arxiv.org/abs/2210.17435": {"title": "Ensemble transport smoothing", "link": "http://arxiv.org/abs/2210.17435", "description": "Smoothing is a specialized form of Bayesian inference for state-space models\nthat characterizes the posterior distribution of a collection of states given\nan associated sequence of observations. Ramgraber et al. (2023) proposes a\ngeneral framework for transport-based ensemble smoothing, which includes linear\nKalman-type smoothers as special cases. Here, we build on this foundation to\nrealize and demonstrate nonlinear backward ensemble transport smoothers. We\ndiscuss parameterization and regularization of the associated transport maps,\nand then examine the performance of these smoothers for nonlinear and chaotic\ndynamical systems that exhibit non-Gaussian behavior. In these settings, our\nnonlinear transport smoothers yield lower estimation error than conventional\nlinear smoothers and state-of-the-art iterative ensemble Kalman smoothers, for\ncomparable numbers of model evaluations."}, "http://arxiv.org/abs/2305.03634": {"title": "On the use of ordered factors as explanatory variables", "link": "http://arxiv.org/abs/2305.03634", "description": "Consider a regression or some regression-type model for a certain response\nvariable where the linear predictor includes an ordered factor among the\nexplanatory variables. The inclusion of a factor of this type can take place is\na few different ways, discussed in the pertaining literature. The present\ncontribution proposes a different way of tackling this problem, by constructing\na numeric variable in an alternative way with respect to the current\nmethodology. The proposed techniques appears to retain the data fitting\ncapability of the existing methodology, but with a simpler interpretation of\nthe model components."}, "http://arxiv.org/abs/2306.13257": {"title": "Semiparametric Estimation of the Shape of the Limiting Multivariate Point Cloud", "link": "http://arxiv.org/abs/2306.13257", "description": "We propose a model to flexibly estimate joint tail properties by exploiting\nthe convergence of an appropriately scaled point cloud onto a compact limit\nset. Characteristics of the shape of the limit set correspond to key tail\ndependence properties. We directly model the shape of the limit set using\nBezier splines, which allow flexible and parsimonious specification of shapes\nin two dimensions. We then fit the Bezier splines to data in pseudo-polar\ncoordinates using Markov chain Monte Carlo, utilizing a limiting approximation\nto the conditional likelihood of the radii given angles. By imposing\nappropriate constraints on the parameters of the Bezier splines, we guarantee\nthat each posterior sample is a valid limit set boundary, allowing direct\nposterior analysis of any quantity derived from the shape of the curve.\nFurthermore, we obtain interpretable inference on the asymptotic dependence\nclass by using mixture priors with point masses on the corner of the unit box.\nFinally, we apply our model to bivariate datasets of extremes of variables\nrelated to fire risk and air pollution."}, "http://arxiv.org/abs/2307.08370": {"title": "Parameter estimation for contact tracing in graph-based models", "link": "http://arxiv.org/abs/2307.08370", "description": "We adopt a maximum-likelihood framework to estimate parameters of a\nstochastic susceptible-infected-recovered (SIR) model with contact tracing on a\nrooted random tree. Given the number of detectees per index case, our estimator\nallows to determine the degree distribution of the random tree as well as the\ntracing probability. Since we do not discover all infectees via contact\ntracing, this estimation is non-trivial. To keep things simple and stable, we\ndevelop an approximation suited for realistic situations (contract tracing\nprobability small, or the probability for the detection of index cases small).\nIn this approximation, the only epidemiological parameter entering the\nestimator is $R_0$.\n\nThe estimator is tested in a simulation study and is furthermore applied to\ncovid-19 contact tracing data from India. The simulation study underlines the\nefficiency of the method. For the empirical covid-19 data, we compare different\ndegree distributions and perform a sensitivity analysis. We find that\nparticularly a power-law and a negative binomial degree distribution fit the\ndata well and that the tracing probability is rather large. The sensitivity\nanalysis shows no strong dependency of the estimates on the reproduction\nnumber. Finally, we discuss the relevance of our findings."}, "http://arxiv.org/abs/2311.13701": {"title": "Reexamining Statistical Significance and P-Values in Nursing Research: Historical Context and Guidance for Interpretation, Alternatives, and Reporting", "link": "http://arxiv.org/abs/2311.13701", "description": "Nurses should rely on the best evidence, but tend to struggle with\nstatistics, impeding research integration into clinical practice. Statistical\nsignificance, a key concept in classical statistics, and its primary metric,\nthe p-value, are frequently misused. This topic has been debated in many\ndisciplines but rarely in nursing. The aim is to present key arguments in the\ndebate surrounding the misuse of p-values, discuss their relevance to nursing,\nand offer recommendations to address them. The literature indicates that the\nconcept of probability in classical statistics is not easily understood,\nleading to misinterpretations of statistical significance. Much of the critique\nconcerning p-values arises from such misunderstandings and imprecise\nterminology. Thus, some scholars have argued for the complete abandonment of\np-values. Instead of discarding p-values, this article provides a comprehensive\naccount of their historical context and the information they convey. This will\nclarify why they are widely used yet often misunderstood. The article also\noffers recommendations for accurate interpretation of statistical significance\nby incorporating other key metrics. To mitigate publication bias resulting from\np-value misuse, pre-registering the analysis plan is recommended. The article\nalso explores alternative approaches, particularly Bayes factors, as they may\nresolve several of these issues. P-values serve a purpose in nursing research\nas an initial safeguard against the influence of randomness. Much criticism\ndirected towards p-values arises from misunderstandings and inaccurate\nterminology. Several considerations and measures are recommended, some which go\nbeyond the conventional, to obtain accurate p-values and to better understand\nstatistical significance. Nurse educators and researchers should considerer\nthese in their educational and research reporting practices."}, "http://arxiv.org/abs/2311.13767": {"title": "Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions", "link": "http://arxiv.org/abs/2311.13767", "description": "With the development of data collection techniques, analysis with a survival\nresponse and high-dimensional covariates has become routine. Here we consider\nan interaction model, which includes a set of low-dimensional covariates, a set\nof high-dimensional covariates, and their interactions. This model has been\nmotivated by gene-environment (G-E) interaction analysis, where the E variables\nhave a low dimension, and the G variables have a high dimension. For such a\nmodel, there has been extensive research on estimation and variable selection.\nComparatively, inference studies with a valid false discovery rate (FDR)\ncontrol have been very limited. The existing high-dimensional inference tools\ncannot be directly applied to interaction models, as interactions and main\neffects are not ``equal\". In this article, for high-dimensional survival\nanalysis with interactions, we model survival using the Accelerated Failure\nTime (AFT) model and adopt a ``weighted least squares + debiased Lasso''\napproach for estimation and selection. A hierarchical FDR control approach is\ndeveloped for inference and respect of the ``main effects, interactions''\nhierarchy. { The asymptotic distribution properties of the debiased Lasso\nestimators} are rigorously established. Simulation demonstrates the\nsatisfactory performance of the proposed approach, and the analysis of a breast\ncancer dataset further establishes its practical utility."}, "http://arxiv.org/abs/2311.13768": {"title": "Valid confidence intervals for regression with best subset selection", "link": "http://arxiv.org/abs/2311.13768", "description": "Classical confidence intervals after best subset selection are widely\nimplemented in statistical software and are routinely used to guide\npractitioners in scientific fields to conclude significance. However, there are\nincreasing concerns in the recent literature about the validity of these\nconfidence intervals in that the intended frequentist coverage is not attained.\nIn the context of the Akaike information criterion (AIC), recent studies\nobserve an under-coverage phenomenon in terms of overfitting, where the\nestimate of error variance under the selected submodel is smaller than that for\nthe true model. Under-coverage is particularly troubling in selective inference\nas it points to inflated Type I errors that would invalidate significant\nfindings. In this article, we delineate a complementary, yet provably more\ndeciding factor behind the incorrect coverage of classical confidence intervals\nunder AIC, in terms of altered conditional sampling distributions of pivotal\nquantities. Resting on selective techniques developed in other settings, our\nfinite-sample characterization of the selection event under AIC uncovers its\ngeometry as a union of finitely many intervals on the real line, based on which\nwe derive new confidence intervals with guaranteed coverage for any sample\nsize. This geometry derived for AIC selection enables exact (and typically less\nthan exact) conditioning, circumventing the need for the excessive conditioning\ncommon in other post-selection methods. The proposed methods are easy to\nimplement and can be broadly applied to other commonly used best subset\nselection criteria. In an application to a classical US consumption dataset,\nthe proposed confidence intervals arrive at different conclusions compared to\nthe conventional ones, even when the selected model is the full model, leading\nto interpretable findings that better align with empirical observations."}, "http://arxiv.org/abs/2311.13815": {"title": "Resampling Methods with Imputed Data", "link": "http://arxiv.org/abs/2311.13815", "description": "Resampling techniques have become increasingly popular for estimation of\nuncertainty in data collected via surveys. Survey data are also frequently\nsubject to missing data which are often imputed. This note addresses the issue\nof using resampling methods such as a jackknife or bootstrap in conjunction\nwith imputations that have be sampled stochastically (e.g., in the vein of\nmultiple imputation). It is illustrated that the imputations must be redrawn\nwithin each replicate group of a jackknife or bootstrap. Further, the number of\nmultiply imputed datasets per replicate group must dramatically exceed the\nnumber of replicate groups for a jackknife. However, this is not the case in a\nbootstrap approach. A brief simulation study is provided to support the theory\nintroduced in this note."}, "http://arxiv.org/abs/2311.13825": {"title": "Online Prediction of Extreme Conditional Quantiles via B-Spline Interpolation", "link": "http://arxiv.org/abs/2311.13825", "description": "Extreme quantiles are critical for understanding the behavior of data in the\ntail region of a distribution. It is challenging to estimate extreme quantiles,\nparticularly when dealing with limited data in the tail. In such cases, extreme\nvalue theory offers a solution by approximating the tail distribution using the\nGeneralized Pareto Distribution (GPD). This allows for the extrapolation beyond\nthe range of observed data, making it a valuable tool for various applications.\nHowever, when it comes to conditional cases, where estimation relies on\ncovariates, existing methods may require computationally expensive GPD fitting\nfor different observations. This computational burden becomes even more\nproblematic as the volume of observations increases, sometimes approaching\ninfinity. To address this issue, we propose an interpolation-based algorithm\nnamed EMI. EMI facilitates the online prediction of extreme conditional\nquantiles with finite offline observations. Combining quantile regression and\nGPD-based extrapolation, EMI formulates as a bilevel programming problem,\nefficiently solvable using classic optimization methods. Once estimates for\noffline observations are obtained, EMI employs B-spline interpolation for\ncovariate-dependent variables, enabling estimation for online observations with\nfinite GPD fitting. Simulations and real data analysis demonstrate the\neffectiveness of EMI across various scenarios."}, "http://arxiv.org/abs/2311.13897": {"title": "Super-resolution capacity of variance-based stochastic fluorescence microscopy", "link": "http://arxiv.org/abs/2311.13897", "description": "Improving the resolution of fluorescence microscopy beyond the diffraction\nlimit can be achievedby acquiring and processing multiple images of the sample\nunder different illumination conditions.One of the simplest techniques, Random\nIllumination Microscopy (RIM), forms the super-resolvedimage from the variance\nof images obtained with random speckled illuminations. However, thevalidity of\nthis process has not been fully theorized. In this work, we characterize\nmathematicallythe sample information contained in the variance of\ndiffraction-limited speckled images as a functionof the statistical properties\nof the illuminations. We show that an unambiguous two-fold resolutiongain is\nobtained when the speckle correlation length coincides with the width of the\nobservationpoint spread function. Last, we analyze the difference between the\nvariance-based techniques usingrandom speckled illuminations (as in RIM) and\nthose obtained using random fluorophore activation(as in Super-resolution\nOptical Fluctuation Imaging, SOFI)."}, "http://arxiv.org/abs/2311.13911": {"title": "Identifying Important Pairwise Logratios in Compositional Data with Sparse Principal Component Analysis", "link": "http://arxiv.org/abs/2311.13911", "description": "Compositional data are characterized by the fact that their elemental\ninformation is contained in simple pairwise logratios of the parts that\nconstitute the composition. While pairwise logratios are typically easy to\ninterpret, the number of possible pairs to consider quickly becomes (too) large\neven for medium-sized compositions, which might hinder interpretability in\nfurther multivariate analyses. Sparse methods can therefore be useful to\nidentify few, important pairwise logratios (respectively parts contained in\nthem) from the total candidate set. To this end, we propose a procedure based\non the construction of all possible pairwise logratios and employ sparse\nprincipal component analysis to identify important pairwise logratios. The\nperformance of the procedure is demonstrated both with simulated and real-world\ndata. In our empirical analyses, we propose three visual tools showing (i) the\nbalance between sparsity and explained variability, (ii) stability of the\npairwise logratios, and (iii) importance of the original compositional parts to\naid practitioners with their model interpretation."}, "http://arxiv.org/abs/2311.13923": {"title": "Optimal $F$-score Clustering for Bipartite Record Linkage", "link": "http://arxiv.org/abs/2311.13923", "description": "Probabilistic record linkage is often used to match records from two files,\nin particular when the variables common to both files comprise imperfectly\nmeasured identifiers like names and demographic variables. We consider\nbipartite record linkage settings in which each entity appears at most once\nwithin a file, i.e., there are no duplicates within the files, but some\nentities appear in both files. In this setting, the analyst desires a point\nestimate of the linkage structure that matches each record to at most one\nrecord from the other file. We propose an approach for obtaining this point\nestimate by maximizing the expected $F$-score for the linkage structure. We\ntarget the approach for record linkage methods that produce either (an\napproximate) posterior distribution of the unknown linkage structure or\nprobabilities of matches for record pairs. Using simulations and applications\nwith genuine data, we illustrate that the $F$-score estimators can lead to\nsensible estimates of the linkage structure."}, "http://arxiv.org/abs/2311.13935": {"title": "An analysis of the fragmentation of observing time at the Muztagh-ata site", "link": "http://arxiv.org/abs/2311.13935", "description": "Cloud cover plays a pivotal role in assessing observational conditions for\nastronomical site-testing. Except for the fraction of observing time, its\nfragmentation also wields a significant influence on the quality of nighttime\nsky clarity. In this article, we introduce the function Gamma, designed to\ncomprehensively capture both the fraction of available observing time and its\ncontinuity. Leveraging in situ measurement data gathered at the Muztagh-ata\nsite between 2017 and 2021, we showcase the effectiveness of our approach. The\nstatistical result illustrates that the Muztagh-ata site affords approximately\n122 nights of absolute clear and 205 very good nights annually, corresponding\nto Gamma greater than or equal 0.9 and Gamma greater than or equal 0.36\nrespectively."}, "http://arxiv.org/abs/2311.14042": {"title": "Optimized Covariance Design for AB Test on Social Network under Interference", "link": "http://arxiv.org/abs/2311.14042", "description": "Online A/B tests have become increasingly popular and important for social\nplatforms. However, accurately estimating the global average treatment effect\n(GATE) has proven to be challenging due to network interference, which violates\nthe Stable Unit Treatment Value Assumption (SUTVA) and poses a great challenge\nto experimental design. Existing network experimental design research was\nmostly based on the unbiased Horvitz-Thompson (HT) estimator with substantial\ndata trimming to ensure unbiasedness at the price of high resultant estimation\nvariance. In this paper, we strive to balance the bias and variance in\ndesigning randomized network experiments. Under a potential outcome model with\n1-hop interference, we derive the bias and variance of the standard HT\nestimator and reveal their relation to the network topological structure and\nthe covariance of the treatment assignment vector. We then propose to formulate\nthe experimental design problem to optimize the covariance matrix of the\ntreatment assignment vector to achieve the bias and variance balance by\nminimizing a well-crafted upper bound of the mean squared error (MSE) of the\nestimator, which allows us to decouple the unknown interference effect\ncomponent and the experimental design component. An efficient projected\ngradient descent algorithm is presented to implement the desired randomization\nscheme. Finally, we carry out extensive simulation studies 2 to demonstrate the\nadvantages of our proposed method over other existing methods in many settings,\nwith different levels of model misspecification."}, "http://arxiv.org/abs/2311.14054": {"title": "Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis", "link": "http://arxiv.org/abs/2311.14054", "description": "Between 2011 and 2014 NHANES collected objectively measured physical activity\ndata using wrist-worn accelerometers for tens of thousands of individuals for\nup to seven days. Here we analyze the minute-level indicators of being active,\nwhich can be viewed as binary (because there is an active indicator at every\nminute), multilevel (because there are multiple days of data for each study\nparticipant), functional (because within-day data can be viewed as a function\nof time) data. To extract within- and between-participant directions of\nvariation in the data, we introduce Generalized Multilevel Functional Principal\nComponent Analysis (GM-FPCA), an approach based on the dimension reduction of\nthe linear predictor. Scores associated with specific patterns of activity are\nshown to be strongly associated with time to death. In particular, we confirm\nthat increased activity is associated with time to death, a result that has\nbeen reported on other data sets. In addition, our method shows the previously\nunreported finding that maintaining a consistent day-to-day routine is strongly\nassociated with a reduced risk of mortality (p-value $< 0.001$) even after\nadjusting for traditional risk factors. Extensive simulation studies indicate\nthat GM-FPCA provides accurate estimation of model parameters, is\ncomputationally stable, and is scalable in the number of study participants,\nvisits, and observations within visits. R code for implementing the method is\nprovided."}, "http://arxiv.org/abs/2311.14122": {"title": "Decompositions of the mean continuous ranked probability score", "link": "http://arxiv.org/abs/2311.14122", "description": "The continuous ranked probability score (crps) is the most commonly used\nscoring rule in the evaluation of probabilistic forecasts for real-valued\noutcomes. To assess and rank forecasting methods, researchers compute the mean\ncrps over given sets of forecast situations, based on the respective predictive\ndistributions and outcomes. We propose a new, isotonicity-based decomposition\nof the mean crps into interpretable components that quantify miscalibration\n(MSC), discrimination ability (DSC), and uncertainty (UNC), respectively. In a\ndetailed theoretical analysis, we compare the new approach to empirical\ndecompositions proposed earlier, generalize to population versions, analyse\ntheir properties and relationships, and relate to a hierarchy of notions of\ncalibration. The isotonicity-based decomposition guarantees the nonnegativity\nof the components and quantifies calibration in a sense that is stronger than\nfor other types of decompositions, subject to the nondegeneracy of empirical\ndecompositions. We illustrate the usage of the isotonicity-based decomposition\nin case studies from weather prediction and machine learning."}, "http://arxiv.org/abs/2311.14212": {"title": "Annotation Sensitivity: Training Data Collection Methods Affect Model Performance", "link": "http://arxiv.org/abs/2311.14212", "description": "When training data are collected from human annotators, the design of the\nannotation instrument, the instructions given to annotators, the\ncharacteristics of the annotators, and their interactions can impact training\ndata. This study demonstrates that design choices made when creating an\nannotation instrument also impact the models trained on the resulting\nannotations.\n\nWe introduce the term annotation sensitivity to refer to the impact of\nannotation data collection methods on the annotations themselves and on\ndownstream model performance and predictions.\n\nWe collect annotations of hate speech and offensive language in five\nexperimental conditions of an annotation instrument, randomly assigning\nannotators to conditions. We then fine-tune BERT models on each of the five\nresulting datasets and evaluate model performance on a holdout portion of each\ncondition. We find considerable differences between the conditions for 1) the\nshare of hate speech/offensive language annotations, 2) model performance, 3)\nmodel predictions, and 4) model learning curves.\n\nOur results emphasize the crucial role played by the annotation instrument\nwhich has received little attention in the machine learning literature. We call\nfor additional research into how and why the instrument impacts the annotations\nto inform the development of best practices in instrument design."}, "http://arxiv.org/abs/2311.14220": {"title": "Assumption-lean and Data-adaptive Post-Prediction Inference", "link": "http://arxiv.org/abs/2311.14220", "description": "A primary challenge facing modern scientific research is the limited\navailability of gold-standard data which can be both costly and labor-intensive\nto obtain. With the rapid development of machine learning (ML), scientists have\nrelied on ML algorithms to predict these gold-standard outcomes with easily\nobtained covariates. However, these predicted outcomes are often used directly\nin subsequent statistical analyses, ignoring imprecision and heterogeneity\nintroduced by the prediction procedure. This will likely result in false\npositive findings and invalid scientific conclusions. In this work, we\nintroduce an assumption-lean and data-adaptive Post-Prediction Inference\n(POP-Inf) procedure that allows valid and powerful inference based on\nML-predicted outcomes. Its \"assumption-lean\" property guarantees reliable\nstatistical inference without assumptions on the ML-prediction, for a wide\nrange of statistical quantities. Its \"data-adaptive'\" feature guarantees an\nefficiency gain over existing post-prediction inference methods, regardless of\nthe accuracy of ML-prediction. We demonstrate the superiority and applicability\nof our method through simulations and large-scale genomic data."}, "http://arxiv.org/abs/2311.14356": {"title": "Lagged coherence: explicit and testable definition", "link": "http://arxiv.org/abs/2311.14356", "description": "Measures of association between cortical regions based on activity signals\nprovide useful information for studying brain functional connectivity.\nDifficulties occur with signals of electric neuronal activity, where an\nobserved signal is a mixture, i.e. an instantaneous weighted average of the\ntrue, unobserved signals from all regions, due to volume conduction and low\nspatial resolution. This is why measures of lagged association are of interest,\nsince at least theoretically, \"lagged association\" is of physiological origin.\nIn contrast, the actual physiological instantaneous zero-lag association is\nmasked and confounded by the mixing artifact. A minimum requirement for a\nmeasure of lagged association is that it must not tend to zero with an increase\nof strength of true instantaneous physiological association. Such biased\nmeasures cannot tell apart if a change in its value is due to a change in\nlagged or a change in instantaneous association. An explicit testable\ndefinition for frequency domain lagged connectivity between two multivariate\ntime series is proposed. It is endowed with two important properties: it is\ninvariant to non-singular linear transformations of each vector time series\nseparately, and it is invariant to instantaneous association. As a sanity\ncheck, in the case of two univariate time series, the new definition leads back\nto the bivariate lagged coherence of 2007 (eqs 25 and 26 in\nhttps://doi.org/10.48550/arXiv.0706.1776)."}, "http://arxiv.org/abs/2311.14367": {"title": "Cultural data integration via random graphical modelling", "link": "http://arxiv.org/abs/2311.14367", "description": "Cultural values vary significantly around the world. Despite a large\nheterogeneity, similarities across national cultures are to be expected. This\npaper studies cross-country culture heterogeneity via the joint inference of\ncopula graphical models. To this end, a random graph generative model is\nintroduced, with a latent space that embeds cultural relatedness across\ncountries. Taking world-wide country-specific survey data as the primary source\nof information, the modelling framework allows to integrate external data, both\nat the level of cultural traits and of their interdependence. In this way, we\nare able to identify several dimensions of culture."}, "http://arxiv.org/abs/2311.14412": {"title": "A Comparison of PDF Projection with Normalizing Flows and SurVAE", "link": "http://arxiv.org/abs/2311.14412", "description": "Normalizing flows (NF) recently gained attention as a way to construct\ngenerative networks with exact likelihood calculation out of composable layers.\nHowever, NF is restricted to dimension-preserving transformations. Surjection\nVAE (SurVAE) has been proposed to extend NF to dimension-altering\ntransformations. Such networks are desirable because they are expressive and\ncan be precisely trained. We show that the approaches are a re-invention of PDF\nprojection, which appeared over twenty years earlier and is much further\ndeveloped."}, "http://arxiv.org/abs/2311.14424": {"title": "Exact confidence intervals for the mixing distribution from binomial mixture distribution samples", "link": "http://arxiv.org/abs/2311.14424", "description": "We present methodology for constructing pointwise confidence intervals for\nthe cumulative distribution function and the quantiles of mixing distributions\non the unit interval from binomial mixture distribution samples. No assumptions\nare made on the shape of the mixing distribution. The confidence intervals are\nconstructed by inverting exact tests of composite null hypotheses regarding the\nmixing distribution. Our method may be applied to any deconvolution approach\nthat produces test statistics whose distribution is stochastically monotone for\nstochastic increase of the mixing distribution. We propose a hierarchical Bayes\napproach, which uses finite Polya Trees for modelling the mixing distribution,\nthat provides stable and accurate deconvolution estimates without the need for\nadditional tuning parameters. Our main technical result establishes the\nstochastic monotonicity property of the test statistics produced by the\nhierarchical Bayes approach. Leveraging the need for the stochastic\nmonotonicity property, we explicitly derive the smallest asymptotic confidence\nintervals that may be constructed using our methodology. Raising the question\nwhether it is possible to construct smaller confidence intervals for the mixing\ndistribution without making parametric assumptions on its shape."}, "http://arxiv.org/abs/2311.14487": {"title": "Reconciliation of expert priors for quantities and events and application within the probabilistic Delphi method", "link": "http://arxiv.org/abs/2311.14487", "description": "We consider the problem of aggregating the judgements of a group of experts\nto form a single prior distribution representing the judgements of the group.\nWe develop a Bayesian hierarchical model to reconcile the judgements of the\ngroup of experts based on elicited quantiles for continuous quantities and\nprobabilities for one-off events. Previous Bayesian reconciliation methods have\nnot been used widely, if at all, in contrast to pooling methods and\nconsensus-based approaches. To address this we embed Bayesian reconciliation\nwithin the probabilistic Delphi method. The result is to furnish the outcome of\nthe probabilistic Delphi method with a direct probabilistic interpretation,\nwith the resulting prior representing the judgements of the decision maker. We\ncan use the rationales from the Delphi process to group the experts for the\nhierarchical modelling. We illustrate the approach with applications to studies\nevaluating erosion in embankment dams and pump failures in a water pumping\nstation, and assess the properties of the approach using the TU Delft database\nof expert judgement studies. We see that, even using an off-the-shelf\nimplementation of the approach, it out-performs individual experts, equal\nweighting of experts and the classical method based on the log score."}, "http://arxiv.org/abs/2311.14502": {"title": "Informed Random Partition Models with Temporal Dependence", "link": "http://arxiv.org/abs/2311.14502", "description": "Model-based clustering is a powerful tool that is often used to discover\nhidden structure in data by grouping observational units that exhibit similar\nresponse values. Recently, clustering methods have been developed that permit\nincorporating an ``initial'' partition informed by expert opinion. Then, using\nsome similarity criteria, partitions different from the initial one are down\nweighted, i.e. they are assigned reduced probabilities. These methods represent\nan exciting new direction of method development in clustering techniques. We\nadd to this literature a method that very flexibly permits assigning varying\nlevels of uncertainty to any subset of the partition. This is particularly\nuseful in practice as there is rarely clear prior information with regards to\nthe entire partition. Our approach is not based on partition penalties but\nconsiders individual allocation probabilities for each unit (e.g., locally\nweighted prior information). We illustrate the gains in prior specification\nflexibility via simulation studies and an application to a dataset concerning\nspatio-temporal evolution of ${\\rm PM}_{10}$ measurements in Germany."}, "http://arxiv.org/abs/2311.14655": {"title": "A Sparse Factor Model for Clustering High-Dimensional Longitudinal Data", "link": "http://arxiv.org/abs/2311.14655", "description": "Recent advances in engineering technologies have enabled the collection of a\nlarge number of longitudinal features. This wealth of information presents\nunique opportunities for researchers to investigate the complex nature of\ndiseases and uncover underlying disease mechanisms. However, analyzing such\nkind of data can be difficult due to its high dimensionality, heterogeneity and\ncomputational challenges. In this paper, we propose a Bayesian nonparametric\nmixture model for clustering high-dimensional mixed-type (e.g., continuous,\ndiscrete and categorical) longitudinal features. We employ a sparse factor\nmodel on the joint distribution of random effects and the key idea is to induce\nclustering at the latent factor level instead of the original data to escape\nthe curse of dimensionality. The number of clusters is estimated through a\nDirichlet process prior. An efficient Gibbs sampler is developed to estimate\nthe posterior distribution of the model parameters. Analysis of real and\nsimulated data is presented and discussed. Our study demonstrates that the\nproposed model serves as a useful analytical tool for clustering\nhigh-dimensional longitudinal data."}, "http://arxiv.org/abs/2104.10618": {"title": "Multiple conditional randomization tests for lagged and spillover treatment effects", "link": "http://arxiv.org/abs/2104.10618", "description": "We consider the problem of constructing multiple conditional randomization\ntests. They may test different causal hypotheses but always aim to be nearly\nindependent, allowing the randomization p-values to be interpreted individually\nand combined using standard methods. We start with a simple, sequential\nconstruction of such tests, and then discuss its application to three problems:\nevidence factors for observational studies, lagged treatment effect in\nstepped-wedge trials, and spillover effect in randomized trials with\ninterference. We compare the proposed approach with some existing methods using\nsimulated and real datasets. Finally, we establish a general sufficient\ncondition for constructing multiple nearly independent conditional\nrandomization tests."}, "http://arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "http://arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for\ndetecting localized regions in data sequences which each must contain a\nchange-point in the median, at a prescribed global significance level. RNSP\nworks by fitting the postulated constant model over many regions of the data\nusing a new sign-multiresolution sup-norm-type loss, and greedily identifying\nthe shortest intervals on which the constancy is significantly violated. By\nworking with the signs of the data around fitted model candidates, RNSP fulfils\nits coverage promises under minimal assumptions, requiring only sign-symmetry\nand serial independence of the signs of the true residuals. In particular, it\npermits their heterogeneity and arbitrarily heavy tails. The intervals of\nsignificance returned by RNSP have a finite-sample character, are unconditional\nin nature and do not rely on any assumptions on the true signal. Code\nimplementing RNSP is available at https://github.com/pfryz/nsp."}, "http://arxiv.org/abs/2111.12720": {"title": "Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator", "link": "http://arxiv.org/abs/2111.12720", "description": "We resurrect the infamous harmonic mean estimator for computing the marginal\nlikelihood (Bayesian evidence) and solve its problematic large variance. The\nmarginal likelihood is a key component of Bayesian model selection to evaluate\nmodel posterior probabilities; however, its computation is challenging. The\noriginal harmonic mean estimator, first proposed by Newton and Raftery in 1994,\ninvolves computing the harmonic mean of the likelihood given samples from the\nposterior. It was immediately realised that the original estimator can fail\ncatastrophically since its variance can become very large (possibly not\nfinite). A number of variants of the harmonic mean estimator have been proposed\nto address this issue although none have proven fully satisfactory. We present\nthe \\emph{learnt harmonic mean estimator}, a variant of the original estimator\nthat solves its large variance problem. This is achieved by interpreting the\nharmonic mean estimator as importance sampling and introducing a new target\ndistribution. The new target distribution is learned to approximate the optimal\nbut inaccessible target, while minimising the variance of the resulting\nestimator. Since the estimator requires samples of the posterior only, it is\nagnostic to the sampling strategy used. We validate the estimator on a variety\nof numerical experiments, including a number of pathological examples where the\noriginal harmonic mean estimator fails catastrophically. We also consider a\ncosmological application, where our approach leads to $\\sim$ 3 to 6 times more\nsamples than current state-of-the-art techniques in 1/3 of the time. In all\ncases our learnt harmonic mean estimator is shown to be highly accurate. The\nestimator is computationally scalable and can be applied to problems of\ndimension $O(10^3)$ and beyond. Code implementing the learnt harmonic mean\nestimator is made publicly available"}, "http://arxiv.org/abs/2205.07378": {"title": "Proximal MCMC for Bayesian Inference of Constrained and Regularized Estimation", "link": "http://arxiv.org/abs/2205.07378", "description": "This paper advocates proximal Markov Chain Monte Carlo (ProxMCMC) as a\nflexible and general Bayesian inference framework for constrained or\nregularized estimation. Originally introduced in the Bayesian imaging\nliterature, ProxMCMC employs the Moreau-Yosida envelope for a smooth\napproximation of the total-variation regularization term, fixes variance and\nregularization strength parameters as constants, and uses the Langevin\nalgorithm for the posterior sampling. We extend ProxMCMC to be fully Bayesian\nby providing data-adaptive estimation of all parameters including the\nregularization strength parameter. More powerful sampling algorithms such as\nHamiltonian Monte Carlo are employed to scale ProxMCMC to high-dimensional\nproblems. Analogous to the proximal algorithms in optimization, ProxMCMC offers\na versatile and modularized procedure for conducting statistical inference on\nconstrained and regularized problems. The power of ProxMCMC is illustrated on\nvarious statistical estimation and machine learning tasks, the inference of\nwhich is traditionally considered difficult from both frequentist and Bayesian\nperspectives."}, "http://arxiv.org/abs/2211.10032": {"title": "Modular Regression: Improving Linear Models by Incorporating Auxiliary Data", "link": "http://arxiv.org/abs/2211.10032", "description": "This paper develops a new framework, called modular regression, to utilize\nauxiliary information -- such as variables other than the original features or\nadditional data sets -- in the training process of linear models. At a high\nlevel, our method follows the routine: (i) decomposing the regression task into\nseveral sub-tasks, (ii) fitting the sub-task models, and (iii) using the\nsub-task models to provide an improved estimate for the original regression\nproblem. This routine applies to widely-used low-dimensional (generalized)\nlinear models and high-dimensional regularized linear regression. It also\nnaturally extends to missing-data settings where only partial observations are\navailable. By incorporating auxiliary information, our approach improves the\nestimation efficiency and prediction accuracy upon linear regression or the\nLasso under a conditional independence assumption for predicting the outcome.\nFor high-dimensional settings, we develop an extension of our procedure that is\nrobust to violations of the conditional independence assumption, in the sense\nthat it improves efficiency if this assumption holds and coincides with the\nLasso otherwise. We demonstrate the efficacy of our methods with simulated and\nreal data sets."}, "http://arxiv.org/abs/2211.13478": {"title": "A New Spatio-Temporal Model Exploiting Hamiltonian Equations", "link": "http://arxiv.org/abs/2211.13478", "description": "The solutions of Hamiltonian equations are known to describe the underlying\nphase space of the mechanical system. Hamiltonian Monte Carlo is the sole use\nof the properties of solutions to the Hamiltonian equations in Bayesian\nstatistics. In this article, we propose a novel spatio-temporal model using a\nstrategic modification of the Hamiltonian equations, incorporating appropriate\nstochasticity via Gaussian processes. The resultant sptaio-temporal process,\ncontinuously varying with time, turns out to be nonparametric, nonstationary,\nnonseparable and non-Gaussian. Additionally, as the spatio-temporal lag goes to\ninfinity, the lagged correlations converge to zero. We investigate the\ntheoretical properties of the new spatio-temporal process, including its\ncontinuity and smoothness properties. In the Bayesian paradigm, we derive\nmethods for complete Bayesian inference using MCMC techniques. The performance\nof our method has been compared with that of non-stationary Gaussian process\n(GP) using two simulation studies, where our method shows a significant\nimprovement over the non-stationary GP. Further, application of our new model\nto two real data sets revealed encouraging performance."}, "http://arxiv.org/abs/2212.08968": {"title": "Covariate Adjustment in Bayesian Adaptive Randomized Controlled Trials", "link": "http://arxiv.org/abs/2212.08968", "description": "In conventional randomized controlled trials, adjustment for baseline values\nof covariates known to be at least moderately associated with the outcome\nincreases the power of the trial. Recent work has shown particular benefit for\nmore flexible frequentist designs, such as information adaptive and adaptive\nmulti-arm designs. However, covariate adjustment has not been characterized\nwithin the more flexible Bayesian adaptive designs, despite their growing\npopularity. We focus on a subclass of these which allow for early stopping at\nan interim analysis given evidence of treatment superiority. We consider both\ncollapsible and non-collapsible estimands, and show how to obtain posterior\nsamples of marginal estimands from adjusted analyses. We describe several\nestimands for three common outcome types. We perform a simulation study to\nassess the impact of covariate adjustment using a variety of adjustment models\nin several different scenarios. This is followed by a real world application of\nthe compared approaches to a COVID-19 trial with a binary endpoint. For all\nscenarios, it is shown that covariate adjustment increases power and the\nprobability of stopping the trials early, and decreases the expected sample\nsizes as compared to unadjusted analyses."}, "http://arxiv.org/abs/2301.11873": {"title": "A Deep Learning Method for Comparing Bayesian Hierarchical Models", "link": "http://arxiv.org/abs/2301.11873", "description": "Bayesian model comparison (BMC) offers a principled approach for assessing\nthe relative merits of competing computational models and propagating\nuncertainty into model selection decisions. However, BMC is often intractable\nfor the popular class of hierarchical models due to their high-dimensional\nnested parameter structure. To address this intractability, we propose a deep\nlearning method for performing BMC on any set of hierarchical models which can\nbe instantiated as probabilistic programs. Since our method enables amortized\ninference, it allows efficient re-estimation of posterior model probabilities\nand fast performance validation prior to any real-data application. In a series\nof extensive validation studies, we benchmark the performance of our method\nagainst the state-of-the-art bridge sampling method and demonstrate excellent\namortized inference across all BMC settings. We then showcase our method by\ncomparing four hierarchical evidence accumulation models that have previously\nbeen deemed intractable for BMC due to partly implicit likelihoods.\nAdditionally, we demonstrate how transfer learning can be leveraged to enhance\ntraining efficiency. We provide reproducible code for all analyses and an\nopen-source implementation of our method."}, "http://arxiv.org/abs/2304.04124": {"title": "Nonparametric Confidence Intervals for Generalized Lorenz Curve using Modified Empirical Likelihood", "link": "http://arxiv.org/abs/2304.04124", "description": "The Lorenz curve portrays the inequality of income distribution. In this\narticle, we develop three modified empirical likelihood (EL) approaches\nincluding adjusted empirical likelihood, transformed empirical likelihood, and\ntransformed adjusted empirical likelihood to construct confidence intervals for\nthe generalized Lorenz ordinate. We have shown that the limiting distribution\nof the modified EL ratio statistics for the generalized Lorenz ordinate follows\nthe scaled Chi-Squared distributions with one degree of freedom. The coverage\nprobabilities and mean lengths of confidence intervals are compared of the\nproposed methods with the traditional EL method through simulations under\nvarious scenarios. Finally, the proposed methods are illustrated using a real\ndata application to construct confidence intervals."}, "http://arxiv.org/abs/2306.04702": {"title": "Efficient sparsity adaptive changepoint estimation", "link": "http://arxiv.org/abs/2306.04702", "description": "We propose a new, computationally efficient, sparsity adaptive changepoint\nestimator for detecting changes in unknown subsets of a high-dimensional data\nsequence. Assuming the data sequence is Gaussian, we prove that the new method\nsuccessfully estimates the number and locations of changepoints with a given\nerror rate and under minimal conditions, for all sparsities of the changing\nsubset. Moreover, our method has computational complexity linear up to\nlogarithmic factors in both the length and number of time series, making it\napplicable to large data sets. Through extensive numerical studies we show that\nthe new methodology is highly competitive in terms of both estimation accuracy\nand computational cost. The practical usefulness of the method is illustrated\nby analysing sensor data from a hydro power plant. An efficient R\nimplementation is available."}, "http://arxiv.org/abs/2306.07119": {"title": "Improving Forecasts for Heterogeneous Time Series by \"Averaging\", with Application to Food Demand Forecast", "link": "http://arxiv.org/abs/2306.07119", "description": "A common forecasting setting in real world applications considers a set of\npossibly heterogeneous time series of the same domain. Due to different\nproperties of each time series such as length, obtaining forecasts for each\nindividual time series in a straight-forward way is challenging. This paper\nproposes a general framework utilizing a similarity measure in Dynamic Time\nWarping to find similar time series to build neighborhoods in a k-Nearest\nNeighbor fashion, and improve forecasts of possibly simple models by averaging.\nSeveral ways of performing the averaging are suggested, and theoretical\narguments underline the usefulness of averaging for forecasting. Additionally,\ndiagnostics tools are proposed allowing a deep understanding of the procedure."}, "http://arxiv.org/abs/2307.07898": {"title": "A Graph-Prediction-Based Approach for Debiasing Underreported Data", "link": "http://arxiv.org/abs/2307.07898", "description": "We present a novel Graph-based debiasing Algorithm for Underreported Data\n(GRAUD) aiming at an efficient joint estimation of event counts and discovery\nprobabilities across spatial or graphical structures. This innovative method\nprovides a solution to problems seen in fields such as policing data and\nCOVID-$19$ data analysis. Our approach avoids the need for strong priors\ntypically associated with Bayesian frameworks. By leveraging the graph\nstructures on unknown variables $n$ and $p$, our method debiases the\nunder-report data and estimates the discovery probability at the same time. We\nvalidate the effectiveness of our method through simulation experiments and\nillustrate its practicality in one real-world application: police 911\ncalls-to-service data."}, "http://arxiv.org/abs/2307.12832": {"title": "More Power by using Fewer Permutations", "link": "http://arxiv.org/abs/2307.12832", "description": "It is conventionally believed that a permutation test should ideally use all\npermutations. If this is computationally unaffordable, it is believed one\nshould use the largest affordable Monte Carlo sample or (algebraic) subgroup of\npermutations. We challenge this belief by showing we can sometimes obtain\ndramatically more power by using a tiny subgroup. As the subgroup is tiny, this\nsimultaneously comes at a much lower computational cost. We exploit this to\nimprove the popular permutation-based Westfall & Young MaxT multiple testing\nmethod. We study the relative efficiency in a Gaussian location model, and find\nthe largest gain in high dimensions."}, "http://arxiv.org/abs/2309.14156": {"title": "Designing and evaluating an online reinforcement learning agent for physical exercise recommendations in N-of-1 trials", "link": "http://arxiv.org/abs/2309.14156", "description": "Personalized adaptive interventions offer the opportunity to increase patient\nbenefits, however, there are challenges in their planning and implementation.\nOnce implemented, it is an important question whether personalized adaptive\ninterventions are indeed clinically more effective compared to a fixed gold\nstandard intervention. In this paper, we present an innovative N-of-1 trial\nstudy design testing whether implementing a personalized intervention by an\nonline reinforcement learning agent is feasible and effective. Throughout, we\nuse a new study on physical exercise recommendations to reduce pain in\nendometriosis for illustration. We describe the design of a contextual bandit\nrecommendation agent and evaluate the agent in simulation studies. The results\nshow that, first, implementing a personalized intervention by an online\nreinforcement learning agent is feasible. Second, such adaptive interventions\nhave the potential to improve patients' benefits even if only few observations\nare available. As one challenge, they add complexity to the design and\nimplementation process. In order to quantify the expected benefit, data from\nprevious interventional studies is required. We expect our approach to be\ntransferable to other interventions and clinical interventions."}, "http://arxiv.org/abs/2311.14766": {"title": "Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing", "link": "http://arxiv.org/abs/2311.14766", "description": "Reinforcement Learning from Human Feedback (RLHF) has played a crucial role\nin the success of large models such as ChatGPT. RLHF is a reinforcement\nlearning framework which combines human feedback to improve learning\neffectiveness and performance. However, obtaining preferences feedback manually\nis quite expensive in commercial applications. Some statistical commercial\nindicators are usually more valuable and always ignored in RLHF. There exists a\ngap between commercial target and model training. In our research, we will\nattempt to fill this gap with statistical business feedback instead of human\nfeedback, using AB testing which is a well-established statistical method.\nReinforcement Learning from Statistical Feedback (RLSF) based on AB testing is\nproposed. Statistical inference methods are used to obtain preferences for\ntraining the reward network, which fine-tunes the pre-trained model in\nreinforcement learning framework, achieving greater business value.\nFurthermore, we extend AB testing with double selections at a single time-point\nto ANT testing with multiple selections at different feedback time points.\nMoreover, we design numerical experiences to validate the effectiveness of our\nalgorithm framework."}, "http://arxiv.org/abs/2311.14846": {"title": "Fast Estimation of the Renshaw-Haberman Model and Its Variants", "link": "http://arxiv.org/abs/2311.14846", "description": "In mortality modelling, cohort effects are often taken into consideration as\nthey add insights about variations in mortality across different generations.\nStatistically speaking, models such as the Renshaw-Haberman model may provide a\nbetter fit to historical data compared to their counterparts that incorporate\nno cohort effects. However, when such models are estimated using an iterative\nmaximum likelihood method in which parameters are updated one at a time,\nconvergence is typically slow and may not even be reached within a reasonably\nestablished maximum number of iterations. Among others, the slow convergence\nproblem hinders the study of parameter uncertainty through bootstrapping\nmethods.\n\nIn this paper, we propose an intuitive estimation method that minimizes the\nsum of squared errors between actual and fitted log central death rates. The\ncomplications arising from the incorporation of cohort effects are overcome by\nformulating part of the optimization as a principal component analysis with\nmissing values. We also show how the proposed method can be generalized to\nvariants of the Renshaw-Haberman model with further computational improvement,\neither with a simplified model structure or an additional constraint. Using\nmortality data from the Human Mortality Database (HMD), we demonstrate that our\nproposed method produces satisfactory estimation results and is significantly\nmore efficient compared to the traditional likelihood-based approach."}, "http://arxiv.org/abs/2311.14867": {"title": "Disaggregating Time-Series with Many Indicators: An Overview of the DisaggregateTS Package", "link": "http://arxiv.org/abs/2311.14867", "description": "Low-frequency time-series (e.g., quarterly data) are often treated as\nbenchmarks for interpolating to higher frequencies, since they generally\nexhibit greater precision and accuracy in contrast to their high-frequency\ncounterparts (e.g., monthly data) reported by governmental bodies. An array of\nregression-based methods have been proposed in the literature which aim to\nestimate a target high-frequency series using higher frequency indicators.\nHowever, in the era of big data and with the prevalence of large volume of\nadministrative data-sources there is a need to extend traditional methods to\nwork in high-dimensional settings, i.e. where the number of indicators is\nsimilar or larger than the number of low-frequency samples. The package\nDisaggregateTS includes both classical regressions-based disaggregation methods\nalongside recent extensions to high-dimensional settings, c.f. Mosley et al.\n(2022). This paper provides guidance on how to implement these methods via the\npackage in R, and demonstrates their use in an application to disaggregating\nCO2 emissions."}, "http://arxiv.org/abs/2311.14889": {"title": "Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data", "link": "http://arxiv.org/abs/2311.14889", "description": "In this paper we review recent advances in statistical methods for the\nevaluation of the heterogeneity of treatment effects (HTE), including subgroup\nidentification and estimation of individualized treatment regimens, from\nrandomized clinical trials and observational studies. We identify several types\nof approaches using the features introduced in Lipkovich, Dmitrienko and\nD'Agostino (2017) that distinguish the recommended principled methods from\nbasic methods for HTE evaluation that typically rely on rules of thumb and\ngeneral guidelines (the methods are often referred to as common practices). We\ndiscuss the advantages and disadvantages of various principled methods as well\nas common measures for evaluating their performance. We use simulated data and\na case study based on a historical clinical trial to illustrate several new\napproaches to HTE evaluation."}, "http://arxiv.org/abs/2311.14894": {"title": "Kernel-based measures of association between inputs and outputs based on ANOVA", "link": "http://arxiv.org/abs/2311.14894", "description": "ANOVA decomposition of function with random input variables provides ANOVA\nfunctionals (AFs), which contain information about the contributions of the\ninput variables on the output variable(s). By embedding AFs into an appropriate\nreproducing kernel Hilbert space regarding their distributions, we propose an\nefficient statistical test of independence between the input variables and\noutput variable(s). The resulting test statistic leads to new dependent\nmeasures of association between inputs and outputs that allow for i) dealing\nwith any distribution of AFs, including the Cauchy distribution, ii) accounting\nfor the necessary or desirable moments of AFs and the interactions among the\ninput variables. In uncertainty quantification for mathematical models, a\nnumber of existing measures are special cases of this framework. We then\nprovide unified and general global sensitivity indices and their consistent\nestimators, including asymptotic distributions. For Gaussian-distributed AFs,\nwe obtain Sobol' indices and dependent generalized sensitivity indices using\nquadratic kernels."}, "http://arxiv.org/abs/2311.15012": {"title": "False Discovery Rate Controlling Procedures with BLOSUM62 substitution matrix and their application to HIV Data", "link": "http://arxiv.org/abs/2311.15012", "description": "Identifying significant sites in sequence data and analogous data is of\nfundamental importance in many biological fields. Fisher's exact test is a\npopular technique, however this approach to sparse count data is not\nappropriate due to conservative decisions. Since count data in HIV data are\ntypically very sparse, it is crucial to use additional information to\nstatistical models to improve testing power. In order to develop new approaches\nto incorporate biological information in the false discovery controlling\nprocedure, we propose two models: one based on the empirical Bayes model under\nindependence of amino acids and the other uses pairwise associations of amino\nacids based on Markov random field with on the BLOSUM62 substitution matrix. We\napply the proposed methods to HIV data and identify significant sites\nincorporating BLOSUM62 matrix while the traditional method based on Fisher's\ntest does not discover any site. These newly developed methods have the\npotential to handle many biological problems in the studies of vaccine and drug\ntrials and phenotype studies."}, "http://arxiv.org/abs/2311.15031": {"title": "Robust and Efficient Semi-supervised Learning for Ising Model", "link": "http://arxiv.org/abs/2311.15031", "description": "In biomedical studies, it is often desirable to characterize the interactive\nmode of multiple disease outcomes beyond their marginal risk. Ising model is\none of the most popular choices serving for this purpose. Nevertheless,\nlearning efficiency of Ising models can be impeded by the scarcity of accurate\ndisease labels, which is a prominent problem in contemporary studies driven by\nelectronic health records (EHR). Semi-supervised learning (SSL) leverages the\nlarge unlabeled sample with auxiliary EHR features to assist the learning with\nlabeled data only and is a potential solution to this issue. In this paper, we\ndevelop a novel SSL method for efficient inference of Ising model. Our method\nfirst models the outcomes against the auxiliary features, then uses it to\nproject the score function of the supervised estimator onto the EHR features,\nand incorporates the unlabeled sample to augment the supervised estimator for\nvariance reduction without introducing bias. For the key step of conditional\nmodeling, we propose strategies that can effectively leverage the auxiliary EHR\ninformation while maintaining moderate model complexity. In addition, we\nintroduce approaches including intrinsic efficient updates and ensemble, to\novercome the potential misspecification of the conditional model that may cause\nefficiency loss. Our method is justified by asymptotic theory and shown to\noutperform existing SSL methods through simulation studies. We also illustrate\nits utility in a real example about several key phenotypes related to frequent\nICU admission on MIMIC-III data set."}, "http://arxiv.org/abs/2311.15257": {"title": "Bayesian Imputation of Revolving Doors", "link": "http://arxiv.org/abs/2311.15257", "description": "Political scientists and sociologists study how individuals switch back and\nforth between public and private organizations, for example between regulator\nand lobbyist positions, a phenomenon called \"revolving doors\". However, they\nface an important issue of data missingness, as not all data relevant to this\nquestion is freely available. For example, the nomination of an individual in a\ngiven public-sector position of power might be publically disclosed, but not\ntheir subsequent positions in the private sector. In this article, we adopt a\nBayesian data augmentation strategy for discrete time series and propose\nmeasures of public-private mobility across the French state at large,\nmobilizing administrative and digital data. We relax homogeneity hypotheses of\ntraditional hidden Markov models and implement a version of a Markov switching\nmodel, which allows for varying parameters across individuals and time and\nauto-correlated behaviors. We describe how the revolving doors phenomenon\nvaries across the French state and how it has evolved between 1990 and 2022."}, "http://arxiv.org/abs/2311.15322": {"title": "False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values", "link": "http://arxiv.org/abs/2311.15322", "description": "The effective utilization of structural information in data while ensuring\nstatistical validity poses a significant challenge in false discovery rate\n(FDR) analyses. Conformal inference provides rigorous theory for grounding\ncomplex machine learning methods without relying on strong assumptions or\nhighly idealized models. However, existing conformal methods have limitations\nin handling structured multiple testing. This is because their validity\nrequires the deployment of symmetric rules, which assume the exchangeability of\ndata points and permutation-invariance of fitting algorithms. To overcome these\nlimitations, we introduce the pseudo local index of significance (PLIS)\nprocedure, which is capable of accommodating asymmetric rules and requires only\npairwise exchangeability between the null conformity scores. We demonstrate\nthat PLIS offers finite-sample guarantees in FDR control and the ability to\nassign higher weights to relevant data points. Numerical results confirm the\neffectiveness and robustness of PLIS and show improvements in power compared to\nexisting model-free methods in various scenarios."}, "http://arxiv.org/abs/2311.15359": {"title": "Goodness-of-fit tests for the one-sided L\\'evy distribution based on quantile conditional moments", "link": "http://arxiv.org/abs/2311.15359", "description": "In this paper we introduce a novel statistical framework based on the first\ntwo quantile conditional moments that facilitates effective goodness-of-fit\ntesting for one-sided L\\'evy distributions. The scale-ratio framework\nintroduced in this paper extends our previous results in which we have shown\nhow to extract unique distribution features using conditional variance ratio\nfor the generic class of {\\alpha}-stable distributions. We show that the\nconditional moment-based goodness-of-fit statistics are a good alternative to\nother methods introduced in the literature tailored to the one-sided L\\'evy\ndistributions. The usefulness of our approach is verified using an empirical\ntest power study. For completeness, we also derive the asymptotic distributions\nof the test statistics and show how to apply our framework to real data."}, "http://arxiv.org/abs/2311.15384": {"title": "Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means", "link": "http://arxiv.org/abs/2311.15384", "description": "Clustering stands as one of the most prominent challenges within the realm of\nunsupervised machine learning. Among the array of centroid-based clustering\nalgorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes\ncenter stage as one of the extensively employed techniques in the literature.\nNonetheless, both $k$-means and its variants grapple with noteworthy\nlimitations. These encompass a heavy reliance on initial cluster centroids,\nsusceptibility to converging into local minima of the objective function, and\nsensitivity to outliers and noise in the data. When confronted with data\ncontaining noisy or outlier-laden observations, the Median-of-Means (MoM)\nestimator emerges as a stabilizing force for any centroid-based clustering\nframework. On a different note, a prevalent constraint among existing\nclustering methodologies resides in the prerequisite knowledge of the number of\nclusters prior to analysis. Utilizing model-based methodologies, such as\nBayesian nonparametric models, offers the advantage of infinite mixture models,\nthereby circumventing the need for such requirements. Motivated by these facts,\nin this article, we present an efficient and automatic clustering technique by\nintegrating the principles of model-based and centroid-based methodologies that\nmitigates the effect of noise on the quality of clustering while ensuring that\nthe number of clusters need not be specified in advance. Statistical guarantees\non the upper bound of clustering error, and rigorous assessment through\nsimulated and real datasets suggest the advantages of our proposed method over\nexisting state-of-the-art clustering algorithms."}, "http://arxiv.org/abs/2311.15410": {"title": "A Comprehensive Analysis of HIV Treatment Efficacy in the ACTG 175 Trial Through Multiple-Endpoint Approaches", "link": "http://arxiv.org/abs/2311.15410", "description": "In the realm of medical research, the intricate interplay of epidemiological\nrisk, genomic activity, adverse events, and clinical response necessitates a\nnuanced consideration of multiple variables. Clinical trials, designed to\nmeticulously assess the efficacy and safety of interventions, routinely\nincorporate a diverse array of endpoints. While a primary endpoint is\ncustomary, supplemented by key secondary endpoints, the statistical\nsignificance is typically evaluated independently for each. To address the\ninherent challenges in studying multiple endpoints, diverse strategies,\nincluding composite endpoints and global testing, have been proposed. This work\nstands apart by focusing on the evaluation of a clinical trial, deviating from\nthe conventional approach to underscore the efficacy of a multiple-endpoint\nprocedure. A double-blind study was conducted to gauge the treatment efficacy\nin adults infected with human immunodeficiency virus type 1 (HIV-1), featuring\nCD4 cell counts ranging from 200 to 500 per cubic millimeter. A total of 2467\nHIV-1-infected patients (43 percent without prior antiretroviral treatment)\nwere randomly assigned to one of four daily regimens: 600 mg of zidovudine; 600\nmg of zidovudine plus 400 mg of didanosine; 600 mg of zidovudine plus 2.25 mg\nof zalcitabine; or 400 mg of didanosine. The primary endpoint comprised a >50\npercent decline in CD4 cell count, development of acquired immunodeficiency\nsyndrome (AIDS), or death. This study sought to determine the efficacy and\nsafety of zidovudine (AZT) versus didanosine (ddI), AZT plus ddI, and AZT plus\nzalcitabine (ddC) in preventing disease progression in HIV-infected patients\nwith CD4 counts of 200-500 cells/mm3. By jointly considering all endpoints, the\nmultiple-endpoints approach yields results of greater significance than a\nsingle-endpoint approach."}, "http://arxiv.org/abs/2311.15434": {"title": "Structural Discovery with Partial Ordering Information for Time-Dependent Data with Convergence Guarantees", "link": "http://arxiv.org/abs/2311.15434", "description": "Structural discovery amongst a set of variables is of interest in both static\nand dynamic settings. In the presence of lead-lag dependencies in the data, the\ndynamics of the system can be represented through a structural equation model\n(SEM) that simultaneously captures the contemporaneous and temporal\nrelationships amongst the variables, with the former encoded through a directed\nacyclic graph (DAG) for model identification. In many real applications, a\npartial ordering amongst the nodes of the DAG is available, which makes it\neither beneficial or imperative to incorporate it as a constraint in the\nproblem formulation. This paper develops an algorithm that can seamlessly\nincorporate a priori partial ordering information for solving a linear SEM\n(also known as Structural Vector Autoregression) under a high-dimensional\nsetting. The proposed algorithm is provably convergent to a stationary point,\nand exhibits competitive performance on both synthetic and real data sets."}, "http://arxiv.org/abs/2311.15485": {"title": "Calibrated Generalized Bayesian Inference", "link": "http://arxiv.org/abs/2311.15485", "description": "We provide a simple and general solution to the fundamental open problem of\ninaccurate uncertainty quantification of Bayesian inference in misspecified or\napproximate models, and of generalized Bayesian posteriors more generally.\nWhile existing solutions are based on explicit Gaussian posterior\napproximations, or computationally onerous post-processing procedures, we\ndemonstrate that correct uncertainty quantification can be achieved by\nsubstituting the usual posterior with an alternative posterior that conveys the\nsame information. This solution applies to both likelihood-based and loss-based\nposteriors, and we formally demonstrate the reliable uncertainty quantification\nof this approach. The new approach is demonstrated through a range of examples,\nincluding generalized linear models, and doubly intractable models."}, "http://arxiv.org/abs/2311.15498": {"title": "Adjusted inference for multiple testing procedure in group sequential designs", "link": "http://arxiv.org/abs/2311.15498", "description": "Adjustment of statistical significance levels for repeated analysis in group\nsequential trials has been understood for some time. Similarly, methods for\nadjustment accounting for testing multiple hypotheses are common. There is\nlimited research on simultaneously adjusting for both multiple hypothesis\ntesting and multiple analyses of one or more hypotheses. We address this gap by\nproposing adjusted-sequential p-values that reject an elementary hypothesis\nwhen its adjusted-sequential p-values are less than or equal to the family-wise\nType I error rate (FWER) in a group sequential design. We also propose\nsequential p-values for intersection hypotheses as a tool to compute adjusted\nsequential p-values for elementary hypotheses. We demonstrate the application\nusing weighted Bonferroni tests and weighted parametric tests, comparing\nadjusted sequential p-values to a desired FWER for inference on each elementary\nhypothesis tested."}, "http://arxiv.org/abs/2311.15598": {"title": "Optimal Clustering of Discrete Mixtures: Binomial, Poisson, Block Models, and Multi-layer Networks", "link": "http://arxiv.org/abs/2311.15598", "description": "In this paper, we first study the fundamental limit of clustering networks\nwhen a multi-layer network is present. Under the mixture multi-layer stochastic\nblock model (MMSBM), we show that the minimax optimal network clustering error\nrate, which takes an exponential form and is characterized by the Renyi\ndivergence between the edge probability distributions of the component\nnetworks. We propose a novel two-stage network clustering method including a\ntensor-based initialization algorithm involving both node and sample splitting\nand a refinement procedure by likelihood-based Lloyd algorithm. Network\nclustering must be accompanied by node community detection. Our proposed\nalgorithm achieves the minimax optimal network clustering error rate and allows\nextreme network sparsity under MMSBM. Numerical simulations and real data\nexperiments both validate that our method outperforms existing methods.\nOftentimes, the edges of networks carry count-type weights. We then extend our\nmethodology and analysis framework to study the minimax optimal clustering\nerror rate for mixture of discrete distributions including Binomial, Poisson,\nand multi-layer Poisson networks. The minimax optimal clustering error rates in\nthese discrete mixtures all take the same exponential form characterized by the\nRenyi divergences. These optimal clustering error rates in discrete mixtures\ncan also be achieved by our proposed two-stage clustering algorithm."}, "http://arxiv.org/abs/2311.15610": {"title": "Bayesian Approach to Linear Bayesian Networks", "link": "http://arxiv.org/abs/2311.15610", "description": "This study proposes the first Bayesian approach for learning high-dimensional\nlinear Bayesian networks. The proposed approach iteratively estimates each\nelement of the topological ordering from backward and its parent using the\ninverse of a partial covariance matrix. The proposed method successfully\nrecovers the underlying structure when Bayesian regularization for the inverse\ncovariance matrix with unequal shrinkage is applied. Specifically, it shows\nthat the number of samples $n = \\Omega( d_M^2 \\log p)$ and $n = \\Omega(d_M^2\np^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian\nnetworks with sub-Gaussian and 4m-th bounded-moment error distributions,\nrespectively, where $p$ is the number of nodes and $d_M$ is the maximum degree\nof the moralized graph. The theoretical findings are supported by extensive\nsimulation studies including real data analysis. Furthermore the proposed\nmethod is demonstrated to outperform state-of-the-art frequentist approaches,\nsuch as the BHLSM, LISTEN, and TD algorithms in synthetic data."}, "http://arxiv.org/abs/2311.15715": {"title": "Spatio-temporal insights for wind energy harvesting in South Africa", "link": "http://arxiv.org/abs/2311.15715", "description": "Understanding complex spatial dependency structures is a crucial\nconsideration when attempting to build a modeling framework for wind speeds.\nIdeally, wind speed modeling should be very efficient since the wind speed can\nvary significantly from day to day or even hour to hour. But complex models\nusually require high computational resources. This paper illustrates how to\nconstruct and implement a hierarchical Bayesian model for wind speeds using the\nWeibull density function based on a continuously-indexed spatial field. For\nefficient (near real-time) inference the proposed model is implemented in the r\npackage R-INLA, based on the integrated nested Laplace approximation (INLA).\nSpecific attention is given to the theoretical and practical considerations of\nincluding a spatial component within a Bayesian hierarchical model. The\nproposed model is then applied and evaluated using a large volume of real data\nsourced from the coastal regions of South Africa between 2011 and 2021. By\nprojecting the mean and standard deviation of the Matern field, the results\nshow that the spatial modeling component is effectively capturing variation in\nwind speeds which cannot be explained by the other model components. The mean\nof the spatial field varies between $\\pm 0.3$ across the domain. These insights\nare valuable for planning and implementation of green energy resources such as\nwind farms in South Africa. Furthermore, shortcomings in the spatial sampling\ndomain is evident in the analysis and this is important for future sampling\nstrategies. The proposed model, and the conglomerated dataset, can serve as a\nfoundational framework for future investigations into wind energy in South\nAfrica."}, "http://arxiv.org/abs/2311.15860": {"title": "Frequentist Prediction Sets for Species Abundance using Indirect Information", "link": "http://arxiv.org/abs/2311.15860", "description": "Citizen science databases that consist of volunteer-led sampling efforts of\nspecies communities are relied on as essential sources of data in ecology.\nSummarizing such data across counties with frequentist-valid prediction sets\nfor each county provides an interpretable comparison across counties of varying\nsize or composition. As citizen science data often feature unequal sampling\nefforts across a spatial domain, prediction sets constructed with indirect\nmethods that share information across counties may be used to improve\nprecision. In this article, we present a nonparametric framework to obtain\nprecise prediction sets for a multinomial random sample based on indirect\ninformation that maintain frequentist coverage guarantees for each county. We\ndetail a simple algorithm to obtain prediction sets for each county using\nindirect information where the computation time does not depend on the sample\nsize and scales nicely with the number of species considered. The indirect\ninformation may be estimated by a proposed empirical Bayes procedure based on\ninformation from auxiliary data. Our approach makes inference for under-sampled\ncounties more precise, while maintaining area-specific frequentist validity for\neach county. Our method is used to provide a useful description of avian\nspecies abundance in North Carolina, USA based on citizen science data from the\neBird database."}, "http://arxiv.org/abs/2311.15982": {"title": "Stab-GKnock: Controlled variable selection for partially linear models using generalized knockoffs", "link": "http://arxiv.org/abs/2311.15982", "description": "The recently proposed fixed-X knockoff is a powerful variable selection\nprocedure that controls the false discovery rate (FDR) in any finite-sample\nsetting, yet its theoretical insights are difficult to show beyond Gaussian\nlinear models. In this paper, we make the first attempt to extend the fixed-X\nknockoff to partially linear models by using generalized knockoff features, and\npropose a new stability generalized knockoff (Stab-GKnock) procedure by\nincorporating selection probability as feature importance score. We provide FDR\ncontrol and power guarantee under some regularity conditions. In addition, we\npropose a two-stage method under high dimensionality by introducing a new joint\nfeature screening procedure, with guaranteed sure screening property. Extensive\nsimulation studies are conducted to evaluate the finite-sample performance of\nthe proposed method. A real data example is also provided for illustration."}, "http://arxiv.org/abs/2311.15988": {"title": "A novel CFA+EFA model to detect aberrant respondents", "link": "http://arxiv.org/abs/2311.15988", "description": "Aberrant respondents are common but yet extremely detrimental to the quality\nof social surveys or questionnaires. Recently, factor mixture models have been\nemployed to identify individuals providing deceptive or careless responses. We\npropose a comprehensive factor mixture model that combines confirmatory and\nexploratory factor models to represent both the non-aberrant and aberrant\ncomponents of the responses. The flexibility of the proposed solution allows\nfor the identification of two of the most common aberant response styles,\nnamely faking and careless responding. We validated our approach by means of\ntwo simulations and two case studies. The results indicate the effectiveness of\nthe proposed model in handling with aberrant responses in social and behavioral\nsurveys."}, "http://arxiv.org/abs/2311.16025": {"title": "Change Point Detection for Random Objects using Distance Profiles", "link": "http://arxiv.org/abs/2311.16025", "description": "We introduce a new powerful scan statistic and an associated test for\ndetecting the presence and pinpointing the location of a change point within\nthe distribution of a data sequence where the data elements take values in a\ngeneral separable metric space $(\\Omega, d)$. These change points mark abrupt\nshifts in the distribution of the data sequence. Our method hinges on distance\nprofiles, where the distance profile of an element $\\omega \\in \\Omega$ is the\ndistribution of distances from $\\omega$ as dictated by the data. Our approach\nis fully non-parametric and universally applicable to diverse data types,\nincluding distributional and network data, as long as distances between the\ndata objects are available. From a practicable point of view, it is nearly\ntuning parameter-free, except for the specification of cut-off intervals near\nthe endpoints where change points are assumed not to occur. Our theoretical\nresults include a precise characterization of the asymptotic distribution of\nthe test statistic under the null hypothesis of no change points and rigorous\nguarantees on the consistency of the test in the presence of change points\nunder contiguous alternatives, as well as for the consistency of the estimated\nchange point location. Through comprehensive simulation studies encompassing\nmultivariate data, bivariate distributional data and sequences of graph\nLaplacians, we demonstrate the effectiveness of our approach in both change\npoint detection power and estimating the location of the change point. We apply\nour method to real datasets, including U.S. electricity generation compositions\nand Bluetooth proximity networks, underscoring its practical relevance."}, "http://arxiv.org/abs/2203.02849": {"title": "Variable Selection with the Knockoffs: Composite Null Hypotheses", "link": "http://arxiv.org/abs/2203.02849", "description": "The fixed-X knockoff filter is a flexible framework for variable selection\nwith false discovery rate (FDR) control in linear models with arbitrary design\nmatrices (of full column rank) and it allows for finite-sample selective\ninference via the Lasso estimates. In this paper, we extend the theory of the\nknockoff procedure to tests with composite null hypotheses, which are usually\nmore relevant to real-world problems. The main technical challenge lies in\nhandling composite nulls in tandem with dependent features from arbitrary\ndesigns. We develop two methods for composite inference with the knockoffs,\nnamely, shifted ordinary least-squares (S-OLS) and feature-response product\nperturbation (FRPP), building on new structural properties of test statistics\nunder composite nulls. We also propose two heuristic variants of S-OLS method\nthat outperform the celebrated Benjamini-Hochberg (BH) procedure for composite\nnulls, which serves as a heuristic baseline under dependent test statistics.\nFinally, we analyze the loss in FDR when the original knockoff procedure is\nnaively applied on composite tests."}, "http://arxiv.org/abs/2205.02617": {"title": "COMBSS: Best Subset Selection via Continuous Optimization", "link": "http://arxiv.org/abs/2205.02617", "description": "The problem of best subset selection in linear regression is considered with\nthe aim to find a fixed size subset of features that best fits the response.\nThis is particularly challenging when the total available number of features is\nvery large compared to the number of data samples. Existing optimal methods for\nsolving this problem tend to be slow while fast methods tend to have low\naccuracy. Ideally, new methods perform best subset selection faster than\nexisting optimal methods but with comparable accuracy, or, being more accurate\nthan methods of comparable computational speed. Here, we propose a novel\ncontinuous optimization method that identifies a subset solution path, a small\nset of models of varying size, that consists of candidates for the single best\nsubset of features, that is optimal in a specific sense in linear regression.\nOur method turns out to be fast, making the best subset selection possible when\nthe number of features is well in excess of thousands. Because of the\noutstanding overall performance, framing the best subset selection challenge as\na continuous optimization problem opens new research directions for feature\nextraction for a large variety of regression models."}, "http://arxiv.org/abs/2210.09339": {"title": "Probability Weighted Clustered Coefficients Regression Models in Complex Survey Sampling", "link": "http://arxiv.org/abs/2210.09339", "description": "Regression analysis is commonly conducted in survey sampling. However,\nexisting methods fail when the relationships vary across different areas or\ndomains. In this paper, we propose a unified framework to study the group-wise\ncovariate effect under complex survey sampling based on pairwise penalties, and\nthe associated objective function is solved by the alternating direction method\nof multipliers. Theoretical properties of the proposed method are investigated\nunder some generality conditions. Numerical experiments demonstrate the\nsuperiority of the proposed method in terms of identifying groups and\nestimation efficiency for both linear regression models and logistic regression\nmodels."}, "http://arxiv.org/abs/2211.00460": {"title": "Augmentation Invariant Manifold Learning", "link": "http://arxiv.org/abs/2211.00460", "description": "Data augmentation is a widely used technique and an essential ingredient in\nthe recent advance in self-supervised representation learning. By preserving\nthe similarity between augmented data, the resulting data representation can\nimprove various downstream analyses and achieve state-of-the-art performance in\nmany applications. Despite the empirical effectiveness, most existing methods\nlack theoretical understanding under a general nonlinear setting. To fill this\ngap, we develop a statistical framework on a low-dimension product manifold to\nmodel the data augmentation transformation. Under this framework, we introduce\na new representation learning method called augmentation invariant manifold\nlearning and design a computationally efficient algorithm by reformulating it\nas a stochastic optimization problem. Compared with existing self-supervised\nmethods, the new method simultaneously exploits the manifold's geometric\nstructure and invariant property of augmented data and has an explicit\ntheoretical guarantee. Our theoretical investigation characterizes the role of\ndata augmentation in the proposed method and reveals why and how the data\nrepresentation learned from augmented data can improve the $k$-nearest neighbor\nclassifier in the downstream analysis, showing that a more complex data\naugmentation leads to more improvement in downstream analysis. Finally,\nnumerical experiments on simulated and real datasets are presented to\ndemonstrate the merit of the proposed method."}, "http://arxiv.org/abs/2212.05831": {"title": "Conditional-mean Multiplicative Operator Models for Count Time Series", "link": "http://arxiv.org/abs/2212.05831", "description": "Multiplicative error models (MEMs) are commonly used for real-valued time\nseries, but they cannot be applied to discrete-valued count time series as the\ninvolved multiplication would not preserve the integer nature of the data.\nThus, the concept of a multiplicative operator for counts is proposed (as well\nas several specific instances thereof), which are then used to develop a kind\nof MEMs for count time series (CMEMs). If equipped with a linear conditional\nmean, the resulting CMEMs are closely related to the class of so-called\ninteger-valued generalized autoregressive conditional heteroscedasticity\n(INGARCH) models and might be used as a semi-parametric extension thereof.\nImportant stochastic properties of different types of INGARCH-CMEM as well as\nrelevant estimation approaches are derived, namely types of quasi-maximum\nlikelihood and weighted least squares estimation. The performance and\napplication are demonstrated with simulations as well as with two real-world\ndata examples."}, "http://arxiv.org/abs/2301.01381": {"title": "Testing High-dimensional Multinomials with Applications to Text Analysis", "link": "http://arxiv.org/abs/2301.01381", "description": "Motivated by applications in text mining and discrete distribution inference,\nwe investigate the testing for equality of probability mass functions of $K$\ngroups of high-dimensional multinomial distributions. A test statistic, which\nis shown to have an asymptotic standard normal distribution under the null, is\nproposed. The optimal detection boundary is established, and the proposed test\nis shown to achieve this optimal detection boundary across the entire parameter\nspace of interest. The proposed method is demonstrated in simulation studies\nand applied to analyze two real-world datasets to examine variation among\nconsumer reviews of Amazon movies and diversity of statistical paper abstracts."}, "http://arxiv.org/abs/2303.03092": {"title": "Environment Invariant Linear Least Squares", "link": "http://arxiv.org/abs/2303.03092", "description": "This paper considers a multi-environment linear regression model in which\ndata from multiple experimental settings are collected. The joint distribution\nof the response variable and covariates may vary across different environments,\nyet the conditional expectations of $y$ given the unknown set of important\nvariables are invariant. Such a statistical model is related to the problem of\nendogeneity, causal inference, and transfer learning. The motivation behind it\nis illustrated by how the goals of prediction and attribution are inherent in\nestimating the true parameter and the important variable set. We construct a\nnovel environment invariant linear least squares (EILLS) objective function, a\nmulti-environment version of linear least-squares regression that leverages the\nabove conditional expectation invariance structure and heterogeneity among\ndifferent environments to determine the true parameter. Our proposed method is\napplicable without any additional structural knowledge and can identify the\ntrue parameter under a near-minimal identification condition. We establish\nnon-asymptotic $\\ell_2$ error bounds on the estimation error for the EILLS\nestimator in the presence of spurious variables. Moreover, we further show that\nthe $\\ell_0$ penalized EILLS estimator can achieve variable selection\nconsistency in high-dimensional regimes. These non-asymptotic results\ndemonstrate the sample efficiency of the EILLS estimator and its capability to\ncircumvent the curse of endogeneity in an algorithmic manner without any prior\nstructural knowledge. To the best of our knowledge, this paper is the first to\nrealize statistically efficient invariance learning in the general linear\nmodel."}, "http://arxiv.org/abs/2305.07721": {"title": "Designing Optimal Behavioral Experiments Using Machine Learning", "link": "http://arxiv.org/abs/2305.07721", "description": "Computational models are powerful tools for understanding human cognition and\nbehavior. They let us express our theories clearly and precisely, and offer\npredictions that can be subtle and often counter-intuitive. However, this same\nrichness and ability to surprise means our scientific intuitions and\ntraditional tools are ill-suited to designing experiments to test and compare\nthese models. To avoid these pitfalls and realize the full potential of\ncomputational modeling, we require tools to design experiments that provide\nclear answers about what models explain human behavior and the auxiliary\nassumptions those models must make. Bayesian optimal experimental design (BOED)\nformalizes the search for optimal experimental designs by identifying\nexperiments that are expected to yield informative data. In this work, we\nprovide a tutorial on leveraging recent advances in BOED and machine learning\nto find optimal experiments for any kind of model that we can simulate data\nfrom, and show how by-products of this procedure allow for quick and\nstraightforward evaluation of models and their parameters against real\nexperimental data. As a case study, we consider theories of how people balance\nexploration and exploitation in multi-armed bandit decision-making tasks. We\nvalidate the presented approach using simulations and a real-world experiment.\nAs compared to experimental designs commonly used in the literature, we show\nthat our optimal designs more efficiently determine which of a set of models\nbest account for individual human behavior, and more efficiently characterize\nbehavior given a preferred model. At the same time, formalizing a scientific\nquestion such that it can be adequately addressed with BOED can be challenging\nand we discuss several potential caveats and pitfalls that practitioners should\nbe aware of. We provide code and tutorial notebooks to replicate all analyses."}, "http://arxiv.org/abs/2307.05251": {"title": "Minimizing robust density power-based divergences for general parametric density models", "link": "http://arxiv.org/abs/2307.05251", "description": "Density power divergence (DPD) is designed to robustly estimate the\nunderlying distribution of observations, in the presence of outliers. However,\nDPD involves an integral of the power of the parametric density models to be\nestimated; the explicit form of the integral term can be derived only for\nspecific densities, such as normal and exponential densities. While we may\nperform a numerical integration for each iteration of the optimization\nalgorithms, the computational complexity has hindered the practical application\nof DPD-based estimation to more general parametric densities. To address the\nissue, this study introduces a stochastic approach to minimize DPD for general\nparametric density models. The proposed approach also can be employed to\nminimize other density power-based $\\gamma$-divergences, by leveraging\nunnormalized models."}, "http://arxiv.org/abs/2307.15176": {"title": "RCT Rejection Sampling for Causal Estimation Evaluation", "link": "http://arxiv.org/abs/2307.15176", "description": "Confounding is a significant obstacle to unbiased estimation of causal\neffects from observational data. For settings with high-dimensional covariates\n-- such as text data, genomics, or the behavioral social sciences --\nresearchers have proposed methods to adjust for confounding by adapting machine\nlearning methods to the goal of causal estimation. However, empirical\nevaluation of these adjustment methods has been challenging and limited. In\nthis work, we build on a promising empirical evaluation strategy that\nsimplifies evaluation design and uses real data: subsampling randomized\ncontrolled trials (RCTs) to create confounded observational datasets while\nusing the average causal effects from the RCTs as ground-truth. We contribute a\nnew sampling algorithm, which we call RCT rejection sampling, and provide\ntheoretical guarantees that causal identification holds in the observational\ndata to allow for valid comparisons to the ground-truth RCT. Using synthetic\ndata, we show our algorithm indeed results in low bias when oracle estimators\nare evaluated on the confounded samples, which is not always the case for a\npreviously proposed algorithm. In addition to this identification result, we\nhighlight several finite data considerations for evaluation designers who plan\nto use RCT rejection sampling on their own datasets. As a proof of concept, we\nimplement an example evaluation pipeline and walk through these finite data\nconsiderations with a novel, real-world RCT -- which we release publicly --\nconsisting of approximately 70k observations and text data as high-dimensional\ncovariates. Together, these contributions build towards a broader agenda of\nimproved empirical evaluation for causal estimation."}, "http://arxiv.org/abs/2309.09111": {"title": "Reducing sequential change detection to sequential estimation", "link": "http://arxiv.org/abs/2309.09111", "description": "We consider the problem of sequential change detection, where the goal is to\ndesign a scheme for detecting any changes in a parameter or functional $\\theta$\nof the data stream distribution that has small detection delay, but guarantees\ncontrol on the frequency of false alarms in the absence of changes. In this\npaper, we describe a simple reduction from sequential change detection to\nsequential estimation using confidence sequences: we begin a new\n$(1-\\alpha)$-confidence sequence at each time step, and proclaim a change when\nthe intersection of all active confidence sequences becomes empty. We prove\nthat the average run length is at least $1/\\alpha$, resulting in a change\ndetection scheme with minimal structural assumptions~(thus allowing for\npossibly dependent observations, and nonparametric distribution classes), but\nstrong guarantees. Our approach bears an interesting parallel with the\nreduction from change detection to sequential testing of Lorden (1971) and the\ne-detector of Shin et al. (2022)."}, "http://arxiv.org/abs/2311.16181": {"title": "mvlearnR and Shiny App for multiview learning", "link": "http://arxiv.org/abs/2311.16181", "description": "The package mvlearnR and accompanying Shiny App is intended for integrating\ndata from multiple sources or views or modalities (e.g. genomics, proteomics,\nclinical and demographic data). Most existing software packages for multiview\nlearning are decentralized and offer limited capabilities, making it difficult\nfor users to perform comprehensive integrative analysis. The new package wraps\nstatistical and machine learning methods and graphical tools, providing a\nconvenient and easy data integration workflow. For users with limited\nprogramming language, we provide a Shiny Application to facilitate data\nintegration anywhere and on any device. The methods have potential to offer\ndeeper insights into complex disease mechanisms.\n\nAvailability and Implementation: mvlearnR is available from the following\nGitHub repository: https://github.com/lasandrall/mvlearnR. The web application\nis hosted on shinyapps.io and available at:\nhttps://multi-viewlearn.shinyapps.io/MultiView_Modeling/"}, "http://arxiv.org/abs/2311.16286": {"title": "A statistical approach to latent dynamic modeling with differential equations", "link": "http://arxiv.org/abs/2311.16286", "description": "Ordinary differential equations (ODEs) can provide mechanistic models of\ntemporally local changes of processes, where parameters are often informed by\nexternal knowledge. While ODEs are popular in systems modeling, they are less\nestablished for statistical modeling of longitudinal cohort data, e.g., in a\nclinical setting. Yet, modeling of local changes could also be attractive for\nassessing the trajectory of an individual in a cohort in the immediate future\ngiven its current status, where ODE parameters could be informed by further\ncharacteristics of the individual. However, several hurdles so far limit such\nuse of ODEs, as compared to regression-based function fitting approaches. The\npotentially higher level of noise in cohort data might be detrimental to ODEs,\nas the shape of the ODE solution heavily depends on the initial value. In\naddition, larger numbers of variables multiply such problems and might be\ndifficult to handle for ODEs. To address this, we propose to use each\nobservation in the course of time as the initial value to obtain multiple local\nODE solutions and build a combined estimator of the underlying dynamics. Neural\nnetworks are used for obtaining a low-dimensional latent space for dynamic\nmodeling from a potentially large number of variables, and for obtaining\npatient-specific ODE parameters from baseline variables. Simultaneous\nidentification of dynamic models and of a latent space is enabled by recently\ndeveloped differentiable programming techniques. We illustrate the proposed\napproach in an application with spinal muscular atrophy patients and a\ncorresponding simulation study. In particular, modeling of local changes in\nhealth status at any point in time is contrasted to the interpretation of\nfunctions obtained from a global regression. This more generally highlights how\ndifferent application settings might demand different modeling strategies."}, "http://arxiv.org/abs/2311.16375": {"title": "Testing for a difference in means of a single feature after clustering", "link": "http://arxiv.org/abs/2311.16375", "description": "For many applications, it is critical to interpret and validate groups of\nobservations obtained via clustering. A common validation approach involves\ntesting differences in feature means between observations in two estimated\nclusters. In this setting, classical hypothesis tests lead to an inflated Type\nI error rate. To overcome this problem, we propose a new test for the\ndifference in means in a single feature between a pair of clusters obtained\nusing hierarchical or $k$-means clustering. The test based on the proposed\n$p$-value controls the selective Type I error rate in finite samples and can be\nefficiently computed. We further illustrate the validity and power of our\nproposal in simulation and demonstrate its use on single-cell RNA-sequencing\ndata."}, "http://arxiv.org/abs/2311.16451": {"title": "Variational Inference for the Latent Shrinkage Position Model", "link": "http://arxiv.org/abs/2311.16451", "description": "The latent position model (LPM) is a popular method used in network data\nanalysis where nodes are assumed to be positioned in a $p$-dimensional latent\nspace. The latent shrinkage position model (LSPM) is an extension of the LPM\nwhich automatically determines the number of effective dimensions of the latent\nspace via a Bayesian nonparametric shrinkage prior. However, the LSPM reliance\non Markov chain Monte Carlo for inference, while rigorous, is computationally\nexpensive, making it challenging to scale to networks with large numbers of\nnodes. We introduce a variational inference approach for the LSPM, aiming to\nreduce computational demands while retaining the model's ability to\nintrinsically determine the number of effective latent dimensions. The\nperformance of the variational LSPM is illustrated through simulation studies\nand its application to real-world network data. To promote wider adoption and\nease of implementation, we also provide open-source code."}, "http://arxiv.org/abs/2311.16529": {"title": "Efficient and Globally Robust Causal Excursion Effect Estimation", "link": "http://arxiv.org/abs/2311.16529", "description": "Causal excursion effect (CEE) characterizes the effect of an intervention\nunder policies that deviate from the experimental policy. It is widely used to\nstudy effect of time-varying interventions that have the potential to be\nfrequently adaptive, such as those delivered through smartphones. We study the\nsemiparametric efficient estimation of CEE and we derive a semiparametric\nefficiency bound for CEE with identity or log link functions under working\nassumptions. We propose a class of two-stage estimators that achieve the\nefficiency bound and are robust to misspecified nuisance models. In deriving\nthe asymptotic property of the estimators, we establish a general theory for\nglobally robust Z-estimators with either cross-fitted or non-cross-fitted\nnuisance parameters. We demonstrate substantial efficiency gain of the proposed\nestimator compared to existing ones through simulations and a real data\napplication using the Drink Less micro-randomized trial."}, "http://arxiv.org/abs/2311.16598": {"title": "Rectangular Hull Confidence Regions for Multivariate Parameters", "link": "http://arxiv.org/abs/2311.16598", "description": "We introduce three notions of multivariate median bias, namely, rectilinear,\nTukey, and orthant median bias. Each of these median biases is zero under a\nsuitable notion of multivariate symmetry. We study the coverage probabilities\nof rectangular hull of $B$ independent multivariate estimators, with special\nattention to the number of estimators $B$ needed to ensure a miscoverage of at\nmost $\\alpha$. It is proved that for estimators with zero orthant median bias,\nwe need $B\\geq c\\log_2(d/\\alpha)$ for some constant $c > 0$. Finally, we show\nthat there exists an asymptotically valid (non-trivial) confidence region for a\nmultivariate parameter $\\theta_0$ if and only if there exists a (non-trivial)\nestimator with an asymptotic orthant median bias of zero."}, "http://arxiv.org/abs/2311.16614": {"title": "A Multivariate Unimodality Test Harnenssing the Dip Statistic of Mahalanobis Distances Over Random Projections", "link": "http://arxiv.org/abs/2311.16614", "description": "Unimodality, pivotal in statistical analysis, offers insights into dataset\nstructures and drives sophisticated analytical procedures. While unimodality's\nconfirmation is straightforward for one-dimensional data using methods like\nSilverman's approach and Hartigans' dip statistic, its generalization to higher\ndimensions remains challenging. By extrapolating one-dimensional unimodality\nprinciples to multi-dimensional spaces through linear random projections and\nleveraging point-to-point distancing, our method, rooted in\n$\\alpha$-unimodality assumptions, presents a novel multivariate unimodality\ntest named mud-pod. Both theoretical and empirical studies confirm the efficacy\nof our method in unimodality assessment of multidimensional datasets as well as\nin estimating the number of clusters."}, "http://arxiv.org/abs/2311.16793": {"title": "Mediation pathway selection with unmeasured mediator-outcome confounding", "link": "http://arxiv.org/abs/2311.16793", "description": "Causal mediation analysis aims to investigate how an intermediary factor,\ncalled a mediator, regulates the causal effect of a treatment on an outcome.\nWith the increasing availability of measurements on a large number of potential\nmediators, methods for selecting important mediators have been proposed.\nHowever, these methods often assume the absence of unmeasured mediator-outcome\nconfounding. We allow for such confounding in a linear structural equation\nmodel for the outcome and further propose an approach to tackle the mediator\nselection issue. To achieve this, we firstly identify causal parameters by\nconstructing a pseudo proxy variable for unmeasured confounding. Leveraging\nthis proxy variable, we propose a partially penalized method to identify\nmediators affecting the outcome. The resultant estimates are consistent, and\nthe estimates of nonzero parameters are asymptotically normal. Motivated by\nthese results, we introduce a two-step procedure to consistently select active\nmediation pathways, eliminating the need to test composite null hypotheses for\neach mediator that are commonly required by traditional methods. Simulation\nstudies demonstrate the superior performance of our approach compared to\nexisting methods. Finally, we apply our approach to genomic data, identifying\ngene expressions that potentially mediate the impact of a genetic variant on\nmouse obesity."}, "http://arxiv.org/abs/2311.16941": {"title": "Debiasing Multimodal Models via Causal Information Minimization", "link": "http://arxiv.org/abs/2311.16941", "description": "Most existing debiasing methods for multimodal models, including causal\nintervention and inference methods, utilize approximate heuristics to represent\nthe biases, such as shallow features from early stages of training or unimodal\nfeatures for multimodal tasks like VQA, etc., which may not be accurate. In\nthis paper, we study bias arising from confounders in a causal graph for\nmultimodal data and examine a novel approach that leverages causally-motivated\ninformation minimization to learn the confounder representations. Robust\npredictive features contain diverse information that helps a model generalize\nto out-of-distribution data. Hence, minimizing the information content of\nfeatures obtained from a pretrained biased model helps learn the simplest\npredictive features that capture the underlying data distribution. We treat\nthese features as confounder representations and use them via methods motivated\nby causal theory to remove bias from models. We find that the learned\nconfounder representations indeed capture dataset biases, and the proposed\ndebiasing methods improve out-of-distribution (OOD) performance on multiple\nmultimodal datasets without sacrificing in-distribution performance.\nAdditionally, we introduce a novel metric to quantify the sufficiency of\nspurious features in models' predictions that further demonstrates the\neffectiveness of our proposed methods. Our code is available at:\nhttps://github.com/Vaidehi99/CausalInfoMin"}, "http://arxiv.org/abs/2311.16984": {"title": "FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings", "link": "http://arxiv.org/abs/2311.16984", "description": "External control arms (ECA) can inform the early clinical development of\nexperimental drugs and provide efficacy evidence for regulatory approval in\nnon-randomized settings. However, the main challenge of implementing ECA lies\nin accessing real-world data or historical clinical trials. Indeed, data\nsharing is often not feasible due to privacy considerations related to data\nleaving the original collection centers, along with pharmaceutical companies'\ncompetitive motives. In this paper, we leverage a privacy-enhancing technology\ncalled federated learning (FL) to remove some of the barriers to data sharing.\nWe introduce a federated learning inverse probability of treatment weighted\n(IPTW) method for time-to-event outcomes called FedECA which eases the\nimplementation of ECA by limiting patients' data exposure. We show with\nextensive experiments that FedECA outperforms its closest competitor,\nmatching-adjusted indirect comparison (MAIC), in terms of statistical power and\nability to balance the treatment and control groups. To encourage the use of\nsuch methods, we publicly release our code which relies on Substra, an\nopen-source FL software with proven experience in privacy-sensitive contexts."}, "http://arxiv.org/abs/2311.16988": {"title": "A Wasserstein-type Distance for Gaussian Mixtures on Vector Bundles with Applications to Shape Analysis", "link": "http://arxiv.org/abs/2311.16988", "description": "This paper uses sample data to study the problem of comparing populations on\nfinite-dimensional parallelizable Riemannian manifolds and more general trivial\nvector bundles. Utilizing triviality, our framework represents populations as\nmixtures of Gaussians on vector bundles and estimates the population parameters\nusing a mode-based clustering algorithm. We derive a Wasserstein-type metric\nbetween Gaussian mixtures, adapted to the manifold geometry, in order to\ncompare estimated distributions. Our contributions include an identifiability\nresult for Gaussian mixtures on manifold domains and a convenient\ncharacterization of optimal couplings of Gaussian mixtures under the derived\nmetric. We demonstrate these tools on some example domains, including the\npre-shape space of planar closed curves, with applications to the shape space\nof triangles and populations of nanoparticles. In the nanoparticle application,\nwe consider a sequence of populations of particle shapes arising from a\nmanufacturing process, and utilize the Wasserstein-type distance to perform\nchange-point detection."}, "http://arxiv.org/abs/2008.11477": {"title": "Bellman filtering and smoothing for state-space models", "link": "http://arxiv.org/abs/2008.11477", "description": "This paper presents a new filter for state-space models based on Bellman's\ndynamic-programming principle, allowing for nonlinearity, non-Gaussianity and\ndegeneracy in the observation and/or state-transition equations. The resulting\nBellman filter is a direct generalisation of the (iterated and extended) Kalman\nfilter, enabling scalability to higher dimensions while remaining\ncomputationally inexpensive. It can also be extended to enable smoothing. Under\nsuitable conditions, the Bellman-filtered states are stable over time and\ncontractive towards a region around the true state at every time step. Static\n(hyper)parameters are estimated by maximising a filter-implied pseudo\nlog-likelihood decomposition. In univariate simulation studies, the Bellman\nfilter performs on par with state-of-the-art simulation-based techniques at a\nfraction of the computational cost. In two empirical applications, involving up\nto 150 spatial dimensions or highly degenerate/nonlinear state dynamics, the\nBellman filter outperforms competing methods in both accuracy and speed."}, "http://arxiv.org/abs/2205.05955": {"title": "Bayesian inference for stochastic oscillatory systems using the phase-corrected Linear Noise Approximation", "link": "http://arxiv.org/abs/2205.05955", "description": "Likelihood-based inference in stochastic non-linear dynamical systems, such\nas those found in chemical reaction networks and biological clock systems, is\ninherently complex and has largely been limited to small and unrealistically\nsimple systems. Recent advances in analytically tractable approximations to the\nunderlying conditional probability distributions enable long-term dynamics to\nbe accurately modelled, and make the large number of model evaluations required\nfor exact Bayesian inference much more feasible. We propose a new methodology\nfor inference in stochastic non-linear dynamical systems exhibiting oscillatory\nbehaviour and show the parameters in these models can be realistically\nestimated from simulated data. Preliminary analyses based on the Fisher\nInformation Matrix of the model can guide the implementation of Bayesian\ninference. We show that this parameter sensitivity analysis can predict which\nparameters are practically identifiable. Several Markov chain Monte Carlo\nalgorithms are compared, with our results suggesting a parallel tempering\nalgorithm consistently gives the best approach for these systems, which are\nshown to frequently exhibit multi-modal posterior distributions."}, "http://arxiv.org/abs/2206.10479": {"title": "Policy Learning with Asymmetric Counterfactual Utilities", "link": "http://arxiv.org/abs/2206.10479", "description": "Data-driven decision making plays an important role even in high stakes\nsettings like medicine and public policy. Learning optimal policies from\nobserved data requires a careful formulation of the utility function whose\nexpected value is maximized across a population. Although researchers typically\nuse utilities that depend on observed outcomes alone, in many settings the\ndecision maker's utility function is more properly characterized by the joint\nset of potential outcomes under all actions. For example, the Hippocratic\nprinciple to \"do no harm\" implies that the cost of causing death to a patient\nwho would otherwise survive without treatment is greater than the cost of\nforgoing life-saving treatment. We consider optimal policy learning with\nasymmetric counterfactual utility functions of this form that consider the\njoint set of potential outcomes. We show that asymmetric counterfactual\nutilities lead to an unidentifiable expected utility function, and so we first\npartially identify it. Drawing on statistical decision theory, we then derive\nminimax decision rules by minimizing the maximum expected utility loss relative\nto different alternative policies. We show that one can learn minimax loss\ndecision rules from observed data by solving intermediate classification\nproblems, and establish that the finite sample excess expected utility loss of\nthis procedure is bounded by the regret of these intermediate classifiers. We\napply this conceptual framework and methodology to the decision about whether\nor not to use right heart catheterization for patients with possible pulmonary\nhypertension."}, "http://arxiv.org/abs/2209.04716": {"title": "Extrapolation before imputation reduces bias when imputing censored covariates", "link": "http://arxiv.org/abs/2209.04716", "description": "Modeling symptom progression to identify informative subjects for a new\nHuntington's disease clinical trial is problematic since time to diagnosis, a\nkey covariate, can be heavily censored. Imputation is an appealing strategy\nwhere censored covariates are replaced with their conditional means, but\nexisting methods saw over 200% bias under heavy censoring. Calculating these\nconditional means well requires estimating and then integrating over the\nsurvival function of the censored covariate from the censored value to\ninfinity. To estimate the survival function flexibly, existing methods use the\nsemiparametric Cox model with Breslow's estimator, leaving the integrand for\nthe conditional means (the estimated survival function) undefined beyond the\nobserved data. The integral is then estimated up to the largest observed\ncovariate value, and this approximation can cut off the tail of the survival\nfunction and lead to severe bias, particularly under heavy censoring. We\npropose a hybrid approach that splices together the semiparametric survival\nestimator with a parametric extension, making it possible to approximate the\nintegral up to infinity. In simulation studies, our proposed approach of\nextrapolation then imputation substantially reduces the bias seen with existing\nimputation methods, even when the parametric extension was misspecified. We\nfurther demonstrate how imputing with corrected conditional means helps to\nprioritize patients for future clinical trials."}, "http://arxiv.org/abs/2210.03559": {"title": "Estimation of the Order of Non-Parametric Hidden Markov Models using the Singular Values of an Integral Operator", "link": "http://arxiv.org/abs/2210.03559", "description": "We are interested in assessing the order of a finite-state Hidden Markov\nModel (HMM) with the only two assumptions that the transition matrix of the\nlatent Markov chain has full rank and that the density functions of the\nemission distributions are linearly independent. We introduce a new procedure\nfor estimating this order by investigating the rank of some well-chosen\nintegral operator which relies on the distribution of a pair of consecutive\nobservations. This method circumvents the usual limits of the spectral method\nwhen it is used for estimating the order of an HMM: it avoids the choice of the\nbasis functions; it does not require any knowledge of an upper-bound on the\norder of the HMM (for the spectral method, such an upper-bound is defined by\nthe number of basis functions); it permits to easily handle different types of\ndata (including continuous data, circular data or multivariate continuous data)\nwith a suitable choice of kernel. The method relies on the fact that the order\nof the HMM can be identified from the distribution of a pair of consecutive\nobservations and that this order is equal to the rank of some integral operator\n(\\emph{i.e.} the number of its singular values that are non-zero). Since only\nthe empirical counter-part of the singular values of the operator can be\nobtained, we propose a data-driven thresholding procedure. An upper-bound on\nthe probability of overestimating the order of the HMM is established.\nMoreover, sufficient conditions on the bandwidth used for kernel density\nestimation and on the threshold are stated to obtain the consistency of the\nestimator of the order of the HMM. The procedure is easily implemented since\nthe values of all the tuning parameters are determined by the sample size."}, "http://arxiv.org/abs/2306.15088": {"title": "Locally tail-scale invariant scoring rules for evaluation of extreme value forecasts", "link": "http://arxiv.org/abs/2306.15088", "description": "Statistical analysis of extremes can be used to predict the probability of\nfuture extreme events, such as large rainfalls or devastating windstorms. The\nquality of these forecasts can be measured through scoring rules. Locally scale\ninvariant scoring rules give equal importance to the forecasts at different\nlocations regardless of differences in the prediction uncertainty. This is a\nuseful feature when computing average scores but can be an unnecessarily strict\nrequirement when mostly concerned with extremes. We propose the concept of\nlocal weight-scale invariance, describing scoring rules fulfilling local scale\ninvariance in a certain region of interest, and as a special case local\ntail-scale invariance, for large events. Moreover, a new version of the\nweighted Continuous Ranked Probability score (wCRPS) called the scaled wCRPS\n(swCRPS) that possesses this property is developed and studied. The score is a\nsuitable alternative for scoring extreme value models over areas with varying\nscale of extreme events, and we derive explicit formulas of the score for the\nGeneralised Extreme Value distribution. The scoring rules are compared through\nsimulation, and their usage is illustrated in modelling of extreme water\nlevels, annual maximum rainfalls, and in an application to non-extreme forecast\nfor the prediction of air pollution."}, "http://arxiv.org/abs/2307.09850": {"title": "Communication-Efficient Distribution-Free Inference Over Networks", "link": "http://arxiv.org/abs/2307.09850", "description": "Consider a star network where each local node possesses a set of test\nstatistics that exhibit a symmetric distribution around zero when their\ncorresponding null hypothesis is true. This paper investigates statistical\ninference problems in networks concerning the aggregation of this general type\nof statistics and global error rate control under communication constraints in\nvarious scenarios. The study proposes communication-efficient algorithms that\nare built on established non-parametric methods, such as the Wilcoxon and sign\ntests, as well as modern inference methods such as the Benjamini-Hochberg (BH)\nand Barber-Candes (BC) procedures, coupled with sampling and quantization\noperations. The proposed methods are evaluated through extensive simulation\nstudies."}, "http://arxiv.org/abs/2308.01747": {"title": "Fusion regression methods with repeated functional data", "link": "http://arxiv.org/abs/2308.01747", "description": "Linear regression and classification methods with repeated functional data\nare considered. For each statistical unit in the sample, a real-valued\nparameter is observed over time under different conditions. Two regression\nmethods based on fusion penalties are presented. The first one is a\ngeneralization of the variable fusion methodology based on the 1-nearest\nneighbor. The second one, called group fusion lasso, assumes some grouping\nstructure of conditions and allows for homogeneity among the regression\ncoefficient functions within groups. A finite sample numerical simulation and\nan application on EEG data are presented."}, "http://arxiv.org/abs/2310.00599": {"title": "Approximate filtering via discrete dual processes", "link": "http://arxiv.org/abs/2310.00599", "description": "We consider the task of filtering a dynamic parameter evolving as a diffusion\nprocess, given data collected at discrete times from a likelihood which is\nconjugate to the marginal law of the diffusion, when a generic dual process on\na discrete state space is available. Recently, it was shown that duality with\nrespect to a death-like process implies that the filtering distributions are\nfinite mixtures, making exact filtering and smoothing feasible through\nrecursive algorithms with polynomial complexity in the number of observations.\nHere we provide general results for the case of duality between the diffusion\nand a regular jump continuous-time Markov chain on a discrete state space,\nwhich typically leads to filtering distribution given by countable mixtures\nindexed by the dual process state space. We investigate the performance of\nseveral approximation strategies on two hidden Markov models driven by\nCox-Ingersoll-Ross and Wright-Fisher diffusions, which admit duals of\nbirth-and-death type, and compare them with the available exact strategies\nbased on death-type duals and with bootstrap particle filtering on the\ndiffusion state space as a general benchmark."}, "http://arxiv.org/abs/2311.17100": {"title": "Automatic cross-validation in structured models: Is it time to leave out leave-one-out?", "link": "http://arxiv.org/abs/2311.17100", "description": "Standard techniques such as leave-one-out cross-validation (LOOCV) might not\nbe suitable for evaluating the predictive performance of models incorporating\nstructured random effects. In such cases, the correlation between the training\nand test sets could have a notable impact on the model's prediction error. To\novercome this issue, an automatic group construction procedure for\nleave-group-out cross validation (LGOCV) has recently emerged as a valuable\ntool for enhancing predictive performance measurement in structured models. The\npurpose of this paper is (i) to compare LOOCV and LGOCV within structured\nmodels, emphasizing model selection and predictive performance, and (ii) to\nprovide real data applications in spatial statistics using complex structured\nmodels fitted with INLA, showcasing the utility of the automatic LGOCV method.\nFirst, we briefly review the key aspects of the recently proposed LGOCV method\nfor automatic group construction in latent Gaussian models. We also demonstrate\nthe effectiveness of this method for selecting the model with the highest\npredictive performance by simulating extrapolation tasks in both temporal and\nspatial data analyses. Finally, we provide insights into the effectiveness of\nthe LGOCV method in modelling complex structured data, encompassing\nspatio-temporal multivariate count data, spatial compositional data, and\nspatio-temporal geospatial data."}, "http://arxiv.org/abs/2311.17102": {"title": "Splinets -- Orthogonal Splines and FDA for the Classification Problem", "link": "http://arxiv.org/abs/2311.17102", "description": "This study introduces an efficient workflow for functional data analysis in\nclassification problems, utilizing advanced orthogonal spline bases. The\nmethodology is based on the flexible Splinets package, featuring a novel spline\nrepresentation designed for enhanced data efficiency. Several innovative\nfeatures contribute to this efficiency: 1)Utilization of Orthonormal Spline\nBases 2)Consideration of Spline Support Sets 3)Data-Driven Knot Selection.\nIllustrating this approach, we applied the workflow to the Fashion MINST\ndataset. We demonstrate the classification process and highlight significant\nefficiency gains. Particularly noteworthy are the improvements that can be\nachieved through the 2D generalization of our methodology, especially in\nscenarios where data sparsity and dimension reduction are critical factors. A\nkey advantage of our workflow is the projection operation into the space of\nsplines with arbitrarily chosen knots, allowing for versatile functional data\nanalysis associated with classification problems. Moreover, the study explores\nSplinets package features suited for functional data analysis. The algebra and\ncalculus of splines use Taylor expansions at the knots within the support sets.\nVarious orthonormalization techniques for B-splines are implemented, including\nthe highly recommended dyadic method, which leads to the creation of splinets.\nImportantly, the locality of B-splines concerning support sets is preserved in\nthe corresponding splinet. Using this locality, along with implemented\nalgorithms, provides a powerful computational tool for functional data\nanalysis."}, "http://arxiv.org/abs/2311.17111": {"title": "Design of variable acceptance sampling plan for exponential distribution under uncertainty", "link": "http://arxiv.org/abs/2311.17111", "description": "In an acceptance monitoring system, acceptance sampling techniques are used\nto increase production, enhance control, and deliver higher-quality products at\na lesser cost. It might not always be possible to define the acceptance\nsampling plan parameters as exact values, especially, when data has\nuncertainty. In this work, acceptance sampling plans for a large number of\nidentical units with exponential lifetimes are obtained by treating acceptable\nquality life, rejectable quality life, consumer's risk, and producer's risk as\nfuzzy parameters. To obtain plan parameters of sequential sampling plans and\nrepetitive group sampling plans, fuzzy hypothesis test is considered. To\nvalidate the sampling plans obtained in this work, some examples are presented.\nOur results are compared with existing results in the literature. Finally, to\ndemonstrate the application of the resulting sampling plans, a real-life case\nstudy is presented."}, "http://arxiv.org/abs/2311.17246": {"title": "Detecting influential observations in single-index models with metric-valued response objects", "link": "http://arxiv.org/abs/2311.17246", "description": "Regression with random data objects is becoming increasingly common in modern\ndata analysis. Unfortunately, like the traditional regression setting with\nEuclidean data, random response regression is not immune to the trouble caused\nby unusual observations. A metric Cook's distance extending the classical\nCook's distances of Cook (1977) to general metric-valued response objects is\nproposed. The performance of the metric Cook's distance in both Euclidean and\nnon-Euclidean response regression with Euclidean predictors is demonstrated in\nan extensive experimental study. A real data analysis of county-level COVID-19\ntransmission in the United States also illustrates the usefulness of this\nmethod in practice."}, "http://arxiv.org/abs/2311.17271": {"title": "Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)", "link": "http://arxiv.org/abs/2311.17271", "description": "One measurement modality for rainfall is a fixed location rain gauge.\nHowever, extreme rainfall, flooding, and other climate extremes often occur at\nlarger spatial scales and affect more than one location in a community. For\nexample, in 2017 Hurricane Harvey impacted all of Houston and the surrounding\nregion causing widespread flooding. Flood risk modeling requires understanding\nof rainfall for hydrologic regions, which may contain one or more rain gauges.\nFurther, policy changes to address the risks and damages of natural hazards\nsuch as severe flooding are usually made at the community/neighborhood level or\nhigher geo-spatial scale. Therefore, spatial-temporal methods which convert\nresults from one spatial scale to another are especially useful in applications\nfor evolving environmental extremes. We develop a point-to-area random effects\n(PARE) modeling strategy for understanding spatial-temporal extreme values at\nthe areal level, when the core information are time series at point locations\ndistributed over the region."}, "http://arxiv.org/abs/2311.17303": {"title": "Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge", "link": "http://arxiv.org/abs/2311.17303", "description": "In this paper, we develop a generic methodology to encode hierarchical\ncausality structure among observed variables into a neural network in order to\nimprove its predictive performance. The proposed methodology, called\ncausality-informed neural network (CINN), leverages three coherent steps to\nsystematically map the structural causal knowledge into the layer-to-layer\ndesign of neural network while strictly preserving the orientation of every\ncausal relationship. In the first step, CINN discovers causal relationships\nfrom observational data via directed acyclic graph (DAG) learning, where causal\ndiscovery is recast as a continuous optimization problem to avoid the\ncombinatorial nature. In the second step, the discovered hierarchical causality\nstructure among observed variables is systematically encoded into neural\nnetwork through a dedicated architecture and customized loss function. By\ncategorizing variables in the causal DAG as root, intermediate, and leaf nodes,\nthe hierarchical causal DAG is translated into CINN with a one-to-one\ncorrespondence between nodes in the causal DAG and units in the CINN while\nmaintaining the relative order among these nodes. Regarding the loss function,\nboth intermediate and leaf nodes in the DAG graph are treated as target outputs\nduring CINN training so as to drive co-learning of causal relationships among\ndifferent types of nodes. As multiple loss components emerge in CINN, we\nleverage the projection of conflicting gradients to mitigate gradient\ninterference among the multiple learning tasks. Computational experiments\nacross a broad spectrum of UCI data sets demonstrate substantial advantages of\nCINN in predictive performance over other state-of-the-art methods. In\naddition, an ablation study underscores the value of integrating structural and\nquantitative causal knowledge in enhancing the neural network's predictive\nperformance incrementally."}, "http://arxiv.org/abs/2311.17445": {"title": "Interaction tests with covariate-adaptive randomization", "link": "http://arxiv.org/abs/2311.17445", "description": "Treatment-covariate interaction tests are commonly applied by researchers to\nexamine whether the treatment effect varies across patient subgroups defined by\nbaseline characteristics. The objective of this study is to explore\ntreatment-covariate interaction tests involving covariate-adaptive\nrandomization. Without assuming a parametric data generation model, we\ninvestigate usual interaction tests and observe that they tend to be\nconservative: specifically, their limiting rejection probabilities under the\nnull hypothesis do not exceed the nominal level and are typically strictly\nlower than it. To address this problem, we propose modifications to the usual\ntests to obtain corresponding exact tests. Moreover, we introduce a novel class\nof stratified-adjusted interaction tests that are simple, broadly applicable,\nand more powerful than the usual and modified tests. Our findings are relevant\nto two types of interaction tests: one involving stratification covariates and\nthe other involving additional covariates that are not used for randomization."}, "http://arxiv.org/abs/2311.17467": {"title": "Design of platform trials with a change in the control treatment arm", "link": "http://arxiv.org/abs/2311.17467", "description": "Platform trials are a more efficient way of testing multiple treatments\ncompared to running separate trials. In this paper we consider platform trials\nwhere, if a treatment is found to be superior to the control, it will become\nthe new standard of care (and the control in the platform). The remaining\ntreatments are then tested against this new control. In such a setting, one can\neither keep the information on both the new standard of care and the other\nactive treatments before the control is changed or one could discard this\ninformation when testing for benefit of the remaining treatments. We will show\nanalytically and numerically that retaining the information collected before\nthe change in control can be detrimental to the power of the study.\nSpecifically, we consider the overall power, the probability that the active\ntreatment with the greatest treatment effect is found during the trial. We also\nconsider the conditional power of the active treatments, the probability a\ngiven treatment can be found superior against the current control. We prove\nwhen, in a multi-arm multi-stage trial where no arms are added, retaining the\ninformation is detrimental to both overall and conditional power of the\nremaining treatments. This loss of power is studied for a motivating example.\nWe then discuss the effect on platform trials in which arms are added later. On\nthe basis of these observations we discuss different aspects to consider when\ndeciding whether to run a continuous platform trial or whether one may be\nbetter running a new trial."}, "http://arxiv.org/abs/2311.17476": {"title": "Inference of Sample Complier Average Causal Effects in Completely Randomized Experiments", "link": "http://arxiv.org/abs/2311.17476", "description": "In randomized experiments with non-compliance scholars have argued that the\ncomplier average causal effect (CACE) ought to be the main causal estimand. The\nliterature on inference of the complier average treatment effect (CACE) has\nfocused on inference about the population CACE. However, in general individuals\nin the experiments are volunteers. This means that there is a risk that\nindividuals partaking in a given experiment differ in important ways from a\npopulation of interest. It is thus of interest to focus on the sample at hand\nand have easy to use and correct procedures for inference about the sample\nCACE. We consider a more general setting than in the previous literature and\nconstruct a confidence interval based on the Wald estimator in the form of a\nfinite closed interval that is familiar to practitioners. Furthermore, with the\naccess of pre-treatment covariates, we propose a new regression adjustment\nestimator and associated methods for constructing confidence intervals. Finite\nsample performance of the methods is examined through a Monte Carlo simulation\nand the methods are used in an application to a job training experiment."}, "http://arxiv.org/abs/2311.17547": {"title": "Risk-based decision making: estimands for sequential prediction under interventions", "link": "http://arxiv.org/abs/2311.17547", "description": "Prediction models are used amongst others to inform medical decisions on\ninterventions. Typically, individuals with high risks of adverse outcomes are\nadvised to undergo an intervention while those at low risk are advised to\nrefrain from it. Standard prediction models do not always provide risks that\nare relevant to inform such decisions: e.g., an individual may be estimated to\nbe at low risk because similar individuals in the past received an intervention\nwhich lowered their risk. Therefore, prediction models supporting decisions\nshould target risks belonging to defined intervention strategies. Previous\nworks on prediction under interventions assumed that the prediction model was\nused only at one time point to make an intervention decision. In clinical\npractice, intervention decisions are rarely made only once: they might be\nrepeated, deferred and re-evaluated. This requires estimated risks under\ninterventions that can be reconsidered at several potential decision moments.\nIn the current work, we highlight key considerations for formulating estimands\nin sequential prediction under interventions that can inform such intervention\ndecisions. We illustrate these considerations by giving examples of estimands\nfor a case study about choosing between vaginal delivery and cesarean section\nfor women giving birth. Our formalization of prediction tasks in a sequential,\ncausal, and estimand context provides guidance for future studies to ensure\nthat the right question is answered and appropriate causal estimation\napproaches are chosen to develop sequential prediction models that can inform\nintervention decisions."}, "http://arxiv.org/abs/2311.17564": {"title": "Combining Stochastic Tendency and Distribution Overlap Towards Improved Nonparametric Effect Measures and Inference", "link": "http://arxiv.org/abs/2311.17564", "description": "A fundamental functional in nonparametric statistics is the Mann-Whitney\nfunctional ${\\theta} = P (X < Y )$ , which constitutes the basis for the most\npopular nonparametric procedures. The functional ${\\theta}$ measures a location\nor stochastic tendency effect between two distributions. A limitation of\n${\\theta}$ is its inability to capture scale differences. If differences of\nthis nature are to be detected, specific tests for scale or omnibus tests need\nto be employed. However, the latter often suffer from low power, and they do\nnot yield interpretable effect measures. In this manuscript, we extend\n${\\theta}$ by additionally incorporating the recently introduced distribution\noverlap index (nonparametric dispersion measure) $I_2$ that can be expressed in\nterms of the quantile process. We derive the joint asymptotic distribution of\nthe respective estimators of ${\\theta}$ and $I_2$ and construct confidence\nregions. Extending the Wilcoxon- Mann-Whitney test, we introduce a new test\nbased on the joint use of these functionals. It results in much larger\nconsistency regions while maintaining competitive power to the rank sum test\nfor situations in which {\\theta} alone would suffice. Compared with classical\nomnibus tests, the simulated power is much improved. Additionally, the newly\nproposed inference method yields effect measures whose interpretation is\nsurprisingly straightforward."}, "http://arxiv.org/abs/2311.17594": {"title": "Scale Invariant Correspondence Analysis", "link": "http://arxiv.org/abs/2311.17594", "description": "Correspondence analysis is a dimension reduction method for visualization of\nnonnegative data sets, in particular contingency tables ; but it depends on the\nmarginals of the data set. Two transformations of the data have been proposed\nto render correspondence analysis row and column scales invariant : These two\nkinds of transformations change the initial form of the data set into a\nbistochastic form. The power transorfmation applied by Greenacre (2010) has one\npositive parameter. While the transormation applied by Mosteller (1968) and\nGoodman (1996) has (I+J) positive parameters, where the raw data is row and\ncolumn scaled by the Sinkhorn (RAS or ipf) algorithm to render it bistochastic.\nGoodman (1996) named correspondence analsis of a bistochastic matrix\nmarginal-free correspondence analysis. We discuss these two transformations,\nand further generalize Mosteller-Goodman approach."}, "http://arxiv.org/abs/2311.17605": {"title": "Improving the Balance of Unobserved Covariates From Information Theory in Multi-Arm Randomization with Unequal Allocation Ratio", "link": "http://arxiv.org/abs/2311.17605", "description": "Multi-arm randomization has increasingly widespread applications recently and\nit is also crucial to ensure that the distributions of important observed\ncovariates as well as the potential unobserved covariates are similar and\ncomparable among all the treatment. However, the theoretical properties of\nunobserved covariates imbalance in multi-arm randomization with unequal\nallocation ratio remains unknown. In this paper, we give a general framework\nanalysing the moments and distributions of unobserved covariates imbalance and\napply them into different procedures including complete randomization (CR),\nstratified permuted block (STR-PB) and covariate-adaptive randomization (CAR).\nThe general procedures of multi-arm STR-PB and CAR with unequal allocation\nratio are also proposed. In addition, we introduce the concept of entropy to\nmeasure the correlation between discrete covariates and verify that we could\nutilize the correlation to select observed covariates to help better balance\nthe unobserved covariates."}, "http://arxiv.org/abs/2311.17685": {"title": "Enhancing efficiency and robustness in high-dimensional linear regression with additional unlabeled data", "link": "http://arxiv.org/abs/2311.17685", "description": "In semi-supervised learning, the prevailing understanding suggests that\nobserving additional unlabeled samples improves estimation accuracy for linear\nparameters only in the case of model misspecification. This paper challenges\nthis notion, demonstrating its inaccuracy in high dimensions. Initially\nfocusing on a dense scenario, we introduce robust semi-supervised estimators\nfor the regression coefficient without relying on sparse structures in the\npopulation slope. Even when the true underlying model is linear, we show that\nleveraging information from large-scale unlabeled data improves both estimation\naccuracy and inference robustness. Moreover, we propose semi-supervised methods\nwith further enhanced efficiency in scenarios with a sparse linear slope.\nDiverging from the standard semi-supervised literature, we also allow for\ncovariate shift. The performance of the proposed methods is illustrated through\nextensive numerical studies, including simulations and a real-data application\nto the AIDS Clinical Trials Group Protocol 175 (ACTG175)."}, "http://arxiv.org/abs/2311.17797": {"title": "Learning to Simulate: Generative Metamodeling via Quantile Regression", "link": "http://arxiv.org/abs/2311.17797", "description": "Stochastic simulation models, while effective in capturing the dynamics of\ncomplex systems, are often too slow to run for real-time decision-making.\nMetamodeling techniques are widely used to learn the relationship between a\nsummary statistic of the outputs (e.g., the mean or quantile) and the inputs of\nthe simulator, so that it can be used in real time. However, this methodology\nrequires the knowledge of an appropriate summary statistic in advance, making\nit inflexible for many practical situations. In this paper, we propose a new\nmetamodeling concept, called generative metamodeling, which aims to construct a\n\"fast simulator of the simulator\". This technique can generate random outputs\nsubstantially faster than the original simulation model, while retaining an\napproximately equal conditional distribution given the same inputs. Once\nconstructed, a generative metamodel can instantaneously generate a large amount\nof random outputs as soon as the inputs are specified, thereby facilitating the\nimmediate computation of any summary statistic for real-time decision-making.\nFurthermore, we propose a new algorithm -- quantile-regression-based generative\nmetamodeling (QRGMM) -- and study its convergence and rate of convergence.\nExtensive numerical experiments are conducted to investigate the empirical\nperformance of QRGMM, compare it with other state-of-the-art generative\nalgorithms, and demonstrate its usefulness in practical real-time\ndecision-making."}, "http://arxiv.org/abs/2311.17808": {"title": "A Heteroscedastic Bayesian Generalized Logistic Regression Model with Application to Scaling Problems", "link": "http://arxiv.org/abs/2311.17808", "description": "Scaling models have been used to explore relationships between urban\nindicators, population, and, more recently, extended to incorporate rural-urban\nindicator densities and population densities. In the scaling framework, power\nlaws and standard linear regression methods are used to estimate model\nparameters with assumed normality and fixed variance. These assumptions,\ninherited in the scaling field, have recently been demonstrated to be\ninadequate and noted that without consideration lead to model bias. Generalized\nlinear models (GLM) can accommodate a wider range of distributions where the\nchosen distribution must meet the assumptions of the data to prevent model\nbias. We present a widely applicable Bayesian generalized logistic regression\n(BGLR) framework to flexibly model a continuous real response addressing skew\nand heteroscedasticity using Markov Chain Monte Carlo (MCMC) methods. The\nGeneralized Logistic Distribution (GLD) robustly models skewed continuous data\ndue to the additional shape parameter. We compare the BGLR model to standard\nand Bayesian normal methods in fitting power laws to COVID-19 data. The BGLR\nprovides additional useful information beyond previous scaling methods,\nuniquely models variance including a scedasticity parameter and reveals\nparameter bias in widely used methods."}, "http://arxiv.org/abs/2311.17867": {"title": "A Class of Directed Acyclic Graphs with Mixed Data Types in Mediation Analysis", "link": "http://arxiv.org/abs/2311.17867", "description": "We propose a unified class of generalized structural equation models (GSEMs)\nwith data of mixed types in mediation analysis, including continuous,\ncategorical, and count variables. Such models extend substantially the\nclassical linear structural equation model to accommodate many data types\narising from the application of mediation analysis. Invoking the hierarchical\nmodeling approach, we specify GSEMs by a copula joint distribution of outcome\nvariable, mediator and exposure variable, in which marginal distributions are\nbuilt upon generalized linear models (GLMs) with confounding factors. We\ndiscuss the identifiability conditions for the causal mediation effects in the\ncounterfactual paradigm as well as the issue of mediation leakage, and develop\nan asymptotically efficient profile maximum likelihood estimation and inference\nfor two key mediation estimands, natural direct effect and natural indirect\neffect, in different scenarios of mixed data types. The proposed new\nmethodology is illustrated by a motivating epidemiological study that aims to\ninvestigate whether the tempo of reaching infancy BMI peak (delay or on time),\nan important early life growth milestone, may mediate the association between\nprenatal exposure to phthalates and pubertal health outcomes."}, "http://arxiv.org/abs/2311.17885": {"title": "Are ensembles getting better all the time?", "link": "http://arxiv.org/abs/2311.17885", "description": "Ensemble methods combine the predictions of several base models. We study\nwhether or not including more models in an ensemble always improve its average\nperformance. Such a question depends on the kind of ensemble considered, as\nwell as the predictive metric chosen. We focus on situations where all members\nof the ensemble are a priori expected to perform as well, which is the case of\nseveral popular methods like random forests or deep ensembles. In this setting,\nwe essentially show that ensembles are getting better all the time if, and only\nif, the considered loss function is convex. More precisely, in that case, the\naverage loss of the ensemble is a decreasing function of the number of models.\nWhen the loss function is nonconvex, we show a series of results that can be\nsummarised by the insight that ensembles of good models keep getting better,\nand ensembles of bad models keep getting worse. To this end, we prove a new\nresult on the monotonicity of tail probabilities that may be of independent\ninterest. We illustrate our results on a simple machine learning problem\n(diagnosing melanomas using neural nets)."}, "http://arxiv.org/abs/2111.07966": {"title": "Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects", "link": "http://arxiv.org/abs/2111.07966", "description": "There are a number of available methods for selecting whom to prioritize for\ntreatment, including ones based on treatment effect estimation, risk scoring,\nand hand-crafted rules. We propose rank-weighted average treatment effect\n(RATE) metrics as a simple and general family of metrics for comparing and\ntesting the quality of treatment prioritization rules. RATE metrics are\nagnostic as to how the prioritization rules were derived, and only assess how\nwell they identify individuals that benefit the most from treatment. We define\na family of RATE estimators and prove a central limit theorem that enables\nasymptotically exact inference in a wide variety of randomized and\nobservational study settings. RATE metrics subsume a number of existing\nmetrics, including the Qini coefficient, and our analysis directly yields\ninference methods for these metrics. We showcase RATE in the context of a\nnumber of applications, including optimal targeting of aspirin to stroke\npatients."}, "http://arxiv.org/abs/2306.02126": {"title": "A Process of Dependent Quantile Pyramids", "link": "http://arxiv.org/abs/2306.02126", "description": "Despite the practicality of quantile regression (QR), simultaneous estimation\nof multiple QR curves continues to be challenging. We address this problem by\nproposing a Bayesian nonparametric framework that generalizes the quantile\npyramid by replacing each scalar variate in the quantile pyramid with a\nstochastic process on a covariate space. We propose a novel approach to show\nthe existence of a quantile pyramid for all quantiles. The process of dependent\nquantile pyramids allows for non-linear QR and automatically ensures\nnon-crossing of QR curves on the covariate space. Simulation studies document\nthe performance and robustness of our approach. An application to cyclone\nintensity data is presented."}, "http://arxiv.org/abs/2306.02244": {"title": "Optimal neighbourhood selection in structural equation models", "link": "http://arxiv.org/abs/2306.02244", "description": "We study the optimal sample complexity of neighbourhood selection in linear\nstructural equation models, and compare this to best subset selection (BSS) for\nlinear models under general design. We show by example that -- even when the\nstructure is \\emph{unknown} -- the existence of underlying structure can reduce\nthe sample complexity of neighbourhood selection. This result is complicated by\nthe possibility of path cancellation, which we study in detail, and show that\nimprovements are still possible in the presence of path cancellation. Finally,\nwe support these theoretical observations with experiments. The proof\nintroduces a modified BSS estimator, called klBSS, and compares its performance\nto BSS. The analysis of klBSS may also be of independent interest since it\napplies to arbitrary structured models, not necessarily those induced by a\nstructural equation model. Our results have implications for structure learning\nin graphical models, which often relies on neighbourhood selection as a\nsubroutine."}, "http://arxiv.org/abs/2307.13868": {"title": "Learning sources of variability from high-dimensional observational studies", "link": "http://arxiv.org/abs/2307.13868", "description": "Causal inference studies whether the presence of a variable influences an\nobserved outcome. As measured by quantities such as the \"average treatment\neffect,\" this paradigm is employed across numerous biological fields, from\nvaccine and drug development to policy interventions. Unfortunately, the\nmajority of these methods are often limited to univariate outcomes. Our work\ngeneralizes causal estimands to outcomes with any number of dimensions or any\nmeasurable space, and formulates traditional causal estimands for nominal\nvariables as causal discrepancy tests. We propose a simple technique for\nadjusting universally consistent conditional independence tests and prove that\nthese tests are universally consistent causal discrepancy tests. Numerical\nexperiments illustrate that our method, Causal CDcorr, leads to improvements in\nboth finite sample validity and power when compared to existing strategies. Our\nmethods are all open source and available at github.com/ebridge2/cdcorr."}, "http://arxiv.org/abs/2307.16353": {"title": "Single Proxy Synthetic Control", "link": "http://arxiv.org/abs/2307.16353", "description": "Synthetic control methods are widely used to estimate the treatment effect on\na single treated unit in time-series settings. A common approach for estimating\nsynthetic controls is to regress the treated unit's pre-treatment outcome on\nthose of untreated units via ordinary least squares. However, this approach can\nperform poorly if the pre-treatment fit is not near perfect, whether the\nweights are normalized or not. In this paper, we introduce a single proxy\nsynthetic control approach, which views the outcomes of untreated units as\nproxies of the treatment-free potential outcome of the treated unit, a\nperspective we leverage to construct a valid synthetic control. Under this\nframework, we establish alternative identification and estimation methodologies\nfor synthetic controls and for the treatment effect on the treated unit.\nNotably, unlike a proximal synthetic control approach which requires two types\nof proxies for identification, ours relies on a single type of proxy, thus\nfacilitating its practical relevance. Additionally, we adapt a conformal\ninference approach to perform inference about the treatment effect, obviating\nthe need for a large number of post-treatment data. Lastly, our framework can\naccommodate time-varying covariates and nonlinear models. We demonstrate the\nproposed approach in a simulation study and a real-world application."}, "http://arxiv.org/abs/2206.12346": {"title": "A fast and stable approximate maximum-likelihood method for template fits", "link": "http://arxiv.org/abs/2206.12346", "description": "Barlow and Beeston presented an exact likelihood for the problem of fitting a\ncomposite model consisting of binned templates obtained from Monte-Carlo\nsimulation which are fitted to equally binned data. Solving the exact\nlikelihood is technically challenging, and therefore Conway proposed an\napproximate likelihood to address these challenges. In this paper, a new\napproximate likelihood is derived from the exact Barlow-Beeston one. The new\napproximate likelihood and Conway's likelihood are generalized to problems of\nfitting weighted data with weighted templates. The performance of estimates\nobtained with all three likelihoods is studied on two toy examples: a simple\none and a challenging one. The performance of the approximate likelihoods is\ncomparable to the exact Barlow-Beeston likelihood, while the performance in\nfits with weighted templates is better. The approximate likelihoods evaluate\nfaster than the Barlow-Beeston one when the number of bins is large."}, "http://arxiv.org/abs/2307.16370": {"title": "Inference for Low-rank Completion without Sample Splitting with Application to Treatment Effect Estimation", "link": "http://arxiv.org/abs/2307.16370", "description": "This paper studies the inferential theory for estimating low-rank matrices.\nIt also provides an inference method for the average treatment effect as an\napplication. We show that the least square estimation of eigenvectors following\nthe nuclear norm penalization attains the asymptotic normality. The key\ncontribution of our method is that it does not require sample splitting. In\naddition, this paper allows dependent observation patterns and heterogeneous\nobservation probabilities. Empirically, we apply the proposed procedure to\nestimating the impact of the presidential vote on allocating the U.S. federal\nbudget to the states."}, "http://arxiv.org/abs/2311.14679": {"title": "\"Medium-n studies\" in computing education conferences", "link": "http://arxiv.org/abs/2311.14679", "description": "Good (Frequentist) statistical practice requires that statistical tests be\nperformed in order to determine if the phenomenon being observed could\nplausibly occur by chance if the null hypothesis is false. Good practice also\nrequires that a test is not performed if the study is underpowered: if the\nnumber of observations is not sufficiently large to be able to reliably detect\nthe effect one hypothesizes, even if the effect exists. Running underpowered\nstudies runs the risk of false negative results. This creates tension in the\nguidelines and expectations for computer science education conferences: while\nthings are clear for studies with a large number of observations, researchers\nshould in fact not compute p-values and perform statistical tests if the number\nof observations is too small. The issue is particularly live in CSed venues,\nsince class sizes where those issues are salient are common. We outline the\nconsiderations for when to compute and when not to compute p-values in\ndifferent settings encountered by computer science education researchers. We\nsurvey the author and reviewer guidelines in different computer science\neducation conferences (ICER, SIGCSE TS, ITiCSE, EAAI, CompEd, Koli Calling). We\npresent summary data and make several preliminary observations about reviewer\nguidelines: guidelines vary from conference to conference; guidelines allow for\nqualitative studies, and, in some cases, experience reports, but guidelines do\nnot generally explicitly indicate that a paper should have at least one of (1)\nan appropriately-powered statistical analysis or (2) rich qualitative\ndescriptions. We present preliminary ideas for addressing the tension in the\nguidelines between small-n and large-n studies"}, "http://arxiv.org/abs/2311.17962": {"title": "In search of the perfect fit: interpretation, flexible modelling, and the existing generalisations of the normal distribution", "link": "http://arxiv.org/abs/2311.17962", "description": "Many generalised distributions exist for modelling data with vastly diverse\ncharacteristics. However, very few of these generalisations of the normal\ndistribution have shape parameters with clear roles that determine, for\ninstance, skewness and tail shape. In this chapter, we review existing skewing\nmechanisms and their properties in detail. Using the knowledge acquired, we add\na skewness parameter to the body-tail generalised normal distribution\n\\cite{BTGN}, that yields the \\ac{FIN} with parameters for location, scale,\nbody-shape, skewness, and tail weight. Basic statistical properties of the\n\\ac{FIN} are provided, such as the \\ac{PDF}, cumulative distribution function,\nmoments, and likelihood equations. Additionally, the \\ac{FIN} \\ac{PDF} is\nextended to a multivariate setting using a student t-copula, yielding the\n\\ac{MFIN}. The \\ac{MFIN} is applied to stock returns data, where it outperforms\nthe t-copula multivariate generalised hyperbolic, Azzalini skew-t, hyperbolic,\nand normal inverse Gaussian distributions."}, "http://arxiv.org/abs/2311.18039": {"title": "Restricted Regression in Networks", "link": "http://arxiv.org/abs/2311.18039", "description": "Network regression with additive node-level random effects can be problematic\nwhen the primary interest is estimating unconditional regression coefficients\nand some covariates are exactly or nearly in the vector space of node-level\neffects. We introduce the Restricted Network Regression model, that removes the\ncollinearity between fixed and random effects in network regression by\northogonalizing the random effects against the covariates. We discuss the\nchange in the interpretation of the regression coefficients in Restricted\nNetwork Regression and analytically characterize the effect of Restricted\nNetwork Regression on the regression coefficients for continuous response data.\nWe show through simulation with continuous and binary response data that\nRestricted Network Regression mitigates, but does not alleviate, network\nconfounding, by providing improved estimation of the regression coefficients.\nWe apply the Restricted Network Regression model in an analysis of 2015\nEurovision Song Contest voting data and show how the choice of regression model\naffects inference."}, "http://arxiv.org/abs/2311.18048": {"title": "An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis", "link": "http://arxiv.org/abs/2311.18048", "description": "We investigate the relationship between system identification and\nintervention design in dynamical systems. While previous research demonstrated\nhow identifiable representation learning methods, such as Independent Component\nAnalysis (ICA), can reveal cause-effect relationships, it relied on a passive\nperspective without considering how to collect data. Our work shows that in\nGaussian Linear Time-Invariant (LTI) systems, the system parameters can be\nidentified by introducing diverse intervention signals in a multi-environment\nsetting. By harnessing appropriate diversity assumptions motivated by the ICA\nliterature, our findings connect experiment design and representational\nidentifiability in dynamical systems. We corroborate our findings on synthetic\nand (simulated) physical data. Additionally, we show that Hidden Markov Models,\nin general, and (Gaussian) LTI systems, in particular, fulfil a generalization\nof the Causal de Finetti theorem with continuous parameters."}, "http://arxiv.org/abs/2311.18053": {"title": "On Non- and Weakly-Informative Priors for the Conway-Maxwell-Poisson (COM-Poisson) Distribution", "link": "http://arxiv.org/abs/2311.18053", "description": "Previous Bayesian evaluations of the Conway-Maxwell-Poisson (COM-Poisson)\ndistribution have little discussion of non- and weakly-informative priors for\nthe model. While only considering priors with such limited information\nrestricts potential analyses, these priors serve an important first step in the\nmodeling process and are useful when performing sensitivity analyses. We\ndevelop and derive several weakly- and non-informative priors using both the\nestablished conjugate prior and Jeffreys' prior. Our evaluation of each prior\ninvolves an empirical study under varying dispersion types and sample sizes. In\ngeneral, we find the weakly informative priors tend to perform better than the\nnon-informative priors. We also consider several data examples for illustration\nand provide code for implementation of each resulting posterior."}, "http://arxiv.org/abs/2311.18093": {"title": "Shared Control Individuals in Health Policy Evaluations with Application to Medical Cannabis Laws", "link": "http://arxiv.org/abs/2311.18093", "description": "Health policy researchers often have questions about the effects of a policy\nimplemented at some cluster-level unit, e.g., states, counties, hospitals, etc.\non individual-level outcomes collected over multiple time periods. Stacked\ndifference-in-differences is an increasingly popular way to estimate these\neffects. This approach involves estimating treatment effects for each\npolicy-implementing unit, then, if scientifically appropriate, aggregating them\nto an average effect estimate. However, when individual-level data are\navailable and non-implementing units are used as comparators for multiple\npolicy-implementing units, data from untreated individuals may be used across\nmultiple analyses, thereby inducing correlation between effect estimates.\nExisting methods do not quantify or account for this sharing of controls. Here,\nwe describe a stacked difference-in-differences study investigating the effects\nof state medical cannabis laws on treatment for chronic pain management that\nmotivated this work, discuss a framework for estimating and managing this\ncorrelation due to shared control individuals, and show how accounting for it\naffects the substantive results."}, "http://arxiv.org/abs/2311.18146": {"title": "Co-Active Subspace Methods for the Joint Analysis of Adjacent Computer Models", "link": "http://arxiv.org/abs/2311.18146", "description": "Active subspace (AS) methods are a valuable tool for understanding the\nrelationship between the inputs and outputs of a Physics simulation. In this\npaper, an elegant generalization of the traditional ASM is developed to assess\nthe co-activity of two computer models. This generalization, which we refer to\nas a Co-Active Subspace (C-AS) Method, allows for the joint analysis of two or\nmore computer models allowing for thorough exploration of the alignment (or\nnon-alignment) of the respective gradient spaces. We define co-active\ndirections, co-sensitivity indices, and a scalar ``concordance\" metric (and\ncomplementary ``discordance\" pseudo-metric) and we demonstrate that these are\npowerful tools for understanding the behavior of a class of computer models,\nespecially when used to supplement traditional AS analysis. Details for\nefficient estimation of the C-AS and an accompanying R package\n(github.com/knrumsey/concordance) are provided. Practical application is\ndemonstrated through analyzing a set of simulated rate stick experiments for\nPBX 9501, a high explosive, offering insights into complex model dynamics."}, "http://arxiv.org/abs/2311.18274": {"title": "Semiparametric Efficient Inference in Adaptive Experiments", "link": "http://arxiv.org/abs/2311.18274", "description": "We consider the problem of efficient inference of the Average Treatment\nEffect in a sequential experiment where the policy governing the assignment of\nsubjects to treatment or control can change over time. We first provide a\ncentral limit theorem for the Adaptive Augmented Inverse-Probability Weighted\nestimator, which is semiparametric efficient, under weaker assumptions than\nthose previously made in the literature. This central limit theorem enables\nefficient inference at fixed sample sizes. We then consider a sequential\ninference setting, deriving both asymptotic and nonasymptotic confidence\nsequences that are considerably tighter than previous methods. These\nanytime-valid methods enable inference under data-dependent stopping times\n(sample sizes). Additionally, we use propensity score truncation techniques\nfrom the recent off-policy estimation literature to reduce the finite sample\nvariance of our estimator without affecting the asymptotic variance. Empirical\nresults demonstrate that our methods yield narrower confidence sequences than\nthose previously developed in the literature while maintaining time-uniform\nerror control."}, "http://arxiv.org/abs/2311.18294": {"title": "Multivariate Unified Skew-t Distributions And Their Properties", "link": "http://arxiv.org/abs/2311.18294", "description": "The unified skew-t (SUT) is a flexible parametric multivariate distribution\nthat accounts for skewness and heavy tails in the data. A few of its properties\ncan be found scattered in the literature or in a parameterization that does not\nfollow the original one for unified skew-normal (SUN) distributions, yet a\nsystematic study is lacking. In this work, explicit properties of the\nmultivariate SUT distribution are presented, such as its stochastic\nrepresentations, moments, SUN-scale mixture representation, linear\ntransformation, additivity, marginal distribution, canonical form, quadratic\nform, conditional distribution, change of latent dimensions, Mardia measures of\nmultivariate skewness and kurtosis, and non-identifiability issue. These\nresults are given in a parametrization that reduces to the original SUN\ndistribution as a sub-model, hence facilitating the use of the SUT for\napplications. Several models based on the SUT distribution are provided for\nillustration."}, "http://arxiv.org/abs/2311.18446": {"title": "Length-of-stay times in hospital for COVID-19 patients using the smoothed Beran's estimator with bootstrap bandwidth selection", "link": "http://arxiv.org/abs/2311.18446", "description": "The survival function of length-of-stay in hospital ward and ICU for COVID-19\npatients is studied in this paper. Flexible statistical methods are used to\nestimate this survival function given relevant covariates such as age, sex,\nobesity and chronic obstructive pulmonary disease (COPD). A doubly-smoothed\nBeran's estimator has been considered to this aim. The bootstrap method has\nbeen used to produce new smoothing parameter selectors and to construct\nconfidence regions for the conditional survival function. Some simulation\nstudies show the good performance of the proposed methods."}, "http://arxiv.org/abs/2311.18460": {"title": "Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework", "link": "http://arxiv.org/abs/2311.18460", "description": "Fairness for machine learning predictions is widely required in practice for\nlegal, ethical, and societal reasons. Existing work typically focuses on\nsettings without unobserved confounding, even though unobserved confounding can\nlead to severe violations of causal fairness and, thus, unfair predictions. In\nthis work, we analyze the sensitivity of causal fairness to unobserved\nconfounding. Our contributions are three-fold. First, we derive bounds for\ncausal fairness metrics under different sources of unobserved confounding. This\nenables practitioners to examine the sensitivity of their machine learning\nmodels to unobserved confounding in fairness-critical applications. Second, we\npropose a novel neural framework for learning fair predictions, which allows us\nto offer worst-case guarantees of the extent to which causal fairness can be\nviolated due to unobserved confounding. Third, we demonstrate the effectiveness\nof our framework in a series of experiments, including a real-world case study\nabout predicting prison sentences. To the best of our knowledge, ours is the\nfirst work to study causal fairness under unobserved confounding. To this end,\nour work is of direct practical value as a refutation strategy to ensure the\nfairness of predictions in high-stakes applications."}, "http://arxiv.org/abs/2311.18477": {"title": "Intraday foreign exchange rate volatility forecasting: univariate and multilevel functional GARCH models", "link": "http://arxiv.org/abs/2311.18477", "description": "This paper seeks to predict conditional intraday volatility in foreign\nexchange (FX) markets using functional Generalized AutoRegressive Conditional\nHeteroscedasticity (GARCH) models. We contribute to the existing functional\nGARCH-type models by accounting for the stylised features of long-range and\ncross-dependence through estimating the models with long-range dependent and\nmulti-level functional principal component basis functions. Remarkably, we find\nthat taking account of cross-dependency dynamics between the major currencies\nsignificantly improves intraday conditional volatility forecasting.\nAdditionally, incorporating intraday bid-ask spread using a functional GARCH-X\nmodel adds explainability of long-range dependence and further enhances\npredictability. Intraday risk management applications are presented to\nhighlight the practical economic benefits of our proposed approaches."}, "http://arxiv.org/abs/2311.18501": {"title": "Perturbation-based Analysis of Compositional Data", "link": "http://arxiv.org/abs/2311.18501", "description": "Existing statistical methods for compositional data analysis are inadequate\nfor many modern applications for two reasons. First, modern compositional\ndatasets, for example in microbiome research, display traits such as\nhigh-dimensionality and sparsity that are poorly modelled with traditional\napproaches. Second, assessing -- in an unbiased way -- how summary statistics\nof a composition (e.g., racial diversity) affect a response variable is not\nstraightforward. In this work, we propose a framework based on hypothetical\ndata perturbations that addresses both issues. Unlike existing methods for\ncompositional data, we do not transform the data and instead use perturbations\nto define interpretable statistical functionals on the compositions themselves,\nwhich we call average perturbation effects. These average perturbation effects,\nwhich can be employed in many applications, naturally account for confounding\nthat biases frequently used marginal dependence analyses. We show how average\nperturbation effects can be estimated efficiently by deriving a\nperturbation-dependent reparametrization and applying semiparametric estimation\ntechniques. We analyze the proposed estimators empirically on simulated data\nand demonstrate advantages over existing techniques on US census and microbiome\ndata. For all proposed estimators, we provide confidence intervals with uniform\nasymptotic coverage guarantees."}, "http://arxiv.org/abs/2311.18532": {"title": "Local causal effects with continuous exposures: A matching estimator for the average causal derivative effect", "link": "http://arxiv.org/abs/2311.18532", "description": "The estimation of causal effects is a fundamental goal in the field of causal\ninference. However, it is challenging for various reasons. One reason is that\nthe exposure (or treatment) is naturally continuous in many real-world\nscenarios. When dealing with continuous exposure, dichotomizing the exposure\nvariable based on a pre-defined threshold may result in a biased understanding\nof causal relationships. In this paper, we propose a novel causal inference\nframework that can measure the causal effect of continuous exposure. We define\nthe expectation of a derivative of potential outcomes at a specific exposure\nlevel as the average causal derivative effect. Additionally, we propose a\nmatching method for this estimator and propose a permutation approach to test\nthe hypothesis of no local causal effect. We also investigate the asymptotic\nproperties of the proposed estimator and examine its performance through\nsimulation studies. Finally, we apply this causal framework in a real data\nexample of Chronic Obstructive Pulmonary Disease (COPD) patients."}, "http://arxiv.org/abs/2311.18584": {"title": "First-order multivariate integer-valued autoregressive model with multivariate mixture distributions", "link": "http://arxiv.org/abs/2311.18584", "description": "The univariate integer-valued time series has been extensively studied, but\nliterature on multivariate integer-valued time series models is quite limited\nand the complex correlation structure among the multivariate integer-valued\ntime series is barely discussed. In this study, we proposed a first-order\nmultivariate integer-valued autoregressive model to characterize the\ncorrelation among multivariate integer-valued time series with higher\nflexibility. Under the general conditions, we established the stationarity and\nergodicity of the proposed model. With the proposed method, we discussed the\nmodels with multivariate Poisson-lognormal distribution and multivariate\ngeometric-logitnormal distribution and the corresponding properties. The\nestimation method based on EM algorithm was developed for the model parameters\nand extensive simulation studies were performed to evaluate the effectiveness\nof proposed estimation method. Finally, a real crime data was analyzed to\ndemonstrate the advantage of the proposed model with comparison to the other\nmodels."}, "http://arxiv.org/abs/2311.18699": {"title": "Gaussian processes Correlated Bayesian Additive Regression Trees", "link": "http://arxiv.org/abs/2311.18699", "description": "In recent years, Bayesian Additive Regression Trees (BART) has garnered\nincreased attention, leading to the development of various extensions for\ndiverse applications. However, there has been limited exploration of its\nutility in analyzing correlated data. This paper introduces a novel extension\nof BART, named Correlated BART (CBART). Unlike the original BART with\nindependent errors, CBART is specifically designed to handle correlated\n(dependent) errors. Additionally, we propose the integration of CBART with\nGaussian processes (GP) to create a new model termed GP-CBART. This innovative\nmodel combines the strengths of the Gaussian processes and CBART, making it\nparticularly well-suited for analyzing time series or spatial data. In the\nGP-CBART framework, CBART captures the nonlinearity in the mean regression\n(covariates) function, while the Gaussian processes adeptly models the\ncorrelation structure within the response. Additionally, given the high\nflexibility of both CBART and GP models, their combination may lead to\nidentification issues. We provide methods to address these challenges. To\ndemonstrate the effectiveness of CBART and GP-CBART, we present corresponding\nsimulated and real-world examples."}, "http://arxiv.org/abs/2311.18725": {"title": "AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities", "link": "http://arxiv.org/abs/2311.18725", "description": "In the pharmaceutical industry, the use of artificial intelligence (AI) has\nseen consistent growth over the past decade. This rise is attributed to major\nadvancements in statistical machine learning methodologies, computational\ncapabilities and the increased availability of large datasets. AI techniques\nare applied throughout different stages of drug development, ranging from drug\ndiscovery to post-marketing benefit-risk assessment. Kolluri et al. provided a\nreview of several case studies that span these stages, featuring key\napplications such as protein structure prediction, success probability\nestimation, subgroup identification, and AI-assisted clinical trial monitoring.\nFrom a regulatory standpoint, there was a notable uptick in submissions\nincorporating AI components in 2021. The most prevalent therapeutic areas\nleveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%),\nand neurology (11%). The paradigm of personalized or precision medicine has\ngained significant traction in recent research, partly due to advancements in\nAI techniques \\cite{hamburg2010path}. This shift has had a transformative\nimpact on the pharmaceutical industry. Departing from the traditional\n\"one-size-fits-all\" model, personalized medicine incorporates various\nindividual factors, such as environmental conditions, lifestyle choices, and\nhealth histories, to formulate customized treatment plans. By utilizing\nsophisticated machine learning algorithms, clinicians and researchers are\nbetter equipped to make informed decisions in areas such as disease prevention,\ndiagnosis, and treatment selection, thereby optimizing health outcomes for each\nindividual."}, "http://arxiv.org/abs/2311.18807": {"title": "Pre-registration for Predictive Modeling", "link": "http://arxiv.org/abs/2311.18807", "description": "Amid rising concerns of reproducibility and generalizability in predictive\nmodeling, we explore the possibility and potential benefits of introducing\npre-registration to the field. Despite notable advancements in predictive\nmodeling, spanning core machine learning tasks to various scientific\napplications, challenges such as overlooked contextual factors, data-dependent\ndecision-making, and unintentional re-use of test data have raised questions\nabout the integrity of results. To address these issues, we propose adapting\npre-registration practices from explanatory modeling to predictive modeling. We\ndiscuss current best practices in predictive modeling and their limitations,\nintroduce a lightweight pre-registration template, and present a qualitative\nstudy with machine learning researchers to gain insight into the effectiveness\nof pre-registration in preventing biased estimates and promoting more reliable\nresearch outcomes. We conclude by exploring the scope of problems that\npre-registration can address in predictive modeling and acknowledging its\nlimitations within this context."}, "http://arxiv.org/abs/1910.12431": {"title": "Multilevel Dimension-Independent Likelihood-Informed MCMC for Large-Scale Inverse Problems", "link": "http://arxiv.org/abs/1910.12431", "description": "We present a non-trivial integration of dimension-independent\nlikelihood-informed (DILI) MCMC (Cui, Law, Marzouk, 2016) and the multilevel\nMCMC (Dodwell et al., 2015) to explore the hierarchy of posterior\ndistributions. This integration offers several advantages: First, DILI-MCMC\nemploys an intrinsic likelihood-informed subspace (LIS) (Cui et al., 2014) --\nwhich involves a number of forward and adjoint model simulations -- to design\naccelerated operator-weighted proposals. By exploiting the multilevel structure\nof the discretised parameters and discretised forward models, we design a\nRayleigh-Ritz procedure to significantly reduce the computational effort in\nbuilding the LIS and operating with DILI proposals. Second, the resulting\nDILI-MCMC can drastically improve the sampling efficiency of MCMC at each\nlevel, and hence reduce the integration error of the multilevel algorithm for\nfixed CPU time. Numerical results confirm the improved computational efficiency\nof the multilevel DILI approach."}, "http://arxiv.org/abs/2107.01306": {"title": "The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models", "link": "http://arxiv.org/abs/2107.01306", "description": "Here, we investigate whether (and how) experimental design could aid in the\nestimation of the precision matrix in a Gaussian chain graph model, especially\nthe interplay between the design, the effect of the experiment and prior\nknowledge about the effect. Estimation of the precision matrix is a fundamental\ntask to infer biological graphical structures like microbial networks. We\ncompare the marginal posterior precision of the precision matrix under four\npriors: flat, conjugate Normal-Wishart, Normal-MGIG and a general independent.\nUnder the flat and conjugate priors, the Laplace-approximated posterior\nprecision is not a function of the design matrix rendering useless any efforts\nto find an optimal experimental design to infer the precision matrix. In\ncontrast, the Normal-MGIG and general independent priors do allow for the\nsearch of optimal experimental designs, yet there is a sharp upper bound on the\ninformation that can be extracted from a given experiment. We confirm our\ntheoretical findings via a simulation study comparing i) the KL divergence\nbetween prior and posterior and ii) the Stein's loss difference of MAPs between\nrandom and no experiment. Our findings provide practical advice for domain\nscientists conducting experiments to better infer the precision matrix as a\nrepresentation of a biological network."}, "http://arxiv.org/abs/2206.02508": {"title": "Tucker tensor factor models: matricization and mode-wise PCA estimation", "link": "http://arxiv.org/abs/2206.02508", "description": "High-dimensional, higher-order tensor data are gaining prominence in a\nvariety of fields, including but not limited to computer vision and network\nanalysis. Tensor factor models, induced from noisy versions of tensor\ndecomposition or factorization, are natural potent instruments to study a\ncollection of tensor-variate objects that may be dependent or independent.\nHowever, it is still in the early stage of developing statistical inferential\ntheories for estimation of various low-rank structures, which are customary to\nplay the role of signals of tensor factor models. In this paper, starting from\ntensor matricization, we aim to ``decode\" estimation of a higher-order tensor\nfactor model in the sense that, we recast it into mode-wise traditional\nhigh-dimensional vector/fiber factor models so as to deploy the conventional\nestimation of principle components analysis (PCA). Demonstrated by the Tucker\ntensor factor model (TuTFaM), which is induced from most popular Tucker\ndecomposition, we summarize that estimations on signal components are\nessentially mode-wise PCA techniques, and the involvement of projection and\niteration will enhance the signal-to-noise ratio to various extend. We\nestablish the inferential theory of the proposed estimations and conduct rich\nsimulation experiments under TuTFaM, and illustrate how the proposed\nestimations can work in tensor reconstruction, clustering for video and\neconomic datasets, respectively."}, "http://arxiv.org/abs/2211.13959": {"title": "Testing Homological Equivalence Using Betti Numbers", "link": "http://arxiv.org/abs/2211.13959", "description": "In this article, we propose a one-sample test to check whether the support of\nthe unknown distribution generating the data is homologically equivalent to the\nsupport of some specified distribution or not OR using the corresponding\ntwo-sample test, one can test whether the supports of two unknown distributions\nare homologically equivalent or not. In the course of this study, test\nstatistics based on the Betti numbers are formulated, and the consistency of\nthe tests is established under the critical and the supercritical regimes.\nMoreover, some simulation studies are conducted and results are compared with\nthe existing methodologies such as Robinson's permutation test and test based\non mean persistent landscape functions. Furthermore, the practicability of the\ntests is shown on two well-known real data sets also."}, "http://arxiv.org/abs/2302.08893": {"title": "Active learning for data streams: a survey", "link": "http://arxiv.org/abs/2302.08893", "description": "Online active learning is a paradigm in machine learning that aims to select\nthe most informative data points to label from a data stream. The problem of\nminimizing the cost associated with collecting labeled observations has gained\na lot of attention in recent years, particularly in real-world applications\nwhere data is only available in an unlabeled form. Annotating each observation\ncan be time-consuming and costly, making it difficult to obtain large amounts\nof labeled data. To overcome this issue, many active learning strategies have\nbeen proposed in the last decades, aiming to select the most informative\nobservations for labeling in order to improve the performance of machine\nlearning models. These approaches can be broadly divided into two categories:\nstatic pool-based and stream-based active learning. Pool-based active learning\ninvolves selecting a subset of observations from a closed pool of unlabeled\ndata, and it has been the focus of many surveys and literature reviews.\nHowever, the growing availability of data streams has led to an increase in the\nnumber of approaches that focus on online active learning, which involves\ncontinuously selecting and labeling observations as they arrive in a stream.\nThis work aims to provide an overview of the most recently proposed approaches\nfor selecting the most informative observations from data streams in real time.\nWe review the various techniques that have been proposed and discuss their\nstrengths and limitations, as well as the challenges and opportunities that\nexist in this area of research."}, "http://arxiv.org/abs/2304.14750": {"title": "Bayesian Testing of Scientific Expectations Under Exponential Random Graph Models", "link": "http://arxiv.org/abs/2304.14750", "description": "The exponential random graph (ERGM) model is a commonly used statistical\nframework for studying the determinants of tie formations from social network\ndata. To test scientific theories under the ERGM framework, statistical\ninferential techniques are generally used based on traditional significance\ntesting using p-values. This methodology has certain limitations, however, such\nas its inconsistent behavior when the null hypothesis is true, its inability to\nquantify evidence in favor of a null hypothesis, and its inability to test\nmultiple hypotheses with competing equality and/or order constraints on the\nparameters of interest in a direct manner. To tackle these shortcomings, this\npaper presents Bayes factors and posterior probabilities for testing scientific\nexpectations under a Bayesian framework. The methodology is implemented in the\nR package 'BFpack'. The applicability of the methodology is illustrated using\nempirical collaboration networks and policy networks."}, "http://arxiv.org/abs/2312.00130": {"title": "Sparse Projected Averaged Regression for High-Dimensional Data", "link": "http://arxiv.org/abs/2312.00130", "description": "We examine the linear regression problem in a challenging high-dimensional\nsetting with correlated predictors to explain and predict relevant quantities,\nwith explicitly allowing the regression coefficient to vary from sparse to\ndense. Most classical high-dimensional regression estimators require some\ndegree of sparsity. We discuss the more recent concepts of variable screening\nand random projection as computationally fast dimension reduction tools, and\npropose a new random projection matrix tailored to the linear regression\nproblem with a theoretical bound on the gain in expected prediction error over\nconventional random projections.\n\nAround this new random projection, we built the Sparse Projected Averaged\nRegression (SPAR) method combining probabilistic variable screening steps with\nthe random projection steps to obtain an ensemble of small linear models. In\ndifference to existing methods, we introduce a thresholding parameter to obtain\nsome degree of sparsity.\n\nIn extensive simulations and two real data applications we guide through the\nelements of this method and compare prediction and variable selection\nperformance to various competitors. For prediction, our method performs at\nleast as good as the best competitors in most settings with a high number of\ntruly active variables, while variable selection remains a hard task for all\nmethods in high dimensions."}, "http://arxiv.org/abs/2312.00185": {"title": "On the variance of the Least Mean Square squared-error sample curve", "link": "http://arxiv.org/abs/2312.00185", "description": "Most studies of adaptive algorithm behavior consider performance measures\nbased on mean values such as the mean-square error. The derived models are\nuseful for understanding the algorithm behavior under different environments\nand can be used for design. Nevertheless, from a practical point of view, the\nadaptive filter user has only one realization of the algorithm to obtain the\ndesired result. This letter derives a model for the variance of the\nsquared-error sample curve of the least-mean-square (LMS) adaptive algorithm,\nso that the achievable cancellation level can be predicted based on the\nproperties of the steady-state squared error. The derived results provide the\nuser with useful design guidelines."}, "http://arxiv.org/abs/2312.00219": {"title": "The Functional Average Treatment Effect", "link": "http://arxiv.org/abs/2312.00219", "description": "This paper establishes the functional average as an important estimand for\ncausal inference. The significance of the estimand lies in its robustness\nagainst traditional issues of confounding. We prove that this robustness holds\neven when the probability distribution of the outcome, conditional on treatment\nor some other vector of adjusting variables, differs almost arbitrarily from\nits counterfactual analogue. This paper also examines possible estimators of\nthe functional average, including the sample mid-range, and proposes a new type\nof bootstrap for robust statistical inference: the Hoeffding bootstrap. After\nthis, the paper explores a new class of variables, the $\\mathcal{U}$ class of\nvariables, that simplifies the estimation of functional averages. This class of\nvariables is also used to establish mean exchangeability in some cases and to\nprovide the results of elementary statistical procedures, such as linear\nregression and the analysis of variance, with causal interpretations.\nSimulation evidence is provided. The methods of this paper are also applied to\na National Health and Nutrition Survey data set to investigate the causal\neffect of exercise on the blood pressure of adult smokers."}, "http://arxiv.org/abs/2312.00225": {"title": "Eliminating confounder-induced bias in the statistics of intervention", "link": "http://arxiv.org/abs/2312.00225", "description": "Experimental and observational studies often lead to spurious association\nbetween the outcome and independent variables describing the intervention,\nbecause of confounding to third-party factors. Even in randomized clinical\ntrials, confounding might be unavoidable due to small sample sizes.\nPractically, this poses a problem, because it is either expensive to re-design\nand conduct a new study or even impossible to alleviate the contribution of\nsome confounders due to e.g. ethical concerns. Here, we propose a method to\nconsistently derive hypothetical studies that retain as many of the\ndependencies in the original study as mathematically possible, while removing\nany association of observed confounders to the independent variables. Using\nhistoric studies, we illustrate how the confounding-free scenario re-estimates\nthe effect size of the intervention. The new effect size estimate represents a\nconcise prediction in the hypothetical scenario which paves a way from the\noriginal data towards the design of future studies."}, "http://arxiv.org/abs/2312.00294": {"title": "aeons: approximating the end of nested sampling", "link": "http://arxiv.org/abs/2312.00294", "description": "This paper presents analytic results on the anatomy of nested sampling, from\nwhich a technique is developed to estimate the run-time of the algorithm that\nworks for any nested sampling implementation. We test these methods on both toy\nmodels and true cosmological nested sampling runs. The method gives an\norder-of-magnitude prediction of the end point at all times, forecasting the\ntrue endpoint within standard error around the halfway point."}, "http://arxiv.org/abs/2312.00305": {"title": "Multiple Testing of Linear Forms for Noisy Matrix Completion", "link": "http://arxiv.org/abs/2312.00305", "description": "Many important tasks of large-scale recommender systems can be naturally cast\nas testing multiple linear forms for noisy matrix completion. These problems,\nhowever, present unique challenges because of the subtle bias-and-variance\ntradeoff of and an intricate dependence among the estimated entries induced by\nthe low-rank structure. In this paper, we develop a general approach to\novercome these difficulties by introducing new statistics for individual tests\nwith sharp asymptotics both marginally and jointly, and utilizing them to\ncontrol the false discovery rate (FDR) via a data splitting and symmetric\naggregation scheme. We show that valid FDR control can be achieved with\nguaranteed power under nearly optimal sample size requirements using the\nproposed methodology. Extensive numerical simulations and real data examples\nare also presented to further illustrate its practical merits."}, "http://arxiv.org/abs/2312.00346": {"title": "Supervised Factor Modeling for High-Dimensional Linear Time Series", "link": "http://arxiv.org/abs/2312.00346", "description": "Motivated by Tucker tensor decomposition, this paper imposes low-rank\nstructures to the column and row spaces of coefficient matrices in a\nmultivariate infinite-order vector autoregression (VAR), which leads to a\nsupervised factor model with two factor modelings being conducted to responses\nand predictors simultaneously. Interestingly, the stationarity condition\nimplies an intrinsic weak group sparsity mechanism of infinite-order VAR, and\nhence a rank-constrained group Lasso estimation is considered for\nhigh-dimensional linear time series. Its non-asymptotic properties are\ndiscussed thoughtfully by balancing the estimation, approximation and\ntruncation errors. Moreover, an alternating gradient descent algorithm with\nthresholding is designed to search for high-dimensional estimates, and its\ntheoretical justifications, including statistical and convergence analysis, are\nalso provided. Theoretical and computational properties of the proposed\nmethodology are verified by simulation experiments, and the advantages over\nexisting methods are demonstrated by two real examples."}, "http://arxiv.org/abs/2312.00373": {"title": "Streaming Bayesian Modeling for predicting Fat-Tailed Customer Lifetime Value", "link": "http://arxiv.org/abs/2312.00373", "description": "We develop an online learning MCMC approach applicable for hierarchical\nbayesian models and GLMS. We also develop a fat-tailed LTV model that\ngeneralizes over several kinds of fat and thin tails. We demonstrate both\ndevelopments on commercial LTV data from a large mobile app."}, "http://arxiv.org/abs/2312.00439": {"title": "Modeling the Ratio of Correlated Biomarkers Using Copula Regression", "link": "http://arxiv.org/abs/2312.00439", "description": "Modeling the ratio of two dependent components as a function of covariates is\na frequently pursued objective in observational research. Despite the high\nrelevance of this topic in medical studies, where biomarker ratios are often\nused as surrogate endpoints for specific diseases, existing models are based on\noversimplified assumptions, assuming e.g.\\@ independence or strictly positive\nassociations between the components. In this paper, we close this gap in the\nliterature and propose a regression model where the marginal distributions of\nthe two components are linked by Frank copula. A key feature of our model is\nthat it allows for both positive and negative correlations between the\ncomponents, with one of the model parameters being directly interpretable in\nterms of Kendall's rank correlation coefficient. We study our method\ntheoretically, evaluate finite sample properties in a simulation study and\ndemonstrate its efficacy in an application to diagnosis of Alzheimer's disease\nvia ratios of amyloid-beta and total tau protein biomarkers."}, "http://arxiv.org/abs/2312.00494": {"title": "Applying the estimands framework to non-inferiority trials: guidance on choice of hypothetical estimands for non-adherence and comparison of estimation methods", "link": "http://arxiv.org/abs/2312.00494", "description": "A common concern in non-inferiority (NI) trials is that non adherence due,\nfor example, to poor study conduct can make treatment arms artificially\nsimilar. Because intention to treat analyses can be anti-conservative in this\nsituation, per protocol analyses are sometimes recommended. However, such\nadvice does not consider the estimands framework, nor the risk of bias from per\nprotocol analyses. We therefore sought to update the above guidance using the\nestimands framework, and compare estimators to improve on the performance of\nper protocol analyses. We argue the main threat to validity of NI trials is the\noccurrence of trial specific intercurrent events (IEs), that is, IEs which\noccur in a trial setting, but would not occur in practice. To guard against\nerroneous conclusions of non inferiority, we suggest an estimand using a\nhypothetical strategy for trial specific IEs should be employed, with handling\nof other non trial specific IEs chosen based on clinical considerations. We\nprovide an overview of estimators that could be used to estimate a hypothetical\nestimand, including inverse probability weighting (IPW), and two instrumental\nvariable approaches (one using an informative Bayesian prior on the effect of\nstandard treatment, and one using a treatment by covariate interaction as an\ninstrument). We compare them, using simulation in the setting of all or nothing\ncompliance in two active treatment arms, and conclude both IPW and the\ninstrumental variable method using a Bayesian prior are potentially useful\napproaches, with the choice between them depending on which assumptions are\nmost plausible for a given trial."}, "http://arxiv.org/abs/2312.00501": {"title": "Cautionary Tales on Synthetic Controls in Survival Analyses", "link": "http://arxiv.org/abs/2312.00501", "description": "Synthetic control (SC) methods have gained rapid popularity in economics\nrecently, where they have been applied in the context of inferring the effects\nof treatments on standard continuous outcomes assuming linear input-output\nrelations. In medical applications, conversely, survival outcomes are often of\nprimary interest, a setup in which both commonly assumed data-generating\nprocesses (DGPs) and target parameters are different. In this paper, we\ntherefore investigate whether and when SCs could serve as an alternative to\nmatching methods in survival analyses. We find that, because SCs rely on a\nlinearity assumption, they will generally be biased for the true expected\nsurvival time in commonly assumed survival DGPs -- even when taking into\naccount the possibility of linearity on another scale as in accelerated failure\ntime models. Additionally, we find that, because SC units follow distributions\nwith lower variance than real control units, summaries of their distributions,\nsuch as survival curves, will be biased for the parameters of interest in many\nsurvival analyses. Nonetheless, we also highlight that using SCs can still\nimprove upon matching whenever the biases described above are outweighed by\nextrapolation biases exhibited by imperfect matches, and investigate the use of\nregularization to trade off the shortcomings of both approaches."}, "http://arxiv.org/abs/2312.00509": {"title": "Bayesian causal discovery from unknown general interventions", "link": "http://arxiv.org/abs/2312.00509", "description": "We consider the problem of learning causal Directed Acyclic Graphs (DAGs)\nusing combinations of observational and interventional experimental data.\nCurrent methods tailored to this setting assume that interventions either\ndestroy parent-child relations of the intervened (target) nodes or only alter\nsuch relations without modifying the parent sets, even when the intervention\ntargets are unknown. We relax this assumption by proposing a Bayesian method\nfor causal discovery from general interventions, which allow for modifications\nof the parent sets of the unknown targets. Even in this framework, DAGs and\ngeneral interventions may be identifiable only up to some equivalence classes.\nWe provide graphical characterizations of such interventional Markov\nequivalence and devise compatible priors for Bayesian inference that guarantee\nscore equivalence of indistinguishable structures. We then develop a Markov\nChain Monte Carlo (MCMC) scheme to approximate the posterior distribution over\nDAGs, intervention targets and induced parent sets. Finally, we evaluate the\nproposed methodology on both simulated and real protein expression data."}, "http://arxiv.org/abs/2312.00530": {"title": "New tools for network time series with an application to COVID-19 hospitalisations", "link": "http://arxiv.org/abs/2312.00530", "description": "Network time series are becoming increasingly important across many areas in\nscience and medicine and are often characterised by a known or inferred\nunderlying network structure, which can be exploited to make sense of dynamic\nphenomena that are often high-dimensional. For example, the Generalised Network\nAutoregressive (GNAR) models exploit such structure parsimoniously. We use the\nGNAR framework to introduce two association measures: the network and partial\nnetwork autocorrelation functions, and introduce Corbit (correlation-orbit)\nplots for visualisation. As with regular autocorrelation plots, Corbit plots\npermit interpretation of underlying correlation structures and, crucially, aid\nmodel selection more rapidly than using other tools such as AIC or BIC. We\nadditionally interpret GNAR processes as generalised graphical models, which\nconstrain the processes' autoregressive structure and exhibit interesting\ntheoretical connections to graphical models via utilization of higher-order\ninteractions. We demonstrate how incorporation of prior information is related\nto performing variable selection and shrinkage in the GNAR context. We\nillustrate the usefulness of the GNAR formulation, network autocorrelations and\nCorbit plots by modelling a COVID-19 network time series of the number of\nadmissions to mechanical ventilation beds at 140 NHS Trusts in England & Wales.\nWe introduce the Wagner plot that can analyse correlations over different time\nperiods or with respect to external covariates. In addition, we introduce plots\nthat quantify the relevance and influence of individual nodes. Our modelling\nprovides insight on the underlying dynamics of the COVID-19 series, highlights\ntwo groups of geographically co-located `influential' NHS Trusts and\ndemonstrates superior prediction abilities when compared to existing\ntechniques."}, "http://arxiv.org/abs/2312.00616": {"title": "Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry", "link": "http://arxiv.org/abs/2312.00616", "description": "In a longitudinal clinical registry, different measurement instruments might\nhave been used for assessing individuals at different time points. To combine\nthem, we investigate deep learning techniques for obtaining a joint latent\nrepresentation, to which the items of different measurement instruments are\nmapped. This corresponds to domain adaptation, an established concept in\ncomputer science for image data. Using the proposed approach as an example, we\nevaluate the potential of domain adaptation in a longitudinal cohort setting\nwith a rather small number of time points, motivated by an application with\ndifferent motor function measurement instruments in a registry of spinal\nmuscular atrophy (SMA) patients. There, we model trajectories in the latent\nrepresentation by ordinary differential equations (ODEs), where person-specific\nODE parameters are inferred from baseline characteristics. The goodness of fit\nand complexity of the ODE solutions then allows to judge the measurement\ninstrument mappings. We subsequently explore how alignment can be improved by\nincorporating corresponding penalty terms into model fitting. To systematically\ninvestigate the effect of differences between measurement instruments, we\nconsider several scenarios based on modified SMA data, including scenarios\nwhere a mapping should be feasible in principle and scenarios where no perfect\nmapping is available. While misalignment increases in more complex scenarios,\nsome structure is still recovered, even if the availability of measurement\ninstruments depends on patient state. A reasonable mapping is feasible also in\nthe more complex real SMA dataset. These results indicate that domain\nadaptation might be more generally useful in statistical modeling for\nlongitudinal registry data."}, "http://arxiv.org/abs/2312.00622": {"title": "Practical Path-based Bayesian Optimization", "link": "http://arxiv.org/abs/2312.00622", "description": "There has been a surge in interest in data-driven experimental design with\napplications to chemical engineering and drug manufacturing. Bayesian\noptimization (BO) has proven to be adaptable to such cases, since we can model\nthe reactions of interest as expensive black-box functions. Sometimes, the cost\nof this black-box functions can be separated into two parts: (a) the cost of\nthe experiment itself, and (b) the cost of changing the input parameters. In\nthis short paper, we extend the SnAKe algorithm to deal with both types of\ncosts simultaneously. We further propose extensions to the case of a maximum\nallowable input change, as well as to the multi-objective setting."}, "http://arxiv.org/abs/2312.00710": {"title": "SpaCE: The Spatial Confounding Environment", "link": "http://arxiv.org/abs/2312.00710", "description": "Spatial confounding poses a significant challenge in scientific studies\ninvolving spatial data, where unobserved spatial variables can influence both\ntreatment and outcome, possibly leading to spurious associations. To address\nthis problem, we introduce SpaCE: The Spatial Confounding Environment, the\nfirst toolkit to provide realistic benchmark datasets and tools for\nsystematically evaluating causal inference methods designed to alleviate\nspatial confounding. Each dataset includes training data, true counterfactuals,\na spatial graph with coordinates, and smoothness and confounding scores\ncharacterizing the effect of a missing spatial confounder. It also includes\nrealistic semi-synthetic outcomes and counterfactuals, generated using\nstate-of-the-art machine learning ensembles, following best practices for\ncausal inference benchmarks. The datasets cover real treatment and covariates\nfrom diverse domains, including climate, health and social sciences. SpaCE\nfacilitates an automated end-to-end pipeline, simplifying data loading,\nexperimental setup, and evaluating machine learning and causal inference\nmodels. The SpaCE project provides several dozens of datasets of diverse sizes\nand spatial complexity. It is publicly available as a Python package,\nencouraging community feedback and contributions."}, "http://arxiv.org/abs/2312.00728": {"title": "Soft computing for the posterior of a new matrix t graphical network", "link": "http://arxiv.org/abs/2312.00728", "description": "Modelling noisy data in a network context remains an unavoidable obstacle;\nfortunately, random matrix theory may comprehensively describe network\nenvironments effectively. Thus it necessitates the probabilistic\ncharacterisation of these networks (and accompanying noisy data) using matrix\nvariate models. Denoising network data using a Bayes approach is not common in\nsurveyed literature. This paper adopts the Bayesian viewpoint and introduces a\nnew matrix variate t-model in a prior sense by relying on the matrix variate\ngamma distribution for the noise process, following the Gaussian graphical\nnetwork for the cases when the normality assumption is violated. From a\nstatistical learning viewpoint, such a theoretical consideration indubitably\nbenefits the real-world comprehension of structures causing noisy data with\nnetwork-based attributes as part of machine learning in data science. A full\nstructural learning procedure is provided for calculating and approximating the\nresulting posterior of interest to assess the considered model's network\ncentrality measures. Experiments with synthetic and real-world stock price data\nare performed not only to validate the proposed algorithm's capabilities but\nalso to show that this model has wider flexibility than originally implied in\nBillio et al. (2021)."}, "http://arxiv.org/abs/2312.00770": {"title": "Random Forest for Dynamic Risk Prediction or Recurrent Events: A Pseudo-Observation Approach", "link": "http://arxiv.org/abs/2312.00770", "description": "Recurrent events are common in clinical, healthcare, social and behavioral\nstudies. A recent analysis framework for potentially censored recurrent event\ndata is to construct a censored longitudinal data set consisting of times to\nthe first recurrent event in multiple prespecified follow-up windows of length\n$\\tau$. With the staggering number of potential predictors being generated from\ngenetic, -omic, and electronic health records sources, machine learning\napproaches such as the random forest are growing in popularity, as they can\nincorporate information from highly correlated predictors with non-standard\nrelationships. In this paper, we bridge this gap by developing a random forest\napproach for dynamically predicting probabilities of remaining event-free\nduring a subsequent $\\tau$-duration follow-up period from a reconstructed\ncensored longitudinal data set. We demonstrate the increased ability of our\nrandom forest algorithm for predicting the probability of remaining event-free\nover a $\\tau$-duration follow-up period when compared to the recurrent event\nmodeling framework of Xia et al. (2020) in settings where association between\npredictors and recurrent event outcomes is complex in nature. The proposed\nrandom forest algorithm is demonstrated using recurrent exacerbation data from\nthe Azithromycin for the Prevention of Exacerbations of Chronic Obstructive\nPulmonary Disease (Albert et al., 2011)."}, "http://arxiv.org/abs/2109.00160": {"title": "Novel Bayesian method for simultaneous detection of activation signatures and background connectivity for task fMRI data", "link": "http://arxiv.org/abs/2109.00160", "description": "In this paper, we introduce a new Bayesian approach for analyzing task fMRI\ndata that simultaneously detects activation signatures and background\nconnectivity. Our modeling involves a new hybrid tensor spatial-temporal basis\nstrategy that enables scalable computing yet captures nearby and distant\nintervoxel correlation and long-memory temporal correlation. The spatial basis\ninvolves a composite hybrid transform with two levels: the first accounts for\nwithin-ROI correlation, and second between-ROI distant correlation. We\ndemonstrate in simulations how our basis space regression modeling strategy\nincreases sensitivity for identifying activation signatures, partly driven by\nthe induced background connectivity that itself can be summarized to reveal\nbiological insights. This strategy leads to computationally scalable fully\nBayesian inference at the voxel or ROI level that adjusts for multiple testing.\nWe apply this model to Human Connectome Project data to reveal insights into\nbrain activation patterns and background connectivity related to working memory\ntasks."}, "http://arxiv.org/abs/2202.06117": {"title": "Metric Statistics: Exploration and Inference for Random Objects With Distance Profiles", "link": "http://arxiv.org/abs/2202.06117", "description": "This article provides an overview on the statistical modeling of complex data\nas increasingly encountered in modern data analysis. It is argued that such\ndata can often be described as elements of a metric space that satisfies\ncertain structural conditions and features a probability measure. We refer to\nthe random elements of such spaces as random objects and to the emerging field\nthat deals with their statistical analysis as metric statistics. Metric\nstatistics provides methodology, theory and visualization tools for the\nstatistical description, quantification of variation, centrality and quantiles,\nregression and inference for populations of random objects for which samples\nare available. In addition to a brief review of current concepts, we focus on\ndistance profiles as a major tool for object data in conjunction with the\npairwise Wasserstein transports of the underlying one-dimensional distance\ndistributions. These pairwise transports lead to the definition of intuitive\nand interpretable notions of transport ranks and transport quantiles as well as\ntwo-sample inference. An associated profile metric complements the original\nmetric of the object space and may reveal important features of the object data\nin data analysis We demonstrate these tools for the analysis of complex data\nthrough various examples and visualizations."}, "http://arxiv.org/abs/2205.11956": {"title": "Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control", "link": "http://arxiv.org/abs/2205.11956", "description": "Most machine learning methods require tuning of hyper-parameters. For kernel\nridge regression with the Gaussian kernel, the hyper-parameter is the\nbandwidth. The bandwidth specifies the length scale of the kernel and has to be\ncarefully selected to obtain a model with good generalization. The default\nmethods for bandwidth selection, cross-validation and marginal likelihood\nmaximization, often yield good results, albeit at high computational costs.\nInspired by Jacobian regularization, we formulate an approximate expression for\nhow the derivatives of the functions inferred by kernel ridge regression with\nthe Gaussian kernel depend on the kernel bandwidth. We use this expression to\npropose a closed-form, computationally feather-light, bandwidth selection\nheuristic, based on controlling the Jacobian. In addition, the Jacobian\nexpression illuminates how the bandwidth selection is a trade-off between the\nsmoothness of the inferred function and the conditioning of the training data\nkernel matrix. We show on real and synthetic data that compared to\ncross-validation and marginal likelihood maximization, our method is on pair in\nterms of model performance, but up to six orders of magnitude faster."}, "http://arxiv.org/abs/2209.00102": {"title": "Bayesian Mixed Multidimensional Scaling for Auditory Processing", "link": "http://arxiv.org/abs/2209.00102", "description": "The human brain distinguishes speech sound categories by representing\nacoustic signals in a latent multidimensional auditory-perceptual space. This\nspace can be statistically constructed using multidimensional scaling, a\ntechnique that can compute lower-dimensional latent features representing the\nspeech signals in such a way that their pairwise distances in the latent space\nclosely resemble the corresponding distances in the observation space. The\ninter-individual and inter-population (e.g., native versus non-native\nlisteners) heterogeneity in such representations is however not well\nunderstood. These questions have often been examined using joint analyses that\nignore individual heterogeneity or using separate analyses that cannot\ncharacterize human similarities. Neither extreme, therefore, allows for\nprincipled comparisons between populations and individuals. The focus of the\ncurrent literature has also often been on inference on latent distances between\nthe categories and not on the latent features themselves, which are crucial for\nour applications, that make up these distances. Motivated by these problems, we\ndevelop a novel Bayesian mixed multidimensional scaling method, taking into\naccount the heterogeneity across populations and subjects. We design a Markov\nchain Monte Carlo algorithm for posterior computation. We then recover the\nlatent features using a post-processing scheme applied to the posterior\nsamples. We evaluate the method's empirical performances through synthetic\nexperiments. Applied to a motivating auditory neuroscience study, the method\nprovides novel insights into how biologically interpretable lower-dimensional\nlatent features reconstruct the observed distances between the stimuli and vary\nbetween individuals and their native language experiences."}, "http://arxiv.org/abs/2209.01297": {"title": "Assessing treatment effect heterogeneity in the presence of missing effect modifier data in cluster-randomized trials", "link": "http://arxiv.org/abs/2209.01297", "description": "Understanding whether and how treatment effects vary across subgroups is\ncrucial to inform clinical practice and recommendations. Accordingly, the\nassessment of heterogeneous treatment effects (HTE) based on pre-specified\npotential effect modifiers has become a common goal in modern randomized\ntrials. However, when one or more potential effect modifiers are missing,\ncomplete-case analysis may lead to bias and under-coverage. While statistical\nmethods for handling missing data have been proposed and compared for\nindividually randomized trials with missing effect modifier data, few\nguidelines exist for the cluster-randomized setting, where intracluster\ncorrelations in the effect modifiers, outcomes, or even missingness mechanisms\nmay introduce further threats to accurate assessment of HTE. In this article,\nthe performance of several missing data methods are compared through a\nsimulation study of cluster-randomized trials with continuous outcome and\nmissing binary effect modifier data, and further illustrated using real data\nfrom the Work, Family, and Health Study. Our results suggest that multilevel\nmultiple imputation (MMI) and Bayesian MMI have better performance than other\navailable methods, and that Bayesian MMI has lower bias and closer to nominal\ncoverage than standard MMI when there are model specification or compatibility\nissues."}, "http://arxiv.org/abs/2307.06996": {"title": "Spey: smooth inference for reinterpretation studies", "link": "http://arxiv.org/abs/2307.06996", "description": "Statistical models serve as the cornerstone for hypothesis testing in\nempirical studies. This paper introduces a new cross-platform Python-based\npackage designed to utilise different likelihood prescriptions via a flexible\nplug-in system. This framework empowers users to propose, examine, and publish\nnew likelihood prescriptions without developing software infrastructure,\nultimately unifying and generalising different ways of constructing likelihoods\nand employing them for hypothesis testing within a unified platform. We propose\na new simplified likelihood prescription, surpassing previous approximation\naccuracies by incorporating asymmetric uncertainties. Moreover, our package\nfacilitates the integration of various likelihood combination routines, thereby\nbroadening the scope of independent studies through a meta-analysis. By\nremaining agnostic to the source of the likelihood prescription and the signal\nhypothesis generator, our platform allows for the seamless implementation of\npackages with different likelihood prescriptions, fostering compatibility and\ninteroperability."}, "http://arxiv.org/abs/2309.11472": {"title": "Optimizing Dynamic Predictions from Joint Models using Super Learning", "link": "http://arxiv.org/abs/2309.11472", "description": "Joint models for longitudinal and time-to-event data are often employed to\ncalculate dynamic individualized predictions used in numerous applications of\nprecision medicine. Two components of joint models that influence the accuracy\nof these predictions are the shape of the longitudinal trajectories and the\nfunctional form linking the longitudinal outcome history to the hazard of the\nevent. Finding a single well-specified model that produces accurate predictions\nfor all subjects and follow-up times can be challenging, especially when\nconsidering multiple longitudinal outcomes. In this work, we use the concept of\nsuper learning and avoid selecting a single model. In particular, we specify a\nweighted combination of the dynamic predictions calculated from a library of\njoint models with different specifications. The weights are selected to\noptimize a predictive accuracy metric using V-fold cross-validation. We use as\npredictive accuracy measures the expected quadratic prediction error and the\nexpected predictive cross-entropy. In a simulation study, we found that the\nsuper learning approach produces results very similar to the Oracle model,\nwhich was the model with the best performance in the test datasets. All\nproposed methodology is implemented in the freely available R package JMbayes2."}, "http://arxiv.org/abs/2312.00963": {"title": "Spatiotemporal Transformer for Imputing Sparse Data: A Deep Learning Approach", "link": "http://arxiv.org/abs/2312.00963", "description": "Effective management of environmental resources and agricultural\nsustainability heavily depends on accurate soil moisture data. However,\ndatasets like the SMAP/Sentinel-1 soil moisture product often contain missing\nvalues across their spatiotemporal grid, which poses a significant challenge.\nThis paper introduces a novel Spatiotemporal Transformer model (ST-Transformer)\nspecifically designed to address the issue of missing values in sparse\nspatiotemporal datasets, particularly focusing on soil moisture data. The\nST-Transformer employs multiple spatiotemporal attention layers to capture the\ncomplex spatiotemporal correlations in the data and can integrate additional\nspatiotemporal covariates during the imputation process, thereby enhancing its\naccuracy. The model is trained using a self-supervised approach, enabling it to\nautonomously predict missing values from observed data points. Our model's\nefficacy is demonstrated through its application to the SMAP 1km soil moisture\ndata over a 36 x 36 km grid in Texas. It showcases superior accuracy compared\nto well-known imputation methods. Additionally, our simulation studies on other\ndatasets highlight the model's broader applicability in various spatiotemporal\nimputation tasks."}, "http://arxiv.org/abs/2312.01146": {"title": "Bayesian models are better than frequentist models in identifying differences in small datasets comprising phonetic data", "link": "http://arxiv.org/abs/2312.01146", "description": "While many studies have previously conducted direct comparisons between\nresults obtained from frequentist and Bayesian models, our research introduces\na novel perspective by examining these models in the context of a small dataset\ncomprising phonetic data. Specifically, we employed mixed-effects models and\nBayesian regression models to explore differences between monolingual and\nbilingual populations in the acoustic values of produced vowels. Our findings\nrevealed that Bayesian hypothesis testing exhibited superior accuracy in\nidentifying evidence for differences compared to the posthoc test, which tended\nto underestimate the existence of such differences. These results align with a\nsubstantial body of previous research highlighting the advantages of Bayesian\nover frequentist models, thereby emphasizing the need for methodological\nreform. In conclusion, our study supports the assertion that Bayesian models\nare more suitable for investigating differences in small datasets of phonetic\nand/or linguistic data, suggesting that researchers in these fields may find\ngreater reliability in utilizing such models for their analyses."}, "http://arxiv.org/abs/2312.01168": {"title": "MacroPARAFAC for handling rowwise and cellwise outliers in incomplete multi-way data", "link": "http://arxiv.org/abs/2312.01168", "description": "Multi-way data extend two-way matrices into higher-dimensional tensors, often\nexplored through dimensional reduction techniques. In this paper, we study the\nParallel Factor Analysis (PARAFAC) model for handling multi-way data,\nrepresenting it more compactly through a concise set of loading matrices and\nscores. We assume that the data may be incomplete and could contain both\nrowwise and cellwise outliers, signifying cases that deviate from the majority\nand outlying cells dispersed throughout the data array. To address these\nchallenges, we present a novel algorithm designed to robustly estimate both\nloadings and scores. Additionally, we introduce an enhanced outlier map to\ndistinguish various patterns of outlying behavior. Through simulations and the\nanalysis of fluorescence Excitation-Emission Matrix (EEM) data, we demonstrate\nthe robustness of our approach. Our results underscore the effectiveness of\ndiagnostic tools in identifying and interpreting unusual patterns within the\ndata."}, "http://arxiv.org/abs/2312.01210": {"title": "When accurate prediction models yield harmful self-fulfilling prophecies", "link": "http://arxiv.org/abs/2312.01210", "description": "Prediction models are popular in medical research and practice. By predicting\nan outcome of interest for specific patients, these models may help inform\ndifficult treatment decisions, and are often hailed as the poster children for\npersonalized, data-driven healthcare.\n\nWe show however, that using prediction models for decision making can lead to\nharmful decisions, even when the predictions exhibit good discrimination after\ndeployment. These models are harmful self-fulfilling prophecies: their\ndeployment harms a group of patients but the worse outcome of these patients\ndoes not invalidate the predictive power of the model. Our main result is a\nformal characterization of a set of such prediction models. Next we show that\nmodels that are well calibrated before} and after deployment are useless for\ndecision making as they made no change in the data distribution. These results\npoint to the need to revise standard practices for validation, deployment and\nevaluation of prediction models that are used in medical decisions."}, "http://arxiv.org/abs/2312.01238": {"title": "A deep learning pipeline for cross-sectional and longitudinal multiview data integration", "link": "http://arxiv.org/abs/2312.01238", "description": "Biomedical research now commonly integrates diverse data types or views from\nthe same individuals to better understand the pathobiology of complex diseases,\nbut the challenge lies in meaningfully integrating these diverse views.\nExisting methods often require the same type of data from all views\n(cross-sectional data only or longitudinal data only) or do not consider any\nclass outcome in the integration method, presenting limitations. To overcome\nthese limitations, we have developed a pipeline that harnesses the power of\nstatistical and deep learning methods to integrate cross-sectional and\nlongitudinal data from multiple sources. Additionally, it identifies key\nvariables contributing to the association between views and the separation\namong classes, providing deeper biological insights. This pipeline includes\nvariable selection/ranking using linear and nonlinear methods, feature\nextraction using functional principal component analysis and Euler\ncharacteristics, and joint integration and classification using dense\nfeed-forward networks and recurrent neural networks. We applied this pipeline\nto cross-sectional and longitudinal multi-omics data (metagenomics,\ntranscriptomics, and metabolomics) from an inflammatory bowel disease (IBD)\nstudy and we identified microbial pathways, metabolites, and genes that\ndiscriminate by IBD status, providing information on the etiology of IBD. We\nconducted simulations to compare the two feature extraction methods. The\nproposed pipeline is available from the following GitHub repository:\nhttps://github.com/lasandrall/DeepIDA-GRU."}, "http://arxiv.org/abs/2312.01265": {"title": "Concentration of Randomized Functions of Uniformly Bounded Variation", "link": "http://arxiv.org/abs/2312.01265", "description": "A sharp, distribution free, non-asymptotic result is proved for the\nconcentration of a random function around the mean function, when the\nrandomization is generated by a finite sequence of independent data and the\nrandom functions satisfy uniform bounded variation assumptions. The specific\nmotivation for the work comes from the need for inference on the distributional\nimpacts of social policy intervention. However, the family of randomized\nfunctions that we study is broad enough to cover wide-ranging applications. For\nexample, we provide a Kolmogorov-Smirnov like test for randomized functions\nthat are almost surely Lipschitz continuous, and novel tools for inference with\nheterogeneous treatment effects. A Dvoretzky-Kiefer-Wolfowitz like inequality\nis also provided for the sum of almost surely monotone random functions,\nextending the famous non-asymptotic work of Massart for empirical cumulative\ndistribution functions generated by i.i.d. data, to settings without\nmicro-clusters proposed by Canay, Santos, and Shaikh. We illustrate the\nrelevance of our theoretical results for applied work via empirical\napplications. Notably, the proof of our main concentration result relies on a\nnovel stochastic rendition of the fundamental result of Debreu, generally\ndubbed the \"gap lemma,\" that transforms discontinuous utility representations\nof preorders into continuous utility representations, and on an envelope\ntheorem of an infinite dimensional optimisation problem that we carefully\nconstruct."}, "http://arxiv.org/abs/2312.01266": {"title": "A unified framework for covariate adjustment under stratified randomization", "link": "http://arxiv.org/abs/2312.01266", "description": "Randomization, as a key technique in clinical trials, can eliminate sources\nof bias and produce comparable treatment groups. In randomized experiments, the\ntreatment effect is a parameter of general interest. Researchers have explored\nthe validity of using linear models to estimate the treatment effect and\nperform covariate adjustment and thus improve the estimation efficiency.\nHowever, the relationship between covariates and outcomes is not necessarily\nlinear, and is often intricate. Advances in statistical theory and related\ncomputer technology allow us to use nonparametric and machine learning methods\nto better estimate the relationship between covariates and outcomes and thus\nobtain further efficiency gains. However, theoretical studies on how to draw\nvalid inferences when using nonparametric and machine learning methods under\nstratified randomization are yet to be conducted. In this paper, we discuss a\nunified framework for covariate adjustment and corresponding statistical\ninference under stratified randomization and present a detailed proof of the\nvalidity of using local linear kernel-weighted least squares regression for\ncovariate adjustment in treatment effect estimators as a special case. In the\ncase of high-dimensional data, we additionally propose an algorithm for\nstatistical inference using machine learning methods under stratified\nrandomization, which makes use of sample splitting to alleviate the\nrequirements on the asymptotic properties of machine learning methods. Finally,\nwe compare the performances of treatment effect estimators using different\nmachine learning methods by considering various data generation scenarios, to\nguide practical research."}, "http://arxiv.org/abs/2312.01379": {"title": "Relation between PLS and OLS regression in terms of the eigenvalue distribution of the regressor covariance matrix", "link": "http://arxiv.org/abs/2312.01379", "description": "Partial least squares (PLS) is a dimensionality reduction technique\nintroduced in the field of chemometrics and successfully employed in many other\nareas. The PLS components are obtained by maximizing the covariance between\nlinear combinations of the regressors and of the target variables. In this\nwork, we focus on its application to scalar regression problems. PLS regression\nconsists in finding the least squares predictor that is a linear combination of\na subset of the PLS components. Alternatively, PLS regression can be formulated\nas a least squares problem restricted to a Krylov subspace. This equivalent\nformulation is employed to analyze the distance between\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$, the PLS\nestimator of the vector of coefficients of the linear regression model based on\n$L$ PLS components, and $\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$, the one\nobtained by ordinary least squares (OLS), as a function of $L$. Specifically,\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$ is the\nvector of coefficients in the aforementioned Krylov subspace that is closest to\n$\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$ in terms of the Mahalanobis distance\nwith respect to the covariance matrix of the OLS estimate. We provide a bound\non this distance that depends only on the distribution of the eigenvalues of\nthe regressor covariance matrix. Numerical examples on synthetic and real-world\ndata are used to illustrate how the distance between\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$ and\n$\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$ depends on the number of clusters in\nwhich the eigenvalues of the regressor covariance matrix are grouped."}, "http://arxiv.org/abs/2312.01411": {"title": "Bayesian inference on Cox regression models using catalytic prior distributions", "link": "http://arxiv.org/abs/2312.01411", "description": "The Cox proportional hazards model (Cox model) is a popular model for\nsurvival data analysis. When the sample size is small relative to the dimension\nof the model, the standard maximum partial likelihood inference is often\nproblematic. In this work, we propose the Cox catalytic prior distributions for\nBayesian inference on Cox models, which is an extension of a general class of\nprior distributions originally designed for stabilizing complex parametric\nmodels. The Cox catalytic prior is formulated as a weighted likelihood of the\nregression coefficients based on synthetic data and a surrogate baseline hazard\nconstant. This surrogate hazard can be either provided by the user or estimated\nfrom the data, and the synthetic data are generated from the predictive\ndistribution of a fitted simpler model. For point estimation, we derive an\napproximation of the marginal posterior mode, which can be computed\nconveniently as a regularized log partial likelihood estimator. We prove that\nour prior distribution is proper and the resulting estimator is consistent\nunder mild conditions. In simulation studies, our proposed method outperforms\nstandard maximum partial likelihood inference and is on par with existing\nshrinkage methods. We further illustrate the application of our method to a\nreal dataset."}, "http://arxiv.org/abs/2312.01457": {"title": "Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits", "link": "http://arxiv.org/abs/2312.01457", "description": "Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing\nnew policies using existing data without costly experimentation. However,\ncurrent OPE methods, such as Inverse Probability Weighting (IPW) and Doubly\nRobust (DR) estimators, suffer from high variance, particularly in cases of low\noverlap between target and behavior policies or large action and context\nspaces. In this paper, we introduce a new OPE estimator for contextual bandits,\nthe Marginal Ratio (MR) estimator, which focuses on the shift in the marginal\ndistribution of outcomes $Y$ instead of the policies themselves. Through\nrigorous theoretical analysis, we demonstrate the benefits of the MR estimator\ncompared to conventional methods like IPW and DR in terms of variance\nreduction. Additionally, we establish a connection between the MR estimator and\nthe state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator,\nproving that MR achieves lower variance among a generalized family of MIPS\nestimators. We further illustrate the utility of the MR estimator in causal\ninference settings, where it exhibits enhanced performance in estimating\nAverage Treatment Effects (ATE). Our experiments on synthetic and real-world\ndatasets corroborate our theoretical findings and highlight the practical\nadvantages of the MR estimator in OPE for contextual bandits."}, "http://arxiv.org/abs/2312.01496": {"title": "Large-Scale Correlation Screening under Dependence for Brain Functional Connectivity Network Inference", "link": "http://arxiv.org/abs/2312.01496", "description": "Data produced by resting-state functional Magnetic Resonance Imaging are\nwidely used to infer brain functional connectivity networks. Such networks\ncorrelate neural signals to connect brain regions, which consist in groups of\ndependent voxels. Previous work has focused on aggregating data across voxels\nwithin predefined regions. However, the presence of within-region correlations\nhas noticeable impacts on inter-regional correlation detection, and thus edge\nidentification. To alleviate them, we propose to leverage techniques from the\nlarge-scale correlation screening literature, and derive simple and practical\ncharacterizations of the mean number of correlation discoveries that flexibly\nincorporate intra-regional dependence structures. A connectivity network\ninference framework is then presented. First, inter-regional correlation\ndistributions are estimated. Then, correlation thresholds that can be tailored\nto one's application are constructed for each edge. Finally, the proposed\nframework is implemented on synthetic and real-world datasets. This novel\napproach for handling arbitrary intra-regional correlation is shown to limit\nfalse positives while improving true positive rates."}, "http://arxiv.org/abs/2312.01692": {"title": "Risk-Controlling Model Selection via Guided Bayesian Optimization", "link": "http://arxiv.org/abs/2312.01692", "description": "Adjustable hyperparameters of machine learning models typically impact\nvarious key trade-offs such as accuracy, fairness, robustness, or inference\ncost. Our goal in this paper is to find a configuration that adheres to\nuser-specified limits on certain risks while being useful with respect to other\nconflicting metrics. We solve this by combining Bayesian Optimization (BO) with\nrigorous risk-controlling procedures, where our core idea is to steer BO\ntowards an efficient testing strategy. Our BO method identifies a set of Pareto\noptimal configurations residing in a designated region of interest. The\nresulting candidates are statistically verified and the best-performing\nconfiguration is selected with guaranteed risk levels. We demonstrate the\neffectiveness of our approach on a range of tasks with multiple desiderata,\nincluding low error rates, equitable predictions, handling spurious\ncorrelations, managing rate and distortion in generative models, and reducing\ncomputational costs."}, "http://arxiv.org/abs/2312.01723": {"title": "Group Sequential Design Under Non-proportional Hazards", "link": "http://arxiv.org/abs/2312.01723", "description": "Non-proportional hazards (NPH) are often observed in clinical trials with\ntime-to-event endpoints. A common example is a long-term clinical trial with a\ndelayed treatment effect in immunotherapy for cancer. When designing clinical\ntrials with time-to-event endpoints, it is crucial to consider NPH scenarios to\ngain a complete understanding of design operating characteristics. In this\npaper, we focus on group sequential design for three NPH methods: the average\nhazard ratio, the weighted logrank test, and the MaxCombo combination test. For\neach of these approaches, we provide analytic forms of design characteristics\nthat facilitate sample size calculation and bound derivation for group\nsequential designs. Examples are provided to illustrate the proposed methods.\nTo facilitate statisticians in designing and comparing group sequential designs\nunder NPH, we have implemented the group sequential design methodology in the\ngsDesign2 R package at https://cran.r-project.org/web/packages/gsDesign2/."}, "http://arxiv.org/abs/2312.01735": {"title": "Weighted Q-learning for optimal dynamic treatment regimes with MNAR coavriates", "link": "http://arxiv.org/abs/2312.01735", "description": "Dynamic treatment regimes (DTRs) formalize medical decision-making as a\nsequence of rules for different stages, mapping patient-level information to\nrecommended treatments. In practice, estimating an optimal DTR using\nobservational data from electronic medical record (EMR) databases can be\ncomplicated by covariates that are missing not at random (MNAR) due to\ninformative monitoring of patients. Since complete case analysis can result in\nconsistent estimation of outcome model parameters under the assumption of\noutcome-independent missingness \\citep{Yang_Wang_Ding_2019}, Q-learning is a\nnatural approach to accommodating MNAR covariates. However, the backward\ninduction algorithm used in Q-learning can introduce complications, as MNAR\ncovariates at later stages can result in MNAR pseudo-outcomes at earlier\nstages, leading to suboptimal DTRs, even if outcome variables are fully\nobserved. To address this unique missing data problem in DTR settings, we\npropose two weighted Q-learning approaches where inverse probability weights\nfor missingness of the pseudo-outcomes are obtained through estimating\nequations with valid nonresponse instrumental variables or sensitivity\nanalysis. Asymptotic properties of the weighted Q-learning estimators are\nderived and the finite-sample performance of the proposed methods is evaluated\nand compared with alternative methods through extensive simulation studies.\nUsing EMR data from the Medical Information Mart for Intensive Care database,\nwe apply the proposed methods to investigate the optimal fluid strategy for\nsepsis patients in intensive care units."}, "http://arxiv.org/abs/2312.01815": {"title": "Hypothesis Testing in Gaussian Graphical Models: Novel Goodness-of-Fit Tests and Conditional Randomization Tests", "link": "http://arxiv.org/abs/2312.01815", "description": "We introduce novel hypothesis testing methods for Gaussian graphical models,\nwhose foundation is an innovative algorithm that generates exchangeable copies\nfrom these models. We utilize the exchangeable copies to formulate a\ngoodness-of-fit test, which is valid in both low and high-dimensional settings\nand flexible in choosing the test statistic. This test exhibits superior power\nperformance, especially in scenarios where the true precision matrix violates\nthe null hypothesis with many small entries. Furthermore, we adapt the sampling\nalgorithm for constructing a new conditional randomization test for the\nconditional independence between a response $Y$ and a vector of covariates $X$\ngiven some other variables $Z$. Thanks to the model-X framework, this test does\nnot require any modeling assumption about $Y$ and can utilize test statistics\nfrom advanced models. It also relaxes the assumptions of conditional\nrandomization tests by allowing the number of unknown parameters of the\ndistribution of $X$ to be much larger than the sample size. For both of our\ntesting procedures, we propose several test statistics and conduct\ncomprehensive simulation studies to demonstrate their superior performance in\ncontrolling the Type-I error and achieving high power. The usefulness of our\nmethods is further demonstrated through three real-world applications."}, "http://arxiv.org/abs/2312.01870": {"title": "Extreme-value modelling of migratory bird arrival dates: Insights from citizen science data", "link": "http://arxiv.org/abs/2312.01870", "description": "Citizen science mobilises many observers and gathers huge datasets but often\nwithout strict sampling protocols, which results in observation biases due to\nheterogeneity in sampling effort that can lead to biased statistical\ninferences. We develop a spatiotemporal Bayesian hierarchical model for\nbias-corrected estimation of arrival dates of the first migratory bird\nindividuals at a breeding site. Higher sampling effort could be correlated with\nearlier observed dates. We implement data fusion of two citizen-science\ndatasets with sensibly different protocols (BBS, eBird) and map posterior\ndistributions of the latent process, which contains four spatial components\nwith Gaussian process priors: species niche; sampling effort; position and\nscale parameters of annual first date of arrival. The data layer includes four\nresponse variables: counts of observed eBird locations (Poisson);\npresence-absence at observed eBird locations (Binomial); BBS occurrence counts\n(Poisson); first arrival dates (Generalized Extreme-Value). We devise a Markov\nChain Monte Carlo scheme and check by simulation that the latent process\ncomponents are identifiable. We apply our model to several migratory bird\nspecies in the northeastern US for 2001--2021. The sampling effort is shown to\nsignificantly modulate the observed first arrival date. We exploit this\nrelationship to effectively debias predictions of the true first arrival dates."}, "http://arxiv.org/abs/2312.01925": {"title": "Coefficient Shape Alignment in Multivariate Functional Regression", "link": "http://arxiv.org/abs/2312.01925", "description": "In multivariate functional data analysis, different functional covariates can\nbe homogeneous in some sense. The hidden homogeneity structure is informative\nabout the connectivity or association of different covariates. The covariates\nwith pronounced homogeneity can be analyzed jointly in the same group and this\ngives rise to a way of parsimoniously modeling multivariate functional data. In\nthis paper, we develop a multivariate functional regression technique by a new\nregularization approach termed \"coefficient shape alignment\" to tackle the\npotential homogeneity of different functional covariates. The modeling\nprocedure includes two main steps: first the unknown grouping structure is\ndetected with a new regularization approach to aggregate covariates into\ndisjoint groups; and then a grouped multivariate functional regression model is\nestablished based on the detected grouping structure. In this new grouped\nmodel, the coefficient functions of covariates in the same homogeneous group\nshare the same shape invariant to scaling. The new regularization approach\nbuilds on penalizing the discrepancy of coefficient shape. The consistency\nproperty of the detected grouping structure is thoroughly investigated, and the\nconditions that guarantee uncovering the underlying true grouping structure are\ndeveloped. The asymptotic properties of the model estimates are also developed.\nExtensive simulation studies are conducted to investigate the finite-sample\nproperties of the developed methods. The practical utility of the proposed\nmethods is illustrated in an analysis on sugar quality evaluation. This work\nprovides a novel means for analyzing the underlying homogeneity of functional\ncovariates and developing parsimonious model structures for multivariate\nfunctional data."}, "http://arxiv.org/abs/2312.01944": {"title": "New Methods for Network Count Time Series", "link": "http://arxiv.org/abs/2312.01944", "description": "The original generalized network autoregressive models are poor for modelling\ncount data as they are based on the additive and constant noise assumptions,\nwhich is usually inappropriate for count data. We introduce two new models\n(GNARI and NGNAR) for count network time series by adapting and extending\nexisting count-valued time series models. We present results on the statistical\nand asymptotic properties of our new models and their estimates obtained by\nconditional least squares and maximum likelihood. We conduct two simulation\nstudies that verify successful parameter estimation for both models and conduct\na further study that shows, for negative network parameters, that our NGNAR\nmodel outperforms existing models and our other GNARI model in terms of\npredictive performance. We model a network time series constructed from\nCOVID-positive counts for counties in New York State during 2020-22 and show\nthat our new models perform considerably better than existing methods for this\nproblem."}, "http://arxiv.org/abs/2312.01969": {"title": "FDR Control for Online Anomaly Detection", "link": "http://arxiv.org/abs/2312.01969", "description": "The goal of anomaly detection is to identify observations generated by a\nprocess that is different from a reference one. An accurate anomaly detector\nmust ensure low false positive and false negative rates. However in the online\ncontext such a constraint remains highly challenging due to the usual lack of\ncontrol of the False Discovery Rate (FDR). In particular the online framework\nmakes it impossible to use classical multiple testing approaches such as the\nBenjamini-Hochberg (BH) procedure. Our strategy overcomes this difficulty by\nexploiting a local control of the ``modified FDR'' (mFDR). An important\ningredient in this control is the cardinality of the calibration set used for\ncomputing empirical $p$-values, which turns out to be an influential parameter.\nIt results a new strategy for tuning this parameter, which yields the desired\nFDR control over the whole time series. The statistical performance of this\nstrategy is analyzed by theoretical guarantees and its practical behavior is\nassessed by simulation experiments which support our conclusions."}, "http://arxiv.org/abs/2312.02110": {"title": "Fourier Methods for Sufficient Dimension Reduction in Time Series", "link": "http://arxiv.org/abs/2312.02110", "description": "Dimensionality reduction has always been one of the most significant and\nchallenging problems in the analysis of high-dimensional data. In the context\nof time series analysis, our focus is on the estimation and inference of\nconditional mean and variance functions. By using central mean and variance\ndimension reduction subspaces that preserve sufficient information about the\nresponse, one can effectively estimate the unknown mean and variance functions\nof the time series. While the literature presents several approaches to\nestimate the time series central mean and variance subspaces (TS-CMS and\nTS-CVS), these methods tend to be computationally intensive and infeasible for\npractical applications. By employing the Fourier transform, we derive explicit\nestimators for TS-CMS and TS-CVS. These proposed estimators are demonstrated to\nbe consistent, asymptotically normal, and efficient. Simulation studies have\nbeen conducted to evaluate the performance of the proposed method. The results\nshow that our method is significantly more accurate and computationally\nefficient than existing methods. Furthermore, the method has been applied to\nthe Canadian Lynx dataset."}, "http://arxiv.org/abs/2202.03852": {"title": "Nonlinear Network Autoregression", "link": "http://arxiv.org/abs/2202.03852", "description": "We study general nonlinear models for time series networks of integer and\ncontinuous valued data. The vector of high dimensional responses, measured on\nthe nodes of a known network, is regressed non-linearly on its lagged value and\non lagged values of the neighboring nodes by employing a smooth link function.\nWe study stability conditions for such multivariate process and develop quasi\nmaximum likelihood inference when the network dimension is increasing. In\naddition, we study linearity score tests by treating separately the cases of\nidentifiable and non-identifiable parameters. In the case of identifiability,\nthe test statistic converges to a chi-square distribution. When the parameters\nare not-identifiable, we develop a supremum-type test whose p-values are\napproximated adequately by employing a feasible bound and bootstrap\nmethodology. Simulations and data examples support further our findings."}, "http://arxiv.org/abs/2202.10887": {"title": "Policy Evaluation for Temporal and/or Spatial Dependent Experiments", "link": "http://arxiv.org/abs/2202.10887", "description": "The aim of this paper is to establish a causal link between the policies\nimplemented by technology companies and the outcomes they yield within\nintricate temporal and/or spatial dependent experiments. We propose a novel\ntemporal/spatio-temporal Varying Coefficient Decision Process (VCDP) model,\ncapable of effectively capturing the evolving treatment effects in situations\ncharacterized by temporal and/or spatial dependence. Our methodology\nencompasses the decomposition of the Average Treatment Effect (ATE) into the\nDirect Effect (DE) and the Indirect Effect (IE). We subsequently devise\ncomprehensive procedures for estimating and making inferences about both DE and\nIE. Additionally, we provide a rigorous analysis of the statistical properties\nof these procedures, such as asymptotic power. To substantiate the\neffectiveness of our approach, we carry out extensive simulations and real data\nanalyses."}, "http://arxiv.org/abs/2206.13091": {"title": "Informed censoring: the parametric combination of data and expert information", "link": "http://arxiv.org/abs/2206.13091", "description": "The statistical censoring setup is extended to the situation when random\nmeasures can be assigned to the realization of datapoints, leading to a new way\nof incorporating expert information into the usual parametric estimation\nprocedures. The asymptotic theory is provided for the resulting estimators, and\nsome special cases of practical relevance are studied in more detail. Although\nthe proposed framework mathematically generalizes censoring and coarsening at\nrandom, and borrows techniques from M-estimation theory, it provides a novel\nand transparent methodology which enjoys significant practical applicability in\nsituations where expert information is present. The potential of the approach\nis illustrated by a concrete actuarial application of tail parameter estimation\nfor a heavy-tailed MTPL dataset with limited available expert information."}, "http://arxiv.org/abs/2208.11665": {"title": "Statistical exploration of the Manifold Hypothesis", "link": "http://arxiv.org/abs/2208.11665", "description": "The Manifold Hypothesis is a widely accepted tenet of Machine Learning which\nasserts that nominally high-dimensional data are in fact concentrated near a\nlow-dimensional manifold, embedded in high-dimensional space. This phenomenon\nis observed empirically in many real world situations, has led to development\nof a wide range of statistical methods in the last few decades, and has been\nsuggested as a key factor in the success of modern AI technologies. We show\nthat rich and sometimes intricate manifold structure in data can emerge from a\ngeneric and remarkably simple statistical model -- the Latent Metric Model --\nvia elementary concepts such as latent variables, correlation and stationarity.\nThis establishes a general statistical explanation for why the Manifold\nHypothesis seems to hold in so many situations. Informed by the Latent Metric\nModel we derive procedures to discover and interpret the geometry of\nhigh-dimensional data, and explore hypotheses about the data generating\nmechanism. These procedures operate under minimal assumptions and make use of\nwell known, scaleable graph-analytic algorithms."}, "http://arxiv.org/abs/2210.10852": {"title": "BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for Rethinking Generalized Linear Models", "link": "http://arxiv.org/abs/2210.10852", "description": "Two linearly uncorrelated binary variables must be also independent because\nnon-linear dependence cannot manifest with only two possible states. This\ninherent linearity is the atom of dependency constituting any complex form of\nrelationship. Inspired by this observation, we develop a framework called\nbinary expansion linear effect (BELIEF) for understanding arbitrary\nrelationships with a binary outcome. Models from the BELIEF framework are\neasily interpretable because they describe the association of binary variables\nin the language of linear models, yielding convenient theoretical insight and\nstriking Gaussian parallels. With BELIEF, one may study generalized linear\nmodels (GLM) through transparent linear models, providing insight into how the\nchoice of link affects modeling. For example, setting a GLM interaction\ncoefficient to zero does not necessarily lead to the kind of no-interaction\nmodel assumption as understood under their linear model counterparts.\nFurthermore, for a binary response, maximum likelihood estimation for GLMs\nparadoxically fails under complete separation, when the data are most\ndiscriminative, whereas BELIEF estimation automatically reveals the perfect\npredictor in the data that is responsible for complete separation. We explore\nthese phenomena and provide related theoretical results. We also provide\npreliminary empirical demonstration of some theoretical results."}, "http://arxiv.org/abs/2302.03750": {"title": "Linking convolutional kernel size to generalization bias in face analysis CNNs", "link": "http://arxiv.org/abs/2302.03750", "description": "Training dataset biases are by far the most scrutinized factors when\nexplaining algorithmic biases of neural networks. In contrast, hyperparameters\nrelated to the neural network architecture have largely been ignored even\nthough different network parameterizations are known to induce different\nimplicit biases over learned features. For example, convolutional kernel size\nis known to affect the frequency content of features learned in CNNs. In this\nwork, we present a causal framework for linking an architectural hyperparameter\nto out-of-distribution algorithmic bias. Our framework is experimental, in that\nwe train several versions of a network with an intervention to a specific\nhyperparameter, and measure the resulting causal effect of this choice on\nperformance bias when a particular out-of-distribution image perturbation is\napplied. In our experiments, we focused on measuring the causal relationship\nbetween convolutional kernel size and face analysis classification bias across\ndifferent subpopulations (race/gender), with respect to high-frequency image\ndetails. We show that modifying kernel size, even in one layer of a CNN,\nchanges the frequency content of learned features significantly across data\nsubgroups leading to biased generalization performance even in the presence of\na balanced dataset."}, "http://arxiv.org/abs/2303.10808": {"title": "Dimension-agnostic Change Point Detection", "link": "http://arxiv.org/abs/2303.10808", "description": "Change point testing for high-dimensional data has attracted a lot of\nattention in statistics and machine learning owing to the emergence of\nhigh-dimensional data with structural breaks from many fields. In practice,\nwhen the dimension is less than the sample size but is not small, it is often\nunclear whether a method that is tailored to high-dimensional data or simply a\nclassical method that is developed and justified for low-dimensional data is\npreferred. In addition, the methods designed for low-dimensional data may not\nwork well in the high-dimensional environment and vice versa. In this paper, we\npropose a dimension-agnostic testing procedure targeting a single change point\nin the mean of a multivariate time series. Specifically, we can show that the\nlimiting null distribution for our test statistic is the same regardless of the\ndimensionality and the magnitude of cross-sectional dependence. The power\nanalysis is also conducted to understand the large sample behavior of the\nproposed test. Through Monte Carlo simulations and a real data illustration, we\ndemonstrate that the finite sample results strongly corroborate the theory and\nsuggest that the proposed test can be used as a benchmark for change-point\ndetection of time series of low, medium, and high dimensions."}, "http://arxiv.org/abs/2306.13870": {"title": "Post-Selection Inference for the Cox Model with Interval-Censored Data", "link": "http://arxiv.org/abs/2306.13870", "description": "We develop a post-selection inference method for the Cox proportional hazards\nmodel with interval-censored data, which provides asymptotically valid p-values\nand confidence intervals conditional on the model selected by lasso. The method\nis based on a pivotal quantity that is shown to converge to a uniform\ndistribution under local alternatives. The proof can be adapted to many other\nregression models, which is illustrated by the extension to generalized linear\nmodels and the Cox model with right-censored data. Our method involves\nestimation of the efficient information matrix, for which several approaches\nare proposed with proof of their consistency. Thorough simulation studies show\nthat our method has satisfactory performance in samples of modest sizes. The\nutility of the method is illustrated via an application to an Alzheimer's\ndisease study."}, "http://arxiv.org/abs/2307.15330": {"title": "Group integrative dynamic factor models with application to multiple subject brain connectivity", "link": "http://arxiv.org/abs/2307.15330", "description": "This work introduces a novel framework for dynamic factor model-based data\nintegration of multiple subjects time series data, called GRoup Integrative\nDYnamic factor (GRIDY) models. The framework identifies and characterizes\ninter-subject differences between two pre-labeled groups by considering a\ncombination of group spatial information and individual temporal dependence.\nFurthermore, it enables the identification of intra-subject differences over\ntime by employing different model configurations for each subject.\nMethodologically, the framework combines a novel principal angle-based rank\nselection algorithm and a non-iterative integrative analysis framework.\nInspired by simultaneous component analysis, this approach also reconstructs\nidentifiable latent factor series with flexible covariance structures. The\nperformance of the GRIDY models is evaluated through simulations conducted\nunder various scenarios. An application is also presented to compare\nresting-state functional MRI data collected from multiple subjects in the\nAutism Spectrum Disorder and control groups."}, "http://arxiv.org/abs/2308.11138": {"title": "NLP-based detection of systematic anomalies among the narratives of consumer complaints", "link": "http://arxiv.org/abs/2308.11138", "description": "We develop an NLP-based procedure for detecting systematic nonmeritorious\nconsumer complaints, simply called systematic anomalies, among complaint\nnarratives. While classification algorithms are used to detect pronounced\nanomalies, in the case of smaller and frequent systematic anomalies, the\nalgorithms may falter due to a variety of reasons, including technical ones as\nwell as natural limitations of human analysts. Therefore, as the next step\nafter classification, we convert the complaint narratives into quantitative\ndata, which are then analyzed using an algorithm for detecting systematic\nanomalies. We illustrate the entire procedure using complaint narratives from\nthe Consumer Complaint Database of the Consumer Financial Protection Bureau."}, "http://arxiv.org/abs/2310.00864": {"title": "Multi-Label Residual Weighted Learning for Individualized Combination Treatment Rule", "link": "http://arxiv.org/abs/2310.00864", "description": "Individualized treatment rules (ITRs) have been widely applied in many fields\nsuch as precision medicine and personalized marketing. Beyond the extensive\nstudies on ITR for binary or multiple treatments, there is considerable\ninterest in applying combination treatments. This paper introduces a novel ITR\nestimation method for combination treatments incorporating interaction effects\namong treatments. Specifically, we propose the generalized $\\psi$-loss as a\nnon-convex surrogate in the residual weighted learning framework, offering\ndesirable statistical and computational properties. Statistically, the\nminimizer of the proposed surrogate loss is Fisher-consistent with the optimal\ndecision rules, incorporating interaction effects at any intensity level - a\nsignificant improvement over existing methods. Computationally, the proposed\nmethod applies the difference-of-convex algorithm for efficient computation.\nThrough simulation studies and real-world data applications, we demonstrate the\nsuperior performance of the proposed method in recommending combination\ntreatments."}, "http://arxiv.org/abs/2312.02167": {"title": "Uncertainty Quantification in Machine Learning Based Segmentation: A Post-Hoc Approach for Left Ventricle Volume Estimation in MRI", "link": "http://arxiv.org/abs/2312.02167", "description": "Recent studies have confirmed cardiovascular diseases remain responsible for\nhighest death toll amongst non-communicable diseases. Accurate left ventricular\n(LV) volume estimation is critical for valid diagnosis and management of\nvarious cardiovascular conditions, but poses significant challenge due to\ninherent uncertainties associated with segmentation algorithms in magnetic\nresonance imaging (MRI). Recent machine learning advancements, particularly\nU-Net-like convolutional networks, have facilitated automated segmentation for\nmedical images, but struggles under certain pathologies and/or different\nscanner vendors and imaging protocols. This study proposes a novel methodology\nfor post-hoc uncertainty estimation in LV volume prediction using It\\^{o}\nstochastic differential equations (SDEs) to model path-wise behavior for the\nprediction error. The model describes the area of the left ventricle along the\nheart's long axis. The method is agnostic to the underlying segmentation\nalgorithm, facilitating its use with various existing and future segmentation\ntechnologies. The proposed approach provides a mechanism for quantifying\nuncertainty, enabling medical professionals to intervene for unreliable\npredictions. This is of utmost importance in critical applications such as\nmedical diagnosis, where prediction accuracy and reliability can directly\nimpact patient outcomes. The method is also robust to dataset changes, enabling\napplication for medical centers with limited access to labeled data. Our\nfindings highlight the proposed uncertainty estimation methodology's potential\nto enhance automated segmentation robustness and generalizability, paving the\nway for more reliable and accurate LV volume estimation in clinical settings as\nwell as opening new avenues for uncertainty quantification in biomedical image\nsegmentation, providing promising directions for future research."}, "http://arxiv.org/abs/2312.02177": {"title": "Entropy generating function for past lifetime and its properties", "link": "http://arxiv.org/abs/2312.02177", "description": "The past entropy is considered as an uncertainty measure for the past\nlifetime distribution. Generating function approach to entropy become popular\nin recent time as it generate several well-known entropy measures. In this\npaper, we introduce the past entropy-generating function. We study certain\nproperties of this measure. It is shown that the past entropy-generating\nfunction uniquely determines the distribution. Further, we present\ncharacterizations for some lifetime models using the relationship between\nreliability concepts and the past entropy-generating function."}, "http://arxiv.org/abs/2312.02404": {"title": "Addressing Unmeasured Confounders in Cox Proportional Hazards Models Using Nonparametric Bayesian Approaches", "link": "http://arxiv.org/abs/2312.02404", "description": "In observational studies, unmeasured confounders present a crucial challenge\nin accurately estimating desired causal effects. To calculate the hazard ratio\n(HR) in Cox proportional hazard models, which are relevant for time-to-event\noutcomes, methods such as Two-Stage Residual Inclusion and Limited Information\nMaximum Likelihood are typically employed. However, these methods raise\nconcerns, including the potential for biased HR estimates and issues with\nparameter identification. This manuscript introduces a novel nonparametric\nBayesian method designed to estimate an unbiased HR, addressing concerns\nrelated to parameter identification. Our proposed method consists of two\nphases: 1) detecting clusters based on the likelihood of the exposure variable,\nand 2) estimating the hazard ratio within each cluster. Although it is\nimplicitly assumed that unmeasured confounders affect outcomes through cluster\neffects, our algorithm is well-suited for such data structures."}, "http://arxiv.org/abs/2312.02482": {"title": "Treatment heterogeneity with right-censored outcomes using grf", "link": "http://arxiv.org/abs/2312.02482", "description": "This article walks through how to estimate conditional average treatment\neffects (CATEs) with right-censored time-to-event outcomes using the function\ncausal_survival_forest (Cui et al., 2023) in the R package grf (Athey et al.,\n2019, Tibshirani et al., 2023)."}, "http://arxiv.org/abs/2312.02513": {"title": "Asymptotic Theory of the Best-Choice Rerandomization using the Mahalanobis Distance", "link": "http://arxiv.org/abs/2312.02513", "description": "Rerandomization, a design that utilizes pretreatment covariates and improves\ntheir balance between different treatment groups, has received attention\nrecently in both theory and practice. There are at least two types of\nrerandomization that are used in practice: the first rerandomizes the treatment\nassignment until covariate imbalance is below a prespecified threshold; the\nsecond randomizes the treatment assignment multiple times and chooses the one\nwith the best covariate balance. In this paper we will consider the second type\nof rerandomization, namely the best-choice rerandomization, whose theory and\ninference are still lacking in the literature. In particular, we will focus on\nthe best-choice rerandomization that uses the Mahalanobis distance to measure\ncovariate imbalance, which is one of the most commonly used imbalance measure\nfor multivariate covariates and is invariant to affine transformations of\ncovariates. We will study the large-sample repeatedly sampling properties of\nthe best-choice rerandomization, allowing both the number of covariates and the\nnumber of tried complete randomizations to increase with the sample size. We\nshow that the asymptotic distribution of the difference-in-means estimator is\nmore concentrated around the true average treatment effect under\nrerandomization than under the complete randomization, and propose large-sample\naccurate confidence intervals for rerandomization that are shorter than that\nfor the completely randomized experiment. We further demonstrate that, with\nmoderate number of covariates and with the number of tried randomizations\nincreasing polynomially with the sample size, the best-choice rerandomization\ncan achieve the ideally optimal precision that one can expect even with\nperfectly balanced covariates. The developed theory and methods for\nrerandomization are also illustrated using real field experiments."}, "http://arxiv.org/abs/2312.02518": {"title": "The general linear hypothesis testing problem for multivariate functional data with applications", "link": "http://arxiv.org/abs/2312.02518", "description": "As technology continues to advance at a rapid pace, the prevalence of\nmultivariate functional data (MFD) has expanded across diverse disciplines,\nspanning biology, climatology, finance, and numerous other fields of study.\nAlthough MFD are encountered in various fields, the development of methods for\nhypotheses on mean functions, especially the general linear hypothesis testing\n(GLHT) problem for such data has been limited. In this study, we propose and\nstudy a new global test for the GLHT problem for MFD, which includes the\none-way FMANOVA, post hoc, and contrast analysis as special cases. The\nasymptotic null distribution of the test statistic is shown to be a\nchi-squared-type mixture dependent of eigenvalues of the heteroscedastic\ncovariance functions. The distribution of the chi-squared-type mixture can be\nwell approximated by a three-cumulant matched chi-squared-approximation with\nits approximation parameters estimated from the data. By incorporating an\nadjustment coefficient, the proposed test performs effectively irrespective of\nthe correlation structure in the functional data, even when dealing with a\nrelatively small sample size. Additionally, the proposed test is shown to be\nroot-n consistent, that is, it has a nontrivial power against a local\nalternative. Simulation studies and a real data example demonstrate\nfinite-sample performance and broad applicability of the proposed test."}, "http://arxiv.org/abs/2312.02591": {"title": "General Spatio-Temporal Factor Models for High-Dimensional Random Fields on a Lattice", "link": "http://arxiv.org/abs/2312.02591", "description": "Motivated by the need for analysing large spatio-temporal panel data, we\nintroduce a novel dimensionality reduction methodology for $n$-dimensional\nrandom fields observed across a number $S$ spatial locations and $T$ time\nperiods. We call it General Spatio-Temporal Factor Model (GSTFM). First, we\nprovide the probabilistic and mathematical underpinning needed for the\nrepresentation of a random field as the sum of two components: the common\ncomponent (driven by a small number $q$ of latent factors) and the\nidiosyncratic component (mildly cross-correlated). We show that the two\ncomponents are identified as $n\\to\\infty$. Second, we propose an estimator of\nthe common component and derive its statistical guarantees (consistency and\nrate of convergence) as $\\min(n, S, T )\\to\\infty$. Third, we propose an\ninformation criterion to determine the number of factors. Estimation makes use\nof Fourier analysis in the frequency domain and thus we fully exploit the\ninformation on the spatio-temporal covariance structure of the whole panel.\nSynthetic data examples illustrate the applicability of GSTFM and its\nadvantages over the extant generalized dynamic factor model that ignores the\nspatial correlations."}, "http://arxiv.org/abs/2312.02717": {"title": "A Graphical Approach to Treatment Effect Estimation with Observational Network Data", "link": "http://arxiv.org/abs/2312.02717", "description": "We propose an easy-to-use adjustment estimator for the effect of a treatment\nbased on observational data from a single (social) network of units. The\napproach allows for interactions among units within the network, called\ninterference, and for observed confounding. We define a simplified causal graph\nthat does not differentiate between units, called generic graph. Using valid\nadjustment sets determined in the generic graph, we can identify the treatment\neffect and build a corresponding estimator. We establish the estimator's\nconsistency and its convergence to a Gaussian limiting distribution at the\nparametric rate under certain regularity conditions that restrict the growth of\ndependencies among units. We empirically verify the theoretical properties of\nour estimator through a simulation study and apply it to estimate the effect of\na strict facial-mask policy on the spread of COVID-19 in Switzerland."}, "http://arxiv.org/abs/2312.02807": {"title": "Online Change Detection in SAR Time-Series with Kronecker Product Structured Scaled Gaussian Models", "link": "http://arxiv.org/abs/2312.02807", "description": "We develop the information geometry of scaled Gaussian distributions for\nwhich the covariance matrix exhibits a Kronecker product structure. This model\nand its geometry are then used to propose an online change detection (CD)\nalgorithm for multivariate image times series (MITS). The proposed approach\nrelies mainly on the online estimation of the structured covariance matrix\nunder the null hypothesis, which is performed through a recursive (natural)\nRiemannian gradient descent. This approach exhibits a practical interest\ncompared to the corresponding offline version, as its computational cost\nremains constant for each new image added in the time series. Simulations show\nthat the proposed recursive estimators reach the Intrinsic Cram\\'er-Rao bound.\nThe interest of the proposed online CD approach is demonstrated on both\nsimulated and real data."}, "http://arxiv.org/abs/2312.02850": {"title": "A Kernel-Based Neural Network Test for High-dimensional Sequencing Data Analysis", "link": "http://arxiv.org/abs/2312.02850", "description": "The recent development of artificial intelligence (AI) technology, especially\nthe advance of deep neural network (DNN) technology, has revolutionized many\nfields. While DNN plays a central role in modern AI technology, it has been\nrarely used in sequencing data analysis due to challenges brought by\nhigh-dimensional sequencing data (e.g., overfitting). Moreover, due to the\ncomplexity of neural networks and their unknown limiting distributions,\nbuilding association tests on neural networks for genetic association analysis\nremains a great challenge. To address these challenges and fill the important\ngap of using AI in high-dimensional sequencing data analysis, we introduce a\nnew kernel-based neural network (KNN) test for complex association analysis of\nsequencing data. The test is built on our previously developed KNN framework,\nwhich uses random effects to model the overall effects of high-dimensional\ngenetic data and adopts kernel-based neural network structures to model complex\ngenotype-phenotype relationships. Based on KNN, a Wald-type test is then\nintroduced to evaluate the joint association of high-dimensional genetic data\nwith a disease phenotype of interest, considering non-linear and non-additive\neffects (e.g., interaction effects). Through simulations, we demonstrated that\nour proposed method attained higher power compared to the sequence kernel\nassociation test (SKAT), especially in the presence of non-linear and\ninteraction effects. Finally, we apply the methods to the whole genome\nsequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative\n(ADNI) study, investigating new genes associated with the hippocampal volume\nchange over time."}, "http://arxiv.org/abs/2312.02858": {"title": "Towards Causal Representations of Climate Model Data", "link": "http://arxiv.org/abs/2312.02858", "description": "Climate models, such as Earth system models (ESMs), are crucial for\nsimulating future climate change based on projected Shared Socioeconomic\nPathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated\nand invaluable, machine learning-based emulators trained on existing simulation\ndata can project additional climate scenarios much faster and are\ncomputationally efficient. However, they often lack generalizability and\ninterpretability. This work delves into the potential of causal representation\nlearning, specifically the \\emph{Causal Discovery with Single-parent Decoding}\n(CDSD) method, which could render climate model emulation efficient\n\\textit{and} interpretable. We evaluate CDSD on multiple climate datasets,\nfocusing on emissions, temperature, and precipitation. Our findings shed light\non the challenges, limitations, and promise of using CDSD as a stepping stone\ntowards more interpretable and robust climate model emulation."}, "http://arxiv.org/abs/2312.02860": {"title": "Spectral Deconfounding for High-Dimensional Sparse Additive Models", "link": "http://arxiv.org/abs/2312.02860", "description": "Many high-dimensional data sets suffer from hidden confounding. When hidden\nconfounders affect both the predictors and the response in a high-dimensional\nregression problem, standard methods lead to biased estimates. This paper\nsubstantially extends previous work on spectral deconfounding for\nhigh-dimensional linear models to the nonlinear setting and with this,\nestablishes a proof of concept that spectral deconfounding is valid for general\nnonlinear models. Concretely, we propose an algorithm to estimate\nhigh-dimensional additive models in the presence of hidden dense confounding:\narguably, this is a simple yet practically useful nonlinear scope. We prove\nconsistency and convergence rates for our method and evaluate it on synthetic\ndata and a genetic data set."}, "http://arxiv.org/abs/2312.02867": {"title": "Semi-Supervised Health Index Monitoring with Feature Generation and Fusion", "link": "http://arxiv.org/abs/2312.02867", "description": "The Health Index (HI) is crucial for evaluating system health, aiding tasks\nlike anomaly detection and predicting remaining useful life for systems\ndemanding high safety and reliability. Tight monitoring is crucial for\nachieving high precision at a lower cost, with applications such as spray\ncoating. Obtaining HI labels in real-world applications is often\ncost-prohibitive, requiring continuous, precise health measurements. Therefore,\nit is more convenient to leverage run-to failure datasets that may provide\npotential indications of machine wear condition, making it necessary to apply\nsemi-supervised tools for HI construction. In this study, we adapt the Deep\nSemi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use\nthe DeepSAD embedding as a condition indicators to address interpretability\nchallenges and sensitivity to system-specific factors. Then, we introduce a\ndiversity loss to enrich condition indicators. We employ an alternating\nprojection algorithm with isotonic constraints to transform the DeepSAD\nembedding into a normalized HI with an increasing trend. Validation on the PHME\n2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates\nmeaningful HIs estimations. Our methodology is then applied to monitor wear\nstates of thermal spray coatings using high-frequency voltage. Our\ncontributions create opportunities for more accessible and reliable HI\nestimation, particularly in cases where obtaining ground truth HI labels is\nunfeasible."}, "http://arxiv.org/abs/2312.02870": {"title": "Replica analysis of overfitting in regression models for time to event data: the impact of censoring", "link": "http://arxiv.org/abs/2312.02870", "description": "We use statistical mechanics techniques, viz. the replica method, to model\nthe effect of censoring on overfitting in Cox's proportional hazards model, the\ndominant regression method for time-to-event data. In the overfitting regime,\nMaximum Likelihood parameter estimators are known to be biased already for\nsmall values of the ratio of the number of covariates over the number of\nsamples. The inclusion of censoring was avoided in previous overfitting\nanalyses for mathematical convenience, but is vital to make any theory\napplicable to real-world medical data, where censoring is ubiquitous. Upon\nconstructing efficient algorithms for solving the new (and more complex) RS\nequations and comparing the solutions with numerical simulation data, we find\nexcellent agreement, even for large censoring rates. We then address the\npractical problem of using the theory to correct the biased ML estimators\n{without} knowledge of the data-generating distribution. This is achieved via a\nnovel numerical algorithm that self-consistently approximates all relevant\nparameters of the data generating distribution while simultaneously solving the\nRS equations. We investigate numerically the statistics of the corrected\nestimators, and show that the proposed new algorithm indeed succeeds in\nremoving the bias of the ML estimators, for both the association parameters and\nfor the cumulative hazard."}, "http://arxiv.org/abs/2312.02905": {"title": "E-values, Multiple Testing and Beyond", "link": "http://arxiv.org/abs/2312.02905", "description": "We discover a connection between the Benjamini-Hochberg (BH) procedure and\nthe recently proposed e-BH procedure [Wang and Ramdas, 2022] with a suitably\ndefined set of e-values. This insight extends to a generalized version of the\nBH procedure and the model-free multiple testing procedure in Barber and\nCand\\`es [2015] (BC) with a general form of rejection rules. The connection\nprovides an effective way of developing new multiple testing procedures by\naggregating or assembling e-values resulting from the BH and BC procedures and\ntheir use in different subsets of the data. In particular, we propose new\nmultiple testing methodologies in three applications, including a hybrid\napproach that integrates the BH and BC procedures, a multiple testing procedure\naimed at ensuring a new notion of fairness by controlling both the group-wise\nand overall false discovery rates (FDR), and a structure adaptive multiple\ntesting procedure that can incorporate external covariate information to boost\ndetection power. One notable feature of the proposed methods is that we use a\ndata-dependent approach for assigning weights to e-values, significantly\nenhancing the efficiency of the resulting e-BH procedure. The construction of\nthe weights is non-trivial and is motivated by the leave-one-out analysis for\nthe BH and BC procedures. In theory, we prove that the proposed e-BH procedures\nwith data-dependent weights in the three applications ensure finite sample FDR\ncontrol. Furthermore, we demonstrate the efficiency of the proposed methods\nthrough numerical studies in the three applications."}, "http://arxiv.org/abs/2112.06000": {"title": "Multiply robust estimators in longitudinal studies with missing data under control-based imputation", "link": "http://arxiv.org/abs/2112.06000", "description": "Longitudinal studies are often subject to missing data. The ICH E9(R1)\naddendum addresses the importance of defining a treatment effect estimand with\nthe consideration of intercurrent events. Jump-to-reference (J2R) is one\nclassically envisioned control-based scenario for the treatment effect\nevaluation using the hypothetical strategy, where the participants in the\ntreatment group after intercurrent events are assumed to have the same disease\nprogress as those with identical covariates in the control group. We establish\nnew estimators to assess the average treatment effect based on a proposed\npotential outcomes framework under J2R. Various identification formulas are\nconstructed under the assumptions addressed by J2R, motivating estimators that\nrely on different parts of the observed data distribution. Moreover, we obtain\na novel estimator inspired by the efficient influence function, with multiple\nrobustness in the sense that it achieves $n^{1/2}$-consistency if any pairs of\nmultiple nuisance functions are correctly specified, or if the nuisance\nfunctions converge at a rate not slower than $n^{-1/4}$ when using flexible\nmodeling approaches. The finite-sample performance of the proposed estimators\nis validated in simulation studies and an antidepressant clinical trial."}, "http://arxiv.org/abs/2207.13480": {"title": "On Selecting and Conditioning in Multiple Testing and Selective Inference", "link": "http://arxiv.org/abs/2207.13480", "description": "We investigate a class of methods for selective inference that condition on a\nselection event. Such methods follow a two-stage process. First, a data-driven\n(sub)collection of hypotheses is chosen from some large universe of hypotheses.\nSubsequently, inference takes place within this data-driven collection,\nconditioned on the information that was used for the selection. Examples of\nsuch methods include basic data splitting, as well as modern data carving\nmethods and post-selection inference methods for lasso coefficients based on\nthe polyhedral lemma. In this paper, we adopt a holistic view on such methods,\nconsidering the selection, conditioning, and final error control steps together\nas a single method. From this perspective, we demonstrate that multiple testing\nmethods defined directly on the full universe of hypotheses are always at least\nas powerful as selective inference methods based on selection and conditioning.\nThis result holds true even when the universe is potentially infinite and only\nimplicitly defined, such as in the case of data splitting. We provide a\ncomprehensive theoretical framework, along with insights, and delve into\nseveral case studies to illustrate instances where a shift to a non-selective\nor unconditional perspective can yield a power gain."}, "http://arxiv.org/abs/2306.11302": {"title": "A Two-Stage Bayesian Small Area Estimation Approach for Proportions", "link": "http://arxiv.org/abs/2306.11302", "description": "With the rise in popularity of digital Atlases to communicate spatial\nvariation, there is an increasing need for robust small-area estimates.\nHowever, current small-area estimation methods suffer from various modeling\nproblems when data are very sparse or when estimates are required for areas\nwith very small populations. These issues are particularly heightened when\nmodeling proportions. Additionally, recent work has shown significant benefits\nin modeling at both the individual and area levels. We propose a two-stage\nBayesian hierarchical small area estimation approach for proportions that can:\naccount for survey design; reduce direct estimate instability; and generate\nprevalence estimates for small areas with no survey data. Using a simulation\nstudy we show that, compared with existing Bayesian small area estimation\nmethods, our approach can provide optimal predictive performance (Bayesian mean\nrelative root mean squared error, mean absolute relative bias and coverage) of\nproportions under a variety of data conditions, including very sparse and\nunstable data. To assess the model in practice, we compare modeled estimates of\ncurrent smoking prevalence for 1,630 small areas in Australia using the\n2017-2018 National Health Survey data combined with 2016 census data."}, "http://arxiv.org/abs/2308.04368": {"title": "Multiple Testing of Local Extrema for Detection of Structural Breaks in Piecewise Linear Models", "link": "http://arxiv.org/abs/2308.04368", "description": "In this paper, we propose a new generic method for detecting the number and\nlocations of structural breaks or change points in piecewise linear models\nunder stationary Gaussian noise. Our method transforms the change point\ndetection problem into identifying local extrema (local maxima and local\nminima) through kernel smoothing and differentiation of the data sequence. By\ncomputing p-values for all local extrema based on peak height distributions of\nsmooth Gaussian processes, we utilize the Benjamini-Hochberg procedure to\nidentify significant local extrema as the detected change points. Our method\ncan distinguish between two types of change points: continuous breaks (Type I)\nand jumps (Type II). We study three scenarios of piecewise linear signals,\nnamely pure Type I, pure Type II and a mixture of Type I and Type II change\npoints. The results demonstrate that our proposed method ensures asymptotic\ncontrol of the False Discover Rate (FDR) and power consistency, as sequence\nlength, slope changes, and jump size increase. Furthermore, compared to\ntraditional change point detection methods based on recursive segmentation, our\napproach only requires a single test for all candidate local extrema, thereby\nachieving the smallest computational complexity proportionate to the data\nsequence length. Additionally, numerical studies illustrate that our method\nmaintains FDR control and power consistency, even in non-asymptotic cases when\nthe size of slope changes or jumps is not large. We have implemented our method\nin the R package \"dSTEM\" (available from\nhttps://cran.r-project.org/web/packages/dSTEM)."}, "http://arxiv.org/abs/2308.05484": {"title": "Filtering Dynamical Systems Using Observations of Statistics", "link": "http://arxiv.org/abs/2308.05484", "description": "We consider the problem of filtering dynamical systems, possibly stochastic,\nusing observations of statistics. Thus the computational task is to estimate a\ntime-evolving density $\\rho(v, t)$ given noisy observations of the true density\n$\\rho^\\dagger$; this contrasts with the standard filtering problem based on\nobservations of the state $v$. The task is naturally formulated as an\ninfinite-dimensional filtering problem in the space of densities $\\rho$.\nHowever, for the purposes of tractability, we seek algorithms in state space;\nspecifically we introduce a mean field state space model and, using interacting\nparticle system approximations to this model, we propose an ensemble method. We\nrefer to the resulting methodology as the ensemble Fokker-Planck filter\n(EnFPF).\n\nUnder certain restrictive assumptions we show that the EnFPF approximates the\nKalman-Bucy filter for the Fokker-Planck equation, which is the exact solution\nof the infinite-dimensional filtering problem; our numerical experiments show\nthat the methodology is useful beyond this restrictive setting. Specifically\nthe experiments show that the EnFPF is able to correct ensemble statistics, to\naccelerate convergence to the invariant density for autonomous systems, and to\naccelerate convergence to time-dependent invariant densities for non-autonomous\nsystems. We discuss possible applications of the EnFPF to climate ensembles and\nto turbulence modelling."}, "http://arxiv.org/abs/2311.05649": {"title": "Bayesian Image-on-Image Regression via Deep Kernel Learning based Gaussian Processes", "link": "http://arxiv.org/abs/2311.05649", "description": "In neuroimaging studies, it becomes increasingly important to study\nassociations between different imaging modalities using image-on-image\nregression (IIR), which faces challenges in interpretation, statistical\ninference, and prediction. Our motivating problem is how to predict task-evoked\nfMRI activity using resting-state fMRI data in the Human Connectome Project\n(HCP). The main difficulty lies in effectively combining different types of\nimaging predictors with varying resolutions and spatial domains in IIR. To\naddress these issues, we develop Bayesian Image-on-image Regression via Deep\nKernel Learning Gaussian Processes (BIRD-GP) and develop efficient posterior\ncomputation methods through Stein variational gradient descent. We demonstrate\nthe advantages of BIRD-GP over state-of-the-art IIR methods using simulations.\nFor HCP data analysis using BIRD-GP, we combine the voxel-wise fALFF maps and\nregion-wise connectivity matrices to predict fMRI contrast maps for language\nand social recognition tasks. We show that fALFF is less predictive than the\nconnectivity matrix for both tasks, but combining both yields improved results.\nAngular Gyrus Right emerges as the most predictable region for the language\ntask (75.9% predictable voxels), while Superior Parietal Gyrus Right tops for\nthe social recognition task (48.9% predictable voxels). Additionally, we\nidentify features from the resting-state fMRI data that are important for task\nfMRI prediction."}, "http://arxiv.org/abs/2312.03139": {"title": "A Bayesian Skew-heavy-tailed modelling for loss reserving", "link": "http://arxiv.org/abs/2312.03139", "description": "This paper focuses on modelling loss reserving to pay outstanding claims. As\nthe amount liable on any given claim is not known until settlement, we propose\na flexible model via heavy-tailed and skewed distributions to deal with\noutstanding liabilities. The inference relies on Markov chain Monte Carlo via\nGibbs sampler with adaptive Metropolis algorithm steps allowing for fast\ncomputations and providing efficient algorithms. An illustrative example\nemulates a typical dataset based on a runoff triangle and investigates the\nproperties of the proposed models. Also, a case study is considered and shows\nthat the proposed model outperforms the usual loss reserving models well\nestablished in the literature in the presence of skewness and heavy tails."}, "http://arxiv.org/abs/2312.03192": {"title": "Modeling Structure and Country-specific Heterogeneity in Misclassification Matrices of Verbal Autopsy-based Cause of Death Classifiers", "link": "http://arxiv.org/abs/2312.03192", "description": "Verbal autopsy (VA) algorithms are routinely used to determine\nindividual-level causes of death (COD) in many low-and-middle-income countries,\nwhich are then aggregated to derive population-level cause-specific mortality\nfractions (CSMF), essential to informing public health policies. However, VA\nalgorithms frequently misclassify COD and introduce bias in CSMF estimates. A\nrecent method, VA-calibration, can correct for this bias using a VA\nmisclassification matrix estimated from paired data on COD from both VA and\nminimally invasive tissue sampling (MITS) from the Child Health and Mortality\nPrevention Surveillance (CHAMPS) Network. Due to the limited sample size,\nCHAMPS data are pooled across all countries, implicitly assuming that the\nmisclassification rates are homogeneous.\n\nIn this research, we show that the VA misclassification matrices are\nsubstantially heterogeneous across countries, thereby biasing the\nVA-calibration. We develop a coherent framework for modeling country-specific\nmisclassification matrices in data-scarce settings. We first introduce a novel\nbase model based on two latent mechanisms: intrinsic accuracy and systematic\npreference to parsimoniously characterize misclassifications. We prove that\nthey are identifiable from the data and manifest as a form of invariance in\ncertain misclassification odds, a pattern evident in the CHAMPS data. Then we\nexpand from this base model, adding higher complexity and country-specific\nheterogeneity via interpretable effect sizes. Shrinkage priors balance the\nbias-variance tradeoff by adaptively favoring simpler models. We publish\nuncertainty-quantified estimates of VA misclassification rates for 6 countries.\nThis effort broadens VA-calibration's future applicability and strengthens\nongoing efforts of using VA for mortality surveillance."}, "http://arxiv.org/abs/2312.03254": {"title": "Efficiency of Terrestrial Laser Scanning in Survey Works: Assessment, Modelling, and Monitoring", "link": "http://arxiv.org/abs/2312.03254", "description": "Nowadays, static, mobile, terrestrial, and airborne laser scanning\ntechnologies have become familiar data sources for engineering work, especially\nin the area of land surveying. The diversity of Light Detection and Ranging\n(LiDAR) data applications thanks to the accuracy and the high point density in\naddition to the 3D data processing high speed allow laser scanning to occupy an\nadvanced position among other spatial data acquisition technologies. Moreover,\nthe unmanned aerial vehicle drives the airborne scanning progress by solving\nthe flying complexity issues. However, before the employment of the laser\nscanning technique, it is unavoidable to assess the accuracy of the scanner\nbeing used under different circumstances. The key to success is determined by\nthe correct selection of suitable scanning tools for the project. In this\npaper, the terrestrial LiDAR data is tested and used for several laser scanning\nprojects having diverse goals and typology, e.g., road deformation monitoring,\nbuilding facade modelling, road modelling, and stockpile modelling and volume\nmeasuring. The accuracy of direct measurement on the LiDAR point cloud is\nestimated as 4mm which may open the door widely for LiDAR data to play an\nessential role in survey work applications."}, "http://arxiv.org/abs/2312.03257": {"title": "Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes", "link": "http://arxiv.org/abs/2312.03257", "description": "Untargeted metabolomics based on liquid chromatography-mass spectrometry\ntechnology is quickly gaining widespread application given its ability to\ndepict the global metabolic pattern in biological samples. However, the data is\nnoisy and plagued by the lack of clear identity of data features measured from\nsamples. Multiple potential matchings exist between data features and known\nmetabolites, while the truth can only be one-to-one matches. Some existing\nmethods attempt to reduce the matching uncertainty, but are far from being able\nto remove the uncertainty for most features. The existence of the uncertainty\ncauses major difficulty in downstream functional analysis. To address these\nissues, we develop a novel approach for Bayesian Analysis of Untargeted\nMetabolomics data (BAUM) to integrate previously separate tasks into a single\nframework, including matching uncertainty inference, metabolite selection, and\nfunctional analysis. By incorporating the knowledge graph between variables and\nusing relatively simple assumptions, BAUM can analyze datasets with small\nsample sizes. By allowing different confidence levels of feature-metabolite\nmatching, the method is applicable to datasets in which feature identities are\npartially known. Simulation studies demonstrate that, compared with other\nexisting methods, BAUM achieves better accuracy in selecting important\nmetabolites that tend to be functionally consistent and assigning confidence\nscores to feature-metabolite matches. We analyze a COVID-19 metabolomics\ndataset and a mouse brain metabolomics dataset using BAUM. Even with a very\nsmall sample size of 16 mice per group, BAUM is robust and stable. It finds\npathways that conform to existing knowledge, as well as novel pathways that are\nbiologically plausible."}, "http://arxiv.org/abs/2312.03268": {"title": "Design-based inference for generalized network experiments with stochastic interventions", "link": "http://arxiv.org/abs/2312.03268", "description": "A growing number of scholars and data scientists are conducting randomized\nexperiments to analyze causal relationships in network settings where units\ninfluence one another. A dominant methodology for analyzing these network\nexperiments has been design-based, leveraging randomization of treatment\nassignment as the basis for inference. In this paper, we generalize this\ndesign-based approach so that it can be applied to more complex experiments\nwith a variety of causal estimands with different target populations. An\nimportant special case of such generalized network experiments is a bipartite\nnetwork experiment, in which the treatment assignment is randomized among one\nset of units and the outcome is measured for a separate set of units. We\npropose a broad class of causal estimands based on stochastic intervention for\ngeneralized network experiments. Using a design-based approach, we show how to\nestimate the proposed causal quantities without bias, and develop conservative\nvariance estimators. We apply our methodology to a randomized experiment in\neducation where a group of selected students in middle schools are eligible for\nthe anti-conflict promotion program, and the program participation is\nrandomized within this group. In particular, our analysis estimates the causal\neffects of treating each student or his/her close friends, for different target\npopulations in the network. We find that while the treatment improves the\noverall awareness against conflict among students, it does not significantly\nreduce the total number of conflicts."}, "http://arxiv.org/abs/2312.03274": {"title": "Empirical Bayes Covariance Decomposition, and a solution to the Multiple Tuning Problem in Sparse PCA", "link": "http://arxiv.org/abs/2312.03274", "description": "Sparse Principal Components Analysis (PCA) has been proposed as a way to\nimprove both interpretability and reliability of PCA. However, use of sparse\nPCA in practice is hindered by the difficulty of tuning the multiple\nhyperparameters that control the sparsity of different PCs (the \"multiple\ntuning problem\", MTP). Here we present a solution to the MTP using Empirical\nBayes methods. We first introduce a general formulation for penalized PCA of a\ndata matrix $\\mathbf{X}$, which includes some existing sparse PCA methods as\nspecial cases. We show that this formulation also leads to a penalized\ndecomposition of the covariance (or Gram) matrix, $\\mathbf{X}^T\\mathbf{X}$. We\nintroduce empirical Bayes versions of these penalized problems, in which the\npenalties are determined by prior distributions that are estimated from the\ndata by maximum likelihood rather than cross-validation. The resulting\n\"Empirical Bayes Covariance Decomposition\" provides a principled and efficient\nsolution to the MTP in sparse PCA, and one that can be immediately extended to\nincorporate other structural assumptions (e.g. non-negative PCA). We illustrate\nthe effectiveness of this approach on both simulated and real data examples."}, "http://arxiv.org/abs/2312.03538": {"title": "Bayesian variable selection in sample selection models using spike-and-slab priors", "link": "http://arxiv.org/abs/2312.03538", "description": "Sample selection models represent a common methodology for correcting bias\ninduced by data missing not at random. It is well known that these models are\nnot empirically identifiable without exclusion restrictions. In other words,\nsome variables predictive of missingness do not affect the outcome model of\ninterest. The drive to establish this requirement often leads to the inclusion\nof irrelevant variables in the model. A recent proposal uses adaptive LASSO to\ncircumvent this problem, but its performance depends on the so-called\ncovariance assumption, which can be violated in small to moderate samples.\nAdditionally, there are no tools yet for post-selection inference for this\nmodel. To address these challenges, we propose two families of spike-and-slab\npriors to conduct Bayesian variable selection in sample selection models. These\nprior structures allow for constructing a Gibbs sampler with tractable\nconditionals, which is scalable to the dimensions of practical interest. We\nillustrate the performance of the proposed methodology through a simulation\nstudy and present a comparison against adaptive LASSO and stepwise selection.\nWe also provide two applications using publicly available real data. An\nimplementation and code to reproduce the results in this paper can be found at\nhttps://github.com/adam-iqbal/selection-spike-slab"}, "http://arxiv.org/abs/2312.03561": {"title": "Blueprinting the Future: Automatic Item Categorization using Hierarchical Zero-Shot and Few-Shot Classifiers", "link": "http://arxiv.org/abs/2312.03561", "description": "In testing industry, precise item categorization is pivotal to align exam\nquestions with the designated content domains outlined in the assessment\nblueprint. Traditional methods either entail manual classification, which is\nlaborious and error-prone, or utilize machine learning requiring extensive\ntraining data, often leading to model underfit or overfit issues. This study\nunveils a novel approach employing the zero-shot and few-shot Generative\nPretrained Transformer (GPT) classifier for hierarchical item categorization,\nminimizing the necessity for training data, and instead, leveraging human-like\nlanguage descriptions to define categories. Through a structured python\ndictionary, the hierarchical nature of examination blueprints is navigated\nseamlessly, allowing for a tiered classification of items across multiple\nlevels. An initial simulation with artificial data demonstrates the efficacy of\nthis method, achieving an average accuracy of 92.91% measured by the F1 score.\nThis method was further applied to real exam items from the 2022 In-Training\nExamination (ITE) conducted by the American Board of Family Medicine (ABFM),\nreclassifying 200 items according to a newly formulated blueprint swiftly in 15\nminutes, a task that traditionally could span several days among editors and\nphysicians. This innovative approach not only drastically cuts down\nclassification time but also ensures a consistent, principle-driven\ncategorization, minimizing human biases and discrepancies. The ability to\nrefine classifications by adjusting definitions adds to its robustness and\nsustainability."}, "http://arxiv.org/abs/2312.03643": {"title": "Propagating moments in probabilistic graphical models for decision support systems", "link": "http://arxiv.org/abs/2312.03643", "description": "Probabilistic graphical models are widely used to model complex systems with\nuncertainty. Traditionally, Gaussian directed graphical models are applied for\nanalysis of large networks with continuous variables since they can provide\nconditional and marginal distributions in closed form simplifying the\ninferential task. The Gaussianity and linearity assumptions are often adequate,\nyet can lead to poor performance when dealing with some practical applications.\nIn this paper, we model each variable in graph G as a polynomial regression of\nits parents to capture complex relationships between individual variables and\nwith utility function of polynomial form. Since the marginal posterior\ndistributions of individual variables can become analytically intractable, we\ndevelop a message-passing algorithm to propagate information throughout the\nnetwork solely using moments which enables the expected utility scores to be\ncalculated exactly. We illustrate how the proposed methodology works in a\ndecision problem in energy systems planning."}, "http://arxiv.org/abs/2112.05623": {"title": "Smooth test for equality of copulas", "link": "http://arxiv.org/abs/2112.05623", "description": "A smooth test to simultaneously compare $K$ copulas, where $K \\geq 2$ is\nproposed. The $K$ observed populations can be paired, and the test statistic is\nconstructed based on the differences between moment sequences, called copula\ncoefficients. These coefficients characterize the copulas, even when the copula\ndensities may not exist. The procedure employs a two-step data-driven\nprocedure. In the initial step, the most significantly different coefficients\nare selected for all pairs of populations. The subsequent step utilizes these\ncoefficients to identify populations that exhibit significant differences. To\ndemonstrate the effectiveness of the method, we provide illustrations through\nnumerical studies and application to two real datasets."}, "http://arxiv.org/abs/2112.12909": {"title": "Optimal Variable Clustering for High-Dimensional Matrix Valued Data", "link": "http://arxiv.org/abs/2112.12909", "description": "Matrix valued data has become increasingly prevalent in many applications.\nMost of the existing clustering methods for this type of data are tailored to\nthe mean model and do not account for the dependence structure of the features,\nwhich can be very informative, especially in high-dimensional settings or when\nmean information is not available. To extract the information from the\ndependence structure for clustering, we propose a new latent variable model for\nthe features arranged in matrix form, with some unknown membership matrices\nrepresenting the clusters for the rows and columns. Under this model, we\nfurther propose a class of hierarchical clustering algorithms using the\ndifference of a weighted covariance matrix as the dissimilarity measure.\nTheoretically, we show that under mild conditions, our algorithm attains\nclustering consistency in the high-dimensional setting. While this consistency\nresult holds for our algorithm with a broad class of weighted covariance\nmatrices, the conditions for this result depend on the choice of the weight. To\ninvestigate how the weight affects the theoretical performance of our\nalgorithm, we establish the minimax lower bound for clustering under our latent\nvariable model in terms of some cluster separation metric. Given these results,\nwe identify the optimal weight in the sense that using this weight guarantees\nour algorithm to be minimax rate-optimal. The practical implementation of our\nalgorithm with the optimal weight is also discussed. Simulation studies show\nthat our algorithm performs better than existing methods in terms of the\nadjusted Rand index (ARI). The method is applied to a genomic dataset and\nyields meaningful interpretations."}, "http://arxiv.org/abs/2302.09392": {"title": "Extended Excess Hazard Models for Spatially Dependent Survival Data", "link": "http://arxiv.org/abs/2302.09392", "description": "Relative survival represents the preferred framework for the analysis of\npopulation cancer survival data. The aim is to model the survival probability\nassociated to cancer in the absence of information about the cause of death.\nRecent data linkage developments have allowed for incorporating the place of\nresidence into the population cancer data bases; however, modeling this spatial\ninformation has received little attention in the relative survival setting. We\npropose a flexible parametric class of spatial excess hazard models (along with\ninference tools), named \"Relative Survival Spatial General Hazard\" (RS-SGH),\nthat allows for the inclusion of fixed and spatial effects in both time-level\nand hazard-level components. We illustrate the performance of the proposed\nmodel using an extensive simulation study, and provide guidelines about the\ninterplay of sample size, censoring, and model misspecification. We present a\ncase study using real data from colon cancer patients in England. This case\nstudy illustrates how a spatial model can be used to identify geographical\nareas with low cancer survival, as well as how to summarize such a model\nthrough marginal survival quantities and spatial effects."}, "http://arxiv.org/abs/2312.03857": {"title": "Population Monte Carlo with Normalizing Flow", "link": "http://arxiv.org/abs/2312.03857", "description": "Adaptive importance sampling (AIS) methods provide a useful alternative to\nMarkov Chain Monte Carlo (MCMC) algorithms for performing inference of\nintractable distributions. Population Monte Carlo (PMC) algorithms constitute a\nfamily of AIS approaches which adapt the proposal distributions iteratively to\nimprove the approximation of the target distribution. Recent work in this area\nprimarily focuses on ameliorating the proposal adaptation procedure for\nhigh-dimensional applications. However, most of the AIS algorithms use simple\nproposal distributions for sampling, which might be inadequate in exploring\ntarget distributions with intricate geometries. In this work, we construct\nexpressive proposal distributions in the AIS framework using normalizing flow,\nan appealing approach for modeling complex distributions. We use an iterative\nparameter update rule to enhance the approximation of the target distribution.\nNumerical experiments show that in high-dimensional settings, the proposed\nalgorithm offers significantly improved performance compared to the existing\ntechniques."}, "http://arxiv.org/abs/2312.03911": {"title": "Improving Gradient-guided Nested Sampling for Posterior Inference", "link": "http://arxiv.org/abs/2312.03911", "description": "We present a performant, general-purpose gradient-guided nested sampling\nalgorithm, ${\\tt GGNS}$, combining the state of the art in differentiable\nprogramming, Hamiltonian slice sampling, clustering, mode separation, dynamic\nnested sampling, and parallelization. This unique combination allows ${\\tt\nGGNS}$ to scale well with dimensionality and perform competitively on a variety\nof synthetic and real-world problems. We also show the potential of combining\nnested sampling with generative flow networks to obtain large amounts of\nhigh-quality samples from the posterior distribution. This combination leads to\nfaster mode discovery and more accurate estimates of the partition function."}, "http://arxiv.org/abs/2312.03967": {"title": "Test-negative designs with various reasons for testing: statistical bias and solution", "link": "http://arxiv.org/abs/2312.03967", "description": "Test-negative designs are widely used for post-market evaluation of vaccine\neffectiveness. Different from classical test-negative designs where only\nhealthcare-seekers with symptoms are included, recent test-negative designs\nhave involved individuals with various reasons for testing, especially in an\noutbreak setting. While including these data can increase sample size and hence\nimprove precision, concerns have been raised about whether they will introduce\nbias into the current framework of test-negative designs, thereby demanding a\nformal statistical examination of this modified design. In this article, using\nstatistical derivations, causal graphs, and numerical simulations, we show that\nthe standard odds ratio estimator may be biased if various reasons for testing\nare not accounted for. To eliminate this bias, we identify three categories of\nreasons for testing, including symptoms, disease-unrelated reasons, and case\ncontact tracing, and characterize associated statistical properties and\nestimands. Based on our characterization, we propose stratified estimators that\ncan incorporate multiple reasons for testing to achieve consistent estimation\nand improve precision by maximizing the use of data. The performance of our\nproposed method is demonstrated through simulation studies."}, "http://arxiv.org/abs/2312.04026": {"title": "Independent-Set Design of Experiments for Estimating Treatment and Spillover Effects under Network Interference", "link": "http://arxiv.org/abs/2312.04026", "description": "Interference is ubiquitous when conducting causal experiments over networks.\nExcept for certain network structures, causal inference on the network in the\npresence of interference is difficult due to the entanglement between the\ntreatment assignments and the interference levels. In this article, we conduct\ncausal inference under interference on an observed, sparse but connected\nnetwork, and we propose a novel design of experiments based on an independent\nset. Compared to conventional designs, the independent-set design focuses on an\nindependent subset of data and controls their interference exposures through\nthe assignments to the rest (auxiliary set). We provide a lower bound on the\nsize of the independent set from a greedy algorithm , and justify the\ntheoretical performance of estimators under the proposed design. Our approach\nis capable of estimating both spillover effects and treatment effects. We\njustify its superiority over conventional methods and illustrate the empirical\nperformance through simulations."}, "http://arxiv.org/abs/2312.04064": {"title": "DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design", "link": "http://arxiv.org/abs/2312.04064", "description": "The discovery of therapeutics to treat genetically-driven pathologies relies\non identifying genes involved in the underlying disease mechanisms. Existing\napproaches search over the billions of potential interventions to maximize the\nexpected influence on the target phenotype. However, to reduce the risk of\nfailure in future stages of trials, practical experiment design aims to find a\nset of interventions that maximally change a target phenotype via diverse\nmechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the\nrate of significant discoveries per experiment while simultaneously probing for\na wide range of diverse mechanisms during a genomic experiment campaign. We\nprovide theoretical guarantees of approximate optimality under standard\nassumptions, and conduct a comprehensive experimental evaluation covering both\nsynthetic as well as real-world experimental design tasks. DiscoBAX outperforms\nexisting state-of-the-art methods for experimental design, selecting effective\nand diverse perturbations in biological systems."}, "http://arxiv.org/abs/2312.04077": {"title": "When is Plasmode simulation superior to parametric simulation when estimating the MSE of the least squares estimator in linear regression?", "link": "http://arxiv.org/abs/2312.04077", "description": "Simulation is a crucial tool for the evaluation and comparison of statistical\nmethods. How to design fair and neutral simulation studies is therefore of\ngreat interest for both researchers developing new methods and practitioners\nconfronted with the choice of the most suitable method. The term simulation\nusually refers to parametric simulation, that is, computer experiments using\nartificial data made up of pseudo-random numbers. Plasmode simulation, that is,\ncomputer experiments using the combination of resampling feature data from a\nreal-life dataset and generating the target variable with a user-selected\noutcome-generating model (OGM), is an alternative that is often claimed to\nproduce more realistic data. We compare parametric and Plasmode simulation for\nthe example of estimating the mean squared error of the least squares estimator\nin linear regression. If the true underlying data-generating process (DGP) and\nthe OGM were known, parametric simulation would be the best choice in terms of\nestimating the MSE well. However, in reality, both are usually unknown, so\nresearchers have to make assumptions: in Plasmode simulation studies for the\nOGM, in parametric simulation for both DGP and OGM. Most likely, these\nassumptions do not reflect the truth. Here, we aim to find out how assumptions\ndeviating from the true DGP and the true OGM affect the performance of\nparametric simulation and Plasmode simulations in the context of MSE estimation\nfor the least squares estimator and in which situations which simulation type\nis preferable. Our results suggest that the preferable simulation method\ndepends on many factors, including the number of features, and how the\nassumptions of a parametric simulation differ from the true DGP. Also, the\nresampling strategy used for Plasmode influences the results. In particular,\nsubsampling with a small sampling proportion can be recommended."}, "http://arxiv.org/abs/2312.04078": {"title": "A Review and Taxonomy of Methods for Quantifying Dataset Similarity", "link": "http://arxiv.org/abs/2312.04078", "description": "In statistics and machine learning, measuring the similarity between two or\nmore datasets is important for several purposes. The performance of a\npredictive model on novel datasets, referred to as generalizability, critically\ndepends on how similar the dataset used for fitting the model is to the novel\ndatasets. Exploiting or transferring insights between similar datasets is a key\naspect of meta-learning and transfer-learning. In two-sample testing, it is\nchecked, whether the underlying (multivariate) distributions of two datasets\ncoincide or not.\n\nExtremely many approaches for quantifying dataset similarity have been\nproposed in the literature. A structured overview is a crucial first step for\ncomparisons of approaches. We examine more than 100 methods and provide a\ntaxonomy, classifying them into ten classes, including (i) comparisons of\ncumulative distribution functions, density functions, or characteristic\nfunctions, (ii) methods based on multivariate ranks, (iii) discrepancy measures\nfor distributions, (iv) graph-based methods, (v) methods based on inter-point\ndistances, (vi) kernel-based methods, (vii) methods based on binary\nclassification, (viii) distance and similarity measures for datasets, (ix)\ncomparisons based on summary statistics, and (x) different testing approaches.\nHere, we present an extensive review of these methods. We introduce the main\nunderlying ideas, formal definitions, and important properties."}, "http://arxiv.org/abs/2312.04150": {"title": "A simple sensitivity analysis method for unmeasured confounders via linear programming with estimating equation constraints", "link": "http://arxiv.org/abs/2312.04150", "description": "In estimating the average treatment effect in observational studies, the\ninfluence of confounders should be appropriately addressed. To this end, the\npropensity score is widely used. If the propensity scores are known for all the\nsubjects, bias due to confounders can be adjusted by using the inverse\nprobability weighting (IPW) by the propensity score. Since the propensity score\nis unknown in general, it is usually estimated by the parametric logistic\nregression model with unknown parameters estimated by solving the score\nequation under the strongly ignorable treatment assignment (SITA) assumption.\nViolation of the SITA assumption and/or misspecification of the propensity\nscore model can cause serious bias in estimating the average treatment effect.\nTo relax the SITA assumption, the IPW estimator based on the outcome-dependent\npropensity score has been successfully introduced. However, it still depends on\nthe correctly specified parametric model and its identification. In this paper,\nwe propose a simple sensitivity analysis method for unmeasured confounders. In\nthe standard practice, the estimating equation is used to estimate the unknown\nparameters in the parametric propensity score model. Our idea is to make\ninference on the average causal effect by removing restrictive parametric model\nassumptions while still utilizing the estimating equation. Using estimating\nequations as constraints, which the true propensity scores asymptotically\nsatisfy, we construct the worst-case bounds for the average treatment effect\nwith linear programming. Different from the existing sensitivity analysis\nmethods, we construct the worst-case bounds with minimal assumptions. We\nillustrate our proposal by simulation studies and a real-world example."}, "http://arxiv.org/abs/2312.04444": {"title": "Parameter Inference for Hypo-Elliptic Diffusions under a Weak Design Condition", "link": "http://arxiv.org/abs/2312.04444", "description": "We address the problem of parameter estimation for degenerate diffusion\nprocesses defined via the solution of Stochastic Differential Equations (SDEs)\nwith diffusion matrix that is not full-rank. For this class of hypo-elliptic\ndiffusions recent works have proposed contrast estimators that are\nasymptotically normal, provided that the step-size in-between observations\n$\\Delta=\\Delta_n$ and their total number $n$ satisfy $n \\to \\infty$, $n\n\\Delta_n \\to \\infty$, $\\Delta_n \\to 0$, and additionally $\\Delta_n = o\n(n^{-1/2})$. This latter restriction places a requirement for a so-called\n`rapidly increasing experimental design'. In this paper, we overcome this\nlimitation and develop a general contrast estimator satisfying asymptotic\nnormality under the weaker design condition $\\Delta_n = o(n^{-1/p})$ for\ngeneral $p \\ge 2$. Such a result has been obtained for elliptic SDEs in the\nliterature, but its derivation in a hypo-elliptic setting is highly\nnon-trivial. We provide numerical results to illustrate the advantages of the\ndeveloped theory."}, "http://arxiv.org/abs/2312.04481": {"title": "Wasserstein complexity penalization priors: a new class of penalizing complexity priors", "link": "http://arxiv.org/abs/2312.04481", "description": "Penalizing complexity (PC) priors is a principled framework for designing\npriors that reduce model complexity. PC priors penalize the Kullback-Leibler\nDivergence (KLD) between the distributions induced by a ``simple'' model and\nthat of a more complex model. However, in many common cases, it is impossible\nto construct a prior in this way because the KLD is infinite. Various\napproximations are used to mitigate this problem, but the resulting priors then\nfail to follow the designed principles. We propose a new class of priors, the\nWasserstein complexity penalization (WCP) priors, by replacing KLD with the\nWasserstein distance in the PC prior framework. These priors avoid the infinite\nmodel distance issues and can be derived by following the principles exactly,\nmaking them more interpretable. Furthermore, principles and recipes to\nconstruct joint WCP priors for multiple parameters analytically and numerically\nare proposed and we show that they can be easily obtained, either numerically\nor analytically, for a general class of models. The methods are illustrated\nthrough several examples for which PC priors have previously been applied."}, "http://arxiv.org/abs/2201.08502": {"title": "Curved factor analysis with the Ellipsoid-Gaussian distribution", "link": "http://arxiv.org/abs/2201.08502", "description": "There is a need for new models for characterizing dependence in multivariate\ndata. The multivariate Gaussian distribution is routinely used, but cannot\ncharacterize nonlinear relationships in the data. Most non-linear extensions\ntend to be highly complex; for example, involving estimation of a non-linear\nregression model in latent variables. In this article, we propose a relatively\nsimple class of Ellipsoid-Gaussian multivariate distributions, which are\nderived by using a Gaussian linear factor model involving latent variables\nhaving a von Mises-Fisher distribution on a unit hyper-sphere. We show that the\nEllipsoid-Gaussian distribution can flexibly model curved relationships among\nvariables with lower-dimensional structures. Taking a Bayesian approach, we\npropose a hybrid of gradient-based geodesic Monte Carlo and adaptive Metropolis\nfor posterior sampling. We derive basic properties and illustrate the utility\nof the Ellipsoid-Gaussian distribution on a variety of simulated and real data\napplications. An accompanying R package is also available."}, "http://arxiv.org/abs/2212.12822": {"title": "Simultaneous false discovery proportion bounds via knockoffs and closed testing", "link": "http://arxiv.org/abs/2212.12822", "description": "We propose new methods to obtain simultaneous false discovery proportion\nbounds for knockoff-based approaches. We first investigate an approach based on\nJanson and Su's $k$-familywise error rate control method and interpolation. We\nthen generalize it by considering a collection of $k$ values, and show that the\nbound of Katsevich and Ramdas is a special case of this method and can be\nuniformly improved. Next, we further generalize the method by using closed\ntesting with a multi-weighted-sum local test statistic. This allows us to\nobtain a further uniform improvement and other generalizations over previous\nmethods. We also develop an efficient shortcut for its implementation. We\ncompare the performance of our proposed methods in simulations and apply them\nto a data set from the UK Biobank."}, "http://arxiv.org/abs/2302.06054": {"title": "Single Proxy Control", "link": "http://arxiv.org/abs/2302.06054", "description": "Negative control variables are sometimes used in non-experimental studies to\ndetect the presence of confounding by hidden factors. A negative control\noutcome (NCO) is an outcome that is influenced by unobserved confounders of the\nexposure effects on the outcome in view, but is not causally impacted by the\nexposure. Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration\nApproach (COCA) as a formal NCO counterfactual method to detect and correct for\nresidual confounding bias. For identification, COCA treats the NCO as an\nerror-prone proxy of the treatment-free counterfactual outcome of interest, and\ninvolves regressing the NCO on the treatment-free counterfactual, together with\na rank-preserving structural model which assumes a constant individual-level\ncausal effect. In this work, we establish nonparametric COCA identification for\nthe average causal effect for the treated, without requiring rank-preservation,\ntherefore accommodating unrestricted effect heterogeneity across units. This\nnonparametric identification result has important practical implications, as it\nprovides single proxy confounding control, in contrast to recently proposed\nproximal causal inference, which relies for identification on a pair of\nconfounding proxies. For COCA estimation we propose three separate strategies:\n(i) an extended propensity score approach, (ii) an outcome bridge function\napproach, and (iii) a doubly-robust approach. Finally, we illustrate the\nproposed methods in an application evaluating the causal impact of a Zika virus\noutbreak on birth rate in Brazil."}, "http://arxiv.org/abs/2302.11363": {"title": "lqmix: an R package for longitudinal data analysis via linear quantile mixtures", "link": "http://arxiv.org/abs/2302.11363", "description": "The analysis of longitudinal data gives the chance to observe how unit\nbehaviors change over time, but it also poses series of issues. These have been\nthe focus of a huge literature in the context of linear and generalized linear\nregression moving also, in the last ten years or so, to the context of linear\nquantile regression for continuous responses. In this paper, we present lqmix,\na novel R package that helps estimate a class of linear quantile regression\nmodels for longitudinal data, in the presence of time-constant and/or\ntime-varying, unit-specific, random coefficients, with unspecified\ndistribution. Model parameters are estimated in a maximum likelihood framework,\nvia an extended EM algorithm, and parameters' standard errors are estimated via\na block-bootstrap procedure. The analysis of a benchmark dataset is used to\ngive details on the package functions."}, "http://arxiv.org/abs/2303.04408": {"title": "Principal Component Analysis of Two-dimensional Functional Data with Serial Correlation", "link": "http://arxiv.org/abs/2303.04408", "description": "In this paper, we propose a novel model to analyze serially correlated\ntwo-dimensional functional data observed sparsely and irregularly on a domain\nwhich may not be a rectangle. Our approach employs a mixed effects model that\nspecifies the principal component functions as bivariate splines on\ntriangulations and the principal component scores as random effects which\nfollow an auto-regressive model. We apply the thin-plate penalty for\nregularizing the bivariate function estimation and develop an effective EM\nalgorithm along with Kalman filter and smoother for calculating the penalized\nlikelihood estimates of the parameters. Our approach was applied on simulated\ndatasets and on Texas monthly average temperature data from January year 1915\nto December year 2014."}, "http://arxiv.org/abs/2309.15973": {"title": "Application of data acquisition methods in the field of scientific research of public policy", "link": "http://arxiv.org/abs/2309.15973", "description": "Public policy also represent a special subdiscipline within political\nscience, within political science. They are given increasing importance and\nimportance in the context of scientific research and scientific approaches.\nPublic policy as a discipline of political science have their own special\nsubject and method of research. A particularly important aspect of the\nscientific approach to public policy is the aspect of applying research methods\nas one of the stages and phases of designing scientific research. In this\nsense, the goal of this research is to present the application of scientific\nresearch methods in the field of public policy. Those methods are based on\nscientific achievements developed within the framework of modern methodology of\nsocial sciences. Scientific research methods represent an important functional\npart of the research project as a model of the scientific research system,\npredominantly of an empirical character, which is applicable to all types of\nresearch. This is precisely what imposes the need to develop a project as a\nprerequisite for applying scientific methods and conducting scientific\nresearch, and therefore for a more complete understanding of public policy. The\nconclusions that will be reached point to the fact that scientific research of\npublic policy can not be carried out without the creation of a scientific\nresearch project as a complex scientific and operational document and the\napplication of appropriate methods and techniques developed within the\nframework of scientific achievements of modern social science methodology."}, "http://arxiv.org/abs/2312.04601": {"title": "Estimating Fr\\'echet bounds for validating programmatic weak supervision", "link": "http://arxiv.org/abs/2312.04601", "description": "We develop methods for estimating Fr\\'echet bounds on (possibly\nhigh-dimensional) distribution classes in which some variables are\ncontinuous-valued. We establish the statistical correctness of the computed\nbounds under uncertainty in the marginal constraints and demonstrate the\nusefulness of our algorithms by evaluating the performance of machine learning\n(ML) models trained with programmatic weak supervision (PWS). PWS is a\nframework for principled learning from weak supervision inputs (e.g.,\ncrowdsourced labels, knowledge bases, pre-trained models on related tasks,\netc), and it has achieved remarkable success in many areas of science and\nengineering. Unfortunately, it is generally difficult to validate the\nperformance of ML models trained with PWS due to the absence of labeled data.\nOur algorithms address this issue by estimating sharp lower and upper bounds\nfor performance metrics such as accuracy/recall/precision/F1 score."}, "http://arxiv.org/abs/2312.04661": {"title": "Robust Elastic Net Estimators for High Dimensional Generalized Linear Models", "link": "http://arxiv.org/abs/2312.04661", "description": "Robust estimators for Generalized Linear Models (GLMs) are not easy to\ndevelop because of the nature of the distributions involved. Recently, there\nhas been an increasing interest in this topic, especially in the presence of a\npossibly large number of explanatory variables. Transformed M-estimators (MT)\nare a natural way to extend the methodology of M-estimators to the class of\nGLMs and to obtain robust methods. We introduce a penalized version of\nMT-estimators in order to deal with high-dimensional data. We prove, under\nappropriate assumptions, consistency and asymptotic normality of this new class\nof estimators. The theory is developed for redescending $\\rho$-functions and\nElastic Net penalization. An iterative re-weighted least squares algorithm is\ngiven, together with a procedure to initialize it. The latter is of particular\nimportance, since the estimating equations might have multiple roots. We\nillustrate the performance of this new method for the Poisson family under\nseveral type of contaminations in a Monte Carlo experiment and in an example\nbased on a real dataset."}, "http://arxiv.org/abs/2312.04717": {"title": "A kinetic Monte Carlo Approach for Boolean Logic Functionality in Gold Nanoparticle Networks", "link": "http://arxiv.org/abs/2312.04717", "description": "Nanoparticles interconnected by insulating organic molecules exhibit\nnonlinear switching behavior at low temperatures. By assembling these switches\ninto a network and manipulating charge transport dynamics through surrounding\nelectrodes, the network can be reconfigurably functionalized to act as any\nBoolean logic gate. This work introduces a kinetic Monte Carlo-based simulation\ntool, applying established principles of single electronics to model charge\ntransport dynamics in nanoparticle networks. We functionalize nanoparticle\nnetworks as Boolean logic gates and assess their quality using a fitness\nfunction. Based on the definition of fitness, we derive new metrics to quantify\nessential nonlinear properties of the network, including negative differential\nresistance and nonlinear separability. These nonlinear properties are crucial\nnot only for functionalizing the network as Boolean logic gates but also when\nour networks are functionalized for brain-inspired computing applications in\nthe future. We address fundamental questions about the dependence of fitness\nand nonlinear properties on system size, number of surrounding electrodes, and\nelectrode positioning. We assert the overall benefit of having more electrodes,\nwith proximity to the network's output being pivotal for functionality and\nnonlinearity. Additionally, we demonstrate a optimal system size and argue for\nbreaking symmetry in electrode positioning to favor nonlinear properties."}, "http://arxiv.org/abs/2312.04747": {"title": "MetaDetect: Metamorphic Testing Based Anomaly Detection for Multi-UAV Wireless Networks", "link": "http://arxiv.org/abs/2312.04747", "description": "The reliability of wireless Ad Hoc Networks (WANET) communication is much\nlower than wired networks. WANET will be impacted by node overload, routing\nprotocol, weather, obstacle blockage, and many other factors, all those\nanomalies cannot be avoided. Accurate prediction of the network entirely\nstopping in advance is essential after people could do networking re-routing or\nchanging to different bands. In the present study, there are two primary goals.\nFirstly, design anomaly events detection patterns based on Metamorphic Testing\n(MT) methodology. Secondly, compare the performance of evaluation metrics, such\nas Transfer Rate, Occupancy rate, and the Number of packets received. Compared\nto other studies, the most significant advantage of mathematical\ninterpretability, as well as not requiring dependence on physical environmental\ninformation, only relies on the networking physical layer and Mac layer data.\nThe analysis of the results demonstrates that the proposed MT detection method\nis helpful for automatically identifying incidents/accident events on WANET.\nThe physical layer transfer Rate metric could get the best performance."}, "http://arxiv.org/abs/2312.04924": {"title": "Sparse Anomaly Detection Across Referentials: A Rank-Based Higher Criticism Approach", "link": "http://arxiv.org/abs/2312.04924", "description": "Detecting anomalies in large sets of observations is crucial in various\napplications, such as epidemiological studies, gene expression studies, and\nsystems monitoring. We consider settings where the units of interest result in\nmultiple independent observations from potentially distinct referentials. Scan\nstatistics and related methods are commonly used in such settings, but rely on\nstringent modeling assumptions for proper calibration. We instead propose a\nrank-based variant of the higher criticism statistic that only requires\nindependent observations originating from ordered spaces. We show under what\nconditions the resulting methodology is able to detect the presence of\nanomalies. These conditions are stated in a general, non-parametric manner, and\ndepend solely on the probabilities of anomalous observations exceeding nominal\nobservations. The analysis requires a refined understanding of the distribution\nof the ranks under the presence of anomalies, and in particular of the\nrank-induced dependencies. The methodology is robust against heavy-tailed\ndistributions through the use of ranks. Within the exponential family and a\nfamily of convolutional models, we analytically quantify the asymptotic\nperformance of our methodology and the performance of the oracle, and show the\ndifference is small for many common models. Simulations confirm these results.\nWe show the applicability of the methodology through an analysis of quality\ncontrol data of a pharmaceutical manufacturing process."}, "http://arxiv.org/abs/2312.04950": {"title": "Sequential inductive prediction intervals", "link": "http://arxiv.org/abs/2312.04950", "description": "In this paper we explore the concept of sequential inductive prediction\nintervals using theory from sequential testing. We furthermore introduce a\n3-parameter PAC definition of prediction intervals that allows us via\nsimulation to achieve almost sharp bounds with high probability."}, "http://arxiv.org/abs/2312.04972": {"title": "Comparison of Probabilistic Structural Reliability Methods for Ultimate Limit State Assessment of Wind Turbines", "link": "http://arxiv.org/abs/2312.04972", "description": "The probabilistic design of offshore wind turbines aims to ensure structural\nsafety in a cost-effective way. This involves conducting structural reliability\nassessments for different design options and considering different structural\nresponses. There are several structural reliability methods, and this paper\nwill apply and compare different approaches in some simplified case studies. In\nparticular, the well known environmental contour method will be compared to a\nmore novel approach based on sequential sampling and Gaussian processes\nregression for an ultimate limit state case study. For one of the case studies,\nresults will also be compared to results from a brute force simulation\napproach. Interestingly, the comparison is very different from the two case\nstudies. In one of the cases the environmental contours method agrees well with\nthe sequential sampling method but in the other, results vary considerably.\nProbably, this can be explained by the violation of some of the assumptions\nassociated with the environmental contour approach, i.e. that the short-term\nvariability of the response is large compared to the long-term variability of\nthe environmental conditions. Results from this simple comparison study\nsuggests that the sequential sampling method can be a robust and\ncomputationally effective approach for structural reliability assessment."}, "http://arxiv.org/abs/2312.05077": {"title": "Computation of least squares trimmed regression--an alternative to least trimmed squares regression", "link": "http://arxiv.org/abs/2312.05077", "description": "The least squares of depth trimmed (LST) residuals regression, proposed in\nZuo and Zuo (2023) \\cite{ZZ23}, serves as a robust alternative to the classic\nleast squares (LS) regression as well as a strong competitor to the famous\nleast trimmed squares (LTS) regression of Rousseeuw (1984) \\cite{R84}.\nTheoretical properties of the LST were thoroughly studied in \\cite{ZZ23}.\n\nThe article aims to promote the implementation and computation of the LST\nresiduals regression for a broad group of statisticians in statistical practice\nand demonstrates that (i) the LST is as robust as the benchmark of robust\nregression, the LTS regression, and much more efficient than the latter. (ii)\nIt can be as efficient as (or even more efficient than) the LS in the scenario\nwith errors uncorrelated with mean zero and homoscedastic with finite variance.\n(iii) It can be computed as fast as (or even faster than) the LTS based on a\nnewly proposed algorithm."}, "http://arxiv.org/abs/2312.05127": {"title": "Weighted least squares regression with the best robustness and high computability", "link": "http://arxiv.org/abs/2312.05127", "description": "A novel regression method is introduced and studied. The procedure weights\nsquared residuals based on their magnitude. Unlike the classic least squares\nwhich treats every squared residual equally important, the new procedure\nexponentially down-weights squared-residuals that lie far away from the cloud\nof all residuals and assigns a constant weight (one) to squared-residuals that\nlie close to the center of the squared-residual cloud.\n\nThe new procedure can keep a good balance between robustness and efficiency,\nit possesses the highest breakdown point robustness for any regression\nequivariant procedure, much more robust than the classic least squares, yet\nmuch more efficient than the benchmark of robust method, the least trimmed\nsquares (LTS) of Rousseeuw (1984).\n\nWith a smooth weight function, the new procedure could be computed very fast\nby the first-order (first-derivative) method and the second-order\n(second-derivative) method.\n\nAssertions and other theoretical findings are verified in simulated and real\ndata examples."}, "http://arxiv.org/abs/2208.07898": {"title": "Collaborative causal inference on distributed data", "link": "http://arxiv.org/abs/2208.07898", "description": "In recent years, the development of technologies for causal inference with\nprivacy preservation of distributed data has gained considerable attention.\nMany existing methods for distributed data focus on resolving the lack of\nsubjects (samples) and can only reduce random errors in estimating treatment\neffects. In this study, we propose a data collaboration quasi-experiment\n(DC-QE) that resolves the lack of both subjects and covariates, reducing random\nerrors and biases in the estimation. Our method involves constructing\ndimensionality-reduced intermediate representations from private data from\nlocal parties, sharing intermediate representations instead of private data for\nprivacy preservation, estimating propensity scores from the shared intermediate\nrepresentations, and finally, estimating the treatment effects from propensity\nscores. Through numerical experiments on both artificial and real-world data,\nwe confirm that our method leads to better estimation results than individual\nanalyses. While dimensionality reduction loses some information in the private\ndata and causes performance degradation, we observe that sharing intermediate\nrepresentations with many parties to resolve the lack of subjects and\ncovariates sufficiently improves performance to overcome the degradation caused\nby dimensionality reduction. Although external validity is not necessarily\nguaranteed, our results suggest that DC-QE is a promising method. With the\nwidespread use of our method, intermediate representations can be published as\nopen data to help researchers find causalities and accumulate a knowledge base."}, "http://arxiv.org/abs/2304.07034": {"title": "Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata", "link": "http://arxiv.org/abs/2304.07034", "description": "The optimum sample allocation in stratified sampling is one of the basic\nissues of survey methodology. It is a procedure of dividing the overall sample\nsize into strata sample sizes in such a way that for given sampling designs in\nstrata the variance of the stratified $\\pi$ estimator of the population total\n(or mean) for a given study variable assumes its minimum. In this work, we\nconsider the optimum allocation of a sample, under lower and upper bounds\nimposed jointly on sample sizes in strata. We are concerned with the variance\nfunction of some generic form that, in particular, covers the case of the\nsimple random sampling without replacement in strata. The goal of this paper is\ntwofold. First, we establish (using the Karush-Kuhn-Tucker conditions) a\ngeneric form of the optimal solution, the so-called optimality conditions.\nSecond, based on the established optimality conditions, we derive an efficient\nrecursive algorithm, named RNABOX, which solves the allocation problem under\nstudy. The RNABOX can be viewed as a generalization of the classical recursive\nNeyman allocation algorithm, a popular tool for optimum allocation when only\nupper bounds are imposed on sample strata-sizes. We implement RNABOX in R as a\npart of our package stratallo which is available from the Comprehensive R\nArchive Network (CRAN) repository."}, "http://arxiv.org/abs/2305.12366": {"title": "A Quantile Shift Approach To Main Effects And Interactions In A 2-By-2 Design", "link": "http://arxiv.org/abs/2305.12366", "description": "When comparing two independent groups, shift functions are basically\ntechniques that compare multiple quantiles rather than a single measure of\nlocation, the goal being to get a more detailed understanding of how the\ndistributions differ. Various versions have been proposed and studied. This\npaper deals with extensions of these methods to main effects and interactions\nin a between-by-between, 2-by-2 design. Two approaches are studied, one that\ncompares the deciles of the distributions, and one that has a certain\nconnection to the Wilcoxon-Mann-Whitney method. For both methods, we propose an\nimplementation using the Harrell-Davis quantile estimator, used in conjunction\nwith a percentile bootstrap approach. We report results of simulations of false\nand true positive rates."}, "http://arxiv.org/abs/2306.05415": {"title": "Causal normalizing flows: from theory to practice", "link": "http://arxiv.org/abs/2306.05415", "description": "In this work, we deepen on the use of normalizing flows for causal reasoning.\nSpecifically, we first leverage recent results on non-linear ICA to show that\ncausal models are identifiable from observational data given a causal ordering,\nand thus can be recovered using autoregressive normalizing flows (NFs). Second,\nwe analyze different design and learning choices for causal normalizing flows\nto capture the underlying causal data-generating process. Third, we describe\nhow to implement the do-operator in causal NFs, and thus, how to answer\ninterventional and counterfactual questions. Finally, in our experiments, we\nvalidate our design and training choices through a comprehensive ablation\nstudy; compare causal NFs to other approaches for approximating causal models;\nand empirically demonstrate that causal NFs can be used to address real-world\nproblems, where the presence of mixed discrete-continuous data and partial\nknowledge on the causal graph is the norm. The code for this work can be found\nat https://github.com/psanch21/causal-flows."}, "http://arxiv.org/abs/2307.13917": {"title": "BayesDAG: Gradient-Based Posterior Inference for Causal Discovery", "link": "http://arxiv.org/abs/2307.13917", "description": "Bayesian causal discovery aims to infer the posterior distribution over\ncausal models from observed data, quantifying epistemic uncertainty and\nbenefiting downstream tasks. However, computational challenges arise due to\njoint inference over combinatorial space of Directed Acyclic Graphs (DAGs) and\nnonlinear functions. Despite recent progress towards efficient posterior\ninference over DAGs, existing methods are either limited to variational\ninference on node permutation matrices for linear causal models, leading to\ncompromised inference accuracy, or continuous relaxation of adjacency matrices\nconstrained by a DAG regularizer, which cannot ensure resulting graphs are\nDAGs. In this work, we introduce a scalable Bayesian causal discovery framework\nbased on a combination of stochastic gradient Markov Chain Monte Carlo\n(SG-MCMC) and Variational Inference (VI) that overcomes these limitations. Our\napproach directly samples DAGs from the posterior without requiring any DAG\nregularization, simultaneously draws function parameter samples and is\napplicable to both linear and nonlinear causal models. To enable our approach,\nwe derive a novel equivalence to the permutation-based DAG learning, which\nopens up possibilities of using any relaxed gradient estimator defined over\npermutations. To our knowledge, this is the first framework applying\ngradient-based MCMC sampling for causal discovery. Empirical evaluation on\nsynthetic and real-world datasets demonstrate our approach's effectiveness\ncompared to state-of-the-art baselines."}, "http://arxiv.org/abs/2312.05319": {"title": "Hyperbolic Network Latent Space Model with Learnable Curvature", "link": "http://arxiv.org/abs/2312.05319", "description": "Network data is ubiquitous in various scientific disciplines, including\nsociology, economics, and neuroscience. Latent space models are often employed\nin network data analysis, but the geometric effect of latent space curvature\nremains a significant, unresolved issue. In this work, we propose a hyperbolic\nnetwork latent space model with a learnable curvature parameter. We\ntheoretically justify that learning the optimal curvature is essential to\nminimizing the embedding error across all hyperbolic embedding methods beyond\nnetwork latent space models. A maximum-likelihood estimation strategy,\nemploying manifold gradient optimization, is developed, and we establish the\nconsistency and convergence rates for the maximum-likelihood estimators, both\nof which are technically challenging due to the non-linearity and non-convexity\nof the hyperbolic distance metric. We further demonstrate the geometric effect\nof latent space curvature and the superior performance of the proposed model\nthrough extensive simulation studies and an application using a Facebook\nfriendship network."}, "http://arxiv.org/abs/2312.05345": {"title": "A General Estimation Framework for Multi-State Markov Processes with Flexible Specification of the Transition Intensities", "link": "http://arxiv.org/abs/2312.05345", "description": "When interest lies in the progression of a disease rather than on a single\noutcome, non-homogeneous multi-state Markov models constitute a natural and\npowerful modelling approach. Constant monitoring of a phenomenon of interest is\noften unfeasible, hence leading to an intermittent observation scheme. This\nsetting is challenging and existing models and their implementations do not yet\nallow for flexible enough specifications that can fully exploit the information\ncontained in the data. To widen significantly the scope of multi-state Markov\nmodels, we propose a closed-form expression for the local curvature information\nof a key quantity, the transition probability matrix. Such development allows\none to model any type of multi-state Markov process, where the transition\nintensities are flexibly specified as functions of additive predictors.\nParameter estimation is carried out through a carefully structured, stable\npenalised likelihood approach. The methodology is exemplified via two case\nstudies that aim at modelling the onset of cardiac allograft vasculopathy, and\ncognitive decline. To support applicability and reproducibility, all developed\ntools are implemented in the R package flexmsm."}, "http://arxiv.org/abs/2312.05365": {"title": "Product Centered Dirichlet Processes for Dependent Clustering", "link": "http://arxiv.org/abs/2312.05365", "description": "While there is an immense literature on Bayesian methods for clustering, the\nmultiview case has received little attention. This problem focuses on obtaining\ndistinct but statistically dependent clusterings in a common set of entities\nfor different data types. For example, clustering patients into subgroups with\nsubgroup membership varying according to the domain of the patient variables. A\nchallenge is how to model the across-view dependence between the partitions of\npatients into subgroups. The complexities of the partition space make standard\nmethods to model dependence, such as correlation, infeasible. In this article,\nwe propose CLustering with Independence Centering (CLIC), a clustering prior\nthat uses a single parameter to explicitly model dependence between clusterings\nacross views. CLIC is induced by the product centered Dirichlet process (PCDP),\na novel hierarchical prior that bridges between independent and equivalent\npartitions. We show appealing theoretic properties, provide a finite\napproximation and prove its accuracy, present a marginal Gibbs sampler for\nposterior computation, and derive closed form expressions for the marginal and\njoint partition distributions for the CLIC model. On synthetic data and in an\napplication to epidemiology, CLIC accurately characterizes view-specific\npartitions while providing inference on the dependence level."}, "http://arxiv.org/abs/2312.05372": {"title": "Rational Kriging", "link": "http://arxiv.org/abs/2312.05372", "description": "This article proposes a new kriging that has a rational form. It is shown\nthat the generalized least squares estimate of the mean from rational kriging\nis much more well behaved than that from ordinary kriging. Parameter estimation\nand uncertainty quantification for rational kriging are proposed using a\nGaussian process framework. Its potential applications in emulation and\ncalibration of computer models are also discussed."}, "http://arxiv.org/abs/2312.05400": {"title": "Generalized difference-in-differences", "link": "http://arxiv.org/abs/2312.05400", "description": "We propose a new method for estimating causal effects in longitudinal/panel\ndata settings that we call generalized difference-in-differences. Our approach\nunifies two alternative approaches in these settings: ignorability estimators\n(e.g., synthetic controls) and difference-in-differences (DiD) estimators. We\npropose a new identifying assumption -- a stable bias assumption -- which\ngeneralizes the conditional parallel trends assumption in DiD, leading to the\nproposed generalized DiD framework. This change gives generalized DiD\nestimators the flexibility of ignorability estimators while maintaining the\nrobustness to unobserved confounding of DiD. We also show how ignorability and\nDiD estimators are special cases of generalized DiD. We then propose\ninfluence-function based estimators of the observed data functional, allowing\nthe use of double/debiased machine learning for estimation. We also show how\ngeneralized DiD easily extends to include clustered treatment assignment and\nstaggered adoption settings, and we discuss how the framework can facilitate\nestimation of other treatment effects beyond the average treatment effect on\nthe treated. Finally, we provide simulations which show that generalized DiD\noutperforms ignorability and DiD estimators when their identifying assumptions\nare not met, while being competitive with these special cases when their\nidentifying assumptions are met."}, "http://arxiv.org/abs/2312.05404": {"title": "Disentangled Latent Representation Learning for Tackling the Confounding M-Bias Problem in Causal Inference", "link": "http://arxiv.org/abs/2312.05404", "description": "In causal inference, it is a fundamental task to estimate the causal effect\nfrom observational data. However, latent confounders pose major challenges in\ncausal inference in observational data, for example, confounding bias and\nM-bias. Recent data-driven causal effect estimators tackle the confounding bias\nproblem via balanced representation learning, but assume no M-bias in the\nsystem, thus they fail to handle the M-bias. In this paper, we identify a\nchallenging and unsolved problem caused by a variable that leads to confounding\nbias and M-bias simultaneously. To address this problem with co-occurring\nM-bias and confounding bias, we propose a novel Disentangled Latent\nRepresentation learning framework for learning latent representations from\nproxy variables for unbiased Causal effect Estimation (DLRCE) from\nobservational data. Specifically, DLRCE learns three sets of latent\nrepresentations from the measured proxy variables to adjust for the confounding\nbias and M-bias. Extensive experiments on both synthetic and three real-world\ndatasets demonstrate that DLRCE significantly outperforms the state-of-the-art\nestimators in the case of the presence of both confounding bias and M-bias."}, "http://arxiv.org/abs/2312.05411": {"title": "Deep Bayes Factors", "link": "http://arxiv.org/abs/2312.05411", "description": "The is no other model or hypothesis verification tool in Bayesian statistics\nthat is as widely used as the Bayes factor. We focus on generative models that\nare likelihood-free and, therefore, render the computation of Bayes factors\n(marginal likelihood ratios) far from obvious. We propose a deep learning\nestimator of the Bayes factor based on simulated data from two competing models\nusing the likelihood ratio trick. This estimator is devoid of summary\nstatistics and obviates some of the difficulties with ABC model choice. We\nestablish sufficient conditions for consistency of our Deep Bayes Factor\nestimator as well as its consistency as a model selection tool. We investigate\nthe performance of our estimator on various examples using a wide range of\nquality metrics related to estimation and model decision accuracy. After\ntraining, our deep learning approach enables rapid evaluations of the Bayes\nfactor estimator at any fictional data arriving from either hypothesized model,\nnot just the observed data $Y_0$. This allows us to inspect entire Bayes factor\ndistributions under the two models and to quantify the relative location of the\nBayes factor evaluated at $Y_0$ in light of these distributions. Such tail area\nevaluations are not possible for Bayes factor estimators tailored to $Y_0$. We\nfind the performance of our Deep Bayes Factors competitive with existing MCMC\ntechniques that require the knowledge of the likelihood function. We also\nconsider variants for posterior or intrinsic Bayes factors estimation. We\ndemonstrate the usefulness of our approach on a relatively high-dimensional\nreal data example about determining cognitive biases."}, "http://arxiv.org/abs/2312.05523": {"title": "Functional Data Analysis: An Introduction and Recent Developments", "link": "http://arxiv.org/abs/2312.05523", "description": "Functional data analysis (FDA) is a statistical framework that allows for the\nanalysis of curves, images, or functions on higher dimensional domains. The\ngoals of FDA, such as descriptive analyses, classification, and regression, are\ngenerally the same as for statistical analyses of scalar-valued or multivariate\ndata, but FDA brings additional challenges due to the high- and infinite\ndimensionality of observations and parameters, respectively. This paper\nprovides an introduction to FDA, including a description of the most common\nstatistical analysis techniques, their respective software implementations, and\nsome recent developments in the field. The paper covers fundamental concepts\nsuch as descriptives and outliers, smoothing, amplitude and phase variation,\nand functional principal component analysis. It also discusses functional\nregression, statistical inference with functional data, functional\nclassification and clustering, and machine learning approaches for functional\ndata analysis. The methods discussed in this paper are widely applicable in\nfields such as medicine, biophysics, neuroscience, and chemistry, and are\nincreasingly relevant due to the widespread use of technologies that allow for\nthe collection of functional data. Sparse functional data methods are also\nrelevant for longitudinal data analysis. All presented methods are demonstrated\nusing available software in R by analyzing a data set on human motion and motor\ncontrol. To facilitate the understanding of the methods, their implementation,\nand hands-on application, the code for these practical examples is made\navailable on Github: https://github.com/davidruegamer/FDA_tutorial ."}, "http://arxiv.org/abs/2312.05590": {"title": "Gradient Tracking for High Dimensional Federated Optimization", "link": "http://arxiv.org/abs/2312.05590", "description": "In this paper, we study the (decentralized) distributed optimization problem\nwith high-dimensional sparse structure. Building upon the FedDA algorithm, we\npropose a (Decentralized) FedDA-GT algorithm, which combines the\n\\textbf{gradient tracking} technique. It is able to eliminate the heterogeneity\namong different clients' objective functions while ensuring a dimension-free\nconvergence rate. Compared to the vanilla FedDA approach, (D)FedDA-GT can\nsignificantly reduce the communication complexity, from ${O}(s^2\\log\nd/\\varepsilon^{3/2})$ to a more efficient ${O}(s^2\\log d/\\varepsilon)$. In\ncases where strong convexity is applicable, we introduce a multistep mechanism\nresulting in the Multistep ReFedDA-GT algorithm, a minor modified version of\nFedDA-GT. This approach achieves an impressive communication complexity of\n${O}\\left(s\\log d \\log \\frac{1}{\\varepsilon}\\right)$ through repeated calls to\nthe ReFedDA-GT algorithm. Finally, we conduct numerical experiments,\nillustrating that our proposed algorithms enjoy the dual advantage of being\ndimension-free and heterogeneity-free."}, "http://arxiv.org/abs/2312.05682": {"title": "On Valid Multivariate Generalizations of the Confluent Hypergeometric Covariance Function", "link": "http://arxiv.org/abs/2312.05682", "description": "Modeling of multivariate random fields through Gaussian processes calls for\nthe construction of valid cross-covariance functions describing the dependence\nbetween any two component processes at different spatial locations. The\nrequired validity conditions often present challenges that lead to complicated\nrestrictions on the parameter space. The purpose of this paper is to present a\nsimplified techniques for establishing multivariate validity for the\nrecently-introduced Confluent Hypergeometric (CH) class of covariance\nfunctions. Specifically, we use multivariate mixtures to present both\nsimplified and comprehensive conditions for validity, based on results on\nconditionally negative semidefinite matrices and the Schur product theorem. In\naddition, we establish the spectral density of the CH covariance and use this\nto construct valid multivariate models as well as propose new\ncross-covariances. We show that our proposed approach leads to valid\nmultivariate cross-covariance models that inherit the desired marginal\nproperties of the CH model and outperform the multivariate Mat\\'ern model in\nout-of-sample prediction under slowly-decaying correlation of the underlying\nmultivariate random field. We also establish properties of multivariate CH\nmodels, including equivalence of Gaussian measures, and demonstrate their use\nin modeling a multivariate oceanography data set consisting of temperature,\nsalinity and oxygen, as measured by autonomous floats in the Southern Ocean."}, "http://arxiv.org/abs/2312.05718": {"title": "Feasible contact tracing", "link": "http://arxiv.org/abs/2312.05718", "description": "Contact tracing is one of the most important tools for preventing the spread\nof infectious diseases, but as the experience of COVID-19 showed, it is also\nnext-to-impossible to implement when the disease is spreading rapidly. We show\nhow to substantially improve the efficiency of contact tracing by combining\nstandard microeconomic tools that measure heterogeneity in how infectious a\nsick person is with ideas from machine learning about sequential optimization.\nOur contributions are twofold. First, we incorporate heterogeneity in\nindividual infectiousness in a multi-armed bandit to establish optimal\nalgorithms. At the heart of this strategy is a focus on learning. In the\ntypical conceptualization of contact tracing, contacts of an infected person\nare tested to find more infections. Under a learning-first framework, however,\ncontacts of infected persons are tested to ascertain whether the infected\nperson is likely to be a \"high infector\" and to find additional infections only\nif it is likely to be highly fruitful. Second, we demonstrate using three\nadministrative contact tracing datasets from India and Pakistan during COVID-19\nthat this strategy improves efficiency. Using our algorithm, we find 80% of\ninfections with just 40% of contacts while current approaches test twice as\nmany contacts to identify the same number of infections. We further show that a\nsimple strategy that can be easily implemented in the field performs at nearly\noptimal levels, allowing for, what we call, feasible contact tracing. These\nresults are immediately transferable to contact tracing in any epidemic."}, "http://arxiv.org/abs/2312.05756": {"title": "A quantitative fusion strategy of stock picking and timing based on Particle Swarm Optimized-Back Propagation Neural Network and Multivariate Gaussian-Hidden Markov Model", "link": "http://arxiv.org/abs/2312.05756", "description": "In recent years, machine learning (ML) has brought effective approaches and\nnovel techniques to economic decision, investment forecasting, and risk\nmanagement, etc., coping the variable and intricate nature of economic and\nfinancial environments. For the investment in stock market, this research\nintroduces a pioneering quantitative fusion model combining stock timing and\npicking strategy by leveraging the Multivariate Gaussian-Hidden Markov Model\n(MGHMM) and Back Propagation Neural Network optimized by Particle Swarm\n(PSO-BPNN). After the information coefficients (IC) between fifty-two factors\nthat have been winsorized, neutralized and standardized and the return of CSI\n300 index are calculated, a given amount of factors that rank ahead are choose\nto be candidate factors heading for the input of PSO-BPNN after dimension\nreduction by Principal Component Analysis (PCA), followed by a certain amount\nof constituent stocks outputted. Subsequently, we conduct the prediction and\ntrading on the basis of the screening stocks and stock market state outputted\nby MGHMM trained using inputting CSI 300 index data after Box-Cox\ntransformation, bespeaking eximious performance during the period of past four\nyears. Ultimately, some conventional forecast and trading methods are compared\nwith our strategy in Chinese stock market. Our fusion strategy incorporating\nstock picking and timing presented in this article provide a innovative\ntechnique for financial analysis."}, "http://arxiv.org/abs/2312.05757": {"title": "Towards Human-like Perception: Learning Structural Causal Model in Heterogeneous Graph", "link": "http://arxiv.org/abs/2312.05757", "description": "Heterogeneous graph neural networks have become popular in various domains.\nHowever, their generalizability and interpretability are limited due to the\ndiscrepancy between their inherent inference flows and human reasoning logic or\nunderlying causal relationships for the learning problem. This study introduces\na novel solution, HG-SCM (Heterogeneous Graph as Structural Causal Model). It\ncan mimic the human perception and decision process through two key steps:\nconstructing intelligible variables based on semantics derived from the graph\nschema and automatically learning task-level causal relationships among these\nvariables by incorporating advanced causal discovery techniques. We compared\nHG-SCM to seven state-of-the-art baseline models on three real-world datasets,\nunder three distinct and ubiquitous out-of-distribution settings. HG-SCM\nachieved the highest average performance rank with minimal standard deviation,\nsubstantiating its effectiveness and superiority in terms of both predictive\npower and generalizability. Additionally, the visualization and analysis of the\nauto-learned causal diagrams for the three tasks aligned well with domain\nknowledge and human cognition, demonstrating prominent interpretability.\nHG-SCM's human-like nature and its enhanced generalizability and\ninterpretability make it a promising solution for special scenarios where\ntransparency and trustworthiness are paramount."}, "http://arxiv.org/abs/2312.05802": {"title": "Enhancing Scalability in Bayesian Nonparametric Factor Analysis of Spatiotemporal Data", "link": "http://arxiv.org/abs/2312.05802", "description": "This manuscript puts forward novel practicable spatiotemporal Bayesian factor\nanalysis frameworks computationally feasible for moderate to large data. Our\nmodels exhibit significantly enhanced computational scalability and storage\nefficiency, deliver high overall modeling performances, and possess powerful\ninferential capabilities for adequately predicting outcomes at future time\npoints or new spatial locations and satisfactorily clustering spatial locations\ninto regions with similar temporal trajectories, a frequently encountered\ncrucial task. We integrate on top of a baseline separable factor model with\ntemporally dependent latent factors and spatially dependent factor loadings\nunder a probit stick breaking process (PSBP) prior a new slice sampling\nalgorithm that permits unknown varying numbers of spatial mixture components\nacross all factors and guarantees them to be non-increasing through the MCMC\niterations, thus considerably enhancing model flexibility, efficiency, and\nscalability. We further introduce a novel spatial latent nearest-neighbor\nGaussian process (NNGP) prior and new sequential updating algorithms for the\nspatially varying latent variables in the PSBP prior, thereby attaining high\nspatial scalability. The markedly accelerated posterior sampling and spatial\nprediction as well as the great modeling and inferential performances of our\nmodels are substantiated by our simulation experiments."}, "http://arxiv.org/abs/2312.05974": {"title": "Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise", "link": "http://arxiv.org/abs/2312.05974", "description": "This paper considers learning the hidden causal network of a linear networked\ndynamical system (NDS) from the time series data at some of its nodes --\npartial observability. The dynamics of the NDS are driven by colored noise that\ngenerates spurious associations across pairs of nodes, rendering the problem\nmuch harder. To address the challenge of noise correlation and partial\nobservability, we assign to each pair of nodes a feature vector computed from\nthe time series data of observed nodes. The feature embedding is engineered to\nyield structural consistency: there exists an affine hyperplane that\nconsistently partitions the set of features, separating the feature vectors\ncorresponding to connected pairs of nodes from those corresponding to\ndisconnected pairs. The causal inference problem is thus addressed via\nclustering the designed features. We demonstrate with simple baseline\nsupervised methods the competitive performance of the proposed causal inference\nmechanism under broad connectivity regimes and noise correlation levels,\nincluding a real world network. Further, we devise novel technical guarantees\nof structural consistency for linear NDS under the considered regime."}, "http://arxiv.org/abs/2312.06018": {"title": "A Multivariate Polya Tree Model for Meta-Analysis with Event Time Distributions", "link": "http://arxiv.org/abs/2312.06018", "description": "We develop a non-parametric Bayesian prior for a family of random probability\nmeasures by extending the Polya tree ($PT$) prior to a joint prior for a set of\nprobability measures $G_1,\\dots,G_n$, suitable for meta-analysis with event\ntime outcomes. In the application to meta-analysis $G_i$ is the event time\ndistribution specific to study $i$. The proposed model defines a regression on\nstudy-specific covariates by introducing increased correlation for any pair of\nstudies with similar characteristics. The desired multivariate $PT$ model is\nconstructed by introducing a hierarchical prior on the conditional splitting\nprobabilities in the $PT$ construction for each of the $G_i$. The hierarchical\nprior replaces the independent beta priors for the splitting probability in the\n$PT$ construction with a Gaussian process prior for corresponding (logit)\nsplitting probabilities across all studies. The Gaussian process is indexed by\nstudy-specific covariates, introducing the desired dependence with increased\ncorrelation for similar studies. The main feature of the proposed construction\nis (conditionally) conjugate posterior updating with commonly reported\ninference summaries for event time data. The construction is motivated by a\nmeta-analysis over cancer immunotherapy studies."}, "http://arxiv.org/abs/2312.06028": {"title": "Rejoinder to Discussion of \"A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks''", "link": "http://arxiv.org/abs/2312.06028", "description": "This rejoinder responds to discussions by of Caimo, Niezink, and\nSchweinberger and Fritz of ''A Tale of Two Datasets: Representativeness and\nGeneralisability of Inference for Samples of Networks'' by Krivitsky, Coletti,\nand Hens, all published in the Journal of the American Statistical Association\nin 2023."}, "http://arxiv.org/abs/2312.06098": {"title": "Mixture Matrix-valued Autoregressive Model", "link": "http://arxiv.org/abs/2312.06098", "description": "Time series of matrix-valued data are increasingly available in various areas\nincluding economics, finance, social science, etc. These data may shed light on\nthe inter-dynamical relationships between two sets of attributes, for instance\ncountries and economic indices. The matrix autoregressive (MAR) model provides\na parsimonious approach for analyzing such data. However, the MAR model, being\na linear model with parametric constraints, cannot capture the nonlinear\npatterns in the data, such as regime shifts in the dynamics. We propose a\nmixture matrix autoregressive (MMAR) model for analyzing potential regime\nshifts in the dynamics between two attributes, for instance, due to recession\nvs. blooming, or quiet period vs. pandemic. We propose an EM algorithm for\nmaximum likelihood estimation. We derive some theoretical properties of the\nproposed method including consistency and asymptotic distribution, and\nillustrate its performance via simulations and real applications."}, "http://arxiv.org/abs/2312.06155": {"title": "Illustrating the structures of bias from immortal time using directed acyclic graphs", "link": "http://arxiv.org/abs/2312.06155", "description": "Immortal time is a period of follow-up during which death or the study\noutcome cannot occur by design. Bias from immortal time has been increasingly\nrecognized in epidemiologic studies. However, it remains unclear how immortal\ntime arises and what the structures of bias from immortal time are. Here, we\nuse an example \"Do Nobel Prize winners live longer than less recognized\nscientists?\" to illustrate that immortal time arises from using postbaseline\ninformation to define exposure or eligibility. We use time-varying directed\nacyclic graphs (DAGs) to present the structures of bias from immortal time as\nthe key sources of bias, that is confounding and selection bias. We explain\nthat excluding immortal time from the follow-up does not fully address the\nbias, and that the presence of competing risks can worsen the bias. We also\ndiscuss how the structures of bias from immortal time are shared by different\nstudy designs in pharmacoepidemiology and provide solutions, where possible, to\naddress the bias."}, "http://arxiv.org/abs/2312.06159": {"title": "Could dropping a few cells change the takeaways from differential expression?", "link": "http://arxiv.org/abs/2312.06159", "description": "Differential expression (DE) plays a fundamental role toward illuminating the\nmolecular mechanisms driving a difference between groups (e.g., due to\ntreatment or disease). While any analysis is run on particular cells/samples,\nthe intent is to generalize to future occurrences of the treatment or disease.\nImplicitly, this step is justified by assuming that present and future samples\nare independent and identically distributed from the same population. Though\nthis assumption is always false, we hope that any deviation from the assumption\nis small enough that A) conclusions of the analysis still hold and B) standard\ntools like standard error, significance, and power still reflect\ngeneralizability. Conversely, we might worry about these deviations, and\nreliance on standard tools, if conclusions could be substantively changed by\ndropping a very small fraction of data. While checking every small fraction is\ncomputationally intractable, recent work develops an approximation to identify\nwhen such an influential subset exists. Building on this work, we develop a\nmetric for dropping-data robustness of DE; namely, we cast the analysis in a\nform suitable to the approximation, extend the approximation to models with\ndata-dependent hyperparameters, and extend the notion of a data point from a\nsingle cell to a pseudobulk observation. We then overcome the inherent\nnon-differentiability of gene set enrichment analysis to develop an additional\napproximation for the robustness of top gene sets. We assess robustness of DE\nfor published single-cell RNA-seq data and discover that 1000s of genes can\nhave their results flipped by dropping <1% of the data, including 100s that are\nsensitive to dropping a single cell (0.07%). Surprisingly, this non-robustness\nextends to high-level takeaways; half of the top 10 gene sets can be changed by\ndropping 1-2% of cells, and 2/10 can be changed by dropping a single cell."}, "http://arxiv.org/abs/2312.06204": {"title": "Multilayer Network Regression with Eigenvector Centrality and Community Structure", "link": "http://arxiv.org/abs/2312.06204", "description": "Centrality measures and community structures play a pivotal role in the\nanalysis of complex networks. To effectively model the impact of the network on\nour variable of interest, it is crucial to integrate information from the\nmultilayer network, including the interlayer correlations of network data. In\nthis study, we introduce a two-stage regression model that leverages the\neigenvector centrality and network community structure of fourth-order\ntensor-like multilayer networks. Initially, we utilize the eigenvector\ncentrality of multilayer networks, a method that has found extensive\napplication in prior research. Subsequently, we amalgamate the network\ncommunity structure to construct the community component centrality and\nindividual component centrality of nodes, which are then incorporated into the\nregression model. Furthermore, we establish the asymptotic properties of the\nleast squares estimates of the regression model coefficients. Our proposed\nmethod is employed to analyze data from the European airport network and The\nWorld Input-Output Database (WIOD), demonstrating its practical applicability\nand effectiveness."}, "http://arxiv.org/abs/2312.06265": {"title": "Type I Error Rates are Not Usually Inflated", "link": "http://arxiv.org/abs/2312.06265", "description": "The inflation of Type I error rates is thought to be one of the causes of the\nreplication crisis. Questionable research practices such as p-hacking are\nthought to inflate Type I error rates above their nominal level, leading to\nunexpectedly high levels of false positives in the literature and,\nconsequently, unexpectedly low replication rates. In this article, I offer an\nalternative view. I argue that questionable and other research practices do not\nusually inflate relevant Type I error rates. I begin with an introduction to\nType I error rates that distinguishes them from theoretical errors. I then\nillustrate my argument with respect to model misspecification, multiple\ntesting, selective inference, forking paths, exploratory analyses, p-hacking,\noptional stopping, double dipping, and HARKing. In each case, I demonstrate\nthat relevant Type I error rates are not usually inflated above their nominal\nlevel, and in the rare cases that they are, the inflation is easily identified\nand resolved. I conclude that the replication crisis may be explained, at least\nin part, by researchers' misinterpretation of statistical errors and their\nunderestimation of theoretical errors."}, "http://arxiv.org/abs/2312.06289": {"title": "A graphical framework for interpretable correlation matrix models", "link": "http://arxiv.org/abs/2312.06289", "description": "In this work, we present a new approach for constructing models for\ncorrelation matrices with a user-defined graphical structure. The graphical\nstructure makes correlation matrices interpretable and avoids the quadratic\nincrease of parameters as a function of the dimension. We suggest an automatic\napproach to define a prior using a natural sequence of simpler models within\nthe Penalized Complexity framework for the unknown parameters in these models.\n\nWe illustrate this approach with three applications: a multivariate linear\nregression of four biomarkers, a multivariate disease mapping, and a\nmultivariate longitudinal joint modelling. Each application underscores our\nmethod's intuitive appeal, signifying a substantial advancement toward a more\ncohesive and enlightening model that facilitates a meaningful interpretation of\ncorrelation matrices."}, "http://arxiv.org/abs/2312.06334": {"title": "Scoring multilevel regression and postratification based population and subpopulation estimates", "link": "http://arxiv.org/abs/2312.06334", "description": "Multilevel regression and poststratification (MRP) has been used extensively\nto adjust convenience and low-response surveys to make population and\nsubpopulation estimates. For this method, model validation is particularly\nimportant, but recent work has suggested that simple aggregation of individual\nprediction errors does not give a good measure of the error of the population\nestimate. In this manuscript we provide a clear explanation for why this\noccurs, propose two scoring metrics designed specifically for this problem, and\ndemonstrate their use in three different ways. We demonstrate that these\nscoring metrics correctly order models when compared to true goodness of\nestimate, although they do underestimate the magnitude of the score."}, "http://arxiv.org/abs/2312.06415": {"title": "Practicable Power Curve Approximation for Bioequivalence with Unequal Variances", "link": "http://arxiv.org/abs/2312.06415", "description": "Two-group (bio)equivalence tests assess whether two drug formulations provide\nsimilar therapeutic effects. These studies are often conducted using two\none-sided t-tests, where the test statistics jointly follow a bivariate\nt-distribution with singular covariance matrix. Unless the two groups of data\nare assumed to have equal variances, the degrees of freedom for this bivariate\nt-distribution are noninteger and unknown a priori. This makes it difficult to\nanalytically find sample sizes that yield desired power for the study using an\nautomated process. Popular free software for bioequivalence study design does\nnot accommodate the comparison of two groups with unequal variances, and\ncertain paid software solutions that make this accommodation produce unstable\nresults. We propose a novel simulation-based method that uses Sobol' sequences\nand root-finding algorithms to quickly and accurately approximate the power\ncurve for two-group bioequivalence tests with unequal variances. We also\nillustrate that caution should be exercised when assuming automated methods for\npower estimation are robust to arbitrary bioequivalence designs. Our methods\nfor sample size determination mitigate this lack of robustness and are widely\napplicable to equivalence and noninferiority tests facilitated via parallel and\ncrossover designs. All methods proposed in this work can be implemented using\nthe dent package in R."}, "http://arxiv.org/abs/2312.06437": {"title": "Posterior Ramifications of Prior Dependence Structures", "link": "http://arxiv.org/abs/2312.06437", "description": "In fully Bayesian analyses, prior distributions are specified before\nobserving data. Prior elicitation methods transfigure prior information into\nquantifiable prior distributions. Recently, methods that leverage copulas have\nbeen proposed to accommodate more flexible dependence structures when eliciting\nmultivariate priors. The resulting priors have been framed as suitable\ncandidates for Bayesian analysis. We prove that under broad conditions, the\nposterior cannot retain many of these flexible prior dependence structures as\ndata are observed. However, these flexible copula-based priors are useful for\ndesign purposes. Because correctly specifying the dependence structure a priori\ncan be difficult, we consider how the choice of prior copula impacts the\nposterior distribution in terms of convergence of the posterior mode. We also\nmake recommendations regarding prior dependence specification for posterior\nanalyses that streamline the prior elicitation process."}, "http://arxiv.org/abs/2312.06465": {"title": "A New Projection Pursuit Index for Big Data", "link": "http://arxiv.org/abs/2312.06465", "description": "Visualization of extremely large datasets in static or dynamic form is a huge\nchallenge because most traditional methods cannot deal with big data problems.\nA new visualization method for big data is proposed based on Projection\nPursuit, Guided Tour and Data Nuggets methods, that will help display\ninteresting hidden structures such as clusters, outliers, and other nonlinear\nstructures in big data. The Guided Tour is a dynamic graphical tool for\nhigh-dimensional data combining Projection Pursuit and Grand Tour methods. It\ndisplays a dynamic sequence of low-dimensional projections obtained by using\nProjection Pursuit (PP) index functions to navigate the data space. Different\nPP indices have been developed to detect interesting structures of multivariate\ndata but there are computational problems for big data using the original\nguided tour with these indices. A new PP index is developed to be computable\nfor big data, with the help of a data compression method called Data Nuggets\nthat reduces large datasets while maintaining the original data structure.\nSimulation studies are conducted and a real large dataset is used to illustrate\nthe proposed methodology. Static and dynamic graphical tools for big data can\nbe developed based on the proposed PP index to detect nonlinear structures."}, "http://arxiv.org/abs/2312.06478": {"title": "Prediction De-Correlated Inference", "link": "http://arxiv.org/abs/2312.06478", "description": "Leveraging machine-learning methods to predict outcomes on some unlabeled\ndatasets and then using these pseudo-outcomes in subsequent statistical\ninference is common in modern data analysis. Inference in this setting is often\ncalled post-prediction inference. We propose a novel, assumption-lean framework\nfor inference under post-prediction setting, called \\emph{Prediction\nDe-Correlated inference} (PDC). Our approach can automatically adapt to any\nblack-box machine-learning model and consistently outperforms supervised\nmethods. The PDC framework also offers easy extensibility for accommodating\nmultiple predictive models. Both numerical results and real-world data analysis\nsupport our theoretical results."}, "http://arxiv.org/abs/2312.06547": {"title": "KF-PLS: Optimizing Kernel Partial Least-Squares (K-PLS) with Kernel Flows", "link": "http://arxiv.org/abs/2312.06547", "description": "Partial Least-Squares (PLS) Regression is a widely used tool in chemometrics\nfor performing multivariate regression. PLS is a bi-linear method that has a\nlimited capacity of modelling non-linear relations between the predictor\nvariables and the response. Kernel PLS (K-PLS) has been introduced for\nmodelling non-linear predictor-response relations. In K-PLS, the input data is\nmapped via a kernel function to a Reproducing Kernel Hilbert space (RKH), where\nthe dependencies between the response and the input matrix are assumed to be\nlinear. K-PLS is performed in the RKH space between the kernel matrix and the\ndependent variable. Most available studies use fixed kernel parameters. Only a\nfew studies have been conducted on optimizing the kernel parameters for K-PLS.\nIn this article, we propose a methodology for the kernel function optimization\nbased on Kernel Flows (KF), a technique developed for Gaussian process\nregression (GPR). The results are illustrated with four case studies. The case\nstudies represent both numerical examples and real data used in classification\nand regression tasks. K-PLS optimized with KF, called KF-PLS in this study, is\nshown to yield good results in all illustrated scenarios. The paper presents\ncross-validation studies and hyperparameter analysis of the KF methodology when\napplied to K-PLS."}, "http://arxiv.org/abs/2312.06605": {"title": "Statistical Inference on Latent Space Models for Network Data", "link": "http://arxiv.org/abs/2312.06605", "description": "Latent space models are powerful statistical tools for modeling and\nunderstanding network data. While the importance of accounting for uncertainty\nin network analysis has been well recognized, the current literature\npredominantly focuses on point estimation and prediction, leaving the\nstatistical inference of latent space models an open question. This work aims\nto fill this gap by providing a general framework to analyze the theoretical\nproperties of the maximum likelihood estimators. In particular, we establish\nthe uniform consistency and asymptotic distribution results for the latent\nspace models under different edge types and link functions. Furthermore, the\nproposed framework enables us to generalize our results to the dependent-edge\nand sparse scenarios. Our theories are supported by simulation studies and have\nthe potential to be applied in downstream inferences, such as link prediction\nand network testing problems."}, "http://arxiv.org/abs/2312.06616": {"title": "The built environment and induced transport CO2 emissions: A double machine learning approach to account for residential self-selection", "link": "http://arxiv.org/abs/2312.06616", "description": "Understanding why travel behavior differs between residents of urban centers\nand suburbs is key to sustainable urban planning. Especially in light of rapid\nurban growth, identifying housing locations that minimize travel demand and\ninduced CO2 emissions is crucial to mitigate climate change. While the built\nenvironment plays an important role, the precise impact on travel behavior is\nobfuscated by residential self-selection. To address this issue, we propose a\ndouble machine learning approach to obtain unbiased, spatially-explicit\nestimates of the effect of the built environment on travel-related CO2\nemissions for each neighborhood by controlling for residential self-selection.\nWe examine how socio-demographics and travel-related attitudes moderate the\neffect and how it decomposes across the 5Ds of the built environment. Based on\na case study for Berlin and the travel diaries of 32,000 residents, we find\nthat the built environment causes household travel-related CO2 emissions to\ndiffer by a factor of almost two between central and suburban neighborhoods in\nBerlin. To highlight the practical importance for urban climate mitigation, we\nevaluate current plans for 64,000 new residential units in terms of total\ninduced transport CO2 emissions. Our findings underscore the significance of\nspatially differentiated compact development to decarbonize the transport\nsector."}, "http://arxiv.org/abs/2111.10715": {"title": "Confidences in Hypotheses", "link": "http://arxiv.org/abs/2111.10715", "description": "This article introduces a broadly-applicable new method of statistical\nanalysis called hypotheses assessment. It is a frequentist procedure designed\nto answer the question: Given the sample evidence and assuming one of two\nhypotheses is true, what is the relative plausibility of each hypothesis? Our\naim is to determine frequentist confidences in the hypotheses that are relevant\nto the data at hand and are as powerful as the particular application allows.\nHypotheses assessments complement hypothesis tests because providing\nconfidences in the hypotheses in addition to test results can better inform\napplied researchers about the strength of evidence provided by the data. For\nsimple hypotheses, the method produces minimum and maximum confidences in each\nhypothesis. The composite case is more complex, and we introduce two\nconventions to aid with understanding the strength of evidence. Assessments are\nqualitatively different from hypothesis testing and confidence interval\noutcomes, and thus fill a gap in the statistician's toolkit."}, "http://arxiv.org/abs/2204.03343": {"title": "Binary Spatial Random Field Reconstruction from Non-Gaussian Inhomogeneous Time-series Observations", "link": "http://arxiv.org/abs/2204.03343", "description": "We develop a new model for spatial random field reconstruction of a\nbinary-valued spatial phenomenon. In our model, sensors are deployed in a\nwireless sensor network across a large geographical region. Each sensor\nmeasures a non-Gaussian inhomogeneous temporal process which depends on the\nspatial phenomenon. Two types of sensors are employed: one collects point\nobservations at specific time points, while the other collects integral\nobservations over time intervals. Subsequently, the sensors transmit these\ntime-series observations to a Fusion Center (FC), and the FC infers the spatial\nphenomenon from these observations. We show that the resulting posterior\npredictive distribution is intractable and develop a tractable two-step\nprocedure to perform inference. Firstly, we develop algorithms to perform\napproximate Likelihood Ratio Tests on the time-series observations, compressing\nthem to a single bit for both point sensors and integral sensors. Secondly,\nonce the compressed observations are transmitted to the FC, we utilize a\nSpatial Best Linear Unbiased Estimator (S-BLUE) to reconstruct the binary\nspatial random field at any desired spatial location. The performance of the\nproposed approach is studied using simulation. We further illustrate the\neffectiveness of our method using a weather dataset from the National\nEnvironment Agency (NEA) of Singapore with fields including temperature and\nrelative humidity."}, "http://arxiv.org/abs/2210.08964": {"title": "PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting", "link": "http://arxiv.org/abs/2210.08964", "description": "This paper presents a new perspective on time series forecasting. In existing\ntime series forecasting methods, the models take a sequence of numerical values\nas input and yield numerical values as output. The existing SOTA models are\nlargely based on the Transformer architecture, modified with multiple encoding\nmechanisms to incorporate the context and semantics around the historical data.\nInspired by the successes of pre-trained language foundation models, we pose a\nquestion about whether these models can also be adapted to solve time-series\nforecasting. Thus, we propose a new forecasting paradigm: prompt-based time\nseries forecasting (PromptCast). In this novel task, the numerical input and\noutput are transformed into prompts and the forecasting task is framed in a\nsentence-to-sentence manner, making it possible to directly apply language\nmodels for forecasting purposes. To support and facilitate the research of this\ntask, we also present a large-scale dataset (PISA) that includes three\nreal-world forecasting scenarios. We evaluate different SOTA numerical-based\nforecasting methods and language generation models. The benchmark results with\nvarious forecasting settings demonstrate the proposed PromptCast with language\ngeneration models is a promising research direction. Additionally, in\ncomparison to conventional numerical-based forecasting, PromptCast shows a much\nbetter generalization ability under the zero-shot setting."}, "http://arxiv.org/abs/2211.13715": {"title": "Trust Your $\\nabla$: Gradient-based Intervention Targeting for Causal Discovery", "link": "http://arxiv.org/abs/2211.13715", "description": "Inferring causal structure from data is a challenging task of fundamental\nimportance in science. Observational data are often insufficient to identify a\nsystem's causal structure uniquely. While conducting interventions (i.e.,\nexperiments) can improve the identifiability, such samples are usually\nchallenging and expensive to obtain. Hence, experimental design approaches for\ncausal discovery aim to minimize the number of interventions by estimating the\nmost informative intervention target. In this work, we propose a novel\nGradient-based Intervention Targeting method, abbreviated GIT, that 'trusts'\nthe gradient estimator of a gradient-based causal discovery framework to\nprovide signals for the intervention acquisition function. We provide extensive\nexperiments in simulated and real-world datasets and demonstrate that GIT\nperforms on par with competitive baselines, surpassing them in the low-data\nregime."}, "http://arxiv.org/abs/2303.02951": {"title": "The (Surprising) Sample Optimality of Greedy Procedures for Large-Scale Ranking and Selection", "link": "http://arxiv.org/abs/2303.02951", "description": "Ranking and selection (R&S) aims to select the best alternative with the\nlargest mean performance from a finite set of alternatives. Recently,\nconsiderable attention has turned towards the large-scale R&S problem which\ninvolves a large number of alternatives. Ideal large-scale R&S procedures\nshould be sample optimal, i.e., the total sample size required to deliver an\nasymptotically non-zero probability of correct selection (PCS) grows at the\nminimal order (linear order) in the number of alternatives, $k$. Surprisingly,\nwe discover that the na\\\"ive greedy procedure, which keeps sampling the\nalternative with the largest running average, performs strikingly well and\nappears sample optimal. To understand this discovery, we develop a new\nboundary-crossing perspective and prove that the greedy procedure is sample\noptimal for the scenarios where the best mean maintains at least a positive\nconstant away from all other means as $k$ increases. We further show that the\nderived PCS lower bound is asymptotically tight for the slippage configuration\nof means with a common variance. For other scenarios, we consider the\nprobability of good selection and find that the result depends on the growth\nbehavior of the number of good alternatives: if it remains bounded as $k$\nincreases, the sample optimality still holds; otherwise, the result may change.\nMoreover, we propose the explore-first greedy procedures by adding an\nexploration phase to the greedy procedure. The procedures are proven to be\nsample optimal and consistent under the same assumptions. Last, we numerically\ninvestigate the performance of our greedy procedures in solving large-scale R&S\nproblems."}, "http://arxiv.org/abs/2303.07490": {"title": "Comparing the Robustness of Simple Network Scale-Up Method (NSUM) Estimators", "link": "http://arxiv.org/abs/2303.07490", "description": "The network scale-up method (NSUM) is a cost-effective approach to estimating\nthe size or prevalence of a group of people that is hard to reach through a\nstandard survey. The basic NSUM involves two steps: estimating respondents'\ndegrees by one of various methods (in this paper we focus on the probe group\nmethod which uses the number of people a respondent knows in various groups of\nknown size), and estimating the prevalence of the hard-to-reach population of\ninterest using respondents' estimated degrees and the number of people they\nreport knowing in the hard-to-reach group. Each of these two steps involves\ntaking either an average of ratios or a ratio of averages. Using the ratio of\naverages for each step has so far been the most common approach. However, we\npresent theoretical arguments that using the average of ratios at the second,\nprevalence-estimation step often has lower mean squared error when the random\nmixing assumption is violated, which seems likely in practice; this estimator\nwhich uses the ratio of averages for degree estimates and the average of ratios\nfor prevalence was proposed early in NSUM development but has largely been\nunexplored and unused. Simulation results using an example network data set\nalso support these findings. Based on this theoretical and empirical evidence,\nwe suggest that future surveys that use a simple estimator may want to use this\nmixed estimator, and estimation methods based on this estimator may produce new\nimprovements."}, "http://arxiv.org/abs/2307.11084": {"title": "GeoCoDA: Recognizing and Validating Structural Processes in Geochemical Data", "link": "http://arxiv.org/abs/2307.11084", "description": "Geochemical data are compositional in nature and are subject to the problems\ntypically associated with data that are restricted to the real non-negative\nnumber space with constant-sum constraint, that is, the simplex. Geochemistry\ncan be considered a proxy for mineralogy, comprised of atomically ordered\nstructures that define the placement and abundance of elements in the mineral\nlattice structure. Based on the innovative contributions of John Aitchison, who\nintroduced the logratio transformation into compositional data analysis, this\ncontribution provides a systematic workflow for assessing geochemical data in a\nsimple and efficient way, such that significant geochemical (mineralogical)\nprocesses can be recognized and validated. This workflow, called GeoCoDA and\npresented here in the form of a tutorial, enables the recognition of processes\nfrom which models can be constructed based on the associations of elements that\nreflect mineralogy. Both the original compositional values and their\ntransformation to logratios are considered. These models can reflect\nrock-forming processes, metamorphism, alteration and ore mineralization.\nMoreover, machine learning methods, both unsupervised and supervised, applied\nto an optimized set of subcompositions of the data, provide a systematic,\naccurate, efficient and defensible approach to geochemical data analysis. The\nworkflow is illustrated on lithogeochemical data from exploration of the Star\nkimberlite, consisting of a series of eruptions with five recognized phases."}, "http://arxiv.org/abs/2307.15681": {"title": "A Continuous-Time Dynamic Factor Model for Intensive Longitudinal Data Arising from Mobile Health Studies", "link": "http://arxiv.org/abs/2307.15681", "description": "Intensive longitudinal data (ILD) collected in mobile health (mHealth)\nstudies contain rich information on multiple outcomes measured frequently over\ntime that have the potential to capture short-term and long-term dynamics.\nMotivated by an mHealth study of smoking cessation in which participants\nself-report the intensity of many emotions multiple times per day, we describe\na dynamic factor model that summarizes the ILD as a low-dimensional,\ninterpretable latent process. This model consists of two submodels: (i) a\nmeasurement submodel--a factor model--that summarizes the multivariate\nlongitudinal outcome as lower-dimensional latent variables and (ii) a\nstructural submodel--an Ornstein-Uhlenbeck (OU) stochastic process--that\ncaptures the temporal dynamics of the multivariate latent process in continuous\ntime. We derive a closed-form likelihood for the marginal distribution of the\noutcome and the computationally-simpler sparse precision matrix for the OU\nprocess. We propose a block coordinate descent algorithm for estimation.\nFinally, we apply our method to the mHealth data to summarize the dynamics of\n18 different emotions as two latent processes. These latent processes are\ninterpreted by behavioral scientists as the psychological constructs of\npositive and negative affect and are key in understanding vulnerability to\nlapsing back to tobacco use among smokers attempting to quit."}, "http://arxiv.org/abs/2308.09562": {"title": "Outlier detection for mixed-type data: A novel approach", "link": "http://arxiv.org/abs/2308.09562", "description": "Outlier detection can serve as an extremely important tool for researchers\nfrom a wide range of fields. From the sectors of banking and marketing to the\nsocial sciences and healthcare sectors, outlier detection techniques are very\nuseful for identifying subjects that exhibit different and sometimes peculiar\nbehaviours. When the data set available to the researcher consists of both\ndiscrete and continuous variables, outlier detection presents unprecedented\nchallenges. In this paper we propose a novel method that detects outlying\nobservations in settings of mixed-type data, while reducing the required user\ninteraction and providing general guidelines for selecting suitable\nhyperparameter values. The methodology developed is being assessed through a\nseries of simulations on data sets with varying characteristics and achieves\nvery good performance levels. Our method demonstrates a high capacity for\ndetecting the majority of outliers while minimising the number of falsely\ndetected non-outlying observations. The ideas and techniques outlined in the\npaper can be used either as a pre-processing step or in tandem with other data\nmining and machine learning algorithms for developing novel approaches to\nchallenging research problems."}, "http://arxiv.org/abs/2308.15986": {"title": "Sensitivity Analysis for Causal Effects in Observational Studies with Multivalued Treatments", "link": "http://arxiv.org/abs/2308.15986", "description": "One of the fundamental challenges in drawing causal inferences from\nobservational studies is that the assumption of no unmeasured confounding is\nnot testable from observed data. Therefore, assessing sensitivity to this\nassumption's violation is important to obtain valid causal conclusions in\nobservational studies. Although several sensitivity analysis frameworks are\navailable in the casual inference literature, very few of them are applicable\nto observational studies with multivalued treatments. To address this issue, we\npropose a sensitivity analysis framework for performing sensitivity analysis in\nmultivalued treatment settings. Within this framework, a general class of\nadditive causal estimands has been proposed. We demonstrate that the estimation\nof the causal estimands under the proposed sensitivity model can be performed\nvery efficiently. Simulation results show that the proposed framework performs\nwell in terms of bias of the point estimates and coverage of the confidence\nintervals when there is sufficient overlap in the covariate distributions. We\nillustrate the application of our proposed method by conducting an\nobservational study that estimates the causal effect of fish consumption on\nblood mercury levels."}, "http://arxiv.org/abs/2309.01536": {"title": "perms: Likelihood-free estimation of marginal likelihoods for binary response data in Python and R", "link": "http://arxiv.org/abs/2309.01536", "description": "In Bayesian statistics, the marginal likelihood (ML) is the key ingredient\nneeded for model comparison and model averaging. Unfortunately, estimating MLs\naccurately is notoriously difficult, especially for models where posterior\nsimulation is not possible. Recently, Christensen (2023) introduced the concept\nof permutation counting, which can accurately estimate MLs of models for\nexchangeable binary responses. Such data arise in a multitude of statistical\nproblems, including binary classification, bioassay and sensitivity testing.\nPermutation counting is entirely likelihood-free and works for any model from\nwhich a random sample can be generated, including nonparametric models. Here we\npresent perms, a package implementing permutation counting. As a result of\nextensive optimisation efforts, perms is computationally efficient and able to\nhandle large data problems. It is available as both an R package and a Python\nlibrary. A broad gallery of examples illustrating its usage is provided, which\nincludes both standard parametric binary classification and novel applications\nof nonparametric models, such as changepoint analysis. We also cover the\ndetails of the implementation of perms and illustrate its computational speed\nvia a simple simulation study."}, "http://arxiv.org/abs/2312.06669": {"title": "An Association Test Based on Kernel-Based Neural Networks for Complex Genetic Association Analysis", "link": "http://arxiv.org/abs/2312.06669", "description": "The advent of artificial intelligence, especially the progress of deep neural\nnetworks, is expected to revolutionize genetic research and offer unprecedented\npotential to decode the complex relationships between genetic variants and\ndisease phenotypes, which could mark a significant step toward improving our\nunderstanding of the disease etiology. While deep neural networks hold great\npromise for genetic association analysis, limited research has been focused on\ndeveloping neural-network-based tests to dissect complex genotype-phenotype\nassociations. This complexity arises from the opaque nature of neural networks\nand the absence of defined limiting distributions. We have previously developed\na kernel-based neural network model (KNN) that synergizes the strengths of\nlinear mixed models with conventional neural networks. KNN adopts a\ncomputationally efficient minimum norm quadratic unbiased estimator (MINQUE)\nalgorithm and uses KNN structure to capture the complex relationship between\nlarge-scale sequencing data and a disease phenotype of interest. In the KNN\nframework, we introduce a MINQUE-based test to assess the joint association of\ngenetic variants with the phenotype, which considers non-linear and\nnon-additive effects and follows a mixture of chi-square distributions. We also\nconstruct two additional tests to evaluate and interpret linear and\nnon-linear/non-additive genetic effects, including interaction effects. Our\nsimulations show that our method consistently controls the type I error rate\nunder various conditions and achieves greater power than a commonly used\nsequence kernel association test (SKAT), especially when involving non-linear\nand interaction effects. When applied to real data from the UK Biobank, our\napproach identified genes associated with hippocampal volume, which can be\nfurther replicated and evaluated for their role in the pathogenesis of\nAlzheimer's disease."}, "http://arxiv.org/abs/2312.06820": {"title": "Extracting Self-Consistent Causal Insights from Users Feedback with LLMs and In-context Learning", "link": "http://arxiv.org/abs/2312.06820", "description": "Microsoft Windows Feedback Hub is designed to receive customer feedback on a\nwide variety of subjects including critical topics such as power and battery.\nFeedback is one of the most effective ways to have a grasp of users' experience\nwith Windows and its ecosystem. However, the sheer volume of feedback received\nby Feedback Hub makes it immensely challenging to diagnose the actual cause of\nreported issues. To better understand and triage issues, we leverage Double\nMachine Learning (DML) to associate users' feedback with telemetry signals. One\nof the main challenges we face in the DML pipeline is the necessity of domain\nknowledge for model design (e.g., causal graph), which sometimes is either not\navailable or hard to obtain. In this work, we take advantage of reasoning\ncapabilities in Large Language Models (LLMs) to generate a prior model that\nwhich to some extent compensates for the lack of domain knowledge and could be\nused as a heuristic for measuring feedback informativeness. Our LLM-based\napproach is able to extract previously known issues, uncover new bugs, and\nidentify sequences of events that lead to a bug, while minimizing out-of-domain\noutputs."}, "http://arxiv.org/abs/2312.06883": {"title": "Adaptive Experiments Toward Learning Treatment Effect Heterogeneity", "link": "http://arxiv.org/abs/2312.06883", "description": "Understanding treatment effect heterogeneity has become an increasingly\npopular task in various fields, as it helps design personalized advertisements\nin e-commerce or targeted treatment in biomedical studies. However, most of the\nexisting work in this research area focused on either analyzing observational\ndata based on strong causal assumptions or conducting post hoc analyses of\nrandomized controlled trial data, and there has been limited effort dedicated\nto the design of randomized experiments specifically for uncovering treatment\neffect heterogeneity. In the manuscript, we develop a framework for designing\nand analyzing response adaptive experiments toward better learning treatment\neffect heterogeneity. Concretely, we provide response adaptive experimental\ndesign frameworks that sequentially revise the data collection mechanism\naccording to the accrued evidence during the experiment. Such design strategies\nallow for the identification of subgroups with the largest treatment effects\nwith enhanced statistical efficiency. The proposed frameworks not only unify\nadaptive enrichment designs and response-adaptive randomization designs but\nalso complement A/B test designs in e-commerce and randomized trial designs in\nclinical settings. We demonstrate the merit of our design with theoretical\njustifications and in simulation studies with synthetic e-commerce and clinical\ntrial data."}, "http://arxiv.org/abs/2312.07175": {"title": "Instrumental Variable Estimation for Causal Inference in Longitudinal Data with Time-Dependent Latent Confounders", "link": "http://arxiv.org/abs/2312.07175", "description": "Causal inference from longitudinal observational data is a challenging\nproblem due to the difficulty in correctly identifying the time-dependent\nconfounders, especially in the presence of latent time-dependent confounders.\nInstrumental variable (IV) is a powerful tool for addressing the latent\nconfounders issue, but the traditional IV technique cannot deal with latent\ntime-dependent confounders in longitudinal studies. In this work, we propose a\nnovel Time-dependent Instrumental Factor Model (TIFM) for time-varying causal\neffect estimation from data with latent time-dependent confounders. At each\ntime-step, the proposed TIFM method employs the Recurrent Neural Network (RNN)\narchitecture to infer latent IV, and then uses the inferred latent IV factor\nfor addressing the confounding bias caused by the latent time-dependent\nconfounders. We provide a theoretical analysis for the proposed TIFM method\nregarding causal effect estimation in longitudinal data. Extensive evaluation\nwith synthetic datasets demonstrates the effectiveness of TIFM in addressing\ncausal effect estimation over time. We further apply TIFM to a climate dataset\nto showcase the potential of the proposed method in tackling real-world\nproblems."}, "http://arxiv.org/abs/2312.07177": {"title": "Fast Meta-Analytic Approximations for Relational Event Models: Applications to Data Streams and Multilevel Data", "link": "http://arxiv.org/abs/2312.07177", "description": "Large relational-event history data stemming from large networks are becoming\nincreasingly available due to recent technological developments (e.g. digital\ncommunication, online databases, etc). This opens many new doors to learning\nabout complex interaction behavior between actors in temporal social networks.\nThe relational event model has become the gold standard for relational event\nhistory analysis. Currently, however, the main bottleneck to fit relational\nevents models is of computational nature in the form of memory storage\nlimitations and computational complexity. Relational event models are therefore\nmainly used for relatively small data sets while larger, more interesting\ndatasets, including multilevel data structures and relational event data\nstreams, cannot be analyzed on standard desktop computers. This paper addresses\nthis problem by developing approximation algorithms based on meta-analysis\nmethods that can fit relational event models significantly faster while\navoiding the computational issues. In particular, meta-analytic approximations\nare proposed for analyzing streams of relational event data and multilevel\nrelational event data and potentially of combinations thereof. The accuracy and\nthe statistical properties of the methods are assessed using numerical\nsimulations. Furthermore, real-world data are used to illustrate the potential\nof the methodology to study social interaction behavior in an organizational\nnetwork and interaction behavior among political actors. The algorithms are\nimplemented in a publicly available R package 'remx'."}, "http://arxiv.org/abs/2312.07262": {"title": "Robust Bayesian graphical modeling using $\\gamma$-divergence", "link": "http://arxiv.org/abs/2312.07262", "description": "Gaussian graphical model is one of the powerful tools to analyze conditional\nindependence between two variables for multivariate Gaussian-distributed\nobservations. When the dimension of data is moderate or high, penalized\nlikelihood methods such as the graphical lasso are useful to detect significant\nconditional independence structures. However, the estimates are affected by\noutliers due to the Gaussian assumption. This paper proposes a novel robust\nposterior distribution for inference of Gaussian graphical models using the\n$\\gamma$-divergence which is one of the robust divergences. In particular, we\nfocus on the Bayesian graphical lasso by assuming the Laplace-type prior for\nelements of the inverse covariance matrix. The proposed posterior distribution\nmatches its maximum a posteriori estimate with the minimum $\\gamma$-divergence\nestimate provided by the frequentist penalized method. We show that the\nproposed method satisfies the posterior robustness which is a kind of measure\nof robustness in the Bayesian analysis. The property means that the information\nof outliers is automatically ignored in the posterior distribution as long as\nthe outliers are extremely large, which also provides theoretical robustness of\npoint estimate for the existing frequentist method. A sufficient condition for\nthe posterior propriety of the proposed posterior distribution is also shown.\nFurthermore, an efficient posterior computation algorithm via the weighted\nBayesian bootstrap method is proposed. The performance of the proposed method\nis illustrated through simulation studies and real data analysis."}, "http://arxiv.org/abs/2312.07320": {"title": "Convergence rates of non-stationary and deep Gaussian process regression", "link": "http://arxiv.org/abs/2312.07320", "description": "The focus of this work is the convergence of non-stationary and deep Gaussian\nprocess regression. More precisely, we follow a Bayesian approach to regression\nor interpolation, where the prior placed on the unknown function $f$ is a\nnon-stationary or deep Gaussian process, and we derive convergence rates of the\nposterior mean to the true function $f$ in terms of the number of observed\ntraining points. In some cases, we also show convergence of the posterior\nvariance to zero. The only assumption imposed on the function $f$ is that it is\nan element of a certain reproducing kernel Hilbert space, which we in\nparticular cases show to be norm-equivalent to a Sobolev space. Our analysis\nincludes the case of estimated hyper-parameters in the covariance kernels\nemployed, both in an empirical Bayes' setting and the particular hierarchical\nsetting constructed through deep Gaussian processes. We consider the settings\nof noise-free or noisy observations on deterministic or random training points.\nWe establish general assumptions sufficient for the convergence of deep\nGaussian process regression, along with explicit examples demonstrating the\nfulfilment of these assumptions. Specifically, our examples require that the\nH\\\"older or Sobolev norms of the penultimate layer are bounded almost surely."}, "http://arxiv.org/abs/2312.07479": {"title": "Convex Parameter Estimation of Perturbed Multivariate Generalized Gaussian Distributions", "link": "http://arxiv.org/abs/2312.07479", "description": "The multivariate generalized Gaussian distribution (MGGD), also known as the\nmultivariate exponential power (MEP) distribution, is widely used in signal and\nimage processing. However, estimating MGGD parameters, which is required in\npractical applications, still faces specific theoretical challenges. In\nparticular, establishing convergence properties for the standard fixed-point\napproach when both the distribution mean and the scatter (or the precision)\nmatrix are unknown is still an open problem. In robust estimation, imposing\nclassical constraints on the precision matrix, such as sparsity, has been\nlimited by the non-convexity of the resulting cost function. This paper tackles\nthese issues from an optimization viewpoint by proposing a convex formulation\nwith well-established convergence properties. We embed our analysis in a noisy\nscenario where robustness is induced by modelling multiplicative perturbations.\nThe resulting framework is flexible as it combines a variety of regularizations\nfor the precision matrix, the mean and model perturbations. This paper presents\nproof of the desired theoretical properties, specifies the conditions\npreserving these properties for different regularization choices and designs a\ngeneral proximal primal-dual optimization strategy. The experiments show a more\naccurate precision and covariance matrix estimation with similar performance\nfor the mean vector parameter compared to Tyler's M-estimator. In a\nhigh-dimensional setting, the proposed method outperforms the classical GLASSO,\none of its robust extensions, and the regularized Tyler's estimator."}, "http://arxiv.org/abs/2205.04324": {"title": "On a wider class of prior distributions for graphical models", "link": "http://arxiv.org/abs/2205.04324", "description": "Gaussian graphical models are useful tools for conditional independence\nstructure inference of multivariate random variables. Unfortunately, Bayesian\ninference of latent graph structures is challenging due to exponential growth\nof $\\mathcal{G}_n$, the set of all graphs in $n$ vertices. One approach that\nhas been proposed to tackle this problem is to limit search to subsets of\n$\\mathcal{G}_n$. In this paper, we study subsets that are vector subspaces with\nthe cycle space $\\mathcal{C}_n$ as main example. We propose a novel prior on\n$\\mathcal{C}_n$ based on linear combinations of cycle basis elements and\npresent its theoretical properties. Using this prior, we implement a Markov\nchain Monte Carlo algorithm, and show that (i) posterior edge inclusion\nestimates computed with our technique are comparable to estimates from the\nstandard technique despite searching a smaller graph space, and (ii) the vector\nspace perspective enables straightforward implementation of MCMC algorithms."}, "http://arxiv.org/abs/2301.10468": {"title": "Model selection-based estimation for generalized additive models using mixtures of g-priors: Towards systematization", "link": "http://arxiv.org/abs/2301.10468", "description": "We consider the estimation of generalized additive models using basis\nexpansions coupled with Bayesian model selection. Although Bayesian model\nselection is an intuitively appealing tool for regression splines, its use has\ntraditionally been limited to Gaussian additive regression because of the\navailability of a tractable form of the marginal model likelihood. We extend\nthe method to encompass the exponential family of distributions using the\nLaplace approximation to the likelihood. Although the approach exhibits success\nwith any Gaussian-type prior distribution, there remains a lack of consensus\nregarding the best prior distribution for nonparametric regression through\nmodel selection. We observe that the classical unit information prior\ndistribution for variable selection may not be well-suited for nonparametric\nregression using basis expansions. Instead, our investigation reveals that\nmixtures of g-priors are more suitable. We consider various mixtures of\ng-priors to evaluate the performance in estimating generalized additive models.\nFurthermore, we conduct a comparative analysis of several priors for knots to\nidentify the most practically effective strategy. Our extensive simulation\nstudies demonstrate the superiority of model selection-based approaches over\nother Bayesian methods."}, "http://arxiv.org/abs/2302.02482": {"title": "Continuously Indexed Graphical Models", "link": "http://arxiv.org/abs/2302.02482", "description": "Let $X = \\{X_{u}\\}_{u \\in U}$ be a real-valued Gaussian process indexed by a\nset $U$. It can be thought of as an undirected graphical model with every\nrandom variable $X_{u}$ serving as a vertex. We characterize this graph in\nterms of the covariance of $X$ through its reproducing kernel property. Unlike\nother characterizations in the literature, our characterization does not\nrestrict the index set $U$ to be finite or countable, and hence can be used to\nmodel the intrinsic dependence structure of stochastic processes in continuous\ntime/space. Consequently, this characterization is not in terms of the zero\nentries of an inverse covariance. This poses novel challenges for the problem\nof recovery of the dependence structure from a sample of independent\nrealizations of $X$, also known as structure estimation. We propose a\nmethodology that circumvents these issues, by targeting the recovery of the\nunderlying graph up to a finite resolution, which can be arbitrarily fine and\nis limited only by the available sample size. The recovery is shown to be\nconsistent so long as the graph is sufficiently regular in an appropriate\nsense. We derive corresponding convergence rates and finite sample guarantees.\nOur methodology is illustrated by means of a simulation study and two data\nanalyses."}, "http://arxiv.org/abs/2305.04634": {"title": "Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods", "link": "http://arxiv.org/abs/2305.04634", "description": "In spatial statistics, fast and accurate parameter estimation, coupled with a\nreliable means of uncertainty quantification, can be challenging when fitting a\nspatial process to real-world data because the likelihood function might be\nslow to evaluate or wholly intractable. In this work, we propose using\nconvolutional neural networks to learn the likelihood function of a spatial\nprocess. Through a specifically designed classification task, our neural\nnetwork implicitly learns the likelihood function, even in situations where the\nexact likelihood is not explicitly available. Once trained on the\nclassification task, our neural network is calibrated using Platt scaling which\nimproves the accuracy of the neural likelihood surfaces. To demonstrate our\napproach, we compare neural likelihood surfaces and the resulting maximum\nlikelihood estimates and approximate confidence regions with the equivalent for\nexact or approximate likelihood for two different spatial processes: a Gaussian\nprocess and a Brown-Resnick process which have computationally intensive and\nintractable likelihoods, respectively. We conclude that our method provides\nfast and accurate parameter estimation with a reliable method of uncertainty\nquantification in situations where standard methods are either undesirably slow\nor inaccurate. The method is applicable to any spatial process on a grid from\nwhich fast simulations are available."}, "http://arxiv.org/abs/2308.09112": {"title": "REACT to NHST: Sensible conclusions to meaningful hypotheses", "link": "http://arxiv.org/abs/2308.09112", "description": "While Null Hypothesis Significance Testing (NHST) remains a widely used\nstatistical tool, it suffers from several shortcomings, such as conflating\nstatistical and practical significance, sensitivity to sample size, and the\ninability to distinguish between accepting the null hypothesis and failing to\nreject it. Recent efforts have focused on developing alternatives to NHST to\naddress these issues. Despite these efforts, conventional NHST remains dominant\nin scientific research due to its simplicity and perceived ease of\ninterpretation. Our work presents a novel alternative to NHST that is just as\naccessible and intuitive: REACT. It not only tackles the shortcomings of NHST\nbut also offers additional advantages over existing alternatives. For instance,\nREACT is easily applicable to multiparametric hypotheses and does not require\nstringent significance-level corrections when conducting multiple tests. We\nillustrate the practical utility of REACT through real-world data examples,\nusing criteria aligned with common research practices to distinguish between\nthe absence of evidence and evidence of absence."}, "http://arxiv.org/abs/2309.05482": {"title": "A conformal test of linear models via permutation-augmented regressions", "link": "http://arxiv.org/abs/2309.05482", "description": "Permutation tests are widely recognized as robust alternatives to tests based\non normal theory. Random permutation tests have been frequently employed to\nassess the significance of variables in linear models. Despite their widespread\nuse, existing random permutation tests lack finite-sample and assumption-free\nguarantees for controlling type I error in partial correlation tests. To\naddress this ongoing challenge, we have developed a conformal test through\npermutation-augmented regressions, which we refer to as PALMRT. PALMRT not only\nachieves power competitive with conventional methods but also provides reliable\ncontrol of type I errors at no more than $2\\alpha$, given any targeted level\n$\\alpha$, for arbitrary fixed designs and error distributions. We have\nconfirmed this through extensive simulations.\n\nCompared to the cyclic permutation test (CPT) and residual permutation test\n(RPT), which also offer theoretical guarantees, PALMRT does not compromise as\nmuch on power or set stringent requirements on the sample size, making it\nsuitable for diverse biomedical applications. We further illustrate the\ndifferences in a long-Covid study where PALMRT validated key findings\npreviously identified using the t-test after multiple corrections, while both\nCPT and RPT suffered from a drastic loss of power and failed to identify any\ndiscoveries. We endorse PALMRT as a robust and practical hypothesis test in\nscientific research for its superior error control, power preservation, and\nsimplicity. An R package for PALMRT is available at\n\\url{https://github.com/LeyingGuan/PairedRegression}."}, "http://arxiv.org/abs/2309.13001": {"title": "Joint $p$-Values for Higher-Powered Bayesian Model Checking with Frequentist Guarantees", "link": "http://arxiv.org/abs/2309.13001", "description": "We introduce a joint posterior $p$-value, an extension of the posterior\npredictive $p$-value for multiple test statistics, designed to address\nlimitations of existing Bayesian $p$-values in the setting of continuous model\nexpansion. In particular, we show that the posterior predictive $p$-value, as\nwell as its sampled variant, become more conservative as the parameter\ndimension grows, and we demonstrate the ability of the joint $p$-value to\novercome this problem in cases where we can select test statistics that are\nnegatively associated under the posterior. We validate these conclusions with a\npair of simulation examples in which the joint $p$-value achieves substantial\ngains to power with only a modest increase in computational cost."}, "http://arxiv.org/abs/2312.07610": {"title": "Interpretational errors in statistical causal inference", "link": "http://arxiv.org/abs/2312.07610", "description": "We formalize an interpretational error that is common in statistical causal\ninference, termed identity slippage. This formalism is used to describe\nhistorically-recognized fallacies, and analyse a fast-growing literature in\nstatistics and applied fields. We conducted a systematic review of natural\nlanguage claims in the literature on stochastic mediation parameters, and\ndocumented extensive evidence of identity slippage in applications. This\nframework for error detection is applicable whenever policy decisions depend on\nthe accurate interpretation of statistical results, which is nearly always the\ncase. Therefore, broad awareness of identity slippage will aid statisticians in\nthe successful translation of data into public good."}, "http://arxiv.org/abs/2312.07616": {"title": "Evaluating the Alignment of a Data Analysis between Analyst and Audience", "link": "http://arxiv.org/abs/2312.07616", "description": "A challenge that data analysts face is building a data analysis that is\nuseful for a given consumer. Previously, we defined a set of principles for\ndescribing data analyses that can be used to create a data analysis and to\ncharacterize the variation between analyses. Here, we introduce a concept that\nwe call the alignment of a data analysis between the data analyst and a\nconsumer. We define a successfully aligned data analysis as the matching of\nprinciples between the analyst and the consumer for whom the analysis is\ndeveloped. In this paper, we propose a statistical model for evaluating the\nalignment of a data analysis and describe some of its properties. We argue that\nthis framework provides a language for characterizing alignment and can be used\nas a guide for practicing data scientists and students in data science courses\nfor how to build better data analyses."}, "http://arxiv.org/abs/2312.07619": {"title": "Estimating Causal Impacts of Scaling a Voluntary Policy Intervention", "link": "http://arxiv.org/abs/2312.07619", "description": "Evaluations often inform future program implementation decisions. However,\nthe implementation context may differ, sometimes substantially, from the\nevaluation study context. This difference leads to uncertainty regarding the\nrelevance of evaluation findings to future decisions. Voluntary interventions\npose another challenge to generalizability, as we do not know precisely who\nwill volunteer for the intervention in the future. We present a novel approach\nfor estimating target population average treatment effects among the treated by\ngeneralizing results from an observational study to projected volunteers within\nthe target population (the treated group). Our estimation approach can\naccommodate flexible outcome regression estimators such as Bayesian Additive\nRegression Trees (BART) and Bayesian Causal Forests (BCF). Our generalizability\napproach incorporates uncertainty regarding target population treatment status\ninto the posterior credible intervals to better reflect the uncertainty of\nscaling a voluntary intervention. In a simulation based on real data, we\ndemonstrate that these flexible estimators (BCF and BART) improve performance\nover estimators that rely on parametric regressions. We use our approach to\nestimate impacts of scaling up Comprehensive Primary Care Plus, a health care\npayment model intended to improve quality and efficiency of primary care, and\nwe demonstrate the promise of scaling to a targeted subgroup of practices."}, "http://arxiv.org/abs/2312.07697": {"title": "A Class of Computational Methods to Reduce Selection Bias when Designing Phase 3 Clinical Trials", "link": "http://arxiv.org/abs/2312.07697", "description": "When designing confirmatory Phase 3 studies, one usually evaluates one or\nmore efficacious and safe treatment option(s) based on data from previous\nstudies. However, several retrospective research articles reported the\nphenomenon of ``diminished treatment effect in Phase 3'' based on many case\nstudies. Even under basic assumptions, it was shown that the commonly used\nestimator could substantially overestimate the efficacy of selected group(s).\nAs alternatives, we propose a class of computational methods to reduce\nestimation bias and mean squared error (MSE) with a broader scope of multiple\ntreatment groups and flexibility to accommodate summary results by group as\ninput. Based on simulation studies and a real data example, we provide\npractical implementation guidance for this class of methods under different\nscenarios. For more complicated problems, our framework can serve as a starting\npoint with additional layers built in. Proposed methods can also be widely\napplied to other selection problems."}, "http://arxiv.org/abs/2312.07704": {"title": "Distribution of the elemental regression weights with t-distributed co-variate measurement errors", "link": "http://arxiv.org/abs/2312.07704", "description": "In this article, a heuristic approach is used to determined the best\napproximate distribution of $\\dfrac{Y_1}{Y_1 + Y_2}$, given that $Y_1,Y_2$ are\nindependent, and each of $Y_1$ and $Y$ is distributed as the\n$\\mathcal{F}$-distribution with common denominator degrees of freedom. The\nproposed approximate distribution is subject to graphical comparisons and\ndistributional tests. The proposed distribution is used to derive the\ndistribution of the elemental regression weight $\\omega_E$, where $E$ is the\nelemental regression set."}, "http://arxiv.org/abs/2312.07727": {"title": "Two-sample inference for sparse functional data", "link": "http://arxiv.org/abs/2312.07727", "description": "We propose a novel test procedure for comparing mean functions across two\ngroups within the reproducing kernel Hilbert space (RKHS) framework. Our\nproposed method is adept at handling sparsely and irregularly sampled\nfunctional data when observation times are random for each subject.\nConventional approaches, that are built upon functional principal components\nanalysis, usually assume homogeneous covariance structure across groups.\nNonetheless, justifying this assumption in real-world scenarios can be\nchallenging. To eliminate the need for a homogeneous covariance structure, we\nfirst develop the functional Bahadur representation for the mean estimator\nunder the RKHS framework; this representation naturally leads to the desirable\npointwise limiting distributions. Moreover, we establish weak convergence for\nthe mean estimator, allowing us to construct a test statistic for the mean\ndifference. Our method is easily implementable and outperforms some\nconventional tests in controlling type I errors across various settings. We\ndemonstrate the finite sample performance of our approach through extensive\nsimulations and two real-world applications."}, "http://arxiv.org/abs/2312.07741": {"title": "Robust Functional Principal Component Analysis for Non-Euclidean Random Objects", "link": "http://arxiv.org/abs/2312.07741", "description": "Functional data analysis offers a diverse toolkit of statistical methods\ntailored for analyzing samples of real-valued random functions. Recently,\nsamples of time-varying random objects, such as time-varying networks, have\nbeen increasingly encountered in modern data analysis. These data structures\nrepresent elements within general metric spaces that lack local or global\nlinear structures, rendering traditional functional data analysis methods\ninapplicable. Moreover, the existing methodology for time-varying random\nobjects does not work well in the presence of outlying objects. In this paper,\nwe propose a robust method for analysing time-varying random objects. Our\nmethod employs pointwise Fr\\'{e}chet medians and then constructs pointwise\ndistance trajectories between the individual time courses and the sample\nFr\\'{e}chet medians. This representation effectively transforms time-varying\nobjects into functional data. A novel robust approach to functional principal\ncomponent analysis based on a Winsorized U-statistic estimator of the\ncovariance structure is introduced. The proposed robust analysis of these\ndistance trajectories is able to identify key features of time-varying objects\nand is useful for downstream analysis. To illustrate the efficacy of our\napproach, numerical studies focusing on dynamic networks are conducted. The\nresults indicate that the proposed method exhibits good all-round performance\nand surpasses the existing approach in terms of robustness, showcasing its\nsuperior performance in handling time-varying objects data."}, "http://arxiv.org/abs/2312.07775": {"title": "On the construction of stationary processes and random fields", "link": "http://arxiv.org/abs/2312.07775", "description": "We propose a new method to construct a stationary process and random field\nwith a given convex, decreasing covariance function and any one-dimensional\nmarginal distribution. The result is a new class of stationary processes and\nrandom fields. The construction method provides a simple, unified approach for\na wide range of covariance functions and any one-dimensional marginal\ndistributions, and it allows a new way to model dependence structures in a\nstationary process/random field as its dependence structure is induced by the\ncorrelation structure of a few disjoint sets in the support set of the marginal\ndistribution."}, "http://arxiv.org/abs/2312.07792": {"title": "Differentially private projection-depth-based medians", "link": "http://arxiv.org/abs/2312.07792", "description": "We develop $(\\epsilon,\\delta)$-differentially private projection-depth-based\nmedians using the propose-test-release (PTR) and exponential mechanisms. Under\ngeneral conditions on the input parameters and the population measure, (e.g. we\ndo not assume any moment bounds), we quantify the probability the test in PTR\nfails, as well as the cost of privacy via finite sample deviation bounds. We\ndemonstrate our main result on the canonical projection-depth-based median. In\nthe Gaussian setting, we show that the resulting deviation bound matches the\nknown lower bound for private Gaussian mean estimation, up to a polynomial\nfunction of the condition number of the covariance matrix. In the Cauchy\nsetting, we show that the ``outlier error amplification'' effect resulting from\nthe heavy tails outweighs the cost of privacy. This result is then verified via\nnumerical simulations. Additionally, we present results on general PTR\nmechanisms and a uniform concentration result on the projected spacings of\norder statistics."}, "http://arxiv.org/abs/2312.07829": {"title": "How to Select Covariates for Imputation-Based Regression Calibration Method -- A Causal Perspective", "link": "http://arxiv.org/abs/2312.07829", "description": "In this paper, we identify the criteria for the selection of the minimal and\nmost efficient covariate adjustment sets for the regression calibration method\ndeveloped by Carroll, Rupert and Stefanski (CRS, 1992), used to correct bias\ndue to continuous exposure measurement error. We utilize directed acyclic\ngraphs to illustrate how subject matter knowledge can aid in the selection of\nsuch adjustment sets. Valid measurement error correction requires the\ncollection of data on any (1) common causes of true exposure and outcome and\n(2) common causes of measurement error and outcome, in both the main study and\nvalidation study. For the CRS regression calibration method to be valid,\nresearchers need to minimally adjust for covariate set (1) in both the\nmeasurement error model (MEM) and the outcome model and adjust for covariate\nset (2) at least in the MEM. In practice, we recommend including the minimal\ncovariate adjustment set in both the MEM and the outcome model. In contrast\nwith the regression calibration method developed by Rosner, Spiegelman and\nWillet, it is valid and more efficient to adjust for correlates of the true\nexposure or of measurement error that are not risk factors in the MEM only\nunder CRS method. We applied the proposed covariate selection approach to the\nHealth Professional Follow-up Study, examining the effect of fiber intake on\ncardiovascular incidence. In this study, we demonstrated potential issues with\na data-driven approach to building the MEM that is agnostic to the structural\nassumptions. We extend the originally proposed estimators to settings where\neffect modification by a covariate is allowed. Finally, we caution against the\nuse of the regression calibration method to calibrate the true nutrition intake\nusing biomarkers."}, "http://arxiv.org/abs/2312.07873": {"title": "Causal Integration of Multiple Cancer Cohorts with High-Dimensional Confounders: Bayesian Propensity Score Estimation", "link": "http://arxiv.org/abs/2312.07873", "description": "Comparative meta-analyses of patient groups by integrating multiple\nobservational studies rely on estimated propensity scores (PSs) to mitigate\nconfounder imbalances. However, PS estimation grapples with the theoretical and\npractical challenges posed by high-dimensional confounders. Motivated by an\nintegrative analysis of breast cancer patients across seven medical centers,\nthis paper tackles the challenges associated with integrating multiple\nobservational datasets and offering nationally interpretable results. The\nproposed inferential technique, called Bayesian Motif Submatrices for\nConfounders (B-MSMC), addresses the curse of dimensionality by a hybrid of\nBayesian and frequentist approaches. B-MSMC uses nonparametric Bayesian\n``Chinese restaurant\" processes to eliminate redundancy in the high-dimensional\nconfounders and discover latent motifs or lower-dimensional structure. With\nthese motifs as potential predictors, standard regression techniques can be\nutilized to accurately infer the PSs and facilitate causal group comparisons.\nSimulations and meta-analysis of the motivating cancer investigation\ndemonstrate the efficacy of our proposal in high-dimensional causal inference\nby integrating multiple observational studies; using different weighting\nmethods, we apply the B-MSMC approach to efficiently address confounding when\nintegrating observational health studies with high-dimensional confounders."}, "http://arxiv.org/abs/2312.07882": {"title": "A non-parametric approach for estimating consumer valuation distributions using second price auctions", "link": "http://arxiv.org/abs/2312.07882", "description": "We focus on online second price auctions, where bids are made sequentially,\nand the winning bidder pays the maximum of the second-highest bid and a seller\nspecified reserve price. For many such auctions, the seller does not see all\nthe bids or the total number of bidders accessing the auction, and only\nobserves the current selling prices throughout the course of the auction. We\ndevelop a novel non-parametric approach to estimate the underlying consumer\nvaluation distribution based on this data. Previous non-parametric approaches\nin the literature only use the final selling price and assume knowledge of the\ntotal number of bidders. The resulting estimate, in particular, can be used by\nthe seller to compute the optimal profit-maximizing price for the product. Our\napproach is free of tuning parameters, and we demonstrate its computational and\nstatistical efficiency in a variety of simulation settings, and also on an Xbox\n7-day auction dataset on eBay."}, "http://arxiv.org/abs/2312.08040": {"title": "Markov's Equality and Post-hoc (Anytime) Valid Inference", "link": "http://arxiv.org/abs/2312.08040", "description": "We present Markov's equality: a tight version of Markov's inequality, that\ndoes not impose further assumptions on the on the random variable. We show that\nthis equality, as well as Markov's inequality and its randomized improvement,\nare directly implied by a set of deterministic inequalities. We apply Markov's\nequality to show that standard tests based on $e$-values and $e$-processes are\npost-hoc (anytime) valid: the tests remain valid, even if the level $\\alpha$ is\nselected after observing the data. In fact, we show that this property\ncharacterizes $e$-values and $e$-processes."}, "http://arxiv.org/abs/2312.08150": {"title": "Active learning with biased non-response to label requests", "link": "http://arxiv.org/abs/2312.08150", "description": "Active learning can improve the efficiency of training prediction models by\nidentifying the most informative new labels to acquire. However, non-response\nto label requests can impact active learning's effectiveness in real-world\ncontexts. We conceptualise this degradation by considering the type of\nnon-response present in the data, demonstrating that biased non-response is\nparticularly detrimental to model performance. We argue that this sort of\nnon-response is particularly likely in contexts where the labelling process, by\nnature, relies on user interactions. To mitigate the impact of biased\nnon-response, we propose a cost-based correction to the sampling strategy--the\nUpper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly,\nbe applied to any active learning algorithm. Through experiments, we\ndemonstrate that our method successfully reduces the harm from labelling\nnon-response in many settings. However, we also characterise settings where the\nnon-response bias in the annotations remains detrimental under UCB-EU for\nparticular sampling methods and data generating processes. Finally, we evaluate\nour method on a real-world dataset from e-commerce platform Taobao. We show\nthat UCB-EU yields substantial performance improvements to conversion models\nthat are trained on clicked impressions. Most generally, this research serves\nto both better conceptualise the interplay between types of non-response and\nmodel improvements via active learning, and to provide a practical, easy to\nimplement correction that helps mitigate model degradation."}, "http://arxiv.org/abs/2312.08169": {"title": "Efficiency of Multivariate Tests in Trials in Progressive Supranuclear Palsy", "link": "http://arxiv.org/abs/2312.08169", "description": "Measuring disease progression in clinical trials for testing novel treatments\nfor multifaceted diseases as Progressive Supranuclear Palsy (PSP), remains\nchallenging. In this study we assess a range of statistical approaches to\ncompare outcomes measured by the items of the Progressive Supranuclear Palsy\nRating Scale (PSPRS). We consider several statistical approaches, including sum\nscores, as an FDA-recommended version of the PSPRS, multivariate tests, and\nanalysis approaches based on multiple comparisons of the individual items. We\npropose two novel approaches which measure disease status based on Item\nResponse Theory models. We assess the performance of these tests in an\nextensive simulation study and illustrate their use with a re-analysis of the\nABBV-8E12 clinical trial. Furthermore, we discuss the impact of the\nFDA-recommended scoring of item scores on the power of the statistical tests.\nWe find that classical approaches as the PSPRS sum score demonstrate moderate\nto high power when treatment effects are consistent across the individual\nitems. The tests based on Item Response Theory models yield the highest power\nwhen the simulated data are generated from an IRT model. The multiple testing\nbased approaches have a higher power in settings where the treatment effect is\nlimited to certain domains or items. The FDA-recommended item rescoring tends\nto decrease the simulated power. The study shows that there is no\none-size-fits-all testing procedure for evaluating treatment effects using\nPSPRS items; the optimal method varies based on the specific effect size\npatterns. The efficiency of the PSPRS sum score, while generally robust and\nstraightforward to apply, varies depending on the effect sizes' patterns\nencountered and more powerful alternatives are available in specific settings.\nThese findings can have important implications for the design of future\nclinical trials in PSP."}, "http://arxiv.org/abs/2011.06663": {"title": "Patient Recruitment Using Electronic Health Records Under Selection Bias: a Two-phase Sampling Framework", "link": "http://arxiv.org/abs/2011.06663", "description": "Electronic health records (EHRs) are increasingly recognized as a\ncost-effective resource for patient recruitment in clinical research. However,\nhow to optimally select a cohort from millions of individuals to answer a\nscientific question of interest remains unclear. Consider a study to estimate\nthe mean or mean difference of an expensive outcome. Inexpensive auxiliary\ncovariates predictive of the outcome may often be available in patients' health\nrecords, presenting an opportunity to recruit patients selectively which may\nimprove efficiency in downstream analyses. In this paper, we propose a\ntwo-phase sampling design that leverages available information on auxiliary\ncovariates in EHR data. A key challenge in using EHR data for multi-phase\nsampling is the potential selection bias, because EHR data are not necessarily\nrepresentative of the target population. Extending existing literature on\ntwo-phase sampling design, we derive an optimal two-phase sampling method that\nimproves efficiency over random sampling while accounting for the potential\nselection bias in EHR data. We demonstrate the efficiency gain from our\nsampling design via simulation studies and an application to evaluating the\nprevalence of hypertension among US adults leveraging data from the Michigan\nGenomics Initiative, a longitudinal biorepository in Michigan Medicine."}, "http://arxiv.org/abs/2201.05102": {"title": "Space-time extremes of severe US thunderstorm environments", "link": "http://arxiv.org/abs/2201.05102", "description": "Severe thunderstorms cause substantial economic and human losses in the\nUnited States. Simultaneous high values of convective available potential\nenergy (CAPE) and storm relative helicity (SRH) are favorable to severe\nweather, and both they and the composite variable\n$\\mathrm{PROD}=\\sqrt{\\mathrm{CAPE}} \\times \\mathrm{SRH}$ can be used as\nindicators of severe thunderstorm activity. Their extremal spatial dependence\nexhibits temporal non-stationarity due to seasonality and large-scale\natmospheric signals such as El Ni\\~no-Southern Oscillation (ENSO). In order to\ninvestigate this, we introduce a space-time model based on a max-stable,\nBrown--Resnick, field whose range depends on ENSO and on time through a tensor\nproduct spline. We also propose a max-stability test based on empirical\nlikelihood and the bootstrap. The marginal and dependence parameters must be\nestimated separately owing to the complexity of the model, and we develop a\nbootstrap-based model selection criterion that accounts for the marginal\nuncertainty when choosing the dependence model. In the case study, the\nout-sample performance of our model is good. We find that extremes of PROD,\nCAPE and SRH are generally more localized in summer and, in some regions, less\nlocalized during El Ni\\~no and La Ni\\~na events, and give meteorological\ninterpretations of these phenomena."}, "http://arxiv.org/abs/2202.07277": {"title": "Exploiting deterministic algorithms to perform global sensitivity analysis for continuous-time Markov chain compartmental models with application to epidemiology", "link": "http://arxiv.org/abs/2202.07277", "description": "In this paper, we propose a generic approach to perform global sensitivity\nanalysis (GSA) for compartmental models based on continuous-time Markov chains\n(CTMC). This approach enables a complete GSA for epidemic models, in which not\nonly the effects of uncertain parameters such as epidemic parameters\n(transmission rate, mean sojourn duration in compartments) are quantified, but\nalso those of intrinsic randomness and interactions between the two. The main\nstep in our approach is to build a deterministic representation of the\nunderlying continuous-time Markov chain by controlling the latent variables\nmodeling intrinsic randomness. Then, model output can be written as a\ndeterministic function of both uncertain parameters and controlled latent\nvariables, so that it becomes possible to compute standard variance-based\nsensitivity indices, e.g. the so-called Sobol' indices. However, different\nsimulation algorithms lead to different representations. We exhibit in this\nwork three different representations for CTMC stochastic compartmental models\nand discuss the results obtained by implementing and comparing GSAs based on\neach of these representations on a SARS-CoV-2 epidemic model."}, "http://arxiv.org/abs/2210.08589": {"title": "Anytime-Valid Linear Models and Regression Adjusted Causal Inference in Randomized Experiments", "link": "http://arxiv.org/abs/2210.08589", "description": "Linear regression adjustment is commonly used to analyse randomised\ncontrolled experiments due to its efficiency and robustness against model\nmisspecification. Current testing and interval estimation procedures leverage\nthe asymptotic distribution of such estimators to provide Type-I error and\ncoverage guarantees that hold only at a single sample size. Here, we develop\nthe theory for the anytime-valid analogues of such procedures, enabling linear\nregression adjustment in the sequential analysis of randomised experiments. We\nfirst provide sequential $F$-tests and confidence sequences for the parametric\nlinear model, which provide time-uniform Type-I error and coverage guarantees\nthat hold for all sample sizes. We then relax all linear model parametric\nassumptions in randomised designs and provide nonparametric model-free\nsequential tests and confidence sequences for treatment effects. This formally\nallows experiments to be continuously monitored for significance, stopped\nearly, and safeguards against statistical malpractices in data collection. A\nparticular feature of our results is their simplicity. Our test statistics and\nconfidence sequences all emit closed-form expressions, which are functions of\nstatistics directly available from a standard linear regression table. We\nillustrate our methodology with the sequential analysis of software A/B\nexperiments at Netflix, performing regression adjustment with pre-treatment\noutcomes."}, "http://arxiv.org/abs/2212.02505": {"title": "Shared Differential Clustering across Single-cell RNA Sequencing Datasets with the Hierarchical Dirichlet Process", "link": "http://arxiv.org/abs/2212.02505", "description": "Single-cell RNA sequencing (scRNA-seq) is powerful technology that allows\nresearchers to understand gene expression patterns at the single-cell level.\nHowever, analysing scRNA-seq data is challenging due to issues and biases in\ndata collection. In this work, we construct an integrated Bayesian model that\nsimultaneously addresses normalization, imputation and batch effects and also\nnonparametrically clusters cells into groups across multiple datasets. A Gibbs\nsampler based on a finite-dimensional approximation of the HDP is developed for\nposterior inference."}, "http://arxiv.org/abs/2305.14118": {"title": "Notes on Causation, Comparison, and Regression", "link": "http://arxiv.org/abs/2305.14118", "description": "Comparison and contrast are the basic means to unveil causation and learn\nwhich treatments work. To build good comparison groups, randomized\nexperimentation is key, yet often infeasible. In such non-experimental\nsettings, we illustrate and discuss diagnostics to assess how well the common\nlinear regression approach to causal inference approximates desirable features\nof randomized experiments, such as covariate balance, study representativeness,\ninterpolated estimation, and unweighted analyses. We also discuss alternative\nregression modeling, weighting, and matching approaches and argue they should\nbe given strong consideration in empirical work."}, "http://arxiv.org/abs/2312.08391": {"title": "Performance of capture-recapture population size estimators under covariate information", "link": "http://arxiv.org/abs/2312.08391", "description": "Capture-recapture methods for estimating the total size of elusive\npopulations are widely-used, however, due to the choice of estimator impacting\nupon the results and conclusions made, the question of performance of each\nestimator is raised. Motivated by an application of the estimators which allow\ncovariate information to meta-analytic data focused on the prevalence rate of\ncompleted suicide after bariatric surgery, where studies with no completed\nsuicides did not occur, this paper explores the performance of the estimators\nthrough use of a simulation study. The simulation study addresses the\nperformance of the Horvitz-Thompson, generalised Chao and generalised Zelterman\nestimators, in addition to performance of the analytical approach to variance\ncomputation. Given that the estimators vary in their dependence on\ndistributional assumptions, additional simulations are utilised to address the\nquestion of the impact outliers have on performance and inference."}, "http://arxiv.org/abs/2312.08530": {"title": "Using Model-Assisted Calibration Methods to Improve Efficiency of Regression Analyses with Two-Phase Samples under Complex Survey Designs", "link": "http://arxiv.org/abs/2312.08530", "description": "Two-phase sampling designs are frequently employed in epidemiological studies\nand large-scale health surveys. In such designs, certain variables are\nexclusively collected within a second-phase random subsample of the initial\nfirst-phase sample, often due to factors such as high costs, response burden,\nor constraints on data collection or measurement assessment. Consequently,\nsecond-phase sample estimators can be inefficient due to the diminished sample\nsize. Model-assisted calibration methods have been widely used to improve the\nefficiency of second-phase estimators. However, none of the existing methods\nhave considered the complexities arising from the intricate sample designs\npresent in both first- and second-phase samples in regression analyses. This\npaper proposes to calibrate the sample weights for the second-phase subsample\nto the weighted first-phase sample based on influence functions of regression\ncoefficients for a prediction of the covariate of interest, which can be\ncomputed for the entire first-phase sample. We establish the consistency of the\nproposed calibration estimation and provide variance estimation. Empirical\nevidence underscores the robustness of calibration on influence functions\ncompared to the imputation method, which can be sensitive to misspecified\nprediction models for the variable only collected in the second phase. Examples\nusing data from the National Health and Nutrition Examination Survey are\nprovided."}, "http://arxiv.org/abs/2312.08570": {"title": "(Re-)reading Sklar (1959) -- A personal view on Sklar's theorem", "link": "http://arxiv.org/abs/2312.08570", "description": "Some personal thoughts on Sklar's theorem after reading the original paper\n(Sklar, 1059) in French."}, "http://arxiv.org/abs/2312.08587": {"title": "Bayesian Tensor Modeling for Image-based Classification of Alzheimer's Disease", "link": "http://arxiv.org/abs/2312.08587", "description": "Tensor-based representations are being increasingly used to represent complex\ndata types such as imaging data, due to their appealing properties such as\ndimension reduction and the preservation of spatial information. Recently,\nthere is a growing literature on using Bayesian scalar-on-tensor regression\ntechniques that use tensor-based representations for high-dimensional and\nspatially distributed covariates to predict continuous outcomes. However\nsurprisingly, there is limited development on corresponding Bayesian\nclassification methods relying on tensor-valued covariates. Standard approaches\nthat vectorize the image are not desirable due to the loss of spatial\nstructure, and alternate methods that use extracted features from the image in\nthe predictive model may suffer from information loss. We propose a novel data\naugmentation-based Bayesian classification approach relying on tensor-valued\ncovariates, with a focus on imaging predictors. We propose two data\naugmentation schemes, one resulting in a support vector machine (SVM)\nclassifier, and another yielding a logistic regression classifier. While both\ntypes of classifiers have been proposed independently in literature, our\ncontribution is to extend such existing methodology to accommodate\nhigh-dimensional tensor valued predictors that involve low rank decompositions\nof the coefficient matrix while preserving the spatial information in the\nimage. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for\nimplementing these methods. Simulation studies show significant improvements in\nclassification accuracy and parameter estimation compared to routinely used\nclassification methods. We further illustrate our method in a neuroimaging\napplication using cortical thickness MRI data from Alzheimer's Disease\nNeuroimaging Initiative, with results displaying better classification accuracy\nthroughout several classification tasks."}, "http://arxiv.org/abs/2312.08670": {"title": "Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect Estimation", "link": "http://arxiv.org/abs/2312.08670", "description": "In the field of intracity freight transportation, changes in order volume are\nsignificantly influenced by temporal and spatial factors. When building subsidy\nand pricing strategies, predicting the causal effects of these strategies on\norder volume is crucial. In the process of calculating causal effects,\nconfounding variables can have an impact. Traditional methods to control\nconfounding variables handle data from a holistic perspective, which cannot\nensure the precision of causal effects in specific temporal and spatial\ndimensions. However, temporal and spatial dimensions are extremely critical in\nthe logistics field, and this limitation may directly affect the precision of\nsubsidy and pricing strategies. To address these issues, this study proposes a\ntechnique based on flexible temporal-spatial grid partitioning. Furthermore,\nbased on the flexible grid partitioning technique, we further propose a\ncontinuous entropy balancing method in the temporal-spatial domain, which named\nTS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments).\nThe method proposed in this paper has been tested on two simulation datasets\nand two real datasets, all of which have achieved excellent performance. In\nfact, after applying the TS-EBCT method to the intracity freight transportation\nfield, the prediction accuracy of the causal effect has been significantly\nimproved. It brings good business benefits to the company's subsidy and pricing\nstrategies."}, "http://arxiv.org/abs/2312.08838": {"title": "Bayesian Fused Lasso Modeling for Binary Data", "link": "http://arxiv.org/abs/2312.08838", "description": "L1-norm regularized logistic regression models are widely used for analyzing\ndata with binary response. In those analyses, fusing regression coefficients is\nuseful for detecting groups of variables. This paper proposes a binomial\nlogistic regression model with Bayesian fused lasso. Assuming a Laplace prior\non regression coefficients and differences between adjacent regression\ncoefficients enables us to perform variable selection and variable fusion\nsimultaneously in the Bayesian framework. We also propose assuming a horseshoe\nprior on the differences to improve the flexibility of variable fusion. The\nGibbs sampler is derived to estimate the parameters by a hierarchical\nexpression of priors and a data-augmentation method. Using simulation studies\nand real data analysis, we compare the proposed methods with the existing\nmethod."}, "http://arxiv.org/abs/2206.00560": {"title": "Learning common structures in a collection of networks", "link": "http://arxiv.org/abs/2206.00560", "description": "Let a collection of networks represent interactions within several (social or\necological) systems. We pursue two objectives: identifying similarities in the\ntopological structures that are held in common between the networks and\nclustering the collection into sub-collections of structurally homogeneous\nnetworks. We tackle these two questions with a probabilistic model based\napproach. We propose an extension of the Stochastic Block Model (SBM) adapted\nto the joint modeling of a collection of networks. The networks in the\ncollection are assumed to be independent realizations of SBMs. The common\nconnectivity structure is imposed through the equality of some parameters. The\nmodel parameters are estimated with a variational Expectation-Maximization (EM)\nalgorithm. We derive an ad-hoc penalized likelihood criterion to select the\nnumber of blocks and to assess the adequacy of the consensus found between the\nstructures of the different networks. This same criterion can also be used to\ncluster networks on the basis of their connectivity structure. It thus provides\na partition of the collection into subsets of structurally homogeneous\nnetworks. The relevance of our proposition is assessed on two collections of\necological networks. First, an application to three stream food webs reveals\nthe homogeneity of their structures and the correspondence between groups of\nspecies in different ecosystems playing equivalent ecological roles. Moreover,\nthe joint analysis allows a finer analysis of the structure of smaller\nnetworks. Second, we cluster 67 food webs according to their connectivity\nstructures and demonstrate that five mesoscale structures are sufficient to\ndescribe this collection."}, "http://arxiv.org/abs/2207.08933": {"title": "Change point detection in high dimensional data with U-statistics", "link": "http://arxiv.org/abs/2207.08933", "description": "We consider the problem of detecting distributional changes in a sequence of\nhigh dimensional data. Our approach combines two separate statistics stemming\nfrom $L_p$ norms whose behavior is similar under $H_0$ but potentially\ndifferent under $H_A$, leading to a testing procedure that that is flexible\nagainst a variety of alternatives. We establish the asymptotic distribution of\nour proposed test statistics separately in cases of weakly dependent and\nstrongly dependent coordinates as $\\min\\{N,d\\}\\to\\infty$, where $N$ denotes\nsample size and $d$ is the dimension, and establish consistency of testing and\nestimation procedures in high dimensions under one-change alternative settings.\nComputational studies in single and multiple change point scenarios demonstrate\nour method can outperform other nonparametric approaches in the literature for\ncertain alternatives in high dimensions. We illustrate our approach though an\napplication to Twitter data concerning the mentions of U.S. Governors."}, "http://arxiv.org/abs/2303.17642": {"title": "Change Point Detection on A Separable Model for Dynamic Networks", "link": "http://arxiv.org/abs/2303.17642", "description": "This paper studies the change point detection problem in time series of\nnetworks, with the Separable Temporal Exponential-family Random Graph Model\n(STERGM). We consider a sequence of networks generated from a piecewise\nconstant distribution that is altered at unknown change points in time.\nDetection of the change points can identify the discrepancies in the underlying\ndata generating processes and facilitate downstream dynamic network analysis.\nMoreover, the STERGM that focuses on network statistics is a flexible model to\nfit dynamic networks with both dyadic and temporal dependence. We propose a new\nestimator derived from the Alternating Direction Method of Multipliers (ADMM)\nand the Group Fused Lasso to simultaneously detect multiple time points, where\nthe parameters of STERGM have changed. We also provide a Bayesian information\ncriterion for model selection to assist the detection. Our experiments show\ngood performance of the proposed method on both simulated and real data.\nLastly, we develop an R package CPDstergm to implement our method."}, "http://arxiv.org/abs/2310.01198": {"title": "Likelihood Based Inference for ARMA Models", "link": "http://arxiv.org/abs/2310.01198", "description": "Autoregressive moving average (ARMA) models are frequently used to analyze\ntime series data. Despite the popularity of these models, algorithms for\nfitting ARMA models have weaknesses that are not well known. We provide a\nsummary of parameter estimation via maximum likelihood and discuss common\npitfalls that may lead to sub-optimal parameter estimates. We propose a random\nrestart algorithm for parameter estimation that frequently yields higher\nlikelihoods than traditional maximum likelihood estimation procedures. We then\ninvestigate the parameter uncertainty of maximum likelihood estimates, and\npropose the use of profile confidence intervals as a superior alternative to\nintervals derived from the Fisher's information matrix. Through a series of\nsimulation studies, we demonstrate the efficacy of our proposed algorithm and\nthe improved nominal coverage of profile confidence intervals compared to the\nnormal approximation based on Fisher's Information."}, "http://arxiv.org/abs/2312.09303": {"title": "A Physics Based Surrogate Model in Bayesian Uncertainty Quantification involving Elliptic PDEs", "link": "http://arxiv.org/abs/2312.09303", "description": "The paper addresses Bayesian inferences in inverse problems with uncertainty\nquantification involving a computationally expensive forward map associated\nwith solving a partial differential equations. To mitigate the computational\ncost, the paper proposes a new surrogate model informed by the physics of the\nproblem, specifically when the forward map involves solving a linear elliptic\npartial differential equation. The study establishes the consistency of the\nposterior distribution for this surrogate model and demonstrates its\neffectiveness through numerical examples with synthetic data. The results\nindicate a substantial improvement in computational speed, reducing the\nprocessing time from several months with the exact forward map to a few\nminutes, while maintaining negligible loss of accuracy in the posterior\ndistribution."}, "http://arxiv.org/abs/2312.09356": {"title": "Sparsity meets correlation in Gaussian sequence model", "link": "http://arxiv.org/abs/2312.09356", "description": "We study estimation of an $s$-sparse signal in the $p$-dimensional Gaussian\nsequence model with equicorrelated observations and derive the minimax rate. A\nnew phenomenon emerges from correlation, namely the rate scales with respect to\n$p-2s$ and exhibits a phase transition at $p-2s \\asymp \\sqrt{p}$. Correlation\nis shown to be a blessing provided it is sufficiently strong, and the critical\ncorrelation level exhibits a delicate dependence on the sparsity level. Due to\ncorrelation, the minimax rate is driven by two subproblems: estimation of a\nlinear functional (the average of the signal) and estimation of the signal's\n$(p-1)$-dimensional projection onto the orthogonal subspace. The\nhigh-dimensional projection is estimated via sparse regression and the linear\nfunctional is cast as a robust location estimation problem. Existing robust\nestimators turn out to be suboptimal, and we show a kernel mode estimator with\na widening bandwidth exploits the Gaussian character of the data to achieve the\noptimal estimation rate."}, "http://arxiv.org/abs/2312.09422": {"title": "Joint Alignment of Multivariate Quasi-Periodic Functional Data Using Deep Learning", "link": "http://arxiv.org/abs/2312.09422", "description": "The joint alignment of multivariate functional data plays an important role\nin various fields such as signal processing, neuroscience and medicine,\nincluding the statistical analysis of data from wearable devices. Traditional\nmethods often ignore the phase variability and instead focus on the variability\nin the observed amplitude. We present a novel method for joint alignment of\nmultivariate quasi-periodic functions using deep neural networks, decomposing,\nbut retaining all the information in the data by preserving both phase and\namplitude variability. Our proposed neural network uses a special activation of\nthe output that builds on the unit simplex transformation, and we utilize a\nloss function based on the Fisher-Rao metric to train our model. Furthermore,\nour method is unsupervised and can provide an optimal common template function\nas well as subject-specific templates. We demonstrate our method on two\nsimulated datasets and one real example, comprising data from 12-lead 10s\nelectrocardiogram recordings."}, "http://arxiv.org/abs/2312.09604": {"title": "Inferring Causality from Time Series data based on Structural Causal Model and its application to Neural Connectomics", "link": "http://arxiv.org/abs/2312.09604", "description": "Inferring causation from time series data is of scientific interest in\ndifferent disciplines, particularly in neural connectomics. While different\napproaches exist in the literature with parametric modeling assumptions, we\nfocus on a non-parametric model for time series satisfying a Markovian\nstructural causal model with stationary distribution and without concurrent\neffects. We show that the model structure can be used to its advantage to\nobtain an elegant algorithm for causal inference from time series based on\nconditional dependence tests, coined Causal Inference in Time Series (CITS)\nalgorithm. We describe Pearson's partial correlation and Hilbert-Schmidt\ncriterion as candidates for such conditional dependence tests that can be used\nin CITS for the Gaussian and non-Gaussian settings, respectively. We prove the\nmathematical guarantee of the CITS algorithm in recovering the true causal\ngraph, under standard mixing conditions on the underlying time series. We also\nconduct a comparative evaluation of performance of CITS with other existing\nmethodologies in simulated datasets. We then describe the utlity of the\nmethodology in neural connectomics -- in inferring causal functional\nconnectivity from time series of neural activity, and demonstrate its\napplication to a real neurobiological dataset of electro-physiological\nrecordings from the mouse visual cortex recorded by Neuropixel probes."}, "http://arxiv.org/abs/2312.09607": {"title": "Variational excess risk bound for general state space models", "link": "http://arxiv.org/abs/2312.09607", "description": "In this paper, we consider variational autoencoders (VAE) for general state\nspace models. We consider a backward factorization of the variational\ndistributions to analyze the excess risk associated with VAE. Such backward\nfactorizations were recently proposed to perform online variational learning\nand to obtain upper bounds on the variational estimation error. When\nindependent trajectories of sequences are observed and under strong mixing\nassumptions on the state space model and on the variational distribution, we\nprovide an oracle inequality explicit in the number of samples and in the\nlength of the observation sequences. We then derive consequences of this\ntheoretical result. In particular, when the data distribution is given by a\nstate space model, we provide an upper bound for the Kullback-Leibler\ndivergence between the data distribution and its estimator and between the\nvariational posterior and the estimated state space posterior\ndistributions.Under classical assumptions, we prove that our results can be\napplied to Gaussian backward kernels built with dense and recurrent neural\nnetworks."}, "http://arxiv.org/abs/2312.09633": {"title": "Natural gradient Variational Bayes without matrix inversion", "link": "http://arxiv.org/abs/2312.09633", "description": "This paper presents an approach for efficiently approximating the inverse of\nFisher information, a key component in variational Bayes inference. A notable\naspect of our approach is the avoidance of analytically computing the Fisher\ninformation matrix and its explicit inversion. Instead, we introduce an\niterative procedure for generating a sequence of matrices that converge to the\ninverse of Fisher information. The natural gradient variational Bayes algorithm\nwithout matrix inversion is provably convergent and achieves a convergence rate\nof order O(log s/s), with s the number of iterations. We also obtain a central\nlimit theorem for the iterates. Our algorithm exhibits versatility, making it\napplicable across a diverse array of variational Bayes domains, including\nGaussian approximation and normalizing flow Variational Bayes. We offer a range\nof numerical examples to demonstrate the efficiency and reliability of the\nproposed variational Bayes method."}, "http://arxiv.org/abs/2312.09698": {"title": "Smoothing for age-period-cohort models: a comparison between splines and random process", "link": "http://arxiv.org/abs/2312.09698", "description": "Age-Period-Cohort (APC) models are well used in the context of modelling\nhealth and demographic data to produce smooth estimates of each time trend.\nWhen smoothing in the context of APC models, there are two main schools,\nfrequentist using penalised smoothing splines, and Bayesian using random\nprocesses with little crossover between them. In this article, we clearly lay\nout the theoretical link between the two schools, provide examples using\nsimulated and real data to highlight similarities and difference, and help a\ngeneral APC user understand potentially inaccessible theory from functional\nanalysis. As intuition suggests, both approaches lead to comparable and almost\nidentical in-sample predictions, but random processes within a Bayesian\napproach might be beneficial for out-of-sample prediction as the sources of\nuncertainty are captured in a more complete way."}, "http://arxiv.org/abs/2312.09758": {"title": "Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach", "link": "http://arxiv.org/abs/2312.09758", "description": "Invariant representation learning (IRL) encourages the prediction from\ninvariant causal features to labels de-confounded from the environments,\nadvancing the technical roadmap of out-of-distribution (OOD) generalization.\nDespite spotlights around, recent theoretical results verified that some causal\nfeatures recovered by IRLs merely pretend domain-invariantly in the training\nenvironments but fail in unseen domains. The \\emph{fake invariance} severely\nendangers OOD generalization since the trustful objective can not be diagnosed\nand existing causal surgeries are invalid to rectify. In this paper, we review\na IRL family (InvRat) under the Partially and Fully Informative Invariant\nFeature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify\ntheir weaknesses in representing fake invariant features, then, unify their\ncausal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally\nrebuild the spurious and the fake invariant features simultaneously. Given\nthis, we further develop an approach based on conditional mutual information\nwith respect to RS-SCM, then rigorously rectify the spurious and fake invariant\neffects. It can be easily implemented by a small feature selection subnet\nintroduced in the IRL family, which is alternatively optimized to achieve our\ngoal. Experiments verified the superiority of our approach to fight against the\nfake invariant issue across a variety of OOD generalization benchmarks."}, "http://arxiv.org/abs/2312.09777": {"title": "Weyl formula and thermodynamics of geometric flow", "link": "http://arxiv.org/abs/2312.09777", "description": "We study the Weyl formula for the asymptotic number of eigenvalues of the\nLaplace-Beltrami operator with Dirichlet boundary condition on a Riemannian\nmanifold in the context of geometric flows. Assuming the eigenvalues to be the\nenergies of some associated statistical system, we show that geometric flows\nare directly related with the direction of increasing entropy chosen. For a\nclosed Riemannian manifold we obtain a volume preserving flow of geometry being\nequivalent to the increment of Gibbs entropy function derived from the spectrum\nof Laplace-Beltrami operator. Resemblance with Arnowitt, Deser, and Misner\n(ADM) formalism of gravity is also noted by considering open Riemannian\nmanifolds, directly equating the geometric flow parameter and the direction of\nincreasing entropy as time direction."}, "http://arxiv.org/abs/2312.09825": {"title": "Extreme value methods for estimating rare events in Utopia", "link": "http://arxiv.org/abs/2312.09825", "description": "To capture the extremal behaviour of complex environmental phenomena in\npractice, flexible techniques for modelling tail behaviour are required. In\nthis paper, we introduce a variety of such methods, which were used by the\nLancopula Utopiversity team to tackle the data challenge of the 2023 Extreme\nValue Analysis Conference. This data challenge was split into four sections,\nlabelled C1-C4. Challenges C1 and C2 comprise univariate problems, where the\ngoal is to estimate extreme quantiles for a non-stationary time series\nexhibiting several complex features. We propose a flexible modelling technique,\nbased on generalised additive models, with diagnostics indicating generally\ngood performance for the observed data. Challenges C3 and C4 concern\nmultivariate problems where the focus is on estimating joint extremal\nprobabilities. For challenge C3, we propose an extension of available models in\nthe multivariate literature and use this framework to estimate extreme\nprobabilities in the presence of non-stationary dependence. Finally, for\nchallenge C4, which concerns a 50 dimensional random vector, we employ a\nclustering technique to achieve dimension reduction and use a conditional\nmodelling approach to estimate extremal probabilities across independent groups\nof variables."}, "http://arxiv.org/abs/2312.09862": {"title": "Wasserstein-based Minimax Estimation of Dependence in Multivariate Regularly Varying Extremes", "link": "http://arxiv.org/abs/2312.09862", "description": "We study minimax risk bounds for estimators of the spectral measure in\nmultivariate linear factor models, where observations are linear combinations\nof regularly varying latent factors. Non-asymptotic convergence rates are\nderived for the multivariate Peak-over-Threshold estimator in terms of the\n$p$-th order Wasserstein distance, and information-theoretic lower bounds for\nthe minimax risks are established. The convergence rate of the estimator is\nshown to be minimax optimal under a class of Pareto-type models analogous to\nthe standard class used in the setting of one-dimensional observations known as\nthe Hall-Welsh class. When the estimator is minimax inefficient, a novel\ntwo-step estimator is introduced and demonstrated to attain the minimax lower\nbound. Our analysis bridges the gaps in understanding trade-offs between\nestimation bias and variance in multivariate extreme value theory."}, "http://arxiv.org/abs/2312.09884": {"title": "Investigating the heterogeneity of \"study twins\"", "link": "http://arxiv.org/abs/2312.09884", "description": "Meta-analyses are commonly performed based on random-effects models, while in\ncertain cases one might also argue in favour of a common-effect model. One such\ncase may be given by the example of two \"study twins\" that are performed\naccording to a common (or at least very similar) protocol. Here we investigate\nthe particular case of meta-analysis of a pair of studies, e.g. summarizing the\nresults of two confirmatory clinical trials in phase III of a clinical\ndevelopment programme. Thereby, we focus on the question of to what extent\nhomogeneity or heterogeneity may be discernible, and include an empirical\ninvestigation of published (\"twin\") pairs of studies. A pair of estimates from\ntwo studies only provides very little evidence on homogeneity or heterogeneity\nof effects, and ad-hoc decision criteria may often be misleading."}, "http://arxiv.org/abs/2312.09900": {"title": "Integral Fractional Ornstein-Uhlenbeck Process Model for Animal Movement", "link": "http://arxiv.org/abs/2312.09900", "description": "Modeling the trajectories of animals is challenging due to the complexity of\ntheir behaviors, the influence of unpredictable environmental factors,\nindividual variability, and the lack of detailed data on their movements.\nAdditionally, factors such as migration, hunting, reproduction, and social\ninteractions add additional layers of complexity when attempting to accurately\nforecast their movements. In the literature, various models exits that aim to\nstudy animal telemetry, by modeling the velocity of the telemetry, the\ntelemetry itself or both processes jointly through a Markovian process. In this\nwork, we propose to model the velocity of each coordinate axis for animal\ntelemetry data as a fractional Ornstein-Uhlenbeck (fOU) process. Then, the\nintegral fOU process models position data in animal telemetry. Compared to\ntraditional methods, the proposed model is flexible in modeling long-range\nmemory. The Hurst parameter $H \\in (0,1)$ is a crucial parameter in integral\nfOU process, as it determines the degree of dependence or long-range memory.\nThe integral fOU process is nonstationary process. In addition, a higher Hurst\nparameter ($H > 0.5$) indicates a stronger memory, leading to trajectories with\ntransient trends, while a lower Hurst parameter ($H < 0.5$) implies a weaker\nmemory, resulting in trajectories with recurring trends. When H = 0.5, the\nprocess reduces to a standard integral Ornstein-Uhlenbeck process. We develop a\nfast simulation algorithm of telemetry trajectories using an approach via\nfinite-dimensional distributions. We also develop a maximum likelihood method\nfor parameter estimation and its performance is examined by simulation studies.\nFinally, we present a telemetry application of Fin Whales that disperse over\nthe Gulf of Mexico."}, "http://arxiv.org/abs/2312.10002": {"title": "On the Invertibility of Euler Integral Transforms with Hyperplanes and Quadric Hypersurfaces", "link": "http://arxiv.org/abs/2312.10002", "description": "The Euler characteristic transform (ECT) is an integral transform used widely\nin topological data analysis. Previous efforts by Curry et al. and Ghrist et\nal. have independently shown that the ECT is injective on all compact definable\nsets. In this work, we study the invertibility of the ECT on definable sets\nthat aren't necessarily compact, resulting in a complete classification of\nconstructible functions that the Euler characteristic transform is not\ninjective on. We then introduce the quadric Euler characteristic transform\n(QECT) as a natural generalization of the ECT by detecting definable shapes\nwith quadric hypersurfaces rather than hyperplanes. We also discuss some\ncriteria for the invertibility of QECT."}, "http://arxiv.org/abs/2108.01327": {"title": "Distributed Inference for Tail Risk", "link": "http://arxiv.org/abs/2108.01327", "description": "For measuring tail risk with scarce extreme events, extreme value analysis is\noften invoked as the statistical tool to extrapolate to the tail of a\ndistribution. The presence of large datasets benefits tail risk analysis by\nproviding more observations for conducting extreme value analysis. However,\nlarge datasets can be stored distributedly preventing the possibility of\ndirectly analyzing them. In this paper, we introduce a comprehensive set of\ntools for examining the asymptotic behavior of tail empirical and quantile\nprocesses in the setting where data is distributed across multiple sources, for\ninstance, when data are stored on multiple machines. Utilizing these tools, one\ncan establish the oracle property for most distributed estimators in extreme\nvalue statistics in a straightforward way. The main theoretical challenge\narises when the number of machines diverges to infinity. The number of machines\nresembles the role of dimensionality in high dimensional statistics. We provide\nvarious examples to demonstrate the practicality and value of our proposed\ntoolkit."}, "http://arxiv.org/abs/2206.04133": {"title": "Bayesian multivariate logistic regression for superiority and inferiority decision-making under observable treatment heterogeneity", "link": "http://arxiv.org/abs/2206.04133", "description": "The effects of treatments may differ between persons with different\ncharacteristics. Addressing such treatment heterogeneity is crucial to\ninvestigate whether patients with specific characteristics are likely to\nbenefit from a new treatment. The current paper presents a novel Bayesian\nmethod for superiority decision-making in the context of randomized controlled\ntrials with multivariate binary responses and heterogeneous treatment effects.\nThe framework is based on three elements: a) Bayesian multivariate logistic\nregression analysis with a P\\'olya-Gamma expansion; b) a transformation\nprocedure to transfer obtained regression coefficients to a more intuitive\nmultivariate probability scale (i.e., success probabilities and the differences\nbetween them); and c) a compatible decision procedure for treatment comparison\nwith prespecified decision error rates. Procedures for a priori sample size\nestimation under a non-informative prior distribution are included. A numerical\nevaluation demonstrated that decisions based on a priori sample size estimation\nresulted in anticipated error rates among the trial population as well as\nsubpopulations. Further, average and conditional treatment effect parameters\ncould be estimated unbiasedly when the sample was large enough. Illustration\nwith the International Stroke Trial dataset revealed a trend towards\nheterogeneous effects among stroke patients: Something that would have remained\nundetected when analyses were limited to average treatment effects."}, "http://arxiv.org/abs/2210.06448": {"title": "Debiased inference for a covariate-adjusted regression function", "link": "http://arxiv.org/abs/2210.06448", "description": "In this article, we study nonparametric inference for a covariate-adjusted\nregression function. This parameter captures the average association between a\ncontinuous exposure and an outcome after adjusting for other covariates. In\nparticular, under certain causal conditions, this parameter corresponds to the\naverage outcome had all units been assigned to a specific exposure level, known\nas the causal dose-response curve. We propose a debiased local linear estimator\nof the covariate-adjusted regression function, and demonstrate that our\nestimator converges pointwise to a mean-zero normal limit distribution. We use\nthis result to construct asymptotically valid confidence intervals for function\nvalues and differences thereof. In addition, we use approximation results for\nthe distribution of the supremum of an empirical process to construct\nasymptotically valid uniform confidence bands. Our methods do not require\nundersmoothing, permit the use of data-adaptive estimators of nuisance\nfunctions, and our estimator attains the optimal rate of convergence for a\ntwice differentiable function. We illustrate the practical performance of our\nestimator using numerical studies and an analysis of the effect of air\npollution exposure on cardiovascular mortality."}, "http://arxiv.org/abs/2308.12506": {"title": "General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays", "link": "http://arxiv.org/abs/2308.12506", "description": "We present a general central limit theorem with simple, easy-to-check\ncovariance-based sufficient conditions for triangular arrays of random vectors\nwhen all variables could be interdependent. The result is constructed from\nStein's method, but the conditions are distinct from related work. We show that\nthese covariance conditions nest standard assumptions studied in the literature\nsuch as $M$-dependence, mixing random fields, non-mixing autoregressive\nprocesses, and dependency graphs, which themselves need not imply each other.\nThis permits researchers to work with high-level but intuitive conditions based\non overall correlation instead of more complicated and restrictive conditions\nsuch as strong mixing in random fields that may not have any obvious\nmicro-foundation. As examples of the implications, we show how the theorem\nimplies asymptotic normality in estimating: treatment effects with spillovers\nin more settings than previously admitted, covariance matrices, processes with\nglobal dependencies such as epidemic spread and information diffusion, and\nspatial process with Mat\\'{e}rn dependencies."}, "http://arxiv.org/abs/2312.10176": {"title": "Spectral estimation for spatial point processes and random fields", "link": "http://arxiv.org/abs/2312.10176", "description": "Spatial data can come in a variety of different forms, but two of the most\ncommon generating models for such observations are random fields and point\nprocesses. Whilst it is known that spectral analysis can unify these two\ndifferent data forms, specific methodology for the related estimation is yet to\nbe developed. In this paper, we solve this problem by extending multitaper\nestimation, to estimate the spectral density matrix function for multivariate\nspatial data, where processes can be any combination of either point processes\nor random fields. We discuss finite sample and asymptotic theory for the\nproposed estimators, as well as specific details on the implementation,\nincluding how to perform estimation on non-rectangular domains and the correct\nimplementation of multitapering for processes sampled in different ways, e.g.\ncontinuously vs on a regular grid."}, "http://arxiv.org/abs/2312.10234": {"title": "Targeted Machine Learning for Average Causal Effect Estimation Using the Front-Door Functional", "link": "http://arxiv.org/abs/2312.10234", "description": "Evaluating the average causal effect (ACE) of a treatment on an outcome often\ninvolves overcoming the challenges posed by confounding factors in\nobservational studies. A traditional approach uses the back-door criterion,\nseeking adjustment sets to block confounding paths between treatment and\noutcome. However, this method struggles with unmeasured confounders. As an\nalternative, the front-door criterion offers a solution, even in the presence\nof unmeasured confounders between treatment and outcome. This method relies on\nidentifying mediators that are not directly affected by these confounders and\nthat completely mediate the treatment's effect. Here, we introduce novel\nestimation strategies for the front-door criterion based on the targeted\nminimum loss-based estimation theory. Our estimators work across diverse\nscenarios, handling binary, continuous, and multivariate mediators. They\nleverage data-adaptive machine learning algorithms, minimizing assumptions and\nensuring key statistical properties like asymptotic linearity,\ndouble-robustness, efficiency, and valid estimates within the target parameter\nspace. We establish conditions under which the nuisance functional estimations\nensure the root n-consistency of ACE estimators. Our numerical experiments show\nthe favorable finite sample performance of the proposed estimators. We\ndemonstrate the applicability of these estimators to analyze the effect of\nearly stage academic performance on future yearly income using data from the\nFinnish Social Science Data Archive."}, "http://arxiv.org/abs/2312.10388": {"title": "The Causal Impact of Credit Lines on Spending Distributions", "link": "http://arxiv.org/abs/2312.10388", "description": "Consumer credit services offered by e-commerce platforms provide customers\nwith convenient loan access during shopping and have the potential to stimulate\nsales. To understand the causal impact of credit lines on spending, previous\nstudies have employed causal estimators, based on direct regression (DR),\ninverse propensity weighting (IPW), and double machine learning (DML) to\nestimate the treatment effect. However, these estimators do not consider the\nnotion that an individual's spending can be understood and represented as a\ndistribution, which captures the range and pattern of amounts spent across\ndifferent orders. By disregarding the outcome as a distribution, valuable\ninsights embedded within the outcome distribution might be overlooked. This\npaper develops a distribution-valued estimator framework that extends existing\nreal-valued DR-, IPW-, and DML-based estimators to distribution-valued\nestimators within Rubin's causal framework. We establish their consistency and\napply them to a real dataset from a large e-commerce platform. Our findings\nreveal that credit lines positively influence spending across all quantiles;\nhowever, as credit lines increase, consumers allocate more to luxuries (higher\nquantiles) than necessities (lower quantiles)."}, "http://arxiv.org/abs/2312.10435": {"title": "Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model", "link": "http://arxiv.org/abs/2312.10435", "description": "Estimating heterogeneous treatment effects across individuals has attracted\ngrowing attention as a statistical tool for performing critical\ndecision-making. We propose a Bayesian inference framework that quantifies the\nuncertainty in treatment effect estimation to support decision-making in a\nrelatively small sample size setting. Our proposed model places Gaussian\nprocess priors on the nonparametric components of a semiparametric model called\na partially linear model. This model formulation has three advantages. First,\nwe can analytically compute the posterior distribution of a treatment effect\nwithout relying on the computationally demanding posterior approximation.\nSecond, we can guarantee that the posterior distribution concentrates around\nthe true one as the sample size goes to infinity. Third, we can incorporate\nprior knowledge about a treatment effect into the prior distribution, improving\nthe estimation efficiency. Our experimental results show that even in the small\nsample size setting, our method can accurately estimate the heterogeneous\ntreatment effects and effectively quantify its estimation uncertainty."}, "http://arxiv.org/abs/2312.10499": {"title": "Censored extreme value estimation", "link": "http://arxiv.org/abs/2312.10499", "description": "A novel and comprehensive methodology designed to tackle the challenges posed\nby extreme values in the context of random censorship is introduced. The main\nfocus is the analysis of integrals based on the product-limit estimator of\nnormalized top-order statistics, denoted extreme Kaplan--Meier integrals. These\nintegrals allow for transparent derivation of various important asymptotic\ndistributional properties, offering an alternative approach to conventional\nplug-in estimation methods. Notably, this methodology demonstrates robustness\nand wide applicability within the scope of max-domains of attraction. An\nadditional noteworthy by-product is the extension of residual estimation of\nextremes to encompass all max-domains of attraction, which is of independent\ninterest."}, "http://arxiv.org/abs/2312.10541": {"title": "Random Measures, ANOVA Models and Quantifying Uncertainty in Randomized Controlled Trials", "link": "http://arxiv.org/abs/2312.10541", "description": "This short paper introduces a novel approach to global sensitivity analysis,\ngrounded in the variance-covariance structure of random variables derived from\nrandom measures. The proposed methodology facilitates the application of\ninformation-theoretic rules for uncertainty quantification, offering several\nadvantages. Specifically, the approach provides valuable insights into the\ndecomposition of variance within discrete subspaces, similar to the standard\nANOVA analysis. To illustrate this point, the method is applied to datasets\nobtained from the analysis of randomized controlled trials on evaluating the\nefficacy of the COVID-19 vaccine and assessing clinical endpoints in a lung\ncancer study."}, "http://arxiv.org/abs/2312.10548": {"title": "Analysis of composition on the original scale of measurement", "link": "http://arxiv.org/abs/2312.10548", "description": "In current applied research the most-used route to an analysis of composition\nis through log-ratios -- that is, contrasts among log-transformed measurements.\nHere we argue instead for a more direct approach, using a statistical model for\nthe arithmetic mean on the original scale of measurement. Central to the\napproach is a general variance-covariance function, derived by assuming\nmultiplicative measurement error. Quasi-likelihood analysis of logit models for\ncomposition is then a general alternative to the use of multivariate linear\nmodels for log-ratio transformed measurements, and it has important advantages.\nThese include robustness to secondary aspects of model specification, stability\nwhen there are zero-valued or near-zero measurements in the data, and more\ndirect interpretation. The usual efficiency property of quasi-likelihood\nestimation applies even when the error covariance matrix is unspecified. We\nalso indicate how the derived variance-covariance function can be used, instead\nof the variance-covariance matrix of log-ratios, with more general multivariate\nmethods for the analysis of composition. A specific feature is that the notion\nof `null correlation' -- for compositional measurements on their original scale\n-- emerges naturally."}, "http://arxiv.org/abs/2312.10563": {"title": "Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS Integration", "link": "http://arxiv.org/abs/2312.10563", "description": "Mediation analysis is a powerful tool for studying causal pathways between\nexposure, mediator, and outcome variables of interest. While classical\nmediation analysis using observational data often requires strong and sometimes\nunrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR)\navoids unmeasured confounding bias by employing genetic variants as\ninstrumental variables. We develop a novel MR framework for mediation analysis\nwith genome-wide associate study (GWAS) summary data, and provide solid\nstatistical guarantees. Our framework efficiently integrates information stored\nin three independent GWAS summary data and mitigates the commonly encountered\nwinner's curse and measurement error bias (a.k.a. instrument selection and weak\ninstrument bias) in MR. As a result, our framework provides valid statistical\ninference for both direct and mediation effects with enhanced statistical\nefficiency. As part of this endeavor, we also demonstrate that the concept of\nwinner's curse bias in mediation analysis with MR and summary data is more\ncomplex than previously documented in the classical two-sample MR literature,\nrequiring special treatments to address such a bias issue. Through our\ntheoretical investigations, we show that the proposed method delivers\nconsistent and asymptotically normally distributed causal effect estimates. We\nillustrate the finite-sample performance of our approach through simulation\nexperiments and a case study."}, "http://arxiv.org/abs/2312.10569": {"title": "Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data", "link": "http://arxiv.org/abs/2312.10569", "description": "Many modern causal questions ask how treatments affect complex outcomes that\nare measured using wearable devices and sensors. Current analysis approaches\nrequire summarizing these data into scalar statistics (e.g., the mean), but\nthese summaries can be misleading. For example, disparate distributions can\nhave the same means, variances, and other statistics. Researchers can overcome\nthe loss of information by instead representing the data as distributions. We\ndevelop an interpretable method for distributional data analysis that ensures\ntrustworthy and robust decision-making: Analyzing Distributional Data via\nMatching After Learning to Stretch (ADD MALTS). We (i) provide analytical\nguarantees of the correctness of our estimation strategy, (ii) demonstrate via\nsimulation that ADD MALTS outperforms other distributional data analysis\nmethods at estimating treatment effects, and (iii) illustrate ADD MALTS'\nability to verify whether there is enough cohesion between treatment and\ncontrol units within subpopulations to trustworthily estimate treatment\neffects. We demonstrate ADD MALTS' utility by studying the effectiveness of\ncontinuous glucose monitors in mitigating diabetes risks."}, "http://arxiv.org/abs/2312.10570": {"title": "Adversarially Balanced Representation for Continuous Treatment Effect Estimation", "link": "http://arxiv.org/abs/2312.10570", "description": "Individual treatment effect (ITE) estimation requires adjusting for the\ncovariate shift between populations with different treatments, and deep\nrepresentation learning has shown great promise in learning a balanced\nrepresentation of covariates. However the existing methods mostly consider the\nscenario of binary treatments. In this paper, we consider the more practical\nand challenging scenario in which the treatment is a continuous variable (e.g.\ndosage of a medication), and we address the two main challenges of this setup.\nWe propose the adversarial counterfactual regression network (ACFR) that\nadversarially minimizes the representation imbalance in terms of KL divergence,\nand also maintains the impact of the treatment value on the outcome prediction\nby leveraging an attention mechanism. Theoretically we demonstrate that ACFR\nobjective function is grounded in an upper bound on counterfactual outcome\nprediction error. Our experimental evaluation on semi-synthetic datasets\ndemonstrates the empirical superiority of ACFR over a range of state-of-the-art\nmethods."}, "http://arxiv.org/abs/2312.10573": {"title": "Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem", "link": "http://arxiv.org/abs/2312.10573", "description": "Random Forest is a machine learning method that offers many advantages,\nincluding the ability to easily measure variable importance. Class balancing\ntechnique is a well-known solution to deal with class imbalance problem.\nHowever, it has not been actively studied on RF variable importance. In this\npaper, we study the effect of class balancing on RF variable importance. Our\nsimulation results show that over-sampling is effective in correctly measuring\nvariable importance in class imbalanced situations with small sample size,\nwhile under-sampling fails to differentiate important and non-informative\nvariables. We then propose a variable selection algorithm that utilizes RF\nvariable importance and its confidence interval. Through an experimental study\nusing many real and artificial datasets, we demonstrate that our proposed\nalgorithm efficiently selects an optimal feature set, leading to improved\nprediction performance in class imbalance problem."}, "http://arxiv.org/abs/2312.10596": {"title": "A maximin optimal approach for model-free sampling designs in two-phase studies", "link": "http://arxiv.org/abs/2312.10596", "description": "Data collection costs can vary widely across variables in data science tasks.\nTwo-phase designs are often employed to save data collection costs. In\ntwo-phase studies, inexpensive variables are collected for all subjects in the\nfirst phase, and expensive variables are measured for a subset of subjects in\nthe second phase based on a predetermined sampling rule. The estimation\nefficiency under two-phase designs relies heavily on the sampling rule.\nExisting literature primarily focuses on designing sampling rules for\nestimating a scalar parameter in some parametric models or some specific\nestimating problems. However, real-world scenarios are usually model-unknown\nand involve two-phase designs for model-free estimation of a scalar or\nmulti-dimensional parameter. This paper proposes a maximin criterion to design\nan optimal sampling rule based on semiparametric efficiency bounds. The\nproposed method is model-free and applicable to general estimating problems.\nThe resulting sampling rule can minimize the semiparametric efficiency bound\nwhen the parameter is scalar and improve the bound for every component when the\nparameter is multi-dimensional. Simulation studies demonstrate that the\nproposed designs reduce the variance of the resulting estimator in various\nsettings. The implementation of the proposed design is illustrated in a real\ndata analysis."}, "http://arxiv.org/abs/2312.10607": {"title": "Bayesian Model Selection via Mean-Field Variational Approximation", "link": "http://arxiv.org/abs/2312.10607", "description": "This article considers Bayesian model selection via mean-field (MF)\nvariational approximation. Towards this goal, we study the non-asymptotic\nproperties of MF inference under the Bayesian framework that allows latent\nvariables and model mis-specification. Concretely, we show a Bernstein\nvon-Mises (BvM) theorem for the variational distribution from MF under possible\nmodel mis-specification, which implies the distributional convergence of MF\nvariational approximation to a normal distribution centering at the maximal\nlikelihood estimator (within the specified model). Motivated by the BvM\ntheorem, we propose a model selection criterion using the evidence lower bound\n(ELBO), and demonstrate that the model selected by ELBO tends to asymptotically\nagree with the one selected by the commonly used Bayesian information criterion\n(BIC) as sample size tends to infinity. Comparing to BIC, ELBO tends to incur\nsmaller approximation error to the log-marginal likelihood (a.k.a. model\nevidence) due to a better dimension dependence and full incorporation of the\nprior information. Moreover, we show the geometric convergence of the\ncoordinate ascent variational inference (CAVI) algorithm under the parametric\nmodel framework, which provides a practical guidance on how many iterations one\ntypically needs to run when approximating the ELBO. These findings demonstrate\nthat variational inference is capable of providing a computationally efficient\nalternative to conventional approaches in tasks beyond obtaining point\nestimates, which is also empirically demonstrated by our extensive numerical\nexperiments."}, "http://arxiv.org/abs/2312.10618": {"title": "Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines", "link": "http://arxiv.org/abs/2312.10618", "description": "Classification and probability estimation have broad applications in modern\nmachine learning and data science applications, including biology, medicine,\nengineering, and computer science. The recent development of a class of\nweighted Support Vector Machines (wSVMs) has shown great values in robustly\npredicting the class probability and classification for various problems with\nhigh accuracy. The current framework is based on the $\\ell^2$-norm regularized\nbinary wSVMs optimization problem, which only works with dense features and has\npoor performance at sparse features with redundant noise in most real\napplications. The sparse learning process requires a prescreen of the important\nvariables for each binary wSVMs for accurately estimating pairwise conditional\nprobability. In this paper, we proposed novel wSVMs frameworks that incorporate\nautomatic variable selection with accurate probability estimation for sparse\nlearning problems. We developed efficient algorithms for effective variable\nselection for solving either the $\\ell^1$-norm or elastic net regularized\nbinary wSVMs optimization problems. The binary class probability is then\nestimated either by the $\\ell^2$-norm regularized wSVMs framework with selected\nvariables or by elastic net regularized wSVMs directly. The two-step approach\nof $\\ell^1$-norm followed by $\\ell^2$-norm wSVMs show a great advantage in both\nautomatic variable selection and reliable probability estimators with the most\nefficient time. The elastic net regularized wSVMs offer the best performance in\nterms of variable selection and probability estimation with the additional\nadvantage of variable grouping in the compensation of more computation time for\nhigh dimensional problems. The proposed wSVMs-based sparse learning methods\nhave wide applications and can be further extended to $K$-class problems\nthrough ensemble learning."}, "http://arxiv.org/abs/2312.10675": {"title": "Visualization and Assessment of Copula Symmetry", "link": "http://arxiv.org/abs/2312.10675", "description": "Visualization and assessment of copula structures are crucial for accurately\nunderstanding and modeling the dependencies in multivariate data analysis. In\nthis paper, we introduce an innovative method that employs functional boxplots\nand rank-based testing procedures to evaluate copula symmetry. This approach is\nspecifically designed to assess key characteristics such as reflection\nsymmetry, radial symmetry, and joint symmetry. We first construct test\nfunctions for each specific property and then investigate the asymptotic\nproperties of their empirical estimators. We demonstrate that the functional\nboxplot of these sample test functions serves as an informative visualization\ntool of a given copula structure, effectively measuring the departure from zero\nof the test function. Furthermore, we introduce a nonparametric testing\nprocedure to assess the significance of deviations from symmetry, ensuring the\naccuracy and reliability of our visualization method. Through extensive\nsimulation studies involving various copula models, we demonstrate the\neffectiveness of our testing approach. Finally, we apply our visualization and\ntesting techniques to two real-world datasets: a nutritional habits survey with\nfive variables and wind speed data from three locations in Saudi Arabia."}, "http://arxiv.org/abs/2312.10690": {"title": "M-Estimation in Censored Regression Model using Instrumental Variables under Endogeneity", "link": "http://arxiv.org/abs/2312.10690", "description": "We propose and study M-estimation to estimate the parameters in the censored\nregression model in the presence of endogeneity, i.e., the Tobit model. In the\ncourse of this study, we follow two-stage procedures: the first stage consists\nof applying control function procedures to address the issue of endogeneity\nusing instrumental variables, and the second stage applies the M-estimation\ntechnique to estimate the unknown parameters involved in the model. The large\nsample properties of the proposed estimators are derived and analyzed. The\nfinite sample properties of the estimators are studied through Monte Carlo\nsimulation and a real data application related to women's labor force\nparticipation."}, "http://arxiv.org/abs/2312.10695": {"title": "Nonparametric Strategy Test", "link": "http://arxiv.org/abs/2312.10695", "description": "We present a nonparametric statistical test for determining whether an agent\nis following a given mixed strategy in a repeated strategic-form game given\nsamples of the agent's play. This involves two components: determining whether\nthe agent's frequencies of pure strategies are sufficiently close to the target\nfrequencies, and determining whether the pure strategies selected are\nindependent between different game iterations. Our integrated test involves\napplying a chi-squared goodness of fit test for the first component and a\ngeneralized Wald-Wolfowitz runs test for the second component. The results from\nboth tests are combined using Bonferroni correction to produce a complete test\nfor a given significance level $\\alpha.$ We applied the test to publicly\navailable data of human rock-paper-scissors play. The data consists of 50\niterations of play for 500 human players. We test with a null hypothesis that\nthe players are following a uniform random strategy independently at each game\niteration. Using a significance level of $\\alpha = 0.05$, we conclude that 305\n(61%) of the subjects are following the target strategy."}, "http://arxiv.org/abs/2312.10706": {"title": "Margin-closed regime-switching multivariate time series models", "link": "http://arxiv.org/abs/2312.10706", "description": "A regime-switching multivariate time series model which is closed under\nmargins is built. The model imposes a restriction on all lower-dimensional\nsub-processes to follow a regime-switching process sharing the same latent\nregime sequence and having the same Markov order as the original process. The\nmargin-closed regime-switching model is constructed by considering the\nmultivariate margin-closed Gaussian VAR($k$) dependence as a copula within each\nregime, and builds dependence between observations in different regimes by\nrequiring the first observation in the new regime to depend on the last\nobservation in the previous regime. The property of closure under margins\nallows inference on the latent regimes based on lower-dimensional selected\nsub-processes and estimation of univariate parameters from univariate\nsub-processes, and enables the use of multi-stage estimation procedure for the\nmodel. The parsimonious dependence structure of the model also avoids a large\nnumber of parameters under the regime-switching setting. The proposed model is\napplied to a macroeconomic data set to infer the latent business cycle and\ncompared with the relevant benchmark."}, "http://arxiv.org/abs/2312.10796": {"title": "Two sample test for covariance matrices in ultra-high dimension", "link": "http://arxiv.org/abs/2312.10796", "description": "In this paper, we propose a new test for testing the equality of two\npopulation covariance matrices in the ultra-high dimensional setting that the\ndimension is much larger than the sizes of both of the two samples. Our\nproposed methodology relies on a data splitting procedure and a comparison of a\nset of well selected eigenvalues of the sample covariance matrices on the split\ndata sets. Compared to the existing methods, our methodology is adaptive in the\nsense that (i). it does not require specific assumption (e.g., comparable or\nbalancing, etc.) on the sizes of two samples; (ii). it does not need\nquantitative or structural assumptions of the population covariance matrices;\n(iii). it does not need the parametric distributions or the detailed knowledge\nof the moments of the two populations. Theoretically, we establish the\nasymptotic distributions of the statistics used in our method and conduct the\npower analysis. We justify that our method is powerful under very weak\nalternatives. We conduct extensive numerical simulations and show that our\nmethod significantly outperforms the existing ones both in terms of size and\npower. Analysis of two real data sets is also carried out to demonstrate the\nusefulness and superior performance of our proposed methodology. An\n$\\texttt{R}$ package $\\texttt{UHDtst}$ is developed for easy implementation of\nour proposed methodology."}, "http://arxiv.org/abs/2312.10814": {"title": "Scalable Design with Posterior-Based Operating Characteristics", "link": "http://arxiv.org/abs/2312.10814", "description": "To design trustworthy Bayesian studies, criteria for posterior-based\noperating characteristics - such as power and the type I error rate - are often\ndefined in clinical, industrial, and corporate settings. These posterior-based\noperating characteristics are typically assessed by exploring sampling\ndistributions of posterior probabilities via simulation. In this paper, we\npropose a scalable method to determine optimal sample sizes and decision\ncriteria that leverages large-sample theory to explore sampling distributions\nof posterior probabilities in a targeted manner. This targeted exploration\napproach prompts consistent sample size recommendations with fewer simulation\nrepetitions than standard methods. We repurpose the posterior probabilities\ncomputed in that approach to efficiently investigate various sample sizes and\ndecision criteria using contour plots."}, "http://arxiv.org/abs/2312.10894": {"title": "Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference", "link": "http://arxiv.org/abs/2312.10894", "description": "In this paper, we study the effectiveness of using a constant stepsize in\nstatistical inference via linear stochastic approximation (LSA) algorithms with\nMarkovian data. After establishing a Central Limit Theorem (CLT), we outline an\ninference procedure that uses averaged LSA iterates to construct confidence\nintervals (CIs). Our procedure leverages the fast mixing property of\nconstant-stepsize LSA for better covariance estimation and employs\nRichardson-Romberg (RR) extrapolation to reduce the bias induced by constant\nstepsize and Markovian data. We develop theoretical results for guiding\nstepsize selection in RR extrapolation, and identify several important settings\nwhere the bias provably vanishes even without extrapolation. We conduct\nextensive numerical experiments and compare against classical inference\napproaches. Our results show that using a constant stepsize enjoys easy\nhyperparameter tuning, fast convergence, and consistently better CI coverage,\nespecially when data is limited."}, "http://arxiv.org/abs/2312.10920": {"title": "Domain adaption and physical constrains transfer learning for shale gas production", "link": "http://arxiv.org/abs/2312.10920", "description": "Effective prediction of shale gas production is crucial for strategic\nreservoir development. However, in new shale gas blocks, two main challenges\nare encountered: (1) the occurrence of negative transfer due to insufficient\ndata, and (2) the limited interpretability of deep learning (DL) models. To\ntackle these problems, we propose a novel transfer learning methodology that\nutilizes domain adaptation and physical constraints. This methodology\neffectively employs historical data from the source domain to reduce negative\ntransfer from the data distribution perspective, while also using physical\nconstraints to build a robust and reliable prediction model that integrates\nvarious types of data. The methodology starts by dividing the production data\nfrom the source domain into multiple subdomains, thereby enhancing data\ndiversity. It then uses Maximum Mean Discrepancy (MMD) and global average\ndistance measures to decide on the feasibility of transfer. Through domain\nadaptation, we integrate all transferable knowledge, resulting in a more\ncomprehensive target model. Lastly, by incorporating drilling, completion, and\ngeological data as physical constraints, we develop a hybrid model. This model,\na combination of a multi-layer perceptron (MLP) and a Transformer\n(Transformer-MLP), is designed to maximize interpretability. Experimental\nvalidation in China's southwestern region confirms the method's effectiveness."}, "http://arxiv.org/abs/2312.10926": {"title": "A Random Effects Model-based Method of Moments Estimation of Causal Effect in Mendelian Randomization Studies", "link": "http://arxiv.org/abs/2312.10926", "description": "Recent advances in genotyping technology have delivered a wealth of genetic\ndata, which is rapidly advancing our understanding of the underlying genetic\narchitecture of complex diseases. Mendelian Randomization (MR) leverages such\ngenetic data to estimate the causal effect of an exposure factor on an outcome\nfrom observational studies. In this paper, we utilize genetic correlations to\nsummarize information on a large set of genetic variants associated with the\nexposure factor. Our proposed approach is a generalization of the MR-inverse\nvariance weighting (IVW) approach where we can accommodate many weak and\npleiotropic effects. Our approach quantifies the variation explained by all\nvalid instrumental variables (IVs) instead of estimating the individual effects\nand thus could accommodate weak IVs. This is particularly useful for performing\nMR estimation in small studies, or minority populations where the selection of\nvalid IVs is unreliable and thus has a large influence on the MR estimation.\nThrough simulation and real data analysis, we demonstrate that our approach\nprovides a robust alternative to the existing MR methods. We illustrate the\nrobustness of our proposed approach under the violation of MR assumptions and\ncompare the performance with several existing approaches."}, "http://arxiv.org/abs/2312.10958": {"title": "Large-sample properties of multiple imputation estimators for parameters of logistic regression with covariates missing at random separately or simultaneously", "link": "http://arxiv.org/abs/2312.10958", "description": "We consider logistic regression including two sets of discrete or categorical\ncovariates that are missing at random (MAR) separately or simultaneously. We\nexamine the asymptotic properties of two multiple imputation (MI) estimators,\ngiven in the study of Lee at al. (2023), for the parameters of the logistic\nregression model with both sets of discrete or categorical covariates that are\nMAR separately or simultaneously. The proposed estimated asymptotic variances\nof the two MI estimators address a limitation observed with Rubin's type\nestimated variances, which lead to underestimate the variances of the two MI\nestimators (Rubin, 1987). Simulation results demonstrate that our two proposed\nMI methods outperform the complete-case, semiparametric inverse probability\nweighting, random forest MI using chained equations, and stochastic\napproximation of expectation-maximization methods. To illustrate the\nmethodology's practical application, we provide a real data example from a\nsurvey conducted in the Feng Chia night market in Taichung City, Taiwan."}, "http://arxiv.org/abs/2312.11001": {"title": "A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables", "link": "http://arxiv.org/abs/2312.11001", "description": "Most existing causal discovery methods rely on the assumption of no latent\nconfounders, limiting their applicability in solving real-life problems. In\nthis paper, we introduce a novel, versatile framework for causal discovery that\naccommodates the presence of causally-related hidden variables almost\neverywhere in the causal network (for instance, they can be effects of observed\nvariables), based on rank information of covariance matrix over observed\nvariables. We start by investigating the efficacy of rank in comparison to\nconditional independence and, theoretically, establish necessary and sufficient\nconditions for the identifiability of certain latent structural patterns.\nFurthermore, we develop a Rank-based Latent Causal Discovery algorithm, RLCD,\nthat can efficiently locate hidden variables, determine their cardinalities,\nand discover the entire causal structure over both measured and hidden ones. We\nalso show that, under certain graphical conditions, RLCD correctly identifies\nthe Markov Equivalence Class of the whole latent causal graph asymptotically.\nExperimental results on both synthetic and real-world personality data sets\ndemonstrate the efficacy of the proposed approach in finite-sample cases."}, "http://arxiv.org/abs/2312.11054": {"title": "Detection of Model-based Planted Pseudo-cliques in Random Dot Product Graphs by the Adjacency Spectral Embedding and the Graph Encoder Embedding", "link": "http://arxiv.org/abs/2312.11054", "description": "In this paper, we explore the capability of both the Adjacency Spectral\nEmbedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded\npseudo-clique structure in the random dot product graph setting. In both theory\nand experiments, we demonstrate that this pairing of model and methods can\nyield worse results than the best existing spectral clique detection methods,\ndemonstrating at once the methods' potential inability to capture even modestly\nsized pseudo-cliques and the methods' robustness to the model contamination\ngiving rise to the pseudo-clique structure. To further enrich our analysis, we\nalso consider the Variational Graph Auto-Encoder (VGAE) model in our simulation\nand real data experiments."}, "http://arxiv.org/abs/2312.11108": {"title": "Multiple change point detection in functional data with applications to biomechanical fatigue data", "link": "http://arxiv.org/abs/2312.11108", "description": "Injuries to the lower extremity joints are often debilitating, particularly\nfor professional athletes. Understanding the onset of stressful conditions on\nthese joints is therefore important in order to ensure prevention of injuries\nas well as individualised training for enhanced athletic performance. We study\nthe biomechanical joint angles from the hip, knee and ankle for runners who are\nexperiencing fatigue. The data is cyclic in nature and densely collected by\nbody worn sensors, which makes it ideal to work with in the functional data\nanalysis (FDA) framework.\n\nWe develop a new method for multiple change point detection for functional\ndata, which improves the state of the art with respect to at least two novel\naspects. First, the curves are compared with respect to their maximum absolute\ndeviation, which leads to a better interpretation of local changes in the\nfunctional data compared to classical $L^2$-approaches. Secondly, as slight\naberrations are to be often expected in a human movement data, our method will\nnot detect arbitrarily small changes but hunts for relevant changes, where\nmaximum absolute deviation between the curves exceeds a specified threshold,\nsay $\\Delta >0$. We recover multiple changes in a long functional time series\nof biomechanical knee angle data, which are larger than the desired threshold\n$\\Delta$, allowing us to identify changes purely due to fatigue. In this work,\nwe analyse data from both controlled indoor as well as from an uncontrolled\noutdoor (marathon) setting."}, "http://arxiv.org/abs/2312.11136": {"title": "Identification of complier and noncomplier average causal effects in the presence of latent missing-at-random (LMAR) outcomes: a unifying view and choices of assumptions", "link": "http://arxiv.org/abs/2312.11136", "description": "The study of treatment effects is often complicated by noncompliance and\nmissing data. In the one-sided noncompliance setting where of interest are the\ncomplier and noncomplier average causal effects (CACE and NACE), we address\noutcome missingness of the \\textit{latent missing at random} type (LMAR, also\nknown as \\textit{latent ignorability}). That is, conditional on covariates and\ntreatment assigned, the missingness may depend on compliance type. Within the\ninstrumental variable (IV) approach to noncompliance, methods have been\nproposed for handling LMAR outcome that additionally invoke an exclusion\nrestriction type assumption on missingness, but no solution has been proposed\nfor when a non-IV approach is used. This paper focuses on effect identification\nin the presence of LMAR outcome, with a view to flexibly accommodate different\nprincipal identification approaches. We show that under treatment assignment\nignorability and LMAR only, effect nonidentifiability boils down to a set of\ntwo connected mixture equations involving unidentified stratum-specific\nresponse probabilities and outcome means. This clarifies that (except for a\nspecial case) effect identification generally requires two additional\nassumptions: a \\textit{specific missingness mechanism} assumption and a\n\\textit{principal identification} assumption. This provides a template for\nidentifying effects based on separate choices of these assumptions. We consider\na range of specific missingness assumptions, including those that have appeared\nin the literature and some new ones. Incidentally, we find an issue in the\nexisting assumptions, and propose a modification of the assumptions to avoid\nthe issue. Results under different assumptions are illustrated using data from\nthe Baltimore Experience Corps Trial."}, "http://arxiv.org/abs/2312.11137": {"title": "Random multiplication versus random sum: auto-regressive-like models with integer-valued random inputs", "link": "http://arxiv.org/abs/2312.11137", "description": "A common approach to analyze count time series is to fit models based on\nrandom sum operators. As an alternative, this paper introduces time series\nmodels based on a random multiplication operator, which is simply the\nmultiplication of a variable operand by an integer-valued random coefficient,\nwhose mean is the constant operand. Such operation is endowed into\nauto-regressive-like models with integer-valued random inputs, addressed as\nRMINAR. Two special variants are studied, namely the N0-valued random\ncoefficient auto-regressive model and the N0-valued random coefficient\nmultiplicative error model. Furthermore, Z-valued extensions are considered.\nThe dynamic structure of the proposed models is studied in detail. In\nparticular, their corresponding solutions are everywhere strictly stationary\nand ergodic, a fact that is not common neither in the literature on\ninteger-valued time series models nor real-valued random coefficient\nauto-regressive models. Therefore, the parameters of the RMINAR model are\nestimated using a four-stage weighted least squares estimator, with consistency\nand asymptotic normality established everywhere in the parameter space.\nFinally, the new RMINAR models are illustrated with some simulated and\nempirical examples."}, "http://arxiv.org/abs/2312.11178": {"title": "Deinterleaving RADAR emitters with optimal transport distances", "link": "http://arxiv.org/abs/2312.11178", "description": "Detection and identification of emitters provide vital information for\ndefensive strategies in electronic intelligence. Based on a received signal\ncontaining pulses from an unknown number of emitters, this paper introduces an\nunsupervised methodology for deinterleaving RADAR signals based on a\ncombination of clustering algorithms and optimal transport distances. The first\nstep involves separating the pulses with a clustering algorithm under the\nconstraint that the pulses of two different emitters cannot belong to the same\ncluster. Then, as the emitters exhibit complex behavior and can be represented\nby several clusters, we propose a hierarchical clustering algorithm based on an\noptimal transport distance to merge these clusters. A variant is also\ndeveloped, capable of handling more complex signals. Finally, the proposed\nmethodology is evaluated on simulated data provided through a realistic\nsimulator. Results show that the proposed methods are capable of deinterleaving\ncomplex RADAR signals."}, "http://arxiv.org/abs/2312.11319": {"title": "Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation", "link": "http://arxiv.org/abs/2312.11319", "description": "Accurately detecting multiple change-points is critical for various\napplications, but determining the optimal number of change-points remains a\nchallenge. Existing approaches based on information criteria attempt to balance\ngoodness-of-fit and model complexity, but their performance varies depending on\nthe model. Recently, data-driven selection criteria based on cross-validation\nhas been proposed, but these methods can be prone to slight overfitting in\nfinite samples. In this paper, we introduce a method that controls the\nprobability of overestimation and provides uncertainty quantification for\nlearning multiple change-points via cross-validation. We frame this problem as\na sequence of model comparison problems and leverage high-dimensional\ninferential procedures. We demonstrate the effectiveness of our approach\nthrough experiments on finite-sample data, showing superior uncertainty\nquantification for overestimation compared to existing methods. Our approach\nhas broad applicability and can be used in diverse change-point models."}, "http://arxiv.org/abs/2312.11323": {"title": "UniForCE: The Unimodality Forest Method for Clustering and Estimation of the Number of Clusters", "link": "http://arxiv.org/abs/2312.11323", "description": "Estimating the number of clusters k while clustering the data is a\nchallenging task. An incorrect cluster assumption indicates that the number of\nclusters k gets wrongly estimated. Consequently, the model fitting becomes less\nimportant. In this work, we focus on the concept of unimodality and propose a\nflexible cluster definition called locally unimodal cluster. A locally unimodal\ncluster extends for as long as unimodality is locally preserved across pairs of\nsubclusters of the data. Then, we propose the UniForCE method for locally\nunimodal clustering. The method starts with an initial overclustering of the\ndata and relies on the unimodality graph that connects subclusters forming\nunimodal pairs. Such pairs are identified using an appropriate statistical\ntest. UniForCE identifies maximal locally unimodal clusters by computing a\nspanning forest in the unimodality graph. Experimental results on both real and\nsynthetic datasets illustrate that the proposed methodology is particularly\nflexible and robust in discovering regular and highly complex cluster shapes.\nMost importantly, it automatically provides an adequate estimation of the\nnumber of clusters."}, "http://arxiv.org/abs/2312.11393": {"title": "Assessing Estimation Uncertainty under Model Misspecification", "link": "http://arxiv.org/abs/2312.11393", "description": "Model misspecification is ubiquitous in data analysis because the\ndata-generating process is often complex and mathematically intractable.\nTherefore, assessing estimation uncertainty and conducting statistical\ninference under a possibly misspecified working model is unavoidable. In such a\ncase, classical methods such as bootstrap and asymptotic theory-based inference\nfrequently fail since they rely heavily on the model assumptions. In this\narticle, we provide a new bootstrap procedure, termed local residual bootstrap,\nto assess estimation uncertainty under model misspecification for generalized\nlinear models. By resampling the residuals from the neighboring observations,\nwe can approximate the sampling distribution of the statistic of interest\naccurately. Instead of relying on the score equations, the proposed method\ndirectly recreates the response variables so that we can easily conduct\nstandard error estimation, confidence interval construction, hypothesis\ntesting, and model evaluation and selection. It performs similarly to classical\nbootstrap when the model is correctly specified and provides a more accurate\nassessment of uncertainty under model misspecification, offering data analysts\nan easy way to guard against the impact of misspecified models. We establish\ndesirable theoretical properties, such as the bootstrap validity, for the\nproposed method using the surrogate residuals. Numerical results and real data\nanalysis further demonstrate the superiority of the proposed method."}, "http://arxiv.org/abs/2312.11437": {"title": "Clustering Consistency of General Nonparametric Classification Methods in Cognitive Diagnosis", "link": "http://arxiv.org/abs/2312.11437", "description": "Cognitive diagnosis models have been popularly used in fields such as\neducation, psychology, and social sciences. While parametric likelihood\nestimation is a prevailing method for fitting cognitive diagnosis models,\nnonparametric methodologies are attracting increasing attention due to their\nease of implementation and robustness, particularly when sample sizes are\nrelatively small. However, existing clustering consistency results of the\nnonparametric estimation methods often rely on certain restrictive conditions,\nwhich may not be easily satisfied in practice. In this article, the clustering\nconsistency of the general nonparametric classification method is reestablished\nunder weaker and more practical conditions."}, "http://arxiv.org/abs/2301.03747": {"title": "Semiparametric Regression for Spatial Data via Deep Learning", "link": "http://arxiv.org/abs/2301.03747", "description": "In this work, we propose a deep learning-based method to perform\nsemiparametric regression analysis for spatially dependent data. To be\nspecific, we use a sparsely connected deep neural network with rectified linear\nunit (ReLU) activation function to estimate the unknown regression function\nthat describes the relationship between response and covariates in the presence\nof spatial dependence. Under some mild conditions, the estimator is proven to\nbe consistent, and the rate of convergence is determined by three factors: (1)\nthe architecture of neural network class, (2) the smoothness and (intrinsic)\ndimension of true mean function, and (3) the magnitude of spatial dependence.\nOur method can handle well large data set owing to the stochastic gradient\ndescent optimization algorithm. Simulation studies on synthetic data are\nconducted to assess the finite sample performance, the results of which\nindicate that the proposed method is capable of picking up the intricate\nrelationship between response and covariates. Finally, a real data analysis is\nprovided to demonstrate the validity and effectiveness of the proposed method."}, "http://arxiv.org/abs/2301.10059": {"title": "Oncology clinical trial design planning based on a multistate model that jointly models progression-free and overall survival endpoints", "link": "http://arxiv.org/abs/2301.10059", "description": "When planning an oncology clinical trial, the usual approach is to assume\nproportional hazards and even an exponential distribution for time-to-event\nendpoints. Often, besides the gold-standard endpoint overall survival (OS),\nprogression-free survival (PFS) is considered as a second confirmatory\nendpoint. We use a survival multistate model to jointly model these two\nendpoints and find that neither exponential distribution nor proportional\nhazards will typically hold for both endpoints simultaneously. The multistate\nmodel provides a stochastic process approach to model the dependency of such\nendpoints neither requiring latent failure times nor explicit dependency\nmodelling such as copulae. We use the multistate model framework to simulate\nclinical trials with endpoints OS and PFS and show how design planning\nquestions can be answered using this approach. In particular, non-proportional\nhazards for at least one of the endpoints are naturally modelled as well as\ntheir dependency to improve planning. We consider an oncology trial on\nnon-small-cell lung cancer as a motivating example from which we derive\nrelevant trial design questions. We then illustrate how clinical trial design\ncan be based on simulations from a multistate model. Key applications are\nco-primary endpoints and group-sequential designs. Simulations for these\napplications show that the standard simplifying approach may very well lead to\nunderpowered or overpowered clinical trials. Our approach is quite general and\ncan be extended to more complex trial designs, further endpoints, and other\ntherapeutic areas. An R package is available on CRAN."}, "http://arxiv.org/abs/2302.09694": {"title": "Disentangled Representation for Causal Mediation Analysis", "link": "http://arxiv.org/abs/2302.09694", "description": "Estimating direct and indirect causal effects from observational data is\ncrucial to understanding the causal mechanisms and predicting the behaviour\nunder different interventions. Causal mediation analysis is a method that is\noften used to reveal direct and indirect effects. Deep learning shows promise\nin mediation analysis, but the current methods only assume latent confounders\nthat affect treatment, mediator and outcome simultaneously, and fail to\nidentify different types of latent confounders (e.g., confounders that only\naffect the mediator or outcome). Furthermore, current methods are based on the\nsequential ignorability assumption, which is not feasible for dealing with\nmultiple types of latent confounders. This work aims to circumvent the\nsequential ignorability assumption and applies the piecemeal deconfounding\nassumption as an alternative. We propose the Disentangled Mediation Analysis\nVariational AutoEncoder (DMAVAE), which disentangles the representations of\nlatent confounders into three types to accurately estimate the natural direct\neffect, natural indirect effect and total effect. Experimental results show\nthat the proposed method outperforms existing methods and has strong\ngeneralisation ability. We further apply the method to a real-world dataset to\nshow its potential application."}, "http://arxiv.org/abs/2302.13511": {"title": "Extrapolated cross-validation for randomized ensembles", "link": "http://arxiv.org/abs/2302.13511", "description": "Ensemble methods such as bagging and random forests are ubiquitous in various\nfields, from finance to genomics. Despite their prevalence, the question of the\nefficient tuning of ensemble parameters has received relatively little\nattention. This paper introduces a cross-validation method, ECV (Extrapolated\nCross-Validation), for tuning the ensemble and subsample sizes in randomized\nensembles. Our method builds on two primary ingredients: initial estimators for\nsmall ensemble sizes using out-of-bag errors and a novel risk extrapolation\ntechnique that leverages the structure of prediction risk decomposition. By\nestablishing uniform consistency of our risk extrapolation technique over\nensemble and subsample sizes, we show that ECV yields $\\delta$-optimal (with\nrespect to the oracle-tuned risk) ensembles for squared prediction risk. Our\ntheory accommodates general ensemble predictors, only requires mild moment\nassumptions, and allows for high-dimensional regimes where the feature\ndimension grows with the sample size. As a practical case study, we employ ECV\nto predict surface protein abundances from gene expressions in single-cell\nmultiomics using random forests. In comparison to sample-split cross-validation\nand $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample\nsplitting. At the same time, its computational cost is considerably lower owing\nto the use of the risk extrapolation technique. Additional numerical results\nvalidate the finite-sample accuracy of ECV for several common ensemble\npredictors under a computational constraint on the maximum ensemble size."}, "http://arxiv.org/abs/2304.01944": {"title": "A Statistical Approach to Ecological Modeling by a New Similarity Index", "link": "http://arxiv.org/abs/2304.01944", "description": "Similarity index is an important scientific tool frequently used to determine\nwhether different pairs of entities are similar with respect to some prefixed\ncharacteristics. Some standard measures of similarity index include Jaccard\nindex, S{\\o}rensen-Dice index, and Simpson's index. Recently, a better index\n($\\hat{\\alpha}$) for the co-occurrence and/or similarity has been developed,\nand this measure really outperforms and gives theoretically supported\nreasonable predictions. However, the measure $\\hat{\\alpha}$ is not data\ndependent. In this article we propose a new measure of similarity which depends\nstrongly on the data before introducing randomness in prevalence. Then, we\npropose a new method of randomization which changes the whole pattern of\nresults. Before randomization our measure is similar to the Jaccard index,\nwhile after randomization it is close to $\\hat{\\alpha}$. We consider the\npopular ecological dataset from the Tuscan Archipelago, Italy; and compare the\nperformance of the proposed index to other measures. Since our proposed index\nis data dependent, it has some interesting properties which we illustrate in\nthis article through numerical studies."}, "http://arxiv.org/abs/2305.04587": {"title": "Replication of \"null results\" -- Absence of evidence or evidence of absence?", "link": "http://arxiv.org/abs/2305.04587", "description": "In several large-scale replication projects, statistically non-significant\nresults in both the original and the replication study have been interpreted as\na \"replication success\". Here we discuss the logical problems with this\napproach: Non-significance in both studies does not ensure that the studies\nprovide evidence for the absence of an effect and \"replication success\" can\nvirtually always be achieved if the sample sizes are small enough. In addition,\nthe relevant error rates are not controlled. We show how methods, such as\nequivalence testing and Bayes factors, can be used to adequately quantify the\nevidence for the absence of an effect and how they can be applied in the\nreplication setting. Using data from the Reproducibility Project: Cancer\nBiology, the Experimental Philosophy Replicability Project, and the\nReproducibility Project: Psychology we illustrate that many original and\nreplication studies with \"null results\" are in fact inconclusive. We conclude\nthat it is important to also replicate studies with statistically\nnon-significant results, but that they should be designed, analyzed, and\ninterpreted appropriately."}, "http://arxiv.org/abs/2306.02235": {"title": "Learning Linear Causal Representations from Interventions under General Nonlinear Mixing", "link": "http://arxiv.org/abs/2306.02235", "description": "We study the problem of learning causal representations from unknown, latent\ninterventions in a general setting, where the latent distribution is Gaussian\nbut the mixing function is completely general. We prove strong identifiability\nresults given unknown single-node interventions, i.e., without having access to\nthe intervention targets. This generalizes prior works which have focused on\nweaker classes, such as linear maps or paired counterfactual data. This is also\nthe first instance of causal identifiability from non-paired interventions for\ndeep neural network embeddings. Our proof relies on carefully uncovering the\nhigh-dimensional geometric structure present in the data distribution after a\nnon-linear density transformation, which we capture by analyzing quadratic\nforms of precision matrices of the latent distributions. Finally, we propose a\ncontrastive algorithm to identify the latent variables in practice and evaluate\nits performance on various tasks."}, "http://arxiv.org/abs/2309.09367": {"title": "ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed Factors", "link": "http://arxiv.org/abs/2309.09367", "description": "In this paper, we address the problem of designing an experiment with both\ndiscrete and continuous factors under fairly general parametric statistical\nmodels. We propose a new algorithm, named ForLion, to search for optimal\ndesigns under the D-criterion. The algorithm performs an exhaustive search in a\ndesign space with mixed factors while keeping high efficiency and reducing the\nnumber of distinct experimental settings. Its optimality is guaranteed by the\ngeneral equivalence theorem. We demonstrate its superiority over\nstate-of-the-art design algorithms using real-life experiments under\nmultinomial logistic models (MLM) and generalized linear models (GLM). Our\nsimulation studies show that the ForLion algorithm could reduce the number of\nexperimental settings by 25% or improve the relative efficiency of the designs\nby 17.5% on average. Our algorithm can help the experimenters reduce the time\ncost, the usage of experimental devices, and thus the total cost of their\nexperiments while preserving high efficiencies of the designs."}, "http://arxiv.org/abs/2312.11573": {"title": "Estimation of individual causal effects in network setup for multiple treatments", "link": "http://arxiv.org/abs/2312.11573", "description": "We study the problem of estimation of Individual Treatment Effects (ITE) in\nthe context of multiple treatments and networked observational data. Leveraging\nthe network information, we aim to utilize hidden confounders that may not be\ndirectly accessible in the observed data, thereby enhancing the practical\napplicability of the strong ignorability assumption. To achieve this, we first\nemploy Graph Convolutional Networks (GCN) to learn a shared representation of\nthe confounders. Then, our approach utilizes separate neural networks to infer\npotential outcomes for each treatment. We design a loss function as a weighted\ncombination of two components: representation loss and Mean Squared Error (MSE)\nloss on the factual outcomes. To measure the representation loss, we extend\nexisting metrics such as Wasserstein and Maximum Mean Discrepancy (MMD) from\nthe binary treatment setting to the multiple treatments scenario. To validate\nthe effectiveness of our proposed methodology, we conduct a series of\nexperiments on the benchmark datasets such as BlogCatalog and Flickr. The\nexperimental results consistently demonstrate the superior performance of our\nmodels when compared to baseline methods."}, "http://arxiv.org/abs/2312.11582": {"title": "Shapley-PC: Constraint-based Causal Structure Learning with Shapley Values", "link": "http://arxiv.org/abs/2312.11582", "description": "Causal Structure Learning (CSL), amounting to extracting causal relations\namong the variables in a dataset, is widely perceived as an important step\ntowards robust and transparent models. Constraint-based CSL leverages\nconditional independence tests to perform causal discovery. We propose\nShapley-PC, a novel method to improve constraint-based CSL algorithms by using\nShapley values over the possible conditioning sets to decide which variables\nare responsible for the observed conditional (in)dependences. We prove\nsoundness and asymptotic consistency and demonstrate that it can outperform\nstate-of-the-art constraint-based, search-based and functional causal\nmodel-based methods, according to standard metrics in CSL."}, "http://arxiv.org/abs/2312.11926": {"title": "Big Learning Expectation Maximization", "link": "http://arxiv.org/abs/2312.11926", "description": "Mixture models serve as one fundamental tool with versatile applications.\nHowever, their training techniques, like the popular Expectation Maximization\n(EM) algorithm, are notoriously sensitive to parameter initialization and often\nsuffer from bad local optima that could be arbitrarily worse than the optimal.\nTo address the long-lasting bad-local-optima challenge, we draw inspiration\nfrom the recent ground-breaking foundation models and propose to leverage their\nunderlying big learning principle to upgrade the EM. Specifically, we present\nthe Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs\njoint, marginal, and orthogonally transformed marginal matchings between data\nand model distributions. Through simulated experiments, we empirically show\nthat the BigLearn-EM is capable of delivering the optimal with high\nprobability; comparisons on benchmark clustering datasets further demonstrate\nits effectiveness and advantages over existing techniques. The code is\navailable at\nhttps://github.com/YulaiCong/Big-Learning-Expectation-Maximization."}, "http://arxiv.org/abs/2312.11927": {"title": "Empowering Dual-Level Graph Self-Supervised Pretraining with Motif Discovery", "link": "http://arxiv.org/abs/2312.11927", "description": "While self-supervised graph pretraining techniques have shown promising\nresults in various domains, their application still experiences challenges of\nlimited topology learning, human knowledge dependency, and incompetent\nmulti-level interactions. To address these issues, we propose a novel solution,\nDual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which\nintroduces a unique dual-level pretraining structure that orchestrates\nnode-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM\nautonomously uncovers significant graph motifs through an edge pooling module,\naligning learned motif similarities with graph kernel-based similarities. A\ncross-matching task enables sophisticated node-motif interactions and novel\nrepresentation learning. Extensive experiments on 15 datasets validate DGPM's\neffectiveness and generalizability, outperforming state-of-the-art methods in\nunsupervised representation learning and transfer learning settings. The\nautonomously discovered motifs demonstrate the potential of DGPM to enhance\nrobustness and interpretability."}, "http://arxiv.org/abs/2312.11934": {"title": "Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants", "link": "http://arxiv.org/abs/2312.11934", "description": "Causal discovery with latent variables is a crucial but challenging task.\nDespite the emergence of numerous methods aimed at addressing this challenge,\nthey are not fully identified to the structure that two observed variables are\ninfluenced by one latent variable and there might be a directed edge in\nbetween. Interestingly, we notice that this structure can be identified through\nthe utilization of higher-order cumulants. By leveraging the higher-order\ncumulants of non-Gaussian data, we provide an analytical solution for\nestimating the causal coefficients or their ratios. With the estimated (ratios\nof) causal coefficients, we propose a novel approach to identify the existence\nof a causal edge between two observed variables subject to latent variable\ninfluence. In case when such a causal edge exits, we introduce an asymmetry\ncriterion to determine the causal direction. The experimental results\ndemonstrate the effectiveness of our proposed method."}, "http://arxiv.org/abs/2312.11991": {"title": "Outcomes truncated by death in RCTs: a simulation study on the survivor average causal effect", "link": "http://arxiv.org/abs/2312.11991", "description": "Continuous outcome measurements truncated by death present a challenge for\nthe estimation of unbiased treatment effects in randomized controlled trials\n(RCTs). One way to deal with such situations is to estimate the survivor\naverage causal effect (SACE), but this requires making non-testable\nassumptions. Motivated by an ongoing RCT in very preterm infants with\nintraventricular hemorrhage, we performed a simulation study to compare a SACE\nestimator with complete case analysis (CCA, benchmark for a biased analysis)\nand an analysis after multiple imputation of missing outcomes. We set up 9\nscenarios combining positive, negative and no treatment effect on the outcome\n(cognitive development) and on survival at 2 years of age. Treatment effect\nestimates from all methods were compared in terms of bias, mean squared error\nand coverage with regard to two estimands: the treatment effect on the outcome\nused in the simulation and the SACE, which was derived by simulation of both\npotential outcomes per patient. Despite targeting different estimands\n(principal stratum estimand, hypothetical estimand), the SACE-estimator and\nmultiple imputation gave similar estimates of the treatment effect and\nefficiently reduced the bias compared to CCA. Also, both methods were\nrelatively robust to omission of one covariate in the analysis, and thus\nviolation of relevant assumptions. Although the SACE is not without\ncontroversy, we find it useful if mortality is inherent to the study\npopulation. Some degree of violation of the required assumptions is almost\ncertain, but may be acceptable in practice."}, "http://arxiv.org/abs/2312.12008": {"title": "How to develop, externally validate, and update multinomial prediction models", "link": "http://arxiv.org/abs/2312.12008", "description": "Multinomial prediction models (MPMs) have a range of potential applications\nacross healthcare where the primary outcome of interest has multiple nominal or\nordinal categories. However, the application of MPMs is scarce, which may be\ndue to the added methodological complexities that they bring. This article\nprovides a guide of how to develop, externally validate, and update MPMs. Using\na previously developed and validated MPM for treatment outcomes in rheumatoid\narthritis as an example, we outline guidance and recommendations for producing\na clinical prediction model, using multinomial logistic regression. This\narticle is intended to supplement existing general guidance on prediction model\nresearch. This guide is split into three parts: 1) Outcome definition and\nvariable selection, 2) Model development, and 3) Model evaluation (including\nperformance assessment, internal and external validation, and model\nrecalibration). We outline how to evaluate and interpret the predictive\nperformance of MPMs. R code is provided. We recommend the application of MPMs\nin clinical settings where the prediction of a nominal polytomous outcome is of\ninterest. Future methodological research could focus on MPM-specific\nconsiderations for variable selection and sample size criteria for external\nvalidation."}, "http://arxiv.org/abs/2312.12106": {"title": "Conditional autoregressive models fused with random forests to improve small-area spatial prediction", "link": "http://arxiv.org/abs/2312.12106", "description": "In areal unit data with missing or suppressed data, it desirable to create\nmodels that are able to predict observations that are not available.\nTraditional statistical methods achieve this through Bayesian hierarchical\nmodels that can capture the unexplained residual spatial autocorrelation\nthrough conditional autoregressive (CAR) priors, such that they can make\npredictions at geographically related spatial locations. In contrast, typical\nmachine learning approaches such as random forests ignore this residual\nautocorrelation, and instead base predictions on complex non-linear\nfeature-target relationships. In this paper, we propose CAR-Forest, a novel\nspatial prediction algorithm that combines the best features of both approaches\nby fusing them together. By iteratively refitting a random forest combined with\na Bayesian CAR model in one algorithm, CAR-Forest can incorporate flexible\nfeature-target relationships while still accounting for the residual spatial\nautocorrelation. Our results, based on a Scottish housing price data set, show\nthat CAR-Forest outperforms Bayesian CAR models, random forests, and the\nstate-of-the-art hybrid approach, geographically weighted random forest,\nproviding a state-of-the-art framework for small-area spatial prediction."}, "http://arxiv.org/abs/2312.12149": {"title": "Bayesian and minimax estimators of loss", "link": "http://arxiv.org/abs/2312.12149", "description": "We study the problem of loss estimation that involves for an observable $X\n\\sim f_{\\theta}$ the choice of a first-stage estimator $\\hat{\\gamma}$ of\n$\\gamma(\\theta)$, incurred loss $L=L(\\theta, \\hat{\\gamma})$, and the choice of\na second-stage estimator $\\hat{L}$ of $L$. We consider both: (i) a sequential\nversion where the first-stage estimate and loss are fixed and optimization is\nperformed at the second-stage level, and (ii) a simultaneous version with a\nRukhin-type loss function designed for the evaluation of $(\\hat{\\gamma},\n\\hat{L})$ as an estimator of $(\\gamma, L)$.\n\nWe explore various Bayesian solutions and provide minimax estimators for both\nsituations (i) and (ii). The analysis is carried out for several probability\nmodels, including multivariate normal models $N_d(\\theta, \\sigma^2 I_d)$ with\nboth known and unknown $\\sigma^2$, Gamma, univariate and multivariate Poisson,\nand negative binomial models, and relates to different choices of the\nfirst-stage and second-stage losses. The minimax findings are achieved by\nidentifying least favourable of sequence of priors and depend critically on\nparticular Bayesian solution properties, namely situations where the\nsecond-stage estimator $\\hat{L}(x)$ is constant as a function of $x$."}, "http://arxiv.org/abs/2312.12206": {"title": "Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model", "link": "http://arxiv.org/abs/2312.12206", "description": "Missing data are an unavoidable complication frequently encountered in many\ncausal discovery tasks. When a missing process depends on the missing values\nthemselves (known as self-masking missingness), the recovery of the joint\ndistribution becomes unattainable, and detecting the presence of such\nself-masking missingness remains a perplexing challenge. Consequently, due to\nthe inability to reconstruct the original distribution and to discern the\nunderlying missingness mechanism, simply applying existing causal discovery\nmethods would lead to wrong conclusions. In this work, we found that the recent\nadvances additive noise model has the potential for learning causal structure\nunder the existence of the self-masking missingness. With this observation, we\naim to investigate the identification problem of learning causal structure from\nmissing data under an additive noise model with different missingness\nmechanisms, where the `no self-masking missingness' assumption can be\neliminated appropriately. Specifically, we first elegantly extend the scope of\nidentifiability of causal skeleton to the case with weak self-masking\nmissingness (i.e., no other variable could be the cause of self-masking\nindicators except itself). We further provide the sufficient and necessary\nidentification conditions of the causal direction under additive noise model\nand show that the causal structure can be identified up to an IN-equivalent\npattern. We finally propose a practical algorithm based on the above\ntheoretical results on learning the causal skeleton and causal direction.\nExtensive experiments on synthetic and real data demonstrate the efficiency and\neffectiveness of the proposed algorithms."}, "http://arxiv.org/abs/2312.12287": {"title": "A Criterion for Multivariate Regionalization of Spatial Data", "link": "http://arxiv.org/abs/2312.12287", "description": "The modifiable areal unit problem in geography or the change-of-support (COS)\nproblem in statistics demonstrates that the interpretation of spatial (or\nspatio-temporal) data analysis is affected by the choice of resolutions or\ngeographical units used in the study. The ecological fallacy is one famous\nexample of this phenomenon. Here we investigate the ecological fallacy\nassociated with the COS problem for multivariate spatial data with the goal of\nproviding a data-driven discretization criterion for the domain of interest\nthat minimizes aggregation errors. The discretization is based on a novel\nmultiscale metric, called the Multivariate Criterion for Aggregation Error\n(MVCAGE). Such multi-scale representations of an underlying multivariate\nprocess are often formulated in terms of basis expansions. We show that a\nparticularly useful basis expansion in this context is the multivariate\nKarhunen-Lo`eve expansion (MKLE). We use the MKLE to build the MVCAGE loss\nfunction and use it within the framework of spatial clustering algorithms to\nperform optimal spatial aggregation. We demonstrate the effectiveness of our\napproach through simulation and through regionalization of county-level income\nand hospital quality data over the United States and prediction of ocean color\nin the coastal Gulf of Alaska."}, "http://arxiv.org/abs/2312.12357": {"title": "Modeling non-linear Effects with Neural Networks in Relational Event Models", "link": "http://arxiv.org/abs/2312.12357", "description": "Dynamic networks offer an insight of how relational systems evolve. However,\nmodeling these networks efficiently remains a challenge, primarily due to\ncomputational constraints, especially as the number of observed events grows.\nThis paper addresses this issue by introducing the Deep Relational Event\nAdditive Model (DREAM) as a solution to the computational challenges presented\nby modeling non-linear effects in Relational Event Models (REMs). DREAM relies\non Neural Additive Models to model non-linear effects, allowing each effect to\nbe captured by an independent neural network. By strategically trading\ncomputational complexity for improved memory management and leveraging the\ncomputational capabilities of Graphic Processor Units (GPUs), DREAM efficiently\ncaptures complex non-linear relationships within data. This approach\ndemonstrates the capability of DREAM in modeling dynamic networks and scaling\nto larger networks. Comparisons with traditional REM approaches showcase DREAM\nsuperior computational efficiency. The model potential is further demonstrated\nby an examination of the patent citation network, which contains nearly 8\nmillion nodes and 100 million events."}, "http://arxiv.org/abs/2312.12361": {"title": "Improved multifidelity Monte Carlo estimators based on normalizing flows and dimensionality reduction techniques", "link": "http://arxiv.org/abs/2312.12361", "description": "We study the problem of multifidelity uncertainty propagation for\ncomputationally expensive models. In particular, we consider the general\nsetting where the high-fidelity and low-fidelity models have a dissimilar\nparameterization both in terms of number of random inputs and their probability\ndistributions, which can be either known in closed form or provided through\nsamples. We derive novel multifidelity Monte Carlo estimators which rely on a\nshared subspace between the high-fidelity and low-fidelity models where the\nparameters follow the same probability distribution, i.e., a standard Gaussian.\nWe build the shared space employing normalizing flows to map different\nprobability distributions into a common one, together with linear and nonlinear\ndimensionality reduction techniques, active subspaces and autoencoders,\nrespectively, which capture the subspaces where the models vary the most. We\nthen compose the existing low-fidelity model with these transformations and\nconstruct modified models with an increased correlation with the high-fidelity,\nwhich therefore yield multifidelity Monte Carlo estimators with reduced\nvariance. A series of numerical experiments illustrate the properties and\nadvantages of our approaches."}, "http://arxiv.org/abs/2312.12396": {"title": "A change-point random partition model for large spatio-temporal datasets", "link": "http://arxiv.org/abs/2312.12396", "description": "Spatio-temporal areal data can be seen as a collection of time series which\nare spatially correlated, according to a specific neighboring structure.\nMotivated by a dataset on mobile phone usage in the Metropolitan area of Milan,\nItaly, we propose a semi-parametric hierarchical Bayesian model allowing for\ntime-varying as well as spatial model-based clustering. To accommodate for\nchanging patterns over work hours and weekdays/weekends, we incorporate a\ntemporal change-point component that allows the specification of different\nhierarchical structures across time points. The model features a random\npartition prior that incorporates the desired spatial features and encourages\nco-clustering based on areal proximity. We explore properties of the model by\nway of extensive simulation studies from which we collect valuable information.\nFinally, we discuss the application to the motivating data, where the main goal\nis to spatially cluster population patterns of mobile phone usage."}, "http://arxiv.org/abs/2009.04710": {"title": "Robust Clustering with Normal Mixture Models: A Pseudo $\\beta$-Likelihood Approach", "link": "http://arxiv.org/abs/2009.04710", "description": "As in other estimation scenarios, likelihood based estimation in the normal\nmixture set-up is highly non-robust against model misspecification and presence\nof outliers (apart from being an ill-posed optimization problem). A robust\nalternative to the ordinary likelihood approach for this estimation problem is\nproposed which performs simultaneous estimation and data clustering and leads\nto subsequent anomaly detection. To invoke robustness, the methodology based on\nthe minimization of the density power divergence (or alternatively, the\nmaximization of the $\\beta$-likelihood) is utilized under suitable constraints.\nAn iteratively reweighted least squares approach has been followed in order to\ncompute the proposed estimators for the component means (or equivalently\ncluster centers) and component dispersion matrices which leads to simultaneous\ndata clustering. Some exploratory techniques are also suggested for anomaly\ndetection, a problem of great importance in the domain of statistics and\nmachine learning. The proposed method is validated with simulation studies\nunder different set-ups; it performs competitively or better compared to the\npopular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST,\nespecially when the mixture components (i.e., the clusters) share regions with\nsignificant overlap or outlying clusters exist with small but non-negligible\nweights (particularly in higher dimensions). Two real datasets are also used to\nillustrate the performance of the newly proposed method in comparison with\nothers along with an application in image processing. The proposed method\ndetects the clusters with lower misclassification rates and successfully points\nout the outlying (anomalous) observations from these datasets."}, "http://arxiv.org/abs/2202.02416": {"title": "Generalized Causal Tree for Uplift Modeling", "link": "http://arxiv.org/abs/2202.02416", "description": "Uplift modeling is crucial in various applications ranging from marketing and\npolicy-making to personalized recommendations. The main objective is to learn\noptimal treatment allocations for a heterogeneous population. A primary line of\nexisting work modifies the loss function of the decision tree algorithm to\nidentify cohorts with heterogeneous treatment effects. Another line of work\nestimates the individual treatment effects separately for the treatment group\nand the control group using off-the-shelf supervised learning algorithms. The\nformer approach that directly models the heterogeneous treatment effect is\nknown to outperform the latter in practice. However, the existing tree-based\nmethods are mostly limited to a single treatment and a single control use case,\nexcept for a handful of extensions to multiple discrete treatments. In this\npaper, we propose a generalization of tree-based approaches to tackle multiple\ndiscrete and continuous-valued treatments. We focus on a generalization of the\nwell-known causal tree algorithm due to its desirable statistical properties,\nbut our generalization technique can be applied to other tree-based approaches\nas well. The efficacy of our proposed method is demonstrated using experiments\nand real data examples."}, "http://arxiv.org/abs/2204.10426": {"title": "Marginal Structural Illness-Death Models for Semi-Competing Risks Data", "link": "http://arxiv.org/abs/2204.10426", "description": "The three state illness death model has been established as a general\napproach for regression analysis of semi competing risks data. For\nobservational data the marginal structural models (MSM) are a useful tool,\nunder the potential outcomes framework to define and estimate parameters with\ncausal interpretations. In this paper we introduce a class of marginal\nstructural illness death models for the analysis of observational semi\ncompeting risks data. We consider two specific such models, the Markov illness\ndeath MSM and the frailty based Markov illness death MSM. For interpretation\npurposes, risk contrasts under the MSMs are defined. Inference under the\nillness death MSM can be carried out using estimating equations with inverse\nprobability weighting, while inference under the frailty based illness death\nMSM requires a weighted EM algorithm. We study the inference procedures under\nboth MSMs using extensive simulations, and apply them to the analysis of mid\nlife alcohol exposure on late life cognitive impairment as well as mortality\nusing the Honolulu Asia Aging Study data set. The R codes developed in this\nwork have been implemented in the R package semicmprskcoxmsm that is publicly\navailable on CRAN."}, "http://arxiv.org/abs/2301.06098": {"title": "A novel method and comparison of methods for constructing Markov bridges", "link": "http://arxiv.org/abs/2301.06098", "description": "In this study, we address the central issue of statistical inference for\nMarkov jump processes using discrete time observations. The primary problem at\nhand is to accurately estimate the infinitesimal generator of a Markov jump\nprocess, a critical task in various applications. To tackle this problem, we\nbegin by reviewing established methods for generating sample paths from a\nMarkov jump process conditioned to endpoints, known as Markov bridges.\nAdditionally, we introduce a novel algorithm grounded in the concept of\ntime-reversal, which serves as our main contribution. Our proposed method is\nthen employed to estimate the infinitesimal generator of a Markov jump process.\nTo achieve this, we use a combination of Markov Chain Monte Carlo techniques\nand the Monte Carlo Expectation-Maximization algorithm. The results obtained\nfrom our approach demonstrate its effectiveness in providing accurate parameter\nestimates. To assess the efficacy of our proposed method, we conduct a\ncomprehensive comparative analysis with existing techniques (Bisection,\nUniformization, Direct, Rejection, and Modified Rejection), taking into\nconsideration both speed and accuracy. Notably, our method stands out as the\nfastest among the alternatives while maintaining high levels of precision."}, "http://arxiv.org/abs/2305.06262": {"title": "Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease", "link": "http://arxiv.org/abs/2305.06262", "description": "We propose a Bayesian model selection approach that allows medical\npractitioners to select among predictor variables while taking their respective\ncosts into account. Medical procedures almost always incur costs in time and/or\nmoney. These costs might exceed their usefulness for modeling the outcome of\ninterest. We develop Bayesian model selection that uses flexible model priors\nto penalize costly predictors a priori and select a subset of predictors useful\nrelative to their costs. Our approach (i) gives the practitioner control over\nthe magnitude of cost penalization, (ii) enables the prior to scale well with\nsample size, and (iii) enables the creation of our proposed inclusion path\nvisualization, which can be used to make decisions about individual candidate\npredictors using both probabilistic and visual tools. We demonstrate the\neffectiveness of our inclusion path approach and the importance of being able\nto adjust the magnitude of the prior's cost penalization through a dataset\npertaining to heart disease diagnosis in patients at the Cleveland Clinic\nFoundation, where several candidate predictors with various costs were recorded\nfor patients, and through simulated data."}, "http://arxiv.org/abs/2305.19139": {"title": "Estimating excess mortality in high-income countries during the COVID-19 pandemic", "link": "http://arxiv.org/abs/2305.19139", "description": "Quantifying the number of deaths caused by the COVID-19 crisis has been an\nongoing challenge for scientists, and no golden standard to do so has yet been\nestablished. We propose a principled approach to calculate age-adjusted yearly\nexcess mortality, and apply it to obtain estimates and uncertainty bounds for\n30 countries with publicly available data. The results uncover remarkable\nvariation in pandemic outcomes across different countries. We further compare\nour findings with existing estimates published in other major scientific\noutlets, highlighting the importance of proper age adjustment to obtain\nunbiased figures."}, "http://arxiv.org/abs/2312.12477": {"title": "Survey on Trustworthy Graph Neural Networks: From A Causal Perspective", "link": "http://arxiv.org/abs/2312.12477", "description": "Graph Neural Networks (GNNs) have emerged as powerful representation learning\ntools for capturing complex dependencies within diverse graph-structured data.\nDespite their success in a wide range of graph mining tasks, GNNs have raised\nserious concerns regarding their trustworthiness, including susceptibility to\ndistribution shift, biases towards certain populations, and lack of\nexplainability. Recently, integrating causal learning techniques into GNNs has\nsparked numerous ground-breaking studies since most of the trustworthiness\nissues can be alleviated by capturing the underlying data causality rather than\nsuperficial correlations. In this survey, we provide a comprehensive review of\nrecent research efforts on causality-inspired GNNs. Specifically, we first\npresent the key trustworthy risks of existing GNN models through the lens of\ncausality. Moreover, we introduce a taxonomy of Causality-Inspired GNNs\n(CIGNNs) based on the type of causal learning capability they are equipped\nwith, i.e., causal reasoning and causal representation learning. Besides, we\nsystematically discuss typical methods within each category and demonstrate how\nthey mitigate trustworthiness risks. Finally, we summarize useful resources and\ndiscuss several future directions, hoping to shed light on new research\nopportunities in this emerging field. The representative papers, along with\nopen-source data and codes, are available in\nhttps://github.com/usail-hkust/Causality-Inspired-GNNs."}, "http://arxiv.org/abs/2312.12638": {"title": "Using Exact Tests from Algebraic Statistics in Sparse Multi-way Analyses: An Application to Analyzing Differential Item Functioning", "link": "http://arxiv.org/abs/2312.12638", "description": "Asymptotic goodness-of-fit methods in contingency table analysis can struggle\nwith sparse data, especially in multi-way tables where it can be infeasible to\nmeet sample size requirements for a robust application of distributional\nassumptions. However, algebraic statistics provides exact alternatives to these\nclassical asymptotic methods that remain viable even with sparse data. We apply\nthese methods to a context in psychometrics and education research that leads\nnaturally to multi-way contingency tables: the analysis of differential item\nfunctioning (DIF). We explain concretely how to apply the exact methods of\nalgebraic statistics to DIF analysis using the R package algstat, and we\ncompare their performance to that of classical asymptotic methods."}, "http://arxiv.org/abs/2312.12641": {"title": "Matching via Distance Profiles", "link": "http://arxiv.org/abs/2312.12641", "description": "In this paper, we introduce and study matching methods based on distance\nprofiles. For the matching of point clouds, the proposed method is easily\nimplementable by solving a linear program, circumventing the computational\nobstacles of quadratic matching. Also, we propose and analyze a flexible way to\nexecute location-to-location matching using distance profiles. Moreover, we\nprovide a statistical estimation error analysis in the context of\nlocation-to-location matching using empirical process theory. Furthermore, we\napply our method to a certain model and show its noise stability by\ncharacterizing conditions on the noise level for the matching to be successful.\nLastly, we demonstrate the performance of the proposed method and compare it\nwith some existing methods using synthetic and real data."}, "http://arxiv.org/abs/2312.12645": {"title": "Revisiting the effect of greediness on the efficacy of exchange algorithms for generating exact optimal experimental designs", "link": "http://arxiv.org/abs/2312.12645", "description": "Coordinate exchange (CEXCH) is a popular algorithm for generating exact\noptimal experimental designs. The authors of CEXCH advocated for a highly\ngreedy implementation - one that exchanges and optimizes single element\ncoordinates of the design matrix. We revisit the effect of greediness on CEXCHs\nefficacy for generating highly efficient designs. We implement the\nsingle-element CEXCH (most greedy), a design-row (medium greedy) optimization\nexchange, and particle swarm optimization (PSO; least greedy) on 21 exact\nresponse surface design scenarios, under the $D$- and $I-$criterion, which have\nwell-known optimal designs that have been reproduced by several researchers. We\nfound essentially no difference in performance of the most greedy CEXCH and the\nmedium greedy CEXCH. PSO did exhibit better efficacy for generating $D$-optimal\ndesigns, and for most $I$-optimal designs than CEXCH, but not to a strong\ndegree under our parametrization. This work suggests that further investigation\nof the greediness dimension and its effect on CEXCH efficacy on a wider suite\nof models and criterion is warranted."}, "http://arxiv.org/abs/2312.12678": {"title": "Causal Discovery for fMRI data: Challenges, Solutions, and a Case Study", "link": "http://arxiv.org/abs/2312.12678", "description": "Designing studies that apply causal discovery requires navigating many\nresearcher degrees of freedom. This complexity is exacerbated when the study\ninvolves fMRI data. In this paper we (i) describe nine challenges that occur\nwhen applying causal discovery to fMRI data, (ii) discuss the space of\ndecisions that need to be made, (iii) review how a recent case study made those\ndecisions, (iv) and identify existing gaps that could potentially be solved by\nthe development of new methods. Overall, causal discovery is a promising\napproach for analyzing fMRI data, and multiple successful applications have\nindicated that it is superior to traditional fMRI functional connectivity\nmethods, but current causal discovery methods for fMRI leave room for\nimprovement."}, "http://arxiv.org/abs/2312.12708": {"title": "Gradient flows for empirical Bayes in high-dimensional linear models", "link": "http://arxiv.org/abs/2312.12708", "description": "Empirical Bayes provides a powerful approach to learning and adapting to\nlatent structure in data. Theory and algorithms for empirical Bayes have a rich\nliterature for sequence models, but are less understood in settings where\nlatent variables and data interact through more complex designs. In this work,\nwe study empirical Bayes estimation of an i.i.d. prior in Bayesian linear\nmodels, via the nonparametric maximum likelihood estimator (NPMLE). We\nintroduce and study a system of gradient flow equations for optimizing the\nmarginal log-likelihood, jointly over the prior and posterior measures in its\nGibbs variational representation using a smoothed reparametrization of the\nregression coefficients. A diffusion-based implementation yields a Langevin\ndynamics MCEM algorithm, where the prior law evolves continuously over time to\noptimize a sequence-model log-likelihood defined by the coordinates of the\ncurrent Langevin iterate. We show consistency of the NPMLE as $n, p \\rightarrow\n\\infty$ under mild conditions, including settings of random sub-Gaussian\ndesigns when $n \\asymp p$. In high noise, we prove a uniform log-Sobolev\ninequality for the mixing of Langevin dynamics, for possibly misspecified\npriors and non-log-concave posteriors. We then establish polynomial-time\nconvergence of the joint gradient flow to a near-NPMLE if the marginal negative\nlog-likelihood is convex in a sub-level set of the initialization."}, "http://arxiv.org/abs/2312.12710": {"title": "Semiparametric Copula Estimation for Spatially Correlated Multivariate Mixed Outcomes: Analyzing Visual Sightings of Fin Whales from Line Transect Survey", "link": "http://arxiv.org/abs/2312.12710", "description": "Multivariate data having both continuous and discrete variables is known as\nmixed outcomes and has widely appeared in a variety of fields such as ecology,\nepidemiology, and climatology. In order to understand the probability structure\nof multivariate data, the estimation of the dependence structure among mixed\noutcomes is very important. However, when location information is equipped with\nmultivariate data, the spatial correlation should be adequately taken into\naccount; otherwise, the estimation of the dependence structure would be\nseverely biased. To solve this issue, we propose a semiparametric Bayesian\ninference for the dependence structure among mixed outcomes while eliminating\nspatial correlation. To this end, we consider a hierarchical spatial model\nbased on the rank likelihood and a latent multivariate Gaussian process. We\ndevelop an efficient algorithm for computing the posterior using the Markov\nChain Monte Carlo. We also provide a scalable implementation of the model using\nthe nearest-neighbor Gaussian process under large spatial datasets. We conduct\na simulation study to validate our proposed procedure and demonstrate that the\nprocedure successfully accounts for spatial correlation and correctly infers\nthe dependence structure among outcomes. Furthermore, the procedure is applied\nto a real example collected during an international synoptic krill survey in\nthe Scotia Sea of the Antarctic Peninsula, which includes sighting data of fin\nwhales (Balaenoptera physalus), and the relevant oceanographic data."}, "http://arxiv.org/abs/2312.12786": {"title": "Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets", "link": "http://arxiv.org/abs/2312.12786", "description": "Development of comprehensive prediction models are often of great interest in\nmany disciplines of science, but datasets with information on all desired\nfeatures typically have small sample sizes. In this article, we describe a\ntransfer learning approach for building high-dimensional generalized linear\nmodels using data from a main study that has detailed information on all\npredictors, and from one or more external studies that have ascertained a more\nlimited set of predictors. We propose using the external dataset(s) to build\nreduced model(s) and then transfer the information on underlying parameters for\nthe analysis of the main study through a set of calibration equations, while\naccounting for the study-specific effects of certain design variables. We then\nuse a generalized method of moment (GMM) with penalization for parameter\nestimation and develop highly scalable algorithms for fitting models taking\nadvantage of the popular glmnet package. We further show that the use of\nadaptive-Lasso penalty leads to the oracle property of underlying parameter\nestimates and thus leads to convenient post-selection inference procedures. We\nconduct extensive simulation studies to investigate both predictive performance\nand post-selection inference properties of the proposed method. Finally, we\nillustrate a timely application of the proposed method for the development of\nrisk prediction models for five common diseases using the UK Biobank study,\ncombining baseline information from all study participants (500K) and recently\nreleased high-throughout proteomic data (# protein = 1500) on a subset (50K) of\nthe participants."}, "http://arxiv.org/abs/2312.12823": {"title": "Detecting Multiple Change-Points in Distributional Sequences Derived from Structural Health Monitoring Data: An Application to Bridge Damage Detection", "link": "http://arxiv.org/abs/2312.12823", "description": "Detecting damage in important structures using monitored data is a\nfundamental task of structural health monitoring, which is very important for\nthe structures' safety and life-cycle management. Based on the statistical\npattern recognition paradigm, damage detection can be achieved by detecting\nchanges in distribution of properly extracted damage-sensitive features (DSFs).\nThis can be naturally formulated as a distributional change-point detection\nproblem. A good change-point detector for damage detection should be scalable\nto large DSF datasets, applicable to different types of changes and able to\ncontrol the false-positive indication rate. To address these challenges, we\npropose a new distributional change-point detection method for damage\ndetection. We embed the elements of a DSF distributional sequence into the\nWasserstein space and develop a MOSUM-type multiple change-point detector based\non Fr\\'echet statistics. Theoretical properties are also established. Extensive\nsimulation studies demonstrate the superiority of our proposal against other\ncompetitors in addressing the aforementioned practical requirements. We apply\nour method to the cable-tension measurements monitored from a long-span\ncable-stayed bridge for cable damage detection. We conduct a comprehensive\nchange-point analysis for the extracted DSF data, and find some interesting\npatterns from the detected changes, which provides important insights into the\ndamage of the cable system."}, "http://arxiv.org/abs/2312.12844": {"title": "Causal Discovery under Identifiable Heteroscedastic Noise Model", "link": "http://arxiv.org/abs/2312.12844", "description": "Capturing the underlying structural causal relations represented by Directed\nAcyclic Graphs (DAGs) has been a fundamental task in various AI disciplines.\nCausal DAG learning via the continuous optimization framework has recently\nachieved promising performance in terms of both accuracy and efficiency.\nHowever, most methods make strong assumptions of homoscedastic noise, i.e.,\nexogenous noises have equal variances across variables, observations, or even\nboth. The noises in real data usually violate both assumptions due to the\nbiases introduced by different data collection processes. To address the issue\nof heteroscedastic noise, we introduce relaxed and implementable sufficient\nconditions, proving the identifiability of a general class of SEM subject to\nthese conditions. Based on the identifiable general SEM, we propose a novel\nformulation for DAG learning that accounts for the variation in noise variance\nacross variables and observations. We then propose an effective two-phase\niterative DAG learning algorithm to address the increasing optimization\ndifficulties and to learn a causal DAG from data with heteroscedastic variable\nnoise under varying variance. We show significant empirical gains of the\nproposed approaches over state-of-the-art methods on both synthetic data and\nreal data."}, "http://arxiv.org/abs/2312.12952": {"title": "High-dimensional sparse classification using exponential weighting with empirical hinge loss", "link": "http://arxiv.org/abs/2312.12952", "description": "In this study, we address the problem of high-dimensional binary\nclassification. Our proposed solution involves employing an aggregation\ntechnique founded on exponential weights and empirical hinge loss. Through the\nemployment of a suitable sparsity-inducing prior distribution, we demonstrate\nthat our method yields favorable theoretical results on predictions and\nmisclassification error. The efficiency of our procedure is achieved through\nthe utilization of Langevin Monte Carlo, a gradient-based sampling approach. To\nillustrate the effectiveness of our approach, we conduct comparisons with the\nlogistic Lasso on both simulated and a real dataset. Our method frequently\ndemonstrates superior performance compared to the logistic Lasso."}, "http://arxiv.org/abs/2312.12966": {"title": "Rank-based Bayesian clustering via covariate-informed Mallows mixtures", "link": "http://arxiv.org/abs/2312.12966", "description": "Data in the form of rankings, ratings, pair comparisons or clicks are\nfrequently collected in diverse fields, from marketing to politics, to\nunderstand assessors' individual preferences. Combining such preference data\nwith features associated with the assessors can lead to a better understanding\nof the assessors' behaviors and choices. The Mallows model is a popular model\nfor rankings, as it flexibly adapts to different types of preference data, and\nthe previously proposed Bayesian Mallows Model (BMM) offers a computationally\nefficient framework for Bayesian inference, also allowing capturing the users'\nheterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite\nmixture model that performs clustering while also accounting for\nassessor-related features, called the Bayesian Mallows model with covariates\n(BMMx). BMMx is based on a similarity function that a priori favours the\naggregation of assessors into a cluster when their covariates are similar,\nusing the Product Partition models (PPMx) proposal. We present two approaches\nto measure the covariate similarity: one based on a novel deterministic\nfunction measuring the covariates' goodness-of-fit to the cluster, and one\nbased on an augmented model as in PPMx. We investigate the performance of BMMx\nin both simulation experiments and real-data examples, showing the method's\npotential for advancing the understanding of assessor preferences and behaviors\nin different applications."}, "http://arxiv.org/abs/2312.13018": {"title": "Sample Design and Cross-sectional Weights of the Brazilian PCSVDF-Mulher Study (Waves 2016 and 2017): Integrating a Refreshment Sample with an Ongoing Longitudinal Sample to Calculate IPV Prevalence", "link": "http://arxiv.org/abs/2312.13018", "description": "Addressing unit non-response between waves of longitudinal studies\n(attrition) by means of sampling design in weighting has moved from an approach\nfocused on participant retention or modern missing data analysis procedures to\nan approach based on the availability of supplemental samples, either\ncollecting refreshment or replacement samples on an ongoing larger sample. We\nimplement a strategy for calculating individual cross-sectional weights and\napply them to the 2016 and 2017 waves of the PCSVDF-Mulher (Pesquisa de\nCondi\\c{c}\\~oes Socioecon\\^omicas e Viol\\^encia Dom\\'estica e Familiar contra a\nMulher - Survey of Socioeconomic Conditions and Domestic and Family Violence\nagainst Women), a large ($\\approx 10,000$), household interdisciplinary\nlongitudinal data set in Brazil to study intimate partner violence (IPV), its\ncauses and consequences. We developed a set of weights that combines a\nrefreshment sample collected in 2017 with the ongoing longitudinal sample\nstarted in 2016. Armed with this set of individual weights, we calculated IPV\nprevalence for nine capital cities in Brazil for the years 2016 and 2017. As\nfar as we know, this is the first attempt to calculate cross-sectional weights\nwith the aid of supplemental samples applied to a population representative\nsample study focused on IPV. Our analysis produced a set of weights whose\ncomparison to unweighted designs shows clearly neglected trends in the\nliterature on IPV measurement. Indeed, one of our key findings pointed out to\nthe fact that, even in well-designed longitudinal household surveys, the\nindiscriminate use of unweighted designs to calculate IPV prevalence might\nartificially and inadvertently inflate their values, which in turn might bring\ndistortions and considerable political, social, budgetary, and scientific\nimplications."}, "http://arxiv.org/abs/2312.13044": {"title": "Particle Gibbs for Likelihood-Free Inference of State Space Models with Application to Stochastic Volatility", "link": "http://arxiv.org/abs/2312.13044", "description": "State space models (SSMs) are widely used to describe dynamic systems.\nHowever, when the likelihood of the observations is intractable, parameter\ninference for SSMs cannot be easily carried out using standard Markov chain\nMonte Carlo or sequential Monte Carlo methods. In this paper, we propose a\nparticle Gibbs sampler as a general strategy to handle SSMs with intractable\nlikelihoods in the approximate Bayesian computation (ABC) setting. The proposed\nsampler incorporates a conditional auxiliary particle filter, which can help\nmitigate the weight degeneracy often encountered in ABC. To illustrate the\nmethodology, we focus on a classic stochastic volatility model (SVM) used in\nfinance and econometrics for analyzing and interpreting volatility. Simulation\nstudies demonstrate the accuracy of our sampler for SVM parameter inference,\ncompared to existing particle Gibbs samplers based on the conditional bootstrap\nfilter. As a real data application, we apply the proposed sampler for fitting\nan SVM to S&P 500 Index time-series data during the 2008 financial crisis."}, "http://arxiv.org/abs/2312.13097": {"title": "Power calculation for cross-sectional stepped wedge cluster randomized trials with a time-to-event endpoint", "link": "http://arxiv.org/abs/2312.13097", "description": "A popular design choice in public health and implementation science research,\nstepped wedge cluster randomized trials (SW-CRTs) are a form of randomized\ntrial whereby clusters are progressively transitioned from control to\nintervention, and the timing of transition is randomized for each cluster. An\nimportant task at the design stage is to ensure that the planned trial has\nsufficient power to observe a clinically meaningful effect size. While methods\nfor determining study power have been well-developed for SW-CRTs with\ncontinuous and binary outcomes, limited methods for power calculation are\navailable for SW-CRTs with censored time-to-event outcomes. In this article, we\npropose a stratified marginal Cox model to account for secular trend in\ncross-sectional SW-CRTs, and derive an explicit expression of the robust\nsandwich variance to facilitate power calculations without the need for\ncomputationally intensive simulations. Power formulas based on both the Wald\nand robust score tests are developed and compared via simulation, generally\ndemonstrating superiority of robust score procedures in different finite-sample\nscenarios. Finally, we illustrate our methods using a SW-CRT testing the effect\nof a new electronic reminder system on time to catheter removal in hospital\nsettings. We also offer an R Shiny application to facilitate sample size and\npower calculations using our proposed methods."}, "http://arxiv.org/abs/2312.13148": {"title": "Partially factorized variational inference for high-dimensional mixed models", "link": "http://arxiv.org/abs/2312.13148", "description": "While generalized linear mixed models (GLMMs) are a fundamental tool in\napplied statistics, many specifications -- such as those involving categorical\nfactors with many levels or interaction terms -- can be computationally\nchallenging to estimate due to the need to compute or approximate\nhigh-dimensional integrals. Variational inference (VI) methods are a popular\nway to perform such computations, especially in the Bayesian context. However,\nnaive VI methods can provide unreliable uncertainty quantification. We show\nthat this is indeed the case in the GLMM context, proving that standard VI\n(i.e. mean-field) dramatically underestimates posterior uncertainty in\nhigh-dimensions. We then show how appropriately relaxing the mean-field\nassumption leads to VI methods whose uncertainty quantification does not\ndeteriorate in high-dimensions, and whose total computational cost scales\nlinearly with the number of parameters and observations. Our theoretical and\nnumerical results focus on GLMMs with Gaussian or binomial likelihoods, and\nrely on connections to random graph theory to obtain sharp high-dimensional\nasymptotic analysis. We also provide generic results, which are of independent\ninterest, relating the accuracy of variational inference to the convergence\nrate of the corresponding coordinate ascent variational inference (CAVI)\nalgorithm for Gaussian targets. Our proposed partially-factorized VI (PF-VI)\nmethodology for GLMMs is implemented in the R package vglmer, see\nhttps://github.com/mgoplerud/vglmer . Numerical results with simulated and real\ndata examples illustrate the favourable computation cost versus accuracy\ntrade-off of PF-VI."}, "http://arxiv.org/abs/2312.13168": {"title": "Learning Bayesian networks: a copula approach for mixed-type data", "link": "http://arxiv.org/abs/2312.13168", "description": "Estimating dependence relationships between variables is a crucial issue in\nmany applied domains, such as medicine, social sciences and psychology. When\nseveral variables are entertained, these can be organized into a network which\nencodes their set of conditional dependence relations. Typically however, the\nunderlying network structure is completely unknown or can be partially drawn\nonly; accordingly it should be learned from the available data, a process known\nas structure learning. In addition, data arising from social and psychological\nstudies are often of different types, as they can include categorical, discrete\nand continuous measurements. In this paper we develop a novel Bayesian\nmethodology for structure learning of directed networks which applies to mixed\ndata, i.e. possibly containing continuous, discrete, ordinal and binary\nvariables simultaneously. Whenever available, our method can easily incorporate\nknown dependence structures among variables represented by paths or edge\ndirections that can be postulated in advance based on the specific problem\nunder consideration. We evaluate the proposed method through extensive\nsimulation studies, with appreciable performances in comparison with current\nstate-of-the-art alternative methods. Finally, we apply our methodology to\nwell-being data from a social survey promoted by the United Nations, and mental\nhealth data collected from a cohort of medical students."}, "http://arxiv.org/abs/2202.02249": {"title": "Functional Mixtures-of-Experts", "link": "http://arxiv.org/abs/2202.02249", "description": "We consider the statistical analysis of heterogeneous data for prediction in\nsituations where the observations include functions, typically time series. We\nextend the modeling with Mixtures-of-Experts (ME), as a framework of choice in\nmodeling heterogeneity in data for prediction with vectorial observations, to\nthis functional data analysis context. We first present a new family of ME\nmodels, named functional ME (FME) in which the predictors are potentially noisy\nobservations, from entire functions. Furthermore, the data generating process\nof the predictor and the real response, is governed by a hidden discrete\nvariable representing an unknown partition. Second, by imposing sparsity on\nderivatives of the underlying functional parameters via Lasso-like\nregularizations, we provide sparse and interpretable functional representations\nof the FME models called iFME. We develop dedicated expectation--maximization\nalgorithms for Lasso-like (EM-Lasso) regularized maximum-likelihood parameter\nestimation strategies to fit the models. The proposed models and algorithms are\nstudied in simulated scenarios and in applications to two real data sets, and\nthe obtained results demonstrate their performance in accurately capturing\ncomplex nonlinear relationships and in clustering the heterogeneous regression\ndata."}, "http://arxiv.org/abs/2203.14511": {"title": "Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments", "link": "http://arxiv.org/abs/2203.14511", "description": "Researchers are increasingly turning to machine learning (ML) algorithms to\ninvestigate causal heterogeneity in randomized experiments. Despite their\npromise, ML algorithms may fail to accurately ascertain heterogeneous treatment\neffects under practical settings with many covariates and small sample size. In\naddition, the quantification of estimation uncertainty remains a challenge. We\ndevelop a general approach to statistical inference for heterogeneous treatment\neffects discovered by a generic ML algorithm. We apply the Neyman's repeated\nsampling framework to a common setting, in which researchers use an ML\nalgorithm to estimate the conditional average treatment effect and then divide\nthe sample into several groups based on the magnitude of the estimated effects.\nWe show how to estimate the average treatment effect within each of these\ngroups, and construct a valid confidence interval. In addition, we develop\nnonparametric tests of treatment effect homogeneity across groups, and\nrank-consistency of within-group average treatment effects. The validity of our\nmethodology does not rely on the properties of ML algorithms because it is\nsolely based on the randomization of treatment assignment and random sampling\nof units. Finally, we generalize our methodology to the cross-fitting procedure\nby accounting for the additional uncertainty induced by the random splitting of\ndata."}, "http://arxiv.org/abs/2206.10323": {"title": "What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?", "link": "http://arxiv.org/abs/2206.10323", "description": "Estimation of heterogeneous treatment effects (HTE) is of prime importance in\nmany disciplines, ranging from personalized medicine to economics among many\nothers. Random forests have been shown to be a flexible and powerful approach\nto HTE estimation in both randomized trials and observational studies. In\nparticular \"causal forests\", introduced by Athey, Tibshirani and Wager (2019),\nalong with the R implementation in package grf were rapidly adopted. A related\napproach, called \"model-based forests\", that is geared towards randomized\ntrials and simultaneously captures effects of both prognostic and predictive\nvariables, was introduced by Seibold, Zeileis and Hothorn (2018) along with a\nmodular implementation in the R package model4you.\n\nHere, we present a unifying view that goes beyond the theoretical motivations\nand investigates which computational elements make causal forests so successful\nand how these can be blended with the strengths of model-based forests. To do\nso, we show that both methods can be understood in terms of the same parameters\nand model assumptions for an additive model under L2 loss. This theoretical\ninsight allows us to implement several flavors of \"model-based causal forests\"\nand dissect their different elements in silico.\n\nThe original causal forests and model-based forests are compared with the new\nblended versions in a benchmark study exploring both randomized trials and\nobservational settings. In the randomized setting, both approaches performed\nakin. If confounding was present in the data generating process, we found local\ncentering of the treatment indicator with the corresponding propensities to be\nthe main driver for good performance. Local centering of the outcome was less\nimportant, and might be replaced or enhanced by simultaneous split selection\nwith respect to both prognostic and predictive effects."}, "http://arxiv.org/abs/2303.13616": {"title": "Estimating Maximal Symmetries of Regression Functions via Subgroup Lattices", "link": "http://arxiv.org/abs/2303.13616", "description": "We present a method for estimating the maximal symmetry of a continuous\nregression function. Knowledge of such a symmetry can be used to significantly\nimprove modelling by removing the modes of variation resulting from the\nsymmetries. Symmetry estimation is carried out using hypothesis testing for\ninvariance strategically over the subgroup lattice of a search group G acting\non the feature space. We show that the estimation of the unique maximal\ninvariant subgroup of G generalises useful tools from linear dimension\nreduction to a non linear context. We show that the estimation is consistent\nwhen the subgroup lattice chosen is finite, even when some of the subgroups\nthemselves are infinite. We demonstrate the performance of this estimator in\nsynthetic settings and apply the methods to two data sets: satellite\nmeasurements of the earth's magnetic field intensity; and the distribution of\nsunspots."}, "http://arxiv.org/abs/2305.12043": {"title": "SF-SFD: Stochastic Optimization of Fourier Coefficients to Generate Space-Filling Designs", "link": "http://arxiv.org/abs/2305.12043", "description": "Due to the curse of dimensionality, it is often prohibitively expensive to\ngenerate deterministic space-filling designs. On the other hand, when using\nna{\\\"i}ve uniform random sampling to generate designs cheaply, design points\ntend to concentrate in a small region of the design space. Although, it is\npreferable in these cases to utilize quasi-random techniques such as Sobol\nsequences and Latin hypercube designs over uniform random sampling in many\nsettings, these methods have their own caveats especially in high-dimensional\nspaces. In this paper, we propose a technique that addresses the fundamental\nissue of measure concentration by updating high-dimensional distribution\nfunctions to produce better space-filling designs. Then, we show that our\ntechnique can outperform Latin hypercube sampling and Sobol sequences by the\ndiscrepancy metric while generating moderately-sized space-filling samples for\nhigh-dimensional problems."}, "http://arxiv.org/abs/2306.03625": {"title": "Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learning", "link": "http://arxiv.org/abs/2306.03625", "description": "We propose a simple and general framework for nonparametric estimation of\nheterogeneous treatment effects under fairness constraints. Under standard\nregularity conditions, we show that the resulting estimators possess the double\nrobustness property. We use this framework to characterize the trade-off\nbetween fairness and the maximum welfare achievable by the optimal policy. We\nevaluate the methods in a simulation study and illustrate them in a real-world\ncase study."}, "http://arxiv.org/abs/2312.13331": {"title": "A Bayesian Spatial Berkson error approach to estimate small area opioid mortality rates accounting for population-at-risk uncertainty", "link": "http://arxiv.org/abs/2312.13331", "description": "Monitoring small-area geographical population trends in opioid mortality has\nlarge scale implications to informing preventative resource allocation. A\ncommon approach to obtain small area estimates of opioid mortality is to use a\nstandard disease mapping approach in which population-at-risk estimates are\ntreated as fixed and known. Assuming fixed populations ignores the uncertainty\nsurrounding small area population estimates, which may bias risk estimates and\nunder-estimate their associated uncertainties. We present a Bayesian Spatial\nBerkson Error (BSBE) model to incorporate population-at-risk uncertainty within\na disease mapping model. We compare the BSBE approach to the naive (treating\ndenominators as fixed) using simulation studies to illustrate potential bias\nresulting from this assumption. We show the application of the BSBE model to\nobtain 2020 opioid mortality risk estimates for 159 counties in GA accounting\nfor population-at-risk uncertainty. Utilizing our proposed approach will help\nto inform interventions in opioid related public health responses, policies,\nand resource allocation. Additionally, we provide a general framework to\nimprove in the estimation and mapping of health indicators."}, "http://arxiv.org/abs/2312.13416": {"title": "A new criterion for interpreting acoustic emission damage signals in condition monitoring based on the distribution of cluster onsets", "link": "http://arxiv.org/abs/2312.13416", "description": "Structural Health Monitoring (SHM) relies on non-destructive techniques such\nas Acoustic Emission (AE) which provide a large amount of data over the life of\nthe systems. The analysis of these data is often based on clustering in order\nto get insights about damage evolution. In order to evaluate clustering\nresults, current approaches include Clustering Validity Indices (CVI) which\nfavor compact and separable clusters. However, these shape-based criteria are\nnot specific to AE data and SHM. This paper proposes a new approach based on\nthe sequentiality of clusters onsets. For monitoring purposes, onsets indicate\nwhen potential damage occurs for the first time and allows to detect the\ninititation of the defects. The proposed CVI relies on the Kullback-Leibler\ndivergence and enables to incorporate prior on damage onsets when available.\nThree experiments on real-world data sets demonstrate the relevance of the\nproposed approach. The first benchmark concerns the detection of the loosening\nof bolted plates under vibration. The proposed onset-based CVI outperforms the\nstandard approach in terms of both cluster quality and accuracy in detecting\nchanges in loosening. The second application involves micro-drilling of hard\nmaterials using Electrical Discharge Machining. In this industrial application,\nit is demonstrated that the proposed CVI can be used to evaluate the electrode\nprogression until the reference depth which is essential to ensure structural\nintegrity. Lastly, the third application is about the damage monitoring in a\ncomposite/metal hybrid joint structure. As an important result, the timeline of\nclusters generated by the proposed CVI is used to draw a scenario that accounts\nfor the occurrence of slippage leading to a critical failure."}, "http://arxiv.org/abs/2312.13430": {"title": "Debiasing Sample Loadings and Scores in Exponential Family PCA for Sparse Count Data", "link": "http://arxiv.org/abs/2312.13430", "description": "Multivariate count data with many zeros frequently occur in a variety of\napplication areas such as text mining with a document-term matrix and cluster\nanalysis with microbiome abundance data. Exponential family PCA (Collins et\nal., 2001) is a widely used dimension reduction tool to understand and capture\nthe underlying low-rank structure of count data. It produces principal\ncomponent scores by fitting Poisson regression models with estimated loadings\nas covariates. This tends to result in extreme scores for sparse count data\nsignificantly deviating from true scores. We consider two major sources of bias\nin this estimation procedure and propose ways to reduce their effects. First,\nthe discrepancy between true loadings and their estimates under a limited\nsample size largely degrades the quality of score estimates. By treating\nestimated loadings as covariates with bias and measurement errors, we debias\nscore estimates, using the iterative bootstrap method for loadings and\nconsidering classical measurement error models. Second, the existence of MLE\nbias is often ignored in score estimation, but this bias could be removed\nthrough well-known MLE bias reduction methods. We demonstrate the effectiveness\nof the proposed bias correction procedure through experiments on both simulated\ndata and real data."}, "http://arxiv.org/abs/2312.13450": {"title": "Precise FWER Control for Gaussian Related Fields: Riding the SuRF to continuous land -- Part 1", "link": "http://arxiv.org/abs/2312.13450", "description": "The Gaussian Kinematic Formula (GKF) is a powerful and computationally\nefficient tool to perform statistical inference on random fields and became a\nwell-established tool in the analysis of neuroimaging data. Using realistic\nerror models, recent articles show that GKF based methods for \\emph{voxelwise\ninference} lead to conservative control of the familywise error rate (FWER) and\nfor cluster-size inference lead to inflated false positive rates. In this\nseries of articles we identify and resolve the main causes of these\nshortcomings in the traditional usage of the GKF for voxelwise inference. This\nfirst part removes the \\textit{good lattice assumption} and allows the data to\nbe non-stationary, yet still assumes the data to be Gaussian. The latter\nassumption is resolved in part 2, where we also demonstrate that our GKF based\nmethodology is non-conservative under realistic error models."}, "http://arxiv.org/abs/2312.13454": {"title": "MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records", "link": "http://arxiv.org/abs/2312.13454", "description": "Objective: To improve survival analysis using EHR data, we aim to develop a\nsupervised topic model called MixEHR-SurG to simultaneously integrate\nheterogeneous EHR data and model survival hazard.\n\nMaterials and Methods: Our technical contributions are three-folds: (1)\nintegrating EHR topic inference with Cox proportional hazards likelihood; (2)\ninferring patient-specific topic hyperparameters using the PheCode concepts\nsuch that each topic can be identified with exactly one PheCode-associated\nphenotype; (3) multi-modal survival topic inference. This leads to a highly\ninterpretable survival and guided topic model that can infer PheCode-specific\nphenotype topics associated with patient mortality. We evaluated MixEHR-G using\na simulated dataset and two real-world EHR datasets: the Quebec Congenital\nHeart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient\nclaim data of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458\nsubjects with multi-modal EHR records.\n\nResults: Compared to the baselines, MixEHR-G achieved a superior dynamic\nAUROC for mortality prediction, with a mean AUROC score of 0.89 in the\nsimulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively,\nMixEHR-G associates severe cardiac conditions with high mortality risk among\nthe CHD patients after the first heart failure hospitalization and critical\nbrain injuries with increased mortality among the MIMIC-III patients after\ntheir ICU discharge.\n\nConclusion: The integration of the Cox proportional hazards model and EHR\ntopic inference in MixEHR-SurG led to not only competitive mortality prediction\nbut also meaningful phenotype topics for systematic survival analysis. The\nsoftware is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG."}, "http://arxiv.org/abs/2312.13460": {"title": "Hierarchical selection of genetic and gene by environment interaction effects in high-dimensional mixed models", "link": "http://arxiv.org/abs/2312.13460", "description": "Interactions between genes and environmental factors may play a key role in\nthe etiology of many common disorders. Several regularized generalized linear\nmodels (GLMs) have been proposed for hierarchical selection of gene by\nenvironment interaction (GEI) effects, where a GEI effect is selected only if\nthe corresponding genetic main effect is also selected in the model. However,\nnone of these methods allow to include random effects to account for population\nstructure, subject relatedness and shared environmental exposure. In this\npaper, we develop a unified approach based on regularized penalized\nquasi-likelihood (PQL) estimation to perform hierarchical selection of GEI\neffects in sparse regularized mixed models. We compare the selection and\nprediction accuracy of our proposed model with existing methods through\nsimulations under the presence of population structure and shared environmental\nexposure. We show that for all simulation scenarios, compared to other\npenalized methods, our proposed method enforced sparsity by controlling the\nnumber of false positives in the model while having the best predictive\nperformance. Finally, we apply our method to a real data application using the\nOrofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, and\nfound that our method retrieves previously reported significant loci."}, "http://arxiv.org/abs/2312.13482": {"title": "Spatially Adaptive Variable Screening in Presurgical fMRI Data Analysis", "link": "http://arxiv.org/abs/2312.13482", "description": "Accurate delineation of tumor-adjacent functional brain regions is essential\nfor planning function-preserving neurosurgery. Functional magnetic resonance\nimaging (fMRI) is increasingly used for presurgical counseling and planning.\nWhen analyzing presurgical fMRI data, false negatives are more dangerous to the\npatients than false positives because patients are more likely to experience\nsignificant harm from failing to identify functional regions and subsequently\nresecting critical tissues. In this paper, we propose a novel spatially\nadaptive variable screening procedure to enable effective control of false\nnegatives while leveraging the spatial structure of fMRI data. Compared to\nexisting statistical methods in fMRI data analysis, the new procedure directly\ncontrols false negatives at a desirable level and is completely data-driven.\nThe new method is also substantially different from existing false-negative\ncontrol procedures which do not take spatial information into account.\nNumerical examples show that the new method outperforms several\nstate-of-the-art methods in retaining signal voxels, especially the subtle ones\nat the boundaries of functional regions, while providing cleaner separation of\nfunctional regions from background noise. Such results could be valuable to\npreserve critical tissues in neurosurgery."}, "http://arxiv.org/abs/2312.13517": {"title": "An utopic adventure in the modelling of conditional univariate and multivariate extremes", "link": "http://arxiv.org/abs/2312.13517", "description": "The EVA 2023 data competition consisted of four challenges, ranging from\ninterval estimation for very high quantiles of univariate extremes conditional\non covariates, point estimation of unconditional return levels under a custom\nloss function, to estimation of the probabilities of tail events for low and\nhigh-dimensional multivariate data. We tackle these tasks by revisiting the\ncurrent and existing literature on conditional univariate and multivariate\nextremes. We propose new cross-validation methods for covariate-dependent\nmodels, validation metrics for exchangeable multivariate models, formulae for\nthe joint probability of exceedance for multivariate generalized Pareto vectors\nand a composition sampling algorithm for generating multivariate tail events\nfor the latter. We highlight overarching themes ranging from model validation\nat extremely high quantile levels to building custom estimation strategies that\nleverage model assumptions."}, "http://arxiv.org/abs/2312.13643": {"title": "Debiasing Welch's Method for Spectral Density Estimation", "link": "http://arxiv.org/abs/2312.13643", "description": "Welch's method provides an estimator of the power spectral density that is\nstatistically consistent. This is achieved by averaging over periodograms\ncalculated from overlapping segments of a time series. For a finite length time\nseries, while the variance of the estimator decreases as the number of segments\nincrease, the magnitude of the estimator's bias increases: a bias-variance\ntrade-off ensues when setting the segment number. We address this issue by\nproviding a a novel method for debiasing Welch's method which maintains the\ncomputational complexity and asymptotic consistency, and leads to improved\nfinite-sample performance. Theoretical results are given for fourth-order\nstationary processes with finite fourth-order moments and absolutely continuous\nfourth-order cumulant spectrum. The significant bias reduction is demonstrated\nwith numerical simulation and an application to real-world data, where several\nempirical metrics indicate our debiased estimator compares favourably to\nWelch's. Our estimator also permits irregular spacing over frequency and we\ndemonstrate how this may be employed for signal compression and further\nvariance reduction. Code accompanying this work is available in the R and\npython languages."}, "http://arxiv.org/abs/2312.13725": {"title": "Extreme Value Statistics for Analysing Simulated Environmental Extremes", "link": "http://arxiv.org/abs/2312.13725", "description": "We present the methods employed by team `Uniofbathtopia' as part of the Data\nChallenge organised for the 13th International Conference on Extreme Value\nAnalysis (EVA2023), including our winning entry for the third sub-challenge.\nOur approaches unite ideas from extreme value theory, which provides a\nstatistical framework for the estimation of probabilities/return levels\nassociated with rare events, with techniques from unsupervised statistical\nlearning, such as clustering and support identification. The methods are\ndemonstrated on the data provided for the Data Challenge -- environmental data\nsampled from the fantasy country of `Utopia' -- but the underlying assumptions\nand frameworks should apply in more general settings and applications."}, "http://arxiv.org/abs/2312.13875": {"title": "Best Arm Identification in Batched Multi-armed Bandit Problems", "link": "http://arxiv.org/abs/2312.13875", "description": "Recently multi-armed bandit problem arises in many real-life scenarios where\narms must be sampled in batches, due to limited time the agent can wait for the\nfeedback. Such applications include biological experimentation and online\nmarketing. The problem is further complicated when the number of arms is large\nand the number of batches is small. We consider pure exploration in a batched\nmulti-armed bandit problem. We introduce a general linear programming framework\nthat can incorporate objectives of different theoretical settings in best arm\nidentification. The linear program leads to a two-stage algorithm that can\nachieve good theoretical properties. We demonstrate by numerical studies that\nthe algorithm also has good performance compared to certain UCB-type or\nThompson sampling methods."}, "http://arxiv.org/abs/2312.13992": {"title": "Bayesian nonparametric boundary detection for income areal data", "link": "http://arxiv.org/abs/2312.13992", "description": "Recent discussions on the future of metropolitan cities underscore the\npivotal role of (social) equity, driven by demographic and economic trends.\nMore equal policies can foster and contribute to a city's economic success and\nsocial stability. In this work, we focus on identifying metropolitan areas with\ndistinct economic and social levels in the greater Los Angeles area, one of the\nmost diverse yet unequal areas in the United States. Utilizing American\nCommunity Survey data, we propose a Bayesian model for boundary detection based\non income distributions. The model identifies areas with significant income\ndisparities, offering actionable insights for policymakers to address social\nand economic inequalities. Our approach formalized as a Bayesian structural\nlearning framework, models areal densities through finite mixture models.\nEfficient posterior computation is facilitated by a transdimensional Markov\nChain Monte Carlo sampler. The methodology is validated via extensive\nsimulations and applied to the income distributions in the greater Los Angeles\narea. We identify several boundaries in the income distributions which can be\nexplained in light of other social dynamics such as crime rates and healthcare,\nshowing the usefulness of such an analysis to policymakers."}, "http://arxiv.org/abs/2312.14013": {"title": "Two-Stage Pseudo Maximum Likelihood Estimation of Semiparametric Copula-based Regression Models for Semi-Competing Risks Data", "link": "http://arxiv.org/abs/2312.14013", "description": "We propose a two-stage estimation procedure for a copula-based model with\nsemi-competing risks data, where the non-terminal event is subject to dependent\ncensoring by the terminal event, and both events are subject to independent\ncensoring. Under a copula-based model, the marginal survival functions of\nindividual event times are specified by semiparametric transformation models,\nand the dependence between the bivariate event times is specified by a\nparametric copula function. For the estimation procedure, in the first stage,\nthe parameters associated with the marginal of the terminal event are estimated\nonly using the corresponding observed outcomes, and in the second stage, the\nmarginal parameters for the non-terminal event time and the copula parameter\nare estimated via maximizing a pseudo-likelihood function based on the joint\ndistribution of the bivariate event times. We derived the asymptotic properties\nof the proposed estimator and provided an analytic variance estimator for\ninference. Through simulation studies, we showed that our approach leads to\nconsistent estimates with less computational cost and more robustness compared\nto the one-stage procedure developed in Chen (2012), where all parameters were\nestimated simultaneously. In addition, our approach demonstrates more desirable\nfinite-sample performances over another existing two-stage estimation method\nproposed in Zhu et al. (2021)."}, "http://arxiv.org/abs/2312.14086": {"title": "A Bayesian approach to functional regression: theory and computation", "link": "http://arxiv.org/abs/2312.14086", "description": "We propose a novel Bayesian methodology for inference in functional linear\nand logistic regression models based on the theory of reproducing kernel\nHilbert spaces (RKHS's). These models build upon the RKHS associated with the\ncovariance function of the underlying stochastic process, and can be viewed as\na finite-dimensional approximation to the classical functional regression\nparadigm. The corresponding functional model is determined by a function living\non a dense subspace of the RKHS of interest, which has a tractable parametric\nform based on linear combinations of the kernel. By imposing a suitable prior\ndistribution on this functional space, we can naturally perform data-driven\ninference via standard Bayes methodology, estimating the posterior distribution\nthrough Markov chain Monte Carlo (MCMC) methods. In this context, our\ncontribution is two-fold. First, we derive a theoretical result that guarantees\nposterior consistency in these models, based on an application of a classic\ntheorem of Doob to our RKHS setting. Second, we show that several prediction\nstrategies stemming from our Bayesian formulation are competitive against other\nusual alternatives in both simulations and real data sets, including a\nBayesian-motivated variable selection procedure."}, "http://arxiv.org/abs/2312.14130": {"title": "Adaptation using spatially distributed Gaussian Processes", "link": "http://arxiv.org/abs/2312.14130", "description": "We consider the accuracy of an approximate posterior distribution in\nnonparametric regression problems by combining posterior distributions computed\non subsets of the data defined by the locations of the independent variables.\nWe show that this approximate posterior retains the rate of recovery of the\nfull data posterior distribution, where the rate of recovery adapts to the\nsmoothness of the true regression function. As particular examples we consider\nGaussian process priors based on integrated Brownian motion and the Mat\\'ern\nkernel augmented with a prior on the length scale. Besides theoretical\nguarantees we present a numerical study of the methods both on synthetic and\nreal world data. We also propose a new aggregation technique, which numerically\noutperforms previous approaches."}, "http://arxiv.org/abs/2202.00824": {"title": "KSD Aggregated Goodness-of-fit Test", "link": "http://arxiv.org/abs/2202.00824", "description": "We investigate properties of goodness-of-fit tests based on the Kernel Stein\nDiscrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg,\nwhich aggregates multiple tests with different kernels. KSDAgg avoids splitting\nthe data to perform kernel selection (which leads to a loss in test power), and\nrather maximises the test power over a collection of kernels. We provide\nnon-asymptotic guarantees on the power of KSDAgg: we show it achieves the\nsmallest uniform separation rate of the collection, up to a logarithmic term.\nFor compactly supported densities with bounded model score function, we derive\nthe rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the\nminimax optimal rate over unrestricted Sobolev balls, up to an iterated\nlogarithmic term. KSDAgg can be computed exactly in practice as it relies\neither on a parametric bootstrap or on a wild bootstrap to estimate the\nquantiles and the level corrections. In particular, for the crucial choice of\nbandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such\nas median or standard deviation) or to data splitting. We find on both\nsynthetic and real-world data that KSDAgg outperforms other state-of-the-art\nquadratic-time adaptive KSD-based goodness-of-fit testing procedures."}, "http://arxiv.org/abs/2209.06101": {"title": "Evaluating individualized treatment effect predictions: a model-based perspective on discrimination and calibration assessment", "link": "http://arxiv.org/abs/2209.06101", "description": "In recent years, there has been a growing interest in the prediction of\nindividualized treatment effects. While there is a rapidly growing literature\non the development of such models, there is little literature on the evaluation\nof their performance. In this paper, we aim to facilitate the validation of\nprediction models for individualized treatment effects. The estimands of\ninterest are defined as based on the potential outcomes framework, which\nfacilitates a comparison of existing and novel measures. In particular, we\nexamine existing measures of measures of discrimination for benefit (variations\nof the c-for-benefit), and propose model-based extensions to the treatment\neffect setting for discrimination and calibration metrics that have a strong\nbasis in outcome risk prediction. The main focus is on randomized trial data\nwith binary endpoints and on models that provide individualized treatment\neffect predictions and potential outcome predictions. We use simulated data to\nprovide insight into the characteristics of the examined discrimination and\ncalibration statistics under consideration, and further illustrate all methods\nin a trial of acute ischemic stroke treatment. The results show that the\nproposed model-based statistics had the best characteristics in terms of bias\nand accuracy. While resampling methods adjusted for the optimism of performance\nestimates in the development data, they had a high variance across replications\nthat limited their accuracy. Therefore, individualized treatment effect models\nare best validated in independent data. To aid implementation, a software\nimplementation of the proposed methods was made available in R."}, "http://arxiv.org/abs/2211.11400": {"title": "The Online Closure Principle", "link": "http://arxiv.org/abs/2211.11400", "description": "The closure principle is fundamental in multiple testing and has been used to\nderive many efficient procedures with familywise error rate control. However,\nit is often unsuitable for modern research, which involves flexible multiple\ntesting settings where not all hypotheses are known at the beginning of the\nevaluation. In this paper, we focus on online multiple testing where a possibly\ninfinite sequence of hypotheses is tested over time. At each step, it must be\ndecided on the current hypothesis without having any information about the\nhypotheses that have not been tested yet. Our main contribution is a general\nand stringent mathematical definition of online multiple testing and a new\nonline closure principle which ensures that the resulting closed procedure can\nbe applied in the online setting. We prove that any familywise error rate\ncontrolling online procedure can be derived by this online closure principle\nand provide admissibility results. In addition, we demonstrate how short-cuts\nof these online closed procedures can be obtained under a suitable consonance\nproperty."}, "http://arxiv.org/abs/2305.12616": {"title": "Conformal Prediction With Conditional Guarantees", "link": "http://arxiv.org/abs/2305.12616", "description": "We consider the problem of constructing distribution-free prediction sets\nwith finite-sample conditional guarantees. Prior work has shown that it is\nimpossible to provide exact conditional coverage universally in finite samples.\nThus, most popular methods only provide marginal coverage over the covariates.\nThis paper bridges this gap by defining a spectrum of problems that interpolate\nbetween marginal and conditional validity. We motivate these problems by\nreformulating conditional coverage as coverage over a class of covariate\nshifts. When the target class of shifts is finite dimensional, we show how to\nsimultaneously obtain exact finite sample coverage over all possible shifts.\nFor example, given a collection of protected subgroups, our algorithm outputs\nintervals with exact coverage over each group. For more flexible, infinite\ndimensional classes where exact coverage is impossible, we provide a simple\nprocedure for quantifying the gap between the coverage of our algorithm and the\ntarget level. Moreover, by tuning a single hyperparameter, we allow the\npractitioner to control the size of this gap across shifts of interest. Our\nmethods can be easily incorporated into existing split conformal inference\npipelines, and thus can be used to quantify the uncertainty of modern black-box\nalgorithms without distributional assumptions."}, "http://arxiv.org/abs/2305.16842": {"title": "Accounting statement analysis at industry level", "link": "http://arxiv.org/abs/2305.16842", "description": "Compositional data are contemporarily defined as positive vectors, the ratios\namong whose elements are of interest to the researcher. Financial statement\nanalysis by means of accounting ratios fulfils this definition to the letter.\nCompositional data analysis solves the major problems in statistical analysis\nof standard financial ratios at industry level, such as skewness,\nnon-normality, non-linearity and dependence of the results on the choice of\nwhich accounting figure goes to the numerator and to the denominator of the\nratio. In spite of this, compositional applications to financial statement\nanalysis are still rare. In this article, we present some transformations\nwithin compositional data analysis that are particularly useful for financial\nstatement analysis. We show how to compute industry or sub-industry means of\nstandard financial ratios from a compositional perspective. We show how to\nvisualise firms in an industry with a compositional biplot, to classify them\nwith compositional cluster analysis and to relate financial and non-financial\nindicators with compositional regression models. We show an application to the\naccounting statements of Spanish wineries using DuPont analysis, and a\nstep-by-step tutorial to the compositional freeware CoDaPack."}, "http://arxiv.org/abs/2306.12865": {"title": "Estimating dynamic treatment regimes for ordinal outcomes with household interference: Application in household smoking cessation", "link": "http://arxiv.org/abs/2306.12865", "description": "The focus of precision medicine is on decision support, often in the form of\ndynamic treatment regimes (DTRs), which are sequences of decision rules. At\neach decision point, the decision rules determine the next treatment according\nto the patient's baseline characteristics, the information on treatments and\nresponses accrued by that point, and the patient's current health status,\nincluding symptom severity and other measures. However, DTR estimation with\nordinal outcomes is rarely studied, and rarer still in the context of\ninterference - where one patient's treatment may affect another's outcome. In\nthis paper, we introduce the weighted proportional odds model (WPOM): a\nregression-based, approximate doubly-robust approach to single-stage DTR\nestimation for ordinal outcomes. This method also accounts for the possibility\nof interference between individuals sharing a household through the use of\ncovariate balancing weights derived from joint propensity scores. Examining\ndifferent types of balancing weights, we verify the approximate double\nrobustness of WPOM with our adjusted weights via simulation studies. We further\nextend WPOM to multi-stage DTR estimation with household interference, namely\ndWPOM (dynamic WPOM). Lastly, we demonstrate our proposed methodology in the\nanalysis of longitudinal survey data from the Population Assessment of Tobacco\nand Health study, which motivates this work. Furthermore, considering\ninterference, we provide optimal treatment strategies for households to achieve\nsmoking cessation of the pair in the household."}, "http://arxiv.org/abs/2307.16502": {"title": "Percolated stochastic block model via EM algorithm and belief propagation with non-backtracking spectra", "link": "http://arxiv.org/abs/2307.16502", "description": "Whereas Laplacian and modularity based spectral clustering is apt to dense\ngraphs, recent results show that for sparse ones, the non-backtracking spectrum\nis the best candidate to find assortative clusters of nodes. Here belief\npropagation in the sparse stochastic block model is derived with arbitrary\ngiven model parameters that results in a non-linear system of equations; with\nlinear approximation, the spectrum of the non-backtracking matrix is able to\nspecify the number $k$ of clusters. Then the model parameters themselves can be\nestimated by the EM algorithm. Bond percolation in the assortative model is\nconsidered in the following two senses: the within- and between-cluster edge\nprobabilities decrease with the number of nodes and edges coming into existence\nin this way are retained with probability $\\beta$. As a consequence, the\noptimal $k$ is the number of the structural real eigenvalues (greater than\n$\\sqrt{c}$, where $c$ is the average degree) of the non-backtracking matrix of\nthe graph. Assuming, these eigenvalues $\\mu_1 >\\dots > \\mu_k$ are distinct, the\nmultiple phase transitions obtained for $\\beta$ are $\\beta_i\n=\\frac{c}{\\mu_i^2}$; further, at $\\beta_i$ the number of detectable clusters is\n$i$, for $i=1,\\dots ,k$. Inflation-deflation techniques are also discussed to\nclassify the nodes themselves, which can be the base of the sparse spectral\nclustering."}, "http://arxiv.org/abs/2310.02679": {"title": "Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization", "link": "http://arxiv.org/abs/2310.02679", "description": "We tackle the problem of sampling from intractable high-dimensional density\nfunctions, a fundamental task that often appears in machine learning and\nstatistics. We extend recent sampling-based approaches that leverage controlled\nstochastic processes to model approximate samples from these target densities.\nThe main drawback of these approaches is that the training objective requires\nfull trajectories to compute, resulting in sluggish credit assignment issues\ndue to use of entire trajectories and a learning signal present only at the\nterminal time. In this work, we present Diffusion Generative Flow Samplers\n(DGFS), a sampling-based framework where the learning process can be tractably\nbroken down into short partial trajectory segments, via parameterizing an\nadditional \"flow function\". Our method takes inspiration from the theory\ndeveloped for generative flow networks (GFlowNets), allowing us to make use of\nintermediate learning signals. Through various challenging experiments, we\ndemonstrate that DGFS achieves more accurate estimates of the normalization\nconstant than closely-related prior methods."}, "http://arxiv.org/abs/2312.14333": {"title": "Behaviour Modelling of Social Animals via Causal Structure Discovery and Graph Neural Networks", "link": "http://arxiv.org/abs/2312.14333", "description": "Better understanding the natural world is a crucial task with a wide range of\napplications. In environments with close proximity between humans and animals,\nsuch as zoos, it is essential to better understand the causes behind animal\nbehaviour and what interventions are responsible for changes in their\nbehaviours. This can help to predict unusual behaviours, mitigate detrimental\neffects and increase the well-being of animals. There has been work on\nmodelling the dynamics behind swarms of birds and insects but the complex\nsocial behaviours of mammalian groups remain less explored. In this work, we\npropose a method to build behavioural models using causal structure discovery\nand graph neural networks for time series. We apply this method to a mob of\nmeerkats in a zoo environment and study its ability to predict future actions\nand model the behaviour distribution at an individual-level and at a group\nlevel. We show that our method can match and outperform standard deep learning\narchitectures and generate more realistic data, while using fewer parameters\nand providing increased interpretability."}, "http://arxiv.org/abs/2312.14416": {"title": "Joint Semi-Symmetric Tensor PCA for Integrating Multi-modal Populations of Networks", "link": "http://arxiv.org/abs/2312.14416", "description": "Multi-modal populations of networks arise in many scenarios including in\nlarge-scale multi-modal neuroimaging studies that capture both functional and\nstructural neuroimaging data for thousands of subjects. A major research\nquestion in such studies is how functional and structural brain connectivity\nare related and how they vary across the population. we develop a novel\nPCA-type framework for integrating multi-modal undirected networks measured on\nmany subjects. Specifically, we arrange these networks as semi-symmetric\ntensors, where each tensor slice is a symmetric matrix representing a network\nfrom an individual subject. We then propose a novel Joint, Integrative\nSemi-Symmetric Tensor PCA (JisstPCA) model, associated with an efficient\niterative algorithm, for jointly finding low-rank representations of two or\nmore networks across the same population of subjects. We establish one-step\nstatistical convergence of our separate low-rank network factors as well as the\nshared population factors to the true factors, with finite sample statistical\nerror bounds. Through simulation studies and a real data example for\nintegrating multi-subject functional and structural brain connectivity, we\nillustrate the advantages of our method for finding joint low-rank structures\nin multi-modal populations of networks."}, "http://arxiv.org/abs/2312.14420": {"title": "On eigenvalues of sample covariance matrices based on high dimensional compositional data", "link": "http://arxiv.org/abs/2312.14420", "description": "This paper studies the asymptotic spectral properties of the sample\ncovariance matrix for high dimensional compositional data, including the\nlimiting spectral distribution, the limit of extreme eigenvalues, and the\ncentral limit theorem for linear spectral statistics. All asymptotic results\nare derived under the high-dimensional regime where the data dimension\nincreases to infinity proportionally with the sample size. The findings reveal\nthat the limiting spectral distribution is the well-known Marchenko-Pastur law.\nThe largest (or smallest non-zero) eigenvalue converges almost surely to the\nleft (or right) endpoint of the limiting spectral distribution, respectively.\nMoreover, the linear spectral statistics demonstrate a Gaussian limit.\nSimulation experiments demonstrate the accuracy of theoretical results."}, "http://arxiv.org/abs/2312.14534": {"title": "Global Rank Sum Test: An Efficient Rank-Based Nonparametric Test for Large Scale Online Experiment", "link": "http://arxiv.org/abs/2312.14534", "description": "Online experiments are widely used for improving online services. While doing\nonline experiments, The student t-test is the most widely used hypothesis\ntesting technique. In practice, however, the normality assumption on which the\nt-test depends on may fail, which resulting in untrustworthy results. In this\npaper, we first discuss the question of when the t-test fails, and thus\nintroduce the rank-sum test. Next, in order to solve the difficulties while\nimplementing rank-sum test in large online experiment platforms, we proposed a\nglobal-rank-sum test method as an improvement for the traditional one. Finally,\nwe demonstrate that the global-rank-sum test is not only more accurate and has\nhigher statistical power than the t-test, but also more time efficient than the\ntraditional rank-sum test, which eventually makes it possible for large online\nexperiment platforms to use."}, "http://arxiv.org/abs/2312.14549": {"title": "A machine learning approach based on survival analysis for IBNR frequencies in non-life reserving", "link": "http://arxiv.org/abs/2312.14549", "description": "We introduce new approaches for forecasting IBNR (Incurred But Not Reported)\nfrequencies by leveraging individual claims data, which includes accident date,\nreporting delay, and possibly additional features for every reported claim. A\nkey element of our proposal involves computing development factors, which may\nbe influenced by both the accident date and other features. These development\nfactors serve as the basis for predictions. While we assume close to continuous\nobservations of accident date and reporting delay, the development factors can\nbe expressed at any level of granularity, such as months, quarters, or year and\npredictions across different granularity levels exhibit coherence. The\ncalculation of development factors relies on the estimation of a hazard\nfunction in reverse development time, and we present three distinct methods for\nestimating this function: the Cox proportional hazard model, a feed-forward\nneural network, and xgboost (eXtreme gradient boosting). In all three cases,\nestimation is based on the same partial likelihood that accommodates left\ntruncation and ties in the data. While the first case is a semi-parametric\nmodel that assumes in parts a log linear structure, the two machine learning\napproaches only assume that the baseline and the other factors are\nmultiplicatively separable. Through an extensive simulation study and\nreal-world data application, our approach demonstrates promising results. This\npaper comes with an accompanying R-package, $\\texttt{ReSurv}$, which can be\naccessed at \\url{https://github.com/edhofman/ReSurv}"}, "http://arxiv.org/abs/2312.14583": {"title": "Inference on the state process of periodically inhomogeneous hidden Markov models for animal behavior", "link": "http://arxiv.org/abs/2312.14583", "description": "Over the last decade, hidden Markov models (HMMs) have become increasingly\npopular in statistical ecology, where they constitute natural tools for\nstudying animal behavior based on complex sensor data. Corresponding analyses\nsometimes explicitly focus on - and in any case need to take into account -\nperiodic variation, for example by quantifying the activity distribution over\nthe daily cycle or seasonal variation such as migratory behavior. For HMMs\nincluding periodic components, we establish important mathematical properties\nthat allow for comprehensive statistical inference related to periodic\nvariation, thereby also providing guidance for model building and model\nchecking. Specifically, we derive the periodically varying unconditional state\ndistribution as well as the time-varying and overall state dwell-time\ndistributions - all of which are of key interest when the inferential focus\nlies on the dynamics of the state process. We use the associated novel\ninference and model-checking tools to investigate changes in the diel activity\npatterns of fruit flies in response to changing light conditions."}, "http://arxiv.org/abs/2312.14601": {"title": "Generalized Moment Estimators based on Stein Identities", "link": "http://arxiv.org/abs/2312.14601", "description": "For parameter estimation of continuous and discrete distributions, we propose\na generalization of the method of moments (MM), where Stein identities are\nutilized for improved estimation performance. The construction of these\nStein-type MM-estimators makes use of a weight function as implied by an\nappropriate form of the Stein identity. Our general approach as well as\npotential benefits thereof are first illustrated by the simple example of the\nexponential distribution. Afterward, we investigate the more sophisticated\ntwo-parameter inverse Gaussian distribution and the two-parameter\nnegative-binomial distribution in great detail, together with illustrative\nreal-world data examples. Given an appropriate choice of the respective weight\nfunctions, their Stein-MM estimators, which are defined by simple closed-form\nformulas and allow for closed-form asymptotic computations, exhibit a better\nperformance regarding bias and mean squared error than competing estimators."}, "http://arxiv.org/abs/2312.14689": {"title": "Mistaken identities lead to missed opportunities: Testing for mean differences in partially matched data", "link": "http://arxiv.org/abs/2312.14689", "description": "It is increasingly common to collect pre-post data with pseudonyms or\nself-constructed identifiers. On survey responses from sensitive populations,\nidentifiers may be made optional to encourage higher response rates. The\nability to match responses between pre- and post-intervention phases for every\nparticipant may be impossible in such applications, leaving practitioners with\na choice between the paired t-test on the matched samples and the two-sample\nt-test on all samples for evaluating mean differences. We demonstrate the\ninadequacies with both approaches, as the former test requires discarding\nunmatched data, while the latter test ignores correlation and assumes\nindependence. In cases with a subset of matched samples, an opportunity to\nachieve limited inference about the correlation exists. We propose a novel\ntechnique for such `partially matched' data, which we refer to as the\nQuantile-based t-test for correlated samples, to assess mean differences using\na conservative estimate of the correlation between responses based on the\nmatched subset. Critically, our approach does not discard unmatched samples,\nnor does it assume independence. Our results demonstrate that the proposed\nmethod yields nominal Type I error probability while affording more power than\nexisting approaches. Practitioners can readily adopt our approach with basic\nstatistical programming software."}, "http://arxiv.org/abs/2312.14719": {"title": "Nonhomogeneous hidden semi-Markov models for toroidal data", "link": "http://arxiv.org/abs/2312.14719", "description": "A nonhomogeneous hidden semi-Markov model is proposed to segment toroidal\ntime series according to a finite number of latent regimes and, simultaneously,\nestimate the influence of time-varying covariates on the process' survival\nunder each regime. The model is a mixture of toroidal densities, whose\nparameters depend on the evolution of a semi-Markov chain, which is in turn\nmodulated by time-varying covariates through a proportional hazards assumption.\nParameter estimates are obtained using an EM algorithm that relies on an\nefficient augmentation of the latent process. The proposal is illustrated on a\ntime series of wind and wave directions recorded during winter."}, "http://arxiv.org/abs/2312.14810": {"title": "Accelerating Bayesian Optimal Experimental Design with Derivative-Informed Neural Operators", "link": "http://arxiv.org/abs/2312.14810", "description": "We consider optimal experimental design (OED) for nonlinear Bayesian inverse\nproblems governed by large-scale partial differential equations (PDEs). For the\noptimality criteria of Bayesian OED, we consider both expected information gain\nand summary statistics including the trace and determinant of the information\nmatrix that involves the evaluation of the parameter-to-observable (PtO) map\nand its derivatives. However, it is prohibitive to compute and optimize these\ncriteria when the PDEs are very expensive to solve, the parameters to estimate\nare high-dimensional, and the optimization problem is combinatorial,\nhigh-dimensional, and non-convex. To address these challenges, we develop an\naccurate, scalable, and efficient computational framework to accelerate the\nsolution of Bayesian OED. In particular, the framework is developed based on\nderivative-informed neural operator (DINO) surrogates with proper dimension\nreduction techniques and a modified swapping greedy algorithm. We demonstrate\nthe high accuracy of the DINO surrogates in the computation of the PtO map and\nthe optimality criteria compared to high-fidelity finite element\napproximations. We also show that the proposed method is scalable with\nincreasing parameter dimensions. Moreover, we demonstrate that it achieves high\nefficiency with over 1000X speedup compared to a high-fidelity Bayesian OED\nsolution for a three-dimensional PDE example with tens of thousands of\nparameters, including both online evaluation and offline construction costs of\nthe surrogates."}, "http://arxiv.org/abs/2108.13010": {"title": "Piecewise monotone estimation in one-parameter exponential family", "link": "http://arxiv.org/abs/2108.13010", "description": "The problem of estimating a piecewise monotone sequence of normal means is\ncalled the nearly isotonic regression. For this problem, an efficient algorithm\nhas been devised by modifying the pool adjacent violators algorithm (PAVA). In\nthis study, we investigate estimation of a piecewise monotone parameter\nsequence for general one-parameter exponential families such as binomial,\nPoisson and chi-square. We develop an efficient algorithm based on the modified\nPAVA, which utilizes the duality between the natural and expectation\nparameters. We also provide a method for selecting the regularization parameter\nby using an information criterion. Simulation results demonstrate that the\nproposed method detects change-points in piecewise monotone parameter sequences\nin a data-driven manner. Applications to spectrum estimation, causal inference\nand discretization error quantification of ODE solvers are also presented."}, "http://arxiv.org/abs/2212.03131": {"title": "Explainability as statistical inference", "link": "http://arxiv.org/abs/2212.03131", "description": "A wide variety of model explanation approaches have been proposed in recent\nyears, all guided by very different rationales and heuristics. In this paper,\nwe take a new route and cast interpretability as a statistical inference\nproblem. We propose a general deep probabilistic model designed to produce\ninterpretable predictions. The model parameters can be learned via maximum\nlikelihood, and the method can be adapted to any predictor network architecture\nand any type of prediction problem. Our method is a case of amortized\ninterpretability models, where a neural network is used as a selector to allow\nfor fast interpretation at inference time. Several popular interpretability\nmethods are shown to be particular cases of regularised maximum likelihood for\nour general model. We propose new datasets with ground truth selection which\nallow for the evaluation of the features importance map. Using these datasets,\nwe show experimentally that using multiple imputation provides more reasonable\ninterpretations."}, "http://arxiv.org/abs/2302.08151": {"title": "Towards a universal representation of statistical dependence", "link": "http://arxiv.org/abs/2302.08151", "description": "Dependence is undoubtedly a central concept in statistics. Though, it proves\ndifficult to locate in the literature a formal definition which goes beyond the\nself-evident 'dependence = non-independence'. This absence has allowed the term\n'dependence' and its declination to be used vaguely and indiscriminately for\nqualifying a variety of disparate notions, leading to numerous incongruities.\nFor example, the classical Pearson's, Spearman's or Kendall's correlations are\nwidely regarded as 'dependence measures' of major interest, in spite of\nreturning 0 in some cases of deterministic relationships between the variables\nat play, evidently not measuring dependence at all. Arguing that research on\nsuch a fundamental topic would benefit from a slightly more rigid framework,\nthis paper suggests a general definition of the dependence between two random\nvariables defined on the same probability space. Natural enough for aligning\nwith intuition, that definition is still sufficiently precise for allowing\nunequivocal identification of a 'universal' representation of the dependence\nstructure of any bivariate distribution. Links between this representation and\nfamiliar concepts are highlighted, and ultimately, the idea of a dependence\nmeasure based on that universal representation is explored and shown to satisfy\nRenyi's postulates."}, "http://arxiv.org/abs/2305.00349": {"title": "Causal effects of intervening variables in settings with unmeasured confounding", "link": "http://arxiv.org/abs/2305.00349", "description": "We present new results on average causal effects in settings with unmeasured\nexposure-outcome confounding. Our results are motivated by a class of\nestimands, e.g., frequently of interest in medicine and public health, that are\ncurrently not targeted by standard approaches for average causal effects. We\nrecognize these estimands as queries about the average causal effect of an\nintervening variable. We anchor our introduction of these estimands in an\ninvestigation of the role of chronic pain and opioid prescription patterns in\nthe opioid epidemic, and illustrate how conventional approaches will lead\nunreplicable estimates with ambiguous policy implications. We argue that our\naltenative effects are replicable and have clear policy implications, and\nfurthermore are non-parametrically identified by the classical frontdoor\nformula. As an independent contribution, we derive a new semiparametric\nefficient estimator of the frontdoor formula with a uniform sample boundedness\nguarantee. This property is unique among previously-described estimators in its\nclass, and we demonstrate superior performance in finite-sample settings.\nTheoretical results are applied with data from the National Health and\nNutrition Examination Survey."}, "http://arxiv.org/abs/2305.05330": {"title": "Point and probabilistic forecast reconciliation for general linearly constrained multiple time series", "link": "http://arxiv.org/abs/2305.05330", "description": "Forecast reconciliation is the post-forecasting process aimed to revise a set\nof incoherent base forecasts into coherent forecasts in line with given data\nstructures. Most of the point and probabilistic regression-based forecast\nreconciliation results ground on the so called \"structural representation\" and\non the related unconstrained generalized least squares reconciliation formula.\nHowever, the structural representation naturally applies to genuine\nhierarchical/grouped time series, where the top- and bottom-level variables are\nuniquely identified. When a general linearly constrained multiple time series\nis considered, the forecast reconciliation is naturally expressed according to\na projection approach. While it is well known that the classic structural\nreconciliation formula is equivalent to its projection approach counterpart, so\nfar it is not completely understood if and how a structural-like reconciliation\nformula may be derived for a general linearly constrained multiple time series.\nSuch an expression would permit to extend reconciliation definitions, theorems\nand results in a straightforward manner. In this paper, we show that for\ngeneral linearly constrained multiple time series it is possible to express the\nreconciliation formula according to a \"structural-like\" approach that keeps\ndistinct free and constrained, instead of bottom and upper (aggregated),\nvariables, establish the probabilistic forecast reconciliation framework, and\napply these findings to obtain fully reconciled point and probabilistic\nforecasts for the aggregates of the Australian GDP from income and expenditure\nsides, and for the European Area GDP disaggregated by income, expenditure and\noutput sides and by 19 countries."}, "http://arxiv.org/abs/2312.15032": {"title": "Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study", "link": "http://arxiv.org/abs/2312.15032", "description": "Scientific claims gain credibility by replicability, especially if\nreplication under different circumstances and varying designs yields equivalent\nresults. Aggregating results over multiple studies is, however, not\nstraightforward, and when the heterogeneity between studies increases,\nconventional methods such as (Bayesian) meta-analysis and Bayesian sequential\nupdating become infeasible. *Bayesian Evidence Synthesis*, built upon the\nfoundations of the Bayes factor, allows to aggregate support for conceptually\nsimilar hypotheses over studies, regardless of methodological differences. We\nassess the performance of Bayesian Evidence Synthesis over multiple effect and\nsample sizes, with a broad set of (inequality-constrained) hypotheses using\nMonte Carlo simulations, focusing explicitly on the complexity of the\nhypotheses under consideration. The simulations show that this method can\nevaluate complex (informative) hypotheses regardless of methodological\ndifferences between studies, and performs adequately if the set of studies\nconsidered has sufficient statistical power. Additionally, we pinpoint\nchallenging conditions that can lead to unsatisfactory results, and provide\nsuggestions on handling these situations. Ultimately, we show that Bayesian\nEvidence Synthesis is a promising tool that can be used when traditional\nresearch synthesis methods are not applicable due to insurmountable\nbetween-study heterogeneity."}, "http://arxiv.org/abs/2312.15055": {"title": "Deep Learning for Efficient GWAS Feature Selection", "link": "http://arxiv.org/abs/2312.15055", "description": "Genome-Wide Association Studies (GWAS) face unique challenges in the era of\nbig genomics data, particularly when dealing with ultra-high-dimensional\ndatasets where the number of genetic features significantly exceeds the\navailable samples. This paper introduces an extension to the feature selection\nmethodology proposed by Mirzaei et al. (2020), specifically tailored to tackle\nthe intricacies associated with ultra-high-dimensional GWAS data. Our extended\napproach enhances the original method by introducing a Frobenius norm penalty\ninto the student network, augmenting its capacity to adapt to scenarios\ncharacterized by a multitude of features and limited samples. Operating\nseamlessly in both supervised and unsupervised settings, our method employs two\nkey neural networks. The first leverages an autoencoder or supervised\nautoencoder for dimension reduction, extracting salient features from the\nultra-high-dimensional genomic data. The second network, a regularized\nfeed-forward model with a single hidden layer, is designed for precise feature\nselection. The introduction of the Frobenius norm penalty in the student\nnetwork significantly boosts the method's resilience to the challenges posed by\nultra-high-dimensional GWAS datasets. Experimental results showcase the\nefficacy of our approach in feature selection for GWAS data. The method not\nonly handles the inherent complexities of ultra-high-dimensional settings but\nalso demonstrates superior adaptability to the nuanced structures present in\ngenomics data. The flexibility and versatility of our proposed methodology are\nunderscored by its successful performance across a spectrum of experiments."}, "http://arxiv.org/abs/2312.15079": {"title": "Invariance-based Inference in High-Dimensional Regression with Finite-Sample Guarantees", "link": "http://arxiv.org/abs/2312.15079", "description": "In this paper, we develop invariance-based procedures for testing and\ninference in high-dimensional regression models. These procedures, also known\nas randomization tests, provide several important advantages. First, for the\nglobal null hypothesis of significance, our test is valid in finite samples. It\nis also simple to implement and comes with finite-sample guarantees on\nstatistical power. Remarkably, despite its simplicity, this testing idea has\nescaped the attention of earlier analytical work, which mainly concentrated on\ncomplex high-dimensional asymptotic methods. Under an additional assumption of\nGaussian design, we show that this test also achieves the minimax optimal rate\nagainst certain nonsparse alternatives, a type of result that is rare in the\nliterature. Second, for partial null hypotheses, we propose residual-based\ntests and derive theoretical conditions for their validity. These tests can be\nmade powerful by constructing the test statistic in a way that, first, selects\nthe important covariates (e.g., through Lasso) and then orthogonalizes the\nnuisance parameters. We illustrate our results through extensive simulations\nand applied examples. One consistent finding is that the strong finite-sample\nguarantees associated with our procedures result in added robustness when it\ncomes to handling multicollinearity and heavy-tailed covariates."}, "http://arxiv.org/abs/2312.15179": {"title": "Evaluating District-based Election Surveys with Synthetic Dirichlet Likelihood", "link": "http://arxiv.org/abs/2312.15179", "description": "In district-based multi-party elections, electors cast votes in their\nrespective districts. In each district, the party with maximum votes wins the\ncorresponding seat in the governing body. Election Surveys try to predict the\nelection outcome (vote shares and seat shares of parties) by querying a random\nsample of electors. However, the survey results are often inconsistent with the\nactual results, which could be due to multiple reasons. The aim of this work is\nto estimate a posterior distribution over the possible outcomes of the\nelection, given one or more survey results. This is achieved using a prior\ndistribution over vote shares, election models to simulate the complete\nelection from the vote share, and survey models to simulate survey results from\na complete election. The desired posterior distribution over the space of\npossible outcomes is constructed using Synthetic Dirichlet Likelihoods, whose\nparameters are estimated from Monte Carlo sampling of elections using the\nelection models. We further show the same approach can also use be used to\nevaluate the surveys - whether they were biased or not, based on the true\noutcome once it is known. Our work offers the first-ever probabilistic model to\nanalyze district-based election surveys. We illustrate our approach with\nextensive experiments on real and simulated data of district-based political\nelections in India."}, "http://arxiv.org/abs/2312.15205": {"title": "X-Vine Models for Multivariate Extremes", "link": "http://arxiv.org/abs/2312.15205", "description": "Regular vine sequences permit the organisation of variables in a random\nvector along a sequence of trees. Regular vine models have become greatly\npopular in dependence modelling as a way to combine arbitrary bivariate copulas\ninto higher-dimensional ones, offering flexibility, parsimony, and\ntractability. In this project, we use regular vine structures to decompose and\nconstruct the exponent measure density of a multivariate extreme value\ndistribution, or, equivalently, the tail copula density. Although these\ndensities pose theoretical challenges due to their infinite mass, their\nhomogeneity property offers simplifications. The theory sheds new light on\nexisting parametric families and facilitates the construction of new ones,\ncalled X-vines. Computations proceed via recursive formulas in terms of\nbivariate model components. We develop simulation algorithms for X-vine\nmultivariate Pareto distributions as well as methods for parameter estimation\nand model selection on the basis of threshold exceedances. The methods are\nillustrated by Monte Carlo experiments and a case study on US flight delay\ndata."}, "http://arxiv.org/abs/2312.15217": {"title": "Constructing a T-test for Value Function Comparison of Individualized Treatment Regimes in the Presence of Multiple Imputation for Missing Data", "link": "http://arxiv.org/abs/2312.15217", "description": "Optimal individualized treatment decision-making has improved health outcomes\nin recent years. The value function is commonly used to evaluate the goodness\nof an individualized treatment decision rule. Despite recent advances,\ncomparing value functions between different treatment decision rules or\nconstructing confidence intervals around value functions remains difficult. We\npropose a t-test based method applied to a test set that generates valid\np-values to compare value functions between a given pair of treatment decision\nrules when some of the data are missing. We demonstrate the ease in use of this\nmethod and evaluate its performance via simulation studies and apply it to the\nChina Health and Nutrition Survey data."}, "http://arxiv.org/abs/2312.15222": {"title": "Towards reaching a consensus in Bayesian trial designs: the case of 2-arm trials", "link": "http://arxiv.org/abs/2312.15222", "description": "Practical employment of Bayesian trial designs has been rare, in part due to\nthe regulators' requirement to calibrate such designs with an upper bound for\nType 1 error rate. This has led to an internally inconsistent hybrid\nmethodology, where important advantages from applying the Bayesian principles\nare lost. To present an alternative approach, we consider the prototype case of\na 2-arm superiority trial with binary outcomes. The design is adaptive,\napplying block randomization for treatment assignment and using error tolerance\ncriteria based on sequentially updated posterior probabilities, to conclude\nefficacy of the experimental treatment or futility of the trial. The method\nalso contains the option of applying a decision rule for terminating the trial\nearly if the predicted costs from continuing would exceed the corresponding\ngains. We propose that the control of Type 1 error rate be replaced by a\ncontrol of false discovery rate (FDR), a concept that lends itself to both\nfrequentist and Bayesian interpretations. Importantly, if the regulators and\nthe investigators can agree on a consensus distribution to represent their\nshared understanding on the effect sizes, the selected level of risk tolerance\nagainst false conclusions during the data analysis will also be a valid bound\nfor the FDR. The approach can lower the ultimately unnecessary barriers from\nthe practical application of Bayesian trial designs. This can lead to more\nflexible experimental protocols and more efficient use of trial data while\nstill effectively guarding against falsely claimed discoveries."}, "http://arxiv.org/abs/2312.15373": {"title": "A Multi-day Needs-based Modeling Approach for Activity and Travel Demand Analysis", "link": "http://arxiv.org/abs/2312.15373", "description": "This paper proposes a multi-day needs-based model for activity and travel\ndemand analysis. The model captures the multi-day dynamics in activity\ngeneration, which enables the modeling of activities with increased flexibility\nin time and space (e.g., e-commerce and remote working). As an enhancement to\nactivity-based models, the proposed model captures the underlying\ndecision-making process of activity generation by accounting for psychological\nneeds as the drivers of activities. The level of need satisfaction is modeled\nas a psychological inventory, whose utility is optimized via decisions on\nactivity participation, location, and duration. The utility includes both the\nbenefit in the inventory gained and the cost in time, monetary expense as well\nas maintenance of safety stock. The model includes two sub-models, a\nDeterministic Model that optimizes the utility of the inventory, and an\nEmpirical Model that accounts for heterogeneity and stochasticity. Numerical\nexperiments are conducted to demonstrate model scalability. A maximum\nlikelihood estimator is proposed, the properties of the log-likelihood function\nare examined and the recovery of true parameters is tested. This research\ncontributes to the literature on transportation demand models in the following\nthree aspects. First, it is arguably better grounded in psychological theory\nthan traditional models and allows the generation of activity patterns to be\npolicy-sensitive (while avoiding the need for ad hoc utility definitions).\nSecond, it contributes to the development of needs-based models with a\nnon-myopic approach to model multi-day activity patterns. Third, it proposes a\ntractable model formulation via problem reformulation and computational\nenhancements, which allows for maximum likelihood parameter estimation."}, "http://arxiv.org/abs/2312.15376": {"title": "Geodesic Optimal Transport Regression", "link": "http://arxiv.org/abs/2312.15376", "description": "Classical regression models do not cover non-Euclidean data that reside in a\ngeneral metric space, while the current literature on non-Euclidean regression\nby and large has focused on scenarios where either predictors or responses are\nrandom objects, i.e., non-Euclidean, but not both. In this paper we propose\ngeodesic optimal transport regression models for the case where both predictors\nand responses lie in a common geodesic metric space and predictors may include\nnot only one but also several random objects. This provides an extension of\nclassical multiple regression to the case where both predictors and responses\nreside in non-Euclidean metric spaces, a scenario that has not been considered\nbefore. It is based on the concept of optimal geodesic transports, which we\ndefine as an extension of the notion of optimal transports in distribution\nspaces to more general geodesic metric spaces, where we characterize optimal\ntransports as transports along geodesics. The proposed regression models cover\nthe relation between non-Euclidean responses and vectors of non-Euclidean\npredictors in many spaces of practical statistical interest. These include\none-dimensional distributions viewed as elements of the 2-Wasserstein space and\nmultidimensional distributions with the Fisher-Rao metric that are represented\nas data on the Hilbert sphere. Also included are data on finite-dimensional\nRiemannian manifolds, with an emphasis on spheres, covering directional and\ncompositional data, as well as data that consist of symmetric positive definite\nmatrices. We illustrate the utility of geodesic optimal transport regression\nwith data on summer temperature distributions and human mortality."}, "http://arxiv.org/abs/2312.15396": {"title": "PKBOIN-12: A Bayesian optimal interval Phase I/II design incorporating pharmacokinetics outcomes to find the optimal biological dose", "link": "http://arxiv.org/abs/2312.15396", "description": "Immunotherapies and targeted therapies have gained popularity due to their\npromising therapeutic effects across multiple treatment areas. The focus of\nearly phase dose-finding clinical trials has shifted from finding the maximum\ntolerated dose (MTD) to identifying the optimal biological dose (OBD), which\naims to balance the toxicity and efficacy outcomes, thereby optimizing the\nrisk-benefit trade-off. These trials often collect multiple pharmacokinetics\n(PK) outcomes to assess drug exposure, which has shown correlations with\ntoxicity and efficacy outcomes but has not been utilized in the current\ndose-finding designs for OBD selection. Moreover, PK outcomes are usually\navailable within days after initial treatment, much faster than toxicity and\nefficacy outcomes. To bridge this gap, we introduce the innovative\nmodel-assisted PKBOIN-12 design, which enhances the BOIN12 design by\nintegrating PK information into both the dose-finding algorithm and the final\nOBD determination process. We further extend PKBOIN-12 to the TITE-PKBOIN-12\ndesign to address the challenges of late-onset toxicity and efficacy outcomes.\nSimulation results demonstrate that the PKBOIN-12 design more effectively\nidentifies the OBD and allocates a greater number of patients to it than\nBOIN12. Additionally, PKBOIN-12 decreases the probability of selecting\ninefficacious doses as the OBD by excluding those with low drug exposure.\nComprehensive simulation studies and sensitivity analysis confirm the\nrobustness of both PKBOIN-12 and TITE-PKBOIN-12 designs in various scenarios."}, "http://arxiv.org/abs/2312.15469": {"title": "Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products", "link": "http://arxiv.org/abs/2312.15469", "description": "We consider the problem of sufficient dimension reduction (SDR) for\nmulti-index models. The estimators of the central mean subspace in prior works\neither have slow (non-parametric) convergence rates, or rely on stringent\ndistributional conditions (e.g., the covariate distribution $P_{\\mathbf{X}}$\nbeing elliptical symmetric). In this paper, we show that a fast parametric\nconvergence rate of form $C_d \\cdot n^{-1/2}$ is achievable via estimating the\n\\emph{expected smoothed gradient outer product}, for a general class of\ndistribution $P_{\\mathbf{X}}$ admitting Gaussian or heavier distributions. When\nthe link function is a polynomial with a degree of at most $r$ and\n$P_{\\mathbf{X}}$ is the standard Gaussian, we show that the prefactor depends\non the ambient dimension $d$ as $C_d \\propto d^r$."}, "http://arxiv.org/abs/2312.15496": {"title": "A Simple Bias Reduction for Chatterjee's Correlation", "link": "http://arxiv.org/abs/2312.15496", "description": "Chatterjee's rank correlation coefficient $\\xi_n$ is an empirical index for\ndetecting functional dependencies between two variables $X$ and $Y$. It is an\nestimator for a theoretical quantity $\\xi$ that is zero for independence and\none if $Y$ is a measurable function of $X$. Based on an equivalent\ncharacterization of sorted numbers, we derive an upper bound for $\\xi_n$ and\nsuggest a simple normalization aimed at reducing its bias for small sample size\n$n$. In Monte Carlo simulations of various cases, the normalization reduced the\nbias in all cases. The mean squared error was reduced, too, for values of $\\xi$\ngreater than about 0.4. Moreover, we observed that confidence intervals for\n$\\xi$ based on bootstrapping $\\xi_n$ in the usual n-out-of-n way have a\ncoverage probability close to zero. This is remedied by an m-out-of-n bootstrap\nwithout replacement in combination with our normalization method."}, "http://arxiv.org/abs/2312.15611": {"title": "Inference of Dependency Knowledge Graph for Electronic Health Records", "link": "http://arxiv.org/abs/2312.15611", "description": "The effective analysis of high-dimensional Electronic Health Record (EHR)\ndata, with substantial potential for healthcare research, presents notable\nmethodological challenges. Employing predictive modeling guided by a knowledge\ngraph (KG), which enables efficient feature selection, can enhance both\nstatistical efficiency and interpretability. While various methods have emerged\nfor constructing KGs, existing techniques often lack statistical certainty\nconcerning the presence of links between entities, especially in scenarios\nwhere the utilization of patient-level EHR data is limited due to privacy\nconcerns. In this paper, we propose the first inferential framework for\nderiving a sparse KG with statistical guarantee based on the dynamic log-linear\ntopic model proposed by \\cite{arora2016latent}. Within this model, the KG\nembeddings are estimated by performing singular value decomposition on the\nempirical pointwise mutual information matrix, offering a scalable solution. We\nthen establish entrywise asymptotic normality for the KG low-rank estimator,\nenabling the recovery of sparse graph edges with controlled type I error. Our\nwork uniquely addresses the under-explored domain of statistical inference\nabout non-linear statistics under the low-rank temporal dependent models, a\ncritical gap in existing research. We validate our approach through extensive\nsimulation studies and then apply the method to real-world EHR data in\nconstructing clinical KGs and generating clinical feature embeddings."}, "http://arxiv.org/abs/2312.15781": {"title": "A Computational Note on the Graphical Ridge in High-dimension", "link": "http://arxiv.org/abs/2312.15781", "description": "This article explores the estimation of precision matrices in\nhigh-dimensional Gaussian graphical models. We address the challenge of\nimproving the accuracy of maximum likelihood-based precision estimation through\npenalization. Specifically, we consider an elastic net penalty, which\nincorporates both L1 and Frobenius norm penalties while accounting for the\ntarget matrix during estimation. To enhance precision matrix estimation, we\npropose a novel two-step estimator that combines the strengths of ridge and\ngraphical lasso estimators. Through this approach, we aim to improve overall\nestimation performance. Our empirical analysis demonstrates the superior\nefficiency of our proposed method compared to alternative approaches. We\nvalidate the effectiveness of our proposal through numerical experiments and\napplication on three real datasets. These examples illustrate the practical\napplicability and usefulness of our proposed estimator."}, "http://arxiv.org/abs/2312.15919": {"title": "Review on Causality Detection Based on Empirical Dynamic Modeling", "link": "http://arxiv.org/abs/2312.15919", "description": "In contemporary scientific research, understanding the distinction between\ncorrelation and causation is crucial. While correlation is a widely used\nanalytical standard, it does not inherently imply causation. This paper\naddresses the potential for misinterpretation in relying solely on correlation,\nespecially in the context of nonlinear dynamics. Despite the rapid development\nof various correlation research methodologies, including machine learning, the\nexploration into mining causal correlations between variables remains ongoing.\nEmpirical Dynamic Modeling (EDM) emerges as a data-driven framework for\nmodeling dynamic systems, distinguishing itself by eschewing traditional\nformulaic methods in data analysis. Instead, it reconstructs dynamic system\nbehavior directly from time series data. The fundamental premise of EDM is that\ndynamic systems can be conceptualized as processes where a set of states,\ngoverned by specific rules, evolve over time in a high-dimensional space. By\nreconstructing these evolving states, dynamic systems can be effectively\nmodeled. Using EDM, this paper explores the detection of causal relationships\nbetween variables within dynamic systems through their time series data. It\nposits that if variable X causes variable Y, then the information about X is\ninherent in Y and can be extracted from Y's data. This study begins by\nexamining the dialectical relationship between correlation and causation,\nemphasizing that correlation does not equate to causation, and the absence of\ncorrelation does not necessarily indicate a lack of causation."}, "http://arxiv.org/abs/2312.16037": {"title": "Critical nonlinear aspects of hopping transport for reconfigurable logic in disordered dopant networks", "link": "http://arxiv.org/abs/2312.16037", "description": "Nonlinear behavior in the hopping transport of interacting charges enables\nreconfigurable logic in disordered dopant network devices, where voltages\napplied at control electrodes tune the relation between voltages applied at\ninput electrodes and the current measured at an output electrode. From kinetic\nMonte Carlo simulations we analyze the critical nonlinear aspects of\nvariable-range hopping transport for realizing Boolean logic gates in these\ndevices on three levels. First, we quantify the occurrence of individual gates\nfor random choices of control voltages. We find that linearly inseparable gates\nsuch as the XOR gate are less likely to occur than linearly separable gates\nsuch as the AND gate, despite the fact that the number of different regions in\nthe multidimensional control voltage space for which AND or XOR gates occur is\ncomparable. Second, we use principal component analysis to characterize the\ndistribution of the output current vectors for the (00,10,01,11) logic input\ncombinations in terms of eigenvectors and eigenvalues of the output covariance\nmatrix. This allows a simple and direct comparison of the behavior of different\nsimulated devices and a comparison to experimental devices. Third, we quantify\nthe nonlinearity in the distribution of the output current vectors necessary\nfor realizing Boolean functionality by introducing three nonlinearity\nindicators. The analysis provides a physical interpretation of the effects of\nchanging the hopping distance and temperature and is used in a comparison with\ndata generated by a deep neural network trained on a physical device."}, "http://arxiv.org/abs/2312.16139": {"title": "Anomaly component analysis", "link": "http://arxiv.org/abs/2312.16139", "description": "At the crossway of machine learning and data analysis, anomaly detection aims\nat identifying observations that exhibit abnormal behaviour. Be it measurement\nerrors, disease development, severe weather, production quality default(s)\n(items) or failed equipment, financial frauds or crisis events, their on-time\nidentification and isolation constitute an important task in almost any area of\nindustry and science. While a substantial body of literature is devoted to\ndetection of anomalies, little attention is payed to their explanation. This is\nthe case mostly due to intrinsically non-supervised nature of the task and\nnon-robustness of the exploratory methods like principal component analysis\n(PCA).\n\nWe introduce a new statistical tool dedicated for exploratory analysis of\nabnormal observations using data depth as a score. Anomaly component analysis\n(shortly ACA) is a method that searches a low-dimensional data representation\nthat best visualises and explains anomalies. This low-dimensional\nrepresentation not only allows to distinguish groups of anomalies better than\nthe methods of the state of the art, but as well provides a -- linear in\nvariables and thus easily interpretable -- explanation for anomalies. In a\ncomparative simulation and real-data study, ACA also proves advantageous for\nanomaly analysis with respect to methods present in the literature."}, "http://arxiv.org/abs/2312.16160": {"title": "SymmPI: Predictive Inference for Data with Group Symmetries", "link": "http://arxiv.org/abs/2312.16160", "description": "Quantifying the uncertainty of predictions is a core problem in modern\nstatistics. Methods for predictive inference have been developed under a\nvariety of assumptions, often -- for instance, in standard conformal prediction\n-- relying on the invariance of the distribution of the data under special\ngroups of transformations such as permutation groups. Moreover, many existing\nmethods for predictive inference aim to predict unobserved outcomes in\nsequences of feature-outcome observations. Meanwhile, there is interest in\npredictive inference under more general observation models (e.g., for partially\nobserved features) and for data satisfying more general distributional\nsymmetries (e.g., rotationally invariant or coordinate-independent observations\nin physics). Here we propose SymmPI, a methodology for predictive inference\nwhen data distributions have general group symmetries in arbitrary observation\nmodels. Our methods leverage the novel notion of distributional equivariant\ntransformations, which process the data while preserving their distributional\ninvariances. We show that SymmPI has valid coverage under distributional\ninvariance and characterize its performance under distribution shift,\nrecovering recent results as special cases. We apply SymmPI to predict\nunobserved values associated to vertices in a network, where the distribution\nis unchanged under relabelings that keep the network structure unchanged. In\nseveral simulations in a two-layer hierarchical model, and in an empirical data\nanalysis example, SymmPI performs favorably compared to existing methods."}, "http://arxiv.org/abs/2312.16162": {"title": "Properties of Test Statistics for Nonparametric Cointegrating Regression Functions Based on Subsamples", "link": "http://arxiv.org/abs/2312.16162", "description": "Nonparametric cointegrating regression models have been extensively used in\nfinancial markets, stock prices, heavy traffic, climate data sets, and energy\nmarkets. Models with parametric regression functions can be more appealing in\npractice compared to non-parametric forms, but do result in potential\nfunctional misspecification. Thus, there exists a vast literature on developing\na model specification test for parametric forms of regression functions. In\nthis paper, we develop two test statistics which are applicable for the\nendogenous regressors driven by long memory and semi-long memory input shocks\nin the regression model. The limit distributions of the test statistics under\nthese two scenarios are complicated and cannot be effectively used in practice.\nTo overcome this difficulty, we use the subsampling method and compute the test\nstatistics on smaller blocks of the data to construct their empirical\ndistributions. Throughout, Monte Carlo simulation studies are used to\nillustrate the properties of test statistics. We also provide an empirical\nexample of relating gross domestic product to total output of carbon dioxide in\ntwo European countries."}, "http://arxiv.org/abs/1909.06307": {"title": "Multiscale Jump Testing and Estimation Under Complex Temporal Dynamics", "link": "http://arxiv.org/abs/1909.06307", "description": "We consider the problem of detecting jumps in an otherwise smoothly evolving\ntrend whilst the covariance and higher-order structures of the system can\nexperience both smooth and abrupt changes over time. The number of jump points\nis allowed to diverge to infinity with the jump sizes possibly shrinking to\nzero. The method is based on a multiscale application of an optimal jump-pass\nfilter to the time series, where the scales are dense between admissible lower\nand upper bounds. For a wide class of non-stationary time series models and\ntrend functions, the proposed method is shown to be able to detect all jump\npoints within a nearly optimal range with a prescribed probability\nasymptotically under mild conditions. For a time series of length $n$, the\ncomputational complexity of the proposed method is $O(n)$ for each scale and\n$O(n\\log^{1+\\epsilon} n)$ overall, where $\\epsilon$ is an arbitrarily small\npositive constant. Numerical studies show that the proposed jump testing and\nestimation method performs robustly and accurately under complex temporal\ndynamics."}, "http://arxiv.org/abs/1911.06583": {"title": "GET: Global envelopes in R", "link": "http://arxiv.org/abs/1911.06583", "description": "This work describes the R package GET that implements global envelopes for a\ngeneral set of $d$-dimensional vectors $T$ in various applications. A\n$100(1-\\alpha)$% global envelope is a band bounded by two vectors such that the\nprobability that $T$ falls outside this envelope in any of the $d$ points is\nequal to $\\alpha$. The term 'global' means that this probability is controlled\nsimultaneously for all the $d$ elements of the vectors. The global envelopes\ncan be employed for central regions of functional or multivariate data, for\ngraphical Monte Carlo and permutation tests where the test statistic is\nmultivariate or functional, and for global confidence and prediction bands.\nIntrinsic graphical interpretation property is introduced for global envelopes.\nThe global envelopes included in the GET package have this property, which\nparticularly helps to interpret test results, by providing a graphical\ninterpretation that shows the reasons of rejection of the tested hypothesis.\nExamples of different uses of global envelopes and their implementation in the\nGET package are presented, including global envelopes for single and several\none- or two-dimensional functions, Monte Carlo goodness-of-fit tests for simple\nand composite hypotheses, comparison of distributions, functional analysis of\nvariance, functional linear model, and confidence bands in polynomial\nregression."}, "http://arxiv.org/abs/2007.02192": {"title": "Tail-adaptive Bayesian shrinkage", "link": "http://arxiv.org/abs/2007.02192", "description": "Modern genomic studies are increasingly focused on discovering more and more\ninteresting genes associated with a health response. Traditional shrinkage\npriors are primarily designed to detect a handful of signals from tens of\nthousands of predictors in the so-called ultra-sparsity domain. However, they\nmay fail to identify signals when the degree of sparsity is moderate. Robust\nsparse estimation under diverse sparsity regimes relies on a tail-adaptive\nshrinkage property. In this property, the tail-heaviness of the prior adjusts\nadaptively, becoming larger or smaller as the sparsity level increases or\ndecreases, respectively, to accommodate more or fewer signals. In this study,\nwe propose a global-local-tail (GLT) Gaussian mixture distribution that ensures\nthis property. We examine the role of the tail-index of the prior in relation\nto the underlying sparsity level and demonstrate that the GLT posterior\ncontracts at the minimax optimal rate for sparse normal mean models. We apply\nboth the GLT prior and the Horseshoe prior to real data problems and simulation\nexamples. Our findings indicate that the varying tail rule based on the GLT\nprior offers advantages over a fixed tail rule based on the Horseshoe prior in\ndiverse sparsity regimes."}, "http://arxiv.org/abs/2011.04833": {"title": "Handling time-dependent exposures and confounders when estimating attributable fractions -- bridging the gap between multistate and counterfactual modeling", "link": "http://arxiv.org/abs/2011.04833", "description": "The population-attributable fraction (PAF) expresses the proportion of events\nthat can be ascribed to a certain exposure in a certain population. It can be\nstrongly time-dependent because either exposure incidence or excess risk may\nchange over time. Competing events may moreover hinder the outcome of interest\nfrom being observed. Occurrence of either of these events may, in turn, prevent\nthe exposure of interest. Estimation approaches therefore need to carefully\naccount for the timing of events in such highly dynamic settings. The use of\nmultistate models has been widely encouraged to eliminate preventable yet\ncommon types of time-dependent bias. Even so, it has been pointed out that\nproposed multistate modeling approaches for PAF estimation fail to fully\neliminate such bias. In addition, assessing whether patients die from rather\nthan with a certain exposure not only requires adequate modeling of the timing\nof events but also of their confounding factors. While proposed multistate\nmodeling approaches for confounding adjustment may adequately accommodate\nbaseline imbalances, unlike g-methods, these proposals are not generally\nequipped to handle time-dependent confounding. However, the connection between\nmultistate modeling and g-methods (e.g. inverse probability of censoring\nweighting) for PAF estimation is not readily apparent. In this paper, we\nprovide a weighting-based characterization of both approaches to illustrate\nthis connection, to pinpoint current shortcomings of multistate modeling, and\nto enhance intuition into simple modifications to overcome these. R code is\nmade available to foster the uptake of g-methods for PAF estimation."}, "http://arxiv.org/abs/2211.00338": {"title": "Typical Yet Unlikely and Normally Abnormal: The Intuition Behind High-Dimensional Statistics", "link": "http://arxiv.org/abs/2211.00338", "description": "Normality, in the colloquial sense, has historically been considered an\naspirational trait, synonymous with ideality. The arithmetic average and, by\nextension, statistics including linear regression coefficients, have often been\nused to characterize normality, and are often used as a way to summarize\nsamples and identify outliers. We provide intuition behind the behavior of such\nstatistics in high dimensions, and demonstrate that even for datasets with a\nrelatively low number of dimensions, data start to exhibit a number of\npeculiarities which become severe as the number of dimensions increases. Whilst\nour main goal is to familiarize researchers with these peculiarities, we also\nshow that normality can be better characterized with `typicality', an\ninformation theoretic concept relating to entropy. An application of typicality\nto both synthetic and real-world data concerning political values reveals that\nin multi-dimensional space, to be `normal' is actually to be atypical. We\nbriefly explore the ramifications for outlier detection, demonstrating how\ntypicality, in contrast with the popular Mahalanobis distance, represents a\nviable method for outlier detection."}, "http://arxiv.org/abs/2212.12940": {"title": "Exact Selective Inference with Randomization", "link": "http://arxiv.org/abs/2212.12940", "description": "We introduce a pivot for exact selective inference with randomization. Not\nonly does our pivot lead to exact inference in Gaussian regression models, but\nit is also available in closed form. We reduce the problem of exact selective\ninference to a bivariate truncated Gaussian distribution. By doing so, we give\nup some power that is achieved with approximate maximum likelihood estimation\nin Panigrahi and Taylor (2022). Yet our pivot always produces narrower\nconfidence intervals than a closely related data splitting procedure. We\ninvestigate the trade-off between power and exact selective inference on\nsimulated datasets and an HIV drug resistance dataset."}, "http://arxiv.org/abs/2309.05025": {"title": "Simulating data from marginal structural models for a survival time outcome", "link": "http://arxiv.org/abs/2309.05025", "description": "Marginal structural models (MSMs) are often used to estimate causal effects\nof treatments on survival time outcomes from observational data when\ntime-dependent confounding may be present. They can be fitted using, e.g.,\ninverse probability of treatment weighting (IPTW). It is important to evaluate\nthe performance of statistical methods in different scenarios, and simulation\nstudies are a key tool for such evaluations. In such simulation studies, it is\ncommon to generate data in such a way that the model of interest is correctly\nspecified, but this is not always straightforward when the model of interest is\nfor potential outcomes, as is an MSM. Methods have been proposed for simulating\nfrom MSMs for a survival outcome, but these methods impose restrictions on the\ndata-generating mechanism. Here we propose a method that overcomes these\nrestrictions. The MSM can be a marginal structural logistic model for a\ndiscrete survival time or a Cox or additive hazards MSM for a continuous\nsurvival time. The hazard of the potential survival time can be conditional on\nbaseline covariates, and the treatment variable can be discrete or continuous.\nWe illustrate the use of the proposed simulation algorithm by carrying out a\nbrief simulation study. This study compares the coverage of confidence\nintervals calculated in two different ways for causal effect estimates obtained\nby fitting an MSM via IPTW."}, "http://arxiv.org/abs/2312.16177": {"title": "Learning to Infer Unobserved Behaviors: Estimating User's Preference for a Site over Other Sites", "link": "http://arxiv.org/abs/2312.16177", "description": "A site's recommendation system relies on knowledge of its users' preferences\nto offer relevant recommendations to them. These preferences are for attributes\nthat comprise items and content shown on the site, and are estimated from the\ndata of users' interactions with the site. Another form of users' preferences\nis material too, namely, users' preferences for the site over other sites,\nsince that shows users' base level propensities to engage with the site.\nEstimating users' preferences for the site, however, faces major obstacles\nbecause (a) the focal site usually has no data of its users' interactions with\nother sites; these interactions are users' unobserved behaviors for the focal\nsite; and (b) the Machine Learning literature in recommendation does not offer\na model of this situation. Even if (b) is resolved, the problem in (a) persists\nsince without access to data of its users' interactions with other sites, there\nis no ground truth for evaluation. Moreover, it is most useful when (c) users'\npreferences for the site can be estimated at the individual level, since the\nsite can then personalize recommendations to individual users. We offer a\nmethod to estimate individual user's preference for a focal site, under this\npremise. In particular, we compute the focal site's share of a user's online\nengagements without any data from other sites. We show an evaluation framework\nfor the model using only the focal site's data, allowing the site to test the\nmodel. We rely upon a Hierarchical Bayes Method and perform estimation in two\ndifferent ways - Markov Chain Monte Carlo and Stochastic Gradient with Langevin\nDynamics. Our results find good support for the approach to computing\npersonalized share of engagement and for its evaluation."}, "http://arxiv.org/abs/2312.16188": {"title": "The curious case of the test set AUROC", "link": "http://arxiv.org/abs/2312.16188", "description": "Whilst the size and complexity of ML models have rapidly and significantly\nincreased over the past decade, the methods for assessing their performance\nhave not kept pace. In particular, among the many potential performance\nmetrics, the ML community stubbornly continues to use (a) the area under the\nreceiver operating characteristic curve (AUROC) for a validation and test\ncohort (distinct from training data) or (b) the sensitivity and specificity for\nthe test data at an optimal threshold determined from the validation ROC.\nHowever, we argue that considering scores derived from the test ROC curve alone\ngives only a narrow insight into how a model performs and its ability to\ngeneralise."}, "http://arxiv.org/abs/2312.16241": {"title": "Analysis of Pleiotropy for Testosterone and Lipid Profiles in Males and Females", "link": "http://arxiv.org/abs/2312.16241", "description": "In modern scientific studies, it is often imperative to determine whether a\nset of phenotypes is affected by a single factor. If such an influence is\nidentified, it becomes essential to discern whether this effect is contingent\nupon categories such as sex or age group, and importantly, to understand\nwhether this dependence is rooted in purely non-environmental reasons. The\nexploration of such dependencies often involves studying pleiotropy, a\nphenomenon wherein a single genetic locus impacts multiple traits. This\nheightened interest in uncovering dependencies by pleiotropy is fueled by the\ngrowing accessibility of summary statistics from genome-wide association\nstudies (GWAS) and the establishment of thoroughly phenotyped sample\ncollections. This advancement enables a systematic and comprehensive\nexploration of the genetic connections among various traits and diseases.\nadditive genetic correlation illuminates the genetic connection between two\ntraits, providing valuable insights into the shared biological pathways and\nunderlying causal relationships between them. In this paper, we present a novel\nmethod to analyze such dependencies by studying additive genetic correlations\nbetween pairs of traits under consideration. Subsequently, we employ matrix\ncomparison techniques to discern and elucidate sex-specific or\nage-group-specific associations, contributing to a deeper understanding of the\nnuanced dependencies within the studied traits. Our proposed method is\ncomputationally handy and requires only GWAS summary statistics. We validate\nour method by applying it to the UK Biobank data and present the results."}, "http://arxiv.org/abs/2312.16260": {"title": "Multinomial Link Models", "link": "http://arxiv.org/abs/2312.16260", "description": "We propose a unified multinomial link model for analyzing categorical\nresponses. It not only covers the existing multinomial logistic models and\ntheir extensions as a special class, but also allows the observations with NA\nor Unknown responses to be incorporated as a special category in the data\nanalysis. We provide explicit formulae for computing the likelihood gradient\nand Fisher information matrix, as well as detailed algorithms for finding the\nmaximum likelihood estimates of the model parameters. Our algorithms solve the\ninfeasibility issue of existing statistical software on estimating parameters\nof cumulative link models. The applications to real datasets show that the\nproposed multinomial link models can fit the data significantly better, and the\ncorresponding data analysis may correct the misleading conclusions due to\nmissing data."}, "http://arxiv.org/abs/2312.16357": {"title": "Statistical monitoring of European cross-border physical electricity flows using novel temporal edge network processes", "link": "http://arxiv.org/abs/2312.16357", "description": "Conventional modelling of networks evolving in time focuses on capturing\nvariations in the network structure. However, the network might be static from\nthe origin or experience only deterministic, regulated changes in its\nstructure, providing either a physical infrastructure or a specified connection\narrangement for some other processes. Thus, to detect change in its\nexploitation, we need to focus on the processes happening on the network. In\nthis work, we present the concept of monitoring random Temporal Edge Network\n(TEN) processes that take place on the edges of a graph having a fixed\nstructure. Our framework is based on the Generalized Network Autoregressive\nstatistical models with time-dependent exogenous variables (GNARX models) and\nCumulative Sum (CUSUM) control charts. To demonstrate its effective detection\nof various types of change, we conduct a simulation study and monitor the\nreal-world data of cross-border physical electricity flows in Europe."}, "http://arxiv.org/abs/2312.16439": {"title": "Inferring the Effect of a Confounded Treatment by Calibrating Resistant Population's Variance", "link": "http://arxiv.org/abs/2312.16439", "description": "In a general set-up that allows unmeasured confounding, we show that the\nconditional average treatment effect on the treated can be identified as one of\ntwo possible values. Unlike existing causal inference methods, we do not\nrequire an exogenous source of variability in the treatment, e.g., an\ninstrument or another outcome unaffected by the treatment. Instead, we require\n(a) a nondeterministic treatment assignment, (b) that conditional variances of\nthe two potential outcomes are equal in the treatment group, and (c) a\nresistant population that was not exposed to the treatment or, if exposed, is\nunaffected by the treatment. Assumption (a) is commonly assumed in theoretical\nwork, while (b) holds under fairly general outcome models. For (c), which is a\nnew assumption, we show that a resistant population is often available in\npractice. We develop a large sample inference methodology and demonstrate our\nproposed method in a study of the effect of surface mining in central\nAppalachia on birth weight that finds a harmful effect."}, "http://arxiv.org/abs/2312.16512": {"title": "Degrees-of-freedom penalized piecewise regression", "link": "http://arxiv.org/abs/2312.16512", "description": "Many popular piecewise regression models rely on minimizing a cost function\non the model fit with a linear penalty on the number of segments. However, this\npenalty does not take into account varying complexities of the model functions\non the segments potentially leading to overfitting when models with varying\ncomplexities, such as polynomials of different degrees, are used. In this work,\nwe enhance on this approach by instead using a penalty on the sum of the\ndegrees of freedom over all segments, called degrees-of-freedom penalized\npiecewise regression (DofPPR). We show that the solutions of the resulting\nminimization problem are unique for almost all input data in a least squares\nsetting. We develop a fast algorithm which does not only compute a minimizer\nbut also determines an optimal hyperparameter -- in the sense of rolling cross\nvalidation with the one standard error rule -- exactly. This eliminates manual\nhyperparameter selection. Our method supports optional user parameters for\nincorporating domain knowledge. We provide an open-source Python/Rust code for\nthe piecewise polynomial least squares case which can be extended to further\nmodels. We demonstrate the practical utility through a simulation study and by\napplications to real data. A constrained variant of the proposed method gives\nstate-of-the-art results in the Turing benchmark for unsupervised changepoint\ndetection."}, "http://arxiv.org/abs/2312.16544": {"title": "Hierarchical variable clustering based on the predictive strength between random vectors", "link": "http://arxiv.org/abs/2312.16544", "description": "A rank-invariant clustering of variables is introduced that is based on the\npredictive strength between groups of variables, i.e., two groups are assigned\na high similarity if the variables in the first group contain high predictive\ninformation about the behaviour of the variables in the other group and/or vice\nversa. The method presented here is model-free, dependence-based and does not\nrequire any distributional assumptions. Various general invariance and\ncontinuity properties are investigated, with special attention to those that\nare beneficial for the agglomerative hierarchical clustering procedure. A fully\nnon-parametric estimator is considered whose excellent performance is\ndemonstrated in several simulation studies and by means of real-data examples."}, "http://arxiv.org/abs/2312.16656": {"title": "Clustering Sets of Functional Data by Similarity in Law", "link": "http://arxiv.org/abs/2312.16656", "description": "We introduce a new clustering method for the classification of functional\ndata sets by their probabilistic law, that is, a procedure that aims to assign\ndata sets to the same cluster if and only if the data were generated with the\nsame underlying distribution. This method has the nice virtue of being\nnon-supervised and non-parametric, allowing for exploratory investigation with\nfew assumptions about the data. Rigorous finite bounds on the classification\nerror are given along with an objective heuristic that consistently selects the\nbest partition in a data-driven manner. Simulated data has been clustered with\nthis procedure to show the performance of the method with different parametric\nmodel classes of functional data."}, "http://arxiv.org/abs/2312.16734": {"title": "Selective Inference for Sparse Graphs via Neighborhood Selection", "link": "http://arxiv.org/abs/2312.16734", "description": "Neighborhood selection is a widely used method used for estimating the\nsupport set of sparse precision matrices, which helps determine the conditional\ndependence structure in undirected graphical models. However, reporting only\npoint estimates for the estimated graph can result in poor replicability\nwithout accompanying uncertainty estimates. In fields such as psychology, where\nthe lack of replicability is a major concern, there is a growing need for\nmethods that can address this issue. In this paper, we focus on the Gaussian\ngraphical model. We introduce a selective inference method to attach\nuncertainty estimates to the selected (nonzero) entries of the precision matrix\nand decide which of the estimated edges must be included in the graph. Our\nmethod provides an exact adjustment for the selection of edges, which when\nmultiplied with the Wishart density of the random matrix, results in valid\nselective inferences. Through the use of externally added randomization\nvariables, our adjustment is easy to compute, requiring us to calculate the\nprobability of a selection event, that is equivalent to a few sign constraints\nand that decouples across the nodewise regressions. Through simulations and an\napplication to a mobile health trial designed to study mental health, we\ndemonstrate that our selective inference method results in higher power and\nimproved estimation accuracy."}, "http://arxiv.org/abs/2312.16739": {"title": "A Bayesian functional PCA model with multilevel partition priors for group studies in neuroscience", "link": "http://arxiv.org/abs/2312.16739", "description": "The statistical analysis of group studies in neuroscience is particularly\nchallenging due to the complex spatio-temporal nature of the data, its multiple\nlevels and the inter-individual variability in brain responses. In this\nrespect, traditional ANOVA-based studies and linear mixed effects models\ntypically provide only limited exploration of the dynamic of the group brain\nactivity and variability of the individual responses potentially leading to\noverly simplistic conclusions and/or missing more intricate patterns. In this\nstudy we propose a novel method based on functional Principal Components\nAnalysis and Bayesian model-based clustering to simultaneously assess group\neffects and individual deviations over the most important temporal features in\nthe data. This method provides a thorough exploration of group differences and\nindividual deviations in neuroscientific group studies without compromising on\nthe spatio-temporal nature of the data. By means of a simulation study we\ndemonstrate that the proposed model returns correct classification in different\nclustering scenarios under low and high of noise levels in the data. Finally we\nconsider a case study using Electroencephalogram data recorded during an object\nrecognition task where our approach provides new insights into the underlying\nbrain mechanisms generating the data and their variability."}, "http://arxiv.org/abs/2312.16769": {"title": "Estimation and Inference for High-dimensional Multi-response Growth Curve Model", "link": "http://arxiv.org/abs/2312.16769", "description": "A growth curve model (GCM) aims to characterize how an outcome variable\nevolves, develops and grows as a function of time, along with other predictors.\nIt provides a particularly useful framework to model growth trend in\nlongitudinal data. However, the estimation and inference of GCM with a large\nnumber of response variables faces numerous challenges, and remains\nunderdeveloped. In this article, we study the high-dimensional\nmultivariate-response linear GCM, and develop the corresponding estimation and\ninference procedures. Our proposal is far from a straightforward extension, and\ninvolves several innovative components. Specifically, we introduce a Kronecker\nproduct structure, which allows us to effectively decompose a very large\ncovariance matrix, and to pool the correlated samples to improve the estimation\naccuracy. We devise a highly non-trivial multi-step estimation approach to\nestimate the individual covariance components separately and effectively. We\nalso develop rigorous statistical inference procedures to test both the global\neffects and the individual effects, and establish the size and power\nproperties, as well as the proper false discovery control. We demonstrate the\neffectiveness of the new method through both intensive simulations, and the\nanalysis of a longitudinal neuroimaging data for Alzheimer's disease."}, "http://arxiv.org/abs/2312.16887": {"title": "Automatic Scoring of Cognition Drawings: Assessing the Quality of Machine-Based Scores Against a Gold Standard", "link": "http://arxiv.org/abs/2312.16887", "description": "Figure drawing is often used as part of dementia screening protocols. The\nSurvey of Health Aging and Retirement in Europe (SHARE) has adopted three\ndrawing tests from Addenbrooke's Cognitive Examination III as part of its\nquestionnaire module on cognition. While the drawings are usually scored by\ntrained clinicians, SHARE uses the face-to-face interviewers who conduct the\ninterviews to score the drawings during fieldwork. This may pose a risk to data\nquality, as interviewers may be less consistent in their scoring and more\nlikely to make errors due to their lack of clinical training. This paper\ntherefore reports a first proof of concept and evaluates the feasibility of\nautomating scoring using deep learning. We train several different\nconvolutional neural network (CNN) models using about 2,000 drawings from the\n8th wave of the SHARE panel in Germany and the corresponding interviewer\nscores, as well as self-developed 'gold standard' scores. The results suggest\nthat this approach is indeed feasible. Compared to training on interviewer\nscores, models trained on the gold standard data improve prediction accuracy by\nabout 10 percentage points. The best performing model, ConvNeXt Base, achieves\nan accuracy of about 85%, which is 5 percentage points higher than the accuracy\nof the interviewers. While this is a promising result, the models still\nstruggle to score partially correct drawings, which are also problematic for\ninterviewers. This suggests that more and better training data is needed to\nachieve production-level prediction accuracy. We therefore discuss possible\nnext steps to improve the quality and quantity of training examples."}, "http://arxiv.org/abs/2312.16953": {"title": "Super Ensemble Learning Using the Highly-Adaptive-Lasso", "link": "http://arxiv.org/abs/2312.16953", "description": "We consider estimation of a functional parameter of a realistically modeled\ndata distribution based on independent and identically distributed\nobservations. Suppose that the true function is defined as the minimizer of the\nexpectation of a specified loss function over its parameter space. Estimators\nof the true function are provided, viewed as a data-adaptive coordinate\ntransformation for the true function. For any $J$-dimensional real valued\ncadlag function with finite sectional variation norm, we define a candidate\nensemble estimator as the mapping from the data into the composition of the\ncadlag function and the $J$ estimated functions. Using $V$-fold\ncross-validation, we define the cross-validated empirical risk of each cadlag\nfunction specific ensemble estimator. We then define the Meta Highly Adaptive\nLasso Minimum Loss Estimator (M-HAL-MLE) as the cadlag function that minimizes\nthis cross-validated empirical risk over all cadlag functions with a uniform\nbound on the sectional variation norm. For each of the $V$ training samples,\nthis yields a composition of the M-HAL-MLE ensemble and the $J$ estimated\nfunctions trained on the training sample. We can estimate the true function\nwith the average of these $V$ estimated functions, which we call the M-HAL\nsuper-learner. The M-HAL super-learner converges to the oracle estimator at a\nrate $n^{-2/3}$ (up till $\\log n$-factor) w.r.t. excess risk, where the oracle\nestimator minimizes the excess risk among all considered ensembles. The excess\nrisk of the oracle estimator and true function is generally second order. Under\nweak conditions on the $J$ candidate estimators, target features of the\nundersmoothed M-HAL super-learner are asymptotically linear estimators of the\ncorresponding target features of true function, with influence curve either the\nefficient influence curve, or potentially, a super-efficient influence curve."}, "http://arxiv.org/abs/2312.17015": {"title": "Regularized Exponentially Tilted Empirical Likelihood for Bayesian Inference", "link": "http://arxiv.org/abs/2312.17015", "description": "Bayesian inference with empirical likelihood faces a challenge as the\nposterior domain is a proper subset of the original parameter space due to the\nconvex hull constraint. We propose a regularized exponentially tilted empirical\nlikelihood to address this issue. Our method removes the convex hull constraint\nusing a novel regularization technique, incorporating a continuous exponential\nfamily distribution to satisfy a Kullback--Leibler divergence criterion. The\nregularization arises as a limiting procedure where pseudo-data are added to\nthe formulation of exponentially tilted empirical likelihood in a structured\nfashion. We show that this regularized exponentially tilted empirical\nlikelihood retains certain desirable asymptotic properties of (exponentially\ntilted) empirical likelihood and has improved finite sample performance.\nSimulation and data analysis demonstrate that the proposed method provides a\nsuitable pseudo-likelihood for Bayesian inference. The implementation of our\nmethod is available as the R package retel. Supplementary materials for this\narticle are available online."}, "http://arxiv.org/abs/2312.17047": {"title": "Inconsistency of cross-validation for structure learning in Gaussian graphical models", "link": "http://arxiv.org/abs/2312.17047", "description": "Despite numerous years of research into the merits and trade-offs of various\nmodel selection criteria, obtaining robust results that elucidate the behavior\nof cross-validation remains a challenging endeavor. In this paper, we highlight\nthe inherent limitations of cross-validation when employed to discern the\nstructure of a Gaussian graphical model. We provide finite-sample bounds on the\nprobability that the Lasso estimator for the neighborhood of a node within a\nGaussian graphical model, optimized using a prediction oracle, misidentifies\nthe neighborhood. Our results pertain to both undirected and directed acyclic\ngraphs, encompassing general, sparse covariance structures. To support our\ntheoretical findings, we conduct an empirical investigation of this\ninconsistency by contrasting our outcomes with other commonly used information\ncriteria through an extensive simulation study. Given that many algorithms\ndesigned to learn the structure of graphical models require hyperparameter\nselection, the precise calibration of this hyperparameter is paramount for\naccurately estimating the inherent structure. Consequently, our observations\nshed light on this widely recognized practical challenge."}, "http://arxiv.org/abs/2312.17065": {"title": "CluBear: A Subsampling Package for Interactive Statistical Analysis with Massive Data on A Single Machine", "link": "http://arxiv.org/abs/2312.17065", "description": "This article introduces CluBear, a Python-based open-source package for\ninteractive massive data analysis. The key feature of CluBear is that it\nenables users to conduct convenient and interactive statistical analysis of\nmassive data with only a traditional single-computer system. Thus, CluBear\nprovides a cost-effective solution when mining large-scale datasets. In\naddition, the CluBear package integrates many commonly used statistical and\ngraphical tools, which are useful for most commonly encountered data analysis\ntasks."}, "http://arxiv.org/abs/2312.17111": {"title": "Online Tensor Inference", "link": "http://arxiv.org/abs/2312.17111", "description": "Recent technological advances have led to contemporary applications that\ndemand real-time processing and analysis of sequentially arriving tensor data.\nTraditional offline learning, involving the storage and utilization of all data\nin each computational iteration, becomes impractical for high-dimensional\ntensor data due to its voluminous size. Furthermore, existing low-rank tensor\nmethods lack the capability for statistical inference in an online fashion,\nwhich is essential for real-time predictions and informed decision-making. This\npaper addresses these challenges by introducing a novel online inference\nframework for low-rank tensor learning. Our approach employs Stochastic\nGradient Descent (SGD) to enable efficient real-time data processing without\nextensive memory requirements, thereby significantly reducing computational\ndemands. We establish a non-asymptotic convergence result for the online\nlow-rank SGD estimator, nearly matches the minimax optimal rate of estimation\nerror in offline models that store all historical data. Building upon this\nfoundation, we propose a simple yet powerful online debiasing approach for\nsequential statistical inference in low-rank tensor learning. The entire online\nprocedure, covering both estimation and inference, eliminates the need for data\nsplitting or storing historical data, making it suitable for on-the-fly\nhypothesis testing. Given the sequential nature of our data collection,\ntraditional analyses relying on offline methods and sample splitting are\ninadequate. In our analysis, we control the sum of constructed\nsuper-martingales to ensure estimates along the entire solution path remain\nwithin the benign region. Additionally, a novel spectral representation tool is\nemployed to address statistical dependencies among iterative estimates,\nestablishing the desired asymptotic normality."}, "http://arxiv.org/abs/2312.17230": {"title": "Variable Neighborhood Searching Rerandomization", "link": "http://arxiv.org/abs/2312.17230", "description": "Rerandomization discards undesired treatment assignments to ensure covariate\nbalance in randomized experiments. However, rerandomization based on\nacceptance-rejection sampling is computationally inefficient, especially when\nnumerous independent assignments are required to perform randomization-based\nstatistical inference. Existing acceleration methods are suboptimal and are not\napplicable in structured experiments, including stratified experiments and\nexperiments with clusters. Based on metaheuristics in combinatorial\noptimization, we propose a novel variable neighborhood searching\nrerandomization(VNSRR) method to draw balanced assignments in various\nexperiments efficiently. We derive the unbiasedness and a lower bound for the\nvariance reduction of the treatment effect estimator under VNSRR. Simulation\nstudies and a real data example indicate that our method maintains the\nappealing statistical properties of rerandomization and can sample thousands of\ntreatment assignments within seconds, even in cases where existing methods\nrequire an hour to complete the task."}, "http://arxiv.org/abs/2003.13119": {"title": "Statistical Quantile Learning for Large, Nonlinear, and Additive Latent Variable Models", "link": "http://arxiv.org/abs/2003.13119", "description": "The studies of large-scale, high-dimensional data in fields such as genomics\nand neuroscience have injected new insights into science. Yet, despite\nadvances, they are confronting several challenges often simultaneously:\nnon-linearity, slow computation, inconsistency and uncertain convergence, and\nsmall sample sizes compared to high feature dimensions. Here, we propose a\nrelatively simple, scalable, and consistent nonlinear dimension reduction\nmethod that can potentially address these issues in unsupervised settings. We\ncall this method Statistical Quantile Learning (SQL) because, methodologically,\nit leverages on a quantile approximation of the latent variables and standard\nnonparametric techniques (sieve or penalyzed methods). By doing so, we show\nthat estimating the model originate from a convex assignment matching problem.\nTheoretically, we provide the asymptotic properties of SQL and its rates of\nconvergence. Operationally, SQL overcomes both the parametric restriction in\nnonlinear factor models in statistics and the difficulty of specifying\nhyperparameters and vanishing gradients in deep learning. Simulation studies\nassent the theory and reveal that SQL outperforms state-of-the-art statistical\nand machine learning methods. Compared to its linear competitors, SQL explains\nmore variance, yields better separation and explanation, and delivers more\naccurate outcome prediction when latent factors are used as predictors;\ncompared to its nonlinear competitors, SQL shows considerable advantage in\ninterpretability, ease of use and computations in high-dimensional\nsettings.Finally, we apply SQL to high-dimensional gene expression data\n(consisting of 20263 genes from 801 subjects), where the proposed method\nidentified latent factors predictive of five cancer types. The SQL package is\navailable at https://github.com/jbodelet/SQL."}, "http://arxiv.org/abs/2102.06197": {"title": "Estimating a Directed Tree for Extremes", "link": "http://arxiv.org/abs/2102.06197", "description": "We propose a new method to estimate a root-directed spanning tree from\nextreme data. A prominent example is a river network, to be discovered from\nextreme flow measured at a set of stations. Our new algorithm utilizes\nqualitative aspects of a max-linear Bayesian network, which has been designed\nfor modelling causality in extremes. The algorithm estimates bivariate scores\nand returns a root-directed spanning tree. It performs extremely well on\nbenchmark data and new data. We prove that the new estimator is consistent\nunder a max-linear Bayesian network model with noise. We also assess its\nstrengths and limitations in a small simulation study."}, "http://arxiv.org/abs/2206.10108": {"title": "A Bayesian Nonparametric Approach for Identifying Differentially Abundant Taxa in Multigroup Microbiome Data with Covariates", "link": "http://arxiv.org/abs/2206.10108", "description": "Scientific studies in the last two decades have established the central role\nof the microbiome in disease and health. Differential abundance analysis seeks\nto identify microbial taxa associated with sample groups defined by a factor\nsuch as disease subtype, geographical region, or environmental condition. The\nresults, in turn, help clinical practitioners and researchers diagnose disease\nand develop treatments more effectively. However, microbiome data analysis is\nuniquely challenging due to high-dimensionality, sparsity, compositionally, and\ncollinearity. There is a critical need for unified statistical approaches for\ndifferential analysis in the presence of covariates. We develop a zero-inflated\nBayesian nonparametric (ZIBNP) methodology that meets these multipronged\nchallenges. The proposed technique flexibly adapts to the unique data\ncharacteristics, casts the high proportion of zeros in a censoring framework,\nand mitigates high-dimensionality and collinearity by utilizing the\ndimension-reducing property of the semiparametric Chinese restaurant process.\nAdditionally, the ZIBNP approach relates the microbiome sampling depths to\ninferential precision while accommodating the compositional nature of\nmicrobiome data. Through simulation studies and analyses of the CAnine\nMicrobiome during Parasitism (CAMP) and Global Gut microbiome datasets, we\ndemonstrate the accuracy of ZIBNP compared to established methods for\ndifferential abundance analysis in the presence of covariates."}, "http://arxiv.org/abs/2208.10910": {"title": "A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression", "link": "http://arxiv.org/abs/2208.10910", "description": "We introduce a new empirical Bayes approach for large-scale multiple linear\nregression. Our approach combines two key ideas: (i) the use of flexible\n\"adaptive shrinkage\" priors, which approximate the nonparametric family of\nscale mixture of normal distributions by a finite mixture of normal\ndistributions; and (ii) the use of variational approximations to efficiently\nestimate prior hyperparameters and compute approximate posteriors. Combining\nthese two ideas results in fast and flexible methods, with computational speed\ncomparable to fast penalized regression methods such as the Lasso, and with\nsuperior prediction accuracy across a wide range of scenarios. Furthermore, we\nshow that the posterior mean from our method can be interpreted as solving a\npenalized regression problem, with the precise form of the penalty function\nbeing learned from the data by directly solving an optimization problem (rather\nthan being tuned by cross-validation). Our methods are implemented in an R\npackage, mr.ash.alpha, available from\nhttps://github.com/stephenslab/mr.ash.alpha"}, "http://arxiv.org/abs/2209.13117": {"title": "Consistent Covariance estimation for stratum imbalances under minimization method for covariate-adaptive randomization", "link": "http://arxiv.org/abs/2209.13117", "description": "Pocock and Simon's minimization method is a popular approach for\ncovariate-adaptive randomization in clinical trials. Valid statistical\ninference with data collected under the minimization method requires the\nknowledge of the limiting covariance matrix of within-stratum imbalances, whose\nexistence is only recently established. In this work, we propose a\nbootstrap-based estimator for this limit and establish its consistency, in\nparticular, by Le Cam's third lemma. As an application, we consider in\nsimulation studies adjustments to existing robust tests for treatment effects\nwith survival data by the proposed estimator. It shows that the adjusted tests\nachieve a size close to the nominal level, and unlike other designs, the robust\ntests without adjustment may have an asymptotic size inflation issue under the\nminimization method."}, "http://arxiv.org/abs/2209.15224": {"title": "Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models", "link": "http://arxiv.org/abs/2209.15224", "description": "Unsupervised learning has been widely used in many real-world applications.\nOne of the simplest and most important unsupervised learning models is the\nGaussian mixture model (GMM). In this work, we study the multi-task learning\nproblem on GMMs, which aims to leverage potentially similar GMM parameter\nstructures among tasks to obtain improved learning performance compared to\nsingle-task learning. We propose a multi-task GMM learning procedure based on\nthe EM algorithm that not only can effectively utilize unknown similarity\nbetween related tasks but is also robust against a fraction of outlier tasks\nfrom arbitrary distributions. The proposed procedure is shown to achieve\nminimax optimal rate of convergence for both parameter estimation error and the\nexcess mis-clustering error, in a wide range of regimes. Moreover, we\ngeneralize our approach to tackle the problem of transfer learning for GMMs,\nwhere similar theoretical results are derived. Finally, we demonstrate the\neffectiveness of our methods through simulations and real data examples. To the\nbest of our knowledge, this is the first work studying multi-task and transfer\nlearning on GMMs with theoretical guarantees."}, "http://arxiv.org/abs/2212.06228": {"title": "LRD spectral analysis of multifractional functional time series on manifolds", "link": "http://arxiv.org/abs/2212.06228", "description": "This paper addresses the estimation of the second-order structure of a\nmanifold cross-time random field (RF) displaying spatially varying Long Range\nDependence (LRD), adopting the functional time series framework introduced in\nRuiz-Medina (2022). Conditions for the asymptotic unbiasedness of the\nintegrated periodogram operator in the Hilbert-Schmidt operator norm are\nderived beyond structural assumptions. Weak-consistent estimation of the\nlong-memory operator is achieved under a semiparametric functional spectral\nframework in the Gaussian context. The case where the projected manifold\nprocess can display Short Range Dependence (SRD) and LRD at different manifold\nscales is also analyzed. The performance of both estimation procedures is\nillustrated in the simulation study, in the context of multifractionally\nintegrated spherical functional autoregressive-moving average (SPHARMA(p,q))\nprocesses."}, "http://arxiv.org/abs/2301.01854": {"title": "Solving The Ordinary Least Squares in Closed Form, Without Inversion or Normalization", "link": "http://arxiv.org/abs/2301.01854", "description": "By connecting the LU factorization and the Gram-Schmidt orthogonalization\nwithout any normalization, closed-forms for the coefficients of the ordinary\nleast squares estimates are presented. Instead of using matrix inversion\nexplicitly, each of the coefficients is expressed and computed directly as a\nlinear combination of non-normalized Gram-Schmidt vectors and the original data\nmatrix and also in terms of the upper triangular factor from LU factorization.\nThe coefficients may computed iteratively using backward or forward algorithms\ngiven."}, "http://arxiv.org/abs/2312.17420": {"title": "Exact Consistency Tests for Gaussian Mixture Filters using Normalized Deviation Squared Statistics", "link": "http://arxiv.org/abs/2312.17420", "description": "We consider the problem of evaluating dynamic consistency in discrete time\nprobabilistic filters that approximate stochastic system state densities with\nGaussian mixtures. Dynamic consistency means that the estimated probability\ndistributions correctly describe the actual uncertainties. As such, the problem\nof consistency testing naturally arises in applications with regards to\nestimator tuning and validation. However, due to the general complexity of the\ndensity functions involved, straightforward approaches for consistency testing\nof mixture-based estimators have remained challenging to define and implement.\nThis paper derives a new exact result for Gaussian mixture consistency testing\nwithin the framework of normalized deviation squared (NDS) statistics. It is\nshown that NDS test statistics for generic multivariate Gaussian mixture models\nexactly follow mixtures of generalized chi-square distributions, for which\nefficient computational tools are available. The accuracy and utility of the\nresulting consistency tests are numerically demonstrated on static and dynamic\nmixture estimation examples."}, "http://arxiv.org/abs/2312.17480": {"title": "Detection of evolutionary shifts in variance under an Ornsten-Uhlenbeck model", "link": "http://arxiv.org/abs/2312.17480", "description": "1. Abrupt environmental changes can lead to evolutionary shifts in not only\nmean (optimal value), but also variance of descendants in trait evolution.\nThere are some methods to detect shifts in optimal value but few studies\nconsider shifts in variance. 2. We use a multi-optima and multi-variance OU\nprocess model to describe the trait evolution process with shifts in both\noptimal value and variance and provide analysis of how the covariance between\nspecies changes when shifts in variance occur along the path. 3. We propose a\nnew method to detect the shifts in both variance and optimal values based on\nminimizing the loss function with L1 penalty. We implement our method in a new\nR package, ShiVa (Detection of evolutionary shifts in variance). 4. We conduct\nsimulations to compare our method with the two methods considering only shifts\nin optimal values (l1ou; PhylogeneticEM). Our method shows strength in\npredictive ability and includes far fewer false positive shifts in optimal\nvalue compared to other methods when shifts in variance actually exist. When\nthere are only shifts in optimal value, our method performs similarly to other\nmethods. We applied our method to the cordylid data, ShiVa outperformed l1ou\nand phyloEM, exhibiting the highest log-likelihood and lowest BIC."}, "http://arxiv.org/abs/2312.17566": {"title": "Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing", "link": "http://arxiv.org/abs/2312.17566", "description": "Bayesian model-averaged hypothesis testing is an important technique in\nregression because it addresses the problem that the evidence one variable\ndirectly affects an outcome often depends on which other variables are included\nin the model. This problem is caused by confounding and mediation, and is\npervasive in big data settings with thousands of variables. However,\nmodel-averaging is under-utilized in fields, like epidemiology, where classical\nstatistical approaches dominate. Here we show that simultaneous Bayesian and\nfrequentist model-averaged hypothesis testing is possible in large samples, for\na family of priors. We show that Bayesian model-averaged regression is a closed\ntesting procedure, and use the theory of regular variation to derive\ninterchangeable posterior odds and $p$-values that jointly control the Bayesian\nfalse discovery rate (FDR), the frequentist type I error rate, and the\nfrequentist familywise error rate (FWER). These results arise from an\nasymptotic chi-squared distribution for the model-averaged deviance, under the\nnull hypothesis. We call the approach 'Doublethink'. In a related manuscript\n(Arning, Fryer and Wilson, 2024), we apply it to discovering direct risk\nfactors for COVID-19 hospitalization in UK Biobank, and we discuss its broader\nimplications for bridging the differences between Bayesian and frequentist\nhypothesis testing."}, "http://arxiv.org/abs/2312.17716": {"title": "Dependent Random Partitions by Shrinking Toward an Anchor", "link": "http://arxiv.org/abs/2312.17716", "description": "Although exchangeable processes from Bayesian nonparametrics have been used\nas a generating mechanism for random partition models, we deviate from this\nparadigm to explicitly incorporate clustering information in the formulation\nour random partition model. Our shrinkage partition distribution takes any\npartition distribution and shrinks its probability mass toward an anchor\npartition. We show how this provides a framework to model\nhierarchically-dependent and temporally-dependent random partitions. The\nshrinkage parameters control the degree of dependence, accommodating at its\nextremes both independence and complete equality. Since a priori knowledge of\nitems may vary, our formulation allows the degree of shrinkage toward the\nanchor to be item-specific. Our random partition model has a tractable\nnormalizing constant which allows for standard Markov chain Monte Carlo\nalgorithms for posterior sampling. We prove intuitive theoretical properties\nfor our distribution and compare it to related partition distributions. We show\nthat our model provides better out-of-sample fit in a real data application."}, "http://arxiv.org/abs/2210.00697": {"title": "A flexible model for correlated count data, with application to multi-condition differential expression analyses of single-cell RNA sequencing data", "link": "http://arxiv.org/abs/2210.00697", "description": "Detecting differences in gene expression is an important part of single-cell\nRNA sequencing experiments, and many statistical methods have been developed\nfor this aim. Most differential expression analyses focus on comparing\nexpression between two groups (e.g., treatment vs. control). But there is\nincreasing interest in multi-condition differential expression analyses in\nwhich expression is measured in many conditions, and the aim is to accurately\ndetect and estimate expression differences in all conditions. We show that\ndirectly modeling single-cell RNA-seq counts in all conditions simultaneously,\nwhile also inferring how expression differences are shared across conditions,\nleads to greatly improved performance for detecting and estimating expression\ndifferences compared to existing methods. We illustrate the potential of this\nnew approach by analyzing data from a single-cell experiment studying the\neffects of cytokine stimulation on gene expression. We call our new method\n\"Poisson multivariate adaptive shrinkage\", and it is implemented in an R\npackage available online at https://github.com/stephenslab/poisson.mash.alpha."}, "http://arxiv.org/abs/2211.13383": {"title": "A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments", "link": "http://arxiv.org/abs/2211.13383", "description": "In this paper, we aim to propose a consistent non-Gaussian Bayesian filter of\nwhich the system state is a continuous function. The distributions of the true\nsystem states, and those of the system and observation noises, are only assumed\nLebesgue integrable with no prior constraints on what function classes they\nfall within. This type of filter has significant merits in both theory and\npractice, which is able to ameliorate the curse of dimensionality for the\nparticle filter, a popular non-Gaussian Bayesian filter of which the system\nstate is parameterized by discrete particles and the corresponding weights. We\nfirst propose a new type of statistics, called the generalized logarithmic\nmoments. Together with the power moments, they are used to form a density\nsurrogate, parameterized as an analytic function, to approximate the true\nsystem state. The map from the parameters of the proposed density surrogate to\nboth the power moments and the generalized logarithmic moments is proved to be\na diffeomorphism, establishing the fact that there exists a unique density\nsurrogate which satisfies both moment conditions. This diffeomorphism also\nallows us to use gradient methods to treat the convex optimization problem in\ndetermining the parameters. Last but not least, simulation results reveal the\nadvantage of using both sets of moments for estimating mixtures of complicated\ntypes of functions. A robot localization simulation is also given, as an\nengineering application to validate the proposed filtering scheme."}, "http://arxiv.org/abs/2304.03476": {"title": "Generalizing the intention-to-treat effect of an active control against placebo from historical placebo-controlled trials to an active-controlled trial: A case study of the efficacy of daily oral TDF/FTC in the HPTN 084 study", "link": "http://arxiv.org/abs/2304.03476", "description": "In many clinical settings, an active-controlled trial design (e.g., a\nnon-inferiority or superiority design) is often used to compare an experimental\nmedicine to an active control (e.g., an FDA-approved, standard therapy). One\nprominent example is a recent phase 3 efficacy trial, HIV Prevention Trials\nNetwork Study 084 (HPTN 084), comparing long-acting cabotegravir, a new HIV\npre-exposure prophylaxis (PrEP) agent, to the FDA-approved daily oral tenofovir\ndisoproxil fumarate plus emtricitabine (TDF/FTC) in a population of\nheterosexual women in 7 African countries. One key complication of interpreting\nstudy results in an active-controlled trial like HPTN 084 is that the placebo\narm is not present and the efficacy of the active control (and hence the\nexperimental drug) compared to the placebo can only be inferred by leveraging\nother data sources. \\bz{In this article, we study statistical inference for the\nintention-to-treat (ITT) effect of the active control using relevant historical\nplacebo-controlled trials data under the potential outcomes (PO) framework}. We\nhighlight the role of adherence and unmeasured confounding, discuss in detail\nidentification assumptions and two modes of inference (point versus partial\nidentification), propose estimators under identification assumptions permitting\npoint identification, and lay out sensitivity analyses needed to relax\nidentification assumptions. We applied our framework to estimating the\nintention-to-treat effect of daily oral TDF/FTC versus placebo in HPTN 084\nusing data from an earlier Phase 3, placebo-controlled trial of daily oral\nTDF/FTC (Partners PrEP)."}, "http://arxiv.org/abs/2305.08284": {"title": "Model-based standardization using multiple imputation", "link": "http://arxiv.org/abs/2305.08284", "description": "When studying the association between treatment and a clinical outcome, a\nparametric multivariable model of the conditional outcome expectation is often\nused to adjust for covariates. The treatment coefficient of the outcome model\ntargets a conditional treatment effect. Model-based standardization is\ntypically applied to average the model predictions over the target covariate\ndistribution, and generate a covariate-adjusted estimate of the marginal\ntreatment effect. The standard approach to model-based standardization involves\nmaximum-likelihood estimation and use of the non-parametric bootstrap. We\nintroduce a novel, general-purpose, model-based standardization method based on\nmultiple imputation that is easily applicable when the outcome model is a\ngeneralized linear model. We term our proposed approach multiple imputation\nmarginalization (MIM). MIM consists of two main stages: the generation of\nsynthetic datasets and their analysis. MIM accommodates a Bayesian statistical\nframework, which naturally allows for the principled propagation of\nuncertainty, integrates the analysis into a probabilistic framework, and allows\nfor the incorporation of prior evidence. We conduct a simulation study to\nbenchmark the finite-sample performance of MIM in conjunction with a parametric\noutcome model. The simulations provide proof-of-principle in scenarios with\nbinary outcomes, continuous-valued covariates, a logistic outcome model and the\nmarginal log odds ratio as the target effect measure. When parametric modeling\nassumptions hold, MIM yields unbiased estimation in the target covariate\ndistribution, valid coverage rates, and similar precision and efficiency than\nthe standard approach to model-based standardization."}, "http://arxiv.org/abs/2401.00097": {"title": "Recursive identification with regularization and on-line hyperparameters estimation", "link": "http://arxiv.org/abs/2401.00097", "description": "This paper presents a regularized recursive identification algorithm with\nsimultaneous on-line estimation of both the model parameters and the algorithms\nhyperparameters. A new kernel is proposed to facilitate the algorithm\ndevelopment. The performance of this novel scheme is compared with that of the\nrecursive least-squares algorithm in simulation."}, "http://arxiv.org/abs/2401.00104": {"title": "Causal State Distillation for Explainable Reinforcement Learning", "link": "http://arxiv.org/abs/2401.00104", "description": "Reinforcement learning (RL) is a powerful technique for training intelligent\nagents, but understanding why these agents make specific decisions can be quite\nchallenging. This lack of transparency in RL models has been a long-standing\nproblem, making it difficult for users to grasp the reasons behind an agent's\nbehaviour. Various approaches have been explored to address this problem, with\none promising avenue being reward decomposition (RD). RD is appealing as it\nsidesteps some of the concerns associated with other methods that attempt to\nrationalize an agent's behaviour in a post-hoc manner. RD works by exposing\nvarious facets of the rewards that contribute to the agent's objectives during\ntraining. However, RD alone has limitations as it primarily offers insights\nbased on sub-rewards and does not delve into the intricate cause-and-effect\nrelationships that occur within an RL agent's neural model. In this paper, we\npresent an extension of RD that goes beyond sub-rewards to provide more\ninformative explanations. Our approach is centred on a causal learning\nframework that leverages information-theoretic measures for explanation\nobjectives that encourage three crucial properties of causal factors:\n\\emph{causal sufficiency}, \\emph{sparseness}, and \\emph{orthogonality}. These\nproperties help us distill the cause-and-effect relationships between the\nagent's states and actions or rewards, allowing for a deeper understanding of\nits decision-making processes. Our framework is designed to generate local\nexplanations and can be applied to a wide range of RL tasks with multiple\nreward channels. Through a series of experiments, we demonstrate that our\napproach offers more meaningful and insightful explanations for the agent's\naction selections."}, "http://arxiv.org/abs/2401.00139": {"title": "Is Knowledge All Large Language Models Needed for Causal Reasoning?", "link": "http://arxiv.org/abs/2401.00139", "description": "This paper explores the causal reasoning of large language models (LLMs) to\nenhance their interpretability and reliability in advancing artificial\nintelligence. Despite the proficiency of LLMs in a range of tasks, their\npotential for understanding causality requires further exploration. We propose\na novel causal attribution model that utilizes \"do-operators\" for constructing\ncounterfactual scenarios, allowing us to systematically quantify the influence\nof input numerical data and LLMs' pre-existing knowledge on their causal\nreasoning processes. Our newly developed experimental setup assesses LLMs'\nreliance on contextual information and inherent knowledge across various\ndomains. Our evaluation reveals that LLMs' causal reasoning ability depends on\nthe context and domain-specific knowledge provided, and supports the argument\nthat \"knowledge is, indeed, what LLMs principally require for sound causal\nreasoning\". On the contrary, in the absence of knowledge, LLMs still maintain a\ndegree of causal reasoning using the available numerical data, albeit with\nlimitations in the calculations."}, "http://arxiv.org/abs/2401.00196": {"title": "Bayesian principal stratification with longitudinal data and truncation by death", "link": "http://arxiv.org/abs/2401.00196", "description": "In many causal studies, outcomes are censored by death, in the sense that\nthey are neither observed nor defined for units who die. In such studies, the\nfocus is usually on the stratum of always survivors up to a single fixed time\ns. Building on a recent strand of the literature, we propose an extended\nframework for the analysis of longitudinal studies, where units can die at\ndifferent time points, and the main endpoints are observed and well defined\nonly up to the death time. We develop a Bayesian longitudinal principal\nstratification framework, where units are cross classified according to the\nlongitudinal death status. Under this framework, the focus is on causal effects\nfor the principal strata of units that would be alive up to a time point s\nirrespective of their treatment assignment, where these strata may vary as a\nfunction of s. We can get precious insights into the effects of treatment by\ninspecting the distribution of baseline characteristics within each\nlongitudinal principal stratum, and by investigating the time trend of both\nprincipal stratum membership and survivor-average causal effects. We illustrate\nour approach for the analysis of a longitudinal observational study aimed to\nassess, under the assumption of strong ignorability of treatment assignment,\nthe causal effects of a policy promoting start ups on firms survival and hiring\npolicy, where firms hiring status is censored by death."}, "http://arxiv.org/abs/2401.00245": {"title": "Alternative Approaches for Computing Highest-Density Regions", "link": "http://arxiv.org/abs/2401.00245", "description": "Many statistical problems require estimating a density function, say $f$,\nfrom data samples. In this work, for example, we are interested in\nhighest-density regions (HDRs), i.e., minimum volume sets that contain a given\nprobability. HDRs are typically computed using a density quantile approach,\nwhich, in the case of unknown densities, involves their estimation. This task\nturns out to be far from trivial, especially over increased dimensions and when\ndata are sparse and exhibit complex structures (e.g., multimodalities or\nparticular dependencies). We address this challenge by exploring alternative\napproaches to build HDRs that overcome direct (multivariate) density\nestimation. First, we generalize the density quantile method, currently\nimplementable on the basis of a consistent estimator of the density, to\n$neighbourhood$ measures, i.e., measures that preserve the order induced in the\nsample by $f$. Second, we discuss a number of suitable probabilistic- and\ndistance-based measures such as the $k$-nearest neighbourhood Euclidean\ndistance. Third, motivated by the ubiquitous role of $copula$ modeling in\nmodern statistics, we explore its use in the context of probabilistic-based\nmeasures. An extensive comparison among the introduced measures is provided,\nand their implications for computing HDRs in real-world problems are discussed."}, "http://arxiv.org/abs/2401.00255": {"title": "Adaptive Rank-based Tests for High Dimensional Mean Problems", "link": "http://arxiv.org/abs/2401.00255", "description": "The Wilcoxon signed-rank test and the Wilcoxon-Mann-Whitney test are commonly\nemployed in one sample and two sample mean tests for one-dimensional hypothesis\nproblems. For high-dimensional mean test problems, we calculate the asymptotic\ndistribution of the maximum of rank statistics for each variable and suggest a\nmax-type test. This max-type test is then merged with a sum-type test, based on\ntheir asymptotic independence offered by stationary and strong mixing\nassumptions. Our numerical studies reveal that this combined test demonstrates\nrobustness and superiority over other methods, especially for heavy-tailed\ndistributions."}, "http://arxiv.org/abs/2401.00257": {"title": "Assessing replication success via skeptical mixture priors", "link": "http://arxiv.org/abs/2401.00257", "description": "There is a growing interest in the analysis of replication studies of\noriginal findings across many disciplines. When testing a hypothesis for an\neffect size, two Bayesian approaches stand out for their principled use of the\nBayes factor (BF), namely the replication BF and the skeptical BF. In\nparticular, the latter BF is based on the skeptical prior, which represents the\nopinion of an investigator who is unconvinced by the original findings and\nwants to challenge them. We embrace the skeptical perspective, and elaborate a\nnovel mixture prior which incorporates skepticism while at the same time\ncontrolling for prior-data conflict within the original data. Consistency\nproperties of the resulting skeptical mixture BF are provided together with an\nextensive analysis of the main features of our proposal. Finally, we apply our\nmethodology to data from the Social Sciences Replication Project. In particular\nwe show that, for some case studies where prior-data conflict is an issue, our\nmethod uses a more realistic prior and leads to evidence-classification for\nreplication success which differs from the standard skeptical approach."}, "http://arxiv.org/abs/2401.00324": {"title": "Stratified distance space improves the efficiency of sequential samplers for approximate Bayesian computation", "link": "http://arxiv.org/abs/2401.00324", "description": "Approximate Bayesian computation (ABC) methods are standard tools for\ninferring parameters of complex models when the likelihood function is\nanalytically intractable. A popular approach to improving the poor acceptance\nrate of the basic rejection sampling ABC algorithm is to use sequential Monte\nCarlo (ABC SMC) to produce a sequence of proposal distributions adapting\ntowards the posterior, instead of generating values from the prior distribution\nof the model parameters. Proposal distribution for the subsequent iteration is\ntypically obtained from a weighted set of samples, often called particles, of\nthe current iteration of this sequence. Current methods for constructing these\nproposal distributions treat all the particles equivalently, regardless of the\ncorresponding value generated by the sampler, which may lead to inefficiency\nwhen propagating the information across iterations of the algorithm. To improve\nsampler efficiency, we introduce a modified approach called stratified distance\nABC SMC. Our algorithm stratifies particles based on their distance between the\ncorresponding synthetic and observed data, and then constructs distinct\nproposal distributions for all the strata. Taking into account the distribution\nof distances across the particle space leads to substantially improved\nacceptance rate of the rejection sampling. We further show that efficiency can\nbe gained by introducing a novel stopping rule for the sequential process based\non the stratified posterior samples and demonstrate these advances by several\nexamples."}, "http://arxiv.org/abs/2401.00354": {"title": "Estimation of the Emax model", "link": "http://arxiv.org/abs/2401.00354", "description": "This study focuses on the estimation of the Emax dose-response model, a\nwidely utilized framework in clinical trials, agriculture, and environmental\nexperiments. Existing challenges in obtaining maximum likelihood estimates\n(MLE) for model parameters are often ascribed to computational issues but, in\nreality, stem from the absence of MLE. Our contribution provides a new\nunderstanding and control of all the experimental situations that pratictioners\nmight face, guiding them in the estimation process. We derive the exact MLE for\na three-point experimental design and we identify the two scenarios where the\nMLE fails. To address these challenges, we propose utilizing Firth's modified\nscore, providing its analytical expression as a function of the experimental\ndesign. Through a simulation study, we demonstrate that, in one of the\nproblematic cases, the Firth modification yields a finite estimate. For the\nremaining case, we introduce a design-correction strategy akin to a hypothesis\ntest."}, "http://arxiv.org/abs/2401.00395": {"title": "Energetic Variational Gaussian Process Regression for Computer Experiments", "link": "http://arxiv.org/abs/2401.00395", "description": "The Gaussian process (GP) regression model is a widely employed surrogate\nmodeling technique for computer experiments, offering precise predictions and\nstatistical inference for the computer simulators that generate experimental\ndata. Estimation and inference for GP can be performed in both frequentist and\nBayesian frameworks. In this chapter, we construct the GP model through\nvariational inference, particularly employing the recently introduced energetic\nvariational inference method by Wang et al. (2021). Adhering to the GP model\nassumptions, we derive posterior distributions for its parameters. The\nenergetic variational inference approach bridges the Bayesian sampling and\noptimization and enables approximation of the posterior distributions and\nidentification of the posterior mode. By incorporating a normal prior on the\nmean component of the GP model, we also apply shrinkage estimation to the\nparameters, facilitating mean function variable selection. To showcase the\neffectiveness of our proposed GP model, we present results from three benchmark\nexamples."}, "http://arxiv.org/abs/2401.00461": {"title": "A Penalized Functional Linear Cox Regression Model for Spatially-defined Environmental Exposure with an Estimated Buffer Distance", "link": "http://arxiv.org/abs/2401.00461", "description": "In environmental health research, it is of interest to understand the effect\nof the neighborhood environment on health. Researchers have shown a protective\nassociation between green space around a person's residential address and\ndepression outcomes. In measuring exposure to green space, distance buffers are\noften used. However, buffer distances differ across studies. Typically, the\nbuffer distance is determined by researchers a priori. It is unclear how to\nidentify an appropriate buffer distance for exposure assessment. To address\ngeographic uncertainty problem for exposure assessment, we present a domain\nselection algorithm based on the penalized functional linear Cox regression\nmodel. The theoretical properties of our proposed method are studied and\nsimulation studies are conducted to evaluate finite sample performances of our\nmethod. The proposed method is illustrated in a study of associations of green\nspace exposure with depression and/or antidepressant use in the Nurses' Health\nStudy."}, "http://arxiv.org/abs/2401.00517": {"title": "Detecting Imprinting and Maternal Effects Using Monte Carlo Expectation Maximization Algorithm", "link": "http://arxiv.org/abs/2401.00517", "description": "Numerous statistical methods have been developed to explore genomic\nimprinting and maternal effects, which are causes of parent-of-origin patterns\nin complex human diseases. However, most of them either only model one of these\ntwo confounded epigenetic effects, or make strong yet unrealistic assumptions\nabout the population to avoid over-parameterization. A recent partial\nlikelihood method (LIME) can identify both epigenetic effects based on\ncase-control family data without those assumptions. Theoretical and empirical\nstudies have shown its validity and robustness. However, because LIME obtains\nparameter estimation by maximizing partial likelihood, it is interesting to\ncompare its efficiency with full likelihood maximizer. To overcome the\ndifficulty in over-parameterization when using full likelihood, in this study\nwe propose a Monte Carlo Expectation Maximization (MCEM) method to detect\nimprinting and maternal effects jointly. Those unknown mating type\nprobabilities, the nuisance parameters, can be considered as latent variables\nin EM algorithm. Monte Carlo samples are used to numerically approximate the\nexpectation function that cannot be solved algebraically. Our simulation\nresults show that though this MCEM algorithm takes longer computational time,\nand can give higher bias in some simulations compared to LIME, it can generally\ndetect both epigenetic effects with higher power and smaller standard error\nwhich demonstrates that it can be a good complement of LIME method."}, "http://arxiv.org/abs/2401.00520": {"title": "Monte Carlo Expectation-Maximization algorithm to detect imprinting and maternal effects for discordant sib-pair data", "link": "http://arxiv.org/abs/2401.00520", "description": "Numerous statistical methods have been developed to explore genomic\nimprinting and maternal effects, which are causes of parent-of-origin patterns\nin complex human diseases. Most of the methods, however, either only model one\nof these two confounded epigenetic effects, or make strong yet unrealistic\nassumptions about the population to avoid over-parameterization. A recent\npartial likelihood method (LIMEDSP ) can identify both epigenetic effects based\non discordant sibpair family data without those assumptions. Theoretical and\nempirical studies have shown its validity and robustness. As LIMEDSP method\nobtains parameter estimation by maximizing partial likelihood, it is\ninteresting to compare its efficiency with full likelihood maximizer. To\novercome the difficulty in over-parameterization when using full likelihood,\nthis study proposes a discordant sib-pair design based Monte Carlo Expectation\nMaximization (MCEMDSP ) method to detect imprinting and maternal effects\njointly. Those unknown mating type probabilities, the nuisance parameters, are\nconsidered as latent variables in EM algorithm. Monte Carlo samples are used to\nnumerically approximate the expectation function that cannot be solved\nalgebraically. Our simulation results show that though this MCEMDSP algorithm\ntakes longer computation time, it can generally detect both epigenetic effects\nwith higher power, which demonstrates that it can be a good complement of\nLIMEDSP method"}, "http://arxiv.org/abs/2401.00540": {"title": "Study Duration Prediction for Clinical Trials with Time-to-Event Endpoints Using Mixture Distributions Accounting for Heterogeneous Population", "link": "http://arxiv.org/abs/2401.00540", "description": "In the era of precision medicine, more and more clinical trials are now\ndriven or guided by biomarkers, which are patient characteristics objectively\nmeasured and evaluated as indicators of normal biological processes, pathogenic\nprocesses, or pharmacologic responses to therapeutic interventions. With the\noverarching objective to optimize and personalize disease management,\nbiomarker-guided clinical trials increase the efficiency by appropriately\nutilizing prognostic or predictive biomarkers in the design. However, the\nefficiency gain is often not quantitatively compared to the traditional\nall-comers design, in which a faster enrollment rate is expected (e.g. due to\nno restriction to biomarker positive patients) potentially leading to a shorter\nduration. To accurately predict biomarker-guided trial duration, we propose a\ngeneral framework using mixture distributions accounting for heterogeneous\npopulation. Extensive simulations are performed to evaluate the impact of\nheterogeneous population and the dynamics of biomarker characteristics and\ndisease on the study duration. Several influential parameters including median\nsurvival time, enrollment rate, biomarker prevalence and effect size are\nidentitied. Re-assessments of two publicly available trials are conducted to\nempirically validate the prediction accuracy and to demonstrate the practical\nutility. The R package \\emph{detest} is developed to implement the proposed\nmethod and is publicly available on CRAN."}, "http://arxiv.org/abs/2401.00566": {"title": "Change point analysis -- the empirical Hankel transform approach", "link": "http://arxiv.org/abs/2401.00566", "description": "In this study, we introduce the first-of-its-kind class of tests for\ndetecting change points in the distribution of a sequence of independent\nmatrix-valued random variables. The tests are constructed using the weighted\nsquare integral difference of the empirical orthogonal Hankel transforms. The\ntest statistics have a convenient closed-form expression, making them easy to\nimplement in practice. We present their limiting properties and demonstrate\ntheir quality through an extensive simulation study. We utilize these tests for\nchange point detection in cryptocurrency markets to showcase their practical\nuse. The detection of change points in this context can have various\napplications in constructing and analyzing novel trading systems."}, "http://arxiv.org/abs/2401.00568": {"title": "Extrapolation of Relative Treatment Effects using Change-point Survival Models", "link": "http://arxiv.org/abs/2401.00568", "description": "Introduction: Modelling of relative treatment effects is an important aspect\nto consider when extrapolating the long-term survival outcomes of treatments.\nFlexible parametric models offer the ability to accurately model the observed\ndata, however, the extrapolated relative treatment effects and subsequent\nsurvival function may lack face validity. Methods: We investigate the ability\nof change-point survival models to estimate changes in the relative treatment\neffects, specifically treatment delay, loss of treatment effects and converging\nhazards. These models are implemented using standard Bayesian statistical\nsoftware and propagate the uncertainty associate with all model parameters\nincluding the change-point location. A simulation study was conducted to assess\nthe predictive performance of these models compared with other parametric\nsurvival models. Change-point survival models were applied to three datasets,\ntwo of which were used in previous health technology assessments. Results:\nChange-point survival models typically provided improved extrapolated survival\npredictions, particularly when the changes in relative treatment effects are\nlarge. When applied to the real world examples they provided good fit to the\nobserved data while and in some situations produced more clinically plausible\nextrapolations than those generated by flexible spline models. Change-point\nmodels also provided support to a previously implemented modelling approach\nwhich was justified by visual inspection only and not goodness of fit to the\nobserved data. Conclusions: We believe change-point survival models offer the\nability to flexibly model observed data while also modelling and investigating\nclinically plausible scenarios with respect to the relative treatment effects."}, "http://arxiv.org/abs/2401.00624": {"title": "Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures", "link": "http://arxiv.org/abs/2401.00624", "description": "We propose a novel data-driven semi-confirmatory factor analysis (SCFA) model\nthat addresses the absence of model specification and handles the estimation\nand inference tasks with high-dimensional data. Confirmatory factor analysis\n(CFA) is a prevalent and pivotal technique for statistically validating the\ncovariance structure of latent common factors derived from multiple observed\nvariables. In contrast to other factor analysis methods, CFA offers a flexible\ncovariance modeling approach for common factors, enhancing the interpretability\nof relationships between the common factors, as well as between common factors\nand observations. However, the application of classic CFA models faces dual\nbarriers: the lack of a prerequisite specification of \"non-zero loadings\" or\nfactor membership (i.e., categorizing the observations into distinct common\nfactors), and the formidable computational burden in high-dimensional scenarios\nwhere the number of observed variables surpasses the sample size. To bridge\nthese two gaps, we propose the SCFA model by integrating the underlying\nhigh-dimensional covariance structure of observed variables into the CFA model.\nAdditionally, we offer computationally efficient solutions (i.e., closed-form\nuniformly minimum variance unbiased estimators) and ensure accurate statistical\ninference through closed-form exact variance estimators for all model\nparameters and factor scores. Through an extensive simulation analysis\nbenchmarking against standard computational packages, SCFA exhibits superior\nperformance in estimating model parameters and recovering factor scores, while\nsubstantially reducing the computational load, across both low- and\nhigh-dimensional scenarios. It exhibits moderate robustness to model\nmisspecification. We illustrate the practical application of the SCFA model by\nconducting factor analysis on a high-dimensional gene expression dataset."}, "http://arxiv.org/abs/2401.00634": {"title": "A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology", "link": "http://arxiv.org/abs/2401.00634", "description": "Accounting for exposure measurement errors has been recognized as a crucial\nproblem in environmental epidemiology for over two decades. Bayesian\nhierarchical models offer a coherent probabilistic framework for evaluating\nassociations between environmental exposures and health effects, which take\ninto account exposure measurement errors introduced by uncertainty in the\nestimated exposure as well as spatial misalignment between the exposure and\nhealth outcome data. While two-stage Bayesian analyses are often regarded as a\ngood alternative to fully Bayesian analyses when joint estimation is not\nfeasible, there has been minimal research on how to properly propagate\nuncertainty from the first-stage exposure model to the second-stage health\nmodel, especially in the case of a large number of participant locations along\nwith spatially correlated exposures. We propose a scalable two-stage Bayesian\napproach, called a sparse multivariate normal (sparse MVN) prior approach,\nbased on the Vecchia approximation for assessing associations between exposure\nand health outcomes in environmental epidemiology. We compare its performance\nwith existing approaches through simulation. Our sparse MVN prior approach\nshows comparable performance with the fully Bayesian approach, which is a gold\nstandard but is impossible to implement in some cases. We investigate the\nassociation between source-specific exposures and pollutant (nitrogen dioxide\n(NO$_2$))-specific exposures and birth outcomes for 2012 in Harris County,\nTexas, using several approaches, including the newly developed method."}, "http://arxiv.org/abs/2401.00649": {"title": "Linear Model and Extensions", "link": "http://arxiv.org/abs/2401.00649", "description": "I developed the lecture notes based on my ``Linear Model'' course at the\nUniversity of California Berkeley over the past seven years. This book provides\nan intermediate-level introduction to the linear model. It balances rigorous\nproofs and heuristic arguments. This book provides R code to replicate all\nsimulation studies and case studies."}, "http://arxiv.org/abs/2401.00667": {"title": "Channelling Multimodality Through a Unimodalizing Transport: Warp-U Sampler and Stochastic Bridge Sampling", "link": "http://arxiv.org/abs/2401.00667", "description": "Monte Carlo integration is fundamental in scientific and statistical\ncomputation, but requires reliable samples from the target distribution, which\nposes a substantial challenge in the case of multi-modal distributions.\nExisting methods often involve time-consuming tuning, and typically lack\ntailored estimators for efficient use of the samples. This paper adapts the\nWarp-U transformation [Wang et al., 2022] to form multi-modal sampling strategy\ncalled Warp-U sampling. It constructs a stochastic map to transport a\nmulti-modal density into a uni-modal one, and subsequently inverts the\ntransport but with new stochasticity injected. For efficient use of the samples\nfor normalising constant estimation, we propose (i) an unbiased estimation\nscheme based coupled chains, where the Warp-U sampling is used to reduce the\ncoupling time; and (ii) a stochastic Warp-U bridge sampling estimator, which\nimproves its deterministic counterpart given in Wang et al. [2022]. Our overall\napproach requires less tuning and is easier to apply than common alternatives.\nTheoretically, we establish the ergodicity of our sampling algorithm and that\nour stochastic Warp-U bridge sampling estimator has greater (asymptotic)\nprecision per CPU second compared to the Warp-U bridge estimator of Wang et al.\n[2022] under practical conditions. The advantages and current limitations of\nour approach are demonstrated through simulation studies and an application to\nexoplanet detection."}, "http://arxiv.org/abs/2401.00800": {"title": "Factor Importance Ranking and Selection using Total Indices", "link": "http://arxiv.org/abs/2401.00800", "description": "Factor importance measures the impact of each feature on output prediction\naccuracy. Many existing works focus on the model-based importance, but an\nimportant feature in one learning algorithm may hold little significance in\nanother model. Hence, a factor importance measure ought to characterize the\nfeature's predictive potential without relying on a specific prediction\nalgorithm. Such algorithm-agnostic importance is termed as intrinsic importance\nin Williamson et al. (2023), but their estimator again requires model fitting.\nTo bypass the modeling step, we present the equivalence between predictiveness\npotential and total Sobol' indices from global sensitivity analysis, and\nintroduce a novel consistent estimator that can be directly estimated from\nnoisy data. Integrating with forward selection and backward elimination gives\nrise to FIRST, Factor Importance Ranking and Selection using Total (Sobol')\nindices. Extensive simulations are provided to demonstrate the effectiveness of\nFIRST on regression and binary classification problems, and a clear advantage\nover the state-of-the-art methods."}, "http://arxiv.org/abs/2401.00840": {"title": "Bayesian Effect Selection in Additive Models with an Application to Time-to-Event Data", "link": "http://arxiv.org/abs/2401.00840", "description": "Accurately selecting and estimating smooth functional effects in additive\nmodels with potentially many functions is a challenging task. We introduce a\nnovel Demmler-Reinsch basis expansion to model the functional effects that\nallows us to orthogonally decompose an effect into its linear and nonlinear\nparts. We show that our representation allows to consistently estimate both\nparts as opposed to commonly employed mixed model representations. Equipping\nthe reparameterized regression coefficients with normal beta prime spike and\nslab priors allows us to determine whether a continuous covariate has a linear,\na nonlinear or no effect at all. We provide new theoretical results for the\nprior and a compelling explanation for its superior Markov chain Monte Carlo\nmixing performance compared to the spike-and-slab group lasso. We establish an\nefficient posterior estimation scheme and illustrate our approach along effect\nselection on the hazard rate of a time-to-event response in the geoadditive Cox\nregression model in simulations and data on survival with leukemia."}, "http://arxiv.org/abs/2205.00605": {"title": "Cluster-based Regression using Variational Inference and Applications in Financial Forecasting", "link": "http://arxiv.org/abs/2205.00605", "description": "This paper describes an approach to simultaneously identify clusters and\nestimate cluster-specific regression parameters from the given data. Such an\napproach can be useful in learning the relationship between input and output\nwhen the regression parameters for estimating output are different in different\nregions of the input space. Variational Inference (VI), a machine learning\napproach to obtain posterior probability densities using optimization\ntechniques, is used to identify clusters of explanatory variables and\nregression parameters for each cluster. From these results, one can obtain both\nthe expected value and the full distribution of predicted output. Other\nadvantages of the proposed approach include the elegant theoretical solution\nand clear interpretability of results. The proposed approach is well-suited for\nfinancial forecasting where markets have different regimes (or clusters) with\ndifferent patterns and correlations of market changes in each regime. In\nfinancial applications, knowledge about such clusters can provide useful\ninsights about portfolio performance and identify the relative importance of\nvariables in different market regimes. An illustrative example of predicting\none-day S&P change is considered to illustrate the approach and compare the\nperformance of the proposed approach with standard regression without clusters.\nDue to the broad applicability of the problem, its elegant theoretical\nsolution, and the computational efficiency of the proposed algorithm, the\napproach may be useful in a number of areas extending beyond the financial\ndomain."}, "http://arxiv.org/abs/2209.02008": {"title": "Parallel sampling of decomposable graphs using Markov chain on junction trees", "link": "http://arxiv.org/abs/2209.02008", "description": "Bayesian inference for undirected graphical models is mostly restricted to\nthe class of decomposable graphs, as they enjoy a rich set of properties making\nthem amenable to high-dimensional problems. While parameter inference is\nstraightforward in this setup, inferring the underlying graph is a challenge\ndriven by the computational difficulty in exploring the space of decomposable\ngraphs. This work makes two contributions to address this problem. First, we\nprovide sufficient and necessary conditions for when multi-edge perturbations\nmaintain decomposability of the graph. Using these, we characterize a simple\nclass of partitions that efficiently classify all edge perturbations by whether\nthey maintain decomposability. Second, we propose a novel parallel\nnon-reversible Markov chain Monte Carlo sampler for distributions over junction\ntree representations of the graph. At every step, the parallel sampler executes\nsimultaneously all edge perturbations within a partition. Through simulations,\nwe demonstrate the efficiency of our new edge perturbation conditions and class\nof partitions. We find that our parallel sampler yields improved mixing\nproperties in comparison to the single-move variate, and outperforms current\nstate-of-the-arts methods in terms of accuracy and computational efficiency.\nThe implementation of our work is available in the Python package parallelDG."}, "http://arxiv.org/abs/2302.00293": {"title": "A Survey of Methods, Challenges and Perspectives in Causality", "link": "http://arxiv.org/abs/2302.00293", "description": "Deep Learning models have shown success in a large variety of tasks by\nextracting correlation patterns from high-dimensional data but still struggle\nwhen generalizing out of their initial distribution. As causal engines aim to\nlearn mechanisms independent from a data distribution, combining Deep Learning\nwith Causality can have a great impact on the two fields. In this paper, we\nfurther motivate this assumption. We perform an extensive overview of the\ntheories and methods for Causality from different perspectives, with an\nemphasis on Deep Learning and the challenges met by the two domains. We show\nearly attempts to bring the fields together and the possible perspectives for\nthe future. We finish by providing a large variety of applications for\ntechniques from Causality."}, "http://arxiv.org/abs/2304.14954": {"title": "A Class of Dependent Random Distributions Based on Atom Skipping", "link": "http://arxiv.org/abs/2304.14954", "description": "We propose the Plaid Atoms Model (PAM), a novel Bayesian nonparametric model\nfor grouped data. Founded on an idea of `atom skipping', PAM is part of a\nwell-established category of models that generate dependent random\ndistributions and clusters across multiple groups. Atom skipping referrs to\nstochastically assigning 0 weights to atoms in an infinite mixture. Deploying\natom skipping across groups, PAM produces a dependent clustering pattern with\noverlapping and non-overlapping clusters across groups. As a result,\ninterpretable posterior inference is possible such as reporting the posterior\nprobability of a cluster being exclusive to a single group or shared among a\nsubset of groups. We discuss the theoretical properties of the proposed and\nrelated models. Minor extensions of the proposed model for multivariate or\ncount data are presented. Simulation studies and applications using real-world\ndatasets illustrate the performance of the new models with comparison to\nexisting models."}, "http://arxiv.org/abs/2305.09126": {"title": "Transfer Learning for Causal Effect Estimation", "link": "http://arxiv.org/abs/2305.09126", "description": "We present a Transfer Causal Learning (TCL) framework when target and source\ndomains share the same covariate/feature spaces, aiming to improve causal\neffect estimation accuracy in limited data. Limited data is very common in\nmedical applications, where some rare medical conditions, such as sepsis, are\nof interest. Our proposed method, named \\texttt{$\\ell_1$-TCL}, incorporates\n$\\ell_1$ regularized TL for nuisance models (e.g., propensity score model); the\nTL estimator of the nuisance parameters is plugged into downstream average\ncausal/treatment effect estimators (e.g., inverse probability weighted\nestimator). We establish non-asymptotic recovery guarantees for the\n\\texttt{$\\ell_1$-TCL} with generalized linear model (GLM) under the sparsity\nassumption in the high-dimensional setting, and demonstrate the empirical\nbenefits of \\texttt{$\\ell_1$-TCL} through extensive numerical simulation for\nGLM and recent neural network nuisance models. Our method is subsequently\nextended to real data and generates meaningful insights consistent with medical\nliterature, a case where all baseline methods fail."}, "http://arxiv.org/abs/2305.12789": {"title": "The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference with Partially Labeled Data", "link": "http://arxiv.org/abs/2305.12789", "description": "In real-world scenarios, data collection limitations often result in\npartially labeled datasets, leading to difficulties in drawing reliable causal\ninferences. Traditional approaches in the semi-supervised (SS) and missing data\nliterature may not adequately handle these complexities, leading to biased\nestimates. To address these challenges, our paper introduces a novel decaying\nmissing-at-random (decaying MAR) framework. This framework tackles missing\noutcomes in high-dimensional settings and accounts for selection bias arising\nfrom the dependence of labeling probability on covariates. Notably, we relax\nthe need for a positivity condition, commonly required in the missing data\nliterature, and allow uniform decay of labeling propensity scores with sample\nsize, accommodating faster growth of unlabeled data. Our decaying MAR framework\nenables easy rate double-robust (DR) estimation of average treatment effects,\nsucceeding where other methods fail, even with correctly specified nuisance\nmodels. Additionally, it facilitates asymptotic normality under model\nmisspecification. To achieve this, we propose adaptive new targeted\nbias-reducing nuisance estimators and asymmetric cross-fitting, along with a\nnovel semi-parametric approach that fully leverages large volumes of unlabeled\ndata. Our approach requires weak sparsity conditions. Numerical results confirm\nour estimators' efficacy and versatility, addressing selection bias and model\nmisspecification."}, "http://arxiv.org/abs/2401.00872": {"title": "On discriminating between Libby-Novick generalized beta and Kumaraswamy distributions: theory and methods", "link": "http://arxiv.org/abs/2401.00872", "description": "In fitting a continuous bounded data, the generalized beta (and several\nvariants of this distribution) and the two-parameter Kumaraswamy (KW)\ndistributions are the two most prominent univariate continuous distributions\nthat come to our mind. There are some common features between these two rival\nprobability models and to select one of them in a practical situation can be of\ngreat interest. Consequently, in this paper, we discuss various methods of\nselection between the generalized beta proposed by Libby and Novick (1982)\n(LNGB) and the KW distributions, such as the criteria based on probability of\ncorrect selection which is an improvement over the likelihood ratio statistic\napproach, and also based on pseudo-distance measures. We obtain an\napproximation for the probability of correct selection under the hypotheses\nHLNGB and HKW , and select the model that maximizes it. However, our proposal\nis more appealing in the sense that we provide the comparison study for the\nLNGB distribution that subsumes both types of classical beta and exponentiated\ngenerators (see, for details, Cordeiro et al. 2014; Libby and Novick 1982)\nwhich can be a natural competitor of a two-parameter KW distribution in an\nappropriate scenario."}, "http://arxiv.org/abs/2401.00945": {"title": "A review of Monte Carlo-based versions of the EM algorithm", "link": "http://arxiv.org/abs/2401.00945", "description": "The EM algorithm is a powerful tool for maximum likelihood estimation with\nmissing data. In practice, the calculations required for the EM algorithm are\noften intractable. We review numerous methods to circumvent this\nintractability, all of which are based on Monte Carlo simulation. We focus our\nattention on the Monte Carlo EM (MCEM) algorithm and its various\nimplementations. We also discuss some related methods like stochastic\napproximation and Monte Carlo maximum likelihood. Generating the Monte Carlo\nsamples necessary for these methods is, in general, a hard problem. As such, we\nreview several simulation strategies which can be used to address this\nchallenge.\n\nGiven the wide range of methods available for approximating the EM, it can be\nchallenging to select which one to use. We review numerous comparisons between\nthese methods from a wide range of sources, and offer guidance on synthesizing\nthe findings. Finally, we give some directions for future research to fill\nimportant gaps in the existing literature on the MCEM algorithm and related\nmethods."}, "http://arxiv.org/abs/2401.00987": {"title": "Inverting estimating equations for causal inference on quantiles", "link": "http://arxiv.org/abs/2401.00987", "description": "The causal inference literature frequently focuses on estimating the mean of\nthe potential outcome, whereas the quantiles of the potential outcome may carry\nimportant additional information. We propose a universal approach, based on the\ninverse estimating equations, to generalize a wide class of causal inference\nsolutions from estimating the mean of the potential outcome to its quantiles.\nWe assume that an identifying moment function is available to identify the mean\nof the threshold-transformed potential outcome, based on which a convenient\nconstruction of the estimating equation of quantiles of potential outcome is\nproposed. In addition, we also give a general construction of the efficient\ninfluence functions of the mean and quantiles of potential outcomes, and\nidentify their connection. We motivate estimators for the quantile estimands\nwith the efficient influence function, and develop their asymptotic properties\nwhen either parametric models or data-adaptive machine learners are used to\nestimate the nuisance functions. A broad implication of our results is that one\ncan rework the existing result for mean causal estimands to facilitate causal\ninference on quantiles, rather than starting from scratch. Our results are\nillustrated by several examples."}, "http://arxiv.org/abs/2401.01234": {"title": "Mixture cure semiparametric additive hazard models under partly interval censoring -- a penalized likelihood approach", "link": "http://arxiv.org/abs/2401.01234", "description": "Survival analysis can sometimes involve individuals who will not experience\nthe event of interest, forming what is known as the cured group. Identifying\nsuch individuals is not always possible beforehand, as they provide only\nright-censored data. Ignoring the presence of the cured group can introduce\nbias in the final model. This paper presents a method for estimating a\nsemiparametric additive hazards model that accounts for the cured fraction.\nUnlike regression coefficients in a hazard ratio model, those in an additive\nhazard model measure hazard differences. The proposed method uses a primal-dual\ninterior point algorithm to obtain constrained maximum penalized likelihood\nestimates of the model parameters, including the regression coefficients and\nthe baseline hazard, subject to certain non-negativity constraints."}, "http://arxiv.org/abs/2401.01264": {"title": "Multiple Randomization Designs: Estimation and Inference with Interference", "link": "http://arxiv.org/abs/2401.01264", "description": "Classical designs of randomized experiments, going back to Fisher and Neyman\nin the 1930s still dominate practice even in online experimentation. However,\nsuch designs are of limited value for answering standard questions in settings,\ncommon in marketplaces, where multiple populations of agents interact\nstrategically, leading to complex patterns of spillover effects. In this paper,\nwe discuss new experimental designs and corresponding estimands to account for\nand capture these complex spillovers. We derive the finite-sample properties of\ntractable estimators for main effects, direct effects, and spillovers, and\npresent associated central limit theorems."}, "http://arxiv.org/abs/2401.01294": {"title": "Efficient Sparse Least Absolute Deviation Regression with Differential Privacy", "link": "http://arxiv.org/abs/2401.01294", "description": "In recent years, privacy-preserving machine learning algorithms have\nattracted increasing attention because of their important applications in many\nscientific fields. However, in the literature, most privacy-preserving\nalgorithms demand learning objectives to be strongly convex and Lipschitz\nsmooth, which thus cannot cover a wide class of robust loss functions (e.g.,\nquantile/least absolute loss). In this work, we aim to develop a fast\nprivacy-preserving learning solution for a sparse robust regression problem.\nOur learning loss consists of a robust least absolute loss and an $\\ell_1$\nsparse penalty term. To fast solve the non-smooth loss under a given privacy\nbudget, we develop a Fast Robust And Privacy-Preserving Estimation (FRAPPE)\nalgorithm for least absolute deviation regression. Our algorithm achieves a\nfast estimation by reformulating the sparse LAD problem as a penalized least\nsquare estimation problem and adopts a three-stage noise injection to guarantee\nthe $(\\epsilon,\\delta)$-differential privacy. We show that our algorithm can\nachieve better privacy and statistical accuracy trade-off compared with the\nstate-of-the-art privacy-preserving regression algorithms. In the end, we\nconduct experiments to verify the efficiency of our proposed FRAPPE algorithm."}, "http://arxiv.org/abs/2112.07145": {"title": "Linear Discriminant Analysis with High-dimensional Mixed Variables", "link": "http://arxiv.org/abs/2112.07145", "description": "Datasets containing both categorical and continuous variables are frequently\nencountered in many areas, and with the rapid development of modern measurement\ntechnologies, the dimensions of these variables can be very high. Despite the\nrecent progress made in modelling high-dimensional data for continuous\nvariables, there is a scarcity of methods that can deal with a mixed set of\nvariables. To fill this gap, this paper develops a novel approach for\nclassifying high-dimensional observations with mixed variables. Our framework\nbuilds on a location model, in which the distributions of the continuous\nvariables conditional on categorical ones are assumed Gaussian. We overcome the\nchallenge of having to split data into exponentially many cells, or\ncombinations of the categorical variables, by kernel smoothing, and provide new\nperspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma,\nwhich is different to the usual bias-variance tradeoff. We show that the two\nsets of parameters in our model can be separately estimated and provide\npenalized likelihood for their estimation. Results on the estimation accuracy\nand the misclassification rates are established, and the competitive\nperformance of the proposed classifier is illustrated by extensive simulation\nand real data studies."}, "http://arxiv.org/abs/2206.02164": {"title": "Estimating and Mitigating the Congestion Effect of Curbside Pick-ups and Drop-offs: A Causal Inference Approach", "link": "http://arxiv.org/abs/2206.02164", "description": "Curb space is one of the busiest areas in urban road networks. Especially in\nrecent years, the rapid increase of ride-hailing trips and commercial\ndeliveries has induced massive pick-ups/drop-offs (PUDOs), which occupy the\nlimited curb space that was designed and built decades ago. These PUDOs could\njam curbside utilization and disturb the mainline traffic flow, evidently\nleading to significant negative societal externalities. However, there is a\nlack of an analytical framework that rigorously quantifies and mitigates the\ncongestion effect of PUDOs in the system view, particularly with little data\nsupport and involvement of confounding effects. To bridge this research gap,\nthis paper develops a rigorous causal inference approach to estimate the\ncongestion effect of PUDOs on general regional networks. A causal graph is set\nto represent the spatio-temporal relationship between PUDOs and traffic speed,\nand a double and separated machine learning (DSML) method is proposed to\nquantify how PUDOs affect traffic congestion. Additionally, a re-routing\nformulation is developed and solved to encourage passenger walking and traffic\nflow re-routing to achieve system optimization. Numerical experiments are\nconducted using real-world data in the Manhattan area. On average, 100\nadditional units of PUDOs in a region could reduce the traffic speed by 3.70\nand 4.54 mph on weekdays and weekends, respectively. Re-routing trips with\nPUDOs on curb space could respectively reduce the system-wide total travel time\nby 2.44% and 2.12% in Midtown and Central Park on weekdays. Sensitivity\nanalysis is also conducted to demonstrate the effectiveness and robustness of\nthe proposed framework."}, "http://arxiv.org/abs/2208.00124": {"title": "Physical Parameter Calibration", "link": "http://arxiv.org/abs/2208.00124", "description": "Computer simulation models are widely used to study complex physical systems.\nA related fundamental topic is the inverse problem, also called calibration,\nwhich aims at learning about the values of parameters in the model based on\nobservations. In most real applications, the parameters have specific physical\nmeanings, and we call them physical parameters. To recognize the true\nunderlying physical system, we need to effectively estimate such parameters.\nHowever, existing calibration methods cannot do this well due to the model\nidentifiability problem. This paper proposes a semi-parametric model, called\nthe discrepancy decomposition model, to describe the discrepancy between the\nphysical system and the computer model. The proposed model possesses a clear\ninterpretation, and more importantly, it is identifiable under mild conditions.\nUnder this model, we present estimators of the physical parameters and the\ndiscrepancy, and then establish their asymptotic properties. Numerical examples\nshow that the proposed method can better estimate the physical parameters than\nexisting methods."}, "http://arxiv.org/abs/2302.01974": {"title": "Conic Sparsity: Estimation of Regression Parameters in Closed Convex Polyhedral Cones", "link": "http://arxiv.org/abs/2302.01974", "description": "Statistical problems often involve linear equality and inequality constraints\non model parameters. Direct estimation of parameters restricted to general\npolyhedral cones, particularly when one is interested in estimating low\ndimensional features, may be challenging. We use a dual form parameterization\nto characterize parameter vectors restricted to lower dimensional faces of\npolyhedral cones and use the characterization to define a notion of 'sparsity'\non such cones. We show that the proposed notion agrees with the usual notion of\nsparsity in the unrestricted case and prove the validity of the proposed\ndefinition as a measure of sparsity. The identifiable parameterization of the\nlower dimensional faces allows a generalization of popular spike-and-slab\npriors to a closed convex polyhedral cone. The prior measure utilizes the\ngeometry of the cone by defining a Markov random field over the adjacency graph\nof the extreme rays of the cone. We describe an efficient way of computing the\nposterior of the parameters in the restricted case. We illustrate the\nusefulness of the proposed methodology for imposing linear equality and\ninequality constraints by using wearables data from the National Health and\nNutrition Examination Survey (NHANES) actigraph study where the daily average\nactivity profiles of participants exhibit patterns that seem to obey such\nconstraints."}, "http://arxiv.org/abs/2401.01426": {"title": "Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference", "link": "http://arxiv.org/abs/2401.01426", "description": "Pearl's causal hierarchy establishes a clear separation between\nobservational, interventional, and counterfactual questions. Researchers\nproposed sound and complete algorithms to compute identifiable causal queries\nat a given level of the hierarchy using the causal structure and data from the\nlower levels of the hierarchy. However, most of these algorithms assume that we\ncan accurately estimate the probability distribution of the data, which is an\nimpractical assumption for high-dimensional variables such as images. On the\nother hand, modern generative deep learning architectures can be trained to\nlearn how to accurately sample from such high-dimensional distributions.\nEspecially with the recent rise of foundation models for images, it is\ndesirable to leverage pre-trained models to answer causal queries with such\nhigh-dimensional data. To address this, we propose a sequential training\nalgorithm that, given the causal structure and a pre-trained conditional\ngenerative model, can train a deep causal generative model, which utilizes the\npre-trained model and can provably sample from identifiable interventional and\ncounterfactual distributions. Our algorithm, called Modular-DCM, uses\nadversarial training to learn the network weights, and to the best of our\nknowledge, is the first algorithm that can make use of pre-trained models and\nprovably sample from any identifiable causal query in the presence of latent\nconfounders with high-dimensional data. We demonstrate the utility of our\nalgorithm using semi-synthetic and real-world datasets containing images as\nvariables in the causal structure."}, "http://arxiv.org/abs/2401.01500": {"title": "Log-concave Density Estimation with Independent Components", "link": "http://arxiv.org/abs/2401.01500", "description": "We propose a method for estimating a log-concave density on $\\mathbb R^d$\nfrom samples, under the assumption that there exists an orthogonal\ntransformation that makes the components of the random vector independent.\nWhile log-concave density estimation is hard both computationally and\nstatistically, the independent components assumption alleviates both issues,\nwhile still maintaining a large non-parametric class. We prove that under mild\nconditions, at most $\\tilde{\\mathcal{O}}(\\epsilon^{-4})$ samples (suppressing\nconstants and log factors) suffice for our proposed estimator to be within\n$\\epsilon$ of the original density in squared Hellinger distance. On the\ncomputational front, while the usual log-concave maximum likelihood estimate\ncan be obtained via a finite-dimensional convex program, it is slow to compute\n-- especially in higher dimensions. We demonstrate through numerical\nexperiments that our estimator can be computed efficiently, making it more\npractical to use."}, "http://arxiv.org/abs/2401.01665": {"title": "A Wild Bootstrap Procedure for the Identification of Optimal Groups in Singular Spectrum Analysis", "link": "http://arxiv.org/abs/2401.01665", "description": "A key step in separating signal from noise using Singular Spectrum Analysis\n(SSA) is grouping, which is often done subjectively. In this article a method\nwhich enables the identification of statistically significant groups for the\ngrouping step in SSA is presented. The proposed procedure provides a more\nobjective and reliable approach for separating noise from the main signal in\nSSA. We utilize the w- correlation and test if it close or equal to zero. A\nwild bootstrap approach is used to determine the distribution of the\nw-correlation. To identify an ideal number of groupings which leads to almost\nperfect separation of the noise and signal, a given number of groups are\ntested, necessitating accounting for multiplicity. The effectiveness of our\nmethod in identifying the best group is demonstrated through a simulation\nstudy, furthermore, we have applied the approach to real world data in the\ncontext of neuroimaging. This research provides a valuable contribution to the\nfield of SSA and offers important insights into the statistical properties of\nthe w-correlation distribution. The results obtained from the simulation\nstudies and analysis of real-world data demonstrate the effectiveness of the\nproposed approach in identifying the best groupings for SSA."}, "http://arxiv.org/abs/2401.01713": {"title": "Multiple testing of interval composite null hypotheses using randomized p-values", "link": "http://arxiv.org/abs/2401.01713", "description": "One class of statistical hypothesis testing procedures is the indisputable\nequivalence tests, whose main objective is to establish practical equivalence\nrather than the usual statistical significant difference. These hypothesis\ntests are prone in bioequivalence studies, where one would wish to show that,\nfor example, an existing drug and a new one under development have the same\ntherapeutic effect. In this article, we consider a two-stage randomized (RAND2)\np-value utilizing the uniformly most powerful (UMP) p-value in the first stage\nwhen multiple two-one-sided hypotheses are of interest. We investigate the\nbehavior of the distribution functions of the two p-values when there are\nchanges in the boundaries of the null or alternative hypothesis or when the\nchosen parameters are too close to these boundaries. We also consider the\nbehavior of the power functions to an increase in sample size. Specifically, we\ninvestigate the level of conservativity to the sample sizes to see if we\ncontrol the type I error rate when using either of the two p-values for any\nsample size. In multiple tests, we evaluate the performance of the two p-values\nin estimating the proportion of true null hypotheses. We conduct a family-wise\nerror rate control using an adaptive Bonferroni procedure with a plug-in\nestimator to account for the multiplicity that arises from the multiple\nhypotheses under consideration. We verify the various claims in this research\nusing simulation study and real-world data analysis."}, "http://arxiv.org/abs/2401.01806": {"title": "A complex meta-regression model to identify effective features of interventions from multi-arm, multi-follow-up trials", "link": "http://arxiv.org/abs/2401.01806", "description": "Network meta-analysis (NMA) combines evidence from multiple trials to compare\nthe effectiveness of a set of interventions. In public health research,\ninterventions are often complex, made up of multiple components or features.\nThis makes it difficult to define a common set of interventions on which to\nperform the analysis. One approach to this problem is component network\nmeta-analysis (CNMA) which uses a meta-regression framework to define each\nintervention as a subset of components whose individual effects combine\nadditively. In this paper, we are motivated by a systematic review of complex\ninterventions to prevent obesity in children. Due to considerable heterogeneity\nacross the trials, these interventions cannot be expressed as a subset of\ncomponents but instead are coded against a framework of characteristic\nfeatures. To analyse these data, we develop a bespoke CNMA-inspired model that\nallows us to identify the most important features of interventions. We define a\nmeta-regression model with covariates on three levels: intervention, study, and\nfollow-up time, as well as flexible interaction terms. By specifying different\nregression structures for trials with and without a control arm, we relax the\nassumption from previous CNMA models that a control arm is the absence of\nintervention components. Furthermore, we derive a correlation structure that\naccounts for trials with multiple intervention arms and multiple follow-up\ntimes. Although our model was developed for the specifics of the obesity data\nset, it has wider applicability to any set of complex interventions that can be\ncoded according to a set of shared features."}, "http://arxiv.org/abs/2401.01833": {"title": "Credible Distributions of Overall Ranking of Entities", "link": "http://arxiv.org/abs/2401.01833", "description": "Inference on overall ranking of a set of entities, such as chess players,\nsubpopulations or hospitals, is an important problem. Estimation of ranks based\non point estimates of means does not account for the uncertainty in those\nestimates. Treating estimated ranks without regard for uncertainty is\nproblematic. We propose a Bayesian solution. It is competitive with recent\nfrequentist methods, and more effective and informative, and is as easy to\nimplement as it is to compute the posterior means and variances of the entity\nmeans. Using credible sets, we created novel credible distributions for the\nrank vector of the entities. We evaluate the Bayesian procedure in terms of\naccuracy and stability in two applications and a simulation study. Frequentist\napproaches cannot take account of covariates, but the Bayesian method handles\nthem easily."}, "http://arxiv.org/abs/2401.01872": {"title": "Multiple Imputation of Hierarchical Nonlinear Time Series Data with an Application to School Enrollment Data", "link": "http://arxiv.org/abs/2401.01872", "description": "International comparisons of hierarchical time series data sets based on\nsurvey data, such as annual country-level estimates of school enrollment rates,\ncan suffer from large amounts of missing data due to differing coverage of\nsurveys across countries and across times. A popular approach to handling\nmissing data in these settings is through multiple imputation, which can be\nespecially effective when there is an auxiliary variable that is strongly\npredictive of and has a smaller amount of missing data than the variable of\ninterest. However, standard methods for multiple imputation of hierarchical\ntime series data can perform poorly when the auxiliary variable and the\nvariable of interest are have a nonlinear relationship. Performance of standard\nmultiple imputation methods can also suffer if the substantive analysis model\nof interest is uncongenial to the imputation model, which can be a common\noccurrence for social science data if the imputation phase is conducted\nindependently of the analysis phase. We propose a Bayesian method for multiple\nimputation of hierarchical nonlinear time series data that uses a sequential\ndecomposition of the joint distribution and incorporates smoothing splines to\naccount for nonlinear relationships between variables. We compare the proposed\nmethod with existing multiple imputation methods through a simulation study and\nan application to secondary school enrollment data. We find that the proposed\nmethod can lead to substantial performance increases for estimation of\nparameters in uncongenial analysis models and for prediction of individual\nmissing values."}, "http://arxiv.org/abs/2208.10962": {"title": "Prediction of good reaction coordinates and future evolution of MD trajectories using Regularized Sparse Autoencoders: A novel deep learning approach", "link": "http://arxiv.org/abs/2208.10962", "description": "Identifying reaction coordinates(RCs) is an active area of research, given\nthe crucial role RCs play in determining the progress of a chemical reaction.\nThe choice of the reaction coordinate is often based on heuristic knowledge.\nHowever, an essential criterion for the choice is that the coordinate should\ncapture both the reactant and product states unequivocally. Also, the\ncoordinate should be the slowest one so that all the other degrees of freedom\ncan easily equilibrate along the reaction coordinate. Also, the coordinate\nshould be the slowest one so that all the other degrees of freedom can easily\nequilibrate along the reaction coordinate. We used a regularised sparse\nautoencoder, an energy-based model, to discover a crucial set of reaction\ncoordinates. Along with discovering reaction coordinates, our model also\npredicts the evolution of a molecular dynamics(MD) trajectory. We showcased\nthat including sparsity enforcing regularisation helps in choosing a small but\nimportant set of reaction coordinates. We used two model systems to demonstrate\nour approach: alanine dipeptide system and proflavine and DNA system, which\nexhibited intercalation of proflavine into DNA minor groove in an aqueous\nenvironment. We model MD trajectory as a multivariate time series, and our\nlatent variable model performs the task of multi-step time series prediction.\nThis idea is inspired by the popular sparse coding approach - to represent each\ninput sample as a linear combination of few elements taken from a set of\nrepresentative patterns."}, "http://arxiv.org/abs/2308.01704": {"title": "Similarity-based Random Partition Distribution for Clustering Functional Data", "link": "http://arxiv.org/abs/2308.01704", "description": "Random partition distribution is a crucial tool for model-based clustering.\nThis study advances the field of random partition in the context of functional\nspatial data, focusing on the challenges posed by hourly population data across\nvarious regions and dates. We propose an extended generalized Dirichlet\nprocess, named the similarity-based generalized Dirichlet process (SGDP), to\naddress the limitations of simple random partition distributions (e.g., those\ninduced by the Dirichlet process), such as an overabundance of clusters. This\nmodel prevents producing excess clusters as well as incorporates pairwise\nsimilarity information to ensure a more accurate and meaningful grouping. The\ntheoretical properties of SGDP are studied. Then, SGDP is applied to a\nreal-world dataset of hourly population flows in 500$\\rm{m}^2$ meshes in the\ncentral part of Tokyo. In this empirical context, SGDP excelled at detecting\nmeaningful patterns in the data while accounting for spatial nuances. The\nresults underscore the adaptability and utility of the method, showcasing its\nprowess in revealing intricate spatiotemporal dynamics. This study's findings\ncontribute significantly to urban planning, transportation, and policy-making\nby providing a helpful tool for understanding population dynamics and their\nimplications."}, "http://arxiv.org/abs/2401.01949": {"title": "Adjacency Matrix Decomposition Clustering for Human Activity Data", "link": "http://arxiv.org/abs/2401.01949", "description": "Mobile apps and wearable devices accurately and continuously measure human\nactivity; patterns within this data can provide a wealth of information\napplicable to fields such as transportation and health. Despite the potential\nutility of this data, there has been limited development of analysis methods\nfor sequences of daily activities. In this paper, we propose a novel clustering\nmethod and cluster evaluation metric for human activity data that leverages an\nadjacency matrix representation to cluster the data without the calculation of\na distance matrix. Our technique is substantially faster than conventional\nmethods based on computing pairwise distances via sequence alignment algorithms\nand also enhances interpretability of results. We compare our method to\ndistance-based hierarchical clustering through simulation studies and an\napplication to data collected by Daynamica, an app that turns raw sensor data\ninto a daily summary of a user's activities. Among days that contain a large\nportion of time spent at home, our method distinguishes days that also contain\nfull-time work or multiple hours of travel, while hierarchical clustering\ngroups these days together. We also illustrate how the computational advantage\nof our method enables the analysis of longer sequences by clustering full weeks\nof activity data."}, "http://arxiv.org/abs/2401.01966": {"title": "Ethical considerations for data involving human gender and sex variables", "link": "http://arxiv.org/abs/2401.01966", "description": "The inclusion of human sex and gender data in statistical analysis invokes\nmultiple considerations for data collection, combination, analysis, and\ninterpretation. These considerations are not unique to variables representing\nsex and gender. However, considering the relevance of the ethical practice\nstandards for statistics and data science to sex and gender variables is\ntimely, with results that can be applied to other sociocultural variables.\nHistorically, human gender and sex have been categorized with a binary system.\nThis tradition persists mainly because it is easy, and not because it produces\nthe best scientific information. Binary classification simplifies combinations\nof older and newer data sets. However, this classification system eliminates\nthe ability for respondents to articulate their gender identity, conflates\ngender and sex, and also obscures potentially important differences by\ncollapsing across valid and authentic categories. This approach perpetuates\nhistorical inaccuracy, simplicity, and bias, while also limiting the\ninformation that emerges from analyses of human data. The approach also\nviolates multiple elements in the American Statistical Association (ASA)\nEthical Guidelines for Statistical Practice. Information that would be captured\nwith a nonbinary classification could be relevant to decisions about analysis\nmethods and to decisions based on otherwise expert statistical work.\nStatistical practitioners are increasingly concerned with inconsistent,\nuninformative, and even unethical data collection and analysis practices. This\npaper presents a historical introduction to the collection and analysis of\nhuman gender and sex data, offers a critique of a few common survey questioning\nmethods based on alignment with the ASA Ethical Guidelines, and considers the\nscope of ethical considerations for human gender and sex data from design\nthrough analysis and interpretation."}, "http://arxiv.org/abs/2401.01977": {"title": "Conformal causal inference for cluster randomized trials: model-robust inference without asymptotic approximations", "link": "http://arxiv.org/abs/2401.01977", "description": "In the analysis of cluster randomized trials, two typical features are that\nindividuals within a cluster are correlated and that the total number of\nclusters can sometimes be limited. While model-robust treatment effect\nestimators have been recently developed, their asymptotic theory requires the\nnumber of clusters to approach infinity, and one often has to empirically\nassess the applicability of those methods in finite samples. To address this\nchallenge, we propose a conformal causal inference framework that achieves the\ntarget coverage probability of treatment effects in finite samples without the\nneed for asymptotic approximations. Meanwhile, we prove that this framework is\ncompatible with arbitrary working models, including machine learning algorithms\nleveraging baseline covariates, possesses robustness against arbitrary\nmisspecification of working models, and accommodates a variety of\nwithin-cluster correlations. Under this framework, we offer efficient\nalgorithms to make inferences on treatment effects at both the cluster and\nindividual levels, applicable to user-specified covariate subgroups and two\ntypes of test data. Finally, we demonstrate our methods via simulations and a\nreal data application based on a cluster randomized trial for treating chronic\npain."}, "http://arxiv.org/abs/2401.02048": {"title": "Random Effect Restricted Mean Survival Time Model", "link": "http://arxiv.org/abs/2401.02048", "description": "The restricted mean survival time (RMST) model has been garnering attention\nas a way to provide a clinically intuitive measure: the mean survival time.\nRMST models, which use methods based on pseudo time-to-event values and inverse\nprobability censoring weighting, can adjust covariates. However, no approach\nhas yet been introduced that considers random effects for clusters. In this\npaper, we propose a new random-effect RMST. We present two methods of analysis\nthat consider variable effects by i) using a generalized mixed model with\npseudo-values and ii) integrating the estimated results from the inverse\nprobability censoring weighting estimating equations for each cluster. We\nevaluate our proposed methods through computer simulations. In addition, we\nanalyze the effect of a mother's age at birth on under-five deaths in India\nusing states as clusters."}, "http://arxiv.org/abs/2401.02154": {"title": "Disentangle Estimation of Causal Effects from Cross-Silo Data", "link": "http://arxiv.org/abs/2401.02154", "description": "Estimating causal effects among different events is of great importance to\ncritical fields such as drug development. Nevertheless, the data features\nassociated with events may be distributed across various silos and remain\nprivate within respective parties, impeding direct information exchange between\nthem. This, in turn, can result in biased estimations of local causal effects,\nwhich rely on the characteristics of only a subset of the covariates. To tackle\nthis challenge, we introduce an innovative disentangle architecture designed to\nfacilitate the seamless cross-silo transmission of model parameters, enriched\nwith causal mechanisms, through a combination of shared and private branches.\nBesides, we introduce global constraints into the equation to effectively\nmitigate bias within the various missing domains, thereby elevating the\naccuracy of our causal effect estimation. Extensive experiments conducted on\nnew semi-synthetic datasets show that our method outperforms state-of-the-art\nbaselines."}, "http://arxiv.org/abs/2401.02387": {"title": "Assessing Time Series Correlation Significance: A Parametric Approach with Application to Physiological Signals", "link": "http://arxiv.org/abs/2401.02387", "description": "Correlation coefficients play a pivotal role in quantifying linear\nrelationships between random variables. Yet, their application to time series\ndata is very challenging due to temporal dependencies. This paper introduces a\nnovel approach to estimate the statistical significance of correlation\ncoefficients in time series data, addressing the limitations of traditional\nmethods based on the concept of effective degrees of freedom (or effective\nsample size, ESS). These effective degrees of freedom represent the independent\nsample size that would yield comparable test statistics under the assumption of\nno temporal correlation. We propose to assume a parametric Gaussian form for\nthe autocorrelation function. We show that this assumption, motivated by a\nLaplace approximation, enables a simple estimator of the ESS that depends only\non the temporal derivatives of the time series. Through numerical experiments,\nwe show that the proposed approach yields accurate statistics while\nsignificantly reducing computational overhead. In addition, we evaluate the\nadequacy of our approach on real physiological signals, for assessing the\nconnectivity measures in electrophysiology and detecting correlated arm\nmovements in motion capture data. Our methodology provides a simple tool for\nresearchers working with time series data, enabling robust hypothesis testing\nin the presence of temporal dependencies."}, "http://arxiv.org/abs/1712.02195": {"title": "Fast approximations in the homogeneous Ising model for use in scene analysis", "link": "http://arxiv.org/abs/1712.02195", "description": "The Ising model is important in statistical modeling and inference in many\napplications, however its normalizing constant, mean number of active vertices\nand mean spin interaction -- quantities needed in inference -- are\ncomputationally intractable. We provide accurate approximations that make it\npossible to numerically calculate these quantities in the homogeneous case.\nSimulation studies indicate good performance of our approximation formulae that\nare scalable and unfazed by the size (number of nodes, degree of graph) of the\nMarkov Random Field. The practical import of our approximation formulae is\nillustrated in performing Bayesian inference in a functional Magnetic Resonance\nImaging activation detection experiment, and also in likelihood ratio testing\nfor anisotropy in the spatial patterns of yearly increases in pistachio tree\nyields."}, "http://arxiv.org/abs/2203.12179": {"title": "Targeted Function Balancing", "link": "http://arxiv.org/abs/2203.12179", "description": "This paper introduces Targeted Function Balancing (TFB), a covariate\nbalancing weights framework for estimating the average treatment effect of a\nbinary intervention. TFB first regresses an outcome on covariates, and then\nselects weights that balance functions (of the covariates) that are\nprobabilistically near the resulting regression function. This yields balance\nin the regression function's predicted values and the covariates, with the\nregression function's estimated variance determining how much balance in the\ncovariates is sufficient. Notably, TFB demonstrates that intentionally leaving\nimbalance in some covariates can increase efficiency without introducing bias,\nchallenging traditions that warn against imbalance in any variable.\nAdditionally, TFB is entirely defined by a regression function and its\nestimated variance, turning the problem of how best to balance the covariates\ninto how best to model the outcome. Kernel regularized least squares and the\nLASSO are considered as regression estimators. With the former, TFB contributes\nto the literature of kernel-based weights. As for the LASSO, TFB uses the\nregression function's estimated variance to prioritize balance in certain\ndimensions of the covariates, a feature that can be greatly exploited by\nchoosing a sparse regression estimator. This paper also introduces a balance\ndiagnostic, Targeted Function Imbalance, that may have useful applications."}, "http://arxiv.org/abs/2303.01572": {"title": "Transportability without positivity: a synthesis of statistical and simulation modeling", "link": "http://arxiv.org/abs/2303.01572", "description": "When estimating an effect of an action with a randomized or observational\nstudy, that study is often not a random sample of the desired target\npopulation. Instead, estimates from that study can be transported to the target\npopulation. However, transportability methods generally rely on a positivity\nassumption, such that all relevant covariate patterns in the target population\nare also observed in the study sample. Strict eligibility criteria,\nparticularly in the context of randomized trials, may lead to violations of\nthis assumption. Two common approaches to address positivity violations are\nrestricting the target population and restricting the relevant covariate set.\nAs neither of these restrictions are ideal, we instead propose a synthesis of\nstatistical and simulation models to address positivity violations. We propose\ncorresponding g-computation and inverse probability weighting estimators. The\nrestriction and synthesis approaches to addressing positivity violations are\ncontrasted with a simulation experiment and an illustrative example in the\ncontext of sexually transmitted infection testing uptake. In both cases, the\nproposed synthesis approach accurately addressed the original research question\nwhen paired with a thoughtfully selected simulation model. Neither of the\nrestriction approaches were able to accurately address the motivating question.\nAs public health decisions must often be made with imperfect target population\ninformation, model synthesis is a viable approach given a combination of\nempirical data and external information based on the best available knowledge."}, "http://arxiv.org/abs/2401.02529": {"title": "Simulation-based transition density approximation for the inference of SDE models", "link": "http://arxiv.org/abs/2401.02529", "description": "Stochastic Differential Equations (SDEs) serve as a powerful modeling tool in\nvarious scientific domains, including systems science, engineering, and\necological science. While the specific form of SDEs is typically known for a\ngiven problem, certain model parameters remain unknown. Efficiently inferring\nthese unknown parameters based on observations of the state in discrete time\nseries represents a vital practical subject. The challenge arises in nonlinear\nSDEs, where maximum likelihood estimation of parameters is generally unfeasible\ndue to the absence of closed-form expressions for transition and stationary\nprobability density functions of the states. In response to this limitation, we\npropose a novel two-step parameter inference mechanism. This approach involves\na global-search phase followed by a local-refining procedure. The global-search\nphase is dedicated to identifying the domain of high-value likelihood\nfunctions, while the local-refining procedure is specifically designed to\nenhance the surrogate likelihood within this localized domain. Additionally, we\npresent two simulation-based approximations for the transition density, aiming\nto efficiently or accurately approximate the likelihood function. Numerical\nexamples illustrate the efficacy of our proposed methodology in achieving\nposterior parameter estimation."}, "http://arxiv.org/abs/2401.02557": {"title": "Multivariate Functional Clustering with Variable Selection and Application to Sensor Data from Engineering Systems", "link": "http://arxiv.org/abs/2401.02557", "description": "Multi-sensor data that track system operating behaviors are widely available\nnowadays from various engineering systems. Measurements from each sensor over\ntime form a curve and can be viewed as functional data. Clustering of these\nmultivariate functional curves is important for studying the operating patterns\nof systems. One complication in such applications is the possible presence of\nsensors whose data do not contain relevant information. Hence it is desirable\nfor the clustering method to equip with an automatic sensor selection\nprocedure. Motivated by a real engineering application, we propose a functional\ndata clustering method that simultaneously removes noninformative sensors and\ngroups functional curves into clusters using informative sensors. Functional\nprincipal component analysis is used to transform multivariate functional data\ninto a coefficient matrix for data reduction. We then model the transformed\ndata by a Gaussian mixture distribution to perform model-based clustering with\nvariable selection. Three types of penalties, the individual, variable, and\ngroup penalties, are considered to achieve automatic variable selection.\nExtensive simulations are conducted to assess the clustering and variable\nselection performance of the proposed methods. The application of the proposed\nmethods to an engineering system with multiple sensors shows the promise of the\nmethods and reveals interesting patterns in the sensor data."}, "http://arxiv.org/abs/2401.02694": {"title": "Nonconvex High-Dimensional Time-Varying Coefficient Estimation for Noisy High-Frequency Observations", "link": "http://arxiv.org/abs/2401.02694", "description": "In this paper, we propose a novel high-dimensional time-varying coefficient\nestimator for noisy high-frequency observations. In high-frequency finance, we\noften observe that noises dominate a signal of an underlying true process.\nThus, we cannot apply usual regression procedures to analyze noisy\nhigh-frequency observations. To handle this issue, we first employ a smoothing\nmethod for the observed variables. However, the smoothed variables still\ncontain non-negligible noises. To manage these non-negligible noises and the\nhigh dimensionality, we propose a nonconvex penalized regression method for\neach local coefficient. This method produces consistent but biased local\ncoefficient estimators. To estimate the integrated coefficients, we propose a\ndebiasing scheme and obtain a debiased integrated coefficient estimator using\ndebiased local coefficient estimators. Then, to further account for the\nsparsity structure of the coefficients, we apply a thresholding scheme to the\ndebiased integrated coefficient estimator. We call this scheme the Thresholded\ndEbiased Nonconvex LASSO (TEN-LASSO) estimator. Furthermore, this paper\nestablishes the concentration properties of the TEN-LASSO estimator and\ndiscusses a nonconvex optimization algorithm."}, "http://arxiv.org/abs/2401.02735": {"title": "Shared active subspace for multivariate vector-valued functions", "link": "http://arxiv.org/abs/2401.02735", "description": "This paper proposes several approaches as baselines to compute a shared\nactive subspace for multivariate vector-valued functions. The goal is to\nminimize the deviation between the function evaluations on the original space\nand those on the reconstructed one. This is done either by manipulating the\ngradients or the symmetric positive (semi-)definite (SPD) matrices computed\nfrom the gradients of each component function so as to get a single structure\ncommon to all component functions. These approaches can be applied to any data\nirrespective of the underlying distribution unlike the existing vector-valued\napproach that is constrained to a normal distribution. We test the\neffectiveness of these methods on five optimization problems. The experiments\nshow that, in general, the SPD-level methods are superior to the gradient-level\nones, and are close to the vector-valued approach in the case of a normal\ndistribution. Interestingly, in most cases it suffices to take the sum of the\nSPD matrices to identify the best shared active subspace."}, "http://arxiv.org/abs/2401.02828": {"title": "Optimal prediction of positive-valued spatial processes: asymmetric power-divergence loss", "link": "http://arxiv.org/abs/2401.02828", "description": "This article studies the use of asymmetric loss functions for the optimal\nprediction of positive-valued spatial processes. We focus on the family of\npower-divergence loss functions due to its many convenient properties, such as\nits continuity, convexity, relationship to well known divergence measures, and\nthe ability to control the asymmetry and behaviour of the loss function via a\npower parameter. The properties of power-divergence loss functions, optimal\npower-divergence (OPD) spatial predictors, and related measures of uncertainty\nquantification are examined. In addition, we examine the notion of asymmetry in\nloss functions defined for positive-valued spatial processes and define an\nasymmetry measure that is applied to the power-divergence loss function and\nother common loss functions. The paper concludes with a spatial statistical\nanalysis of zinc measurements in the soil of a floodplain of the Meuse River,\nNetherlands, using OPD spatial prediction."}, "http://arxiv.org/abs/2401.02917": {"title": "Bayesian changepoint detection via logistic regression and the topological analysis of image series", "link": "http://arxiv.org/abs/2401.02917", "description": "We present a Bayesian method for multivariate changepoint detection that\nallows for simultaneous inference on the location of a changepoint and the\ncoefficients of a logistic regression model for distinguishing pre-changepoint\ndata from post-changepoint data. In contrast to many methods for multivariate\nchangepoint detection, the proposed method is applicable to data of mixed type\nand avoids strict assumptions regarding the distribution of the data and the\nnature of the change. The regression coefficients provide an interpretable\ndescription of a potentially complex change. For posterior inference, the model\nadmits a simple Gibbs sampling algorithm based on P\\'olya-gamma data\naugmentation. We establish conditions under which the proposed method is\nguaranteed to recover the true underlying changepoint. As a testing ground for\nour method, we consider the problem of detecting topological changes in time\nseries of images. We demonstrate that the proposed method, combined with a\nnovel topological feature embedding, performs well on both simulated and real\nimage data."}, "http://arxiv.org/abs/2401.02930": {"title": "Dagma-DCE: Interpretable, Non-Parametric Differentiable Causal Discovery", "link": "http://arxiv.org/abs/2401.02930", "description": "We introduce Dagma-DCE, an interpretable and model-agnostic scheme for\ndifferentiable causal discovery. Current non- or over-parametric methods in\ndifferentiable causal discovery use opaque proxies of ``independence'' to\njustify the inclusion or exclusion of a causal relationship. We show\ntheoretically and empirically that these proxies may be arbitrarily different\nthan the actual causal strength. Juxtaposed to existing differentiable causal\ndiscovery algorithms, \\textsc{Dagma-DCE} uses an interpretable measure of\ncausal strength to define weighted adjacency matrices. In a number of simulated\ndatasets, we show our method achieves state-of-the-art level performance. We\nadditionally show that \\textsc{Dagma-DCE} allows for principled thresholding\nand sparsity penalties by domain-experts. The code for our method is available\nopen-source at https://github.com/DanWaxman/DAGMA-DCE, and can easily be\nadapted to arbitrary differentiable models."}, "http://arxiv.org/abs/2401.02939": {"title": "Penalized Distributed Lag Interaction Model: Air Pollution, Birth Weight and Neighborhood Vulnerability", "link": "http://arxiv.org/abs/2401.02939", "description": "Maternal exposure to air pollution during pregnancy has a substantial public\nhealth impact. Epidemiological evidence supports an association between\nmaternal exposure to air pollution and low birth weight. A popular method to\nestimate this association while identifying windows of susceptibility is a\ndistributed lag model (DLM), which regresses an outcome onto exposure history\nobserved at multiple time points. However, the standard DLM framework does not\nallow for modification of the association between repeated measures of exposure\nand the outcome. We propose a distributed lag interaction model that allows\nmodification of the exposure-time-response associations across individuals by\nincluding an interaction between a continuous modifying variable and the\nexposure history. Our model framework is an extension of a standard DLM that\nuses a cross-basis, or bi-dimensional function space, to simultaneously\ndescribe both the modification of the exposure-response relationship and the\ntemporal structure of the exposure data. Through simulations, we showed that\nour model with penalization out-performs a standard DLM when the true\nexposure-time-response associations vary by a continuous variable. Using a\nColorado, USA birth cohort, we estimated the association between birth weight\nand ambient fine particulate matter air pollution modified by an area-level\nmetric of health and social adversities from Colorado EnviroScreen."}, "http://arxiv.org/abs/2401.02953": {"title": "Linked factor analysis", "link": "http://arxiv.org/abs/2401.02953", "description": "Factor models are widely used in the analysis of high-dimensional data in\nseveral fields of research. Estimating a factor model, in particular its\ncovariance matrix, from partially observed data vectors is very challenging. In\nthis work, we show that when the data are structurally incomplete, the factor\nmodel likelihood function can be decomposed into the product of the likelihood\nfunctions of multiple partial factor models relative to different subsets of\ndata. If these multiple partial factor models are linked together by common\nparameters, then we can obtain complete maximum likelihood estimates of the\nfactor model parameters and thereby the full covariance matrix. We call this\nframework Linked Factor Analysis (LINFA). LINFA can be used for covariance\nmatrix completion, dimension reduction, data completion, and graphical\ndependence structure recovery. We propose an efficient Expectation-Maximization\nalgorithm for maximum likelihood estimation, accelerated by a novel group\nvertex tessellation (GVT) algorithm which identifies a minimal partition of the\nvertex set to implement an efficient optimization in the maximization steps. We\nillustrate our approach in an extensive simulation study and in the analysis of\ncalcium imaging data collected from mouse visual cortex."}, "http://arxiv.org/abs/2203.12808": {"title": "Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning", "link": "http://arxiv.org/abs/2203.12808", "description": "We discuss causal inference for observational studies with possibly invalid\ninstrumental variables. We propose a novel methodology called two-stage\ncurvature identification (TSCI) by exploring the nonlinear treatment model with\nmachine learning. {The first-stage machine learning enables improving the\ninstrumental variable's strength and adjusting for different forms of violating\nthe instrumental variable assumptions.} The success of TSCI requires the\ninstrumental variable's effect on treatment to differ from its violation form.\nA novel bias correction step is implemented to remove bias resulting from the\npotentially high complexity of machine learning. Our proposed \\texttt{TSCI}\nestimator is shown to be asymptotically unbiased and Gaussian even if the\nmachine learning algorithm does not consistently estimate the treatment model.\nFurthermore, we design a data-dependent method to choose the best among several\ncandidate violation forms. We apply TSCI to study the effect of education on\nearnings."}, "http://arxiv.org/abs/2208.14011": {"title": "Robust and Efficient Estimation in Ordinal Response Models using the Density Power Divergence", "link": "http://arxiv.org/abs/2208.14011", "description": "In real life, we frequently come across data sets that involve some\nindependent explanatory variable(s) generating a set of ordinal responses.\nThese ordinal responses may correspond to an underlying continuous latent\nvariable, which is linearly related to the covariate(s), and takes a particular\n(ordinal) label depending on whether this latent variable takes value in some\nsuitable interval specified by a pair of (unknown) cut-offs. The most efficient\nway of estimating the unknown parameters (i.e., the regression coefficients and\nthe cut-offs) is the method of maximum likelihood (ML). However, contamination\nin the data set either in the form of misspecification of ordinal responses, or\nthe unboundedness of the covariate(s), might destabilize the likelihood\nfunction to a great extent where the ML based methodology might lead to\ncompletely unreliable inferences. In this paper, we explore a minimum distance\nestimation procedure based on the popular density power divergence (DPD) to\nyield robust parameter estimates for the ordinal response model. This paper\nhighlights how the resulting estimator, namely the minimum DPD estimator\n(MDPDE), can be used as a practical robust alternative to the classical\nprocedures based on the ML. We rigorously develop several theoretical\nproperties of this estimator, and provide extensive simulations to substantiate\nthe theory developed."}, "http://arxiv.org/abs/2303.10018": {"title": "Efficient nonparametric estimation of Toeplitz covariance matrices", "link": "http://arxiv.org/abs/2303.10018", "description": "A new nonparametric estimator for Toeplitz covariance matrices is proposed.\nThis estimator is based on a data transformation that translates the problem of\nToeplitz covariance matrix estimation to the problem of mean estimation in an\napproximate Gaussian regression. The resulting Toeplitz covariance matrix\nestimator is positive definite by construction, fully data-driven and\ncomputationally very fast. Moreover, this estimator is shown to be minimax\noptimal under the spectral norm for a large class of Toeplitz matrices. These\nresults are readily extended to estimation of inverses of Toeplitz covariance\nmatrices. Also, an alternative version of the Whittle likelihood for the\nspectral density based on the Discrete Cosine Transform (DCT) is proposed. The\nmethod is implemented in the R package vstdct that accompanies the paper."}, "http://arxiv.org/abs/2308.11260": {"title": "A simplified spatial+ approach to mitigate spatial confounding in multivariate spatial areal models", "link": "http://arxiv.org/abs/2308.11260", "description": "Spatial areal models encounter the well-known and challenging problem of\nspatial confounding. This issue makes it arduous to distinguish between the\nimpacts of observed covariates and spatial random effects. Despite previous\nresearch and various proposed methods to tackle this problem, finding a\ndefinitive solution remains elusive. In this paper, we propose a simplified\nversion of the spatial+ approach that involves dividing the covariate into two\ncomponents. One component captures large-scale spatial dependence, while the\nother accounts for short-scale dependence. This approach eliminates the need to\nseparately fit spatial models for the covariates. We apply this method to\nanalyse two forms of crimes against women, namely rapes and dowry deaths, in\nUttar Pradesh, India, exploring their relationship with socio-demographic\ncovariates. To evaluate the performance of the new approach, we conduct\nextensive simulation studies under different spatial confounding scenarios. The\nresults demonstrate that the proposed method provides reliable estimates of\nfixed effects and posterior correlations between different responses."}, "http://arxiv.org/abs/2308.15596": {"title": "Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes", "link": "http://arxiv.org/abs/2308.15596", "description": "The assessment of regression models with discrete outcomes is challenging and\nhas many fundamental issues. With discrete outcomes, standard regression model\nassessment tools such as Pearson and deviance residuals do not follow the\nconventional reference distribution (normal) under the true model, calling into\nquestion the legitimacy of model assessment based on these tools. To fill this\ngap, we construct a new type of residuals for general discrete outcomes,\nincluding ordinal and count outcomes. The proposed residuals are based on two\nlayers of probability integral transformation. When at least one continuous\ncovariate is available, the proposed residuals closely follow a uniform\ndistribution (or a normal distribution after transformation) under the\ncorrectly specified model. One can construct visualizations such as QQ plots to\ncheck the overall fit of a model straightforwardly, and the shape of QQ plots\ncan further help identify possible causes of misspecification such as\noverdispersion. We provide theoretical justification for the proposed residuals\nby establishing their asymptotic properties. Moreover, in order to assess the\nmean structure and identify potential covariates, we develop an ordered curve\nas a supplementary tool, which is based on the comparison between the partial\nsum of outcomes and of fitted means. Through simulation, we demonstrate\nempirically that the proposed tools outperform commonly used residuals for\nvarious model assessment tasks. We also illustrate the workflow of model\nassessment using the proposed tools in data analysis."}, "http://arxiv.org/abs/2401.03072": {"title": "Optimal Nonparametric Inference on Network Effects with Dependent Edges", "link": "http://arxiv.org/abs/2401.03072", "description": "Testing network effects in weighted directed networks is a foundational\nproblem in econometrics, sociology, and psychology. Yet, the prevalent edge\ndependency poses a significant methodological challenge. Most existing methods\nare model-based and come with stringent assumptions, limiting their\napplicability. In response, we introduce a novel, fully nonparametric framework\nthat requires only minimal regularity assumptions. While inspired by recent\ndevelopments in $U$-statistic literature (arXiv:1712.00771, arXiv:2004.06615),\nour approach notably broadens their scopes. Specifically, we identified and\ncarefully addressed the challenge of indeterminate degeneracy in the test\nstatistics $-$ a problem that aforementioned tools do not handle. We\nestablished Berry-Esseen type bound for the accuracy of type-I error rate\ncontrol. Using original analysis, we also proved the minimax optimality of our\ntest's power. Simulations underscore the superiority of our method in\ncomputation speed, accuracy, and numerical robustness compared to competing\nmethods. We also applied our method to the U.S. faculty hiring network data and\ndiscovered intriguing findings."}, "http://arxiv.org/abs/2401.03081": {"title": "Shrinkage Estimation and Prediction for Joint Type-II Censored Data from Two Burr-XII Populations", "link": "http://arxiv.org/abs/2401.03081", "description": "The main objective of this paper is to apply linear and pretest shrinkage\nestimation techniques to estimating the parameters of two 2-parameter Burr-XII\ndistributions. Further more, predictions for future observations are made using\nboth classical and Bayesian methods within a joint type-II censoring scheme.\nThe efficiency of shrinkage estimates is compared to maximum likelihood and\nBayesian estimates obtained through the expectation-maximization algorithm and\nimportance sampling method, as developed by Akbari Bargoshadi et al. (2023) in\n\"Statistical inference under joint type-II censoring data from two Burr-XII\npopulations\" published in Communications in Statistics-Simulation and\nComputation\". For Bayesian estimations, both informative and non-informative\nprior distributions are considered. Additionally, various loss functions\nincluding squared error, linear-exponential, and generalized entropy are taken\ninto account. Approximate confidence, credible, and highest probability density\nintervals are calculated. To evaluate the performance of the estimation\nmethods, a Monte Carlo simulation study is conducted. Additionally, two real\ndatasets are utilized to illustrate the proposed methods."}, "http://arxiv.org/abs/2401.03084": {"title": "Finite sample performance of optimal treatment rule estimators with right-censored outcomes", "link": "http://arxiv.org/abs/2401.03084", "description": "Patient care may be improved by recommending treatments based on patient\ncharacteristics when there is treatment effect heterogeneity. Recently, there\nhas been a great deal of attention focused on the estimation of optimal\ntreatment rules that maximize expected outcomes. However, there has been\ncomparatively less attention given to settings where the outcome is\nright-censored, especially with regard to the practical use of estimators. In\nthis study, simulations were undertaken to assess the finite-sample performance\nof estimators for optimal treatment rules and estimators for the expected\noutcome under treatment rules. The simulations were motivated by the common\nsetting in biomedical and public health research where the data is\nobservational, survival times may be right-censored, and there is interest in\nestimating baseline treatment decisions to maximize survival probability. A\nvariety of outcome regression and direct search estimation methods were\ncompared for optimal treatment rule estimation across a range of simulation\nscenarios. Methods that flexibly model the outcome performed comparatively\nwell, including in settings where the treatment rule was non-linear. R code to\nreproduce this study's results are available on Github."}, "http://arxiv.org/abs/2401.03106": {"title": "Contrastive linear regression", "link": "http://arxiv.org/abs/2401.03106", "description": "Contrastive dimension reduction methods have been developed for case-control\nstudy data to identify variation that is enriched in the foreground (case) data\nX relative to the background (control) data Y. Here, we develop contrastive\nregression for the setting when there is a response variable r associated with\neach foreground observation. This situation occurs frequently when, for\nexample, the unaffected controls do not have a disease grade or intervention\ndosage but the affected cases have a disease grade or intervention dosage, as\nin autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our\ncontrastive regression model captures shared low-dimensional variation between\nthe predictors in the cases and control groups, and then explains the\ncase-specific response variables through the variance that remains in the\npredictors after shared variation is removed. We show that, in one\nsingle-nucleus RNA sequencing dataset on autism severity in postmortem brain\nsamples from donors with and without autism and in another single-cell RNA\nsequencing dataset on cellular differentiation in chronic rhinosinusitis with\nand without nasal polyps, our contrastive linear regression performs feature\nranking and identifies biologically-informative predictors associated with\nresponse that cannot be identified using other approaches"}, "http://arxiv.org/abs/2401.03123": {"title": "A least distance estimator for a multivariate regression model using deep neural networks", "link": "http://arxiv.org/abs/2401.03123", "description": "We propose a deep neural network (DNN) based least distance (LD) estimator\n(DNN-LD) for a multivariate regression problem, addressing the limitations of\nthe conventional methods. Due to the flexibility of a DNN structure, both\nlinear and nonlinear conditional mean functions can be easily modeled, and a\nmultivariate regression model can be realized by simply adding extra nodes at\nthe output layer. The proposed method is more efficient in capturing the\ndependency structure among responses than the least squares loss, and robust to\noutliers. In addition, we consider $L_1$-type penalization for variable\nselection, crucial in analyzing high-dimensional data. Namely, we propose what\nwe call (A)GDNN-LD estimator that enjoys variable selection and model\nestimation simultaneously, by applying the (adaptive) group Lasso penalty to\nweight parameters in the DNN structure. For the computation, we propose a\nquadratic smoothing approximation method to facilitate optimizing the\nnon-smooth objective function based on the least distance loss. The simulation\nstudies and a real data analysis demonstrate the promising performance of the\nproposed method."}, "http://arxiv.org/abs/2401.03206": {"title": "A Robbins--Monro Sequence That Can Exploit Prior Information For Faster Convergence", "link": "http://arxiv.org/abs/2401.03206", "description": "We propose a new method to improve the convergence speed of the Robbins-Monro\nalgorithm by introducing prior information about the target point into the\nRobbins-Monro iteration. We achieve the incorporation of prior information\nwithout the need of a -- potentially wrong -- regression model, which would\nalso entail additional constraints. We show that this prior-information\nRobbins-Monro sequence is convergent for a wide range of prior distributions,\neven wrong ones, such as Gaussian, weighted sum of Gaussians, e.g., in a kernel\ndensity estimate, as well as bounded arbitrary distribution functions greater\nthan zero. We furthermore analyse the sequence numerically to understand its\nperformance and the influence of parameters. The results demonstrate that the\nprior-information Robbins-Monro sequence converges faster than the standard\none, especially during the first steps, which are particularly important for\napplications where the number of function measurements is limited, and when the\nnoise of observing the underlying function is large. We finally propose a rule\nto select the parameters of the sequence."}, "http://arxiv.org/abs/2401.03268": {"title": "Adaptive Randomization Methods for Sequential Multiple Assignment Randomized Trials (SMARTs) via Thompson Sampling", "link": "http://arxiv.org/abs/2401.03268", "description": "Response-adaptive randomization (RAR) has been studied extensively in\nconventional, single-stage clinical trials, where it has been shown to yield\nethical and statistical benefits, especially in trials with many treatment\narms. However, RAR and its potential benefits are understudied in sequential\nmultiple assignment randomized trials (SMARTs), which are the gold-standard\ntrial design for evaluation of multi-stage treatment regimes. We propose a\nsuite of RAR algorithms for SMARTs based on Thompson Sampling (TS), a widely\nused RAR method in single-stage trials in which treatment randomization\nprobabilities are aligned with the estimated probability that the treatment is\noptimal. We focus on two common objectives in SMARTs: (i) comparison of the\nregimes embedded in the trial, and (ii) estimation of an optimal embedded\nregime. We develop valid post-study inferential procedures for treatment\nregimes under the proposed algorithms. This is nontrivial, as (even in\nsingle-stage settings) RAR can lead to nonnormal limiting distributions of\nestimators. Our algorithms are the first for RAR in multi-stage trials that\naccount for nonregularity in the estimand. Empirical studies based on\nreal-world SMARTs show that TS can improve in-trial subject outcomes without\nsacrificing efficiency for post-trial comparisons."}, "http://arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "http://arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with\npotential confounding by time trends. Traditional frequentist methods can fail\nto provide adequate coverage of the intervention's true effect using confidence\nintervals, whereas Bayesian approaches show potential for better coverage of\nintervention effects. However, Bayesian methods have seen limited development\nin SWCRTs. We propose two novel Bayesian hierarchical penalized spline models\nfor SWCRTs. The first model is for SWCRTs involving many clusters and time\nperiods, focusing on immediate intervention effects. To evaluate its efficacy,\nwe compared this model to traditional frequentist methods. We further developed\nthe model to estimate time-varying intervention effects. We conducted a\ncomparative analysis of this Bayesian spline model against an existing Bayesian\nmonotone effect curve model. The proposed models are applied in the Primary\nPalliative Care for Emergency Medicine stepped wedge trial to evaluate the\neffectiveness of primary palliative care intervention. Extensive simulations\nand a real-world application demonstrate the strengths of the proposed Bayesian\nmodels. The Bayesian immediate effect model consistently achieves near the\nfrequentist nominal coverage probability for true intervention effect,\nproviding more reliable interval estimations than traditional frequentist\nmodels, while maintaining high estimation accuracy. The proposed Bayesian\ntime-varying effect model exhibits advancements over the existing Bayesian\nmonotone effect curve model in terms of improved accuracy and reliability. To\nthe best of our knowledge, this is the first development of Bayesian\nhierarchical spline modeling for SWCRTs. The proposed models offer an accurate\nand robust analysis of intervention effects. Their application could lead to\neffective adjustments in intervention strategies."}, "http://arxiv.org/abs/2401.03326": {"title": "Optimal Adaptive SMART Designs with Binary Outcomes", "link": "http://arxiv.org/abs/2401.03326", "description": "In a sequential multiple-assignment randomized trial (SMART), a sequence of\ntreatments is given to a patient over multiple stages. In each stage,\nrandomization may be done to allocate patients to different treatment groups.\nEven though SMART designs are getting popular among clinical researchers, the\nmethodologies for adaptive randomization at different stages of a SMART are few\nand not sophisticated enough to handle the complexity of optimal allocation of\ntreatments at every stage of a trial. Lack of optimal allocation methodologies\ncan raise serious concerns about SMART designs from an ethical point of view.\nIn this work, we develop an optimal adaptive allocation procedure to minimize\nthe expected number of treatment failures for a SMART with a binary primary\noutcome. Issues related to optimal adaptive allocations are explored\ntheoretically with supporting simulations. The applicability of the proposed\nmethodology is demonstrated using a recently conducted SMART study named\nM-Bridge for developing universal and resource-efficient dynamic treatment\nregimes (DTRs) for incoming first-year college students as a bridge to\ndesirable treatments to address alcohol-related risks."}, "http://arxiv.org/abs/2401.03338": {"title": "Modelling pathwise uncertainty of Stochastic Differential Equations samplers via Probabilistic Numerics", "link": "http://arxiv.org/abs/2401.03338", "description": "Probabilistic ordinary differential equation (ODE) solvers have been\nintroduced over the past decade as uncertainty-aware numerical integrators.\nThey typically proceed by assuming a functional prior to the ODE solution,\nwhich is then queried on a grid to form a posterior distribution over the ODE\nsolution. As the queries span the integration interval, the approximate\nposterior solution then converges to the true deterministic one. Gaussian ODE\nfilters, in particular, have enjoyed a lot of attention due to their\ncomputational efficiency, the simplicity of their implementation, as well as\ntheir provable fast convergence rates. In this article, we extend the\nmethodology to stochastic differential equations (SDEs) and propose a\nprobabilistic simulator for SDEs. Our approach involves transforming the SDE\ninto a sequence of random ODEs using piecewise differentiable approximations of\nthe Brownian motion. We then apply probabilistic ODE solvers to the individual\nODEs, resulting in a pathwise probabilistic solution to the SDE\\@. We establish\nworst-case strong $1.5$ local and $1.0$ global convergence orders for a\nspecific instance of our method. We further show how we can marginalise the\nBrownian approximations, by incorporating its coefficients as part of the prior\nODE model, allowing for computing exact transition densities under our model.\nFinally, we numerically validate the theoretical findings, showcasing\nreasonable weak convergence properties in the marginalised version."}, "http://arxiv.org/abs/2401.03554": {"title": "False Discovery Rate and Localizing Power", "link": "http://arxiv.org/abs/2401.03554", "description": "False discovery rate (FDR) is commonly used for correction for multiple\ntesting in neuroimaging studies. However, when using two-tailed tests, making\ndirectional inferences about the results can lead to vastly inflated error\nrate, even approaching 100\\% in some cases. This happens because FDR only\nprovides weak control over the error rate, meaning that the proportion of error\nis guaranteed only globally over all tests, not within subsets, such as among\nthose in only one or another direction. Here we consider and evaluate different\nstrategies for FDR control with two-tailed tests, using both synthetic and real\nimaging data. Approaches that separate the tests by direction of the hypothesis\ntest, or by the direction of the resulting test statistic, more properly\ncontrol the directional error rate and preserve FDR benefits, albeit with a\ndoubled risk of errors under complete absence of signal. Strategies that\ncombine tests in both directions, or that use simple two-tailed p-values, can\nlead to invalid directional conclusions, even if these tests remain globally\nvalid. To enable valid thresholding for directional inference, we suggest that\nimaging software should allow the possibility that the user sets asymmetrical\nthresholds for the two sides of the statistical map. While FDR continues to be\na valid, powerful procedure for multiple testing correction, care is needed\nwhen making directional inferences for two-tailed tests, or more broadly, when\nmaking any localized inference."}, "http://arxiv.org/abs/2401.03633": {"title": "A Spatial-statistical model to analyse historical rutting data", "link": "http://arxiv.org/abs/2401.03633", "description": "Pavement rutting poses a significant challenge in flexible pavements,\nnecessitating costly asphalt resurfacing. To address this issue\ncomprehensively, we propose an advanced Bayesian hierarchical framework of\nlatent Gaussian models with spatial components. Our model provides a thorough\ndiagnostic analysis, pinpointing areas exhibiting unexpectedly high rutting\nrates. Incorporating spatial and random components, and important explanatory\nvariables like annual average daily traffic (traffic intensity), pavement type,\nrut depth and lane width, our proposed models account for and estimate the\ninfluence of these variables on rutting. This approach not only quantifies\nuncertainties and discerns locations at the highest risk of requiring\nmaintenance, but also uncover spatial dependencies in rutting\n(millimetre/year). We apply our models to a data set spanning eleven years\n(2010-2020). Our findings emphasize the systematic unexpected spatial rutting\neffect, where more of the rutting variability is explained by including spatial\ncomponents. Pavement type, in conjunction with traffic intensity, is also found\nto be the primary driver of rutting. Furthermore, the spatial dependencies\nuncovered reveal road sections experiencing more than 1 millimeter of rutting\nbeyond annual expectations. This leads to a halving of the expected pavement\nlifespan in these areas. Our study offers valuable insights, presenting maps\nindicating expected rutting, and identifying locations with accelerated rutting\nrates, resulting in a reduction in pavement life expectancy of at least 10\nyears."}, "http://arxiv.org/abs/2401.03649": {"title": "Bayes Factor of Zero Inflated Models under Jeffereys Prior", "link": "http://arxiv.org/abs/2401.03649", "description": "Microbiome omics data including 16S rRNA reveal intriguing dynamic\nassociations between the human microbiome and various disease states. Drastic\nchanges in microbiota can be associated with factors like diet, hormonal\ncycles, diseases, and medical interventions. Along with the identification of\nspecific bacteria taxa associated with diseases, recent advancements give\nevidence that metabolism, genetics, and environmental factors can model these\nmicrobial effects. However, the current analytic methods for integrating\nmicrobiome data are fully developed to address the main challenges of\nlongitudinal metagenomics data, such as high-dimensionality, intra-sample\ndependence, and zero-inflation of observed counts. Hence, we propose the Bayes\nfactor approach for model selection based on negative binomial, Poisson,\nzero-inflated negative binomial, and zero-inflated Poisson models with\nnon-informative Jeffreys prior. We find that both in simulation studies and\nreal data analysis, our Bayes factor remarkably outperform traditional Akaike\ninformation criterion and Vuong's test. A new R package BFZINBZIP has been\nintroduced to do simulation study and real data analysis to facilitate Bayesian\nmodel selection based on the Bayes factor."}, "http://arxiv.org/abs/2401.03693": {"title": "TAD-SIE: Sample Size Estimation for Clinical Randomized Controlled Trials using a Trend-Adaptive Design with a Synthetic-Intervention-Based Estimator", "link": "http://arxiv.org/abs/2401.03693", "description": "Phase-3 clinical trials provide the highest level of evidence on drug safety\nand effectiveness needed for market approval by implementing large randomized\ncontrolled trials (RCTs). However, 30-40% of these trials fail mainly because\nsuch studies have inadequate sample sizes, stemming from the inability to\nobtain accurate initial estimates of average treatment effect parameters. To\nremove this obstacle from the drug development cycle, we present a new\nalgorithm called Trend-Adaptive Design with a Synthetic-Intervention-Based\nEstimator (TAD-SIE) that appropriately powers a parallel-group trial, a\nstandard RCT design, by leveraging a state-of-the-art hypothesis testing\nstrategy and a novel trend-adaptive design (TAD). Specifically, TAD-SIE uses\nSECRETS (Subject-Efficient Clinical Randomized Controlled Trials using\nSynthetic Intervention) for hypothesis testing, which simulates a cross-over\ntrial in order to boost power; doing so, makes it easier for a trial to reach\ntarget power within trial constraints (e.g., sample size limits). To estimate\nsample sizes, TAD-SIE implements a new TAD tailored to SECRETS given that\nSECRETS violates assumptions under standard TADs. In addition, our TAD\novercomes the ineffectiveness of standard TADs by allowing sample sizes to be\nincreased across iterations without any condition while controlling\nsignificance level with futility stopping. On a real-world Phase-3 clinical RCT\n(i.e., a two-arm parallel-group superiority trial with an equal number of\nsubjects per arm), TAD-SIE reaches typical target operating points of 80% or\n90% power and 5% significance level in contrast to baseline algorithms that\nonly get at best 59% power and 4% significance level."}, "http://arxiv.org/abs/2401.03820": {"title": "Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices", "link": "http://arxiv.org/abs/2401.03820", "description": "Estimating a covariance matrix and its associated principal components is a\nfundamental problem in contemporary statistics. While optimal estimation\nprocedures have been developed with well-understood properties, the increasing\ndemand for privacy preservation introduces new complexities to this classical\nproblem. In this paper, we study optimal differentially private Principal\nComponent Analysis (PCA) and covariance estimation within the spiked covariance\nmodel.\n\nWe precisely characterize the sensitivity of eigenvalues and eigenvectors\nunder this model and establish the minimax rates of convergence for estimating\nboth the principal components and covariance matrix. These rates hold up to\nlogarithmic factors and encompass general Schatten norms, including spectral\nnorm, Frobenius norm, and nuclear norm as special cases.\n\nWe introduce computationally efficient differentially private estimators and\nprove their minimax optimality, up to logarithmic factors. Additionally,\nmatching minimax lower bounds are established. Notably, in comparison with\nexisting literature, our results accommodate a diverging rank, necessitate no\neigengap condition between distinct principal components, and remain valid even\nif the sample size is much smaller than the dimension."}, "http://arxiv.org/abs/2401.03834": {"title": "Simultaneous false discovery bounds for invariant causal prediction", "link": "http://arxiv.org/abs/2401.03834", "description": "Invariant causal prediction (ICP, Peters et al. (2016)) provides a novel way\nto identify causal predictors of a response by utilizing heterogeneous data\nfrom different environments. One advantage of ICP is that it guarantees to make\nno false causal discoveries with high probability. Such a guarantee, however,\ncan be too conservative in some applications, resulting in few or no\ndiscoveries. To address this, we propose simultaneous false discovery bounds\nfor ICP, which provides users with extra flexibility in exploring causal\npredictors and can extract more informative results. These additional\ninferences come for free, in the sense that they do not require additional\nassumptions, and the same information obtained by the original ICP is retained.\nWe demonstrate the practical usage of our method through simulations and a real\ndataset."}, "http://arxiv.org/abs/2401.03881": {"title": "Density regression via Dirichlet process mixtures of normal structured additive regression models", "link": "http://arxiv.org/abs/2401.03881", "description": "Within Bayesian nonparametrics, dependent Dirichlet process mixture models\nprovide a highly flexible approach for conducting inference about the\nconditional density function. However, several formulations of this class make\neither rather restrictive modelling assumptions or involve intricate algorithms\nfor posterior inference, thus preventing their widespread use. In response to\nthese challenges, we present a flexible, versatile, and computationally\ntractable model for density regression based on a single-weights dependent\nDirichlet process mixture of normal distributions model for univariate\ncontinuous responses. We assume an additive structure for the mean of each\nmixture component and incorporate the effects of continuous covariates through\nsmooth nonlinear functions. The key components of our modelling approach are\npenalised B-splines and their bivariate tensor product extension. Our proposed\nmethod also seamlessly accommodates parametric effects of categorical\ncovariates, linear effects of continuous covariates, interactions between\ncategorical and/or continuous covariates, varying coefficient terms, and random\neffects, which is why we refer our model as a Dirichlet process mixture of\nnormal structured additive regression models. A noteworthy feature of our\nmethod is its efficiency in posterior simulation through Gibbs sampling, as\nclosed-form full conditional distributions for all model parameters are\navailable. Results from a simulation study demonstrate that our approach\nsuccessfully recovers true conditional densities and other regression\nfunctionals in various challenging scenarios. Applications to a toxicology,\ndisease diagnosis, and agricultural study are provided and further underpin the\nbroad applicability of our modelling framework. An R package, \\texttt{DDPstar},\nimplementing the proposed method is publicly available at\n\\url{https://bitbucket.org/mxrodriguez/ddpstar}."}, "http://arxiv.org/abs/2401.03891": {"title": "Radius selection using kernel density estimation for the computation of nonlinear measures", "link": "http://arxiv.org/abs/2401.03891", "description": "When nonlinear measures are estimated from sampled temporal signals with\nfinite-length, a radius parameter must be carefully selected to avoid a poor\nestimation. These measures are generally derived from the correlation integral\nwhich quantifies the probability of finding neighbors, i.e. pair of points\nspaced by less than the radius parameter. While each nonlinear measure comes\nwith several specific empirical rules to select a radius value, we provide a\nsystematic selection method. We show that the optimal radius for nonlinear\nmeasures can be approximated by the optimal bandwidth of a Kernel Density\nEstimator (KDE) related to the correlation sum. The KDE framework provides\nnon-parametric tools to approximate a density function from finite samples\n(e.g. histograms) and optimal methods to select a smoothing parameter, the\nbandwidth (e.g. bin width in histograms). We use results from KDE to derive a\nclosed-form expression for the optimal radius. The latter is used to compute\nthe correlation dimension and to construct recurrence plots yielding an\nestimate of Kolmogorov-Sinai entropy. We assess our method through numerical\nexperiments on signals generated by nonlinear systems and experimental\nelectroencephalographic time series."}, "http://arxiv.org/abs/2401.04036": {"title": "A regularized MANOVA test for semicontinuous high-dimensional data", "link": "http://arxiv.org/abs/2401.04036", "description": "We propose a MANOVA test for semicontinuous data that is applicable also when\nthe dimensionality exceeds the sample size. The test statistic is obtained as a\nlikelihood ratio, where numerator and denominator are computed at the maxima of\npenalized likelihood functions under each hypothesis. Closed form solutions for\nthe regularized estimators allow us to avoid computational overheads. We derive\nthe null distribution using a permutation scheme. The power and level of the\nresulting test are evaluated in a simulation study. We illustrate the new\nmethodology with two original data analyses, one regarding microRNA expression\nin human blastocyst cultures, and another regarding alien plant species\ninvasion in the island of Socotra (Yemen)."}, "http://arxiv.org/abs/2401.04086": {"title": "A Priori Determination of the Pretest Probability", "link": "http://arxiv.org/abs/2401.04086", "description": "In this manuscript, we present various proposed methods estimate the\nprevalence of disease, a critical prerequisite for the adequate interpretation\nof screening tests. To address the limitations of these approaches, which\nrevolve primarily around their a posteriori nature, we introduce a novel method\nto estimate the pretest probability of disease, a priori, utilizing the Logit\nfunction from the logistic regression model. This approach is a modification of\nMcGee's heuristic, originally designed for estimating the posttest probability\nof disease. In a patient presenting with $n_\\theta$ signs or symptoms, the\nminimal bound of the pretest probability, $\\phi$, can be approximated by:\n\n$\\phi \\approx\n\\frac{1}{5}{ln\\left[\\displaystyle\\prod_{\\theta=1}^{i}\\kappa_\\theta\\right]}$\n\nwhere $ln$ is the natural logarithm, and $\\kappa_\\theta$ is the likelihood\nratio associated with the sign or symptom in question."}, "http://arxiv.org/abs/2107.00118": {"title": "Do we need to estimate the variance in robust mean estimation?", "link": "http://arxiv.org/abs/2107.00118", "description": "In this paper, we propose self-tuned robust estimators for estimating the\nmean of heavy-tailed distributions, which refer to distributions with only\nfinite variances. Our approach introduces a new loss function that considers\nboth the mean parameter and a robustification parameter. By jointly optimizing\nthe empirical loss function with respect to both parameters, the\nrobustification parameter estimator can automatically adapt to the unknown data\nvariance, and thus the self-tuned mean estimator can achieve optimal\nfinite-sample performance. Our method outperforms previous approaches in terms\nof both computational and asymptotic efficiency. Specifically, it does not\nrequire cross-validation or Lepski's method to tune the robustification\nparameter, and the variance of our estimator achieves the Cram\\'er-Rao lower\nbound. Project source code is available at\n\\url{https://github.com/statsle/automean}."}, "http://arxiv.org/abs/2201.06606": {"title": "Drift vs Shift: Decoupling Trends and Changepoint Analysis", "link": "http://arxiv.org/abs/2201.06606", "description": "We introduce a new approach for decoupling trends (drift) and changepoints\n(shifts) in time series. Our locally adaptive model-based approach for robustly\ndecoupling combines Bayesian trend filtering and machine learning based\nregularization. An over-parameterized Bayesian dynamic linear model (DLM) is\nfirst applied to characterize drift. Then a weighted penalized likelihood\nestimator is paired with the estimated DLM posterior distribution to identify\nshifts. We show how Bayesian DLMs specified with so-called shrinkage priors can\nprovide smooth estimates of underlying trends in the presence of complex noise\ncomponents. However, their inability to shrink exactly to zero inhibits direct\nchangepoint detection. In contrast, penalized likelihood methods are highly\neffective in locating changepoints. However, they require data with simple\npatterns in both signal and noise. The proposed decoupling approach combines\nthe strengths of both, i.e. the flexibility of Bayesian DLMs with the hard\nthresholding property of penalized likelihood estimators, to provide\nchangepoint analysis in complex, modern settings. The proposed framework is\noutlier robust and can identify a variety of changes, including in mean and\nslope. It is also easily extended for analysis of parameter shifts in\ntime-varying parameter models like dynamic regressions. We illustrate the\nflexibility and contrast the performance and robustness of our approach with\nseveral alternative methods across a wide range of simulations and application\nexamples."}, "http://arxiv.org/abs/2303.00471": {"title": "E-values for k-Sample Tests With Exponential Families", "link": "http://arxiv.org/abs/2303.00471", "description": "We develop and compare e-variables for testing whether $k$ samples of data\nare drawn from the same distribution, the alternative being that they come from\ndifferent elements of an exponential family. We consider the GRO (growth-rate\noptimal) e-variables for (1) a `small' null inside the same exponential family,\nand (2) a `large' nonparametric null, as well as (3) an e-variable arrived at\nby conditioning on the sum of the sufficient statistics. (2) and (3) are\nefficiently computable, and extend ideas from Turner et al. [2021] and Wald\n[1947] respectively from Bernoulli to general exponential families. We provide\ntheoretical and simulation-based comparisons of these e-variables in terms of\ntheir logarithmic growth rate, and find that for small effects all four\ne-variables behave surprisingly similarly; for the Gaussian location and\nPoisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2)\ncoincide; but in general, whether (2) or (3) grows faster under the alternative\nis family-dependent. We furthermore discuss algorithms for numerically\napproximating (1)."}, "http://arxiv.org/abs/2303.13850": {"title": "Towards Learning and Explaining Indirect Causal Effects in Neural Networks", "link": "http://arxiv.org/abs/2303.13850", "description": "Recently, there has been a growing interest in learning and explaining causal\neffects within Neural Network (NN) models. By virtue of NN architectures,\nprevious approaches consider only direct and total causal effects assuming\nindependence among input variables. We view an NN as a structural causal model\n(SCM) and extend our focus to include indirect causal effects by introducing\nfeedforward connections among input neurons. We propose an ante-hoc method that\ncaptures and maintains direct, indirect, and total causal effects during NN\nmodel training. We also propose an algorithm for quantifying learned causal\neffects in an NN model and efficient approximation strategies for quantifying\ncausal effects in high-dimensional data. Extensive experiments conducted on\nsynthetic and real-world datasets demonstrate that the causal effects learned\nby our ante-hoc method better approximate the ground truth effects compared to\nexisting methods."}, "http://arxiv.org/abs/2304.08242": {"title": "The Deep Latent Position Topic Model for Clustering and Representation of Networks with Textual Edges", "link": "http://arxiv.org/abs/2304.08242", "description": "Numerical interactions leading to users sharing textual content published by\nothers are naturally represented by a network where the individuals are\nassociated with the nodes and the exchanged texts with the edges. To understand\nthose heterogeneous and complex data structures, clustering nodes into\nhomogeneous groups as well as rendering a comprehensible visualisation of the\ndata is mandatory. To address both issues, we introduce Deep-LPTM, a\nmodel-based clustering strategy relying on a variational graph auto-encoder\napproach as well as a probabilistic model to characterise the topics of\ndiscussion. Deep-LPTM allows to build a joint representation of the nodes and\nof the edges in two embeddings spaces. The parameters are inferred using a\nvariational inference algorithm. We also introduce IC2L, a model selection\ncriterion specifically designed to choose models with relevant clustering and\nvisualisation properties. An extensive benchmark study on synthetic data is\nprovided. In particular, we find that Deep-LPTM better recovers the partitions\nof the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails\nof the Enron company are analysed and visualisations of the results are\npresented, with meaningful highlights of the graph structure."}, "http://arxiv.org/abs/2305.05126": {"title": "Comparing Foundation Models using Data Kernels", "link": "http://arxiv.org/abs/2305.05126", "description": "Recent advances in self-supervised learning and neural network scaling have\nenabled the creation of large models, known as foundation models, which can be\neasily adapted to a wide range of downstream tasks. The current paradigm for\ncomparing foundation models involves evaluating them with aggregate metrics on\nvarious benchmark datasets. This method of model comparison is heavily\ndependent on the chosen evaluation metric, which makes it unsuitable for\nsituations where the ideal metric is either not obvious or unavailable. In this\nwork, we present a methodology for directly comparing the embedding space\ngeometry of foundation models, which facilitates model comparison without the\nneed for an explicit evaluation metric. Our methodology is grounded in random\ngraph theory and enables valid hypothesis testing of embedding similarity on a\nper-datum basis. Further, we demonstrate how our methodology can be extended to\nfacilitate population level model comparison. In particular, we show how our\nframework can induce a manifold of models equipped with a distance function\nthat correlates strongly with several downstream metrics. We remark on the\nutility of this population level model comparison as a first step towards a\ntaxonomic science of foundation models."}, "http://arxiv.org/abs/2305.18044": {"title": "Bayesian estimation of clustered dependence structures in functional neuroconnectivity", "link": "http://arxiv.org/abs/2305.18044", "description": "Motivated by the need to model the dependence between regions of interest in\nfunctional neuroconnectivity for efficient inference, we propose a new\nsampling-based Bayesian clustering approach for covariance structures of\nhigh-dimensional Gaussian outcomes. The key technique is based on a Dirichlet\nprocess that clusters covariance sub-matrices into independent groups of\noutcomes, thereby naturally inducing sparsity in the whole brain connectivity\nmatrix. A new split-merge algorithm is employed to achieve convergence of the\nMarkov chain that is shown empirically to recover both uniform and Dirichlet\npartitions with high accuracy. We investigate the empirical performance of the\nproposed method through extensive simulations. Finally, the proposed approach\nis used to group regions of interest into functionally independent groups in\nthe Autism Brain Imaging Data Exchange participants with autism spectrum\ndisorder and and co-occurring attention-deficit/hyperactivity disorder."}, "http://arxiv.org/abs/2306.12528": {"title": "Structured Learning in Time-dependent Cox Models", "link": "http://arxiv.org/abs/2306.12528", "description": "Cox models with time-dependent coefficients and covariates are widely used in\nsurvival analysis. In high-dimensional settings, sparse regularization\ntechniques are employed for variable selection, but existing methods for\ntime-dependent Cox models lack flexibility in enforcing specific sparsity\npatterns (i.e., covariate structures). We propose a flexible framework for\nvariable selection in time-dependent Cox models, accommodating complex\nselection rules. Our method can adapt to arbitrary grouping structures,\nincluding interaction selection, temporal, spatial, tree, and directed acyclic\ngraph structures. It achieves accurate estimation with low false alarm rates.\nWe develop the sox package, implementing a network flow algorithm for\nefficiently solving models with complex covariate structures. sox offers a\nuser-friendly interface for specifying grouping structures and delivers fast\ncomputation. Through examples, including a case study on identifying predictors\nof time to all-cause death in atrial fibrillation patients, we demonstrate the\npractical application of our method with specific selection rules."}, "http://arxiv.org/abs/2401.04156": {"title": "LASPATED: a Library for the Analysis of SPAtio-TEmporal Discrete data", "link": "http://arxiv.org/abs/2401.04156", "description": "We describe methods, tools, and a software library called LASPATED, available\non GitHub (at https://github.com/vguigues/) to fit models using spatio-temporal\ndata and space-time discretization. A video tutorial for this library is\navailable on YouTube. We consider two types of methods to estimate a\nnon-homogeneous Poisson process in space and time. The methods approximate the\narrival intensity function of the Poisson process by discretizing space and\ntime, and estimating arrival intensity as a function of subregion and time\ninterval. With such methods, it is typical that the dimension of the estimator\nis large relative to the amount of data, and therefore the performance of the\nestimator can be improved by using additional data. The first method uses\nadditional data to add a regularization term to the likelihood function for\ncalibrating the intensity of the Poisson process. The second method uses\nadditional data to estimate arrival intensity as a function of covariates. We\ndescribe a Python package to perform various types of space and time\ndiscretization. We also describe two packages for the calibration of the\nmodels, one in Matlab and one in C++. We demonstrate the advantages of our\nmethods compared to basic maximum likelihood estimation with simulated and real\ndata. The experiments with real data calibrate models of the arrival process of\nemergencies to be handled by the Rio de Janeiro emergency medical service."}, "http://arxiv.org/abs/2401.04190": {"title": "Is it possible to know cosmological fine-tuning?", "link": "http://arxiv.org/abs/2401.04190", "description": "Fine-tuning studies whether some physical parameters, or relevant ratios\nbetween them, are located within so-called life-permitting intervals of small\nprobability outside of which carbon-based life would not be possible. Recent\ndevelopments have found estimates of these probabilities that circumvent\nprevious concerns of measurability and selection bias. However, the question\nremains if fine-tuning can indeed be known. Using a mathematization of the\nepistemological concepts of learning and knowledge acquisition, we argue that\nmost examples that have been touted as fine-tuned cannot be formally assessed\nas such. Nevertheless, fine-tuning can be known when the physical parameter is\nseen as a random variable and it is supported in the nonnegative real line,\nprovided the size of the life-permitting interval is small in relation to the\nobserved value of the parameter."}, "http://arxiv.org/abs/2401.04240": {"title": "A New Cure Rate Model with Discrete and Multiple Exposures", "link": "http://arxiv.org/abs/2401.04240", "description": "Cure rate models are mostly used to study data arising from cancer clinical\ntrials. Its use in the context of infectious diseases has not been explored\nwell. In 2008, Tournoud and Ecochard first proposed a mechanistic formulation\nof cure rate model in the context of infectious diseases with multiple\nexposures to infection. However, they assumed a simple Poisson distribution to\ncapture the unobserved pathogens at each exposure time. In this paper, we\npropose a new cure rate model to study infectious diseases with discrete\nmultiple exposures to infection. Our formulation captures both over-dispersion\nand under-dispersion with respect to the count on pathogens at each time of\nexposure. We also propose a new estimation method based on the expectation\nmaximization algorithm to calculate the maximum likelihood estimates of the\nmodel parameters. We carry out a detailed Monte Carlo simulation study to\ndemonstrate the performance of the proposed model and estimation algorithm. The\nflexibility of our proposed model also allows us to carry out a model\ndiscrimination. For this purpose, we use both likelihood ratio test and\ninformation-based criteria. Finally, we illustrate our proposed model using a\nrecently collected data on COVID-19."}, "http://arxiv.org/abs/2401.04263": {"title": "Two-Step Targeted Minimum-Loss Based Estimation for Non-Negative Two-Part Outcomes", "link": "http://arxiv.org/abs/2401.04263", "description": "Non-negative two-part outcomes are defined as outcomes with a density\nfunction that have a zero point mass but are otherwise positive. Examples, such\nas healthcare expenditure and hospital length of stay, are common in healthcare\nutilization research. Despite the practical relevance of non-negative two-part\noutcomes, very few methods exist to leverage knowledge of their semicontinuity\nto achieve improved performance in estimating causal effects. In this paper, we\ndevelop a nonparametric two-step targeted minimum-loss based estimator (denoted\nas hTMLE) for non-negative two-part outcomes. We present methods for a general\nclass of interventions referred to as modified treatment policies, which can\naccommodate continuous, categorical, and binary exposures. The two-step TMLE\nuses a targeted estimate of the intensity component of the outcome to produce a\ntargeted estimate of the binary component of the outcome that may improve\nfinite sample efficiency. We demonstrate the efficiency gains achieved by the\ntwo-step TMLE with simulated examples and then apply it to a cohort of Medicaid\nbeneficiaries to estimate the effect of chronic pain and physical disability on\ndays' supply of opioids."}, "http://arxiv.org/abs/2401.04297": {"title": "Staged trees for discrete longitudinal data", "link": "http://arxiv.org/abs/2401.04297", "description": "In this paper we investigate the use of staged tree models for discrete\nlongitudinal data. Staged trees are a type of probabilistic graphical model for\nfinite sample space processes. They are a natural fit for longitudinal data\nbecause a temporal ordering is often implicitly assumed and standard methods\ncan be used for model selection and probability estimation. However, model\nselection methods perform poorly when the sample size is small relative to the\nsize of the graph and model interpretation is tricky with larger graphs. This\nis exacerbated by longitudinal data which is characterised by repeated\nobservations. To address these issues we propose two approaches: the\nlongitudinal staged tree with Markov assumptions which makes some initial\nconditional independence assumptions represented by a directed acyclic graph\nand marginal longitudinal staged trees which model certain margins of the data."}, "http://arxiv.org/abs/2401.04352": {"title": "Non-Deterministic Extension of Plasma Wind Tunnel Data Calibrated Model Predictions to Flight Conditions", "link": "http://arxiv.org/abs/2401.04352", "description": "This work proposes a novel approach for non-deterministic extension of\nexperimental data that considers structural model inadequacy for conditions\nother than the calibration scenario while simultaneously resolving any\nsignificant prior-data discrepancy with information extracted from flight\nmeasurements. This functionality is achieved through methodical utilization of\nmodel error emulators and Bayesian model averaging studies with available\nresponse data. The outlined approach does not require prior flight data\navailability and introduces straightforward mechanisms for their assimilation\nin future predictions. Application of the methodology is demonstrated herein by\nextending material performance data captured at the HyMETS facility to the MSL\nscenario, where the described process yields results that exhibit significantly\nimproved capacity for predictive uncertainty quantification studies. This work\nalso investigates limitations associated with straightforward uncertainty\npropagation procedures onto calibrated model predictions for the flight\nscenario and manages computational requirements with sensitivity analysis and\nsurrogate modeling techniques."}, "http://arxiv.org/abs/2401.04450": {"title": "Recanting twins: addressing intermediate confounding in mediation analysis", "link": "http://arxiv.org/abs/2401.04450", "description": "The presence of intermediate confounders, also called recanting witnesses, is\na fundamental challenge to the investigation of causal mechanisms in mediation\nanalysis, preventing the identification of natural path-specific effects.\nProposed alternative parameters (such as randomizational interventional\neffects) are problematic because they can be non-null even when there is no\nmediation for any individual in the population; i.e., they are not an average\nof underlying individual-level mechanisms. In this paper we develop a novel\nmethod for mediation analysis in settings with intermediate confounding, with\nguarantees that the causal parameters are summaries of the individual-level\nmechanisms of interest. The method is based on recently proposed ideas that\nview causality as the transfer of information, and thus replace recanting\nwitnesses by draws from their conditional distribution, what we call \"recanting\ntwins\". We show that, in the absence of intermediate confounding, recanting\ntwin effects recover natural path-specific effects. We present the assumptions\nrequired for identification of recanting twins effects under a standard\nstructural causal model, as well as the assumptions under which the recanting\ntwin identification formulas can be interpreted in the context of the recently\nproposed separable effects models. To estimate recanting-twin effects, we\ndevelop efficient semi-parametric estimators that allow the use of data driven\nmethods in the estimation of the nuisance parameters. We present numerical\nstudies of the methods using synthetic data, as well as an application to\nevaluate the role of new-onset anxiety and depressive disorder in explaining\nthe relationship between gabapentin/pregabalin prescription and incident opioid\nuse disorder among Medicaid beneficiaries with chronic pain."}, "http://arxiv.org/abs/2401.04490": {"title": "Testing similarity of parametric competing risks models for identifying potentially similar pathways in healthcare", "link": "http://arxiv.org/abs/2401.04490", "description": "The identification of similar patient pathways is a crucial task in\nhealthcare analytics. A flexible tool to address this issue are parametric\ncompeting risks models, where transition intensities may be specified by a\nvariety of parametric distributions, thus in particular being possibly\ntime-dependent. We assess the similarity between two such models by examining\nthe transitions between different health states. This research introduces a\nmethod to measure the maximum differences in transition intensities over time,\nleading to the development of a test procedure for assessing similarity. We\npropose a parametric bootstrap approach for this purpose and provide a proof to\nconfirm the validity of this procedure. The performance of our proposed method\nis evaluated through a simulation study, considering a range of sample sizes,\ndiffering amounts of censoring, and various thresholds for similarity. Finally,\nwe demonstrate the practical application of our approach with a case study from\nurological clinical routine practice, which inspired this research."}, "http://arxiv.org/abs/2401.04498": {"title": "Efficient designs for multivariate crossover trials", "link": "http://arxiv.org/abs/2401.04498", "description": "This article aims to study efficient/trace optimal designs for crossover\ntrials, with multiple response recorded from each subject in each time period.\nA multivariate fixed effect model is proposed with direct and carryover effects\ncorresponding to the multiple responses and the error dispersion matrix\nallowing for correlations to exist between and within responses. Two\ncorrelation structures, namely the proportional and the generalized markov\ncovariances are studied. The corresponding information matrices for direct\neffects under the two covariances are used to determine efficient designs.\nEfficiency of orthogonal array designs of Type $I$ and strength $2$ is\ninvestigated for the two covariance forms. To motivate the multivariate\ncrossover designs, a gene expression data in a $3 \\times 3$ framework is\nutilized."}, "http://arxiv.org/abs/2401.04603": {"title": "Skewed Pivot-Blend Modeling with Applications to Semicontinuous Outcomes", "link": "http://arxiv.org/abs/2401.04603", "description": "Skewness is a common occurrence in statistical applications. In recent years,\nvarious distribution families have been proposed to model skewed data by\nintroducing unequal scales based on the median or mode. However, we argue that\nthe point at which unbalanced scales occur may be at any quantile and cannot be\nreparametrized as an ordinary shift parameter in the presence of skewness. In\nthis paper, we introduce a novel skewed pivot-blend technique to create a\nskewed density family based on any continuous density, even those that are\nasymmetric and nonunimodal. Our framework enables the simultaneous estimation\nof scales, the pivotal point, and other location parameters, along with various\nextensions. We also introduce a skewed two-part model tailored for\nsemicontinuous outcomes, which identifies relevant variables across the entire\npopulation and mitigates the additional skewness induced by commonly used\ntransformations. Our theoretical analysis reveals the influence of skewness\nwithout assuming asymptotic conditions. Experiments on synthetic and real-life\ndata demonstrate the excellent performance of the proposed method."}, "http://arxiv.org/abs/2401.04686": {"title": "Weighted likelihood methods for robust fitting of wrapped models for $p$-torus data", "link": "http://arxiv.org/abs/2401.04686", "description": "We consider robust estimation of wrapped models to multivariate circular data\nthat are points on the surface of a $p$-torus based on the weighted likelihood\nmethodology.Robust model fitting is achieved by a set of weighted likelihood\nestimating equations, based on the computation of data dependent weights aimed\nto down-weight anomalous values, such as unexpected directions that do not\nshare the main pattern of the bulk of the data. Weighted likelihood estimating\nequations with weights evaluated on the torus orobtained after unwrapping the\ndata onto the Euclidean space are proposed and compared. Asymptotic properties\nand robustness features of the estimators under study have been studied,\nwhereas their finite sample behavior has been investigated by Monte Carlo\nnumerical experiment and real data examples."}, "http://arxiv.org/abs/2401.04689": {"title": "Efficient estimation for ergodic diffusion processes sampled at high frequency", "link": "http://arxiv.org/abs/2401.04689", "description": "A general theory of efficient estimation for ergodic diffusion processes\nsampled at high frequency with an infinite time horizon is presented. High\nfrequency sampling is common in many applications, with finance as a prominent\nexample. The theory is formulated in term of approximate martingale estimating\nfunctions and covers a large class of estimators including most of the\npreviously proposed estimators for diffusion processes. Easily checked\nconditions ensuring that an estimating function is an approximate martingale\nare derived, and general conditions ensuring consistency and asymptotic\nnormality of estimators are given. Most importantly, simple conditions are\ngiven that ensure rate optimality and efficiency. Rate optimal estimators of\nparameters in the diffusion coefficient converge faster than estimators of\ndrift coefficient parameters because they take advantage of the information in\nthe quadratic variation. The conditions facilitate the choice among the\nmultitude of estimators that have been proposed for diffusion models. Optimal\nmartingale estimating functions in the sense of Godambe and Heyde and their\nhigh frequency approximations are, under weak conditions, shown to satisfy the\nconditions for rate optimality and efficiency. This provides a natural feasible\nmethod of constructing explicit rate optimal and efficient estimating functions\nby solving a linear equation."}, "http://arxiv.org/abs/2401.04693": {"title": "Co-Clustering Multi-View Data Using the Latent Block Model", "link": "http://arxiv.org/abs/2401.04693", "description": "The Latent Block Model (LBM) is a prominent model-based co-clustering method,\nreturning parametric representations of each block cluster and allowing the use\nof well-grounded model selection methods. The LBM, while adapted in literature\nto handle different feature types, cannot be applied to datasets consisting of\nmultiple disjoint sets of features, termed views, for a common set of\nobservations. In this work, we introduce the multi-view LBM, extending the LBM\nmethod to multi-view data, where each view marginally follows an LBM. In the\ncase of two views, the dependence between them is captured by a cluster\nmembership matrix, and we aim to learn the structure of this matrix. We develop\na likelihood-based approach in which parameter estimation uses a stochastic EM\nalgorithm integrating a Gibbs sampler, and an ICL criterion is derived to\ndetermine the number of row and column clusters in each view. To motivate the\napplication of multi-view methods, we extend recent work developing hypothesis\ntests for the null hypothesis that clusters of observations in each view are\nindependent of each other. The testing procedure is integrated into the model\nestimation strategy. Furthermore, we introduce a penalty scheme to generate\nsparse row clusterings. We verify the performance of the developed algorithm\nusing synthetic datasets, and provide guidance for optimal parameter selection.\nFinally, the multi-view co-clustering method is applied to a complex genomics\ndataset, and is shown to provide new insights for high-dimension multi-view\nproblems."}, "http://arxiv.org/abs/2401.04723": {"title": "Spatio-temporal data fusion for the analysis of in situ and remote sensing data using the INLA-SPDE approach", "link": "http://arxiv.org/abs/2401.04723", "description": "We propose a Bayesian hierarchical model to address the challenge of spatial\nmisalignment in spatio-temporal data obtained from in situ and satellite\nsources. The model is fit using the INLA-SPDE approach, which provides\nefficient computation. Our methodology combines the different data sources in a\n\"fusion\"\" model via the construction of projection matrices in both spatial and\ntemporal domains. Through simulation studies, we demonstrate that the fusion\nmodel has superior performance in prediction accuracy across space and time\ncompared to standalone \"in situ\" and \"satellite\" models based on only in situ\nor satellite data, respectively. The fusion model also generally outperforms\nthe standalone models in terms of parameter inference. Such a modeling approach\nis motivated by environmental problems, and our specific focus is on the\nanalysis and prediction of harmful algae bloom (HAB) events, where the\nconvention is to conduct separate analyses based on either in situ samples or\nsatellite images. A real data analysis shows that the proposed model is a\nnecessary step towards a unified characterization of bloom dynamics and\nidentifying the key drivers of HAB events."}, "http://arxiv.org/abs/1807.00347": {"title": "Robust Inference Under Heteroskedasticity via the Hadamard Estimator", "link": "http://arxiv.org/abs/1807.00347", "description": "Drawing statistical inferences from large datasets in a model-robust way is\nan important problem in statistics and data science. In this paper, we propose\nmethods that are robust to large and unequal noise in different observational\nunits (i.e., heteroskedasticity) for statistical inference in linear\nregression. We leverage the Hadamard estimator, which is unbiased for the\nvariances of ordinary least-squares regression. This is in contrast to the\npopular White's sandwich estimator, which can be substantially biased in high\ndimensions. We propose to estimate the signal strength, noise level,\nsignal-to-noise ratio, and mean squared error via the Hadamard estimator. We\ndevelop a new degrees of freedom adjustment that gives more accurate confidence\nintervals than variants of White's sandwich estimator. Moreover, we provide\nconditions ensuring the estimator is well-defined, by studying a new random\nmatrix ensemble in which the entries of a random orthogonal projection matrix\nare squared. We also show approximate normality, using the second-order\nPoincare inequality. Our work provides improved statistical theory and methods\nfor linear regression in high dimensions."}, "http://arxiv.org/abs/2110.12235": {"title": "Adjusting for indirectly measured confounding using large-scale propensity scores", "link": "http://arxiv.org/abs/2110.12235", "description": "Confounding remains one of the major challenges to causal inference with\nobservational data. This problem is paramount in medicine, where we would like\nto answer causal questions from large observational datasets like electronic\nhealth records (EHRs) and administrative claims. Modern medical data typically\ncontain tens of thousands of covariates. Such a large set carries hope that\nmany of the confounders are directly measured, and further hope that others are\nindirectly measured through their correlation with measured covariates. How can\nwe exploit these large sets of covariates for causal inference? To help answer\nthis question, this paper examines the performance of the large-scale\npropensity score (LSPS) approach on causal analysis of medical data. We\ndemonstrate that LSPS may adjust for indirectly measured confounders by\nincluding tens of thousands of covariates that may be correlated with them. We\npresent conditions under which LSPS removes bias due to indirectly measured\nconfounders, and we show that LSPS may avoid bias when inadvertently adjusting\nfor variables (like colliders) that otherwise can induce bias. We demonstrate\nthe performance of LSPS with both simulated medical data and real medical data."}, "http://arxiv.org/abs/2208.07014": {"title": "Proximal Survival Analysis to Handle Dependent Right Censoring", "link": "http://arxiv.org/abs/2208.07014", "description": "Many epidemiological and clinical studies aim at analyzing a time-to-event\nendpoint. A common complication is right censoring. In some cases, it arises\nbecause subjects are still surviving after the study terminates or move out of\nthe study area, in which case right censoring is typically treated as\nindependent or non-informative. Such an assumption can be further relaxed to\nconditional independent censoring by leveraging possibly time-varying covariate\ninformation, if available, assuming censoring and failure time are independent\namong covariate strata. In yet other instances, events may be censored by other\ncompeting events like death and are associated with censoring possibly through\nprognoses. Realistically, measured covariates can rarely capture all such\nassociations with certainty. For such dependent censoring, often covariate\nmeasurements are at best proxies of underlying prognoses. In this paper, we\nestablish a nonparametric identification framework by formally admitting that\nconditional independent censoring may fail in practice and accounting for\ncovariate measurements as imperfect proxies of underlying association. The\nframework suggests adaptive estimators which we give generic assumptions under\nwhich they are consistent, asymptotically normal, and doubly robust. We\nillustrate our framework with concrete settings, where we examine the\nfinite-sample performance of our proposed estimators via a Monte-Carlo\nsimulation and apply them to the SEER-Medicare dataset."}, "http://arxiv.org/abs/2306.03384": {"title": "A Calibrated Data-Driven Approach for Small Area Estimation using Big Data", "link": "http://arxiv.org/abs/2306.03384", "description": "Where the response variable in a big data set is consistent with the variable\nof interest for small area estimation, the big data by itself can provide the\nestimates for small areas. These estimates are often subject to the coverage\nand measurement error bias inherited from the big data. However, if a\nprobability survey of the same variable of interest is available, the survey\ndata can be used as a training data set to develop an algorithm to impute for\nthe data missed by the big data and adjust for measurement errors. In this\npaper, we outline a methodology for such imputations based on an kNN algorithm\ncalibrated to an asymptotically design-unbiased estimate of the national total\nand illustrate the use of a training data set to estimate the imputation bias\nand the fixed - asymptotic bootstrap to estimate the variance of the small area\nhybrid estimator. We illustrate the methodology of this paper using a public\nuse data set and use it to compare the accuracy and precision of our hybrid\nestimator with the Fay-Harriot (FH) estimator. Finally, we also examine\nnumerically the accuracy and precision of the FH estimator when the auxiliary\nvariables used in the linking models are subject to under-coverage errors."}, "http://arxiv.org/abs/2306.06270": {"title": "Markov bases: a 25 year update", "link": "http://arxiv.org/abs/2306.06270", "description": "In this paper, we evaluate the challenges and best practices associated with\nthe Markov bases approach to sampling from conditional distributions. We\nprovide insights and clarifications after 25 years of the publication of the\nfundamental theorem for Markov bases by Diaconis and Sturmfels. In addition to\na literature review we prove three new results on the complexity of Markov\nbases in hierarchical models, relaxations of the fibers in log-linear models,\nand limitations of partial sets of moves in providing an irreducible Markov\nchain."}, "http://arxiv.org/abs/2309.02281": {"title": "s-ID: Causal Effect Identification in a Sub-Population", "link": "http://arxiv.org/abs/2309.02281", "description": "Causal inference in a sub-population involves identifying the causal effect\nof an intervention on a specific subgroup, which is distinguished from the\nwhole population through the influence of systematic biases in the sampling\nprocess. However, ignoring the subtleties introduced by sub-populations can\neither lead to erroneous inference or limit the applicability of existing\nmethods. We introduce and advocate for a causal inference problem in\nsub-populations (henceforth called s-ID), in which we merely have access to\nobservational data of the targeted sub-population (as opposed to the entire\npopulation). Existing inference problems in sub-populations operate on the\npremise that the given data distributions originate from the entire population,\nthus, cannot tackle the s-ID problem. To address this gap, we provide necessary\nand sufficient conditions that must hold in the causal graph for a causal\neffect in a sub-population to be identifiable from the observational\ndistribution of that sub-population. Given these conditions, we present a sound\nand complete algorithm for the s-ID problem."}, "http://arxiv.org/abs/2401.04753": {"title": "Dynamic Models Augmented by Hierarchical Data: An Application Of Estimating HIV Epidemics At Sub-National And Sub-Population Level", "link": "http://arxiv.org/abs/2401.04753", "description": "Dynamic models have been successfully used in producing estimates of HIV\nepidemics at the national level due to their epidemiological nature and their\nability to estimate prevalence, incidence, and mortality rates simultaneously.\nRecently, HIV interventions and policies have required more information at\nsub-national levels to support local planning, decision making and resource\nallocation. Unfortunately, many areas lack sufficient data for deriving stable\nand reliable results, and this is a critical technical barrier to more\nstratified estimates. One solution is to borrow information from other areas\nwithin the same country. However, directly assuming hierarchical structures\nwithin the HIV dynamic models is complicated and computationally\ntime-consuming. In this paper, we propose a simple and innovative way to\nincorporate hierarchical information into the dynamical systems by using\nauxiliary data. The proposed method efficiently uses information from multiple\nareas within each country without increasing the computational burden. As a\nresult, the new model improves predictive ability and uncertainty assessment."}, "http://arxiv.org/abs/2401.04771": {"title": "Network Layout Algorithm with Covariate Smoothing", "link": "http://arxiv.org/abs/2401.04771", "description": "Network science explores intricate connections among objects, employed in\ndiverse domains like social interactions, fraud detection, and disease spread.\nVisualization of networks facilitates conceptualizing research questions and\nforming scientific hypotheses. Networks, as mathematical high-dimensional\nobjects, require dimensionality reduction for (planar) visualization.\nVisualizing empirical networks present additional challenges. They often\ncontain false positive (spurious) and false negative (missing) edges.\nTraditional visualization methods don't account for errors in observation,\npotentially biasing interpretations. Moreover, contemporary network data\nincludes rich nodal attributes. However, traditional methods neglect these\nattributes when computing node locations. Our visualization approach aims to\nleverage nodal attribute richness to compensate for network data limitations.\nWe employ a statistical model estimating the probability of edge connections\nbetween nodes based on their covariates. We enhance the Fruchterman-Reingold\nalgorithm to incorporate estimated dyad connection probabilities, allowing\npractitioners to balance reliance on observed versus estimated edges. We\nexplore optimal smoothing levels, offering a natural way to include relevant\nnodal information in layouts. Results demonstrate the effectiveness of our\nmethod in achieving robust network visualization, providing insights for\nimproved analysis."}, "http://arxiv.org/abs/2401.04775": {"title": "Approximate Inference for Longitudinal Mechanistic HIV Contact Networks", "link": "http://arxiv.org/abs/2401.04775", "description": "Network models are increasingly used to study infectious disease spread.\nExponential Random Graph models have a history in this area, with scalable\ninference methods now available. An alternative approach uses mechanistic\nnetwork models. Mechanistic network models directly capture individual\nbehaviors, making them suitable for studying sexually transmitted diseases.\nCombining mechanistic models with Approximate Bayesian Computation allows\nflexible modeling using domain-specific interaction rules among agents,\navoiding network model oversimplifications. These models are ideal for\nlongitudinal settings as they explicitly incorporate network evolution over\ntime. We implemented a discrete-time version of a previously published\ncontinuous-time model of evolving contact networks for men who have sex with\nmen (MSM) and proposed an ABC-based approximate inference scheme for it. As\nexpected, we found that a two-wave longitudinal study design improves the\naccuracy of inference compared to a cross-sectional design. However, the gains\nin precision in collecting data twice, up to 18%, depend on the spacing of the\ntwo waves and are sensitive to the choice of summary statistics. In addition to\nmethodological developments, our results inform the design of future\nlongitudinal network studies in sexually transmitted diseases, specifically in\nterms of what data to collect from participants and when to do so."}, "http://arxiv.org/abs/2401.04778": {"title": "Generative neural networks for characteristic functions", "link": "http://arxiv.org/abs/2401.04778", "description": "In this work, we provide a simulation algorithm to simulate from a\n(multivariate) characteristic function, which is only accessible in a black-box\nformat. We construct a generative neural network, whose loss function exploits\na specific representation of the Maximum-Mean-Discrepancy metric to directly\nincorporate the targeted characteristic function. The construction is universal\nin the sense that it is independent of the dimension and that it does not\nrequire any assumptions on the given characteristic function. Furthermore,\nfinite sample guarantees on the approximation quality in terms of the\nMaximum-Mean Discrepancy metric are derived. The method is illustrated in a\nshort simulation study."}, "http://arxiv.org/abs/2401.04797": {"title": "Principal Component Analysis for Equation Discovery", "link": "http://arxiv.org/abs/2401.04797", "description": "Principal Component Analysis (PCA) is one of the most commonly used\nstatistical methods for data exploration, and for dimensionality reduction\nwherein the first few principal components account for an appreciable\nproportion of the variability in the data. Less commonly, attention is paid to\nthe last principal components because they do not account for an appreciable\nproportion of variability. However, this defining characteristic of the last\nprincipal components also qualifies them as combinations of variables that are\nconstant across the cases. Such constant-combinations are important because\nthey may reflect underlying laws of nature. In situations involving a large\nnumber of noisy covariates, the underlying law may not correspond to the last\nprincipal component, but rather to one of the last. Consequently, a criterion\nis required to identify the relevant eigenvector. In this paper, two examples\nare employed to demonstrate the proposed methodology; one from Physics,\ninvolving a small number of covariates, and another from Meteorology wherein\nthe number of covariates is in the thousands. It is shown that with an\nappropriate selection criterion, PCA can be employed to ``discover\" Kepler's\nthird law (in the former), and the hypsometric equation (in the latter)."}, "http://arxiv.org/abs/2401.04832": {"title": "Group lasso priors for Bayesian accelerated failure time models with left-truncated and interval-censored data", "link": "http://arxiv.org/abs/2401.04832", "description": "An important task in health research is to characterize time-to-event\noutcomes such as disease onset or mortality in terms of a potentially\nhigh-dimensional set of risk factors. For example, prospective cohort studies\nof Alzheimer's disease typically enroll older adults for observation over\nseveral decades to assess the long-term impact of genetic and other factors on\ncognitive decline and mortality. The accelerated failure time model is\nparticularly well-suited to such studies, structuring covariate effects as\n`horizontal' changes to the survival quantiles that conceptually reflect shifts\nin the outcome distribution due to lifelong exposures. However, this modeling\ntask is complicated by the enrollment of adults at differing ages, and\nintermittent followup visits leading to interval censored outcome information.\nMoreover, genetic and clinical risk factors are not only high-dimensional, but\ncharacterized by underlying grouping structure, such as by function or gene\nlocation. Such grouped high-dimensional covariates require shrinkage methods\nthat directly acknowledge this structure to facilitate variable selection and\nestimation. In this paper, we address these considerations directly by\nproposing a Bayesian accelerated failure time model with a group-structured\nlasso penalty, designed for left-truncated and interval-censored time-to-event\ndata. We develop a custom Markov chain Monte Carlo sampler for efficient\nestimation, and investigate the impact of various methods of penalty tuning and\nthresholding for variable selection. We present a simulation study examining\nthe performance of this method relative to models with an ordinary lasso\npenalty, and apply the proposed method to identify groups of predictive genetic\nand clinical risk factors for Alzheimer's disease in the Religious Orders Study\nand Memory and Aging Project (ROSMAP) prospective cohort studies of AD and\ndementia."}, "http://arxiv.org/abs/2401.04841": {"title": "Analysis of Compositional Data with Positive Correlations among Components using a Nested Dirichlet Distribution with Application to a Morris Water Maze Experiment", "link": "http://arxiv.org/abs/2401.04841", "description": "In a typical Morris water maze experiment, a mouse is placed in a circular\nwater tank and allowed to swim freely until it finds a platform, triggering a\nroute of escape from the tank. For reference purposes, the tank is divided into\nfour quadrants: the target quadrant where the trigger to escape resides, the\nopposite quadrant to the target, and two adjacent quadrants. Several response\nvariables can be measured: the amount of time that a mouse spends in different\nquadrants of the water tank, the number of times the mouse crosses from one\nquadrant to another, or how quickly a mouse triggers an escape from the tank.\nWhen considering time within each quadrant, it is hypothesized that normal mice\nwill spend smaller amounts of time in quadrants that do not contain the escape\nroute, while mice with an acquired or induced mental deficiency will spend\nequal time in all quadrants of the tank. Clearly, proportion of time in the\nquadrants must sum to one and are therefore statistically dependent; however,\nmost analyses of data from this experiment treat time in quadrants as\nstatistically independent. A recent paper introduced a hypothesis testing\nmethod that involves fitting such data to a Dirichlet distribution. While an\nimprovement over studies that ignore the compositional structure of the data,\nwe show that methodology is flawed. We introduce a two-sample test to detect\ndifferences in proportion of components for two independent groups where both\ngroups are from either a Dirichlet or nested Dirichlet distribution. This new\ntest is used to reanalyze the data from a previous study and come to a\ndifferent conclusion."}, "http://arxiv.org/abs/2401.04863": {"title": "Estimands and cumulative incidence function regression in clinical trials: some new results on interpretability and robustness", "link": "http://arxiv.org/abs/2401.04863", "description": "Regression analyses based on transformations of cumulative incidence\nfunctions are often adopted when modeling and testing for treatment effects in\nclinical trial settings involving competing and semi-competing risks. Common\nframeworks include the Fine-Gray model and models based on direct binomial\nregression. Using large sample theory we derive the limiting values of\ntreatment effect estimators based on such models when the data are generated\naccording to multiplicative intensity-based models, and show that the estimand\nis sensitive to several process features. The rejection rates of hypothesis\ntests based on cumulative incidence function regression models are also\nexamined for null hypotheses of different types, based on which a robustness\nproperty is established. In such settings supportive secondary analyses of\ntreatment effects are essential to ensure a full understanding of the nature of\ntreatment effects. An application to a palliative study of individuals with\nbreast cancer metastatic to bone is provided for illustration."}, "http://arxiv.org/abs/2401.05088": {"title": "Hybrid of node and link communities for graphon estimation", "link": "http://arxiv.org/abs/2401.05088", "description": "Networks serve as a tool used to examine the large-scale connectivity\npatterns in complex systems. Modelling their generative mechanism\nnonparametrically is often based on step-functions, such as the stochastic\nblock models. These models are capable of addressing two prominent topics in\nnetwork science: link prediction and community detection. However, such methods\noften have a resolution limit, making it difficult to separate small-scale\nstructures from noise. To arrive at a smoother representation of the network's\ngenerative mechanism, we explicitly trade variance for bias by smoothing blocks\nof edges based on stochastic equivalence. As such, we propose a different\nestimation method using a new model, which we call the stochastic shape model.\nTypically, analysis methods are based on modelling node or link communities. In\ncontrast, we take a hybrid approach, bridging the two notions of community.\nConsequently, we obtain a more parsimonious representation, enabling a more\ninterpretable and multiscale summary of the network structure. By considering\nmultiple resolutions, we trade bias and variance to ensure that our estimator\nis rate-optimal. We also examine the performance of our model through\nsimulations and applications to real network data."}, "http://arxiv.org/abs/2401.05124": {"title": "Nonparametric worst-case bounds for publication bias on the summary receiver operating characteristic curve", "link": "http://arxiv.org/abs/2401.05124", "description": "The summary receiver operating characteristic (SROC) curve has been\nrecommended as one important meta-analytical summary to represent the accuracy\nof a diagnostic test in the presence of heterogeneous cutoff values. However,\nselective publication of diagnostic studies for meta-analysis can induce\npublication bias (PB) on the estimate of the SROC curve. Several sensitivity\nanalysis methods have been developed to quantify PB on the SROC curve, and all\nthese methods utilize parametric selection functions to model the selective\npublication mechanism. The main contribution of this article is to propose a\nnew sensitivity analysis approach that derives the worst-case bounds for the\nSROC curve by adopting nonparametric selection functions under minimal\nassumptions. The estimation procedures of the worst-case bounds use the Monte\nCarlo method to obtain the SROC curves along with the corresponding area under\nthe curves in the worst case where the maximum possible PB under a range of\nmarginal selection probabilities is considered. We apply the proposed method to\na real-world meta-analysis to show that the worst-case bounds of the SROC\ncurves can provide useful insights for discussing the robustness of\nmeta-analytical findings on diagnostic test accuracy."}, "http://arxiv.org/abs/2401.05256": {"title": "Tests of Missing Completely At Random based on sample covariance matrices", "link": "http://arxiv.org/abs/2401.05256", "description": "We study the problem of testing whether the missing values of a potentially\nhigh-dimensional dataset are Missing Completely at Random (MCAR). We relax the\nproblem of testing MCAR to the problem of testing the compatibility of a\nsequence of covariance matrices, motivated by the fact that this procedure is\nfeasible when the dimension grows with the sample size. Tests of compatibility\ncan be used to test the feasibility of positive semi-definite matrix completion\nproblems with noisy observations, and thus our results may be of independent\ninterest. Our first contributions are to define a natural measure of the\nincompatibility of a sequence of correlation matrices, which can be\ncharacterised as the optimal value of a Semi-definite Programming (SDP)\nproblem, and to establish a key duality result allowing its practical\ncomputation and interpretation. By studying the concentration properties of the\nnatural plug-in estimator of this measure, we introduce novel hypothesis tests\nthat we prove have power against all distributions with incompatible covariance\nmatrices. The choice of critical values for our tests rely on a new\nconcentration inequality for the Pearson sample correlation matrix, which may\nbe of interest more widely. By considering key examples of missingness\nstructures, we demonstrate that our procedures are minimax rate optimal in\ncertain cases. We further validate our methodology with numerical simulations\nthat provide evidence of validity and power, even when data are heavy tailed."}, "http://arxiv.org/abs/2401.05281": {"title": "Asymptotic expected sensitivity function and its applications to nonparametric correlation estimators", "link": "http://arxiv.org/abs/2401.05281", "description": "We introduce a new type of influence function, the asymptotic expected\nsensitivity function, which is often equivalent to but mathematically more\ntractable than the traditional one based on the Gateaux derivative. To\nillustrate, we study the robustness of some important rank correlations,\nincluding Spearman's and Kendall's correlations, and the recently developed\nChatterjee's correlation."}, "http://arxiv.org/abs/2401.05315": {"title": "Multi-resolution filters via linear projection for large spatio-temporal datasets", "link": "http://arxiv.org/abs/2401.05315", "description": "Advances in compact sensing devices mounted on satellites have facilitated\nthe collection of large spatio-temporal datasets with coordinates. Since such\ndatasets are often incomplete and noisy, it is useful to create the prediction\nsurface of a spatial field. To this end, we consider an online filtering\ninference by using the Kalman filter based on linear Gaussian state-space\nmodels. However, the Kalman filter is impractically time-consuming when the\nnumber of locations in spatio-temporal datasets is large. To address this\nproblem, we propose a multi-resolution filter via linear projection (MRF-lp), a\nfast computation method for online filtering inference. In the MRF-lp, by\ncarrying out a multi-resolution approximation via linear projection (MRA-lp),\nthe forecast covariance matrix can be approximated while capturing both the\nlarge- and small-scale spatial variations. As a result of this approximation,\nour proposed MRF-lp preserves a block-sparse structure of some matrices\nappearing in the MRF-lp through time, which leads to the scalability of this\nalgorithm. Additionally, we discuss extensions of the MRF-lp to a nonlinear and\nnon-Gaussian case. Simulation studies and real data analysis for total\nprecipitable water vapor demonstrate that our proposed approach performs well\ncompared with the related methods."}, "http://arxiv.org/abs/2401.05330": {"title": "Hierarchical Causal Models", "link": "http://arxiv.org/abs/2401.05330", "description": "Scientists often want to learn about cause and effect from hierarchical data,\ncollected from subunits nested inside units. Consider students in schools,\ncells in patients, or cities in states. In such settings, unit-level variables\n(e.g. each school's budget) may affect subunit-level variables (e.g. the test\nscores of each student in each school) and vice versa. To address causal\nquestions with hierarchical data, we propose hierarchical causal models, which\nextend structural causal models and causal graphical models by adding inner\nplates. We develop a general graphical identification technique for\nhierarchical causal models that extends do-calculus. We find many situations in\nwhich hierarchical data can enable causal identification even when it would be\nimpossible with non-hierarchical data, that is, if we had only unit-level\nsummaries of subunit-level variables (e.g. the school's average test score,\nrather than each student's score). We develop estimation techniques for\nhierarchical causal models, using methods including hierarchical Bayesian\nmodels. We illustrate our results in simulation and via a reanalysis of the\nclassic \"eight schools\" study."}, "http://arxiv.org/abs/2010.02848": {"title": "Unified Robust Estimation", "link": "http://arxiv.org/abs/2010.02848", "description": "Robust estimation is primarily concerned with providing reliable parameter\nestimates in the presence of outliers. Numerous robust loss functions have been\nproposed in regression and classification, along with various computing\nalgorithms. In modern penalised generalised linear models (GLM), however, there\nis limited research on robust estimation that can provide weights to determine\nthe outlier status of the observations. This article proposes a unified\nframework based on a large family of loss functions, a composite of concave and\nconvex functions (CC-family). Properties of the CC-family are investigated, and\nCC-estimation is innovatively conducted via the iteratively reweighted convex\noptimisation (IRCO), which is a generalisation of the iteratively reweighted\nleast squares in robust linear regression. For robust GLM, the IRCO becomes the\niteratively reweighted GLM. The unified framework contains penalised estimation\nand robust support vector machine and is demonstrated with a variety of data\napplications."}, "http://arxiv.org/abs/2010.09335": {"title": "Statistical Models for Repeated Categorical Ratings: The R Package rater", "link": "http://arxiv.org/abs/2010.09335", "description": "A common problem in many disciplines is the need to assign a set of items\ninto categories or classes with known labels. This is often done by one or more\nexpert raters, or sometimes by an automated process. If these assignments or\n`ratings' are difficult to make accurately, a common tactic is to repeat them\nby different raters, or even by the same rater multiple times on different\noccasions. We present an R package `rater`, available on CRAN, that implements\nBayesian versions of several statistical models for analysis of repeated\ncategorical rating data. Inference is possible for the true underlying (latent)\nclass of each item, as well as the accuracy of each rater. The models are\nextensions of, and include, the Dawid-Skene model, and we implemented them\nusing the Stan probabilistic programming language. We illustrate the use of\n`rater` through a few examples. We also discuss in detail the techniques of\nmarginalisation and conditioning, which are necessary for these models but also\napply more generally to other models implemented in Stan."}, "http://arxiv.org/abs/2204.10969": {"title": "Combining Doubly Robust Methods and Machine Learning for Estimating Average Treatment Effects for Observational Real-world Data", "link": "http://arxiv.org/abs/2204.10969", "description": "Observational cohort studies are increasingly being used for comparative\neffectiveness research to assess the safety of therapeutics. Recently, various\ndoubly robust methods have been proposed for average treatment effect\nestimation by combining the treatment model and the outcome model via different\nvehicles, such as matching, weighting, and regression. The key advantage of\ndoubly robust estimators is that they require either the treatment model or the\noutcome model to be correctly specified to obtain a consistent estimator of\naverage treatment effects, and therefore lead to a more accurate and often more\nprecise inference. However, little work has been done to understand how doubly\nrobust estimators differ due to their unique strategies of using the treatment\nand outcome models and how machine learning techniques can be combined to boost\ntheir performance. Here we examine multiple popular doubly robust methods and\ncompare their performance using different treatment and outcome modeling via\nextensive simulations and a real-world application. We found that incorporating\nmachine learning with doubly robust estimators such as the targeted maximum\nlikelihood estimator gives the best overall performance. Practical guidance on\nhow to apply doubly robust estimators is provided."}, "http://arxiv.org/abs/2205.00259": {"title": "cubble: An R Package for Organizing and Wrangling Multivariate Spatio-temporal Data", "link": "http://arxiv.org/abs/2205.00259", "description": "Multivariate spatio-temporal data refers to multiple measurements taken\nacross space and time. For many analyses, spatial and time components can be\nseparately studied: for example, to explore the temporal trend of one variable\nfor a single spatial location, or to model the spatial distribution of one\nvariable at a given time. However for some studies, it is important to analyse\ndifferent aspects of the spatio-temporal data simultaneouly, like for instance,\ntemporal trends of multiple variables across locations. In order to facilitate\nthe study of different portions or combinations of spatio-temporal data, we\nintroduce a new data structure, cubble, with a suite of functions enabling easy\nslicing and dicing on the different components spatio-temporal components. The\nproposed cubble structure ensures that all the components of the data are easy\nto access and manipulate while providing flexibility for data analysis. In\naddition, cubble facilitates visual and numerical explorations of the data\nwhile easing data wrangling and modelling. The cubble structure and the\nfunctions provided in the cubble R package equip users with the capability to\nhandle hierarchical spatial and temporal structures. The cubble structure and\nthe tools implemented in the package are illustrated with different examples of\nAustralian climate data."}, "http://arxiv.org/abs/2301.08836": {"title": "Scalable Gaussian Process Inference with Stan", "link": "http://arxiv.org/abs/2301.08836", "description": "Gaussian processes (GPs) are sophisticated distributions to model functional\ndata. Whilst theoretically appealing, they are computationally cumbersome\nexcept for small datasets. We implement two methods for scaling GP inference in\nStan: First, a general sparse approximation using a directed acyclic dependency\ngraph; second, a fast, exact method for regularly spaced data modeled by GPs\nwith stationary kernels using the fast Fourier transform. Based on benchmark\nexperiments, we offer guidance for practitioners to decide between different\nmethods and parameterizations. We consider two real-world examples to\nillustrate the package. The implementation follows Stan's design and exposes\nperformant inference through a familiar interface. Full posterior inference for\nten thousand data points is feasible on a laptop in less than 20 seconds.\nDetails on how to get started using the popular interfaces cmdstanpy for Python\nand cmdstanr for R are provided."}, "http://arxiv.org/abs/2302.09103": {"title": "Multiple change-point detection for Poisson processes", "link": "http://arxiv.org/abs/2302.09103", "description": "The aim of change-point detection is to discover the changes in behavior that\nlie behind time sequence data. In this article, we study the case where the\ndata comes from an inhomogeneous Poisson process or a marked Poisson process.\nWe present a methodology for detecting multiple offline change-points based on\na minimum contrast estimator. In particular, we explain how to handle the\ncontinuous nature of the process with the available discrete observations. In\naddition, we select the appropriate number of regimes via a cross-validation\nprocedure which is really handy here due to the nature of the Poisson process.\nThrough experiments on simulated and real data sets, we demonstrate the\ninterest of the proposed method. The proposed method has been implemented in\nthe R package \\texttt{CptPointProcess} R."}, "http://arxiv.org/abs/2303.00531": {"title": "Parameter estimation for a hidden linear birth and death process with immigration", "link": "http://arxiv.org/abs/2303.00531", "description": "In this paper, we use a linear birth and death process with immigration to\nmodel infectious disease propagation when contamination stems from both\nperson-to-person contact and contact with the environment. Our aim is to\nestimate the parameters of the process. The main originality and difficulty\ncomes from the observation scheme. Counts of infected population are hidden.\nThe only data available are periodic cumulated new retired counts. Although\nvery common in epidemiology, this observation scheme is mathematically\nchallenging even for such a standard stochastic process. We first derive an\nanalytic expression of the unknown parameters as functions of well-chosen\ndiscrete time transition probabilities. Second, we extend and adapt the\nstandard Baum-Welch algorithm in order to estimate the said discrete time\ntransition probabilities in our hidden data framework. The performance of our\nestimators is illustrated both on synthetic data and real data of typhoid fever\nin Mayotte."}, "http://arxiv.org/abs/2306.04836": {"title": "$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control", "link": "http://arxiv.org/abs/2306.04836", "description": "In this paper, we propose a novel $K$-nearest neighbor resampling procedure\nfor estimating the performance of a policy from historical data containing\nrealized episodes of a decision process generated under a different policy. We\nprovide statistical consistency results under weak conditions. In particular,\nwe avoid the common assumption of identically and independently distributed\ntransitions and rewards. Instead, our analysis allows for the sampling of\nentire episodes, as is common practice in most applications. To establish the\nconsistency in this setting, we generalize Stone's Theorem, a well-known result\nin nonparametric statistics on local averaging, to include episodic data and\nthe counterfactual estimation underlying off-policy evaluation (OPE). By\nfocusing on feedback policies that depend deterministically on the current\nstate in environments with continuous state-action spaces and system-inherent\nstochasticity effected by chosen actions, and relying on trajectory simulation\nsimilar to Monte Carlo methods, the proposed method is particularly well suited\nfor stochastic control environments. Compared to other OPE methods, our\nalgorithm does not require optimization, can be efficiently implemented via\ntree-based nearest neighbor search and parallelization, and does not explicitly\nassume a parametric model for the environment's dynamics. Numerical experiments\ndemonstrate the effectiveness of the algorithm compared to existing baselines\nin a variety of stochastic control settings, including a linear quadratic\nregulator, trade execution in limit order books, and online stochastic bin\npacking."}, "http://arxiv.org/abs/2401.05343": {"title": "Spectral Topological Data Analysis of Brain Signals", "link": "http://arxiv.org/abs/2401.05343", "description": "Topological data analysis (TDA) has become a powerful approach over the last\ntwenty years, mainly due to its ability to capture the shape and the geometry\ninherent in the data. Persistence homology, which is a particular tool in TDA,\nhas been demonstrated to be successful in analyzing functional brain\nconnectivity. One limitation of standard approaches is that they use\narbitrarily chosen threshold values for analyzing connectivity matrices. To\novercome this weakness, TDA provides a filtration of the weighted brain network\nacross a range of threshold values. However, current analyses of the\ntopological structure of functional brain connectivity primarily rely on overly\nsimplistic connectivity measures, such as the Pearson orrelation. These\nmeasures do not provide information about the specific oscillators that drive\ndependence within the brain network. Here, we develop a frequency-specific\napproach that utilizes coherence, a measure of dependence in the spectral\ndomain, to evaluate the functional connectivity of the brain. Our approach, the\nspectral TDA (STDA), has the ability to capture more nuanced and detailed\ninformation about the underlying brain networks. The proposed STDA method leads\nto a novel topological summary, the spectral landscape, which is a\n2D-generalization of the persistence landscape. Using the novel spectral\nlandscape, we analyze the EEG brain connectivity of patients with attention\ndeficit hyperactivity disorder (ADHD) and shed light on the frequency-specific\ndifferences in the topology of brain connectivity between the controls and ADHD\npatients."}, "http://arxiv.org/abs/2401.05414": {"title": "On the Three Demons in Causality in Finance: Time Resolution, Nonstationarity, and Latent Factors", "link": "http://arxiv.org/abs/2401.05414", "description": "Financial data is generally time series in essence and thus suffers from\nthree fundamental issues: the mismatch in time resolution, the time-varying\nproperty of the distribution - nonstationarity, and causal factors that are\nimportant but unknown/unobserved. In this paper, we follow a causal perspective\nto systematically look into these three demons in finance. Specifically, we\nreexamine these issues in the context of causality, which gives rise to a novel\nand inspiring understanding of how the issues can be addressed. Following this\nperspective, we provide systematic solutions to these problems, which hopefully\nwould serve as a foundation for future research in the area."}, "http://arxiv.org/abs/2401.05466": {"title": "Multidimensional Scaling for Interval Data: INTERSCAL", "link": "http://arxiv.org/abs/2401.05466", "description": "Standard multidimensional scaling takes as input a dissimilarity matrix of\ngeneral term $\\delta _{ij}$ which is a numerical value. In this paper we input\n$\\delta _{ij}=[\\underline{\\delta _{ij}},\\overline{\\delta _{ij}}]$ where\n$\\underline{\\delta _{ij}}$ and $\\overline{\\delta _{ij}}$ are the lower bound\nand the upper bound of the ``dissimilarity'' between the stimulus/object $S_i$\nand the stimulus/object $S_j$ respectively. As output instead of representing\neach stimulus/object on a factorial plane by a point, as in other\nmultidimensional scaling methods, in the proposed method each stimulus/object\nis visualized by a rectangle, in order to represent dissimilarity variation. We\ngeneralize the classical scaling method looking for a method that produces\nresults similar to those obtained by Tops Principal Components Analysis. Two\nexamples are presented to illustrate the effectiveness of the proposed method."}, "http://arxiv.org/abs/2401.05471": {"title": "Shrinkage linear regression for symbolic interval-valued variables", "link": "http://arxiv.org/abs/2401.05471", "description": "This paper proposes a new approach to fit a linear regression for symbolic\ninternal-valued variables, which improves both the Center Method suggested by\nBillard and Diday in \\cite{BillardDiday2000} and the Center and Range Method\nsuggested by Lima-Neto, E.A. and De Carvalho, F.A.T. in \\cite{Lima2008,\nLima2010}. Just in the Centers Method and the Center and Range Method, the new\nmethods proposed fit the linear regression model on the midpoints and in the\nhalf of the length of the intervals as an additional variable (ranges) assumed\nby the predictor variables in the training data set, but to make these fitments\nin the regression models, the methods Ridge Regression, Lasso, and Elastic Net\nproposed by Tibshirani, R. Hastie, T., and Zou H in \\cite{Tib1996,\nHastieZou2005} are used. The prediction of the lower and upper of the interval\nresponse (dependent) variable is carried out from their midpoints and ranges,\nwhich are estimated from the linear regression models with shrinkage generated\nin the midpoints and the ranges of the interval-valued predictors. Methods\npresented in this document are applied to three real data sets cardiologic\ninterval data set, Prostate interval data set and US Murder interval data set\nto then compare their performance and facility of interpretation regarding the\nCenter Method and the Center and Range Method. For this evaluation, the\nroot-mean-squared error and the correlation coefficient are used. Besides, the\nreader may use all the methods presented herein and verify the results using\nthe {\\tt RSDA} package written in {\\tt R} language, that can be downloaded and\ninstalled directly from {\\tt CRAN} \\cite{Rod2014}."}, "http://arxiv.org/abs/2401.05472": {"title": "INTERSTATIS: The STATIS method for interval valued data", "link": "http://arxiv.org/abs/2401.05472", "description": "The STATIS method, proposed by L'Hermier des Plantes and Escoufier, is used\nto analyze multiple data tables in which is very common that each of the tables\nhave information concerning the same set of individuals. The differences and\nsimilitudes between said tables are analyzed by means of a structure called the\n\\emph{compromise}. In this paper we present a new algorithm for applying the\nSTATIS method when the input consists of interval data. This proposal is based\non Moore's interval arithmetic and the Centers Method for Principal Component\nAnalysis with interval data, proposed by Cazes el al. \\cite{cazes1997}. In\naddition to presenting the INTERSTATIS method in an algorithmic way, an\nexecution example is shown, alongside the interpretation of its results."}, "http://arxiv.org/abs/2401.05473": {"title": "Pyramidal Clustering Algorithms in ISO-3D Project", "link": "http://arxiv.org/abs/2401.05473", "description": "Pyramidal clustering method generalizes hierarchies by allowing non-disjoint\nclasses at a given level instead of a partition. Moreover, the clusters of the\npyramid are intervals of a total order on the set being clustered. [Diday\n1984], [Bertrand, Diday 1990] and [Mfoumoune 1998] proposed algorithms to build\na pyramid starting with an arbitrary order of the individual. In this paper we\npresent two new algorithms name {\\tt CAPS} and {\\tt CAPSO}. {\\tt CAPSO} builds\na pyramid starting with an order given on the set of the individuals (or\nsymbolic objects) while {\\tt CAPS} finds this order. These two algorithms\nallows moreover to cluster more complex data than the tabular model allows to\nprocess, by considering variation on the values taken by the variables, in this\nway, our method produces a symbolic pyramid. Each cluster thus formed is\ndefined not only by the set of its elements (i.e. its extent) but also by a\nsymbolic object, which describes its properties (i.e. its intent). These two\nalgorithms were implemented in C++ and Java to the ISO-3D project."}, "http://arxiv.org/abs/2401.05556": {"title": "Assessing High-Order Links in Cardiovascular and Respiratory Networks via Static and Dynamic Information Measures", "link": "http://arxiv.org/abs/2401.05556", "description": "The network representation is becoming increasingly popular for the\ndescription of cardiovascular interactions based on the analysis of multiple\nsimultaneously collected variables. However, the traditional methods to assess\nnetwork links based on pairwise interaction measures cannot reveal high-order\neffects involving more than two nodes, and are not appropriate to infer the\nunderlying network topology. To address these limitations, here we introduce a\nframework which combines the assessment of high-order interactions with\nstatistical inference for the characterization of the functional links\nsustaining physiological networks. The framework develops information-theoretic\nmeasures quantifying how two nodes interact in a redundant or synergistic way\nwith the rest of the network, and employs these measures for reconstructing the\nfunctional structure of the network. The measures are implemented for both\nstatic and dynamic networks mapped respectively by random variables and random\nprocesses using plug-in and model-based entropy estimators. The validation on\ntheoretical and numerical simulated networks documents the ability of the\nframework to represent high-order interactions as networks and to detect\nstatistical structures associated to cascade, common drive and common target\neffects. The application to cardiovascular networks mapped by the beat-to-beat\nvariability of heart rate, respiration, arterial pressure, cardiac output and\nvascular resistance allowed noninvasive characterization of several mechanisms\nof cardiovascular control operating in resting state and during orthostatic\nstress. Our approach brings to new comprehensive assessment of physiological\ninteractions and complements existing strategies for the classification of\npathophysiological states."}, "http://arxiv.org/abs/2401.05728": {"title": "A General Method for Resampling Autocorrelated Spatial Data", "link": "http://arxiv.org/abs/2401.05728", "description": "Comparing spatial data sets is a ubiquitous task in data analysis, however\nthe presence of spatial autocorrelation means that standard estimates of\nvariance will be wrong and tend to over-estimate the statistical significance\nof correlations and other observations. While there are a number of existing\napproaches to this problem, none are ideal, requiring detailed analytical\ncalculations, which are hard to generalise or detailed knowledge of the data\ngenerating process, which may not be available. In this work we propose a\nresampling approach based on Tobler's Law. By resampling the data with fixed\nspatial autocorrelation, measured by Moran's I, we generate a more realistic\nnull model. Testing on real and synthetic data, we find that, as long as the\nspatial autocorrelation is not too strong, this approach works just as well as\nif we knew the data generating process."}, "http://arxiv.org/abs/2401.05817": {"title": "Testing for similarity of multivariate mixed outcomes using generalised joint regression models with application to efficacy-toxicity responses", "link": "http://arxiv.org/abs/2401.05817", "description": "A common problem in clinical trials is to test whether the effect of an\nexplanatory variable on a response of interest is similar between two groups,\ne.g. patient or treatment groups. In this regard, similarity is defined as\nequivalence up to a pre-specified threshold that denotes an acceptable\ndeviation between the two groups. This issue is typically tackled by assessing\nif the explanatory variable's effect on the response is similar. This\nassessment is based on, for example, confidence intervals of differences or a\nsuitable distance between two parametric regression models. Typically, these\napproaches build on the assumption of a univariate continuous or binary outcome\nvariable. However, multivariate outcomes, especially beyond the case of\nbivariate binary response, remain underexplored. This paper introduces an\napproach based on a generalised joint regression framework exploiting the\nGaussian copula. Compared to existing methods, our approach accommodates\nvarious outcome variable scales, such as continuous, binary, categorical, and\nordinal, including mixed outcomes in multi-dimensional spaces. We demonstrate\nthe validity of this approach through a simulation study and an\nefficacy-toxicity case study, hence highlighting its practical relevance."}, "http://arxiv.org/abs/2401.05839": {"title": "Modelling physical activity profiles in COPD patients: a fully functional approach to variable domain functional regression models", "link": "http://arxiv.org/abs/2401.05839", "description": "Physical activity plays a significant role in the well-being of individuals\nwith Chronic obstructive Pulmonary Disease (COPD). Specifically, it has been\ndirectly associated with changes in hospitalization rates for these patients.\nHowever, previous investigations have primarily been conducted in a\ncross-sectional or longitudinal manner and have not considered a continuous\nperspective. Using the telEPOC program we use telemonitoring data to analyze\nthe impact of physical activity adopting a functional data approach. However,\nTraditional functional data methods, including functional regression models,\ntypically assume a consistent data domain. However, the data in the telEPOC\nprogram exhibits variable domains, presenting a challenge since the majority of\nfunctional data methods, are based on the fact that data are observed in the\nsame domain. To address this challenge, we introduce a novel fully functional\nmethodology tailored to variable domain functional data, eliminating the need\nfor data alignment, which can be computationally taxing. Although models\ndesigned for variable domain data are relatively scarce and may have inherent\nlimitations in their estimation methods, our approach circumvents these issues.\nWe substantiate the effectiveness of our methodology through a simulation\nstudy, comparing our results with those obtained using established\nmethodologies. Finally, we apply our methodology to analyze the impact of\nphysical activity in COPD patients using the telEPOC program's data. Software\nfor our method is available in the form of R code on request at\n\\url{https://github.com/Pavel-Hernadez-Amaro/V.D.F.R.M-new-estimation-approach.git}."}, "http://arxiv.org/abs/2401.05905": {"title": "Feasible pairwise pseudo-likelihood inference on spatial regressions in irregular lattice grids: the KD-T PL algorithm", "link": "http://arxiv.org/abs/2401.05905", "description": "Spatial regression models are central to the field of spatial statistics.\nNevertheless, their estimation in case of large and irregular gridded spatial\ndatasets presents considerable computational challenges. To tackle these\ncomputational problems, Arbia \\citep{arbia_2014_pairwise} introduced a\npseudo-likelihood approach (called pairwise likelihood, say PL) which required\nthe identification of pairs of observations that are internally correlated, but\nmutually conditionally uncorrelated. However, while the PL estimators enjoy\noptimal theoretical properties, their practical implementation when dealing\nwith data observed on irregular grids suffers from dramatic computational\nissues (connected with the identification of the pairs of observations) that,\nin most empirical cases, negatively counter-balance its advantages. In this\npaper we introduce an algorithm specifically designed to streamline the\ncomputation of the PL in large and irregularly gridded spatial datasets,\ndramatically simplifying the estimation phase. In particular, we focus on the\nestimation of Spatial Error models (SEM). Our proposed approach, efficiently\npairs spatial couples exploiting the KD tree data structure and exploits it to\nderive the closed-form expressions for fast parameter approximation. To\nshowcase the efficiency of our method, we provide an illustrative example using\nsimulated data, demonstrating the computational advantages if compared to a\nfull likelihood inference are not at the expenses of accuracy."}, "http://arxiv.org/abs/2401.06082": {"title": "Borrowing from historical control data in a Bayesian time-to-event model with flexible baseline hazard function", "link": "http://arxiv.org/abs/2401.06082", "description": "There is currently a focus on statistical methods which can use historical\ntrial information to help accelerate the discovery, development and delivery of\nmedicine. Bayesian methods can be constructed so that the borrowing is\n\"dynamic\" in the sense that the similarity of the data helps to determine how\nmuch information is used. In the time to event setting with one historical data\nset, a popular model for a range of baseline hazards is the piecewise\nexponential model where the time points are fixed and a borrowing structure is\nimposed on the model. Although convenient for implementation this approach\neffects the borrowing capability of the model. We propose a Bayesian model\nwhich allows the time points to vary and a dependency to be placed between the\nbaseline hazards. This serves to smooth the posterior baseline hazard improving\nboth model estimation and borrowing characteristics. We explore a variety of\nprior structures for the borrowing within our proposed model and assess their\nperformance against established approaches. We demonstrate that this leads to\nimproved type I error in the presence of prior data conflict and increased\npower. We have developed accompanying software which is freely available and\nenables easy implementation of the approach."}, "http://arxiv.org/abs/2401.06091": {"title": "A Closer Look at AUROC and AUPRC under Class Imbalance", "link": "http://arxiv.org/abs/2401.06091", "description": "In machine learning (ML), a widespread adage is that the area under the\nprecision-recall curve (AUPRC) is a superior metric for model comparison to the\narea under the receiver operating characteristic (AUROC) for binary\nclassification tasks with class imbalance. This paper challenges this notion\nthrough novel mathematical analysis, illustrating that AUROC and AUPRC can be\nconcisely related in probabilistic terms. We demonstrate that AUPRC, contrary\nto popular belief, is not superior in cases of class imbalance and might even\nbe a harmful metric, given its inclination to unduly favor model improvements\nin subpopulations with more frequent positive labels. This bias can\ninadvertently heighten algorithmic disparities. Prompted by these insights, a\nthorough review of existing ML literature was conducted, utilizing large\nlanguage models to analyze over 1.5 million papers from arXiv. Our\ninvestigation focused on the prevalence and substantiation of the purported\nAUPRC superiority. The results expose a significant deficit in empirical\nbacking and a trend of misattributions that have fuelled the widespread\nacceptance of AUPRC's supposed advantages. Our findings represent a dual\ncontribution: a significant technical advancement in understanding metric\nbehaviors and a stark warning about unchecked assumptions in the ML community.\nAll experiments are accessible at\nhttps://github.com/mmcdermott/AUC_is_all_you_need."}, "http://arxiv.org/abs/2011.04168": {"title": "Likelihood Inference for Possibly Non-Stationary Processes via Adaptive Overdifferencing", "link": "http://arxiv.org/abs/2011.04168", "description": "We make an observation that facilitates exact likelihood-based inference for\nthe parameters of the popular ARFIMA model without requiring stationarity by\nallowing the upper bound $\\bar{d}$ for the memory parameter $d$ to exceed\n$0.5$. We observe that estimating the parameters of a single non-stationary\nARFIMA model is equivalent to estimating the parameters of a sequence of\nstationary ARFIMA models, which allows for the use of existing methods for\nevaluating the likelihood for an invertible and stationary ARFIMA model. This\nenables improved inference because many standard methods perform poorly when\nestimates are close to the boundary of the parameter space. It also allows us\nto leverage the wealth of likelihood approximations that have been introduced\nfor estimating the parameters of a stationary process. We explore how\nestimation of the memory parameter $d$ depends on the upper bound $\\bar{d}$ and\nintroduce adaptive procedures for choosing $\\bar{d}$. Via simulations, we\nexamine the performance of our adaptive procedures for estimating the memory\nparameter when the true value is as large as $2.5$. Our adaptive procedures\nestimate the memory parameter well, can be used to obtain confidence intervals\nfor the memory parameter that achieve nominal coverage rates, and perform\nfavorably relative to existing alternatives."}, "http://arxiv.org/abs/2203.10118": {"title": "Bayesian Structural Learning with Parametric Marginals for Count Data: An Application to Microbiota Systems", "link": "http://arxiv.org/abs/2203.10118", "description": "High dimensional and heterogeneous count data are collected in various\napplied fields. In this paper, we look closely at high-resolution sequencing\ndata on the microbiome, which have enabled researchers to study the genomes of\nentire microbial communities. Revealing the underlying interactions between\nthese communities is of vital importance to learn how microbes influence human\nhealth. To perform structural learning from multivariate count data such as\nthese, we develop a novel Gaussian copula graphical model with two key\nelements. Firstly, we employ parametric regression to characterize the marginal\ndistributions. This step is crucial for accommodating the impact of external\ncovariates. Neglecting this adjustment could potentially introduce distortions\nin the inference of the underlying network of dependences. Secondly, we advance\na Bayesian structure learning framework, based on a computationally efficient\nsearch algorithm that is suited to high dimensionality. The approach returns\nsimultaneous inference of the marginal effects and of the dependence structure,\nincluding graph uncertainty estimates. A simulation study and a real data\nanalysis of microbiome data highlight the applicability of the proposed\napproach at inferring networks from multivariate count data in general, and its\nrelevance to microbiome analyses in particular. The proposed method is\nimplemented in the R package BDgraph."}, "http://arxiv.org/abs/2207.02986": {"title": "fabisearch: A Package for Change Point Detection in and Visualization of the Network Structure of Multivariate High-Dimensional Time Series in R", "link": "http://arxiv.org/abs/2207.02986", "description": "Change point detection is a commonly used technique in time series analysis,\ncapturing the dynamic nature in which many real-world processes function. With\nthe ever increasing troves of multivariate high-dimensional time series data,\nespecially in neuroimaging and finance, there is a clear need for scalable and\ndata-driven change point detection methods. Currently, change point detection\nmethods for multivariate high-dimensional data are scarce, with even less\navailable in high-level, easily accessible software packages. To this end, we\nintroduce the R package fabisearch, available on the Comprehensive R Archive\nNetwork (CRAN), which implements the factorized binary search (FaBiSearch)\nmethodology. FaBiSearch is a novel statistical method for detecting change\npoints in the network structure of multivariate high-dimensional time series\nwhich employs non-negative matrix factorization (NMF), an unsupervised\ndimension reduction and clustering technique. Given the high computational cost\nof NMF, we implement the method in C++ code and use parallelization to reduce\ncomputation time. Further, we also utilize a new binary search algorithm to\nefficiently identify multiple change points and provide a new method for\nnetwork estimation for data between change points. We show the functionality of\nthe package and the practicality of the method by applying it to a neuroimaging\nand a finance data set. Lastly, we provide an interactive, 3-dimensional,\nbrain-specific network visualization capability in a flexible, stand-alone\nfunction. This function can be conveniently used with any node coordinate\natlas, and nodes can be color coded according to community membership (if\napplicable). The output is an elegantly displayed network laid over a cortical\nsurface, which can be rotated in the 3-dimensional space."}, "http://arxiv.org/abs/2302.02718": {"title": "A Log-Linear Non-Parametric Online Changepoint Detection Algorithm based on Functional Pruning", "link": "http://arxiv.org/abs/2302.02718", "description": "Online changepoint detection aims to detect anomalies and changes in\nreal-time in high-frequency data streams, sometimes with limited available\ncomputational resources. This is an important task that is rooted in many\nreal-world applications, including and not limited to cybersecurity, medicine\nand astrophysics. While fast and efficient online algorithms have been recently\nintroduced, these rely on parametric assumptions which are often violated in\npractical applications. Motivated by data streams from the telecommunications\nsector, we build a flexible nonparametric approach to detect a change in the\ndistribution of a sequence. Our procedure, NP-FOCuS, builds a sequential\nlikelihood ratio test for a change in a set of points of the empirical\ncumulative density function of our data. This is achieved by keeping track of\nthe number of observations above or below those points. Thanks to functional\npruning ideas, NP-FOCuS has a computational cost that is log-linear in the\nnumber of observations and is suitable for high-frequency data streams. In\nterms of detection power, NP-FOCuS is seen to outperform current nonparametric\nonline changepoint techniques in a variety of settings. We demonstrate the\nutility of the procedure on both simulated and real data."}, "http://arxiv.org/abs/2304.10005": {"title": "Prediction under interventions: evaluation of counterfactual performance using longitudinal observational data", "link": "http://arxiv.org/abs/2304.10005", "description": "Predictions under interventions are estimates of what a person's risk of an\noutcome would be if they were to follow a particular treatment strategy, given\ntheir individual characteristics. Such predictions can give important input to\nmedical decision making. However, evaluating predictive performance of\ninterventional predictions is challenging. Standard ways of evaluating\npredictive performance do not apply when using observational data, because\nprediction under interventions involves obtaining predictions of the outcome\nunder conditions that are different to those that are observed for a subset of\nindividuals in the validation dataset. This work describes methods for\nevaluating counterfactual performance of predictions under interventions for\ntime-to-event outcomes. This means we aim to assess how well predictions would\nmatch the validation data if all individuals had followed the treatment\nstrategy under which predictions are made. We focus on counterfactual\nperformance evaluation using longitudinal observational data, and under\ntreatment strategies that involve sustaining a particular treatment regime over\ntime. We introduce an estimation approach using artificial censoring and\ninverse probability weighting which involves creating a validation dataset that\nmimics the treatment strategy under which predictions are made. We extend\nmeasures of calibration, discrimination (c-index and cumulative/dynamic AUCt)\nand overall prediction error (Brier score) to allow assessment of\ncounterfactual performance. The methods are evaluated using a simulation study,\nincluding scenarios in which the methods should detect poor performance.\nApplying our methods in the context of liver transplantation shows that our\nprocedure allows quantification of the performance of predictions supporting\ncrucial decisions on organ allocation."}, "http://arxiv.org/abs/2309.01334": {"title": "Average treatment effect on the treated, under lack of positivity", "link": "http://arxiv.org/abs/2309.01334", "description": "The use of propensity score (PS) methods has become ubiquitous in causal\ninference. At the heart of these methods is the positivity assumption.\nViolation of the positivity assumption leads to the presence of extreme PS\nweights when estimating average causal effects of interest, such as the average\ntreatment effect (ATE) or the average treatment effect on the treated (ATT),\nwhich renders invalid related statistical inference. To circumvent this issue,\ntrimming or truncating the extreme estimated PSs have been widely used.\nHowever, these methods require that we specify a priori a threshold and\nsometimes an additional smoothing parameter. While there are a number of\nmethods dealing with the lack of positivity when estimating ATE, surprisingly\nthere is no much effort in the same issue for ATT. In this paper, we first\nreview widely used methods, such as trimming and truncation in ATT. We\nemphasize the underlying intuition behind these methods to better understand\ntheir applications and highlight their main limitations. Then, we argue that\nthe current methods simply target estimands that are scaled ATT (and thus move\nthe goalpost to a different target of interest), where we specify the scale and\nthe target populations. We further propose a PS weight-based alternative for\nthe average causal effect on the treated, called overlap weighted average\ntreatment effect on the treated (OWATT). The appeal of our proposed method lies\nin its ability to obtain similar or even better results than trimming and\ntruncation while relaxing the constraint to choose a priori a threshold (or\neven specify a smoothing parameter). The performance of the proposed method is\nillustrated via a series of Monte Carlo simulations and a data analysis on\nracial disparities in health care expenditures."}, "http://arxiv.org/abs/2401.06261": {"title": "Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables", "link": "http://arxiv.org/abs/2401.06261", "description": "Multivariate Mendelian randomization (MVMR) is a statistical technique that\nuses sets of genetic instruments to estimate the direct causal effects of\nmultiple exposures on an outcome of interest. At genomic loci with pleiotropic\ngene regulatory effects, that is, loci where the same genetic variants are\nassociated to multiple nearby genes, MVMR can potentially be used to predict\ncandidate causal genes. However, consensus in the field dictates that the\ngenetic instruments in MVMR must be independent, which is usually not possible\nwhen considering a group of candidate genes from the same locus.\n\nWe used causal inference theory to show that MVMR with correlated instruments\nsatisfies the instrumental set condition. This is a classical result by Brito\nand Pearl (2002) for structural equation models that guarantees the\nidentifiability of causal effects in situations where multiple exposures\ncollectively, but not individually, separate a set of instrumental variables\nfrom an outcome variable. Extensive simulations confirmed the validity and\nusefulness of these theoretical results even at modest sample sizes.\nImportantly, the causal effect estimates remain unbiased and their variance\nsmall when instruments are highly correlated.\n\nWe applied MVMR with correlated instrumental variable sets at risk loci from\ngenome-wide association studies (GWAS) for coronary artery disease using eQTL\ndata from the STARNET study. Our method predicts causal genes at twelve loci,\neach associated with multiple colocated genes in multiple tissues. However, the\nextensive degree of regulatory pleiotropy across tissues and the limited number\nof causal variants in each locus still require that MVMR is run on a\ntissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a\nsingle model to predict causal gene-tissue combinations remains infeasible."}, "http://arxiv.org/abs/2401.06347": {"title": "Diagnostics for Regression Models with Semicontinuous Outcomes", "link": "http://arxiv.org/abs/2401.06347", "description": "Semicontinuous outcomes commonly arise in a wide variety of fields, such as\ninsurance claims, healthcare expenditures, rainfall amounts, and alcohol\nconsumption. Regression models, including Tobit, Tweedie, and two-part models,\nare widely employed to understand the relationship between semicontinuous\noutcomes and covariates. Given the potential detrimental consequences of model\nmisspecification, after fitting a regression model, it is of prime importance\nto check the adequacy of the model. However, due to the point mass at zero,\nstandard diagnostic tools for regression models (e.g., deviance and Pearson\nresiduals) are not informative for semicontinuous data. To bridge this gap, we\npropose a new type of residuals for semicontinuous outcomes that are applicable\nto general regression models. Under the correctly specified model, the proposed\nresiduals converge to being uniformly distributed, and when the model is\nmisspecified, they significantly depart from this pattern. In addition to\nin-sample validation, the proposed methodology can also be employed to evaluate\npredictive distributions. We demonstrate the effectiveness of the proposed tool\nusing health expenditure data from the US Medical Expenditure Panel Survey."}, "http://arxiv.org/abs/2401.06348": {"title": "A Fully Bayesian Approach for Comprehensive Mapping of Magnitude and Phase Brain Activation in Complex-Valued fMRI Data", "link": "http://arxiv.org/abs/2401.06348", "description": "Functional magnetic resonance imaging (fMRI) plays a crucial role in\nneuroimaging, enabling the exploration of brain activity through complex-valued\nsignals. These signals, composed of magnitude and phase, offer a rich source of\ninformation for understanding brain functions. Traditional fMRI analyses have\nlargely focused on magnitude information, often overlooking the potential\ninsights offered by phase data. In this paper, we propose a novel fully\nBayesian model designed for analyzing single-subject complex-valued fMRI\n(cv-fMRI) data. Our model, which we refer to as the CV-M&P model, is\ndistinctive in its comprehensive utilization of both magnitude and phase\ninformation in fMRI signals, allowing for independent prediction of different\ntypes of activation maps. We incorporate Gaussian Markov random fields (GMRFs)\nto capture spatial correlations within the data, and employ image partitioning\nand parallel computation to enhance computational efficiency. Our model is\nrigorously tested through simulation studies, and then applied to a real\ndataset from a unilateral finger-tapping experiment. The results demonstrate\nthe model's effectiveness in accurately identifying brain regions activated in\nresponse to specific tasks, distinguishing between magnitude and phase\nactivation."}, "http://arxiv.org/abs/2401.06350": {"title": "Optimal estimation of the null distribution in large-scale inference", "link": "http://arxiv.org/abs/2401.06350", "description": "The advent of large-scale inference has spurred reexamination of conventional\nstatistical thinking. In a Gaussian model for $n$ many $z$-scores with at most\n$k < \\frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale\nparameters of the null distribution. Placing no assumptions on the nonnull\neffects, the statistical task can be viewed as a robust estimation problem.\nHowever, the best known robust estimators fail to be consistent in the regime\n$k \\asymp n$ which is especially relevant in large-scale inference. The failure\nof estimators which are minimax rate-optimal with respect to other formulations\nof robustness (e.g. Huber's contamination model) might suggest the\nimpossibility of consistent estimation in this regime and, consequently, a\nmajor weakness of Efron's suggestion. A sound evaluation of Efron's model thus\nrequires a complete understanding of consistency. We sharply characterize the\nregime of $k$ for which consistent estimation is possible and further establish\nthe minimax estimation rates. It is shown consistent estimation of the location\nparameter is possible if and only if $\\frac{n}{2} - k = \\omega(\\sqrt{n})$, and\nconsistent estimation of the scale parameter is possible in the entire regime\n$k < \\frac{n}{2}$. Faster rates than those in Huber's contamination model are\nachievable by exploiting the Gaussian character of the data. The minimax upper\nbound is obtained by considering estimators based on the empirical\ncharacteristic function. The minimax lower bound involves constructing two\nmarginal distributions whose characteristic functions match on a wide interval\ncontaining zero. The construction notably differs from those in the literature\nby sharply capturing a scaling of $n-2k$ in the minimax estimation rate of the\nlocation."}, "http://arxiv.org/abs/2401.06383": {"title": "Decomposition with Monotone B-splines: Fitting and Testing", "link": "http://arxiv.org/abs/2401.06383", "description": "A univariate continuous function can always be decomposed as the sum of a\nnon-increasing function and a non-decreasing one. Based on this property, we\npropose a non-parametric regression method that combines two spline-fitted\nmonotone curves. We demonstrate by extensive simulations that, compared to\nstandard spline-fitting methods, the proposed approach is particularly\nadvantageous in high-noise scenarios. Several theoretical guarantees are\nestablished for the proposed approach. Additionally, we present statistics to\ntest the monotonicity of a function based on monotone decomposition, which can\nbetter control Type I error and achieve comparable (if not always higher) power\ncompared to existing methods. Finally, we apply the proposed fitting and\ntesting approaches to analyze the single-cell pseudotime trajectory datasets,\nidentifying significant biological insights for non-monotonically expressed\ngenes through Gene Ontology enrichment analysis. The source code implementing\nthe methodology and producing all results is accessible at\nhttps://github.com/szcf-weiya/MonotoneDecomposition.jl."}, "http://arxiv.org/abs/2401.06403": {"title": "Fourier analysis of spatial point processes", "link": "http://arxiv.org/abs/2401.06403", "description": "In this article, we develop comprehensive frequency domain methods for\nestimating and inferring the second-order structure of spatial point processes.\nThe main element here is on utilizing the discrete Fourier transform (DFT) of\nthe point pattern and its tapered counterpart. Under second-order stationarity,\nwe show that both the DFTs and the tapered DFTs are asymptotically jointly\nindependent Gaussian even when the DFTs share the same limiting frequencies.\nBased on these results, we establish an $\\alpha$-mixing central limit theorem\nfor a statistic formulated as a quadratic form of the tapered DFT. As\napplications, we derive the asymptotic distribution of the kernel spectral\ndensity estimator and establish a frequency domain inferential method for\nparametric stationary point processes. For the latter, the resulting model\nparameter estimator is computationally tractable and yields meaningful\ninterpretations even in the case of model misspecification. We investigate the\nfinite sample performance of our estimator through simulations, considering\nscenarios of both correctly specified and misspecified models. Furthermore, we\nextend our proposed DFT-based frequency domain methods to a class of\nnon-stationary spatial point processes."}, "http://arxiv.org/abs/2401.06446": {"title": "Increasing dimension asymptotics for two-way crossed mixed effect models", "link": "http://arxiv.org/abs/2401.06446", "description": "This paper presents asymptotic results for the maximum likelihood and\nrestricted maximum likelihood (REML) estimators within a two-way crossed mixed\neffect model as the sizes of the rows, columns, and cells tend to infinity.\nUnder very mild conditions which do not require the assumption of normality,\nthe estimators are proven to be asymptotically normal, possessing a structured\ncovariance matrix. The growth rate for the number of rows, columns, and cells\nis unrestricted, whether considered pairwise or collectively."}, "http://arxiv.org/abs/2401.06447": {"title": "A comprehensive framework for multi-fidelity surrogate modeling with noisy data: a gray-box perspective", "link": "http://arxiv.org/abs/2401.06447", "description": "Computer simulations (a.k.a. white-box models) are more indispensable than\never to model intricate engineering systems. However, computational models\nalone often fail to fully capture the complexities of reality. When physical\nexperiments are accessible though, it is of interest to enhance the incomplete\ninformation offered by computational models. Gray-box modeling is concerned\nwith the problem of merging information from data-driven (a.k.a. black-box)\nmodels and white-box (i.e., physics-based) models. In this paper, we propose to\nperform this task by using multi-fidelity surrogate models (MFSMs). A MFSM\nintegrates information from models with varying computational fidelity into a\nnew surrogate model. The multi-fidelity surrogate modeling framework we propose\nhandles noise-contaminated data and is able to estimate the underlying\nnoise-free high-fidelity function. Our methodology emphasizes on delivering\nprecise estimates of the uncertainty in its predictions in the form of\nconfidence and prediction intervals, by quantitatively incorporating the\ndifferent types of uncertainty that affect the problem, arising from\nmeasurement noise and from lack of knowledge due to the limited experimental\ndesign budget on both the high- and low-fidelity models. Applied to gray-box\nmodeling, our MFSM framework treats noisy experimental data as the\nhigh-fidelity and the white-box computational models as their low-fidelity\ncounterparts. The effectiveness of our methodology is showcased through\nsynthetic examples and a wind turbine application."}, "http://arxiv.org/abs/2401.06465": {"title": "Sanity Checks Revisited: An Exploration to Repair the Model Parameter Randomisation Test", "link": "http://arxiv.org/abs/2401.06465", "description": "The Model Parameter Randomisation Test (MPRT) is widely acknowledged in the\neXplainable Artificial Intelligence (XAI) community for its well-motivated\nevaluative principle: that the explanation function should be sensitive to\nchanges in the parameters of the model function. However, recent works have\nidentified several methodological caveats for the empirical interpretation of\nMPRT. To address these caveats, we introduce two adaptations to the original\nMPRT -- Smooth MPRT and Efficient MPRT, where the former minimises the impact\nthat noise has on the evaluation results through sampling and the latter\ncircumvents the need for biased similarity measurements by re-interpreting the\ntest through the explanation's rise in complexity, after full parameter\nrandomisation. Our experimental results demonstrate that these proposed\nvariants lead to improved metric reliability, thus enabling a more trustworthy\napplication of XAI methods."}, "http://arxiv.org/abs/2401.06557": {"title": "Treatment-Aware Hyperbolic Representation Learning for Causal Effect Estimation with Social Networks", "link": "http://arxiv.org/abs/2401.06557", "description": "Estimating the individual treatment effect (ITE) from observational data is a\ncrucial research topic that holds significant value across multiple domains.\nHow to identify hidden confounders poses a key challenge in ITE estimation.\nRecent studies have incorporated the structural information of social networks\nto tackle this challenge, achieving notable advancements. However, these\nmethods utilize graph neural networks to learn the representation of hidden\nconfounders in Euclidean space, disregarding two critical issues: (1) the\nsocial networks often exhibit a scalefree structure, while Euclidean embeddings\nsuffer from high distortion when used to embed such graphs, and (2) each\nego-centric network within a social network manifests a treatment-related\ncharacteristic, implying significant patterns of hidden confounders. To address\nthese issues, we propose a novel method called Treatment-Aware Hyperbolic\nRepresentation Learning (TAHyper). Firstly, TAHyper employs the hyperbolic\nspace to encode the social networks, thereby effectively reducing the\ndistortion of confounder representation caused by Euclidean embeddings.\nSecondly, we design a treatment-aware relationship identification module that\nenhances the representation of hidden confounders by identifying whether an\nindividual and her neighbors receive the same treatment. Extensive experiments\non two benchmark datasets are conducted to demonstrate the superiority of our\nmethod."}, "http://arxiv.org/abs/2401.06564": {"title": "Valid causal inference with unobserved confounding in high-dimensional settings", "link": "http://arxiv.org/abs/2401.06564", "description": "Various methods have recently been proposed to estimate causal effects with\nconfidence intervals that are uniformly valid over a set of data generating\nprocesses when high-dimensional nuisance models are estimated by\npost-model-selection or machine learning estimators. These methods typically\nrequire that all the confounders are observed to ensure identification of the\neffects. We contribute by showing how valid semiparametric inference can be\nobtained in the presence of unobserved confounders and high-dimensional\nnuisance models. We propose uncertainty intervals which allow for unobserved\nconfounding, and show that the resulting inference is valid when the amount of\nunobserved confounding is small relative to the sample size; the latter is\nformalized in terms of convergence rates. Simulation experiments illustrate the\nfinite sample properties of the proposed intervals and investigate an\nalternative procedure that improves the empirical coverage of the intervals\nwhen the amount of unobserved confounding is large. Finally, a case study on\nthe effect of smoking during pregnancy on birth weight is used to illustrate\nthe use of the methods introduced to perform a sensitivity analysis to\nunobserved confounding."}, "http://arxiv.org/abs/2401.06575": {"title": "A Weibull Mixture Cure Frailty Model for High-dimensional Covariates", "link": "http://arxiv.org/abs/2401.06575", "description": "A novel mixture cure frailty model is introduced for handling censored\nsurvival data. Mixture cure models are preferable when the existence of a cured\nfraction among patients can be assumed. However, such models are heavily\nunderexplored: frailty structures within cure models remain largely\nundeveloped, and furthermore, most existing methods do not work for\nhigh-dimensional datasets, when the number of predictors is significantly\nlarger than the number of observations. In this study, we introduce a novel\nextension of the Weibull mixture cure model that incorporates a frailty\ncomponent, employed to model an underlying latent population heterogeneity with\nrespect to the outcome risk. Additionally, high-dimensional covariates are\nintegrated into both the cure rate and survival part of the model, providing a\ncomprehensive approach to employ the model in the context of high-dimensional\nomics data. We also perform variable selection via an adaptive elastic net\npenalization, and propose a novel approach to inference using the\nexpectation-maximization (EM) algorithm. Extensive simulation studies are\nconducted across various scenarios to demonstrate the performance of the model,\nand results indicate that our proposed method outperforms competitor models. We\napply the novel approach to analyze RNAseq gene expression data from bulk\nbreast cancer patients included in The Cancer Genome Atlas (TCGA) database. A\nset of prognostic biomarkers is then derived from selected genes, and\nsubsequently validated via both functional enrichment analysis and comparison\nto the existing biological literature. Finally, a prognostic risk score index\nbased on the identified biomarkers is proposed and validated by exploring the\npatients' survival."}, "http://arxiv.org/abs/2401.06687": {"title": "Proximal Causal Inference With Text Data", "link": "http://arxiv.org/abs/2401.06687", "description": "Recent text-based causal methods attempt to mitigate confounding bias by\nincluding unstructured text data as proxies of confounding variables that are\npartially or imperfectly measured. These approaches assume analysts have\nsupervised labels of the confounders given text for a subset of instances, a\nconstraint that is not always feasible due to data privacy or cost. Here, we\naddress settings in which an important confounding variable is completely\nunobserved. We propose a new causal inference method that splits pre-treatment\ntext data, infers two proxies from two zero-shot models on the separate splits,\nand applies these proxies in the proximal g-formula. We prove that our\ntext-based proxy method satisfies identification conditions required by the\nproximal g-formula while other seemingly reasonable proposals do not. We\nevaluate our method in synthetic and semi-synthetic settings and find that it\nproduces estimates with low bias. This combination of proximal causal inference\nand zero-shot classifiers is novel (to our knowledge) and expands the set of\ntext-specific causal methods available to practitioners."}, "http://arxiv.org/abs/1905.11232": {"title": "Efficient posterior sampling for high-dimensional imbalanced logistic regression", "link": "http://arxiv.org/abs/1905.11232", "description": "High-dimensional data are routinely collected in many areas. We are\nparticularly interested in Bayesian classification models in which one or more\nvariables are imbalanced. Current Markov chain Monte Carlo algorithms for\nposterior computation are inefficient as $n$ and/or $p$ increase due to\nworsening time per step and mixing rates. One strategy is to use a\ngradient-based sampler to improve mixing while using data sub-samples to reduce\nper-step computational complexity. However, usual sub-sampling breaks down when\napplied to imbalanced data. Instead, we generalize piece-wise deterministic\nMarkov chain Monte Carlo algorithms to include importance-weighted and\nmini-batch sub-sampling. These approaches maintain the correct stationary\ndistribution with arbitrarily small sub-samples, and substantially outperform\ncurrent competitors. We provide theoretical support and illustrate gains in\nsimulated and real data applications."}, "http://arxiv.org/abs/2305.19481": {"title": "Bayesian Image Analysis in Fourier Space", "link": "http://arxiv.org/abs/2305.19481", "description": "Bayesian image analysis has played a large role over the last 40+ years in\nsolving problems in image noise-reduction, de-blurring, feature enhancement,\nand object detection. However, these problems can be complex and lead to\ncomputational difficulties, due to the modeled interdependence between spatial\nlocations. The Bayesian image analysis in Fourier space (BIFS) approach\nproposed here reformulates the conventional Bayesian image analysis paradigm\nfor continuous valued images as a large set of independent (but heterogeneous)\nprocesses over Fourier space. The original high-dimensional estimation problem\nin image space is thereby broken down into (trivially parallelizable)\nindependent one-dimensional problems in Fourier space. The BIFS approach leads\nto easy model specification with fast and direct computation, a wide range of\npossible prior characteristics, easy modeling of isotropy into the prior, and\nmodels that are effectively invariant to changes in image resolution."}, "http://arxiv.org/abs/2307.09404": {"title": "Continuous-time multivariate analysis", "link": "http://arxiv.org/abs/2307.09404", "description": "The starting point for much of multivariate analysis (MVA) is an $n\\times p$\ndata matrix whose $n$ rows represent observations and whose $p$ columns\nrepresent variables. Some multivariate data sets, however, may be best\nconceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves\nor functions defined on a common time interval. We introduce a framework for\nextending techniques of multivariate analysis to such settings. The proposed\nframework rests on the assumption that the curves can be represented as linear\ncombinations of basis functions such as B-splines. This is formally identical\nto the Ramsay-Silverman representation of functional data; but whereas\nfunctional data analysis extends MVA to the case of observations that are\ncurves rather than vectors -- heuristically, $n\\times p$ data with $p$ infinite\n-- we are instead concerned with what happens when $n$ is infinite. We describe\nhow to translate the classical MVA methods of covariance and correlation\nestimation, principal component analysis, Fisher's linear discriminant\nanalysis, and $k$-means clustering to the continuous-time setting. We\nillustrate the methods with a novel perspective on a well-known Canadian\nweather data set, and with applications to neurobiological and environmetric\ndata. The methods are implemented in the publicly available R package\n\\texttt{ctmva}."}, "http://arxiv.org/abs/2401.06904": {"title": "Non-collapsibility and Built-in Selection Bias of Hazard Ratio in Randomized Controlled Trials", "link": "http://arxiv.org/abs/2401.06904", "description": "Background: The hazard ratio of the Cox proportional hazards model is widely\nused in randomized controlled trials to assess treatment effects. However, two\nproperties of the hazard ratio including the non-collapsibility and built-in\nselection bias need to be further investigated. Methods: We conduct simulations\nto differentiate the non-collapsibility effect and built-in selection bias from\nthe difference between the marginal and the conditional hazard ratio.\nMeanwhile, we explore the performance of the Cox model with inverse probability\nof treatment weighting for covariate adjustment when estimating the marginal\nhazard ratio. The built-in selection bias is further assessed in the\nperiod-specific hazard ratio. Results: The conditional hazard ratio is a biased\nestimate of the marginal effect due to the non-collapsibility property. In\ncontrast, the hazard ratio estimated from the inverse probability of treatment\nweighting Cox model provides an unbiased estimate of the true marginal hazard\nratio. The built-in selection bias only manifests in the period-specific hazard\nratios even when the proportional hazards assumption is satisfied. The Cox\nmodel with inverse probability of treatment weighting can be used to account\nfor confounding bias and provide an unbiased effect under the randomized\ncontrolled trials setting when the parameter of interest is the marginal\neffect. Conclusions: We propose that the period-specific hazard ratios should\nalways be avoided due to the profound effects of built-in selection bias."}, "http://arxiv.org/abs/2401.06909": {"title": "Sensitivity Analysis for Matched Observational Studies with Continuous Exposures and Binary Outcomes", "link": "http://arxiv.org/abs/2401.06909", "description": "Matching is one of the most widely used study designs for adjusting for\nmeasured confounders in observational studies. However, unmeasured confounding\nmay exist and cannot be removed by matching. Therefore, a sensitivity analysis\nis typically needed to assess a causal conclusion's sensitivity to unmeasured\nconfounding. Sensitivity analysis frameworks for binary exposures have been\nwell-established for various matching designs and are commonly used in various\nstudies. However, unlike the binary exposure case, there still lacks valid and\ngeneral sensitivity analysis methods for continuous exposures, except in some\nspecial cases such as pair matching. To fill this gap in the binary outcome\ncase, we develop a sensitivity analysis framework for general matching designs\nwith continuous exposures and binary outcomes. First, we use probabilistic\nlattice theory to show our sensitivity analysis approach is\nfinite-population-exact under Fisher's sharp null. Second, we prove a novel\ndesign sensitivity formula as a powerful tool for asymptotically evaluating the\nperformance of our sensitivity analysis approach. Third, to allow effect\nheterogeneity with binary outcomes, we introduce a framework for conducting\nasymptotically exact inference and sensitivity analysis on generalized\nattributable effects with binary outcomes via mixed-integer programming.\nFourth, for the continuous outcomes case, we show that conducting an\nasymptotically exact sensitivity analysis in matched observational studies when\nboth the exposures and outcomes are continuous is generally NP-hard, except in\nsome special cases such as pair matching. As a real data application, we apply\nour new methods to study the effect of early-life lead exposure on juvenile\ndelinquency. We also develop a publicly available R package for implementation\nof the methods in this work."}, "http://arxiv.org/abs/2401.06919": {"title": "Pseudo-Empirical Likelihood Methods for Causal Inference", "link": "http://arxiv.org/abs/2401.06919", "description": "Causal inference problems have remained an important research topic over the\npast several decades due to their general applicability in assessing a\ntreatment effect in many different real-world settings. In this paper, we\npropose two inferential procedures on the average treatment effect (ATE)\nthrough a two-sample pseudo-empirical likelihood (PEL) approach. The first\nprocedure uses the estimated propensity scores for the formulation of the PEL\nfunction, and the resulting maximum PEL estimator of the ATE is equivalent to\nthe inverse probability weighted estimator discussed in the literature. Our\nfocus in this scenario is on the PEL ratio statistic and establishing its\ntheoretical properties. The second procedure incorporates outcome regression\nmodels for PEL inference through model-calibration constraints, and the\nresulting maximum PEL estimator of the ATE is doubly robust. Our main\ntheoretical result in this case is the establishment of the asymptotic\ndistribution of the PEL ratio statistic. We also propose a bootstrap method for\nconstructing PEL ratio confidence intervals for the ATE to bypass the scaling\nconstant which is involved in the asymptotic distribution of the PEL ratio\nstatistic but is very difficult to calculate. Finite sample performances of our\nproposed methods with comparisons to existing ones are investigated through\nsimulation studies."}, "http://arxiv.org/abs/2401.06925": {"title": "Modeling Latent Selection with Structural Causal Models", "link": "http://arxiv.org/abs/2401.06925", "description": "Selection bias is ubiquitous in real-world data, and can lead to misleading\nresults if not dealt with properly. We introduce a conditioning operation on\nStructural Causal Models (SCMs) to model latent selection from a causal\nperspective. We show that the conditioning operation transforms an SCM with the\npresence of an explicit latent selection mechanism into an SCM without such\nselection mechanism, which partially encodes the causal semantics of the\nselected subpopulation according to the original SCM. Furthermore, we show that\nthis conditioning operation preserves the simplicity, acyclicity, and linearity\nof SCMs, and commutes with marginalization. Thanks to these properties,\ncombined with marginalization and intervention, the conditioning operation\noffers a valuable tool for conducting causal reasoning tasks within causal\nmodels where latent details have been abstracted away. We demonstrate by\nexample how classical results of causal inference can be generalized to include\nselection bias and how the conditioning operation helps with modeling of\nreal-world problems."}, "http://arxiv.org/abs/2401.06990": {"title": "Graphical Principal Component Analysis of Multivariate Functional Time Series", "link": "http://arxiv.org/abs/2401.06990", "description": "In this paper, we consider multivariate functional time series with a two-way\ndependence structure: a serial dependence across time points and a graphical\ninteraction among the multiple functions within each time point. We develop the\nnotion of dynamic weak separability, a more general condition than those\nassumed in literature, and use it to characterize the two-way structure in\nmultivariate functional time series. Based on the proposed weak separability,\nwe develop a unified framework for functional graphical models and dynamic\nprincipal component analysis, and further extend it to optimally reconstruct\nsignals from contaminated functional data using graphical-level information. We\ninvestigate asymptotic properties of the resulting estimators and illustrate\nthe effectiveness of our proposed approach through extensive simulations. We\napply our method to hourly air pollution data that were collected from a\nmonitoring network in China."}, "http://arxiv.org/abs/2401.07000": {"title": "Counterfactual Slope and Its Applications to Social Stratification", "link": "http://arxiv.org/abs/2401.07000", "description": "This paper addresses two prominent theses in social stratification research,\nthe great equalizer thesis and Mare's (1980) school transition thesis. Both\ntheses are premised on a descriptive regularity: the association between\nsocioeconomic background and an outcome variable changes when conditioning on\nan intermediate treatment. The interpretation of this descriptive regularity is\ncomplicated by social actors' differential selection into treatment based on\ntheir potential outcomes under treatment. In particular, if the descriptive\nregularity is driven by selection, then the theses do not have a substantive\ninterpretation. We propose a set of novel counterfactual slope estimands, which\ncapture the two theses under the hypothetical scenario where differential\nselection into treatment is eliminated. Thus, we use the counterfactual slopes\nto construct selection-free tests for the two theses. Compared with the\nexisting literature, we are the first to provide explicit, nonparametric, and\ncausal estimands, which enable us to conduct principled selection-free tests.\nWe develop efficient and robust estimators by deriving the efficient influence\nfunctions of the estimands. We apply our framework to a nationally\nrepresentative dataset in the United States and re-evaluate the two theses.\nFindings from our selection-free tests show that the descriptive regularity of\nthe two theses is misleading for substantive interpretations."}, "http://arxiv.org/abs/2401.07018": {"title": "Graphical models for cardinal paired comparisons data", "link": "http://arxiv.org/abs/2401.07018", "description": "Graphical models for cardinal paired comparison data with and without\ncovariates are rigorously analyzed. Novel, graph--based, necessary and\nsufficient conditions which guarantee strong consistency, asymptotic normality\nand the exponential convergence of the estimated ranks are emphasized. A\ncomplete theory for models with covariates is laid out. In particular\nconditions under which covariates can be safely omitted from the model are\nprovided. The methodology is employed in the analysis of both finite and\ninfinite sets of ranked items specifically in the case of large sparse\ncomparison graphs. The proposed methods are explored by simulation and applied\nto the ranking of teams in the National Basketball Association (NBA)."}, "http://arxiv.org/abs/2401.07221": {"title": "Type I multivariate P\\'olya-Aeppli distributions with applications", "link": "http://arxiv.org/abs/2401.07221", "description": "An extensive body of literature exists that specifically addresses the\nunivariate case of zero-inflated count models. In contrast, research pertaining\nto multivariate models is notably less developed. We proposed two new\nparsimonious multivariate models which can be used to model correlated\nmultivariate overdispersed count data. Furthermore, for different parameter\nsettings and sample sizes, various simulations are performed. In conclusion, we\ndemonstrated the performance of the newly proposed multivariate candidates on\ntwo benchmark datasets, which surpasses that of several alternative approaches."}, "http://arxiv.org/abs/2401.07231": {"title": "Use of Prior Knowledge to Discover Causal Additive Models with Unobserved Variables and its Application to Time Series Data", "link": "http://arxiv.org/abs/2401.07231", "description": "This paper proposes two methods for causal additive models with unobserved\nvariables (CAM-UV). CAM-UV assumes that the causal functions take the form of\ngeneralized additive models and that latent confounders are present. First, we\npropose a method that leverages prior knowledge for efficient causal discovery.\nThen, we propose an extension of this method for inferring causality in time\nseries data. The original CAM-UV algorithm differs from other existing causal\nfunction models in that it does not seek the causal order between observed\nvariables, but rather aims to identify the causes for each observed variable.\nTherefore, the first proposed method in this paper utilizes prior knowledge,\nsuch as understanding that certain variables cannot be causes of specific\nothers. Moreover, by incorporating the prior knowledge that causes precedes\ntheir effects in time, we extend the first algorithm to the second method for\ncausal discovery in time series data. We validate the first proposed method by\nusing simulated data to demonstrate that the accuracy of causal discovery\nincreases as more prior knowledge is accumulated. Additionally, we test the\nsecond proposed method by comparing it with existing time series causal\ndiscovery methods, using both simulated data and real-world data."}, "http://arxiv.org/abs/2401.07259": {"title": "Inference for multivariate extremes via a semi-parametric angular-radial model", "link": "http://arxiv.org/abs/2401.07259", "description": "The modelling of multivariate extreme events is important in a wide variety\nof applications, including flood risk analysis, metocean engineering and\nfinancial modelling. A wide variety of statistical techniques have been\nproposed in the literature; however, many such methods are limited in the forms\nof dependence they can capture, or make strong parametric assumptions about\ndata structures. In this article, we introduce a novel inference framework for\nmultivariate extremes based on a semi-parametric angular-radial model. This\nmodel overcomes the limitations of many existing approaches and provides a\nunified paradigm for assessing joint tail behaviour. Alongside inferential\ntools, we also introduce techniques for assessing uncertainty and goodness of\nfit. Our proposed technique is tested on simulated data sets alongside observed\nmetocean time series', with results indicating generally good performance."}, "http://arxiv.org/abs/2401.07267": {"title": "Inference for high-dimensional linear expectile regression with de-biased method", "link": "http://arxiv.org/abs/2401.07267", "description": "In this paper, we address the inference problem in high-dimensional linear\nexpectile regression. We transform the expectile loss into a\nweighted-least-squares form and apply a de-biased strategy to establish\nWald-type tests for multiple constraints within a regularized framework.\nSimultaneously, we construct an estimator for the pseudo-inverse of the\ngeneralized Hessian matrix in high dimension with general amenable regularizers\nincluding Lasso and SCAD, and demonstrate its consistency through a new proof\ntechnique. We conduct simulation studies and real data applications to\ndemonstrate the efficacy of our proposed test statistic in both homoscedastic\nand heteroscedastic scenarios."}, "http://arxiv.org/abs/2401.07294": {"title": "Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation Studies", "link": "http://arxiv.org/abs/2401.07294", "description": "Metamodels, or the regression analysis of Monte Carlo simulation (MCS)\nresults, provide a powerful tool to summarize MCS findings. However, an as of\nyet unexplored approach is the use of multilevel metamodels (MLMM) that better\naccount for the dependent data structure of MCS results that arises from\nfitting multiple models to the same simulated data set. In this study, we\narticulate the theoretical rationale for the MLMM and illustrate how it can\ndramatically improve efficiency over the traditional regression approach,\nbetter account for complex MCS designs, and provide new insights into the\ngeneralizability of MCS findings."}, "http://arxiv.org/abs/2401.07344": {"title": "Robust Genomic Prediction and Heritability Estimation using Density Power Divergence", "link": "http://arxiv.org/abs/2401.07344", "description": "This manuscript delves into the intersection of genomics and phenotypic\nprediction, focusing on the statistical innovation required to navigate the\ncomplexities introduced by noisy covariates and confounders. The primary\nemphasis is on the development of advanced robust statistical models tailored\nfor genomic prediction from single nucleotide polymorphism (SNP) data collected\nfrom genome-wide association studies (GWAS) in plant and animal breeding and\nmulti-field trials. The manuscript explores the limitations of traditional\nmarker-assisted recurrent selection, highlighting the significance of\nincorporating all estimated effects of marker loci into the statistical\nframework and aiming to reduce the high dimensionality of GWAS data while\npreserving critical information. This paper introduces a new robust statistical\nframework for genomic prediction, employing one-stage and two-stage linear\nmixed model analyses along with utilizing the popular robust minimum density\npower divergence estimator (MDPDE) to estimate genetic effects on phenotypic\ntraits. The study illustrates the superior performance of the proposed\nMDPDE-based genomic prediction and associated heritability estimation\nprocedures over existing competitors through extensive empirical experiments on\nartificial datasets and application to a real-life maize breeding dataset. The\nresults showcase the robustness and accuracy of the proposed MDPDE-based\napproaches, especially in the presence of data contamination, emphasizing their\npotential applications in improving breeding programs and advancing genomic\nprediction of phenotyping traits."}, "http://arxiv.org/abs/2401.07365": {"title": "Sequential permutation testing by betting", "link": "http://arxiv.org/abs/2401.07365", "description": "We develop an anytime-valid permutation test, where the dataset is fixed and\nthe permutations are sampled sequentially one by one, with the objective of\nsaving computational resources by sampling fewer permutations and stopping\nearly. The core technical advance is the development of new test martingales\n(nonnegative martingales with initial value one) for testing exchangeability\nagainst a very particular alternative. These test martingales are constructed\nusing new and simple betting strategies that smartly bet on the relative ranks\nof permuted test statistics. The betting strategy is guided by the derivation\nof a simple log-optimal betting strategy, and displays excellent power in\npractice. In contrast to a well-known method by Besag and Clifford, our method\nyields a valid e-value or a p-value at any stopping time, and with particular\nstopping rules, it yields computational gains under both the null and the\nalternative without compromising power."}, "http://arxiv.org/abs/2401.07400": {"title": "Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data", "link": "http://arxiv.org/abs/2401.07400", "description": "Investigating the relationship, particularly the lead-lag effect, between\ntime series is a common question across various disciplines, especially when\nuncovering biological process. However, analyzing time series presents several\nchallenges. Firstly, due to technical reasons, the time points at which\nobservations are made are not at uniform inintervals. Secondly, some lead-lag\neffects are transient, necessitating time-lag estimation based on a limited\nnumber of time points. Thirdly, external factors also impact these time series,\nrequiring a similarity metric to assess the lead-lag relationship. To counter\nthese issues, we introduce a model grounded in the Gaussian process, affording\nthe flexibility to estimate lead-lag effects for irregular time series. In\naddition, our method outputs dissimilarity scores, thereby broadening its\napplications to include tasks such as ranking or clustering multiple pair-wise\ntime series when considering their strength of lead-lag effects with external\nfactors. Crucially, we offer a series of theoretical proofs to substantiate the\nvalidity of our proposed kernels and the identifiability of kernel parameters.\nOur model demonstrates advances in various simulations and real-world\napplications, particularly in the study of dynamic chromatin interactions,\ncompared to other leading methods."}, "http://arxiv.org/abs/2401.07401": {"title": "Design-Based Estimation and Central Limit Theorems for Local Average Treatment Effects for RCTs", "link": "http://arxiv.org/abs/2401.07401", "description": "There is a growing literature on design-based methods to estimate average\ntreatment effects for randomized controlled trials (RCTs) using the\nunderpinnings of experiments. In this article, we build on these methods to\nconsider design-based regression estimators for the local average treatment\neffect (LATE) estimand for RCTs with treatment noncompliance. We prove new\nfinite-population central limit theorems for a range of designs, including\nblocked and clustered RCTs, allowing for baseline covariates to improve\nprecision. We discuss consistent variance estimators based on model residuals\nand conduct simulations that show the estimators yield confidence interval\ncoverage near nominal levels. We demonstrate the methods using data from a\nprivate school voucher RCT in New York City USA."}, "http://arxiv.org/abs/2401.07421": {"title": "A Bayesian Approach to Modeling Variance of Intensive Longitudinal Biomarker Data as a Predictor of Health Outcomes", "link": "http://arxiv.org/abs/2401.07421", "description": "Intensive longitudinal biomarker data are increasingly common in scientific\nstudies that seek temporally granular understanding of the role of behavioral\nand physiological factors in relation to outcomes of interest. Intensive\nlongitudinal biomarker data, such as those obtained from wearable devices, are\noften obtained at a high frequency typically resulting in several hundred to\nthousand observations per individual measured over minutes, hours, or days.\nOften in longitudinal studies, the primary focus is on relating the means of\nbiomarker trajectories to an outcome, and the variances are treated as nuisance\nparameters, although they may also be informative for the outcomes. In this\npaper, we propose a Bayesian hierarchical model to jointly model a\ncross-sectional outcome and the intensive longitudinal biomarkers. To model the\nvariability of biomarkers and deal with the high intensity of data, we develop\nsubject-level cubic B-splines and allow the sharing of information across\nindividuals for both the residual variability and the random effects\nvariability. Then different levels of variability are extracted and\nincorporated into an outcome submodel for inferential and predictive purposes.\nWe demonstrate the utility of the proposed model via an application involving\nbio-monitoring of hertz-level heart rate information from a study on social\nstress."}, "http://arxiv.org/abs/2401.07445": {"title": "GACE: Learning Graph-Based Cross-Page Ads Embedding For Click-Through Rate Prediction", "link": "http://arxiv.org/abs/2401.07445", "description": "Predicting click-through rate (CTR) is the core task of many ads online\nrecommendation systems, which helps improve user experience and increase\nplatform revenue. In this type of recommendation system, we often encounter two\nmain problems: the joint usage of multi-page historical advertising data and\nthe cold start of new ads. In this paper, we proposed GACE, a graph-based\ncross-page ads embedding generation method. It can warm up and generate the\nrepresentation embedding of cold-start and existing ads across various pages.\nSpecifically, we carefully build linkages and a weighted undirected graph model\nconsidering semantic and page-type attributes to guide the direction of feature\nfusion and generation. We designed a variational auto-encoding task as\npre-training module and generated embedding representations for new and old ads\nbased on this task. The results evaluated in the public dataset AliEC from\nRecBole and the real-world industry dataset from Alipay show that our GACE\nmethod is significantly superior to the SOTA method. In the online A/B test,\nthe click-through rate on three real-world pages from Alipay has increased by\n3.6%, 2.13%, and 3.02%, respectively. Especially in the cold-start task, the\nCTR increased by 9.96%, 7.51%, and 8.97%, respectively."}, "http://arxiv.org/abs/2401.07522": {"title": "Balancing the edge effect and dimension of spectral spatial statistics under irregular sampling with applications to isotropy testing", "link": "http://arxiv.org/abs/2401.07522", "description": "We investigate distributional properties of a class of spectral spatial\nstatistics under irregular sampling of a random field that is defined on\n$\\mathbb{R}^d$, and use this to obtain a test for isotropy. Within this\ncontext, edge effects are well-known to create a bias in classical estimators\ncommonly encountered in the analysis of spatial data. This bias increases with\ndimension $d$ and, for $d>1$, can become non-negligible in the limiting\ndistribution of such statistics to the extent that a nondegenerate distribution\ndoes not exist. We provide a general theory for a class of (integrated)\nspectral statistics that enables to 1) significantly reduce this bias and 2)\nthat ensures that asymptotically Gaussian limits can be derived for $d \\le 3$\nfor appropriately tapered versions of such statistics. We use this to address\nsome crucial gaps in the literature, and demonstrate that tapering with a\nsufficiently smooth function is necessary to achieve such results. Our findings\nspecifically shed a new light on a recent result in Subba Rao (2018a). Our\ntheory then is used to propose a novel test for isotropy. In contrast to most\nof the literature, which validates this assumption on a finite number of\nspatial locations (or a finite number of Fourier frequencies), we develop a\ntest for isotropy on the full spatial domain by means of its characterization\nin the frequency domain. More precisely, we derive an explicit expression for\nthe minimum $L^2$-distance between the spectral density of the random field and\nits best approximation by a spectral density of an isotropic process. We prove\nasymptotic normality of an estimator of this quantity in the mixed increasing\ndomain framework and use this result to derive an asymptotic level\n$\\alpha$-test."}, "http://arxiv.org/abs/2401.07562": {"title": "Probabilistic Richardson Extrapolation", "link": "http://arxiv.org/abs/2401.07562", "description": "For over a century, extrapolation methods have provided a powerful tool to\nimprove the convergence order of a numerical method. However, these tools are\nnot well-suited to modern computer codes, where multiple continua are\ndiscretised and convergence orders are not easily analysed. To address this\nchallenge we present a probabilistic perspective on Richardson extrapolation, a\npoint of view that unifies classical extrapolation methods with modern\nmulti-fidelity modelling, and handles uncertain convergence orders by allowing\nthese to be statistically estimated. The approach is developed using Gaussian\nprocesses, leading to Gauss-Richardson Extrapolation (GRE). Conditions are\nestablished under which extrapolation using the conditional mean achieves a\npolynomial (or even an exponential) speed-up compared to the original numerical\nmethod. Further, the probabilistic formulation unlocks the possibility of\nexperimental design, casting the selection of fidelities as a continuous\noptimisation problem which can then be (approximately) solved. A case-study\ninvolving a computational cardiac model demonstrates that practical gains in\naccuracy can be achieved using the GRE method."}, "http://arxiv.org/abs/2401.07625": {"title": "Statistics in Survey Sampling", "link": "http://arxiv.org/abs/2401.07625", "description": "Survey sampling theory and methods are introduced. Sampling designs and\nestimation methods are carefully discussed as a textbook for survey sampling.\nTopics includes Horvitz-Thompson estimation, simple random sampling, stratified\nsampling, cluster sampling, ratio estimation, regression estimation, variance\nestimation, two-phase sampling, and nonresponse adjustment methods."}, "http://arxiv.org/abs/2401.07724": {"title": "A non-parametric estimator for Archimedean copulas under flexible censoring scenarios and an application to claims reserving", "link": "http://arxiv.org/abs/2401.07724", "description": "With insurers benefiting from ever-larger amounts of data of increasing\ncomplexity, we explore a data-driven method to model dependence within\nmultilevel claims in this paper. More specifically, we start from a\nnon-parametric estimator for Archimedean copula generators introduced by Genest\nand Rivest (1993), and we extend it to diverse flexible censoring scenarios\nusing techniques derived from survival analysis. We implement a graphical\nselection procedure for copulas that we validate using goodness-of-fit methods\napplied to complete, single-censored, and double-censored bivariate data. We\nillustrate the performance of our model with multiple simulation studies. We\nthen apply our methodology to a recent Canadian automobile insurance dataset\nwhere we seek to model the dependence between the activation delays of\ncorrelated coverages. We show that our model performs quite well in selecting\nthe best-fitted copula for the data at hand, especially when the dataset is\nlarge, and that the results can then be used as part of a larger claims\nreserving methodology."}, "http://arxiv.org/abs/2401.07767": {"title": "Estimation of the genetic Gaussian network using GWAS summary data", "link": "http://arxiv.org/abs/2401.07767", "description": "Genetic Gaussian network of multiple phenotypes constructed through the\ngenetic correlation matrix is informative for understanding their biological\ndependencies. However, its interpretation may be challenging because the\nestimated genetic correlations are biased due to estimation errors and\nhorizontal pleiotropy inherent in GWAS summary statistics. Here we introduce a\nnovel approach called Estimation of Genetic Graph (EGG), which eliminates the\nestimation error bias and horizontal pleiotropy bias with the same techniques\nused in multivariable Mendelian randomization. The genetic network estimated by\nEGG can be interpreted as representing shared common biological contributions\nbetween phenotypes, conditional on others, and even as indicating the causal\ncontributions. We use both simulations and real data to demonstrate the\nsuperior efficacy of our novel method in comparison with the traditional\nnetwork estimators. R package EGG is available on\nhttps://github.com/harryyiheyang/EGG."}, "http://arxiv.org/abs/2401.07820": {"title": "Posterior shrinkage towards linear subspaces", "link": "http://arxiv.org/abs/2401.07820", "description": "It is common to hold prior beliefs that are not characterized by points in\nthe parameter space but instead are relational in nature and can be described\nby a linear subspace. While some previous work has been done to account for\nsuch prior beliefs, the focus has primarily been on point estimators within a\nregression framework. We argue, however, that prior beliefs about parameters\nought to be encoded into the prior distribution rather than in the formation of\na point estimator. In this way, the prior beliefs help shape \\textit{all}\ninference. Through exponential tilting, we propose a fully generalizable method\nof taking existing prior information from, e.g., a pilot study, and combining\nit with additional prior beliefs represented by parameters lying on a linear\nsubspace. We provide computationally efficient algorithms for posterior\ninference that, once inference is made using a non-tilted prior, does not\ndepend on the sample size. We illustrate our proposed approach on an\nantihypertensive clinical trial dataset where we shrink towards a power law\ndose-response relationship, and on monthly influenza and pneumonia data where\nwe shrink moving average lag parameters towards smoothness. Software to\nimplement the proposed approach is provided in the R package \\verb+SUBSET+\navailable on GitHub."}, "http://arxiv.org/abs/2401.08159": {"title": "Reluctant Interaction Modeling in Generalized Linear Models", "link": "http://arxiv.org/abs/2401.08159", "description": "While including pairwise interactions in a regression model can better\napproximate response surface, fitting such an interaction model is a well-known\ndifficult problem. In particular, analyzing contemporary high-dimensional\ndatasets often leads to extremely large-scale interaction modeling problem,\nwhere the challenge is posed to identify important interactions among millions\nor even billions of candidate interactions. While several methods have recently\nbeen proposed to tackle this challenge, they are mostly designed by (1)\nassuming the hierarchy assumption among the important interactions and (or) (2)\nfocusing on the case in linear models with interactions and (sub)Gaussian\nerrors. In practice, however, neither of these two building blocks has to hold.\nIn this paper, we propose an interaction modeling framework in generalized\nlinear models (GLMs) which is free of any assumptions on hierarchy. We develop\na non-trivial extension of the reluctance interaction selection principle to\nthe GLMs setting, where a main effect is preferred over an interaction if all\nelse is equal. Our proposed method is easy to implement, and is highly scalable\nto large-scale datasets. Theoretically, we demonstrate that it possesses\nscreening consistency under high-dimensional setting. Numerical studies on\nsimulated datasets and a real dataset show that the proposed method does not\nsacrifice statistical performance in the presence of significant computational\ngain."}, "http://arxiv.org/abs/2401.08172": {"title": "On GEE for Mean-Variance-Correlation Models: Variance Estimation and Model Selection", "link": "http://arxiv.org/abs/2401.08172", "description": "Generalized estimating equations (GEE) are of great importance in analyzing\nclustered data without full specification of multivariate distributions. A\nrecent approach jointly models the mean, variance, and correlation coefficients\nof clustered data through three sets of regressions (Luo and Pan, 2022). We\nobserve that these estimating equations, however, are a special case of those\nof Yan and Fine (2004) which further allows the variance to depend on the mean\nthrough a variance function. The proposed variance estimators may be incorrect\nfor the variance and correlation parameters because of a subtle dependence\ninduced by the nested structure of the estimating equations. We characterize\nmodel settings where their variance estimation is invalid and show the variance\nestimators in Yan and Fine (2004) correctly account for such dependence. In\naddition, we introduce a novel model selection criterion that enables the\nsimultaneous selection of the mean-scale-correlation model. The sandwich\nvariance estimator and the proposed model selection criterion are tested by\nseveral simulation studies and real data analysis, which validate its\neffectiveness in variance estimation and model selection. Our work also extends\nthe R package geepack with the flexibility to apply different working\ncovariance matrices for the variance and correlation structures."}, "http://arxiv.org/abs/2401.08173": {"title": "Simultaneous Change Point Detection and Identification for High Dimensional Linear Models", "link": "http://arxiv.org/abs/2401.08173", "description": "In this article, we consider change point inference for high dimensional\nlinear models. For change point detection, given any subgroup of variables, we\npropose a new method for testing the homogeneity of corresponding regression\ncoefficients across the observations. Under some regularity conditions, the\nproposed new testing procedure controls the type I error asymptotically and is\npowerful against sparse alternatives and enjoys certain optimality. For change\npoint identification, an argmax based change point estimator is proposed which\nis shown to be consistent for the true change point location. Moreover,\ncombining with the binary segmentation technique, we further extend our new\nmethod for detecting and identifying multiple change points. Extensive\nnumerical studies justify the validity of our new method and an application to\nthe Alzheimer's disease data analysis further demonstrate its competitive\nperformance."}, "http://arxiv.org/abs/2401.08175": {"title": "Bayesian Kriging Approaches for Spatial Functional Data", "link": "http://arxiv.org/abs/2401.08175", "description": "Functional kriging approaches have been developed to predict the curves at\nunobserved spatial locations. However, most existing approaches are based on\nvariogram fittings rather than constructing hierarchical statistical models.\nTherefore, it is challenging to analyze the relationships between functional\nvariables, and uncertainty quantification of the model is not trivial. In this\nmanuscript, we propose a Bayesian framework for spatial function-on-function\nregression. However, inference for the proposed model has computational and\ninferential challenges because the model needs to account for within and\nbetween-curve dependencies. Furthermore, high-dimensional and spatially\ncorrelated parameters can lead to the slow mixing of Markov chain Monte Carlo\nalgorithms. To address these issues, we first utilize a basis transformation\napproach to simplify the covariance and apply projection methods for dimension\nreduction. We also develop a simultaneous band score for the proposed model to\ndetect the significant region in the regression function. We apply the methods\nto simulated and real datasets, including data on particulate matter in Japan\nand mobility data in South Korea. The proposed method is computationally\nefficient and provides accurate estimations and predictions."}, "http://arxiv.org/abs/2401.08224": {"title": "Differentially Private Estimation of CATE in Adaptive Experiment", "link": "http://arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average\ntreatment effect (CATE) in clinical trials and many other scenarios. While the\nprimary goal in experiment is to maximize estimation accuracy, due to the\nimperative of social welfare, it's also crucial to provide treatment with\nsuperior outcomes to patients, which is measured by regret in contextual bandit\nframework. These two objectives often lead to contrast optimal allocation\nmechanism. Furthermore, privacy concerns arise in clinical scenarios containing\nsensitive data like patients health records. Therefore, it's essential for the\ntreatment allocation mechanism to incorporate robust privacy protection\nmeasures. In this paper, we investigate the tradeoff between loss of social\nwelfare and statistical power in contextual bandit experiment. We propose a\nmatched upper and lower bound for the multi-objective optimization problem, and\nthen adopt the concept of Pareto optimality to mathematically characterize the\noptimality condition. Furthermore, we propose differentially private algorithms\nwhich still matches the lower bound, showing that privacy is \"almost free\".\nAdditionally, we derive the asymptotic normality of the estimator, which is\nessential in statistical inference and hypothesis testing."}, "http://arxiv.org/abs/2401.08303": {"title": "A Bayesian multivariate model with temporal dependence on random partition of areal data", "link": "http://arxiv.org/abs/2401.08303", "description": "More than half of the world's population is exposed to the risk of\nmosquito-borne diseases, which leads to millions of cases and hundreds of\nthousands of deaths every year. Analyzing this type of data is often complex\nand poses several interesting challenges, mainly due to the vast geographic\narea, the peculiar temporal behavior, and the potential correlation between\ninfections. Motivation stems from the analysis of tropical diseases data,\nnamely, the number of cases of two arboviruses, dengue and chikungunya,\ntransmitted by the same mosquito, for all the 145 microregions in Southeast\nBrazil from 2018 to 2022. As a contribution to the literature on multivariate\ndisease data, we develop a flexible Bayesian multivariate spatio-temporal model\nwhere temporal dependence is defined for areal clusters. The model features a\nprior distribution for the random partition of areal data that incorporates\nneighboring information, thus encouraging maps with few contiguous clusters and\ndiscouraging clusters with disconnected areas. The model also incorporates an\nautoregressive structure and terms related to seasonal patterns into temporal\ncomponents that are disease and cluster-specific. It also considers a\nmultivariate directed acyclic graph autoregressive structure to accommodate\nspatial and inter-disease dependence, facilitating the interpretation of\nspatial correlation. We explore properties of the model by way of simulation\nstudies and show results that prove our proposal compares well to competing\nalternatives. Finally, we apply the model to the motivating dataset with a\ntwofold goal: clustering areas where the temporal trend of certain diseases are\nsimilar, and exploring the potential existence of temporal and/or spatial\ncorrelation between two diseases transmitted by the same mosquito."}, "http://arxiv.org/abs/2006.13850": {"title": "Global Sensitivity and Domain-Selective Testing for Functional-Valued Responses: An Application to Climate Economy Models", "link": "http://arxiv.org/abs/2006.13850", "description": "Understanding the dynamics and evolution of climate change and associated\nuncertainties is key for designing robust policy actions. Computer models are\nkey tools in this scientific effort, which have now reached a high level of\nsophistication and complexity. Model auditing is needed in order to better\nunderstand their results, and to deal with the fact that such models are\nincreasingly opaque with respect to their inner workings. Current techniques\nsuch as Global Sensitivity Analysis (GSA) are limited to dealing either with\nmultivariate outputs, stochastic ones, or finite-change inputs. This limits\ntheir applicability to time-varying variables such as future pathways of\ngreenhouse gases. To provide additional semantics in the analysis of a model\nensemble, we provide an extension of GSA methodologies tackling the case of\nstochastic functional outputs with finite change inputs. To deal with finite\nchange inputs and functional outputs, we propose an extension of currently\navailable GSA methodologies while we deal with the stochastic part by\nintroducing a novel, domain-selective inferential technique for sensitivity\nindices. Our method is explored via a simulation study that shows its\nrobustness and efficacy in detecting sensitivity patterns. We apply it to real\nworld data, where its capabilities can provide to practitioners and\npolicymakers additional information about the time dynamics of sensitivity\npatterns, as well as information about robustness."}, "http://arxiv.org/abs/2107.00527": {"title": "Distribution-Free Prediction Bands for Multivariate Functional Time Series: an Application to the Italian Gas Market", "link": "http://arxiv.org/abs/2107.00527", "description": "Uncertainty quantification in forecasting represents a topic of great\nimportance in energy trading, as understanding the status of the energy market\nwould enable traders to directly evaluate the impact of their own offers/bids.\nTo this end, we propose a scalable procedure that outputs closed-form\nsimultaneous prediction bands for multivariate functional response variables in\na time series setting, which is able to guarantee performance bounds in terms\nof unconditional coverage and asymptotic exactness, both under some conditions.\nAfter evaluating its performance on synthetic data, the method is used to build\nmultivariate prediction bands for daily demand and offer curves in the Italian\ngas market."}, "http://arxiv.org/abs/2110.13017": {"title": "Nested $\\hat R$: Assessing the convergence of Markov chain Monte Carlo when running many short chains", "link": "http://arxiv.org/abs/2110.13017", "description": "Recent developments in Markov chain Monte Carlo (MCMC) algorithms allow us to\nrun thousands of chains in parallel almost as quickly as a single chain, using\nhardware accelerators such as GPUs. While each chain still needs to forget its\ninitial point during a warmup phase, the subsequent sampling phase can be\nshorter than in classical settings, where we run only a few chains. To\ndetermine if the resulting short chains are reliable, we need to assess how\nclose the Markov chains are to their stationary distribution after warmup. The\npotential scale reduction factor $\\widehat R$ is a popular convergence\ndiagnostic but unfortunately can require a long sampling phase to work well. We\npresent a nested design to overcome this challenge and a generalization called\nnested $\\widehat R$. This new diagnostic works under conditions similar to\n$\\widehat R$ and completes the workflow for GPU-friendly samplers. In addition,\nthe proposed nesting provides theoretical insights into the utility of\n$\\widehat R$, in both classical and short-chains regimes."}, "http://arxiv.org/abs/2111.10718": {"title": "The R2D2 Prior for Generalized Linear Mixed Models", "link": "http://arxiv.org/abs/2111.10718", "description": "In Bayesian analysis, the selection of a prior distribution is typically done\nby considering each parameter in the model. While this can be convenient, in\nmany scenarios it may be desirable to place a prior on a summary measure of the\nmodel instead. In this work, we propose a prior on the model fit, as measured\nby a Bayesian coefficient of determination ($R^2)$, which then induces a prior\non the individual parameters. We achieve this by placing a beta prior on $R^2$\nand then deriving the induced prior on the global variance parameter for\ngeneralized linear mixed models. We derive closed-form expressions in many\nscenarios and present several approximation strategies when an analytic form is\nnot possible and/or to allow for easier computation. In these situations, we\nsuggest approximating the prior by using a generalized beta prime distribution\nand provide a simple default prior construction scheme. This approach is quite\nflexible and can be easily implemented in standard Bayesian software. Lastly,\nwe demonstrate the performance of the method on simulated and real-world data,\nwhere the method particularly shines in high-dimensional settings, as well as\nmodeling random effects."}, "http://arxiv.org/abs/2206.05337": {"title": "Integrating complex selection rules into the latent overlapping group Lasso for constructing coherent prediction models", "link": "http://arxiv.org/abs/2206.05337", "description": "The construction of coherent prediction models holds great importance in\nmedical research as such models enable health researchers to gain deeper\ninsights into disease epidemiology and clinicians to identify patients at\nhigher risk of adverse outcomes. One commonly employed approach to developing\nprediction models is variable selection through penalized regression\ntechniques. Integrating natural variable structures into this process not only\nenhances model interpretability but can also %increase the likelihood of\nrecovering the true underlying model and boost prediction accuracy. However, a\nchallenge lies in determining how to effectively integrate potentially complex\nselection dependencies into the penalized regression. In this work, we\ndemonstrate how to represent selection dependencies mathematically, provide\nalgorithms for deriving the complete set of potential models, and offer a\nstructured approach for integrating complex rules into variable selection\nthrough the latent overlapping group Lasso. To illustrate our methodology, we\napplied these techniques to construct a coherent prediction model for major\nbleeding in hypertensive patients recently hospitalized for atrial fibrillation\nand subsequently prescribed oral anticoagulants. In this application, we\naccount for a proxy of anticoagulant adherence and its interaction with dosage\nand the type of oral anticoagulants in addition to drug-drug interactions."}, "http://arxiv.org/abs/2206.08756": {"title": "Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay", "link": "http://arxiv.org/abs/2206.08756", "description": "We study the tensor-on-tensor regression, where the goal is to connect tensor\nresponses to tensor covariates with a low Tucker rank parameter tensor/matrix\nwithout the prior knowledge of its intrinsic rank. We propose the Riemannian\ngradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with\nthe challenge of unknown rank by studying the effect of rank\nover-parameterization. We provide the first convergence guarantee for the\ngeneral tensor-on-tensor regression by showing that RGD and RGN respectively\nconverge linearly and quadratically to a statistically optimal estimate in both\nrank correctly-parameterized and over-parameterized settings. Our theory\nreveals an intriguing phenomenon: Riemannian optimization methods naturally\nadapt to over-parameterization without modifications to their implementation.\nWe also prove the statistical-computational gap in scalar-on-tensor regression\nby a direct low-degree polynomial argument. Our theory demonstrates a \"blessing\nof statistical-computational gap\" phenomenon: in a wide range of scenarios in\ntensor-on-tensor regression for tensors of order three or higher, the\ncomputationally required sample size matches what is needed by moderate rank\nover-parameterization when considering computationally feasible estimators,\nwhile there are no such benefits in the matrix settings. This shows moderate\nrank over-parameterization is essentially \"cost-free\" in terms of sample size\nin tensor-on-tensor regression of order three or higher. Finally, we conduct\nsimulation studies to show the advantages of our proposed methods and to\ncorroborate our theoretical findings."}, "http://arxiv.org/abs/2209.01396": {"title": "Small Study Regression Discontinuity Designs: Density Inclusive Study Size Metric and Performance", "link": "http://arxiv.org/abs/2209.01396", "description": "Regression discontinuity (RD) designs are popular quasi-experimental studies\nin which treatment assignment depends on whether the value of a running\nvariable exceeds a cutoff. RD designs are increasingly popular in educational\napplications due to the prevalence of cutoff-based interventions. In such\napplications sample sizes can be relatively small or there may be sparsity\naround the cutoff. We propose a metric, density inclusive study size (DISS),\nthat characterizes the size of an RD study better than overall sample size by\nincorporating the density of the running variable. We show the usefulness of\nthis metric in a Monte Carlo simulation study that compares the operating\ncharacteristics of popular nonparametric RD estimation methods in small\nstudies. We also apply the DISS metric and RD estimation methods to school\naccountability data from the state of Indiana."}, "http://arxiv.org/abs/2212.06108": {"title": "Tandem clustering with invariant coordinate selection", "link": "http://arxiv.org/abs/2212.06108", "description": "For multivariate data, tandem clustering is a well-known technique aiming to\nimprove cluster identification through initial dimension reduction.\nNevertheless, the usual approach using principal component analysis (PCA) has\nbeen criticized for focusing solely on inertia so that the first components do\nnot necessarily retain the structure of interest for clustering. To address\nthis limitation, a new tandem clustering approach based on invariant coordinate\nselection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS\nis designed to find structure in the data while providing affine invariant\ncomponents. Certain theoretical results have been previously derived and\nguarantee that under some elliptical mixture models, the group structure can be\nhighlighted on a subset of the first and/or last components. However, ICS has\ngarnered minimal attention within the context of clustering. Two challenges\nassociated with ICS include choosing the pair of scatter matrices and selecting\nthe components to retain. For effective clustering purposes, it is demonstrated\nthat the best scatter pairs consist of one scatter matrix capturing the\nwithin-cluster structure and another capturing the global structure. For the\nformer, local shape or pairwise scatters are of great interest, as is the\nminimum covariance determinant (MCD) estimator based on a carefully chosen\nsubset size that is smaller than usual. The performance of ICS as a dimension\nreduction method is evaluated in terms of preserving the cluster structure in\nthe data. In an extensive simulation study and empirical applications with\nbenchmark data sets, various combinations of scatter matrices as well as\ncomponent selection criteria are compared in situations with and without\noutliers. Overall, the new approach of tandem clustering with ICS shows\npromising results and clearly outperforms the PCA-based approach."}, "http://arxiv.org/abs/2212.07687": {"title": "Networks of reinforced stochastic processes: probability of asymptotic polarization and related general results", "link": "http://arxiv.org/abs/2212.07687", "description": "In a network of reinforced stochastic processes, for certain values of the\nparameters, all the agents' inclinations synchronize and converge almost surely\ntoward a certain random variable. The present work aims at clarifying when the\nagents can asymptotically polarize, i.e. when the common limit inclination can\ntake the extreme values, 0 or 1, with probability zero, strictly positive, or\nequal to one. Moreover, we present a suitable technique to estimate this\nprobability that, along with the theoretical results, has been framed in the\nmore general setting of a class of martingales taking values in [0, 1] and\nfollowing a specific dynamics."}, "http://arxiv.org/abs/2303.17478": {"title": "A Bayesian Dirichlet Auto-Regressive Moving Average Model for Forecasting Lead Times", "link": "http://arxiv.org/abs/2303.17478", "description": "Lead time data is compositional data found frequently in the hospitality\nindustry. Hospitality businesses earn fees each day, however these fees cannot\nbe recognized until later. For business purposes, it is important to understand\nand forecast the distribution of future fees for the allocation of resources,\nfor business planning, and for staffing. Motivated by 5 years of daily fees\ndata, we propose a new class of Bayesian time series models, a Bayesian\nDirichlet Auto-Regressive Moving Average (B-DARMA) model for compositional time\nseries, modeling the proportion of future fees that will be recognized in 11\nconsecutive 30 day windows and 1 last consecutive 35 day window. Each day's\ncompositional datum is modeled as Dirichlet distributed given the mean and a\nscale parameter. The mean is modeled with a Vector Autoregressive Moving\nAverage process after transforming with an additive log ratio link function and\ndepends on previous compositional data, previous compositional parameters and\ndaily covariates. The B-DARMA model offers solutions to data analyses of large\ncompositional vectors and short or long time series, offers efficiency gains\nthrough choice of priors, provides interpretable parameters for inference, and\nmakes reasonable forecasts."}, "http://arxiv.org/abs/2306.17361": {"title": "iSCAN: Identifying Causal Mechanism Shifts among Nonlinear Additive Noise Models", "link": "http://arxiv.org/abs/2306.17361", "description": "Structural causal models (SCMs) are widely used in various disciplines to\nrepresent causal relationships among variables in complex systems.\nUnfortunately, the underlying causal structure is often unknown, and estimating\nit from data remains a challenging task. In many situations, however, the end\ngoal is to localize the changes (shifts) in the causal mechanisms between\nrelated datasets instead of learning the full causal structure of the\nindividual datasets. Some applications include root cause analysis, analyzing\ngene regulatory network structure changes between healthy and cancerous\nindividuals, or explaining distribution shifts. This paper focuses on\nidentifying the causal mechanism shifts in two or more related datasets over\nthe same set of variables -- without estimating the entire DAG structure of\neach SCM. Prior work under this setting assumed linear models with Gaussian\nnoises; instead, in this work we assume that each SCM belongs to the more\ngeneral class of nonlinear additive noise models (ANMs). A key technical\ncontribution of this work is to show that the Jacobian of the score function\nfor the mixture distribution allows for the identification of shifts under\ngeneral non-parametric functional mechanisms. Once the shifted variables are\nidentified, we leverage recent work to estimate the structural differences, if\nany, for the shifted variables. Experiments on synthetic and real-world data\nare provided to showcase the applicability of this approach. Code implementing\nthe proposed method is open-source and publicly available at\nhttps://github.com/kevinsbello/iSCAN."}, "http://arxiv.org/abs/2307.10841": {"title": "A criterion and incremental design construction for simultaneous kriging predictions", "link": "http://arxiv.org/abs/2307.10841", "description": "In this paper, we further investigate the problem of selecting a set of\ndesign points for universal kriging, which is a widely used technique for\nspatial data analysis. Our goal is to select the design points in order to make\nsimultaneous predictions of the random variable of interest at a finite number\nof unsampled locations with maximum precision. Specifically, we consider as\nresponse a correlated random field given by a linear model with an unknown\nparameter vector and a spatial error correlation structure. We propose a new\ndesign criterion that aims at simultaneously minimizing the variation of the\nprediction errors at various points. We also present various efficient\ntechniques for incrementally building designs for that criterion scaling well\nfor high dimensions. Thus the method is particularly suitable for big data\napplications in areas of spatial data analysis such as mining, hydrogeology,\nnatural resource monitoring, and environmental sciences or equivalently for any\ncomputer simulation experiments. We have demonstrated the effectiveness of the\nproposed designs through two illustrative examples: one by simulation and\nanother based on real data from Upper Austria."}, "http://arxiv.org/abs/2307.15404": {"title": "Information-based Preprocessing of PLC Data for Automatic Behavior Modeling", "link": "http://arxiv.org/abs/2307.15404", "description": "Cyber-physical systems (CPS) offer immense optimization potential for\nmanufacturing processes through the availability of multivariate time series\ndata of actors and sensors. Based on automated analysis software, the\ndeployment of adaptive and responsive measures is possible for time series\ndata. Due to the complex and dynamic nature of modern manufacturing, analysis\nand modeling often cannot be entirely automated. Even machine- or deep learning\napproaches often depend on a priori expert knowledge and labelling. In this\npaper, an information-based data preprocessing approach is proposed. By\napplying statistical methods including variance and correlation analysis, an\napproximation of the sampling rate in event-based systems and the utilization\nof spectral analysis, knowledge about the underlying manufacturing processes\ncan be gained prior to modeling. The paper presents, how statistical analysis\nenables the pruning of a dataset's least important features and how the\nsampling rate approximation approach sets the base for further data analysis\nand modeling. The data's underlying periodicity, originating from the cyclic\nnature of an automated manufacturing process, will be detected by utilizing the\nfast Fourier transform. This information-based preprocessing method will then\nbe validated for process time series data of cyber-physical systems'\nprogrammable logic controllers (PLC)."}, "http://arxiv.org/abs/2308.05205": {"title": "Dynamic survival analysis: modelling the hazard function via ordinary differential equations", "link": "http://arxiv.org/abs/2308.05205", "description": "The hazard function represents one of the main quantities of interest in the\nanalysis of survival data. We propose a general approach for parametrically\nmodelling the dynamics of the hazard function using systems of autonomous\nordinary differential equations (ODEs). This modelling approach can be used to\nprovide qualitative and quantitative analyses of the evolution of the hazard\nfunction over time. Our proposal capitalises on the extensive literature of\nODEs which, in particular, allow for establishing basic rules or laws on the\ndynamics of the hazard function via the use of autonomous ODEs. We show how to\nimplement the proposed modelling framework in cases where there is an analytic\nsolution to the system of ODEs or where an ODE solver is required to obtain a\nnumerical solution. We focus on the use of a Bayesian modelling approach, but\nthe proposed methodology can also be coupled with maximum likelihood\nestimation. A simulation study is presented to illustrate the performance of\nthese models and the interplay of sample size and censoring. Two case studies\nusing real data are presented to illustrate the use of the proposed approach\nand to highlight the interpretability of the corresponding models. We conclude\nwith a discussion on potential extensions of our work and strategies to include\ncovariates into our framework."}, "http://arxiv.org/abs/2309.14658": {"title": "Improvements on Scalable Stochastic Bayesian Inference Methods for Multivariate Hawkes Process", "link": "http://arxiv.org/abs/2309.14658", "description": "Multivariate Hawkes Processes (MHPs) are a class of point processes that can\naccount for complex temporal dynamics among event sequences. In this work, we\nstudy the accuracy and computational efficiency of three classes of algorithms\nwhich, while widely used in the context of Bayesian inference, have rarely been\napplied in the context of MHPs: stochastic gradient expectation-maximization,\nstochastic gradient variational inference and stochastic gradient Langevin\nMonte Carlo. An important contribution of this paper is a novel approximation\nto the likelihood function that allows us to retain the computational\nadvantages associated with conjugate settings while reducing approximation\nerrors associated with the boundary effects. The comparisons are based on\nvarious simulated scenarios as well as an application to the study the risk\ndynamics in the Standard & Poor's 500 intraday index prices among its 11\nsectors."}, "http://arxiv.org/abs/2401.08606": {"title": "Forking paths in financial economics", "link": "http://arxiv.org/abs/2401.08606", "description": "We argue that spanning large numbers of degrees of freedom in empirical\nanalysis allows better characterizations of effects and thus improves the\ntrustworthiness of conclusions. Our ideas are illustrated in three studies:\nequity premium prediction, asset pricing anomalies and risk premia estimation.\nIn the first, we find that each additional degree of freedom in the protocol\nexpands the average range of $t$-statistics by at least 30%. In the second, we\nshow that resorting to forking paths instead of bootstrapping in multiple\ntesting raises the bar of significance for anomalies: at the 5% confidence\nlevel, the threshold for bootstrapped statistics is 4.5, whereas with paths, it\nis at least 8.2, a bar much higher than those currently used in the literature.\nIn our third application, we reveal the importance of particular steps in the\nestimation of premia. In addition, we use paths to corroborate prior findings\nin the three topics. We document heterogeneity in our ability to replicate\nprior studies: some conclusions seem robust, others do not align with the paths\nwe were able to generate."}, "http://arxiv.org/abs/2401.08626": {"title": "Validation and Comparison of Non-Stationary Cognitive Models: A Diffusion Model Application", "link": "http://arxiv.org/abs/2401.08626", "description": "Cognitive processes undergo various fluctuations and transient states across\ndifferent temporal scales. Superstatistics are emerging as a flexible framework\nfor incorporating such non-stationary dynamics into existing cognitive model\nclasses. In this work, we provide the first experimental validation of\nsuperstatistics and formal comparison of four non-stationary diffusion decision\nmodels in a specifically designed perceptual decision-making task. Task\ndifficulty and speed-accuracy trade-off were systematically manipulated to\ninduce expected changes in model parameters. To validate our models, we assess\nwhether the inferred parameter trajectories align with the patterns and\nsequences of the experimental manipulations. To address computational\nchallenges, we present novel deep learning techniques for amortized Bayesian\nestimation and comparison of models with time-varying parameters. Our findings\nindicate that transition models incorporating both gradual and abrupt parameter\nshifts provide the best fit to the empirical data. Moreover, we find that the\ninferred parameter trajectories closely mirror the sequence of experimental\nmanipulations. Posterior re-simulations further underscore the ability of the\nmodels to faithfully reproduce critical data patterns. Accordingly, our results\nsuggest that the inferred non-stationary dynamics may reflect actual changes in\nthe targeted psychological constructs. We argue that our initial experimental\nvalidation paves the way for the widespread application of superstatistics in\ncognitive modeling and beyond."}, "http://arxiv.org/abs/2401.08702": {"title": "Do We Really Even Need Data?", "link": "http://arxiv.org/abs/2401.08702", "description": "As artificial intelligence and machine learning tools become more accessible,\nand scientists face new obstacles to data collection (e.g. rising costs,\ndeclining survey response rates), researchers increasingly use predictions from\npre-trained algorithms as outcome variables. Though appealing for financial and\nlogistical reasons, using standard tools for inference can misrepresent the\nassociation between independent variables and the outcome of interest when the\ntrue, unobserved outcome is replaced by a predicted value. In this paper, we\ncharacterize the statistical challenges inherent to this so-called\n``post-prediction inference'' problem and elucidate three potential sources of\nerror: (i) the relationship between predicted outcomes and their true,\nunobserved counterparts, (ii) robustness of the machine learning model to\nresampling or uncertainty about the training data, and (iii) appropriately\npropagating not just bias but also uncertainty from predictions into the\nultimate inference procedure. We also contrast the framework for\npost-prediction inference with classical work spanning several related fields,\nincluding survey sampling, missing data, and semi-supervised learning. This\ncontrast elucidates the role of design in both classical and modern inference\nproblems."}, "http://arxiv.org/abs/2401.08875": {"title": "DCRMTA: Unbiased Causal Representation for Multi-touch Attribution", "link": "http://arxiv.org/abs/2401.08875", "description": "Multi-touch attribution (MTA) currently plays a pivotal role in achieving a\nfair estimation of the contributions of each advertising touchpoint to-wards\nconversion behavior, deeply influencing budget allocation and advertising\nrecommenda-tion. Traditional multi-touch attribution methods initially build a\nconversion prediction model, an-ticipating learning the inherent relationship\nbe-tween touchpoint sequences and user purchasing behavior through historical\ndata. Based on this, counterfactual touchpoint sequences are con-structed from\nthe original sequence subset, and conversions are estimated using the\nprediction model, thus calculating advertising contributions. A covert\nassumption of these methods is the un-biased nature of conversion prediction\nmodels. However, due to confounding variables factors arising from user\npreferences and internet recom-mendation mechanisms such as homogenization of\nad recommendations resulting from past shop-ping records, bias can easily occur\nin conversion prediction models trained on observational data. This paper\nredefines the causal effect of user fea-tures on conversions and proposes a\nnovel end-to-end approach, Deep Causal Representation for MTA (DCRMTA). Our\nmodel while eliminating confounding variables, extracts features with causal\nrelations to conversions from users. Fur-thermore, Extensive experiments on\nboth synthet-ic and real-world Criteo data demonstrate DCRMTA's superior\nperformance in converting prediction across varying data distributions, while\nalso effectively attributing value across dif-ferent advertising channels"}, "http://arxiv.org/abs/2401.08941": {"title": "A Powerful and Precise Feature-level Filter using Group Knockoffs", "link": "http://arxiv.org/abs/2401.08941", "description": "Selecting important features that have substantial effects on the response\nwith provable type-I error rate control is a fundamental concern in statistics,\nwith wide-ranging practical applications. Existing knockoff filters, although\nshown to provide theoretical guarantee on false discovery rate (FDR) control,\noften struggle to strike a balance between high power and precision in\npinpointing important features when there exist large groups of strongly\ncorrelated features. To address this challenge, we develop a new filter using\ngroup knockoffs to achieve both powerful and precise selection of important\nfeatures. Via experiments of simulated data and analysis of a real Alzheimer's\ndisease genetic dataset, it is found that the proposed filter can not only\ncontrol the proportion of false discoveries but also identify important\nfeatures with comparable power and greater precision than the existing group\nknockoffs filter."}, "http://arxiv.org/abs/2401.09379": {"title": "Merging uncertainty sets via majority vote", "link": "http://arxiv.org/abs/2401.09379", "description": "Given $K$ uncertainty sets that are arbitrarily dependent -- for example,\nconfidence intervals for an unknown parameter obtained with $K$ different\nestimators, or prediction sets obtained via conformal prediction based on $K$\ndifferent algorithms on shared data -- we address the question of how to\nefficiently combine them in a black-box manner to produce a single uncertainty\nset. We present a simple and broadly applicable majority vote procedure that\nproduces a merged set with nearly the same error guarantee as the input sets.\nWe then extend this core idea in a few ways: we show that weighted averaging\ncan be a powerful way to incorporate prior information, and a simple\nrandomization trick produces strictly smaller merged sets without altering the\ncoverage guarantee. Along the way, we prove an intriguing result that R\\\"uger's\ncombination rules (eg: twice the median of dependent p-values is a p-value) can\nbe strictly improved with randomization. When deployed in online settings, we\nshow how the exponential weighted majority algorithm can be employed in order\nto learn a good weighting over time. We then combine this method with adaptive\nconformal inference to deliver a simple conformal online model aggregation\n(COMA) method for nonexchangeable data."}, "http://arxiv.org/abs/2401.09381": {"title": "Modelling clusters in network time series with an application to presidential elections in the USA", "link": "http://arxiv.org/abs/2401.09381", "description": "Network time series are becoming increasingly relevant in the study of\ndynamic processes characterised by a known or inferred underlying network\nstructure. Generalised Network Autoregressive (GNAR) models provide a\nparsimonious framework for exploiting the underlying network, even in the\nhigh-dimensional setting. We extend the GNAR framework by introducing the\n$\\textit{community}$-$\\alpha$ GNAR model that exploits prior knowledge and/or\nexogenous variables for identifying and modelling dynamic interactions across\ncommunities in the underlying network. We further analyse the dynamics of\n$\\textit{Red, Blue}$ and $\\textit{Swing}$ states throughout presidential\nelections in the USA. Our analysis shows that dynamics differ among the\nstate-wise clusters."}, "http://arxiv.org/abs/2401.09401": {"title": "PERMUTOOLS: A MATLAB Package for Multivariate Permutation Testing", "link": "http://arxiv.org/abs/2401.09401", "description": "Statistical hypothesis testing and effect size measurement are routine parts\nof quantitative research. Advancements in computer processing power have\ngreatly improved the capability of statistical inference through the\navailability of resampling methods. However, many of the statistical practices\nused today are based on traditional, parametric methods that rely on\nassumptions about the underlying population. These assumptions may not always\nbe valid, leading to inaccurate results and misleading interpretations.\nPermutation testing, on the other hand, generates the sampling distribution\nempirically by permuting the observed data, providing distribution-free\nhypothesis testing. Furthermore, this approach lends itself to a powerful\nmethod for multiple comparison correction - known as max correction - which is\nless prone to type II errors than conventional correction methods. Parametric\nmethods have also traditionally been utilized for estimating the confidence\ninterval of various test statistics and effect size measures. However, these\ntoo can be estimated empirically using permutation or bootstrapping techniques.\nWhilst resampling methods are generally considered preferable, many popular\nprogramming languages and statistical software packages lack efficient\nimplementations. Here, we introduce PERMUTOOLS, a MATLAB package for\nmultivariate permutation testing and effect size measurement."}, "http://arxiv.org/abs/2008.03073": {"title": "From the power law to extreme value mixture distributions", "link": "http://arxiv.org/abs/2008.03073", "description": "The power law is useful in describing count phenomena such as network degrees\nand word frequencies. With a single parameter, it captures the main feature\nthat the frequencies are linear on the log-log scale. Nevertheless, there have\nbeen criticisms of the power law, for example that a threshold needs to be\npre-selected without its uncertainty quantified, that the power law is simply\ninadequate, and that subsequent hypothesis tests are required to determine\nwhether the data could have come from the power law. We propose a modelling\nframework that combines two different generalisations of the power law, namely\nthe generalised Pareto distribution and the Zipf-polylog distribution, to\nresolve these issues. The proposed mixture distributions are shown to fit the\ndata well and quantify the threshold uncertainty in a natural way. A model\nselection step embedded in the Bayesian inference algorithm further answers the\nquestion whether the power law is adequate."}, "http://arxiv.org/abs/2211.11884": {"title": "Parameter Estimation in Nonlinear Multivariate Stochastic Differential Equations Based on Splitting Schemes", "link": "http://arxiv.org/abs/2211.11884", "description": "Surprisingly, general estimators for nonlinear continuous time models based\non stochastic differential equations are yet lacking. Most applications still\nuse the Euler-Maruyama discretization, despite many proofs of its bias. More\nsophisticated methods, such as Kessler's Gaussian approximation, Ozak's Local\nLinearization, A\\\"it-Sahalia's Hermite expansions, or MCMC methods, lack a\nstraightforward implementation, do not scale well with increasing model\ndimension or can be numerically unstable. We propose two efficient and\neasy-to-implement likelihood-based estimators based on the Lie-Trotter (LT) and\nthe Strang (S) splitting schemes. We prove that S has $L^p$ convergence rate of\norder 1, a property already known for LT. We show that the estimators are\nconsistent and asymptotically efficient under the less restrictive one-sided\nLipschitz assumption. A numerical study on the 3-dimensional stochastic Lorenz\nsystem complements our theoretical findings. The simulation shows that the S\nestimator performs the best when measured on precision and computational speed\ncompared to the state-of-the-art."}, "http://arxiv.org/abs/2212.00703": {"title": "Data Integration Via Analysis of Subspaces (DIVAS)", "link": "http://arxiv.org/abs/2212.00703", "description": "Modern data collection in many data paradigms, including bioinformatics,\noften incorporates multiple traits derived from different data types (i.e.\nplatforms). We call this data multi-block, multi-view, or multi-omics data. The\nemergent field of data integration develops and applies new methods for\nstudying multi-block data and identifying how different data types relate and\ndiffer. One major frontier in contemporary data integration research is\nmethodology that can identify partially-shared structure between\nsub-collections of data types. This work presents a new approach: Data\nIntegration Via Analysis of Subspaces (DIVAS). DIVAS combines new insights in\nangular subspace perturbation theory with recent developments in matrix signal\nprocessing and convex-concave optimization into one algorithm for exploring\npartially-shared structure. Based on principal angles between subspaces, DIVAS\nprovides built-in inference on the results of the analysis, and is effective\neven in high-dimension-low-sample-size (HDLSS) situations."}, "http://arxiv.org/abs/2301.05636": {"title": "Improving Power by Conditioning on Less in Post-selection Inference for Changepoints", "link": "http://arxiv.org/abs/2301.05636", "description": "Post-selection inference has recently been proposed as a way of quantifying\nuncertainty about detected changepoints. The idea is to run a changepoint\ndetection algorithm, and then re-use the same data to perform a test for a\nchange near each of the detected changes. By defining the p-value for the test\nappropriately, so that it is conditional on the information used to choose the\ntest, this approach will produce valid p-values. We show how to improve the\npower of these procedures by conditioning on less information. This gives rise\nto an ideal selective p-value that is intractable but can be approximated by\nMonte Carlo. We show that for any Monte Carlo sample size, this procedure\nproduces valid p-values, and empirically that noticeable increase in power is\npossible with only very modest Monte Carlo sample sizes. Our procedure is easy\nto implement given existing post-selection inference methods, as we just need\nto generate perturbations of the data set and re-apply the post-selection\nmethod to each of these. On genomic data consisting of human GC content, our\nprocedure increases the number of significant changepoints that are detected\nfrom e.g. 17 to 27, when compared to existing methods."}, "http://arxiv.org/abs/2309.03969": {"title": "Estimating the prevalance of indirect effects and other spillovers", "link": "http://arxiv.org/abs/2309.03969", "description": "In settings where interference between units is possible, we define the\nprevalence of indirect effects to be the number of units who are affected by\nthe treatment of others. This quantity does not fully identify an indirect\neffect, but may be used to show whether such effects are widely prevalent.\nGiven a randomized experiment with binary-valued outcomes, methods are\npresented for conservative point estimation and one-sided interval estimation.\nNo assumptions beyond randomization of treatment are required, allowing for\nusage in settings where models or assumptions on interference might be\nquestionable. To show asymptotic coverage of our intervals in settings not\ncovered by existing results, we provide a central limit theorem that combines\nlocal dependence and sampling without replacement. Consistency and minimax\nproperties of the point estimator are shown as well. The approach is\ndemonstrated on an experiment in which students were treated for a highly\ntransmissible parasitic infection, for which we find that a significant\nfraction of students were affected by the treatment of schools other than their\nown."}, "http://arxiv.org/abs/2401.09559": {"title": "Asymptotic Online FWER Control for Dependent Test Statistics", "link": "http://arxiv.org/abs/2401.09559", "description": "In online multiple testing, an a priori unknown number of hypotheses are\ntested sequentially, i.e. at each time point a test decision for the current\nhypothesis has to be made using only the data available so far. Although many\npowerful test procedures have been developed for online error control in recent\nyears, most of them are designed solely for independent or at most locally\ndependent test statistics. In this work, we provide a new framework for\nderiving online multiple test procedures which ensure asymptotical (with\nrespect to the sample size) control of the familywise error rate (FWER),\nregardless of the dependence structure between test statistics. In this\ncontext, we give a few concrete examples of such test procedures and discuss\ntheir properties. Furthermore, we conduct a simulation study in which the type\nI error control of these test procedures is also confirmed for a finite sample\nsize and a gain in power is indicated."}, "http://arxiv.org/abs/2401.09641": {"title": "Functional Linear Non-Gaussian Acyclic Model for Causal Discovery", "link": "http://arxiv.org/abs/2401.09641", "description": "In causal discovery, non-Gaussianity has been used to characterize the\ncomplete configuration of a Linear Non-Gaussian Acyclic Model (LiNGAM),\nencompassing both the causal ordering of variables and their respective\nconnection strengths. However, LiNGAM can only deal with the finite-dimensional\ncase. To expand this concept, we extend the notion of variables to encompass\nvectors and even functions, leading to the Functional Linear Non-Gaussian\nAcyclic Model (Func-LiNGAM). Our motivation stems from the desire to identify\ncausal relationships in brain-effective connectivity tasks involving, for\nexample, fMRI and EEG datasets. We demonstrate why the original LiNGAM fails to\nhandle these inherently infinite-dimensional datasets and explain the\navailability of functional data analysis from both empirical and theoretical\nperspectives. {We establish theoretical guarantees of the identifiability of\nthe causal relationship among non-Gaussian random vectors and even random\nfunctions in infinite-dimensional Hilbert spaces.} To address the issue of\nsparsity in discrete time points within intrinsic infinite-dimensional\nfunctional data, we propose optimizing the coordinates of the vectors using\nfunctional principal component analysis. Experimental results on synthetic data\nverify the ability of the proposed framework to identify causal relationships\namong multivariate functions using the observed samples. For real data, we\nfocus on analyzing the brain connectivity patterns derived from fMRI data."}, "http://arxiv.org/abs/2401.09696": {"title": "Rejection Sampling with Vertical Weighted Strips", "link": "http://arxiv.org/abs/2401.09696", "description": "A number of distributions that arise in statistical applications can be\nexpressed in the form of a weighted density: the product of a base density and\na nonnegative weight function. Generating variates from such a distribution may\nbe nontrivial and can involve an intractable normalizing constant. Rejection\nsampling may be used to generate exact draws, but requires formulation of a\nsuitable proposal distribution. To be practically useful, the proposal must\nboth be convenient to sample from and not reject candidate draws too\nfrequently. A well-known approach to design a proposal involves decomposing the\ntarget density into a finite mixture, whose components may correspond to a\npartition of the support. This work considers such a construction that focuses\non majorization of the weight function. This approach may be applicable when\nassumptions for adaptive rejection sampling and related algorithms are not met.\nAn upper bound for the rejection probability based on this construction can be\nexpressed to evaluate the efficiency of the proposal before sampling. A method\nto partition the support is considered where regions are bifurcated based on\ntheir contribution to the bound. Examples based on the von Mises Fisher\ndistribution and Gaussian Process regression are provided to illustrate the\nmethod."}, "http://arxiv.org/abs/2401.09715": {"title": "Fast Variational Inference of Latent Space Models for Dynamic Networks Using Bayesian P-Splines", "link": "http://arxiv.org/abs/2401.09715", "description": "Latent space models (LSMs) are often used to analyze dynamic (time-varying)\nnetworks that evolve in continuous time. Existing approaches to Bayesian\ninference for these models rely on Markov chain Monte Carlo algorithms, which\ncannot handle modern large-scale networks. To overcome this limitation, we\nintroduce a new prior for continuous-time LSMs based on Bayesian P-splines that\nallows the posterior to adapt to the dimension of the latent space and the\ntemporal variation in each latent position. We propose a stochastic variational\ninference algorithm to estimate the model parameters. We use stochastic\noptimization to subsample both dyads and observed time points to design a fast\nalgorithm that is linear in the number of edges in the dynamic network.\nFurthermore, we establish non-asymptotic error bounds for point estimates\nderived from the variational posterior. To our knowledge, this is the first\nsuch result for Bayesian estimators of continuous-time LSMs. Lastly, we use the\nmethod to analyze a large data set of international conflicts consisting of\n4,456,095 relations from 2018 to 2022."}, "http://arxiv.org/abs/2401.09719": {"title": "Kernel-based multi-marker tests of association based on the accelerated failure time model", "link": "http://arxiv.org/abs/2401.09719", "description": "Kernel-based multi-marker tests for survival outcomes use primarily the Cox\nmodel to adjust for covariates. The proportional hazards assumption made by the\nCox model could be unrealistic, especially in the long-term follow-up. We\ndevelop a suite of novel multi-marker survival tests for genetic association\nbased on the accelerated failure time model, which is a popular alternative to\nthe Cox model due to its direct physical interpretation. The tests are based on\nthe asymptotic distributions of their test statistics and are thus\ncomputationally efficient. The association tests can account for the\nheterogeneity of genetic effects across sub-populations/individuals to increase\nthe power. All the new tests can deal with competing risks and left truncation.\nMoreover, we develop small-sample corrections to the tests to improve their\naccuracy under small samples. Extensive numerical experiments show that the new\ntests perform very well in various scenarios. An application to a genetic\ndataset of Alzheimer's disease illustrates the tests' practical utility."}, "http://arxiv.org/abs/2401.09816": {"title": "Jackknife empirical likelihood ratio test for testing the equality of semivariance", "link": "http://arxiv.org/abs/2401.09816", "description": "Semivariance is a measure of the dispersion of all observations that fall\nabove the mean or target value of a random variable and it plays an important\nrole in life-length, actuarial and income studies. In this paper, we develop a\nnew non-parametric test for equality of upper semi-variance. We use the\nU-statistic theory to derive the test statistic and then study the asymptotic\nproperties of the test statistic. We also develop a jackknife empirical\nlikelihood (JEL) ratio test for equality of upper Semivariance. Extensive Monte\nCarlo simulation studies are carried out to validate the performance of the\nproposed JEL-based test. We illustrate the test procedure using real data."}, "http://arxiv.org/abs/2401.09994": {"title": "Bayesian modeling of spatial ordinal data from health surveys", "link": "http://arxiv.org/abs/2401.09994", "description": "Health surveys allow exploring health indicators that are of great value from\na public health point of view and that cannot normally be studied from regular\nhealth registries. These indicators are usually coded as ordinal variables and\nmay depend on covariates associated with individuals. In this paper, we propose\na Bayesian individual-level model for small-area estimation of survey-based\nhealth indicators. A categorical likelihood is used at the first level of the\nmodel hierarchy to describe the ordinal data, and spatial dependence among\nsmall areas is taken into account by using a conditional autoregressive (CAR)\ndistribution. Post-stratification of the results of the proposed\nindividual-level model allows extrapolating the results to any administrative\nareal division, even for small areas. We apply this methodology to the analysis\nof the Health Survey of the Region of Valencia (Spain) of 2016 to describe the\ngeographical distribution of a self-perceived health indicator of interest in\nthis region."}, "http://arxiv.org/abs/2401.10010": {"title": "A global kernel estimator for partially linear varying coefficient additive hazards models", "link": "http://arxiv.org/abs/2401.10010", "description": "In biomedical studies, we are often interested in the association between\ndifferent types of covariates and the times to disease events. Because the\nrelationship between the covariates and event times is often complex, standard\nsurvival models that assume a linear covariate effect are inadequate. A\nflexible class of models for capturing complex interaction effects among types\nof covariates is the varying coefficient models, where the effects of a type of\ncovariates can be modified by another type of covariates. In this paper, we\nstudy kernel-based estimation methods for varying coefficient additive hazards\nmodels. Unlike many existing kernel-based methods that use a local neighborhood\nof subjects for the estimation of the varying coefficient function, we propose\na novel global approach that is generally more efficient. We establish\ntheoretical properties of the proposed estimators and demonstrate their\nsuperior performance compared with existing local methods through large-scale\nsimulation studies. To illustrate the proposed method, we provide an\napplication to a motivating cancer genomic study."}, "http://arxiv.org/abs/2401.10057": {"title": "A method for characterizing disease emergence curves from paired pathogen detection and serology data", "link": "http://arxiv.org/abs/2401.10057", "description": "Wildlife disease surveillance programs and research studies track infection\nand identify risk factors for wild populations, humans, and agriculture. Often,\nseveral types of samples are collected from individuals to provide more\ncomplete information about an animal's infection history. Methods that jointly\nanalyze multiple data streams to study disease emergence and drivers of\ninfection via epidemiological process models remain underdeveloped.\nJoint-analysis methods can more thoroughly analyze all available data, more\nprecisely quantifying epidemic processes, outbreak status, and risks. We\ncontribute a paired data modeling approach that analyzes multiple samples from\nindividuals. We use \"characterization maps\" to link paired data to\nepidemiological processes through a hierarchical statistical observation model.\nOur approach can provide both Bayesian and frequentist estimates of\nepidemiological parameters and state. We motivate our approach through the need\nto use paired pathogen and antibody detection tests to estimate parameters and\ninfection trajectories for the widely applicable susceptible, infectious,\nrecovered (SIR) model. We contribute general formulas to link characterization\nmaps to arbitrary process models and datasets and an extended SIR model that\nbetter accommodates paired data. We find via simulation that paired data can\nmore efficiently estimate SIR parameters than unpaired data, requiring samples\nfrom 5-10 times fewer individuals. We then study SARS-CoV-2 in wild\nWhite-tailed deer (Odocoileus virginianus) from three counties in the United\nStates. Estimates for average infectious times corroborate captive animal\nstudies. Our methods use general statistical theory to let applications extend\nbeyond the SIR model we consider, and to more complicated examples of paired\ndata."}, "http://arxiv.org/abs/2401.10124": {"title": "Lower Ricci Curvature for Efficient Community Detection", "link": "http://arxiv.org/abs/2401.10124", "description": "This study introduces the Lower Ricci Curvature (LRC), a novel, scalable, and\nscale-free discrete curvature designed to enhance community detection in\nnetworks. Addressing the computational challenges posed by existing\ncurvature-based methods, LRC offers a streamlined approach with linear\ncomputational complexity, making it well-suited for large-scale network\nanalysis. We further develop an LRC-based preprocessing method that effectively\naugments popular community detection algorithms. Through comprehensive\nsimulations and applications on real-world datasets, including the NCAA\nfootball league network, the DBLP collaboration network, the Amazon product\nco-purchasing network, and the YouTube social network, we demonstrate the\nefficacy of our method in significantly improving the performance of various\ncommunity detection algorithms."}, "http://arxiv.org/abs/2401.10180": {"title": "Generalized Decomposition Priors on R2", "link": "http://arxiv.org/abs/2401.10180", "description": "The adoption of continuous shrinkage priors in high-dimensional linear models\nhas gained momentum, driven by their theoretical and practical advantages. One\nof these shrinkage priors is the R2D2 prior, which comes with intuitive\nhyperparameters and well understood theoretical properties. The core idea is to\nspecify a prior on the percentage of explained variance $R^2$ and to conduct a\nDirichlet decomposition to distribute the explained variance among all the\nregression terms of the model. Due to the properties of the Dirichlet\ndistribution, the competition among variance components tends to gravitate\ntowards negative dependence structures, fully determined by the individual\ncomponents' means. Yet, in reality, specific coefficients or groups may compete\ndifferently for the total variability than the Dirichlet would allow for. In\nthis work we address this limitation by proposing a generalization of the R2D2\nprior, which we term the Generalized Decomposition R2 (GDR2) prior.\n\nOur new prior provides great flexibility in expressing dependency structures\nas well as enhanced shrinkage properties. Specifically, we explore the\ncapabilities of variance decomposition via logistic normal distributions.\nThrough extensive simulations and real-world case studies, we demonstrate that\nGDR2 priors yield strongly improved out-of-sample predictive performance and\nparameter recovery compared to R2D2 priors with similar hyper-parameter\nchoices."}, "http://arxiv.org/abs/2401.10193": {"title": "tinyVAST: R package with an expressive interface to specify lagged and simultaneous effects in multivariate spatio-temporal models", "link": "http://arxiv.org/abs/2401.10193", "description": "Multivariate spatio-temporal models are widely applicable, but specifying\ntheir structure is complicated and may inhibit wider use. We introduce the R\npackage tinyVAST from two viewpoints: the software user and the statistician.\nFrom the user viewpoint, tinyVAST adapts a widely used formula interface to\nspecify generalized additive models, and combines this with arguments to\nspecify spatial and spatio-temporal interactions among variables. These\ninteractions are specified using arrow notation (from structural equation\nmodels), or an extended arrow-and-lag notation that allows simultaneous,\nlagged, and recursive dependencies among variables over time. The user also\nspecifies a spatial domain for areal (gridded), continuous (point-count), or\nstream-network data. From the statistician viewpoint, tinyVAST constructs\nsparse precision matrices representing multivariate spatio-temporal variation,\nand parameters are estimated by specifying a generalized linear mixed model\n(GLMM). This expressive interface encompasses vector autoregressive, empirical\northogonal functions, spatial factor analysis, and ARIMA models. To\ndemonstrate, we fit to data from two survey platforms sampling corals, sponges,\nrockfishes, and flatfishes in the Gulf of Alaska and Aleutian Islands. We then\ncompare eight alternative model structures using different assumptions about\nhabitat drivers and survey detectability. Model selection suggests that\ntowed-camera and bottom trawl gears have spatial variation in detectability but\nsample the same underlying density of flatfishes and rockfishes, and that\nrockfishes are positively associated with sponges while flatfishes are\nnegatively associated with corals. We conclude that tinyVAST can be used to\ntest complicated dependencies representing alternative structural assumptions\nfor research and real-world policy evaluation."}, "http://arxiv.org/abs/2401.10196": {"title": "Functional Conditional Gaussian Graphical Models", "link": "http://arxiv.org/abs/2401.10196", "description": "Functional data has become a commonly encountered data type. In this paper,\nwe contribute to the literature on functional graphical modelling by extending\nthe notion of conditional Gaussian Graphical models and proposing a\ndouble-penalized estimator by which to recover the edge-set of the\ncorresponding graph. Penalty parameters play a crucial role in determining the\nprecision matrices for the response variables and the regression matrices. The\nperformance and model selection process in the proposed framework are\ninvestigated using information criteria. Moreover, we propose a novel version\nof the Kullback-Leibler cross-validation designed for conditional joint\nGaussian Graphical Models. The evaluation of model performance is done in terms\nof Kullback-Leibler divergence and graph recovery power."}, "http://arxiv.org/abs/2206.02250": {"title": "Frequency Domain Statistical Inference for High-Dimensional Time Series", "link": "http://arxiv.org/abs/2206.02250", "description": "Analyzing time series in the frequency domain enables the development of\npowerful tools for investigating the second-order characteristics of\nmultivariate processes. Parameters like the spectral density matrix and its\ninverse, the coherence or the partial coherence, encode comprehensively the\ncomplex linear relations between the component processes of the multivariate\nsystem. In this paper, we develop inference procedures for such parameters in a\nhigh-dimensional, time series setup. Towards this goal, we first focus on the\nderivation of consistent estimators of the coherence and, more importantly, of\nthe partial coherence which possess manageable limiting distributions that are\nsuitable for testing purposes. Statistical tests of the hypothesis that the\nmaximum over frequencies of the coherence, respectively, of the partial\ncoherence, do not exceed a prespecified threshold value are developed. Our\napproach allows for testing hypotheses for individual coherences and/or partial\ncoherences as well as for multiple testing of large sets of such parameters. In\nthe latter case, a consistent procedure to control the false discovery rate is\ndeveloped. The finite sample performance of the inference procedures introduced\nis investigated by means of simulations and applications to the construction of\ngraphical interaction models for brain connectivity based on EEG data are\npresented."}, "http://arxiv.org/abs/2206.09754": {"title": "Guided structure learning of DAGs for count data", "link": "http://arxiv.org/abs/2206.09754", "description": "In this paper, we tackle structure learning of Directed Acyclic Graphs\n(DAGs), with the idea of exploiting available prior knowledge of the domain at\nhand to guide the search of the best structure. In particular, we assume to\nknow the topological ordering of variables in addition to the given data. We\nstudy a new algorithm for learning the structure of DAGs, proving its\ntheoretical consistence in the limit of infinite observations. Furthermore, we\nexperimentally compare the proposed algorithm to a number of popular\ncompetitors, in order to study its behavior in finite samples."}, "http://arxiv.org/abs/2211.08784": {"title": "The robusTest package: two-sample tests revisited", "link": "http://arxiv.org/abs/2211.08784", "description": "The R package robusTest offers corrected versions of several common tests in\nbivariate statistics. We point out the limitations of these tests in their\nclassical versions, some of which are well known such as robustness or\ncalibration problems, and provide simple alternatives that can be easily used\ninstead. The classical tests and theirs robust alternatives are compared\nthrough a small simulation study. The latter emphasizes the superiority of\nrobust versions of the test of interest. Finally, an illustration of\ncorrelation's tests on a real data set is also provided."}, "http://arxiv.org/abs/2304.05527": {"title": "Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box", "link": "http://arxiv.org/abs/2304.05527", "description": "Automatic differentiation variational inference (ADVI) offers fast and\neasy-to-use posterior approximation in multiple modern probabilistic\nprogramming languages. However, its stochastic optimizer lacks clear\nconvergence criteria and requires tuning parameters. Moreover, ADVI inherits\nthe poor posterior uncertainty estimates of mean-field variational Bayes\n(MFVB). We introduce \"deterministic ADVI\" (DADVI) to address these issues.\nDADVI replaces the intractable MFVB objective with a fixed Monte Carlo\napproximation, a technique known in the stochastic optimization literature as\nthe \"sample average approximation\" (SAA). By optimizing an approximate but\ndeterministic objective, DADVI can use off-the-shelf second-order optimization,\nand, unlike standard mean-field ADVI, is amenable to more accurate posterior\ncovariances via linear response (LR). In contrast to existing worst-case\ntheory, we show that, on certain classes of common statistical problems, DADVI\nand the SAA can perform well with relatively few samples even in very high\ndimensions, though we also show that such favorable results cannot extend to\nvariational approximations that are too expressive relative to mean-field ADVI.\nWe show on a variety of real-world problems that DADVI reliably finds good\nsolutions with default settings (unlike ADVI) and, together with LR\ncovariances, is typically faster and more accurate than standard ADVI."}, "http://arxiv.org/abs/2401.10233": {"title": "Likelihood-ratio inference on differences in quantiles", "link": "http://arxiv.org/abs/2401.10233", "description": "Quantiles can represent key operational and business metrics, but the\ncomputational challenges associated with inference has hampered their adoption\nin online experimentation. One-sample confidence intervals are trivial to\nconstruct; however, two-sample inference has traditionally required\nbootstrapping or a density estimator. This paper presents a new two-sample\ndifference-in-quantile hypothesis test and confidence interval based on a\nlikelihood-ratio test statistic. A conservative version of the test does not\ninvolve a density estimator; a second version of the test, which uses a density\nestimator, yields confidence intervals very close to the nominal coverage\nlevel. It can be computed using only four order statistics from each sample."}, "http://arxiv.org/abs/2401.10235": {"title": "Semi-parametric local variable selection under misspecification", "link": "http://arxiv.org/abs/2401.10235", "description": "Local variable selection aims to discover localized effects by assessing the\nimpact of covariates on outcomes within specific regions defined by other\ncovariates. We outline some challenges of local variable selection in the\npresence of non-linear relationships and model misspecification. Specifically,\nwe highlight a potential drawback of common semi-parametric methods: even\nslight model misspecification can result in a high rate of false positives. To\naddress these shortcomings, we propose a methodology based on orthogonal cut\nsplines that achieves consistent local variable selection in high-dimensional\nscenarios. Our approach offers simplicity, handles both continuous and discrete\ncovariates, and provides theory for high-dimensional covariates and model\nmisspecification. We discuss settings with either independent or dependent\ndata. Our proposal allows including adjustment covariates that do not undergo\nselection, enhancing flexibility in modeling complex scenarios. We illustrate\nits application in simulation studies with both independent and functional\ndata, as well as with two real datasets. One dataset evaluates salary gaps\nassociated with discrimination factors at different ages, while the other\nexamines the effects of covariates on brain activation over time. The approach\nis implemented in the R package mombf."}, "http://arxiv.org/abs/2401.10269": {"title": "Robust Multi-Sensor Multi-Target Tracking Using Possibility Labeled Multi-Bernoulli Filter", "link": "http://arxiv.org/abs/2401.10269", "description": "With the increasing complexity of multiple target tracking scenes, a single\nsensor may not be able to effectively monitor a large number of targets.\nTherefore, it is imperative to extend the single-sensor technique to\nMulti-Sensor Multi-Target Tracking (MSMTT) for enhanced functionality. Typical\nMSMTT methods presume complete randomness of all uncertain components, and\ntherefore effective solutions such as the random finite set filter and\ncovariance intersection method have been derived to conduct the MSMTT task.\nHowever, the presence of epistemic uncertainty, arising from incomplete\ninformation, is often disregarded within the context of MSMTT. This paper\ndevelops an innovative possibility Labeled Multi-Bernoulli (LMB) Filter based\non the labeled Uncertain Finite Set (UFS) theory. The LMB filter inherits the\nhigh robustness of the possibility generalized labeled multi-Bernoulli filter\nwith simplified computational complexity. The fusion of LMB UFSs is derived and\nadapted to develop a robust MSMTT scheme. Simulation results corroborate the\nsuperior performance exhibited by the proposed approach in comparison to\ntypical probabilistic methods."}, "http://arxiv.org/abs/2401.10275": {"title": "Symbolic principal plane with Duality Centers Method", "link": "http://arxiv.org/abs/2401.10275", "description": "In \\cite{ref11} and \\cite{ref3}, the authors proposed the Centers and the\nVertices Methods to extend the well known principal components analysis method\nto a particular kind of symbolic objects characterized by multi--valued\nvariables of interval type. Nevertheless the authors use the classical circle\nof correlation to represent the variables. The correlation between the\nvariables and the principal components are not symbolic, because they compute\nthe standard correlations between the midpoints of variables and the midpoints\nof the principal components. It is well known that in standard principal\ncomponent analysis we may compute the correlation between the variables and the\nprincipal components using the duality relations starting from the coordinates\nof the individuals in the principal plane, also we can compute the coordinates\nof the individuals in the principal plane using duality relations starting from\nthe correlation between the variables and the principal components. In this\npaper we propose a new method to compute the symbolic correlation circle using\nduality relations in the case of interval-valued variables. Besides, the reader\nmay use all the methods presented herein and verify the results using the {\\tt\nRSDA} package written in {\\tt R} language, that can be downloaded and installed\ndirectly from {\\tt CRAN} \\cite{Rod2014}."}, "http://arxiv.org/abs/2401.10276": {"title": "Correspondence Analysis for Symbolic Multi--Valued Variables", "link": "http://arxiv.org/abs/2401.10276", "description": "This paper sets a proposal of a new method and two new algorithms for\nCorrespondence Analysis when we have Symbolic Multi--Valued Variables (SymCA).\nIn our method, there are two multi--valued variables $X$ and $Y$, that is to\nsay, the modality that takes the variables for a given individual is a finite\nset formed by the possible modalities taken for the variables in a given\nindividual, that which allows to apply the Correspondence Analysis to multiple\nselection questionnaires. Then, starting from all the possible classic\ncontingency tables an interval contingency table can be built, which will be\nthe point of departure of the proposed method."}, "http://arxiv.org/abs/2401.10495": {"title": "Causal Layering via Conditional Entropy", "link": "http://arxiv.org/abs/2401.10495", "description": "Causal discovery aims to recover information about an unobserved causal graph\nfrom the observable data it generates. Layerings are orderings of the variables\nwhich place causes before effects. In this paper, we provide ways to recover\nlayerings of a graph by accessing the data via a conditional entropy oracle,\nwhen distributions are discrete. Our algorithms work by repeatedly removing\nsources or sinks from the graph. Under appropriate assumptions and\nconditioning, we can separate the sources or sinks from the remainder of the\nnodes by comparing their conditional entropy to the unconditional entropy of\ntheir noise. Our algorithms are provably correct and run in worst-case\nquadratic time. The main assumptions are faithfulness and injective noise, and\neither known noise entropies or weakly monotonically increasing noise entropies\nalong directed paths. In addition, we require one of either a very mild\nextension of faithfulness, or strictly monotonically increasing noise\nentropies, or expanding noise injectivity to include an additional single\nargument in the structural functions."}, "http://arxiv.org/abs/2401.10592": {"title": "Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights", "link": "http://arxiv.org/abs/2401.10592", "description": "Randomized controlled clinical trials provide the gold standard for evidence\ngeneration in relation to the effectiveness of a new treatment in medical\nresearch. Relevant information from previous studies may be desirable to\nincorporate in the design of a new trial, with the Bayesian paradigm providing\na coherent framework to formally incorporate prior knowledge. Many established\nmethods involve the use of a discounting factor, sometimes related to a measure\nof `similarity' between historical sources and the new trial. However, it is\noften the case that the sample size is highly nonlinear in those discounting\nfactors. This hinders communication with subject-matter experts to elicit\nsensible values for borrowing strength at the trial design stage.\n\nFocusing on a sample size formula that can incorporate historical data from\nmultiple sources, we propose a linearization technique such that the sample\nsize changes evenly over values of the discounting factors (hereafter referred\nto as `weights'). Our approach leads to interpretable weights that directly\nrepresent the dissimilarity between historical and new trial data on the\nprobability scale, and could therefore facilitate easier elicitation of expert\nopinion on their values.\n\nInclusion of historical data in the design of clinical trials is not common\npractice. Part of the reason might be difficulty in interpretability of\ndiscrepancy parameters. We hope our work will help to bridge this gap and\nencourage uptake of these innovative methods.\n\nKeywords: Bayesian sample size determination; Commensurate priors; Historical\nborrowing; Prior aggregation; Uniform shrinkage."}, "http://arxiv.org/abs/2401.10796": {"title": "Reliability analysis for data-driven noisy models using active learning", "link": "http://arxiv.org/abs/2401.10796", "description": "Reliability analysis aims at estimating the failure probability of an\nengineering system. It often requires multiple runs of a limit-state function,\nwhich usually relies on computationally intensive simulations. Traditionally,\nthese simulations have been considered deterministic, i.e., running them\nmultiple times for a given set of input parameters always produces the same\noutput. However, this assumption does not always hold, as many studies in the\nliterature report non-deterministic computational simulations (also known as\nnoisy models). In such cases, running the simulations multiple times with the\nsame input will result in different outputs. Similarly, data-driven models that\nrely on real-world data may also be affected by noise. This characteristic\nposes a challenge when performing reliability analysis, as many classical\nmethods, such as FORM and SORM, are tailored to deterministic models. To bridge\nthis gap, this paper provides a novel methodology to perform reliability\nanalysis on models contaminated by noise. In such cases, noise introduces\nlatent uncertainty into the reliability estimator, leading to an incorrect\nestimation of the real underlying reliability index, even when using Monte\nCarlo simulation. To overcome this challenge, we propose the use of denoising\nregression-based surrogate models within an active learning reliability\nanalysis framework. Specifically, we combine Gaussian process regression with a\nnoise-aware learning function to efficiently estimate the probability of\nfailure of the underlying noise-free model. We showcase the effectiveness of\nthis methodology on standard benchmark functions and a finite element model of\na realistic structural frame."}, "http://arxiv.org/abs/2401.10824": {"title": "The trivariate wrapped Cauchy copula -- a multi-purpose model for angular data", "link": "http://arxiv.org/abs/2401.10824", "description": "In this paper, we will present a new flexible distribution for\nthree-dimensional angular data, or data on the three-dimensional torus. Our\ntrivariate wrapped Cauchy copula has the following benefits: (i) simple form of\ndensity, (ii) adjustable degree of dependence between every pair of variables,\n(iii) interpretable and well-estimable parameters, (iv) well-known conditional\ndistributions, (v) a simple data generating mechanism, (vi) unimodality.\nMoreover, our construction allows for linear marginals, implying that our\ncopula can also model cylindrical data. Parameter estimation via maximum\nlikelihood is explained, a comparison with the competitors in the existing\nliterature is given, and two real datasets are considered, one concerning\nprotein dihedral angles and another about data obtained by a buoy in the\nAdriatic Sea."}, "http://arxiv.org/abs/2401.10867": {"title": "Learning Optimal Dynamic Treatment Regimes from Longitudinal Data", "link": "http://arxiv.org/abs/2401.10867", "description": "Studies often report estimates of the average treatment effect. While the ATE\nsummarizes the effect of a treatment on average, it does not provide any\ninformation about the effect of treatment within any individual. A treatment\nstrategy that uses an individual's information to tailor treatment to maximize\nbenefit is known as an optimal dynamic treatment rule. Treatment, however, is\ntypically not limited to a single point in time; consequently, learning an\noptimal rule for a time-varying treatment may involve not just learning the\nextent to which the comparative treatments' benefits vary across the\ncharacteristics of individuals, but also learning the extent to which the\ncomparative treatments' benefits vary as relevant circumstances evolve within\nan individual. The goal of this paper is to provide a tutorial for estimating\nODTR from longitudinal observational and clinical trial data for applied\nresearchers. We describe an approach that uses a doubly-robust unbiased\ntransformation of the conditional average treatment effect. We then learn a\ntime-varying ODTR for when to increase buprenorphine-naloxone dose to minimize\nreturn-to-regular-opioid-use among patients with opioid use disorder. Our\nanalysis highlights the utility of ODTRs in the context of sequential decision\nmaking: the learned ODTR outperforms a clinically defined strategy."}, "http://arxiv.org/abs/2401.10869": {"title": "Variable selection for partially linear additive models", "link": "http://arxiv.org/abs/2401.10869", "description": "Among semiparametric regression models, partially linear additive models\nprovide a useful tool to include additive nonparametric components as well as a\nparametric component, when explaining the relationship between the response and\na set of explanatory variables. This paper concerns such models under sparsity\nassumptions for the covariates included in the linear component. Sparse\ncovariates are frequent in regression problems where the task of variable\nselection is usually of interest. As in other settings, outliers either in the\nresiduals or in the covariates involved in the linear component have a harmful\neffect. To simultaneously achieve model selection for the parametric component\nof the model and resistance to outliers, we combine preliminary robust\nestimators of the additive component, robust linear $MM-$regression estimators\nwith a penalty such as SCAD on the coefficients in the parametric part. Under\nmild assumptions, consistency results and rates of convergence for the proposed\nestimators are derived. A Monte Carlo study is carried out to compare, under\ndifferent models and contamination schemes, the performance of the robust\nproposal with its classical counterpart. The obtained results show the\nadvantage of using the robust approach. Through the analysis of a real data\nset, we also illustrate the benefits of the proposed procedure."}, "http://arxiv.org/abs/2103.06347": {"title": "Factorized Binary Search: change point detection in the network structure of multivariate high-dimensional time series", "link": "http://arxiv.org/abs/2103.06347", "description": "Functional magnetic resonance imaging (fMRI) time series data presents a\nunique opportunity to understand the behavior of temporal brain connectivity,\nand models that uncover the complex dynamic workings of this organ are of keen\ninterest in neuroscience. We are motivated to develop accurate change point\ndetection and network estimation techniques for high-dimensional whole-brain\nfMRI data. To this end, we introduce factorized binary search (FaBiSearch), a\nnovel change point detection method in the network structure of multivariate\nhigh-dimensional time series in order to understand the large-scale\ncharacterizations and dynamics of the brain. FaBiSearch employs non-negative\nmatrix factorization, an unsupervised dimension reduction technique, and a new\nbinary search algorithm to identify multiple change points. In addition, we\npropose a new method for network estimation for data between change points. We\nseek to understand the dynamic mechanism of the brain, particularly for two\nfMRI data sets. The first is a resting-state fMRI experiment, where subjects\nare scanned over three visits. The second is a task-based fMRI experiment,\nwhere subjects read Chapter 9 of Harry Potter and the Sorcerer's Stone. For the\nresting-state data set, we examine the test-retest behavior of dynamic\nfunctional connectivity, while for the task-based data set, we explore network\ndynamics during the reading and whether change points across subjects coincide\nwith key plot twists in the story. Further, we identify hub nodes in the brain\nnetwork and examine their dynamic behavior. Finally, we make all the methods\ndiscussed available in the R package fabisearch on CRAN."}, "http://arxiv.org/abs/2207.07020": {"title": "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO", "link": "http://arxiv.org/abs/2207.07020", "description": "The Gaussian chain graph model simultaneously parametrizes (i) the direct\neffects of $p$ predictors on $q$ outcomes and (ii) the residual partial\ncovariances between pairs of outcomes. We introduce a new method for fitting\nsparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We\ndevelop an Expectation Conditional Maximization algorithm to obtain sparse\nestimates of the $p \\times q$ matrix of direct effects and the $q \\times q$\nresidual precision matrix. Our algorithm iteratively solves a sequence of\npenalized maximum likelihood problems with self-adaptive penalties that\ngradually filter out negligible regression coefficients and partial\ncovariances. Because it adaptively penalizes individual model parameters, our\nmethod is seen to outperform fixed-penalty competitors on simulated data. We\nestablish the posterior contraction rate for our model, buttressing our\nmethod's excellent empirical performance with strong theoretical guarantees.\nUsing our method, we estimated the direct effects of diet and residence type on\nthe composition of the gut microbiome of elderly adults."}, "http://arxiv.org/abs/2301.00040": {"title": "Optimization-based Sensitivity Analysis for Unmeasured Confounding using Partial Correlations", "link": "http://arxiv.org/abs/2301.00040", "description": "Causal inference necessarily relies upon untestable assumptions; hence, it is\ncrucial to assess the robustness of obtained results to violations of\nidentification assumptions. However, such sensitivity analysis is only\noccasionally undertaken in practice, as many existing methods only apply to\nrelatively simple models and their results are often difficult to interpret. We\ntake a more flexible approach to sensitivity analysis and view it as a\nconstrained stochastic optimization problem. This work focuses on sensitivity\nanalysis for a linear causal effect when an unmeasured confounder and a\npotential instrument are present. We show how the bias of the OLS and TSLS\nestimands can be expressed in terms of partial correlations. Leveraging the\nalgebraic rules that relate different partial correlations, practitioners can\nspecify intuitive sensitivity models which bound the bias. We further show that\nthe heuristic \"plug-in\" sensitivity interval may not have any confidence\nguarantees; instead, we propose a bootstrap approach to construct sensitivity\nintervals which perform well in numerical simulations. We illustrate the\nproposed methods with a real study on the causal effect of education on\nearnings and provide user-friendly visualization tools."}, "http://arxiv.org/abs/2307.00048": {"title": "Learned harmonic mean estimation of the marginal likelihood with normalizing flows", "link": "http://arxiv.org/abs/2307.00048", "description": "Computing the marginal likelihood (also called the Bayesian model evidence)\nis an important task in Bayesian model selection, providing a principled\nquantitative way to compare models. The learned harmonic mean estimator solves\nthe exploding variance problem of the original harmonic mean estimation of the\nmarginal likelihood. The learned harmonic mean estimator learns an importance\nsampling target distribution that approximates the optimal distribution. While\nthe approximation need not be highly accurate, it is critical that the\nprobability mass of the learned distribution is contained within the posterior\nin order to avoid the exploding variance problem. In previous work a bespoke\noptimization problem is introduced when training models in order to ensure this\nproperty is satisfied. In the current article we introduce the use of\nnormalizing flows to represent the importance sampling target distribution. A\nflow-based model is trained on samples from the posterior by maximum likelihood\nestimation. Then, the probability density of the flow is concentrated by\nlowering the variance of the base distribution, i.e. by lowering its\n\"temperature\", ensuring its probability mass is contained within the posterior.\nThis approach avoids the need for a bespoke optimisation problem and careful\nfine tuning of parameters, resulting in a more robust method. Moreover, the use\nof normalizing flows has the potential to scale to high dimensional settings.\nWe present preliminary experiments demonstrating the effectiveness of the use\nof flows for the learned harmonic mean estimator. The harmonic code\nimplementing the learned harmonic mean, which is publicly available, has been\nupdated to now support normalizing flows."}, "http://arxiv.org/abs/2401.11001": {"title": "Reply to Comment by Schilling on their paper \"Optimal and fast confidence intervals for hypergeometric successes\"", "link": "http://arxiv.org/abs/2401.11001", "description": "A response to a letter to the editor by Schilling regarding Bartroff, Lorden,\nand Wang (\"Optimal and fast confidence intervals for hypergeometric successes\"\n2022, arXiv:2109.05624)"}, "http://arxiv.org/abs/2401.11070": {"title": "Efficient Data Reduction Strategies for Big Data and High-Dimensional LASSO Regressions", "link": "http://arxiv.org/abs/2401.11070", "description": "The IBOSS approach proposed by Wang et al. (2019) selects the most\ninformative subset of n points. It assumes that the ordinary least squares\nmethod is used and requires that the number of variables, p, is not large.\nHowever, in many practical problems, p is very large and penalty-based model\nfitting methods such as LASSO is used. We study the big data problems, in which\nboth n and p are large. In the first part, we focus on reduction in data\npoints. We develop theoretical results showing that the IBOSS type of approach\ncan be applicable to penalty-based regressions such as LASSO. In the second\npart, we consider the situations where p is extremely large. We propose a\ntwo-step approach that involves first reducing the number of variables and then\nreducing the number of data points. Two separate algorithms are developed,\nwhose performances are studied through extensive simulation studies. Compared\nto existing methods including well-known split-and-conquer approach, the\nproposed methods enjoy advantages in terms of estimation accuracy, prediction\naccuracy, and computation time."}, "http://arxiv.org/abs/2401.11075": {"title": "Estimating the Hawkes process from a discretely observed sample path", "link": "http://arxiv.org/abs/2401.11075", "description": "The Hawkes process is a widely used model in many areas, such as\n\nfinance, seismology, neuroscience, epidemiology, and social\n\nsciences. Estimation of the Hawkes process from continuous\n\nobservations of a sample path is relatively straightforward using\n\neither the maximum likelihood or other methods. However, estimating\n\nthe parameters of a Hawkes process from observations of a sample\n\npath at discrete time points only is challenging due to the\n\nintractability of the likelihood with such data. In this work, we\n\nintroduce a method to estimate the Hawkes process from a discretely\n\nobserved sample path. The method takes advantage of a state-space\n\nrepresentation of the incomplete data problem and use the sequential\n\nMonte Carlo (aka particle filtering) to approximate the likelihood\n\nfunction. As an estimator of the likelihood function the SMC\n\napproximation is unbiased, and therefore it can be used together\n\nwith the Metropolis-Hastings algorithm to construct Markov Chains to\n\napproximate the likelihood distribution, or more generally, the\n\nposterior distribution of model parameters. The performance of the\n\nmethodology is assessed using simulation experiments and compared\n\nwith other recently published methods. The proposed estimator is\n\nfound to have a smaller mean square error than the two benchmark\n\nestimators. The proposed method has the additional advantage that\n\nconfidence intervals for the parameters are easily available. We\n\napply the proposed estimator to the analysis of weekly count data on\n\nmeasles cases in Tokyo Japan and compare the results to those by\n\none of the benchmark methods."}, "http://arxiv.org/abs/2401.11119": {"title": "Constraint-based measures of shift and relative shift for discrete frequency distributions", "link": "http://arxiv.org/abs/2401.11119", "description": "Comparisons of frequency distributions often invoke the concept of shift to\ndescribe directional changes in properties such as the mean. In the present\nstudy, we sought to define shift as a property in and of itself. Specifically,\nwe define distributional shift (DS) as the concentration of frequencies away\nfrom the discrete class having the greatest value (e.g., the right-most bin of\na histogram). We derive a measure of DS using the normalized sum of\nexponentiated cumulative frequencies. We then define relative distributional\nshift (RDS) as the difference in DS between two distributions, revealing the\nmagnitude and direction by which one distribution is concentrated to lesser or\ngreater discrete classes relative to another. We find that RDS is highly\nrelated to popular measures that, while based on the comparison of frequency\ndistributions, do not explicitly consider shift. While RDS provides a useful\ncomplement to other comparative measures, DS allows shift to be quantified as a\nproperty of individual distributions, similar in concept to a statistical\nmoment."}, "http://arxiv.org/abs/2401.11128": {"title": "Regularized Estimation of Sparse Spectral Precision Matrices", "link": "http://arxiv.org/abs/2401.11128", "description": "Spectral precision matrix, the inverse of a spectral density matrix, is an\nobject of central interest in frequency-domain analysis of multivariate time\nseries. Estimation of spectral precision matrix is a key step in calculating\npartial coherency and graphical model selection of stationary time series. When\nthe dimension of a multivariate time series is moderate to large, traditional\nestimators of spectral density matrices such as averaged periodograms tend to\nbe severely ill-conditioned, and one needs to resort to suitable regularization\nstrategies involving optimization over complex variables.\n\nIn this work, we propose complex graphical Lasso (CGLASSO), an\n$\\ell_1$-penalized estimator of spectral precision matrix based on local\nWhittle likelihood maximization. We develop fast $\\textit{pathwise coordinate\ndescent}$ algorithms for implementing CGLASSO on large dimensional time series\ndata sets. At its core, our algorithmic development relies on a ring\nisomorphism between complex and real matrices that helps map a number of\noptimization problems over complex variables to similar optimization problems\nover real variables. This finding may be of independent interest and more\nbroadly applicable for high-dimensional statistical analysis with\ncomplex-valued data. We also present a complete non-asymptotic theory of our\nproposed estimator which shows that consistent estimation is possible in\nhigh-dimensional regime as long as the underlying spectral precision matrix is\nsuitably sparse. We compare the performance of CGLASSO with competing\nalternatives on simulated data sets, and use it to construct partial coherence\nnetwork among brain regions from a real fMRI data set."}, "http://arxiv.org/abs/2401.11263": {"title": "Estimating heterogeneous treatment effect from survival outcomes via (orthogonal) censoring unbiased learning", "link": "http://arxiv.org/abs/2401.11263", "description": "Methods for estimating heterogeneous treatment effects (HTE) from\nobservational data have largely focused on continuous or binary outcomes, with\nless attention paid to survival outcomes and almost none to settings with\ncompeting risks. In this work, we develop censoring unbiased transformations\n(CUTs) for survival outcomes both with and without competing risks.After\nconverting time-to-event outcomes using these CUTs, direct application of HTE\nlearners for continuous outcomes yields consistent estimates of heterogeneous\ncumulative incidence effects, total effects, and separable direct effects. Our\nCUTs enable application of a much larger set of state of the art HTE learners\nfor censored outcomes than had previously been available, especially in\ncompeting risks settings. We provide generic model-free learner-specific oracle\ninequalities bounding the finite-sample excess risk. The oracle efficiency\nresults depend on the oracle selector and estimated nuisance functions from all\nsteps involved in the transformation. We demonstrate the empirical performance\nof the proposed methods in simulation studies."}, "http://arxiv.org/abs/2401.11265": {"title": "Assessing the Competitiveness of Matrix-Free Block Likelihood Estimation in Spatial Models", "link": "http://arxiv.org/abs/2401.11265", "description": "In geostatistics, block likelihood offers a balance between statistical\naccuracy and computational efficiency when estimating covariance functions.\nThis balance is reached by dividing the sample into blocks and computing a\nweighted sum of (sub) log-likelihoods corresponding to pairs of blocks.\nPractitioners often choose block sizes ranging from hundreds to a few thousand\nobservations, inherently involving matrix-based implementations. An\nalternative, residing at the opposite end of this methodological spectrum,\ntreats each observation as a block, resulting in the matrix-free pairwise\nlikelihood method. We propose an additional alternative within this broad\nmethodological landscape, systematically constructing blocks of size two and\nmerging pairs of blocks through conditioning. Importantly, our method\nstrategically avoids large-sized blocks, facilitating explicit calculations\nthat ultimately do not rely on matrix computations. Studies with both simulated\nand real data validate the effectiveness of our approach, on one hand\ndemonstrating its superiority over pairwise likelihood, and on the other,\nchallenging the intuitive notion that employing matrix-based versions\nuniversally lead to better statistical performance."}, "http://arxiv.org/abs/2401.11272": {"title": "Asymptotics for non-degenerate multivariate $U$-statistics with estimated nuisance parameters under the null and local alternative hypotheses", "link": "http://arxiv.org/abs/2401.11272", "description": "The large-sample behavior of non-degenerate multivariate $U$-statistics of\narbitrary degree is investigated under the assumption that their kernel depends\non parameters that can be estimated consistently. Mild regularity conditions\nare given which guarantee that once properly normalized, such statistics are\nasymptotically multivariate Gaussian both under the null hypothesis and\nsequences of local alternatives. The work of Randles (1982, Ann. Statist.) is\nextended in three ways: the data and the kernel values can be multivariate\nrather than univariate, the limiting behavior under local alternatives is\nstudied for the first time, and the effect of knowing some of the nuisance\nparameters is quantified. These results can be applied to a broad range of\ngoodness-of-fit testing contexts, as shown in one specific example."}, "http://arxiv.org/abs/2401.11278": {"title": "Handling incomplete outcomes and covariates in cluster-randomized trials: doubly-robust estimation, efficiency considerations, and sensitivity analysis", "link": "http://arxiv.org/abs/2401.11278", "description": "In cluster-randomized trials, missing data can occur in various ways,\nincluding missing values in outcomes and baseline covariates at the individual\nor cluster level, or completely missing information for non-participants. Among\nthe various types of missing data in CRTs, missing outcomes have attracted the\nmost attention. However, no existing method comprehensively addresses all the\naforementioned types of missing data simultaneously due to their complexity.\nThis gap in methodology may lead to confusion and potential pitfalls in the\nanalysis of CRTs. In this article, we propose a doubly-robust estimator for a\nvariety of estimands that simultaneously handles missing outcomes under a\nmissing-at-random assumption, missing covariates with the missing-indicator\nmethod (with no constraint on missing covariate distributions), and missing\ncluster-population sizes via a uniform sampling framework. Furthermore, we\nprovide three approaches to improve precision by choosing the optimal weights\nfor intracluster correlation, leveraging machine learning, and modeling the\npropensity score for treatment assignment. To evaluate the impact of violated\nmissing data assumptions, we additionally propose a sensitivity analysis that\nmeasures when missing data alter the conclusion of treatment effect estimation.\nSimulation studies and data applications both show that our proposed method is\nvalid and superior to the existing methods."}, "http://arxiv.org/abs/2401.11327": {"title": "Measuring hierarchically-organized interactions in dynamic networks through spectral entropy rates: theory, estimation, and illustrative application to physiological networks", "link": "http://arxiv.org/abs/2401.11327", "description": "Recent advances in signal processing and information theory are boosting the\ndevelopment of new approaches for the data-driven modelling of complex network\nsystems. In the fields of Network Physiology and Network Neuroscience where the\nsignals of interest are often rich of oscillatory content, the spectral\nrepresentation of network systems is essential to ascribe the analyzed\ninteractions to specific oscillations with physiological meaning. In this\ncontext, the present work formalizes a coherent framework which integrates\nseveral information dynamics approaches to quantify node-specific, pairwise and\nhigher-order interactions in network systems. The framework establishes a\nhierarchical organization of interactions of different order using measures of\nentropy rate, mutual information rate and O-information rate, to quantify\nrespectively the dynamics of individual nodes, the links between pairs of\nnodes, and the redundant/synergistic hyperlinks between groups of nodes. All\nmeasures are formulated in the time domain, and then expanded to the spectral\ndomain to obtain frequency-specific information. The practical computation of\nall measures is favored presenting a toolbox that implements their parametric\nand non-parametric estimation, and includes approaches to assess their\nstatistical significance. The framework is illustrated first using theoretical\nexamples where the properties of the measures are displayed in benchmark\nsimulated network systems, and then applied to representative examples of\nmultivariate time series in the context of Network Neuroscience and Network\nPhysiology."}, "http://arxiv.org/abs/2401.11328": {"title": "A Hierarchical Decision-Based Maintenance for a Complex Modular System Driven by the { MoMA} Algorithm", "link": "http://arxiv.org/abs/2401.11328", "description": "This paper presents a maintenance policy for a modular system formed by K\nindependent modules (n-subsystems) subjected to environmental conditions\n(shocks). For the modeling of this complex system, the use of the\nMatrix-Analytical Method (MAM) is proposed under a layered approach according\nto its hierarchical structure. Thus, the operational state of the system (top\nlayer) depends on the states of the modules (middle layer), which in turn\ndepend on the states of their components (bottom layer). This allows a detailed\ndescription of the system operation to plan maintenance actions appropriately\nand optimally. We propose a hierarchical decision-based maintenance strategy\nwith periodic inspections as follows: at the time of the inspection, the\ncondition of the system is first evaluated. If intervention is necessary, the\nmodules are then checked to make individual decisions based on their states,\nand so on. Replacement or repair will be carried out as appropriate. An\noptimization problem is formulated as a function of the length of the\ninspection period and the intervention cost incurred over the useful life of\nthe system. Our method shows the advantages, providing compact and\nimplementable expressions. The model is illustrated on a submarine Electrical\nControl Unit (ECU)."}, "http://arxiv.org/abs/2401.11346": {"title": "Estimating Default Probability and Correlation using Stan", "link": "http://arxiv.org/abs/2401.11346", "description": "This work has the objective of estimating default probabilities and\ncorrelations of credit portfolios given default rate information through a\nBayesian framework using Stan. We use Vasicek's single factor credit model to\nestablish the theoretical framework for the behavior of the default rates, and\nuse NUTS Markov Chain Monte Carlo to estimate the parameters. We compare the\nBayesian estimates with classical estimates such as moments estimators and\nmaximum likelihood estimates. We apply the methodology both to simulated data\nand to corporate default rates, and perform inferences through Bayesian methods\nin order to exhibit the advantages of such a framework. We perform default\nforecasting and exhibit the importance of an adequate estimation of default\ncorrelations, and exhibit the advantage of using Stan to perform sampling\nregarding prior choice."}, "http://arxiv.org/abs/2401.11352": {"title": "Geometric Insights and Empirical Observations on Covariate Adjustment and Stratified Randomization in Randomized Clinical Trials", "link": "http://arxiv.org/abs/2401.11352", "description": "The statistical efficiency of randomized clinical trials can be improved by\nincorporating information from baseline covariates (i.e., pre-treatment patient\ncharacteristics). This can be done in the design stage using a\ncovariate-adaptive randomization scheme such as stratified (permutated block)\nrandomization, or in the analysis stage through covariate adjustment. This\narticle provides a geometric perspective on covariate adjustment and stratified\nrandomization in a unified framework where all regular, asymptotically linear\nestimators are identified as augmented estimators. From this perspective,\ncovariate adjustment can be viewed as an effort to approximate the optimal\naugmentation function, and stratified randomization aims to improve a given\napproximation by projecting it into an affine subspace containing the optimal\naugmentation function. The efficiency benefit of stratified randomization is\nasymptotically equivalent to making full use of stratum information in\ncovariate adjustment, which can be achieved using a simple calibration\nprocedure. Simulation results indicate that stratified randomization is clearly\nbeneficial to unadjusted estimators and much less so to adjusted ones and that\ncalibration is an effective way to recover the efficiency benefit of stratified\nrandomization without actually performing stratified randomization. These\ninsights and observations are illustrated using real clinical trial data."}, "http://arxiv.org/abs/2401.11354": {"title": "Squared Wasserstein-2 Distance for Efficient Reconstruction of Stochastic Differential Equations", "link": "http://arxiv.org/abs/2401.11354", "description": "We provide an analysis of the squared Wasserstein-2 ($W_2$) distance between\ntwo probability distributions associated with two stochastic differential\nequations (SDEs). Based on this analysis, we propose the use of a squared $W_2$\ndistance-based loss functions in the \\textit{reconstruction} of SDEs from noisy\ndata. To demonstrate the practicality of our Wasserstein distance-based loss\nfunctions, we performed numerical experiments that demonstrate the efficiency\nof our method in reconstructing SDEs that arise across a number of\napplications."}, "http://arxiv.org/abs/2401.11359": {"title": "The Exact Risks of Reference Panel-based Regularized Estimators", "link": "http://arxiv.org/abs/2401.11359", "description": "Reference panel-based estimators have become widely used in genetic\nprediction of complex traits due to their ability to address data privacy\nconcerns and reduce computational and communication costs. These estimators\nestimate the covariance matrix of predictors using an external reference panel,\ninstead of relying solely on the original training data. In this paper, we\ninvestigate the performance of reference panel-based $L_1$ and $L_2$\nregularized estimators within a unified framework based on approximate message\npassing (AMP). We uncover several key factors that influence the accuracy of\nreference panel-based estimators, including the sample sizes of the training\ndata and reference panels, the signal-to-noise ratio, the underlying sparsity\nof the signal, and the covariance matrix among predictors. Our findings reveal\nthat, even when the sample size of the reference panel matches that of the\ntraining data, reference panel-based estimators tend to exhibit lower accuracy\ncompared to traditional regularized estimators. Furthermore, we observe that\nthis performance gap widens as the amount of training data increases,\nhighlighting the importance of constructing large-scale reference panels to\nmitigate this issue. To support our theoretical analysis, we develop a novel\nnon-separable matrix AMP framework capable of handling the complexities\nintroduced by a general covariance matrix and the additional randomness\nassociated with a reference panel. We validate our theoretical results through\nextensive simulation studies and real data analyses using the UK Biobank\ndatabase."}, "http://arxiv.org/abs/2401.11368": {"title": "When exposure affects subgroup membership: Framing relevant causal questions in perinatal epidemiology and beyond", "link": "http://arxiv.org/abs/2401.11368", "description": "Perinatal epidemiology often aims to evaluate exposures on infant outcomes.\nWhen the exposure affects the composition of people who give birth to live\ninfants (e.g., by affecting fertility, behavior, or birth outcomes), this \"live\nbirth process\" mediates the exposure effect on infant outcomes. Causal\nestimands previously proposed for this setting include the total exposure\neffect on composite birth and infant outcomes, controlled direct effects (e.g.,\nenforcing birth), and principal stratum direct effects. Using perinatal HIV\ntransmission in the SEARCH Study as a motivating example, we present two\nalternative causal estimands: 1) conditional total effects; and 2) conditional\nstochastic direct effects, formulated under a hypothetical intervention to draw\nmediator values from some distribution (possibly conditional on covariates).\nThe proposed conditional total effect includes impacts of an intervention that\noperate by changing the types of people who have a live birth and the timing of\nbirths. The proposed conditional stochastic direct effects isolate the effect\nof an exposure on infant outcomes excluding any impacts through this live birth\nprocess. In SEARCH, this approach quantifies the impact of a universal testing\nand treatment intervention on infant HIV-free survival absent any effect of the\nintervention on the live birth process, within a clearly defined target\npopulation of women of reproductive age with HIV at study baseline. Our\napproach has implications for the evaluation of intervention effects in\nperinatal epidemiology broadly, and whenever causal effects within a subgroup\nare of interest and exposure affects membership in the subgroup."}, "http://arxiv.org/abs/2401.11380": {"title": "MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning", "link": "http://arxiv.org/abs/2401.11380", "description": "Model-based offline reinforcement learning methods (RL) have achieved\nstate-of-the-art performance in many decision-making problems thanks to their\nsample efficiency and generalizability. Despite these advancements, existing\nmodel-based offline RL approaches either focus on theoretical studies without\ndeveloping practical algorithms or rely on a restricted parametric policy\nspace, thus not fully leveraging the advantages of an unrestricted policy space\ninherent to model-based methods. To address this limitation, we develop MoMA, a\nmodel-based mirror ascent algorithm with general function approximations under\npartial coverage of offline data. MoMA distinguishes itself from existing\nliterature by employing an unrestricted policy class. In each iteration, MoMA\nconservatively estimates the value function by a minimization procedure within\na confidence set of transition models in the policy evaluation step, then\nupdates the policy with general function approximations instead of\ncommonly-used parametric policy classes in the policy improvement step. Under\nsome mild assumptions, we establish theoretical guarantees of MoMA by proving\nan upper bound on the suboptimality of the returned policy. We also provide a\npractically implementable, approximate version of the algorithm. The\neffectiveness of MoMA is demonstrated via numerical studies."}, "http://arxiv.org/abs/2401.11507": {"title": "Redundant multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses", "link": "http://arxiv.org/abs/2401.11507", "description": "During multiple testing, researchers often adjust their alpha level to\ncontrol the familywise error rate for a statistical inference about a joint\nunion alternative hypothesis (e.g., \"H1 or H2\"). However, in some cases, they\ndo not make this inference and instead make separate inferences about each of\nthe individual hypotheses that comprise the joint hypothesis (e.g., H1 and H2).\nFor example, a researcher might use a Bonferroni correction to adjust their\nalpha level from the conventional level of 0.050 to 0.025 when testing H1 and\nH2, find a significant result for H1 (p < 0.025) and not for H2 (p > .0.025),\nand so claim support for H1 and not for H2. However, these separate individual\ninferences do not require an alpha adjustment. Only a statistical inference\nabout the union alternative hypothesis \"H1 or H2\" requires an alpha adjustment\nbecause it is based on \"at least one\" significant result among the two tests,\nand so it depends on the familywise error rate. When a researcher corrects\ntheir alpha level during multiple testing but does not make an inference about\nthe union alternative hypothesis, their correction is redundant. In the present\narticle, I discuss this redundant correction problem, including its associated\nloss of statistical power and its potential causes vis-\\`a-vis error rate\nconfusions and the alpha adjustment ritual. I also provide three illustrations\nof redundant corrections from recent psychology studies. I conclude that\nredundant corrections represent a symptom of statisticism, and I call for a\nmore nuanced and context-specific approach to multiple testing corrections."}, "http://arxiv.org/abs/2401.11515": {"title": "Geometry-driven Bayesian Inference for Ultrametric Covariance Matrices", "link": "http://arxiv.org/abs/2401.11515", "description": "Ultrametric matrices arise as covariance matrices in latent tree models for\nmultivariate data with hierarchically correlated components. As a parameter\nspace in a model, the set of ultrametric matrices is neither convex nor a\nsmooth manifold, and focus in literature has hitherto mainly been restricted to\nestimation through projections and relaxation-based techniques. Leveraging the\nlink between an ultrametric matrix and a rooted tree, we equip the set of\nultrametric matrices with a convenient geometry based on the well-known\ngeometry of phylogenetic trees, whose attractive properties (e.g. unique\ngeodesics and Fr\\'{e}chet means) the set of ultrametric matrices inherits. This\nresults in a novel representation of an ultrametric matrix by coordinates of\nthe tree space, which we then use to define a class of Markovian and consistent\nprior distributions on the set of ultrametric matrices in a Bayesian model, and\ndevelop an efficient algorithm to sample from the posterior distribution that\ngenerates updates by making intrinsic local moves along geodesics within the\nset of ultrametric matrices. In simulation studies, our proposed algorithm\nrestores the underlying matrices with posterior samples that recover the tree\ntopology with a high frequency of true topology and generate element-wise\ncredible intervals with a high nominal coverage rate. We use the proposed\nalgorithm on the pre-clinical cancer data to investigate the mechanism\nsimilarity by constructing the underlying treatment tree and identify\ntreatments with high mechanism similarity also target correlated pathways in\nbiological literature."}, "http://arxiv.org/abs/2401.11537": {"title": "Addressing researcher degrees of freedom through minP adjustment", "link": "http://arxiv.org/abs/2401.11537", "description": "When different researchers study the same research question using the same\ndataset they may obtain different and potentially even conflicting results.\nThis is because there is often substantial flexibility in researchers'\nanalytical choices, an issue also referred to as ''researcher degrees of\nfreedom''. Combined with selective reporting of the smallest p-value or largest\neffect, researcher degrees of freedom may lead to an increased rate of false\npositive and overoptimistic results. In this paper, we address this issue by\nformalizing the multiplicity of analysis strategies as a multiple testing\nproblem. As the test statistics of different analysis strategies are usually\nhighly dependent, a naive approach such as the Bonferroni correction is\ninappropriate because it leads to an unacceptable loss of power. Instead, we\npropose using the ''minP'' adjustment method, which takes potential test\ndependencies into account and approximates the underlying null distribution of\nthe minimal p-value through a permutation-based procedure. This procedure is\nknown to achieve more power than simpler approaches while ensuring a weak\ncontrol of the family-wise error rate. We illustrate our approach for\naddressing researcher degrees of freedom by applying it to a study on the\nimpact of perioperative paO2 on post-operative complications after\nneurosurgery. A total of 48 analysis strategies are considered and adjusted\nusing the minP procedure. This approach allows to selectively report the result\nof the analysis strategy yielding the most convincing evidence, while\ncontrolling the type 1 error -- and thus the risk of publishing false positive\nresults that may not be replicable."}, "http://arxiv.org/abs/2401.11540": {"title": "A new flexible class of kernel-based tests of independence", "link": "http://arxiv.org/abs/2401.11540", "description": "Spherical and hyperspherical data are commonly encountered in diverse applied\nresearch domains, underscoring the vital task of assessing independence within\nsuch data structures. In this context, we investigate the properties of test\nstatistics relying on distance correlation measures originally introduced for\nthe energy distance, and generalize the concept to strongly negative definite\nkernel-based distances. An important benefit of employing this method lies in\nits versatility across diverse forms of directional data, enabling the\nexamination of independence among vectors of varying types. The applicability\nof tests is demonstrated on several real datasets."}, "http://arxiv.org/abs/2401.11646": {"title": "Nonparametric Estimation via Variance-Reduced Sketching", "link": "http://arxiv.org/abs/2401.11646", "description": "Nonparametric models are of great interest in various scientific and\nengineering disciplines. Classical kernel methods, while numerically robust and\nstatistically sound in low-dimensional settings, become inadequate in\nhigher-dimensional settings due to the curse of dimensionality. In this paper,\nwe introduce a new framework called Variance-Reduced Sketching (VRS),\nspecifically designed to estimate density functions and nonparametric\nregression functions in higher dimensions with a reduced curse of\ndimensionality. Our framework conceptualizes multivariable functions as\ninfinite-size matrices, and facilitates a new sketching technique motivated by\nnumerical linear algebra literature to reduce the variance in estimation\nproblems. We demonstrate the robust numerical performance of VRS through a\nseries of simulated experiments and real-world data applications. Notably, VRS\nshows remarkable improvement over existing neural network estimators and\nclassical kernel methods in numerous density estimation and nonparametric\nregression models. Additionally, we offer theoretical justifications for VRS to\nsupport its ability to deliver nonparametric estimation with a reduced curse of\ndimensionality."}, "http://arxiv.org/abs/2401.11789": {"title": "Stein EWMA Control Charts for Count Processes", "link": "http://arxiv.org/abs/2401.11789", "description": "The monitoring of serially independent or autocorrelated count processes is\nconsidered, having a Poisson or (negative) binomial marginal distribution under\nin-control conditions. Utilizing the corresponding Stein identities,\nexponentially weighted moving-average (EWMA) control charts are constructed,\nwhich can be flexibly adapted to uncover zero inflation, over- or\nunderdispersion. The proposed Stein EWMA charts' performance is investigated by\nsimulations, and their usefulness is demonstrated by a real-world data example\nfrom health surveillance."}, "http://arxiv.org/abs/2401.11804": {"title": "Regression Copulas for Multivariate Responses", "link": "http://arxiv.org/abs/2401.11804", "description": "We propose a novel distributional regression model for a multivariate\nresponse vector based on a copula process over the covariate space. It uses the\nimplicit copula of a Gaussian multivariate regression, which we call a\n``regression copula''. To allow for large covariate vectors their coefficients\nare regularized using a novel multivariate extension of the horseshoe prior.\nBayesian inference and distributional predictions are evaluated using efficient\nvariational inference methods, allowing application to large datasets. An\nadvantage of the approach is that the marginal distributions of the response\nvector can be estimated separately and accurately, resulting in predictive\ndistributions that are marginally-calibrated. Two substantive applications of\nthe methodology highlight its efficacy in multivariate modeling. The first is\nthe econometric modeling and prediction of half-hourly regional Australian\nelectricity prices. Here, our approach produces more accurate distributional\nforecasts than leading benchmark methods. The second is the evaluation of\nmultivariate posteriors in likelihood-free inference (LFI) of a model for tree\nspecies abundance data, extending a previous univariate regression copula LFI\nmethod. In both applications, we demonstrate that our new approach exhibits a\ndesirable marginal calibration property."}, "http://arxiv.org/abs/2401.11827": {"title": "Flexible Models for Simple Longitudinal Data", "link": "http://arxiv.org/abs/2401.11827", "description": "We propose a new method for estimating subject-specific mean functions from\nlongitudinal data. We aim to do this in a flexible manner (without restrictive\nassumptions about the shape of the subject-specific mean functions), while\nexploiting similarities in the mean functions between different subjects.\nFunctional principal components analysis fulfils both requirements, and methods\nfor functional principal components analysis have been developed for\nlongitudinal data. However, we find that these existing methods sometimes give\nfitted mean functions which are more complex than needed to provide a good fit\nto the data. We develop a new penalised likelihood approach to flexibly model\nlongitudinal data, with a penalty term to control the balance between fit to\nthe data and smoothness of the subject-specific mean curves. We run simulation\nstudies to demonstrate that the new method substantially improves the quality\nof inference relative to existing methods across a range of examples, and apply\nthe method to data on changes in body composition in adolescent girls."}, "http://arxiv.org/abs/2401.11842": {"title": "Subgroup analysis methods for time-to-event outcomes in heterogeneous randomized controlled trials", "link": "http://arxiv.org/abs/2401.11842", "description": "Non-significant randomized control trials can hide subgroups of good\nresponders to experimental drugs, thus hindering subsequent development.\nIdentifying such heterogeneous treatment effects is key for precision medicine\nand many post-hoc analysis methods have been developed for that purpose. While\nseveral benchmarks have been carried out to identify the strengths and\nweaknesses of these methods, notably for binary and continuous endpoints,\nsimilar systematic empirical evaluation of subgroup analysis for time-to-event\nendpoints are lacking. This work aims to fill this gap by evaluating several\nsubgroup analysis algorithms in the context of time-to-event outcomes, by means\nof three different research questions: Is there heterogeneity? What are the\nbiomarkers responsible for such heterogeneity? Who are the good responders to\ntreatment? In this context, we propose a new synthetic and semi-synthetic data\ngeneration process that allows one to explore a wide range of heterogeneity\nscenarios with precise control on the level of heterogeneity. We provide an\nopen source Python package, available on Github, containing our generation\nprocess and our comprehensive benchmark framework. We hope this package will be\nuseful to the research community for future investigations of heterogeneity of\ntreatment effects and subgroup analysis methods benchmarking."}, "http://arxiv.org/abs/2401.11885": {"title": "Bootstrap prediction regions for daily curves of electricity demand and price using functional data", "link": "http://arxiv.org/abs/2401.11885", "description": "The aim of this paper is to compute one-day-ahead prediction regions for\ndaily curves of electricity demand and price. Three model-based procedures to\nconstruct general prediction regions are proposed, all of them using bootstrap\nalgorithms. The first proposed method considers any $L_p$ norm for functional\ndata to measure the distance between curves, the second one is designed to take\ndifferent variabilities along the curve into account, and the third one takes\nadvantage of the notion of depth of a functional data. The regression model\nwith functional response on which our proposed prediction regions are based is\nrather general: it allows to include both endogenous and exogenous functional\nvariables, as well as exogenous scalar variables; in addition, the effect of\nsuch variables on the response one is modeled in a parametric, nonparametric or\nsemi-parametric way. A comparative study is carried out to analyse the\nperformance of these prediction regions for the electricity market of mainland\nSpain, in year 2012. This work extends and complements the methods and results\nin Aneiros et al. (2016) (focused on curve prediction) and Vilar et al. (2018)\n(focused on prediction intervals), which use the same database as here."}, "http://arxiv.org/abs/2401.11948": {"title": "The Ensemble Kalman Filter for Dynamic Inverse Problems", "link": "http://arxiv.org/abs/2401.11948", "description": "In inverse problems, the goal is to estimate unknown model parameters from\nnoisy observational data. Traditionally, inverse problems are solved under the\nassumption of a fixed forward operator describing the observation model. In\nthis article, we consider the extension of this approach to situations where we\nhave a dynamic forward model, motivated by applications in scientific\ncomputation and engineering. We specifically consider this extension for a\nderivative-free optimizer, the ensemble Kalman inversion (EKI). We introduce\nand justify a new methodology called dynamic-EKI, which is a particle-based\nmethod with a changing forward operator. We analyze our new method, presenting\nresults related to the control of our particle system through its covariance\nstructure. This analysis includes moment bounds and an ensemble collapse, which\nare essential for demonstrating a convergence result. We establish convergence\nin expectation and validate our theoretical findings through experiments with\ndynamic-EKI applied to a 2D Darcy flow partial differential equation."}, "http://arxiv.org/abs/2401.12031": {"title": "Multi-objective optimisation using expected quantile improvement for decision making in disease outbreaks", "link": "http://arxiv.org/abs/2401.12031", "description": "Optimization under uncertainty is important in many applications,\nparticularly to inform policy and decision making in areas such as public\nhealth. A key source of uncertainty arises from the incorporation of\nenvironmental variables as inputs into computational models or simulators. Such\nvariables represent uncontrollable features of the optimization problem and\nreliable decision making must account for the uncertainty they propagate to the\nsimulator outputs. Often, multiple, competing objectives are defined from these\noutputs such that the final optimal decision is a compromise between different\ngoals.\n\nHere, we present emulation-based optimization methodology for such problems\nthat extends expected quantile improvement (EQI) to address multi-objective\noptimization. Focusing on the practically important case of two objectives, we\nuse a sequential design strategy to identify the Pareto front of optimal\nsolutions. Uncertainty from the environmental variables is integrated out using\nMonte Carlo samples from the simulator. Interrogation of the expected output\nfrom the simulator is facilitated by use of (Gaussian process) emulators. The\nmethodology is demonstrated on an optimization problem from public health\ninvolving the dispersion of anthrax spores across a spatial terrain.\nEnvironmental variables include meteorological features that impact the\ndispersion, and the methodology identifies the Pareto front even when there is\nconsiderable input uncertainty."}, "http://arxiv.org/abs/2401.12126": {"title": "Biological species delimitation based on genetic and spatial dissimilarity: a comparative study", "link": "http://arxiv.org/abs/2401.12126", "description": "The delimitation of biological species, i.e., deciding which individuals\nbelong to the same species and whether and how many different species are\nrepresented in a data set, is key to the conservation of biodiversity. Much\nexisting work uses only genetic data for species delimitation, often employing\nsome kind of cluster analysis. This can be misleading, because geographically\ndistant groups of individuals can be genetically quite different even if they\nbelong to the same species. This paper investigates the problem of testing\nwhether two potentially separated groups of individuals can belong to a single\nspecies or not based on genetic and spatial data. Various approaches are\ncompared (some of which already exist in the literature) based on simulated\nmetapopulations generated with SLiM and GSpace - two software packages that can\nsimulate spatially-explicit genetic data at an individual level. Approaches\ninvolve partial Mantel testing, maximum likelihood mixed-effects models with a\npopulation effect, and jackknife-based homogeneity tests. A key challenge is\nthat most tests perform on genetic and geographical distance data, violating\nstandard independence assumptions. Simulations showed that partial Mantel tests\nand mixed-effects models have larger power than jackknife-based methods, but\ntend to display type-I-error rates slightly above the significance level.\nMoreover, a multiple regression model neglecting the dependence in the\ndissimilarities did not show inflated type-I-error rate. An application on\nbrassy ringlets concludes the paper."}, "http://arxiv.org/abs/1412.0367": {"title": "Bayesian nonparametric modeling for mean residual life regression", "link": "http://arxiv.org/abs/1412.0367", "description": "The mean residual life function is a key functional for a survival\ndistribution. It has practically useful interpretation as the expected\nremaining lifetime given survival up to a particular time point, and it also\ncharacterizes the survival distribution. However, it has received limited\nattention in terms of inference methods under a probabilistic modeling\nframework. In this paper, we seek to provide general inference methodology for\nmean residual life regression. Survival data often include a set of predictor\nvariables for the survival response distribution, and in many cases it is\nnatural to include the covariates as random variables into the modeling. We\nthus propose a Dirichlet process mixture modeling approach for the joint\nstochastic mechanism of the covariates and survival responses. This approach\nimplies a flexible model structure for the mean residual life of the\nconditional response distribution, allowing general shapes for mean residual\nlife as a function of covariates given a specific time point, as well as a\nfunction of time given particular values of the covariate vector. To expand the\nscope of the modeling framework, we extend the mixture model to incorporate\ndependence across experimental groups, such as treatment and control groups.\nThis extension is built from a dependent Dirichlet process prior for the\ngroup-specific mixing distributions, with common locations and weights that\nvary across groups through latent bivariate beta distributed random variables.\nWe develop properties of the proposed regression models, and discuss methods\nfor prior specification and posterior inference. The different components of\nthe methodology are illustrated with simulated data sets. Moreover, the\nmodeling approach is applied to a data set comprising right censored survival\ntimes of patients with small cell lung cancer."}, "http://arxiv.org/abs/2110.00314": {"title": "Confounder importance learning for treatment effect inference", "link": "http://arxiv.org/abs/2110.00314", "description": "We address modelling and computational issues for multiple treatment effect\ninference under many potential confounders. A primary issue relates to\npreventing harmful effects from omitting relevant covariates (under-selection),\nwhile not running into over-selection issues that introduce substantial\nvariance and a bias related to the non-random over-inclusion of covariates. We\npropose a novel empirical Bayes framework for Bayesian model averaging that\nlearns from data the extent to which the inclusion of key covariates should be\nencouraged, specifically those highly associated to the treatments. A key\nchallenge is computational. We develop fast algorithms, including an\nExpectation-Propagation variational approximation and simple stochastic\ngradient optimization algorithms, to learn the hyper-parameters from data. Our\nframework uses widely-used ingredients and largely existing software, and it is\nimplemented within the R package mombf featured on CRAN. This work is motivated\nby and is illustrated in two applications. The first is the association between\nsalary variation and discriminatory factors. The second, that has been debated\nin previous works, is the association between abortion policies and crime. Our\napproach provides insights that differ from previous analyses especially in\nsituations with weaker treatment effects."}, "http://arxiv.org/abs/2201.12865": {"title": "Extremal Random Forests", "link": "http://arxiv.org/abs/2201.12865", "description": "Classical methods for quantile regression fail in cases where the quantile of\ninterest is extreme and only few or no training data points exceed it.\nAsymptotic results from extreme value theory can be used to extrapolate beyond\nthe range of the data, and several approaches exist that use linear regression,\nkernel methods or generalized additive models. Most of these methods break down\nif the predictor space has more than a few dimensions or if the regression\nfunction of extreme quantiles is complex. We propose a method for extreme\nquantile regression that combines the flexibility of random forests with the\ntheory of extrapolation. Our extremal random forest (ERF) estimates the\nparameters of a generalized Pareto distribution, conditional on the predictor\nvector, by maximizing a local likelihood with weights extracted from a quantile\nrandom forest. We penalize the shape parameter in this likelihood to regularize\nits variability in the predictor space. Under general domain of attraction\nconditions, we show consistency of the estimated parameters in both the\nunpenalized and penalized case. Simulation studies show that our ERF\noutperforms both classical quantile regression methods and existing regression\napproaches from extreme value theory. We apply our methodology to extreme\nquantile prediction for U.S. wage data."}, "http://arxiv.org/abs/2203.00144": {"title": "The Concordance Index decomposition: A measure for a deeper understanding of survival prediction models", "link": "http://arxiv.org/abs/2203.00144", "description": "The Concordance Index (C-index) is a commonly used metric in Survival\nAnalysis for evaluating the performance of a prediction model. In this paper,\nwe propose a decomposition of the C-index into a weighted harmonic mean of two\nquantities: one for ranking observed events versus other observed events, and\nthe other for ranking observed events versus censored cases. This decomposition\nenables a finer-grained analysis of the relative strengths and weaknesses\nbetween different survival prediction methods. The usefulness of this\ndecomposition is demonstrated through benchmark comparisons against classical\nmodels and state-of-the-art methods, together with the new variational\ngenerative neural-network-based method (SurVED) proposed in this paper. The\nperformance of the models is assessed using four publicly available datasets\nwith varying levels of censoring. Using the C-index decomposition and synthetic\ncensoring, the analysis shows that deep learning models utilize the observed\nevents more effectively than other models. This allows them to keep a stable\nC-index in different censoring levels. In contrast to such deep learning\nmethods, classical machine learning models deteriorate when the censoring level\ndecreases due to their inability to improve on ranking the events versus other\nevents."}, "http://arxiv.org/abs/2211.01799": {"title": "Statistical Inference for Scale Mixture Models via Mellin Transform Approach", "link": "http://arxiv.org/abs/2211.01799", "description": "This paper deals with statistical inference for the scale mixture models. We\nstudy an estimation approach based on the Mellin -- Stieltjes transform that\ncan be applied to both discrete and absolute continuous mixing distributions.\nThe accuracy of the corresponding estimate is analysed in terms of its expected\npointwise error. As an important technical result, we prove the analogue of the\nBerry -- Esseen inequality for the Mellin transforms. The proposed statistical\napproach is illustrated by numerical examples."}, "http://arxiv.org/abs/2211.16155": {"title": "High-Dimensional Block Diagonal Covariance Structure Detection Using Singular Vectors", "link": "http://arxiv.org/abs/2211.16155", "description": "The assumption of independent subvectors arises in many aspects of\nmultivariate analysis. In most real-world applications, however, we lack prior\nknowledge about the number of subvectors and the specific variables within each\nsubvector. Yet, testing all these combinations is not feasible. For example,\nfor a data matrix containing 15 variables, there are already 1 382 958 545\npossible combinations. Given that zero correlation is a necessary condition for\nindependence, independent subvectors exhibit a block diagonal covariance\nmatrix. This paper focuses on the detection of such block diagonal covariance\nstructures in high-dimensional data and therefore also identifies uncorrelated\nsubvectors. Our nonparametric approach exploits the fact that the structure of\nthe covariance matrix is mirrored by the structure of its eigenvectors.\nHowever, the true block diagonal structure is masked by noise in the sample\ncase. To address this problem, we propose to use sparse approximations of the\nsample eigenvectors to reveal the sparse structure of the population\neigenvectors. Notably, the right singular vectors of a data matrix with an\noverall mean of zero are identical to the sample eigenvectors of its covariance\nmatrix. Using sparse approximations of these singular vectors instead of the\neigenvectors makes the estimation of the covariance matrix obsolete. We\ndemonstrate the performance of our method through simulations and provide real\ndata examples. Supplementary materials for this article are available online."}, "http://arxiv.org/abs/2303.05659": {"title": "A marginal structural model for normal tissue complication probability", "link": "http://arxiv.org/abs/2303.05659", "description": "The goal of radiation therapy for cancer is to deliver prescribed radiation\ndose to the tumor while minimizing dose to the surrounding healthy tissues. To\nevaluate treatment plans, the dose distribution to healthy organs is commonly\nsummarized as dose-volume histograms (DVHs). Normal tissue complication\nprobability (NTCP) modelling has centered around making patient-level risk\npredictions with features extracted from the DVHs, but few have considered\nadapting a causal framework to evaluate the safety of alternative treatment\nplans. We propose causal estimands for NTCP based on deterministic and\nstochastic interventions, as well as propose estimators based on marginal\nstructural models that impose bivariable monotonicity between dose, volume, and\ntoxicity risk. The properties of these estimators are studied through\nsimulations, and their use is illustrated in the context of radiotherapy\ntreatment of anal canal cancer patients."}, "http://arxiv.org/abs/2305.04141": {"title": "Geostatistical capture-recapture models", "link": "http://arxiv.org/abs/2305.04141", "description": "Methods for population estimation and inference have evolved over the past\ndecade to allow for the incorporation of spatial information when using\ncapture-recapture study designs. Traditional approaches to specifying spatial\ncapture-recapture (SCR) models often rely on an individual-based detection\nfunction that decays as a detection location is farther from an individual's\nactivity center. Traditional SCR models are intuitive because they incorporate\nmechanisms of animal space use based on their assumptions about activity\ncenters. We modify the SCR model to accommodate a wide range of space use\npatterns, including for those individuals that may exhibit traditional\nelliptical utilization distributions. Our approach uses underlying Gaussian\nprocesses to characterize the space use of individuals. This allows us to\naccount for multimodal and other complex space use patterns that may arise due\nto movement. We refer to this class of models as geostatistical\ncapture-recapture (GCR) models. We adapt a recursive computing strategy to fit\nGCR models to data in stages, some of which can be parallelized. This technique\nfacilitates implementation and leverages modern multicore and distributed\ncomputing environments. We demonstrate the application of GCR models by\nanalyzing both simulated data and a data set involving capture histories of\nsnowshoe hares in central Colorado, USA."}, "http://arxiv.org/abs/2308.03355": {"title": "Nonparametric Bayes multiresolution testing for high-dimensional rare events", "link": "http://arxiv.org/abs/2308.03355", "description": "In a variety of application areas, there is interest in assessing evidence of\ndifferences in the intensity of event realizations between groups. For example,\nin cancer genomic studies collecting data on rare variants, the focus is on\nassessing whether and how the variant profile changes with the disease subtype.\nMotivated by this application, we develop multiresolution nonparametric Bayes\ntests for differential mutation rates across groups. The multiresolution\napproach yields fast and accurate detection of spatial clusters of rare\nvariants, and our nonparametric Bayes framework provides great flexibility for\nmodeling the intensities of rare variants. Some theoretical properties are also\nassessed, including weak consistency of our Dirichlet Process-Poisson-Gamma\nmixture over multiple resolutions. Simulation studies illustrate excellent\nsmall sample properties relative to competitors, and we apply the method to\ndetect rare variants related to common variable immunodeficiency from whole\nexome sequencing data on 215 patients and over 60,027 control subjects."}, "http://arxiv.org/abs/2309.15316": {"title": "Leveraging Neural Networks to Profile Health Care Providers with Application to Medicare Claims", "link": "http://arxiv.org/abs/2309.15316", "description": "Encompassing numerous nationwide, statewide, and institutional initiatives in\nthe United States, provider profiling has evolved into a major health care\nundertaking with ubiquitous applications, profound implications, and\nhigh-stakes consequences. In line with such a significant profile, the\nliterature has accumulated a number of developments dedicated to enhancing the\nstatistical paradigm of provider profiling. Tackling wide-ranging profiling\nissues, these methods typically adjust for risk factors using linear\npredictors. While this approach is simple, it can be too restrictive to\ncharacterize complex and dynamic factor-outcome associations in certain\ncontexts. One such example arises from evaluating dialysis facilities treating\nMedicare beneficiaries with end-stage renal disease. It is of primary interest\nto consider how the coronavirus disease (COVID-19) affected 30-day unplanned\nreadmissions in 2020. The impact of COVID-19 on the risk of readmission varied\ndramatically across pandemic phases. To efficiently capture the variation while\nprofiling facilities, we develop a generalized partially linear model (GPLM)\nthat incorporates a neural network. Considering provider-level clustering, we\nimplement the GPLM as a stratified sampling-based stochastic optimization\nalgorithm that features accelerated convergence. Furthermore, an exact test is\ndesigned to identify under- and over-performing facilities, with an\naccompanying funnel plot to visualize profiles. The advantages of the proposed\nmethods are demonstrated through simulation experiments and profiling dialysis\nfacilities using 2020 Medicare claims from the United States Renal Data System."}, "http://arxiv.org/abs/2401.12369": {"title": "SubgroupTE: Advancing Treatment Effect Estimation with Subgroup Identification", "link": "http://arxiv.org/abs/2401.12369", "description": "Precise estimation of treatment effects is crucial for evaluating\nintervention effectiveness. While deep learning models have exhibited promising\nperformance in learning counterfactual representations for treatment effect\nestimation (TEE), a major limitation in most of these models is that they treat\nthe entire population as a homogeneous group, overlooking the diversity of\ntreatment effects across potential subgroups that have varying treatment\neffects. This limitation restricts the ability to precisely estimate treatment\neffects and provide subgroup-specific treatment recommendations. In this paper,\nwe propose a novel treatment effect estimation model, named SubgroupTE, which\nincorporates subgroup identification in TEE. SubgroupTE identifies\nheterogeneous subgroups with different treatment responses and more precisely\nestimates treatment effects by considering subgroup-specific causal effects. In\naddition, SubgroupTE iteratively optimizes subgrouping and treatment effect\nestimation networks to enhance both estimation and subgroup identification.\nComprehensive experiments on the synthetic and semi-synthetic datasets exhibit\nthe outstanding performance of SubgroupTE compared with the state-of-the-art\nmodels on treatment effect estimation. Additionally, a real-world study\ndemonstrates the capabilities of SubgroupTE in enhancing personalized treatment\nrecommendations for patients with opioid use disorder (OUD) by advancing\ntreatment effect estimation with subgroup identification."}, "http://arxiv.org/abs/2401.12420": {"title": "Rank-based estimators of global treatment effects for cluster randomized trials with multiple endpoints", "link": "http://arxiv.org/abs/2401.12420", "description": "Cluster randomization trials commonly employ multiple endpoints. When a\nsingle summary of treatment effects across endpoints is of primary interest,\nglobal hypothesis testing/effect estimation methods represent a common analysis\nstrategy. However, specification of the joint distribution required by these\nmethods is non-trivial, particularly when endpoint properties differ. We\ndevelop rank-based interval estimators for a global treatment effect referred\nto as the \"global win probability,\" or the probability that a treatment\nindividual responds better than a control individual on average. Using\nendpoint-specific ranks among the combined sample and within each arm, each\nindividual-level observation is converted to a \"win fraction\" which quantifies\nthe proportion of wins experienced over every observation in the comparison\narm. An individual's multiple observations are then replaced by a single\n\"global win fraction,\" constructed by averaging win fractions across endpoints.\nA linear mixed model is applied directly to the global win fractions to recover\npoint, variance, and interval estimates of the global win probability adjusted\nfor clustering. Simulation demonstrates our approach performs well concerning\ncoverage and type I error, and methods are easily implemented using standard\nsoftware. A case study using publicly available data is provided with\ncorresponding R and SAS code."}, "http://arxiv.org/abs/2401.12640": {"title": "Multilevel network meta-regression for general likelihoods: synthesis of individual and aggregate data with applications to survival analysis", "link": "http://arxiv.org/abs/2401.12640", "description": "Network meta-analysis combines aggregate data (AgD) from multiple randomised\ncontrolled trials, assuming that any effect modifiers are balanced across\npopulations. Individual patient data (IPD) meta-regression is the ``gold\nstandard'' method to relax this assumption, however IPD are frequently only\navailable in a subset of studies. Multilevel network meta-regression (ML-NMR)\nextends IPD meta-regression to incorporate AgD studies whilst avoiding\naggregation bias, but currently requires the aggregate-level likelihood to have\na known closed form. Notably, this prevents application to time-to-event\noutcomes.\n\nWe extend ML-NMR to individual-level likelihoods of any form, by integrating\nthe individual-level likelihood function over the AgD covariate distributions\nto obtain the respective marginal likelihood contributions. We illustrate with\ntwo examples of time-to-event outcomes, showing the performance of ML-NMR in a\nsimulated comparison with little loss of precision from a full IPD analysis,\nand demonstrating flexible modelling of baseline hazards using cubic M-splines\nwith synthetic data on newly diagnosed multiple myeloma.\n\nML-NMR is a general method for synthesising individual and aggregate level\ndata in networks of all sizes. Extension to general likelihoods, including for\nsurvival outcomes, greatly increases the applicability of the method. R and\nStan code is provided, and the methods are implemented in the multinma R\npackage."}, "http://arxiv.org/abs/2401.12697": {"title": "A Computationally Efficient Approach to False Discovery Rate Control and Power Maximisation via Randomisation and Mirror Statistic", "link": "http://arxiv.org/abs/2401.12697", "description": "Simultaneously performing variable selection and inference in\nhigh-dimensional regression models is an open challenge in statistics and\nmachine learning. The increasing availability of vast amounts of variables\nrequires the adoption of specific statistical procedures to accurately select\nthe most important predictors in a high-dimensional space, while controlling\nthe False Discovery Rate (FDR) arising from the underlying multiple hypothesis\ntesting. In this paper we propose the joint adoption of the Mirror Statistic\napproach to FDR control, coupled with outcome randomisation to maximise the\nstatistical power of the variable selection procedure. Through extensive\nsimulations we show how our proposed strategy allows to combine the benefits of\nthe two techniques. The Mirror Statistic is a flexible method to control FDR,\nwhich only requires mild model assumptions, but requires two sets of\nindependent regression coefficient estimates, usually obtained after splitting\nthe original dataset. Outcome randomisation is an alternative to Data\nSplitting, that allows to generate two independent outcomes, which can then be\nused to estimate the coefficients that go into the construction of the Mirror\nStatistic. The combination of these two approaches provides increased testing\npower in a number of scenarios, such as highly correlated covariates and high\npercentages of active variables. Moreover, it is scalable to very\nhigh-dimensional problems, since the algorithm has a low memory footprint and\nonly requires a single run on the full dataset, as opposed to iterative\nalternatives such as Multiple Data Splitting."}, "http://arxiv.org/abs/2401.12753": {"title": "Optimal Confidence Bands for Shape-restricted Regression in Multidimensions", "link": "http://arxiv.org/abs/2401.12753", "description": "In this paper, we propose and study construction of confidence bands for\nshape-constrained regression functions when the predictor is multivariate. In\nparticular, we consider the continuous multidimensional white noise model given\nby $d Y(\\mathbf{t}) = n^{1/2} f(\\mathbf{t}) \\,d\\mathbf{t} + d W(\\mathbf{t})$,\nwhere $Y$ is the observed stochastic process on $[0,1]^d$ ($d\\ge 1$), $W$ is\nthe standard Brownian sheet on $[0,1]^d$, and $f$ is the unknown function of\ninterest assumed to belong to a (shape-constrained) function class, e.g.,\ncoordinate-wise monotone functions or convex functions. The constructed\nconfidence bands are based on local kernel averaging with bandwidth chosen\nautomatically via a multivariate multiscale statistic. The confidence bands\nhave guaranteed coverage for every $n$ and for every member of the underlying\nfunction class. Under monotonicity/convexity constraints on $f$, the proposed\nconfidence bands automatically adapt (in terms of width) to the global and\nlocal (H\\\"{o}lder) smoothness and intrinsic dimensionality of the unknown $f$;\nthe bands are also shown to be optimal in a certain sense. These bands have\n(almost) parametric ($n^{-1/2}$) widths when the underlying function has\n``low-complexity'' (e.g., piecewise constant/affine)."}, "http://arxiv.org/abs/2401.12776": {"title": "Sub-model aggregation for scalable eigenvector spatial filtering: Application to spatially varying coefficient modeling", "link": "http://arxiv.org/abs/2401.12776", "description": "This study proposes a method for aggregating/synthesizing global and local\nsub-models for fast and flexible spatial regression modeling. Eigenvector\nspatial filtering (ESF) was used to model spatially varying coefficients and\nspatial dependence in the residuals by sub-model, while the generalized\nproduct-of-experts method was used to aggregate these sub-models. The major\nadvantages of the proposed method are as follows: (i) it is highly scalable for\nlarge samples in terms of accuracy and computational efficiency; (ii) it is\neasily implemented by estimating sub-models independently first and\naggregating/averaging them thereafter; and (iii) likelihood-based inference is\navailable because the marginal likelihood is available in closed-form. The\naccuracy and computational efficiency of the proposed method are confirmed\nusing Monte Carlo simulation experiments. This method was then applied to\nresidential land price analysis in Japan. The results demonstrate the\nusefulness of this method for improving the interpretability of spatially\nvarying coefficients. The proposed method is implemented in an R package\nspmoran (version 0.3.0 or later)."}, "http://arxiv.org/abs/2401.12827": {"title": "Distributed Empirical Likelihood Inference With or Without Byzantine Failures", "link": "http://arxiv.org/abs/2401.12827", "description": "Empirical likelihood is a very important nonparametric approach which is of\nwide application. However, it is hard and even infeasible to calculate the\nempirical log-likelihood ratio statistic with massive data. The main challenge\nis the calculation of the Lagrange multiplier. This motivates us to develop a\ndistributed empirical likelihood method by calculating the Lagrange multiplier\nin a multi-round distributed manner. It is shown that the distributed empirical\nlog-likelihood ratio statistic is asymptotically standard chi-squared under\nsome mild conditions. The proposed algorithm is communication-efficient and\nachieves the desired accuracy in a few rounds. Further, the distributed\nempirical likelihood method is extended to the case of Byzantine failures. A\nmachine selection algorithm is developed to identify the worker machines\nwithout Byzantine failures such that the distributed empirical likelihood\nmethod can be applied. The proposed methods are evaluated by numerical\nsimulations and illustrated with an analysis of airline on-time performance\nstudy and a surface climate analysis of Yangtze River Economic Belt."}, "http://arxiv.org/abs/2401.12836": {"title": "Empirical Likelihood Inference over Decentralized Networks", "link": "http://arxiv.org/abs/2401.12836", "description": "As a nonparametric statistical inference approach, empirical likelihood has\nbeen found very useful in numerous occasions. However, it encounters serious\ncomputational challenges when applied directly to the modern massive dataset.\nThis article studies empirical likelihood inference over decentralized\ndistributed networks, where the data are locally collected and stored by\ndifferent nodes. To fully utilize the data, this article fuses Lagrange\nmultipliers calculated in different nodes by employing a penalization\ntechnique. The proposed distributed empirical log-likelihood ratio statistic\nwith Lagrange multipliers solved by the penalized function is asymptotically\nstandard chi-squared under regular conditions even for a divergent machine\nnumber. Nevertheless, the optimization problem with the fused penalty is still\nhard to solve in the decentralized distributed network. To address the problem,\ntwo alternating direction method of multipliers (ADMM) based algorithms are\nproposed, which both have simple node-based implementation schemes.\nTheoretically, this article establishes convergence properties for proposed\nalgorithms, and further proves the linear convergence of the second algorithm\nin some specific network structures. The proposed methods are evaluated by\nnumerical simulations and illustrated with analyses of census income and Ford\ngobike datasets."}, "http://arxiv.org/abs/2401.12865": {"title": "Gridsemble: Selective Ensembling for False Discovery Rates", "link": "http://arxiv.org/abs/2401.12865", "description": "In this paper, we introduce Gridsemble, a data-driven selective ensembling\nalgorithm for estimating local false discovery rates (fdr) in large-scale\nmultiple hypothesis testing. Existing methods for estimating fdr often yield\ndifferent conclusions, yet the unobservable nature of fdr values prevents the\nuse of traditional model selection. There is limited guidance on choosing a\nmethod for a given dataset, making this an arbitrary decision in practice.\nGridsemble circumvents this challenge by ensembling a subset of methods with\nweights based on their estimated performances, which are computed on synthetic\ndatasets generated to mimic the observed data while including ground truth. We\ndemonstrate through simulation studies and an experimental application that\nthis method outperforms three popular R software packages with their default\nparameter values$\\unicode{x2014}$common choices given the current landscape.\nWhile our applications are in the context of high throughput transcriptomics,\nwe emphasize that Gridsemble is applicable to any use of large-scale multiple\nhypothesis testing, an approach that is utilized in many fields. We believe\nthat Gridsemble will be a useful tool for computing reliable estimates of fdr\nand for improving replicability in the presence of multiple hypotheses by\neliminating the need for an arbitrary choice of method. Gridsemble is\nimplemented in an open-source R software package available on GitHub at\njennalandy/gridsemblefdr."}, "http://arxiv.org/abs/2401.12905": {"title": "Estimating the construct validity of Principal Components Analysis", "link": "http://arxiv.org/abs/2401.12905", "description": "In many scientific disciplines, the features of interest cannot be observed\ndirectly, so must instead be inferred from observed behaviour. Latent variable\nanalyses are increasingly employed to systematise these inferences, and\nPrincipal Components Analysis (PCA) is perhaps the simplest and most popular of\nthese methods. Here, we examine how the assumptions that we are prepared to\nentertain, about the latent variable system, mediate the likelihood that\nPCA-derived components will capture the true sources of variance underlying\ndata. As expected, we find that this likelihood is excellent in the best case,\nand robust to empirically reasonable levels of measurement noise, but best-case\nperformance is also: (a) not robust to violations of the method's more\nprominent assumptions, of linearity and orthogonality; and also (b) requires\nthat other subtler assumptions be made, such as that the latent variables\nshould have varying importance, and that weights relating latent variables to\nobserved data have zero mean. Neither variance explained, nor replication in\nindependent samples, could reliably predict which (if any) PCA-derived\ncomponents will capture true sources of variance in data. We conclude by\ndescribing a procedure to fit these inferences more directly to empirical data,\nand use it to find that components derived via PCA from two different empirical\nneuropsychological datasets, are less likely to have meaningful referents in\nthe brain than we hoped."}, "http://arxiv.org/abs/2401.12911": {"title": "Pretraining and the Lasso", "link": "http://arxiv.org/abs/2401.12911", "description": "Pretraining is a popular and powerful paradigm in machine learning. As an\nexample, suppose one has a modest-sized dataset of images of cats and dogs, and\nplans to fit a deep neural network to classify them from the pixel features.\nWith pretraining, we start with a neural network trained on a large corpus of\nimages, consisting of not just cats and dogs but hundreds of other image types.\nThen we fix all of the network weights except for the top layer (which makes\nthe final classification) and train (or \"fine tune\") those weights on our\ndataset. This often results in dramatically better performance than the network\ntrained solely on our smaller dataset.\n\nIn this paper, we ask the question \"Can pretraining help the lasso?\". We\ndevelop a framework for the lasso in which an overall model is fit to a large\nset of data, and then fine-tuned to a specific task on a smaller dataset. This\nlatter dataset can be a subset of the original dataset, but does not need to\nbe. We find that this framework has a wide variety of applications, including\nstratified models, multinomial targets, multi-response models, conditional\naverage treatment estimation and even gradient boosting.\n\nIn the stratified model setting, the pretrained lasso pipeline estimates the\ncoefficients common to all groups at the first stage, and then group specific\ncoefficients at the second \"fine-tuning\" stage. We show that under appropriate\nassumptions, the support recovery rate of the common coefficients is superior\nto that of the usual lasso trained only on individual groups. This separate\nidentification of common and individual coefficients can also be useful for\nscientific understanding."}, "http://arxiv.org/abs/2401.12924": {"title": "Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection", "link": "http://arxiv.org/abs/2401.12924", "description": "This article delves into the analysis of performance and utilization of\nSupport Vector Machines (SVMs) for the critical task of forest fire detection\nusing image datasets. With the increasing threat of forest fires to ecosystems\nand human settlements, the need for rapid and accurate detection systems is of\nutmost importance. SVMs, renowned for their strong classification capabilities,\nexhibit proficiency in recognizing patterns associated with fire within images.\nBy training on labeled data, SVMs acquire the ability to identify distinctive\nattributes associated with fire, such as flames, smoke, or alterations in the\nvisual characteristics of the forest area. The document thoroughly examines the\nuse of SVMs, covering crucial elements like data preprocessing, feature\nextraction, and model training. It rigorously evaluates parameters such as\naccuracy, efficiency, and practical applicability. The knowledge gained from\nthis study aids in the development of efficient forest fire detection systems,\nenabling prompt responses and improving disaster management. Moreover, the\ncorrelation between SVM accuracy and the difficulties presented by\nhigh-dimensional datasets is carefully investigated, demonstrated through a\nrevealing case study. The relationship between accuracy scores and the\ndifferent resolutions used for resizing the training datasets has also been\ndiscussed in this article. These comprehensive studies result in a definitive\noverview of the difficulties faced and the potential sectors requiring further\nimprovement and focus."}, "http://arxiv.org/abs/2401.12937": {"title": "Are the Signs of Factor Loadings Arbitrary in Confirmatory Factor Analysis? Problems and Solutions", "link": "http://arxiv.org/abs/2401.12937", "description": "The replication crisis in social and behavioral sciences has raised concerns\nabout the reliability and validity of empirical studies. While research in the\nliterature has explored contributing factors to this crisis, the issues related\nto analytical tools have received less attention. This study focuses on a\nwidely used analytical tool - confirmatory factor analysis (CFA) - and\ninvestigates one issue that is typically overlooked in practice: accurately\nestimating factor-loading signs. Incorrect loading signs can distort the\nrelationship between observed variables and latent factors, leading to\nunreliable or invalid results in subsequent analyses. Our study aims to\ninvestigate and address the estimation problem of factor-loading signs in CFA\nmodels. Based on an empirical demonstration and Monte Carlo simulation studies,\nwe found current methods have drawbacks in estimating loading signs. To address\nthis problem, three solutions are proposed and proven to work effectively. The\napplications of these solutions are discussed and elaborated."}, "http://arxiv.org/abs/2401.12967": {"title": "Measure transport with kernel mean embeddings", "link": "http://arxiv.org/abs/2401.12967", "description": "Kalman filters constitute a scalable and robust methodology for approximate\nBayesian inference, matching first and second order moments of the target\nposterior. To improve the accuracy in nonlinear and non-Gaussian settings, we\nextend this principle to include more or different characteristics, based on\nkernel mean embeddings (KMEs) of probability measures into their corresponding\nHilbert spaces. Focusing on the continuous-time setting, we develop a family of\ninteracting particle systems (termed $\\textit{KME-dynamics}$) that bridge\nbetween the prior and the posterior, and that include the Kalman-Bucy filter as\na special case. A variant of KME-dynamics has recently been derived from an\noptimal transport perspective by Maurais and Marzouk, and we expose further\nconnections to (kernelised) diffusion maps, leading to a variational\nformulation of regression type. Finally, we conduct numerical experiments on\ntoy examples and the Lorenz-63 model, the latter of which show particular\npromise for a hybrid modification (called Kalman-adjusted KME-dynamics)."}, "http://arxiv.org/abs/2110.05475": {"title": "Bayesian hidden Markov models for latent variable labeling assignments in conflict research: application to the role ceasefires play in conflict dynamics", "link": "http://arxiv.org/abs/2110.05475", "description": "A crucial challenge for solving problems in conflict research is in\nleveraging the semi-supervised nature of the data that arise. Observed response\ndata such as counts of battle deaths over time indicate latent processes of\ninterest such as intensity and duration of conflicts, but defining and labeling\ninstances of these unobserved processes requires nuance and imprecision. The\navailability of such labels, however, would make it possible to study the\neffect of intervention-related predictors - such as ceasefires - directly on\nconflict dynamics (e.g., latent intensity) rather than through an intermediate\nproxy like observed counts of battle deaths. Motivated by this problem and the\nnew availability of the ETH-PRIO Civil Conflict Ceasefires data set, we propose\na Bayesian autoregressive (AR) hidden Markov model (HMM) framework as a\nsufficiently flexible machine learning approach for semi-supervised regime\nlabeling with uncertainty quantification. We motivate our approach by\nillustrating the way it can be used to study the role that ceasefires play in\nshaping conflict dynamics. This ceasefires data set is the first systematic and\nglobally comprehensive data on ceasefires, and our work is the first to analyze\nthis new data and to explore the effect of ceasefires on conflict dynamics in a\ncomprehensive and cross-country manner."}, "http://arxiv.org/abs/2112.14946": {"title": "A causal inference framework for spatial confounding", "link": "http://arxiv.org/abs/2112.14946", "description": "Recently, addressing spatial confounding has become a major topic in spatial\nstatistics. However, the literature has provided conflicting definitions, and\nmany proposed definitions do not address the issue of confounding as it is\nunderstood in causal inference. We define spatial confounding as the existence\nof an unmeasured causal confounder with a spatial structure. We present a\ncausal inference framework for nonparametric identification of the causal\neffect of a continuous exposure on an outcome in the presence of spatial\nconfounding. We propose double machine learning (DML), a procedure in which\nflexible models are used to regress both the exposure and outcome variables on\nconfounders to arrive at a causal estimator with favorable robustness\nproperties and convergence rates, and we prove that this approach is consistent\nand asymptotically normal under spatial dependence. As far as we are aware,\nthis is the first approach to spatial confounding that does not rely on\nrestrictive parametric assumptions (such as linearity, effect homogeneity, or\nGaussianity) for both identification and estimation. We demonstrate the\nadvantages of the DML approach analytically and in simulations. We apply our\nmethods and reasoning to a study of the effect of fine particulate matter\nexposure during pregnancy on birthweight in California."}, "http://arxiv.org/abs/2210.08228": {"title": "Nonparametric Estimation of Mediation Effects with A General Treatment", "link": "http://arxiv.org/abs/2210.08228", "description": "To investigate causal mechanisms, causal mediation analysis decomposes the\ntotal treatment effect into the natural direct and indirect effects. This paper\nexamines the estimation of the direct and indirect effects in a general\ntreatment effect model, where the treatment can be binary, multi-valued,\ncontinuous, or a mixture. We propose generalized weighting estimators with\nweights estimated by solving an expanding set of equations. Under some\nsufficient conditions, we show that the proposed estimators are consistent and\nasymptotically normal. Specifically, when the treatment is discrete, the\nproposed estimators attain the semiparametric efficiency bounds. Meanwhile,\nwhen the treatment is continuous, the convergence rates of the proposed\nestimators are slower than $N^{-1/2}$; however, they are still more efficient\nthan that constructed from the true weighting function. A simulation study\nreveals that our estimators exhibit a satisfactory finite-sample performance,\nwhile an application shows their practical value"}, "http://arxiv.org/abs/2212.11398": {"title": "Grace periods in comparative effectiveness studies of sustained treatments", "link": "http://arxiv.org/abs/2212.11398", "description": "Researchers are often interested in estimating the effect of sustained use of\na treatment on a health outcome. However, adherence to strict treatment\nprotocols can be challenging for individuals in practice and, when\nnon-adherence is expected, estimates of the effect of sustained use may not be\nuseful for decision making. As an alternative, more relaxed treatment protocols\nwhich allow for periods of time off treatment (i.e. grace periods) have been\nconsidered in pragmatic randomized trials and observational studies. In this\narticle, we consider the interpretation, identification, and estimation of\ntreatment strategies which include grace periods. We contrast natural grace\nperiod strategies which allow individuals the flexibility to take treatment as\nthey would naturally do, with stochastic grace period strategies in which the\ninvestigator specifies the distribution of treatment utilization. We estimate\nthe effect of initiation of a thiazide diuretic or an angiotensin-converting\nenzyme inhibitor in hypertensive individuals under various strategies which\ninclude grace periods."}, "http://arxiv.org/abs/2301.02739": {"title": "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values", "link": "http://arxiv.org/abs/2301.02739", "description": "Many testing problems are readily amenable to randomised tests such as those\nemploying data splitting. However despite their usefulness in principle,\nrandomised tests have obvious drawbacks. Firstly, two analyses of the same\ndataset may lead to different results. Secondly, the test typically loses power\nbecause it does not fully utilise the entire sample. As a remedy to these\ndrawbacks, we study how to combine the test statistics or p-values resulting\nfrom multiple random realisations such as through random data splits. We\ndevelop rank-transformed subsampling as a general method for delivering large\nsample inference about the combined statistic or p-value under mild\nassumptions. We apply our methodology to a wide range of problems, including\ntesting unimodality in high-dimensional data, testing goodness-of-fit of\nparametric quantile regression models, testing no direct effect in a\nsequentially randomised trial and calibrating cross-fit double machine learning\nconfidence intervals. In contrast to existing p-value aggregation schemes that\ncan be highly conservative, our method enjoys type-I error control that\nasymptotically approaches the nominal level. Moreover, compared to using the\nordinary subsampling, we show that our rank transform can remove the\nfirst-order bias in approximating the null under alternatives and greatly\nimprove power."}, "http://arxiv.org/abs/2303.07706": {"title": "On the Utility of Equal Batch Sizes for Inference in Stochastic Gradient Descent", "link": "http://arxiv.org/abs/2303.07706", "description": "Stochastic gradient descent (SGD) is an estimation tool for large data\nemployed in machine learning and statistics. Due to the Markovian nature of the\nSGD process, inference is a challenging problem. An underlying asymptotic\nnormality of the averaged SGD (ASGD) estimator allows for the construction of a\nbatch-means estimator of the asymptotic covariance matrix. Instead of the usual\nincreasing batch-size strategy employed in ASGD, we propose a memory efficient\nequal batch-size strategy and show that under mild conditions, the estimator is\nconsistent. A key feature of the proposed batching technique is that it allows\nfor bias-correction of the variance, at no cost to memory. Since joint\ninference for high dimensional problems may be undesirable, we present\nmarginal-friendly simultaneous confidence intervals, and show through an\nexample how covariance estimators of ASGD can be employed in improved\npredictions."}, "http://arxiv.org/abs/2307.04225": {"title": "Copula-like inference for discrete bivariate distributions with rectangular supports", "link": "http://arxiv.org/abs/2307.04225", "description": "After reviewing a large body of literature on the modeling of bivariate\ndiscrete distributions with finite support, \\cite{Gee20} made a compelling case\nfor the use of $I$-projections in the sense of \\cite{Csi75} as a sound way to\nattempt to decompose a bivariate probability mass function (p.m.f.) into its\ntwo univariate margins and a bivariate p.m.f.\\ with uniform margins playing the\nrole of a discrete copula. From a practical perspective, the necessary\n$I$-projections on Fr\\'echet classes can be carried out using the iterative\nproportional fitting procedure (IPFP), also known as Sinkhorn's algorithm or\nmatrix scaling in the literature. After providing conditions under which a\nbivariate p.m.f.\\ can be decomposed in the aforementioned sense, we\ninvestigate, for starting bivariate p.m.f.s with rectangular supports,\nnonparametric and parametric estimation procedures as well as goodness-of-fit\ntests for the underlying discrete copula. Related asymptotic results are\nprovided and build upon a differentiability result for $I$-projections on\nFr\\'echet classes which can be of independent interest. Theoretical results are\ncomplemented by finite-sample experiments and a data example."}, "http://arxiv.org/abs/2309.08783": {"title": "Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression", "link": "http://arxiv.org/abs/2309.08783", "description": "Sparse linear regression methods for high-dimensional data commonly assume\nthat residuals have constant variance, which can be violated in practice. For\nexample, Aphasia Quotient (AQ) is a critical measure of language impairment and\ninforms treatment decisions, but it is challenging to measure in stroke\npatients. It is of interest to use high-resolution T2 neuroimages of brain\ndamage to predict AQ. However, sparse regression models show marked evidence of\nheteroscedastic error even after transformations are applied. This violation of\nthe homoscedasticity assumption can lead to bias in estimated coefficients,\nprediction intervals (PI) with improper length, and increased type I errors.\nBayesian heteroscedastic linear regression models relax the homoscedastic error\nassumption but can enforce restrictive prior assumptions on parameters, and\nmany are computationally infeasible in the high-dimensional setting. This paper\nproposes estimating high-dimensional heteroscedastic linear regression models\nusing a heteroscedastic partitioned empirical Bayes Expectation Conditional\nMaximization (H-PROBE) algorithm. H-PROBE is a computationally efficient\nmaximum a posteriori estimation approach that requires minimal prior\nassumptions and can incorporate covariates hypothesized to impact\nheterogeneity. We apply this method by using high-dimensional neuroimages to\npredict and provide PIs for AQ that accurately quantify predictive uncertainty.\nOur analysis demonstrates that H-PROBE can provide narrower PI widths than\nstandard methods without sacrificing coverage. Narrower PIs are clinically\nimportant for determining the risk of moderate to severe aphasia. Additionally,\nthrough extensive simulation studies, we exhibit that H-PROBE results in\nsuperior prediction, variable selection, and predictive inference compared to\nalternative methods."}, "http://arxiv.org/abs/2401.13009": {"title": "Comparative Study of Causal Discovery Methods for Cyclic Models with Hidden Confounders", "link": "http://arxiv.org/abs/2401.13009", "description": "Nowadays, the need for causal discovery is ubiquitous. A better understanding\nof not just the stochastic dependencies between parts of a system, but also the\nactual cause-effect relations, is essential for all parts of science. Thus, the\nneed for reliable methods to detect causal directions is growing constantly. In\nthe last 50 years, many causal discovery algorithms have emerged, but most of\nthem are applicable only under the assumption that the systems have no feedback\nloops and that they are causally sufficient, i.e. that there are no unmeasured\nsubsystems that can affect multiple measured variables. This is unfortunate\nsince those restrictions can often not be presumed in practice. Feedback is an\nintegral feature of many processes, and real-world systems are rarely\ncompletely isolated and fully measured. Fortunately, in recent years, several\ntechniques, that can cope with cyclic, causally insufficient systems, have been\ndeveloped. And with multiple methods available, a practical application of\nthose algorithms now requires knowledge of the respective strengths and\nweaknesses. Here, we focus on the problem of causal discovery for sparse linear\nmodels which are allowed to have cycles and hidden confounders. We have\nprepared a comprehensive and thorough comparative study of four causal\ndiscovery techniques: two versions of the LLC method [10] and two variants of\nthe ASP-based algorithm [11]. The evaluation investigates the performance of\nthose techniques for various experiments with multiple interventional setups\nand different dataset sizes."}, "http://arxiv.org/abs/2401.13010": {"title": "Bartholomew's trend test -- approximated by a multiple contrast test", "link": "http://arxiv.org/abs/2401.13010", "description": "Bartholomew's trend test belongs to the broad class of isotonic regression\nmodels, specifically with a single qualitative factor, e.g. dose levels. Using\nthe approximation of the ANOVA F-test by the maximum contrast test against\ngrand mean and pool-adjacent-violator estimates under order restriction, an\neasier to use approximation is proposed."}, "http://arxiv.org/abs/2401.13045": {"title": "Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?", "link": "http://arxiv.org/abs/2401.13045", "description": "Over the past decade, the intricacies of sports-related concussions among\nfemale athletes have become readily apparent. Traditional clinical methods for\ndiagnosing concussions suffer limitations when applied to female athletes,\noften failing to capture subtle changes in brain structure and function.\nAdvanced neuroinformatics techniques and machine learning models have become\ninvaluable assets in this endeavor. While these technologies have been\nextensively employed in understanding concussion in male athletes, there\nremains a significant gap in our comprehension of their effectiveness for\nfemale athletes. With its remarkable data analysis capacity, machine learning\noffers a promising avenue to bridge this deficit. By harnessing the power of\nmachine learning, researchers can link observed phenotypic neuroimaging data to\nsex-specific biological mechanisms, unraveling the mysteries of concussions in\nfemale athletes. Furthermore, embedding methods within machine learning enable\nexamining brain architecture and its alterations beyond the conventional\nanatomical reference frame. In turn, allows researchers to gain deeper insights\ninto the dynamics of concussions, treatment responses, and recovery processes.\nTo guarantee that female athletes receive the optimal care they deserve,\nresearchers must employ advanced neuroimaging techniques and sophisticated\nmachine-learning models. These tools enable an in-depth investigation of the\nunderlying mechanisms responsible for concussion symptoms stemming from\nneuronal dysfunction in female athletes. This paper endeavors to address the\ncrucial issue of sex differences in multimodal neuroimaging experimental design\nand machine learning approaches within female athlete populations, ultimately\nensuring that they receive the tailored care they require when facing the\nchallenges of concussions."}, "http://arxiv.org/abs/2401.13090": {"title": "Variational Estimation for Multidimensional Generalized Partial Credit Model", "link": "http://arxiv.org/abs/2401.13090", "description": "Multidimensional item response theory (MIRT) models have generated increasing\ninterest in the psychometrics literature. Efficient approaches for estimating\nMIRT models with dichotomous responses have been developed, but constructing an\nequally efficient and robust algorithm for polytomous models has received\nlimited attention. To address this gap, this paper presents a novel Gaussian\nvariational estimation algorithm for the multidimensional generalized partial\ncredit model (MGPCM). The proposed algorithm demonstrates both fast and\naccurate performance, as illustrated through a series of simulation studies and\ntwo real data analyses."}, "http://arxiv.org/abs/2401.13094": {"title": "On cross-validated estimation of skew normal model", "link": "http://arxiv.org/abs/2401.13094", "description": "Skew normal model suffers from inferential drawbacks, namely singular Fisher\ninformation in the vicinity of symmetry and diverging of maximum likelihood\nestimation. To address the above drawbacks, Azzalini and Arellano-Valle (2013)\nintroduced maximum penalised likelihood estimation (MPLE) by subtracting a\npenalty function from the log-likelihood function with a pre-specified penalty\ncoefficient. Here, we propose a cross-validated MPLE to improve its performance\nwhen the underlying model is close to symmetry. We develop a theory for MPLE,\nwhere an asymptotic rate for the cross-validated penalty coefficient is\nderived. We further show that the proposed cross-validated MPLE is\nasymptotically efficient under certain conditions. In simulation studies and a\nreal data application, we demonstrate that the proposed estimator can\noutperform the conventional MPLE when the model is close to symmetry."}, "http://arxiv.org/abs/2401.13208": {"title": "Assessing Influential Observations in Pain Prediction using fMRI Data", "link": "http://arxiv.org/abs/2401.13208", "description": "Influential diagnosis is an integral part of data analysis, of which most\nexisting methodological frameworks presume a deterministic submodel and are\ndesigned for low-dimensional data (i.e., the number of predictors p smaller\nthan the sample size n). However, the stochastic selection of a submodel from\nhigh-dimensional data where p exceeds n has become ubiquitous. Thus, methods\nfor identifying observations that could exert undue influence on the choice of\na submodel can play an important role in this setting. To date, discussion of\nthis topic has been limited, falling short in two domains: (i) constrained\nability to detect multiple influential points, and (ii) applicability only in\nrestrictive settings. After describing the problem, we characterize and\nformalize the concept of influential observations on variable selection. Then,\nwe propose a generalized diagnostic measure, extended from an available metric\naccommodating different model selectors and multiple influential observations,\nthe asymptotic distribution of which is subsequently establish large p, thus\nproviding guidelines to ascertain influential observations. A high-dimensional\nclustering procedure is further incorporated into our proposed scheme to detect\nmultiple influential points. Simulation is conducted to assess the performances\nof various diagnostic approaches. The proposed procedure further demonstrates\nits value in improving predictive power when analyzing thermal-stimulated pain\nbased on fMRI data."}, "http://arxiv.org/abs/2401.13379": {"title": "An Ising Similarity Regression Model for Modeling Multivariate Binary Data", "link": "http://arxiv.org/abs/2401.13379", "description": "Understanding the dependence structure between response variables is an\nimportant component in the analysis of correlated multivariate data. This\narticle focuses on modeling dependence structures in multivariate binary data,\nmotivated by a study aiming to understand how patterns in different U.S.\nsenators' votes are determined by similarities (or lack thereof) in their\nattributes, e.g., political parties and social network profiles. To address\nsuch a research question, we propose a new Ising similarity regression model\nwhich regresses pairwise interaction coefficients in the Ising model against a\nset of similarity measures available/constructed from covariates. Model\nselection approaches are further developed through regularizing the\npseudo-likelihood function with an adaptive lasso penalty to enable the\nselection of relevant similarity measures. We establish estimation and\nselection consistency of the proposed estimator under a general setting where\nthe number of similarity measures and responses tend to infinity. Simulation\nstudy demonstrates the strong finite sample performance of the proposed\nestimator in terms of parameter estimation and similarity selection. Applying\nthe Ising similarity regression model to a dataset of roll call voting records\nof 100 U.S. senators, we are able to quantify how similarities in senators'\nparties, businessman occupations and social network profiles drive their voting\nassociations."}, "http://arxiv.org/abs/2201.05828": {"title": "Adaptive procedures for directional false discovery rate control", "link": "http://arxiv.org/abs/2201.05828", "description": "In multiple hypothesis testing, it is well known that adaptive procedures can\nenhance power via incorporating information about the number of true nulls\npresent. Under independence, we establish that two adaptive false discovery\nrate (FDR) methods, upon augmenting sign declarations, also offer directional\nfalse discovery rate (FDR$_\\text{dir}$) control in the strong sense. Such\nFDR$_\\text{dir}$ controlling properties are appealing because adaptive\nprocedures have the greatest potential to reap substantial gain in power when\nthe underlying parameter configurations contain little to no true nulls, which\nare precisely settings where the FDR$_\\text{dir}$ is an arguably more\nmeaningful error rate to be controlled than the FDR."}, "http://arxiv.org/abs/2212.02306": {"title": "Robust multiple method comparison and transformation", "link": "http://arxiv.org/abs/2212.02306", "description": "A generalization of Passing-Bablok regression is proposed for comparing\nmultiple measurement methods simultaneously. Possible applications include\nassay migration studies or interlaboratory trials. When comparing only two\nmethods, the method reduces to the usual Passing-Bablok estimator. It is close\nin spirit to reduced major axis regression, which is, however, not robust. To\nobtain a robust estimator, the major axis is replaced by the (hyper-)spherical\nmedian axis. The method is shown to reduce to the usual Passing-Bablok\nestimator if only two methods are compared. This technique has been applied to\ncompare SARS-CoV-2 serological tests, bilirubin in neonates, and an in vitro\ndiagnostic test using different instruments, sample preparations, and reagent\nlots. In addition, plots similar to the well-known Bland-Altman plots have been\ndeveloped to represent the variance structure."}, "http://arxiv.org/abs/2305.15936": {"title": "Learning DAGs from Data with Few Root Causes", "link": "http://arxiv.org/abs/2305.15936", "description": "We present a novel perspective and algorithm for learning directed acyclic\ngraphs (DAGs) from data generated by a linear structural equation model (SEM).\nFirst, we show that a linear SEM can be viewed as a linear transform that, in\nprior work, computes the data from a dense input vector of random valued root\ncauses (as we will call them) associated with the nodes. Instead, we consider\nthe case of (approximately) few root causes and also introduce noise in the\nmeasurement of the data. Intuitively, this means that the DAG data is produced\nby few data-generating events whose effect percolates through the DAG. We prove\nidentifiability in this new setting and show that the true DAG is the global\nminimizer of the $L^0$-norm of the vector of root causes. For data with few\nroot causes, with and without noise, we show superior performance compared to\nprior DAG learning methods."}, "http://arxiv.org/abs/2307.02096": {"title": "Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo", "link": "http://arxiv.org/abs/2307.02096", "description": "Hamiltonian Monte Carlo (HMC) is a powerful tool for Bayesian statistical\ninference due to its potential to rapidly explore high dimensional state space,\navoiding the random walk behavior typical of many Markov Chain Monte Carlo\nsamplers. The proper choice of the integrator of the Hamiltonian dynamics is\nkey to the efficiency of HMC. It is becoming increasingly clear that\nmulti-stage splitting integrators are a good alternative to the Verlet method,\ntraditionally used in HMC. Here we propose a principled way of finding optimal,\nproblem-specific integration schemes (in terms of the best conservation of\nenergy for harmonic forces/Gaussian targets) within the families of 2- and\n3-stage splitting integrators. The method, which we call Adaptive Integration\nApproach for statistics, or s-AIA, uses a multivariate Gaussian model and\nsimulation data obtained at the HMC burn-in stage to identify a system-specific\ndimensional stability interval and assigns the most appropriate 2-/3-stage\nintegrator for any user-chosen simulation step size within that interval. s-AIA\nhas been implemented in the in-house software package HaiCS without introducing\ncomputational overheads in the simulations. The efficiency of the s-AIA\nintegrators and their impact on the HMC accuracy, sampling performance and\nconvergence are discussed in comparison with known fixed-parameter multi-stage\nsplitting integrators (including Verlet). Numerical experiments on well-known\nstatistical models show that the adaptive schemes reach the best possible\nperformance within the family of 2-, 3-stage splitting schemes."}, "http://arxiv.org/abs/2401.13777": {"title": "Revisiting the memoryless property -- testing for the Pareto type I distribution", "link": "http://arxiv.org/abs/2401.13777", "description": "We propose new goodness-of-fit tests for the Pareto type I distribution.\nThese tests are based on a multiplicative version of the memoryless property\nwhich characterises this distribution. We present the results of a Monte Carlo\npower study demonstrating that the proposed tests are powerful compared to\nexisting tests. As a result of independent interest, we demonstrate that tests\nspecifically developed for the Pareto type I distribution substantially\noutperform tests for exponentiality applied to log-transformed data (since\nPareto type I distributed values can be transformed to exponentiality via a\nsimple log-transformation). Specifically, the newly proposed tests based on the\nmultiplicative memoryless property of the Pareto distribution substantially\noutperform a test based on the memoryless property of the exponential\ndistribution. The practical use of tests is illustrated by testing the\nhypothesis that two sets of observed golfers' earnings (those of the PGA and\nLIV tours) are realised from Pareto distributions."}, "http://arxiv.org/abs/2401.13787": {"title": "Bayesian Analysis of the Beta Regression Model Subject to Linear Inequality Restrictions with Application", "link": "http://arxiv.org/abs/2401.13787", "description": "ReRecent studies in machine learning are based on models in which parameters\nor state variables are bounded restricted. These restrictions are from prior\ninformation to ensure the validity of scientific theories or structural\nconsistency based on physical phenomena. The valuable information contained in\nthe restrictions must be considered during the estimation process to improve\nestimation accuracy. Many researchers have focused on linear regression models\nsubject to linear inequality restrictions, but generalized linear models have\nreceived little attention. In this paper, the parameters of beta Bayesian\nregression models subjected to linear inequality restrictions are estimated.\nThe proposed Bayesian restricted estimator, which is demonstrated by simulated\nstudies, outperforms ordinary estimators. Even in the presence of\nmulticollinearity, it outperforms the ridge estimator in terms of the standard\ndeviation and the mean squared error. The results confirm that the proposed\nBayesian restricted estimator makes sparsity in parameter estimating without\nusing the regularization penalty. Finally, a real data set is analyzed by the\nnew proposed Bayesian estimation method."}, "http://arxiv.org/abs/2401.13820": {"title": "A Bayesian hierarchical mixture cure modelling framework to utilize multiple survival datasets for long-term survivorship estimates: A case study from previously untreated metastatic melanoma", "link": "http://arxiv.org/abs/2401.13820", "description": "Time to an event of interest over a lifetime is a central measure of the\nclinical benefit of an intervention used in a health technology assessment\n(HTA). Within the same trial multiple end-points may also be considered. For\nexample, overall and progression-free survival time for different drugs in\noncology studies. A common challenge is when an intervention is only effective\nfor some proportion of the population who are not clinically identifiable.\nTherefore, latent group membership as well as separate survival models for\ngroups identified need to be estimated. However, follow-up in trials may be\nrelatively short leading to substantial censoring. We present a general\nBayesian hierarchical framework that can handle this complexity by exploiting\nthe similarity of cure fractions between end-points; accounting for the\ncorrelation between them and improving the extrapolation beyond the observed\ndata. Assuming exchangeability between cure fractions facilitates the borrowing\nof information between end-points. We show the benefits of using our approach\nwith a motivating example, the CheckMate 067 phase 3 trial consisting of\npatients with metastatic melanoma treated with first line therapy."}, "http://arxiv.org/abs/2401.13890": {"title": "Discrete Hawkes process with flexible residual distribution and filtered historical simulation", "link": "http://arxiv.org/abs/2401.13890", "description": "We introduce a new model which can be considered as a extended version of the\nHawkes process in a discrete sense. This model enables the integration of\nvarious residual distributions while preserving the fundamental properties of\nthe original Hawkes process. The rich nature of this model enables a filtered\nhistorical simulation which incorporate the properties of original time series\nmore accurately. The process naturally extends to multi-variate models with\neasy implementations of estimation and simulation. We investigate the effect of\nflexible residual distribution on estimation of high frequency financial data\ncompared with the Hawkes process."}, "http://arxiv.org/abs/2401.13929": {"title": "Reinforcement Learning with Hidden Markov Models for Discovering Decision-Making Dynamics", "link": "http://arxiv.org/abs/2401.13929", "description": "Major depressive disorder (MDD) presents challenges in diagnosis and\ntreatment due to its complex and heterogeneous nature. Emerging evidence\nindicates that reward processing abnormalities may serve as a behavioral marker\nfor MDD. To measure reward processing, patients perform computer-based\nbehavioral tasks that involve making choices or responding to stimulants that\nare associated with different outcomes. Reinforcement learning (RL) models are\nfitted to extract parameters that measure various aspects of reward processing\nto characterize how patients make decisions in behavioral tasks. Recent\nfindings suggest the inadequacy of characterizing reward learning solely based\non a single RL model; instead, there may be a switching of decision-making\nprocesses between multiple strategies. An important scientific question is how\nthe dynamics of learning strategies in decision-making affect the reward\nlearning ability of individuals with MDD. Motivated by the probabilistic reward\ntask (PRT) within the EMBARC study, we propose a novel RL-HMM framework for\nanalyzing reward-based decision-making. Our model accommodates learning\nstrategy switching between two distinct approaches under a hidden Markov model\n(HMM): subjects making decisions based on the RL model or opting for random\nchoices. We account for continuous RL state space and allow time-varying\ntransition probabilities in the HMM. We introduce a computationally efficient\nEM algorithm for parameter estimation and employ a nonparametric bootstrap for\ninference. We apply our approach to the EMBARC study to show that MDD patients\nare less engaged in RL compared to the healthy controls, and engagement is\nassociated with brain activities in the negative affect circuitry during an\nemotional conflict task."}, "http://arxiv.org/abs/2401.13943": {"title": "Is the age pension in Australia sustainable and fair? Evidence from forecasting the old-age dependency ratio using the Hamilton-Perry model", "link": "http://arxiv.org/abs/2401.13943", "description": "The age pension aims to assist eligible elderly Australians meet specific age\nand residency criteria in maintaining basic living standards. In designing\nefficient pension systems, government policymakers seek to satisfy the\nexpectations of the overall aging population in Australia. However, the\npopulation's unique demographic characteristics at the state and territory\nlevel are often overlooked due to the lack of available data. We use the\nHamilton-Perry model, which requires minimum input, to model and forecast the\nevolution of age-specific populations at the state level. We also integrate the\nobtained sub-national demographic information to determine sustainable pension\nages up to 2051. We also investigate pension welfare distribution in all states\nand territories to identify disadvantaged residents under the current pension\nsystem. Using the sub-national mortality data for Australia from 1971 to 2021\nobtained from AHMD (2023), we implement the Hamilton-Perry model with the help\nof functional time series forecasting techniques. With forecasts of\nage-specific population sizes for each state and territory, we compute the old\nage dependency ratio to determine the nationwide sustainable pension age."}, "http://arxiv.org/abs/2401.13975": {"title": "Sparse signal recovery and source localization via covariance learning", "link": "http://arxiv.org/abs/2401.13975", "description": "In the Multiple Measurements Vector (MMV) model, measurement vectors are\nconnected to unknown, jointly sparse signal vectors through a linear regression\nmodel employing a single known measurement matrix (or dictionary). Typically,\nthe number of atoms (columns of the dictionary) is greater than the number\nmeasurements and the sparse signal recovery problem is generally ill-posed. In\nthis paper, we treat the signals and measurement noise as independent Gaussian\nrandom vectors with unknown signal covariance matrix and noise variance,\nrespectively, and derive fixed point (FP) equation for solving the likelihood\nequation for signal powers, thereby enabling the recovery of the sparse signal\nsupport (sources with non-zero variances). Two practical algorithms, a block\ncoordinate descent (BCD) and a cyclic coordinate descent (CCD) algorithms, that\nleverage on the FP characterization of the likelihood equation are then\nproposed. Additionally, a greedy pursuit method, analogous to popular\nsimultaneous orthogonal matching pursuit (OMP), is introduced. Our numerical\nexamples demonstrate effectiveness of the proposed covariance learning (CL)\nalgorithms both in classic sparse signal recovery as well as in\ndirection-of-arrival (DOA) estimation problems where they perform favourably\ncompared to the state-of-the-art algorithms under a broad variety of settings."}, "http://arxiv.org/abs/2401.14052": {"title": "Testing Alpha in High Dimensional Linear Factor Pricing Models with Dependent Observations", "link": "http://arxiv.org/abs/2401.14052", "description": "In this study, we introduce three distinct testing methods for testing alpha\nin high dimensional linear factor pricing model that deals with dependent data.\nThe first method is a sum-type test procedure, which exhibits high performance\nwhen dealing with dense alternatives. The second method is a max-type test\nprocedure, which is particularly effective for sparse alternatives. For a\nbroader range of alternatives, we suggest a Cauchy combination test procedure.\nThis is predicated on the asymptotic independence of the sum-type and max-type\ntest statistics. Both simulation studies and practical data application\ndemonstrate the effectiveness of our proposed methods when handling dependent\nobservations."}, "http://arxiv.org/abs/2401.14094": {"title": "ODC and ROC curves, comparison curves, and stochastic dominance", "link": "http://arxiv.org/abs/2401.14094", "description": "We discuss two novel approaches to the classical two-sample problem. Our\nstarting point are properly standardized and combined, very popular in several\nareas of statistics and data analysis, ordinal dominance and receiver\ncharacteristic curves, denoted by ODC and ROC, respectively. The proposed new\ncurves are termed the comparison curves. Their estimates, being weighted rank\nprocesses on (0,1), form the basis of inference. These weighted processes are\nintuitive, well-suited for visual inspection of data at hand, and are also\nuseful for constructing some formal inferential procedures. They can be applied\nto several variants of two-sample problem. Their use can help to improve some\nexisting procedures both in terms of power and the ability to identify the\nsources of departures from the postulated model. To simplify interpretation of\nfinite sample results we restrict attention to values of the processes on a\nfinite grid of points. This results in the so-called bar plots (B-plots) which\nreadably summarize the information contained in the data. What is more, we show\nthat B-plots along with adjusted simultaneous acceptance regions provide\nprincipled information about where the model departs from the data. This leads\nto a framework which facilitates identification of regions with locally\nsignificant differences.\n\nWe show an implementation of the considered techniques to a standard\nstochastic dominance testing problem. Some min-type statistics are introduced\nand investigated. A simulation study compares two tests pertinent to the\ncomparison curves to well-established tests in the literature and demonstrates\nthe strong and competitive performance of the former in many typical\nsituations. Some real data applications illustrate simplicity and practical\nusefulness of the proposed approaches. A range of other applications of\nconsidered weighted processes is briefly discussed too."}, "http://arxiv.org/abs/2401.14122": {"title": "On a Novel Skewed Generalized t Distribution: Properties, Estimations and its Applications", "link": "http://arxiv.org/abs/2401.14122", "description": "With the progress of information technology, large amounts of asymmetric,\nleptokurtic and heavy-tailed data are arising in various fields, such as\nfinance, engineering, genetics and medicine. It is very challenging to model\nthose kinds of data, especially for extremely skewed data, accompanied by very\nhigh kurtosis or heavy tails. In this paper, we propose a class of novel skewed\ngeneralized t distribution (SkeGTD) as a scale mixture of skewed generalized\nnormal. The proposed SkeGTD has excellent adaptiveness to various data, because\nof its capability of allowing for a large range of skewness and kurtosis and\nits compatibility of the separated location, scale, skewness and shape\nparameters. We investigate some important properties of this family of\ndistributions. The maximum likelihood estimation, L-moments estimation and\ntwo-step estimation for the SkeGTD are explored. To illustrate the usefulness\nof the proposed methodology, we present simulation studies and analyze two real\ndatasets."}, "http://arxiv.org/abs/2401.14294": {"title": "Heteroscedasticity-aware stratified sampling to improve uplift modeling", "link": "http://arxiv.org/abs/2401.14294", "description": "In many business applications, including online marketing and customer churn\nprevention, randomized controlled trials (RCT's) are conducted to investigate\non the effect of specific treatment (coupon offers, advertisement\nmailings,...). Such RCT's allow for the estimation of average treatment effects\nas well as the training of (uplift) models for the heterogeneity of treatment\neffects between individuals. The problem with these RCT's is that they are\ncostly and this cost increases with the number of individuals included into the\nRCT. For this reason, there is research how to conduct experiments involving a\nsmall number of individuals while still obtaining precise treatment effect\nestimates. We contribute to this literature a heteroskedasticity-aware\nstratified sampling (HS) scheme, which leverages the fact that different\nindividuals have different noise levels in their outcome and precise treatment\neffect estimation requires more observations from the \"high-noise\" individuals\nthan from the \"low-noise\" individuals. By theory as well as by empirical\nexperiments, we demonstrate that our HS-sampling yields significantly more\nprecise estimates of the ATE, improves uplift models and makes their evaluation\nmore reliable compared to RCT data sampled completely randomly. Due to the\nrelative ease of application and the significant benefits, we expect\nHS-sampling to be valuable in many real-world applications."}, "http://arxiv.org/abs/2401.14338": {"title": "Case-crossover designs and overdispersion with application in air pollution epidemiology", "link": "http://arxiv.org/abs/2401.14338", "description": "Over the last three decades, case-crossover designs have found many\napplications in health sciences, especially in air pollution epidemiology. They\nare typically used, in combination with partial likelihood techniques, to\ndefine a conditional logistic model for the responses, usually health outcomes,\nconditional on the exposures. Despite the fact that conditional logistic models\nhave been shown equivalent, in typical air pollution epidemiology setups, to\nspecific instances of the well-known Poisson time series model, it is often\nclaimed that they cannot allow for overdispersion. This paper clarifies the\nrelationship between case-crossover designs, the models that ensue from their\nuse, and overdispersion. In particular, we propose to relax the assumption of\nindependence between individuals traditionally made in case-crossover analyses,\nin order to explicitly introduce overdispersion in the conditional logistic\nmodel. As we show, the resulting overdispersed conditional logistic model\ncoincides with the overdispersed, conditional Poisson model, in the sense that\ntheir likelihoods are simple re-expressions of one another. We further provide\nthe technical details of a Bayesian implementation of the proposed\ncase-crossover model, which we use to demonstrate, by means of a large\nsimulation study, that standard case-crossover models can lead to dramatically\nunderestimated coverage probabilities, while the proposed models do not. We\nalso perform an illustrative analysis of the association between air pollution\nand morbidity in Toronto, Canada, which shows that the proposed models are more\nrobust than standard ones to outliers such as those associated with public\nholidays."}, "http://arxiv.org/abs/2401.14345": {"title": "Uncovering Heterogeneity of Solar Flare Mechanism With Mixture Models", "link": "http://arxiv.org/abs/2401.14345", "description": "The physics of solar flares occurring on the Sun is highly complex and far\nfrom fully understood. However, observations show that solar eruptions are\nassociated with the intense kilogauss fields of active regions, where free\nenergies are stored with field-aligned electric currents. With the advent of\nhigh-quality data sources such as the Geostationary Operational Environmental\nSatellites (GOES) and Solar Dynamics Observatory (SDO)/Helioseismic and\nMagnetic Imager (HMI), recent works on solar flare forecasting have been\nfocusing on data-driven methods. In particular, black box machine learning and\ndeep learning models are increasingly adopted in which underlying data\nstructures are not modeled explicitly. If the active regions indeed follow the\nsame laws of physics, there should be similar patterns shared among them,\nreflected by the observations. Yet, these black box models currently used in\nthe literature do not explicitly characterize the heterogeneous nature of the\nsolar flare data, within and between active regions. In this paper, we propose\ntwo finite mixture models designed to capture the heterogeneous patterns of\nactive regions and their associated solar flare events. With extensive\nnumerical studies, we demonstrate the usefulness of our proposed method for\nboth resolving the sample imbalance issue and modeling the heterogeneity for\nrare energetic solar flare events."}, "http://arxiv.org/abs/2401.14355": {"title": "Multiply Robust Estimation of Causal Effect Curves for Difference-in-Differences Designs", "link": "http://arxiv.org/abs/2401.14355", "description": "Researchers commonly use difference-in-differences (DiD) designs to evaluate\npublic policy interventions. While established methodologies exist for\nestimating effects in the context of binary interventions, policies often\nresult in varied exposures across regions implementing the policy. Yet,\nexisting approaches for incorporating continuous exposures face substantial\nlimitations in addressing confounding variables associated with intervention\nstatus, exposure levels, and outcome trends. These limitations significantly\nconstrain policymakers' ability to fully comprehend policy impacts and design\nfuture interventions. In this study, we propose innovative estimators for\ncausal effect curves within the DiD framework, accounting for multiple sources\nof confounding. Our approach accommodates misspecification of a subset of\ntreatment, exposure, and outcome models while avoiding any parametric\nassumptions on the effect curve. We present the statistical properties of the\nproposed methods and illustrate their application through simulations and a\nstudy investigating the diverse effects of a nutritional excise tax."}, "http://arxiv.org/abs/2401.14359": {"title": "Minimum Covariance Determinant: Spectral Embedding and Subset Size Determination", "link": "http://arxiv.org/abs/2401.14359", "description": "This paper introduces several ideas to the minimum covariance determinant\nproblem for outlier detection and robust estimation of means and covariances.\nWe leverage the principal component transform to achieve dimension reduction,\npaving the way for improved analyses. Our best subset selection algorithm\nstrategically combines statistical depth and concentration steps. To ascertain\nthe appropriate subset size and number of principal components, we introduce a\nnovel bootstrap procedure that estimates the instability of the best subset\nalgorithm. The parameter combination exhibiting minimal instability proves\nideal for the purposes of outlier detection and robust estimation. Rigorous\nbenchmarking against prominent MCD variants showcases our approach's superior\ncapability in outlier detection and computational speed in high dimensions.\nApplication to a fruit spectra data set and a cancer genomics data set\nillustrates our claims."}, "http://arxiv.org/abs/2401.14393": {"title": "Clustering-based spatial interpolation of parametric post-processing models", "link": "http://arxiv.org/abs/2401.14393", "description": "Since the start of the operational use of ensemble prediction systems,\nensemble-based probabilistic forecasting has become the most advanced approach\nin weather prediction. However, despite the persistent development of the last\nthree decades, ensemble forecasts still often suffer from the lack of\ncalibration and might exhibit systematic bias, which calls for some form of\nstatistical post-processing. Nowadays, one can choose from a large variety of\npost-processing approaches, where parametric methods provide full predictive\ndistributions of the investigated weather quantity. Parameter estimation in\nthese models is based on training data consisting of past forecast-observation\npairs, thus post-processed forecasts are usually available only at those\nlocations where training data are accessible. We propose a general\nclustering-based interpolation technique of extending calibrated predictive\ndistributions from observation stations to any location in the ensemble domain\nwhere there are ensemble forecasts at hand. Focusing on the ensemble model\noutput statistics (EMOS) post-processing technique, in a case study based on\nwind speed ensemble forecasts of the European Centre for Medium-Range Weather\nForecasts, we demonstrate the predictive performance of various versions of the\nsuggested method and show its superiority over the regionally estimated and\ninterpolated EMOS models and the raw ensemble forecasts as well."}, "http://arxiv.org/abs/2208.09344": {"title": "A note on incorrect inferences in non-binary qualitative probabilistic networks", "link": "http://arxiv.org/abs/2208.09344", "description": "Qualitative probabilistic networks (QPNs) combine the conditional\nindependence assumptions of Bayesian networks with the qualitative properties\nof positive and negative dependence. They formalise various intuitive\nproperties of positive dependence to allow inferences over a large network of\nvariables. However, we will demonstrate in this paper that, due to an incorrect\nsymmetry property, many inferences obtained in non-binary QPNs are not\nmathematically true. We will provide examples of such incorrect inferences and\nbriefly discuss possible resolutions."}, "http://arxiv.org/abs/2210.14080": {"title": "Learning Individual Treatment Effects under Heterogeneous Interference in Networks", "link": "http://arxiv.org/abs/2210.14080", "description": "Estimates of individual treatment effects from networked observational data\nare attracting increasing attention these days. One major challenge in network\nscenarios is the violation of the stable unit treatment value assumption\n(SUTVA), which assumes that the treatment assignment of a unit does not\ninfluence others' outcomes. In network data, due to interference, the outcome\nof a unit is influenced not only by its treatment (i.e., direct effects) but\nalso by others' treatments (i.e., spillover effects). Furthermore, the\ninfluences from other units are always heterogeneous (e.g., friends with\nsimilar interests affect a person differently than friends with different\ninterests). In this paper, we focus on the problem of estimating individual\ntreatment effects (both direct and spillover effects) under heterogeneous\ninterference. To address this issue, we propose a novel Dual Weighting\nRegression (DWR) algorithm by simultaneously learning attention weights that\ncapture the heterogeneous interference and sample weights to eliminate the\ncomplex confounding bias in networks. We formulate the entire learning process\nas a bi-level optimization problem. In theory, we present generalization error\nbounds for individual treatment effect estimation. Extensive experiments on\nfour benchmark datasets demonstrate that the proposed DWR algorithm outperforms\nstate-of-the-art methods for estimating individual treatment effects under\nheterogeneous interference."}, "http://arxiv.org/abs/2302.10836": {"title": "nlive: an R Package to facilitate the application of the sigmoidal and random changepoint mixed models", "link": "http://arxiv.org/abs/2302.10836", "description": "Background: The use of mixed effect models with a specific functional form\nsuch as the Sigmoidal Mixed Model and the Piecewise Mixed Model (or Changepoint\nMixed Model) with abrupt or smooth random change allows the interpretation of\nthe defined parameters to understand longitudinal trajectories. Currently,\nthere are no interface R packages that can easily fit the Sigmoidal Mixed Model\nallowing the inclusion of covariates or incorporating recent developments to\nfit the Piecewise Mixed Model with random change. Results: To facilitate the\nmodeling of the Sigmoidal Mixed Model, and Piecewise Mixed Model with abrupt or\nsmooth random change, we have created an R package called nlive. All needed\npieces such as functions, covariance matrices, and initials generation were\nprogrammed. The package was implemented with recent developments such as the\npolynomial smooth transition of the piecewise mixed model with improved\nproperties over Bacon-Watts, and the stochastic approximation\nexpectation-maximization (SAEM) for efficient estimation. It was designed to\nhelp interpretation of the output by providing features such as annotated\noutput, warnings, and graphs. Functionality, including time and convergence,\nwas tested using simulations. We provided a data example to illustrate the\npackage use and output features and interpretation. The package implemented in\nthe R software is available from the Comprehensive R Archive Network (CRAN) at\nhttps://CRAN.R-project.org/package=nlive. Conclusions: The nlive package for R\nfits the Sigmoidal Mixed Model and the Piecewise Mixed: abrupt and smooth. The\nnlive allows fitting these models with only five mandatory arguments that are\nintuitive enough to the less sophisticated users."}, "http://arxiv.org/abs/2304.12500": {"title": "Environmental Justice Implications of Power Plant Emissions Control Policies: Heterogeneous Causal Effect Estimation under Bipartite Network Interference", "link": "http://arxiv.org/abs/2304.12500", "description": "Emissions generators, such as coal-fired power plants, are key contributors\nto air pollution and thus environmental policies to reduce their emissions have\nbeen proposed. Furthermore, marginalized groups are exposed to\ndisproportionately high levels of this pollution and have heightened\nsusceptibility to its adverse health impacts. As a result, robust evaluations\nof the heterogeneous impacts of air pollution regulations are key to justifying\nand designing maximally protective interventions. However, such evaluations are\ncomplicated in that much of air pollution regulatory policy intervenes on large\nemissions generators while resulting impacts are measured in potentially\ndistant populations. Such a scenario can be described as that of bipartite\nnetwork interference (BNI). To our knowledge, no literature to date has\nconsidered estimation of heterogeneous causal effects with BNI. In this paper,\nwe contribute to the literature in a three-fold manner. First, we propose\nBNI-specific estimators for subgroup-specific causal effects and design an\nempirical Monte Carlo simulation approach for BNI to evaluate their\nperformance. Second, we demonstrate how these estimators can be combined with\nsubgroup discovery approaches to identify subgroups benefiting most from air\npollution policies without a priori specification. Finally, we apply the\nproposed methods to estimate the effects of coal-fired power plant emissions\ncontrol interventions on ischemic heart disease (IHD) among 27,312,190 US\nMedicare beneficiaries. Though we find no statistically significant effect of\nthe interventions in the full population, we do find significant IHD\nhospitalization decreases in communities with high poverty and smoking rates."}, "http://arxiv.org/abs/2306.00686": {"title": "A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience", "link": "http://arxiv.org/abs/2306.00686", "description": "In this paper, we will outline a novel data-driven method for estimating\nfunctions in a multivariate nonparametric regression model based on an adaptive\nknot selection for B-splines. The underlying idea of our approach for selecting\nknots is to apply the generalized lasso, since the knots of the B-spline basis\ncan be seen as changes in the derivatives of the function to be estimated. This\nmethod was then extended to functions depending on several variables by\nprocessing each dimension independently, thus reducing the problem to a\nunivariate setting. The regularization parameters were chosen by means of a\ncriterion based on EBIC. The nonparametric estimator was obtained using a\nmultivariate B-spline regression with the corresponding selected knots. Our\nprocedure was validated through numerical experiments by varying the number of\nobservations and the level of noise to investigate its robustness. The\ninfluence of observation sampling was also assessed and our method was applied\nto a chemical system commonly used in geoscience. For each different framework\nconsidered in this paper, our approach performed better than state-of-the-art\nmethods. Our completely data-driven method is implemented in the glober R\npackage which is available on the Comprehensive R Archive Network (CRAN)."}, "http://arxiv.org/abs/2401.14426": {"title": "M$^3$TN: Multi-gate Mixture-of-Experts based Multi-valued Treatment Network for Uplift Modeling", "link": "http://arxiv.org/abs/2401.14426", "description": "Uplift modeling is a technique used to predict the effect of a treatment\n(e.g., discounts) on an individual's response. Although several methods have\nbeen proposed for multi-valued treatment, they are extended from binary\ntreatment methods. There are still some limitations. Firstly, existing methods\ncalculate uplift based on predicted responses, which may not guarantee a\nconsistent uplift distribution between treatment and control groups. Moreover,\nthis may cause cumulative errors for multi-valued treatment. Secondly, the\nmodel parameters become numerous with many prediction heads, leading to reduced\nefficiency. To address these issues, we propose a novel \\underline{M}ulti-gate\n\\underline{M}ixture-of-Experts based \\underline{M}ulti-valued\n\\underline{T}reatment \\underline{N}etwork (M$^3$TN). M$^3$TN consists of two\ncomponents: 1) a feature representation module with Multi-gate\nMixture-of-Experts to improve the efficiency; 2) a reparameterization module by\nmodeling uplift explicitly to improve the effectiveness. We also conduct\nextensive experiments to demonstrate the effectiveness and efficiency of our\nM$^3$TN."}, "http://arxiv.org/abs/2401.14512": {"title": "Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population", "link": "http://arxiv.org/abs/2401.14512", "description": "Randomized controlled trials (RCTs) serve as the cornerstone for\nunderstanding causal effects, yet extending inferences to target populations\npresents challenges due to effect heterogeneity and underrepresentation. Our\npaper addresses the critical issue of identifying and characterizing\nunderrepresented subgroups in RCTs, proposing a novel framework for refining\ntarget populations to improve generalizability. We introduce an\noptimization-based approach, Rashomon Set of Optimal Trees (ROOT), to\ncharacterize underrepresented groups. ROOT optimizes the target subpopulation\ndistribution by minimizing the variance of the target average treatment effect\nestimate, ensuring more precise treatment effect estimations. Notably, ROOT\ngenerates interpretable characteristics of the underrepresented population,\naiding researchers in effective communication. Our approach demonstrates\nimproved precision and interpretability compared to alternatives, as\nillustrated with synthetic data experiments. We apply our methodology to extend\ninferences from the Starting Treatment with Agonist Replacement Therapies\n(START) trial -- investigating the effectiveness of medication for opioid use\ndisorder -- to the real-world population represented by the Treatment Episode\nDataset: Admissions (TEDS-A). By refining target populations using ROOT, our\nframework offers a systematic approach to enhance decision-making accuracy and\ninform future trials in diverse populations."}, "http://arxiv.org/abs/2401.14515": {"title": "Martingale Posterior Distributions for Log-concave Density Functions", "link": "http://arxiv.org/abs/2401.14515", "description": "The family of log-concave density functions contains various kinds of common\nprobability distributions. Due to the shape restriction, it is possible to find\nthe nonparametric estimate of the density, for example, the nonparametric\nmaximum likelihood estimate (NPMLE). However, the associated uncertainty\nquantification of the NPMLE is less well developed. The current techniques for\nuncertainty quantification are Bayesian, using a Dirichlet process prior\ncombined with the use of Markov chain Monte Carlo (MCMC) to sample from the\nposterior. In this paper, we start with the NPMLE and use a version of the\nmartingale posterior distribution to establish uncertainty about the NPMLE. The\nalgorithm can be implemented in parallel and hence is fast. We prove the\nconvergence of the algorithm by constructing suitable submartingales. We also\nillustrate results with different models and settings and some real data, and\ncompare our method with that within the literature."}, "http://arxiv.org/abs/2401.14535": {"title": "CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process", "link": "http://arxiv.org/abs/2401.14535", "description": "Identifying the underlying time-delayed latent causal processes in sequential\ndata is vital for grasping temporal dynamics and making downstream reasoning.\nWhile some recent methods can robustly identify these latent causal variables,\nthey rely on strict assumptions about the invertible generation process from\nlatent variables to observed data. However, these assumptions are often hard to\nsatisfy in real-world applications containing information loss. For instance,\nthe visual perception process translates a 3D space into 2D images, or the\nphenomenon of persistence of vision incorporates historical data into current\nperceptions. To address this challenge, we establish an identifiability theory\nthat allows for the recovery of independent latent components even when they\ncome from a nonlinear and non-invertible mix. Using this theory as a\nfoundation, we propose a principled approach, CaRiNG, to learn the CAusal\nRepresentatIon of Non-invertible Generative temporal data with identifiability\nguarantees. Specifically, we utilize temporal context to recover lost latent\ninformation and apply the conditions in our theory to guide the training\nprocess. Through experiments conducted on synthetic datasets, we validate that\nour CaRiNG method reliably identifies the causal process, even when the\ngeneration process is non-invertible. Moreover, we demonstrate that our\napproach considerably improves temporal understanding and reasoning in\npractical applications."}, "http://arxiv.org/abs/2401.14549": {"title": "Privacy-preserving Quantile Treatment Effect Estimation for Randomized Controlled Trials", "link": "http://arxiv.org/abs/2401.14549", "description": "In accordance with the principle of \"data minimization\", many internet\ncompanies are opting to record less data. However, this is often at odds with\nA/B testing efficacy. For experiments with units with multiple observations,\none popular data minimizing technique is to aggregate data for each unit.\nHowever, exact quantile estimation requires the full observation-level data. In\nthis paper, we develop a method for approximate Quantile Treatment Effect (QTE)\nanalysis using histogram aggregation. In addition, we can also achieve formal\nprivacy guarantees using differential privacy."}, "http://arxiv.org/abs/2401.14558": {"title": "Simulation Model Calibration with Dynamic Stratification and Adaptive Sampling", "link": "http://arxiv.org/abs/2401.14558", "description": "Calibrating simulation models that take large quantities of multi-dimensional\ndata as input is a hard simulation optimization problem. Existing adaptive\nsampling strategies offer a methodological solution. However, they may not\nsufficiently reduce the computational cost for estimation and solution\nalgorithm's progress within a limited budget due to extreme noise levels and\nheteroskedasticity of system responses. We propose integrating stratification\nwith adaptive sampling for the purpose of efficiency in optimization.\nStratification can exploit local dependence in the simulation inputs and\noutputs. Yet, the state-of-the-art does not provide a full capability to\nadaptively stratify the data as different solution alternatives are evaluated.\nWe devise two procedures for data-driven calibration problems that involve a\nlarge dataset with multiple covariates to calibrate models within a fixed\noverall simulation budget. The first approach dynamically stratifies the input\ndata using binary trees, while the second approach uses closed-form solutions\nbased on linearity assumptions between the objective function and concomitant\nvariables. We find that dynamical adjustment of stratification structure\naccelerates optimization and reduces run-to-run variability in generated\nsolutions. Our case study for calibrating a wind power simulation model, widely\nused in the wind industry, using the proposed stratified adaptive sampling,\nshows better-calibrated parameters under a limited budget."}, "http://arxiv.org/abs/2401.14562": {"title": "Properties of the Mallows Model Depending on the Number of Alternatives: A Warning for an Experimentalist", "link": "http://arxiv.org/abs/2401.14562", "description": "The Mallows model is a popular distribution for ranked data. We empirically\nand theoretically analyze how the properties of rankings sampled from the\nMallows model change when increasing the number of alternatives. We find that\nreal-world data behaves differently than the Mallows model, yet is in line with\nits recent variant proposed by Boehmer et al. [2021]. As part of our study, we\nissue several warnings about using the model."}, "http://arxiv.org/abs/2401.14593": {"title": "Robust Estimation of Pareto's Scale Parameter from Grouped Data", "link": "http://arxiv.org/abs/2401.14593", "description": "Numerous robust estimators exist as alternatives to the maximum likelihood\nestimator (MLE) when a completely observed ground-up loss severity sample\ndataset is available. However, the options for robust alternatives to MLE\nbecome significantly limited when dealing with grouped loss severity data, with\nonly a handful of methods like least squares, minimum Hellinger distance, and\noptimal bounded influence function available. This paper introduces a novel\nrobust estimation technique, the Method of Truncated Moments (MTuM),\nspecifically designed to estimate the tail index of a Pareto distribution from\ngrouped data. Inferential justification of MTuM is established by employing the\ncentral limit theorem and validating them through a comprehensive simulation\nstudy."}, "http://arxiv.org/abs/2401.14655": {"title": "Distributionally Robust Optimization and Robust Statistics", "link": "http://arxiv.org/abs/2401.14655", "description": "We review distributionally robust optimization (DRO), a principled approach\nfor constructing statistical estimators that hedge against the impact of\ndeviations in the expected loss between the training and deployment\nenvironments. Many well-known estimators in statistics and machine learning\n(e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are\ndistributionally robust in a precise sense. We hope that by discussing the DRO\ninterpretation of well-known estimators, statisticians who may not be too\nfamiliar with DRO may find a way to access the DRO literature through the\nbridge between classical results and their DRO equivalent formulation. On the\nother hand, the topic of robustness in statistics has a rich tradition\nassociated with removing the impact of contamination. Thus, another objective\nof this paper is to clarify the difference between DRO and classical\nstatistical robustness. As we will see, these are two fundamentally different\nphilosophies leading to completely different types of estimators. In DRO, the\nstatistician hedges against an environment shift that occurs after the decision\nis made; thus DRO estimators tend to be pessimistic in an adversarial setting,\nleading to a min-max type formulation. In classical robust statistics, the\nstatistician seeks to correct contamination that occurred before a decision is\nmade; thus robust statistical estimators tend to be optimistic leading to a\nmin-min type formulation."}, "http://arxiv.org/abs/2401.14684": {"title": "Inference for Cumulative Incidences and Treatment Effects in Randomized Controlled Trials with Time-to-Event Outcomes under ICH E9 (E1)", "link": "http://arxiv.org/abs/2401.14684", "description": "In randomized controlled trials (RCT) with time-to-event outcomes,\nintercurrent events occur as semi-competing/competing events, and they could\naffect the hazard of outcomes or render outcomes ill-defined. Although five\nstrategies have been proposed in ICH E9 (R1) addendum to address intercurrent\nevents in RCT, they did not readily extend to the context of time-to-event data\nfor studying causal effects with rigorously stated implications. In this study,\nwe show how to define, estimate, and infer the time-dependent cumulative\nincidence of outcome events in such contexts for obtaining causal\ninterpretations. Specifically, we derive the mathematical forms of the\nscientific objective (i.e., causal estimands) under the five strategies and\nclarify the required data structure to identify these causal estimands.\nFurthermore, we summarize estimation and inference methods for these causal\nestimands by adopting methodologies in survival analysis, including analytic\nformulas for asymptotic analysis and hypothesis testing. We illustrate our\nmethods with the LEADER Trial on investigating the effect of liraglutide on\ncardiovascular outcomes. Studies of multiple endpoints and combining strategies\nto address multiple intercurrent events can help practitioners understand\ntreatment effects more comprehensively."}, "http://arxiv.org/abs/2401.14722": {"title": "A Nonparametric Bayes Approach to Online Activity Prediction", "link": "http://arxiv.org/abs/2401.14722", "description": "Accurately predicting the onset of specific activities within defined\ntimeframes holds significant importance in several applied contexts. In\nparticular, accurate prediction of the number of future users that will be\nexposed to an intervention is an important piece of information for\nexperimenters running online experiments (A/B tests). In this work, we propose\na novel approach to predict the number of users that will be active in a given\ntime period, as well as the temporal trajectory needed to attain a desired user\nparticipation threshold. We model user activity using a Bayesian nonparametric\napproach which allows us to capture the underlying heterogeneity in user\nengagement. We derive closed-form expressions for the number of new users\nexpected in a given period, and a simple Monte Carlo algorithm targeting the\nposterior distribution of the number of days needed to attain a desired number\nof users; the latter is important for experimental planning. We illustrate the\nperformance of our approach via several experiments on synthetic and real world\ndata, in which we show that our novel method outperforms existing competitors."}, "http://arxiv.org/abs/2401.14827": {"title": "Clustering Longitudinal Ordinal Data via Finite Mixture of Matrix-Variate Distributions", "link": "http://arxiv.org/abs/2401.14827", "description": "In social sciences, studies are often based on questionnaires asking\nparticipants to express ordered responses several times over a study period. We\npresent a model-based clustering algorithm for such longitudinal ordinal data.\nAssuming that an ordinal variable is the discretization of a underlying latent\ncontinuous variable, the model relies on a mixture of matrix-variate normal\ndistributions, accounting simultaneously for within- and between-time\ndependence structures. The model is thus able to concurrently model the\nheterogeneity, the association among the responses and the temporal dependence\nstructure. An EM algorithm is developed and presented for parameters\nestimation. An evaluation of the model through synthetic data shows its\nestimation abilities and its advantages when compared to competitors. A\nreal-world application concerning changes in eating behaviours during the\nCovid-19 pandemic period in France will be presented."}, "http://arxiv.org/abs/2401.14836": {"title": "Automatic and location-adaptive estimation in functional single-index regression", "link": "http://arxiv.org/abs/2401.14836", "description": "This paper develops a new automatic and location-adaptive procedure for\nestimating regression in a Functional Single-Index Model (FSIM). This procedure\nis based on $k$-Nearest Neighbours ($k$NN) ideas. The asymptotic study includes\nresults for automatically data-driven selected number of neighbours, making the\nprocedure directly usable in practice. The local feature of the $k$NN approach\ninsures higher predictive power compared with usual kernel estimates, as\nillustrated in some finite sample analysis. As by-product we state as\npreliminary tools some new uniform asymptotic results for kernel estimates in\nthe FSIM model."}, "http://arxiv.org/abs/2401.14841": {"title": "Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables", "link": "http://arxiv.org/abs/2401.14841", "description": "This paper aims to front with dimensionality reduction in regression setting\nwhen the predictors are a mixture of functional variable and high-dimensional\nvector. A flexible model, combining both sparse linear ideas together with\nsemiparametrics, is proposed. A wide scope of asymptotic results is provided:\nthis covers as well rates of convergence of the estimators as asymptotic\nbehaviour of the variable selection procedure. Practical issues are analysed\nthrough finite sample simulated experiments while an application to Tecator's\ndata illustrates the usefulness of our methodology."}, "http://arxiv.org/abs/2401.14848": {"title": "A $k$NN procedure in semiparametric functional data analysis", "link": "http://arxiv.org/abs/2401.14848", "description": "A fast and flexible $k$NN procedure is developed for dealing with a\nsemiparametric functional regression model involving both partial-linear and\nsingle-index components. Rates of uniform consistency are presented. Simulated\nexperiments highlight the advantages of the $k$NN procedure. A real data\nanalysis is also shown."}, "http://arxiv.org/abs/2401.14864": {"title": "Fast and efficient algorithms for sparse semiparametric bi-functional regression", "link": "http://arxiv.org/abs/2401.14864", "description": "A new sparse semiparametric model is proposed, which incorporates the\ninfluence of two functional random variables in a scalar response in a flexible\nand interpretable manner. One of the functional covariates is included through\na single-index structure, while the other is included linearly through the\nhigh-dimensional vector formed by its discretised observations. For this model,\ntwo new algorithms are presented for selecting relevant variables in the linear\npart and estimating the model. Both procedures utilise the functional origin of\nlinear covariates. Finite sample experiments demonstrated the scope of\napplication of both algorithms: the first method is a fast algorithm that\nprovides a solution (without loss in predictive ability) for the significant\ncomputational time required by standard variable selection methods for\nestimating this model, and the second algorithm completes the set of relevant\nlinear covariates provided by the first, thus improving its predictive\nefficiency. Some asymptotic results theoretically support both procedures. A\nreal data application demonstrated the applicability of the presented\nmethodology from a predictive perspective in terms of the interpretability of\noutputs and low computational cost."}, "http://arxiv.org/abs/2401.14867": {"title": "Variable selection in functional regression models: a review", "link": "http://arxiv.org/abs/2401.14867", "description": "Despite of various similar features, Functional Data Analysis and\nHigh-Dimensional Data Analysis are two major fields in Statistics that grew up\nrecently almost independently one from each other. The aim of this paper is to\npropose a survey on methodological advances for variable selection in\nfunctional regression, which is typically a question for which both functional\nand multivariate ideas are crossing. More than a simple survey, this paper aims\nto promote even more new links between both areas."}, "http://arxiv.org/abs/2401.14902": {"title": "Model-assisted survey sampling with Bayesian optimization", "link": "http://arxiv.org/abs/2401.14902", "description": "Survey sampling plays an important role in the efficient allocation and\nmanagement of resources. The essence of survey sampling lies in acquiring a\nsample of data points from a population and subsequently using this sample to\nestimate the population parameters of the targeted response variable, such as\nenvironmental-related metrics or other pertinent factors. Practical limitations\nimposed on survey sampling necessitate prudent consideration of the number of\nsamples attainable from the study areas, given the constraints of a fixed\nbudget. To this end, researchers are compelled to employ sampling designs that\noptimize sample allocations to the best of their ability. Generally,\nprobability sampling serves as the preferred method, ensuring an unbiased\nestimation of population parameters. Evaluating the efficiency of estimators\ninvolves assessing their variances and benchmarking them against alternative\nbaseline approaches, such as simple random sampling. In this study, we propose\na novel model-assisted unbiased probability sampling method that leverages\nBayesian optimization for the determination of sampling designs. As a result,\nthis approach can yield in estimators with more efficient variance outcomes\ncompared to the conventional estimators such as the Horvitz-Thompson.\nFurthermore, we test the proposed method in a simulation study using an\nempirical dataset covering plot-level tree volume from central Finland. The\nresults demonstrate statistically significant improved performance for the\nproposed method when compared to the baseline."}, "http://arxiv.org/abs/2401.14910": {"title": "Modeling Extreme Events: Univariate and Multivariate Data-Driven Approaches", "link": "http://arxiv.org/abs/2401.14910", "description": "Modern inference in extreme value theory faces numerous complications, such\nas missing data, hidden covariates or design problems. Some of those\ncomplications were exemplified in the EVA 2023 data challenge. The challenge\ncomprises multiple individual problems which cover a variety of univariate and\nmultivariate settings. This note presents the contribution of team genEVA in\nsaid competition, with particular focus on a detailed presentation of\nmethodology and inference."}, "http://arxiv.org/abs/2401.15014": {"title": "A Robust Bayesian Method for Building Polygenic Risk Scores using Projected Summary Statistics and Bridge Prior", "link": "http://arxiv.org/abs/2401.15014", "description": "Polygenic Risk Scores (PRS) developed from genome-wide association studies\n(GWAS) are of increasing interest for various clinical and research\napplications. Bayesian methods have been particularly popular for building PRS\nin genome-wide scale because of their natural ability to regularize model and\nborrow information in high-dimension. In this article, we present new\ntheoretical results, methods, and extensive numerical studies to advance\nBayesian methods for PRS applications. We conduct theoretical studies to\nidentify causes of convergence issues of some Bayesian methods when required\ninput GWAS summary-statistics and linkage disequilibrium (LD) (genetic\ncorrelation) data are derived from distinct samples. We propose a remedy to the\nproblem by the projection of the summary-statistics data into the column space\nof the genetic correlation matrix. We further implement a PRS development\nalgorithm under the Bayesian Bridge prior which can allow more flexible\nspecification of effect-size distribution than those allowed under popular\nalternative methods. Finally, we conduct careful benchmarking studies of\nalternative Bayesian methods using both simulation studies and real datasets,\nwhere we carefully investigate both the effect of prior specification and\nestimation strategies for LD parameters. These studies show that the proposed\nalgorithm, equipped with the projection approach, the flexible prior\nspecification, and an efficient numerical algorithm leads to the development of\nthe most robust PRS across a wide variety of scenarios."}, "http://arxiv.org/abs/2401.15063": {"title": "Graph fission and cross-validation", "link": "http://arxiv.org/abs/2401.15063", "description": "We introduce a technique called graph fission which takes in a graph which\npotentially contains only one observation per node (whose distribution lies in\na known class) and produces two (or more) independent graphs with the same\nnode/edge set in a way that splits the original graph's information amongst\nthem in any desired proportion. Our proposal builds on data fission/thinning, a\nmethod that uses external randomization to create independent copies of an\nunstructured dataset. %under the assumption of independence between\nobservations. We extend this idea to the graph setting where there may be\nlatent structure between observations. We demonstrate the utility of this\nframework via two applications: inference after structural trend estimation on\ngraphs and a model selection procedure we term ``graph cross-validation''."}, "http://arxiv.org/abs/2401.15076": {"title": "Comparative Analysis of Practical Identifiability Methods for an SEIR Model", "link": "http://arxiv.org/abs/2401.15076", "description": "Identifiability of a mathematical model plays a crucial role in\nparameterization of the model. In this study, we establish the structural\nidentifiability of a Susceptible-Exposed-Infected-Recovered (SEIR) model given\ndifferent combinations of input data and investigate practical identifiability\nwith respect to different observable data, data frequency, and noise\ndistributions. The practical identifiability is explored by both Monte Carlo\nsimulations and a Correlation Matrix approach. Our results show that practical\nidentifiability benefits from higher data frequency and data from the peak of\nan outbreak. The incidence data gives the best practical identifiability\nresults compared to prevalence and cumulative data. In addition, we compare and\ndistinguish the practical identifiability by Monte Carlo simulations and a\nCorrelation Matrix approach, providing insights for when to use which method\nfor other applications."}, "http://arxiv.org/abs/2105.02487": {"title": "High-dimensional Functional Graphical Model Structure Learning via Neighborhood Selection Approach", "link": "http://arxiv.org/abs/2105.02487", "description": "Undirected graphical models are widely used to model the conditional\nindependence structure of vector-valued data. However, in many modern\napplications, for example those involving EEG and fMRI data, observations are\nmore appropriately modeled as multivariate random functions rather than\nvectors. Functional graphical models have been proposed to model the\nconditional independence structure of such functional data. We propose a\nneighborhood selection approach to estimate the structure of Gaussian\nfunctional graphical models, where we first estimate the neighborhood of each\nnode via a function-on-function regression and subsequently recover the entire\ngraph structure by combining the estimated neighborhoods. Our approach only\nrequires assumptions on the conditional distributions of random functions, and\nwe estimate the conditional independence structure directly. We thus circumvent\nthe need for a well-defined precision operator that may not exist when the\nfunctions are infinite dimensional. Additionally, the neighborhood selection\napproach is computationally efficient and can be easily parallelized. The\nstatistical consistency of the proposed method in the high-dimensional setting\nis supported by both theory and experimental results. In addition, we study the\neffect of the choice of the function basis used for dimensionality reduction in\nan intermediate step. We give a heuristic criterion for choosing a function\nbasis and motivate two practically useful choices, which we justify by both\ntheory and experiments."}, "http://arxiv.org/abs/2203.11469": {"title": "A new class of composite GBII regression models with varying threshold for modelling heavy-tailed data", "link": "http://arxiv.org/abs/2203.11469", "description": "The four-parameter generalized beta distribution of the second kind (GBII)\nhas been proposed for modelling insurance losses with heavy-tailed features.\nThe aim of this paper is to present a parametric composite GBII regression\nmodelling by splicing two GBII distributions using mode matching method. It is\ndesigned for simultaneous modeling of small and large claims and capturing the\npolicyholder heterogeneity by introducing the covariates into the location\nparameter. In such cases, the threshold that splits two GBII distributions\nvaries across individuals policyholders based on their risk features. The\nproposed regression modelling also contains a wide range of insurance loss\ndistributions as the head and the tail respectively and provides the\nclose-formed expressions for parameter estimation and model prediction. A\nsimulation study is conducted to show the accuracy of the proposed estimation\nmethod and the flexibility of the regressions. Some illustrations of the\napplicability of the new class of distributions and regressions are provided\nwith a Danish fire losses data set and a Chinese medical insurance claims data\nset, comparing with the results of competing models from the literature."}, "http://arxiv.org/abs/2206.14674": {"title": "Signature Methods in Machine Learning", "link": "http://arxiv.org/abs/2206.14674", "description": "Signature-based techniques give mathematical insight into the interactions\nbetween complex streams of evolving data. These insights can be quite naturally\ntranslated into numerical approaches to understanding streamed data, and\nperhaps because of their mathematical precision, have proved useful in\nanalysing streamed data in situations where the data is irregular, and not\nstationary, and the dimension of the data and the sample sizes are both\nmoderate. Understanding streamed multi-modal data is exponential: a word in $n$\nletters from an alphabet of size $d$ can be any one of $d^n$ messages.\nSignatures remove the exponential amount of noise that arises from sampling\nirregularity, but an exponential amount of information still remain. This\nsurvey aims to stay in the domain where that exponential scaling can be managed\ndirectly. Scalability issues are an important challenge in many problems but\nwould require another survey article and further ideas. This survey describes a\nrange of contexts where the data sets are small enough to remove the\npossibility of massive machine learning, and the existence of small sets of\ncontext free and principled features can be used effectively. The mathematical\nnature of the tools can make their use intimidating to non-mathematicians. The\nexamples presented in this article are intended to bridge this communication\ngap and provide tractable working examples drawn from the machine learning\ncontext. Notebooks are available online for several of these examples. This\nsurvey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin\nwhich had broadly similar aims at an earlier point in the development of this\nmachinery. This article illustrates how the theoretical insights offered by\nsignatures are simply realised in the analysis of application data in a way\nthat is largely agnostic to the data type."}, "http://arxiv.org/abs/2208.08925": {"title": "Efficiency of nonparametric e-tests", "link": "http://arxiv.org/abs/2208.08925", "description": "The notion of an e-value has been recently proposed as a possible alternative\nto critical regions and p-values in statistical hypothesis testing. In this\npaper we consider testing the nonparametric hypothesis of symmetry, introduce\nanalogues for e-values of three popular nonparametric tests, define an analogue\nfor e-values of Pitman's asymptotic relative efficiency, and apply it to the\nthree nonparametric tests. We discuss limitations of our simple definition of\nasymptotic relative efficiency and list directions of further research."}, "http://arxiv.org/abs/2211.16468": {"title": "Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs", "link": "http://arxiv.org/abs/2211.16468", "description": "Causal effect estimation from observational data is a fundamental task in\nempirical sciences. It becomes particularly challenging when unobserved\nconfounders are involved in a system. This paper focuses on front-door\nadjustment -- a classic technique which, using observed mediators allows to\nidentify causal effects even in the presence of unobserved confounding. While\nthe statistical properties of the front-door estimation are quite well\nunderstood, its algorithmic aspects remained unexplored for a long time. In\n2022, Jeong, Tian, and Bareinboim presented the first polynomial-time algorithm\nfor finding sets satisfying the front-door criterion in a given directed\nacyclic graph (DAG), with an $O(n^3(n+m))$ run time, where $n$ denotes the\nnumber of variables and $m$ the number of edges of the causal graph. In our\nwork, we give the first linear-time, i.e., $O(n+m)$, algorithm for this task,\nwhich thus reaches the asymptotically optimal time complexity. This result\nimplies an $O(n(n+m))$ delay enumeration algorithm of all front-door adjustment\nsets, again improving previous work by a factor of $n^3$. Moreover, we provide\nthe first linear-time algorithm for finding a minimal front-door adjustment\nset. We offer implementations of our algorithms in multiple programming\nlanguages to facilitate practical usage and empirically validate their\nfeasibility, even for large graphs."}, "http://arxiv.org/abs/2305.13221": {"title": "Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data", "link": "http://arxiv.org/abs/2305.13221", "description": "Additive spatial statistical models with weakly stationary process\nassumptions have become standard in spatial statistics. However, one\ndisadvantage of such models is the computation time, which rapidly increases\nwith the number of data points. The goal of this article is to apply an\nexisting subsampling strategy to standard spatial additive models and to derive\nthe spatial statistical properties. We call this strategy the \"spatial data\nsubset model\" (SDSM) approach, which can be applied to big datasets in a\ncomputationally feasible way. Our approach has the advantage that one does not\nrequire any additional restrictive model assumptions. That is, computational\ngains increase as model assumptions are removed when using our model framework.\nThis provides one solution to the computational bottlenecks that occur when\napplying methods such as Kriging to \"big data\". We provide several properties\nof this new spatial data subset model approach in terms of moments, sill,\nnugget, and range under several sampling designs. An advantage of our approach\nis that it subsamples without throwing away data, and can be implemented using\ndatasets of any size that can be stored. We present the results of the spatial\ndata subset model approach on simulated datasets, and on a large dataset\nconsists of 150,000 observations of daytime land surface temperatures measured\nby the MODIS instrument onboard the Terra satellite."}, "http://arxiv.org/abs/2305.13818": {"title": "A Rank-Based Sequential Test of Independence", "link": "http://arxiv.org/abs/2305.13818", "description": "We consider the problem of independence testing for two univariate random\nvariables in a sequential setting. By leveraging recent developments on safe,\nanytime-valid inference, we propose a test with time-uniform type I error\ncontrol and derive explicit bounds on the finite sample performance of the\ntest. We demonstrate the empirical performance of the procedure in comparison\nto existing sequential and non-sequential independence tests. Furthermore,\nsince the proposed test is distribution free under the null hypothesis, we\nempirically simulate the gap due to Ville's inequality, the supermartingale\nanalogue of Markov's inequality, that is commonly applied to control type I\nerror in anytime-valid inference, and apply this to construct a truncated\nsequential test."}, "http://arxiv.org/abs/2307.08594": {"title": "Tight Distribution-Free Confidence Intervals for Local Quantile Regression", "link": "http://arxiv.org/abs/2307.08594", "description": "It is well known that it is impossible to construct useful confidence\nintervals (CIs) about the mean or median of a response $Y$ conditional on\nfeatures $X = x$ without making strong assumptions about the joint distribution\nof $X$ and $Y$. This paper introduces a new framework for reasoning about\nproblems of this kind by casting the conditional problem at different levels of\nresolution, ranging from coarse to fine localization. In each of these\nproblems, we consider local quantiles defined as the marginal quantiles of $Y$\nwhen $(X,Y)$ is resampled in such a way that samples $X$ near $x$ are\nup-weighted while the conditional distribution $Y \\mid X$ does not change. We\nthen introduce the Weighted Quantile method, which asymptotically produces the\nuniformly most accurate confidence intervals for these local quantiles no\nmatter the (unknown) underlying distribution. Another method, namely, the\nQuantile Rejection method, achieves finite sample validity under no assumption\nwhatsoever. We conduct extensive numerical studies demonstrating that both of\nthese methods are valid. In particular, we show that the Weighted Quantile\nprocedure achieves nominal coverage as soon as the effective sample size is in\nthe range of 10 to 20."}, "http://arxiv.org/abs/2307.08685": {"title": "Evaluating Climate Models with Sliced Elastic Distance", "link": "http://arxiv.org/abs/2307.08685", "description": "The validation of global climate models plays a crucial role in ensuring the\naccuracy of climatological predictions. However, existing statistical methods\nfor evaluating differences between climate fields often overlook time\nmisalignment and therefore fail to distinguish between sources of variability.\nTo more comprehensively measure differences between climate fields, we\nintroduce a new vector-valued metric, the sliced elastic distance. This new\nmetric simultaneously accounts for spatial and temporal variability while\ndecomposing the total distance into shape differences (amplitude), timing\nvariability (phase), and bias (translation). We compare the sliced elastic\ndistance against a classical metric and a newly developed Wasserstein-based\napproach through a simulation study. Our results demonstrate that the sliced\nelastic distance outperforms previous methods by capturing a broader range of\nfeatures. We then apply our metric to evaluate the historical model outputs of\nthe Coupled Model Intercomparison Project (CMIP) members, focusing on monthly\naverage surface temperatures and monthly total precipitation. By comparing\nthese model outputs with quasi-observational ERA5 Reanalysis data products, we\nrank the CMIP models and assess their performance. Additionally, we investigate\nthe progression from CMIP phase 5 to phase 6 and find modest improvements in\nthe phase 6 models regarding their ability to produce realistic climate\ndynamics."}, "http://arxiv.org/abs/2401.15139": {"title": "FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking", "link": "http://arxiv.org/abs/2401.15139", "description": "In high-dimensional data analysis, such as financial index tracking or\nbiomedical applications, it is crucial to select the few relevant variables\nwhile maintaining control over the false discovery rate (FDR). In these\napplications, strong dependencies often exist among the variables (e.g., stock\nreturns), which can undermine the FDR control property of existing methods like\nthe model-X knockoff method or the T-Rex selector. To address this issue, we\nhave expanded the T-Rex framework to accommodate overlapping groups of highly\ncorrelated variables. This is achieved by integrating a nearest neighbors\npenalization mechanism into the framework, which provably controls the FDR at\nthe user-defined target level. A real-world example of sparse index tracking\ndemonstrates the proposed method's ability to accurately track the S&P 500\nindex over the past 20 years based on a small number of stocks. An open-source\nimplementation is provided within the R package TRexSelector on CRAN."}, "http://arxiv.org/abs/2401.15225": {"title": "A bivariate two-state Markov modulated Poisson process for failure modelling", "link": "http://arxiv.org/abs/2401.15225", "description": "Motivated by a real failure dataset in a two-dimensional context, this paper\npresents an extension of the Markov modulated Poisson process (MMPP) to two\ndimensions. The one-dimensional MMPP has been proposed for the modeling of\ndependent and non-exponential inter-failure times (in contexts as queuing, risk\nor reliability, among others). The novel two-dimensional MMPP allows for\ndependence between the two sequences of inter-failure times, while at the same\ntime preserves the MMPP properties, marginally. The generalization is based on\nthe Marshall-Olkin exponential distribution. Inference is undertaken for the\nnew model through a method combining a matching moments approach with an\nApproximate Bayesian Computation (ABC) algorithm. The performance of the method\nis shown on simulated and real datasets representing times and distances\ncovered between consecutive failures in a public transport company. For the\nreal dataset, some quantities of importance associated with the reliability of\nthe system are estimated as the probabilities and expected number of failures\nat different times and distances covered by trains until the occurrence of a\nfailure."}, "http://arxiv.org/abs/2401.15259": {"title": "Estimating lengths-of-stay of hospitalised COVID-19 patients using a non-parametric model: a case study in Galicia (Spain)", "link": "http://arxiv.org/abs/2401.15259", "description": "Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key\nfor predicting the hospital beds' demand and planning mitigation strategies, as\noverwhelming the healthcare systems has critical consequences for disease\nmortality. However, accurately mapping the time-to-event of hospital outcomes,\nsuch as the LoS in the intensive care unit (ICU), requires understanding\npatient trajectories while adjusting for covariates and observation bias, such\nas incomplete data. Standard methods, such as the Kaplan-Meier estimator,\nrequire prior assumptions that are untenable given current knowledge. Using\nreal-time surveillance data from the first weeks of the COVID-19 epidemic in\nGalicia (Spain), we aimed to model the time-to-event and event probabilities of\npatients' hospitalised, without parametric priors and adjusting for individual\ncovariates. We applied a non-parametric mixture cure model and compared its\nperformance in estimating hospital ward (HW)/ICU LoS to the performances of\ncommonly used methods to estimate survival. We showed that the proposed model\noutperformed standard approaches, providing more accurate ICU and HW LoS\nestimates. Finally, we applied our model estimates to simulate COVID-19\nhospital demand using a Monte Carlo algorithm. We provided evidence that\nadjusting for sex, generally overlooked in prediction models, together with age\nis key for accurately forecasting HW and ICU occupancy, as well as discharge or\ndeath outcomes."}, "http://arxiv.org/abs/2401.15262": {"title": "Asymptotic Behavior of Adversarial Training Estimator under $\\ell_\\infty$-Perturbation", "link": "http://arxiv.org/abs/2401.15262", "description": "Adversarial training has been proposed to hedge against adversarial attacks\nin machine learning and statistical models. This paper focuses on adversarial\ntraining under $\\ell_\\infty$-perturbation, which has recently attracted much\nresearch attention. The asymptotic behavior of the adversarial training\nestimator is investigated in the generalized linear model. The results imply\nthat the limiting distribution of the adversarial training estimator under\n$\\ell_\\infty$-perturbation could put a positive probability mass at $0$ when\nthe true parameter is $0$, providing a theoretical guarantee of the associated\nsparsity-recovery ability. Alternatively, a two-step procedure is proposed --\nadaptive adversarial training, which could further improve the performance of\nadversarial training under $\\ell_\\infty$-perturbation. Specifically, the\nproposed procedure could achieve asymptotic unbiasedness and variable-selection\nconsistency. Numerical experiments are conducted to show the sparsity-recovery\nability of adversarial training under $\\ell_\\infty$-perturbation and to compare\nthe empirical performance between classic adversarial training and adaptive\nadversarial training."}, "http://arxiv.org/abs/2401.15281": {"title": "Improved confidence intervals for nonlinear mixed-effects and nonparametric regression models", "link": "http://arxiv.org/abs/2401.15281", "description": "Statistical inference for high dimensional parameters (HDPs) can be based on\ntheir intrinsic correlation; that is, parameters that are close spatially or\ntemporally tend to have more similar values. This is why nonlinear\nmixed-effects models (NMMs) are commonly (and appropriately) used for models\nwith HDPs. Conversely, in many practical applications of NMM, the random\neffects (REs) are actually correlated HDPs that should remain constant during\nrepeated sampling for frequentist inference. In both scenarios, the inference\nshould be conditional on REs, instead of marginal inference by integrating out\nREs. In this paper, we first summarize recent theory of conditional inference\nfor NMM, and then propose a bias-corrected RE predictor and confidence interval\n(CI). We also extend this methodology to accommodate the case where some REs\nare not associated with data. Simulation studies indicate that this new\napproach leads to substantial improvement in the conditional coverage rate of\nRE CIs, including CIs for smooth functions in generalized additive models, as\ncompared to the existing method based on marginal inference."}, "http://arxiv.org/abs/2401.15309": {"title": "Zero-inflated Smoothing Spline (ZISS) Models for Individual-level Single-cell Temporal Data", "link": "http://arxiv.org/abs/2401.15309", "description": "Recent advancements in single-cell RNA-sequencing (scRNA-seq) have enhanced\nour understanding of cell heterogeneity at a high resolution. With the ability\nto sequence over 10,000 cells per hour, researchers can collect large scRNA-seq\ndatasets for different participants, offering an opportunity to study the\ntemporal progression of individual-level single-cell data. However, the\npresence of excessive zeros, a common issue in scRNA-seq, significantly impacts\nregression/association analysis, potentially leading to biased estimates in\ndownstream analysis. Addressing these challenges, we introduce the Zero\nInflated Smoothing Spline (ZISS) method, specifically designed to model\nsingle-cell temporal data. The ZISS method encompasses two components for\nmodeling gene expression patterns over time and handling excessive zeros. Our\napproach employs the smoothing spline ANOVA model, providing robust estimates\nof mean functions and zero probabilities for irregularly observed single-cell\ntemporal data compared to existing methods in our simulation studies and real\ndata analysis."}, "http://arxiv.org/abs/2401.15382": {"title": "Inference on an heteroscedastic Gompertz tumor growth model", "link": "http://arxiv.org/abs/2401.15382", "description": "We consider a non homogeneous Gompertz diffusion process whose parameters are\nmodified by generally time-dependent exogenous factors included in the\ninfinitesimal moments. The proposed model is able to describe tumor dynamics\nunder the effect of anti-proliferative and/or cell death-induced therapies. We\nassume that such therapies can modify also the infinitesimal variance of the\ndiffusion process. An estimation procedure, based on a control group and two\ntreated groups, is proposed to infer the model by estimating the constant\nparameters and the time-dependent terms. Moreover, several concatenated\nhypothesis tests are considered in order to confirm or reject the need to\ninclude time-dependent functions in the infinitesimal moments. Simulations are\nprovided to evaluate the efficiency of the suggested procedures and to validate\nthe testing hypothesis. Finally, an application to real data is considered."}, "http://arxiv.org/abs/2401.15461": {"title": "Anytime-Valid Tests of Group Invariance through Conformal Prediction", "link": "http://arxiv.org/abs/2401.15461", "description": "The assumption that data are invariant under the action of a compact group is\nimplicit in many statistical modeling assumptions such as normality, or the\nassumption of independence and identical distributions. Hence, testing for the\npresence of such invariances offers a principled way to falsify various\nstatistical models. In this article, we develop sequential, anytime-valid tests\nof distributional symmetry under the action of general compact groups. The\ntests that are developed allow for the continuous monitoring of data as it is\ncollected while keeping type-I error guarantees, and include tests for\nexchangeability and rotational symmetry as special examples. The main tool to\nthis end is the machinery developed for conformal prediction. The resulting\ntest statistic, called a conformal martingale, can be interpreted as a\nlikelihood ratio. We use this interpretation to show that the test statistics\nare optimal -- in a specific log-optimality sense -- against certain\nalternatives. Furthermore, we draw a connection between conformal prediction,\nanytime-valid tests of distributional invariance, and current developments on\nanytime-valid testing. In particular, we extend existing anytime-valid tests of\nindependence, which leverage exchangeability, to work under general group\ninvariances. Additionally, we discuss testing for invariance under subgroups of\nthe permutation group and orthogonal group, the latter of which corresponds to\ntesting the assumptions behind linear regression models."}, "http://arxiv.org/abs/2401.15514": {"title": "Validity of Complete Case Analysis Depends on the Target Population", "link": "http://arxiv.org/abs/2401.15514", "description": "Missing data is a pernicious problem in epidemiologic research. Research on\nthe validity of complete case analysis for missing data has typically focused\non estimating the average treatment effect (ATE) in the whole population.\nHowever, other target populations like the treated (ATT) or external targets\ncan be of substantive interest. In such cases, whether missing covariate data\noccurs within or outside the target population may impact the validity of\ncomplete case analysis. We sought to assess bias in complete case analysis when\ncovariate data is missing outside the target (e.g., missing covariate data\namong the untreated when estimating the ATT). We simulated a study of the\neffect of a binary treatment X on a binary outcome Y in the presence of 3\nconfounders C1-C3 that modified the risk difference (RD). We induced\nmissingness in C1 only among the untreated under 4 scenarios: completely\nrandomly (similar to MCAR); randomly based on C2 and C3 (similar to MAR);\nrandomly based on C1 (similar to MNAR); or randomly based on Y (similar to\nMAR). We estimated the ATE and ATT using weighting and averaged results across\nthe replicates. We conducted a parallel simulation transporting trial results\nto a target population in the presence of missing covariate data in the trial.\nIn the complete case analysis, estimated ATE was unbiased only when C1 was MCAR\namong the untreated. The estimated ATT, on the other hand, was unbiased in all\nscenarios except when Y caused missingness. The parallel simulation of\ngeneralizing and transporting trial results saw similar bias patterns. If\nmissing covariate data is only present outside the target population, complete\ncase analysis is unbiased except when missingness is associated with the\noutcome."}, "http://arxiv.org/abs/2401.15519": {"title": "Large Deviation Analysis of Score-based Hypothesis Testing", "link": "http://arxiv.org/abs/2401.15519", "description": "Score-based statistical models play an important role in modern machine\nlearning, statistics, and signal processing. For hypothesis testing, a\nscore-based hypothesis test is proposed in \\cite{wu2022score}. We analyze the\nperformance of this score-based hypothesis testing procedure and derive upper\nbounds on the probabilities of its Type I and II errors. We prove that the\nexponents of our error bounds are asymptotically (in the number of samples)\ntight for the case of simple null and alternative hypotheses. We calculate\nthese error exponents explicitly in specific cases and provide numerical\nstudies for various other scenarios of interest."}, "http://arxiv.org/abs/2401.15567": {"title": "Matrix Supermartingales and Randomized Matrix Concentration Inequalities", "link": "http://arxiv.org/abs/2401.15567", "description": "We present new concentration inequalities for either martingale dependent or\nexchangeable random symmetric matrices under a variety of tail conditions,\nencompassing standard Chernoff bounds to self-normalized heavy-tailed settings.\nThese inequalities are often randomized in a way that renders them strictly\ntighter than existing deterministic results in the literature, are typically\nexpressed in the Loewner order, and are sometimes valid at arbitrary\ndata-dependent stopping times.\n\nAlong the way, we explore the theory of matrix supermartingales and maximal\ninequalities, potentially of independent interest."}, "http://arxiv.org/abs/2401.15623": {"title": "GT-PCA: Effective and Interpretable Dimensionality Reduction with General Transform-Invariant Principal Component Analysis", "link": "http://arxiv.org/abs/2401.15623", "description": "Data analysis often requires methods that are invariant with respect to\nspecific transformations, such as rotations in case of images or shifts in case\nof images and time series. While principal component analysis (PCA) is a\nwidely-used dimension reduction technique, it lacks robustness with respect to\nthese transformations. Modern alternatives, such as autoencoders, can be\ninvariant with respect to specific transformations but are generally not\ninterpretable. We introduce General Transform-Invariant Principal Component\nAnalysis (GT-PCA) as an effective and interpretable alternative to PCA and\nautoencoders. We propose a neural network that efficiently estimates the\ncomponents and show that GT-PCA significantly outperforms alternative methods\nin experiments based on synthetic and real data."}, "http://arxiv.org/abs/2401.15680": {"title": "How to achieve model-robust inference in stepped wedge trials with model-based methods?", "link": "http://arxiv.org/abs/2401.15680", "description": "A stepped wedge design is a unidirectional crossover design where clusters\nare randomized to distinct treatment sequences defined by calendar time. While\nmodel-based analysis of stepped wedge designs -- via linear mixed models or\ngeneralized estimating equations -- is standard practice to evaluate treatment\neffects accounting for clustering and adjusting for baseline covariates, formal\nresults on their model-robustness properties remain unavailable. In this\narticle, we study when a potentially misspecified multilevel model can offer\nconsistent estimators for treatment effect estimands that are functions of\ncalendar time and/or exposure time. We describe a super-population potential\noutcomes framework to define treatment effect estimands of interest in stepped\nwedge designs, and adapt linear mixed models and generalized estimating\nequations to achieve estimand-aligned inference. We prove a central result\nthat, as long as the treatment effect structure is correctly specified in each\nworking model, our treatment effect estimator is robust to arbitrary\nmisspecification of all remaining model components. The theoretical results are\nillustrated via simulation experiments and re-analysis of a cardiovascular\nstepped wedge cluster randomized trial."}, "http://arxiv.org/abs/2401.15694": {"title": "Constrained Markov decision processes for response-adaptive procedures in clinical trials with binary outcomes", "link": "http://arxiv.org/abs/2401.15694", "description": "A constrained Markov decision process (CMDP) approach is developed for\nresponse-adaptive procedures in clinical trials with binary outcomes. The\nresulting CMDP class of Bayesian response- adaptive procedures can be used to\ntarget a certain objective, e.g., patient benefit or power while using\nconstraints to keep other operating characteristics under control. In the CMDP\napproach, the constraints can be formulated under different priors, which can\ninduce a certain behaviour of the policy under a given statistical hypothesis,\nor given that the parameters lie in a specific part of the parameter space. A\nsolution method is developed to find the optimal policy, as well as a more\nefficient method, based on backward recursion, which often yields a\nnear-optimal solution with an available optimality gap. Three applications are\nconsidered, involving type I error and power constraints, constraints on the\nmean squared error, and a constraint on prior robustness. While the CMDP\napproach slightly outperforms the constrained randomized dynamic programming\n(CRDP) procedure known from literature when focussing on type I and II error\nand mean squared error, showing the general quality of CRDP, CMDP significantly\noutperforms CRDP when the focus is on type I and II error only."}, "http://arxiv.org/abs/2401.15703": {"title": "A Bayesian multivariate extreme value mixture model", "link": "http://arxiv.org/abs/2401.15703", "description": "Impact assessment of natural hazards requires the consideration of both\nextreme and non-extreme events. Extensive research has been conducted on the\njoint modeling of bulk and tail in univariate settings; however, the\ncorresponding body of research in the context of multivariate analysis is\ncomparatively scant. This study extends the univariate joint modeling of bulk\nand tail to the multivariate framework. Specifically, it pertains to cases\nwhere multivariate observations exceed a high threshold in at least one\ncomponent. We propose a multivariate mixture model that assumes a parametric\nmodel to capture the bulk of the distribution, which is in the max-domain of\nattraction (MDA) of a multivariate extreme value distribution (mGEVD). The tail\nis described by the multivariate generalized Pareto distribution, which is\nasymptotically justified to model multivariate threshold exceedances. We show\nthat if all components exceed the threshold, our mixture model is in the MDA of\nan mGEVD. Bayesian inference based on multivariate random-walk\nMetropolis-Hastings and the automated factor slice sampler allows us to\nincorporate uncertainty from the threshold selection easily. Due to\ncomputational limitations, simulations and data applications are provided for\ndimension $d=2$, but a discussion is provided with views toward scalability\nbased on pairwise likelihood."}, "http://arxiv.org/abs/2401.15730": {"title": "Statistical analysis and first-passage-time applications of a lognormal diffusion process withmulti-sigmoidal logistic mean", "link": "http://arxiv.org/abs/2401.15730", "description": "We consider a lognormal diffusion process having a multisigmoidal logistic\nmean, useful to model the evolution of a population which reaches the maximum\nlevel of the growth after many stages. Referring to the problem of statistical\ninference, two procedures to find the maximum likelihood estimates of the\nunknown parameters are described. One is based on the resolution of the system\nof the critical points of the likelihood function, and the other is on the\nmaximization of the likelihood function with the simulated annealing algorithm.\nA simulation study to validate the described strategies for finding the\nestimates is also presented, with a real application to epidemiological data.\nSpecial attention is also devoted to the first-passage-time problem of the\nconsidered diffusion process through a fixed boundary."}, "http://arxiv.org/abs/2401.15778": {"title": "On the partial autocorrelation function for locally stationary time series: characterization, estimation and inference", "link": "http://arxiv.org/abs/2401.15778", "description": "For stationary time series, it is common to use the plots of partial\nautocorrelation function (PACF) or PACF-based tests to explore the temporal\ndependence structure of such processes. To our best knowledge, such analogs for\nnon-stationary time series have not been fully established yet. In this paper,\nwe fill this gap for locally stationary time series with short-range\ndependence. {First, we characterize the PACF locally in the time domain and\nshow that the $j$th PACF, denoted as $\\rho_{j}(t),$ decays with $j$ whose rate\nis adaptive to the temporal dependence of the time series $\\{x_{i,n}\\}$.\nSecond, at time $i,$ we justify that the PACF $\\rho_j(i/n)$ can be efficiently\napproximated by the best linear prediction coefficients via the Yule-Walker's\nequations. This allows us to study the PACF via ordinary least squares (OLS)\nlocally. Third, we show that the PACF is smooth in time for locally stationary\ntime series. We use the sieve method with OLS to estimate $\\rho_j(\\cdot)$} and\nconstruct some statistics to test the PACFs and infer the structures of the\ntime series. These tests generalize and modify those used for stationary time\nseries. Finally, a multiplier bootstrap algorithm is proposed for practical\nimplementation and an $\\mathtt R$ package $\\mathtt {Sie2nts}$ is provided to\nimplement our algorithm. Numerical simulations and real data analysis also\nconfirm usefulness of our results."}, "http://arxiv.org/abs/2401.15793": {"title": "Doubly regularized generalized linear models for spatial observations with high-dimensional covariates", "link": "http://arxiv.org/abs/2401.15793", "description": "A discrete spatial lattice can be cast as a network structure over which\nspatially-correlated outcomes are observed. A second network structure may also\ncapture similarities among measured features, when such information is\navailable. Incorporating the network structures when analyzing such\ndoubly-structured data can improve predictive power, and lead to better\nidentification of important features in the data-generating process. Motivated\nby applications in spatial disease mapping, we develop a new doubly regularized\nregression framework to incorporate these network structures for analyzing\nhigh-dimensional datasets. Our estimators can easily be implemented with\nstandard convex optimization algorithms. In addition, we describe a procedure\nto obtain asymptotically valid confidence intervals and hypothesis tests for\nour model parameters. We show empirically that our framework provides improved\npredictive accuracy and inferential power compared to existing high-dimensional\nspatial methods. These advantages hold given fully accurate network\ninformation, and also with networks which are partially misspecified or\nuninformative. The application of the proposed method to modeling COVID-19\nmortality data suggests that it can improve prediction of deaths beyond\nstandard spatial models, and that it selects relevant covariates more often."}, "http://arxiv.org/abs/2401.15796": {"title": "High-Dimensional False Discovery Rate Control for Dependent Variables", "link": "http://arxiv.org/abs/2401.15796", "description": "Algorithms that ensure reproducible findings from large-scale,\nhigh-dimensional data are pivotal in numerous signal processing applications.\nIn recent years, multivariate false discovery rate (FDR) controlling methods\nhave emerged, providing guarantees even in high-dimensional settings where the\nnumber of variables surpasses the number of samples. However, these methods\noften fail to reliably control the FDR in the presence of highly dependent\nvariable groups, a common characteristic in fields such as genomics and\nfinance. To tackle this critical issue, we introduce a novel framework that\naccounts for general dependency structures. Our proposed dependency-aware T-Rex\nselector integrates hierarchical graphical models within the T-Rex framework to\neffectively harness the dependency structure among variables. Leveraging\nmartingale theory, we prove that our variable penalization mechanism ensures\nFDR control. We further generalize the FDR-controlling framework by stating and\nproving a clear condition necessary for designing both graphical and\nnon-graphical models that capture dependencies. Additionally, we formulate a\nfully integrated optimal calibration algorithm that concurrently determines the\nparameters of the graphical model and the T-Rex framework, such that the FDR is\ncontrolled while maximizing the number of selected variables. Numerical\nexperiments and a breast cancer survival analysis use-case demonstrate that the\nproposed method is the only one among the state-of-the-art benchmark methods\nthat controls the FDR and reliably detects genes that have been previously\nidentified to be related to breast cancer. An open-source implementation is\navailable within the R package TRexSelector on CRAN."}, "http://arxiv.org/abs/2401.15806": {"title": "Continuous-time structural failure time model for intermittent treatment", "link": "http://arxiv.org/abs/2401.15806", "description": "The intermittent intake of treatment is commonly seen in patients with\nchronic disease. For example, patients with atrial fibrillation may need to\ndiscontinue the oral anticoagulants when they experience a certain surgery and\nre-initiate the treatment after the surgery. As another example, patients may\nskip a few days before they refill a treatment as planned. This treatment\ndispensation information (i.e., the time at which a patient initiates and\nrefills a treatment) is recorded in the electronic healthcare records or claims\ndatabase, and each patient has a different treatment dispensation. Current\nmethods to estimate the effects of such treatments censor the patients who\nre-initiate the treatment, which results in information loss or biased\nestimation. In this work, we present methods to estimate the effect of\ntreatments on failure time outcomes by taking all the treatment dispensation\ninformation. The developed methods are based on the continuous-time structural\nfailure time model, where the dependent censoring is tackled by inverse\nprobability of censoring weighting. The estimators are doubly robust and\nlocally efficient."}, "http://arxiv.org/abs/2401.15811": {"title": "Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms", "link": "http://arxiv.org/abs/2401.15811", "description": "Two-sided platforms are central to modern commerce and content sharing and\noften utilize A/B testing for developing new features. While user-side\nexperiments are common, seller-side experiments become crucial for specific\ninterventions and metrics. This paper investigates the effects of interference\ncaused by feedback loops on seller-side experiments in two-sided platforms,\nwith a particular focus on the counterfactual interleaving design, proposed in\n\\citet{ha2020counterfactual,nandy2021b}. These feedback loops, often generated\nby pacing algorithms, cause outcomes from earlier sessions to influence\nsubsequent ones. This paper contributes by creating a mathematical framework to\nanalyze this interference, theoretically estimating its impact, and conducting\nempirical evaluations of the counterfactual interleaving design in real-world\nscenarios. Our research shows that feedback loops can result in misleading\nconclusions about the treatment effects."}, "http://arxiv.org/abs/2401.15903": {"title": "Toward the Identifiability of Comparative Deep Generative Models", "link": "http://arxiv.org/abs/2401.15903", "description": "Deep Generative Models (DGMs) are versatile tools for learning data\nrepresentations while adequately incorporating domain knowledge such as the\nspecification of conditional probability distributions. Recently proposed DGMs\ntackle the important task of comparing data sets from different sources. One\nsuch example is the setting of contrastive analysis that focuses on describing\npatterns that are enriched in a target data set compared to a background data\nset. The practical deployment of those models often assumes that DGMs naturally\ninfer interpretable and modular latent representations, which is known to be an\nissue in practice. Consequently, existing methods often rely on ad-hoc\nregularization schemes, although without any theoretical grounding. Here, we\npropose a theory of identifiability for comparative DGMs by extending recent\nadvances in the field of non-linear independent component analysis. We show\nthat, while these models lack identifiability across a general class of mixing\nfunctions, they surprisingly become identifiable when the mixing function is\npiece-wise affine (e.g., parameterized by a ReLU neural network). We also\ninvestigate the impact of model misspecification, and empirically show that\npreviously proposed regularization techniques for fitting comparative DGMs help\nwith identifiability when the number of latent variables is not known in\nadvance. Finally, we introduce a novel methodology for fitting comparative DGMs\nthat improves the treatment of multiple data sources via multi-objective\noptimization and that helps adjust the hyperparameter for the regularization in\nan interpretable manner, using constrained optimization. We empirically\nvalidate our theory and new methodology using simulated data as well as a\nrecent data set of genetic perturbations in cells profiled via single-cell RNA\nsequencing."}, "http://arxiv.org/abs/2401.16099": {"title": "A Ridgelet Approach to Poisson Denoising", "link": "http://arxiv.org/abs/2401.16099", "description": "This paper introduces a novel ridgelet transform-based method for Poisson\nimage denoising. Our work focuses on harnessing the Poisson noise's unique\nnon-additive and signal-dependent properties, distinguishing it from Gaussian\nnoise. The core of our approach is a new thresholding scheme informed by\ntheoretical insights into the ridgelet coefficients of Poisson-distributed\nimages and adaptive thresholding guided by Stein's method.\n\nWe verify our theoretical model through numerical experiments and demonstrate\nthe potential of ridgelet thresholding across assorted scenarios. Our findings\nrepresent a significant step in enhancing the understanding of Poisson noise\nand offer an effective denoising method for images corrupted with it."}, "http://arxiv.org/abs/2401.16286": {"title": "Robust Functional Data Analysis for Stochastic Evolution Equations in Infinite Dimensions", "link": "http://arxiv.org/abs/2401.16286", "description": "This article addresses the robust measurement of covariations in the context\nof solutions to stochastic evolution equations in Hilbert spaces using\nfunctional data analysis. For such equations, standard techniques for\nfunctional data based on cross-sectional covariances are often inadequate for\nidentifying statistically relevant random drivers and detecting outliers since\nthey overlook the interplay between cross-sectional and temporal structures.\nTherefore, we develop an estimation theory for the continuous quadratic\ncovariation of the latent random driver of the equation instead of a static\ncovariance of the observable solution process. We derive identifiability\nresults under weak conditions, establish rates of convergence and a central\nlimit theorem based on infill asymptotics, and provide long-time asymptotics\nfor estimation of a static covariation of the latent driver. Applied to term\nstructure data, our approach uncovers a fundamental alignment with scaling\nlimits of covariations of specific short-term trading strategies, and an\nempirical study detects several jumps and indicates high-dimensional and\ntime-varying covariations."}, "http://arxiv.org/abs/2401.16396": {"title": "Ovarian Cancer Diagnostics using Wavelet Packet Scaling Descriptors", "link": "http://arxiv.org/abs/2401.16396", "description": "Detecting early-stage ovarian cancer accurately and efficiently is crucial\nfor timely treatment. Various methods for early diagnosis have been explored,\nincluding a focus on features derived from protein mass spectra, but these tend\nto overlook the complex interplay across protein expression levels. We propose\nan innovative method to automate the search for diagnostic features in these\nspectra by analyzing their inherent scaling characteristics. We compare two\ntechniques for estimating the self-similarity in a signal using the scaling\nbehavior of its wavelet packet decomposition. The methods are applied to the\nmass spectra using a rolling window approach, yielding a collection of\nself-similarity indexes that capture protein interactions, potentially\nindicative of ovarian cancer. Then, the most discriminatory scaling descriptors\nfrom this collection are selected for use in classification algorithms. To\nassess their effectiveness for early diagnosis of ovarian cancer, the\ntechniques are applied to two datasets from the American National Cancer\nInstitute. Comparative evaluation against an existing wavelet-based method\nshows that one wavelet packet-based technique led to improved diagnostic\nperformance for one of the analyzed datasets (95.67% vs. 96.78% test accuracy,\nrespectively). This highlights the potential of wavelet packet-based methods to\ncapture novel diagnostic information related to ovarian cancer. This innovative\napproach offers promise for better early detection and improved patient\noutcomes in ovarian cancer."}, "http://arxiv.org/abs/2010.16271": {"title": "View selection in multi-view stacking: Choosing the meta-learner", "link": "http://arxiv.org/abs/2010.16271", "description": "Multi-view stacking is a framework for combining information from different\nviews (i.e. different feature sets) describing the same set of objects. In this\nframework, a base-learner algorithm is trained on each view separately, and\ntheir predictions are then combined by a meta-learner algorithm. In a previous\nstudy, stacked penalized logistic regression, a special case of multi-view\nstacking, has been shown to be useful in identifying which views are most\nimportant for prediction. In this article we expand this research by\nconsidering seven different algorithms to use as the meta-learner, and\nevaluating their view selection and classification performance in simulations\nand two applications on real gene-expression data sets. Our results suggest\nthat if both view selection and classification accuracy are important to the\nresearch at hand, then the nonnegative lasso, nonnegative adaptive lasso and\nnonnegative elastic net are suitable meta-learners. Exactly which among these\nthree is to be preferred depends on the research context. The remaining four\nmeta-learners, namely nonnegative ridge regression, nonnegative forward\nselection, stability selection and the interpolating predictor, show little\nadvantages in order to be preferred over the other three."}, "http://arxiv.org/abs/2201.00409": {"title": "Global convergence of optimized adaptive importance samplers", "link": "http://arxiv.org/abs/2201.00409", "description": "We analyze the optimized adaptive importance sampler (OAIS) for performing\nMonte Carlo integration with general proposals. We leverage a classical result\nwhich shows that the bias and the mean-squared error (MSE) of the importance\nsampling scales with the $\\chi^2$-divergence between the target and the\nproposal and develop a scheme which performs global optimization of\n$\\chi^2$-divergence. While it is known that this quantity is convex for\nexponential family proposals, the case of the general proposals has been an\nopen problem. We close this gap by utilizing the nonasymptotic bounds for\nstochastic gradient Langevin dynamics (SGLD) for the global optimization of\n$\\chi^2$-divergence and derive nonasymptotic bounds for the MSE by leveraging\nrecent results from non-convex optimization literature. The resulting AIS\nschemes have explicit theoretical guarantees that are uniform-in-time."}, "http://arxiv.org/abs/2205.13469": {"title": "Proximal Estimation and Inference", "link": "http://arxiv.org/abs/2205.13469", "description": "We build a unifying convex analysis framework characterizing the statistical\nproperties of a large class of penalized estimators, both under a regular and\nan irregular design. Our framework interprets penalized estimators as proximal\nestimators, defined by a proximal operator applied to a corresponding initial\nestimator. We characterize the asymptotic properties of proximal estimators,\nshowing that their asymptotic distribution follows a closed-form formula\ndepending only on (i) the asymptotic distribution of the initial estimator,\n(ii) the estimator's limit penalty subgradient and (iii) the inner product\ndefining the associated proximal operator. In parallel, we characterize the\nOracle features of proximal estimators from the properties of their penalty's\nsubgradients. We exploit our approach to systematically cover linear regression\nsettings with a regular or irregular design. For these settings, we build new\n$\\sqrt{n}-$consistent, asymptotically normal Ridgeless-type proximal\nestimators, which feature the Oracle property and are shown to perform\nsatisfactorily in practically relevant Monte Carlo settings."}, "http://arxiv.org/abs/2206.05581": {"title": "Federated Offline Reinforcement Learning", "link": "http://arxiv.org/abs/2206.05581", "description": "Evidence-based or data-driven dynamic treatment regimes are essential for\npersonalized medicine, which can benefit from offline reinforcement learning\n(RL). Although massive healthcare data are available across medical\ninstitutions, they are prohibited from sharing due to privacy constraints.\nBesides, heterogeneity exists in different sites. As a result, federated\noffline RL algorithms are necessary and promising to deal with the problems. In\nthis paper, we propose a multi-site Markov decision process model that allows\nfor both homogeneous and heterogeneous effects across sites. The proposed model\nmakes the analysis of the site-level features possible. We design the first\nfederated policy optimization algorithm for offline RL with sample complexity.\nThe proposed algorithm is communication-efficient, which requires only a single\nround of communication interaction by exchanging summary statistics. We give a\ntheoretical guarantee for the proposed algorithm, where the suboptimality for\nthe learned policies is comparable to the rate as if data is not distributed.\nExtensive simulations demonstrate the effectiveness of the proposed algorithm.\nThe method is applied to a sepsis dataset in multiple sites to illustrate its\nuse in clinical settings."}, "http://arxiv.org/abs/2210.10171": {"title": "Doubly-robust and heteroscedasticity-aware sample trimming for causal inference", "link": "http://arxiv.org/abs/2210.10171", "description": "A popular method for variance reduction in observational causal inference is\npropensity-based trimming, the practice of removing units with extreme\npropensities from the sample. This practice has theoretical grounding when the\ndata are homoscedastic and the propensity model is parametric (Yang and Ding,\n2018; Crump et al. 2009), but in modern settings where heteroscedastic data are\nanalyzed with non-parametric models, existing theory fails to support current\npractice. In this work, we address this challenge by developing new methods and\ntheory for sample trimming. Our contributions are three-fold: first, we\ndescribe novel procedures for selecting which units to trim. Our procedures\ndiffer from previous work in that we trim not only units with small\npropensities, but also units with extreme conditional variances. Second, we\ngive new theoretical guarantees for inference after trimming. In particular, we\nshow how to perform inference on the trimmed subpopulation without requiring\nthat our regressions converge at parametric rates. Instead, we make only\nfourth-root rate assumptions like those in the double machine learning\nliterature. This result applies to conventional propensity-based trimming as\nwell and thus may be of independent interest. Finally, we propose a\nbootstrap-based method for constructing simultaneously valid confidence\nintervals for multiple trimmed sub-populations, which are valuable for\nnavigating the trade-off between sample size and variance reduction inherent in\ntrimming. We validate our methods in simulation, on the 2007-2008 National\nHealth and Nutrition Examination Survey, and on a semi-synthetic Medicare\ndataset and find promising results in all settings."}, "http://arxiv.org/abs/2211.16384": {"title": "Parameter Estimation with Increased Precision for Elliptic and Hypo-elliptic Diffusions", "link": "http://arxiv.org/abs/2211.16384", "description": "This work aims at making a comprehensive contribution in the general area of\nparametric inference for discretely observed diffusion processes. Established\napproaches for likelihood-based estimation invoke a time-discretisation scheme\nfor the approximation of the intractable transition dynamics of the Stochastic\nDifferential Equation (SDE) model over finite time periods. The scheme is\napplied for a step-size that is either user-selected or determined by the data.\nRecent research has highlighted the critical ef-fect of the choice of numerical\nscheme on the behaviour of derived parameter estimates in the setting of\nhypo-elliptic SDEs. In brief, in our work, first, we develop two weak second\norder sampling schemes (to cover both hypo-elliptic and elliptic SDEs) and\nproduce a small time expansion for the density of the schemes to form a proxy\nfor the true intractable SDE transition density. Then, we establish a\ncollection of analytic results for likelihood-based parameter estimates\nobtained via the formed proxies, thus providing a theoretical framework that\nshowcases advantages from the use of the developed methodology for SDE\ncalibration. We present numerical results from carrying out classical or\nBayesian inference, for both elliptic and hypo-elliptic SDEs."}, "http://arxiv.org/abs/2212.09009": {"title": "Locally Simultaneous Inference", "link": "http://arxiv.org/abs/2212.09009", "description": "Selective inference is the problem of giving valid answers to statistical\nquestions chosen in a data-driven manner. A standard solution to selective\ninference is simultaneous inference, which delivers valid answers to the set of\nall questions that could possibly have been asked. However, simultaneous\ninference can be unnecessarily conservative if this set includes many questions\nthat were unlikely to be asked in the first place. We introduce a less\nconservative solution to selective inference that we call locally simultaneous\ninference, which only answers those questions that could plausibly have been\nasked in light of the observed data, all the while preserving rigorous type I\nerror guarantees. For example, if the objective is to construct a confidence\ninterval for the \"winning\" treatment effect in a clinical trial with multiple\ntreatments, and it is obvious in hindsight that only one treatment had a chance\nto win, then our approach will return an interval that is nearly the same as\nthe uncorrected, standard interval. Locally simultaneous inference is\nimplemented by refining any method for simultaneous inference of interest.\nUnder mild conditions satisfied by common confidence intervals, locally\nsimultaneous inference strictly dominates its underlying simultaneous inference\nmethod, meaning it can never yield less statistical power but only more.\nCompared to conditional selective inference, which demands stronger guarantees,\nlocally simultaneous inference is more easily applicable in nonparametric\nsettings and is more numerically stable."}, "http://arxiv.org/abs/2302.00230": {"title": "Revisiting the Effects of Maternal Education on Adolescents' Academic Performance: Doubly Robust Estimation in a Network-Based Observational Study", "link": "http://arxiv.org/abs/2302.00230", "description": "In many contexts, particularly when study subjects are adolescents, peer\neffects can invalidate typical statistical requirements in the data. For\ninstance, it is plausible that a student's academic performance is influenced\nboth by their own mother's educational level as well as that of their peers.\nSince the underlying social network is measured, the Add Health study provides\na unique opportunity to examine the impact of maternal college education on\nadolescent school performance, both direct and indirect. However, causal\ninference on populations embedded in social networks poses technical\nchallenges, since the typical no interference assumption no longer holds. While\ninverse probability-of-treatment weighted (IPW) estimators have been developed\nfor this setting, they are often highly unstable. Motivated by the question of\nmaternal education, we propose doubly robust (DR) estimators combining models\nfor treatment and outcome that are consistent and asymptotically normal if\neither model is correctly specified. We present empirical results that\nillustrate the DR property and the efficiency gain of DR over IPW estimators\neven when the treatment model is misspecified. Contrary to previous studies,\nour robust analysis does not provide evidence of an indirect effect of maternal\neducation on academic performance within adolescents' social circles in Add\nHealth."}, "http://arxiv.org/abs/2304.04374": {"title": "Partial Identification of Causal Effects Using Proxy Variables", "link": "http://arxiv.org/abs/2304.04374", "description": "Proximal causal inference is a recently proposed framework for evaluating\ncausal effects in the presence of unmeasured confounding. For point\nidentification of causal effects, it leverages a pair of so-called treatment\nand outcome confounding proxy variables, to identify a bridge function that\nmatches the dependence of potential outcomes or treatment variables on the\nhidden factors to corresponding functions of observed proxies. Unique\nidentification of a causal effect via a bridge function crucially requires that\nproxies are sufficiently relevant for hidden factors, a requirement that has\npreviously been formalized as a completeness condition. However, completeness\nis well-known not to be empirically testable, and although a bridge function\nmay be well-defined, lack of completeness, sometimes manifested by availability\nof a single type of proxy, may severely limit prospects for identification of a\nbridge function and thus a causal effect; therefore, potentially restricting\nthe application of the proximal causal framework. In this paper, we propose\npartial identification methods that do not require completeness and obviate the\nneed for identification of a bridge function. That is, we establish that\nproxies of unobserved confounders can be leveraged to obtain bounds on the\ncausal effect of the treatment on the outcome even if available information\ndoes not suffice to identify either a bridge function or a corresponding causal\neffect of interest. Our bounds are non-smooth functionals of the observed data\ndistribution. As a consequence, in the context of inference, we initially\nprovide a smooth approximation of our bounds. Subsequently, we leverage\nbootstrap confidence intervals on the approximated bounds. We further establish\nanalogous partial identification results in related settings where\nidentification hinges upon hidden mediators for which proxies are available."}, "http://arxiv.org/abs/2306.16858": {"title": "Methods for non-proportional hazards in clinical trials: A systematic review", "link": "http://arxiv.org/abs/2306.16858", "description": "For the analysis of time-to-event data, frequently used methods such as the\nlog-rank test or the Cox proportional hazards model are based on the\nproportional hazards assumption, which is often debatable. Although a wide\nrange of parametric and non-parametric methods for non-proportional hazards\n(NPH) has been proposed, there is no consensus on the best approaches. To close\nthis gap, we conducted a systematic literature search to identify statistical\nmethods and software appropriate under NPH. Our literature search identified\n907 abstracts, out of which we included 211 articles, mostly methodological\nones. Review articles and applications were less frequently identified. The\narticles discuss effect measures, effect estimation and regression approaches,\nhypothesis tests, and sample size calculation approaches, which are often\ntailored to specific NPH situations. Using a unified notation, we provide an\noverview of methods available. Furthermore, we derive some guidance from the\nidentified articles. We summarized the contents from the literature review in a\nconcise way in the main text and provide more detailed explanations in the\nsupplement."}, "http://arxiv.org/abs/2308.13827": {"title": "An exhaustive ADDIS principle for online FWER control", "link": "http://arxiv.org/abs/2308.13827", "description": "In this paper we consider online multiple testing with familywise error rate\n(FWER) control, where the probability of committing at least one type I error\nshall remain under control while testing a possibly infinite sequence of\nhypotheses over time. Currently, Adaptive-Discard (ADDIS) procedures seem to be\nthe most promising online procedures with FWER control in terms of power. Now,\nour main contribution is a uniform improvement of the ADDIS principle and thus\nof all ADDIS procedures. This means, the methods we propose reject as least as\nmuch hypotheses as ADDIS procedures and in some cases even more, while\nmaintaining FWER control. In addition, we show that there is no other FWER\ncontrolling procedure that enlarges the event of rejecting any hypothesis.\nFinally, we apply the new principle to derive uniform improvements of the\nADDIS-Spending and ADDIS-Graph."}, "http://arxiv.org/abs/2401.16447": {"title": "The Hubbert diffusion process: Estimation via simulated annealing and variable neighborhood search procedures", "link": "http://arxiv.org/abs/2401.16447", "description": "Accurately charting the progress of oil production is a problem of great\ncurrent interest. Oil production is widely known to be cyclical: in any given\nsystem, after it reaches its peak, a decline will begin. With this in mind,\nMarion King Hubbert developed his peak theory in 1956 based on the bell-shaped\ncurve that bears his name. In the present work, we consider a stochasticmodel\nbased on the theory of diffusion processes and associated with the Hubbert\ncurve. The problem of the maximum likelihood estimation of the parameters for\nthis process is also considered. Since a complex system of equations appears,\nwith a solution that cannot be guaranteed by classical numerical procedures, we\nsuggest the use of metaheuristic optimization algorithms such as simulated\nannealing and variable neighborhood search. Some strategies are suggested for\nbounding the space of solutions, and a description is provided for the\napplication of the algorithms selected. In the case of the variable\nneighborhood search algorithm, a hybrid method is proposed in which it is\ncombined with simulated annealing. In order to validate the theory developed\nhere, we also carry out some studies based on simulated data and consider 2\nreal crude oil production scenarios from Norway and Kazakhstan."}, "http://arxiv.org/abs/2401.16563": {"title": "Topological Detection of Phenomenological Bifurcations with Unreliable Kernel Densities", "link": "http://arxiv.org/abs/2401.16563", "description": "Phenomenological (P-type) bifurcations are qualitative changes in stochastic\ndynamical systems whereby the stationary probability density function (PDF)\nchanges its topology. The current state of the art for detecting these\nbifurcations requires reliable kernel density estimates computed from an\nensemble of system realizations. However, in several real world signals such as\nBig Data, only a single system realization is available -- making it impossible\nto estimate a reliable kernel density. This study presents an approach for\ndetecting P-type bifurcations using unreliable density estimates. The approach\ncreates an ensemble of objects from Topological Data Analysis (TDA) called\npersistence diagrams from the system's sole realization and statistically\nanalyzes the resulting set. We compare several methods for replicating the\noriginal persistence diagram including Gibbs point process modelling, Pairwise\nInteraction Point Modelling, and subsampling. We show that for the purpose of\npredicting a bifurcation, the simple method of subsampling exceeds the other\ntwo methods of point process modelling in performance."}, "http://arxiv.org/abs/2401.16567": {"title": "Parallel Affine Transformation Tuning of Markov Chain Monte Carlo", "link": "http://arxiv.org/abs/2401.16567", "description": "The performance of Markov chain Monte Carlo samplers strongly depends on the\nproperties of the target distribution such as its covariance structure, the\nlocation of its probability mass and its tail behavior. We explore the use of\nbijective affine transformations of the sample space to improve the properties\nof the target distribution and thereby the performance of samplers running in\nthe transformed space. In particular, we propose a flexible and user-friendly\nscheme for adaptively learning the affine transformation during sampling.\nMoreover, the combination of our scheme with Gibbsian polar slice sampling is\nshown to produce samples of high quality at comparatively low computational\ncost in several settings based on real-world data."}, "http://arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "http://arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in\nprecision medicine. Specific interests lie in identifying the differential\neffect of different treatments based on some external covariates. We propose a\nnovel non-parametric treatment effect estimation method in a multi-treatment\nsetting. Our non-parametric modeling of the response curves relies on radial\nbasis function (RBF)-nets with shared hidden neurons. Our model thus\nfacilitates modeling commonality among the treatment outcomes. The estimation\nand inference schemes are developed under a Bayesian framework and implemented\nvia an efficient Markov chain Monte Carlo algorithm, appropriately\naccommodating uncertainty in all aspects of the analysis. The numerical\nperformance of the method is demonstrated through simulation experiments.\nApplying our proposed method to MIMIC data, we obtain several interesting\nfindings related to the impact of different treatment strategies on the length\nof ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "http://arxiv.org/abs/2401.16596": {"title": "PrIsing: Privacy-Preserving Peer Effect Estimation via Ising Model", "link": "http://arxiv.org/abs/2401.16596", "description": "The Ising model, originally developed as a spin-glass model for ferromagnetic\nelements, has gained popularity as a network-based model for capturing\ndependencies in agents' outputs. Its increasing adoption in healthcare and the\nsocial sciences has raised privacy concerns regarding the confidentiality of\nagents' responses. In this paper, we present a novel\n$(\\varepsilon,\\delta)$-differentially private algorithm specifically designed\nto protect the privacy of individual agents' outcomes. Our algorithm allows for\nprecise estimation of the natural parameter using a single network through an\nobjective perturbation technique. Furthermore, we establish regret bounds for\nthis algorithm and assess its performance on synthetic datasets and two\nreal-world networks: one involving HIV status in a social network and the other\nconcerning the political leaning of online blogs."}, "http://arxiv.org/abs/2401.16598": {"title": "Probabilistic Context Neighborhood Model for Lattices", "link": "http://arxiv.org/abs/2401.16598", "description": "We present the Probabilistic Context Neighborhood model designed for\ntwo-dimensional lattices as a variation of a Markov Random Field assuming\ndiscrete values. In this model, the neighborhood structure has a fixed geometry\nbut a variable order, depending on the neighbors' values. Our model extends the\nProbabilistic Context Tree model, originally applicable to one-dimensional\nspace. It retains advantageous properties, such as representing the dependence\nneighborhood structure as a graph in a tree format, facilitating an\nunderstanding of model complexity. Furthermore, we adapt the algorithm used to\nestimate the Probabilistic Context Tree to estimate the parameters of the\nproposed model. We illustrate the accuracy of our estimation methodology\nthrough simulation studies. Additionally, we apply the Probabilistic Context\nNeighborhood model to spatial real-world data, showcasing its practical\nutility."}, "http://arxiv.org/abs/2401.16651": {"title": "A constructive approach to selective risk control", "link": "http://arxiv.org/abs/2401.16651", "description": "Many modern applications require the use of data to both select the\nstatistical tasks and make valid inference after selection. In this article, we\nprovide a unifying approach to control for a class of selective risks. Our\nmethod is motivated by a reformulation of the celebrated Benjamini-Hochberg\n(BH) procedure for multiple hypothesis testing as the iterative limit of the\nBenjamini-Yekutieli (BY) procedure for constructing post-selection confidence\nintervals. Although several earlier authors have made noteworthy observations\nrelated to this, our discussion highlights that (1) the BH procedure is\nprecisely the fixed-point iteration of the BY procedure; (2) the fact that the\nBH procedure controls the false discovery rate is almost an immediate corollary\nof the fact that the BY procedure controls the false coverage-statement rate.\nBuilding on this observation, we propose a constructive approach to control\nextra-selection risk (selection made after decision) by iterating decision\nstrategies that control the post-selection risk (decision made after\nselection), and show that many previous methods and results are special cases\nof this general framework. We further extend this approach to problems with\nmultiple selective risks and demonstrate how new methods can be developed. Our\ndevelopment leads to two surprising results about the BH procedure: (1) in the\ncontext of one-sided location testing, the BH procedure not only controls the\nfalse discovery rate at the null but also at other locations for free; (2) in\nthe context of permutation tests, the BH procedure with exact permutation\np-values can be well approximated by a procedure which only requires a total\nnumber of permutations that is almost linear in the total number of hypotheses."}, "http://arxiv.org/abs/2401.16660": {"title": "A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information", "link": "http://arxiv.org/abs/2401.16660", "description": "The effective sample size (ESS) measures the informational value of a\nprobability distribution in terms of an equivalent number of study\nparticipants. The ESS plays a crucial role in estimating the Expected Value of\nSample Information (EVSI) through the Gaussian approximation approach. Despite\nthe significance of ESS, existing ESS estimation methods within the Gaussian\napproximation framework are either computationally expensive or potentially\ninaccurate. To address these limitations, we propose a novel approach that\nestimates the ESS using the summary statistics of generated datasets and\nnonparametric regression methods. The simulation results suggest that the\nproposed method provides accurate ESS estimates at a low computational cost,\nmaking it an efficient and practical way to quantify the information contained\nin the probability distribution of a parameter. Overall, determining the ESS\ncan help analysts understand the uncertainty levels in complex prior\ndistributions in the probability analyses of decision models and perform\nefficient EVSI calculations."}, "http://arxiv.org/abs/2401.16749": {"title": "Bayesian scalar-on-network regression with applications to brain functional connectivity", "link": "http://arxiv.org/abs/2401.16749", "description": "This study presents a Bayesian regression framework to model the relationship\nbetween scalar outcomes and brain functional connectivity represented as\nsymmetric positive definite (SPD) matrices. Unlike many proposals that simply\nvectorize the connectivity predictors thereby ignoring their matrix structures,\nour method respects the Riemannian geometry of SPD matrices by modelling them\nin a tangent space. We perform dimension reduction in the tangent space,\nrelating the resulting low-dimensional representations with the responses. The\ndimension reduction matrix is learnt in a supervised manner with a\nsparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting.\nOur method yields a parsimonious regression model that allows uncertainty\nquantification of the estimates and identification of key brain regions that\npredict the outcomes. We demonstrate the performance of our approach in\nsimulation settings and through a case study to predict Picture Vocabulary\nscores using data from the Human Connectome Project."}, "http://arxiv.org/abs/2401.16828": {"title": "Simulating signed mixtures", "link": "http://arxiv.org/abs/2401.16828", "description": "Simulating mixtures of distributions with signed weights proves a challenge\nas standard simulation algorithms are inefficient in handling the negative\nweights. In particular, the natural representation of mixture variates as\nassociated with latent component indicators is no longer available. We propose\nhere an exact accept-reject algorithm in the general case of finite signed\nmixtures that relies on optimaly pairing positive and negative components and\ndesigning a stratified sampling scheme on pairs. We analyze the performances of\nour approach, relative to the inverse cdf approach, since the cdf of the\ndistribution remains available for standard signed mixtures."}, "http://arxiv.org/abs/2401.16943": {"title": "Dynamical System Identification, Model Selection and Model Uncertainty Quantification by Bayesian Inference", "link": "http://arxiv.org/abs/2401.16943", "description": "This study presents a Bayesian maximum \\textit{a~posteriori} (MAP) framework\nfor dynamical system identification from time-series data. This is shown to be\nequivalent to a generalized zeroth-order Tikhonov regularization, providing a\nrational justification for the choice of the residual and regularization terms,\nrespectively, from the negative logarithms of the likelihood and prior\ndistributions. In addition to the estimation of model coefficients, the\nBayesian interpretation gives access to the full apparatus for Bayesian\ninference, including the ranking of models, the quantification of model\nuncertainties and the estimation of unknown (nuisance) hyperparameters. Two\nBayesian algorithms, joint maximum \\textit{a~posteriori} (JMAP) and variational\nBayesian approximation (VBA), are compared to the popular SINDy algorithm for\nthresholded least-squares regression, by application to several dynamical\nsystems with added noise. For multivariate Gaussian likelihood and prior\ndistributions, the Bayesian formulation gives Gaussian posterior and evidence\ndistributions, in which the numerator terms can be expressed in terms of the\nMahalanobis distance or ``Gaussian norm'' $||\\vy-\\hat{\\vy}||^2_{M^{-1}} =\n(\\vy-\\hat{\\vy})^\\top {M^{-1}} (\\vy-\\hat{\\vy})$, where $\\vy$ is a vector\nvariable, $\\hat{\\vy}$ is its estimator and $M$ is the covariance matrix. The\nposterior Gaussian norm is shown to provide a robust metric for quantitative\nmodel selection."}, "http://arxiv.org/abs/2401.16954": {"title": "Nonparametric latency estimation for mixture cure models", "link": "http://arxiv.org/abs/2401.16954", "description": "A nonparametric latency estimator for mixture cure models is studied in this\npaper. An i.i.d. representation is obtained, the asymptotic mean squared error\nof the latency estimator is found, and its asymptotic normality is proven. A\nbootstrap bandwidth selection method is introduced and its efficiency is\nevaluated in a simulation study. The proposed methods are applied to a dataset\nof colorectal cancer patients in the University Hospital of A Coru\\~na (CHUAC)."}, "http://arxiv.org/abs/2401.16990": {"title": "Recovering target causal effects from post-exposure selection induced by missing outcome data", "link": "http://arxiv.org/abs/2401.16990", "description": "Confounding bias and selection bias are two significant challenges to the\nvalidity of conclusions drawn from applied causal inference. The latter can\narise through informative missingness, wherein relevant information about units\nin the target population is missing, censored, or coarsened due to factors\nrelated to the exposure, the outcome, or their consequences. We extend existing\ngraphical criteria to address selection bias induced by missing outcome data by\nleveraging post-exposure variables. We introduce the Sequential Adjustment\nCriteria (SAC), which support recovering causal effects through sequential\nregressions. A refined estimator is further developed by applying Targeted\nMinimum-Loss Estimation (TMLE). Under certain regularity conditions, this\nestimator is multiply-robust, ensuring consistency even in scenarios where the\nInverse Probability Weighting (IPW) and the sequential regressions approaches\nfall short. A simulation exercise featuring various toy scenarios compares the\nrelative bias and robustness of the two proposed solutions against other\nestimators. As a motivating application case, we study the effects of\npharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD)\nupon the scores obtained by diagnosed Norwegian schoolchildren in national\ntests using observational data ($n=9\\,352$). Our findings support the\naccumulated clinical evidence affirming a positive but small effect of\nstimulant medication on school performance. A small positive selection bias was\nidentified, indicating that the treatment effect may be even more modest for\nthose exempted or abstained from the tests."}, "http://arxiv.org/abs/2401.17008": {"title": "A Unified Three-State Model Framework for Analysis of Treatment Crossover in Survival Trials", "link": "http://arxiv.org/abs/2401.17008", "description": "We present a unified three-state model (TSM) framework for evaluating\ntreatment effects in clinical trials in the presence of treatment crossover.\nResearchers have proposed diverse methodologies to estimate the treatment\neffect that would have hypothetically been observed if treatment crossover had\nnot occurred. However, there is little work on understanding the connections\nbetween these different approaches from a statistical point of view. Our\nproposed TSM framework unifies existing methods, effectively identifying\npotential biases, model assumptions, and inherent limitations for each method.\nThis can guide researchers in understanding when these methods are appropriate\nand choosing a suitable approach for their data. The TSM framework also\nfacilitates the creation of new methods to adjust for confounding effects from\ntreatment crossover. To illustrate this capability, we introduce a new\nimputation method that falls under its scope. Using a piecewise constant prior\nfor the hazard, our proposed method directly estimates the hazard function with\nincreased flexibility. Through simulation experiments, we demonstrate the\nperformance of different approaches for estimating the treatment effects."}, "http://arxiv.org/abs/2401.17041": {"title": "Gower's similarity coefficients with automatic weight selection", "link": "http://arxiv.org/abs/2401.17041", "description": "Nearest-neighbor methods have become popular in statistics and play a key\nrole in statistical learning. Important decisions in nearest-neighbor methods\nconcern the variables to use (when many potential candidates exist) and how to\nmeasure the dissimilarity between units. The first decision depends on the\nscope of the application while second depends mainly on the type of variables.\nUnfortunately, relatively few options permit to handle mixed-type variables, a\nsituation frequently encountered in practical applications. The most popular\ndissimilarity for mixed-type variables is derived as the complement to one of\nthe Gower's similarity coefficient. It is appealing because ranges between 0\nand 1, being an average of the scaled dissimilarities calculated variable by\nvariable, handles missing values and allows for a user-defined weighting scheme\nwhen averaging dissimilarities. The discussion on the weighting schemes is\nsometimes misleading since it often ignores that the unweighted \"standard\"\nsetting hides an unbalanced contribution of the single variables to the overall\ndissimilarity. We address this drawback following the recent idea of\nintroducing a weighting scheme that minimizes the differences in the\ncorrelation between each contributing dissimilarity and the resulting weighted\nGower's dissimilarity. In particular, this note proposes different approaches\nfor measuring the correlation depending on the type of variables. The\nperformances of the proposed approaches are evaluated in simulation studies\nrelated to classification and imputation of missing values."}, "http://arxiv.org/abs/2401.17110": {"title": "Nonparametric covariate hypothesis tests for the cure rate in mixture cure models", "link": "http://arxiv.org/abs/2401.17110", "description": "In lifetime data, like cancer studies, theremay be long term survivors, which\nlead to heavy censoring at the end of the follow-up period. Since a standard\nsurvival model is not appropriate to handle these data, a cure model is needed.\nIn the literature, covariate hypothesis tests for cure models are limited to\nparametric and semiparametric methods.We fill this important gap by proposing a\nnonparametric covariate hypothesis test for the probability of cure in mixture\ncure models. A bootstrap method is proposed to approximate the null\ndistribution of the test statistic. The procedure can be applied to any type of\ncovariate, and could be extended to the multivariate setting. Its efficiency is\nevaluated in a Monte Carlo simulation study. Finally, the method is applied to\na colorectal cancer dataset."}, "http://arxiv.org/abs/2401.17143": {"title": "Test for high-dimensional mean vectors via the weighted $L_2$-norm", "link": "http://arxiv.org/abs/2401.17143", "description": "In this paper, we propose a novel approach to test the equality of\nhigh-dimensional mean vectors of several populations via the weighted\n$L_2$-norm. We establish the asymptotic normality of the test statistics under\nthe null hypothesis. We also explain theoretically why our test statistics can\nbe highly useful in weakly dense cases when the nonzero signal in mean vectors\nis present. Furthermore, we compare the proposed test with existing tests using\nsimulation results, demonstrating that the weighted $L_2$-norm-based test\nstatistic exhibits favorable properties in terms of both size and power."}, "http://arxiv.org/abs/2401.17152": {"title": "Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models", "link": "http://arxiv.org/abs/2401.17152", "description": "A completely nonparametric method for the estimation of mixture cure models\nis proposed. A nonparametric estimator of the incidence is extensively studied\nand a nonparametric estimator of the latency is presented. These estimators,\nwhich are based on the Beran estimator of the conditional survival function,\nare proved to be the local maximum likelihood estimators. An i.i.d.\nrepresentation is obtained for the nonparametric incidence estimator. As a\nconsequence, an asymptotically optimal bandwidth is found. Moreover, a\nbootstrap bandwidth selection method for the nonparametric incidence estimator\nis proposed. The introduced nonparametric estimators are compared with existing\nsemiparametric approaches in a simulation study, in which the performance of\nthe bootstrap bandwidth selector is also assessed. Finally, the method is\napplied to a database of colorectal cancer from the University Hospital of A\nCoru\\~na (CHUAC)."}, "http://arxiv.org/abs/2401.17249": {"title": "Joint model with latent disease age: overcoming the need for reference time", "link": "http://arxiv.org/abs/2401.17249", "description": "Introduction: Heterogeneity of the progression of neurodegenerative diseases\nis one of the main challenges faced in developing effective therapies. With the\nincreasing number of large clinical databases, disease progression models have\nled to a better understanding of this heterogeneity. Nevertheless, these\ndiseases may have no clear onset and biological underlying processes may start\nbefore the first symptoms. Such an ill-defined disease reference time is an\nissue for current joint models, which have proven their effectiveness by\ncombining longitudinal and survival data. Objective In this work, we propose a\njoint non-linear mixed effect model with a latent disease age, to overcome this\nneed for a precise reference time.\n\nMethod: To do so, we utilized an existing longitudinal model with a latent\ndisease age as a longitudinal sub-model and associated it with a survival\nsub-model that estimates a Weibull distribution from the latent disease age. We\nthen validated our model on different simulated scenarios. Finally, we\nbenchmarked our model with a state-of-the-art joint model and reference\nsurvival and longitudinal models on simulated and real data in the context of\nAmyotrophic Lateral Sclerosis (ALS).\n\nResults: On real data, our model got significantly better results than the\nstate-of-the-art joint model for absolute bias (4.21(4.41) versus\n4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events\n(0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).\n\nConclusion: We showed that our approach is better suited than the\nstate-of-the-art in the context where the reference time is not reliable. This\nwork opens up the perspective to design predictive and personalized therapeutic\nstrategies."}, "http://arxiv.org/abs/2204.07105": {"title": "Nonresponse Bias Analysis in Longitudinal Studies: A Comparative Review with an Application to the Early Childhood Longitudinal Study", "link": "http://arxiv.org/abs/2204.07105", "description": "Longitudinal studies are subject to nonresponse when individuals fail to\nprovide data for entire waves or particular questions of the survey. We compare\napproaches to nonresponse bias analysis (NRBA) in longitudinal studies and\nillustrate them on the Early Childhood Longitudinal Study, Kindergarten Class\nof 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a\nmonotone missingness pattern, and the missingness mechanism can be missing at\nrandom (MAR) or missing not at random (MNAR). We discuss weighting, multiple\nimputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for\nmonotone patterns. Weighting adjustments are effective when the constructed\nweights are correlated to the survey outcome of interest. MI allows for\nvariables with missing values to be included in the imputation model, yielding\npotentially less biased and more efficient estimates. Multilevel models with\nmaximum likelihood estimation and marginal models estimated using generalized\nestimating equations can also handle incomplete longitudinal data. Bayesian\nmethods introduce prior information and potentially stabilize model estimation.\nWe add offsets in the MAR results to provide sensitivity analyses to assess\nMNAR deviations. We conduct NRBA for descriptive summaries and analytic model\nestimates and find that in the ECLS-K:2011 application, NRBA yields minor\nchanges to the substantive conclusions. The strength of evidence about our NRBA\ndepends on the strength of the relationship between the characteristics in the\nnonresponse adjustment and the key survey outcomes, so the key to a successful\nNRBA is to include strong predictors."}, "http://arxiv.org/abs/2206.01900": {"title": "Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios", "link": "http://arxiv.org/abs/2206.01900", "description": "Evaluation of intervention in a multi-agent system, e.g., when humans should\nintervene in autonomous driving systems and when a player should pass to\nteammates for a good shot, is challenging in various engineering and scientific\nfields. Estimating the individual treatment effect (ITE) using counterfactual\nlong-term prediction is practical to evaluate such interventions. However, most\nof the conventional frameworks did not consider the time-varying complex\nstructure of multi-agent relationships and covariate counterfactual prediction.\nThis may lead to erroneous assessments of ITE and difficulty in interpretation.\nHere we propose an interpretable, counterfactual recurrent network in\nmulti-agent systems to estimate the effect of the intervention. Our model\nleverages graph variational recurrent neural networks and theory-based\ncomputation with domain knowledge for the ITE estimation framework based on\nlong-term prediction of multi-agent covariates and outcomes, which can confirm\nthe circumstances under which the intervention is effective. On simulated\nmodels of an automated vehicle and biological agents with time-varying\nconfounders, we show that our methods achieved lower estimation errors in\ncounterfactual covariates and the most effective treatment timing than the\nbaselines. Furthermore, using real basketball data, our methods performed\nrealistic counterfactual predictions and evaluated the counterfactual passes in\nshot scenarios."}, "http://arxiv.org/abs/2210.06934": {"title": "On the potential benefits of entropic regularization for smoothing Wasserstein estimators", "link": "http://arxiv.org/abs/2210.06934", "description": "This paper is focused on the study of entropic regularization in optimal\ntransport as a smoothing method for Wasserstein estimators, through the prism\nof the classical tradeoff between approximation and estimation errors in\nstatistics. Wasserstein estimators are defined as solutions of variational\nproblems whose objective function involves the use of an optimal transport cost\nbetween probability measures. Such estimators can be regularized by replacing\nthe optimal transport cost by its regularized version using an entropy penalty\non the transport plan. The use of such a regularization has a potentially\nsignificant smoothing effect on the resulting estimators. In this work, we\ninvestigate its potential benefits on the approximation and estimation\nproperties of regularized Wasserstein estimators. Our main contribution is to\ndiscuss how entropic regularization may reach, at a lower computational cost,\nstatistical performances that are comparable to those of un-regularized\nWasserstein estimators in statistical learning problems involving\ndistributional data analysis. To this end, we present new theoretical results\non the convergence of regularized Wasserstein estimators. We also study their\nnumerical performances using simulated and real data in the supervised learning\nproblem of proportions estimation in mixture models using optimal transport."}, "http://arxiv.org/abs/2304.09157": {"title": "Neural networks for geospatial data", "link": "http://arxiv.org/abs/2304.09157", "description": "Analysis of geospatial data has traditionally been model-based, with a mean\nmodel, customarily specified as a linear regression on the covariates, and a\ncovariance model, encoding the spatial dependence. We relax the strong\nassumption of linearity and propose embedding neural networks directly within\nthe traditional geostatistical models to accommodate non-linear mean functions\nwhile retaining all other advantages including use of Gaussian Processes to\nexplicitly model the spatial covariance, enabling inference on the covariate\neffect through the mean and on the spatial dependence through the covariance,\nand offering predictions at new locations via kriging. We propose NN-GLS, a new\nneural network estimation algorithm for the non-linear mean in GP models that\nexplicitly accounts for the spatial covariance through generalized least\nsquares (GLS), the same loss used in the linear case. We show that NN-GLS\nadmits a representation as a special type of graph neural network (GNN). This\nconnection facilitates use of standard neural network computational techniques\nfor irregular geospatial data, enabling novel and scalable mini-batching,\nbackpropagation, and kriging schemes. Theoretically, we show that NN-GLS will\nbe consistent for irregularly observed spatially correlated data processes. To\nour knowledge this is the first asymptotic consistency result for any neural\nnetwork algorithm for spatial data. We demonstrate the methodology through\nsimulated and real datasets."}, "http://arxiv.org/abs/2304.14604": {"title": "Deep Neural-network Prior for Orbit Recovery from Method of Moments", "link": "http://arxiv.org/abs/2304.14604", "description": "Orbit recovery problems are a class of problems that often arise in practice\nand various forms. In these problems, we aim to estimate an unknown function\nafter being distorted by a group action and observed via a known operator.\nTypically, the observations are contaminated with a non-trivial level of noise.\nTwo particular orbit recovery problems of interest in this paper are\nmultireference alignment and single-particle cryo-EM modelling. In order to\nsuppress the noise, we suggest using the method of moments approach for both\nproblems while introducing deep neural network priors. In particular, our\nneural networks should output the signals and the distribution of group\nelements, with moments being the input. In the multireference alignment case,\nwe demonstrate the advantage of using the NN to accelerate the convergence for\nthe reconstruction of signals from the moments. Finally, we use our method to\nreconstruct simulated and biological volumes in the cryo-EM setting."}, "http://arxiv.org/abs/2306.13538": {"title": "A brief introduction on latent variable based ordinal regression models with an application to survey data", "link": "http://arxiv.org/abs/2306.13538", "description": "The analysis of survey data is a frequently arising issue in clinical trials,\nparticularly when capturing quantities which are difficult to measure. Typical\nexamples are questionnaires about patient's well-being, pain, or consent to an\nintervention. In these, data is captured on a discrete scale containing only a\nlimited number of possible answers, from which the respondent has to pick the\nanswer which fits best his/her personal opinion. This data is generally located\non an ordinal scale as answers can usually be arranged in an ascending order,\ne.g., \"bad\", \"neutral\", \"good\" for well-being. Since responses are usually\nstored numerically for data processing purposes, analysis of survey data using\nordinary linear regression models are commonly applied. However, assumptions of\nthese models are often not met as linear regression requires a constant\nvariability of the response variable and can yield predictions out of the range\nof response categories. By using linear models, one only gains insights about\nthe mean response which may affect representativeness. In contrast, ordinal\nregression models can provide probability estimates for all response categories\nand yield information about the full response scale beyond the mean. In this\nwork, we provide a concise overview of the fundamentals of latent variable\nbased ordinal models, applications to a real data set, and outline the use of\nstate-of-the-art-software for this purpose. Moreover, we discuss strengths,\nlimitations and typical pitfalls. This is a companion work to a current\nvignette-based structured interview study in paediatric anaesthesia."}, "https://arxiv.org/abs/2401.16447": {"title": "The Hubbert diffusion process: Estimation via simulated annealing and variable neighborhood search procedures", "link": "https://arxiv.org/abs/2401.16447", "description": "Accurately charting the progress of oil production is a problem of great current interest. Oil production is widely known to be cyclical: in any given system, after it reaches its peak, a decline will begin. With this in mind, Marion King Hubbert developed his peak theory in 1956 based on the bell-shaped curve that bears his name. In the present work, we consider a stochasticmodel based on the theory of diffusion processes and associated with the Hubbert curve. The problem of the maximum likelihood estimation of the parameters for this process is also considered. Since a complex system of equations appears, with a solution that cannot be guaranteed by classical numerical procedures, we suggest the use of metaheuristic optimization algorithms such as simulated annealing and variable neighborhood search. Some strategies are suggested for bounding the space of solutions, and a description is provided for the application of the algorithms selected. In the case of the variable neighborhood search algorithm, a hybrid method is proposed in which it is combined with simulated annealing. In order to validate the theory developed here, we also carry out some studies based on simulated data and consider 2 real crude oil production scenarios from Norway and Kazakhstan."}, "https://arxiv.org/abs/2401.16567": {"title": "Parallel Affine Transformation Tuning of Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2401.16567", "description": "The performance of Markov chain Monte Carlo samplers strongly depends on the properties of the target distribution such as its covariance structure, the location of its probability mass and its tail behavior. We explore the use of bijective affine transformations of the sample space to improve the properties of the target distribution and thereby the performance of samplers running in the transformed space. In particular, we propose a flexible and user-friendly scheme for adaptively learning the affine transformation during sampling. Moreover, the combination of our scheme with Gibbsian polar slice sampling is shown to produce samples of high quality at comparatively low computational cost in several settings based on real-world data."}, "https://arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "https://arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in precision medicine. Specific interests lie in identifying the differential effect of different treatments based on some external covariates. We propose a novel non-parametric treatment effect estimation method in a multi-treatment setting. Our non-parametric modeling of the response curves relies on radial basis function (RBF)-nets with shared hidden neurons. Our model thus facilitates modeling commonality among the treatment outcomes. The estimation and inference schemes are developed under a Bayesian framework and implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of the analysis. The numerical performance of the method is demonstrated through simulation experiments. Applying our proposed method to MIMIC data, we obtain several interesting findings related to the impact of different treatment strategies on the length of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "https://arxiv.org/abs/2401.16596": {"title": "PrIsing: Privacy-Preserving Peer Effect Estimation via Ising Model", "link": "https://arxiv.org/abs/2401.16596", "description": "The Ising model, originally developed as a spin-glass model for ferromagnetic elements, has gained popularity as a network-based model for capturing dependencies in agents' outputs. Its increasing adoption in healthcare and the social sciences has raised privacy concerns regarding the confidentiality of agents' responses. In this paper, we present a novel $(\\varepsilon,\\delta)$-differentially private algorithm specifically designed to protect the privacy of individual agents' outcomes. Our algorithm allows for precise estimation of the natural parameter using a single network through an objective perturbation technique. Furthermore, we establish regret bounds for this algorithm and assess its performance on synthetic datasets and two real-world networks: one involving HIV status in a social network and the other concerning the political leaning of online blogs."}, "https://arxiv.org/abs/2401.16598": {"title": "Probabilistic Context Neighborhood Model for Lattices", "link": "https://arxiv.org/abs/2401.16598", "description": "We present the Probabilistic Context Neighborhood model designed for two-dimensional lattices as a variation of a Markov Random Field assuming discrete values. In this model, the neighborhood structure has a fixed geometry but a variable order, depending on the neighbors' values. Our model extends the Probabilistic Context Tree model, originally applicable to one-dimensional space. It retains advantageous properties, such as representing the dependence neighborhood structure as a graph in a tree format, facilitating an understanding of model complexity. Furthermore, we adapt the algorithm used to estimate the Probabilistic Context Tree to estimate the parameters of the proposed model. We illustrate the accuracy of our estimation methodology through simulation studies. Additionally, we apply the Probabilistic Context Neighborhood model to spatial real-world data, showcasing its practical utility."}, "https://arxiv.org/abs/2401.16651": {"title": "A constructive approach to selective risk control", "link": "https://arxiv.org/abs/2401.16651", "description": "Many modern applications require the use of data to both select the statistical tasks and make valid inference after selection. In this article, we provide a unifying approach to control for a class of selective risks. Our method is motivated by a reformulation of the celebrated Benjamini-Hochberg (BH) procedure for multiple hypothesis testing as the iterative limit of the Benjamini-Yekutieli (BY) procedure for constructing post-selection confidence intervals. Although several earlier authors have made noteworthy observations related to this, our discussion highlights that (1) the BH procedure is precisely the fixed-point iteration of the BY procedure; (2) the fact that the BH procedure controls the false discovery rate is almost an immediate corollary of the fact that the BY procedure controls the false coverage-statement rate. Building on this observation, we propose a constructive approach to control extra-selection risk (selection made after decision) by iterating decision strategies that control the post-selection risk (decision made after selection), and show that many previous methods and results are special cases of this general framework. We further extend this approach to problems with multiple selective risks and demonstrate how new methods can be developed. Our development leads to two surprising results about the BH procedure: (1) in the context of one-sided location testing, the BH procedure not only controls the false discovery rate at the null but also at other locations for free; (2) in the context of permutation tests, the BH procedure with exact permutation p-values can be well approximated by a procedure which only requires a total number of permutations that is almost linear in the total number of hypotheses."}, "https://arxiv.org/abs/2401.16660": {"title": "A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information", "link": "https://arxiv.org/abs/2401.16660", "description": "The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations."}, "https://arxiv.org/abs/2401.16749": {"title": "Bayesian scalar-on-network regression with applications to brain functional connectivity", "link": "https://arxiv.org/abs/2401.16749", "description": "This study presents a Bayesian regression framework to model the relationship between scalar outcomes and brain functional connectivity represented as symmetric positive definite (SPD) matrices. Unlike many proposals that simply vectorize the connectivity predictors thereby ignoring their matrix structures, our method respects the Riemannian geometry of SPD matrices by modelling them in a tangent space. We perform dimension reduction in the tangent space, relating the resulting low-dimensional representations with the responses. The dimension reduction matrix is learnt in a supervised manner with a sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting. Our method yields a parsimonious regression model that allows uncertainty quantification of the estimates and identification of key brain regions that predict the outcomes. We demonstrate the performance of our approach in simulation settings and through a case study to predict Picture Vocabulary scores using data from the Human Connectome Project."}, "https://arxiv.org/abs/2401.16943": {"title": "Dynamical System Identification, Model Selection and Model Uncertainty Quantification by Bayesian Inference", "link": "https://arxiv.org/abs/2401.16943", "description": "This study presents a Bayesian maximum \\textit{a~posteriori} (MAP) framework for dynamical system identification from time-series data. This is shown to be equivalent to a generalized zeroth-order Tikhonov regularization, providing a rational justification for the choice of the residual and regularization terms, respectively, from the negative logarithms of the likelihood and prior distributions. In addition to the estimation of model coefficients, the Bayesian interpretation gives access to the full apparatus for Bayesian inference, including the ranking of models, the quantification of model uncertainties and the estimation of unknown (nuisance) hyperparameters. Two Bayesian algorithms, joint maximum \\textit{a~posteriori} (JMAP) and variational Bayesian approximation (VBA), are compared to the popular SINDy algorithm for thresholded least-squares regression, by application to several dynamical systems with added noise. For multivariate Gaussian likelihood and prior distributions, the Bayesian formulation gives Gaussian posterior and evidence distributions, in which the numerator terms can be expressed in terms of the Mahalanobis distance or ``Gaussian norm'' $||\\vy-\\hat{\\vy}||^2_{M^{-1}} = (\\vy-\\hat{\\vy})^\\top {M^{-1}} (\\vy-\\hat{\\vy})$, where $\\vy$ is a vector variable, $\\hat{\\vy}$ is its estimator and $M$ is the covariance matrix. The posterior Gaussian norm is shown to provide a robust metric for quantitative model selection."}, "https://arxiv.org/abs/2401.16954": {"title": "Nonparametric latency estimation for mixture cure models", "link": "https://arxiv.org/abs/2401.16954", "description": "A nonparametric latency estimator for mixture cure models is studied in this paper. An i.i.d. representation is obtained, the asymptotic mean squared error of the latency estimator is found, and its asymptotic normality is proven. A bootstrap bandwidth selection method is introduced and its efficiency is evaluated in a simulation study. The proposed methods are applied to a dataset of colorectal cancer patients in the University Hospital of A Coru\\~na (CHUAC)."}, "https://arxiv.org/abs/2401.16990": {"title": "Recovering target causal effects from post-exposure selection induced by missing outcome data", "link": "https://arxiv.org/abs/2401.16990", "description": "Confounding bias and selection bias are two significant challenges to the validity of conclusions drawn from applied causal inference. The latter can arise through informative missingness, wherein relevant information about units in the target population is missing, censored, or coarsened due to factors related to the exposure, the outcome, or their consequences. We extend existing graphical criteria to address selection bias induced by missing outcome data by leveraging post-exposure variables. We introduce the Sequential Adjustment Criteria (SAC), which support recovering causal effects through sequential regressions. A refined estimator is further developed by applying Targeted Minimum-Loss Estimation (TMLE). Under certain regularity conditions, this estimator is multiply-robust, ensuring consistency even in scenarios where the Inverse Probability Weighting (IPW) and the sequential regressions approaches fall short. A simulation exercise featuring various toy scenarios compares the relative bias and robustness of the two proposed solutions against other estimators. As a motivating application case, we study the effects of pharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD) upon the scores obtained by diagnosed Norwegian schoolchildren in national tests using observational data ($n=9\\,352$). Our findings support the accumulated clinical evidence affirming a positive but small effect of stimulant medication on school performance. A small positive selection bias was identified, indicating that the treatment effect may be even more modest for those exempted or abstained from the tests."}, "https://arxiv.org/abs/2401.17008": {"title": "A Unified Three-State Model Framework for Analysis of Treatment Crossover in Survival Trials", "link": "https://arxiv.org/abs/2401.17008", "description": "We present a unified three-state model (TSM) framework for evaluating treatment effects in clinical trials in the presence of treatment crossover. Researchers have proposed diverse methodologies to estimate the treatment effect that would have hypothetically been observed if treatment crossover had not occurred. However, there is little work on understanding the connections between these different approaches from a statistical point of view. Our proposed TSM framework unifies existing methods, effectively identifying potential biases, model assumptions, and inherent limitations for each method. This can guide researchers in understanding when these methods are appropriate and choosing a suitable approach for their data. The TSM framework also facilitates the creation of new methods to adjust for confounding effects from treatment crossover. To illustrate this capability, we introduce a new imputation method that falls under its scope. Using a piecewise constant prior for the hazard, our proposed method directly estimates the hazard function with increased flexibility. Through simulation experiments, we demonstrate the performance of different approaches for estimating the treatment effects."}, "https://arxiv.org/abs/2401.17041": {"title": "Gower's similarity coefficients with automatic weight selection", "link": "https://arxiv.org/abs/2401.17041", "description": "Nearest-neighbor methods have become popular in statistics and play a key role in statistical learning. Important decisions in nearest-neighbor methods concern the variables to use (when many potential candidates exist) and how to measure the dissimilarity between units. The first decision depends on the scope of the application while second depends mainly on the type of variables. Unfortunately, relatively few options permit to handle mixed-type variables, a situation frequently encountered in practical applications. The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient. It is appealing because ranges between 0 and 1, being an average of the scaled dissimilarities calculated variable by variable, handles missing values and allows for a user-defined weighting scheme when averaging dissimilarities. The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted \"standard\" setting hides an unbalanced contribution of the single variables to the overall dissimilarity. We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity. In particular, this note proposes different approaches for measuring the correlation depending on the type of variables. The performances of the proposed approaches are evaluated in simulation studies related to classification and imputation of missing values."}, "https://arxiv.org/abs/2401.17110": {"title": "Nonparametric covariate hypothesis tests for the cure rate in mixture cure models", "link": "https://arxiv.org/abs/2401.17110", "description": "In lifetime data, like cancer studies, theremay be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods.We fill this important gap by proposing a nonparametric covariate hypothesis test for the probability of cure in mixture cure models. A bootstrap method is proposed to approximate the null distribution of the test statistic. The procedure can be applied to any type of covariate, and could be extended to the multivariate setting. Its efficiency is evaluated in a Monte Carlo simulation study. Finally, the method is applied to a colorectal cancer dataset."}, "https://arxiv.org/abs/2401.17152": {"title": "Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models", "link": "https://arxiv.org/abs/2401.17152", "description": "A completely nonparametric method for the estimation of mixture cure models is proposed. A nonparametric estimator of the incidence is extensively studied and a nonparametric estimator of the latency is presented. These estimators, which are based on the Beran estimator of the conditional survival function, are proved to be the local maximum likelihood estimators. An i.i.d. representation is obtained for the nonparametric incidence estimator. As a consequence, an asymptotically optimal bandwidth is found. Moreover, a bootstrap bandwidth selection method for the nonparametric incidence estimator is proposed. The introduced nonparametric estimators are compared with existing semiparametric approaches in a simulation study, in which the performance of the bootstrap bandwidth selector is also assessed. Finally, the method is applied to a database of colorectal cancer from the University Hospital of A Coru\\~na (CHUAC)."}, "https://arxiv.org/abs/2401.17249": {"title": "Joint model with latent disease age: overcoming the need for reference time", "link": "https://arxiv.org/abs/2401.17249", "description": "Introduction: Heterogeneity of the progression of neurodegenerative diseases is one of the main challenges faced in developing effective therapies. With the increasing number of large clinical databases, disease progression models have led to a better understanding of this heterogeneity. Nevertheless, these diseases may have no clear onset and biological underlying processes may start before the first symptoms. Such an ill-defined disease reference time is an issue for current joint models, which have proven their effectiveness by combining longitudinal and survival data. Objective In this work, we propose a joint non-linear mixed effect model with a latent disease age, to overcome this need for a precise reference time.\n Method: To do so, we utilized an existing longitudinal model with a latent disease age as a longitudinal sub-model and associated it with a survival sub-model that estimates a Weibull distribution from the latent disease age. We then validated our model on different simulated scenarios. Finally, we benchmarked our model with a state-of-the-art joint model and reference survival and longitudinal models on simulated and real data in the context of Amyotrophic Lateral Sclerosis (ALS).\n Results: On real data, our model got significantly better results than the state-of-the-art joint model for absolute bias (4.21(4.41) versus 4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events (0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).\n Conclusion: We showed that our approach is better suited than the state-of-the-art in the context where the reference time is not reliable. This work opens up the perspective to design predictive and personalized therapeutic strategies."}, "https://arxiv.org/abs/2401.16563": {"title": "Topological Detection of Phenomenological Bifurcations with Unreliable Kernel Densities", "link": "https://arxiv.org/abs/2401.16563", "description": "Phenomenological (P-type) bifurcations are qualitative changes in stochastic dynamical systems whereby the stationary probability density function (PDF) changes its topology. The current state of the art for detecting these bifurcations requires reliable kernel density estimates computed from an ensemble of system realizations. However, in several real world signals such as Big Data, only a single system realization is available -- making it impossible to estimate a reliable kernel density. This study presents an approach for detecting P-type bifurcations using unreliable density estimates. The approach creates an ensemble of objects from Topological Data Analysis (TDA) called persistence diagrams from the system's sole realization and statistically analyzes the resulting set. We compare several methods for replicating the original persistence diagram including Gibbs point process modelling, Pairwise Interaction Point Modelling, and subsampling. We show that for the purpose of predicting a bifurcation, the simple method of subsampling exceeds the other two methods of point process modelling in performance."}, "https://arxiv.org/abs/2401.16828": {"title": "Simulating signed mixtures", "link": "https://arxiv.org/abs/2401.16828", "description": "Simulating mixtures of distributions with signed weights proves a challenge as standard simulation algorithms are inefficient in handling the negative weights. In particular, the natural representation of mixture variates as associated with latent component indicators is no longer available. We propose here an exact accept-reject algorithm in the general case of finite signed mixtures that relies on optimaly pairing positive and negative components and designing a stratified sampling scheme on pairs. We analyze the performances of our approach, relative to the inverse cdf approach, since the cdf of the distribution remains available for standard signed mixtures."}, "https://arxiv.org/abs/2401.17143": {"title": "Test for high-dimensional mean vectors via the weighted $L_2$-norm", "link": "https://arxiv.org/abs/2401.17143", "description": "In this paper, we propose a novel approach to test the equality of high-dimensional mean vectors of several populations via the weighted $L_2$-norm. We establish the asymptotic normality of the test statistics under the null hypothesis. We also explain theoretically why our test statistics can be highly useful in weakly dense cases when the nonzero signal in mean vectors is present. Furthermore, we compare the proposed test with existing tests using simulation results, demonstrating that the weighted $L_2$-norm-based test statistic exhibits favorable properties in terms of both size and power."}, "https://arxiv.org/abs/2204.07105": {"title": "Nonresponse Bias Analysis in Longitudinal Studies: A Comparative Review with an Application to the Early Childhood Longitudinal Study", "link": "https://arxiv.org/abs/2204.07105", "description": "Longitudinal studies are subject to nonresponse when individuals fail to provide data for entire waves or particular questions of the survey. We compare approaches to nonresponse bias analysis (NRBA) in longitudinal studies and illustrate them on the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a monotone missingness pattern, and the missingness mechanism can be missing at random (MAR) or missing not at random (MNAR). We discuss weighting, multiple imputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for monotone patterns. Weighting adjustments are effective when the constructed weights are correlated to the survey outcome of interest. MI allows for variables with missing values to be included in the imputation model, yielding potentially less biased and more efficient estimates. Multilevel models with maximum likelihood estimation and marginal models estimated using generalized estimating equations can also handle incomplete longitudinal data. Bayesian methods introduce prior information and potentially stabilize model estimation. We add offsets in the MAR results to provide sensitivity analyses to assess MNAR deviations. We conduct NRBA for descriptive summaries and analytic model estimates and find that in the ECLS-K:2011 application, NRBA yields minor changes to the substantive conclusions. The strength of evidence about our NRBA depends on the strength of the relationship between the characteristics in the nonresponse adjustment and the key survey outcomes, so the key to a successful NRBA is to include strong predictors."}, "https://arxiv.org/abs/2304.14604": {"title": "Deep Neural-network Prior for Orbit Recovery from Method of Moments", "link": "https://arxiv.org/abs/2304.14604", "description": "Orbit recovery problems are a class of problems that often arise in practice and various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignment and single-particle cryo-EM modelling. In order to suppress the noise, we suggest using the method of moments approach for both problems while introducing deep neural network priors. In particular, our neural networks should output the signals and the distribution of group elements, with moments being the input. In the multireference alignment case, we demonstrate the advantage of using the NN to accelerate the convergence for the reconstruction of signals from the moments. Finally, we use our method to reconstruct simulated and biological volumes in the cryo-EM setting."}, "https://arxiv.org/abs/2306.07119": {"title": "Improving Forecasts for Heterogeneous Time Series by \"Averaging\", with Application to Food Demand Forecast", "link": "https://arxiv.org/abs/2306.07119", "description": "A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure."}, "https://arxiv.org/abs/2306.13538": {"title": "A brief introduction on latent variable based ordinal regression models with an application to survey data", "link": "https://arxiv.org/abs/2306.13538", "description": "The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure. Typical examples are questionnaires about patient's well-being, pain, or consent to an intervention. In these, data is captured on a discrete scale containing only a limited number of possible answers, from which the respondent has to pick the answer which fits best his/her personal opinion. This data is generally located on an ordinal scale as answers can usually be arranged in an ascending order, e.g., \"bad\", \"neutral\", \"good\" for well-being. Since responses are usually stored numerically for data processing purposes, analysis of survey data using ordinary linear regression models are commonly applied. However, assumptions of these models are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. By using linear models, one only gains insights about the mean response which may affect representativeness. In contrast, ordinal regression models can provide probability estimates for all response categories and yield information about the full response scale beyond the mean. In this work, we provide a concise overview of the fundamentals of latent variable based ordinal models, applications to a real data set, and outline the use of state-of-the-art-software for this purpose. Moreover, we discuss strengths, limitations and typical pitfalls. This is a companion work to a current vignette-based structured interview study in paediatric anaesthesia."}, "https://arxiv.org/abs/2311.09388": {"title": "Synthesis estimators for positivity violations with a continuous covariate", "link": "https://arxiv.org/abs/2311.09388", "description": "Studies intended to estimate the effect of a treatment, like randomized trials, often consist of a biased sample of the desired target population. To correct for this bias, estimates can be transported to the desired target population. Methods for transporting between populations are often premised on a positivity assumption, such that all relevant covariate patterns in one population are also present in the other. However, eligibility criteria, particularly in the case of trials, can result in violations of positivity. To address nonpositivity, a synthesis of statistical and mathematical models can be considered. This approach integrates multiple data sources (e.g. trials, observational, pharmacokinetic studies) to estimate treatment effects, leveraging mathematical models to handle positivity violations. This approach was previously demonstrated for positivity violations by a single binary covariate. Here, we extend the synthesis approach for positivity violations with a continuous covariate. For estimation, two novel augmented inverse probability weighting estimators are proposed. Both estimators are contrasted with other common approaches for addressing nonpositivity. Empirical performance is compared via Monte Carlo simulation. Finally, the competing approaches are illustrated with an example in the context of two-drug versus one-drug antiretroviral therapy on CD4 T cell counts among women with HIV."}, "https://arxiv.org/abs/2401.15796": {"title": "High-Dimensional False Discovery Rate Control for Dependent Variables", "link": "https://arxiv.org/abs/2401.15796", "description": "Algorithms that ensure reproducible findings from large-scale, high-dimensional data are pivotal in numerous signal processing applications. In recent years, multivariate false discovery rate (FDR) controlling methods have emerged, providing guarantees even in high-dimensional settings where the number of variables surpasses the number of samples. However, these methods often fail to reliably control the FDR in the presence of highly dependent variable groups, a common characteristic in fields such as genomics and finance. To tackle this critical issue, we introduce a novel framework that accounts for general dependency structures. Our proposed dependency-aware T-Rex selector integrates hierarchical graphical models within the T-Rex framework to effectively harness the dependency structure among variables. Leveraging martingale theory, we prove that our variable penalization mechanism ensures FDR control. We further generalize the FDR-controlling framework by stating and proving a clear condition necessary for designing both graphical and non-graphical models that capture dependencies. Additionally, we formulate a fully integrated optimal calibration algorithm that concurrently determines the parameters of the graphical model and the T-Rex framework, such that the FDR is controlled while maximizing the number of selected variables. Numerical experiments and a breast cancer survival analysis use-case demonstrate that the proposed method is the only one among the state-of-the-art benchmark methods that controls the FDR and reliably detects genes that have been previously identified to be related to breast cancer. An open-source implementation is available within the R package TRexSelector on CRAN."}, "https://arxiv.org/abs/2206.01900": {"title": "Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios", "link": "https://arxiv.org/abs/2206.01900", "description": "Evaluation of intervention in a multi-agent system, e.g., when humans should intervene in autonomous driving systems and when a player should pass to teammates for a good shot, is challenging in various engineering and scientific fields. Estimating the individual treatment effect (ITE) using counterfactual long-term prediction is practical to evaluate such interventions. However, most of the conventional frameworks did not consider the time-varying complex structure of multi-agent relationships and covariate counterfactual prediction. This may lead to erroneous assessments of ITE and difficulty in interpretation. Here we propose an interpretable, counterfactual recurrent network in multi-agent systems to estimate the effect of the intervention. Our model leverages graph variational recurrent neural networks and theory-based computation with domain knowledge for the ITE estimation framework based on long-term prediction of multi-agent covariates and outcomes, which can confirm the circumstances under which the intervention is effective. On simulated models of an automated vehicle and biological agents with time-varying confounders, we show that our methods achieved lower estimation errors in counterfactual covariates and the most effective treatment timing than the baselines. Furthermore, using real basketball data, our methods performed realistic counterfactual predictions and evaluated the counterfactual passes in shot scenarios."}, "https://arxiv.org/abs/2210.06934": {"title": "On the potential benefits of entropic regularization for smoothing Wasserstein estimators", "link": "https://arxiv.org/abs/2210.06934", "description": "This paper is focused on the study of entropic regularization in optimal transport as a smoothing method for Wasserstein estimators, through the prism of the classical tradeoff between approximation and estimation errors in statistics. Wasserstein estimators are defined as solutions of variational problems whose objective function involves the use of an optimal transport cost between probability measures. Such estimators can be regularized by replacing the optimal transport cost by its regularized version using an entropy penalty on the transport plan. The use of such a regularization has a potentially significant smoothing effect on the resulting estimators. In this work, we investigate its potential benefits on the approximation and estimation properties of regularized Wasserstein estimators. Our main contribution is to discuss how entropic regularization may reach, at a lower computational cost, statistical performances that are comparable to those of un-regularized Wasserstein estimators in statistical learning problems involving distributional data analysis. To this end, we present new theoretical results on the convergence of regularized Wasserstein estimators. We also study their numerical performances using simulated and real data in the supervised learning problem of proportions estimation in mixture models using optimal transport."}, "https://arxiv.org/abs/2304.09157": {"title": "Neural networks for geospatial data", "link": "https://arxiv.org/abs/2304.09157", "description": "Analysis of geospatial data has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a covariance model, encoding the spatial dependence. We relax the strong assumption of linearity and propose embedding neural networks directly within the traditional geostatistical models to accommodate non-linear mean functions while retaining all other advantages including use of Gaussian Processes to explicitly model the spatial covariance, enabling inference on the covariate effect through the mean and on the spatial dependence through the covariance, and offering predictions at new locations via kriging. We propose NN-GLS, a new neural network estimation algorithm for the non-linear mean in GP models that explicitly accounts for the spatial covariance through generalized least squares (GLS), the same loss used in the linear case. We show that NN-GLS admits a representation as a special type of graph neural network (GNN). This connection facilitates use of standard neural network computational techniques for irregular geospatial data, enabling novel and scalable mini-batching, backpropagation, and kriging schemes. Theoretically, we show that NN-GLS will be consistent for irregularly observed spatially correlated data processes. To our knowledge this is the first asymptotic consistency result for any neural network algorithm for spatial data. We demonstrate the methodology through simulated and real datasets."}, "https://arxiv.org/abs/2311.17246": {"title": "Detecting influential observations in single-index models with metric-valued response objects", "link": "https://arxiv.org/abs/2311.17246", "description": "Regression with random data objects is becoming increasingly common in modern data analysis. Unfortunately, like the traditional regression setting with Euclidean data, random response regression is not immune to the trouble caused by unusual observations. A metric Cook's distance extending the classical Cook's distances of Cook (1977) to general metric-valued response objects is proposed. The performance of the metric Cook's distance in both Euclidean and non-Euclidean response regression with Euclidean predictors is demonstrated in an extensive experimental study. A real data analysis of county-level COVID-19 transmission in the United States also illustrates the usefulness of this method in practice."}, "https://arxiv.org/abs/2401.15139": {"title": "FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking", "link": "https://arxiv.org/abs/2401.15139", "description": "In high-dimensional data analysis, such as financial index tracking or biomedical applications, it is crucial to select the few relevant variables while maintaining control over the false discovery rate (FDR). In these applications, strong dependencies often exist among the variables (e.g., stock returns), which can undermine the FDR control property of existing methods like the model-X knockoff method or the T-Rex selector. To address this issue, we have expanded the T-Rex framework to accommodate overlapping groups of highly correlated variables. This is achieved by integrating a nearest neighbors penalization mechanism into the framework, which provably controls the FDR at the user-defined target level. A real-world example of sparse index tracking demonstrates the proposed method's ability to accurately track the S&P 500 index over the past 20 years based on a small number of stocks. An open-source implementation is provided within the R package TRexSelector on CRAN."}, "https://arxiv.org/abs/2401.17346": {"title": "npcure: An R Package for Nonparametric Inference in Mixture Cure Models", "link": "https://arxiv.org/abs/2401.17346", "description": "Mixture cure models have been widely used to analyze survival data with a cure fraction. They assume that a subgroup of the individuals under study will never experience the event (cured subjects). So, the goal is twofold: to study both the cure probability and the failure time of the uncured individuals through a proper survival function (latency). The R package npcure implements a completely nonparametric approach for estimating these functions in mixture cure models, considering right-censored survival times. Nonparametric estimators for the cure probability and the latency as functions of a covariate are provided. Bootstrap bandwidth selectors for the estimators are included. The package also implements a nonparametric covariate significance test for the cure probability, which can be applied with a continuous, discrete, or qualitative covariate."}, "https://arxiv.org/abs/2401.17347": {"title": "Cure models to estimate time until hospitalization due to COVID-19", "link": "https://arxiv.org/abs/2401.17347", "description": "A short introduction to survival analysis and censored data is included in this paper. A thorough literature review in the field of cure models has been done. An overview on the most important and recent approaches on parametric, semiparametric and nonparametric mixture cure models is also included. The main nonparametric and semiparametric approaches were applied to a real time dataset of COVID-19 patients from the first weeks of the epidemic in Galicia (NW Spain). The aim is to model the elapsed time from diagnosis to hospital admission. The main conclusions, as well as the limitations of both the cure models and the dataset, are presented, illustrating the usefulness of cure models in this kind of studies, where the influence of age and sex on the time to hospital admission is shown."}, "https://arxiv.org/abs/2401.17385": {"title": "Causal Analysis of Air Pollution Mixtures: Estimands, Positivity, and Extrapolation", "link": "https://arxiv.org/abs/2401.17385", "description": "Causal inference for air pollution mixtures is an increasingly important issue with appreciable challenges. When the exposure is a multivariate mixture, there are many exposure contrasts that may be of nominal interest for causal effect estimation, but the complex joint mixture distribution often renders observed data extremely limited in their ability to inform estimates of many commonly-defined causal effects. We use potential outcomes to 1) define causal effects of air pollution mixtures, 2) formalize the key assumption of mixture positivity required for estimation and 3) offer diagnostic metrics for positivity violations in the mixture setting that allow researchers to assess the extent to which data can actually support estimation of mixture effects of interest. For settings where there is limited empirical support, we redefine causal estimands that apportion causal effects according to whether they can be directly informed by observed data versus rely entirely on model extrapolation, isolating key sources of information on the causal effect of an air pollution mixture. The ideas are deployed to assess the ability of a national United States data set on the chemical components of ambient particulate matter air pollution to support estimation of a variety of causal mixture effects."}, "https://arxiv.org/abs/2401.17389": {"title": "A review of statistical models used to characterize species-habitat associations with animal movement data", "link": "https://arxiv.org/abs/2401.17389", "description": "Understanding species-habitat associations is fundamental to ecological sciences and for species conservation. Consequently, various statistical approaches have been designed to infer species-habitat associations. Due to their conceptual and mathematical differences, these methods can yield contrasting results. In this paper, we describe and compare commonly used statistical models that relate animal movement data to environmental data. Specifically, we examined selection functions which include resource selection function (RSF) and step-selection function (SSF), as well as hidden Markov models (HMMs) and related methods such as state-space models. We demonstrate differences in assumptions of each method while highlighting advantages and limitations. Additionally, we provide guidance on selecting the most appropriate statistical method based on research objectives and intended inference. To demonstrate the varying ecological insights derived from each statistical model, we apply them to the movement track of a single ringed seal in a case study. For example, the RSF indicated selection of areas with high prey diversity, whereas the SSFs indicated no discernable relationship with prey diversity. Furthermore, the HMM reveals variable associations with prey diversity across different behaviors. Notably, the three models identified different important areas. This case study highlights the critical significance of selecting the appropriate model to identify species-habitat relationships and specific areas of importance. Our comprehensive review provides the foundational information required for making informed decisions when choosing the most suitable statistical methods to address specific questions, such as identifying expansive corridors or protected zones, understanding movement patterns, or studying behaviours."}, "https://arxiv.org/abs/2401.17393": {"title": "Estimating the EVSI with Gaussian Approximations and Spline-Based Series Methods", "link": "https://arxiv.org/abs/2401.17393", "description": "Background. The Expected Value of Sample Information (EVSI) measures the expected benefits that could be obtained by collecting additional data. Estimating EVSI using the traditional nested Monte Carlo method is computationally expensive but the recently developed Gaussian approximation (GA) approach can efficiently estimate EVSI across different sample sizes. However, the conventional GA may result in biased EVSI estimates if the decision models are highly nonlinear. This bias may lead to suboptimal study designs when GA is used to optimize the value of different studies. Therefore, we extend the conventional GA approach to improve its performance for nonlinear decision models. Methods. Our method provides accurate EVSI estimates by approximating the conditional benefit based on two steps. First, a Taylor series approximation is applied to estimate the conditional benefit as a function of the conditional moments of the parameters of interest using a spline, which is fitted to the samples of the parameters and the corresponding benefits. Next, the conditional moments of parameters are approximated by the conventional GA and Fisher information. The proposed approach is applied to several data collection exercises involving non-Gaussian parameters and nonlinear decision models. Its performance is compared with the nested Monte Carlo method, the conventional GA approach, and the nonparametric regression-based method for EVSI calculation. Results. The proposed approach provides accurate EVSI estimates across different sample sizes when the parameters of interest are non-Gaussian and the decision models are nonlinear. The computational cost of the proposed method is similar to other novel methods. Conclusions. The proposed approach can estimate EVSI across sample sizes accurately and efficiently, which may support researchers in determining an economically optimal study design using EVSI."}, "https://arxiv.org/abs/2401.17430": {"title": "Modeling of spatial extremes in environmental data science: Time to move away from max-stable processes", "link": "https://arxiv.org/abs/2401.17430", "description": "Environmental data science for spatial extremes has traditionally relied heavily on max-stable processes. Even though the popularity of these models has perhaps peaked with statisticians, they are still perceived and considered as the `state-of-the-art' in many applied fields. However, while the asymptotic theory supporting the use of max-stable processes is mathematically rigorous and comprehensive, we think that it has also been overused, if not misused, in environmental applications, to the detriment of more purposeful and meticulously validated models. In this paper, we review the main limitations of max-stable process models, and strongly argue against their systematic use in environmental studies. Alternative solutions based on more flexible frameworks using the exceedances of variables above appropriately chosen high thresholds are discussed, and an outlook on future research is given, highlighting recommendations moving forward and the opportunities offered by hybridizing machine learning with extreme-value statistics."}, "https://arxiv.org/abs/2401.17452": {"title": "Group-Weighted Conformal Prediction", "link": "https://arxiv.org/abs/2401.17452", "description": "Conformal prediction (CP) is a method for constructing a prediction interval around the output of a fitted model, whose validity does not rely on the model being correct--the CP interval offers a coverage guarantee that is distribution-free, but relies on the training data being drawn from the same distribution as the test data. A recent variant, weighted conformal prediction (WCP), reweights the method to allow for covariate shift between the training and test distributions. However, WCP requires knowledge of the nature of the covariate shift-specifically,the likelihood ratio between the test and training covariate distributions. In practice, since this likelihood ratio is estimated rather than known exactly, the coverage guarantee may degrade due to the estimation error. In this paper, we consider a special scenario where observations belong to a finite number of groups, and these groups determine the covariate shift between the training and test distributions-for instance, this may arise if the training set is collected via stratified sampling. Our results demonstrate that in this special case, the predictive coverage guarantees of WCP can be drastically improved beyond the bounds given by existing estimation error bounds."}, "https://arxiv.org/abs/2401.17473": {"title": "Adaptive Matrix Change Point Detection: Leveraging Structured Mean Shifts", "link": "https://arxiv.org/abs/2401.17473", "description": "In high-dimensional time series, the component processes are often assembled into a matrix to display their interrelationship. We focus on detecting mean shifts with unknown change point locations in these matrix time series. Series that are activated by a change may cluster along certain rows (columns), which forms mode-specific change point alignment. Leveraging mode-specific change point alignments may substantially enhance the power for change point detection. Yet, there may be no mode-specific alignments in the change point structure. We propose a powerful test to detect mode-specific change points, yet robust to non-mode-specific changes. We show the validity of using the multiplier bootstrap to compute the p-value of the proposed methods, and derive non-asymptotic bounds on the size and power of the tests. We also propose a parallel bootstrap, a computationally efficient approach for computing the p-value of the proposed adaptive test. In particular, we show the consistency of the proposed test, under mild regularity conditions. To obtain the theoretical results, we derive new, sharp bounds on Gaussian approximation and multiplier bootstrap approximation, which are of independent interest for high dimensional problems with diverging sparsity."}, "https://arxiv.org/abs/2401.17518": {"title": "Model Uncertainty and Selection of Risk Models for Left-Truncated and Right-Censored Loss Data", "link": "https://arxiv.org/abs/2401.17518", "description": "Insurance loss data are usually in the form of left-truncation and right-censoring due to deductibles and policy limits respectively. This paper investigates the model uncertainty and selection procedure when various parametric models are constructed to accommodate such left-truncated and right-censored data. The joint asymptotic properties of the estimators have been established using the Delta method along with Maximum Likelihood Estimation when the model is specified. We conduct the simulation studies using Fisk, Lognormal, Lomax, Paralogistic, and Weibull distributions with various proportions of loss data below deductibles and above policy limits. A variety of graphic tools, hypothesis tests, and penalized likelihood criteria are employed to validate the models, and their performances on the model selection are evaluated through the probability of each parent distribution being correctly selected. The effectiveness of each tool on model selection is also illustrated using {well-studied} data that represent Wisconsin property losses in the United States from 2007 to 2010."}, "https://arxiv.org/abs/2401.17646": {"title": "From Sparse to Dense Functional Data: Phase Transitions from a Simultaneous Inference Perspective", "link": "https://arxiv.org/abs/2401.17646", "description": "We aim to develop simultaneous inference tools for the mean function of functional data from sparse to dense. First, we derive a unified Gaussian approximation to construct simultaneous confidence bands of mean functions based on the B-spline estimator. Then, we investigate the conditions of phase transitions by decomposing the asymptotic variance of the approximated Gaussian process. As an extension, we also consider the orthogonal series estimator and show the corresponding conditions of phase transitions. Extensive simulation results strongly corroborate the theoretical results, and also illustrate the variation of the asymptotic distribution via the asymptotic variance decomposition we obtain. The developed method is further applied to body fat data and traffic data."}, "https://arxiv.org/abs/2401.17735": {"title": "The impact of coarsening an exposure on partial identifiability in instrumental variable settings", "link": "https://arxiv.org/abs/2401.17735", "description": "In instrumental variable (IV) settings, such as in imperfect randomized trials and observational studies with Mendelian randomization, one may encounter a continuous exposure, the causal effect of which is not of true interest. Instead, scientific interest may lie in a coarsened version of this exposure. Although there is a lengthy literature on the impact of coarsening of an exposure with several works focusing specifically on IV settings, all methods proposed in this literature require parametric assumptions. Instead, just as in the standard IV setting, one can consider partial identification via bounds making no parametric assumptions. This was first pointed out in Alexander Balke's PhD dissertation. We extend and clarify his work and derive novel bounds in several settings, including for a three-level IV, which will most likely be the case in Mendelian randomization. We demonstrate our findings in two real data examples, a randomized trial for peanut allergy in infants and a Mendelian randomization setting investigating the effect of homocysteine on cardiovascular disease."}, "https://arxiv.org/abs/2401.17737": {"title": "Hierarchical Bias-Driven Stratification for Interpretable Causal Effect Estimation", "link": "https://arxiv.org/abs/2401.17737", "description": "Interpretability and transparency are essential for incorporating causal effect models from observational data into policy decision-making. They can provide trust for the model in the absence of ground truth labels to evaluate the accuracy of such models. To date, attempts at transparent causal effect estimation consist of applying post hoc explanation methods to black-box models, which are not interpretable. Here, we present BICauseTree: an interpretable balancing method that identifies clusters where natural experiments occur locally. Our approach builds on decision trees with a customized objective function to improve balancing and reduce treatment allocation bias. Consequently, it can additionally detect subgroups presenting positivity violations, exclude them, and provide a covariate-based definition of the target population we can infer from and generalize to. We evaluate the method's performance using synthetic and realistic datasets, explore its bias-interpretability tradeoff, and show that it is comparable with existing approaches."}, "https://arxiv.org/abs/2401.17754": {"title": "Nonparametric estimation of circular trend surfaces with application to wave directions", "link": "https://arxiv.org/abs/2401.17754", "description": "In oceanography, modeling wave fields requires the use of statistical tools capable of handling the circular nature of the {data measurements}. An important issue in ocean wave analysis is the study of height and direction waves, being direction values recorded as angles or, equivalently, as points on a unit circle. Hence, reconstruction of a wave direction field on the sea surface can be approached by the use of a linear-circular regression model, viewing wave directions as a realization of a circular spatial process whose trend should be estimated. In this paper, we consider a spatial regression model with a circular response and several real-valued predictors. Nonparametric estimators of the circular trend surface are proposed, accounting for the (unknown) spatial correlation. Some asymptotic results about these estimators as well as some guidelines for their practical implementation are also given. The performance of the proposed estimators is investigated in a simulation study. An application to wave directions in the Adriatic Sea is provided for illustration."}, "https://arxiv.org/abs/2401.17770": {"title": "Nonparametric geostatistical risk mapping", "link": "https://arxiv.org/abs/2401.17770", "description": "In this work, a fully nonparametric geostatistical approach to estimate threshold exceeding probabilities is proposed. To estimate the large-scale variability (spatial trend) of the process, the nonparametric local linear regression estimator, with the bandwidth selected by a method that takes the spatial dependence into account, is used. A bias-corrected nonparametric estimator of the variogram, obtained from the nonparametric residuals, is proposed to estimate the small-scale variability. Finally, a bootstrap algorithm is designed to estimate the unconditional probabilities of exceeding a threshold value at any location. The behavior of this approach is evaluated through simulation and with an application to a real data set."}, "https://arxiv.org/abs/2401.17966": {"title": "XGBoostPP: Tree-based Estimation of Point Process Intensity Functions", "link": "https://arxiv.org/abs/2401.17966", "description": "We propose a novel tree-based ensemble method, named XGBoostPP, to nonparametrically estimate the intensity of a point process as a function of covariates. It extends the use of gradient-boosted regression trees (Chen & Guestrin, 2016) to the point process literature via two carefully designed loss functions. The first loss is based on the Poisson likelihood, working for general point processes. The second loss is based on the weighted Poisson likelihood, where spatially dependent weights are introduced to further improve the estimation efficiency for clustered processes. An efficient greedy search algorithm is developed for model estimation, and the effectiveness of the proposed method is demonstrated through extensive simulation studies and two real data analyses. In particular, we report that XGBoostPP achieves superior performance to existing approaches when the dimension of the covariate space is high, revealing the advantages of tree-based ensemble methods in estimating complex intensity functions."}, "https://arxiv.org/abs/2401.17987": {"title": "Bagging cross-validated bandwidths with application to Big Data", "link": "https://arxiv.org/abs/2401.17987", "description": "Hall and Robinson (2009) proposed and analyzed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall and Robinson (2009) assumes that $N$, the number of bagged subsamples, is $\\infty$. We expand upon their theoretical results by allowing $N$ to be finite, as it is in practice. Our results indicate an important difference in the rate of convergence of the bagged cross-validation bandwidth for the cases $N=\\infty$ and $N<\\infty$. Simulations quantify the improvement in statistical efficiency and computational speed that can result from using bagged cross-validation as opposed to a binned implementation of ordinary cross-validation. The performance of thebagged bandwidth is also illustrated on a real, very large, data set. Finally, a byproduct of our study is the correction of errors appearing in the Hall and Robinson (2009) expression for the asymptotic mean squared error of the bagging selector."}, "https://arxiv.org/abs/2401.17993": {"title": "Robust Inference for Generalized Linear Mixed Models: An Approach Based on Score Sign Flipping", "link": "https://arxiv.org/abs/2401.17993", "description": "Despite the versatility of generalized linear mixed models in handling complex experimental designs, they often suffer from misspecification and convergence problems. This makes inference on the values of coefficients problematic. To address these challenges, we propose a robust extension of the score-based statistical test using sign-flipping transformations. Our approach efficiently handles within-variance structure and heteroscedasticity, ensuring accurate regression coefficient testing. The approach is illustrated by analyzing the reduction of health issues over time for newly adopted children. The model is characterized by a binomial response with unbalanced frequencies and several categorical and continuous predictors. The proposed approach efficiently deals with critical problems related to longitudinal nonlinear models, surpassing common statistical approaches such as generalized estimating equations and generalized linear mixed models."}, "https://arxiv.org/abs/2401.18014": {"title": "Bayesian regularization for flexible baseline hazard functions in Cox survival models", "link": "https://arxiv.org/abs/2401.18014", "description": "Fully Bayesian methods for Cox models specify a model for the baseline hazard function. Parametric approaches generally provide monotone estimations. Semi-parametric choices allow for more flexible patterns but they can suffer from overfitting and instability. Regularization methods through prior distributions with correlated structures usually give reasonable answers to these types of situations. We discuss Bayesian regularization for Cox survival models defined via flexible baseline hazards specified by a mixture of piecewise constant functions and by a cubic B-spline function. For those \"semiparametric\" proposals, different prior scenarios ranging from prior independence to particular correlated structures are discussed in a real study with micro-virulence data and in an extensive simulation scenario that includes different data sample and time axis partition sizes in order to capture risk variations. The posterior distribution of the parameters was approximated using Markov chain Monte Carlo methods. Model selection was performed in accordance with the Deviance Information Criteria and the Log Pseudo-Marginal Likelihood. The results obtained reveal that, in general, Cox models present great robustness in covariate effects and survival estimates independent of the baseline hazard specification. In relation to the \"semi-parametric\" baseline hazard specification, the B-splines hazard function is less dependent on the regularization process than the piecewise specification because it demands a smaller time axis partition to estimate a similar behaviour of the risk."}, "https://arxiv.org/abs/2401.18023": {"title": "A cost-sensitive constrained Lasso", "link": "https://arxiv.org/abs/2401.18023", "description": "The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered."}, "https://arxiv.org/abs/2401.16667": {"title": "Sharp variance estimator and causal bootstrap in stratified randomized experiments", "link": "https://arxiv.org/abs/2401.16667", "description": "The finite-population asymptotic theory provides a normal approximation for the sampling distribution of the average treatment effect estimator in stratified randomized experiments. The asymptotic variance is often estimated by a Neyman-type conservative variance estimator. However, the variance estimator can be overly conservative, and the asymptotic theory may fail in small samples. To solve these issues, we propose a sharp variance estimator for the difference-in-means estimator weighted by the proportion of stratum sizes in stratified randomized experiments. Furthermore, we propose two causal bootstrap procedures to more accurately approximate the sampling distribution of the weighted difference-in-means estimator. The first causal bootstrap procedure is based on rank-preserving imputation and we show that it has second-order refinement over normal approximation. The second causal bootstrap procedure is based on sharp null imputation and is applicable in paired experiments. Our analysis is randomization-based or design-based by conditioning on the potential outcomes, with treatment assignment being the sole source of randomness. Numerical studies and real data analyses demonstrate advantages of our proposed methods in finite samples."}, "https://arxiv.org/abs/2401.17334": {"title": "Efficient estimation of parameters in marginals in semiparametric multivariate models", "link": "https://arxiv.org/abs/2401.17334", "description": "We consider a general multivariate model where univariate marginal distributions are known up to a parameter vector and we are interested in estimating that parameter vector without specifying the joint distribution, except for the marginals. If we assume independence between the marginals and maximize the resulting quasi-likelihood, we obtain a consistent but inefficient QMLE estimator. If we assume a parametric copula (other than independence) we obtain a full MLE, which is efficient but only under a correct copula specification and may be biased if the copula is misspecified. Instead we propose a sieve MLE estimator (SMLE) which improves over QMLE but does not have the drawbacks of full MLE. We model the unknown part of the joint distribution using the Bernstein-Kantorovich polynomial copula and assess the resulting improvement over QMLE and over misspecified FMLE in terms of relative efficiency and robustness. We derive the asymptotic distribution of the new estimator and show that it reaches the relevant semiparametric efficiency bound. Simulations suggest that the sieve MLE can be almost as efficient as FMLE relative to QMLE provided there is enough dependence between the marginals. We demonstrate practical value of the new estimator with several applications. First, we apply SMLE in an insurance context where we build a flexible semi-parametric claim loss model for a scenario where one of the variables is censored. As in simulations, the use of SMLE leads to tighter parameter estimates. Next, we consider financial risk management examples and show how the use of SMLE leads to superior Value-at-Risk predictions. The paper comes with an online archive which contains all codes and datasets."}, "https://arxiv.org/abs/2401.17422": {"title": "River flow modelling using nonparametric functional data analysis", "link": "https://arxiv.org/abs/2401.17422", "description": "Time series and extreme value analyses are two statistical approaches usually applied to study hydrological data. Classical techniques, such as ARIMA models (in the case of mean flow predictions), and parametric generalised extreme value (GEV) fits and nonparametric extreme value methods (in the case of extreme value theory) have been usually employed in this context. In this paper, nonparametric functional data methods are used to perform mean monthly flow predictions and extreme value analysis, which are important for flood risk management. These are powerful tools that take advantage of both, the functional nature of the data under consideration and the flexibility of nonparametric methods, providing more reliable results. Therefore, they can be useful to prevent damage caused by floods and to reduce the likelihood and/or the impact of floods in a specific location. The nonparametric functional approaches are applied to flow samples of two rivers in the U.S. In this way, monthly mean flow is predicted and flow quantiles in the extreme value framework are estimated using the proposed methods. Results show that the nonparametric functional techniques work satisfactorily, generally outperforming the behaviour of classical parametric and nonparametric estimators in both settings."}, "https://arxiv.org/abs/2401.17504": {"title": "CaMU: Disentangling Causal Effects in Deep Model Unlearning", "link": "https://arxiv.org/abs/2401.17504", "description": "Machine unlearning requires removing the information of forgetting data while keeping the necessary information of remaining data. Despite recent advancements in this area, existing methodologies mainly focus on the effect of removing forgetting data without considering the negative impact this can have on the information of the remaining data, resulting in significant performance degradation after data removal. Although some methods try to repair the performance of remaining data after removal, the forgotten information can also return after repair. Such an issue is due to the intricate intertwining of the forgetting and remaining data. Without adequately differentiating the influence of these two kinds of data on the model, existing algorithms take the risk of either inadequate removal of the forgetting data or unnecessary loss of valuable information from the remaining data. To address this shortcoming, the present study undertakes a causal analysis of the unlearning and introduces a novel framework termed Causal Machine Unlearning (CaMU). This framework adds intervention on the information of remaining data to disentangle the causal effects between forgetting data and remaining data. Then CaMU eliminates the causal impact associated with forgetting data while concurrently preserving the causal relevance of the remaining data. Comprehensive empirical results on various datasets and models suggest that CaMU enhances performance on the remaining data and effectively minimizes the influences of forgetting data. Notably, this work is the first to interpret deep model unlearning tasks from a new perspective of causality and provide a solution based on causal analysis, which opens up new possibilities for future research in deep model unlearning."}, "https://arxiv.org/abs/2401.17585": {"title": "Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks", "link": "https://arxiv.org/abs/2401.17585", "description": "Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers six common reasoning schemes in real world. We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit. We found that all model editing methods show notably low performance on this dataset, especially in certain reasoning schemes. Our analysis over the chain-of-thought generation of edited models further uncover key reasons behind the inadequacy of existing knowledge editing methods from a reasoning standpoint, involving aspects on fact-wise editing, fact recall ability, and coherence in generation. We will make our benchmark publicly available."}, "https://arxiv.org/abs/2210.00822": {"title": "Combinatorial and algebraic perspectives on the marginal independence structure of Bayesian networks", "link": "https://arxiv.org/abs/2210.00822", "description": "We consider the problem of estimating the marginal independence structure of a Bayesian network from observational data, learning an undirected graph we call the unconditional dependence graph. We show that unconditional dependence graphs of Bayesian networks correspond to the graphs having equal independence and intersection numbers. Using this observation, a Gr\\\"obner basis for a toric ideal associated to unconditional dependence graphs of Bayesian networks is given and then extended by additional binomial relations to connect the space of all such graphs. An MCMC method, called GrUES (Gr\\\"obner-based Unconditional Equivalence Search), is implemented based on the resulting moves and applied to synthetic Gaussian data. GrUES recovers the true marginal independence structure via a penalized maximum likelihood or MAP estimate at a higher rate than simple independence tests while also yielding an estimate of the posterior, for which the $20\\%$ HPD credible sets include the true structure at a high rate for data-generating graphs with density at least $0.5$."}, "https://arxiv.org/abs/2306.04866": {"title": "Computational methods for fast Bayesian model assessment via calibrated posterior p-values", "link": "https://arxiv.org/abs/2306.04866", "description": "Posterior predictive p-values (ppps) have become popular tools for Bayesian model assessment, being general-purpose and easy to use. However, interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. Calibrated ppps (cppps) can be obtained via a bootstrap-like procedure, yet remain unavailable in practice due to high computational cost. This paper introduces methods to enable efficient approximation of cppps and their uncertainty for fast model assessment. We first investigate the computational trade-off between the number of calibration replicates and the number of MCMC samples per replicate. Provided that the MCMC chain from the real data has converged, using short MCMC chains per calibration replicate can save significant computation time compared to naive implementations, without significant loss in accuracy. We propose different variance estimators for the cppp approximation, which can be used to confirm quickly the lack of evidence against model misspecification. As variance estimation uses effective sample sizes of many short MCMC chains, we show these can be approximated well from the real-data MCMC chain. The procedure for cppp is implemented in NIMBLE, a flexible framework for hierarchical modeling that supports many models and discrepancy measures."}, "https://arxiv.org/abs/2306.11267": {"title": "Model-assisted analysis of covariance estimators for stepped wedge cluster randomized experiments", "link": "https://arxiv.org/abs/2306.11267", "description": "Stepped wedge cluster randomized experiments represent a class of unidirectional crossover designs increasingly adopted for comparative effectiveness and implementation science research. Although stepped wedge cluster randomized experiments have become popular, definitions of estimands and robust methods to target clearly-defined estimands remain insufficient. To address this gap, we describe a class of estimands that explicitly acknowledge the multilevel data structure in stepped wedge cluster randomized experiments, and highlight three typical members of the estimand class that are interpretable and are of practical interest. We then introduce four possible formulations of analysis of covariance (ANCOVA) working models to achieve estimand-aligned analyses. By exploiting baseline covariates, each ANCOVA model can potentially improve the estimation efficiency over the unadjusted estimators. In addition, each ANCOVA estimator is model-assisted in the sense that its point estimator is consistent with the target estimand even when the working model is misspecified. Under the stepped wedge randomization scheme, we establish the finite population Central Limit Theorem for each estimator, which motivates design-based variance estimators. Through simulations, we study the finite-sample operating characteristics of the ANCOVA estimators under different data-generating processes. We illustrate their applications via the analysis of the Washington State Expedited Partner Therapy study."}, "https://arxiv.org/abs/2401.10869": {"title": "Robust variable selection for partially linear additive models", "link": "https://arxiv.org/abs/2401.10869", "description": "Among semiparametric regression models, partially linear additive models provide a useful tool to include additive nonparametric components as well as a parametric component, when explaining the relationship between the response and a set of explanatory variables. This paper concerns such models under sparsity assumptions for the covariates included in the linear component. Sparse covariates are frequent in regression problems where the task of variable selection is usually of interest. As in other settings, outliers either in the residuals or in the covariates involved in the linear component have a harmful effect. To simultaneously achieve model selection for the parametric component of the model and resistance to outliers, we combine preliminary robust estimators of the additive component, robust linear $MM-$regression estimators with a penalty such as SCAD on the coefficients in the parametric part. Under mild assumptions, consistency results and rates of convergence for the proposed estimators are derived. A Monte Carlo study is carried out to compare, under different models and contamination schemes, the performance of the robust proposal with its classical counterpart. The obtained results show the advantage of using the robust approach. Through the analysis of a real data set, we also illustrate the benefits of the proposed procedure."}, "https://arxiv.org/abs/2209.04193": {"title": "Correcting inferences for volunteer-collected data with geospatial sampling bias", "link": "https://arxiv.org/abs/2209.04193", "description": "Citizen science projects in which volunteers collect data are increasingly popular due to their ability to engage the public with scientific questions. The scientific value of these data are however hampered by several biases. In this paper, we deal with geospatial sampling bias by enriching the volunteer-collected data with geographical covariates, and then using regression-based models to correct for bias. We show that night sky brightness estimates change substantially after correction, and that the corrected inferences better represent an external satellite-derived measure of skyglow. We conclude that geospatial bias correction can greatly increase the scientific value of citizen science projects."}, "https://arxiv.org/abs/2302.06935": {"title": "Reference prior for Bayesian estimation of seismic fragility curves", "link": "https://arxiv.org/abs/2302.06935", "description": "One of the central quantities of probabilistic seismic risk assessment studies is the fragility curve, which represents the probability of failure of a mechanical structure conditional on a scalar measure derived from the seismic ground motion. Estimating such curves is a difficult task because, for many structures of interest, few data are available and the data are only binary; i.e., they indicate the state of the structure, failure or non-failure. This framework concerns complex equipments such as electrical devices encountered in industrial installations. In order to address this challenging framework a wide range of the methods in the literature rely on a parametric log-normal model. Bayesian approaches allow for efficient learning of the model parameters. However, the choice of the prior distribution has a non-negligible influence on the posterior distribution and, therefore, on any resulting estimate. We propose a thorough study of this parametric Bayesian estimation problem when the data are limited and binary. Using the reference prior theory as a support, we suggest an objective approach for the prior choice. This approach leads to the Jeffreys prior which is explicitly derived for this problem for the first time. The posterior distribution is proven to be proper (i.e., it integrates to unity) with the Jeffreys prior and improper with some classical priors from the literature. The posterior distribution with the Jeffreys prior is also shown to vanish at the boundaries of the parameters domain, so sampling the posterior distribution of the parameters does not produce anomalously small or large values. Therefore, this does not produce degenerate fragility curves such as unit-step functions and the Jeffreys prior leads to robust credibility intervals. The numerical results obtained on two different case studies, including an industrial case, illustrate the theoretical predictions."}, "https://arxiv.org/abs/2307.02096": {"title": "Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo", "link": "https://arxiv.org/abs/2307.02096", "description": "Hamiltonian Monte Carlo (HMC) is a powerful tool for Bayesian statistical inference due to its potential to rapidly explore high dimensional state space, avoiding the random walk behavior typical of many Markov Chain Monte Carlo samplers. The proper choice of the integrator of the Hamiltonian dynamics is key to the efficiency of HMC. It is becoming increasingly clear that multi-stage splitting integrators are a good alternative to the Verlet method, traditionally used in HMC. Here we propose a principled way of finding optimal, problem-specific integration schemes (in terms of the best conservation of energy for harmonic forces/Gaussian targets) within the families of 2- and 3-stage splitting integrators. The method, which we call Adaptive Integration Approach for statistics, or s-AIA, uses a multivariate Gaussian model and simulation data obtained at the HMC burn-in stage to identify a system-specific dimensional stability interval and assigns the most appropriate 2-/3-stage integrator for any user-chosen simulation step size within that interval. s-AIA has been implemented in the in-house software package HaiCS without introducing computational overheads in the simulations. The efficiency of the s-AIA integrators and their impact on the HMC accuracy, sampling performance and convergence are discussed in comparison with known fixed-parameter multi-stage splitting integrators (including Verlet). Numerical experiments on well-known statistical models show that the adaptive schemes reach the best possible performance within the family of 2-, 3-stage splitting schemes."}, "https://arxiv.org/abs/2307.15176": {"title": "RCT Rejection Sampling for Causal Estimation Evaluation", "link": "https://arxiv.org/abs/2307.15176", "description": "Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation."}, "https://arxiv.org/abs/2311.11922": {"title": "Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix", "link": "https://arxiv.org/abs/2311.11922", "description": "Surrogate index approaches have recently become a popular method of estimating longer-term impact from shorter-term outcomes. In this paper, we leverage 1098 test arms from 200 A/B tests at Netflix to empirically investigate to what degree would decisions made using a surrogate index utilizing 14 days of data would align with those made using direct measurement of day 63 treatment effects. Focusing specifically on linear \"auto-surrogate\" models that utilize the shorter-term observations of the long-term outcome of interest, we find that the statistical inferences that we would draw from using the surrogate index are ~95% consistent with those from directly measuring the long-term treatment effect. Moreover, when we restrict ourselves to the set of tests that would be \"launched\" (i.e. positive and statistically significant) based on the 63-day directly measured treatment effects, we find that relying instead on the surrogate index achieves 79% and 65% recall."}, "https://arxiv.org/abs/2401.15778": {"title": "On the partial autocorrelation function for locally stationary time series: characterization, estimation and inference", "link": "https://arxiv.org/abs/2401.15778", "description": "For stationary time series, it is common to use the plots of partial autocorrelation function (PACF) or PACF-based tests to explore the temporal dependence structure of such processes. To our best knowledge, such analogs for non-stationary time series have not been fully established yet. In this paper, we fill this gap for locally stationary time series with short-range dependence. First, we characterize the PACF locally in the time domain and show that the $j$th PACF, denoted as $\\rho_{j}(t),$ decays with $j$ whose rate is adaptive to the temporal dependence of the time series $\\{x_{i,n}\\}$. Second, at time $i,$ we justify that the PACF $\\rho_j(i/n)$ can be efficiently approximated by the best linear prediction coefficients via the Yule-Walker's equations. This allows us to study the PACF via ordinary least squares (OLS) locally. Third, we show that the PACF is smooth in time for locally stationary time series. We use the sieve method with OLS to estimate $\\rho_j(\\cdot)$ and construct some statistics to test the PACFs and infer the structures of the time series. These tests generalize and modify those used for stationary time series. Finally, a multiplier bootstrap algorithm is proposed for practical implementation and an $\\mathtt R$ package $\\mathtt {Sie2nts}$ is provided to implement our algorithm. Numerical simulations and real data analysis also confirm usefulness of our results."}, "https://arxiv.org/abs/2402.00154": {"title": "Penalized G-estimation for effect modifier selection in the structural nested mean models for repeated outcomes", "link": "https://arxiv.org/abs/2402.00154", "description": "Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \\textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\\'e de Montr\\'eal."}, "https://arxiv.org/abs/2402.00164": {"title": "De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing", "link": "https://arxiv.org/abs/2402.00164", "description": "In some high-dimensional and semiparametric inference problems involving two populations, the parameter of interest can be characterized by two-sample U-statistics involving some nuisance parameters. In this work we first extend the framework of one-step estimation with cross-fitting to two-sample U-statistics, showing that using an orthogonalized influence function can effectively remove the first order bias, resulting in asymptotically normal estimates of the parameter of interest. As an example, we apply this method and theory to the problem of testing two-sample conditional distributions, also known as strong ignorability. When combined with a conformal-based rank-sum test, we discover that the nuisance parameters can be divided into two categories, where in one category the nuisance estimation accuracy does not affect the testing validity, whereas in the other the nuisance estimation accuracy must satisfy the usual requirement for the test to be valid. We believe these findings provide further insights into and enhance the conformal inference toolbox."}, "https://arxiv.org/abs/2402.00183": {"title": "A review of regularised estimation methods and cross-validation in spatiotemporal statistics", "link": "https://arxiv.org/abs/2402.00183", "description": "This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \\geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix."}, "https://arxiv.org/abs/2402.00202": {"title": "Anytime-Valid Generalized Universal Inference on Risk Minimizers", "link": "https://arxiv.org/abs/2402.00202", "description": "A common goal in statistics and machine learning is estimation of unknowns. Point estimates alone are of little value without an accompanying measure of uncertainty, but traditional uncertainty quantification methods, such as confidence sets and p-values, often require strong distributional or structural assumptions that may not be justified in modern problems. The present paper considers a very common case in machine learning, where the quantity of interest is the minimizer of a given risk (expected loss) function. For such cases, we propose a generalized universal procedure for inference on risk minimizers that features a finite-sample, frequentist validity property under mild distributional assumptions. One version of the proposed procedure is shown to be anytime-valid in the sense that it maintains validity properties regardless of the stopping rule used for the data collection process. We show how this anytime-validity property offers protection against certain factors contributing to the replication crisis in science."}, "https://arxiv.org/abs/2402.00239": {"title": "Publication bias adjustment in network meta-analysis: an inverse probability weighting approach using clinical trial registries", "link": "https://arxiv.org/abs/2402.00239", "description": "Network meta-analysis (NMA) is a useful tool to compare multiple interventions simultaneously in a single meta-analysis, it can be very helpful for medical decision making when the study aims to find the best therapy among several active candidates. However, the validity of its results is threatened by the publication bias issue. Existing methods to handle the publication bias issue in the standard pairwise meta-analysis are hard to extend to this area with the complicated data structure and the underlying assumptions for pooling the data. In this paper, we aimed to provide a flexible inverse probability weighting (IPW) framework along with several t-type selection functions to deal with the publication bias problem in the NMA context. To solve these proposed selection functions, we recommend making use of the additional information from the unpublished studies from multiple clinical trial registries. A comprehensive numerical study and a real example showed that our methodology can help obtain more accurate estimates and higher coverage probabilities, and improve other properties of an NMA (e.g., ranking the interventions)."}, "https://arxiv.org/abs/2402.00307": {"title": "Debiased Multivariable Mendelian Randomization", "link": "https://arxiv.org/abs/2402.00307", "description": "Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic variants that are weakly associated with some exposures conditional on the other exposures. This article focuses on MVMR using summary data from genome-wide association studies (GWAS). We provide a new asymptotic regime to analyze MVMR estimators with many weak instruments, allowing for linear combinations of exposures to have different degrees of instrument strength, and formally show that the popular multivariable inverse-variance weighted (MV-IVW) estimator's asymptotic behavior is highly sensitive to instruments' strength. We then propose a multivariable debiased IVW (MV-dIVW) estimator, which effectively reduces the asymptotic bias from weak instruments in MV-IVW, and introduce an adjusted version, MV-adIVW, for improved finite-sample robustness. We establish the theoretical properties of our proposed estimators and extend them to handle balanced horizontal pleiotropy. We conclude by demonstrating the performance of our proposed methods in simulated and real datasets. We implement this method in the R package mr.divw."}, "https://arxiv.org/abs/2402.00335": {"title": "Regression-Based Proximal Causal Inference", "link": "https://arxiv.org/abs/2402.00335", "description": "In observational studies, identification of causal effects is threatened by the potential for unmeasured confounding. Negative controls have become widely used to evaluate the presence of potential unmeasured confounding thus enhancing credibility of reported causal effect estimates. Going beyond simply testing for residual confounding, proximal causal inference (PCI) was recently developed to debias causal effect estimates subject to confounding by hidden factors, by leveraging a pair of negative control variables, also known as treatment and outcome confounding proxies. While formal statistical inference has been developed for PCI, these methods can be challenging to implement in practice as they involve solving complex integral equations that are typically ill-posed. In this paper, we develop a regression-based PCI approach, employing a two-stage regression via familiar generalized linear models to implement the PCI framework, which completely obviates the need to solve difficult integral equations. In the first stage, one fits a generalized linear model (GLM) for the outcome confounding proxy in terms of the treatment confounding proxy and the primary treatment. In the second stage, one fits a GLM for the primary outcome in terms of the primary treatment, using the predicted value of the first-stage regression model as a regressor which as we establish accounts for any residual confounding for which the proxies are relevant. The proposed approach has merit in that (i) it is applicable to continuous, count, and binary outcomes cases, making it relevant to a wide range of real-world applications, and (ii) it is easy to implement using off-the-shelf software for GLMs. We establish the statistical properties of regression-based PCI and illustrate their performance in both synthetic and real-world empirical applications."}, "https://arxiv.org/abs/2402.00512": {"title": "A goodness-of-fit test for regression models with spatially correlated errors", "link": "https://arxiv.org/abs/2402.00512", "description": "The problem of assessing a parametric regression model in the presence of spatial correlation is addressed in this work. For that purpose, a goodness-of-fit test based on a $L_2$-distance comparing a parametric and a nonparametric regression estimators is proposed. Asymptotic properties of the test statistic, both under the null hypothesis and under local alternatives, are derived. Additionally, a bootstrap procedure is designed to calibrate the test in practice. Finite sample performance of the test is analyzed through a simulation study, and its applicability is illustrated using a real data example."}, "https://arxiv.org/abs/2402.00597": {"title": "An efficient multivariate volatility model for many assets", "link": "https://arxiv.org/abs/2402.00597", "description": "This paper develops a flexible and computationally efficient multivariate volatility model, which allows for dynamic conditional correlations and volatility spillover effects among financial assets. The new model has desirable properties such as identifiability and computational tractability for many assets. A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new model with and without low-rank constraints on the coefficient matrices respectively, and the asymptotic properties for both estimators are established. Moreover, a Bayesian information criterion with selection consistency is developed for order selection, and the testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for small and moderate dimensions. The usefulness of the new model and its inference tools is illustrated by two empirical examples for 5 stock markets and 17 industry portfolios, respectively."}, "https://arxiv.org/abs/2402.00610": {"title": "Multivariate ordinal regression for multiple repeated measurements", "link": "https://arxiv.org/abs/2402.00610", "description": "In this paper we propose a multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, where we distinguish between the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student t distributed. The estimation is performed using composite likelihood methods. We perform several simulation exercises to investigate the quality of the estimates in different settings as well as in comparison with a Bayesian approach. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. We also introduce R package mvordflex and illustrate how this implementation can be used to estimate the proposed model in a user-friendly, convenient way. Finally, we illustrate the framework on a data set containing firm failure and credit ratings information from the rating agencies S&P and Moody's for US listed companies."}, "https://arxiv.org/abs/2402.00651": {"title": "Skew-elliptical copula based mixed models for non-Gaussian longitudinal data with application to HIV-AIDS study", "link": "https://arxiv.org/abs/2402.00651", "description": "This work has been motivated by a longitudinal data set on HIV CD4 T+ cell counts from Livingstone district, Zambia. The corresponding histogram plots indicate lack of symmetry in the marginal distributions and the pairwise scatter plots show non-elliptical dependence patterns. The standard linear mixed model for longitudinal data fails to capture these features. Thus it seems appropriate to consider a more general framework for modeling such data. In this article, we consider generalized linear mixed models (GLMM) for the marginals (e.g. Gamma mixed model), and temporal dependency of the repeated measurements is modeled by the copula corresponding to some skew-elliptical distributions (like skew-normal/skew-t). Our proposed class of copula based mixed models simultaneously takes into account asymmetry, between-subject variability and non-standard temporal dependence, and hence can be considered extensions to the standard linear mixed model based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and also describe how to obtain standard errors of the parameter estimates. We investigate the finite sample performance of our procedure with extensive simulation studies involving skewed and symmetric marginal distributions and several choices of the copula. We finally apply our models to the HIV data set and report the findings."}, "https://arxiv.org/abs/2402.00668": {"title": "Factor copula models for non-Gaussian longitudinal data", "link": "https://arxiv.org/abs/2402.00668", "description": "This article presents factor copula approaches to model temporal dependency of non- Gaussian (continuous/discrete) longitudinal data. Factor copula models are canonical vine copulas which explain the underlying dependence structure of a multivariate data through latent variables, and therefore can be easily interpreted and implemented to unbalanced longitudinal data. We develop regression models for continuous, binary and ordinal longitudinal data including covariates, by using factor copula constructions with subject-specific latent variables. Considering homogeneous within-subject dependence, our proposed models allow for feasible parametric inference in moderate to high dimensional situations, using two-stage (IFM) estimation method. We assess the finite sample performance of the proposed models with extensive simulation studies. In the empirical analysis, the proposed models are applied for analysing different longitudinal responses of two real world data sets. Moreover, we compare the performances of these models with some widely used random effect models using standard model selection techniques and find substantial improvements. Our studies suggest that factor copula models can be good alternatives to random effect models and can provide better insights to temporal dependency of longitudinal data of arbitrary nature."}, "https://arxiv.org/abs/2402.00778": {"title": "Robust Sufficient Dimension Reduction via $\\alpha$-Distance Covariance", "link": "https://arxiv.org/abs/2402.00778", "description": "We introduce a novel sufficient dimension-reduction (SDR) method which is robust against outliers using $\\alpha$-distance covariance (dCov) in dimension-reduction problems. Under very mild conditions on the predictors, the central subspace is effectively estimated and model-free advantage without estimating link function based on the projection on the Stiefel manifold. We establish the convergence property of the proposed estimation under some regularity conditions. We compare the performance of our method with existing SDR methods by simulation and real data analysis and show that our algorithm improves the computational efficiency and effectiveness."}, "https://arxiv.org/abs/2402.00814": {"title": "Optimal monotone conditional error functions", "link": "https://arxiv.org/abs/2402.00814", "description": "This note presents a method that provides optimal monotone conditional error functions for a large class of adaptive two stage designs. The presented method builds on a previously developed general theory for optimal adaptive two stage designs where sample sizes are reassessed for a specific conditional power and the goal is to minimize the expected sample size. The previous theory can easily lead to a non-monotonous conditional error function which is highly undesirable for logical reasons and can harm type I error rate control for composite null hypotheses. The here presented method extends the existing theory by introducing intermediate monotonising steps that can easily be implemented."}, "https://arxiv.org/abs/2402.00072": {"title": "Explainable AI for survival analysis: a median-SHAP approach", "link": "https://arxiv.org/abs/2402.00072", "description": "With the adoption of machine learning into routine clinical practice comes the need for Explainable AI methods tailored to medical applications. Shapley values have sparked wide interest for locally explaining models. Here, we demonstrate their interpretation strongly depends on both the summary statistic and the estimator for it, which in turn define what we identify as an 'anchor point'. We show that the convention of using a mean anchor point may generate misleading interpretations for survival analysis and introduce median-SHAP, a method for explaining black-box models predicting individual survival times."}, "https://arxiv.org/abs/2402.00077": {"title": "Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions", "link": "https://arxiv.org/abs/2402.00077", "description": "Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by the American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging such multi-institutional sequencing data presents significant challenges. Variations in gene panels result in loss of information when the analysis is conducted on common gene sets. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess the model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data."}, "https://arxiv.org/abs/2402.00168": {"title": "Continuous Treatment Effects with Surrogate Outcomes", "link": "https://arxiv.org/abs/2402.00168", "description": "In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance."}, "https://arxiv.org/abs/2402.00358": {"title": "nhppp: Simulating Nonhomogeneous Poisson Point Processes in R", "link": "https://arxiv.org/abs/2402.00358", "description": "We introduce the `nhppp' package for simulating events from one dimensional non-homogeneous Poisson point processes (NHPPPs) in R. Its functions are based on three algorithms that provably sample from a target NHPPP: the time-transformation of a homogeneous Poisson process (of intensity one) via the inverse of the integrated intensity function; the generation of a Poisson number of order statistics from a fixed density function; and the thinning of a majorizing NHPPP via an acceptance-rejection scheme. We present a study of numerical accuracy and time performance of the algorithms and advice on which algorithm to prefer in each situation. Functions available in the package are illustrated with simple reproducible examples."}, "https://arxiv.org/abs/2402.00396": {"title": "Efficient Exploration for LLMs", "link": "https://arxiv.org/abs/2402.00396", "description": "We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles."}, "https://arxiv.org/abs/2402.00715": {"title": "Intent Assurance using LLMs guided by Intent Drift", "link": "https://arxiv.org/abs/2402.00715", "description": "Intent-Based Networking (IBN) presents a paradigm shift for network management, by promising to align intents and business objectives with network operations--in an automated manner. However, its practical realization is challenging: 1) processing intents, i.e., translate, decompose and identify the logic to fulfill the intent, and 2) intent conformance, that is, considering dynamic networks, the logic should be adequately adapted to assure intents. To address the latter, intent assurance is tasked with continuous verification and validation, including taking the necessary actions to align the operational and target states. In this paper, we define an assurance framework that allows us to detect and act when intent drift occurs. To do so, we leverage AI-driven policies, generated by Large Language Models (LLMs) which can quickly learn the necessary in-context requirements, and assist with the fulfillment and assurance of intents."}, "https://arxiv.org/abs/1911.12430": {"title": "Propensity score matching for estimating a marginal hazard ratio", "link": "https://arxiv.org/abs/1911.12430", "description": "Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching."}, "https://arxiv.org/abs/2008.08176": {"title": "New Goodness-of-Fit Tests for Time Series Models", "link": "https://arxiv.org/abs/2008.08176", "description": "This article proposes omnibus portmanteau tests for contrasting adequacy of time series models. The test statistics are based on combining the autocorrelation function of the conditional residuals, the autocorrelation function of the conditional squared residuals, and the cross-correlation function between these residuals and their squares. The maximum likelihood estimator is used to derive the asymptotic distribution of the proposed test statistics under a general class of time series models, including ARMA, GARCH, and other nonlinear structures. An extensive Monte Carlo simulation study shows that the proposed tests successfully control the type I error probability and tend to have more power than other competitor tests in many scenarios. Two applications to a set of weekly stock returns for 92 companies from the S&P 500 demonstrate the practical use of the proposed tests."}, "https://arxiv.org/abs/2010.06103": {"title": "Quasi-maximum Likelihood Inference for Linear Double Autoregressive Models", "link": "https://arxiv.org/abs/2010.06103", "description": "This paper investigates the quasi-maximum likelihood inference including estimation, model selection and diagnostic checking for linear double autoregressive (DAR) models, where all asymptotic properties are established under only fractional moment of the observed process. We propose a Gaussian quasi-maximum likelihood estimator (G-QMLE) and an exponential quasi-maximum likelihood estimator (E-QMLE) for the linear DAR model, and establish the consistency and asymptotic normality for both estimators. Based on the G-QMLE and E-QMLE, two Bayesian information criteria are proposed for model selection, and two mixed portmanteau tests are constructed to check the adequacy of fitted models. Moreover, we compare the proposed G-QMLE and E-QMLE with the existing doubly weighted quantile regression estimator in terms of the asymptotic efficiency and numerical performance. Simulation studies illustrate the finite-sample performance of the proposed inference tools, and a real example on the Bitcoin return series shows the usefulness of the proposed inference tools."}, "https://arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "https://arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for detecting localized regions in data sequences which each must contain a change-point in the median, at a prescribed global significance level. RNSP works by fitting the postulated constant model over many regions of the data using a new sign-multiresolution sup-norm-type loss, and greedily identifying the shortest intervals on which the constancy is significantly violated. By working with the signs of the data around fitted model candidates, RNSP fulfils its coverage promises under minimal assumptions, requiring only sign-symmetry and serial independence of the signs of the true residuals. In particular, it permits their heterogeneity and arbitrarily heavy tails. The intervals of significance returned by RNSP have a finite-sample character, are unconditional in nature and do not rely on any assumptions on the true signal. Code implementing RNSP is available at https://github.com/pfryz/nsp."}, "https://arxiv.org/abs/2301.10387": {"title": "Mesh-clustered Gaussian process emulator for partial differential equation boundary value problems", "link": "https://arxiv.org/abs/2301.10387", "description": "Partial differential equations (PDEs) have become an essential tool for modeling complex physical systems. Such equations are typically solved numerically via mesh-based methods, such as finite element methods, with solutions over the spatial domain. However, obtaining these solutions are often prohibitively costly, limiting the feasibility of exploring parameters in PDEs. In this paper, we propose an efficient emulator that simultaneously predicts the solutions over the spatial domain, with theoretical justification of its uncertainty quantification. The novelty of the proposed method lies in the incorporation of the mesh node coordinates into the statistical model. In particular, the proposed method segments the mesh nodes into multiple clusters via a Dirichlet process prior and fits Gaussian process models with the same hyperparameters in each of them. Most importantly, by revealing the underlying clustering structures, the proposed method can provide valuable insights into qualitative features of the resulting dynamics that can be used to guide further investigations. Real examples are demonstrated to show that our proposed method has smaller prediction errors than its main competitors, with competitive computation time, and identifies interesting clusters of mesh nodes that possess physical significance, such as satisfying boundary conditions. An R package for the proposed methodology is provided in an open repository."}, "https://arxiv.org/abs/2304.07312": {"title": "Stochastic Actor Oriented Model with Random Effects", "link": "https://arxiv.org/abs/2304.07312", "description": "The stochastic actor oriented model (SAOM) is a method for modelling social interactions and social behaviour over time. It can be used to model drivers of dynamic interactions using both exogenous covariates and endogenous network configurations, but also the co-evolution of behaviour and social interactions. In its standard implementations, it assumes that all individual have the same interaction evaluation function. This lack of heterogeneity is one of its limitations. The aim of this paper is to extend the inference framework for the SAOM to include random effects, so that the heterogeneity of individuals can be modeled more accurately.\n We decompose the linear evaluation function that models the probability of forming or removing a tie from the network, in a homogeneous fixed part and a random, individual-specific part. We extend the Robbins-Monro algorithm to the estimation of the variance of the random parameters. Our method is applicable for the general random effect formulations. We illustrate the method with a random out-degree model and show the parameter estimation of the random components, significance tests and model evaluation. We apply the method to the Kapferer's Tailor shop study. It is shown that a random out-degree constitutes a serious alternative to including transitivity and higher-order dependency effects."}, "https://arxiv.org/abs/2305.03780": {"title": "Boldness-Recalibration for Binary Event Predictions", "link": "https://arxiv.org/abs/2305.03780", "description": "Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions range from .26-.78 to .10-.91)."}, "https://arxiv.org/abs/2308.10231": {"title": "Static and Dynamic BART for Rank-Order Data", "link": "https://arxiv.org/abs/2308.10231", "description": "Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores defined as nonlinear Bayesian additive regression tree functions of covariates. To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are derived. These results are applied in a Gibbs sampler with data augmentation for posterior inference. The proposed methods are shown to outperform existing competitors in simulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams."}, "https://arxiv.org/abs/2309.08494": {"title": "Modeling Data Analytic Iteration With Probabilistic Outcome Sets", "link": "https://arxiv.org/abs/2309.08494", "description": "In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis."}, "https://arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "https://arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with potential confounding by time trends. Traditional frequentist methods can fail to provide adequate coverage of the intervention's true effect using confidence intervals, whereas Bayesian approaches show potential for better coverage of intervention effects. However, Bayesian methods have seen limited development in SWCRTs. We propose two novel Bayesian hierarchical penalized spline models for SWCRTs. The first model is for SWCRTs involving many clusters and time periods, focusing on immediate intervention effects. To evaluate its efficacy, we compared this model to traditional frequentist methods. We further developed the model to estimate time-varying intervention effects. We conducted a comparative analysis of this Bayesian spline model against an existing Bayesian monotone effect curve model. The proposed models are applied in the Primary Palliative Care for Emergency Medicine stepped wedge trial to evaluate the effectiveness of primary palliative care intervention. Extensive simulations and a real-world application demonstrate the strengths of the proposed Bayesian models. The Bayesian immediate effect model consistently achieves near the frequentist nominal coverage probability for true intervention effect, providing more reliable interval estimations than traditional frequentist models, while maintaining high estimation accuracy. The proposed Bayesian time-varying effect model exhibits advancements over the existing Bayesian monotone effect curve model in terms of improved accuracy and reliability. To the best of our knowledge, this is the first development of Bayesian hierarchical spline modeling for SWCRTs. The proposed models offer an accurate and robust analysis of intervention effects. Their application could lead to effective adjustments in intervention strategies."}, "https://arxiv.org/abs/2401.08224": {"title": "Privacy Preserving Adaptive Experiment Design", "link": "https://arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two objectives often lead to contrast optimal allocation mechanism. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it's essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is \"almost free\". Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing."}, "https://arxiv.org/abs/2211.01345": {"title": "Generative machine learning methods for multivariate ensemble post-processing", "link": "https://arxiv.org/abs/2211.01345", "description": "Ensemble weather forecasts based on multiple runs of numerical weather prediction models typically show systematic errors and require post-processing to obtain reliable forecasts. Accurately modeling multivariate dependencies is crucial in many practical applications, and various approaches to multivariate post-processing have been proposed where ensemble predictions are first post-processed separately in each margin and multivariate dependencies are then restored via copulas. These two-step methods share common key limitations, in particular the difficulty to include additional predictors in modeling the dependencies. We propose a novel multivariate post-processing method based on generative machine learning to address these challenges. In this new class of nonparametric data-driven distributional regression models, samples from the multivariate forecast distribution are directly obtained as output of a generative neural network. The generative model is trained by optimizing a proper scoring rule which measures the discrepancy between the generated and observed data, conditional on exogenous input variables. Our method does not require parametric assumptions on univariate distributions or multivariate dependencies and allows for incorporating arbitrary predictors. In two case studies on multivariate temperature and wind speed forecasting at weather stations over Germany, our generative model shows significant improvements over state-of-the-art methods and particularly improves the representation of spatial dependencies."}, "https://arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "https://arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this paper we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed with a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent, heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Finally, we apply our methodology to real and simulated data, demonstrating some effects not identifiable from the model with discrete community memberships."}, "https://arxiv.org/abs/2212.11880": {"title": "Parameter Inference based on Gaussian Processes Informed by Nonlinear Partial Differential Equations", "link": "https://arxiv.org/abs/2212.11880", "description": "Partial differential equations (PDEs) are widely used for the description of physical and engineering phenomena. Some key parameters involved in PDEs, which represent certain physical properties with important scientific interpretations, are difficult or even impossible to measure directly. Estimating these parameters from noisy and sparse experimental data of related physical quantities is an important task. Many methods for PDE parameter inference involve a large number of evaluations for numerical solutions to PDE through algorithms such as the finite element method, which can be time-consuming, especially for nonlinear PDEs. In this paper, we propose a novel method for the inference of unknown parameters in PDEs, called the PDE-Informed Gaussian Process (PIGP) based parameter inference method. Through modeling the PDE solution as a Gaussian process (GP), we derive the manifold constraints induced by the (linear) PDE structure such that, under the constraints, the GP satisfies the PDE. For nonlinear PDEs, we propose an augmentation method that transforms the nonlinear PDE into an equivalent PDE system linear in all derivatives, which our PIGP-based method can handle. The proposed method can be applied to a broad spectrum of nonlinear PDEs. The PIGP-based method can be applied to multi-dimensional PDE systems and PDE systems with unobserved components. Like conventional Bayesian approaches, the method can provide uncertainty quantification for both the unknown parameters and the PDE solution. The PIGP-based method also completely bypasses the numerical solver for PDEs. The proposed method is demonstrated through several application examples from different areas."}, "https://rss.arxiv.org/abs/2402.00154": {"title": "Penalized G-estimation for effect modifier selection in the structural nested mean models for repeated outcomes", "link": "https://rss.arxiv.org/abs/2402.00154", "description": "Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \\textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\\'e de Montr\\'eal."}, "https://rss.arxiv.org/abs/2402.00164": {"title": "De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing", "link": "https://rss.arxiv.org/abs/2402.00164", "description": "In some high-dimensional and semiparametric inference problems involving two populations, the parameter of interest can be characterized by two-sample U-statistics involving some nuisance parameters. In this work we first extend the framework of one-step estimation with cross-fitting to two-sample U-statistics, showing that using an orthogonalized influence function can effectively remove the first order bias, resulting in asymptotically normal estimates of the parameter of interest. As an example, we apply this method and theory to the problem of testing two-sample conditional distributions, also known as strong ignorability. When combined with a conformal-based rank-sum test, we discover that the nuisance parameters can be divided into two categories, where in one category the nuisance estimation accuracy does not affect the testing validity, whereas in the other the nuisance estimation accuracy must satisfy the usual requirement for the test to be valid. We believe these findings provide further insights into and enhance the conformal inference toolbox."}, "https://rss.arxiv.org/abs/2402.00183": {"title": "A review of regularised estimation methods and cross-validation in spatiotemporal statistics", "link": "https://rss.arxiv.org/abs/2402.00183", "description": "This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \\geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix."}, "https://rss.arxiv.org/abs/2402.00202": {"title": "Anytime-Valid Generalized Universal Inference on Risk Minimizers", "link": "https://rss.arxiv.org/abs/2402.00202", "description": "A common goal in statistics and machine learning is estimation of unknowns. Point estimates alone are of little value without an accompanying measure of uncertainty, but traditional uncertainty quantification methods, such as confidence sets and p-values, often require strong distributional or structural assumptions that may not be justified in modern problems. The present paper considers a very common case in machine learning, where the quantity of interest is the minimizer of a given risk (expected loss) function. For such cases, we propose a generalized universal procedure for inference on risk minimizers that features a finite-sample, frequentist validity property under mild distributional assumptions. One version of the proposed procedure is shown to be anytime-valid in the sense that it maintains validity properties regardless of the stopping rule used for the data collection process. We show how this anytime-validity property offers protection against certain factors contributing to the replication crisis in science."}, "https://rss.arxiv.org/abs/2402.00239": {"title": "Publication bias adjustment in network meta-analysis: an inverse probability weighting approach using clinical trial registries", "link": "https://rss.arxiv.org/abs/2402.00239", "description": "Network meta-analysis (NMA) is a useful tool to compare multiple interventions simultaneously in a single meta-analysis, it can be very helpful for medical decision making when the study aims to find the best therapy among several active candidates. However, the validity of its results is threatened by the publication bias issue. Existing methods to handle the publication bias issue in the standard pairwise meta-analysis are hard to extend to this area with the complicated data structure and the underlying assumptions for pooling the data. In this paper, we aimed to provide a flexible inverse probability weighting (IPW) framework along with several t-type selection functions to deal with the publication bias problem in the NMA context. To solve these proposed selection functions, we recommend making use of the additional information from the unpublished studies from multiple clinical trial registries. A comprehensive numerical study and a real example showed that our methodology can help obtain more accurate estimates and higher coverage probabilities, and improve other properties of an NMA (e.g., ranking the interventions)."}, "https://rss.arxiv.org/abs/2402.00307": {"title": "Debiased Multivariable Mendelian Randomization", "link": "https://rss.arxiv.org/abs/2402.00307", "description": "Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic variants that are weakly associated with some exposures conditional on the other exposures. This article focuses on MVMR using summary data from genome-wide association studies (GWAS). We provide a new asymptotic regime to analyze MVMR estimators with many weak instruments, allowing for linear combinations of exposures to have different degrees of instrument strength, and formally show that the popular multivariable inverse-variance weighted (MV-IVW) estimator's asymptotic behavior is highly sensitive to instruments' strength. We then propose a multivariable debiased IVW (MV-dIVW) estimator, which effectively reduces the asymptotic bias from weak instruments in MV-IVW, and introduce an adjusted version, MV-adIVW, for improved finite-sample robustness. We establish the theoretical properties of our proposed estimators and extend them to handle balanced horizontal pleiotropy. We conclude by demonstrating the performance of our proposed methods in simulated and real datasets. We implement this method in the R package mr.divw."}, "https://rss.arxiv.org/abs/2402.00335": {"title": "Regression-Based Proximal Causal Inference", "link": "https://rss.arxiv.org/abs/2402.00335", "description": "In observational studies, identification of causal effects is threatened by the potential for unmeasured confounding. Negative controls have become widely used to evaluate the presence of potential unmeasured confounding thus enhancing credibility of reported causal effect estimates. Going beyond simply testing for residual confounding, proximal causal inference (PCI) was recently developed to debias causal effect estimates subject to confounding by hidden factors, by leveraging a pair of negative control variables, also known as treatment and outcome confounding proxies. While formal statistical inference has been developed for PCI, these methods can be challenging to implement in practice as they involve solving complex integral equations that are typically ill-posed. In this paper, we develop a regression-based PCI approach, employing a two-stage regression via familiar generalized linear models to implement the PCI framework, which completely obviates the need to solve difficult integral equations. In the first stage, one fits a generalized linear model (GLM) for the outcome confounding proxy in terms of the treatment confounding proxy and the primary treatment. In the second stage, one fits a GLM for the primary outcome in terms of the primary treatment, using the predicted value of the first-stage regression model as a regressor which as we establish accounts for any residual confounding for which the proxies are relevant. The proposed approach has merit in that (i) it is applicable to continuous, count, and binary outcomes cases, making it relevant to a wide range of real-world applications, and (ii) it is easy to implement using off-the-shelf software for GLMs. We establish the statistical properties of regression-based PCI and illustrate their performance in both synthetic and real-world empirical applications."}, "https://rss.arxiv.org/abs/2402.00512": {"title": "A goodness-of-fit test for regression models with spatially correlated errors", "link": "https://rss.arxiv.org/abs/2402.00512", "description": "The problem of assessing a parametric regression model in the presence of spatial correlation is addressed in this work. For that purpose, a goodness-of-fit test based on a $L_2$-distance comparing a parametric and a nonparametric regression estimators is proposed. Asymptotic properties of the test statistic, both under the null hypothesis and under local alternatives, are derived. Additionally, a bootstrap procedure is designed to calibrate the test in practice. Finite sample performance of the test is analyzed through a simulation study, and its applicability is illustrated using a real data example."}, "https://rss.arxiv.org/abs/2402.00597": {"title": "An efficient multivariate volatility model for many assets", "link": "https://rss.arxiv.org/abs/2402.00597", "description": "This paper develops a flexible and computationally efficient multivariate volatility model, which allows for dynamic conditional correlations and volatility spillover effects among financial assets. The new model has desirable properties such as identifiability and computational tractability for many assets. A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new model with and without low-rank constraints on the coefficient matrices respectively, and the asymptotic properties for both estimators are established. Moreover, a Bayesian information criterion with selection consistency is developed for order selection, and the testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for small and moderate dimensions. The usefulness of the new model and its inference tools is illustrated by two empirical examples for 5 stock markets and 17 industry portfolios, respectively."}, "https://rss.arxiv.org/abs/2402.00610": {"title": "Multivariate ordinal regression for multiple repeated measurements", "link": "https://rss.arxiv.org/abs/2402.00610", "description": "In this paper we propose a multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, where we distinguish between the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student t distributed. The estimation is performed using composite likelihood methods. We perform several simulation exercises to investigate the quality of the estimates in different settings as well as in comparison with a Bayesian approach. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. We also introduce R package mvordflex and illustrate how this implementation can be used to estimate the proposed model in a user-friendly, convenient way. Finally, we illustrate the framework on a data set containing firm failure and credit ratings information from the rating agencies S&P and Moody's for US listed companies."}, "https://rss.arxiv.org/abs/2402.00651": {"title": "Skew-elliptical copula based mixed models for non-Gaussian longitudinal data with application to HIV-AIDS study", "link": "https://rss.arxiv.org/abs/2402.00651", "description": "This work has been motivated by a longitudinal data set on HIV CD4 T+ cell counts from Livingstone district, Zambia. The corresponding histogram plots indicate lack of symmetry in the marginal distributions and the pairwise scatter plots show non-elliptical dependence patterns. The standard linear mixed model for longitudinal data fails to capture these features. Thus it seems appropriate to consider a more general framework for modeling such data. In this article, we consider generalized linear mixed models (GLMM) for the marginals (e.g. Gamma mixed model), and temporal dependency of the repeated measurements is modeled by the copula corresponding to some skew-elliptical distributions (like skew-normal/skew-t). Our proposed class of copula based mixed models simultaneously takes into account asymmetry, between-subject variability and non-standard temporal dependence, and hence can be considered extensions to the standard linear mixed model based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and also describe how to obtain standard errors of the parameter estimates. We investigate the finite sample performance of our procedure with extensive simulation studies involving skewed and symmetric marginal distributions and several choices of the copula. We finally apply our models to the HIV data set and report the findings."}, "https://rss.arxiv.org/abs/2402.00668": {"title": "Factor copula models for non-Gaussian longitudinal data", "link": "https://rss.arxiv.org/abs/2402.00668", "description": "This article presents factor copula approaches to model temporal dependency of non- Gaussian (continuous/discrete) longitudinal data. Factor copula models are canonical vine copulas which explain the underlying dependence structure of a multivariate data through latent variables, and therefore can be easily interpreted and implemented to unbalanced longitudinal data. We develop regression models for continuous, binary and ordinal longitudinal data including covariates, by using factor copula constructions with subject-specific latent variables. Considering homogeneous within-subject dependence, our proposed models allow for feasible parametric inference in moderate to high dimensional situations, using two-stage (IFM) estimation method. We assess the finite sample performance of the proposed models with extensive simulation studies. In the empirical analysis, the proposed models are applied for analysing different longitudinal responses of two real world data sets. Moreover, we compare the performances of these models with some widely used random effect models using standard model selection techniques and find substantial improvements. Our studies suggest that factor copula models can be good alternatives to random effect models and can provide better insights to temporal dependency of longitudinal data of arbitrary nature."}, "https://rss.arxiv.org/abs/2402.00778": {"title": "Robust Sufficient Dimension Reduction via $\\alpha$-Distance Covariance", "link": "https://rss.arxiv.org/abs/2402.00778", "description": "We introduce a novel sufficient dimension-reduction (SDR) method which is robust against outliers using $\\alpha$-distance covariance (dCov) in dimension-reduction problems. Under very mild conditions on the predictors, the central subspace is effectively estimated and model-free advantage without estimating link function based on the projection on the Stiefel manifold. We establish the convergence property of the proposed estimation under some regularity conditions. We compare the performance of our method with existing SDR methods by simulation and real data analysis and show that our algorithm improves the computational efficiency and effectiveness."}, "https://rss.arxiv.org/abs/2402.00814": {"title": "Optimal monotone conditional error functions", "link": "https://rss.arxiv.org/abs/2402.00814", "description": "This note presents a method that provides optimal monotone conditional error functions for a large class of adaptive two stage designs. The presented method builds on a previously developed general theory for optimal adaptive two stage designs where sample sizes are reassessed for a specific conditional power and the goal is to minimize the expected sample size. The previous theory can easily lead to a non-monotonous conditional error function which is highly undesirable for logical reasons and can harm type I error rate control for composite null hypotheses. The here presented method extends the existing theory by introducing intermediate monotonising steps that can easily be implemented."}, "https://rss.arxiv.org/abs/2402.00072": {"title": "Explainable AI for survival analysis: a median-SHAP approach", "link": "https://rss.arxiv.org/abs/2402.00072", "description": "With the adoption of machine learning into routine clinical practice comes the need for Explainable AI methods tailored to medical applications. Shapley values have sparked wide interest for locally explaining models. Here, we demonstrate their interpretation strongly depends on both the summary statistic and the estimator for it, which in turn define what we identify as an 'anchor point'. We show that the convention of using a mean anchor point may generate misleading interpretations for survival analysis and introduce median-SHAP, a method for explaining black-box models predicting individual survival times."}, "https://rss.arxiv.org/abs/2402.00077": {"title": "Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions", "link": "https://rss.arxiv.org/abs/2402.00077", "description": "Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by the American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging such multi-institutional sequencing data presents significant challenges. Variations in gene panels result in loss of information when the analysis is conducted on common gene sets. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess the model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data."}, "https://rss.arxiv.org/abs/2402.00168": {"title": "Continuous Treatment Effects with Surrogate Outcomes", "link": "https://rss.arxiv.org/abs/2402.00168", "description": "In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance."}, "https://rss.arxiv.org/abs/2402.00358": {"title": "nhppp: Simulating Nonhomogeneous Poisson Point Processes in R", "link": "https://rss.arxiv.org/abs/2402.00358", "description": "We introduce the `nhppp' package for simulating events from one dimensional non-homogeneous Poisson point processes (NHPPPs) in R. Its functions are based on three algorithms that provably sample from a target NHPPP: the time-transformation of a homogeneous Poisson process (of intensity one) via the inverse of the integrated intensity function; the generation of a Poisson number of order statistics from a fixed density function; and the thinning of a majorizing NHPPP via an acceptance-rejection scheme. We present a study of numerical accuracy and time performance of the algorithms and advice on which algorithm to prefer in each situation. Functions available in the package are illustrated with simple reproducible examples."}, "https://rss.arxiv.org/abs/2402.00396": {"title": "Efficient Exploration for LLMs", "link": "https://rss.arxiv.org/abs/2402.00396", "description": "We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles."}, "https://rss.arxiv.org/abs/2402.00715": {"title": "Intent Assurance using LLMs guided by Intent Drift", "link": "https://rss.arxiv.org/abs/2402.00715", "description": "Intent-Based Networking (IBN) presents a paradigm shift for network management, by promising to align intents and business objectives with network operations--in an automated manner. However, its practical realization is challenging: 1) processing intents, i.e., translate, decompose and identify the logic to fulfill the intent, and 2) intent conformance, that is, considering dynamic networks, the logic should be adequately adapted to assure intents. To address the latter, intent assurance is tasked with continuous verification and validation, including taking the necessary actions to align the operational and target states. In this paper, we define an assurance framework that allows us to detect and act when intent drift occurs. To do so, we leverage AI-driven policies, generated by Large Language Models (LLMs) which can quickly learn the necessary in-context requirements, and assist with the fulfillment and assurance of intents."}, "https://rss.arxiv.org/abs/1911.12430": {"title": "Propensity score matching for estimating a marginal hazard ratio", "link": "https://rss.arxiv.org/abs/1911.12430", "description": "Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching."}, "https://rss.arxiv.org/abs/2008.08176": {"title": "New Goodness-of-Fit Tests for Time Series Models", "link": "https://rss.arxiv.org/abs/2008.08176", "description": "This article proposes omnibus portmanteau tests for contrasting adequacy of time series models. The test statistics are based on combining the autocorrelation function of the conditional residuals, the autocorrelation function of the conditional squared residuals, and the cross-correlation function between these residuals and their squares. The maximum likelihood estimator is used to derive the asymptotic distribution of the proposed test statistics under a general class of time series models, including ARMA, GARCH, and other nonlinear structures. An extensive Monte Carlo simulation study shows that the proposed tests successfully control the type I error probability and tend to have more power than other competitor tests in many scenarios. Two applications to a set of weekly stock returns for 92 companies from the S&P 500 demonstrate the practical use of the proposed tests."}, "https://rss.arxiv.org/abs/2010.06103": {"title": "Quasi-maximum Likelihood Inference for Linear Double Autoregressive Models", "link": "https://rss.arxiv.org/abs/2010.06103", "description": "This paper investigates the quasi-maximum likelihood inference including estimation, model selection and diagnostic checking for linear double autoregressive (DAR) models, where all asymptotic properties are established under only fractional moment of the observed process. We propose a Gaussian quasi-maximum likelihood estimator (G-QMLE) and an exponential quasi-maximum likelihood estimator (E-QMLE) for the linear DAR model, and establish the consistency and asymptotic normality for both estimators. Based on the G-QMLE and E-QMLE, two Bayesian information criteria are proposed for model selection, and two mixed portmanteau tests are constructed to check the adequacy of fitted models. Moreover, we compare the proposed G-QMLE and E-QMLE with the existing doubly weighted quantile regression estimator in terms of the asymptotic efficiency and numerical performance. Simulation studies illustrate the finite-sample performance of the proposed inference tools, and a real example on the Bitcoin return series shows the usefulness of the proposed inference tools."}, "https://rss.arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "https://rss.arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for detecting localized regions in data sequences which each must contain a change-point in the median, at a prescribed global significance level. RNSP works by fitting the postulated constant model over many regions of the data using a new sign-multiresolution sup-norm-type loss, and greedily identifying the shortest intervals on which the constancy is significantly violated. By working with the signs of the data around fitted model candidates, RNSP fulfils its coverage promises under minimal assumptions, requiring only sign-symmetry and serial independence of the signs of the true residuals. In particular, it permits their heterogeneity and arbitrarily heavy tails. The intervals of significance returned by RNSP have a finite-sample character, are unconditional in nature and do not rely on any assumptions on the true signal. Code implementing RNSP is available at https://github.com/pfryz/nsp."}, "https://rss.arxiv.org/abs/2301.10387": {"title": "Mesh-clustered Gaussian process emulator for partial differential equation boundary value problems", "link": "https://rss.arxiv.org/abs/2301.10387", "description": "Partial differential equations (PDEs) have become an essential tool for modeling complex physical systems. Such equations are typically solved numerically via mesh-based methods, such as finite element methods, with solutions over the spatial domain. However, obtaining these solutions are often prohibitively costly, limiting the feasibility of exploring parameters in PDEs. In this paper, we propose an efficient emulator that simultaneously predicts the solutions over the spatial domain, with theoretical justification of its uncertainty quantification. The novelty of the proposed method lies in the incorporation of the mesh node coordinates into the statistical model. In particular, the proposed method segments the mesh nodes into multiple clusters via a Dirichlet process prior and fits Gaussian process models with the same hyperparameters in each of them. Most importantly, by revealing the underlying clustering structures, the proposed method can provide valuable insights into qualitative features of the resulting dynamics that can be used to guide further investigations. Real examples are demonstrated to show that our proposed method has smaller prediction errors than its main competitors, with competitive computation time, and identifies interesting clusters of mesh nodes that possess physical significance, such as satisfying boundary conditions. An R package for the proposed methodology is provided in an open repository."}, "https://rss.arxiv.org/abs/2304.07312": {"title": "Stochastic Actor Oriented Model with Random Effects", "link": "https://rss.arxiv.org/abs/2304.07312", "description": "The stochastic actor oriented model (SAOM) is a method for modelling social interactions and social behaviour over time. It can be used to model drivers of dynamic interactions using both exogenous covariates and endogenous network configurations, but also the co-evolution of behaviour and social interactions. In its standard implementations, it assumes that all individual have the same interaction evaluation function. This lack of heterogeneity is one of its limitations. The aim of this paper is to extend the inference framework for the SAOM to include random effects, so that the heterogeneity of individuals can be modeled more accurately.\n We decompose the linear evaluation function that models the probability of forming or removing a tie from the network, in a homogeneous fixed part and a random, individual-specific part. We extend the Robbins-Monro algorithm to the estimation of the variance of the random parameters. Our method is applicable for the general random effect formulations. We illustrate the method with a random out-degree model and show the parameter estimation of the random components, significance tests and model evaluation. We apply the method to the Kapferer's Tailor shop study. It is shown that a random out-degree constitutes a serious alternative to including transitivity and higher-order dependency effects."}, "https://rss.arxiv.org/abs/2305.03780": {"title": "Boldness-Recalibration for Binary Event Predictions", "link": "https://rss.arxiv.org/abs/2305.03780", "description": "Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions range from .26-.78 to .10-.91)."}, "https://rss.arxiv.org/abs/2308.10231": {"title": "Static and Dynamic BART for Rank-Order Data", "link": "https://rss.arxiv.org/abs/2308.10231", "description": "Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores defined as nonlinear Bayesian additive regression tree functions of covariates. To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are derived. These results are applied in a Gibbs sampler with data augmentation for posterior inference. The proposed methods are shown to outperform existing competitors in simulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams."}, "https://rss.arxiv.org/abs/2309.08494": {"title": "Modeling Data Analytic Iteration With Probabilistic Outcome Sets", "link": "https://rss.arxiv.org/abs/2309.08494", "description": "In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis."}, "https://rss.arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "https://rss.arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with potential confounding by time trends. Traditional frequentist methods can fail to provide adequate coverage of the intervention's true effect using confidence intervals, whereas Bayesian approaches show potential for better coverage of intervention effects. However, Bayesian methods have seen limited development in SWCRTs. We propose two novel Bayesian hierarchical penalized spline models for SWCRTs. The first model is for SWCRTs involving many clusters and time periods, focusing on immediate intervention effects. To evaluate its efficacy, we compared this model to traditional frequentist methods. We further developed the model to estimate time-varying intervention effects. We conducted a comparative analysis of this Bayesian spline model against an existing Bayesian monotone effect curve model. The proposed models are applied in the Primary Palliative Care for Emergency Medicine stepped wedge trial to evaluate the effectiveness of primary palliative care intervention. Extensive simulations and a real-world application demonstrate the strengths of the proposed Bayesian models. The Bayesian immediate effect model consistently achieves near the frequentist nominal coverage probability for true intervention effect, providing more reliable interval estimations than traditional frequentist models, while maintaining high estimation accuracy. The proposed Bayesian time-varying effect model exhibits advancements over the existing Bayesian monotone effect curve model in terms of improved accuracy and reliability. To the best of our knowledge, this is the first development of Bayesian hierarchical spline modeling for SWCRTs. The proposed models offer an accurate and robust analysis of intervention effects. Their application could lead to effective adjustments in intervention strategies."}, "https://rss.arxiv.org/abs/2401.08224": {"title": "Privacy Preserving Adaptive Experiment Design", "link": "https://rss.arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two objectives often lead to contrast optimal allocation mechanism. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it's essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is \"almost free\". Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing."}, "https://rss.arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "https://rss.arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in precision medicine. Specific interests lie in identifying the differential effect of different treatments based on some external covariates. We propose a novel non-parametric treatment effect estimation method in a multi-treatment setting. Our non-parametric modeling of the response curves relies on radial basis function (RBF)-nets with shared hidden neurons. Our model thus facilitates modeling commonality among the treatment outcomes. The estimation and inference schemes are developed under a Bayesian framework and implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of the analysis. The numerical performance of the method is demonstrated through simulation experiments. Applying our proposed method to MIMIC data, we obtain several interesting findings related to the impact of different treatment strategies on the length of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "https://rss.arxiv.org/abs/2211.01345": {"title": "Generative machine learning methods for multivariate ensemble post-processing", "link": "https://rss.arxiv.org/abs/2211.01345", "description": "Ensemble weather forecasts based on multiple runs of numerical weather prediction models typically show systematic errors and require post-processing to obtain reliable forecasts. Accurately modeling multivariate dependencies is crucial in many practical applications, and various approaches to multivariate post-processing have been proposed where ensemble predictions are first post-processed separately in each margin and multivariate dependencies are then restored via copulas. These two-step methods share common key limitations, in particular the difficulty to include additional predictors in modeling the dependencies. We propose a novel multivariate post-processing method based on generative machine learning to address these challenges. In this new class of nonparametric data-driven distributional regression models, samples from the multivariate forecast distribution are directly obtained as output of a generative neural network. The generative model is trained by optimizing a proper scoring rule which measures the discrepancy between the generated and observed data, conditional on exogenous input variables. Our method does not require parametric assumptions on univariate distributions or multivariate dependencies and allows for incorporating arbitrary predictors. In two case studies on multivariate temperature and wind speed forecasting at weather stations over Germany, our generative model shows significant improvements over state-of-the-art methods and particularly improves the representation of spatial dependencies."}, "https://rss.arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "https://rss.arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this paper we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed with a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent, heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Finally, we apply our methodology to real and simulated data, demonstrating some effects not identifiable from the model with discrete community memberships."}, "https://rss.arxiv.org/abs/2212.11880": {"title": "Parameter Inference based on Gaussian Processes Informed by Nonlinear Partial Differential Equations", "link": "https://rss.arxiv.org/abs/2212.11880", "description": "Partial differential equations (PDEs) are widely used for the description of physical and engineering phenomena. Some key parameters involved in PDEs, which represent certain physical properties with important scientific interpretations, are difficult or even impossible to measure directly. Estimating these parameters from noisy and sparse experimental data of related physical quantities is an important task. Many methods for PDE parameter inference involve a large number of evaluations for numerical solutions to PDE through algorithms such as the finite element method, which can be time-consuming, especially for nonlinear PDEs. In this paper, we propose a novel method for the inference of unknown parameters in PDEs, called the PDE-Informed Gaussian Process (PIGP) based parameter inference method. Through modeling the PDE solution as a Gaussian process (GP), we derive the manifold constraints induced by the (linear) PDE structure such that, under the constraints, the GP satisfies the PDE. For nonlinear PDEs, we propose an augmentation method that transforms the nonlinear PDE into an equivalent PDE system linear in all derivatives, which our PIGP-based method can handle. The proposed method can be applied to a broad spectrum of nonlinear PDEs. The PIGP-based method can be applied to multi-dimensional PDE systems and PDE systems with unobserved components. Like conventional Bayesian approaches, the method can provide uncertainty quantification for both the unknown parameters and the PDE solution. The PIGP-based method also completely bypasses the numerical solver for PDEs. The proposed method is demonstrated through several application examples from different areas."}, "https://arxiv.org/abs/2402.01005": {"title": "The prices of renewable commodities: A robust stationarity analysis", "link": "https://arxiv.org/abs/2402.01005", "description": "This paper addresses the problem of testing for persistence in the effects of the shocks affecting the prices of renewable commodities, which have potential implications on stabilization policies and economic forecasting, among other areas. A robust methodology is employed that enables the determination of the potential presence and number of instant/gradual structural changes in the series, stationarity testing conditional on the number of changes detected, and the detection of change points. This procedure is applied to the annual real prices of eighteen renewable commodities over the period of 1900-2018. Results indicate that most of the series display non-linear features, including quadratic patterns and regime transitions that often coincide with well-known political and economic episodes. The conclusions of stationarity testing suggest that roughly half of the series are integrated. Stationarity fails to be rejected for grains, whereas most livestock and textile commodities do reject stationarity. Evidence is mixed in all soft commodities and tropical crops, where stationarity can be rejected in approximately half of the cases. The implication would be that for these commodities, stabilization schemes would not be recommended."}, "https://arxiv.org/abs/2402.01069": {"title": "Data-driven model selection within the matrix completion method for causal panel data models", "link": "https://arxiv.org/abs/2402.01069", "description": "Matrix completion estimators are employed in causal panel data models to regulate the rank of the underlying factor model using nuclear norm minimization. This convex optimization problem enables concurrent regularization of a potentially high-dimensional set of covariates to shrink the model size. For valid finite sample inference, we adopt a permutation-based approach and prove its validity for any treatment assignment mechanism. Simulations illustrate the consistency of the proposed estimator in parameter estimation and variable selection. An application to public health policies in Germany demonstrates the data-driven model selection feature on empirical data and finds no effect of travel restrictions on the containment of severe Covid-19 infections."}, "https://arxiv.org/abs/2402.01112": {"title": "Gerontologic Biostatistics 2", "link": "https://arxiv.org/abs/2402.01112", "description": "Background: Introduced in 2010, the sub-discipline of gerontologic biostatistics (GBS) was conceptualized to address the specific challenges in analyzing data from research studies involving older adults. However, the evolving technological landscape has catalyzed data science and statistical advancements since the original GBS publication, greatly expanding the scope of gerontologic research. There is a need to describe how these advancements enhance the analysis of multi-modal data and complex phenotypes that are hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an updated and expanded set of analytical methods reflective of the practice of gerontologic biostatistics in contemporary and future research. Results: GBS 2.0 topics and relevant software resources include cutting-edge methods in experimental design; analytical techniques that include adaptations of machine learning, quantifying deep phenotypic measurements, high-dimensional -omics analysis; the integration of information from multiple studies, and strategies to foster reproducibility, replicability, and open science. Discussion: The methodological topics presented here seek to update and expand GBS. By facilitating the synthesis of biostatistics and data science in gerontology, we aim to foster the next generation of gerontologic researchers."}, "https://arxiv.org/abs/2402.01121": {"title": "Non-linear Mendelian randomization with Two-stage prediction estimation and Control function estimation", "link": "https://arxiv.org/abs/2402.01121", "description": "Most of the existing Mendelian randomization (MR) methods are limited by the assumption of linear causality between exposure and outcome, and the development of new non-linear MR methods is highly desirable. We introduce two-stage prediction estimation and control function estimation from econometrics to MR and extend them to non-linear causality. We give conditions for parameter identification and theoretically prove the consistency and asymptotic normality of the estimates. We compare the two methods theoretically under both linear and non-linear causality. We also extend the control function estimation to a more flexible semi-parametric framework without detailed parametric specifications of causality. Extensive simulations numerically corroborate our theoretical results. Application to UK Biobank data reveals non-linear causal relationships between sleep duration and systolic/diastolic blood pressure."}, "https://arxiv.org/abs/2402.01381": {"title": "Spatial-Sign based Maxsum Test for High Dimensional Location Parameters", "link": "https://arxiv.org/abs/2402.01381", "description": "In this study, we explore a robust testing procedure for the high-dimensional location parameters testing problem. Initially, we introduce a spatial-sign based max-type test statistic, which exhibits excellent performance for sparse alternatives. Subsequently, we demonstrate the asymptotic independence between this max-type test statistic and the spatial-sign based sum-type test statistic (Feng and Sun, 2016). Building on this, we propose a spatial-sign based max-sum type testing procedure, which shows remarkable performance under varying signal sparsity. Our simulation studies underscore the superior performance of the procedures we propose."}, "https://arxiv.org/abs/2402.01398": {"title": "penalizedclr: an R package for penalized conditional logistic regression for integration of multiple omics layers", "link": "https://arxiv.org/abs/2402.01398", "description": "The matched case-control design, up until recently mostly pertinent to epidemiological studies, is becoming customary in biomedical applications as well. For instance, in omics studies, it is quite common to compare cancer and healthy tissue from the same patient. Furthermore, researchers today routinely collect data from various and variable sources that they wish to relate to the case-control status. This highlights the need to develop and implement statistical methods that can take these tendencies into account. We present an R package penalizedclr, that provides an implementation of the penalized conditional logistic regression model for analyzing matched case-control studies. It allows for different penalties for different blocks of covariates, and it is therefore particularly useful in the presence of multi-source omics data. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection for variable selection in the considered regression model. The proposed method fills a gap in the available software for fitting high-dimensional conditional logistic regression model accounting for the matched design and block structure of predictors/features. The output consists of a set of selected variables that are significantly associated with case-control status. These features can then be investigated in terms of functional interpretation or validation in further, more targeted studies."}, "https://arxiv.org/abs/2402.01491": {"title": "Moving Aggregate Modified Autoregressive Copula-Based Time Series Models (MAGMAR-Copulas) Without Markov Restriction", "link": "https://arxiv.org/abs/2402.01491", "description": "Copula-based time series models implicitly assume a finite Markov order. In reality a time series may not follow the Markov property. We modify the copula-based time series models by introducing a moving aggregate (MAG) part into the model updating equation. The functional form of the MAG-part is given as the inverse of a conditional copula. The resulting MAG-modified Autoregressive Copula-Based Time Series model (MAGMAR-Copula) is discussed in detail and distributional properties are derived in a D-vine framework. The model nests the classical ARMA model as well as the copula-based time series model. The modeling performance is compared with the model from \\cite{mcneil2022time} modeling US inflation. Our model is competitive in terms of information criteria. It is a generalization of ARMA and also copula-based time series models and is in spirit similar to other moving average time series models such as ARMA and GARCH."}, "https://arxiv.org/abs/2402.01635": {"title": "kNN Algorithm for Conditional Mean and Variance Estimation with Automated Uncertainty Quantification and Variable Selection", "link": "https://arxiv.org/abs/2402.01635", "description": "In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach incorporates a robust uncertainty quantification mechanism, leveraging our prior estimation work on conditional mean and variance. The employment of kNN ensures scalable computational efficiency in predicting intervals and statistical accuracy in line with optimal non-parametric rates. Additionally, we introduce a new kNN semi-parametric algorithm for estimating ROC curves, accounting for covariates. For selecting the smoothing parameter k, we propose an algorithm with theoretical guarantees.Incorporation of variable selection enhances the performance of the method significantly over conventional kNN techniques in various modeling tasks. We validate the approach through simulations in low, moderate, and high-dimensional covariate spaces. The algorithm's effectiveness is particularly notable in biomedical applications as demonstrated in two case studies. Concluding with a theoretical analysis, we highlight the consistency and convergence rate of our method over traditional kNN models, particularly when the underlying regression model takes values in a low-dimensional space."}, "https://arxiv.org/abs/2402.00907": {"title": "AlphaRank: An Artificial Intelligence Approach for Ranking and Selection Problems", "link": "https://arxiv.org/abs/2402.00907", "description": "We introduce AlphaRank, an artificial intelligence approach to address the fixed-budget ranking and selection (R&S) problems. We formulate the sequential sampling decision as a Markov decision process and propose a Monte Carlo simulation-based rollout policy that utilizes classic R&S procedures as base policies for efficiently learning the value function of stochastic dynamic programming. We accelerate online sample-allocation by using deep reinforcement learning to pre-train a neural network model offline based on a given prior. We also propose a parallelizable computing framework for large-scale problems, effectively combining \"divide and conquer\" and \"recursion\" for enhanced scalability and efficiency. Numerical experiments demonstrate that the performance of AlphaRank is significantly improved over the base policies, which could be attributed to AlphaRank's superior capability on the trade-off among mean, variance, and induced correlation overlooked by many existing policies."}, "https://arxiv.org/abs/2402.01003": {"title": "Practical challenges in mediation analysis: A guide for applied researchers", "link": "https://arxiv.org/abs/2402.01003", "description": "Mediation analysis is a statistical approach that can provide insights regarding the intermediary processes by which an intervention or exposure affects a given outcome. Mediation analyses rose to prominence, particularly in social science research, with the publication of the seminal paper by Baron and Kenny and is now commonly applied in many research disciplines, including health services research. Despite the growth in popularity, applied researchers may still encounter challenges in terms of conducting mediation analyses in practice. In this paper, we provide an overview of conceptual and methodological challenges that researchers face when conducting mediation analyses. Specifically, we discuss the following key challenges: (1) Conceptually differentiating mediators from other third variables, (2) Extending beyond the single mediator context, (3) Identifying appropriate datasets in which measurement and temporal ordering supports the hypothesized mediation model, (4) Selecting mediation effects that reflect the scientific question of interest, (5) Assessing the validity of underlying assumptions of no omitted confounders, (6) Addressing measurement error regarding the mediator, and (7) Clearly reporting results from mediation analyses. We discuss each challenge and highlight ways in which the applied researcher can approach these challenges."}, "https://arxiv.org/abs/2402.01139": {"title": "Online conformal prediction with decaying step sizes", "link": "https://arxiv.org/abs/2402.01139", "description": "We introduce a method for online conformal prediction with decaying step sizes. Like previous methods, ours possesses a retrospective guarantee of coverage for arbitrary sequences. However, unlike previous methods, we can simultaneously estimate a population quantile when it exists. Our theory and experiments indicate substantially improved practical properties: in particular, when the distribution is stable, the coverage is close to the desired level for every time point, not just on average over the observed sequence."}, "https://arxiv.org/abs/2402.01207": {"title": "Efficient Causal Graph Discovery Using Large Language Models", "link": "https://arxiv.org/abs/2402.01207", "description": "We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains."}, "https://arxiv.org/abs/2402.01320": {"title": "On the mean-field limit for Stein variational gradient descent: stability and multilevel approximation", "link": "https://arxiv.org/abs/2402.01320", "description": "In this paper we propose and analyze a novel multilevel version of Stein variational gradient descent (SVGD). SVGD is a recent particle based variational inference method. For Bayesian inverse problems with computationally expensive likelihood evaluations, the method can become prohibitive as it requires to evolve a discrete dynamical system over many time steps, each of which requires likelihood evaluations at all particle locations. To address this, we introduce a multilevel variant that involves running several interacting particle dynamics in parallel corresponding to different approximation levels of the likelihood. By carefully tuning the number of particles at each level, we prove that a significant reduction in computational complexity can be achieved. As an application we provide a numerical experiment for a PDE driven inverse problem, which confirms the speed up suggested by our theoretical results."}, "https://arxiv.org/abs/2402.01454": {"title": "Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach", "link": "https://arxiv.org/abs/2402.01454", "description": "In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is widely accepted as significant for creating consistent meaningful causal models, despite the recognized challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through \"statistical causal prompting (SCP)\" for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, it has been clarified that an LLM can improve SCD with its background knowledge, even if the LLM does not contain information on the dataset. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains."}, "https://arxiv.org/abs/2402.01607": {"title": "Natural Counterfactuals With Necessary Backtracking", "link": "https://arxiv.org/abs/2402.01607", "description": "Counterfactual reasoning is pivotal in human cognition and especially important for providing explanations and making decisions. While Judea Pearl's influential approach is theoretically elegant, its generation of a counterfactual scenario often requires interventions that are too detached from the real scenarios to be feasible. In response, we propose a framework of natural counterfactuals and a method for generating counterfactuals that are natural with respect to the actual world's data distribution. Our methodology refines counterfactual reasoning, allowing changes in causally preceding variables to minimize deviations from realistic scenarios. To generate natural counterfactuals, we introduce an innovative optimization framework that permits but controls the extent of backtracking with a naturalness criterion. Empirical experiments indicate the effectiveness of our method."}, "https://arxiv.org/abs/2106.15436": {"title": "Topo-Geometric Analysis of Variability in Point Clouds using Persistence Landscapes", "link": "https://arxiv.org/abs/2106.15436", "description": "Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies."}, "https://arxiv.org/abs/2203.06225": {"title": "Semiparametric Mixed-effects Model for Longitudinal Data with Non-normal Errors", "link": "https://arxiv.org/abs/2203.06225", "description": "Difficulties may arise when analyzing longitudinal data using mixed-effects models if there are nonparametric functions present in the linear predictor component. This study extends the use of semiparametric mixed-effects modeling in cases when the response variable does not always follow a normal distribution and the nonparametric component is structured as an additive model. A novel approach is proposed to identify significant linear and non-linear components using a double-penalized generalized estimating equation with two penalty terms. Furthermore, the iterative approach provided intends to enhance the efficiency of estimating regression coefficients by incorporating the calculation of the working covariance matrix. The oracle properties of the resulting estimators are established under certain regularity conditions, where the dimensions of both the parametric and nonparametric components increase as the sample size grows. We perform numerical studies to demonstrate the efficacy of our proposal."}, "https://arxiv.org/abs/2208.07573": {"title": "Higher-order accurate two-sample network inference and network hashing", "link": "https://arxiv.org/abs/2208.07573", "description": "Two-sample hypothesis testing for network comparison presents many significant challenges, including: leveraging repeated network observations and known node registration, but without requiring them to operate; relaxing strong structural assumptions; achieving finite-sample higher-order accuracy; handling different network sizes and sparsity levels; fast computation and memory parsimony; controlling false discovery rate (FDR) in multiple testing; and theoretical understandings, particularly regarding finite-sample accuracy and minimax optimality. In this paper, we develop a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address these challenges. Our method outperforms existing tools in speed and accuracy, and it is proved power-optimal. Our algorithms are user-friendly and versatile in handling various data structures (single or repeated network observations; known or unknown node registration). We also develop an innovative framework for offline hashing and fast querying as a very useful tool for large network databases. We showcase the effectiveness of our method through comprehensive simulations and applications to two real-world datasets, which revealed intriguing new structures."}, "https://arxiv.org/abs/2208.14983": {"title": "Improved information criteria for Bayesian model averaging in lattice field theory", "link": "https://arxiv.org/abs/2208.14983", "description": "Bayesian model averaging is a practical method for dealing with uncertainty due to model specification. Use of this technique requires the estimation of model probability weights. In this work, we revisit the derivation of estimators for these model weights. Use of the Kullback-Leibler divergence as a starting point leads naturally to a number of alternative information criteria suitable for Bayesian model weight estimation. We explore three such criteria, known to the statistics literature before, in detail: a Bayesian analogue of the Akaike information criterion which we call the BAIC, the Bayesian predictive information criterion (BPIC), and the posterior predictive information criterion (PPIC). We compare the use of these information criteria in numerical analysis problems common in lattice field theory calculations. We find that the PPIC has the most appealing theoretical properties and can give the best performance in terms of model-averaging uncertainty, particularly in the presence of noisy data, while the BAIC is a simple and reliable alternative."}, "https://arxiv.org/abs/2210.12968": {"title": "Overlap, matching, or entropy weights: what are we weighting for?", "link": "https://arxiv.org/abs/2210.12968", "description": "There has been a recent surge in statistical methods for handling the lack of adequate positivity when using inverse probability weights (IPW). However, these nascent developments have raised a number of questions. Thus, we demonstrate the ability of equipoise estimators (overlap, matching, and entropy weights) to handle the lack of positivity. Compared to IPW, the equipoise estimators have been shown to be flexible and easy to interpret. However, promoting their wide use requires that researchers know clearly why, when to apply them and what to expect.\n In this paper, we provide the rationale to use these estimators to achieve robust results. We specifically look into the impact imbalances in treatment allocation can have on the positivity and, ultimately, on the estimates of the treatment effect. We zero into the typical pitfalls of the IPW estimator and its relationship with the estimators of the average treatment effect on the treated (ATT) and on the controls (ATC). Furthermore, we also compare IPW trimming to the equipoise estimators. We focus particularly on two key points: What fundamentally distinguishes their estimands? When should we expect similar results? Our findings are illustrated through Monte-Carlo simulation studies and a data example on healthcare expenditure."}, "https://arxiv.org/abs/2212.13574": {"title": "Weak Signal Inclusion Under Dependence and Applications in Genome-wide Association Study", "link": "https://arxiv.org/abs/2212.13574", "description": "Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regulate false negative proportion at a user-specified level. FNC screening is developed in a realistic setting with arbitrary covariance dependence between variables. We calibrate the overall dependence through a parameter whose scale is compatible with the existing phase diagram in high-dimensional sparse inference. Utilizing the new calibration, we asymptotically explicate the joint effect of covariance dependence, signal sparsity, and signal intensity on the proposed method. We interpret the results using a new phase diagram, which shows that FNC screening can efficiently select a set of candidate variables to retain a high proportion of signals even when the signals are not individually separable from noise. Finite sample performance of FNC screening is compared to those of several existing methods in simulation studies. The proposed method outperforms the others in adapting to a user-specified false negative control level. We implement FNC screening to empower a two-stage GWAS procedure, which demonstrates substantial power gain when working with limited sample sizes in real applications."}, "https://arxiv.org/abs/2303.03502": {"title": "Analyzing Risk Factors for Post-Acute Recovery in Older Adults with Alzheimer's Disease and Related Dementia: A New Semi-Parametric Model for Large-Scale Medicare Claims", "link": "https://arxiv.org/abs/2303.03502", "description": "Nearly 300,000 older adults experience a hip fracture every year, the majority of which occur following a fall. Unfortunately, recovery after fall-related trauma such as hip fracture is poor, where older adults diagnosed with Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long time in hospitals or rehabilitation facilities during the post-operative recuperation period. Because older adults value functional recovery and spending time at home versus facilities as key outcomes after hospitalization, identifying factors that influence days spent at home after hospitalization is imperative. While several individual-level factors have been identified, the characteristics of the treating hospital have recently been identified as contributors. However, few methodological rigorous approaches are available to help overcome potential sources of bias such as hospital-level unmeasured confounders, informative hospital size, and loss to follow-up due to death. This article develops a useful tool equipped with unsupervised learning to simultaneously handle statistical complexities that are often encountered in health services research, especially when using large administrative claims databases. The proposed estimator has a closed form, thus only requiring light computation load in a large-scale study. We further develop its asymptotic properties that can be used to make statistical inference in practice. Extensive simulation studies demonstrate superiority of the proposed estimator compared to existing estimators."}, "https://arxiv.org/abs/2303.08218": {"title": "Spatial causal inference in the presence of unmeasured confounding and interference", "link": "https://arxiv.org/abs/2303.08218", "description": "This manuscript bridges the divide between causal inference and spatial statistics, presenting novel insights for causal inference in spatial data analysis, and establishing how tools from spatial statistics can be used to draw causal inferences. We introduce spatial causal graphs to highlight that spatial confounding and interference can be entangled, in that investigating the presence of one can lead to wrongful conclusions in the presence of the other. Moreover, we show that spatial dependence in the exposure variable can render standard analyses invalid, which can lead to erroneous conclusions. To remedy these issues, we propose a Bayesian parametric approach based on tools commonly-used in spatial statistics. This approach simultaneously accounts for interference and mitigates bias resulting from local and neighborhood unmeasured spatial confounding. From a Bayesian perspective, we show that incorporating an exposure model is necessary, and we theoretically prove that all model parameters are identifiable, even in the presence of unmeasured confounding. To illustrate the approach's effectiveness, we provide results from a simulation study and a case study involving the impact of sulfur dioxide emissions from power plants on cardiovascular mortality."}, "https://arxiv.org/abs/2306.12671": {"title": "Feature screening for clustering analysis", "link": "https://arxiv.org/abs/2306.12671", "description": "In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses."}, "https://arxiv.org/abs/2312.03257": {"title": "Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes", "link": "https://arxiv.org/abs/2312.03257", "description": "Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application given its ability to depict the global metabolic pattern in biological samples. However, the data is noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible."}, "https://arxiv.org/abs/2306.06534": {"title": "K-Tensors: Clustering Positive Semi-Definite Matrices", "link": "https://arxiv.org/abs/2306.06534", "description": "This paper introduces $K$-Tensors, a novel self-consistent clustering algorithm designed to cluster positive semi-definite (PSD) matrices by their eigenstructures. Clustering PSD matrices is crucial across various fields, including computer and biomedical sciences. Traditional clustering methods, which often involve matrix vectorization, tend to overlook the inherent PSD characteristics, thereby discarding valuable shape and eigenstructural information. To preserve this essential shape and eigenstructral information, our approach incorporates a unique distance metric that respects the PSD nature of the data. We demonstrate that $K$-Tensors is not only self-consistent but also reliably converges to a local optimum. Through numerical studies, we further validate the algorithm's effectiveness and explore its properties in detail."}, "https://arxiv.org/abs/2307.10276": {"title": "A test for counting sequences of integer-valued autoregressive models", "link": "https://arxiv.org/abs/2307.10276", "description": "The integer autoregressive (INAR) model is one of the most commonly used models in nonnegative integer-valued time series analysis and is a counterpart to the traditional autoregressive model for continuous-valued time series. To guarantee the integer-valued nature, the binomial thinning operator or more generally the generalized Steutel and van Harn operator is used to define the INAR model. However, the distributions of the counting sequences used in the operators have been determined by the preference of analyst without statistical verification so far. In this paper, we propose a test based on the mean and variance relationships for distributions of counting sequences and a disturbance process to check if the operator is reasonable. We show that our proposed test has asymptotically correct size and is consistent. Numerical simulation is carried out to evaluate the finite sample performance of our test. As a real data application, we apply our test to the monthly number of anorexia cases in animals submitted to animal health laboratories in New Zealand and we conclude that binomial thinning operator is not appropriate."}, "https://arxiv.org/abs/2402.01827": {"title": "Extracting Scalar Measures from Curves", "link": "https://arxiv.org/abs/2402.01827", "description": "The ability to order outcomes is necessary to make comparisons which is complicated when there is no natural ordering on the space of outcomes, as in the case of functional outcomes. This paper examines methods for extracting a scalar summary from functional or longitudinal outcomes based on an average rate of change which can be used to compare curves. Common approaches used in practice use a change score or an analysis of covariance (ANCOVA) to make comparisons. However, these standard approaches only use a fraction of the available data and are inefficient. We derive measures of performance of an averaged rate of change of a functional outcome and compare this measure to standard measures. Simulations and data from a depression clinical trial are used to illustrate results."}, "https://arxiv.org/abs/2402.01861": {"title": "Depth for samples of sets with applications to testing equality in distribution of two samples of random sets", "link": "https://arxiv.org/abs/2402.01861", "description": "This paper introduces several depths for random sets with possibly non-convex realisations, proposes ways to estimate the depths based on the samples and compares them with existing ones. The depths are further applied for the comparison between two samples of random sets using a visual method of DD-plots and statistical testing. The advantage of this approach is identifying sets within the sample that are responsible for rejecting the null hypothesis of equality in distribution and providing clues on differences between distributions. The method is justified using a simulation study and applied to real data consisting of histological images of mastopathy and mammary cancer tissue."}, "https://arxiv.org/abs/2402.01866": {"title": "Parametric Bootstrap on Networks with Non-Exchangeable Nodes", "link": "https://arxiv.org/abs/2402.01866", "description": "This paper studies the parametric bootstrap method for networks to quantify the uncertainty of statistics of interest. While existing network resampling methods primarily focus on count statistics under node-exchangeable (graphon) models, we consider more general network statistics (including local statistics) under the Chung-Lu model without node-exchangeability. We show that the natural network parametric bootstrap that first estimates the network generating model and then draws bootstrap samples from the estimated model generally suffers from bootstrap bias. As a general recipe for addressing this problem, we show that a two-level bootstrap procedure provably reduces the bias. This essentially extends the classical idea of iterative bootstrap to the network case with growing number of parameters. Moreover, for many network statistics, the second-level bootstrap also provides a way to construct confidence intervals with higher accuracy. As a byproduct of our effort to construct confidence intervals, we also prove the asymptotic normality of subgraph counts under the Chung-Lu model."}, "https://arxiv.org/abs/2402.01914": {"title": "Predicting Batting Averages in Specific Matchups Using Generalized Linked Matrix Factorization", "link": "https://arxiv.org/abs/2402.01914", "description": "Predicting batting averages for specific batters against specific pitchers is a challenging problem in baseball. Previous methods for estimating batting averages in these matchups have used regression models that can incorporate the pitcher's and batter's individual batting averages. However, these methods are limited in their flexibility to include many additional parameters because of the challenges of high-dimensional data in regression. Dimension reduction methods can be used to incorporate many predictors into the model by finding a lower rank set of patterns among them, providing added flexibility. This paper illustrates that dimension reduction methods can be useful for predicting batting averages. To incorporate binomial data (batting averages) as well as additional data about each batter and pitcher, this paper proposes a novel dimension reduction method that uses alternating generalized linear models to estimate shared patterns across three data sources related to batting averages. While the novel method slightly outperforms existing methods for imputing batting averages based on simulations and a cross-validation study, the biggest advantage is that it can easily incorporate other sources of data. As data-collection technology continues to improve, more variables will be available, and this method will be more accurate with more informative data in the future."}, "https://arxiv.org/abs/2402.01951": {"title": "Sparse spanning portfolios and under-diversification with second-order stochastic dominance", "link": "https://arxiv.org/abs/2402.01951", "description": "We develop and implement methods for determining whether relaxing sparsity constraints on portfolios improves the investment opportunity set for risk-averse investors. We formulate a new estimation procedure for sparse second-order stochastic spanning based on a greedy algorithm and Linear Programming. We show the optimal recovery of the sparse solution asymptotically whether spanning holds or not. From large equity datasets, we estimate the expected utility loss due to possible under-diversification, and find that there is no benefit from expanding a sparse opportunity set beyond 45 assets. The optimal sparse portfolio invests in 10 industry sectors and cuts tail risk when compared to a sparse mean-variance portfolio. On a rolling-window basis, the number of assets shrinks to 25 assets in crisis periods, while standard factor models cannot explain the performance of the sparse portfolios."}, "https://arxiv.org/abs/2402.01966": {"title": "The general solution to an autoregressive law of motion", "link": "https://arxiv.org/abs/2402.01966", "description": "In this article we provide a complete description of the set of all solutions to an autoregressive law of motion in a finite-dimensional complex vector space. Every solution is shown to be the sum of three parts, each corresponding to a directed flow of time. One part flows forward from the arbitrarily distant past; one flows backward from the arbitrarily distant future; and one flows outward from time zero. The three parts are obtained by applying three complementary spectral projections to the solution, these corresponding to a separation of the eigenvalues of the autoregressive operator according to whether they are inside, outside or on the unit circle. We provide a finite-dimensional parametrization of the set of all solutions."}, "https://arxiv.org/abs/2402.02016": {"title": "Applying different methods to model dry and wet spells at daily scale in a large range of rainfall regimes across Europe", "link": "https://arxiv.org/abs/2402.02016", "description": "The modelling of the occurrence of rainfall dry and wet spells (ds and ws, respectively) can be jointly conveyed using the inter-arrival times (it). While the modelling of it has the advantage of requiring a single fitting for the description of all rainfall time characteristics (including wet and dry chains, an extension of the concept of spells), the assumption on the independence and identical distribution of the renewal times it implicitly imposes a memoryless property on the derived ws, which may not be true in some cases. In this study, two different methods for the modelling of rainfall time characteristics at station scale have been applied: i) a direct method (DM) that fits the discrete Lerch distribution to it records, and then derives ws and ds (as well as the corresponding chains) from the it distribution; and ii) an indirect method (IM) that fits the Lerch distribution to the ws and ds records separately, relaxing the assumptions of the renewal process. The results of this application over six stations in Europe, characterized by a wide range of rainfall regimes, highlight how the geometric distribution does not always reasonably reproduce the ws frequencies, even when it are modelled by the Lerch distribution well. Improved performances are obtained with the IM, thanks to the relaxation of the assumption on the independence and identical distribution of the renewal times. A further improvement on the fittings is obtained when the datasets are separated into two periods, suggesting that the inferences may benefit for accounting for the local seasonality."}, "https://arxiv.org/abs/2402.02064": {"title": "Prediction modelling with many correlated and zero-inflated predictors: assessing a nonnegative garrote approach", "link": "https://arxiv.org/abs/2402.02064", "description": "Building prediction models from mass-spectrometry data is challenging due to the abundance of correlated features with varying degrees of zero-inflation, leading to a common interest in reducing the features to a concise predictor set with good predictive performance. In this study, we formally established and examined regularized regression approaches, designed to address zero-inflated and correlated predictors. In particular, we describe a novel two-stage regularized regression approach (ridge-garrote) explicitly modelling zero-inflated predictors using two component variables, comprising a ridge estimator in the first stage and subsequently applying a nonnegative garrote estimator in the second stage. We contrasted ridge-garrote with one-stage methods (ridge, lasso) and other two-stage regularized regression approaches (lasso-ridge, ridge-lasso) for zero-inflated predictors. We assessed the predictive performance and predictor selection properties of these methods in a comparative simulation study and a real-data case study to predict kidney function using peptidomic features derived from mass-spectrometry. In the simulation study, the predictive performance of all assessed approaches was comparable, yet the ridge-garrote approach consistently selected more parsimonious models compared to its competitors in most scenarios. While lasso-ridge achieved higher predictive accuracy than its competitors, it exhibited high variability in the number of selected predictors. Ridge-lasso exhibited slightly superior predictive accuracy than ridge-garrote but at the expense of selecting more noise predictors. Overall, ridge emerged as a favourable option when variable selection is not a primary concern, while ridge-garrote demonstrated notable practical utility in selecting a parsimonious set of predictors, with only minimal compromise in predictive accuracy."}, "https://arxiv.org/abs/2402.02068": {"title": "Modeling local predictive ability using power-transformed Gaussian processes", "link": "https://arxiv.org/abs/2402.02068", "description": "A Gaussian process is proposed as a model for the posterior distribution of the local predictive ability of a model or expert, conditional on a vector of covariates, from historical predictions in the form of log predictive scores. Assuming Gaussian expert predictions and a Gaussian data generating process, a linear transformation of the predictive score follows a noncentral chi-squared distribution with one degree of freedom. Motivated by this we develop a non-central chi-squared Gaussian process regression to flexibly model local predictive ability, with the posterior distribution of the latent GP function and kernel hyperparameters sampled by Hamiltonian Monte Carlo. We show that a cube-root transformation of the log scores is approximately Gaussian with homoscedastic variance, which makes it possible to estimate the model much faster by marginalizing the latent GP function analytically. Linear pools based on learned local predictive ability are applied to predict daily bike usage in Washington DC."}, "https://arxiv.org/abs/2402.02128": {"title": "Adaptive Accelerated Failure Time modeling with a Semiparametric Skewed Error Distribution", "link": "https://arxiv.org/abs/2402.02128", "description": "The accelerated failure time (AFT) model is widely used to analyze relationships between variables in the presence of censored observations. However, this model relies on some assumptions such as the error distribution, which can lead to biased or inefficient estimates if these assumptions are violated. In order to overcome this challenge, we propose a novel approach that incorporates a semiparametric skew-normal scale mixture distribution for the error term in the AFT model. By allowing for more flexibility and robustness, this approach reduces the risk of misspecification and improves the accuracy of parameter estimation. We investigate the identifiability and consistency of the proposed model and develop a practical estimation algorithm. To evaluate the performance of our approach, we conduct extensive simulation studies and real data analyses. The results demonstrate the effectiveness of our method in providing robust and accurate estimates in various scenarios."}, "https://arxiv.org/abs/2402.02187": {"title": "Graphical models for multivariate extremes", "link": "https://arxiv.org/abs/2402.02187", "description": "Graphical models in extremes have emerged as a diverse and quickly expanding research area in extremal dependence modeling. They allow for parsimonious statistical methodology and are particularly suited for enforcing sparsity in high-dimensional problems. In this work, we provide the fundamental concepts of extremal graphical models and discuss recent advances in the field. Different existing perspectives on graphical extremes are presented in a unified way through graphical models for exponent measures. We discuss the important cases of nonparametric extremal graphical models on simple graph structures, and the parametric class of H\\\"usler--Reiss models on arbitrary undirected graphs. In both cases, we describe model properties, methods for statistical inference on known graph structures, and structure learning algorithms when the graph is unknown. We illustrate different methods in an application to flight delay data at US airports."}, "https://arxiv.org/abs/2402.02196": {"title": "Sample-Efficient Clustering and Conquer Procedures for Parallel Large-Scale Ranking and Selection", "link": "https://arxiv.org/abs/2402.02196", "description": "We propose novel \"clustering and conquer\" procedures for the parallel large-scale ranking and selection (R&S) problem, which leverage correlation information for clustering to break the bottleneck of sample efficiency. In parallel computing environments, correlation-based clustering can achieve an $\\mathcal{O}(p)$ sample complexity reduction rate, which is the optimal reduction rate theoretically attainable. Our proposed framework is versatile, allowing for seamless integration of various prevalent R&S methods under both fixed-budget and fixed-precision paradigms. It can achieve improvements without the necessity of highly accurate correlation estimation and precise clustering. In large-scale AI applications such as neural architecture search, a screening-free version of our procedure surprisingly surpasses fully-sequential benchmarks in terms of sample efficiency. This suggests that leveraging valuable structural information, such as correlation, is a viable path to bypassing the traditional need for screening via pairwise comparison--a step previously deemed essential for high sample efficiency but problematic for parallelization. Additionally, we propose a parallel few-shot clustering algorithm tailored for large-scale problems."}, "https://arxiv.org/abs/2402.02272": {"title": "One-inflated zero-truncated count regression models", "link": "https://arxiv.org/abs/2402.02272", "description": "We find that in zero-truncated count data (y=1,2,...), individuals often gain information at first observation (y=1), leading to a common but unaddressed phenomenon of \"one-inflation\". The current standard, the zero-truncated negative binomial (ZTNB) model, is misspecified under one-inflation, causing bias and inconsistency. To address this, we introduce the one-inflated zero-truncated negative binomial (OIZTNB) regression model. The importance of our model is highlighted through simulation studies, and through the discovery of one-inflation in four datasets that have traditionally championed ZTNB. We recommended OIZTNB over ZTNB for most data, and provide estimation, marginal effects, and testing in the accompanying R package oneinfl."}, "https://arxiv.org/abs/2402.02306": {"title": "A flexible Bayesian g-formula for causal survival analyses with time-dependent confounding", "link": "https://arxiv.org/abs/2402.02306", "description": "In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator. This estimator facilitates both longitudinal predictive and causal inference. It incorporates Bayesian additive regression trees in the modeling of the time-evolving generative components, aiming to mitigate bias due to model misspecification. Specifically, we introduce a more general class of g-formulas for discrete survival data. These formulas can incorporate the longitudinal balancing scores, which serve as an effective method for dimension reduction and are vital when dealing with an expanding array of time-varying confounders. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment regimes, whether static or dynamic. For each type of treatment regime, we provide posterior sampling algorithms, which are grounded in the Bayesian additive regression trees framework. We have conducted simulation studies to illustrate the empirical performance of our proposed Bayesian g-formula estimators, and to compare them with existing parametric estimators. We further demonstrate the practical utility of our methods in real-world scenarios using data from the Yale New Haven Health System's electronic health records."}, "https://arxiv.org/abs/2402.02329": {"title": "Leveraging Local Distributions in Mendelian Randomization: Uncertain Opinions are Invalid", "link": "https://arxiv.org/abs/2402.02329", "description": "Mendelian randomization (MR) considers using genetic variants as instrumental variables (IVs) to infer causal effects in observational studies. However, the validity of causal inference in MR can be compromised when the IVs are potentially invalid. In this work, we propose a new method, MR-Local, to infer the causal effect in the existence of possibly invalid IVs. By leveraging the distribution of ratio estimates around the true causal effect, MR-Local selects the cluster of ratio estimates with the least uncertainty and performs causal inference within it. We establish the asymptotic normality of our estimator in the two-sample summary-data setting under either the plurality rule or the balanced pleiotropy assumption. Extensive simulations and analyses of real datasets demonstrate the reliability of our approach."}, "https://arxiv.org/abs/2402.02482": {"title": "Global bank network connectedness revisited: What is common, idiosyncratic and when?", "link": "https://arxiv.org/abs/2402.02482", "description": "We revisit the problem of estimating high-dimensional global bank network connectedness. Instead of directly regularizing the high-dimensional vector of realized volatilities as in Demirer et al. (2018), we estimate a dynamic factor model with sparse VAR idiosyncratic components. This allows to disentangle: (I) the part of system-wide connectedness (SWC) due to the common component shocks (what we call the \"banking market\"), and (II) the part due to the idiosyncratic shocks (the single banks). We employ both the original dataset as in Demirer et al. (2018) (daily data, 2003-2013), as well as a more recent vintage (2014-2023). For both, we compute SWC due to (I), (II), (I+II) and provide bootstrap confidence bands. In accordance with the literature, we find SWC to spike during global crises. However, our method minimizes the risk of SWC underestimation in high-dimensional datasets where episodes of systemic risk can be both pervasive and idiosyncratic. In fact, we are able to disentangle how in normal times $\\approx$60-80% of SWC is due to idiosyncratic variation and only $\\approx$20-40% to market variation. However, in crises periods such as the 2008 financial crisis and the Covid19 outbreak in 2019, the situation is completely reversed: SWC is comparatively more driven by a market dynamic and less by an idiosyncratic one."}, "https://arxiv.org/abs/2402.02535": {"title": "Data-driven Policy Learning for a Continuous Treatment", "link": "https://arxiv.org/abs/2402.02535", "description": "This paper studies policy learning under the condition of unconfoundedness with a continuous treatment variable. Our research begins by employing kernel-based inverse propensity-weighted (IPW) methods to estimate policy welfare. We aim to approximate the optimal policy within a global policy class characterized by infinite Vapnik-Chervonenkis (VC) dimension. This is achieved through the utilization of a sequence of sieve policy classes, each with finite VC dimension. Preliminary analysis reveals that welfare regret comprises of three components: global welfare deficiency, variance, and bias. This leads to the necessity of simultaneously selecting the optimal bandwidth for estimation and the optimal policy class for welfare approximation. To tackle this challenge, we introduce a semi-data-driven strategy that employs penalization techniques. This approach yields oracle inequalities that adeptly balance the three components of welfare regret without prior knowledge of the welfare deficiency. By utilizing precise maximal and concentration inequalities, we derive sharper regret bounds than those currently available in the literature. In instances where the propensity score is unknown, we adopt the doubly robust (DR) moment condition tailored to the continuous treatment setting. In alignment with the binary-treatment case, the DR welfare regret closely parallels the IPW welfare regret, given the fast convergence of nuisance estimators."}, "https://arxiv.org/abs/2402.02664": {"title": "Statistical Inference for Generalized Integer Autoregressive Processes", "link": "https://arxiv.org/abs/2402.02664", "description": "A popular and flexible time series model for counts is the generalized integer autoregressive process of order $p$, GINAR($p$). These Markov processes are defined using thinning operators evaluated on past values of the process along with a discretely-valued innovation process. This class includes the commonly used INAR($p$) process, defined with binomial thinning and Poisson innovations. GINAR processes can be used in a variety of settings, including modeling time series with low counts, and allow for more general mean-variance relationships, capturing both over- or under-dispersion. While there are many thinning operators and innovation processes given in the literature, less focus has been spent on comparing statistical inference and forecasting procedures over different choices of GINAR process. We provide an extensive study of exact and approximate inference and forecasting methods that can be applied to a wide class of GINAR($p$) processes with general thinning and innovation parameters. We discuss the challenges of exact estimation when $p$ is larger. We summarize and extend asymptotic results for estimators of process parameters, and present simulations to compare small sample performance, highlighting how different methods compare. We illustrate this methodology by fitting GINAR processes to a disease surveillance series."}, "https://arxiv.org/abs/2402.02666": {"title": "Using child-woman ratios to infer demographic rates in historical populations with limited data", "link": "https://arxiv.org/abs/2402.02666", "description": "Data on historical populations often extends no further than numbers of people by broad age-sex group, with nothing on numbers of births or deaths. Demographers studying these populations have experimented with methods that use the data on numbers of people to infer birth and death rates. These methods have, however, received little attention since they were first developed in the 1960s. We revisit the problem of inferring demographic rates from population structure, spelling out the assumptions needed, and specialising the methods to the case where only child-woman ratios are available. We apply the methods to the case of Maori populations in nineteenth-century Aotearoa New Zealand. We find that, in this particular case, the methods reveal as much about the nature of the data as they do about historical demographic conditions."}, "https://arxiv.org/abs/2402.02672": {"title": "Estimation of conditional average treatment effects on distributed data: A privacy-preserving approach", "link": "https://arxiv.org/abs/2402.02672", "description": "Estimation of conditional average treatment effects (CATEs) is an important topic in various fields such as medical and social sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data if they contain privacy information. To address this issue, we proposed data collaboration double machine learning (DC-DML), a method that can estimate CATE models with privacy preservation of distributed data, and evaluated the method through numerical experiments. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric or non-parametric CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. However, to our knowledge, no communication-efficient method has been proposed for estimating and testing semi-parametric or non-parametric CATE models on distributed data. Second, our method enables collaborative estimation between different parties as well as multiple time points because the dimensionality-reduced intermediate representations can be accumulated. Third, our method performed as well or better than other methods in evaluation experiments using synthetic, semi-synthetic and real-world datasets."}, "https://arxiv.org/abs/2402.02684": {"title": "Efficient estimation of subgroup treatment effects using multi-source data", "link": "https://arxiv.org/abs/2402.02684", "description": "Investigators often use multi-source data (e.g., multi-center trials, meta-analyses of randomized trials, pooled analyses of observational cohorts) to learn about the effects of interventions in subgroups of some well-defined target population. Such a target population can correspond to one of the data sources of the multi-source data or an external population in which the treatment and outcome information may not be available. We develop and evaluate methods for using multi-source data to estimate subgroup potential outcome means and treatment effects in a target population. We consider identifiability conditions and propose doubly robust estimators that, under mild conditions, are non-parametrically efficient and allow for nuisance functions to be estimated using flexible data-adaptive methods (e.g., machine learning techniques). We also show how to construct confidence intervals and simultaneous confidence bands for the estimated subgroup treatment effects. We examine the properties of the proposed estimators in simulation studies and compare performance against alternative estimators. We also conclude that our methods work well when the sample size of the target population is much larger than the sample size of the multi-source data. We illustrate the proposed methods in a meta-analysis of randomized trials for schizophrenia."}, "https://arxiv.org/abs/2402.02702": {"title": "Causal inference under transportability assumptions for conditional relative effect measures", "link": "https://arxiv.org/abs/2402.02702", "description": "When extending inferences from a randomized trial to a new target population, an assumption of transportability of difference effect measures (e.g., conditional average treatment effects) -- or even stronger assumptions of transportability in expectation or distribution of potential outcomes -- is invoked to identify the marginal causal mean difference in the target population. However, many clinical investigators believe that relative effect measures conditional on covariates, such as conditional risk ratios and mean ratios, are more likely to be ``transportable'' across populations compared with difference effect measures. Here, we examine the identification and estimation of the marginal counterfactual mean difference and ratio under a transportability assumption for conditional relative effect measures. We obtain identification results for two scenarios that often arise in practice when individuals in the target population (1) only have access to the control treatment, or (2) have access to the control and other treatments but not necessarily the experimental treatment evaluated in the trial. We then propose multiply robust and nonparametric efficient estimators that allow for the use of data-adaptive methods (e.g., machine learning techniques) to model the nuisance parameters. We examine the performance of the methods in simulation studies and illustrate their use with data from two trials of paliperidone for patients with schizophrenia. We conclude that the proposed methods are attractive when background knowledge suggests that the transportability assumption for conditional relative effect measures is more plausible than alternative assumptions."}, "https://arxiv.org/abs/2402.02867": {"title": "A new robust approach for the polytomous logistic regression model based on R\\'enyi's pseudodistances", "link": "https://arxiv.org/abs/2402.02867", "description": "This paper presents a robust alternative to the Maximum Likelihood Estimator (MLE) for the Polytomous Logistic Regression Model (PLRM), known as the family of minimum R\\`enyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $\\alpha\\geq0$, and include the MLE as a special case when $\\alpha=0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics."}, "https://arxiv.org/abs/2402.03004": {"title": "Construction and evaluation of optimal diagnostic tests with application to hepatocellular carcinoma diagnosis", "link": "https://arxiv.org/abs/2402.03004", "description": "Accurate diagnostic tests are crucial to ensure effective treatment, screening, and surveillance of diseases. However, the limited accuracy of individual biomarkers often hinders comprehensive screening. The heterogeneity of many diseases, particularly cancer, calls for the use of several biomarkers together into a composite diagnostic test. In this paper, we present a novel multivariate model that optimally combines multiple biomarkers using the likelihood ratio function. The model's parameters directly translate into computationally simple diagnostic accuracy measures. Additionally, our method allows for reliable predictions even in scenarios where specific biomarker measurements are unavailable and can guide the selection of biomarker combinations under resource constraints. We conduct simulation studies to compare the performance to popular classification and discriminant analysis methods. We utilize the approach to construct an optimal diagnostic test for hepatocellular carcinoma, a cancer type known for the absence of a single ideal marker. An accompanying R implementation is made available for reproducing all results."}, "https://arxiv.org/abs/2402.03035": {"title": "An optimality property of the Bayes-Kelly algorithm", "link": "https://arxiv.org/abs/2402.03035", "description": "This note states a simple property of optimality of the Bayes-Kelly algorithm for conformal testing and poses a related open problem."}, "https://arxiv.org/abs/2402.03192": {"title": "Multiple testing using uniform filtering of ordered p-values", "link": "https://arxiv.org/abs/2402.03192", "description": "We investigate the multiplicity model with m values of some test statistic independently drawn from a mixture of no effect (null) and positive effect (alternative), where we seek to identify, the alternative test results with a controlled error rate. We are interested in the case where the alternatives are rare. A number of multiple testing procedures filter the set of ordered p-values in order to eliminate the nulls. Such an approach can only work if the p-values originating from the alternatives form one or several identifiable clusters. The Benjamini and Hochberg (BH) method, for example, assumes that this cluster occurs in a small interval $(0,\\Delta)$ and filters out all or most of the ordered p-values $p_{(r)}$ above a linear threshold $s \\times r$. In repeated applications this filter controls the false discovery rate via the slope s. We propose a new adaptive filter that deletes the p-values from regions of uniform distribution. In cases where a single cluster remains, the p-values in an interval are declared alternatives, with the mid-point and the length of the interval chosen by controlling the data-dependent FDR at a desired level."}, "https://arxiv.org/abs/2402.03231": {"title": "Improved prediction of future user activity in online A/B testing", "link": "https://arxiv.org/abs/2402.03231", "description": "In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at which individuals will be exposed to interventions within the realm of online A/B testing. Our approach stands out by offering dual prediction capabilities: it forecasts both the quantity of new customers expected in future time windows and, unlike available alternative methods, the number of times they will be observed. We derive closed-form expressions for the posterior distributions of the quantities needed to form predictions about future user activity, thereby bypassing the need for numerical algorithms such as Markov chain Monte Carlo. After a comprehensive exposition of our model, we test its performance on experiments on real and simulated data, where we show its superior performance with respect to existing alternatives in the literature."}, "https://arxiv.org/abs/2402.01785": {"title": "DoubleMLDeep: Estimation of Causal Effects with Multimodal Data", "link": "https://arxiv.org/abs/2402.01785", "description": "This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data."}, "https://arxiv.org/abs/2402.01897": {"title": "Comparative analysis of two new wind speed T-X models using Weibull and log-logistic distributions for wind energy potential estimation in Tabriz, Iran", "link": "https://arxiv.org/abs/2402.01897", "description": "To assess the potential of wind energy in a specific area, statistical distribution functions are commonly used to characterize wind speed distributions. The selection of an appropriate wind speed model is crucial in minimizing wind power estimation errors. In this paper, we propose a novel method that utilizes the T-X family of continuous distributions to generate two new wind speed distribution functions, which have not been previously explored in the wind energy literature. These two statistical distributions, namely the Weibull-three parameters-log-logistic (WE3-LL3) and log-logistic-three parameters-Weibull (LL3-WE3) are compared with four other probability density functions (PDFs) to analyze wind speed data collected in Tabriz, Iran. The parameters of the considered distributions are estimated using maximum likelihood estimators with the Nelder-Mead numerical method. The suitability of the proposed distributions for the actual wind speed data is evaluated based on criteria such as root mean square errors, coefficient of determination, Kolmogorov-Smirnov test, and chi-square test. The analysis results indicate that the LL3-WE3 distribution demonstrates generally superior performance in capturing seasonal and annual wind speed data, except for summer, while the WE3-LL3 distribution exhibits the best fit for summer. It is also observed that both the LL3-WE3 and WE3-LL3 distributions effectively describe wind speed data in terms of the wind power density error criterion. Overall, the LL3-WE3 and WE3-LL3 models offer a highly accurate fit compared to other PDFs for estimating wind energy potential."}, "https://arxiv.org/abs/2402.01972": {"title": "Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts", "link": "https://arxiv.org/abs/2402.01972", "description": "We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that (i) their practical applicability can be hindered by loss function non-convexity; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To avoid these drawbacks, EP-learner constructs an efficient plug-in estimator of the population risk function for the causal contrast, thereby inheriting the stability and robustness properties of plug-in estimation strategies like T-learning. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we illustrate that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our R package hte3."}, "https://arxiv.org/abs/2402.02111": {"title": "Accelerating Look-ahead in Bayesian Optimization: Multilevel Monte Carlo is All you Need", "link": "https://arxiv.org/abs/2402.02111", "description": "We leverage multilevel Monte Carlo (MLMC) to improve the performance of multi-step look-ahead Bayesian optimization (BO) methods that involve nested expectations and maximizations. The complexity rate of naive Monte Carlo degrades for nested operations, whereas MLMC is capable of achieving the canonical Monte Carlo convergence rate for this type of problem, independently of dimension and without any smoothness assumptions. Our theoretical study focuses on the approximation improvements for one- and two-step look-ahead acquisition functions, but, as we discuss, the approach is generalizable in various ways, including beyond the context of BO. Findings are verified numerically and the benefits of MLMC for BO are illustrated on several benchmark examples. Code is available here https://github.com/Shangda-Yang/MLMCBO."}, "https://arxiv.org/abs/2402.02190": {"title": "Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems", "link": "https://arxiv.org/abs/2402.02190", "description": "Finding the best solution is the most common objective in combinatorial optimization (CO) problems. However, a single solution may not be suitable in practical scenarios, as the objective functions and constraints are only approximations of original real-world situations. To tackle this, finding (i) \"heterogeneous solutions\", diverse solutions with distinct characteristics, and (ii) \"penalty-diversified solutions\", variations in constraint severity, are natural directions. This strategy provides the flexibility to select a suitable solution during post-processing. However, discovering these diverse solutions is more challenging than identifying a single solution. To overcome this challenge, this study introduces Continual Tensor Relaxation Annealing (CTRA) for unsupervised-learning-based CO solvers. CTRA addresses various problems simultaneously by extending the continual relaxation approach, which transforms discrete decision variables into continual tensors. This method finds heterogeneous and penalty-diversified solutions through mutual interactions, where the choice of one solution affects the other choices. Numerical experiments show that CTRA enables UL-based solvers to find heterogeneous and penalty-diversified solutions much faster than existing UL-based solvers. Moreover, these experiments reveal that CTRA enhances the exploration ability."}, "https://arxiv.org/abs/2402.02303": {"title": "Bootstrapping Fisher Market Equilibrium and First-Price Pacing Equilibrium", "link": "https://arxiv.org/abs/2402.02303", "description": "The linear Fisher market (LFM) is a basic equilibrium model from economics, which also has applications in fair and efficient resource allocation. First-price pacing equilibrium (FPPE) is a model capturing budget-management mechanisms in first-price auctions. In certain practical settings such as advertising auctions, there is an interest in performing statistical inference over these models. A popular methodology for general statistical inference is the bootstrap procedure. Yet, for LFM and FPPE there is no existing theory for the valid application of bootstrap procedures. In this paper, we introduce and devise several statistically valid bootstrap inference procedures for LFM and FPPE. The most challenging part is to bootstrap general FPPE, which reduces to bootstrapping constrained M-estimators, a largely unexplored problem. We devise a bootstrap procedure for FPPE under mild degeneracy conditions by using the powerful tool of epi-convergence theory. Experiments with synthetic and semi-real data verify our theory."}, "https://arxiv.org/abs/2402.02459": {"title": "On Minimum Trace Factor Analysis -- An Old Song Sung to a New Tune", "link": "https://arxiv.org/abs/2402.02459", "description": "Dimensionality reduction methods, such as principal component analysis (PCA) and factor analysis, are central to many problems in data science. There are, however, serious and well-understood challenges to finding robust low dimensional approximations for data with significant heteroskedastic noise. This paper introduces a relaxed version of Minimum Trace Factor Analysis (MTFA), a convex optimization method with roots dating back to the work of Ledermann in 1940. This relaxation is particularly effective at not overfitting to heteroskedastic perturbations and addresses the commonly cited Heywood cases in factor analysis and the recently identified \"curse of ill-conditioning\" for existing spectral methods. We provide theoretical guarantees on the accuracy of the resulting low rank subspace and the convergence rate of the proposed algorithm to compute that matrix. We develop a number of interesting connections to existing methods, including HeteroPCA, Lasso, and Soft-Impute, to fill an important gap in the already large literature on low rank matrix estimation. Numerical experiments benchmark our results against several recent proposals for dealing with heteroskedastic noise."}, "https://arxiv.org/abs/2402.02489": {"title": "Bivariate change point detection in movement direction and speed", "link": "https://arxiv.org/abs/2402.02489", "description": "Biological movement patterns can sometimes be quasi linear with abrupt changes in direction and speed, as in plastids in root cells investigated here. For the analysis of such changes we propose a new stochastic model for movement along linear structures. Maximum likelihood estimators are provided, and due to serial dependencies of increments, the classical MOSUM statistic is replaced by a moving kernel estimator. Convergence of the resulting difference process and strong consistency of the variance estimator are shown. We estimate the change points and propose a graphical technique to distinguish between change points in movement direction and speed."}, "https://arxiv.org/abs/2402.02773": {"title": "Series ridge regression for spatial data on $\\mathbb{R}^d$", "link": "https://arxiv.org/abs/2402.02773", "description": "This paper develops a general asymptotic theory of series ridge estimators for spatial data observed at irregularly spaced locations in a sampling region $R_n \\subset \\mathbb{R}^d$. We adopt a stochastic sampling design that can generate irregularly spaced sampling sites in a flexible manner including both pure increasing and mixed increasing domain frameworks. Specifically, we consider a spatial trend regression model and a nonparametric regression model with spatially dependent covariates. For these models, we investigate the $L^2$-penalized series estimation of the trend and regression functions and establish (i) uniform and $L^2$ convergence rates and (ii) multivariate central limit theorems for general series estimators, (iii) optimal uniform and $L^2$ convergence rates for spline and wavelet series estimators, and (iv) show that our dependence structure conditions on the underlying spatial processes cover a wide class of random fields including L\\'evy-driven continuous autoregressive and moving average random fields."}, "https://arxiv.org/abs/2402.02859": {"title": "Importance sampling for online variational learning", "link": "https://arxiv.org/abs/2402.02859", "description": "This article addresses online variational estimation in state-space models. We focus on learning the smoothing distribution, i.e. the joint distribution of the latent states given the observations, using a variational approach together with Monte Carlo importance sampling. We propose an efficient algorithm for computing the gradient of the evidence lower bound (ELBO) in the context of streaming data, where observations arrive sequentially. Our contributions include a computationally efficient online ELBO estimator, demonstrated performance in offline and true online settings, and adaptability for computing general expectations under joint smoothing distributions."}, "https://arxiv.org/abs/2402.02898": {"title": "Bayesian Federated Inference for regression models with heterogeneous multi-center populations", "link": "https://arxiv.org/abs/2402.02898", "description": "To estimate accurately the parameters of a regression model, the sample size must be large enough relative to the number of possible predictors for the model. In practice, sufficient data is often lacking, which can lead to overfitting of the model and, as a consequence, unreliable predictions of the outcome of new patients. Pooling data from different data sets collected in different (medical) centers would alleviate this problem, but is often not feasible due to privacy regulation or logistic problems. An alternative route would be to analyze the local data in the centers separately and combine the statistical inference results with the Bayesian Federated Inference (BFI) methodology. The aim of this approach is to compute from the inference results in separate centers what would have been found if the statistical analysis was performed on the combined data. We explain the methodology under homogeneity and heterogeneity across the populations in the separate centers, and give real life examples for better understanding. Excellent performance of the proposed methodology is shown. An R-package to do all the calculations has been developed and is illustrated in this paper. The mathematical details are given in the Appendix."}, "https://arxiv.org/abs/2402.03197": {"title": "Heavy-tailed $p$-value combinations from the perspective of extreme value theory", "link": "https://arxiv.org/abs/2402.03197", "description": "Handling multiplicity without losing much power has been a persistent challenge in various fields that often face the necessity of managing numerous statistical tests simultaneously. Recently, $p$-value combination methods based on heavy-tailed distributions, such as a Cauchy distribution, have received much attention for their ability to handle multiplicity without the prescribed knowledge of the dependence structure. This paper delves into these types of $p$-value combinations through the lens of extreme value theory. Distributions with regularly varying tails, a subclass of heavy tail distributions, are found to be useful in constructing such $p$-value combinations. Three $p$-value combination statistics (sum, max cumulative sum, and max) are introduced, of which left tail probabilities are shown to be approximately uniform when the global null is true. The primary objective of this paper is to bridge the gap between current developments in $p$-value combination methods and the literature on extreme value theory, while also offering guidance on selecting the calibrator and its associated parameters."}, "https://arxiv.org/abs/2402.03203": {"title": "Right-censored models by the expectile method", "link": "https://arxiv.org/abs/2402.03203", "description": "Based on the expectile loss function and the adaptive LASSO penalty, the paper proposes and studies the estimation methods for the accelerated failure time (AFT) model. In this approach, we need to estimate the survival function of the censoring variable by the Kaplan-Meier estimator. The AFT model parameters are first estimated by the expectile method and afterwards, when the number of explanatory variables can be large, by the adaptive LASSO expectile method which directly carries out the automatic selection of variables. We also obtain the convergence rate and asymptotic normality for the two estimators, while showing the sparsity property for the censored adaptive LASSO expectile estimator. A numerical study using Monte Carlo simulations confirms the theoretical results and demonstrates the competitive performance of the two proposed estimators. The usefulness of these estimators is illustrated by applying them to three survival data sets."}, "https://arxiv.org/abs/2402.03229": {"title": "Disentangling high order effects in the transfer entropy", "link": "https://arxiv.org/abs/2402.03229", "description": "Transfer Entropy (TE), the main approach to determine the directed information flow within a network system, can be biased (in defect or excess), both in the pairwise and conditioned calculation, due to high order dependencies among the two dynamic processes under consideration and the remaining processes in the system which are used in conditioning. Here we propose a novel approach which, instead of conditioning the TE on all the network processes other than driver and target like in its fully conditioned version, or not conditioning at all like in the pairwise approach, searches both for the multiplet of variables leading to the maximum information flow and for those minimizing it, providing a decomposition of the TE in unique, redundant and synergistic atoms. Our approach allows to quantify the relative importance of high order effects, w.r.t. pure two-body effects, in the information transfer between two processes, and to highlight those processes which accompany the driver to build those high order effects. We report an application of the proposed approach in climatology, analyzing data from El Ni\\~{n}o and the Southern Oscillation."}, "https://arxiv.org/abs/2012.14708": {"title": "Adaptive Estimation for Non-stationary Factor Models And A Test for Static Factor Loadings", "link": "https://arxiv.org/abs/2012.14708", "description": "This paper considers the estimation and testing of a class of locally stationary time series factor models with evolutionary temporal dynamics. In particular, the entries and the dimension of the factor loading matrix are allowed to vary with time while the factors and the idiosyncratic noise components are locally stationary. We propose an adaptive sieve estimator for the span of the varying loading matrix and the locally stationary factor processes. A uniformly consistent estimator of the effective number of factors is investigated via eigenanalysis of a non-negative definite time-varying matrix. A possibly high-dimensional bootstrap-assisted test for the hypothesis of static factor loadings is proposed by comparing the kernels of the covariance matrices of the whole time series with their local counterparts. We examine our estimator and test via simulation studies and real data analysis. Finally, all our results hold at the following popular but distinct assumptions: (a) the white noise idiosyncratic errors with either fixed or diverging dimension, and (b) the correlated idiosyncratic errors with diverging dimension."}, "https://arxiv.org/abs/2106.16124": {"title": "Robust inference for geographic regression discontinuity designs: assessing the impact of police precincts", "link": "https://arxiv.org/abs/2106.16124", "description": "We study variation in policing outcomes attributable to differential policing practices in New York City (NYC) using geographic regression discontinuity designs (GeoRDDs). By focusing on small geographic windows near police precinct boundaries we can estimate local average treatment effects of police precincts on arrest rates. We propose estimands and develop estimators for the GeoRDD when the data come from a spatial point process. Additionally, standard GeoRDDs rely on continuity assumptions of the potential outcome surface or a local randomization assumption within a window around the boundary. These assumptions, however, can easily be violated in realistic applications. We develop a novel and robust approach to testing whether there are differences in policing outcomes that are caused by differences in police precincts across NYC. Importantly, this approach is applicable to standard regression discontinuity designs with both numeric and point process data. This approach is robust to violations of traditional assumptions made, and is valid under weaker assumptions. We use a unique form of resampling to provide a valid estimate of our test statistic's null distribution even under violations of standard assumptions. This procedure gives substantially different results in the analysis of NYC arrest rates than those that rely on standard assumptions."}, "https://arxiv.org/abs/2110.12510": {"title": "Post-Regularization Confidence Bands for Ordinary Differential Equations", "link": "https://arxiv.org/abs/2110.12510", "description": "Ordinary differential equation (ODE) is an important tool to study the dynamics of a system of biological and physical processes. A central question in ODE modeling is to infer the significance of individual regulatory effect of one signal variable on another. However, building confidence band for ODE with unknown regulatory relations is challenging, and it remains largely an open question. In this article, we construct post-regularization confidence band for individual regulatory function in ODE with unknown functionals and noisy data observations. Our proposal is the first of its kind, and is built on two novel ingredients. The first is a new localized kernel learning approach that combines reproducing kernel learning with local Taylor approximation, and the second is a new de-biasing method that tackles infinite-dimensional functionals and additional measurement errors. We show that the constructed confidence band has the desired asymptotic coverage probability, and the recovered regulatory network approaches the truth with probability tending to one. We establish the theoretical properties when the number of variables in the system can be either smaller or larger than the number of sampling time points, and we study the regime-switching phenomenon. We demonstrate the efficacy of the proposed method through both simulations and illustrations with two data applications."}, "https://arxiv.org/abs/2111.02299": {"title": "Cross-validated risk scores adaptive enrichment (CADEN) design", "link": "https://arxiv.org/abs/2111.02299", "description": "We propose a Cross-validated ADaptive ENrichment design (CADEN) in which a trial population is enriched with a subpopulation of patients who are predicted to benefit from the treatment more than an average patient (the sensitive group). This subpopulation is found using a risk score constructed from the baseline (potentially high-dimensional) information about patients. The design incorporates an early stopping rule for futility. Simulation studies are used to assess the properties of CADEN against the original (non-enrichment) cross-validated risk scores (CVRS) design that constructs a risk score at the end of the trial. We show that when there exists a sensitive group of patients, CADEN achieves a higher power and a reduction in the expected sample size, in comparison to the CVRS design. We illustrate the application of the design in two real clinical trials. We conclude that the new design offers improved statistical efficiency in comparison to the existing non-enrichment method, as well as increased benefit to patients. The method has been implemented in an R package caden."}, "https://arxiv.org/abs/2208.05553": {"title": "Exploiting Neighborhood Interference with Low Order Interactions under Unit Randomized Design", "link": "https://arxiv.org/abs/2208.05553", "description": "Network interference, where the outcome of an individual is affected by the treatment assignment of those in their social network, is pervasive in real-world settings. However, it poses a challenge to estimating causal effects. We consider the task of estimating the total treatment effect (TTE), or the difference between the average outcomes of the population when everyone is treated versus when no one is, under network interference. Under a Bernoulli randomized design, we provide an unbiased estimator for the TTE when network interference effects are constrained to low order interactions among neighbors of an individual. We make no assumptions on the graph other than bounded degree, allowing for well-connected networks that may not be easily clustered. We derive a bound on the variance of our estimator and show in simulated experiments that it performs well compared with standard estimators for the TTE. We also derive a minimax lower bound on the mean squared error of our estimator which suggests that the difficulty of estimation can be characterized by the degree of interactions in the potential outcomes model. We also prove that our estimator is asymptotically normal under boundedness conditions on the network degree and potential outcomes model. Central to our contribution is a new framework for balancing model flexibility and statistical complexity as captured by this low order interactions structure."}, "https://arxiv.org/abs/2307.02236": {"title": "D-optimal Subsampling Design for Massive Data Linear Regression", "link": "https://arxiv.org/abs/2307.02236", "description": "Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that differs from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample."}, "https://arxiv.org/abs/2307.09700": {"title": "The Connection Between R-Learning and Inverse-Variance Weighting for Estimation of Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2307.09700", "description": "Many methods for estimating conditional average treatment effects (CATEs) can be expressed as weighted pseudo-outcome regressions (PORs). Previous comparisons of POR techniques have paid careful attention to the choice of pseudo-outcome transformation. However, we argue that the dominant driver of performance is actually the choice of weights. For example, we point out that R-Learning implicitly performs a POR with inverse-variance weights (IVWs). In the CATE setting, IVWs mitigate the instability associated with inverse-propensity weights, and lead to convenient simplifications of bias terms. We demonstrate the superior performance of IVWs in simulations, and derive convergence rates for IVWs that are, to our knowledge, the fastest yet shown without assuming knowledge of the covariate distribution."}, "https://arxiv.org/abs/2311.02467": {"title": "Individualized Policy Evaluation and Learning under Clustered Network Interference", "link": "https://arxiv.org/abs/2311.02467", "description": "While there now exists a large literature on policy evaluation and learning, much of prior work assumes that the treatment assignment of one unit does not affect the outcome of another unit. Unfortunately, ignoring interference may lead to biased policy evaluation and ineffective learned policies. For example, treating influential individuals who have many friends can generate positive spillover effects, thereby improving the overall performance of an individualized treatment rule (ITR). We consider the problem of evaluating and learning an optimal ITR under clustered network interference (also known as partial interference) where clusters of units are sampled from a population and units may influence one another within each cluster. Unlike previous methods that impose strong restrictions on spillover effects, the proposed methodology only assumes a semiparametric structural model where each unit's outcome is an additive function of individual treatments within the cluster. Under this model, we propose an estimator that can be used to evaluate the empirical performance of an ITR. We show that this estimator is substantially more efficient than the standard inverse probability weighting estimator, which does not impose any assumption about spillover effects. We derive the finite-sample regret bound for a learned ITR, showing that the use of our efficient evaluation estimator leads to the improved performance of learned policies. Finally, we conduct simulation and empirical studies to illustrate the advantages of the proposed methodology."}, "https://arxiv.org/abs/2311.14220": {"title": "Assumption-lean and Data-adaptive Post-Prediction Inference", "link": "https://arxiv.org/abs/2311.14220", "description": "A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce an assumption-lean and data-adaptive Post-Prediction Inference (POP-Inf) procedure that allows valid and powerful inference based on ML-predicted outcomes. Its \"assumption-lean\" property guarantees reliable statistical inference without assumptions on the ML-prediction, for a wide range of statistical quantities. Its \"data-adaptive'\" feature guarantees an efficiency gain over existing post-prediction inference methods, regardless of the accuracy of ML-prediction. We demonstrate the superiority and applicability of our method through simulations and large-scale genomic data."}, "https://arxiv.org/abs/2401.07365": {"title": "Sequential permutation testing by betting", "link": "https://arxiv.org/abs/2401.07365", "description": "We develop an anytime-valid permutation test, where the dataset is fixed and the permutations are sampled sequentially one by one, with the objective of saving computational resources by sampling fewer permutations and stopping early. The core technical advance is the development of new test martingales (nonnegative martingales with initial value one) for testing exchangeability against a very particular alternative. These test martingales are constructed using new and simple betting strategies that smartly bet on the relative ranks of permuted test statistics. The betting strategies are guided by the derivation of a simple log-optimal betting strategy, and display excellent power in practice. In contrast to a well-known method by Besag and Clifford, our method yields a valid e-value or a p-value at any stopping time, and with particular stopping rules, it yields computational gains under both the null and the alternative without compromising power."}, "https://arxiv.org/abs/2401.08702": {"title": "Do We Really Even Need Data?", "link": "https://arxiv.org/abs/2401.08702", "description": "As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure."}, "https://arxiv.org/abs/2105.09254": {"title": "Multiply Robust Causal Mediation Analysis with Continuous Treatments", "link": "https://arxiv.org/abs/2105.09254", "description": "In many applications, researchers are interested in the direct and indirect causal effects of a treatment or exposure on an outcome of interest. Mediation analysis offers a rigorous framework for identifying and estimating these causal effects. For binary treatments, efficient estimators for the direct and indirect effects are presented in Tchetgen Tchetgen and Shpitser (2012) based on the influence function of the parameter of interest. These estimators possess desirable properties, such as multiple-robustness and asymptotic normality, while allowing for slower than root-n rates of convergence for the nuisance parameters. However, in settings involving continuous treatments, these influence function-based estimators are not readily applicable without making strong parametric assumptions. In this work, utilizing a kernel-smoothing approach, we propose an estimator suitable for settings with continuous treatments inspired by the influence function-based estimator of Tchetgen Tchetgen and Shpitser (2012). Our proposed approach employs cross-fitting, relaxing the smoothness requirements on the nuisance functions, and allowing them to be estimated at slower rates than the target parameter. Additionally, similar to influence function-based estimators, our proposed estimator is multiply robust and asymptotically normal, making it applicable for inference in settings where a parametric model cannot be assumed."}, "https://arxiv.org/abs/2207.12382": {"title": "On Confidence Sequences for Bounded Random Processes via Universal Gambling Strategies", "link": "https://arxiv.org/abs/2207.12382", "description": "This paper considers the problem of constructing a confidence sequence, which is a sequence of confidence intervals that hold uniformly over time, for estimating the mean of bounded real-valued random processes. This paper revisits the gambling-based approach established in the recent literature from a natural \\emph{two-horse race} perspective, and demonstrates new properties of the resulting algorithm induced by Cover (1991)'s universal portfolio. The main result of this paper is a new algorithm based on a mixture of lower bounds, which closely approximates the performance of Cover's universal portfolio with constant per-round time complexity. A higher-order generalization of a lower bound on a logarithmic function in (Fan et al., 2015), which is developed as a key technique for the proposed algorithm, may be of independent interest."}, "https://arxiv.org/abs/2305.11857": {"title": "Computing high-dimensional optimal transport by flow neural networks", "link": "https://arxiv.org/abs/2305.11857", "description": "Flow-based models are widely used in generative tasks, including normalizing flow, where a neural network transports from a data distribution $P$ to a normal distribution. This work develops a flow-based model that transports from $P$ to an arbitrary $Q$ where both distributions are only accessible via finite samples. We propose to learn the dynamic optimal transport between $P$ and $Q$ by training a flow neural network. The model is trained to optimally find an invertible transport map between $P$ and $Q$ by minimizing the transport cost. The trained optimal transport flow subsequently allows for performing many downstream tasks, including infinitesimal density ratio estimation (DRE) and distribution interpolation in the latent space for generative models. The effectiveness of the proposed model on high-dimensional data is demonstrated by strong empirical performance on high-dimensional DRE, OT baselines, and image-to-image translation."}, "https://arxiv.org/abs/2305.18435": {"title": "Statistically Efficient Bayesian Sequential Experiment Design via Reinforcement Learning with Cross-Entropy Estimators", "link": "https://arxiv.org/abs/2305.18435", "description": "Reinforcement learning can learn amortised design policies for designing sequences of experiments. However, current amortised methods rely on estimators of expected information gain (EIG) that require an exponential number of samples on the magnitude of the EIG to achieve an unbiased estimation. We propose the use of an alternative estimator based on the cross-entropy of the joint model distribution and a flexible proposal distribution. This proposal distribution approximates the true posterior of the model parameters given the experimental history and the design policy. Our method overcomes the exponential-sample complexity of previous approaches and provide more accurate estimates of high EIG values. More importantly, it allows learning of superior design policies, and is compatible with continuous and discrete design spaces, non-differentiable likelihoods and even implicit probabilistic models."}, "https://arxiv.org/abs/2309.16965": {"title": "Controlling Continuous Relaxation for Combinatorial Optimization", "link": "https://arxiv.org/abs/2309.16965", "description": "Motivated by developments in machine learning technologies, unsupervised learning (UL)-based solvers for CO problems have recently been proposed. These solvers train a neural network that outputs a solution by optimizing the CO objective directly. UL-based solvers have several advantages over traditional methods. However, various studies have shown that these solvers underperform compared to greedy algorithms for complex CO problems. In addition, these solvers employ a continuous relaxation strategy; thus, post-learning rounding from the continuous space back to the original discrete space is required, undermining the robustness of the results. To address these problems, we propose the continuous relaxation annealing (CRA) strategy. The CRA introduces a penalty term to control the continuity and discreteness of the relaxed variables and eliminate local optima. In addition, the CRA implements an annealing process for the penalty term that initially prioritizes continuous solutions and progressively transitions towards discreet solutions until the relaxed variables become nearly discrete, eliminating the artificial rounding. Experimental results demonstrate that the CRA significantly enhances the UL-based solvers, outperforming both existing UL-based solvers and greedy algorithms for complex CO problems."}, "https://arxiv.org/abs/2310.12424": {"title": "Optimal heteroskedasticity testing in nonparametric regression", "link": "https://arxiv.org/abs/2310.12424", "description": "Heteroskedasticity testing in nonparametric regression is a classic statistical problem with important practical applications, yet fundamental limits are unknown. Adopting a minimax perspective, this article considers the testing problem in the context of an $\\alpha$-H\\\"{o}lder mean and a $\\beta$-H\\\"{o}lder variance function. For $\\alpha > 0$ and $\\beta \\in (0, 1/2)$, the sharp minimax separation rate $n^{-4\\alpha} + n^{-4\\beta/(4\\beta+1)} + n^{-2\\beta}$ is established. To achieve the minimax separation rate, a kernel-based statistic using first-order squared differences is developed. Notably, the statistic estimates a proxy rather than a natural quadratic functional (the squared distance between the variance function and its best $L^2$ approximation by a constant) suggested in previous work.\n The setting where no smoothness is assumed on the variance function is also studied; the variance profile across the design points can be arbitrary. Despite the lack of structure, consistent testing turns out to still be possible by using the Gaussian character of the noise, and the minimax rate is shown to be $n^{-4\\alpha} + n^{-1/2}$. Exploiting noise information happens to be a fundamental necessity as consistent testing is impossible if nothing more than zero mean and unit variance is known about the noise distribution. Furthermore, in the setting where the variance function is $\\beta$-H\\\"{o}lder but heteroskedasticity is measured only with respect to the design points, the minimax separation rate is shown to be $n^{-4\\alpha} + n^{-\\left((1/2) \\vee (4\\beta/(4\\beta+1))\\right)}$ when the noise is Gaussian and $n^{-4\\alpha} + n^{-4\\beta/(4\\beta+1)} + n^{-2\\beta}$ when the noise distribution is unknown."}, "https://arxiv.org/abs/2311.12267": {"title": "Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity", "link": "https://arxiv.org/abs/2311.12267", "description": "We study causal representation learning, the task of recovering high-level latent variables and their causal relationships in the form of a causal graph from low-level observed data (such as text and images), assuming access to observations generated from multiple environments. Prior results on the identifiability of causal representations typically assume access to single-node interventions which is rather unrealistic in practice, since the latent variables are unknown in the first place. In this work, we provide the first identifiability results based on data that stem from general environments. We show that for linear causal models, while the causal graph can be fully recovered, the latent variables are only identified up to the surrounded-node ambiguity (SNA) \\citep{varici2023score}. We provide a counterpart of our guarantee, showing that SNA is basically unavoidable in our setting. We also propose an algorithm, \\texttt{LiNGCReL} which provably recovers the ground-truth model up to SNA, and we demonstrate its effectiveness via numerical experiments. Finally, we consider general non-parametric causal models and show that the same identification barrier holds when assuming access to groups of soft single-node interventions."}, "https://arxiv.org/abs/2401.08875": {"title": "DCRMTA: Unbiased Causal Representation for Multi-touch Attribution", "link": "https://arxiv.org/abs/2401.08875", "description": "Multi-touch attribution (MTA) currently plays a pivotal role in achieving a fair estimation of the contributions of each advertising touchpoint to-wards conversion behavior, deeply influencing budget allocation and advertising recommenda-tion. Previous works attempted to eliminate the bias caused by user preferences to achieve the unbiased assumption of the conversion model. The multi-model collaboration method is not ef-ficient, and the complete elimination of user in-fluence also eliminates the causal effect of user features on conversion, resulting in limited per-formance of the conversion model. This paper re-defines the causal effect of user features on con-versions and proposes a novel end-to-end ap-proach, Deep Causal Representation for MTA (DCRMTA). Our model focuses on extracting causa features between conversions and users while eliminating confounding variables. Fur-thermore, extensive experiments demonstrate DCRMTA's superior performance in converting prediction across varying data distributions, while also effectively attributing value across dif-ferent advertising channels."}, "https://arxiv.org/abs/2401.15519": {"title": "Large Deviation Analysis of Score-based Hypothesis Testing", "link": "https://arxiv.org/abs/2401.15519", "description": "Score-based statistical models play an important role in modern machine learning, statistics, and signal processing. For hypothesis testing, a score-based hypothesis test is proposed in \\cite{wu2022score}. We analyze the performance of this score-based hypothesis testing procedure and derive upper bounds on the probabilities of its Type I and II errors. We prove that the exponents of our error bounds are asymptotically (in the number of samples) tight for the case of simple null and alternative hypotheses. We calculate these error exponents explicitly in specific cases and provide numerical studies for various other scenarios of interest."}, "https://arxiv.org/abs/2402.03416": {"title": "A hyperbolastic type-I diffusion process: Parameter estimation bymeans of the firefly algorithm", "link": "https://arxiv.org/abs/2402.03416", "description": "A stochastic diffusion process, whose mean function is a hyperbolastic curve of type I, is presented. Themain characteristics of the process are studied and the problem of maximum likelihood estimation forthe parameters of the process is considered. To this end, the firefly metaheuristic optimization algo-rithm is applied after bounding the parametric space by a stagewise procedure. Some examples basedon simulated sample paths and real data illustrate this development."}, "https://arxiv.org/abs/2402.03459": {"title": "Hybrid Smoothing for Anomaly Detection in Time Series", "link": "https://arxiv.org/abs/2402.03459", "description": "Many industrial and engineering processes monitored as times series have smooth trends that indicate normal behavior and occasionally anomalous patterns that can indicate a problem. This kind of behavior can be modeled by a smooth trend such as a spline or Gaussian process and a disruption based on a sparser representation. Our approach is to expand the process signal into two sets of basis functions: one set uses $L_2$ penalties on the coefficients and the other set uses $L_1$ penalties to control sparsity. From a frequentist perspective, this results in a hybrid smoother that combines cubic smoothing splines and the LASSO, and as a Bayesian hierarchical model (BHM), this is equivalent to priors giving a Gaussian process and a Laplace distribution for anomaly coefficients. For the hybrid smoother we propose two new ways of determining the penalty parameters that use effective degrees of freedom and contrast this with the BHM that uses loosely informative inverse gamma priors. Several reformulations are used to make sampling the BHM posterior more efficient including some novel features in orthogonalizing and regularizing the model basis functions. This methodology is motivated by a substantive application, monitoring the water treatment process for the Denver Metropolitan area. We also test the methods with a Monte Carlo study designed around the kind of anomalies expected in this application. Both the hybrid smoother and the full BHM give comparable results with small false positive and false negative rates. Besides being successful in the water treatment application, this work can be easily extended to other Gaussian process models and other features that represent process disruptions."}, "https://arxiv.org/abs/2402.03491": {"title": "Periodically Correlated Time Series and the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2402.03491", "description": "This research introduces a novel approach to resampling periodically correlated (PC) time series using bandpass filters for frequency separation called the Variable Bandpass Periodic Block Bootstrap (VBPBB) and then examines the significant advantages of this new method. While bootstrapping allows estimation of a statistic's sampling distribution by resampling the original data with replacement, and block bootstrapping is a model-free resampling strategy for correlated time series data, both fail to preserve correlations in PC time series. Existing extensions of the block bootstrap help preserve the correlation structures of PC processes but suffer from flaws and inefficiencies. Analyses of time series data containing cyclic, seasonal, or PC principal components often seen in annual, daily, or other cyclostationary processes benefit from separating these components. The VBPBB uses bandpass filters to separate a PC component from interference such as noise at other uncorrelated frequencies. A simulation study is presented, demonstrating near universal improvements obtained from the VBPBB when compared with prior block bootstrapping methods for periodically correlated time series."}, "https://arxiv.org/abs/2402.03538": {"title": "Forecasting Adversarial Actions Using Judgment Decomposition-Recomposition", "link": "https://arxiv.org/abs/2402.03538", "description": "In domains such as homeland security, cybersecurity and competitive marketing, it is frequently the case that analysts need to forecast adversarial actions that impact the problem of interest. Standard structured expert judgement elicitation techniques may fall short as they do not explicitly take into account intentionality. We present a decomposition technique based on adversarial risk analysis followed by a behavioral recomposition using discrete choice models that facilitate such elicitation process and illustrate its performance through behavioral experiments."}, "https://arxiv.org/abs/2402.03565": {"title": "Breakpoint based online anomaly detection", "link": "https://arxiv.org/abs/2402.03565", "description": "The goal of anomaly detection is to identify observations that are generated by a distribution that differs from the reference distribution that qualifies normal behavior. When examining a time series, the reference distribution may evolve over time. The anomaly detector must therefore be able to adapt to such changes. In the online context, it is particularly difficult to adapt to abrupt and unpredictable changes. Our solution to this problem is based on the detection of breakpoints in order to adapt in real time to the new reference behavior of the series and to increase the accuracy of the anomaly detection. This solution also provides a control of the False Discovery Rate by extending methods developed for stationary series."}, "https://arxiv.org/abs/2402.03683": {"title": "Gambling-Based Confidence Sequences for Bounded Random Vectors", "link": "https://arxiv.org/abs/2402.03683", "description": "A confidence sequence (CS) is a sequence of confidence sets that contains a target parameter of an underlying stochastic process at any time step with high probability. This paper proposes a new approach to constructing CSs for means of bounded multivariate stochastic processes using a general gambling framework, extending the recently established coin toss framework for bounded random processes. The proposed gambling framework provides a general recipe for constructing CSs for categorical and probability-vector-valued observations, as well as for general bounded multidimensional observations through a simple reduction. This paper specifically explores the use of the mixture portfolio, akin to Cover's universal portfolio, in the proposed framework and investigates the properties of the resulting CSs. Simulations demonstrate the tightness of these confidence sequences compared to existing methods. When applied to the sampling without-replacement setting for finite categorical data, it is shown that the resulting CS based on a universal gambling strategy is provably tighter than that of the posterior-prior ratio martingale proposed by Waudby-Smith and Ramdas."}, "https://arxiv.org/abs/2402.03954": {"title": "Mixed Matrix Completion in Complex Survey Sampling under Heterogeneous Missingness", "link": "https://arxiv.org/abs/2402.03954", "description": "Modern surveys with large sample sizes and growing mixed-type questionnaires require robust and scalable analysis methods. In this work, we consider recovering a mixed dataframe matrix, obtained by complex survey sampling, with entries following different canonical exponential distributions and subject to heterogeneous missingness. To tackle this challenging task, we propose a two-stage procedure: in the first stage, we model the entry-wise missing mechanism by logistic regression, and in the second stage, we complete the target parameter matrix by maximizing a weighted log-likelihood with a low-rank constraint. We propose a fast and scalable estimation algorithm that achieves sublinear convergence, and the upper bound for the estimation error of the proposed method is rigorously derived. Experimental results support our theoretical claims, and the proposed estimator shows its merits compared to other existing methods. The proposed method is applied to analyze the National Health and Nutrition Examination Survey data."}, "https://arxiv.org/abs/2402.04165": {"title": "Monthly GDP nowcasting with Machine Learning and Unstructured Data", "link": "https://arxiv.org/abs/2402.04165", "description": "In the dynamic landscape of continuous change, Machine Learning (ML) \"nowcasting\" models offer a distinct advantage for informed decision-making in both public and private sectors. This study introduces ML-based GDP growth projection models for monthly rates in Peru, integrating structured macroeconomic indicators with high-frequency unstructured sentiment variables. Analyzing data from January 2007 to May 2023, encompassing 91 leading economic indicators, the study evaluates six ML algorithms to identify optimal predictors. Findings highlight the superior predictive capability of ML models using unstructured data, particularly Gradient Boosting Machine, LASSO, and Elastic Net, exhibiting a 20% to 25% reduction in prediction errors compared to traditional AR and Dynamic Factor Models (DFM). This enhanced performance is attributed to better handling of data of ML models in high-uncertainty periods, such as economic crises."}, "https://arxiv.org/abs/2402.03447": {"title": "Challenges in Variable Importance Ranking Under Correlation", "link": "https://arxiv.org/abs/2402.03447", "description": "Variable importance plays a pivotal role in interpretable machine learning as it helps measure the impact of factors on the output of the prediction model. Model agnostic methods based on the generation of \"null\" features via permutation (or related approaches) can be applied. Such analysis is often utilized in pharmaceutical applications due to its ability to interpret black-box models, including tree-based ensembles. A major challenge and significant confounder in variable importance estimation however is the presence of between-feature correlation. Recently, several adjustments to marginal permutation utilizing feature knockoffs were proposed to address this issue, such as the variable importance measure known as conditional predictive impact (CPI). Assessment and evaluation of such approaches is the focus of our work. We first present a comprehensive simulation study investigating the impact of feature correlation on the assessment of variable importance. We then theoretically prove the limitation that highly correlated features pose for the CPI through the knockoff construction. While we expect that there is always no correlation between knockoff variables and its corresponding predictor variables, we prove that the correlation increases linearly beyond a certain correlation threshold between the predictor variables. Our findings emphasize the absence of free lunch when dealing with high feature correlation, as well as the necessity of understanding the utility and limitations behind methods in variable importance estimation."}, "https://arxiv.org/abs/2402.03527": {"title": "Consistent Validation for Predictive Methods in Spatial Settings", "link": "https://arxiv.org/abs/2402.03527", "description": "Spatial prediction tasks are key to weather forecasting, studying air pollution, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We instead propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data."}, "https://arxiv.org/abs/2402.03941": {"title": "Discovery of the Hidden World with Large Language Models", "link": "https://arxiv.org/abs/2402.03941", "description": "Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as a factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis."}, "https://arxiv.org/abs/2402.04082": {"title": "An Optimal House Price Prediction Algorithm: XGBoost", "link": "https://arxiv.org/abs/2402.04082", "description": "An accurate prediction of house prices is a fundamental requirement for various sectors including real estate and mortgage lending. It is widely recognized that a property value is not solely determined by its physical attributes but is significantly influenced by its surrounding neighbourhood. Meeting the diverse housing needs of individuals while balancing budget constraints is a primary concern for real estate developers. To this end, we addressed the house price prediction problem as a regression task and thus employed various machine learning techniques capable of expressing the significance of independent variables. We made use of the housing dataset of Ames City in Iowa, USA to compare support vector regressor, random forest regressor, XGBoost, multilayer perceptron and multiple linear regression algorithms for house price prediction. Afterwards, we identified the key factors that influence housing costs. Our results show that XGBoost is the best performing model for house price prediction."}, "https://arxiv.org/abs/1801.07683": {"title": "Discovering the Signal Subgraph: An Iterative Screening Approach on Graphs", "link": "https://arxiv.org/abs/1801.07683", "description": "Supervised learning on graphs is a challenging task due to the high dimensionality and inherent structural dependencies in the data, where each edge depends on a pair of vertices. Existing conventional methods designed for Euclidean data do not account for this graph dependency structure. To address this issue, this paper proposes an iterative vertex screening method to identify the signal subgraph that is most informative for the given graph attributes. The method screens the rows and columns of the adjacency matrix concurrently and stops when the resulting distance correlation is maximized. We establish the theoretical foundation of our method by proving that it estimates the true signal subgraph with high probability. Additionally, we establish the convergence rate of classification error under the Erdos-Renyi random graph model and prove that the subsequent classification can be asymptotically optimal, outperforming the entire graph under high-dimensional conditions. Our method is evaluated on various simulated datasets and real-world human and murine graphs derived from functional and structural magnetic resonance images. The results demonstrate its excellent performance in estimating the ground-truth signal subgraph and achieving superior classification accuracy."}, "https://arxiv.org/abs/2006.09587": {"title": "Adaptive, Rate-Optimal Hypothesis Testing in Nonparametric IV Models", "link": "https://arxiv.org/abs/2006.09587", "description": "We propose a new adaptive hypothesis test for inequality (e.g., monotonicity, convexity) and equality (e.g., parametric, semiparametric) restrictions on a structural function in a nonparametric instrumental variables (NPIV) model. Our test statistic is based on a modified leave-one-out sample analog of a quadratic distance between the restricted and unrestricted sieve NPIV estimators. We provide computationally simple, data-driven choices of sieve tuning parameters and Bonferroni adjusted chi-squared critical values. Our test adapts to the unknown smoothness of alternative functions in the presence of unknown degree of endogeneity and unknown strength of the instruments. It attains the adaptive minimax rate of testing in $L^2$.\n That is, the sum of its type I error uniformly over the composite null and its type II error uniformly over nonparametric alternative models cannot be improved by any other hypothesis test for NPIV models of unknown regularities. Confidence sets in $L^2$ are obtained by inverting the adaptive test. Simulations confirm that our adaptive test controls size and its finite-sample power greatly exceeds existing non-adaptive tests for monotonicity and parametric restrictions in NPIV models. Empirical applications to test for shape restrictions of differentiated products demand and of Engel curves are presented."}, "https://arxiv.org/abs/2007.01680": {"title": "Developing a predictive signature for two trial endpoints using the cross-validated risk scores method", "link": "https://arxiv.org/abs/2007.01680", "description": "The existing cross-validated risk scores (CVRS) design has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using high-dimensional data (such as genetic data). The design is based on computing a risk score for each patient and dividing them into clusters using a non-parametric clustering procedure. In some settings it is desirable to consider the trade-off between two outcomes, such as efficacy and toxicity, or cost and effectiveness. With this motivation, we extend the CVRS design (CVRS2) to consider two outcomes. The design employs bivariate risk scores that are divided into clusters. We assess the properties of the CVRS2 using simulated data and illustrate its application on a randomised psychiatry trial. We show that CVRS2 is able to reliably identify the sensitive group (the group for which the new treatment provides benefit on both outcomes) in the simulated data. We apply the CVRS2 design to a psychology clinical trial that had offender status and substance use status as two outcomes and collected a large number of baseline covariates. The CVRS2 design yields a significant treatment effect for both outcomes, while the CVRS approach identified a significant effect for the offender status only after pre-filtering the covariates."}, "https://arxiv.org/abs/2208.05121": {"title": "Locally Adaptive Bayesian Isotonic Regression using Half Shrinkage Priors", "link": "https://arxiv.org/abs/2208.05121", "description": "Isotonic regression or monotone function estimation is a problem of estimating function values under monotonicity constraints, which appears naturally in many scientific fields. This paper proposes a new Bayesian method with global-local shrinkage priors for estimating monotone function values. Specifically, we introduce half shrinkage priors for positive valued random variables and assign them for the first-order differences of function values. We also develop fast and simple Gibbs sampling algorithms for full posterior analysis. By incorporating advanced shrinkage priors, the proposed method is adaptive to local abrupt changes or jumps in target functions. We show this adaptive property theoretically by proving that the posterior mean estimators are robust to large differences and that asymptotic risk for unchanged points can be improved. Finally, we demonstrate the proposed methods through simulations and applications to a real data set."}, "https://arxiv.org/abs/2303.03497": {"title": "Integrative data analysis where partial covariates have complex non-linear effects by using summary information from an external data", "link": "https://arxiv.org/abs/2303.03497", "description": "A full parametric and linear specification may be insufficient to capture complicated patterns in studies exploring complex features, such as those investigating age-related changes in brain functional abilities. Alternatively, a partially linear model (PLM) consisting of both parametric and non-parametric elements may have a better fit. This model has been widely applied in economics, environmental science, and biomedical studies. In this paper, we introduce a novel statistical inference framework that equips PLM with high estimation efficiency by effectively synthesizing summary information from external data into the main analysis. Such an integrative scheme is versatile in assimilating various types of reduced models from the external study. The proposed method is shown to be theoretically valid and numerically convenient, and it ensures a high-efficiency gain compared to classic methods in PLM. Our method is further validated using two data applications by evaluating the risk factors of brain imaging measures and blood pressure."}, "https://arxiv.org/abs/2304.08184": {"title": "Adjustment with Many Regressors Under Covariate-Adaptive Randomizations", "link": "https://arxiv.org/abs/2304.08184", "description": "Our paper discovers a new trade-off of using regression adjustments (RAs) in causal inference under covariate-adaptive randomizations (CARs). On one hand, RAs can improve the efficiency of causal estimators by incorporating information from covariates that are not used in the randomization. On the other hand, RAs can degrade estimation efficiency due to their estimation errors, which are not asymptotically negligible when the number of regressors is of the same order as the sample size. Ignoring the estimation errors of RAs may result in serious over-rejection of causal inference under the null hypothesis. To address the issue, we construct a new ATE estimator by optimally linearly combining the adjusted and unadjusted estimators. We then develop a unified inference theory for this estimator under CARs. It has two features: (1) the Wald test based on it achieves the exact asymptotic size under the null hypothesis, regardless of whether the number of covariates is fixed or diverges no faster than the sample size; and (2) it guarantees weak efficiency improvement over both the adjusted and unadjusted estimators."}, "https://arxiv.org/abs/2305.01435": {"title": "Transfer Estimates for Causal Effects across Heterogeneous Sites", "link": "https://arxiv.org/abs/2305.01435", "description": "We consider the problem of extrapolating treatment effects across heterogeneous populations (\"sites\"/\"contexts\"). We consider an idealized scenario in which the researcher observes cross-sectional data for a large number of units across several \"experimental\" sites in which an intervention has already been implemented to a new \"target\" site for which a baseline survey of unit-specific, pre-treatment outcomes and relevant attributes is available. We propose a transfer estimator that exploits cross-sectional variation between individuals and sites to predict treatment outcomes using baseline outcome data for the target location. We consider the problem of determining the optimal finite-dimensional feature space in which to solve that prediction problem. Our approach is design-based in the sense that the performance of the predictor is evaluated given the specific, finite selection of experimental and target sites. Our approach is nonparametric, and our formal results concern the construction of an optimal basis of predictors as well as convergence rates for the estimated conditional average treatment effect relative to the constrained-optimal population predictor for the target site. We illustrate our approach using a combined data set of five multi-site randomized controlled trials (RCTs) to evaluate the effect of conditional cash transfers on school attendance."}, "https://arxiv.org/abs/2305.10054": {"title": "Functional Adaptive Double-Sparsity Estimator for Functional Linear Regression Model with Multiple Functional Covariates", "link": "https://arxiv.org/abs/2305.10054", "description": "Sensor devices have been increasingly used in engineering and health studies recently, and the captured multi-dimensional activity and vital sign signals can be studied in association with health outcomes to inform public health. The common approach is the scalar-on-function regression model, in which health outcomes are the scalar responses while high-dimensional sensor signals are the functional covariates, but how to effectively interpret results becomes difficult. In this study, we propose a new Functional Adaptive Double-Sparsity (FadDoS) estimator based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. Application to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly demonstrates how the FadDoS estimator can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields, where multi-dimensional sensor signals are collected simultaneously, to expand the use of sensor devices in health studies and facilitate sensor data analysis."}, "https://arxiv.org/abs/2305.15671": {"title": "Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data", "link": "https://arxiv.org/abs/2305.15671", "description": "We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on time series of matrices with observations distributed on a fixed 2-D spatial grid, i.e., the spatio-temporal data, and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially-smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC."}, "https://arxiv.org/abs/2307.15181": {"title": "On the Efficiency of Finely Stratified Experiments", "link": "https://arxiv.org/abs/2307.15181", "description": "This paper studies the efficient estimation of a large class of treatment effect parameters that arise in the analysis of experiments. Here, efficiency is understood to be with respect to a broad class of treatment assignment schemes for which the marginal probability that any unit is assigned to treatment equals a pre-specified value, e.g., one half. Importantly, we do not require that treatment status is assigned in an i.i.d. fashion, thereby accommodating complicated treatment assignment schemes that are used in practice, such as stratified block randomization and matched pairs. The class of parameters considered are those that can be expressed as the solution to a set of moment conditions involving a known function of the observed data, including possibly the pre-specified value for the marginal probability of treatment assignment. We show that this class of parameters includes, among other things, average treatment effects, quantile treatment effects, local average treatment effects as well as the counterparts to these quantities in experiments in which the unit is itself a cluster. In this setting, we establish two results. First, we derive a lower bound on the asymptotic variance of estimators of the parameter of interest in the form of a convolution theorem. Second, we show that the naive method of moments estimator achieves this bound on the asymptotic variance quite generally if treatment is assigned using a \"finely stratified\" design. By a \"finely stratified\" design, we mean experiments in which units are divided into groups of a fixed size and a proportion within each group is assigned to treatment uniformly at random so that it respects the restriction on the marginal probability of treatment assignment. In this sense, \"finely stratified\" experiments lead to efficient estimators of treatment effect parameters \"by design\" rather than through ex post covariate adjustment."}, "https://arxiv.org/abs/2310.13785": {"title": "Bayesian Estimation of Panel Models under Potentially Sparse Heterogeneity", "link": "https://arxiv.org/abs/2310.13785", "description": "We incorporate a version of a spike and slab prior, comprising a pointmass at zero (\"spike\") and a Normal distribution around zero (\"slab\") into a dynamic panel data framework to model coefficient heterogeneity. In addition to homogeneity and full heterogeneity, our specification can also capture sparse heterogeneity, that is, there is a core group of units that share common parameters and a set of deviators with idiosyncratic parameters. We fit a model with unobserved components to income data from the Panel Study of Income Dynamics. We find evidence for sparse heterogeneity for balanced panels composed of individuals with long employment histories."}, "https://arxiv.org/abs/2401.12420": {"title": "Rank-based estimators of global treatment effects for cluster randomized trials with multiple endpoints", "link": "https://arxiv.org/abs/2401.12420", "description": "Cluster randomization trials commonly employ multiple endpoints. When a single summary of treatment effects across endpoints is of primary interest, global hypothesis testing/effect estimation methods represent a common analysis strategy. However, specification of the joint distribution required by these methods is non-trivial, particularly when endpoint properties differ. We develop rank-based interval estimators for a global treatment effect referred to as the \"global win probability,\" or the probability that a treatment individual responds better than a control individual on average. Using endpoint-specific ranks among the combined sample and within each arm, each individual-level observation is converted to a \"win fraction\" which quantifies the proportion of wins experienced over every observation in the comparison arm. An individual's multiple observations are then replaced by a single \"global win fraction,\" constructed by averaging win fractions across endpoints. A linear mixed model is applied directly to the global win fractions to recover point, variance, and interval estimates of the global win probability adjusted for clustering. Simulation demonstrates our approach performs well concerning coverage and type I error, and methods are easily implemented using standard software. A case study using publicly available data is provided with corresponding R and SAS code."}, "https://arxiv.org/abs/2401.14684": {"title": "Inference for Cumulative Incidences and Treatment Effects in Randomized Controlled Trials with Time-to-Event Outcomes under ICH E9 (E1)", "link": "https://arxiv.org/abs/2401.14684", "description": "In randomized controlled trials (RCT) with time-to-event outcomes, intercurrent events occur as semi-competing/competing events, and they could affect the hazard of outcomes or render outcomes ill-defined. Although five strategies have been proposed in ICH E9 (R1) addendum to address intercurrent events in RCT, they did not readily extend to the context of time-to-event data for studying causal effects with rigorously stated implications. In this study, we show how to define, estimate, and infer the time-dependent cumulative incidence of outcome events in such contexts for obtaining causal interpretations. Specifically, we derive the mathematical forms of the scientific objective (i.e., causal estimands) under the five strategies and clarify the required data structure to identify these causal estimands. Furthermore, we summarize estimation and inference methods for these causal estimands by adopting methodologies in survival analysis, including analytic formulas for asymptotic analysis and hypothesis testing. We illustrate our methods with the LEADER Trial on investigating the effect of liraglutide on cardiovascular outcomes. Studies of multiple endpoints and combining strategies to address multiple intercurrent events can help practitioners understand treatment effects more comprehensively."}, "https://arxiv.org/abs/1908.06486": {"title": "Independence Testing for Temporal Data", "link": "https://arxiv.org/abs/1908.06486", "description": "Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time-series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in invalid test. To address these challenges, this paper introduces the temporal dependence statistic with block permutation to test independence between temporal data. Under proper assumptions, the proposed procedure is asymptotically valid and universally consistent for testing independence between stationary time-series, and capable of estimating the optimal dependence lag that maximizes the dependence. Notably, it is compatible with a rich family of distance and kernel based dependence measures, eliminates the need for multiple testing, and demonstrates superior power in multivariate, low sample size, and nonlinear settings. An analysis of neural connectivity with fMRI data reveals various temporal dependence among signals within the visual network and default mode network."}, "https://arxiv.org/abs/2001.01095": {"title": "High-Dimensional Independence Testing via Maximum and Average Distance Correlations", "link": "https://arxiv.org/abs/2001.01095", "description": "This paper introduces and investigates the utilization of maximum and average distance correlations for multivariate independence testing. We characterize their consistency properties in high-dimensional settings with respect to the number of marginally dependent dimensions, assess the advantages of each test statistic, examine their respective null distributions, and present a fast chi-square-based testing procedure. The resulting tests are non-parametric and applicable to both Euclidean distance and the Gaussian kernel as the underlying metric. To better understand the practical use cases of the proposed tests, we evaluate the empirical performance of the maximum distance correlation, average distance correlation, and the original distance correlation across various multivariate dependence scenarios, as well as conduct a real data experiment to test the presence of various cancer types and peptide levels in human plasma."}, "https://arxiv.org/abs/2204.10275": {"title": "Do t-Statistic Hurdles Need to be Raised?", "link": "https://arxiv.org/abs/2204.10275", "description": "Many scholars have called for raising statistical hurdles to guard against false discoveries in academic publications. I show these calls may be difficult to justify empirically. Published data exhibit bias: results that fail to meet existing hurdles are often unobserved. These unobserved results must be extrapolated, which can lead to weak identification of revised hurdles. In contrast, statistics that can target only published findings (e.g. empirical Bayes shrinkage and the FDR) can be strongly identified, as data on published findings is plentiful. I demonstrate these results theoretically and in an empirical analysis of the cross-sectional return predictability literature."}, "https://arxiv.org/abs/2310.07852": {"title": "On the Computational Complexity of Private High-dimensional Model Selection", "link": "https://arxiv.org/abs/2310.07852", "description": "We consider the problem of model selection in a high-dimensional sparse linear regression model under privacy constraints. We propose a differentially private best subset selection method with strong utility properties by adopting the well-known exponential mechanism for selecting the best model. We propose an efficient Metropolis-Hastings algorithm and establish that it enjoys polynomial mixing time to its stationary distribution. Furthermore, we also establish approximate differential privacy for the final estimates of the Metropolis-Hastings random walk using its mixing property. Finally, we perform some illustrative experiments that show the strong utility of our algorithm."}, "https://arxiv.org/abs/2311.01303": {"title": "Local differential privacy in survival analysis using private failure indicators", "link": "https://arxiv.org/abs/2311.01303", "description": "We consider the estimation of the cumulative hazard function, and equivalently the distribution function, with censored data under a setup that preserves the privacy of the survival database. This is done through a $\\alpha$-locally differentially private mechanism for the failure indicators and by proposing a non-parametric kernel estimator for the cumulative hazard function that remains consistent under the privatization. Under mild conditions, we also prove lowers bounds for the minimax rates of convergence and show that estimator is minimax optimal under a well-chosen bandwidth."}, "https://arxiv.org/abs/2311.08527": {"title": "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments", "link": "https://arxiv.org/abs/2311.08527", "description": "We study inference on the long-term causal effect of a continual exposure to a novel intervention, which we term a long-term treatment, based on an experiment involving only short-term observations. Key examples include the long-term health effects of regularly-taken medicine or of environmental hazards and the long-term effects on users of changes to an online platform. This stands in contrast to short-term treatments or ``shocks,\" whose long-term effect can reasonably be mediated by short-term observations, enabling the use of surrogate methods. Long-term treatments by definition have direct effects on long-term outcomes via continual exposure, so surrogacy conditions cannot reasonably hold. We connect the problem with offline reinforcement learning, leveraging doubly-robust estimators to estimate long-term causal effects for long-term treatments and construct confidence intervals."}, "https://arxiv.org/abs/2402.04321": {"title": "Homogeneity problem for basis expansion of functional data with applications to resistive memories", "link": "https://arxiv.org/abs/2402.04321", "description": "The homogeneity problem for testing if more than two different samples come from the same population is considered for the case of functional data. The methodological results are motivated by the study of homogeneity of electronic devices fabricated by different materials and active layer thicknesses. In the case of normality distribution of the stochastic processes associated with each sample, this problem is known as Functional ANOVA problem and is reduced to test the equality of the mean group functions (FANOVA). The problem is that the current/voltage curves associated with Resistive Random Access Memories (RRAM) are not generated by a Gaussian process so that a different approach is necessary for testing homogeneity. To solve this problem two different parametric and nonparametric approaches based on basis expansion of the sample curves are proposed. The first consists of testing multivariate homogeneity tests on a vector of basis coefficients of the sample curves. The second is based on dimension reduction by using functional principal component analysis of the sample curves (FPCA) and testing multivariate homogeneity on a vector of principal components scores. Different approximation numerical techniques are employed to adapt the experimental data for the statistical study. An extensive simulation study is developed for analyzing the performance of both approaches in the parametric and non-parametric cases. Finally, the proposed methodologies are applied on three samples of experimental reset curves measured in three different RRAM technologies."}, "https://arxiv.org/abs/2402.04327": {"title": "On the estimation and interpretation of effect size metrics", "link": "https://arxiv.org/abs/2402.04327", "description": "Effect size estimates are thought to capture the collective, two-way response to an intervention or exposure in a three-way problem among the intervention/exposure, various confounders and the outcome. For meaningful causal inference from the estimated effect size, the joint distribution of observed confounders must be identical across all intervention/exposure groups. However, real-world observational studies and even randomized clinical trials often lack such structural symmetry. To address this issue, various methods have been proposed and widely utilized. Recently, elementary combinatorics and information theory have motivated a consistent way to completely eliminate observed confounding in any given study. In this work, we leverage these new techniques to evaluate conventional methods based on their ability to (a) consistently differentiate between collective and individual responses to intervention/exposure and (b) establish the desired structural parity for sensible effect size estimation. Our findings reveal that a straightforward application of logistic regression homogenizes the three-way stratified analysis, but fails to restore structural symmetry leaving in particular the two-way effect size estimate unadjusted. Conversely, the Mantel-Haenszel estimator struggles to separate three-way effects from the two-way effect of intervention/exposure, leading to inconsistencies in interpreting pooled estimates as two-way risk metrics."}, "https://arxiv.org/abs/2402.04341": {"title": "CausalMetaR: An R package for performing causally interpretable meta-analyses", "link": "https://arxiv.org/abs/2402.04341", "description": "Researchers would often like to leverage data from a collection of sources (e.g., primary studies in a meta-analysis) to estimate causal effects in a target population of interest. However, traditional meta-analytic methods do not produce causally interpretable estimates for a well-defined target population. In this paper, we present the CausalMetaR R package, which implements efficient and robust methods to estimate causal effects in a given internal or external target population using multi-source data. The package includes estimators of average and subgroup treatment effects for the entire target population. To produce efficient and robust estimates of causal effects, the package implements doubly robust and non-parametric efficient estimators and supports using flexible data-adaptive (e.g., machine learning techniques) methods and cross-fitting techniques to estimate the nuisance models (e.g., the treatment model, the outcome model). We describe the key features of the package and demonstrate how to use the package through an example."}, "https://arxiv.org/abs/2402.04345": {"title": "A Framework of Zero-Inflated Bayesian Negative Binomial Regression Models For Spatiotemporal Data", "link": "https://arxiv.org/abs/2402.04345", "description": "Spatiotemporal data analysis with massive zeros is widely used in many areas such as epidemiology and public health. We use a Bayesian framework to fit zero-inflated negative binomial models and employ a set of latent variables from P\\'olya-Gamma distributions to derive an efficient Gibbs sampler. The proposed model accommodates varying spatial and temporal random effects through Gaussian process priors, which have both the simplicity and flexibility in modeling nonlinear relationships through a covariance function. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. For the simulation study, we adopt multiple settings with varying sizes of spatial locations to evaluate the performance of the proposed model such as spatial and temporal random effects estimation and compare the result to other methods. We also apply the proposed model to the COVID-19 death counts in the state of Florida, USA from 3/25/2020 through 7/29/2020 to examine relationships between social vulnerability and COVID-19 deaths."}, "https://arxiv.org/abs/2402.04425": {"title": "Linear-Phase-Type probability modelling of functional PCA with applications to resistive memories", "link": "https://arxiv.org/abs/2402.04425", "description": "Functional principal component analysis based on Karhunen Loeve expansion allows to describe the stochastic evolution of the main characteristics associated to multiple systems and devices. Identifying the probability distribution of the principal component scores is fundamental to characterize the whole process. The aim of this work is to consider a family of statistical distributions that could be accurately adjusted to a previous transformation. Then, a new class of distributions, the linear-phase-type, is introduced to model the principal components. This class is studied in detail in order to prove, through the KL expansion, that certain linear transformations of the process at each time point are phase-type distributed. This way, the one-dimensional distributions of the process are in the same linear-phase-type class. Finally, an application to model the reset process associated with resistive memories is developed and explained."}, "https://arxiv.org/abs/2402.04433": {"title": "Fast Online Changepoint Detection", "link": "https://arxiv.org/abs/2402.04433", "description": "We study online changepoint detection in the context of a linear regression model. We propose a class of heavily weighted statistics based on the CUSUM process of the regression residuals, which are specifically designed to ensure timely detection of breaks occurring early on during the monitoring horizon. We subsequently propose a class of composite statistics, constructed using different weighing schemes; the decision rule to mark a changepoint is based on the largest statistic across the various weights, thus effectively working like a veto-based voting mechanism, which ensures fast detection irrespective of the location of the changepoint. Our theory is derived under a very general form of weak dependence, thus being able to apply our tests to virtually all time series encountered in economics, medicine, and other applied sciences. Monte Carlo simulations show that our methodologies are able to control the procedure-wise Type I Error, and have short detection delays in the presence of breaks."}, "https://arxiv.org/abs/2402.04461": {"title": "On Data Analysis Pipelines and Modular Bayesian Modeling", "link": "https://arxiv.org/abs/2402.04461", "description": "Data analysis pipelines, where quantities estimated in upstream modules are used as inputs to downstream ones, are common in many application areas. The most common approach to implementing data analysis pipelines involves obtaining point estimates from the upstream module(s) and then treating these as known quantities when working with the downstream module(s). This approach is straightforward, but is likely to underestimate the overall uncertainty associated with any final estimates. An alternative approach involves estimating parameters from the modules jointly using a Bayesian hierarchical model, which has the advantage of propagating upstream uncertainty into the downstream estimates. However in the case where one of the modules suffers from misspecification, such a joint model can result in the misspecification from one module corrupting the estimates from the remaining modules. Furthermore, hierarchical models require the development of ad-hoc implementations that can be time consuming to create and require large amounts of computational effort. So-called cut inference modifies the posterior distribution in such a way that prevents information flow between certain modules and provides a third alternative for statistical inference in data analysis pipelines. This paper presents a unified framework that encompasses all three modeling approaches (two-step, cut, and joint) in the context of data analysis pipelines with two modules and uses two examples to illustrate the trade offs associated with these three approaches. Our work shows that cut inference provides both robustness and ease of implementation for data analysis pipelines at a lower cost in terms of statistical inference than two-step procedures."}, "https://arxiv.org/abs/2402.04487": {"title": "Item-Level Heterogeneous Treatment Effects of Selective Serotonin Reuptake Inhibitors (SSRIs) on Depression: Implications for Inference, Generalizability, and Identification", "link": "https://arxiv.org/abs/2402.04487", "description": "In analysis of randomized controlled trials (RCTs) with patient-reported outcome measures (PROMs), Item Response Theory (IRT) models that allow for heterogeneity in the treatment effect at the item level merit consideration. These models for ``item-level heterogeneous treatment effects'' (IL-HTE) can provide more accurate statistical inference, allow researchers to better generalize their results, and resolve critical identification problems in the estimation of interaction effects. In this study, we extend the IL-HTE model to polytomous data and apply the model to determine how the effect of selective serotonin reuptake inhibitors (SSRIs) on depression varies across the items on a depression rating scale. We first conduct a Monte Carlo simulation study to assess the performance of the polytomous IL-HTE model under a range of conditions. We then apply the IL-HTE model to item-level data from 28 RCTs measuring the effect of SSRIs on depression using the 17-item Hamilton Depression Rating Scale (HDRS-17) and estimate potential heterogeneity by subscale (HDRS-6). Our results show that the IL-HTE model provides more accurate statistical inference, allows for generalizability of results to out-of-sample items, and resolves identification problems in the estimation of interaction effects. Our empirical application shows that while the average effect of SSRIs on depression is beneficial (i.e., negative) and statistically significant, there is substantial IL-HTE, with estimates of the standard deviation of item-level effects nearly as large as the average effect. We show that this substantial IL-HTE is driven primarily by systematically larger effects on the HDRS-6 subscale items. The IL-HTE model has the potential to provide new insights for the inference, generalizability, and identification of treatment effects in clinical trials using patient reported outcome measures."}, "https://arxiv.org/abs/2402.04593": {"title": "Spatial autoregressive model with measurement error in covariates", "link": "https://arxiv.org/abs/2402.04593", "description": "The Spatial AutoRegressive model (SAR) is commonly used in studies involving spatial and network data to estimate the spatial or network peer influence and the effects of covariates on the response, taking into account the spatial or network dependence. While the model can be efficiently estimated with a Quasi maximum likelihood approach (QMLE), the detrimental effect of covariate measurement error on the QMLE and how to remedy it is currently unknown. If covariates are measured with error, then the QMLE may not have the $\\sqrt{n}$ convergence and may even be inconsistent even when a node is influenced by only a limited number of other nodes or spatial units. We develop a measurement error-corrected ML estimator (ME-QMLE) for the parameters of the SAR model when covariates are measured with error. The ME-QMLE possesses statistical consistency and asymptotic normality properties. We consider two types of applications. The first is when the true covariate cannot be measured directly, and a proxy is observed instead. The second one involves including latent homophily factors estimated with error from the network for estimating peer influence. Our numerical results verify the bias correction property of the estimator and the accuracy of the standard error estimates in finite samples. We illustrate the method on a real dataset related to county-level death rates from the COVID-19 pandemic."}, "https://arxiv.org/abs/2402.04674": {"title": "Hyperparameter Tuning for Causal Inference with Double Machine Learning: A Simulation Study", "link": "https://arxiv.org/abs/2402.04674", "description": "Proper hyperparameter tuning is essential for achieving optimal performance of modern machine learning (ML) methods in predictive tasks. While there is an extensive literature on tuning ML learners for prediction, there is only little guidance available on tuning ML learners for causal machine learning and how to select among different ML learners. In this paper, we empirically assess the relationship between the predictive performance of ML methods and the resulting causal estimation based on the Double Machine Learning (DML) approach by Chernozhukov et al. (2018). DML relies on estimating so-called nuisance parameters by treating them as supervised learning problems and using them as plug-in estimates to solve for the (causal) parameter. We conduct an extensive simulation study using data from the 2019 Atlantic Causal Inference Conference Data Challenge. We provide empirical insights on the role of hyperparameter tuning and other practical decisions for causal estimation with DML. First, we assess the importance of data splitting schemes for tuning ML learners within Double Machine Learning. Second, we investigate how the choice of ML methods and hyperparameters, including recent AutoML frameworks, impacts the estimation performance for a causal parameter of interest. Third, we assess to what extent the choice of a particular causal model, as characterized by incorporated parametric assumptions, can be based on predictive performance metrics."}, "https://arxiv.org/abs/2402.04727": {"title": "Data-driven Bayesian estimation of Monod kinetics", "link": "https://arxiv.org/abs/2402.04727", "description": "In this paper, we consider the well known problem of non-linear identification of the rates of the reactions involved in cells with Monod functions. In bioprocesses, generating data is very expensive and long and so it is important to incorporate prior knowledge on the Monod kinetic parameters. Bayesian estimation is an elegant estimation technique which deals with parameter estimation with prior knowledge modeled as probability density functions. However, we might not have an accurate knowledge of the kinetic parameters such as interval bounds, especially for newly developed cell lines. Hence, we consider the case when there is no accurate prior information on the kinetic parameters except qualitative knowledge such that their non-negativity. A log-Gaussian prior distribution is considered for the parameters and the mean and variances of these distribution are tuned using the Expectation Maximization algorithm. The algorithm requires to use Metropolis Hastings within Gibbs sampling which can be computationally expensive. We develop a novel variant of the Metropolis-Hastings within Gibbs sampling sampling scheme in order to accelerate and improve on the hyperparameter tuning. We show that it can give better modeling performances on a relatively large-scale simulation example compared to available methods in the literature."}, "https://arxiv.org/abs/2402.04808": {"title": "Basis expansion approaches for functional analysis of variance with repeated measures", "link": "https://arxiv.org/abs/2402.04808", "description": "The methodological contribution in this paper is motivated by biomechanical studies where data characterizing human movement are waveform curves representing joint measures such as flexion angles, velocity, acceleration, and so on. In many cases the aim consists of detecting differences in gait patterns when several independent samples of subjects walk or run under different conditions (repeated measures). Classic kinematic studies often analyse discrete summaries of the sample curves discarding important information and providing biased results. As the sample data are obviously curves, a Functional Data Analysis approach is proposed to solve the problem of testing the equality of the mean curves of a functional variable observed on several independent groups under different treatments or time periods. A novel approach for Functional Analysis of Variance (FANOVA) for repeated measures that takes into account the complete curves is introduced. By assuming a basis expansion for each sample curve, two-way FANOVA problem is reduced to Multivariate ANOVA for the multivariate response of basis coefficients. Then, two different approaches for MANOVA with repeated measures are considered. Besides, an extensive simulation study is developed to check their performance. Finally, two applications with gait data are developed."}, "https://arxiv.org/abs/2402.04828": {"title": "What drives the European carbon market? Macroeconomic factors and forecasts", "link": "https://arxiv.org/abs/2402.04828", "description": "Putting a price on carbon -- with taxes or developing carbon markets -- is a widely used policy measure to achieve the target of net-zero emissions by 2050. This paper tackles the issue of producing point, direction-of-change, and density forecasts for the monthly real price of carbon within the EU Emissions Trading Scheme (EU ETS). We aim to uncover supply- and demand-side forces that can contribute to improving the prediction accuracy of models at short- and medium-term horizons. We show that a simple Bayesian Vector Autoregressive (BVAR) model, augmented with either one or two factors capturing a set of predictors affecting the price of carbon, provides substantial accuracy gains over a wide set of benchmark forecasts, including survey expectations and forecasts made available by data providers. We extend the study to verified emissions and demonstrate that, in this case, adding stochastic volatility can further improve the forecasting performance of a single-factor BVAR model. We rely on emissions and price forecasts to build market monitoring tools that track demand and price pressure in the EU ETS market. Our results are relevant for policymakers and market practitioners interested in quantifying the desired and unintended macroeconomic effects of monitoring the carbon market dynamics."}, "https://arxiv.org/abs/2402.04952": {"title": "Metrics on Markov Equivalence Classes for Evaluating Causal Discovery Algorithms", "link": "https://arxiv.org/abs/2402.04952", "description": "Many state-of-the-art causal discovery methods aim to generate an output graph that encodes the graphical separation and connection statements of the causal graph that underlies the data-generating process. In this work, we argue that an evaluation of a causal discovery method against synthetic data should include an analysis of how well this explicit goal is achieved by measuring how closely the separations/connections of the method's output align with those of the ground truth. We show that established evaluation measures do not accurately capture the difference in separations/connections of two causal graphs, and we introduce three new measures of distance called s/c-distance, Markov distance and Faithfulness distance that address this shortcoming. We complement our theoretical analysis with toy examples, empirical experiments and pseudocode."}, "https://arxiv.org/abs/2402.05030": {"title": "Inference for Two-Stage Extremum Estimators", "link": "https://arxiv.org/abs/2402.05030", "description": "We present a simulation-based approach to approximate the asymptotic variance and asymptotic distribution function of two-stage estimators. We focus on extremum estimators in the second stage and consider a large class of estimators in the first stage. This class includes extremum estimators, high-dimensional estimators, and other types of estimators (e.g., Bayesian estimators). We accommodate scenarios where the asymptotic distributions of both the first- and second-stage estimators are non-normal. We also allow for the second-stage estimator to exhibit a significant bias due to the first-stage sampling error. We introduce a debiased plug-in estimator and establish its limiting distribution. Our method is readily implementable with complex models. Unlike resampling methods, we eliminate the need for multiple computations of the plug-in estimator. Monte Carlo simulations confirm the effectiveness of our approach in finite samples. We present an empirical application with peer effects on adolescent fast-food consumption habits, where we employ the proposed method to address the issue of biased instrumental variable estimates resulting from the presence of many weak instruments."}, "https://arxiv.org/abs/2402.05065": {"title": "logitFD: an R package for functional principal component logit regression", "link": "https://arxiv.org/abs/2402.05065", "description": "The functional logit regression model was proposed by Escabias et al. (2004) with the objective of modeling a scalar binary response variable from a functional predictor. The model estimation proposed in that case was performed in a subspace of L2(T) of squared integrable functions of finite dimension, generated by a finite set of basis functions. For that estimation it was assumed that the curves of the functional predictor and the functional parameter of the model belong to the same finite subspace. The estimation so obtained was affected by high multicollinearity problems and the solution given to these problems was based on different functional principal component analysis. The logitFD package introduced here provides a toolbox for the fit of these models by implementing the different proposed solutions and by generalizing the model proposed in 2004 to the case of several functional and non-functional predictors. The performance of the functions is illustrated by using data sets of functional data included in the fda.usc package from R-CRAN."}, "https://arxiv.org/abs/2402.04355": {"title": "PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation", "link": "https://arxiv.org/abs/2402.04355", "description": "We propose a comprehensive sample-based method for assessing the quality of generative models. The proposed approach enables the estimation of the probability that two sets of samples are drawn from the same distribution, providing a statistically rigorous method for assessing the performance of a single generative model or the comparison of multiple competing models trained on the same dataset. This comparison can be conducted by dividing the space into non-overlapping regions and comparing the number of data samples in each region. The method only requires samples from the generative model and the test data. It is capable of functioning directly on high-dimensional data, obviating the need for dimensionality reduction. Significantly, the proposed method does not depend on assumptions regarding the density of the true distribution, and it does not rely on training or fitting any auxiliary models. Instead, it focuses on approximating the integral of the density (probability mass) across various sub-regions within the data space."}, "https://arxiv.org/abs/2402.04489": {"title": "De-amplifying Bias from Differential Privacy in Language Model Fine-tuning", "link": "https://arxiv.org/abs/2402.04489", "description": "Fairness and privacy are two important values machine learning (ML) practitioners often seek to operationalize in models. Fairness aims to reduce model bias for social/demographic sub-groups. Privacy via differential privacy (DP) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. The trade-offs between privacy and fairness goals of trustworthy ML pose a challenge to those wishing to address both. We show that DP amplifies gender, racial, and religious bias when fine-tuning large language models (LLMs), producing models more biased than ones fine-tuned without DP. We find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. Through the case of binary gender bias, we demonstrate that Counterfactual Data Augmentation (CDA), a known method for addressing bias, also mitigates bias amplification by DP. As a consequence, DP and CDA together can be used to fine-tune models while maintaining both fairness and privacy."}, "https://arxiv.org/abs/2402.04579": {"title": "Collective Counterfactual Explanations via Optimal Transport", "link": "https://arxiv.org/abs/2402.04579", "description": "Counterfactual explanations provide individuals with cost-optimal actions that can alter their labels to desired classes. However, if substantial instances seek state modification, such individual-centric methods can lead to new competitions and unanticipated costs. Furthermore, these recommendations, disregarding the underlying data distribution, may suggest actions that users perceive as outliers. To address these issues, our work proposes a collective approach for formulating counterfactual explanations, with an emphasis on utilizing the current density of the individuals to inform the recommended actions. Our problem naturally casts as an optimal transport problem. Leveraging the extensive literature on optimal transport, we illustrate how this collective method improves upon the desiderata of classical counterfactual explanations. We support our proposal with numerical simulations, illustrating the effectiveness of the proposed approach and its relation to classic methods."}, "https://arxiv.org/abs/2402.04602": {"title": "Online Quantile Regression", "link": "https://arxiv.org/abs/2402.04602", "description": "This paper tackles the challenge of integrating sequentially arriving data within the quantile regression framework, where the number of covariates is allowed to grow with the number of observations, the horizon is unknown, and memory is limited. We employ stochastic sub-gradient descent to minimize the empirical check loss and study its statistical properties and regret performance. In our analysis, we unveil the delicate interplay between updating iterates based on individual observations versus batches of observations, revealing distinct regularity properties in each scenario. Our method ensures long-term optimal estimation irrespective of the chosen update strategy. Importantly, our contributions go beyond prior works by achieving exponential-type concentration inequalities and attaining optimal regret and error rates that exhibit only short-term sensitivity to initial errors. A key insight from our study is the delicate statistical analyses and the revelation that appropriate stepsize schemes significantly mitigate the impact of initial errors on subsequent errors and regrets. This underscores the robustness of stochastic sub-gradient descent in handling initial uncertainties, emphasizing its efficacy in scenarios where the sequential arrival of data introduces uncertainties regarding both the horizon and the total number of observations. Additionally, when the initial error rate is well controlled, there is a trade-off between short-term error rate and long-term optimality. Due to the lack of delicate statistical analysis for square loss, we also briefly discuss its properties and proper schemes. Extensive simulations support our theoretical findings."}, "https://arxiv.org/abs/2402.04859": {"title": "Conditionality principle under unconstrained randomness", "link": "https://arxiv.org/abs/2402.04859", "description": "A very simple example demonstrates that Fisher's application of the conditionality principle to regression (\"fixed $x$ regression\"), endorsed by Sprott and many other followers, makes prediction impossible in the context of statistical learning theory. On the other hand, relaxing the requirement of conditionality makes it possible via, e.g., conformal prediction."}, "https://arxiv.org/abs/2110.00314": {"title": "Confounder importance learning for treatment effect inference", "link": "https://arxiv.org/abs/2110.00314", "description": "We address modelling and computational issues for multiple treatment effect inference under many potential confounders. A primary issue relates to preventing harmful effects from omitting relevant covariates (under-selection), while not running into over-selection issues that introduce substantial variance and a bias related to the non-random over-inclusion of covariates. We propose a novel empirical Bayes framework for Bayesian model averaging that learns from data the extent to which the inclusion of key covariates should be encouraged, specifically those highly associated to the treatments. A key challenge is computational. We develop fast algorithms, including an Expectation-Propagation variational approximation and simple stochastic gradient optimization algorithms, to learn the hyper-parameters from data. Our framework uses widely-used ingredients and largely existing software, and it is implemented within the R package mombf featured on CRAN. This work is motivated by and is illustrated in two applications. The first is the association between salary variation and discriminatory factors. The second, that has been debated in previous works, is the association between abortion policies and crime. Our approach provides insights that differ from previous analyses especially in situations with weaker treatment effects."}, "https://arxiv.org/abs/2202.12062": {"title": "Semiparametric Estimation of Dynamic Binary Choice Panel Data Models", "link": "https://arxiv.org/abs/2202.12062", "description": "We propose a new approach to the semiparametric analysis of panel data binary choice models with fixed effects and dynamics (lagged dependent variables). The model we consider has the same random utility framework as in Honore and Kyriazidou (2000). We demonstrate that, with additional serial dependence conditions on the process of deterministic utility and tail restrictions on the error distribution, the (point) identification of the model can proceed in two steps, and only requires matching the value of an index function of explanatory variables over time, as opposed to that of each explanatory variable. Our identification approach motivates an easily implementable, two-step maximum score (2SMS) procedure -- producing estimators whose rates of convergence, in contrast to Honore and Kyriazidou's (2000) methods, are independent of the model dimension. We then derive the asymptotic properties of the 2SMS procedure and propose bootstrap-based distributional approximations for inference. Monte Carlo evidence indicates that our procedure performs adequately in finite samples."}, "https://arxiv.org/abs/2206.08216": {"title": "Minimum Density Power Divergence Estimation for the Generalized Exponential Distribution", "link": "https://arxiv.org/abs/2206.08216", "description": "Statistical modeling of rainfall data is an active research area in agro-meteorology. The most common models fitted to such datasets are exponential, gamma, log-normal, and Weibull distributions. As an alternative to some of these models, the generalized exponential (GE) distribution was proposed by Gupta and Kundu (2001, Exponentiated Exponential Family: An Alternative to Gamma and Weibull Distributions, Biometrical Journal). Rainfall (specifically for short periods) datasets often include outliers, and thus, a proper robust parameter estimation procedure is necessary. Here, we use the popular minimum density power divergence estimation (MDPDE) procedure developed by Basu et al. (1998, Robust and Efficient Estimation by Minimising a Density Power Divergence, Biometrika) for estimating the GE parameters. We derive the analytical expressions for the estimating equations and asymptotic distributions. We analytically compare MDPDE with maximum likelihood estimation in terms of robustness, through an influence function analysis. Besides, we study the asymptotic relative efficiency of MDPDE analytically for different parameter settings. We apply the proposed technique to some simulated datasets and two rainfall datasets from Texas, United States. The results indicate superior performance of MDPDE compared to the other existing estimation techniques in most of the scenarios."}, "https://arxiv.org/abs/2210.08589": {"title": "Anytime-Valid Linear Models and Regression Adjusted Causal Inference in Randomized Experiments", "link": "https://arxiv.org/abs/2210.08589", "description": "Linear regression adjustment is commonly used to analyse randomised controlled experiments due to its efficiency and robustness against model misspecification. Current testing and interval estimation procedures leverage the asymptotic distribution of such estimators to provide Type-I error and coverage guarantees that hold only at a single sample size. Here, we develop the theory for the anytime-valid analogues of such procedures, enabling linear regression adjustment in the sequential analysis of randomised experiments. We first provide sequential $F$-tests and confidence sequences for the parametric linear model, which provide time-uniform Type-I error and coverage guarantees that hold for all sample sizes. We then relax all linear model parametric assumptions in randomised designs and provide nonparametric model-free sequential tests and confidence sequences for treatment effects. This formally allows experiments to be continuously monitored for significance, stopped early, and safeguards against statistical malpractices in data collection. A particular feature of our results is their simplicity. Our test statistics and confidence sequences all emit closed-form expressions, which are functions of statistics directly available from a standard linear regression table. We illustrate our methodology with the sequential analysis of software A/B experiments at Netflix, performing regression adjustment with pre-treatment outcomes."}, "https://arxiv.org/abs/2302.07663": {"title": "Testing for causal effect for binary data when propensity scores are estimated through Bayesian Networks", "link": "https://arxiv.org/abs/2302.07663", "description": "This paper proposes a new statistical approach for assessing treatment effect using Bayesian Networks (BNs). The goal is to draw causal inferences from observational data with a binary outcome and discrete covariates. The BNs are here used to estimate the propensity score, which enables flexible modeling and ensures maximum likelihood properties, including asymptotic efficiency. %As a result, other available approaches cannot perform better. When the propensity score is estimated by BNs, two point estimators are considered - H\\'ajek and Horvitz-Thompson - based on inverse probability weighting, and their main distributional properties are derived for constructing confidence intervals and testing hypotheses about the absence of the treatment effect. Empirical evidence is presented to show the goodness of the proposed methodology on a simulation study mimicking the characteristics of a real dataset of prostate cancer patients from Milan San Raffaele Hospital."}, "https://arxiv.org/abs/2306.01211": {"title": "Priming bias versus post-treatment bias in experimental designs", "link": "https://arxiv.org/abs/2306.01211", "description": "Conditioning on variables affected by treatment can induce post-treatment bias when estimating causal effects. Although this suggests that researchers should measure potential moderators before administering the treatment in an experiment, doing so may also bias causal effect estimation if the covariate measurement primes respondents to react differently to the treatment. This paper formally analyzes this trade-off between post-treatment and priming biases in three experimental designs that vary when moderators are measured: pre-treatment, post-treatment, or a randomized choice between the two. We derive nonparametric bounds for interactions between the treatment and the moderator in each design and show how to use substantive assumptions to narrow these bounds. These bounds allow researchers to assess the sensitivity of their empirical findings to either source of bias. We then apply these methods to a survey experiment on electoral messaging."}, "https://arxiv.org/abs/2307.01037": {"title": "Vector Quantile Regression on Manifolds", "link": "https://arxiv.org/abs/2307.01037", "description": "Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features. QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain. Although the notion of quantiles was recently extended to multi-variate distributions, QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate and geological phenomena), and tori (dihedral angles in proteins). By leveraging optimal transport theory and c-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs). Our approach allows for quantile estimation, regression, and computation of conditional confidence sets and likelihoods. We demonstrate the approach's efficacy and provide insights regarding the meaning of non-Euclidean quantiles through synthetic and real data experiments."}, "https://arxiv.org/abs/2303.10019": {"title": "Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices", "link": "https://arxiv.org/abs/2303.10019", "description": "This paper presents a new method for combining (or aggregating or ensembling) multivariate probabilistic forecasts, considering dependencies between quantiles and marginals through a smoothing procedure that allows for online learning. We discuss two smoothing methods: dimensionality reduction using Basis matrices and penalized smoothing. The new online learning algorithm generalizes the standard CRPS learning framework into multivariate dimensions. It is based on Bernstein Online Aggregation (BOA) and yields optimal asymptotic learning properties. The procedure uses horizontal aggregation, i.e., aggregation across quantiles. We provide an in-depth discussion on possible extensions of the algorithm and several nested cases related to the existing literature on online forecast combination. We apply the proposed methodology to forecasting day-ahead electricity prices, which are 24-dimensional distributional forecasts. The proposed method yields significant improvements over uniform combination in terms of continuous ranked probability score (CRPS). We discuss the temporal evolution of the weights and hyperparameters and present the results of reduced versions of the preferred model. A fast C++ implementation of the proposed algorithm is provided in the open-source R-Package profoc on CRAN."}, "https://arxiv.org/abs/2402.05194": {"title": "Multi-class classification of biomechanical data: A functional LDA approach based on multi-class penalized functional PLS", "link": "https://arxiv.org/abs/2402.05194", "description": "A functional linear discriminant analysis approach to classify a set of kinematic data (human movement curves of individuals performing different physical activities) is performed. Kinematic data, usually collected in linear acceleration or angular rotation format, can be identified with functions in a continuous domain (time, percentage of gait cycle, etc.). Since kinematic curves are measured in the same sample of individuals performing different activities, they are a clear example of functional data with repeated measures. On the other hand, the sample curves are observed with noise. Then, a roughness penalty might be necessary in order to provide a smooth estimation of the discriminant functions, which would make them more interpretable. Moreover, because of the infinite dimension of functional data, a reduction dimension technique should be considered. To solve these problems, we propose a multi-class approach for penalized functional partial least squares (FPLS) regression. Then linear discriminant analysis (LDA) will be performed on the estimated FPLS components. This methodology is motivated by two case studies. The first study considers the linear acceleration recorded every two seconds in 30 subjects, related to three different activities (walking, climbing stairs and down stairs). The second study works with the triaxial angular rotation, for each joint, in 51 children when they completed a cycle walking under three conditions (walking, carrying a backpack and pulling a trolley). A simulation study is also developed for comparing the performance of the proposed functional LDA with respect to the corresponding multivariate and non-penalized approaches."}, "https://arxiv.org/abs/2402.05209": {"title": "Stochastic modeling of Random Access Memories reset transitions", "link": "https://arxiv.org/abs/2402.05209", "description": "Resistive Random Access Memories (RRAMs) are being studied by the industry and academia because it is widely accepted that they are promising candidates for the next generation of high density nonvolatile memories. Taking into account the stochastic nature of mechanisms behind resistive switching, a new technique based on the use of functional data analysis has been developed to accurately model resistive memory device characteristics. Functional principal component analysis (FPCA) based on Karhunen-Loeve expansion is applied to obtain an orthogonal decomposition of the reset process in terms of uncorrelated scalar random variables. Then, the device current has been accurately described making use of just one variable presenting a modeling approach that can be very attractive from the circuit simulation viewpoint. The new method allows a comprehensive description of the stochastic variability of these devices by introducing a probability distribution that allows the simulation of the main parameter that is employed for the model implementation. A rigorous description of the mathematical theory behind the technique is given and its application for a broad set of experimental measurements is explained."}, "https://arxiv.org/abs/2402.05231": {"title": "Estimating Fold Changes from Partially Observed Outcomes with Applications in Microbial Metagenomics", "link": "https://arxiv.org/abs/2402.05231", "description": "We consider the problem of estimating fold-changes in the expected value of a multivariate outcome that is observed subject to unknown sample-specific and category-specific perturbations. We are motivated by high-throughput sequencing studies of the abundance of microbial taxa, in which microbes are systematically over- and under-detected relative to their true abundances. Our log-linear model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of parameter estimates in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test, and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated via a meta-analysis of microbial associations with colorectal cancer."}, "https://arxiv.org/abs/2402.05329": {"title": "Selective linear segmentation for detecting relevant parameter changes", "link": "https://arxiv.org/abs/2402.05329", "description": "Change-point processes are one flexible approach to model long time series. We propose a method to uncover which model parameter truly vary when a change-point is detected. Given a set of breakpoints, we use a penalized likelihood approach to select the best set of parameters that changes over time and we prove that the penalty function leads to a consistent selection of the true model. Estimation is carried out via the deterministic annealing expectation-maximization algorithm. Our method accounts for model selection uncertainty and associates a probability to all the possible time-varying parameter specifications. Monte Carlo simulations highlight that the method works well for many time series models including heteroskedastic processes. For a sample of 14 Hedge funds (HF) strategies, using an asset based style pricing model, we shed light on the promising ability of our method to detect the time-varying dynamics of risk exposures as well as to forecast HF returns."}, "https://arxiv.org/abs/2402.05342": {"title": "Nonlinear Regression Analysis", "link": "https://arxiv.org/abs/2402.05342", "description": "Nonlinear regression analysis is a popular and important tool for scientists and engineers. In this article, we introduce theories and methods of nonlinear regression and its statistical inferences using the frequentist and Bayesian statistical modeling and computation. Least squares with the Gauss-Newton method is the most widely used approach to parameters estimation. Under the assumption of normally distributed errors, maximum likelihood estimation is equivalent to least squares estimation. The Wald confidence regions for parameters in a nonlinear regression model are affected by the curvatures in the mean function. Furthermore, we introduce the Newton-Raphson method and the generalized least squares method to deal with variance heterogeneity. Examples of simulation data analysis are provided to illustrate important properties of confidence regions and the statistical inferences using the nonlinear least squares estimation and Bayesian inference."}, "https://arxiv.org/abs/2402.05384": {"title": "Efficient Nonparametric Inference of Causal Mediation Effects with Nonignorable Missing Confounders", "link": "https://arxiv.org/abs/2402.05384", "description": "We consider causal mediation analysis with confounders subject to nonignorable missingness in a nonparametric framework. Our approach relies on shadow variables that are associated with the missing confounders but independent of the missingness mechanism. The mediation effect of interest is shown to be a weighted average of an iterated conditional expectation, which motivates our Sieve-based Iterative Outward (SIO) estimator. We derive the rate of convergence and asymptotic normality of the SIO estimator, which do not suffer from the ill-posed inverse problem. Essentially, we show that the asymptotic normality is not affected by the slow convergence rate of nonparametric estimators of nuisance functions. Moreover, we demonstrate that our estimator is locally efficient and attains the semiparametric efficiency bound under certain conditions. We accurately depict the efficiency loss attributable to missingness and identify scenarios in which efficiency loss is absent. We also propose a stable and easy-to-implement approach to estimate asymptotic variance and construct confidence intervals for the mediation effects. Finally, we evaluate the finite-sample performance of our proposed approach through simulation studies, and apply it to the CFPS data to show its practical applicability."}, "https://arxiv.org/abs/2402.05395": {"title": "Efficient Estimation for Functional Accelerated Failure Time Model", "link": "https://arxiv.org/abs/2402.05395", "description": "We propose a functional accelerated failure time model to characterize effects of both functional and scalar covariates on the time to event of interest, and provide regularity conditions to guarantee model identifiability. For efficient estimation of model parameters, we develop a sieve maximum likelihood approach where parametric and nonparametric coefficients are bundled with an unknown baseline hazard function in the likelihood function. Not only do the bundled parameters cause immense numerical difficulties, but they also result in new challenges in theoretical development. By developing a general theoretical framework, we overcome the challenges arising from the bundled parameters and derive the convergence rate of the proposed estimator. Furthermore, we prove that the finite-dimensional estimator is $\\sqrt{n}$-consistent, asymptotically normal and achieves the semiparametric information bound. The proposed inference procedures are evaluated by extensive simulation studies and illustrated with an application to the sequential organ failure assessment data from the Improving Care of Acute Lung Injury Patients study."}, "https://arxiv.org/abs/2402.05432": {"title": "Difference-in-Differences Estimators with Continuous Treatments and no Stayers", "link": "https://arxiv.org/abs/2402.05432", "description": "Many treatments or policy interventions are continuous in nature. Examples include prices, taxes or temperatures. Empirical researchers have usually relied on two-way fixed effect regressions to estimate treatment effects in such cases. However, such estimators are not robust to heterogeneous treatment effects in general; they also rely on the linearity of treatment effects. We propose estimators for continuous treatments that do not impose those restrictions, and that can be used when there are no stayers: the treatment of all units changes from one period to the next. We start by extending the nonparametric results of de Chaisemartin et al. (2023) to cases without stayers. We also present a parametric estimator, and use it to revisit Desch\\^enes and Greenstone (2012)."}, "https://arxiv.org/abs/2402.05479": {"title": "A comparison of the effects of different methodologies on the statistics learning profiles of prospective primary education teachers from a gender perspective", "link": "https://arxiv.org/abs/2402.05479", "description": "Over the last decades,it has been shown that teaching and learning statistics is complex, regardless of the teaching methodology. This research presents the different learning profiles identified in a group of future Primary Education (PE) teachers during the study of the Statistics blockdepending on the methodology used and gender, where the sample consists of 132 students in the third year of the PE undergraduate degree in theUniversity of the Basque Country(Universidad del Pa\\'is Vasco/Euskal Herriko Unibertsitatea, UPV/EHU). To determine the profiles, a cluster analysis technique has been used, where the main variables to determine them are, on the one hand, their statistical competence development and, on the other hand, the evolutionof their attitude towards statistics. In order to better understand the nature of the profiles obtained, the type of teaching methodology used to work on the Statistics block has been taken into account. This comparison is based on the fact that the sample is divided into two groups: one has worked with a Project Based Learning (PBL) methodology,while the other has worked with a methodology in which theoretical explanations and typically decontextualized exercises predominate. Among the results obtained,three differentiated profiles areobserved, highlighting the proportion of students with an advantageous profile in the group where PBL is included.With regard to gender, the results show that women's attitudes towardstatistics evolvedmore positively than men's after the sessions devoted to statistics in the PBL group."}, "https://arxiv.org/abs/2402.05633": {"title": "Full Law Identification under Missing Data in the Categorical Colluder Model", "link": "https://arxiv.org/abs/2402.05633", "description": "Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable. However, with additional mild assumptions regarding the variables involved with the colluder structure, identifiability is regained. We present a necessary and sufficient condition for the identification of the full law in the presence of a colluder structure with arbitrary categorical variables."}, "https://arxiv.org/abs/2402.05767": {"title": "Covariance matrix completion via auxiliary information", "link": "https://arxiv.org/abs/2402.05767", "description": "Covariance matrix estimation is an important task in the analysis of multivariate data in disparate scientific fields, including neuroscience, genomics, and astronomy. However, modern scientific data are often incomplete due to factors beyond the control of researchers, and data missingness may prohibit the use of traditional covariance estimation methods. Some existing methods address this problem by completing the data matrix, or by filling the missing entries of an incomplete sample covariance matrix by assuming a low-rank structure. We propose a novel approach that exploits auxiliary variables to complete covariance matrix estimates. An example of auxiliary variable is the distance between neurons, which is usually inversely related to the strength of neuronal correlation. Our method extracts auxiliary information via regression, and involves a single tuning parameter that can be selected empirically. We compare our method with other matrix completion approaches via simulations, and apply it to the analysis of large-scale neuroscience data."}, "https://arxiv.org/abs/2402.05789": {"title": "High Dimensional Factor Analysis with Weak Factors", "link": "https://arxiv.org/abs/2402.05789", "description": "This paper studies the principal components (PC) estimator for high dimensional approximate factor models with weak factors in that the factor loading ($\\boldsymbol{\\Lambda}^0$) scales sublinearly in the number $N$ of cross-section units, i.e., $\\boldsymbol{\\Lambda}^{0\\top} \\boldsymbol{\\Lambda}^0 / N^\\alpha$ is positive definite in the limit for some $\\alpha \\in (0,1)$. While the consistency and asymptotic normality of these estimates are by now well known when the factors are strong, i.e., $\\alpha=1$, the statistical properties for weak factors remain less explored. Here, we show that the PC estimator maintains consistency and asymptotical normality for any $\\alpha\\in(0,1)$, provided suitable conditions regarding the dependence structure in the noise are met. This complements earlier result by Onatski (2012) that the PC estimator is inconsistent when $\\alpha=0$, and the more recent work by Bai and Ng (2023) who established the asymptotic normality of the PC estimator when $\\alpha \\in (1/2,1)$. Our proof strategy integrates the traditional eigendecomposition-based approach for factor models with leave-one-out analysis similar in spirit to those used in matrix completion and other settings. This combination allows us to deal with factors weaker than the former and at the same time relax the incoherence and independence assumptions often associated with the later."}, "https://arxiv.org/abs/2402.05844": {"title": "The CATT SATT on the MATT: semiparametric inference for sample treatment effects on the treated", "link": "https://arxiv.org/abs/2402.05844", "description": "We study variants of the average treatment effect on the treated with population parameters replaced by their sample counterparts. For each estimand, we derive the limiting distribution with respect to a semiparametric efficient estimator of the population effect and provide guidance on variance estimation. Included in our analysis is the well-known sample average treatment effect on the treated, for which we obtain some unexpected results. Unlike for the ordinary sample average treatment effect, we find that the asymptotic variance for the sample average treatment effect on the treated is point-identified and consistently estimable, but it potentially exceeds that of the population estimand. To address this shortcoming, we propose a modification that yields a new estimand, the mixed average treatment effect on the treated, which is always estimated more precisely than both the population and sample effects. We also introduce a second new estimand that arises from an alternative interpretation of the treatment effect on the treated with which all individuals are weighted by the propensity score."}, "https://arxiv.org/abs/2402.05377": {"title": "Association between Sitting Time and Urinary Incontinence in the US Population: data from the National Health and Nutrition Examination Survey (NHANES) 2007 to 2018", "link": "https://arxiv.org/abs/2402.05377", "description": "Background Urinary incontinence (UI) is a common health problem that affects the life and health quality of millions of people in the US. We aimed to investigate the association between sitting time and UI. Methods Across-sectional survey of adult participants of National Health and Nutrition Examination Survey 2007-2018 was performed. Weighted multivariable logistic and regression models were conducted to assess the association between sitting time and UI. Results A total of 22916 participants were enrolled. Prolonged sitting time was associated with urgent UI (UUI, Odds ratio [OR] = 1.184, 95% Confidence interval [CI] = 1.076 to 1.302, P = 0.001). Compared with patients with sitting time shorter than 7 hours, moderate activity increased the risk of prolonged sitting time over 7 hours in the fully-adjusted model (OR = 2.537, 95% CI = 1.419 to 4.536, P = 0.002). Sitting time over 7 hours was related to male mixed UI (MUI, OR = 1.581, 95% CI = 1.129 to 2.213, P = 0.010), and female stress UI (SUI, OR = 0.884, 95% CI = 0.795 to 0.983, P = 0.026) in the fully-adjusted model. Conclusions Prolonged sedentary sitting time (> 7 hours) indicated a high risk of UUI in all populations, female SUI and male MUI. Compared with sitting time shorter than 7 hours, the moderate activity could not reverse the risk of prolonged sitting, which warranted further studies for confirmation."}, "https://arxiv.org/abs/2402.05438": {"title": "Penalized spline estimation of principal components for sparse functional data: rates of convergence", "link": "https://arxiv.org/abs/2402.05438", "description": "This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functions motivated by the matrix Bregman divergence, and the penalty term is the integrated squared derivative. The theory reveals that the asymptotic behavior of penalized spline estimators depends on the interesting interplay between several factors, i.e., the smoothness of the unknown functions, the spline degree, the spline knot number, the penalty order, and the penalty parameter. The theory also classifies the asymptotic behavior into seven scenarios and characterizes whether and how the minimax optimal rates of convergence are achievable in each scenario."}, "https://arxiv.org/abs/2201.01793": {"title": "Spectral Clustering with Variance Information for Group Structure Estimation in Panel Data", "link": "https://arxiv.org/abs/2201.01793", "description": "Consider a panel data setting where repeated observations on individuals are available. Often it is reasonable to assume that there exist groups of individuals that share similar effects of observed characteristics, but the grouping is typically unknown in advance. We first conduct a local analysis which reveals that the variances of the individual coefficient estimates contain useful information for the estimation of group structure. We then propose a method to estimate unobserved groupings for general panel data models that explicitly account for the variance information. Our proposed method remains computationally feasible with a large number of individuals and/or repeated measurements on each individual. The developed ideas can also be applied even when individual-level data are not available and only parameter estimates together with some quantification of estimation uncertainty are given to the researcher. A thorough simulation study demonstrates superior performance of our method than existing methods and we apply the method to two empirical applications."}, "https://arxiv.org/abs/2205.08586": {"title": "Treatment Choice with Nonlinear Regret", "link": "https://arxiv.org/abs/2205.08586", "description": "The literature focuses on the mean of welfare regret, which can lead to undesirable treatment choice due to sensitivity to sampling uncertainty. We propose to minimize the mean of a nonlinear transformation of regret and show that singleton rules are not essentially complete for nonlinear regret. Focusing on mean square regret, we derive closed-form fractions for finite-sample Bayes and minimax optimal rules. Our approach is grounded in decision theory and extends to limit experiments. The treatment fractions can be viewed as the strength of evidence favoring treatment. We apply our framework to a normal regression model and sample size calculation."}, "https://arxiv.org/abs/2205.08644": {"title": "Benefits and costs of matching prior to a Difference in Differences analysis when parallel trends does not hold", "link": "https://arxiv.org/abs/2205.08644", "description": "The Difference in Difference (DiD) estimator is a popular estimator built on the \"parallel trends\" assumption, which is an assertion that the treatment group, absent treatment, would change \"similarly\" to the control group over time. To bolster such a claim, one might generate a comparison group, via matching, that is similar to the treated group with respect to pre-treatment outcomes and/or pre-treatment covariates. Unfortunately, as has been previously pointed out, this intuitively appealing approach also has a cost in terms of bias. To assess the trade-offs of matching in our application, we first characterize the bias of matching prior to a DiD analysis under a linear structural model that allows for time-invariant observed and unobserved confounders with time-varying effects on the outcome. Given our framework, we verify that matching on baseline covariates generally reduces bias. We further show how additionally matching on pre-treatment outcomes has both cost and benefit. First, matching on pre-treatment outcomes partially balances unobserved confounders, which mitigates some bias. This reduction is proportional to the outcome's reliability, a measure of how coupled the outcomes are with the latent covariates. Offsetting these gains, matching also injects bias into the final estimate by undermining the second difference in the DiD via a regression-to-the-mean effect. Consequently, we provide heuristic guidelines for determining to what degree the bias reduction of matching is likely to outweigh the bias cost. We illustrate our guidelines by reanalyzing a principal turnover study that used matching prior to a DiD analysis and find that matching on both the pre-treatment outcomes and observed covariates makes the estimated treatment effect more credible."}, "https://arxiv.org/abs/2211.15128": {"title": "A New Formula for Faster Computation of the K-Fold Cross-Validation and Good Regularisation Parameter Values in Ridge Regression", "link": "https://arxiv.org/abs/2211.15128", "description": "In the present paper, we prove a new theorem, resulting in an update formula for linear regression model residuals calculating the exact k-fold cross-validation residuals for any choice of cross-validation strategy without model refitting. The required matrix inversions are limited by the cross-validation segment sizes and can be executed with high efficiency in parallel. The well-known formula for leave-one-out cross-validation follows as a special case of the theorem. In situations where the cross-validation segments consist of small groups of repeated measurements, we suggest a heuristic strategy for fast serial approximations of the cross-validated residuals and associated Predicted Residual Sum of Squares (PRESS) statistic. We also suggest strategies for efficient estimation of the minimum PRESS value and full PRESS function over a selected interval of regularisation values. The computational effectiveness of the parameter selection for Ridge- and Tikhonov regression modelling resulting from our theoretical findings and heuristic arguments is demonstrated in several applications with real and highly multivariate datasets."}, "https://arxiv.org/abs/2305.08172": {"title": "Fast Signal Region Detection with Application to Whole Genome Association Studies", "link": "https://arxiv.org/abs/2305.08172", "description": "Research on the localization of the genetic basis associated with diseases or traits has been widely conducted in the last a few decades. Scan methods have been developed for region-based analysis in whole-genome association studies, helping us better understand how genetics influences human diseases or traits, especially when the aggregated effects of multiple causal variants are present. In this paper, we propose a fast and effective algorithm coupling with high-dimensional test for simultaneously detecting multiple signal regions, which is distinct from existing methods using scan or knockoff statistics. The idea is to conduct binary splitting with re-search and arrangement based on a sequence of dynamic critical values to increase detection accuracy and reduce computation. Theoretical and empirical studies demonstrate that our approach enjoys favorable theoretical guarantees with fewer restrictions and exhibits superior numerical performance with faster computation. Utilizing the UK Biobank data to identify the genetic regions related to breast cancer, we confirm previous findings and meanwhile, identify a number of new regions which suggest strong association with risk of breast cancer and deserve further investigation."}, "https://arxiv.org/abs/2305.15951": {"title": "Distributed model building and recursive integration for big spatial data modeling", "link": "https://arxiv.org/abs/2305.15951", "description": "Motivated by the need for computationally tractable spatial methods in neuroimaging studies, we develop a distributed and integrated framework for estimation and inference of Gaussian process model parameters with ultra-high-dimensional likelihoods. We propose a shift in viewpoint from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights on autism spectrum disorder from the Autism Brain Imaging Data Exchange."}, "https://arxiv.org/abs/2307.02388": {"title": "Multi-Task Learning with Summary Statistics", "link": "https://arxiv.org/abs/2307.02388", "description": "Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields."}, "https://arxiv.org/abs/2307.03748": {"title": "Incentive-Theoretic Bayesian Inference for Collaborative Science", "link": "https://arxiv.org/abs/2307.03748", "description": "Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior -- their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful."}, "https://arxiv.org/abs/2307.05251": {"title": "Minimizing robust density power-based divergences for general parametric density models", "link": "https://arxiv.org/abs/2307.05251", "description": "Density power divergence (DPD) is designed to robustly estimate the underlying distribution of observations, in the presence of outliers. However, DPD involves an integral of the power of the parametric density models to be estimated; the explicit form of the integral term can be derived only for specific densities, such as normal and exponential densities. While we may perform a numerical integration for each iteration of the optimization algorithms, the computational complexity has hindered the practical application of DPD-based estimation to more general parametric densities. To address the issue, this study introduces a stochastic approach to minimize DPD for general parametric density models. The proposed approach can also be employed to minimize other density power-based $\\gamma$-divergences, by leveraging unnormalized models. We provide \\verb|R| package for implementation of the proposed approach in \\url{https://github.com/oknakfm/sgdpd}."}, "https://arxiv.org/abs/2307.06137": {"title": "Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case", "link": "https://arxiv.org/abs/2307.06137", "description": "Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes."}, "https://arxiv.org/abs/2312.01210": {"title": "When accurate prediction models yield harmful self-fulfilling prophecies", "link": "https://arxiv.org/abs/2312.01210", "description": "Objective: Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. Many prediction models are deployed for decision support based on their prediction accuracy in validation studies. We investigate whether this is a safe and valid approach.\n Materials and Methods: We show that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model.\n Results: Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution.\n Discussion: Our results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.\n Conclusion: Outcome prediction models can yield harmful self-fulfilling prophecies when used for decision making, a new perspective on prediction model development, deployment and monitoring is needed."}, "https://arxiv.org/abs/2401.11119": {"title": "Constraint-based measures of shift and relative shift for discrete frequency distributions", "link": "https://arxiv.org/abs/2401.11119", "description": "The comparison of frequency distributions is a common statistical task with broad applications and a long history of methodological development. However, existing measures do not quantify the magnitude and direction by which one distribution is shifted relative to another. In the present study, we define distributional shift (DS) as the concentration of frequencies away from the greatest discrete class, e.g., a histogram's right-most bin. We derive a measure of DS based on the sum of cumulative frequencies, intuitively quantifying shift as a statistical moment. We then define relative distributional shift (RDS) as the difference in DS between distributions. Using simulated random sampling, we demonstrate that RDS is highly related to measures that are popularly used to compare frequency distributions. Focusing on a specific use case, i.e., simulated healthcare Evaluation and Management coding profiles, we show how RDS can be used to examine many pairs of empirical and expected distributions via shift-significance plots. In comparison to other measures, RDS has the unique advantage of being a signed (directional) measure based on a simple difference in an intuitive property."}, "https://arxiv.org/abs/2109.12006": {"title": "An overview of variable selection procedures using regularization paths in high-dimensional Gaussian linear regression", "link": "https://arxiv.org/abs/2109.12006", "description": "Current high-throughput technologies provide a large amount of variables to describe a phenomenon. Only a few variables are generally sufficient to answer the question. Identify them in a high-dimensional Gaussian linear regression model is the one of the most-used statistical methods. In this article, we describe step-by-step the variable selection procedures built upon regularization paths. Regularization paths are obtained by combining a regularization function and an algorithm. Then, they are combined either with a model selection procedure using penalty functions or with a sampling strategy to obtain the final selected variables. We perform a comparison study by considering three simulation settings with various dependency structures on variables. %from the most classical to a most realistic one. In all the settings, we evaluate (i) the ability to discriminate between the active variables and the non-active variables along the regularization path (pROC-AUC), (ii) the prediction performance of the selected variable subset (MSE) and (iii) the relevance of the selected variables (recall, specificity, FDR). From the results, we provide recommendations on strategies to be favored depending on the characteristics of the problem at hand. We obtain that the regularization function Elastic-net provides most of the time better results than the $\\ell_1$ one and the lars algorithm has to be privileged as the GD one. ESCV provides the best prediction performances. Bolasso and the knockoffs method are judicious choices to limit the selection of non-active variables while ensuring selection of enough active variables. Conversely, the data-driven penalties considered in this review are not to be favored. As for Tigress and LinSelect, they are conservative methods."}, "https://arxiv.org/abs/2309.09323": {"title": "Answering Causal Queries at Layer 3 with DiscoSCMs-Embracing Heterogeneity", "link": "https://arxiv.org/abs/2309.09323", "description": "In the realm of causal inference, Potential Outcomes (PO) and Structural Causal Models (SCM) are recognized as the principal frameworks.However, when it comes to Layer 3 valuations -- counterfactual queries deeply entwined with individual-level semantics -- both frameworks encounter limitations due to the degenerative issues brought forth by the consistency rule. This paper advocates for the Distribution-consistency Structural Causal Models (DiscoSCM) framework as a pioneering approach to counterfactual inference, skillfully integrating the strengths of both PO and SCM. The DiscoSCM framework distinctively incorporates a unit selection variable $U$ and embraces the concept of uncontrollable exogenous noise realization. Through personalized incentive scenarios, we demonstrate the inadequacies of PO and SCM frameworks in representing the probability of a user being a complier (a Layer 3 event) without degeneration, an issue adeptly resolved by adopting the assumption of independent counterfactual noises within DiscoSCM. This innovative assumption broadens the foundational counterfactual theory, facilitating the extension of numerous theoretical results regarding the probability of causation to an individual granularity level and leading to a comprehensive set of theories on heterogeneous counterfactual bounds. Ultimately, our paper posits that if one acknowledges and wishes to leverage the ubiquitous heterogeneity, understanding causality as invariance across heterogeneous units, then DiscoSCM stands as a significant advancement in the methodology of counterfactual inference."}, "https://arxiv.org/abs/2401.03756": {"title": "Adaptive Experimental Design for Policy Learning", "link": "https://arxiv.org/abs/2401.03756", "description": "Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases."}, "https://arxiv.org/abs/2402.06058": {"title": "Mathematical programming tools for randomization purposes in small two-arm clinical trials: A case study with real data", "link": "https://arxiv.org/abs/2402.06058", "description": "Modern randomization methods in clinical trials are invariably adaptive, meaning that the assignment of the next subject to a treatment group uses the accumulated information in the trial. Some of the recent adaptive randomization methods use mathematical programming to construct attractive clinical trials that balance the group features, such as their sizes and covariate distributions of their subjects. We review some of these methods and compare their performance with common covariate-adaptive randomization methods for small clinical trials. We introduce an energy distance measure that compares the discrepancy between the two groups using the joint distribution of the subjects' covariates. This metric is more appealing than evaluating the discrepancy between the groups using their marginal covariate distributions. Using numerical experiments, we demonstrate the advantages of the mathematical programming methods under the new measure. In the supplementary material, we provide R codes to reproduce our study results and facilitate comparisons of different randomization procedures."}, "https://arxiv.org/abs/2402.06122": {"title": "Peeking with PEAK: Sequential, Nonparametric Composite Hypothesis Tests for Means of Multiple Data Streams", "link": "https://arxiv.org/abs/2402.06122", "description": "We propose a novel nonparametric sequential test for composite hypotheses for means of multiple data streams. Our proposed method, \\emph{peeking with expectation-based averaged capital} (PEAK), builds upon the testing-as-betting framework and provides a non-asymptotic $\\alpha$-level test across any stopping time. PEAK is computationally tractable and efficiently rejects hypotheses that are incorrect across all potential distributions that satisfy our nonparametric assumption, enabling joint composite hypothesis testing on multiple streams of data. We numerically validate our theoretical findings under the best arm identification and threshold identification in the bandit setting, illustrating the computational efficiency of our method against state-of-the-art testing methods."}, "https://arxiv.org/abs/2402.06133": {"title": "Leveraging Quadratic Polynomials in Python for Advanced Data Analysis", "link": "https://arxiv.org/abs/2402.06133", "description": "The article provides a comprehensive overview of using quadratic polynomials in Python for modeling and analyzing data. It starts by explaining the basic concept of a quadratic polynomial, its general form, and its significance in capturing the curvature in data indicative of natural phenomena. The paper highlights key features of quadratic polynomials, their applications in regression analysis, and the process of fitting these polynomials to data using Python's `numpy` and `matplotlib` libraries. It also discusses the calculation of the coefficient of determination (R-squared) to quantify the fit of the polynomial model. Practical examples, including Python scripts, are provided to demonstrate how to apply these concepts in data analysis. The document serves as a bridge between theoretical knowledge and applied analytics, aiding in understanding and communicating data patterns."}, "https://arxiv.org/abs/2402.06228": {"title": "Towards participatory multi-modeling for policy support across domains and scales: a systematic procedure for integral multi-model design", "link": "https://arxiv.org/abs/2402.06228", "description": "Policymaking for complex challenges such as pandemics necessitates the consideration of intricate implications across multiple domains and scales. Computational models can support policymaking, but a single model is often insufficient for such multidomain and scale challenges. Multi-models comprising several interacting computational models at different scales or relying on different modeling paradigms offer a potential solution. Such multi-models can be assembled from existing computational models (i.e., integrated modeling) or be designed conceptually as a whole before their computational implementation (i.e., integral modeling). Integral modeling is particularly valuable for novel policy problems, such as those faced in the early stages of a pandemic, where relevant models may be unavailable or lack standard documentation. Designing such multi-models through an integral approach is, however, a complex task requiring the collaboration of modelers and experts from various domains. In this collaborative effort, modelers must precisely define the domain knowledge needed from experts and establish a systematic procedure for translating such knowledge into a multi-model. Yet, these requirements and systematic procedures are currently lacking for multi-models that are both multiscale and multi-paradigm. We address this challenge by introducing a procedure for developing multi-models with an integral approach based on clearly defined domain knowledge requirements derived from literature. We illustrate this procedure using the case of school closure policies in the Netherlands during the COVID-19 pandemic, revealing their potential implications in the short and long term and across the healthcare and educational domains. The requirements and procedure provided in this article advance the application of integral multi-modeling for policy support in multiscale and multidomain contexts."}, "https://arxiv.org/abs/2402.06382": {"title": "Robust Rao-type tests for step-stress accelerated life-tests under interval-monitoring and Weibull lifetime distributions", "link": "https://arxiv.org/abs/2402.06382", "description": "Many products in engineering are highly reliable with large mean lifetimes to failure. Performing lifetests under normal operations conditions would thus require long experimentation times and high experimentation costs. Alternatively, accelerated lifetests shorten the experimentation time by running the tests at higher than normal stress conditions, thus inducing more failures. Additionally, a log-linear regression model can be used to relate the lifetime distribution of the product to the level of stress it experiences. After estimating the parameters of this relationship, results can be extrapolated to normal operating conditions. On the other hand, censored data is common in reliability analysis. Interval-censored data arise when continuous inspection is difficult or infeasible due to technical or budgetary constraints. In this paper, we develop robust restricted estimators based on the density power divergence for step-stress accelerated life-tests under Weibull distributions with interval-censored data. We present theoretical asymptotic properties of the estimators and develop robust Rao-type test statistics based on the proposed robust estimators for testing composite null hypothesis on the model parameters."}, "https://arxiv.org/abs/2402.06410": {"title": "Manifold-valued models for analysis of EEG time series data", "link": "https://arxiv.org/abs/2402.06410", "description": "We propose a model for time series taking values on a Riemannian manifold and fit it to time series of covariance matrices derived from EEG data for patients suffering from epilepsy. The aim of the study is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of manifold and its associated geometry. The model specifies a distribution for the tangent direction vector at any time point, combining an autoregressive term, a mean reverting term and a form of Gaussian noise. Parameter inference is carried out by maximum likelihood estimation, and we compare modelling results obtained using the standard Euclidean geometry on covariance matrices and the affine invariant geometry. Results distinguish between epileptic seizures and interictal periods between seizures in patients: between seizures the dynamics have a strong mean reverting component and the autoregressive component is missing, while for the majority of seizures there is a significant autoregressive component and the mean reverting effect is weak. The fitted models are also used to compare seizures within and between patients. The affine invariant geometry is advantageous and it provides a better fit to the data."}, "https://arxiv.org/abs/2402.06428": {"title": "Smooth Transformation Models for Survival Analysis: A Tutorial Using R", "link": "https://arxiv.org/abs/2402.06428", "description": "Over the last five decades, we have seen strong methodological advances in survival analysis, mainly in two separate strands: One strand is based on a parametric approach that assumes some response distribution. More prominent, however, is the strand of flexible methods which rely mainly on non-/semi-parametric estimation. As the methodological landscape continues to evolve, the task of navigating through the multitude of methods and identifying corresponding available software resources is becoming increasingly difficult. This task becomes particularly challenging in more complex scenarios, such as when dealing with interval-censored or clustered survival data, non-proportionality, or dependent censoring.\n In this tutorial, we explore the potential of using smooth transformation models for survival analysis in the R system for statistical computing. These models provide a unified maximum likelihood framework that covers a range of survival models, including well-established ones such as the Weibull model and a fully parameterised version of the famous Cox proportional hazards model, as well as extensions to more complex scenarios. We explore smooth transformation models for non-proportional/crossing hazards, dependent censoring, clustered observations and extensions towards personalised medicine within this framework.\n By fitting these models to survival data from a two-arm randomised controlled trial on rectal cancer therapy, we demonstrate how survival analysis tasks can be seamlessly navigated within the smooth transformation model framework in R. This is achieved by the implementation provided by the \"tram\" package and few related packages."}, "https://arxiv.org/abs/2402.06480": {"title": "Optimal Forecast Reconciliation with Uncertainty Quantification", "link": "https://arxiv.org/abs/2402.06480", "description": "We propose to estimate the weight matrix used for forecast reconciliation as parameters in a general linear model in order to quantify its uncertainty. This implies that forecast reconciliation can be formulated as an orthogonal projection from the space of base-forecast errors into a coherent linear subspace. We use variance decomposition together with the Wishart distribution to derive the central estimator for the forecast-error covariance matrix. In addition, we prove that distance-reducing properties apply to the reconciled forecasts at all levels of the hierarchy as well as to the forecast-error covariance. A covariance matrix for the reconciliation weight matrix is derived, which leads to improved estimates of the forecast-error covariance matrix. We show how shrinkage can be introduced in the formulated model by imposing specific priors on the weight matrix and the forecast-error covariance matrix. The method is illustrated in a simulation study that shows consistent improvements in the log-score. Finally, standard errors for the weight matrix and the variance-separation formula are illustrated using a case study of forecasting electricity load in Sweden."}, "https://arxiv.org/abs/2402.05940": {"title": "Causal Relationship Network of Risk Factors Impacting Workday Loss in Underground Coal Mines", "link": "https://arxiv.org/abs/2402.05940", "description": "This study aims to establish the causal relationship network between various factors leading to workday loss in underground coal mines using a novel causal artificial intelligence (AI) method. The analysis utilizes data obtained from the National Institute for Occupational Safety and Health (NIOSH). A total of 101,010 injury records from 3,982 unique underground coal mines spanning the years from 1990 to 2020 were extracted from the NIOSH database. Causal relationships were analyzed and visualized using a novel causal AI method called Grouped Greedy Equivalence Search (GGES). The impact of each variable on workday loss was assessed through intervention do-calculus adjustment (IDA) scores. Model training and validation were performed using the 10-fold cross-validation technique. Performance metrics, including adjacency precision (AP), adjacency recall (AR), arrowhead precision (AHP), and arrowhead recall (AHR), were utilized to evaluate the models. Findings revealed that after 2006, key direct causes of workday loss among mining employees included total mining experience, mean office employees, mean underground employees, county, and total mining experience (years). Total mining experience emerged as the most influential factor, whereas mean employees per mine exhibited the least influence. The analyses emphasized the significant role of total mining experience in determining workday loss. The models achieved optimal performance, with AP, AR, AHP, and AHR values measuring 0.694, 0.653, 0.386, and 0.345, respectively. This study demonstrates the feasibility of utilizing the new GGES method to clarify the causal factors behind the workday loss by analyzing employment demographics and injury records and establish their causal relationship network."}, "https://arxiv.org/abs/2402.06048": {"title": "Coherence-based Input Design for Sparse System Identification", "link": "https://arxiv.org/abs/2402.06048", "description": "The maximum absolute correlation between regressors, which is called mutual coherence, plays an essential role in sparse estimation. A regressor matrix whose columns are highly correlated may result from optimal input design, since there is no constraint on the mutual coherence, so when this regressor is used to estimate sparse parameter vectors of a system, it may yield a large estimation error. This paper aims to tackle this issue for fixed denominator models, which include Laguerre, Kautz, and generalized orthonormal basis function expansion models, for example. The paper proposes an optimal input design method where the achieved Fisher information matrix is fitted to the desired Fisher matrix, together with a coordinate transformation designed to make the regressors in the transformed coordinates have low mutual coherence. The method can be used together with any sparse estimation method and in a numerical study we show its potential for alleviating the problem of model order selection when used in conjunction with, for example, classical methods such as AIC and BIC."}, "https://arxiv.org/abs/2402.06571": {"title": "Weighted cumulative residual Entropy Generating Function and its properties", "link": "https://arxiv.org/abs/2402.06571", "description": "The study on the generating function approach to entropy become popular as it generates several well-known entropy measures discussed in the literature. In this work, we define the weighted cumulative residual entropy generating function (WCREGF) and study its properties. We then introduce the dynamic weighted cumulative residual entropy generating function (DWCREGF). It is shown that the DWCREGF determines the distribution uniquely. We study some characterization results using the relationship between the DWCREGF and the hazard rate and/or the mean residual life function. Using a characterization based on DWCREGF, we develop a new goodness fit test for Rayleigh distribution. A Monte Carlo simulation study is conducted to evaluate the proposed test. Finally, the test is illustrated using two real data sets."}, "https://arxiv.org/abs/2402.06574": {"title": "Prediction of air pollutants PM10 by ARBX(1) processes", "link": "https://arxiv.org/abs/2402.06574", "description": "This work adopts a Banach-valued time series framework for component-wise estimation and prediction, from temporal correlated functional data, in presence of exogenous variables. The strong-consistency of the proposed functional estimator and associated plug-in predictor is formulated. The simulation study undertaken illustrates their large-sample size properties. Air pollutants PM10 curve forecasting, in the Haute-Normandie region (France), is addressed by implementation of the functional time series approach presented"}, "https://arxiv.org/abs/1409.2709": {"title": "Convergence of hybrid slice sampling via spectral gap", "link": "https://arxiv.org/abs/1409.2709", "description": "It is known that the simple slice sampler has robust convergence properties, however the class of problems where it can be implemented is limited. In contrast, we consider hybrid slice samplers which are easily implementable and where another Markov chain approximately samples the uniform distribution on each slice. Under appropriate assumptions on the Markov chain on the slice we show a lower bound and an upper bound of the spectral gap of the hybrid slice sampler in terms of the spectral gap of the simple slice sampler. An immediate consequence of this is that spectral gap and geometric ergodicity of the hybrid slice sampler can be concluded from spectral gap and geometric ergodicity of its simple version which is very well understood. These results indicate that robustness properties of the simple slice sampler are inherited by (appropriately designed) easily implementable hybrid versions. We apply the developed theory and analyse a number of specific algorithms such as the stepping-out shrinkage slice sampling, hit-and-run slice sampling on a class of multivariate targets and an easily implementable combination of both procedures on multidimensional bimodal densities."}, "https://arxiv.org/abs/2101.07934": {"title": "Meta-analysis of Censored Adverse Events", "link": "https://arxiv.org/abs/2101.07934", "description": "Meta-analysis is a powerful tool for assessing drug safety by combining treatment-related toxicological findings across multiple studies, as clinical trials are typically underpowered for detecting adverse drug effects. However, incomplete reporting of adverse events (AEs) in published clinical studies is a frequent issue, especially if the observed number of AEs is below a pre-specified study-dependent threshold. Ignoring the censored AE information, often found in lower frequency, can significantly bias the estimated incidence rate of AEs. Despite its importance, this common meta-analysis problem has received little statistical or analytic attention in the literature. To address this challenge, we propose a Bayesian approach to accommodating the censored and possibly rare AEs for meta-analysis of safety data. Through simulation studies, we demonstrate that the proposed method can improves accuracy in point and interval estimation of incidence probabilities, particularly in the presence of censored data. Overall, the proposed method provides a practical solution that can facilitate better-informed decisions regarding drug safety."}, "https://arxiv.org/abs/2208.11665": {"title": "Statistical exploration of the Manifold Hypothesis", "link": "https://arxiv.org/abs/2208.11665", "description": "The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms."}, "https://arxiv.org/abs/2305.13462": {"title": "Robust heavy-tailed versions of generalized linear models with applications in actuarial science", "link": "https://arxiv.org/abs/2305.13462", "description": "Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. We propose an alternative approach which is modelling-based and thus fundamentally different. It allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The approach possesses appealing theoretical and empirical properties."}, "https://arxiv.org/abs/2306.08598": {"title": "Kernel Debiased Plug-in Estimation: Simultaneous, Automated Debiasing without Influence Functions for Many Target Parameters", "link": "https://arxiv.org/abs/2306.08598", "description": "In the problem of estimating target parameters in nonparametric models with nuisance parameters, substituting the unknown nuisances with nonparametric estimators can introduce \"plug-in bias.\" Traditional methods addressing this sub-optimal bias-variance trade-offs rely on the influence function (IF) of the target parameter. When estimating multiple target parameters, these methods require debiasing the nuisance parameter multiple times using the corresponding IFs, posing analytical and computational challenges. In this work, we leverage the targeted maximum likelihood estimation framework to propose a novel method named kernel debiased plug-in estimation (KDPE). KDPE refines an initial estimate through regularized likelihood maximization steps, employing a nonparametric model based on reproducing kernel Hilbert spaces. We show that KDPE (i) simultaneously debiases all pathwise differentiable target parameters that satisfy our regularity conditions, (ii) does not require the IF for implementation, and (iii) remains computationally tractable. We numerically illustrate the use of KDPE and validate our theoretical results."}, "https://arxiv.org/abs/2306.16593": {"title": "Autoregressive with Slack Time Series Model for Forecasting a Partially-Observed Dynamical Time Series", "link": "https://arxiv.org/abs/2306.16593", "description": "This study delves into the domain of dynamical systems, specifically the forecasting of dynamical time series defined through an evolution function. Traditional approaches in this area predict the future behavior of dynamical systems by inferring the evolution function. However, these methods may confront obstacles due to the presence of missing variables, which are usually attributed to challenges in measurement and a partial understanding of the system of interest. To overcome this obstacle, we introduce the autoregressive with slack time series (ARS) model, that simultaneously estimates the evolution function and imputes missing variables as a slack time series. Assuming time-invariance and linearity in the (underlying) entire dynamical time series, our experiments demonstrate the ARS model's capability to forecast future time series. From a theoretical perspective, we prove that a 2-dimensional time-invariant and linear system can be reconstructed by utilizing observations from a single, partially observed dimension of the system."}, "https://arxiv.org/abs/2310.11683": {"title": "Treatment bootstrapping: A new approach to quantify uncertainty of average treatment effect estimates", "link": "https://arxiv.org/abs/2310.11683", "description": "This paper proposes a new non-parametric bootstrap method to quantify the uncertainty of average treatment effect estimate for the treated from matching estimators. More specifically, it seeks to quantify the uncertainty associated with the average treatment effect estimate for the treated by bootstrapping the treatment group only and finding the counterpart control group by matching on estimated propensity score. We demonstrate the validity of this approach and compare it with existing bootstrap approaches through Monte Carlo simulation and real world data set. The results indicate that the proposed approach constructs confidence interval and standard error that have at least comparable precision and coverage rate as existing bootstrap approaches, while the variance estimates can fluctuate depending on the proportion of treatment group units in the sample data and the specific matching method used."}, "https://arxiv.org/abs/2401.15811": {"title": "Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms", "link": "https://arxiv.org/abs/2401.15811", "description": "Two-sided platforms are central to modern commerce and content sharing and often utilize A/B testing for developing new features. While user-side experiments are common, seller-side experiments become crucial for specific interventions and metrics. This paper investigates the effects of interference caused by feedback loops on seller-side experiments in two-sided platforms, with a particular focus on the counterfactual interleaving design, proposed in \\citet{ha2020counterfactual,nandy2021b}. These feedback loops, often generated by pacing algorithms, cause outcomes from earlier sessions to influence subsequent ones. This paper contributes by creating a mathematical framework to analyze this interference, theoretically estimating its impact, and conducting empirical evaluations of the counterfactual interleaving design in real-world scenarios. Our research shows that feedback loops can result in misleading conclusions about the treatment effects."}, "https://arxiv.org/abs/2212.06693": {"title": "Transfer Learning with Large-Scale Quantile Regression", "link": "https://arxiv.org/abs/2212.06693", "description": "Quantile regression is increasingly encountered in modern big data applications due to its robustness and flexibility. We consider the scenario of learning the conditional quantiles of a specific target population when the available data may go beyond the target and be supplemented from other sources that possibly share similarities with the target. A crucial question is how to properly distinguish and utilize useful information from other sources to improve the quantile estimation and inference at the target. We develop transfer learning methods for high-dimensional quantile regression by detecting informative sources whose models are similar to the target and utilizing them to improve the target model. We show that under reasonable conditions, the detection of the informative sources based on sample splitting is consistent. Compared to the naive estimator with only the target data, the transfer learning estimator achieves a much lower error rate as a function of the sample sizes, the signal-to-noise ratios, and the similarity measures among the target and the source models. Extensive simulation studies demonstrate the superiority of our proposed approach. We apply our methods to tackle the problem of detecting hard-landing risk for flight safety and show the benefits and insights gained from transfer learning of three different types of airplanes: Boeing 737, Airbus A320, and Airbus A380."}, "https://arxiv.org/abs/2304.09872": {"title": "Depth Functions for Partial Orders with a Descriptive Analysis of Machine Learning Algorithms", "link": "https://arxiv.org/abs/2304.09872", "description": "We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies of depth functions in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we analyze the distribution of different classifier performances over a sample of standard benchmark data sets. Our results promisingly demonstrate that our approach differs substantially from existing benchmarking approaches and, therefore, adds a new perspective to the vivid debate on the comparison of classifiers."}, "https://arxiv.org/abs/2305.15984": {"title": "Dynamic Inter-treatment Information Sharing for Individualized Treatment Effects Estimation", "link": "https://arxiv.org/abs/2305.15984", "description": "Estimation of individualized treatment effects (ITE) from observational studies is a fundamental problem in causal inference and holds significant importance across domains, including healthcare. However, limited observational datasets pose challenges in reliable ITE estimation as data have to be split among treatment groups to train an ITE learner. While information sharing among treatment groups can partially alleviate the problem, there is currently no general framework for end-to-end information sharing in ITE estimation. To tackle this problem, we propose a deep learning framework based on `\\textit{soft weight sharing}' to train ITE learners, enabling \\textit{dynamic end-to-end} information sharing among treatment groups. The proposed framework complements existing ITE learners, and introduces a new class of ITE learners, referred to as \\textit{HyperITE}. We extend state-of-the-art ITE learners with \\textit{HyperITE} versions and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves ITE estimation error, with increasing effectiveness for smaller datasets."}, "https://arxiv.org/abs/2402.06915": {"title": "Detection and inference of changes in high-dimensional linear regression with non-sparse structures", "link": "https://arxiv.org/abs/2402.06915", "description": "For the data segmentation problem in high-dimensional linear regression settings, a commonly made assumption is that the regression parameters are segment-wise sparse, which enables many existing methods to estimate the parameters locally via $\\ell_1$-regularised maximum likelihood-type estimation and contrast them for change point detection. Contrary to the common belief, we show that the sparsity of neither regression parameters nor their differences, a.k.a.\\ differential parameters, is necessary for achieving the consistency in multiple change point detection. In fact, both statistically and computationally, better efficiency is attained by a simple strategy that scans for large discrepancies in local covariance between the regressors and the response. We go a step further and propose a suite of tools for directly inferring about the differential parameters post-segmentation, which are applicable even when the regression parameters themselves are non-sparse. Theoretical investigations are conducted under general conditions permitting non-Gaussianity, temporal dependence and ultra-high dimensionality. Numerical experiments demonstrate the competitiveness of the proposed methodologies."}, "https://arxiv.org/abs/2402.07022": {"title": "A product-limit estimator of the conditional survival function when cure status is partially known", "link": "https://arxiv.org/abs/2402.07022", "description": "We introduce a nonparametric estimator of the conditional survival function in the mixture cure model for right censored data when cure status is partially known. The estimator is developed for the setting of a single continuous covariate but it can be extended to multiple covariates. It extends the estimator of Beran (1981), which ignores cure status information. We obtain an almost sure representation, from which the strong consistency and asymptotic normality of the estimator are derived. Asymptotic expressions of the bias and variance demonstrate a reduction in the variance with respect to Beran's estimator. A simulation study shows that, if the bandwidth parameter is suitably chosen, our estimator performs better than others for an ample range of covariate values. A bootstrap bandwidth selector is proposed. Finally, the proposed estimator is applied to a real dataset studying survival of sarcoma patients."}, "https://arxiv.org/abs/2402.07048": {"title": "Logistic-beta processes for modeling dependent random probabilities with beta marginals", "link": "https://arxiv.org/abs/2402.07048", "description": "The beta distribution serves as a canonical tool for modeling probabilities and is extensively used in statistics and machine learning, especially in the field of Bayesian nonparametrics. Despite its widespread use, there is limited work on flexible and computationally convenient stochastic process extensions for modeling dependent random probabilities. We propose a novel stochastic process called the logistic-beta process, whose logistic transformation yields a stochastic process with common beta marginals. Similar to the Gaussian process, the logistic-beta process can model dependence on both discrete and continuous domains, such as space or time, and has a highly flexible dependence structure through correlation kernels. Moreover, its normal variance-mean mixture representation leads to highly effective posterior inference algorithms. The flexibility and computational benefits of logistic-beta processes are demonstrated through nonparametric binary regression simulation studies. Furthermore, we apply the logistic-beta process in modeling dependent Dirichlet processes, and illustrate its application and benefits through Bayesian density regression problems in a toxicology study."}, "https://arxiv.org/abs/2402.07247": {"title": "The Pairwise Matching Design is Optimal under Extreme Noise and Assignments", "link": "https://arxiv.org/abs/2402.07247", "description": "We consider the general performance of the difference-in-means estimator in an equally-allocated two-arm randomized experiment under common experimental endpoints such as continuous (regression), incidence, proportion, count and uncensored survival. We consider two sources of randomness: the subject-specific assignments and the contribution of unobserved subject-specific measurements. We then examine mean squared error (MSE) performance under a new, more realistic \"simultaneous tail criterion\". We prove that the pairwise matching design of Greevy et al. (2004) performs best asymptotically under this criterion when compared to other blocking designs. We also prove that the optimal design must be less random than complete randomization and more random than any deterministic, optimized allocation. Theoretical results are supported by simulations in all five response types."}, "https://arxiv.org/abs/2402.07297": {"title": "On Robust Measures of Spatial Correlation", "link": "https://arxiv.org/abs/2402.07297", "description": "As a rule statistical measures are often vulnerable to the presence of outliers and spatial correlation coefficients, critical in the assessment of spatial data, remain susceptible to this inherent flaw. In contexts where data originates from a variety of domains (such as, e. g., socio-economic, environmental or epidemiological disciplines) it is quite common to encounter not just anomalous data points, but also non-normal distributions. These irregularities can significantly distort the broader analytical landscape often masking significant spatial attributes. This paper embarks on a mission to enhance the resilience of traditional spatial correlation metrics, specifically the Moran Coefficient (MC), Geary's Contiguity ratio (GC), and the Approximate Profile Likelihood Estimator (APLE) and to propose a series of alternative measures. Drawing inspiration from established analytical paradigms, our research harnesses the power of influence function studies to examine the robustness of the proposed novel measures in the presence of different outlier scenarios."}, "https://arxiv.org/abs/2402.07349": {"title": "Control Variates for MCMC", "link": "https://arxiv.org/abs/2402.07349", "description": "This chapter describes several control variate methods for improving estimates of expectations from MCMC."}, "https://arxiv.org/abs/2402.07373": {"title": "A Functional Coefficients Network Autoregressive Model", "link": "https://arxiv.org/abs/2402.07373", "description": "The paper introduces a flexible model for the analysis of multivariate nonlinear time series data. The proposed Functional Coefficients Network Autoregressive (FCNAR) model considers the response of each node in the network to depend in a nonlinear fashion to each own past values (autoregressive component), as well as past values of each neighbor (network component). Key issues of model stability/stationarity, together with model parameter identifiability, estimation and inference are addressed for error processes that can be heavier than Gaussian for both fixed and growing number of network nodes. The performance of the estimators for the FCNAR model is assessed on synthetic data and the applicability of the model is illustrated on multiple indicators of air pollution data."}, "https://arxiv.org/abs/2402.07439": {"title": "Joint estimation of the predictive ability of experts using a multi-output Gaussian process", "link": "https://arxiv.org/abs/2402.07439", "description": "A multi-output Gaussian process (GP) is introduced as a model for the joint posterior distribution of the local predictive ability of set of models and/or experts, conditional on a vector of covariates, from historical predictions in the form of log predictive scores. Following a power transformation of the log scores, a GP with Gaussian noise can be used, which allows faster computation by first using Hamiltonian Monte Carlo to sample the hyper-parameters of the GP from a model where the latent GP surface has been marginalized out, and then using these draws to generate draws of joint predictive ability conditional on a new vector of covariates. Linear pools based on learned joint local predictive ability are applied to predict daily bike usage in Washington DC."}, "https://arxiv.org/abs/2402.07521": {"title": "A step towards the integration of machine learning and small area estimation", "link": "https://arxiv.org/abs/2402.07521", "description": "The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with \"borrowing strength\" from other subpopulations and time periods, is considered."}, "https://arxiv.org/abs/2402.07569": {"title": "EM Estimation of the B-spline Copula with Penalized Log-Likelihood Function", "link": "https://arxiv.org/abs/2402.07569", "description": "The B-spline copula function is defined by a linear combination of elements of the normalized B-spline basis. We develop a modified EM algorithm, to maximize the penalized log-likelihood function, wherein we use the smoothly clipped absolute deviation (SCAD) penalty function for the penalization term. We conduct simulation studies to demonstrate the stability of the proposed numerical procedure, show that penalization yields estimates with smaller mean-square errors when the true parameter matrix is sparse, and provide methods for determining tuning parameters and for model selection. We analyze as an example a data set consisting of birth and death rates from 237 countries, available at the website, ''Our World in Data,'' and we estimate the marginal density and distribution functions of those rates together with all parameters of our B-spline copula model."}, "https://arxiv.org/abs/2402.07629": {"title": "Logistic Multidimensional Data Analysis for Ordinal Response Variables using a Cumulative Link function", "link": "https://arxiv.org/abs/2402.07629", "description": "We present a multidimensional data analysis framework for the analysis of ordinal response variables. Underlying the ordinal variables, we assume a continuous latent variable, leading to cumulative logit models. The framework includes unsupervised methods, when no predictor variables are available, and supervised methods, when predictor variables are available. We distinguish between dominance variables and proximity variables, where dominance variables are analyzed using inner product models, whereas the proximity variables are analyzed using distance models. An expectation-majorization-minimization algorithm is derived for estimation of the parameters of the models. We illustrate our methodology with data from the International Social Survey Programme."}, "https://arxiv.org/abs/2402.07634": {"title": "Loglinear modeling with mixed numerical and categorical predictor variables through an Extended Stereotype Model", "link": "https://arxiv.org/abs/2402.07634", "description": "Loglinear analysis is most useful when we have two or more categorical response variables. Loglinear analysis, however, requires categorical predictor variables, such that the data can be represented in a contingency table. Researchers often have a mix of categorical and numerical predictors. We present a new statistical methodology for the analysis of multiple categorical response variables with a mix of numeric and categorical predictor variables. Therefore, the stereotype model, a reduced rank regression model for multinomial outcome variables, is extended with a design matrix for the profile scores and one for the dependencies among the responses. An MM algorithm is presented for estimation of the model parameters. Three examples are presented. The first shows that our method is equivalent to loglinear analysis when we only have categorical variables. With the second example, we show the differences between marginal logit models and our extended stereotype model, which is a conditional model. The third example is more extensive, and shows how to analyze a data set, how to select a model, and how to interpret the final model."}, "https://arxiv.org/abs/2402.07743": {"title": "Beyond Sparsity: Local Projections Inference with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2402.07743", "description": "Impulse response analysis studies how the economy responds to shocks, such as changes in interest rates, and helps policymakers manage these effects. While Vector Autoregression Models (VARs) with structural assumptions have traditionally dominated the estimation of impulse responses, local projections, the projection of future responses on current shock, have recently gained attention for their robustness and interpretability. Including many lags as controls is proposed as a means of robustness, and including a richer set of controls helps in its interpretation as a causal parameter. In both cases, an extensive number of controls leads to the consideration of high-dimensional techniques. While methods like LASSO exist, they mostly rely on sparsity assumptions - most of the parameters are exactly zero, which has limitations in dense data generation processes. This paper proposes a novel approach that incorporates high-dimensional covariates in local projections without relying on sparsity constraints. Adopting the Orthogonal Greedy Algorithm with a high-dimensional AIC (OGA+HDAIC) model selection method, this approach offers advantages including robustness in both sparse and dense scenarios, improved interpretability by prioritizing cross-sectional explanatory power, and more reliable causal inference in local projections."}, "https://arxiv.org/abs/2402.07811": {"title": "PageRank and the Bradley-Terry model", "link": "https://arxiv.org/abs/2402.07811", "description": "PageRank and the Bradley-Terry model are competing approaches to ranking entities such as teams in sports tournaments or journals in citation networks. The Bradley-Terry model is a classical statistical method for ranking based on paired comparisons. The PageRank algorithm ranks nodes according to their importance in a network. Whereas Bradley-Terry scores are computed via maximum likelihood estimation, PageRanks are derived from the stationary distribution of a Markov chain. More recent work has shown maximum likelihood estimates for the Bradley-Terry model may be approximated from such a limiting distribution, an interesting connection that has been discovered and rediscovered over the decades. Here we show - through relatively simple mathematics - a connection between paired comparisons and PageRank that exploits the quasi-symmetry property of the Bradley-Terry model. This motivates a novel interpretation of Bradley-Terry scores as 'scaled' PageRanks, and vice versa, with direct implications for citation-based journal ranking metrics."}, "https://arxiv.org/abs/2402.07837": {"title": "Quantile Least Squares: A Flexible Approach for Robust Estimation and Validation of Location-Scale Families", "link": "https://arxiv.org/abs/2402.07837", "description": "In this paper, the problem of robust estimation and validation of location-scale families is revisited. The proposed methods exploit the joint asymptotic normality of sample quantiles (of i.i.d random variables) to construct the ordinary and generalized least squares estimators of location and scale parameters. These quantile least squares (QLS) estimators are easy to compute because they have explicit expressions, their robustness is achieved by excluding extreme quantiles from the least-squares estimation, and efficiency is boosted by using as many non-extreme quantiles as practically relevant. The influence functions of the QLS estimators are specified and plotted for several location-scale families. They closely resemble the shapes of some well-known influence functions yet those shapes emerge automatically (i.e., do not need to be specified). The joint asymptotic normality of the proposed estimators is established, and their finite-sample properties are explored using simulations. Also, computational costs of these estimators, as well as those of MLE, are evaluated for sample sizes n = 10^6, 10^7, 10^8, 10^9. For model validation, two goodness-of-fit tests are constructed and their performance is studied using simulations and real data. In particular, for the daily stock returns of Google over the last four years, both tests strongly support the logistic distribution assumption and reject other bell-shaped competitors."}, "https://arxiv.org/abs/2402.06920": {"title": "The power of forgetting in statistical hypothesis testing", "link": "https://arxiv.org/abs/2402.06920", "description": "This paper places conformal testing in a general framework of statistical hypothesis testing. A standard approach to testing a composite null hypothesis $H$ is to test each of its elements and to reject $H$ when each of its elements is rejected. It turns out that we can fully cover conformal testing using this approach only if we allow forgetting some of the data. However, we will see that the standard approach covers conformal testing in a weak asymptotic sense and under restrictive assumptions. I will also list several possible directions of further research, including developing a general scheme of online testing."}, "https://arxiv.org/abs/2402.07066": {"title": "Differentially Private Range Queries with Correlated Input Perturbation", "link": "https://arxiv.org/abs/2402.07066", "description": "This work proposes a class of locally differentially private mechanisms for linear queries, in particular range queries, that leverages correlated input perturbation to simultaneously achieve unbiasedness, consistency, statistical transparency, and control over utility requirements in terms of accuracy targets expressed either in certain query margins or as implied by the hierarchical database structure. The proposed Cascade Sampling algorithm instantiates the mechanism exactly and efficiently. Our bounds show that we obtain near-optimal utility while being empirically competitive against output perturbation methods."}, "https://arxiv.org/abs/2402.07131": {"title": "Resampling methods for Private Statistical Inference", "link": "https://arxiv.org/abs/2402.07131", "description": "We consider the task of constructing confidence intervals with differential privacy. We propose two private variants of the non-parametric bootstrap, which privately compute the median of the results of multiple ``little'' bootstraps run on partitions of the data and give asymptotic bounds on the coverage error of the resulting confidence intervals. For a fixed differential privacy parameter $\\epsilon$, our methods enjoy the same error rates as that of the non-private bootstrap to within logarithmic factors in the sample size $n$. We empirically validate the performance of our methods for mean estimation, median estimation, and logistic regression with both real and synthetic data. Our methods achieve similar coverage accuracy to existing methods (and non-private baselines) while providing notably shorter ($\\gtrsim 10$ times) confidence intervals than previous approaches."}, "https://arxiv.org/abs/2402.07160": {"title": "PASOA- PArticle baSed Bayesian Optimal Adaptive design", "link": "https://arxiv.org/abs/2402.07160", "description": "We propose a new procedure named PASOA, for Bayesian experimental design, that performs sequential design optimization by simultaneously providing accurate estimates of successive posterior distributions for parameter inference. The sequential design process is carried out via a contrastive estimation principle, using stochastic optimization and Sequential Monte Carlo (SMC) samplers to maximise the Expected Information Gain (EIG). As larger information gains are obtained for larger distances between successive posterior distributions, this EIG objective may worsen classical SMC performance. To handle this issue, tempering is proposed to have both a large information gain and an accurate SMC sampling, that we show is crucial for performance. This novel combination of stochastic optimization and tempered SMC allows to jointly handle design optimization and parameter inference. We provide a proof that the obtained optimal design estimators benefit from some consistency property. Numerical experiments confirm the potential of the approach, which outperforms other recent existing procedures."}, "https://arxiv.org/abs/2402.07170": {"title": "Research on the multi-stage impact of digital economy on rural revitalization in Hainan Province based on GPM model", "link": "https://arxiv.org/abs/2402.07170", "description": "The rapid development of the digital economy has had a profound impact on the implementation of the rural revitalization strategy. Based on this, this study takes Hainan Province as the research object to deeply explore the impact of digital economic development on rural revitalization. The study collected panel data from 2003 to 2022 to construct an evaluation index system for the digital economy and rural revitalization and used panel regression analysis and other methods to explore the promotion effect of the digital economy on rural revitalization. Research results show that the digital economy has a significant positive impact on rural revitalization, and this impact increases as the level of fiscal expenditure increases. The issuance of digital RMB has further exerted a regulatory effect and promoted the development of the digital economy and the process of rural revitalization. At the same time, the establishment of the Hainan Free Trade Port has also played a positive role in promoting the development of the digital economy and rural revitalization. In the prediction of the optimal strategy for rural revitalization based on the development levels of the primary, secondary, and tertiary industries (Rate1, Rate2, and Rate3), it was found that rate1 can encourage Hainan Province to implement digital economic innovation, encourage rate3 to implement promotion behaviors, and increase rate2 can At the level of sustainable development when rate3 promotes rate2's digital economic innovation behavior, it can standardize rate2's production behavior to the greatest extent, accelerate the faster application of the digital economy to the rural revitalization industry, and promote the technological advancement of enterprises."}, "https://arxiv.org/abs/2402.07307": {"title": "Self-Consistent Conformal Prediction", "link": "https://arxiv.org/abs/2402.07307", "description": "In decision-making guided by machine learning, decision-makers often take identical actions in contexts with identical predicted outcomes. Conformal prediction helps decision-makers quantify outcome uncertainty for actions, allowing for better risk management. Inspired by this perspective, we introduce self-consistent conformal prediction, which yields both Venn-Abers calibrated predictions and conformal prediction intervals that are valid conditional on actions prompted by model predictions. Our procedure can be applied post-hoc to any black-box predictor to provide rigorous, action-specific decision-making guarantees. Numerical experiments show our approach strikes a balance between interval efficiency and conditional validity."}, "https://arxiv.org/abs/2402.07322": {"title": "Interference Among First-Price Pacing Equilibria: A Bias and Variance Analysis", "link": "https://arxiv.org/abs/2402.07322", "description": "Online A/B testing is widely used in the internet industry to inform decisions on new feature roll-outs. For online marketplaces (such as advertising markets), standard approaches to A/B testing may lead to biased results when buyers operate under a budget constraint, as budget consumption in one arm of the experiment impacts performance of the other arm. To counteract this interference, one can use a budget-split design where the budget constraint operates on a per-arm basis and each arm receives an equal fraction of the budget, leading to ``budget-controlled A/B testing.'' Despite clear advantages of budget-controlled A/B testing, performance degrades when budget are split too small, limiting the overall throughput of such systems. In this paper, we propose a parallel budget-controlled A/B testing design where we use market segmentation to identify submarkets in the larger market, and we run parallel experiments on each submarket.\n Our contributions are as follows: First, we introduce and demonstrate the effectiveness of the parallel budget-controlled A/B test design with submarkets in a large online marketplace environment. Second, we formally define market interference in first-price auction markets using the first price pacing equilibrium (FPPE) framework. Third, we propose a debiased surrogate that eliminates the first-order bias of FPPE, drawing upon the principles of sensitivity analysis in mathematical programs. Fourth, we derive a plug-in estimator for the surrogate and establish its asymptotic normality. Fifth, we provide an estimation procedure for submarket parallel budget-controlled A/B tests. Finally, we present numerical examples on semi-synthetic data, confirming that the debiasing technique achieves the desired coverage properties."}, "https://arxiv.org/abs/2402.07406": {"title": "Asymptotic Equivalency of Two Different Approaches of L-statistics", "link": "https://arxiv.org/abs/2402.07406", "description": "There are several ways to establish the asymptotic normality of $L$-statistics, depending upon the selection of the weights generating function and the cumulative distribution function of the underlying model. Here, in this paper it is shown that the two of the asymptotic approaches are equivalent/equal for a particular choice of the weights generating function."}, "https://arxiv.org/abs/2402.07419": {"title": "Conditional Generative Models are Sufficient to Sample from Any Causal Effect Estimand", "link": "https://arxiv.org/abs/2402.07419", "description": "Causal inference from observational data has recently found many applications in machine learning. While sound and complete algorithms exist to compute causal effects, many of these algorithms require explicit access to conditional likelihoods over the observational distribution, which is difficult to estimate in the high-dimensional regime, such as with images. To alleviate this issue, researchers have approached the problem by simulating causal relations with neural models and obtained impressive results. However, none of these existing approaches can be applied to generic scenarios such as causal graphs on image data with latent confounders, or obtain conditional interventional samples. In this paper, we show that any identifiable causal effect given an arbitrary causal graph can be computed through push-forward computations of conditional generative models. Based on this result, we devise a diffusion-based approach to sample from any (conditional) interventional distribution on image data. To showcase our algorithm's performance, we conduct experiments on a Colored MNIST dataset having both the treatment ($X$) and the target variables ($Y$) as images and obtain interventional samples from $P(y|do(x))$. As an application of our algorithm, we evaluate two large conditional generative models that are pre-trained on the CelebA dataset by analyzing the strength of spurious correlations and the level of disentanglement they achieve."}, "https://arxiv.org/abs/2402.07717": {"title": "Efficient reductions between some statistical models", "link": "https://arxiv.org/abs/2402.07717", "description": "We study the problem of approximately transforming a sample from a source statistical model to a sample from a target statistical model without knowing the parameters of the source model, and construct several computationally efficient such reductions between statistical experiments. In particular, we provide computationally efficient procedures that approximately reduce uniform, Erlang, and Laplace location models to general target families. We illustrate our methodology by establishing nonasymptotic reductions between some canonical high-dimensional problems, spanning mixtures of experts, phase retrieval, and signal denoising. Notably, the reductions are structure preserving and can accommodate missing data. We also point to a possible application in transforming one differentially private mechanism to another."}, "https://arxiv.org/abs/2402.07806": {"title": "A comparison of six linear and non-linear mixed models for longitudinal data: application to late-life cognitive trajectories", "link": "https://arxiv.org/abs/2402.07806", "description": "Longitudinal characterization of cognitive change in late-life has received increasing attention to better understand age-related cognitive aging and cognitive changes reflecting pathology-related and mortality-related processes. Several mixed-effects models have been proposed to accommodate the non-linearity of cognitive decline and assess the putative influence of covariates on it. In this work, we examine the standard linear mixed model (LMM) with a linear function of time and five alternative models capturing non-linearity of change over time, including the LMM with a quadratic term, LMM with splines, the functional mixed model, the piecewise linear mixed model and the sigmoidal mixed model. We first theoretically describe the models. Next, using data from deceased participants from two prospective cohorts with annual cognitive testing, we compared the interpretation of the models by investigating the association of education on cognitive change before death. Finally, we performed a simulation study to empirically evaluate the models and provide practical recommendations. In particular, models were challenged by increasing follow-up spacing, increasing missing data, and decreasing sample size. With the exception of the LMM with a quadratic term, the fit of all models was generally adequate to capture non-linearity of cognitive change and models were relatively robust. Although spline-based models do not have interpretable nonlinearity parameters, their convergence was easier to achieve and they allow for graphical interpretation. In contrast the piecewise and the sigmoidal models, with interpretable non-linear parameters may require more data to achieve convergence."}, "https://arxiv.org/abs/2008.04522": {"title": "Bayesian Analysis on Limiting the Student-$t$ Linear Regression Model", "link": "https://arxiv.org/abs/2008.04522", "description": "For the outlier problem in linear regression models, the Student-$t$ linear regression model is one of the common methods for robust modeling and is widely adopted in the literature. However, most of them applies it without careful theoretical consideration. This study provides the practically useful and quite simple conditions to ensure that the Student-$t$ linear regression model is robust against an outlier in the $y$-direction using regular variation theory."}, "https://arxiv.org/abs/2103.01412": {"title": "Some Finite Sample Properties of the Sign Test", "link": "https://arxiv.org/abs/2103.01412", "description": "This paper contains two finite-sample results concerning the sign test. First, we show that the sign-test is unbiased with independent, non-identically distributed data for both one-sided and two-sided hypotheses. The proof for the two-sided case is based on a novel argument that relates the derivatives of the power function to a regular bipartite graph. Unbiasedness then follows from the existence of perfect matchings on such graphs. Second, we provide a simple theoretical counterexample to show that the sign test over-rejects when the data exhibits correlation. Our results can be useful for understanding the properties of approximate randomization tests in settings with few clusters."}, "https://arxiv.org/abs/2105.12852": {"title": "Identifying Brexit voting patterns in the British House of Commons: an analysis based on Bayesian mixture models with flexible concomitant covariate effects", "link": "https://arxiv.org/abs/2105.12852", "description": "Brexit and its implications are an ongoing topic of interest since the Brexit referendum in 2016. In 2019 the House of commons held a number of \"indicative\" and \"meaningful\" votes as part of the Brexit approval process. The voting behaviour of members of the parliament in these votes is investigated to gain insight into the Brexit approval process. In particular, a mixture model with concomitant covariates is developed to identify groups of members of parliament who share similar voting behaviour while also considering characteristics of the members of parliament. The novelty of the method lies in the flexible structure used to model the effect of concomitant covariates on the component weights of the mixture, with the (potentially nonlinear) terms represented as a smooth function of the covariates. Results show this approach allows to quantify the effect of the age of members of parliament, as well as preferences and competitiveness in the constituencies they represent, on their position towards Brexit. This helps grouping the aforementioned politicians into homogeous clusters, whose composition departs sensibly from that of the parties."}, "https://arxiv.org/abs/2112.03220": {"title": "Cross-validation for change-point regression: pitfalls and solutions", "link": "https://arxiv.org/abs/2112.03220", "description": "Cross-validation is the standard approach for tuning parameter selection in many non-parametric regression problems. However its use is less common in change-point regression, perhaps as its prediction error-based criterion may appear to permit small spurious changes and hence be less well-suited to estimation of the number and location of change-points. We show that in fact the problems of cross-validation with squared error loss are more severe and can lead to systematic under- or over-estimation of the number of change-points, and highly suboptimal estimation of the mean function in simple settings where changes are easily detectable. We propose two simple approaches to remedy these issues, the first involving the use of absolute error rather than squared error loss, and the second involving modifying the holdout sets used. For the latter, we provide conditions that permit consistent estimation of the number of change-points for a general change-point estimation procedure. We show these conditions are satisfied for least squares estimation using new results on its performance when supplied with the incorrect number of change-points. Numerical experiments show that our new approaches are competitive with common change-point methods using classical tuning parameter choices when error distributions are well-specified, but can substantially outperform these in misspecified models. An implementation of our methodology is available in the R package crossvalidationCP on CRAN."}, "https://arxiv.org/abs/2203.07342": {"title": "A Bayesian Nonparametric Approach to Species Sampling Problems with Ordering", "link": "https://arxiv.org/abs/2203.07342", "description": "Species-sampling problems (SSPs) refer to a vast class of statistical problems calling for the estimation of (discrete) functionals of the unknown species composition of an unobservable population. A common feature of SSPs is their invariance with respect to species labeling, which is at the core of the Bayesian nonparametric (BNP) approach to SSPs under the popular Pitman-Yor process (PYP) prior. In this paper, we develop a BNP approach to SSPs that are not \"invariant\" to species labeling, in the sense that an ordering or ranking is assigned to species' labels. Inspired by the population genetics literature on age-ordered alleles' compositions, we study the following SSP with ordering: given an observable sample from an unknown population of individuals belonging to species (alleles), with species' labels being ordered according to weights (ages), estimate the frequencies of the first $r$ order species' labels in an enlarged sample obtained by including additional unobservable samples. By relying on an ordered PYP prior, we obtain an explicit posterior distribution of the first $r$ order frequencies, with estimates being of easy implementation and computationally efficient. We apply our approach to the analysis of genetic variation, showing its effectiveness in estimating the frequency of the oldest allele, and then we discuss other potential applications."}, "https://arxiv.org/abs/2210.04100": {"title": "Doubly robust estimation and sensitivity analysis for marginal structural quantile models", "link": "https://arxiv.org/abs/2210.04100", "description": "The marginal structure quantile model (MSQM) provides a unique lens to understand the causal effect of a time-varying treatment on the full distribution of potential outcomes. Under the semiparametric framework, we derive the efficiency influence function for the MSQM, from which a new doubly robust estimator is proposed for point estimation and inference. We show that the doubly robust estimator is consistent if either of the models associated with treatment assignment or the potential outcome distributions is correctly specified, and is semiparametric efficient if both models are correct. To implement the doubly robust MSQM estimator, we propose to solve a smoothed estimating equation to facilitate efficient computation of the point and variance estimates. In addition, we develop a confounding function approach to investigate the sensitivity of several MSQM estimators when the sequential ignorability assumption is violated. Extensive simulations are conducted to examine the finite-sample performance characteristics of the proposed methods. We apply the proposed methods to the Yale New Haven Health System Electronic Health Record data to study the effect of antihypertensive medications to patients with severe hypertension and assess the robustness of findings to unmeasured baseline and time-varying confounding."}, "https://arxiv.org/abs/2304.00231": {"title": "Using Overlap Weights to Address Extreme Propensity Scores in Estimating Restricted Mean Counterfactual Survival Times", "link": "https://arxiv.org/abs/2304.00231", "description": "While the inverse probability of treatment weighting (IPTW) is a commonly used approach for treatment comparisons in observational data, the resulting estimates may be subject to bias and excessively large variance when there is lack of overlap in the propensity score distributions. By smoothly down-weighting the units with extreme propensity scores, overlap weighting (OW) can help mitigate the bias and variance issues associated with IPTW. Although theoretical and simulation results have supported the use of OW with continuous and binary outcomes, its performance with right-censored survival outcomes remains to be further investigated, especially when the target estimand is defined based on the restricted mean survival time (RMST)-a clinically meaningful summary measure free of the proportional hazards assumption. In this article, we combine propensity score weighting and inverse probability of censoring weighting to estimate the restricted mean counterfactual survival times, and propose computationally-efficient variance estimators. We conduct simulations to compare the performance of IPTW, trimming, and OW in terms of bias, variance, and 95% confidence interval coverage, under various degrees of covariate overlap. Regardless of overlap, we demonstrate the advantage of OW over IPTW and trimming methods in bias, variance, and coverage when the estimand is defined based on RMST."}, "https://arxiv.org/abs/2304.04221": {"title": "Maximum Agreement Linear Prediction via the Concordance Correlation Coefficient", "link": "https://arxiv.org/abs/2304.04221", "description": "This paper examines distributional properties and predictive performance of the estimated maximum agreement linear predictor (MALP) introduced in Bottai, Kim, Lieberman, Luta, and Pena (2022) paper in The American Statistician, which is the linear predictor maximizing Lin's concordance correlation coefficient (CCC) between the predictor and the predictand. It is compared and contrasted, theoretically and through computer experiments, with the estimated least-squares linear predictor (LSLP). Finite-sample and asymptotic properties are obtained, and confidence intervals are also presented. The predictors are illustrated using two real data sets: an eye data set and a bodyfat data set. The results indicate that the estimated MALP is a viable alternative to the estimated LSLP if one desires a predictor whose predicted values possess higher agreement with the predictand values, as measured by the CCC."}, "https://arxiv.org/abs/2305.16464": {"title": "Flexible Variable Selection for Clustering and Classification", "link": "https://arxiv.org/abs/2305.16464", "description": "The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets"}, "https://arxiv.org/abs/2306.15616": {"title": "Network-Adjusted Covariates for Community Detection", "link": "https://arxiv.org/abs/2306.15616", "description": "Community detection is a crucial task in network analysis that can be significantly improved by incorporating subject-level information, i.e. covariates. However, current methods often struggle with selecting tuning parameters and analyzing low-degree nodes. In this paper, we introduce a novel method that addresses these challenges by constructing network-adjusted covariates, which leverage the network connections and covariates with a unique weight to each node based on the node's degree. Spectral clustering on network-adjusted covariates yields an exact recovery of community labels under certain conditions, which is tuning-free and computationally efficient. We present novel theoretical results about the strong consistency of our method under degree-corrected stochastic blockmodels with covariates, even in the presence of mis-specification and sparse communities with bounded degrees. Additionally, we establish a general lower bound for the community detection problem when both network and covariates are present, and it shows our method is optimal up to a constant factor. Our method outperforms existing approaches in simulations and a LastFM app user network, and provides interpretable community structures in a statistics publication citation network where $30\\%$ of nodes are isolated."}, "https://arxiv.org/abs/2308.01062": {"title": "A new multivariate and non-parametric association measure based on paired orthants", "link": "https://arxiv.org/abs/2308.01062", "description": "Multivariate correlation analysis plays a key role in various fields such as statistics and big data analytics. In this paper, it is presented a new non-parametric association measure between more than two variables based on the concept of paired orthants. In order to evaluate the proposed methodology, different N-tuple sets (from two to six variables) have been evaluated. The presented rank correlation analysis not only evaluates the inter-relatedness of multiple variables, but also determine the specific tendency of these variables."}, "https://arxiv.org/abs/2310.12711": {"title": "Modelling multivariate extremes through angular-radial decomposition of the density function", "link": "https://arxiv.org/abs/2310.12711", "description": "We present a new framework for modelling multivariate extremes, based on an angular-radial representation of the probability density function. Under this representation, the problem of modelling multivariate extremes is transformed to that of modelling an angular density and the tail of the radial variable, conditional on angle. Motivated by univariate theory, we assume that the tail of the conditional radial distribution converges to a generalised Pareto (GP) distribution. To simplify inference, we also assume that the angular density is continuous and finite and the GP parameter functions are continuous with angle. We refer to the resulting model as the semi-parametric angular-radial (SPAR) model for multivariate extremes. We consider the effect of the choice of polar coordinate system and introduce generalised concepts of angular-radial coordinate systems and generalised scalar angles in two dimensions. We show that under certain conditions, the choice of polar coordinate system does not affect the validity of the SPAR assumptions. However, some choices of coordinate system lead to simpler representations. In contrast, we show that the choice of margin does affect whether the model assumptions are satisfied. In particular, the use of Laplace margins results in a form of the density function for which the SPAR assumptions are satisfied for many common families of copula, with various dependence classes. We show that the SPAR model provides a more versatile framework for characterising multivariate extremes than provided by existing approaches, and that several commonly-used approaches are special cases of the SPAR model."}, "https://arxiv.org/abs/2310.17334": {"title": "Bayesian Optimization for Personalized Dose-Finding Trials with Combination Therapies", "link": "https://arxiv.org/abs/2310.17334", "description": "Identification of optimal dose combinations in early phase dose-finding trials is challenging, due to the trade-off between precisely estimating the many parameters required to flexibly model the possibly non-monotonic dose-response surface, and the small sample sizes in early phase trials. This difficulty is even more pertinent in the context of personalized dose-finding, where patient characteristics are used to identify tailored optimal dose combinations. To overcome these challenges, we propose the use of Bayesian optimization for finding optimal dose combinations in standard (\"one size fits all\") and personalized multi-agent dose-finding trials. Bayesian optimization is a method for estimating the global optima of expensive-to-evaluate objective functions. The objective function is approximated by a surrogate model, commonly a Gaussian process, paired with a sequential design strategy to select the next point via an acquisition function. This work is motivated by an industry-sponsored problem, where focus is on optimizing a dual-agent therapy in a setting featuring minimal toxicity. To compare the performance of the standard and personalized methods under this setting, simulation studies are performed for a variety of scenarios. Our study concludes that taking a personalized approach is highly beneficial in the presence of heterogeneity."}, "https://arxiv.org/abs/2401.09379": {"title": "Merging uncertainty sets via majority vote", "link": "https://arxiv.org/abs/2401.09379", "description": "Given $K$ uncertainty sets that are arbitrarily dependent -- for example, confidence intervals for an unknown parameter obtained with $K$ different estimators, or prediction sets obtained via conformal prediction based on $K$ different algorithms on shared data -- we address the question of how to efficiently combine them in a black-box manner to produce a single uncertainty set. We present a simple and broadly applicable majority vote procedure that produces a merged set with nearly the same error guarantee as the input sets. We then extend this core idea in a few ways: we show that weighted averaging can be a powerful way to incorporate prior information, and a simple randomization trick produces strictly smaller merged sets without altering the coverage guarantee. Further improvements can be obtained inducing exchangeability within the sets. When deployed in online settings, we show how the exponential weighted majority algorithm can be employed in order to learn a good weighting over time. We then combine this method with adaptive conformal inference to deliver a simple conformal online model aggregation (COMA) method for nonexchangeable data."}, "https://arxiv.org/abs/2105.02569": {"title": "Machine Collaboration", "link": "https://arxiv.org/abs/2105.02569", "description": "We propose a new ensemble framework for supervised learning, called machine collaboration (MaC), using a collection of base machines for prediction tasks. Unlike bagging/stacking (a parallel & independent framework) and boosting (a sequential & top-down framework), MaC is a type of circular & interactive learning framework. The circular & interactive feature helps the base machines to transfer information circularly and update their structures and parameters accordingly. The theoretical result on the risk bound of the estimator from MaC reveals that the circular & interactive feature can help MaC reduce risk via a parsimonious ensemble. We conduct extensive experiments on MaC using both simulated data and 119 benchmark real datasets. The results demonstrate that in most cases, MaC performs significantly better than several other state-of-the-art methods, including classification and regression trees, neural networks, stacking, and boosting."}, "https://arxiv.org/abs/2211.03846": {"title": "Federated Causal Discovery From Interventions", "link": "https://arxiv.org/abs/2211.03846", "description": "Causal discovery serves a pivotal role in mitigating model uncertainty through recovering the underlying causal mechanisms among variables. In many practical domains, such as healthcare, access to the data gathered by individual entities is limited, primarily for privacy and regulatory constraints. However, the majority of existing causal discovery methods require the data to be available in a centralized location. In response, researchers have introduced federated causal discovery. While previous federated methods consider distributed observational data, the integration of interventional data remains largely unexplored. We propose FedCDI, a federated framework for inferring causal structures from distributed data containing interventional samples. In line with the federated learning framework, FedCDI improves privacy by exchanging belief updates rather than raw samples. Additionally, it introduces a novel intervention-aware method for aggregating individual updates. We analyze scenarios with shared or disjoint intervened covariates, and mitigate the adverse effects of interventional data heterogeneity. The performance and scalability of FedCDI is rigorously tested across a variety of synthetic and real-world graphs."}, "https://arxiv.org/abs/2309.00943": {"title": "iCOS: Option-Implied COS Method", "link": "https://arxiv.org/abs/2309.00943", "description": "This paper proposes the option-implied Fourier-cosine method, iCOS, for non-parametric estimation of risk-neutral densities, option prices, and option sensitivities. The iCOS method leverages the Fourier-based COS technique, proposed by Fang and Oosterlee (2008), by utilizing the option-implied cosine series coefficients. Notably, this procedure does not rely on any model assumptions about the underlying asset price dynamics, it is fully non-parametric, and it does not involve any numerical optimization. These features make it rather general and computationally appealing. Furthermore, we derive the asymptotic properties of the proposed non-parametric estimators and study their finite-sample behavior in Monte Carlo simulations. Our empirical analysis using S&P 500 index options and Amazon equity options illustrates the effectiveness of the iCOS method in extracting valuable information from option prices under different market conditions. Additionally, we apply our methodology to dissect and quantify observation and discretization errors in the VIX index."}, "https://arxiv.org/abs/2310.12863": {"title": "A remark on moment-dependent phase transitions in high-dimensional Gaussian approximations", "link": "https://arxiv.org/abs/2310.12863", "description": "In this article, we study the critical growth rates of dimension below which Gaussian critical values can be used for hypothesis testing but beyond which they cannot. We are particularly interested in how these growth rates depend on the number of moments that the observations possess."}, "https://arxiv.org/abs/2312.05974": {"title": "Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise", "link": "https://arxiv.org/abs/2312.05974", "description": "This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes -- partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime."}, "https://arxiv.org/abs/2402.08051": {"title": "On Bayesian Filtering for Markov Regime Switching Models", "link": "https://arxiv.org/abs/2402.08051", "description": "This paper presents a framework for empirical analysis of dynamic macroeconomic models using Bayesian filtering, with a specific focus on the state-space formulation of Dynamic Stochastic General Equilibrium (DSGE) models with multiple regimes. We outline the theoretical foundations of model estimation, provide the details of two families of powerful multiple-regime filters, IMM and GPB, and construct corresponding multiple-regime smoothers. A simulation exercise, based on a prototypical New Keynesian DSGE model, is used to demonstrate the computational robustness of the proposed filters and smoothers and evaluate their accuracy and speed for a selection of filters from each family. We show that the canonical IMM filter is faster and is no less, and often more, accurate than its competitors within IMM and GPB families, the latter including the commonly used Kim and Nelson (1999) filter. Using it with the matching smoother improves the precision in recovering unobserved variables by about 25 percent. Furthermore, applying it to the U.S. 1947-2023 macroeconomic time series, we successfully identify significant past policy shifts including those related to the post-Covid-19 period. Our results demonstrate the practical applicability and potential of the proposed routines in macroeconomic analysis."}, "https://arxiv.org/abs/2402.08069": {"title": "Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions", "link": "https://arxiv.org/abs/2402.08069", "description": "Measurement of the interrater agreement (IRA) is critical in various disciplines. To correct for potential confounding chance agreement in IRA, Cohen's kappa and many other methods have been proposed. However, owing to the varied strategies and assumptions across these methods, there is a lack of practical guidelines on how these methods should be preferred even for the common two-rater dichotomous rating. To fill the gaps in the literature, we systematically review nine IRA methods and propose a generalized framework that can simulate the correlated decision processes behind the two raters to compare those reviewed methods under comprehensive practical scenarios. Based on the new framework, an estimand of \"true\" chance-corrected IRA is defined by accounting for the \"probabilistic certainty\" and serves as the comparison benchmark. We carry out extensive simulations to evaluate the performance of the reviewed IRA measures, and an agglomerative hierarchical clustering analysis is conducted to assess the inter-relationships among the included methods and the benchmark metric. Recommendations for selecting appropriate IRA statistics in different practical conditions are provided and the needs for further advancements in IRA estimation methodologies are emphasized."}, "https://arxiv.org/abs/2402.08108": {"title": "Finding Moving-Band Statistical Arbitrages via Convex-Concave Optimization", "link": "https://arxiv.org/abs/2402.08108", "description": "We propose a new method for finding statistical arbitrages that can contain more assets than just the traditional pair. We formulate the problem as seeking a portfolio with the highest volatility, subject to its price remaining in a band and a leverage limit. This optimization problem is not convex, but can be approximately solved using the convex-concave procedure, a specific sequential convex programming method. We show how the method generalizes to finding moving-band statistical arbitrages, where the price band midpoint varies over time."}, "https://arxiv.org/abs/2402.08151": {"title": "Gradient-flow adaptive importance sampling for Bayesian leave one out cross-validation for sigmoidal classification models", "link": "https://arxiv.org/abs/2402.08151", "description": "We introduce a set of gradient-flow-guided adaptive importance sampling (IS) transformations to stabilize Monte-Carlo approximations of point-wise leave one out cross-validated (LOO) predictions for Bayesian classification models. One can leverage this methodology for assessing model generalizability by for instance computing a LOO analogue to the AIC or computing LOO ROC/PRC curves and derived metrics like the AUROC and AUPRC. By the calculus of variations and gradient flow, we derive two simple nonlinear single-step transformations that utilize gradient information to shift a model's pre-trained full-data posterior closer to the target LOO posterior predictive distributions. In doing so, the transformations stabilize importance weights. Because the transformations involve the gradient of the likelihood function, the resulting Monte Carlo integral depends on Jacobian determinants with respect to the model Hessian. We derive closed-form exact formulae for these Jacobian determinants in the cases of logistic regression and shallow ReLU-activated artificial neural networks, and provide a simple approximation that sidesteps the need to compute full Hessian matrices and their spectra. We test the methodology on an $n\\ll p$ dataset that is known to produce unstable LOO IS weights."}, "https://arxiv.org/abs/2402.08222": {"title": "Integration of multiview microbiome data for deciphering microbiome-metabolome-disease pathways", "link": "https://arxiv.org/abs/2402.08222", "description": "The intricate interplay between host organisms and their gut microbiota has catalyzed research into the microbiome's role in disease, shedding light on novel aspects of disease pathogenesis. However, the mechanisms through which the microbiome exerts its influence on disease remain largely unclear. In this study, we first introduce a structural equation model to delineate the pathways connecting the microbiome, metabolome, and disease processes, utilizing a target multiview microbiome data. To mitigate the challenges posed by hidden confounders, we further propose an integrative approach that incorporates data from an external microbiome cohort. This method also supports the identification of disease-specific and microbiome-associated metabolites that are missing in the target cohort. We provide theoretical underpinnings for the estimations derived from our integrative approach, demonstrating estimation consistency and asymptotic normality. The effectiveness of our methodologies is validated through comprehensive simulation studies and an empirical application to inflammatory bowel disease, highlighting their potential to unravel the complex relationships between the microbiome, metabolome, and disease."}, "https://arxiv.org/abs/2402.08239": {"title": "Interaction Decomposition of prediction function", "link": "https://arxiv.org/abs/2402.08239", "description": "This paper discusses the foundation of methods for accurately grasping the interaction effects. Among the existing methods that capture the interaction effects as terms, PD and ALE are known as global modelagnostic methods in the IML field. ALE, among the two, can theoretically provide a functional decomposition of the prediction function, and this study focuses on functional decomposition. Specifically, we mathematically formalize what we consider to be the requirements that must always be met by a decomposition (interaction decomposition, hereafter, ID) that decomposes the prediction function into main and interaction effect terms. We also present a theorem about how to produce a decomposition that meets these requirements. Furthermore, we confirm that while ALE is ID, PD is not, and we present examples of decomposition that meet the requirements of ID using methods other than existing ones (i.e., new methods)."}, "https://arxiv.org/abs/2402.08283": {"title": "Classification Using Global and Local Mahalanobis Distances", "link": "https://arxiv.org/abs/2402.08283", "description": "We propose a novel semi-parametric classifier based on Mahalanobis distances of an observation from the competing classes. Our tool is a generalized additive model with the logistic link function that uses these distances as features to estimate the posterior probabilities of the different classes. While popular parametric classifiers like linear and quadratic discriminant analyses are mainly motivated by the normality of the underlying distributions, the proposed classifier is more flexible and free from such parametric assumptions. Since the densities of elliptic distributions are functions of Mahalanobis distances, this classifier works well when the competing classes are (nearly) elliptic. In such cases, it often outperforms popular nonparametric classifiers, especially when the sample size is small compared to the dimension of the data. To cope with non-elliptic and possibly multimodal distributions, we propose a local version of the Mahalanobis distance. Subsequently, we propose another classifier based on a generalized additive model that uses the local Mahalanobis distances as features. This nonparametric classifier usually performs like the Mahalanobis distance based semiparametric classifier when the underlying distributions are elliptic, but outperforms it for several non-elliptic and multimodal distributions. We also investigate the behaviour of these two classifiers in high dimension, low sample size situations. A thorough numerical study involving several simulated and real datasets demonstrate the usefulness of the proposed classifiers in comparison to many state-of-the-art methods."}, "https://arxiv.org/abs/2402.08328": {"title": "Dual Likelihood for Causal Inference under Structure Uncertainty", "link": "https://arxiv.org/abs/2402.08328", "description": "Knowledge of the underlying causal relations is essential for inferring the effect of interventions in complex systems. In a widely studied approach, structural causal models postulate noisy functional relations among interacting variables, where the underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In the typical application, this underlying causal structure must be learned from data, and thus, the remaining structure uncertainty needs to be incorporated into causal inference in order to draw reliable conclusions. In recent work, test inversions provide an ansatz to account for this data-driven model choice and, therefore, combine structure learning with causal inference. In this article, we propose the use of dual likelihood to greatly simplify the treatment of the involved testing problem. Indeed, dual likelihood leads to a closed-form solution for constructing confidence regions for total causal effects that rigorously capture both sources of uncertainty: causal structure and numerical size of nonzero effects. The proposed confidence regions can be computed with a bottom-up procedure starting from sink nodes. To render the causal structure identifiable, we develop our ideas in the context of linear causal relations with equal error variances."}, "https://arxiv.org/abs/2402.08335": {"title": "Joint Modeling of Multivariate Longitudinal and Survival Outcomes with the R package INLAjoint", "link": "https://arxiv.org/abs/2402.08335", "description": "This paper introduces the R package INLAjoint, designed as a toolbox for fitting a diverse range of regression models addressing both longitudinal and survival outcomes. INLAjoint relies on the computational efficiency of the integrated nested Laplace approximations methodology, an efficient alternative to Markov chain Monte Carlo for Bayesian inference, ensuring both speed and accuracy in parameter estimation and uncertainty quantification. The package facilitates the construction of complex joint models by treating individual regression models as building blocks, which can be assembled to address specific research questions. Joint models are relevant in biomedical studies where the collection of longitudinal markers alongside censored survival times is common. They have gained significant interest in recent literature, demonstrating the ability to rectify biases present in separate modeling approaches such as informative censoring by a survival event or confusion bias due to population heterogeneity. We provide a comprehensive overview of the joint modeling framework embedded in INLAjoint with illustrative examples. Through these examples, we demonstrate the practical utility of INLAjoint in handling complex data scenarios encountered in biomedical research."}, "https://arxiv.org/abs/2402.08336": {"title": "A two-step approach for analyzing time to event data under non-proportional hazards", "link": "https://arxiv.org/abs/2402.08336", "description": "The log-rank test and the Cox proportional hazards model are commonly used to compare time-to-event data in clinical trials, as they are most powerful under proportional hazards. But there is a loss of power if this assumption is violated, which is the case for some new oncology drugs like immunotherapies. We consider a two-stage test procedure, in which the weighting of the log-rank test statistic depends on a pre-test of the proportional hazards assumption. I.e., depending on the pre-test either the log-rank or an alternative test is used to compare the survival probabilities. We show that if naively implemented this can lead to a substantial inflation of the type-I error rate. To address this, we embed the two-stage test in a permutation test framework to keep the nominal level alpha. We compare the operating characteristics of the two-stage test with the log-rank test and other tests by clinical trial simulations."}, "https://arxiv.org/abs/2402.08376": {"title": "The generalized Hausman test for detecting non-normality in the latent variable distribution of the two-parameter IRT model", "link": "https://arxiv.org/abs/2402.08376", "description": "This paper introduces the generalized Hausman test as a novel method for detecting non-normality of the latent variable distribution of unidimensional Item Response Theory (IRT) models for binary data. The test utilizes the pairwise maximum likelihood estimator obtained for the parameters of the classical two-parameter IRT model, which assumes normality of the latent variable, and the quasi-maximum likelihood estimator obtained under a semi-nonparametric framework, allowing for a more flexible distribution of the latent variable. The performance of the generalized Hausman test is evaluated through a simulation study and it is compared with the likelihood-ratio and the M2 test statistics. Additionally, various information criteria are computed. The simulation results show that the generalized Hausman test outperforms the other tests under most conditions. However, the results obtained from the information criteria are somewhat contradictory under certain conditions, suggesting a need for further investigation and interpretation."}, "https://arxiv.org/abs/2402.08500": {"title": "Covariate selection for the estimation of marginal hazard ratios in high-dimensional data", "link": "https://arxiv.org/abs/2402.08500", "description": "Hazard ratios are frequently reported in time-to-event and epidemiological studies to assess treatment effects. In observational studies, the combination of propensity score weights with the Cox proportional hazards model facilitates the estimation of the marginal hazard ratio (MHR). The methods for estimating MHR are analogous to those employed for estimating common causal parameters, such as the average treatment effect. However, MHR estimation in the context of high-dimensional data remain unexplored. This paper seeks to address this gap through a simulation study that consider variable selection methods from causal inference combined with a recently proposed multiply robust approach for MHR estimation. Additionally, a case study utilizing stroke register data is conducted to demonstrate the application of these methods. The results from the simulation study indicate that the double selection covariate selection method is preferable to several other strategies when estimating MHR. Nevertheless, the estimation can be further improved by employing the multiply robust approach to the set of propensity score models obtained during the double selection process."}, "https://arxiv.org/abs/2402.08575": {"title": "Heterogeneity, Uncertainty and Learning: Semiparametric Identification and Estimation", "link": "https://arxiv.org/abs/2402.08575", "description": "We provide semiparametric identification results for a broad class of learning models in which continuous outcomes depend on three types of unobservables: i) known heterogeneity, ii) initially unknown heterogeneity that may be revealed over time, and iii) transitory uncertainty. We consider a common environment where the researcher only has access to a short panel on choices and realized outcomes. We establish identification of the outcome equation parameters and the distribution of the three types of unobservables, under the standard assumption that unknown heterogeneity and uncertainty are normally distributed. We also show that, absent known heterogeneity, the model is identified without making any distributional assumption. We then derive the asymptotic properties of a sieve MLE estimator for the model parameters, and devise a tractable profile likelihood based estimation procedure. Monte Carlo simulation results indicate that our estimator exhibits good finite-sample properties."}, "https://arxiv.org/abs/2402.08110": {"title": "Estimating Lagged (Cross-)Covariance Operators of $L^p$-$m$-approximable Processes in Cartesian Product Hilbert Spaces", "link": "https://arxiv.org/abs/2402.08110", "description": "When estimating the parameters in functional ARMA, GARCH and invertible, linear processes, covariance and lagged cross-covariance operators of processes in Cartesian product spaces appear. Such operators have been consistenly estimated in recent years, either less generally or under a strong condition. This article extends the existing literature by deriving explicit upper bounds for estimation errors for lagged covariance and lagged cross-covariance operators of processes in general Cartesian product Hilbert spaces, based on the mild weak dependence condition $L^p$-$m$-approximability. The upper bounds are stated for each lag, Cartesian power(s) and sample size, where the two processes in the context of lagged cross-covariance operators can take values in different spaces. General consequences of our results are also mentioned."}, "https://arxiv.org/abs/2402.08229": {"title": "Causal Discovery under Off-Target Interventions", "link": "https://arxiv.org/abs/2402.08229", "description": "Causal graph discovery is a significant problem with applications across various disciplines. However, with observational data alone, the underlying causal graph can only be recovered up to its Markov equivalence class, and further assumptions or interventions are necessary to narrow down the true graph. This work addresses the causal discovery problem under the setting of stochastic interventions with the natural goal of minimizing the number of interventions performed. We propose the following stochastic intervention model which subsumes existing adaptive noiseless interventions in the literature while capturing scenarios such as fat-hand interventions and CRISPR gene knockouts: any intervention attempt results in an actual intervention on a random subset of vertices, drawn from a distribution dependent on attempted action. Under this model, we study the two fundamental problems in causal discovery of verification and search and provide approximation algorithms with polylogarithmic competitive ratios and provide some preliminary experimental results."}, "https://arxiv.org/abs/2402.08285": {"title": "Theoretical properties of angular halfspace depth", "link": "https://arxiv.org/abs/2402.08285", "description": "The angular halfspace depth (ahD) is a natural modification of the celebrated halfspace (or Tukey) depth to the setup of directional data. It allows us to define elements of nonparametric inference, such as the median, the inter-quantile regions, or the rank statistics, for datasets supported in the unit sphere. Despite being introduced in 1987, ahD has never received ample recognition in the literature, mainly due to the lack of efficient algorithms for its computation. With the recent progress on the computational front, ahD however exhibits the potential for developing viable nonparametric statistics techniques for directional datasets. In this paper, we thoroughly treat the theoretical properties of ahD. We show that similarly to the classical halfspace depth for multivariate data, also ahD satisfies many desirable properties of a statistical depth function. Further, we derive uniform continuity/consistency results for the associated set of directional medians, and the central regions of ahD, the latter representing a depth-based analogue of the quantiles for directional data."}, "https://arxiv.org/abs/2402.08602": {"title": "Globally-Optimal Greedy Experiment Selection for Active Sequential Estimation", "link": "https://arxiv.org/abs/2402.08602", "description": "Motivated by modern applications such as computerized adaptive testing, sequential rank aggregation, and heterogeneous data source selection, we study the problem of active sequential estimation, which involves adaptively selecting experiments for sequentially collected data. The goal is to design experiment selection rules for more accurate model estimation. Greedy information-based experiment selection methods, optimizing the information gain for one-step ahead, have been employed in practice thanks to their computational convenience, flexibility to context or task changes, and broad applicability. However, statistical analysis is restricted to one-dimensional cases due to the problem's combinatorial nature and the seemingly limited capacity of greedy algorithms, leaving the multidimensional problem open.\n In this study, we close the gap for multidimensional problems. In particular, we propose adopting a class of greedy experiment selection methods and provide statistical analysis for the maximum likelihood estimator following these selection rules. This class encompasses both existing methods and introduces new methods with improved numerical efficiency. We prove that these methods produce consistent and asymptotically normal estimators. Additionally, within a decision theory framework, we establish that the proposed methods achieve asymptotic optimality when the risk measure aligns with the selection rule. We also conduct extensive numerical studies on both simulated and real data to illustrate the efficacy of the proposed methods.\n From a technical perspective, we devise new analytical tools to address theoretical challenges. These analytical tools are of independent theoretical interest and may be reused in related problems involving stochastic approximation and sequential designs."}, "https://arxiv.org/abs/2402.08616": {"title": "Adjustment Identification Distance: A gadjid for Causal Structure Learning", "link": "https://arxiv.org/abs/2402.08616", "description": "Evaluating graphs learned by causal discovery algorithms is difficult: The number of edges that differ between two graphs does not reflect how the graphs differ with respect to the identifying formulas they suggest for causal effects. We introduce a framework for developing causal distances between graphs which includes the structural intervention distance for directed acyclic graphs as a special case. We use this framework to develop improved adjustment-based distances as well as extensions to completed partially directed acyclic graphs and causal orders. We develop polynomial-time reachability algorithms to compute the distances efficiently. In our package gadjid (open source at https://github.com/CausalDisco/gadjid), we provide implementations of our distances; they are orders of magnitude faster than the structural intervention distance and thereby provide a success metric for causal discovery that scales to graph sizes that were previously prohibitive."}, "https://arxiv.org/abs/2402.08672": {"title": "Model Assessment and Selection under Temporal Distribution Shift", "link": "https://arxiv.org/abs/2402.08672", "description": "We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidate models by estimating the difference of their generalization errors. We further integrate pairwise comparisons into a single-elimination tournament, achieving near-optimal model selection from a collection of candidates. Theoretical analyses and numerical experiments demonstrate the adaptivity of our proposed methods to the non-stationarity in data."}, "https://arxiv.org/abs/1711.00564": {"title": "Sophisticated and small versus simple and sizeable: When does it pay off to introduce drifting coefficients in Bayesian VARs?", "link": "https://arxiv.org/abs/1711.00564", "description": "We assess the relationship between model size and complexity in the time-varying parameter VAR framework via thorough predictive exercises for the Euro Area, the United Kingdom and the United States. It turns out that sophisticated dynamics through drifting coefficients are important in small data sets, while simpler models tend to perform better in sizeable data sets. To combine the best of both worlds, novel shrinkage priors help to mitigate the curse of dimensionality, resulting in competitive forecasts for all scenarios considered. Furthermore, we discuss dynamic model selection to improve upon the best performing individual model for each point in time."}, "https://arxiv.org/abs/2204.02346": {"title": "Finitely Heterogeneous Treatment Effect in Event-study", "link": "https://arxiv.org/abs/2204.02346", "description": "The key assumption of the differences-in-differences approach in the event-study design is that untreated potential outcome differences are mean independent of treatment timing: the parallel trend assumption. In this paper, we relax the parallel trend assumption by assuming a latent type variable and developing a type-specific parallel trend assumption. With a finite support assumption on the latent type variable, we show that an extremum classifier consistently estimates the type assignment. Based on the classification result, we propose a type-specific diff-in-diff estimator for type-specific CATT. By estimating the CATT with regard to the latent type, we study heterogeneity in treatment effect, in addition to heterogeneity in baseline outcomes."}, "https://arxiv.org/abs/2209.14846": {"title": "Modeling and Learning on High-Dimensional Matrix-Variate Sequences", "link": "https://arxiv.org/abs/2209.14846", "description": "We propose a new matrix factor model, named RaDFaM, which is strictly derived based on the general rank decomposition and assumes a structure of a high-dimensional vector factor model for each basis vector. RaDFaM contributes a novel class of low-rank latent structure that makes tradeoff between signal intensity and dimension reduction from the perspective of tensor subspace. Based on the intrinsic separable covariance structure of RaDFaM, for a collection of matrix-valued observations, we derive a new class of PCA variants for estimating loading matrices, and sequentially the latent factor matrices. The peak signal-to-noise ratio of RaDFaM is proved to be superior in the category of PCA-type estimations. We also establish the asymptotic theory including the consistency, convergence rates, and asymptotic distributions for components in the signal part. Numerically, we demonstrate the performance of RaDFaM in applications such as matrix reconstruction, supervised learning, and clustering, on uncorrelated and correlated data, respectively."}, "https://arxiv.org/abs/2212.12822": {"title": "Simultaneous false discovery proportion bounds via knockoffs and closed testing", "link": "https://arxiv.org/abs/2212.12822", "description": "We propose new methods to obtain simultaneous false discovery proportion bounds for knockoff-based approaches. We first investigate an approach based on Janson and Su's $k$-familywise error rate control method and interpolation. We then generalize it by considering a collection of $k$ values, and show that the bound of Katsevich and Ramdas is a special case of this method and can be uniformly improved. Next, we further generalize the method by using closed testing with a multi-weighted-sum local test statistic. This allows us to obtain a further uniform improvement and other generalizations over previous methods. We also develop an efficient shortcut for its implementation. We compare the performance of our proposed methods in simulations and apply them to a data set from the UK Biobank."}, "https://arxiv.org/abs/2305.19749": {"title": "Forecasting high-dimensional functional time series: Application to sub-national age-specific mortality", "link": "https://arxiv.org/abs/2305.19749", "description": "We study the modeling and forecasting of high-dimensional functional time series (HDFTS), which can be cross-sectionally correlated and temporally dependent. We introduce a decomposition of the HDFTS into two distinct components: a deterministic component and a residual component that varies over time. The decomposition is derived through the estimation of two-way functional analysis of variance. A functional time series forecasting method, based on functional principal component analysis, is implemented to produce forecasts for the residual component. By combining the forecasts of the residual component with the deterministic component, we obtain forecast curves for multiple populations. We apply the model to age- and sex-specific mortality rates in the United States, France, and Japan, in which there are 51 states, 95 departments, and 47 prefectures, respectively. The proposed method is capable of delivering more accurate point and interval forecasts in forecasting multi-population mortality than several benchmark methods considered."}, "https://arxiv.org/abs/2306.01521": {"title": "Uncertainty Quantification in Bayesian Reduced-Rank Sparse Regressions", "link": "https://arxiv.org/abs/2306.01521", "description": "Reduced-rank regression recognises the possibility of a rank-deficient matrix of coefficients. We propose a novel Bayesian model for estimating the rank of the coefficient matrix, which obviates the need for post-processing steps and allows for uncertainty quantification. Our method employs a mixture prior on the regression coefficient matrix along with a global-local shrinkage prior on its low-rank decomposition. Then, we rely on the Signal Adaptive Variable Selector to perform sparsification and define two novel tools: the Posterior Inclusion Probability uncertainty index and the Relevance Index. The validity of the method is assessed in a simulation study, and then its advantages and usefulness are shown in real-data applications on the chemical composition of tobacco and on the photometry of galaxies."}, "https://arxiv.org/abs/2306.04177": {"title": "Semiparametric Efficiency Gains From Parametric Restrictions on Propensity Scores", "link": "https://arxiv.org/abs/2306.04177", "description": "We explore how much knowing a parametric restriction on propensity scores improves semiparametric efficiency bounds in the potential outcome framework. For stratified propensity scores, considered as a parametric model, we derive explicit formulas for the efficiency gain from knowing how the covariate space is split. Based on these, we find that the efficiency gain decreases as the partition of the stratification becomes finer. For general parametric models, where it is hard to obtain explicit representations of efficiency bounds, we propose a novel framework that enables us to see whether knowing a parametric model is valuable in terms of efficiency even when it is very high-dimensional. In addition to the intuitive fact that knowing the parametric model does not help much if it is sufficiently flexible, we reveal that the efficiency gain can be nearly zero even though the parametric assumption significantly restricts the space of possible propensity scores."}, "https://arxiv.org/abs/2311.04696": {"title": "Quantification and cross-fitting inference of asymmetric relations under generative exposure mapping models", "link": "https://arxiv.org/abs/2311.04696", "description": "In many practical studies, learning directionality between a pair of variables is of great interest while notoriously hard when their underlying relation is nonlinear. This paper presents a method that examines asymmetry in exposure-outcome pairs when a priori assumptions about their relative ordering are unavailable. Our approach utilizes a framework of generative exposure mapping (GEM) to study asymmetric relations in continuous exposure-outcome pairs, through which we can capture distributional asymmetries with no prefixed variable ordering. We propose a coefficient of asymmetry to quantify relational asymmetry using Shannon's entropy analytics as well as statistical estimation and inference for such an estimand of directionality. Large-sample theoretical guarantees are established for cross-fitting inference techniques. The proposed methodology is extended to allow both measured confounders and contamination in outcome measurements, which is extensively evaluated through extensive simulation studies and real data applications."}, "https://arxiv.org/abs/2304.08242": {"title": "The Deep Latent Position Topic Model for Clustering and Representation of Networks with Textual Edges", "link": "https://arxiv.org/abs/2304.08242", "description": "Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure."}, "https://arxiv.org/abs/2310.17816": {"title": "Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs", "link": "https://arxiv.org/abs/2310.17816", "description": "Causal discovery is crucial for causal inference in observational studies: it can enable the identification of valid adjustment sets (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose local discovery by partitioning (LDP), a novel nonparametric local discovery algorithm that is tailored for downstream inference tasks while avoiding the pretreatment assumption. LDP is a constraint-based procedure that partitions variables into subsets defined solely by their causal relation to an exposure-outcome pair. Further, LDP returns a VAS for the exposure-outcome pair under causal insufficiency and mild sufficient conditions. Total independence tests is worst-case quadratic in variable count. Asymptotic theoretical guarantees are numerically validated on synthetic graphs. Adjustment sets from LDP yield less biased and more precise average treatment effect estimates than baseline discovery algorithms, with LDP outperforming on confounder recall, runtime, and test count for VAS discovery. Further, LDP ran at least 1300x faster than baselines on a benchmark."}, "https://arxiv.org/abs/2311.06840": {"title": "Omitted Labels in Causality: A Study of Paradoxes", "link": "https://arxiv.org/abs/2311.06840", "description": "We explore what we call ``omitted label contexts,'' in which training data is limited to a subset of the possible labels. This setting is common among specialized human experts or specific focused studies. We lean on well-studied paradoxes (Simpson's and Condorcet) to illustrate the more general difficulties of causal inference in omitted label contexts. Contrary to the fundamental principles on which much of causal inference is built, we show that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. These pitfalls lead us to the study networks of conclusions drawn from different contexts and the structures the form, proving an interesting connection between these networks and social choice theory."}, "https://arxiv.org/abs/2402.08707": {"title": "Data Analytics for Intermodal Freight Transportation Applications", "link": "https://arxiv.org/abs/2402.08707", "description": "arXiv:2402.08707v1 Announce Type: new\nAbstract: With the growth of intermodal freight transportation, it is important that transportation planners and decision makers are knowledgeable about freight flow data to make informed decisions. This is particularly true with Intelligent Transportation Systems (ITS) offering new capabilities to intermodal freight transportation. Specifically, ITS enables access to multiple different data sources, but they have different formats, resolution, and time scales. Thus, knowledge of data science is essential to be successful in future ITS-enabled intermodal freight transportation system. This chapter discusses the commonly used descriptive and predictive data analytic techniques in intermodal freight transportation applications. These techniques cover the entire spectrum of univariate, bivariate, and multivariate analyses. In addition to illustrating how to apply these techniques through relatively simple examples, this chapter will also show how to apply them using the statistical software R. Additional exercises are provided for those who wish to apply the described techniques to more complex problems."}, "https://arxiv.org/abs/2402.08828": {"title": "Fusing Individualized Treatment Rules Using Secondary Outcomes", "link": "https://arxiv.org/abs/2402.08828", "description": "arXiv:2402.08828v1 Announce Type: new\nAbstract: An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method."}, "https://arxiv.org/abs/2402.08873": {"title": "Balancing Method for Non-monotone Missing Data", "link": "https://arxiv.org/abs/2402.08873", "description": "arXiv:2402.08873v1 Announce Type: new\nAbstract: Covariate balancing methods have been widely applied to single or monotone missing patterns and have certain advantages over likelihood-based methods and inverse probability weighting approaches based on standard logistic regression. In this paper, we consider non-monotone missing data under the complete-case missing variable condition (CCMV), which is a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, a weighted estimator can be used for estimation, where the weight is a sum of ratios of conditional probability of observing a particular missing pattern versus that of observing the complete-case pattern, given the variables observed in the corresponding missing pattern. Plug-in estimators of the propensity ratios, however, can be unbounded and lead to unstable estimation. Using further relations between propensity ratios and balancing of moments across missing patterns, we employ tailored loss functions each encouraging empirical balance across patterns to estimate propensity ratios flexibly using functional basis expansion. We propose two penalizations to separately control propensity ratio model complexity and covariate imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are also developed. Numerical simulation results show that the proposed method achieves smaller mean squared errors than other methods."}, "https://arxiv.org/abs/2402.08877": {"title": "Computational Considerations for the Linear Model of Coregionalization", "link": "https://arxiv.org/abs/2402.08877", "description": "arXiv:2402.08877v1 Announce Type: new\nAbstract: In the last two decades, the linear model of coregionalization (LMC) has been widely used to model multivariate spatial processes. From a computational standpoint, the LMC is a substantially easier model to work with than other multidimensional alternatives. Up to now, this fact has been largely overlooked in the literature. Starting from an analogy with matrix normal models, we propose a reformulation of the LMC likelihood that highlights the linear, rather than cubic, computational complexity as a function of the dimension of the response vector. Further, we describe in detail how those simplifications can be included in a Gaussian hierarchical model. In addition, we demonstrate in two examples how the disentangled version of the likelihood we derive can be exploited to improve Markov chain Monte Carlo (MCMC) based computations when conducting Bayesian inference. The first is an interwoven approach that combines samples from centered and whitened parametrizations of the latent LMC distributed random fields. The second is a sparsity-inducing method that introduces structural zeros in the coregionalization matrix in an attempt to reduce the number of parameters in a principled way. It also provides a new way to investigate the strength of the correlation among the components of the outcome vector. Both approaches come at virtually no additional cost and are shown to significantly improve MCMC performance and predictive performance respectively. We apply our methodology to a dataset comprised of air pollutant measurements in the state of California."}, "https://arxiv.org/abs/2402.08879": {"title": "Inference for an Algorithmic Fairness-Accuracy Frontier", "link": "https://arxiv.org/abs/2402.08879", "description": "arXiv:2402.08879v1 Announce Type: new\nAbstract: Decision-making processes increasingly rely on the use of algorithms. Yet, algorithms' predictive ability frequently exhibit systematic variation across subgroups of the population. While both fairness and accuracy are desirable properties of an algorithm, they often come at the cost of one another. What should a fairness-minded policymaker do then, when confronted with finite data? In this paper, we provide a consistent estimator for a theoretical fairness-accuracy frontier put forward by Liang, Lu and Mu (2023) and propose inference methods to test hypotheses that have received much attention in the fairness literature, such as (i) whether fully excluding a covariate from use in training the algorithm is optimal and (ii) whether there are less discriminatory alternatives to an existing algorithm. We also provide an estimator for the distance between a given algorithm and the fairest point on the frontier, and characterize its asymptotic distribution. We leverage the fact that the fairness-accuracy frontier is part of the boundary of a convex set that can be fully represented by its support function. We show that the estimated support function converges to a tight Gaussian process as the sample size increases, and then express policy-relevant hypotheses as restrictions on the support function to construct valid test statistics."}, "https://arxiv.org/abs/2402.08941": {"title": "Local-Polynomial Estimation for Multivariate Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2402.08941", "description": "arXiv:2402.08941v1 Announce Type: new\nAbstract: We introduce a multivariate local-linear estimator for multivariate regression discontinuity designs in which treatment is assigned by crossing a boundary in the space of running variables. The dominant approach uses the Euclidean distance from a boundary point as the scalar running variable; hence, multivariate designs are handled as uni-variate designs. However, the distance running variable is incompatible with the assumption for asymptotic validity. We handle multivariate designs as multivariate. In this study, we develop a novel asymptotic normality for multivariate local-polynomial estimators. Our estimator is asymptotically valid and can capture heterogeneous treatment effects over the boundary. We demonstrate the effectiveness of our estimator through numerical simulations. Our empirical illustration of a Colombian scholarship study reveals a richer heterogeneity (including its absence) of the treatment effect that is hidden in the original estimates."}, "https://arxiv.org/abs/2402.09033": {"title": "Cross-Temporal Forecast Reconciliation at Digital Platforms with Machine Learning", "link": "https://arxiv.org/abs/2402.09033", "description": "arXiv:2402.09033v1 Announce Type: new\nAbstract: Platform businesses operate on a digital core and their decision making requires high-dimensional accurate forecast streams at different levels of cross-sectional (e.g., geographical regions) and temporal aggregation (e.g., minutes to days). It also necessitates coherent forecasts across all levels of the hierarchy to ensure aligned decision making across different planning units such as pricing, product, controlling and strategy. Given that platform data streams feature complex characteristics and interdependencies, we introduce a non-linear hierarchical forecast reconciliation method that produces cross-temporal reconciled forecasts in a direct and automated way through the use of popular machine learning methods. The method is sufficiently fast to allow forecast-based high-frequency decision making that platforms require. We empirically test our framework on a unique, large-scale streaming dataset from a leading on-demand delivery platform in Europe."}, "https://arxiv.org/abs/2402.09086": {"title": "Impact of Non-Informative Censoring on Propensity Score Based Estimation of Marginal Hazard Ratios", "link": "https://arxiv.org/abs/2402.09086", "description": "arXiv:2402.09086v1 Announce Type: new\nAbstract: In medical and epidemiological studies, one of the most common settings is studying the effect of a treatment on a time-to-event outcome, where the time-to-event might be censored before end of study. A common parameter of interest in such a setting is the marginal hazard ratio (MHR). When a study is based on observational data, propensity score (PS) based methods are often used, in an attempt to make the treatment groups comparable despite having a non-randomized treatment. Previous studies have shown censoring to be a factor that induces bias when using PS based estimators. In this paper we study the magnitude of the bias under different rates of non-informative censoring when estimating MHR using PS weighting or PS matching. A bias correction involving the probability of event is suggested and compared to conventional PS based methods."}, "https://arxiv.org/abs/2402.09332": {"title": "Nonparametric identification and efficient estimation of causal effects with instrumental variables", "link": "https://arxiv.org/abs/2402.09332", "description": "arXiv:2402.09332v1 Announce Type: new\nAbstract: Instrumental variables are widely used in econometrics and epidemiology for identifying and estimating causal effects when an exposure of interest is confounded by unmeasured factors. Despite this popularity, the assumptions invoked to justify the use of instruments differ substantially across the literature. Similarly, statistical approaches for estimating the resulting causal quantities vary considerably, and often rely on strong parametric assumptions. In this work, we compile and organize structural conditions that nonparametrically identify conditional average treatment effects, average treatment effects among the treated, and local average treatment effects, with a focus on identification formulae invoking the conditional Wald estimand. Moreover, we build upon existing work and propose nonparametric efficient estimators of functionals corresponding to marginal and conditional causal contrasts resulting from the various identification paradigms. We illustrate the proposed methods on an observational study examining the effects of operative care on adverse events for cholecystitis patients, and a randomized trial assessing the effects of market participation on political views."}, "https://arxiv.org/abs/2402.08845": {"title": "Feature Attribution with Necessity and Sufficiency via Dual-stage Perturbation Test for Causal Explanation", "link": "https://arxiv.org/abs/2402.08845", "description": "arXiv:2402.08845v1 Announce Type: cross\nAbstract: We investigate the problem of explainability in machine learning.To address this problem, Feature Attribution Methods (FAMs) measure the contribution of each feature through a perturbation test, where the difference in prediction is compared under different perturbations.However, such perturbation tests may not accurately distinguish the contributions of different features, when their change in prediction is the same after perturbation.In order to enhance the ability of FAMs to distinguish different features' contributions in this challenging setting, we propose to utilize the probability (PNS) that perturbing a feature is a necessary and sufficient cause for the prediction to change as a measure of feature importance.Our approach, Feature Attribution with Necessity and Sufficiency (FANS), computes the PNS via a perturbation test involving two stages (factual and interventional).In practice, to generate counterfactual samples, we use a resampling-based approach on the observed samples to approximate the required conditional distribution.Finally, we combine FANS and gradient-based optimization to extract the subset with the largest PNS.We demonstrate that FANS outperforms existing feature attribution methods on six benchmarks."}, "https://arxiv.org/abs/2402.09032": {"title": "Ecient spatial designs for targeted regions or level set detection", "link": "https://arxiv.org/abs/2402.09032", "description": "arXiv:2402.09032v1 Announce Type: cross\nAbstract: Acquiring information on spatial phenomena can be costly and time-consuming. In this context, to obtain reliable global knowledge, the choice of measurement location is a crucial issue. Space-lling designs are often used to control variability uniformly across the whole space. However, in a monitoring context, it is more relevant to focus on crucial regions, especially when dealing with sensitive areas such as the environment, climate or public health. It is therefore important to choose a relevant optimality criterion to build models adapted to the purpose of the experiment. In this article, we propose two new optimality criteria: the rst aims to focus on areas where the response exceeds a given threshold, while the second is suitable for estimating sets of levels. We introduce several algorithms for constructing optimal designs. We also focus on cost-eective algorithms that produce non-optimal but ecient designs. For both sequential and non-sequential contexts, we compare our designs with existing ones through extensive simulation studies."}, "https://arxiv.org/abs/2402.09328": {"title": "Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production", "link": "https://arxiv.org/abs/2402.09328", "description": "arXiv:2402.09328v1 Announce Type: cross\nAbstract: National Statistical Organizations (NSOs) increasingly draw on Machine Learning (ML) to improve the timeliness and cost-effectiveness of their products. When introducing ML solutions, NSOs must ensure that high standards with respect to robustness, reproducibility, and accuracy are upheld as codified, e.g., in the Quality Framework for Statistical Algorithms (QF4SA; Yung et al. 2022). At the same time, a growing body of research focuses on fairness as a pre-condition of a safe deployment of ML to prevent disparate social impacts in practice. However, fairness has not yet been explicitly discussed as a quality aspect in the context of the application of ML at NSOs. We employ Yung et al. (2022)'s QF4SA quality framework and present a mapping of its quality dimensions to algorithmic fairness. We thereby extend the QF4SA framework in several ways: we argue for fairness as its own quality dimension, we investigate the interaction of fairness with other dimensions, and we explicitly address data, both on its own and its interaction with applied methodology. In parallel with empirical illustrations, we show how our mapping can contribute to methodology in the domains of official statistics, algorithmic fairness, and trustworthy machine learning."}, "https://arxiv.org/abs/2402.09356": {"title": "On the Impact of Spatial Covariance Matrix Ordering on Tile Low-Rank Estimation of Mat\\'ern Parameters", "link": "https://arxiv.org/abs/2402.09356", "description": "arXiv:2402.09356v1 Announce Type: cross\nAbstract: Spatial statistical modeling and prediction involve generating and manipulating an n*n symmetric positive definite covariance matrix, where n denotes the number of spatial locations. However, when n is large, processing this covariance matrix using traditional methods becomes prohibitive. Thus, coupling parallel processing with approximation can be an elegant solution to this challenge by relying on parallel solvers that deal with the matrix as a set of small tiles instead of the full structure. Each processing unit can process a single tile, allowing better performance. The approximation can also be performed at the tile level for better compression and faster execution. The Tile Low-Rank (TLR) approximation, a tile-based approximation algorithm, has recently been used in spatial statistics applications. However, the quality of TLR algorithms mainly relies on ordering the matrix elements. This order can impact the compression quality and, therefore, the efficiency of the underlying linear solvers, which highly depends on the individual ranks of each tile. Thus, herein, we aim to investigate the accuracy and performance of some existing ordering algorithms that are used to order the geospatial locations before generating the spatial covariance matrix. Furthermore, we highlight the pros and cons of each ordering algorithm in the context of spatial statistics applications and give hints to practitioners on how to choose the ordering algorithm carefully. We assess the quality of the compression and the accuracy of the statistical parameter estimates of the Mat\\'ern covariance function using TLR approximation under various ordering algorithms and settings of correlations."}, "https://arxiv.org/abs/2110.14465": {"title": "Unbiased Statistical Estimation and Valid Confidence Intervals Under Differential Privacy", "link": "https://arxiv.org/abs/2110.14465", "description": "arXiv:2110.14465v2 Announce Type: replace\nAbstract: We present a method for producing unbiased parameter estimates and valid confidence intervals under the constraints of differential privacy, a formal framework for limiting individual information leakage from sensitive data. Prior work in this area is limited in that it is tailored to calculating confidence intervals for specific statistical procedures, such as mean estimation or simple linear regression. While other recent work can produce confidence intervals for more general sets of procedures, they either yield only approximately unbiased estimates, are designed for one-dimensional outputs, or assume significant user knowledge about the data-generating distribution. Our method induces distributions of mean and covariance estimates via the bag of little bootstraps (BLB) and uses them to privately estimate the parameters' sampling distribution via a generalized version of the CoinPress estimation algorithm. If the user can bound the parameters of the BLB-induced parameters and provide heavier-tailed families, the algorithm produces unbiased parameter estimates and valid confidence intervals which hold with arbitrarily high probability. These results hold in high dimensions and for any estimation procedure which behaves nicely under the bootstrap."}, "https://arxiv.org/abs/2212.00984": {"title": "Fully Data-driven Normalized and Exponentiated Kernel Density Estimator with Hyv\\\"arinen Score", "link": "https://arxiv.org/abs/2212.00984", "description": "arXiv:2212.00984v2 Announce Type: replace\nAbstract: We introduce a new deal of kernel density estimation using an exponentiated form of kernel density estimators. The density estimator has two hyperparameters flexibly controlling the smoothness of the resulting density. We tune them in a data-driven manner by minimizing an objective function based on the Hyv\\\"arinen score to avoid the optimization involving the intractable normalizing constant due to the exponentiation. We show the asymptotic properties of the proposed estimator and emphasize the importance of including the two hyperparameters for flexible density estimation. Our simulation studies and application to income data show that the proposed density estimator is appealing when the underlying density is multi-modal or observations contain outliers."}, "https://arxiv.org/abs/2301.11399": {"title": "Distributional outcome regression via quantile functions and its application to modelling continuously monitored heart rate and physical activity", "link": "https://arxiv.org/abs/2301.11399", "description": "arXiv:2301.11399v3 Announce Type: replace\nAbstract: Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve."}, "https://arxiv.org/abs/2302.11746": {"title": "Binary Regression and Classification with Covariates in Metric Spaces", "link": "https://arxiv.org/abs/2302.11746", "description": "arXiv:2302.11746v3 Announce Type: replace\nAbstract: Inspired by logistic regression, we introduce a regression model for data tuples consisting of a binary response and a set of covariates residing in a metric space without vector structures. Based on the proposed model we also develop a binary classifier for metric-space valued data. We propose a maximum likelihood estimator for the metric-space valued regression coefficient in the model, and provide upper bounds on the estimation error under various metric entropy conditions that quantify complexity of the underlying metric space. Matching lower bounds are derived for the important metric spaces commonly seen in statistics, establishing optimality of the proposed estimator in such spaces. Similarly, an upper bound on the excess risk of the developed classifier is provided for general metric spaces. A finer upper bound and a matching lower bound, and thus optimality of the proposed classifier, are established for Riemannian manifolds. To the best of our knowledge, the proposed regression model and the above minimax bounds are the first of their kind for analyzing a binary response with covariates residing in general metric spaces. We also investigate the numerical performance of the proposed estimator and classifier via simulation studies, and illustrate their practical merits via an application to task-related fMRI data."}, "https://arxiv.org/abs/2305.02685": {"title": "Testing for no effect in regression problems: a permutation approach", "link": "https://arxiv.org/abs/2305.02685", "description": "arXiv:2305.02685v2 Announce Type: replace\nAbstract: Often the question arises whether $Y$ can be predicted based on $X$ using a certain model. Especially for highly flexible models such as neural networks one may ask whether a seemingly good prediction is actually better than fitting pure noise or whether it has to be attributed to the flexibility of the model. This paper proposes a rigorous permutation test to assess whether the prediction is better than the prediction of pure noise. The test avoids any sample splitting and is based instead on generating new pairings of $(X_i, Y_j)$. It introduces a new formulation of the null hypothesis and rigorous justification for the test, which distinguishes it from previous literature. The theoretical findings are applied both to simulated data and to sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test. It shows that the less informative the predictors, the lower the probability of rejecting the null hypothesis of fitting pure noise and emphasizes that detecting weaker dependence between variables requires a sufficient sample size."}, "https://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "https://arxiv.org/abs/2310.05539", "description": "arXiv:2310.05539v2 Announce Type: replace\nAbstract: In response to the unique challenge created by high-dimensional mediators in mediation analysis, this paper presents a novel procedure for testing the nullity of the mediation effect in the presence of high-dimensional mediators. The procedure incorporates two distinct features. Firstly, the test remains valid under all cases of the composite null hypothesis, including the challenging scenario where both exposure-mediator and mediator-outcome coefficients are zero. Secondly, it does not impose structural assumptions on the exposure-mediator coefficients, thereby allowing for an arbitrarily strong exposure-mediator relationship. To the best of our knowledge, the proposed test is the first of its kind to provably possess these two features in high-dimensional mediation analysis. The validity and consistency of the proposed test are established, and its numerical performance is showcased through simulation studies. The application of the proposed test is demonstrated by examining the mediation effect of DNA methylation between smoking status and lung cancer development."}, "https://arxiv.org/abs/2312.16307": {"title": "Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration", "link": "https://arxiv.org/abs/2312.16307", "description": "arXiv:2312.16307v2 Announce Type: replace\nAbstract: We consider the setting of synthetic control methods (SCMs), a canonical approach used to estimate the treatment effect on the treated in a panel data setting. We shed light on a frequently overlooked but ubiquitous assumption made in SCMs of \"overlap\": a treated unit can be written as some combination -- typically, convex or linear combination -- of the units that remain under control. We show that if units select their own interventions, and there is sufficiently large heterogeneity between units that prefer different interventions, overlap will not hold. We address this issue by proposing a framework which incentivizes units with different preferences to take interventions they would not normally consider. Specifically, leveraging tools from information design and online learning, we propose a SCM that incentivizes exploration in panel data settings by providing incentive-compatible intervention recommendations to units. We establish this estimator obtains valid counterfactual estimates without the need for an a priori overlap assumption. We extend our results to the setting of synthetic interventions, where the goal is to produce counterfactual outcomes under all interventions, not just control. Finally, we provide two hypothesis tests for determining whether unit overlap holds for a given panel dataset."}, "https://arxiv.org/abs/2211.11700": {"title": "High-Dimensional Undirected Graphical Models for Arbitrary Mixed Data", "link": "https://arxiv.org/abs/2211.11700", "description": "arXiv:2211.11700v2 Announce Type: replace-cross\nAbstract: Graphical models are an important tool in exploring relationships between variables in complex, multivariate data. Methods for learning such graphical models are well developed in the case where all variables are either continuous or discrete, including in high-dimensions. However, in many applications data span variables of different types (e.g. continuous, count, binary, ordinal, etc.), whose principled joint analysis is nontrivial. Latent Gaussian copula models, in which all variables are modeled as transformations of underlying jointly Gaussian variables, represent a useful approach. Recent advances have shown how the binary-continuous case can be tackled, but the general mixed variable type regime remains challenging. In this work, we make the simple yet useful observation that classical ideas concerning polychoric and polyserial correlations can be leveraged in a latent Gaussian copula framework. Building on this observation we propose flexible and scalable methodology for data with variables of entirely general mixed type. We study the key properties of the approaches theoretically and empirically, via extensive simulations as well an illustrative application to data from the UK Biobank concerning COVID-19 risk factors."}, "https://arxiv.org/abs/2303.15135": {"title": "Properties of the reconciled distributions for Gaussian and count forecasts", "link": "https://arxiv.org/abs/2303.15135", "description": "arXiv:2303.15135v2 Announce Type: replace-cross\nAbstract: Reconciliation enforces coherence between hierarchical forecasts, in order to satisfy a set of linear constraints. While most works focus on the reconciliation of the point forecasts, we consider probabilistic reconciliation and we analyze the properties of the distributions reconciled via conditioning. We provide a formal analysis of the variance of the reconciled distribution, treating separately the case of Gaussian forecasts and count forecasts. We also study the reconciled upper mean in the case of 1-level hierarchies; also in this case we analyze separately the case of Gaussian forecasts and count forecasts. We then show experiments on the reconciliation of intermittent time series related to the count of extreme market events. The experiments confirm our theoretical results and show that reconciliation largely improves the performance of probabilistic forecasting."}, "https://arxiv.org/abs/2305.15742": {"title": "Counterfactual Generative Models for Time-Varying Treatments", "link": "https://arxiv.org/abs/2305.15742", "description": "arXiv:2305.15742v3 Announce Type: replace-cross\nAbstract: Estimating the counterfactual outcome of treatment is essential for decision-making in public health and clinical science, among others. Often, treatments are administered in a sequential, time-varying manner, leading to an exponentially increased number of possible counterfactual outcomes. Furthermore, in modern applications, the outcomes are high-dimensional and conventional average treatment effect estimation fails to capture disparities in individuals. To tackle these challenges, we propose a novel conditional generative framework capable of producing counterfactual samples under time-varying treatment, without the need for explicit density estimation. Our method carefully addresses the distribution mismatch between the observed and counterfactual distributions via a loss function based on inverse probability re-weighting, and supports integration with state-of-the-art conditional generative models such as the guided diffusion and conditional variational autoencoder. We present a thorough evaluation of our method using both synthetic and real-world data. Our results demonstrate that our method is capable of generating high-quality counterfactual samples and outperforms the state-of-the-art baselines."}, "https://arxiv.org/abs/2306.10816": {"title": "$\\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery", "link": "https://arxiv.org/abs/2306.10816", "description": "arXiv:2306.10816v2 Announce Type: replace-cross\nAbstract: Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms."}, "https://arxiv.org/abs/2310.12115": {"title": "MMD-based Variable Importance for Distributional Random Forest", "link": "https://arxiv.org/abs/2310.12115", "description": "arXiv:2310.12115v2 Announce Type: replace-cross\nAbstract: Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions."}, "https://arxiv.org/abs/2401.03633": {"title": "A Spatial-statistical model to analyse historical rutting data", "link": "https://arxiv.org/abs/2401.03633", "description": "arXiv:2401.03633v2 Announce Type: replace-cross\nAbstract: Pavement rutting poses a significant challenge in flexible pavements, necessitating costly asphalt resurfacing. To address this issue comprehensively, we propose an advanced Bayesian hierarchical framework of latent Gaussian models with spatial components. Our model provides a thorough diagnostic analysis, pinpointing areas exhibiting unexpectedly high rutting rates. Incorporating spatial and random components, and important explanatory variables like annual average daily traffic (traffic intensity), pavement type, rut depth and lane width, our proposed models account for and estimate the influence of these variables on rutting. This approach not only quantifies uncertainties and discerns locations at the highest risk of requiring maintenance, but also uncover spatial dependencies in rutting (millimetre/year). We apply our models to a data set spanning eleven years (2010-2020). Our findings emphasize the systematic unexpected spatial rutting effect, where more of the rutting variability is explained by including spatial components. Pavement type, in conjunction with traffic intensity, is also found to be the primary driver of rutting. Furthermore, the spatial dependencies uncovered reveal road sections experiencing more than 1 millimeter of rutting beyond annual expectations. This leads to a halving of the expected pavement lifespan in these areas. Our study offers valuable insights, presenting maps indicating expected rutting, and identifying locations with accelerated rutting rates, resulting in a reduction in pavement life expectancy of at least 10 years."}, "https://arxiv.org/abs/2402.09472": {"title": "Identifying Intended Effects with Causal Models", "link": "https://arxiv.org/abs/2402.09472", "description": "arXiv:2402.09472v1 Announce Type: new \nAbstract: The aim of this paper is to extend the framework of causal inference, in particular as it has been developed by Judea Pearl, in order to model actions and identify their intended effects, in the direction opened by Elisabeth Anscombe. We show how intentions can be inferred from a causal model and its implied correlations observable in data. The paper defines confounding effects as the reasons why teleological inference may fail and introduces interference as a way to control for them. The ''fundamental problem'' of teleological inference is presented, explaining why causal analysis needs an extension in order to take intentions into account."}, "https://arxiv.org/abs/2402.09583": {"title": "Horseshoe Priors for Sparse Dirichlet-Multinomial Models", "link": "https://arxiv.org/abs/2402.09583", "description": "arXiv:2402.09583v1 Announce Type: new \nAbstract: Bayesian inference for Dirichlet-Multinomial (DM) models has a long and important history. The concentration parameter $\\alpha$ is pivotal in smoothing category probabilities within the multinomial distribution and is crucial for the inference afterward. Due to the lack of a tractable form of its marginal likelihood, $\\alpha$ is often chosen ad-hoc, or estimated using approximation algorithms. A constant $\\alpha$ often leads to inadequate smoothing of probabilities, particularly for sparse compositional count datasets. In this paper, we introduce a novel class of prior distributions facilitating conjugate updating of the concentration parameter, allowing for full Bayesian inference for DM models. Our methodology is based on fast residue computation and admits closed-form posterior moments in specific scenarios. Additionally, our prior provides continuous shrinkage with its heavy tail and substantial mass around zero, ensuring adaptability to the sparsity or quasi-sparsity of the data. We demonstrate the usefulness of our approach on both simulated examples and on a real-world human microbiome dataset. Finally, we conclude with directions for future research."}, "https://arxiv.org/abs/2402.09698": {"title": "Combining Evidence Across Filtrations", "link": "https://arxiv.org/abs/2402.09698", "description": "arXiv:2402.09698v1 Announce Type: new \nAbstract: In anytime-valid sequential inference, it is known that any admissible inference procedure must be based on test martingales and their composite generalization, called e-processes, which are nonnegative processes whose expectation at any arbitrary stopping time is upper-bounded by one. An e-process quantifies the accumulated evidence against a composite null hypothesis over a sequence of outcomes. This paper studies methods for combining e-processes that are computed using different information sets, i.e., filtrations, for a null hypothesis. Even though e-processes constructed on the same filtration can be combined effortlessly (e.g., by averaging), e-processes constructed on different filtrations cannot be combined as easily because their validity in a coarser filtration does not translate to validity in a finer filtration. We discuss three concrete examples of such e-processes in the literature: exchangeability tests, independence tests, and tests for evaluating and comparing forecasts with lags. Our main result establishes that these e-processes can be lifted into any finer filtration using adjusters, which are functions that allow betting on the running maximum of the accumulated wealth (thereby insuring against the loss of evidence). We also develop randomized adjusters that can improve the power of the resulting sequential inference procedure."}, "https://arxiv.org/abs/2402.09744": {"title": "Quantile Granger Causality in the Presence of Instability", "link": "https://arxiv.org/abs/2402.09744", "description": "arXiv:2402.09744v1 Announce Type: new \nAbstract: We propose a new framework for assessing Granger causality in quantiles in unstable environments, for a fixed quantile or over a continuum of quantile levels. Our proposed test statistics are consistent against fixed alternatives, they have nontrivial power against local alternatives, and they are pivotal in certain important special cases. In addition, we show the validity of a bootstrap procedure when asymptotic distributions depend on nuisance parameters. Monte Carlo simulations reveal that the proposed test statistics have correct empirical size and high power, even in absence of structural breaks. Finally, two empirical applications in energy economics and macroeconomics highlight the applicability of our method as the new tests provide stronger evidence of Granger causality."}, "https://arxiv.org/abs/2402.09758": {"title": "Extrapolation-Aware Nonparametric Statistical Inference", "link": "https://arxiv.org/abs/2402.09758", "description": "arXiv:2402.09758v1 Announce Type: new \nAbstract: We define extrapolation as any type of statistical inference on a conditional function (e.g., a conditional expectation or conditional quantile) evaluated outside of the support of the conditioning variable. This type of extrapolation occurs in many data analysis applications and can invalidate the resulting conclusions if not taken into account. While extrapolating is straightforward in parametric models, it becomes challenging in nonparametric models. In this work, we extend the nonparametric statistical model to explicitly allow for extrapolation and introduce a class of extrapolation assumptions that can be combined with existing inference techniques to draw extrapolation-aware conclusions. The proposed class of extrapolation assumptions stipulate that the conditional function attains its minimal and maximal directional derivative, in each direction, within the observed support. We illustrate how the framework applies to several statistical applications including prediction and uncertainty quantification. We furthermore propose a consistent estimation procedure that can be used to adjust existing nonparametric estimates to account for extrapolation by providing lower and upper extrapolation bounds. The procedure is empirically evaluated on both simulated and real-world data."}, "https://arxiv.org/abs/2402.09788": {"title": "An extension of sine-skewed circular distributions", "link": "https://arxiv.org/abs/2402.09788", "description": "arXiv:2402.09788v1 Announce Type: new \nAbstract: Sine-skewed circular distributions are identifiable and have easily-computable trigonometric moments and a simple random number generation algorithm, whereas they are known to have relatively low levels of asymmetry. This study proposes a new family of circular distributions that can be skewed more significantly than that of existing models. It is shown that a subfamily of the proposed distributions is identifiable with respect to parameters and all distributions in the subfamily have explicit trigonometric moments and a simple random number generation algorithm. The maximum likelihood estimation for model parameters is considered and its finite sample performances are investigated by numerical simulations. Some real data applications are illustrated for practical purposes."}, "https://arxiv.org/abs/2402.09789": {"title": "Identification with Posterior-Separable Information Costs", "link": "https://arxiv.org/abs/2402.09789", "description": "arXiv:2402.09789v1 Announce Type: new \nAbstract: I provide a model of rational inattention with heterogeneity and prove it is observationally equivalent to a state-dependent stochastic choice model subject to attention costs. I demonstrate that additive separability of unobservable heterogeneity, together with an independence assumption, suffice for the empirical model to admit a representative agent. Using conditional probabilities, I show how to identify: how covariates affect the desirability of goods, (a measure of) welfare, factual changes in welfare, and bounds on counterfactual market shares."}, "https://arxiv.org/abs/2402.09837": {"title": "Conjugacy properties of multivariate unified skew-elliptical distributions", "link": "https://arxiv.org/abs/2402.09837", "description": "arXiv:2402.09837v1 Announce Type: new \nAbstract: The broad class of multivariate unified skew-normal (SUN) distributions has been recently shown to possess fundamental conjugacy properties. When used as priors for the vector of parameters in general probit, tobit, and multinomial probit models, these distributions yield posteriors that still belong to the SUN family. Although such a core result has led to important advancements in Bayesian inference and computation, its applicability beyond likelihoods associated with fully-observed, discretized, or censored realizations from multivariate Gaussian models remains yet unexplored. This article covers such an important gap by proving that the wider family of multivariate unified skew-elliptical (SUE) distributions, which extends SUNs to more general perturbations of elliptical densities, guarantees conjugacy for broader classes of models, beyond those relying on fully-observed, discretized or censored Gaussians. Such a result leverages the closure under linear combinations, conditioning and marginalization of SUE to prove that such a family is conjugate to the likelihood induced by general multivariate regression models for fully-observed, censored or dichotomized realizations from skew-elliptical distributions. This advancement substantially enlarges the set of models that enable conjugate Bayesian inference to general formulations arising from elliptical and skew-elliptical families, including the multivariate Student's t and skew-t, among others."}, "https://arxiv.org/abs/2402.09858": {"title": "An Approximation Based Theory of Linear Regression", "link": "https://arxiv.org/abs/2402.09858", "description": "arXiv:2402.09858v1 Announce Type: new \nAbstract: The goal of this paper is to provide a theory linear regression based entirely on approximations. It will be argued that the standard linear regression model based theory whether frequentist or Bayesian has failed and that this failure is due to an 'assumed (revealed?) truth' (John Tukey) attitude to the models. This is reflected in the language of statistical inference which involves a concept of truth, for example efficiency, consistency and hypothesis testing. The motivation behind this paper was to remove the word `true' from the theory and practice of linear regression and to replace it by approximation. The approximations considered are the least squares approximations. An approximation is called valid if it contains no irrelevant covariates. This is operationalized using the concept of a Gaussian P-value which is the probability that pure Gaussian noise is better in term of least squares than the covariate. The precise definition given in the paper is intuitive and requires only four simple equations. Given this a valid approximation is one where all the Gaussian P-values are less than a threshold $p0$ specified by the statistician, in this paper with the default value 0.01. This approximations approach is not only much simpler it is overwhelmingly better than the standard model based approach. This will be demonstrated using six real data sets, four from high dimensional regression and two from vector autoregression. Both the simplicity and the superiority of Gaussian P-values derive from their universal exactness and validity. This is in complete contrast to standard F P-values which are valid only for carefully designed simulations.\n The paper contains excerpts from an unpublished paper by John Tukey entitled `Issues relevant to an honest account of data-based inference partially in the light of Laurie Davies's paper'."}, "https://arxiv.org/abs/2402.09888": {"title": "Multinomial mixture for spatial data", "link": "https://arxiv.org/abs/2402.09888", "description": "arXiv:2402.09888v1 Announce Type: new \nAbstract: The purpose of this paper is to extend standard finite mixture models in the context of multinomial mixtures for spatial data, in order to cluster geographical units according to demographic characteristics. The spatial information is incorporated on the model through the mixing probabilities of each component. To be more specific, a Gibbs distribution is assumed for prior probabilities. In this way, assignment of each observation is affected by neighbors' cluster and spatial dependence is included in the model. Estimation is based on a modified EM algorithm which is enriched by an extra, initial step for approximating the field. The simulated field algorithm is used in this initial step. The presented model will be used for clustering municipalities of Attica with respect to age distribution of residents."}, "https://arxiv.org/abs/2402.09895": {"title": "Spatial Data Analysis", "link": "https://arxiv.org/abs/2402.09895", "description": "arXiv:2402.09895v1 Announce Type: new \nAbstract: This handbook chapter provides an essential introduction to the field of spatial econometrics, offering a comprehensive overview of techniques and methodologies for analysing spatial data in the social sciences. Spatial econometrics addresses the unique challenges posed by spatially dependent observations, where spatial relationships among data points can significantly impact statistical analyses. The chapter begins by exploring the fundamental concepts of spatial dependence and spatial autocorrelation, and highlighting their implications for traditional econometric models. It then introduces a range of spatial econometric models, particularly spatial lag, spatial error, and spatial lag of X models, illustrating how these models accommodate spatial relationships and yield accurate and insightful results about the underlying spatial processes. The chapter provides an intuitive understanding of these models compare to each other. A practical example on London house prices demonstrates the application of spatial econometrics, emphasising its relevance in uncovering hidden spatial patterns, addressing endogeneity, and providing robust estimates in the presence of spatial dependence."}, "https://arxiv.org/abs/2402.09928": {"title": "When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators", "link": "https://arxiv.org/abs/2402.09928", "description": "arXiv:2402.09928v1 Announce Type: new \nAbstract: The conventional Two-Way Fixed-Effects (TWFE) estimator has been under strain lately. Recent literature has revealed potential shortcomings of TWFE when the treatment effects are heterogeneous. Scholars have developed new advanced dynamic Difference-in-Differences (DiD) estimators to tackle these potential shortcomings. However, confusion remains in applied research as to when the conventional TWFE is biased and what issues the novel estimators can and cannot address. In this study, we first provide an intuitive explanation of the problems of TWFE and elucidate the key features of the novel alternative DiD estimators. We then systematically demonstrate the conditions under which the conventional TWFE is inconsistent. We employ Monte Carlo simulations to assess the performance of dynamic DiD estimators under violations of key assumptions, which likely happens in applied cases. While the new dynamic DiD estimators offer notable advantages in capturing heterogeneous treatment effects, we show that the conventional TWFE performs generally well if the model specifies an event-time function. All estimators are equally sensitive to violations of the parallel trends assumption, anticipation effects or violations of time-varying exogeneity. Despite their advantages, the new dynamic DiD estimators tackle a very specific problem and they do not serve as a universal remedy for violations of the most critical assumptions. We finally derive, based on our simulations, recommendations for how and when to use TWFE and the new DiD estimators in applied research."}, "https://arxiv.org/abs/2402.09938": {"title": "Optimal Bayesian stepped-wedge cluster randomised trial designs for binary outcome data", "link": "https://arxiv.org/abs/2402.09938", "description": "arXiv:2402.09938v1 Announce Type: new \nAbstract: Under a generalised estimating equation analysis approach, approximate design theory is used to determine Bayesian D-optimal designs. For two examples, considering simple exchangeable and exponential decay correlation structures, we compare the efficiency of identified optimal designs to balanced stepped-wedge designs and corresponding stepped-wedge designs determined by optimising using a normal approximation approach. The dependence of the Bayesian D-optimal designs on the assumed correlation structure is explored; for the considered settings, smaller decay in the correlation between outcomes across time periods, along with larger values of the intra-cluster correlation, leads to designs closer to a balanced design being optimal. Unlike for normal data, it is shown that the optimal design need not be centro-symmetric in the binary outcome case. The efficiency of the Bayesian D-optimal design relative to a balanced design can be large, but situations are demonstrated in which the advantages are small. Similarly, the optimal design from a normal approximation approach is often not much less efficient than the Bayesian D-optimal design. Bayesian D-optimal designs can be readily identified for stepped-wedge cluster randomised trials with binary outcome data. In certain circumstances, principally ones with strong time period effects, they will indicate that a design unlikely to have been identified by previous methods may be substantially more efficient. However, they require a larger number of assumptions than existing optimal designs, and in many situations existing theory under a normal approximation will provide an easier means of identifying an efficient design for binary outcome data."}, "https://arxiv.org/abs/2402.10156": {"title": "Empirically assessing the plausibility of unconfoundedness in observational studies", "link": "https://arxiv.org/abs/2402.10156", "description": "arXiv:2402.10156v1 Announce Type: new \nAbstract: The possibility of unmeasured confounding is one of the main limitations for causal inference from observational studies. There are different methods for partially empirically assessing the plausibility of unconfoundedness. However, most currently available methods require (at least partial) assumptions about the confounding structure, which may be difficult to know in practice. In this paper we describe a simple strategy for empirically assessing the plausibility of conditional unconfoundedness (i.e., whether the candidate set of covariates suffices for confounding adjustment) which does not require any assumptions about the confounding structure, requiring instead assumptions related to temporal ordering between covariates, exposure and outcome (which can be guaranteed by design), measurement error and selection into the study. The proposed method essentially relies on testing the association between a subset of covariates (those associated with the exposure given all other covariates) and the outcome conditional on the remaining covariates and the exposure. We describe the assumptions underlying the method, provide proofs, use simulations to corroborate the theory and illustrate the method with an applied example assessing the causal effect of length-for-age measured in childhood and intelligence quotient measured in adulthood using data from the 1982 Pelotas (Brazil) birth cohort. We also discuss the implications of measurement error and some important limitations."}, "https://arxiv.org/abs/2106.05024": {"title": "Contamination Bias in Linear Regressions", "link": "https://arxiv.org/abs/2106.05024", "description": "arXiv:2106.05024v4 Announce Type: replace \nAbstract: We study regressions with multiple treatments and a set of controls that is flexible enough to purge omitted variable bias. We show that these regressions generally fail to estimate convex averages of heterogeneous treatment effects -- instead, estimates of each treatment's effect are contaminated by non-convex averages of the effects of other treatments. We discuss three estimation approaches that avoid such contamination bias, including the targeting of easiest-to-estimate weighted average effects. A re-analysis of nine empirical applications finds economically and statistically meaningful contamination bias in observational studies; contamination bias in experimental studies is more limited due to idiosyncratic effect heterogeneity."}, "https://arxiv.org/abs/2301.00277": {"title": "Higher-order Refinements of Small Bandwidth Asymptotics for Density-Weighted Average Derivative Estimators", "link": "https://arxiv.org/abs/2301.00277", "description": "arXiv:2301.00277v2 Announce Type: replace \nAbstract: The density weighted average derivative (DWAD) of a regression function is a canonical parameter of interest in economics. Classical first-order large sample distribution theory for kernel-based DWAD estimators relies on tuning parameter restrictions and model assumptions that imply an asymptotic linear representation of the point estimator. These conditions can be restrictive, and the resulting distributional approximation may not be representative of the actual sampling distribution of the statistic of interest. In particular, the approximation is not robust to bandwidth choice. Small bandwidth asymptotics offers an alternative, more general distributional approximation for kernel-based DWAD estimators that allows for, but does not require, asymptotic linearity. The resulting inference procedures based on small bandwidth asymptotics were found to exhibit superior finite sample performance in simulations, but no formal theory justifying that empirical success is available in the literature. Employing Edgeworth expansions, this paper shows that small bandwidth asymptotic approximations lead to inference procedures with higher-order distributional properties that are demonstrably superior to those of procedures based on asymptotic linear approximations."}, "https://arxiv.org/abs/2301.04804": {"title": "A Generalized Estimating Equation Approach to Network Regression", "link": "https://arxiv.org/abs/2301.04804", "description": "arXiv:2301.04804v2 Announce Type: replace \nAbstract: Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by the important observation that controlling for community structure can, when a network is modular, significantly account for meaningful correlation between observations induced by network connections. We propose a generalized estimating equation (GEE) approach to learn model parameters based on clusters defined through any single-membership community detection algorithm applied to the observed network. We provide a necessary condition on the network size and edge formation probabilities to establish the asymptotic normality of the model parameters under the assumption that the graph structure is a stochastic block model. We evaluate the performance of our approach through simulations and apply it to estimate the joint impact of baseline covariates and network effects on COVID-19 incidence rate among countries connected by a network of commercial airline traffic. We find that during the beginning of the pandemic the network effect has some influence, the percentage of urban population has more influence on the incidence rate compared to the network effect after the travel ban was in effect."}, "https://arxiv.org/abs/2301.09754": {"title": "Prioritizing Variables for Observational Study Design using the Joint Variable Importance Plot", "link": "https://arxiv.org/abs/2301.09754", "description": "arXiv:2301.09754v3 Announce Type: replace \nAbstract: Observational studies of treatment effects require adjustment for confounding variables. However, causal inference methods typically cannot deliver perfect adjustment on all measured baseline variables, and there is often ambiguity about which variables should be prioritized. Standard prioritization methods based on treatment imbalance alone neglect variables' relationships with the outcome. We propose the joint variable importance plot to guide variable prioritization for observational studies. Since not all variables are equally relevant to the outcome, the plot adds outcome associations to quantify the potential confounding jointly with the standardized mean difference. To enhance comparisons on the plot between variables with different confounding relationships, we also derive and plot bias curves. Variable prioritization using the plot can produce recommended values for tuning parameters in many existing matching and weighting methods. We showcase the use of the joint variable importance plots in the design of a balance-constrained matched study to evaluate whether taking an antidiabetic medication, glyburide, increases the incidence of C-section delivery among pregnant individuals with gestational diabetes."}, "https://arxiv.org/abs/2302.11487": {"title": "Improving Model Choice in Classification: An Approach Based on Clustering of Covariance Matrices", "link": "https://arxiv.org/abs/2302.11487", "description": "arXiv:2302.11487v2 Announce Type: replace \nAbstract: This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of clusters of the involved covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. We show that such groupings of covariance matrices can be achieved through simple modifications of the CEM (Classification Expectation Maximization) algorithm. Our approach leads to propose Gaussian Mixture Models for model-based clustering and discriminant analysis, in which covariance matrices are clustered according to a parsimonious criterion, creating intermediate steps between the fourteen widely known parsimonious models. The added versatility not only allows us to obtain models with fewer parameters for fitting the data, but also provides greater interpretability. We show its usefulness for model-based clustering and discriminant analysis, providing algorithms to find approximate solutions verifying suitable size, shape and orientation constraints, and applying them to both simulation and real data examples."}, "https://arxiv.org/abs/2308.07248": {"title": "Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures", "link": "https://arxiv.org/abs/2308.07248", "description": "arXiv:2308.07248v3 Announce Type: replace \nAbstract: Linear mixed models are commonly used in analyzing stepped-wedge cluster randomized trials (SW-CRTs). A key consideration for analyzing a SW-CRT is accounting for the potentially complex correlation structure, which can be achieved by specifying a random effects structure. Common random effects structures for a SW-CRT include random intercept, random cluster-by-period, and discrete-time decay. Recently, more complex structures, such as the random intervention structure, have been proposed. In practice, specifying appropriate random effects can be challenging. Robust variance estimators (RVE) may be applied to linear mixed models to provide consistent estimators of standard errors of fixed effect parameters in the presence of random-effects misspecification. However, there has been no empirical investigation of RVE for SW-CRT. In this paper, we first review five RVEs (both standard and small-sample bias-corrected RVEs) that are available for linear mixed models. We then describe a comprehensive simulation study to examine the performance of these RVEs for SW-CRTs with a continuous outcome under different data generators. For each data generator, we investigate whether the use of a RVE with either the random intercept model or the random cluster-by-period model is sufficient to provide valid statistical inference for fixed effect parameters, when these working models are subject to misspecification. Our results indicate that the random intercept and random cluster-by-period models with RVEs performed similarly. The CR3 RVE estimator, coupled with the number of clusters minus two degrees of freedom correction, consistently gave the best coverage results, but could be slightly anti-conservative when the number of clusters was below 16. We summarize the implications of our results for linear mixed model analysis of SW-CRTs in practice."}, "https://arxiv.org/abs/2308.12260": {"title": "Estimating Causal Effects for Binary Outcomes Using Per-Decision Inverse Probability Weighting", "link": "https://arxiv.org/abs/2308.12260", "description": "arXiv:2308.12260v2 Announce Type: replace \nAbstract: Micro-randomized trials are commonly conducted for optimizing mobile health interventions such as push notifications for behavior change. In analyzing such trials, causal excursion effects are often of primary interest, and their estimation typically involves inverse probability weighting (IPW). However, in a micro-randomized trial additional treatments can often occur during the time window over which an outcome is defined, and this can greatly inflate the variance of the causal effect estimator because IPW would involve a product of numerous weights. To reduce variance and improve estimation efficiency, we propose a new estimator using a modified version of IPW, which we call \"per-decision IPW\". It is applicable when the outcome is binary and can be expressed as the maximum of a series of sub-outcomes defined over sub-intervals of time. We establish the estimator's consistency and asymptotic normality. Through simulation studies and real data applications, we demonstrate substantial efficiency improvement of the proposed estimator over existing estimators (relative efficiency up to 1.45 and sample size savings up to 31% in realistic settings). The new estimator can be used to improve the precision of primary and secondary analyses for micro-randomized trials with binary outcomes."}, "https://arxiv.org/abs/2309.06988": {"title": "A basket trial design based on power priors", "link": "https://arxiv.org/abs/2309.06988", "description": "arXiv:2309.06988v2 Announce Type: replace \nAbstract: In basket trials a treatment is investigated in several subgroups. They are primarily used in oncology in early clinical phases as single-arm trials with a binary endpoint. For their analysis primarily Bayesian methods have been suggested, as they allow partial sharing of information based on the observed similarity between subgroups. Fujikawa et al. (2020) suggested an approach using empirical Bayes methods that allows flexible sharing based on easily interpretable weights derived from the Jensen-Shannon divergence between the subgroup-wise posterior distributions. We show that this design is closely related to the method of power priors and investigate several modifications of Fujikawa's design using methods from the power prior literature. While in Fujikawa's design, the amount of information that is shared between two baskets is only determined by their pairwise similarity, we also discuss extensions where the outcomes of all baskets are considered in the computation of the sharing weights. The results of our comparison study show that the power prior design has comparable performance to fully Bayesian designs in a range of different scenarios. At the same time, the power prior design is computationally cheap and even allows analytical computation of operating characteristics in some settings."}, "https://arxiv.org/abs/2202.00190": {"title": "Sketching stochastic valuation functions", "link": "https://arxiv.org/abs/2202.00190", "description": "arXiv:2202.00190v3 Announce Type: replace-cross \nAbstract: We consider the problem of sketching a set valuation function, which is defined as the expectation of a valuation function of independent random item values. We show that for monotone subadditive or submodular valuation functions satisfying a weak homogeneity condition, or certain other conditions, there exist discretized distributions of item values with $O(k\\log(k))$ support sizes that yield a sketch valuation function which is a constant-factor approximation, for any value query for a set of items of cardinality less than or equal to $k$. The discretized distributions can be efficiently computed by an algorithm for each item's value distribution separately. Our results hold under conditions that accommodate a wide range of valuation functions arising in applications, such as the value of a team corresponding to the best performance of a team member, constant elasticity of substitution production functions exhibiting diminishing returns used in economics and consumer theory, and others. Sketch valuation functions are particularly valuable for finding approximate solutions to optimization problems such as best set selection and welfare maximization. They enable computationally efficient evaluation of approximate value oracle queries and provide an approximation guarantee for the underlying optimization problem."}, "https://arxiv.org/abs/2305.16842": {"title": "Accounting statement analysis at industry level", "link": "https://arxiv.org/abs/2305.16842", "description": "arXiv:2305.16842v4 Announce Type: replace-cross \nAbstract: Compositional data are contemporarily defined as positive vectors, the ratios among whose elements are of interest to the researcher. Financial statement analysis by means of accounting ratios fulfils this definition to the letter. Compositional data analysis solves the major problems in statistical analysis of standard financial ratios at industry level, such as skewness, non-normality, non-linearity and dependence of the results on the choice of which accounting figure goes to the numerator and to the denominator of the ratio. In spite of this, compositional applications to financial statement analysis are still rare. In this article, we present some transformations within compositional data analysis that are particularly useful for financial statement analysis. We show how to compute industry or sub-industry means of standard financial ratios from a compositional perspective. We show how to visualise firms in an industry with a compositional biplot, to classify them with compositional cluster analysis and to relate financial and non-financial indicators with compositional regression models. We show an application to the accounting statements of Spanish wineries using DuPont analysis, and a step-by-step tutorial to the compositional freeware CoDaPack."}, "https://arxiv.org/abs/2402.10537": {"title": "Quantifying Individual Risk for Binary Outcome: Bounds and Inference", "link": "https://arxiv.org/abs/2402.10537", "description": "arXiv:2402.10537v1 Announce Type: new \nAbstract: Understanding treatment heterogeneity is crucial for reliable decision-making in treatment evaluation and selection. While the conditional average treatment effect (CATE) is commonly used to capture treatment heterogeneity induced by covariates and design individualized treatment policies, it remains an averaging metric within subpopulations. This limitation prevents it from unveiling individual-level risks, potentially leading to misleading results. This article addresses this gap by examining individual risk for binary outcomes, specifically focusing on the fraction negatively affected (FNA) conditional on covariates -- a metric assessing the percentage of individuals experiencing worse outcomes with treatment compared to control. Under the strong ignorability assumption, FNA is unidentifiable, and we find that previous bounds are wide and practically unattainable except in certain degenerate cases. By introducing a plausible positive correlation assumption for the potential outcomes, we obtain significantly improved bounds compared to previous studies. We show that even with a positive and statistically significant CATE, the lower bound on FNA can be positive, i.e., in the best-case scenario many units will be harmed if receiving treatment. We establish a nonparametric sensitivity analysis framework for FNA using the Pearson correlation coefficient as the sensitivity parameter, thereby exploring the relationships among the correlation coefficient, FNA, and CATE. We also present a practical and tractable method for selecting the range of correlation coefficients. Furthermore, we propose flexible estimators for refined FNA bounds and prove their consistency and asymptotic normality."}, "https://arxiv.org/abs/2402.10545": {"title": "Spatial quantile clustering of climate data", "link": "https://arxiv.org/abs/2402.10545", "description": "arXiv:2402.10545v1 Announce Type: new \nAbstract: In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change."}, "https://arxiv.org/abs/2402.10574": {"title": "Nowcasting with mixed frequency data using Gaussian processes", "link": "https://arxiv.org/abs/2402.10574", "description": "arXiv:2402.10574v1 Announce Type: new \nAbstract: We propose and discuss Bayesian machine learning methods for mixed data sampling (MIDAS) regressions. This involves handling frequency mismatches with restricted and unrestricted MIDAS variants and specifying functional relationships between many predictors and the dependent variable. We use Gaussian processes (GP) and Bayesian additive regression trees (BART) as flexible extensions to linear penalized estimation. In a nowcasting and forecasting exercise we focus on quarterly US output growth and inflation in the GDP deflator. The new models leverage macroeconomic Big Data in a computationally efficient way and offer gains in predictive accuracy along several dimensions."}, "https://arxiv.org/abs/2402.10624": {"title": "Functional principal component analysis as an alternative to mixed-effect models for describing sparse repeated measures in presence of missing data", "link": "https://arxiv.org/abs/2402.10624", "description": "arXiv:2402.10624v1 Announce Type: new \nAbstract: Analyzing longitudinal data in health studies is challenging due to sparse and error-prone measurements, strong within-individual correlation, missing data and various trajectory shapes. While mixed-effect models (MM) effectively address these challenges, they remain parametric models and may incur computational costs. In contrast, Functional Principal Component Analysis (FPCA) is a non-parametric approach developed for regular and dense functional data that flexibly describes temporal trajectories at a lower computational cost. This paper presents an empirical simulation study evaluating the behaviour of FPCA with sparse and error-prone repeated measures and its robustness under different missing data schemes in comparison with MM. The results show that FPCA is well-suited in the presence of missing at random data caused by dropout, except in scenarios involving most frequent and systematic dropout. Like MM, FPCA fails under missing not at random mechanism. The FPCA was applied to describe the trajectories of four cognitive functions before clinical dementia and contrast them with those of matched controls in a case-control study nested in a population-based aging cohort. The average cognitive declines of future dementia cases showed a sudden divergence from those of their matched controls with a sharp acceleration 5 to 2.5 years prior to diagnosis."}, "https://arxiv.org/abs/2402.10836": {"title": "Manipulation Test for Multidimensional RDD", "link": "https://arxiv.org/abs/2402.10836", "description": "arXiv:2402.10836v1 Announce Type: new \nAbstract: The causal inference model proposed by Lee (2008) for the regression discontinuity design (RDD) relies on assumptions that imply the continuity of the density of the assignment (running) variable. The test for this implication is commonly referred to as the manipulation test and is regularly reported in applied research to strengthen the design's validity. The multidimensional RDD (MRDD) extends the RDD to contexts where treatment assignment depends on several running variables. This paper introduces a manipulation test for the MRDD. First, it develops a theoretical model for causal inference with the MRDD, used to derive a testable implication on the conditional marginal densities of the running variables. Then, it constructs the test for the implication based on a quadratic form of a vector of statistics separately computed for each marginal density. Finally, the proposed test is compared with alternative procedures commonly employed in applied research."}, "https://arxiv.org/abs/2402.10242": {"title": "Signed Diverse Multiplex Networks: Clustering and Inference", "link": "https://arxiv.org/abs/2402.10242", "description": "arXiv:2402.10242v1 Announce Type: cross \nAbstract: The paper introduces a Signed Generalized Random Dot Product Graph (SGRDPG) model, which is a variant of the Generalized Random Dot Product Graph (GRDPG), where, in addition, edges can be positive or negative. The setting is extended to a multiplex version, where all layers have the same collection of nodes and follow the SGRDPG. The only common feature of the layers of the network is that they can be partitioned into groups with common subspace structures, while otherwise all matrices of connection probabilities can be all different. The setting above is extremely flexible and includes a variety of existing multiplex network models as its particular cases. The paper fulfills two objectives. First, it shows that keeping signs of the edges in the process of network construction leads to a better precision of estimation and clustering and, hence, is beneficial for tackling real world problems such as analysis of brain networks. Second, by employing novel algorithms, our paper ensures equivalent or superior accuracy than has been achieved in simpler multiplex network models. In addition to theoretical guarantees, both of those features are demonstrated using numerical simulations and a real data example."}, "https://arxiv.org/abs/2402.10255": {"title": "Benchmarking the Operation of Quantum Heuristics and Ising Machines: Scoring Parameter Setting Strategies on Optimization Applications", "link": "https://arxiv.org/abs/2402.10255", "description": "arXiv:2402.10255v1 Announce Type: cross \nAbstract: We discuss guidelines for evaluating the performance of parameterized stochastic solvers for optimization problems, with particular attention to systems that employ novel hardware, such as digital quantum processors running variational algorithms, analog processors performing quantum annealing, or coherent Ising Machines. We illustrate through an example a benchmarking procedure grounded in the statistical analysis of the expectation of a given performance metric measured in a test environment. In particular, we discuss the necessity and cost of setting parameters that affect the algorithm's performance. The optimal value of these parameters could vary significantly between instances of the same target problem. We present an open-source software package that facilitates the design, evaluation, and visualization of practical parameter tuning strategies for complex use of the heterogeneous components of the solver. We examine in detail an example using parallel tempering and a simulator of a photonic Coherent Ising Machine computing and display the scoring of an illustrative baseline family of parameter-setting strategies that feature an exploration-exploitation trade-off."}, "https://arxiv.org/abs/2402.10456": {"title": "Generative Modeling for Tabular Data via Penalized Optimal Transport Network", "link": "https://arxiv.org/abs/2402.10456", "description": "arXiv:2402.10456v1 Announce Type: cross \nAbstract: The task of precisely learning the probability distribution of rows within tabular data and producing authentic synthetic samples is both crucial and non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable improvement in generative modeling, addressing the challenges faced by its predecessor, generative adversarial network. However, due to the mixed data types and multimodalities prevalent in tabular data, the delicate equilibrium between the generator and discriminator, as well as the inherent instability of Wasserstein distance in high dimensions, WGAN often fails to produce high-fidelity samples. To this end, we propose POTNet (Penalized Optimal Transport Network), a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can effectively model tabular data containing both categorical and continuous features. Moreover, it offers the flexibility to condition on a subset of features. We provide theoretical justifications for the motivation behind the MPW loss. We also empirically demonstrate the effectiveness of our proposed method on four different benchmarks across a variety of real-world and simulated datasets. Our proposed model achieves orders of magnitude speedup during the sampling stage compared to state-of-the-art generative models for tabular data, thereby enabling efficient large-scale synthetic data generation."}, "https://arxiv.org/abs/2402.10592": {"title": "Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification", "link": "https://arxiv.org/abs/2402.10592", "description": "arXiv:2402.10592v1 Announce Type: cross \nAbstract: Practitioners conducting adaptive experiments often encounter two competing priorities: reducing the cost of experimentation by effectively assigning treatments during the experiment itself, and gathering information swiftly to conclude the experiment and implement a treatment across the population. Currently, the literature is divided, with studies on regret minimization addressing the former priority in isolation, and research on best-arm identification focusing solely on the latter. This paper proposes a unified model that accounts for both within-experiment performance and post-experiment outcomes. We then provide a sharp theory of optimal performance in large populations that unifies canonical results in the literature. This unification also uncovers novel insights. For example, the theory reveals that familiar algorithms, like the recently proposed top-two Thompson sampling algorithm, can be adapted to optimize a broad class of objectives by simply adjusting a single scalar parameter. In addition, the theory reveals that enormous reductions in experiment duration can sometimes be achieved with minimal impact on both within-experiment and post-experiment regret."}, "https://arxiv.org/abs/2402.10608": {"title": "A maximum likelihood estimation of L\\'evy-driven stochastic systems for univariate and multivariate time series of observations", "link": "https://arxiv.org/abs/2402.10608", "description": "arXiv:2402.10608v1 Announce Type: cross \nAbstract: Literature is full of inference techniques developed to estimate the parameters of stochastic dynamical systems driven by the well-known Brownian noise. Such diffusion models are often inappropriate models to properly describe the dynamics reflected in many real-world data which are dominated by jump discontinuities of various sizes and frequencies. To account for the presence of jumps, jump-diffusion models are introduced and some inference techniques are developed. Jump-diffusion models are also inadequate models since they fail to reflect the frequent occurrence as well as the continuous spectrum of natural jumps. It is, therefore, crucial to depart from the classical stochastic systems like diffusion and jump-diffusion models and resort to stochastic systems where the regime of stochasticity is governed by the stochastic fluctuations of L\\'evy type. Reconstruction of L\\'evy-driven dynamical systems, however, has been a major challenge. The literature on the reconstruction of L\\'evy-driven systems is rather poor: there are few reconstruction algorithms developed which suffer from one or several problems such as being data-hungry, failing to provide a full reconstruction of noise parameters, tackling only some specific systems, failing to cope with multivariate data in practice, lacking proper validation mechanisms, and many more. This letter introduces a maximum likelihood estimation procedure which grants a full reconstruction of the system, requires less data, and its implementation for multivariate data is quite straightforward. To the best of our knowledge this contribution is the first to tackle all the mentioned shortcomings. We apply our algorithm to simulated data as well as an ice-core dataset spanning the last glaciation. In particular, we find new insights about the dynamics of the climate in the curse of the last glaciation which was not found in previous studies."}, "https://arxiv.org/abs/2402.10859": {"title": "Spatio-temporal point process modelling of fires in Sicily exploring human and environmental factors", "link": "https://arxiv.org/abs/2402.10859", "description": "arXiv:2402.10859v1 Announce Type: cross \nAbstract: In 2023, Sicily faced an escalating issue of uncontrolled fires, necessitating a thorough investigation into their spatio-temporal dynamics. Our study addresses this concern through point process theory. Each wildfire is treated as a unique point in both space and time, allowing us to assess the influence of environmental and anthropogenic factors by fitting a spatio-temporal separable Poisson point process model, with a particular focus on the role of land usage. First, a spatial log-linear Poisson model is applied to investigate the influence of land use types on wildfire distribution, controlling for other environmental covariates. The results highlight the significant effect of human activities, altitude, and slope on spatial fire occurrence. Then, a Generalized Additive Model with Poisson-distributed response further explores the temporal dynamics of wildfire occurrences, confirming their dependence on various environmental variables, including the maximum daily temperature, wind speed, surface pressure, and total precipitation."}, "https://arxiv.org/abs/2103.05161": {"title": "The Efficient Shrinkage Path: Maximum Likelihood of Minimum MSE Risk", "link": "https://arxiv.org/abs/2103.05161", "description": "arXiv:2103.05161v5 Announce Type: replace \nAbstract: A new generalized ridge regression shrinkage path is proposed that is as short as possible under the restriction that it must pass through the vector of regression coefficient estimators that make the overall Optimal Variance-Bias Trade-Off under Normal distribution-theory. Five distinct types of ridge TRACE displays plus other graphics for this efficient path are motivated and illustrated here. These visualizations provide invaluable data-analytic insights and improved self-confidence to researchers and data scientists fitting linear models to ill-conditioned (confounded) data."}, "https://arxiv.org/abs/2208.11756": {"title": "Testing Many Constraints in Possibly Irregular Models Using Incomplete U-Statistics", "link": "https://arxiv.org/abs/2208.11756", "description": "arXiv:2208.11756v3 Announce Type: replace \nAbstract: We consider the problem of testing a null hypothesis defined by equality and inequality constraints on a statistical parameter. Testing such hypotheses can be challenging because the number of relevant constraints may be on the same order or even larger than the number of observed samples. Moreover, standard distributional approximations may be invalid due to irregularities in the null hypothesis. We propose a general testing methodology that aims to circumvent these difficulties. The constraints are estimated by incomplete U-statistics, and we derive critical values by Gaussian multiplier bootstrap. We show that the bootstrap approximation of incomplete U-statistics is valid for kernels that we call mixed degenerate when the number of combinations used to compute the incomplete U-statistic is of the same order as the sample size. It follows that our test controls type I error even in irregular settings. Furthermore, the bootstrap approximation covers high-dimensional settings making our testing strategy applicable for problems with many constraints. The methodology is applicable, in particular, when the constraints to be tested are polynomials in U-estimable parameters. As an application, we consider goodness-of-fit tests of latent tree models for multivariate data."}, "https://arxiv.org/abs/2305.03149": {"title": "A Spectral Method for Identifiable Grade of Membership Analysis with Binary Responses", "link": "https://arxiv.org/abs/2305.03149", "description": "arXiv:2305.03149v3 Announce Type: replace \nAbstract: Grade of Membership (GoM) models are popular individual-level mixture models for multivariate categorical data. GoM allows each subject to have mixed memberships in multiple extreme latent profiles. Therefore GoM models have a richer modeling capacity than latent class models that restrict each subject to belong to a single profile. The flexibility of GoM comes at the cost of more challenging identifiability and estimation problems. In this work, we propose a singular value decomposition (SVD) based spectral approach to GoM analysis with multivariate binary responses. Our approach hinges on the observation that the expectation of the data matrix has a low-rank decomposition under a GoM model. For identifiability, we develop sufficient and almost necessary conditions for a notion of expectation identifiability. For estimation, we extract only a few leading singular vectors of the observed data matrix, and exploit the simplex geometry of these vectors to estimate the mixed membership scores and other parameters. We also establish the consistency of our estimator in the double-asymptotic regime where both the number of subjects and the number of items grow to infinity. Our spectral method has a huge computational advantage over Bayesian or likelihood-based methods and is scalable to large-scale and high-dimensional data. Extensive simulation studies demonstrate the superior efficiency and accuracy of our method. We also illustrate our method by applying it to a personality test dataset."}, "https://arxiv.org/abs/2305.17517": {"title": "Stochastic Nonparametric Estimation of the Density-Flow Curve", "link": "https://arxiv.org/abs/2305.17517", "description": "arXiv:2305.17517v3 Announce Type: replace \nAbstract: Recent advances in operations research and machine learning have revived interest in solving complex real-world, large-size traffic control problems. With the increasing availability of road sensor data, deterministic parametric models have proved inadequate in describing the variability of real-world data, especially in congested area of the density-flow diagram. In this paper we estimate the stochastic density-flow relation introducing a nonparametric method called convex quantile regression. The proposed method does not depend on any prior functional form assumptions, but thanks to the concavity constraints, the estimated function satisfies the theoretical properties of the density-flow curve. The second contribution is to develop the new convex quantile regression with bags (CQRb) approach to facilitate practical implementation of CQR to the real-world data. We illustrate the CQRb estimation process using the road sensor data from Finland in years 2016-2018. Our third contribution is to demonstrate the excellent out-of-sample predictive power of the proposed CQRb method in comparison to the standard parametric deterministic approach."}, "https://arxiv.org/abs/2309.01889": {"title": "The Local Projection Residual Bootstrap for AR(1) Models", "link": "https://arxiv.org/abs/2309.01889", "description": "arXiv:2309.01889v3 Announce Type: replace \nAbstract: This paper proposes a local projection residual bootstrap method to construct confidence intervals for impulse response coefficients of AR(1) models. Our bootstrap method is based on the local projection (LP) approach and involves a residual bootstrap procedure applied to AR(1) models. We present theoretical results for our bootstrap method and proposed confidence intervals. First, we prove the uniform consistency of the LP-residual bootstrap over a large class of AR(1) models that allow for a unit root. Then, we prove the asymptotic validity of our confidence intervals over the same class of AR(1) models. Finally, we show that the LP-residual bootstrap provides asymptotic refinements for confidence intervals on a restricted class of AR(1) models relative to those required for the uniform consistency of our bootstrap."}, "https://arxiv.org/abs/2312.00501": {"title": "Cautionary Tales on Synthetic Controls in Survival Analyses", "link": "https://arxiv.org/abs/2312.00501", "description": "arXiv:2312.00501v2 Announce Type: replace \nAbstract: Synthetic control (SC) methods have gained rapid popularity in economics recently, where they have been applied in the context of inferring the effects of treatments on standard continuous outcomes assuming linear input-output relations. In medical applications, conversely, survival outcomes are often of primary interest, a setup in which both commonly assumed data-generating processes (DGPs) and target parameters are different. In this paper, we therefore investigate whether and when SCs could serve as an alternative to matching methods in survival analyses. We find that, because SCs rely on a linearity assumption, they will generally be biased for the true expected survival time in commonly assumed survival DGPs -- even when taking into account the possibility of linearity on another scale as in accelerated failure time models. Additionally, we find that, because SC units follow distributions with lower variance than real control units, summaries of their distributions, such as survival curves, will be biased for the parameters of interest in many survival analyses. Nonetheless, we also highlight that using SCs can still improve upon matching whenever the biases described above are outweighed by extrapolation biases exhibited by imperfect matches, and investigate the use of regularization to trade off the shortcomings of both approaches."}, "https://arxiv.org/abs/2312.12966": {"title": "Rank-based Bayesian clustering via covariate-informed Mallows mixtures", "link": "https://arxiv.org/abs/2312.12966", "description": "arXiv:2312.12966v2 Announce Type: replace \nAbstract: Data in the form of rankings, ratings, pair comparisons or clicks are frequently collected in diverse fields, from marketing to politics, to understand assessors' individual preferences. Combining such preference data with features associated with the assessors can lead to a better understanding of the assessors' behaviors and choices. The Mallows model is a popular model for rankings, as it flexibly adapts to different types of preference data, and the previously proposed Bayesian Mallows Model (BMM) offers a computationally efficient framework for Bayesian inference, also allowing capturing the users' heterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite mixture model that performs clustering while also accounting for assessor-related features, called the Bayesian Mallows model with covariates (BMMx). BMMx is based on a similarity function that a priori favours the aggregation of assessors into a cluster when their covariates are similar, using the Product Partition models (PPMx) proposal. We present two approaches to measure the covariate similarity: one based on a novel deterministic function measuring the covariates' goodness-of-fit to the cluster, and one based on an augmented model as in PPMx. We investigate the performance of BMMx in both simulation experiments and real-data examples, showing the method's potential for advancing the understanding of assessor preferences and behaviors in different applications."}, "https://arxiv.org/abs/2401.04200": {"title": "Teacher bias or measurement error?", "link": "https://arxiv.org/abs/2401.04200", "description": "arXiv:2401.04200v2 Announce Type: replace \nAbstract: In many countries, teachers' track recommendations are used to allocate students to secondary school tracks. Previous studies have shown that students from families with low socioeconomic status (SES) receive lower track recommendations than their peers from high SES families, conditional on standardized test scores. It is often argued that this indicates teacher bias. However, this claim is invalid in the presence of measurement error in test scores. We discuss how measurement error in test scores generates a biased coefficient of the conditional SES gap, and consider three empirical strategies to address this bias. Using administrative data from the Netherlands, we find that measurement error explains 35 to 43% of the conditional SES gap in track recommendations."}, "https://arxiv.org/abs/2401.04512": {"title": "Robust Bayesian Method for Refutable Models", "link": "https://arxiv.org/abs/2401.04512", "description": "arXiv:2401.04512v2 Announce Type: replace \nAbstract: We propose a robust Bayesian method for economic models that can be rejected under some data distributions. The econometrician starts with a structural assumption which can be written as the intersection of several assumptions, and the joint assumption is refutable. To avoid the model rejection, the econometrician first takes a stance on which assumption $j$ is likely to be violated and considers a measurement of the degree of violation of this assumption $j$. She then considers a (marginal) prior belief on the degree of violation $(\\pi_{m_j})$: She considers a class of prior distributions $\\pi_s$ on all economic structures such that all $\\pi_s$ have the same marginal distribution $\\pi_m$. Compared to the standard nonparametric Bayesian method that puts a single prior on all economic structures, the robust Bayesian method imposes a single marginal prior distribution on the degree of violation. As a result, the robust Bayesian method allows the econometrician to take a stance only on the likeliness of violation of assumption $j$. Compared to the frequentist approach to relax the refutable assumption, the robust Bayesian method is transparent on the econometrician's stance of choosing models. We also show that many frequentists' ways to relax the refutable assumption can be found equivalent to particular choices of robust Bayesian prior classes. We use the local average treatment effect (LATE) in the potential outcome framework as the leading illustrating example."}, "https://arxiv.org/abs/2401.14593": {"title": "Robust Estimation of Pareto's Scale Parameter from Grouped Data", "link": "https://arxiv.org/abs/2401.14593", "description": "arXiv:2401.14593v2 Announce Type: replace \nAbstract: Numerous robust estimators exist as alternatives to the maximum likelihood estimator (MLE) when a completely observed ground-up loss severity sample dataset is available. However, the options for robust alternatives to MLE become significantly limited when dealing with grouped loss severity data, with only a handful of methods like least squares, minimum Hellinger distance, and optimal bounded influence function available. This paper introduces a novel robust estimation technique, the Method of Truncated Moments (MTuM), specifically designed to estimate the tail index of a Pareto distribution from grouped data. Inferential justification of MTuM is established by employing the central limit theorem and validating them through a comprehensive simulation study."}, "https://arxiv.org/abs/1906.01741": {"title": "Fr\\'echet random forests for metric space valued regression with non euclidean predictors", "link": "https://arxiv.org/abs/1906.01741", "description": "arXiv:1906.01741v3 Announce Type: replace-cross \nAbstract: Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fr\\'echet trees and Fr\\'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fr\\'echet regressogram predictor using data-driven partitions is given and applied to Fr\\'echet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice."}, "https://arxiv.org/abs/2302.07930": {"title": "Interpretable Deep Learning Methods for Multiview Learning", "link": "https://arxiv.org/abs/2302.07930", "description": "arXiv:2302.07930v2 Announce Type: replace-cross \nAbstract: Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning."}, "https://arxiv.org/abs/2311.18048": {"title": "An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis", "link": "https://arxiv.org/abs/2311.18048", "description": "arXiv:2311.18048v2 Announce Type: replace-cross \nAbstract: We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-Invariant (LTI) systems, the system parameters can be identified by introducing diverse intervention signals in a multi-environment setting. By harnessing appropriate diversity assumptions motivated by the ICA literature, our findings connect experiment design and representational identifiability in dynamical systems. We corroborate our findings on synthetic and (simulated) physical data. Additionally, we show that Hidden Markov Models, in general, and (Gaussian) LTI systems, in particular, fulfil a generalization of the Causal de Finetti theorem with continuous parameters."}, "https://arxiv.org/abs/2312.02867": {"title": "Semi-Supervised Health Index Monitoring with Feature Generation and Fusion", "link": "https://arxiv.org/abs/2312.02867", "description": "arXiv:2312.02867v2 Announce Type: replace-cross \nAbstract: The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible."}, "https://arxiv.org/abs/2402.11020": {"title": "Proximal Causal Inference for Conditional Separable Effects", "link": "https://arxiv.org/abs/2402.11020", "description": "arXiv:2402.11020v1 Announce Type: new \nAbstract: Scientists often pose questions about treatment effects on outcomes conditional on a post-treatment event. However, defining, identifying, and estimating causal effects conditional on post-treatment events requires care, even in perfectly executed randomized experiments. Recently, the conditional separable effect (CSE) was proposed as an interventionist estimand, corresponding to scientifically meaningful questions in these settings. However, while being a single-world estimand, which can be queried experimentally, existing identification results for the CSE require no unmeasured confounding between the outcome and post-treatment event. This assumption can be violated in many applications. In this work, we address this concern by developing new identification and estimation results for the CSE in the presence of unmeasured confounding. We establish nonparametric identification of the CSE in both observational and experimental settings when certain proxy variables are available for hidden common causes of the post-treatment event and outcome. We characterize the efficient influence function for the CSE under a semiparametric model of the observed data law in which nuisance functions are a priori unrestricted. Moreover, we develop a consistent, asymptotically linear, and locally semiparametric efficient estimator of the CSE using modern machine learning theory. We illustrate our framework with simulation studies and a real-world cancer therapy trial."}, "https://arxiv.org/abs/2402.11052": {"title": "Building Trees for Probabilistic Prediction via Scoring Rules", "link": "https://arxiv.org/abs/2402.11052", "description": "arXiv:2402.11052v1 Announce Type: new \nAbstract: Decision trees built with data remain in widespread use for nonparametric prediction. Predicting probability distributions is preferred over point predictions when uncertainty plays a prominent role in analysis and decision-making. We study modifying a tree to produce nonparametric predictive distributions. We find the standard method for building trees may not result in good predictive distributions and propose changing the splitting criteria for trees to one based on proper scoring rules. Analysis of both simulated data and several real datasets demonstrates that using these new splitting criteria results in trees with improved predictive properties considering the entire predictive distribution."}, "https://arxiv.org/abs/2402.11070": {"title": "Scalable Analysis of Bipartite Experiments", "link": "https://arxiv.org/abs/2402.11070", "description": "arXiv:2402.11070v1 Announce Type: new \nAbstract: Bipartite Experiments are randomized experiments where the treatment is applied to a set of units (randomization units) that is different from the units of analysis, and randomization units and analysis units are connected through a bipartite graph. The scale of experimentation at large online platforms necessitates both accurate inference in the presence of a large bipartite interference graph, as well as a highly scalable implementation. In this paper, we describe new methods for inference that enable practical, scalable analysis of bipartite experiments: (1) We propose CA-ERL, a covariate-adjusted variant of the exposure-reweighted-linear (ERL) estimator [9], which empirically yields 60-90% variance reduction. (2) We introduce a randomization-based method for inference and prove asymptotic validity of a Wald-type confidence interval under graph sparsity assumptions. (3) We present a linear-time algorithm for randomization inference of the CA-ERL estimator, which can be easily implemented in query engines like Presto or Spark. We evaluate our methods both on a real experiment at Meta that randomized treatment on Facebook Groups and analyzed user-level metrics, as well as simulations on synthetic data. The real-world data shows that our CA-ERL estimator reduces the confidence interval (CI) width by 60-90% (compared to ERL) in a practical setting. The simulations using synthetic data show that our randomization inference procedure achieves correct coverage across instances, while the ERL estimator has incorrectly small CI widths for instances with large true effect sizes and is overly conservative when the bipartite graph is dense."}, "https://arxiv.org/abs/2402.11092": {"title": "Adaptive Weight Learning for Multiple Outcome Optimization With Continuous Treatment", "link": "https://arxiv.org/abs/2402.11092", "description": "arXiv:2402.11092v1 Announce Type: new \nAbstract: To promote precision medicine, individualized treatment regimes (ITRs) are crucial for optimizing the expected clinical outcome based on patient-specific characteristics. However, existing ITR research has primarily focused on scenarios with categorical treatment options and a single outcome. In reality, clinicians often encounter scenarios with continuous treatment options and multiple, potentially competing outcomes, such as medicine efficacy and unavoidable toxicity. To balance these outcomes, a proper weight is necessary, which should be learned in a data-driven manner that considers both patient preference and clinician expertise. In this paper, we present a novel algorithm for developing individualized treatment regimes (ITRs) that incorporate continuous treatment options and multiple outcomes, utilizing observational data. Our approach assumes that clinicians are optimizing individualized patient utilities with sub-optimal treatment decisions that are at least better than random assignment. Treatment assignment is assumed to directly depend on the true underlying utility of the treatment rather than patient characteristics. The proposed method simultaneously estimates the weighting of composite outcomes and the decision-making process, allowing for construction of individualized treatment regimes with continuous doses. The proposed estimators can be used for inference and variable selection, facilitating the identification of informative treatment assignments and preference-associated variables. We evaluate the finite sample performance of our proposed method via simulation studies and apply it to a real data application of radiation oncology analysis."}, "https://arxiv.org/abs/2402.11133": {"title": "Two-Sample Hypothesis Testing for Large Random Graphs of Unequal Size", "link": "https://arxiv.org/abs/2402.11133", "description": "arXiv:2402.11133v1 Announce Type: new \nAbstract: Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a Frobenius test statistic tailored for small sample sizes and unequal-sized random graphs to test whether they are generated from the same model or not. Our approach involves an algorithm for generating bootstrapped adjacency matrices from estimated community-wise edge probability matrices, forming the basis of the Frobenius test statistic. We derive the asymptotic distribution of the proposed test statistic and validate its stability and efficiency in detecting minor differences in underlying models through simulations. Furthermore, we explore its application to fMRI data where we are able to distinguish brain activity patterns when subjects are exposed to sentences and pictures for two different stimuli and the control group."}, "https://arxiv.org/abs/2402.11292": {"title": "Semi-functional partial linear regression with measurement error: An approach based on $k$NN estimation", "link": "https://arxiv.org/abs/2402.11292", "description": "arXiv:2402.11292v1 Announce Type: new \nAbstract: This paper focuses on a semiparametric regression model in which the response variable is explained by the sum of two components. One of them is parametric (linear), the corresponding explanatory variable is measured with additive error and its dimension is finite ($p$). The other component models, in a nonparametric way, the effect of a functional variable (infinite dimension) on the response. $k$-NN based estimators are proposed for each component, and some asymptotic results are obtained. A simulation study illustrates the behaviour of such estimators for finite sample sizes, while an application to real data shows the usefulness of our proposal."}, "https://arxiv.org/abs/2402.11336": {"title": "Conditionally Affinely Invariant Rerandomization and its Admissibility", "link": "https://arxiv.org/abs/2402.11336", "description": "arXiv:2402.11336v1 Announce Type: new \nAbstract: Rerandomization utilizes modern computing ability to search for covariate balance improved experimental design while adhering to the randomization principle originally advocated by RA Fisher. Conditionally affinely invariant rerandomization has the ``Equal Percent Variance Reducing'' property on subsets of conditionally ellipsoidally symmetric covariates. It is suitable to deal with covariates of varying importance or mixed types and usually produces multiple balance scores. ``Unified'' and `` intersection'' methods are common ways of deciding on multiple scores. In general, `` intersection'' methods are computationally more efficient but asymptotically inadmissible. As computational cost is not a major concern in experimental design, we recommend ``unified'' methods to build admissible criteria for rerandomization"}, "https://arxiv.org/abs/2402.11341": {"title": "Between- and Within-Cluster Spearman Rank Correlations", "link": "https://arxiv.org/abs/2402.11341", "description": "arXiv:2402.11341v1 Announce Type: new \nAbstract: Clustered data are common in practice. Clustering arises when subjects are measured repeatedly, or subjects are nested in groups (e.g., households, schools). It is often of interest to evaluate the correlation between two variables with clustered data. There are three commonly used Pearson correlation coefficients (total, between-, and within-cluster), which together provide an enriched perspective of the correlation. However, these Pearson correlation coefficients are sensitive to extreme values and skewed distributions. They also depend on the scale of the data and are not applicable to ordered categorical data. Current non-parametric measures for clustered data are only for the total correlation. Here we define population parameters for the between- and within-cluster Spearman rank correlations. The definitions are natural extensions of the Pearson between- and within-cluster correlations to the rank scale. We show that the total Spearman rank correlation approximates a weighted sum of the between- and within-cluster Spearman rank correlations, where the weights are functions of rank intraclass correlations of the two random variables. We also discuss the equivalence between the within-cluster Spearman rank correlation and the covariate-adjusted partial Spearman rank correlation. Furthermore, we describe estimation and inference for the three Spearman rank correlations, conduct simulations to evaluate the performance of our estimators, and illustrate their use with data from a longitudinal biomarker study and a clustered randomized trial."}, "https://arxiv.org/abs/2402.11425": {"title": "Online Local False Discovery Rate Control: A Resource Allocation Approach", "link": "https://arxiv.org/abs/2402.11425", "description": "arXiv:2402.11425v1 Announce Type: new \nAbstract: We consider the problem of online local false discovery rate (FDR) control where multiple tests are conducted sequentially, with the goal of maximizing the total expected number of discoveries. We formulate the problem as an online resource allocation problem with accept/reject decisions, which from a high level can be viewed as an online knapsack problem, with the additional uncertainty of random budget replenishment. We start with general arrival distributions and propose a simple policy that achieves a $O(\\sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\\Omega(\\sqrt{T})$ or even a $\\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over accept arrivals, we propose a novel policy that incorporates budget buffers. We show that small additional logarithmic buffers suffice to reduce the regret from $\\Omega(\\sqrt{T})$ or even $\\Omega(T)$ to $O(\\ln^2 T)$. Numerical experiments are conducted to validate our theoretical findings. Our formulation may have wider applications beyond the problem considered in this paper, and our results emphasize how effective policies should be designed to reach a balance between circumventing wrong accept and reducing wrong reject in online resource allocation problems with uncertain budgets."}, "https://arxiv.org/abs/2402.11466": {"title": "Nonparametric assessment of regimen response curve estimators", "link": "https://arxiv.org/abs/2402.11466", "description": "arXiv:2402.11466v1 Announce Type: new \nAbstract: Marginal structural models have been widely used in causal inference to estimate mean outcomes under either a static or a prespecified set of treatment decision rules. This approach requires imposing a working model for the mean outcome given a sequence of treatments and possibly baseline covariates. In this paper, we introduce a dynamic marginal structural model that can be used to estimate an optimal decision rule within a class of parametric rules. Specifically, we will estimate the mean outcome as a function of the parameters in the class of decision rules, referred to as a regimen-response curve. In general, misspecification of the working model may lead to a biased estimate with questionable causal interpretability. To mitigate this issue, we will leverage risk to assess \"goodness-of-fit\" of the imposed working model. We consider the counterfactual risk as our target parameter and derive inverse probability weighting and canonical gradients to map it to the observed data. We provide asymptotic properties of the resulting risk estimators, considering both fixed and data-dependent target parameters. We will show that the inverse probability weighting estimator can be efficient and asymptotic linear when the weight functions are estimated using a sieve-based estimator. The proposed method is implemented on the LS1 study to estimate a regimen-response curve for patients with Parkinson's disease."}, "https://arxiv.org/abs/2402.11563": {"title": "A Gibbs Sampling Scheme for a Generalised Poisson-Kingman Class", "link": "https://arxiv.org/abs/2402.11563", "description": "arXiv:2402.11563v1 Announce Type: new \nAbstract: A Bayesian nonparametric method of James, Lijoi \\& Prunster (2009) used to predict future values of observations from normalized random measures with independent increments is modified to a class of models based on negative binomial processes for which the increments are not independent, but are independent conditional on an underlying gamma variable. Like in James et al., the new algorithm is formulated in terms of two variables, one a function of the past observations, and the other an updating by means of a new observation. We outline an application of the procedure to population genetics, for the construction of realisations of genealogical trees and coalescents from samples of alleles."}, "https://arxiv.org/abs/2402.11609": {"title": "Risk-aware product decisions in A/B tests with multiple metrics", "link": "https://arxiv.org/abs/2402.11609", "description": "arXiv:2402.11609v1 Announce Type: new \nAbstract: In the past decade, AB tests have become the standard method for making product decisions in tech companies. They offer a scientific approach to product development, using statistical hypothesis testing to control the risks of incorrect decisions. Typically, multiple metrics are used in AB tests to serve different purposes, such as establishing evidence of success, guarding against regressions, or verifying test validity. To mitigate risks in AB tests with multiple outcomes, it's crucial to adapt the design and analysis to the varied roles of these outcomes. This paper introduces the theoretical framework for decision rules guiding the evaluation of experiments at Spotify. First, we show that if guardrail metrics with non-inferiority tests are used, the significance level does not need to be multiplicity-adjusted for those tests. Second, if the decision rule includes non-inferiority tests, deterioration tests, or tests for quality, the type II error rate must be corrected to guarantee the desired power level for the decision. We propose a decision rule encompassing success, guardrail, deterioration, and quality metrics, employing diverse tests. This is accompanied by a design and analysis plan that mitigates risks across any data-generating process. The theoretical results are demonstrated using Monte Carlo simulations."}, "https://arxiv.org/abs/2402.11640": {"title": "Protocols for Observational Studies: An Application to Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2402.11640", "description": "arXiv:2402.11640v1 Announce Type: new \nAbstract: In his 2022 IMS Medallion Lecture delivered at the Joint Statistical Meetings, Prof. Dylan S. Small eloquently advocated for the use of protocols in observational studies. We discuss his proposal and, inspired by his ideas, we develop a protocol for the regression discontinuity design."}, "https://arxiv.org/abs/2402.11652": {"title": "Doubly Robust Inference in Causal Latent Factor Models", "link": "https://arxiv.org/abs/2402.11652", "description": "arXiv:2402.11652v1 Announce Type: new \nAbstract: This article introduces a new framework for estimating average treatment effects under unobserved confounding in modern data-rich environments featuring large numbers of units and outcomes. The proposed estimator is doubly robust, combining outcome imputation, inverse probability weighting, and a novel cross-fitting procedure for matrix completion. We derive finite-sample and asymptotic guarantees, and show that the error of the new estimator converges to a mean-zero Gaussian distribution at a parametric rate. Simulation results demonstrate the practical relevance of the formal properties of the estimators analyzed in this article."}, "https://arxiv.org/abs/2402.11659": {"title": "Credible causal inference beyond toy models", "link": "https://arxiv.org/abs/2402.11659", "description": "arXiv:2402.11659v1 Announce Type: new \nAbstract: Causal inference with observational data critically relies on untestable and extra-statistical assumptions that have (sometimes) testable implications. Well-known sets of assumptions that are sufficient to justify the causal interpretation of certain estimators are called identification strategies. These templates for causal analysis, however, do not perfectly map into empirical research practice. Researchers are often left in the disjunctive of either abstracting away from their particular setting to fit in the templates, risking erroneous inferences, or avoiding situations in which the templates cannot be applied, missing valuable opportunities for conducting empirical analysis. In this article, I show how directed acyclic graphs (DAGs) can help researchers to conduct empirical research and assess the quality of evidence without excessively relying on research templates. First, I offer a concise introduction to causal inference frameworks. Then I survey the arguments in the methodological literature in favor of using research templates, while either avoiding or limiting the use of causal graphical models. Third, I discuss the problems with the template model, arguing for a more flexible approach to DAGs that helps illuminating common problems in empirical settings and improving the credibility of causal claims. I demonstrate this approach in a series of worked examples, showing the gap between identification strategies as invoked by researchers and their actual applications. Finally, I conclude highlighting the benefits that routinely incorporating causal graphical models in our scientific discussions would have in terms of transparency, testability, and generativity."}, "https://arxiv.org/abs/2402.11936": {"title": "Relative Jump Distance: a diagnostic for Nested Sampling", "link": "https://arxiv.org/abs/2402.11936", "description": "arXiv:2402.11936v1 Announce Type: new \nAbstract: Nested sampling is widely used in astrophysics for reliably inferring model parameters and comparing models within a Bayesian framework. To address models with many parameters, Markov Chain Monte Carlo (MCMC) random walks are incorporated within nested sampling to advance a live point population. Diagnostic tools for nested sampling are crucial to ensure the reliability of astrophysical conclusions. We develop a diagnostic to identify problematic random walks that fail to meet the requirements of nested sampling. The distance from the start to the end of the random walk, the jump distance, is divided by the typical neighbor distance between live points, computed robustly with the MLFriends algorithm, to obtain a relative jump distance (RJD). We propose the geometric mean RJD and the fraction of RJD>1 as new summary diagnostics. Relative jump distances are investigated with mock and real-world inference applications, including inferring the distance to gravitational wave event GW170817. Problematic nested sampling runs are identified based on significant differences to reruns with much longer MCMC chains. These consistently exhibit low average RJDs and f(RJD>1) values below 50 percent. The RJD is more sensitive than previous tests based on the live point insertion order. The RJD diagnostic is proposed as a widely applicable diagnostic to verify inference with nested sampling. It is implemented in the UltraNest package in version 4.1."}, "https://arxiv.org/abs/2402.12009": {"title": "Moduli of Continuity in Metric Models and Extension of Liveability Indices", "link": "https://arxiv.org/abs/2402.12009", "description": "arXiv:2402.12009v1 Announce Type: new \nAbstract: Index spaces serve as valuable metric models for studying properties relevant to various applications, such as social science or economics. These properties are represented by real Lipschitz functions that describe the degree of association with each element within the underlying metric space. After determining the index value within a given sample subset, the classic McShane and Whitney formulas allow a Lipschitz regression procedure to be performed to extend the index values over the entire metric space. To improve the adaptability of the metric model to specific scenarios, this paper introduces the concept of a composition metric, which involves composing a metric with an increasing, positive and subadditive function $\\phi$. The results presented here extend well-established results for Lipschitz indices on metric spaces to composition metrics. In addition, we establish the corresponding approximation properties that facilitate the use of this functional structure. To illustrate the power and simplicity of this mathematical framework, we provide a concrete application involving the modelling of livability indices in North American cities."}, "https://arxiv.org/abs/2402.12083": {"title": "TrialEmulation: An R Package to Emulate Target Trials for Causal Analysis of Observational Time-to-event Data", "link": "https://arxiv.org/abs/2402.12083", "description": "arXiv:2402.12083v1 Announce Type: new \nAbstract: Randomised controlled trials (RCTs) are regarded as the gold standard for estimating causal treatment effects on health outcomes. However, RCTs are not always feasible, because of time, budget or ethical constraints. Observational data such as those from electronic health records (EHRs) offer an alternative way to estimate the causal effects of treatments. Recently, the `target trial emulation' framework was proposed by Hernan and Robins (2016) to provide a formal structure for estimating causal treatment effects from observational data. To promote more widespread implementation of target trial emulation in practice, we develop the R package TrialEmulation to emulate a sequence of target trials using observational time-to-event data, where individuals who start to receive treatment and those who have not been on the treatment at the baseline of the emulated trials are compared in terms of their risks of an outcome event. Specifically, TrialEmulation provides (1) data preparation for emulating a sequence of target trials, (2) calculation of the inverse probability of treatment and censoring weights to handle treatment switching and dependent censoring, (3) fitting of marginal structural models for the time-to-event outcome given baseline covariates, (4) estimation and inference of marginal intention to treat and per-protocol effects of the treatment in terms of marginal risk differences between treated and untreated for a user-specified target trial population. In particular, TrialEmulation can accommodate large data sets (e.g., from EHRs) within memory constraints of R by processing data in chunks and applying case-control sampling. We demonstrate the functionality of TrialEmulation using a simulated data set that mimics typical observational time-to-event data in practice."}, "https://arxiv.org/abs/2402.12171": {"title": "A frequentist test of proportional colocalization after selecting relevant genetic variants", "link": "https://arxiv.org/abs/2402.12171", "description": "arXiv:2402.12171v1 Announce Type: new \nAbstract: Colocalization analyses assess whether two traits are affected by the same or distinct causal genetic variants in a single gene region. A class of Bayesian colocalization tests are now routinely used in practice; for example, for genetic analyses in drug development pipelines. In this work, we consider an alternative frequentist approach to colocalization testing that examines the proportionality of genetic associations with each trait. The proportional colocalization approach uses markedly different assumptions to Bayesian colocalization tests, and therefore can provide valuable complementary evidence in cases where Bayesian colocalization results are inconclusive or sensitive to priors. We propose a novel conditional test of proportional colocalization, prop-coloc-cond, that aims to account for the uncertainty in variant selection, in order to recover accurate type I error control. The test can be implemented straightforwardly, requiring only summary data on genetic associations. Simulation evidence and an empirical investigation into GLP1R gene expression demonstrates how tests of proportional colocalization can offer important insights in conjunction with Bayesian colocalization tests."}, "https://arxiv.org/abs/2402.12323": {"title": "Expressing and visualizing model uncertainty in Bayesian variable selection using Cartesian credible sets", "link": "https://arxiv.org/abs/2402.12323", "description": "arXiv:2402.12323v1 Announce Type: new \nAbstract: Modern regression applications can involve hundreds or thousands of variables which motivates the use of variable selection methods. Bayesian variable selection defines a posterior distribution on the possible subsets of the variables (which are usually termed models) to express uncertainty about which variables are strongly linked to the response. This can be used to provide Bayesian model averaged predictions or inference, and to understand the relative importance of different variables. However, there has been little work on meaningful representations of this uncertainty beyond first order summaries. We introduce Cartesian credible sets to address this gap. The elements of these sets are formed by concatenating sub-models defined on each block of a partition of the variables. Investigating these sub-models allow us to understand whether the models in the Cartesian credible set always/never/sometimes include a particular variable or group of variables and provide a useful summary of model uncertainty. We introduce methods to find these sets that emphasize ease of understanding. The potential of the method is illustrated on regression problems with both small and large numbers of variables."}, "https://arxiv.org/abs/2402.12349": {"title": "An optimal replacement policy under variable shocks and self-healing patterns", "link": "https://arxiv.org/abs/2402.12349", "description": "arXiv:2402.12349v1 Announce Type: new \nAbstract: We study a system that experiences damaging external shocks at stochastic intervals, continuous degradation, and self-healing. The motivation for such a system comes from real-life applications based on micro-electro-mechanical systems (MEMS). The system fails if the cumulative damage exceeds a time-dependent threshold. We develop a preventive maintenance policy to replace the system such that its lifetime is prudently utilized. Further, three variations on the healing pattern have been considered: (i) shocks heal for a fixed duration $\\tau$; (ii) a fixed proportion of shocks are non-healable (that is, $\\tau=0$); (iii) there are two types of shocks -- self healable shocks heal for a finite duration, and nonhealable shocks inflict a random system degradation. We implement a proposed preventive maintenance policy and compare the optimal replacement times in these new cases to that of the original case where all shocks heal indefinitely and thereby enable the system manager to take necessary decisions in generalized system set-ups."}, "https://arxiv.org/abs/2306.11380": {"title": "A Bayesian Take on Gaussian Process Networks", "link": "https://arxiv.org/abs/2306.11380", "description": "arXiv:2306.11380v4 Announce Type: cross \nAbstract: Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows the description of continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution."}, "https://arxiv.org/abs/2402.00623": {"title": "Bayesian Causal Inference with Gaussian Process Networks", "link": "https://arxiv.org/abs/2402.00623", "description": "arXiv:2402.00623v1 Announce Type: cross \nAbstract: Causal discovery and inference from observational data is an essential problem in statistics posing both modeling and computational challenges. These are typically addressed by imposing strict assumptions on the joint distribution such as linearity. We consider the problem of the Bayesian estimation of the effects of hypothetical interventions in the Gaussian Process Network (GPN) model, a flexible causal framework which allows describing the causal relationships nonparametrically. We detail how to perform causal inference on GPNs by simulating the effect of an intervention across the whole network and propagating the effect of the intervention on downstream variables. We further derive a simpler computational approximation by estimating the intervention distribution as a function of local variables only, modeling the conditional distributions via additive Gaussian processes. We extend both frameworks beyond the case of a known causal graph, incorporating uncertainty about the causal structure via Markov chain Monte Carlo methods. Simulation studies show that our approach is able to identify the effects of hypothetical interventions with non-Gaussian, non-linear observational data and accurately reflect the posterior uncertainty of the causal estimates. Finally we compare the results of our GPN-based causal inference approach to existing methods on a dataset of $A.~thaliana$ gene expressions."}, "https://arxiv.org/abs/2402.10982": {"title": "mshw, a forecasting library to predict short-term electricity demand based on multiple seasonal Holt-Winters", "link": "https://arxiv.org/abs/2402.10982", "description": "arXiv:2402.10982v1 Announce Type: cross \nAbstract: Transmission system operators have a growing need for more accurate forecasting of electricity demand. Current electricity systems largely require demand forecasting so that the electricity market establishes electricity prices as well as the programming of production units. The companies that are part of the electrical system use exclusive software to obtain predictions, based on the use of time series and prediction tools, whether statistical or artificial intelligence. However, the most common form of prediction is based on hybrid models that use both technologies. In any case, it is software with a complicated structure, with a large number of associated variables and that requires a high computational load to make predictions. The predictions they can offer are not much better than those that simple models can offer. In this paper we present a MATLAB toolbox created for the prediction of electrical demand. The toolbox implements multiple seasonal Holt-Winters exponential smoothing models and neural network models. The models used include the use of discrete interval mobile seasonalities (DIMS) to improve forecasting on special days. Additionally, the results of its application in various electrical systems in Europe are shown, where the results obtained can be seen. The use of this library opens a new avenue of research for the use of models with discrete and complex seasonalities in other fields of application."}, "https://arxiv.org/abs/2402.11134": {"title": "Functional Partial Least-Squares: Optimal Rates and Adaptation", "link": "https://arxiv.org/abs/2402.11134", "description": "arXiv:2402.11134v1 Announce Type: cross \nAbstract: We consider the functional linear regression model with a scalar response and a Hilbert space-valued predictor, a well-known ill-posed inverse problem. We propose a new formulation of the functional partial least-squares (PLS) estimator related to the conjugate gradient method. We shall show that the estimator achieves the (nearly) optimal convergence rate on a class of ellipsoids and we introduce an early stopping rule which adapts to the unknown degree of ill-posedness. Some theoretical and simulation comparison between the estimator and the principal component regression estimator is provided."}, "https://arxiv.org/abs/2402.11219": {"title": "Estimators for multivariate allometric regression model", "link": "https://arxiv.org/abs/2402.11219", "description": "arXiv:2402.11219v1 Announce Type: cross \nAbstract: In a regression model with multiple response variables and multiple explanatory variables, if the difference of the mean vectors of the response variables for different values of explanatory variables is always in the direction of the first principal eigenvector of the covariance matrix of the response variables, then it is called a multivariate allometric regression model. This paper studies the estimation of the first principal eigenvector in the multivariate allometric regression model. A class of estimators that includes conventional estimators is proposed based on weighted sum-of-squares matrices of regression sum-of-squares matrix and residual sum-of-squares matrix. We establish an upper bound of the mean squared error of the estimators contained in this class, and the weight value minimizing the upper bound is derived. Sufficient conditions for the consistency of the estimators are discussed in weak identifiability regimes under which the difference of the largest and second largest eigenvalues of the covariance matrix decays asymptotically and in ``large $p$, large $n$\" regimes, where $p$ is the number of response variables and $n$ is the sample size. Several numerical results are also presented."}, "https://arxiv.org/abs/2402.11228": {"title": "Adaptive Split Balancing for Optimal Random Forest", "link": "https://arxiv.org/abs/2402.11228", "description": "arXiv:2402.11228v1 Announce Type: cross \nAbstract: While random forests are commonly used for regression problems, existing methods often lack adaptability in complex situations or lose optimality under simple, smooth scenarios. In this study, we introduce the adaptive split balancing forest (ASBF), capable of learning tree representations from data while simultaneously achieving minimax optimality under the Lipschitz class. To exploit higher-order smoothness levels, we further propose a localized version that attains the minimax rate under the H\\\"older class $\\mathcal{H}^{q,\\beta}$ for any $q\\in\\mathbb{N}$ and $\\beta\\in(0,1]$. Rather than relying on the widely-used random feature selection, we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal results. Conversely, a less random, more balanced approach demonstrates optimality. Additionally, we establish uniform upper bounds and explore the application of random forests in average treatment effect estimation problems. Through simulation studies and real-data applications, we demonstrate the superior empirical performance of the proposed methods over existing random forests."}, "https://arxiv.org/abs/2402.11394": {"title": "Maximal Inequalities for Empirical Processes under General Mixing Conditions with an Application to Strong Approximations", "link": "https://arxiv.org/abs/2402.11394", "description": "arXiv:2402.11394v1 Announce Type: cross \nAbstract: This paper provides a bound for the supremum of sample averages over a class of functions for a general class of mixing stochastic processes with arbitrary mixing rates. Regardless of the speed of mixing, the bound is comprised of a concentration rate and a novel measure of complexity. The speed of mixing, however, affects the former quantity implying a phase transition. Fast mixing leads to the standard root-n concentration rate, while slow mixing leads to a slower concentration rate, its speed depends on the mixing structure. Our findings are applied to derive strong approximation results for a general class of mixing processes with arbitrary mixing rates."}, "https://arxiv.org/abs/2402.11771": {"title": "Evaluating the Effectiveness of Index-Based Treatment Allocation", "link": "https://arxiv.org/abs/2402.11771", "description": "arXiv:2402.11771v1 Announce Type: cross \nAbstract: When resources are scarce, an allocation policy is needed to decide who receives a resource. This problem occurs, for instance, when allocating scarce medical resources and is often solved using modern ML methods. This paper introduces methods to evaluate index-based allocation policies -- that allocate a fixed number of resources to those who need them the most -- by using data from a randomized control trial. Such policies create dependencies between agents, which render the assumptions behind standard statistical tests invalid and limit the effectiveness of estimators. Addressing these challenges, we translate and extend recent ideas from the statistics literature to present an efficient estimator and methods for computing asymptotically correct confidence intervals. This enables us to effectively draw valid statistical conclusions, a critical gap in previous work. Our extensive experiments validate our methodology in practical settings, while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial to evaluate different ML allocation policies in the context of a mHealth program, drawing previously invisible conclusions."}, "https://arxiv.org/abs/2109.08351": {"title": "Regression Discontinuity Design with Potentially Many Covariates", "link": "https://arxiv.org/abs/2109.08351", "description": "arXiv:2109.08351v4 Announce Type: replace \nAbstract: This paper studies the case of possibly high-dimensional covariates in the regression discontinuity design (RDD) analysis. In particular, we propose estimation and inference methods for the RDD models with covariate selection which perform stably regardless of the number of covariates. The proposed methods combine the local approach using kernel weights with $\\ell_{1}$-penalization to handle high-dimensional covariates. We provide theoretical and numerical results which illustrate the usefulness of the proposed methods. Theoretically, we present risk and coverage properties for our point estimation and inference methods, respectively. Under certain special case, the proposed estimator becomes more efficient than the conventional covariate adjusted estimator at the cost of an additional sparsity condition. Numerically, our simulation experiments and empirical example show the robust behaviors of the proposed methods to the number of covariates in terms of bias and variance for point estimation and coverage probability and interval length for inference."}, "https://arxiv.org/abs/2110.13761": {"title": "Regime-Switching Density Forecasts Using Economists' Scenarios", "link": "https://arxiv.org/abs/2110.13761", "description": "arXiv:2110.13761v2 Announce Type: replace \nAbstract: We propose an approach for generating macroeconomic density forecasts that incorporate information on multiple scenarios defined by experts. We adopt a regime-switching framework in which sets of scenarios (\"views\") are used as Bayesian priors on economic regimes. Predictive densities coming from different views are then combined by optimizing objective functions of density forecasting. We illustrate the approach with an empirical application to quarterly real-time forecasts of U.S. GDP growth, in which we exploit the Fed's macroeconomic scenarios used for bank stress tests. We show that the approach achieves good accuracy in terms of average predictive scores and good calibration of forecast distributions. Moreover, it can be used to evaluate the contribution of economists' scenarios to density forecast performance."}, "https://arxiv.org/abs/2203.08014": {"title": "Non-Existent Moments of Earnings Growth", "link": "https://arxiv.org/abs/2203.08014", "description": "arXiv:2203.08014v3 Announce Type: replace \nAbstract: The literature often employs moment-based earnings risk measures like variance, skewness, and kurtosis. However, under heavy-tailed distributions, these moments may not exist in the population. Our empirical analysis reveals that population kurtosis, skewness, and variance often do not exist for the conditional distribution of earnings growth. This challenges moment-based analyses. We propose robust conditional Pareto exponents as novel earnings risk measures, developing estimation and inference methods. Using the UK New Earnings Survey Panel Dataset (NESPD) and US Panel Study of Income Dynamics (PSID), we find: 1) Moments often fail to exist; 2) Earnings risk increases over the life cycle; 3) Job stayers face higher earnings risk; 4) These patterns persist during the 2007--2008 recession and the 2015--2016 positive growth period."}, "https://arxiv.org/abs/2206.08051": {"title": "Voronoi Density Estimator for High-Dimensional Data: Computation, Compactification and Convergence", "link": "https://arxiv.org/abs/2206.08051", "description": "arXiv:2206.08051v2 Announce Type: replace \nAbstract: The Voronoi Density Estimator (VDE) is an established density estimation technique that adapts to the local geometry of data. However, its applicability has been so far limited to problems in two and three dimensions. This is because Voronoi cells rapidly increase in complexity as dimensions grow, making the necessary explicit computations infeasible. We define a variant of the VDE deemed Compactified Voronoi Density Estimator (CVDE), suitable for higher dimensions. We propose computationally efficient algorithms for numerical approximation of the CVDE and formally prove convergence of the estimated density to the original one. We implement and empirically validate the CVDE through a comparison with the Kernel Density Estimator (KDE). Our results indicate that the CVDE outperforms the KDE on sound and image data."}, "https://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "https://arxiv.org/abs/2302.13066", "description": "arXiv:2302.13066v4 Announce Type: replace \nAbstract: Different proxy variables used in fiscal policy SVARs lead to contradicting conclusions regarding the size of fiscal multipliers. We show that the conflicting results are due to violations of the exogeneity assumptions, i.e. the commonly used proxies are endogenously related to the structural shocks. We propose a novel approach to include proxy variables into a Bayesian non-Gaussian SVAR, tailored to accommodate for potentially endogenous proxy variables. Using our model, we show that increasing government spending is a more effective tool to stimulate the economy than reducing taxes."}, "https://arxiv.org/abs/2304.09046": {"title": "On clustering levels of a hierarchical categorical risk factor", "link": "https://arxiv.org/abs/2304.09046", "description": "arXiv:2304.09046v2 Announce Type: replace \nAbstract: Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies."}, "https://arxiv.org/abs/2306.12003": {"title": "Difference-in-Differences with Interference", "link": "https://arxiv.org/abs/2306.12003", "description": "arXiv:2306.12003v4 Announce Type: replace \nAbstract: In many scenarios, such as the evaluation of place-based policies, potential outcomes are not only dependent upon the unit's own treatment but also its neighbors' treatment. Despite this, \"difference-in-differences\" (DID) type estimators typically ignore such interference among neighbors. I show in this paper that the canonical DID estimators generally fail to identify interesting causal effects in the presence of neighborhood interference. To incorporate interference structure into DID estimation, I propose doubly robust estimators for the direct average treatment effect on the treated as well as the average spillover effects under a modified parallel trends assumption. The approach in this paper relaxes common restrictions in the literature, such as partial interference and correctly specified spillover functions. Moreover, robust inference is discussed based on the asymptotic distribution of the proposed estimators."}, "https://arxiv.org/abs/2309.16129": {"title": "Counterfactual Density Estimation using Kernel Stein Discrepancies", "link": "https://arxiv.org/abs/2309.16129", "description": "arXiv:2309.16129v2 Announce Type: replace \nAbstract: Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance."}, "https://arxiv.org/abs/2311.14054": {"title": "Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2311.14054", "description": "arXiv:2311.14054v2 Announce Type: replace \nAbstract: Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant), functional (because within-day data can be viewed as a function of time) data. To extract within- and between-participant directions of variation in the data, we introduce Generalized Multilevel Functional Principal Component Analysis (GM-FPCA), an approach based on the dimension reduction of the linear predictor. Scores associated with specific patterns of activity are shown to be strongly associated with time to death. In particular, we confirm that increased activity is associated with time to death, a result that has been reported on other data sets. In addition, our method shows the previously unreported finding that maintaining a consistent day-to-day routine is strongly associated with a reduced risk of mortality (p-value $< 0.001$) even after adjusting for traditional risk factors. Extensive simulation studies indicate that GM-FPCA provides accurate estimation of model parameters, is computationally stable, and is scalable in the number of study participants, visits, and observations within visits. R code for implementing the method is provided."}, "https://arxiv.org/abs/2401.10235": {"title": "Semi-parametric local variable selection under misspecification", "link": "https://arxiv.org/abs/2401.10235", "description": "arXiv:2401.10235v2 Announce Type: replace \nAbstract: Local variable selection aims to discover localized effects by assessing the impact of covariates on outcomes within specific regions defined by other covariates. We outline some challenges of local variable selection in the presence of non-linear relationships and model misspecification. Specifically, we highlight a potential drawback of common semi-parametric methods: even slight model misspecification can result in a high rate of false positives. To address these shortcomings, we propose a methodology based on orthogonal cut splines that achieves consistent local variable selection in high-dimensional scenarios. Our approach offers simplicity, handles both continuous and discrete covariates, and provides theory for high-dimensional covariates and model misspecification. We discuss settings with either independent or dependent data. Our proposal allows including adjustment covariates that do not undergo selection, enhancing flexibility in modeling complex scenarios. We illustrate its application in simulation studies with both independent and functional data, as well as with two real datasets. One dataset evaluates salary gaps associated with discrimination factors at different ages, while the other examines the effects of covariates on brain activation over time. The approach is implemented in the R package mombf."}, "https://arxiv.org/abs/1910.08597": {"title": "Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic", "link": "https://arxiv.org/abs/1910.08597", "description": "arXiv:1910.08597v5 Announce Type: replace-cross \nAbstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam."}, "https://arxiv.org/abs/2007.02192": {"title": "Tail-adaptive Bayesian shrinkage", "link": "https://arxiv.org/abs/2007.02192", "description": "arXiv:2007.02192v4 Announce Type: replace-cross \nAbstract: Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation method under diverse sparsity regimes, which has a tail-adaptive shrinkage property. In this property, the tail-heaviness of the prior adjusts adaptively, becoming larger or smaller as the sparsity level increases or decreases, respectively, to accommodate more or fewer signals, a posteriori. We propose a global-local-tail (GLT) Gaussian mixture distribution that ensures this property. We examine the role of the tail-index of the prior in relation to the underlying sparsity level and demonstrate that the GLT posterior contracts at the minimax optimal rate for sparse normal mean models. We apply both the GLT prior and the Horseshoe prior to a real data problem and simulation examples. Our findings indicate that the varying tail rule based on the GLT prior offers advantages over a fixed tail rule based on the Horseshoe prior in diverse sparsity regimes."}, "https://arxiv.org/abs/2109.02726": {"title": "Screening the Discrepancy Function of a Computer Model", "link": "https://arxiv.org/abs/2109.02726", "description": "arXiv:2109.02726v2 Announce Type: replace-cross \nAbstract: Screening traditionally refers to the problem of detecting active inputs in the computer model. In this paper, we develop methodology that applies to screening, but the main focus is on detecting active inputs not in the computer model itself but rather on the discrepancy function that is introduced to account for model inadequacy when linking the computer model with field observations. We contend this is an important problem as it informs the modeler which are the inputs that are potentially being mishandled in the model, but also along which directions it may be less recommendable to use the model for prediction. The methodology is Bayesian and is inspired by the continuous spike and slab prior popularized by the literature on Bayesian variable selection. In our approach, and in contrast with previous proposals, a single MCMC sample from the full model allows us to compute the posterior probabilities of all the competing models, resulting in a methodology that is computationally very fast. The approach hinges on the ability to obtain posterior inclusion probabilities of the inputs, which are very intuitive and easy to interpret quantities, as the basis for selecting active inputs. For that reason, we name the methodology PIPS -- posterior inclusion probability screening."}, "https://arxiv.org/abs/2210.02171": {"title": "A uniform kernel trick for high-dimensional two-sample problems", "link": "https://arxiv.org/abs/2210.02171", "description": "arXiv:2210.02171v2 Announce Type: replace-cross \nAbstract: We use a suitable version of the so-called \"kernel trick\" to devise two-sample (homogeneity) tests, especially focussed on high-dimensional and functional data. Our proposal entails a simplification related to the important practical problem of selecting an appropriate kernel function. Specifically, we apply a uniform variant of the kernel trick which involves the supremum within a class of kernel-based distances. We obtain the asymptotic distribution (under the null and alternative hypotheses) of the test statistic. The proofs rely on empirical processes theory, combined with the delta method and Hadamard (directional) differentiability techniques, and functional Karhunen-Lo\\`eve-type expansions of the underlying processes. This methodology has some advantages over other standard approaches in the literature. We also give some experimental insight into the performance of our proposal compared to the original kernel-based approach \\cite{Gretton2007} and the test based on energy distances \\cite{Szekely-Rizzo-2017}."}, "https://arxiv.org/abs/2303.01385": {"title": "Hyperlink communities in higher-order networks", "link": "https://arxiv.org/abs/2303.01385", "description": "arXiv:2303.01385v3 Announce Type: replace-cross \nAbstract: Many networks can be characterised by the presence of communities, which are groups of units that are closely linked. Identifying these communities can be crucial for understanding the system's overall function. Recently, hypergraphs have emerged as a fundamental tool for modelling systems where interactions are not limited to pairs but may involve an arbitrary number of nodes. In this study, we adopt a dual approach to community detection and extend the concept of link communities to hypergraphs. This extension allows us to extract informative clusters of highly related hyperedges. We analyze the dendrograms obtained by applying hierarchical clustering to distance matrices among hyperedges across a variety of real-world data, showing that hyperlink communities naturally highlight the hierarchical and multiscale structure of higher-order networks. Moreover, hyperlink communities enable us to extract overlapping memberships from nodes, overcoming limitations of traditional hard clustering methods. Finally, we introduce higher-order network cartography as a practical tool for categorizing nodes into different structural roles based on their interaction patterns and community participation. This approach aids in identifying different types of individuals in a variety of real-world social systems. Our work contributes to a better understanding of the structural organization of real-world higher-order systems."}, "https://arxiv.org/abs/2308.07480": {"title": "Order-based Structure Learning with Normalizing Flows", "link": "https://arxiv.org/abs/2308.07480", "description": "arXiv:2308.07480v2 Announce Type: replace-cross \nAbstract: Estimating the causal structure of observational data is a challenging combinatorial search problem that scales super-exponentially with graph size. Existing methods use continuous relaxations to make this problem computationally tractable but often restrict the data-generating process to additive noise models (ANMs) through explicit or implicit assumptions. We present Order-based Structure Learning with Normalizing Flows (OSLow), a framework that relaxes these assumptions using autoregressive normalizing flows. We leverage the insight that searching over topological orderings is a natural way to enforce acyclicity in structure discovery and propose a novel, differentiable permutation learning method to find such orderings. Through extensive experiments on synthetic and real-world data, we demonstrate that OSLow outperforms prior baselines and improves performance on the observational Sachs and SynTReN datasets as measured by structural hamming distance and structural intervention distance, highlighting the importance of relaxing the ANM assumption made by existing methods."}, "https://arxiv.org/abs/2309.02468": {"title": "Ab initio uncertainty quantification in scattering analysis of microscopy", "link": "https://arxiv.org/abs/2309.02468", "description": "arXiv:2309.02468v2 Announce Type: replace-cross \nAbstract: Estimating parameters from data is a fundamental problem in physics, customarily done by minimizing a loss function between a model and observed statistics. In scattering-based analysis, researchers often employ their domain expertise to select a specific range of wavevectors for analysis, a choice that can vary depending on the specific case. We introduce another paradigm that defines a probabilistic generative model from the beginning of data processing and propagates the uncertainty for parameter estimation, termed ab initio uncertainty quantification (AIUQ). As an illustrative example, we demonstrate this approach with differential dynamic microscopy (DDM) that extracts dynamical information through Fourier analysis at a selected range of wavevectors. We first show that DDM is equivalent to fitting a temporal variogram in the reciprocal space using a latent factor model as the generative model. Then we derive the maximum marginal likelihood estimator, which optimally weighs information at all wavevectors, therefore eliminating the need to select the range of wavevectors. Furthermore, we substantially reduce the computational cost by utilizing the generalized Schur algorithm for Toeplitz covariances without approximation. Simulated studies validate that AIUQ significantly improves estimation accuracy and enables model selection with automated analysis. The utility of AIUQ is also demonstrated by three distinct sets of experiments: first in an isotropic Newtonian fluid, pushing limits of optically dense systems compared to multiple particle tracking; next in a system undergoing a sol-gel transition, automating the determination of gelling points and critical exponent; and lastly, in discerning anisotropic diffusive behavior of colloids in a liquid crystal. These outcomes collectively underscore AIUQ's versatility to capture system dynamics in an efficient and automated manner."}, "https://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "https://arxiv.org/abs/2310.03114", "description": "arXiv:2310.03114v2 Announce Type: replace-cross \nAbstract: In this article we consider Bayesian parameter inference for a type of partially observed stochastic Volterra equation (SVE). SVEs are found in many areas such as physics and mathematical finance. In the latter field they can be used to represent long memory in unobserved volatility processes. In many cases of practical interest, SVEs must be time-discretized and then parameter inference is based upon the posterior associated to this time-discretized process. Based upon recent studies on time-discretization of SVEs (e.g. Richard et al. 2021), we use Euler-Maruyama methods for the afore-mentioned discretization. We then show how multilevel Markov chain Monte Carlo (MCMC) methods (Jasra et al. 2018) can be applied in this context. In the examples we study, we give a proof that shows that the cost to achieve a mean square error (MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon>0$, is {$\\mathcal{O}(\\epsilon^{-\\tfrac{4}{2H+1}})$, where $H$ is the Hurst parameter. If one uses a single level MCMC method then the cost is $\\mathcal{O}(\\epsilon^{-\\tfrac{2(2H+3)}{2H+1}})$} to achieve the same MSE. We illustrate these results in the context of state-space and stochastic volatility models, with the latter applied to real data."}, "https://arxiv.org/abs/2310.11122": {"title": "Sensitivity-Aware Amortized Bayesian Inference", "link": "https://arxiv.org/abs/2310.11122", "description": "arXiv:2310.11122v4 Announce Type: replace-cross \nAbstract: Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions."}, "https://arxiv.org/abs/2402.12548": {"title": "Composite likelihood inference for space-time point processes", "link": "https://arxiv.org/abs/2402.12548", "description": "arXiv:2402.12548v1 Announce Type: new \nAbstract: The dynamics of a rain forest is extremely complex involving births, deaths and growth of trees with complex interactions between trees, animals, climate, and environment. We consider the patterns of recruits (new trees) and dead trees between rain forest censuses. For a current census we specify regression models for the conditional intensity of recruits and the conditional probabilities of death given the current trees and spatial covariates. We estimate regression parameters using conditional composite likelihood functions that only involve the conditional first order properties of the data. When constructing assumption lean estimators of covariance matrices of parameter estimates we only need mild assumptions of decaying conditional correlations in space while assumptions regarding correlations over time are avoided by exploiting conditional centering of composite likelihood score functions. Time series of point patterns from rain forest censuses are quite short while each point pattern covers a fairly big spatial region. To obtain asymptotic results we therefore use a central limit theorem for the fixed timespan - increasing spatial domain asymptotic setting. This also allows us to handle the challenge of using stochastic covariates constructed from past point patterns. Conveniently, it suffices to impose weak dependence assumptions on the innovations of the space-time process. We investigate the proposed methodology by simulation studies and applications to rain forest data."}, "https://arxiv.org/abs/2402.12555": {"title": "Optimal Dynamic Treatment Regime Estimation in the Presence of Nonadherence", "link": "https://arxiv.org/abs/2402.12555", "description": "arXiv:2402.12555v1 Announce Type: new \nAbstract: Dynamic treatment regimes (DTRs) are sequences of functions that formalize the process of precision medicine. DTRs take as input patient information and output treatment recommendations. A major focus of the DTR literature has been on the estimation of optimal DTRs, the sequences of decision rules that result in the best outcome in expectation, across the complete population were they to be applied. While there is a rich literature on optimal DTR estimation, to date there has been minimal consideration of the impacts of nonadherence on these estimation procedures. Nonadherence refers to any process through that an individual's prescribed treatment does not match their true treatment. We explore the impacts of nonadherence and demonstrate that generally, when nonadherence is ignored, suboptimal regimes will be estimated. In light of these findings we propose a method for estimating optimal DTRs in the presence of nonadherence. The resulting estimators are consistent and asymptotically normal, with a double robustness property. Using simulations we demonstrate the reliability of these results, and illustrate comparable performance between the proposed estimation procedure adjusting for the impacts of nonadherence and estimators that are computed on data without nonadherence."}, "https://arxiv.org/abs/2402.12576": {"title": "Understanding Difference-in-differences methods to evaluate policy effects with staggered adoption: an application to Medicaid and HIV", "link": "https://arxiv.org/abs/2402.12576", "description": "arXiv:2402.12576v1 Announce Type: new \nAbstract: While a randomized control trial is considered the gold standard for estimating causal treatment effects, there are many research settings in which randomization is infeasible or unethical. In such cases, researchers rely on analytical methods for observational data to explore causal relationships. Difference-in-differences (DID) is one such method that, most commonly, estimates a difference in some mean outcome in a group before and after the implementation of an intervention or policy and compares this with a control group followed over the same time (i.e., a group that did not implement the intervention or policy). Although DID modeling approaches have been gaining popularity in public health research, the majority of these approaches and their extensions are developed and shared within the economics literature. While extensions of DID modeling approaches may be straightforward to apply to observational data in any field, the complexities and assumptions involved in newer approaches are often misunderstood. In this paper, we focus on recent extensions of the DID method and their relationships to linear models in the setting of staggered treatment adoption over multiple years. We detail the identification and estimation of the average treatment effect among the treated using potential outcomes notation, highlighting the assumptions necessary to produce valid estimates. These concepts are described within the context of Medicaid expansion and retention in care among people living with HIV (PWH) in the United States. While each DID approach is potentially valid, understanding their different assumptions and choosing an appropriate method can have important implications for policy-makers, funders, and public health as a whole."}, "https://arxiv.org/abs/2402.12583": {"title": "Non-linear Triple Changes Estimator for Targeted Policies", "link": "https://arxiv.org/abs/2402.12583", "description": "arXiv:2402.12583v1 Announce Type: new \nAbstract: The renowned difference-in-differences (DiD) estimator relies on the assumption of 'parallel trends,' which does not hold in many practical applications. To address this issue, the econometrics literature has turned to the triple difference estimator. Both DiD and triple difference are limited to assessing average effects exclusively. An alternative avenue is offered by the changes-in-changes (CiC) estimator, which provides an estimate of the entire counterfactual distribution at the cost of relying on (stronger) distributional assumptions. In this work, we extend the triple difference estimator to accommodate the CiC framework, presenting the `triple changes estimator' and its identification assumptions, thereby expanding the scope of the CiC paradigm. Subsequently, we empirically evaluate the proposed framework and apply it to a study examining the impact of Medicaid expansion on children's preventive care."}, "https://arxiv.org/abs/2402.12607": {"title": "Inference on LATEs with covariates", "link": "https://arxiv.org/abs/2402.12607", "description": "arXiv:2402.12607v1 Announce Type: new \nAbstract: In theory, two-stage least squares (TSLS) identifies a weighted average of covariate-specific local average treatment effects (LATEs) from a saturated specification without making parametric assumptions on how available covariates enter the model. In practice, TSLS is severely biased when saturation leads to a number of control dummies that is of the same order of magnitude as the sample size, and the use of many, arguably weak, instruments. This paper derives asymptotically valid tests and confidence intervals for an estimand that identifies the weighted average of LATEs targeted by saturated TSLS, even when the number of control dummies and instrument interactions is large. The proposed inference procedure is robust against four key features of saturated economic data: treatment effect heterogeneity, covariates with rich support, weak identification strength, and conditional heteroskedasticity."}, "https://arxiv.org/abs/2402.12710": {"title": "Integrating Active Learning in Causal Inference with Interference: A Novel Approach in Online Experiments", "link": "https://arxiv.org/abs/2402.12710", "description": "arXiv:2402.12710v1 Announce Type: new \nAbstract: In the domain of causal inference research, the prevalent potential outcomes framework, notably the Rubin Causal Model (RCM), often overlooks individual interference and assumes independent treatment effects. This assumption, however, is frequently misaligned with the intricate realities of real-world scenarios, where interference is not merely a possibility but a common occurrence. Our research endeavors to address this discrepancy by focusing on the estimation of direct and spillover treatment effects under two assumptions: (1) network-based interference, where treatments on neighbors within connected networks affect one's outcomes, and (2) non-random treatment assignments influenced by confounders. To improve the efficiency of estimating potentially complex effects functions, we introduce an novel active learning approach: Active Learning in Causal Inference with Interference (ACI). This approach uses Gaussian process to flexibly model the direct and spillover treatment effects as a function of a continuous measure of neighbors' treatment assignment. The ACI framework sequentially identifies the experimental settings that demand further data. It further optimizes the treatment assignments under the network interference structure using genetic algorithms to achieve efficient learning outcome. By applying our method to simulation data and a Tencent game dataset, we demonstrate its feasibility in achieving accurate effects estimations with reduced data requirements. This ACI approach marks a significant advancement in the realm of data efficiency for causal inference, offering a robust and efficient alternative to traditional methodologies, particularly in scenarios characterized by complex interference patterns."}, "https://arxiv.org/abs/2402.12719": {"title": "Restricted maximum likelihood estimation in generalized linear mixed models", "link": "https://arxiv.org/abs/2402.12719", "description": "arXiv:2402.12719v1 Announce Type: new \nAbstract: Restricted maximum likelihood (REML) estimation is a widely accepted and frequently used method for fitting linear mixed models, with its principal advantage being that it produces less biased estimates of the variance components. However, the concept of REML does not immediately generalize to the setting of non-normally distributed responses, and it is not always clear the extent to which, either asymptotically or in finite samples, such generalizations reduce the bias of variance component estimates compared to standard unrestricted maximum likelihood estimation. In this article, we review various attempts that have been made over the past four decades to extend REML estimation in generalized linear mixed models. We establish four major classes of approaches, namely approximate linearization, integrated likelihood, modified profile likelihoods, and direct bias correction of the score function, and show that while these four classes may have differing motivations and derivations, they often arrive at a similar if not the same REML estimate. We compare the finite sample performance of these four classes through a numerical study involving binary and count data, with results demonstrating that they perform similarly well in reducing the finite sample bias of variance components."}, "https://arxiv.org/abs/2402.12724": {"title": "Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression", "link": "https://arxiv.org/abs/2402.12724", "description": "arXiv:2402.12724v1 Announce Type: new \nAbstract: Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power."}, "https://arxiv.org/abs/2402.12803": {"title": "Joint Mean and Correlation Regression Models for Multivariate Data", "link": "https://arxiv.org/abs/2402.12803", "description": "arXiv:2402.12803v1 Announce Type: new \nAbstract: We propose a new joint mean and correlation regression model for correlated multivariate discrete responses, that simultaneously regresses the mean of each response against a set of covariates, and the correlations between responses against a set of similarity/distance measures. A set of joint estimating equations are formulated to construct an estimator of both the mean regression coefficients and the correlation regression parameters. Under a general setting where the number of responses can tend to infinity, the joint estimator is demonstrated to be consistent and asymptotically normally distributed, with differing rates of convergence due to the mean regression coefficients being heterogeneous across responses. An iterative estimation procedure is developed to obtain parameter estimates in the required, constrained parameter space. We apply the proposed model to a multivariate abundance dataset comprising overdispersed counts of 38 Carabidae ground beetle species sampled throughout Scotland, along with information about the environmental conditions of each site and the traits of each species. Results show in particular that the relationships between the mean abundances of various beetle species and environmental covariates are different and that beetle total length has statistically important effect in driving the correlations between the species. Simulations demonstrate the strong finite sample performance of the proposed estimator in terms of point estimation and inference."}, "https://arxiv.org/abs/2402.12825": {"title": "On scalable ARMA models", "link": "https://arxiv.org/abs/2402.12825", "description": "arXiv:2402.12825v1 Announce Type: new \nAbstract: This paper considers both the least squares and quasi-maximum likelihood estimation for the recently proposed scalable ARMA model, a parametric infinite-order vector AR model, and their asymptotic normality is also established. It makes feasible the inference on this computationally efficient model, especially for financial time series. An efficient block coordinate descent algorithm is further introduced to search for estimates, and a Bayesian information criterion is suggested for model selection. Simulation experiments are conducted to illustrate their finite sample performance, and a real application on six macroeconomic indicators illustrates the usefulness of the proposed methodology."}, "https://arxiv.org/abs/2402.12838": {"title": "Extending the Scope of Inference About Predictive Ability to Machine Learning Methods", "link": "https://arxiv.org/abs/2402.12838", "description": "arXiv:2402.12838v1 Announce Type: new \nAbstract: Though out-of-sample forecast evaluation is systematically employed with modern machine learning methods and there exists a well-established classic inference theory for predictive ability, see, e.g., West (1996, Asymptotic Inference About Predictive Ability, \\textit{Econometrica}, 64, 1067-1084), such theory is not directly applicable to modern machine learners such as the Lasso in the high dimensional setting. We investigate under which conditions such extensions are possible. Two key properties for standard out-of-sample asymptotic inference to be valid with machine learning are (i) a zero-mean condition for the score of the prediction loss function; and (ii) a fast rate of convergence for the machine learner. Monte Carlo simulations confirm our theoretical findings. For accurate finite sample inferences with machine learning, we recommend a small out-of-sample vs in-sample size ratio. We illustrate the wide applicability of our results with a new out-of-sample test for the Martingale Difference Hypothesis (MDH). We obtain the asymptotic null distribution of our test and use it to evaluate"}, "https://arxiv.org/abs/2402.12866": {"title": "On new tests for the Poisson distribution based on empirical weight functions", "link": "https://arxiv.org/abs/2402.12866", "description": "arXiv:2402.12866v1 Announce Type: new \nAbstract: We propose new goodness-of-fit tests for the Poisson distribution. The testing procedure entails fitting a weighted Poisson distribution, which has the Poisson as a special case, to observed data. Based on sample data, we calculate an empirical weight function which is compared to its theoretical counterpart under the Poisson assumption. Weighted Lp distances between these empirical and theoretical functions are proposed as test statistics and closed form expressions are derived for L1, L2 and L1 distances. A Monte Carlo study is included in which the newly proposed tests are shown to be powerful when compared to existing tests, especially in the case of overdispersed alternatives. We demonstrate the use of the tests with two practical examples."}, "https://arxiv.org/abs/2402.12901": {"title": "Bi-invariant Dissimilarity Measures for Sample Distributions in Lie Groups", "link": "https://arxiv.org/abs/2402.12901", "description": "arXiv:2402.12901v1 Announce Type: new \nAbstract: Data sets sampled in Lie groups are widespread, and as with multivariate data, it is important for many applications to assess the differences between the sets in terms of their distributions. Indices for this task are usually derived by considering the Lie group as a Riemannian manifold. Then, however, compatibility with the group operation is guaranteed only if a bi-invariant metric exists, which is not the case for most non-compact and non-commutative groups. We show here that if one considers an affine connection structure instead, one obtains bi-invariant generalizations of well-known dissimilarity measures: a Hotelling $T^2$ statistic, Bhattacharyya distance and Hellinger distance. Each of the dissimilarity measures matches its multivariate counterpart for Euclidean data and is translation-invariant, so that biases, e.g., through an arbitrary choice of reference, are avoided. We further derive non-parametric two-sample tests that are bi-invariant and consistent. We demonstrate the potential of these dissimilarity measures by performing group tests on data of knee configurations and epidemiological shape data. Significant differences are revealed in both cases."}, "https://arxiv.org/abs/2402.13023": {"title": "Bridging Methodologies: Angrist and Imbens' Contributions to Causal Identification", "link": "https://arxiv.org/abs/2402.13023", "description": "arXiv:2402.13023v1 Announce Type: new \nAbstract: In the 1990s, Joshua Angrist and Guido Imbens studied the causal interpretation of Instrumental Variable estimates (a widespread methodology in economics) through the lens of potential outcomes (a classical framework to formalize causality in statistics). Bridging a gap between those two strands of literature, they stress the importance of treatment effect heterogeneity and show that, under defendable assumptions in various applications, this method recovers an average causal effect for a specific subpopulation of individuals whose treatment is affected by the instrument. They were awarded the Nobel Prize primarily for this Local Average Treatment Effect (LATE). The first part of this article presents that methodological contribution in-depth: the origination in earlier applied articles, the different identification results and extensions, and related debates on the relevance of LATEs for public policy decisions. The second part reviews the main contributions of the authors beyond the LATE. J. Angrist has pursued the search for informative and varied empirical research designs in several fields, particularly in education. G. Imbens has complemented the toolbox for treatment effect estimation in many ways, notably through propensity score reweighting, matching, and, more recently, adapting machine learning procedures."}, "https://arxiv.org/abs/2402.13042": {"title": "Not all distributional shifts are equal: Fine-grained robust conformal inference", "link": "https://arxiv.org/abs/2402.13042", "description": "arXiv:2402.13042v1 Announce Type: new \nAbstract: We introduce a fine-grained framework for uncertainty quantification of predictive models under distributional shifts. This framework distinguishes the shift in covariate distributions from that in the conditional relationship between the outcome (Y) and the covariates (X). We propose to reweight the training samples to adjust for an identifiable covariate shift while protecting against worst-case conditional distribution shift bounded in an $f$-divergence ball. Based on ideas from conformal inference and distributionally robust learning, we present an algorithm that outputs (approximately) valid and efficient prediction intervals in the presence of distributional shifts. As a use case, we apply the framework to sensitivity analysis of individual treatment effects with hidden confounding. The proposed methods are evaluated in simulation studies and three real data applications, demonstrating superior robustness and efficiency compared with existing benchmarks."}, "https://arxiv.org/abs/2402.12384": {"title": "Comparing MCMC algorithms in Stochastic Volatility Models using Simulation Based Calibration", "link": "https://arxiv.org/abs/2402.12384", "description": "arXiv:2402.12384v1 Announce Type: cross \nAbstract: Simulation Based Calibration (SBC) is applied to analyse two commonly used, competing Markov chain Monte Carlo algorithms for estimating the posterior distribution of a stochastic volatility model. In particular, the bespoke 'off-set mixture approximation' algorithm proposed by Kim, Shephard, and Chib (1998) is explored together with a Hamiltonian Monte Carlo algorithm implemented through Stan. The SBC analysis involves a simulation study to assess whether each sampling algorithm has the capacity to produce valid inference for the correctly specified model, while also characterising statistical efficiency through the effective sample size. Results show that Stan's No-U-Turn sampler, an implementation of Hamiltonian Monte Carlo, produces a well-calibrated posterior estimate while the celebrated off-set mixture approach is less efficient and poorly calibrated, though model parameterisation also plays a role. Limitations and restrictions of generality are discussed."}, "https://arxiv.org/abs/2402.12785": {"title": "Stochastic Graph Heat Modelling for Diffusion-based Connectivity Retrieval", "link": "https://arxiv.org/abs/2402.12785", "description": "arXiv:2402.12785v1 Announce Type: cross \nAbstract: Heat diffusion describes the process by which heat flows from areas with higher temperatures to ones with lower temperatures. This concept was previously adapted to graph structures, whereby heat flows between nodes of a graph depending on the graph topology. Here, we combine the graph heat equation with the stochastic heat equation, which ultimately yields a model for multivariate time signals on a graph. We show theoretically how the model can be used to directly compute the diffusion-based connectivity structure from multivariate signals. Unlike other connectivity measures, our heat model-based approach is inherently multivariate and yields an absolute scaling factor, namely the graph thermal diffusivity, which captures the extent of heat-like graph propagation in the data. On two datasets, we show how the graph thermal diffusivity can be used to characterise Alzheimer's disease. We find that the graph thermal diffusivity is lower for Alzheimer's patients than healthy controls and correlates with dementia scores, suggesting structural impairment in patients in line with previous findings."}, "https://arxiv.org/abs/2402.12980": {"title": "Efficient adjustment for complex covariates: Gaining efficiency with DOPE", "link": "https://arxiv.org/abs/2402.12980", "description": "arXiv:2402.12980v1 Announce Type: cross \nAbstract: Covariate adjustment is a ubiquitous method used to estimate the average treatment effect (ATE) from observational data. Assuming a known graphical structure of the data generating model, recent results give graphical criteria for optimal adjustment, which enables efficient estimation of the ATE. However, graphical approaches are challenging for high-dimensional and complex data, and it is not straightforward to specify a meaningful graphical model of non-Euclidean data such as texts. We propose an general framework that accommodates adjustment for any subset of information expressed by the covariates. We generalize prior works and leverage these results to identify the optimal covariate information for efficient adjustment. This information is minimally sufficient for prediction of the outcome conditionally on treatment.\n Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estimator (DOPE) for efficient estimation of the ATE, and we provide asymptotic results for the DOPE under general conditions. Compared to the augmented inverse propensity weighted (AIPW) estimator, the DOPE can retain its efficiency even when the covariates are highly predictive of treatment. We illustrate this with a single-index model, and with an implementation of the DOPE based on neural networks, we demonstrate its performance on simulated and real data. Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings."}, "https://arxiv.org/abs/2202.12540": {"title": "Familial inference: tests for hypotheses on a family of centres", "link": "https://arxiv.org/abs/2202.12540", "description": "arXiv:2202.12540v3 Announce Type: replace \nAbstract: Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions, often concerning their centre. Tests that assess statistical hypotheses of centre implicitly assume a specific centre, e.g., the mean or median. Yet, scientific hypotheses do not always specify a particular centre. This ambiguity leaves the possibility for a gap between scientific theory and statistical practice that can lead to rejection of a true null. In the face of replicability crises in many scientific disciplines, significant results of this kind are concerning. Rather than testing a single centre, this paper proposes testing a family of plausible centres, such as that induced by the Huber loss function (the Huber family). Each centre in the family generates a testing problem, and the resulting family of hypotheses constitutes a familial hypothesis. A Bayesian nonparametric procedure is devised to test familial hypotheses, enabled by a novel pathwise optimization routine to fit the Huber family. The favourable properties of the new test are demonstrated theoretically and experimentally. Two examples from psychology serve as real-world case studies."}, "https://arxiv.org/abs/2204.10508": {"title": "Identification enhanced generalised linear model estimation with nonignorable missing outcomes", "link": "https://arxiv.org/abs/2204.10508", "description": "arXiv:2204.10508v2 Announce Type: replace \nAbstract: Missing data often result in undesirable bias and loss of efficiency. These become substantial problems when the response mechanism is nonignorable, such that the response model depends on unobserved variables. It is necessary to estimate the joint distribution of unobserved variables and response indicators to manage nonignorable nonresponse. However, model misspecification and identification issues prevent robust estimates despite careful estimation of the target joint distribution. In this study, we modelled the distribution of the observed parts and derived sufficient conditions for model identifiability, assuming a logistic regression model as the response mechanism and generalised linear models as the main outcome model of interest. More importantly, the derived sufficient conditions are testable with the observed data and do not require any instrumental variables, which are often assumed to guarantee model identifiability but cannot be practically determined beforehand. To analyse missing data, we propose a new imputation method which incorporates verifiable identifiability using only observed data. Furthermore, we present the performance of the proposed estimators in numerical studies and apply the proposed method to two sets of real data: exit polls for the 19th South Korean election data and public data collected from the Korean Survey of Household Finances and Living Conditions."}, "https://arxiv.org/abs/2206.08541": {"title": "Ensemble distributional forecasting for insurance loss reserving", "link": "https://arxiv.org/abs/2206.08541", "description": "arXiv:2206.08541v4 Announce Type: replace \nAbstract: Loss reserving generally focuses on identifying a single model that can generate superior predictive performance. However, different loss reserving models specialise in capturing different aspects of loss data. This is recognised in practice in the sense that results from different models are often considered, and sometimes combined. For instance, actuaries may take a weighted average of the prediction outcomes from various loss reserving models, often based on subjective assessments.\n In this paper, we propose a systematic framework to objectively combine (i.e. ensemble) multiple _stochastic_ loss reserving models such that the strengths offered by different models can be utilised effectively. Our framework contains two main innovations compared to existing literature and practice. Firstly, our criteria model combination considers the full distributional properties of the ensemble and not just the central estimate - which is of particular importance in the reserving context. Secondly, our framework is that it is tailored for the features inherent to reserving data. These include, for instance, accident, development, calendar, and claim maturity effects. Crucially, the relative importance and scarcity of data across accident periods renders the problem distinct from the traditional ensembling techniques in statistical learning.\n Our framework is illustrated with a complex synthetic dataset. In the results, the optimised ensemble outperforms both (i) traditional model selection strategies, and (ii) an equally weighted ensemble. In particular, the improvement occurs not only with central estimates but also relevant quantiles, such as the 75th percentile of reserves (typically of interest to both insurers and regulators)."}, "https://arxiv.org/abs/2207.08964": {"title": "Sensitivity analysis for constructing optimal regimes in the presence of treatment non-compliance", "link": "https://arxiv.org/abs/2207.08964", "description": "arXiv:2207.08964v2 Announce Type: replace \nAbstract: The current body of research on developing optimal treatment strategies often places emphasis on intention-to-treat analyses, which fail to take into account the compliance behavior of individuals. Methods based on instrumental variables have been developed to determine optimal treatment strategies in the presence of endogeneity. However, these existing methods are not applicable when there are two active treatment options and the average causal effects of the treatments cannot be identified using a binary instrument. In order to address this limitation, we present a procedure that can identify an optimal treatment strategy and the corresponding value function as a function of a vector of sensitivity parameters. Additionally, we derive the canonical gradient of the target parameter and propose a multiply robust classification-based estimator for the optimal treatment strategy. Through simulations, we demonstrate the practical need for and usefulness of our proposed method. We apply our method to a randomized trial on Adaptive Treatment for Alcohol and Cocaine Dependence."}, "https://arxiv.org/abs/2211.07506": {"title": "Type I Tobit Bayesian Additive Regression Trees for Censored Outcome Regression", "link": "https://arxiv.org/abs/2211.07506", "description": "arXiv:2211.07506v4 Announce Type: replace \nAbstract: Censoring occurs when an outcome is unobserved beyond some threshold value. Methods that do not account for censoring produce biased predictions of the unobserved outcome. This paper introduces Type I Tobit Bayesian Additive Regression Tree (TOBART-1) models for censored outcomes. Simulation results and real data applications demonstrate that TOBART-1 produces accurate predictions of censored outcomes. TOBART-1 provides posterior intervals for the conditional expectation and other quantities of interest. The error term distribution can have a large impact on the expectation of the censored outcome. Therefore the error is flexibly modeled as a Dirichlet process mixture of normal distributions."}, "https://arxiv.org/abs/2305.01464": {"title": "Large Global Volatility Matrix Analysis Based on Observation Structural Information", "link": "https://arxiv.org/abs/2305.01464", "description": "arXiv:2305.01464v3 Announce Type: replace \nAbstract: In this paper, we develop a novel large volatility matrix estimation procedure for analyzing global financial markets. Practitioners often use lower-frequency data, such as weekly or monthly returns, to address the issue of different trading hours in the international financial market. However, this approach can lead to inefficiency due to information loss. To mitigate this problem, our proposed method, called Structured Principal Orthogonal complEment Thresholding (Structured-POET), incorporates observation structural information for both global and national factor models. We establish the asymptotic properties of the Structured-POET estimator, and also demonstrate the drawbacks of conventional covariance matrix estimation procedures when using lower-frequency data. Finally, we apply the Structured-POET estimator to an out-of-sample portfolio allocation study using international stock market data."}, "https://arxiv.org/abs/2307.07090": {"title": "Choice Models and Permutation Invariance: Demand Estimation in Differentiated Products Markets", "link": "https://arxiv.org/abs/2307.07090", "description": "arXiv:2307.07090v2 Announce Type: replace \nAbstract: Choice modeling is at the core of understanding how changes to the competitive landscape affect consumer choices and reshape market equilibria. In this paper, we propose a fundamental characterization of choice functions that encompasses a wide variety of extant choice models. We demonstrate how non-parametric estimators like neural nets can easily approximate such functionals and overcome the curse of dimensionality that is inherent in the non-parametric estimation of choice functions. We demonstrate through extensive simulations that our proposed functionals can flexibly capture underlying consumer behavior in a completely data-driven fashion and outperform traditional parametric models. As demand settings often exhibit endogenous features, we extend our framework to incorporate estimation under endogenous features. Further, we also describe a formal inference procedure to construct valid confidence intervals on objects of interest like price elasticity. Finally, to assess the practical applicability of our estimator, we utilize a real-world dataset from S. Berry, Levinsohn, and Pakes (1995). Our empirical analysis confirms that the estimator generates realistic and comparable own- and cross-price elasticities that are consistent with the observations reported in the existing literature."}, "https://arxiv.org/abs/2307.15681": {"title": "A Continuous-Time Dynamic Factor Model for Intensive Longitudinal Data Arising from Mobile Health Studies", "link": "https://arxiv.org/abs/2307.15681", "description": "arXiv:2307.15681v3 Announce Type: replace \nAbstract: Intensive longitudinal data (ILD) collected in mobile health (mHealth) studies contain rich information on multiple outcomes measured frequently over time that have the potential to capture short-term and long-term dynamics. Motivated by an mHealth study of smoking cessation in which participants self-report the intensity of many emotions multiple times per day, we describe a dynamic factor model that summarizes the ILD as a low-dimensional, interpretable latent process. This model consists of two submodels: (i) a measurement submodel--a factor model--that summarizes the multivariate longitudinal outcome as lower-dimensional latent variables and (ii) a structural submodel--an Ornstein-Uhlenbeck (OU) stochastic process--that captures the temporal dynamics of the multivariate latent process in continuous time. We derive a closed-form likelihood for the marginal distribution of the outcome and the computationally-simpler sparse precision matrix for the OU process. We propose a block coordinate descent algorithm for estimation. Finally, we apply our method to the mHealth data to summarize the dynamics of 18 different emotions as two latent processes. These latent processes are interpreted by behavioral scientists as the psychological constructs of positive and negative affect and are key in understanding vulnerability to lapsing back to tobacco use among smokers attempting to quit."}, "https://arxiv.org/abs/2312.17015": {"title": "Regularized Exponentially Tilted Empirical Likelihood for Bayesian Inference", "link": "https://arxiv.org/abs/2312.17015", "description": "arXiv:2312.17015v3 Announce Type: replace \nAbstract: Bayesian inference with empirical likelihood faces a challenge as the posterior domain is a proper subset of the original parameter space due to the convex hull constraint. We propose a regularized exponentially tilted empirical likelihood to address this issue. Our method removes the convex hull constraint using a novel regularization technique, incorporating a continuous exponential family distribution to satisfy a Kullback-Leibler divergence criterion. The regularization arises as a limiting procedure where pseudo-data are added to the formulation of exponentially tilted empirical likelihood in a structured fashion. We show that this regularized exponentially tilted empirical likelihood retains certain desirable asymptotic properties of (exponentially tilted) empirical likelihood and has improved finite sample performance. Simulation and data analysis demonstrate that the proposed method provides a suitable pseudo-likelihood for Bayesian inference. The implementation of our method is available as the R package retel. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2401.11278": {"title": "Handling incomplete outcomes and covariates in cluster-randomized trials: doubly-robust estimation, efficiency considerations, and sensitivity analysis", "link": "https://arxiv.org/abs/2401.11278", "description": "arXiv:2401.11278v2 Announce Type: replace \nAbstract: In cluster-randomized trials (CRTs), missing data can occur in various ways, including missing values in outcomes and baseline covariates at the individual or cluster level, or completely missing information for non-participants. Among the various types of missing data in CRTs, missing outcomes have attracted the most attention. However, no existing methods can simultaneously address all aforementioned types of missing data in CRTs. To fill in this gap, we propose a new doubly-robust estimator for the average treatment effect on a variety of scales. The proposed estimator simultaneously handles missing outcomes under missingness at random, missing covariates without constraining the missingness mechanism, and missing cluster-population sizes via a uniform sampling mechanism. Furthermore, we detail key considerations to improve precision by specifying the optimal weights, leveraging machine learning, and modeling the treatment assignment mechanism. Finally, to evaluate the impact of violating missing data assumptions, we contribute a new sensitivity analysis framework tailored to CRTs. Simulation studies and a real data application both demonstrate that our proposed methods are effective in handling missing data in CRTs and superior to the existing methods."}, "https://arxiv.org/abs/2211.14221": {"title": "Learning Large Causal Structures from Inverse Covariance Matrix via Sparse Matrix Decomposition", "link": "https://arxiv.org/abs/2211.14221", "description": "arXiv:2211.14221v3 Announce Type: replace-cross \nAbstract: Learning causal structures from observational data is a fundamental problem facing important computational challenges when the number of variables is large. In the context of linear structural equation models (SEMs), this paper focuses on learning causal structures from the inverse covariance matrix. The proposed method, called ICID for Independence-preserving Decomposition from Inverse Covariance matrix, is based on continuous optimization of a matrix decomposition model that preserves the nonzero patterns of the inverse covariance matrix. Through theoretical and empirical evidences, we show that ICID efficiently identifies the sought directed acyclic graph (DAG) assuming the knowledge of noise variances. Moreover, ICID is shown empirically to be robust under bounded misspecification of noise variances in the case where the noise variances are non-equal. The proposed method enjoys a low complexity, as reflected by its time efficiency in the experiments, and also enables a novel regularization scheme that yields highly accurate solutions on the Simulated fMRI data (Smith et al., 2011) in comparison with state-of-the-art algorithms."}, "https://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "https://arxiv.org/abs/2310.04563", "description": "arXiv:2310.04563v2 Announce Type: replace-cross \nAbstract: During the COVID-19 pandemic, safely implementing in-person indoor instruction was a high priority for universities nationwide. To support this effort at the University, we developed a mathematical model for estimating the risk of SARS-CoV-2 transmission in university classrooms. This model was used to evaluate combinations of feasible interventions for classrooms at the University during the pandemic and optimize the set of interventions that would allow higher occupancy levels, matching the pre-pandemic numbers of in-person courses. Importantly, we determined that requiring masking in dense classrooms with unrestricted seating with more than 90% of students vaccinated was easy to implement, incurred little logistical or financial cost, and allowed classes to be held at full capacity. A retrospective analysis at the end of the semester confirmed the model's assessment that the proposed classroom configuration would be safe. Our framework is generalizable and was used to support reopening decisions at Stanford University. In addition, our framework is flexible and applies to a wide range of indoor settings. It was repurposed for large university events and gatherings and could be used to support planning indoor space use to avoid transmission of infectious diseases across various industries, from secondary schools to movie theaters and restaurants."}, "https://arxiv.org/abs/2310.18212": {"title": "Robustness of Algorithms for Causal Structure Learning to Hyperparameter Choice", "link": "https://arxiv.org/abs/2310.18212", "description": "arXiv:2310.18212v2 Announce Type: replace-cross \nAbstract: Hyperparameters play a critical role in machine learning. Hyperparameter tuning can make the difference between state-of-the-art and poor prediction performance for any algorithm, but it is particularly challenging for structure learning due to its unsupervised nature. As a result, hyperparameter tuning is often neglected in favour of using the default values provided by a particular implementation of an algorithm. While there have been numerous studies on performance evaluation of causal discovery algorithms, how hyperparameters affect individual algorithms, as well as the choice of the best algorithm for a specific problem, has not been studied in depth before. This work addresses this gap by investigating the influence of hyperparameters on causal structure learning tasks. Specifically, we perform an empirical evaluation of hyperparameter selection for some seminal learning algorithms on datasets of varying levels of complexity. We find that, while the choice of algorithm remains crucial to obtaining state-of-the-art performance, hyperparameter selection in ensemble settings strongly influences the choice of algorithm, in that a poor choice of hyperparameters can lead to analysts using algorithms which do not give state-of-the-art performance for their data."}, "https://arxiv.org/abs/2402.10870": {"title": "Best of Three Worlds: Adaptive Experimentation for Digital Marketing in Practice", "link": "https://arxiv.org/abs/2402.10870", "description": "arXiv:2402.10870v2 Announce Type: replace-cross \nAbstract: Adaptive experimental design (AED) methods are increasingly being used in industry as a tool to boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing methods. However, the behavior and guarantees of such methods are not well-understood beyond idealized stationary settings. This paper shares lessons learned regarding the challenges of naively using AED systems in industrial settings where non-stationarity is prevalent, while also providing perspectives on the proper objectives and system specifications in such settings. We developed an AED framework for counterfactual inference based on these experiences, and tested it in a commercial environment."}, "https://arxiv.org/abs/2402.13259": {"title": "Fast Discrete-Event Simulation of Markovian Queueing Networks through Euler Approximation", "link": "https://arxiv.org/abs/2402.13259", "description": "arXiv:2402.13259v1 Announce Type: new \nAbstract: The efficient management of large-scale queueing networks is critical for a variety of sectors, including healthcare, logistics, and customer service, where system performance has profound implications for operational effectiveness and cost management. To address this key challenge, our paper introduces simulation techniques tailored for complex, large-scale Markovian queueing networks. We develop two simulation schemes based on Euler approximation, namely the backward and forward schemes. These schemes can accommodate time-varying dynamics and are optimized for efficient implementation using vectorization. Assuming a feedforward queueing network structure, we establish that the two schemes provide stochastic upper and lower bounds for the system state, while the approximation error remains bounded over the simulation horizon. With the recommended choice of time step, we show that our approximation schemes exhibit diminishing asymptotic relative error as the system scales up, while maintaining much lower computational complexity compared to traditional discrete-event simulation and achieving speedups up to tens of thousands times. This study highlights the substantial potential of Euler approximation in simulating large-scale discrete systems."}, "https://arxiv.org/abs/2402.13375": {"title": "A Strategic Model of Software Dependency Networks", "link": "https://arxiv.org/abs/2402.13375", "description": "arXiv:2402.13375v1 Announce Type: new \nAbstract: Modern software development involves collaborative efforts and reuse of existing code, which reduces the cost of developing new software. However, reusing code from existing packages exposes coders to vulnerabilities in these dependencies. We study the formation of dependency networks among software packages and libraries, guided by a structural model of network formation with observable and unobservable heterogeneity. We estimate costs, benefits, and link externalities of the network of 696,790 directed dependencies between 35,473 repositories of the Rust programming language using a novel scalable algorithm. We find evidence of a positive externality exerted on other coders when coders create dependencies. Furthermore, we show that coders are likely to link to more popular packages of the same software type but less popular packages of other types. We adopt models for the spread of infectious diseases to measure a package's systemicness as the number of downstream packages a vulnerability would affect. Systemicness is highly skewed with the most systemic repository affecting almost 90% of all repositories only two steps away. Lastly, we show that protecting only the ten most important repositories reduces vulnerability contagion by nearly 40%."}, "https://arxiv.org/abs/2402.13384": {"title": "Allowing Growing Dimensional Binary Outcomes via the Multivariate Probit Indian Buffet Process", "link": "https://arxiv.org/abs/2402.13384", "description": "arXiv:2402.13384v1 Announce Type: new \nAbstract: There is a rich literature on infinite latent feature models, with much of the focus being on the Indian Buffet Process (IBP). The current literature focuses on the case in which latent features are generated independently, so that there is no within-sample dependence in which features are selected. Motivated by ecology applications in which latent features correspond to which species are discovered in a sample, we propose a new class of dependent infinite latent feature models. Our construction starts with a probit Indian buffet process, and then incorporates dependence among features through a correlation matrix in a latent multivariate Gaussian random variable. We show that the proposed approach preserves many appealing theoretical properties of the IBP. To address the challenge of modeling and inference for a growing-dimensional correlation matrix, we propose Bayesian modeling approaches to reduce effective dimensionality. We additionally describe extensions to incorporate covariates and hierarchical structure to enable borrowing of information. For posterior inference, we describe Markov chain Monte Carlo sampling algorithms and efficient approximations. Simulation studies and applications to fungal biodiversity data provide compelling support for the new modeling class relative to competitors."}, "https://arxiv.org/abs/2402.13391": {"title": "De-Biasing the Bias: Methods for Improving Disparity Assessments with Noisy Group Measurements", "link": "https://arxiv.org/abs/2402.13391", "description": "arXiv:2402.13391v1 Announce Type: new \nAbstract: Health care decisions are increasingly informed by clinical decision support algorithms, but these algorithms may perpetuate or increase racial and ethnic disparities in access to and quality of health care. Further complicating the problem, clinical data often have missing or poor quality racial and ethnic information, which can lead to misleading assessments of algorithmic bias. We present novel statistical methods that allow for the use of probabilities of racial/ethnic group membership in assessments of algorithm performance and quantify the statistical bias that results from error in these imputed group probabilities. We propose a sensitivity analysis approach to estimating the statistical bias that allows practitioners to assess disparities in algorithm performance under a range of assumed levels of group probability error. We also prove theoretical bounds on the statistical bias for a set of commonly used fairness metrics and describe real-world scenarios where our theoretical results are likely to apply. We present a case study using imputed race and ethnicity from the Bayesian Improved Surname Geocoding (BISG) algorithm for estimation of disparities in a clinical decision support algorithm used to inform osteoporosis treatment. Our novel methods allow policy makers to understand the range of potential disparities under a given algorithm even when race and ethnicity information is missing and to make informed decisions regarding the implementation of machine learning for clinical decision support."}, "https://arxiv.org/abs/2402.13472": {"title": "Generalized linear models with spatial dependence and a functional covariate", "link": "https://arxiv.org/abs/2402.13472", "description": "arXiv:2402.13472v1 Announce Type: new \nAbstract: We extend generalized functional linear models under independence to a situation in which a functional covariate is related to a scalar response variable that exhibits spatial dependence. For estimation, we apply basis expansion and truncation for dimension reduction of the covariate process followed by a composite likelihood estimating equation to handle the spatial dependency. We develop asymptotic results for the proposed model under a repeating lattice asymptotic context, allowing us to construct a confidence interval for the spatial dependence parameter and a confidence band for the parameter function. A binary conditionals model is presented as a concrete illustration and is used in simulation studies to verify the applicability of the asymptotic inferential results."}, "https://arxiv.org/abs/2402.13719": {"title": "Informative Simultaneous Confidence Intervals for Graphical Test Procedures", "link": "https://arxiv.org/abs/2402.13719", "description": "arXiv:2402.13719v1 Announce Type: new \nAbstract: Simultaneous confidence intervals (SCIs) that are compatible with a given closed test procedure are often non-informative. More precisely, for a one-sided null hypothesis, the bound of the SCI can stick to the border of the null hypothesis, irrespective of how far the point estimate deviates from the null hypothesis. This has been illustrated for the Bonferroni-Holm and fall-back procedures, for which alternative SCIs have been suggested, that are free of this deficiency. These informative SCIs are not fully compatible with the initial multiple test, but are close to it and hence provide similar power advantages. They provide a multiple hypothesis test with strong family-wise error rate control that can be used in replacement of the initial multiple test. The current paper extends previous work for informative SCIs to graphical test procedures. The information gained from the newly suggested SCIs is shown to be always increasing with increasing evidence against a null hypothesis. The new SCIs provide a compromise between information gain and the goal to reject as many hypotheses as possible. The SCIs are defined via a family of dual graphs and the projection method. A simple iterative algorithm for the computation of the intervals is provided. A simulation study illustrates the results for a complex graphical test procedure."}, "https://arxiv.org/abs/2402.13890": {"title": "A unified Bayesian framework for interval hypothesis testing in clinical trials", "link": "https://arxiv.org/abs/2402.13890", "description": "arXiv:2402.13890v1 Announce Type: new \nAbstract: The American Statistical Association (ASA) statement on statistical significance and P-values \\cite{wasserstein2016asa} cautioned statisticians against making scientific decisions solely on the basis of traditional P-values. The statement delineated key issues with P-values, including a lack of transparency, an inability to quantify evidence in support of the null hypothesis, and an inability to measure the size of an effect or the importance of a result. In this article, we demonstrate that the interval null hypothesis framework (instead of the point null hypothesis framework), when used in tandem with Bayes factor-based tests, is instrumental in circumnavigating the key issues of P-values. Further, we note that specifying prior densities for Bayes factors is challenging and has been a reason for criticism of Bayesian hypothesis testing in existing literature. We address this by adapting Bayes factors directly based on common test statistics. We demonstrate, through numerical experiments and real data examples, that the proposed Bayesian interval hypothesis testing procedures can be calibrated to ensure frequentist error control while retaining their inherent interpretability. Finally, we illustrate the improved flexibility and applicability of the proposed methods by providing coherent frameworks for competitive landscape analysis and end-to-end Bayesian hypothesis tests in the context of reporting clinical trial outcomes."}, "https://arxiv.org/abs/2402.13907": {"title": "Quadratic inference with dense functional responses", "link": "https://arxiv.org/abs/2402.13907", "description": "arXiv:2402.13907v1 Announce Type: new \nAbstract: We address the challenge of estimation in the context of constant linear effect models with dense functional responses. In this framework, the conditional expectation of the response curve is represented by a linear combination of functional covariates with constant regression parameters. In this paper, we present an alternative solution by employing the quadratic inference approach, a well-established method for analyzing correlated data, to estimate the regression coefficients. Our approach leverages non-parametrically estimated basis functions, eliminating the need for choosing working correlation structures. Furthermore, we demonstrate that our method achieves a parametric $\\sqrt{n}$-convergence rate, contingent on an appropriate choice of bandwidth. This convergence is observed when the number of repeated measurements per trajectory exceeds a certain threshold, specifically, when it surpasses $n^{a_{0}}$, with $n$ representing the number of trajectories. Additionally, we establish the asymptotic normality of the resulting estimator. The performance of the proposed method is compared with that of existing methods through extensive simulation studies, where our proposed method outperforms. Real data analysis is also conducted to demonstrate the proposed method."}, "https://arxiv.org/abs/2402.13933": {"title": "Powerful Large-scale Inference in High Dimensional Mediation Analysis", "link": "https://arxiv.org/abs/2402.13933", "description": "arXiv:2402.13933v1 Announce Type: new \nAbstract: In genome-wide epigenetic studies, exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables such as DNA methylation. Mediation analysis offers a way to study these intermediate variables and identify the presence or absence of causal mediation effects. Testing for mediation effects lead to a composite null hypothesis. Existing methods like the Sobel's test or the Max-P test are often underpowered because 1) statistical inference is often conducted based on distributions determined under a subset of the null and 2) they are not designed to shoulder the multiple testing burden. To tackle these issues, we introduce a technique called MLFDR (Mediation Analysis using Local False Discovery Rates) for high dimensional mediation analysis, which uses the local False Discovery Rates based on the coefficients of the structural equation model specifying the mediation relationship to construct a rejection region. We have shown theoretically as well as through simulation studies that in the high-dimensional setting, the new method of identifying the mediating variables controls the FDR asymptotically and performs better with respect to power than several existing methods such as DACT (Liu et al.)and JS-mixture (Dai et al)."}, "https://arxiv.org/abs/2402.13332": {"title": "Double machine learning for causal hybrid modeling -- applications in the Earth sciences", "link": "https://arxiv.org/abs/2402.13332", "description": "arXiv:2402.13332v1 Announce Type: cross \nAbstract: Hybrid modeling integrates machine learning with scientific knowledge with the goal of enhancing interpretability, generalization, and adherence to natural laws. Nevertheless, equifinality and regularization biases pose challenges in hybrid modeling to achieve these purposes. This paper introduces a novel approach to estimating hybrid models via a causal inference framework, specifically employing Double Machine Learning (DML) to estimate causal effects. We showcase its use for the Earth sciences on two problems related to carbon dioxide fluxes. In the $Q_{10}$ model, we demonstrate that DML-based hybrid modeling is superior in estimating causal parameters over end-to-end deep neural network (DNN) approaches, proving efficiency, robustness to bias from regularization methods, and circumventing equifinality. Our approach, applied to carbon flux partitioning, exhibits flexibility in accommodating heterogeneous causal effects. The study emphasizes the necessity of explicitly defining causal graphs and relationships, advocating for this as a general best practice. We encourage the continued exploration of causality in hybrid models for more interpretable and trustworthy results in knowledge-guided machine learning."}, "https://arxiv.org/abs/2402.13604": {"title": "Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE", "link": "https://arxiv.org/abs/2402.13604", "description": "arXiv:2402.13604v1 Announce Type: cross \nAbstract: This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history and various related disciplines."}, "https://arxiv.org/abs/2402.13678": {"title": "Weak Poincar\\'e inequality comparisons for ideal and hybrid slice sampling", "link": "https://arxiv.org/abs/2402.13678", "description": "arXiv:2402.13678v1 Announce Type: cross \nAbstract: Using the framework of weak Poincar{\\'e} inequalities, we provide a general comparison between the Hybrid and Ideal Slice Sampling Markov chains in terms of their Dirichlet forms. In particular, under suitable assumptions Hybrid Slice Sampling will inherit fast convergence from Ideal Slice Sampling and conversely. We apply our results to analyse the convergence of the Independent Metropolis--Hastings, Slice Sampling with Stepping-Out and Shrinkage, and Hit-and-Run-within-Slice Sampling algorithms."}, "https://arxiv.org/abs/1508.01167": {"title": "The Divergence Index: A Decomposable Measure of Segregation and Inequality", "link": "https://arxiv.org/abs/1508.01167", "description": "arXiv:1508.01167v3 Announce Type: replace \nAbstract: Decomposition analysis is a critical tool for understanding the social and spatial dimensions of segregation and diversity. In this paper, I highlight the conceptual, mathematical, and empirical distinctions between segregation and diversity and introduce the Divergence Index as a decomposable measure of segregation. Scholars have turned to the Information Theory Index as the best alternative to the Dissimilarity Index in decomposition studies, however it measures diversity rather than segregation. I demonstrate the importance of preserving this conceptual distinction with a decomposition analysis of segregation and diversity in U.S. metropolitan areas from 1990 to 2010, which shows that the Information Theory Index has tended to decrease, particularly within cities, while the Divergence Index has tended to increase, particularly within suburbs. Rather than being a substitute for measures of diversity, the Divergence Index complements existing measures by enabling the analysis and decomposition of segregation alongside diversity."}, "https://arxiv.org/abs/2007.12448": {"title": "A (tight) upper bound for the length of confidence intervals with conditional coverage", "link": "https://arxiv.org/abs/2007.12448", "description": "arXiv:2007.12448v3 Announce Type: replace \nAbstract: We show that two popular selective inference procedures, namely data carving (Fithian et al., 2017) and selection with a randomized response (Tian et al., 2018b), when combined with the polyhedral method (Lee et al., 2016), result in confidence intervals whose length is bounded. This contrasts results for confidence intervals based on the polyhedral method alone, whose expected length is typically infinite (Kivaranovic and Leeb, 2020). Moreover, we show that these two procedures always dominate corresponding sample-splitting methods in terms of interval length."}, "https://arxiv.org/abs/2101.05135": {"title": "A Latent Variable Model for Relational Events with Multiple Receivers", "link": "https://arxiv.org/abs/2101.05135", "description": "arXiv:2101.05135v3 Announce Type: replace \nAbstract: Directional relational event data, such as email data, often contain unicast messages (i.e., messages of one sender towards one receiver) and multicast messages (i.e., messages of one sender towards multiple receivers). The Enron email data that is the focus in this paper consists of 31% multicast messages. Multicast messages contain important information about the roles of actors in the network, which is needed for better understanding social interaction dynamics. In this paper a multiplicative latent factor model is proposed to analyze such relational data. For a given message, all potential receiver actors are placed on a suitability scale, and the actors are included in the receiver set whose suitability score exceeds a threshold value. Unobserved heterogeneity in the social interaction behavior is captured using a multiplicative latent factor structure with latent variables for actors (which differ for actors as senders and receivers) and latent variables for individual messages. A Bayesian computational algorithm, which relies on Gibbs sampling, is proposed for model fitting. Model assessment is done using posterior predictive checks. Based on our analyses of the Enron email data, a mc-amen model with a 2 dimensional latent variable can accurately capture the empirical distribution of the cardinality of the receiver set and the composition of the receiver sets for commonly observed messages. Moreover the results show that actors have a comparable (but not identical) role as a sender and as a receiver in the network."}, "https://arxiv.org/abs/2102.10154": {"title": "Truncated, Censored, and Actuarial Payment-type Moments for Robust Fitting of a Single-parameter Pareto Distribution", "link": "https://arxiv.org/abs/2102.10154", "description": "arXiv:2102.10154v3 Announce Type: replace \nAbstract: With some regularity conditions maximum likelihood estimators (MLEs) always produce asymptotically optimal (in the sense of consistency, efficiency, sufficiency, and unbiasedness) estimators. But in general, the MLEs lead to non-robust statistical inference, for example, pricing models and risk measures. Actuarial claim severity is continuous, right-skewed, and frequently heavy-tailed. The data sets that such models are usually fitted to contain outliers that are difficult to identify and separate from genuine data. Moreover, due to commonly used actuarial \"loss control strategies\" in financial and insurance industries, the random variables we observe and wish to model are affected by truncation (due to deductibles), censoring (due to policy limits), scaling (due to coinsurance proportions) and other transformations. To alleviate the lack of robustness of MLE-based inference in risk modeling, here in this paper, we propose and develop a new method of estimation - method of truncated moments (MTuM) and generalize it for different scenarios of loss control mechanism. Various asymptotic properties of those estimates are established by using central limit theory. New connections between different estimators are found. A comparative study of newly-designed methods with the corresponding MLEs is performed. Detail investigation has been done for a single parameter Pareto loss model including a simulation study."}, "https://arxiv.org/abs/2103.02089": {"title": "Robust Estimation of Loss Models for Lognormal Insurance Payment Severity Data", "link": "https://arxiv.org/abs/2103.02089", "description": "arXiv:2103.02089v4 Announce Type: replace \nAbstract: The primary objective of this scholarly work is to develop two estimation procedures - maximum likelihood estimator (MLE) and method of trimmed moments (MTM) - for the mean and variance of lognormal insurance payment severity data sets affected by different loss control mechanism, for example, truncation (due to deductibles), censoring (due to policy limits), and scaling (due to coinsurance proportions), in insurance and financial industries. Maximum likelihood estimating equations for both payment-per-payment and payment-per-loss data sets are derived which can be solved readily by any existing iterative numerical methods. The asymptotic distributions of those estimators are established via Fisher information matrices. Further, with a goal of balancing efficiency and robustness and to remove point masses at certain data points, we develop a dynamic MTM estimation procedures for lognormal claim severity models for the above-mentioned transformed data scenarios. The asymptotic distributional properties and the comparison with the corresponding MLEs of those MTM estimators are established along with extensive simulation studies. Purely for illustrative purpose, numerical examples for 1500 US indemnity losses are provided which illustrate the practical performance of the established results in this paper."}, "https://arxiv.org/abs/2202.13000": {"title": "Robust Estimation of Loss Models for Truncated and Censored Severity Data", "link": "https://arxiv.org/abs/2202.13000", "description": "arXiv:2202.13000v2 Announce Type: replace \nAbstract: In this paper, we consider robust estimation of claim severity models in insurance, when data are affected by truncation (due to deductibles), censoring (due to policy limits), and scaling (due to coinsurance). In particular, robust estimators based on the methods of trimmed moments (T-estimators) and winsorized moments (W-estimators) are pursued and fully developed. The general definitions of such estimators are formulated and their asymptotic properties are investigated. For illustrative purposes, specific formulas for T- and W-estimators of the tail parameter of a single-parameter Pareto distribution are derived. The practical performance of these estimators is then explored using the well-known Norwegian fire claims data. Our results demonstrate that T- and W-estimators offer a robust and computationally efficient alternative to the likelihood-based inference for models that are affected by deductibles, policy limits, and coinsurance."}, "https://arxiv.org/abs/2204.02477": {"title": "Method of Winsorized Moments for Robust Fitting of Truncated and Censored Lognormal Distributions", "link": "https://arxiv.org/abs/2204.02477", "description": "arXiv:2204.02477v2 Announce Type: replace \nAbstract: When constructing parametric models to predict the cost of future claims, several important details have to be taken into account: (i) models should be designed to accommodate deductibles, policy limits, and coinsurance factors, (ii) parameters should be estimated robustly to control the influence of outliers on model predictions, and (iii) all point predictions should be augmented with estimates of their uncertainty. The methodology proposed in this paper provides a framework for addressing all these aspects simultaneously. Using payment-per-payment and payment-per-loss variables, we construct the adaptive version of method of winsorized moments (MWM) estimators for the parameters of truncated and censored lognormal distribution. Further, the asymptotic distributional properties of this approach are derived and compared with those of the maximum likelihood estimator (MLE) and method of trimmed moments (MTM) estimators. The latter being a primary competitor to MWM. Moreover, the theoretical results are validated with extensive simulation studies and risk measure sensitivity analysis. Finally, practical performance of these methods is illustrated using the well-studied data set of 1500 U.S. indemnity losses. With this real data set, it is also demonstrated that the composite models do not provide much improvement in the quality of predictive models compared to a stand-alone fitted distribution specially for truncated and censored sample data."}, "https://arxiv.org/abs/2211.16298": {"title": "Double Robust Bayesian Inference on Average Treatment Effects", "link": "https://arxiv.org/abs/2211.16298", "description": "arXiv:2211.16298v4 Announce Type: replace \nAbstract: We propose a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. Our robust Bayesian approach involves two important modifications: first, we adjust the prior distributions of the conditional mean function; second, we correct the posterior distribution of the resulting ATE. Both adjustments make use of pilot estimators motivated by the semiparametric influence function for ATE estimation. We prove asymptotic equivalence of our Bayesian procedure and efficient frequentist ATE estimators by establishing a new semiparametric Bernstein-von Mises theorem under double robustness; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, our double robust Bayesian procedure leads to significant bias reduction of point estimation over conventional Bayesian methods and more accurate coverage of confidence intervals compared to existing frequentist methods. We illustrate our method in an application to the National Supported Work Demonstration."}, "https://arxiv.org/abs/2302.12627": {"title": "Cox reduction and confidence sets of models: a theoretical elucidation", "link": "https://arxiv.org/abs/2302.12627", "description": "arXiv:2302.12627v2 Announce Type: replace \nAbstract: For sparse high-dimensional regression problems, Cox and Battey [1, 9] emphasised the need for confidence sets of models: an enumeration of those small sets of variables that fit the data equivalently well in a suitable statistical sense. This is to be contrasted with the single model returned by penalised regression procedures, effective for prediction but potentially misleading for subject-matter understanding. The proposed construction of such sets relied on preliminary reduction of the full set of variables, and while various possibilities could be considered for this, [9] proposed a succession of regression fits based on incomplete block designs. The purpose of the present paper is to provide insight on both aspects of that work. For an unspecified reduction strategy, we begin by characterising models that are likely to be retained in the model confidence set, emphasising geometric aspects. We then evaluate possible reduction schemes based on penalised regression or marginal screening, before theoretically elucidating the reduction of [9]. We identify features of the covariate matrix that may reduce its efficacy, and indicate improvements to the original proposal. An advantage of the approach is its ability to reveal its own stability or fragility for the data at hand."}, "https://arxiv.org/abs/2303.17642": {"title": "Change Point Detection on A Separable Model for Dynamic Networks", "link": "https://arxiv.org/abs/2303.17642", "description": "arXiv:2303.17642v3 Announce Type: replace \nAbstract: This paper studies change point detection in time series of networks, with the Separable Temporal Exponential-family Random Graph Model (STERGM). Dynamic network patterns can be inherently complex due to dyadic and temporal dependence. Detection of the change points can identify the discrepancies in the underlying data generating processes and facilitate downstream analysis. The STERGM that utilizes network statistics to represent the structural patterns is a flexible model to fit dynamic networks. We propose a new estimator derived from the Alternating Direction Method of Multipliers (ADMM) and Group Fused Lasso to simultaneously detect multiple time points, where the parameters of a time-heterogeneous STERGM have changed. We also provide a Bayesian information criterion for model selection and an R package CPDstergm to implement the proposed method. Experiments on simulated and real data show good performance of the proposed framework."}, "https://arxiv.org/abs/2306.15088": {"title": "Locally tail-scale invariant scoring rules for evaluation of extreme value forecasts", "link": "https://arxiv.org/abs/2306.15088", "description": "arXiv:2306.15088v3 Announce Type: replace \nAbstract: Statistical analysis of extremes can be used to predict the probability of future extreme events, such as large rainfalls or devastating windstorms. The quality of these forecasts can be measured through scoring rules. Locally scale invariant scoring rules give equal importance to the forecasts at different locations regardless of differences in the prediction uncertainty. This is a useful feature when computing average scores but can be an unnecessarily strict requirement when mostly concerned with extremes. We propose the concept of local weight-scale invariance, describing scoring rules fulfilling local scale invariance in a certain region of interest, and as a special case local tail-scale invariance, for large events. Moreover, a new version of the weighted Continuous Ranked Probability score (wCRPS) called the scaled wCRPS (swCRPS) that possesses this property is developed and studied. The score is a suitable alternative for scoring extreme value models over areas with varying scale of extreme events, and we derive explicit formulas of the score for the Generalised Extreme Value distribution. The scoring rules are compared through simulation, and their usage is illustrated in modelling of extreme water levels, annual maximum rainfalls, and in an application to non-extreme forecast for the prediction of air pollution."}, "https://arxiv.org/abs/2309.06429": {"title": "Efficient Inference on High-Dimensional Linear Models with Missing Outcomes", "link": "https://arxiv.org/abs/2309.06429", "description": "arXiv:2309.06429v2 Announce Type: replace \nAbstract: This paper is concerned with inference on the regression function of a high-dimensional linear model when outcomes are missing at random. We propose an estimator which combines a Lasso pilot estimate of the regression function with a bias correction term based on the weighted residuals of the Lasso regression. The weights depend on estimates of the missingness probabilities (propensity scores) and solve a convex optimization program that trades off bias and variance optimally. Provided that the propensity scores can be pointwise consistently estimated at in-sample data points, our proposed estimator for the regression function is asymptotically normal and semi-parametrically efficient among all asymptotically linear estimators. Furthermore, the proposed estimator keeps its asymptotic properties even if the propensity scores are estimated by modern machine learning techniques. We validate the finite-sample performance of the proposed estimator through comparative simulation studies and the real-world problem of inferring the stellar masses of galaxies in the Sloan Digital Sky Survey."}, "https://arxiv.org/abs/2401.02939": {"title": "Penalized Distributed Lag Interaction Model: Air Pollution, Birth Weight and Neighborhood Vulnerability", "link": "https://arxiv.org/abs/2401.02939", "description": "arXiv:2401.02939v2 Announce Type: replace \nAbstract: Maternal exposure to air pollution during pregnancy has a substantial public health impact. Epidemiological evidence supports an association between maternal exposure to air pollution and low birth weight. A popular method to estimate this association while identifying windows of susceptibility is a distributed lag model (DLM), which regresses an outcome onto exposure history observed at multiple time points. However, the standard DLM framework does not allow for modification of the association between repeated measures of exposure and the outcome. We propose a distributed lag interaction model that allows modification of the exposure-time-response associations across individuals by including an interaction between a continuous modifying variable and the exposure history. Our model framework is an extension of a standard DLM that uses a cross-basis, or bi-dimensional function space, to simultaneously describe both the modification of the exposure-response relationship and the temporal structure of the exposure data. Through simulations, we showed that our model with penalization out-performs a standard DLM when the true exposure-time-response associations vary by a continuous variable. Using a Colorado, USA birth cohort, we estimated the association between birth weight and ambient fine particulate matter air pollution modified by an area-level metric of health and social adversities from Colorado EnviroScreen."}, "https://arxiv.org/abs/2205.12112": {"title": "Stereographic Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2205.12112", "description": "arXiv:2205.12112v2 Announce Type: replace-cross \nAbstract: High-dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves results in empirically observed ``stickiness'' and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high-dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of the Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the convergence is faster in higher dimensions."}, "https://arxiv.org/abs/2306.09507": {"title": "Winsorized Robust Credibility Models", "link": "https://arxiv.org/abs/2306.09507", "description": "arXiv:2306.09507v3 Announce Type: replace-cross \nAbstract: The B\\\"uhlmann model, a branch of classical credibility theory, has been successively applied to the premium estimation for group insurance contracts and other insurance specifications. In this paper, we develop a robust B\\\"uhlmann credibility via the censored version of loss data, or the censored mean (a robust alternative to traditional individual mean). This framework yields explicit formulas of structural parameters in credibility estimation for both scale-shape distribution families, location-scale distribution families, and their variants, which are commonly used to model insurance risks. The asymptotic properties of the proposed method are provided and corroborated through simulations, and their performance is compared to that of credibility based on the trimmed mean. By varying the censoring/trimming threshold level in several parametric models, we find all structural parameters via censoring are less volatile compared to the corresponding quantities via trimming, and using censored mean as a robust risk measure will reduce the influence of parametric loss assumptions on credibility estimation. Besides, the non-parametric estimations in credibility are discussed using the theory of $L-$estimators. And a numerical illustration from Wisconsin Local Government Property Insurance Fund indicates that the proposed robust credibility can prevent the effect caused by model mis-specification and capture the risk behavior of loss data in a broader viewpoint."}, "https://arxiv.org/abs/2401.12126": {"title": "Biological species delimitation based on genetic and spatial dissimilarity: a comparative study", "link": "https://arxiv.org/abs/2401.12126", "description": "arXiv:2401.12126v2 Announce Type: replace-cross \nAbstract: The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of individuals can be genetically quite different even if they belong to the same species. This paper investigates the problem of testing whether two potentially separated groups of individuals can belong to a single species or not based on genetic and spatial data. Various approaches are compared (some of which already exist in the literature) based on simulated metapopulations generated with SLiM and GSpace - two software packages that can simulate spatially-explicit genetic data at an individual level. Approaches involve partial Mantel testing, maximum likelihood mixed-effects models with a population effect, and jackknife-based homogeneity tests. A key challenge is that most tests perform on genetic and geographical distance data, violating standard independence assumptions. Simulations showed that partial Mantel tests and mixed-effects models have larger power than jackknife-based methods, but tend to display type-I-error rates slightly above the significance level. Moreover, a multiple regression model neglecting the dependence in the dissimilarities did not show inflated type-I-error rate. An application on brassy ringlets concludes the paper."}, "https://arxiv.org/abs/2401.15567": {"title": "Positive Semidefinite Supermartingales and Randomized Matrix Concentration Inequalities", "link": "https://arxiv.org/abs/2401.15567", "description": "arXiv:2401.15567v2 Announce Type: replace-cross \nAbstract: We present new concentration inequalities for either martingale dependent or exchangeable random symmetric matrices under a variety of tail conditions, encompassing now-standard Chernoff bounds to self-normalized heavy-tailed settings. These inequalities are often randomized in a way that renders them strictly tighter than existing deterministic results in the literature, are typically expressed in the Loewner order, and are sometimes valid at arbitrary data-dependent stopping times. Along the way, we explore the theory of positive semidefinite supermartingales and maximal inequalities, a natural matrix analog of scalar nonnegative supermartingales that is potentially of independent interest."}, "https://arxiv.org/abs/2402.14133": {"title": "Maximum likelihood estimation for aggregate current status data: Simulation study using the illness-death model for chronic diseases with duration dependency", "link": "https://arxiv.org/abs/2402.14133", "description": "arXiv:2402.14133v1 Announce Type: new \nAbstract: We use the illness-death model (IDM) for chronic conditions to derive a new analytical relation between the transition rates between the states of the IDM. The transition rates are the incidence rate (i) and the mortality rates of people without disease (m0) and with disease (m1). For the most generic case, the rates depend on age, calendar time and in case of m1 also on the duration of the disease. In this work, we show that the prevalence-odds can be expressed as a convolution-like product of the incidence rate and an exponentiated linear combination of i, m0 and m1. The analytical expression can be used as the basis for a maximum likelihood estimation (MLE) and associated large sample asymptotics. In a simulation study where a cross-sectional trial about a chronic condition is mimicked, we estimate the duration dependency of the mortality rate m1 based on aggregated current status data using the ML estimator. For this, the number of study participants and the number of diseased people in eleven age groups are considered. The ML estimator provides reasonable estimates for the parameters including their large sample confidence bounds."}, "https://arxiv.org/abs/2402.14206": {"title": "The impact of Facebook-Cambridge Analytica data scandal on the USA tech stock market: An event study based on clustering method", "link": "https://arxiv.org/abs/2402.14206", "description": "arXiv:2402.14206v1 Announce Type: new \nAbstract: This study delves into the intra-industry effects following a firm-specific scandal, with a particular focus on the Facebook data leakage scandal and its associated events within the U.S. tech industry and two additional relevant groups. We employ various metrics including daily spread, volatility, volume-weighted return, and CAPM-beta for the pre-analysis clustering, and subsequently utilize CAR (Cumulative Abnormal Return) to evaluate the impact on firms grouped within these clusters. From a broader industry viewpoint, significant positive CAARs are observed across U.S. sample firms over the three days post-scandal announcement, indicating no adverse impact on the tech sector overall. Conversely, after Facebook's initial quarterly earnings report, it showed a notable negative effect despite reported positive performance. The clustering principle should aid in identifying directly related companies and thus reducing the influence of randomness. This was indeed achieved for the effect of the key event, namely \"The Effect of Congressional Hearing on Certain Clusters across U.S. Tech Stock Market,\" which was identified as delayed and significantly negative. Therefore, we recommend applying the clustering method when conducting such or similar event studies."}, "https://arxiv.org/abs/2402.14260": {"title": "Linear Discriminant Regularized Regression", "link": "https://arxiv.org/abs/2402.14260", "description": "arXiv:2402.14260v1 Announce Type: new \nAbstract: Linear Discriminant Analysis (LDA) is an important classification approach. Its simple linear form makes it easy to interpret and it is capable to handle multi-class responses. It is closely related to other classical multivariate statistical techniques, such as Fisher's discriminant analysis, canonical correlation analysis and linear regression. In this paper we strengthen its connection to multivariate response regression by characterizing the explicit relationship between the discriminant directions and the regression coefficient matrix. This key characterization leads to a new regression-based multi-class classification procedure that is flexible enough to deploy any existing structured, regularized, and even non-parametric, regression methods. Moreover, our new formulation is generically easy to analyze compared to existing regression-based LDA procedures. In particular, we provide complete theoretical guarantees for using the widely used $\\ell_1$-regularization that has not yet been fully analyzed in the LDA context. Our theoretical findings are corroborated by extensive simulation studies and real data analysis."}, "https://arxiv.org/abs/2402.14282": {"title": "Extention of Bagging MARS with Group LASSO for Heterogeneous Treatment Effect Estimation", "link": "https://arxiv.org/abs/2402.14282", "description": "arXiv:2402.14282v1 Announce Type: new \nAbstract: Recent years, large scale clinical data like patient surveys and medical record data are playing an increasing role in medical data science. These large-scale clinical data, collectively referred to as \"real-world data (RWD)\". It is expected to be widely used in large-scale observational studies of specific diseases, personal medicine or precise medicine, finding the responder of drugs or treatments. Applying RWD for estimating heterogeneous treat ment effect (HTE) has already been a trending topic. HTE has the potential to considerably impact the development of precision medicine by helping doctors make more informed precise treatment decisions and provide more personalized medical care. The statistical models used to estimate HTE is called treatment effect models. Powers et al. proposed a some treatment effect models for observational study, where they pointed out that the bagging causal MARS (BCM) performs outstanding compared to other models. While BCM has excellent performance, it still has room for improvement. In this paper, we proposed a new treatment effect model called shrinkage causal bagging MARS method to improve their shared basis conditional mean regression framework based on the following points: first, we estimated basis functions using transformed outcome, then applied the group LASSO method to optimize the model and estimate parameters. Besides, we are focusing on pursing better interpretability of model to improve the ethical acceptance. We designed simulations to verify the performance of our proposed method and our proposed method superior in mean square error and bias in most simulation settings. Also we applied it to real data set ACTG 175 to verify its usability, where our results are supported by previous studies."}, "https://arxiv.org/abs/2402.14322": {"title": "Estimation of Spectral Risk Measure for Left Truncated and Right Censored Data", "link": "https://arxiv.org/abs/2402.14322", "description": "arXiv:2402.14322v1 Announce Type: new \nAbstract: Left truncated and right censored data are encountered frequently in insurance loss data due to deductibles and policy limits. Risk estimation is an important task in insurance as it is a necessary step for determining premiums under various policy terms. Spectral risk measures are inherently coherent and have the benefit of connecting the risk measure to the user's risk aversion. In this paper we study the estimation of spectral risk measure based on left truncated and right censored data. We propose a non parametric estimator of spectral risk measure using the product limit estimator and establish the asymptotic normality for our proposed estimator. We also develop an Edgeworth expansion of our proposed estimator. The bootstrap is employed to approximate the distribution of our proposed estimator and shown to be second order ``accurate''. Monte Carlo studies are conducted to compare the proposed spectral risk measure estimator with the existing parametric and non parametric estimators for left truncated and right censored data. Based on our simulation study we estimate the exponential spectral risk measure for three data sets viz; Norwegian fire claims data set, Spain automobile insurance claims and French marine losses."}, "https://arxiv.org/abs/2402.14377": {"title": "Ratio of two independent xgamma random variables and some distributional properties", "link": "https://arxiv.org/abs/2402.14377", "description": "arXiv:2402.14377v1 Announce Type: new \nAbstract: In this investigation, the distribution of the ratio of two independently distributed xgamma (Sen et al. 2016) random variables X and Y , with different parameters, is proposed and studied. The related distributional properties such as, moments, entropy measures, are investigated. We have also shown a unique characterization of the proposed distribution based on truncated incomplete moments."}, "https://arxiv.org/abs/2402.14438": {"title": "Efficiency-improved doubly robust estimation with non-confounding predictive covariates", "link": "https://arxiv.org/abs/2402.14438", "description": "arXiv:2402.14438v1 Announce Type: new \nAbstract: In observational studies, covariates with substantial missing data are often omitted, despite their strong predictive capabilities. These excluded covariates are generally believed not to simultaneously affect both treatment and outcome, indicating that they are not genuine confounders and do not impact the identification of the average treatment effect (ATE). In this paper, we introduce an alternative doubly robust (DR) estimator that fully leverages non-confounding predictive covariates to enhance efficiency, while also allowing missing values in such covariates. Beyond the double robustness property, our proposed estimator is designed to be more efficient than the standard DR estimator. Specifically, when the propensity score model is correctly specified, it achieves the smallest asymptotic variance among the class of DR estimators, and brings additional efficiency gains by further integrating predictive covariates. Simulation studies demonstrate the notable performance of the proposed estimator over current popular methods. An illustrative example is provided to assess the effectiveness of right heart catheterization (RHC) for critically ill patients."}, "https://arxiv.org/abs/2402.14506": {"title": "Enhancing Rolling Horizon Production Planning Through Stochastic Optimization Evaluated by Means of Simulation", "link": "https://arxiv.org/abs/2402.14506", "description": "arXiv:2402.14506v1 Announce Type: new \nAbstract: Production planning must account for uncertainty in a production system, arising from fluctuating demand forecasts. Therefore, this article focuses on the integration of updated customer demand into the rolling horizon planning cycle. We use scenario-based stochastic programming to solve capacitated lot sizing problems under stochastic demand in a rolling horizon environment. This environment is replicated using a discrete event simulation-optimization framework, where the optimization problem is periodically solved, leveraging the latest demand information to continually adjust the production plan. We evaluate the stochastic optimization approach and compare its performance to solving a deterministic lot sizing model, using expected demand figures as input, as well as to standard Material Requirements Planning (MRP). In the simulation study, we analyze three different customer behaviors related to forecasting, along with four levels of shop load, within a multi-item and multi-stage production system. We test a range of significant parameter values for the three planning methods and compute the overall costs to benchmark them. The results show that the production plans obtained by MRP are outperformed by deterministic and stochastic optimization. Particularly, when facing tight resource restrictions and rising uncertainty in customer demand, the use of stochastic optimization becomes preferable compared to deterministic optimization."}, "https://arxiv.org/abs/2402.14562": {"title": "Recoverability of Causal Effects in a Longitudinal Study under Presence of Missing Data", "link": "https://arxiv.org/abs/2402.14562", "description": "arXiv:2402.14562v1 Announce Type: new \nAbstract: Missing data in multiple variables is a common issue. We investigate the applicability of the framework of graphical models for handling missing data to a complex longitudinal pharmacological study of HIV-positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial. Specifically, we examine whether the causal effects of interest, defined through static interventions on multiple continuous variables, can be recovered (estimated consistently) from the available data only. So far, no general algorithms are available to decide on recoverability, and decisions have to be made on a case-by-case basis. We emphasize sensitivity of recoverability to even the smallest changes in the graph structure, and present recoverability results for three plausible missingness directed acyclic graphs (m-DAGs) in the CHAPAS-3 study, informed by clinical knowledge. Furthermore, we propose the concept of ''closed missingness mechanisms'' and show that under these mechanisms an available case analysis is admissible for consistent estimation for any type of statistical and causal query, even if the underlying missingness mechanism is of missing not at random (MNAR) type. Both simulations and theoretical considerations demonstrate how, in the assumed MNAR setting of our study, a complete or available case analysis can be superior to multiple imputation, and estimation results vary depending on the assumed missingness DAG. Our analyses are possibly the first to show the applicability of missingness DAGs (m-DAGs) to complex longitudinal real-world data, while highlighting the sensitivity with respect to the assumed causal model."}, "https://arxiv.org/abs/2402.14574": {"title": "A discussion of the paper \"Safe testing\" by Gr\\\"unwald, de Heide, and Koolen", "link": "https://arxiv.org/abs/2402.14574", "description": "arXiv:2402.14574v1 Announce Type: new \nAbstract: This is a discussion of the paper \"Safe testing\" by Gr\\\"unwald, de Heide, and Koolen, Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, 24 January, 2024"}, "https://arxiv.org/abs/2402.14763": {"title": "Functional Spatial Autoregressive Models", "link": "https://arxiv.org/abs/2402.14763", "description": "arXiv:2402.14763v1 Announce Type: new \nAbstract: This study introduces a novel spatial autoregressive model in which the dependent variable is a function that may exhibit functional autocorrelation with the outcome functions of nearby units. This model can be characterized as a simultaneous integral equation system, which, in general, does not necessarily have a unique solution. For this issue, we provide a simple condition on the magnitude of the spatial interaction to ensure the uniqueness in data realization. For estimation, to account for the endogeneity caused by the spatial interaction, we propose a regularized two-stage least squares estimator based on a basis approximation for the functional parameter. The asymptotic properties of the estimator including the consistency and asymptotic normality are investigated under certain conditions. Additionally, we propose a simple Wald-type test for detecting the presence of spatial effects. As an empirical illustration, we apply the proposed model and method to analyze age distributions in Japanese cities."}, "https://arxiv.org/abs/2402.14775": {"title": "Localised Natural Causal Learning Algorithms for Weak Consistency Conditions", "link": "https://arxiv.org/abs/2402.14775", "description": "arXiv:2402.14775v1 Announce Type: new \nAbstract: By relaxing conditions for natural structure learning algorithms, a family of constraint-based algorithms containing all exact structure learning algorithms under the faithfulness assumption, we define localised natural structure learning algorithms (LoNS). We also provide a set of necessary and sufficient assumptions for consistency of LoNS, which can be thought of as a strict relaxation of the restricted faithfulness assumption. We provide a practical LoNS algorithm that runs in exponential time, which is then compared with related existing structure learning algorithms, namely PC/SGS and the relatively recent Sparsest Permutation algorithm. Simulation studies are also provided."}, "https://arxiv.org/abs/2402.14145": {"title": "Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains", "link": "https://arxiv.org/abs/2402.14145", "description": "arXiv:2402.14145v1 Announce Type: cross \nAbstract: Distribution shifts are ubiquitous in real-world machine learning applications, posing a challenge to the generalization of models trained on one data distribution to another. We focus on scenarios where data distributions vary across multiple segments of the entire population and only make local assumptions about the differences between training and test (deployment) distributions within each segment. We propose a two-stage multiply robust estimation method to improve model performance on each individual segment for tabular data analysis. The method involves fitting a linear combination of the based models, learned using clusters of training data from multiple segments, followed by a refinement step for each segment. Our method is designed to be implemented with commonly used off-the-shelf machine learning models. We establish theoretical guarantees on the generalization bound of the method on the test risk. With extensive experiments on synthetic and real datasets, we demonstrate that the proposed method substantially improves over existing alternatives in prediction accuracy and robustness on both regression and classification tasks. We also assess its effectiveness on a user city prediction dataset from a large technology company."}, "https://arxiv.org/abs/2402.14220": {"title": "Estimating Unknown Population Sizes Using the Hypergeometric Distribution", "link": "https://arxiv.org/abs/2402.14220", "description": "arXiv:2402.14220v1 Announce Type: cross \nAbstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data."}, "https://arxiv.org/abs/2402.14264": {"title": "Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation", "link": "https://arxiv.org/abs/2402.14264", "description": "arXiv:2402.14264v1 Announce Type: cross \nAbstract: Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, recently also incorporating generic machine learning estimators, the statistical optimality of these methods has still remained an open area of investigation. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that attain small errors; which is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as a black-box sub-process. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATTE), as well as weighted variants of the former, which arise in policy evaluation."}, "https://arxiv.org/abs/2402.14390": {"title": "Composite likelihood inference for the Poisson log-normal model", "link": "https://arxiv.org/abs/2402.14390", "description": "arXiv:2402.14390v1 Announce Type: cross \nAbstract: Inferring parameters of a latent variable model can be a daunting task when the conditional distribution of the latent variables given the observed ones is intractable. Variational approaches prove to be computationally efficient but, possibly, lack theoretical guarantees on the estimates, while sampling based solutions are quite the opposite. Starting from already available variational approximations, we define a first Monte Carlo EM algorithm to obtain maximum likelihood estimators, focusing on the Poisson log-normal model which provides a generic framework for the analysis of multivariate count data. We then extend this algorithm to the case of a composite likelihood in order to be able to handle higher dimensional count data."}, "https://arxiv.org/abs/2402.14481": {"title": "Towards Automated Causal Discovery: a case study on 5G telecommunication data", "link": "https://arxiv.org/abs/2402.14481", "description": "arXiv:2402.14481v1 Announce Type: cross \nAbstract: We introduce the concept of Automated Causal Discovery (AutoCD), defined as any system that aims to fully automate the application of causal discovery and causal reasoning methods. AutoCD's goal is to deliver all causal information that an expert human analyst would and answer a user's causal queries. We describe the architecture of such a platform, and illustrate its performance on synthetic data sets. As a case study, we apply it on temporal telecommunication data. The system is general and can be applied to a plethora of causal discovery problems."}, "https://arxiv.org/abs/2402.14538": {"title": "Interference Produces False-Positive Pricing Experiments", "link": "https://arxiv.org/abs/2402.14538", "description": "arXiv:2402.14538v1 Announce Type: cross \nAbstract: It is standard practice in online retail to run pricing experiments by randomizing at the article-level, i.e. by changing prices of different products to identify treatment effects. Due to customers' cross-price substitution behavior, such experiments suffer from interference bias: the observed difference between treatment groups in the experiment is typically significantly larger than the global effect that could be expected after a roll-out decision of the tested pricing policy. We show in simulations that such bias can be as large as 100%, and report experimental data implying bias of similar magnitude. Finally, we discuss approaches for de-biased pricing experiments, suggesting observational methods as a potentially attractive alternative to clustering."}, "https://arxiv.org/abs/2402.14764": {"title": "A Combinatorial Central Limit Theorem for Stratified Randomization", "link": "https://arxiv.org/abs/2402.14764", "description": "arXiv:2402.14764v1 Announce Type: cross \nAbstract: This paper establishes a combinatorial central limit theorem for stratified randomization that holds under Lindeberg-type conditions and allows for a growing number of large and small strata. The result is then applied to derive the asymptotic distributions of two test statistics proposed in a finite population setting with randomly assigned instruments and a super population instrumental variables model, both having many strata."}, "https://arxiv.org/abs/2402.14781": {"title": "Rao-Blackwellising Bayesian Causal Inference", "link": "https://arxiv.org/abs/2402.14781", "description": "arXiv:2402.14781v1 Announce Type: cross \nAbstract: Bayesian causal inference, i.e., inferring a posterior over causal models for the use in downstream causal reasoning tasks, poses a hard computational inference problem that is little explored in literature. In this work, we combine techniques from order-based MCMC structure learning with recent advances in gradient-based graph learning into an effective Bayesian causal inference framework. Specifically, we decompose the problem of inferring the causal structure into (i) inferring a topological order over variables and (ii) inferring the parent sets for each variable. When limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time. We further use Gaussian processes to model the unknown causal mechanisms, which also allows their exact marginalisation. This introduces a Rao-Blackwellization scheme, where all components are eliminated from the model, except for the causal order, for which we learn a distribution via gradient-based optimisation. The combination of Rao-Blackwellization with our sequential inference procedure for causal orders yields state-of-the-art on linear and non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures."}, "https://arxiv.org/abs/2203.03020": {"title": "Optimal regimes for algorithm-assisted human decision-making", "link": "https://arxiv.org/abs/2203.03020", "description": "arXiv:2203.03020v3 Announce Type: replace \nAbstract: We consider optimal regimes for algorithm-assisted human decision-making. Such regimes are decision functions of measured pre-treatment variables and, by leveraging natural treatment values, enjoy a \"superoptimality\" property whereby they are guaranteed to outperform conventional optimal regimes. When there is unmeasured confounding, the benefit of using superoptimal regimes can be considerable. When there is no unmeasured confounding, superoptimal regimes are identical to conventional optimal regimes. Furthermore, identification of the expected outcome under superoptimal regimes in non-experimental studies requires the same assumptions as identification of value functions under conventional optimal regimes when the treatment is binary. To illustrate the utility of superoptimal regimes, we derive new identification and estimation results in a common instrumental variable setting. We use these derivations to analyze examples from the optimal regimes literature, including a case study of the effect of prompt intensive care treatment on survival."}, "https://arxiv.org/abs/2211.02609": {"title": "How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests", "link": "https://arxiv.org/abs/2211.02609", "description": "arXiv:2211.02609v2 Announce Type: replace \nAbstract: There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results."}, "https://arxiv.org/abs/2303.09680": {"title": "Bootstrap based asymptotic refinements for high-dimensional nonlinear models", "link": "https://arxiv.org/abs/2303.09680", "description": "arXiv:2303.09680v2 Announce Type: replace \nAbstract: We consider penalized extremum estimation of a high-dimensional, possibly nonlinear model that is sparse in the sense that most of its parameters are zero but some are not. We use the SCAD penalty function, which provides model selection consistent and oracle efficient estimates under suitable conditions. However, asymptotic approximations based on the oracle model can be inaccurate with the sample sizes found in many applications. This paper gives conditions under which the bootstrap, based on estimates obtained through SCAD penalization with thresholding, provides asymptotic refinements of size \\(O \\left( n^{- 2} \\right)\\) for the error in the rejection (coverage) probability of a symmetric hypothesis test (confidence interval) and \\(O \\left( n^{- 1} \\right)\\) for the error in the rejection (coverage) probability of a one-sided or equal tailed test (confidence interval). The results of Monte Carlo experiments show that the bootstrap can provide large reductions in errors in rejection and coverage probabilities. The bootstrap is consistent, though it does not necessarily provide asymptotic refinements, even if some parameters are close but not equal to zero. Random-coefficients logit and probit models and nonlinear moment models are examples of models to which the procedure applies."}, "https://arxiv.org/abs/2303.13960": {"title": "Demystifying estimands in cluster-randomised trials", "link": "https://arxiv.org/abs/2303.13960", "description": "arXiv:2303.13960v3 Announce Type: replace \nAbstract: Estimands can help clarify the interpretation of treatment effects and ensure that estimators are aligned to the study's objectives. Cluster randomised trials require additional attributes to be defined within the estimand compared to individually randomised trials, including whether treatment effects are marginal or cluster specific, and whether they are participant or cluster average. In this paper, we provide formal definitions of estimands encompassing both these attributes using potential outcomes notation and describe differences between them. We then provide an overview of estimators for each estimand, describe their assumptions, and show consistency (i.e. asymptotically unbiased estimation) for a series of analyses based on cluster level summaries. Then, through a reanalysis of a published cluster randomised trial, we demonstrate that the choice of both estimand and estimator can affect interpretation. For instance, the estimated odds ratio ranged from 1.38 (p=0.17) to 1.83 (p=0.03) depending on the target estimand, and for some estimands, the choice of estimator affected the conclusions by leading to smaller treatment effect estimates. We conclude that careful specification of the estimand, along with an appropriate choice of estimator, are essential to ensuring that cluster randomised trials address the right question."}, "https://arxiv.org/abs/2306.08946": {"title": "Bootstrap aggregation and confidence measures to improve time series causal discovery", "link": "https://arxiv.org/abs/2306.08946", "description": "arXiv:2306.08946v2 Announce Type: replace \nAbstract: Learning causal graphs from multivariate time series is a ubiquitous challenge in all application domains dealing with time-dependent systems, such as in Earth sciences, biology, or engineering, to name a few. Recent developments for this causal discovery learning task have shown considerable skill, notably the specific time-series adaptations of the popular conditional independence-based learning framework. However, uncertainty estimation is challenging for conditional independence-based methods. Here, we introduce a novel bootstrap approach designed for time series causal discovery that preserves the temporal dependencies and lag structure. It can be combined with a range of time series causal discovery methods and provides a measure of confidence for the links of the time series graphs. Furthermore, next to confidence estimation, an aggregation, also called bagging, of the bootstrapped graphs by majority voting results in bagged causal discovery methods. In this work, we combine this approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves in precision and recall as compared to its base algorithm PCMCI+, at the cost of higher computational demands. These statistical performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications."}, "https://arxiv.org/abs/2307.14973": {"title": "Insufficient Gibbs Sampling", "link": "https://arxiv.org/abs/2307.14973", "description": "arXiv:2307.14973v2 Announce Type: replace \nAbstract: In some applied scenarios, the availability of complete data is restricted, often due to privacy concerns; only aggregated, robust and inefficient statistics derived from the data are made accessible. These robust statistics are not sufficient, but they demonstrate reduced sensitivity to outliers and offer enhanced data protection due to their higher breakdown point. We consider a parametric framework and propose a method to sample from the posterior distribution of parameters conditioned on various robust and inefficient statistics: specifically, the pairs (median, MAD) or (median, IQR), or a collection of quantiles. Our approach leverages a Gibbs sampler and simulates latent augmented data, which facilitates simulation from the posterior distribution of parameters belonging to specific families of distributions. A by-product of these samples from the joint posterior distribution of parameters and data given the observed statistics is that we can estimate Bayes factors based on observed statistics via bridge sampling. We validate and outline the limitations of the proposed methods through toy examples and an application to real-world income data."}, "https://arxiv.org/abs/2308.08644": {"title": "Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons", "link": "https://arxiv.org/abs/2308.08644", "description": "arXiv:2308.08644v2 Announce Type: replace \nAbstract: Many applications, e.g. in content recommendation, sports, or recruitment, leverage the comparisons of alternatives to score those alternatives. The classical Bradley-Terry model and its variants have been widely used to do so. The historical model considers binary comparisons (victory or defeat) between alternatives, while more recent developments allow finer comparisons to be taken into account. In this article, we introduce a probabilistic model encompassing a broad variety of paired comparisons that can take discrete or continuous values. We do so by considering a well-behaved subset of the exponential family, which we call the family of generalized Bradley-Terry (GBT) models, as it includes the classical Bradley-Terry model and many of its variants. Remarkably, we prove that all GBT models are guaranteed to yield a strictly convex negative log-likelihood. Moreover, assuming a Gaussian prior on alternatives' scores, we prove that the maximum a posteriori (MAP) of GBT models, whose existence, uniqueness and fast computation are thus guaranteed, varies monotonically with respect to comparisons (the more A beats B, the better the score of A) and is Lipschitz-resilient with respect to each new comparison (a single new comparison can only have a bounded effect on all the estimated scores). These desirable properties make GBT models appealing for practical use. We illustrate some features of GBT models on simulations."}, "https://arxiv.org/abs/2309.05092": {"title": "Adaptive conformal classification with noisy labels", "link": "https://arxiv.org/abs/2309.05092", "description": "arXiv:2309.05092v2 Announce Type: replace \nAbstract: This paper develops novel conformal prediction methods for classification tasks that can automatically adapt to random label contamination in the calibration sample, leading to more informative prediction sets with stronger coverage guarantees compared to state-of-the-art approaches. This is made possible by a precise characterization of the effective coverage inflation (or deflation) suffered by standard conformal inferences in the presence of label contamination, which is then made actionable through new calibration algorithms. Our solution is flexible and can leverage different modeling assumptions about the label contamination process, while requiring no knowledge of the underlying data distribution or of the inner workings of the machine-learning classifier. The advantages of the proposed methods are demonstrated through extensive simulations and an application to object classification with the CIFAR-10H image data set."}, "https://arxiv.org/abs/2310.14763": {"title": "Externally Valid Policy Evaluation Combining Trial and Observational Data", "link": "https://arxiv.org/abs/2310.14763", "description": "arXiv:2310.14763v2 Announce Type: replace \nAbstract: Randomized trials are widely considered as the gold standard for evaluating the effects of decision policies. Trial data is, however, drawn from a population which may differ from the intended target population and this raises a problem of external validity (aka. generalizability). In this paper we seek to use trial data to draw valid inferences about the outcome of a policy on the target population. Additional covariate data from the target population is used to model the sampling of individuals in the trial study. We develop a method that yields certifiably valid trial-based policy evaluations under any specified range of model miscalibrations. The method is nonparametric and the validity is assured even with finite samples. The certified policy evaluations are illustrated using both simulated and real data."}, "https://arxiv.org/abs/2401.11422": {"title": "Local Identification in Instrumental Variable Multivariate Quantile Regression Models", "link": "https://arxiv.org/abs/2401.11422", "description": "arXiv:2401.11422v2 Announce Type: replace \nAbstract: The instrumental variable (IV) quantile regression model introduced by Chernozhukov and Hansen (2005) is a useful tool for analyzing quantile treatment effects in the presence of endogeneity, but when outcome variables are multidimensional, it is silent on the joint distribution of different dimensions of each variable. To overcome this limitation, we propose an IV model built on the optimal-transport-based multivariate quantile that takes into account the correlation between the entries of the outcome variable. We then provide a local identification result for the model. Surprisingly, we find that the support size of the IV required for the identification is independent of the dimension of the outcome vector, as long as the IV is sufficiently informative. Our result follows from a general identification theorem that we establish, which has independent theoretical significance."}, "https://arxiv.org/abs/2401.12911": {"title": "Pretraining and the Lasso", "link": "https://arxiv.org/abs/2401.12911", "description": "arXiv:2401.12911v2 Announce Type: replace \nAbstract: Pretraining is a popular and powerful paradigm in machine learning. As an example, suppose one has a modest-sized dataset of images of cats and dogs, and plans to fit a deep neural network to classify them from the pixel features. With pretraining, we start with a neural network trained on a large corpus of images, consisting of not just cats and dogs but hundreds of other image types. Then we fix all of the network weights except for the top layer (which makes the final classification) and train (or \"fine tune\") those weights on our dataset. This often results in dramatically better performance than the network trained solely on our smaller dataset.\n In this paper, we ask the question \"Can pretraining help the lasso?\". We develop a framework for the lasso in which an overall model is fit to a large set of data, and then fine-tuned to a specific task on a smaller dataset. This latter dataset can be a subset of the original dataset, but does not need to be. We find that this framework has a wide variety of applications, including stratified models, multinomial targets, multi-response models, conditional average treatment estimation and even gradient boosting.\n In the stratified model setting, the pretrained lasso pipeline estimates the coefficients common to all groups at the first stage, and then group specific coefficients at the second \"fine-tuning\" stage. We show that under appropriate assumptions, the support recovery rate of the common coefficients is superior to that of the usual lasso trained only on individual groups. This separate identification of common and individual coefficients can also be useful for scientific understanding."}, "https://arxiv.org/abs/2310.19253": {"title": "Flow-based Distributionally Robust Optimization", "link": "https://arxiv.org/abs/2310.19253", "description": "arXiv:2310.19253v3 Announce Type: replace-cross \nAbstract: We present a computationally efficient framework, called $\\texttt{FlowDRO}$, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD) and sample from it. The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution and develop a Wasserstein proximal gradient flow type algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on high-dimensional real data."}, "https://arxiv.org/abs/2402.14942": {"title": "On Identification of Dynamic Treatment Regimes with Proxies of Hidden Confounders", "link": "https://arxiv.org/abs/2402.14942", "description": "arXiv:2402.14942v1 Announce Type: new \nAbstract: We consider identification of optimal dynamic treatment regimes in a setting where time-varying treatments are confounded by hidden time-varying confounders, but proxy variables of the unmeasured confounders are available. We show that, with two independent proxy variables at each time point that are sufficiently relevant for the hidden confounders, identification of the joint distribution of counterfactuals is possible, thereby facilitating identification of an optimal treatment regime."}, "https://arxiv.org/abs/2402.15004": {"title": "Repro Samples Method for a Performance Guaranteed Inference in General and Irregular Inference Problems", "link": "https://arxiv.org/abs/2402.15004", "description": "arXiv:2402.15004v1 Announce Type: new \nAbstract: Rapid advancements in data science require us to have fundamentally new frameworks to tackle prevalent but highly non-trivial \"irregular\" inference problems, to which the large sample central limit theorem does not apply. Typical examples are those involving discrete or non-numerical parameters and those involving non-numerical data, etc. In this article, we present an innovative, wide-reaching, and effective approach, called \"repro samples method,\" to conduct statistical inference for these irregular problems plus more. The development relates to but improves several existing simulation-inspired inference approaches, and we provide both exact and approximate theories to support our development. Moreover, the proposed approach is broadly applicable and subsumes the classical Neyman-Pearson framework as a special case. For the often-seen irregular inference problems that involve both discrete/non-numerical and continuous parameters, we propose an effective three-step procedure to make inferences for all parameters. We also develop a unique matching scheme that turns the discreteness of discrete/non-numerical parameters from an obstacle for forming inferential theories into a beneficial attribute for improving computational efficiency. We demonstrate the effectiveness of the proposed general methodology using various examples, including a case study example on a Gaussian mixture model with unknown number of components. This case study example provides a solution to a long-standing open inference question in statistics on how to quantify the estimation uncertainty for the unknown number of components and other associated parameters. Real data and simulation studies, with comparisons to existing approaches, demonstrate the far superior performance of the proposed method."}, "https://arxiv.org/abs/2402.15030": {"title": "Adjusting for Ascertainment Bias in Meta-Analysis of Penetrance for Cancer Risk", "link": "https://arxiv.org/abs/2402.15030", "description": "arXiv:2402.15030v1 Announce Type: new \nAbstract: Multi-gene panel testing allows efficient detection of pathogenic variants in cancer susceptibility genes including moderate-risk genes such as ATM and PALB2. A growing number of studies examine the risk of breast cancer (BC) conferred by pathogenic variants of such genes. A meta-analysis combining the reported risk estimates can provide an overall age-specific risk of developing BC, i.e., penetrance for a gene. However, estimates reported by case-control studies often suffer from ascertainment bias. Currently there are no methods available to adjust for such ascertainment bias in this setting. We consider a Bayesian random-effects meta-analysis method that can synthesize different types of risk measures and extend it to incorporate studies with ascertainment bias. This is achieved by introducing a bias term in the model and assigning appropriate priors. We validate the method through a simulation study and apply it to estimate BC penetrance for carriers of pathogenic variants of ATM and PALB2 genes. Our simulations show that the proposed method results in more accurate and precise penetrance estimates compared to when no adjustment is made for ascertainment bias or when such biased studies are discarded from the analysis. The estimated overall BC risk for individuals with pathogenic variants in (1) ATM is 5.77% (3.22%-9.67%) by age 50 and 26.13% (20.31%-32.94%) by age 80; (2) PALB2 is 12.99% (6.48%-22.23%) by age 50 and 44.69% (34.40%-55.80%) by age 80. The proposed method allows for meta-analyses to include studies with ascertainment bias resulting in a larger number of studies included and thereby more robust estimates."}, "https://arxiv.org/abs/2402.15060": {"title": "A uniformly ergodic Gibbs sampler for Bayesian survival analysis", "link": "https://arxiv.org/abs/2402.15060", "description": "arXiv:2402.15060v1 Announce Type: new \nAbstract: Finite sample inference for Cox models is an important problem in many settings, such as clinical trials. Bayesian procedures provide a means for finite sample inference and incorporation of prior information if MCMC algorithms and posteriors are well behaved. On the other hand, estimation procedures should also retain inferential properties in high dimensional settings. In addition, estimation procedures should be able to incorporate constraints and multilevel modeling such as cure models and frailty models in a straightforward manner. In order to tackle these modeling challenges, we propose a uniformly ergodic Gibbs sampler for a broad class of convex set constrained multilevel Cox models. We develop two key strategies. First, we exploit a connection between Cox models and negative binomial processes through the Poisson process to reduce Bayesian computation to iterative Gaussian sampling. Next, we appeal to sufficient dimension reduction to address the difficult computation of nonparametric baseline hazards, allowing for the collapse of the Markov transition operator within the Gibbs sampler based on sufficient statistics. We demonstrate our approach using open source data and simulations."}, "https://arxiv.org/abs/2402.15071": {"title": "High-Dimensional Covariate-Augmented Overdispersed Poisson Factor Model", "link": "https://arxiv.org/abs/2402.15071", "description": "arXiv:2402.15071v1 Announce Type: new \nAbstract: The current Poisson factor models often assume that the factors are unknown, which overlooks the explanatory potential of certain observable covariates. This study focuses on high dimensional settings, where the number of the count response variables and/or covariates can diverge as the sample size increases. A covariate-augmented overdispersed Poisson factor model is proposed to jointly perform a high-dimensional Poisson factor analysis and estimate a large coefficient matrix for overdispersed count data. A group of identifiability conditions are provided to theoretically guarantee computational identifiability. We incorporate the interdependence of both response variables and covariates by imposing a low-rank constraint on the large coefficient matrix. To address the computation challenges posed by nonlinearity, two high-dimensional latent matrices, and the low-rank constraint, we propose a novel variational estimation scheme that combines Laplace and Taylor approximations. We also develop a criterion based on a singular value ratio to determine the number of factors and the rank of the coefficient matrix. Comprehensive simulation studies demonstrate that the proposed method outperforms the state-of-the-art methods in estimation accuracy and computational efficiency. The practical merit of our method is demonstrated by an application to the CITE-seq dataset. A flexible implementation of our proposed method is available in the R package \\emph{COAP}."}, "https://arxiv.org/abs/2402.15086": {"title": "A modified debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization", "link": "https://arxiv.org/abs/2402.15086", "description": "arXiv:2402.15086v1 Announce Type: new \nAbstract: Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing methods, such as the popular inverse-variance weighted (IVW) method, could be biased when the instrument strength is weak. To address this issue, the debiased IVW (dIVW) estimator, which is shown to be robust to many weak instruments, was recently proposed. However, this estimator still has non-ignorable bias when the effective sample size is small. In this paper, we propose a modified debiased IVW (mdIVW) estimator by multiplying a modification factor to the original dIVW estimator. After this simple correction, we show that the bias of the mdIVW estimator converges to zero at a faster rate than that of the dIVW estimator under some regularity conditions. Moreover, the mdIVW estimator has smaller variance than the dIVW estimator.We further extend the proposed method to account for the presence of instrumental variable selection and balanced horizontal pleiotropy. We demonstrate the improvement of the mdIVW estimator over the dIVW estimator through extensive simulation studies and real data analysis."}, "https://arxiv.org/abs/2402.15137": {"title": "Benchmarking Observational Studies with Experimental Data under Right-Censoring", "link": "https://arxiv.org/abs/2402.15137", "description": "arXiv:2402.15137v1 Announce Type: new \nAbstract: Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women's Health Initiative study."}, "https://arxiv.org/abs/2402.15292": {"title": "adjustedCurves: Estimating Confounder-Adjusted Survival Curves in R", "link": "https://arxiv.org/abs/2402.15292", "description": "arXiv:2402.15292v1 Announce Type: new \nAbstract: Kaplan-Meier curves stratified by treatment allocation are the most popular way to depict causal effects in studies with right-censored time-to-event endpoints. If the treatment is randomly assigned and the sample size of the study is adequate, this method produces unbiased estimates of the population-averaged counterfactual survival curves. However, in the presence of confounding, this is no longer the case. Instead, specific methods that allow adjustment for confounding must be used. We present the adjustedCurves R package, which can be used to estimate and plot these confounder-adjusted survival curves using a variety of methods from the literature. It provides a convenient wrapper around existing R packages on the topic and adds additional methods and functionality on top of it, uniting the sometimes vastly different methods under one consistent framework. Among the additional features are the estimation of confidence intervals, confounder-adjusted restricted mean survival times and confounder-adjusted survival time quantiles. After giving a brief overview of the implemented methods, we illustrate the package using publicly available data from an observational study including 2982 breast cancer."}, "https://arxiv.org/abs/2402.15357": {"title": "Rapid Bayesian identification of sparse nonlinear dynamics from scarce and noisy data", "link": "https://arxiv.org/abs/2402.15357", "description": "arXiv:2402.15357v1 Announce Type: new \nAbstract: We propose a fast probabilistic framework for identifying differential equations governing the dynamics of observed data. We recast the SINDy method within a Bayesian framework and use Gaussian approximations for the prior and likelihood to speed up computation. The resulting method, Bayesian-SINDy, not only quantifies uncertainty in the parameters estimated but also is more robust when learning the correct model from limited and noisy data. Using both synthetic and real-life examples such as Lynx-Hare population dynamics, we demonstrate the effectiveness of the new framework in learning correct model equations and compare its computational and data efficiency with existing methods. Because Bayesian-SINDy can quickly assimilate data and is robust against noise, it is particularly suitable for biological data and real-time system identification in control. Its probabilistic framework also enables the calculation of information entropy, laying the foundation for an active learning strategy."}, "https://arxiv.org/abs/2402.15489": {"title": "On inference for modularity statistics in structured networks", "link": "https://arxiv.org/abs/2402.15489", "description": "arXiv:2402.15489v1 Announce Type: new \nAbstract: This paper revisits the classical concept of network modularity and its spectral relaxations used throughout graph data analysis. We formulate and study several modularity statistic variants for which we establish asymptotic distributional results in the large-network limit for networks exhibiting nodal community structure. Our work facilitates testing for network differences and can be used in conjunction with existing theoretical guarantees for stochastic blockmodel random graphs. Our results are enabled by recent advances in the study of low-rank truncations of large network adjacency matrices. We provide confirmatory simulation studies and real data analysis pertaining to the network neuroscience study of psychosis, specifically schizophrenia. Collectively, this paper contributes to the limited existing literature to date on statistical inference for modularity-based network analysis. Supplemental materials for this article are available online."}, "https://arxiv.org/abs/2402.14966": {"title": "Smoothness Adaptive Hypothesis Transfer Learning", "link": "https://arxiv.org/abs/2402.14966", "description": "arXiv:2402.14966v1 Announce Type: cross \nAbstract: Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results."}, "https://arxiv.org/abs/2402.14979": {"title": "Optimizing Language Models for Human Preferences is a Causal Inference Problem", "link": "https://arxiv.org/abs/2402.14979", "description": "arXiv:2402.14979v1 Announce Type: cross \nAbstract: As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions."}, "https://arxiv.org/abs/2402.15053": {"title": "Nonlinear Bayesian optimal experimental design using logarithmic Sobolev inequalities", "link": "https://arxiv.org/abs/2402.15053", "description": "arXiv:2402.15053v1 Announce Type: cross \nAbstract: We study the problem of selecting $k$ experiments from a larger candidate pool, where the goal is to maximize mutual information (MI) between the selected subset and the underlying parameters. Finding the exact solution is to this combinatorial optimization problem is computationally costly, not only due to the complexity of the combinatorial search but also the difficulty of evaluating MI in nonlinear/non-Gaussian settings. We propose greedy approaches based on new computationally inexpensive lower bounds for MI, constructed via log-Sobolev inequalities. We demonstrate that our method outperforms random selection strategies, Gaussian approximations, and nested Monte Carlo (NMC) estimators of MI in various settings, including optimal design for nonlinear models with non-additive noise."}, "https://arxiv.org/abs/2402.15301": {"title": "Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models", "link": "https://arxiv.org/abs/2402.15301", "description": "arXiv:2402.15301v1 Announce Type: cross \nAbstract: Causal graph recovery is essential in the field of causal inference. Traditional methods are typically knowledge-based or statistical estimation-based, which are limited by data collection biases and individuals' knowledge about factors affecting the relations between variables of interests. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that utilizes the extensive knowledge contained within a large corpus of scientific literature to deduce causal relationships in general causal graph recovery tasks. This method leverages Retrieval Augmented-Generation (RAG) based LLMs to systematically analyze and extract pertinent information from a comprehensive collection of research papers. Our method first retrieves relevant text chunks from the aggregated literature. Then, the LLM is tasked with identifying and labelling potential associations between factors. Finally, we give a method to aggregate the associational relationships to build a causal graph. We demonstrate our method is able to construct high quality causal graphs on the well-known SACHS dataset solely from literature."}, "https://arxiv.org/abs/2402.15460": {"title": "Potential outcome simulation for efficient head-to-head comparison of adaptive dose-finding designs", "link": "https://arxiv.org/abs/2402.15460", "description": "arXiv:2402.15460v1 Announce Type: cross \nAbstract: Dose-finding trials are a key component of the drug development process and rely on a statistical design to help inform dosing decisions. Triallists wishing to choose a design require knowledge of operating characteristics of competing methods. This is often assessed using a large-scale simulation study with multiple designs and configurations investigated, which can be time-consuming and therefore limits the scope of the simulation.\n We introduce a new approach to the design of simulation studies of dose-finding trials. The approach simulates all potential outcomes that individuals could experience at each dose level in the trial. Datasets are simulated in advance and then the same datasets are applied to each of the competing methods to enable a more efficient head-to-head comparison.\n In two case-studies we show sizeable reductions in Monte Carlo error for comparing a performance metric between two competing designs. Efficiency gains depend on the similarity of the designs. Comparing two Phase I/II design variants, with high correlation of recommending the same optimal biologic dose, we show that the new approach requires a simulation study that is approximately 30 times smaller than the conventional approach. Furthermore, advance-simulated trial datasets can be reused to assess the performance of designs across multiple configurations.\n We recommend researchers consider this more efficient simulation approach in their dose-finding studies and we have updated the R package escalation to help facilitate implementation."}, "https://arxiv.org/abs/2010.02848": {"title": "Unified Robust Estimation", "link": "https://arxiv.org/abs/2010.02848", "description": "arXiv:2010.02848v5 Announce Type: replace \nAbstract: Robust estimation is primarily concerned with providing reliable parameter estimates in the presence of outliers. Numerous robust loss functions have been proposed in regression and classification, along with various computing algorithms. In modern penalised generalised linear models (GLM), however, there is limited research on robust estimation that can provide weights to determine the outlier status of the observations. This article proposes a unified framework based on a large family of loss functions, a composite of concave and convex functions (CC-family). Properties of the CC-family are investigated, and CC-estimation is innovatively conducted via the iteratively reweighted convex optimisation (IRCO), which is a generalisation of the iteratively reweighted least squares in robust linear regression. For robust GLM, the IRCO becomes the iteratively reweighted GLM. The unified framework contains penalised estimation and robust support vector machine and is demonstrated with a variety of data applications."}, "https://arxiv.org/abs/2206.14275": {"title": "Dynamic CoVaR Modeling", "link": "https://arxiv.org/abs/2206.14275", "description": "arXiv:2206.14275v3 Announce Type: replace \nAbstract: The popular systemic risk measure CoVaR (conditional Value-at-Risk) is widely used in economics and finance. Formally, it is defined as a large quantile of one variable (e.g., losses in the financial system) conditional on some other variable (e.g., losses in a bank's shares) being in distress. In this article, we propose joint dynamic forecasting models for the Value-at-Risk (VaR) and CoVaR. We also introduce a two-step M-estimator for the model parameters drawing on recently proposed bivariate scoring functions for the pair (VaR, CoVaR). We prove consistency and asymptotic normality of our parameter estimator and analyze its finite-sample properties in simulations. Finally, we apply a specific subclass of our dynamic forecasting models, which we call CoCAViaR models, to log-returns of large US banks. It is shown that our CoCAViaR models generate CoVaR predictions that are superior to forecasts issued from current benchmark models."}, "https://arxiv.org/abs/2301.10640": {"title": "Adaptive enrichment trial designs using joint modeling of longitudinal and time-to-event data", "link": "https://arxiv.org/abs/2301.10640", "description": "arXiv:2301.10640v2 Announce Type: replace \nAbstract: Adaptive enrichment allows for pre-defined patient subgroups of interest to be investigated throughout the course of a clinical trial. Many trials which measure a long-term time-to-event endpoint often also routinely collect repeated measures on biomarkers which may be predictive of the primary endpoint. Although these data may not be leveraged directly to support subgroup selection decisions and early stopping decisions, we aim to make greater use of these data to increase efficiency and improve interim decision making. In this work, we present a joint model for longitudinal and time-to-event data and two methods for creating standardised statistics based on this joint model. We can use the estimates to define enrichment rules and efficacy and futility early stopping rules for a flexible efficient clinical trial with possible enrichment. Under this framework, we show asymptotically that the familywise error rate is protected in the strong sense. To assess the results, we consider a trial for the treatment of metastatic breast cancer where repeated ctDNA measurements are available and the subgroup criteria is defined by patients' ER and HER2 status. Using simulation, we show that incorporating biomarker information leads to accurate subgroup identification and increases in power."}, "https://arxiv.org/abs/2303.17478": {"title": "A Bayesian Dirichlet Auto-Regressive Moving Average Model for Forecasting Lead Times", "link": "https://arxiv.org/abs/2303.17478", "description": "arXiv:2303.17478v3 Announce Type: replace \nAbstract: Lead time data is compositional data found frequently in the hospitality industry. Hospitality businesses earn fees each day, however these fees cannot be recognized until later. For business purposes, it is important to understand and forecast the distribution of future fees for the allocation of resources, for business planning, and for staffing. Motivated by 5 years of daily fees data, we propose a new class of Bayesian time series models, a Bayesian Dirichlet Auto-Regressive Moving Average (B-DARMA) model for compositional time series, modeling the proportion of future fees that will be recognized in 11 consecutive 30 day windows and 1 last consecutive 35 day window. Each day's compositional datum is modeled as Dirichlet distributed given the mean and a scale parameter. The mean is modeled with a Vector Autoregressive Moving Average process after transforming with an additive log ratio link function and depends on previous compositional data, previous compositional parameters and daily covariates. The B-DARMA model offers solutions to data analyses of large compositional vectors and short or long time series, offers efficiency gains through choice of priors, provides interpretable parameters for inference, and makes reasonable forecasts."}, "https://arxiv.org/abs/2304.01363": {"title": "Sufficient and Necessary Conditions for the Identifiability of DINA Models with Polytomous Responses", "link": "https://arxiv.org/abs/2304.01363", "description": "arXiv:2304.01363v3 Announce Type: replace \nAbstract: Cognitive Diagnosis Models (CDMs) provide a powerful statistical and psychometric tool for researchers and practitioners to learn fine-grained diagnostic information about respondents' latent attributes. There has been a growing interest in the use of CDMs for polytomous response data, as more and more items with multiple response options become widely used. Similar to many latent variable models, the identifiability of CDMs is critical for accurate parameter estimation and valid statistical inference. However, the existing identifiability results are primarily focused on binary response models and have not adequately addressed the identifiability of CDMs with polytomous responses. This paper addresses this gap by presenting sufficient and necessary conditions for the identifiability of the widely used DINA model with polytomous responses, with the aim to provide a comprehensive understanding of the identifiability of CDMs with polytomous responses and to inform future research in this field."}, "https://arxiv.org/abs/2312.01735": {"title": "Weighted Q-learning for optimal dynamic treatment regimes with MNAR covariates", "link": "https://arxiv.org/abs/2312.01735", "description": "arXiv:2312.01735v3 Announce Type: replace \nAbstract: Dynamic treatment regimes (DTRs) formalize medical decision-making as a sequence of rules for different stages, mapping patient-level information to recommended treatments. In practice, estimating an optimal DTR using observational data from electronic medical record (EMR) databases can be complicated by covariates that are missing not at random (MNAR) due to informative monitoring of patients. Since complete case analysis can result in consistent estimation of outcome model parameters under the assumption of outcome-independent missingness, Q-learning is a natural approach to accommodating MNAR covariates. However, the backward induction algorithm used in Q-learning can introduce challenges, as MNAR covariates at later stages can result in MNAR pseudo-outcomes at earlier stages, leading to suboptimal DTRs, even if the longitudinal outcome variables are fully observed. To address this unique missing data problem in DTR settings, we propose two weighted Q-learning approaches where inverse probability weights for missingness of the pseudo-outcomes are obtained through estimating equations with valid nonresponse instrumental variables or sensitivity analysis. Asymptotic properties of the weighted Q-learning estimators are derived and the finite-sample performance of the proposed methods is evaluated and compared with alternative methods through extensive simulation studies. Using EMR data from the Medical Information Mart for Intensive Care database, we apply the proposed methods to investigate the optimal fluid strategy for sepsis patients in intensive care units."}, "https://arxiv.org/abs/2401.06082": {"title": "Borrowing from historical control data in a Bayesian time-to-event model with flexible baseline hazard function", "link": "https://arxiv.org/abs/2401.06082", "description": "arXiv:2401.06082v2 Announce Type: replace \nAbstract: There is currently a focus on statistical methods which can use historical trial information to help accelerate the discovery, development and delivery of medicine. Bayesian methods can be constructed so that the borrowing is \"dynamic\" in the sense that the similarity of the data helps to determine how much information is used. In the time to event setting with one historical data set, a popular model for a range of baseline hazards is the piecewise exponential model where the time points are fixed and a borrowing structure is imposed on the model. Although convenient for implementation this approach effects the borrowing capability of the model. We propose a Bayesian model which allows the time points to vary and a dependency to be placed between the baseline hazards. This serves to smooth the posterior baseline hazard improving both model estimation and borrowing characteristics. We explore a variety of prior structures for the borrowing within our proposed model and assess their performance against established approaches. We demonstrate that this leads to improved type I error in the presence of prior data conflict and increased power. We have developed accompanying software which is freely available and enables easy implementation of the approach."}, "https://arxiv.org/abs/2307.11390": {"title": "Fast spatial simulation of extreme high-resolution radar precipitation data using INLA", "link": "https://arxiv.org/abs/2307.11390", "description": "arXiv:2307.11390v2 Announce Type: replace-cross \nAbstract: Aiming to deliver improved precipitation simulations for hydrological impact assessment studies, we develop a methodology for modelling and simulating high-dimensional spatial precipitation extremes, focusing on both their marginal distributions and tail dependence structures. Tail dependence is a crucial property for assessing the consequences of an extreme precipitation event, yet most stochastic weather generators do not attempt to capture this property. We model extreme precipitation using a latent Gaussian version of the spatial conditional extremes model. This requires data with Laplace marginal distributions, but precipitation distributions contain point masses at zero that complicate necessary standardisation procedures. We therefore employ two separate models, one for describing extremes of nonzero precipitation and one for describing the probability of precipitation occurrence. Extreme precipitation is simulated by combining simulations from the two models. Nonzero precipitation marginals are modelled using latent Gaussian models with gamma and generalised Pareto likelihoods, and four different precipitation occurrence models are investigated. Fast inference is achieved using integrated nested Laplace approximations (INLA). We model and simulate spatial precipitation extremes in Central Norway, using high-density radar data. Inference on a 6000-dimensional data set is achieved within hours, and the simulations capture the main trends of the observed precipitation well."}, "https://arxiv.org/abs/2308.01054": {"title": "Simulation-based inference using surjective sequential neural likelihood estimation", "link": "https://arxiv.org/abs/2308.01054", "description": "arXiv:2308.01054v2 Announce Type: replace-cross \nAbstract: We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model."}, "https://arxiv.org/abs/2402.15585": {"title": "Inference for Regression with Variables Generated from Unstructured Data", "link": "https://arxiv.org/abs/2402.15585", "description": "arXiv:2402.15585v1 Announce Type: new \nAbstract: The leading strategy for analyzing unstructured data uses two steps. First, latent variables of economic interest are estimated with an upstream information retrieval model. Second, the estimates are treated as \"data\" in a downstream econometric model. We establish theoretical arguments for why this two-step strategy leads to biased inference in empirically plausible settings. More constructively, we propose a one-step strategy for valid inference that uses the upstream and downstream models jointly. The one-step strategy (i) substantially reduces bias in simulations; (ii) has quantitatively important effects in a leading application using CEO time-use data; and (iii) can be readily adapted by applied researchers."}, "https://arxiv.org/abs/2402.15600": {"title": "A Graph-based Approach to Estimating the Number of Clusters", "link": "https://arxiv.org/abs/2402.15600", "description": "arXiv:2402.15600v1 Announce Type: new \nAbstract: We consider the problem of estimating the number of clusters ($k$) in a dataset. We propose a non-parametric approach to the problem that is based on maximizing a statistic constructed from similarity graphs. This graph-based statistic is a robust summary measure of the similarity information among observations and is applicable even if the number of dimensions or number of clusters is possibly large. The approach is straightforward to implement, computationally fast, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating $k$. We illustrate its utility on a high-dimensional image dataset and RNA-seq dataset."}, "https://arxiv.org/abs/2402.15705": {"title": "A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2402.15705", "description": "arXiv:2402.15705v1 Announce Type: new \nAbstract: Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe."}, "https://arxiv.org/abs/2402.15772": {"title": "Mean-preserving rounding integer-valued ARMA models", "link": "https://arxiv.org/abs/2402.15772", "description": "arXiv:2402.15772v1 Announce Type: new \nAbstract: In the past four decades, research on count time series has made significant progress, but research on $\\mathbb{Z}$-valued time series is relatively rare. Existing $\\mathbb{Z}$-valued models are mainly of autoregressive structure, where the use of the rounding operator is very natural. Because of the discontinuity of the rounding operator, the formulation of the corresponding model identifiability conditions and the computation of parameter estimators need special attention. It is also difficult to derive closed-form formulae for crucial stochastic properties. We rediscover a stochastic rounding operator, referred to as mean-preserving rounding, which overcomes the above drawbacks. Then, a novel class of $\\mathbb{Z}$-valued ARMA models based on the new operator is proposed, and the existence of stationary solutions of the models is established. Stochastic properties including closed-form formulae for (conditional) moments, autocorrelation function, and conditional distributions are obtained. The advantages of our novel model class compared to existing ones are demonstrated. In particular, our model construction avoids identifiability issues such that maximum likelihood estimation is possible. A simulation study is provided, and the appealing performance of the new models is shown by several real-world data sets."}, "https://arxiv.org/abs/2402.16053": {"title": "Reducing multivariate independence testing to two bivariate means comparisons", "link": "https://arxiv.org/abs/2402.16053", "description": "arXiv:2402.16053v1 Announce Type: new \nAbstract: Testing for independence between two random vectors is a fundamental problem in statistics. It is observed from empirical studies that many existing omnibus consistent tests may not work well for some strongly nonmonotonic and nonlinear relationships. To explore the reasons behind this issue, we novelly transform the multivariate independence testing problem equivalently into checking the equality of two bivariate means. An important observation we made is that the power loss is mainly due to cancellation of positive and negative terms in dependence metrics, making them very close to zero. Motivated by this observation, we propose a class of consistent metrics with a positive integer $\\gamma$ that exactly characterize independence. Theoretically, we show that the metrics with even and infinity $\\gamma$ can effectively avoid the cancellation, and have high powers under the alternatives that two mean differences offset each other. Since we target at a wide range of dependence scenarios in practice, we further suggest to combine the p-values of test statistics with different $\\gamma$'s through the Fisher's method. We illustrate the advantages of our proposed tests through extensive numerical studies."}, "https://arxiv.org/abs/2402.16322": {"title": "Estimating Stochastic Block Models in the Presence of Covariates", "link": "https://arxiv.org/abs/2402.16322", "description": "arXiv:2402.16322v1 Announce Type: new \nAbstract: In the standard stochastic block model for networks, the probability of a connection between two nodes, often referred to as the edge probability, depends on the unobserved communities each of these nodes belongs to. We consider a flexible framework in which each edge probability, together with the probability of community assignment, are also impacted by observed covariates. We propose a computationally tractable two-step procedure to estimate the conditional edge probabilities as well as the community assignment probabilities. The first step relies on a spectral clustering algorithm applied to a localized adjacency matrix of the network. In the second step, k-nearest neighbor regression estimates are computed on the extracted communities. We study the statistical properties of these estimators by providing non-asymptotic bounds."}, "https://arxiv.org/abs/2402.16362": {"title": "Estimation of complex carryover effects in crossover designs with repeated measures", "link": "https://arxiv.org/abs/2402.16362", "description": "arXiv:2402.16362v1 Announce Type: new \nAbstract: It has been argued that the models used to analyze data from crossover designs are not appropriate when simple carryover effects are assumed. In this paper, the estimability conditions of the carryover effects are found, and a theoretical result that supports them, additionally, two simulation examples are developed in a non-linear dose-response for a repeated measures crossover trial in two designs: the traditional AB/BA design and a Williams square. Both show that a semiparametric model can detect complex carryover effects and that this estimation improves the precision of treatment effect estimators. We concluded that when there are at least five replicates in each observation period per individual, semiparametric statistical models provide a good estimator of the treatment effect and reduce bias with respect to models that assume either, the absence of carryover or simplex carryover effects. In addition, an application of the methodology is shown and the richness in analysis that is gained by being able to estimate complex carryover effects is evident."}, "https://arxiv.org/abs/2402.16520": {"title": "Sequential design for surrogate modeling in Bayesian inverse problems", "link": "https://arxiv.org/abs/2402.16520", "description": "arXiv:2402.16520v1 Announce Type: new \nAbstract: Sequential design is a highly active field of research in active learning which provides a general framework for the design of computer experiments to make the most of a low computational budget. It has been widely used to generate efficient surrogate models able to replace complex computer codes, most notably for uncertainty quantification, Bayesian optimization, reliability analysis or model calibration tasks. In this work, a sequential design strategy is developed for Bayesian inverse problems, in which a Gaussian process surrogate model serves as an emulator for a costly computer code. The proposed strategy is based on a goal-oriented I-optimal criterion adapted to the Stepwise Uncertainty Reduction (SUR) paradigm. In SUR strategies, a new design point is chosen by minimizing the expectation of an uncertainty metric with respect to the yet unknown new data point. These methods have attracted increasing interest as they provide an accessible framework for the sequential design of experiments while including almost-sure convergence for the most-widely used metrics. In this paper, a weighted integrated mean square prediction error is introduced and serves as a metric of uncertainty for the newly proposed IP-SUR (Inverse Problem Stepwise Uncertainty Reduction) sequential design strategy derived from SUR methods. This strategy is shown to be tractable for both scalar and multi-output Gaussian process surrogate models with continuous sample paths, and comes with theoretical guarantee for the almost-sure convergence of the metric of uncertainty. The premises of this work are highlighted on various test cases in which the newly derived strategy is compared to other naive and sequential designs (D-optimal designs, Bayes risk minimization)."}, "https://arxiv.org/abs/2402.16580": {"title": "Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso", "link": "https://arxiv.org/abs/2402.16580", "description": "arXiv:2402.16580v1 Announce Type: new \nAbstract: We propose a novel approach to elicit the weight of a potentially non-stationary regressor in the consistent and oracle-efficient estimation of autoregressive models using the adaptive Lasso. The enhanced weight builds on a statistic that exploits distinct orders in probability of the OLS estimator in time series regressions when the degree of integration differs. We provide theoretical results on the benefit of our approach for detecting stationarity when a tuning criterion selects the $\\ell_1$ penalty parameter. Monte Carlo evidence shows that our proposal is superior to using OLS-based weights, as suggested by Kock [Econom. Theory, 32, 2016, 243-259]. We apply the modified estimator to model selection for German inflation rates after the introduction of the Euro. The results indicate that energy commodity price inflation and headline inflation are best described by stationary autoregressions."}, "https://arxiv.org/abs/2402.16693": {"title": "Fast Algorithms for Quantile Regression with Selection", "link": "https://arxiv.org/abs/2402.16693", "description": "arXiv:2402.16693v1 Announce Type: new \nAbstract: This paper addresses computational challenges in estimating Quantile Regression with Selection (QRS). The estimation of the parameters that model self-selection requires the estimation of the entire quantile process several times. Moreover, closed-form expressions of the asymptotic variance are too cumbersome, making the bootstrap more convenient to perform inference. Taking advantage of recent advancements in the estimation of quantile regression, along with some specific characteristics of the QRS estimation problem, I propose streamlined algorithms for the QRS estimator. These algorithms significantly reduce computation time through preprocessing techniques and quantile grid reduction for the estimation of the copula and slope parameters. I show the optimization enhancements with some simulations. Lastly, I show how preprocessing methods can improve the precision of the estimates without sacrificing computational efficiency. Hence, they constitute a practical solutions for estimators with non-differentiable and non-convex criterion functions such as those based on copulas."}, "https://arxiv.org/abs/2402.16725": {"title": "Inference on the proportion of variance explained in principal component analysis", "link": "https://arxiv.org/abs/2402.16725", "description": "arXiv:2402.16725v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored.\n In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset."}, "https://arxiv.org/abs/2402.15619": {"title": "Towards Improved Uncertainty Quantification of Stochastic Epidemic Models Using Sequential Monte Carlo", "link": "https://arxiv.org/abs/2402.15619", "description": "arXiv:2402.15619v1 Announce Type: cross \nAbstract: Sequential Monte Carlo (SMC) algorithms represent a suite of robust computational methodologies utilized for state estimation and parameter inference within dynamical systems, particularly in real-time or online environments where data arrives sequentially over time. In this research endeavor, we propose an integrated framework that combines a stochastic epidemic simulator with a sequential importance sampling (SIS) scheme to dynamically infer model parameters, which evolve due to social as well as biological processes throughout the progression of an epidemic outbreak and are also influenced by evolving data measurement bias. Through iterative updates of a set of weighted simulated trajectories based on observed data, this framework enables the estimation of posterior distributions for these parameters, thereby capturing their temporal variability and associated uncertainties. Through simulation studies, we showcase the efficacy of SMC in accurately tracking the evolving dynamics of epidemics while appropriately accounting for uncertainties. Moreover, we delve into practical considerations and challenges inherent in implementing SMC for parameter estimation within dynamic epidemiological settings, areas where the substantial computational capabilities of high-performance computing resources can be usefully brought to bear."}, "https://arxiv.org/abs/2402.16131": {"title": "A VAE-based Framework for Learning Multi-Level Neural Granger-Causal Connectivity", "link": "https://arxiv.org/abs/2402.16131", "description": "arXiv:2402.16131v1 Announce Type: cross \nAbstract: Granger causality has been widely used in various application domains to capture lead-lag relationships amongst the components of complex dynamical systems, and the focus in extant literature has been on a single dynamical system. In certain applications in macroeconomics and neuroscience, one has access to data from a collection of related such systems, wherein the modeling task of interest is to extract the shared common structure that is embedded across them, as well as to identify the idiosyncrasies within individual ones. This paper introduces a Variational Autoencoder (VAE) based framework that jointly learns Granger-causal relationships amongst components in a collection of related-yet-heterogeneous dynamical systems, and handles the aforementioned task in a principled way. The performance of the proposed framework is evaluated on several synthetic data settings and benchmarked against existing approaches designed for individual system learning. The method is further illustrated on a real dataset involving time series data from a neurophysiological experiment and produces interpretable results."}, "https://arxiv.org/abs/2402.16225": {"title": "Discrete Fourier Transform Approximations Based on the Cooley-Tukey Radix-2 Algorithm", "link": "https://arxiv.org/abs/2402.16225", "description": "arXiv:2402.16225v1 Announce Type: cross \nAbstract: This report elaborates on approximations for the discrete Fourier transform by means of replacing the exact Cooley-Tukey algorithm twiddle-factors by low-complexity integers, such as $0, \\pm \\frac{1}{2}, \\pm 1$."}, "https://arxiv.org/abs/2402.16375": {"title": "Valuing insurance against small probability risks: A meta-analysis", "link": "https://arxiv.org/abs/2402.16375", "description": "arXiv:2402.16375v1 Announce Type: cross \nAbstract: The demand for voluntary insurance against low-probability, high-impact risks is lower than expected. To assess the magnitude of the demand, we conduct a meta-analysis of contingent valuation studies using a dataset of experimentally elicited and survey-based estimates. We find that the average stated willingness to pay (WTP) for insurance is 87% of expected losses. We perform a meta-regression analysis to examine the heterogeneity in aggregate WTP across these studies. The meta-regression reveals that information about loss probability and probability levels positively influence relative willingness to pay, whereas respondents' average income and age have a negative effect. Moreover, we identify cultural sub-factors, such as power distance and uncertainty avoidance, that provided additional explanations for differences in WTP across international samples. Methodological factors related to the sampling and data collection process significantly influence the stated WTP. Our results, robust to model specification and publication bias, are relevant to current debates on stated preferences for low-probability risks management."}, "https://arxiv.org/abs/2402.16661": {"title": "Penalized Generative Variable Selection", "link": "https://arxiv.org/abs/2402.16661", "description": "arXiv:2402.16661v1 Announce Type: cross \nAbstract: Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using the conditional Wasserstein Generative Adversarial networks. Group Lasso penalization is applied for variable selection, which may improve model estimation/prediction, interpretability, stability, etc. Significantly advancing from the existing literature, the analysis of censored survival data is also considered. We establish the convergence rate for variable selection while considering the approximation error, and obtain a more efficient distribution estimation. Simulations and the analysis of real experimental data demonstrate satisfactory practical utility of the proposed analysis."}, "https://arxiv.org/abs/2106.10770": {"title": "A Neural Frequency-Severity Model and Its Application to Insurance Claims", "link": "https://arxiv.org/abs/2106.10770", "description": "arXiv:2106.10770v2 Announce Type: replace \nAbstract: This paper proposes a flexible and analytically tractable class of frequency and severity models for predicting insurance claims. The proposed model is able to capture nonlinear relationships in explanatory variables by characterizing the logarithmic mean functions of frequency and severity distributions as neural networks. Moreover, a potential dependence between the claim frequency and severity can be incorporated. In particular, the paper provides analytic formulas for mean and variance of the total claim cost, making our model ideal for many applications such as pricing insurance contracts and the pure premium. A simulation study demonstrates that our method successfully recovers nonlinear features of explanatory variables as well as the dependency between frequency and severity. Then, this paper uses a French auto insurance claim dataset to illustrate that the proposed model is superior to the existing methods in fitting and predicting the claim frequency, severity, and the total claim loss. Numerical results indicate that the proposed model helps in maintaining the competitiveness of an insurer by accurately predicting insurance claims and avoiding adverse selection."}, "https://arxiv.org/abs/2107.06238": {"title": "GENIUS-MAWII: For Robust Mendelian Randomization with Many Weak Invalid Instruments", "link": "https://arxiv.org/abs/2107.06238", "description": "arXiv:2107.06238v3 Announce Type: replace \nAbstract: Mendelian randomization (MR) has become a popular approach to study causal effects by using genetic variants as instrumental variables. We propose a new MR method, GENIUS-MAWII, which simultaneously addresses the two salient phenomena that adversely affect MR analyses: many weak instruments and widespread horizontal pleiotropy. Similar to MR GENIUS (Tchetgen Tchetgen et al., 2021), we achieve identification of the treatment effect by leveraging heteroscedasticity of the exposure. We then derive the class of influence functions of the treatment effect, based on which, we construct a continuous updating estimator and establish its consistency and asymptotic normality under a many weak invalid instruments asymptotic regime by developing novel semiparametric theory. We also provide a measure of weak identification, an overidentification test, and a graphical diagnostic tool. We demonstrate in simulations that GENIUS-MAWII has clear advantages in the presence of directional or correlated horizontal pleiotropy compared to other methods. We apply our method to study the effect of body mass index on systolic blood pressure using UK Biobank."}, "https://arxiv.org/abs/2205.06989": {"title": "lsirm12pl: An R package for latent space item response modeling", "link": "https://arxiv.org/abs/2205.06989", "description": "arXiv:2205.06989v2 Announce Type: replace \nAbstract: The latent space item response model (LSIRM; Jeon et al., 2021) allows us to show interactions between respondents and items in item response data by embedding both items and respondents in a shared and unobserved metric space. The R package lsirm12pl implements Bayesian estimation of the LSIRM and its extensions for different response types, base model specifications, and missing data. Further, the lsirm12pl offers methods to improve model utilization and interpretation, such as clustering of item positions in an estimated interaction map. lsirm12pl also provides convenient summary and plotting options to assess and process estimated results. In this paper, we give an overview of the methodological basis of LSIRM and describe the LSIRM extensions considered in the package. We then present the utilization of the package lsirm12pl with real data examples that are contained in the package."}, "https://arxiv.org/abs/2208.02925": {"title": "Factor Network Autoregressions", "link": "https://arxiv.org/abs/2208.02925", "description": "arXiv:2208.02925v4 Announce Type: replace \nAbstract: We propose a factor network autoregressive (FNAR) model for time series with complex network structures. The coefficients of the model reflect many different types of connections between economic agents (``multilayer network\"), which are summarized into a smaller number of network matrices (``network factors\") through a novel tensor-based principal component approach. We provide consistency and asymptotic normality results for the estimation of the factors and the coefficients of the FNAR. Our approach combines two different dimension-reduction techniques and can be applied to ultra-high-dimensional datasets. Simulation results show the goodness of our approach in finite samples. In an empirical application, we use the FNAR to investigate the cross-country interdependence of GDP growth rates based on a variety of international trade and financial linkages. The model provides a rich characterization of macroeconomic network effects."}, "https://arxiv.org/abs/2209.01172": {"title": "An Interpretable and Efficient Infinite-Order Vector Autoregressive Model for High-Dimensional Time Series", "link": "https://arxiv.org/abs/2209.01172", "description": "arXiv:2209.01172v4 Announce Type: replace \nAbstract: As a special infinite-order vector autoregressive (VAR) model, the vector autoregressive moving average (VARMA) model can capture much richer temporal patterns than the widely used finite-order VAR model. However, its practicality has long been hindered by its non-identifiability, computational intractability, and difficulty of interpretation, especially for high-dimensional time series. This paper proposes a novel sparse infinite-order VAR model for high-dimensional time series, which avoids all above drawbacks while inheriting essential temporal patterns of the VARMA model. As another attractive feature, the temporal and cross-sectional structures of the VARMA-type dynamics captured by this model can be interpreted separately, since they are characterized by different sets of parameters. This separation naturally motivates the sparsity assumption on the parameters determining the cross-sectional dependence. As a result, greater statistical efficiency and interpretability can be achieved with little loss of temporal information. We introduce two $\\ell_1$-regularized estimation methods for the proposed model, which can be efficiently implemented via block coordinate descent algorithms, and derive the corresponding nonasymptotic error bounds. A consistent model order selection method based on the Bayesian information criteria is also developed. The merit of the proposed approach is supported by simulation studies and a real-world macroeconomic data analysis."}, "https://arxiv.org/abs/2211.04697": {"title": "$L^{\\infty}$- and $L^2$-sensitivity analysis for causal inference with unmeasured confounding", "link": "https://arxiv.org/abs/2211.04697", "description": "arXiv:2211.04697v4 Announce Type: replace \nAbstract: Sensitivity analysis for the unconfoundedness assumption is crucial in observational studies. For this purpose, the marginal sensitivity model (MSM) gained popularity recently due to its good interpretability and mathematical properties. However, as a quantification of confounding strength, the $L^{\\infty}$-bound it puts on the logit difference between the observed and full data propensity scores may render the analysis conservative. In this article, we propose a new sensitivity model that restricts the $L^2$-norm of the propensity score ratio, requiring only the average strength of unmeasured confounding to be bounded. By characterizing sensitivity analysis as an optimization problem, we derive closed-form sharp bounds of the average potential outcomes under our model. We propose efficient one-step estimators for these bounds based on the corresponding efficient influence functions. Additionally, we apply multiplier bootstrap to construct simultaneous confidence bands to cover the sensitivity curve that consists of bounds at different sensitivity parameters. Through a real-data study, we illustrate how the new $L^2$-sensitivity analysis can improve calibration using observed confounders and provide tighter bounds when the unmeasured confounder is additionally assumed to be independent of the measured confounders and only have an additive effect on the potential outcomes."}, "https://arxiv.org/abs/2306.16591": {"title": "Nonparametric Causal Decomposition of Group Disparities", "link": "https://arxiv.org/abs/2306.16591", "description": "arXiv:2306.16591v2 Announce Type: replace \nAbstract: We propose a nonparametric framework that decomposes the causal contributions of a treatment variable to an outcome disparity between two groups. We decompose the causal contributions of treatment into group differences in 1) treatment prevalence, 2) average treatment effects, and 3) selection into treatment based on individual-level treatment effects. Our framework reformulates the classic Kitagawa-Blinder-Oaxaca decomposition nonparametrically in causal terms, complements causal mediation analysis by explaining group disparities instead of group effects, and distinguishes more mechanisms than recent random equalization decomposition. In contrast to all prior approaches, our framework isolates the causal contribution of differential selection into treatment as a novel mechanism for explaining and ameliorating group disparities. We develop nonparametric estimators based on efficient influence functions that are $\\sqrt{n}$-consistent, asymptotically normal, semiparametrically efficient, and multiply robust to misspecification. We apply our framework to decompose the causal contributions of education to the disparity in adult income between parental income groups (intergenerational income persistence). We find that both differential prevalence of, and differential selection into, college graduation significantly contribute to intergenerational income persistence."}, "https://arxiv.org/abs/2308.05484": {"title": "Filtering Dynamical Systems Using Observations of Statistics", "link": "https://arxiv.org/abs/2308.05484", "description": "arXiv:2308.05484v3 Announce Type: replace \nAbstract: We consider the problem of filtering dynamical systems, possibly stochastic, using observations of statistics. Thus, the computational task is to estimate a time-evolving density $\\rho(v, t)$ given noisy observations of the true density $\\rho^\\dagger$; this contrasts with the standard filtering problem based on observations of the state $v$. The task is naturally formulated as an infinite-dimensional filtering problem in the space of densities $\\rho$. However, for the purposes of tractability, we seek algorithms in state space; specifically, we introduce a mean-field state-space model, and using interacting particle system approximations to this model, we propose an ensemble method. We refer to the resulting methodology as the ensemble Fokker-Planck filter (EnFPF).\n Under certain restrictive assumptions, we show that the EnFPF approximates the Kalman-Bucy filter for the Fokker-Planck equation, which is the exact solution to the infinite-dimensional filtering problem. Furthermore, our numerical experiments show that the methodology is useful beyond this restrictive setting. Specifically, the experiments show that the EnFPF is able to correct ensemble statistics, to accelerate convergence to the invariant density for autonomous systems, and to accelerate convergence to time-dependent invariant densities for non-autonomous systems. We discuss possible applications of the EnFPF to climate ensembles and to turbulence modeling."}, "https://arxiv.org/abs/2309.09115": {"title": "Fully Synthetic Data for Complex Surveys", "link": "https://arxiv.org/abs/2309.09115", "description": "arXiv:2309.09115v3 Announce Type: replace \nAbstract: When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey."}, "https://arxiv.org/abs/2310.11724": {"title": "Simultaneous Nonparametric Inference of M-regression under Complex Temporal Dynamics", "link": "https://arxiv.org/abs/2310.11724", "description": "arXiv:2310.11724v2 Announce Type: replace \nAbstract: The paper considers simultaneous nonparametric inference for a wide class of M-regression models with time-varying coefficients. The covariates and errors of the regression model are tackled as a general class of nonstationary time series and are allowed to be cross-dependent. We construct $\\sqrt{n}$-consistent inference for the cumulative regression function, whose limiting properties are disclosed using Bahadur representation and Gaussian approximation theory. A simple and unified self-convolved bootstrap procedure is proposed. With only one tuning parameter, the bootstrap consistently simulates the desired limiting behavior of the M-estimators under complex temporal dynamics, even under the possible presence of breakpoints in time series. Our methodology leads to a unified framework to conduct general classes of Exact Function Tests, Lack-of-fit Tests, and Qualitative Tests for the time-varying coefficients under complex temporal dynamics. These tests enable one to, among many others, conduct variable selection procedures, check for constancy and linearity, as well as verify shape assumptions, including monotonicity and convexity. As applications, our method is utilized to study the time-varying properties of global climate data and Microsoft stock return, respectively."}, "https://arxiv.org/abs/2310.14399": {"title": "The role of randomization inference in unraveling individual treatment effects in early phase vaccine trials", "link": "https://arxiv.org/abs/2310.14399", "description": "arXiv:2310.14399v2 Announce Type: replace \nAbstract: Randomization inference is a powerful tool in early phase vaccine trials when estimating the causal effect of a regimen against a placebo or another regimen. Randomization-based inference often focuses on testing either Fisher's sharp null hypothesis of no treatment effect for any participant or Neyman's weak null hypothesis of no sample average treatment effect. Many recent efforts have explored conducting exact randomization-based inference for other summaries of the treatment effect profile, for instance, quantiles of the treatment effect distribution function. In this article, we systematically review methods that conduct exact, randomization-based inference for quantiles of individual treatment effects (ITEs) and extend some results to a special case where na\\\"ive participants are expected not to exhibit responses to highly specific endpoints. These methods are suitable for completely randomized trials, stratified completely randomized trials, and a matched study comparing two non-randomized arms from possibly different trials. We evaluate the usefulness of these methods using synthetic data in simulation studies. Finally, we apply these methods to HIV Vaccine Trials Network Study 086 (HVTN 086) and HVTN 205 and showcase a wide range of application scenarios of the methods. R code that replicates all analyses in this article can be found in first author's GitHub page at https://github.com/Zhe-Chen-1999/ITE-Inference."}, "https://arxiv.org/abs/2401.02529": {"title": "Simulation-based transition density approximation for the inference of SDE models", "link": "https://arxiv.org/abs/2401.02529", "description": "arXiv:2401.02529v2 Announce Type: replace \nAbstract: Stochastic Differential Equations (SDEs) serve as a powerful modeling tool in various scientific domains, including systems science, engineering, and ecological science. While the specific form of SDEs is typically known for a given problem, certain model parameters remain unknown. Efficiently inferring these unknown parameters based on observations of the state in discrete time series represents a vital practical subject. The challenge arises in nonlinear SDEs, where maximum likelihood estimation of parameters is generally unfeasible due to the absence of closed-form expressions for transition and stationary probability density functions of the states. In response to this limitation, we propose a novel two-step parameter inference mechanism. This approach involves a global-search phase followed by a local-refining procedure. The global-search phase is dedicated to identifying the domain of high-value likelihood functions, while the local-refining procedure is specifically designed to enhance the surrogate likelihood within this localized domain. Additionally, we present two simulation-based approximations for the transition density, aiming to efficiently or accurately approximate the likelihood function. Numerical examples illustrate the efficacy of our proposed methodology in achieving posterior parameter estimation."}, "https://arxiv.org/abs/2401.11229": {"title": "Estimation with Pairwise Observations", "link": "https://arxiv.org/abs/2401.11229", "description": "arXiv:2401.11229v2 Announce Type: replace \nAbstract: The paper introduces a new estimation method for the standard linear regression model. The procedure is not driven by the optimisation of any objective function rather, it is a simple weighted average of slopes from observation pairs. The paper shows that such estimator is consistent for carefully selected weights. Other properties, such as asymptotic distributions, have also been derived to facilitate valid statistical inference. Unlike traditional methods, such as Least Squares and Maximum Likelihood, among others, the estimated residual of this estimator is not by construction orthogonal to the explanatory variables of the model. This property allows a wide range of practical applications, such as the testing of endogeneity, i.e., the correlation between the explanatory variables and the disturbance terms."}, "https://arxiv.org/abs/2401.15076": {"title": "Comparative Analysis of Practical Identifiability Methods for an SEIR Model", "link": "https://arxiv.org/abs/2401.15076", "description": "arXiv:2401.15076v2 Announce Type: replace \nAbstract: Identifiability of a mathematical model plays a crucial role in parameterization of the model. In this study, we establish the structural identifiability of a Susceptible-Exposed-Infected-Recovered (SEIR) model given different combinations of input data and investigate practical identifiability with respect to different observable data, data frequency, and noise distributions. The practical identifiability is explored by both Monte Carlo simulations and a Correlation Matrix approach. Our results show that practical identifiability benefits from higher data frequency and data from the peak of an outbreak. The incidence data gives the best practical identifiability results compared to prevalence and cumulative data. In addition, we compare and distinguish the practical identifiability by Monte Carlo simulations and a Correlation Matrix approach, providing insights for when to use which method for other applications."}, "https://arxiv.org/abs/1908.08600": {"title": "Online Causal Inference for Advertising in Real-Time Bidding Auctions", "link": "https://arxiv.org/abs/1908.08600", "description": "arXiv:1908.08600v4 Announce Type: replace-cross \nAbstract: Real-time bidding (RTB) systems, which utilize auctions to allocate user impressions to competing advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of such advertising remains a challenge in research and practice. This paper proposes a new approach to perform causal inference on advertising bought through such mechanisms. Leveraging the economic structure of first- and second-price auctions, we first show that the effects of advertising are identified by the optimal bids. Hence, since these optimal bids are the only objects that need to be recovered, we introduce an adapted Thompson sampling (TS) algorithm to solve a multi-armed bandit problem that succeeds in recovering such bids and, consequently, the effects of advertising while minimizing the costs of experimentation. We derive a regret bound for our algorithm which is order optimal and use data from RTB auctions to show that it outperforms commonly used methods that estimate the effects of advertising."}, "https://arxiv.org/abs/2009.10303": {"title": "On the representation and learning of monotone triangular transport maps", "link": "https://arxiv.org/abs/2009.10303", "description": "arXiv:2009.10303v3 Announce Type: replace-cross \nAbstract: Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\\unicode{x2014}$approximations of the Knothe$\\unicode{x2013}$Rosenblatt (KR) rearrangement$\\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes."}, "https://arxiv.org/abs/2103.14726": {"title": "Random line graphs and edge-attributed network inference", "link": "https://arxiv.org/abs/2103.14726", "description": "arXiv:2103.14726v2 Announce Type: replace-cross \nAbstract: We extend the latent position random graph model to the line graph of a random graph, which is formed by creating a vertex for each edge in the original random graph, and connecting each pair of edges incident to a common vertex in the original graph. We prove concentration inequalities for the spectrum of a line graph, as well as limiting distribution results for the largest eigenvalue and the empirical spectral distribution in certain settings. For the stochastic blockmodel, we establish that although naive spectral decompositions can fail to extract necessary signal for edge clustering, there exist signal-preserving singular subspaces of the line graph that can be recovered through a carefully-chosen projection. Moreover, we can consistently estimate edge latent positions in a random line graph, even though such graphs are of a random size, typically have high rank, and possess no spectral gap. Our results demonstrate that the line graph of a stochastic block model exhibits underlying block structure, and in simulations, we synthesize and test our methods against several commonly-used techniques, including tensor decompositions, for cluster recovery and edge covariate inference. By naturally incorporating information encoded in both vertices and edges, the random line graph improves network inference."}, "https://arxiv.org/abs/2106.14077": {"title": "The Role of Contextual Information in Best Arm Identification", "link": "https://arxiv.org/abs/2106.14077", "description": "arXiv:2106.14077v3 Announce Type: replace-cross \nAbstract: We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the \"Track-and-Stop\" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification."}, "https://arxiv.org/abs/2306.14142": {"title": "Estimating Policy Effects in a Social Network with Independent Set Sampling", "link": "https://arxiv.org/abs/2306.14142", "description": "arXiv:2306.14142v3 Announce Type: replace-cross \nAbstract: Evaluating the impact of policy interventions on respondents who are embedded in a social network is often challenging due to the presence of network interference within the treatment groups, as well as between treatment and non-treatment groups throughout the network. In this paper, we propose a modeling strategy that combines existing work on stochastic actor-oriented models (SAOM) with a novel network sampling method based on the identification of independent sets. By assigning respondents from an independent set to the treatment, we are able to block any spillover of the treatment and network influence, thereby allowing us to isolate the direct effect of the treatment from the indirect network-induced effects, in the immediate term. As a result, our method allows for the estimation of both the \\textit{direct} as well as the \\textit{net effect} of a chosen policy intervention, in the presence of network effects in the population. We perform a comparative simulation analysis to show that our proposed sampling technique leads to distinct direct and net effects of the policy, as well as significant network effects driven by policy-linked homophily. This study highlights the importance of network sampling techniques in improving policy evaluation studies and has the potential to help researchers and policymakers with better planning, designing, and anticipating policy responses in a networked society."}, "https://arxiv.org/abs/2306.16838": {"title": "Solving Kernel Ridge Regression with Gradient-Based Optimization Methods", "link": "https://arxiv.org/abs/2306.16838", "description": "arXiv:2306.16838v5 Announce Type: replace-cross \nAbstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\\ell_1$ and $\\ell_\\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\\ell_\\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\\ell_1$ and $\\ell_\\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively."}, "https://arxiv.org/abs/2310.02461": {"title": "Optimization-based frequentist confidence intervals for functionals in constrained inverse problems: Resolving the Burrus conjecture", "link": "https://arxiv.org/abs/2310.02461", "description": "arXiv:2310.02461v2 Announce Type: replace-cross \nAbstract: We present an optimization-based framework to construct confidence intervals for functionals in constrained inverse problems, ensuring valid one-at-a-time frequentist coverage guarantees. Our approach builds upon the now-called strict bounds intervals, originally pioneered by Burrus (1965) and Rust and Burrus (1972), which offer ways to directly incorporate any side information about the parameters during inference without introducing external biases. This family of methods allows for uncertainty quantification in ill-posed inverse problems without needing to select a regularizing prior. By tying optimization-based intervals to an inversion of a constrained likelihood ratio test, we translate interval coverage guarantees into type I error control and characterize the resulting interval via solutions to optimization problems. Along the way, we refute the Burrus conjecture, which posited that, for possibly rank-deficient linear Gaussian models with positivity constraints, a correction based on the quantile of the chi-squared distribution with one degree of freedom suffices to shorten intervals while maintaining frequentist coverage guarantees. Our framework provides a novel approach to analyzing the conjecture, and we construct a counterexample employing a stochastic dominance argument, which we also use to disprove a general form of the conjecture. We illustrate our framework with several numerical examples and provide directions for extensions beyond the Rust-Burrus method for nonlinear, non-Gaussian settings with general constraints."}, "https://arxiv.org/abs/2310.16281": {"title": "Improving Robust Decisions with Data", "link": "https://arxiv.org/abs/2310.16281", "description": "arXiv:2310.16281v3 Announce Type: replace-cross \nAbstract: A decision-maker faces uncertainty governed by a data-generating process (DGP), which is only known to belong to a set of sequences of independent but possibly non-identical distributions. A robust decision maximizes the expected payoff against the worst possible DGP in this set. This paper characterizes when and how such robust decisions can be improved with data, measured by the expected payoff under the true DGP, no matter which possible DGP is the truth. It further develops novel and simple inference methods to achieve it, as common methods (e.g., maximum likelihood or Bayesian) often fail to deliver such an improvement."}, "https://arxiv.org/abs/2312.02482": {"title": "Treatment heterogeneity with right-censored outcomes using grf", "link": "https://arxiv.org/abs/2312.02482", "description": "arXiv:2312.02482v3 Announce Type: replace-cross \nAbstract: This article walks through how to estimate conditional average treatment effects (CATEs) with right-censored time-to-event outcomes using the function causal_survival_forest (Cui et al., 2023) in the R package grf (Athey et al., 2019, Tibshirani et al., 2024) using data from the National Job Training Partnership Act."}, "https://arxiv.org/abs/2401.06091": {"title": "A Closer Look at AUROC and AUPRC under Class Imbalance", "link": "https://arxiv.org/abs/2401.06091", "description": "arXiv:2401.06091v2 Announce Type: replace-cross \nAbstract: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need."}, "https://arxiv.org/abs/2402.16917": {"title": "Bayesian Chain Ladder For Cumulative Run-Off Triangle Under Half-Normal Distribution Assumption", "link": "https://arxiv.org/abs/2402.16917", "description": "arXiv:2402.16917v1 Announce Type: new \nAbstract: An insurance company is required to prepare a certain amount of money, called reserve, as a mean to pay its policy holders claims in the future. There are several types of reserve, one of them is IBNR reserve, for which the payments are made not in the same calendar year as the events that trigger the claims. Wuthrich and Merz (2015) developed a Bayesian Chain Ladder method for gamma distribution that applies the Bayesian Theory into the usual Chain Ladder Method . In this article, we modify the previous Bayesian Chain Ladder for half-normal distribution, as such distribution is more well suited for lighter tail claims."}, "https://arxiv.org/abs/2402.16969": {"title": "Robust Evaluation of Longitudinal Surrogate Markers with Censored Data", "link": "https://arxiv.org/abs/2402.16969", "description": "arXiv:2402.16969v1 Announce Type: new \nAbstract: The development of statistical methods to evaluate surrogate markers is an active area of research. In many clinical settings, the surrogate marker is not simply a single measurement but is instead a longitudinal trajectory of measurements over time, e.g., fasting plasma glucose measured every 6 months for 3 years. In general, available methods developed for the single-surrogate setting cannot accommodate a longitudinal surrogate marker. Furthermore, many of the methods have not been developed for use with primary outcomes that are time-to-event outcomes and/or subject to censoring. In this paper, we propose robust methods to evaluate a longitudinal surrogate marker in a censored time-to-event outcome setting. Specifically, we propose a method to define and estimate the proportion of the treatment effect on a censored primary outcome that is explained by the treatment effect on a longitudinal surrogate marker measured up to time $t_0$. We accommodate both potential censoring of the primary outcome and of the surrogate marker. A simulation study demonstrates good finite-sample performance of our proposed methods. We illustrate our procedures by examining repeated measures of fasting plasma glucose, a surrogate marker for diabetes diagnosis, using data from the Diabetes Prevention Program (DPP)."}, "https://arxiv.org/abs/2402.17042": {"title": "Towards Generalizing Inferences from Trials to Target Populations", "link": "https://arxiv.org/abs/2402.17042", "description": "arXiv:2402.17042v1 Announce Type: new \nAbstract: Randomized Controlled Trials (RCTs) are pivotal in generating internally valid estimates with minimal assumptions, serving as a cornerstone for researchers dedicated to advancing causal inference methods. However, extending these findings beyond the experimental cohort to achieve externally valid estimates is crucial for broader scientific inquiry. This paper delves into the forefront of addressing these external validity challenges, encapsulating the essence of a multidisciplinary workshop held at the Institute for Computational and Experimental Research in Mathematics (ICERM), Brown University, in Fall 2023. The workshop congregated experts from diverse fields including social science, medicine, public health, statistics, computer science, and education, to tackle the unique obstacles each discipline faces in extrapolating experimental findings. Our study presents three key contributions: we integrate ongoing efforts, highlighting methodological synergies across fields; provide an exhaustive review of generalizability and transportability based on the workshop's discourse; and identify persistent hurdles while suggesting avenues for future research. By doing so, this paper aims to enhance the collective understanding of the generalizability and transportability of causal effects, fostering cross-disciplinary collaboration and offering valuable insights for researchers working on refining and applying causal inference methods."}, "https://arxiv.org/abs/2402.17070": {"title": "Dempster-Shafer P-values: Thoughts on an Alternative Approach for Multinomial Inference", "link": "https://arxiv.org/abs/2402.17070", "description": "arXiv:2402.17070v1 Announce Type: new \nAbstract: In this paper, we demonstrate that a new measure of evidence we developed called the Dempster-Shafer p-value which allow for insights and interpretations which retain most of the structure of the p-value while covering for some of the disadvantages that traditional p- values face. Moreover, we show through classical large-sample bounds and simulations that there exists a close connection between our form of DS hypothesis testing and the classical frequentist testing paradigm. We also demonstrate how our approach gives unique insights into the dimensionality of a hypothesis test, as well as models the effects of adversarial attacks on multinomial data. Finally, we demonstrate how these insights can be used to analyze text data for public health through an analysis of the Population Health Metrics Research Consortium dataset for verbal autopsies."}, "https://arxiv.org/abs/2402.17103": {"title": "Causal Orthogonalization: Multicollinearity, Economic Interpretability, and the Gram-Schmidt Process", "link": "https://arxiv.org/abs/2402.17103", "description": "arXiv:2402.17103v1 Announce Type: new \nAbstract: This paper considers the problem of interpreting orthogonalization model coefficients. We derive a causal economic interpretation of the Gram-Schmidt orthogonalization process and provide the conditions for its equivalence to total effects from a recursive Directed Acyclic Graph. We extend the Gram-Schmidt process to groups of simultaneous regressors common in economic data sets and derive its finite sample properties, finding its coefficients to be unbiased, stable, and more efficient than those from Ordinary Least Squares. Finally, we apply the estimator to childhood reading comprehension scores, controlling for such highly collinear characteristics as race, education, and income. The model expands Bohren et al.'s decomposition of systemic discrimination into channel-specific effects and improves its coefficient significance levels."}, "https://arxiv.org/abs/2402.17274": {"title": "Sequential Change-point Detection for Binomial Time Series with Exogenous Variables", "link": "https://arxiv.org/abs/2402.17274", "description": "arXiv:2402.17274v1 Announce Type: new \nAbstract: Sequential change-point detection for time series enables us to sequentially check the hypothesis that the model still holds as more and more data are observed. It's widely used in data monitoring in practice. Meanwhile, binomial time series, which depicts independent binary individual behaviors within a group when the individual behaviors are dependent on past observations of the whole group, is an important type of model in practice but hasn't been developed well. We first propose a Binomial AR($1$) model, and then consider a method for sequential change-point detection for the Binomial AR(1)."}, "https://arxiv.org/abs/2402.17297": {"title": "Combined Quantile Forecasting for High-Dimensional Non-Gaussian Data", "link": "https://arxiv.org/abs/2402.17297", "description": "arXiv:2402.17297v1 Announce Type: new \nAbstract: This study proposes a novel method for forecasting a scalar variable based on high-dimensional predictors that is applicable to various data distributions. In the literature, one of the popular approaches for forecasting with many predictors is to use factor models. However, these traditional methods are ineffective when the data exhibit non-Gaussian characteristics such as skewness or heavy tails. In this study, we newly utilize a quantile factor model to extract quantile factors that describe specific quantiles of the data beyond the mean factor. We then build a quantile-based forecast model using the estimated quantile factors at different quantile levels as predictors. Finally, the predicted values at the various quantile levels are combined into a single forecast as a weighted average with weights determined by a Markov chain based on past trends of the target variable. The main idea of the proposed method is to incorporate a quantile approach to a forecasting method to handle non-Gaussian characteristics effectively. The performance of the proposed method is evaluated through a simulation study and real data analysis of PM2.5 data in South Korea, where the proposed method outperforms other existing methods in most cases."}, "https://arxiv.org/abs/2402.17366": {"title": "Causal blind spots when using prediction models for treatment decisions", "link": "https://arxiv.org/abs/2402.17366", "description": "arXiv:2402.17366v1 Announce Type: new \nAbstract: Prediction models are increasingly proposed for guiding treatment decisions, but most fail to address the special role of treatments, leading to inappropriate use. This paper highlights the limitations of using standard prediction models for treatment decision support. We identify 'causal blind spots' in three common approaches to handling treatments in prediction modelling and illustrate potential harmful consequences in several medical applications. We advocate for an extension of guidelines for development, reporting, clinical evaluation and monitoring of prediction models to ensure that the intended use of the model is matched to an appropriate risk estimand. For decision support this requires a shift towards developing predictions under the specific treatment options under consideration ('predictions under interventions'). We argue that this will improve the efficacy of prediction models in guiding treatment decisions and prevent potential negative effects on patient outcomes."}, "https://arxiv.org/abs/2402.17374": {"title": "Quasi-Bayesian Estimation and Inference with Control Functions", "link": "https://arxiv.org/abs/2402.17374", "description": "arXiv:2402.17374v1 Announce Type: new \nAbstract: We consider a quasi-Bayesian method that combines a frequentist estimation in the first stage and a Bayesian estimation/inference approach in the second stage. The study is motivated by structural discrete choice models that use the control function methodology to correct for endogeneity bias. In this scenario, the first stage estimates the control function using some frequentist parametric or nonparametric approach. The structural equation in the second stage, associated with certain complicated likelihood functions, can be more conveniently dealt with using a Bayesian approach. This paper studies the asymptotic properties of the quasi-posterior distributions obtained from the second stage. We prove that the corresponding quasi-Bayesian credible set does not have the desired coverage in large samples. Nonetheless, the quasi-Bayesian point estimator remains consistent and is asymptotically equivalent to a frequentist two-stage estimator. We show that one can obtain valid inference by bootstrapping the quasi-posterior that takes into account the first-stage estimation uncertainty."}, "https://arxiv.org/abs/2402.17425": {"title": "Semi-parametric goodness-of-fit testing for INAR models", "link": "https://arxiv.org/abs/2402.17425", "description": "arXiv:2402.17425v1 Announce Type: new \nAbstract: Among the various models designed for dependent count data, integer-valued autoregressive (INAR) processes enjoy great popularity. Typically, statistical inference for INAR models uses asymptotic theory that relies on rather stringent (parametric) assumptions on the innovations such as Poisson or negative binomial distributions. In this paper, we present a novel semi-parametric goodness-of-fit test tailored for the INAR model class. Relying on the INAR-specific shape of the joint probability generating function, our approach allows for model validation of INAR models without specifying the (family of the) innovation distribution. We derive the limiting null distribution of our proposed test statistic, prove consistency under fixed alternatives and discuss its asymptotic behavior under local alternatives. By manifold Monte Carlo simulations, we illustrate the overall good performance of our testing procedure in terms of power and size properties. In particular, it turns out that the power can be considerably improved by using higher-order test statistics. We conclude the article with the application on three real-world economic data sets."}, "https://arxiv.org/abs/2402.17637": {"title": "Learning the Covariance of Treatment Effects Across Many Weak Experiments", "link": "https://arxiv.org/abs/2402.17637", "description": "arXiv:2402.17637v1 Announce Type: new \nAbstract: When primary objectives are insensitive or delayed, experimenters may instead focus on proxy metrics derived from secondary outcomes. For example, technology companies often infer long-term impacts of product interventions from their effects on weighted indices of short-term user engagement signals. We consider meta-analysis of many historical experiments to learn the covariance of treatment effects on different outcomes, which can support the construction of such proxies. Even when experiments are plentiful and large, if treatment effects are weak, the sample covariance of estimated treatment effects across experiments can be highly biased and remains inconsistent even as more experiments are considered. We overcome this by using techniques inspired by weak instrumental variable analysis, which we show can reliably estimate parameters of interest, even without a structural model. We show the Limited Information Maximum Likelihood (LIML) estimator learns a parameter that is equivalent to fitting total least squares to a transformation of the scatterplot of estimated treatment effects, and that Jackknife Instrumental Variables Estimation (JIVE) learns another parameter that can be computed from the average of Jackknifed covariance matrices across experiments. We also present a total-covariance-based estimator for the latter estimand under homoskedasticity, which we show is equivalent to a $k$-class estimator. We show how these parameters relate to causal quantities and can be used to construct unbiased proxy metrics under a structural model with both direct and indirect effects subject to the INstrument Strength Independent of Direct Effect (INSIDE) assumption of Mendelian randomization. Lastly, we discuss the application of our methods at Netflix."}, "https://arxiv.org/abs/2402.16909": {"title": "Impact of Physical Activity on Quality of Life During Pregnancy: A Causal ML Approach", "link": "https://arxiv.org/abs/2402.16909", "description": "arXiv:2402.16909v1 Announce Type: cross \nAbstract: The concept of Quality of Life (QoL) refers to a holistic measurement of an individual's well-being, incorporating psychological and social aspects. Pregnant women, especially those with obesity and stress, often experience lower QoL. Physical activity (PA) has shown the potential to enhance the QoL. However, pregnant women who are overweight and obese rarely meet the recommended level of PA. Studies have investigated the relationship between PA and QoL during pregnancy using correlation-based approaches. These methods aim to discover spurious correlations between variables rather than causal relationships. Besides, the existing methods mainly rely on physical activity parameters and neglect the use of different factors such as maternal (medical) history and context data, leading to biased estimates. Furthermore, the estimations lack an understanding of mediators and counterfactual scenarios that might affect them. In this paper, we investigate the causal relationship between being physically active (treatment variable) and the QoL (outcome) during pregnancy and postpartum. To estimate the causal effect, we develop a Causal Machine Learning method, integrating causal discovery and causal inference components. The data for our investigation is derived from a long-term wearable-based health monitoring study focusing on overweight and obese pregnant women. The machine learning (meta-learner) estimation technique is used to estimate the causal effect. Our result shows that performing adequate physical activity during pregnancy and postpartum improves the QoL by units of 7.3 and 3.4 on average in physical health and psychological domains, respectively. In the final step, four refutation analysis techniques are employed to validate our estimation."}, "https://arxiv.org/abs/2402.17233": {"title": "Hybrid Square Neural ODE Causal Modeling", "link": "https://arxiv.org/abs/2402.17233", "description": "arXiv:2402.17233v1 Announce Type: cross \nAbstract: Hybrid models combine mechanistic ODE-based dynamics with flexible and expressive neural network components. Such models have grown rapidly in popularity, especially in scientific domains where such ODE-based modeling offers important interpretability and validated causal grounding (e.g., for counterfactual reasoning). The incorporation of mechanistic models also provides inductive bias in standard blackbox modeling approaches, critical when learning from small datasets or partially observed, complex systems. Unfortunately, as hybrid models become more flexible, the causal grounding provided by the mechanistic model can quickly be lost. We address this problem by leveraging another common source of domain knowledge: ranking of treatment effects for a set of interventions, even if the precise treatment effect is unknown. We encode this information in a causal loss that we combine with the standard predictive loss to arrive at a hybrid loss that biases our learning towards causally valid hybrid models. We demonstrate our ability to achieve a win-win -- state-of-the-art predictive performance and causal validity -- in the challenging task of modeling glucose dynamics during exercise."}, "https://arxiv.org/abs/2402.17570": {"title": "Sparse Variational Contaminated Noise Gaussian Process Regression for Forecasting Geomagnetic Perturbations", "link": "https://arxiv.org/abs/2402.17570", "description": "arXiv:2402.17570v1 Announce Type: cross \nAbstract: Gaussian Processes (GP) have become popular machine learning methods for kernel based learning on datasets with complicated covariance structures. In this paper, we present a novel extension to the GP framework using a contaminated normal likelihood function to better account for heteroscedastic variance and outlier noise. We propose a scalable inference algorithm based on the Sparse Variational Gaussian Process (SVGP) method for fitting sparse Gaussian process regression models with contaminated normal noise on large datasets. We examine an application to geomagnetic ground perturbations, where the state-of-art prediction model is based on neural networks. We show that our approach yields shorter predictions intervals for similar coverage and accuracy when compared to an artificial dense neural network baseline."}, "https://arxiv.org/abs/1809.04347": {"title": "High-dimensional Bayesian Fourier Analysis For Detecting Circadian Gene Expressions", "link": "https://arxiv.org/abs/1809.04347", "description": "arXiv:1809.04347v2 Announce Type: replace \nAbstract: In genomic applications, there is often interest in identifying genes whose time-course expression trajectories exhibit periodic oscillations with a period of approximately 24 hours. Such genes are usually referred to as circadian, and their identification is a crucial step toward discovering physiological processes that are clock-controlled. It is natural to expect that the expression of gene i at time j might depend to some degree on the expression of the other genes measured at the same time. However, widely-used rhythmicity detection techniques do not accommodate for the potential dependence across genes. We develop a Bayesian approach for periodicity identification that explicitly takes into account the complex dependence structure across time-course trajectories in gene expressions. We employ a latent factor representation to accommodate dependence, while representing the true trajectories in the Fourier domain allows for inference on period, phase, and amplitude of the signal. Identification of circadian genes is allowed through a carefully chosen variable selection prior on the Fourier basis coefficients. The methodology is applied to a novel mouse liver circadian dataset. Although motivated by time-course gene expression array data, the proposed approach is applicable to the analysis of dependent functional data at broad."}, "https://arxiv.org/abs/2202.06117": {"title": "Metric Statistics: Exploration and Inference for Random Objects With Distance Profiles", "link": "https://arxiv.org/abs/2202.06117", "description": "arXiv:2202.06117v4 Announce Type: replace \nAbstract: This article provides an overview on the statistical modeling of complex data as increasingly encountered in modern data analysis. It is argued that such data can often be described as elements of a metric space that satisfies certain structural conditions and features a probability measure. We refer to the random elements of such spaces as random objects and to the emerging field that deals with their statistical analysis as metric statistics. Metric statistics provides methodology, theory and visualization tools for the statistical description, quantification of variation, centrality and quantiles, regression and inference for populations of random objects, inferring these quantities from available data and samples. In addition to a brief review of current concepts, we focus on distance profiles as a major tool for object data in conjunction with the pairwise Wasserstein transports of the underlying one-dimensional distance distributions. These pairwise transports lead to the definition of intuitive and interpretable notions of transport ranks and transport quantiles as well as two-sample inference. An associated profile metric complements the original metric of the object space and may reveal important features of the object data in data analysis. We demonstrate these tools for the analysis of complex data through various examples and visualizations."}, "https://arxiv.org/abs/2206.11117": {"title": "Causal inference in multi-cohort studies using the target trial framework to identify and minimize sources of bias", "link": "https://arxiv.org/abs/2206.11117", "description": "arXiv:2206.11117v4 Announce Type: replace \nAbstract: Longitudinal cohort studies, which follow a group of individuals over time, provide the opportunity to examine causal effects of complex exposures on long-term health outcomes. Utilizing data from multiple cohorts has the potential to add further benefit by improving precision of estimates through data pooling and by allowing examination of effect heterogeneity through replication of analyses across cohorts. However, the interpretation of findings can be complicated by biases that may be compounded when pooling data, or, contribute to discrepant findings when analyses are replicated. The 'target trial' is a powerful tool for guiding causal inference in single-cohort studies. Here we extend this conceptual framework to address the specific challenges that can arise in the multi-cohort setting. By representing a clear definition of the target estimand, the target trial provides a central point of reference against which biases arising in each cohort and from data pooling can be systematically assessed. Consequently, analyses can be designed to reduce these biases and the resulting findings appropriately interpreted in light of potential remaining biases. We use a case study to demonstrate the framework and its potential to strengthen causal inference in multi-cohort studies through improved analysis design and clarity in the interpretation of findings."}, "https://arxiv.org/abs/2301.03805": {"title": "Asymptotic Theory for Two-Way Clustering", "link": "https://arxiv.org/abs/2301.03805", "description": "arXiv:2301.03805v2 Announce Type: replace \nAbstract: This paper proves a new central limit theorem for a sample that exhibits two-way dependence and heterogeneity across clusters. Statistical inference for situations with both two-way dependence and cluster heterogeneity has thus far been an open issue. The existing theory for two-way clustering inference requires identical distributions across clusters (implied by the so-called separate exchangeability assumption). Yet no such homogeneity requirement is needed in the existing theory for one-way clustering. The new result therefore theoretically justifies the view that two-way clustering is a more robust version of one-way clustering, consistent with applied practice. The result is applied to linear regression, where it is shown that a standard plug-in variance estimator is valid for inference."}, "https://arxiv.org/abs/2304.00074": {"title": "Bayesian Clustering via Fusing of Localized Densities", "link": "https://arxiv.org/abs/2304.00074", "description": "arXiv:2304.00074v2 Announce Type: replace \nAbstract: Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favours a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure."}, "https://arxiv.org/abs/2305.19675": {"title": "Testing Truncation Dependence: The Gumbel-Barnett Copula", "link": "https://arxiv.org/abs/2305.19675", "description": "arXiv:2305.19675v2 Announce Type: replace \nAbstract: In studies on lifetimes, occasionally, the population contains statistical units that are born before the data collection has started. Left-truncated are units that deceased before this start. For all other units, the age at the study start often is recorded and we aim at testing whether this second measurement is independent of the genuine measure of interest, the lifetime. Our basic model of dependence is the one-parameter Gumbel-Barnett copula. For simplicity, the marginal distribution of the lifetime is assumed to be Exponential and for the age-at-study-start, namely the distribution of birth dates, we assume a Uniform. Also for simplicity, and to fit our application, we assume that units that die later than our study period, are also truncated. As a result from point process theory, we can approximate the truncated sample by a Poisson process and thereby derive its likelihood. Identification, consistency and asymptotic distribution of the maximum-likelihood estimator are derived. Testing for positive truncation dependence must include the hypothetical independence which coincides with the boundary of the copula's parameter space. By non-standard theory, the maximum likelihood estimator of the exponential and the copula parameter is distributed as a mixture of a two- and a one-dimensional normal distribution. The application are 55 thousand double-truncated lifetimes of German businesses that closed down over the period 2014 to 2016. The likelihood has its maximum for the copula parameter at the parameter space boundary so that the $p$-value of test is $0.5$. The life expectancy does not increase relative to the year of foundation. Using a Farlie-Gumbel-Morgenstern copula, which models positive and negative dependence, finds that life expectancy of German enterprises even decreases significantly over time."}, "https://arxiv.org/abs/2310.18504": {"title": "Doubly Robust Identification of Causal Effects of a Continuous Treatment using Discrete Instruments", "link": "https://arxiv.org/abs/2310.18504", "description": "arXiv:2310.18504v3 Announce Type: replace \nAbstract: Many empirical applications estimate causal effects of a continuous endogenous variable (treatment) using a binary instrument. Estimation is typically done through linear 2SLS. This approach requires a mean treatment change and causal interpretation requires the LATE-type monotonicity in the first stage. An alternative approach is to explore distributional changes in the treatment, where the first-stage restriction is treatment rank similarity. We propose causal estimands that are doubly robust in that they are valid under either of these two restrictions. We apply the doubly robust estimation to estimate the impacts of sleep on well-being. Our new estimates corroborate the usual 2SLS estimates."}, "https://arxiv.org/abs/2311.00439": {"title": "Robustify and Tighten the Lee Bounds: A Sample Selection Model under Stochastic Monotonicity and Symmetry Assumptions", "link": "https://arxiv.org/abs/2311.00439", "description": "arXiv:2311.00439v2 Announce Type: replace \nAbstract: In the presence of sample selection, Lee's (2009) nonparametric bounds are a popular tool for estimating a treatment effect. However, the Lee bounds rely on the monotonicity assumption, whose empirical validity is sometimes unclear, and the bounds are often regarded to be too wide and less informative even under the monotonicity. To address these practical issues, this paper introduces a stochastic version of the monotonicity assumption alongside a nonparametric distributional shape constraint. The former enhances the robustness of the Lee bounds with respect to monotonicity, while the latter serves to tighten these bounds. The obtained bounds do not rely on the exclusion restriction and can be root-$n$ consistently estimable, making them practically viable. The potential usefulness of the proposed methods is illustrated by their application to experimental data from the after-school instruction program studied by Muralidharan, Singh, and Ganimian (2019)."}, "https://arxiv.org/abs/2312.07177": {"title": "Fast Meta-Analytic Approximations for Relational Event Models: Applications to Data Streams and Multilevel Data", "link": "https://arxiv.org/abs/2312.07177", "description": "arXiv:2312.07177v2 Announce Type: replace \nAbstract: Large relational-event history data stemming from large networks are becoming increasingly available due to recent technological developments (e.g. digital communication, online databases, etc). This opens many new doors to learning about complex interaction behavior between actors in temporal social networks. The relational event model has become the gold standard for relational event history analysis. Currently, however, the main bottleneck to fit relational events models is of computational nature in the form of memory storage limitations and computational complexity. Relational event models are therefore mainly used for relatively small data sets while larger, more interesting datasets, including multilevel data structures and relational event data streams, cannot be analyzed on standard desktop computers. This paper addresses this problem by developing approximation algorithms based on meta-analysis methods that can fit relational event models significantly faster while avoiding the computational issues. In particular, meta-analytic approximations are proposed for analyzing streams of relational event data and multilevel relational event data and potentially of combinations thereof. The accuracy and the statistical properties of the methods are assessed using numerical simulations. Furthermore, real-world data are used to illustrate the potential of the methodology to study social interaction behavior in an organizational network and interaction behavior among political actors. The algorithms are implemented in a publicly available R package 'remx'."}, "https://arxiv.org/abs/2103.09603": {"title": "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R", "link": "https://arxiv.org/abs/2103.09603", "description": "arXiv:2103.09603v5 Announce Type: replace-cross \nAbstract: The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods."}, "https://arxiv.org/abs/2111.10275": {"title": "Composite Goodness-of-fit Tests with Kernels", "link": "https://arxiv.org/abs/2111.10275", "description": "arXiv:2111.10275v4 Announce Type: replace-cross \nAbstract: Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this paper, we propose one such method. More precisely, we propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on the maximum mean discrepancy and the kernel Stein discrepancy. They are widely applicable, including whenever the density of the parametric model is known up to normalisation constant, or if the model takes the form of a simulator. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of an unnormalised non-parametric density model, and an intractable generative model of a biological cellular network."}, "https://arxiv.org/abs/2203.12572": {"title": "Post-selection inference for e-value based confidence intervals", "link": "https://arxiv.org/abs/2203.12572", "description": "arXiv:2203.12572v4 Announce Type: replace-cross \nAbstract: Suppose that one can construct a valid $(1-\\delta)$-confidence interval (CI) for each of $K$ parameters of potential interest. If a data analyst uses an arbitrary data-dependent criterion to select some subset $S$ of parameters, then the aforementioned CIs for the selected parameters are no longer valid due to selection bias. We design a new method to adjust the intervals in order to control the false coverage rate (FCR). The main established method is the \"BY procedure\" by Benjamini and Yekutieli (JASA, 2005). The BY guarantees require certain restrictions on the selection criterion and on the dependence between the CIs. We propose a new simple method which, in contrast, is valid under any dependence structure between the original CIs, and any (unknown) selection criterion, but which only applies to a special, yet broad, class of CIs that we call e-CIs. To elaborate, our procedure simply reports $(1-\\delta|S|/K)$-CIs for the selected parameters, and we prove that it controls the FCR at $\\delta$ for confidence intervals that implicitly invert e-values; examples include those constructed via supermartingale methods, via universal inference, or via Chernoff-style bounds, among others. The e-BY procedure is admissible, and recovers the BY procedure as a special case via a particular calibrator. Our work also has implications for post-selection inference in sequential settings, since it applies at stopping times, to continuously-monitored confidence sequences, and under bandit sampling. We demonstrate the efficacy of our procedure using numerical simulations and real A/B testing data from Twitter."}, "https://arxiv.org/abs/2206.09107": {"title": "Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data", "link": "https://arxiv.org/abs/2206.09107", "description": "arXiv:2206.09107v2 Announce Type: replace-cross \nAbstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk."}, "https://arxiv.org/abs/2301.08994": {"title": "How to Measure Evidence and Its Strength: Bayes Factors or Relative Belief Ratios?", "link": "https://arxiv.org/abs/2301.08994", "description": "arXiv:2301.08994v2 Announce Type: replace-cross \nAbstract: Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. Certainly Bayes factors are regularly employed. The question then is: which of these measures of evidence is more appropriate? It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor based on a mixture prior and, when all is considered, the relative belief ratio has better properties as a measure of evidence. It is further shown that, when a natural restriction on the mixture prior is imposed, the Bayes factor equals the relative belief ratio obtained without using the mixture prior. Even with this restriction, this still leaves open the question of how the strength of evidence is to be measured. It is argued here that the current practice of using the size of the Bayes factor to measure strength is not correct and a solution to this issue is presented. Several general criticisms of these measures of evidence are also discussed and addressed."}, "https://arxiv.org/abs/2306.03816": {"title": "Parametrization, Prior Independence, and the Semiparametric Bernstein-von Mises Theorem for the Partially Linear Model", "link": "https://arxiv.org/abs/2306.03816", "description": "arXiv:2306.03816v4 Announce Type: replace-cross \nAbstract: I prove a semiparametric Bernstein-von Mises theorem for a partially linear regression model with independent priors for the low-dimensional parameter of interest and the infinite-dimensional nuisance parameters. My result avoids a prior invariance condition that arises from a loss of information in not knowing the nuisance parameter. The key idea is a feasible reparametrization of the regression function that mimics the Gaussian profile likelihood. This allows a researcher to assume independent priors for the model parameters while automatically accounting for the loss of information associated with not knowing the nuisance parameter. As these prior stability conditions often impose strong restrictions on the underlying data-generating process, my results provide a more robust asymptotic normality theorem than the original parametrization of the partially linear model."}, "https://arxiv.org/abs/2308.11050": {"title": "Optimal Dorfman Group Testing for Symmetric Distributions", "link": "https://arxiv.org/abs/2308.11050", "description": "arXiv:2308.11050v2 Announce Type: replace-cross \nAbstract: We study Dorfman's classical group testing protocol in a novel setting where individual specimen statuses are modeled as exchangeable random variables. We are motivated by infectious disease screening. In that case, specimens which arrive together for testing often originate from the same community and so their statuses may exhibit positive correlation. Dorfman's protocol screens a population of n specimens for a binary trait by partitioning it into non-overlapping groups, testing these, and only individually retesting the specimens of each positive group. The partition is chosen to minimize the expected number of tests under a probabilistic model of specimen statuses. We relax the typical assumption that these are independent and identically distributed and instead model them as exchangeable random variables. In this case, their joint distribution is symmetric in the sense that it is invariant under permutations. We give a characterization of such distributions in terms of a function q where q(h) is the marginal probability that any group of size h tests negative. We use this interpretable representation to show that the set partitioning problem arising in Dorfman's protocol can be reduced to an integer partitioning problem and efficiently solved. We apply these tools to an empirical dataset from the COVID-19 pandemic. The methodology helps explain the unexpectedly high empirical efficiency reported by the original investigators."}, "https://arxiv.org/abs/2309.15326": {"title": "Structural identifiability analysis of linear reaction-advection-diffusion processes in mathematical biology", "link": "https://arxiv.org/abs/2309.15326", "description": "arXiv:2309.15326v3 Announce Type: replace-cross \nAbstract: Effective application of mathematical models to interpret biological data and make accurate predictions often requires that model parameters are identifiable. Approaches to assess the so-called structural identifiability of models are well-established for ordinary differential equation models, yet there are no commonly adopted approaches that can be applied to assess the structural identifiability of the partial differential equation (PDE) models that are requisite to capture spatial features inherent to many phenomena. The differential algebra approach to structural identifiability has recently been demonstrated to be applicable to several specific PDE models. In this brief article, we present general methodology for performing structural identifiability analysis on partially observed reaction-advection-diffusion (RAD) PDE models that are linear in the unobserved quantities. We show that the differential algebra approach can always, in theory, be applied to such models. Moreover, despite the perceived complexity introduced by the addition of advection and diffusion terms, identifiability of spatial analogues of non-spatial models cannot decrease in structural identifiability. We conclude by discussing future possibilities and the computational cost of performing structural identifiability analysis on more general PDE models."}, "https://arxiv.org/abs/2310.19683": {"title": "An Online Bootstrap for Time Series", "link": "https://arxiv.org/abs/2310.19683", "description": "arXiv:2310.19683v2 Announce Type: replace-cross \nAbstract: Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments."}, "https://arxiv.org/abs/2311.07511": {"title": "Uncertainty estimation in satellite precipitation interpolation with machine learning", "link": "https://arxiv.org/abs/2311.07511", "description": "arXiv:2311.07511v2 Announce Type: replace-cross \nAbstract: Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We address this gap by benchmarking six algorithms, mostly novel for this task, for quantifying predictive uncertainty in spatial interpolation. On 15 years of monthly data over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Feature importance analysis revealed satellite precipitation (PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals) datasets) as the most informative predictor, followed by gauge elevation and distance to satellite grid points. Compared to QR, LightGBM showed improved performance with respect to the quantile scoring rule by 11.10%, followed by QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial interpolation with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in interpolating spatial data, supported with a formal evaluation framework based on scoring functions and scoring rules."}, "https://arxiv.org/abs/2311.18460": {"title": "Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework", "link": "https://arxiv.org/abs/2311.18460", "description": "arXiv:2311.18460v2 Announce Type: replace-cross \nAbstract: Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications."}, "https://arxiv.org/abs/2402.17915": {"title": "Generation and analysis of synthetic data via Bayesian networks: a robust approach for uncertainty quantification via Bayesian paradigm", "link": "https://arxiv.org/abs/2402.17915", "description": "arXiv:2402.17915v1 Announce Type: new \nAbstract: Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples."}, "https://arxiv.org/abs/2402.17920": {"title": "Generalized Bayesian Additive Regression Trees for Restricted Mean Survival Time Inference", "link": "https://arxiv.org/abs/2402.17920", "description": "arXiv:2402.17920v1 Announce Type: new \nAbstract: Prediction methods for time-to-event outcomes often utilize survival models that rely on strong assumptions about noninformative censoring or on how individual-level covariates and survival functions are related. When the main interest is in predicting individual-level restricted mean survival times (RMST), reliance on such assumptions can lead to poor predictive performance if these assumptions are not satisfied. We propose a generalized Bayes framework that avoids full probability modeling of all survival outcomes by using an RMST-targeted loss function that depends on a collection of inverse probability of censoring weights (IPCW). In our generalized Bayes formulation, we utilize a flexible additive tree regression model for the RMST function, and the posterior distribution of interest is obtained through model-averaging IPCW-conditional loss function-based pseudo-Bayesian posteriors. Because informative censoring can be captured by the IPCW-dependent loss function, our approach only requires one to specify a model for the censoring distribution, thereby obviating the need for complex joint modeling to handle informative censoring. We evaluate the performance of our method through a series of simulations that compare it with several well-known survival machine learning methods, and we illustrate the application of our method using a multi-site cohort of breast cancer patients with clinical and genomic covariates."}, "https://arxiv.org/abs/2402.17984": {"title": "Sampling low-fidelity outputs for estimation of high-fidelity density and its tails", "link": "https://arxiv.org/abs/2402.17984", "description": "arXiv:2402.17984v1 Announce Type: new \nAbstract: In a multifidelity setting, data are available under the same conditions from two (or more) sources, e.g. computer codes, one being lower-fidelity but computationally cheaper, and the other higher-fidelity and more expensive. This work studies for which low-fidelity outputs, one should obtain high-fidelity outputs, if the goal is to estimate the probability density function of the latter, especially when it comes to the distribution tails and extremes. It is suggested to approach this problem from the perspective of the importance sampling of low-fidelity outputs according to some proposal distribution, combined with special considerations for the distribution tails based on extreme value theory. The notion of an optimal proposal distribution is introduced and investigated, in both theory and simulations. The approach is motivated and illustrated with an application to estimate the probability density function of record extremes of ship motions, obtained through two computer codes of different fidelities."}, "https://arxiv.org/abs/2402.18105": {"title": "JEL ratio test for independence between a continuous and a categorical random variable", "link": "https://arxiv.org/abs/2402.18105", "description": "arXiv:2402.18105v1 Announce Type: new \nAbstract: The categorical Gini covariance is a dependence measure between a numerical variable and a categorical variable. The Gini covariance measures dependence by quantifying the difference between the conditional and unconditional distributional functions. A value of zero for the categorical Gini covariance implies independence of the numerical variable and the categorical variable. We propose a non-parametric test for testing the independence between a numerical and categorical variable using the categorical Gini covariance. We used the theory of U-statistics to find the test statistics and study the properties. The test has an asymptotic normal distribution. As the implementation of a normal-based test is difficult, we develop a jackknife empirical likelihood (JEL) ratio test for testing independence. Extensive Monte Carlo simulation studies are carried out to validate the performance of the proposed JEL-based test. We illustrate the test procedure using real a data set."}, "https://arxiv.org/abs/2402.18130": {"title": "Sequential Change-point Detection for Compositional Time Series with Exogenous Variables", "link": "https://arxiv.org/abs/2402.18130", "description": "arXiv:2402.18130v1 Announce Type: new \nAbstract: Sequential change-point detection for time series enables us to sequentially check the hypothesis that the model still holds as more and more data are observed. It is widely used in data monitoring in practice. In this work, we consider sequential change-point detection for compositional time series, time series in which the observations are proportions. For fitting compositional time series, we propose a generalized Beta AR(1) model, which can incorporate exogenous variables upon which the time series observations are dependent. We show the compositional time series are strictly stationary and geometrically ergodic and consider maximum likelihood estimation for model parameters. We show the partial MLEs are consistent and asymptotically normal and propose a parametric sequential change-point detection method for the compositional time series model. The change-point detection method is illustrated using a time series of Covid-19 positivity rates."}, "https://arxiv.org/abs/2402.18185": {"title": "Quantile Outcome Adaptive Lasso: Covariate Selection for Inverse Probability Weighting Estimator of Quantile Treatment Effects", "link": "https://arxiv.org/abs/2402.18185", "description": "arXiv:2402.18185v1 Announce Type: new \nAbstract: When using the propensity score method to estimate the treatment effects, it is important to select the covariates to be included in the propensity score model. The inclusion of covariates unrelated to the outcome in the propensity score model led to bias and large variance in the estimator of treatment effects. Many data-driven covariate selection methods have been proposed for selecting covariates related to outcomes. However, most of them assume an average treatment effect estimation and may not be designed to estimate quantile treatment effects (QTE), which is the effect of treatment on the quantiles of outcome distribution. In QTE estimation, we consider two relation types with the outcome as the expected value and quantile point. To achieve this, we propose a data-driven covariate selection method for propensity score models that allows for the selection of covariates related to the expected value and quantile of the outcome for QTE estimation. Assuming the quantile regression model as an outcome regression model, covariate selection was performed using a regularization method with the partial regression coefficients of the quantile regression model as weights. The proposed method was applied to artificial data and a dataset of mothers and children born in King County, Washington, to compare the performance of existing methods and QTE estimators. As a result, the proposed method performs well in the presence of covariates related to both the expected value and quantile of the outcome."}, "https://arxiv.org/abs/2402.18186": {"title": "Bayesian Geographically Weighted Regression using Fused Lasso Prior", "link": "https://arxiv.org/abs/2402.18186", "description": "arXiv:2402.18186v1 Announce Type: new \nAbstract: A main purpose of spatial data analysis is to predict the objective variable for the unobserved locations. Although Geographically Weighted Regression (GWR) is often used for this purpose, estimation instability proves to be an issue. To address this issue, Bayesian Geographically Weighted Regression (BGWR) has been proposed. In BGWR, by setting the same prior distribution for all locations, the coefficients' estimation stability is improved. However, when observation locations' density is spatially different, these methods do not sufficiently consider the similarity of coefficients among locations. Moreover, the prediction accuracy of these methods becomes worse. To solve these issues, we propose Bayesian Geographically Weighted Sparse Regression (BGWSR) that uses Bayesian Fused Lasso for the prior distribution of the BGWR coefficients. Constraining the parameters to have the same values at adjacent locations is expected to improve the prediction accuracy at locations with a low number of adjacent locations. Furthermore, from the predictive distribution, it is also possible to evaluate the uncertainty of the predicted value of the objective variable. By examining numerical studies, we confirmed that BGWSR has better prediction performance than the existing methods (GWR and BGWR) when the density of observation locations is spatial difference. Finally, the BGWSR is applied to land price data in Tokyo. Thus, the results suggest that BGWSR has better prediction performance and smaller uncertainty than existing methods."}, "https://arxiv.org/abs/2402.18298": {"title": "Mapping between measurement scales in meta-analysis, with application to measures of body mass index in children", "link": "https://arxiv.org/abs/2402.18298", "description": "arXiv:2402.18298v1 Announce Type: new \nAbstract: Quantitative evidence synthesis methods aim to combine data from multiple medical trials to infer relative effects of different interventions. A challenge arises when trials report continuous outcomes on different measurement scales. To include all evidence in one coherent analysis, we require methods to `map' the outcomes onto a single scale. This is particularly challenging when trials report aggregate rather than individual data. We are motivated by a meta-analysis of interventions to prevent obesity in children. Trials report aggregate measurements of body mass index (BMI) either expressed as raw values or standardised for age and sex. We develop three methods for mapping between aggregate BMI data using known relationships between individual measurements on different scales. The first is an analytical method based on the mathematical definitions of z-scores and percentiles. The other two approaches involve sampling individual participant data on which to perform the conversions. One method is a straightforward sampling routine, while the other involves optimization with respect to the reported outcomes. In contrast to the analytical approach, these methods also have wider applicability for mapping between any pair of measurement scales with known or estimable individual-level relationships. We verify and contrast our methods using trials from our data set which report outcomes on multiple scales. We find that all methods recreate mean values with reasonable accuracy, but for standard deviations, optimization outperforms the other methods. However, the optimization method is more likely to underestimate standard deviations and is vulnerable to non-convergence."}, "https://arxiv.org/abs/2402.18450": {"title": "Copula Approximate Bayesian Computation Using Distribution Random Forests", "link": "https://arxiv.org/abs/2402.18450", "description": "arXiv:2402.18450v1 Announce Type: new \nAbstract: This paper introduces a novel Approximate Bayesian Computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihood functions. This framework can describe the possibly skewed and high dimensional posterior distribution by a novel multivariate copula-based distribution, based on univariate marginal posterior distributions which can account for skewness and be accurately estimated by Distribution Random Forests (DRF) while performing automatic summary statistics (covariates) selection, and on robustly estimated copula dependence parameters. The framework employs a novel multivariate mode estimator to perform for MLE and posterior mode estimation, and provides an optional step to perform model selection from a given set of models with posterior probabilities estimated by DRF. The posterior distribution estimation accuracy of the ABC framework is illustrated through simulation studies involving models with analytically computable posterior distributions, and involving exponential random graph and mechanistic network models which are each defined by an intractable likelihood from which it is costly to simulate large network datasets. Also, the framework is illustrated through analyses of large real-life networks of sizes ranging between 28,000 to 65.6 million nodes (between 3 million to 1.8 billion edges), including a large multilayer network with weighted directed edges."}, "https://arxiv.org/abs/2402.18533": {"title": "Constructing Bayesian Optimal Designs for Discrete Choice Experiments by Simulated Annealing", "link": "https://arxiv.org/abs/2402.18533", "description": "arXiv:2402.18533v1 Announce Type: new \nAbstract: Discrete Choice Experiments (DCEs) investigate the attributes that influence individuals' choices when selecting among various options. To enhance the quality of the estimated choice models, researchers opt for Bayesian optimal designs that utilize existing information about the attributes' preferences. Given the nonlinear nature of choice models, the construction of an appropriate design requires efficient algorithms. Among these, the Coordinate-Exchange (CE) algorithm is most commonly employed for constructing designs based on the multinomial logit model. Since this is a hill-climbing algorithm, obtaining better designs necessitates multiple random starting designs. This approach increases the algorithm's run-time, but may not lead to a significant improvement in results. We propose the use of a Simulated Annealing (SA) algorithm to construct Bayesian D-optimal designs. This algorithm accepts both superior and inferior solutions, avoiding premature convergence and allowing a more thorough exploration of potential designs. Consequently, it ultimately obtains higher-quality choice designs within the same time-frame. Our work represents the first application of an SA algorithm in constructing Bayesian optimal designs for DCEs. Through computational experiments and a real-life case study, we demonstrate that the SA designs consistently outperform the CE designs in terms of Bayesian D-efficiency, especially when the prior preference information is highly uncertain."}, "https://arxiv.org/abs/2402.17886": {"title": "Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion", "link": "https://arxiv.org/abs/2402.17886", "description": "arXiv:2402.17886v1 Announce Type: cross \nAbstract: This paper considers the problem of sampling from non-logconcave distribution, based on queries of its unnormalized density. It first describes a framework, Diffusion Monte Carlo (DMC), based on the simulation of a denoising diffusion process with its score function approximated by a generic Monte Carlo estimator. DMC is an oracle-based meta-algorithm, where its oracle is the assumed access to samples that generate a Monte Carlo score estimator. Then we provide an implementation of this oracle, based on rejection sampling, and this turns DMC into a true algorithm, termed Zeroth-Order Diffusion Monte Carlo (ZOD-MC). We provide convergence analyses by first constructing a general framework, i.e. a performance guarantee for DMC, without assuming the target distribution to be log-concave or satisfying any isoperimetric inequality. Then we prove that ZOD-MC admits an inverse polynomial dependence on the desired sampling accuracy, albeit still suffering from the curse of dimensionality. Consequently, for low dimensional distributions, ZOD-MC is a very efficient sampler, with performance exceeding latest samplers, including also-denoising-diffusion-based RDMC and RS-DMC. Last, we experimentally demonstrate the insensitivity of ZOD-MC to increasingly higher barriers between modes or discontinuity in non-convex potential."}, "https://arxiv.org/abs/2402.18242": {"title": "A network-constrain Weibull AFT model for biomarkers discovery", "link": "https://arxiv.org/abs/2402.18242", "description": "arXiv:2402.18242v1 Announce Type: cross \nAbstract: We propose AFTNet, a novel network-constraint survival analysis method based on the Weibull accelerated failure time (AFT) model solved by a penalized likelihood approach for variable selection and estimation. When using the log-linear representation, the inference problem becomes a structured sparse regression problem for which we explicitly incorporate the correlation patterns among predictors using a double penalty that promotes both sparsity and grouping effect. Moreover, we establish the theoretical consistency for the AFTNet estimator and present an efficient iterative computational algorithm based on the proximal gradient descent method. Finally, we evaluate AFTNet performance both on synthetic and real data examples."}, "https://arxiv.org/abs/2402.18392": {"title": "Unveiling the Potential of Robustness in Evaluating Causal Inference Models", "link": "https://arxiv.org/abs/2402.18392", "description": "arXiv:2402.18392v1 Announce Type: cross \nAbstract: The growing demand for personalized decision-making has led to a surge of interest in estimating the Conditional Average Treatment Effect (CATE). The intersection of machine learning and causal inference has yielded various effective CATE estimators. However, deploying these estimators in practice is often hindered by the absence of counterfactual labels, making it challenging to select the desirable CATE estimator using conventional model selection procedures like cross-validation. Existing approaches for CATE estimator selection, such as plug-in and pseudo-outcome metrics, face two inherent challenges. Firstly, they are required to determine the metric form and the underlying machine learning models for fitting nuisance parameters or plug-in learners. Secondly, they lack a specific focus on selecting a robust estimator. To address these challenges, this paper introduces a novel approach, the Distributionally Robust Metric (DRM), for CATE estimator selection. The proposed DRM not only eliminates the need to fit additional models but also excels at selecting a robust CATE estimator. Experimental studies demonstrate the efficacy of the DRM method, showcasing its consistent effectiveness in identifying superior estimators while mitigating the risk of selecting inferior ones."}, "https://arxiv.org/abs/2108.00306": {"title": "A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions", "link": "https://arxiv.org/abs/2108.00306", "description": "arXiv:2108.00306v5 Announce Type: replace \nAbstract: With advances in scientific computing and mathematical modeling, complex scientific phenomena such as galaxy formations and rocket propulsion can now be reliably simulated. Such simulations can however be very time-intensive, requiring millions of CPU hours to perform. One solution is multi-fidelity emulation, which uses data of different fidelities to train an efficient predictive model which emulates the expensive simulator. For complex scientific problems and with careful elicitation from scientists, such multi-fidelity data may often be linked by a directed acyclic graph (DAG) representing its scientific model dependencies. We thus propose a new Graphical Multi-fidelity Gaussian Process (GMGP) model, which embeds this DAG structure (capturing scientific dependencies) within a Gaussian process framework. We show that the GMGP has desirable modeling traits via two Markov properties, and admits a scalable algorithm for recursive computation of the posterior mean and variance along at each depth level of the DAG. We also present a novel experimental design methodology over the DAG given an experimental budget, and propose a nonlinear extension of the GMGP via deep Gaussian processes. The advantages of the GMGP are then demonstrated via a suite of numerical experiments and an application to emulation of heavy-ion collisions, which can be used to study the conditions of matter in the Universe shortly after the Big Bang. The proposed model has broader uses in data fusion applications with graphical structure, which we further discuss."}, "https://arxiv.org/abs/2112.02712": {"title": "Interpretable discriminant analysis for functional data supported on random nonlinear domains with an application to Alzheimer's disease", "link": "https://arxiv.org/abs/2112.02712", "description": "arXiv:2112.02712v2 Announce Type: replace \nAbstract: We introduce a novel framework for the classification of functional data supported on nonlinear, and possibly random, manifold domains. The motivating application is the identification of subjects with Alzheimer's disease from their cortical surface geometry and associated cortical thickness map. The proposed model is based upon a reformulation of the classification problem as a regularized multivariate functional linear regression model. This allows us to adopt a direct approach to the estimation of the most discriminant direction while controlling for its complexity with appropriate differential regularization. Our approach does not require prior estimation of the covariance structure of the functional predictors, which is computationally prohibitive in our application setting. We provide a theoretical analysis of the out-of-sample prediction error of the proposed model and explore the finite sample performance in a simulation setting. We apply the proposed method to a pooled dataset from the Alzheimer's Disease Neuroimaging Initiative and the Parkinson's Progression Markers Initiative. Through this application, we identify discriminant directions that capture both cortical geometric and thickness predictive features of Alzheimer's disease that are consistent with the existing neuroscience literature."}, "https://arxiv.org/abs/2202.03960": {"title": "Continuous permanent unobserved heterogeneity in dynamic discrete choice models", "link": "https://arxiv.org/abs/2202.03960", "description": "arXiv:2202.03960v3 Announce Type: replace \nAbstract: In dynamic discrete choice (DDC) analysis, it is common to use mixture models to control for unobserved heterogeneity. However, consistent estimation typically requires both restrictions on the support of unobserved heterogeneity and a high-level injectivity condition that is difficult to verify. This paper provides primitive conditions for point identification of a broad class of DDC models with multivariate continuous permanent unobserved heterogeneity. The results apply to both finite- and infinite-horizon DDC models, do not require a full support assumption, nor a long panel, and place no parametric restriction on the distribution of unobserved heterogeneity. In addition, I propose a seminonparametric estimator that is computationally attractive and can be implemented using familiar parametric methods."}, "https://arxiv.org/abs/2211.06755": {"title": "The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis", "link": "https://arxiv.org/abs/2211.06755", "description": "arXiv:2211.06755v3 Announce Type: replace \nAbstract: The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the `chiPower' transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it} defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are each identified with single compositional parts, not ratios."}, "https://arxiv.org/abs/2301.11057": {"title": "Empirical Bayes factors for common hypothesis tests", "link": "https://arxiv.org/abs/2301.11057", "description": "arXiv:2301.11057v4 Announce Type: replace \nAbstract: Bayes factors for composite hypotheses have difficulty in encoding vague prior knowledge, as improper priors cannot be used and objective priors may be subjectively unreasonable. To address these issues I revisit the posterior Bayes factor, in which the posterior distribution from the data at hand is re-used in the Bayes factor for the same data. I argue that this is biased when calibrated against proper Bayes factors, but propose adjustments to allow interpretation on the same scale. In the important case of a regular normal model, the bias in log scale is half the number of parameters. The resulting empirical Bayes factor is closely related to the widely applicable information criterion. I develop test-based empirical Bayes factors for several standard tests and propose an extension to multiple testing closely related to the optimal discovery procedure. When only a P-value is available, an approximate empirical Bayes factor is 10p. I propose interpreting the strength of Bayes factors on a logarithmic scale with base 3.73, reflecting the sharpest distinction between weaker and stronger belief. This provides an objective framework for interpreting statistical evidence, and realises a Bayesian/frequentist compromise."}, "https://arxiv.org/abs/2303.16599": {"title": "Difference-based covariance matrix estimate in time series nonparametric regression with applications to specification tests", "link": "https://arxiv.org/abs/2303.16599", "description": "arXiv:2303.16599v2 Announce Type: replace \nAbstract: Long-run covariance matrix estimation is the building block of time series inference. The corresponding difference-based estimator, which avoids detrending, has attracted considerable interest due to its robustness to both smooth and abrupt structural breaks and its competitive finite sample performance. However, existing methods mainly focus on estimators for the univariate process while their direct and multivariate extensions for most linear models are asymptotically biased. We propose a novel difference-based and debiased long-run covariance matrix estimator for functional linear models with time-varying regression coefficients, allowing time series non-stationarity, long-range dependence, state-heteroscedasticity and their mixtures. We apply the new estimator to (i) the structural stability test, overcoming the notorious non-monotonic power phenomena caused by piecewise smooth alternatives for regression coefficients, and (ii) the nonparametric residual-based tests for long memory, improving the performance via the residual-free formula of the proposed estimator. The effectiveness of the proposed method is justified theoretically and demonstrated by superior performance in simulation studies, while its usefulness is elaborated via real data analysis. Our method is implemented in the R package mlrv."}, "https://arxiv.org/abs/2307.11705": {"title": "Small Sample Inference for Two-way Capture Recapture Experiments", "link": "https://arxiv.org/abs/2307.11705", "description": "arXiv:2307.11705v2 Announce Type: replace \nAbstract: The properties of the generalized Waring distribution defined on the non negative integers are reviewed. Formulas for its moments and its mode are given. A construction as a mixture of negative binomial distributions is also presented. Then we turn to the Petersen model for estimating the population size $N$ in a two-way capture recapture experiment. We construct a Bayesian model for $N$ by combining a Waring prior with the hypergeometric distribution for the number of units caught twice in the experiment. Credible intervals for $N$ are obtained using quantiles of the posterior, a generalized Waring distribution. The standard confidence interval for the population size constructed using the asymptotic variance of Petersen estimator and .5 logit transformed interval are shown to be special cases of the generalized Waring credible interval. The true coverage of this interval is shown to be bigger than or equal to its nominal converage in small populations, regardless of the capture probabilities. In addition, its length is substantially smaller than that of the .5 logit transformed interval. Thus the proposed generalized Waring credible interval appears to be the best way to quantify the uncertainty of the Petersen estimator for populations size."}, "https://arxiv.org/abs/2310.18027": {"title": "Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors", "link": "https://arxiv.org/abs/2310.18027", "description": "arXiv:2310.18027v4 Announce Type: replace \nAbstract: Effective and rapid decision-making from randomized controlled trials (RCTs) requires unbiased and precise treatment effect inferences. Two strategies to address this requirement are to adjust for covariates that are highly correlated with the outcome, and to leverage historical control information via Bayes' theorem. We propose a new Bayesian prognostic covariate adjustment methodology, referred to as Bayesian PROCOVA, that combines these two strategies. Covariate adjustment in Bayesian PROCOVA is based on generative artificial intelligence (AI) algorithms that construct a digital twin generator (DTG) for RCT participants. The DTG is trained on historical control data and yields a digital twin (DT) probability distribution for each RCT participant's outcome under the control treatment. The expectation of the DT distribution, referred to as the prognostic score, defines the covariate for adjustment. Historical control information is leveraged via an additive mixture prior with two components: an informative prior probability distribution specified based on historical control data, and a weakly informative prior distribution. The mixture weight determines the extent to which posterior inferences are drawn from the informative component, versus the weakly informative component. This weight has a prior distribution as well, and so the entire additive mixture prior is completely pre-specifiable without involving any RCT information. We establish an efficient Gibbs algorithm for sampling from the posterior distribution, and derive closed-form expressions for the posterior mean and variance of the treatment effect parameter conditional on the weight, in Bayesian PROCOVA. We evaluate efficiency gains of Bayesian PROCOVA via its bias control and variance reduction compared to frequentist PROCOVA in simulation studies that encompass different discrepancies. These gains translate to smaller RCTs."}, "https://arxiv.org/abs/2311.09015": {"title": "Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach", "link": "https://arxiv.org/abs/2311.09015", "description": "arXiv:2311.09015v2 Announce Type: replace \nAbstract: We consider the task of identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR). In general, such parameters are not identified without strong assumptions on the missing data model. In this paper, we take an alternative approach and introduce a method inspired by data fusion, where information in an MNAR dataset is augmented by information in an auxiliary dataset subject to missingness at random (MAR). We show that even if the parameter of interest cannot be identified given either dataset alone, it can be identified given pooled data, under two complementary sets of assumptions. We derive an inverse probability weighted (IPW) estimator for identified parameters, and evaluate the performance of our estimation strategies via simulation studies, and a data application."}, "https://arxiv.org/abs/2401.08941": {"title": "A Powerful and Precise Feature-level Filter using Group Knockoffs", "link": "https://arxiv.org/abs/2401.08941", "description": "arXiv:2401.08941v2 Announce Type: replace \nAbstract: Selecting important features that have substantial effects on the response with provable type-I error rate control is a fundamental concern in statistics, with wide-ranging practical applications. Existing knockoff filters, although shown to provide theoretical guarantee on false discovery rate (FDR) control, often struggle to strike a balance between high power and precision in pinpointing important features when there exist large groups of strongly correlated features. To address this challenge, we develop a new filter using group knockoffs to achieve both powerful and precise selection of important features. Via experiments of simulated data and analysis of a real Alzheimer's disease genetic dataset, it is found that the proposed filter can not only control the proportion of false discoveries but also identify important features with comparable power and greater precision than the existing group knockoffs filter."}, "https://arxiv.org/abs/2107.12365": {"title": "Inference for Heteroskedastic PCA with Missing Data", "link": "https://arxiv.org/abs/2107.12365", "description": "arXiv:2107.12365v2 Announce Type: replace-cross \nAbstract: This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels."}, "https://arxiv.org/abs/2306.05751": {"title": "Advancing Counterfactual Inference through Nonlinear Quantile Regression", "link": "https://arxiv.org/abs/2306.05751", "description": "arXiv:2306.05751v3 Announce Type: replace-cross \nAbstract: The capacity to address counterfactual \"what if\" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to perform reliable counterfactual inference based solely on observational data and the (learned) qualitative causal structure, without necessitating a predefined causal model or even direct estimations of conditional distributions. To this end, we establish a novel connection between counterfactual inference and quantile regression and show that counterfactual inference can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual inference implemented with neural networks under a bi-level optimization scheme. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data, thereby providing an upper bound on the generalization error. Furthermore, empirical evidence demonstrates its superior statistical efficiency in comparison to existing methods. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions."}, "https://arxiv.org/abs/2307.16485": {"title": "Parameter Inference for Degenerate Diffusion Processes", "link": "https://arxiv.org/abs/2307.16485", "description": "arXiv:2307.16485v2 Announce Type: replace-cross \nAbstract: We study parametric inference for ergodic diffusion processes with a degenerate diffusion matrix. Existing research focuses on a particular class of hypo-elliptic SDEs, with components split into `rough'/`smooth' and noise from rough components propagating directly onto smooth ones, but some critical model classes arising in applications have yet to be explored. We aim to cover this gap, thus analyse the highly degenerate class of SDEs, where components split into further sub-groups. Such models include e.g. the notable case of generalised Langevin equations. We propose a tailored time-discretisation scheme and provide asymptotic results supporting our scheme in the context of high-frequency, full observations. The proposed discretisation scheme is applicable in much more general data regimes and is shown to overcome biases via simulation studies also in the practical case when only a smooth component is observed. Joint consideration of our study for highly degenerate SDEs and existing research provides a general `recipe' for the development of time-discretisation schemes to be used within statistical methods for general classes of hypo-elliptic SDEs."}, "https://arxiv.org/abs/2402.18612": {"title": "Understanding random forests and overfitting: a visualization and simulation study", "link": "https://arxiv.org/abs/2402.18612", "description": "arXiv:2402.18612v1 Announce Type: new \nAbstract: Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models."}, "https://arxiv.org/abs/2402.18741": {"title": "Spectral Extraction of Unique Latent Variables", "link": "https://arxiv.org/abs/2402.18741", "description": "arXiv:2402.18741v1 Announce Type: new \nAbstract: Multimodal datasets contain observations generated by multiple types of sensors. Most works to date focus on uncovering latent structures in the data that appear in all modalities. However, important aspects of the data may appear in only one modality due to the differences between the sensors. Uncovering modality-specific attributes may provide insights into the sources of the variability of the data. For example, certain clusters may appear in the analysis of genetics but not in epigenetic markers. Another example is hyper-spectral satellite imaging, where various atmospheric and ground phenomena are detectable using different parts of the spectrum. In this paper, we address the problem of uncovering latent structures that are unique to a single modality. Our approach is based on computing a graph representation of datasets from two modalities and analyzing the differences between their connectivity patterns. We provide an asymptotic analysis of the convergence of our approach based on a product manifold model. To evaluate the performance of our method, we test its ability to uncover latent structures in multiple types of artificial and real datasets."}, "https://arxiv.org/abs/2402.18745": {"title": "Degree-heterogeneous Latent Class Analysis for High-dimensional Discrete Data", "link": "https://arxiv.org/abs/2402.18745", "description": "arXiv:2402.18745v1 Announce Type: new \nAbstract: The latent class model is a widely used mixture model for multivariate discrete data. Besides the existence of qualitatively heterogeneous latent classes, real data often exhibit additional quantitative heterogeneity nested within each latent class. The modern latent class analysis also faces extra challenges, including the high-dimensionality, sparsity, and heteroskedastic noise inherent in discrete data. Motivated by these phenomena, we introduce the Degree-heterogeneous Latent Class Model and propose a spectral approach to clustering and statistical inference in the challenging high-dimensional sparse data regime. We propose an easy-to-implement HeteroClustering algorithm. It uses heteroskedastic PCA with L2 normalization to remove degree effects and perform clustering in the top singular subspace of the data matrix. We establish an exponential error rate for HeteroClustering, leading to exact clustering under minimal signal-to-noise conditions. We further investigate the estimation and inference of the high-dimensional continuous item parameters in the model, which are crucial to interpreting and finding useful markers for latent classes. We provide comprehensive procedures for global testing and multiple testing of these parameters with valid error controls. The superior performance of our methods is demonstrated through extensive simulations and applications to three diverse real-world datasets from political voting records, genetic variations, and single-cell sequencing."}, "https://arxiv.org/abs/2402.18748": {"title": "Fast Bootstrapping Nonparametric Maximum Likelihood for Latent Mixture Models", "link": "https://arxiv.org/abs/2402.18748", "description": "arXiv:2402.18748v1 Announce Type: new \nAbstract: Estimating the mixing density of a latent mixture model is an important task in signal processing. Nonparametric maximum likelihood estimation is one popular approach to this problem. If the latent variable distribution is assumed to be continuous, then bootstrapping can be used to approximate it. However, traditional bootstrapping requires repeated evaluations on resampled data and is not scalable. In this letter, we construct a generative process to rapidly produce nonparametric maximum likelihood bootstrap estimates. Our method requires only a single evaluation of a novel two-stage optimization algorithm. Simulations and real data analyses demonstrate that our procedure accurately estimates the mixing density with little computational cost even when there are a hundred thousand observations."}, "https://arxiv.org/abs/2402.18900": {"title": "Prognostic Covariate Adjustment for Logistic Regression in Randomized Controlled Trials", "link": "https://arxiv.org/abs/2402.18900", "description": "arXiv:2402.18900v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) with binary primary endpoints introduce novel challenges for inferring the causal effects of treatments. The most significant challenge is non-collapsibility, in which the conditional odds ratio estimand under covariate adjustment differs from the unconditional estimand in the logistic regression analysis of RCT data. This issue gives rise to apparent paradoxes, such as the variance of the estimator for the conditional odds ratio from a covariate-adjusted model being greater than the variance of the estimator from the unadjusted model. We address this challenge in the context of adjustment based on predictions of control outcomes from generative artificial intelligence (AI) algorithms, which are referred to as prognostic scores. We demonstrate that prognostic score adjustment in logistic regression increases the power of the Wald test for the conditional odds ratio under a fixed sample size, or alternatively reduces the necessary sample size to achieve a desired power, compared to the unadjusted analysis. We derive formulae for prospective calculations of the power gain and sample size reduction that can result from adjustment for the prognostic score. Furthermore, we utilize g-computation to expand the scope of prognostic score adjustment to inferences on the marginal risk difference, relative risk, and odds ratio estimands. We demonstrate the validity of our formulae via extensive simulation studies that encompass different types of logistic regression model specifications. Our simulation studies also indicate how prognostic score adjustment can reduce the variance of g-computation estimators for the marginal estimands while maintaining frequentist properties such as asymptotic unbiasedness and Type I error rate control. Our methodology can ultimately enable more definitive and conclusive analyses for RCTs with binary primary endpoints."}, "https://arxiv.org/abs/2402.18904": {"title": "False Discovery Rate Control for Confounder Selection Using Mirror Statistics", "link": "https://arxiv.org/abs/2402.18904", "description": "arXiv:2402.18904v1 Announce Type: new \nAbstract: While data-driven confounder selection requires careful consideration, it is frequently employed in observational studies to adjust for confounding factors. Widely recognized criteria for confounder selection include the minimal set approach, which involves selecting variables relevant to both treatment and outcome, and the union set approach, which involves selecting variables for either treatment or outcome. These approaches are often implemented using heuristics and off-the-shelf statistical methods, where the degree of uncertainty may not be clear. In this paper, we focus on the false discovery rate (FDR) to measure uncertainty in confounder selection. We define the FDR specific to confounder selection and propose methods based on the mirror statistic, a recently developed approach for FDR control that does not rely on p-values. The proposed methods are free from p-values and require only the assumption of some symmetry in the distribution of the mirror statistic. It can be easily combined with sparse estimation and other methods that involve difficulties in deriving p-values. The properties of the proposed method are investigated by exhaustive numerical experiments. Particularly in high-dimensional data scenarios, our method outperforms conventional methods."}, "https://arxiv.org/abs/2402.19021": {"title": "Enhancing the Power of Gaussian Graphical Model Inference by Modeling the Graph Structure", "link": "https://arxiv.org/abs/2402.19021", "description": "arXiv:2402.19021v1 Announce Type: new \nAbstract: For the problem of inferring a Gaussian graphical model (GGM), this work explores the application of a recent approach from the multiple testing literature for graph inference. The main idea of the method by Rebafka et al. (2022) is to model the data by a latent variable model, the so-called noisy stochastic block model (NSBM), and then use the associated ${\\ell}$-values to infer the graph. The inferred graph controls the false discovery rate, that means that the proportion of falsely declared edges does not exceed a user-defined nominal level. Here it is shown that any test statistic from the GGM literature can be used as input for the NSBM approach to perform GGM inference. To make the approach feasible in practice, a new, computationally efficient inference algorithm for the NSBM is developed relying on a greedy approach to maximize the integrated complete-data likelihood. Then an extensive numerical study illustrates that the NSBM approach outperforms the state of the art for any of the here considered GGM-test statistics. In particular in sparse settings and on real datasets a significant gain in power is observed."}, "https://arxiv.org/abs/2402.19029": {"title": "Essential Properties of Type III* Methods", "link": "https://arxiv.org/abs/2402.19029", "description": "arXiv:2402.19029v1 Announce Type: new \nAbstract: Type III methods, introduced by SAS in 1976, formulate estimable functions that substitute, somehow, for classical ANOVA effects in multiple linear regression models. They have been controversial since, provoking wide use and satisfied users on the one hand and skepticism and scorn on the other. Their essential mathematical properties have not been established, although they are widely thought to be known: what those functions are, to what extent they coincide with classical ANOVA effects, and how they are affected by cell sample sizes, empty cells, and covariates. Those properties are established here."}, "https://arxiv.org/abs/2402.19046": {"title": "On the Improvement of Predictive Modeling Using Bayesian Stacking and Posterior Predictive Checking", "link": "https://arxiv.org/abs/2402.19046", "description": "arXiv:2402.19046v1 Announce Type: new \nAbstract: Model uncertainty is pervasive in real world analysis situations and is an often-neglected issue in applied statistics. However, standard approaches to the research process do not address the inherent uncertainty in model building and, thus, can lead to overconfident and misleading analysis interpretations. One strategy to incorporate more flexible models is to base inferences on predictive modeling. This approach provides an alternative to existing explanatory models, as inference is focused on the posterior predictive distribution of the response variable. Predictive modeling can advance explanatory ambitions in the social sciences and in addition enrich the understanding of social phenomena under investigation. Bayesian stacking is a methodological approach rooted in Bayesian predictive modeling. In this paper, we outline the method of Bayesian stacking but add to it the approach of posterior predictive checking (PPC) as a means of assessing the predictive quality of those elements of the stacking ensemble that are important to the research question. Thus, we introduce a viable workflow for incorporating PPC into predictive modeling using Bayesian stacking without presuming the existence of a true model. We apply these tools to the PISA 2018 data to investigate potential inequalities in reading competency with respect to gender and socio-economic background. Our empirical example serves as rough guideline for practitioners who want to implement the concepts of predictive modeling and model uncertainty in their work to similar research questions."}, "https://arxiv.org/abs/2402.19109": {"title": "Confidence and Assurance of Percentiles", "link": "https://arxiv.org/abs/2402.19109", "description": "arXiv:2402.19109v1 Announce Type: new \nAbstract: Confidence interval of mean is often used when quoting statistics. The same rigor is often missing when quoting percentiles and tolerance or percentile intervals. This article derives the expression for confidence in percentiles of a sample population. Confidence intervals of median is compared to those of mean for a few sample distributions. The concept of assurance from reliability engineering is then extended to percentiles. The assurance level of sorted samples simply matches the confidence and percentile levels. Numerical method to compute assurance using Brent's optimization method is provided as an open-source python package."}, "https://arxiv.org/abs/2402.19346": {"title": "Recanting witness and natural direct effects: Violations of assumptions or definitions?", "link": "https://arxiv.org/abs/2402.19346", "description": "arXiv:2402.19346v1 Announce Type: new \nAbstract: There have been numerous publications on the advantages and disadvantages of estimating natural (pure) effects compared to controlled effects. One of the main criticisms of natural effects is that it requires an additional assumption for identifiability, namely that the exposure does not cause a confounder of the mediator-outcome relationship. However, every analysis in every study should begin with a research question expressed in ordinary language. Researchers then develop/use mathematical expressions or estimators to best answer these ordinary language questions. When a recanting witness is present, the paper illustrates that there are no violations of assumptions. Rather, using directed acyclic graphs, the typical estimators for natural effects are simply no longer answering any meaningful question. Although some might view this as semantics, the proposed approach illustrates why the more recent methods of path-specific effects and separable effects are more valid and transparent compared to previous methods for decomposition analysis."}, "https://arxiv.org/abs/2402.19425": {"title": "Testing Information Ordering for Strategic Agents", "link": "https://arxiv.org/abs/2402.19425", "description": "arXiv:2402.19425v1 Announce Type: new \nAbstract: A key primitive of a strategic environment is the information available to players. Specifying a priori an information structure is often difficult for empirical researchers. We develop a test of information ordering that allows researchers to examine if the true information structure is at least as informative as a proposed baseline. We construct a computationally tractable test statistic by utilizing the notion of Bayes Correlated Equilibrium (BCE) to translate the ordering of information structures into an ordering of functions. We apply our test to examine whether hubs provide informational advantages to certain airlines in addition to market power."}, "https://arxiv.org/abs/2402.18651": {"title": "Quantifying Human Priors over Social and Navigation Networks", "link": "https://arxiv.org/abs/2402.18651", "description": "arXiv:2402.18651v1 Announce Type: cross \nAbstract: Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some features of the inferred priors are remarkably consistent, such as the tendency for sparsity as a function of graph size. Other features are domain-specific, such as the propensity for triadic closure in social interactions. More broadly, our work demonstrates how nonclassical statistical analysis of indirect behavioral experiments can be used to efficiently model latent biases in the data."}, "https://arxiv.org/abs/2402.18689": {"title": "The VOROS: Lifting ROC curves to 3D", "link": "https://arxiv.org/abs/2402.18689", "description": "arXiv:2402.18689v1 Announce Type: cross \nAbstract: The area under the ROC curve is a common measure that is often used to rank the relative performance of different binary classifiers. However, as has been also previously noted, it can be a measure that ill-captures the benefits of different classifiers when either the true class values or misclassification costs are highly unbalanced between the two classes. We introduce a third dimension to capture these costs, and lift the ROC curve to a ROC surface in a natural way. We study both this surface and introduce the VOROS, the volume over this ROC surface, as a 3D generalization of the 2D area under the ROC curve. For problems where there are only bounds on the expected costs or class imbalances, we restrict consideration to the volume of the appropriate subregion of the ROC surface. We show how the VOROS can better capture the costs of different classifiers on both a classical and a modern example dataset."}, "https://arxiv.org/abs/2402.18810": {"title": "The numeraire e-variable", "link": "https://arxiv.org/abs/2402.18810", "description": "arXiv:2402.18810v1 Announce Type: cross \nAbstract: We consider testing a composite null hypothesis $\\mathcal{P}$ against a point alternative $\\mathbb{Q}$. This paper establishes a powerful and general result: under no conditions whatsoever on $\\mathcal{P}$ or $\\mathbb{Q}$, we show that there exists a special e-variable $X^*$ that we call the numeraire. It is strictly positive and for every $\\mathbb{P} \\in \\mathcal{P}$, $\\mathbb{E}_\\mathbb{P}[X^*] \\le 1$ (the e-variable property), while for every other e-variable $X$, we have $\\mathbb{E}_\\mathbb{Q}[X/X^*] \\le 1$ (the numeraire property). In particular, this implies $\\mathbb{E}_\\mathbb{Q}[\\log(X/X^*)] \\le 0$ (log-optimality). $X^*$ also identifies a particular sub-probability measure $\\mathbb{P}^*$ via the density $d \\mathbb{P}^*/d \\mathbb{Q} = 1/X^*$. As a result, $X^*$ can be seen as a generalized likelihood ratio of $\\mathbb{Q}$ against $\\mathcal{P}$. We show that $\\mathbb{P}^*$ coincides with the reverse information projection (RIPr) when additional assumptions are made that are required for the latter to exist. Thus $\\mathbb{P}^*$ is a natural definition of the RIPr in the absence of any assumptions on $\\mathcal{P}$ or $\\mathbb{Q}$. In addition to the abstract theory, we provide several tools for finding the numeraire in concrete cases. We discuss several nonparametric examples where we can indeed identify the numeraire, despite not having a reference measure. We end with a more general optimality theory that goes beyond the ubiquitous logarithmic utility. We focus on certain power utilities, leading to reverse R\\'enyi projections in place of the RIPr, which also always exists."}, "https://arxiv.org/abs/2402.18910": {"title": "DIGIC: Domain Generalizable Imitation Learning by Causal Discovery", "link": "https://arxiv.org/abs/2402.18910", "description": "arXiv:2402.18910v1 Announce Type: cross \nAbstract: Causality has been combined with machine learning to produce robust representations for domain generalization. Most existing methods of this type require massive data from multiple domains to identify causal features by cross-domain variations, which can be expensive or even infeasible and may lead to misidentification in some cases. In this work, we make a different attempt by leveraging the demonstration data distribution to discover the causal features for a domain generalizable policy. We design a novel framework, called DIGIC, to identify the causal features by finding the direct cause of the expert action from the demonstration data distribution via causal discovery. Our framework can achieve domain generalizable imitation learning with only single-domain data and serve as a complement for cross-domain variation-based methods under non-structural assumptions on the underlying causal models. Our empirical study in various control tasks shows that the proposed framework evidently improves the domain generalization performance and has comparable performance to the expert in the original domain simultaneously."}, "https://arxiv.org/abs/2402.18921": {"title": "Semi-Supervised U-statistics", "link": "https://arxiv.org/abs/2402.18921", "description": "arXiv:2402.18921v1 Announce Type: cross \nAbstract: Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework."}, "https://arxiv.org/abs/2402.19162": {"title": "A Bayesian approach to uncover spatio-temporal determinants of heterogeneity in repeated cross-sectional health surveys", "link": "https://arxiv.org/abs/2402.19162", "description": "arXiv:2402.19162v1 Announce Type: cross \nAbstract: In several countries, including Italy, a prominent approach to population health surveillance involves conducting repeated cross-sectional surveys at short intervals of time. These surveys gather information on the health status of individual respondents, including details on their behaviors, risk factors, and relevant socio-demographic information. While the collected data undoubtedly provides valuable information, modeling such data presents several challenges. For instance, in health risk models, it is essential to consider behavioral information, spatio-temporal dynamics, and disease co-occurrence. In response to these challenges, our work proposes a multivariate spatio-temporal logistic model for chronic disease diagnoses. Predictors are modeled using individual risk factor covariates and a latent individual propensity to the disease.\n Leveraging a state space formulation of the model, we construct a framework in which spatio-temporal heterogeneity in regression parameters is informed by exogenous spatial information, corresponding to different spatial contextual risk factors that may affect health and the occurrence of chronic diseases in different ways. To explore the utility and the effectiveness of our method, we analyze behavioral and risk factor surveillance data collected in Italy (PASSI), which is well-known as a country characterized by high peculiar administrative, social and territorial diversities reflected on high variability in morbidity among population subgroups."}, "https://arxiv.org/abs/2402.19214": {"title": "A Bayesian approach with Gaussian priors to the inverse problem of source identification in elliptic PDEs", "link": "https://arxiv.org/abs/2402.19214", "description": "arXiv:2402.19214v1 Announce Type: cross \nAbstract: We consider the statistical linear inverse problem of making inference on an unknown source function in an elliptic partial differential equation from noisy observations of its solution. We employ nonparametric Bayesian procedures based on Gaussian priors, leading to convenient conjugate formulae for posterior inference. We review recent results providing theoretical guarantees on the quality of the resulting posterior-based estimation and uncertainty quantification, and we discuss the application of the theory to the important classes of Gaussian series priors defined on the Dirichlet-Laplacian eigenbasis and Mat\\'ern process priors. We provide an implementation of posterior inference for both classes of priors, and investigate its performance in a numerical simulation study."}, "https://arxiv.org/abs/2402.19268": {"title": "Extremal quantiles of intermediate orders under two-way clustering", "link": "https://arxiv.org/abs/2402.19268", "description": "arXiv:2402.19268v1 Announce Type: cross \nAbstract: This paper investigates extremal quantiles under two-way cluster dependence. We demonstrate that the limiting distribution of the unconditional intermediate order quantiles in the tails converges to a Gaussian distribution. This is remarkable as two-way cluster dependence entails potential non-Gaussianity in general, but extremal quantiles do not suffer from this issue. Building upon this result, we extend our analysis to extremal quantile regressions of intermediate order."}, "https://arxiv.org/abs/2402.19399": {"title": "An Empirical Analysis of Scam Token on Ethereum Blockchain: Counterfeit tokens on Uniswap", "link": "https://arxiv.org/abs/2402.19399", "description": "arXiv:2402.19399v1 Announce Type: cross \nAbstract: This article presents an empirical investigation into the determinants of total revenue generated by counterfeit tokens on Uniswap. It offers a detailed overview of the counterfeit token fraud process, along with a systematic summary of characteristics associated with such fraudulent activities observed in Uniswap. The study primarily examines the relationship between revenue from counterfeit token scams and their defining characteristics, and analyzes the influence of market economic factors such as return on market capitalization and price return on Ethereum. Key findings include a significant increase in overall transactions of counterfeit tokens on their first day of fraud, and a rise in upfront fraud costs leading to corresponding increases in revenue. Furthermore, a negative correlation is identified between the total revenue of counterfeit tokens and the volatility of Ethereum market capitalization return, while price return volatility on Ethereum is found to have a positive impact on counterfeit token revenue, albeit requiring further investigation for a comprehensive understanding. Additionally, the number of subscribers for the real token correlates positively with the realized volume of scam tokens, indicating that a larger community following the legitimate token may inadvertently contribute to the visibility and success of counterfeit tokens. Conversely, the number of Telegram subscribers exhibits a negative impact on the realized volume of scam tokens, suggesting that a higher level of scrutiny or awareness within Telegram communities may act as a deterrent to fraudulent activities. Finally, the timing of when the scam token is introduced on the Ethereum blockchain may have a negative impact on its success. Notably, the cumulative amount scammed by only 42 counterfeit tokens amounted to almost 11214 Ether."}, "https://arxiv.org/abs/1701.07078": {"title": "Measurement-to-Track Association and Finite-Set Statistics", "link": "https://arxiv.org/abs/1701.07078", "description": "arXiv:1701.07078v2 Announce Type: replace \nAbstract: This is a shortened, clarified, and mathematically more rigorous version of the original arXiv version. Its first four findings remain unchanged from the original: 1) measurement-to-track associations (MTAs) in multitarget tracking (MTT) are heuristic and physically erroneous multitarget state models; 2) MTAs occur in the labeled random finite set (LRFS) approach only as purely mathematical abstractions that do not occur singly; 3) the labeled random finite set (LRFS) approach is not a mathematically obfuscated replication of multi-hypothesis tracking (MHT); and 4) the conventional interpretation of MHT is more consistent with classical than Bayesian statistics. This version goes beyond the original in including the following additional main finding: 5) a generalized, RFS-like interpretation results in a correct Bayesian formulation of MHT, based on MTA likelihood functions and MTA Markov transitions/."}, "https://arxiv.org/abs/1805.10869": {"title": "Tilting Approximate Models", "link": "https://arxiv.org/abs/1805.10869", "description": "arXiv:1805.10869v4 Announce Type: replace \nAbstract: Model approximations are common practice when estimating structural or quasi-structural models. The paper considers the econometric properties of estimators that utilize projections to reimpose information about the exact model in the form of conditional moments. The resulting estimator efficiently combines the information provided by the approximate law of motion and the moment conditions. The paper develops the corresponding asymptotic theory and provides simulation evidence that tilting substantially reduces the mean squared error for parameter estimates. It applies the methodology to pricing long-run risks in aggregate consumption in the US, whereas the model is solved using the Campbell and Shiller (1988) approximation. Tilting improves empirical fit and results suggest that approximation error is a source of upward bias in estimates of risk aversion and downward bias in the elasticity of intertemporal substitution."}, "https://arxiv.org/abs/2204.07672": {"title": "Abadie's Kappa and Weighting Estimators of the Local Average Treatment Effect", "link": "https://arxiv.org/abs/2204.07672", "description": "arXiv:2204.07672v4 Announce Type: replace \nAbstract: Recent research has demonstrated the importance of flexibly controlling for covariates in instrumental variables estimation. In this paper we study the finite sample and asymptotic properties of various weighting estimators of the local average treatment effect (LATE), motivated by Abadie's (2003) kappa theorem and offering the requisite flexibility relative to standard practice. We argue that two of the estimators under consideration, which are weight normalized, are generally preferable. Several other estimators, which are unnormalized, do not satisfy the properties of scale invariance with respect to the natural logarithm and translation invariance, thereby exhibiting sensitivity to the units of measurement when estimating the LATE in logs and the centering of the outcome variable more generally. We also demonstrate that, when noncompliance is one sided, certain weighting estimators have the advantage of being based on a denominator that is strictly greater than zero by construction. This is the case for only one of the two normalized estimators, and we recommend this estimator for wider use. We illustrate our findings with a simulation study and three empirical applications, which clearly document the sensitivity of unnormalized estimators to how the outcome variable is coded. We implement the proposed estimators in the Stata package kappalate."}, "https://arxiv.org/abs/2212.06669": {"title": "A scale of interpretation for likelihood ratios and Bayes factors", "link": "https://arxiv.org/abs/2212.06669", "description": "arXiv:2212.06669v3 Announce Type: replace \nAbstract: Several subjective proposals have been made for interpreting the strength of evidence in likelihood ratios and Bayes factors. I identify a more objective scaling by modelling the effect of evidence on belief. The resulting scale with base 3.73 aligns with previous proposals and may partly explain intuitions."}, "https://arxiv.org/abs/2304.05805": {"title": "GDP nowcasting with artificial neural networks: How much does long-term memory matter?", "link": "https://arxiv.org/abs/2304.05805", "description": "arXiv:2304.05805v3 Announce Type: replace \nAbstract: We apply artificial neural networks (ANNs) to nowcast quarterly GDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare the nowcasting performance of five different ANN architectures: the multilayer perceptron (MLP), the one-dimensional convolutional neural network (1D CNN), the Elman recurrent neural network (RNN), the long short-term memory network (LSTM), and the gated recurrent unit (GRU). The empirical analysis presents results from two distinctively different evaluation periods. The first (2012:Q1 -- 2019:Q4) is characterized by balanced economic growth, while the second (2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According to our results, longer input sequences result in more accurate nowcasts in periods of balanced economic growth. However, this effect ceases above a relatively low threshold value of around six quarters (eighteen months). During periods of economic turbulence (e.g., during the COVID-19 recession), longer input sequences do not help the models' predictive performance; instead, they seem to weaken their generalization capability. Combined results from the two evaluation periods indicate that architectural features enabling long-term memory do not result in more accurate nowcasts. Comparing network architectures, the 1D CNN has proved to be a highly suitable model for GDP nowcasting. The network has shown good nowcasting performance among the competitors during the first evaluation period and achieved the overall best accuracy during the second evaluation period. Consequently, first in the literature, we propose the application of the 1D CNN for economic nowcasting."}, "https://arxiv.org/abs/2305.01849": {"title": "Semiparametric Discovery and Estimation of Interaction in Mixed Exposures using Stochastic Interventions", "link": "https://arxiv.org/abs/2305.01849", "description": "arXiv:2305.01849v2 Announce Type: replace \nAbstract: This study introduces a nonparametric definition of interaction and provides an approach to both interaction discovery and efficient estimation of this parameter. Using stochastic shift interventions and ensemble machine learning, our approach identifies and quantifies interaction effects through a model-independent target parameter, estimated via targeted maximum likelihood and cross-validation. This method contrasts the expected outcomes of joint interventions with those of individual interventions. Validation through simulation and application to the National Institute of Environmental Health Sciences Mixtures Workshop data demonstrate the efficacy of our method in detecting true interaction directions and its consistency in identifying significant impacts of furan exposure on leukocyte telomere length. Our method, called SuperNOVA, advances the ability to analyze multiexposure interactions within high-dimensional data, offering significant methodological improvements to understand complex exposure dynamics in health research. We provide peer-reviewed open-source software that employs or proposed methodology in the \\texttt{SuperNOVA} R package."}, "https://arxiv.org/abs/2305.04634": {"title": "Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods", "link": "https://arxiv.org/abs/2305.04634", "description": "arXiv:2305.04634v3 Announce Type: replace \nAbstract: In spatial statistics, fast and accurate parameter estimation, coupled with a reliable means of uncertainty quantification, can be challenging when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or wholly intractable. In this work, we propose using convolutional neural networks to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare neural likelihood surfaces and the resulting maximum likelihood estimates and approximate confidence regions with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian process and a Brown-Resnick process which have computationally intensive and intractable likelihoods, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate. The method is applicable to any spatial process on a grid from which fast simulations are available."}, "https://arxiv.org/abs/2306.10405": {"title": "A semi-parametric estimation method for quantile coherence with an application to bivariate financial time series clustering", "link": "https://arxiv.org/abs/2306.10405", "description": "arXiv:2306.10405v3 Announce Type: replace \nAbstract: In multivariate time series analysis, spectral coherence measures the linear dependency between two time series at different frequencies. However, real data applications often exhibit nonlinear dependency in the frequency domain. Conventional coherence analysis fails to capture such dependency. The quantile coherence, on the other hand, characterizes nonlinear dependency by defining the coherence at a set of quantile levels based on trigonometric quantile regression. This paper introduces a new estimation technique for quantile coherence. The proposed method is semi-parametric, which uses the parametric form of the spectrum of a vector autoregressive (VAR) model to approximate the quantile coherence, combined with nonparametric smoothing across quantiles. At a given quantile level, we compute the quantile autocovariance function (QACF) by performing the Fourier inverse transform of the quantile periodograms. Subsequently, we utilize the multivariate Durbin-Levinson algorithm to estimate the VAR parameters and derive the estimate of the quantile coherence. Finally, we smooth the preliminary estimate of quantile coherence across quantiles using a nonparametric smoother. Numerical results show that the proposed estimation method outperforms nonparametric methods. We show that quantile coherence-based bivariate time series clustering has advantages over the ordinary VAR coherence. For applications, the identified clusters of financial stocks by quantile coherence with a market benchmark are shown to have an intriguing and more informative structure of diversified investment portfolios that may be used by investors to make better decisions."}, "https://arxiv.org/abs/2202.08370": {"title": "CAREER: A Foundation Model for Labor Sequence Data", "link": "https://arxiv.org/abs/2202.08370", "description": "arXiv:2202.08370v4 Announce Type: replace-cross \nAbstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use."}, "https://arxiv.org/abs/2210.14484": {"title": "Imputation of missing values in multi-view data", "link": "https://arxiv.org/abs/2210.14484", "description": "arXiv:2210.14484v3 Announce Type: replace-cross \nAbstract: Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible."}, "https://arxiv.org/abs/2309.16598": {"title": "Cross-Prediction-Powered Inference", "link": "https://arxiv.org/abs/2309.16598", "description": "arXiv:2309.16598v3 Announce Type: replace-cross \nAbstract: While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability."}, "https://arxiv.org/abs/2311.08168": {"title": "Time-Uniform Confidence Spheres for Means of Random Vectors", "link": "https://arxiv.org/abs/2311.08168", "description": "arXiv:2311.08168v2 Announce Type: replace-cross \nAbstract: We derive and study time-uniform confidence spheres -- confidence sphere sequences (CSSs) -- which contain the mean of random vectors with high probability simultaneously across all sample sizes. Inspired by the original work of Catoni and Giulini, we unify and extend their analysis to cover both the sequential setting and to handle a variety of distributional assumptions. Our results include an empirical-Bernstein CSS for bounded random vectors (resulting in a novel empirical-Bernstein confidence interval with asymptotic width scaling proportionally to the true unknown variance), CSSs for sub-$\\psi$ random vectors (which includes sub-gamma, sub-Poisson, and sub-exponential), and CSSs for heavy-tailed random vectors (two moments only). Finally, we provide two CSSs that are robust to contamination by Huber noise. The first is a robust version of our empirical-Bernstein CSS, and the second extends recent work in the univariate setting to heavy-tailed multivariate distributions."}, "https://arxiv.org/abs/2403.00080": {"title": "Spatio-temporal modeling for record-breaking temperature events in Spain", "link": "https://arxiv.org/abs/2403.00080", "description": "arXiv:2403.00080v1 Announce Type: new \nAbstract: Record-breaking temperature events are now very frequently in the news, viewed as evidence of climate change. With this as motivation, we undertake the first substantial spatial modeling investigation of temperature record-breaking across years for any given day within the year. We work with a dataset consisting of over sixty years (1960-2021) of daily maximum temperatures across peninsular Spain. Formal statistical analysis of record-breaking events is an area that has received attention primarily within the probability community, dominated by results for the stationary record-breaking setting with some additional work addressing trends. Such effort is inadequate for analyzing actual record-breaking data. Effective analysis requires rich modeling of the indicator events which define record-breaking sequences. Resulting from novel and detailed exploratory data analysis, we propose hierarchical conditional models for the indicator events. After suitable model selection, we discover explicit trend behavior, necessary autoregression, significance of distance to the coast, useful interactions, helpful spatial random effects, and very strong daily random effects. Illustratively, the model estimates that global warming trends have increased the number of records expected in the past decade almost two-fold, 1.93 (1.89,1.98), but also estimates highly differentiated climate warming rates in space and by season."}, "https://arxiv.org/abs/2403.00140": {"title": "Estimating the linear relation between variables that are never jointly observed: an application in in vivo experiments", "link": "https://arxiv.org/abs/2403.00140", "description": "arXiv:2403.00140v1 Announce Type: new \nAbstract: This work is motivated by in vivo experiments in which measurement are destructive so that the variables of interest can never be observed simultaneously when the aim is to estimate the regression coefficients of a linear regression. Assuming that the global experiment can be decomposed into sub experiments (corresponding for example to different doses) with distinct first moments, we propose different estimators of the linear regression which take account of that additional information. We consider estimators based on moments as well as estimators based optimal transport theory. These estimators are proved to be consistent as well as asymptotically Gaussian under weak hypotheses. The asymptotic variance has no explicit expression, except in some particular cases, and specific bootstrap approaches are developed to build confidence intervals for the estimated parameter. A Monte Carlo study is conducted to assess and compare the finite sample performances of the different approaches."}, "https://arxiv.org/abs/2403.00224": {"title": "Tobit models for count time series", "link": "https://arxiv.org/abs/2403.00224", "description": "arXiv:2403.00224v1 Announce Type: new \nAbstract: Several models for count time series have been developed during the last decades, often inspired by traditional autoregressive moving average (ARMA) models for real-valued time series, including integer-valued ARMA (INARMA) and integer-valued generalized autoregressive conditional heteroscedasticity (INGARCH) models. Both INARMA and INGARCH models exhibit an ARMA-like autocorrelation function (ACF). To achieve negative ACF values within the class of INGARCH models, log and softplus link functions are suggested in the literature, where the softplus approach leads to conditional linearity in good approximation. However, the softplus approach is limited to the INGARCH family for unbounded counts, i.e. it can neither be used for bounded counts, nor for count processes from the INARMA family. In this paper, we present an alternative solution, named the Tobit approach, for achieving approximate linearity together with negative ACF values, which is more generally applicable than the softplus approach. A Skellam--Tobit INGARCH model for unbounded counts is studied in detail, including stationarity, approximate computation of moments, maximum likelihood and censored least absolute deviations estimation for unknown parameters and corresponding simulations. Extensions of the Tobit approach to other situations are also discussed, including underlying discrete distributions, INAR models, and bounded counts. Three real-data examples are considered to illustrate the usefulness of the new approach."}, "https://arxiv.org/abs/2403.00237": {"title": "Stable Reduced-Rank VAR Identification", "link": "https://arxiv.org/abs/2403.00237", "description": "arXiv:2403.00237v1 Announce Type: new \nAbstract: The vector autoregression (VAR) has been widely used in system identification, econometrics, natural science, and many other areas. However, when the state dimension becomes large the parameter dimension explodes. So rank reduced modelling is attractive and is well developed. But a fundamental requirement in almost all applications is stability of the fitted model. And this has not been addressed in the rank reduced case. Here, we develop, for the first time, a closed-form formula for an estimator of a rank reduced transition matrix which is guaranteed to be stable. We show that our estimator is consistent and asymptotically statistically efficient and illustrate it in comparative simulations."}, "https://arxiv.org/abs/2403.00281": {"title": "Wavelet Based Periodic Autoregressive Moving Average Models", "link": "https://arxiv.org/abs/2403.00281", "description": "arXiv:2403.00281v1 Announce Type: new \nAbstract: This paper proposes a wavelet-based method for analysing periodic autoregressive moving average (PARMA) time series. Even though Fourier analysis provides an effective method for analysing periodic time series, it requires the estimation of a large number of Fourier parameters when the PARMA parameters do not vary smoothly. The wavelet-based analysis helps us to obtain a parsimonious model with a reduced number of parameters. We have illustrated this with simulated and actual data sets."}, "https://arxiv.org/abs/2403.00304": {"title": "Coherent forecasting of NoGeAR(1) model", "link": "https://arxiv.org/abs/2403.00304", "description": "arXiv:2403.00304v1 Announce Type: new \nAbstract: This article focuses on the coherent forecasting of the recently introduced novel geometric AR(1) (NoGeAR(1)) model - an INAR model based on inflated - parameter binomial thinning approach. Various techniques are available to achieve h - step ahead coherent forecasts of count time series, like median and mode forecasting. However, there needs to be more body of literature addressing coherent forecasting in the context of overdispersed count time series. Here, we study the forecasting distribution corresponding to NoGeAR(1) process using the Monte Carlo (MC) approximation method. Accordingly, several forecasting measures are employed in the simulation study to facilitate a thorough comparison of the forecasting capability of NoGeAR(1) with other models. The methodology is also demonstrated using real-life data, specifically the data on CW{\\ss} TeXpert downloads and Barbados COVID-19 data."}, "https://arxiv.org/abs/2403.00347": {"title": "Set-Valued Control Functions", "link": "https://arxiv.org/abs/2403.00347", "description": "arXiv:2403.00347v1 Announce Type: new \nAbstract: The control function approach allows the researcher to identify various causal effects of interest. While powerful, it requires a strong invertibility assumption, which limits its applicability. This paper expands the scope of the nonparametric control function approach by allowing the control function to be set-valued and derive sharp bounds on structural parameters. The proposed generalization accommodates a wide range of selection processes involving discrete endogenous variables, random coefficients, treatment selections with interference, and dynamic treatment selections."}, "https://arxiv.org/abs/2403.00383": {"title": "The Mollified (Discrete) Uniform Distribution and its Applications", "link": "https://arxiv.org/abs/2403.00383", "description": "arXiv:2403.00383v1 Announce Type: new \nAbstract: The mollified uniform distribution is rediscovered, which constitutes a ``soft'' version of the continuous uniform distribution. Important stochastic properties are derived and used to demonstrate potential fields of applications. For example, it constitutes a model covering platykurtic, mesokurtic and leptokurtic shapes. Its cumulative distribution function may also serve as the soft-clipping response function for defining generalized linear models with approximately linear dependence. Furthermore, it might be considered for teaching, as an appealing example for the convolution of random variables. Finally, a discrete type of mollified uniform distribution is briefly discussed as well."}, "https://arxiv.org/abs/2403.00422": {"title": "Inference for Interval-Identified Parameters Selected from an Estimated Set", "link": "https://arxiv.org/abs/2403.00422", "description": "arXiv:2403.00422v1 Announce Type: new \nAbstract: Interval identification of parameters such as average treatment effects, average partial effects and welfare is particularly common when using observational data and experimental data with imperfect compliance due to the endogeneity of individuals' treatment uptake. In this setting, a treatment or policy will typically become an object of interest to the researcher when it is either selected from the estimated set of best-performers or arises from a data-dependent selection rule. In this paper, we develop new inference tools for interval-identified parameters chosen via these forms of selection. We develop three types of confidence intervals for data-dependent and interval-identified parameters, discuss how they apply to several examples of interest and prove their uniform asymptotic validity under weak assumptions."}, "https://arxiv.org/abs/2403.00429": {"title": "Population Power Curves in ASCA with Permutation Testing", "link": "https://arxiv.org/abs/2403.00429", "description": "arXiv:2403.00429v1 Announce Type: new \nAbstract: In this paper, we revisit the Power Curves in ANOVA Simultaneous Component Analysis (ASCA) based on permutation testing, and introduce the Population Curves derived from population parameters describing the relative effect among factors and interactions. We distinguish Relative from Absolute Population Curves, where the former represent statistical power in terms of the normalized effect size between structure and noise, and the latter in terms of the sample size. Relative Population Curves are useful to find the optimal ASCA model (e.g., fixed/random factors, crossed/nested relationships, interactions, the test statistic, transformations, etc.) for the analysis of an experimental design at hand. Absolute Population Curves are useful to determine the sample size and the optimal number of levels for each factor during the planning phase on an experiment. We illustrate both types of curves through simulation. We expect Population Curves to become the go-to approach to plan the optimal analysis pipeline and the required sample size in an omics study analyzed with ASCA."}, "https://arxiv.org/abs/2403.00458": {"title": "Prices and preferences in the electric vehicle market", "link": "https://arxiv.org/abs/2403.00458", "description": "arXiv:2403.00458v1 Announce Type: new \nAbstract: Although electric vehicles are less polluting than gasoline powered vehicles, adoption is challenged by higher procurement prices. Existing discourse emphasizes EV battery costs as being principally responsible for this price differential and widespread adoption is routinely conditioned upon battery costs declining. We scrutinize such reasoning by sourcing data on EV attributes and market conditions between 2011 and 2023. Our findings are fourfold. First, EV prices are influenced principally by the number of amenities, additional features, and dealer-installed accessories sold as standard on an EV, and to a lesser extent, by EV horsepower. Second, EV range is negatively correlated with EV price implying that range anxiety concerns may be less consequential than existing discourse suggests. Third, battery capacity is positively correlated with EV price, due to more capacity being synonymous with the delivery of more horsepower. Collectively, this suggests that higher procurement prices for EVs reflects consumer preference for vehicles that are feature dense and more powerful. Fourth and finally, accommodating these preferences have produced vehicles with lower fuel economy, a shift that reduces envisioned lifecycle emissions benefits by at least 3.26 percent, subject to the battery pack chemistry leveraged and the carbon intensity of the electrical grid. These findings warrant attention as decarbonization efforts increasingly emphasize electrification as a pathway for complying with domestic and international climate agreements."}, "https://arxiv.org/abs/2403.00508": {"title": "Changepoint problem with angular data using a measure of variation based on the intrinsic geometry of torus", "link": "https://arxiv.org/abs/2403.00508", "description": "arXiv:2403.00508v1 Announce Type: new \nAbstract: In many temporally ordered data sets, it is observed that the parameters of the underlying distribution change abruptly at unknown times. The detection of such changepoints is important for many applications. While this problem has been studied substantially in the linear data setup, not much work has been done for angular data. In this article, we utilize the intrinsic geometry of a torus to introduce the notion of the `square of an angle' and use it to propose a new measure of variation, called the `curved variance', of an angular random variable. Using the above ideas, we propose new tests for the existence of changepoint(s) in the concentration, mean direction, and/or both of these. The limiting distributions of the test statistics are derived and their powers are obtained using extensive simulation. It is seen that the tests have better power than the corresponding existing tests. The proposed methods have been implemented on three real-life data sets revealing interesting insights. In particular, our method when used to detect simultaneous changes in mean direction and concentration for hourly wind direction measurements of the cyclonic storm `Amphan' identified changepoints that could be associated with important meteorological events."}, "https://arxiv.org/abs/2403.00600": {"title": "Random Interval Distillation for Detecting Multiple Changes in General Dependent Data", "link": "https://arxiv.org/abs/2403.00600", "description": "arXiv:2403.00600v1 Announce Type: new \nAbstract: We propose a new and generic approach for detecting multiple change-points in general dependent data, termed random interval distillation (RID). By collecting random intervals with sufficient strength of signals and reassembling them into a sequence of informative short intervals, our new approach captures the shifts in signal characteristics across diverse dependent data forms including locally stationary high-dimensional time series and dynamic networks with Markov formation. We further propose a range of secondary refinements tailored to various data types to enhance the localization precision. Notably, for univariate time series and low-rank autoregressive networks, our methods achieve the minimax optimality as their independent counterparts. For practical applications, we introduce a clustering-based and data-driven procedure to determine the optimal threshold for signal strength, which is adaptable to a wide array of dependent data scenarios utilizing the connection between RID and clustering. Additionally, our method has been extended to identify kinks and changes in signals characterized by piecewise polynomial trends. We examine the effectiveness and usefulness of our methodology via extensive simulation studies and a real data example, implementing it in the R-package rid."}, "https://arxiv.org/abs/2403.00617": {"title": "Distortion in Correspondence Analysis and in Taxicab Correspondence Analysis: A Comparison", "link": "https://arxiv.org/abs/2403.00617", "description": "arXiv:2403.00617v1 Announce Type: new \nAbstract: Distortion is a fundamental well-studied topic in dimension reduction papers, and intimately related with the underlying intrinsic dimension of a mapping of a high dimensional data set onto a lower dimension. In this paper, we study embedding distortions produced by Correspondence Analysis and its robust l1 variant Taxicab Correspondence analysis, which are visualization methods for contingency tables. For high dimensional data, distortions in Correspondence Analysis are contractions; while distortions in Taxicab Correspondence Analysis could be contractions or stretchings. This shows that Euclidean geometry is quite rigid, because of the orthogonality property; while Taxicab geometry is quite flexible, because the orthogonality property is replaced by the conjugacy property."}, "https://arxiv.org/abs/2403.00639": {"title": "Hierarchical Bayesian Models to Mitigate Systematic Disparities in Prediction with Proxy Outcomes", "link": "https://arxiv.org/abs/2403.00639", "description": "arXiv:2403.00639v1 Announce Type: new \nAbstract: Label bias occurs when the outcome of interest is not directly observable and instead modeling is performed with proxy labels. When the difference between the true outcome and the proxy label is correlated with predictors, this can yield systematic disparities in predictions for different groups of interest. We propose Bayesian hierarchical measurement models to address these issues. Through practical examples, we demonstrate how our approach improves accuracy and helps with algorithmic fairness."}, "https://arxiv.org/abs/2403.00687": {"title": "Structurally Aware Robust Model Selection for Mixtures", "link": "https://arxiv.org/abs/2403.00687", "description": "arXiv:2403.00687v1 Announce Type: new \nAbstract: Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations."}, "https://arxiv.org/abs/2403.00701": {"title": "Bayesian Model Averaging for Partial Ordering Continual Reassessment Methods", "link": "https://arxiv.org/abs/2403.00701", "description": "arXiv:2403.00701v1 Announce Type: new \nAbstract: Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3+3' method and the Continual Reassessment Method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose-levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The Partial Ordering CRM (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is proposed. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model."}, "https://arxiv.org/abs/2403.00158": {"title": "Automated Efficient Estimation using Monte Carlo Efficient Influence Functions", "link": "https://arxiv.org/abs/2403.00158", "description": "arXiv:2403.00158v1 Announce Type: cross \nAbstract: Many practical problems involve estimating low dimensional statistical quantities with high-dimensional models and datasets. Several approaches address these estimation tasks based on the theory of influence functions, such as debiased/double ML or targeted minimum loss estimation. This paper introduces \\textit{Monte Carlo Efficient Influence Functions} (MC-EIF), a fully automated technique for approximating efficient influence functions that integrates seamlessly with existing differentiable probabilistic programming systems. MC-EIF automates efficient statistical estimation for a broad class of models and target functionals that would previously require rigorous custom analysis. We prove that MC-EIF is consistent, and that estimators using MC-EIF achieve optimal $\\sqrt{N}$ convergence rates. We show empirically that estimators using MC-EIF are at parity with estimators using analytic EIFs. Finally, we demonstrate a novel capstone example using MC-EIF for optimal portfolio selection."}, "https://arxiv.org/abs/2403.00694": {"title": "Defining Expertise: Applications to Treatment Effect Estimation", "link": "https://arxiv.org/abs/2403.00694", "description": "arXiv:2403.00694v1 Announce Type: cross \nAbstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and \"expertise\" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection."}, "https://arxiv.org/abs/2403.00749": {"title": "Shrinkage estimators in zero-inflated Bell regression model with application", "link": "https://arxiv.org/abs/2403.00749", "description": "arXiv:2403.00749v1 Announce Type: cross \nAbstract: We propose Stein-type estimators for zero-inflated Bell regression models by incorporating information on model parameters. These estimators combine the advantages of unrestricted and restricted estimators. We derive the asymptotic distributional properties, including bias and mean squared error, for the proposed shrinkage estimators. Monte Carlo simulations demonstrate the superior performance of our shrinkage estimators across various scenarios. Furthermore, we apply the proposed estimators to analyze a real dataset, showcasing their practical utility."}, "https://arxiv.org/abs/2107.08112": {"title": "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data", "link": "https://arxiv.org/abs/2107.08112", "description": "arXiv:2107.08112v2 Announce Type: replace \nAbstract: Latent variable models are increasingly used in economics for high-dimensional categorical data like text and surveys. We demonstrate the effectiveness of Hamiltonian Monte Carlo (HMC) with parallelized automatic differentiation for analyzing such data in a computationally efficient and methodologically sound manner. Our new model, Supervised Topic Model with Covariates, shows that carefully modeling this type of data can have significant implications on conclusions compared to a simpler, frequently used, yet methodologically problematic, two-step approach. A simulation study and revisiting Bandiera et al. (2020)'s study of executive time use demonstrate these results. The approach accommodates thousands of parameters and doesn't require custom algorithms specific to each model, making it accessible for applied researchers"}, "https://arxiv.org/abs/2111.12258": {"title": "On Recoding Ordered Treatments as Binary Indicators", "link": "https://arxiv.org/abs/2111.12258", "description": "arXiv:2111.12258v4 Announce Type: replace \nAbstract: Researchers using instrumental variables to investigate ordered treatments often recode treatment into an indicator for any exposure. We investigate this estimand under the assumption that the instruments shift compliers from no treatment to some but not from some treatment to more. We show that when there are extensive margin compliers only (EMCO) this estimand captures a weighted average of treatment effects that can be partially unbundled into each complier group's potential outcome means. We also establish an equivalence between EMCO and a two-factor selection model and apply our results to study treatment heterogeneity in the Oregon Health Insurance Experiment."}, "https://arxiv.org/abs/2211.16364": {"title": "Disentangling the structure of ecological bipartite networks from observation processes", "link": "https://arxiv.org/abs/2211.16364", "description": "arXiv:2211.16364v2 Announce Type: replace \nAbstract: The structure of a bipartite interaction network can be described by providing a clustering for each of the two types of nodes. Such clusterings are outputted by fitting a Latent Block Model (LBM) on an observed network that comes from a sampling of species interactions in the field. However, the sampling is limited and possibly uneven. This may jeopardize the fit of the LBM and then the description of the structure of the network by detecting structures which result from the sampling and not from actual underlying ecological phenomena. If the observed interaction network consists of a weighted bipartite network where the number of observed interactions between two species is available, the sampling efforts for all species can be estimated and used to correct the LBM fit. We propose to combine an observation model that accounts for sampling and an LBM for describing the structure of underlying possible ecological interactions. We develop an original inference procedure for this model, the efficiency of which is demonstrated on simulation studies. The pratical interest in ecology of our model is highlighted on a large dataset of plant-pollinator network."}, "https://arxiv.org/abs/2212.07602": {"title": "Modifying Survival Models To Accommodate Thresholding Behavior", "link": "https://arxiv.org/abs/2212.07602", "description": "arXiv:2212.07602v2 Announce Type: replace \nAbstract: Survival models capture the relationship between an accumulating hazard and the occurrence of a singular event stimulated by that accumulation. When the model for the hazard is sufficiently flexible survival models can accommodate a wide range of behaviors. If the hazard model is less flexible, for example when it is constrained by an external physical process, then the resulting survival model can be much too rigid. In this paper I introduce a modified survival model that generalizes the relationship between accumulating hazard and event occurrence with particular emphasis on capturing thresholding behavior. Finally I demonstrate the utility of this approach on a physiological application."}, "https://arxiv.org/abs/2212.13294": {"title": "Multivariate Bayesian variable selection with application to multi-trait genetic fine mapping", "link": "https://arxiv.org/abs/2212.13294", "description": "arXiv:2212.13294v3 Announce Type: replace \nAbstract: Variable selection has played a critical role in modern statistical learning and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but most of these methods consider selecting variables for only one response. As more data is being collected nowadays, it is common to analyze multiple related responses from the same study. Existing multivariate variable selection methods select variables for all responses without considering the possible heterogeneity across different responses, i.e. some features may only predict a subset of responses but not the rest. Motivated by the multi-trait fine mapping problem in genetics to identify the causal variants for multiple related traits, we developed a novel multivariate Bayesian variable selection method to select critical predictors from a large number of grouped predictors that target at multiple correlated and possibly heterogeneous responses. Our new method is featured by its selection at multiple levels, its incorporation of prior biological knowledge to guide selection and identification of best subset of responses predictors target at. We showed the advantage of our method via extensive simulations and a real fine mapping example to identify causal variants associated with different subsets of addictive behaviors."}, "https://arxiv.org/abs/2302.07976": {"title": "Discovery of Critical Thresholds in Mixed Exposures and Estimation of Policy Intervention Effects using Targeted Learning", "link": "https://arxiv.org/abs/2302.07976", "description": "arXiv:2302.07976v2 Announce Type: replace \nAbstract: Traditional regulations of chemical exposure tend to focus on single exposures, overlooking the potential amplified toxicity due to multiple concurrent exposures. We are interested in understanding the average outcome if exposures were limited to fall under a multivariate threshold. Because threshold levels are often unknown \\textit{a priori}, we provide an algorithm that finds exposure threshold levels where the expected outcome is maximized or minimized. Because both identifying thresholds and estimating policy effects on the same data would lead to overfitting bias, we also provide a data-adaptive estimation framework, which allows for both threshold discovery and policy estimation. Simulation studies show asymptotic convergence to the optimal exposure region and to the true effect of an intervention. We demonstrate how our method identifies true interactions in a public synthetic mixture data set. Finally, we applied our method to NHANES data to discover metal exposures that have the most harmful effects on telomere length. We provide an implementation in the \\texttt{CVtreeMLE} R package."}, "https://arxiv.org/abs/2304.00059": {"title": "Resolving power: A general approach to compare the distinguishing ability of threshold-free evaluation metrics", "link": "https://arxiv.org/abs/2304.00059", "description": "arXiv:2304.00059v2 Announce Type: replace \nAbstract: Selecting an evaluation metric is fundamental to model development, but uncertainty remains about when certain metrics are preferable and why. This paper introduces the concept of resolving power to describe the ability of an evaluation metric to distinguish between binary classifiers of similar quality. This ability depends on two attributes: 1. The metric's response to improvements in classifier quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power generically as a metric's sampling uncertainty scaled by its signal. The primary application of resolving power is to assess threshold-free evaluation metrics, such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation study compares the AUROC and the AUPRC in a variety of contexts. It finds that the AUROC generally has greater resolving power, but that the AUPRC is better when searching among high-quality classifiers applied to low prevalence outcomes. The paper concludes by proposing an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model."}, "https://arxiv.org/abs/2306.08559": {"title": "Inference in IV models with clustered dependence, many instruments and weak identification", "link": "https://arxiv.org/abs/2306.08559", "description": "arXiv:2306.08559v2 Announce Type: replace \nAbstract: Data clustering reduces the effective sample size from the number of observations towards the number of clusters. For instrumental variable models I show that this reduced effective sample size makes the instruments more likely to be weak, in the sense that they contain little information about the endogenous regressor, and many, in the sense that their number is large compared to the sample size. Clustered data therefore increases the need for many and weak instrument robust tests. However, none of the previously developed many and weak instrument robust tests can be applied to this type of data as they all require independent observations. I therefore adapt two types of such tests to clustered data. First, I derive cluster jackknife Anderson-Rubin and score tests by removing clusters rather than individual observations from the statistics. Second, I propose a cluster many instrument Anderson-Rubin test which improves on the first type of tests by using a more optimal, but more complex, weighting matrix. I show that if the clusters satisfy an invariance assumption the higher complexity poses no problems. By revisiting a study on the effect of queenly reign on war I show the empirical relevance of the new tests."}, "https://arxiv.org/abs/2306.10635": {"title": "Finite Population Survey Sampling: An Unapologetic Bayesian Perspective", "link": "https://arxiv.org/abs/2306.10635", "description": "arXiv:2306.10635v2 Announce Type: replace \nAbstract: This article attempts to offer some perspectives on Bayesian inference for finite population quantities when the units in the population are assumed to exhibit complex dependencies. Beginning with an overview of Bayesian hierarchical models, including some that yield design-based Horvitz-Thompson estimators, the article proceeds to introduce dependence in finite populations and sets out inferential frameworks for ignorable and nonignorable responses. Multivariate dependencies using graphical models and spatial processes are discussed and some salient features of two recent analyses for spatial finite populations are presented."}, "https://arxiv.org/abs/2307.05234": {"title": "CR-Lasso: Robust cellwise regularized sparse regression", "link": "https://arxiv.org/abs/2307.05234", "description": "arXiv:2307.05234v2 Announce Type: replace \nAbstract: Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. To evaluate the approach, we conduct empirical studies comparing its selection and prediction performance with several sparse regression methods. We show that CR-Lasso is competitive under the settings considered. We illustrate the effectiveness of the proposed method on real data through an analysis of a bone mineral density dataset."}, "https://arxiv.org/abs/2308.02918": {"title": "Spectral Ranking Inferences based on General Multiway Comparisons", "link": "https://arxiv.org/abs/2308.02918", "description": "arXiv:2308.02918v3 Announce Type: replace \nAbstract: This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a general and more realistic setup. Specifically, the comparison graph consists of hyper-edges of possible heterogeneous sizes, and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in scenarios where the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that this is the first time effective two-sample rank testing methods have been proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences for statistical journals and movie rankings."}, "https://arxiv.org/abs/2310.17009": {"title": "Simulation-based stacking", "link": "https://arxiv.org/abs/2310.17009", "description": "arXiv:2310.17009v2 Announce Type: replace \nAbstract: Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task."}, "https://arxiv.org/abs/2310.17434": {"title": "The `Why' behind including `Y' in your imputation model", "link": "https://arxiv.org/abs/2310.17434", "description": "arXiv:2310.17434v2 Announce Type: replace \nAbstract: Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation with fixed values) and stochastic imputation (i.e., single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This paper aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model."}, "https://arxiv.org/abs/2204.10375": {"title": "lpcde: Estimation and Inference for Local Polynomial Conditional Density Estimators", "link": "https://arxiv.org/abs/2204.10375", "description": "arXiv:2204.10375v2 Announce Type: replace-cross \nAbstract: This paper discusses the R package lpcde, which stands for local polynomial conditional density estimation. It implements the kernel-based local polynomial smoothing methods introduced in Cattaneo, Chandak, Jansson, Ma (2024( for statistical estimation and inference of conditional distributions, densities, and derivatives thereof. The package offers mean square error optimal bandwidth selection and associated point estimators, as well as uncertainty quantification based on robust bias correction both pointwise (e.g., confidence intervals) and uniformly (e.g., confidence bands) over evaluation points. The methods implemented are boundary adaptive whenever the data is compactly supported. The package also implements regularized conditional density estimation methods, ensuring the resulting density estimate is non-negative and integrates to one. We contrast the functionalities of lpcde with existing R packages for conditional density estimation, and showcase its main features using simulated data."}, "https://arxiv.org/abs/2210.01282": {"title": "Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees", "link": "https://arxiv.org/abs/2210.01282", "description": "arXiv:2210.01282v3 Announce Type: replace-cross \nAbstract: We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks."}, "https://arxiv.org/abs/2304.09126": {"title": "An improved BISG for inferring race from surname and geolocation", "link": "https://arxiv.org/abs/2304.09126", "description": "arXiv:2304.09126v3 Announce Type: replace-cross \nAbstract: Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual's geolocation and surname. Here we demonstrate that statistical dependence of surname and geolocation within racial/ethnic categories in the United States results in biases for minority subpopulations, and we introduce a raking-based improvement. Our method augments the data used by BISG--distributions of race by geolocation and race by surname--with the distribution of surname by geolocation obtained from state voter files. We validate our algorithm on state voter registration lists that contain self-identified race/ethnicity."}, "https://arxiv.org/abs/2305.14916": {"title": "Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting", "link": "https://arxiv.org/abs/2305.14916", "description": "arXiv:2305.14916v2 Announce Type: replace-cross \nAbstract: We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning."}, "https://arxiv.org/abs/2401.04190": {"title": "Is it possible to know cosmological fine-tuning?", "link": "https://arxiv.org/abs/2401.04190", "description": "arXiv:2401.04190v2 Announce Type: replace-cross \nAbstract: Fine-tuning studies whether some physical parameters, or relevant ratios between them, are located within so-called life-permitting intervals of small probability outside of which carbon-based life would not be possible. Recent developments have found estimates of these probabilities that circumvent previous concerns of measurability and selection bias. However, the question remains if fine-tuning can indeed be known. Using a mathematization of the epistemological concepts of learning and knowledge acquisition, we argue that most examples that have been touted as fine-tuned cannot be formally assessed as such. Nevertheless, fine-tuning can be known when the physical parameter is seen as a random variable and it is supported in the nonnegative real line, provided the size of the life-permitting interval is small in relation to the observed value of the parameter."}, "https://arxiv.org/abs/2403.00957": {"title": "Resolution of Simpson's paradox via the common cause principle", "link": "https://arxiv.org/abs/2403.00957", "description": "arXiv:2403.00957v1 Announce Type: new \nAbstract: Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This set-up generalizes the original Simpson's paradox. Now its two contradicting options simply refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for valid Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of the association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. For tertiary (unobserved) common causes $C$ all three options of Simpson's paradox become possible (i.e. marginalized, conditional, and none of them), and one needs prior information on $C$ to choose the right option."}, "https://arxiv.org/abs/2403.00968": {"title": "The Bridged Posterior: Optimization, Profile Likelihood and a New Approach to Generalized Bayes", "link": "https://arxiv.org/abs/2403.00968", "description": "arXiv:2403.00968v1 Announce Type: new \nAbstract: Optimization is widely used in statistics, thanks to its efficiency for delivering point estimates on useful spaces, such as those satisfying low cardinality or combinatorial structure. To quantify uncertainty, Gibbs posterior exponentiates the negative loss function to form a posterior density. Nevertheless, Gibbs posteriors are supported in a high-dimensional space, and do not inherit the computational efficiency or constraint formulations from optimization. In this article, we explore a new generalized Bayes approach, viewing the likelihood as a function of data, parameters, and latent variables conditionally determined by an optimization sub-problem. Marginally, the latent variable given the data remains stochastic, and is characterized by its posterior distribution. This framework, coined ``bridged posterior'', conforms to the Bayesian paradigm. Besides providing a novel generative model, we obtain a positively surprising theoretical finding that under mild conditions, the $\\sqrt{n}$-adjusted posterior distribution of the parameters under our model converges to the same normal distribution as that of the canonical integrated posterior. Therefore, our result formally dispels a long-held belief that partial optimization of latent variables may lead to under-estimation of parameter uncertainty. We demonstrate the practical advantages of our approach under several settings, including maximum-margin classification, latent normal models, and harmonization of multiple networks."}, "https://arxiv.org/abs/2403.01000": {"title": "A linear mixed model approach for measurement error adjustment: applications to sedentary behavior assessment from wearable devices", "link": "https://arxiv.org/abs/2403.01000", "description": "arXiv:2403.01000v1 Announce Type: new \nAbstract: In recent years, wearable devices have become more common to capture a wide range of health behaviors, especially for physical activity and sedentary behavior. These sensor-based measures are deemed to be objective and thus less prone to self-reported biases, inherent in questionnaire assessments. While this is undoubtedly a major advantage, there can still be measurement errors from the device recordings, which pose serious challenges for conducting statistical analysis and obtaining unbiased risk estimates. There is a vast literature proposing statistical methods for adjusting for measurement errors in self-reported behaviors, such as in dietary intake. However, there is much less research on error correction for sensor-based device measures, especially sedentary behavior. In this paper, we address this gap. Exploiting the excessive multiple-day assessments typically collected when sensor devices are deployed, we propose a two-stage linear mixed effect model (LME) based approach to correct bias caused by measurement errors. We provide theoretical proof of the debiasing process using the Best Linear Unbiased Predictors (BLUP), and use both simulation and real data from a cohort study to demonstrate the performance of the proposed approach while comparing to the na\\\"ive plug-in approach that directly uses device measures without appropriately adjusting measurement errors. Our results indicate that employing our easy-to-implement BLUP correction method can greatly reduce biases in disease risk estimates and thus enhance the validity of study findings."}, "https://arxiv.org/abs/2403.01150": {"title": "Error Analysis of a Simple Quaternion Estimator: the Gaussian Case", "link": "https://arxiv.org/abs/2403.01150", "description": "arXiv:2403.01150v1 Announce Type: new \nAbstract: Reference [1] introduces a novel closed-form quaternion estimator from two vector observations. The simplicity of the estimator enables clear physical insights and a closed-form expression for the bias as a function of the quaternion error covariance matrix. The latter could be approximated up to second order with respect to the underlying measurement noise assuming arbitrary probability distribution. The current note relaxes the second-order assumption and provides an expression for the error covariance that is exact to the fourth order, under the assumption of Gaussian distribution. This not only provides increased accuracy but also alleviates issues related to singularity. This technical note presents a comprehensive derivation of the individual components of the quaternion additive error covariance matrix."}, "https://arxiv.org/abs/2403.01330": {"title": "Re-evaluating the impact of hormone replacement therapy on heart disease using match-adaptive randomization inference", "link": "https://arxiv.org/abs/2403.01330", "description": "arXiv:2403.01330v1 Announce Type: new \nAbstract: Matching is an appealing way to design observational studies because it mimics the data structure produced by stratified randomized trials, pairing treated individuals with similar controls. After matching, inference is often conducted using methods tailored for stratified randomized trials in which treatments are permuted within matched pairs. However, in observational studies, matched pairs are not predetermined before treatment; instead, they are constructed based on observed treatment status. This introduces a challenge as the permutation distributions used in standard inference methods do not account for the possibility that permuting treatments might lead to a different selection of matched pairs ($Z$-dependence). To address this issue, we propose a novel and computationally efficient algorithm that characterizes and enables sampling from the correct conditional distribution of treatment after an optimal propensity score matching, accounting for $Z$-dependence. We show how this new procedure, called match-adaptive randomization inference, corrects for an anticonservative result in a well-known observational study investigating the impact of hormone replacement theory (HRT) on coronary heart disease and corroborates experimental findings about heterogeneous effects of HRT across different ages of initiation in women. Keywords: matching, causal inference, propensity score, permutation test, Type I error, graphs."}, "https://arxiv.org/abs/2403.01386": {"title": "Minimax-Regret Sample Selection in Randomized Experiments", "link": "https://arxiv.org/abs/2403.01386", "description": "arXiv:2403.01386v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are often run in settings with many subpopulations that may have differential benefits from the treatment being evaluated. We consider the problem of sample selection, i.e., whom to enroll in an RCT, such as to optimize welfare in a heterogeneous population. We formalize this problem within the minimax-regret framework, and derive optimal sample-selection schemes under a variety of conditions. We also highlight how different objectives and decisions can lead to notably different guidance regarding optimal sample allocation through a synthetic experiment leveraging historical COVID-19 trial data."}, "https://arxiv.org/abs/2403.01477": {"title": "Two-phase rejective sampling", "link": "https://arxiv.org/abs/2403.01477", "description": "arXiv:2403.01477v1 Announce Type: new \nAbstract: Rejective sampling improves design and estimation efficiency of single-phase sampling when auxiliary information in a finite population is available. When such auxiliary information is unavailable, we propose to use two-phase rejective sampling (TPRS), which involves measuring auxiliary variables for the sample of units in the first phase, followed by the implementation of rejective sampling for the outcome in the second phase. We explore the asymptotic design properties of double expansion and regression estimators under TPRS. We show that TPRS enhances the efficiency of the double expansion estimator, rendering it comparable to a regression estimator. We further refine the design to accommodate varying importance of covariates and extend it to multi-phase sampling."}, "https://arxiv.org/abs/2403.01585": {"title": "Calibrating doubly-robust estimators with unbalanced treatment assignment", "link": "https://arxiv.org/abs/2403.01585", "description": "arXiv:2403.01585v1 Announce Type: new \nAbstract: Machine learning methods, particularly the double machine learning (DML) estimator (Chernozhukov et al., 2018), are increasingly popular for the estimation of the average treatment effect (ATE). However, datasets often exhibit unbalanced treatment assignments where only a few observations are treated, leading to unstable propensity score estimations. We propose a simple extension of the DML estimator which undersamples data for propensity score modeling and calibrates scores to match the original distribution. The paper provides theoretical results showing that the estimator retains the DML estimator's asymptotic properties. A simulation study illustrates the finite sample performance of the estimator."}, "https://arxiv.org/abs/2403.01684": {"title": "Dendrogram of mixing measures: Learning latent hierarchy and model selection for finite mixture models", "link": "https://arxiv.org/abs/2403.01684", "description": "arXiv:2403.01684v1 Announce Type: new \nAbstract: We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) of an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and mixture modeling. The dendrogram's construction is derived from the theory of convergence of the mixing measures, and as a result, we can both consistently select the true number of mixing components and obtain the pointwise optimal convergence rate for parameter estimation from the tree, even when the model parameters are only weakly identifiable. In theory, it explicates the choice of the optimal number of clusters in hierarchical clustering. In practice, the dendrogram reveals more information on the hierarchy of subpopulations compared to traditional ways of summarizing mixture models. Several simulation studies are carried out to support our theory. We also illustrate the methodology with an application to single-cell RNA sequence analysis."}, "https://arxiv.org/abs/2403.01838": {"title": "Graphical n-sample tests of correspondence of distributions", "link": "https://arxiv.org/abs/2403.01838", "description": "arXiv:2403.01838v1 Announce Type: new \nAbstract: Classical tests are available for the two-sample test of correspondence of distribution functions. From these, the Kolmogorov-Smirnov test provides also the graphical interpretation of the test results, in different forms. Here, we propose modifications of the Kolmogorov-Smirnov test with higher power. The proposed tests are based on the so-called global envelope test which allows for graphical interpretation, similarly as the Kolmogorov-Smirnov test. The tests are based on rank statistics and are suitable also for the comparison of $n$ samples, with $n \\geq 2$. We compare the alternatives for the two-sample case through an extensive simulation study and discuss their interpretation. Finally, we apply the tests to real data. Specifically, we compare the height distributions between boys and girls at different ages, as well as sepal length distributions of different flower species using the proposed methodologies."}, "https://arxiv.org/abs/2403.01948": {"title": "On Fractional Moment Estimation from Polynomial Chaos Expansion", "link": "https://arxiv.org/abs/2403.01948", "description": "arXiv:2403.01948v1 Announce Type: new \nAbstract: Fractional statistical moments are utilized for various tasks of uncertainty quantification, including the estimation of probability distributions. However, an estimation of fractional statistical moments of costly mathematical models by statistical sampling is challenging since it is typically not possible to create a large experimental design due to limitations in computing capacity. This paper presents a novel approach for the analytical estimation of fractional moments, directly from polynomial chaos expansions. Specifically, the first four statistical moments obtained from the deterministic PCE coefficients are used for an estimation of arbitrary fractional moments via H\\\"{o}lder's inequality. The proposed approach is utilized for an estimation of statistical moments and probability distributions in three numerical examples of increasing complexity. Obtained results show that the proposed approach achieves a superior performance in estimating the distribution of the response, in comparison to a standard Latin hypercube sampling in the presented examples."}, "https://arxiv.org/abs/2403.02058": {"title": "Utility-based optimization of Fujikawa's basket trial design - Pre-specified protocol of a comparison study", "link": "https://arxiv.org/abs/2403.02058", "description": "arXiv:2403.02058v1 Announce Type: new \nAbstract: Basket trial designs are a type of master protocol in which the same therapy is tested in several strata of the patient cohort. Many basket trial designs implement borrowing mechanisms. These allow sharing information between similar strata with the goal of increasing power in responsive strata while at the same time constraining type-I error inflation to a bearable threshold. These borrowing mechanisms can be tuned using numerical tuning parameters. The optimal choice of these tuning parameters is subject to research. In a comparison study using simulations and numerical calculations, we are planning to investigate the use of utility functions for quantifying the compromise between power and type-I error inflation and the use of numerical optimization algorithms for optimizing these functions. The present document is the protocol of this comparison study, defining each step of the study in accordance with the ADEMP scheme for pre-specification of simulation studies."}, "https://arxiv.org/abs/2403.02060": {"title": "Expectile Periodograms", "link": "https://arxiv.org/abs/2403.02060", "description": "arXiv:2403.02060v1 Announce Type: new \nAbstract: In this paper, we introduce a periodogram-like function, called expectile periodograms, for detecting and estimating hidden periodicity from observations with asymmetrically distributed noise. The expectile periodograms are constructed from trigonometric expectile regression where a specially designed objective function is used to substitute the squared $l_2$ norm that leads to the ordinary periodograms. The expectile periodograms have properties which are analogous to quantile periodograms, which provide a broader view of the time series by examining different expectile levels, but are much faster to calculate. The asymptotic properties are discussed and simulations show its efficiency and robustness in the presence of hidden periodicities with asymmetric or heavy-tailed noise. Finally, we leverage the inherent two-dimensional characteristics of the expectile periodograms and train a deep-learning (DL) model to classify the earthquake waveform data. Remarkably, our approach achieves heightened classification testing accuracy when juxtaposed with alternative periodogram-based methodologies."}, "https://arxiv.org/abs/2403.02144": {"title": "Improved Tests for Mediation", "link": "https://arxiv.org/abs/2403.02144", "description": "arXiv:2403.02144v1 Announce Type: new \nAbstract: Testing for a mediation effect is important in many disciplines, but is made difficult - even asymptotically - by the influence of nuisance parameters. Classical tests such as likelihood ratio (LR) and Wald (Sobel) tests have very poor power properties in parts of the parameter space, and many attempts have been made to produce improved tests, with limited success. In this paper we show that augmenting the critical region of the LR test can produce a test with much improved behavior everywhere. In fact, we first show that there exists a test of this type that is (asymptotically) exact for certain test levels $\\alpha $, including the common choices $\\alpha =.01,.05,.10.$ The critical region of this exact test has some undesirable properties. We go on to show that there is a very simple class of augmented LR critical regions which provides tests that are nearly exact, and avoid the issues inherent in the exact test. We suggest an optimal and coherent member of this class, provide the table needed to implement the test and to report p-values if desired. Simulation confirms validity with non-Gaussian disturbances, under heteroskedasticity, and in a nonlinear (logit) model. A short application of the method to an entrepreneurial attitudes study is included for illustration."}, "https://arxiv.org/abs/2403.02154": {"title": "Double trouble: Predicting new variant counts across two heterogeneous populations", "link": "https://arxiv.org/abs/2403.02154", "description": "arXiv:2403.02154v1 Announce Type: new \nAbstract: Collecting genomics data across multiple heterogeneous populations (e.g., across different cancer types) has the potential to improve our understanding of disease. Despite sequencing advances, though, resources often remain a constraint when gathering data. So it would be useful for experimental design if experimenters with access to a pilot study could predict the number of new variants they might expect to find in a follow-up study: both the number of new variants shared between the populations and the total across the populations. While many authors have developed prediction methods for the single-population case, we show that these predictions can fare poorly across multiple populations that are heterogeneous. We prove that, surprisingly, a natural extension of a state-of-the-art single-population predictor to multiple populations fails for fundamental reasons. We provide the first predictor for the number of new shared variants and new total variants that can handle heterogeneity in multiple populations. We show that our proposed method works well empirically using real cancer and population genetics data."}, "https://arxiv.org/abs/2403.02194": {"title": "Boosting Distributional Copula Regression for Bivariate Binary, Discrete and Mixed Responses", "link": "https://arxiv.org/abs/2403.02194", "description": "arXiv:2403.02194v1 Announce Type: new \nAbstract: Motivated by challenges in the analysis of biomedical data and observational studies, we develop statistical boosting for the general class of bivariate distributional copula regression with arbitrary marginal distributions, which is suited to model binary, count, continuous or mixed outcomes. In our framework, the joint distribution of arbitrary, bivariate responses is modelled through a parametric copula. To arrive at a model for the entire conditional distribution, not only the marginal distribution parameters but also the copula parameters are related to covariates through additive predictors. We suggest efficient and scalable estimation by means of an adapted component-wise gradient boosting algorithm with statistical models as base-learners. A key benefit of boosting as opposed to classical likelihood or Bayesian estimation is the implicit data-driven variable selection mechanism as well as shrinkage without additional input or assumptions from the analyst. To the best of our knowledge, our implementation is the only one that combines a wide range of covariate effects, marginal distributions, copula functions, and implicit data-driven variable selection. We showcase the versatility of our approach on data from genetic epidemiology, healthcare utilization and childhood undernutrition. Our developments are implemented in the R package gamboostLSS, fostering transparent and reproducible research."}, "https://arxiv.org/abs/2403.02245": {"title": "Dynamic programming principle in cost-efficient sequential design: application to switching measurements", "link": "https://arxiv.org/abs/2403.02245", "description": "arXiv:2403.02245v1 Announce Type: new \nAbstract: We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurements, a sequence of current pulses is applied to the junction and a binary voltage response is observed. The measurement requires a very low temperature that can be kept stable only for a relatively short time, and therefore it is essential to use an efficient design. We use the dynamic programming principle from the mathematical theory of optimal control to solve the optimal update times. Our simulations demonstrate the cost-efficiency compared to the previously used methods."}, "https://arxiv.org/abs/2403.00776": {"title": "A framework for understanding data science", "link": "https://arxiv.org/abs/2403.00776", "description": "arXiv:2403.00776v1 Announce Type: cross \nAbstract: The objective of this research is to provide a framework with which the data science community can understand, define, and develop data science as a field of inquiry. The framework is based on the classical reference framework (axiology, ontology, epistemology, methodology) used for 200 years to define knowledge discovery paradigms and disciplines in the humanities, sciences, algorithms, and now data science. I augmented it for automated problem-solving with (methods, technology, community). The resulting data science reference framework is used to define the data science knowledge discovery paradigm in terms of the philosophy of data science addressed in previous papers and the data science problem-solving paradigm, i.e., the data science method, and the data science problem-solving workflow, both addressed in this paper. The framework is a much called for unifying framework for data science as it contains the components required to define data science. For insights to better understand data science, this paper uses the framework to define the emerging, often enigmatic, data science problem-solving paradigm and workflow, and to compare them with their well-understood scientific counterparts, scientific problem-solving paradigm and workflow."}, "https://arxiv.org/abs/2403.00967": {"title": "Asymptotic expansion of the drift estimator for the fractional Ornstein-Uhlenbeck process", "link": "https://arxiv.org/abs/2403.00967", "description": "arXiv:2403.00967v1 Announce Type: cross \nAbstract: We present an asymptotic expansion formula of an estimator for the drift coefficient of the fractional Ornstein-Uhlenbeck process. As the machinery, we apply the general expansion scheme for Wiener functionals recently developed by the authors [26]. The central limit theorem in the principal part of the expansion has the classical scaling T^{1/2}. However, the asymptotic expansion formula is a complex in that the order of the correction term becomes the classical T^{-1/2} for H in (1/2,5/8), but T^{4H-3} for H in [5/8, 3/4)."}, "https://arxiv.org/abs/2403.01208": {"title": "The Science of Data Collection: Insights from Surveys can Improve Machine Learning Models", "link": "https://arxiv.org/abs/2403.01208", "description": "arXiv:2403.01208v1 Announce Type: cross \nAbstract: Whether future AI models make the world safer or less safe for humans rests in part on our ability to efficiently collect accurate data from people about what they want the models to do. However, collecting high quality data is difficult, and most AI/ML researchers are not trained in data collection methods. The growing emphasis on data-centric AI highlights the potential of data to enhance model performance. It also reveals an opportunity to gain insights from survey methodology, the science of collecting high-quality survey data.\n In this position paper, we summarize lessons from the survey methodology literature and discuss how they can improve the quality of training and feedback data, which in turn improve model performance. Based on the cognitive response process model, we formulate specific hypotheses about the aspects of label collection that may impact training data quality. We also suggest collaborative research ideas into how possible biases in data collection can be mitigated, making models more accurate and human-centric."}, "https://arxiv.org/abs/2403.01318": {"title": "High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media", "link": "https://arxiv.org/abs/2403.01318", "description": "arXiv:2403.01318v1 Announce Type: cross \nAbstract: Motivated by the empirical power law of the distributions of credits (e.g., the number of \"likes\") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularized estimate, and establish the asymptotic normality of the debiased estimator. Simulation studies support our theory. These methods are applied to text analyses of viral posts in X (formerly Twitter) concerning LGBTQ+."}, "https://arxiv.org/abs/2403.01403": {"title": "Greedy selection of optimal location of sensors for uncertainty reduction in seismic moment tensor inversion", "link": "https://arxiv.org/abs/2403.01403", "description": "arXiv:2403.01403v1 Announce Type: cross \nAbstract: We address an optimal sensor placement problem through Bayesian experimental design for seismic full waveform inversion for the recovery of the associated moment tensor. The objective is that of optimally choosing the location of the sensors (stations) from which to collect the observed data. The Shannon expected information gain is used as the objective function to search for the optimal network of sensors. A closed form for such objective is available due to the linear structure of the forward problem, as well as the Gaussian modeling of the observational errors and prior distribution. The resulting problem being inherently combinatorial, a greedy algorithm is deployed to sequentially select the sensor locations that form the best network for learning the moment tensor. Numerical results are presented and analyzed under several instances of the problem, including: use of full three-dimensional velocity-models, cases in which the earthquake-source location is unknown, as well as moment tensor inversion under model misspecification"}, "https://arxiv.org/abs/2403.01865": {"title": "Improving generalisation via anchor multivariate analysis", "link": "https://arxiv.org/abs/2403.01865", "description": "arXiv:2403.01865v1 Announce Type: cross \nAbstract: We introduce a causal regularisation extension to anchor regression (AR) for improved out-of-distribution (OOD) generalisation. We present anchor-compatible losses, aligning with the anchor framework to ensure robustness against distribution shifts. Various multivariate analysis (MVA) algorithms, such as (Orthonormalized) PLS, RRR, and MLR, fall within the anchor framework. We observe that simple regularisation enhances robustness in OOD settings. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with MVA approaches and its role in enhancing replicability while guarding against distribution shifts. The extended AR framework advances causal inference methodologies, addressing the need for reliable OOD generalisation."}, "https://arxiv.org/abs/2107.08075": {"title": "Kpop: A kernel balancing approach for reducing specification assumptions in survey weighting", "link": "https://arxiv.org/abs/2107.08075", "description": "arXiv:2107.08075v2 Announce Type: replace \nAbstract: With the precipitous decline in response rates, researchers and pollsters have been left with highly non-representative samples, relying on constructed weights to make these samples representative of the desired target population. Though practitioners employ valuable expert knowledge to choose what variables, $X$ must be adjusted for, they rarely defend particular functional forms relating these variables to the response process or the outcome. Unfortunately, commonly-used calibration weights -- which make the weighted mean $X$ in the sample equal that of the population -- only ensure correct adjustment when the portion of the outcome and the response process left unexplained by linear functions of $X$ are independent. To alleviate this functional form dependency, we describe kernel balancing for population weighting (kpop). This approach replaces the design matrix $\\mathbf{X}$ with a kernel matrix, $\\mathbf{K}$ encoding high-order information about $\\mathbf{X}$. Weights are then found to make the weighted average row of $\\mathbf{K}$ among sampled units approximately equal that of the target population. This produces good calibration on a wide range of smooth functions of $X$, without relying on the user to decide which $X$ or what functions of them to include. We describe the method and illustrate it by application to polling data from the 2016 U.S. presidential election."}, "https://arxiv.org/abs/2202.10030": {"title": "Multivariate Tie-breaker Designs", "link": "https://arxiv.org/abs/2202.10030", "description": "arXiv:2202.10030v4 Announce Type: replace \nAbstract: In a tie-breaker design (TBD), subjects with high values of a running variable are given some (usually desirable) treatment, subjects with low values are not, and subjects in the middle are randomized. TBDs are intermediate between regression discontinuity designs (RDDs) and randomized controlled trials (RCTs) by allowing a tradeoff between the resource allocation efficiency of an RDD and the statistical efficiency of an RCT. We study a model where the expected response is one multivariate regression for treated subjects and another for control subjects. For given covariates, we show how to use convex optimization to choose treatment probabilities that optimize a D-optimality criterion. We can incorporate a variety of constraints motivated by economic and ethical considerations. In our model, D-optimality for the treatment effect coincides with D-optimality for the whole regression, and without economic constraints, an RCT is globally optimal. We show that a monotonicity constraint favoring more deserving subjects induces sparsity in the number of distinct treatment probabilities and this is different from preexisting sparsity results for constrained designs. We also study a prospective D-optimality, analogous to Bayesian optimal design, to understand design tradeoffs without reference to a specific data set. We apply the convex optimization solution to a semi-synthetic example involving triage data from the MIMIC-IV-ED database."}, "https://arxiv.org/abs/2206.09883": {"title": "Policy Learning under Endogeneity Using Instrumental Variables", "link": "https://arxiv.org/abs/2206.09883", "description": "arXiv:2206.09883v3 Announce Type: replace \nAbstract: This paper studies the identification and estimation of individualized intervention policies in observational data settings characterized by endogenous treatment selection and the availability of instrumental variables. We introduce encouragement rules that manipulate an instrument. Incorporating the marginal treatment effects (MTE) as policy invariant structural parameters, we establish the identification of the social welfare criterion for the optimal encouragement rule. Focusing on binary encouragement rules, we propose to estimate the optimal policy via the Empirical Welfare Maximization (EWM) method and derive convergence rates of the regret (welfare loss). We consider extensions to accommodate multiple instruments and budget constraints. Using data from the Indonesian Family Life Survey, we apply the EWM encouragement rule to advise on the optimal tuition subsidy assignment. Our framework offers interpretability regarding why a certain subpopulation is targeted."}, "https://arxiv.org/abs/2303.03520": {"title": "The Effect of Alcohol Consumption on Brain Ageing: A New Causal Inference Framework for Incomplete and Massive Phenomic Data", "link": "https://arxiv.org/abs/2303.03520", "description": "arXiv:2303.03520v2 Announce Type: replace \nAbstract: Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and actual age, under different alcohol intake frequencies in a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive life-style profile. We face two major challenges: (1) a large number of phenomic variables as potential confounders and (2) a small proportion of participants with complete phenomic data. To address these challenges, we first develop a new ensemble learning framework to establish robust estimation of mean potential outcome in the presence of many confounders. We then construct a data integration step to borrow information from UKB participants with incomplete phenomic data to improve efficiency. Our analysis results reveal that daily intake or even a few times a week may have significant effects on accelerating brain ageing. Moreover, extensive numerical studies demonstrate the superiority of our method over competing methods, in terms of smaller estimation bias and variability."}, "https://arxiv.org/abs/2305.17615": {"title": "Estimating overidentified linear models with heteroskedasticity and outliers", "link": "https://arxiv.org/abs/2305.17615", "description": "arXiv:2305.17615v3 Announce Type: replace \nAbstract: Overidentified two-stage least square (TSLS) is commonly adopted by applied economists to address endogeneity. Though it potentially gives more efficient or informative estimate, overidentification comes with a cost. The bias of TSLS is severe when the number of instruments is large. Hence, Jackknife Instrumental Variable Estimator (JIVE) has been proposed to reduce bias of overidentified TSLS. A conventional heuristic rule to assess the performance of TSLS and JIVE is approximate bias. This paper formalizes this concept and applies the new definition of approximate bias to three classes of estimators that bridge between OLS, TSLS and a variant of JIVE, namely, JIVE1. Three new approximately unbiased estimators are proposed. They are called AUK, TSJI1 and UOJIVE. Interestingly, a previously proposed approximately unbiased estimator UIJIVE can be viewed as a special case of UOJIVE. While UIJIVE is approximately unbiased asymptotically, UOJIVE is approximately unbiased even in finite sample. Moreover, UOJIVE estimates parameters for both endogenous and control variables whereas UIJIVE only estimates the parameter of the endogenous variables. TSJI1 and UOJIVE are consistent and asymptotically normal under fixed number of instruments. They are also consistent under many-instrument asymptotics. This paper characterizes a series of moment existence conditions to establish all asymptotic results. In addition, the new estimators demonstrate good performances with simulated and empirical datasets."}, "https://arxiv.org/abs/2306.10976": {"title": "Empirical sandwich variance estimator for iterated conditional expectation g-computation", "link": "https://arxiv.org/abs/2306.10976", "description": "arXiv:2306.10976v2 Announce Type: replace \nAbstract: Iterated conditional expectation (ICE) g-computation is an estimation approach for addressing time-varying confounding for both longitudinal and time-to-event data. Unlike other g-computation implementations, ICE avoids the need to specify models for each time-varying covariate. For variance estimation, previous work has suggested the bootstrap. However, bootstrapping can be computationally intense and sensitive to the number of resamples used. Here, we present ICE g-computation as a set of stacked estimating equations. Therefore, the variance for the ICE g-computation estimator can be consistently estimated using the empirical sandwich variance estimator. Performance of the variance estimator was evaluated empirically with a simulation study. The proposed approach is also demonstrated with an illustrative example on the effect of cigarette smoking on the prevalence of hypertension. In the simulation study, the empirical sandwich variance estimator appropriately estimated the variance. When comparing runtimes between the sandwich variance estimator and the bootstrap for the applied example, the sandwich estimator was substantially faster, even when bootstraps were run in parallel. The empirical sandwich variance estimator is a viable option for variance estimation with ICE g-computation."}, "https://arxiv.org/abs/2309.03952": {"title": "The Causal Roadmap and Simulations to Improve the Rigor and Reproducibility of Real-Data Applications", "link": "https://arxiv.org/abs/2309.03952", "description": "arXiv:2309.03952v4 Announce Type: replace \nAbstract: The Causal Roadmap outlines a systematic approach to asking and answering questions of cause-and-effect: define quantity of interest, evaluate needed assumptions, conduct statistical estimation, and carefully interpret results. It is paramount that the algorithm for statistical estimation and inference be carefully pre-specified to optimize its expected performance for the specific real-data application. Simulations that realistically reflect the application, including key characteristics such as strong confounding and dependent or missing outcomes, can help us gain a better understanding of an estimator's applied performance. We illustrate this with two examples, using the Causal Roadmap and realistic simulations to inform estimator selection and full specification of the Statistical Analysis Plan. First, in an observational longitudinal study, outcome-blind simulations are used to inform nuisance parameter estimation and variance estimation for longitudinal targeted maximum likelihood estimation (TMLE). Second, in a cluster-randomized controlled trial with missing outcomes, treatment-blind simulations are used to ensure control for Type-I error in Two-Stage TMLE. In both examples, realistic simulations empower us to pre-specify an estimator that is expected to have strong finite sample performance and also yield quality-controlled computing code for the actual analysis. Together, this process helps to improve the rigor and reproducibility of our research."}, "https://arxiv.org/abs/2309.08982": {"title": "Least squares estimation in nonstationary nonlinear cohort panels with learning from experience", "link": "https://arxiv.org/abs/2309.08982", "description": "arXiv:2309.08982v3 Announce Type: replace \nAbstract: We discuss techniques of estimation and inference for nonstationary nonlinear cohort panels with learning from experience, showing, inter alia, the consistency and asymptotic normality of the nonlinear least squares estimator used in empirical practice. Potential pitfalls for hypothesis testing are identified and solutions proposed. Monte Carlo simulations verify the properties of the estimator and corresponding test statistics in finite samples, while an application to a panel of survey expectations demonstrates the usefulness of the theory developed."}, "https://arxiv.org/abs/2310.15069": {"title": "Second-order group knockoffs with applications to GWAS", "link": "https://arxiv.org/abs/2310.15069", "description": "arXiv:2310.15069v2 Announce Type: replace \nAbstract: Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance.\n While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct \"group knockoffs.\" While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank.\n The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available."}, "https://arxiv.org/abs/2102.06202": {"title": "Private Prediction Sets", "link": "https://arxiv.org/abs/2102.06202", "description": "arXiv:2102.06202v3 Announce Type: replace-cross \nAbstract: In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals' privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification -- they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately-trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately, this is not the case. To remedy this key problem, we develop a method that takes any pre-trained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision datasets."}, "https://arxiv.org/abs/2210.00362": {"title": "Yurinskii's Coupling for Martingales", "link": "https://arxiv.org/abs/2210.00362", "description": "arXiv:2210.00362v2 Announce Type: replace-cross \nAbstract: Yurinskii's coupling is a popular theoretical tool for non-asymptotic distributional analysis in mathematical statistics and applied probability, offering a Gaussian strong approximation with an explicit error bound under easily verified conditions. Originally stated in $\\ell^2$-norm for sums of independent random vectors, it has recently been extended both to the $\\ell^p$-norm, for $1 \\leq p \\leq \\infty$, and to vector-valued martingales in $\\ell^2$-norm, under some strong conditions. We present as our main result a Yurinskii coupling for approximate martingales in $\\ell^p$-norm, under substantially weaker conditions than those previously imposed. Our formulation further allows for the coupling variable to follow a more general Gaussian mixture distribution, and we provide a novel third-order coupling method which gives tighter approximations in certain settings. We specialize our main result to mixingales, martingales, and independent data, and derive uniform Gaussian mixture strong approximations for martingale empirical processes. Applications to nonparametric partitioning-based and local polynomial regression procedures are provided."}, "https://arxiv.org/abs/2302.14186": {"title": "Approximately optimal domain adaptation with Fisher's Linear Discriminant", "link": "https://arxiv.org/abs/2302.14186", "description": "arXiv:2302.14186v3 Announce Type: replace-cross \nAbstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propose a computable approximation, and study the effect of various parameter settings on the relative risks between the optimal hypothesis, hypothesis i), and hypothesis ii). We demonstrate the effectiveness of the proposed optimal classifier in the context of EEG- and ECG-based classification settings and argue that the optimal classifier can be computed without access to direct information from any of the individual source tasks. We conclude by discussing further applications, limitations, and possible future directions."}, "https://arxiv.org/abs/2303.05328": {"title": "Simulation-based, Finite-sample Inference for Privatized Data", "link": "https://arxiv.org/abs/2303.05328", "description": "arXiv:2303.05328v4 Announce Type: replace-cross \nAbstract: Privacy protection methods, such as differentially private mechanisms, introduce noise into resulting statistics which often produces complex and intractable sampling distributions. In this paper, we propose a simulation-based \"repro sample\" approach to produce statistically valid confidence intervals and hypothesis tests, which builds on the work of Xie and Wang (2022). We show that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms (such as by clamping), and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and type I error of the private inference. We also develop significant improvements and extensions for the repro sample methodology for general models (not necessarily related to privacy), including 1) modifying the procedure to ensure guaranteed coverage and type I errors, even accounting for Monte Carlo error, and 2) proposing efficient numerical algorithms to implement the confidence intervals and $p$-values."}, "https://arxiv.org/abs/2306.03928": {"title": "Designing Decision Support Systems Using Counterfactual Prediction Sets", "link": "https://arxiv.org/abs/2306.03928", "description": "arXiv:2306.03928v2 Announce Type: replace-cross \nAbstract: Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets."}, "https://arxiv.org/abs/2306.06342": {"title": "Distribution-free inference with hierarchical data", "link": "https://arxiv.org/abs/2306.06342", "description": "arXiv:2306.06342v3 Announce Type: replace-cross \nAbstract: This paper studies distribution-free inference in settings where the data set has a hierarchical structure -- for example, groups of observations, or repeated measurements. In such settings, standard notions of exchangeability may not hold. To address this challenge, a hierarchical form of exchangeability is derived, facilitating extensions of distribution-free methods, including conformal prediction and jackknife+. While the standard theoretical guarantee obtained by the conformal prediction framework is a marginal predictive coverage guarantee, in the special case of independent repeated measurements, it is possible to achieve a stronger form of coverage -- the \"second-moment coverage\" property -- to provide better control of conditional miscoverage rates, and distribution-free prediction sets that achieve this property are constructed. Simulations illustrate that this guarantee indeed leads to uniformly small conditional miscoverage rates. Empirically, this stronger guarantee comes at the cost of a larger width of the prediction set in scenarios where the fitted model is poorly calibrated, but this cost is very mild in cases where the fitted model is accurate."}, "https://arxiv.org/abs/2310.01153": {"title": "Post-hoc and Anytime Valid Permutation and Group Invariance Testing", "link": "https://arxiv.org/abs/2310.01153", "description": "arXiv:2310.01153v3 Announce Type: replace-cross \nAbstract: We study post-hoc ($e$-value-based) and post-hoc anytime valid inference for testing exchangeability and general group invariance. Our methods satisfy a generalized Type I error control that permits a data-dependent selection of both the number of observations $n$ and the significance level $\\alpha$. We derive a simple analytical expression for all exact post-hoc valid $p$-values for group invariance, which allows for a flexible plug-in of the test statistic. For post-hoc anytime validity, we derive sequential $p$-processes by multiplying post-hoc $p$-values. In sequential testing, it is key to specify how the number of observations may depend on the data. We propose two approaches, and show how they nest existing efforts. To construct good post-hoc $p$-values, we develop the theory of likelihood ratios for group invariance, and generalize existing optimality results. These likelihood ratios turn out to exist in different flavors depending on which space we specify our alternative. We illustrate our methods by testing against a Gaussian location shift, which yields an improved optimality result for the $t$-test when testing sphericity, connections to the softmax function when testing exchangeability, and an improved method for testing sign-symmetry."}, "https://arxiv.org/abs/2311.18274": {"title": "Semiparametric Efficient Inference in Adaptive Experiments", "link": "https://arxiv.org/abs/2311.18274", "description": "arXiv:2311.18274v3 Announce Type: replace-cross \nAbstract: We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control."}, "https://arxiv.org/abs/2403.02467": {"title": "Applied Causal Inference Powered by ML and AI", "link": "https://arxiv.org/abs/2403.02467", "description": "arXiv:2403.02467v1 Announce Type: new \nAbstract: An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools."}, "https://arxiv.org/abs/2403.02539": {"title": "Addressing the Influence of Unmeasured Confounding in Observational Studies with Time-to-Event Outcomes: A Semiparametric Sensitivity Analysis Approach", "link": "https://arxiv.org/abs/2403.02539", "description": "arXiv:2403.02539v1 Announce Type: new \nAbstract: In this paper, we develop a semiparametric sensitivity analysis approach designed to address unmeasured confounding in observational studies with time-to-event outcomes. We target estimation of the marginal distributions of potential outcomes under competing exposures using influence function-based techniques. We derived the non-parametric influence function for uncensored data and mapped the uncensored data influence function to the observed data influence function. Our methodology is motivated by and applied to an observational study evaluating the effectiveness of radical prostatectomy (RP) versus external beam radiotherapy with androgen deprivation (EBRT+AD) for the treatment of prostate cancer. We also present a simulation study to evaluate the statistical properties of our methodology."}, "https://arxiv.org/abs/2403.02591": {"title": "Matrix-based Prediction Approach for Intraday Instantaneous Volatility Vector", "link": "https://arxiv.org/abs/2403.02591", "description": "arXiv:2403.02591v1 Announce Type: new \nAbstract: In this paper, we introduce a novel method for predicting intraday instantaneous volatility based on Ito semimartingale models using high-frequency financial data. Several studies have highlighted stylized volatility time series features, such as interday auto-regressive dynamics and the intraday U-shaped pattern. To accommodate these volatility features, we propose an interday-by-intraday instantaneous volatility matrix process that can be decomposed into low-rank conditional expected instantaneous volatility and noise matrices. To predict the low-rank conditional expected instantaneous volatility matrix, we propose the Two-sIde Projected-PCA (TIP-PCA) procedure. We establish asymptotic properties of the proposed estimators and conduct a simulation study to assess the finite sample performance of the proposed prediction method. Finally, we apply the TIP-PCA method to an out-of-sample instantaneous volatility vector prediction study using high-frequency data from the S&P 500 index and 11 sector index funds."}, "https://arxiv.org/abs/2403.02625": {"title": "Determining the Number of Common Functional Factors with Twice Cross-Validation", "link": "https://arxiv.org/abs/2403.02625", "description": "arXiv:2403.02625v1 Announce Type: new \nAbstract: The semiparametric factor model serves as a vital tool to describe the dependence patterns in the data. It recognizes that the common features observed in the data are actually explained by functions of specific exogenous variables.Unlike traditional factor models, where the focus is on selecting the number of factors, our objective here is to identify the appropriate number of common functions, a crucial parameter in this model. In this paper, we develop a novel data-driven method to determine the number of functional factors using cross validation (CV). Our proposed method employs a two-step CV process that ensures the orthogonality of functional factors, which we refer to as Functional Twice Cross-Validation (FTCV). Extensive simulations demonstrate that FTCV accurately selects the number of common functions and outperforms existing methods in most cases.Furthermore, by specifying market volatility as the exogenous force, we provide real data examples that illustrate the interpretability of selected common functions in characterizing the influence on U.S. Treasury Yields and the cross correlations between Dow30 returns."}, "https://arxiv.org/abs/2403.02979": {"title": "Regularised Canonical Correlation Analysis: graphical lasso, biplots and beyond", "link": "https://arxiv.org/abs/2403.02979", "description": "arXiv:2403.02979v1 Announce Type: new \nAbstract: Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption of conditional independencies and based on the Graphical Lasso. Our method has desirable theoretical guarantees and good empirical performance, demonstrated through extensive simulations and real-world biological datasets. Recognizing the difficulties of model selection in high dimensions and other practical challenges of applying CCA in real-world settings, we introduce a novel framework for evaluating and interpreting regularized CCA models in the context of Exploratory Data Analysis (EDA), which we hope will empower researchers and pave the way for wider adoption."}, "https://arxiv.org/abs/2403.03058": {"title": "Machine Learning Assisted Adjustment Boosts Inferential Efficiency of Randomized Controlled Trials", "link": "https://arxiv.org/abs/2403.03058", "description": "arXiv:2403.03058v1 Announce Type: new \nAbstract: In this work, we proposed a novel inferential procedure assisted by machine learning based adjustment for randomized control trials. The method was developed under the Rosenbaum's framework of exact tests in randomized experiments with covariate adjustments. Through extensive simulation experiments, we showed the proposed method can robustly control the type I error and can boost the inference efficiency for a randomized controlled trial (RCT). This advantage was further demonstrated in a real world example. The simplicity and robustness of the proposed method makes it a competitive candidate as a routine inference procedure for RCTs, especially when the number of baseline covariates is large, and when nonlinear association or interaction among covariates is expected. Its application may remarkably reduce the required sample size and cost of RCTs, such as phase III clinical trials."}, "https://arxiv.org/abs/2403.03099": {"title": "Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure", "link": "https://arxiv.org/abs/2403.03099", "description": "arXiv:2403.03099v1 Announce Type: new \nAbstract: Big data, with NxP dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order PxN(N-1). To circumvent this problem, typically, the clustering technique is applied to a random sample drawn from the dataset: however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of \"data nuggets\", which reduce a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets-based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2403.03138": {"title": "A novel methodological framework for the analysis of health trajectories and survival outcomes in heart failure patients", "link": "https://arxiv.org/abs/2403.03138", "description": "arXiv:2403.03138v1 Announce Type: new \nAbstract: Heart failure (HF) contributes to circa 200,000 annual hospitalizations in France. With the increasing age of HF patients, elucidating the specific causes of inpatient mortality became a public health problematic. We introduce a novel methodological framework designed to identify prevalent health trajectories and investigate their impact on death. The initial step involves applying sequential pattern mining to characterize patients' trajectories, followed by an unsupervised clustering algorithm based on a new metric for measuring the distance between hospitalization diagnoses. Finally, a survival analysis is conducted to assess survival outcomes. The application of this framework to HF patients from a representative sample of the French population demonstrates its methodological significance in enhancing the analysis of healthcare trajectories."}, "https://arxiv.org/abs/2403.02663": {"title": "The Influence of Validation Data on Logical and Scientific Interpretations of Forensic Expert Opinions", "link": "https://arxiv.org/abs/2403.02663", "description": "arXiv:2403.02663v1 Announce Type: cross \nAbstract: Forensic experts use specialized training and knowledge to enable other members of the judicial system to make better informed and more just decisions. Factfinders, in particular, are tasked with judging how much weight to give to experts' reports and opinions. Many references describe assessing evidential weight from the perspective of a forensic expert. Some recognize that stakeholders are each responsible for evaluating their own weight of evidence. Morris (1971, 1974, 1977) provided a general framework for recipients to update their own uncertainties after learning an expert's opinion. Although this framework is normative under Bayesian axioms and several forensic scholars advocate the use of Bayesian reasoning, few resources describe its application in forensic science. This paper addresses this gap by examining how recipients can combine principles of science and Bayesian reasoning to evaluate their own likelihood ratios for expert opinions. This exercise helps clarify how an expert's role depends on whether one envisions recipients to be logical and scientific or deferential. Illustrative examples with an expert's opinion expressed as a categorical conclusion, likelihood ratio, or range of likelihood ratios, or with likelihood ratios from multiple experts, each reveal the importance and influence of validation data for logical recipients' interpretations."}, "https://arxiv.org/abs/2403.02696": {"title": "Low-rank matrix estimation via nonconvex spectral regularized methods in errors-in-variables matrix regression", "link": "https://arxiv.org/abs/2403.02696", "description": "arXiv:2403.02696v1 Announce Type: cross \nAbstract: High-dimensional matrix regression has been studied in various aspects, such as statistical properties, computational efficiency and application to specific instances including multivariate regression, system identification and matrix compressed sensing. Current studies mainly consider the idealized case that the covariate matrix is obtained without noise, while the more realistic scenario that the covariates may always be corrupted with noise or missing data has received little attention. We consider the general errors-in-variables matrix regression model and proposed a unified framework for low-rank estimation based on nonconvex spectral regularization. Then in the statistical aspect, recovery bounds for any stationary points are provided to achieve statistical consistency. In the computational aspect, the proximal gradient method is applied to solve the nonconvex optimization problem and is proved to converge in polynomial time. Consequences for specific matrix compressed sensing models with additive noise and missing data are obtained via verifying corresponding regularity conditions. Finally, the performance of the proposed nonconvex estimation method is illustrated by numerical experiments."}, "https://arxiv.org/abs/2403.02840": {"title": "On a theory of martingales for censoring", "link": "https://arxiv.org/abs/2403.02840", "description": "arXiv:2403.02840v1 Announce Type: cross \nAbstract: A theory of martingales for censoring is developed. The Doob-Meyer martingale is shown to be inadequate in general, and a repaired martingale is proposed with a non-predictable centering term. Associated martingale transforms, variation processes, and covariation processes are developed based on a measure of half-predictability that generalizes predictability. The development is applied to study the Kaplan Meier estimator."}, "https://arxiv.org/abs/2403.03007": {"title": "Scalable Bayesian inference for the generalized linear mixed model", "link": "https://arxiv.org/abs/2403.03007", "description": "arXiv:2403.03007v1 Announce Type: cross \nAbstract: The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database."}, "https://arxiv.org/abs/2403.03208": {"title": "Active Statistical Inference", "link": "https://arxiv.org/abs/2403.03208", "description": "arXiv:2403.03208v1 Announce Type: cross \nAbstract: Inspired by the concept of active learning, we propose active inference$\\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics."}, "https://arxiv.org/abs/2103.12900": {"title": "Loss based prior for the degrees of freedom of the Wishart distribution", "link": "https://arxiv.org/abs/2103.12900", "description": "arXiv:2103.12900v2 Announce Type: replace \nAbstract: Motivated by the proliferation of extensive macroeconomic and health datasets necessitating accurate forecasts, a novel approach is introduced to address Vector Autoregressive (VAR) models. This approach employs the global-local shrinkage-Wishart prior. Unlike conventional VAR models, where degrees of freedom are predetermined to be equivalent to the size of the variable plus one or equal to zero, the proposed method integrates a hyperprior for the degrees of freedom to account for the uncertainty about the parameter values. Specifically, a loss-based prior is derived to leverage information regarding the data-inherent degrees of freedom. The efficacy of the proposed prior is demonstrated in a multivariate setting for forecasting macroeconomic data, as well as Dengue infection data."}, "https://arxiv.org/abs/2107.13737": {"title": "Design-Robust Two-Way-Fixed-Effects Regression For Panel Data", "link": "https://arxiv.org/abs/2107.13737", "description": "arXiv:2107.13737v3 Announce Type: replace \nAbstract: We propose a new estimator for average causal effects of a binary treatment with panel data in settings with general treatment patterns. Our approach augments the popular two-way-fixed-effects specification with unit-specific weights that arise from a model for the assignment mechanism. We show how to construct these weights in various settings, including the staggered adoption setting, where units opt into the treatment sequentially but permanently. The resulting estimator converges to an average (over units and time) treatment effect under the correct specification of the assignment model, even if the fixed effect model is misspecified. We show that our estimator is more robust than the conventional two-way estimator: it remains consistent if either the assignment mechanism or the two-way regression model is correctly specified. In addition, the proposed estimator performs better than the two-way-fixed-effect estimator if the outcome model and assignment mechanism are locally misspecified. This strong double robustness property underlines and quantifies the benefits of modeling the assignment process and motivates using our estimator in practice. We also discuss an extension of our estimator to handle dynamic treatment effects."}, "https://arxiv.org/abs/2108.04376": {"title": "Counterfactual Effect Generalization: A Combinatorial Definition", "link": "https://arxiv.org/abs/2108.04376", "description": "arXiv:2108.04376v4 Announce Type: replace \nAbstract: The widely used 'Counterfactual' definition of Causal Effects was derived for unbiasedness and accuracy - and not generalizability. We propose a Combinatorial definition for the External Validity (EV) of intervention effects. We first define the concept of an effect observation 'background'. We then formulate conditions for effect generalization based on their sets of (observable and unobservable) backgrounds. This reveals two limits for effect generalization: (1) when effects are observed under all their enumerable backgrounds, or, (2) when backgrounds have become sufficiently randomized. We use the resulting combinatorial framework to re-examine several issues in the original counterfactual formulation: out-of-sample validity, concurrent estimation of multiple effects, bias-variance tradeoffs, statistical power, and connections to current predictive and explaining techniques.\n Methodologically, the definitions also allow us to also replace the parametric estimation problems that followed the counterfactual definition by combinatorial enumeration and randomization problems in non-experimental samples. We use this non-parametric framework to demonstrate (External Validity, Unconfoundness and Precision) tradeoffs in the performance of popular supervised, explaining, and causal-effect estimators. We demonstrate the approach also allows for the use of these methods in non-i.i.d. samples. The COVID19 pandemic highlighted the need for learning solutions to provide predictions in severally incomplete samples. We demonstrate applications in this pressing problem."}, "https://arxiv.org/abs/2212.14411": {"title": "Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations", "link": "https://arxiv.org/abs/2212.14411", "description": "arXiv:2212.14411v3 Announce Type: replace \nAbstract: Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider \\cite{robbins1970boundary}'s delayed-start normal-mixture sequential probability ratio tests, and we provide the first asymptotic type-I-error and expected-rejection-time guarantees under general non-parametric data generating processes, where the asymptotics are indexed by the test's burn-in time. The type-I-error results primarily leverage a martingale strong invariance principle and establish that these tests (and their implied confidence sequences) have type-I error rates approaching a desired $\\alpha$-level. The expected-rejection-time results primarily leverage an identity inspired by It\\^o's lemma and imply that, in certain asymptotic regimes, the expected rejection time approaches the minimum possible among $\\alpha$-level tests. We show how to apply our results to sequential inference on parameters defined by estimating equations, such as average treatment effects. Together, our results establish these (ostensibly parametric) tests as general-purpose, non-parametric, and near-optimal. We illustrate this via numerical experiments."}, "https://arxiv.org/abs/2307.12754": {"title": "Nonparametric Linear Feature Learning in Regression Through Regularisation", "link": "https://arxiv.org/abs/2307.12754", "description": "arXiv:2307.12754v3 Announce Type: replace \nAbstract: Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments."}, "https://arxiv.org/abs/2311.10877": {"title": "Covariate adjustment in randomized experiments with missing outcomes and covariates", "link": "https://arxiv.org/abs/2311.10877", "description": "arXiv:2311.10877v2 Announce Type: replace \nAbstract: Covariate adjustment can improve precision in analyzing randomized experiments. With fully observed data, regression adjustment and propensity score weighting are asymptotically equivalent in improving efficiency over unadjusted analysis. When some outcomes are missing, we consider combining these two adjustment methods with inverse probability of observation weighting for handling missing outcomes, and show that the equivalence between the two methods breaks down. Regression adjustment no longer ensures efficiency gain over unadjusted analysis unless the true outcome model is linear in covariates or the outcomes are missing completely at random. Propensity score weighting, in contrast, still guarantees efficiency over unadjusted analysis, and including more covariates in adjustment never harms asymptotic efficiency. Moreover, we establish the value of using partially observed covariates to secure additional efficiency by the missingness indicator method, which imputes all missing covariates by zero and uses the union of the completed covariates and corresponding missingness indicators as the new, fully observed covariates. Based on these findings, we recommend using regression adjustment in combination with the missingness indicator method if the linear outcome model or missing complete at random assumption is plausible and using propensity score weighting with the missingness indicator method otherwise."}, "https://arxiv.org/abs/2401.11804": {"title": "Regression Copulas for Multivariate Responses", "link": "https://arxiv.org/abs/2401.11804", "description": "arXiv:2401.11804v2 Announce Type: replace \nAbstract: We propose a novel distributional regression model for a multivariate response vector based on a copula process over the covariate space. It uses the implicit copula of a Gaussian multivariate regression, which we call a ``regression copula''. To allow for large covariate vectors their coefficients are regularized using a novel multivariate extension of the horseshoe prior. Bayesian inference and distributional predictions are evaluated using efficient variational inference methods, allowing application to large datasets. An advantage of the approach is that the marginal distributions of the response vector can be estimated separately and accurately, resulting in predictive distributions that are marginally-calibrated. Two substantive applications of the methodology highlight its efficacy in multivariate modeling. The first is the econometric modeling and prediction of half-hourly regional Australian electricity prices. Here, our approach produces more accurate distributional forecasts than leading benchmark methods. The second is the evaluation of multivariate posteriors in likelihood-free inference (LFI) of a model for tree species abundance data, extending a previous univariate regression copula LFI method. In both applications, we demonstrate that our new approach exhibits a desirable marginal calibration property."}, "https://arxiv.org/abs/2211.13715": {"title": "Trust Your $\\nabla$: Gradient-based Intervention Targeting for Causal Discovery", "link": "https://arxiv.org/abs/2211.13715", "description": "arXiv:2211.13715v4 Announce Type: replace-cross \nAbstract: Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime."}, "https://arxiv.org/abs/2302.01246": {"title": "Behavioral Carry-Over Effect and Power Consideration in Crossover Trials", "link": "https://arxiv.org/abs/2302.01246", "description": "arXiv:2302.01246v2 Announce Type: replace-cross \nAbstract: A crossover trial is an efficient trial design when there is no carry-over effect. To reduce the impact of the biological carry-over effect, a washout period is often designed. However, the carry-over effect remains an outstanding concern when a washout period is unethical or cannot sufficiently diminish the impact of the carry-over effect. The latter can occur in comparative effectiveness research where the carry-over effect is often non-biological but behavioral. In this paper, we investigate the crossover design under a potential outcomes framework with and without the carry-over effect. We find that when the carry-over effect exists and satisfies a sign condition, the basic estimator underestimates the treatment effect, which does not inflate the type I error of one-sided tests but negatively impacts the power. This leads to a power trade-off between the crossover design and the parallel-group design, and we derive the condition under which the crossover design does not lead to type I error inflation and is still more powerful than the parallel-group design. We also develop covariate adjustment methods for crossover trials. We evaluate the performance of cross-over design and covariate adjustment using data from the MTN-034/REACH study."}, "https://arxiv.org/abs/2306.14318": {"title": "Regularization of the ensemble Kalman filter using a non-parametric, non-stationary spatial model", "link": "https://arxiv.org/abs/2306.14318", "description": "arXiv:2306.14318v3 Announce Type: replace-cross \nAbstract: The sample covariance matrix of a random vector is a good estimate of the true covariance matrix if the sample size is much larger than the length of the vector. In high-dimensional problems, this condition is never met. As a result, in high dimensions the EnKF ensemble does not contain enough information to specify the prior covariance matrix accurately. This necessitates the need for regularization of the analysis (observation update) problem. We propose a regularization technique based on a new spatial model on the sphere. The model is a constrained version of the general Gaussian process convolution model. The constraints on the location-dependent convolution kernel include local isotropy, positive definiteness as a function of distance, and smoothness as a function of location. The model allows for a rigorous definition of the local spectrum, which, in addition, is required to be a smooth function of spatial wavenumber. We regularize the ensemble Kalman filter by postulating that its prior covariances obey this model. The model is estimated online in a two-stage procedure. First, ensemble perturbations are bandpass filtered in several wavenumber bands to extract aggregated local spatial spectra. Second, a neural network recovers the local spectra from sample variances of the filtered fields. We show that with the growing ensemble size, the estimator is capable of extracting increasingly detailed spatially non-stationary structures. In simulation experiments, the new technique led to substantially better EnKF performance than several existing techniques."}, "https://arxiv.org/abs/2403.03228": {"title": "Multiwinner Elections and the Spoiler Effect", "link": "https://arxiv.org/abs/2403.03228", "description": "arXiv:2403.03228v1 Announce Type: new \nAbstract: In the popular debate over the use of ranked-choice voting, it is often claimed that the method of single transferable vote (STV) is immune or mostly immune to the so-called ``spoiler effect,'' where the removal of a losing candidate changes the set of winners. This claim has previously been studied only in the single-winner case. We investigate how susceptible STV is to the spoiler effect in multiwinner elections, where the output of the voting method is a committee of size at least two. To evaluate STV we compare it to numerous other voting methods including single non-transferable vote, $k$-Borda, and the Chamberlin-Courant rule. We provide simulation results under three different random models and empirical results using a large database of real-world multiwinner political elections from Scotland. Our results show that STV is not spoiler-proof in any meaningful sense in the multiwinner context, but it tends to perform well relative to other methods, especially when using real-world ballot data."}, "https://arxiv.org/abs/2403.03240": {"title": "Triple/Debiased Lasso for Statistical Inference of Conditional Average Treatment Effects", "link": "https://arxiv.org/abs/2403.03240", "description": "arXiv:2403.03240v1 Announce Type: new \nAbstract: This study investigates the estimation and the statistical inference about Conditional Average Treatment Effects (CATEs), which have garnered attention as a metric representing individualized causal effects. In our data-generating process, we assume linear models for the outcomes associated with binary treatments and define the CATE as a difference between the expected outcomes of these linear models. This study allows the linear models to be high-dimensional, and our interest lies in consistent estimation and statistical inference for the CATE. In high-dimensional linear regression, one typical approach is to assume sparsity. However, in our study, we do not assume sparsity directly. Instead, we consider sparsity only in the difference of the linear models. We first use a doubly robust estimator to approximate this difference and then regress the difference on covariates with Lasso regularization. Although this regression estimator is consistent for the CATE, we further reduce the bias using the techniques in double/debiased machine learning (DML) and debiased Lasso, leading to $\\sqrt{n}$-consistency and confidence intervals. We refer to the debiased estimator as the triple/debiased Lasso (TDL), applying both DML and debiased Lasso techniques. We confirm the soundness of our proposed method through simulation studies."}, "https://arxiv.org/abs/2403.03299": {"title": "Understanding and avoiding the \"weights of regression\": Heterogeneous effects, misspecification, and longstanding solutions", "link": "https://arxiv.org/abs/2403.03299", "description": "arXiv:2403.03299v1 Announce Type: new \nAbstract: Researchers in many fields endeavor to estimate treatment effects by regressing outcome data (Y) on a treatment (D) and observed confounders (X). Even absent unobserved confounding, the regression coefficient on the treatment reports a weighted average of strata-specific treatment effects (Angrist, 1998). Where heterogeneous treatment effects cannot be ruled out, the resulting coefficient is thus not generally equal to the average treatment effect (ATE), and is unlikely to be the quantity of direct scientific or policy interest. The difference between the coefficient and the ATE has led researchers to propose various interpretational, bounding, and diagnostic aids (Humphreys, 2009; Aronow and Samii, 2016; Sloczynski, 2022; Chattopadhyay and Zubizarreta, 2023). We note that the linear regression of Y on D and X can be misspecified when the treatment effect is heterogeneous in X. The \"weights of regression\", for which we provide a new (more general) expression, simply characterize how the OLS coefficient will depart from the ATE under the misspecification resulting from unmodeled treatment effect heterogeneity. Consequently, a natural alternative to suffering these weights is to address the misspecification that gives rise to them. For investigators committed to linear approaches, we propose relying on the slightly weaker assumption that the potential outcomes are linear in X. Numerous well-known estimators are unbiased for the ATE under this assumption, namely regression-imputation/g-computation/T-learner, regression with an interaction of the treatment and covariates (Lin, 2013), and balancing weights. Any of these approaches avoid the apparent weighting problem of the misspecified linear regression, at an efficiency cost that will be small when there are few covariates relative to sample size. We demonstrate these lessons using simulations in observational and experimental settings."}, "https://arxiv.org/abs/2403.03349": {"title": "A consensus-constrained parsimonious Gaussian mixture model for clustering hyperspectral images", "link": "https://arxiv.org/abs/2403.03349", "description": "arXiv:2403.03349v1 Announce Type: new \nAbstract: The use of hyperspectral imaging to investigate food samples has grown due to the improved performance and lower cost of spectroscopy instrumentation. Food engineers use hyperspectral images to classify the type and quality of a food sample, typically using classification methods. In order to train these methods, every pixel in each training image needs to be labelled. Typically, computationally cheap threshold-based approaches are used to label the pixels, and classification methods are trained based on those labels. However, threshold-based approaches are subjective and cannot be generalized across hyperspectral images taken in different conditions and of different foods. Here a consensus-constrained parsimonious Gaussian mixture model (ccPGMM) is proposed to label pixels in hyperspectral images using a model-based clustering approach. The ccPGMM utilizes available information on the labels of a small number of pixels and the relationship between those pixels and neighbouring pixels as constraints when clustering the rest of the pixels in the image. A latent variable model is used to represent the high-dimensional data in terms of a small number of underlying latent factors. To ensure computational feasibility, a consensus clustering approach is employed, where the data are divided into multiple randomly selected subsets of variables and constrained clustering is applied to each data subset; the clustering results are then consolidated across all data subsets to provide a consensus clustering solution. The ccPGMM approach is applied to simulated datasets and real hyperspectral images of three types of puffed cereal, corn, rice, and wheat. Improved clustering performance and computational efficiency are demonstrated when compared to other current state-of-the-art approaches."}, "https://arxiv.org/abs/2403.03389": {"title": "Exploring Spatial Generalized Functional Linear Models: A Comparative Simulation Study and Analysis of COVID-19", "link": "https://arxiv.org/abs/2403.03389", "description": "arXiv:2403.03389v1 Announce Type: new \nAbstract: Implementation of spatial generalized linear models with a functional covariate can be accomplished through the use of a truncated basis expansion of the covariate process. In practice, one must select a truncation level for use. We compare five criteria for the selection of an appropriate truncation level, including AIC and BIC based on a log composite likelihood, a fraction of variance explained criterion, a fitted mean squared error, and a prediction error with one standard error rule. Based on the use of extensive simulation studies, we propose that BIC constitutes a reasonable default criterion for the selection of the truncation level for use in a spatial functional generalized linear model. In addition, we demonstrate that the spatial model with a functional covariate outperforms other models when the data contain spatial structure and response variables are in fact influenced by a functional covariate process. We apply the spatial functional generalized linear model to a problem in which the objective is to relate COVID-19 vaccination rates in counties of states in the Midwestern United States to the number of new cases from previous weeks in those same geographic regions."}, "https://arxiv.org/abs/2403.03589": {"title": "Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choices", "link": "https://arxiv.org/abs/2403.03589", "description": "arXiv:2403.03589v1 Announce Type: new \nAbstract: This study designs an adaptive experiment for efficiently estimating average treatment effect (ATEs). We consider an adaptive experiment where an experimenter sequentially samples an experimental unit from a covariate density decided by the experimenter and assigns a treatment. After assigning a treatment, the experimenter observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using gathered samples. The objective of the experimenter is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose a framework under which an experimenter optimizes the covariate density, as well as the propensity score, and find that optimizing both covariate density and propensity score reduces the asymptotic variance more than optimizing only the propensity score. Based on this idea, in each round of our experiment, the experimenter optimizes the covariate density and propensity score based on past observations. To design an adaptive experiment, we first derive the efficient covariate density and propensity score that minimizes the semiparametric efficiency bound, a lower bound for the asymptotic variance given a fixed covariate density and a fixed propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound."}, "https://arxiv.org/abs/2403.03613": {"title": "Reducing the dimensionality and granularity in hierarchical categorical variables", "link": "https://arxiv.org/abs/2403.03613", "description": "arXiv:2403.03613v1 Announce Type: new \nAbstract: Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves model fit. We apply our methodology on a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature."}, "https://arxiv.org/abs/2403.03646": {"title": "Bayesian Generalized Distributed Lag Regression with Variable Selection", "link": "https://arxiv.org/abs/2403.03646", "description": "arXiv:2403.03646v1 Announce Type: new \nAbstract: Distributed Lag Models (DLMs) and similar regression approaches such as MIDAS have been used for many decades in econometrics, and more recently in the study of air quality and its impact on human health. They are useful not only for quantifying accumulating and delayed effects, but also for estimating the lags that are most susceptible to these effects. Among other things, they have been used to infer the period of exposure to poor air quality which might negatively impact child birth weight. The increased attention DLMs have received in recent years is reflective of their potential to help us understand a great many issues, particularly in the investigation of how the environment affects human health. In this paper we describe how to expand the utility of these models for Bayesian inference by leveraging latent-variables. In particular we explain how to perform binary regression to better handle imbalanced data, how to incorporate negative binomial regression, and how to estimate the probability of predictor inclusion. Extra parameters introduced through the DLM framework may require calibration for the MCMC algorithm, but this will not be the case in DLM-based analyses often seen in pollution exposure literature. In these cases, the parameters are inferred through a fully automatic Gibbs sampling procedure."}, "https://arxiv.org/abs/2403.03722": {"title": "Is Distance Correlation Robust?", "link": "https://arxiv.org/abs/2403.03722", "description": "arXiv:2403.03722v1 Announce Type: new \nAbstract: Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitivity to outliers we construct a more robust version of distance correlation, which is based on a new data transformation. Simulations indicate that the resulting method is quite robust, and has good power in the presence of outliers. We illustrate the method on genetic data. Comparing the classical distance correlation with its more robust version provides additional insight."}, "https://arxiv.org/abs/2403.03778": {"title": "Ancestor regression in structural vector autoregressive models", "link": "https://arxiv.org/abs/2403.03778", "description": "arXiv:2403.03778v1 Announce Type: new \nAbstract: We present a new method for causal discovery in linear structural vector autoregressive models. We adapt an idea designed for independent observations to the case of time series while retaining its favorable properties, i.e., explicit error control for false causal discovery, at least asymptotically. We apply our method to several real-world bivariate time series datasets and discuss its findings which mostly agree with common understanding. The arrow of time in a model can be interpreted as background knowledge on possible causal mechanisms. Hence, our ideas could be extended to incorporating different background knowledge, even for independent observations."}, "https://arxiv.org/abs/2403.03802": {"title": "Inequalities and bounds for expected order statistics from transform-ordered families", "link": "https://arxiv.org/abs/2403.03802", "description": "arXiv:2403.03802v1 Announce Type: new \nAbstract: We introduce a comprehensive method for establishing stochastic orders among order statistics in the i.i.d. case. This approach relies on the assumption that the underlying distribution is linked to a reference distribution through a transform order. Notably, this method exhibits broad applicability, particularly since several well-known nonparametric distribution families can be defined using relevant transform orders, including the convex and the star transform orders. In the context of convex-ordered families, we demonstrate that applying Jensen's inequality enables the derivation of bounds for the probability that a random variable exceeds the expected value of its corresponding order statistic."}, "https://arxiv.org/abs/2403.03837": {"title": "An Adaptive Multivariate Functional EWMA Control Chart", "link": "https://arxiv.org/abs/2403.03837", "description": "arXiv:2403.03837v1 Announce Type: new \nAbstract: In many modern industrial scenarios, the measurements of the quality characteristics of interest are often required to be represented as functional data or profiles. This motivates the growing interest in extending traditional univariate statistical process monitoring (SPM) schemes to the functional data setting. This article proposes a new SPM scheme, which is referred to as adaptive multivariate functional EWMA (AMFEWMA), to extend the well-known exponentially weighted moving average (EWMA) control chart from the univariate scalar to the multivariate functional setting. The favorable performance of the AMFEWMA control chart over existing methods is assessed via an extensive Monte Carlo simulation. Its practical applicability is demonstrated through a case study in the monitoring of the quality of a resistance spot welding process in the automotive industry through the online observations of dynamic resistance curves, which are associated with multiple spot welds on the same car body and recognized as the full technological signature of the process."}, "https://arxiv.org/abs/2403.03868": {"title": "Confidence on the Focal: Conformal Prediction with Selection-Conditional Coverage", "link": "https://arxiv.org/abs/2403.03868", "description": "arXiv:2403.03868v1 Announce Type: new \nAbstract: Conformal prediction builds marginally valid prediction intervals which cover the unknown outcome of a randomly drawn new test point with a prescribed probability. In practice, a common scenario is that, after seeing the test unit(s), practitioners decide which test unit(s) to focus on in a data-driven manner, and wish to quantify the uncertainty for the focal unit(s). In such cases, marginally valid prediction intervals for these focal units can be misleading due to selection bias. This paper presents a general framework for constructing a prediction set with finite-sample exact coverage conditional on the unit being selected. Its general form works for arbitrary selection rules, and generalizes Mondrian Conformal Prediction to multiple test units and non-equivariant classifiers. We then work out computationally efficient implementation of our framework for a number of realistic selection rules, including top-K selection, optimization-based selection, selection based on conformal p-values, and selection based on properties of preliminary conformal prediction sets. The performance of our methods is demonstrated via applications in drug discovery and health risk prediction."}, "https://arxiv.org/abs/2403.03948": {"title": "Estimating the household secondary attack rate with the Incomplete Chain Binomial model", "link": "https://arxiv.org/abs/2403.03948", "description": "arXiv:2403.03948v1 Announce Type: new \nAbstract: The Secondary Attack Rate (SAR) is a measure of how infectious a communicable disease is, and is often estimated based on studies of disease transmission in households. The Chain Binomial model is a simple model for disease outbreaks, and the final size distribution derived from it can be used to estimate the SAR using simple summary statistics. The final size distribution of the Chain Binomial model assume that the outbreaks have concluded, which in some instances may require long follow-up time. We develop a way to compute the probability distribution of the number of infected before the outbreak has concluded, which we call the Incomplete Chain Binomial distribution. We study a few theoretical properties of the model. We develop Maximum Likelihood estimation routines for inference on the SAR and explore the model by analyzing two real world data sets."}, "https://arxiv.org/abs/1612.06040": {"title": "Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels", "link": "https://arxiv.org/abs/1612.06040", "description": "arXiv:1612.06040v4 Announce Type: replace \nAbstract: We construct Bayesian and frequentist finite-sample goodness-of-fit tests for three different variants of the stochastic blockmodel for network data. Since all of the stochastic blockmodel variants are log-linear in form when block assignments are known, the tests for the \\emph{latent} block model versions combine a block membership estimator with the algebraic statistics machinery for testing goodness-of-fit in log-linear models. We describe Markov bases and marginal polytopes of the variants of the stochastic blockmodel, and discuss how both facilitate the development of goodness-of-fit tests and understanding of model behavior.\n The general testing methodology developed here extends to any finite mixture of log-linear models on discrete data, and as such is the first application of the algebraic statistics machinery for latent-variable models."}, "https://arxiv.org/abs/2212.06108": {"title": "Tandem clustering with invariant coordinate selection", "link": "https://arxiv.org/abs/2212.06108", "description": "arXiv:2212.06108v4 Announce Type: replace \nAbstract: For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach."}, "https://arxiv.org/abs/2302.11322": {"title": "Causal inference with misspecified network interference structure", "link": "https://arxiv.org/abs/2302.11322", "description": "arXiv:2302.11322v2 Announce Type: replace \nAbstract: Under interference, the potential outcomes of a unit depend on treatments assigned to other units. A network interference structure is typically assumed to be given and accurate. In this paper, we study the problems resulting from misspecifying these networks. First, we derive bounds on the bias arising from estimating causal effects under a misspecified network. We show that the maximal possible bias depends on the divergence between the assumed network and the true one with respect to the induced exposure probabilities. Then, we propose a novel estimator that leverages multiple networks simultaneously and is unbiased if one of the networks is correct, thus providing robustness to network specification. Additionally, we develop a probabilistic bias analysis that quantifies the impact of a postulated misspecification mechanism on the causal estimates. We illustrate key issues in simulations and demonstrate the utility of the proposed methods in a social network field experiment and a cluster-randomized trial with suspected cross-clusters contamination."}, "https://arxiv.org/abs/2305.00484": {"title": "Sequential Markov Chain Monte Carlo for Lagrangian Data Assimilation with Applications to Unknown Data Locations", "link": "https://arxiv.org/abs/2305.00484", "description": "arXiv:2305.00484v3 Announce Type: replace \nAbstract: We consider a class of high-dimensional spatial filtering problems, where the spatial locations of observations are unknown and driven by the partially observed hidden signal. This problem is exceptionally challenging as not only is high-dimensional, but the model for the signal yields longer-range time dependencies through the observation locations. Motivated by this model we revisit a lesser-known and \\emph{provably convergent} computational methodology from \\cite{berzuini, cent, martin} that uses sequential Markov Chain Monte Carlo (MCMC) chains. We extend this methodology for data filtering problems with unknown observation locations. We benchmark our algorithms on Linear Gaussian state space models against competing ensemble methods and demonstrate a significant improvement in both execution speed and accuracy. Finally, we implement a realistic case study on a high-dimensional rotating shallow water model (of about $10^4-10^5$ dimensions) with real and synthetic data. The data is provided by the National Oceanic and Atmospheric Administration (NOAA) and contains observations from ocean drifters in a domain of the Atlantic Ocean restricted to the longitude and latitude intervals $[-51^{\\circ}, -41^{\\circ}]$, $[17^{\\circ}, 27^{\\circ}]$ respectively."}, "https://arxiv.org/abs/2306.05829": {"title": "A reduced-rank approach to predicting multiple binary responses through machine learning", "link": "https://arxiv.org/abs/2306.05829", "description": "arXiv:2306.05829v2 Announce Type: replace \nAbstract: This paper investigates the problem of simultaneously predicting multiple binary responses by utilizing a shared set of covariates. Our approach incorporates machine learning techniques for binary classification, without making assumptions about the underlying observations. Instead, our focus lies on a group of predictors, aiming to identify the one that minimizes prediction error. Unlike previous studies that primarily address estimation error, we directly analyze the prediction error of our method using PAC-Bayesian bounds techniques. In this paper, we introduce a pseudo-Bayesian approach capable of handling incomplete response data. Our strategy is efficiently implemented using the Langevin Monte Carlo method. Through simulation studies and a practical application using real data, we demonstrate the effectiveness of our proposed method, producing comparable or sometimes superior results compared to the current state-of-the-art method."}, "https://arxiv.org/abs/2309.12380": {"title": "Methods for generating and evaluating synthetic longitudinal patient data: a systematic review", "link": "https://arxiv.org/abs/2309.12380", "description": "arXiv:2309.12380v2 Announce Type: replace \nAbstract: The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods."}, "https://arxiv.org/abs/2312.01209": {"title": "A Method of Moments Approach to Asymptotically Unbiased Synthetic Controls", "link": "https://arxiv.org/abs/2312.01209", "description": "arXiv:2312.01209v2 Announce Type: replace \nAbstract: A common approach to constructing a Synthetic Control unit is to fit on the outcome variable and covariates in pre-treatment time periods, but it has been shown by Ferman and Pinto (2019) that this approach does not provide asymptotic unbiasedness when the fit is imperfect and the number of controls is fixed. Many related panel methods have a similar limitation when the number of units is fixed. I introduce and evaluate a new method in which the Synthetic Control is constructed using a General Method of Moments approach where units not being included in the Synthetic Control are used as instruments. I show that a Synthetic Control Estimator of this form will be asymptotically unbiased as the number of pre-treatment time periods goes to infinity, even when pre-treatment fit is imperfect and the number of units is fixed. Furthermore, if both the number of pre-treatment and post-treatment time periods go to infinity, then averages of treatment effects can be consistently estimated. I conduct simulations and an empirical application to compare the performance of this method with existing approaches in the literature."}, "https://arxiv.org/abs/2312.12952": {"title": "High-dimensional sparse classification using exponential weighting with empirical hinge loss", "link": "https://arxiv.org/abs/2312.12952", "description": "arXiv:2312.12952v2 Announce Type: replace \nAbstract: In this study, we address the problem of high-dimensional binary classification. Our proposed solution involves employing an aggregation technique founded on exponential weights and empirical hinge loss. Through the employment of a suitable sparsity-inducing prior distribution, we demonstrate that our method yields favorable theoretical results on prediction error. The efficiency of our procedure is achieved through the utilization of Langevin Monte Carlo, a gradient-based sampling approach. To illustrate the effectiveness of our approach, we conduct comparisons with the logistic Lasso on simulated data and a real dataset. Our method frequently demonstrates superior performance compared to the logistic Lasso."}, "https://arxiv.org/abs/2211.12200": {"title": "Fast Computer Model Calibration using Annealed and Transformed Variational Inference", "link": "https://arxiv.org/abs/2211.12200", "description": "arXiv:2211.12200v2 Announce Type: replace-cross \nAbstract: Computer models play a crucial role in numerous scientific and engineering domains. To ensure the accuracy of simulations, it is essential to properly calibrate the input parameters of these models through statistical inference. While Bayesian inference is the standard approach for this task, employing Markov Chain Monte Carlo methods often encounters computational hurdles due to the costly evaluation of likelihood functions and slow mixing rates. Although variational inference (VI) can be a fast alternative to traditional Bayesian approaches, VI has limited applicability due to boundary issues and local optima problems. To address these challenges, we propose flexible VI methods based on deep generative models that do not require parametric assumptions on the variational distribution. We embed a surjective transformation in our framework to avoid posterior truncation at the boundary. Additionally, we provide theoretical conditions that guarantee the success of the algorithm. Furthermore, our temperature annealing scheme can prevent being trapped in local optima through a series of intermediate posteriors. We apply our method to infectious disease models and a geophysical model, illustrating that the proposed method can provide fast and accurate inference compared to its competitors."}, "https://arxiv.org/abs/2403.03975": {"title": "Robust covariance estimation and explainable outlier detection for matrix-valued data", "link": "https://arxiv.org/abs/2403.03975", "description": "arXiv:2403.03975v1 Announce Type: new \nAbstract: The minimum covariance determinant (MCD) estimator is a popular method for robustly estimating the mean and covariance of multivariate data. We extend the MCD to the setting where the observations are matrices rather than vectors and introduce the matrix minimum covariance determinant (MMCD) estimators for robust parameter estimation. These estimators hold equivariance properties, achieve a high breakdown point, and are consistent under elliptical matrix-variate distributions. We have also developed an efficient algorithm with convergence guarantees to compute the MMCD estimators. Using the MMCD estimators, we can compute robust Mahalanobis distances that can be used for outlier detection. Those distances can be decomposed into outlyingness contributions from each cell, row, or column of a matrix-variate observation using Shapley values, a concept for outlier explanation recently introduced in the multivariate setting. Simulations and examples reveal the excellent properties and usefulness of the robust estimators."}, "https://arxiv.org/abs/2403.04058": {"title": "Plant-Capture Methods for Estimating Population Size from Uncertain Plant Captures", "link": "https://arxiv.org/abs/2403.04058", "description": "arXiv:2403.04058v1 Announce Type: new \nAbstract: Plant-capture is a variant of classical capture-recapture methods used to estimate the size of a population. In this method, decoys referred to as \"plants\" are introduced into the population in order to estimate the capture probability. The method has shown considerable success in estimating population sizes from limited samples in many epidemiological, ecological, and demographic studies. However, previous plant-recapture studies have not systematically accounted for uncertainty in the capture status of each individual plant. In this work, we propose various approaches to formally incorporate uncertainty into the plant-capture model arising from (i) the capture status of plants and (ii) the heterogeneity between multiple survey sites. We present two inference methods and compare their performance in simulation studies. We then apply our methods to estimate the size of the homeless population in several US cities using the large-scale \"S-night\" study conducted by the US Census Bureau."}, "https://arxiv.org/abs/2403.04131": {"title": "Extract Mechanisms from Heterogeneous Effects: Identification Strategy for Mediation Analysis", "link": "https://arxiv.org/abs/2403.04131", "description": "arXiv:2403.04131v1 Announce Type: new \nAbstract: Understanding causal mechanisms is essential for explaining and generalizing empirical phenomena. Causal mediation analysis offers statistical techniques to quantify mediation effects. However, existing methods typically require strong identification assumptions or sophisticated research designs. We develop a new identification strategy that simplifies these assumptions, enabling the simultaneous estimation of causal and mediation effects. The strategy is based on a novel decomposition of total treatment effects, which transforms the challenging mediation problem into a simple linear regression problem. The new method establishes a new link between causal mediation and causal moderation. We discuss several research designs and estimators to increase the usability of our identification strategy for a variety of empirical studies. We demonstrate the application of our method by estimating the causal mediation effect in experiments concerning common pool resource governance and voting information. Additionally, we have created statistical software to facilitate the implementation of our method."}, "https://arxiv.org/abs/2403.04345": {"title": "A Novel Theoretical Framework for Exponential Smoothing", "link": "https://arxiv.org/abs/2403.04345", "description": "arXiv:2403.04345v1 Announce Type: new \nAbstract: Simple Exponential Smoothing is a classical technique used for smoothing time series data by assigning exponentially decreasing weights to past observations through a recursive equation; it is sometimes presented as a rule of thumb procedure. We introduce a novel theoretical perspective where the recursive equation that defines simple exponential smoothing occurs naturally as a stochastic gradient ascent scheme to optimize a sequence of Gaussian log-likelihood functions. Under this lens of analysis, our main theorem shows that -in a general setting- simple exponential smoothing converges to a neighborhood of the trend of a trend-stationary stochastic process. This offers a novel theoretical assurance that the exponential smoothing procedure yields reliable estimators of the underlying trend shedding light on long-standing observations in the literature regarding the robustness of simple exponential smoothing."}, "https://arxiv.org/abs/2403.04354": {"title": "A Logarithmic Mean Divisia Index Decomposition of CO$_2$ Emissions from Energy Use in Romania", "link": "https://arxiv.org/abs/2403.04354", "description": "arXiv:2403.04354v1 Announce Type: new \nAbstract: Carbon emissions have become a specific alarming indicators and intricate challenges that lead an extended argue about climate change. The growing trend in the utilization of fossil fuels for the economic progress and simultaneously reducing the carbon quantity has turn into a substantial and global challenge. The aim of this paper is to examine the driving factors of CO$_2$ emissions from energy sector in Romania during the period 2008-2022 emissions using the log mean Divisia index (LMDI) method and takes into account five items: CO$_2$ emissions, primary energy resources, energy consumption, gross domestic product and population, the driving forces of CO$_2$ emissions, based on which it was calculated the contribution of carbon intensity, energy mixes, generating efficiency, economy, and population. The results indicate that generating efficiency effect -90968.57 is the largest inhibiting index while economic effect is the largest positive index 69084.04 having the role of increasing CO$_2$ emissions."}, "https://arxiv.org/abs/2403.04564": {"title": "Estimating hidden population size from a single respondent-driven sampling survey", "link": "https://arxiv.org/abs/2403.04564", "description": "arXiv:2403.04564v1 Announce Type: new \nAbstract: This work is concerned with the estimation of hard-to-reach population sizes using a single respondent-driven sampling (RDS) survey, a variant of chain-referral sampling that leverages social relationships to reach members of a hidden population. The popularity of RDS as a standard approach for surveying hidden populations brings theoretical and methodological challenges regarding the estimation of population sizes, mainly for public health purposes. This paper proposes a frequentist, model-based framework for estimating the size of a hidden population using a network-based approach. An optimization algorithm is proposed for obtaining the identification region of the target parameter when model assumptions are violated. We characterize the asymptotic behavior of our proposed methodology and assess its finite sample performance under departures from model assumptions."}, "https://arxiv.org/abs/2403.04613": {"title": "Simultaneous Conformal Prediction of Missing Outcomes with Propensity Score $\\epsilon$-Discretization", "link": "https://arxiv.org/abs/2403.04613", "description": "arXiv:2403.04613v1 Announce Type: new \nAbstract: We study the problem of simultaneous predictive inference on multiple outcomes missing at random. We consider a suite of possible simultaneous coverage properties, conditionally on the missingness pattern and on the -- possibly discretized/binned -- feature values. For data with discrete feature distributions, we develop a procedure which attains feature- and missingness-conditional coverage; and further improve it via pooling its results after partitioning the unobserved outcomes. To handle general continuous feature distributions, we introduce methods based on discretized feature values. To mitigate the issue that feature-discretized data may fail to remain missing at random, we propose propensity score $\\epsilon$-discretization. This approach is inspired by the balancing property of the propensity score, namely that the missing data mechanism is independent of the outcome conditional on the propensity [Rosenbaum and Rubin (1983)]. We show that the resulting pro-CP method achieves propensity score discretized feature- and missingness-conditional coverage, when the propensity score is known exactly or is estimated sufficiently accurately. Furthermore, we consider a stronger inferential target, the squared-coverage guarantee, which penalizes the spread of the coverage proportion. We propose methods -- termed pro-CP2 -- to achieve it with similar conditional properties as we have shown for usual coverage. A key novel technical contribution in our results is that propensity score discretization leads to a notion of approximate balancing, which we formalize and characterize precisely. In extensive empirical experiments on simulated data and on a job search intervention dataset, we illustrate that our procedures provide informative prediction sets with valid conditional coverage."}, "https://arxiv.org/abs/2403.04766": {"title": "Nonparametric Regression under Cluster Sampling", "link": "https://arxiv.org/abs/2403.04766", "description": "arXiv:2403.04766v1 Announce Type: new \nAbstract: This paper develops a general asymptotic theory for nonparametric kernel regression in the presence of cluster dependence. We examine nonparametric density estimation, Nadaraya-Watson kernel regression, and local linear estimation. Our theory accommodates growing and heterogeneous cluster sizes. We derive asymptotic conditional bias and variance, establish uniform consistency, and prove asymptotic normality. Our findings reveal that under heterogeneous cluster sizes, the asymptotic variance includes a new term reflecting within-cluster dependence, which is overlooked when cluster sizes are presumed to be bounded. We propose valid approaches for bandwidth selection and inference, introduce estimators of the asymptotic variance, and demonstrate their consistency. In simulations, we verify the effectiveness of the cluster-robust bandwidth selection and show that the derived cluster-robust confidence interval improves the coverage ratio. We illustrate the application of these methods using a policy-targeting dataset in development economics."}, "https://arxiv.org/abs/2403.04039": {"title": "Sample size planning for conditional counterfactual mean estimation with a K-armed randomized experiment", "link": "https://arxiv.org/abs/2403.04039", "description": "arXiv:2403.04039v1 Announce Type: cross \nAbstract: We cover how to determine a sufficiently large sample size for a $K$-armed randomized experiment in order to estimate conditional counterfactual expectations in data-driven subgroups. The sub-groups can be output by any feature space partitioning algorithm, including as defined by binning users having similar predictive scores or as defined by a learned policy tree. After carefully specifying the inference target, a minimum confidence level, and a maximum margin of error, the key is to turn the original goal into a simultaneous inference problem where the recommended sample size to offset an increased possibility of estimation error is directly related to the number of inferences to be conducted. Given a fixed sample size budget, our result allows us to invert the question to one about the feasible number of treatment arms or partition complexity (e.g. number of decision tree leaves). Using policy trees to learn sub-groups, we evaluate our nominal guarantees on a large publicly-available randomized experiment test data set."}, "https://arxiv.org/abs/2403.04236": {"title": "Regularized DeepIV with Model Selection", "link": "https://arxiv.org/abs/2403.04236", "description": "arXiv:2403.04236v1 Announce Type: cross \nAbstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work."}, "https://arxiv.org/abs/2403.04328": {"title": "A dual approach to nonparametric characterization for random utility models", "link": "https://arxiv.org/abs/2403.04328", "description": "arXiv:2403.04328v1 Announce Type: cross \nAbstract: This paper develops a novel characterization for random utility models (RUM), which turns out to be a dual representation of the characterization by Kitamura and Stoye (2018, ECMA). For a given family of budgets and its \"patch\" representation \\'a la Kitamura and Stoye, we construct a matrix $\\Xi$ of which each row vector indicates the structure of possible revealed preference relations in each subfamily of budgets. Then, it is shown that a stochastic demand system on the patches of budget lines, say $\\pi$, is consistent with a RUM, if and only if $\\Xi\\pi \\geq \\mathbb{1}$. In addition to providing a concise closed form characterization, especially when $\\pi$ is inconsistent with RUMs, the vector $\\Xi\\pi$ also contains information concerning (1) sub-families of budgets in which cyclical choices must occur with positive probabilities, and (2) the maximal possible weights on rational choice patterns in a population. The notion of Chv\\'atal rank of polytopes and the duality theorem in linear programming play key roles to obtain these results."}, "https://arxiv.org/abs/2211.00329": {"title": "Weak Identification in Low-Dimensional Factor Models with One or Two Factors", "link": "https://arxiv.org/abs/2211.00329", "description": "arXiv:2211.00329v2 Announce Type: replace \nAbstract: This paper describes how to reparameterize low-dimensional factor models with one or two factors to fit weak identification theory developed for generalized method of moments models. Some identification-robust tests, here called \"plug-in\" tests, require a reparameterization to distinguish weakly identified parameters from strongly identified parameters. The reparameterizations in this paper make plug-in tests available for subvector hypotheses in low-dimensional factor models with one or two factors. Simulations show that the plug-in tests are less conservative than identification-robust tests that use the original parameterization. An empirical application to a factor model of parental investments in children is included."}, "https://arxiv.org/abs/2301.13368": {"title": "Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference", "link": "https://arxiv.org/abs/2301.13368", "description": "arXiv:2301.13368v2 Announce Type: replace \nAbstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL."}, "https://arxiv.org/abs/2302.09211": {"title": "Bayesian Covariance Estimation for Multi-group Matrix-variate Data", "link": "https://arxiv.org/abs/2302.09211", "description": "arXiv:2302.09211v2 Announce Type: replace \nAbstract: Multi-group covariance estimation for matrix-variate data with small within group sample sizes is a key part of many data analysis tasks in modern applications. To obtain accurate group-specific covariance estimates, shrinkage estimation methods which shrink an unstructured, group-specific covariance either across groups towards a pooled covariance or within each group towards a Kronecker structure have been developed. However, in many applications, it is unclear which approach will result in more accurate covariance estimates. In this article, we present a hierarchical prior distribution which flexibly allows for both types of shrinkage. The prior linearly combines shrinkage across groups towards a shared pooled covariance and shrinkage within groups towards a group-specific Kronecker covariance. We illustrate the utility of the proposed prior in speech recognition and an analysis of chemical exposure data."}, "https://arxiv.org/abs/2307.10651": {"title": "Distributional Regression for Data Analysis", "link": "https://arxiv.org/abs/2307.10651", "description": "arXiv:2307.10651v2 Announce Type: replace \nAbstract: Flexible modeling of the entire distribution as a function of covariates is an important generalization of mean-based regression that has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices."}, "https://arxiv.org/abs/2308.00014": {"title": "A new mapping of technological interdependence", "link": "https://arxiv.org/abs/2308.00014", "description": "arXiv:2308.00014v2 Announce Type: replace \nAbstract: How does technological interdependence affect a sector's ability to innovate? This paper answers this question by looking at knowledge interdependence (knowledge spillovers and technological complementarities) and structural interdependence (intersectoral network linkages). We examine these two dimensions of technological interdependence by applying novel methods of text mining and network analysis to the documents of 6.5 million patents granted by the United States Patent and Trademark Office (USPTO) between 1976 and 2021. We show that both dimensions positively affect sector innovation. While the impact of knowledge interdependence is slightly larger in the long-term horizon, positive shocks affecting the network linkages (structural interdependence) produce greater and more enduring effects on innovation performance in a relatively short run. Our analysis also highlights that patent text contains a wealth of information often not captured by traditional innovation metrics, such as patent citations."}, "https://arxiv.org/abs/2308.05564": {"title": "Large Skew-t Copula Models and Asymmetric Dependence in Intraday Equity Returns", "link": "https://arxiv.org/abs/2308.05564", "description": "arXiv:2308.05564v2 Announce Type: replace \nAbstract: Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we propose a fast and accurate Bayesian variational inference (VI) approach to do so. The method uses a conditionally Gaussian generative representation of the skew-t distribution to define an augmented posterior that can be approximated accurately. A fast stochastic gradient ascent algorithm is used to solve the variational optimization. The new methodology is used to estimate skew-t factor copula models for intraday returns from 2017 to 2021 on 93 U.S. equities. The copula captures substantial heterogeneity in asymmetric dependence over equity pairs, in addition to the variability in pairwise correlations. We show that intraday predictive densities from the skew-t copula are more accurate than from some other copula models, while portfolio selection strategies based on the estimated pairwise tail dependencies improve performance relative to the benchmark index."}, "https://arxiv.org/abs/2310.00864": {"title": "Multi-Label Residual Weighted Learning for Individualized Combination Treatment Rule", "link": "https://arxiv.org/abs/2310.00864", "description": "arXiv:2310.00864v3 Announce Type: replace \nAbstract: Individualized treatment rules (ITRs) have been widely applied in many fields such as precision medicine and personalized marketing. Beyond the extensive studies on ITR for binary or multiple treatments, there is considerable interest in applying combination treatments. This paper introduces a novel ITR estimation method for combination treatments incorporating interaction effects among treatments. Specifically, we propose the generalized $\\psi$-loss as a non-convex surrogate in the residual weighted learning framework, offering desirable statistical and computational properties. Statistically, the minimizer of the proposed surrogate loss is Fisher-consistent with the optimal decision rules, incorporating interaction effects at any intensity level - a significant improvement over existing methods. Computationally, the proposed method applies the difference-of-convex algorithm for efficient computation. Through simulation studies and real-world data applications, we demonstrate the superior performance of the proposed method in recommending combination treatments."}, "https://arxiv.org/abs/2310.01748": {"title": "A generative approach to frame-level multi-competitor races", "link": "https://arxiv.org/abs/2310.01748", "description": "arXiv:2310.01748v3 Announce Type: replace \nAbstract: Multi-competitor races often feature complicated within-race strategies that are difficult to capture when training data on race outcome level data. Further, models which do not account for such strategic effects may suffer from confounded inferences and predictions. In this work we develop a general generative model for multi-competitor races which allows analysts to explicitly model certain strategic effects such as changing lanes or drafting and separate these impacts from competitor ability. The generative model allows one to simulate full races from any real or created starting position which opens new avenues for attributing value to within-race actions and to perform counter-factual analyses. This methodology is sufficiently general to apply to any track based multi-competitor races where both tracking data is available and competitor movement is well described by simultaneous forward and lateral movements. We apply this methodology to one-mile horse races using data provided by the New York Racing Association (NYRA) and the New York Thoroughbred Horsemen's Association (NYTHA) for the Big Data Derby 2022 Kaggle Competition. This data features granular tracking data for all horses at the frame-level (occurring at approximately 4hz). We demonstrate how this model can yield new inferences, such as the estimation of horse-specific speed profiles which vary over phases of the race, and examples of posterior predictive counterfactual simulations to answer questions of interest such as starting lane impacts on race outcomes."}, "https://arxiv.org/abs/2311.17100": {"title": "Automatic cross-validation in structured models: Is it time to leave out leave-one-out?", "link": "https://arxiv.org/abs/2311.17100", "description": "arXiv:2311.17100v2 Announce Type: replace \nAbstract: Standard techniques such as leave-one-out cross-validation (LOOCV) might not be suitable for evaluating the predictive performance of models incorporating structured random effects. In such cases, the correlation between the training and test sets could have a notable impact on the model's prediction error. To overcome this issue, an automatic group construction procedure for leave-group-out cross validation (LGOCV) has recently emerged as a valuable tool for enhancing predictive performance measurement in structured models. The purpose of this paper is (i) to compare LOOCV and LGOCV within structured models, emphasizing model selection and predictive performance, and (ii) to provide real data applications in spatial statistics using complex structured models fitted with INLA, showcasing the utility of the automatic LGOCV method. First, we briefly review the key aspects of the recently proposed LGOCV method for automatic group construction in latent Gaussian models. We also demonstrate the effectiveness of this method for selecting the model with the highest predictive performance by simulating extrapolation tasks in both temporal and spatial data analyses. Finally, we provide insights into the effectiveness of the LGOCV method in modelling complex structured data, encompassing spatio-temporal multivariate count data, spatial compositional data, and spatio-temporal geospatial data."}, "https://arxiv.org/abs/2401.14512": {"title": "Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population", "link": "https://arxiv.org/abs/2401.14512", "description": "arXiv:2401.14512v2 Announce Type: replace \nAbstract: Randomized controlled trials (RCTs) serve as the cornerstone for understanding causal effects, yet extending inferences to target populations presents challenges due to effect heterogeneity and underrepresentation. Our paper addresses the critical issue of identifying and characterizing underrepresented subgroups in RCTs, proposing a novel framework for refining target populations to improve generalizability. We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups. ROOT optimizes the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate, ensuring more precise treatment effect estimations. Notably, ROOT generates interpretable characteristics of the underrepresented population, aiding researchers in effective communication. Our approach demonstrates improved precision and interpretability compared to alternatives, as illustrated with synthetic data experiments. We apply our methodology to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial -- investigating the effectiveness of medication for opioid use disorder -- to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations."}, "https://arxiv.org/abs/2208.07581": {"title": "Regression modelling of spatiotemporal extreme U", "link": "https://arxiv.org/abs/2208.07581", "description": "arXiv:2208.07581v4 Announce Type: replace-cross \nAbstract: Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The \"black box\" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques."}, "https://arxiv.org/abs/2403.04912": {"title": "Bayesian Level-Set Clustering", "link": "https://arxiv.org/abs/2403.04912", "description": "arXiv:2403.04912v1 Announce Type: new \nAbstract: Broadly, the goal when clustering data is to separate observations into meaningful subgroups. The rich variety of methods for clustering reflects the fact that the relevant notion of meaningful clusters varies across applications. The classical Bayesian approach clusters observations by their association with components of a mixture model; the choice in class of components allows flexibility to capture a range of meaningful cluster notions. However, in practice the range is somewhat limited as difficulties with computation and cluster identifiability arise as components are made more flexible. Instead of mixture component attribution, we consider clusterings that are functions of the data and the density $f$, which allows us to separate flexible density estimation from clustering. Within this framework, we develop a method to cluster data into connected components of a level set of $f$. Under mild conditions, we establish that our Bayesian level-set (BALLET) clustering methodology yields consistent estimates, and we highlight its performance in a variety of toy and simulated data examples. Finally, through an application to astronomical data we show the method performs favorably relative to the popular level-set clustering algorithm DBSCAN in terms of accuracy, insensitivity to tuning parameters, and quantification of uncertainty."}, "https://arxiv.org/abs/2403.04915": {"title": "Bayesian Inference for High-dimensional Time Series by Latent Process Modeling", "link": "https://arxiv.org/abs/2403.04915", "description": "arXiv:2403.04915v1 Announce Type: new \nAbstract: Time series data arising in many applications nowadays are high-dimensional. A large number of parameters describe features of these time series. We propose a novel approach to modeling a high-dimensional time series through several independent univariate time series, which are then orthogonally rotated and sparsely linearly transformed. With this approach, any specified intrinsic relations among component time series given by a graphical structure can be maintained at all time snapshots. We call the resulting process an Orthogonally-rotated Univariate Time series (OUT). Key structural properties of time series such as stationarity and causality can be easily accommodated in the OUT model. For Bayesian inference, we put suitable prior distributions on the spectral densities of the independent latent times series, the orthogonal rotation matrix, and the common precision matrix of the component times series at every time point. A likelihood is constructed using the Whittle approximation for univariate latent time series. An efficient Markov Chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We study the convergence of the pseudo-posterior distribution based on the Whittle likelihood for the model's parameters upon developing a new general posterior convergence theorem for pseudo-posteriors. We find that the posterior contraction rate for independent observations essentially prevails in the OUT model under very mild conditions on the temporal dependence described in terms of the smoothness of the corresponding spectral densities. Through a simulation study, we compare the accuracy of estimating the parameters and identifying the graphical structure with other approaches. We apply the proposed methodology to analyze a dataset on different industrial components of the US gross domestic product between 2010 and 2019 and predict future observations."}, "https://arxiv.org/abs/2403.05080": {"title": "On an Empirical Likelihood based Solution to the Approximate Bayesian Computation Problem", "link": "https://arxiv.org/abs/2403.05080", "description": "arXiv:2403.05080v1 Announce Type: new \nAbstract: Approximate Bayesian Computation (ABC) methods are applicable to statistical models specified by generative processes with analytically intractable likelihoods. These methods try to approximate the posterior density of a model parameter by comparing the observed data with additional process-generated simulated datasets. For computational benefit, only the values of certain well-chosen summary statistics are usually compared, instead of the whole dataset. Most ABC procedures are computationally expensive, justified only heuristically, and have poor asymptotic properties. In this article, we introduce a new empirical likelihood-based approach to the ABC paradigm called ABCel. The proposed procedure is computationally tractable and approximates the target log posterior of the parameter as a sum of two functions of the data -- namely, the mean of the optimal log-empirical likelihood weights and the estimated differential entropy of the summary functions. We rigorously justify the procedure via direct and reverse information projections onto appropriate classes of probability densities. Past applications of empirical likelihood in ABC demanded constraints based on analytically tractable estimating functions that involve both the data and the parameter; although by the nature of the ABC problem such functions may not be available in general. In contrast, we use constraints that are functions of the summary statistics only. Equally importantly, we show that our construction directly connects to the reverse information projection. We show that ABCel is posterior consistent and has highly favourable asymptotic properties. Its construction justifies the use of simple summary statistics like moments, quantiles, etc, which in practice produce an accurate approximation of the posterior density. We illustrate the performance of the proposed procedure in a range of applications."}, "https://arxiv.org/abs/2403.05331": {"title": "Causality and extremes", "link": "https://arxiv.org/abs/2403.05331", "description": "arXiv:2403.05331v1 Announce Type: new \nAbstract: In this work, we summarize the state-of-the-art methods in causal inference for extremes. In a non-exhaustive way, we start by describing an extremal approach to quantile treatment effect where the treatment has an impact on the tail of the outcome. Then, we delve into two primary causal structures for extremes, offering in-depth insights into their identifiability. Additionally, we discuss causal structure learning in relation to these two models as well as in a model-agnostic framework. To illustrate the practicality of the approaches, we apply and compare these different methods using a Seine network dataset. This work concludes with a summary and outlines potential directions for future research."}, "https://arxiv.org/abs/2403.05336": {"title": "Estimating time-varying exposure effects through continuous-time modelling in Mendelian randomization", "link": "https://arxiv.org/abs/2403.05336", "description": "arXiv:2403.05336v1 Announce Type: new \nAbstract: Mendelian randomization is an instrumental variable method that utilizes genetic information to investigate the causal effect of a modifiable exposure on an outcome. In most cases, the exposure changes over time. Understanding the time-varying causal effect of the exposure can yield detailed insights into mechanistic effects and the potential impact of public health interventions. Recently, a growing number of Mendelian randomization studies have attempted to explore time-varying causal effects. However, the proposed approaches oversimplify temporal information and rely on overly restrictive structural assumptions, limiting their reliability in addressing time-varying causal problems. This paper considers a novel approach to estimate time-varying effects through continuous-time modelling by combining functional principal component analysis and weak-instrument-robust techniques. Our method effectively utilizes available data without making strong structural assumptions and can be applied in general settings where the exposure measurements occur at different timepoints for different individuals. We demonstrate through simulations that our proposed method performs well in estimating time-varying effects and provides reliable inference results when the time-varying effect form is correctly specified. The method could theoretically be used to estimate arbitrarily complex time-varying effects. However, there is a trade-off between model complexity and instrument strength. Estimating complex time-varying effects requires instruments that are unrealistically strong. We illustrate the application of this method in a case study examining the time-varying effects of systolic blood pressure on urea levels."}, "https://arxiv.org/abs/2403.05343": {"title": "Disentangling the Timescales of a Complex System: A Bayesian Approach to Temporal Network Analysis", "link": "https://arxiv.org/abs/2403.05343", "description": "arXiv:2403.05343v1 Announce Type: new \nAbstract: Changes in the timescales at which complex systems evolve are essential to predicting critical transitions and catastrophic failures. Disentangling the timescales of the dynamics governing complex systems remains a key challenge. With this study, we introduce an integrated Bayesian framework based on temporal network models to address this challenge. We focus on two methodologies: change point detection for identifying shifts in system dynamics, and a spectrum analysis for inferring the distribution of timescales. Applied to synthetic and empirical datasets, these methologies robustly identify critical transitions and comprehensively map the dominant and subsidiaries timescales in complex systems. This dual approach offers a powerful tool for analyzing temporal networks, significantly enhancing our understanding of dynamic behaviors in complex systems."}, "https://arxiv.org/abs/2403.05359": {"title": "Applying Non-negative Matrix Factorization with Covariates to the Longitudinal Data as Growth Curve Model", "link": "https://arxiv.org/abs/2403.05359", "description": "arXiv:2403.05359v1 Announce Type: new \nAbstract: Using Non-negative Matrix Factorization (NMF), the observed matrix can be approximated by the product of the basis and coefficient matrices. Moreover, if the coefficient vectors are explained by the covariates for each individual, the coefficient matrix can be written as the product of the parameter matrix and the covariate matrix, and additionally described in the framework of Non-negative Matrix tri-Factorization (tri-NMF) with covariates. Consequently, this is equal to the mean structure of the Growth Curve Model (GCM). The difference is that the basis matrix for GCM is given by the analyst, whereas that for NMF with covariates is unknown and optimized. In this study, we applied NMF with covariance to longitudinal data and compared it with GCM. We have also published an R package that implements this method, and we show how to use it through examples of data analyses including longitudinal measurement, spatiotemporal data and text data. In particular, we demonstrate the usefulness of Gaussian kernel functions as covariates."}, "https://arxiv.org/abs/2403.05373": {"title": "Regularized Principal Spline Functions to Mitigate Spatial Confounding", "link": "https://arxiv.org/abs/2403.05373", "description": "arXiv:2403.05373v1 Announce Type: new \nAbstract: This paper proposes a new approach to address the problem of unmeasured confounding in spatial designs. Spatial confounding occurs when some confounding variables are unobserved and not included in the model, leading to distorted inferential results about the effect of an exposure on an outcome. We show the relationship existing between the confounding bias of a non-spatial model and that of a semi-parametric model that includes a basis matrix to represent the unmeasured confounder conditional on the exposure. This relationship holds for any basis expansion, however it is shown that using the semi-parametric approach guarantees a reduction in the confounding bias only under certain circumstances, which are related to the spatial structures of the exposure and the unmeasured confounder, the type of basis expansion utilized, and the regularization mechanism. To adjust for spatial confounding, and therefore try to recover the effect of interest, we propose a Bayesian semi-parametric regression model, where an expansion matrix of principal spline basis functions is used to approximate the unobserved factor, and spike-and-slab priors are imposed on the respective expansion coefficients in order to select the most important bases. From the results of an extensive simulation study, we conclude that our proposal is able to reduce the confounding bias with respect to the non-spatial model, and it also seems more robust to bias amplification than competing approaches."}, "https://arxiv.org/abs/2403.05503": {"title": "Linear Model Estimators and Consistency under an Infill Asymptotic Domain", "link": "https://arxiv.org/abs/2403.05503", "description": "arXiv:2403.05503v1 Announce Type: new \nAbstract: Functional data present as functions or curves possessing a spatial or temporal component. These components by nature have a fixed observational domain. Consequently, any asymptotic investigation requires modelling the increased correlation among observations as density increases due to this fixed domain constraint. One such appropriate stochastic process is the Ornstein-Uhlenbeck process. Utilizing this spatial autoregressive process, we demonstrate that parameter estimators for a simple linear regression model display inconsistency in an infill asymptotic domain. Such results are contrary to those expected under the customary increasing domain asymptotics. Although none of these estimator variances approach zero, they do display a pattern of diminishing return regarding decreasing estimator variance as sample size increases. This may prove invaluable to a practitioner as this indicates perhaps an optimal sample size to cease data collection. This in turn reduces time and data collection cost because little information is gained in sampling beyond a certain sample size."}, "https://arxiv.org/abs/2403.04793": {"title": "A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series", "link": "https://arxiv.org/abs/2403.04793", "description": "arXiv:2403.04793v1 Announce Type: cross \nAbstract: Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results."}, "https://arxiv.org/abs/2403.04919": {"title": "Identifying Causal Effects Under Functional Dependencies", "link": "https://arxiv.org/abs/2403.04919", "description": "arXiv:2403.04919v1 Announce Type: cross \nAbstract: We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects."}, "https://arxiv.org/abs/2403.05006": {"title": "Provable Multi-Party Reinforcement Learning with Diverse Human Feedback", "link": "https://arxiv.org/abs/2403.05006", "description": "arXiv:2403.05006v1 Announce Type: cross \nAbstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \\textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity."}, "https://arxiv.org/abs/2403.05425": {"title": "An Adaptive Dimension Reduction Estimation Method for High-dimensional Bayesian Optimization", "link": "https://arxiv.org/abs/2403.05425", "description": "arXiv:2403.05425v1 Announce Type: cross \nAbstract: Bayesian optimization (BO) has shown impressive results in a variety of applications within low-to-moderate dimensional Euclidean spaces. However, extending BO to high-dimensional settings remains a significant challenge. We address this challenge by proposing a two-step optimization framework. Initially, we identify the effective dimension reduction (EDR) subspace for the objective function using the minimum average variance estimation (MAVE) method. Subsequently, we construct a Gaussian process model within this EDR subspace and optimize it using the expected improvement criterion. Our algorithm offers the flexibility to operate these steps either concurrently or in sequence. In the sequential approach, we meticulously balance the exploration-exploitation trade-off by distributing the sampling budget between subspace estimation and function optimization, and the convergence rate of our algorithm in high-dimensional contexts has been established. Numerical experiments validate the efficacy of our method in challenging scenarios."}, "https://arxiv.org/abs/2403.05461": {"title": "On varimax asymptotics in network models and spectral methods for dimensionality reduction", "link": "https://arxiv.org/abs/2403.05461", "description": "arXiv:2403.05461v1 Announce Type: cross \nAbstract: Varimax factor rotations, while popular among practitioners in psychology and statistics since being introduced by H. Kaiser, have historically been viewed with skepticism and suspicion by some theoreticians and mathematical statisticians. Now, work by K. Rohe and M. Zeng provides new, fundamental insight: varimax rotations provably perform statistical estimation in certain classes of latent variable models when paired with spectral-based matrix truncations for dimensionality reduction. We build on this newfound understanding of varimax rotations by developing further connections to network analysis and spectral methods rooted in entrywise matrix perturbation analysis. Concretely, this paper establishes the asymptotic multivariate normality of vectors in varimax-transformed Euclidean point clouds that represent low-dimensional node embeddings in certain latent space random graph models. We address related concepts including network sparsity, data denoising, and the role of matrix rank in latent variable parameterizations. Collectively, these findings, at the confluence of classical and contemporary multivariate analysis, reinforce methodology and inference procedures grounded in matrix factorization-based techniques. Numerical examples illustrate our findings and supplement our discussion."}, "https://arxiv.org/abs/2105.12081": {"title": "Group selection and shrinkage: Structured sparsity for semiparametric additive models", "link": "https://arxiv.org/abs/2105.12081", "description": "arXiv:2105.12081v3 Announce Type: replace \nAbstract: Sparse regression and classification estimators that respect group structures have application to an assortment of statistical and machine learning problems, from multitask learning to sparse additive modeling to hierarchical selection. This work introduces structured sparse estimators that combine group subset selection with shrinkage. To accommodate sophisticated structures, our estimators allow for arbitrary overlap between groups. We develop an optimization framework for fitting the nonconvex regularization surface and present finite-sample error bounds for estimation of the regression function. As an application requiring structure, we study sparse semiparametric additive modeling, a procedure that allows the effect of each predictor to be zero, linear, or nonlinear. For this task, the new estimators improve across several metrics on synthetic data compared to alternatives. Finally, we demonstrate their efficacy in modeling supermarket foot traffic and economic recessions using many predictors. These demonstrations suggest sparse semiparametric additive models, fit using the new estimators, are an excellent compromise between fully linear and fully nonparametric alternatives. All of our algorithms are made available in the scalable implementation grpsel."}, "https://arxiv.org/abs/2107.02739": {"title": "Shapes as Product Differentiation: Neural Network Embedding in the Analysis of Markets for Fonts", "link": "https://arxiv.org/abs/2107.02739", "description": "arXiv:2107.02739v2 Announce Type: replace \nAbstract: Many differentiated products have key attributes that are unstructured and thus high-dimensional (e.g., design, text). Instead of treating unstructured attributes as unobservables in economic models, quantifying them can be important to answer interesting economic questions. To propose an analytical framework for these types of products, this paper considers one of the simplest design products-fonts-and investigates merger and product differentiation using an original dataset from the world's largest online marketplace for fonts. We quantify font shapes by constructing embeddings from a deep convolutional neural network. Each embedding maps a font's shape onto a low-dimensional vector. In the resulting product space, designers are assumed to engage in Hotelling-type spatial competition. From the image embeddings, we construct two alternative measures that capture the degree of design differentiation. We then study the causal effects of a merger on the merging firm's creative decisions using the constructed measures in a synthetic control method. We find that the merger causes the merging firm to increase the visual variety of font design. Notably, such effects are not captured when using traditional measures for product offerings (e.g., specifications and the number of products) constructed from structured data."}, "https://arxiv.org/abs/2108.05858": {"title": "An Optimal Transport Approach to Estimating Causal Effects via Nonlinear Difference-in-Differences", "link": "https://arxiv.org/abs/2108.05858", "description": "arXiv:2108.05858v2 Announce Type: replace \nAbstract: We propose a nonlinear difference-in-differences method to estimate multivariate counterfactual distributions in classical treatment and control study designs with observational data. Our approach sheds a new light on existing approaches like the changes-in-changes and the classical semiparametric difference-in-differences estimator and generalizes them to settings with multivariate heterogeneity in the outcomes. The main benefit of this extension is that it allows for arbitrary dependence and heterogeneity in the joint outcomes. We demonstrate its utility both on synthetic and real data. In particular, we revisit the classical Card \\& Krueger dataset, examining the effect of a minimum wage increase on employment in fast food restaurants; a reanalysis with our method reveals that restaurants tend to substitute full-time with part-time labor after a minimum wage increase at a faster pace. A previous version of this work was entitled \"An optimal transport approach to causal inference."}, "https://arxiv.org/abs/2112.13738": {"title": "Evaluation of binary classifiers for asymptotically dependent and independent extremes", "link": "https://arxiv.org/abs/2112.13738", "description": "arXiv:2112.13738v3 Announce Type: replace \nAbstract: Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this paper, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences. This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study a risk function adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. A simulation study compares different classifiers and indicates their performance with respect to our risk function. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations."}, "https://arxiv.org/abs/2211.01492": {"title": "Fast, effective, and coherent time series modeling using the sparsity-ranked lasso", "link": "https://arxiv.org/abs/2211.01492", "description": "arXiv:2211.01492v2 Announce Type: replace \nAbstract: The sparsity-ranked lasso (SRL) has been developed for model selection and estimation in the presence of interactions and polynomials. The main tenet of the SRL is that an algorithm should be more skeptical of higher-order polynomials and interactions *a priori* compared to main effects, and hence the inclusion of these more complex terms should require a higher level of evidence. In time series, the same idea of ranked prior skepticism can be applied to the possibly seasonal autoregressive (AR) structure of the series during the model fitting process, becoming especially useful in settings with uncertain or multiple modes of seasonality. The SRL can naturally incorporate exogenous variables, with streamlined options for inference and/or feature selection. The fitting process is quick even for large series with a high-dimensional feature set. In this work, we discuss both the formulation of this procedure and the software we have developed for its implementation via the **fastTS** R package. We explore the performance of our SRL-based approach in a novel application involving the autoregressive modeling of hourly emergency room arrivals at the University of Iowa Hospitals and Clinics. We find that the SRL is considerably faster than its competitors, while producing more accurate predictions."}, "https://arxiv.org/abs/2306.16406": {"title": "Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift", "link": "https://arxiv.org/abs/2306.16406", "description": "arXiv:2306.16406v3 Announce Type: replace \nAbstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \\emph{dataset shift} conditions are known as \\emph{domain adaptation} or \\emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population.\n In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions."}, "https://arxiv.org/abs/2312.00130": {"title": "Sparse Data-Driven Random Projection in Regression for High-Dimensional Data", "link": "https://arxiv.org/abs/2312.00130", "description": "arXiv:2312.00130v2 Announce Type: replace \nAbstract: We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors where the vector of coefficients can vary from sparse to dense. In this setting, we propose a combination of probabilistic variable screening with random projection tools as a viable approach. More specifically, we introduce a new data-driven random projection tailored to the problem at hand and derive a theoretical bound on the gain in expected prediction error over conventional random projections. The variables to enter the projection are screened by accounting for predictor correlation. To reduce the dependence on fine-tuning choices, we aggregate over an ensemble of linear models. A thresholding parameter is introduced to obtain a higher degree of sparsity. Both this parameter and the number of models in the ensemble can be chosen by cross-validation. In extensive simulations, we compare the proposed method with other random projection tools and with classical sparse and dense methods and show that it is competitive in terms of prediction across a variety of scenarios with different sparsity and predictor covariance settings. We also show that the method with cross-validation is able to rank the variables satisfactorily. Finally, we showcase the method on two real data applications."}, "https://arxiv.org/abs/2104.14737": {"title": "Automatic Debiased Machine Learning via Riesz Regression", "link": "https://arxiv.org/abs/2104.14737", "description": "arXiv:2104.14737v2 Announce Type: replace-cross \nAbstract: A variety of interesting parameters may depend on high dimensional regressions. Machine learning can be used to estimate such parameters. However estimators based on machine learners can be severely biased by regularization and/or model selection. Debiased machine learning uses Neyman orthogonal estimating equations to reduce such biases. Debiased machine learning generally requires estimation of unknown Riesz representers. A primary innovation of this paper is to provide Riesz regression estimators of Riesz representers that depend on the parameter of interest, rather than explicit formulae, and that can employ any machine learner, including neural nets and random forests. End-to-end algorithms emerge where the researcher chooses the parameter of interest and the machine learner and the debiasing follows automatically. Another innovation here is debiased machine learners of parameters depending on generalized regressions, including high-dimensional generalized linear models. An empirical example of automatic debiased machine learning using neural nets is given. We find in Monte Carlo examples that automatic debiasing sometimes performs better than debiasing via inverse propensity scores and never worse. Finite sample mean square error bounds for Riesz regression estimators and asymptotic theory are also given."}, "https://arxiv.org/abs/2312.13454": {"title": "MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records", "link": "https://arxiv.org/abs/2312.13454", "description": "arXiv:2312.13454v2 Announce Type: replace-cross \nAbstract: Existing survival models either do not scale to high dimensional and multi-modal data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC- III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG."}, "https://arxiv.org/abs/2401.12924": {"title": "Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection", "link": "https://arxiv.org/abs/2401.12924", "description": "arXiv:2401.12924v2 Announce Type: replace-cross \nAbstract: This article delves into the analysis of performance and utilization of Support Vector Machines (SVMs) for the critical task of forest fire detection using image datasets. With the increasing threat of forest fires to ecosystems and human settlements, the need for rapid and accurate detection systems is of utmost importance. SVMs, renowned for their strong classification capabilities, exhibit proficiency in recognizing patterns associated with fire within images. By training on labeled data, SVMs acquire the ability to identify distinctive attributes associated with fire, such as flames, smoke, or alterations in the visual characteristics of the forest area. The document thoroughly examines the use of SVMs, covering crucial elements like data preprocessing, feature extraction, and model training. It rigorously evaluates parameters such as accuracy, efficiency, and practical applicability. The knowledge gained from this study aids in the development of efficient forest fire detection systems, enabling prompt responses and improving disaster management. Moreover, the correlation between SVM accuracy and the difficulties presented by high-dimensional datasets is carefully investigated, demonstrated through a revealing case study. The relationship between accuracy scores and the different resolutions used for resizing the training datasets has also been discussed in this article. These comprehensive studies result in a definitive overview of the difficulties faced and the potential sectors requiring further improvement and focus."}, "https://arxiv.org/abs/2403.05644": {"title": "TSSS: A Novel Triangulated Spherical Spline Smoothing for Surface-based Data", "link": "https://arxiv.org/abs/2403.05644", "description": "arXiv:2403.05644v1 Announce Type: new \nAbstract: Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery. The proposed method offers several advantages over existing methods, including superior handling of \"leakage\" or \"boundary effects\" over complex domains, enhanced computational efficiency, and potential applications in analyzing sparse and irregularly distributed data on complex objects. We provide rigorous theoretical guarantees for the proposed method, including convergence rates of the estimator in both the $L_2$ and supremum norms, as well as the asymptotic normality of the estimator. We also demonstrate that the convergence rates achieved by our estimation method are optimal within the framework of nonparametric estimation. Furthermore, we introduce a bootstrap method to quantify the uncertainty associated with the proposed estimators accurately. The superior performance of the proposed method is demonstrated through simulation experiments and data applications on cortical surface functional magnetic resonance imaging data and oceanic near-surface atmospheric data."}, "https://arxiv.org/abs/2403.05647": {"title": "Minor Issues Escalated to Critical Levels in Large Samples: A Permutation-Based Fix", "link": "https://arxiv.org/abs/2403.05647", "description": "arXiv:2403.05647v1 Announce Type: new \nAbstract: In the big data era, the need to reevaluate traditional statistical methods is paramount due to the challenges posed by vast datasets. While larger samples theoretically enhance accuracy and hypothesis testing power without increasing false positives, practical concerns about inflated Type-I errors persist. The prevalent belief is that larger samples can uncover subtle effects, necessitating dual consideration of p-value and effect size. Yet, the reliability of p-values from large samples remains debated.\n This paper warns that larger samples can exacerbate minor issues into significant errors, leading to false conclusions. Through our simulation study, we demonstrate how growing sample sizes amplify issues arising from two commonly encountered violations of model assumptions in real-world data and lead to incorrect decisions. This underscores the need for vigilant analytical approaches in the era of big data. In response, we introduce a permutation-based test to counterbalance the effects of sample size and assumption discrepancies by neutralizing them between actual and permuted data. We demonstrate that this approach effectively stabilizes nominal Type I error rates across various sample sizes, thereby ensuring robust statistical inferences even amidst breached conventional assumptions in big data.\n For reproducibility, our R codes are publicly available at: \\url{https://github.com/ubcxzhang/bigDataIssue}."}, "https://arxiv.org/abs/2403.05655": {"title": "PROTEST: Nonparametric Testing of Hypotheses Enhanced by Experts' Utility Judgements", "link": "https://arxiv.org/abs/2403.05655", "description": "arXiv:2403.05655v1 Announce Type: new \nAbstract: Instead of testing solely a precise hypothesis, it is often useful to enlarge it with alternatives that are deemed to differ from it negligibly. For instance, in a bioequivalence study one might consider the hypothesis that the concentration of an ingredient is exactly the same in two drugs. In such a context, it might be more relevant to test the enlarged hypothesis that the difference in concentration between the drugs is of no practical significance. While this concept is not alien to Bayesian statistics, applications remain confined to parametric settings and strategies on how to effectively harness experts' intuitions are often scarce or nonexistent. To resolve both issues, we introduce PROTEST, an accessible nonparametric testing framework that seamlessly integrates with Markov Chain Monte Carlo (MCMC) methods. We develop expanded versions of the model adherence, goodness-of-fit, quantile and two-sample tests. To demonstrate how PROTEST operates, we make use of examples, simulated studies - such as testing link functions in a binary regression setting, as well as a comparison between the performance of PROTEST and the PTtest (Holmes et al., 2015) - and an application with data on neuron spikes. Furthermore, we address the crucial issue of selecting the threshold - which controls how much a hypothesis is to be expanded - even when intuitions are limited or challenging to quantify."}, "https://arxiv.org/abs/2403.05679": {"title": "Debiased Projected Two-Sample Comparisonscfor Single-Cell Expression Data", "link": "https://arxiv.org/abs/2403.05679", "description": "arXiv:2403.05679v1 Announce Type: new \nAbstract: We study several variants of the high-dimensional mean inference problem motivated by modern single-cell genomics data. By taking advantage of low-dimensional and localized signal structures commonly seen in such data, our proposed methods not only have the usual frequentist validity but also provide useful information on the potential locations of the signal if the null hypothesis is rejected. Our method adaptively projects the high-dimensional vector onto a low-dimensional space, followed by a debiasing step using the semiparametric double-machine learning framework. Our analysis shows that debiasing is unnecessary under the global null, but necessary under a ``projected null'' that is of scientific interest. We also propose an ``anchored projection'' to maximize the power while avoiding the degeneracy issue under the null. Experiments on synthetic data and a real single-cell sequencing dataset demonstrate the effectiveness and interpretability of our methods."}, "https://arxiv.org/abs/2403.05704": {"title": "Non-robustness of diffusion estimates on networks with measurement error", "link": "https://arxiv.org/abs/2403.05704", "description": "arXiv:2403.05704v1 Announce Type: new \nAbstract: Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China."}, "https://arxiv.org/abs/2403.05756": {"title": "Model-Free Local Recalibration of Neural Networks", "link": "https://arxiv.org/abs/2403.05756", "description": "arXiv:2403.05756v1 Announce Type: new \nAbstract: Artificial neural networks (ANNs) are highly flexible predictive models. However, reliably quantifying uncertainty for their predictions is a continuing challenge. There has been much recent work on \"recalibration\" of predictive distributions for ANNs, so that forecast probabilities for events of interest are consistent with certain frequency evaluations of them. Uncalibrated probabilistic forecasts are of limited use for many important decision-making tasks. To address this issue, we propose a localized recalibration of ANN predictive distributions using the dimension-reduced representation of the input provided by the ANN hidden layers. Our novel method draws inspiration from recalibration techniques used in the literature on approximate Bayesian computation and likelihood-free inference methods. Most existing calibration methods for ANNs can be thought of as calibrating either on the input layer, which is difficult when the input is high-dimensional, or the output layer, which may not be sufficiently flexible. Through a simulation study, we demonstrate that our method has good performance compared to alternative approaches, and explore the benefits that can be achieved by localizing the calibration based on different layers of the network. Finally, we apply our proposed method to a diamond price prediction problem, demonstrating the potential of our approach to improve prediction and uncertainty quantification in real-world applications."}, "https://arxiv.org/abs/2403.05792": {"title": "Distributed Conditional Feature Screening via Pearson Partial Correlation with FDR Control", "link": "https://arxiv.org/abs/2403.05792", "description": "arXiv:2403.05792v1 Announce Type: new \nAbstract: This paper studies the distributed conditional feature screening for massive data with ultrahigh-dimensional features. Specifically, three distributed partial correlation feature screening methods (SAPS, ACPS and JDPS methods) are firstly proposed based on Pearson partial correlation. The corresponding consistency of distributed estimation and the sure screening property of feature screening methods are established. Secondly, because using a hard threshold in feature screening will lead to a high false discovery rate (FDR), this paper develops a two-step distributed feature screening method based on knockoff technique to control the FDR. It is shown that the proposed method can control the FDR in the finite sample, and also enjoys the sure screening property under some conditions. Different from the existing screening methods, this paper not only considers the influence of a conditional variable on both the response variable and feature variables in variable screening, but also studies the FDR control issue. Finally, the effectiveness of the proposed methods is confirmed by numerical simulations and a real data analysis."}, "https://arxiv.org/abs/2403.05803": {"title": "Semiparametric Inference for Regression-Discontinuity Designs", "link": "https://arxiv.org/abs/2403.05803", "description": "arXiv:2403.05803v1 Announce Type: new \nAbstract: Treatment effects in regression discontinuity designs (RDDs) are often estimated using local regression methods. However, global approximation methods are generally deemed inefficient. In this paper, we propose a semiparametric framework tailored for estimating treatment effects in RDDs. Our global approach conceptualizes the identification of treatment effects within RDDs as a partially linear modeling problem, with the linear component capturing the treatment effect. Furthermore, we utilize the P-spline method to approximate the nonparametric function and develop procedures for inferring treatment effects within this framework. We demonstrate through Monte Carlo simulations that the proposed method performs well across various scenarios. Furthermore, we illustrate using real-world datasets that our global approach may result in more reliable inference."}, "https://arxiv.org/abs/2403.05850": {"title": "Estimating Causal Effects of Discrete and Continuous Treatments with Binary Instruments", "link": "https://arxiv.org/abs/2403.05850", "description": "arXiv:2403.05850v1 Announce Type: new \nAbstract: We propose an instrumental variable framework for identifying and estimating average and quantile effects of discrete and continuous treatments with binary instruments. The basis of our approach is a local copula representation of the joint distribution of the potential outcomes and unobservables determining treatment assignment. This representation allows us to introduce an identifying assumption, so-called copula invariance, that restricts the local dependence of the copula with respect to the treatment propensity. We show that copula invariance identifies treatment effects for the entire population and other subpopulations such as the treated. The identification results are constructive and lead to straightforward semiparametric estimation procedures based on distribution regression. An application to the effect of sleep on well-being uncovers interesting patterns of heterogeneity."}, "https://arxiv.org/abs/2403.05899": {"title": "Online Identification of Stochastic Continuous-Time Wiener Models Using Sampled Data", "link": "https://arxiv.org/abs/2403.05899", "description": "arXiv:2403.05899v1 Announce Type: new \nAbstract: It is well known that ignoring the presence of stochastic disturbances in the identification of stochastic Wiener models leads to asymptotically biased estimators. On the other hand, optimal statistical identification, via likelihood-based methods, is sensitive to the assumptions on the data distribution and is usually based on relatively complex sequential Monte Carlo algorithms. We develop a simple recursive online estimation algorithm based on an output-error predictor, for the identification of continuous-time stochastic parametric Wiener models through stochastic approximation. The method is applicable to generic model parameterizations and, as demonstrated in the numerical simulation examples, it is robust with respect to the assumptions on the spectrum of the disturbance process."}, "https://arxiv.org/abs/2403.05969": {"title": "Sample Size Selection under an Infill Asymptotic Domain", "link": "https://arxiv.org/abs/2403.05969", "description": "arXiv:2403.05969v1 Announce Type: new \nAbstract: Experimental studies often fail to appropriately account for the number of collected samples within a fixed time interval for functional responses. Data of this nature appropriately falls under an Infill Asymptotic domain that is constrained by time and not considered infinite. Therefore, the sample size should account for this infill asymptotic domain. This paper provides general guidance on selecting an appropriate size for an experimental study for various simple linear regression models and tuning parameter values of the covariance structure used under an asymptotic domain, an Ornstein-Uhlenbeck process. Selecting an appropriate sample size is determined based on the percent of total variation that is captured at any given sample size for each parameter. Additionally, guidance on the selection of the tuning parameter is given by linking this value to the signal-to-noise ratio utilized for power calculations under design of experiments."}, "https://arxiv.org/abs/2403.05999": {"title": "Locally Regular and Efficient Tests in Non-Regular Semiparametric Models", "link": "https://arxiv.org/abs/2403.05999", "description": "arXiv:2403.05999v1 Announce Type: new \nAbstract: This paper considers hypothesis testing in semiparametric models which may be non-regular. I show that C($\\alpha$) style tests are locally regular under mild conditions, including in cases where locally regular estimators do not exist, such as models which are (semi-parametrically) weakly identified. I characterise the appropriate limit experiment in which to study local (asymptotic) optimality of tests in the non-regular case, permitting the generalisation of classical power bounds to this case. I give conditions under which these power bounds are attained by the proposed C($\\alpha$) style tests. The application of the theory to a single index model and an instrumental variables model is worked out in detail."}, "https://arxiv.org/abs/2403.06238": {"title": "Quantifying the Uncertainty of Imputed Demographic Disparity Estimates: The Dual-Bootstrap", "link": "https://arxiv.org/abs/2403.06238", "description": "arXiv:2403.06238v1 Announce Type: new \nAbstract: Measuring average differences in an outcome across racial or ethnic groups is a crucial first step for equity assessments, but researchers often lack access to data on individuals' races and ethnicities to calculate them. A common solution is to impute the missing race or ethnicity labels using proxies, then use those imputations to estimate the disparity. Conventional standard errors mischaracterize the resulting estimate's uncertainty because they treat the imputation model as given and fixed, instead of as an unknown object that must be estimated with uncertainty. We propose a dual-bootstrap approach that explicitly accounts for measurement uncertainty and thus enables more accurate statistical inference, which we demonstrate via simulation. In addition, we adapt our approach to the commonly used Bayesian Improved Surname Geocoding (BISG) imputation algorithm, where direct bootstrapping is infeasible because the underlying Census Bureau data are unavailable. In simulations, we find that measurement uncertainty is generally insignificant for BISG except in particular circumstances; bias, not variance, is likely the predominant source of error. We apply our method to quantify the uncertainty of prevalence estimates of common health conditions by race using data from the American Family Cohort."}, "https://arxiv.org/abs/2403.06246": {"title": "Estimating Factor-Based Spot Volatility Matrices with Noisy and Asynchronous High-Frequency Data", "link": "https://arxiv.org/abs/2403.06246", "description": "arXiv:2403.06246v1 Announce Type: new \nAbstract: We propose a new estimator of high-dimensional spot volatility matrices satisfying a low-rank plus sparse structure from noisy and asynchronous high-frequency data collected for an ultra-large number of assets. The noise processes are allowed to be temporally correlated, heteroskedastic, asymptotically vanishing and dependent on the efficient prices. We define a kernel-weighted pre-averaging method to jointly tackle the microstructure noise and asynchronicity issues, and we obtain uniformly consistent estimates for latent prices. We impose a continuous-time factor model with time-varying factor loadings on the price processes, and estimate the common factors and loadings via a local principal component analysis. Assuming a uniform sparsity condition on the idiosyncratic volatility structure, we combine the POET and kernel-smoothing techniques to estimate the spot volatility matrices for both the latent prices and idiosyncratic errors. Under some mild restrictions, the estimated spot volatility matrices are shown to be uniformly consistent under various matrix norms. We provide Monte-Carlo simulation and empirical studies to examine the numerical performance of the developed estimation methodology."}, "https://arxiv.org/abs/2403.06657": {"title": "Data-Driven Tuning Parameter Selection for High-Dimensional Vector Autoregressions", "link": "https://arxiv.org/abs/2403.06657", "description": "arXiv:2403.06657v1 Announce Type: new \nAbstract: Lasso-type estimators are routinely used to estimate high-dimensional time series models. The theoretical guarantees established for Lasso typically require the penalty level to be chosen in a suitable fashion often depending on unknown population quantities. Furthermore, the resulting estimates and the number of variables retained in the model depend crucially on the chosen penalty level. However, there is currently no theoretically founded guidance for this choice in the context of high-dimensional time series. Instead one resorts to selecting the penalty level in an ad hoc manner using, e.g., information criteria or cross-validation. We resolve this problem by considering estimation of the perhaps most commonly employed multivariate time series model, the linear vector autoregressive (VAR) model, and propose a weighted Lasso estimator with penalization chosen in a fully data-driven way. The theoretical guarantees that we establish for the resulting estimation and prediction error match those currently available for methods based on infeasible choices of penalization. We thus provide a first solution for choosing the penalization in high-dimensional time series models."}, "https://arxiv.org/abs/2403.06783": {"title": "A doubly robust estimator for the Mann Whitney Wilcoxon Rank Sum Test when applied for causal inference in observational studies", "link": "https://arxiv.org/abs/2403.06783", "description": "arXiv:2403.06783v1 Announce Type: new \nAbstract: The Mann-Whitney-Wilcoxon rank sum test (MWWRST) is a widely used method for comparing two treatment groups in randomized control trials, particularly when dealing with highly skewed data. However, when applied to observational study data, the MWWRST often yields invalid results for causal inference. To address this limitation, Wu et al. (2014) introduced an approach that incorporates inverse probability weighting (IPW) into this rank-based statistics to mitigate confounding effects. Subsequently, Mao (2018), Zhang et al. (2019), and Ai et al. (2020) extended this IPW estimator to develop doubly robust estimators.\n Nevertheless, each of these approaches has notable limitations. Mao's method imposes stringent assumptions that may not align with real-world study data. Zhang et al.'s (2019) estimators rely on bootstrap inference, which suffers from computational inefficiency and lacks known asymptotic properties. Meanwhile, Ai et al. (2020) primarily focus on testing the null hypothesis of equal distributions between two groups, which is a more stringent assumption that may not be well-suited to the primary practical application of MWWRST.\n In this paper, we aim to address these limitations by leveraging functional response models (FRM) to develop doubly robust estimators. We demonstrate the performance of our proposed approach using both simulated and real study data."}, "https://arxiv.org/abs/2403.06879": {"title": "Partially identified heteroskedastic SVARs", "link": "https://arxiv.org/abs/2403.06879", "description": "arXiv:2403.06879v1 Announce Type: new \nAbstract: This paper studies the identification of Structural Vector Autoregressions (SVARs) exploiting a break in the variances of the structural shocks. Point-identification for this class of models relies on an eigen-decomposition involving the covariance matrices of reduced-form errors and requires that all the eigenvalues are distinct. This point-identification, however, fails in the presence of multiplicity of eigenvalues. This occurs in an empirically relevant scenario where, for instance, only a subset of structural shocks had the break in their variances, or where a group of variables shows a variance shift of the same amount. Together with zero or sign restrictions on the structural parameters and impulse responses, we derive the identified sets for impulse responses and show how to compute them. We perform inference on the impulse response functions, building on the robust Bayesian approach developed for set identified SVARs. To illustrate our proposal, we present an empirical example based on the literature on the global crude oil market where the identification is expected to fail due to multiplicity of eigenvalues."}, "https://arxiv.org/abs/2403.05759": {"title": "Membership Testing in Markov Equivalence Classes via Independence Query Oracles", "link": "https://arxiv.org/abs/2403.05759", "description": "arXiv:2403.05759v1 Announce Type: cross \nAbstract: Understanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to learning causal graphs from data, its complementary concept of testing causal relationships has remained largely unexplored. While learning involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the testing counterpart addresses the following critical question: Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC?\n We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\\exp(\\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\\exp(O(s))$ tests, matching our lower bound. Compared to the learning problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that testing is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning."}, "https://arxiv.org/abs/2403.06343": {"title": "Seasonal and Periodic Patterns in US COVID-19 Mortality using the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2403.06343", "description": "arXiv:2403.06343v1 Announce Type: cross \nAbstract: Since the emergence of the SARS-CoV-2 virus, research into the existence, extent, and pattern of seasonality has been of the highest importance for public health preparation. This study uses a novel bandpass bootstrap approach called the Variable Bandpass Periodic Block Bootstrap (VBPBB) to investigate the periodically correlated (PC) components including seasonality within US COVID-19 mortality. Bootstrapping to produce confidence intervals (CI) for periodic characteristics such as the seasonal mean requires preservation of the PC component's correlation structure during resampling. While existing bootstrap methods can preserve the PC component correlation structure, filtration of that PC component's frequency from interference is critical to bootstrap the PC component's characteristics accurately and efficiently. The VBPBB filters the PC time series to reduce interference from other components such as noise. This greatly reduces bootstrapped CI size and outperforms the statistical power and accuracy of other methods when estimating the periodic mean sampling distribution. VBPBB analysis of US COVID-19 mortality PC components are provided and compared against alternative bootstrapping methods. These results reveal crucial evidence supporting the presence of a seasonal PC pattern and existence of additional PC components, their timing, and CIs for their effect which will aid prediction and preparation for future COVID-19 responses."}, "https://arxiv.org/abs/2403.06371": {"title": "Ensemble Kalman filter in geoscience meets model predictive control", "link": "https://arxiv.org/abs/2403.06371", "description": "arXiv:2403.06371v1 Announce Type: cross \nAbstract: Although data assimilation originates from control theory, the relationship between modern data assimilation methods in geoscience and model predictive control has not been extensively explored. In the present paper, I discuss that the modern data assimilation methods in geoscience and model predictive control essentially minimize the similar quadratic cost functions. Inspired by this similarity, I propose a new ensemble Kalman filter (EnKF)-based method for controlling spatio-temporally chaotic systems, which can readily be applied to high-dimensional and nonlinear Earth systems. In this method, the reference vector, which serves as the control target, is assimilated into the state space as a pseudo-observation by ensemble Kalman smoother to obtain the appropriate perturbation to be added to a system. A proof-of-concept experiment using the Lorenz 63 model is presented. The system is constrained in one wing of the butterfly attractor without tipping to the other side by reasonably small control perturbations which are comparable with previous works."}, "https://arxiv.org/abs/1904.00136": {"title": "Estimating spillovers using imprecisely measured networks", "link": "https://arxiv.org/abs/1904.00136", "description": "arXiv:1904.00136v4 Announce Type: replace \nAbstract: In many experimental contexts, whether and how network interactions impact the outcome of interest for both treated and untreated individuals are key concerns. Networks data is often assumed to perfectly represent these possible interactions. This paper considers the problem of estimating treatment effects when measured connections are, instead, a noisy representation of the true spillover pathways. We show that existing methods, using the potential outcomes framework, yield biased estimators in the presence of this mismeasurement. We develop a new method, using a class of mixture models, that can account for missing connections and discuss its estimation via the Expectation-Maximization algorithm. We check our method's performance by simulating experiments on real network data from 43 villages in India. Finally, we use data from a previously published study to show that estimates using our method are more robust to the choice of network measure."}, "https://arxiv.org/abs/2108.05555": {"title": "Longitudinal Network Models and Permutation-Uniform Markov Chains", "link": "https://arxiv.org/abs/2108.05555", "description": "arXiv:2108.05555v2 Announce Type: replace \nAbstract: Consider longitudinal networks whose edges turn on and off according to a discrete-time Markov chain with exponential-family transition probabilities. We characterize when their joint distributions are also exponential families with the same parameter, improving data reduction. Further we show that the permutation-uniform subclass of these chains permit interpretation as an independent, identically distributed sequence on the same state space. We then apply these ideas to temporal exponential random graph models, for which permutation uniformity is well suited, and discuss mean-parameter convergence, dyadic independence, and exchangeability. Our framework facilitates our introducing a new network model; simplifies analysis of some network and autoregressive models from the literature, including by permitting closed-form expressions for maximum likelihood estimates for some models; and facilitates applying standard tools to longitudinal-network Markov chains from either asymptotics or single-observation exponential random graph models."}, "https://arxiv.org/abs/2201.13451": {"title": "Nonlinear Regression with Residuals: Causal Estimation with Time-varying Treatments and Covariates", "link": "https://arxiv.org/abs/2201.13451", "description": "arXiv:2201.13451v3 Announce Type: replace \nAbstract: Standard regression adjustment gives inconsistent estimates of causal effects when there are time-varying treatment effects and time-varying covariates. Loosely speaking, the issue is that some covariates are post-treatment variables because they may be affected by prior treatment status, and regressing out post-treatment variables causes bias. More precisely, the bias is due to certain non-confounding latent variables that create colliders in the causal graph. These latent variables, which we call phantoms, do not harm the identifiability of the causal effect, but they render naive regression estimates inconsistent. Motivated by this, we ask: how can we modify regression methods so that they hold up even in the presence of phantoms? We develop an estimator for this setting based on regression modeling (linear, log-linear, probit and Cox regression), proving that it is consistent for a reasonable causal estimand. In particular, the estimator is a regression model fit with a simple adjustment for collinearity, making it easy to understand and implement with standard regression software. The proposed estimators are instances of the parametric g-formula, extending the regression-with-residuals approach to several canonical nonlinear models."}, "https://arxiv.org/abs/2207.07318": {"title": "Flexible global forecast combinations", "link": "https://arxiv.org/abs/2207.07318", "description": "arXiv:2207.07318v3 Announce Type: replace \nAbstract: Forecast combination -- the aggregation of individual forecasts from multiple experts or models -- is a proven approach to economic forecasting. To date, research on economic forecasting has concentrated on local combination methods, which handle separate but related forecasting tasks in isolation. Yet, it has been known for over two decades in the machine learning community that global methods, which exploit task-relatedness, can improve on local methods that ignore it. Motivated by the possibility for improvement, this paper introduces a framework for globally combining forecasts while being flexible to the level of task-relatedness. Through our framework, we develop global versions of several existing forecast combinations. To evaluate the efficacy of these new global forecast combinations, we conduct extensive comparisons using synthetic and real data. Our real data comparisons, which involve forecasts of core economic indicators in the Eurozone, provide empirical evidence that the accuracy of global combinations of economic forecasts can surpass local combinations."}, "https://arxiv.org/abs/2301.09379": {"title": "Revisiting Panel Data Discrete Choice Models with Lagged Dependent Variables", "link": "https://arxiv.org/abs/2301.09379", "description": "arXiv:2301.09379v4 Announce Type: replace \nAbstract: This paper revisits the identification and estimation of a class of semiparametric (distribution-free) panel data binary choice models with lagged dependent variables, exogenous covariates, and entity fixed effects. We provide a novel identification strategy, using an \"identification at infinity\" argument. In contrast with the celebrated Honore and Kyriazidou (2000), our method permits time trends of any form and does not suffer from the \"curse of dimensionality\". We propose an easily implementable conditional maximum score estimator. The asymptotic properties of the proposed estimator are fully characterized. A small-scale Monte Carlo study demonstrates that our approach performs satisfactorily in finite samples. We illustrate the usefulness of our method by presenting an empirical application to enrollment in private hospital insurance using the Household, Income and Labour Dynamics in Australia (HILDA) Survey data."}, "https://arxiv.org/abs/2303.10117": {"title": "Estimation of Grouped Time-Varying Network Vector Autoregression Models", "link": "https://arxiv.org/abs/2303.10117", "description": "arXiv:2303.10117v2 Announce Type: replace \nAbstract: This paper introduces a flexible time-varying network vector autoregressive model framework for large-scale time series. A latent group structure is imposed on the heterogeneous and node-specific time-varying momentum and network spillover effects so that the number of unknown time-varying coefficients to be estimated can be reduced considerably. A classic agglomerative clustering algorithm with nonparametrically estimated distance matrix is combined with a ratio criterion to consistently estimate the latent group number and membership. A post-grouping local linear smoothing method is proposed to estimate the group-specific time-varying momentum and network effects, substantially improving the convergence rates of the preliminary estimates which ignore the latent structure. We further modify the methodology and theory to allow for structural breaks in either the group membership, group number or group-specific coefficient functions. Numerical studies including Monte-Carlo simulation and an empirical application are presented to examine the finite-sample performance of the developed model and methodology."}, "https://arxiv.org/abs/2305.11812": {"title": "Off-policy evaluation beyond overlap: partial identification through smoothness", "link": "https://arxiv.org/abs/2305.11812", "description": "arXiv:2305.11812v2 Announce Type: replace \nAbstract: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions."}, "https://arxiv.org/abs/2306.09806": {"title": "Testing for Peer Effects without Specifying the Network Structure", "link": "https://arxiv.org/abs/2306.09806", "description": "arXiv:2306.09806v2 Announce Type: replace \nAbstract: This paper proposes an Anderson-Rubin (AR) test for the presence of peer effects in panel data without the need to specify the network structure. The unrestricted model of our test is a linear panel data model of social interactions with dyad-specific peer effects. The proposed AR test evaluates if the peer effect coefficients are all zero. As the number of peer effect coefficients increases with the sample size, so does the number of instrumental variables (IVs) employed to test the restrictions under the null, rendering Bekker's many-IV environment. By extending existing many-IV asymptotic results to panel data, we establish the asymptotic validity of the proposed AR test. Our Monte Carlo simulations show the robustness and superior performance of the proposed test compared to some existing tests with misspecified networks. We provide two applications to demonstrate its empirical relevance."}, "https://arxiv.org/abs/2308.10138": {"title": "On the Inconsistency of Cluster-Robust Inference and How Subsampling Can Fix It", "link": "https://arxiv.org/abs/2308.10138", "description": "arXiv:2308.10138v4 Announce Type: replace \nAbstract: Conventional methods of cluster-robust inference are inconsistent in the presence of unignorably large clusters. We formalize this claim by establishing a necessary and sufficient condition for the consistency of the conventional methods. We find that this condition for the consistency is rejected for a majority of empirical research papers. In this light, we propose a novel score subsampling method that achieves uniform size control over a broad class of data generating processes, covering that fails the conventional method. Simulation studies support these claims. With real data used by an empirical paper, we showcase that the conventional methods conclude significance while our proposed method concludes insignificance."}, "https://arxiv.org/abs/2309.12819": {"title": "Doubly Robust Proximal Causal Learning for Continuous Treatments", "link": "https://arxiv.org/abs/2309.12819", "description": "arXiv:2309.12819v3 Announce Type: replace \nAbstract: Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications."}, "https://arxiv.org/abs/2310.18905": {"title": "Incorporating nonparametric methods for estimating causal excursion effects in mobile health with zero-inflated count outcomes", "link": "https://arxiv.org/abs/2310.18905", "description": "arXiv:2310.18905v2 Announce Type: replace \nAbstract: In mobile health, tailoring interventions for real-time delivery is of paramount importance. Micro-randomized trials have emerged as the \"gold-standard\" methodology for developing such interventions. Analyzing data from these trials provides insights into the efficacy of interventions and the potential moderation by specific covariates. The \"causal excursion effect\", a novel class of causal estimand, addresses these inquiries. Yet, existing research mainly focuses on continuous or binary data, leaving count data largely unexplored. The current work is motivated by the Drink Less micro-randomized trial from the UK, which focuses on a zero-inflated proximal outcome, i.e., the number of screen views in the subsequent hour following the intervention decision point. To be specific, we revisit the concept of causal excursion effect, specifically for zero-inflated count outcomes, and introduce novel estimation approaches that incorporate nonparametric techniques. Bidirectional asymptotics are established for the proposed estimators. Simulation studies are conducted to evaluate the performance of the proposed methods. As an illustration, we also implement these methods to the Drink Less trial data."}, "https://arxiv.org/abs/2311.17445": {"title": "Interaction tests with covariate-adaptive randomization", "link": "https://arxiv.org/abs/2311.17445", "description": "arXiv:2311.17445v2 Announce Type: replace \nAbstract: Treatment-covariate interaction tests are commonly applied by researchers to examine whether the treatment effect varies across patient subgroups defined by baseline characteristics. The objective of this study is to explore treatment-covariate interaction tests involving covariate-adaptive randomization. Without assuming a parametric data generating model, we investigate usual interaction tests and observe that they tend to be conservative: specifically, their limiting rejection probabilities under the null hypothesis do not exceed the nominal level and are typically strictly lower than it. To address this problem, we propose modifications to the usual tests to obtain corresponding valid tests. Moreover, we introduce a novel class of stratified-adjusted interaction tests that are simple, more powerful than the usual and modified tests, and broadly applicable to most covariate-adaptive randomization methods. The results are general to encompass two types of interaction tests: one involving stratification covariates and the other involving additional covariates that are not used for randomization. Our study clarifies the application of interaction tests in clinical trials and offers valuable tools for revealing treatment heterogeneity, crucial for advancing personalized medicine."}, "https://arxiv.org/abs/2112.14249": {"title": "Nested Nonparametric Instrumental Variable Regression: Long Term, Mediated, and Time Varying Treatment Effects", "link": "https://arxiv.org/abs/2112.14249", "description": "arXiv:2112.14249v3 Announce Type: replace-cross \nAbstract: Several causal parameters in short panel data models are scalar summaries of a function called a nested nonparametric instrumental variable regression (nested NPIV). Examples include long term, mediated, and time varying treatment effects identified using proxy variables. However, it appears that no prior estimators or guarantees for nested NPIV exist, preventing flexible estimation and inference for these causal parameters. A major challenge is compounding ill posedness due to the nested inverse problems. We analyze adversarial estimators of nested NPIV, and provide sufficient conditions for efficient inference on the causal parameter. Our nonasymptotic analysis has three salient features: (i) introducing techniques that limit how ill posedness compounds; (ii) accommodating neural networks, random forests, and reproducing kernel Hilbert spaces; and (iii) extending to causal functions, e.g. long term heterogeneous treatment effects. We measure long term heterogeneous treatment effects of Project STAR and mediated proximal treatment effects of the Job Corps."}, "https://arxiv.org/abs/2211.12585": {"title": "Scalable couplings for the random walk Metropolis algorithm", "link": "https://arxiv.org/abs/2211.12585", "description": "arXiv:2211.12585v2 Announce Type: replace-cross \nAbstract: There has been a recent surge of interest in coupling methods for Markov chain Monte Carlo algorithms: they facilitate convergence quantification and unbiased estimation, while exploiting embarrassingly parallel computing capabilities. Motivated by these, we consider the design and analysis of couplings of the random walk Metropolis algorithm which scale well with the dimension of the target measure. Methodologically, we introduce a low-rank modification of the synchronous coupling that is provably optimally contractive in standard high-dimensional asymptotic regimes. We expose a shortcoming of the reflection coupling, the status quo at time of writing, and we propose a modification which mitigates the issue. Our analysis bridges the gap to the optimal scaling literature and builds a framework of asymptotic optimality which may be of independent interest. We illustrate the applicability of our proposed couplings, and the potential for extending our ideas, with various numerical experiments."}, "https://arxiv.org/abs/2302.07677": {"title": "Bayesian Federated Inference for estimating Statistical Models based on Non-shared Multicenter Data sets", "link": "https://arxiv.org/abs/2302.07677", "description": "arXiv:2302.07677v2 Announce Type: replace-cross \nAbstract: Identifying predictive factors for an outcome of interest via a multivariable analysis is often difficult when the data set is small. Combining data from different medical centers into a single (larger) database would alleviate this problem, but is in practice challenging due to regulatory and logistic problems. Federated Learning (FL) is a machine learning approach that aims to construct from local inferences in separate data centers what would have been inferred had the data sets been merged. It seeks to harvest the statistical power of larger data sets without actually creating them. The FL strategy is not always efficient and precise. Therefore, in this paper we refine and implement an alternative Bayesian Federated Inference (BFI) framework for multicenter data with the same aim as FL. The BFI framework is designed to cope with small data sets by inferring locally not only the optimal parameter values, but also additional features of the posterior parameter distribution, capturing information beyond what is used in FL. BFI has the additional benefit that a single inference cycle across the centers is sufficient, whereas FL needs multiple cycles. We quantify the performance of the proposed methodology on simulated and real life data."}, "https://arxiv.org/abs/2305.00164": {"title": "Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles", "link": "https://arxiv.org/abs/2305.00164", "description": "arXiv:2305.00164v2 Announce Type: replace-cross \nAbstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given for both the estimation accuracy and expected length of confidence intervals for the minimizer and minimum.\n The nonasymptotic local minimax framework brings out new phenomena in simultaneous estimation and inference for the minimizer and minimum. We establish a novel uncertainty principle that provides a fundamental limit on how well the minimizer and minimum can be estimated simultaneously for any convex regression function. A similar result holds for the expected length of the confidence intervals for the minimizer and minimum."}, "https://arxiv.org/abs/2306.12690": {"title": "Uniform error bound for PCA matrix denoising", "link": "https://arxiv.org/abs/2306.12690", "description": "arXiv:2306.12690v2 Announce Type: replace-cross \nAbstract: Principal component analysis (PCA) is a simple and popular tool for processing high-dimensional data. We investigate its effectiveness for matrix denoising.\n We consider the clean data are generated from a low-dimensional subspace, but masked by independent high-dimensional sub-Gaussian noises with standard deviation $\\sigma$. Under the low-rank assumption on the clean data with a mild spectral gap assumption, we prove that the distance between each pair of PCA-denoised data point and the clean data point is uniformly bounded by $O(\\sigma \\log n)$. To illustrate the spectral gap assumption, we show it can be satisfied when the clean data are independently generated with a non-degenerate covariance matrix. We then provide a general lower bound for the error of the denoised data matrix, which indicates PCA denoising gives a uniform error bound that is rate-optimal. Furthermore, we examine how the error bound impacts downstream applications such as clustering and manifold learning. Numerical results validate our theoretical findings and reveal the importance of the uniform error."}, "https://arxiv.org/abs/2310.02679": {"title": "Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization", "link": "https://arxiv.org/abs/2310.02679", "description": "arXiv:2310.02679v3 Announce Type: replace-cross \nAbstract: We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional \"flow function\". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods."}, "https://arxiv.org/abs/2310.19788": {"title": "Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget", "link": "https://arxiv.org/abs/2310.19788", "description": "arXiv:2310.19788v3 Announce Type: replace-cross \nAbstract: This study investigates the experimental design problem for identifying the arm with the highest expected outcome, referred to as best arm identification (BAI). In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates an arm and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the arms. At the end of the experiment, the decision-maker recommends one of the arms as an estimate of the best arm. To design an experiment, we first discuss lower bounds for the probability of misidentification. Our analysis highlights that the available information on the outcome distribution, such as means (expected outcomes), variances, and the choice of the best arm, significantly influences the lower bounds. Because available information is limited in actual experiments, we develop a lower bound that is valid under the unknown means and the unknown choice of the best arm, which are referred to as the worst-case lower bound. We demonstrate that the worst-case lower bound depends solely on the variances of the outcomes. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases infinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to the same values across arms. We refer to such strategies as asymptotically worst-case optimal."}, "https://arxiv.org/abs/2312.08150": {"title": "Active learning with biased non-response to label requests", "link": "https://arxiv.org/abs/2312.08150", "description": "arXiv:2312.08150v2 Announce Type: replace-cross \nAbstract: Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation."}, "https://arxiv.org/abs/2401.13045": {"title": "Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?", "link": "https://arxiv.org/abs/2401.13045", "description": "arXiv:2401.13045v2 Announce Type: replace-cross \nAbstract: Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions."}, "https://arxiv.org/abs/2403.07124": {"title": "Stochastic gradient descent-based inference for dynamic network models with attractors", "link": "https://arxiv.org/abs/2403.07124", "description": "arXiv:2403.07124v1 Announce Type: new \nAbstract: In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requirement for nodes to be present throughout the study period limit practical applications. We address these issues by (i) introducing a Stochastic gradient descent (SGD) parameter estimation method, (ii) developing a novel approach for uncertainty quantification using SGD, and (iii) extending the model to allow nodes to join and leave over time. Simulation results show that our extensions result in little loss of accuracy compared to MCMC, but can scale to much larger networks. We apply our approach to the longitudinal social networks of members of US Congress on the social media platform X. Accounting for node dynamics overcomes selection bias in the network and uncovers uniquely and increasingly repulsive forces within the Republican Party."}, "https://arxiv.org/abs/2403.07158": {"title": "Sample Splitting and Assessing Goodness-of-fit of Time Series", "link": "https://arxiv.org/abs/2403.07158", "description": "arXiv:2403.07158v1 Announce Type: new \nAbstract: A fundamental and often final step in time series modeling is to assess the quality of fit of a proposed model to the data. Since the underlying distribution of the innovations that generate a model is often not prescribed, goodness-of-fit tests typically take the form of testing the fitted residuals for serial independence. However, these fitted residuals are inherently dependent since they are based on the same parameter estimates and thus standard tests of serial independence, such as those based on the autocorrelation function (ACF) or distance correlation function (ADCF) of the fitted residuals need to be adjusted. The sample splitting procedure in Pfister et al.~(2018) is one such fix for the case of models for independent data, but fails to work in the dependent setting. In this paper sample splitting is leveraged in the time series setting to perform tests of serial dependence of fitted residuals using the ACF and ADCF. Here the first $f_n$ of the data points are used to estimate the parameters of the model and then using these parameter estimates, the last $l_n$ of the data points are used to compute the estimated residuals. Tests for serial independence are then based on these $l_n$ residuals. As long as the overlap between the $f_n$ and $l_n$ data splits is asymptotically 1/2, the ACF and ADCF tests of serial independence tests often have the same limit distributions as though the underlying residuals are indeed iid. In particular if the first half of the data is used to estimate the parameters and the estimated residuals are computed for the entire data set based on these parameter estimates, then the ACF and ADCF can have the same limit distributions as though the residuals were iid. This procedure ameliorates the need for adjustment in the construction of confidence bounds for both the ACF and ADCF in goodness-of-fit testing."}, "https://arxiv.org/abs/2403.07236": {"title": "Partial Identification of Individual-Level Parameters Using Aggregate Data in a Nonparametric Binary Outcome Model", "link": "https://arxiv.org/abs/2403.07236", "description": "arXiv:2403.07236v1 Announce Type: new \nAbstract: It is well known that the relationship between variables at the individual level can be different from the relationship between those same variables aggregated over individuals. This problem of aggregation becomes relevant when the researcher wants to learn individual-level relationships, but only has access to data that has been aggregated. In this paper, I develop a methodology to partially identify linear combinations of conditional average outcomes from aggregate data when the outcome of interest is binary, while imposing as few restrictions on the underlying data generating process as possible. I construct identified sets using an optimization program that allows for researchers to impose additional shape restrictions. I also provide consistency results and construct an inference procedure that is valid with aggregate data, which only provides marginal information about each variable. I apply the methodology to simulated and real-world data sets and find that the estimated identified sets are too wide to be useful. This suggests that to obtain useful information from aggregate data sets about individual-level relationships, researchers must impose further assumptions that are carefully justified."}, "https://arxiv.org/abs/2403.07288": {"title": "Efficient and Model-Agnostic Parameter Estimation Under Privacy-Preserving Post-randomization Data", "link": "https://arxiv.org/abs/2403.07288", "description": "arXiv:2403.07288v1 Announce Type: new \nAbstract: Protecting individual privacy is crucial when releasing sensitive data for public use. While data de-identification helps, it is not enough. This paper addresses parameter estimation in scenarios where data are perturbed using the Post-Randomization Method (PRAM) to enhance privacy. Existing methods for parameter estimation under PRAM data suffer from limitations like being parameter-specific, model-dependent, and lacking efficiency guarantees. We propose a novel, efficient method that overcomes these limitations. Our method is applicable to general parameters defined through estimating equations and makes no assumptions about the underlying data model. We further prove that the proposed estimator achieves the semiparametric efficiency bound, making it optimal in terms of asymptotic variance."}, "https://arxiv.org/abs/2403.07650": {"title": "A Class of Semiparametric Yang and Prentice Frailty Models", "link": "https://arxiv.org/abs/2403.07650", "description": "arXiv:2403.07650v1 Announce Type: new \nAbstract: The Yang and Prentice (YP) regression models have garnered interest from the scientific community due to their ability to analyze data whose survival curves exhibit intersection. These models include proportional hazards (PH) and proportional odds (PO) models as specific cases. However, they encounter limitations when dealing with multivariate survival data due to potential dependencies between the times-to-event. A solution is introducing a frailty term into the hazard functions, making it possible for the times-to-event to be considered independent, given the frailty term. In this study, we propose a new class of YP models that incorporate frailty. We use the exponential distribution, the piecewise exponential distribution (PE), and Bernstein polynomials (BP) as baseline functions. Our approach adopts a Bayesian methodology. The proposed models are evaluated through a simulation study, which shows that the YP frailty models with BP and PE baselines perform similarly to the generator parametric model of the data. We apply the models in two real data sets."}, "https://arxiv.org/abs/2403.07778": {"title": "Joint Modeling of Longitudinal Measurements and Time-to-event Outcomes Using BUGS", "link": "https://arxiv.org/abs/2403.07778", "description": "arXiv:2403.07778v1 Announce Type: new \nAbstract: The objective of this paper is to provide an introduction to the principles of Bayesian joint modeling of longitudinal measurements and time-to-event outcomes, as well as model implementation using the BUGS language syntax. This syntax can be executed directly using OpenBUGS or by utilizing convenient functions to invoke OpenBUGS and JAGS from R software. In this paper, all details of joint models are provided, ranging from simple to more advanced models. The presentation started with the joint modeling of a Gaussian longitudinal marker and time-to-event outcome. The implementation of the Bayesian paradigm of the model is reviewed. The strategies for simulating data from the JM are also discussed. A proportional hazard model with various forms of baseline hazards, along with the discussion of all possible association structures between the two sub-models are taken into consideration. The paper covers joint models with multivariate longitudinal measurements, zero-inflated longitudinal measurements, competing risks, and time-to-event with cure fraction. The models are illustrated by the analyses of several real data sets. All simulated and real data and code are available at \\url{https://github.com/tbaghfalaki/JM-with-BUGS-and-JAGS}."}, "https://arxiv.org/abs/2403.07008": {"title": "AutoEval Done Right: Using Synthetic Data for Model Evaluation", "link": "https://arxiv.org/abs/2403.07008", "description": "arXiv:2403.07008v1 Announce Type: cross \nAbstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4."}, "https://arxiv.org/abs/2403.07031": {"title": "The Cram Method for Efficient Simultaneous Learning and Evaluation", "link": "https://arxiv.org/abs/2403.07031", "description": "arXiv:2403.07031v1 Announce Type: cross \nAbstract: We introduce the \"cram\" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient than sample-splitting. The cram method also naturally accommodates online learning algorithms, making its implementation computationally efficient. To demonstrate the power of the cram method, we consider the standard policy learning setting where cramming is applied to the same data to both develop an individualized treatment rule (ITR) and estimate the average outcome that would result if the learned ITR were to be deployed. We show that under a minimal set of assumptions, the resulting crammed evaluation estimator is consistent and asymptotically normal. While our asymptotic results require a relatively weak stabilization condition of ML algorithm, we develop a simple, generic method that can be used with any policy learning algorithm to satisfy this condition. Our extensive simulation studies show that, when compared to sample-splitting, cramming reduces the evaluation standard error by more than 40% while improving the performance of learned policy. We also apply the cram method to a randomized clinical trial to demonstrate its applicability to real-world problems. Finally, we briefly discuss future extensions of the cram method to other learning and evaluation settings."}, "https://arxiv.org/abs/2403.07104": {"title": "Shrinkage MMSE estimators of covariances beyond the zero-mean and stationary variance assumptions", "link": "https://arxiv.org/abs/2403.07104", "description": "arXiv:2403.07104v1 Announce Type: cross \nAbstract: We tackle covariance estimation in low-sample scenarios, employing a structured covariance matrix with shrinkage methods. These involve convexly combining a low-bias/high-variance empirical estimate with a biased regularization estimator, striking a bias-variance trade-off. Literature provides optimal settings of the regularization amount through risk minimization between the true covariance and its shrunk counterpart. Such estimators were derived for zero-mean statistics with i.i.d. diagonal regularization matrices accounting for the average sample variance solely. We extend these results to regularization matrices accounting for the sample variances both for centered and non-centered samples. In the latter case, the empirical estimate of the true mean is incorporated into our shrinkage estimators. Introducing confidence weights into the statistics also enhance estimator robustness against outliers. We compare our estimators to other shrinkage methods both on numerical simulations and on real data to solve a detection problem in astronomy."}, "https://arxiv.org/abs/2403.07464": {"title": "On Ranking-based Tests of Independence", "link": "https://arxiv.org/abs/2403.07464", "description": "arXiv:2403.07464v1 Announce Type: cross \nAbstract: In this paper we develop a novel nonparametric framework to test the independence of two random variables $\\mathbf{X}$ and $\\mathbf{Y}$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dx dy)$, based on {\\it Receiver Operating Characteristic} (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\\mathcal{H}\\_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\\otimes G,\\; F)$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square.We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here."}, "https://arxiv.org/abs/2403.07495": {"title": "Tuning diagonal scale matrices for HMC", "link": "https://arxiv.org/abs/2403.07495", "description": "arXiv:2403.07495v1 Announce Type: cross \nAbstract: Three approaches for adaptively tuning diagonal scale matrices for HMC are discussed and compared. The common practice of scaling according to estimated marginal standard deviations is taken as a benchmark. Scaling according to the mean log-target gradient (ISG), and a scaling method targeting that the frequency of when the underlying Hamiltonian dynamics crosses the respective medians should be uniform across dimensions, are taken as alternatives. Numerical studies suggest that the ISG method leads in many cases to more efficient sampling than the benchmark, in particular in cases with strong correlations or non-linear dependencies. The ISG method is also easy to implement, computationally cheap and would be relatively simple to include in automatically tuned codes as an alternative to the benchmark practice."}, "https://arxiv.org/abs/2403.07657": {"title": "Scalable Spatiotemporal Prediction with Bayesian Neural Fields", "link": "https://arxiv.org/abs/2403.07657", "description": "arXiv:2403.07657v1 Announce Type: cross \nAbstract: Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in many scientific and business-intelligence applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As modern datasets continue to increase in size and complexity, there is a growing need for new statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle large prediction problems. This work presents the Bayesian Neural Field (BayesNF), a domain-general statistical model for inferring rich probability distributions over a spatiotemporal domain, which can be used for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a novel deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust uncertainty quantification. By defining the prior through a sequence of smooth differentiable transforms, posterior inference is conducted on large-scale data using variationally learned surrogates trained via stochastic gradient descent. We evaluate BayesNF against prominent statistical and machine-learning baselines, showing considerable improvements on diverse prediction problems from climate and public health datasets that contain tens to hundreds of thousands of measurements. The paper is accompanied with an open-source software package (https://github.com/google/bayesnf) that is easy-to-use and compatible with modern GPU and TPU accelerators on the JAX machine learning platform."}, "https://arxiv.org/abs/2403.07728": {"title": "CAS: A General Algorithm for Online Selective Conformal Prediction with FCR Control", "link": "https://arxiv.org/abs/2403.07728", "description": "arXiv:2403.07728v1 Announce Type: cross \nAbstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) to measure the averaged miscoverage error. We develop a general framework named CAS (Calibration after Adaptive Selection) that can wrap around any prediction model and online selection rule to output post-selection prediction intervals. If the current individual is selected, we first perform an adaptive selection on historical data to construct a calibration set, then output a conformal prediction interval for the unobserved label. We provide tractable constructions for the calibration set for popular online selection rules. We proved that CAS can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. For the decision-driven selection rule, including most online multiple-testing procedures, CAS can exactly control the real-time FCR below the target level without any distributional assumptions. For the online selection with symmetric thresholds, we establish the error bound for the control gap of FCR under mild distributional assumptions. To account for the distribution shift in online data, we also embed CAS into some recent dynamic conformal prediction methods and examine the long-run FCR control. Numerical results on both synthetic and real data corroborate that CAS can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings."}, "https://arxiv.org/abs/2103.11066": {"title": "Treatment Allocation under Uncertain Costs", "link": "https://arxiv.org/abs/2103.11066", "description": "arXiv:2103.11066v4 Announce Type: replace \nAbstract: We consider the problem of learning how to optimally allocate treatments whose cost is uncertain and can vary with pre-treatment covariates. This setting may arise in medicine if we need to prioritize access to a scarce resource that different patients would use for different amounts of time, or in marketing if we want to target discounts whose cost to the company depends on how much the discounts are used. Here, we show that the optimal treatment allocation rule under budget constraints is a thresholding rule based on priority scores, and we propose a number of practical methods for learning these priority scores using data from a randomized trial. Our formal results leverage a statistical connection between our problem and that of learning heterogeneous treatment effects under endogeneity using an instrumental variable. We find our method to perform well in a number of empirical evaluations."}, "https://arxiv.org/abs/2201.01879": {"title": "Exponential family measurement error models for single-cell CRISPR screens", "link": "https://arxiv.org/abs/2201.01879", "description": "arXiv:2201.01879v3 Announce Type: replace \nAbstract: CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present substantial statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens -- \"thresholded regression\" -- exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV (\"GLM-based errors-in-variables\"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g., Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several novel insights."}, "https://arxiv.org/abs/2202.03513": {"title": "Causal survival analysis under competing risks using longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2202.03513", "description": "arXiv:2202.03513v2 Announce Type: replace \nAbstract: Longitudinal modified treatment policies (LMTP) have been recently developed as a novel method to define and estimate causal parameters that depend on the natural value of treatment. LMTPs represent an important advancement in causal inference for longitudinal studies as they allow the non-parametric definition and estimation of the joint effect of multiple categorical, numerical, or continuous exposures measured at several time points. We extend the LMTP methodology to problems in which the outcome is a time-to-event variable subject to right-censoring and competing risks. We present identification results and non-parametric locally efficient estimators that use flexible data-adaptive regression techniques to alleviate model misspecification bias, while retaining important asymptotic properties such as $\\sqrt{n}$-consistency. We present an application to the estimation of the effect of the time-to-intubation on acute kidney injury amongst COVID-19 hospitalized patients, where death by other causes is taken to be the competing event."}, "https://arxiv.org/abs/2301.00584": {"title": "Selective conformal inference with false coverage-statement rate control", "link": "https://arxiv.org/abs/2301.00584", "description": "arXiv:2301.00584v5 Announce Type: replace \nAbstract: Conformal inference is a popular tool for constructing prediction intervals (PI). We consider here the scenario of post-selection/selective conformal inference, that is PIs are reported only for individuals selected from an unlabeled test data. To account for multiplicity, we develop a general split conformal framework to construct selective PIs with the false coverage-statement rate (FCR) control. We first investigate the Benjamini and Yekutieli (2005)'s FCR-adjusted method in the present setting, and show that it is able to achieve FCR control but yields uniformly inflated PIs. We then propose a novel solution to the problem, named as Selective COnditional conformal Predictions (SCOP), which entails performing selection procedures on both calibration set and test set and construct marginal conformal PIs on the selected sets by the aid of conditional empirical distribution obtained by the calibration set. Under a unified framework and exchangeable assumptions, we show that the SCOP can exactly control the FCR. More importantly, we provide non-asymptotic miscoverage bounds for a general class of selection procedures beyond exchangeablity and discuss the conditions under which the SCOP is able to control the FCR. As special cases, the SCOP with quantile-based selection or conformal p-values-based multiple testing procedures enjoys valid coverage guarantee under mild conditions. Numerical results confirm the effectiveness and robustness of SCOP in FCR control and show that it achieves more narrowed PIs over existing methods in many settings."}, "https://arxiv.org/abs/2302.07070": {"title": "Empirical study of periodic autoregressive models with additive noise -- estimation and testing", "link": "https://arxiv.org/abs/2302.07070", "description": "arXiv:2302.07070v2 Announce Type: replace \nAbstract: Periodic autoregressive (PAR) time series with finite variance is considered as one of the most common models of second-order cyclostationary processes. However, in the real applications, the signals with periodic characteristics may be disturbed by additional noise related to measurement device disturbances or to other external sources. Thus, the known estimation techniques dedicated for PAR models may be inefficient for such cases. When the variance of the additive noise is relatively small, it can be ignored and the classical estimation techniques can be applied. However, for extreme cases, the additive noise can have a significant influence on the estimation results. In this paper, we propose four estimation techniques for the noise-corrupted PAR models with finite variance distributions. The methodology is based on Yule-Walker equations utilizing the autocovariance function. It can be used for any type of the finite variance additive noise. The presented simulation study clearly indicates the efficiency of the proposed techniques, also for extreme case, when the additive noise is a sum of the Gaussian additive noise and additive outliers. The proposed estimation techniques are also applied for testing if the data corresponds to noise-corrupted PAR model. This issue is strongly related to the identification of informative component in the data in case when the model is disturbed by additive non-informative noise. The power of the test is studied for simulated data. Finally, the testing procedure is applied for two real time series describing particulate matter concentration in the air."}, "https://arxiv.org/abs/2303.04669": {"title": "Minimum contrast for the first-order intensity estimation of spatial and spatio-temporal point processes", "link": "https://arxiv.org/abs/2303.04669", "description": "arXiv:2303.04669v2 Announce Type: replace \nAbstract: In this paper, we harness a result in point process theory, specifically the expectation of the weighted $K$-function, where the weighting is done by the true first-order intensity function. This theoretical result can be employed as an estimation method to derive parameter estimates for a particular model assumed for the data. The underlying motivation is to avoid the difficulties associated with dealing with complex likelihoods in point process models and their maximization. The exploited result makes our method theoretically applicable to any model specification. In this paper, we restrict our study to Poisson models, whose likelihood represents the base for many more complex point process models.\n In this context, our proposed method can estimate the vector of local parameters that correspond to the points within the analyzed point pattern without introducing any additional complexity compared to the global estimation. We illustrate the method through simulation studies for both purely spatial and spatio-temporal point processes and show complex scenarios based on the Poisson model through the analysis of two real datasets concerning environmental problems."}, "https://arxiv.org/abs/2303.12378": {"title": "A functional spatial autoregressive model using signatures", "link": "https://arxiv.org/abs/2303.12378", "description": "arXiv:2303.12378v2 Announce Type: replace \nAbstract: We propose a new approach to the autoregressive spatial functional model, based on the notion of signature, which represents a function as an infinite series of its iterated integrals. It presents the advantage of being applicable to a wide range of processes. After having provided theoretical guarantees to the proposed model, we have shown in a simulation study and on a real data set that this new approach presents competitive performances compared to the traditional model."}, "https://arxiv.org/abs/2303.17823": {"title": "An interpretable neural network-based non-proportional odds model for ordinal regression", "link": "https://arxiv.org/abs/2303.17823", "description": "arXiv:2303.17823v4 Announce Type: replace \nAbstract: This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is defined for both continuous and discrete responses, whereas standard methods typically treat the ordered continuous variables as if they are discrete, (2) instead of estimating response-dependent finite-dimensional coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets."}, "https://arxiv.org/abs/2306.05186": {"title": "Innovation Processes for Inference", "link": "https://arxiv.org/abs/2306.05186", "description": "arXiv:2306.05186v2 Announce Type: replace \nAbstract: Urn models for innovation have proven to capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, an urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a novel approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other state-of-the-art methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics."}, "https://arxiv.org/abs/2306.08693": {"title": "Integrating Uncertainty Awareness into Conformalized Quantile Regression", "link": "https://arxiv.org/abs/2306.08693", "description": "arXiv:2306.08693v2 Announce Type: replace \nAbstract: Conformalized Quantile Regression (CQR) is a recently proposed method for constructing prediction intervals for a response $Y$ given covariates $X$, without making distributional assumptions. However, existing constructions of CQR can be ineffective for problems where the quantile regressors perform better in certain parts of the feature space than others. The reason is that the prediction intervals of CQR do not distinguish between two forms of uncertainty: first, the variability of the conditional distribution of $Y$ given $X$ (i.e., aleatoric uncertainty), and second, our uncertainty in estimating this conditional distribution (i.e., epistemic uncertainty). This can lead to intervals that are overly narrow in regions where epistemic uncertainty is high. To address this, we propose a new variant of the CQR methodology, Uncertainty-Aware CQR (UACQR), that explicitly separates these two sources of uncertainty to adjust quantile regressors differentially across the feature space. Compared to CQR, our methods enjoy the same distribution-free theoretical coverage guarantees, while demonstrating in our experiments stronger conditional coverage properties in simulated settings and real-world data sets alike."}, "https://arxiv.org/abs/2310.17165": {"title": "Price Experimentation and Interference", "link": "https://arxiv.org/abs/2310.17165", "description": "arXiv:2310.17165v2 Announce Type: replace \nAbstract: In this paper, we examine biases arising in A/B tests where firms modify a continuous parameter, such as price, to estimate the global treatment effect of a given performance metric, such as profit. These biases emerge in canonical experimental estimators due to interference among market participants. We employ structural modeling and differential calculus to derive intuitive characterizations of these biases. We then specialize our general model to a standard revenue management pricing problem. This setting highlights a key pitfall in the use of A/B pricing experiments to guide profit maximization: notably, the canonical estimator for the expected change in profits can have the {\\em wrong sign}. In other words, following the guidance of canonical estimators may lead firms to move prices in the wrong direction, inadvertently decreasing profits relative to the status quo. We apply these results to a two-sided market model and show how this ``change of sign\" regime depends on model parameters such as market imbalance, as well as the price markup. Finally, we discuss structural and practical implications for platform operators."}, "https://arxiv.org/abs/2311.06458": {"title": "Conditional Adjustment in a Markov Equivalence Class", "link": "https://arxiv.org/abs/2311.06458", "description": "arXiv:2311.06458v2 Announce Type: replace \nAbstract: We consider the problem of identifying a conditional causal effect through covariate adjustment. We focus on the setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs represent equivalence classes of possible underlying causal models. After defining adjustment sets in this setting, we provide a necessary and sufficient graphical criterion -- the conditional adjustment criterion -- for finding these sets under conditioning on variables unaffected by treatment. We further provide explicit sets from the graph that satisfy the conditional adjustment criterion, and therefore, can be used as adjustment sets for conditional causal effect identification."}, "https://arxiv.org/abs/2312.17061": {"title": "Bayesian Analysis of High Dimensional Vector Error Correction Model", "link": "https://arxiv.org/abs/2312.17061", "description": "arXiv:2312.17061v2 Announce Type: replace \nAbstract: Vector Error Correction Model (VECM) is a classic method to analyse cointegration relationships amongst multivariate non-stationary time series. In this paper, we focus on high dimensional setting and seek for sample-size-efficient methodology to determine the level of cointegration. Our investigation centres at a Bayesian approach to analyse the cointegration matrix, henceforth determining the cointegration rank. We design two algorithms and implement them on simulated examples, yielding promising results particularly when dealing with high number of variables and relatively low number of observations. Furthermore, we extend this methodology to empirically investigate the constituents of the S&P 500 index, where low-volatility portfolios can be found during both in-sample training and out-of-sample testing periods."}, "https://arxiv.org/abs/2111.13800": {"title": "A Two-Stage Feature Selection Approach for Robust Evaluation of Treatment Effects in High-Dimensional Observational Data", "link": "https://arxiv.org/abs/2111.13800", "description": "arXiv:2111.13800v2 Announce Type: replace-cross \nAbstract: A Randomized Control Trial (RCT) is considered as the gold standard for evaluating the effect of any intervention or treatment. However, its feasibility is often hindered by ethical, economical, and legal considerations, making observational data a valuable alternative for drawing causal conclusions. Nevertheless, healthcare observational data presents a difficult challenge due to its high dimensionality, requiring careful consideration to ensure unbiased, reliable, and robust causal inferences. To overcome this challenge, in this study, we propose a novel two-stage feature selection technique called, Outcome Adaptive Elastic Net (OAENet), explicitly designed for making robust causal inference decisions using matching techniques. OAENet offers several key advantages over existing methods: superior performance on correlated and high-dimensional data compared to the existing methods and the ability to select specific sets of variables (including confounders and variables associated only with the outcome). This ensures robustness and facilitates an unbiased estimate of the causal effect. Numerical experiments on simulated data demonstrate that OAENet significantly outperforms state-of-the-art methods by either producing a higher-quality estimate or a comparable estimate in significantly less time. To illustrate the applicability of OAENet, we employ large-scale US healthcare data to estimate the effect of Opioid Use Disorder (OUD) on suicidal behavior. When compared to competing methods, OAENet closely aligns with existing literature on the relationship between OUD and suicidal behavior. Performance on both simulated and real-world data highlights that OAENet notably enhances the accuracy of estimating treatment effects or evaluating policy decision-making with causal inference."}, "https://arxiv.org/abs/2311.05330": {"title": "A Bayesian framework for measuring association and its application to emotional dynamics in Web discourse", "link": "https://arxiv.org/abs/2311.05330", "description": "arXiv:2311.05330v2 Announce Type: replace-cross \nAbstract: This paper introduces a Bayesian framework designed to measure the degree of association between categorical random variables. The method is grounded in the formal definition of variable independence and is implemented using Markov Chain Monte Carlo (MCMC) techniques. Unlike commonly employed techniques in Association Rule Learning, this approach enables a clear and precise estimation of confidence intervals and the statistical significance of the measured degree of association. We applied the method to non-exclusive emotions identified by annotators in 4,613 tweets written in Portuguese. This analysis revealed pairs of emotions that exhibit associations and mutually opposed pairs. Moreover, the method identifies hierarchical relations between categories, a feature observed in our data, and is utilized to cluster emotions into basic-level groups."}, "https://arxiv.org/abs/2403.07892": {"title": "Change Point Detection with Copula Entropy based Two-Sample Test", "link": "https://arxiv.org/abs/2403.07892", "description": "arXiv:2403.07892v1 Announce Type: new \nAbstract: Change point detection is a typical task that aim to find changes in time series and can be tackled with two-sample test. Copula Entropy is a mathematical concept for measuring statistical independence and a two-sample test based on it was introduced recently. In this paper we propose a nonparametric multivariate method for multiple change point detection with the copula entropy-based two-sample test. The single change point detection is first proposed as a group of two-sample tests on every points of time series data and the change point is considered as with the maximum of the test statistics. The multiple change point detection is then proposed by combining the single change point detection method with binary segmentation strategy. We verified the effectiveness of our method and compared it with the other similar methods on the simulated univariate and multivariate data and the Nile data."}, "https://arxiv.org/abs/2403.07963": {"title": "Copula based dependent censoring in cure models", "link": "https://arxiv.org/abs/2403.07963", "description": "arXiv:2403.07963v1 Announce Type: new \nAbstract: In this paper we consider a time-to-event variable $T$ that is subject to random right censoring, and we assume that the censoring time $C$ is stochastically dependent on $T$ and that there is a positive probability of not observing the event. There are various situations in practice where this happens, and appropriate models and methods need to be considered to avoid biased estimators of the survival function or incorrect conclusions in clinical trials. We consider a fully parametric model for the bivariate distribution of $(T,C)$, that takes these features into account. The model depends on a parametric copula (with unknown association parameter) and on parametric marginal distributions for $T$ and $C$. Sufficient conditions are developed under which the model is identified, and an estimation procedure is proposed. In particular, our model allows to identify and estimate the association between $T$ and $C$, even though only the smallest of these variables is observable. The finite sample performance of the estimated parameters is illustrated by means of a thorough simulation study and the analysis of breast cancer data."}, "https://arxiv.org/abs/2403.08118": {"title": "Characterising harmful data sources when constructing multi-fidelity surrogate models", "link": "https://arxiv.org/abs/2403.08118", "description": "arXiv:2403.08118v1 Announce Type: new \nAbstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting."}, "https://arxiv.org/abs/2403.08130": {"title": "Imputation of Counterfactual Outcomes when the Errors are Predictable", "link": "https://arxiv.org/abs/2403.08130", "description": "arXiv:2403.08130v1 Announce Type: new \nAbstract: A crucial input into causal inference is the imputed counterfactual outcome.\n Imputation error can arise because of sampling uncertainty from estimating the prediction model using the untreated observations, or from out-of-sample information not captured by the model. While the literature has focused on sampling uncertainty, it vanishes with the sample size. Often overlooked is the possibility that the out-of-sample error can be informative about the missing counterfactual outcome if it is mutually or serially correlated. Motivated by the best linear unbiased predictor (\\blup) of \\citet{goldberger:62} in a time series setting, we propose an improved predictor of potential outcome when the errors are correlated. The proposed \\pup\\; is practical as it is not restricted to linear models,\n can be used with consistent estimators already developed, and improves mean-squared error for a large class of strong mixing error processes. Ignoring predictability in the errors can distort conditional inference. However, the precise impact will depend on the choice of estimator as well as the realized values of the residuals."}, "https://arxiv.org/abs/2403.08183": {"title": "Causal Interpretation of Estimands Defined by Exposure Mappings", "link": "https://arxiv.org/abs/2403.08183", "description": "arXiv:2403.08183v1 Announce Type: new \nAbstract: In settings with interference, it is common to utilize estimands defined by exposure mappings to summarize the impact of variation in treatment assignments local to the ego. This paper studies their causal interpretation under weak restrictions on interference. We demonstrate that the estimands can exhibit unpalatable sign reversals under conventional identification conditions. This motivates the formulation of sign preservation criteria for causal interpretability. To satisfy preferred criteria, it is necessary to impose restrictions on interference, either in potential outcomes or selection into treatment. We provide sufficient conditions and show that they are satisfied by a nonparametric model allowing for a complex form of interference in both the outcome and selection stages."}, "https://arxiv.org/abs/2403.08359": {"title": "Assessment of background noise properties in time and time-frequency domains in the context of vibration-based local damage detection in real environment", "link": "https://arxiv.org/abs/2403.08359", "description": "arXiv:2403.08359v1 Announce Type: new \nAbstract: Any measurement in condition monitoring applications is associated with disturbing noise. Till now, most of the diagnostic procedures have assumed the Gaussian distribution for the noise. This paper shares a novel perspective to the problem of local damage detection. The acquired vector of observations is considered as an additive mixture of signal of interest (SOI) and noise with strongly non-Gaussian, heavy-tailed properties, that masks the SOI. The distribution properties of the background noise influence the selection of tools used for the signal analysis, particularly for local damage detection. Thus, it is extremely important to recognize and identify possible non-Gaussian behavior of the noise. The problem considered here is more general than the classical goodness-of-fit testing. The paper highlights the important role of variance, as most of the methods for signal analysis are based on the assumption of the finite-variance distribution of the underlying signal. The finite variance assumption is crucial but implicit to most indicators used in condition monitoring, (such as the root-mean-square value, the power spectral density, the kurtosis, the spectral correlation, etc.), in view that infinite variance implies moments higher than 2 are also infinite. The problem is demonstrated based on three popular types of non-Gaussian distributions observed for real vibration signals. We demonstrate how the properties of noise distribution in the time domain may change by its transformations to the time-frequency domain (spectrogram). Additionally, we propose a procedure to check the presence of the infinite-variance of the background noise. Our investigations are illustrated using simulation studies and real vibration signals from various machines."}, "https://arxiv.org/abs/2403.08514": {"title": "Spatial Latent Gaussian Modelling with Change of Support", "link": "https://arxiv.org/abs/2403.08514", "description": "arXiv:2403.08514v1 Announce Type: new \nAbstract: Spatial data are often derived from multiple sources (e.g. satellites, in-situ sensors, survey samples) with different supports, but associated with the same properties of a spatial phenomenon of interest. It is common for predictors to also be measured on different spatial supports than the response variables. Although there is no standard way to work with spatial data with different supports, a prevalent approach used by practitioners has been to use downscaling or interpolation to project all the variables of analysis towards a common support, and then using standard spatial models. The main disadvantage with this approach is that simple interpolation can introduce biases and, more importantly, the uncertainty associated with the change of support is not taken into account in parameter estimation. In this article, we propose a Bayesian spatial latent Gaussian model that can handle data with different rectilinear supports in both the response variable and predictors. Our approach allows to handle changes of support more naturally according to the properties of the spatial stochastic process being used, and to take into account the uncertainty from the change of support in parameter estimation and prediction. We use spatial stochastic processes as linear combinations of basis functions where Gaussian Markov random fields define the weights. Our hierarchical modelling approach can be described by the following steps: (i) define a latent model where response variables and predictors are considered as latent stochastic processes with continuous support, (ii) link the continuous-index set stochastic processes with its projection to the support of the observed data, (iii) link the projected process with the observed data. We show the applicability of our approach by simulation studies and modelling land suitability for improved grassland in Rhondda Cynon Taf, a county borough in Wales."}, "https://arxiv.org/abs/2403.08577": {"title": "Evaluation and comparison of covariate balance metrics in studies with time-dependent confounding", "link": "https://arxiv.org/abs/2403.08577", "description": "arXiv:2403.08577v1 Announce Type: new \nAbstract: Marginal structural models have been increasingly used by analysts in recent years to account for confounding bias in studies with time-varying treatments. The parameters of these models are often estimated using inverse probability of treatment weighting. To ensure that the estimated weights adequately control confounding, it is possible to check for residual imbalance between treatment groups in the weighted data. Several balance metrics have been developed and compared in the cross-sectional case but have not yet been evaluated and compared in longitudinal studies with time-varying treatment. We have first extended the definition of several balance metrics to the case of a time-varying treatment, with or without censoring. We then compared the performance of these balance metrics in a simulation study by assessing the strength of the association between their estimated level of imbalance and bias. We found that the Mahalanobis balance performed best.Finally, the method was illustrated for estimating the cumulative effect of statins exposure over one year on the risk of cardiovascular disease or death in people aged 65 and over in population-wide administrative data. This illustration confirms the feasibility of employing our proposed metrics in large databases with multiple time-points."}, "https://arxiv.org/abs/2403.08630": {"title": "Leveraging Non-Decimated Wavelet Packet Features and Transformer Models for Time Series Forecasting", "link": "https://arxiv.org/abs/2403.08630", "description": "arXiv:2403.08630v1 Announce Type: new \nAbstract: This article combines wavelet analysis techniques with machine learning methods for univariate time series forecasting, focusing on three main contributions. Firstly, we consider the use of Daubechies wavelets with different numbers of vanishing moments as input features to both non-temporal and temporal forecasting methods, by selecting these numbers during the cross-validation phase. Secondly, we compare the use of both the non-decimated wavelet transform and the non-decimated wavelet packet transform for computing these features, the latter providing a much larger set of potentially useful coefficient vectors. The wavelet coefficients are computed using a shifted version of the typical pyramidal algorithm to ensure no leakage of future information into these inputs. Thirdly, we evaluate the use of these wavelet features on a significantly wider set of forecasting methods than previous studies, including both temporal and non-temporal models, and both statistical and deep learning-based methods. The latter include state-of-the-art transformer-based neural network architectures. Our experiments suggest significant benefit in replacing higher-order lagged features with wavelet features across all examined non-temporal methods for one-step-forward forecasting, and modest benefit when used as inputs for temporal deep learning-based models for long-horizon forecasting."}, "https://arxiv.org/abs/2403.08653": {"title": "Physics-Guided Inverse Regression for Crop Quality Assessment", "link": "https://arxiv.org/abs/2403.08653", "description": "arXiv:2403.08653v1 Announce Type: new \nAbstract: We present an innovative approach leveraging Physics-Guided Neural Networks (PGNNs) for enhancing agricultural quality assessments. Central to our methodology is the application of physics-guided inverse regression, a technique that significantly improves the model's ability to precisely predict quality metrics of crops. This approach directly addresses the challenges of scalability, speed, and practicality that traditional assessment methods face. By integrating physical principles, notably Fick`s second law of diffusion, into neural network architectures, our developed PGNN model achieves a notable advancement in enhancing both the interpretability and accuracy of assessments. Empirical validation conducted on cucumbers and mushrooms demonstrates the superior capability of our model in outperforming conventional computer vision techniques in postharvest quality evaluation. This underscores our contribution as a scalable and efficient solution to the pressing demands of global food supply challenges."}, "https://arxiv.org/abs/2403.08753": {"title": "Invalid proxies and volatility changes", "link": "https://arxiv.org/abs/2403.08753", "description": "arXiv:2403.08753v1 Announce Type: new \nAbstract: When in proxy-SVARs the covariance matrix of VAR disturbances is subject to exogenous, permanent, nonrecurring breaks that generate target impulse response functions (IRFs) that change across volatility regimes, even strong, exogenous external instruments can result in inconsistent estimates of the dynamic causal effects of interest if the breaks are not properly accounted for. In such cases, it is essential to explicitly incorporate the shifts in unconditional volatility in order to point-identify the target structural shocks and possibly restore consistency. We demonstrate that, under a necessary and sufficient rank condition that leverages moments implied by changes in volatility, the target IRFs can be point-identified and consistently estimated. Importantly, standard asymptotic inference remains valid in this context despite (i) the covariance between the proxies and the instrumented structural shocks being local-to-zero, as in Staiger and Stock (1997), and (ii) the potential failure of instrument exogeneity. We introduce a novel identification strategy that appropriately combines external instruments with \"informative\" changes in volatility, thus obviating the need to assume proxy relevance and exogeneity in estimation. We illustrate the effectiveness of the suggested method by revisiting a fiscal proxy-SVAR previously estimated in the literature, complementing the fiscal instruments with information derived from the massive reduction in volatility observed in the transition from the Great Inflation to the Great Moderation regimes."}, "https://arxiv.org/abs/2403.08079": {"title": "BayesFLo: Bayesian fault localization of complex software systems", "link": "https://arxiv.org/abs/2403.08079", "description": "arXiv:2403.08079v1 Announce Type: cross \nAbstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root causes, or for integrating domain and/or structural knowledge from test engineers. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model on potential root cause combinations. A key feature of BayesFLo is its integration of the principles of combination hierarchy and heredity, which capture the structured nature of failure-inducing combinations. A critical challenge, however, is the sheer number of potential root cause scenarios to consider, which renders the computation of posterior root cause probabilities infeasible even for small software systems. We thus develop new algorithms for efficient computation of such probabilities, leveraging recent tools from integer programming and graph representations. We then demonstrate the effectiveness of BayesFLo over state-of-the-art fault localization methods, in a suite of numerical experiments and in two motivating case studies on the JMP XGBoost interface."}, "https://arxiv.org/abs/2403.08628": {"title": "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables", "link": "https://arxiv.org/abs/2403.08628", "description": "arXiv:2403.08628v1 Announce Type: cross \nAbstract: This paper establishes the optimal sub-Gaussian variance proxy for truncated Gaussian and truncated exponential random variables. The proofs rely on first characterizing the optimal variance proxy as the unique solution to a set of two equations and then observing that for these two truncated distributions, one may find explicit solutions to this set of equations. Moreover, we establish the conditions under which the optimal variance proxy coincides with the variance, thereby characterizing the strict sub-Gaussianity of the truncated random variables. Specifically, we demonstrate that truncated Gaussian variables exhibit strict sub-Gaussian behavior if and only if they are symmetric, meaning their truncation is symmetric with respect to the mean. Conversely, truncated exponential variables are shown to never exhibit strict sub-Gaussian properties. These findings contribute to the understanding of these prevalent probability distributions in statistics and machine learning, providing a valuable foundation for improved and optimal modeling and decision-making processes."}, "https://arxiv.org/abs/2206.04643": {"title": "On the Performance of the Neyman Allocation with Small Pilots", "link": "https://arxiv.org/abs/2206.04643", "description": "arXiv:2206.04643v3 Announce Type: replace \nAbstract: The Neyman Allocation is used in many papers on experimental design, which typically assume that researchers have access to large pilot studies. This may be unrealistic. To understand the properties of the Neyman Allocation with small pilots, we study its behavior in an asymptotic framework that takes pilot size to be fixed even as the size of the main wave tends to infinity. Our analysis shows that the Neyman Allocation can lead to estimates of the ATE with higher asymptotic variance than with (non-adaptive) balanced randomization. In particular, this happens when the outcome variable is relatively homoskedastic with respect to treatment status or when it exhibits high kurtosis. We provide a series of empirical examples showing that such situations can arise in practice. Our results suggest that researchers with small pilots should not use the Neyman Allocation if they believe that outcomes are homoskedastic or heavy-tailed. We examine some potential methods for improving the finite sample performance of the FNA via simulations."}, "https://arxiv.org/abs/2208.11570": {"title": "Flexible control of the median of the false discovery proportion", "link": "https://arxiv.org/abs/2208.11570", "description": "arXiv:2208.11570v4 Announce Type: replace \nAbstract: We introduce a multiple testing procedure that controls the median of the proportion of false discoveries (FDP) in a flexible way. The procedure only requires a vector of p-values as input and is comparable to the Benjamini-Hochberg method, which controls the mean of the FDP. Our method allows freely choosing one or several values of alpha after seeing the data -- unlike Benjamini-Hochberg, which can be very liberal when alpha is chosen post hoc. We prove these claims and illustrate them with simulations. Our procedure is inspired by a popular estimator of the total number of true hypotheses. We adapt this estimator to provide simultaneously median unbiased estimators of the FDP, valid for finite samples. This simultaneity allows for the claimed flexibility. Our approach does not assume independence. The time complexity of our method is linear in the number of hypotheses, after sorting the p-values."}, "https://arxiv.org/abs/2209.00105": {"title": "Personalized Biopsy Schedules Using an Interval-censored Cause-specific Joint Model", "link": "https://arxiv.org/abs/2209.00105", "description": "arXiv:2209.00105v5 Announce Type: replace \nAbstract: Active surveillance (AS), where biopsies are conducted to detect cancer progression, has been acknowledged as an efficient way to reduce the overtreatment of prostate cancer. Most AS cohorts use fixed biopsy schedules for all patients. However, the ideal test frequency remains unknown, and the routine use of such invasive tests burdens the patients. An emerging idea is to generate personalized biopsy schedules based on each patient's progression-specific risk. To achieve that, we propose the interval-censored cause-specific joint model (ICJM), which models the impact of longitudinal biomarkers on cancer progression while considering the competing event of early treatment initiation. The underlying likelihood function incorporates the interval-censoring of cancer progression, the competing risk of treatment, and the uncertainty about whether cancer progression occurred since the last biopsy in patients that are right-censored or experience the competing event. The model can produce patient-specific risk profiles until a horizon time. If the risk exceeds a certain threshold, a biopsy is conducted. The optimal threshold can be chosen by balancing two indicators of the biopsy schedules: the expected number of biopsies and expected delay in detection of cancer progression. A simulation study showed that our personalized schedules could considerably reduce the number of biopsies per patient by 34%-54% compared to the fixed schedules, though at the cost of a slightly longer detection delay."}, "https://arxiv.org/abs/2306.01604": {"title": "On the minimum information checkerboard copulas under fixed Kendall's rank correlation", "link": "https://arxiv.org/abs/2306.01604", "description": "arXiv:2306.01604v2 Announce Type: replace \nAbstract: Copulas have gained widespread popularity as statistical models to represent dependence structures between multiple variables in various applications. The minimum information copula, given a finite number of constraints in advance, emerges as the copula closest to the uniform copula when measured in Kullback-Leibler divergence. In prior research, the focus has predominantly been on constraints related to expectations on moments, including Spearman's $\\rho$. This approach allows for obtaining the copula through convex programming. However, the existing framework for minimum information copulas does not encompass non-linear constraints such as Kendall's $\\tau$. To address this limitation, we introduce MICK, a novel minimum information copula under fixed Kendall's $\\tau$. We first characterize MICK by its local dependence property. Despite being defined as the solution to a non-convex optimization problem, we demonstrate that the uniqueness of this copula is guaranteed when the correlation is sufficiently small. Additionally, we provide numerical insights into applying MICK to real financial data."}, "https://arxiv.org/abs/2306.10601": {"title": "Sliced Wasserstein Regression", "link": "https://arxiv.org/abs/2306.10601", "description": "arXiv:2306.10601v2 Announce Type: replace \nAbstract: While statistical modeling of distributional data has gained increased attention, the case of multivariate distributions has been somewhat neglected despite its relevance in various applications. This is because the Wasserstein distance, commonly used in distributional data analysis, poses challenges for multivariate distributions. A promising alternative is the sliced Wasserstein distance, which offers a computationally simpler solution. We propose distributional regression models with multivariate distributions as responses paired with Euclidean vector predictors. The foundation of our methodology is a slicing transform from the multivariate distribution space to the sliced distribution space for which we establish a theoretical framework, with the Radon transform as a prominent example. We introduce and study the asymptotic properties of sample-based estimators for two regression approaches, one based on utilizing the sliced Wasserstein distance directly in the multivariate distribution space, and a second approach based on a new slice-wise distance, employing a univariate distribution regression for each slice. Both global and local Fr\\'echet regression methods are deployed for these approaches and illustrated in simulations and through applications. These include joint distributions of excess winter death rates and winter temperature anomalies in European countries as a function of base winter temperature and also data from finance."}, "https://arxiv.org/abs/2310.11969": {"title": "Survey calibration for causal inference: a simple method to balance covariate distributions", "link": "https://arxiv.org/abs/2310.11969", "description": "arXiv:2310.11969v2 Announce Type: replace \nAbstract: This paper proposes a~simple, yet powerful, method for balancing distributions of covariates for causal inference based on observational studies. The method makes it possible to balance an arbitrary number of quantiles (e.g., medians, quartiles, or deciles) together with means if necessary. The proposed approach is based on the theory of calibration estimators (Deville and S\\\"arndal 1992), in particular, calibration estimators for quantiles, proposed by Harms and Duchesne (2006). The method does not require numerical integration, kernel density estimation or assumptions about the distributions. Valid estimates can be obtained by drawing on existing asymptotic theory. An~illustrative example of the proposed approach is presented for the entropy balancing method and the covariate balancing propensity score method. Results of a~simulation study indicate that the method efficiently estimates average treatment effects on the treated (ATT), the average treatment effect (ATE), the quantile treatment effect on the treated (QTT) and the quantile treatment effect (QTE), especially in the presence of non-linearity and mis-specification of the models. The proposed approach can be further generalized to other designs (e.g. multi-category, continuous) or methods (e.g. synthetic control method). An open source software implementing proposed methods is available."}, "https://arxiv.org/abs/2401.14359": {"title": "Minimum Covariance Determinant: Spectral Embedding and Subset Size Determination", "link": "https://arxiv.org/abs/2401.14359", "description": "arXiv:2401.14359v2 Announce Type: replace \nAbstract: This paper introduces several enhancements to the minimum covariance determinant method of outlier detection and robust estimation of means and covariances. We leverage the principal component transform to achieve dimension reduction and ultimately better analyses. Our best subset selection algorithm strategically combines statistical depth and concentration steps. To ascertain the appropriate subset size and number of principal components, we introduce a bootstrap procedure that estimates the instability of the best subset algorithm. The parameter combination exhibiting minimal instability proves ideal for the purposes of outlier detection and robust estimation. Rigorous benchmarking against prominent MCD variants showcases our approach's superior statistical performance and computational speed in high dimensions. Application to a fruit spectra data set and a cancer genomics data set illustrates our claims."}, "https://arxiv.org/abs/2305.15754": {"title": "Bayesian Analysis for Over-parameterized Linear Model without Sparsity", "link": "https://arxiv.org/abs/2305.15754", "description": "arXiv:2305.15754v2 Announce Type: replace-cross \nAbstract: In the field of high-dimensional Bayesian statistics, a plethora of methodologies have been developed, including various prior distributions that result in parameter sparsity. However, such priors exhibit limitations in handling the spectral eigenvector structure of data, rendering estimations less effective for analyzing the over-parameterized models (high-dimensional linear models that do not assume sparsity) developed in recent years. This study introduces a Bayesian approach that employs a prior distribution dependent on the eigenvectors of data covariance matrices without inducing parameter sparsity. We also provide contraction rates of the derived posterior estimation and develop a truncated Gaussian approximation of the posterior distribution. The former demonstrates the efficiency of posterior estimation, whereas the latter facilitates the uncertainty quantification of parameters via a Bernstein--von Mises-type approach. These findings suggest that Bayesian methods capable of handling data spectra and estimating non-sparse high-dimensional parameters are feasible."}, "https://arxiv.org/abs/2401.06446": {"title": "Increasing dimension asymptotics for two-way crossed mixed effect models", "link": "https://arxiv.org/abs/2401.06446", "description": "arXiv:2401.06446v2 Announce Type: replace-cross \nAbstract: This paper presents asymptotic results for the maximum likelihood and restricted maximum likelihood (REML) estimators within a two-way crossed mixed effect model as the sizes of the rows, columns, and cells tend to infinity. Under very mild conditions which do not require the assumption of normality, the estimators are proven to be asymptotically normal, possessing a structured covariance matrix. The growth rate for the number of rows, columns, and cells is unrestricted, whether considered pairwise or collectively."}, "https://arxiv.org/abs/2403.08927": {"title": "Principal stratification with U-statistics under principal ignorability", "link": "https://arxiv.org/abs/2403.08927", "description": "arXiv:2403.08927v1 Announce Type: new \nAbstract: Principal stratification is a popular framework for causal inference in the presence of an intermediate outcome. While the principal average treatment effects have traditionally been the default target of inference, it may not be sufficient when the interest lies in the relative favorability of one potential outcome over the other within the principal stratum. We thus introduce the principal generalized causal effect estimands, which extend the principal average causal effects to accommodate nonlinear contrast functions. Under principal ignorability, we expand the theoretical results in Jiang et. al. (2022) to a much wider class of causal estimands in the presence of a binary intermediate variable. We develop identification formulas and derive the efficient influence functions of the generalized estimands for principal stratification analyses. These efficient influence functions motivate a set of multiply robust estimators and lay the ground for obtaining efficient debiased machine learning estimators via cross-fitting based on $U$-statistics. The proposed methods are illustrated through simulations and the analysis of a data example."}, "https://arxiv.org/abs/2403.09042": {"title": "Recurrent Events Modeling Based on a Reflected Brownian Motion with Application to Hypoglycemia", "link": "https://arxiv.org/abs/2403.09042", "description": "arXiv:2403.09042v1 Announce Type: new \nAbstract: Patients with type 2 diabetes need to closely monitor blood sugar levels as their routine diabetes self-management. Although many treatment agents aim to tightly control blood sugar, hypoglycemia often stands as an adverse event. In practice, patients can observe hypoglycemic events more easily than hyperglycemic events due to the perception of neurogenic symptoms. We propose to model each patient's observed hypoglycemic event as a lower-boundary crossing event for a reflected Brownian motion with an upper reflection barrier. The lower-boundary is set by clinical standards. To capture patient heterogeneity and within-patient dependence, covariates and a patient level frailty are incorporated into the volatility and the upper reflection barrier. This framework provides quantification for the underlying glucose level variability, patients heterogeneity, and risk factors' impact on glucose. We make inferences based on a Bayesian framework using Markov chain Monte Carlo. Two model comparison criteria, the Deviance Information Criterion and the Logarithm of the Pseudo-Marginal Likelihood, are used for model selection. The methodology is validated in simulation studies. In analyzing a dataset from the diabetic patients in the DURABLE trial, our model provides adequate fit, generates data similar to the observed data, and offers insights that could be missed by other models."}, "https://arxiv.org/abs/2403.09350": {"title": "A Bayes Factor Framework for Unified Parameter Estimation and Hypothesis Testing", "link": "https://arxiv.org/abs/2403.09350", "description": "arXiv:2403.09350v1 Announce Type: new \nAbstract: The Bayes factor, the data-based updating factor of the prior to posterior odds of two hypotheses, is a natural measure of statistical evidence for one hypothesis over the other. We show how Bayes factors can also be used for parameter estimation. The key idea is to consider the Bayes factor as a function of the parameter value under the null hypothesis. This 'Bayes factor function' is inverted to obtain point estimates ('maximum evidence estimates') and interval estimates ('support intervals'), similar to how P-value functions are inverted to obtain point estimates and confidence intervals. This provides data analysts with a unified inference framework as Bayes factors (for any tested parameter value), support intervals (at any level), and point estimates can be easily read off from a plot of the Bayes factor function. This approach shares similarities but is also distinct from conventional Bayesian and frequentist approaches: It uses the Bayesian evidence calculus, but without synthesizing data and prior, and it defines statistical evidence in terms of (integrated) likelihood ratios, but also includes a natural way for dealing with nuisance parameters. Applications to real-world examples illustrate how our framework is of practical value for making make quantitative inferences."}, "https://arxiv.org/abs/2403.09503": {"title": "Shrinkage for Extreme Partial Least-Squares", "link": "https://arxiv.org/abs/2403.09503", "description": "arXiv:2403.09503v1 Announce Type: new \nAbstract: This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the Extreme Partial Least Squares (EPLS) method--an adaptation of the original Partial Least Squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises-Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method's applicability using French farm income data, highlighting its efficacy in real-world scenarios."}, "https://arxiv.org/abs/2403.09604": {"title": "Extremal graphical modeling with latent variables", "link": "https://arxiv.org/abs/2403.09604", "description": "arXiv:2403.09604v1 Announce Type: new \nAbstract: Extremal graphical models encode the conditional independence structure of multivariate extremes and provide a powerful tool for quantifying the risk of rare events. Prior work on learning these graphs from data has focused on the setting where all relevant variables are observed. For the popular class of H\\\"usler-Reiss models, we propose the \\texttt{eglatent} method, a tractable convex program for learning extremal graphical models in the presence of latent variables. Our approach decomposes the H\\\"usler-Reiss precision matrix into a sparse component encoding the graphical structure among the observed variables after conditioning on the latent variables, and a low-rank component encoding the effect of a few latent variables on the observed variables. We provide finite-sample guarantees of \\texttt{eglatent} and show that it consistently recovers the conditional graph as well as the number of latent variables. We highlight the improved performances of our approach on synthetic and real data."}, "https://arxiv.org/abs/2011.09437": {"title": "Trend and Variance Adaptive Bayesian Changepoint Analysis & Local Outlier Scoring", "link": "https://arxiv.org/abs/2011.09437", "description": "arXiv:2011.09437v4 Announce Type: replace \nAbstract: We adaptively estimate both changepoints and local outlier processes in a Bayesian dynamic linear model with global-local shrinkage priors in a novel model we call Adaptive Bayesian Changepoints with Outliers (ABCO). We utilize a state-space approach to identify a dynamic signal in the presence of outliers and measurement error with stochastic volatility. We find that global state equation parameters are inadequate for most real applications and we include local parameters to track noise at each time-step. This setup provides a flexible framework to detect unspecified changepoints in complex series, such as those with large interruptions in local trends, with robustness to outliers and heteroskedastic noise. Finally, we compare our algorithm against several alternatives to demonstrate its efficacy in diverse simulation scenarios and two empirical examples on the U.S. economy."}, "https://arxiv.org/abs/2208.07614": {"title": "Reweighting the RCT for generalization: finite sample error and variable selection", "link": "https://arxiv.org/abs/2208.07614", "description": "arXiv:2208.07614v5 Announce Type: replace \nAbstract: Randomized Controlled Trials (RCTs) may suffer from limited scope. In particular, samples may be unrepresentative: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation. In this work, we establish the exact expressions of the bias and variance of such reweighting procedures -- also called Inverse Propensity of Sampling Weighting (IPSW) -- in presence of categorical covariates for any sample size. Such results allow us to compare the theoretical performance of different versions of IPSW estimates. Besides, our results show how the performance (bias, variance, and quadratic risk) of IPSW estimates depends on the two sample sizes (RCT and target population). A by-product of our work is the proof of consistency of IPSW estimates. Results also reveal that IPSW performances are improved when the trial probability to be treated is estimated (rather than using its oracle counterpart). In addition, we study choice of variables: how including covariates that are not necessary for identifiability of the causal effect may impact the asymptotic variance. Including covariates that are shifted between the two samples but not treatment effect modifiers increases the variance while non-shifted but treatment effect modifiers do not. We illustrate all the takeaways in a didactic example, and on a semi-synthetic simulation inspired from critical care medicine."}, "https://arxiv.org/abs/2212.01832": {"title": "The flexible Gumbel distribution: A new model for inference about the mode", "link": "https://arxiv.org/abs/2212.01832", "description": "arXiv:2212.01832v2 Announce Type: replace \nAbstract: A new unimodal distribution family indexed by the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. A regression model concerning the mode of a response given covariates based on the proposed unimodal distribution can be easily formulated, which we apply to data from an application in criminology to reveal interesting data features that are obscured by outliers."}, "https://arxiv.org/abs/2306.15629": {"title": "A non-parametric approach to detect patterns in binary sequences", "link": "https://arxiv.org/abs/2306.15629", "description": "arXiv:2306.15629v2 Announce Type: replace \nAbstract: In many circumstances given an ordered sequence of one or more types of elements/ symbols, the objective is to determine any existence of randomness in occurrence of one of the elements,say type 1 element. Such a method can be useful in determining existence of any non-random pattern in the wins or loses of a player in a series of games played. Existing methods of tests based on total number of runs or tests based on length of longest run (Mosteller (1941)) can be used for testing the null hypothesis of randomness in the entire sequence, and not a specific type of element. Additionally, the Runs Test tends to show results contradictory to the intuition visualised by the graphs of say, win proportions over time due to method used in computation of runs. This paper develops a test approach to address this problem by computing the gaps between two consecutive type 1 elements and thereafter following the idea of \"pattern\" in occurrence and \"directional\" trend (increasing, decreasing or constant), employs the use of exact Binomial test, Kenall's Tau and Siegel-Tukey test for scale problem. Further modifications suggested by Jan Vegelius(1982) have been applied in the Siegel Tukey test to adjust for tied ranks and achieve more accurate results. This approach is distribution-free and suitable for small sizes. Also comparisons with the conventional runs test shows the superiority of the proposed approach under the null hypothesis of randomness in the occurrence of type 1 elements."}, "https://arxiv.org/abs/2309.03067": {"title": "A modelling framework for detecting and leveraging node-level information in Bayesian network inference", "link": "https://arxiv.org/abs/2309.03067", "description": "arXiv:2309.03067v2 Announce Type: replace \nAbstract: Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modelling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximisation algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases."}, "https://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "https://arxiv.org/abs/2309.12833", "description": "arXiv:2309.12833v3 Announce Type: replace \nAbstract: Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients."}, "https://arxiv.org/abs/2312.05345": {"title": "Spline-Based Multi-State Models for Analyzing Disease Progression", "link": "https://arxiv.org/abs/2312.05345", "description": "arXiv:2312.05345v2 Announce Type: replace \nAbstract: Motivated by disease progression-related studies, we propose an estimation method for fitting general non-homogeneous multi-state Markov models. The proposal can handle many types of multi-state processes, with several states and various combinations of observation schemes (e.g., intermittent, exactly observed, censored), and allows for the transition intensities to be flexibly modelled through additive (spline-based) predictors. The algorithm is based on a computationally efficient and stable penalized maximum likelihood estimation approach which exploits the information provided by the analytical Hessian matrix of the model log-likelihood. The proposed modeling framework is employed in case studies that aim at modeling the onset of cardiac allograft vasculopathy, and cognitive decline due to aging, where novel patterns are uncovered. To support applicability and reproducibility, all developed tools are implemented in the R package flexmsm."}, "https://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "https://arxiv.org/abs/2205.02274", "description": "arXiv:2205.02274v4 Announce Type: replace-cross \nAbstract: Marketplace companies rely heavily on experimentation when making changes to the design or operation of their platforms. The workhorse of experimentation is the randomized controlled trial (RCT), or A/B test, in which users are randomly assigned to treatment or control groups. However, marketplace interference causes the Stable Unit Treatment Value Assumption (SUTVA) to be violated, leading to bias in the standard RCT metric. In this work, we propose techniques for platforms to run standard RCTs and still obtain meaningful estimates despite the presence of marketplace interference. We specifically consider a generalized matching setting, in which the platform explicitly matches supply with demand via a linear programming algorithm. Our first proposal is for the platform to estimate the value of global treatment and global control via optimization. We prove that this approach is unbiased in the fluid limit. Our second proposal is to compare the average shadow price of the treatment and control groups rather than the total value accrued by each group. We prove that this technique corresponds to the correct first-order approximation (in a Taylor series sense) of the value function of interest even in a finite-size system. We then use this result to prove that, under reasonable assumptions, our estimator is less biased than the RCT estimator. At the heart of our result is the idea that it is relatively easy to model interference in matching-driven marketplaces since, in such markets, the platform mediates the spillover."}, "https://arxiv.org/abs/2303.12657": {"title": "Generalised Linear Mixed Model Specification, Analysis, Fitting, and Optimal Design in R with the glmmr Packages", "link": "https://arxiv.org/abs/2303.12657", "description": "arXiv:2303.12657v3 Announce Type: replace-cross \nAbstract: We describe the \\proglang{R} package \\pkg{glmmrBase} and an extension \\pkg{glmmrOptim}. \\pkg{glmmrBase} provides a flexible approach to specifying, fitting, and analysing generalised linear mixed models. We use an object-orientated class system within \\proglang{R} to provide methods for a wide range of covariance and mean functions, including specification of non-linear functions of data and parameters, relevant to multiple applications including cluster randomised trials, cohort studies, spatial and spatio-temporal modelling, and split-plot designs. The class generates relevant matrices and statistics and a wide range of methods including full likelihood estimation of generalised linear mixed models using stochastic Maximum Likelihood, Laplace approximation, power calculation, and access to relevant calculations. The class also includes Hamiltonian Monte Carlo simulation of random effects, sparse matrix methods, and other functionality to support efficient estimation. The \\pkg{glmmrOptim} package implements a set of algorithms to identify c-optimal experimental designs where observations are correlated and can be specified using the generalised linear mixed model classes. Several examples and comparisons to existing packages are provided to illustrate use of the packages."}, "https://arxiv.org/abs/2403.09726": {"title": "Inference for non-probability samples using the calibration approach for quantiles", "link": "https://arxiv.org/abs/2403.09726", "description": "arXiv:2403.09726v1 Announce Type: new \nAbstract: Non-probability survey samples are examples of data sources that have become increasingly popular in recent years, also in official statistics. However, statistical inference based on non-probability samples is much more difficult because they are biased and are not representative of the target population (Wu, 2022). In this paper we consider a method of joint calibration for totals (Deville & S\\\"arndal, 1992) and quantiles (Harms & Duchesne, 2006) and use the proposed approach to extend existing inference methods for non-probability samples, such as inverse probability weighting, mass imputation and doubly robust estimators. By including quantile information in the estimation process non-linear relationships between the target and auxiliary variables can be approximated the way it is done in step-wise (constant) regression. Our simulation study has demonstrated that the estimators in question are more robust against model mis-specification and, as a result, help to reduce bias and improve estimation efficiency. Variance estimation for our proposed approach is also discussed. We show that existing inference methods can be used and that the resulting confidence intervals are at nominal levels. Finally, we applied the proposed methods to estimate the share of vacancies aimed at Ukrainian workers in Poland using an integrated set of administrative and survey data about job vacancies. The proposed approaches have been implemented in two R packages (nonprobsvy and jointCalib), which were used to conduct the simulation and empirical study"}, "https://arxiv.org/abs/2403.09877": {"title": "Quantifying Distributional Input Uncertainty via Inflated Kolmogorov-Smirnov Confidence Band", "link": "https://arxiv.org/abs/2403.09877", "description": "arXiv:2403.09877v1 Announce Type: new \nAbstract: In stochastic simulation, input uncertainty refers to the propagation of the statistical noise in calibrating input models to impact output accuracy, in addition to the Monte Carlo simulation noise. The vast majority of the input uncertainty literature focuses on estimating target output quantities that are real-valued. However, outputs of simulation models are random and real-valued targets essentially serve only as summary statistics. To provide a more holistic assessment, we study the input uncertainty problem from a distributional view, namely we construct confidence bands for the entire output distribution function. Our approach utilizes a novel test statistic whose asymptotic consists of the supremum of the sum of a Brownian bridge and a suitable mean-zero Gaussian process, which generalizes the Kolmogorov-Smirnov statistic to account for input uncertainty. Regarding implementation, we also demonstrate how to use subsampling to efficiently estimate the covariance function of the Gaussian process, thereby leading to an implementable estimation of the quantile of the test statistic and a statistically valid confidence band. Numerical results demonstrate how our new confidence bands provide valid coverage for output distributions under input uncertainty that is not achievable by conventional approaches."}, "https://arxiv.org/abs/2403.09907": {"title": "Multi-Layer Kernel Machines: Fast and Optimal Nonparametric Regression with Uncertainty Quantification", "link": "https://arxiv.org/abs/2403.09907", "description": "arXiv:2403.09907v1 Announce Type: new \nAbstract: Kernel ridge regression (KRR) is widely used for nonparametric regression over reproducing kernel Hilbert spaces. It offers powerful modeling capabilities at the cost of significant computational costs, which typically require $O(n^3)$ computational time and $O(n^2)$ storage space, with the sample size n. We introduce a novel framework of multi-layer kernel machines that approximate KRR by employing a multi-layer structure and random features, and study how the optimal number of random features and layer sizes can be chosen while still preserving the minimax optimality of the approximate KRR estimate. For various classes of random features, including those corresponding to Gaussian and Matern kernels, we prove that multi-layer kernel machines can achieve $O(n^2\\log^2n)$ computational time and $O(n\\log^2n)$ storage space, and yield fast and minimax optimal approximations to the KRR estimate for nonparametric regression. Moreover, we construct uncertainty quantification for multi-layer kernel machines by using conformal prediction techniques with robust coverage properties. The analysis and theoretical predictions are supported by simulations and real data examples."}, "https://arxiv.org/abs/2403.09928": {"title": "Identification and estimation of mediational effects of longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2403.09928", "description": "arXiv:2403.09928v1 Announce Type: new \nAbstract: We demonstrate a comprehensive semiparametric approach to causal mediation analysis, addressing the complexities inherent in settings with longitudinal and continuous treatments, confounders, and mediators. Our methodology utilizes a nonparametric structural equation model and a cross-fitted sequential regression technique based on doubly robust pseudo-outcomes, yielding an efficient, asymptotically normal estimator without relying on restrictive parametric modeling assumptions. We are motivated by a recent scientific controversy regarding the effects of invasive mechanical ventilation (IMV) on the survival of COVID-19 patients, considering acute kidney injury (AKI) as a mediating factor. We highlight the possibility of \"inconsistent mediation,\" in which the direct and indirect effects of the exposure operate in opposite directions. We discuss the significance of mediation analysis for scientific understanding and its potential utility in treatment decisions."}, "https://arxiv.org/abs/2403.09956": {"title": "On the distribution of isometric log-ratio transformations under extra-multinomial count data", "link": "https://arxiv.org/abs/2403.09956", "description": "arXiv:2403.09956v1 Announce Type: new \nAbstract: Compositional data arise when count observations are normalised into proportions adding up to unity. To allow use of standard statistical methods, compositional proportions can be mapped from the simplex into the Euclidean space through the isometric log-ratio (ilr) transformation. When the counts follow a multinomial distribution with fixed class-specific probabilities, the distribution of the ensuing ilr coordinates has been shown to be asymptotically multivariate normal. We here derive an asymptotic normal approximation to the distribution of the ilr coordinates when the counts show overdispersion under the Dirichlet-multinomial mixture model. Using a simulation study, we then investigate the practical applicability of the approximation against the empirical distribution of the ilr coordinates under varying levels of extra-multinomial variation and the total count. The approximation works well, except with a small total count or high amount of overdispersion. These empirical results remain even under population-level heterogeneity in the total count. Our work is motivated by microbiome data, which often exhibit considerable extra-multinomial variation and are increasingly treated as compositional through scaling taxon-specific counts into proportions. We conclude that if the analysis of empirical data relies on normality of the ilr coordinates, it may be advisable to choose a taxonomic level where counts are less sparse so that the distribution of taxon-specific class probabilities remains unimodal."}, "https://arxiv.org/abs/2403.09984": {"title": "Repro Samples Method for High-dimensional Logistic Model", "link": "https://arxiv.org/abs/2403.09984", "description": "arXiv:2403.09984v1 Announce Type: new \nAbstract: This paper presents a novel method to make statistical inferences for both the model support and regression coefficients in a high-dimensional logistic regression model. Our method is based on the repro samples framework, in which we conduct statistical inference by generating artificial samples mimicking the actual data-generating process. The proposed method has two major advantages. Firstly, for model support, we introduce the first method for constructing model confidence set in a high-dimensional setting and the proposed method only requires a weak signal strength assumption. Secondly, in terms of regression coefficients, we establish confidence sets for any group of linear combinations of regression coefficients. Our simulation results demonstrate that the proposed method produces valid and small model confidence sets and achieves better coverage for regression coefficients than the state-of-the-art debiasing methods. Additionally, we analyze single-cell RNA-seq data on the immune response. Besides identifying genes previously proved as relevant in the literature, our method also discovers a significant gene that has not been studied before, revealing a potential new direction in understanding cellular immune response mechanisms."}, "https://arxiv.org/abs/2403.10034": {"title": "Inference for Heterogeneous Graphical Models using Doubly High-Dimensional Linear-Mixed Models", "link": "https://arxiv.org/abs/2403.10034", "description": "arXiv:2403.10034v1 Announce Type: new \nAbstract: Motivated by the problem of inferring the graph structure of functional connectivity networks from multi-level functional magnetic resonance imaging data, we develop a valid inference framework for high-dimensional graphical models that accounts for group-level heterogeneity. We introduce a neighborhood-based method to learn the graph structure and reframe the problem as that of inferring fixed effect parameters in a doubly high-dimensional linear mixed model. Specifically, we propose a LASSO-based estimator and a de-biased LASSO-based inference framework for the fixed effect parameters in the doubly high-dimensional linear mixed model, leveraging random matrix theory to deal with challenges induced by the identical fixed and random effect design matrices arising in our setting. Moreover, we introduce consistent estimators for the variance components to identify subject-specific edges in the inferred graph. To illustrate the generality of the proposed approach, we also adapt our method to account for serial correlation by learning heterogeneous graphs in the setting of a vector autoregressive model. We demonstrate the performance of the proposed framework using real data and benchmark simulation studies."}, "https://arxiv.org/abs/2403.10136": {"title": "Response Style Characterization for Repeated Measures Using the Visual Analogue Scale", "link": "https://arxiv.org/abs/2403.10136", "description": "arXiv:2403.10136v1 Announce Type: new \nAbstract: Self-report measures (e.g., Likert scales) are widely used to evaluate subjective health perceptions. Recently, the visual analog scale (VAS), a slider-based scale, has become popular owing to its ability to precisely and easily assess how people feel. These data can be influenced by the response style (RS), a user-dependent systematic tendency that occurs regardless of questionnaire instructions. Despite its importance, especially in between-individual analysis, little attention has been paid to handling the RS in the VAS (denoted as response profile (RP)), as it is mainly used for within-individual monitoring and is less affected by RP. However, VAS measurements often require repeated self-reports of the same questionnaire items, making it difficult to apply conventional methods on a Likert scale. In this study, we developed a novel RP characterization method for various types of repeatedly measured VAS data. This approach involves the modeling of RP as distributional parameters ${\\theta}$ through a mixture of RS-like distributions, and addressing the issue of unbalanced data through bootstrap sampling for treating repeated measures. We assessed the effectiveness of the proposed method using simulated pseudo-data and an actual dataset from an empirical study. The assessment of parameter recovery showed that our method accurately estimated the RP parameter ${\\theta}$, demonstrating its robustness. Moreover, applying our method to an actual VAS dataset revealed the presence of individual RP heterogeneity, even in repeated VAS measurements, similar to the findings of the Likert scale. Our proposed method enables RP heterogeneity-aware VAS data analysis, similar to Likert-scale data analysis."}, "https://arxiv.org/abs/2403.10165": {"title": "Finite mixture copulas for modeling dependence in longitudinal count data", "link": "https://arxiv.org/abs/2403.10165", "description": "arXiv:2403.10165v1 Announce Type: new \nAbstract: Dependence modeling of multivariate count data has been receiving a considerable attention in recent times. Multivariate elliptical copulas are typically preferred in statistical literature to analyze dependence between repeated measurements of longitudinal data since they allow for different choices of the correlation structure. But these copulas lack in flexibility to model dependence and inference is only feasible under parametric restrictions. In this article, we propose the use of finite mixture of elliptical copulas in order to capture complex and hidden temporal dependency of discrete longitudinal data. With guaranteed model identifiability, our approach permits to use different correlation matrices in each component of the mixture copula. We theoretically examine the dependence properties of finite mixture of copulas, before applying them for constructing regression models for count longitudinal data. The inference of the proposed class of models is based on composite likelihood approach and the finite sample performance of the parameter estimates are investigated through extensive simulation studies. For model validation, besides the standard techniques we extended the t-plot method to accommodate finite mixture of elliptical copulas. Finally, our models are applied to analyze the temporal dependency of two real world longitudinal data sets and shown to provide improvements if compared against standard elliptical copulas."}, "https://arxiv.org/abs/2403.10289": {"title": "Towards a power analysis for PLS-based methods", "link": "https://arxiv.org/abs/2403.10289", "description": "arXiv:2403.10289v1 Announce Type: new \nAbstract: In recent years, power analysis has become widely used in applied sciences, with the increasing importance of the replicability issue. When distribution-free methods, such as Partial Least Squares (PLS)-based approaches, are considered, formulating power analysis turns out to be challenging. In this study, we introduce the methodological framework of a new procedure for performing power analysis when PLS-based methods are used. Data are simulated by the Monte Carlo method, assuming the null hypothesis of no effect is false and exploiting the latent structure estimated by PLS in the pilot data. In this way, the complex correlation data structure is explicitly considered in power analysis and sample size estimation. The paper offers insights into selecting statistical tests for the power analysis procedure, comparing accuracy-based tests and those based on continuous parameters estimated by PLS. Simulated and real datasets are investigated to show how the method works in practice."}, "https://arxiv.org/abs/2403.10352": {"title": "Goodness-of-Fit for Conditional Distributions: An Approach Using Principal Component Analysis and Component Selection", "link": "https://arxiv.org/abs/2403.10352", "description": "arXiv:2403.10352v1 Announce Type: new \nAbstract: This paper introduces a novel goodness-of-fit test technique for parametric conditional distributions. The proposed tests are based on a residual marked empirical process, for which we develop a conditional Principal Component Analysis. The obtained components provide a basis for various types of new tests in addition to the omnibus one. Component tests that based on each component serve as experts in detecting certain directions. Smooth tests that assemble a few components are also of great use in practice. To further improve testing efficiency, we introduce a component selection approach, aiming to identify the most contributory components. The finite sample performance of the proposed tests is illustrated through Monte Carlo experiments."}, "https://arxiv.org/abs/2403.10440": {"title": "Multivariate Bayesian models with flexible shared interactions for analyzing spatio-temporal patterns of rare cancers", "link": "https://arxiv.org/abs/2403.10440", "description": "arXiv:2403.10440v1 Announce Type: new \nAbstract: Rare cancers affect millions of people worldwide each year. However, estimating incidence or mortality rates associated with rare cancers presents important difficulties and poses new statistical methodological challenges. In this paper, we expand the collection of multivariate spatio-temporal models by introducing adaptable shared interactions to enable a comprehensive analysis of both incidence and cancer mortality in rare cancer cases. These models allow the modulation of spatio-temporal interactions between incidence and mortality, allowing for changes in their relationship over time. The new models have been implemented in INLA using r-generic constructions. We conduct a simulation study to evaluate the performance of the new spatio-temporal models in terms of sensitivity and specificity. Results show that multivariate spatio-temporal models with flexible shared interaction outperform conventional multivariate spatio-temporal models with independent interactions. We use these models to analyze incidence and mortality data for pancreatic cancer and leukaemia among males across 142 administrative healthcare districts of Great Britain over a span of nine biennial periods (2002-2019)."}, "https://arxiv.org/abs/2403.10514": {"title": "Multilevel functional distributional models with application to continuous glucose monitoring in diabetes clinical trials", "link": "https://arxiv.org/abs/2403.10514", "description": "arXiv:2403.10514v1 Announce Type: new \nAbstract: Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each individual has multiple months of data) and distributional (because the data for each four-week interval is represented as a distribution). The scientific goals are to: (1) identify and quantify the effects of factors that affect glycemic control in type 1 diabetes (T1D) patients; and (2) identify and characterize the patients who respond to treatment. To address these goals, we propose a new multilevel functional model that treats the CGM distributions as a response. Methods are motivated by and applied to data collected by The Juvenile Diabetes Research Foundation Continuous Glucose Monitoring Group. Reproducible code for the methods introduced here is available on GitHub."}, "https://arxiv.org/abs/2403.09667": {"title": "Should the choice of BOIN design parameter p", "link": "https://arxiv.org/abs/2403.09667", "description": "arXiv:2403.09667v1 Announce Type: cross \nAbstract: When the early stopping parameter n.earlystop is relatively small or the cohortsize value is not optimized via simulation, it may be better to use p.tox < 1.4 * target.DLT.rate, or try out different cohort sizes, or increase n.earlystop, whichever is both feasible and provides better operating characteristics. This is because if the cohortsize was not optimized via simulation, even when n.earlystop = 12, the BOIN escalation/de-escalation rules generated using p.tox = 1.4 * target.DLT.rate could be exactly the same as those calculated using p.tox > 3 * target.DLT.rate, which might not be acceptable for some pediatric trials targeting 10% DLT rate. The traditional 3+3 design stops the dose finding process when 3 patients have been treated at the current dose level, 0 DLT has been observed, and the next higher dose has already been eliminated. If additional 3 patients were required to be treated at the current dose in the situation described above, the corresponding boundary table could be generated using BOIN design with target DLT rates ranging from 18% to 29%, p.saf ranging from 8% to 26%, and p.tox ranging from 39% to 99%. To generate the boundary table of this 3+3 design variant, BOIN parameters also need to satisfy a set of conditions."}, "https://arxiv.org/abs/2403.09869": {"title": "Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors", "link": "https://arxiv.org/abs/2403.09869", "description": "arXiv:2403.09869v1 Announce Type: cross \nAbstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts."}, "https://arxiv.org/abs/2403.10239": {"title": "A Big Data Approach to Understand Sub-national Determinants of FDI in Africa", "link": "https://arxiv.org/abs/2403.10239", "description": "arXiv:2403.10239v1 Announce Type: cross \nAbstract: Various macroeconomic and institutional factors hinder FDI inflows, including corruption, trade openness, access to finance, and political instability. Existing research mostly focuses on country-level data, with limited exploration of firm-level data, especially in developing countries. Recognizing this gap, recent calls for research emphasize the need for qualitative data analysis to delve into FDI determinants, particularly at the regional level. This paper proposes a novel methodology, based on text mining and social network analysis, to get information from more than 167,000 online news articles to quantify regional-level (sub-national) attributes affecting FDI ownership in African companies. Our analysis extends information on obstacles to industrial development as mapped by the World Bank Enterprise Surveys. Findings suggest that regional (sub-national) structural and institutional characteristics can play an important role in determining foreign ownership."}, "https://arxiv.org/abs/2403.10250": {"title": "Interpretable Machine Learning for Survival Analysis", "link": "https://arxiv.org/abs/2403.10250", "description": "arXiv:2403.10250v1 Announce Type: cross \nAbstract: With the spread and rapid advancement of black box machine learning models, the field of interpretable machine learning (IML) or explainable artificial intelligence (XAI) has become increasingly important over the last decade. This is particularly relevant for survival analysis, where the adoption of IML techniques promotes transparency, accountability and fairness in sensitive areas, such as clinical decision making processes, the development of targeted therapies, interventions or in other medical or healthcare related contexts. More specifically, explainability can uncover a survival model's potential biases and limitations and provide more mathematically sound ways to understand how and which features are influential for prediction or constitute risk factors. However, the lack of readily available IML methods may have deterred medical practitioners and policy makers in public health from leveraging the full potential of machine learning for predicting time-to-event data. We present a comprehensive review of the limited existing amount of work on IML methods for survival analysis within the context of the general IML taxonomy. In addition, we formally detail how commonly used IML methods, such as such as individual conditional expectation (ICE), partial dependence plots (PDP), accumulated local effects (ALE), different feature importance measures or Friedman's H-interaction statistics can be adapted to survival outcomes. An application of several IML methods to real data on data on under-5 year mortality of Ghanaian children from the Demographic and Health Surveys (DHS) Program serves as a tutorial or guide for researchers, on how to utilize the techniques in practice to facilitate understanding of model decisions or predictions."}, "https://arxiv.org/abs/2312.17420": {"title": "Exact Consistency Tests for Gaussian Mixture Filters using Normalized Deviation Squared Statistics", "link": "https://arxiv.org/abs/2312.17420", "description": "arXiv:2312.17420v2 Announce Type: replace \nAbstract: We consider the problem of evaluating dynamic consistency in discrete time probabilistic filters that approximate stochastic system state densities with Gaussian mixtures. Dynamic consistency means that the estimated probability distributions correctly describe the actual uncertainties. As such, the problem of consistency testing naturally arises in applications with regards to estimator tuning and validation. However, due to the general complexity of the density functions involved, straightforward approaches for consistency testing of mixture-based estimators have remained challenging to define and implement. This paper derives a new exact result for Gaussian mixture consistency testing within the framework of normalized deviation squared (NDS) statistics. It is shown that NDS test statistics for generic multivariate Gaussian mixture models exactly follow mixtures of generalized chi-square distributions, for which efficient computational tools are available. The accuracy and utility of the resulting consistency tests are numerically demonstrated on static and dynamic mixture estimation examples."}, "https://arxiv.org/abs/2305.05998": {"title": "On the Time-Varying Structure of the Arbitrage Pricing Theory using the Japanese Sector Indices", "link": "https://arxiv.org/abs/2305.05998", "description": "arXiv:2305.05998v4 Announce Type: replace-cross \nAbstract: This paper is the first study to examine the time instability of the APT in the Japanese stock market. In particular, we measure how changes in each risk factor affect the stock risk premiums to investigate the validity of the APT over time, applying the rolling window method to Fama and MacBeth's (1973) two-step regression and Kamstra and Shi's (2023) generalized GRS test. We summarize our empirical results as follows: (1) the changes in monetary policy by major central banks greatly affect the validity of the APT in Japan, and (2) the time-varying estimates of the risk premiums for each factor are also unstable over time, and they are affected by the business cycle and economic crises. Therefore, we conclude that the validity of the APT as an appropriate model to explain the Japanese sector index is not stable over time."}, "https://arxiv.org/abs/2307.09552": {"title": "Self-Compatibility: Evaluating Causal Discovery without Ground Truth", "link": "https://arxiv.org/abs/2307.09552", "description": "arXiv:2307.09552v2 Announce Type: replace-cross \nAbstract: As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection."}, "https://arxiv.org/abs/2309.09452": {"title": "Beyond expected values: Making environmental decisions using value of information analysis when measurement outcome matters", "link": "https://arxiv.org/abs/2309.09452", "description": "arXiv:2309.09452v2 Announce Type: replace-cross \nAbstract: In ecological and environmental contexts, management actions must sometimes be chosen urgently. Value of information (VoI) analysis provides a quantitative toolkit for projecting the improved management outcomes expected after making additional measurements. However, traditional VoI analysis reports metrics as expected values (i.e. risk-neutral). This can be problematic because expected values hide uncertainties in projections. The true value of a measurement will only be known after the measurement's outcome is known, leaving large uncertainty in the measurement's value before it is performed. As a result, the expected value metrics produced in traditional VoI analysis may not align with the priorities of a risk-averse decision-maker who wants to avoid low-value measurement outcomes. In the present work, we introduce four new VoI metrics that can address a decision-maker's risk-aversion to different measurement outcomes. We demonstrate the benefits of the new metrics with two ecological case studies for which traditional VoI analysis has been previously applied. Using the new metrics, we also demonstrate a clear mathematical link between the often-separated environmental decision-making disciplines of VoI and optimal design of experiments. This mathematical link has the potential to catalyse future collaborations between ecologists and statisticians to work together to quantitatively address environmental decision-making questions of fundamental importance. Overall, the introduced VoI metrics complement existing metrics to provide decision-makers with a comprehensive view of the value of, and risks associated with, a proposed monitoring or measurement activity. This is critical for improved environmental outcomes when decisions must be urgently made."}, "https://arxiv.org/abs/2311.10685": {"title": "High-Throughput Asset Pricing", "link": "https://arxiv.org/abs/2311.10685", "description": "arXiv:2311.10685v2 Announce Type: replace-cross \nAbstract: We use empirical Bayes (EB) to mine data on 140,000 long-short strategies constructed from accounting ratios, past returns, and ticker symbols. This \"high-throughput asset pricing\" produces out-of-sample performance comparable to strategies in top finance journals. But unlike the published strategies, the data-mined strategies are free of look-ahead bias. EB predicts that high returns are concentrated in accounting strategies, small stocks, and pre-2004 samples, consistent with limited attention theories. The intuition is seen in the cross-sectional distribution of t-stats, which is far from the null for equal-weighted accounting strategies. High-throughput methods provide a rigorous, unbiased method for documenting asset pricing facts."}, "https://arxiv.org/abs/2403.10680": {"title": "Spatio-temporal Occupancy Models with INLA", "link": "https://arxiv.org/abs/2403.10680", "description": "arXiv:2403.10680v1 Announce Type: new \nAbstract: Modern methods for quantifying and predicting species distribution play a crucial part in biodiversity conservation. Occupancy models are a popular choice for analyzing species occurrence data as they allow to separate the observational error induced by imperfect detection, and the sources of bias affecting the occupancy process. However, the spatial and temporal variation in occupancy not accounted for by environmental covariates is often ignored or modelled through simple spatial structures as the computational costs of fitting explicit spatio-temporal models is too high. In this work, we demonstrate how INLA may be used to fit complex occupancy models and how the R-INLA package can provide a user-friendly interface to make such complex models available to users.\n We show how occupancy models, provided some simplification on the detection process, can be framed as latent Gaussian models and benefit from the powerful INLA machinery. A large selection of complex modelling features, and random effect modelshave already been implemented in R-INLA. These become available for occupancy models, providing the user with an efficient and flexible toolbox.\n We illustrate how INLA provides a computationally efficient framework for developing and fitting complex occupancy models using two case studies. Through these, we show how different spatio-temporal models that include spatial-varying trends, smooth terms, and spatio-temporal random effects can be fitted. At the cost of limiting the complexity of the detection model, INLA can incorporate a range of complex structures in the process.\n INLA-based occupancy models provide an alternative framework to fit complex spatiotemporal occupancy models. The need for new and more flexible computationally approaches to fit such models makes INLA an attractive option for addressing complex ecological problems, and a promising area of research."}, "https://arxiv.org/abs/2403.10742": {"title": "Assessing Delayed Treatment Benefits of Immunotherapy Using Long-Term Average Hazard: A Novel Test/Estimation Approach", "link": "https://arxiv.org/abs/2403.10742", "description": "arXiv:2403.10742v1 Announce Type: new \nAbstract: Delayed treatment effects on time-to-event outcomes have often been observed in randomized controlled studies of cancer immunotherapies. In the case of delayed onset of treatment effect, the conventional test/estimation approach using the log-rank test for between-group comparison and Cox's hazard ratio to estimate the magnitude of treatment effect is not optimal, because the log-rank test is not the most powerful option, and the interpretation of the resulting hazard ratio is not obvious. Recently, alternative test/estimation approaches were proposed to address both the power issue and the interpretation problems of the conventional approach. One is a test/estimation approach based on long-term restricted mean survival time, and the other approach is based on average hazard with survival weight. This paper integrates these two ideas and proposes a novel test/estimation approach based on long-term average hazard (LT-AH) with survival weight. Numerical studies reveal specific scenarios where the proposed LT-AH method provides a higher power than the two alternative approaches. The proposed approach has test/estimation coherency and can provide robust estimates of the magnitude of treatment effect not dependent on study-specific censoring time distribution. Also, the proposed LT-AH approach can summarize the magnitude of the treatment effect in both absolute difference and relative terms using ``hazard'' (i.e., difference in LT-AH and ratio of LT-AH), meeting guideline recommendations and practical needs. This proposed approach can be a useful alternative to the traditional hazard-based test/estimation approach when delayed onset of survival benefit is expected."}, "https://arxiv.org/abs/2403.10878": {"title": "Cubature scheme for spatio-temporal Poisson point processes estimation", "link": "https://arxiv.org/abs/2403.10878", "description": "arXiv:2403.10878v1 Announce Type: new \nAbstract: This work presents the cubature scheme for the fitting of spatio-temporal Poisson point processes. The methodology is implemented in the R Core Team (2024) package stopp (D'Angelo and Adelfio, 2023), published on the Comprehensive R Archive Network (CRAN) and available from https://CRAN.R-project.org/package=stopp. Since the number of dummy points should be sufficient for an accurate estimate of the likelihood, numerical experiments are currently under development to give guidelines on this aspect."}, "https://arxiv.org/abs/2403.10907": {"title": "Macroeconomic Spillovers of Weather Shocks across U", "link": "https://arxiv.org/abs/2403.10907", "description": "arXiv:2403.10907v1 Announce Type: new \nAbstract: We estimate the short-run effects of severe weather shocks on local economic activity and assess cross-border spillovers operating through economic linkages between U.S. states. We measure weather shocks using a detailed county-level database on emergency declarations triggered by natural disasters and estimate their impacts with a monthly Global Vector Autoregressive (GVAR) model for the U.S. states. Impulse responses highlight significant country-wide macroeconomic effects of weather shocks hitting individual regions. We also show that (i) taking into account economic interconnections between states allows capturing much stronger spillover effects than those associated with mere spatial adjacency, (ii) geographical heterogeneity is critical for assessing country-wide effects of weather shocks, and (iii) network effects amplify the local impacts of these shocks."}, "https://arxiv.org/abs/2403.10945": {"title": "Zero-Inflated Stochastic Volatility Model for Disaggregated Inflation Data with Exact Zeros", "link": "https://arxiv.org/abs/2403.10945", "description": "arXiv:2403.10945v1 Announce Type: new \nAbstract: The disaggregated time-series data for Consumer Price Index often exhibits frequent instances of exact zero price changes, stemming from measurement errors inherent in the data collection process. However, the currently prominent stochastic volatility model of trend inflation is designed for aggregate measures of price inflation, where exact zero price changes rarely occur. We propose a zero-inflated stochastic volatility model applicable to such nonstationary real-valued multivariate time-series data with exact zeros, by a Bayesian dynamic generalized linear model that jointly specifies the dynamic zero-generating process. We also provide an efficient custom Gibbs sampler that leverages the P\\'olya-Gamma augmentation. Applying the model to disaggregated Japanese Consumer Price Index data, we find that the zero-inflated model provides more sensible and informative estimates of time-varying trend and volatility. Through an out-of-sample forecasting exercise, we find that the zero-inflated model provides improved point forecasts when zero-inflation is prominent, and better coverage of interval forecasts of the non-zero data by the non-zero distributional component."}, "https://arxiv.org/abs/2403.11003": {"title": "Extreme Treatment Effect: Extrapolating Causal Effects Into Extreme Treatment Domain", "link": "https://arxiv.org/abs/2403.11003", "description": "arXiv:2403.11003v1 Announce Type: new \nAbstract: The potential outcomes framework serves as a fundamental tool for quantifying the causal effects. When the treatment variable (exposure) is continuous, one is typically interested in the estimation of the effect curve (also called the average dose-response function), denoted as \\(mu(t)\\). In this work, we explore the ``extreme causal effect,'' where our focus lies in determining the impact of an extreme level of treatment, potentially beyond the range of observed values--that is, estimating \\(mu(t)\\) for very large \\(t\\). Our framework is grounded in the field of statistics known as extreme value theory. We establish the foundation for our approach, outlining key assumptions that enable the estimation of the extremal causal effect. Additionally, we present a novel and consistent estimation procedure that utilizes extreme value theory in order to potentially reduce the dimension of the confounders to at most 3. In practical applications, our framework proves valuable when assessing the effects of scenarios such as drug overdoses, extreme river discharges, or extremely high temperatures on a variable of interest."}, "https://arxiv.org/abs/2403.11016": {"title": "Comprehensive OOS Evaluation of Predictive Algorithms with Statistical Decision Theory", "link": "https://arxiv.org/abs/2403.11016", "description": "arXiv:2403.11016v1 Announce Type: new \nAbstract: We argue that comprehensive out-of-sample (OOS) evaluation using statistical decision theory (SDT) should replace the current practice of K-fold and Common Task Framework validation in machine learning (ML) research. SDT provides a formal framework for performing comprehensive OOS evaluation across all possible (1) training samples, (2) populations that may generate training data, and (3) populations of prediction interest. Regarding feature (3), we emphasize that SDT requires the practitioner to directly confront the possibility that the future may not look like the past and to account for a possible need to extrapolate from one population to another when building a predictive algorithm. SDT is simple in abstraction, but it is often computationally demanding to implement. We discuss progress in tractable implementation of SDT when prediction accuracy is measured by mean square error or by misclassification rate. We summarize research studying settings in which the training data will be generated from a subpopulation of the population of prediction interest. We also consider conditional prediction with alternative restrictions on the state space of possible populations that may generate training data. We conclude by calling on ML researchers to join with econometricians and statisticians in expanding the domain within which implementation of SDT is tractable."}, "https://arxiv.org/abs/2403.11017": {"title": "Continuous-time mediation analysis for repeatedlymeasured mediators and outcomes", "link": "https://arxiv.org/abs/2403.11017", "description": "arXiv:2403.11017v1 Announce Type: new \nAbstract: Mediation analysis aims to decipher the underlying causal mechanisms between an exposure, an outcome, and intermediate variables called mediators. Initially developed for fixed-time mediator and outcome, it has been extended to the framework of longitudinal data by discretizing the assessment times of mediator and outcome. Yet, processes in play in longitudinal studies are usually defined in continuous time and measured at irregular and subject-specific visits. This is the case in dementia research when cerebral and cognitive changes measured at planned visits in cohorts are of interest. We thus propose a methodology to estimate the causal mechanisms between a time-fixed exposure ($X$), a mediator process ($\\mathcal{M}_t$) and an outcome process ($\\mathcal{Y}_t$) both measured repeatedly over time in the presence of a time-dependent confounding process ($\\mathcal{L}_t$). We consider three types of causal estimands, the natural effects, path-specific effects and randomized interventional analogues to natural effects, and provide identifiability assumptions. We employ a dynamic multivariate model based on differential equations for their estimation. The performance of the methods are explored in simulations, and we illustrate the method in two real-world examples motivated by the 3C cerebral aging study to assess: (1) the effect of educational level on functional dependency through depressive symptomatology and cognitive functioning, and (2) the effect of a genetic factor on cognitive functioning potentially mediated by vascular brain lesions and confounded by neurodegeneration."}, "https://arxiv.org/abs/2403.11163": {"title": "A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques", "link": "https://arxiv.org/abs/2403.11163", "description": "arXiv:2403.11163v1 Announce Type: new \nAbstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models."}, "https://arxiv.org/abs/2403.11276": {"title": "Effects of model misspecification on small area estimators", "link": "https://arxiv.org/abs/2403.11276", "description": "arXiv:2403.11276v1 Announce Type: new \nAbstract: Nested error regression models are commonly used to incorporate observational unit specific auxiliary variables to improve small area estimates. When the mean structure of this model is misspecified, there is generally an increase in the mean square prediction error (MSPE) of Empirical Best Linear Unbiased Predictors (EBLUP). Observed Best Prediction (OBP) method has been proposed with the intent to improve on the MSPE over EBLUP. We conduct a Monte Carlo simulation experiment to understand the effect of mispsecification of mean structures on different small area estimators. Our simulation results lead to an unexpected result that OBP may perform very poorly when observational unit level auxiliary variables are used and that OBP can be improved significantly when population means of those auxiliary variables (area level auxiliary variables) are used in the nested error regression model or when a corresponding area level model is used. Our simulation also indicates that the MSPE of OBP in an increasing function of the difference between the sample and population means of the auxiliary variables."}, "https://arxiv.org/abs/2403.11309": {"title": "Nonparametric Identification and Estimation with Non-Classical Errors-in-Variables", "link": "https://arxiv.org/abs/2403.11309", "description": "arXiv:2403.11309v1 Announce Type: new \nAbstract: This paper considers nonparametric identification and estimation of the regression function when a covariate is mismeasured. The measurement error need not be classical. Employing the small measurement error approximation, we establish nonparametric identification under weak and easy-to-interpret conditions on the instrumental variable. The paper also provides nonparametric estimators of the regression function and derives their rates of convergence."}, "https://arxiv.org/abs/2403.11356": {"title": "Multiscale Quantile Regression with Local Error Control", "link": "https://arxiv.org/abs/2403.11356", "description": "arXiv:2403.11356v1 Announce Type: new \nAbstract: For robust and efficient detection of change points, we introduce a novel methodology MUSCLE (multiscale quantile segmentation controlling local error) that partitions serial data into multiple segments, each sharing a common quantile. It leverages multiple tests for quantile changes over different scales and locations, and variational estimation. Unlike the often adopted global error control, MUSCLE focuses on local errors defined on individual segments, significantly improving detection power in finding change points. Meanwhile, due to the built-in model complexity penalty, it enjoys the finite sample guarantee that its false discovery rate (or the expected proportion of falsely detected change points) is upper bounded by its unique tuning parameter. Further, we obtain the consistency and the localisation error rates in estimating change points, under mild signal-to-noise-ratio conditions. Both match (up to log factors) the minimax optimality results in the Gaussian setup. All theories hold under the only distributional assumption of serial independence. Incorporating the wavelet tree data structure, we develop an efficient dynamic programming algorithm for computing MUSCLE. Extensive simulations as well as real data applications in electrophysiology and geophysics demonstrate its competitiveness and effectiveness. An implementation via R package muscle is available from GitHub."}, "https://arxiv.org/abs/2403.11438": {"title": "Models of linkage error for capture-recapture estimation without clerical reviews", "link": "https://arxiv.org/abs/2403.11438", "description": "arXiv:2403.11438v1 Announce Type: new \nAbstract: The capture-recapture method can be applied to measure the coverage of administrative and big data sources, in official statistics. In its basic form, it involves the linkage of two sources while assuming a perfect linkage and other standard assumptions. In practice, linkage errors arise and are a potential source of bias, where the linkage is based on quasi-identifiers. These errors include false positives and false negatives, where the former arise when linking a pair of records from different units, and the latter arise when not linking a pair of records from the same unit. So far, the existing solutions have resorted to costly clerical reviews, or they have made the restrictive conditional independence assumption. In this work, these requirements are relaxed by modeling the number of links from a record instead. The same approach may be taken to estimate the linkage accuracy without clerical reviews, when linking two sources that each have some undercoverage."}, "https://arxiv.org/abs/2403.11562": {"title": "A Comparison of Joint Species Distribution Models for Percent Cover Data", "link": "https://arxiv.org/abs/2403.11562", "description": "arXiv:2403.11562v1 Announce Type: new \nAbstract: 1. Joint species distribution models (JSDMs) have gained considerable traction among ecologists over the past decade, due to their capacity to answer a wide range of questions at both the species- and the community-level. The family of generalized linear latent variable models in particular has proven popular for building JSDMs, being able to handle many response types including presence-absence data, biomass, overdispersed and/or zero-inflated counts.\n 2. We extend latent variable models to handle percent cover data, with vegetation, sessile invertebrate, and macroalgal cover data representing the prime examples of such data arising in community ecology.\n 3. Sparsity is a commonly encountered challenge with percent cover data. Responses are typically recorded as percentages covered per plot, though some species may be completely absent or present, i.e., have 0% or 100% cover respectively, rendering the use of beta distribution inadequate.\n 4. We propose two JSDMs suitable for percent cover data, namely a hurdle beta model and an ordered beta model. We compare the two proposed approaches to a beta distribution for shifted responses, transformed presence-absence data, and an ordinal model for percent cover classes. Results demonstrate the hurdle beta JSDM was generally the most accurate at retrieving the latent variables and predicting ecological percent cover data."}, "https://arxiv.org/abs/2403.11564": {"title": "Spatio-temporal point process intensity estimation using zero-deflated subsampling applied to a lightning strikes dataset in France", "link": "https://arxiv.org/abs/2403.11564", "description": "arXiv:2403.11564v1 Announce Type: new \nAbstract: Cloud-to-ground lightning strikes observed in a specific geographical domain over time can be naturally modeled by a spatio-temporal point process. Our focus lies in the parametric estimation of its intensity function, incorporating both spatial factors (such as altitude) and spatio-temporal covariates (such as field temperature, precipitation, etc.). The events are observed in France over a span of three years. Spatio-temporal covariates are observed with resolution $0.1^\\circ \\times 0.1^\\circ$ ($\\approx 100$km$^2$) and six-hour periods. This results in an extensive dataset, further characterized by a significant excess of zeroes (i.e., spatio-temporal cells with no observed events). We reexamine composite likelihood methods commonly employed for spatial point processes, especially in situations where covariates are piecewise constant. Additionally, we extend these methods to account for zero-deflated subsampling, a strategy involving dependent subsampling, with a focus on selecting more cells in regions where events are observed. A simulation study is conducted to illustrate these novel methodologies, followed by their application to the dataset of lightning strikes."}, "https://arxiv.org/abs/2403.11767": {"title": "Multiple testing in game-theoretic probability: pictures and questions", "link": "https://arxiv.org/abs/2403.11767", "description": "arXiv:2403.11767v1 Announce Type: new \nAbstract: The usual way of testing probability forecasts in game-theoretic probability is via construction of test martingales. The standard assumption is that all forecasts are output by the same forecaster. In this paper I will discuss possible extensions of this picture to testing probability forecasts output by several forecasters. This corresponds to multiple hypothesis testing in statistics. One interesting phenomenon is that even a slight relaxation of the requirement of family-wise validity leads to a very significant increase in the efficiency of testing procedures. The main goal of this paper is to report results of preliminary simulation studies and list some directions of further research."}, "https://arxiv.org/abs/2403.11954": {"title": "Robust Estimation and Inference in Categorical Data", "link": "https://arxiv.org/abs/2403.11954", "description": "arXiv:2403.11954v1 Announce Type: new \nAbstract: In empirical science, many variables of interest are categorical. Like any model, models for categorical responses can be misspecified, leading to possibly large biases in estimation. One particularly troublesome source of misspecification is inattentive responding in questionnaires, which is well-known to jeopardize the validity of structural equation models (SEMs) and other survey-based analyses. I propose a general estimator that is designed to be robust to misspecification of models for categorical responses. Unlike hitherto approaches, the estimator makes no assumption whatsoever on the degree, magnitude, or type of misspecification. The proposed estimator generalizes maximum likelihood estimation, is strongly consistent, asymptotically Gaussian, has the same time complexity as maximum likelihood, and can be applied to any model for categorical responses. In addition, I develop a novel test that tests whether a given response can be fitted well by the assumed model, which allows one to trace back possible sources of misspecification. I verify the attractive theoretical properties of the proposed methodology in Monte Carlo experiments, and demonstrate its practical usefulness in an empirical application on a SEM of personality traits, where I find compelling evidence for the presence of inattentive responding whose adverse effects the proposed estimator can withstand, unlike maximum likelihood."}, "https://arxiv.org/abs/2403.11983": {"title": "Proposal of a general framework to categorize continuous predictor variables", "link": "https://arxiv.org/abs/2403.11983", "description": "arXiv:2403.11983v1 Announce Type: new \nAbstract: The use of discretized variables in the development of prediction models is a common practice, in part because the decision-making process is more natural when it is based on rules created from segmented models. Although this practice is perhaps more common in medicine, it is extensible to any area of knowledge where a predictive model helps in decision-making. Therefore, providing researchers with a useful and valid categorization method could be a relevant issue when developing prediction models. In this paper, we propose a new general methodology that can be applied to categorize a predictor variable in any regression model where the response variable belongs to the exponential family distribution. Furthermore, it can be applied in any multivariate context, allowing to categorize more than one continuous covariate simultaneously. In addition, a computationally very efficient method is proposed to obtain the optimal number of categories, based on a pseudo-BIC proposal. Several simulation studies have been conducted in which the efficiency of the method with respect to both the location and the number of estimated cut-off points is shown. Finally, the categorization proposal has been applied to a real data set of 543 patients with chronic obstructive pulmonary disease from Galdakao Hospital's five outpatient respiratory clinics, who were followed up for 10 years. We applied the proposed methodology to jointly categorize the continuous variables six-minute walking test and forced expiratory volume in one second in a multiple Poisson generalized additive model for the response variable rate of the number of hospital admissions by years of follow-up. The location and number of cut-off points obtained were clinically validated as being in line with the categorizations used in the literature."}, "https://arxiv.org/abs/2401.01998": {"title": "A Corrected Score Function Framework for Modelling Circadian Gene Expression", "link": "https://arxiv.org/abs/2401.01998", "description": "arXiv:2401.01998v1 Announce Type: cross \nAbstract: Many biological processes display oscillatory behavior based on an approximately 24 hour internal timing system specific to each individual. One process of particular interest is gene expression, for which several circadian transcriptomic studies have identified associations between gene expression during a 24 hour period and an individual's health. A challenge with analyzing data from these studies is that each individual's internal timing system is offset relative to the 24 hour day-night cycle, where day-night cycle time is recorded for each collected sample. Laboratory procedures can accurately determine each individual's offset and determine the internal time of sample collection. However, these laboratory procedures are labor-intensive and expensive. In this paper, we propose a corrected score function framework to obtain a regression model of gene expression given internal time when the offset of each individual is too burdensome to determine. A feature of this framework is that it does not require the probability distribution generating offsets to be symmetric with a mean of zero. Simulation studies validate the use of this corrected score function framework for cosinor regression, which is prevalent in circadian transcriptomic studies. Illustrations with three real circadian transcriptomic data sets further demonstrate that the proposed framework consistently mitigates bias relative to using a score function that does not account for this offset."}, "https://arxiv.org/abs/2403.10567": {"title": "Uncertainty estimation in spatial interpolation of satellite precipitation with ensemble learning", "link": "https://arxiv.org/abs/2403.10567", "description": "arXiv:2403.10567v1 Announce Type: cross \nAbstract: Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond."}, "https://arxiv.org/abs/2403.10618": {"title": "Limits of Approximating the Median Treatment Effect", "link": "https://arxiv.org/abs/2403.10618", "description": "arXiv:2403.10618v1 Announce Type: cross \nAbstract: Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\\mathbf{a}, \\mathbf{b}$, much of the prior work focused on estimating median$(\\mathbf{a}) -$ median$(\\mathbf{b})$, where median($\\mathbf x$) denotes the median value in the sorted ordering of all the values in vector $\\mathbf x$. It is known that estimating the difference of medians is easier than the desired estimand of median$(\\mathbf{a-b})$, called the Median Treatment Effect (MTE). The fundamental problem of causal inference -- for every individual $i$, we can only observe one of the potential outcome values, i.e., either the value $a_i$ or $b_i$, but not both, makes estimating MTE particularly challenging. In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\\mathbf{a-b}$. Next, we identify a quantity called variability that exactly captures the complexity of MTE estimation. By drawing connections to instance-optimality studied in theoretical computer science, we show that every algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior work, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values."}, "https://arxiv.org/abs/2403.10766": {"title": "ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference", "link": "https://arxiv.org/abs/2403.10766", "description": "arXiv:2403.10766v1 Announce Type: cross \nAbstract: Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method."}, "https://arxiv.org/abs/2403.11332": {"title": "Graph Neural Network based Double Machine Learning Estimator of Network Causal Effects", "link": "https://arxiv.org/abs/2403.11332", "description": "arXiv:2403.11332v1 Announce Type: cross \nAbstract: Our paper addresses the challenge of inferring causal effects in social network data, characterized by complex interdependencies among individuals resulting in challenges such as non-independence of units, interference (where a unit's outcome is affected by neighbors' treatments), and introduction of additional confounding factors from neighboring units. We propose a novel methodology combining graph neural networks and double machine learning, enabling accurate and efficient estimation of direct and peer effects using a single observational social network. Our approach utilizes graph isomorphism networks in conjunction with double machine learning to effectively adjust for network confounders and consistently estimate the desired causal effects. We demonstrate that our estimator is both asymptotically normal and semiparametrically efficient. A comprehensive evaluation against four state-of-the-art baseline methods using three semi-synthetic social network datasets reveals our method's on-par or superior efficacy in precise causal effect estimation. Further, we illustrate the practical application of our method through a case study that investigates the impact of Self-Help Group participation on financial risk tolerance. The results indicate a significant positive direct effect, underscoring the potential of our approach in social network analysis. Additionally, we explore the effects of network sparsity on estimation performance."}, "https://arxiv.org/abs/2403.11333": {"title": "Identification of Information Structures in Bayesian Games", "link": "https://arxiv.org/abs/2403.11333", "description": "arXiv:2403.11333v1 Announce Type: cross \nAbstract: To what extent can an external observer observing an equilibrium action distribution in an incomplete information game infer the underlying information structure? We investigate this issue in a general linear-quadratic-Gaussian framework. A simple class of canonical information structures is offered and proves rich enough to rationalize any possible equilibrium action distribution that can arise under an arbitrary information structure. We show that the class is parsimonious in the sense that the relevant parameters can be uniquely pinned down by an observed equilibrium outcome, up to some qualifications. Our result implies, for example, that the accuracy of each agent's signal about the state is identified, as measured by how much observing the signal reduces the state variance. Moreover, we show that a canonical information structure characterizes the lower bound on the amount by which each agent's signal can reduce the state variance, across all observationally equivalent information structures. The lower bound is tight, for example, when the actual information structure is uni-dimensional, or when there are no strategic interactions among agents, but in general, there is a gap since agents' strategic motives confound their private information about fundamental and strategic uncertainty."}, "https://arxiv.org/abs/2403.11343": {"title": "Federated Transfer Learning with Differential Privacy", "link": "https://arxiv.org/abs/2403.11343", "description": "arXiv:2403.11343v1 Announce Type: cross \nAbstract: Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \\textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets."}, "https://arxiv.org/abs/1805.05606": {"title": "Nonparametric Bayesian volatility learning under microstructure noise", "link": "https://arxiv.org/abs/1805.05606", "description": "arXiv:1805.05606v2 Announce Type: replace \nAbstract: In this work, we study the problem of learning the volatility under market microstructure noise. Specifically, we consider noisy discrete time observations from a stochastic differential equation and develop a novel computational method to learn the diffusion coefficient of the equation. We take a nonparametric Bayesian approach, where we \\emph{a priori} model the volatility function as piecewise constant. Its prior is specified via the inverse Gamma Markov chain. Sampling from the posterior is accomplished by incorporating the Forward Filtering Backward Simulation algorithm in the Gibbs sampler. Good performance of the method is demonstrated on two representative synthetic data examples. We also apply the method on a EUR/USD exchange rate dataset. Finally we present a limit result on the prior distribution."}, "https://arxiv.org/abs/2203.06056": {"title": "Identifying Causal Effects using Instrumental Time Series: Nuisance IV and Correcting for the Past", "link": "https://arxiv.org/abs/2203.06056", "description": "arXiv:2203.06056v2 Announce Type: replace \nAbstract: Instrumental variable (IV) regression relies on instruments to infer causal effects from observational data with unobserved confounding. We consider IV regression in time series models, such as vector auto-regressive (VAR) processes. Direct applications of i.i.d. techniques are generally inconsistent as they do not correctly adjust for dependencies in the past. In this paper, we outline the difficulties that arise due to time structure and propose methodology for constructing identifying equations that can be used for consistent parametric estimation of causal effects in time series data. One method uses extra nuisance covariates to obtain identifiability (an idea that can be of interest even in the i.i.d. case). We further propose a graph marginalization framework that allows us to apply nuisance IV and other IV methods in a principled way to time series. Our methods make use of a version of the global Markov property, which we prove holds for VAR(p) processes. For VAR(1) processes, we prove identifiability conditions that relate to Jordan forms and are different from the well-known rank conditions in the i.i.d. case (they do not require as many instruments as covariates, for example). We provide methods, prove their consistency, and show how the inferred causal effect can be used for distribution generalization. Simulation experiments corroborate our theoretical results. We provide ready-to-use Python code."}, "https://arxiv.org/abs/2206.12113": {"title": "Sequential adaptive design for emulating costly computer codes", "link": "https://arxiv.org/abs/2206.12113", "description": "arXiv:2206.12113v3 Announce Type: replace \nAbstract: Gaussian processes (GPs) are generally regarded as the gold standard surrogate model for emulating computationally expensive computer-based simulators. However, the problem of training GPs as accurately as possible with a minimum number of model evaluations remains challenging. We address this problem by suggesting a novel adaptive sampling criterion called VIGF (variance of improvement for global fit). The improvement function at any point is a measure of the deviation of the GP emulator from the nearest observed model output. At each iteration of the proposed algorithm, a new run is performed where VIGF is the largest. Then, the new sample is added to the design and the emulator is updated accordingly. A batch version of VIGF is also proposed which can save the user time when parallel computing is available. Additionally, VIGF is extended to the multi-fidelity case where the expensive high-fidelity model is predicted with the assistance of a lower fidelity simulator. This is performed via hierarchical kriging. The applicability of our method is assessed on a bunch of test functions and its performance is compared with several sequential sampling strategies. The results suggest that our method has a superior performance in predicting the benchmark functions in most cases."}, "https://arxiv.org/abs/2206.12525": {"title": "Causality for Complex Continuous-time Functional Longitudinal Studies", "link": "https://arxiv.org/abs/2206.12525", "description": "arXiv:2206.12525v3 Announce Type: replace \nAbstract: The paramount obstacle in longitudinal studies for causal inference is the complex \"treatment-confounder feedback.\" Traditional methodologies for elucidating causal effects in longitudinal analyses are primarily based on the assumption that time moves in specific intervals or that changes in treatment occur discretely. This conventional view confines treatment-confounder feedback to a limited, countable scope. The advent of real-time monitoring in modern medical research introduces functional longitudinal data with dynamically time-varying outcomes, treatments, and confounders, necessitating dealing with a potentially uncountably infinite treatment-confounder feedback. Thus, there is an urgent need for a more elaborate and refined theoretical framework to navigate these intricacies. Recently, Ying (2024) proposed a preliminary framework focusing on end-of-study outcomes and addressing the causality in functional longitudinal data. Our paper expands significantly upon his foundation in fourfold: First, we conduct a comprehensive review of existing literature, which not only fosters a deeper understanding of the underlying concepts but also illuminates the genesis of both Ying (2024)'s and ours. Second, we extend Ying (2024) to fully embrace a functional time-varying outcome process, incorporating right censoring and truncation by death, which are both significant and practical concerns. Third, we formalize previously informal propositions in Ying (2024), demonstrating how this framework broadens the existing frameworks in a nonparametric manner. Lastly, we delve into a detailed discussion on the interpretability and feasibility of our assumptions, and outlining a strategy for future numerical studies."}, "https://arxiv.org/abs/2207.00530": {"title": "The Target Study: A Conceptual Model and Framework for Measuring Disparity", "link": "https://arxiv.org/abs/2207.00530", "description": "arXiv:2207.00530v2 Announce Type: replace \nAbstract: We present a conceptual model to measure disparity--the target study--where social groups may be similarly situated (i.e., balanced) on allowable covariates. Our model, based on a sampling design, does not intervene to assign social group membership or alter allowable covariates. To address non-random sample selection, we extend our model to generalize or transport disparity or to assess disparity after an intervention on eligibility-related variables that eliminates forms of collider-stratification. To avoid bias from differential timing of enrollment, we aggregate time-specific study results by balancing calendar time of enrollment across social groups. To provide a framework for emulating our model, we discuss study designs, data structures, and G-computation and weighting estimators. We compare our sampling-based model to prominent decomposition-based models used in healthcare and algorithmic fairness. We provide R code for all estimators and apply our methods to measure health system disparities in hypertension control using electronic medical records."}, "https://arxiv.org/abs/2210.06927": {"title": "Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models", "link": "https://arxiv.org/abs/2210.06927", "description": "arXiv:2210.06927v4 Announce Type: replace \nAbstract: Bayesian modeling provides a principled approach to quantifying uncertainty in model parameters and model structure and has seen a surge of applications in recent years. Within the context of a Bayesian workflow, we are concerned with model selection for the purpose of finding models that best explain the data, that is, help us understand the underlying data generating process. Since we rarely have access to the true process, all we are left with during real-world analyses is incomplete causal knowledge from sources outside of the current data and model predictions of said data. This leads to the important question of when the use of prediction as a proxy for explanation for the purpose of model selection is valid. We approach this question by means of large-scale simulations of Bayesian generalized linear models where we investigate various causal and statistical misspecifications. Our results indicate that the use of prediction as proxy for explanation is valid and safe only when the models under consideration are sufficiently consistent with the underlying causal structure of the true data generating process."}, "https://arxiv.org/abs/2211.01938": {"title": "A family of mixture models for beta valued DNA methylation data", "link": "https://arxiv.org/abs/2211.01938", "description": "arXiv:2211.01938v3 Announce Type: replace \nAbstract: As hypermethylation of promoter cytosine-guanine dinucleotide (CpG) islands has been shown to silence tumour suppressor genes, identifying differentially methylated CpG sites between different samples can assist in understanding disease. Differentially methylated CpG sites (DMCs) can be identified using moderated t-tests or nonparametric tests, but this typically requires the use of data transformations due to a lack of appropriate statistical methods able to adequately account for the bounded nature of DNA methylation data.\n We propose a family of beta mixture models (BMMs) which use a model-based approach to cluster CpG sites given their original beta-valued methylation data, with no need for transformations. The BMMs allow (i) objective inference of methylation state thresholds and (ii) identification of DMCs between different sample types. The BMMs employ different parameter constraints facilitating application to different study settings. Parameter estimation proceeds via an expectation-maximisation algorithm, with a novel approximation in the maximization step providing tractability and computational feasibility.\n Performance of BMMs is assessed through thorough simulation studies, and the BMMs are used to analyse a prostate cancer dataset. The BMMs objectively infer intuitive and biologically interpretable methylation state thresholds, and identify DMCs that are related to genes implicated in carcinogenesis and involved in cancer related pathways. An R package betaclust facilitates widespread use of BMMs."}, "https://arxiv.org/abs/2211.16059": {"title": "On Large-Scale Multiple Testing Over Networks: An Asymptotic Approach", "link": "https://arxiv.org/abs/2211.16059", "description": "arXiv:2211.16059v4 Announce Type: replace \nAbstract: This work concerns developing communication- and computation-efficient methods for large-scale multiple testing over networks, which is of interest to many practical applications. We take an asymptotic approach and propose two methods, proportion-matching and greedy aggregation, tailored to distributed settings. The proportion-matching method achieves the global BH performance yet only requires a one-shot communication of the (estimated) proportion of true null hypotheses as well as the number of p-values at each node. By focusing on the asymptotic optimal power, we go beyond the BH procedure by providing an explicit characterization of the asymptotic optimal solution. This leads to the greedy aggregation method that effectively approximates the optimal rejection regions at each node, while computation efficiency comes from the greedy-type approach naturally. Moreover, for both methods, we provide the rate of convergence for both the FDR and power. Extensive numerical results over a variety of challenging settings are provided to support our theoretical findings."}, "https://arxiv.org/abs/2212.01900": {"title": "Bayesian survival analysis with INLA", "link": "https://arxiv.org/abs/2212.01900", "description": "arXiv:2212.01900v3 Announce Type: replace \nAbstract: This tutorial shows how various Bayesian survival models can be fitted using the integrated nested Laplace approximation in a clear, legible, and comprehensible manner using the INLA and INLAjoint R-packages. Such models include accelerated failure time, proportional hazards, mixture cure, competing risks, multi-state, frailty, and joint models of longitudinal and survival data, originally presented in the article \"Bayesian survival analysis with BUGS\" (Alvares et al., 2021). In addition, we illustrate the implementation of a new joint model for a longitudinal semicontinuous marker, recurrent events, and a terminal event. Our proposal aims to provide the reader with syntax examples for implementing survival models using a fast and accurate approximate Bayesian inferential approach."}, "https://arxiv.org/abs/2302.13133": {"title": "Data-driven uncertainty quantification for constrained stochastic differential equations and application to solar photovoltaic power forecast data", "link": "https://arxiv.org/abs/2302.13133", "description": "arXiv:2302.13133v2 Announce Type: replace \nAbstract: In this work, we extend the data-driven It\\^{o} stochastic differential equation (SDE) framework for the pathwise assessment of short-term forecast errors to account for the time-dependent upper bound that naturally constrains the observable historical data and forecast. We propose a new nonlinear and time-inhomogeneous SDE model with a Jacobi-type diffusion term for the phenomenon of interest, simultaneously driven by the forecast and the constraining upper bound. We rigorously demonstrate the existence and uniqueness of a strong solution to the SDE model by imposing a condition for the time-varying mean-reversion parameter appearing in the drift term. The normalized forecast function is thresholded to keep such mean-reversion parameters bounded. The SDE model parameter calibration also covers the thresholding parameter of the normalized forecast by applying a novel iterative two-stage optimization procedure to user-selected approximations of the likelihood function. Another novel contribution is estimating the transition density of the forecast error process, not known analytically in a closed form, through a tailored kernel smoothing technique with the control variate method. We fit the model to the 2019 photovoltaic (PV) solar power daily production and forecast data in Uruguay, computing the daily maximum solar PV production estimation. Two statistical versions of the constrained SDE model are fit, with the beta and truncated normal distributions as proxies for the transition density. Empirical results include simulations of the normalized solar PV power production and pathwise confidence bands generated through an indirect inference method. An objective comparison of optimal parametric points associated with the two selected statistical approximations is provided by applying the innovative kernel density estimation technique of the transition function of the forecast error process."}, "https://arxiv.org/abs/2303.08528": {"title": "Translating predictive distributions into informative priors", "link": "https://arxiv.org/abs/2303.08528", "description": "arXiv:2303.08528v3 Announce Type: replace \nAbstract: When complex Bayesian models exhibit implausible behaviour, one solution is to assemble available information into an informative prior. Challenges arise as prior information is often only available for the observable quantity, or some model-derived marginal quantity, rather than directly pertaining to the natural parameters in our model. We propose a method for translating available prior information, in the form of an elicited distribution for the observable or model-derived marginal quantity, into an informative joint prior. Our approach proceeds given a parametric class of prior distributions with as yet undetermined hyperparameters, and minimises the difference between the supplied elicited distribution and corresponding prior predictive distribution. We employ a global, multi-stage Bayesian optimisation procedure to locate optimal values for the hyperparameters. Three examples illustrate our approach: a cure-fraction survival model, where censoring implies that the observable quantity is a priori a mixed discrete/continuous quantity; a setting in which prior information pertains to $R^{2}$ -- a model-derived quantity; and a nonlinear regression model."}, "https://arxiv.org/abs/2303.10215": {"title": "Statistical inference for association studies in the presence of binary outcome misclassification", "link": "https://arxiv.org/abs/2303.10215", "description": "arXiv:2303.10215v3 Announce Type: replace \nAbstract: In biomedical and public health association studies, binary outcome variables may be subject to misclassification, resulting in substantial bias in effect estimates. The feasibility of addressing binary outcome misclassification in regression models is often hindered by model identifiability issues. In this paper, we characterize the identifiability problems in this class of models as a specific case of ''label switching'' and leverage a pattern in the resulting parameter estimates to solve the permutation invariance of the complete data log-likelihood. Our proposed algorithm in binary outcome misclassification models does not require gold standard labels and relies only on the assumption that the sum of the sensitivity and specificity exceeds 1. A label switching correction is applied within estimation methods to recover unbiased effect estimates and to estimate misclassification rates. Open source software is provided to implement the proposed methods. We give a detailed simulation study for our proposed methodology and apply these methods to data from the 2020 Medical Expenditure Panel Survey (MEPS)."}, "https://arxiv.org/abs/2303.15158": {"title": "Discovering the Network Granger Causality in Large Vector Autoregressive Models", "link": "https://arxiv.org/abs/2303.15158", "description": "arXiv:2303.15158v2 Announce Type: replace \nAbstract: This paper proposes novel inferential procedures for discovering the network Granger causality in high-dimensional vector autoregressive models. In particular, we mainly offer two multiple testing procedures designed to control the false discovery rate (FDR). The first procedure is based on the limiting normal distribution of the $t$-statistics with the debiased lasso estimator. The second procedure is its bootstrap version. We also provide a robustification of the first procedure against any cross-sectional dependence using asymptotic e-variables. Their theoretical properties, including FDR control and power guarantee, are investigated. The finite sample evidence suggests that both procedures can successfully control the FDR while maintaining high power. Finally, the proposed methods are applied to discovering the network Granger causality in a large number of macroeconomic variables and regional house prices in the UK."}, "https://arxiv.org/abs/2306.14311": {"title": "Simple Estimation of Semiparametric Models with Measurement Errors", "link": "https://arxiv.org/abs/2306.14311", "description": "arXiv:2306.14311v2 Announce Type: replace \nAbstract: We develop a practical way of addressing the Errors-In-Variables (EIV) problem in the Generalized Method of Moments (GMM) framework. We focus on the settings in which the variability of the EIV is a fraction of that of the mismeasured variables, which is typical for empirical applications. For any initial set of moment conditions our approach provides a \"corrected\" set of moment conditions that are robust to the EIV. We show that the GMM estimator based on these moments is root-n-consistent, with the standard tests and confidence intervals providing valid inference. This is true even when the EIV are so large that naive estimators (that ignore the EIV problem) are heavily biased with their confidence intervals having 0% coverage. Our approach involves no nonparametric estimation, which is especially important for applications with many covariates, and settings with multivariate or non-classical EIV. In particular, the approach makes it easy to use instrumental variables to address EIV in nonlinear models."}, "https://arxiv.org/abs/2306.14862": {"title": "Marginal Effects for Probit and Tobit with Endogeneity", "link": "https://arxiv.org/abs/2306.14862", "description": "arXiv:2306.14862v2 Announce Type: replace \nAbstract: When evaluating partial effects, it is important to distinguish between structural endogeneity and measurement errors. In contrast to linear models, these two sources of endogeneity affect partial effects differently in nonlinear models. We study this issue focusing on the Instrumental Variable (IV) Probit and Tobit models. We show that even when a valid IV is available, failing to differentiate between the two types of endogeneity can lead to either under- or over-estimation of the partial effects. We develop simple estimators of the bounds on the partial effects and provide easy to implement confidence intervals that correctly account for both types of endogeneity. We illustrate the methods in a Monte Carlo simulation and an empirical application."}, "https://arxiv.org/abs/2308.05577": {"title": "Optimal Designs for Two-Stage Inference", "link": "https://arxiv.org/abs/2308.05577", "description": "arXiv:2308.05577v2 Announce Type: replace \nAbstract: The analysis of screening experiments is often done in two stages, starting with factor selection via an analysis under a main effects model. The success of this first stage is influenced by three components: (1) main effect estimators' variances and (2) bias, and (3) the estimate of the noise variance. Component (3) has only recently been given attention with design techniques that ensure an unbiased estimate of the noise variance. In this paper, we propose a design criterion based on expected confidence intervals of the first stage analysis that balances all three components. To address model misspecification, we propose a computationally-efficient all-subsets analysis and a corresponding constrained design criterion based on lack-of-fit. Scenarios found in existing design literature are revisited with our criteria and new designs are provided that improve upon existing methods."}, "https://arxiv.org/abs/2310.08536": {"title": "Real-time Prediction of the Great Recession and the Covid-19 Recession", "link": "https://arxiv.org/abs/2310.08536", "description": "arXiv:2310.08536v4 Announce Type: replace \nAbstract: A series of standard and penalized logistic regression models is employed to model and forecast the Great Recession and the Covid-19 recession in the US. The empirical analysis explores the predictive content of numerous macroeconomic and financial indicators with respect to NBER recession indicator. The predictive ability of the underlying models is evaluated using a set of statistical evaluation metrics. The recessions are scrutinized by closely examining the movement of five most influential predictors that are chosen through automatic variable selections of the Lasso regression, along with the regression coefficients and the predicted recession probabilities. The results strongly support the application of penalized logistic regression models in the area of recession forecasting. Specifically, the analysis indicates that the mixed usage of different penalized logistic regression models over different forecast horizons largely outperform standard logistic regression models in the prediction of Great recession in the US, as they achieve higher predictive accuracy across 5 different forecast horizons. The Great Recession is largely predictable, whereas the Covid-19 recession remains unpredictable, given that the Covid-19 pandemic is a real exogenous event. The empirical study reaffirms the traditional role of the term spread as one of the most important recession indicators. The results are validated by constructing via PCA on a set of selected variables a recession indicator that suffers less from publication lags and exhibits a very high association with the NBER recession indicator."}, "https://arxiv.org/abs/2310.17571": {"title": "Inside the black box: Neural network-based real-time prediction of US recessions", "link": "https://arxiv.org/abs/2310.17571", "description": "arXiv:2310.17571v2 Announce Type: replace \nAbstract: A standard feedforward neural network (FFN) and two specific types of recurrent neural networks, long short-term memory (LSTM) and gated recurrent unit (GRU), are used for modeling US recessions in the period from 1967 to 2021. The estimated models are then employed to conduct real-time predictions of the Great Recession and the Covid-19 recession in the US. Their predictive performances are compared to those of the traditional linear models, the standard logit model and the ridge logit model. The out-of-sample performance suggests the application of LSTM and GRU in the area of recession forecasting, especially for the long-term forecasting tasks. They outperform other types of models across five different forecast horizons with respect to a selected set of statistical metrics. Shapley additive explanations (SHAP) method is applied to GRU and the ridge logit model as the best performer in the neural network and linear model group, respectively, to gain insight into the variable importance. The evaluation of variable importance differs between GRU and the ridge logit model, as reflected in their unequal variable orders determined by the SHAP values. These different weight assignments can be attributed to GRUs flexibility and capability to capture the business cycle asymmetries and nonlinearities. The SHAP method delivers some key recession indicators. For forecasting up to 3 months, the stock price index, real GDP, and private residential fixed investment show great short-term predictability, while for longer-term forecasting up to 12 months, the term spread and the producer price index have strong explanatory power for recessions. These findings are robust against other interpretation methods such as the local interpretable model-agnostic explanations (LIME) for GRU and the marginal effects for the ridge logit model."}, "https://arxiv.org/abs/2311.03829": {"title": "Multilevel mixtures of latent trait analyzers for clustering multi-layer bipartite networks", "link": "https://arxiv.org/abs/2311.03829", "description": "arXiv:2311.03829v2 Announce Type: replace \nAbstract: Within network data analysis, bipartite networks represent a particular type of network where relationships occur between two disjoint sets of nodes, formally called sending and receiving nodes. In this context, sending nodes may be organized into layers on the basis of some defined characteristics, resulting in a special case of multilayer bipartite network, where each layer includes a specific set of sending nodes. To perform a clustering of sending nodes in multi-layer bipartite network, we extend the Mixture of Latent Trait Analyzers (MLTA), also taking into account the influence of concomitant variables on clustering formation and the multi-layer structure of the data. To this aim, a multilevel approach offers a useful methodological tool to properly account for the hierarchical structure of the data and for the unobserved sources of heterogeneity at multiple levels. A simulation study is conducted to test the performance of the proposal in terms of parameters' and clustering recovery. Furthermore, the model is applied to the European Social Survey data (ESS) to i) perform a clustering of individuals (sending nodes) based on their digital skills (receiving nodes); ii) understand how socio-economic and demographic characteristics influence the individual digitalization level; iii) account for the multilevel structure of the data; iv) obtain a clustering of countries in terms of the base-line attitude to digital technologies of their residents."}, "https://arxiv.org/abs/2312.12741": {"title": "Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances", "link": "https://arxiv.org/abs/2312.12741", "description": "arXiv:2312.12741v2 Announce Type: replace-cross \nAbstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown."}, "https://arxiv.org/abs/2403.12243": {"title": "Time-Since-Infection Model for Hospitalization and Incidence Data", "link": "https://arxiv.org/abs/2403.12243", "description": "arXiv:2403.12243v1 Announce Type: new \nAbstract: The Time Since Infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular recently due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted to estimate disease transmission or predict disease-related hospitalizations - metrics crucial for understanding a pandemic and planning hospital resources. Moreover, their dependence on reported infection data makes them vulnerable to variations in data quality. In this study, we advance TSI models by integrating hospitalization data, marking a significant step forward in modeling with TSI models. Our improvements enable the estimation of key infectious disease parameters without relying on contact tracing data, reduce bias in incidence data, and provide a foundation to connect TSI models with other infectious disease models. We introduce hospitalization propensity parameters to jointly model incidence and hospitalization data. We use a composite likelihood function to accommodate complex data structure and an MCEM algorithm to estimate model parameters. We apply our method to COVID-19 data to estimate disease transmission, assess risk factor impacts, and calculate hospitalization propensity."}, "https://arxiv.org/abs/2403.12250": {"title": "Bayesian Optimization Sequential Surrogate (BOSS) Algorithm: Fast Bayesian Inference for a Broad Class of Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2403.12250", "description": "arXiv:2403.12250v1 Announce Type: new \nAbstract: Approximate Bayesian inference based on Laplace approximation and quadrature methods have become increasingly popular for their efficiency at fitting latent Gaussian models (LGM), which encompass popular models such as Bayesian generalized linear models, survival models, and spatio-temporal models. However, many useful models fall under the LGM framework only if some conditioning parameters are fixed, as the design matrix would vary with these parameters otherwise. Such models are termed the conditional LGMs with examples in change-point detection, non-linear regression, etc. Existing methods for fitting conditional LGMs rely on grid search or Markov-chain Monte Carlo (MCMC); both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. As each evaluation of the density requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, we introduce the Bayesian optimization sequential surrogate (BOSS) algorithm, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations compared to grid or MCMC methods, Bayesian optimization provides us with sequential design points that capture the majority of the posterior mass of the conditioning parameters, which subsequently yields an accurate surrogate posterior distribution that can be easily normalized. We illustrate the efficiency, accuracy, and practical utility of the proposed method through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics."}, "https://arxiv.org/abs/2403.12332": {"title": "A maximum penalised likelihood approach for semiparametric accelerated failure time models with time-varying covariates and partly interval censoring", "link": "https://arxiv.org/abs/2403.12332", "description": "arXiv:2403.12332v1 Announce Type: new \nAbstract: Accelerated failure time (AFT) models are frequently used for modelling survival data. This approach is attractive as it quantifies the direct relationship between the time until an event occurs and various covariates. It asserts that the failure times experience either acceleration or deceleration through a multiplicative factor when these covariates are present. While existing literature provides numerous methods for fitting AFT models with time-fixed covariates, adapting these approaches to scenarios involving both time-varying covariates and partly interval-censored data remains challenging. In this paper, we introduce a maximum penalised likelihood approach to fit a semiparametric AFT model. This method, designed for survival data with partly interval-censored failure times, accommodates both time-fixed and time-varying covariates. We utilise Gaussian basis functions to construct a smooth approximation of the nonparametric baseline hazard and fit the model via a constrained optimisation approach. To illustrate the effectiveness of our proposed method, we conduct a comprehensive simulation study. We also present an implementation of our approach on a randomised clinical trial dataset on advanced melanoma patients."}, "https://arxiv.org/abs/2403.12456": {"title": "Inflation Target at Risk: A Time-varying Parameter Distributional Regression", "link": "https://arxiv.org/abs/2403.12456", "description": "arXiv:2403.12456v1 Announce Type: new \nAbstract: Macro variables frequently display time-varying distributions, driven by the dynamic and evolving characteristics of economic, social, and environmental factors that consistently reshape the fundamental patterns and relationships governing these variables. To better understand the distributional dynamics beyond the central tendency, this paper introduces a novel semi-parametric approach for constructing time-varying conditional distributions, relying on the recent advances in distributional regression. We present an efficient precision-based Markov Chain Monte Carlo algorithm that simultaneously estimates all model parameters while explicitly enforcing the monotonicity condition on the conditional distribution function. Our model is applied to construct the forecasting distribution of inflation for the U.S., conditional on a set of macroeconomic and financial indicators. The risks of future inflation deviating excessively high or low from the desired range are carefully evaluated. Moreover, we provide a thorough discussion about the interplay between inflation and unemployment rates during the Global Financial Crisis, COVID, and the third quarter of 2023."}, "https://arxiv.org/abs/2403.12561": {"title": "A Bayesian multilevel hidden Markov model with Poisson-lognormal emissions for intense longitudinal count data", "link": "https://arxiv.org/abs/2403.12561", "description": "arXiv:2403.12561v1 Announce Type: new \nAbstract: Hidden Markov models (HMMs) are probabilistic methods in which observations are seen as realizations of a latent Markov process with discrete states that switch over time. Moving beyond standard statistical tests, HMMs offer a statistical environment to optimally exploit the information present in multivariate time series, uncovering the latent dynamics that rule them. Here, we extend the Poisson HMM to the multilevel framework, accommodating variability between individuals with continuously distributed individual random effects following a lognormal distribution, and describe how to estimate the model in a fully parametric Bayesian framework. The proposed multilevel HMM enables probabilistic decoding of hidden state sequences from multivariate count time-series based on individual-specific parameters, and offers a framework to quantificate between-individual variability formally. Through a Monte Carlo study we show that the multilevel HMM outperforms the HMM for scenarios involving heterogeneity between individuals, demonstrating improved decoding accuracy and estimation performance of parameters of the emission distribution, and performs equally well when not between heterogeneity is present. Finally, we illustrate how to use our model to explore the latent dynamics governing complex multivariate count data in an empirical application concerning pilot whale diving behaviour in the wild, and how to identify neural states from multi-electrode recordings of motor neural cortex activity in a macaque monkey in an experimental set up. We make the multilevel HMM introduced in this study publicly available in the R-package mHMMbayes in CRAN."}, "https://arxiv.org/abs/2403.12624": {"title": "Large-scale metric objects filtering for binary classification with application to abnormal brain connectivity detection", "link": "https://arxiv.org/abs/2403.12624", "description": "arXiv:2403.12624v1 Announce Type: new \nAbstract: The classification of random objects within metric spaces without a vector structure has attracted increasing attention. However, the complexity inherent in such non-Euclidean data often restricts existing models to handle only a limited number of features, leaving a gap in real-world applications. To address this, we propose a data-adaptive filtering procedure to identify informative features from large-scale random objects, leveraging a novel Kolmogorov-Smirnov-type statistic defined on the metric space. Our method, applicable to data in general metric spaces with binary labels, exhibits remarkable flexibility. It enjoys a model-free property, as its implementation does not rely on any specified classifier. Theoretically, it controls the false discovery rate while guaranteeing the sure screening property. Empirically, equipped with a Wasserstein metric, it demonstrates superior sample performance compared to Euclidean competitors. When applied to analyze a dataset on autism, our method identifies significant brain regions associated with the condition. Moreover, it reveals distinct interaction patterns among these regions between individuals with and without autism, achieved by filtering hundreds of thousands of covariance matrices representing various brain connectivities."}, "https://arxiv.org/abs/2403.12653": {"title": "Composite likelihood estimation of stationary Gaussian processes with a view toward stochastic volatility", "link": "https://arxiv.org/abs/2403.12653", "description": "arXiv:2403.12653v1 Announce Type: new \nAbstract: We develop a framework for composite likelihood inference of parametric continuous-time stationary Gaussian processes. We derive the asymptotic theory of the associated maximum composite likelihood estimator. We implement our approach on a pair of models that has been proposed to describe the random log-spot variance of financial asset returns. A simulation study shows that it delivers good performance in these settings and improves upon a method-of-moments estimation. In an application, we inspect the dynamic of an intraday measure of spot variance computed with high-frequency data from the cryptocurrency market. The empirical evidence supports a mechanism, where the short- and long-term correlation structure of stochastic volatility are decoupled in order to capture its properties at different time scales."}, "https://arxiv.org/abs/2403.12677": {"title": "Causal Change Point Detection and Localization", "link": "https://arxiv.org/abs/2403.12677", "description": "arXiv:2403.12677v1 Announce Type: new \nAbstract: Detecting and localizing change points in sequential data is of interest in many areas of application. Various notions of change points have been proposed, such as changes in mean, variance, or the linear regression coefficient. In this work, we consider settings in which a response variable $Y$ and a set of covariates $X=(X^1,\\ldots,X^{d+1})$ are observed over time and aim to find changes in the causal mechanism generating $Y$ from $X$. More specifically, we assume $Y$ depends linearly on a subset of the covariates and aim to determine at what time points either the dependency on the subset or the subset itself changes. We call these time points causal change points (CCPs) and show that they form a subset of the commonly studied regression change points. We propose general methodology to both detect and localize CCPs. Although motivated by causality, we define CCPs without referencing an underlying causal model. The proposed definition of CCPs exploits a notion of invariance, which is a purely observational quantity but -- under additional assumptions -- has a causal meaning. For CCP localization, we propose a loss function that can be combined with existing multiple change point algorithms to localize multiple CCPs efficiently. We evaluate and illustrate our methods on simulated datasets."}, "https://arxiv.org/abs/2403.12711": {"title": "Tests for categorical data beyond Pearson: A distance covariance and energy distance approach", "link": "https://arxiv.org/abs/2403.12711", "description": "arXiv:2403.12711v1 Announce Type: new \nAbstract: Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice."}, "https://arxiv.org/abs/2403.12714": {"title": "On the use of the cumulant generating function for inference on time series", "link": "https://arxiv.org/abs/2403.12714", "description": "arXiv:2403.12714v1 Announce Type: new \nAbstract: We introduce innovative inference procedures for analyzing time series data. Our methodology enables density approximation and composite hypothesis testing based on Whittle's estimator, a widely applied M-estimator in the frequency domain. Its core feature involves the (general Legendre transform of the) cumulant generating function of the Whittle likelihood score, as obtained using an approximated distribution of the periodogram ordinates. We present a testing algorithm that significantly expands the applicability of the state-of-the-art saddlepoint test, while maintaining the numerical accuracy of the saddlepoint approximation. Additionally, we demonstrate connections between our findings and three other prevalent frequency domain approaches: the bootstrap, empirical likelihood, and exponential tilting. Numerical examples using both simulated and real data illustrate the advantages and accuracy of our methodology."}, "https://arxiv.org/abs/2403.12757": {"title": "On Equivalence of Likelihood-Based Confidence Bands for Fatigue-Life and Fatigue-Strength Distributions", "link": "https://arxiv.org/abs/2403.12757", "description": "arXiv:2403.12757v1 Announce Type: new \nAbstract: Fatigue data arise in many research and applied areas and there have been statistical methods developed to model and analyze such data. The distributions of fatigue life and fatigue strength are often of interest to engineers designing products that might fail due to fatigue from cyclic-stress loading. Based on a specified statistical model and the maximum likelihood method, the cumulative distribution function (cdf) and quantile function (qf) can be estimated for the fatigue-life and fatigue-strength distributions. Likelihood-based confidence bands then can be obtained for the cdf and qf. This paper provides equivalence results for confidence bands for fatigue-life and fatigue-strength models. These results are useful for data analysis and computing implementation. We show (a) the equivalence of the confidence bands for the fatigue-life cdf and the fatigue-life qf, (b) the equivalence of confidence bands for the fatigue-strength cdf and the fatigue-strength qf, and (c) the equivalence of confidence bands for the fatigue-life qf and the fatigue-strength qf. Then we illustrate the usefulness of those equivalence results with two examples using experimental fatigue data."}, "https://arxiv.org/abs/2403.12759": {"title": "Robust Numerical Methods for Nonlinear Regression", "link": "https://arxiv.org/abs/2403.12759", "description": "arXiv:2403.12759v1 Announce Type: new \nAbstract: Many scientific and engineering applications require fitting regression models that are nonlinear in the parameters. Advances in computer hardware and software in recent decades have made it easier to fit such models. Relative to fitting regression models that are linear in the parameters, however, fitting nonlinear regression models is more complicated. In particular, software like the $\\texttt{nls}$ R function requires care in how the model is parameterized and how initial values are chosen for the maximum likelihood iterations. Often special diagnostics are needed to detect and suggest approaches for dealing with identifiability problems that can arise with such model fitting. When using Bayesian inference, there is the added complication of having to specify (often noninformative or weakly informative) prior distributions. Generally, the details for these tasks must be determined for each new nonlinear regression model. This paper provides a step-by-step procedure for specifying these details for any appropriate nonlinear regression model. Following the procedure will result in a numerically robust algorithm for fitting the nonlinear regression model. We illustrate the methods with three different nonlinear models that are used in the analysis of experimental fatigue data and we include two detailed numerical examples."}, "https://arxiv.org/abs/2403.12789": {"title": "Bivariate temporal dependence via mixtures of rotated copulas", "link": "https://arxiv.org/abs/2403.12789", "description": "arXiv:2403.12789v1 Announce Type: new \nAbstract: Parametric bivariate copula families have been known to flexibly capture enough various dependence patterns, e.g., either positive or negative dependence in either the lower or upper tails of bivariate distributions. However, to the best of our knowledge, there is not a single parametric model adaptable enough to capture several of these features simultaneously. To address this, we propose a mixture of 4-way rotations of a parametric copula that is able to capture all these features. We illustrate the construction using the Clayton family but the concept is general and can be applied to other families. In order to include dynamic dependence regimes, the approach is extended to a time-dependent sequence of mixture copulas in which the mixture probabilities are allowed to evolve in time via a moving average type of relationship. The properties of the proposed model and its performance are examined using simulated and real data sets."}, "https://arxiv.org/abs/2403.12815": {"title": "A Unified Framework for Rerandomization using Quadratic Forms", "link": "https://arxiv.org/abs/2403.12815", "description": "arXiv:2403.12815v1 Announce Type: new \nAbstract: In the design stage of a randomized experiment, one way to ensure treatment and control groups exhibit similar covariate distributions is to randomize treatment until some prespecified level of covariate balance is satisfied. This experimental design strategy is known as rerandomization. Most rerandomization methods utilize balance metrics based on a quadratic form $v^TAv$ , where $v$ is a vector of covariate mean differences and $A$ is a positive semi-definite matrix. In this work, we derive general results for treatment-versus-control rerandomization schemes that employ quadratic forms for covariate balance. In addition to allowing researchers to quickly derive properties of rerandomization schemes not previously considered, our theoretical results provide guidance on how to choose the matrix $A$ in practice. We find the Mahalanobis and Euclidean distances optimize different measures of covariate balance. Furthermore, we establish how the covariates' eigenstructure and their relationship to the outcomes dictates which matrix $A$ yields the most precise mean-difference estimator for the average treatment effect. We find that the Euclidean distance is minimax optimal, in the sense that the mean-difference estimator's precision is never too far from the optimal choice, regardless of the relationship between covariates and outcomes. Our theoretical results are verified via simulation, where we find that rerandomization using the Euclidean distance has better performance in high-dimensional settings and typically achieves greater variance reduction to the mean-difference estimator than other quadratic forms."}, "https://arxiv.org/abs/2403.12822": {"title": "FORM-based global reliability sensitivity analysis of systems with multiple failure modes", "link": "https://arxiv.org/abs/2403.12822", "description": "arXiv:2403.12822v1 Announce Type: new \nAbstract: Global variance-based reliability sensitivity indices arise from a variance decomposition of the indicator function describing the failure event. The first-order indices reflect the main effect of each variable on the variance of the failure event and can be used for variable prioritization; the total-effect indices represent the total effect of each variable, including its interaction with other variables, and can be used for variable fixing. This contribution derives expressions for the variance-based reliability indices of systems with multiple failure modes that are based on the first-order reliability method (FORM). The derived expressions are a function of the FORM results and, hence, do not require additional expensive model evaluations. They do involve the evaluation of multinormal integrals, for which effective solutions are available. We demonstrate that the derived expressions enable an accurate estimation of variance-based reliability sensitivities for general system problems to which FORM is applicable."}, "https://arxiv.org/abs/2403.12880": {"title": "Clustered Mallows Model", "link": "https://arxiv.org/abs/2403.12880", "description": "arXiv:2403.12880v1 Announce Type: new \nAbstract: Rankings are a type of preference elicitation that arise in experiments where assessors arrange items, for example, in decreasing order of utility. Orderings of n items labelled {1,...,n} denoted are permutations that reflect strict preferences. For a number of reasons, strict preferences can be unrealistic assumptions for real data. For example, when items share common traits it may be reasonable to attribute them equal ranks. Also, there can be different importance attributions to decisions that form the ranking. In a situation with, for example, a large number of items, an assessor may wish to rank at top a certain number items; to rank other items at the bottom and to express indifference to all others. In addition, when aggregating opinions, a judging body might be decisive about some parts of the rank but ambiguous for others. In this paper we extend the well-known Mallows (Mallows, 1957) model (MM) to accommodate item indifference, a phenomenon that can be in place for a variety of reasons, such as those above mentioned.The underlying grouping of similar items motivates the proposed Clustered Mallows Model (CMM). The CMM can be interpreted as a Mallows distribution for tied ranks where ties are learned from the data. The CMM provides the flexibility to combine strict and indifferent relations, achieving a simpler and robust representation of rank collections in the form of ordered clusters. Bayesian inference for the CMM is in the class of doubly-intractable problems since the model's normalisation constant is not available in closed form. We overcome this challenge by sampling from the posterior with a version of the exchange algorithm \\citep{murray2006}. Real data analysis of food preferences and results of Formula 1 races are presented, illustrating the CMM in practical situations."}, "https://arxiv.org/abs/2403.12908": {"title": "Regularised Spectral Estimation for High Dimensional Point Processes", "link": "https://arxiv.org/abs/2403.12908", "description": "arXiv:2403.12908v1 Announce Type: new \nAbstract: Advances in modern technology have enabled the simultaneous recording of neural spiking activity, which statistically can be represented by a multivariate point process. We characterise the second order structure of this process via the spectral density matrix, a frequency domain equivalent of the covariance matrix. In the context of neuronal analysis, statistics based on the spectral density matrix can be used to infer connectivity in the brain network between individual neurons. However, the high-dimensional nature of spike train data mean that it is often difficult, or at times impossible, to compute these statistics. In this work, we discuss the importance of regularisation-based methods for spectral estimation, and propose novel methodology for use in the point process setting. We establish asymptotic properties for our proposed estimators and evaluate their performance on synthetic data simulated from multivariate Hawkes processes. Finally, we apply our methodology to neuroscience spike train data in order to illustrate its ability to infer connectivity in the brain network."}, "https://arxiv.org/abs/2403.12108": {"title": "Does AI help humans make better decisions? A methodological framework for experimental evaluation", "link": "https://arxiv.org/abs/2403.12108", "description": "arXiv:2403.12108v1 Announce Type: cross \nAbstract: The use of Artificial Intelligence (AI) based on data-driven algorithms has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions as compared to a human alone or AI an alone. We introduce a new methodological framework that can be used to answer experimentally this question with no additional assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded experimental design, in which the provision of AI-generated recommendations is randomized across cases with a human making final decisions. Under this experimental design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. We apply the proposed methodology to the data from our own randomized controlled trial of a pretrial risk assessment instrument. We find that AI recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Our analysis also shows that AI-alone decisions generally perform worse than human decisions with or without AI assistance. Finally, AI recommendations tend to impose cash bail on non-white arrestees more often than necessary when compared to white arrestees."}, "https://arxiv.org/abs/2403.12110": {"title": "Robust estimations from distribution structures: I", "link": "https://arxiv.org/abs/2403.12110", "description": "arXiv:2403.12110v1 Announce Type: cross \nAbstract: As the most fundamental problem in statistics, robust location estimation has many prominent solutions, such as the trimmed mean, Winsorized mean, Hodges Lehmann estimator, Huber M estimator, and median of means. Recent studies suggest that their maximum biases concerning the mean can be quite different, but the underlying mechanisms largely remain unclear. This study exploited a semiparametric method to classify distributions by the asymptotic orderliness of quantile combinations with varying breakdown points, showing their interrelations and connections to parametric distributions. Further deductions explain why the Winsorized mean typically has smaller biases compared to the trimmed mean; two sequences of semiparametric robust mean estimators emerge, particularly highlighting the superiority of the median Hodges Lehmann mean. This article sheds light on the understanding of the common nature of probability distributions."}, "https://arxiv.org/abs/2403.12284": {"title": "The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control", "link": "https://arxiv.org/abs/2403.12284", "description": "arXiv:2403.12284v1 Announce Type: cross \nAbstract: Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition."}, "https://arxiv.org/abs/2403.12367": {"title": "Semisupervised score based matching algorithm to evaluate the effect of public health interventions", "link": "https://arxiv.org/abs/2403.12367", "description": "arXiv:2403.12367v1 Announce Type: cross \nAbstract: Multivariate matching algorithms \"pair\" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of \"pairs\" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching algorithm with efficiency and comparatively limited auxiliary matching knowledge provided through a \"training\" set of paired units by domain experts, is practically intriguing.\n We proposed a novel one-to-one matching algorithm based on a quadratic score function $S_{\\beta}(x_i,x_j)= \\beta^T (x_i-x_j)(x_i-x_j)^T \\beta$. The weights $\\beta$, which can be interpreted as a variable importance measure, are designed to minimize the score difference between paired training units while maximizing the score difference between unpaired training units. Further, in the typical but intricate case where the training set is much smaller than the unpaired set, we propose a \\underline{s}emisupervised \\underline{c}ompanion \\underline{o}ne-\\underline{t}o-\\underline{o}ne \\underline{m}atching \\underline{a}lgorithm (SCOTOMA) that makes the best use of the unpaired units. The proposed weight estimator is proved to be consistent when the truth matching criterion is indeed the quadratic score function. When the model assumptions are violated, we demonstrate that the proposed algorithm still outperforms some popular competing matching algorithms through a series of simulations. We applied the proposed algorithm to a real-world study to investigate the effect of in-person schooling on community Covid-19 transmission rate for policy making purpose."}, "https://arxiv.org/abs/2403.12417": {"title": "On Predictive planning and counterfactual learning in active inference", "link": "https://arxiv.org/abs/2403.12417", "description": "arXiv:2403.12417v1 Announce Type: cross \nAbstract: Given the rapid advancement of artificial intelligence, understanding the foundations of intelligent behaviour is increasingly important. Active inference, regarded as a general theory of behaviour, offers a principled approach to probing the basis of sophistication in planning and decision-making. In this paper, we examine two decision-making schemes in active inference based on 'planning' and 'learning from experience'. Furthermore, we also introduce a mixed model that navigates the data-complexity trade-off between these strategies, leveraging the strengths of both to facilitate balanced decision-making. We evaluate our proposed model in a challenging grid-world scenario that requires adaptability from the agent. Additionally, our model provides the opportunity to analyze the evolution of various parameters, offering valuable insights and contributing to an explainable framework for intelligent decision-making."}, "https://arxiv.org/abs/2403.12491": {"title": "A consistent test of spherical symmetry for multivariate and high-dimensional data via data augmentation", "link": "https://arxiv.org/abs/2403.12491", "description": "arXiv:2403.12491v1 Announce Type: cross \nAbstract: We develop a test for spherical symmetry of a multivariate distribution $P$ that works even when the dimension of the data $d$ is larger than the sample size $n$. We propose a non-negative measure $\\zeta(P)$ such that $\\zeta(P)=0$ if and only if $P$ is spherically symmetric. We construct a consistent estimator of $\\zeta(P)$ using the data augmentation method and investigate its large sample properties. The proposed test based on this estimator is calibrated using a novel resampling algorithm. Our test controls the Type-I error, and it is consistent against general alternatives. We also study its behaviour for a sequence of alternatives $(1-\\delta_n) F+\\delta_n G$, where $\\zeta(G)=0$ but $\\zeta(F)>0$, and $\\delta_n \\in [0,1]$. When $\\lim\\sup\\delta_n<1$, for any $G$, the power of our test converges to unity as $n$ increases. However, if $\\lim\\sup\\delta_n=1$, the asymptotic power of our test depends on $\\lim n(1-\\delta_n)^2$. We establish this by proving the minimax rate optimality of our test over a suitable class of alternatives and showing that it is Pitman efficient when $\\lim n(1-\\delta_n)^2>0$. Moreover, our test is provably consistent for high-dimensional data even when $d$ is larger than $n$. Our numerical results amply demonstrate the superiority of the proposed test over some state-of-the-art methods."}, "https://arxiv.org/abs/2403.12858": {"title": "Settlement Mapping for Population Density Modelling in Disease Risk Spatial Analysis", "link": "https://arxiv.org/abs/2403.12858", "description": "arXiv:2403.12858v1 Announce Type: cross \nAbstract: In disease risk spatial analysis, many researchers especially in Indonesia are still modelling population density as the ratio of total population to administrative area extent. This model oversimplifies the problem, because it covers large uninhabited areas, while the model should focus on inhabited areas. This study uses settlement mapping against satellite imagery to focus the model and calculate settlement area extent. As far as our search goes, we did not find any specific studies comparing the use of settlement mapping with administrative area to model population density in computing its correlation to a disease case rate. This study investigates the comparison of both models using data on Tuberculosis (TB) case rate in Central and East Java Indonesia. Our study shows that using administrative area density the Spearman's $\\rho$ was considered as \"Fair\" (0.566, p<0.01) and using settlement density was \"Moderately Strong\" (0.673, p<0.01). The difference is significant according to Hotelling's t test. By this result we are encouraging researchers to use settlement mapping to improve population density modelling in disease risk spatial analysis. Resources used by and resulting from this work are publicly available at https://github.com/mirzaalimm/PopulationDensityVsDisease."}, "https://arxiv.org/abs/2202.07277": {"title": "Exploiting deterministic algorithms to perform global sensitivity analysis of continuous-time Markov chain compartmental models with application to epidemiology", "link": "https://arxiv.org/abs/2202.07277", "description": "arXiv:2202.07277v3 Announce Type: replace \nAbstract: In this paper, we propose a generic approach to perform global sensitivity analysis (GSA) for compartmental models based on continuous-time Markov chains (CTMC). This approach enables a complete GSA for epidemic models, in which not only the effects of uncertain parameters such as epidemic parameters (transmission rate, mean sojourn duration in compartments) are quantified, but also those of intrinsic randomness and interactions between the two. The main step in our approach is to build a deterministic representation of the underlying continuous-time Markov chain by controlling the latent variables modeling intrinsic randomness. Then, model output can be written as a deterministic function of both uncertain parameters and controlled latent variables, so that it becomespossible to compute standard variance-based sensitivity indices, e.g. the so-called Sobol' indices. However, different simulation algorithms lead to different representations. We exhibit in this work three different representations for CTMC stochastic compartmental models and discuss the results obtained by implementing and comparing GSAs based on each of these representations on a SARS-CoV-2 epidemic model."}, "https://arxiv.org/abs/2205.13734": {"title": "An efficient tensor regression for high-dimensional data", "link": "https://arxiv.org/abs/2205.13734", "description": "arXiv:2205.13734v2 Announce Type: replace \nAbstract: Most currently used tensor regression models for high-dimensional data are based on Tucker decomposition, which has good properties but loses its efficiency in compressing tensors very quickly as the order of tensors increases, say greater than four or five. However, for the simplest tensor autoregression in handling time series data, its coefficient tensor already has the order of six. This paper revises a newly proposed tensor train (TT) decomposition and then applies it to tensor regression such that a nice statistical interpretation can be obtained. The new tensor regression can well match the data with hierarchical structures, and it even can lead to a better interpretation for the data with factorial structures, which are supposed to be better fitted by models with Tucker decomposition. More importantly, the new tensor regression can be easily applied to the case with higher order tensors since TT decomposition can compress the coefficient tensors much more efficiently. The methodology is also extended to tensor autoregression for time series data, and nonasymptotic properties are derived for the ordinary least squares estimations of both tensor regression and autoregression. A new algorithm is introduced to search for estimators, and its theoretical justification is also discussed. Theoretical and computational properties of the proposed methodology are verified by simulation studies, and the advantages over existing methods are illustrated by two real examples."}, "https://arxiv.org/abs/2207.08886": {"title": "Turning the information-sharing dial: efficient inference from different data sources", "link": "https://arxiv.org/abs/2207.08886", "description": "arXiv:2207.08886v3 Announce Type: replace \nAbstract: A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes."}, "https://arxiv.org/abs/2212.08282": {"title": "Early-Phase Local-Area Model for Pandemics Using Limited Data: A SARS-CoV-2 Application", "link": "https://arxiv.org/abs/2212.08282", "description": "arXiv:2212.08282v2 Announce Type: replace \nAbstract: The emergence of novel infectious agents presents challenges to statistical models of disease transmission. These challenges arise from limited, poor-quality data and an incomplete understanding of the agent. Moreover, outbreaks manifest differently across regions due to various factors, making it imperative for models to factor in regional specifics. In this work, we offer a model that effectively utilizes constrained data resources to estimate disease transmission rates at the local level, especially during the early outbreak phase when primarily infection counts and aggregated local characteristics are accessible. This model merges a pathogen transmission methodology based on daily infection numbers with regression techniques, drawing correlations between disease transmission and local-area factors, such as demographics, health policies, behavior, and even climate, to estimate and forecast daily infections. We incorporate the quasi-score method and an error term to navigate potential data concerns and mistaken assumptions. Additionally, we introduce an online estimator that facilitates real-time data updates, complemented by an iterative algorithm for parameter estimation. This approach facilitates real-time analysis of disease transmission when data quality is suboptimal and knowledge of the infectious pathogen is limited. It is particularly useful in the early stages of outbreaks, providing support for local decision-making."}, "https://arxiv.org/abs/2306.07047": {"title": "Foundations of Causal Discovery on Groups of Variables", "link": "https://arxiv.org/abs/2306.07047", "description": "arXiv:2306.07047v3 Announce Type: replace \nAbstract: Discovering causal relationships from observational data is a challenging task that relies on assumptions connecting statistical quantities to graphical or algebraic causal models. In this work, we focus on widely employed assumptions for causal discovery when objects of interest are (multivariate) groups of random variables rather than individual (univariate) random variables, as is the case in a variety of problems in scientific domains such as climate science or neuroscience. If the group-level causal models are derived from partitioning a micro-level model into groups, we explore the relationship between micro and group-level causal discovery assumptions. We investigate the conditions under which assumptions like Causal Faithfulness hold or fail to hold. Our analysis encompasses graphical causal models that contain cycles and bidirected edges. We also discuss grouped time series causal graphs and variants thereof as special cases of our general theoretical framework. Thereby, we aim to provide researchers with a solid theoretical foundation for the development and application of causal discovery methods for variable groups."}, "https://arxiv.org/abs/2306.16821": {"title": "Efficient subsampling for exponential family models", "link": "https://arxiv.org/abs/2306.16821", "description": "arXiv:2306.16821v2 Announce Type: replace \nAbstract: We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are \"closest\" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$."}, "https://arxiv.org/abs/2308.08346": {"title": "RMST-based multiple contrast tests in general factorial designs", "link": "https://arxiv.org/abs/2308.08346", "description": "arXiv:2308.08346v2 Announce Type: replace \nAbstract: Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. However, global tests do not provide this information. Therefore, multiple tests for the RMST are developed in a second step to infer several null hypotheses simultaneously. Hereby, the asymptotically exact dependence structure between the local test statistics is incorporated to gain more power. Finally, the small sample performance of the proposed global and multiple testing procedures is analyzed in simulations and illustrated in a real data example."}, "https://arxiv.org/abs/2310.18108": {"title": "Transductive conformal inference with adaptive scores", "link": "https://arxiv.org/abs/2310.18108", "description": "arXiv:2310.18108v2 Announce Type: replace \nAbstract: Conformal inference is a fundamental and versatile tool that provides distribution-free guarantees for many machine learning tasks. We consider the transductive setting, where decisions are made on a test sample of $m$ new points, giving rise to $m$ conformal $p$-values. While classical results only concern their marginal distribution, we show that their joint distribution follows a P\\'olya urn model, and establish a concentration inequality for their empirical distribution function. The results hold for arbitrary exchangeable scores, including adaptive ones that can use the covariates of the test+calibration samples at training stage for increased accuracy. We demonstrate the usefulness of these theoretical results through uniform, in-probability guarantees for two machine learning tasks of current interest: interval prediction for transductive transfer learning and novelty detection based on two-class classification."}, "https://arxiv.org/abs/2206.08235": {"title": "Identifying the Most Appropriate Order for Categorical Responses", "link": "https://arxiv.org/abs/2206.08235", "description": "arXiv:2206.08235v3 Announce Type: replace-cross \nAbstract: Categorical responses arise naturally within various scientific disciplines. In many circumstances, there is no predetermined order for the response categories, and the response has to be modeled as nominal. In this study, we regard the order of response categories as part of the statistical model, and show that the true order, when it exists, can be selected using likelihood-based model selection criteria. For predictive purposes, a statistical model with a chosen order may outperform models based on nominal responses, even if a true order does not exist. For multinomial logistic models, widely used for categorical responses, we show the existence of theoretically equivalent orders that cannot be differentiated based on likelihood criteria, and determine the connections between their maximum likelihood estimators. We use simulation studies and a real-data analysis to confirm the need and benefits of choosing the most appropriate order for categorical responses."}, "https://arxiv.org/abs/2302.03788": {"title": "Toward a Theory of Causation for Interpreting Neural Code Models", "link": "https://arxiv.org/abs/2302.03788", "description": "arXiv:2302.03788v2 Announce Type: replace-cross \nAbstract: Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs."}, "https://arxiv.org/abs/2403.13069": {"title": "kDGLM: a R package for Bayesian analysis of Generalized Dynamic Linear Models", "link": "https://arxiv.org/abs/2403.13069", "description": "arXiv:2403.13069v1 Announce Type: new \nAbstract: This paper introduces kDGLM, an R package designed for Bayesian analysis of Generalized Dynamic Linear Models (GDLM), with a primary focus on both uni- and multivariate exponential families. Emphasizing sequential inference for time series data, the kDGLM package provides comprehensive support for fitting, smoothing, monitoring, and feed-forward interventions. The methodology employed by kDGLM, as proposed in Alves et al. (2024), seamlessly integrates with well-established techniques from the literature, particularly those used in (Gaussian) Dynamic Models. These include discount strategies, autoregressive components, transfer functions, and more. Leveraging key properties of the Kalman filter and smoothing, kDGLM exhibits remarkable computational efficiency, enabling virtually instantaneous fitting times that scale linearly with the length of the time series. This characteristic makes it an exceptionally powerful tool for the analysis of extended time series. For example, when modeling monthly hospital admissions in Brazil due to gastroenteritis from 2010 to 2022, the fitting process took a mere 0.11s. Even in a spatial-time variant of the model (27 outcomes, 110 latent states, and 156 months, yielding 17,160 parameters), the fitting time was only 4.24s. Currently, the kDGLM package supports a range of distributions, including univariate Normal (unknown mean and observational variance), bivariate Normal (unknown means, observational variances, and correlation), Poisson, Gamma (known shape and unknown mean), and Multinomial (known number of trials and unknown event probabilities). Additionally, kDGLM allows the joint modeling of multiple time series, provided each series follows one of the supported distributions. Ongoing efforts aim to continuously expand the supported distributions."}, "https://arxiv.org/abs/2403.13076": {"title": "Spatial Autoregressive Model on a Dirichlet Distribution", "link": "https://arxiv.org/abs/2403.13076", "description": "arXiv:2403.13076v1 Announce Type: new \nAbstract: Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect estimates of parameters. Hence, it is essential to incorporate spatial information into the statistical analysis of compositional data to obtain accurate and reliable results. However, traditional statistical methods are not directly applicable to compositional data due to the correlation between its observations, which are constrained to lie on a simplex. To address this challenge, the Dirichlet distribution is commonly employed, as its support aligns with the nature of compositional vectors. Specifically, the R package DirichletReg provides a regression model, termed Dirichlet regression, tailored for compositional data. However, this model fails to account for spatial dependencies, thereby restricting its utility in spatial contexts. In this study, we introduce a novel spatial autoregressive Dirichlet regression model for compositional data, adeptly integrating spatial dependencies among observations. We construct a maximum likelihood estimator for a Dirichlet density function augmented with a spatial lag term. We compare this spatial autoregressive model with the same model without spatial lag, where we test both models on synthetic data as well as two real datasets, using different metrics. By considering the spatial relationships among observations, our model provides more accurate and reliable results for the analysis of compositional data. The model is further evaluated against a spatial multinomial regression model for compositional data, and their relative effectiveness is discussed."}, "https://arxiv.org/abs/2403.13118": {"title": "Modal Analysis of Spatiotemporal Data via Multivariate Gaussian Process Regression", "link": "https://arxiv.org/abs/2403.13118", "description": "arXiv:2403.13118v1 Announce Type: new \nAbstract: Modal analysis has become an essential tool to understand the coherent structure of complex flows. The classical modal analysis methods, such as dynamic mode decomposition (DMD) and spectral proper orthogonal decomposition (SPOD), rely on a sufficient amount of data that is regularly sampled in time. However, often one needs to deal with sparse temporally irregular data, e.g., due to experimental measurements and simulation algorithm. To overcome the limitations of data scarcity and irregular sampling, we propose a novel modal analysis technique using multi-variate Gaussian process regression (MVGPR). We first establish the connection between MVGPR and the existing modal analysis techniques, DMD and SPOD, from a linear system identification perspective. Next, leveraging this connection, we develop a MVGPR-based modal analysis technique that addresses the aforementioned limitations. The capability of MVGPR is endowed by its judiciously designed kernel structure for correlation function, that is derived from the assumed linear dynamics. Subsequently, the proposed MVGPR method is benchmarked against DMD and SPOD on a range of examples, from academic and synthesized data to unsteady airfoil aerodynamics. The results demonstrate MVGPR as a promising alternative to classical modal analysis methods, especially in the scenario of scarce and temporally irregular data."}, "https://arxiv.org/abs/2403.13197": {"title": "Robust inference of cooperative behaviour of multiple ion channels in voltage-clamp recordings", "link": "https://arxiv.org/abs/2403.13197", "description": "arXiv:2403.13197v1 Announce Type: new \nAbstract: Recent experimental studies have shed light on the intriguing possibility that ion channels exhibit cooperative behaviour. However, a comprehensive understanding of such cooperativity remains elusive, primarily due to limitations in measuring separately the response of each channel. Rather, only the superimposed channel response can be observed, challenging existing data analysis methods. To address this gap, we propose IDC (Idealisation, Discretisation, and Cooperativity inference), a robust statistical data analysis methodology that requires only voltage-clamp current recordings of an ensemble of ion channels. The framework of IDC enables us to integrate recent advancements in idealisation techniques and coupled Markov models. Further, in the cooperativity inference phase of IDC, we introduce a minimum distance estimator and establish its statistical guarantee in the form of asymptotic consistency. We demonstrate the effectiveness and robustness of IDC through extensive simulation studies. As an application, we investigate gramicidin D channels. Our findings reveal that these channels act independently, even at varying applied voltages during voltage-clamp experiments. An implementation of IDC is available from GitLab."}, "https://arxiv.org/abs/2403.13256": {"title": "Bayesian Nonparametric Trees for Principal Causal Effects", "link": "https://arxiv.org/abs/2403.13256", "description": "arXiv:2403.13256v1 Announce Type: new \nAbstract: Principal stratification analysis evaluates how causal effects of a treatment on a primary outcome vary across strata of units defined by their treatment effect on some intermediate quantity. This endeavor is substantially challenged when the intermediate variable is continuously scaled and there are infinitely many basic principal strata. We employ a Bayesian nonparametric approach to flexibly evaluate treatment effects across flexibly-modeled principal strata. The approach uses Bayesian Causal Forests (BCF) to simultaneously specify two Bayesian Additive Regression Tree models; one for the principal stratum membership and one for the outcome, conditional on principal strata. We show how the capability of BCF for capturing treatment effect heterogeneity is particularly relevant for assessing how treatment effects vary across the surface defined by continuously-scaled principal strata, in addition to other benefits relating to targeted selection and regularization-induced confounding. The capabilities of the proposed approach are illustrated with a simulation study, and the methodology is deployed to investigate how causal effects of power plant emissions control technologies on ambient particulate pollution vary as a function of the technologies' impact on sulfur dioxide emissions."}, "https://arxiv.org/abs/2403.13260": {"title": "A Bayesian Approach for Selecting Relevant External Data (BASE): Application to a study of Long-Term Outcomes in a Hemophilia Gene Therapy Trial", "link": "https://arxiv.org/abs/2403.13260", "description": "arXiv:2403.13260v1 Announce Type: new \nAbstract: Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early stages of their introduction to the market. To address the critical question of estimating long-term efficacy without waiting for the completion of lengthy clinical trials, we propose a novel Bayesian framework. This framework selects pertinent data from external sources, often early-phase clinical trials with more comprehensive longitudinal efficacy data that could lead to an improved inference of the long-term efficacy outcome. We apply this methodology to predict the long-term factor IX (FIX) levels of HEMGENIX (etranacogene dezaparvovec), the first FDA-approved gene therapy to treat adults with severe Hemophilia B, in a phase 3 study. Our application showcases the capability of the framework to estimate the 5-year FIX levels following HEMGENIX therapy, demonstrating sustained FIX levels induced by HEMGENIX infusion. Additionally, we provide theoretical insights into the methodology by establishing its posterior convergence properties."}, "https://arxiv.org/abs/2403.13340": {"title": "Forecasting density-valued functional panel data", "link": "https://arxiv.org/abs/2403.13340", "description": "arXiv:2403.13340v1 Announce Type: new \nAbstract: We introduce a statistical method for modeling and forecasting functional panel data, where each element is a density. Density functions are nonnegative and have a constrained integral and thus do not constitute a linear vector space. We implement a center log-ratio transformation to transform densities into unconstrained functions. These functions exhibit cross-sectionally correlation and temporal dependence. Via a functional analysis of variance decomposition, we decompose the unconstrained functional panel data into a deterministic trend component and a time-varying residual component. To produce forecasts for the time-varying component, a functional time series forecasting method, based on the estimation of the long-range covariance, is implemented. By combining the forecasts of the time-varying residual component with the deterministic trend component, we obtain h-step-ahead forecast curves for multiple populations. Illustrated by age- and sex-specific life-table death counts in the United States, we apply our proposed method to generate forecasts of the life-table death counts for 51 states."}, "https://arxiv.org/abs/2403.13398": {"title": "A unified framework for bounding causal effects on the always-survivor and other populations", "link": "https://arxiv.org/abs/2403.13398", "description": "arXiv:2403.13398v1 Announce Type: new \nAbstract: We investigate the bounding problem of causal effects in experimental studies in which the outcome is truncated by death, meaning that the subject dies before the outcome can be measured. Causal effects cannot be point identified without instruments and/or tight parametric assumptions but can be bounded under mild restrictions. Previous work on partial identification under the principal stratification framework has primarily focused on the `always-survivor' subpopulation. In this paper, we present a novel nonparametric unified framework to provide sharp bounds on causal effects on discrete and continuous square-integrable outcomes. These bounds are derived on the `always-survivor', `protected', and `harmed' subpopulations and on the entire population with/without assumptions of monotonicity and stochastic dominance. The main idea depends on rewriting the optimization problem in terms of the integrated tail probability expectation formula using a set of conditional probability distributions. The proposed procedure allows for settings with any type and number of covariates, and can be extended to incorporate average causal effects and complier average causal effects. Furthermore, we present several simulation studies conducted under various assumptions as well as the application of the proposed approach to a real dataset from the National Supported Work Demonstration."}, "https://arxiv.org/abs/2403.13544": {"title": "A class of bootstrap based residuals for compositional data", "link": "https://arxiv.org/abs/2403.13544", "description": "arXiv:2403.13544v1 Announce Type: new \nAbstract: Regression models for compositional data are common in several areas of knowledge. As in other classes of regression models, it is desirable to perform diagnostic analysis in these models using residuals that are approximately standard normally distributed. However, for regression models for compositional data, there has not been any multivariate residual that meets this requirement. In this work, we introduce a class of asymptotically standard normally distributed residuals for compositional data based on bootstrap. Monte Carlo simulation studies indicate that the distributions of the residuals of this class are well approximated by the standard normal distribution in small samples. An application to simulated data also suggests that one of the residuals of the proposed class is better to identify model misspecification than its competitors. Finally, the usefulness of the best residual of the proposed class is illustrated through an application on sleep stages. The class of residuals proposed here can also be used in other classes of multivariate regression models."}, "https://arxiv.org/abs/2403.13628": {"title": "Scalable Scalar-on-Image Cortical Surface Regression with a Relaxed-Thresholded Gaussian Process Prior", "link": "https://arxiv.org/abs/2403.13628", "description": "arXiv:2403.13628v1 Announce Type: new \nAbstract: In addressing the challenge of analysing the large-scale Adolescent Brain Cognition Development (ABCD) fMRI dataset, involving over 5,000 subjects and extensive neuroimaging data, we propose a scalable Bayesian scalar-on-image regression model for computational feasibility and efficiency. Our model employs a relaxed-thresholded Gaussian process (RTGP), integrating piecewise-smooth, sparse, and continuous functions capable of both hard- and soft-thresholding. This approach introduces additional flexibility in feature selection in scalar-on-image regression and leads to scalable posterior computation by adopting a variational approximation and utilising the Karhunen-Lo\\`eve expansion for Gaussian processes. This advancement substantially reduces the computational costs in vertex-wise analysis of cortical surface data in large-scale Bayesian spatial models. The model's parameter estimation and prediction accuracy and feature selection performance are validated through extensive simulation studies and an application to the ABCD study. Here, we perform regression analysis correlating intelligence scores with task-based functional MRI data, taking into account confounding factors including age, sex, and parental education level. This validation highlights our model's capability to handle large-scale neuroimaging data while maintaining computational feasibility and accuracy."}, "https://arxiv.org/abs/2403.13725": {"title": "Robust Inference in Locally Misspecified Bipartite Networks", "link": "https://arxiv.org/abs/2403.13725", "description": "arXiv:2403.13725v1 Announce Type: new \nAbstract: This paper introduces a methodology to conduct robust inference in bipartite networks under local misspecification. We focus on a class of dyadic network models with misspecified conditional moment restrictions. The framework of misspecification is local, as the effect of misspecification varies with the sample size. We utilize this local asymptotic approach to construct a robust estimator that is minimax optimal for the mean square error within a neighborhood of misspecification. Additionally, we introduce bias-aware confidence intervals that account for the effect of the local misspecification. These confidence intervals have the correct asymptotic coverage for the true parameter of interest under sparse network asymptotics. Monte Carlo experiments demonstrate that the robust estimator performs well in finite samples and sparse networks. As an empirical illustration, we study the formation of a scientific collaboration network among economists."}, "https://arxiv.org/abs/2403.13738": {"title": "Policy Relevant Treatment Effects with Multidimensional Unobserved Heterogeneity", "link": "https://arxiv.org/abs/2403.13738", "description": "arXiv:2403.13738v1 Announce Type: new \nAbstract: This paper provides a framework for the policy relevant treatment effects using instrumental variables. In the framework, a treatment selection may or may not satisfy the classical monotonicity condition and can accommodate multidimensional unobserved heterogeneity. We can bound the target parameter by extracting information from identifiable estimands. We also provide a more conservative yet computationally simpler bound by applying a convex relaxation method. Linear shape restrictions can be easily incorporated to further improve the bounds. Numerical and simulation results illustrate the informativeness of our convex-relaxation bounds, i.e., that our bounds are sufficiently tight."}, "https://arxiv.org/abs/2403.13750": {"title": "Data integration of non-probability and probability samples with predictive mean matching", "link": "https://arxiv.org/abs/2403.13750", "description": "arXiv:2403.13750v1 Announce Type: new \nAbstract: In this paper we study predictive mean matching mass imputation estimators to integrate data from probability and non-probability samples. We consider two approaches: matching predicted to observed ($\\hat{y}-y$ matching) or predicted to predicted ($\\hat{y}-\\hat{y}$ matching) values. We prove the consistency of two semi-parametric mass imputation estimators based on these approaches and derive their variance and estimators of variance. Our approach can be employed with non-parametric regression techniques, such as kernel regression, and the analytical expression for variance can also be applied in nearest neighbour matching for non-probability samples. We conduct extensive simulation studies in order to compare the properties of this estimator with existing approaches, discuss the selection of $k$-nearest neighbours, and study the effects of model mis-specification. The paper finishes with empirical study in integration of job vacancy survey and vacancies submitted to public employment offices (admin and online data). Open source software is available for the proposed approaches."}, "https://arxiv.org/abs/2403.13153": {"title": "Tensor Time Series Imputation through Tensor Factor Modelling", "link": "https://arxiv.org/abs/2403.13153", "description": "arXiv:2403.13153v1 Announce Type: cross \nAbstract: We propose tensor time series imputation when the missing pattern in the tensor data can be general, as long as any two data positions along a tensor fibre are both observed for enough time points. The method is based on a tensor time series factor model with Tucker decomposition of the common component. One distinguished feature of the tensor time series factor model used is that there can be weak factors in the factor loadings matrix for each mode. This reflects reality better when real data can have weak factors which drive only groups of observed variables, for instance, a sector factor in financial market driving only stocks in a particular sector. Using the data with missing entries, asymptotic normality is derived for rows of estimated factor loadings, while consistent covariance matrix estimation enables us to carry out inferences. As a first in the literature, we also propose a ratio-based estimator for the rank of the core tensor under general missing patterns. Rates of convergence are spelt out for the imputations from the estimated tensor factor models. We introduce a new measure for gauging imputation performances, and simulation results show that our imputation procedure works well, with asymptotic normality and corresponding inferences also demonstrated. Re-imputation performances are also gauged when we demonstrate that using slightly larger rank then estimated gives superior re-imputation performances. An NYC taxi traffic data set is also analyzed by imposing general missing patterns and gauging the imputation performances."}, "https://arxiv.org/abs/2403.13203": {"title": "Quadratic Point Estimate Method for Probabilistic Moments Computation", "link": "https://arxiv.org/abs/2403.13203", "description": "arXiv:2403.13203v1 Announce Type: cross \nAbstract: This paper presents in detail the originally developed Quadratic Point Estimate Method (QPEM), aimed at efficiently and accurately computing the first four output moments of probabilistic distributions, using 2n^2+1 sample (or sigma) points, with n, the number of input random variables. The proposed QPEM particularly offers an effective, superior, and practical alternative to existing sampling and quadrature methods for low- and moderately-high-dimensional problems. Detailed theoretical derivations are provided proving that the proposed method can achieve a fifth or higher-order accuracy for symmetric input distributions. Various numerical examples, from simple polynomial functions to nonlinear finite element analyses with random field representations, support the theoretical findings and further showcase the validity, efficiency, and applicability of the QPEM, from low- to high-dimensional problems."}, "https://arxiv.org/abs/2403.13361": {"title": "Multifractal wavelet dynamic mode decomposition modeling for marketing time series", "link": "https://arxiv.org/abs/2403.13361", "description": "arXiv:2403.13361v1 Announce Type: cross \nAbstract: Marketing is the way we ensure our sales are the best in the market, our prices the most accessible, and our clients satisfied, thus ensuring our brand has the widest distribution. This requires sophisticated and advanced understanding of the whole related network. Indeed, marketing data may exist in different forms such as qualitative and quantitative data. However, in the literature, it is easily noted that large bibliographies may be collected about qualitative studies, while only a few studies adopt a quantitative point of view. This is a major drawback that results in marketing science still focusing on design, although the market is strongly dependent on quantities such as money and time. Indeed, marketing data may form time series such as brand sales in specified periods, brand-related prices over specified periods, market shares, etc. The purpose of the present work is to investigate some marketing models based on time series for various brands. This paper aims to combine the dynamic mode decomposition and wavelet decomposition to study marketing series due to both prices, and volume sales in order to explore the effect of the time scale on the persistence of brand sales in the market and on the forecasting of such persistence, according to the characteristics of the brand and the related market competition or competitors. Our study is based on a sample of Saudi brands during the period 22 November 2017 to 30 December 2021."}, "https://arxiv.org/abs/2403.13489": {"title": "Antithetic Multilevel Methods for Elliptic and Hypo-Elliptic Diffusions with Applications", "link": "https://arxiv.org/abs/2403.13489", "description": "arXiv:2403.13489v1 Announce Type: cross \nAbstract: In this paper, we present a new antithetic multilevel Monte Carlo (MLMC) method for the estimation of expectations with respect to laws of diffusion processes that can be elliptic or hypo-elliptic. In particular, we consider the case where one has to resort to time discretization of the diffusion and numerical simulation of such schemes. Motivated by recent developments, we introduce a new MLMC estimator of expectations, which does not require simulation of intractable L\\'evy areas but has a weak error of order 2 and achieves the optimal computational complexity. We then show how this approach can be used in the context of the filtering problem associated to partially observed diffusions with discrete time observations. We illustrate with numerical simulations that our new approaches provide efficiency gains for several problems relative to some existing methods."}, "https://arxiv.org/abs/2403.13565": {"title": "AdaTrans: Feature-wise and Sample-wise Adaptive Transfer Learning for High-dimensional Regression", "link": "https://arxiv.org/abs/2403.13565", "description": "arXiv:2403.13565v1 Announce Type: cross \nAbstract: We consider the transfer learning problem in the high dimensional setting, where the feature dimension is larger than the sample size. To learn transferable information, which may vary across features or the source samples, we propose an adaptive transfer learning method that can detect and aggregate the feature-wise (F-AdaTrans) or sample-wise (S-AdaTrans) transferable structures. We achieve this by employing a novel fused-penalty, coupled with weights that can adapt according to the transferable structure. To choose the weight, we propose a theoretically informed, data-driven procedure, enabling F-AdaTrans to selectively fuse the transferable signals with the target while filtering out non-transferable signals, and S-AdaTrans to obtain the optimal combination of information transferred from each source sample. The non-asymptotic rates are established, which recover existing near-minimax optimal rates in special cases. The effectiveness of the proposed method is validated using both synthetic and real data."}, "https://arxiv.org/abs/2109.10330": {"title": "A Bayesian hierarchical model for disease mapping that accounts for scaling and heavy-tailed latent effects", "link": "https://arxiv.org/abs/2109.10330", "description": "arXiv:2109.10330v2 Announce Type: replace \nAbstract: In disease mapping, the relative risk of a disease is commonly estimated across different areas within a region of interest. The number of cases in an area is often assumed to follow a Poisson distribution whose mean is decomposed as the product between an offset and the logarithm of the disease's relative risk. The log risk may be written as the sum of fixed effects and latent random effects. The BYM2 model decomposes each latent effect into a weighted sum of independent and spatial effects. We build on the BYM2 model to allow for heavy-tailed latent effects and accommodate potentially outlying risks, after accounting for the fixed effects. We assume a scale mixture structure wherein the variance of the latent process changes across areas and allows for outlier identification. We propose two prior specifications for this scale mixture parameter. These are compared through simulation studies and in the analysis of Zika cases from the first (2015-2016) epidemic in Rio de Janeiro city, Brazil. The simulation studies show that, in terms of the model assessment criterion WAIC and outlier detection, the two proposed parametrisations perform better than the model proposed by Congdon (2017) to capture outliers. In particular, the proposed parametrisations are more efficient, in terms of outlier detection, than Congdon's when outliers are neighbours. Our analysis of Zika cases finds 19 out of 160 districts of Rio as potential outliers, after accounting for the socio-development index. Our proposed model may help prioritise interventions and identify potential issues in the recording of cases."}, "https://arxiv.org/abs/2210.13562": {"title": "Prediction intervals for economic fixed-event forecasts", "link": "https://arxiv.org/abs/2210.13562", "description": "arXiv:2210.13562v3 Announce Type: replace \nAbstract: The fixed-event forecasting setup is common in economic policy. It involves a sequence of forecasts of the same (`fixed') predictand, so that the difficulty of the forecasting problem decreases over time. Fixed-event point forecasts are typically published without a quantitative measure of uncertainty. To construct such a measure, we consider forecast postprocessing techniques tailored to the fixed-event case. We develop regression methods that impose constraints motivated by the problem at hand, and use these methods to construct prediction intervals for gross domestic product (GDP) growth in Germany and the US."}, "https://arxiv.org/abs/2211.07823": {"title": "Graph Neural Networks for Causal Inference Under Network Confounding", "link": "https://arxiv.org/abs/2211.07823", "description": "arXiv:2211.07823v3 Announce Type: replace \nAbstract: This paper studies causal inference with observational network data. A challenging aspect of this setting is the possibility of interference in both potential outcomes and selection into treatment, for example due to peer effects in either stage. We therefore consider a nonparametric setup in which both stages are reduced forms of simultaneous-equations models. This results in high-dimensional network confounding, where the network and covariates of all units constitute sources of selection bias. The literature predominantly assumes that confounding can be summarized by a known, low-dimensional function of these objects, and it is unclear what selection models justify common choices of functions. We show that graph neural networks (GNNs) are well suited to adjust for high-dimensional network confounding. We establish a network analog of approximate sparsity under primitive conditions on interference. This demonstrates that the model has low-dimensional structure that makes estimation feasible and justifies the use of shallow GNN architectures."}, "https://arxiv.org/abs/2301.12045": {"title": "Forward selection and post-selection inference in factorial designs", "link": "https://arxiv.org/abs/2301.12045", "description": "arXiv:2301.12045v2 Announce Type: replace \nAbstract: Ever since the seminal work of R. A. Fisher and F. Yates, factorial designs have been an important experimental tool to simultaneously estimate the effects of multiple treatment factors. In factorial designs, the number of treatment combinations grows exponentially with the number of treatment factors, which motivates the forward selection strategy based on the sparsity, hierarchy, and heredity principles for factorial effects. Although this strategy is intuitive and has been widely used in practice, its rigorous statistical theory has not been formally established. To fill this gap, we establish design-based theory for forward factor selection in factorial designs based on the potential outcome framework. We not only prove a consistency property for the factor selection procedure but also discuss statistical inference after factor selection. In particular, with selection consistency, we quantify the advantages of forward selection based on asymptotic efficiency gain in estimating factorial effects. With inconsistent selection in higher-order interactions, we propose two strategies and investigate their impact on subsequent inference. Our formulation differs from the existing literature on variable selection and post-selection inference because our theory is based solely on the physical randomization of the factorial design and does not rely on a correctly specified outcome model."}, "https://arxiv.org/abs/2305.13221": {"title": "Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data", "link": "https://arxiv.org/abs/2305.13221", "description": "arXiv:2305.13221v3 Announce Type: replace \nAbstract: Additive spatial statistical models with weakly stationary process assumptions have become standard in spatial statistics. However, one disadvantage of such models is the computation time, which rapidly increases with the number of data points. The goal of this article is to apply an existing subsampling strategy to standard spatial additive models and to derive the spatial statistical properties. We call this strategy the ''spatial data subset model'' (SDSM) approach, which can be applied to big datasets in a computationally feasible way. Our approach has the advantage that one does not require any additional restrictive model assumptions. That is, computational gains increase as model assumptions are removed when using our model framework. This provides one solution to the computational bottlenecks that occur when applying methods such as Kriging to ''big data''. We provide several properties of this new spatial data subset model approach in terms of moments, sill, nugget, and range under several sampling designs. An advantage of our approach is that it subsamples without throwing away data, and can be implemented using datasets of any size that can be stored. We present the results of the spatial data subset model approach on simulated datasets, and on a large dataset consists of 150,000 observations of daytime land surface temperatures measured by the MODIS instrument onboard the Terra satellite."}, "https://arxiv.org/abs/2309.03316": {"title": "Spatial data fusion adjusting for preferential sampling using INLA and SPDE", "link": "https://arxiv.org/abs/2309.03316", "description": "arXiv:2309.03316v2 Announce Type: replace \nAbstract: Spatially misaligned data can be fused by using a Bayesian melding model that assumes that underlying all observations there is a spatially continuous Gaussian random field process. This model can be used, for example, to predict air pollution levels by combining point data from monitoring stations and areal data from satellite imagery.\n However, if the data presents preferential sampling, that is, if the observed point locations are not independent of the underlying spatial process, the inference obtained from models that ignore such a dependence structure might not be valid.\n In this paper, we present a Bayesian spatial model for the fusion of point and areal data that takes into account preferential sampling. The model combines the Bayesian melding specification and a model for the stochastically dependent sampling and underlying spatial processes.\n Fast Bayesian inference is performed using the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) approaches. The performance of the model is assessed using simulated data in a range of scenarios and sampling strategies that can appear in real settings. The model is also applied to predict air pollution in the USA."}, "https://arxiv.org/abs/2312.12823": {"title": "Detecting Multiple Change Points in Distributional Sequences Derived from Structural Health Monitoring Data: An Application to Bridge Damage Detection", "link": "https://arxiv.org/abs/2312.12823", "description": "arXiv:2312.12823v2 Announce Type: replace \nAbstract: Detecting damage in critical structures using monitored data is a fundamental task of structural health monitoring, which is extremely important for maintaining structures' safety and life-cycle management. Based on statistical pattern recognition paradigm, damage detection can be conducted by assessing changes in the distribution of properly extracted damage-sensitive features (DSFs). This can be naturally formulated as a distributional change-point detection problem. A good change-point detector for damage detection should be scalable to large DSF datasets, applicable to different types of changes, and capable of controlling for false-positive indications. This study proposes a new distributional change-point detection method for damage detection to address these challenges. We embed the elements of a DSF distributional sequence into the Wasserstein space and construct a moving sum (MOSUM) multiple change-point detector based on Fr\\'echet statistics and establish theoretical properties. Extensive simulation studies demonstrate the superiority of our proposed approach against other competitors to address the aforementioned practical requirements. We apply our method to the cable-tension measurements monitored from a long-span cable-stayed bridge for cable damage detection. We conduct a comprehensive change-point analysis for the extracted DSF data, and reveal interesting patterns from the detected changes, which provides valuable insights into cable system damage."}, "https://arxiv.org/abs/2401.10196": {"title": "Functional Gaussian Graphical Regression Models", "link": "https://arxiv.org/abs/2401.10196", "description": "arXiv:2401.10196v2 Announce Type: replace \nAbstract: Functional data describe a wide range of processes encountered in practice, such as growth curves and spectral absorption. Functional regression considers a version of regression, where both the response and covariates are functional data. Evaluating both the functional relatedness between the response and covariates and the relatedness of a multivariate response function can be challenging. In this paper, we propose a solution for both these issues, by means of a functional Gaussian graphical regression model. It extends the notion of conditional Gaussian graphical models to partially separable functions. For inference, we propose a double-penalized estimator. Additionally, we present a novel adaptation of Kullback-Leibler cross-validation tailored for graph estimators which accounts for precision and regression matrices when the population presents one or more sub-groups, named joint Kullback-Leibler cross-validation. Evaluation of model performance is done in terms of Kullback-Leibler divergence and graph recovery power. We illustrate the method on a air pollution dataset."}, "https://arxiv.org/abs/2206.13668": {"title": "Non-Independent Components Analysis", "link": "https://arxiv.org/abs/2206.13668", "description": "arXiv:2206.13668v4 Announce Type: replace-cross \nAbstract: A seminal result in the ICA literature states that for $AY = \\varepsilon$, if the components of $\\varepsilon$ are independent and at most one is Gaussian, then $A$ is identified up to sign and permutation of its rows (Comon, 1994). In this paper we study to which extent the independence assumption can be relaxed by replacing it with restrictions on higher order moment or cumulant tensors of $\\varepsilon$. We document new conditions that establish identification for several non-independent component models, e.g. common variance models, and propose efficient estimation methods based on the identification results. We show that in situations where independence cannot be assumed the efficiency gains can be significant relative to methods that rely on independence."}, "https://arxiv.org/abs/2311.00118": {"title": "Extracting the Multiscale Causal Backbone of Brain Dynamics", "link": "https://arxiv.org/abs/2311.00118", "description": "arXiv:2311.00118v2 Announce Type: replace-cross \nAbstract: The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics, shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it.\n Our approach leverages recent advances in multiscale causal structure learning and optimizes the trade-off between the model fit and its complexity. Empirical assessment on synthetic data shows the superiority of our methodology over a baseline based on canonical functional connectivity networks. When applied to resting-state fMRI data, we find sparse MCBs for both the left and right brain hemispheres. Thanks to its multiscale nature, our approach shows that at low-frequency bands, causal dynamics are driven by brain regions associated with high-level cognitive functions; at higher frequencies instead, nodes related to sensory processing play a crucial role. Finally, our analysis of individual multiscale causal structures confirms the existence of a causal fingerprint of brain connectivity, thus supporting the existing extensive research in brain connectivity fingerprinting from a causal perspective."}, "https://arxiv.org/abs/2311.17885": {"title": "Are Ensembles Getting Better all the Time?", "link": "https://arxiv.org/abs/2311.17885", "description": "arXiv:2311.17885v2 Announce Type: replace-cross \nAbstract: Ensemble methods combine the predictions of several base models. We study whether or not including more models always improves their average performance. This question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods such as random forests or deep ensembles. In this setting, we show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised as: ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a medical prediction problem (diagnosing melanomas using neural nets) and a \"wisdom of crowds\" experiment (guessing the ratings of upcoming movies)."}, "https://arxiv.org/abs/2403.13934": {"title": "Data integration methods for micro-randomized trials", "link": "https://arxiv.org/abs/2403.13934", "description": "arXiv:2403.13934v1 Announce Type: new \nAbstract: Existing statistical methods for the analysis of micro-randomized trials (MRTs) are designed to estimate causal excursion effects using data from a single MRT. In practice, however, researchers can often find previous MRTs that employ similar interventions. In this paper, we develop data integration methods that capitalize on this additional information, leading to statistical efficiency gains. To further increase efficiency, we demonstrate how to combine these approaches according to a generalization of multivariate precision weighting that allows for correlation between estimates, and we show that the resulting meta-estimator possesses an asymptotic optimality property. We illustrate our methods in simulation and in a case study involving two MRTs in the area of smoking cessation."}, "https://arxiv.org/abs/2403.14036": {"title": "Fused LASSO as Non-Crossing Quantile Regression", "link": "https://arxiv.org/abs/2403.14036", "description": "arXiv:2403.14036v1 Announce Type: new \nAbstract: Quantile crossing has been an ever-present thorn in the side of quantile regression. This has spurred research into obtaining densities and coefficients that obey the quantile monotonicity property. While important contributions, these papers do not provide insight into how exactly these constraints influence the estimated coefficients. This paper extends non-crossing constraints and shows that by varying a single hyperparameter ($\\alpha$) one can obtain commonly used quantile estimators. Namely, we obtain the quantile regression estimator of Koenker and Bassett (1978) when $\\alpha=0$, the non crossing quantile regression estimator of Bondell et al. (2010) when $\\alpha=1$, and the composite quantile regression estimator of Koenker (1984) and Zou and Yuan (2008) when $\\alpha\\rightarrow\\infty$. As such, we show that non-crossing constraints are simply a special type of fused-shrinkage."}, "https://arxiv.org/abs/2403.14044": {"title": "Statistical tests for comparing the associations of multiple exposures with a common outcome in Cox proportional hazard models", "link": "https://arxiv.org/abs/2403.14044", "description": "arXiv:2403.14044v1 Announce Type: new \nAbstract: With advancement of medicine, alternative exposures or interventions are emerging with respect to a common outcome, and there are needs to formally test the difference in the associations of multiple exposures. We propose a duplication method-based multivariate Wald test in the Cox proportional hazard regression analyses to test the difference in the associations of multiple exposures with a same outcome. The proposed method applies to linear or categorical exposures. To illustrate our method, we applied our method to compare the associations between alignment to two different dietary patterns, either as continuous or quartile exposures, and incident chronic diseases, defined as a composite of CVD, cancer, and diabetes, in the Health Professional Follow-up Study. Relevant sample codes in R that implement the proposed approach are provided. The proposed duplication-method-based approach offers a flexible, formal statistical test of multiple exposures for the common outcome with minimal assumptions."}, "https://arxiv.org/abs/2403.14152": {"title": "Generalized Rosenbaum Bounds Sensitivity Analysis for Matched Observational Studies with Treatment Doses: Sufficiency, Consistency, and Efficiency", "link": "https://arxiv.org/abs/2403.14152", "description": "arXiv:2403.14152v1 Announce Type: new \nAbstract: In matched observational studies with binary treatments, the Rosenbaum bounds framework is arguably the most widely used sensitivity analysis framework for assessing sensitivity to unobserved covariates. Unlike the binary treatment case, although widely needed in practice, sensitivity analysis for matched observational studies with treatment doses (i.e., non-binary treatments such as ordinal treatments or continuous treatments) still lacks solid foundations and valid methodologies. We fill in this blank by establishing theoretical foundations and novel methodologies under a generalized Rosenbaum bounds sensitivity analysis framework. First, we present a criterion for assessing the validity of sensitivity analysis in matched observational studies with treatment doses and use that criterion to justify the necessity of incorporating the treatment dose information into sensitivity analysis through generalized Rosenbaum sensitivity bounds. We also generalize Rosenbaum's classic sensitivity parameter $\\Gamma$ to the non-binary treatment case and prove its sufficiency. Second, we study the asymptotic properties of sensitivity analysis by generalizing Rosenbaum's classic design sensitivity and Bahadur efficiency for testing Fisher's sharp null to the non-binary treatment case and deriving novel formulas for them. Our theoretical results showed the importance of appropriately incorporating the treatment dose into a test, which is an intrinsic distinction with the binary treatment case. Third, for testing Neyman's weak null (i.e., null sample average treatment effect), we propose the first valid sensitivity analysis procedure for matching with treatment dose through generalizing an existing optimization-based sensitivity analysis for the binary treatment case, built on the generalized Rosenbaum sensitivity bounds and large-scale mixed integer programming."}, "https://arxiv.org/abs/2403.14216": {"title": "A Gaussian smooth transition vector autoregressive model: An application to the macroeconomic effects of severe weather shocks", "link": "https://arxiv.org/abs/2403.14216", "description": "arXiv:2403.14216v1 Announce Type: new \nAbstract: We introduce a new smooth transition vector autoregressive model with a Gaussian conditional distribution and transition weights that, for a $p$th order model, depend on the full distribution of the preceding $p$ observations. Specifically, the transition weight of each regime increases in its relative weighted likelihood. This data-driven approach facilitates capturing complex switching dynamics, enhancing the identification of gradual regime shifts. In an empirical application to the macroeconomic effects of a severe weather shock, we find that in monthly U.S. data from 1961:1 to 2022:3, the impacts of the shock are stronger in the regime prevailing in the early part of the sample and in certain crisis periods than in the regime dominating the latter part of the sample. This suggests overall adaptation of the U.S. economy to increased severe weather over time."}, "https://arxiv.org/abs/2403.14336": {"title": "An empirical appraisal of methods for the dynamic prediction of survival with numerous longitudinal predictors", "link": "https://arxiv.org/abs/2403.14336", "description": "arXiv:2403.14336v1 Announce Type: new \nAbstract: Recently, the increasing availability of repeated measurements in biomedical studies has motivated the development of several statistical methods for the dynamic prediction of survival in settings where a large (potentially high-dimensional) number of longitudinal covariates is available. These methods differ in both how they model the longitudinal covariates trajectories, and how they specify the relationship between the longitudinal covariates and the survival outcome. Because these methods are still quite new, little is known about their applicability, limitations and performance when applied to real-world data.\n To investigate these questions, we present a comparison of the predictive performance of the aforementioned methods and two simpler prediction approaches to three datasets that differ in terms of outcome type, sample size, number of longitudinal covariates and length of follow-up. We discuss how different modelling choices can have an impact on the possibility to accommodate unbalanced study designs and on computing time, and compare the predictive performance of the different approaches using a range of performance measures and landmark times."}, "https://arxiv.org/abs/2403.14348": {"title": "Statistical modeling to adjust for time trends in adaptive platform trials utilizing non-concurrent controls", "link": "https://arxiv.org/abs/2403.14348", "description": "arXiv:2403.14348v1 Announce Type: new \nAbstract: Utilizing non-concurrent controls in the analysis of late-entering experimental arms in platform trials has recently received considerable attention, both on academic and regulatory levels. While incorporating this data can lead to increased power and lower required sample sizes, it might also introduce bias to the effect estimators if temporal drifts are present in the trial. Aiming to mitigate the potential calendar time bias, we propose various frequentist model-based approaches that leverage the non-concurrent control data, while adjusting for time trends. One of the currently available frequentist models incorporates time as a categorical fixed effect, separating the duration of the trial into periods, defined as time intervals bounded by any treatment arm entering or leaving the platform. In this work, we propose two extensions of this model. First, we consider an alternative definition of the time covariate by dividing the trial into fixed-length calendar time intervals. Second, we propose alternative methods to adjust for time trends. In particular, we investigate adjusting for autocorrelated random effects to account for dependency between closer time intervals and employing spline regression to model time with a smooth polynomial function. We evaluate the performance of the proposed approaches in a simulation study and illustrate their use by means of a case study."}, "https://arxiv.org/abs/2403.14451": {"title": "Phenology curve estimation via a mixed model representation of functional principal components: Characterizing time series of satellite-derived vegetation indices", "link": "https://arxiv.org/abs/2403.14451", "description": "arXiv:2403.14451v1 Announce Type: new \nAbstract: Vegetation phenology consists of studying synchronous stationary events, such as the vegetation green up and leaves senescence, that can be construed as adaptive responses to climatic constraints. In this paper, we propose a method to estimate the annual phenology curve from multi-annual observations of time series of vegetation indices derived from satellite images. We fitted the classical harmonic regression model to annual-based time series in order to construe the original data set as realizations of a functional process. Hierarchical clustering was applied to define a nearly homogeneous group of annual (smoothed) time series from which a representative and idealized phenology curve was estimated at the pixel level. This curve resulted from fitting a mixed model, based on functional principal components, to the homogeneous group of time series. Leveraging the idealized phenology curve, we employed standard calculus criteria to estimate the following phenological parameters (stationary events): green up, start of season, maturity, senescence, end of season and dormancy. By applying the proposed methodology to four different data cubes (time series from 2000 to 2023 of a popular satellite-derived vegetation index) recorded across grasslands, forests, and annual rainfed agricultural zones of a Flora and Fauna Protected Area in northern Mexico, we verified that our approach characterizes properly the phenological cycle in vegetation with nearly periodic dynamics, such as grasslands and agricultural areas. The R package sephora was used for all computations in this paper."}, "https://arxiv.org/abs/2403.14452": {"title": "On Weighted Trigonometric Regression for Suboptimal Designs in Circadian Biology Studies", "link": "https://arxiv.org/abs/2403.14452", "description": "arXiv:2403.14452v1 Announce Type: new \nAbstract: Studies in circadian biology often use trigonometric regression to model phenomena over time. Ideally, protocols in these studies would collect samples at evenly distributed and equally spaced time points over a 24 hour period. This sample collection protocol is known as an equispaced design, which is considered the optimal experimental design for trigonometric regression under multiple statistical criteria. However, implementing equispaced designs in studies involving individuals is logistically challenging, and failure to employ an equispaced design could cause a loss of statistical power when performing hypothesis tests with an estimated model. This paper is motivated by the potential loss of statistical power during hypothesis testing, and considers a weighted trigonometric regression as a remedy. Specifically, the weights for this regression are the normalized reciprocals of estimates derived from a kernel density estimator for sample collection time, which inflates the weight of samples collected at underrepresented time points. A search procedure is also introduced to identify the concentration hyperparameter for kernel density estimation that maximizes the Hessian of weighted squared loss, which relates to both maximizing the $D$-optimality criterion from experimental design literature and minimizing the generalized variance. Simulation studies consistently demonstrate that this weighted regression mitigates variability in inferences produced by an estimated model. Illustrations with three real circadian biology data sets further indicate that this weighted regression consistently yields larger test statistics than its unweighted counterpart for first-order trigonometric regression, or cosinor regression, which is prevalent in circadian biology studies."}, "https://arxiv.org/abs/2403.14531": {"title": "Green's matching: an efficient approach to parameter estimation in complex dynamic systems", "link": "https://arxiv.org/abs/2403.14531", "description": "arXiv:2403.14531v1 Announce Type: new \nAbstract: Parameters of differential equations are essential to characterize intrinsic behaviors of dynamic systems. Numerous methods for estimating parameters in dynamic systems are computationally and/or statistically inadequate, especially for complex systems with general-order differential operators, such as motion dynamics. This article presents Green's matching, a computationally tractable and statistically efficient two-step method, which only needs to approximate trajectories in dynamic systems but not their derivatives due to the inverse of differential operators by Green's function. This yields a statistically optimal guarantee for parameter estimation in general-order equations, a feature not shared by existing methods, and provides an efficient framework for broad statistical inferences in complex dynamic systems."}, "https://arxiv.org/abs/2403.14563": {"title": "Evaluating the impact of instrumental variables in propensity score models using synthetic and negative control experiments", "link": "https://arxiv.org/abs/2403.14563", "description": "arXiv:2403.14563v1 Announce Type: new \nAbstract: In pharmacoepidemiology research, instrumental variables (IVs) are variables that strongly predict treatment but have no causal effect on the outcome of interest except through the treatment. There remain concerns about the inclusion of IVs in propensity score (PS) models amplifying estimation bias and reducing precision. Some PS modeling approaches attempt to address the potential effects of IVs, including selecting only covariates for the PS model that are strongly associated to the outcome of interest, thus screening out IVs. We conduct a study utilizing simulations and negative control experiments to evaluate the effect of IVs on PS model performance and to uncover best PS practices for real-world studies. We find that simulated IVs have a weak effect on bias and precision in both simulations and negative control experiments based on real-world data. In simulation experiments, PS methods that utilize outcome data, including the high-dimensional propensity score, produce the least estimation bias. However, in real-world settings underlying causal structures are unknown, and negative control experiments can illustrate a PS model's ability to minimize systematic bias. We find that large-scale, regularized regression based PS models in this case provide the most centered negative control distributions, suggesting superior performance in real-world scenarios."}, "https://arxiv.org/abs/2403.14573": {"title": "A Transfer Learning Causal Approach to Evaluate Racial/Ethnic and Geographic Variation in Outcomes Following Congenital Heart Surgery", "link": "https://arxiv.org/abs/2403.14573", "description": "arXiv:2403.14573v1 Announce Type: new \nAbstract: Congenital heart defects (CHD) are the most prevalent birth defects in the United States and surgical outcomes vary considerably across the country. The outcomes of treatment for CHD differ for specific patient subgroups, with non-Hispanic Black and Hispanic populations experiencing higher rates of mortality and morbidity. A valid comparison of outcomes within racial/ethnic subgroups is difficult given large differences in case-mix and small subgroup sizes. We propose a causal inference framework for outcome assessment and leverage advances in transfer learning to incorporate data from both target and source populations to help estimate causal effects while accounting for different sources of risk factor and outcome differences across populations. Using the Society of Thoracic Surgeons' Congenital Heart Surgery Database (STS-CHSD), we focus on a national cohort of patients undergoing the Norwood operation from 2016-2022 to assess operative mortality and morbidity outcomes across U.S. geographic regions by race/ethnicity. We find racial and ethnic outcome differences after controlling for potential confounding factors. While geography does not have a causal effect on outcomes for non-Hispanic Caucasian patients, non-Hispanic Black patients experience wide variability in outcomes with estimated 30-day mortality ranging from 5.9% (standard error 2.2%) to 21.6% (4.4%) across U.S. regions."}, "https://arxiv.org/abs/2403.14067": {"title": "Automatic Outlier Rectification via Optimal Transport", "link": "https://arxiv.org/abs/2403.14067", "description": "arXiv:2403.14067v1 Announce Type: cross \nAbstract: In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize an optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We discuss the fundamental differences between our estimator and optimal transport-based distributionally robust optimization estimator. finally, we demonstrate the effectiveness and superiority of our approach over conventional approaches in extensive simulation and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces."}, "https://arxiv.org/abs/2403.14385": {"title": "Estimating Causal Effects with Double Machine Learning -- A Method Evaluation", "link": "https://arxiv.org/abs/2403.14385", "description": "arXiv:2403.14385v1 Announce Type: cross \nAbstract: The estimation of causal effects with observational data continues to be a very active research area. In recent years, researchers have developed new frameworks which use machine learning to relax classical assumptions necessary for the estimation of causal effects. In this paper, we review one of the most prominent methods - \"double/debiased machine learning\" (DML) - and empirically evaluate it by comparing its performance on simulated data relative to more traditional statistical methods, before applying it to real-world data. Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships. This advantage enables a departure from traditional functional form assumptions typically necessary in causal effect estimation. However, we demonstrate that the method continues to critically depend on standard assumptions about causal structure and identification. When estimating the effects of air pollution on housing prices in our application, we find that DML estimates are consistently larger than estimates of less flexible methods. From our overall results, we provide actionable recommendations for specific choices researchers must make when applying DML in practice."}, "https://arxiv.org/abs/1905.07812": {"title": "Iterative Estimation of Nonparametric Regressions with Continuous Endogenous Variables and Discrete Instruments", "link": "https://arxiv.org/abs/1905.07812", "description": "arXiv:1905.07812v2 Announce Type: replace \nAbstract: We consider a nonparametric regression model with continuous endogenous independent variables when only discrete instruments are available that are independent of the error term. While this framework is very relevant for applied research, its implementation is cumbersome, as the regression function becomes the solution to a nonlinear integral equation. We propose a simple iterative procedure to estimate such models and showcase some of its asymptotic properties. In a simulation experiment, we discuss the details of its implementation in the case when the instrumental variable is binary. We conclude with an empirical application in which we examine the effect of pollution on house prices in a short panel of U.S. counties."}, "https://arxiv.org/abs/2111.10715": {"title": "Confidences in Hypotheses", "link": "https://arxiv.org/abs/2111.10715", "description": "arXiv:2111.10715v4 Announce Type: replace \nAbstract: This article outlines a broadly-applicable new method of statistical analysis for situations involving two competing hypotheses. Hypotheses assessment is a frequentist procedure designed to answer the question: Given the sample evidence (and assumed model), what is the relative plausibility of each hypothesis? Our aim is to determine frequentist confidences in the hypotheses that are relevant to the data at hand and are as powerful as the particular application allows. Hypotheses assessments complement significance tests because providing confidences in the hypotheses in addition to test results can better inform applied researchers about the strength of evidence provided by the data. For simple hypotheses, the method produces minimum and maximum confidences in each hypothesis. The composite case is more complex, and we introduce two conventions to aid with understanding the strength of evidence. Assessments are qualitatively different from significance test and confidence interval outcomes, and thus fill a gap in the statistician's toolkit."}, "https://arxiv.org/abs/2201.06694": {"title": "Homophily in preferences or meetings? Identifying and estimating an iterative network formation model", "link": "https://arxiv.org/abs/2201.06694", "description": "arXiv:2201.06694v4 Announce Type: replace \nAbstract: Is homophily in social and economic networks driven by a taste for homogeneity (preferences) or by a higher probability of meeting individuals with similar attributes (opportunity)? This paper studies identification and estimation of an iterative network game that distinguishes between these two mechanisms. Our approach enables us to assess the counterfactual effects of changing the meeting protocol between agents. As an application, we study the role of preferences and meetings in shaping classroom friendship networks in Brazil. In a network structure in which homophily due to preferences is stronger than homophily due to meeting opportunities, tracking students may improve welfare. Still, the relative benefit of this policy diminishes over the school year."}, "https://arxiv.org/abs/2204.11318": {"title": "Identification and Statistical Decision Theory", "link": "https://arxiv.org/abs/2204.11318", "description": "arXiv:2204.11318v2 Announce Type: replace \nAbstract: Econometricians have usefully separated study of estimation into identification and statistical components. Identification analysis, which assumes knowledge of the probability distribution generating observable data, places an upper bound on what may be learned about population parameters of interest with finite sample data. Yet Wald's statistical decision theory studies decision making with sample data without reference to identification, indeed without reference to estimation. This paper asks if identification analysis is useful to statistical decision theory. The answer is positive, as it can yield an informative and tractable upper bound on the achievable finite sample performance of decision criteria. The reasoning is simple when the decision relevant parameter is point identified. It is more delicate when the true state is partially identified and a decision must be made under ambiguity. Then the performance of some criteria, such as minimax regret, is enhanced by randomizing choice of an action. This may be accomplished by making choice a function of sample data. I find it useful to recast choice of a statistical decision function as selection of choice probabilities for the elements of the choice set. Using sample data to randomize choice conceptually differs from and is complementary to its traditional use to estimate population parameters."}, "https://arxiv.org/abs/2303.10712": {"title": "Mixture of segmentation for heterogeneous functional data", "link": "https://arxiv.org/abs/2303.10712", "description": "arXiv:2303.10712v2 Announce Type: replace \nAbstract: In this paper we consider functional data with heterogeneity in time and in population. We propose a mixture model with segmentation of time to represent this heterogeneity while keeping the functional structure. Maximum likelihood estimator is considered, proved to be identifiable and consistent. In practice, an EM algorithm is used, combined with dynamic programming for the maximization step, to approximate the maximum likelihood estimator. The method is illustrated on a simulated dataset, and used on a real dataset of electricity consumption."}, "https://arxiv.org/abs/2307.16798": {"title": "Forster-Warmuth Counterfactual Regression: A Unified Learning Approach", "link": "https://arxiv.org/abs/2307.16798", "description": "arXiv:2307.16798v4 Announce Type: replace \nAbstract: Series or orthogonal basis regression is one of the most popular non-parametric regression techniques in practice, obtained by regressing the response on features generated by evaluating the basis functions at observed covariate values. The most routinely used series estimator is based on ordinary least squares fitting, which is known to be minimax rate optimal in various settings, albeit under stringent restrictions on the basis functions and the distribution of covariates. In this work, inspired by the recently developed Forster-Warmuth (FW) learner, we propose an alternative series regression estimator that can attain the minimax estimation rate under strictly weaker conditions imposed on the basis functions and the joint law of covariates, than existing series estimators in the literature. Moreover, a key contribution of this work generalizes the FW-learner to a so-called counterfactual regression problem, in which the response variable of interest may not be directly observed (hence, the name ``counterfactual'') on all sampled units, and therefore needs to be inferred in order to identify and estimate the regression in view from the observed data. Although counterfactual regression is not entirely a new area of inquiry, we propose the first-ever systematic study of this challenging problem from a unified pseudo-outcome perspective. In fact, we provide what appears to be the first generic and constructive approach for generating the pseudo-outcome (to substitute for the unobserved response) which leads to the estimation of the counterfactual regression curve of interest with small bias, namely bias of second order. Several applications are used to illustrate the resulting FW-learner including many nonparametric regression problems in missing data and causal inference literature, for which we establish high-level conditions for minimax rate optimality of the proposed FW-learner."}, "https://arxiv.org/abs/2312.16241": {"title": "Analysis of Pleiotropy for Testosterone and Lipid Profiles in Males and Females", "link": "https://arxiv.org/abs/2312.16241", "description": "arXiv:2312.16241v2 Announce Type: replace \nAbstract: In modern scientific studies, it is often imperative to determine whether a set of phenotypes is affected by a single factor. If such an influence is identified, it becomes essential to discern whether this effect is contingent upon categories such as sex or age group, and importantly, to understand whether this dependence is rooted in purely non-environmental reasons. The exploration of such dependencies often involves studying pleiotropy, a phenomenon wherein a single genetic locus impacts multiple traits. This heightened interest in uncovering dependencies by pleiotropy is fueled by the growing accessibility of summary statistics from genome-wide association studies (GWAS) and the establishment of thoroughly phenotyped sample collections. This advancement enables a systematic and comprehensive exploration of the genetic connections among various traits and diseases. additive genetic correlation illuminates the genetic connection between two traits, providing valuable insights into the shared biological pathways and underlying causal relationships between them. In this paper, we present a novel method to analyze such dependencies by studying additive genetic correlations between pairs of traits under consideration. Subsequently, we employ matrix comparison techniques to discern and elucidate sex-specific or age-group-specific associations, contributing to a deeper understanding of the nuanced dependencies within the studied traits. Our proposed method is computationally handy and requires only GWAS summary statistics. We validate our method by applying it to the UK Biobank data and present the results."}, "https://arxiv.org/abs/2312.10569": {"title": "Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data", "link": "https://arxiv.org/abs/2312.10569", "description": "arXiv:2312.10569v2 Announce Type: replace-cross \nAbstract: Many modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss of information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS' ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks."}, "https://arxiv.org/abs/2403.14899": {"title": "Statistical Inference For Noisy Matrix Completion Incorporating Auxiliary Information", "link": "https://arxiv.org/abs/2403.14899", "description": "arXiv:2403.14899v1 Announce Type: new \nAbstract: This paper investigates statistical inference for noisy matrix completion in a semi-supervised model when auxiliary covariates are available. The model consists of two parts. One part is a low-rank matrix induced by unobserved latent factors; the other part models the effects of the observed covariates through a coefficient matrix which is composed of high-dimensional column vectors. We model the observational pattern of the responses through a logistic regression of the covariates, and allow its probability to go to zero as the sample size increases. We apply an iterative least squares (LS) estimation approach in our considered context. The iterative LS methods in general enjoy a low computational cost, but deriving the statistical properties of the resulting estimators is a challenging task. We show that our method only needs a few iterations, and the resulting entry-wise estimators of the low-rank matrix and the coefficient matrix are guaranteed to have asymptotic normal distributions. As a result, individual inference can be conducted for each entry of the unknown matrices. We also propose a simultaneous testing procedure with multiplier bootstrap for the high-dimensional coefficient matrix. This simultaneous inferential tool can help us further investigate the effects of covariates for the prediction of missing entries."}, "https://arxiv.org/abs/2403.14925": {"title": "Computational Approaches for Exponential-Family Factor Analysis", "link": "https://arxiv.org/abs/2403.14925", "description": "arXiv:2403.14925v1 Announce Type: new \nAbstract: We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies."}, "https://arxiv.org/abs/2403.14954": {"title": "Creating a Spatial Vulnerability Index for Environmental Health", "link": "https://arxiv.org/abs/2403.14954", "description": "arXiv:2403.14954v1 Announce Type: new \nAbstract: Extreme natural hazards are increasing in frequency and intensity. These natural changes in our environment, combined with man-made pollution, have substantial economic, social and health impacts globally. The impact of the environment on human health (environmental health) is becoming well understood in international research literature. However, there are significant barriers to understanding key characteristics of this impact, related to substantial data volumes, data access rights and the time required to compile and compare data over regions and time. This study aims to reduce these barriers in Australia by creating an open data repository of national environmental health data and presenting a methodology for the production of health outcome-weighted population vulnerability indices related to extreme heat, extreme cold and air pollution at various temporal and geographical resolutions.\n Current state-of-the-art methods for the calculation of vulnerability indices include equal weight percentile ranking and the use of principal component analysis (PCA). The weighted vulnerability index methodology proposed in this study offers an advantage over others in the literature by considering health outcomes in the calculation process. The resulting vulnerability percentiles more clearly align population sensitivity and adaptive capacity with health risks. The temporal and spatial resolutions of the indices enable national monitoring on a scale never before seen across Australia. Additionally, we show that a weekly temporal resolution can be used to identify spikes in vulnerability due to changes in relative national environmental exposure."}, "https://arxiv.org/abs/2403.15111": {"title": "Fast TTC Computation", "link": "https://arxiv.org/abs/2403.15111", "description": "arXiv:2403.15111v1 Announce Type: new \nAbstract: This paper proposes a fast Markov Matrix-based methodology for computing Top Trading Cycles (TTC) that delivers O(1) computational speed, that is speed independent of the number of agents and objects in the system. The proposed methodology is well suited for complex large-dimensional problems like housing choice. The methodology retains all the properties of TTC, namely, Pareto-efficiency, individual rationality and strategy-proofness."}, "https://arxiv.org/abs/2403.15220": {"title": "Modelling with Discretized Variables", "link": "https://arxiv.org/abs/2403.15220", "description": "arXiv:2403.15220v1 Announce Type: new \nAbstract: This paper deals with econometric models in which the dependent variable, some explanatory variables, or both are observed as censored interval data. This discretization often happens due to confidentiality of sensitive variables like income. Models using these variables cannot point identify regression parameters as the conditional moments are unknown, which led the literature to use interval estimates. Here, we propose a discretization method through which the regression parameters can be point identified while preserving data confidentiality. We demonstrate the asymptotic properties of the OLS estimator for the parameters in multivariate linear regressions for cross-sectional data. The theoretical findings are supported by Monte Carlo experiments and illustrated with an application to the Australian gender wage gap."}, "https://arxiv.org/abs/2403.15258": {"title": "Tests for almost stochastic dominance", "link": "https://arxiv.org/abs/2403.15258", "description": "arXiv:2403.15258v1 Announce Type: new \nAbstract: We introduce a 2-dimensional stochastic dominance (2DSD) index to characterize both strict and almost stochastic dominance. Based on this index, we derive an estimator for the minimum violation ratio (MVR), also known as the critical parameter, of the almost stochastic ordering condition between two variables. We determine the asymptotic properties of the empirical 2DSD index and MVR for the most frequently used stochastic orders. We also provide conditions under which the bootstrap estimators of these quantities are strongly consistent. As an application, we develop consistent bootstrap testing procedures for almost stochastic dominance. The performance of the tests is checked via simulations and the analysis of real data."}, "https://arxiv.org/abs/2403.15302": {"title": "Optimal Survival Analyses With Prevalent and Incident Patients", "link": "https://arxiv.org/abs/2403.15302", "description": "arXiv:2403.15302v1 Announce Type: new \nAbstract: Period-prevalent cohorts are often used for their cost-saving potential in epidemiological studies of survival outcomes. Under this design, prevalent patients allow for evaluations of long-term survival outcomes without the need for long follow-up, whereas incident patients allow for evaluations of short-term survival outcomes without the issue of left-truncation. In most period-prevalent survival analyses from the existing literature, patients have been recruited to achieve an overall sample size, with little attention given to the relative frequencies of prevalent and incident patients and their statistical implications. Furthermore, there are no existing methods available to rigorously quantify the impact of these relative frequencies on estimation and inference and incorporate this information into study design strategies. To address these gaps, we develop an approach to identify the optimal mix of prevalent and incident patients that maximizes precision over the entire estimated survival curve, subject to a flexible weighting scheme. In addition, we prove that inference based on the weighted log-rank test or Cox proportional hazards model is most powerful with an entirely prevalent or incident cohort, and we derive theoretical formulas to determine the optimal choice. Simulations confirm the validity of the proposed optimization criteria and show that substantial efficiency gains can be achieved by recruiting the optimal mix of prevalent and incident patients. The proposed methods are applied to assess waitlist outcomes among kidney transplant candidates."}, "https://arxiv.org/abs/2403.15327": {"title": "On two-sample testing for data with arbitrarily missing values", "link": "https://arxiv.org/abs/2403.15327", "description": "arXiv:2403.15327v1 Announce Type: new \nAbstract: We develop a new rank-based approach for univariate two-sample testing in the presence of missing data which makes no assumptions about the missingness mechanism. This approach is a theoretical extension of the Wilcoxon-Mann-Whitney test that controls the Type I error by providing exact bounds for the test statistic after accounting for the number of missing values. Greater statistical power is shown when the method is extended to account for a bounded domain. Furthermore, exact bounds are provided on the proportions of data that can be missing in the two samples while yielding a significant result. Simulations demonstrate that our method has good power, typically for cases of $10\\%$ to $20\\%$ missing data, while standard imputation approaches fail to control the Type I error. We illustrate our method on complex clinical trial data in which patients' withdrawal from the trial lead to missing values."}, "https://arxiv.org/abs/2403.15384": {"title": "Unifying area and unit-level small area estimation through calibration", "link": "https://arxiv.org/abs/2403.15384", "description": "arXiv:2403.15384v1 Announce Type: new \nAbstract: When estimating area means, direct estimators based on area-specific data, are usually consistent under the sampling design without model assumptions. However, they are inefficient if the area sample size is small. In small area estimation, model assumptions linking the areas are used to \"borrow strength\" from other areas. The basic area-level model provides design-consistent estimators but error variances are assumed to be known. In practice, they are estimated with the (scarce) area-specific data. These estimators are inefficient, and their error is not accounted for in the associated mean squared error estimators. Unit-level models do not require to know the error variances but do not account for the survey design. Here we describe a unified estimator of an area mean that may be obtained both from an area-level model or a unit-level model and based on consistent estimators of the model error variances as the number of areas increases. We propose bootstrap mean squared error estimators that account for the uncertainty due to the estimation of the error variances. We show a better performance of the new small area estimators and our bootstrap estimators of the mean squared error. We apply the results to education data from Colombia."}, "https://arxiv.org/abs/2403.15175": {"title": "Double Cross-fit Doubly Robust Estimators: Beyond Series Regression", "link": "https://arxiv.org/abs/2403.15175", "description": "arXiv:2403.15175v1 Announce Type: cross \nAbstract: Doubly robust estimators with cross-fitting have gained popularity in causal inference due to their favorable structure-agnostic error guarantees. However, when additional structure, such as H\\\"{o}lder smoothness, is available then more accurate \"double cross-fit doubly robust\" (DCDR) estimators can be constructed by splitting the training data and undersmoothing nuisance function estimators on independent samples. We study a DCDR estimator of the Expected Conditional Covariance, a functional of interest in causal inference and conditional independence testing, and derive a series of increasingly powerful results with progressively stronger assumptions. We first provide a structure-agnostic error analysis for the DCDR estimator with no assumptions on the nuisance functions or their estimators. Then, assuming the nuisance functions are H\\\"{o}lder smooth, but without assuming knowledge of the true smoothness level or the covariate density, we establish that DCDR estimators with several linear smoothers are semiparametric efficient under minimal conditions and achieve fast convergence rates in the non-$\\sqrt{n}$ regime. When the covariate density and smoothnesses are known, we propose a minimax rate-optimal DCDR estimator based on undersmoothed kernel regression. Moreover, we show an undersmoothed DCDR estimator satisfies a slower-than-$\\sqrt{n}$ central limit theorem, and that inference is possible even in the non-$\\sqrt{n}$ regime. Finally, we support our theoretical results with simulations, providing intuition for double cross-fitting and undersmoothing, demonstrating where our estimator achieves semiparametric efficiency while the usual \"single cross-fit\" estimator fails, and illustrating asymptotic normality for the undersmoothed DCDR estimator."}, "https://arxiv.org/abs/2403.15198": {"title": "On the Weighted Top-Difference Distance: Axioms, Aggregation, and Approximation", "link": "https://arxiv.org/abs/2403.15198", "description": "arXiv:2403.15198v1 Announce Type: cross \nAbstract: We study a family of distance functions on rankings that allow for asymmetric treatments of alternatives and consider the distinct relevance of the top and bottom positions for ordered lists. We provide a full axiomatic characterization of our distance. In doing so, we retrieve new characterizations of existing axioms and show how to effectively weaken them for our purposes. This analysis highlights the generality of our distance as it embeds many (semi)metrics previously proposed in the literature. Subsequently, we show that, notwithstanding its level of generality, our distance is still readily applicable. We apply it to preference aggregation, studying the features of the associated median voting rule. It is shown how the derived preference function satisfies many desirable features in the context of voting rules, ranging from fairness to majority and Pareto-related properties. We show how to compute consensus rankings exactly, and provide generalized Diaconis-Graham inequalities that can be leveraged to obtain approximation algorithms. Finally, we propose some truncation ideas for our distances inspired by Lu and Boutilier (2010). These can be leveraged to devise a Polynomial-Time-Approximation Scheme for the corresponding rank aggregation problem."}, "https://arxiv.org/abs/2403.15238": {"title": "WEEP: A method for spatial interpretation of weakly supervised CNN models in computational pathology", "link": "https://arxiv.org/abs/2403.15238", "description": "arXiv:2403.15238v1 Announce Type: cross \nAbstract: Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel method, Wsi rEgion sElection aPproach (WEEP), for model interpretation. It provides a principled yet straightforward way to establish the spatial area of WSI required for assigning a particular prediction label. We demonstrate WEEP on a binary classification task in the area of breast cancer computational pathology. WEEP is easy to implement, is directly connected to the model-based decision process, and offers information relevant to both research and diagnostic applications."}, "https://arxiv.org/abs/2207.02626": {"title": "Estimating the limiting shape of bivariate scaled sample clouds: with additional benefits of self-consistent inference for existing extremal dependence properties", "link": "https://arxiv.org/abs/2207.02626", "description": "arXiv:2207.02626v3 Announce Type: replace \nAbstract: The key to successful statistical analysis of bivariate extreme events lies in flexible modelling of the tail dependence relationship between the two variables. In the extreme value theory literature, various techniques are available to model separate aspects of tail dependence, based on different asymptotic limits. Results from Balkema and Nolde (2010) and Nolde (2014) highlight the importance of studying the limiting shape of an appropriately-scaled sample cloud when characterising the whole joint tail. We now develop the first statistical inference for this limit set, which has considerable practical importance for a unified inference framework across different aspects of the joint tail. Moreover, Nolde and Wadsworth (2022) link this limit set to various existing extremal dependence frameworks. Hence, a by-product of our new limit set inference is the first set of self-consistent estimators for several extremal dependence measures, avoiding the current possibility of contradictory conclusions. In simulations, our limit set estimator is successful across a range of distributions, and the corresponding extremal dependence estimators provide a major joint improvement and small marginal improvements over existing techniques. We consider an application to sea wave heights, where our estimates successfully capture the expected weakening extremal dependence as the distance between locations increases."}, "https://arxiv.org/abs/2211.04034": {"title": "Structured Mixture of Continuation-ratio Logits Models for Ordinal Regression", "link": "https://arxiv.org/abs/2211.04034", "description": "arXiv:2211.04034v2 Announce Type: replace \nAbstract: We develop a nonparametric Bayesian modeling approach to ordinal regression based on priors placed directly on the discrete distribution of the ordinal responses. The prior probability models are built from a structured mixture of multinomial distributions. We leverage a continuation-ratio logits representation to formulate the mixture kernel, with mixture weights defined through the logit stick-breaking process that incorporates the covariates through a linear function. The implied regression functions for the response probabilities can be expressed as weighted sums of parametric regression functions, with covariate-dependent weights. Thus, the modeling approach achieves flexible ordinal regression relationships, avoiding linearity or additivity assumptions in the covariate effects. Model flexibility is formally explored through the Kullback-Leibler support of the prior probability model. A key model feature is that the parameters for both the mixture kernel and the mixture weights can be associated with a continuation-ratio logits regression structure. Hence, an efficient and relatively easy to implement posterior simulation method can be designed, using P\\'olya-Gamma data augmentation. Moreover, the model is built from a conditional independence structure for category-specific parameters, which results in additional computational efficiency gains through partial parallel sampling. In addition to the general mixture structure, we study simplified model versions that incorporate covariate dependence only in the mixture kernel parameters or only in the mixture weights. For all proposed models, we discuss approaches to prior specification and develop Markov chain Monte Carlo methods for posterior simulation. The methodology is illustrated with several synthetic and real data examples."}, "https://arxiv.org/abs/2302.06054": {"title": "Single Proxy Control", "link": "https://arxiv.org/abs/2302.06054", "description": "arXiv:2302.06054v5 Announce Type: replace \nAbstract: Negative control variables are sometimes used in non-experimental studies to detect the presence of confounding by hidden factors. A negative control outcome (NCO) is an outcome that is influenced by unobserved confounders of the exposure effects on the outcome in view, but is not causally impacted by the exposure. Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration Approach (COCA) as a formal NCO counterfactual method to detect and correct for residual confounding bias. For identification, COCA treats the NCO as an error-prone proxy of the treatment-free counterfactual outcome of interest, and involves regressing the NCO on the treatment-free counterfactual, together with a rank-preserving structural model which assumes a constant individual-level causal effect. In this work, we establish nonparametric COCA identification for the average causal effect for the treated, without requiring rank-preservation, therefore accommodating unrestricted effect heterogeneity across units. This nonparametric identification result has important practical implications, as it provides single proxy confounding control, in contrast to recently proposed proximal causal inference, which relies for identification on a pair of confounding proxies. For COCA estimation we propose three separate strategies: (i) an extended propensity score approach, (ii) an outcome bridge function approach, and (iii) a doubly-robust approach. Finally, we illustrate the proposed methods in an application evaluating the causal impact of a Zika virus outbreak on birth rate in Brazil."}, "https://arxiv.org/abs/2309.06985": {"title": "CARE: Large Precision Matrix Estimation for Compositional Data", "link": "https://arxiv.org/abs/2309.06985", "description": "arXiv:2309.06985v2 Announce Type: replace \nAbstract: High-dimensional compositional data are prevalent in many applications. The simplex constraint poses intrinsic challenges to inferring the conditional dependence relationships among the components forming a composition, as encoded by a large precision matrix. We introduce a precise specification of the compositional precision matrix and relate it to its basis counterpart, which is shown to be asymptotically identifiable under suitable sparsity assumptions. By exploiting this connection, we propose a composition adaptive regularized estimation (CARE) method for estimating the sparse basis precision matrix. We derive rates of convergence for the estimator and provide theoretical guarantees on support recovery and data-driven parameter tuning. Our theory reveals an intriguing trade-off between identification and estimation, thereby highlighting the blessing of dimensionality in compositional data analysis. In particular, in sufficiently high dimensions, the CARE estimator achieves minimax optimality and performs as well as if the basis were observed. We further discuss how our framework can be extended to handle data containing zeros, including sampling zeros and structural zeros. The advantages of CARE over existing methods are illustrated by simulation studies and an application to inferring microbial ecological networks in the human gut."}, "https://arxiv.org/abs/2311.06086": {"title": "A three-step approach to production frontier estimation and the Matsuoka's distribution", "link": "https://arxiv.org/abs/2311.06086", "description": "arXiv:2311.06086v2 Announce Type: replace \nAbstract: In this work, we introduce a three-step semiparametric methodology for the estimation of production frontiers. We consider a model inspired by the well-known Cobb-Douglas production function, wherein input factors operate multiplicatively within the model. Efficiency in the proposed model is assumed to follow a continuous univariate uniparametric distribution in $(0,1)$, referred to as Matsuoka's distribution, which is discussed in detail. Following model linearization, the first step is to semiparametrically estimate the regression function through a local linear smoother. The second step focuses on the estimation of the efficiency parameter. Finally, we estimate the production frontier through a plug-in methodology. We present a rigorous asymptotic theory related to the proposed three-step estimation, including consistency, and asymptotic normality, and derive rates for the convergences presented. Incidentally, we also study the Matsuoka's distribution, deriving its main properties. The Matsuoka's distribution exhibits a versatile array of shapes capable of effectively encapsulating the typical behavior of efficiency within production frontier models. To complement the large sample results obtained, a Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed three-step methodology. An empirical application using a dataset of Danish milk producers is also presented."}, "https://arxiv.org/abs/2311.15458": {"title": "Causal Models for Longitudinal and Panel Data: A Survey", "link": "https://arxiv.org/abs/2311.15458", "description": "arXiv:2311.15458v2 Announce Type: replace \nAbstract: This survey discusses the recent causal panel data literature. This recent literature has focused on credibly estimating causal effects of binary interventions in settings with longitudinal data, with an emphasis on practical advice for empirical researchers. It pays particular attention to heterogeneity in the causal effects, often in situations where few units are treated and with particular structures on the assignment pattern. The literature has extended earlier work on difference-in-differences or two-way-fixed-effect estimators. It has more generally incorporated factor models or interactive fixed effects. It has also developed novel methods using synthetic control approaches."}, "https://arxiv.org/abs/2312.08530": {"title": "Using Model-Assisted Calibration Methods to Improve Efficiency of Regression Analyses with Two-Phase Samples under Complex Survey Designs", "link": "https://arxiv.org/abs/2312.08530", "description": "arXiv:2312.08530v2 Announce Type: replace \nAbstract: Two-phase sampling designs are frequently employed in epidemiological studies and large-scale health surveys. In such designs, certain variables are exclusively collected within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or measurement assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators. However, no existing methods provide appropriate calibration auxiliary variables while simultaneously considering the complex sample designs present in both the first- and second-phase samples in regression analyses. This paper proposes to calibrate the sample weights for the second-phase subsample to the weighted entire first-phase sample based on score functions of regression coefficients by using predictions of the covariate of interest, which can be computed for the entire first-phase sample. We establish the consistency of the proposed calibration estimation and provide variance estimation. Empirical evidence underscores the robustness of the calibration on score functions compared to the imputation method, which can be sensitive to misspecified prediction models for the variable only collected in the second phase. Examples using data from the National Health and Nutrition Examination Survey are provided."}, "https://arxiv.org/abs/2312.15496": {"title": "A Simple Bias Reduction for Chatterjee's Correlation", "link": "https://arxiv.org/abs/2312.15496", "description": "arXiv:2312.15496v2 Announce Type: replace \nAbstract: Chatterjee's rank correlation coefficient $\\xi_n$ is an empirical index for detecting functional dependencies between two variables $X$ and $Y$. It is an estimator for a theoretical quantity $\\xi$ that is zero for independence and one if $Y$ is a measurable function of $X$. Based on an equivalent characterization of sorted numbers, we derive an upper bound for $\\xi_n$ and suggest a simple normalization aimed at reducing its bias for small sample size $n$. In Monte Carlo simulations of various cases, the normalization reduced the bias in all cases. The mean squared error was reduced, too, for values of $\\xi$ greater than about 0.4. Moreover, we observed that non-parametric confidence intervals for $\\xi$ based on bootstrapping $\\xi_n$ in the usual n-out-of-n way have a coverage probability close to zero. This is remedied by an m-out-of-n bootstrap without replacement in combination with our normalization method."}, "https://arxiv.org/abs/2203.13776": {"title": "Sharp adaptive and pathwise stable similarity testing for scalar ergodic diffusions", "link": "https://arxiv.org/abs/2203.13776", "description": "arXiv:2203.13776v2 Announce Type: replace-cross \nAbstract: Within the nonparametric diffusion model, we develop a multiple test to infer about similarity of an unknown drift $b$ to some reference drift $b_0$: At prescribed significance, we simultaneously identify those regions where violation from similiarity occurs, without a priori knowledge of their number, size and location. This test is shown to be minimax-optimal and adaptive. At the same time, the procedure is robust under small deviation from Brownian motion as the driving noise process. A detailed investigation for fractional driving noise, which is neither a semimartingale nor a Markov process, is provided for Hurst indices close to the Brownian motion case."}, "https://arxiv.org/abs/2306.15209": {"title": "Dynamic Reconfiguration of Brain Functional Network in Stroke", "link": "https://arxiv.org/abs/2306.15209", "description": "arXiv:2306.15209v2 Announce Type: replace-cross \nAbstract: The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI data from 15 stroke patients, with mild (n = 6) and severe (n = 9) two subgroups based on their clinical symptoms. Additionally, 15 age-matched healthy subjects were considered as controls. By applying a multilayer network method, a dynamic modular structure was recognized based on a time-resolved function network. Then dynamic network measurements (recruitment, integration, and flexibility) were calculated to characterize the dynamic reconfiguration of post-stroke brain functional networks, hence, to reveal the neural functional rebuilding process. It was found from this investigation that severe patients tended to have reduced recruitment and increased between-network integration, while mild patients exhibited low network flexibility and less network integration. It is also noted that this severity-dependent alteration in network interaction was not able to be revealed by previous studies using static methods. Clinically, the obtained knowledge of the diverse patterns of dynamic adjustment in brain functional networks observed from the brain signal could help understand the underlying mechanism of the motor, speech, and cognitive functional impairments caused by stroke attacks. The proposed method not only could be used to evaluate patients' current brain status but also has the potential to provide insights into prognosis analysis and prediction."}, "https://arxiv.org/abs/2403.15670": {"title": "Computationally Scalable Bayesian SPDE Modeling for Censored Spatial Responses", "link": "https://arxiv.org/abs/2403.15670", "description": "arXiv:2403.15670v1 Announce Type: new \nAbstract: Observations of groundwater pollutants, such as arsenic or Perfluorooctane sulfonate (PFOS), are riddled with left censoring. These measurements have impact on the health and lifestyle of the populace. Left censoring of these spatially correlated observations are usually addressed by applying Gaussian processes (GPs), which have theoretical advantages. However, this comes with a challenging computational complexity of $\\mathcal{O}(n^3)$, which is impractical for large datasets. Additionally, a sizable proportion of the data being left-censored creates further bottlenecks, since the likelihood computation now involves an intractable high-dimensional integral of the multivariate Gaussian density. In this article, we tackle these two problems simultaneously by approximating the GP with a Gaussian Markov random field (GMRF) approach that exploits an explicit link between a GP with Mat\\'ern correlation function and a GMRF using stochastic partial differential equations (SPDEs). We introduce a GMRF-based measurement error into the model, which alleviates the likelihood computation for the censored data, drastically improving the speed of the model while maintaining admirable accuracy. Our approach demonstrates robustness and substantial computational scalability, compared to state-of-the-art methods for censored spatial responses across various simulation settings. Finally, the fit of this fully Bayesian model to the concentration of PFOS in groundwater available at 24,959 sites across California, where 46.62\\% responses are censored, produces prediction surface and uncertainty quantification in real time, thereby substantiating the applicability and scalability of the proposed method. Code for implementation is made available via GitHub."}, "https://arxiv.org/abs/2403.15755": {"title": "Optimized Model Selection for Estimating Treatment Effects from Costly Simulations of the US Opioid Epidemic", "link": "https://arxiv.org/abs/2403.15755", "description": "arXiv:2403.15755v1 Announce Type: new \nAbstract: Agent-based simulation with a synthetic population can help us compare different treatment conditions while keeping everything else constant within the same population (i.e., as digital twins). Such population-scale simulations require large computational power (i.e., CPU resources) to get accurate estimates for treatment effects. We can use meta models of the simulation results to circumvent the need to simulate every treatment condition. Selecting the best estimating model at a given sample size (number of simulation runs) is a crucial problem. Depending on the sample size, the ability of the method to estimate accurately can change significantly. In this paper, we discuss different methods to explore what model works best at a specific sample size. In addition to the empirical results, we provide a mathematical analysis of the MSE equation and how its components decide which model to select and why a specific method behaves that way in a range of sample sizes. The analysis showed why the direction estimation method is better than model-based methods in larger sample sizes and how the between-group variation and the within-group variation affect the MSE equation."}, "https://arxiv.org/abs/2403.15778": {"title": "Supervised Learning via Ensembles of Diverse Functional Representations: the Functional Voting Classifier", "link": "https://arxiv.org/abs/2403.15778", "description": "arXiv:2403.15778v1 Announce Type: new \nAbstract: Many conventional statistical and machine learning methods face challenges when applied directly to high dimensional temporal observations. In recent decades, Functional Data Analysis (FDA) has gained widespread popularity as a framework for modeling and analyzing data that are, by their nature, functions in the domain of time. Although supervised classification has been extensively explored in recent decades within the FDA literature, ensemble learning of functional classifiers has only recently emerged as a topic of significant interest. Thus, the latter subject presents unexplored facets and challenges from various statistical perspectives. The focal point of this paper lies in the realm of ensemble learning for functional data and aims to show how different functional data representations can be used to train ensemble members and how base model predictions can be combined through majority voting. The so-called Functional Voting Classifier (FVC) is proposed to demonstrate how different functional representations leading to augmented diversity can increase predictive accuracy. Many real-world datasets from several domains are used to display that the FVC can significantly enhance performance compared to individual models. The framework presented provides a foundation for voting ensembles with functional data and can stimulate a highly encouraging line of research in the FDA context."}, "https://arxiv.org/abs/2403.15802": {"title": "Augmented Doubly Robust Post-Imputation Inference for Proteomic Data", "link": "https://arxiv.org/abs/2403.15802", "description": "arXiv:2403.15802v1 Announce Type: new \nAbstract: Quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of proteins in molecular mechanisms. However, analysis of such data is challenging due to the large proportion of missing values. A common strategy to address this issue is to utilize an imputed dataset, which often introduces systematic bias into downstream analyses if the imputation errors are ignored. In this paper, we propose a statistical framework inspired by doubly robust estimators that offers valid and efficient inference for proteomic data. Our framework combines powerful machine learning tools, such as variational autoencoders, to augment the imputation quality with high-dimensional peptide data, and a parametric model to estimate the propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework and has provable properties. In application to both single-cell and bulk-cell proteomic data our method utilizes the imputed data to gain additional, meaningful discoveries and yet maintains good control of false positives."}, "https://arxiv.org/abs/2403.15862": {"title": "Non-monotone dependence modeling with copulas: an application to the volume-return relationship", "link": "https://arxiv.org/abs/2403.15862", "description": "arXiv:2403.15862v1 Announce Type: new \nAbstract: This paper introduces an innovative method for constructing copula models capable of describing arbitrary non-monotone dependence structures. The proposed method enables the creation of such copulas in parametric form, thus allowing the resulting models to adapt to diverse and intricate real-world data patterns. We apply this novel methodology to analyze the relationship between returns and trading volumes in financial markets, a domain where the existence of non-monotone dependencies is well-documented in the existing literature. Our approach exhibits superior adaptability compared to other models which have previously been proposed in the literature, enabling a deeper understanding of the dependence structure among the considered variables."}, "https://arxiv.org/abs/2403.15877": {"title": "Integrated path stability selection", "link": "https://arxiv.org/abs/2403.15877", "description": "arXiv:2403.15877v1 Announce Type: new \nAbstract: Stability selection is a widely used method for improving the performance of feature selection algorithms. However, stability selection has been found to be highly conservative, resulting in low sensitivity. Further, the theoretical bound on the expected number of false positives, E(FP), is relatively loose, making it difficult to know how many false positives to expect in practice. In this paper, we introduce a novel method for stability selection based on integrating the stability paths rather than maximizing over them. This yields a tighter bound on E(FP), resulting in a feature selection criterion that has higher sensitivity in practice and is better calibrated in terms of matching the target E(FP). Our proposed method requires the same amount of computation as the original stability selection algorithm, and only requires the user to specify one input parameter, a target value for E(FP). We provide theoretical bounds on performance, and demonstrate the method on simulations and real data from cancer gene expression studies."}, "https://arxiv.org/abs/2403.15910": {"title": "Difference-in-Differences with Unpoolable Data", "link": "https://arxiv.org/abs/2403.15910", "description": "arXiv:2403.15910v1 Announce Type: new \nAbstract: In this study, we identify and relax the assumption of data \"poolability\" in difference-in-differences (DID) estimation. Poolability, or the combination of observations from treated and control units into one dataset, is often not possible due to data privacy concerns. For instance, administrative health data stored in secure facilities is often not combinable across jurisdictions. We propose an innovative approach to estimate DID with unpoolable data: UN--DID. Our method incorporates adjustments for additional covariates, multiple groups, and staggered adoption. Without covariates, UN--DID and conventional DID give identical estimates of the average treatment effect on the treated (ATT). With covariates, we show mathematically and through simulations that UN--DID and conventional DID provide different, but equally informative, estimates of the ATT. An empirical example further underscores the utility of our methodology. The UN--DID method paves the way for more comprehensive analyses of policy impacts, even under data poolability constraints."}, "https://arxiv.org/abs/2403.15934": {"title": "Debiased Machine Learning when Nuisance Parameters Appear in Indicator Functions", "link": "https://arxiv.org/abs/2403.15934", "description": "arXiv:2403.15934v1 Announce Type: new \nAbstract: This paper studies debiased machine learning when nuisance parameters appear in indicator functions. An important example is maximized average welfare under optimal treatment assignment rules. For asymptotically valid inference for a parameter of interest, the current literature on debiased machine learning relies on Gateaux differentiability of the functions inside moment conditions, which does not hold when nuisance parameters appear in indicator functions. In this paper, we propose smoothing the indicator functions, and develop an asymptotic distribution theory for this class of models. The asymptotic behavior of the proposed estimator exhibits a trade-off between bias and variance due to smoothing. We study how a parameter which controls the degree of smoothing can be chosen optimally to minimize an upper bound of the asymptotic mean squared error. A Monte Carlo simulation supports the asymptotic distribution theory, and an empirical example illustrates the implementation of the method."}, "https://arxiv.org/abs/2403.15983": {"title": "Bayesian segmented Gaussian copula factor model for single-cell sequencing data", "link": "https://arxiv.org/abs/2403.15983", "description": "arXiv:2403.15983v1 Announce Type: new \nAbstract: Single-cell sequencing technologies have significantly advanced molecular and cellular biology, offering unprecedented insights into cellular heterogeneity by allowing for the measurement of gene expression at an individual cell level. However, the analysis of such data is challenged by the prevalence of low counts due to dropout events and the skewed nature of the data distribution, which conventional Gaussian factor models struggle to handle effectively. To address these challenges, we propose a novel Bayesian segmented Gaussian copula model to explicitly account for inflation of zero and near-zero counts, and to address the high skewness in the data. By employing a Dirichlet-Laplace prior for each column of the factor loadings matrix, we shrink the loadings of unnecessary factors towards zero, which leads to a simple approach to automatically determine the number of latent factors, and resolve the identifiability issue inherent in factor models due to the rotational invariance of the factor loadings matrix. Through simulation studies, we demonstrate the superior performance of our method over existing approaches in conducting factor analysis on data exhibiting the characteristics of single-cell data, such as excessive low counts and high skewness. Furthermore, we apply the proposed method to a real single-cell RNA-sequencing dataset from a lymphoblastoid cell line, successfully identifying biologically meaningful latent factors and detecting previously uncharacterized cell subtypes."}, "https://arxiv.org/abs/2403.16177": {"title": "The Informativeness of Combined Experimental and Observational Data under Dynamic Selection", "link": "https://arxiv.org/abs/2403.16177", "description": "arXiv:2403.16177v1 Announce Type: new \nAbstract: This paper addresses the challenge of estimating the Average Treatment Effect on the Treated Survivors (ATETS; Vikstrom et al., 2018) in the absence of long-term experimental data, utilizing available long-term observational data instead. We establish two theoretical results. First, it is impossible to obtain informative bounds for the ATETS with no model restriction and no auxiliary data. Second, to overturn this negative result, we explore as a promising avenue the recent econometric developments in combining experimental and observational data (e.g., Athey et al., 2020, 2019); we indeed find that exploiting short-term experimental data can be informative without imposing classical model restrictions. Furthermore, building on Chesher and Rosen (2017), we explore how to systematically derive sharp identification bounds, exploiting both the novel data-combination principles and classical model restrictions. Applying the proposed method, we explore what can be learned about the long-run effects of job training programs on employment without long-term experimental data."}, "https://arxiv.org/abs/2403.16256": {"title": "Covariate-adjusted marginal cumulative incidence curves for competing risk analysis", "link": "https://arxiv.org/abs/2403.16256", "description": "arXiv:2403.16256v1 Announce Type: new \nAbstract: Covariate imbalance between treatment groups makes it difficult to compare cumulative incidence curves in competing risk analyses. In this paper we discuss different methods to estimate adjusted cumulative incidence curves including inverse probability of treatment weighting and outcome regression modeling. For these methods to work, correct specification of the propensity score model or outcome regression model, respectively, is needed. We introduce a new doubly robust estimator, which requires correct specification of only one of the two models. We conduct a simulation study to assess the performance of these three methods, including scenarios with model misspecification of the relationship between covariates and treatment and/or outcome. We illustrate their usage in a cohort study of breast cancer patients estimating covariate-adjusted marginal cumulative incidence curves for recurrence, second primary tumour development and death after undergoing mastectomy treatment or breast-conserving therapy. Our study points out the advantages and disadvantages of each covariate adjustment method when applied in competing risk analysis."}, "https://arxiv.org/abs/2403.16283": {"title": "Sample Empirical Likelihood Methods for Causal Inference", "link": "https://arxiv.org/abs/2403.16283", "description": "arXiv:2403.16283v1 Announce Type: new \nAbstract: Causal inference is crucial for understanding the true impact of interventions, policies, or actions, enabling informed decision-making and providing insights into the underlying mechanisms that shape our world. In this paper, we establish a framework for the estimation and inference of average treatment effects using a two-sample empirical likelihood function. Two different approaches to incorporating propensity scores are developed. The first approach introduces propensity scores calibrated constraints in addition to the standard model-calibration constraints; the second approach uses the propensity scores to form weighted versions of the model-calibration constraints. The resulting estimators from both approaches are doubly robust. The limiting distributions of the two sample empirical likelihood ratio statistics are derived, facilitating the construction of confidence intervals and hypothesis tests for the average treatment effect. Bootstrap methods for constructing sample empirical likelihood ratio confidence intervals are also discussed for both approaches. Finite sample performances of the methods are investigated through simulation studies."}, "https://arxiv.org/abs/2403.16297": {"title": "Round Robin Active Sequential Change Detection for Dependent Multi-Channel Data", "link": "https://arxiv.org/abs/2403.16297", "description": "arXiv:2403.16297v1 Announce Type: new \nAbstract: This paper considers the problem of sequentially detecting a change in the joint distribution of multiple data sources under a sampling constraint. Specifically, the channels or sources generate observations that are independent over time, but not necessarily independent at any given time instant. The sources follow an initial joint distribution, and at an unknown time instant, the joint distribution of an unknown subset of sources changes. Importantly, there is a hard constraint that only a fixed number of sources are allowed to be sampled at each time instant. The goal is to sequentially observe the sources according to the constraint, and stop sampling as quickly as possible after the change while controlling the false alarm rate below a user-specified level. The sources can be selected dynamically based on the already collected data, and thus, a policy for this problem consists of a joint sampling and change-detection rule. A non-randomized policy is studied, and an upper bound is established on its worst-case conditional expected detection delay with respect to both the change point and the observations from the affected sources before the change."}, "https://arxiv.org/abs/2403.16544": {"title": "The Role of Mean Absolute Deviation Function in Obtaining Smooth Estimation for Distribution and Density Functions: Beta Regression Approach", "link": "https://arxiv.org/abs/2403.16544", "description": "arXiv:2403.16544v1 Announce Type: new \nAbstract: Smooth Estimation of probability density and distribution functions from its sample is an attractive and an important problem that has applications in several fields such as, business, medicine, and environment. This article introduces a simple approach but novel for estimating both functions in one process to have smooth curves for both via left mean absolute deviation (MAD) function and beta regression approach. Our approach explores estimation of both functions by smoothing the first derivative of left MAD function to obtain the final optimal smooth estimates. The derivation for these final smooth estimates under conditions of nondecreasing distribution function and nonnegative density function are performed by applying beta regression of a polynomial degree on the first derivative of left MAD function where the degree of polynomial is chosen among the models that have less mean absolute residuals under the constraint of nonnegativity for the first derivative of regression vector of expected values. A general class of normal, logistic and Gumbel distributions is derived as proposed smooth estimators for the distribution and density functions using logit, probit and cloglog links, respectively. This approach is applied to simulated data from unimodal, bimodal, tri-modal and skew distributions and an application to real data set is given."}, "https://arxiv.org/abs/2403.16590": {"title": "Extremal properties of max-autoregressive moving average processes for modelling extreme river flows", "link": "https://arxiv.org/abs/2403.16590", "description": "arXiv:2403.16590v1 Announce Type: new \nAbstract: Max-autogressive moving average (Max-ARMA) processes are powerful tools for modelling time series data with heavy-tailed behaviour; these are a non-linear version of the popular autoregressive moving average models. River flow data typically have features of heavy tails and non-linearity, as large precipitation events cause sudden spikes in the data that then exponentially decay. Therefore, stationary Max-ARMA models are a suitable candidate for capturing the unique temporal dependence structure exhibited by river flows. This paper contributes to advancing our understanding of the extremal properties of stationary Max-ARMA processes. We detail the first approach for deriving the extremal index, the lagged asymptotic dependence coefficient, and an efficient simulation for a general Max-ARMA process. We use the extremal properties, coupled with the belief that Max-ARMA processes provide only an approximation to extreme river flow, to fit such a model which can broadly capture river flow behaviour over a high threshold. We make our inference under a reparametrisation which gives a simpler parameter space that excludes cases where any parameter is non-identifiable. We illustrate results for river flow data from the UK River Thames."}, "https://arxiv.org/abs/2403.16673": {"title": "Quasi-randomization tests for network interference", "link": "https://arxiv.org/abs/2403.16673", "description": "arXiv:2403.16673v1 Announce Type: new \nAbstract: Many classical inferential approaches fail to hold when interference exists among the population units. This amounts to the treatment status of one unit affecting the potential outcome of other units in the population. Testing for such spillover effects in this setting makes the null hypothesis non-sharp. An interesting approach to tackling the non-sharp nature of the null hypothesis in this setup is constructing conditional randomization tests such that the null is sharp on the restricted population. In randomized experiments, conditional randomized tests hold finite sample validity. Such approaches can pose computational challenges as finding these appropriate sub-populations based on experimental design can involve solving an NP-hard problem. In this paper, we view the network amongst the population as a random variable instead of being fixed. We propose a new approach that builds a conditional quasi-randomization test. Our main idea is to build the (non-sharp) null distribution of no spillover effects using random graph null models. We show that our method is exactly valid in finite-samples under mild assumptions. Our method displays enhanced power over other methods, with substantial improvement in complex experimental designs. We highlight that the method reduces to a simple permutation test, making it easy to implement in practice. We conduct a simulation study to verify the finite-sample validity of our approach and illustrate our methodology to test for interference in a weather insurance adoption experiment run in rural China."}, "https://arxiv.org/abs/2403.16706": {"title": "An alternative measure for quantifying the heterogeneity in meta-analysis", "link": "https://arxiv.org/abs/2403.16706", "description": "arXiv:2403.16706v1 Announce Type: new \nAbstract: Quantifying the heterogeneity is an important issue in meta-analysis, and among the existing measures, the $I^2$ statistic is most commonly used. In this paper, we first illustrate with a simple example that the $I^2$ statistic is heavily dependent on the study sample sizes, mainly because it is used to quantify the heterogeneity between the observed effect sizes. To reduce the influence of sample sizes, we introduce an alternative measure that aims to directly measure the heterogeneity between the study populations involved in the meta-analysis. We further propose a new estimator, namely the $I_A^2$ statistic, to estimate the newly defined measure of heterogeneity. For practical implementation, the exact formulas of the $I_A^2$ statistic are also derived under two common scenarios with the effect size as the mean difference (MD) or the standardized mean difference (SMD). Simulations and real data analysis demonstrate that the $I_A^2$ statistic provides an asymptotically unbiased estimator for the absolute heterogeneity between the study populations, and it is also independent of the study sample sizes as expected. To conclude, our newly defined $I_A^2$ statistic can be used as a supplemental measure of heterogeneity to monitor the situations where the study effect sizes are indeed similar with little biological difference. In such scenario, the fixed-effect model can be appropriate; nevertheless, when the sample sizes are sufficiently large, the $I^2$ statistic may still increase to 1 and subsequently suggest the random-effects model for meta-analysis."}, "https://arxiv.org/abs/2403.16773": {"title": "Privacy-Protected Spatial Autoregressive Model", "link": "https://arxiv.org/abs/2403.16773", "description": "arXiv:2403.16773v1 Announce Type: new \nAbstract: Spatial autoregressive (SAR) models are important tools for studying network effects. However, with an increasing emphasis on data privacy, data providers often implement privacy protection measures that make classical SAR models inapplicable. In this study, we introduce a privacy-protected SAR model with noise-added response and covariates to meet privacy-protection requirements. However, in this scenario, the traditional quasi-maximum likelihood estimator becomes infeasible because the likelihood function cannot be formulated. To address this issue, we first consider an explicit expression for the likelihood function with only noise-added responses. However, the derivatives are biased owing to the noise in the covariates. Therefore, we develop techniques that can correct the biases introduced by noise. Correspondingly, a Newton-Raphson-type algorithm is proposed to obtain the estimator, leading to a corrected likelihood estimator. To further enhance computational efficiency, we introduce a corrected least squares estimator based on the idea of bias correction. These two estimation methods ensure both data security and the attainment of statistically valid estimators. Theoretical analysis of both estimators is carefully conducted, and statistical inference methods are discussed. The finite sample performances of different methods are demonstrated through extensive simulations and the analysis of a real dataset."}, "https://arxiv.org/abs/2403.16813": {"title": "A Generalized Logrank-type Test for Comparison of Treatment Regimes in Sequential Multiple Assignment Randomized Trials", "link": "https://arxiv.org/abs/2403.16813", "description": "arXiv:2403.16813v1 Announce Type: new \nAbstract: The sequential multiple assignment randomized trial (SMART) is the\n ideal study design for the evaluation of multistage treatment\n regimes, which comprise sequential decision rules that recommend\n treatments for a patient at each of a series of decision points\n based on their evolving characteristics. A common goal is to\n compare the set of so-called embedded regimes represented in the\n design on the basis of a primary outcome of interest. In the study\n of chronic diseases and disorders, this outcome is often a time to\n an event, and a goal is to compare the distributions of the\n time-to-event outcome associated with each regime in the set. We\n present a general statistical framework in which we develop a\n logrank-type test for comparison of the survival distributions\n associated with regimes within a specified set based on the data\n from a SMART with an arbitrary number of stages that allows\n incorporation of covariate information to enhance efficiency and can\n also be used with data from an observational study. The framework\n provides clarification of the assumptions required to yield a\n principled test procedure, and the proposed test subsumes or offers\n an improved alternative to existing methods. We demonstrate\n performance of the methods in a suite of simulation\n studies. The methods are applied to a SMART in patients with acute\n promyelocytic leukemia."}, "https://arxiv.org/abs/2403.16832": {"title": "Testing for sufficient follow-up in survival data with a cure fraction", "link": "https://arxiv.org/abs/2403.16832", "description": "arXiv:2403.16832v1 Announce Type: new \nAbstract: In order to estimate the proportion of `immune' or `cured' subjects who will never experience failure, a sufficiently long follow-up period is required. Several statistical tests have been proposed in the literature for assessing the assumption of sufficient follow-up, meaning that the study duration is longer than the support of the survival times for the uncured subjects. However, for practical purposes, the follow-up would be considered sufficiently long if the probability for the event to happen after the end of the study is very small. Based on this observation, we formulate a more relaxed notion of `practically' sufficient follow-up characterized by the quantiles of the distribution and develop a novel nonparametric statistical test. The proposed method relies mainly on the assumption of a non-increasing density function in the tail of the distribution. The test is then based on a shape constrained density estimator such as the Grenander or the kernel smoothed Grenander estimator and a bootstrap procedure is used for computation of the critical values. The performance of the test is investigated through an extensive simulation study, and the method is illustrated on breast cancer data."}, "https://arxiv.org/abs/2403.16844": {"title": "Resistant Inference in Instrumental Variable Models", "link": "https://arxiv.org/abs/2403.16844", "description": "arXiv:2403.16844v1 Announce Type: new \nAbstract: The classical tests in the instrumental variable model can behave arbitrarily if the data is contaminated. For instance, one outlying observation can be enough to change the outcome of a test. We develop a framework to construct testing procedures that are robust to weak instruments, outliers and heavy-tailed errors in the instrumental variable model. The framework is constructed upon M-estimators. By deriving the influence functions of the classical weak instrument robust tests, such as the Anderson-Rubin test, K-test and the conditional likelihood ratio (CLR) test, we prove their unbounded sensitivity to infinitesimal contamination. Therefore, we construct contamination resistant/robust alternatives. In particular, we show how to construct a robust CLR statistic based on Mallows type M-estimators and show that its asymptotic distribution is the same as that of the (classical) CLR statistic. The theoretical results are corroborated by a simulation study. Finally, we revisit three empirical studies affected by outliers and demonstrate how the new robust tests can be used in practice."}, "https://arxiv.org/abs/2403.16906": {"title": "Comparing basic statistical concepts with diagnostic probabilities based on directly observed proportions to help understand the replication crisis", "link": "https://arxiv.org/abs/2403.16906", "description": "arXiv:2403.16906v1 Announce Type: new \nAbstract: Instead of regarding an observed proportion as a sample from a population with an unknown parameter, diagnosticians intuitively use the observed proportion as a direct estimate of the posterior probability of a diagnosis. Therefore, a diagnostician might also regard a continuous Gaussian probability distribution of an outcome conditional on a study selection criterion to represents posterior probabilities. Fitting a distribution to its mean and standard deviation (SD) can be regarded as pooling data from an infinite number of imaginary or theoretical studies with an identical mean and SD but randomly different numerical values. For a distribution of possible means based on a SEM, the posterior probability Q of any theoretically true mean falling into a specified tail would be equal to the tail area as a proportion of the whole. If the reverse likelihood distribution of possible study means conditional on the same hypothetical tail threshold is assumed to be the same as the posterior probability distribution of means (as is customary) then by Bayes rule the P value equals Q. Replication involves doing two independent studies, thus doubling the variance for the combined posterior probability distribution. Thus, if the original effect size was 1.96, the number of observations was 100, the SEM was 1 and the original P value was 0.025, the theoretical probability of a replicating study getting a P value of up to 0.025 again is only 0.283. By applying the double variance to power calculations, the required number of observations is doubled compared to conventional approaches. If these theoretical probabilities of replication are consistent with empirical replication study results, this might explain the replication crisis and make the concepts of statistics easier to understand by diagnosticians and others."}, "https://arxiv.org/abs/2403.15499": {"title": "A Causal Analysis of CO2 Reduction Strategies in Electricity Markets Through Machine Learning-Driven Metalearners", "link": "https://arxiv.org/abs/2403.15499", "description": "arXiv:2403.15499v1 Announce Type: cross \nAbstract: This study employs the Causal Machine Learning (CausalML) statistical method to analyze the influence of electricity pricing policies on carbon dioxide (CO2) levels in the household sector. Investigating the causality between potential outcomes and treatment effects, where changes in pricing policies are the treatment, our analysis challenges the conventional wisdom surrounding incentive-based electricity pricing. The study's findings suggest that adopting such policies may inadvertently increase CO2 intensity. Additionally, we integrate a machine learning-based meta-algorithm, reflecting a contemporary statistical approach, to enhance the depth of our causal analysis. The study conducts a comparative analysis of learners X, T, S, and R to ascertain the optimal methods based on the defined question's specified goals and contextual nuances. This research contributes valuable insights to the ongoing dialogue on sustainable development practices, emphasizing the importance of considering unintended consequences in policy formulation."}, "https://arxiv.org/abs/2403.15635": {"title": "Nonparametric inference of higher order interaction patterns in networks", "link": "https://arxiv.org/abs/2403.15635", "description": "arXiv:2403.15635v1 Announce Type: cross \nAbstract: We propose a method for obtaining parsimonious decompositions of networks into higher order interactions which can take the form of arbitrary motifs.The method is based on a class of analytically solvable generative models, where vertices are connected via explicit copies of motifs, which in combination with non-parametric priors allow us to infer higher order interactions from dyadic graph data without any prior knowledge on the types or frequencies of such interactions. Crucially, we also consider 'degree--corrected' models that correctly reflect the degree distribution of the network and consequently prove to be a better fit for many real world--networks compared to non-degree corrected models. We test the presented approach on simulated data for which we recover the set of underlying higher order interactions to a high degree of accuracy. For empirical networks the method identifies concise sets of atomic subgraphs from within thousands of candidates that cover a large fraction of edges and include higher order interactions of known structural and functional significance. The method not only produces an explicit higher order representation of the network but also a fit of the network to analytically tractable models opening new avenues for the systematic study of higher order network structures."}, "https://arxiv.org/abs/2403.15711": {"title": "Identifiable Latent Neural Causal Models", "link": "https://arxiv.org/abs/2403.15711", "description": "arXiv:2403.15711v1 Announce Type: cross \nAbstract: Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data. It is particularly good at predictions under unseen distribution shifts, because these shifts can generally be interpreted as consequences of interventions. Hence leveraging {seen} distribution shifts becomes a natural strategy to help identifying causal representations, which in turn benefits predictions where distributions are previously {unseen}. Determining the types (or conditions) of such distribution shifts that do contribute to the identifiability of causal representations is critical. This work establishes a {sufficient} and {necessary} condition characterizing the types of distribution shifts for identifiability in the context of latent additive noise models. Furthermore, we present partial identifiability results when only a portion of distribution shifts meets the condition. In addition, we extend our findings to latent post-nonlinear causal models. We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations. Our algorithm, guided by our underlying theory, has demonstrated outstanding performance across a diverse range of synthetic and real-world datasets. The empirical observations align closely with the theoretical findings, affirming the robustness and effectiveness of our approach."}, "https://arxiv.org/abs/2403.15792": {"title": "Reviving pseudo-inverses: Asymptotic properties of large dimensional Moore-Penrose and Ridge-type inverses with applications", "link": "https://arxiv.org/abs/2403.15792", "description": "arXiv:2403.15792v1 Announce Type: cross \nAbstract: In this paper, we derive high-dimensional asymptotic properties of the Moore-Penrose inverse and the ridge-type inverse of the sample covariance matrix. In particular, the analytical expressions of the weighted sample trace moments are deduced for both generalized inverse matrices and are present by using the partial exponential Bell polynomials which can easily be computed in practice. The existent results are extended in several directions: (i) First, the population covariance matrix is not assumed to be a multiple of the identity matrix; (ii) Second, the assumption of normality is not used in the derivation; (iii) Third, the asymptotic results are derived under the high-dimensional asymptotic regime. Our findings are used to construct improved shrinkage estimators of the precision matrix, which asymptotically minimize the quadratic loss with probability one. Finally, the finite sample properties of the derived theoretical results are investigated via an extensive simulation study."}, "https://arxiv.org/abs/2403.16031": {"title": "Learning Directed Acyclic Graphs from Partial Orderings", "link": "https://arxiv.org/abs/2403.16031", "description": "arXiv:2403.16031v1 Announce Type: cross \nAbstract: Directed acyclic graphs (DAGs) are commonly used to model causal relationships among random variables. In general, learning the DAG structure is both computationally and statistically challenging. Moreover, without additional information, the direction of edges may not be estimable from observational data. In contrast, given a complete causal ordering of the variables, the problem can be solved efficiently, even in high dimensions. In this paper, we consider the intermediate problem of learning DAGs when a partial causal ordering of variables is available. We propose a general estimation framework for leveraging the partial ordering and present efficient estimation algorithms for low- and high-dimensional problems. The advantages of the proposed framework are illustrated via numerical studies."}, "https://arxiv.org/abs/2403.16336": {"title": "Predictive Inference in Multi-environment Scenarios", "link": "https://arxiv.org/abs/2403.16336", "description": "arXiv:2403.16336v1 Announce Type: cross \nAbstract: We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include extensions for settings with non-real-valued responses and a theory of consistency for predictive inference in these general problems. We demonstrate a novel resizing method to adapt to problem difficulty, which applies both to existing approaches for predictive inference with hierarchical data and the methods we develop; this reduces prediction set sizes using limited information from the test environment, a key to the methods' practical performance, which we evaluate through neurochemical sensing and species classification datasets."}, "https://arxiv.org/abs/2403.16413": {"title": "Optimal testing in a class of nonregular models", "link": "https://arxiv.org/abs/2403.16413", "description": "arXiv:2403.16413v1 Announce Type: cross \nAbstract: This paper studies optimal hypothesis testing for nonregular statistical models with parameter-dependent support. We consider both one-sided and two-sided hypothesis testing and develop asymptotically uniformly most powerful tests based on the likelihood ratio process. The proposed one-sided test involves randomization to achieve asymptotic size control, some tuning constant to avoid discontinuities in the limiting likelihood ratio process, and a user-specified alternative hypothetical value to achieve the asymptotic optimality. Our two-sided test becomes asymptotically uniformly most powerful without imposing further restrictions such as unbiasedness. Simulation results illustrate desirable power properties of the proposed tests."}, "https://arxiv.org/abs/2403.16688": {"title": "Optimal convex $M$-estimation via score matching", "link": "https://arxiv.org/abs/2403.16688", "description": "arXiv:2403.16688v1 Announce Type: cross \nAbstract: In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitting process is a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. The procedure is computationally efficient, and we prove that our procedure attains the minimal asymptotic covariance among all convex $M$-estimators. As an example of a non-log-concave setting, for Cauchy errors, the optimal convex loss function is Huber-like, and our procedure yields an asymptotic efficiency greater than 0.87 relative to the oracle maximum likelihood estimator of the regression coefficients that uses knowledge of this error distribution; in this sense, we obtain robustness without sacrificing much efficiency. Numerical experiments confirm the practical merits of our proposal."}, "https://arxiv.org/abs/2403.16828": {"title": "Asymptotics of predictive distributions driven by sample means and variances", "link": "https://arxiv.org/abs/2403.16828", "description": "arXiv:2403.16828v1 Announce Type: cross \nAbstract: Let $\\alpha_n(\\cdot)=P\\bigl(X_{n+1}\\in\\cdot\\mid X_1,\\ldots,X_n\\bigr)$ be the predictive distributions of a sequence $(X_1,X_2,\\ldots)$ of $p$-variate random variables. Suppose $$\\alpha_n=\\mathcal{N}_p(M_n,Q_n)$$ where $M_n=\\frac{1}{n}\\sum_{i=1}^nX_i$ and $Q_n=\\frac{1}{n}\\sum_{i=1}^n(X_i-M_n)(X_i-M_n)^t$. Then, there is a random probability measure $\\alpha$ on $\\mathbb{R}^p$ such that $\\alpha_n\\rightarrow\\alpha$ weakly a.s. If $p\\in\\{1,2\\}$, one also obtains $\\lVert\\alpha_n-\\alpha\\rVert\\overset{a.s.}\\longrightarrow 0$ where $\\lVert\\cdot\\rVert$ is total variation distance. Moreover, the convergence rate of $\\lVert\\alpha_n-\\alpha\\rVert$ is arbitrarily close to $n^{-1/2}$. These results (apart from the one regarding the convergence rate) still apply even if $\\alpha_n=\\mathcal{L}_p(M_n,Q_n)$, where $\\mathcal{L}_p$ belongs to a class of distributions much larger than the normal. Finally, the asymptotic behavior of copula-based predictive distributions (introduced in [13]) is investigated and a numerical experiment is performed."}, "https://arxiv.org/abs/2008.12927": {"title": "Broadcasted Nonparametric Tensor Regression", "link": "https://arxiv.org/abs/2008.12927", "description": "arXiv:2008.12927v3 Announce Type: replace \nAbstract: We propose a novel use of a broadcasting operation, which distributes univariate functions to all entries of the tensor covariate, to model the nonlinearity in tensor regression nonparametrically. A penalized estimation and the corresponding algorithm are proposed. Our theoretical investigation, which allows the dimensions of the tensor covariate to diverge, indicates that the proposed estimation yields a desirable convergence rate. We also provide a minimax lower bound, which characterizes the optimality of the proposed estimator for a wide range of scenarios. Numerical experiments are conducted to confirm the theoretical findings, and they show that the proposed model has advantages over its existing linear counterparts."}, "https://arxiv.org/abs/2010.15864": {"title": "Identification and Estimation of Unconditional Policy Effects of an Endogenous Binary Treatment: An Unconditional MTE Approach", "link": "https://arxiv.org/abs/2010.15864", "description": "arXiv:2010.15864v5 Announce Type: replace \nAbstract: This paper studies the identification and estimation of policy effects when treatment status is binary and endogenous. We introduce a new class of marginal treatment effects (MTEs) based on the influence function of the functional underlying the policy target. We show that an unconditional policy effect can be represented as a weighted average of the newly defined MTEs over the individuals who are indifferent about their treatment status. We provide conditions for point identification of the unconditional policy effects. When a quantile is the functional of interest, we introduce the UNconditional Instrumental Quantile Estimator (UNIQUE) and establish its consistency and asymptotic distribution. In the empirical application, we estimate the effect of changing college enrollment status, induced by higher tuition subsidy, on the quantiles of the wage distribution."}, "https://arxiv.org/abs/2102.01155": {"title": "G-Formula for Observational Studies under Stratified Interference, with Application to Bed Net Use on Malaria", "link": "https://arxiv.org/abs/2102.01155", "description": "arXiv:2102.01155v2 Announce Type: replace \nAbstract: Assessing population-level effects of vaccines and other infectious disease prevention measures is important to the field of public health. In infectious disease studies, one person's treatment may affect another individual's outcome, i.e., there may be interference between units. For example, the use of bed nets to prevent malaria by one individual may have an indirect effect on other individuals living in close proximity. In some settings, individuals may form groups or clusters where interference only occurs within groups, i.e., there is partial interference. Inverse probability weighted estimators have previously been developed for observational studies with partial interference. Unfortunately, these estimators are not well suited for studies with large clusters. Therefore, in this paper, the parametric g-formula is extended to allow for partial interference. G-formula estimators are proposed for overall effects, effects when treated, and effects when untreated. The proposed estimators can accommodate large clusters and do not suffer from the g-null paradox that may occur in the absence of interference. The large sample properties of the proposed estimators are derived assuming no unmeasured confounders and that the partial interference takes a particular form (referred to as `weak stratified interference'). Simulation studies are presented demonstrating the finite-sample performance of the proposed estimators. The Demographic and Health Survey from the Democratic Republic of the Congo is then analyzed using the proposed g-formula estimators to assess the effects of bed net use on malaria."}, "https://arxiv.org/abs/2109.03087": {"title": "An unbiased estimator of the case fatality rate", "link": "https://arxiv.org/abs/2109.03087", "description": "arXiv:2109.03087v2 Announce Type: replace \nAbstract: During an epidemic outbreak of a new disease, the probability of dying once infected is considered an important though difficult task to be computed. Since it is very hard to know the true number of infected people, the focus is placed on estimating the case fatality rate, which is defined as the probability of dying once tested and confirmed as infected. The estimation of this rate at the beginning of an epidemic remains challenging for several reasons, including the time gap between diagnosis and death, and the rapid growth in the number of confirmed cases. In this work, an unbiased estimator of the case fatality rate of a virus is presented. The consistency of the estimator is demonstrated, and its asymptotic distribution is derived, enabling the corresponding confidence intervals (C.I.) to be established. The proposed method is based on the distribution F of the time between confirmation and death of individuals who die because of the virus. The estimator's performance is analyzed in both simulation scenarios and the real-world context of Argentina in 2020 for the COVID-19 pandemic, consistently achieving excellent results when compared to an existing proposal as well as to the conventional \\naive\" estimator that was employed to report the case fatality rates during the last COVID-19 pandemic. In the simulated scenarios, the empirical coverage of our C.I. is studied, both using the F employed to generate the data and an estimated F, and it is observed that the desired level of confidence is reached quickly when using real F and in a reasonable period of time when estimating F."}, "https://arxiv.org/abs/2212.09961": {"title": "Uncertainty Quantification of MLE for Entity Ranking with Covariates", "link": "https://arxiv.org/abs/2212.09961", "description": "arXiv:2212.09961v2 Announce Type: replace \nAbstract: This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons with additional covariate information such as the attributes of the compared items. Despite extensive studies, few prior literatures investigate this problem under the more realistic setting where covariate information exists. To tackle this issue, we propose a novel model, Covariate-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model, by incorporating the covariate information. Specifically, instead of assuming every compared item has a fixed latent score $\\{\\theta_i^*\\}_{i=1}^n$, we assume the underlying scores are given by $\\{\\alpha_i^*+{x}_i^\\top\\beta^*\\}_{i=1}^n$, where $\\alpha_i^*$ and ${x}_i^\\top\\beta^*$ represent latent baseline and covariate score of the $i$-th item, respectively. We impose natural identifiability conditions and derive the $\\ell_{\\infty}$- and $\\ell_2$-optimal rates for the maximum likelihood estimator of $\\{\\alpha_i^*\\}_{i=1}^{n}$ and $\\beta^*$ under a sparse comparison graph, using a novel `leave-one-out' technique (Chen et al., 2019) . To conduct statistical inferences, we further derive asymptotic distributions for the MLE of $\\{\\alpha_i^*\\}_{i=1}^n$ and $\\beta^*$ with minimal sample complexity. This allows us to answer the question whether some covariates have any explanation power for latent scores and to threshold some sparse parameters to improve the ranking performance. We improve the approximation method used in (Gao et al., 2021) for the BLT model and generalize it to the CARE model. Moreover, we validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset."}, "https://arxiv.org/abs/2303.10016": {"title": "Improving instrumental variable estimators with post-stratification", "link": "https://arxiv.org/abs/2303.10016", "description": "arXiv:2303.10016v2 Announce Type: replace \nAbstract: Experiments studying get-out-the-vote (GOTV) efforts estimate the causal effect of various mobilization efforts on voter turnout. However, there is often substantial noncompliance in these studies. A usual approach is to use an instrumental variable (IV) analysis to estimate impacts for compliers, here being those actually contacted by the investigators. Unfortunately, popular IV estimators can be unstable in studies with a small fraction of compliers. We explore post-stratifying the data (e.g., taking a weighted average of IV estimates within each stratum) using variables that predict complier status (and, potentially, the outcome) to mitigate this. We present the benefits of post-stratification in terms of bias, variance, and improved standard error estimates, and provide a finite-sample asymptotic variance formula. We also compare the performance of different IV approaches and discuss the advantages of our design-based post-stratification approach over incorporating compliance-predictive covariates into the two-stage least squares estimator. In the end, we show that covariates predictive of compliance can increase precision, but only if one is willing to make a bias-variance trade-off by down-weighting or dropping strata with few compliers. By contrast, standard approaches such as two-stage least squares fail to use such information. We finally examine the benefits of our approach in two GOTV applications."}, "https://arxiv.org/abs/2304.04712": {"title": "Testing for linearity in scalar-on-function regression with responses missing at random", "link": "https://arxiv.org/abs/2304.04712", "description": "arXiv:2304.04712v2 Announce Type: replace \nAbstract: A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the simplified method estimates the functional slope by simply discarding observations with missing responses. Second, the imputed method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the inverse probability weighted method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain."}, "https://arxiv.org/abs/2306.14693": {"title": "Conformal link prediction for false discovery rate control", "link": "https://arxiv.org/abs/2306.14693", "description": "arXiv:2306.14693v2 Announce Type: replace \nAbstract: Most link prediction methods return estimates of the connection probability of missing edges in a graph. Such output can be used to rank the missing edges from most to least likely to be a true edge, but does not directly provide a classification into true and non-existent. In this work, we consider the problem of identifying a set of true edges with a control of the false discovery rate (FDR). We propose a novel method based on high-level ideas from the literature on conformal inference. The graph structure induces intricate dependence in the data, which we carefully take into account, as this makes the setup different from the usual setup in conformal inference, where data exchangeability is assumed. The FDR control is empirically demonstrated for both simulated and real data."}, "https://arxiv.org/abs/2306.16715": {"title": "Causal Meta-Analysis by Integrating Multiple Observational Studies with Multivariate Outcomes", "link": "https://arxiv.org/abs/2306.16715", "description": "arXiv:2306.16715v3 Announce Type: replace \nAbstract: Integrating multiple observational studies to make unconfounded causal or descriptive comparisons of group potential outcomes in a large natural population is challenging. Moreover, retrospective cohorts, being convenience samples, are usually unrepresentative of the natural population of interest and have groups with unbalanced covariates. We propose a general covariate-balancing framework based on pseudo-populations that extends established weighting methods to the meta-analysis of multiple retrospective cohorts with multiple groups. Additionally, by maximizing the effective sample sizes of the cohorts, we propose a FLEXible, Optimized, and Realistic (FLEXOR) weighting method appropriate for integrative analyses. We develop new weighted estimators for unconfounded inferences on wide-ranging population-level features and estimands relevant to group comparisons of quantitative, categorical, or multivariate outcomes. The asymptotic properties of these estimators are examined. Through simulation studies and meta-analyses of TCGA datasets, we demonstrate the versatility and reliability of the proposed weighting strategy, especially for the FLEXOR pseudo-population."}, "https://arxiv.org/abs/2307.02603": {"title": "Bayesian Structure Learning in Undirected Gaussian Graphical Models: Literature Review with Empirical Comparison", "link": "https://arxiv.org/abs/2307.02603", "description": "arXiv:2307.02603v2 Announce Type: replace \nAbstract: Gaussian graphical models provide a powerful framework to reveal the conditional dependency structure between multivariate variables. The process of uncovering the conditional dependency network is known as structure learning. Bayesian methods can measure the uncertainty of conditional relationships and include prior information. However, frequentist methods are often preferred due to the computational burden of the Bayesian approach. Over the last decade, Bayesian methods have seen substantial improvements, with some now capable of generating accurate estimates of graphs up to a thousand variables in mere minutes. Despite these advancements, a comprehensive review or empirical comparison of all recent methods has not been conducted. This paper delves into a wide spectrum of Bayesian approaches used for structure learning and evaluates their efficacy through a simulation study. We also demonstrate how to apply Bayesian structure learning to a real-world data set and provide directions for future research. This study gives an exhaustive overview of this dynamic field for newcomers, practitioners, and experts."}, "https://arxiv.org/abs/2311.02658": {"title": "Nonparametric Estimation and Comparison of Distance Distributions from Censored Data", "link": "https://arxiv.org/abs/2311.02658", "description": "arXiv:2311.02658v4 Announce Type: replace \nAbstract: Transportation distance information is a powerful resource, but location records are often censored due to privacy concerns or regulatory mandates. We outline methods to approximate, sample from, and compare distributions of distances between censored location pairs, a task with applications to public health informatics, logistics, and more. We validate empirically via simulation and demonstrate applicability to practical geospatial data analysis tasks."}, "https://arxiv.org/abs/2312.06437": {"title": "Posterior Ramifications of Prior Dependence Structures", "link": "https://arxiv.org/abs/2312.06437", "description": "arXiv:2312.06437v2 Announce Type: replace \nAbstract: In fully Bayesian analyses, prior distributions are specified before observing data. Prior elicitation methods transfigure prior information into quantifiable prior distributions. Recently, methods that leverage copulas have been proposed to accommodate more flexible dependence structures when eliciting multivariate priors. We prove that under broad conditions, the posterior cannot retain many of these flexible prior dependence structures in large-sample settings. We emphasize the impact of this result by overviewing several objectives for prior specification to help practitioners select prior dependence structures that align with their objectives for posterior analysis. Because correctly specifying the dependence structure a priori can be difficult, we consider how the choice of prior copula impacts the posterior distribution in terms of asymptotic convergence of the posterior mode. Our resulting recommendations streamline the prior elicitation process."}, "https://arxiv.org/abs/2009.13961": {"title": "Online Action Learning in High Dimensions: A Conservative Perspective", "link": "https://arxiv.org/abs/2009.13961", "description": "arXiv:2009.13961v4 Announce Type: replace-cross \nAbstract: Sequential learning problems are common in several fields of research and practical applications. Examples include dynamic pricing and assortment, design of auctions and incentives and permeate a large number of sequential treatment experiments. In this paper, we extend one of the most popular learning solutions, the $\\epsilon_t$-greedy heuristics, to high-dimensional contexts considering a conservative directive. We do this by allocating part of the time the original rule uses to adopt completely new actions to a more focused search in a restrictive set of promising actions. The resulting rule might be useful for practical applications that still values surprises, although at a decreasing rate, while also has restrictions on the adoption of unusual actions. With high probability, we find reasonable bounds for the cumulative regret of a conservative high-dimensional decaying $\\epsilon_t$-greedy rule. Also, we provide a lower bound for the cardinality of the set of viable actions that implies in an improved regret bound for the conservative version when compared to its non-conservative counterpart. Additionally, we show that end-users have sufficient flexibility when establishing how much safety they want, since it can be tuned without impacting theoretical properties. We illustrate our proposal both in a simulation exercise and using a real dataset."}, "https://arxiv.org/abs/2211.09284": {"title": "Iterative execution of discrete and inverse discrete Fourier transforms with applications for signal denoising via sparsification", "link": "https://arxiv.org/abs/2211.09284", "description": "arXiv:2211.09284v3 Announce Type: replace-cross \nAbstract: We describe a family of iterative algorithms that involve the repeated execution of discrete and inverse discrete Fourier transforms. One interesting member of this family is motivated by the discrete Fourier transform uncertainty principle and involves the application of a sparsification operation to both the real domain and frequency domain data with convergence obtained when real domain sparsity hits a stable pattern. This sparsification variant has practical utility for signal denoising, in particular the recovery of a periodic spike signal in the presence of Gaussian noise. General convergence properties and denoising performance relative to existing methods are demonstrated using simulation studies. An R package implementing this technique and related resources can be found at https://hrfrost.host.dartmouth.edu/IterativeFT."}, "https://arxiv.org/abs/2306.17667": {"title": "Bias-Free Estimation of Signals on Top of Unknown Backgrounds", "link": "https://arxiv.org/abs/2306.17667", "description": "arXiv:2306.17667v2 Announce Type: replace-cross \nAbstract: We present a method for obtaining unbiased signal estimates in the presence of a significant unknown background, eliminating the need for a parametric model for the background itself. Our approach is based on a minimal set of conditions for observation and background estimators, which are typically satisfied in practical scenarios. To showcase the effectiveness of our method, we apply it to simulated data from the planned dielectric axion haloscope MADMAX."}, "https://arxiv.org/abs/2311.02695": {"title": "Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions", "link": "https://arxiv.org/abs/2311.02695", "description": "arXiv:2311.02695v2 Announce Type: replace-cross \nAbstract: The task of inferring high-level causal variables from low-level observations, commonly referred to as causal representation learning, is fundamentally underconstrained. As such, recent works to address this problem focus on various assumptions that lead to identifiability of the underlying latent causal variables. A large corpus of these preceding approaches consider multi-environment data collected under different interventions on the causal model. What is common to virtually all of these works is the restrictive assumption that in each environment, only a single variable is intervened on. In this work, we relax this assumption and provide the first identifiability result for causal representation learning that allows for multiple variables to be targeted by an intervention within one environment. Our approach hinges on a general assumption on the coverage and diversity of interventions across environments, which also includes the shared assumption of single-node interventions of previous works. The main idea behind our approach is to exploit the trace that interventions leave on the variance of the ground truth causal variables and regularizing for a specific notion of sparsity with respect to this trace. In addition to and inspired by our theoretical contributions, we present a practical algorithm to learn causal representations from multi-node interventional data and provide empirical evidence that validates our identifiability results."}, "https://arxiv.org/abs/2403.17087": {"title": "Sparse inference in Poisson Log-Normal model by approximating the L0-norm", "link": "https://arxiv.org/abs/2403.17087", "description": "arXiv:2403.17087v1 Announce Type: new \nAbstract: Variable selection methods are required in practical statistical modeling, to identify and include only the most relevant predictors, and then improving model interpretability. Such variable selection methods are typically employed in regression models, for instance in this article for the Poisson Log Normal model (PLN, Chiquet et al., 2021). This model aim to explain multivariate count data using dependent variables, and its utility was demonstrating in scientific fields such as ecology and agronomy. In the case of the PLN model, most recent papers focus on sparse networks inference through combination of the likelihood with a L1 -penalty on the precision matrix. In this paper, we propose to rely on a recent penalization method (SIC, O'Neill and Burke, 2023), which consists in smoothly approximating the L0-penalty, and that avoids the calibration of a tuning parameter with a cross-validation procedure. Moreover, this work focuses on the coefficient matrix of the PLN model and establishes an inference procedure ensuring effective variable selection performance, so that the resulting fitted model explaining multivariate count data using only relevant explanatory variables. Our proposal involves implementing a procedure that integrates the SIC penalization algorithm (epsilon-telescoping) and the PLN model fitting algorithm (a variational EM algorithm). To support our proposal, we provide theoretical results and insights about the penalization method, and we perform simulation studies to assess the method, which is also applied on real datasets."}, "https://arxiv.org/abs/2403.17117": {"title": "Covariate-adjusted Group Sequential Comparisons of Survival Probabilities", "link": "https://arxiv.org/abs/2403.17117", "description": "arXiv:2403.17117v1 Announce Type: new \nAbstract: In confirmatory clinical trials, survival outcomes are frequently studied and interim analyses for efficacy and/or futility are often desirable. Methods such as the log rank test and Cox regression model are commonly used to compare treatments in this setting. They rely on a proportional hazards (PH) assumption and are subject to type I error rate inflation and loss of power when PH are violated. Such violations may be expected a priori, particularly when the mechanisms of treatments differ such as immunotherapy vs. chemotherapy for treating cancer. We develop group sequential tests for comparing survival curves with covariate adjustment that allow for interim analyses in the presence of non-PH and offer easily interpreted, clinically meaningful summary measures of the treatment effect. The joint distribution of repeatedly computed test statistics converges to the canonical joint distribution with a Markov structure. The asymptotic distribution of the test statistics allows marginal comparisons of survival probabilities at multiple fixed time points and facilitates both critical value specification to maintain type I error control and sample size/power determination. Simulations demonstrate that the achieved type I error rate and power of the proposed tests meet targeted levels and are robust to the PH assumption and covariate influence. The proposed tests are illustrated using a clinical trial dataset from the Blood and Marrow Transplant Clinical Trials Network 1101 trial."}, "https://arxiv.org/abs/2403.17121": {"title": "High-dimensional Factor Analysis for Network-linked Data", "link": "https://arxiv.org/abs/2403.17121", "description": "arXiv:2403.17121v1 Announce Type: new \nAbstract: Factor analysis is a widely used statistical tool in many scientific disciplines, such as psychology, economics, and sociology. As observations linked by networks become increasingly common, incorporating network structures into factor analysis remains an open problem. In this paper, we focus on high-dimensional factor analysis involving network-connected observations, and propose a generalized factor model with latent factors that account for both the network structure and the dependence structure among high-dimensional variables. These latent factors can be shared by the high-dimensional variables and the network, or exclusively applied to either of them. We develop a computationally efficient estimation procedure and establish asymptotic inferential theories. Notably, we show that by borrowing information from the network, the proposed estimator of the factor loading matrix achieves optimal asymptotic variance under much milder identifiability constraints than existing literature. Furthermore, we develop a hypothesis testing procedure to tackle the challenge of discerning the shared and individual latent factors' structure. The finite sample performance of the proposed method is demonstrated through simulation studies and a real-world dataset involving a statistician co-authorship network."}, "https://arxiv.org/abs/2403.17127": {"title": "High-Dimensional Mean-Variance Spanning Tests", "link": "https://arxiv.org/abs/2403.17127", "description": "arXiv:2403.17127v1 Announce Type: new \nAbstract: We introduce a new framework for the mean-variance spanning (MVS) hypothesis testing. The procedure can be applied to any test-asset dimension and only requires stationary asset returns and the number of benchmark assets to be smaller than the number of time periods. It involves individually testing moment conditions using a robust Student-t statistic based on the batch-mean method and combining the p-values using the Cauchy combination test. Simulations demonstrate the superior performance of the test compared to state-of-the-art approaches. For the empirical application, we look at the problem of domestic versus international diversification in equities. We find that the advantages of diversification are influenced by economic conditions and exhibit cross-country variation. We also highlight that the rejection of the MVS hypothesis originates from the potential to reduce variance within the domestic global minimum-variance portfolio."}, "https://arxiv.org/abs/2403.17132": {"title": "A Personalized Predictive Model that Jointly Optimizes Discrimination and Calibration", "link": "https://arxiv.org/abs/2403.17132", "description": "arXiv:2403.17132v1 Announce Type: new \nAbstract: Precision medicine is accelerating rapidly in the field of health research. This includes fitting predictive models for individual patients based on patient similarity in an attempt to improve model performance. We propose an algorithm which fits a personalized predictive model (PPM) using an optimal size of a similar subpopulation that jointly optimizes model discrimination and calibration, as it is criticized that calibration is not assessed nearly as often as discrimination despite poorly calibrated models being potentially misleading. We define a mixture loss function that considers model discrimination and calibration, and allows for flexibility in emphasizing one performance measure over another. We empirically show that the relationship between the size of subpopulation and calibration is quadratic, which motivates the development of our jointly optimized model. We also investigate the effect of within-population patient weighting on performance and conclude that the size of subpopulation has a larger effect on the predictive performance of the PPM compared to the choice of weight function."}, "https://arxiv.org/abs/2403.17257": {"title": "Statistical Inference on Hierarchical Simultaneous Autoregressive Models with Missing Data", "link": "https://arxiv.org/abs/2403.17257", "description": "arXiv:2403.17257v1 Announce Type: new \nAbstract: Efficient estimation methods for simultaneous autoregressive (SAR) models with missing data in the response variable have been well-developed in the literature. It is common practice to introduce a measurement error into SAR models. The measurement error serves to distinguish the noise component from the spatial process. However, the previous literature has not considered adding a measurement error to the SAR models with missing data. The maximum likelihood estimation for such models with large datasets is challenging and computationally expensive. This paper proposes two efficient likelihood-based estimation methods: the marginal maximum likelihood (ML) and expectation-maximisation (EM) algorithms for estimating SAR models with both measurement errors and missing data in the response variable. The spatial error model (SEM) and the spatial autoregressive model (SAM), two popular SAR model types, are considered. The missing data mechanism is assumed to follow missing at random (MAR). While naive calculation approaches lead to computational complexities of $O(n^3)$, where n is the total number of observations, our computational approaches for both the marginal ML and EM algorithms are designed to reduce the computational complexity. The performance of the proposed methods is investigated empirically using simulated and real datasets."}, "https://arxiv.org/abs/2403.17318": {"title": "Statistical analysis and method to propagate the impact of measurement uncertainty on dynamic mode decomposition", "link": "https://arxiv.org/abs/2403.17318", "description": "arXiv:2403.17318v1 Announce Type: new \nAbstract: We apply random matrix theory to study the impact of measurement uncertainty on dynamic mode decomposition. Specifically, when the measurements follow a normal probability density function, we show how the moments of that density propagate through the dynamic mode decomposition. While we focus on the first and second moments, the analytical expressions we derive are general and can be extended to higher-order moments. Further, the proposed numerical method to propagate uncertainty is agnostic of specific dynamic mode decomposition formulations. Of particular relevance, the estimated second moments provide confidence bounds that may be used as a metric of trustworthiness, that is, how much one can rely on a finite-dimensional linear operator to represent an underlying dynamical system. We perform numerical experiments on two canonical systems and verify the estimated confidence levels by comparing the moments to those obtained from Monte Carlo simulations."}, "https://arxiv.org/abs/2403.17321": {"title": "A Bayesian shrinkage estimator for transfer learning", "link": "https://arxiv.org/abs/2403.17321", "description": "arXiv:2403.17321v1 Announce Type: new \nAbstract: Transfer learning (TL) has emerged as a powerful tool to supplement data collected for a target task with data collected for a related source task. The Bayesian framework is natural for TL because information from the source data can be incorporated in the prior distribution for the target data analysis. In this paper, we propose and study Bayesian TL methods for the normal-means problem and multiple linear regression. We propose two classes of prior distributions. The first class assumes the difference in the parameters for the source and target tasks is sparse, i.e., many parameters are shared across tasks. The second assumes that none of the parameters are shared across tasks, but the differences are bounded in $\\ell_2$-norm. For the sparse case, we propose a Bayes shrinkage estimator with theoretical guarantees under mild assumptions. The proposed methodology is tested on synthetic data and outperforms state-of-the-art TL methods. We then use this method to fine-tune the last layer of a neural network model to predict the molecular gap property in a material science application. We report improved performance compared to classical fine tuning and methods using only the target data."}, "https://arxiv.org/abs/2403.17481": {"title": "A Type of Nonlinear Fr\\'echet Regressions", "link": "https://arxiv.org/abs/2403.17481", "description": "arXiv:2403.17481v1 Announce Type: new \nAbstract: The existing Fr\\'echet regression is actually defined within a linear framework, since the weight function in the Fr\\'echet objective function is linearly defined, and the resulting Fr\\'echet regression function is identified to be a linear model when the random object belongs to a Hilbert space. Even for nonparametric and semiparametric Fr\\'echet regressions, which are usually nonlinear, the existing methods handle them by local linear (or local polynomial) technique, and the resulting Fr\\'echet regressions are (locally) linear as well. We in this paper introduce a type of nonlinear Fr\\'echet regressions. Such a framework can be utilized to fit the essentially nonlinear models in a general metric space and uniquely identify the nonlinear structure in a Hilbert space. Particularly, its generalized linear form can return to the standard linear Fr\\'echet regression through a special choice of the weight function. Moreover, the generalized linear form possesses methodological and computational simplicity because the Euclidean variable and the metric space element are completely separable. The favorable theoretical properties (e.g. the estimation consistency and presentation theorem) of the nonlinear Fr\\'echet regressions are established systemically. The comprehensive simulation studies and a human mortality data analysis demonstrate that the new strategy is significantly better than the competitors."}, "https://arxiv.org/abs/2403.17489": {"title": "Adaptive Bayesian Structure Learning of DAGs With Non-conjugate Prior", "link": "https://arxiv.org/abs/2403.17489", "description": "arXiv:2403.17489v1 Announce Type: new \nAbstract: Directed Acyclic Graphs (DAGs) are solid structures used to describe and infer the dependencies among variables in multivariate scenarios. Having a thorough comprehension of the accurate DAG-generating model is crucial for causal discovery and estimation. Our work suggests utilizing a non-conjugate prior for Gaussian DAG structure learning to enhance the posterior probability. We employ the idea of using the Bessel function to address the computational burden, providing faster MCMC computation compared to the use of conjugate priors. In addition, our proposal exhibits a greater rate of adaptation when compared to the conjugate prior, specifically for the inclusion of nodes in the DAG-generating model. Simulation studies demonstrate the superior accuracy of DAG learning, and we obtain the same maximum a posteriori and median probability model estimate for the AML data, using the non-conjugate prior."}, "https://arxiv.org/abs/2403.17580": {"title": "Measuring Dependence between Events", "link": "https://arxiv.org/abs/2403.17580", "description": "arXiv:2403.17580v1 Announce Type: new \nAbstract: Measuring dependence between two events, or equivalently between two binary random variables, amounts to expressing the dependence structure inherent in a $2\\times 2$ contingency table in a real number between $-1$ and $1$. Countless such dependence measures exist, but there is little theoretical guidance on how they compare and on their advantages and shortcomings. Thus, practitioners might be overwhelmed by the problem of choosing a suitable measure. We provide a set of natural desirable properties that a proper dependence measure should fulfill. We show that Yule's Q and the little-known Cole coefficient are proper, while the most widely-used measures, the phi coefficient and all contingency coefficients, are improper. They have a severe attainability problem, that is, even under perfect dependence they can be very far away from $-1$ and $1$, and often differ substantially from the proper measures in that they understate strength of dependence. The structural reason is that these are measures for equality of events rather than of dependence. We derive the (in some instances non-standard) limiting distributions of the measures and illustrate how asymptotically valid confidence intervals can be constructed. In a case study on drug consumption we demonstrate how misleading conclusions may arise from the use of improper dependence measures."}, "https://arxiv.org/abs/2403.17609": {"title": "A location Invariant Statistic-Based Consistent Estimation Method for Three-Parameter Generalized Exponential Distribution", "link": "https://arxiv.org/abs/2403.17609", "description": "arXiv:2403.17609v1 Announce Type: new \nAbstract: In numerous instances, the generalized exponential distribution can be used as an alternative to the gamma distribution or the Weibull distribution when analyzing lifetime or skewed data. This article offers a consistent method for estimating the parameters of a three-parameter generalized exponential distribution that sidesteps the issue of an unbounded likelihood function. The method is hinged on a maximum likelihood estimation of shape and scale parameters that uses a location-invariant statistic. Important estimator properties, such as uniqueness and consistency, are demonstrated. In addition, quantile estimates for the lifetime distribution are provided. We present a Monte Carlo simulation study along with comparisons to a number of well-known estimation techniques in terms of bias and root mean square error. For illustrative purposes, a real-world lifetime data set is analyzed."}, "https://arxiv.org/abs/2403.17624": {"title": "The inclusive Synthetic Control Method", "link": "https://arxiv.org/abs/2403.17624", "description": "arXiv:2403.17624v1 Announce Type: new \nAbstract: We introduce the inclusive synthetic control method (iSCM), a modification of synthetic control type methods that allows the inclusion of units potentially affected directly or indirectly by an intervention in the donor pool. This method is well suited for applications with either multiple treated units or in which some of the units in the donor pool might be affected by spillover effects. Our iSCM is very easy to implement using most synthetic control type estimators. As an illustrative empirical example, we re-estimate the causal effect of German reunification on GDP per capita allowing for spillover effects from West Germany to Austria."}, "https://arxiv.org/abs/2403.17670": {"title": "A family of Chatterjee's correlation coefficients and their properties", "link": "https://arxiv.org/abs/2403.17670", "description": "arXiv:2403.17670v1 Announce Type: new \nAbstract: Quantifying the strength of functional dependence between random scalars $X$ and $Y$ is an important statistical problem. While many existing correlation coefficients excel in identifying linear or monotone functional dependence, they fall short in capturing general non-monotone functional relationships. In response, we propose a family of correlation coefficients $\\xi^{(h,F)}_n$, characterized by a continuous bivariate function $h$ and a cdf function $F$. By offering a range of selections for $h$ and $F$, $\\xi^{(h,F)}_n$ encompasses a diverse class of novel correlation coefficients, while also incorporates the Chatterjee's correlation coefficient (Chatterjee, 2021) as a special case. We prove that $\\xi^{(h,F)}_n$ converges almost surely to a deterministic limit $\\xi^{(h,F)}$ as sample size $n$ approaches infinity. In addition, under appropriate conditions imposed on $h$ and $F$, the limit $\\xi^{(h,F)}$ satisfies the three appealing properties: (P1). it belongs to the range of $[0,1]$; (P2). it equals 1 if and only if $Y$ is a measurable function of $X$; and (P3). it equals 0 if and only if $Y$ is independent of $X$. As amplified by our numerical experiments, our proposals provide practitioners with a variety of options to choose the most suitable correlation coefficient tailored to their specific practical needs."}, "https://arxiv.org/abs/2403.17777": {"title": "Deconvolution from two order statistics", "link": "https://arxiv.org/abs/2403.17777", "description": "arXiv:2403.17777v1 Announce Type: new \nAbstract: Economic data are often contaminated by measurement errors and truncated by ranking. This paper shows that the classical measurement error model with independent and additive measurement errors is identified nonparametrically using only two order statistics of repeated measurements. The identification result confirms a hypothesis by Athey and Haile (2002) for a symmetric ascending auction model with unobserved heterogeneity. Extensions allow for heterogeneous measurement errors, broadening the applicability to additional empirical settings, including asymmetric auctions and wage offer models. We adapt an existing simulated sieve estimator and illustrate its performance in finite samples."}, "https://arxiv.org/abs/2403.17882": {"title": "On the properties of distance covariance for categorical data: Robustness, sure screening, and approximate null distributions", "link": "https://arxiv.org/abs/2403.17882", "description": "arXiv:2403.17882v1 Announce Type: new \nAbstract: Pearson's Chi-squared test, though widely used for detecting association between categorical variables, exhibits low statistical power in large sparse contingency tables. To address this limitation, two novel permutation tests have been recently developed: the distance covariance permutation test and the U-statistic permutation test. Both leverage the distance covariance functional but employ different estimators. In this work, we explore key statistical properties of the distance covariance for categorical variables. Firstly, we show that unlike Chi-squared, the distance covariance functional is B-robust for any number of categories (fixed or diverging). Second, we establish the strong consistency of distance covariance screening under mild conditions, and simulations confirm its advantage over Chi-squared screening, especially for large sparse tables. Finally, we derive an approximate null distribution for a bias-corrected distance correlation estimate, demonstrating its effectiveness through simulations."}, "https://arxiv.org/abs/2403.17221": {"title": "Are Made and Missed Different? An analysis of Field Goal Attempts of Professional Basketball Players via Depth Based Testing Procedure", "link": "https://arxiv.org/abs/2403.17221", "description": "arXiv:2403.17221v1 Announce Type: cross \nAbstract: In this paper, we develop a novel depth-based testing procedure on spatial point processes to examine the difference in made and missed field goal attempts for NBA players. Specifically, our testing procedure can statistically detect the differences between made and missed field goal attempts for NBA players. We first obtain the depths of two processes under the polar coordinate system. A two-dimensional Kolmogorov-Smirnov test is then performed to test the difference between the depths of the two processes. Throughout extensive simulation studies, we show our testing procedure with good frequentist properties under both null hypothesis and alternative hypothesis. A comparison against the competing methods shows that our proposed procedure has better testing reliability and testing power. Application to the shot chart data of 191 NBA players in the 2017-2018 regular season offers interesting insights about these players' made and missed shot patterns."}, "https://arxiv.org/abs/2403.17776": {"title": "Exploring the Boundaries of Ambient Awareness in Twitter", "link": "https://arxiv.org/abs/2403.17776", "description": "arXiv:2403.17776v1 Announce Type: cross \nAbstract: Ambient awareness refers to the ability of social media users to obtain knowledge about who knows what (i.e., users' expertise) in their network, by simply being exposed to other users' content (e.g, tweets on Twitter). Previous work, based on user surveys, reveals that individuals self-report ambient awareness only for parts of their networks. However, it is unclear whether it is their limited cognitive capacity or the limited exposure to diagnostic tweets (i.e., online content) that prevents people from developing ambient awareness for their complete network. In this work, we focus on in-wall ambient awareness (IWAA) in Twitter and conduct a two-step data-driven analysis, that allows us to explore to which extent IWAA is likely, or even possible. First, we rely on reactions (e.g., likes), as strong evidence of users being aware of experts in Twitter. Unfortunately, such strong evidence can be only measured for active users, which represent the minority in the network. Thus to study the boundaries of IWAA to a larger extent, in the second part of our analysis, we instead focus on the passive exposure to content generated by other users -- which we refer to as in-wall visibility. This analysis shows that (in line with \\citet{levordashka2016ambient}) only for a subset of users IWAA is plausible, while for the majority it is unlikely, if even possible, to develop IWAA. We hope that our methodology paves the way for the emergence of data-driven approaches for the study of ambient awareness."}, "https://arxiv.org/abs/2206.04133": {"title": "Bayesian multivariate logistic regression for superiority and inferiority decision-making under observable treatment heterogeneity", "link": "https://arxiv.org/abs/2206.04133", "description": "arXiv:2206.04133v4 Announce Type: replace \nAbstract: The effects of treatments may differ between persons with different characteristics. Addressing such treatment heterogeneity is crucial to investigate whether patients with specific characteristics are likely to benefit from a new treatment. The current paper presents a novel Bayesian method for superiority decision-making in the context of randomized controlled trials with multivariate binary responses and heterogeneous treatment effects. The framework is based on three elements: a) Bayesian multivariate logistic regression analysis with a P\\'olya-Gamma expansion; b) a transformation procedure to transfer obtained regression coefficients to a more intuitive multivariate probability scale (i.e., success probabilities and the differences between them); and c) a compatible decision procedure for treatment comparison with prespecified decision error rates. Procedures for a priori sample size estimation under a non-informative prior distribution are included. A numerical evaluation demonstrated that decisions based on a priori sample size estimation resulted in anticipated error rates among the trial population as well as subpopulations. Further, average and conditional treatment effect parameters could be estimated unbiasedly when the sample was large enough. Illustration with the International Stroke Trial dataset revealed a trend towards heterogeneous effects among stroke patients: Something that would have remained undetected when analyses were limited to average treatment effects."}, "https://arxiv.org/abs/2301.08958": {"title": "A Practical Introduction to Regression Discontinuity Designs: Extensions", "link": "https://arxiv.org/abs/2301.08958", "description": "arXiv:2301.08958v2 Announce Type: replace \nAbstract: This monograph, together with its accompanying first part Cattaneo, Idrobo and Titiunik (2020), collects and expands the instructional materials we prepared for more than $50$ short courses and workshops on Regression Discontinuity (RD) methodology that we taught between 2014 and 2023. In this second monograph, we discuss several topics in RD methodology that build on and extend the analysis of RD designs introduced in Cattaneo, Idrobo and Titiunik (2020). Our first goal is to present an alternative RD conceptual framework based on local randomization ideas. This methodological approach can be useful in RD designs with discretely-valued scores, and can also be used more broadly as a complement to the continuity-based approach in other settings. Then, employing both continuity-based and local randomization approaches, we extend the canonical Sharp RD design in multiple directions: fuzzy RD designs, RD designs with discrete scores, and multi-dimensional RD designs. The goal of our two-part monograph is purposely practical and hence we focus on the empirical analysis of RD designs."}, "https://arxiv.org/abs/2302.03157": {"title": "A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data", "link": "https://arxiv.org/abs/2302.03157", "description": "arXiv:2302.03157v2 Announce Type: replace \nAbstract: Recent advancements in Mixed Integer Optimization (MIO) algorithms, paired with hardware enhancements, have led to significant speedups in resolving MIO problems. These strategies have been utilized for optimal subset selection, specifically for choosing $k$ features out of $p$ in linear regression given $n$ observations. In this paper, we broaden this method to facilitate cluster-aware regression, where selection aims to choose $\\lambda$ out of $K$ clusters in a linear mixed effects (LMM) model with $n_k$ observations for each cluster. Through comprehensive testing on a multitude of synthetic and real datasets, we exhibit that our method efficiently solves problems within minutes. Through numerical experiments, we also show that the MIO approach outperforms both Gaussian- and Laplace-distributed LMMs in terms of generating sparse solutions with high predictive power. Traditional LMMs typically assume that clustering effects are independent of individual features. However, we introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression."}, "https://arxiv.org/abs/2303.04754": {"title": "Estimation of Long-Range Dependent Models with Missing Data: to Impute or not to Impute?", "link": "https://arxiv.org/abs/2303.04754", "description": "arXiv:2303.04754v2 Announce Type: replace \nAbstract: Among the most important models for long-range dependent time series is the class of ARFIMA$(p,d,q)$ (Autoregressive Fractionally Integrated Moving Average) models. Estimating the long-range dependence parameter $d$ in ARFIMA models is a well-studied problem, but the literature regarding the estimation of $d$ in the presence of missing data is very sparse. There are two basic approaches to dealing with the problem: missing data can be imputed using some plausible method, and then the estimation can proceed as if no data were missing, or we can use a specially tailored methodology to estimate $d$ in the presence of missing data. In this work, we review some of the methods available for both approaches and compare them through a Monte Carlo simulation study. We present a comparison among 35 different setups to estimate $d$, under tenths of different scenarios, considering percentages of missing data ranging from as few as 10\\% up to 70\\% and several levels of dependence."}, "https://arxiv.org/abs/2304.07034": {"title": "Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata", "link": "https://arxiv.org/abs/2304.07034", "description": "arXiv:2304.07034v4 Announce Type: replace \nAbstract: The optimum sample allocation in stratified sampling is one of the basic issues of survey methodology. It is a procedure of dividing the overall sample size into strata sample sizes in such a way that for given sampling designs in strata the variance of the stratified $\\pi$ estimator of the population total (or mean) for a given study variable assumes its minimum. In this work, we consider the optimum allocation of a sample, under lower and upper bounds imposed jointly on sample sizes in strata. We are concerned with the variance function of some generic form that, in particular, covers the case of the simple random sampling without replacement in strata. The goal of this paper is twofold. First, we establish (using the Karush-Kuhn-Tucker conditions) a generic form of the optimal solution, the so-called optimality conditions. Second, based on the established optimality conditions, we derive an efficient recursive algorithm, named RNABOX, which solves the allocation problem under study. The RNABOX can be viewed as a generalization of the classical recursive Neyman allocation algorithm, a popular tool for optimum allocation when only upper bounds are imposed on sample strata-sizes. We implement RNABOX in R as a part of our package stratallo which is available from the Comprehensive R Archive Network (CRAN) repository."}, "https://arxiv.org/abs/2304.10025": {"title": "Identification and multiply robust estimation in causal mediation analysis across principal strata", "link": "https://arxiv.org/abs/2304.10025", "description": "arXiv:2304.10025v3 Announce Type: replace \nAbstract: We consider assessing causal mediation in the presence of a post-treatment event (examples include noncompliance, a clinical event, or a terminal event). We identify natural mediation effects for the entire study population and for each principal stratum characterized by the joint potential values of the post-treatment event. We derive efficient influence functions for each mediation estimand, which motivate a set of multiply robust estimators for inference. The multiply robust estimators are consistent under four types of misspecifications and are efficient when all nuisance models are correctly specified. We illustrate our methods via simulations and two real data examples."}, "https://arxiv.org/abs/2306.10213": {"title": "A General Form of Covariate Adjustment in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2306.10213", "description": "arXiv:2306.10213v2 Announce Type: replace \nAbstract: In randomized clinical trials, adjusting for baseline covariates can improve credibility and efficiency for demonstrating and quantifying treatment effects. This article studies the augmented inverse propensity weighted (AIPW) estimator, which is a general form of covariate adjustment that uses linear, generalized linear, and non-parametric or machine learning models for the conditional mean of the response given covariates. Under covariate-adaptive randomization, we establish general theorems that show a complete picture of the asymptotic normality, {efficiency gain, and applicability of AIPW estimators}. In particular, we provide for the first time a rigorous theoretical justification of using machine learning methods with cross-fitting for dependent data under covariate-adaptive randomization. Based on the general theorems, we offer insights on the conditions for guaranteed efficiency gain and universal applicability {under different randomization schemes}, which also motivate a joint calibration strategy using some constructed covariates after applying AIPW. Our methods are implemented in the R package RobinCar."}, "https://arxiv.org/abs/2306.16378": {"title": "Spatiotemporal Besov Priors for Bayesian Inverse Problems", "link": "https://arxiv.org/abs/2306.16378", "description": "arXiv:2306.16378v2 Announce Type: replace \nAbstract: Fast development in science and technology has driven the need for proper statistical tools to capture special data features such as abrupt changes or sharp contrast. Many inverse problems in data science require spatiotemporal solutions derived from a sequence of time-dependent objects with these spatial features, e.g., dynamic reconstruction of computerized tomography (CT) images with edges. Conventional methods based on Gaussian processes (GP) often fall short in providing satisfactory solutions since they tend to offer over-smooth priors. Recently, the Besov process (BP), defined by wavelet expansions with random coefficients, has emerged as a more suitable prior for Bayesian inverse problems of this nature. While BP excels in handling spatial inhomogeneity, it does not automatically incorporate temporal correlation inherited in the dynamically changing objects. In this paper, we generalize BP to a novel spatiotemporal Besov process (STBP) by replacing the random coefficients in the series expansion with stochastic time functions as Q-exponential process (Q-EP) which governs the temporal correlation structure. We thoroughly investigate the mathematical and statistical properties of STBP. A white-noise representation of STBP is also proposed to facilitate the inference. Simulations, two limited-angle CT reconstruction examples and a highly non-linear inverse problem involving Navier-Stokes equation are used to demonstrate the advantage of the proposed STBP in preserving spatial features while accounting for temporal changes compared with the classic STGP and a time-uncorrelated approach."}, "https://arxiv.org/abs/2310.17248": {"title": "The observed Fisher information attached to the EM algorithm, illustrated on Shepp and Vardi estimation procedure for positron emission tomography", "link": "https://arxiv.org/abs/2310.17248", "description": "arXiv:2310.17248v2 Announce Type: replace \nAbstract: The Shepp & Vardi (1982) implementation of the EM algorithm for PET scan tumor estimation provides a point estimate of the tumor. The current study presents a closed-form formula of the observed Fisher information for Shepp & Vardi PET scan tumor estimation. Keywords: PET scan, EM algorithm, Fisher information matrix, standard errors."}, "https://arxiv.org/abs/2305.04937": {"title": "Randomly sampling bipartite networks with fixed degree sequences", "link": "https://arxiv.org/abs/2305.04937", "description": "arXiv:2305.04937v3 Announce Type: replace-cross \nAbstract: Statistical analysis of bipartite networks frequently requires randomly sampling from the set of all bipartite networks with the same degree sequence as an observed network. Trade algorithms offer an efficient way to generate samples of bipartite networks by incrementally `trading' the positions of some of their edges. However, it is difficult to know how many such trades are required to ensure that the sample is random. I propose a stopping rule that focuses on the distance between sampled networks and the observed network, and stops performing trades when this distribution stabilizes. Analyses demonstrate that, for over 300 different degree sequences, using this stopping rule ensures a random sample with a high probability, and that it is practical for use in empirical applications."}, "https://arxiv.org/abs/2305.16915": {"title": "When is cross impact relevant?", "link": "https://arxiv.org/abs/2305.16915", "description": "arXiv:2305.16915v2 Announce Type: replace-cross \nAbstract: Trading pressure from one asset can move the price of another, a phenomenon referred to as cross impact. Using tick-by-tick data spanning 5 years for 500 assets listed in the United States, we identify the features that make cross-impact relevant to explain the variance of price returns. We show that price formation occurs endogenously within highly liquid assets. Then, trades in these assets influence the prices of less liquid correlated products, with an impact velocity constrained by their minimum trading frequency. We investigate the implications of such a multidimensional price formation mechanism on interest rate markets. We find that the 10-year bond future serves as the primary liquidity reservoir, influencing the prices of cash bonds and futures contracts within the interest rate curve. Such behaviour challenges the validity of the theory in Financial Economics that regards long-term rates as agents anticipations of future short term rates."}, "https://arxiv.org/abs/2311.01453": {"title": "PPI++: Efficient Prediction-Powered Inference", "link": "https://arxiv.org/abs/2311.01453", "description": "arXiv:2311.01453v2 Announce Type: replace-cross \nAbstract: We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations."}, "https://arxiv.org/abs/2311.02766": {"title": "Riemannian Laplace Approximation with the Fisher Metric", "link": "https://arxiv.org/abs/2311.02766", "description": "arXiv:2311.02766v3 Announce Type: replace-cross \nAbstract: Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments."}, "https://arxiv.org/abs/2403.17948": {"title": "The Rule of link functions on Binomial Regression Model: A Cross Sectional Study on Child Malnutrition, Bangladesh", "link": "https://arxiv.org/abs/2403.17948", "description": "arXiv:2403.17948v1 Announce Type: new \nAbstract: Link function is a key tool in the binomial regression model defined as non-linear model under GLM approach. It transforms the nonlinear regression to linear model with converting the interval (-\\infty,\\infty) to the probability [0,1]. The binomial model with link functions (logit, probit, cloglog and cauchy) are applied on the proportional of child malnutrition age 0-5 years in each household level. Multiple Indicator Cluster survey (MICS)-2019, Bangladesh was conducted by a joint cooperation of UNICEF and BBS . The survey covered 64000 households using two stage stratified sampling technique, where around 21000 household have children age 0-5 years. We use bi-variate analysis to find the statistical association between response and sociodemographic features. In the binary regression model, probit model provides the best result based on the lowest standard error of covariates and goodness of fit test (deviance, AIC)."}, "https://arxiv.org/abs/2403.17982": {"title": "Markov chain models for inspecting response dynamics in psychological testing", "link": "https://arxiv.org/abs/2403.17982", "description": "arXiv:2403.17982v1 Announce Type: new \nAbstract: The importance of considering contextual probabilities in shaping response patterns within psychological testing is underscored, despite the ubiquitous nature of order effects discussed extensively in methodological literature. Drawing from concepts such as path-dependency, first-order autocorrelation, state-dependency, and hysteresis, the present study is an attempt to address how earlier responses serve as an anchor for subsequent answers in tests, surveys, and questionnaires. Introducing the notion of non-commuting observables derived from quantum physics, I highlight their role in characterizing psychological processes and the impact of measurement instruments on participants' responses. We advocate for the utilization of first-order Markov chain modeling to capture and forecast sequential dependencies in survey and test responses. The employment of the first-order Markov chain model lies in individuals' propensity to exhibit partial focus to preceding responses, with recent items most likely exerting a substantial influence on subsequent response selection. This study contributes to advancing our understanding of the dynamics inherent in sequential data within psychological research and provides a methodological framework for conducting longitudinal analyses of response patterns of test and questionnaire."}, "https://arxiv.org/abs/2403.17986": {"title": "Comment on \"Safe Testing\" by Gr\\\"unwald, de Heide, and Koolen", "link": "https://arxiv.org/abs/2403.17986", "description": "arXiv:2403.17986v1 Announce Type: new \nAbstract: This comment briefly reflects on \"Safe Testing\" by Gr\\\"{u}wald et al. (2024). The safety of fractional Bayes factors (O'Hagan, 1995) is illustrated and compared to (safe) Bayes factors based on the right Haar prior."}, "https://arxiv.org/abs/2403.18039": {"title": "Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveys", "link": "https://arxiv.org/abs/2403.18039", "description": "arXiv:2403.18039v1 Announce Type: new \nAbstract: Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), faces challenges from sample selection bias and high-dimensional covariates. This requires considering a selection model alongside treatment and outcome models that are typical ingredients in causal inference. This paper considers integrating large non-probability samples with external probability samples from a design survey, addressing moderately high-dimensional confounders and variables that influence selection. In contrast to the two-step approach that separates variable selection and debiased estimation, we propose a one-step plug-in doubly robust (DR) estimator of the ATE. We construct a novel penalized estimating equation by minimizing the squared asymptotic bias of the DR estimator. Our approach facilitates ATE inference in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches with non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our estimator under misspecification of either the outcome model or the selection and treatment models, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample."}, "https://arxiv.org/abs/2403.18069": {"title": "Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information", "link": "https://arxiv.org/abs/2403.18069", "description": "arXiv:2403.18069v1 Announce Type: new \nAbstract: The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric space based statistical objects. The aim of this paper is to introduce a novel two-step framework that comprises: (i) a imputation methods for statistical objects taking values in metrics spaces, and (ii) a criterion for personalizing imputation using conformal inference techniques. This work is motivated by the need to impute distributional functional representations of continuous glucose monitoring (CGM) data within the context of a longitudinal study on diabetes, where a significant fraction of patients do not have available CGM profiles. The importance of these methods is illustrated by evaluating the effectiveness of CGM data as new digital biomarkers to predict the time to diabetes onset in healthy populations. To address these scientific challenges, we propose: (i) a new regression algorithm for missing responses; (ii) novel conformal prediction algorithms tailored for metric spaces with a focus on density responses within the 2-Wasserstein geometry; (iii) a broadly applicable personalized imputation method criterion, designed to enhance both of the aforementioned strategies, yet valid across any statistical model and data structure. Our findings reveal that incorporating CGM data into diabetes time-to-event analysis, augmented with a novel personalization phase of imputation, significantly enhances predictive accuracy by over ten percent compared to traditional predictive models for time to diabetes."}, "https://arxiv.org/abs/2403.18115": {"title": "Assessing COVID-19 Vaccine Effectiveness in Observational Studies via Nested Trial Emulation", "link": "https://arxiv.org/abs/2403.18115", "description": "arXiv:2403.18115v1 Announce Type: new \nAbstract: Observational data are often used to estimate real-world effectiveness and durability of coronavirus disease 2019 (COVID-19) vaccines. A sequence of nested trials can be emulated to draw inference from such data while minimizing selection bias, immortal time bias, and confounding. Typically, when nested trial emulation (NTE) is employed, effect estimates are pooled across trials to increase statistical efficiency. However, such pooled estimates may lack a clear interpretation when the treatment effect is heterogeneous across trials. In the context of COVID-19, vaccine effectiveness quite plausibly will vary over calendar time due to newly emerging variants of the virus. This manuscript considers a NTE inverse probability weighted estimator of vaccine effectiveness that may vary over calendar time, time since vaccination, or both. Statistical testing of the trial effect homogeneity assumption is considered. Simulation studies are presented examining the finite-sample performance of these methods under a variety of scenarios. The methods are used to estimate vaccine effectiveness against COVID-19 outcomes using observational data on over 120,000 residents of Abruzzo, Italy during 2021."}, "https://arxiv.org/abs/2403.18248": {"title": "Statistical Inference of Optimal Allocations I: Regularities and their Implications", "link": "https://arxiv.org/abs/2403.18248", "description": "arXiv:2403.18248v1 Announce Type: new \nAbstract: In this paper, we develp a functional differentiability approach for solving statistical optimal allocation problems. We first derive Hadamard differentiability of the value function through a detailed analysis of the general properties of the sorting operator. Central to our framework are the concept of Hausdorff measure and the area and coarea integration formulas from geometric measure theory. Building on our Hadamard differentiability results, we demonstrate how the functional delta method can be used to directly derive the asymptotic properties of the value function process for binary constrained optimal allocation problems, as well as the two-step ROC curve estimator. Moreover, leveraging profound insights from geometric functional analysis on convex and local Lipschitz functionals, we obtain additional generic Fr\\'echet differentiability results for the value functions of optimal allocation problems. These compelling findings motivate us to study carefully the first order approximation of the optimal social welfare. In this paper, we then present a double / debiased estimator for the value functions. Importantly, the conditions outlined in the Hadamard differentiability section validate the margin assumption from the statistical classification literature employing plug-in methods that justifies a faster convergence rate."}, "https://arxiv.org/abs/2403.18464": {"title": "Cumulative Incidence Function Estimation Based on Population-Based Biobank Data", "link": "https://arxiv.org/abs/2403.18464", "description": "arXiv:2403.18464v1 Announce Type: new \nAbstract: Many countries have established population-based biobanks, which are being used increasingly in epidemiolgical and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data is collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency, and (2) CIF estimation for ages before the lower limit, $c_L$."}, "https://arxiv.org/abs/2403.18503": {"title": "Distributional Treatment Effect with Finite Mixture", "link": "https://arxiv.org/abs/2403.18503", "description": "arXiv:2403.18503v1 Announce Type: new \nAbstract: Treatment effect heterogeneity is of a great concern when evaluating the treatment. However, even with a simple case of a binary treatment, the distribution of treatment effect is difficult to identify due to the fundamental limitation that we cannot observe both treated potential outcome and untreated potential outcome for a given individual. This paper assumes a finite mixture model on the potential outcomes and a vector of control covariates to address treatment endogeneity and imposes a Markov condition on the potential outcomes and covariates within each type to identify the treatment effect distribution. The mixture weights of the finite mixture model are consistently estimated with a nonnegative matrix factorization algorithm, thus allowing us to consistently estimate the component distribution parameters, including ones for the treatment effect distribution."}, "https://arxiv.org/abs/2403.18549": {"title": "A communication-efficient, online changepoint detection method for monitoring distributed sensor networks", "link": "https://arxiv.org/abs/2403.18549", "description": "arXiv:2403.18549v1 Announce Type: new \nAbstract: We consider the challenge of efficiently detecting changes within a network of sensors, where we also need to minimise communication between sensors and the cloud. We propose an online, communication-efficient method to detect such changes. The procedure works by performing likelihood ratio tests at each time point, and two thresholds are chosen to filter unimportant test statistics and make decisions based on the aggregated test statistics respectively. We provide asymptotic theory concerning consistency and the asymptotic distribution if there are no changes. Simulation results suggest that our method can achieve similar performance to the idealised setting, where we have no constraints on communication between sensors, but substantially reduce the transmission costs."}, "https://arxiv.org/abs/2403.18602": {"title": "Collaborative graphical lasso", "link": "https://arxiv.org/abs/2403.18602", "description": "arXiv:2403.18602v1 Announce Type: new \nAbstract: In recent years, the availability of multi-omics data has increased substantially. Multi-omics data integration methods mainly aim to leverage different molecular data sets to gain a complete molecular description of biological processes. An attractive integration approach is the reconstruction of multi-omics networks. However, the development of effective multi-omics network reconstruction strategies lags behind. This hinders maximizing the potential of multi-omics data sets. With this study, we advance the frontier of multi-omics network reconstruction by introducing \"collaborative graphical lasso\" as a novel strategy. Our proposed algorithm synergizes \"graphical lasso\" with the concept of \"collaboration\", effectively harmonizing multi-omics data sets integration, thereby enhancing the accuracy of network inference. Besides, to tackle model selection in this framework, we designed an ad hoc procedure based on network stability. We assess the performance of collaborative graphical lasso and the corresponding model selection procedure through simulations, and we apply them to publicly available multi-omics data. This demonstrated collaborative graphical lasso is able to reconstruct known biological connections and suggest previously unknown and biologically coherent interactions, enabling the generation of novel hypotheses. We implemented collaborative graphical lasso as an R package, available on CRAN as coglasso."}, "https://arxiv.org/abs/2403.18072": {"title": "Goal-Oriented Bayesian Optimal Experimental Design for Nonlinear Models using Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2403.18072", "description": "arXiv:2403.18072v1 Announce Type: cross \nAbstract: Optimal experimental design (OED) provides a systematic approach to quantify and maximize the value of experimental data. Under a Bayesian approach, conventional OED maximizes the expected information gain (EIG) on model parameters. However, we are often interested in not the parameters themselves, but predictive quantities of interest (QoIs) that depend on the parameters in a nonlinear manner. We present a computational framework of predictive goal-oriented OED (GO-OED) suitable for nonlinear observation and prediction models, which seeks the experimental design providing the greatest EIG on the QoIs. In particular, we propose a nested Monte Carlo estimator for the QoI EIG, featuring Markov chain Monte Carlo for posterior sampling and kernel density estimation for evaluating the posterior-predictive density and its Kullback-Leibler divergence from the prior-predictive. The GO-OED design is then found by maximizing the EIG over the design space using Bayesian optimization. We demonstrate the effectiveness of the overall nonlinear GO-OED method, and illustrate its differences versus conventional non-GO-OED, through various test problems and an application of sensor placement for source inversion in a convection-diffusion field."}, "https://arxiv.org/abs/2403.18245": {"title": "LocalCop: An R package for local likelihood inference for conditional copulas", "link": "https://arxiv.org/abs/2403.18245", "description": "arXiv:2403.18245v1 Announce Type: cross \nAbstract: Conditional copulas models allow the dependence structure between multiple response variables to be modelled as a function of covariates. LocalCop (Acar & Lysy, 2024) is an R/C++ package for computationally efficient semiparametric conditional copula modelling using a local likelihood inference framework developed in Acar, Craiu, & Yao (2011), Acar, Craiu, & Yao (2013) and Acar, Czado, & Lysy (2019)."}, "https://arxiv.org/abs/2403.18782": {"title": "Beyond boundaries: Gary Lorden's groundbreaking contributions to sequential analysis", "link": "https://arxiv.org/abs/2403.18782", "description": "arXiv:2403.18782v1 Announce Type: cross \nAbstract: Gary Lorden provided a number of fundamental and novel insights to sequential hypothesis testing and changepoint detection. In this article we provide an overview of Lorden's contributions in the context of existing results in those areas, and some extensions made possible by Lorden's work, mentioning also areas of application including threat detection in physical-computer systems, near-Earth space informatics, epidemiology, clinical trials, and finance."}, "https://arxiv.org/abs/2109.05755": {"title": "IQ: Intrinsic measure for quantifying the heterogeneity in meta-analysis", "link": "https://arxiv.org/abs/2109.05755", "description": "arXiv:2109.05755v2 Announce Type: replace \nAbstract: Quantifying the heterogeneity is an important issue in meta-analysis, and among the existing measures, the $I^2$ statistic is the most commonly used measure in the literature. In this paper, we show that the $I^2$ statistic was, in fact, defined as problematic or even completely wrong from the very beginning. To confirm this statement, we first present a motivating example to show that the $I^2$ statistic is heavily dependent on the study sample sizes, and consequently it may yield contradictory results for the amount of heterogeneity. Moreover, by drawing a connection between ANOVA and meta-analysis, the $I^2$ statistic is shown to have, mistakenly, applied the sampling errors of the estimators rather than the variances of the study populations. Inspired by this, we introduce an Intrinsic measure for Quantifying the heterogeneity in meta-analysis, and meanwhile study its statistical properties to clarify why it is superior to the existing measures. We further propose an optimal estimator, referred to as the IQ statistic, for the new measure of heterogeneity that can be readily applied in meta-analysis. Simulations and real data analysis demonstrate that the IQ statistic provides a nearly unbiased estimate of the true heterogeneity and it is also independent of the study sample sizes."}, "https://arxiv.org/abs/2207.07020": {"title": "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO", "link": "https://arxiv.org/abs/2207.07020", "description": "arXiv:2207.07020v3 Announce Type: replace \nAbstract: The multivariate regression interpretation of the Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ outcomes and (ii) the residual partial covariances between pairs of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation Conditional Maximization algorithm to obtain sparse estimates of the $p \\times q$ matrix of direct effects and the $q \\times q$ residual precision matrix. Our algorithm iteratively solves a sequence of penalized maximum likelihood problems with self-adaptive penalties that gradually filter out negligible regression coefficients and partial covariances. Because it adaptively penalizes individual model parameters, our method is seen to outperform fixed-penalty competitors on simulated data. We establish the posterior contraction rate for our model, buttressing our method's excellent empirical performance with strong theoretical guarantees. Using our method, we estimated the direct effects of diet and residence type on the composition of the gut microbiome of elderly adults."}, "https://arxiv.org/abs/2210.09426": {"title": "Party On: The Labor Market Returns to Social Networks in Adolescence", "link": "https://arxiv.org/abs/2210.09426", "description": "arXiv:2210.09426v5 Announce Type: replace \nAbstract: We investigate the returns to adolescent friendships on earnings in adulthood using data from the National Longitudinal Study of Adolescent to Adult Health. Because both education and friendships are jointly determined in adolescence, OLS estimates of their returns are likely biased. We implement a novel procedure to obtain bounds on the causal returns to friendships: we assume that the returns to schooling range from 5 to 15% (based on prior literature), and instrument for friendships using similarity in age among peers. Having one more friend in adolescence increases earnings between 7 and 14%, substantially more than OLS estimates would suggest."}, "https://arxiv.org/abs/2306.13829": {"title": "Selective inference using randomized group lasso estimators for general models", "link": "https://arxiv.org/abs/2306.13829", "description": "arXiv:2306.13829v3 Announce Type: replace \nAbstract: Selective inference methods are developed for group lasso estimators for use with a wide class of distributions and loss functions. The method includes the use of exponential family distributions, as well as quasi-likelihood modeling for overdispersed count data, for example, and allows for categorical or grouped covariates as well as continuous covariates. A randomized group-regularized optimization problem is studied. The added randomization allows us to construct a post-selection likelihood which we show to be adequate for selective inference when conditioning on the event of the selection of the grouped covariates. This likelihood also provides a selective point estimator, accounting for the selection by the group lasso. Confidence regions for the regression parameters in the selected model take the form of Wald-type regions and are shown to have bounded volume. The selective inference method for grouped lasso is illustrated on data from the national health and nutrition examination survey while simulations showcase its behaviour and favorable comparison with other methods."}, "https://arxiv.org/abs/2306.15173": {"title": "Robust propensity score weighting estimation under missing at random", "link": "https://arxiv.org/abs/2306.15173", "description": "arXiv:2306.15173v3 Announce Type: replace \nAbstract: Missing data is frequently encountered in many areas of statistics. Propensity score weighting is a popular method for handling missing data. The propensity score method employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data. Doubly robust estimation is attractive, as the consistency of the estimator is guaranteed when either the outcome regression model or the propensity score model is correctly specified. In this paper, we first employ information projection to develop an efficient and doubly robust estimator under indirect model calibration constraints. The resulting propensity score estimator can be equivalently expressed as a doubly robust regression imputation estimator by imposing the internal bias calibration condition in estimating the regression parameters. In addition, we generalize the information projection to allow for outlier-robust estimation. Some asymptotic properties are presented. The simulation study confirms that the proposed method allows robust inference against not only the violation of various model assumptions, but also outliers. A real-life application is presented using data from the Conservation Effects Assessment Project."}, "https://arxiv.org/abs/2307.00567": {"title": "A Note on Ising Network Analysis with Missing Data", "link": "https://arxiv.org/abs/2307.00567", "description": "arXiv:2307.00567v2 Announce Type: replace \nAbstract: The Ising model has become a popular psychometric model for analyzing item response data. The statistical inference of the Ising model is typically carried out via a pseudo-likelihood, as the standard likelihood approach suffers from a high computational cost when there are many variables (i.e., items). Unfortunately, the presence of missing values can hinder the use of pseudo-likelihood, and a listwise deletion approach for missing data treatment may introduce a substantial bias into the estimation and sometimes yield misleading interpretations. This paper proposes a conditional Bayesian framework for Ising network analysis with missing data, which integrates a pseudo-likelihood approach with iterative data imputation. An asymptotic theory is established for the method. Furthermore, a computationally efficient {P{\\'o}lya}-Gamma data augmentation procedure is proposed to streamline the sampling of model parameters. The method's performance is shown through simulations and a real-world application to data on major depressive and generalized anxiety disorders from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC)."}, "https://arxiv.org/abs/2307.09713": {"title": "Non-parametric inference on calibration of predicted risks", "link": "https://arxiv.org/abs/2307.09713", "description": "arXiv:2307.09713v3 Announce Type: replace \nAbstract: Moderate calibration, the expected event probability among observations with predicted probability z being equal to z, is a desired property of risk prediction models. Current graphical and numerical techniques for evaluating moderate calibration of risk prediction models are mostly based on smoothing or grouping the data. As well, there is no widely accepted inferential method for the null hypothesis that a model is moderately calibrated. In this work, we discuss recently-developed, and propose novel, methods for the assessment of moderate calibration for binary responses. The methods are based on the limiting distributions of functions of standardized partial sums of prediction errors converging to the corresponding laws of Brownian motion. The novel method relies on well-known properties of the Brownian bridge which enables joint inference on mean and moderate calibration, leading to a unified \"bridge\" test for detecting miscalibration. Simulation studies indicate that the bridge test is more powerful, often substantially, than the alternative test. As a case study we consider a prediction model for short-term mortality after a heart attack, where we provide suggestions on graphical presentation and the interpretation of results. Moderate calibration can be assessed without requiring arbitrary grouping of data or using methods that require tuning of parameters. An accompanying R package implements this method (see https://github.com/resplab/cumulcalib/)."}, "https://arxiv.org/abs/2308.11138": {"title": "NLP-based detection of systematic anomalies among the narratives of consumer complaints", "link": "https://arxiv.org/abs/2308.11138", "description": "arXiv:2308.11138v3 Announce Type: replace \nAbstract: We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau."}, "https://arxiv.org/abs/2309.10978": {"title": "Negative Spillover: A Potential Source of Bias in Pragmatic Clinical Trials", "link": "https://arxiv.org/abs/2309.10978", "description": "arXiv:2309.10978v4 Announce Type: replace \nAbstract: Pragmatic clinical trials evaluate the effectiveness of health interventions in real-world settings. Negative spillover can arise in a pragmatic trial if the study intervention affects how scarce resources are allocated between patients in the intervention and comparison groups. This can harm patients assigned to the control group and lead to overestimation of treatment effect. While this type of negative spillover is often addressed in trials of social welfare and public health interventions, there is little recognition of this source of bias in the medical literature. In this article, I examine what causes negative spillover and how it may have led clinical trial investigators to overestimate the effect of patient navigation, AI-based physiological alarms, and elective induction of labor. I also suggest ways to detect negative spillover and design trials that avoid this potential source of bias."}, "https://arxiv.org/abs/2310.11471": {"title": "Modeling lower-truncated and right-censored insurance claims with an extension of the MBBEFD class", "link": "https://arxiv.org/abs/2310.11471", "description": "arXiv:2310.11471v2 Announce Type: replace \nAbstract: In general insurance, claims are often lower-truncated and right-censored because insurance contracts may involve deductibles and maximal covers. Most classical statistical models are not (directly) suited to model lower-truncated and right-censored claims. A surprisingly flexible family of distributions that can cope with lower-truncated and right-censored claims is the class of MBBEFD distributions that originally has been introduced by Bernegger (1997) for reinsurance pricing, but which has not gained much attention outside the reinsurance literature. Interestingly, in general insurance, we mainly rely on unimodal skewed densities, whereas the reinsurance literature typically proposes monotonically decreasing densities within the MBBEFD class. We show that this class contains both types of densities, and we extend it to a bigger family of distribution functions suitable for modeling lower-truncated and right-censored claims. In addition, we discuss how changes in the deductible or the maximal cover affect the chosen distributions."}, "https://arxiv.org/abs/2310.13580": {"title": "Bayesian Hierarchical Modeling for Bivariate Multiscale Spatial Data with Application to Blood Test Monitoring", "link": "https://arxiv.org/abs/2310.13580", "description": "arXiv:2310.13580v2 Announce Type: replace \nAbstract: In public health applications, spatial data collected are often recorded at different spatial scales and over different correlated variables. Spatial change of support is a key inferential problem in these applications and have become standard in univariate settings; however, it is less standard in multivariate settings. There are several existing multivariate spatial models that can be easily combined with multiscale spatial approach to analyze multivariate multiscale spatial data. In this paper, we propose three new models from such combinations for bivariate multiscale spatial data in a Bayesian context. In particular, we extend spatial random effects models, multivariate conditional autoregressive models, and ordered hierarchical models through a multiscale spatial approach. We run simulation studies for the three models and compare them in terms of prediction performance and computational efficiency. We motivate our models through an analysis of 2015 Texas annual average percentage receiving two blood tests from the Dartmouth Atlas Project."}, "https://arxiv.org/abs/2310.16502": {"title": "Assessing the overall and partial causal well-specification of nonlinear additive noise models", "link": "https://arxiv.org/abs/2310.16502", "description": "arXiv:2310.16502v3 Announce Type: replace \nAbstract: We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution. We then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data."}, "https://arxiv.org/abs/2401.00624": {"title": "Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures", "link": "https://arxiv.org/abs/2401.00624", "description": "arXiv:2401.00624v2 Announce Type: replace \nAbstract: Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about \"non-zero loadings\" and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies \"non-zero loadings\" by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about \"non-zero loadings\". Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset."}, "https://arxiv.org/abs/2306.10594": {"title": "A nonparametric test for elliptical distribution based on kernel embedding of probabilities", "link": "https://arxiv.org/abs/2306.10594", "description": "arXiv:2306.10594v2 Announce Type: replace-cross \nAbstract: Elliptical distribution is a basic assumption underlying many multivariate statistical methods. For example, in sufficient dimension reduction and statistical graphical models, this assumption is routinely imposed to simplify the data dependence structure. Before applying such methods, we need to decide whether the data are elliptically distributed. Currently existing tests either focus exclusively on spherical distributions, or rely on bootstrap to determine the null distribution, or require specific forms of the alternative distribution. In this paper, we introduce a general nonparametric test for elliptical distribution based on kernel embedding of the probability measure that embodies the two properties that characterize an elliptical distribution: namely, after centering and rescaling, (1) the direction and length of the random vector are independent, and (2) the directional vector is uniformly distributed on the unit sphere. We derive the asymptotic distributions of the test statistic via von-Mises expansion, develop the sample-level procedure to determine the rejection region, and establish the consistency and validity of the proposed test. We also develop the concentration bounds of the test statistic, allowing the dimension to grow with the sample size, and further establish the consistency in this high-dimension setting. We compare our method with several existing methods via simulation studies, and apply our test to a SENIC dataset with and without a transformation aimed to achieve ellipticity."}, "https://arxiv.org/abs/2402.07868": {"title": "Nesting Particle Filters for Experimental Design in Dynamical Systems", "link": "https://arxiv.org/abs/2402.07868", "description": "arXiv:2402.07868v2 Announce Type: replace-cross \nAbstract: In this paper, we propose a novel approach to Bayesian experimental design for non-exchangeable data that formulates it as risk-sensitive policy optimization. We develop the Inside-Out SMC$^2$ algorithm, a nested sequential Monte Carlo technique to infer optimal designs, and embed it into a particle Markov chain Monte Carlo framework to perform gradient-based policy amortization. Our approach is distinct from other amortized experimental design techniques, as it does not rely on contrastive estimators. Numerical validation on a set of dynamical systems showcases the efficacy of our method in comparison to other state-of-the-art strategies."}, "https://arxiv.org/abs/2403.18951": {"title": "Robust estimations from distribution structures: V", "link": "https://arxiv.org/abs/2403.18951", "description": "arXiv:2403.18951v1 Announce Type: new \nAbstract: Due to the complexity of order statistics, the finite sample behaviour of robust statistics is generally not analytically solvable. While the Monte Carlo method can provide approximate solutions, its convergence rate is typically very slow, making the computational cost to achieve the desired accuracy unaffordable for ordinary users. In this paper, we propose an approach analogous to the Fourier transformation to decompose the finite sample structure of the uniform distribution. By obtaining sets of sequences that are consistent with parametric distributions for the first four sample moments, we can approximate the finite sample behavior of other estimators with significantly reduced computational costs. This article reveals the underlying structure of randomness and presents a novel approach to integrate multiple assumptions."}, "https://arxiv.org/abs/2403.19018": {"title": "Efficient global estimation of conditional-value-at-risk through stochastic kriging and extreme value theory", "link": "https://arxiv.org/abs/2403.19018", "description": "arXiv:2403.19018v1 Announce Type: new \nAbstract: We consider the problem of evaluating risk for a system that is modeled by a complex stochastic simulation with many possible input parameter values. Two sources of computational burden can be identified: the effort associated with extensive simulation runs required to accurately represent the tail of the loss distribution for each set of parameter values, and the computational cost of evaluating multiple candidate parameter values. The former concern can be addressed by using Extreme Value Theory (EVT) estimations, which specifically concentrate on the tails. Meta-modeling approaches are often used to tackle the latter concern. In this paper, we propose a framework for constructing a particular meta-modeling framework, stochastic kriging, that is based on EVT-based estimation for a class of coherent measures of risk. The proposed approach requires an efficient estimator of the intrinsic variance, and so we derive an EVT-based expression for it. It then allows us to avoid multiple replications of the risk measure in each design point, which was required in similar previously proposed approaches, resulting in a substantial reduction in computational effort. We then perform a case study, outlining promising use cases, and conditions when the EVT-based approach outperforms simpler empirical estimators."}, "https://arxiv.org/abs/2403.19192": {"title": "Imputing missing not-at-random longitudinal marker values in time-to-event analysis: fully conditional specification multiple imputation in joint modeling", "link": "https://arxiv.org/abs/2403.19192", "description": "arXiv:2403.19192v1 Announce Type: new \nAbstract: We propose a procedure for imputing missing values of time-dependent covariates in a survival model using fully conditional specification. Specifically, we focus on imputing missing values of a longitudinal marker in joint modeling of the marker and time-to-event data, but the procedure can be easily applied to a time-varying covariate survival model as well. First, missing marker values are imputed via fully conditional specification multiple imputation, and then joint modeling is applied for estimating the association between the marker and the event. This procedure is recommended since in joint modeling marker measurements that are missing not-at-random can lead to bias (e.g. when patients with higher marker values tend to miss visits). Specifically, in cohort studies such a bias can occur since patients for whom all marker measurements during follow-up are missing are excluded from the analysis. Our procedure enables to include these patients by imputing their missing values using a modified version of fully conditional specification multiple imputation. The imputation model includes a special indicator for the subgroup with missing marker values during follow-up, and can be easily implemented in various software: R, SAS, Stata etc. Using simulations we show that the proposed procedure performs better than standard joint modeling in the missing not-at-random scenario with respect to bias, coverage and Type I error rate of the test, and as good as standard joint modeling in the completely missing at random scenario. Finally we apply the procedure on real data on glucose control and cancer in diabetic patients."}, "https://arxiv.org/abs/2403.19363": {"title": "Dynamic Correlation of Market Connectivity, Risk Spillover and Abnormal Volatility in Stock Price", "link": "https://arxiv.org/abs/2403.19363", "description": "arXiv:2403.19363v1 Announce Type: new \nAbstract: The connectivity of stock markets reflects the information efficiency of capital markets and contributes to interior risk contagion and spillover effects. We compare Shanghai Stock Exchange A-shares (SSE A-shares) during tranquil periods, with high leverage periods associated with the 2015 subprime mortgage crisis. We use Pearson correlations of returns, the maximum strongly connected subgraph, and $3\\sigma$ principle to iteratively determine the threshold value for building a dynamic correlation network of SSE A-shares. Analyses are carried out based on the networking structure, intra-sector connectivity, and node status, identifying several contributions. First, compared with tranquil periods, the SSE A-shares network experiences a more significant small-world and connective effect during the subprime mortgage crisis and the high leverage period in 2015. Second, the finance, energy and utilities sectors have a stronger intra-industry connectivity than other sectors. Third, HUB nodes drive the growth of the SSE A-shares market during bull periods, while stocks have a think-tail degree distribution in bear periods and show distinct characteristics in terms of market value and finance. Granger linear and non-linear causality networks are also considered for the comparison purpose. Studies on the evolution of inter-cycle connectivity in the SSE A-share market may help investors improve portfolios and develop more robust risk management policies."}, "https://arxiv.org/abs/2403.19439": {"title": "Dynamic Analyses of Contagion Risk and Module Evolution on the SSE A-Shares Market Based on Minimum Information Entropy", "link": "https://arxiv.org/abs/2403.19439", "description": "arXiv:2403.19439v1 Announce Type: new \nAbstract: The interactive effect is significant in the Chinese stock market, exacerbating the abnormal market volatilities and risk contagion. Based on daily stock returns in the Shanghai Stock Exchange (SSE) A-shares, this paper divides the period between 2005 and 2018 into eight bull and bear market stages to investigate interactive patterns in the Chinese financial market. We employ the LASSO method to construct the stock network and further use the Map Equation method to analyze the evolution of modules in the SSE A-shares market. Empirical results show: (1) The connected effect is more significant in bear markets than bull markets; (2) A system module can be found in the network during the first four stages, and the industry aggregation effect leads to module differentiation in the last four stages; (3) Some stocks have leading effects on others throughout eight periods, and medium- and small-cap stocks with poor financial conditions are more likely to become risk sources, especially in bear markets. Our conclusions are beneficial to improving investment strategies and making regulatory policies."}, "https://arxiv.org/abs/2403.19504": {"title": "Overlap violations in external validity", "link": "https://arxiv.org/abs/2403.19504", "description": "arXiv:2403.19504v1 Announce Type: new \nAbstract: Estimating externally valid causal effects is a foundational problem in the social and biomedical sciences. Generalizing or transporting causal estimates from an experimental sample to a target population of interest relies on an overlap assumption between the experimental sample and the target population--i.e., all units in the target population must have a non-zero probability of being included in the experiment. In practice, having full overlap between an experimental sample and a target population can be implausible. In the following paper, we introduce a framework for considering external validity in the presence of overlap violations. We introduce a novel bias decomposition that parameterizes the bias from an overlap violation into two components: (1) the proportion of units omitted, and (2) the degree to which omitting the units moderates the treatment effect. The bias decomposition offers an intuitive and straightforward approach to conducting sensitivity analysis to assess robustness to overlap violations. Furthermore, we introduce a suite of sensitivity tools in the form of summary measures and benchmarking, which help researchers consider the plausibility of the overlap violations. We apply the proposed framework on an experiment evaluating the impact of a cash transfer program in Northern Uganda."}, "https://arxiv.org/abs/2403.19515": {"title": "On Bootstrapping Lasso in Generalized Linear Models and the Cross Validation", "link": "https://arxiv.org/abs/2403.19515", "description": "arXiv:2403.19515v1 Announce Type: new \nAbstract: Generalized linear models or GLM constitutes an important set of models which generalizes the ordinary linear regression by connecting the response variable with the covariates through arbitrary link functions. On the other hand, Lasso is a popular and easy to implement penalization method in regression when all the covariates are not relevant. However, Lasso generally has non-tractable asymptotic distribution and hence development of an alternative method of distributional approximation is required for the purpose of statistical inference. In this paper, we develop a Bootstrap method which works as an approximation of the distribution of the Lasso estimator for all the sub-models of GLM. To connect the distributional approximation theory based on the proposed Bootstrap method with the practical implementation of Lasso, we explore the asymptotic properties of K-fold cross validation-based penalty parameter. The results established essentially justifies drawing valid statistical inference regarding the unknown parameters based on the proposed Bootstrap method for any sub model of GLM after selecting the penalty parameter using K-fold cross validation. Good finite sample properties are also shown through a moderately large simulation study. The method is also implemented on a real data set."}, "https://arxiv.org/abs/2403.19605": {"title": "Data-Adaptive Tradeoffs among Multiple Risks in Distribution-Free Prediction", "link": "https://arxiv.org/abs/2403.19605", "description": "arXiv:2403.19605v1 Announce Type: new \nAbstract: Decision-making pipelines are generally characterized by tradeoffs among various risk functions. It is often desirable to manage such tradeoffs in a data-adaptive manner. As we demonstrate, if this is done naively, state-of-the art uncertainty quantification methods can lead to significant violations of putative risk guarantees.\n To address this issue, we develop methods that permit valid control of risk when threshold and tradeoff parameters are chosen adaptively. Our methodology supports monotone and nearly-monotone risks, but otherwise makes no distributional assumptions.\n To illustrate the benefits of our approach, we carry out numerical experiments on synthetic data and the large-scale vision dataset MS-COCO."}, "https://arxiv.org/abs/2403.19606": {"title": "Positivity violations in marginal structural survival models with time-dependent confounding: a simulation study on IPTW-estimator performance", "link": "https://arxiv.org/abs/2403.19606", "description": "arXiv:2403.19606v1 Announce Type: new \nAbstract: In longitudinal observational studies, marginal structural models (MSMs) are a class of causal models used to analyze the effect of an exposure on the (survival) outcome of interest while accounting for exposure-affected time-dependent confounding. In the applied literature, inverse probability of treatment weighting (IPTW) has been widely adopted to estimate MSMs. An essential assumption for IPTW-based MSMs is the positivity assumption, which ensures that each individual in the population has a non-zero probability of receiving each exposure level within confounder strata. Positivity, along with consistency, conditional exchangeability, and correct specification of the weighting model, is crucial for valid causal inference through IPTW-based MSMs but is often overlooked compared to confounding bias. Positivity violations can arise from subjects having a zero probability of being exposed/unexposed (strict violations) or near-zero probabilities due to sampling variability (near violations). This article discusses the effect of violations in the positivity assumption on the estimates from IPTW-based MSMs. Building on the algorithms for simulating longitudinal survival data from MSMs by Havercroft and Didelez (2012) and Keogh et al. (2021), systematic simulations under strict/near positivity violations are performed. Various scenarios are explored by varying (i) the size of the confounder interval in which positivity violations arise, (ii) the sample size, (iii) the weight truncation strategy, and (iv) the subject's propensity to follow the protocol violation rule. This study underscores the importance of assessing positivity violations in IPTW-based MSMs to ensure robust and reliable causal inference in survival analyses."}, "https://arxiv.org/abs/2403.19289": {"title": "Graph Neural Networks for Treatment Effect Prediction", "link": "https://arxiv.org/abs/2403.19289", "description": "arXiv:2403.19289v1 Announce Type: cross \nAbstract: Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random underlining the need for models that can generalize with limited labeled samples to reduce experimental risks."}, "https://arxiv.org/abs/2403.19329": {"title": "Simulating Relational Event Histories -- Why and How", "link": "https://arxiv.org/abs/2403.19329", "description": "arXiv:2403.19329v1 Announce Type: cross \nAbstract: Many important social phenomena result from repeated interactions among individuals over time such as email exchanges in an organization, or face-to-face interactions in a classroom. Insights into the mechanisms underlying the dynamics of these interactions can be achieved through simulations of networks on a fine temporal granularity. In this paper, we present statistical frameworks to simulate relational event networks under dyadic and actor-oriented relational event models. These simulators have a broad applicability in temporal social network research such as model fit assessment, theory building, network intervention planning, making predictions, understanding the impact of network structures, to name a few. We show this in three extensive applications. First, it is shown why simulation-based techniques are crucial for relational event model assessment, for example to investigate how past events affect future interactions in the network. Second, we demonstrate how simulation techniques contribute to a better understanding of the longevity of network interventions. Third, we show how simulation techniques are important when building and extending theories about social phenomena such as understanding social identity dynamics using optimal distinctiveness theory."}, "https://arxiv.org/abs/2201.12045": {"title": "A loss discounting framework for model averaging and selection in time series models", "link": "https://arxiv.org/abs/2201.12045", "description": "arXiv:2201.12045v4 Announce Type: replace \nAbstract: We introduce a Loss Discounting Framework for model and forecast combination which generalises and combines Bayesian model synthesis and generalized Bayes methodologies. We use a loss function to score the performance of different models and introduce a multilevel discounting scheme which allows a flexible specification of the dynamics of the model weights. This novel and simple model combination approach can be easily applied to large scale model averaging/selection, can handle unusual features such as sudden regime changes, and can be tailored to different forecasting problems. We compare our method to both established methodologies and state of the art methods for a number of macroeconomic forecasting examples. We find that the proposed method offers an attractive, computationally efficient alternative to the benchmark methodologies and often outperforms more complex techniques."}, "https://arxiv.org/abs/2203.11873": {"title": "Nonstationary Spatial Process Models with Spatially Varying Covariance Kernels", "link": "https://arxiv.org/abs/2203.11873", "description": "arXiv:2203.11873v2 Announce Type: replace \nAbstract: Spatial process models for capturing nonstationary behavior in scientific data present several challenges with regard to statistical inference and uncertainty quantification. While nonstationary spatially-varying kernels are attractive for their flexibility and richness, their practical implementation has been reported to be overwhelmingly cumbersome because of the high-dimensional parameter spaces resulting from the spatially varying process parameters. Matters are considerably exacerbated with the massive numbers of spatial locations over which measurements are available. With limited theoretical tractability offered by nonstationary spatial processes, overcoming such computational bottlenecks require a synergy between model construction and algorithm development. We build a class of scalable nonstationary spatial process models using spatially varying covariance kernels. We present some novel consequences of such representations that befit computationally efficient implementation. More specifically, we operate within a coherent Bayesian modeling framework to achieve full uncertainty quantification using a Hybrid Monte-Carlo with nested interweaving. We carry out experiments on synthetic data sets to explore model selection and parameter identifiability and assess inferential improvements accrued from the nonstationary modeling. We illustrate strengths and pitfalls with a data set on remote sensed normalized difference vegetation index with further analysis of a lead contamination data set in the Supplement."}, "https://arxiv.org/abs/2209.06704": {"title": "Causal chain event graphs for remedial maintenance", "link": "https://arxiv.org/abs/2209.06704", "description": "arXiv:2209.06704v2 Announce Type: replace \nAbstract: The analysis of system reliability has often benefited from graphical tools such as fault trees and Bayesian networks. In this article, instead of conventional graphical tools, we apply a probabilistic graphical model called the chain event graph (CEG) to represent the failures and processes of deterioration of a system. The CEG is derived from an event tree and can flexibly represent the unfolding of asymmetric processes. For this application we need to define a new class of formal intervention we call remedial to model causal effects of remedial maintenance. This fixes the root causes of a failure and returns the status of the system to as good as new. We demonstrate that the semantics of the CEG are rich enough to express this novel type of intervention. Furthermore through the bespoke causal algebras the CEG provides a transparent framework with which guide and express the rationale behind predictive inferences about the effects of various different types of remedial intervention. A back-door theorem is adapted to apply to these interventions to help discover when a system is only partially observed."}, "https://arxiv.org/abs/2305.17744": {"title": "Heterogeneous Matrix Factorization: When Features Differ by Datasets", "link": "https://arxiv.org/abs/2305.17744", "description": "arXiv:2305.17744v2 Announce Type: replace \nAbstract: In myriad statistical applications, data are collected from related but heterogeneous sources. These sources share some commonalities while containing idiosyncratic characteristics. One of the most fundamental challenges in such scenarios is to recover the shared and source-specific factors. Despite the existence of a few heuristic approaches, a generic algorithm with theoretical guarantees has yet to be established. In this paper, we tackle the problem by proposing a method called Heterogeneous Matrix Factorization to separate the shared and unique factors for a class of problems. HMF maintains the orthogonality between the shared and unique factors by leveraging an invariance property in the objective. The algorithm is easy to implement and intrinsically distributed. On the theoretic side, we show that for the square error loss, HMF will converge into the optimal solutions, which are close to the ground truth. HMF can be integrated auto-encoders to learn nonlinear feature mappings. Through a variety of case studies, we showcase HMF's benefits and applicability in video segmentation, time-series feature extraction, and recommender systems."}, "https://arxiv.org/abs/2308.06820": {"title": "Divisive Hierarchical Clustering of Variables Identified by Singular Vectors", "link": "https://arxiv.org/abs/2308.06820", "description": "arXiv:2308.06820v3 Announce Type: replace \nAbstract: In this work, we present a novel method for divisive hierarchical variable clustering. A cluster is a group of elements that exhibit higher similarity among themselves than to elements outside this cluster. The correlation coefficient serves as a natural measure to assess the similarity of variables. This means that in a correlation matrix, a cluster is represented by a block of variables with greater internal than external correlation. Our approach provides a nonparametric solution to identify such block structures in the correlation matrix using singular vectors of the underlying data matrix. When divisively clustering $p$ variables, there are $2^{p-1}$ possible splits. Using the singular vectors for cluster identification, we can effectively reduce these number to at most $p(p-1)$, thereby making it computationally efficient. We elaborate on the methodology and outline the incorporation of dissimilarity measures and linkage functions to assess distances between clusters. Additionally, we demonstrate that these distances are ultrametric, ensuring that the resulting hierarchical cluster structure can be uniquely represented by a dendrogram, with the heights of the dendrogram being interpretable. To validate the efficiency of our method, we perform simulation studies and analyze real world data on personality traits and cognitive abilities. Supplementary materials for this article can be accessed online."}, "https://arxiv.org/abs/2309.01721": {"title": "Direct and Indirect Treatment Effects in the Presence of Semi-Competing Risks", "link": "https://arxiv.org/abs/2309.01721", "description": "arXiv:2309.01721v4 Announce Type: replace \nAbstract: Semi-competing risks refer to the phenomenon that the terminal event (such as death) can censor the non-terminal event (such as disease progression) but not vice versa. The treatment effect on the terminal event can be delivered either directly following the treatment or indirectly through the non-terminal event. We consider two strategies to decompose the total effect into a direct effect and an indirect effect under the framework of mediation analysis in completely randomized experiments by adjusting the prevalence and hazard of non-terminal events, respectively. They require slightly different assumptions on cross-world quantities to achieve identifiability. We establish asymptotic properties for the estimated counterfactual cumulative incidences and decomposed treatment effects. We illustrate the subtle difference between these two decompositions through simulation studies and two real-data applications."}, "https://arxiv.org/abs/2401.15680": {"title": "How to achieve model-robust inference in stepped wedge trials with model-based methods?", "link": "https://arxiv.org/abs/2401.15680", "description": "arXiv:2401.15680v2 Announce Type: replace \nAbstract: A stepped wedge design is a unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs -- via linear mixed models or generalized estimating equations -- is standard practice to evaluate treatment effects accounting for clustering and adjusting for baseline covariates, their properties under misspecification have not been systematically explored. In this article, we study when a potentially misspecified multilevel model can offer consistent estimation for treatment effect estimands that are functions of calendar time and/or exposure time. We define nonparametric treatment effect estimands using potential outcomes, and adapt model-based methods via g-computation to achieve estimand-aligned inference. We prove a central result that, as long as the working model includes a correctly specified treatment effect structure, the g-computation is guaranteed to be consistent even if all remaining model components are arbitrarily misspecified. Furthermore, valid inference is obtained via the sandwich variance estimator. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge trial."}, "https://arxiv.org/abs/2205.05955": {"title": "Bayesian inference for stochastic oscillatory systems using the phase-corrected Linear Noise Approximation", "link": "https://arxiv.org/abs/2205.05955", "description": "arXiv:2205.05955v3 Announce Type: replace-cross \nAbstract: Likelihood-based inference in stochastic non-linear dynamical systems, such as those found in chemical reaction networks and biological clock systems, is inherently complex and has largely been limited to small and unrealistically simple systems. Recent advances in analytically tractable approximations to the underlying conditional probability distributions enable long-term dynamics to be accurately modelled, and make the large number of model evaluations required for exact Bayesian inference much more feasible. We propose a new methodology for inference in stochastic non-linear dynamical systems exhibiting oscillatory behaviour and show the parameters in these models can be realistically estimated from simulated data. Preliminary analyses based on the Fisher Information Matrix of the model can guide the implementation of Bayesian inference. We show that this parameter sensitivity analysis can predict which parameters are practically identifiable. Several Markov chain Monte Carlo algorithms are compared, with our results suggesting a parallel tempering algorithm consistently gives the best approach for these systems, which are shown to frequently exhibit multi-modal posterior distributions."}, "https://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "https://arxiv.org/abs/2310.06673", "description": "arXiv:2310.06673v2 Announce Type: replace-cross \nAbstract: An assurance calculation is a Bayesian alternative to a power calculation. One may be performed to aid the planning of a clinical trial, specifically setting the sample size or to support decisions about whether or not to perform a study. Immuno-oncology is a rapidly evolving area in the development of anticancer drugs. A common phenomenon that arises in trials of such drugs is one of delayed treatment effects, that is, there is a delay in the separation of the survival curves. To calculate assurance for a trial in which a delayed treatment effect is likely to be present, uncertainty about key parameters needs to be considered. If uncertainty is not considered, the number of patients recruited may not be enough to ensure we have adequate statistical power to detect a clinically relevant treatment effect and the risk of an unsuccessful trial is increased. We present a new elicitation technique for when a delayed treatment effect is likely and show how to compute assurance using these elicited prior distributions. We provide an example to illustrate how this can be used in practice and develop open-source software to implement our methods. Our methodology has the potential to improve the success rate and efficiency of Phase III trials in immuno-oncology and for other treatments where a delayed treatment effect is expected to occur."}, "https://arxiv.org/abs/2312.14810": {"title": "Accurate, scalable, and efficient Bayesian Optimal Experimental Design with derivative-informed neural operators", "link": "https://arxiv.org/abs/2312.14810", "description": "arXiv:2312.14810v2 Announce Type: replace-cross \nAbstract: We consider optimal experimental design (OED) problems in selecting the most informative observation sensors to estimate model parameters in a Bayesian framework. Such problems are computationally prohibitive when the parameter-to-observable (PtO) map is expensive to evaluate, the parameters are high-dimensional, and the optimization for sensor selection is combinatorial and high-dimensional. To address these challenges, we develop an accurate, scalable, and efficient computational framework based on derivative-informed neural operators (DINOs). The derivative of the PtO map is essential for accurate evaluation of the optimality criteria of OED in our consideration. We take the key advantage of DINOs, a class of neural operators trained with derivative information, to achieve high approximate accuracy of not only the PtO map but also, more importantly, its derivative. Moreover, we develop scalable and efficient computation of the optimality criteria based on DINOs and propose a modified swapping greedy algorithm for its optimization. We demonstrate that the proposed method is scalable to preserve the accuracy for increasing parameter dimensions and achieves high computational efficiency, with an over 1000x speedup accounting for both offline construction and online evaluation costs, compared to high-fidelity Bayesian OED solutions for a three-dimensional nonlinear convection-diffusion-reaction example with tens of thousands of parameters."}, "https://arxiv.org/abs/2403.19752": {"title": "Deep Learning Framework with Uncertainty Quantification for Survey Data: Assessing and Predicting Diabetes Mellitus Risk in the American Population", "link": "https://arxiv.org/abs/2403.19752", "description": "arXiv:2403.19752v1 Announce Type: new \nAbstract: Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential. This approach is key to minimizing potential selective biases in results. The objectives of this paper are: (i) To propose a general predictive framework for regression and classification using neural network (NN) modeling, which incorporates survey weights into the estimation process; (ii) To introduce an uncertainty quantification algorithm for model prediction, tailored for data from complex survey designs; (iii) To apply this method in developing robust risk score models to assess the risk of Diabetes Mellitus in the US population, utilizing data from the NHANES 2011-2014 cohort. The theoretical properties of our estimators are designed to ensure minimal bias and the statistical consistency, thereby ensuring that our models yield reliable predictions and contribute novel scientific insights in diabetes research. While focused on diabetes, this NN predictive framework is adaptable to create clinical models for a diverse range of diseases and medical cohorts. The software and the data used in this paper is publicly available on GitHub."}, "https://arxiv.org/abs/2403.19757": {"title": "Nonparametric conditional risk mapping under heteroscedasticity", "link": "https://arxiv.org/abs/2403.19757", "description": "arXiv:2403.19757v1 Announce Type: new \nAbstract: A nonparametric procedure to estimate the conditional probability that a nonstationary geostatistical process exceeds a certain threshold value is proposed. The method consists of a bootstrap algorithm that combines conditional simulation techniques with nonparametric estimations of the trend and the variability. The nonparametric local linear estimator, considering a bandwidth matrix selected by a method that takes the spatial dependence into account, is used to estimate the trend. The variability is modeled estimating the conditional variance and the variogram from corrected residuals to avoid the biasses. The proposed method allows to obtain estimates of the conditional exceedance risk in non-observed spatial locations. The performance of the approach is analyzed by simulation and illustrated with the application to a real data set of precipitations in the U.S."}, "https://arxiv.org/abs/2403.19807": {"title": "Protocols for Observational Studies: Methods and Open Problems", "link": "https://arxiv.org/abs/2403.19807", "description": "arXiv:2403.19807v1 Announce Type: new \nAbstract: For learning about the causal effect of a treatment, a randomized controlled trial (RCT) is considered the gold standard. However, randomizing treatment is sometimes unethical or infeasible, and instead an observational study may be conducted. While some aspects of a well designed RCT cannot be replicated in an observational study, one aspect that can is to have a protocol with prespecified hypotheses about prespecified outcomes and a prespecified analysis. We illustrate the value of protocols for observational studies in three applications -- the effect of playing high school football on later life mental functioning, the effect of police seizing a gun when arresting a domestic violence suspect on future domestic violence and the effect of mountaintop mining on health. We then discuss methodologies for observational study protocols. We discuss considerations for protocols that are similar between observational studies and RCTs, and considerations that are different. The considerations that are different include (i) whether the protocol should be specified before treatment assignment is known or after; (ii) how multiple outcomes should be incorporated into the planned analysis and (iii) how subgroups should be incorporated into the planned analysis. We conclude with discussion of a few open problems in the methodology of observational study protocols."}, "https://arxiv.org/abs/2403.19818": {"title": "Testing for common structures in high-dimensional factor models", "link": "https://arxiv.org/abs/2403.19818", "description": "arXiv:2403.19818v1 Announce Type: new \nAbstract: This work proposes a novel procedure to test for common structures across two high-dimensional factor models. The introduced test allows to uncover whether two factor models are driven by the same loading matrix up to some linear transformation. The test can be used to discover inter individual relationships between two data sets. In addition, it can be applied to test for structural changes over time in the loading matrix of an individual factor model. The test aims to reduce the set of possible alternatives in a classical change-point setting. The theoretical results establish the asymptotic behavior of the introduced test statistic. The theory is supported by a simulation study showing promising results in empirical test size and power. A data application investigates changes in the loadings when modeling the celebrated US macroeconomic data set of Stock and Watson."}, "https://arxiv.org/abs/2403.19835": {"title": "Constrained least squares simplicial-simplicial regression", "link": "https://arxiv.org/abs/2403.19835", "description": "arXiv:2403.19835v1 Announce Type: new \nAbstract: Simplicial-simplicial regression refers to the regression setting where both the responses and predictor variables lie within the simplex space, i.e. they are compositional. For this setting, constrained least squares, where the regression coefficients themselves lie within the simplex, is proposed. The model is transformation-free but the adoption of a power transformation is straightforward, it can treat more than one compositional datasets as predictors and offers the possibility of weights among the simplicial predictors. Among the model's advantages are its ability to treat zeros in a natural way and a highly computationally efficient algorithm to estimate its coefficients. Resampling based hypothesis testing procedures are employed regarding inference, such as linear independence, and equality of the regression coefficients to some pre-specified values. The performance of the proposed technique and its comparison to an existing methodology that is of the same spirit takes place using simulation studies and real data examples."}, "https://arxiv.org/abs/2403.19842": {"title": "Optimal regimes with limited resources", "link": "https://arxiv.org/abs/2403.19842", "description": "arXiv:2403.19842v1 Announce Type: new \nAbstract: Policy-makers are often faced with the task of distributing a limited supply of resources. To support decision-making in these settings, statisticians are confronted with two challenges: estimands are defined by allocation strategies that are functions of features of all individuals in a cluster; and relatedly the observed data are neither independent nor identically distributed when individuals compete for resources. Existing statistical approaches are inadequate because they ignore at least one of these core features. As a solution, we develop theory for a general policy class of dynamic regimes for clustered data, covering existing results in classical and interference settings as special cases. We cover policy-relevant estimands and articulate realistic conditions compatible with resource-limited observed data. We derive identification and inference results for settings with a finite number of individuals in a cluster, where the observed dataset is viewed as a single draw from a super-population of clusters. We also consider asymptotic estimands when the number of individuals in a cluster is allowed to grow; under explicit conditions, we recover previous results, thereby clarifying when the use of existing methods is permitted. Our general results lay the foundation for future research on dynamic regimes for clustered data, including the longitudinal cluster setting."}, "https://arxiv.org/abs/2403.19917": {"title": "Doubly robust estimation and inference for a log-concave counterfactual density", "link": "https://arxiv.org/abs/2403.19917", "description": "arXiv:2403.19917v1 Announce Type: new \nAbstract: We consider the problem of causal inference based on observational data (or the related missing data problem) with a binary or discrete treatment variable. In that context we study counterfactual density estimation, which provides more nuanced information than counterfactual mean estimation (i.e., the average treatment effect). We impose the shape-constraint of log-concavity (a unimodality constraint) on the counterfactual densities, and then develop doubly robust estimators of the log-concave counterfactual density (based on an augmented inverse-probability weighted pseudo-outcome), and show the consistency in various global metrics of that estimator. Based on that estimator we also develop asymptotically valid pointwise confidence intervals for the counterfactual density."}, "https://arxiv.org/abs/2403.19994": {"title": "Supervised Bayesian joint graphical model for simultaneous network estimation and subgroup identification", "link": "https://arxiv.org/abs/2403.19994", "description": "arXiv:2403.19994v1 Announce Type: new \nAbstract: Heterogeneity is a fundamental characteristic of cancer. To accommodate heterogeneity, subgroup identification has been extensively studied and broadly categorized into unsupervised and supervised analysis. Compared to unsupervised analysis, supervised approaches potentially hold greater clinical implications. Under the unsupervised analysis framework, several methods focusing on network-based subgroup identification have been developed, offering more comprehensive insights than those restricted to mean, variance, and other simplistic distributions by incorporating the interconnections among variables. However, research on supervised network-based subgroup identification remains limited. In this study, we develop a novel supervised Bayesian graphical model for jointly identifying multiple heterogeneous networks and subgroups. In the proposed model, heterogeneity is not only reflected in molecular data but also associated with a clinical outcome, and a novel similarity prior is introduced to effectively accommodate similarities among the networks of different subgroups, significantly facilitating clinically meaningful biological network construction and subgroup identification. The consistency properties of the estimates are rigorously established, and an efficient algorithm is developed. Extensive simulation studies and a real-world application to TCGA data are conducted, which demonstrate the advantages of the proposed approach in terms of both subgroup and network identification."}, "https://arxiv.org/abs/2403.20007": {"title": "Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization", "link": "https://arxiv.org/abs/2403.20007", "description": "arXiv:2403.20007v1 Announce Type: new \nAbstract: The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real datasets. The first dataset is analyzed using the principle component analysis while the analysis of the second dataset is based on partial least square framework."}, "https://arxiv.org/abs/2403.20182": {"title": "Quantifying Uncertainty: All We Need is the Bootstrap?", "link": "https://arxiv.org/abs/2403.20182", "description": "arXiv:2403.20182v1 Announce Type: new \nAbstract: Standard errors, confidence intervals, hypothesis tests, and other quantifications of uncertainty are essential to statistical practice. However, they feature a plethora of different methods, mathematical formulas, and concepts. Could we not just replace them all with the general and relatively easy-to-understand non-parametric bootstrap? We contribute to answering this question with a review of related work and a simulation study of one- and two-sided confidence intervals over several different sample sizes, confidence levels, data generating processes, and functionals. Results show that double bootstrap is the overall best method and a viable alternative to typically used approaches in all but the smallest sample sizes."}, "https://arxiv.org/abs/2403.20214": {"title": "Hypergraph adjusted plus-minus", "link": "https://arxiv.org/abs/2403.20214", "description": "arXiv:2403.20214v1 Announce Type: new \nAbstract: In team sports, traditional ranking statistics do not allow for the simultaneous evaluation of both individuals and combinations of players. Metrics for individual player rankings often fail to consider the interaction effects between groups of players, while methods for assessing full lineups cannot be used to identify the value of lower-order combinations of players (pairs, trios, etc.). Given that player and lineup rankings are inherently dependent on each other, these limitations may affect the accuracy of performance evaluations. To address this, we propose a novel adjusted box score plus-minus (APM) approach that allows for the simultaneous ranking of individual players, lower-order combinations of players, and full lineups. The method adjusts for the complete dependency structure and is motivated by the connection between APM and the hypergraph representation of a team. We discuss the similarities of our approach to other advanced metrics, demonstrate it using NBA data from 2012-2022, and suggest potential directions for future work."}, "https://arxiv.org/abs/2403.20313": {"title": "Towards a turnkey approach to unbiased Monte Carlo estimation of smooth functions of expectations", "link": "https://arxiv.org/abs/2403.20313", "description": "arXiv:2403.20313v1 Announce Type: new \nAbstract: Given a smooth function $f$, we develop a general approach to turn Monte\n Carlo samples with expectation $m$ into an unbiased estimate of $f(m)$.\n Specifically, we develop estimators that are based on randomly truncating\n the Taylor series expansion of $f$ and estimating the coefficients of the\n truncated series. We derive their properties and propose a strategy to set\n their tuning parameters -- which depend on $m$ -- automatically, with a\n view to make the whole approach simple to use. We develop our methods for\n the specific functions $f(x)=\\log x$ and $f(x)=1/x$, as they arise in\n several statistical applications such as maximum likelihood estimation of\n latent variable models and Bayesian inference for un-normalised models.\n Detailed numerical studies are performed for a range of applications to\n determine how competitive and reliable the proposed approach is."}, "https://arxiv.org/abs/2403.20200": {"title": "High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile", "link": "https://arxiv.org/abs/2403.20200", "description": "arXiv:2403.20200v1 Announce Type: cross \nAbstract: High-dimensional linear regression has been thoroughly studied in the context of independent and identically distributed data. We propose to investigate high-dimensional regression models for independent but non-identically distributed data. To this end, we suppose that the set of observed predictors (or features) is a random matrix with a variance profile and with dimensions growing at a proportional rate. Assuming a random effect model, we study the predictive risk of the ridge estimator for linear regression with such a variance profile. In this setting, we provide deterministic equivalents of this risk and of the degree of freedom of the ridge estimator. For certain class of variance profile, our work highlights the emergence of the well-known double descent phenomenon in high-dimensional regression for the minimum norm least-squares estimator when the ridge regularization parameter goes to zero. We also exhibit variance profiles for which the shape of this predictive risk differs from double descent. The proofs of our results are based on tools from random matrix theory in the presence of a variance profile that have not been considered so far to study regression models. Numerical experiments are provided to show the accuracy of the aforementioned deterministic equivalents on the computation of the predictive risk of ridge regression. We also investigate the similarities and differences that exist with the standard setting of independent and identically distributed data."}, "https://arxiv.org/abs/2211.06808": {"title": "Flexible Basis Representations for Modeling Large Non-Gaussian Spatial Data", "link": "https://arxiv.org/abs/2211.06808", "description": "arXiv:2211.06808v3 Announce Type: replace \nAbstract: Nonstationary and non-Gaussian spatial data are common in various fields, including ecology (e.g., counts of animal species), epidemiology (e.g., disease incidence counts in susceptible regions), and environmental science (e.g., remotely-sensed satellite imagery). Due to modern data collection methods, the size of these datasets have grown considerably. Spatial generalized linear mixed models (SGLMMs) are a flexible class of models used to model nonstationary and non-Gaussian datasets. Despite their utility, SGLMMs can be computationally prohibitive for even moderately large datasets (e.g., 5,000 to 100,000 observed locations). To circumvent this issue, past studies have embedded nested radial basis functions into the SGLMM. However, two crucial specifications (knot placement and bandwidth parameters), which directly affect model performance, are typically fixed prior to model-fitting. We propose a novel approach to model large nonstationary and non-Gaussian spatial datasets using adaptive radial basis functions. Our approach: (1) partitions the spatial domain into subregions; (2) employs reversible-jump Markov chain Monte Carlo (RJMCMC) to infer the number and location of the knots within each partition; and (3) models the latent spatial surface using partition-varying and adaptive basis functions. Through an extensive simulation study, we show that our approach provides more accurate predictions than competing methods while preserving computational efficiency. We demonstrate our approach on two environmental datasets - incidences of plant species and counts of bird species in the United States."}, "https://arxiv.org/abs/2303.05032": {"title": "Sensitivity analysis for principal ignorability violation in estimating complier and noncomplier average causal effects", "link": "https://arxiv.org/abs/2303.05032", "description": "arXiv:2303.05032v4 Announce Type: replace \nAbstract: An important strategy for identifying principal causal effects, which are often used in settings with noncompliance, is to invoke the principal ignorability (PI) assumption. As PI is untestable, it is important to gauge how sensitive effect estimates are to its violation. We focus on this task for the common one-sided noncompliance setting where there are two principal strata, compliers and noncompliers. Under PI, compliers and noncompliers share the same outcome-mean-given-covariates function under the control condition. For sensitivity analysis, we allow this function to differ between compliers and noncompliers in several ways, indexed by an odds ratio, a generalized odds ratio, a mean ratio, or a standardized mean difference sensitivity parameter. We tailor sensitivity analysis techniques (with any sensitivity parameter choice) to several types of PI-based main analysis methods, including outcome regression, influence function (IF) based and weighting methods. We illustrate the proposed sensitivity analyses using several outcome types from the JOBS II study. This application estimates nuisance functions parametrically -- for simplicity and accessibility. In addition, we establish rate conditions on nonparametric nuisance estimation for IF-based estimators to be asymptotically normal -- with a view to inform nonparametric inference."}, "https://arxiv.org/abs/2305.17643": {"title": "Flexible sensitivity analysis for causal inference in observational studies subject to unmeasured confounding", "link": "https://arxiv.org/abs/2305.17643", "description": "arXiv:2305.17643v2 Announce Type: replace \nAbstract: Causal inference with observational studies often suffers from unmeasured confounding, yielding biased estimators based on the unconfoundedness assumption. Sensitivity analysis assesses how the causal conclusions change with respect to different degrees of unmeasured confounding. Most existing sensitivity analysis methods work well for specific types of statistical estimation or testing strategies. We propose a flexible sensitivity analysis framework that can deal with commonly used inverse probability weighting, outcome regression, and doubly robust estimators simultaneously. It is based on the well-known parametrization of the selection bias as comparisons of the observed and counterfactual outcomes conditional on observed covariates. It is attractive for practical use because it only requires simple modifications of the standard estimators. Moreover, it naturally extends to many other causal inference settings, including the causal risk ratio or odds ratio, the average causal effect on the treated units, and studies with survival outcomes. We also develop an R package saci to implement our sensitivity analysis estimators."}, "https://arxiv.org/abs/2306.17260": {"title": "Incorporating Auxiliary Variables to Improve the Efficiency of Time-Varying Treatment Effect Estimation", "link": "https://arxiv.org/abs/2306.17260", "description": "arXiv:2306.17260v2 Announce Type: replace \nAbstract: The use of smart devices (e.g., smartphones, smartwatches) and other wearables for context sensing and delivery of digital interventions to improve health outcomes has grown significantly in behavioral and psychiatric studies. Micro-randomized trials (MRTs) are a common experimental design for obtaining data-driven evidence on mobile health (mHealth) intervention effectiveness where each individual is repeatedly randomized to receive treatments over numerous time points. Individual characteristics and the contexts around randomizations are also collected throughout the study, some may be pre-specified as moderators when assessing time-varying causal effect moderation. Moreover, we have access to abundant measurements beyond just the moderators. Our study aims to leverage this auxiliary information to improve causal estimation and better understand the intervention effect. Similar problems have been raised in randomized control trials (RCTs), where extensive literature demonstrates that baseline covariate information can be incorporated to alleviate chance imbalances and increase asymptotic efficiency. However, covariate adjustment in the context of time-varying treatments and repeated measurements, as seen in MRTs, has not been studied. Recognizing the connection to Neyman Orthogonality, we address this gap by introducing an intuitive approach to incorporate auxiliary variables to improve the efficiency of moderated causal excursion effect estimation. The efficiency gain of our approach is proved theoretically and demonstrated through simulation studies and an analysis of data from the Intern Health Study (NeCamp et al., 2020)."}, "https://arxiv.org/abs/2310.13240": {"title": "Transparency challenges in policy evaluation with causal machine learning -- improving usability and accountability", "link": "https://arxiv.org/abs/2310.13240", "description": "arXiv:2310.13240v2 Announce Type: replace-cross \nAbstract: Causal machine learning tools are beginning to see use in real-world policy evaluation tasks to flexibly estimate treatment effects. One issue with these methods is that the machine learning models used are generally black boxes, i.e., there is no globally interpretable way to understand how a model makes estimates. This is a clear problem in policy evaluation applications, particularly in government, because it is difficult to understand whether such models are functioning in ways that are fair, based on the correct interpretation of evidence and transparent enough to allow for accountability if things go wrong. However, there has been little discussion of transparency problems in the causal machine learning literature and how these might be overcome. This paper explores why transparency issues are a problem for causal machine learning in public policy evaluation applications and considers ways these problems might be addressed through explainable AI tools and by simplifying models in line with interpretable AI principles. It then applies these ideas to a case-study using a causal forest model to estimate conditional average treatment effects for a hypothetical change in the school leaving age in Australia. It shows that existing tools for understanding black-box predictive models are poorly suited to causal machine learning and that simplifying the model to make it interpretable leads to an unacceptable increase in error (in this application). It concludes that new tools are needed to properly understand causal machine learning models and the algorithms that fit them."}, "https://arxiv.org/abs/2311.04855": {"title": "Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values", "link": "https://arxiv.org/abs/2311.04855", "description": "arXiv:2311.04855v2 Announce Type: replace-cross \nAbstract: Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules."}, "https://arxiv.org/abs/2404.00164": {"title": "Sequential Synthetic Difference in Differences", "link": "https://arxiv.org/abs/2404.00164", "description": "arXiv:2404.00164v1 Announce Type: new \nAbstract: We study the estimation of treatment effects of a binary policy in environments with a staggered treatment rollout. We propose a new estimator -- Sequential Synthetic Difference in Difference (Sequential SDiD) -- and establish its theoretical properties in a linear model with interactive fixed effects. Our estimator is based on sequentially applying the original SDiD estimator proposed in Arkhangelsky et al. (2021) to appropriately aggregated data. To establish the theoretical properties of our method, we compare it to an infeasible OLS estimator based on the knowledge of the subspaces spanned by the interactive fixed effects. We show that this OLS estimator has a sequential representation and use this result to show that it is asymptotically equivalent to the Sequential SDiD estimator. This result implies the asymptotic normality of our estimator along with corresponding efficiency guarantees. The method developed in this paper presents a natural alternative to the conventional DiD strategies in staggered adoption designs."}, "https://arxiv.org/abs/2404.00221": {"title": "Robust Learning for Optimal Dynamic Treatment Regimes with Observational Data", "link": "https://arxiv.org/abs/2404.00221", "description": "arXiv:2404.00221v1 Announce Type: new \nAbstract: Many public policies and medical interventions involve dynamics in their treatment assignments, where treatments are sequentially assigned to the same individuals across multiple stages, and the effect of treatment at each stage is usually heterogeneous with respect to the history of prior treatments and associated characteristics. We study statistical learning of optimal dynamic treatment regimes (DTRs) that guide the optimal treatment assignment for each individual at each stage based on the individual's history. We propose a step-wise doubly-robust approach to learn the optimal DTR using observational data under the assumption of sequential ignorability. The approach solves the sequential treatment assignment problem through backward induction, where, at each step, we combine estimators of propensity scores and action-value functions (Q-functions) to construct augmented inverse probability weighting estimators of values of policies for each stage. The approach consistently estimates the optimal DTR if either a propensity score or Q-function for each stage is consistently estimated. Furthermore, the resulting DTR can achieve the optimal convergence rate $n^{-1/2}$ of regret under mild conditions on the convergence rate for estimators of the nuisance parameters."}, "https://arxiv.org/abs/2404.00256": {"title": "Objective Bayesian FDR", "link": "https://arxiv.org/abs/2404.00256", "description": "arXiv:2404.00256v1 Announce Type: new \nAbstract: Here, we develop an objective Bayesian analysis for large-scale datasets. When Bayesian analysis is applied to large-scale datasets, the cut point that provides the posterior probability is usually determined following customs. In this work, we propose setting the cut point in an objective manner, which is determined so as to match the posterior null number with the estimated true null number. The posterior probability obtained using an objective cut point is relatively similar to the real false discovery rate (FDR), which facilitates control of the FDR level."}, "https://arxiv.org/abs/2404.00319": {"title": "Direction Preferring Confidence Intervals", "link": "https://arxiv.org/abs/2404.00319", "description": "arXiv:2404.00319v1 Announce Type: new \nAbstract: Confidence intervals (CIs) are instrumental in statistical analysis, providing a range estimate of the parameters. In modern statistics, selective inference is common, where only certain parameters are highlighted. However, this selective approach can bias the inference, leading some to advocate for the use of CIs over p-values. To increase the flexibility of confidence intervals, we introduce direction-preferring CIs, enabling analysts to focus on parameters trending in a particular direction. We present these types of CIs in two settings: First, when there is no selection of parameters; and second, for situations involving parameter selection, where we offer a conditional version of the direction-preferring CIs. Both of these methods build upon the foundations of Modified Pratt CIs, which rely on non-equivariant acceptance regions to achieve longer intervals in exchange for improved sign exclusions. We show that for selected parameters out of m > 1 initial parameters of interest, CIs aimed at controlling the false coverage rate, have higher power to determine the sign compared to conditional CIs. We also show that conditional confidence intervals control the marginal false coverage rate (mFCR) under any dependency."}, "https://arxiv.org/abs/2404.00359": {"title": "Loss-based prior for tree topologies in BART models", "link": "https://arxiv.org/abs/2404.00359", "description": "arXiv:2404.00359v1 Announce Type: new \nAbstract: We present a novel prior for tree topology within Bayesian Additive Regression Trees (BART) models. This approach quantifies the hypothetical loss in information and the loss due to complexity associated with choosing the wrong tree structure. The resulting prior distribution is compellingly geared toward sparsity, a critical feature considering BART models' tendency to overfit. Our method incorporates prior knowledge into the distribution via two parameters that govern the tree's depth and balance between its left and right branches. Additionally, we propose a default calibration for these parameters, offering an objective version of the prior. We demonstrate our method's efficacy on both simulated and real datasets."}, "https://arxiv.org/abs/2404.00606": {"title": "\"Sound and Fury\": Nonlinear Functionals of Volatility Matrix in the Presence of Jump and Noise", "link": "https://arxiv.org/abs/2404.00606", "description": "arXiv:2404.00606v1 Announce Type: new \nAbstract: This paper resolves a pivotal open problem on nonparametric inference for nonlinear functionals of volatility matrix. Multiple prominent statistical tasks can be formulated as functionals of volatility matrix, yet a unified statistical theory of general nonlinear functionals based on noisy data remains challenging and elusive. Nonetheless, this paper shows it can be achieved by combining the strengths of pre-averaging, jump truncation and nonlinearity bias correction. In light of general nonlinearity, bias correction beyond linear approximation becomes necessary. Resultant estimators are nonparametric and robust over a wide spectrum of stochastic models. Moreover, the estimators can be rate-optimal and stable central limit theorems are obtained. The proposed framework lends itself conveniently to uncertainty quantification and permits fully feasible inference. With strong theoretical guarantees, this paper provides an inferential foundation for a wealth of statistical methods for noisy high-frequency data, such as realized principal component analysis, continuous-time linear regression, realized Laplace transform, generalized method of integrated moments and specification tests, hence extends current application scopes to noisy data which is more prevalent in practice."}, "https://arxiv.org/abs/2404.00735": {"title": "Two-Stage Nuisance Function Estimation for Causal Mediation Analysis", "link": "https://arxiv.org/abs/2404.00735", "description": "arXiv:2404.00735v1 Announce Type: new \nAbstract: When estimating the direct and indirect causal effects using the influence function-based estimator of the mediation functional, it is crucial to understand what aspects of the treatment, the mediator, and the outcome mean mechanisms should be focused on. Specifically, considering them as nuisance functions and attempting to fit these nuisance functions as accurate as possible is not necessarily the best approach to take. In this work, we propose a two-stage estimation strategy for the nuisance functions that estimates the nuisance functions based on the role they play in the structure of the bias of the influence function-based estimator of the mediation functional. We provide robustness analysis of the proposed method, as well as sufficient conditions for consistency and asymptotic normality of the estimator of the parameter of interest."}, "https://arxiv.org/abs/2404.00788": {"title": "A Novel Stratified Analysis Method for Testing and Estimating Overall Treatment Effects on Time-to-Event Outcomes Using Average Hazard with Survival Weight", "link": "https://arxiv.org/abs/2404.00788", "description": "arXiv:2404.00788v1 Announce Type: new \nAbstract: Given the limitations of using the Cox hazard ratio to summarize the magnitude of the treatment effect, alternative measures that do not have these limitations are gaining attention. One of the recently proposed alternative methods uses the average hazard with survival weight (AH). This population quantity can be interpreted as the average intensity of the event occurrence in a given time window that does not involve study-specific censoring. Inference procedures for the ratio of AH and difference in AH have already been proposed in simple randomized controlled trial settings to compare two groups. However, methods with stratification factors have not been well discussed, although stratified analysis is often used in practice to adjust for confounding factors and increase the power to detect a between-group difference. The conventional stratified analysis or meta-analysis approach, which integrates stratum-specific treatment effects using an optimal weight, directly applies to the ratio of AH and difference in AH. However, this conventional approach has significant limitations similar to the Cochran-Mantel-Haenszel method for a binary outcome and the stratified Cox procedure for a time-to-event outcome. To address this, we propose a new stratified analysis method for AH using standardization. With the proposed method, one can summarize the between-group treatment effect in both absolute difference and relative terms, adjusting for stratification factors. This can be a valuable alternative to the traditional stratified Cox procedure to estimate and report the magnitude of the treatment effect on time-to-event outcomes using hazard."}, "https://arxiv.org/abs/2404.00820": {"title": "Visual analysis of bivariate dependence between continuous random variables", "link": "https://arxiv.org/abs/2404.00820", "description": "arXiv:2404.00820v1 Announce Type: new \nAbstract: Scatter plots are widely recognized as fundamental tools for illustrating the relationship between two numerical variables. Despite this, based on solid theoretical foundations, scatter plots generated from pairs of continuous random variables may not serve as reliable tools for assessing dependence. Sklar's Theorem implies that scatter plots created from ranked data are preferable for such analysis as they exclusively convey information pertinent to dependence. This is in stark contrast to conventional scatter plots, which also encapsulate information about the variables' marginal distributions. Such additional information is extraneous to dependence analysis and can obscure the visual interpretation of the variables' relationship. In this article, we delve into the theoretical underpinnings of these ranked data scatter plots, hereafter referred to as rank plots. We offer insights into interpreting the information they reveal and examine their connections with various association measures, including Pearson's and Spearman's correlation coefficients, as well as Schweizer-Wolff's measure of dependence. Furthermore, we introduce a novel graphical combination for dependence analysis, termed a dplot, and demonstrate its efficacy through real data examples."}, "https://arxiv.org/abs/2404.00864": {"title": "Convolution-t Distributions", "link": "https://arxiv.org/abs/2404.00864", "description": "arXiv:2404.00864v1 Announce Type: new \nAbstract: We introduce a new class of multivariate heavy-tailed distributions that are convolutions of heterogeneous multivariate t-distributions. Unlike commonly used heavy-tailed distributions, the multivariate convolution-t distributions embody cluster structures with flexible nonlinear dependencies and heterogeneous marginal distributions. Importantly, convolution-t distributions have simple density functions that facilitate estimation and likelihood-based inference. The characteristic features of convolution-t distributions are found to be important in an empirical analysis of realized volatility measures and help identify their underlying factor structure."}, "https://arxiv.org/abs/2404.01043": {"title": "The Mean Shape under the Relative Curvature Condition", "link": "https://arxiv.org/abs/2404.01043", "description": "arXiv:2404.01043v1 Announce Type: new \nAbstract: The relative curvature condition (RCC) serves as a crucial constraint, ensuring the avoidance of self-intersection problems in calculating the mean shape over a sample of swept regions. By considering the RCC, this work discusses estimating the mean shape for a class of swept regions called elliptical slabular objects based on a novel shape representation, namely elliptical tube representation (ETRep). The ETRep shape space equipped with extrinsic and intrinsic distances in accordance with object transformation is explained. The intrinsic distance is determined based on the intrinsic skeletal coordinate system of the shape space. Further, calculating the intrinsic mean shape based on the intrinsic distance over a set of ETReps is demonstrated. The proposed intrinsic methodology is applied for the statistical shape analysis to design global and partial hypothesis testing methods to study the hippocampal structure in early Parkinson's disease."}, "https://arxiv.org/abs/2404.01076": {"title": "Debiased calibration estimation using generalized entropy in survey sampling", "link": "https://arxiv.org/abs/2404.01076", "description": "arXiv:2404.01076v1 Announce Type: new \nAbstract: Incorporating the auxiliary information into the survey estimation is a fundamental problem in survey sampling. Calibration weighting is a popular tool for incorporating the auxiliary information. The calibration weighting method of Deville and Sarndal (1992) uses a distance measure between the design weights and the final weights to solve the optimization problem with calibration constraints. This paper introduces a novel framework that leverages generalized entropy as the objective function for optimization, where design weights play a role in the constraints to ensure design consistency, rather than being part of the objective function. This innovative calibration framework is particularly attractive due to its generality and its ability to generate more efficient calibration weights compared to traditional methods based on Deville and Sarndal (1992). Furthermore, we identify the optimal choice of the generalized entropy function that achieves the minimum variance across various choices of the generalized entropy function under the same constraints. Asymptotic properties, such as design consistency and asymptotic normality, are presented rigorously. The results from a limited simulation study are also presented. We demonstrate a real-life application using agricultural survey data collected from Kynetec, Inc."}, "https://arxiv.org/abs/2404.01191": {"title": "A Semiparametric Approach for Robust and Efficient Learning with Biobank Data", "link": "https://arxiv.org/abs/2404.01191", "description": "arXiv:2404.01191v1 Announce Type: new \nAbstract: With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenotyping and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenotyping model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root-$n$ convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenotyping and genetic risk modeling of type II diabetes."}, "https://arxiv.org/abs/2404.00751": {"title": "C-XGBoost: A tree boosting model for causal effect estimation", "link": "https://arxiv.org/abs/2404.00751", "description": "arXiv:2404.00751v1 Announce Type: cross \nAbstract: Causal effect estimation aims at estimating the Average Treatment Effect as well as the Conditional Average Treatment Effect of a treatment to an outcome from the available data. This knowledge is important in many safety-critical domains, where it often needs to be extracted from observational data. In this work, we propose a new causal inference model, named C-XGBoost, for the prediction of potential outcomes. The motivation of our approach is to exploit the superiority of tree-based models for handling tabular data together with the notable property of causal inference neural network-based models to learn representations that are useful for estimating the outcome for both the treatment and non-treatment cases. The proposed model also inherits the considerable advantages of XGBoost model such as efficiently handling features with missing values requiring minimum preprocessing effort, as well as it is equipped with regularization techniques to avoid overfitting/bias. Furthermore, we propose a new loss function for efficiently training the proposed causal inference model. The experimental analysis, which is based on the performance profiles of Dolan and Mor{\\'e} as well as on post-hoc and non-parametric statistical tests, provide strong evidence about the effectiveness of the proposed approach."}, "https://arxiv.org/abs/2404.00753": {"title": "A compromise criterion for weighted least squares estimates", "link": "https://arxiv.org/abs/2404.00753", "description": "arXiv:2404.00753v1 Announce Type: cross \nAbstract: When independent errors in a linear model have non-identity covariance, the ordinary least squares estimate of the model coefficients is less efficient than the weighted least squares estimate. However, the practical application of weighted least squares is challenging due to its reliance on the unknown error covariance matrix. Although feasible weighted least squares estimates, which use an approximation of this matrix, often outperform the ordinary least squares estimate in terms of efficiency, this is not always the case. In some situations, feasible weighted least squares can be less efficient than ordinary least squares. This study identifies the conditions under which feasible weighted least squares estimates using fixed weights demonstrate greater efficiency than the ordinary least squares estimate. These conditions provide guidance for the design of feasible estimates using random weights. They also shed light on how a certain robust regression estimate behaves with respect to the linear model with normal errors of unequal variance."}, "https://arxiv.org/abs/2404.00784": {"title": "Estimating sample paths of Gauss-Markov processes from noisy data", "link": "https://arxiv.org/abs/2404.00784", "description": "arXiv:2404.00784v1 Announce Type: cross \nAbstract: I derive the pointwise conditional means and variances of an arbitrary Gauss-Markov process, given noisy observations of points on a sample path. These moments depend on the process's mean and covariance functions, and on the conditional moments of the sampled points. I study the Brownian motion and bridge as special cases."}, "https://arxiv.org/abs/2404.00848": {"title": "Predictive Performance Comparison of Decision Policies Under Confounding", "link": "https://arxiv.org/abs/2404.00848", "description": "arXiv:2404.00848v1 Announce Type: cross \nAbstract: Predictive models are often introduced to decision-making tasks under the rationale that they improve performance over an existing decision-making policy. However, it is challenging to compare predictive performance against an existing decision-making policy that is generally under-specified and dependent on unobservable factors. These sources of uncertainty are often addressed in practice by making strong assumptions about the data-generating mechanism. In this work, we propose a method to compare the predictive performance of decision policies under a variety of modern identification approaches from the causal inference and off-policy evaluation literatures (e.g., instrumental variable, marginal sensitivity model, proximal variable). Key to our method is the insight that there are regions of uncertainty that we can safely ignore in the policy comparison. We develop a practical approach for finite-sample estimation of regret intervals under no assumptions on the parametric form of the status quo policy. We verify our framework theoretically and via synthetic data experiments. We conclude with a real-world application using our framework to support a pre-deployment evaluation of a proposed modification to a healthcare enrollment policy."}, "https://arxiv.org/abs/2404.00912": {"title": "Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms", "link": "https://arxiv.org/abs/2404.00912", "description": "arXiv:2404.00912v1 Announce Type: cross \nAbstract: Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm. We propose statistical inference methods for a broad range of sketching distributions, such as the subsampled randomized Hadamard transform (SRHT), Sparse Sign Embeddings (SSE) and CountSketch, sketching matrices with i.i.d. entries, and uniform subsampling. To our knowledge, no comparable methods are available for SSE and for SRHT in PCA. Our novel theoretical approach rests on showing the asymptotic normality of certain quadratic forms. As a contribution of broader interest, we show central limit theorems for quadratic forms of the SRHT, relying on a novel proof via a dyadic expansion that leverages the recursive structure of the Hadamard transform. Numerical experiments using both synthetic and empirical datasets support the efficacy of our methods, and in particular suggest that sketching methods can have better computation-estimation tradeoffs than recently proposed optimal subsampling methods."}, "https://arxiv.org/abs/2404.01153": {"title": "TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression", "link": "https://arxiv.org/abs/2404.01153", "description": "arXiv:2404.01153v1 Announce Type: cross \nAbstract: The main challenge that sets transfer learning apart from traditional supervised learning is the distribution shift, reflected as the shift between the source and target models and that between the marginal covariate distributions. In this work, we tackle model shifts in the presence of covariate shifts in the high-dimensional regression setting. Specifically, we propose a two-step method with a novel fused-regularizer that effectively leverages samples from source tasks to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model, showing the robustness of the proposed method to covariate shifts. We further establish conditions under which the estimator is minimax-optimal. Additionally, we extend the method to a distributed setting, allowing for a pretraining-finetuning strategy, requiring just one round of communication while retaining the estimation rate of the centralized version. Numerical tests validate our theory, highlighting the method's robustness to covariate shifts."}, "https://arxiv.org/abs/2404.01273": {"title": "TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model", "link": "https://arxiv.org/abs/2404.01273", "description": "arXiv:2404.01273v1 Announce Type: cross \nAbstract: Recently, there has been a burgeoning interest in virtual clinical trials, which simulate real-world scenarios and hold the potential to significantly enhance patient safety, expedite development, reduce costs, and contribute to the broader scientific knowledge in healthcare. Existing research often focuses on leveraging electronic health records (EHRs) to support clinical trial outcome prediction. Yet, trained with limited clinical trial outcome data, existing approaches frequently struggle to perform accurate predictions. Some research has attempted to generate EHRs to augment model development but has fallen short in personalizing the generation for individual patient profiles. Recently, the emergence of large language models has illuminated new possibilities, as their embedded comprehensive clinical knowledge has proven beneficial in addressing medical issues. In this paper, we propose a large language model-based digital twin creation approach, called TWIN-GPT. TWIN-GPT can establish cross-dataset associations of medical information given limited data, generating unique personalized digital twins for different patients, thereby preserving individual patient characteristics. Comprehensive experiments show that using digital twins created by TWIN-GPT can boost clinical trial outcome prediction, exceeding various previous prediction approaches. Besides, we also demonstrate that TWIN-GPT can generate high-fidelity trial data that closely approximate specific patients, aiding in more accurate result predictions in data-scarce situations. Moreover, our study provides practical evidence for the application of digital twins in healthcare, highlighting its potential significance."}, "https://arxiv.org/abs/2109.02309": {"title": "Hypothesis Testing for Functional Linear Models via Bootstrapping", "link": "https://arxiv.org/abs/2109.02309", "description": "arXiv:2109.02309v4 Announce Type: replace \nAbstract: Hypothesis testing for the slope function in functional linear regression is of both practical and theoretical interest. We develop a novel test for the nullity of the slope function, where testing the slope function is transformed into testing a high-dimensional vector based on functional principal component analysis. This transformation fully circumvents ill-posedness in functional linear regression, thereby enhancing numeric stability. The proposed method leverages the technique of bootstrapping max statistics and exploits the inherent variance decay property of functional data, improving the empirical power of tests especially when the sample size is limited or the signal is relatively weak. We establish validity and consistency of our proposed test when the functional principal components are derived from data. Moreover, we show that the test maintains its asymptotic validity and consistency, even when including \\emph{all} empirical functional principal components in our test statistics. This sharply contrasts with the task of estimating the slope function, which requires a delicate choice of the number (at most in the order of $\\sqrt n$) of functional principal components to ensure estimation consistency. This distinction highlights an interesting difference between estimation and statistical inference regarding the slope function in functional linear regression. To the best of our knowledge, the proposed test is the first of its kind to utilize all empirical functional principal components."}, "https://arxiv.org/abs/2201.10080": {"title": "Spatial meshing for general Bayesian multivariate models", "link": "https://arxiv.org/abs/2201.10080", "description": "arXiv:2201.10080v2 Announce Type: replace \nAbstract: Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package 'meshed', available on CRAN."}, "https://arxiv.org/abs/2209.07028": {"title": "Estimating large causal polytrees from small samples", "link": "https://arxiv.org/abs/2209.07028", "description": "arXiv:2209.07028v3 Announce Type: replace \nAbstract: We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions."}, "https://arxiv.org/abs/2303.16008": {"title": "Risk ratio, odds ratio, risk difference", "link": "https://arxiv.org/abs/2303.16008", "description": "arXiv:2303.16008v3 Announce Type: replace \nAbstract: There are many measures to report so-called treatment or causal effects: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different impressions of the benefit or risk of a treatment. Besides, different causal measures may lead to various treatment effect heterogeneity: some input variables may have an influence on some causal measures and no effect at all on others. In addition some measures - but not all - have appealing properties such as collapsibility, matching the intuition of a population summary. In this paper, we first review common causal measures and their pros and cons typically brought forward. Doing so, we clarify the notions of collapsibility and treatment effect heterogeneity, unifying existing definitions. Then, we show that for any causal measures there exists a generative model such that the conditional average treatment effect (CATE) captures the treatment effect. However, only the risk difference can disentangle the treatment effect from the baseline at both population and strata levels, regardless of the outcome type (continuous or binary). As our primary goal is the generalization of causal measures, we show that different sets of covariates are needed to generalize an effect to a target population depending on (i) the causal measure of interest, and (ii) the identification method chosen, that is generalizing either conditional outcome or local effects."}, "https://arxiv.org/abs/2307.03639": {"title": "Fast and Optimal Inference for Change Points in Piecewise Polynomials via Differencing", "link": "https://arxiv.org/abs/2307.03639", "description": "arXiv:2307.03639v2 Announce Type: replace \nAbstract: We consider the problem of uncertainty quantification in change point regressions, where the signal can be piecewise polynomial of arbitrary but fixed degree. That is we seek disjoint intervals which, uniformly at a given confidence level, must each contain a change point location. We propose a procedure based on performing local tests at a number of scales and locations on a sparse grid, which adapts to the choice of grid in the sense that by choosing a sparser grid one explicitly pays a lower price for multiple testing. The procedure is fast as its computational complexity is always of the order $\\mathcal{O} (n \\log (n))$ where $n$ is the length of the data, and optimal in the sense that under certain mild conditions every change point is detected with high probability and the widths of the intervals returned match the mini-max localisation rates for the associated change point problem up to log factors. A detailed simulation study shows our procedure is competitive against state of the art algorithms for similar problems. Our procedure is implemented in the R package ChangePointInference which is available via https://github.com/gaviosha/ChangePointInference."}, "https://arxiv.org/abs/2307.15213": {"title": "PCA, SVD, and Centering of Data", "link": "https://arxiv.org/abs/2307.15213", "description": "arXiv:2307.15213v2 Announce Type: replace \nAbstract: The research detailed in this paper scrutinizes Principal Component Analysis (PCA), a seminal method employed in statistics and machine learning for the purpose of reducing data dimensionality. Singular Value Decomposition (SVD) is often employed as the primary means for computing PCA, a process that indispensably includes the step of centering - the subtraction of the mean location from the data set. In our study, we delve into a detailed exploration of the influence of this critical yet often ignored or downplayed data centering step. Our research meticulously investigates the conditions under which two PCA embeddings, one derived from SVD with centering and the other without, can be viewed as aligned. As part of this exploration, we analyze the relationship between the first singular vector and the mean direction, subsequently linking this observation to the congruity between two SVDs of centered and uncentered matrices. Furthermore, we explore the potential implications arising from the absence of centering in the context of performing PCA via SVD from a spectral analysis standpoint. Our investigation emphasizes the importance of a comprehensive understanding and acknowledgment of the subtleties involved in the computation of PCA. As such, we believe this paper offers a crucial contribution to the nuanced understanding of this foundational statistical method and stands as a valuable addition to the academic literature in the field of statistics."}, "https://arxiv.org/abs/2308.13069": {"title": "The diachronic Bayesian", "link": "https://arxiv.org/abs/2308.13069", "description": "arXiv:2308.13069v3 Announce Type: replace \nAbstract: It is well known that a Bayesian probability forecast for all future observations should be a probability measure in order to satisfy a natural condition of coherence. The main topics of this paper are the evolution of the Bayesian probability measure and ways of testing its adequacy as it evolves over time. The process of testing evolving Bayesian beliefs is modelled in terms of betting, similarly to the standard Dutch book treatment of coherence. The resulting picture is adapted to forecasting several steps ahead and making almost optimal decisions."}, "https://arxiv.org/abs/2310.01402": {"title": "Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs", "link": "https://arxiv.org/abs/2310.01402", "description": "arXiv:2310.01402v2 Announce Type: replace \nAbstract: We investigated whether large language models (LLMs) can develop data validation tests. We considered 96 conditions each for both GPT-3.5 and GPT-4, examining different prompt scenarios, learning modes, temperature settings, and roles. The prompt scenarios were: 1) Asking for expectations, 2) Asking for expectations with a given context, 3) Asking for expectations after requesting a data simulation, and 4) Asking for expectations with a provided data sample. The learning modes were: 1) zero-shot, 2) one-shot, and 3) few-shot learning. We also tested four temperature settings: 0, 0.4, 0.6, and 1. And the two distinct roles were: 1) helpful assistant, 2) expert data scientist. To gauge consistency, every setup was tested five times. The LLM-generated responses were benchmarked against a gold standard data validation suite, created by an experienced data scientist knowledgeable about the data in question. We find there are considerable returns to the use of few-shot learning, and that the more explicit the data setting can be the better, to a point. The best LLM configurations complement, rather than substitute, the gold standard results. This study underscores the value LLMs can bring to the data cleaning and preparation stages of the data science workflow, but highlights that they need considerable evaluation by experienced analysts."}, "https://arxiv.org/abs/2310.02968": {"title": "Sampling depth trade-off in function estimation under a two-level design", "link": "https://arxiv.org/abs/2310.02968", "description": "arXiv:2310.02968v3 Announce Type: replace \nAbstract: Many modern statistical applications involve a two-level sampling scheme that first samples subjects from a population and then samples observations on each subject. These schemes often are designed to learn both the population-level functional structures shared by the subjects and the functional characteristics specific to individual subjects. Common wisdom suggests that learning population-level structures benefits from sampling more subjects whereas learning subject-specific structures benefits from deeper sampling within each subject. Oftentimes these two objectives compete for limited sampling resources, which raises the question of how to optimally sample at the two levels. We quantify such sampling-depth trade-offs by establishing the $L_2$ minimax risk rates for learning the population-level and subject-specific structures under a hierarchical Gaussian process model framework where we consider a Bayesian and a frequentist perspective on the unknown population-level structure. These rates provide general lessons for designing two-level sampling schemes given a fixed sampling budget. Interestingly, they show that subject-specific learning occasionally benefits more by sampling more subjects than by deeper within-subject sampling. We show that the corresponding minimax rates can be readily achieved in practice through simple adaptive estimators without assuming prior knowledge on the underlying variability at the two sampling levels. We validate our theory and illustrate the sampling trade-off in practice through both simulation experiments and two real datasets. While we carry out all the theoretical analysis in the context of Gaussian process models for analytical tractability, the results provide insights on effective two-level sampling designs more broadly."}, "https://arxiv.org/abs/2311.02789": {"title": "Estimation and Inference for a Class of Generalized Hierarchical Models", "link": "https://arxiv.org/abs/2311.02789", "description": "arXiv:2311.02789v4 Announce Type: replace \nAbstract: In this paper, we consider estimation and inference for the unknown parameters and function involved in a class of generalized hierarchical models. Such models are of great interest in the literature of neural networks (such as Bauer and Kohler, 2019). We propose a rectified linear unit (ReLU) based deep neural network (DNN) approach, and contribute to the design of DNN by i) providing more transparency for practical implementation, ii) defining different types of sparsity, iii) showing the differentiability, iv) pointing out the set of effective parameters, and v) offering a new variant of rectified linear activation function (ReLU), etc. Asymptotic properties are established accordingly, and a feasible procedure for the purpose of inference is also proposed. We conduct extensive numerical studies to examine the finite-sample performance of the estimation methods, and we also evaluate the empirical relevance and applicability of the proposed models and estimation methods to real data."}, "https://arxiv.org/abs/2311.14032": {"title": "Counterfactual Sensitivity in Quantitative Spatial Models", "link": "https://arxiv.org/abs/2311.14032", "description": "arXiv:2311.14032v2 Announce Type: replace \nAbstract: Counterfactuals in quantitative spatial models are functions of the current state of the world and the model parameters. Current practice treats the current state of the world as perfectly observed, but there is good reason to believe that it is measured with error. This paper provides tools for quantifying uncertainty about counterfactuals when the current state of the world is measured with error. I recommend an empirical Bayes approach to uncertainty quantification, which is both practical and theoretically justified. I apply the proposed method to the applications in Adao, Costinot, and Donaldson (2017) and Allen and Arkolakis (2022) and find non-trivial uncertainty about counterfactuals."}, "https://arxiv.org/abs/2312.02518": {"title": "The general linear hypothesis testing problem for multivariate functional data with applications", "link": "https://arxiv.org/abs/2312.02518", "description": "arXiv:2312.02518v2 Announce Type: replace \nAbstract: As technology continues to advance at a rapid pace, the prevalence of multivariate functional data (MFD) has expanded across diverse disciplines, spanning biology, climatology, finance, and numerous other fields of study. Although MFD are encountered in various fields, the development of methods for hypotheses on mean functions, especially the general linear hypothesis testing (GLHT) problem for such data has been limited. In this study, we propose and study a new global test for the GLHT problem for MFD, which includes the one-way FMANOVA, post hoc, and contrast analysis as special cases. The asymptotic null distribution of the test statistic is shown to be a chi-squared-type mixture dependent of eigenvalues of the heteroscedastic covariance functions. The distribution of the chi-squared-type mixture can be well approximated by a three-cumulant matched chi-squared-approximation with its approximation parameters estimated from the data. By incorporating an adjustment coefficient, the proposed test performs effectively irrespective of the correlation structure in the functional data, even when dealing with a relatively small sample size. Additionally, the proposed test is shown to be root-n consistent, that is, it has a nontrivial power against a local alternative. Simulation studies and a real data example demonstrate finite-sample performance and broad applicability of the proposed test."}, "https://arxiv.org/abs/2401.00395": {"title": "Energetic Variational Gaussian Process Regression for Computer Experiments", "link": "https://arxiv.org/abs/2401.00395", "description": "arXiv:2401.00395v2 Announce Type: replace \nAbstract: The Gaussian process (GP) regression model is a widely employed surrogate modeling technique for computer experiments, offering precise predictions and statistical inference for the computer simulators that generate experimental data. Estimation and inference for GP can be performed in both frequentist and Bayesian frameworks. In this chapter, we construct the GP model through variational inference, particularly employing the recently introduced energetic variational inference method by Wang et al. (2021). Adhering to the GP model assumptions, we derive posterior distributions for its parameters. The energetic variational inference approach bridges the Bayesian sampling and optimization and enables approximation of the posterior distributions and identification of the posterior mode. By incorporating a normal prior on the mean component of the GP model, we also apply shrinkage estimation to the parameters, facilitating mean function variable selection. To showcase the effectiveness of our proposed GP model, we present results from three benchmark examples."}, "https://arxiv.org/abs/2110.00152": {"title": "ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families", "link": "https://arxiv.org/abs/2110.00152", "description": "arXiv:2110.00152v3 Announce Type: replace-cross \nAbstract: The empirical Bayes normal means (EBNM) model is important to many areas of statistics, including (but not limited to) multiple testing, wavelet denoising, and gene expression analysis. There are several existing software packages that can fit EBNM models under different prior assumptions and using different algorithms; however, the differences across interfaces complicate direct comparisons. Further, a number of important prior assumptions do not yet have implementations. Motivated by these issues, we developed the R package ebnm, which provides a unified interface for efficiently fitting EBNM models using a variety of prior assumptions, including nonparametric approaches. In some cases, we incorporated existing implementations into ebnm; in others, we implemented new fitting procedures with a focus on speed and numerical stability. We illustrate the use of ebnm in a detailed analysis of baseball statistics. By providing a unified and easily extensible interface, the ebnm package can facilitate development of new methods in statistics, genetics, and other areas; as an example, we briefly discuss the R package flashier, which harnesses methods in ebnm to provide a flexible and robust approach to matrix factorization."}, "https://arxiv.org/abs/2205.13589": {"title": "Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes", "link": "https://arxiv.org/abs/2205.13589", "description": "arXiv:2205.13589v3 Announce Type: replace-cross \nAbstract: We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \\underline{P}roxy variable \\underline{P}essimistic \\underline{P}olicy \\underline{O}ptimization (\\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \\texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \\texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \\texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset."}, "https://arxiv.org/abs/2302.06809": {"title": "Large-scale Multiple Testing: Fundamental Limits of False Discovery Rate Control and Compound Oracle", "link": "https://arxiv.org/abs/2302.06809", "description": "arXiv:2302.06809v2 Announce Type: replace-cross \nAbstract: The false discovery rate (FDR) and the false non-discovery rate (FNR), defined as the expected false discovery proportion (FDP) and the false non-discovery proportion (FNP), are the most popular benchmarks for multiple testing. Despite the theoretical and algorithmic advances in recent years, the optimal tradeoff between the FDR and the FNR has been largely unknown except for certain restricted classes of decision rules, e.g., separable rules, or for other performance metrics, e.g., the marginal FDR and the marginal FNR (mFDR and mFNR). In this paper, we determine the asymptotically optimal FDR-FNR tradeoff under the two-group random mixture model when the number of hypotheses tends to infinity. Distinct from the optimal mFDR-mFNR tradeoff, which is achieved by separable decision rules, the optimal FDR-FNR tradeoff requires compound rules even in the large-sample limit and for models as simple as the Gaussian location model. This suboptimality of separable rules also holds for other objectives, such as maximizing the expected number of true discoveries. Finally, to address the limitation of the FDR which only controls the expectation but not the fluctuation of the FDP, we also determine the optimal tradeoff when the FDP is controlled with high probability and show it coincides with that of the mFDR and the mFNR. Extensions to models with a fixed non-null proportion are also obtained."}, "https://arxiv.org/abs/2310.15333": {"title": "Safe and Interpretable Estimation of Optimal Treatment Regimes", "link": "https://arxiv.org/abs/2310.15333", "description": "arXiv:2310.15333v2 Announce Type: replace-cross \nAbstract: Recent statistical and reinforcement learning methods have significantly advanced patient care strategies. However, these approaches face substantial challenges in high-stakes contexts, including missing data, inherent stochasticity, and the critical requirements for interpretability and patient safety. Our work operationalizes a safe and interpretable framework to identify optimal treatment regimes. This approach involves matching patients with similar medical and pharmacological characteristics, allowing us to construct an optimal policy via interpolation. We perform a comprehensive simulation study to demonstrate the framework's ability to identify optimal policies even in complex settings. Ultimately, we operationalize our approach to study regimes for treating seizures in critically ill patients. Our findings strongly support personalized treatment strategies based on a patient's medical history and pharmacological features. Notably, we identify that reducing medication doses for patients with mild and brief seizure episodes while adopting aggressive treatment for patients in intensive care unit experiencing intense seizures leads to more favorable outcomes."}, "https://arxiv.org/abs/2401.00104": {"title": "Causal State Distillation for Explainable Reinforcement Learning", "link": "https://arxiv.org/abs/2401.00104", "description": "arXiv:2401.00104v2 Announce Type: replace-cross \nAbstract: Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond sub-rewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: causal sufficiency, sparseness, and orthogonality. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agent's action selections."}, "https://arxiv.org/abs/2404.01495": {"title": "Estimating Heterogeneous Effects: Applications to Labor Economics", "link": "https://arxiv.org/abs/2404.01495", "description": "arXiv:2404.01495v1 Announce Type: new \nAbstract: A growing number of applications involve settings where, in order to infer heterogeneous effects, a researcher compares various units. Examples of research designs include children moving between different neighborhoods, workers moving between firms, patients migrating from one city to another, and banks offering loans to different firms. We present a unified framework for these settings, based on a linear model with normal random coefficients and normal errors. Using the model, we discuss how to recover the mean and dispersion of effects, other features of their distribution, and to construct predictors of the effects. We provide moment conditions on the model's parameters, and outline various estimation strategies. A main objective of the paper is to clarify some of the underlying assumptions by highlighting their economic content, and to discuss and inform some of the key practical choices."}, "https://arxiv.org/abs/2404.01546": {"title": "Time-Varying Matrix Factor Models", "link": "https://arxiv.org/abs/2404.01546", "description": "arXiv:2404.01546v1 Announce Type: new \nAbstract: Matrix-variate data of high dimensions are frequently observed in finance and economics, spanning extended time periods, such as the long-term data on international trade flows among numerous countries. To address potential structural shifts and explore the matrix structure's informational context, we propose a time-varying matrix factor model. This model accommodates changing factor loadings over time, revealing the underlying dynamic structure through nonparametric principal component analysis and facilitating dimension reduction. We establish the consistency and asymptotic normality of our estimators under general conditions that allow for weak correlations across time, rows, or columns of the noise. A novel approach is introduced to overcome rotational ambiguity in the estimators, enhancing the clarity and interpretability of the estimated loading matrices. Our simulation study highlights the merits of the proposed estimators and the effective of the smoothing operation. In an application to international trade flow, we investigate the trading hubs, centrality, patterns, and trends in the trading network."}, "https://arxiv.org/abs/2404.01566": {"title": "Heterogeneous Treatment Effects and Causal Mechanisms", "link": "https://arxiv.org/abs/2404.01566", "description": "arXiv:2404.01566v1 Announce Type: new \nAbstract: The credibility revolution advances the use of research designs that permit identification and estimation of causal effects. However, understanding which mechanisms produce measured causal effects remains a challenge. A dominant current approach to the quantitative evaluation of mechanisms relies on the detection of heterogeneous treatment effects with respect to pre-treatment covariates. This paper develops a framework to understand when the existence of such heterogeneous treatment effects can support inferences about the activation of a mechanism. We show first that this design cannot provide evidence of mechanism activation without additional, generally implicit, assumptions. Further, even when these assumptions are satisfied, if a measured outcome is produced by a non-linear transformation of a directly-affected outcome of theoretical interest, heterogeneous treatment effects are not informative of mechanism activation. We provide novel guidance for interpretation and research design in light of these findings."}, "https://arxiv.org/abs/2404.01641": {"title": "The impact of geopolitical risk on the international agricultural market: Empirical analysis based on the GJR-GARCH-MIDAS model", "link": "https://arxiv.org/abs/2404.01641", "description": "arXiv:2404.01641v1 Announce Type: new \nAbstract: The current international landscape is turbulent and unstable, with frequent outbreaks of geopolitical conflicts worldwide. Geopolitical risk has emerged as a significant threat to regional and global peace, stability, and economic prosperity, causing serious disruptions to the global food system and food security. Focusing on the international food market, this paper builds different dimensions of geopolitical risk measures based on the random matrix theory and constructs single- and two-factor GJR-GARCH-MIDAS models with fixed time span and rolling window, respectively, to investigate the impact of geopolitical risk on food market volatility. The findings indicate that modeling based on rolling window performs better in describing the overall volatility of the wheat, maize, soybean, and rice markets, and the two-factor models generally exhibit stronger explanatory power in most cases. In terms of short-term fluctuations, all four staple food markets demonstrate obvious volatility clustering and high volatility persistence, without significant asymmetry. Regarding long-term volatility, the realized volatility of wheat, maize, and soybean significantly exacerbates their long-run market volatility. Additionally, geopolitical risks of different dimensions show varying directions and degrees of effects in explaining the long-term market volatility of the four staple food commodities. This study contributes to the understanding of the macro-drivers of food market fluctuations, provides useful information for investment using agricultural futures, and offers valuable insights into maintaining the stable operation of food markets and safeguarding global food security."}, "https://arxiv.org/abs/2404.01688": {"title": "Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis", "link": "https://arxiv.org/abs/2404.01688", "description": "arXiv:2404.01688v1 Announce Type: new \nAbstract: When building statistical models for Bayesian data analysis tasks, required and optional iterative adjustments and different modelling choices can give rise to numerous candidate models. In particular, checks and evaluations throughout the modelling process can motivate changes to an existing model or the consideration of alternative models to ultimately obtain models of sufficient quality for the problem at hand. Additionally, failing to consider alternative models can lead to overconfidence in the predictive or inferential ability of a chosen model. The search for suitable models requires modellers to work with multiple models without jeopardising the validity of their results. Multiverse analysis offers a framework for transparent creation of multiple models at once based on different sensible modelling choices, but the number of candidate models arising in the combination of iterations and possible modelling choices can become overwhelming in practice. Motivated by these challenges, this work proposes iterative filtering for multiverse analysis to support efficient and consistent assessment of multiple models and meaningful filtering towards fewer models of higher quality across different modelling contexts. Given that causal constraints have been taken into account, we show how multiverse analysis can be combined with recommendations from established Bayesian modelling workflows to identify promising candidate models by assessing predictive abilities and, if needed, tending to computational issues. We illustrate our suggested approach in different realistic modelling scenarios using real data examples."}, "https://arxiv.org/abs/2404.01734": {"title": "Expansion of net correlations in terms of partial correlations", "link": "https://arxiv.org/abs/2404.01734", "description": "arXiv:2404.01734v1 Announce Type: new \nAbstract: Graphical models are usually employed to represent statistical relationships between pairs of variables when all the remaining variables are fixed. In this picture, conditionally independent pairs are disconnected. In the real world, however, strict conditional independence is almost impossible to prove. Here we use a weaker version of the concept of graphical models, in which only the linear component of the conditional dependencies is represented. This notion enables us to relate the marginal Pearson correlation coefficient (a measure of linear marginal dependence) with the partial correlations (a measure of linear conditional dependence). Specifically, we use the graphical model to express the marginal Pearson correlation $\\rho_{ij}$ between variables $X_i$ and $X_j$ as a sum of the efficacies with which messages propagate along all the paths connecting the variables in the graph. The expansion is convergent, and provides a mechanistic interpretation of how global correlations arise from local interactions. Moreover, by weighing the relevance of each path and of each intermediate node, an intuitive way to imagine interventions is enabled, revealing for example what happens when a given edge is pruned, or the weight of an edge is modified. The expansion is also useful to construct minimal equivalent models, in which latent variables are introduced to replace a larger number of marginalised variables. In addition, the expansion yields an alternative algorithm to calculate marginal Pearson correlations, particularly beneficial when partial correlation matrix inversion is difficult. Finally, for Gaussian variables, the mutual information is also related to message-passing efficacies along paths in the graph."}, "https://arxiv.org/abs/2404.01736": {"title": "Nonparametric efficient causal estimation of the intervention-specific expected number of recurrent events with continuous-time targeted maximum likelihood and highly adaptive lasso estimation", "link": "https://arxiv.org/abs/2404.01736", "description": "arXiv:2404.01736v1 Announce Type: new \nAbstract: Longitudinal settings involving outcome, competing risks and censoring events occurring and recurring in continuous time are common in medical research, but are often analyzed with methods that do not allow for taking post-baseline information into account. In this work, we define statistical and causal target parameters via the g-computation formula by carrying out interventions directly on the product integral representing the observed data distribution in a continuous-time counting process model framework. In recurrent events settings our target parameter identifies the expected number of recurrent events also in settings where the censoring mechanism or post-baseline treatment decisions depend on past information of post-baseline covariates such as the recurrent event process. We propose a flexible estimation procedure based on targeted maximum likelihood estimation coupled with highly adaptive lasso estimation to provide a novel approach for double robust and nonparametric inference for the considered target parameter. We illustrate the methods in a simulation study."}, "https://arxiv.org/abs/2404.01977": {"title": "Least Squares Inference for Data with Network Dependency", "link": "https://arxiv.org/abs/2404.01977", "description": "arXiv:2404.01977v1 Announce Type: new \nAbstract: We address the inference problem concerning regression coefficients in a classical linear regression model using least squares estimates. The analysis is conducted under circumstances where network dependency exists across units in the sample. Neglecting the dependency among observations may lead to biased estimation of the asymptotic variance and often inflates the Type I error in coefficient inference. In this paper, we first establish a central limit theorem for the ordinary least squares estimate, with a verifiable dependence condition alongside corresponding neighborhood growth conditions. Subsequently, we propose a consistent estimator for the asymptotic variance of the estimated coefficients, which employs a data-driven method to balance the bias-variance trade-off. We find that the optimal tuning depends on the linear hypothesis under consideration and must be chosen adaptively. The presented theory and methods are illustrated and supported by numerical experiments and a data example."}, "https://arxiv.org/abs/2404.02093": {"title": "High-dimensional covariance regression with application to co-expression QTL detection", "link": "https://arxiv.org/abs/2404.02093", "description": "arXiv:2404.02093v1 Announce Type: new \nAbstract: While covariance matrices have been widely studied in many scientific fields, relatively limited progress has been made on estimating conditional covariances that permits a large covariance matrix to vary with high-dimensional subject-level covariates. In this paper, we present a new sparse multivariate regression framework that models the covariance matrix as a function of subject-level covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can be used to determine if and how gene co-expressions vary with genetic variations. To accommodate high-dimensional responses and covariates, we stipulate a combined sparsity structure that encourages covariates with non-zero effects and edges that are modulated by these covariates to be simultaneously sparse. We approach parameter estimation with a blockwise coordinate descent algorithm, and investigate the $\\ell_2$ convergence rate of the estimated parameters. In addition, we propose a computationally efficient debiased inference procedure for uncertainty quantification. The efficacy of the proposed method is demonstrated through numerical experiments and an application to a gene co-expression network study with brain cancer patients."}, "https://arxiv.org/abs/2404.02141": {"title": "Robustly estimating heterogeneity in factorial data using Rashomon Partitions", "link": "https://arxiv.org/abs/2404.02141", "description": "arXiv:2404.02141v1 Announce Type: new \nAbstract: Many statistical analyses, in both observational data and randomized control trials, ask: how does the outcome of interest vary with combinations of observable covariates? How do various drug combinations affect health outcomes, or how does technology adoption depend on incentives and demographics? Our goal is to partition this factorial space into ``pools'' of covariate combinations where the outcome differs across the pools (but not within a pool). Existing approaches (i) search for a single ``optimal'' partition under assumptions about the association between covariates or (ii) sample from the entire set of possible partitions. Both these approaches ignore the reality that, especially with correlation structure in covariates, many ways to partition the covariate space may be statistically indistinguishable, despite very different implications for policy or science. We develop an alternative perspective, called Rashomon Partition Sets (RPSs). Each item in the RPS partitions the space of covariates using a tree-like geometry. RPSs incorporate all partitions that have posterior values near the maximum a posteriori partition, even if they offer substantively different explanations, and do so using a prior that makes no assumptions about associations between covariates. This prior is the $\\ell_0$ prior, which we show is minimax optimal. Given the RPS we calculate the posterior of any measurable function of the feature effects vector on outcomes, conditional on being in the RPS. We also characterize approximation error relative to the entire posterior and provide bounds on the size of the RPS. Simulations demonstrate this framework allows for robust conclusions relative to conventional regularization techniques. We apply our method to three empirical settings: price effects on charitable giving, chromosomal structure (telomere length), and the introduction of microfinance."}, "https://arxiv.org/abs/2404.01466": {"title": "TS-CausalNN: Learning Temporal Causal Relations from Non-linear Non-stationary Time Series Data", "link": "https://arxiv.org/abs/2404.01466", "description": "arXiv:2404.01466v1 Announce Type: cross \nAbstract: The growing availability and importance of time series data across various domains, including environmental science, epidemiology, and economics, has led to an increasing need for time-series causal discovery methods that can identify the intricate relationships in the non-stationary, non-linear, and often noisy real world data. However, the majority of current time series causal discovery methods assume stationarity and linear relations in data, making them infeasible for the task. Further, the recent deep learning-based methods rely on the traditional causal structure learning approaches making them computationally expensive. In this paper, we propose a Time-Series Causal Neural Network (TS-CausalNN) - a deep learning technique to discover contemporaneous and lagged causal relations simultaneously. Our proposed architecture comprises (i) convolutional blocks comprising parallel custom causal layers, (ii) acyclicity constraint, and (iii) optimization techniques using the augmented Lagrangian approach. In addition to the simple parallel design, an advantage of the proposed model is that it naturally handles the non-stationarity and non-linearity of the data. Through experiments on multiple synthetic and real world datasets, we demonstrate the empirical proficiency of our proposed approach as compared to several state-of-the-art methods. The inferred graphs for the real world dataset are in good agreement with the domain understanding."}, "https://arxiv.org/abs/2404.01467": {"title": "Transnational Network Dynamics of Problematic Information Diffusion", "link": "https://arxiv.org/abs/2404.01467", "description": "arXiv:2404.01467v1 Announce Type: cross \nAbstract: This study maps the spread of two cases of COVID-19 conspiracy theories and misinformation in Spanish and French in Latin American and French-speaking communities on Facebook, and thus contributes to understanding the dynamics, reach and consequences of emerging transnational misinformation networks. The findings show that co-sharing behavior of public Facebook groups created transnational networks by sharing videos of Medicos por la Verdad (MPV) conspiracy theories in Spanish and hydroxychloroquine-related misinformation sparked by microbiologist Didier Raoult (DR) in French, usually igniting the surge of locally led interest groups across the Global South. Using inferential methods, the study shows how these networks are enabled primarily by shared cultural and thematic attributes among Facebook groups, effectively creating very large, networked audiences. The study contributes to the understanding of how potentially harmful conspiracy theories and misinformation transcend national borders through non-English speaking online communities, further highlighting the overlooked role of transnationalism in global misinformation diffusion and the potentially disproportionate harm that it causes in vulnerable communities across the globe."}, "https://arxiv.org/abs/2404.01469": {"title": "A group testing based exploration of age-varying factors in chlamydia infections among Iowa residents", "link": "https://arxiv.org/abs/2404.01469", "description": "arXiv:2404.01469v1 Announce Type: cross \nAbstract: Group testing, a method that screens subjects in pooled samples rather than individually, has been employed as a cost-effective strategy for chlamydia screening among Iowa residents. In efforts to deepen our understanding of chlamydia epidemiology in Iowa, several group testing regression models have been proposed. Different than previous approaches, we expand upon the varying coefficient model to capture potential age-varying associations with chlamydia infection risk. In general, our model operates within a Bayesian framework, allowing regression associations to vary with a covariate of key interest. We employ a stochastic search variable selection process for regularization in estimation. Additionally, our model can integrate random effects to consider potential geographical factors and estimate unknown assay accuracy probabilities. The performance of our model is assessed through comprehensive simulation studies. Upon application to the Iowa group testing dataset, we reveal a significant age-varying racial disparity in chlamydia infections. We believe this discovery has the potential to inform the enhancement of interventions and prevention strategies, leading to more effective chlamydia control and management, thereby promoting health equity across all populations."}, "https://arxiv.org/abs/2404.01595": {"title": "Propensity Score Alignment of Unpaired Multimodal Data", "link": "https://arxiv.org/abs/2404.01595", "description": "arXiv:2404.01595v1 Announce Type: cross \nAbstract: Multimodal representation learning techniques typically rely on paired samples to learn common representations, but paired samples are challenging to collect in fields such as biology where measurement devices often destroy the samples. This paper presents an approach to address the challenge of aligning unpaired samples across disparate modalities in multimodal representation learning. We draw an analogy between potential outcomes in causal inference and potential views in multimodal observations, which allows us to use Rubin's framework to estimate a common space in which to match samples. Our approach assumes we collect samples that are experimentally perturbed by treatments, and uses this to estimate a propensity score from each modality, which encapsulates all shared information between a latent state and treatment and can be used to define a distance between samples. We experiment with two alignment techniques that leverage this distance -- shared nearest neighbours (SNN) and optimal transport (OT) matching -- and find that OT matching results in significant improvements over state-of-the-art alignment approaches in both a synthetic multi-modal setting and in real-world data from NeurIPS Multimodal Single-Cell Integration Challenge."}, "https://arxiv.org/abs/2404.01608": {"title": "FAIRM: Learning invariant representations for algorithmic fairness and domain generalization with minimax optimality", "link": "https://arxiv.org/abs/2404.01608", "description": "arXiv:2404.01608v1 Announce Type: cross \nAbstract: Machine learning methods often assume that the test data have the same distribution as the training data. However, this assumption may not hold due to multiple levels of heterogeneity in applications, raising issues in algorithmic fairness and domain generalization. In this work, we address the problem of fair and generalizable machine learning by invariant principles. We propose a training environment-based oracle, FAIRM, which has desirable fairness and domain generalization properties under a diversity-type condition. We then provide an empirical FAIRM with finite-sample theoretical guarantees under weak distributional assumptions. We then develop efficient algorithms to realize FAIRM in linear models and demonstrate the nonasymptotic performance with minimax optimality. We evaluate our method in numerical experiments with synthetic data and MNIST data and show that it outperforms its counterparts."}, "https://arxiv.org/abs/2404.02120": {"title": "DEMO: Dose Exploration, Monitoring, and Optimization Using a Biological Mediator for Clinical Outcomes", "link": "https://arxiv.org/abs/2404.02120", "description": "arXiv:2404.02120v1 Announce Type: cross \nAbstract: Phase 1-2 designs provide a methodological advance over phase 1 designs for dose finding by using both clinical response and toxicity. A phase 1-2 trial still may fail to select a truly optimal dose. because early response is not a perfect surrogate for long term therapeutic success. To address this problem, a generalized phase 1-2 design first uses a phase 1-2 design's components to identify a set of candidate doses, adaptively randomizes patients among the candidates, and after longer follow up selects a dose to maximize long-term success rate. In this paper, we extend this paradigm by proposing a design that exploits an early treatment-related, real-valued biological outcome, such as pharmacodynamic activity or an immunological effect, that may act as a mediator between dose and clinical outcomes, including tumor response, toxicity, and survival time. We assume multivariate dose-outcome models that include effects appearing in causal pathways from dose to the clinical outcomes. Bayesian model selection is used to identify and eliminate biologically inactive doses. At the end of the trial, a therapeutically optimal dose is chosen from the set of doses that are acceptably safe, clinically effective, and biologically active to maximize restricted mean survival time. Results of a simulation study show that the proposed design may provide substantial improvements over designs that ignore the biological variable."}, "https://arxiv.org/abs/2007.02404": {"title": "Semi-parametric TEnsor Factor Analysis by Iteratively Projected Singular Value Decomposition", "link": "https://arxiv.org/abs/2007.02404", "description": "arXiv:2007.02404v2 Announce Type: replace \nAbstract: This paper introduces a general framework of Semi-parametric TEnsor Factor Analysis (STEFA) that focuses on the methodology and theory of low-rank tensor decomposition with auxiliary covariates. Semi-parametric TEnsor Factor Analysis models extend tensor factor models by incorporating auxiliary covariates in the loading matrices. We propose an algorithm of iteratively projected singular value decomposition (IP-SVD) for the semi-parametric estimation. It iteratively projects tensor data onto the linear space spanned by the basis functions of covariates and applies singular value decomposition on matricized tensors over each mode. We establish the convergence rates of the loading matrices and the core tensor factor. The theoretical results only require a sub-exponential noise distribution, which is weaker than the assumption of sub-Gaussian tail of noise in the literature. Compared with the Tucker decomposition, IP-SVD yields more accurate estimators with a faster convergence rate. Besides estimation, we propose several prediction methods with new covariates based on the STEFA model. On both synthetic and real tensor data, we demonstrate the efficacy of the STEFA model and the IP-SVD algorithm on both the estimation and prediction tasks."}, "https://arxiv.org/abs/2307.04457": {"title": "Predicting milk traits from spectral data using Bayesian probabilistic partial least squares regression", "link": "https://arxiv.org/abs/2307.04457", "description": "arXiv:2307.04457v3 Announce Type: replace \nAbstract: High-dimensional spectral data--routinely generated in dairy production--are used to predict a range of traits in milk products. Partial least squares (PLS) regression is ubiquitously used for these prediction tasks. However, PLS regression is not typically viewed as arising from statistical inference of a probabilistic model, and parameter uncertainty is rarely quantified. Additionally, PLS regression does not easily lend itself to model-based modifications, coherent prediction intervals are not readily available, and the process of choosing the latent-space dimension, $\\mathtt{Q}$, can be subjective and sensitive to data size. We introduce a Bayesian latent-variable model, emulating the desirable properties of PLS regression while accounting for parameter uncertainty in prediction. The need to choose $\\mathtt{Q}$ is eschewed through a nonparametric shrinkage prior. The flexibility of the proposed Bayesian partial least squares (BPLS) regression framework is exemplified by considering sparsity modifications and allowing for multivariate response prediction. The BPLS regression framework is used in two motivating settings: 1) multivariate trait prediction from mid-infrared spectral analyses of milk samples, and 2) milk pH prediction from surface-enhanced Raman spectral data. The prediction performance of BPLS regression at least matches that of PLS regression. Additionally, the provision of correctly calibrated prediction intervals objectively provides richer, more informative inference for stakeholders in dairy production."}, "https://arxiv.org/abs/2308.06913": {"title": "Improving the Estimation of Site-Specific Effects and their Distribution in Multisite Trials", "link": "https://arxiv.org/abs/2308.06913", "description": "arXiv:2308.06913v2 Announce Type: replace \nAbstract: In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribution, and alternative posterior summary methods tailored to minimize specific loss functions. Our findings highlight that the success of different estimation strategies depends largely on the amount of within-site and between-site information available from the data. We discuss how our results can guide balancing the trade-offs associated with shrinkage in limited data environments."}, "https://arxiv.org/abs/2112.10151": {"title": "Edge differentially private estimation in the $\\beta$-model via jittering and method of moments", "link": "https://arxiv.org/abs/2112.10151", "description": "arXiv:2112.10151v2 Announce Type: replace-cross \nAbstract: A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $\\beta$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavkovi\\'{c}, 2017). Unlike most previous approaches based on maximum likelihood estimation for this network model, we proceed via method-of-moments. This choice facilitates our exploration of a substantially broader range of privacy levels - corresponding to stricter privacy - than has been to date. Over this new range we discover our proposed estimator for the parameters exhibits an interesting phase transition, with both its convergence rate and asymptotic variance following one of three different regimes of behavior depending on the level of privacy. Because identification of the operable regime is difficult if not impossible in practice, we devise a novel adaptive bootstrap procedure to construct uniform inference across different phases. In fact, leveraging this bootstrap we are able to provide for simultaneous inference of all parameters in the $\\beta$-model (i.e., equal to the number of nodes), which, to our best knowledge, is the first result of its kind. Numerical experiments confirm the competitive and reliable finite sample performance of the proposed inference methods, next to a comparable maximum likelihood method, as well as significant advantages in terms of computational speed and memory."}, "https://arxiv.org/abs/2311.03381": {"title": "Separating and Learning Latent Confounders to Enhancing User Preferences Modeling", "link": "https://arxiv.org/abs/2311.03381", "description": "arXiv:2311.03381v2 Announce Type: replace-cross \nAbstract: Recommender models aim to capture user preferences from historical feedback and then predict user-specific feedback on candidate items. However, the presence of various unmeasured confounders causes deviations between the user preferences in the historical feedback and the true preferences, resulting in models not meeting their expected performance. Existing debias models either (1) specific to solving one particular bias or (2) directly obtain auxiliary information from user historical feedback, which cannot identify whether the learned preferences are true user preferences or mixed with unmeasured confounders. Moreover, we find that the former recommender system is not only a successor to unmeasured confounders but also acts as an unmeasured confounder affecting user preference modeling, which has always been neglected in previous studies. To this end, we incorporate the effect of the former recommender system and treat it as a proxy for all unmeasured confounders. We propose a novel framework, Separating and Learning Latent Confounders For Recommendation (SLFR), which obtains the representation of unmeasured confounders to identify the counterfactual feedback by disentangling user preferences and unmeasured confounders, then guides the target model to capture the true preferences of users. Extensive experiments in five real-world datasets validate the advantages of our method."}, "https://arxiv.org/abs/2404.02228": {"title": "Seemingly unrelated Bayesian additive regression trees for cost-effectiveness analyses in healthcare", "link": "https://arxiv.org/abs/2404.02228", "description": "arXiv:2404.02228v1 Announce Type: new \nAbstract: In recent years, theoretical results and simulation evidence have shown Bayesian additive regression trees to be a highly-effective method for nonparametric regression. Motivated by cost-effectiveness analyses in health economics, where interest lies in jointly modelling the costs of healthcare treatments and the associated health-related quality of life experienced by a patient, we propose a multivariate extension of BART applicable in regression and classification analyses with several correlated outcome variables. Our framework overcomes some key limitations of existing multivariate BART models by allowing each individual response to be associated with different ensembles of trees, while still handling dependencies between the outcomes. In the case of continuous outcomes, our model is essentially a nonparametric version of seemingly unrelated regression. Likewise, our proposal for binary outcomes is a nonparametric generalisation of the multivariate probit model. We give suggestions for easily interpretable prior distributions, which allow specification of both informative and uninformative priors. We provide detailed discussions of MCMC sampling methods to conduct posterior inference. Our methods are implemented in the R package `suBART'. We showcase their performance through extensive simulations and an application to an empirical case study from health economics. By also accommodating propensity scores in a manner befitting a causal analysis, we find substantial evidence for a novel trauma care intervention's cost-effectiveness."}, "https://arxiv.org/abs/2404.02283": {"title": "Integrating representative and non-representative survey data for efficient inference", "link": "https://arxiv.org/abs/2404.02283", "description": "arXiv:2404.02283v1 Announce Type: new \nAbstract: Non-representative surveys are commonly used and widely available but suffer from selection bias that generally cannot be entirely eliminated using weighting techniques. Instead, we propose a Bayesian method to synthesize longitudinal representative unbiased surveys with non-representative biased surveys by estimating the degree of selection bias over time. We show using a simulation study that synthesizing biased and unbiased surveys together out-performs using the unbiased surveys alone, even if the selection bias may evolve in a complex manner over time. Using COVID-19 vaccination data, we are able to synthesize two large sample biased surveys with an unbiased survey to reduce uncertainty in now-casting and inference estimates while simultaneously retaining the empirical credible interval coverage. Ultimately, we are able to conceptually obtain the properties of a large sample unbiased survey if the assumed unbiased survey, used to anchor the estimates, is unbiased for all time-points."}, "https://arxiv.org/abs/2404.02313": {"title": "Optimal combination of composite likelihoods using approximate Bayesian computation with application to state-space models", "link": "https://arxiv.org/abs/2404.02313", "description": "arXiv:2404.02313v1 Announce Type: new \nAbstract: Composite likelihood provides approximate inference when the full likelihood is intractable and sub-likelihood functions of marginal events can be evaluated relatively easily. It has been successfully applied for many complex models. However, its wider application is limited by two issues. First, weight selection of marginal likelihood can have a significant impact on the information efficiency and is currently an open question. Second, calibrated Bayesian inference with composite likelihood requires curvature adjustment which is difficult for dependent data. This work shows that approximate Bayesian computation (ABC) can properly address these two issues by using multiple composite score functions as summary statistics. First, the summary-based posterior distribution gives the optimal Godambe information among a wide class of estimators defined by linear combinations of estimating functions. Second, to make ABC computationally feasible for models where marginal likelihoods have no closed form, a novel approach is proposed to estimate all simulated marginal scores using a Monte Carlo sample with size N. Sufficient conditions are given for the additional noise to be negligible with N fixed as the data size n goes to infinity, and the computational cost is O(n). Third, asymptotic properties of ABC with summary statistics having heterogeneous convergence rates is derived, and an adaptive scheme to choose the component composite scores is proposed. Numerical studies show that the new method significantly outperforms the existing Bayesian composite likelihood methods, and the efficiency of adaptively combined composite scores well approximates the efficiency of particle MCMC using the full likelihood."}, "https://arxiv.org/abs/2404.02400": {"title": "On Improved Semi-parametric Bounds for Tail Probability and Expected Loss", "link": "https://arxiv.org/abs/2404.02400", "description": "arXiv:2404.02400v1 Announce Type: new \nAbstract: We revisit the fundamental issue of tail behavior of accumulated random realizations when individual realizations are independent, and we develop new sharper bounds on the tail probability and expected linear loss. The underlying distribution is semi-parametric in the sense that it remains unrestricted other than the assumed mean and variance. Our sharp bounds complement well-established results in the literature, including those based on aggregation, which often fail to take full account of independence and use less elegant proofs. New insights include a proof that in the non-identical case, the distributions attaining the bounds have the equal range property, and that the impact of each random variable on the expected value of the sum can be isolated using an extension of the Korkine identity. We show that the new bounds not only complement the extant results but also open up abundant practical applications, including improved pricing of product bundles, more precise option pricing, more efficient insurance design, and better inventory management."}, "https://arxiv.org/abs/2404.02453": {"title": "Exploring the Connection Between the Normalized Power Prior and Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2404.02453", "description": "arXiv:2404.02453v1 Announce Type: new \nAbstract: The power prior is a popular class of informative priors for incorporating information from historical data. It involves raising the likelihood for the historical data to a power, which acts as a discounting parameter. When the discounting parameter is modeled as random, the normalized power prior is recommended. Bayesian hierarchical modeling is a widely used method for synthesizing information from different sources, including historical data. In this work, we examine the analytical relationship between the normalized power prior (NPP) and Bayesian hierarchical models (BHM) for \\emph{i.i.d.} normal data. We establish a direct relationship between the prior for the discounting parameter of the NPP and the prior for the variance parameter of the BHM. Such a relationship is first established for the case of a single historical dataset, and then extended to the case with multiple historical datasets with dataset-specific discounting parameters. For multiple historical datasets, we develop and establish theory for the BHM-matching NPP (BNPP) which establishes dependence between the dataset-specific discounting parameters leading to inferences that are identical to the BHM. Establishing this relationship not only justifies the NPP from the perspective of hierarchical modeling, but also provides insight on prior elicitation for the NPP. We present strategies on inducing priors on the discounting parameter based on hierarchical models, and investigate the borrowing properties of the BNPP."}, "https://arxiv.org/abs/2404.02584": {"title": "Moran's I 2-Stage Lasso: for Models with Spatial Correlation and Endogenous Variables", "link": "https://arxiv.org/abs/2404.02584", "description": "arXiv:2404.02584v1 Announce Type: new \nAbstract: We propose a novel estimation procedure for models with endogenous variables in the presence of spatial correlation based on Eigenvector Spatial Filtering. The procedure, called Moran's $I$ 2-Stage Lasso (Mi-2SL), uses a two-stage Lasso estimator where the Standardised Moran's I is used to set the Lasso tuning parameter. Unlike existing spatial econometric methods, this has the key benefit of not requiring the researcher to explicitly model the spatial correlation process, which is of interest in cases where they are only interested in removing the resulting bias when estimating the direct effect of covariates. We show the conditions necessary for consistent and asymptotically normal parameter estimation assuming the support (relevant) set of eigenvectors is known. Our Monte Carlo simulation results also show that Mi-2SL performs well against common alternatives in the presence of spatial correlation. Our empirical application replicates Cadena and Kovak (2016) instrumental variables estimates using Mi-2SL and shows that in that case, Mi-2SL can boost the performance of the first stage."}, "https://arxiv.org/abs/2404.02594": {"title": "Comparison of the LASSO and Integrative LASSO with Penalty Factors (IPF-LASSO) methods for multi-omics data: Variable selection with Type I error control", "link": "https://arxiv.org/abs/2404.02594", "description": "arXiv:2404.02594v1 Announce Type: new \nAbstract: Variable selection in relation to regression modeling has constituted a methodological problem for more than 60 years. Especially in the context of high-dimensional regression, developing stable and reliable methods, algorithms, and computational tools for variable selection has become an important research topic. Omics data is one source of such high-dimensional data, characterized by diverse genomic layers, and an additional analytical challenge is how to integrate these layers into various types of analyses. While the IPF-LASSO model has previously explored the integration of multiple omics modalities for feature selection and prediction by introducing distinct penalty parameters for each modality, the challenge of incorporating heterogeneous data layers into variable selection with Type I error control remains an open problem. To address this problem, we applied stability selection as a method for variable selection with false positives control in both IPF-LASSO and regular LASSO. The objective of this study was to compare the LASSO algorithm with IPF-LASSO, investigating whether introducing different penalty parameters per omics modality could improve statistical power while controlling false positives. Two high-dimensional data structures were investigated, one with independent data and the other with correlated data. The different models were also illustrated using data from a study on breast cancer treatment, where the IPF-LASSO model was able to select some highly relevant clinical variables."}, "https://arxiv.org/abs/2404.02671": {"title": "Bayesian Bi-level Sparse Group Regressions for Macroeconomic Forecasting", "link": "https://arxiv.org/abs/2404.02671", "description": "arXiv:2404.02671v1 Announce Type: new \nAbstract: We propose a Machine Learning approach for optimal macroeconomic forecasting in a high-dimensional setting with covariates presenting a known group structure. Our model encompasses forecasting settings with many series, mixed frequencies, and unknown nonlinearities. We introduce in time-series econometrics the concept of bi-level sparsity, i.e. sparsity holds at both the group level and within groups, and we assume the true model satisfies this assumption. We propose a prior that induces bi-level sparsity, and the corresponding posterior distribution is demonstrated to contract at the minimax-optimal rate, recover the model parameters, and have a support that includes the support of the model asymptotically. Our theory allows for correlation between groups, while predictors in the same group can be characterized by strong covariation as well as common characteristics and patterns. Finite sample performance is illustrated through comprehensive Monte Carlo experiments and a real-data nowcasting exercise of the US GDP growth rate."}, "https://arxiv.org/abs/2404.02685": {"title": "Testing Independence Between High-Dimensional Random Vectors Using Rank-Based Max-Sum Tests", "link": "https://arxiv.org/abs/2404.02685", "description": "arXiv:2404.02685v1 Announce Type: new \nAbstract: In this paper, we address the problem of testing independence between two high-dimensional random vectors. Our approach involves a series of max-sum tests based on three well-known classes of rank-based correlations. These correlation classes encompass several popular rank measures, including Spearman's $\\rho$, Kendall's $\\tau$, Hoeffding's D, Blum-Kiefer-Rosenblatt's R and Bergsma-Dassios-Yanagimoto's $\\tau^*$.The key advantages of our proposed tests are threefold: (1) they do not rely on specific assumptions about the distribution of random vectors, which flexibility makes them available across various scenarios; (2) they can proficiently manage non-linear dependencies between random vectors, a critical aspect in high-dimensional contexts; (3) they have robust performance, regardless of whether the alternative hypothesis is sparse or dense.Notably, our proposed tests demonstrate significant advantages in various scenarios, which is suggested by extensive numerical results and an empirical application in RNA microarray analysis."}, "https://arxiv.org/abs/2404.02764": {"title": "Estimation of Quantile Functionals in Linear Model", "link": "https://arxiv.org/abs/2404.02764", "description": "arXiv:2404.02764v1 Announce Type: new \nAbstract: Various indicators and measures of the real life procedures rise up as functionals of the quantile process of a parent random variable Z. However, Z can be observed only through a response in a linear model whose covariates are not under our control and the probability distribution of error terms is generally unknown. The problem is that of nonparametric estimation or other inference for such functionals. We propose an estimation procedure based on the averaged two-step regression quantile, recently developed by the authors, combined with an R-estimator of slopes of the linear model."}, "https://arxiv.org/abs/2404.02184": {"title": "What is to be gained by ensemble models in analysis of spectroscopic data?", "link": "https://arxiv.org/abs/2404.02184", "description": "arXiv:2404.02184v1 Announce Type: cross \nAbstract: An empirical study was carried out to compare different implementations of ensemble models aimed at improving prediction in spectroscopic data. A wide range of candidate models were fitted to benchmark datasets from regression and classification settings. A statistical analysis using linear mixed model was carried out on prediction performance criteria resulting from model fits over random splits of the data. The results showed that the ensemble classifiers were able to consistently outperform candidate models in our application"}, "https://arxiv.org/abs/2404.02519": {"title": "Differentially Private Verification of Survey-Weighted Estimates", "link": "https://arxiv.org/abs/2404.02519", "description": "arXiv:2404.02519v1 Announce Type: cross \nAbstract: Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences."}, "https://arxiv.org/abs/2404.02736": {"title": "On the Estimation of bivariate Conditional Transition Rates", "link": "https://arxiv.org/abs/2404.02736", "description": "arXiv:2404.02736v1 Announce Type: cross \nAbstract: Recent literature has found conditional transition rates to be a useful tool for avoiding Markov assumptions in multistate models. While the estimation of univariate conditional transition rates has been extensively studied, the intertemporal dependencies captured in the bivariate conditional transition rates still require a consistent estimator. We provide an estimator that is suitable for censored data and emphasize the connection to the rich theory of the estimation of bivariate survival functions. Bivariate conditional transition rates are necessary for various applications in the survival context but especially in the calculation of moments in life insurance mathematics."}, "https://arxiv.org/abs/1603.09326": {"title": "Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index", "link": "https://arxiv.org/abs/1603.09326", "description": "arXiv:1603.09326v4 Announce Type: replace \nAbstract: Estimating the long-term effects of treatments is of interest in many fields. A common challenge in estimating such treatment effects is that long-term outcomes are unobserved in the time frame needed to make policy decisions. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, often called a statistical surrogate, if it satisfies the condition that treatment and outcome are independent conditional on the statistical surrogate. The validity of the surrogacy condition is often controversial. Here we exploit that fact that in modern datasets, researchers often observe a large number, possibly hundreds or thousands, of intermediate outcomes, thought to lie on or close to the causal chain between the treatment and the long-term outcome of interest. Even if none of the individual proxies satisfies the statistical surrogacy criterion by itself, using multiple proxies can be useful in causal inference. We focus primarily on a setting with two samples, an experimental sample containing data about the treatment indicator and the surrogates and an observational sample containing information about the surrogates and the primary outcome. We state assumptions under which the average treatment effect be identified and estimated with a high-dimensional vector of proxies that collectively satisfy the surrogacy assumption, and derive the bias from violations of the surrogacy assumption, and show that even if the primary outcome is also observed in the experimental sample, there is still information to be gained from using surrogates."}, "https://arxiv.org/abs/2205.00901": {"title": "Beyond Neyman-Pearson: e-values enable hypothesis testing with a data-driven alpha", "link": "https://arxiv.org/abs/2205.00901", "description": "arXiv:2205.00901v3 Announce Type: replace \nAbstract: A standard practice in statistical hypothesis testing is to mention the p-value alongside the accept/reject decision. We show the advantages of mentioning an e-value instead. With p-values, it is not clear how to use an extreme observation (e.g. p $\\ll \\alpha$) for getting better frequentist decisions. With e-values it is straightforward, since they provide Type-I risk control in a generalized Neyman-Pearson setting with the decision task (a general loss function) determined post-hoc, after observation of the data -- thereby providing a handle on `roving $\\alpha$'s'. When Type-II risks are taken into consideration, the only admissible decision rules in the post-hoc setting turn out to be e-value-based. Similarly, if the loss incurred when specifying a faulty confidence interval is not fixed in advance, standard confidence intervals and distributions may fail whereas e-confidence sets and e-posteriors still provide valid risk guarantees. Sufficiently powerful e-values have by now been developed for a range of classical testing problems. We discuss the main challenges for wider development and deployment."}, "https://arxiv.org/abs/2205.09094": {"title": "High confidence inference on the probability an individual benefits from treatment using experimental or observational data with known propensity scores", "link": "https://arxiv.org/abs/2205.09094", "description": "arXiv:2205.09094v3 Announce Type: replace \nAbstract: We seek to understand the probability an individual benefits from treatment (PIBT), an inestimable quantity that must be bounded in practice. Given the innate uncertainty in the population-level bounds on PIBT, we seek to better understand the margin of error for their estimation in order to discern whether the estimated bounds on PIBT are tight or wide due to random chance or not. Toward this goal, we present guarantees to the estimation of bounds on marginal PIBT, with any threshold of interest, for a randomized experiment setting or an observational setting where propensity scores are known. We also derive results that permit us to understand heterogeneity in PIBT across learnable sub-groups delineated by pre-treatment features. These results can be used to help with formal statistical power analyses and frequentist confidence statements for settings where we are interested in assumption-lean inference on PIBT through the target bounds with minimal computational complexity compared to a bootstrap approach. Through a real data example from a large randomized experiment, we also demonstrate how our results for PIBT can allow us to understand the practical implication and goodness of fit of an estimate for the conditional average treatment effect (CATE), a function of an individual's baseline covariates."}, "https://arxiv.org/abs/2206.06991": {"title": "Concentration of discrepancy-based approximate Bayesian computation via Rademacher complexity", "link": "https://arxiv.org/abs/2206.06991", "description": "arXiv:2206.06991v4 Announce Type: replace \nAbstract: There has been an increasing interest on summary-free versions of approximate Bayesian computation (ABC), which replace distances among summaries with discrepancies between the empirical distributions of the observed data and the synthetic samples generated under the proposed parameter values. The success of these solutions has motivated theoretical studies on the limiting properties of the induced posteriors. However, current results (i) are often tailored to a specific discrepancy, (ii) require, either explicitly or implicitly, regularity conditions on the data generating process and the assumed statistical model, and (iii) yield bounds depending on sequences of control functions that are not made explicit. As such, there is the lack of a theoretical framework that (i) is unified, (ii) facilitates the derivation of limiting properties that hold uniformly, and (iii) relies on verifiable assumptions that provide concentration bounds clarifying which factors govern the limiting behavior of the ABC posterior. We address this gap via a novel theoretical framework that introduces the concept of Rademacher complexity in the analysis of the limiting properties for discrepancy-based ABC posteriors. This yields a unified theory that relies on constructive arguments and provides more informative asymptotic results and uniform concentration bounds, even in settings not covered by current studies. These advancements are obtained by relating the properties of summary-free ABC posteriors to the behavior of the Rademacher complexity associated with the chosen discrepancy within the family of integral probability semimetrics. This family extends summary-based ABC, and includes the Wasserstein distance and maximum mean discrepancy (MMD), among others. As clarified through a focus on the MMD case and via illustrative simulations, this perspective yields an improved understanding of summary-free ABC."}, "https://arxiv.org/abs/2211.04752": {"title": "Bayesian Neural Networks for Macroeconomic Analysis", "link": "https://arxiv.org/abs/2211.04752", "description": "arXiv:2211.04752v4 Announce Type: replace \nAbstract: Macroeconomic data is characterized by a limited number of observations (small T), many time series (big K) but also by featuring temporal dependence. Neural networks, by contrast, are designed for datasets with millions of observations and covariates. In this paper, we develop Bayesian neural networks (BNNs) that are well-suited for handling datasets commonly used for macroeconomic analysis in policy institutions. Our approach avoids extensive specification searches through a novel mixture specification for the activation function that appropriately selects the form of nonlinearities. Shrinkage priors are used to prune the network and force irrelevant neurons to zero. To cope with heteroskedasticity, the BNN is augmented with a stochastic volatility model for the error term. We illustrate how the model can be used in a policy institution by first showing that our different BNNs produce precise density forecasts, typically better than those from other machine learning methods. Finally, we showcase how our model can be used to recover nonlinearities in the reaction of macroeconomic aggregates to financial shocks."}, "https://arxiv.org/abs/2303.00281": {"title": "Posterior Robustness with Milder Conditions: Contamination Models Revisited", "link": "https://arxiv.org/abs/2303.00281", "description": "arXiv:2303.00281v2 Announce Type: replace \nAbstract: Robust Bayesian linear regression is a classical but essential statistical tool. Although novel robustness properties of posterior distributions have been proved recently under a certain class of error distributions, their sufficient conditions are restrictive and exclude several important situations. In this work, we revisit a classical two-component mixture model for response variables, also known as contamination model, where one component is a light-tailed regression model and the other component is heavy-tailed. The latter component is independent of the regression parameters, which is crucial in proving the posterior robustness. We obtain new sufficient conditions for posterior (non-)robustness and reveal non-trivial robustness results by using those conditions. In particular, we find that even the Student-$t$ error distribution can achieve the posterior robustness in our framework. A numerical study is performed to check the Kullback-Leibler divergence between the posterior distribution based on full data and that based on data obtained by removing outliers."}, "https://arxiv.org/abs/2303.13281": {"title": "Uncertain Short-Run Restrictions and Statistically Identified Structural Vector Autoregressions", "link": "https://arxiv.org/abs/2303.13281", "description": "arXiv:2303.13281v2 Announce Type: replace \nAbstract: This study proposes a combination of a statistical identification approach with potentially invalid short-run zero restrictions. The estimator shrinks towards imposed restrictions and stops shrinkage when the data provide evidence against a restriction. Simulation results demonstrate how incorporating valid restrictions through the shrinkage approach enhances the accuracy of the statistically identified estimator and how the impact of invalid restrictions decreases with the sample size. The estimator is applied to analyze the interaction between the stock and oil market. The results indicate that incorporating stock market data into the analysis is crucial, as it enables the identification of information shocks, which are shown to be important drivers of the oil price."}, "https://arxiv.org/abs/2307.01449": {"title": "A Double Machine Learning Approach to Combining Experimental and Observational Data", "link": "https://arxiv.org/abs/2307.01449", "description": "arXiv:2307.01449v2 Announce Type: replace \nAbstract: Experimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework tests for violations of external validity and ignorability under milder assumptions. When only one of these assumptions is violated, we provide semiparametrically efficient treatment effect estimators. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. Through comparative analyses, we show our framework's superiority over existing data fusion methods. The practical utility of our approach is further exemplified by three real-world case studies, underscoring its potential for widespread application in empirical research."}, "https://arxiv.org/abs/2309.12425": {"title": "Principal Stratification with Continuous Post-Treatment Variables: Nonparametric Identification and Semiparametric Estimation", "link": "https://arxiv.org/abs/2309.12425", "description": "arXiv:2309.12425v2 Announce Type: replace \nAbstract: Post-treatment variables often complicate causal inference. They appear in many scientific problems, including noncompliance, truncation by death, mediation, and surrogate endpoint evaluation. Principal stratification is a strategy to address these challenges by adjusting for the potential values of the post-treatment variables, defined as the principal strata. It allows for characterizing treatment effect heterogeneity across principal strata and unveiling the mechanism of the treatment's impact on the outcome related to post-treatment variables. However, the existing literature has primarily focused on binary post-treatment variables, leaving the case with continuous post-treatment variables largely unexplored. This gap persists due to the complexity of infinitely many principal strata, which present challenges to both the identification and estimation of causal effects. We fill this gap by providing nonparametric identification and semiparametric estimation theory for principal stratification with continuous post-treatment variables. We propose to use working models to approximate the underlying causal effect surfaces and derive the efficient influence functions of the corresponding model parameters. Based on the theory, we construct doubly robust estimators and implement them in an R package."}, "https://arxiv.org/abs/2310.18556": {"title": "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment", "link": "https://arxiv.org/abs/2310.18556", "description": "arXiv:2310.18556v2 Announce Type: replace \nAbstract: Design-based causal inference, also known as randomization-based or finite-population causal inference, is one of the most widely used causal inference frameworks, largely due to the merit that its statistical validity can be guaranteed by the study design (e.g., randomized experiments) and does not require assuming specific outcome-generating distributions or super-population models. Despite its advantages, design-based causal inference can still suffer from other data-related issues, among which outcome missingness is a prevalent and significant challenge. This work systematically studies the outcome missingness problem in design-based causal inference. First, we propose a general and flexible outcome missingness mechanism that can facilitate finite-population-exact randomization tests for the null effect. Second, under this flexible missingness mechanism, we propose a general framework called \"imputation and re-imputation\" for conducting finite-population-exact randomization tests in design-based causal inference with missing outcomes. This framework can incorporate any imputation algorithms (from linear models to advanced machine learning-based imputation algorithms) while ensuring finite-population-exact type-I error rate control. Third, we extend our framework to conduct covariate adjustment in randomization tests and construct finite-population-valid confidence sets with missing outcomes. Our framework is evaluated via extensive simulation studies and applied to a cluster randomized experiment called the Work, Family, and Health Study. Open-source Python and R packages are also developed for implementation of our framework."}, "https://arxiv.org/abs/2311.03247": {"title": "Multivariate selfsimilarity: Multiscale eigen-structures for selfsimilarity parameter estimation", "link": "https://arxiv.org/abs/2311.03247", "description": "arXiv:2311.03247v2 Announce Type: replace \nAbstract: Scale-free dynamics, formalized by selfsimilarity, provides a versatile paradigm massively and ubiquitously used to model temporal dynamics in real-world data. However, its practical use has mostly remained univariate so far. By contrast, modern applications often demand multivariate data analysis. Accordingly, models for multivariate selfsimilarity were recently proposed. Nevertheless, they have remained rarely used in practice because of a lack of available robust estimation procedures for the vector of selfsimilarity parameters. Building upon recent mathematical developments, the present work puts forth an efficient estimation procedure based on the theoretical study of the multiscale eigenstructure of the wavelet spectrum of multivariate selfsimilar processes. The estimation performance is studied theoretically in the asymptotic limits of large scale and sample sizes, and computationally for finite-size samples. As a practical outcome, a fully operational and documented multivariate signal processing estimation toolbox is made freely available and is ready for practical use on real-world data. Its potential benefits are illustrated in epileptic seizure prediction from multi-channel EEG data."}, "https://arxiv.org/abs/2401.11507": {"title": "Inconistent multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses", "link": "https://arxiv.org/abs/2401.11507", "description": "arXiv:2401.11507v3 Announce Type: replace \nAbstract: During multiple testing, researchers often adjust their alpha level to control the familywise error rate for a statistical inference about a joint union alternative hypothesis (e.g., \"H1,1 or H1,2\"). However, in some cases, they do not make this inference. Instead, they make separate inferences about each of the individual hypotheses that comprise the joint hypothesis (e.g., H1,1 and H1,2). For example, a researcher might use a Bonferroni correction to adjust their alpha level from the conventional level of 0.050 to 0.025 when testing H1,1 and H1,2, find a significant result for H1,1 (p < 0.025) and not for H1,2 (p > .0.025), and so claim support for H1,1 and not for H1,2. However, these separate individual inferences do not require an alpha adjustment. Only a statistical inference about the union alternative hypothesis \"H1,1 or H1,2\" requires an alpha adjustment because it is based on \"at least one\" significant result among the two tests, and so it refers to the familywise error rate. Hence, an inconsistent correction occurs when a researcher corrects their alpha level during multiple testing but does not make an inference about a union alternative hypothesis. In the present article, I discuss this inconsistent correction problem, including its reduction in statistical power for tests of individual hypotheses and its potential causes vis-a-vis error rate confusions and the alpha adjustment ritual. I also provide three illustrations of inconsistent corrections from recent psychology studies. I conclude that inconsistent corrections represent a symptom of statisticism, and I call for a more nuanced inference-based approach to multiple testing corrections."}, "https://arxiv.org/abs/2401.15461": {"title": "Anytime-Valid Tests of Group Invariance through Conformal Prediction", "link": "https://arxiv.org/abs/2401.15461", "description": "arXiv:2401.15461v2 Announce Type: replace \nAbstract: Many standard statistical hypothesis tests, including those for normality and exchangeability, can be reformulated as tests of invariance under a group of transformations. We develop anytime-valid tests of invariance under the action of general compact groups and show their optimality -- in a specific logarithmic-growth sense -- against certain alternatives. This is achieved by using the invariant structure of the problem to construct conformal test martingales, a class of objects associated to conformal prediction. We apply our methods to extend recent anytime-valid tests of independence, which leverage exchangeability, to work under general group invariances. Additionally, we show applications to testing for invariance under subgroups of rotations, which corresponds to testing the Gaussian-error assumptions behind linear models."}, "https://arxiv.org/abs/2211.13289": {"title": "Shapley Curves: A Smoothing Perspective", "link": "https://arxiv.org/abs/2211.13289", "description": "arXiv:2211.13289v5 Announce Type: replace-cross \nAbstract: This paper fills the limited statistical understanding of Shapley values as a variable importance measure from a nonparametric (or smoothing) perspective. We introduce population-level \\textit{Shapley curves} to measure the true variable importance, determined by the conditional expectation function and the distribution of covariates. Having defined the estimand, we derive minimax convergence rates and asymptotic normality under general conditions for the two leading estimation strategies. For finite sample inference, we propose a novel version of the wild bootstrap procedure tailored for capturing lower-order terms in the estimation of Shapley curves. Numerical studies confirm our theoretical findings, and an empirical application analyzes the determining factors of vehicle prices."}, "https://arxiv.org/abs/2211.16462": {"title": "Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target", "link": "https://arxiv.org/abs/2211.16462", "description": "arXiv:2211.16462v2 Announce Type: replace-cross \nAbstract: As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \\in \\{0, \\ldots, H-1\\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call Probability-space Conformalized Quantile Regression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain guarantees for the probability that the cumulative reward of an autonomous system will fall below a threshold sampled from the marginal distribution of the response variable (i.e., a calibrated CDF estimate) that we employ to predict coverage probabilities for user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated."}, "https://arxiv.org/abs/2404.02982": {"title": "Spatio-temporal Modeling of Count Data", "link": "https://arxiv.org/abs/2404.02982", "description": "arXiv:2404.02982v1 Announce Type: new \nAbstract: We introduce parsimonious parameterisations for multivariate autoregressive count time series models for spatio-temporal data, including possible regressions on covariates. The number of parameters is reduced by specifying spatial neighbourhood structures for possibly huge matrices that take into account spatio-temporal dependencies. Consistency and asymptotic normality of the parameter estimators are obtained under mild assumptions by employing quasi-maximum likelihood methodology. This is used to obtain an asymptotic Wald test for testing the significance of individual or group effects. Several simulations and two data examples support and illustrate the methods proposed in this paper."}, "https://arxiv.org/abs/2404.03024": {"title": "General Effect Modelling (GEM) -- Part 1", "link": "https://arxiv.org/abs/2404.03024", "description": "arXiv:2404.03024v1 Announce Type: new \nAbstract: We present a flexible tool, called General Effect Modelling (GEM), for the analysis of any multivariate data influenced by one or more qualitative (categorical) or quantitative (continuous) input variables. The variables can be design factors or observed values, e.g., age, sex, or income, or they may represent subgroups of the samples found by data exploration. The first step in GEM separates the variation in the multivariate data into effect matrices associated with each of the influencing variables (and possibly interactions between these) by applying a general linear model. The residuals of the model are added to each of the effect matrices and the results are called Effect plus Residual (ER) values. The tables of ER values have the same dimensions as the original multivariate data. The second step of GEM is a multi- or univariate exploration of the ER tables to learn more about the multivariate data in relation to each input variables. The exploration is simplified as it addresses one input variable at the time, or if preferred, a combination of input variables. One example is a study to identify molecular fingerprints associated with a disease that is not influenced by age where individuals at different ages with and without the disease are included in the experiment. This situation can be described as an experiment with two input variables: the targeted disease and the individual age. Through GEM, the effect of age can be removed, thus focusing further analysis on the targeted disease without the influence of the confounding effect of age. ER values can also be the combined effect of several input variables. This publication has three parts: the first part describes the GEM methodology, Part 2 is a consideration of multivariate data and the benefits of treating such data by multivariate analysis, with a focus on omics data, and Part 3 is a case study in Multiple Sclerosis (MS)."}, "https://arxiv.org/abs/2404.03029": {"title": "General Effect Modelling (GEM) -- Part 2", "link": "https://arxiv.org/abs/2404.03029", "description": "arXiv:2404.03029v1 Announce Type: new \nAbstract: General Effect Modelling (GEM) is an umbrella over different methods that utilise effects in the analyses of data with multiple design variables and multivariate responses. To demonstrate the methodology, we here use GEM in gene expression data where we use GEM to combine data from different cohorts and apply multivariate analysis of the effects of the targeted disease across the cohorts. Omics data are by nature multivariate, yet univariate analysis is the dominating approach used for such data. A major challenge in omics data is that the number of features such as genes, proteins and metabolites are often very large, whereas the number of samples is limited. Furthermore, omics research aims to obtain results that are generically valid across different backgrounds. The present publication applies GEM to address these aspects. First, we emphasise the benefit of multivariate analysis for multivariate data. Then we illustrate the use of GEM to combine data from two different cohorts for multivariate analysis across the cohorts, and we highlight that multivariate analysis can detect information that is lost by univariate validation."}, "https://arxiv.org/abs/2404.03034": {"title": "General Effect Modelling (GEM) -- Part 3", "link": "https://arxiv.org/abs/2404.03034", "description": "arXiv:2404.03034v1 Announce Type: new \nAbstract: The novel data analytical platform General Effect Modelling (GEM), is an umbrella platform covering different data analytical methods that handle data with multiple design variables (or pseudo design variables) and multivariate responses. GEM is here demonstrated in an analysis of proteome data from cerebrospinal fluid (CSF) from two independent previously published datasets, one data set comprised of persons with relapsing-remitting multiple sclerosis, persons with other neurological disorders and persons without neurological disorders, and one data set had persons with clinically isolated syndrome (CIS), which is the first clinical symptom of MS, and controls. The primary aim of the present publication is to use these data to demonstrate how patient stratification can be utilised by GEM for multivariate analysis. We also emphasize how the findings shed light on important aspects of the molecular mechanism of MS that may otherwise be lost. We identified proteins involved in neural development as significantly lower for MS/CIS than for their respective controls. This information was only seen after stratification of the persons into two groups, which were found to have different inflammatory patterns and the utilisation of this by GEM. Our conclusion from the study of these data is that disrupted neural development may be an early event in CIS and MS."}, "https://arxiv.org/abs/2404.03059": {"title": "Asymptotically-exact selective inference for quantile regression", "link": "https://arxiv.org/abs/2404.03059", "description": "arXiv:2404.03059v1 Announce Type: new \nAbstract: When analyzing large datasets, it is common to select a model prior to making inferences. For reliable inferences, it is important to make adjustments that account for the model selection process, resulting in selective inferences. Our paper introduces an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and ensures asymptotically-exact selective inferences without making strict distributional assumptions about the response variable. At the core of the pivot is the use of external randomization, which enables us to utilize the full sample for both selection and inference without the need to partition the data into independent data subsets or discard data at either step. On simulated data, we find that: (i) the asymptotic confidence intervals based on our pivot achieve the desired coverage rates, even in cases where sample splitting fails due to insufficient sample size for inference; (ii) our intervals are consistently shorter than those produced by sample splitting across various models and signal settings. We report similar findings when we apply our approach to study risk factors for low birth weights in a publicly accessible dataset of US birth records from 2022."}, "https://arxiv.org/abs/2404.03127": {"title": "A Bayesian factor analysis model for high-dimensional microbiome count data", "link": "https://arxiv.org/abs/2404.03127", "description": "arXiv:2404.03127v1 Announce Type: new \nAbstract: Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial PCA is a natural counterpart of the standard PCA. However, this technique fails to account for the excessive number of zero values, which is frequently observed in microbiome count data. To allow for sparsity, zero-inflated multivariate distributions can be used. We propose a zero-inflated probabilistic PCA model for latent factor analysis. The proposed model is a fully Bayesian factor analysis technique that is appropriate for microbiome count data analysis. In addition, we use the mean-field-type variational family to approximate the marginal likelihood and develop a classification variational approximation algorithm to fit the model. We demonstrate the efficiency of our procedure for predictions based on the latent factors and the model parameters through simulation experiments, showcasing its superiority over competing methods. This efficiency is further illustrated with two real microbiome count datasets. The method is implemented in R."}, "https://arxiv.org/abs/2404.03152": {"title": "Orthogonal calibration via posterior projections with applications to the Schwarzschild model", "link": "https://arxiv.org/abs/2404.03152", "description": "arXiv:2404.03152v1 Announce Type: new \nAbstract: The orbital superposition method originally developed by Schwarzschild (1979) is used to study the dynamics of growth of a black hole and its host galaxy, and has uncovered new relationships between the galaxy's global characteristics. Scientists are specifically interested in finding optimal parameter choices for this model that best match physical measurements along with quantifying the uncertainty of such procedures. This renders a statistical calibration problem with multivariate outcomes. In this article, we develop a Bayesian method for calibration with multivariate outcomes using orthogonal bias functions thus ensuring parameter identifiability. Our approach is based on projecting the posterior to an appropriate space which allows the user to choose any nonparametric prior on the bias function(s) instead of having to model it (them) with Gaussian processes. We develop a functional projection approach using the theory of Hilbert spaces. A finite-dimensional analogue of the projection problem is also considered. We illustrate the proposed approach using a BART prior and apply it to calibrate the Schwarzschild model illustrating how a multivariate approach may resolve discrepancies resulting from a univariate calibration."}, "https://arxiv.org/abs/2404.03198": {"title": "Delaunay Weighted Two-sample Test for High-dimensional Data by Incorporating Geometric Information", "link": "https://arxiv.org/abs/2404.03198", "description": "arXiv:2404.03198v1 Announce Type: new \nAbstract: Two-sample hypothesis testing is a fundamental problem with various applications, which faces new challenges in the high-dimensional context. To mitigate the issue of the curse of dimensionality, high-dimensional data are typically assumed to lie on a low-dimensional manifold. To incorporate geometric informtion in the data, we propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points. In contrast to existing similarity measures that only utilize pairwise distances, the Delaunay weight can take both the distance and direction information into account. A detailed computation procedure to approximate the Delaunay weight for the unknown manifold is developed. We further propose a novel nonparametric test statistic using the Delaunay weight matrix to test whether the underlying distributions of two samples are the same or not. Applied on simulated data, the new test exhibits substantial power gain in detecting differences in principal directions between distributions. The proposed test also shows great power on a real dataset of human face images."}, "https://arxiv.org/abs/2404.03235": {"title": "Marginal Treatment Effects and Monotonicity", "link": "https://arxiv.org/abs/2404.03235", "description": "arXiv:2404.03235v1 Announce Type: new \nAbstract: How robust are analyses based on marginal treatment effects (MTE) to violations of Imbens and Angrist (1994) monotonicity? In this note, I present weaker forms of monotonicity under which popular MTE-based estimands still identify the parameters of interest."}, "https://arxiv.org/abs/2404.03250": {"title": "Multi-task learning via robust regularized clustering with non-convex group penalties", "link": "https://arxiv.org/abs/2404.03250", "description": "arXiv:2404.03250v1 Announce Type: new \nAbstract: Multi-task learning (MTL) aims to improve estimation and prediction performance by sharing common information among related tasks. One natural assumption in MTL is that tasks are classified into clusters based on their characteristics. However, existing MTL methods based on this assumption often ignore outlier tasks that have large task-specific components or no relation to other tasks. To address this issue, we propose a novel MTL method called Multi-Task Learning via Robust Regularized Clustering (MTLRRC). MTLRRC incorporates robust regularization terms inspired by robust convex clustering, which is further extended to handle non-convex and group-sparse penalties. The extension allows MTLRRC to simultaneously perform robust task clustering and outlier task detection. The connection between the extended robust clustering and the multivariate M-estimator is also established. This provides an interpretation of the robustness of MTLRRC against outlier tasks. An efficient algorithm based on a modified alternating direction method of multipliers is developed for the estimation of the parameters. The effectiveness of MTLRRC is demonstrated through simulation studies and application to real data."}, "https://arxiv.org/abs/2404.03319": {"title": "Early warning systems for financial markets of emerging economies", "link": "https://arxiv.org/abs/2404.03319", "description": "arXiv:2404.03319v1 Announce Type: new \nAbstract: We develop and apply a new online early warning system (EWS) for what is known in machine learning as concept drift, in economics as a regime shift and in statistics as a change point. The system goes beyond linearity assumed in many conventional methods, and is robust to heavy tails and tail-dependence in the data, making it particularly suitable for emerging markets. The key component is an effective change-point detection mechanism for conditional entropy of the data, rather than for a particular indicator of interest. Combined with recent advances in machine learning methods for high-dimensional random forests, the mechanism is capable of finding significant shifts in information transfer between interdependent time series when traditional methods fail. We explore when this happens using simulations and we provide illustrations by applying the method to Uzbekistan's commodity and equity markets as well as to Russia's equity market in 2021-2023."}, "https://arxiv.org/abs/2404.03404": {"title": "Robust inference for linear regression models with possibly skewed error distribution", "link": "https://arxiv.org/abs/2404.03404", "description": "arXiv:2404.03404v1 Announce Type: new \nAbstract: Traditional methods for linear regression generally assume that the underlying error distribution, equivalently the distribution of the responses, is normal. Yet, sometimes real life response data may exhibit a skewed pattern, and assuming normality would not give reliable results in such cases. This is often observed in cases of some biomedical, behavioral, socio-economic and other variables. In this paper, we propose to use the class of skew normal (SN) distributions, which also includes the ordinary normal distribution as its special case, as the model for the errors in a linear regression setup and perform subsequent statistical inference using the popular and robust minimum density power divergence approach to get stable insights in the presence of possible data contamination (e.g., outliers). We provide the asymptotic distribution of the proposed estimator of the regression parameters and also propose robust Wald-type tests of significance for these parameters. We provide an influence function analysis of these estimators and test statistics, and also provide level and power influence functions. Numerical verification including simulation studies and real data analysis is provided to substantiate the theory developed."}, "https://arxiv.org/abs/2404.03420": {"title": "Modeling temporal dependency of longitudinal data: use of multivariate geometric skew-normal copula", "link": "https://arxiv.org/abs/2404.03420", "description": "arXiv:2404.03420v1 Announce Type: new \nAbstract: Use of copula for the purpose of modeling dependence has been receiving considerable attention in recent times. On the other hand, search for multivariate copulas with desirable dependence properties also is an important area of research. When fitting regression models to non-Gaussian longitudinal data, multivariate Gaussian copula is commonly used to account for temporal dependence of the repeated measurements. But using symmetric multivariate Gaussian copula is not preferable in every situation, since it can not capture non-exchangeable dependence or tail dependence, if present in the data. Hence to ensure reliable inference, it is important to look beyond the Gaussian dependence assumption. In this paper, we construct geometric skew-normal copula from multivariate geometric skew-normal (MGSN) distribution proposed by Kundu (2014) and Kundu (2017) in order to model temporal dependency of non-Gaussian longitudinal data. First we investigate the theoretical properties of the proposed multivariate copula, and then develop regression models for both continuous and discrete longitudinal data. The quantile function of this copula is independent of the correlation matrix of its respective multivariate distribution, which provides computational advantage in terms of likelihood inference compared to the class of copulas derived from skew-elliptical distributions by Azzalini & Valle (1996). Moreover, composite likelihood inference is possible for this multivariate copula, which facilitates to estimate parameters from ordered probit model with same dependence structure as geometric skew-normal distribution. We conduct extensive simulation studies to validate our proposed models and therefore apply them to analyze the longitudinal dependence of two real world data sets. Finally, we report our findings in terms of improvements over multivariate Gaussian copula based regression models."}, "https://arxiv.org/abs/2404.03422": {"title": "Empirical Bayes for the Reluctant Frequentist", "link": "https://arxiv.org/abs/2404.03422", "description": "arXiv:2404.03422v1 Announce Type: new \nAbstract: Empirical Bayes methods offer valuable tools for a large class of compound decision problems. In this tutorial we describe some basic principles of the empirical Bayes paradigm stressing their frequentist interpretation. Emphasis is placed on recent developments of nonparametric maximum likelihood methods for estimating mixture models. A more extensive introductory treatment will eventually be available in \\citet{kg24}. The methods are illustrated with an extended application to models of heterogeneous income dynamics based on PSID data."}, "https://arxiv.org/abs/2404.03116": {"title": "ALAAMEE: Open-source software for fitting autologistic actor attribute models", "link": "https://arxiv.org/abs/2404.03116", "description": "arXiv:2404.03116v1 Announce Type: cross \nAbstract: The autologistic actor attribute model (ALAAM) is a model for social influence, derived from the more widely known exponential-family random graph model (ERGM). ALAAMs can be used to estimate parameters corresponding to multiple forms of social contagion associated with network structure and actor covariates. This work introduces ALAAMEE, open-source Python software for estimation, simulation, and goodness-of-fit testing for ALAAM models. ALAAMEE implements both the stochastic approximation and equilibrium expectation (EE) algorithms for ALAAM parameter estimation, including estimation from snowball sampled network data. It implements data structures and statistics for undirected, directed, and bipartite networks. We use a simulation study to assess the accuracy of the EE algorithm for ALAAM parameter estimation and statistical inference, and demonstrate the use of ALAAMEE with empirical examples using both small (fewer than 100 nodes) and large (more than 10 000 nodes) networks."}, "https://arxiv.org/abs/2208.06039": {"title": "Semiparametric adaptive estimation under informative sampling", "link": "https://arxiv.org/abs/2208.06039", "description": "arXiv:2208.06039v3 Announce Type: replace \nAbstract: In survey sampling, survey data do not necessarily represent the target population, and the samples are often biased. However, information on the survey weights aids in the elimination of selection bias. The Horvitz-Thompson estimator is a well-known unbiased, consistent, and asymptotically normal estimator; however, it is not efficient. Thus, this study derives the semiparametric efficiency bound for various target parameters by considering the survey weight as a random variable and consequently proposes a semiparametric optimal estimator with certain working models on the survey weights. The proposed estimator is consistent, asymptotically normal, and efficient in a class of the regular and asymptotically linear estimators. Further, a limited simulation study is conducted to investigate the finite sample performance of the proposed method. The proposed method is applied to the 1999 Canadian Workplace and Employee Survey data."}, "https://arxiv.org/abs/2303.06501": {"title": "Learning from limited temporal data: Dynamically sparse historical functional linear models with applications to Earth science", "link": "https://arxiv.org/abs/2303.06501", "description": "arXiv:2303.06501v2 Announce Type: replace \nAbstract: Scientists and statisticians often want to learn about the complex relationships that connect two time-varying variables. Recent work on sparse functional historical linear models confirms that they are promising for this purpose, but several notable limitations exist. Most importantly, previous works have imposed sparsity on the historical coefficient function, but have not allowed the sparsity, hence lag, to vary with time. We simplify the framework of sparse functional historical linear models by using a rectangular coefficient structure along with Whittaker smoothing, then reduce the assumptions of the previous frameworks by estimating the dynamic time lag from a hierarchical coefficient structure. We motivate our study by aiming to extract the physical rainfall-runoff processes hidden within hydrological data. We show the promise and accuracy of our method using eight simulation studies, further justified by two real sets of hydrological data."}, "https://arxiv.org/abs/2306.05299": {"title": "Heterogeneous Autoregressions in Short T Panel Data Models", "link": "https://arxiv.org/abs/2306.05299", "description": "arXiv:2306.05299v2 Announce Type: replace \nAbstract: This paper considers a first-order autoregressive panel data model with individual-specific effects and heterogeneous autoregressive coefficients defined on the interval (-1,1], thus allowing for some of the individual processes to have unit roots. It proposes estimators for the moments of the cross-sectional distribution of the autoregressive (AR) coefficients, assuming a random coefficient model for the autoregressive coefficients without imposing any restrictions on the fixed effects. It is shown the standard generalized method of moments estimators obtained under homogeneous slopes are biased. Small sample properties of the proposed estimators are investigated by Monte Carlo experiments and compared with a number of alternatives, both under homogeneous and heterogeneous slopes. It is found that a simple moment estimator of the mean of heterogeneous AR coefficients performs very well even for moderate sample sizes, but to reliably estimate the variance of AR coefficients much larger samples are required. It is also required that the true value of this variance is not too close to zero. The utility of the heterogeneous approach is illustrated in the case of earnings dynamics."}, "https://arxiv.org/abs/2306.14302": {"title": "Improved LM Test for Robust Model Specification Searches in Covariance Structure Analysis", "link": "https://arxiv.org/abs/2306.14302", "description": "arXiv:2306.14302v3 Announce Type: replace \nAbstract: Model specification searches and modifications are commonly employed in covariance structure analysis (CSA) or structural equation modeling (SEM) to improve the goodness-of-fit. However, these practices can be susceptible to capitalizing on chance, as a model that fits one sample may not generalize to another sample from the same population. This paper introduces the improved Lagrange Multipliers (LM) test, which provides a reliable method for conducting a thorough model specification search and effectively identifying missing parameters. By leveraging the stepwise bootstrap method in the standard LM and Wald tests, our data-driven approach enhances the accuracy of parameter identification. The results from Monte Carlo simulations and two empirical applications in political science demonstrate the effectiveness of the improved LM test, particularly when dealing with small sample sizes and models with large degrees of freedom. This approach contributes to better statistical fit and addresses the issue of capitalization on chance in model specification."}, "https://arxiv.org/abs/2309.17295": {"title": "covXtreme : MATLAB software for non-stationary penalised piecewise constant marginal and conditional extreme value models", "link": "https://arxiv.org/abs/2309.17295", "description": "arXiv:2309.17295v2 Announce Type: replace \nAbstract: The covXtreme software provides functionality for estimation of marginal and conditional extreme value models, non-stationary with respect to covariates, and environmental design contours. Generalised Pareto (GP) marginal models of peaks over threshold are estimated, using a piecewise-constant representation for the variation of GP threshold and scale parameters on the (potentially multidimensional) covariate domain of interest. The conditional variation of one or more associated variates, given a large value of a single conditioning variate, is described using the conditional extremes model of Heffernan and Tawn (2004), the slope term of which is also assumed to vary in a piecewise constant manner with covariates. Optimal smoothness of marginal and conditional extreme value model parameters with respect to covariates is estimated using cross-validated roughness-penalised maximum likelihood estimation. Uncertainties in model parameter estimates due to marginal and conditional extreme value threshold choice, and sample size, are quantified using a bootstrap resampling scheme. Estimates of environmental contours using various schemes, including the direct sampling approach of Huseby et al. 2013, are calculated by simulation or numerical integration under fitted models. The software was developed in MATLAB for metocean applications, but is applicable generally to multivariate samples of peaks over threshold. The software and case study data can be downloaded from GitHub, with an accompanying user guide."}, "https://arxiv.org/abs/2310.01076": {"title": "A Pareto tail plot without moment restrictions", "link": "https://arxiv.org/abs/2310.01076", "description": "arXiv:2310.01076v2 Announce Type: replace \nAbstract: We propose a mean functional which exists for any probability distributions, and which characterizes the Pareto distribution within the set of distributions with finite left endpoint. This is in sharp contrast to the mean excess plot which is not meaningful for distributions without existing mean, and which has a nonstandard behaviour if the mean is finite, but the second moment does not exist. The construction of the plot is based on the so called principle of a single huge jump, which differentiates between distributions with moderately heavy and super heavy tails. We present an estimator of the tail function based on $U$-statistics and study its large sample properties. The use of the new plot is illustrated by several loss datasets."}, "https://arxiv.org/abs/2310.10915": {"title": "Identifiability of the Multinomial Processing Tree-IRT model for the Philadelphia Naming Test", "link": "https://arxiv.org/abs/2310.10915", "description": "arXiv:2310.10915v3 Announce Type: replace \nAbstract: Naming tests represent an essential tool in gauging the severity of aphasia and monitoring the trajectory of recovery for individuals afflicted with this debilitating condition. In these assessments, patients are presented with images corresponding to common nouns, and their responses are evaluated for accuracy. The Philadelphia Naming Test (PNT) stands as a paragon in this domain, offering nuanced insights into the type of errors made in responses. In a groundbreaking advancement, Walker et al. (2018) introduced a model rooted in Item Response Theory and multinomial processing trees (MPT-IRT). This innovative approach seeks to unravel the intricate mechanisms underlying the various errors patients make when responding to an item, aiming to pinpoint the specific stage of word production where a patient's capability falters. However, given the sophisticated nature of the IRT-MPT model proposed by Walker et al. (2018), it is imperative to scrutinize both its conceptual as well as its statistical validity. Our endeavor here is to closely examine the model's formulation to ensure its parameters are identifiable as a first step in evaluating its validity."}, "https://arxiv.org/abs/2401.14582": {"title": "High-dimensional forecasting with known knowns and known unknowns", "link": "https://arxiv.org/abs/2401.14582", "description": "arXiv:2401.14582v2 Announce Type: replace \nAbstract: Forecasts play a central role in decision making under uncertainty. After a brief review of the general issues, this paper considers ways of using high-dimensional data in forecasting. We consider selecting variables from a known active set, known knowns, using Lasso and OCMT, and approximating unobserved latent factors, known unknowns, by various means. This combines both sparse and dense approaches. We demonstrate the various issues involved in variable selection in a high-dimensional setting with an application to forecasting UK inflation at different horizons over the period 2020q1-2023q1. This application shows both the power of parsimonious models and the importance of allowing for global variables."}, "https://arxiv.org/abs/2310.12140": {"title": "Online Estimation with Rolling Validation: Adaptive Nonparametric Estimation with Streaming Data", "link": "https://arxiv.org/abs/2310.12140", "description": "arXiv:2310.12140v2 Announce Type: replace-cross \nAbstract: Online nonparametric estimators are gaining popularity due to their efficient computation and competitive generalization abilities. An important example includes variants of stochastic gradient descent. These algorithms often take one sample point at a time and instantly update the parameter estimate of interest. In this work we consider model selection and hyperparameter tuning for such online algorithms. We propose a weighted rolling-validation procedure, an online variant of leave-one-out cross-validation, that costs minimal extra computation for many typical stochastic gradient descent estimators. Similar to batch cross-validation, it can boost base estimators to achieve a better, adaptive convergence rate. Our theoretical analysis is straightforward, relying mainly on some general statistical stability assumptions. The simulation study underscores the significance of diverging weights in rolling validation in practice and demonstrates its sensitivity even when there is only a slim difference between candidate estimators."}, "https://arxiv.org/abs/2312.00770": {"title": "Random Forest for Dynamic Risk Prediction or Recurrent Events: A Pseudo-Observation Approach", "link": "https://arxiv.org/abs/2312.00770", "description": "arXiv:2312.00770v2 Announce Type: replace-cross \nAbstract: Recurrent events are common in clinical, healthcare, social and behavioral studies. A recent analysis framework for potentially censored recurrent event data is to construct a censored longitudinal data set consisting of times to the first recurrent event in multiple prespecified follow-up windows of length $\\tau$. With the staggering number of potential predictors being generated from genetic, -omic, and electronic health records sources, machine learning approaches such as the random forest are growing in popularity, as they can incorporate information from highly correlated predictors with non-standard relationships. In this paper, we bridge this gap by developing a random forest approach for dynamically predicting probabilities of remaining event-free during a subsequent $\\tau$-duration follow-up period from a reconstructed censored longitudinal data set. We demonstrate the increased ability of our random forest algorithm for predicting the probability of remaining event-free over a $\\tau$-duration follow-up period when compared to the recurrent event modeling framework of Xia et al. (2020) in settings where association between predictors and recurrent event outcomes is complex in nature. The proposed random forest algorithm is demonstrated using recurrent exacerbation data from the Azithromycin for the Prevention of Exacerbations of Chronic Obstructive Pulmonary Disease (Albert et al., 2011)."}, "https://arxiv.org/abs/2404.03737": {"title": "Forecasting with Neuro-Dynamic Programming", "link": "https://arxiv.org/abs/2404.03737", "description": "arXiv:2404.03737v1 Announce Type: new \nAbstract: Economic forecasting is concerned with the estimation of some variable like gross domestic product (GDP) in the next period given a set of variables that describes the current situation or state of the economy, including industrial production, retail trade turnover or economic confidence. Neuro-dynamic programming (NDP) provides tools to deal with forecasting and other sequential problems with such high-dimensional states spaces. Whereas conventional forecasting methods penalises the difference (or loss) between predicted and actual outcomes, NDP favours the difference between temporally successive predictions, following an interactive and trial-and-error approach. Past data provides a guidance to train the models, but in a different way from ordinary least squares (OLS) and other supervised learning methods, signalling the adjustment costs between sequential states. We found that it is possible to train a GDP forecasting model with data concerned with other countries that performs better than models trained with past data from the tested country (Portugal). In addition, we found that non-linear architectures to approximate the value function of a sequential problem, namely, neural networks can perform better than a simple linear architecture, lowering the out-of-sample mean absolute forecast error (MAE) by 32% from an OLS model."}, "https://arxiv.org/abs/2404.03781": {"title": "Signal cancellation factor analysis", "link": "https://arxiv.org/abs/2404.03781", "description": "arXiv:2404.03781v1 Announce Type: new \nAbstract: Signal cancellation provides a radically new and efficient approach to exploratory factor analysis, without matrix decomposition nor presetting the required number of factors. Its current implementation requires that each factor has at least two unique indicators. Its principle is that it is always possible to combine two indicator variables exclusive to the same factor with weights that cancel their common factor information. Successful combinations, consisting of nose only, are recognized by their null correlations with all remaining variables. The optimal combinations of multifactorial indicators, though, typically retain correlations with some other variables. Their signal, however, can be cancelled through combinations with unifactorial indicators of their contributing factors. The loadings are estimated from the relative signal cancellation weights of the variables involved along with their observed correlations. The factor correlations are obtained from those of their unifactorial indicators, corrected by their factor loadings. The method is illustrated with synthetic data from a complex six-factor structure that even includes two doublet factors. Another example using actual data documents that signal cancellation can rival confirmatory factor analysis."}, "https://arxiv.org/abs/2404.03805": {"title": "Blessing of dimension in Bayesian inference on covariance matrices", "link": "https://arxiv.org/abs/2404.03805", "description": "arXiv:2404.03805v1 Announce Type: new \nAbstract: Bayesian factor analysis is routinely used for dimensionality reduction in modeling of high-dimensional covariance matrices. Factor analytic decompositions express the covariance as a sum of a low rank and diagonal matrix. In practice, Gibbs sampling algorithms are typically used for posterior computation, alternating between updating the latent factors, loadings, and residual variances. In this article, we exploit a blessing of dimensionality to develop a provably accurate pseudo-posterior for the covariance matrix that bypasses the need for Gibbs or other variants of Markov chain Monte Carlo sampling. Our proposed Factor Analysis with BLEssing of dimensionality (FABLE) approach relies on a first-stage singular value decomposition (SVD) to estimate the latent factors, and then defines a jointly conjugate prior for the loadings and residual variances. The accuracy of the resulting pseudo-posterior for the covariance improves with increasing dimensionality. We show that FABLE has excellent performance in high-dimensional covariance matrix estimation, including producing well calibrated credible intervals, both theoretically and through simulation experiments. We also demonstrate the strength of our approach in terms of accurate inference and computational efficiency by applying it to a gene expression data set."}, "https://arxiv.org/abs/2404.03835": {"title": "Quantile-respectful density estimation based on the Harrell-Davis quantile estimator", "link": "https://arxiv.org/abs/2404.03835", "description": "arXiv:2404.03835v1 Announce Type: new \nAbstract: Traditional density and quantile estimators are often inconsistent with each other. Their simultaneous usage may lead to inconsistent results. To address this issue, we propose a novel smooth density estimator that is naturally consistent with the Harrell-Davis quantile estimator. We also provide a jittering implementation to support discrete-continuous mixture distributions."}, "https://arxiv.org/abs/2404.03837": {"title": "Inference for non-stationary time series quantile regression with inequality constraints", "link": "https://arxiv.org/abs/2404.03837", "description": "arXiv:2404.03837v1 Announce Type: new \nAbstract: We consider parameter inference for linear quantile regression with non-stationary predictors and errors, where the regression parameters are subject to inequality constraints. We show that the constrained quantile coefficient estimators are asymptotically equivalent to the metric projections of the unconstrained estimator onto the constrained parameter space. Utilizing a geometry-invariant property of this projection operation, we propose inference procedures - the Wald, likelihood ratio, and rank-based methods - that are consistent regardless of whether the true parameters lie on the boundary of the constrained parameter space. We also illustrate the advantages of considering the inequality constraints in analyses through simulations and an application to an electricity demand dataset."}, "https://arxiv.org/abs/2404.03878": {"title": "Wasserstein F-tests for Fr\\'echet regression on Bures-Wasserstein manifolds", "link": "https://arxiv.org/abs/2404.03878", "description": "arXiv:2404.03878v1 Announce Type: new \nAbstract: This paper considers the problem of regression analysis with random covariance matrix as outcome and Euclidean covariates in the framework of Fr\\'echet regression on the Bures-Wasserstein manifold. Such regression problems have many applications in single cell genomics and neuroscience, where we have covariance matrix measured over a large set of samples. Fr\\'echet regression on the Bures-Wasserstein manifold is formulated as estimating the conditional Fr\\'echet mean given covariates $x$. A non-asymptotic $\\sqrt{n}$-rate of convergence (up to $\\log n$ factors) is obtained for our estimator $\\hat{Q}_n(x)$ uniformly for $\\left\\|x\\right\\| \\lesssim \\sqrt{\\log n}$, which is crucial for deriving the asymptotic null distribution and power of our proposed statistical test for the null hypothesis of no association. In addition, a central limit theorem for the point estimate $\\hat{Q}_n(x)$ is obtained, giving insights to a test for covariate effects. The null distribution of the test statistic is shown to converge to a weighted sum of independent chi-squares, which implies that the proposed test has the desired significance level asymptotically. Also, the power performance of the test is demonstrated against a sequence of contiguous alternatives. Simulation results show the accuracy of the asymptotic distributions. The proposed methods are applied to a single cell gene expression data set that shows the change of gene co-expression network as people age."}, "https://arxiv.org/abs/2404.03957": {"title": "Bayesian Graphs of Intelligent Causation", "link": "https://arxiv.org/abs/2404.03957", "description": "arXiv:2404.03957v1 Announce Type: new \nAbstract: Probabilistic Graphical Bayesian models of causation have continued to impact on strategic analyses designed to help evaluate the efficacy of different interventions on systems. However, the standard causal algebras upon which these inferences are based typically assume that the intervened population does not react intelligently to frustrate an intervention. In an adversarial setting this is rarely an appropriate assumption. In this paper, we extend an established Bayesian methodology called Adversarial Risk Analysis to apply it to settings that can legitimately be designated as causal in this graphical sense. To embed this technology we first need to generalize the concept of a causal graph. We then proceed to demonstrate how the predicable intelligent reactions of adversaries to circumvent an intervention when they hear about it can be systematically modelled within such graphical frameworks, importing these recent developments from Bayesian game theory. The new methodologies and supporting protocols are illustrated through applications associated with an adversary attempting to infiltrate a friendly state."}, "https://arxiv.org/abs/2404.04074": {"title": "DGP-LVM: Derivative Gaussian process latent variable model", "link": "https://arxiv.org/abs/2404.04074", "description": "arXiv:2404.04074v1 Announce Type: new \nAbstract: We develop a framework for derivative Gaussian process latent variable models (DGP-LVM) that can handle multi-dimensional output data using modified derivative covariance functions. The modifications account for complexities in the underlying data generating process such as scaled derivatives, varying information across multiple output dimensions as well as interactions between outputs. Further, our framework provides uncertainty estimates for each latent variable samples using Bayesian inference. Through extensive simulations, we demonstrate that latent variable estimation accuracy can be drastically increased by including derivative information due to our proposed covariance function modifications. The developments are motivated by a concrete biological research problem involving the estimation of the unobserved cellular ordering from single-cell RNA (scRNA) sequencing data for gene expression and its corresponding derivative information known as RNA velocity. Since the RNA velocity is only an estimate of the exact derivative information, the derivative covariance functions need to account for potential scale differences. In a real-world case study, we illustrate the application of DGP-LVMs to such scRNA sequencing data. While motivated by this biological problem, our framework is generally applicable to all kinds of latent variable estimation problems involving derivative information irrespective of the field of study."}, "https://arxiv.org/abs/2404.04122": {"title": "Hidden Markov Models for Multivariate Panel Data", "link": "https://arxiv.org/abs/2404.04122", "description": "arXiv:2404.04122v1 Announce Type: new \nAbstract: While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms due to the unique correlation structure, a consequence of taking observations on several subjects over multiple time points. Additionally, panel data are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the unique correlation structures that arise in panel data. A modified expectation-maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation."}, "https://arxiv.org/abs/2404.04213": {"title": "Modelling handball outcomes using univariate and bivariate approaches", "link": "https://arxiv.org/abs/2404.04213", "description": "arXiv:2404.04213v1 Announce Type: new \nAbstract: Handball has received growing interest during the last years, including academic research for many different aspects of the sport. On the other hand modelling the outcome of the game has attracted less interest mainly because of the additional challenges that occur. Data analysis has revealed that the number of goals scored by each team are under-dispersed relative to a Poisson distribution and hence new models are needed for this purpose. Here we propose to circumvent the problem by modelling the score difference. This removes the need for special models since typical models for integer data like the Skellam distribution can provide sufficient fit and thus reveal some of the characteristics of the game. In the present paper we propose some models starting from a Skellam regression model and also considering zero inflated versions as well as other discrete distributions in $\\mathbb Z$. Furthermore, we develop some bivariate models using copulas to model the two halves of the game and thus providing insights on the game. Data from German Bundesliga are used to show the potential of the new models."}, "https://arxiv.org/abs/2404.03764": {"title": "CONCERT: Covariate-Elaborated Robust Local Information Transfer with Conditional Spike-and-Slab Prior", "link": "https://arxiv.org/abs/2404.03764", "description": "arXiv:2404.03764v1 Announce Type: cross \nAbstract: The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only local information is shared. In this paper, we propose a novel Bayesian transfer learning method named \"CONCERT\" to allow robust local information transfer for high-dimensional data analysis. A novel conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize the local similarities and make the sources work collaboratively to help improve the performance on the target. Distinguished from existing work, CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. Variable selection consistency is established for our CONCERT. To make our algorithm scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and a genetic data analysis demonstrate the validity and the advantage of CONCERT over existing cutting-edge transfer learning methods. We also extend our CONCERT to the logistical models with numerical studies showing its superiority over other methods."}, "https://arxiv.org/abs/2404.03804": {"title": "TransformerLSR: Attentive Joint Model of Longitudinal Data, Survival, and Recurrent Events with Concurrent Latent Structure", "link": "https://arxiv.org/abs/2404.03804", "description": "arXiv:2404.03804v1 Announce Type: cross \nAbstract: In applications such as biomedical studies, epidemiology, and social sciences, recurrent events often co-occur with longitudinal measurements and a terminal event, such as death. Therefore, jointly modeling longitudinal measurements, recurrent events, and survival data while accounting for their dependencies is critical. While joint models for the three components exist in statistical literature, many of these approaches are limited by heavy parametric assumptions and scalability issues. Recently, incorporating deep learning techniques into joint modeling has shown promising results. However, current methods only address joint modeling of longitudinal measurements at regularly-spaced observation times and survival events, neglecting recurrent events. In this paper, we develop TransformerLSR, a flexible transformer-based deep modeling and inference framework to jointly model all three components simultaneously. TransformerLSR integrates deep temporal point processes into the joint modeling framework, treating recurrent and terminal events as two competing processes dependent on past longitudinal measurements and recurrent event times. Additionally, TransformerLSR introduces a novel trajectory representation and model architecture to potentially incorporate a priori knowledge of known latent structures among concurrent longitudinal variables. We demonstrate the effectiveness and necessity of TransformerLSR through simulation studies and analyzing a real-world medical dataset on patients after kidney transplantation."}, "https://arxiv.org/abs/1907.01136": {"title": "Finding Outliers in Gaussian Model-Based Clustering", "link": "https://arxiv.org/abs/1907.01136", "description": "arXiv:1907.01136v5 Announce Type: replace \nAbstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and \\textit{post hoc} outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers."}, "https://arxiv.org/abs/2306.05568": {"title": "Maximally Machine-Learnable Portfolios", "link": "https://arxiv.org/abs/2306.05568", "description": "arXiv:2306.05568v2 Announce Type: replace \nAbstract: When it comes to stock returns, any form of predictability can bolster risk-adjusted profitability. We develop a collaborative machine learning algorithm that optimizes portfolio weights so that the resulting synthetic security is maximally predictable. Precisely, we introduce MACE, a multivariate extension of Alternating Conditional Expectations that achieves the aforementioned goal by wielding a Random Forest on one side of the equation, and a constrained Ridge Regression on the other. There are two key improvements with respect to Lo and MacKinlay's original maximally predictable portfolio approach. First, it accommodates for any (nonlinear) forecasting algorithm and predictor set. Second, it handles large portfolios. We conduct exercises at the daily and monthly frequency and report significant increases in predictability and profitability using very little conditioning information. Interestingly, predictability is found in bad as well as good times, and MACE successfully navigates the debacle of 2022."}, "https://arxiv.org/abs/2311.12392": {"title": "Individualized Dynamic Model for Multi-resolutional Data with Application to Mobile Health", "link": "https://arxiv.org/abs/2311.12392", "description": "arXiv:2311.12392v3 Announce Type: replace \nAbstract: Mobile health has emerged as a major success for tracking individual health status, due to the popularity and power of smartphones and wearable devices. This has also brought great challenges in handling heterogeneous, multi-resolution data which arise ubiquitously in mobile health due to irregular multivariate measurements collected from individuals. In this paper, we propose an individualized dynamic latent factor model for irregular multi-resolution time series data to interpolate unsampled measurements of time series with low resolution. One major advantage of the proposed method is the capability to integrate multiple irregular time series and multiple subjects by mapping the multi-resolution data to the latent space. In addition, the proposed individualized dynamic latent factor model is applicable to capturing heterogeneous longitudinal information through individualized dynamic latent factors. Our theory provides a bound on the integrated interpolation error and the convergence rate for B-spline approximation methods. Both the simulation studies and the application to smartwatch data demonstrate the superior performance of the proposed method compared to existing methods."}, "https://arxiv.org/abs/2404.04301": {"title": "Robust Nonparametric Stochastic Frontier Analysis", "link": "https://arxiv.org/abs/2404.04301", "description": "arXiv:2404.04301v1 Announce Type: new \nAbstract: Benchmarking tools, including stochastic frontier analysis (SFA), data envelopment analysis (DEA), and its stochastic extension (StoNED) are core tools in economics used to estimate an efficiency envelope and production inefficiencies from data. The problem appears in a wide range of fields -- for example, in global health the frontier can quantify efficiency of interventions and funding of health initiatives. Despite their wide use, classic benchmarking approaches have key limitations that preclude even wider applicability. Here we propose a robust non-parametric stochastic frontier meta-analysis (SFMA) approach that fills these gaps. First, we use flexible basis splines and shape constraints to model the frontier function, so specifying a functional form of the frontier as in classic SFA is no longer necessary. Second, the user can specify relative errors on input datapoints, enabling population-level analyses. Third, we develop a likelihood-based trimming strategy to robustify the approach to outliers, which otherwise break available benchmarking methods. We provide a custom optimization algorithm for fast and reliable performance. We implement the approach and algorithm in an open source Python package `sfma'. Synthetic and real examples show the new capabilities of the method, and are used to compare SFMA to state of the art benchmarking packages that implement DEA, SFA, and StoNED."}, "https://arxiv.org/abs/2404.04343": {"title": "Multi-way contingency tables with uniform margins", "link": "https://arxiv.org/abs/2404.04343", "description": "arXiv:2404.04343v1 Announce Type: new \nAbstract: We study the problem of transforming a multi-way contingency table into an equivalent table with uniform margins and same dependence structure. Such a problem relates to recent developments in copula modeling for discrete random vectors. Here, we focus on three-way binary tables and show that, even in such a simple case, the situation is quite different than for two-way tables. Many more constraints are needed to ensure a unique solution to the problem. Therefore, the uniqueness of the transformed table is subject to arbitrary choices of the practitioner. We illustrate the theory through some examples, and conclude with a discussion on the topic and future research directions."}, "https://arxiv.org/abs/2404.04398": {"title": "Bayesian Methods for Modeling Cumulative Exposure to Extensive Environmental Health Hazards", "link": "https://arxiv.org/abs/2404.04398", "description": "arXiv:2404.04398v1 Announce Type: new \nAbstract: Measuring the impact of an environmental point source exposure on the risk of disease, like cancer or childhood asthma, is well-developed. Modeling how an environmental health hazard that is extensive in space, like a wastewater canal, is not. We propose a novel Bayesian generative semiparametric model for characterizing the cumulative spatial exposure to an environmental health hazard that is not well-represented by a single point in space. The model couples a dose-response model with a log-Gaussian Cox process integrated against a distance kernel with an unknown length-scale. We show that this model is a well-defined Bayesian inverse model, namely that the posterior exists under a Gaussian process prior for the log-intensity of exposure, and that a simple integral approximation adequately controls the computational error. We quantify the finite-sample properties and the computational tractability of the discretization scheme in a simulation study. Finally, we apply the model to survey data on household risk of childhood diarrheal illness from exposure to a system of wastewater canals in Mezquital Valley, Mexico."}, "https://arxiv.org/abs/2404.04403": {"title": "Low-Rank Robust Subspace Tensor Clustering for Metro Passenger Flow Modeling", "link": "https://arxiv.org/abs/2404.04403", "description": "arXiv:2404.04403v1 Announce Type: new \nAbstract: Tensor clustering has become an important topic, specifically in spatio-temporal modeling, due to its ability to cluster spatial modes (e.g., stations or road segments) and temporal modes (e.g., time of the day or day of the week). Our motivating example is from subway passenger flow modeling, where similarities between stations are commonly found. However, the challenges lie in the innate high-dimensionality of tensors and also the potential existence of anomalies. This is because the three tasks, i.e., dimension reduction, clustering, and anomaly decomposition, are inter-correlated to each other, and treating them in a separate manner will render a suboptimal performance. Thus, in this work, we design a tensor-based subspace clustering and anomaly decomposition technique for simultaneously outlier-robust dimension reduction and clustering for high-dimensional tensors. To achieve this, a novel low-rank robust subspace clustering decomposition model is proposed by combining Tucker decomposition, sparse anomaly decomposition, and subspace clustering. An effective algorithm based on Block Coordinate Descent is proposed to update the parameters. Prudent experiments prove the effectiveness of the proposed framework via the simulation study, with a gain of +25% clustering accuracy than benchmark methods in a hard case. The interrelations of the three tasks are also analyzed via ablation studies, validating the interrelation assumption. Moreover, a case study in the station clustering based on real passenger flow data is conducted, with quite valuable insights discovered."}, "https://arxiv.org/abs/2404.04406": {"title": "Optimality-based reward learning with applications to toxicology", "link": "https://arxiv.org/abs/2404.04406", "description": "arXiv:2404.04406v1 Announce Type: new \nAbstract: In toxicology research, experiments are often conducted to determine the effect of toxicant exposure on the behavior of mice, where mice are randomized to receive the toxicant or not. In particular, in fixed interval experiments, one provides a mouse reinforcers (e.g., a food pellet), contingent upon some action taken by the mouse (e.g., a press of a lever), but the reinforcers are only provided after fixed time intervals. Often, to analyze fixed interval experiments, one specifies and then estimates the conditional state-action distribution (e.g., using an ANOVA). This existing approach, which in the reinforcement learning framework would be called modeling the mouse's \"behavioral policy,\" is sensitive to misspecification. It is likely that any model for the behavioral policy is misspecified; a mapping from a mouse's exposure to their actions can be highly complex. In this work, we avoid specifying the behavioral policy by instead learning the mouse's reward function. Specifying a reward function is as challenging as specifying a behavioral policy, but we propose a novel approach that incorporates knowledge of the optimal behavior, which is often known to the experimenter, to avoid specifying the reward function itself. In particular, we define the reward as a divergence of the mouse's actions from optimality, where the representations of the action and optimality can be arbitrarily complex. The parameters of the reward function then serve as a measure of the mouse's tolerance for divergence from optimality, which is a novel summary of the impact of the exposure. The parameter itself is scalar, and the proposed objective function is differentiable, allowing us to benefit from typical results on consistency of parametric estimators while making very few assumptions."}, "https://arxiv.org/abs/2404.04415": {"title": "Sample size planning for estimating the global win probability with assurance and precision", "link": "https://arxiv.org/abs/2404.04415", "description": "arXiv:2404.04415v1 Announce Type: new \nAbstract: Most clinical trials conducted in drug development contain multiple endpoints in order to collectively assess the intended effects of the drug on various disease characteristics. Focusing on the estimation of the global win probability, defined as the average win probability (WinP) across endpoints that a treated participant would have a better outcome than a control participant, we propose a closed-form sample size formula incorporating pre-specified precision and assurance, with precision denoted by the lower limit of confidence interval and assurance denoted by the probability of achieving that lower limit. We make use of the equivalence of the WinP and the area under the receiver operating characteristic curve (AUC) and adapt a formula originally developed for the difference between two AUCs to handle the global WinP. Unequal variance is allowed. Simulation results suggest that the method performs very well. We illustrate the proposed formula using a Parkinson's disease clinical trial design example."}, "https://arxiv.org/abs/2404.04446": {"title": "Bounding Causal Effects with Leaky Instruments", "link": "https://arxiv.org/abs/2404.04446", "description": "arXiv:2404.04446v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\\textit{partial}$ identification in linear models given a set of $\\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art."}, "https://arxiv.org/abs/2404.04471": {"title": "Estimation and Inference in Ultrahigh Dimensional Partially Linear Single-Index Models", "link": "https://arxiv.org/abs/2404.04471", "description": "arXiv:2404.04471v1 Announce Type: new \nAbstract: This paper is concerned with estimation and inference for ultrahigh dimensional partially linear single-index models. The presence of high dimensional nuisance parameter and nuisance unknown function makes the estimation and inference problem very challenging. In this paper, we first propose a profile partial penalized least squares estimator and establish the sparsity, consistency and asymptotic representation of the proposed estimator in ultrahigh dimensional setting. We then propose an $F$-type test statistic for parameters of primary interest and show that the limiting null distribution of the test statistic is $\\chi^2$ distribution, and the test statistic can detect local alternatives, which converge to the null hypothesis at the root-$n$ rate. We further propose a new test for the specification testing problem of the nonparametric function. The test statistic is shown to be asymptotically normal. Simulation studies are conducted to examine the finite sample performance of the proposed estimators and tests. A real data example is used to illustrate the proposed procedures."}, "https://arxiv.org/abs/2404.04494": {"title": "Fast and simple inner-loop algorithms of static / dynamic BLP estimations", "link": "https://arxiv.org/abs/2404.04494", "description": "arXiv:2404.04494v1 Announce Type: new \nAbstract: This study investigates computationally efficient inner-loop algorithms for estimating static / dynamic BLP models. It provides the following ideas to reduce the number of inner-loop iterations: (1). Add a term concerning the outside option share in the BLP contraction mapping; (2). Analytically represent mean product utilities as a function of value functions and solve for the value functions (for dynamic BLP); (3-1). Combine the spectral / SQUAREM algorithms; (3-2). Choice of the step sizes. These methods are independent and easy to implement. This study shows good performance of these ideas by numerical experiments."}, "https://arxiv.org/abs/2404.04590": {"title": "Absolute Technical Efficiency Indices", "link": "https://arxiv.org/abs/2404.04590", "description": "arXiv:2404.04590v1 Announce Type: new \nAbstract: Technical efficiency indices (TEIs) can be estimated using the traditional stochastic frontier analysis approach, which yields relative indices that do not allow self-interpretations. In this paper, we introduce a single-step estimation procedure for TEIs that eliminates the need to identify best practices and avoids imposing restrictive hypotheses on the error term. The resulting indices are absolute and allow for individual interpretation. In our model, we estimate a distance function using the inverse coefficient of resource utilization, rather than treating it as unobservable. We employ a Tobit model with a translog distance function as our econometric framework. Applying this model to a sample of 19 airline companies from 2012 to 2021, we find that: (1) Absolute technical efficiency varies considerably between companies with medium-haul European airlines being technically the most efficient, while Asian airlines are the least efficient; (2) Our estimated TEIs are consistent with the observed data with a decline in efficiency especially during the Covid-19 crisis and Brexit period; (3) All airlines contained in our sample would be able to increase their average technical efficiency by 0.209% if they reduced their average kerosene consumption by 1%; (4) Total factor productivity (TFP) growth slowed between 2013 and 2019 due to a decrease in Disembodied Technical Change (DTC) and a small effect from Scale Economies (SE). Toward the end of our study period, TFP growth seemed increasingly driven by the SE effect, with a sharp decline in 2020 followed by an equally sharp recovery in 2021 for most airlines."}, "https://arxiv.org/abs/2404.04696": {"title": "Dynamic Treatment Regimes with Replicated Observations Available for Error-prone Covariates: a Q-learning Approach", "link": "https://arxiv.org/abs/2404.04696", "description": "arXiv:2404.04696v1 Announce Type: new \nAbstract: Dynamic treatment regimes (DTRs) have received an increasing interest in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been considered as one of the most popular regression-based methods to estimate the optimal DTR. However, it is rarely studied in an error-prone setting, where the patient information is contaminated with measurement error. In this paper, we study the effect of covariate measurement error on Q-learning and propose a correction method to correct the measurement error in Q-learning. Simulation studies are conducted to assess the performance of the proposed method in Q-learning. We illustrate the use of the proposed method in an application to the sequenced treatment alternatives to relieve depression data."}, "https://arxiv.org/abs/2404.04697": {"title": "Q-learning in Dynamic Treatment Regimes with Misclassified Binary Outcome", "link": "https://arxiv.org/abs/2404.04697", "description": "arXiv:2404.04697v1 Announce Type: new \nAbstract: The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended by taking patient-level information as input. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that leads to the best expected clinical outcome. Statistical methods have been developed in recent years to estimate an optimal DTR, including Q-learning, a regression-based method in the DTR literature. Although there are many studies concerning Q-learning, little attention has been given in the presence of noisy data, such as misclassified outcomes. In this paper, we investigate the effect of outcome misclassification on Q-learning and propose a correction method to accommodate the misclassification effect. Simulation studies are conducted to demonstrate the satisfactory performance of the proposed method. We illustrate the proposed method in two examples from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study and the smoking cessation program."}, "https://arxiv.org/abs/2404.04700": {"title": "Stratifying on Treatment Status", "link": "https://arxiv.org/abs/2404.04700", "description": "arXiv:2404.04700v1 Announce Type: new \nAbstract: We investigate the estimation of treatment effects from a sample that is stratified on the binary treatment status. In the case of unconfounded assignment where the potential outcomes are independent of the treatment given covariates, we show that standard estimators of the average treatment effect are inconsistent. In the case of an endogenous treatment and a binary instrument, we show that the IV estimator is inconsistent for the local average treatment effect. In both cases, we propose simple alternative estimators that are consistent in stratified samples, assuming that the fraction treated in the population is known or can be estimated."}, "https://arxiv.org/abs/2404.04702": {"title": "Topological data analysis for random sets and its application in detecting outliers and goodness of fit testing", "link": "https://arxiv.org/abs/2404.04702", "description": "arXiv:2404.04702v1 Announce Type: new \nAbstract: In this paper we present the methodology for detecting outliers and testing the goodness-of-fit of random sets using topological data analysis. We construct the filtration from level sets of the signed distance function and consider various summary functions of the persistence diagram derived from the obtained persistence homology. The outliers are detected using functional depths for the summary functions. Global envelope tests using the summary statistics as test statistics were used to construct the goodness-of-fit test. The procedures were justified by a simulation study using germ-grain random set models."}, "https://arxiv.org/abs/2404.04719": {"title": "Change Point Detection in Dynamic Graphs with Generative Model", "link": "https://arxiv.org/abs/2404.04719", "description": "arXiv:2404.04719v1 Announce Type: new \nAbstract: This paper proposes a simple generative model to detect change points in time series of graphs. The proposed framework consists of learnable prior distributions for low-dimensional graph representations and of a decoder that can generate dynamic graphs from the latent representations. The informative prior distributions in the latent spaces are learned from observed data as empirical Bayes, and the expressive power of a generative model is exploited to assist change point detection. Specifically, the model parameters are learned via maximum approximate likelihood, with a Group Fused Lasso regularization. The optimization problem is then solved via Alternating Direction Method of Multipliers (ADMM), and Langevin Dynamics are recruited for posterior inference. Experiments in simulated and real data demonstrate the ability of the generative model in supporting change point detection with good performance."}, "https://arxiv.org/abs/2404.04775": {"title": "Bipartite causal inference with interference, time series data, and a random network", "link": "https://arxiv.org/abs/2404.04775", "description": "arXiv:2404.04775v1 Announce Type: new \nAbstract: In bipartite causal inference with interference there are two distinct sets of units: those that receive the treatment, termed interventional units, and those on which the outcome is measured, termed outcome units. Which interventional units' treatment can drive which outcome units' outcomes is often depicted in a bipartite network. We study bipartite causal inference with interference from observational data across time and with a changing bipartite network. Under an exposure mapping framework, we define causal effects specific to each outcome unit, representing average contrasts of potential outcomes across time. We establish unconfoundedness of the exposure received by the outcome units based on unconfoundedness assumptions on the interventional units' treatment assignment and the random graph, hence respecting the bipartite structure of the problem. By harvesting the time component of our setting, causal effects are estimable while controlling only for temporal trends and time-varying confounders. Our results hold for binary, continuous, and multivariate exposure mappings. In the case of a binary exposure, we propose three matching algorithms to estimate the causal effect based on matching exposed to unexposed time periods for the same outcome unit, and we show that the bias of the resulting estimators is bounded. We illustrate our approach with an extensive simulation study and an application on the effect of wildfire smoke on transportation by bicycle."}, "https://arxiv.org/abs/2404.04794": {"title": "A Deep Learning Approach to Nonparametric Propensity Score Estimation with Optimized Covariate Balance", "link": "https://arxiv.org/abs/2404.04794", "description": "arXiv:2404.04794v1 Announce Type: new \nAbstract: This paper proposes a novel propensity score weighting analysis. We define two sufficient and necessary conditions for a function of the covariates to be the propensity score. The first is \"local balance\", which ensures the conditional independence of covariates and treatment assignment across a dense grid of propensity score values. The second condition, \"local calibration\", guarantees that a balancing score is a propensity score. Using three-layer feed-forward neural networks, we develop a nonparametric propensity score model that satisfies these conditions, effectively circumventing the issue of model misspecification and optimizing covariate balance to minimize bias and stabilize the inverse probability weights. Our proposed method performed substantially better than existing methods in extensive numerical studies of both real and simulated benchmark datasets."}, "https://arxiv.org/abs/2404.04905": {"title": "Review for Handling Missing Data with special missing mechanism", "link": "https://arxiv.org/abs/2404.04905", "description": "arXiv:2404.04905v1 Announce Type: new \nAbstract: Missing data poses a significant challenge in data science, affecting decision-making processes and outcomes. Understanding what missing data is, how it occurs, and why it is crucial to handle it appropriately is paramount when working with real-world data, especially in tabular data, one of the most commonly used data types in the real world. Three missing mechanisms are defined in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), each presenting unique challenges in imputation. Most existing work are focused on MCAR that is relatively easy to handle. The special missing mechanisms of MNAR and MAR are less explored and understood. This article reviews existing literature on handling missing values. It compares and contrasts existing methods in terms of their ability to handle different missing mechanisms and data types. It identifies research gap in the existing literature and lays out potential directions for future research in the field. The information in this review will help data analysts and researchers to adopt and promote good practices for handling missing data in real-world problems."}, "https://arxiv.org/abs/2404.04974": {"title": "Neural Network Modeling for Forecasting Tourism Demand in Stopi\\'{c}a Cave: A Serbian Cave Tourism Study", "link": "https://arxiv.org/abs/2404.04974", "description": "arXiv:2404.04974v1 Announce Type: new \nAbstract: For modeling the number of visits in Stopi\\'{c}a cave (Serbia) we consider the classical Auto-regressive Integrated Moving Average (ARIMA) model, Machine Learning (ML) method Support Vector Regression (SVR), and hybrid NeuralPropeth method which combines classical and ML concepts. The most accurate predictions were obtained with NeuralPropeth which includes the seasonal component and growing trend of time-series. In addition, non-linearity is modeled by shallow Neural Network (NN), and Google Trend is incorporated as an exogenous variable. Modeling tourist demand represents great importance for management structures and decision-makers due to its applicability in establishing sustainable tourism utilization strategies in environmentally vulnerable destinations such as caves. The data provided insights into the tourist demand in Stopi\\'{c}a cave and preliminary data for addressing the issues of carrying capacity within the most visited cave in Serbia."}, "https://arxiv.org/abs/2404.04979": {"title": "CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference", "link": "https://arxiv.org/abs/2404.04979", "description": "arXiv:2404.04979v1 Announce Type: new \nAbstract: Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis."}, "https://arxiv.org/abs/2404.04985": {"title": "Towards a generalized accessibility measure for transportation equity and efficiency", "link": "https://arxiv.org/abs/2404.04985", "description": "arXiv:2404.04985v1 Announce Type: new \nAbstract: Locational measures of accessibility are widely used in urban and transportation planning to understand the impact of the transportation system on influencing people's access to places. However, there is a considerable lack of measurement standards and publicly available data. We propose a generalized measure of locational accessibility that has a comprehensible form for transportation planning analysis. This metric combines the cumulative opportunities approach with gravity-based measures and is capable of catering to multiple trip purposes, travel modes, cost thresholds, and scales of analysis. Using data from multiple publicly available datasets, this metric is computed by trip purpose and travel time threshold for all block groups in the United States, and the data is made publicly accessible. Further, case studies of three large metropolitan areas reveal substantial inefficiencies in transportation infrastructure, with the most inefficiency observed in sprawling and non-core urban areas, especially for bicycling. Subsequently, it is shown that targeted investment in facilities can contribute to a more equitable distribution of accessibility to essential shopping and service facilities. By assigning greater weights to socioeconomically disadvantaged neighborhoods, the proposed metric formally incorporates equity considerations into transportation planning, contributing to a more equitable distribution of accessibility to essential services and facilities."}, "https://arxiv.org/abs/2404.05021": {"title": "Context-dependent Causality (the Non-Nonotonic Case)", "link": "https://arxiv.org/abs/2404.05021", "description": "arXiv:2404.05021v1 Announce Type: new \nAbstract: We develop a novel identification strategy as well as a new estimator for context-dependent causal inference in non-parametric triangular models with non-separable disturbances. Departing from the common practice, our analysis does not rely on the strict monotonicity assumption. Our key contribution lies in leveraging on diffusion models to formulate the structural equations as a system evolving from noise accumulation to account for the influence of the latent context (confounder) on the outcome. Our identifiability strategy involves a system of Fredholm integral equations expressing the distributional relationship between a latent context variable and a vector of observables. These integral equations involve an unknown kernel and are governed by a set of structural form functions, inducing a non-monotonic inverse problem. We prove that if the kernel density can be represented as an infinite mixture of Gaussians, then there exists a unique solution for the unknown function. This is a significant result, as it shows that it is possible to solve a non-monotonic inverse problem even when the kernel is unknown. On the methodological front we leverage on a novel and enriched Contaminated Generative Adversarial (Neural) Networks (CONGAN) which we provide as a solution to the non-monotonic inverse problem."}, "https://arxiv.org/abs/2404.05060": {"title": "Dir-SPGLM: A Bayesian semiparametric GLM with data-driven reference distribution", "link": "https://arxiv.org/abs/2404.05060", "description": "arXiv:2404.05060v1 Announce Type: new \nAbstract: The recently developed semi-parametric generalized linear model (SPGLM) offers more flexibility as compared to the classical GLM by including the baseline or reference distribution of the response as an additional parameter in the model. However, some inference summaries are not easily generated under existing maximum-likelihood based inference (ML-SPGLM). This includes uncertainty in estimation for model-derived functionals such as exceedance probabilities. The latter are critical in a clinical diagnostic or decision-making setting. In this article, by placing a Dirichlet prior on the baseline distribution, we propose a Bayesian model-based approach for inference to address these important gaps. We establish consistency and asymptotic normality results for the implied canonical parameter. Simulation studies and an illustration with data from an aging research study confirm that the proposed method performs comparably or better in comparison with ML-SPGLM. The proposed Bayesian framework is most attractive for inference with small sample training data or in sparse-data scenarios."}, "https://arxiv.org/abs/2404.05118": {"title": "BayesPPDSurv: An R Package for Bayesian Sample Size Determination Using the Power and Normalized Power Prior for Time-To-Event Data", "link": "https://arxiv.org/abs/2404.05118", "description": "arXiv:2404.05118v1 Announce Type: new \nAbstract: The BayesPPDSurv (Bayesian Power Prior Design for Survival Data) R package supports Bayesian power and type I error calculations and model fitting using the power and normalized power priors incorporating historical data with for the analysis of time-to-event outcomes. The package implements the stratified proportional hazards regression model with piecewise constant hazard within each stratum. The package allows the historical data to inform the treatment effect parameter, parameter effects for other covariates in the regression model, as well as the baseline hazard parameters. The use of multiple historical datasets is supported. A novel algorithm is developed for computationally efficient use of the normalized power prior. In addition, the package supports the use of arbitrary sampling priors for computing Bayesian power and type I error rates, and has built-in features that semi-automatically generate sampling priors from the historical data. We demonstrate the use of BayesPPDSurv in a comprehensive case study for a melanoma clinical trial design."}, "https://arxiv.org/abs/2404.05148": {"title": "Generalized Criterion for Identifiability of Additive Noise Models Using Majorization", "link": "https://arxiv.org/abs/2404.05148", "description": "arXiv:2404.05148v1 Announce Type: new \nAbstract: The discovery of causal relationships from observational data is very challenging. Many recent approaches rely on complexity or uncertainty concepts to impose constraints on probability distributions, aiming to identify specific classes of directed acyclic graph (DAG) models. In this paper, we introduce a novel identifiability criterion for DAGs that places constraints on the conditional variances of additive noise models. We demonstrate that this criterion extends and generalizes existing identifiability criteria in the literature that employ (conditional) variances as measures of uncertainty in (conditional) distributions. For linear Structural Equation Models, we present a new algorithm that leverages the concept of weak majorization applied to the diagonal elements of the Cholesky factor of the covariance matrix to learn a topological ordering of variables. Through extensive simulations and the analysis of bank connectivity data, we provide evidence of the effectiveness of our approach in successfully recovering DAGs. The code for reproducing the results in this paper is available in Supplementary Materials."}, "https://arxiv.org/abs/2404.05178": {"title": "Estimating granular house price distributions in the Australian market using Gaussian mixtures", "link": "https://arxiv.org/abs/2404.05178", "description": "arXiv:2404.05178v1 Announce Type: new \nAbstract: A new methodology is proposed to approximate the time-dependent house price distribution at a fine regional scale using Gaussian mixtures. The means, variances and weights of the mixture components are related to time, location and dwelling type through a non linear function trained by a deep functional approximator. Price indices are derived as means, medians, quantiles or other functions of the estimated distributions. Price densities for larger regions, such as a city, are calculated via a weighted sum of the component density functions. The method is applied to a data set covering all of Australia at a fine spatial and temporal resolution. In addition to enabling a detailed exploration of the data, the proposed index yields lower prediction errors in the practical task of individual dwelling price projection from previous sales values within the three major Australian cities. The estimated quantiles are also found to be well calibrated empirically, capturing the complexity of house price distributions."}, "https://arxiv.org/abs/2404.05209": {"title": "Maximally Forward-Looking Core Inflation", "link": "https://arxiv.org/abs/2404.05209", "description": "arXiv:2404.05209v1 Announce Type: new \nAbstract: Timely monetary policy decision-making requires timely core inflation measures. We create a new core inflation series that is explicitly designed to succeed at that goal. Precisely, we introduce the Assemblage Regression, a generalized nonnegative ridge regression problem that optimizes the price index's subcomponent weights such that the aggregate is maximally predictive of future headline inflation. Ordering subcomponents according to their rank in each period switches the algorithm to be learning supervised trimmed inflation - or, put differently, the maximally forward-looking summary statistic of the realized price changes distribution. In an extensive out-of-sample forecasting experiment for the US and the euro area, we find substantial improvements for signaling medium-term inflation developments in both the pre- and post-Covid years. Those coming from the supervised trimmed version are particularly striking, and are attributable to a highly asymmetric trimming which contrasts with conventional indicators. We also find that this metric was indicating first upward pressures on inflation as early as mid-2020 and quickly captured the turning point in 2022. We also consider extensions, like assembling inflation from geographical regions, trimmed temporal aggregation, and building core measures specialized for either upside or downside inflation risks."}, "https://arxiv.org/abs/2404.05246": {"title": "Assessing the causes of continuous effects by posterior effects of causes", "link": "https://arxiv.org/abs/2404.05246", "description": "arXiv:2404.05246v1 Announce Type: new \nAbstract: To evaluate a single cause of a binary effect, Dawid et al. (2014) defined the probability of causation, while Pearl (2015) defined the probabilities of necessity and sufficiency. For assessing the multiple correlated causes of a binary effect, Lu et al. (2023) defined the posterior causal effects based on post-treatment variables. In many scenarios, outcomes are continuous, simply binarizing them and applying previous methods may result in information loss or biased conclusions. To address this limitation, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous effect, including posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under the assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish the asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method."}, "https://arxiv.org/abs/2404.05349": {"title": "Common Trends and Long-Run Multipliers in Nonlinear Structural VARs", "link": "https://arxiv.org/abs/2404.05349", "description": "arXiv:2404.05349v1 Announce Type: new \nAbstract: While it is widely recognised that linear (structural) VARs may omit important features of economic time series, the use of nonlinear SVARs has to date been almost entirely confined to the modelling of stationary time series, because of a lack of understanding as to how common stochastic trends may be accommodated within nonlinear VAR models. This has unfortunately circumscribed the range of series to which such models can be applied -- and/or required that these series be first transformed to stationarity, a potential source of misspecification -- and prevented the use of long-run identifying restrictions in these models. To address these problems, we develop a flexible class of additively time-separable nonlinear SVARs, which subsume models with threshold-type endogenous regime switching, both of the piecewise linear and smooth transition varieties. We extend the Granger-Johansen representation theorem to this class of models, obtaining conditions that specialise exactly to the usual ones when the model is linear. We further show that, as a corollary, these models are capable of supporting the same kinds of long-run identifying restrictions as are available in linear cointegrated SVARs."}, "https://arxiv.org/abs/2404.05445": {"title": "Unsupervised Training of Convex Regularizers using Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2404.05445", "description": "arXiv:2404.05445v1 Announce Type: new \nAbstract: Unsupervised learning is a training approach in the situation where ground truth data is unavailable, such as inverse imaging problems. We present an unsupervised Bayesian training approach to learning convex neural network regularizers using a fixed noisy dataset, based on a dual Markov chain estimation method. Compared to classical supervised adversarial regularization methods, where there is access to both clean images as well as unlimited to noisy copies, we demonstrate close performance on natural image Gaussian deconvolution and Poisson denoising tasks."}, "https://arxiv.org/abs/2404.05671": {"title": "Bayesian Inverse Ising Problem with Three-body Interactions", "link": "https://arxiv.org/abs/2404.05671", "description": "arXiv:2404.05671v1 Announce Type: new \nAbstract: In this paper, we solve the inverse Ising problem with three-body interaction. Using the mean-field approximation, we find a tractable expansion of the normalizing constant. This facilitates estimation, which is known to be quite challenging for the Ising model. We then develop a novel hybrid MCMC algorithm that integrates Adaptive Metropolis Hastings (AMH), Hamiltonian Monte Carlo (HMC), and the Manifold-Adjusted Langevin Algorithm (MALA), which converges quickly and mixes well. We demonstrate the robustness of our algorithm using data simulated with a structure under which parameter estimation is known to be challenging, such as in the presence of a phase transition and at the critical point of the system."}, "https://arxiv.org/abs/2404.05702": {"title": "On the estimation of complex statistics combining different surveys", "link": "https://arxiv.org/abs/2404.05702", "description": "arXiv:2404.05702v1 Announce Type: new \nAbstract: The importance of exploring a potential integration among surveys has been acknowledged in order to enhance effectiveness and minimize expenses. In this work, we employ the alignment method to combine information from two different surveys for the estimation of complex statistics. The derivation of the alignment weights poses challenges in case of complex statistics due to their non-linear form. To overcome this, we propose to use a linearized variable associated with the complex statistic under consideration. Linearized variables have been widely used to derive variance estimates, thus allowing for the estimation of the variance of the combined complex statistics estimates. Simulations conducted show the effectiveness of the proposed approach, resulting to the reduction of the variance of the combined complex statistics estimates. Also, in some cases, the usage of the alignment weights derived using the linearized variable associated with a complex statistic, could result in a further reduction of the variance of the combined estimates."}, "https://arxiv.org/abs/2404.04399": {"title": "Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer", "link": "https://arxiv.org/abs/2404.04399", "description": "arXiv:2404.04399v1 Announce Type: cross \nAbstract: We propose Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to estimate the counterfactual mean of outcome under dynamic treatment policies in longitudinal problem settings. Our approach utilizes a transformer architecture with heterogeneous type embedding trained using temporal-difference learning. After obtaining an initial estimate using the transformer, following the targeted minimum loss-based likelihood estimation (TMLE) framework, we statistically corrected for the bias commonly associated with machine learning algorithms. Furthermore, our method also facilitates statistical inference by enabling the provision of 95% confidence intervals grounded in asymptotic statistical theory. Simulation results demonstrate our method's superior performance over existing approaches, particularly in complex, long time-horizon scenarios. It remains effective in small-sample, short-duration contexts, matching the performance of asymptotically efficient estimators. To demonstrate our method in practice, we applied our method to estimate counterfactual mean outcomes for standard versus intensive blood pressure management strategies in a real-world cardiovascular epidemiology cohort study."}, "https://arxiv.org/abs/2404.04498": {"title": "Bayesian Inference for Consistent Predictions in Overparameterized Nonlinear Regression", "link": "https://arxiv.org/abs/2404.04498", "description": "arXiv:2404.04498v1 Announce Type: cross \nAbstract: The remarkable generalization performance of overparameterized models has challenged the conventional wisdom of statistical learning theory. While recent theoretical studies have shed light on this behavior in linear models or nonlinear classifiers, a comprehensive understanding of overparameterization in nonlinear regression remains lacking. This paper explores the predictive properties of overparameterized nonlinear regression within the Bayesian framework, extending the methodology of adaptive prior based on the intrinsic spectral structure of the data. We establish posterior contraction for single-neuron models with Lipschitz continuous activation functions and for generalized linear models, demonstrating that our approach achieves consistent predictions in the overparameterized regime. Moreover, our Bayesian framework allows for uncertainty estimation of the predictions. The proposed method is validated through numerical simulations and a real data application, showcasing its ability to achieve accurate predictions and reliable uncertainty estimates. Our work advances the theoretical understanding of the blessing of overparameterization and offers a principled Bayesian approach for prediction in large nonlinear models."}, "https://arxiv.org/abs/2404.05062": {"title": "New methods for computing the generalized chi-square distribution", "link": "https://arxiv.org/abs/2404.05062", "description": "arXiv:2404.05062v1 Announce Type: cross \nAbstract: We present several exact and approximate mathematical methods and open-source software to compute the cdf, pdf and inverse cdf of the generalized chi-square distribution, which appears in Bayesian classification problems. Some methods are geared for speed, while others are designed to be accurate far into the tails, using which we can also measure large values of the discriminability index $d'$ between multinormals. We compare the accuracy and speed of these methods against the best existing methods."}, "https://arxiv.org/abs/2404.05545": {"title": "Evaluating Interventional Reasoning Capabilities of Large Language Models", "link": "https://arxiv.org/abs/2404.05545", "description": "arXiv:2404.05545v1 Announce Type: cross \nAbstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts."}, "https://arxiv.org/abs/2404.05622": {"title": "How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation", "link": "https://arxiv.org/abs/2404.05622", "description": "arXiv:2404.05622v1 Announce Type: cross \nAbstract: Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/"}, "https://arxiv.org/abs/1910.02170": {"title": "Donor's Deferral and Return Behavior: Partial Identification from a Regression Discontinuity Design with Manipulation", "link": "https://arxiv.org/abs/1910.02170", "description": "arXiv:1910.02170v3 Announce Type: replace \nAbstract: Volunteer labor can temporarily yield lower benefits to charities than its costs. In such instances, organizations may wish to defer volunteer donations to a later date. Exploiting a discontinuity in blood donations' eligibility criteria, we show that deferring donors reduces their future volunteerism. In our setting, medical staff manipulates donors' reported hemoglobin levels over a threshold to facilitate donation. Such manipulation invalidates standard regression discontinuity design. To circumvent this issue, we propose a procedure for obtaining partial identification bounds where manipulation is present. Our procedure is applicable in various regression discontinuity settings where the running variable is manipulated and discrete."}, "https://arxiv.org/abs/2201.03616": {"title": "Scale Reliant Inference", "link": "https://arxiv.org/abs/2201.03616", "description": "arXiv:2201.03616v4 Announce Type: replace \nAbstract: Scientific fields such as genomics, ecology, and political science often collect multivariate count data. In these fields, the data are often sufficiently noisy such that inferences regarding the total size of the measured systems have substantial uncertainty. This uncertainty can hinder downstream analyses, such as differential analysis in case-control studies. There have historically been two approaches to this problem: one considers the data as compositional and the other as counts that can be normalized. In this article, we use the framework of partially identified models to rigorously study the types of scientific questions (estimands) that can be answered (estimated) using these data. We prove that satisfying Frequentist inferential criteria is impossible for many estimation problems. In contrast, we find that the criteria for Bayesian inference can be satisfied, yet it requires a particular type of model called a Bayesian partially identified model. We introduce Scale Simulation Random Variables as a flexible and computationally efficient form of Bayesian partially identified models for analyzing these data. We use simulations and data analysis to validate our theory."}, "https://arxiv.org/abs/2204.12699": {"title": "Randomness of Shapes and Statistical Inference on Shapes via the Smooth Euler Characteristic Transform", "link": "https://arxiv.org/abs/2204.12699", "description": "arXiv:2204.12699v4 Announce Type: replace \nAbstract: In this article, we establish the mathematical foundations for modeling the randomness of shapes and conducting statistical inference on shapes using the smooth Euler characteristic transform. Based on these foundations, we propose two chi-squared statistic-based algorithms for testing hypotheses on random shapes. Simulation studies are presented to validate our mathematical derivations and to compare our algorithms with state-of-the-art methods to demonstrate the utility of our proposed framework. As real applications, we analyze a data set of mandibular molars from four genera of primates and show that our algorithms have the power to detect significant shape differences that recapitulate known morphological variation across suborders. Altogether, our discussions bridge the following fields: algebraic and computational topology, probability theory and stochastic processes, Sobolev spaces and functional analysis, analysis of variance for functional data, and geometric morphometrics."}, "https://arxiv.org/abs/2205.09922": {"title": "Nonlinear Fore(Back)casting and Innovation Filtering for Causal-Noncausal VAR Models", "link": "https://arxiv.org/abs/2205.09922", "description": "arXiv:2205.09922v3 Announce Type: replace \nAbstract: We introduce closed-form formulas of out-of-sample predictive densities for forecasting and backcasting of mixed causal-noncausal (Structural) Vector Autoregressive VAR models. These nonlinear and time irreversible non-Gaussian VAR processes are shown to satisfy the Markov property in both calendar and reverse time. A post-estimation inference method for assessing the forecast interval uncertainty due to the preliminary estimation step is introduced too. The nonlinear past-dependent innovations of a mixed causal-noncausal VAR model are defined and their filtering and identification methods are discussed. Our approach is illustrated by a simulation study, and an application to cryptocurrency prices."}, "https://arxiv.org/abs/2206.06344": {"title": "Sparse-group boosting -- Unbiased group and variable selection", "link": "https://arxiv.org/abs/2206.06344", "description": "arXiv:2206.06344v2 Announce Type: replace \nAbstract: In the presence of grouped covariates, we propose a framework for boosting that allows to enforce sparsity within and between groups. By using component-wise and group-wise gradient boosting at the same time with adjusted degrees of freedom, a model with similar properties as the sparse group lasso can be fitted through boosting. We show that within-group and between-group sparsity can be controlled by a mixing parameter and discuss similarities and differences to the mixing parameter in the sparse group lasso. With simulations, gene data as well as agricultural data we show the effectiveness and predictive competitiveness of this estimator. The data and simulations suggest, that in the presence of grouped variables the use of sparse group boosting is associated with less biased variable selection and higher predictability compared to component-wise boosting. Additionally, we propose a way of reducing bias in component-wise boosting through the degrees of freedom."}, "https://arxiv.org/abs/2212.04550": {"title": "Modern Statistical Models and Methods for Estimating Fatigue-Life and Fatigue-Strength Distributions from Experimental Data", "link": "https://arxiv.org/abs/2212.04550", "description": "arXiv:2212.04550v3 Announce Type: replace \nAbstract: Engineers and scientists have been collecting and analyzing fatigue data since the 1800s to ensure the reliability of life-critical structures. Applications include (but are not limited to) bridges, building structures, aircraft and spacecraft components, ships, ground-based vehicles, and medical devices. Engineers need to estimate S-N relationships (Stress or Strain versus Number of cycles to failure), typically with a focus on estimating small quantiles of the fatigue-life distribution. Estimates from this kind of model are used as input to models (e.g., cumulative damage models) that predict failure-time distributions under varying stress patterns. Also, design engineers need to estimate lower-tail quantiles of the closely related fatigue-strength distribution. The history of applying incorrect statistical methods is nearly as long and such practices continue to the present. Examples include treating the applied stress (or strain) as the response and the number of cycles to failure as the explanatory variable in regression analyses (because of the need to estimate strength distributions) and ignoring or otherwise mishandling censored observations (known as runouts in the fatigue literature). The first part of the paper reviews the traditional modeling approach where a fatigue-life model is specified. We then show how this specification induces a corresponding fatigue-strength model. The second part of the paper presents a novel alternative modeling approach where a fatigue-strength model is specified and a corresponding fatigue-life model is induced. We explain and illustrate the important advantages of this new modeling approach."}, "https://arxiv.org/abs/2302.14230": {"title": "Optimal Priors for the Discounting Parameter of the Normalized Power Prior", "link": "https://arxiv.org/abs/2302.14230", "description": "arXiv:2302.14230v2 Announce Type: replace \nAbstract: The power prior is a popular class of informative priors for incorporating information from historical data. It involves raising the likelihood for the historical data to a power, which acts as discounting parameter. When the discounting parameter is modelled as random, the normalized power prior is recommended. In this work, we prove that the marginal posterior for the discounting parameter for generalized linear models converges to a point mass at zero if there is any discrepancy between the historical and current data, and that it does not converge to a point mass at one when they are fully compatible. In addition, we explore the construction of optimal priors for the discounting parameter in a normalized power prior. In particular, we are interested in achieving the dual objectives of encouraging borrowing when the historical and current data are compatible and limiting borrowing when they are in conflict. We propose intuitive procedures for eliciting the shape parameters of a beta prior for the discounting parameter based on two minimization criteria, the Kullback-Leibler divergence and the mean squared error. Based on the proposed criteria, the optimal priors derived are often quite different from commonly used priors such as the uniform prior."}, "https://arxiv.org/abs/2305.11323": {"title": "Cumulative differences between paired samples", "link": "https://arxiv.org/abs/2305.11323", "description": "arXiv:2305.11323v2 Announce Type: replace \nAbstract: The simplest, most common paired samples consist of observations from two populations, with each observed response from one population corresponding to an observed response from the other population at the same value of an ordinal covariate. The pair of observed responses (one from each population) at the same value of the covariate is known as a \"matched pair\" (with the matching based on the value of the covariate). A graph of cumulative differences between the two populations reveals differences in responses as a function of the covariate. Indeed, the slope of the secant line connecting two points on the graph becomes the average difference over the wide interval of values of the covariate between the two points; i.e., slope of the graph is the average difference in responses. (\"Average\" refers to the weighted average if the samples are weighted.) Moreover, a simple statistic known as the Kuiper metric summarizes into a single scalar the overall differences over all values of the covariate. The Kuiper metric is the absolute value of the total difference in responses between the two populations, totaled over the interval of values of the covariate for which the absolute value of the total is greatest. The total should be normalized such that it becomes the (weighted) average over all values of the covariate when the interval over which the total is taken is the entire range of the covariate (i.e., the sum for the total gets divided by the total number of observations, if the samples are unweighted, or divided by the total weight, if the samples are weighted). This cumulative approach is fully nonparametric and uniquely defined (with only one right way to construct the graphs and scalar summary statistics), unlike traditional methods such as reliability diagrams or parametric or semi-parametric regressions, which typically obscure significant differences due to their parameter settings."}, "https://arxiv.org/abs/2309.09371": {"title": "Gibbs Sampling using Anti-correlation Gaussian Data Augmentation, with Applications to L1-ball-type Models", "link": "https://arxiv.org/abs/2309.09371", "description": "arXiv:2309.09371v2 Announce Type: replace \nAbstract: L1-ball-type priors are a recent generalization of the spike-and-slab priors. By transforming a continuous precursor distribution to the L1-ball boundary, it induces exact zeros with positive prior and posterior probabilities. With great flexibility in choosing the precursor and threshold distributions, we can easily specify models under structured sparsity, such as those with dependent probability for zeros and smoothness among the non-zeros. Motivated to significantly accelerate the posterior computation, we propose a new data augmentation that leads to a fast block Gibbs sampling algorithm. The latent variable, named ``anti-correlation Gaussian'', cancels out the quadratic exponent term in the latent Gaussian distribution, making the parameters of interest conditionally independent so that they can be updated in a block. Compared to existing algorithms such as the No-U-Turn sampler, the new blocked Gibbs sampler has a very low computing cost per iteration and shows rapid mixing of Markov chains. We establish the geometric ergodicity guarantee of the algorithm in linear models. Further, we show useful extensions of our algorithm for posterior estimation of general latent Gaussian models, such as those involving multivariate truncated Gaussian or latent Gaussian process."}, "https://arxiv.org/abs/2310.17999": {"title": "Automated threshold selection and associated inference uncertainty for univariate extremes", "link": "https://arxiv.org/abs/2310.17999", "description": "arXiv:2310.17999v3 Announce Type: replace \nAbstract: Threshold selection is a fundamental problem in any threshold-based extreme value analysis. While models are asymptotically motivated, selecting an appropriate threshold for finite samples is difficult and highly subjective through standard methods. Inference for high quantiles can also be highly sensitive to the choice of threshold. Too low a threshold choice leads to bias in the fit of the extreme value model, while too high a choice leads to unnecessary additional uncertainty in the estimation of model parameters. We develop a novel methodology for automated threshold selection that directly tackles this bias-variance trade-off. We also develop a method to account for the uncertainty in the threshold estimation and propagate this uncertainty through to high quantile inference. Through a simulation study, we demonstrate the effectiveness of our method for threshold selection and subsequent extreme quantile estimation, relative to the leading existing methods, and show how the method's effectiveness is not sensitive to the tuning parameters. We apply our method to the well-known, troublesome example of the River Nidd dataset."}, "https://arxiv.org/abs/2102.13273": {"title": "Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting", "link": "https://arxiv.org/abs/2102.13273", "description": "arXiv:2102.13273v5 Announce Type: replace-cross \nAbstract: Forecasting and decision-making are generally modeled as two sequential steps with no feedback, following an open-loop approach. In this paper, we present application-driven learning, a new closed-loop framework in which the processes of forecasting and decision-making are merged and co-optimized through a bilevel optimization problem. We present our methodology in a general format and prove that the solution converges to the best estimator in terms of the expected cost of the selected application. Then, we propose two solution methods: an exact method based on the KKT conditions of the second-level problem and a scalable heuristic approach suitable for decomposition methods. The proposed methodology is applied to the relevant problem of defining dynamic reserve requirements and conditional load forecasts, offering an alternative approach to current ad hoc procedures implemented in industry practices. We benchmark our methodology with the standard sequential least-squares forecast and dispatch planning process. We apply the proposed methodology to an illustrative system and to a wide range of instances, from dozens of buses to large-scale realistic systems with thousands of buses. Our results show that the proposed methodology is scalable and yields consistently better performance than the standard open-loop approach."}, "https://arxiv.org/abs/2211.13383": {"title": "A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments", "link": "https://arxiv.org/abs/2211.13383", "description": "arXiv:2211.13383v4 Announce Type: replace-cross \nAbstract: In this paper, we aim to propose a consistent non-Gaussian Bayesian filter of which the system state is a continuous function. The distributions of the true system states, and those of the system and observation noises, are only assumed Lebesgue integrable with no prior constraints on what function classes they fall within. This type of filter has significant merits in both theory and practice, which is able to ameliorate the curse of dimensionality for the particle filter, a popular non-Gaussian Bayesian filter of which the system state is parameterized by discrete particles and the corresponding weights. We first propose a new type of statistics, called the generalized logarithmic moments. Together with the power moments, they are used to form a density surrogate, parameterized as an analytic function, to approximate the true system state. The map from the parameters of the proposed density surrogate to both the power moments and the generalized logarithmic moments is proved to be a diffeomorphism, establishing the fact that there exists a unique density surrogate which satisfies both moment conditions. This diffeomorphism also allows us to use gradient methods to treat the convex optimization problem in determining the parameters. Last but not least, simulation results reveal the advantage of using both sets of moments for estimating mixtures of complicated types of functions. A robot localization simulation is also given, as an engineering application to validate the proposed filtering scheme."}, "https://arxiv.org/abs/2301.03038": {"title": "Skewed Bernstein-von Mises theorem and skew-modal approximations", "link": "https://arxiv.org/abs/2301.03038", "description": "arXiv:2301.03038v3 Announce Type: replace-cross \nAbstract: Gaussian approximations are routinely employed in Bayesian statistics to ease inference when the target posterior is intractable. Although these approximations are asymptotically justified by Bernstein-von Mises type results, in practice the expected Gaussian behavior may poorly represent the shape of the posterior, thus affecting approximation accuracy. Motivated by these considerations, we derive an improved class of closed-form approximations of posterior distributions which arise from a new treatment of a third-order version of the Laplace method yielding approximations in a tractable family of skew-symmetric distributions. Under general assumptions which account for misspecified models and non-i.i.d. settings, this family of approximations is shown to have a total variation distance from the target posterior whose rate of convergence improves by at least one order of magnitude the one established by the classical Bernstein-von Mises theorem. Specializing this result to the case of regular parametric models shows that the same improvement in approximation accuracy can be also derived for polynomially bounded posterior functionals. Unlike other higher-order approximations, our results prove that it is possible to derive closed-form and valid densities which are expected to provide, in practice, a more accurate, yet similarly-tractable, alternative to Gaussian approximations of the target posterior, while inheriting its limiting frequentist properties. We strengthen such arguments by developing a practical skew-modal approximation for both joint and marginal posteriors that achieves the same theoretical guarantees of its theoretical counterpart by replacing the unknown model parameters with the corresponding MAP estimate. Empirical studies confirm that our theoretical results closely match the remarkable performance observed in practice, even in finite, possibly small, sample regimes."}, "https://arxiv.org/abs/2404.05808": {"title": "Replicability analysis of high dimensional data accounting for dependence", "link": "https://arxiv.org/abs/2404.05808", "description": "arXiv:2404.05808v1 Announce Type: new \nAbstract: Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modelling assumptions, producing overly conservative or overly liberal results. Based on $p$-values from two studies, we use a four-state hidden Markov model to capture the structure of local dependence. Our method effectively borrows information from different features and studies while accounting for dependence among features and heterogeneity across studies. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods."}, "https://arxiv.org/abs/2404.05933": {"title": "fastcpd: Fast Change Point Detection in R", "link": "https://arxiv.org/abs/2404.05933", "description": "arXiv:2404.05933v1 Announce Type: new \nAbstract: Change point analysis is concerned with detecting and locating structure breaks in the underlying model of a sequence of observations ordered by time, space or other variables. A widely adopted approach for change point analysis is to minimize an objective function with a penalty term on the number of change points. This framework includes several well-established procedures, such as the penalized log-likelihood using the (modified) Bayesian information criterion (BIC) or the minimum description length (MDL). The resulting optimization problem can be solved in polynomial time by dynamic programming or its improved version, such as the Pruned Exact Linear Time (PELT) algorithm (Killick, Fearnhead, and Eckley 2012). However, existing computational methods often suffer from two primary limitations: (1) methods based on direct implementation of dynamic programming or PELT are often time-consuming for long data sequences due to repeated computation of the cost value over different segments of the data sequence; (2) state-of-the-art R packages do not provide enough flexibility for users to handle different change point settings and models. In this work, we present the fastcpd package, aiming to provide an efficient and versatile framework for change point detection in several commonly encountered settings. The core of our algorithm is built upon PELT and the sequential gradient descent method recently proposed by Zhang and Dawn (2023). We illustrate the usage of the fastcpd package through several examples, including mean/variance changes in a (multivariate) Gaussian sequence, parameter changes in regression models, structural breaks in ARMA/GARCH/VAR models, and changes in user-specified models."}, "https://arxiv.org/abs/2404.06064": {"title": "Constructing hierarchical time series through clustering: Is there an optimal way for forecasting?", "link": "https://arxiv.org/abs/2404.06064", "description": "arXiv:2404.06064v1 Announce Type: new \nAbstract: Forecast reconciliation has attracted significant research interest in recent years, with most studies taking the hierarchy of time series as given. We extend existing work that uses time series clustering to construct hierarchies, with the goal of improving forecast accuracy, in three ways. First, we investigate multiple approaches to clustering, including not only different clustering algorithms, but also the way time series are represented and how distance between time series is defined. We find that cluster-based hierarchies lead to improvements in forecast accuracy relative to two-level hierarchies. Second, we devise an approach based on random permutation of hierarchies, keeping the structure of the hierarchy fixed, while time series are randomly allocated to clusters. In doing so, we find that improvements in forecast accuracy that accrue from using clustering do not arise from grouping together similar series but from the structure of the hierarchy. Third, we propose an approach based on averaging forecasts across hierarchies constructed using different clustering methods, that is shown to outperform any single clustering method. All analysis is carried out on two benchmark datasets and a simulated dataset. Our findings provide new insights into the role of hierarchy construction in forecast reconciliation and offer valuable guidance on forecasting practice."}, "https://arxiv.org/abs/2404.06093": {"title": "Supervised Contamination Detection, with Flow Cytometry Application", "link": "https://arxiv.org/abs/2404.06093", "description": "arXiv:2404.06093v1 Announce Type: new \nAbstract: The contamination detection problem aims to determine whether a set of observations has been contaminated, i.e. whether it contains points drawn from a distribution different from the reference distribution. Here, we consider a supervised problem, where labeled samples drawn from both the reference distribution and the contamination distribution are available at training time. This problem is motivated by the detection of rare cells in flow cytometry. Compared to novelty detection problems or two-sample testing, where only samples from the reference distribution are available, the challenge lies in efficiently leveraging the observations from the contamination detection to design more powerful tests. In this article, we introduce a test for the supervised contamination detection problem. We provide non-asymptotic guarantees on its Type I error, and characterize its detection rate. The test relies on estimating reference and contamination densities using histograms, and its power depends strongly on the choice of the corresponding partition. We present an algorithm for judiciously choosing the partition that results in a powerful test. Simulations illustrate the good empirical performances of our partition selection algorithm and the efficiency of our test. Finally, we showcase our method and apply it to a real flow cytometry dataset."}, "https://arxiv.org/abs/2404.06205": {"title": "Adaptive Unit Root Inference in Autoregressions using the Lasso Solution Path", "link": "https://arxiv.org/abs/2404.06205", "description": "arXiv:2404.06205v1 Announce Type: new \nAbstract: We show that the activation knot of a potentially non-stationary regressor on the adaptive Lasso solution path in autoregressions can be leveraged for selection-free inference about a unit root. The resulting test has asymptotic power against local alternatives in $1/T$ neighbourhoods, unlike post-selection inference methods based on consistent model selection. Exploiting the information enrichment principle devised by Reinschl\\\"ussel and Arnold arXiv:2402.16580 [stat.ME] to improve the Lasso-based selection of ADF models, we propose a composite statistic and analyse its asymptotic distribution and local power function. Monte Carlo evidence shows that the combined test dominates the comparable post-selection inference methods of Tibshirani et al. [JASA, 2016, 514, 600-620] and may surpass the power of established unit root tests against local alternatives. We apply the new tests to groundwater level time series for Germany and find evidence rejecting stochastic trends to explain observed long-term declines in mean water levels."}, "https://arxiv.org/abs/2404.06471": {"title": "Regression Discontinuity Design with Spillovers", "link": "https://arxiv.org/abs/2404.06471", "description": "arXiv:2404.06471v1 Announce Type: new \nAbstract: Researchers who estimate treatment effects using a regression discontinuity design (RDD) typically assume that there are no spillovers between the treated and control units. This may be unrealistic. We characterize the estimand of RDD in a setting where spillovers occur between units that are close in their values of the running variable. Under the assumption that spillovers are linear-in-means, we show that the estimand depends on the ratio of two terms: (1) the radius over which spillovers occur and (2) the choice of bandwidth used for the local linear regression. Specifically, RDD estimates direct treatment effect when radius is of larger order than the bandwidth, and total treatment effect when radius is of smaller order than the bandwidth. In the more realistic regime where radius is of similar order as the bandwidth, the RDD estimand is a mix of the above effects. To recover direct and spillover effects, we propose incorporating estimated spillover terms into local linear regression -- the local analog of peer effects regression. We also clarify the settings under which the donut-hole RD is able to eliminate the effects of spillovers."}, "https://arxiv.org/abs/2404.05809": {"title": "Self-Labeling in Multivariate Causality and Quantification for Adaptive Machine Learning", "link": "https://arxiv.org/abs/2404.05809", "description": "arXiv:2404.05809v1 Announce Type: cross \nAbstract: Adaptive machine learning (ML) aims to allow ML models to adapt to ever-changing environments with potential concept drift after model deployment. Traditionally, adaptive ML requires a new dataset to be manually labeled to tailor deployed models to altered data distributions. Recently, an interactive causality based self-labeling method was proposed to autonomously associate causally related data streams for domain adaptation, showing promising results compared to traditional feature similarity-based semi-supervised learning. Several unanswered research questions remain, including self-labeling's compatibility with multivariate causality and the quantitative analysis of the auxiliary models used in the self-labeling. The auxiliary models, the interaction time model (ITM) and the effect state detector (ESD), are vital to the success of self-labeling. This paper further develops the self-labeling framework and its theoretical foundations to address these research questions. A framework for the application of self-labeling to multivariate causal graphs is proposed using four basic causal relationships, and the impact of non-ideal ITM and ESD performance is analyzed. A simulated experiment is conducted based on a multivariate causal graph, validating the proposed theory."}, "https://arxiv.org/abs/2404.05929": {"title": "A feature-based information-theoretic approach for detecting interpretable, long-timescale pairwise interactions from time series", "link": "https://arxiv.org/abs/2404.05929", "description": "arXiv:2404.05929v1 Announce Type: cross \nAbstract: Quantifying relationships between components of a complex system is critical to understanding the rich network of interactions that characterize the behavior of the system. Traditional methods for detecting pairwise dependence of time series, such as Pearson correlation, Granger causality, and mutual information, are computed directly in the space of measured time-series values. But for systems in which interactions are mediated by statistical properties of the time series (`time-series features') over longer timescales, this approach can fail to capture the underlying dependence from limited and noisy time-series data, and can be challenging to interpret. Addressing these issues, here we introduce an information-theoretic method for detecting dependence between time series mediated by time-series features that provides interpretable insights into the nature of the interactions. Our method extracts a candidate set of time-series features from sliding windows of the source time series and assesses their role in mediating a relationship to values of the target process. Across simulations of three different generative processes, we demonstrate that our feature-based approach can outperform a traditional inference approach based on raw time-series values, especially in challenging scenarios characterized by short time-series lengths, high noise levels, and long interaction timescales. Our work introduces a new tool for inferring and interpreting feature-mediated interactions from time-series data, contributing to the broader landscape of quantitative analysis in complex systems research, with potential applications in various domains including but not limited to neuroscience, finance, climate science, and engineering."}, "https://arxiv.org/abs/2404.05976": {"title": "A Cyber Manufacturing IoT System for Adaptive Machine Learning Model Deployment by Interactive Causality Enabled Self-Labeling", "link": "https://arxiv.org/abs/2404.05976", "description": "arXiv:2404.05976v1 Announce Type: cross \nAbstract: Machine Learning (ML) has been demonstrated to improve productivity in many manufacturing applications. To host these ML applications, several software and Industrial Internet of Things (IIoT) systems have been proposed for manufacturing applications to deploy ML applications and provide real-time intelligence. Recently, an interactive causality enabled self-labeling method has been proposed to advance adaptive ML applications in cyber-physical systems, especially manufacturing, by automatically adapting and personalizing ML models after deployment to counter data distribution shifts. The unique features of the self-labeling method require a novel software system to support dynamism at various levels.\n This paper proposes the AdaptIoT system, comprised of an end-to-end data streaming pipeline, ML service integration, and an automated self-labeling service. The self-labeling service consists of causal knowledge bases and automated full-cycle self-labeling workflows to adapt multiple ML models simultaneously. AdaptIoT employs a containerized microservice architecture to deliver a scalable and portable solution for small and medium-sized manufacturers. A field demonstration of a self-labeling adaptive ML application is conducted with a makerspace and shows reliable performance."}, "https://arxiv.org/abs/2404.06238": {"title": "Least Squares-Based Permutation Tests in Time Series", "link": "https://arxiv.org/abs/2404.06238", "description": "arXiv:2404.06238v1 Announce Type: cross \nAbstract: This paper studies permutation tests for regression parameters in a time series setting, where the time series is assumed stationary but may exhibit an arbitrary (but weak) dependence structure. In such a setting, it is perhaps surprising that permutation tests can offer any type of inference guarantees, since permuting of covariates can destroy its relationship with the response. Indeed, the fundamental assumption of exchangeability of errors required for the finite-sample exactness of permutation tests, can easily fail. However, we show that permutation tests may be constructed which are asymptotically valid for a wide class of stationary processes, but remain exact when exchangeability holds. We also consider the problem of testing for no monotone trend and we construct asymptotically valid permutation tests in this setting as well."}, "https://arxiv.org/abs/2404.06239": {"title": "Permutation Testing for Monotone Trend", "link": "https://arxiv.org/abs/2404.06239", "description": "arXiv:2404.06239v1 Announce Type: cross \nAbstract: In this paper, we consider the fundamental problem of testing for monotone trend in a time series. While the term \"trend\" is commonly used and has an intuitive meaning, it is first crucial to specify its exact meaning in a hypothesis testing context. A commonly used well-known test is the Mann-Kendall test, which we show does not offer Type 1 error control even in large samples. On the other hand, by an appropriate studentization of the Mann-Kendall statistic, we construct permutation tests that offer asymptotic error control quite generally, but retain the exactness property of permutation tests for i.i.d. observations. We also introduce \"local\" Mann-Kendall statistics as a means of testing for local rather than global trend in a time series. Similar properties of permutation tests are obtained for these tests as well."}, "https://arxiv.org/abs/2107.03253": {"title": "Dynamic Ordered Panel Logit Models", "link": "https://arxiv.org/abs/2107.03253", "description": "arXiv:2107.03253v4 Announce Type: replace \nAbstract: This paper studies a dynamic ordered logit model for panel data with fixed effects. The main contribution of the paper is to construct a set of valid moment conditions that are free of the fixed effects. The moment functions can be computed using four or more periods of data, and the paper presents sufficient conditions for the moment conditions to identify the common parameters of the model, namely the regression coefficients, the autoregressive parameters, and the threshold parameters. The availability of moment conditions suggests that these common parameters can be estimated using the generalized method of moments, and the paper documents the performance of this estimator using Monte Carlo simulations and an empirical illustration to self-reported health status using the British Household Panel Survey."}, "https://arxiv.org/abs/2112.08199": {"title": "M-Estimation based on quasi-processes from discrete samples of Levy processes", "link": "https://arxiv.org/abs/2112.08199", "description": "arXiv:2112.08199v3 Announce Type: replace \nAbstract: We consider M-estimation problems, where the target value is determined using a minimizer of an expected functional of a Levy process. With discrete observations from the Levy process, we can produce a \"quasi-path\" by shuffling increments of the Levy process, we call it a quasi-process. Under a suitable sampling scheme, a quasi-process can converge weakly to the true process according to the properties of the stationary and independent increments. Using this resampling technique, we can estimate objective functionals similar to those estimated using the Monte Carlo simulations, and it is available as a contrast function. The M-estimator based on these quasi-processes can be consistent and asymptotically normal."}, "https://arxiv.org/abs/2212.14444": {"title": "Empirical Bayes When Estimation Precision Predicts Parameters", "link": "https://arxiv.org/abs/2212.14444", "description": "arXiv:2212.14444v4 Announce Type: replace \nAbstract: Empirical Bayes methods usually maintain a prior independence assumption: The unknown parameters of interest are independent from the known standard errors of the estimates. This assumption is often theoretically questionable and empirically rejected. This paper instead models the conditional distribution of the parameter given the standard errors as a flexibly parametrized family of distributions, leading to a family of methods that we call CLOSE. This paper establishes that (i) CLOSE is rate-optimal for squared error Bayes regret, (ii) squared error regret control is sufficient for an important class of economic decision problems, and (iii) CLOSE is worst-case robust when our assumption on the conditional distribution is misspecified. Empirically, using CLOSE leads to sizable gains for selecting high-mobility Census tracts. Census tracts selected by CLOSE are substantially more mobile on average than those selected by the standard shrinkage method."}, "https://arxiv.org/abs/2306.16091": {"title": "Adaptive functional principal components analysis", "link": "https://arxiv.org/abs/2306.16091", "description": "arXiv:2306.16091v2 Announce Type: replace \nAbstract: Functional data analysis almost always involves smoothing discrete observations into curves, because they are never observed in continuous time and rarely without error. Although smoothing parameters affect the subsequent inference, data-driven methods for selecting these parameters are not well-developed, frustrated by the difficulty of using all the information shared by curves while being computationally efficient. On the one hand, smoothing individual curves in an isolated, albeit sophisticated way, ignores useful signals present in other curves. On the other hand, bandwidth selection by automatic procedures such as cross-validation after pooling all the curves together quickly become computationally unfeasible due to the large number of data points. In this paper we propose a new data-driven, adaptive kernel smoothing, specifically tailored for functional principal components analysis through the derivation of sharp, explicit risk bounds for the eigen-elements. The minimization of these quadratic risk bounds provide refined, yet computationally efficient bandwidth rules for each eigen-element separately. Both common and independent design cases are allowed. Rates of convergence for the estimators are derived. An extensive simulation study, designed in a versatile manner to closely mimic the characteristics of real data sets supports our methodological contribution. An illustration on a real data application is provided."}, "https://arxiv.org/abs/2308.04325": {"title": "A Spatial Autoregressive Graphical Model with Applications in Intercropping", "link": "https://arxiv.org/abs/2308.04325", "description": "arXiv:2308.04325v3 Announce Type: replace \nAbstract: Within the statistical literature, a significant gap exists in methods capable of modeling asymmetric multivariate spatial effects that elucidate the relationships underlying complex spatial phenomena. For such a phenomenon, observations at any location are expected to arise from a combination of within- and between- location effects, where the latter exhibit asymmetry. This asymmetry is represented by heterogeneous spatial effects between locations pertaining to different categories, that is, a feature inherent to each location in the data, such that based on the feature label, asymmetric spatial relations are postulated between neighbouring locations with different labels. Our novel approach synergises the principles of multivariate spatial autoregressive models and the Gaussian graphical model. This synergy enables us to effectively address the gap by accommodating asymmetric spatial relations, overcoming the usual constraints in spatial analyses. Using a Bayesian-estimation framework, the model performance is assessed in a simulation study. We apply the model on intercropping data, where spatial effects between different crops are unlikely to be symmetric, in order to illustrate the usage of the proposed methodology. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=SAGM."}, "https://arxiv.org/abs/2308.05456": {"title": "Optimally weighted average derivative effects", "link": "https://arxiv.org/abs/2308.05456", "description": "arXiv:2308.05456v2 Announce Type: replace \nAbstract: Weighted average derivative effects (WADEs) are nonparametric estimands with uses in economics and causal inference. Debiased WADE estimators typically require learning the conditional mean outcome as well as a Riesz representer (RR) that characterises the requisite debiasing corrections. RR estimators for WADEs often rely on kernel estimators, introducing complicated bandwidth-dependant biases. In our work we propose a new class of RRs that are isomorphic to the class of WADEs and we derive the WADE weight that is optimal, in the sense of having minimum nonparametric efficiency bound. Our optimal WADE estimators require estimating conditional expectations only (e.g. using machine learning), thus overcoming the limitations of kernel estimators. Moreover, we connect our optimal WADE to projection parameters in partially linear models. We ascribe a causal interpretation to WADE and projection parameters in terms of so-called incremental effects. We propose efficient estimators for two WADE estimands in our class, which we evaluate in a numerical experiment and use to determine the effect of Warfarin dose on blood clotting function."}, "https://arxiv.org/abs/2404.06565": {"title": "Confidence Intervals on Multivariate Normal Quantiles for Environmental Specification Development in Multi-axis Shock and Vibration Testing", "link": "https://arxiv.org/abs/2404.06565", "description": "arXiv:2404.06565v1 Announce Type: new \nAbstract: This article describes two Monte Carlo methods for calculating confidence intervals on cumulative density function (CDF) based multivariate normal quantiles that allows for controlling the tail regions of a multivariate distribution where one is most concerned about extreme responses. The CDF based multivariate normal quantiles associated with bivariate distributions are represented as contours and for trivariate distributions represented as iso-surfaces. We first provide a novel methodology for an inverse problem, characterizing the uncertainty on the $\\tau^{\\mathrm{th}}$ multivariate quantile probability, when using concurrent univariate quantile probabilities. The uncertainty on the $\\tau^{\\mathrm{th}}$ multivariate quantile probability demonstrates inadequacy in univariate methods which neglect correlation between multiple variates. Limitations of traditional multivariate normal tolerance regions and simultaneous univariate tolerance methods are discussed thereby necessitating the need for confidence intervals on CDF based multivariate normal quantiles. Two Monte Carlo methods are discussed; the first calculates the CDF over a tessellated domain followed by taking a bootstrap confidence interval over the tessellated CDF. The CDF based multivariate quantiles are then estimated from the CDF confidence intervals. For the second method, only the point associated with highest probability density along the CDF based quantile is calculated, which greatly improves the computational speed compared to the first method. Monte Carlo simulation studies are used to assess the performance of the various methods. Finally, real data analysis is performed to illustrate a workflow for CDF based multivariate normal quantiles in the domain of mechanical shock and vibration to specify a minimum conservative test level for environmental specification."}, "https://arxiv.org/abs/2404.06602": {"title": "A General Identification Algorithm For Data Fusion Problems Under Systematic Selection", "link": "https://arxiv.org/abs/2404.06602", "description": "arXiv:2404.06602v1 Announce Type: new \nAbstract: Causal inference is made challenging by confounding, selection bias, and other complications. A common approach to addressing these difficulties is the inclusion of auxiliary data on the superpopulation of interest. Such data may measure a different set of variables, or be obtained under different experimental conditions than the primary dataset. Analysis based on multiple datasets must carefully account for similarities between datasets, while appropriately accounting for differences.\n In addition, selection of experimental units into different datasets may be systematic; similar difficulties are encountered in missing data problems. Existing methods for combining datasets either do not consider this issue, or assume simple selection mechanisms.\n In this paper, we provide a general approach, based on graphical causal models, for causal inference from data on the same superpopulation that is obtained under different experimental conditions. Our framework allows both arbitrary unobserved confounding, and arbitrary selection processes into different experimental regimes in our data.\n We describe how systematic selection processes may be organized into a hierarchy similar to censoring processes in missing data: selected completely at random (SCAR), selected at random (SAR), and selected not at random (SNAR). In addition, we provide a general identification algorithm for interventional distributions in this setting."}, "https://arxiv.org/abs/2404.06698": {"title": "Bayesian Model Selection with Latent Group-Based Effects and Variances with the R Package slgf", "link": "https://arxiv.org/abs/2404.06698", "description": "arXiv:2404.06698v1 Announce Type: new \nAbstract: Linear modeling is ubiquitous, but performance can suffer when the model is misspecified. We have recently demonstrated that latent groupings in the levels of categorical predictors can complicate inference in a variety of fields including bioinformatics, agriculture, industry, engineering, and medicine. Here we present the R package slgf which enables the user to easily implement our recently-developed approach to detect group-based regression effects, latent interactions, and/or heteroscedastic error variance through Bayesian model selection. We focus on the scenario in which the levels of a categorical predictor exhibit two latent groups. We treat the detection of this grouping structure as an unsupervised learning problem by searching the space of possible groupings of factor levels. First we review the suspected latent grouping factor (SLGF) method. Next, using both observational and experimental data, we illustrate the usage of slgf in the context of several common linear model layouts: one-way analysis of variance (ANOVA), analysis of covariance (ANCOVA), a two-way replicated layout, and a two-way unreplicated layout. We have selected data that reveal the shortcomings of classical analyses to emphasize the advantage our method can provide when a latent grouping structure is present."}, "https://arxiv.org/abs/2404.06701": {"title": "Covariance Regression with High-Dimensional Predictors", "link": "https://arxiv.org/abs/2404.06701", "description": "arXiv:2404.06701v1 Announce Type: new \nAbstract: In the high-dimensional landscape, addressing the challenges of covariance regression with high-dimensional covariates has posed difficulties for conventional methodologies. This paper addresses these hurdles by presenting a novel approach for high-dimensional inference with covariance matrix outcomes. The proposed methodology is illustrated through its application in elucidating brain coactivation patterns observed in functional magnetic resonance imaging (fMRI) experiments and unraveling complex associations within anatomical connections between brain regions identified through diffusion tensor imaging (DTI). In the pursuit of dependable statistical inference, we introduce an integrative approach based on penalized estimation. This approach combines data splitting, variable selection, aggregation of low-dimensional estimators, and robust variance estimation. It enables the construction of reliable confidence intervals for covariate coefficients, supported by theoretical confidence levels under specified conditions, where asymptotic distributions are provided. Through various types of simulation studies, the proposed approach performs well for covariance regression in the presence of high-dimensional covariates. This innovative approach is applied to the Lifespan Human Connectome Project (HCP) Aging Study, which aims to uncover a typical aging trajectory and variations in the brain connectome among mature and older adults. The proposed approach effectively identifies brain networks and associated predictors of white matter integrity, aligning with established knowledge of the human brain."}, "https://arxiv.org/abs/2404.06803": {"title": "A new way to evaluate G-Wishart normalising constants via Fourier analysis", "link": "https://arxiv.org/abs/2404.06803", "description": "arXiv:2404.06803v1 Announce Type: new \nAbstract: The G-Wishart distribution is an essential component for the Bayesian analysis of Gaussian graphical models as the conjugate prior for the precision matrix. Evaluating the marginal likelihood of such models usually requires computing high-dimensional integrals to determine the G-Wishart normalising constant. Closed-form results are known for decomposable or chordal graphs, while an explicit representation as a formal series expansion has been derived recently for general graphs. The nested infinite sums, however, do not lend themselves to computation, remaining of limited practical value. Borrowing techniques from random matrix theory and Fourier analysis, we provide novel exact results well suited to the numerical evaluation of the normalising constant for a large class of graphs beyond chordal graphs. Furthermore, they open new possibilities for developing more efficient sampling schemes for Bayesian inference of Gaussian graphical models."}, "https://arxiv.org/abs/2404.06837": {"title": "Sensitivity analysis for publication bias in meta-analysis of sparse data based on exact likelihood", "link": "https://arxiv.org/abs/2404.06837", "description": "arXiv:2404.06837v1 Announce Type: new \nAbstract: Meta-analysis is a powerful tool to synthesize findings from multiple studies. The normal-normal random-effects model is widely used accounting for between-study heterogeneity. However, meta-analysis of sparse data, which may arise when the event rate is low for binary or count outcomes, poses a challenge to the normal-normal random-effects model in the accuracy and stability in inference since the normal approximation in the within-study likelihood may not be good. To reduce bias arising from data sparsity, the generalized linear mixed model can be used by replacing the approximate normal within-study likelihood with an exact likelihood. Publication bias is one of the most serious threats in meta-analysis. Several objective sensitivity analysis methods for evaluating potential impacts of selective publication are available for the normal-normal random-effects model. We propose a sensitivity analysis method by extending the likelihood-based sensitivity analysis with the $t$-statistic selection function of Copas to several generalized linear mixed-effects models. Through applications of our proposed method to several real-world meta-analyses and simulation studies, the proposed method was proven to outperform the likelihood-based sensitivity analysis based on the normal-normal model. The proposed method would give a useful guidance to address publication bias in meta-analysis of sparse data."}, "https://arxiv.org/abs/2404.06967": {"title": "Multiple imputation for longitudinal data: A tutorial", "link": "https://arxiv.org/abs/2404.06967", "description": "arXiv:2404.06967v1 Announce Type: new \nAbstract: Longitudinal studies are frequently used in medical research and involve collecting repeated measures on individuals over time. Observations from the same individual are invariably correlated and thus an analytic approach that accounts for this clustering by individual is required. While almost all research suffers from missing data, this can be particularly problematic in longitudinal studies as participation often becomes harder to maintain over time. Multiple imputation (MI) is widely used to handle missing data in such studies. When using MI, it is important that the imputation model is compatible with the proposed analysis model. In a longitudinal analysis, this implies that the clustering considered in the analysis model should be reflected in the imputation process. Several MI approaches have been proposed to impute incomplete longitudinal data, such as treating repeated measurements of the same variable as distinct variables or using generalized linear mixed imputation models. However, the uptake of these methods has been limited, as they require additional data manipulation and use of advanced imputation procedures. In this tutorial, we review the available MI approaches that can be used for handling incomplete longitudinal data, including where individuals are clustered within higher-level clusters. We illustrate implementation with replicable R and Stata code using a case study from the Childhood to Adolescence Transition Study."}, "https://arxiv.org/abs/2404.06984": {"title": "Adaptive Strategy of Testing Alphas in High Dimensional Linear Factor Pricing Models", "link": "https://arxiv.org/abs/2404.06984", "description": "arXiv:2404.06984v1 Announce Type: new \nAbstract: In recent years, there has been considerable research on testing alphas in high-dimensional linear factor pricing models. In our study, we introduce a novel max-type test procedure that performs well under sparse alternatives. Furthermore, we demonstrate that this new max-type test procedure is asymptotically independent from the sum-type test procedure proposed by Pesaran and Yamagata (2017). Building on this, we propose a Fisher combination test procedure that exhibits good performance for both dense and sparse alternatives."}, "https://arxiv.org/abs/2404.06995": {"title": "Model-free Change-point Detection Using Modern Classifiers", "link": "https://arxiv.org/abs/2404.06995", "description": "arXiv:2404.06995v1 Announce Type: new \nAbstract: In contemporary data analysis, it is increasingly common to work with non-stationary complex datasets. These datasets typically extend beyond the classical low-dimensional Euclidean space, making it challenging to detect shifts in their distribution without relying on strong structural assumptions. This paper introduces a novel offline change-point detection method that leverages modern classifiers developed in the machine-learning community. With suitable data splitting, the test statistic is constructed through sequential computation of the Area Under the Curve (AUC) of a classifier, which is trained on data segments on both ends of the sequence. It is shown that the resulting AUC process attains its maxima at the true change-point location, which facilitates the change-point estimation. The proposed method is characterized by its complete nonparametric nature, significant versatility, considerable flexibility, and absence of stringent assumptions pertaining to the underlying data or any distributional shifts. Theoretically, we derive the limiting pivotal distribution of the proposed test statistic under null, as well as the asymptotic behaviors under both local and fixed alternatives. The weak consistency of the change-point estimator is provided. Extensive simulation studies and the analysis of two real-world datasets illustrate the superior performance of our approach compared to existing model-free change-point detection methods."}, "https://arxiv.org/abs/2404.07136": {"title": "To impute or not to? Testing multivariate normality on incomplete dataset: Revisiting the BHEP test", "link": "https://arxiv.org/abs/2404.07136", "description": "arXiv:2404.07136v1 Announce Type: new \nAbstract: In this paper, we focus on testing multivariate normality using the BHEP test with data that are missing completely at random. Our objective is twofold: first, to gain insight into the asymptotic behavior of BHEP test statistics under two widely used approaches for handling missing data, namely complete-case analysis and imputation, and second, to compare the power performance of test statistic under these approaches. It is observed that under the imputation approach, the affine invariance of test statistics is not preserved. To address this issue, we propose an appropriate bootstrap algorithm for approximating p-values. Extensive simulation studies demonstrate that both mean and median approaches exhibit greater power compared to testing with complete-case analysis, and open some questions for further research."}, "https://arxiv.org/abs/2404.07141": {"title": "High-dimensional copula-based Wasserstein dependence", "link": "https://arxiv.org/abs/2404.07141", "description": "arXiv:2404.07141v1 Announce Type: new \nAbstract: We generalize 2-Wasserstein dependence coefficients to measure dependence between a finite number of random vectors. This generalization includes theoretical properties, and in particular focuses on an interpretation of maximal dependence and an asymptotic normality result for a proposed semi-parametric estimator under a Gaussian copula assumption. In addition, we discuss general axioms for dependence measures between multiple random vectors, other plausible normalizations, and various examples. Afterwards, we look into plug-in estimators based on penalized empirical covariance matrices in order to deal with high dimensionality issues and take possible marginal independencies into account by inducing (block) sparsity. The latter ideas are investigated via a simulation study, considering other dependence coefficients as well. We illustrate the use of the developed methods in two real data applications."}, "https://arxiv.org/abs/2404.06681": {"title": "Causal Unit Selection using Tractable Arithmetic Circuits", "link": "https://arxiv.org/abs/2404.06681", "description": "arXiv:2404.06681v1 Announce Type: cross \nAbstract: The unit selection problem aims to find objects, called units, that optimize a causal objective function which describes the objects' behavior in a causal context (e.g., selecting customers who are about to churn but would most likely change their mind if encouraged). While early studies focused mainly on bounding a specific class of counterfactual objective functions using data, more recent work allows one to find optimal units exactly by reducing the causal objective to a classical objective on a meta-model, and then applying a variant of the classical Variable Elimination (VE) algorithm to the meta-model -- assuming a fully specified causal model is available. In practice, however, finding optimal units using this approach can be very expensive because the used VE algorithm must be exponential in the constrained treewidth of the meta-model, which is larger and denser than the original model. We address this computational challenge by introducing a new approach for unit selection that is not necessarily limited by the constrained treewidth. This is done through compiling the meta-model into a special class of tractable arithmetic circuits that allows the computation of optimal units in time linear in the circuit size. We finally present empirical results on random causal models that show order-of-magnitude speedups based on the proposed method for solving unit selection."}, "https://arxiv.org/abs/2404.06735": {"title": "A Copula Graphical Model for Multi-Attribute Data using Optimal Transport", "link": "https://arxiv.org/abs/2404.06735", "description": "arXiv:2404.06735v1 Announce Type: cross \nAbstract: Motivated by modern data forms such as images and multi-view data, the multi-attribute graphical model aims to explore the conditional independence structure among vectors. Under the Gaussian assumption, the conditional independence between vectors is characterized by blockwise zeros in the precision matrix. To relax the restrictive Gaussian assumption, in this paper, we introduce a novel semiparametric multi-attribute graphical model based on a new copula named Cyclically Monotone Copula. This new copula treats the distribution of the node vectors as multivariate marginals and transforms them into Gaussian distributions based on the optimal transport theory. Since the model allows the node vectors to have arbitrary continuous distributions, it is more flexible than the classical Gaussian copula method that performs coordinatewise Gaussianization. We establish the concentration inequalities of the estimated covariance matrices and provide sufficient conditions for selection consistency of the group graphical lasso estimator. For the setting with high-dimensional attributes, a {Projected Cyclically Monotone Copula} model is proposed to address the curse of dimensionality issue that arises from solving high-dimensional optimal transport problems. Numerical results based on synthetic and real data show the efficiency and flexibility of our methods."}, "https://arxiv.org/abs/2404.07100": {"title": "A New Statistic for Testing Covariance Equality in High-Dimensional Gaussian Low-Rank Models", "link": "https://arxiv.org/abs/2404.07100", "description": "arXiv:2404.07100v1 Announce Type: cross \nAbstract: In this paper, we consider the problem of testing equality of the covariance matrices of L complex Gaussian multivariate time series of dimension $M$ . We study the special case where each of the L covariance matrices is modeled as a rank K perturbation of the identity matrix, corresponding to a signal plus noise model. A new test statistic based on the estimates of the eigenvalues of the different covariance matrices is proposed. In particular, we show that this statistic is consistent and with controlled type I error in the high-dimensional asymptotic regime where the sample sizes $N_1,\\ldots,N_L$ of each time series and the dimension $M$ both converge to infinity at the same rate, while $K$ and $L$ are kept fixed. We also provide some simulations on synthetic and real data (SAR images) which demonstrate significant improvements over some classical methods such as the GLRT, or other alternative methods relevant for the high-dimensional regime and the low-rank model."}, "https://arxiv.org/abs/2106.05031": {"title": "Estimation of Optimal Dynamic Treatment Assignment Rules under Policy Constraints", "link": "https://arxiv.org/abs/2106.05031", "description": "arXiv:2106.05031v4 Announce Type: replace \nAbstract: This paper studies statistical decisions for dynamic treatment assignment problems. Many policies involve dynamics in their treatment assignments where treatments are sequentially assigned to individuals across multiple stages and the effect of treatment at each stage is usually heterogeneous with respect to the prior treatments, past outcomes, and observed covariates. We consider estimating an optimal dynamic treatment rule that guides the optimal treatment assignment for each individual at each stage based on the individual's history. This paper proposes an empirical welfare maximization approach in a dynamic framework. The approach estimates the optimal dynamic treatment rule using data from an experimental or quasi-experimental study. The paper proposes two estimation methods: one solves the treatment assignment problem at each stage through backward induction, and the other solves the whole dynamic treatment assignment problem simultaneously across all stages. We derive finite-sample upper bounds on worst-case average welfare regrets for the proposed methods and show $1/\\sqrt{n}$-minimax convergence rates. We also modify the simultaneous estimation method to incorporate intertemporal budget/capacity constraints."}, "https://arxiv.org/abs/2309.11772": {"title": "Active Learning for a Recursive Non-Additive Emulator for Multi-Fidelity Computer Experiments", "link": "https://arxiv.org/abs/2309.11772", "description": "arXiv:2309.11772v2 Announce Type: replace \nAbstract: Computer simulations have become essential for analyzing complex systems, but high-fidelity simulations often come with significant computational costs. To tackle this challenge, multi-fidelity computer experiments have emerged as a promising approach that leverages both low-fidelity and high-fidelity simulations, enhancing both the accuracy and efficiency of the analysis. In this paper, we introduce a new and flexible statistical model, the Recursive Non-Additive (RNA) emulator, that integrates the data from multi-fidelity computer experiments. Unlike conventional multi-fidelity emulation approaches that rely on an additive auto-regressive structure, the proposed RNA emulator recursively captures the relationships between multi-fidelity data using Gaussian process priors without making the additive assumption, allowing the model to accommodate more complex data patterns. Importantly, we derive the posterior predictive mean and variance of the emulator, which can be efficiently computed in a closed-form manner, leading to significant improvements in computational efficiency. Additionally, based on this emulator, we introduce three active learning strategies that optimize the balance between accuracy and simulation costs to guide the selection of the fidelity level and input locations for the next simulation run. We demonstrate the effectiveness of the proposed approach in a suite of synthetic examples and a real-world problem. An R package RNAmf for the proposed methodology is provided on CRAN."}, "https://arxiv.org/abs/2401.06383": {"title": "Decomposition with Monotone B-splines: Fitting and Testing", "link": "https://arxiv.org/abs/2401.06383", "description": "arXiv:2401.06383v2 Announce Type: replace \nAbstract: A univariate continuous function can always be decomposed as the sum of a non-increasing function and a non-decreasing one. Based on this property, we propose a non-parametric regression method that combines two spline-fitted monotone curves. We demonstrate by extensive simulations that, compared to standard spline-fitting methods, the proposed approach is particularly advantageous in high-noise scenarios. Several theoretical guarantees are established for the proposed approach. Additionally, we present statistics to test the monotonicity of a function based on monotone decomposition, which can better control Type I error and achieve comparable (if not always higher) power compared to existing methods. Finally, we apply the proposed fitting and testing approaches to analyze the single-cell pseudotime trajectory datasets, identifying significant biological insights for non-monotonically expressed genes through Gene Ontology enrichment analysis. The source code implementing the methodology and producing all results is accessible at https://github.com/szcf-weiya/MonotoneDecomposition.jl."}, "https://arxiv.org/abs/2401.07625": {"title": "Statistics in Survey Sampling", "link": "https://arxiv.org/abs/2401.07625", "description": "arXiv:2401.07625v2 Announce Type: replace \nAbstract: Survey sampling theory and methods are introduced. Sampling designs and estimation methods are carefully discussed as a textbook for survey sampling. Topics includes Horvitz-Thompson estimation, simple random sampling, stratified sampling, cluster sampling, ratio estimation, regression estimation, variance estimation, two-phase sampling, and nonresponse adjustment methods."}, "https://arxiv.org/abs/2302.01831": {"title": "Trade-off between predictive performance and FDR control for high-dimensional Gaussian model selection", "link": "https://arxiv.org/abs/2302.01831", "description": "arXiv:2302.01831v3 Announce Type: replace-cross \nAbstract: In the context of the high-dimensional Gaussian linear regression for ordered variables, we study the variable selection procedure via the minimization of the penalized least-squares criterion. We focus on model selection where the penalty function depends on an unknown multiplicative constant commonly calibrated for prediction. We propose a new proper calibration of this hyperparameter to simultaneously control predictive risk and false discovery rate. We obtain non-asymptotic bounds on the False Discovery Rate with respect to the hyperparameter and we provide an algorithm to calibrate it. This algorithm is based on quantities that can typically be observed in real data applications. The algorithm is validated in an extensive simulation study and is compared with some existing variable selection procedures. Finally, we study an extension of our approach to the case in which an ordering of the variables is not available."}, "https://arxiv.org/abs/2302.08070": {"title": "Local Causal Discovery for Estimating Causal Effects", "link": "https://arxiv.org/abs/2302.08070", "description": "arXiv:2302.08070v4 Announce Type: replace-cross \nAbstract: Even when the causal graph underlying our data is unknown, we can use observational data to narrow down the possible values that an average treatment effect (ATE) can take by (1) identifying the graph up to a Markov equivalence class; and (2) estimating that ATE for each graph in the class. While the PC algorithm can identify this class under strong faithfulness assumptions, it can be computationally prohibitive. Fortunately, only the local graph structure around the treatment is required to identify the set of possible ATE values, a fact exploited by local discovery algorithms to improve computational efficiency. In this paper, we introduce Local Discovery using Eager Collider Checks (LDECC), a new local causal discovery algorithm that leverages unshielded colliders to orient the treatment's parents differently from existing methods. We show that there exist graphs where LDECC exponentially outperforms existing local discovery algorithms and vice versa. Moreover, we show that LDECC and existing algorithms rely on different faithfulness assumptions, leveraging this insight to weaken the assumptions for identifying the set of possible ATE values."}, "https://arxiv.org/abs/2305.04174": {"title": "Root-n consistent semiparametric learning with high-dimensional nuisance functions under minimal sparsity", "link": "https://arxiv.org/abs/2305.04174", "description": "arXiv:2305.04174v2 Announce Type: replace-cross \nAbstract: Treatment effect estimation under unconfoundedness is a fundamental task in causal inference. In response to the challenge of analyzing high-dimensional datasets collected in substantive fields such as epidemiology, genetics, economics, and social sciences, many methods for treatment effect estimation with high-dimensional nuisance parameters (the outcome regression and the propensity score) have been developed in recent years. However, it is still unclear what is the necessary and sufficient sparsity condition on the nuisance parameters for the treatment effect to be $\\sqrt{n}$-estimable. In this paper, we propose a new Double-Calibration strategy that corrects the estimation bias of the nuisance parameter estimates computed by regularized high-dimensional techniques and demonstrate that the corresponding Doubly-Calibrated estimator achieves $1 / \\sqrt{n}$-rate as long as one of the nuisance parameters is sparse with sparsity below $\\sqrt{n} / \\log p$, where $p$ denotes the ambient dimension of the covariates, whereas the other nuisance parameter can be arbitrarily complex and completely misspecified. The Double-Calibration strategy can also be applied to settings other than treatment effect estimation, e.g. regression coefficient estimation in the presence of diverging number of controls in a semiparametric partially linear model."}, "https://arxiv.org/abs/2305.07276": {"title": "multilevLCA: An R Package for Single-Level and Multilevel Latent Class Analysis with Covariates", "link": "https://arxiv.org/abs/2305.07276", "description": "arXiv:2305.07276v2 Announce Type: replace-cross \nAbstract: This contribution presents a guide to the R package multilevLCA, which offers a complete and innovative set of technical tools for the latent class analysis of single-level and multilevel categorical data. We describe the available model specifications, mainly falling within the fixed-effect or random-effect approaches. Maximum likelihood estimation of the model parameters, enhanced by a refined initialization strategy, is implemented either simultaneously, i.e., in one-step, or by means of the more advantageous two-step estimator. The package features i) semi-automatic model selection when a priori information on the number of classes is lacking, ii) predictors of class membership, and iii) output visualization tools for any of the available model specifications. All functionalities are illustrated by means of a real application on citizenship norms data, which are available in the package."}, "https://arxiv.org/abs/2311.07474": {"title": "A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals", "link": "https://arxiv.org/abs/2311.07474", "description": "arXiv:2311.07474v2 Announce Type: replace-cross \nAbstract: Most prognostic methods require a decent amount of data for model training. In reality, however, the amount of historical data owned by a single organization might be small or not large enough to train a reliable prognostic model. To address this challenge, this article proposes a federated prognostic model that allows multiple users to jointly construct a failure time prediction model using their multi-stream, high-dimensional, and incomplete data while keeping each user's data local and confidential. The prognostic model first employs multivariate functional principal component analysis to fuse the multi-stream degradation signals. Then, the fused features coupled with the times-to-failure are utilized to build a (log)-location-scale regression model for failure prediction. To estimate parameters using distributed datasets and keep the data privacy of all participants, we propose a new federated algorithm for feature extraction. Numerical studies indicate that the performance of the proposed model is the same as that of classic non-federated prognostic models and is better than that of the models constructed by each user itself."}, "https://arxiv.org/abs/2404.07248": {"title": "Parametric estimation of conditional Archimedean copula generators for censored data", "link": "https://arxiv.org/abs/2404.07248", "description": "arXiv:2404.07248v1 Announce Type: new \nAbstract: In this paper, we propose a novel approach for estimating Archimedean copula generators in a conditional setting, incorporating endogenous variables. Our method allows for the evaluation of the impact of the different levels of covariates on both the strength and shape of dependence by directly estimating the generator function rather than the copula itself. As such, we contribute to relaxing the simplifying assumption inherent in traditional copula modeling. We demonstrate the effectiveness of our methodology through applications in two diverse settings: a diabetic retinopathy study and a claims reserving analysis. In both cases, we show how considering the influence of covariates enables a more accurate capture of the underlying dependence structure in the data, thus enhancing the applicability of copula models, particularly in actuarial contexts."}, "https://arxiv.org/abs/2404.07323": {"title": "Surrogate modeling for probability distribution estimation:uniform or adaptive design?", "link": "https://arxiv.org/abs/2404.07323", "description": "arXiv:2404.07323v1 Announce Type: new \nAbstract: The active learning (AL) technique, one of the state-of-the-art methods for constructing surrogate models, has shown high accuracy and efficiency in forward uncertainty quantification (UQ) analysis. This paper provides a comprehensive study on AL-based global surrogates for computing the full distribution function, i.e., the cumulative distribution function (CDF) and the complementary CDF (CCDF). To this end, we investigate the three essential components for building surrogates, i.e., types of surrogate models, enrichment methods for experimental designs, and stopping criteria. For each component, we choose several representative methods and study their desirable configurations. In addition, we devise a uniform design (i.e., space-filling design) as a baseline for measuring the improvement of using AL. Combining all the representative methods, a total of 1,920 UQ analyses are carried out to solve 16 benchmark examples. The performance of the selected strategies is evaluated based on accuracy and efficiency. In the context of full distribution estimation, this study concludes that (i) AL techniques cannot provide a systematic improvement compared with uniform designs, (ii) the recommended surrogate modeling methods depend on the features of the problems (especially the local nonlinearity), target accuracy, and computational budget."}, "https://arxiv.org/abs/2404.07397": {"title": "Mediated probabilities of causation", "link": "https://arxiv.org/abs/2404.07397", "description": "arXiv:2404.07397v1 Announce Type: new \nAbstract: We propose a set of causal estimands that we call ``the mediated probabilities of causation.'' These estimands quantify the probabilities that an observed negative outcome was induced via a mediating pathway versus a direct pathway in a stylized setting involving a binary exposure or intervention, a single binary mediator, and a binary outcome. We outline a set of conditions sufficient to identify these effects given observed data, and propose a doubly-robust projection based estimation strategy that allows for the use of flexible non-parametric and machine learning methods for estimation. We argue that these effects may be more relevant than the probability of causation, particularly in settings where we observe both some negative outcome and negative mediating event, and we wish to distinguish between settings where the outcome was induced via the exposure inducing the mediator versus the exposure inducing the outcome directly. We motivate our quantities of interest by discussing applications to legal and medical questions of causal attribution."}, "https://arxiv.org/abs/2404.07411": {"title": "Joint mixed-effects models for causal inference in clustered network-based observational studies", "link": "https://arxiv.org/abs/2404.07411", "description": "arXiv:2404.07411v1 Announce Type: new \nAbstract: Causal inference on populations embedded in social networks poses technical challenges, since the typical no interference assumption frequently does not hold. Existing methods developed in the context of network interference rely upon the assumption of no unmeasured confounding. However, when faced with multilevel network data, there may be a latent factor influencing both the exposure and the outcome at the cluster level. We propose a Bayesian inference approach that combines a joint mixed-effects model for the outcome and the exposure with direct standardization to identify and estimate causal effects in the presence of network interference and unmeasured cluster confounding. In simulations, we compare our proposed method with linear mixed and fixed effects models and show that unbiased estimation is achieved using the joint model. Having derived valid tools for estimation, we examine the effect of maternal college education on adolescent school performance using data from the National Longitudinal Study of Adolescent Health."}, "https://arxiv.org/abs/2404.07440": {"title": "Bayesian Penalized Transformation Models: Structured Additive Location-Scale Regression for Arbitrary Conditional Distributions", "link": "https://arxiv.org/abs/2404.07440", "description": "arXiv:2404.07440v1 Announce Type: new \nAbstract: Penalized transformation models (PTMs) are a novel form of location-scale regression. In PTMs, the shape of the response's conditional distribution is estimated directly from the data, and structured additive predictors are placed on its location and scale. The core of the model is a monotonically increasing transformation function that relates the response distribution to a reference distribution. The transformation function is equipped with a smoothness prior that regularizes how much the estimated distribution diverges from the reference distribution. These models can be seen as a bridge between conditional transformation models and generalized additive models for location, scale and shape. Markov chain Monte Carlo inference for PTMs can be conducted with the No-U-Turn sampler and offers straightforward uncertainty quantification for the conditional distribution as well as for the covariate effects. A simulation study demonstrates the effectiveness of the approach. We apply the model to data from the Fourth Dutch Growth Study and the Framingham Heart Study. A full-featured implementation is available as a Python library."}, "https://arxiv.org/abs/2404.07459": {"title": "Safe subspace screening for the adaptive nuclear norm regularized trace regression", "link": "https://arxiv.org/abs/2404.07459", "description": "arXiv:2404.07459v1 Announce Type: new \nAbstract: Matrix form data sets arise in many areas, so there are lots of works about the matrix regression models. One special model of these models is the adaptive nuclear norm regularized trace regression, which has been proven have good statistical performances. In order to accelerate the computation of this model, we consider the technique called screening rule. According to matrix decomposition and optimal condition of the model, we develop a safe subspace screening rule that can be used to identify inactive subspace of the solution decomposition and reduce the dimension of the solution. To evaluate the efficiency of the safe subspace screening rule, we embed this result into the alternating direction method of multipliers algorithm under a sequence of the tuning parameters. Under this process, each solution under the tuning parameter provides a matrix decomposition space. Then, the safe subspace screening rule is applied to eliminate inactive subspace, reduce the solution dimension and accelerate the computation process. Some numerical experiments are implemented on simulation data sets and real data sets, which illustrate the efficiency of our screening rule."}, "https://arxiv.org/abs/2404.07632": {"title": "Consistent Distribution Free Affine Invariant Tests for the Validity of Independent Component Models", "link": "https://arxiv.org/abs/2404.07632", "description": "arXiv:2404.07632v1 Announce Type: new \nAbstract: We propose a family of tests of the validity of the assumptions underlying independent component analysis methods. The tests are formulated as L2-type procedures based on characteristic functions and involve weights; a proper choice of these weights and the estimation method for the mixing matrix yields consistent and affine-invariant tests. Due to the complexity of the asymptotic null distribution of the resulting test statistics, implementation is based on permutational and resampling strategies. This leads to distribution-free procedures regardless of whether these procedures are performed on the estimated independent components themselves or the componentwise ranks of their components. A Monte Carlo study involving various estimation methods for the mixing matrix, various weights, and a competing test based on distance covariance is conducted under the null hypothesis as well as under alternatives. A real-data application demonstrates the practical utility and effectiveness of the method."}, "https://arxiv.org/abs/2404.07684": {"title": "Merger Analysis with Latent Price", "link": "https://arxiv.org/abs/2404.07684", "description": "arXiv:2404.07684v1 Announce Type: new \nAbstract: Standard empirical tools for merger analysis assume price data, which may not be readily available. This paper characterizes sufficient conditions for identifying the unilateral effects of mergers without price data. I show that revenues, margins, and revenue diversion ratios are sufficient for identifying the gross upward pricing pressure indices, impact on consumer/producer surplus, and compensating marginal cost reductions associated with a merger. I also describe assumptions on demand that facilitate the identification of revenue diversion ratios and merger simulations. I use the proposed framework to evaluate the Staples/Office Depot merger (2016)."}, "https://arxiv.org/abs/2404.07906": {"title": "WiNNbeta: Batch and drift correction method by white noise normalization for metabolomic studies", "link": "https://arxiv.org/abs/2404.07906", "description": "arXiv:2404.07906v1 Announce Type: new \nAbstract: We developed a method called batch and drift correction method by White Noise Normalization (WiNNbeta) to correct individual metabolites for batch effects and drifts. This method tests for white noise properties to identify metabolites in need of correction and corrects them by using fine-tuned splines. To test the method performance we applied WiNNbeta to LC-MS data from our metabolomic studies and computed CVs before and after WiNNbeta correction in quality control samples."}, "https://arxiv.org/abs/2404.07923": {"title": "A Bayesian Estimator of Sample Size", "link": "https://arxiv.org/abs/2404.07923", "description": "arXiv:2404.07923v1 Announce Type: new \nAbstract: We consider a Bayesian estimator of sample size (BESS) and an application to oncology dose optimization clinical trials. BESS is built upon balancing a trio of Sample size, Evidence from observed data, and Confidence in posterior inference. It uses a simple logic of \"given the evidence from data, a specific sample size can achieve a degree of confidence in the posterior inference.\" The key distinction between BESS and standard sample size estimation (SSE) is that SSE, typically based on Frequentist inference, specifies the true parameters values in its calculation while BESS assumes a possible outcome from the observed data. As a result, the calibration of the sample size is not based on Type I or Type II error rates, but on posterior probabilities. We argue that BESS leads to a more interpretable statement for investigators, and can easily accommodates prior information as well as sample size re-estimation. We explore its performance in comparison to SSE and demonstrate its usage through a case study of oncology optimization trial. BESS can be applied to general hypothesis tests. R functions are available at https://ccte.uchicago.edu/bess."}, "https://arxiv.org/abs/2404.07222": {"title": "Liquidity Jump, Liquidity Diffusion, and Wash Trading of Crypto Assets", "link": "https://arxiv.org/abs/2404.07222", "description": "arXiv:2404.07222v1 Announce Type: cross \nAbstract: We propose that the liquidity of an asset includes two components: liquidity jump and liquidity diffusion. We find that the liquidity diffusion has a higher correlation with crypto washing trading than the liquidity jump. We demonstrate that the treatment of washing trading significantly reduces the liquidity diffusion, but only marginally reduces the liquidity jump. We find that the ARMA-GARCH/EGARCH models are highly effective in modeling the liquidity-adjusted return with and without treatment on wash trading. We argue that treatment on wash trading is unnecessary in modeling established crypto assets that trade in mainstream exchanges, even if these exchanges are unregulated."}, "https://arxiv.org/abs/2404.07238": {"title": "Application of the chemical master equation and its analytical solution to the illness-death model", "link": "https://arxiv.org/abs/2404.07238", "description": "arXiv:2404.07238v1 Announce Type: cross \nAbstract: The aim of this article is relating the chemical master equation (CME) to the illness-death model for chronic diseases. We show that a recently developed differential equation for the prevalence directly follows from the CME. As an application, we use the theory of the CME in a simulation study about diabetes in Germany from a previous publication. We find a good agreement between the theory and the simulations."}, "https://arxiv.org/abs/2404.07586": {"title": "State-Space Modeling of Shape-constrained Functional Time Series", "link": "https://arxiv.org/abs/2404.07586", "description": "arXiv:2404.07586v1 Announce Type: cross \nAbstract: Functional time series data frequently appears in economic applications, where the functions of interest are subject to some shape constraints, including monotonicity and convexity, as typical of the estimation of the Lorenz curve. This paper proposes a state-space model for time-varying functions to extract trends and serial dependence from functional time series while imposing the shape constraints on the estimated functions. The function of interest is modeled by a convex combination of selected basis functions to satisfy the shape constraints, where the time-varying convex weights on simplex follow the dynamic multi-logit models. For the complicated likelihood of this model, a novel data augmentation technique is devised to enable posterior computation by an efficient Markov chain Monte Carlo method. The proposed method is applied to the estimation of time-varying Lorenz curves, and its utility is illustrated through numerical experiments and analysis of panel data of household incomes in Japan."}, "https://arxiv.org/abs/2404.07593": {"title": "Diffusion posterior sampling for simulation-based inference in tall data settings", "link": "https://arxiv.org/abs/2404.07593", "description": "arXiv:2404.07593v1 Announce Type: cross \nAbstract: Determining which parameters of a non-linear model could best describe a set of experimental data is a fundamental problem in science and it has gained much traction lately with the rise of complex large-scale simulators (a.k.a. black-box simulators). The likelihood of such models is typically intractable, which is why classical MCMC methods can not be used. Simulation-based inference (SBI) stands out in this context by only requiring a dataset of simulations to train deep generative models capable of approximating the posterior distribution that relates input parameters to a given observation. In this work, we consider a tall data extension in which multiple observations are available and one wishes to leverage their shared information to better infer the parameters of the model. The method we propose is built upon recent developments from the flourishing score-based diffusion literature and allows us to estimate the tall data posterior distribution simply using information from the score network trained on individual observations. We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost."}, "https://arxiv.org/abs/2404.07661": {"title": "Robust performance metrics for imbalanced classification problems", "link": "https://arxiv.org/abs/2404.07661", "description": "arXiv:2404.07661v1 Announce Type: cross \nAbstract: We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to $0$, the true positive rate (TPR) of the Bayes classifier under these metrics tends to $0$ as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from $0$. We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics."}, "https://arxiv.org/abs/2011.08661": {"title": "Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders", "link": "https://arxiv.org/abs/2011.08661", "description": "arXiv:2011.08661v3 Announce Type: replace \nAbstract: We consider estimation of average treatment effects given observational data with high-dimensional pretreatment variables. Existing methods for this problem typically assume some form of sparsity for the regression functions. In this work, we introduce a debiased inverse propensity score weighting (DIPW) scheme for average treatment effect estimation that delivers $\\sqrt{n}$-consistent estimates when the propensity score follows a sparse logistic regression model; the outcome regression functions are permitted to be arbitrarily complex. We further demonstrate how confidence intervals centred on our estimates may be constructed. Our theoretical results quantify the price to pay for permitting the regression functions to be unestimable, which shows up as an inflation of the variance of the estimator compared to the semiparametric efficient variance by a constant factor, under mild conditions. We also show that when outcome regressions can be estimated faster than a slow $1/\\sqrt{ \\log n}$ rate, our estimator achieves semiparametric efficiency. As our results accommodate arbitrary outcome regression functions, averages of transformed responses under each treatment may also be estimated at the $\\sqrt{n}$ rate. Thus, for example, the variances of the potential outcomes may be estimated. We discuss extensions to estimating linear projections of the heterogeneous treatment effect function and explain how propensity score models with more general link functions may be handled within our framework. An R package \\texttt{dipw} implementing our methodology is available on CRAN."}, "https://arxiv.org/abs/2104.13871": {"title": "Selection and Aggregation of Conformal Prediction Sets", "link": "https://arxiv.org/abs/2104.13871", "description": "arXiv:2104.13871v3 Announce Type: replace \nAbstract: Conformal prediction is a generic methodology for finite-sample valid distribution-free prediction. This technique has garnered a lot of attention in the literature partly because it can be applied with any machine learning algorithm that provides point predictions to yield valid prediction regions. Of course, the efficiency (width/volume) of the resulting prediction region depends on the performance of the machine learning algorithm. In the context of point prediction, several techniques (such as cross-validation) exist to select one of many machine learning algorithms for better performance. In contrast, such selection techniques are seldom discussed in the context of set prediction (or prediction regions). In this paper, we consider the problem of obtaining the smallest conformal prediction region given a family of machine learning algorithms. We provide two general-purpose selection algorithms and consider coverage as well as width properties of the final prediction region. The first selection method yields the smallest width prediction region among the family of conformal prediction regions for all sample sizes but only has an approximate coverage guarantee. The second selection method has a finite sample coverage guarantee but only attains close to the smallest width. The approximate optimal width property of the second method is quantified via an oracle inequality. As an illustration, we consider the use of aggregation of non-parametric regression estimators in the split conformal method with the absolute residual conformal score."}, "https://arxiv.org/abs/2203.09000": {"title": "Lorenz map, inequality ordering and curves based on multidimensional rearrangements", "link": "https://arxiv.org/abs/2203.09000", "description": "arXiv:2203.09000v3 Announce Type: replace \nAbstract: We propose a multivariate extension of the Lorenz curve based on multivariate rearrangements of optimal transport theory. We define a vector Lorenz map as the integral of the vector quantile map associated with a multivariate resource allocation. Each component of the Lorenz map is the cumulative share of each resource, as in the traditional univariate case. The pointwise ordering of such Lorenz maps defines a new multivariate majorization order, which is equivalent to preference by any social planner with inequality averse multivariate rank dependent social evaluation functional. We define a family of multi-attribute Gini index and complete ordering based on the Lorenz map. We propose the level sets of an Inverse Lorenz Function as a practical tool to visualize and compare inequality in two dimensions, and apply it to income-wealth inequality in the United States between 1989 and 2022."}, "https://arxiv.org/abs/2207.04773": {"title": "Functional Regression Models with Functional Response: New Approaches and a Comparative Study", "link": "https://arxiv.org/abs/2207.04773", "description": "arXiv:2207.04773v4 Announce Type: replace \nAbstract: This paper proposes a new nonlinear approach for additive functional regression with functional response based on kernel methods along with some slight reformulation and implementation of the linear regression and the spectral additive model. The latter methods have in common that the covariates and the response are represented in a basis and so, can only be applied when the response and the covariates belong to a Hilbert space, while the proposed method only uses the distances among data and thus can be applied to those situations where any of the covariates or the response is not Hilbertian typically normed or even metric spaces with a real vector structure. A comparison of these methods with other procedures readily available in R is performed in a simulation study and in real datasets showing the results the advantages of the nonlinear proposals and the small loss of efficiency when the simulation scenario is truly linear. The comparison is done in the Hilbertian case as it is the only scenario where all the procedures can be compared. Finally, the supplementary material provides a visualization tool for checking the linearity of the relationship between a single covariate and the response and a link to a GitHub repository where the code and data are available.} %and an example considering that the response is not Hilbertian."}, "https://arxiv.org/abs/2207.09101": {"title": "Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability", "link": "https://arxiv.org/abs/2207.09101", "description": "arXiv:2207.09101v3 Announce Type: replace \nAbstract: Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss possible other uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement and implement the computations in IRR2FPR R package."}, "https://arxiv.org/abs/2301.08324": {"title": "Differentially Private Confidence Intervals for Proportions under Stratified Random Sampling", "link": "https://arxiv.org/abs/2301.08324", "description": "arXiv:2301.08324v2 Announce Type: replace \nAbstract: Confidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, developing a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from sensitive data. Recent work has been done around differentially private confidence intervals, yet to the best of our knowledge, rigorous methodologies on differentially private confidence intervals in the context of survey sampling have not been studied. In this paper, we propose three differentially private algorithms for constructing confidence intervals for proportions under stratified random sampling. We articulate two variants of differential privacy that make sense for data from stratified sampling designs, analyzing each of our algorithms within one of these two variants. We establish analytical privacy guarantees and asymptotic properties of the estimators. In addition, we conduct simulation studies to evaluate the proposed private confidence intervals, and two applications to the 1940 Census data are provided."}, "https://arxiv.org/abs/2304.00117": {"title": "Efficiently transporting average treatment effects using a sufficient subset of effect modifiers", "link": "https://arxiv.org/abs/2304.00117", "description": "arXiv:2304.00117v2 Announce Type: replace \nAbstract: We develop flexible and nonparametric estimators of the average treatment effect (ATE) transported to a new population that offer potential efficiency gains by incorporating only a sufficient subset of effect modifiers that are differentially distributed between the source and target populations into the transport step. We develop both a one-step estimator when this sufficient subset of effect modifiers is known and a collaborative one-step estimator when it is unknown. We discuss when we would expect our estimators to be more efficient than those that assume all covariates may be relevant effect modifiers and the exceptions when we would expect worse efficiency. We use simulation to compare finite sample performance across our proposed estimators and existing estimators of the transported ATE, including in the presence of practical violations of the positivity assumption. Lastly, we apply our proposed estimators to a large-scale housing trial."}, "https://arxiv.org/abs/2309.11058": {"title": "require: Package dependencies for reproducible research", "link": "https://arxiv.org/abs/2309.11058", "description": "arXiv:2309.11058v2 Announce Type: replace \nAbstract: The ability to conduct reproducible research in Stata is often limited by the lack of version control for community-contributed packages. This article introduces the require command, a tool designed to ensure Stata package dependencies are compatible across users and computer systems. Given a list of Stata packages, require verifies that each package is installed, checks for a minimum or exact version or package release date, and optionally installs the package if prompted by the researcher."}, "https://arxiv.org/abs/2311.09972": {"title": "Inference in Auctions with Many Bidders Using Transaction Prices", "link": "https://arxiv.org/abs/2311.09972", "description": "arXiv:2311.09972v2 Announce Type: replace \nAbstract: This paper considers inference in first-price and second-price sealed-bid auctions in empirical settings where we observe auctions with a large number of bidders. Relevant applications include online auctions, treasury auctions, spectrum auctions, art auctions, and IPO auctions, among others. Given the abundance of bidders in each auction, we propose an asymptotic framework in which the number of bidders diverges while the number of auctions remains fixed. This framework allows us to perform asymptotically exact inference on key model features using only transaction price data. Specifically, we examine inference on the expected utility of the auction winner, the expected revenue of the seller, and the tail properties of the valuation distribution. Simulations confirm the accuracy of our inference methods in finite samples. Finally, we also apply them to Hong Kong car license auction data."}, "https://arxiv.org/abs/2312.13643": {"title": "Debiasing Welch's Method for Spectral Density Estimation", "link": "https://arxiv.org/abs/2312.13643", "description": "arXiv:2312.13643v2 Announce Type: replace \nAbstract: Welch's method provides an estimator of the power spectral density that is statistically consistent. This is achieved by averaging over periodograms calculated from overlapping segments of a time series. For a finite length time series, while the variance of the estimator decreases as the number of segments increase, the magnitude of the estimator's bias increases: a bias-variance trade-off ensues when setting the segment number. We address this issue by providing a novel method for debiasing Welch's method which maintains the computational complexity and asymptotic consistency, and leads to improved finite-sample performance. Theoretical results are given for fourth-order stationary processes with finite fourth-order moments and absolutely convergent fourth-order cumulant function. The significant bias reduction is demonstrated with numerical simulation and an application to real-world data. Our estimator also permits irregular spacing over frequency and we demonstrate how this may be employed for signal compression and further variance reduction. Code accompanying this work is available in R and python."}, "https://arxiv.org/abs/2404.08105": {"title": "Uniform Inference in High-Dimensional Threshold Regression Models", "link": "https://arxiv.org/abs/2404.08105", "description": "arXiv:2404.08105v1 Announce Type: new \nAbstract: We develop uniform inference for high-dimensional threshold regression parameters and valid inference for the threshold parameter in this paper. We first establish oracle inequalities for prediction errors and $\\ell_1$ estimation errors for the Lasso estimator of the slope parameters and the threshold parameter, allowing for heteroskedastic non-subgaussian error terms and non-subgaussian covariates. Next, we derive the asymptotic distribution of tests involving an increasing number of slope parameters by debiasing (or desparsifying) the scaled Lasso estimator. The asymptotic distribution of tests without the threshold effect is identical to that with a fixed effect. Moreover, we perform valid inference for the threshold parameter using subsampling method. Finally, we conduct simulation studies to demonstrate the performance of our method in finite samples."}, "https://arxiv.org/abs/2404.08128": {"title": "Inference of treatment effect and its regional modifiers using restricted mean survival time in multi-regional clinical trials", "link": "https://arxiv.org/abs/2404.08128", "description": "arXiv:2404.08128v1 Announce Type: new \nAbstract: Multi-regional clinical trials (MRCTs) play an increasingly crucial role in global pharmaceutical development by expediting data gathering and regulatory approval across diverse patient populations. However, differences in recruitment practices and regional demographics often lead to variations in study participant characteristics, potentially biasing treatment effect estimates and undermining treatment effect consistency assessment across regions. To address this challenge, we propose novel estimators and inference methods utilizing inverse probability of sampling and calibration weighting. Our approaches aim to eliminate exogenous regional imbalance while preserving intrinsic differences across regions, such as race and genetic variants. Moreover, time-to-event outcomes in MRCT studies receive limited attention, with existing methodologies primarily focusing on hazard ratios. In this paper, we adopt restricted mean survival time to characterize the treatment effect, offering more straightforward interpretations of treatment effects with fewer assumptions than hazard ratios. Theoretical results are established for the proposed estimators, supported by extensive simulation studies. We illustrate the effectiveness of our methods through a real MRCT case study on acute coronary syndromes."}, "https://arxiv.org/abs/2404.08169": {"title": "AutoGFI: Streamlined Generalized Fiducial Inference for Modern Inference Problems", "link": "https://arxiv.org/abs/2404.08169", "description": "arXiv:2404.08169v1 Announce Type: new \nAbstract: The origins of fiducial inference trace back to the 1930s when R. A. Fisher first introduced the concept as a response to what he perceived as a limitation of Bayesian inference - the requirement for a subjective prior distribution on model parameters in cases where no prior information was available. However, Fisher's initial fiducial approach fell out of favor as complications arose, particularly in multi-parameter problems. In the wake of 2000, amidst a renewed interest in contemporary adaptations of fiducial inference, generalized fiducial inference (GFI) emerged to extend Fisher's fiducial argument, providing a promising avenue for addressing numerous crucial and practical inference challenges. Nevertheless, the adoption of GFI has been limited due to its often demanding mathematical derivations and the necessity for implementing complex Markov Chain Monte Carlo algorithms. This complexity has impeded its widespread utilization and practical applicability. This paper presents a significant advancement by introducing an innovative variant of GFI designed to alleviate these challenges. Specifically, this paper proposes AutoGFI, an easily implementable algorithm that streamlines the application of GFI to a broad spectrum of inference problems involving additive noise. AutoGFI can be readily implemented as long as a fitting routine is available, making it accessible to a broader audience of researchers and practitioners. To demonstrate its effectiveness, AutoGFI is applied to three contemporary and challenging problems: tensor regression, matrix completion, and regression with network cohesion. These case studies highlight the immense potential of GFI and illustrate AutoGFI's promising performance when compared to specialized solutions for these problems. Overall, this research paves the way for a more accessible and powerful application of GFI in a range of practical domains."}, "https://arxiv.org/abs/2404.08284": {"title": "A unified generalization of inverse regression via adaptive column selection", "link": "https://arxiv.org/abs/2404.08284", "description": "arXiv:2404.08284v1 Announce Type: new \nAbstract: A bottleneck of sufficient dimension reduction (SDR) in the modern era is that, among numerous methods, only the sliced inverse regression (SIR) is generally applicable under the high-dimensional settings. The higher-order inverse regression methods, which form a major family of SDR methods that are superior to SIR in the population level, suffer from the dimensionality of their intermediate matrix-valued parameters that have an excessive number of columns. In this paper, we propose the generic idea of using a small subset of columns of the matrix-valued parameter for SDR estimation, which breaks the convention of using the ambient matrix for the higher-order inverse regression methods. With the aid of a quick column selection procedure, we then generalize these methods as well as their ensembles towards sparsity under the ultrahigh-dimensional settings, in a uniform manner that resembles sparse SIR and without additional assumptions. This is the first promising attempt in the literature to free the higher-order inverse regression methods from their dimensionality, which facilitates the applicability of SDR. The gain of column selection with respect to SDR estimation efficiency is also studied under the fixed-dimensional settings. Simulation studies and a real data example are provided at the end."}, "https://arxiv.org/abs/2404.08331": {"title": "A Balanced Statistical Boosting Approach for GAMLSS via New Step Lengths", "link": "https://arxiv.org/abs/2404.08331", "description": "arXiv:2404.08331v1 Announce Type: new \nAbstract: Component-wise gradient boosting algorithms are popular for their intrinsic variable selection and implicit regularization, which can be especially beneficial for very flexible model classes. When estimating generalized additive models for location, scale and shape (GAMLSS) by means of a component-wise gradient boosting algorithm, an important part of the estimation procedure is to determine the relative complexity of the submodels corresponding to the different distribution parameters. Existing methods either suffer from a computationally expensive tuning procedure or can be biased by structural differences in the negative gradients' sizes, which, if encountered, lead to imbalances between the different submodels. Shrunk optimal step lengths have been suggested to replace the typical small fixed step lengths for a non-cyclical boosting algorithm limited to a Gaussian response variable in order to address this issue. In this article, we propose a new adaptive step length approach that accounts for the relative size of the fitted base-learners to ensure a natural balance between the different submodels. The new balanced boosting approach thus represents a computationally efficient and easily generalizable alternative to shrunk optimal step lengths. We implemented the balanced non-cyclical boosting algorithm for a Gaussian, a negative binomial as well as a Weibull distributed response variable and demonstrate the competitive performance of the new adaptive step length approach by means of a simulation study, in the analysis of count data modeling the number of doctor's visits as well as for survival data in an oncological trial."}, "https://arxiv.org/abs/2404.08352": {"title": "Comment on 'Exact-corrected confidence interval for risk difference in noninferiority binomial trials'", "link": "https://arxiv.org/abs/2404.08352", "description": "arXiv:2404.08352v1 Announce Type: new \nAbstract: The article by Hawila & Berg (2023) that is going to be commented presents four relevant problems, apart from other less important ones that are also cited. First, the title is incorrect, since it leads readers to believe that the confidence interval defined is exact when in fact it is asymptotic. Second, contrary to what is assumed by the authors of the article, the statistic that they define is not monotonic in delta. But it is fundamental that this property is verified, as the authors themselves recognize. Third, the inferences provided by the confidence interval proposed may be incoherent, which could lead the scientific community to reach incorrect conclusions in any practical application. For example, for fixed data it might happen that a certain delta value is within the 90%-confidence interval, but outside the 95%-confidence interval. Fourth, the authors do not validate its statistic through a simulation with diverse (and credible) values of the parameters involved. In fact, one of its two examples is for an alpha error of 70%!"}, "https://arxiv.org/abs/2404.08365": {"title": "Estimation and Inference for Three-Dimensional Panel Data Models", "link": "https://arxiv.org/abs/2404.08365", "description": "arXiv:2404.08365v1 Announce Type: new \nAbstract: Hierarchical panel data models have recently garnered significant attention. This study contributes to the relevant literature by introducing a novel three-dimensional (3D) hierarchical panel data model, which integrates panel regression with three sets of latent factor structures: one set of global factors and two sets of local factors. Instead of aggregating latent factors from various nodes, as seen in the literature of distributed principal component analysis (PCA), we propose an estimation approach capable of recovering the parameters of interest and disentangling latent factors at different levels and across different dimensions. We establish an asymptotic theory and provide a bootstrap procedure to obtain inference for the parameters of interest while accommodating various types of cross-sectional dependence and time series autocorrelation. Finally, we demonstrate the applicability of our framework by examining productivity convergence in manufacturing industries worldwide."}, "https://arxiv.org/abs/2404.08426": {"title": "confintROB Package: Confindence Intervals in robust linear mixed models", "link": "https://arxiv.org/abs/2404.08426", "description": "arXiv:2404.08426v1 Announce Type: new \nAbstract: Statistical inference is a major scientific endeavor for many researchers. In terms of inferential methods implemented to mixed-effects models, significant progress has been made in the R software. However, these advances primarily concern classical estimators (ML, REML) and mainly focus on fixed effects. In the confintROB package, we have implemented various bootstrap methods for computing confidence intervals (CIs) not only for fixed effects but also for variance components. These methods can be implemented with the widely used lmer function from the lme4 package, as well as with the rlmer function from the robustlmm package and the varComprob function from the robustvarComp package. These functions implement robust estimation methods suitable for data with outliers. The confintROB package implements the Wald method for fixed effects, whereas for both fixed effects and variance components, two bootstrap methods are implemented: the parametric bootstrap and the wild bootstrap. Moreover, the confintROB package can obtain both the percentile and the bias-corrected accelerated versions of CIs."}, "https://arxiv.org/abs/2404.08457": {"title": "A Latent Factor Model for High-Dimensional Binary Data", "link": "https://arxiv.org/abs/2404.08457", "description": "arXiv:2404.08457v1 Announce Type: new \nAbstract: In this study, we develop a latent factor model for analysing high-dimensional binary data. Specifically, a standard probit model is used to describe the regression relationship between the observed binary data and the continuous latent variables. Our method assumes that the dependency structure of the observed binary data can be fully captured by the continuous latent factors. To estimate the model, a moment-based estimation method is developed. The proposed method is able to deal with both discontinuity and high dimensionality. Most importantly, the asymptotic properties of the resulting estimators are rigorously established. Extensive simulation studies are presented to demonstrate the proposed methodology. A real dataset about product descriptions is analysed for illustration."}, "https://arxiv.org/abs/2404.08129": {"title": "One Factor to Bind the Cross-Section of Returns", "link": "https://arxiv.org/abs/2404.08129", "description": "arXiv:2404.08129v1 Announce Type: cross \nAbstract: We propose a new non-linear single-factor asset pricing model $r_{it}=h(f_{t}\\lambda_{i})+\\epsilon_{it}$. Despite its parsimony, this model represents exactly any non-linear model with an arbitrary number of factors and loadings -- a consequence of the Kolmogorov-Arnold representation theorem. It features only one pricing component $h(f_{t}\\lambda_{I})$, comprising a nonparametric link function of the time-dependent factor and factor loading that we jointly estimate with sieve-based estimators. Using 171 assets across major classes, our model delivers superior cross-sectional performance with a low-dimensional approximation of the link function. Most known finance and macro factors become insignificant controlling for our single-factor."}, "https://arxiv.org/abs/2404.08208": {"title": "Shifting the Paradigm: Estimating Heterogeneous Treatment Effects in the Development of Walkable Cities Design", "link": "https://arxiv.org/abs/2404.08208", "description": "arXiv:2404.08208v1 Announce Type: cross \nAbstract: The transformation of urban environments to accommodate growing populations has profoundly impacted public health and well-being. This paper addresses the critical challenge of estimating the impact of urban design interventions on diverse populations. Traditional approaches, reliant on questionnaires and stated preference techniques, are limited by recall bias and capturing the complex dynamics between environmental attributes and individual characteristics. To address these challenges, we integrate Virtual Reality (VR) with observational causal inference methods to estimate heterogeneous treatment effects, specifically employing Targeted Maximum Likelihood Estimation (TMLE) for its robustness against model misspecification. Our innovative approach leverages VR-based experiment to collect data that reflects perceptual and experiential factors. The result shows the heterogeneous impacts of urban design elements on public health and underscore the necessity for personalized urban design interventions. This study not only extends the application of TMLE to built environment research but also informs public health policy by illuminating the nuanced effects of urban design on mental well-being and advocating for tailored strategies that foster equitable, health-promoting urban spaces."}, "https://arxiv.org/abs/2206.14668": {"title": "Score Matching for Truncated Density Estimation on a Manifold", "link": "https://arxiv.org/abs/2206.14668", "description": "arXiv:2206.14668v2 Announce Type: replace \nAbstract: When observations are truncated, we are limited to an incomplete picture of our dataset. Recent methods propose to use score matching for truncated density estimation, where the access to the intractable normalising constant is not required. We present a novel extension of truncated score matching to a Riemannian manifold with boundary. Applications are presented for the von Mises-Fisher and Kent distributions on a two dimensional sphere in $\\mathbb{R}^3$, as well as a real-world application of extreme storm observations in the USA. In simulated data experiments, our score matching estimator is able to approximate the true parameter values with a low estimation error and shows improvements over a naive maximum likelihood estimator."}, "https://arxiv.org/abs/2305.12643": {"title": "A two-way heterogeneity model for dynamic networks", "link": "https://arxiv.org/abs/2305.12643", "description": "arXiv:2305.12643v2 Announce Type: replace \nAbstract: Dynamic network data analysis requires joint modelling individual snapshots and time dynamics. This paper proposes a new two-way heterogeneity model towards this goal. The new model equips each node of the network with two heterogeneity parameters, one to characterize the propensity of forming ties with other nodes and the other to differentiate the tendency of retaining existing ties over time. Though the negative log-likelihood function is non-convex, it is locally convex in a neighbourhood of the true value of the parameter vector. By using a novel method of moments estimator as the initial value, the consistent local maximum likelihood estimator (MLE) can be obtained by a gradient descent algorithm. To establish the upper bound for the estimation error of the MLE, we derive a new uniform deviation bound, which is of independent interest. The usefulness of the model and the associated theory are further supported by extensive simulation and the analysis of some real network data sets."}, "https://arxiv.org/abs/2312.06334": {"title": "Model validation for aggregate inferences in out-of-sample prediction", "link": "https://arxiv.org/abs/2312.06334", "description": "arXiv:2312.06334v3 Announce Type: replace \nAbstract: Generalization to new samples is a fundamental rationale for statistical modeling. For this purpose, model validation is particularly important, but recent work in survey inference has suggested that simple aggregation of individual prediction scores does not give a good measure of the score for population aggregate estimates. In this manuscript we explain why this occurs, propose two scoring metrics designed specifically for this problem, and demonstrate their use in three different ways. We show that these scoring metrics correctly order models when compared to the true score, although they do underestimate the magnitude of the score. We demonstrate with a problem in survey research, where multilevel regression and poststratification (MRP) has been used extensively to adjust convenience and low-response surveys to make population and subpopulation estimates."}, "https://arxiv.org/abs/2401.07152": {"title": "Inference for Synthetic Controls via Refined Placebo Tests", "link": "https://arxiv.org/abs/2401.07152", "description": "arXiv:2401.07152v2 Announce Type: replace \nAbstract: The synthetic control method is often applied to problems with one treated unit and a small number of control units. A common inferential task in this setting is to test null hypotheses regarding the average treatment effect on the treated. Inference procedures that are justified asymptotically are often unsatisfactory due to (1) small sample sizes that render large-sample approximation fragile and (2) simplification of the estimation procedure that is implemented in practice. An alternative is permutation inference, which is related to a common diagnostic called the placebo test. It has provable Type-I error guarantees in finite samples without simplification of the method, when the treatment is uniformly assigned. Despite this robustness, the placebo test suffers from low resolution since the null distribution is constructed from only $N$ reference estimates, where $N$ is the sample size. This creates a barrier for statistical inference at a common level like $\\alpha = 0.05$, especially when $N$ is small. We propose a novel leave-two-out procedure that bypasses this issue, while still maintaining the same finite-sample Type-I error guarantee under uniform assignment for a wide range of $N$. Unlike the placebo test whose Type-I error always equals the theoretical upper bound, our procedure often achieves a lower unconditional Type-I error than theory suggests; this enables useful inference in the challenging regime when $\\alpha < 1/N$. Empirically, our procedure achieves a higher power when the effect size is reasonably large and a comparable power otherwise. We generalize our procedure to non-uniform assignments and show how to conduct sensitivity analysis. From a methodological perspective, our procedure can be viewed as a new type of randomization inference different from permutation or rank-based inference, which is particularly effective in small samples."}, "https://arxiv.org/abs/2301.04512": {"title": "Partial Conditioning for Inference of Many-Normal-Means with H\\\"older Constraints", "link": "https://arxiv.org/abs/2301.04512", "description": "arXiv:2301.04512v2 Announce Type: replace-cross \nAbstract: Inferential models have been proposed for valid and efficient prior-free probabilistic inference. As it gradually gained popularity, this theory is subject to further developments for practically challenging problems. This paper considers the many-normal-means problem with the means constrained to be in the neighborhood of each other, formally represented by a H\\\"older space. A new method, called partial conditioning, is proposed to generate valid and efficient marginal inference about the individual means. It is shown that the method outperforms both a fiducial-counterpart in terms of validity and a conservative-counterpart in terms of efficiency. We conclude the paper by remarking that a general theory of partial conditioning for inferential models deserves future development."}, "https://arxiv.org/abs/2306.00602": {"title": "Approximate Stein Classes for Truncated Density Estimation", "link": "https://arxiv.org/abs/2306.00602", "description": "arXiv:2306.00602v2 Announce Type: replace-cross \nAbstract: Estimating truncated density models is difficult, as these models have intractable normalising constants and hard to satisfy boundary conditions. Score matching can be adapted to solve the truncated density estimation problem, but requires a continuous weighting function which takes zero at the boundary and is positive elsewhere. Evaluation of such a weighting function (and its gradient) often requires a closed-form expression of the truncation boundary and finding a solution to a complicated optimisation problem. In this paper, we propose approximate Stein classes, which in turn leads to a relaxed Stein identity for truncated density estimation. We develop a novel discrepancy measure, truncated kernelised Stein discrepancy (TKSD), which does not require fixing a weighting function in advance, and can be evaluated using only samples on the boundary. We estimate a truncated density model by minimising the Lagrangian dual of TKSD. Finally, experiments show the accuracy of our method to be an improvement over previous works even without the explicit functional form of the boundary."}, "https://arxiv.org/abs/2404.08839": {"title": "Multiply-Robust Causal Change Attribution", "link": "https://arxiv.org/abs/2404.08839", "description": "arXiv:2404.08839v1 Announce Type: new \nAbstract: Comparing two samples of data, we observe a change in the distribution of an outcome variable. In the presence of multiple explanatory variables, how much of the change can be explained by each possible cause? We develop a new estimation strategy that, given a causal model, combines regression and re-weighting methods to quantify the contribution of each causal mechanism. Our proposed methodology is multiply robust, meaning that it still recovers the target parameter under partial misspecification. We prove that our estimator is consistent and asymptotically normal. Moreover, it can be incorporated into existing frameworks for causal attribution, such as Shapley values, which will inherit the consistency and large-sample distribution properties. Our method demonstrates excellent performance in Monte Carlo simulations, and we show its usefulness in an empirical application."}, "https://arxiv.org/abs/2404.08883": {"title": "Projection matrices and the sweep operator", "link": "https://arxiv.org/abs/2404.08883", "description": "arXiv:2404.08883v1 Announce Type: new \nAbstract: These notes have been adapted from an undergraduate course given by Professor Alan James at the University of Adelaide from around 1965 and onwards. This adaption has put a focus on the definition of projection matrices and the sweep operator. These devices were at the heart of the development of the statistical package Genstat which initially focussed on the analysis of variance using the sweep operator. The notes provide an algebraic background to the sweep operator which has since been used to effect in a number of experimental design settings."}, "https://arxiv.org/abs/2404.09117": {"title": "Identifying Causal Effects under Kink Setting: Theory and Evidence", "link": "https://arxiv.org/abs/2404.09117", "description": "arXiv:2404.09117v1 Announce Type: new \nAbstract: This paper develops a generalized framework for identifying causal impacts in a reduced-form manner under kinked settings when agents can manipulate their choices around the threshold. The causal estimation using a bunching framework was initially developed by Diamond and Persson (2017) under notched settings. Many empirical applications of bunching designs involve kinked settings. We propose a model-free causal estimator in kinked settings with sharp bunching and then extend to the scenarios with diffuse bunching, misreporting, optimization frictions, and heterogeneity. The estimation method is mostly non-parametric and accounts for the interior response under kinked settings. Applying the proposed approach, we estimate how medical subsidies affect outpatient behaviors in China."}, "https://arxiv.org/abs/2404.09119": {"title": "Causal Inference for Genomic Data with Multiple Heterogeneous Outcomes", "link": "https://arxiv.org/abs/2404.09119", "description": "arXiv:2404.09119v1 Announce Type: new \nAbstract: With the evolution of single-cell RNA sequencing techniques into a standard approach in genomics, it has become possible to conduct cohort-level causal inferences based on single-cell-level measurements. However, the individual gene expression levels of interest are not directly observable; instead, only repeated proxy measurements from each individual's cells are available, providing a derived outcome to estimate the underlying outcome for each of many genes. In this paper, we propose a generic semiparametric inference framework for doubly robust estimation with multiple derived outcomes, which also encompasses the usual setting of multiple outcomes when the response of each unit is available. To reliably quantify the causal effects of heterogeneous outcomes, we specialize the analysis to the standardized average treatment effects and the quantile treatment effects. Through this, we demonstrate the use of the semiparametric inferential results for doubly robust estimators derived from both Von Mises expansions and estimating equations. A multiple testing procedure based on the Gaussian multiplier bootstrap is tailored for doubly robust estimators to control the false discovery exceedance rate. Applications in single-cell CRISPR perturbation analysis and individual-level differential expression analysis demonstrate the utility of the proposed methods and offer insights into the usage of different estimands for causal inference in genomics."}, "https://arxiv.org/abs/2404.09126": {"title": "Treatment Effect Heterogeneity and Importance Measures for Multivariate Continuous Treatments", "link": "https://arxiv.org/abs/2404.09126", "description": "arXiv:2404.09126v1 Announce Type: new \nAbstract: Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and therefore we introduce novel estimands that summarize the nature and extent of the heterogeneity, and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. We apply our approach to a study of the health effects of simultaneous exposure to the components of PM$_{2.5}$ and find that the negative health effects of exposure to these environmental pollutants is exacerbated by low socioeconomic status and age."}, "https://arxiv.org/abs/2404.09154": {"title": "Extreme quantile regression with deep learning", "link": "https://arxiv.org/abs/2404.09154", "description": "arXiv:2404.09154v1 Announce Type: new \nAbstract: Estimation of extreme conditional quantiles is often required for risk assessment of natural hazards in climate and geo-environmental sciences and for quantitative risk management in statistical finance, econometrics, and actuarial sciences. Interest often lies in extrapolating to quantile levels that exceed any past observations. Therefore, it is crucial to use a statistical framework that is well-adapted and especially designed for this purpose, and here extreme-value theory plays a key role. This chapter reviews how extreme quantile regression may be performed using theoretically-justified models, and how modern deep learning approaches can be harnessed in this context to enhance the model's performance in complex high-dimensional settings. The power of deep learning combined with the rigor of theoretically-justified extreme-value methods opens the door to efficient extreme quantile regression, in cases where both the number of covariates and the quantile level of interest can be simultaneously ``extreme''."}, "https://arxiv.org/abs/2404.09194": {"title": "Bayesian modeling of co-occurrence microbial interaction networks", "link": "https://arxiv.org/abs/2404.09194", "description": "arXiv:2404.09194v1 Announce Type: new \nAbstract: The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach using a weighted stochastic infinite block model to analyze the complex community structures within microbial co-occurrence microbial interaction networks. Our model defines connections between microbial taxa using a novel semi-parametric rank-based correlation method on their transformed relative abundances within a fully connected network framework. Employing a Bayesian nonparametric approach, the proposed model effectively clusters taxa into distinct communities while estimating the number of communities. The posterior summary of the taxa community membership is obtained based on the posterior probability matrix, which could naturally solve the label switching problem. Through simulation studies and real-world application to microbiome data from postmenopausal patients with recurrent urinary tract infections, we demonstrate that our method has superior clustering accuracy over alternative approaches. This advancement provides a more nuanced understanding of microbiome organization, with significant implications for disease research."}, "https://arxiv.org/abs/2404.09309": {"title": "Julia as a universal platform for statistical software development", "link": "https://arxiv.org/abs/2404.09309", "description": "arXiv:2404.09309v1 Announce Type: new \nAbstract: Like Python and Java, which are integrated into Stata, Julia is a free programming language that runs on all major operating systems. The julia package links Stata to Julia as well. Users can transfer data between Stata and Julia at high speed, issue Julia commands from Stata to analyze and plot, and pass results back to Stata. Julia's econometric software ecosystem is not as mature as Stata's or R's, or even Python's. But Julia is an excellent environment for developing high-performance numerical applications, which can then be called from many platforms. The boottest program for wild bootstrap-based inference (Roodman et al. 2019) can call a Julia back end for a 33-50% speed-up, even as the R package fwildclusterboot (Fischer and Roodman 2021) uses the same back end for inference after instrumental variables estimation. reghdfejl mimics reghdfe (Correia 2016) in fitting linear models with high-dimensional fixed effects but calls an independently developed Julia package for tenfold acceleration on hard problems. reghdfejl also supports nonlinear models--preliminarily, as the Julia package for that purpose matures."}, "https://arxiv.org/abs/2404.09353": {"title": "A Unified Combination Framework for Dependent Tests with Applications to Microbiome Association Studies", "link": "https://arxiv.org/abs/2404.09353", "description": "arXiv:2404.09353v1 Announce Type: new \nAbstract: We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing $p$-value combination methods, including the vanilla Cauchy combination method, the proposed combination framework can handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to Microbiome Association Studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, %resulting in a significant enhancement of testing power across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations."}, "https://arxiv.org/abs/2404.09358": {"title": "Two-stage Spatial Regression Models for Spatial Confounding", "link": "https://arxiv.org/abs/2404.09358", "description": "arXiv:2404.09358v1 Announce Type: new \nAbstract: Public health data are often spatially dependent, but standard spatial regression methods can suffer from bias and invalid inference when the independent variable is associated with spatially-correlated residuals. This could occur if, for example, there is an unmeasured environmental contaminant. Geoadditive structural equation modeling (gSEM), in which an estimated spatial trend is removed from both the explanatory and response variables before estimating the parameters of interest, has previously been proposed as a solution, but there has been little investigation of gSEM's properties with point-referenced data. We link gSEM to results on double machine learning and semiparametric regression based on two-stage procedures. We propose using these semiparametric estimators for spatial regression using Gaussian processes with Mat\\`ern covariance to estimate the spatial trends, and term this class of estimators Double Spatial Regression (DSR). We derive regularity conditions for root-$n$ asymptotic normality and consistency and closed-form variance estimation, and show that in simulations where standard spatial regression estimators are highly biased and have poor coverage, DSR can mitigate bias more effectively than competitors and obtain nominal coverage."}, "https://arxiv.org/abs/2404.09362": {"title": "A Bayesian Joint Modelling for Misclassified Interval-censoring and Competing Risks", "link": "https://arxiv.org/abs/2404.09362", "description": "arXiv:2404.09362v1 Announce Type: new \nAbstract: In active surveillance of prostate cancer, cancer progression is interval-censored and the examination to detect progression is subject to misclassification, usually false negatives. Meanwhile, patients may initiate early treatment before progression detection, constituting a competing risk. We developed the Misclassification-Corrected Interval-censored Cause-specific Joint Model (MCICJM) to estimate the association between longitudinal biomarkers and cancer progression in this setting. The sensitivity of the examination is considered in the likelihood of this model via a parameter that may be set to a specific value if the sensitivity is known, or for which a prior distribution can be specified if the sensitivity is unknown. Our simulation results show that misspecification of the sensitivity parameter or ignoring it entirely impacts the model parameters, especially the parameter uncertainty and the baseline hazards. Moreover, specification of a prior distribution for the sensitivity parameter may reduce the risk of misspecification in settings where the exact sensitivity is unknown, but may cause identifiability issues. Thus, imposing restrictions on the baseline hazards is recommended. A trade-off between modelling with a sensitivity constant at the risk of misspecification and a sensitivity prior at the cost of flexibility needs to be decided."}, "https://arxiv.org/abs/2404.09414": {"title": "General Bayesian inference for causal effects using covariate balancing procedure", "link": "https://arxiv.org/abs/2404.09414", "description": "arXiv:2404.09414v1 Announce Type: new \nAbstract: In observational studies, the propensity score plays a central role in estimating causal effects of interest. The inverse probability weighting (IPW) estimator is especially commonly used. However, if the propensity score model is misspecified, the IPW estimator may produce biased estimates of causal effects. Previous studies have proposed some robust propensity score estimation procedures; these methods, however, require consideration of parameters that dominate the uncertainty of sampling and treatment allocation. In this manuscript, we propose a novel Bayesian estimating procedure that necessitates deciding the parameter probability, rather than deterministically. Since both the IPW estimator and the propensity score estimator can be derived as solutions to certain loss functions, the general Bayesian paradigm, which does not require the consideration of the full likelihood, can be applied. In this sense, our proposed method only requires the same level of assumptions as ordinary causal inference contexts."}, "https://arxiv.org/abs/2404.09467": {"title": "The Role of Carbon Pricing in Food Inflation: Evidence from Canadian Provinces", "link": "https://arxiv.org/abs/2404.09467", "description": "arXiv:2404.09467v1 Announce Type: new \nAbstract: Carbon pricing, including carbon tax and cap-and-trade, is usually seen as an effective policy tool for mitigating emissions. Although such policies are often perceived as worsening affordability issues, earlier studies find insignificant or deflationary effects of carbon pricing. We verify this result for the food sector by using provincial-level data on food CPI from Canada. By using a staggered difference-in-difference (DiD) approach, we show that the deflationary effects of carbon pricing on food do exist. Additionally, such effects are weak at first and grow stronger after two years of implementation. However, the overall magnitudes are too small to make carbon pricing blamable for the current high inflation. Our subsequent analysis suggests a reduction in consumption is likely to be the cause of deflation. Contrarily, carbon pricing has little impact on farm production costs owing to the special treatment farmers receive within carbon pricing systems. This paper decomposes the long-term influence Canadian carbon pricing has on food affordability and its possible mechanisms."}, "https://arxiv.org/abs/2404.09528": {"title": "Overfitting Reduction in Convex Regression", "link": "https://arxiv.org/abs/2404.09528", "description": "arXiv:2404.09528v1 Announce Type: new \nAbstract: Convex regression is a method for estimating an unknown function $f_0$ from a data set of $n$ noisy observations when $f_0$ is known to be convex. This method has played an important role in operations research, economics, machine learning, and many other areas. It has been empirically observed that the convex regression estimator produces inconsistent estimates of $f_0$ and extremely large subgradients near the boundary of the domain of $f_0$ as $n$ increases. In this paper, we provide theoretical evidence of this overfitting behaviour. We also prove that the penalised convex regression estimator, one of the variants of the convex regression estimator, exhibits overfitting behaviour. To eliminate this behaviour, we propose two new estimators by placing a bound on the subgradients of the estimated function. We further show that our proposed estimators do not exhibit the overfitting behaviour by proving that (a) they converge to $f_0$ and (b) their subgradients converge to the gradient of $f_0$, both uniformly over the domain of $f_0$ with probability one as $n \\rightarrow \\infty$. We apply the proposed methods to compute the cost frontier function for Finnish electricity distribution firms and confirm their superior performance in predictive power over some existing methods."}, "https://arxiv.org/abs/2404.09716": {"title": "Optimal Cut-Point Estimation for functional digital biomarkers: Application to Continuous Glucose Monitoring", "link": "https://arxiv.org/abs/2404.09716", "description": "arXiv:2404.09716v1 Announce Type: new \nAbstract: Establish optimal cut points plays a crucial role in epidemiology and biomarker discovery, enabling the development of effective and practical clinical decision criteria. While there is extensive literature to define optimal cut off over scalar biomarkers, there is a notable lack of general methodologies for analyzing statistical objects in more complex spaces of functions and graphs, which are increasingly relevant in digital health applications. This paper proposes a new general methodology to define optimal cut points for random objects in separable Hilbert spaces. The paper is motivated by the need for creating new clinical rules for diabetes mellitus disease, exploiting the functional information of a continuous diabetes monitor (CGM) as a digital biomarker. More specifically, we provide the functional cut off to identify diabetes cases with CGM information based on glucose distributional functional representations."}, "https://arxiv.org/abs/2404.09823": {"title": "Biclustering bipartite networks via extended Mixture of Latent Trait Analyzers", "link": "https://arxiv.org/abs/2404.09823", "description": "arXiv:2404.09823v1 Announce Type: new \nAbstract: In the context of network data, bipartite networks are of particular interest, as they provide a useful description of systems representing relationships between sending and receiving nodes. In this framework, we extend the Mixture of Latent Trait Analyzers (MLTA) to perform a joint clustering of sending and receiving nodes, as in the biclustering framework. In detail, sending nodes are partitioned into clusters (called components) via a finite mixture of latent trait models. In each component, receiving nodes are partitioned into clusters (called segments) by adopting a flexible and parsimonious specification of the linear predictor. Dependence between receiving nodes is modeled via a multidimensional latent trait, as in the original MLTA specification. The proposal also allows for the inclusion of concomitant variables in the latent layer of the model, with the aim of understanding how they influence component formation. To estimate model parameters, an EM-type algorithm based on a Gauss-Hermite approximation of intractable integrals is proposed. A simulation study is conducted to test the performance of the model in terms of clustering and parameters' recovery. The proposed model is applied to a bipartite network on pediatric patients possibly affected by appendicitis with the objective of identifying groups of patients (sending nodes) being similar with respect to subsets of clinical conditions (receiving nodes)."}, "https://arxiv.org/abs/2404.09863": {"title": "sfislands: An R Package for Accommodating Islands and Disjoint Zones in Areal Spatial Modelling", "link": "https://arxiv.org/abs/2404.09863", "description": "arXiv:2404.09863v1 Announce Type: new \nAbstract: Fitting areal models which use a spatial weights matrix to represent relationships between geographical units can be a cumbersome task, particularly when these units are not well-behaved. The two chief aims of sfislands are to simplify the process of creating an appropriate neighbourhood matrix, and to quickly visualise the predictions of subsequent models. The package uses visual aids in the form of easily-generated maps to help this process. This paper demonstrates how sfislands could be useful to researchers. It begins by describing the package's functions in the context of a proposed workflow. It then presents three worked examples showing a selection of potential use-cases. These range from earthquakes in Indonesia, to river crossings in London, and hierarchical models of output areas in Liverpool. We aim to show how the sfislands package streamlines much of the human workflow involved in creating and examining such models."}, "https://arxiv.org/abs/2404.09882": {"title": "A spatio-temporal model to detect potential outliers in disease mapping", "link": "https://arxiv.org/abs/2404.09882", "description": "arXiv:2404.09882v1 Announce Type: new \nAbstract: Spatio-temporal disease mapping models are commonly used to estimate the relative risk of a disease over time and across areas. For each area and time point, the disease count is modelled with a Poisson distribution whose mean is the product of an offset and the disease relative risk. This relative risk is commonly decomposed in the log scale as the sum of fixed and latent effects. The Rushworth model allows for spatio-temporal autocorrelation of the random effects. We build on the Rushworth model to accommodate and identify potentially outlying areas with respect to their disease relative risk evolution, after taking into account the fixed effects. An area may display outlying behaviour at some points in time but not all. At each time point, we assume the latent effects to be spatially structured and include scaling parameters in the precision matrix, to allow for heavy-tails. Two prior specifications are considered for the scaling parameters: one where they are independent across space and one with spatial autocorrelation. We investigate the performance of the different prior specifications of the proposed model through simulation studies and analyse the weekly evolution of the number of COVID-19 cases across the 33 boroughs of Montreal and the 96 French departments during the second wave. In Montreal, 6 boroughs are found to be potentially outlying. In France, the model with spatially structured scaling parameters identified 21 departments as potential outliers. We find that these departments tend to be close to each other and within common French regions."}, "https://arxiv.org/abs/2404.09960": {"title": "Pseudo P-values for Assessing Covariate Balance in a Finite Study Population with Application to the California Sugar Sweetened Beverage Tax Study", "link": "https://arxiv.org/abs/2404.09960", "description": "arXiv:2404.09960v1 Announce Type: new \nAbstract: Assessing covariate balance (CB) is a common practice in various types of evaluation studies. Two-sample descriptive statistics, such as the standardized mean difference, have been widely applied in the scientific literature to assess the goodness of CB. Studies in health policy, health services research, built and social environment research, and many other fields often involve a finite number of units that may be subject to different treatment levels. Our case study, the California Sugar Sweetened Beverage (SSB) Tax Study, include 332 study cities in the state of California, among which individual cities may elect to levy a city-wide excise tax on SSB sales. Evaluating the balance of covariates between study cities with and without the tax policy is essential for assessing the effects of the policy on health outcomes of interest. In this paper, we introduce the novel concepts of the pseudo p-value and the standardized pseudo p-value, which are descriptive statistics to assess the overall goodness of CB between study arms in a finite study population. While not meant as a hypothesis test, the pseudo p-values bear superficial similarity to the classic p-value, which makes them easy to apply and interpret in applications. We discuss some theoretical properties of the pseudo p-values and present an algorithm to calculate them. We report a numerical simulation study to demonstrate their performance. We apply the pseudo p-values to the California SSB Tax study to assess the balance of city-level characteristics between the two study arms."}, "https://arxiv.org/abs/2404.09966": {"title": "A fully Bayesian approach for the imputation and analysis of derived outcome variables with missingness", "link": "https://arxiv.org/abs/2404.09966", "description": "arXiv:2404.09966v1 Announce Type: new \nAbstract: Derived variables are variables that are constructed from one or more source variables through established mathematical operations or algorithms. For example, body mass index (BMI) is a derived variable constructed from two source variables: weight and height. When using a derived variable as the outcome in a statistical model, complications arise when some of the source variables have missing values. In this paper, we propose how one can define a single fully Bayesian model to simultaneously impute missing values and sample from the posterior. We compare our proposed method with alternative approaches that rely on multiple imputation, and, with a simulated dataset, consider how best to estimate the risk of microcephaly in newborns exposed to the ZIKA virus."}, "https://arxiv.org/abs/2404.08694": {"title": "Musical Listening Qualia: A Multivariate Approach", "link": "https://arxiv.org/abs/2404.08694", "description": "arXiv:2404.08694v1 Announce Type: cross \nAbstract: French and American participants listened to new music stimuli and evaluated the stimuli using either adjectives or quantitative musical dimensions. Results were analyzed using correspondence analysis (CA), hierarchical cluster analysis (HCA), multiple factor analysis (MFA), and partial least squares correlation (PLSC). French and American listeners differed when they described the musical stimuli using adjectives, but not when using the quantitative dimensions. The present work serves as a case study in research methodology that allows for a balance between relaxing experimental control and maintaining statistical rigor."}, "https://arxiv.org/abs/2404.08816": {"title": "Evaluating the Quality of Answers in Political Q&A Sessions with Large Language Models", "link": "https://arxiv.org/abs/2404.08816", "description": "arXiv:2404.08816v1 Announce Type: cross \nAbstract: This paper presents a new approach to evaluating the quality of answers in political question-and-answer sessions. We propose to measure an answer's quality based on the degree to which it allows us to infer the initial question accurately. This conception of answer quality inherently reflects their relevance to initial questions. Drawing parallels with semantic search, we argue that this measurement approach can be operationalized by fine-tuning a large language model on the observed corpus of questions and answers without additional labeled data. We showcase our measurement approach within the context of the Question Period in the Canadian House of Commons. Our approach yields valuable insights into the correlates of the quality of answers in the Question Period. We find that answer quality varies significantly based on the party affiliation of the members of Parliament asking the questions and uncover a meaningful correlation between answer quality and the topics of the questions."}, "https://arxiv.org/abs/2404.09059": {"title": "Prevalence estimation methods for time-dependent antibody kinetics of infected and vaccinated individuals: a graph-theoretic approach", "link": "https://arxiv.org/abs/2404.09059", "description": "arXiv:2404.09059v1 Announce Type: cross \nAbstract: Immune events such as infection, vaccination, and a combination of the two result in distinct time-dependent antibody responses in affected individuals. These responses and event prevalences combine non-trivially to govern antibody levels sampled from a population. Time-dependence and disease prevalence pose considerable modeling challenges that need to be addressed to provide a rigorous mathematical underpinning of the underlying biology. We propose a time-inhomogeneous Markov chain model for event-to-event transitions coupled with a probabilistic framework for anti-body kinetics and demonstrate its use in a setting in which individuals can be infected or vaccinated but not both. We prove the equivalency of this approach to the framework developed in our previous work. Synthetic data are used to demonstrate the modeling process and conduct prevalence estimation via transition probability matrices. This approach is ideal to model sequences of infections and vaccinations, or personal trajectories in a population, making it an important first step towards a mathematical characterization of reinfection, vaccination boosting, and cross-events of infection after vaccination or vice versa."}, "https://arxiv.org/abs/2404.09142": {"title": "Controlling the False Discovery Rate in Subspace Selection", "link": "https://arxiv.org/abs/2404.09142", "description": "arXiv:2404.09142v1 Announce Type: cross \nAbstract: Controlling the false discovery rate (FDR) is a popular approach to multiple testing, variable selection, and related problems of simultaneous inference. In many contemporary applications, models are not specified by discrete variables, which necessitates a broadening of the scope of the FDR control paradigm. Motivated by the ubiquity of low-rank models for high-dimensional matrices, we present methods for subspace selection in principal components analysis that provide control on a geometric analog of FDR that is adapted to subspace selection. Our methods crucially rely on recently-developed tools from random matrix theory, in particular on a characterization of the limiting behavior of eigenvectors and the gaps between successive eigenvalues of large random matrices. Our procedure is parameter-free, and we show that it provides FDR control in subspace selection for common noise models considered in the literature. We demonstrate the utility of our algorithm with numerical experiments on synthetic data and on problems arising in single-cell RNA sequencing and hyperspectral imaging."}, "https://arxiv.org/abs/2404.09156": {"title": "Statistics of extremes for natural hazards: landslides and earthquakes", "link": "https://arxiv.org/abs/2404.09156", "description": "arXiv:2404.09156v1 Announce Type: cross \nAbstract: In this chapter, we illustrate the use of split bulk-tail models and subasymptotic models motivated by extreme-value theory in the context of hazard assessment for earthquake-induced landslides. A spatial joint areal model is presented for modeling both landslides counts and landslide sizes, paying particular attention to extreme landslides, which are the most devastating ones."}, "https://arxiv.org/abs/2404.09157": {"title": "Statistics of Extremes for Neuroscience", "link": "https://arxiv.org/abs/2404.09157", "description": "arXiv:2404.09157v1 Announce Type: cross \nAbstract: This chapter illustrates how tools from univariate and multivariate statistics of extremes can complement classical methods used to study brain signals and enhance the understanding of brain activity and connectivity during specific cognitive tasks or abnormal episodes, such as an epileptic seizure."}, "https://arxiv.org/abs/2404.09629": {"title": "Quantifying fair income distribution in Thailand", "link": "https://arxiv.org/abs/2404.09629", "description": "arXiv:2404.09629v1 Announce Type: cross \nAbstract: Given a vast concern about high income inequality in Thailand as opposed to empirical findings around the world showing people's preference for fair income inequality over unfair income equality, it is therefore important to examine whether inequality in income distribution in Thailand over the past three decades is fair, and what fair inequality in income distribution in Thailand should be. To quantitatively measure fair income distribution, this study employs the fairness benchmarks that are derived from the distributions of athletes' salaries in professional sports which satisfy the concepts of distributive justice and procedural justice, the no-envy principle of fair allocation, and the general consensus or the international norm criterion of a meaningful benchmark. By using the data on quintile income shares and the income Gini index of Thailand from the National Social and Economic Development Council, this study finds that, throughout the period from 1988 to 2021, the Thai income earners in the bottom 20%, the second 20%, and the top 20% receive income shares more than the fair shares whereas those in the third 20% and the fourth 20% receive income shares less than the fair shares. Provided that there are infinite combinations of quintile income shares that can have the same value of income Gini index but only one of them is regarded as fair, this study demonstrates the use of fairness benchmarks as a practical guideline for designing policies with an aim to achieve fair income distribution in Thailand. Moreover, a comparative analysis is conducted by employing the method for estimating optimal (fair) income distribution representing feasible income equality in order to provide an alternative recommendation on what optimal (fair) income distribution characterizing feasible income equality in Thailand should be."}, "https://arxiv.org/abs/2404.09729": {"title": "Amplitude-Phase Fusion for Enhanced Electrocardiogram Morphological Analysis", "link": "https://arxiv.org/abs/2404.09729", "description": "arXiv:2404.09729v1 Announce Type: cross \nAbstract: Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG morphology, to comprehensively describe the fusion of amplitude and phase patterns. MEE is computed based on beat-level samples, enabling detailed analysis of each cardiac cycle. Experimental results demonstrate that MEE achieves rapid, accurate, and label-free localization of abnormal ECG arrhythmia regions. Furthermore, MEE provides a method for assessing sample diversity, facilitating compression of imbalanced training sets (via representative sample selection), and outperforms random pruning. Additionally, MEE exhibits the ability to describe areas of poor quality. By discussing, it proves the robustness of MEE value calculation to noise interference and its low computational complexity. Finally, we integrate this method into a clinical interactive interface to provide a more convenient and intuitive user experience. These findings indicate that MEE serves as a valuable clinical descriptor for ECG characterization. The implementation code can be referenced at the following link: https://github.com/fdu-harry/ECG-MEE-metric."}, "https://arxiv.org/abs/2404.09847": {"title": "Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning", "link": "https://arxiv.org/abs/2404.09847", "description": "arXiv:2404.09847v1 Announce Type: cross \nAbstract: Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded. We characterize the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. We show that closed-form solutions for the optimal constrained parameter are often available, providing insight into mechanisms that drive fairness in predictive models. Our results also suggest natural estimators of the constrained parameter that can be constructed by combining estimates of unconstrained parameters of the data generating distribution. Thus, our estimation procedure for constructing fair machine learning algorithms can be applied in conjunction with any statistical learning approach and off-the-shelf software. We demonstrate the generality of our method by explicitly considering a number of examples of statistical fairness constraints and implementing the approach using several popular learning approaches."}, "https://arxiv.org/abs/2404.09962": {"title": "Invariant Subspace Decomposition", "link": "https://arxiv.org/abs/2404.09962", "description": "arXiv:2404.09962v1 Announce Type: cross \nAbstract: We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. In this work, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be utilized both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure."}, "https://arxiv.org/abs/2111.09254": {"title": "Universal Inference Meets Random Projections: A Scalable Test for Log-concavity", "link": "https://arxiv.org/abs/2111.09254", "description": "arXiv:2111.09254v4 Announce Type: replace \nAbstract: Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent universal inference methodology provides a valid test. The universal test relies on maximum likelihood estimation (MLE), and efficient methods already exist for finding the log-concave MLE. This yields the first test of log-concavity that is provably valid in finite samples in any dimension, for which we also establish asymptotic consistency results. Empirically, we find that a random projections approach that converts the d-dimensional testing problem into many one-dimensional problems can yield high power, leading to a simple procedure that is statistically and computationally efficient."}, "https://arxiv.org/abs/2202.00618": {"title": "Penalized Estimation of Frailty-Based Illness-Death Models for Semi-Competing Risks", "link": "https://arxiv.org/abs/2202.00618", "description": "arXiv:2202.00618v3 Announce Type: replace \nAbstract: Semi-competing risks refers to the survival analysis setting where the occurrence of a non-terminal event is subject to whether a terminal event has occurred, but not vice versa. Semi-competing risks arise in a broad range of clinical contexts, with a novel example being the pregnancy condition preeclampsia, which can only occur before the `terminal' event of giving birth. Models that acknowledge semi-competing risks enable investigation of relationships between covariates and the joint timing of the outcomes, but methods for model selection and prediction of semi-competing risks in high dimensions are lacking. Instead, researchers commonly analyze only a single or composite outcome, losing valuable information and limiting clinical utility -- in the obstetric setting, this means ignoring valuable insight into timing of delivery after preeclampsia has onset. To address this gap we propose a novel penalized estimation framework for frailty-based illness-death multi-state modeling of semi-competing risks. Our approach combines non-convex and structured fusion penalization, inducing global sparsity as well as parsimony across submodels. We perform estimation and model selection via a pathwise routine for non-convex optimization, and prove the first statistical error bound results in this setting. We present a simulation study investigating estimation error and model selection performance, and a comprehensive application of the method to joint risk modeling of preeclampsia and timing of delivery using pregnancy data from an electronic health record."}, "https://arxiv.org/abs/2206.08503": {"title": "Semiparametric Single-Index Estimation for Average Treatment Effects", "link": "https://arxiv.org/abs/2206.08503", "description": "arXiv:2206.08503v3 Announce Type: replace \nAbstract: We propose a semiparametric method to estimate the average treatment effect under the assumption of unconfoundedness given observational data. Our estimation method alleviates misspecification issues of the propensity score function by estimating the single-index link function involved through Hermite polynomials. Our approach is computationally tractable and allows for moderately large dimension covariates. We provide the large sample properties of the estimator and show its validity. Also, the average treatment effect estimator achieves the parametric rate and asymptotic normality. Our extensive Monte Carlo study shows that the proposed estimator is valid in finite samples. We also provide an empirical analysis on the effect of maternal smoking on babies' birth weight and the effect of job training program on future earnings."}, "https://arxiv.org/abs/2208.02657": {"title": "Using Instruments for Selection to Adjust for Selection Bias in Mendelian Randomization", "link": "https://arxiv.org/abs/2208.02657", "description": "arXiv:2208.02657v3 Announce Type: replace \nAbstract: Selection bias is a common concern in epidemiologic studies. In the literature, selection bias is often viewed as a missing data problem. Popular approaches to adjust for bias due to missing data, such as inverse probability weighting, rely on the assumption that data are missing at random and can yield biased results if this assumption is violated. In observational studies with outcome data missing not at random, Heckman's sample selection model can be used to adjust for bias due to missing data. In this paper, we review Heckman's method and a similar approach proposed by Tchetgen Tchetgen and Wirth (2017). We then discuss how to apply these methods to Mendelian randomization analyses using individual-level data, with missing data for either the exposure or outcome or both. We explore whether genetic variants associated with participation can be used as instruments for selection. We then describe how to obtain missingness-adjusted Wald ratio, two-stage least squares and inverse variance weighted estimates. The two methods are evaluated and compared in simulations, with results suggesting that they can both mitigate selection bias but may yield parameter estimates with large standard errors in some settings. In an illustrative real-data application, we investigate the effects of body mass index on smoking using data from the Avon Longitudinal Study of Parents and Children."}, "https://arxiv.org/abs/2209.08036": {"title": "mpower: An R Package for Power Analysis of Exposure Mixture Studies via Monte Carlo Simulations", "link": "https://arxiv.org/abs/2209.08036", "description": "arXiv:2209.08036v2 Announce Type: replace \nAbstract: Estimating sample size and statistical power is an essential part of a good study design. This R package allows users to conduct power analysis based on Monte Carlo simulations in settings in which consideration of the correlations between predictors is important. It runs power analyses given a data generative model and an inference model. It can set up a data generative model that preserves dependence structures among variables given existing data (continuous, binary, or ordinal) or high-level descriptions of the associations. Users can generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This paper presents tutorials and examples focusing on applications for environmental mixture studies when predictors tend to be moderately to highly correlated. It easily interfaces with several existing and newly developed analysis strategies for assessing associations between exposures and health outcomes. However, the package is sufficiently general to facilitate power simulations in a wide variety of settings."}, "https://arxiv.org/abs/2209.09810": {"title": "The boosted HP filter is more general than you might think", "link": "https://arxiv.org/abs/2209.09810", "description": "arXiv:2209.09810v2 Announce Type: replace \nAbstract: The global financial crisis and Covid recession have renewed discussion concerning trend-cycle discovery in macroeconomic data, and boosting has recently upgraded the popular HP filter to a modern machine learning device suited to data-rich and rapid computational environments. This paper extends boosting's trend determination capability to higher order integrated processes and time series with roots that are local to unity. The theory is established by understanding the asymptotic effect of boosting on a simple exponential function. Given a universe of time series in FRED databases that exhibit various dynamic patterns, boosting timely captures downturns at crises and recoveries that follow."}, "https://arxiv.org/abs/2305.13188": {"title": "Fast Variational Inference for Bayesian Factor Analysis in Single and Multi-Study Settings", "link": "https://arxiv.org/abs/2305.13188", "description": "arXiv:2305.13188v2 Announce Type: replace \nAbstract: Factors models are routinely used to analyze high-dimensional data in both single-study and multi-study settings. Bayesian inference for such models relies on Markov Chain Monte Carlo (MCMC) methods which scale poorly as the number of studies, observations, or measured variables increase. To address this issue, we propose variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process shrinkage prior. The proposed algorithms provide fast approximate inference at a fraction of the time and memory of MCMC-based implementations while maintaining comparable accuracy in characterizing the data covariance matrix. We conduct extensive simulations to evaluate our proposed algorithms and show their utility in estimating the model for high-dimensional multi-study gene expression data in ovarian cancers. Overall, our proposed approaches enable more efficient and scalable inference for factor models, facilitating their use in high-dimensional settings. An R package VIMSFA implementing our methods is available on GitHub (github.com/blhansen/VI-MSFA)."}, "https://arxiv.org/abs/2305.18809": {"title": "Discrete forecast reconciliation", "link": "https://arxiv.org/abs/2305.18809", "description": "arXiv:2305.18809v3 Announce Type: replace \nAbstract: This paper presents a formal framework and proposes algorithms to extend forecast reconciliation to discrete-valued data to extend forecast reconciliation to discrete-valued data, including low counts. A novel method is introduced based on recasting the optimisation of scoring rules as an assignment problem, which is solved using quadratic programming. The proposed framework produces coherent joint probabilistic forecasts for count hierarchical time series. Two discrete reconciliation algorithms are also proposed and compared against generalisations of the top-down and bottom-up approaches for count data. Two simulation experiments and two empirical examples are conducted to validate that the proposed reconciliation algorithms improve forecast accuracy. The empirical applications are forecasting criminal offences in Washington D.C. and product unit sales in the M5 dataset. Compared to benchmarks, the proposed framework shows superior performance in both simulations and empirical studies."}, "https://arxiv.org/abs/2307.11401": {"title": "Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data", "link": "https://arxiv.org/abs/2307.11401", "description": "arXiv:2307.11401v2 Announce Type: replace \nAbstract: We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data."}, "https://arxiv.org/abs/2308.11672": {"title": "Simulation-Based Prior Knowledge Elicitation for Parametric Bayesian Models", "link": "https://arxiv.org/abs/2308.11672", "description": "arXiv:2308.11672v2 Announce Type: replace \nAbstract: A central characteristic of Bayesian statistics is the ability to consistently incorporate prior knowledge into various modeling processes. In this paper, we focus on translating domain expert knowledge into corresponding prior distributions over model parameters, a process known as prior elicitation. Expert knowledge can manifest itself in diverse formats, including information about raw data, summary statistics, or model parameters. A major challenge for existing elicitation methods is how to effectively utilize all of these different formats in order to formulate prior distributions that align with the expert's expectations, regardless of the model structure. To address these challenges, we develop a simulation-based elicitation method that can learn the hyperparameters of potentially any parametric prior distribution from a wide spectrum of expert knowledge using stochastic gradient descent. We validate the effectiveness and robustness of our elicitation method in four representative case studies covering linear models, generalized linear models, and hierarchical models. Our results support the claim that our method is largely independent of the underlying model structure and adaptable to various elicitation techniques, including quantile-based, moment-based, and histogram-based methods."}, "https://arxiv.org/abs/2401.12084": {"title": "Temporal Aggregation for the Synthetic Control Method", "link": "https://arxiv.org/abs/2401.12084", "description": "arXiv:2401.12084v2 Announce Type: replace \nAbstract: The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also destroy important signal. In this paper, we bound the bias for SCM with disaggregated and aggregated outcomes and give conditions under which aggregating tightens the bounds. We then propose finding weights that balance both disaggregated and aggregated series."}, "https://arxiv.org/abs/2010.16271": {"title": "View selection in multi-view stacking: Choosing the meta-learner", "link": "https://arxiv.org/abs/2010.16271", "description": "arXiv:2010.16271v3 Announce Type: replace-cross \nAbstract: Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three."}, "https://arxiv.org/abs/2303.05561": {"title": "Exploration of the search space of Gaussian graphical models for paired data", "link": "https://arxiv.org/abs/2303.05561", "description": "arXiv:2303.05561v2 Announce Type: replace-cross \nAbstract: We consider the problem of learning a Gaussian graphical model in the case where the observations come from two dependent groups sharing the same variables. We focus on a family of coloured Gaussian graphical models specifically suited for the paired data problem. Commonly, graphical models are ordered by the submodel relationship so that the search space is a lattice, called the model inclusion lattice. We introduce a novel order between models, named the twin order. We show that, embedded with this order, the model space is a lattice that, unlike the model inclusion lattice, is distributive. Furthermore, we provide the relevant rules for the computation of the neighbours of a model. The latter are more efficient than the same operations in the model inclusion lattice, and are then exploited to achieve a more efficient exploration of the search space. These results can be applied to improve the efficiency of both greedy and Bayesian model search procedures. Here we implement a stepwise backward elimination procedure and evaluate its performance by means of simulations. Finally, the procedure is applied to learn a brain network from fMRI data where the two groups correspond to the left and right hemispheres, respectively."}, "https://arxiv.org/abs/2306.11281": {"title": "Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models", "link": "https://arxiv.org/abs/2306.11281", "description": "arXiv:2306.11281v3 Announce Type: replace-cross \nAbstract: Answering counterfactual queries has important applications such as explainability, robustness, and fairness but is challenging when the causal variables are unobserved and the observations are non-linear mixtures of these latent variables, such as pixels in images. One approach is to recover the latent Structural Causal Model (SCM), which may be infeasible in practice due to requiring strong assumptions, e.g., linearity of the causal mechanisms or perfect atomic interventions. Meanwhile, more practical ML-based approaches using naive domain translation models to generate counterfactual samples lack theoretical grounding and may construct invalid counterfactuals. In this work, we strive to strike a balance between practicality and theoretical guarantees by analyzing a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). We show that recovering the latent SCM is unnecessary for estimating domain counterfactuals, thereby sidestepping some of the theoretic challenges. By assuming invertibility and sparsity of intervention, we prove domain counterfactual estimation error can be bounded by a data fit term and intervention sparsity term. Building upon our theoretical results, we develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation under autoregressive and shared parameter constraints that enforce intervention sparsity. Finally, we show an improvement in counterfactual estimation over baseline methods through extensive simulated and image-based experiments."}, "https://arxiv.org/abs/2404.10063": {"title": "Adjusting for bias due to measurement error in functional quantile regression models with error-prone functional and scalar covariates", "link": "https://arxiv.org/abs/2404.10063", "description": "arXiv:2404.10063v1 Announce Type: new \nAbstract: Wearable devices enable the continuous monitoring of physical activity (PA) but generate complex functional data with poorly characterized errors. Most work on functional data views the data as smooth, latent curves obtained at discrete time intervals with some random noise with mean zero and constant variance. Viewing this noise as homoscedastic and independent ignores potential serial correlations. Our preliminary studies indicate that failing to account for these serial correlations can bias estimations. In dietary assessments, epidemiologists often use self-reported measures based on food frequency questionnaires that are prone to recall bias. With the increased availability of complex, high-dimensional functional, and scalar biomedical data potentially prone to measurement errors, it is necessary to adjust for biases induced by these errors to permit accurate analyses in various regression settings. However, there has been limited work to address measurement errors in functional and scalar covariates in the context of quantile regression. Therefore, we developed new statistical methods based on simulation extrapolation (SIMEX) and mixed effects regression with repeated measures to correct for measurement error biases in this context. We conducted simulation studies to establish the finite sample properties of our new methods. The methods are illustrated through application to a real data set."}, "https://arxiv.org/abs/2404.10111": {"title": "From Predictive Algorithms to Automatic Generation of Anomalies", "link": "https://arxiv.org/abs/2404.10111", "description": "arXiv:2404.10111v1 Announce Type: new \nAbstract: Machine learning algorithms can find predictive signals that researchers fail to notice; yet they are notoriously hard-to-interpret. How can we extract theoretical insights from these black boxes? History provides a clue. Facing a similar problem -- how to extract theoretical insights from their intuitions -- researchers often turned to ``anomalies:'' constructed examples that highlight flaws in an existing theory and spur the development of new ones. Canonical examples include the Allais paradox and the Kahneman-Tversky choice experiments for expected utility theory. We suggest anomalies can extract theoretical insights from black box predictive algorithms. We develop procedures to automatically generate anomalies for an existing theory when given a predictive algorithm. We cast anomaly generation as an adversarial game between a theory and a falsifier, the solutions to which are anomalies: instances where the black box algorithm predicts - were we to collect data - we would likely observe violations of the theory. As an illustration, we generate anomalies for expected utility theory using a large, publicly available dataset on real lottery choices. Based on an estimated neural network that predicts lottery choices, our procedures recover known anomalies and discover new ones for expected utility theory. In incentivized experiments, subjects violate expected utility theory on these algorithmically generated anomalies; moreover, the violation rates are similar to observed rates for the Allais paradox and Common ratio effect."}, "https://arxiv.org/abs/2404.10251": {"title": "Perturbations of Markov Chains", "link": "https://arxiv.org/abs/2404.10251", "description": "arXiv:2404.10251v1 Announce Type: new \nAbstract: This chapter surveys progress on three related topics in perturbations of Markov chains: the motivating question of when and how \"perturbed\" MCMC chains are developed, the theoretical problem of how perturbation theory can be used to analyze such chains, and finally the question of how the theoretical analyses can lead to practical advice."}, "https://arxiv.org/abs/2404.10344": {"title": "Semi-parametric profile pseudolikelihood via local summary statistics for spatial point pattern intensity estimation", "link": "https://arxiv.org/abs/2404.10344", "description": "arXiv:2404.10344v1 Announce Type: new \nAbstract: Second-order statistics play a crucial role in analysing point processes. Previous research has specifically explored locally weighted second-order statistics for point processes, offering diagnostic tests in various spatial domains. However, there remains a need to improve inference for complex intensity functions, especially when the point process likelihood is intractable and in the presence of interactions among points. This paper addresses this gap by proposing a method that exploits local second-order characteristics to account for local dependencies in the fitting procedure. Our approach utilises the Papangelou conditional intensity function for general Gibbs processes, avoiding explicit assumptions about the degree of interaction and homogeneity. We provide simulation results and an application to real data to assess the proposed method's goodness-of-fit. Overall, this work contributes to advancing statistical techniques for point process analysis in the presence of spatial interactions."}, "https://arxiv.org/abs/2404.10381": {"title": "Covariate Ordered Systematic Sampling as an Improvement to Randomized Controlled Trials", "link": "https://arxiv.org/abs/2404.10381", "description": "arXiv:2404.10381v1 Announce Type: new \nAbstract: The Randomized Controlled Trial (RCT) or A/B testing is considered the gold standard method for estimating causal effects. Fisher famously advocated randomly allocating experiment units into treatment and control groups to preclude systematic biases. We propose a variant of systematic sampling called Covariate Ordered Systematic Sampling (COSS). In COSS, we order experimental units using a pre-experiment covariate and allocate them alternately into treatment and control groups. Using theoretical proofs, experiments on simulated data, and hundreds of A/B tests conducted within 3 real-world marketing campaigns, we show how our method achieves better sensitivity gains than commonly used variance reduction techniques like CUPED while retaining the simplicity of RCTs."}, "https://arxiv.org/abs/2404.10495": {"title": "Assumption-Lean Quantile Regression", "link": "https://arxiv.org/abs/2404.10495", "description": "arXiv:2404.10495v1 Announce Type: new \nAbstract: Quantile regression is a powerful tool for detecting exposure-outcome associations given covariates across different parts of the outcome's distribution, but has two major limitations when the aim is to infer the effect of an exposure. Firstly, the exposure coefficient estimator may not converge to a meaningful quantity when the model is misspecified, and secondly, variable selection methods may induce bias and excess uncertainty, rendering inferences biased and overly optimistic. In this paper, we address these issues via partially linear quantile regression models which parametrize the conditional association of interest, but do not restrict the association with other covariates in the model. We propose consistent estimators for the unknown model parameter by mapping it onto a nonparametric main effect estimand that captures the (conditional) association of interest even when the quantile model is misspecified. This estimand is estimated using the efficient influence function under the nonparametric model, allowing for the incorporation of data-adaptive procedures such as variable selection and machine learning. Our approach provides a flexible and reliable method for detecting associations that is robust to model misspecification and excess uncertainty induced by variable selection methods. The proposal is illustrated using simulation studies and data on annual health care costs associated with excess body weight."}, "https://arxiv.org/abs/2404.10594": {"title": "Nonparametric Isotropy Test for Spatial Point Processes using Random Rotations", "link": "https://arxiv.org/abs/2404.10594", "description": "arXiv:2404.10594v1 Announce Type: new \nAbstract: In spatial statistics, point processes are often assumed to be isotropic meaning that their distribution is invariant under rotations. Statistical tests for the null hypothesis of isotropy found in the literature are based either on asymptotics or on Monte Carlo simulation of a parametric null model. Here, we present a nonparametric test based on resampling the Fry points of the observed point pattern. Empirical levels and powers of the test are investigated in a simulation study for four point process models with anisotropy induced by different mechanisms. Finally, a real data set is tested for isotropy."}, "https://arxiv.org/abs/2404.10629": {"title": "Weighting methods for truncation by death in cluster-randomized trials", "link": "https://arxiv.org/abs/2404.10629", "description": "arXiv:2404.10629v1 Announce Type: new \nAbstract: Patient-centered outcomes, such as quality of life and length of hospital stay, are the focus in a wide array of clinical studies. However, participants in randomized trials for elderly or critically and severely ill patient populations may have truncated or undefined non-mortality outcomes if they do not survive through the measurement time point. To address truncation by death, the survivor average causal effect (SACE) has been proposed as a causally interpretable subgroup treatment effect defined under the principal stratification framework. However, the majority of methods for estimating SACE have been developed in the context of individually-randomized trials. Only limited discussions have been centered around cluster-randomized trials (CRTs), where methods typically involve strong distributional assumptions for outcome modeling. In this paper, we propose two weighting methods to estimate SACE in CRTs that obviate the need for potentially complicated outcome distribution modeling. We establish the requisite assumptions that address latent clustering effects to enable point identification of SACE, and we provide computationally-efficient asymptotic variance estimators for each weighting estimator. In simulations, we evaluate our weighting estimators, demonstrating their finite-sample operating characteristics and robustness to certain departures from the identification assumptions. We illustrate our methods using data from a CRT to assess the impact of a sedation protocol on mechanical ventilation among children with acute respiratory failure."}, "https://arxiv.org/abs/2404.10427": {"title": "Effect of Systematic Uncertainties on Density and Temperature Estimates in Coronae of Capella", "link": "https://arxiv.org/abs/2404.10427", "description": "arXiv:2404.10427v1 Announce Type: cross \nAbstract: We estimate the coronal density of Capella using the O VII and Fe XVII line systems in the soft X-ray regime that have been observed over the course of the Chandra mission. Our analysis combines measures of error due to uncertainty in the underlying atomic data with statistical errors in the Chandra data to derive meaningful overall uncertainties on the plasma density of the coronae of Capella. We consider two Bayesian frameworks. First, the so-called pragmatic-Bayesian approach considers the atomic data and their uncertainties as fully specified and uncorrectable. The fully-Bayesian approach, on the other hand, allows the observed spectral data to update the atomic data and their uncertainties, thereby reducing the overall errors on the inferred parameters. To incorporate atomic data uncertainties, we obtain a set of atomic data replicates, the distribution of which captures their uncertainty. A principal component analysis of these replicates allows us to represent the atomic uncertainty with a lower-dimensional multivariate Gaussian distribution. A $t$-distribution approximation of the uncertainties of a subset of plasma parameters including a priori temperature information, obtained from the temperature-sensitive-only Fe XVII spectral line analysis, is carried forward into the density- and temperature-sensitive O VII spectral line analysis. Markov Chain Monte Carlo based model fitting is implemented including Multi-step Monte Carlo Gibbs Sampler and Hamiltonian Monte Carlo. Our analysis recovers an isothermally approximated coronal plasma temperature of $\\approx$5 MK and a coronal plasma density of $\\approx$10$^{10}$ cm$^{-3}$, with uncertainties of 0.1 and 0.2 dex respectively."}, "https://arxiv.org/abs/2404.10436": {"title": "Tree Bandits for Generative Bayes", "link": "https://arxiv.org/abs/2404.10436", "description": "arXiv:2404.10436v1 Announce Type: cross \nAbstract: In generative models with obscured likelihood, Approximate Bayesian Computation (ABC) is often the tool of last resort for inference. However, ABC demands many prior parameter trials to keep only a small fraction that passes an acceptance test. To accelerate ABC rejection sampling, this paper develops a self-aware framework that learns from past trials and errors. We apply recursive partitioning classifiers on the ABC lookup table to sequentially refine high-likelihood regions into boxes. Each box is regarded as an arm in a binary bandit problem treating ABC acceptance as a reward. Each arm has a proclivity for being chosen for the next ABC evaluation, depending on the prior distribution and past rejections. The method places more splits in those areas where the likelihood resides, shying away from low-probability regions destined for ABC rejections. We provide two versions: (1) ABC-Tree for posterior sampling, and (2) ABC-MAP for maximum a posteriori estimation. We demonstrate accurate ABC approximability at much lower simulation cost. We justify the use of our tree-based bandit algorithms with nearly optimal regret bounds. Finally, we successfully apply our approach to the problem of masked image classification using deep generative models."}, "https://arxiv.org/abs/2404.10523": {"title": "Capturing the Macroscopic Behaviour of Molecular Dynamics with Membership Functions", "link": "https://arxiv.org/abs/2404.10523", "description": "arXiv:2404.10523v1 Announce Type: cross \nAbstract: Markov processes serve as foundational models in many scientific disciplines, such as molecular dynamics, and their simulation forms a common basis for analysis. While simulations produce useful trajectories, obtaining macroscopic information directly from microstate data presents significant challenges. This paper addresses this gap by introducing the concept of membership functions being the macrostates themselves. We derive equations for the holding times of these macrostates and demonstrate their consistency with the classical definition. Furthermore, we discuss the application of the ISOKANN method for learning these quantities from simulation data. In addition, we present a novel method for extracting transition paths based on the ISOKANN results and demonstrate its efficacy by applying it to simulations of the mu-opioid receptor. With this approach we provide a new perspective on analyzing the macroscopic behaviour of Markov systems."}, "https://arxiv.org/abs/2404.10530": {"title": "JCGM 101-compliant uncertainty evaluation using virtual experiments", "link": "https://arxiv.org/abs/2404.10530", "description": "arXiv:2404.10530v1 Announce Type: cross \nAbstract: Virtual experiments (VEs), a modern tool in metrology, can be used to help perform an uncertainty evaluation for the measurand. Current guidelines in metrology do not cover the many possibilities to incorporate VEs into an uncertainty evaluation, and it is often difficult to assess if the intended use of a VE complies with said guidelines. In recent work, it was shown that a VE can be used in conjunction with real measurement data and a Monte Carlo procedure to produce equal results to a supplement of the Guide to the Expression of Uncertainty in Measurement. However, this was shown only for linear measurement models. In this work, we extend this Monte Carlo approach to a common class of non-linear measurement models and more complex VEs, providing a reference approach for suitable uncertainty evaluations involving VEs. Numerical examples are given to show that the theoretical derivations hold in a practical scenario."}, "https://arxiv.org/abs/2404.10580": {"title": "Data-driven subgrouping of patient trajectories with chronic diseases: Evidence from low back pain", "link": "https://arxiv.org/abs/2404.10580", "description": "arXiv:2404.10580v1 Announce Type: cross \nAbstract: Clinical data informs the personalization of health care with a potential for more effective disease management. In practice, this is achieved by subgrouping, whereby clusters with similar patient characteristics are identified and then receive customized treatment plans with the goal of targeting subgroup-specific disease dynamics. In this paper, we propose a novel mixture hidden Markov model for subgrouping patient trajectories from chronic diseases. Our model is probabilistic and carefully designed to capture different trajectory phases of chronic diseases (i.e., \"severe\", \"moderate\", and \"mild\") through tailored latent states. We demonstrate our subgrouping framework based on a longitudinal study across 847 patients with non-specific low back pain. Here, our subgrouping framework identifies 8 subgroups. Further, we show that our subgrouping framework outperforms common baselines in terms of cluster validity indices. Finally, we discuss the applicability of the model to other chronic and long-lasting diseases."}, "https://arxiv.org/abs/1808.10522": {"title": "Optimal Instrument Selection using Bayesian Model Averaging for Model Implied Instrumental Variable Two Stage Least Squares Estimators", "link": "https://arxiv.org/abs/1808.10522", "description": "arXiv:1808.10522v2 Announce Type: replace \nAbstract: Model-Implied Instrumental Variable Two-Stage Least Squares (MIIV-2SLS) is a limited information, equation-by-equation, non-iterative estimator for latent variable models. Associated with this estimator are equation specific tests of model misspecification. One issue with equation specific tests is that they lack specificity, in that they indicate that some instruments are problematic without revealing which specific ones. Instruments that are poor predictors of their target variables (weak instruments) is a second potential problem. We propose a novel extension to detect instrument specific tests of misspecification and weak instruments. We term this the Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging (MIIV-2SBMA) estimator. We evaluate the performance of MIIV-2SBMA against MIIV-2SLS in a simulation study and show that it has comparable performance in terms of parameter estimation. Additionally, our instrument specific overidentification tests developed within the MIIV-2SBMA framework show increased power to detect specific problematic and weak instruments. Finally, we demonstrate MIIV-2SBMA using an empirical example."}, "https://arxiv.org/abs/2308.10375": {"title": "Model Selection over Partially Ordered Sets", "link": "https://arxiv.org/abs/2308.10375", "description": "arXiv:2308.10375v3 Announce Type: replace \nAbstract: In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false positive and false negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false positive and false negative errors. We describe model selection procedures that provide false positive error control in our general setting and we illustrate their utility with numerical experiments."}, "https://arxiv.org/abs/2310.09701": {"title": "A robust and powerful replicability analysis for high dimensional data", "link": "https://arxiv.org/abs/2310.09701", "description": "arXiv:2310.09701v2 Announce Type: replace \nAbstract: Identifying replicable signals across different studies provides stronger scientific evidence and more powerful inference. Existing literature on high dimensional applicability analysis either imposes strong modeling assumptions or has low power. We develop a powerful and robust empirical Bayes approach for high dimensional replicability analysis. Our method effectively borrows information from different features and studies while accounting for heterogeneity. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from the genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods."}, "https://arxiv.org/abs/2312.13450": {"title": "Precise FWER Control for Gaussian Related Fields: Riding the SuRF to continuous land -- Part 1", "link": "https://arxiv.org/abs/2312.13450", "description": "arXiv:2312.13450v2 Announce Type: replace \nAbstract: The Gaussian Kinematic Formula (GKF) is a powerful and computationally efficient tool to perform statistical inference on random fields and became a well-established tool in the analysis of neuroimaging data. Using realistic error models, recent articles show that GKF based methods for \\emph{voxelwise inference} lead to conservative control of the familywise error rate (FWER) and for cluster-size inference lead to inflated false positive rates. In this series of articles we identify and resolve the main causes of these shortcomings in the traditional usage of the GKF for voxelwise inference. This first part removes the \\textit{good lattice assumption} and allows the data to be non-stationary, yet still assumes the data to be Gaussian. The latter assumption is resolved in part 2, where we also demonstrate that our GKF based methodology is non-conservative under realistic error models."}, "https://arxiv.org/abs/2401.08290": {"title": "Causal Machine Learning for Moderation Effects", "link": "https://arxiv.org/abs/2401.08290", "description": "arXiv:2401.08290v2 Announce Type: replace \nAbstract: It is valuable for any decision maker to know the impact of decisions (treatments) on average and for subgroups. The causal machine learning literature has recently provided tools for estimating group average treatment effects (GATE) to understand treatment heterogeneity better. This paper addresses the challenge of interpreting such differences in treatment effects between groups while accounting for variations in other covariates. We propose a new parameter, the balanced group average treatment effect (BGATE), which measures a GATE with a specific distribution of a priori-determined covariates. By taking the difference of two BGATEs, we can analyse heterogeneity more meaningfully than by comparing two GATEs. The estimation strategy for this parameter is based on double/debiased machine learning for discrete treatments in an unconfoundedness setting, and the estimator is shown to be $\\sqrt{N}$-consistent and asymptotically normal under standard conditions. Adding additional identifying assumptions allows specific balanced differences in treatment effects between groups to be interpreted causally, leading to the causal balanced group average treatment effect. We explore the finite sample properties in a small-scale simulation study and demonstrate the usefulness of these parameters in an empirical example."}, "https://arxiv.org/abs/2401.13057": {"title": "Inference under partial identification with minimax test statistics", "link": "https://arxiv.org/abs/2401.13057", "description": "arXiv:2401.13057v2 Announce Type: replace \nAbstract: We provide a means of computing and estimating the asymptotic distributions of statistics based on an outer minimization of an inner maximization. Such test statistics, which arise frequently in moment models, are of special interest in providing hypothesis tests under partial identification. Under general conditions, we provide an asymptotic characterization of such test statistics using the minimax theorem, and a means of computing critical values using the bootstrap. Making some light regularity assumptions, our results augment several asymptotic approximations that have been provided for partially identified hypothesis tests, and extend them by mitigating their dependence on local linear approximations of the parameter space. These asymptotic results are generally simple to state and straightforward to compute (esp.\\ adversarially)."}, "https://arxiv.org/abs/2401.14355": {"title": "Multiply Robust Difference-in-Differences Estimation of Causal Effect Curves for Continuous Exposures", "link": "https://arxiv.org/abs/2401.14355", "description": "arXiv:2401.14355v2 Announce Type: replace \nAbstract: Researchers commonly use difference-in-differences (DiD) designs to evaluate public policy interventions. While methods exist for estimating effects in the context of binary interventions, policies often result in varied exposures across regions implementing the policy. Yet, existing approaches for incorporating continuous exposures face substantial limitations in addressing confounding variables associated with intervention status, exposure levels, and outcome trends. These limitations significantly constrain policymakers' ability to fully comprehend policy impacts and design future interventions. In this work, we propose new estimators for causal effect curves within the DiD framework, accounting for multiple sources of confounding. Our approach accommodates misspecification of a subset of treatment, exposure, and outcome models while avoiding any parametric assumptions on the effect curve. We present the statistical properties of the proposed methods and illustrate their application through simulations and a study investigating the heterogeneous effects of a nutritional excise tax under different levels of accessibility to cross-border shopping."}, "https://arxiv.org/abs/2205.06812": {"title": "Principal-Agent Hypothesis Testing", "link": "https://arxiv.org/abs/2205.06812", "description": "arXiv:2205.06812v3 Announce Type: replace-cross \nAbstract: Consider the relationship between a regulator (the principal) and an experimenter (the agent) such as a pharmaceutical company. The pharmaceutical company wishes to sell a drug for profit, whereas the regulator wishes to allow only efficacious drugs to be marketed. The efficacy of the drug is not known to the regulator, so the pharmaceutical company must run a costly trial to prove efficacy to the regulator. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested agent; a lower standard of statistical evidence incentivizes the agent to run more trials that are less likely to be effective. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial for understanding this system and designing protocols with high social utility. In this work, we discuss how the regulator can set up a protocol with payoffs based on statistical evidence. We show how to design protocols that are robust to an agent's strategic actions, and derive the optimal protocol in the presence of strategic entrants."}, "https://arxiv.org/abs/2207.07978": {"title": "Robust Multivariate Functional Control Chart", "link": "https://arxiv.org/abs/2207.07978", "description": "arXiv:2207.07978v3 Announce Type: replace-cross \nAbstract: In modern Industry 4.0 applications, a huge amount of data is acquired during manufacturing processes that are often contaminated with anomalous observations in the form of both casewise and cellwise outliers. These can seriously reduce the performance of control charting procedures, especially in complex and high-dimensional settings. To mitigate this issue in the context of profile monitoring, we propose a new framework, referred to as robust multivariate functional control chart (RoMFCC), that is able to monitor multivariate functional data while being robust to both functional casewise and cellwise outliers. The RoMFCC relies on four main elements: (I) a functional univariate filter to identify functional cellwise outliers to be replaced by missing components; (II) a robust multivariate functional data imputation method of missing values; (III) a casewise robust dimensionality reduction; (IV) a monitoring strategy for the multivariate functional quality characteristic. An extensive Monte Carlo simulation study is performed to compare the RoMFCC with competing monitoring schemes already appeared in the literature. Finally, a motivating real-case study is presented where the proposed framework is used to monitor a resistance spot welding process in the automotive industry."}, "https://arxiv.org/abs/2304.01163": {"title": "The extended Ville's inequality for nonintegrable nonnegative supermartingales", "link": "https://arxiv.org/abs/2304.01163", "description": "arXiv:2304.01163v2 Announce Type: replace-cross \nAbstract: Following the initial work by Robbins, we rigorously present an extended theory of nonnegative supermartingales, requiring neither integrability nor finiteness. In particular, we derive a key maximal inequality foreshadowed by Robbins, which we call the extended Ville's inequality, that strengthens the classical Ville's inequality (for integrable nonnegative supermartingales), and also applies to our nonintegrable setting. We derive an extension of the method of mixtures, which applies to $\\sigma$-finite mixtures of our extended nonnegative supermartingales. We present some implications of our theory for sequential statistics, such as the use of improper mixtures (priors) in deriving nonparametric confidence sequences and (extended) e-processes."}, "https://arxiv.org/abs/2404.10834": {"title": "VARX Granger Analysis: Modeling, Inference, and Applications", "link": "https://arxiv.org/abs/2404.10834", "description": "arXiv:2404.10834v1 Announce Type: new \nAbstract: Vector Autoregressive models with exogenous input (VARX) provide a powerful framework for modeling complex dynamical systems like brains, markets, or societies. Their simplicity allows us to uncover linear effects between endogenous and exogenous variables. The Granger formalism is naturally suited for VARX models, but the connection between the two is not widely understood. We aim to bridge this gap by providing both the basic equations and easy-to-use code. We first explain how the Granger formalism can be combined with a VARX model using deviance as a test statistic. We also present a bias correction for the deviance in the case of L2 regularization, a technique used to reduce model complexity. To address the challenge of modeling long responses, we propose the use of basis functions, which further reduce parameter complexity. We demonstrate that p-values are correctly estimated, even for short signals where regularization influences the results. Additionally, we analyze the model's performance under various scenarios where model assumptions are violated, such as missing variables or indirect observations of the underlying dynamics. Finally, we showcase the practical utility of our approach by applying it to real-world data from neuroscience, physiology, and sociology. To facilitate its adoption, we make Matlab, Python, and R code available here: https://github.com/lcparra/varx"}, "https://arxiv.org/abs/2404.10884": {"title": "Modeling Interconnected Modules in Multivariate Outcomes: Evaluating the Impact of Alcohol Intake on Plasma Metabolomics", "link": "https://arxiv.org/abs/2404.10884", "description": "arXiv:2404.10884v1 Announce Type: new \nAbstract: Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expression networks with interconnected modules), modeling this structure is crucial for accurately identifying metabolomic features associated with alcohol intake. However, integrating dependence structures into regression models remains difficult in both estimation and inference procedures due to their large or high dimensionality. To bridge this gap, we propose an innovative multivariate regression model that accounts for correlations among outcome features by incorporating an interconnected community structure. Furthermore, we derive closed-form and likelihood-based estimators, accompanied by explicit exact and explicit asymptotic covariance matrix estimators, respectively. Simulation analysis demonstrates that our approach provides accurate estimation of both dependence and regression coefficients, and enhances sensitivity while maintaining a well-controlled discovery rate, as evidenced through benchmarking against existing regression models. Finally, we apply our approach to assess the impact of alcohol intake on $249$ metabolomic biomarkers measured using nuclear magnetic resonance spectroscopy. The results indicate that alcohol intake can elevate high-density lipoprotein levels by enhancing the transport rate of Apolipoproteins A1."}, "https://arxiv.org/abs/2404.10974": {"title": "Compressive Bayesian non-negative matrix factorization for mutational signatures analysis", "link": "https://arxiv.org/abs/2404.10974", "description": "arXiv:2404.10974v1 Announce Type: new \nAbstract: Non-negative matrix factorization (NMF) is widely used in many applications for dimensionality reduction. Inferring an appropriate number of factors for NMF is a challenging problem, and several approaches based on information criteria or sparsity-inducing priors have been proposed. However, inference in these models is often complicated and computationally challenging. In this paper, we introduce a novel methodology for overfitted Bayesian NMF models using \"compressive hyperpriors\" that force unneeded factors down to negligible values while only imposing mild shrinkage on needed factors. The method is based on using simple semi-conjugate priors to facilitate inference, while setting the strength of the hyperprior in a data-dependent way to achieve this compressive property. We apply our method to mutational signatures analysis in cancer genomics, where we find that it outperforms state-of-the-art alternatives. In particular, we illustrate how our compressive hyperprior enables the use of biologically informed priors on the signatures, yielding significantly improved accuracy. We provide theoretical results establishing the compressive property, and we demonstrate the method in simulations and on real data from a breast cancer application."}, "https://arxiv.org/abs/2404.11057": {"title": "Partial Identification of Heteroskedastic Structural VARs: Theory and Bayesian Inference", "link": "https://arxiv.org/abs/2404.11057", "description": "arXiv:2404.11057v1 Announce Type: new \nAbstract: We consider structural vector autoregressions identified through stochastic volatility. Our focus is on whether a particular structural shock is identified by heteroskedasticity without the need to impose any sign or exclusion restrictions. Three contributions emerge from our exercise: (i) a set of conditions under which the matrix containing structural parameters is partially or globally unique; (ii) a statistical procedure to assess the validity of the conditions mentioned above; and (iii) a shrinkage prior distribution for conditional variances centred on a hypothesis of homoskedasticity. Such a prior ensures that the evidence for identifying a structural shock comes only from the data and is not favoured by the prior. We illustrate our new methods using a U.S. fiscal structural model."}, "https://arxiv.org/abs/2404.11092": {"title": "Estimation for conditional moment models based on martingale difference divergence", "link": "https://arxiv.org/abs/2404.11092", "description": "arXiv:2404.11092v1 Announce Type: new \nAbstract: We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unconditional moment restrictions than the integrable weighting function to enhance the estimation efficiency. Due to the nature of shift-invariance in MDD, our MDD-based estimation method can not identify the intercept parameters. To overcome this identification issue, we further provide a two-step estimation procedure for the model with intercept parameters. Under regularity conditions, we establish the asymptotics of the proposed estimators, which are not only easy-to-implement with analytic asymptotic variances, but also applicable to time series data with an unspecified form of conditional heteroskedasticity. Finally, we illustrate the usefulness of the proposed estimators by simulations and two real examples."}, "https://arxiv.org/abs/2404.11125": {"title": "Interval-censored linear quantile regression", "link": "https://arxiv.org/abs/2404.11125", "description": "arXiv:2404.11125v1 Announce Type: new \nAbstract: Censored quantile regression has emerged as a prominent alternative to classical Cox's proportional hazards model or accelerated failure time model in both theoretical and applied statistics. While quantile regression has been extensively studied for right-censored survival data, methodologies for analyzing interval-censored data remain limited in the survival analysis literature. This paper introduces a novel local weighting approach for estimating linear censored quantile regression, specifically tailored to handle diverse forms of interval-censored survival data. The estimation equation and the corresponding convex objective function for the regression parameter can be constructed as a weighted average of quantile loss contributions at two interval endpoints. The weighting components are nonparametrically estimated using local kernel smoothing or ensemble machine learning techniques. To estimate the nonparametric distribution mass for interval-censored data, a modified EM algorithm for nonparametric maximum likelihood estimation is employed by introducing subject-specific latent Poisson variables. The proposed method's empirical performance is demonstrated through extensive simulation studies and real data analyses of two HIV/AIDS datasets."}, "https://arxiv.org/abs/2404.11150": {"title": "Automated, efficient and model-free inference for randomized clinical trials via data-driven covariate adjustment", "link": "https://arxiv.org/abs/2404.11150", "description": "arXiv:2404.11150v1 Announce Type: new \nAbstract: In May 2023, the U.S. Food and Drug Administration (FDA) released guidance for industry on \"Adjustment for Covariates in Randomized Clinical Trials for Drugs and Biological Products\". Covariate adjustment is a statistical analysis method for improving precision and power in clinical trials by adjusting for pre-specified, prognostic baseline variables. Though recommended by the FDA and the European Medicines Agency (EMA), many trials do not exploit the available information in baseline variables or make use only of the baseline measurement of the outcome. This is likely (partly) due to the regulatory mandate to pre-specify baseline covariates for adjustment, leading to challenges in determining appropriate covariates and their functional forms. We will explore the potential of automated data-adaptive methods, such as machine learning algorithms, for covariate adjustment, addressing the challenge of pre-specification. Specifically, our approach allows the use of complex models or machine learning algorithms without compromising the interpretation or validity of the treatment effect estimate and its corresponding standard error, even in the presence of misspecified outcome working models. This contrasts the majority of competing works which assume correct model specification for the validity of standard errors. Our proposed estimators either necessitate ultra-sparsity in the outcome model (which can be relaxed by limiting the number of predictors in the model) or necessitate integration with sample splitting to enhance their performance. As such, we will arrive at simple estimators and standard errors for the marginal treatment effect in randomized clinical trials, which exploit data-adaptive outcome predictions based on prognostic baseline covariates, and have low (or no) bias in finite samples even when those predictions are themselves biased."}, "https://arxiv.org/abs/2404.11198": {"title": "Forecasting with panel data: Estimation uncertainty versus parameter heterogeneity", "link": "https://arxiv.org/abs/2404.11198", "description": "arXiv:2404.11198v1 Announce Type: new \nAbstract: We provide a comprehensive examination of the predictive accuracy of panel forecasting methods based on individual, pooling, fixed effects, and Bayesian estimation, and propose optimal weights for forecast combination schemes. We consider linear panel data models, allowing for weakly exogenous regressors and correlated heterogeneity. We quantify the gains from exploiting panel data and demonstrate how forecasting performance depends on the degree of parameter heterogeneity, whether such heterogeneity is correlated with the regressors, the goodness of fit of the model, and the cross-sectional ($N$) and time ($T$) dimensions. Monte Carlo simulations and empirical applications to house prices and CPI inflation show that forecast combination and Bayesian forecasting methods perform best overall and rarely produce the least accurate forecasts for individual series."}, "https://arxiv.org/abs/2404.11235": {"title": "Bayesian Markov-Switching Vector Autoregressive Process", "link": "https://arxiv.org/abs/2404.11235", "description": "arXiv:2404.11235v1 Announce Type: new \nAbstract: This study introduces marginal density functions of the general Bayesian Markov-Switching Vector Autoregressive (MS-VAR) process. In the case of the Bayesian MS-VAR process, we provide closed--form density functions and Monte-Carlo simulation algorithms, including the importance sampling method. The Monte--Carlo simulation method departs from the previous simulation methods because it removes the duplication in a regime vector."}, "https://arxiv.org/abs/2404.11323": {"title": "Bayesian Optimization for Identification of Optimal Biological Dose Combinations in Personalized Dose-Finding Trials", "link": "https://arxiv.org/abs/2404.11323", "description": "arXiv:2404.11323v1 Announce Type: new \nAbstract: Early phase, personalized dose-finding trials for combination therapies seek to identify patient-specific optimal biological dose (OBD) combinations, which are defined as safe dose combinations which maximize therapeutic benefit for a specific covariate pattern. Given the small sample sizes which are typical of these trials, it is challenging for traditional parametric approaches to identify OBD combinations across multiple dosing agents and covariate patterns. To address these challenges, we propose a Bayesian optimization approach to dose-finding which formally incorporates toxicity information into both the initial data collection process and the sequential search strategy. Independent Gaussian processes are used to model the efficacy and toxicity surfaces, and an acquisition function is utilized to define the dose-finding strategy and an early stopping rule. This work is motivated by a personalized dose-finding trial which considers a dual-agent therapy for obstructive sleep apnea, where OBD combinations are tailored to obstructive sleep apnea severity. To compare the performance of the personalized approach to a standard approach where covariate information is ignored, a simulation study is performed. We conclude that personalized dose-finding is essential in the presence of heterogeneity."}, "https://arxiv.org/abs/2404.11324": {"title": "Weighted-Average Least Squares for Negative Binomial Regression", "link": "https://arxiv.org/abs/2404.11324", "description": "arXiv:2404.11324v1 Announce Type: new \nAbstract: Model averaging methods have become an increasingly popular tool for improving predictions and dealing with model uncertainty, especially in Bayesian settings. Recently, frequentist model averaging methods such as information theoretic and least squares model averaging have emerged. This work focuses on the issue of covariate uncertainty where managing the computational resources is key: The model space grows exponentially with the number of covariates such that averaged models must often be approximated. Weighted-average least squares (WALS), first introduced for (generalized) linear models in the econometric literature, combines Bayesian and frequentist aspects and additionally employs a semiorthogonal transformation of the regressors to reduce the computational burden. This paper extends WALS for generalized linear models to the negative binomial (NB) regression model for overdispersed count data. A simulation experiment and an empirical application using data on doctor visits were conducted to compare the predictive power of WALS for NB regression to traditional estimators. The results show that WALS for NB improves on the maximum likelihood estimator in sparse situations and is competitive with lasso while being computationally more efficient."}, "https://arxiv.org/abs/2404.11345": {"title": "Jacobi Prior: An Alternate Bayesian Method for Supervised Learning", "link": "https://arxiv.org/abs/2404.11345", "description": "arXiv:2404.11345v1 Announce Type: new \nAbstract: This paper introduces the `Jacobi prior,' an alternative Bayesian method, that aims to address the computational challenges inherent in traditional techniques. It demonstrates that the Jacobi prior performs better than well-known methods like Lasso, Ridge, Elastic Net, and MCMC-based Horse-Shoe Prior, especially in predicting accurately. Additionally, We also show that the Jacobi prior is more than a hundred times faster than these methods while maintaining similar predictive accuracy. The method is implemented for Generalised Linear Models, Gaussian process regression, and classification, making it suitable for longitudinal/panel data analysis. The Jacobi prior shows it can handle partitioned data across servers worldwide, making it useful for distributed computing environments. As the method runs faster while still predicting accurately, it's good for organizations wanting to reduce their environmental impact and meet ESG standards. To show how well the Jacobi prior works, we did a detailed simulation study with four experiments, looking at statistical consistency, accuracy, and speed. Additionally, we present two empirical studies. First, we thoroughly evaluate Credit Risk by studying default probability using data from the U.S. Small Business Administration (SBA). Also, we use the Jacobi prior to classifying stars, quasars, and galaxies in a 3-class problem using multinational logit regression on Sloan Digital Sky Survey data. We use different filters as features. All codes and datasets for this paper are available in the following GitHub repository: https://github.com/sourish-cmi/Jacobi-Prior/"}, "https://arxiv.org/abs/2404.11427": {"title": "Matern Correlation: A Panoramic Primer", "link": "https://arxiv.org/abs/2404.11427", "description": "arXiv:2404.11427v1 Announce Type: new \nAbstract: Matern correlation is of pivotal importance in spatial statistics and machine learning. This paper serves as a panoramic primer for this correlation with an emphasis on the exposition of its changing behavior and smoothness properties in response to the change of its two parameters. Such exposition is achieved through a series of simulation studies, the use of an interactive 3D visualization applet, and a practical modeling example, all tailored for a wide-ranging statistical audience. Meanwhile, the thorough understanding of these parameter-smoothness relationships, in turn, serves as a pragmatic guide for researchers in their real-world modeling endeavors, such as setting appropriate initial values for these parameters and parameter-fine-tuning in their Bayesian modeling practice or simulation studies involving the Matern correlation. Derived problems surrounding Matern, such as inconsistent parameter inference, extended forms of Matern and limitations of Matern, are also explored and surveyed to impart a panoramic view of this correlation."}, "https://arxiv.org/abs/2404.11510": {"title": "Improved bounds and inference on optimal regimes", "link": "https://arxiv.org/abs/2404.11510", "description": "arXiv:2404.11510v1 Announce Type: new \nAbstract: Point identification of causal effects requires strong assumptions that are unreasonable in many practical settings. However, informative bounds on these effects can often be derived under plausible assumptions. Even when these bounds are wide or cover null effects, they can guide practical decisions based on formal decision theoretic criteria. Here we derive new results on optimal treatment regimes in settings where the effect of interest is bounded. These results are driven by consideration of superoptimal regimes; we define regimes that leverage an individual's natural treatment value, which is typically ignored in the existing literature. We obtain (sharp) bounds for the value function of superoptimal regimes, and provide performance guarantees relative to conventional optimal regimes. As a case study, we consider a commonly studied Marginal Sensitivity Model and illustrate that the superoptimal regime can be identified when conventional optimal regimes are not. We similarly illustrate this property in an instrumental variable setting. Finally, we derive efficient estimators for upper and lower bounds on the superoptimal value in instrumental variable settings, building on recent results on covariate adjusted Balke-Pearl bounds. These estimators are applied to study the effect of prompt ICU admission on survival."}, "https://arxiv.org/abs/2404.11579": {"title": "Spatial Heterogeneous Additive Partial Linear Model: A Joint Approach of Bivariate Spline and Forest Lasso", "link": "https://arxiv.org/abs/2404.11579", "description": "arXiv:2404.11579v1 Announce Type: new \nAbstract: Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an efficient clustering approach to identify the model heterogeneity of the spatial additive partial linear model. Specifically, we aim to detect the spatially contiguous clusters based on the regression coefficients while introducing a spatially varying intercept to deal with the smooth spatial effect. On the one hand, to approximate the spatial varying intercept, we use the method of bivariate spline over triangulation, which can effectively handle the data from a complex domain. On the other hand, a novel fusion penalty termed the forest lasso is proposed to reveal the spatial clustering pattern. Our proposed fusion penalty has advantages in both the estimation and computation efficiencies when dealing with large spatial data. Theoretically properties of our estimator are established, and simulation results show that our approach can achieve more accurate estimation with a limited computation cost compared with the existing approaches. To illustrate its practical use, we apply our approach to analyze the spatial pattern of the relationship between land surface temperature measured by satellites and air temperature measured by ground stations in the United States."}, "https://arxiv.org/abs/2404.10883": {"title": "Automated Discovery of Functional Actual Causes in Complex Environments", "link": "https://arxiv.org/abs/2404.10883", "description": "arXiv:2404.10883v1 Announce Type: cross \nAbstract: Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments."}, "https://arxiv.org/abs/2404.10942": {"title": "What Hides behind Unfairness? Exploring Dynamics Fairness in Reinforcement Learning", "link": "https://arxiv.org/abs/2404.10942", "description": "arXiv:2404.10942v1 Announce Type: cross \nAbstract: In sequential decision-making problems involving sensitive attributes like race and gender, reinforcement learning (RL) agents must carefully consider long-term fairness while maximizing returns. Recent works have proposed many different types of fairness notions, but how unfairness arises in RL problems remains unclear. In this paper, we address this gap in the literature by investigating the sources of inequality through a causal lens. We first analyse the causal relationships governing the data generation process and decompose the effect of sensitive attributes on long-term well-being into distinct components. We then introduce a novel notion called dynamics fairness, which explicitly captures the inequality stemming from environmental dynamics, distinguishing it from those induced by decision-making or inherited from the past. This notion requires evaluating the expected changes in the next state and the reward induced by changing the value of the sensitive attribute while holding everything else constant. To quantitatively evaluate this counterfactual concept, we derive identification formulas that allow us to obtain reliable estimations from data. Extensive experiments demonstrate the effectiveness of the proposed techniques in explaining, detecting, and reducing inequality in reinforcement learning."}, "https://arxiv.org/abs/2404.11006": {"title": "Periodicity in New York State COVID-19 Hospitalizations Leveraged from the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2404.11006", "description": "arXiv:2404.11006v1 Announce Type: cross \nAbstract: The outbreak of the SARS-CoV-2 virus, which led to an unprecedented global pandemic, has underscored the critical importance of understanding seasonal patterns. This knowledge is fundamental for decision-making in healthcare and public health domains. Investigating the presence, intensity, and precise nature of seasonal trends, as well as these temporal patterns, is essential for forecasting future occurrences, planning interventions, and making informed decisions based on the evolution of events over time. This study employs the Variable Bandpass Periodic Block Bootstrap (VBPBB) to separate and analyze different periodic components by frequency in time series data, focusing on annually correlated (PC) principal components. Bootstrapping, a method used to estimate statistical sampling distributions through random sampling with replacement, is particularly useful in this context. Specifically, block bootstrapping, a model-independent resampling method suitable for time series data, is utilized. Its extensions are aimed at preserving the correlation structures inherent in PC processes. The VBPBB applies a bandpass filter to isolate the relevant PC frequency, thereby minimizing contamination from extraneous frequencies and noise. This approach significantly narrows the confidence intervals, enhancing the precision of estimated sampling distributions for the investigated periodic characteristics. Furthermore, we compared the outcomes of block bootstrapping for periodically correlated time series with VBPBB against those from more traditional bootstrapping methods. Our analysis shows VBPBB provides strong evidence of the existence of an annual seasonal PC pattern in hospitalization rates not detectible by other methods, providing timing and confidence intervals for their impact."}, "https://arxiv.org/abs/2404.11341": {"title": "The Causal Chambers: Real Physical Systems as a Testbed for AI Methodology", "link": "https://arxiv.org/abs/2404.11341", "description": "arXiv:2404.11341v1 Announce Type: cross \nAbstract: In some fields of AI, machine learning and statistics, the validation of new methods and algorithms is often hindered by the scarcity of suitable real-world datasets. Researchers must often turn to simulated data, which yields limited information about the applicability of the proposed methods to real problems. As a step forward, we have constructed two devices that allow us to quickly and inexpensively produce large datasets from non-trivial but well-understood physical systems. The devices, which we call causal chambers, are computer-controlled laboratories that allow us to manipulate and measure an array of variables from these physical systems, providing a rich testbed for algorithms from a variety of fields. We illustrate potential applications through a series of case studies in fields such as causal discovery, out-of-distribution generalization, change point detection, independent component analysis, and symbolic regression. For applications to causal inference, the chambers allow us to carefully perform interventions. We also provide and empirically validate a causal model of each chamber, which can be used as ground truth for different tasks. All hardware and software is made open source, and the datasets are publicly available at causalchamber.org or through the Python package causalchamber."}, "https://arxiv.org/abs/2208.03489": {"title": "Forecasting Algorithms for Causal Inference with Panel Data", "link": "https://arxiv.org/abs/2208.03489", "description": "arXiv:2208.03489v3 Announce Type: replace \nAbstract: Conducting causal inference with panel data is a core challenge in social science research. We adapt a deep neural architecture for time series forecasting (the N-BEATS algorithm) to more accurately impute the counterfactual evolution of a treated unit had treatment not occurred. Across a range of settings, the resulting estimator (``SyNBEATS'') significantly outperforms commonly employed methods (synthetic controls, two-way fixed effects), and attains comparable or more accurate performance compared to recently proposed methods (synthetic difference-in-differences, matrix completion). An implementation of this estimator is available for public use. Our results highlight how advances in the forecasting literature can be harnessed to improve causal inference in panel data settings."}, "https://arxiv.org/abs/2209.03218": {"title": "Local Projection Inference in High Dimensions", "link": "https://arxiv.org/abs/2209.03218", "description": "arXiv:2209.03218v3 Announce Type: replace \nAbstract: In this paper, we estimate impulse responses by local projections in high-dimensional settings. We use the desparsified (de-biased) lasso to estimate the high-dimensional local projections, while leaving the impulse response parameter of interest unpenalized. We establish the uniform asymptotic normality of the proposed estimator under general conditions. Finally, we demonstrate small sample performance through a simulation study and consider two canonical applications in macroeconomic research on monetary policy and government spending."}, "https://arxiv.org/abs/2209.10587": {"title": "DeepVARwT: Deep Learning for a VAR Model with Trend", "link": "https://arxiv.org/abs/2209.10587", "description": "arXiv:2209.10587v3 Announce Type: replace \nAbstract: The vector autoregressive (VAR) model has been used to describe the dependence within and across multiple time series. This is a model for stationary time series which can be extended to allow the presence of a deterministic trend in each series. Detrending the data either parametrically or nonparametrically before fitting the VAR model gives rise to more errors in the latter part. In this study, we propose a new approach called DeepVARwT that employs deep learning methodology for maximum likelihood estimation of the trend and the dependence structure at the same time. A Long Short-Term Memory (LSTM) network is used for this purpose. To ensure the stability of the model, we enforce the causality condition on the autoregressive coefficients using the transformation of Ansley & Kohn (1986). We provide a simulation study and an application to real data. In the simulation study, we use realistic trend functions generated from real data and compare the estimates with true function/parameter values. In the real data application, we compare the prediction performance of this model with state-of-the-art models in the literature."}, "https://arxiv.org/abs/2305.00218": {"title": "Subdata selection for big data regression: an improved approach", "link": "https://arxiv.org/abs/2305.00218", "description": "arXiv:2305.00218v2 Announce Type: replace \nAbstract: In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may suffer due to the large sample size, since they involve inverting huge data matrices or even because the data cannot fit to the memory. Proposed approaches are based on selecting representative subdata to run the regression. Existing approaches select the subdata using information criteria and/or properties from orthogonal arrays. In the present paper we improve existing algorithms providing a new algorithm that is based on D-optimality approach. We provide simulation evidence for its performance. Evidence about the parameters of the proposed algorithm is also provided in order to clarify the trade-offs between execution time and information gain. Real data applications are also provided."}, "https://arxiv.org/abs/2305.08201": {"title": "Efficient Computation of High-Dimensional Penalized Generalized Linear Mixed Models by Latent Factor Modeling of the Random Effects", "link": "https://arxiv.org/abs/2305.08201", "description": "arXiv:2305.08201v2 Announce Type: replace \nAbstract: Modern biomedical datasets are increasingly high dimensional and exhibit complex correlation structures. Generalized Linear Mixed Models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches."}, "https://arxiv.org/abs/1710.00915": {"title": "Change Acceleration and Detection", "link": "https://arxiv.org/abs/1710.00915", "description": "arXiv:1710.00915v4 Announce Type: replace-cross \nAbstract: A novel sequential change detection problem is proposed, in which the goal is to not only detect but also accelerate the change. Specifically, it is assumed that the sequentially collected observations are responses to treatments selected in real time. The assigned treatments determine the pre-change and post-change distributions of the responses and also influence when the change happens. The goal is to find a treatment assignment rule and a stopping rule that minimize the expected total number of observations subject to a user-specified bound on the false alarm probability. The optimal solution is obtained under a general Markovian change-point model. Moreover, an alternative procedure is proposed, whose applicability is not restricted to Markovian change-point models and whose design requires minimal computation. For a large class of change-point models, the proposed procedure is shown to achieve the optimal performance in an asymptotic sense. Finally, its performance is found in simulation studies to be comparable to the optimal, uniformly with respect to the error probability."}, "https://arxiv.org/abs/1908.01109": {"title": "The Use of Binary Choice Forests to Model and Estimate Discrete Choices", "link": "https://arxiv.org/abs/1908.01109", "description": "arXiv:1908.01109v5 Announce Type: replace-cross \nAbstract: Problem definition. In retailing, discrete choice models (DCMs) are commonly used to capture the choice behavior of customers when offered an assortment of products. When estimating DCMs using transaction data, flexible models (such as machine learning models or nonparametric models) are typically not interpretable and hard to estimate, while tractable models (such as the multinomial logit model) tend to misspecify the complex behavior represeted in the data. Methodology/results. In this study, we use a forest of binary decision trees to represent DCMs. This approach is based on random forests, a popular machine learning algorithm. The resulting model is interpretable: the decision trees can explain the decision-making process of customers during the purchase. We show that our approach can predict the choice probability of any DCM consistently and thus never suffers from misspecification. Moreover, our algorithm predicts assortments unseen in the training data. The mechanism and errors can be theoretically analyzed. We also prove that the random forest can recover preference rankings of customers thanks to the splitting criterion such as the Gini index and information gain ratio. Managerial implications. The framework has unique practical advantages. It can capture customers' behavioral patterns such as irrationality or sequential searches when purchasing a product. It handles nonstandard formats of training data that result from aggregation. It can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product. It can also incorporate price information and customer features. Our numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods."}, "https://arxiv.org/abs/2204.01884": {"title": "Policy Learning with Competing Agents", "link": "https://arxiv.org/abs/2204.01884", "description": "arXiv:2204.01884v4 Announce Type: replace-cross \nAbstract: Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating estimation of the optimal policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy gradient. In a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior."}, "https://arxiv.org/abs/2211.07451": {"title": "Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain", "link": "https://arxiv.org/abs/2211.07451", "description": "arXiv:2211.07451v3 Announce Type: replace-cross \nAbstract: Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint distribution of net-demand across the 14 regions constituting Great Britain's electricity network. Joint modelling is complicated by the fact that the net-demand variability within each region, and the dependencies between regions, vary with temporal, socio-economical and weather-related factors. We accommodate for these characteristics by proposing a multivariate Gaussian model based on a modified Cholesky parametrisation, which allows us to model each unconstrained parameter via an additive model. Given that the number of model parameters and covariates is large, we adopt a semi-automated approach to model selection, based on gradient boosting. In addition to comparing the forecasting performance of several versions of the proposed model with that of two non-Gaussian copula-based models, we visually explore the model output to interpret how the covariates affect net-demand variability and dependencies.\n The code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.7315105, while methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM."}, "https://arxiv.org/abs/2305.08204": {"title": "glmmPen: High Dimensional Penalized Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2305.08204", "description": "arXiv:2305.08204v2 Announce Type: replace-cross \nAbstract: Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower-dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo Expectation Conditional Minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs."}, "https://arxiv.org/abs/2308.15838": {"title": "Adaptive Lasso, Transfer Lasso, and Beyond: An Asymptotic Perspective", "link": "https://arxiv.org/abs/2308.15838", "description": "arXiv:2308.15838v2 Announce Type: replace-cross \nAbstract: This paper presents a comprehensive exploration of the theoretical properties inherent in the Adaptive Lasso and the Transfer Lasso. The Adaptive Lasso, a well-established method, employs regularization divided by initial estimators and is characterized by asymptotic normality and variable selection consistency. In contrast, the recently proposed Transfer Lasso employs regularization subtracted by initial estimators with the demonstrated capacity to curtail non-asymptotic estimation errors. A pivotal question thus emerges: Given the distinct ways the Adaptive Lasso and the Transfer Lasso employ initial estimators, what benefits or drawbacks does this disparity confer upon each method? This paper conducts a theoretical examination of the asymptotic properties of the Transfer Lasso, thereby elucidating its differentiation from the Adaptive Lasso. Informed by the findings of this analysis, we introduce a novel method, one that amalgamates the strengths and compensates for the weaknesses of both methods. The paper concludes with validations of our theory and comparisons of the methods via simulation experiments."}, "https://arxiv.org/abs/2404.11678": {"title": "Corrected Correlation Estimates for Meta-Analysis", "link": "https://arxiv.org/abs/2404.11678", "description": "arXiv:2404.11678v1 Announce Type: new \nAbstract: Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo cases and non-cases for exposure groups to obtain within-study correlations. However, currently available implementations for both methods fail on simple examples.\n We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations a guaranteed to work for any feasible input.\n For RRs, we show the new GL approach is always guaranteed to succeed, but any Hamling approach may fail: we give counter-examples where no solutions exist. We derive a sufficient condition on reported RRs that guarantees success when reported variances are all equal."}, "https://arxiv.org/abs/2404.11713": {"title": "Propensity Score Analysis with Guaranteed Subgroup Balance", "link": "https://arxiv.org/abs/2404.11713", "description": "arXiv:2404.11713v1 Announce Type: new \nAbstract: Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity may be present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a new algorithm, the propensity score analysis with guaranteed subgroup balance (G-SBPS), to achieve covariate mean balance in all subgroups. We further incorporated nonparametric kernel regression for the propensity scores and developed a kernelized G-SBPS (kG-SBPS) to improve the subgroup mean balance of covariate transformations in a rich functional class. This extension is more robust to propensity score model misspecification. Extensive numerical studies showed that G-SBPS and kG-SBPS improve both subgroup covariate balance and subgroup treatment effect estimation, compared to existing approaches. We applied G-SBPS and kG-SBPS to a dataset on right heart catheterization to estimate the subgroup average treatment effects on the hospital length of stay and a dataset on diabetes self-management training to estimate the subgroup average treatment effects for the treated on the hospitalization rate."}, "https://arxiv.org/abs/2404.11739": {"title": "Testing Mechanisms", "link": "https://arxiv.org/abs/2404.11739", "description": "arXiv:2404.11739v1 Announce Type: new \nAbstract: Economists are often interested in the mechanisms by which a particular treatment affects an outcome. This paper develops tests for the ``sharp null of full mediation'' that the treatment $D$ operates on the outcome $Y$ only through a particular conjectured mechanism (or set of mechanisms) $M$. A key observation is that if $D$ is randomly assigned and has a monotone effect on $M$, then $D$ is a valid instrumental variable for the local average treatment effect (LATE) of $M$ on $Y$. Existing tools for testing the validity of the LATE assumptions can thus be used to test the sharp null of full mediation when $M$ and $D$ are binary. We develop a more general framework that allows one to test whether the effect of $D$ on $Y$ is fully explained by a potentially multi-valued and multi-dimensional set of mechanisms $M$, allowing for relaxations of the monotonicity assumption. We further provide methods for lower-bounding the size of the alternative mechanisms when the sharp null is rejected. An advantage of our approach relative to existing tools for mediation analysis is that it does not require stringent assumptions about how $M$ is assigned; on the other hand, our approach helps to answer different questions than traditional mediation analysis by focusing on the sharp null rather than estimating average direct and indirect effects. We illustrate the usefulness of the testable implications in two empirical applications."}, "https://arxiv.org/abs/2404.11747": {"title": "Spatio-temporal patterns of diurnal temperature: a random matrix approach I-case of India", "link": "https://arxiv.org/abs/2404.11747", "description": "arXiv:2404.11747v1 Announce Type: new \nAbstract: We consider the spatio-temporal gridded daily diurnal temperature range (DTR) data across India during the 72-year period 1951--2022. We augment this data with information on the El Nino-Southern Oscillation (ENSO) and on the climatic regions (Stamp's and Koeppen's classification) and four seasons of India.\n We use various matrix theory approaches to trim out strong but routine signals, random matrix theory to remove noise, and novel empirical generalised singular-value distributions to establish retention of essential signals in the trimmed data. We make use of the spatial Bergsma statistics to measure spatial association and identify temporal change points in the spatial-association.\n In particular, our investigation captures a yet unknown change-point over the 72 years under study with drastic changes in spatial-association of DTR in India. It also brings out changes in spatial association with regard to ENSO.\n We conclude that while studying/modelling Indian DTR data, due consideration should be granted to the strong spatial association that is being persistently exhibited over decades, and provision should be kept for potential change points in the temporal behaviour, which in turn can bring moderate to dramatic changes in the spatial association pattern.\n Some of our analysis also reaffirms the conclusions made by other authors, regarding spatial and temporal behavior of DTR, adding our own insights. We consider the data from the yearly, seasonal and climatic zones points of view, and discover several new and interesting statistical structures which should be of interest, especially to climatologists and statisticians. Our methods are not country specific and could be used profitably for DTR data from other geographical areas."}, "https://arxiv.org/abs/2404.11767": {"title": "Regret Analysis in Threshold Policy Design", "link": "https://arxiv.org/abs/2404.11767", "description": "arXiv:2404.11767v1 Announce Type: new \nAbstract: Threshold policies are targeting mechanisms that assign treatments based on whether an observable characteristic exceeds a certain threshold. They are widespread across multiple domains, such as welfare programs, taxation, and clinical medicine. This paper addresses the problem of designing threshold policies using experimental data, when the goal is to maximize the population welfare. First, I characterize the regret (a measure of policy optimality) of the Empirical Welfare Maximizer (EWM) policy, popular in the literature. Next, I introduce the Smoothed Welfare Maximizer (SWM) policy, which improves the EWM's regret convergence rate under an additional smoothness condition. The two policies are compared studying how differently their regrets depend on the population distribution, and investigating their finite sample performances through Monte Carlo simulations. In many contexts, the welfare guaranteed by the novel SWM policy is larger than with the EWM. An empirical illustration demonstrates how the treatment recommendation of the two policies may in practice notably differ."}, "https://arxiv.org/abs/2404.11781": {"title": "Asymmetric canonical correlation analysis of Riemannian and high-dimensional data", "link": "https://arxiv.org/abs/2404.11781", "description": "arXiv:2404.11781v1 Announce Type: new \nAbstract: In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture."}, "https://arxiv.org/abs/2404.11813": {"title": "Detection of a structural break in intraday volatility pattern", "link": "https://arxiv.org/abs/2404.11813", "description": "arXiv:2404.11813v1 Announce Type: new \nAbstract: We develop theory leading to testing procedures for the presence of a change point in the intraday volatility pattern. The new theory is developed in the framework of Functional Data Analysis. It is based on a model akin to the stochastic volatility model for scalar point-to-point returns. In our context, we study intraday curves, one curve per trading day. After postulating a suitable model for such functional data, we present three tests focusing, respectively, on changes in the shape, the magnitude and arbitrary changes in the sequences of the curves of interest. We justify the respective procedures by showing that they have asymptotically correct size and by deriving consistency rates for all tests. These rates involve the sample size (the number of trading days) and the grid size (the number of observations per day). We also derive the corresponding change point estimators and their consistency rates. All procedures are additionally validated by a simulation study and an application to US stocks."}, "https://arxiv.org/abs/2404.11839": {"title": "(Empirical) Bayes Approaches to Parallel Trends", "link": "https://arxiv.org/abs/2404.11839", "description": "arXiv:2404.11839v1 Announce Type: new \nAbstract: We consider Bayes and Empirical Bayes (EB) approaches for dealing with violations of parallel trends. In the Bayes approach, the researcher specifies a prior over both the pre-treatment violations of parallel trends $\\delta_{pre}$ and the post-treatment violations $\\delta_{post}$. The researcher then updates their posterior about the post-treatment bias $\\delta_{post}$ given an estimate of the pre-trends $\\delta_{pre}$. This allows them to form posterior means and credible sets for the treatment effect of interest, $\\tau_{post}$. In the EB approach, the prior on the violations of parallel trends is learned from the pre-treatment observations. We illustrate these approaches in two empirical applications."}, "https://arxiv.org/abs/2404.12319": {"title": "Marginal Analysis of Count Time Series in the Presence of Missing Observations", "link": "https://arxiv.org/abs/2404.12319", "description": "arXiv:2404.12319v1 Announce Type: new \nAbstract: Time series in real-world applications often have missing observations, making typical analytical methods unsuitable. One method for dealing with missing data is the concept of amplitude modulation. While this principle works with any data, here, missing data for unbounded and bounded count time series are investigated, where tailor-made dispersion and skewness statistics are used for model diagnostics. General closed-form asymptotic formulas are derived for such statistics with only weak assumptions on the underlying process. Moreover, closed-form formulas are derived for the popular special cases of Poisson and binomial autoregressive processes, always under the assumption that missingness occurs. The finite-sample performances of the considered asymptotic approximations are analyzed with simulations. The practical application of the corresponding dispersion and skewness tests under missing data is demonstrated with three real-data examples."}, "https://arxiv.org/abs/2404.11675": {"title": "Decomposition of Longitudinal Disparities: an Application to the Fetal Growth-Singletons Study", "link": "https://arxiv.org/abs/2404.11675", "description": "arXiv:2404.11675v1 Announce Type: cross \nAbstract: Addressing health disparities among different demographic groups is a key challenge in public health. Despite many efforts, there is still a gap in understanding how these disparities unfold over time. Our paper focuses on this overlooked longitudinal aspect, which is crucial in both clinical and public health settings. In this paper, we introduce a longitudinal disparity decomposition method that decomposes disparities into three components: the explained disparity linked to differences in the exploratory variables' conditional distribution when the modifier distribution is identical between majority and minority groups, the explained disparity that emerges specifically from the unequal distribution of the modifier and its interaction with covariates, and the unexplained disparity. The proposed method offers a dynamic alternative to the traditional Peters-Belson decomposition approach, tackling both the potential reduction in disparity if the covariate distributions of minority groups matched those of the majority group and the evolving nature of disparity over time. We apply the proposed approach to a fetal growth study to gain insights into disparities between different race/ethnicity groups in fetal developmental progress throughout the course of pregnancy."}, "https://arxiv.org/abs/2404.11922": {"title": "Redefining the Shortest Path Problem Formulation of the Linear Non-Gaussian Acyclic Model: Pairwise Likelihood Ratios, Prior Knowledge, and Path Enumeration", "link": "https://arxiv.org/abs/2404.11922", "description": "arXiv:2404.11922v1 Announce Type: cross \nAbstract: Effective causal discovery is essential for learning the causal graph from observational data. The linear non-Gaussian acyclic model (LiNGAM) operates under the assumption of a linear data generating process with non-Gaussian noise in determining the causal graph. Its assumption of unmeasured confounders being absent, however, poses practical limitations. In response, empirical research has shown that the reformulation of LiNGAM as a shortest path problem (LiNGAM-SPP) addresses this limitation. Within LiNGAM-SPP, mutual information is chosen to serve as the measure of independence. A challenge is introduced - parameter tuning is now needed due to its reliance on kNN mutual information estimators. The paper proposes a threefold enhancement to the LiNGAM-SPP framework.\n First, the need for parameter tuning is eliminated by using the pairwise likelihood ratio in lieu of kNN-based mutual information. This substitution is validated on a general data generating process and benchmark real-world data sets, outperforming existing methods especially when given a larger set of features. The incorporation of prior knowledge is then enabled by a node-skipping strategy implemented on the graph representation of all causal orderings to eliminate violations based on the provided input of relative orderings. Flexibility relative to existing approaches is achieved. Last among the three enhancements is the utilization of the distribution of paths in the graph representation of all causal orderings. From this, crucial properties of the true causal graph such as the presence of unmeasured confounders and sparsity may be inferred. To some extent, the expected performance of the causal discovery algorithm may be predicted. The refinements above advance the practicality and performance of LiNGAM-SPP, showcasing the potential of graph-search-based methodologies in advancing causal discovery."}, "https://arxiv.org/abs/2404.12181": {"title": "Estimation of the invariant measure of a multidimensional diffusion from noisy observations", "link": "https://arxiv.org/abs/2404.12181", "description": "arXiv:2404.12181v1 Announce Type: cross \nAbstract: We introduce a new approach for estimating the invariant density of a multidimensional diffusion when dealing with high-frequency observations blurred by independent noises. We consider the intermediate regime, where observations occur at discrete time instances $k\\Delta_n$ for $k=0,\\dots,n$, under the conditions $\\Delta_n\\to 0$ and $n\\Delta_n\\to\\infty$. Our methodology involves the construction of a kernel density estimator that uses a pre-averaging technique to proficiently remove noise from the data while preserving the analytical characteristics of the underlying signal and its asymptotic properties. The rate of convergence of our estimator depends on both the anisotropic regularity of the density and the intensity of the noise. We establish conditions on the intensity of the noise that ensure the recovery of convergence rates similar to those achievable without any noise. Furthermore, we prove a Bernstein concentration inequality for our estimator, from which we derive an adaptive procedure for the kernel bandwidth selection."}, "https://arxiv.org/abs/2404.12238": {"title": "Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation", "link": "https://arxiv.org/abs/2404.12238", "description": "arXiv:2404.12238v1 Announce Type: cross \nAbstract: In recent years, there has been a growing interest in using machine learning techniques for the estimation of treatment effects. Most of the best-performing methods rely on representation learning strategies that encourage shared behavior among potential outcomes to increase the precision of treatment effect estimates. In this paper we discuss and classify these models in terms of their algorithmic inductive biases and present a new model, NN-CGC, that considers additional information from the causal graph. NN-CGC tackles bias resulting from spurious variable interactions by implementing novel constraints on models, and it can be integrated with other representation learning methods. We test the effectiveness of our method using three different base models on common benchmarks. Our results indicate that our model constraints lead to significant improvements, achieving new state-of-the-art results in treatment effects estimation. We also show that our method is robust to imperfect causal graphs and that using partial causal information is preferable to ignoring it."}, "https://arxiv.org/abs/2404.12290": {"title": "Debiased Distribution Compression", "link": "https://arxiv.org/abs/2404.12290", "description": "arXiv:2404.12290v1 Announce Type: cross \nAbstract: Modern compression methods can summarize a target distribution $\\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein Kernel Thinning (SKT) returns $\\sqrt{n}$ equal-weighted points with $\\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\\mathbb {P}$. For larger-scale compression tasks, Low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein Recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\\operatorname{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering."}, "https://arxiv.org/abs/2306.00389": {"title": "R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models", "link": "https://arxiv.org/abs/2306.00389", "description": "arXiv:2306.00389v2 Announce Type: replace-cross \nAbstract: Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a \"partial\" log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by using a variant of variational tempering in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to the exact posterior distributions, that it can be made robust through tempering, and that it is computationally efficient."}, "https://arxiv.org/abs/2311.03343": {"title": "Distribution-uniform anytime-valid sequential inference", "link": "https://arxiv.org/abs/2311.03343", "description": "arXiv:2311.03343v2 Announce Type: replace-cross \nAbstract: Are asymptotic confidence sequences and anytime $p$-values uniformly valid for a nontrivial class of distributions $\\mathcal{P}$? We give a positive answer to this question by deriving distribution-uniform anytime-valid inference procedures. Historically, anytime-valid methods -- including confidence sequences, anytime $p$-values, and sequential hypothesis tests that enable inference at stopping times -- have been justified nonasymptotically. Nevertheless, asymptotic procedures such as those based on the central limit theorem occupy an important part of statistical toolbox due to their simplicity, universality, and weak assumptions. While recent work has derived asymptotic analogues of anytime-valid methods with the aforementioned benefits, these were not shown to be $\\mathcal{P}$-uniform, meaning that their asymptotics are not uniformly valid in a class of distributions $\\mathcal{P}$. Indeed, the anytime-valid inference literature currently has no central limit theory to draw from that is both uniform in $\\mathcal{P}$ and in the sample size $n$. This paper fills that gap by deriving a novel $\\mathcal{P}$-uniform strong Gaussian approximation theorem. We apply some of these results to obtain an anytime-valid test of conditional independence without the Model-X assumption, as well as a $\\mathcal{P}$-uniform law of the iterated logarithm."}, "https://arxiv.org/abs/2404.12462": {"title": "Axiomatic modeling of fixed proportion technologies", "link": "https://arxiv.org/abs/2404.12462", "description": "arXiv:2404.12462v1 Announce Type: new \nAbstract: Understanding input substitution and output transformation possibilities is critical for efficient resource allocation and firm strategy. There are important examples of fixed proportion technologies where certain inputs are non-substitutable and/or certain outputs are non-transformable. However, there is widespread confusion about the appropriate modeling of fixed proportion technologies in data envelopment analysis. We point out and rectify several misconceptions in the existing literature, and show how fixed proportion technologies can be correctly incorporated into the axiomatic framework. A Monte Carlo study is performed to demonstrate the proposed solution."}, "https://arxiv.org/abs/2404.12463": {"title": "Spatially Selected and Dependent Random Effects for Small Area Estimation with Application to Rent Burden", "link": "https://arxiv.org/abs/2404.12463", "description": "arXiv:2404.12463v1 Announce Type: new \nAbstract: Area-level models for small area estimation typically rely on areal random effects to shrink design-based direct estimates towards a model-based predictor. Incorporating the spatial dependence of the random effects into these models can further improve the estimates when there are not enough covariates to fully account for spatial dependence of the areal means. A number of recent works have investigated models that include random effects for only a subset of areas, in order to improve the precision of estimates. However, such models do not readily handle spatial dependence. In this paper, we introduce a model that accounts for spatial dependence in both the random effects as well as the latent process that selects the effects. We show how this model can significantly improve predictive accuracy via an empirical simulation study based on data from the American Community Survey, and illustrate its properties via an application to estimate county-level median rent burden."}, "https://arxiv.org/abs/2404.12499": {"title": "A Multivariate Copula-based Bayesian Framework for Doping Detection", "link": "https://arxiv.org/abs/2404.12499", "description": "arXiv:2404.12499v1 Announce Type: new \nAbstract: Doping control is an essential component of anti-doping organizations for protecting clean sports competitions. Since 2009, this mission has been complemented worldwide by the Athlete Biological Passport (ABP), used to monitor athletes' individual profiles over time. The practical implementation of the ABP is based on a Bayesian framework, called ADAPTIVE, intended to identify individual reference ranges outside of which an observation may indicate doping abuse. Currently, this method follows a univariate approach, relying on simultaneous univariate analysis of different markers. This work extends the ADAPTIVE method to a multivariate testing framework, making use of copula models to couple the marginal distribution of biomarkers with their dependency structure. After introducing the proposed copula-based hierarchical model, we discuss our approach to inference, grounded in a Bayesian spirit, and present an extension to multidimensional predictive reference regions. Focusing on the hematological module of the ABP, we evaluate the proposed framework in both data-driven simulations and real data."}, "https://arxiv.org/abs/2404.12581": {"title": "Two-step Estimation of Network Formation Models with Unobserved Heterogeneities and Strategic Interactions", "link": "https://arxiv.org/abs/2404.12581", "description": "arXiv:2404.12581v1 Announce Type: new \nAbstract: In this paper, I characterize the network formation process as a static game of incomplete information, where the latent payoff of forming a link between two individuals depends on the structure of the network, as well as private information on agents' attributes. I allow agents' private unobserved attributes to be correlated with observed attributes through individual fixed effects. Using data from a single large network, I propose a two-step estimator for the model primitives. In the first step, I estimate agents' equilibrium beliefs of other people's choice probabilities. In the second step, I plug in the first-step estimator to the conditional choice probability expression and estimate the model parameters and the unobserved individual fixed effects together using Joint MLE. Assuming that the observed attributes are discrete, I showed that the first step estimator is uniformly consistent with rate $N^{-1/4}$, where $N$ is the total number of linking proposals. I also show that the second-step estimator converges asymptotically to a normal distribution at the same rate."}, "https://arxiv.org/abs/2404.12592": {"title": "Integer Programming for Learning Directed Acyclic Graphs from Non-identifiable Gaussian Models", "link": "https://arxiv.org/abs/2404.12592", "description": "arXiv:2404.12592v1 Announce Type: new \nAbstract: We study the problem of learning directed acyclic graphs from continuous observational data, generated according to a linear Gaussian structural equation model. State-of-the-art structure learning methods for this setting have at least one of the following shortcomings: i) they cannot provide optimality guarantees and can suffer from learning sub-optimal models; ii) they rely on the stringent assumption that the noise is homoscedastic, and hence the underlying model is fully identifiable. We overcome these shortcomings and develop a computationally efficient mixed-integer programming framework for learning medium-sized problems that accounts for arbitrary heteroscedastic noise. We present an early stopping criterion under which we can terminate the branch-and-bound procedure to achieve an asymptotically optimal solution and establish the consistency of this approximate solution. In addition, we show via numerical experiments that our method outperforms three state-of-the-art algorithms and is robust to noise heteroscedasticity, whereas the performance of the competing methods deteriorates under strong violations of the identifiability assumption. The software implementation of our method is available as the Python package \\emph{micodag}."}, "https://arxiv.org/abs/2404.12696": {"title": "Gaussian dependence structure pairwise goodness-of-fit testing based on conditional covariance and the 20/60/20 rule", "link": "https://arxiv.org/abs/2404.12696", "description": "arXiv:2404.12696v1 Announce Type: new \nAbstract: We present a novel data-oriented statistical framework that assesses the presumed Gaussian dependence structure in a pairwise setting. This refers to both multivariate normality and normal copula goodness-of-fit testing. The proposed test clusters the data according to the 20/60/20 rule and confronts conditional covariance (or correlation) estimates on the obtained subsets. The corresponding test statistic has a natural practical interpretation, desirable statistical properties, and asymptotic pivotal distribution under the multivariate normality assumption. We illustrate the usefulness of the introduced framework using extensive power simulation studies and show that our approach outperforms popular benchmark alternatives. Also, we apply the proposed methodology to commodities market data."}, "https://arxiv.org/abs/2404.12756": {"title": "Why not a thin plate spline for spatial models? A comparative study using Bayesian inference", "link": "https://arxiv.org/abs/2404.12756", "description": "arXiv:2404.12756v1 Announce Type: new \nAbstract: Spatial modelling often uses Gaussian random fields to capture the stochastic nature of studied phenomena. However, this approach incurs significant computational burdens (O(n3)), primarily due to covariance matrix computations. In this study, we propose to use a low-rank approximation of a thin plate spline as a spatial random effect in Bayesian spatial models. We compare its statistical performance and computational efficiency with the approximated Gaussian random field (by the SPDE method). In this case, the dense matrix of the thin plate spline is approximated using a truncated spectral decomposition, resulting in computational complexity of O(kn2) operations, where k is the number of knots. Bayesian inference is conducted via the Hamiltonian Monte Carlo algorithm of the probabilistic software Stan, which allows us to evaluate performance and diagnostics for the proposed models. A simulation study reveals that both models accurately recover the parameters used to simulate data. However, models using a thin plate spline demonstrate superior execution time to achieve the convergence of chains compared to the models utilizing an approximated Gaussian random field. Furthermore, thin plate spline models exhibited better computational efficiency for simulated data coming from different spatial locations. In a real application, models using a thin plate spline as spatial random effect produced similar results in estimating a relative index of abundance for a benthic marine species when compared to models incorporating an approximated Gaussian random field. Although they were not the more computational efficient models, their simplicity in parametrization, execution time and predictive performance make them a valid alternative for spatial modelling under Bayesian inference."}, "https://arxiv.org/abs/2404.12882": {"title": "The modified conditional sum-of-squares estimator for fractionally integrated models", "link": "https://arxiv.org/abs/2404.12882", "description": "arXiv:2404.12882v1 Announce Type: new \nAbstract: In this paper, we analyse the influence of estimating a constant term on the bias of the conditional sum-of-squares (CSS) estimator in a stationary or non-stationary type-II ARFIMA ($p_1$,$d$,$p_2$) model. We derive expressions for the estimator's bias and show that the leading term can be easily removed by a simple modification of the CSS objective function. We call this new estimator the modified conditional sum-of-squares (MCSS) estimator. We show theoretically and by means of Monte Carlo simulations that its performance relative to that of the CSS estimator is markedly improved even for small sample sizes. Finally, we revisit three classical short datasets that have in the past been described by ARFIMA($p_1$,$d$,$p_2$) models with constant term, namely the post-second World War real GNP data, the extended Nelson-Plosser data, and the Nile data."}, "https://arxiv.org/abs/2404.12997": {"title": "On the Asymmetric Volatility Connectedness", "link": "https://arxiv.org/abs/2404.12997", "description": "arXiv:2404.12997v1 Announce Type: new \nAbstract: Connectedness measures the degree at which a time-series variable spills over volatility to other variables compared to the rate that it is receiving. The idea is based on the percentage of variance decomposition from one variable to the others, which is estimated by making use of a VAR model. Diebold and Yilmaz (2012, 2014) suggested estimating this simple and useful measure of percentage risk spillover impact. Their method is symmetric by nature, however. The current paper offers an alternative asymmetric approach for measuring the volatility spillover direction, which is based on estimating the asymmetric variance decompositions introduced by Hatemi-J (2011, 2014). This approach accounts explicitly for the asymmetric property in the estimations, which accords better with reality. An application is provided to capture the potential asymmetric volatility spillover impacts between the three largest financial markets in the world."}, "https://arxiv.org/abs/2404.12613": {"title": "A Fourier Approach to the Parameter Estimation Problem for One-dimensional Gaussian Mixture Models", "link": "https://arxiv.org/abs/2404.12613", "description": "arXiv:2404.12613v1 Announce Type: cross \nAbstract: The purpose of this paper is twofold. First, we propose a novel algorithm for estimating parameters in one-dimensional Gaussian mixture models (GMMs). The algorithm takes advantage of the Hankel structure inherent in the Fourier data obtained from independent and identically distributed (i.i.d) samples of the mixture. For GMMs with a unified variance, a singular value ratio functional using the Fourier data is introduced and used to resolve the variance and component number simultaneously. The consistency of the estimator is derived. Compared to classic algorithms such as the method of moments and the maximum likelihood method, the proposed algorithm does not require prior knowledge of the number of Gaussian components or good initial guesses. Numerical experiments demonstrate its superior performance in estimation accuracy and computational cost. Second, we reveal that there exists a fundamental limit to the problem of estimating the number of Gaussian components or model order in the mixture model if the number of i.i.d samples is finite. For the case of a single variance, we show that the model order can be successfully estimated only if the minimum separation distance between the component means exceeds a certain threshold value and can fail if below. We derive a lower bound for this threshold value, referred to as the computational resolution limit, in terms of the number of i.i.d samples, the variance, and the number of Gaussian components. Numerical experiments confirm this phase transition phenomenon in estimating the model order. Moreover, we demonstrate that our algorithm achieves better scores in likelihood, AIC, and BIC when compared to the EM algorithm."}, "https://arxiv.org/abs/2404.12862": {"title": "A Guide to Feature Importance Methods for Scientific Inference", "link": "https://arxiv.org/abs/2404.12862", "description": "arXiv:2404.12862v1 Announce Type: cross \nAbstract: While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide, due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models."}, "https://arxiv.org/abs/1711.08265": {"title": "Sparse Variable Selection on High Dimensional Heterogeneous Data with Tree Structured Responses", "link": "https://arxiv.org/abs/1711.08265", "description": "arXiv:1711.08265v2 Announce Type: replace \nAbstract: We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taking on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and huge quantities of response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in enormous false discoveries when Lasso or its variants are na\\\"ively applied. Therefore, developing effective confounder correction methods is a growing heat point among researchers. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among response variables. To fully improve current variable selection methods, we introduce a model, the tree-guided sparse linear mixed model, that can utilize the dependency information from multiple responses to explore how specifically clusters are and select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods and achieves the highest ROC area."}, "https://arxiv.org/abs/2101.01157": {"title": "A tutorial on spatiotemporal partially observed Markov process models via the R package spatPomp", "link": "https://arxiv.org/abs/2101.01157", "description": "arXiv:2101.01157v4 Announce Type: replace \nAbstract: We describe a computational framework for modeling and statistical inference on high-dimensional stochastic dynamic systems. Our primary motivation is the investigation of metapopulation dynamics arising from a collection of spatially distributed, interacting biological populations. To make progress on this goal, we embed it in a more general problem: inference for a collection of interacting partially observed nonlinear non-Gaussian stochastic processes. Each process in the collection is called a unit; in the case of spatiotemporal models, the units correspond to distinct spatial locations. The dynamic state for each unit may be discrete or continuous, scalar or vector valued. In metapopulation applications, the state can represent a structured population or the abundances of a collection of species at a single location. We consider models where the collection of states has a Markov property. A sequence of noisy measurements is made on each unit, resulting in a collection of time series. A model of this form is called a spatiotemporal partially observed Markov process (SpatPOMP). The R package spatPomp provides an environment for implementing SpatPOMP models, analyzing data using existing methods, and developing new inference approaches. Our presentation of spatPomp reviews various methodologies in a unifying notational framework. We demonstrate the package on a simple Gaussian system and on a nontrivial epidemiological model for measles transmission within and between cities. We show how to construct user-specified SpatPOMP models within spatPomp."}, "https://arxiv.org/abs/2110.15517": {"title": "CP Factor Model for Dynamic Tensors", "link": "https://arxiv.org/abs/2110.15517", "description": "arXiv:2110.15517v2 Announce Type: replace \nAbstract: Observations in various applications are frequently represented as a time series of multidimensional arrays, called tensor time series, preserving the inherent multidimensional structure. In this paper, we present a factor model approach, in a form similar to tensor CP decomposition, to the analysis of high-dimensional dynamic tensor time series. As the loading vectors are uniquely defined but not necessarily orthogonal, it is significantly different from the existing tensor factor models based on Tucker-type tensor decomposition. The model structure allows for a set of uncorrelated one-dimensional latent dynamic factor processes, making it much more convenient to study the underlying dynamics of the time series. A new high order projection estimator is proposed for such a factor model, utilizing the special structure and the idea of the higher order orthogonal iteration procedures commonly used in Tucker-type tensor factor model and general tensor CP decomposition procedures. Theoretical investigation provides statistical error bounds for the proposed methods, which shows the significant advantage of utilizing the special model structure. Simulation study is conducted to further demonstrate the finite sample properties of the estimators. Real data application is used to illustrate the model and its interpretations."}, "https://arxiv.org/abs/2303.06434": {"title": "Direct Bayesian Regression for Distribution-valued Covariates", "link": "https://arxiv.org/abs/2303.06434", "description": "arXiv:2303.06434v2 Announce Type: replace \nAbstract: In this manuscript, we study the problem of scalar-on-distribution regression; that is, instances where subject-specific distributions or densities, or in practice, repeated measures from those distributions, are the covariates related to a scalar outcome via a regression model. We propose a direct regression for such distribution-valued covariates that circumvents estimating subject-specific densities and directly uses the observed repeated measures as covariates. The model is invariant to any transformation or ordering of the repeated measures. Endowing the regression function with a Gaussian Process prior, we obtain closed form or conjugate Bayesian inference. Our method subsumes the standard Bayesian non-parametric regression using Gaussian Processes as a special case. Theoretically, we show that the method can achieve an optimal estimation error bound. To our knowledge, this is the first theoretical study on Bayesian regression using distribution-valued covariates. Through simulation studies and analysis of activity count dataset, we demonstrate that our method performs better than approaches that require an intermediate density estimation step."}, "https://arxiv.org/abs/2311.05883": {"title": "Time-Varying Identification of Monetary Policy Shocks", "link": "https://arxiv.org/abs/2311.05883", "description": "arXiv:2311.05883v3 Announce Type: replace \nAbstract: We propose a new Bayesian heteroskedastic Markov-switching structural vector autoregression with data-driven time-varying identification. The model selects alternative exclusion restrictions over time and, as a condition for the search, allows to verify identification through heteroskedasticity within each regime. Based on four alternative monetary policy rules, we show that a monthly six-variable system supports time variation in US monetary policy shock identification. In the sample-dominating first regime, systematic monetary policy follows a Taylor rule extended by the term spread, effectively curbing inflation. In the second regime, occurring after 2000 and gaining more persistence after the global financial and COVID crises, it is characterized by a money-augmented Taylor rule. This regime's unconventional monetary policy provides economic stimulus, features the liquidity effect, and is complemented by a pure term spread shock. Absent the specific monetary policy of the second regime, inflation would be over one percentage point higher on average after 2008."}, "https://arxiv.org/abs/2404.13177": {"title": "A Bayesian Hybrid Design with Borrowing from Historical Study", "link": "https://arxiv.org/abs/2404.13177", "description": "arXiv:2404.13177v1 Announce Type: new \nAbstract: In early phase drug development of combination therapy, the primary objective is to preliminarily assess whether there is additive activity when a novel agent combined with an established monotherapy. Due to potential feasibility issues with a large randomized study, uncontrolled single-arm trials have been the mainstream approach in cancer clinical trials. However, such trials often present significant challenges in deciding whether to proceed to the next phase of development. A hybrid design, leveraging data from a completed historical clinical study of the monotherapy, offers a valuable option to enhance study efficiency and improve informed decision-making. Compared to traditional single-arm designs, the hybrid design may significantly enhance power by borrowing external information, enabling a more robust assessment of activity. The primary challenge of hybrid design lies in handling information borrowing. We introduce a Bayesian dynamic power prior (DPP) framework with three components of controlling amount of dynamic borrowing. The framework offers flexible study design options with explicit interpretation of borrowing, allowing customization according to specific needs. Furthermore, the posterior distribution in the proposed framework has a closed form, offering significant advantages in computational efficiency. The proposed framework's utility is demonstrated through simulations and a case study."}, "https://arxiv.org/abs/2404.13233": {"title": "On a Notion of Graph Centrality Based on L1 Data Depth", "link": "https://arxiv.org/abs/2404.13233", "description": "arXiv:2404.13233v1 Announce Type: new \nAbstract: A new measure to assess the centrality of vertices in an undirected and connected graph is proposed. The proposed measure, L1 centrality, can adequately handle graphs with weights assigned to vertices and edges. The study provides tools for graphical and multiscale analysis based on the L1 centrality. Specifically, the suggested analysis tools include the target plot, L1 centrality-based neighborhood, local L1 centrality, multiscale edge representation, and heterogeneity plot and index. Most importantly, our work is closely associated with the concept of data depth for multivariate data, which allows for a wide range of practical applications of the proposed measure. Throughout the paper, we demonstrate our tools with two interesting examples: the Marvel Cinematic Universe movie network and the bill cosponsorship network of the 21st National Assembly of South Korea."}, "https://arxiv.org/abs/2404.13284": {"title": "Impact of methodological assumptions and covariates on the cutoff estimation in ROC analysis", "link": "https://arxiv.org/abs/2404.13284", "description": "arXiv:2404.13284v1 Announce Type: new \nAbstract: The Receiver Operating Characteristic (ROC) curve stands as a cornerstone in assessing the efficacy of biomarkers for disease diagnosis. Beyond merely evaluating performance, it provides with an optimal cutoff for biomarker values, crucial for disease categorization. While diverse methodologies exist for threshold estimation, less attention has been paid to integrating covariate impact into this process. Covariates can strongly impact diagnostic summaries, leading to variations across different covariate levels. Therefore, a tailored covariate-based framework is imperative for outlining covariate-specific optimal cutoffs. Moreover, recent investigations into cutoff estimators have overlooked the influence of ROC curve estimation methodologies. This study endeavors to bridge this gap by addressing the research void. Extensive simulation studies are conducted to scrutinize the performance of ROC curve estimation models in estimating different cutoffs in varying scenarios, encompassing diverse data-generating mechanisms and covariate effects. Additionally, leveraging the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the research assesses the performance of different biomarkers in diagnosing Alzheimer's disease and determines the suitable optimal cutoffs."}, "https://arxiv.org/abs/2404.13339": {"title": "Group COMBSS: Group Selection via Continuous Optimization", "link": "https://arxiv.org/abs/2404.13339", "description": "arXiv:2404.13339v1 Announce Type: new \nAbstract: We present a new optimization method for the group selection problem in linear regression. In this problem, predictors are assumed to have a natural group structure and the goal is to select a small set of groups that best fits the response. The incorporation of group structure in a predictor matrix is a key factor in obtaining better estimators and identifying associations between response and predictors. Such a discrete constrained problem is well-known to be hard, particularly in high-dimensional settings where the number of predictors is much larger than the number of observations. We propose to tackle this problem by framing the underlying discrete binary constrained problem into an unconstrained continuous optimization problem. The performance of our proposed approach is compared to state-of-the-art variable selection strategies on simulated data sets. We illustrate the effectiveness of our approach on a genetic dataset to identify grouping of markers across chromosomes."}, "https://arxiv.org/abs/2404.13356": {"title": "How do applied researchers use the Causal Forest? A methodological review of a method", "link": "https://arxiv.org/abs/2404.13356", "description": "arXiv:2404.13356v1 Announce Type: new \nAbstract: This paper conducts a methodological review of papers using the causal forest machine learning method for flexibly estimating heterogeneous treatment effects. It examines 133 peer-reviewed papers. It shows that the emerging best practice relies heavily on the approach and tools created by the original authors of the causal forest such as their grf package and the approaches given by them in examples. Generally researchers use the causal forest on a relatively low-dimensional dataset relying on randomisation or observed controls to identify effects. There are several common ways to then communicate results -- by mapping out the univariate distribution of individual-level treatment effect estimates, displaying variable importance results for the forest and graphing the distribution of treatment effects across covariates that are important either for theoretical reasons or because they have high variable importance. Some deviations from this common practice are interesting and deserve further development and use. Others are unnecessary or even harmful."}, "https://arxiv.org/abs/2404.13366": {"title": "Prior Effective Sample Size When Borrowing on the Treatment Effect Scale", "link": "https://arxiv.org/abs/2404.13366", "description": "arXiv:2404.13366v1 Announce Type: new \nAbstract: With the robust uptick in the applications of Bayesian external data borrowing, eliciting a prior distribution with the proper amount of information becomes increasingly critical. The prior effective sample size (ESS) is an intuitive and efficient measure for this purpose. The majority of ESS definitions have been proposed in the context of borrowing control information. While many Bayesian models can be naturally extended to leveraging external information on the treatment effect scale, very little attention has been directed to computing the prior ESS in this setting. In this research, we bridge this methodological gap by extending the popular ELIR ESS definition. We lay out the general framework, and derive the prior ESS for various types of endpoints and treatment effect measures. The posterior distribution and the predictive consistency property of ESS are also examined. The methods are implemented in R programs available on GitHub: https://github.com/squallteo/TrtEffESS."}, "https://arxiv.org/abs/2404.13442": {"title": "Difference-in-Differences under Bipartite Network Interference: A Framework for Quasi-Experimental Assessment of the Effects of Environmental Policies on Health", "link": "https://arxiv.org/abs/2404.13442", "description": "arXiv:2404.13442v1 Announce Type: new \nAbstract: Pollution from coal-fired power plants has been linked to substantial health and mortality burdens in the US. In recent decades, federal regulatory policies have spurred efforts to curb emissions through various actions, such as the installation of emissions control technologies on power plants. However, assessing the health impacts of these measures, particularly over longer periods of time, is complicated by several factors. First, the units that potentially receive the intervention (power plants) are disjoint from those on which outcomes are measured (communities), and second, pollution emitted from power plants disperses and affects geographically far-reaching areas. This creates a methodological challenge known as bipartite network interference (BNI). To our knowledge, no methods have been developed for conducting quasi-experimental studies with panel data in the BNI setting. In this study, motivated by the need for robust estimates of the total health impacts of power plant emissions control technologies in recent decades, we introduce a novel causal inference framework for difference-in-differences analysis under BNI with staggered treatment adoption. We explain the unique methodological challenges that arise in this setting and propose a solution via a data reconfiguration and mapping strategy. The proposed approach is advantageous because analysis is conducted at the intervention unit level, avoiding the need to arbitrarily define treatment status at the outcome unit level, but it permits interpretation of results at the more policy-relevant outcome unit level. Using this interference-aware approach, we investigate the impacts of installation of flue gas desulfurization scrubbers on coal-fired power plants on coronary heart disease hospitalizations among older Americans over the period 2003-2014, finding an overall beneficial effect in mitigating such disease outcomes."}, "https://arxiv.org/abs/2404.13487": {"title": "Generating Synthetic Rainfall Fields by R-vine Copulas Applied to Seamless Probabilistic Predictions", "link": "https://arxiv.org/abs/2404.13487", "description": "arXiv:2404.13487v1 Announce Type: new \nAbstract: Many post-processing methods improve forecasts at individual locations but remove their correlation structure, which is crucial for predicting larger-scale events like total precipitation amount over areas such as river catchments that are relevant for weather warnings and flood predictions. We propose a method to reintroduce spatial correlation into a post-processed forecast using an R-vine copula fitted to historical observations. This method works similarly to related approaches like the Schaake shuffle and ensemble copula coupling, i.e., by rearranging predictions at individual locations and reintroducing spatial correlation while maintaining the post-processed marginal distribution. Here, the copula measures how well an arrangement compares with the historical distribution of precipitation. No close relationship is needed between the post-processed marginal distributions and the spatial correlation source. This is an advantage compared to Schaake shuffle and ensemble copula coupling, which rely on a ranking with no ties at each considered location in their source for spatial correlations. However, weather variables such as the precipitation amount, whose distribution has an atom at zero, have rankings with ties. To evaluate the proposed method, it is applied to a precipitation forecast produced by a combination model with two input forecasts that deliver calibrated marginal distributions but without spatial correlations. The obtained results indicate that the calibration of the combination model carries over to the output of the proposed model, i.e., the evaluation of area predictions shows a similar improvement in forecast quality as the predictions for individual locations. Additionally, the spatial correlation of the forecast is evaluated with the help of object-based metrics, for which the proposed model also shows an improvement compared to both input forecasts."}, "https://arxiv.org/abs/2404.13589": {"title": "The quantile-based classifier with variable-wise parameters", "link": "https://arxiv.org/abs/2404.13589", "description": "arXiv:2404.13589v1 Announce Type: new \nAbstract: Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters."}, "https://arxiv.org/abs/2404.13707": {"title": "Robust inference for the unification of confidence intervals in meta-analysis", "link": "https://arxiv.org/abs/2404.13707", "description": "arXiv:2404.13707v1 Announce Type: new \nAbstract: Traditional meta-analysis assumes that the effect sizes estimated in individual studies follow a Gaussian distribution. However, this distributional assumption is not always satisfied in practice, leading to potentially biased results. In the situation when the number of studies, denoted as K, is large, the cumulative Gaussian approximation errors from each study could make the final estimation unreliable. In the situation when K is small, it is not realistic to assume the random-effect follows Gaussian distribution. In this paper, we present a novel empirical likelihood method for combining confidence intervals under the meta-analysis framework. This method is free of the Gaussian assumption in effect size estimates from individual studies and from the random-effects. We establish the large-sample properties of the non-parametric estimator, and introduce a criterion governing the relationship between the number of studies, K, and the sample size of each study, n_i. Our methodology supersedes conventional meta-analysis techniques in both theoretical robustness and computational efficiency. We assess the performance of our proposed methods using simulation studies, and apply our proposed methods to two examples."}, "https://arxiv.org/abs/2404.13735": {"title": "Identification and Estimation of Nonseparable Triangular Equations with Mismeasured Instruments", "link": "https://arxiv.org/abs/2404.13735", "description": "arXiv:2404.13735v1 Announce Type: new \nAbstract: In this paper, I study the nonparametric identification and estimation of the marginal effect of an endogenous variable $X$ on the outcome variable $Y$, given a potentially mismeasured instrument variable $W^*$, without assuming linearity or separability of the functions governing the relationship between observables and unobservables. To address the challenges arising from the co-existence of measurement error and nonseparability, I first employ the deconvolution technique from the measurement error literature to identify the joint distribution of $Y, X, W^*$ using two error-laden measurements of $W^*$. I then recover the structural derivative of the function of interest and the \"Local Average Response\" (LAR) from the joint distribution via the \"unobserved instrument\" approach in Matzkin (2016). I also propose nonparametric estimators for these parameters and derive their uniform rates of convergence. Monte Carlo exercises show evidence that the estimators I propose have good finite sample performance."}, "https://arxiv.org/abs/2404.13753": {"title": "A nonstandard application of cross-validation to estimate density functionals", "link": "https://arxiv.org/abs/2404.13753", "description": "arXiv:2404.13753v1 Announce Type: new \nAbstract: Cross-validation is usually employed to evaluate the performance of a given statistical methodology. When such a methodology depends on a number of tuning parameters, cross-validation proves to be helpful to select the parameters that optimize the estimated performance. In this paper, however, a very different and nonstandard use of cross-validation is investigated. Instead of focusing on the cross-validated parameters, the main interest is switched to the estimated value of the error criterion at optimal performance. It is shown that this approach is able to provide consistent and efficient estimates of some density functionals, with the noteworthy feature that these estimates do not rely on the choice of any further tuning parameter, so that, in that sense, they can be considered to be purely empirical. Here, a base case of application of this new paradigm is developed in full detail, while many other possible extensions are hinted as well."}, "https://arxiv.org/abs/2404.13825": {"title": "Change-point analysis for binomial autoregressive model with application to price stability counts", "link": "https://arxiv.org/abs/2404.13825", "description": "arXiv:2404.13825v1 Announce Type: new \nAbstract: The first-order binomial autoregressive (BAR(1)) model is the most frequently used tool to analyze the bounded count time series. The BAR(1) model is stationary and assumes process parameters to remain constant throughout the time period, which may be incompatible with the non-stationary real data, which indicates piecewise stationary characteristic. To better analyze the non-stationary bounded count time series, this article introduces the BAR(1) process with multiple change-points, which contains the BAR(1) model as a special case. Our primary goals are not only to detect the change-points, but also to give a solution to estimate the number and locations of the change-points. For this, the cumulative sum (CUSUM) test and minimum description length (MDL) principle are employed to deal with the testing and estimation problems. The proposed approaches are also applied to analysis of the Harmonised Index of Consumer Prices of the European Union."}, "https://arxiv.org/abs/2404.13834": {"title": "Inference for multiple change-points in generalized integer-valued autoregressive model", "link": "https://arxiv.org/abs/2404.13834", "description": "arXiv:2404.13834v1 Announce Type: new \nAbstract: In this paper, we propose a computationally valid and theoretically justified methods, the likelihood ratio scan method (LRSM), for estimating multiple change-points in a piecewise stationary generalized conditional integer-valued autoregressive process. LRSM with the usual window parameter $h$ is more satisfied to be used in long-time series with few and even change-points vs. LRSM with the multiple window parameter $h_{mix}$ performs well in short-time series with large and dense change-points. The computational complexity of LRSM can be efficiently performed with order $O((\\log n)^3 n)$. Moreover, two bootstrap procedures, namely parametric and block bootstrap, are developed for constructing confidence intervals (CIs) for each of the change-points. Simulation experiments and real data analysis show that the LRSM and bootstrap procedures have excellent performance and are consistent with the theoretical analysis."}, "https://arxiv.org/abs/2404.13836": {"title": "MultiFun-DAG: Multivariate Functional Directed Acyclic Graph", "link": "https://arxiv.org/abs/2404.13836", "description": "arXiv:2404.13836v1 Announce Type: new \nAbstract: Directed Acyclic Graphical (DAG) models efficiently formulate causal relationships in complex systems. Traditional DAGs assume nodes to be scalar variables, characterizing complex systems under a facile and oversimplified form. This paper considers that nodes can be multivariate functional data and thus proposes a multivariate functional DAG (MultiFun-DAG). It constructs a hidden bilinear multivariate function-to-function regression to describe the causal relationships between different nodes. Then an Expectation-Maximum algorithm is used to learn the graph structure as a score-based algorithm with acyclic constraints. Theoretical properties are diligently derived. Prudent numerical studies and a case study from urban traffic congestion analysis are conducted to show MultiFun-DAG's effectiveness."}, "https://arxiv.org/abs/2404.13939": {"title": "Unlocking Insights: Enhanced Analysis of Covariance in General Factorial Designs through Multiple Contrast Tests under Variance Heteroscedasticity", "link": "https://arxiv.org/abs/2404.13939", "description": "arXiv:2404.13939v1 Announce Type: new \nAbstract: A common goal in clinical trials is to conduct tests on estimated treatment effects adjusted for covariates such as age or sex. Analysis of Covariance (ANCOVA) is often used in these scenarios to test the global null hypothesis of no treatment effect using an $F$-test. However, in several samples, the $F$-test does not provide any information about individual null hypotheses and has strict assumptions such as variance homoscedasticity. We extend the method proposed by Konietschke et al. (2021) to a multiple contrast test procedure (MCTP), which allows us to test arbitrary linear hypotheses and provides information about the global as well as the individual null hypotheses. Further, we can calculate compatible simultaneous confidence intervals for the individual effects. We derive a small sample size approximation of the distribution of the test statistic via a multivariate t-distribution. As an alternative, we introduce a Wild-bootstrap method. Extensive simulations show that our methods are applicable even when sample sizes are small. Their application is further illustrated within a real data example."}, "https://arxiv.org/abs/2404.13986": {"title": "Stochastic Volatility in Mean: Efficient Analysis by a Generalized Mixture Sampler", "link": "https://arxiv.org/abs/2404.13986", "description": "arXiv:2404.13986v1 Announce Type: new \nAbstract: In this paper we consider the simulation-based Bayesian analysis of stochastic volatility in mean (SVM) models. Extending the highly efficient Markov chain Monte Carlo mixture sampler for the SV model proposed in Kim et al. (1998) and Omori et al. (2007), we develop an accurate approximation of the non-central chi-squared distribution as a mixture of thirty normal distributions. Under this mixture representation, we sample the parameters and latent volatilities in one block. We also detail a correction of the small approximation error by using additional Metropolis-Hastings steps. The proposed method is extended to the SVM model with leverage. The methodology and models are applied to excess holding yields in empirical studies, and the SVM model with leverage is shown to outperform competing volatility models based on marginal likelihoods."}, "https://arxiv.org/abs/2404.14124": {"title": "Gaussian distributional structural equation models: A framework for modeling latent heteroscedasticity", "link": "https://arxiv.org/abs/2404.14124", "description": "arXiv:2404.14124v1 Announce Type: new \nAbstract: Accounting for the complexity of psychological theories requires methods that can predict not only changes in the means of latent variables -- such as personality factors, creativity, or intelligence -- but also changes in their variances. Structural equation modeling (SEM) is the framework of choice for analyzing complex relationships among latent variables, but current methods do not allow modeling latent variances as a function of other latent variables. In this paper, we develop a Bayesian framework for Gaussian distributional SEM which overcomes this limitation. We validate our framework using extensive simulations, which demonstrate that the new models produce reliable statistical inference and can be computed with sufficient efficiency for practical everyday use. We illustrate our framework's applicability in a real-world case study that addresses a substantive hypothesis from personality psychology."}, "https://arxiv.org/abs/2404.14149": {"title": "A Linear Relationship between Correlation and Cohen's Kappa for Binary Data and Simulating Multivariate Nominal and Ordinal Data with Specified Kappa Matrix", "link": "https://arxiv.org/abs/2404.14149", "description": "arXiv:2404.14149v1 Announce Type: new \nAbstract: Cohen's kappa is a useful measure for agreement between the judges, inter-rater reliability, and also goodness of fit in classification problems. For binary nominal and ordinal data, kappa and correlation are equally applicable. We have found a linear relationship between correlation and kappa for binary data. Exact bounds of kappa are much more important as kappa can be only .5 even if there is very strong agreement. The exact upper bound was developed by Cohen (1960) but the exact lower bound is also important if the range of kappa is small for some marginals. We have developed an algorithm to find the exact lower bound given marginal proportions. Our final contribution is a method to generate multivariate nominal and ordinal data with a specified kappa matrix based on the rearrangement of independently generated marginal data to a multidimensional contingency table, where cell counts are found by solving system of linear equations for positive roots."}, "https://arxiv.org/abs/2404.14213": {"title": "An Exposure Model Framework for Signal Detection based on Electronic Healthcare Data", "link": "https://arxiv.org/abs/2404.14213", "description": "arXiv:2404.14213v1 Announce Type: new \nAbstract: Despite extensive safety assessments of drugs prior to their introduction to the market, certain adverse drug reactions (ADRs) remain undetected. The primary objective of pharmacovigilance is to identify these ADRs (i.e., signals). In addition to traditional spontaneous reporting systems (SRSs), electronic health (EHC) data is being used for signal detection as well. Unlike SRS, EHC data is longitudinal and thus requires assumptions about the patient's drug exposure history and its impact on ADR occurrences over time, which many current methods do implicitly.\n We propose an exposure model framework that explicitly models the longitudinal relationship between the drug and the ADR. By considering multiple such models simultaneously, we can detect signals that might be missed by other approaches. The parameters of these models are estimated using maximum likelihood, and the Bayesian Information Criterion (BIC) is employed to select the most suitable model. Since BIC is connected to the posterior distribution, it servers the dual purpose of identifying the best-fitting model and determining the presence of a signal by evaluating the posterior probability of the null model.\n We evaluate the effectiveness of this framework through a simulation study, for which we develop an EHC data simulator. Additionally, we conduct a case study applying our approach to four drug-ADR pairs using an EHC dataset comprising over 1.2 million insured individuals. Both the method and the EHC data simulator code are publicly accessible as part of the R package https://github.com/bips-hb/expard."}, "https://arxiv.org/abs/2404.14275": {"title": "Maximally informative feature selection using Information Imbalance: Application to COVID-19 severity prediction", "link": "https://arxiv.org/abs/2404.14275", "description": "arXiv:2404.14275v1 Announce Type: new \nAbstract: Clinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features, and automatically select the set of features which is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients. We apply this algorithm to a data set of ~ 1,300 patients treated for COVID-19 in Udine hospital before October 2021. Using this approach, we find combinations of features which, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The optimal number of features, which is determined automatically, turns out to be between 10 and 15. These features can be measured at admission. The approach can be used also if the features are available only for a fraction of the patients, does not require imputation and, importantly, is able to automatically select features with small inter-feature correlation. Clinical insights deriving from this study are also discussed."}, "https://arxiv.org/abs/2404.13056": {"title": "Variational Bayesian Optimal Experimental Design with Normalizing Flows", "link": "https://arxiv.org/abs/2404.13056", "description": "arXiv:2404.13056v1 Announce Type: cross \nAbstract: Bayesian optimal experimental design (OED) seeks experiments that maximize the expected information gain (EIG) in model parameters. Directly estimating the EIG using nested Monte Carlo is computationally expensive and requires an explicit likelihood. Variational OED (vOED), in contrast, estimates a lower bound of the EIG without likelihood evaluations by approximating the posterior distributions with variational forms, and then tightens the bound by optimizing its variational parameters. We introduce the use of normalizing flows (NFs) for representing variational distributions in vOED; we call this approach vOED-NFs. Specifically, we adopt NFs with a conditional invertible neural network architecture built from compositions of coupling layers, and enhanced with a summary network for data dimension reduction. We present Monte Carlo estimators to the lower bound along with gradient expressions to enable a gradient-based simultaneous optimization of the variational parameters and the design variables. The vOED-NFs algorithm is then validated in two benchmark problems, and demonstrated on a partial differential equation-governed application of cathodic electrophoretic deposition and an implicit likelihood case with stochastic modeling of aphid population. The findings suggest that a composition of 4--5 coupling layers is able to achieve lower EIG estimation bias, under a fixed budget of forward model runs, compared to previous approaches. The resulting NFs produce approximate posteriors that agree well with the true posteriors, able to capture non-Gaussian and multi-modal features effectively."}, "https://arxiv.org/abs/2404.13147": {"title": "Multiclass ROC", "link": "https://arxiv.org/abs/2404.13147", "description": "arXiv:2404.13147v1 Announce Type: cross \nAbstract: Model evaluation is of crucial importance in modern statistics application. The construction of ROC and calculation of AUC have been widely used for binary classification evaluation. Recent research generalizing the ROC/AUC analysis to multi-class classification has problems in at least one of the four areas: 1. failure to provide sensible plots 2. being sensitive to imbalanced data 3. unable to specify mis-classification cost and 4. unable to provide evaluation uncertainty quantification. Borrowing from a binomial matrix factorization model, we provide an evaluation metric summarizing the pair-wise multi-class True Positive Rate (TPR) and False Positive Rate (FPR) with one-dimensional vector representation. Visualization on the representation vector measures the relative speed of increment between TPR and FPR across all the classes pairs, which in turns provides a ROC plot for the multi-class counterpart. An integration over those factorized vector provides a binary AUC-equivalent summary on the classifier performance. Mis-clasification weights specification and bootstrapped confidence interval are also enabled to accommodate a variety of of evaluation criteria. To support our findings, we conducted extensive simulation studies and compared our method to the pair-wise averaged AUC statistics on benchmark datasets."}, "https://arxiv.org/abs/2404.13198": {"title": "An economically-consistent discrete choice model with flexible utility specification based on artificial neural networks", "link": "https://arxiv.org/abs/2404.13198", "description": "arXiv:2404.13198v1 Announce Type: cross \nAbstract: Random utility maximisation (RUM) models are one of the cornerstones of discrete choice modelling. However, specifying the utility function of RUM models is not straightforward and has a considerable impact on the resulting interpretable outcomes and welfare measures. In this paper, we propose a new discrete choice model based on artificial neural networks (ANNs) named \"Alternative-Specific and Shared weights Neural Network (ASS-NN)\", which provides a further balance between flexible utility approximation from the data and consistency with two assumptions: RUM theory and fungibility of money (i.e., \"one euro is one euro\"). Therefore, the ASS-NN can derive economically-consistent outcomes, such as marginal utilities or willingness to pay, without explicitly specifying the utility functional form. Using a Monte Carlo experiment and empirical data from the Swissmetro dataset, we show that ASS-NN outperforms (in terms of goodness of fit) conventional multinomial logit (MNL) models under different utility specifications. Furthermore, we show how the ASS-NN is used to derive marginal utilities and willingness to pay measures."}, "https://arxiv.org/abs/2404.13302": {"title": "Monte Carlo sampling with integrator snippets", "link": "https://arxiv.org/abs/2404.13302", "description": "arXiv:2404.13302v1 Announce Type: cross \nAbstract: Assume interest is in sampling from a probability distribution $\\mu$ defined on $(\\mathsf{Z},\\mathscr{Z})$. We develop a framework to construct sampling algorithms taking full advantage of numerical integrators of ODEs, say $\\psi\\colon\\mathsf{Z}\\rightarrow\\mathsf{Z}$ for one integration step, to explore $\\mu$ efficiently and robustly. The popular Hybrid/Hamiltonian Monte Carlo (HMC) algorithm [Duane, 1987], [Neal, 2011] and its derivatives are example of such a use of numerical integrators. However we show how the potential of integrators can be exploited beyond current ideas and HMC sampling in order to take into account aspects of the geometry of the target distribution. A key idea is the notion of integrator snippet, a fragment of the orbit of an ODE numerical integrator $\\psi$, and its associate probability distribution $\\bar{\\mu}$, which takes the form of a mixture of distributions derived from $\\mu$ and $\\psi$. Exploiting properties of mixtures we show how samples from $\\bar{\\mu}$ can be used to estimate expectations with respect to $\\mu$. We focus here primarily on Sequential Monte Carlo (SMC) algorithms, but the approach can be used in the context of Markov chain Monte Carlo algorithms as discussed at the end of the manuscript. We illustrate performance of these new algorithms through numerical experimentation and provide preliminary theoretical results supporting observed performance."}, "https://arxiv.org/abs/2404.13371": {"title": "On Risk-Sensitive Decision Making Under Uncertainty", "link": "https://arxiv.org/abs/2404.13371", "description": "arXiv:2404.13371v1 Announce Type: cross \nAbstract: This paper studies a risk-sensitive decision-making problem under uncertainty. It considers a decision-making process that unfolds over a fixed number of stages, in which a decision-maker chooses among multiple alternatives, some of which are deterministic and others are stochastic. The decision-maker's cumulative value is updated at each stage, reflecting the outcomes of the chosen alternatives. After formulating this as a stochastic control problem, we delineate the necessary optimality conditions for it. Two illustrative examples from optimal betting and inventory management are provided to support our theory."}, "https://arxiv.org/abs/2404.13649": {"title": "Distributional Principal Autoencoders", "link": "https://arxiv.org/abs/2404.13649", "description": "arXiv:2404.13649v1 Announce Type: cross \nAbstract: Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distribution of data given its low-dimensional latent variables. Motivated by this, we propose Distributional Principal Autoencoder (DPA) that consists of an encoder that maps high-dimensional data to low-dimensional latent variables and a decoder that maps the latent variables back to the data space. For reducing the dimension, the DPA encoder aims to minimise the unexplained variability of the data with an adaptive choice of the latent dimension. For reconstructing data, the DPA decoder aims to match the conditional distribution of all data that are mapped to a certain latent value, thus ensuring that the reconstructed data retains the original data distribution. Our numerical results on climate data, single-cell data, and image benchmarks demonstrate the practical feasibility and success of the approach in reconstructing the original distribution of the data. DPA embeddings are shown to preserve meaningful structures of data such as the seasonal cycle for precipitations and cell types for gene expression."}, "https://arxiv.org/abs/2404.13964": {"title": "An Economic Solution to Copyright Challenges of Generative AI", "link": "https://arxiv.org/abs/2404.13964", "description": "arXiv:2404.13964v1 Announce Type: cross \nAbstract: Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content. The metric for contributions is quantitatively determined by leveraging the probabilistic nature of modern generative AI models and using techniques from cooperative game theory in economics. This framework enables a platform where AI developers benefit from access to high-quality training data, thus improving model performance. Meanwhile, copyright owners receive fair compensation, driving the continued provision of relevant data for generative model training. Experiments demonstrate that our framework successfully identifies the most relevant data sources used in artwork generation, ensuring a fair and interpretable distribution of revenues among copyright owners."}, "https://arxiv.org/abs/2404.14052": {"title": "Differential contributions of machine learning and statistical analysis to language and cognitive sciences", "link": "https://arxiv.org/abs/2404.14052", "description": "arXiv:2404.14052v1 Announce Type: cross \nAbstract: Data-driven approaches have revolutionized scientific research. Machine learning and statistical analysis are commonly utilized in this type of research. Despite their widespread use, these methodologies differ significantly in their techniques and objectives. Few studies have utilized a consistent dataset to demonstrate these differences within the social sciences, particularly in language and cognitive sciences. This study leverages the Buckeye Speech Corpus to illustrate how both machine learning and statistical analysis are applied in data-driven research to obtain distinct insights. This study significantly enhances our understanding of the diverse approaches employed in data-driven strategies."}, "https://arxiv.org/abs/2404.14136": {"title": "Elicitability and identifiability of tail risk measures", "link": "https://arxiv.org/abs/2404.14136", "description": "arXiv:2404.14136v1 Announce Type: cross \nAbstract: Tail risk measures are fully determined by the distribution of the underlying loss beyond its quantile at a certain level, with Value-at-Risk and Expected Shortfall being prime examples. They are induced by law-based risk measures, called their generators, evaluated on the tail distribution. This paper establishes joint identifiability and elicitability results of tail risk measures together with the corresponding quantile, provided that their generators are identifiable and elicitable, respectively. As an example, we establish the joint identifiability and elicitability of the tail expectile together with the quantile. The corresponding consistent scores constitute a novel class of weighted scores, nesting the known class of scores of Fissler and Ziegel for the Expected Shortfall together with the quantile. For statistical purposes, our results pave the way to easier model fitting for tail risk measures via regression and the generalized method of moments, but also model comparison and model validation in terms of established backtesting procedures."}, "https://arxiv.org/abs/2404.14328": {"title": "Preserving linear invariants in ensemble filtering methods", "link": "https://arxiv.org/abs/2404.14328", "description": "arXiv:2404.14328v1 Announce Type: cross \nAbstract: Formulating dynamical models for physical phenomena is essential for understanding the interplay between the different mechanisms and predicting the evolution of physical states. However, a dynamical model alone is often insufficient to address these fundamental tasks, as it suffers from model errors and uncertainties. One common remedy is to rely on data assimilation, where the state estimate is updated with observations of the true system. Ensemble filters sequentially assimilate observations by updating a set of samples over time. They operate in two steps: a forecast step that propagates each sample through the dynamical model and an analysis step that updates the samples with incoming observations. For accurate and robust predictions of dynamical systems, discrete solutions must preserve their critical invariants. While modern numerical solvers satisfy these invariants, existing invariant-preserving analysis steps are limited to Gaussian settings and are often not compatible with classical regularization techniques of ensemble filters, e.g., inflation and covariance tapering. The present work focuses on preserving linear invariants, such as mass, stoichiometric balance of chemical species, and electrical charges. Using tools from measure transport theory (Spantini et al., 2022, SIAM Review), we introduce a generic class of nonlinear ensemble filters that automatically preserve desired linear invariants in non-Gaussian filtering problems. By specializing this framework to the Gaussian setting, we recover a constrained formulation of the Kalman filter. Then, we show how to combine existing regularization techniques for the ensemble Kalman filter (Evensen, 1994, J. Geophys. Res.) with the preservation of the linear invariants. Finally, we assess the benefits of preserving linear invariants for the ensemble Kalman filter and nonlinear ensemble filters."}, "https://arxiv.org/abs/1908.04218": {"title": "Asymptotic Validity and Finite-Sample Properties of Approximate Randomization Tests", "link": "https://arxiv.org/abs/1908.04218", "description": "arXiv:1908.04218v3 Announce Type: replace \nAbstract: Randomization tests rely on simple data transformations and possess an appealing robustness property. In addition to being finite-sample valid if the data distribution is invariant under the transformation, these tests can be asymptotically valid under a suitable studentization of the test statistic, even if the invariance does not hold. However, practical implementation often encounters noisy data, resulting in approximate randomization tests that may not be as robust. In this paper, our key theoretical contribution is a non-asymptotic bound on the discrepancy between the size of an approximate randomization test and the size of the original randomization test using noiseless data. This allows us to derive novel conditions for the validity of approximate randomization tests under data invariances, while being able to leverage existing results based on studentization if the invariance does not hold. We illustrate our theory through several examples, including tests of significance in linear regression. Our theory can explain certain aspects of how randomization tests perform in small samples, addressing limitations of prior theoretical results."}, "https://arxiv.org/abs/2109.02204": {"title": "On the edge eigenvalues of the precision matrices of nonstationary autoregressive processes", "link": "https://arxiv.org/abs/2109.02204", "description": "arXiv:2109.02204v3 Announce Type: replace \nAbstract: This paper investigates the structural changes in the parameters of first-order autoregressive models by analyzing the edge eigenvalues of the precision matrices. Specifically, edge eigenvalues in the precision matrix are observed if and only if there is a structural change in the autoregressive coefficients. We demonstrate that these edge eigenvalues correspond to the zeros of some determinantal equation. Additionally, we propose a consistent estimator for detecting outliers within the panel time series framework, supported by numerical experiments."}, "https://arxiv.org/abs/2112.07465": {"title": "The multirank likelihood for semiparametric canonical correlation analysis", "link": "https://arxiv.org/abs/2112.07465", "description": "arXiv:2112.07465v4 Announce Type: replace \nAbstract: Many analyses of multivariate data focus on evaluating the dependence between two sets of variables, rather than the dependence among individual variables within each set. Canonical correlation analysis (CCA) is a classical data analysis technique that estimates parameters describing the dependence between such sets. However, inference procedures based on traditional CCA rely on the assumption that all variables are jointly normally distributed. We present a semiparametric approach to CCA in which the multivariate margins of each variable set may be arbitrary, but the dependence between variable sets is described by a parametric model that provides low-dimensional summaries of dependence. While maximum likelihood estimation in the proposed model is intractable, we propose two estimation strategies: one using a pseudolikelihood for the model and one using a Markov chain Monte Carlo (MCMC) algorithm that provides Bayesian estimates and confidence regions for the between-set dependence parameters. The MCMC algorithm is derived from a multirank likelihood function, which uses only part of the information in the observed data in exchange for being free of assumptions about the multivariate margins. We apply the proposed Bayesian inference procedure to Brazilian climate data and monthly stock returns from the materials and communications market sectors."}, "https://arxiv.org/abs/2202.02311": {"title": "Graphical criteria for the identification of marginal causal effects in continuous-time survival and event-history analyses", "link": "https://arxiv.org/abs/2202.02311", "description": "arXiv:2202.02311v2 Announce Type: replace \nAbstract: We consider continuous-time survival or more general event-history settings, where the aim is to infer the causal effect of a time-dependent treatment process. This is formalised as the effect on the outcome event of a (possibly hypothetical) intervention on the intensity of the treatment process, i.e. a stochastic intervention. To establish whether valid inference about the interventional situation can be drawn from typical observational, i.e. non-experimental, data we propose graphical rules indicating whether the observed information is sufficient to identify the desired causal effect by suitable re-weighting. In analogy to the well-known causal directed acyclic graphs, the corresponding dynamic graphs combine causal semantics with local independence models for multivariate counting processes. Importantly, we highlight that causal inference from censored data requires structural assumptions on the censoring process beyond the usual independent censoring assumption, which can be represented and verified graphically. Our results establish general non-parametric identifiability and do not rely on particular survival models. We illustrate our proposal with a data example on HPV-testing for cervical cancer screening, where the desired effect is estimated by re-weighted cumulative incidence curves."}, "https://arxiv.org/abs/2203.14511": {"title": "Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments", "link": "https://arxiv.org/abs/2203.14511", "description": "arXiv:2203.14511v3 Announce Type: replace \nAbstract: Researchers are increasingly turning to machine learning (ML) algorithms to investigate causal heterogeneity in randomized experiments. Despite their promise, ML algorithms may fail to accurately ascertain heterogeneous treatment effects under practical settings with many covariates and small sample size. In addition, the quantification of estimation uncertainty remains a challenge. We develop a general approach to statistical inference for heterogeneous treatment effects discovered by a generic ML algorithm. We apply the Neyman's repeated sampling framework to a common setting, in which researchers use an ML algorithm to estimate the conditional average treatment effect and then divide the sample into several groups based on the magnitude of the estimated effects. We show how to estimate the average treatment effect within each of these groups, and construct a valid confidence interval. In addition, we develop nonparametric tests of treatment effect homogeneity across groups, and rank-consistency of within-group average treatment effects. The validity of our methodology does not rely on the properties of ML algorithms because it is solely based on the randomization of treatment assignment and random sampling of units. Finally, we generalize our methodology to the cross-fitting procedure by accounting for the additional uncertainty induced by the random splitting of data."}, "https://arxiv.org/abs/2207.07533": {"title": "Selection of the Most Probable Best", "link": "https://arxiv.org/abs/2207.07533", "description": "arXiv:2207.07533v2 Announce Type: replace \nAbstract: We consider an expected-value ranking and selection (R&S) problem where all k solutions' simulation outputs depend on a common parameter whose uncertainty can be modeled by a distribution. We define the most probable best (MPB) to be the solution that has the largest probability of being optimal with respect to the distribution and design an efficient sequential sampling algorithm to learn the MPB when the parameter has a finite support. We derive the large deviations rate of the probability of falsely selecting the MPB and formulate an optimal computing budget allocation problem to find the rate-maximizing static sampling ratios. The problem is then relaxed to obtain a set of optimality conditions that are interpretable and computationally efficient to verify. We devise a series of algorithms that replace the unknown means in the optimality conditions with their estimates and prove the algorithms' sampling ratios achieve the conditions as the simulation budget increases. Furthermore, we show that the empirical performances of the algorithms can be significantly improved by adopting the kernel ridge regression for mean estimation while achieving the same asymptotic convergence results. The algorithms are benchmarked against a state-of-the-art contextual R&S algorithm and demonstrated to have superior empirical performances."}, "https://arxiv.org/abs/2209.10128": {"title": "Efficient Integrated Volatility Estimation in the Presence of Infinite Variation Jumps via Debiased Truncated Realized Variations", "link": "https://arxiv.org/abs/2209.10128", "description": "arXiv:2209.10128v3 Announce Type: replace \nAbstract: Statistical inference for stochastic processes based on high-frequency observations has been an active research area for more than two decades. One of the most well-known and widely studied problems has been the estimation of the quadratic variation of the continuous component of an It\\^o semimartingale with jumps. Several rate- and variance-efficient estimators have been proposed in the literature when the jump component is of bounded variation. However, to date, very few methods can deal with jumps of unbounded variation. By developing new high-order expansions of the truncated moments of a locally stable L\\'evy process, we propose a new rate- and variance-efficient volatility estimator for a class of It\\^o semimartingales whose jumps behave locally like those of a stable L\\'evy process with Blumenthal-Getoor index $Y\\in (1,8/5)$ (hence, of unbounded variation). The proposed method is based on a two-step debiasing procedure for the truncated realized quadratic variation of the process and can also cover the case $Y<1$. Our Monte Carlo experiments indicate that the method outperforms other efficient alternatives in the literature in the setting covered by our theoretical framework."}, "https://arxiv.org/abs/2210.12382": {"title": "Model-free controlled variable selection via data splitting", "link": "https://arxiv.org/abs/2210.12382", "description": "arXiv:2210.12382v3 Announce Type: replace \nAbstract: Addressing the simultaneous identification of contributory variables while controlling the false discovery rate (FDR) in high-dimensional data is a crucial statistical challenge. In this paper, we propose a novel model-free variable selection procedure in sufficient dimension reduction framework via a data splitting technique. The variable selection problem is first converted to a least squares procedure with several response transformations. We construct a series of statistics with global symmetry property and leverage the symmetry to derive a data-driven threshold aimed at error rate control. Our approach demonstrates the capability for achieving finite-sample and asymptotic FDR control under mild theoretical conditions. Numerical experiments confirm that our procedure has satisfactory FDR control and higher power compared with existing methods."}, "https://arxiv.org/abs/2212.04814": {"title": "The Falsification Adaptive Set in Linear Models with Instrumental Variables that Violate the Exclusion or Conditional Exogeneity Restriction", "link": "https://arxiv.org/abs/2212.04814", "description": "arXiv:2212.04814v2 Announce Type: replace \nAbstract: Masten and Poirier (2021) introduced the falsification adaptive set (FAS) in linear models with a single endogenous variable estimated with multiple correlated instrumental variables (IVs). The FAS reflects the model uncertainty that arises from falsification of the baseline model. We show that it applies to cases where a conditional exogeneity assumption holds and invalid instruments violate the exclusion assumption only. We propose a generalized FAS that reflects the model uncertainty when some instruments violate the exclusion assumption and/or some instruments violate the conditional exogeneity assumption. Under the assumption that invalid instruments are not themselves endogenous explanatory variables, if there is at least one relevant instrument that satisfies both the exclusion and conditional exogeneity assumptions then this generalized FAS is guaranteed to contain the parameter of interest."}, "https://arxiv.org/abs/2212.12501": {"title": "Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls", "link": "https://arxiv.org/abs/2212.12501", "description": "arXiv:2212.12501v2 Announce Type: replace \nAbstract: Dynamic treatment regimens (DTRs) aim at tailoring individualized sequential treatment rules that maximize cumulative beneficial outcomes by accommodating patients' heterogeneity in decision-making. For many chronic diseases including type 2 diabetes mellitus (T2D), treatments are usually multifaceted in the sense that aggressive treatments with a higher expected reward are also likely to elevate the risk of acute adverse events. In this paper, we propose a new weighted learning framework, namely benefit-risk dynamic treatment regimens (BR-DTRs), to address the benefit-risk trade-off. The new framework relies on a backward learning procedure by restricting the induced risk of the treatment rule to be no larger than a pre-specified risk constraint at each treatment stage. Computationally, the estimated treatment rule solves a weighted support vector machine problem with a modified smooth constraint. Theoretically, we show that the proposed DTRs are Fisher consistent, and we further obtain the convergence rates for both the value and risk functions. Finally, the performance of the proposed method is demonstrated via extensive simulation studies and application to a real study for T2D patients."}, "https://arxiv.org/abs/2303.08790": {"title": "lmw: Linear Model Weights for Causal Inference", "link": "https://arxiv.org/abs/2303.08790", "description": "arXiv:2303.08790v2 Announce Type: replace \nAbstract: The linear regression model is widely used in the biomedical and social sciences as well as in policy and business research to adjust for covariates and estimate the average effects of treatments. Behind every causal inference endeavor there is a hypothetical randomized experiment. However, in routine regression analyses in observational studies, it is unclear how well the adjustments made by regression approximate key features of randomized experiments, such as covariate balance, study representativeness, sample boundedness, and unweighted sampling. In this paper, we provide software to empirically address this question. We introduce the lmw package for R to compute the implied linear model weights and perform diagnostics for their evaluation. The weights are obtained as part of the design stage of the study; that is, without using outcome information. The implementation is general and applicable, for instance, in settings with instrumental variables and multi-valued treatments; in essence, in any situation where the linear model is the vehicle for adjustment and estimation of average treatment effects with discrete-valued interventions."}, "https://arxiv.org/abs/2303.08987": {"title": "Generalized Score Matching", "link": "https://arxiv.org/abs/2303.08987", "description": "arXiv:2303.08987v2 Announce Type: replace \nAbstract: Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems, this article proposes a unified asymptotic theory of generalized score matching developed under the independence assumption, covering both continuous and discrete response data, thereby giving a sound basis for score-matchingbased inference. Real data analyses and simulation studies provide convincing evidence of strong practical performance of the proposed methods."}, "https://arxiv.org/abs/2303.13237": {"title": "Improving estimation for asymptotically independent bivariate extremes via global estimators for the angular dependence function", "link": "https://arxiv.org/abs/2303.13237", "description": "arXiv:2303.13237v3 Announce Type: replace \nAbstract: Modelling the extremal dependence of bivariate variables is important in a wide variety of practical applications, including environmental planning, catastrophe modelling and hydrology. The majority of these approaches are based on the framework of bivariate regular variation, and a wide range of literature is available for estimating the dependence structure in this setting. However, such procedures are only applicable to variables exhibiting asymptotic dependence, even though asymptotic independence is often observed in practice. In this paper, we consider the so-called `angular dependence function'; this quantity summarises the extremal dependence structure for asymptotically independent variables. Until recently, only pointwise estimators of the angular dependence function have been available. We introduce a range of global estimators and compare them to another recently introduced technique for global estimation through a systematic simulation study, and a case study on river flow data from the north of England, UK."}, "https://arxiv.org/abs/2304.02476": {"title": "A Class of Models for Large Zero-inflated Spatial Data", "link": "https://arxiv.org/abs/2304.02476", "description": "arXiv:2304.02476v2 Announce Type: replace \nAbstract: Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as nimble and stan."}, "https://arxiv.org/abs/2304.05323": {"title": "A nonparametric framework for treatment effect modifier discovery in high dimensions", "link": "https://arxiv.org/abs/2304.05323", "description": "arXiv:2304.05323v2 Announce Type: replace \nAbstract: Heterogeneous treatment effects are driven by treatment effect modifiers, pre-treatment covariates that modify the effect of a treatment on an outcome. Current approaches for uncovering these variables are limited to low-dimensional data, data with weakly correlated covariates, or data generated according to parametric processes. We resolve these issues by developing a framework for defining model-agnostic treatment effect modifier variable importance parameters applicable to high-dimensional data with arbitrary correlation structure, deriving one-step, estimating equation and targeted maximum likelihood estimators of these parameters, and establishing these estimators' asymptotic properties. This framework is showcased by defining variable importance parameters for data-generating processes with continuous, binary, and time-to-event outcomes with binary treatments, and deriving accompanying multiply-robust and asymptotically linear estimators. Simulation experiments demonstrate that these estimators' asymptotic guarantees are approximately achieved in realistic sample sizes for observational and randomized studies alike. This framework is applied to gene expression data collected for a clinical trial assessing the effect of a monoclonal antibody therapy on disease-free survival in breast cancer patients. Genes predicted to have the greatest potential for treatment effect modification have previously been linked to breast cancer. An open-source R package implementing this methodology, unihtee, is made available on GitHub at https://github.com/insightsengineering/unihtee."}, "https://arxiv.org/abs/2305.12624": {"title": "Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors", "link": "https://arxiv.org/abs/2305.12624", "description": "arXiv:2305.12624v2 Announce Type: replace \nAbstract: Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data are smooth, latent functions obtained at discrete time intervals and prone to homoscedastic white noise. However, the assumption of homoscedastic errors might not be appropriate in this setting because the devices collect the data serially. While researchers have previously addressed measurement error in scalar covariates prone to errors, less work has been done on correcting measurement error in high dimensional longitudinal curves prone to heteroscedastic errors. We present two new methods for correcting measurement error in longitudinal functional curves prone to complex measurement error structures in multi-level generalized functional linear regression models. These methods are based on two-stage scalable regression calibration. We assume that the distribution of the scalar responses and the surrogate measures prone to heteroscedastic errors both belong in the exponential family and that the measurement errors follow Gaussian processes. In simulations and sensitivity analyses, we established some finite sample properties of these methods. In our simulations, both regression calibration methods for correcting measurement error performed better than estimators based on averaging the longitudinal functional data and using observations from a single day. We also applied the methods to assess the relationship between physical activity and type 2 diabetes in community dwelling adults in the United States who participated in the National Health and Nutrition Examination Survey."}, "https://arxiv.org/abs/2306.09976": {"title": "Catch me if you can: Signal localization with knockoff e-values", "link": "https://arxiv.org/abs/2306.09976", "description": "arXiv:2306.09976v3 Announce Type: replace \nAbstract: We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank."}, "https://arxiv.org/abs/2306.16785": {"title": "A location-scale joint model for studying the link between the time-dependent subject-specific variability of blood pressure and competing events", "link": "https://arxiv.org/abs/2306.16785", "description": "arXiv:2306.16785v2 Announce Type: replace \nAbstract: Given the high incidence of cardio and cerebrovascular diseases (CVD), and its association with morbidity and mortality, its prevention is a major public health issue. A high level of blood pressure is a well-known risk factor for these events and an increasing number of studies suggest that blood pressure variability may also be an independent risk factor. However, these studies suffer from significant methodological weaknesses. In this work we propose a new location-scale joint model for the repeated measures of a marker and competing events. This joint model combines a mixed model including a subject-specific and time-dependent residual variance modeled through random effects, and cause-specific proportional intensity models for the competing events. The risk of events may depend simultaneously on the current value of the variance, as well as, the current value and the current slope of the marker trajectory. The model is estimated by maximizing the likelihood function using the Marquardt-Levenberg algorithm. The estimation procedure is implemented in a R-package and is validated through a simulation study. This model is applied to study the association between blood pressure variability and the risk of CVD and death from other causes. Using data from a large clinical trial on the secondary prevention of stroke, we find that the current individual variability of blood pressure is associated with the risk of CVD and death. Moreover, the comparison with a model without heterogeneous variance shows the importance of taking into account this variability in the goodness-of-fit and for dynamic predictions."}, "https://arxiv.org/abs/2307.04527": {"title": "Automatic Debiased Machine Learning for Covariate Shifts", "link": "https://arxiv.org/abs/2307.04527", "description": "arXiv:2307.04527v3 Announce Type: replace \nAbstract: In this paper we address the problem of bias in machine learning of parameters following covariate shifts. Covariate shift occurs when the distribution of input features change between the training and deployment stages. Regularization and model selection associated with machine learning biases many parameter estimates. In this paper, we propose an automatic debiased machine learning approach to correct for this bias under covariate shifts. The proposed approach leverages state-of-the-art techniques in debiased machine learning to debias estimators of policy and causal parameters when covariate shift is present. The debiasing is automatic in only relying on the parameter of interest and not requiring the form of the form of the bias. We show that our estimator is asymptotically normal as the sample size grows. Finally, we demonstrate the proposed method on a regression problem using a Monte-Carlo simulation."}, "https://arxiv.org/abs/2307.07898": {"title": "A Graph-Prediction-Based Approach for Debiasing Underreported Data", "link": "https://arxiv.org/abs/2307.07898", "description": "arXiv:2307.07898v3 Announce Type: replace \nAbstract: We present a novel Graph-based debiasing Algorithm for Underreported Data (GRAUD) aiming at an efficient joint estimation of event counts and discovery probabilities across spatial or graphical structures. This innovative method provides a solution to problems seen in fields such as policing data and COVID-$19$ data analysis. Our approach avoids the need for strong priors typically associated with Bayesian frameworks. By leveraging the graph structures on unknown variables $n$ and $p$, our method debiases the under-report data and estimates the discovery probability at the same time. We validate the effectiveness of our method through simulation experiments and illustrate its practicality in one real-world application: police 911 calls-to-service data."}, "https://arxiv.org/abs/2307.14867": {"title": "One-step smoothing splines instrumental regression", "link": "https://arxiv.org/abs/2307.14867", "description": "arXiv:2307.14867v3 Announce Type: replace \nAbstract: We extend nonparametric regression smoothing splines to a context where there is endogeneity and instrumental variables are available. Unlike popular existing estimators, the resulting estimator is one-step and relies on a unique regularization parameter. We derive uniform rates of the convergence for the estimator and its first derivative. We also address the issue of imposing monotonicity in estimation and extend the approach to a partly linear model. Simulations confirm the good performances of our estimator compared to two-step procedures. Our method yields economically sensible results when used to estimate Engel curves."}, "https://arxiv.org/abs/2309.07261": {"title": "Simultaneous inference for generalized linear models with unmeasured confounders", "link": "https://arxiv.org/abs/2309.07261", "description": "arXiv:2309.07261v3 Announce Type: replace \nAbstract: Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model."}, "https://arxiv.org/abs/2312.03967": {"title": "Test-negative designs with various reasons for testing: statistical bias and solution", "link": "https://arxiv.org/abs/2312.03967", "description": "arXiv:2312.03967v2 Announce Type: replace \nAbstract: Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases where randomization is not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical simulations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, disease-unrelated reasons, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. The performance of our proposed method is demonstrated through simulation studies."}, "https://arxiv.org/abs/2312.06415": {"title": "Bioequivalence Design with Sampling Distribution Segments", "link": "https://arxiv.org/abs/2312.06415", "description": "arXiv:2312.06415v3 Announce Type: replace \nAbstract: In bioequivalence design, power analyses dictate how much data must be collected to detect the absence of clinically important effects. Power is computed as a tail probability in the sampling distribution of the pertinent test statistics. When these test statistics cannot be constructed from pivotal quantities, their sampling distributions are approximated via repetitive, time-intensive computer simulation. We propose a novel simulation-based method to quickly approximate the power curve for many such bioequivalence tests by efficiently exploring segments (as opposed to the entirety) of the relevant sampling distributions. Despite not estimating the entire sampling distribution, this approach prompts unbiased sample size recommendations. We illustrate this method using two-group bioequivalence tests with unequal variances and overview its broader applicability in clinical design. All methods proposed in this work can be implemented using the developed dent package in R."}, "https://arxiv.org/abs/2203.05860": {"title": "Modelling non-stationarity in asymptotically independent extremes", "link": "https://arxiv.org/abs/2203.05860", "description": "arXiv:2203.05860v4 Announce Type: replace-cross \nAbstract: In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, we propose a novel semi-parametric modelling framework for bivariate extremal dependence structures. This framework allows us to capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, our model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points."}, "https://arxiv.org/abs/2309.08043": {"title": "On Prediction Feature Assignment in the Heckman Selection Model", "link": "https://arxiv.org/abs/2309.08043", "description": "arXiv:2309.08043v2 Announce Type: replace-cross \nAbstract: Under missing-not-at-random (MNAR) sample selection bias, the performance of a prediction model is often degraded. This paper focuses on one classic instance of MNAR sample selection bias where a subset of samples have non-randomly missing outcomes. The Heckman selection model and its variants have commonly been used to handle this type of sample selection bias. The Heckman model uses two separate equations to model the prediction and selection of samples, where the selection features include all prediction features. When using the Heckman model, the prediction features must be properly chosen from the set of selection features. However, choosing the proper prediction features is a challenging task for the Heckman model. This is especially the case when the number of selection features is large. Existing approaches that use the Heckman model often provide a manually chosen set of prediction features. In this paper, we propose Heckman-FA as a novel data-driven framework for obtaining prediction features for the Heckman model. Heckman-FA first trains an assignment function that determines whether or not a selection feature is assigned as a prediction feature. Using the parameters of the trained function, the framework extracts a suitable set of prediction features based on the goodness-of-fit of the prediction model given the chosen prediction features and the correlation between noise terms of the prediction and selection equations. Experimental results on real-world datasets show that Heckman-FA produces a robust regression model under MNAR sample selection bias."}, "https://arxiv.org/abs/2310.01374": {"title": "Corrected generalized cross-validation for finite ensembles of penalized estimators", "link": "https://arxiv.org/abs/2310.01374", "description": "arXiv:2310.01374v2 Announce Type: replace-cross \nAbstract: Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV."}, "https://arxiv.org/abs/2404.14534": {"title": "Random Indicator Imputation for Missing Not At Random Data", "link": "https://arxiv.org/abs/2404.14534", "description": "arXiv:2404.14534v1 Announce Type: new \nAbstract: Imputation methods for dealing with incomplete data typically assume that the missingness mechanism is at random (MAR). These methods can also be applied to missing not at random (MNAR) situations, where the user specifies some adjustment parameters that describe the degree of departure from MAR. The effect of different pre-chosen values is then studied on the inferences. This paper proposes a novel imputation method, the Random Indicator (RI) method, which, in contrast to the current methodology, estimates these adjustment parameters from the data. For an incomplete variable $X$, the RI method assumes that the observed part of $X$ is normal and the probability for $X$ to be missing follows a logistic function. The idea is to estimate the adjustment parameters by generating a pseudo response indicator from this logistic function. Our method iteratively draws imputations for $X$ and the realization of the response indicator $R$, to which we refer as $\\dot{R}$, for $X$. By cross-classifying $X$ by $R$ and $\\dot{R}$, we obtain various properties on the distribution of the missing data. These properties form the basis for estimating the degree of departure from MAR. Our numerical simulations show that the RI method performs very well across a variety of situations. We show how the method can be used in a real life data set. The RI method is automatic and opens up new ways to tackle the problem of MNAR data."}, "https://arxiv.org/abs/2404.14603": {"title": "Quantifying the Internal Validity of Weighted Estimands", "link": "https://arxiv.org/abs/2404.14603", "description": "arXiv:2404.14603v1 Announce Type: new \nAbstract: In this paper we study a class of weighted estimands, which we define as parameters that can be expressed as weighted averages of the underlying heterogeneous treatment effects. The popular ordinary least squares (OLS), two-stage least squares (2SLS), and two-way fixed effects (TWFE) estimands are all special cases within our framework. Our focus is on answering two questions concerning weighted estimands. First, under what conditions can they be interpreted as the average treatment effect for some (possibly latent) subpopulation? Second, when these conditions are satisfied, what is the upper bound on the size of that subpopulation, either in absolute terms or relative to a target population of interest? We argue that this upper bound provides a valuable diagnostic for empirical research. When a given weighted estimand corresponds to the average treatment effect for a small subset of the population of interest, we say its internal validity is low. Our paper develops practical tools to quantify the internal validity of weighted estimands."}, "https://arxiv.org/abs/2404.14623": {"title": "On Bayesian wavelet shrinkage estimation of nonparametric regression models with stationary errors", "link": "https://arxiv.org/abs/2404.14623", "description": "arXiv:2404.14623v1 Announce Type: new \nAbstract: This work proposes a Bayesian rule based on the mixture of a point mass function at zero and the logistic distribution to perform wavelet shrinkage in nonparametric regression models with stationary errors (with short or long-memory behavior). The proposal is assessed through Monte Carlo experiments and illustrated with real data. Simulation studies indicate that the precision of the estimates decreases as the amount of correlation increases. However, given a sample size and error correlated noise, the performance of the rule is almost the same while the signal-to-noise ratio decreases, compared to the performance of the rule under independent and identically distributed errors. Further, we find that the performance of the proposal is better than the standard soft thresholding rule with universal policy in most of the considered underlying functions, sample sizes and signal-to-noise ratios scenarios."}, "https://arxiv.org/abs/2404.14644": {"title": "Identifying sparse treatment effects", "link": "https://arxiv.org/abs/2404.14644", "description": "arXiv:2404.14644v1 Announce Type: new \nAbstract: Based on technological advances in sensing modalities, randomized trials with primary outcomes represented as high-dimensional vectors have become increasingly prevalent. For example, these outcomes could be week-long time-series data from wearable devices or high-dimensional neuroimaging data, such as from functional magnetic resonance imaging. This paper focuses on randomized treatment studies with such high-dimensional outcomes characterized by sparse treatment effects, where interventions may influence a small number of dimensions, e.g., small temporal windows or specific brain regions. Conventional practices, such as using fixed, low-dimensional summaries of the outcomes, result in significantly reduced power for detecting treatment effects. To address this limitation, we propose a procedure that involves subset selection followed by inference. Specifically, given a potentially large set of outcome summaries, we identify the subset that captures treatment effects, which requires only one call to the Lasso, and subsequently conduct inference on the selected subset. Via theoretical analysis as well as simulations, we demonstrate that our method asymptotically selects the correct subset and increases statistical power."}, "https://arxiv.org/abs/2404.14840": {"title": "Analysis of cohort stepped wedge cluster-randomized trials with non-ignorable dropout via joint modeling", "link": "https://arxiv.org/abs/2404.14840", "description": "arXiv:2404.14840v1 Announce Type: new \nAbstract: Stepped wedge cluster-randomized trial (CRTs) designs randomize clusters of individuals to intervention sequences, ensuring that every cluster eventually transitions from a control period to receive the intervention under study by the end of the study period. The analysis of stepped wedge CRTs is usually more complex than parallel-arm CRTs due to potential secular trends that result in changing intra-cluster and period-cluster correlations over time. A further challenge in the analysis of closed-cohort stepped wedge CRTs, which follow groups of individuals enrolled in each period longitudinally, is the occurrence of dropout. This is particularly problematic in studies of individuals at high risk for mortality, which causes non-ignorable missing outcomes. If not appropriately addressed, missing outcomes from death will erode statistical power, at best, and bias treatment effect estimates, at worst. Joint longitudinal-survival models can accommodate informative dropout and missingness patterns in longitudinal studies. Specifically, within this framework one directly models the dropout process via a time-to-event submodel together with the longitudinal outcome of interest. The two submodels are then linked using a variety of possible association structures. This work extends linear mixed-effects models by jointly modeling the dropout process to accommodate informative missing outcome data in closed-cohort stepped wedge CRTs. We focus on constant intervention and general time-on-treatment effect parametrizations for the longitudinal submodel and study the performance of the proposed methodology using Monte Carlo simulation under several data-generating scenarios. We illustrate the joint modeling methodology in practice by reanalyzing the `Frail Older Adults: Care in Transition' (ACT) trial, a stepped wedge CRT of a multifaceted geriatric care model versus usual care in the Netherlands."}, "https://arxiv.org/abs/2404.15017": {"title": "The mosaic permutation test: an exact and nonparametric goodness-of-fit test for factor models", "link": "https://arxiv.org/abs/2404.15017", "description": "arXiv:2404.15017v1 Announce Type: new \nAbstract: Financial firms often rely on factor models to explain correlations among asset returns. These models are important for managing risk, for example by modeling the probability that many assets will simultaneously lose value. Yet after major events, e.g., COVID-19, analysts may reassess whether existing models continue to fit well: specifically, after accounting for the factor exposures, are the residuals of the asset returns independent? With this motivation, we introduce the mosaic permutation test, a nonparametric goodness-of-fit test for preexisting factor models. Our method allows analysts to use nearly any machine learning technique to detect model violations while provably controlling the false positive rate, i.e., the probability of rejecting a well-fitting model. Notably, this result does not rely on asymptotic approximations and makes no parametric assumptions. This property helps prevent analysts from unnecessarily rebuilding accurate models, which can waste resources and increase risk. We illustrate our methodology by applying it to the Blackrock Fundamental Equity Risk (BFRE) model. Using the mosaic permutation test, we find that the BFRE model generally explains the most significant correlations among assets. However, we find evidence of unexplained correlations among certain real estate stocks, and we show that adding new factors improves model fit. We implement our methods in the python package mosaicperm."}, "https://arxiv.org/abs/2404.15060": {"title": "Fast and reliable confidence intervals for a variance component or proportion", "link": "https://arxiv.org/abs/2404.15060", "description": "arXiv:2404.15060v1 Announce Type: new \nAbstract: We show that confidence intervals for a variance component or proportion, with asymptotically correct uniform coverage probability, can be obtained by inverting certain test-statistics based on the score for the restricted likelihood. The results apply in settings where the variance or proportion is near or at the boundary of the parameter set. Simulations indicate the proposed test-statistics are approximately pivotal and lead to confidence intervals with near-nominal coverage even in small samples. We illustrate our methods' application in spatially-resolved transcriptomics where we compute approximately 15,000 confidence intervals, used for gene ranking, in less than 4 minutes. In the settings we consider, the proposed method is between two and 28,000 times faster than popular alternatives, depending on how many confidence intervals are computed."}, "https://arxiv.org/abs/2404.15073": {"title": "The Complex Estimand of Clone-Censor-Weighting When Studying Treatment Initiation Windows", "link": "https://arxiv.org/abs/2404.15073", "description": "arXiv:2404.15073v1 Announce Type: new \nAbstract: Clone-censor-weighting (CCW) is an analytic method for studying treatment regimens that are indistinguishable from one another at baseline without relying on landmark dates or creating immortal person time. One particularly interesting CCW application is estimating outcomes when starting treatment within specific time windows in observational data (e.g., starting a treatment within 30 days of hospitalization). In such cases, CCW estimates something fairly complex. We show how using CCW to study a regimen such as \"start treatment prior to day 30\" estimates the potential outcome of a hypothetical intervention where A) prior to day 30, everyone follows the treatment start distribution of the study population and B) everyone who has not initiated by day 30 initiates on day 30. As a result, the distribution of treatment initiation timings provides essential context for the results of CCW studies. We also show that if the exposure effect varies over time, ignoring exposure history when estimating inverse probability of censoring weights (IPCW) estimates the risk under an impossible intervention and can create selection bias. Finally, we examine some simplifying assumptions that can make this complex treatment effect more interpretable and allow everyone to contribute to IPCW."}, "https://arxiv.org/abs/2404.15115": {"title": "Principal Component Analysis and biplots", "link": "https://arxiv.org/abs/2404.15115", "description": "arXiv:2404.15115v1 Announce Type: new \nAbstract: Principal Component Analysis and biplots are so well-established and readily implemented that it is just too tempting to give for granted their internal workings. In this note I get back to basics in comparing how PCA and biplots are implemented in base-R and contributed R packages, leveraging an implementation-agnostic understanding of the computational structure of each technique. I do so with a view to illustrating discrepancies that users might find elusive, as these arise from seemingly innocuous computational choices made under the hood. The proposed evaluation grid elevates aspects that are usually disregarded, including relationships that should hold if the computational rationale underpinning each technique is followed correctly. Strikingly, what is expected from these equivalences rarely follows without caveats from the output of specific implementations alone."}, "https://arxiv.org/abs/2404.15245": {"title": "Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification", "link": "https://arxiv.org/abs/2404.15245", "description": "arXiv:2404.15245v1 Announce Type: new \nAbstract: Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets."}, "https://arxiv.org/abs/2203.05120": {"title": "A low-rank ensemble Kalman filter for elliptic observations", "link": "https://arxiv.org/abs/2203.05120", "description": "arXiv:2203.05120v3 Announce Type: cross \nAbstract: We propose a regularization method for ensemble Kalman filtering (EnKF) with elliptic observation operators. Commonly used EnKF regularization methods suppress state correlations at long distances. For observations described by elliptic partial differential equations, such as the pressure Poisson equation (PPE) in incompressible fluid flows, distance localization cannot be applied, as we cannot disentangle slowly decaying physical interactions from spurious long-range correlations. This is particularly true for the PPE, in which distant vortex elements couple nonlinearly to induce pressure. Instead, these inverse problems have a low effective dimension: low-dimensional projections of the observations strongly inform a low-dimensional subspace of the state space. We derive a low-rank factorization of the Kalman gain based on the spectrum of the Jacobian of the observation operator. The identified eigenvectors generalize the source and target modes of the multipole expansion, independently of the underlying spatial distribution of the problem. Given rapid spectral decay, inference can be performed in the low-dimensional subspace spanned by the dominant eigenvectors. This low-rank EnKF is assessed on dynamical systems with Poisson observation operators, where we seek to estimate the positions and strengths of point singularities over time from potential or pressure observations. We also comment on the broader applicability of this approach to elliptic inverse problems outside the context of filtering."}, "https://arxiv.org/abs/2404.14446": {"title": "Spatio-temporal Joint Analysis of PM2", "link": "https://arxiv.org/abs/2404.14446", "description": "arXiv:2404.14446v1 Announce Type: cross \nAbstract: The substantial threat of concurrent air pollutants to public health is increasingly severe under climate change. To identify the common drivers and extent of spatio-temporal similarity of PM2.5 and ozone, this paper proposed a log Gaussian-Gumbel Bayesian hierarchical model allowing for sharing a SPDE-AR(1) spatio-temporal interaction structure. The proposed model outperforms in terms of estimation accuracy and prediction capacity for its increased parsimony and reduced uncertainty, especially for the shared ozone sub-model. Besides the consistently significant influence of temperature (positive), extreme drought (positive), fire burnt area (positive), and wind speed (negative) on both PM2.5 and ozone, surface pressure and GDP per capita (precipitation) demonstrate only positive associations with PM2.5 (ozone), while population density relates to neither. In addition, our results show the distinct spatio-temporal interactions and different seasonal patterns of PM2.5 and ozone, with peaks of PM2.5 and ozone in cold and hot seasons, respectively. Finally, with the aid of the excursion function, we see that the areas around the intersection of San Luis Obispo and Santa Barbara counties are likely to exceed the unhealthy ozone level for sensitive groups throughout the year. Our findings provide new insights for regional and seasonal strategies in the co-control of PM2.5 and ozone. Our methodology is expected to be utilized when interest lies in multiple interrelated processes in the fields of environment and epidemiology."}, "https://arxiv.org/abs/2404.14460": {"title": "Inference of Causal Networks using a Topological Threshold", "link": "https://arxiv.org/abs/2404.14460", "description": "arXiv:2404.14460v1 Announce Type: cross \nAbstract: We propose a constraint-based algorithm, which automatically determines causal relevance thresholds, to infer causal networks from data. We call these topological thresholds. We present two methods for determining the threshold: the first seeks a set of edges that leaves no disconnected nodes in the network; the second seeks a causal large connected component in the data.\n We tested these methods both for discrete synthetic and real data, and compared the results with those obtained for the PC algorithm, which we took as the benchmark. We show that this novel algorithm is generally faster and more accurate than the PC algorithm.\n The algorithm for determining the thresholds requires choosing a measure of causality. We tested our methods for Fisher Correlations, commonly used in PC algorithm (for instance in \\cite{kalisch2005}), and further proposed a discrete and asymmetric measure of causality, that we called Net Influence, which provided very good results when inferring causal networks from discrete data. This metric allows for inferring directionality of the edges in the process of applying the thresholds, speeding up the inference of causal DAGs."}, "https://arxiv.org/abs/2404.14786": {"title": "LLM-Enhanced Causal Discovery in Temporal Domain from Interventional Data", "link": "https://arxiv.org/abs/2404.14786", "description": "arXiv:2404.14786v1 Announce Type: cross \nAbstract: In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures."}, "https://arxiv.org/abs/2404.15133": {"title": "Bayesian Strategies for Repulsive Spatial Point Processes", "link": "https://arxiv.org/abs/2404.15133", "description": "arXiv:2404.15133v1 Announce Type: cross \nAbstract: There is increasing interest to develop Bayesian inferential algorithms for point process models with intractable likelihoods. A purpose of this paper is to illustrate the utility of using simulation based strategies, including approximate Bayesian computation (ABC) and Markov chain Monte Carlo (MCMC) methods for this task. Shirota and Gelfand (2017) proposed an extended version of an ABC approach for repulsive spatial point processes, including the Strauss point process and the determinantal point process, but their algorithm was not correctly detailed. We explain that is, in general, intractable and therefore impractical to use, except in some restrictive situations. This motivates us to instead consider an ABC-MCMC algorithm developed by Fearnhead and Prangle (2012). We further explore the use of the exchange algorithm, together with the recently proposed noisy Metropolis-Hastings algorithm (Alquier et al., 2016). As an extension of the exchange algorithm, which requires a single simulation from the likelihood at each iteration, the noisy Metropolis-Hastings algorithm considers multiple draws from the same likelihood function. We find that both of these inferential approaches yield good performance for repulsive spatial point processes in both simulated and real data applications and should be considered as viable approaches for the analysis of these models."}, "https://arxiv.org/abs/2404.15209": {"title": "Data-Driven Knowledge Transfer in Batch $Q^*$ Learning", "link": "https://arxiv.org/abs/2404.15209", "description": "arXiv:2404.15209v1 Announce Type: cross \nAbstract: In data-driven decision-making in marketing, healthcare, and education, it is desirable to utilize a large amount of data from existing ventures to navigate high-dimensional feature spaces and address data scarcity in new ventures. We explore knowledge transfer in dynamic decision-making by concentrating on batch stationary environments and formally defining task discrepancies through the lens of Markov decision processes (MDPs). We propose a framework of Transferred Fitted $Q$-Iteration algorithm with general function approximation, enabling the direct estimation of the optimal action-state function $Q^*$ using both target and source data. We establish the relationship between statistical performance and MDP task discrepancy under sieve approximation, shedding light on the impact of source and target sample sizes and task discrepancy on the effectiveness of knowledge transfer. We show that the final learning error of the $Q^*$ function is significantly improved from the single task rate both theoretically and empirically."}, "https://arxiv.org/abs/2004.01623": {"title": "Estimation and Uniform Inference in Sparse High-Dimensional Additive Models", "link": "https://arxiv.org/abs/2004.01623", "description": "arXiv:2004.01623v2 Announce Type: replace \nAbstract: We develop a novel method to construct uniformly valid confidence bands for a nonparametric component $f_1$ in the sparse additive model $Y=f_1(X_1)+\\ldots + f_p(X_p) + \\varepsilon$ in a high-dimensional setting. Our method integrates sieve estimation into a high-dimensional Z-estimation framework, facilitating the construction of uniformly valid confidence bands for the target component $f_1$. To form these confidence bands, we employ a multiplier bootstrap procedure. Additionally, we provide rates for the uniform lasso estimation in high dimensions, which may be of independent interest. Through simulation studies, we demonstrate that our proposed method delivers reliable results in terms of estimation and coverage, even in small samples."}, "https://arxiv.org/abs/2205.07950": {"title": "The Power of Tests for Detecting $p$-Hacking", "link": "https://arxiv.org/abs/2205.07950", "description": "arXiv:2205.07950v3 Announce Type: replace \nAbstract: $p$-Hacking undermines the validity of empirical studies. A flourishing empirical literature investigates the prevalence of $p$-hacking based on the distribution of $p$-values across studies. Interpreting results in this literature requires a careful understanding of the power of methods for detecting $p$-hacking. We theoretically study the implications of likely forms of $p$-hacking on the distribution of $p$-values to understand the power of tests for detecting it. Power depends crucially on the $p$-hacking strategy and the distribution of true effects. Publication bias can enhance the power for testing the joint null of no $p$-hacking and no publication bias."}, "https://arxiv.org/abs/2311.16333": {"title": "From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks", "link": "https://arxiv.org/abs/2311.16333", "description": "arXiv:2311.16333v2 Announce Type: replace \nAbstract: We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density forecasting through a novel neural network architecture with dedicated mean and variance hemispheres. Our architecture features several key ingredients making MLE work in this context. First, the hemispheres share a common core at the entrance of the network which accommodates for various forms of time variation in the error variance. Second, we introduce a volatility emphasis constraint that breaks mean/variance indeterminacy in this class of overparametrized nonlinear models. Third, we conduct a blocked out-of-bag reality check to curb overfitting in both conditional moments. Fourth, the algorithm utilizes standard deep learning software and thus handles large data sets - both computationally and statistically. Ergo, our Hemisphere Neural Network (HNN) provides proactive volatility forecasts based on leading indicators when it can, and reactive volatility based on the magnitude of previous prediction errors when it must. We evaluate point and density forecasts with an extensive out-of-sample experiment and benchmark against a suite of models ranging from classics to more modern machine learning-based offerings. In all cases, HNN fares well by consistently providing accurate mean/variance forecasts for all targets and horizons. Studying the resulting volatility paths reveals its versatility, while probabilistic forecasting evaluation metrics showcase its enviable reliability. Finally, we also demonstrate how this machinery can be merged with other structured deep learning models by revisiting Goulet Coulombe (2022)'s Neural Phillips Curve."}, "https://arxiv.org/abs/2401.04263": {"title": "Two-Step Targeted Minimum-Loss Based Estimation for Non-Negative Two-Part Outcomes", "link": "https://arxiv.org/abs/2401.04263", "description": "arXiv:2401.04263v2 Announce Type: replace \nAbstract: Non-negative two-part outcomes are defined as outcomes with a density function that have a zero point mass but are otherwise positive. Examples, such as healthcare expenditure and hospital length of stay, are common in healthcare utilization research. Despite the practical relevance of non-negative two-part outcomes, very few methods exist to leverage knowledge of their semicontinuity to achieve improved performance in estimating causal effects. In this paper, we develop a nonparametric two-step targeted minimum-loss based estimator (denoted as hTMLE) for non-negative two-part outcomes. We present methods for a general class of interventions referred to as modified treatment policies, which can accommodate continuous, categorical, and binary exposures. The two-step TMLE uses a targeted estimate of the intensity component of the outcome to produce a targeted estimate of the binary component of the outcome that may improve finite sample efficiency. We demonstrate the efficiency gains achieved by the two-step TMLE with simulated examples and then apply it to a cohort of Medicaid beneficiaries to estimate the effect of chronic pain and physical disability on days' supply of opioids."}, "https://arxiv.org/abs/2404.15572": {"title": "Mapping Incidence and Prevalence Peak Data for SIR Forecasting Applications", "link": "https://arxiv.org/abs/2404.15572", "description": "arXiv:2404.15572v1 Announce Type: new \nAbstract: Infectious disease modeling and forecasting have played a key role in helping assess and respond to epidemics and pandemics. Recent work has leveraged data on disease peak infection and peak hospital incidence to fit compartmental models for the purpose of forecasting and describing the dynamics of a disease outbreak. Incorporating these data can greatly stabilize a compartmental model fit on early observations, where slight perturbations in the data may lead to model fits that project wildly unrealistic peak infection. We introduce a new method for incorporating historic data on the value and time of peak incidence of hospitalization into the fit for a Susceptible-Infectious-Recovered (SIR) model by formulating the relationship between an SIR model's starting parameters and peak incidence as a system of two equations that can be solved computationally. This approach is assessed for practicality in terms of accuracy and speed of computation via simulation. To exhibit the modeling potential, we update the Dirichlet-Beta State Space modeling framework to use hospital incidence data, as this framework was previously formulated to incorporate only data on total infections."}, "https://arxiv.org/abs/2404.15586": {"title": "Multiple testing with anytime-valid Monte-Carlo p-values", "link": "https://arxiv.org/abs/2404.15586", "description": "arXiv:2404.15586v1 Announce Type: new \nAbstract: In contemporary problems involving genetic or neuroimaging data, thousands of hypotheses need to be tested. Due to their high power, and finite sample guarantees on type-1 error under weak assumptions, Monte-Carlo permutation tests are often considered as gold standard for these settings. However, the enormous computational effort required for (thousands of) permutation tests is a major burden. Recently, Fischer and Ramdas (2024) constructed a permutation test for a single hypothesis in which the permutations are drawn sequentially one-by-one and the testing process can be stopped at any point without inflating the type I error. They showed that the number of permutations can be substantially reduced (under null and alternative) while the power remains similar. We show how their approach can be modified to make it suitable for a broad class of multiple testing procedures. In particular, we discuss its use with the Benjamini-Hochberg procedure and illustrate the application on a large dataset."}, "https://arxiv.org/abs/2404.15329": {"title": "Greedy Capon Beamformer", "link": "https://arxiv.org/abs/2404.15329", "description": "arXiv:2404.15329v1 Announce Type: cross \nAbstract: We propose greedy Capon beamformer (GBF) for direction finding of narrow-band sources present in the array's viewing field. After defining the grid covering the location search space, the algorithm greedily builds the interference-plus-noise covariance matrix by identifying a high-power source on the grid using Capon's principle of maximizing the signal to interference plus noise ratio (SINR) while enforcing unit gain towards the signal of interest. An estimate of the power of the detected source is derived by exploiting the unit power constraint, which subsequently allows to update the noise covariance matrix by simple rank-1 matrix addition composed of outerproduct of the selected steering matrix with itself scaled by the signal power estimate. Our numerical examples demonstrate effectiveness of the proposed GCB in direction finding where it perform favourably compared to the state-of-the-art algorithms under a broad variety of settings. Furthermore, GCB estimates of direction-of-arrivals (DOAs) are very fast to compute."}, "https://arxiv.org/abs/2404.15440": {"title": "Exploring Convergence in Relation using Association Rules Mining: A Case Study in Collaborative Knowledge Production", "link": "https://arxiv.org/abs/2404.15440", "description": "arXiv:2404.15440v1 Announce Type: cross \nAbstract: This study delves into the pivotal role played by non-experts in knowledge production on open collaboration platforms, with a particular focus on the intricate process of tag development that culminates in the proposal of new glitch classes. Leveraging the power of Association Rule Mining (ARM), this research endeavors to unravel the underlying dynamics of collaboration among citizen scientists. By meticulously quantifying tag associations and scrutinizing their temporal dynamics, the study provides a comprehensive and nuanced understanding of how non-experts collaborate to generate valuable scientific insights. Furthermore, this investigation extends its purview to examine the phenomenon of ideological convergence within online citizen science knowledge production. To accomplish this, a novel measurement algorithm, based on the Mann-Kendall Trend Test, is introduced. This innovative approach sheds illuminating light on the dynamics of collaborative knowledge production, revealing both the vast opportunities and daunting challenges inherent in leveraging non-expert contributions for scientific research endeavors. Notably, the study uncovers a robust pattern of convergence in ideology, employing both the newly proposed convergence testing method and the traditional approach based on the stationarity of time series data. This groundbreaking discovery holds significant implications for understanding the dynamics of online citizen science communities and underscores the crucial role played by non-experts in shaping the scientific landscape of the digital age. Ultimately, this study contributes significantly to our understanding of online citizen science communities, highlighting their potential to harness collective intelligence for tackling complex scientific tasks and enriching our comprehension of collaborative knowledge production processes in the digital age."}, "https://arxiv.org/abs/2404.15484": {"title": "Uncertainty, Imprecise Probabilities and Interval Capacity Measures on a Product Space", "link": "https://arxiv.org/abs/2404.15484", "description": "arXiv:2404.15484v1 Announce Type: cross \nAbstract: In Basili and Pratelli (2024), a novel and coherent concept of interval probability measures has been introduced, providing a method for representing imprecise probabilities and uncertainty. Within the framework of set algebra, we introduced the concepts of weak complementation and interval probability measures associated with a family of random variables, which effectively capture the inherent uncertainty in any event. This paper conducts a comprehensive analysis of these concepts within a specific probability space. Additionally, we elaborate on an updating rule for events, integrating essential concepts of statistical independence, dependence, and stochastic dominance."}, "https://arxiv.org/abs/2404.15495": {"title": "Correlations versus noise in the NFT market", "link": "https://arxiv.org/abs/2404.15495", "description": "arXiv:2404.15495v1 Announce Type: cross \nAbstract: The non-fungible token (NFT) market emerges as a recent trading innovation leveraging blockchain technology, mirroring the dynamics of the cryptocurrency market. To deepen the understanding of the dynamics of this market, in the current study, based on the capitalization changes and transaction volumes across a large number of token collections on the Ethereum platform, the degree of correlation in this market is examined by using the multivariate formalism of detrended correlation coefficient and correlation matrix. It appears that correlation strength is lower here than that observed in previously studied markets. Consequently, the eigenvalue spectra of the correlation matrix more closely follow the Marchenko-Pastur distribution, still, some departures indicating the existence of correlations remain. The comparison of results obtained from the correlation matrix built from the Pearson coefficients and, independently, from the detrended cross-correlation coefficients suggests that the global correlations in the NFT market arise from higher frequency fluctuations. Corresponding minimal spanning trees (MSTs) for capitalization variability exhibit a scale-free character while, for the number of transactions, they are somewhat more decentralized."}, "https://arxiv.org/abs/2404.15649": {"title": "The Impact of Loss Estimation on Gibbs Measures", "link": "https://arxiv.org/abs/2404.15649", "description": "arXiv:2404.15649v1 Announce Type: cross \nAbstract: In recent years, the shortcomings of Bayes posteriors as inferential devices has received increased attention. A popular strategy for fixing them has been to instead target a Gibbs measure based on losses that connect a parameter of interest to observed data. While existing theory for such inference procedures relies on these losses to be analytically available, in many situations these losses must be stochastically estimated using pseudo-observations. The current paper fills this research gap, and derives the first asymptotic theory for Gibbs measures based on estimated losses. Our findings reveal that the number of pseudo-observations required to accurately approximate the exact Gibbs measure depends on the rates at which the bias and variance of the estimated loss converge to zero. These results are particularly consequential for the emerging field of generalised Bayesian inference, for estimated intractable likelihoods, and for biased pseudo-marginal approaches. We apply our results to three Gibbs measures that have been proposed to deal with intractable likelihoods and model misspecification."}, "https://arxiv.org/abs/2404.15654": {"title": "Autoregressive Networks with Dependent Edges", "link": "https://arxiv.org/abs/2404.15654", "description": "arXiv:2404.15654v1 Announce Type: cross \nAbstract: We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters (Chang et al., 2021, 2023). Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set."}, "https://arxiv.org/abs/2404.15764": {"title": "Assessment of the quality of a prediction", "link": "https://arxiv.org/abs/2404.15764", "description": "arXiv:2404.15764v1 Announce Type: cross \nAbstract: Shannon defined the mutual information between two variables. We illustrate why the true mutual information between a variable and the predictions made by a prediction algorithm is not a suitable measure of prediction quality, but the apparent Shannon mutual information (ASI) is; indeed it is the unique prediction quality measure with either of two very different lists of desirable properties, as previously shown by de Finetti and other authors. However, estimating the uncertainty of the ASI is a difficult problem, because of long and non-symmetric heavy tails to the distribution of the individual values of $j(x,y)=\\log\\frac{Q_y(x)}{P(x)}$ We propose a Bayesian modelling method for the distribution of $j(x,y)$, from the posterior distribution of which the uncertainty in the ASI can be inferred. This method is based on Dirichlet-based mixtures of skew-Student distributions. We illustrate its use on data from a Bayesian model for prediction of the recurrence time of prostate cancer. We believe that this approach is generally appropriate for most problems, where it is infeasible to derive the explicit distribution of the samples of $j(x,y)$, though the precise modelling parameters may need adjustment to suit particular cases."}, "https://arxiv.org/abs/2404.15843": {"title": "Large-sample theory for inferential models: a possibilistic Bernstein--von Mises theorem", "link": "https://arxiv.org/abs/2404.15843", "description": "arXiv:2404.15843v1 Announce Type: cross \nAbstract: The inferential model (IM) framework offers alternatives to the familiar probabilistic (e.g., Bayesian and fiducial) uncertainty quantification in statistical inference. Allowing this uncertainty quantification to be imprecise makes it possible to achieve exact validity and reliability. But is imprecision and exact validity compatible with attainment of the classical notions of statistical efficiency? The present paper offers an affirmative answer to this question via a new possibilistic Bernstein--von Mises theorem that parallels a fundamental result in Bayesian inference. Among other things, our result demonstrates that the IM solution is asymptotically efficient in the sense that its asymptotic credal set is the smallest that contains the Gaussian distribution whose variance agrees with the Cramer--Rao lower bound."}, "https://arxiv.org/abs/2404.15967": {"title": "Interpretable clustering with the Distinguishability criterion", "link": "https://arxiv.org/abs/2404.15967", "description": "arXiv:2404.15967v1 Announce Type: cross \nAbstract: Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications."}, "https://arxiv.org/abs/2106.07096": {"title": "Tests for partial correlation between repeatedly observed nonstationary nonlinear timeseries", "link": "https://arxiv.org/abs/2106.07096", "description": "arXiv:2106.07096v2 Announce Type: replace \nAbstract: We describe two families of statistical tests to detect partial correlation in vectorial timeseries. The tests measure whether an observed timeseries Y can be predicted from a second series X, even after accounting for a third series Z which may correlate with X. They do not make any assumptions on the nature of these timeseries, such as stationarity or linearity, but they do require that multiple statistically independent recordings of the 3 series are available. Intuitively, the tests work by asking if the series Y recorded on one experiment can be better predicted from X recorded on the same experiment than on a different experiment, after accounting for the prediction from Z recorded on both experiments."}, "https://arxiv.org/abs/2204.06960": {"title": "The replication of equivalence studies", "link": "https://arxiv.org/abs/2204.06960", "description": "arXiv:2204.06960v3 Announce Type: replace \nAbstract: Replication studies are increasingly conducted to assess the credibility of scientific findings. Most of these replication attempts target studies with a superiority design, but there is a lack of methodology regarding the analysis of replication studies with alternative types of designs, such as equivalence. In order to fill this gap, we propose two approaches, the two-trials rule and the sceptical TOST procedure, adapted from methods used in superiority settings. Both methods have the same overall Type-I error rate, but the sceptical TOST procedure allows replication success even for non-significant original or replication studies. This leads to a larger project power and other differences in relevant operating characteristics. Both methods can be used for sample size calculation of the replication study, based on the results from the original one. The two methods are applied to data from the Reproducibility Project: Cancer Biology."}, "https://arxiv.org/abs/2305.11126": {"title": "More powerful multiple testing under dependence via randomization", "link": "https://arxiv.org/abs/2305.11126", "description": "arXiv:2305.11126v3 Announce Type: replace \nAbstract: We show that two procedures for false discovery rate (FDR) control -- the Benjamini-Yekutieli procedure for dependent p-values, and the e-Benjamini-Hochberg procedure for dependent e-values -- can both be made more powerful by a simple randomization involving one independent uniform random variable. As a corollary, the Hommel test under arbitrary dependence is also improved. Importantly, our randomized improvements are never worse than the originals and are typically strictly more powerful, with marked improvements in simulations. The same technique also improves essentially every other multiple testing procedure based on e-values."}, "https://arxiv.org/abs/2306.01086": {"title": "Multi-Study R-Learner for Estimating Heterogeneous Treatment Effects Across Studies Using Statistical Machine Learning", "link": "https://arxiv.org/abs/2306.01086", "description": "arXiv:2306.01086v3 Announce Type: replace \nAbstract: Estimating heterogeneous treatment effects (HTEs) is crucial for precision medicine. While multiple studies can improve the generalizability of results, leveraging them for estimation is statistically challenging. Existing approaches often assume identical HTEs across studies, but this may be violated due to various sources of between-study heterogeneity, including differences in study design, study populations, and data collection protocols, among others. To this end, we propose a framework for multi-study HTE estimation that accounts for between-study heterogeneity in the nuisance functions and treatment effects. Our approach, the multi-study R-learner, extends the R-learner to obtain principled statistical estimation with machine learning (ML) in the multi-study setting. It involves a data-adaptive objective function that links study-specific treatment effects with nuisance functions through membership probabilities, which enable information to be borrowed across potentially heterogeneous studies. The multi-study R-learner framework can combine data from randomized controlled trials, observational studies, or a combination of both. It's easy to implement and flexible in its ability to incorporate ML for estimating HTEs, nuisance functions, and membership probabilities. In the series estimation framework, we show that the multi-study R-learner is asymptotically normal and more efficient than the R-learner when there is between-study heterogeneity in the propensity score model under homoscedasticity. We illustrate using cancer data that the proposed method performs favorably compared to existing approaches in the presence of between-study heterogeneity."}, "https://arxiv.org/abs/2306.11979": {"title": "Qini Curves for Multi-Armed Treatment Rules", "link": "https://arxiv.org/abs/2306.11979", "description": "arXiv:2306.11979v3 Announce Type: replace \nAbstract: Qini curves have emerged as an attractive and popular approach for evaluating the benefit of data-driven targeting rules for treatment allocation. We propose a generalization of the Qini curve to multiple costly treatment arms, that quantifies the value of optimally selecting among both units and treatment arms at different budget levels. We develop an efficient algorithm for computing these curves and propose bootstrap-based confidence intervals that are exact in large samples for any point on the curve. These confidence intervals can be used to conduct hypothesis tests comparing the value of treatment targeting using an optimal combination of arms with using just a subset of arms, or with a non-targeting assignment rule ignoring covariates, at different budget levels. We demonstrate the statistical performance in a simulation experiment and an application to treatment targeting for election turnout."}, "https://arxiv.org/abs/2308.04420": {"title": "Contour Location for Reliability in Airfoil Simulation Experiments using Deep Gaussian Processes", "link": "https://arxiv.org/abs/2308.04420", "description": "arXiv:2308.04420v2 Announce Type: replace \nAbstract: Bayesian deep Gaussian processes (DGPs) outperform ordinary GPs as surrogate models of complex computer experiments when response surface dynamics are non-stationary, which is especially prevalent in aerospace simulations. Yet DGP surrogates have not been deployed for the canonical downstream task in that setting: reliability analysis through contour location (CL). In that context, we are motivated by a simulation of an RAE-2822 transonic airfoil which demarcates efficient and inefficient flight conditions. Level sets separating passable versus failable operating conditions are best learned through strategic sequential design. There are two limitations to modern CL methodology which hinder DGP integration in this setting. First, derivative-based optimization underlying acquisition functions is thwarted by sampling-based Bayesian (i.e., MCMC) inference, which is essential for DGP posterior integration. Second, canonical acquisition criteria, such as entropy, are famously myopic to the extent that optimization may even be undesirable. Here we tackle both of these limitations at once, proposing a hybrid criterion that explores along the Pareto front of entropy and (predictive) uncertainty, requiring evaluation only at strategically located \"triangulation\" candidates. We showcase DGP CL performance in several synthetic benchmark exercises and on the RAE-2822 airfoil."}, "https://arxiv.org/abs/2308.12181": {"title": "Consistency of common spatial estimators under spatial confounding", "link": "https://arxiv.org/abs/2308.12181", "description": "arXiv:2308.12181v2 Announce Type: replace \nAbstract: This paper addresses the asymptotic performance of popular spatial regression estimators on the task of estimating the linear effect of an exposure on an outcome under \"spatial confounding\" -- the presence of an unmeasured spatially-structured variable influencing both the exposure and the outcome. The existing literature on spatial confounding is informal and inconsistent; this paper is an attempt to bring clarity through rigorous results on the asymptotic bias and consistency of estimators from popular spatial regression models. We consider two data generation processes: one where the confounder is a fixed function of space and one where it is a random function (i.e., a stochastic process on the spatial domain). We first show that the estimators from ordinary least squares (OLS) and restricted spatial regression are asymptotically biased under spatial confounding. We then prove a novel main result on the consistency of the generalized least squares (GLS) estimator using a Gaussian process (GP) covariance matrix in the presence of spatial confounding under in-fill (fixed domain) asymptotics. The result holds under very general conditions -- for any exposure with some non-spatial variation (noise), for any spatially continuous confounder, using any choice of Mat\\'ern or square exponential Gaussian process covariance used to construct the GLS estimator, and without requiring Gaussianity of errors. Finally, we prove that spatial estimators from GLS, GP regression, and spline models that are consistent under confounding by a fixed function will also be consistent under confounding by a random function. We conclude that, contrary to much of the literature on spatial confounding, traditional spatial estimators are capable of estimating linear exposure effects under spatial confounding in the presence of some noise in the exposure. We support our theoretical arguments with simulation studies."}, "https://arxiv.org/abs/2311.16451": {"title": "Variational Inference for the Latent Shrinkage Position Model", "link": "https://arxiv.org/abs/2311.16451", "description": "arXiv:2311.16451v2 Announce Type: replace \nAbstract: The latent position model (LPM) is a popular method used in network data analysis where nodes are assumed to be positioned in a $p$-dimensional latent space. The latent shrinkage position model (LSPM) is an extension of the LPM which automatically determines the number of effective dimensions of the latent space via a Bayesian nonparametric shrinkage prior. However, the LSPM reliance on Markov chain Monte Carlo for inference, while rigorous, is computationally expensive, making it challenging to scale to networks with large numbers of nodes. We introduce a variational inference approach for the LSPM, aiming to reduce computational demands while retaining the model's ability to intrinsically determine the number of effective latent dimensions. The performance of the variational LSPM is illustrated through simulation studies and its application to real-world network data. To promote wider adoption and ease of implementation, we also provide open-source code."}, "https://arxiv.org/abs/2111.09447": {"title": "Unbiased Risk Estimation in the Normal Means Problem via Coupled Bootstrap Techniques", "link": "https://arxiv.org/abs/2111.09447", "description": "arXiv:2111.09447v3 Announce Type: replace-cross \nAbstract: We develop a new approach for estimating the risk of an arbitrary estimator of the mean vector in the classical normal means problem. The key idea is to generate two auxiliary data vectors, by adding carefully constructed normal noise vectors to the original data. We then train the estimator of interest on the first auxiliary vector and test it on the second. In order to stabilize the risk estimate, we average this procedure over multiple draws of the synthetic noise vector. A key aspect of this coupled bootstrap (CB) approach is that it delivers an unbiased estimate of risk under no assumptions on the estimator of the mean vector, albeit for a modified and slightly \"harder\" version of the original problem, where the noise variance is elevated. We prove that, under the assumptions required for the validity of Stein's unbiased risk estimator (SURE), a limiting version of the CB estimator recovers SURE exactly. We then analyze a bias-variance decomposition of the error of the CB estimator, which elucidates the effects of the variance of the auxiliary noise and the number of bootstrap samples on the accuracy of the estimator. Lastly, we demonstrate that the CB estimator performs favorably in various simulated experiments."}, "https://arxiv.org/abs/2207.11332": {"title": "Comparing baseball players across eras via novel Full House Modeling", "link": "https://arxiv.org/abs/2207.11332", "description": "arXiv:2207.11332v2 Announce Type: replace-cross \nAbstract: A new methodological framework suitable for era-adjusting baseball statistics is developed in this article. Within this methodological framework specific models are motivated. We call these models Full House Models. Full House Models work by balancing the achievements of Major League Baseball (MLB) players within a given season and the size of the MLB talent pool from which a player came. We demonstrate the utility of Full House Models in an application of comparing baseball players' performance statistics across eras. Our results reveal a new ranking of baseball's greatest players which include several modern players among the top all-time players. Modern players are elevated by Full House Modeling because they come from a larger talent pool. Sensitivity and multiverse analyses which investigate the how results change with changes to modeling inputs including the estimate of the talent pool are presented."}, "https://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "https://arxiv.org/abs/2310.03722", "description": "arXiv:2310.03722v3 Announce Type: replace-cross \nAbstract: In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$ of a Gaussian distribution with unknown variance $\\sigma^2$. Curiously, he employed both an improper (right Haar) mixture over $\\sigma$ and an improper (flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his construction, which use generalized nonintegrable martingales and an extended Ville's inequality. While this does yield a sequential t-test, it does not yield an \"e-process\" (due to the nonintegrability of his martingale). In this paper, we develop two new e-processes and confidence sequences for the same setting: one is a test martingale in a reduced filtration, while the other is an e-process in the canonical data filtration. These are respectively obtained by swapping Lai's flat mixture for a Gaussian mixture, and swapping the right Haar mixture over $\\sigma$ with the maximum likelihood estimate under the null, as done in universal inference. We also analyze the width of resulting confidence sequences, which have a curious polynomial dependence on the error probability $\\alpha$ that we prove to be not only unavoidable, but (for universal inference) even better than the classical fixed-sample t-test. Numerical experiments are provided along the way to compare and contrast the various approaches, including some recent suboptimal ones."}, "https://arxiv.org/abs/2312.11108": {"title": "Multiple change point detection in functional data with applications to biomechanical fatigue data", "link": "https://arxiv.org/abs/2312.11108", "description": "arXiv:2312.11108v3 Announce Type: replace-cross \nAbstract: Injuries to the lower extremity joints are often debilitating, particularly for professional athletes. Understanding the onset of stressful conditions on these joints is therefore important in order to ensure prevention of injuries as well as individualised training for enhanced athletic performance. We study the biomechanical joint angles from the hip, knee and ankle for runners who are experiencing fatigue. The data is cyclic in nature and densely collected by body worn sensors, which makes it ideal to work with in the functional data analysis (FDA) framework.\n We develop a new method for multiple change point detection for functional data, which improves the state of the art with respect to at least two novel aspects. First, the curves are compared with respect to their maximum absolute deviation, which leads to a better interpretation of local changes in the functional data compared to classical $L^2$-approaches. Secondly, as slight aberrations are to be often expected in a human movement data, our method will not detect arbitrarily small changes but hunts for relevant changes, where maximum absolute deviation between the curves exceeds a specified threshold, say $\\Delta >0$. We recover multiple changes in a long functional time series of biomechanical knee angle data, which are larger than the desired threshold $\\Delta$, allowing us to identify changes purely due to fatigue. In this work, we analyse data from both controlled indoor as well as from an uncontrolled outdoor (marathon) setting."}, "https://arxiv.org/abs/2404.16166": {"title": "Double Robust Variance Estimation", "link": "https://arxiv.org/abs/2404.16166", "description": "arXiv:2404.16166v1 Announce Type: new \nAbstract: Doubly robust estimators have gained popularity in the field of causal inference due to their ability to provide consistent point estimates when either an outcome or exposure model is correctly specified. However, the influence function based variance estimator frequently used with doubly robust estimators is only consistent when both the outcome and exposure models are correctly specified. Here, use of M-estimation and the empirical sandwich variance estimator for doubly robust point and variance estimation is demonstrated. Simulation studies illustrate the properties of the influence function based and empirical sandwich variance estimators. Estimators are applied to data from the Improving Pregnancy Outcomes with Progesterone (IPOP) trial to estimate the effect of maternal anemia on birth weight among women with HIV. In the example, birth weights if all women had anemia were estimated to be lower than birth weights if no women had anemia, though estimates were imprecise. Variance estimates were more stable under varying model specifications for the empirical sandwich variance estimator than the influence function based variance estimator."}, "https://arxiv.org/abs/2404.16209": {"title": "Exploring Spatial Context: A Comprehensive Bibliography of GWR and MGWR", "link": "https://arxiv.org/abs/2404.16209", "description": "arXiv:2404.16209v1 Announce Type: new \nAbstract: Local spatial models such as Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) serve as instrumental tools to capture intrinsic contextual effects through the estimates of the local intercepts and behavioral contextual effects through estimates of the local slope parameters. GWR and MGWR provide simple implementation yet powerful frameworks that could be extended to various disciplines that handle spatial data. This bibliography aims to serve as a comprehensive compilation of peer-reviewed papers that have utilized GWR or MGWR as a primary analytical method to conduct spatial analyses and acts as a useful guide to anyone searching the literature for previous examples of local statistical modeling in a wide variety of application fields."}, "https://arxiv.org/abs/2404.16490": {"title": "On Neighbourhood Cross Validation", "link": "https://arxiv.org/abs/2404.16490", "description": "arXiv:2404.16490v1 Announce Type: new \nAbstract: It is shown how to efficiently and accurately compute and optimize a range of cross validation criteria for a wide range of models estimated by minimizing a quadratically penalized smooth loss. Example models include generalized additive models for location scale and shape and smooth additive quantile regression. Example losses include negative log likelihoods and smooth quantile losses. Example cross validation criteria include leave-out-neighbourhood cross validation for dealing with un-modelled short range autocorrelation as well as the more familiar leave-one-out cross validation. For a $p$ coefficient model of $n$ data, estimable at $O(np^2)$ computational cost, the general $O(n^2p^2)$ cost of ordinary cross validation is reduced to $O(np^2)$, computing the cross validation criterion to $O(p^3n^{-2})$ accuracy. This is achieved by directly approximating the model coefficient estimates under data subset omission, via efficiently computed single step Newton updates of the full data coefficient estimates. Optimization of the resulting cross validation criterion, with respect to multiple smoothing/precision parameters, can be achieved efficiently using quasi-Newton optimization, adapted to deal with the indefiniteness that occurs when the optimal value for a smoothing parameter tends to infinity. The link between cross validation and the jackknife can be exploited to achieve reasonably well calibrated uncertainty quantification for the model coefficients in non standard settings such as leaving-out-neighbourhoods under residual autocorrelation or quantile regression. Several practical examples are provided, focussing particularly on dealing with un-modelled auto-correlation."}, "https://arxiv.org/abs/2404.16610": {"title": "Conformalized Ordinal Classification with Marginal and Conditional Coverage", "link": "https://arxiv.org/abs/2404.16610", "description": "arXiv:2404.16610v1 Announce Type: new \nAbstract: Conformal prediction is a general distribution-free approach for constructing prediction sets combined with any machine learning algorithm that achieve valid marginal or conditional coverage in finite samples. Ordinal classification is common in real applications where the target variable has natural ordering among the class labels. In this paper, we discuss constructing distribution-free prediction sets for such ordinal classification problems by leveraging the ideas of conformal prediction and multiple testing with FWER control. Newer conformal prediction methods are developed for constructing contiguous and non-contiguous prediction sets based on marginal and conditional (class-specific) conformal $p$-values, respectively. Theoretically, we prove that the proposed methods respectively achieve satisfactory levels of marginal and class-specific conditional coverages. Through simulation study and real data analysis, these proposed methods show promising performance compared to the existing conformal method."}, "https://arxiv.org/abs/2404.16709": {"title": "Understanding Reliability from a Regression Perspective", "link": "https://arxiv.org/abs/2404.16709", "description": "arXiv:2404.16709v1 Announce Type: new \nAbstract: Reliability is an important quantification of measurement precision based on a latent variable measurement model. Inspired by McDonald (2011), we present a regression framework of reliability, placing emphasis on whether latent or observed scores serve as the regression outcome. Our theory unifies two extant perspectives of reliability: (a) classical test theory (measurement decomposition), and (b) optimal prediction of latent scores (prediction decomposition). Importantly, reliability should be treated as a property of the observed score under a measurement decomposition, but a property of the latent score under a prediction decomposition. To facilitate the evaluation and interpretation of distinct reliability coefficients for complex measurement models, we introduce a Monte Carlo approach for approximate calculation of reliability. We illustrate the proposed computational procedure with an empirical data analysis, which concerns measuring susceptibility and severity of depressive symptoms using a two-dimensional item response theory model. We conclude with a discussion on computing reliability coefficients and outline future avenues of research."}, "https://arxiv.org/abs/2404.16745": {"title": "Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness", "link": "https://arxiv.org/abs/2404.16745", "description": "arXiv:2404.16745v1 Announce Type: new \nAbstract: In the era of data explosion, statisticians have been developing interpretable and computationally efficient statistical methods to measure latent factors (e.g., skills, abilities, and personalities) using large-scale assessment data. In addition to understanding the latent information, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race) taking into account their latent abilities. However, the large sample size, substantial covariate dimension, and great test length pose challenges to developing efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete types of responses, nonlinear latent factor models are often assumed, bringing further complexity to the problem. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects under a practical yet challenging asymptotic regime. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an application to an educational assessment dataset obtained from the Programme for International Student Assessment (PISA)."}, "https://arxiv.org/abs/2404.16746": {"title": "Estimating the Number of Components in Finite Mixture Models via Variational Approximation", "link": "https://arxiv.org/abs/2404.16746", "description": "arXiv:2404.16746v1 Announce Type: new \nAbstract: This work introduces a new method for selecting the number of components in finite mixture models (FMMs) using variational Bayes, inspired by the large-sample properties of the Evidence Lower Bound (ELBO) derived from mean-field (MF) variational approximation. Specifically, we establish matching upper and lower bounds for the ELBO without assuming conjugate priors, suggesting the consistency of model selection for FMMs based on maximizing the ELBO. As a by-product of our proof, we demonstrate that the MF approximation inherits the stable behavior (benefited from model singularity) of the posterior distribution, which tends to eliminate the extra components under model misspecification where the number of mixture components is over-specified. This stable behavior also leads to the $n^{-1/2}$ convergence rate for parameter estimation, up to a logarithmic factor, under this model overspecification. Empirical experiments are conducted to validate our theoretical findings and compare with other state-of-the-art methods for selecting the number of components in FMMs."}, "https://arxiv.org/abs/2404.16775": {"title": "Estimating Metocean Environments Associated with Extreme Structural Response", "link": "https://arxiv.org/abs/2404.16775", "description": "arXiv:2404.16775v1 Announce Type: new \nAbstract: Extreme value analysis (EVA) uses data to estimate long-term extreme environmental conditions for variables such as significant wave height and period, for the design of marine structures. Together with models for the short-term evolution of the ocean environment and for wave-structure interaction, EVA provides a basis for full probabilistic design analysis. Environmental contours provide an alternate approach to estimating structural integrity, without requiring structural knowledge. These contour methods also exploit statistical models, including EVA, but avoid the need for structural modelling by making what are believed to be conservative assumptions about the shape of the structural failure boundary in the environment space. These assumptions, however, may not always be appropriate, or may lead to unnecessary wasted resources from over design. We introduce a methodology for full probabilistic analysis to estimate the joint probability density of the environment, conditional on the occurrence of an extreme structural response, for simple structures. We use this conditional density of the environment as a basis to assess the performance of different environmental contour methods. We demonstrate the difficulty of estimating the contour boundary in the environment space for typical data samples, as well as the dependence of the performance of the environmental contour on the structure being considered."}, "https://arxiv.org/abs/2404.16287": {"title": "Differentially Private Federated Learning: Servers Trustworthiness, Estimation, and Statistical Inference", "link": "https://arxiv.org/abs/2404.16287", "description": "arXiv:2404.16287v1 Announce Type: cross \nAbstract: Differentially private federated learning is crucial for maintaining privacy in distributed environments. This paper investigates the challenges of high-dimensional estimation and inference under the constraints of differential privacy. First, we study scenarios involving an untrusted central server, demonstrating the inherent difficulties of accurate estimation in high-dimensional problems. Our findings indicate that the tight minimax rates depends on the high-dimensionality of the data even with sparsity assumptions. Second, we consider a scenario with a trusted central server and introduce a novel federated estimation algorithm tailored for linear regression models. This algorithm effectively handles the slight variations among models distributed across different machines. We also propose methods for statistical inference, including coordinate-wise confidence intervals for individual parameters and strategies for simultaneous inference. Extensive simulation experiments support our theoretical advances, underscoring the efficacy and reliability of our approaches."}, "https://arxiv.org/abs/2404.16583": {"title": "Fast Machine-Precision Spectral Likelihoods for Stationary Time Series", "link": "https://arxiv.org/abs/2404.16583", "description": "arXiv:2404.16583v1 Announce Type: cross \nAbstract: We provide in this work an algorithm for approximating a very broad class of symmetric Toeplitz matrices to machine precision in $\\mathcal{O}(n \\log n)$ time. In particular, for a Toeplitz matrix $\\mathbf{\\Sigma}$ with values $\\mathbf{\\Sigma}_{j,k} = h_{|j-k|} = \\int_{-1/2}^{1/2} e^{2 \\pi i |j-k| \\omega} S(\\omega) \\mathrm{d} \\omega$ where $S(\\omega)$ is piecewise smooth, we give an approximation $\\mathbf{\\mathcal{F}} \\mathbf{\\Sigma} \\mathbf{\\mathcal{F}}^H \\approx \\mathbf{D} + \\mathbf{U} \\mathbf{V}^H$, where $\\mathbf{\\mathcal{F}}$ is the DFT matrix, $\\mathbf{D}$ is diagonal, and the matrices $\\mathbf{U}$ and $\\mathbf{V}$ are in $\\mathbb{C}^{n \\times r}$ with $r \\ll n$. Studying these matrices in the context of time series, we offer a theoretical explanation of this structure and connect it to existing spectral-domain approximation frameworks. We then give a complete discussion of the numerical method for assembling the approximation and demonstrate its efficiency for improving Whittle-type likelihood approximations, including dramatic examples where a correction of rank $r = 2$ to the standard Whittle approximation increases the accuracy from $3$ to $14$ digits for a matrix $\\mathbf{\\Sigma} \\in \\mathbb{R}^{10^5 \\times 10^5}$. The method and analysis of this work applies well beyond time series analysis, providing an algorithm for extremely accurate direct solves with a wide variety of symmetric Toeplitz matrices. The analysis employed here largely depends on asymptotic expansions of oscillatory integrals, and also provides a new perspective on when existing spectral-domain approximation methods for Gaussian log-likelihoods can be particularly problematic."}, "https://arxiv.org/abs/2212.09833": {"title": "Direct covariance matrix estimation with compositional data", "link": "https://arxiv.org/abs/2212.09833", "description": "arXiv:2212.09833v2 Announce Type: replace \nAbstract: Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject's gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. Through simulation studies, we demonstrate that our estimator can outperform existing estimators. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with chronic fatigue syndrome."}, "https://arxiv.org/abs/2303.03521": {"title": "Bayesian Variable Selection for Function-on-Scalar Regression Models: a comparative analysis", "link": "https://arxiv.org/abs/2303.03521", "description": "arXiv:2303.03521v4 Announce Type: replace \nAbstract: In this work, we developed a new Bayesian method for variable selection in function-on-scalar regression (FOSR). Our method uses a hierarchical Bayesian structure and latent variables to enable an adaptive covariate selection process for FOSR. Extensive simulation studies show the proposed method's main properties, such as its accuracy in estimating the coefficients and high capacity to select variables correctly. Furthermore, we conducted a substantial comparative analysis with the main competing methods, the BGLSS (Bayesian Group Lasso with Spike and Slab prior) method, the group LASSO (Least Absolute Shrinkage and Selection Operator), the group MCP (Minimax Concave Penalty), and the group SCAD (Smoothly Clipped Absolute Deviation). Our results demonstrate that the proposed methodology is superior in correctly selecting covariates compared with the existing competing methods while maintaining a satisfactory level of goodness of fit. In contrast, the competing methods could not balance selection accuracy with goodness of fit. We also considered a COVID-19 dataset and some socioeconomic data from Brazil as an application and obtained satisfactory results. In short, the proposed Bayesian variable selection model is highly competitive, showing significant predictive and selective quality."}, "https://arxiv.org/abs/2304.02339": {"title": "Combining experimental and observational data through a power likelihood", "link": "https://arxiv.org/abs/2304.02339", "description": "arXiv:2304.02339v2 Announce Type: replace \nAbstract: Randomized controlled trials are the gold standard for causal inference and play a pivotal role in modern evidence-based medicine. However, the sample sizes they use are often too limited to draw significant causal conclusions for subgroups that are less prevalent in the population. In contrast, observational data are becoming increasingly accessible in large volumes but can be subject to bias as a result of hidden confounding. Given these complementary features, we propose a power likelihood approach to augmenting RCTs with observational data to improve the efficiency of treatment effect estimation. We provide a data-adaptive procedure for maximizing the expected log predictive density (ELPD) to select the learning rate that best regulates the information from the observational data. We validate our method through a simulation study that shows increased power while maintaining an approximate nominal coverage rate. Finally, we apply our method in a real-world data fusion study augmenting the PIONEER 6 clinical trial with a US health claims dataset, demonstrating the effectiveness of our method and providing detailed guidance on how to address practical considerations in its application."}, "https://arxiv.org/abs/2307.04225": {"title": "Copula-like inference for discrete bivariate distributions with rectangular supports", "link": "https://arxiv.org/abs/2307.04225", "description": "arXiv:2307.04225v3 Announce Type: replace \nAbstract: After reviewing a large body of literature on the modeling of bivariate discrete distributions with finite support, \\cite{Gee20} made a compelling case for the use of $I$-projections in the sense of \\cite{Csi75} as a sound way to attempt to decompose a bivariate probability mass function (p.m.f.) into its two univariate margins and a bivariate p.m.f.\\ with uniform margins playing the role of a discrete copula. From a practical perspective, the necessary $I$-projections on Fr\\'echet classes can be carried out using the iterative proportional fitting procedure (IPFP), also known as Sinkhorn's algorithm or matrix scaling in the literature. After providing conditions under which a bivariate p.m.f.\\ can be decomposed in the aforementioned sense, we investigate, for starting bivariate p.m.f.s with rectangular supports, nonparametric and parametric estimation procedures as well as goodness-of-fit tests for the underlying discrete copula. Related asymptotic results are provided and build upon a differentiability result for $I$-projections on Fr\\'echet classes which can be of independent interest. Theoretical results are complemented by finite-sample experiments and a data example."}, "https://arxiv.org/abs/2311.03769": {"title": "Nonparametric Screening for Additive Quantile Regression in Ultra-high Dimension", "link": "https://arxiv.org/abs/2311.03769", "description": "arXiv:2311.03769v2 Announce Type: replace \nAbstract: In practical applications, one often does not know the \"true\" structure of the underlying conditional quantile function, especially in the ultra-high dimensional setting. To deal with ultra-high dimensionality, quantile-adaptive marginal nonparametric screening methods have been recently developed. However, these approaches may miss important covariates that are marginally independent of the response, or may select unimportant covariates due to their high correlations with important covariates. To mitigate such shortcomings, we develop a conditional nonparametric quantile screening procedure (complemented by subsequent selection) for nonparametric additive quantile regression models. Under some mild conditions, we show that the proposed screening method can identify all relevant covariates in a small number of steps with probability approaching one. The subsequent narrowed best subset (via a modified Bayesian information criterion) also contains all the relevant covariates with overwhelming probability. The advantages of our proposed procedure are demonstrated through simulation studies and a real data example."}, "https://arxiv.org/abs/2312.05985": {"title": "Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions", "link": "https://arxiv.org/abs/2312.05985", "description": "arXiv:2312.05985v2 Announce Type: replace \nAbstract: To address the bias of the canonical two-way fixed effects estimator for difference-in-differences under staggered adoptions, Wooldridge (2021) proposed the extended two-way fixed effects estimator, which adds many parameters. However, this reduces efficiency. Restricting some of these parameters to be equal (for example, subsequent treatment effects within a cohort) helps, but ad hoc restrictions may reintroduce bias. We propose a machine learning estimator with a single tuning parameter, fused extended two-way fixed effects (FETWFE), that enables automatic data-driven selection of these restrictions. We prove that under an appropriate sparsity assumption FETWFE identifies the correct restrictions with probability tending to one, which improves efficiency. We also prove the consistency, oracle property, and asymptotic normality of FETWFE for several classes of heterogeneous marginal treatment effect estimators under either conditional or marginal parallel trends, and we prove the same results for conditional average treatment effects under conditional parallel trends. We demonstrate FETWFE in simulation studies and an empirical application."}, "https://arxiv.org/abs/2303.07854": {"title": "Empirical Bayes inference in sparse high-dimensional generalized linear models", "link": "https://arxiv.org/abs/2303.07854", "description": "arXiv:2303.07854v2 Announce Type: replace-cross \nAbstract: High-dimensional linear models have been widely studied, but the developments in high-dimensional generalized linear models, or GLMs, have been slower. In this paper, we propose an empirical or data-driven prior leading to an empirical Bayes posterior distribution which can be used for estimation of and inference on the coefficient vector in a high-dimensional GLM, as well as for variable selection. We prove that our proposed posterior concentrates around the true/sparse coefficient vector at the optimal rate, provide conditions under which the posterior can achieve variable selection consistency, and prove a Bernstein--von Mises theorem that implies asymptotically valid uncertainty quantification. Computation of the proposed empirical Bayes posterior is simple and efficient, and is shown to perform well in simulations compared to existing Bayesian and non-Bayesian methods in terms of estimation and variable selection."}, "https://arxiv.org/abs/2306.05571": {"title": "Heterogeneity-aware integrative regression for ancestry-specific association studies", "link": "https://arxiv.org/abs/2306.05571", "description": "arXiv:2306.05571v2 Announce Type: replace-cross \nAbstract: Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model which makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population."}, "https://arxiv.org/abs/2311.02610": {"title": "An adaptive standardisation methodology for Day-Ahead electricity price forecasting", "link": "https://arxiv.org/abs/2311.02610", "description": "arXiv:2311.02610v2 Announce Type: replace-cross \nAbstract: The study of Day-Ahead prices in the electricity market is one of the most popular problems in time series forecasting. Previous research has focused on employing increasingly complex learning algorithms to capture the sophisticated dynamics of the market. However, there is a threshold where increased complexity fails to yield substantial improvements. In this work, we propose an alternative approach by introducing an adaptive standardisation to mitigate the effects of dataset shifts that commonly occur in the market. By doing so, learning algorithms can prioritize uncovering the true relationship between the target variable and the explanatory variables. We investigate five distinct markets, including two novel datasets, previously unexplored in the literature. These datasets provide a more realistic representation of the current market context, that conventional datasets do not show. The results demonstrate a significant improvement across all five markets using the widely accepted learning algorithms in the literature (LEAR and DNN). In particular, the combination of the proposed methodology with the methodology previously presented in the literature obtains the best results. This significant advancement unveils new lines of research in this field, highlighting the potential of adaptive transformations in enhancing the performance of forecasting models."}, "https://arxiv.org/abs/2403.14713": {"title": "Auditing Fairness under Unobserved Confounding", "link": "https://arxiv.org/abs/2403.14713", "description": "arXiv:2403.14713v2 Announce Type: replace-cross \nAbstract: The presence of inequity is a fundamental problem in the outcomes of decision-making systems, especially when human lives are at stake. Yet, estimating notions of unfairness or inequity is difficult, particularly if they rely on hard-to-measure concepts such as risk. Such measurements of risk can be accurately obtained when no unobserved confounders have jointly influenced past decisions and outcomes. However, in the real world, this assumption rarely holds. In this paper, we show a surprising result that one can still give meaningful bounds on treatment rates to high-risk individuals, even when entirely eliminating or relaxing the assumption that all relevant risk factors are observed. We use the fact that in many real-world settings (e.g., the release of a new treatment) we have data from prior to any allocation to derive unbiased estimates of risk. This result is of immediate practical interest: we can audit unfair outcomes of existing decision-making systems in a principled manner. For instance, in a real-world study of Paxlovid allocation, our framework provably identifies that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates."}, "https://arxiv.org/abs/2404.16961": {"title": "On the testability of common trends in panel data without placebo periods", "link": "https://arxiv.org/abs/2404.16961", "description": "arXiv:2404.16961v1 Announce Type: new \nAbstract: We demonstrate and discuss the testability of the common trend assumption imposed in Difference-in-Differences (DiD) estimation in panel data when not relying on multiple pre-treatment periods for running placebo tests. Our testing approach involves two steps: (i) constructing a control group of non-treated units whose pre-treatment outcome distribution matches that of treated units, and (ii) verifying if this control group and the original non-treated group share the same time trend in average outcomes. Testing is motivated by the fact that in several (but not all) panel data models, a common trend violation across treatment groups implies and is implied by a common trend violation across pre-treatment outcomes. For this reason, the test verifies a sufficient, but (depending on the model) not necessary condition for DiD-based identification. We investigate the finite sample performance of a testing procedure that is based on double machine learning, which permits controlling for covariates in a data-driven manner, in a simulation study and also apply it to labor market data from the National Supported Work Demonstration."}, "https://arxiv.org/abs/2404.17019": {"title": "Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules", "link": "https://arxiv.org/abs/2404.17019", "description": "arXiv:2404.17019v1 Announce Type: new \nAbstract: A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman's approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman's repeated sampling framework is as relevant for causal inference today as it has been since its inception."}, "https://arxiv.org/abs/2404.17049": {"title": "Overidentification in Shift-Share Designs", "link": "https://arxiv.org/abs/2404.17049", "description": "arXiv:2404.17049v1 Announce Type: new \nAbstract: This paper studies the testability of identifying restrictions commonly employed to assign a causal interpretation to two stage least squares (TSLS) estimators based on Bartik instruments. For homogeneous effects models applied to short panels, our analysis yields testable implications previously noted in the literature for the two major available identification strategies. We propose overidentification tests for these restrictions that remain valid in high dimensional regimes and are robust to heteroskedasticity and clustering. We further show that homogeneous effect models in short panels, and their corresponding overidentification tests, are of central importance by establishing that: (i) In heterogenous effects models, interpreting TSLS as a positively weighted average of treatment effects can impose implausible assumptions on the distribution of the data; and (ii) Alternative identifying strategies relying on long panels can prove uninformative in short panel applications. We highlight the empirical relevance of our results by examining the viability of Bartik instruments for identifying the effect of rising Chinese import competition on US local labor markets."}, "https://arxiv.org/abs/2404.17181": {"title": "Consistent information criteria for regularized regression and loss-based learning problems", "link": "https://arxiv.org/abs/2404.17181", "description": "arXiv:2404.17181v1 Announce Type: new \nAbstract: Many problems in statistics and machine learning can be formulated as model selection problems, where the goal is to choose an optimal parsimonious model among a set of candidate models. It is typical to conduct model selection by penalizing the objective function via information criteria (IC), as with the pioneering work by Akaike and Schwarz. Via recent work, we propose a generalized IC framework to consistently estimate general loss-based learning problems. In this work, we propose a consistent estimation method for Generalized Linear Model (GLM) regressions by utilizing the recent IC developments. We advance the generalized IC framework by proposing model selection problems, where the model set consists of a potentially uncountable set of models. In addition to theoretical expositions, our proposal introduces a computational procedure for the implementation of our methods in the finite sample setting, which we demonstrate via an extensive simulation study."}, "https://arxiv.org/abs/2404.17380": {"title": "Correspondence analysis: handling cell-wise outliers via the reconstitution algorithm", "link": "https://arxiv.org/abs/2404.17380", "description": "arXiv:2404.17380v1 Announce Type: new \nAbstract: Correspondence analysis (CA) is a popular technique to visualize the relationship between two categorical variables. CA uses the data from a two-way contingency table and is affected by the presence of outliers. The supplementary points method is a popular method to handle outliers. Its disadvantage is that the information from entire rows or columns is removed. However, outliers can be caused by cells only. In this paper, a reconstitution algorithm is introduced to cope with such cells. This algorithm can reduce the contribution of cells in CA instead of deleting entire rows or columns. Thus the remaining information in the row and column involved can be used in the analysis. The reconstitution algorithm is compared with two alternative methods for handling outliers, the supplementary points method and MacroPCA. It is shown that the proposed strategy works well."}, "https://arxiv.org/abs/2404.17464": {"title": "Bayesian Federated Inference for Survival Models", "link": "https://arxiv.org/abs/2404.17464", "description": "arXiv:2404.17464v1 Announce Type: new \nAbstract: In cancer research, overall survival and progression free survival are often analyzed with the Cox model. To estimate accurately the parameters in the model, sufficient data and, more importantly, sufficient events need to be observed. In practice, this is often a problem. Merging data sets from different medical centers may help, but this is not always possible due to strict privacy legislation and logistic difficulties. Recently, the Bayesian Federated Inference (BFI) strategy for generalized linear models was proposed. With this strategy the statistical analyses are performed in the local centers where the data were collected (or stored) and only the inference results are combined to a single estimated model; merging data is not necessary. The BFI methodology aims to compute from the separate inference results in the local centers what would have been obtained if the analysis had been based on the merged data sets. In this paper we generalize the BFI methodology as initially developed for generalized linear models to survival models. Simulation studies and real data analyses show excellent performance; i.e., the results obtained with the BFI methodology are very similar to the results obtained by analyzing the merged data. An R package for doing the analyses is available."}, "https://arxiv.org/abs/2404.17482": {"title": "A comparison of the discrimination performance of lasso and maximum likelihood estimation in logistic regression model", "link": "https://arxiv.org/abs/2404.17482", "description": "arXiv:2404.17482v1 Announce Type: new \nAbstract: Logistic regression is widely used in many areas of knowledge. Several works compare the performance of lasso and maximum likelihood estimation in logistic regression. However, part of these works do not perform simulation studies and the remaining ones do not consider scenarios in which the ratio of the number of covariates to sample size is high. In this work, we compare the discrimination performance of lasso and maximum likelihood estimation in logistic regression using simulation studies and applications. Variable selection is done both by lasso and by stepwise when maximum likelihood estimation is used. We consider a wide range of values for the ratio of the number of covariates to sample size. The main conclusion of the work is that lasso has a better discrimination performance than maximum likelihood estimation when the ratio of the number of covariates to sample size is high."}, "https://arxiv.org/abs/2404.17561": {"title": "Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems", "link": "https://arxiv.org/abs/2404.17561", "description": "arXiv:2404.17561v1 Announce Type: new \nAbstract: We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set."}, "https://arxiv.org/abs/2404.17562": {"title": "Boosting e-BH via conditional calibration", "link": "https://arxiv.org/abs/2404.17562", "description": "arXiv:2404.17562v1 Announce Type: new \nAbstract: The e-BH procedure is an e-value-based multiple testing procedure that provably controls the false discovery rate (FDR) under any dependence structure between the e-values. Despite this appealing theoretical FDR control guarantee, the e-BH procedure often suffers from low power in practice. In this paper, we propose a general framework that boosts the power of e-BH without sacrificing its FDR control under arbitrary dependence. This is achieved by the technique of conditional calibration, where we take as input the e-values and calibrate them to be a set of \"boosted e-values\" that are guaranteed to be no less -- and are often more -- powerful than the original ones. Our general framework is explicitly instantiated in three classes of multiple testing problems: (1) testing under parametric models, (2) conditional independence testing under the model-X setting, and (3) model-free conformalized selection. Extensive numerical experiments show that our proposed method significantly improves the power of e-BH while continuing to control the FDR. We also demonstrate the effectiveness of our method through an application to an observational study dataset for identifying individuals whose counterfactuals satisfy certain properties."}, "https://arxiv.org/abs/2404.17211": {"title": "Pseudo-Observations and Super Learner for the Estimation of the Restricted Mean Survival Time", "link": "https://arxiv.org/abs/2404.17211", "description": "arXiv:2404.17211v1 Announce Type: cross \nAbstract: In the context of right-censored data, we study the problem of predicting the restricted time to event based on a set of covariates. Under a quadratic loss, this problem is equivalent to estimating the conditional Restricted Mean Survival Time (RMST). To that aim, we propose a flexible and easy-to-use ensemble algorithm that combines pseudo-observations and super learner. The classical theoretical results of the super learner are extended to right-censored data, using a new definition of pseudo-observations, the so-called split pseudo-observations. Simulation studies indicate that the split pseudo-observations and the standard pseudo-observations are similar even for small sample sizes. The method is applied to maintenance and colon cancer datasets, showing the interest of the method in practice, as compared to other prediction methods. We complement the predictions obtained from our method with our RMST-adapted risk measure, prediction intervals and variable importance measures developed in a previous work."}, "https://arxiv.org/abs/2404.17468": {"title": "On Elliptical and Inverse Elliptical Wishart distributions: Review, new results, and applications", "link": "https://arxiv.org/abs/2404.17468", "description": "arXiv:2404.17468v1 Announce Type: cross \nAbstract: This paper deals with matrix-variate distributions, from Wishart to Inverse Elliptical Wishart distributions over the set of symmetric definite positive matrices. Similar to the multivariate scenario, (Inverse) Elliptical Wishart distributions form a vast and general family of distributions, encompassing, for instance, Wishart or $t$-Wishart ones. The first objective of this study is to present a unified overview of Wishart, Inverse Wishart, Elliptical Wishart, and Inverse Elliptical Wishart distributions through their fundamental properties. This involves leveraging the stochastic representation of these distributions to establish key statistical properties of the Normalized Wishart distribution. Subsequently, this enables the computation of expectations, variances, and Kronecker moments for Elliptical Wishart and Inverse Elliptical Wishart distributions. As an illustrative application, the practical utility of these generalized Elliptical Wishart distributions is demonstrated using a real electroencephalographic dataset. This showcases their effectiveness in accurately modeling heterogeneous data."}, "https://arxiv.org/abs/2404.17483": {"title": "Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect Estimation", "link": "https://arxiv.org/abs/2404.17483", "description": "arXiv:2404.17483v1 Announce Type: cross \nAbstract: There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using the inverse of probability weighting (IPW). However, due to the numerically unstable IPW weights, they suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator via weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Experimental results show that by effectively correcting the weight values, our method outperforms the existing ones, including traditional weighting schemes."}, "https://arxiv.org/abs/2101.00009": {"title": "Adversarial Estimation of Riesz Representers", "link": "https://arxiv.org/abs/2101.00009", "description": "arXiv:2101.00009v3 Announce Type: replace \nAbstract: Many causal parameters are linear functionals of an underlying regression. The Riesz representer is a key component in the asymptotic variance of a semiparametrically estimated linear functional. We propose an adversarial framework to estimate the Riesz representer using general function spaces. We prove a nonasymptotic mean square rate in terms of an abstract quantity called the critical radius, then specialize it for neural networks, random forests, and reproducing kernel Hilbert spaces as leading cases. Our estimators are highly compatible with targeted and debiased machine learning with sample splitting; our guarantees directly verify general conditions for inference that allow mis-specification. We also use our guarantees to prove inference without sample splitting, based on stability or complexity. Our estimators achieve nominal coverage in highly nonlinear simulations where some previous methods break down. They shed new light on the heterogeneous effects of matching grants."}, "https://arxiv.org/abs/2107.05936": {"title": "Testability of Reverse Causality Without Exogenous Variation", "link": "https://arxiv.org/abs/2107.05936", "description": "arXiv:2107.05936v2 Announce Type: replace \nAbstract: This paper shows that testability of reverse causality is possible even in the absence of exogenous variation, such as in the form of instrumental variables. Instead of relying on exogenous variation, we achieve testability by imposing relatively weak model restrictions and exploiting that a dependence of residual and purported cause is informative about the causal direction. Our main assumption is that the true functional relationship is nonlinear and that error terms are additively separable. We extend previous results by incorporating control variables and allowing heteroskedastic errors. We build on reproducing kernel Hilbert space (RKHS) embeddings of probability distributions to test conditional independence and demonstrate the efficacy in detecting the causal direction in both Monte Carlo simulations and an application to German survey data."}, "https://arxiv.org/abs/2304.09988": {"title": "The effect of estimating prevalences on the population-wise error rate", "link": "https://arxiv.org/abs/2304.09988", "description": "arXiv:2304.09988v2 Announce Type: replace \nAbstract: The population-wise error rate (PWER) is a type I error rate for clinical trials with multiple target populations. In such trials, one treatment is tested for its efficacy in each population. The PWER is defined as the probability that a randomly selected, future patient will be exposed to an inefficient treatment based on the study results. The PWER can be understood and computed as an average of strata-specific family-wise error rates and involves the prevalences of these strata. A major issue of this concept is that the prevalences are usually not known in practice, so that the PWER cannot be directly controlled. Instead, one could use an estimator based on the given sample, like their maximum-likelihood estimator under a multinomial distribution. In this paper we show in simulations that this does not substantially inflate the true PWER. We differentiate between the expected PWER, which is almost perfectly controlled, and study-specific values of the PWER which are conditioned on all subgroup sample sizes and vary within a narrow range. Thereby, we consider up to eight different overlapping patient populations and moderate to large sample sizes. In these settings, we also consider the maximum strata-wise family-wise error rate, which is found to be, on average, at least bounded by twice the significance level used for PWER control. Finally, we introduce an adjustment of the PWER that could be made when, by chance, no patients are recruited from a stratum, so that this stratum is not counted in PWER control. We would then reduce the PWER in order to control for multiplicity in this stratum as well."}, "https://arxiv.org/abs/2305.02434": {"title": "Uncertainty Quantification and Confidence Intervals for Naive Rare-Event Estimators", "link": "https://arxiv.org/abs/2305.02434", "description": "arXiv:2305.02434v2 Announce Type: replace \nAbstract: We consider the estimation of rare-event probabilities using sample proportions output by naive Monte Carlo or collected data. Unlike using variance reduction techniques, this naive estimator does not have a priori relative efficiency guarantee. On the other hand, due to the recent surge of sophisticated rare-event problems arising in safety evaluations of intelligent systems, efficiency-guaranteed variance reduction may face implementation challenges which, coupled with the availability of computation or data collection power, motivate the use of such a naive estimator. In this paper we study the uncertainty quantification, namely the construction, coverage validity and tightness of confidence intervals, for rare-event probabilities using only sample proportions. In addition to the known normality, Wilson's and exact intervals, we investigate and compare them with two new intervals derived from Chernoff's inequality and the Berry-Esseen theorem. Moreover, we generalize our results to the natural situation where sampling stops by reaching a target number of rare-event hits. Our findings show that the normality and Wilson's intervals are not always valid, but they are close to the newly developed valid intervals in terms of half-width. In contrast, the exact interval is conservative, but safely guarantees the attainment of the nominal confidence level. Our new intervals, while being more conservative than the exact interval, provide useful insights in understanding the tightness of the considered intervals."}, "https://arxiv.org/abs/2309.15983": {"title": "What To Do (and Not to Do) with Causal Panel Analysis under Parallel Trends: Lessons from A Large Reanalysis Study", "link": "https://arxiv.org/abs/2309.15983", "description": "arXiv:2309.15983v2 Announce Type: replace \nAbstract: Two-way fixed effects (TWFE) models are ubiquitous in causal panel analysis in political science. However, recent methodological discussions challenge their validity in the presence of heterogeneous treatment effects (HTE) and violations of the parallel trends assumption (PTA). This burgeoning literature has introduced multiple estimators and diagnostics, leading to confusion among empirical researchers on two fronts: the reliability of existing results based on TWFE models and the current best practices. To address these concerns, we examined, replicated, and reanalyzed 37 articles from three leading political science journals that employed observational panel data with binary treatments. Using six newly introduced HTE-robust estimators, we find that although precision may be affected, the core conclusions derived from TWFE estimates largely remain unchanged. PTA violations and insufficient statistical power, however, continue to be significant obstacles to credible inferences. Based on these findings, we offer recommendations for improving practice in empirical research."}, "https://arxiv.org/abs/2312.09633": {"title": "Natural Gradient Variational Bayes without Fisher Matrix Analytic Calculation and Its Inversion", "link": "https://arxiv.org/abs/2312.09633", "description": "arXiv:2312.09633v2 Announce Type: replace \nAbstract: This paper introduces a method for efficiently approximating the inverse of the Fisher information matrix, a crucial step in achieving effective variational Bayes inference. A notable aspect of our approach is the avoidance of analytically computing the Fisher information matrix and its explicit inversion. Instead, we introduce an iterative procedure for generating a sequence of matrices that converge to the inverse of Fisher information. The natural gradient variational Bayes algorithm without analytic expression of the Fisher matrix and its inversion is provably convergent and achieves a convergence rate of order O(log s/s), with s the number of iterations. We also obtain a central limit theorem for the iterates. Implementation of our method does not require storage of large matrices, and achieves a linear complexity in the number of variational parameters. Our algorithm exhibits versatility, making it applicable across a diverse array of variational Bayes domains, including Gaussian approximation and normalizing flow Variational Bayes. We offer a range of numerical examples to demonstrate the efficiency and reliability of the proposed variational Bayes method."}, "https://arxiv.org/abs/2401.06575": {"title": "A Weibull Mixture Cure Frailty Model for High-dimensional Covariates", "link": "https://arxiv.org/abs/2401.06575", "description": "arXiv:2401.06575v2 Announce Type: replace \nAbstract: A novel mixture cure frailty model is introduced for handling censored survival data. Mixture cure models are preferable when the existence of a cured fraction among patients can be assumed. However, such models are heavily underexplored: frailty structures within cure models remain largely undeveloped, and furthermore, most existing methods do not work for high-dimensional datasets, when the number of predictors is significantly larger than the number of observations. In this study, we introduce a novel extension of the Weibull mixture cure model that incorporates a frailty component, employed to model an underlying latent population heterogeneity with respect to the outcome risk. Additionally, high-dimensional covariates are integrated into both the cure rate and survival part of the model, providing a comprehensive approach to employ the model in the context of high-dimensional omics data. We also perform variable selection via an adaptive elastic-net penalization, and propose a novel approach to inference using the expectation-maximization (EM) algorithm. Extensive simulation studies are conducted across various scenarios to demonstrate the performance of the model, and results indicate that our proposed method outperforms competitor models. We apply the novel approach to analyze RNAseq gene expression data from bulk breast cancer patients included in The Cancer Genome Atlas (TCGA) database. A set of prognostic biomarkers is then derived from selected genes, and subsequently validated via both functional enrichment analysis and comparison to the existing biological literature. Finally, a prognostic risk score index based on the identified biomarkers is proposed and validated by exploring the patients' survival."}, "https://arxiv.org/abs/2310.19091": {"title": "Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector", "link": "https://arxiv.org/abs/2310.19091", "description": "arXiv:2310.19091v2 Announce Type: replace-cross \nAbstract: Machine Learning (ML) systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, they still face the challenge of aligning nuanced policy objectives with the precise formalization requirements necessitated by ML models. In this paper, we aim to bridge the gap between ML model requirements and public sector decision-making by presenting a comprehensive overview of key technical challenges where disjunctions between policy goals and ML models commonly arise. We concentrate on pivotal points of the ML pipeline that connect the model to its operational environment, discussing the significance of representative training data and highlighting the importance of a model setup that facilitates effective decision-making. Additionally, we link these challenges with emerging methodological advancements, encompassing causal ML, domain adaptation, uncertainty quantification, and multi-objective optimization, illustrating the path forward for harmonizing ML and public sector objectives."}, "https://arxiv.org/abs/2404.17615": {"title": "DeepVARMA: A Hybrid Deep Learning and VARMA Model for Chemical Industry Index Forecasting", "link": "https://arxiv.org/abs/2404.17615", "description": "arXiv:2404.17615v1 Announce Type: new \nAbstract: Since the chemical industry index is one of the important indicators to measure the development of the chemical industry, forecasting it is critical for understanding the economic situation and trends of the industry. Taking the multivariable nonstationary series-synthetic material index as the main research object, this paper proposes a new prediction model: DeepVARMA, and its variants Deep-VARMA-re and DeepVARMA-en, which combine LSTM and VARMAX models. The new model firstly uses the deep learning model such as the LSTM remove the trends of the target time series and also learn the representation of endogenous variables, and then uses the VARMAX model to predict the detrended target time series with the embeddings of endogenous variables, and finally combines the trend learned by the LSTM and dependency learned by the VARMAX model to obtain the final predictive values. The experimental results show that (1) the new model achieves the best prediction accuracy by combining the LSTM encoding of the exogenous variables and the VARMAX model. (2) In multivariate non-stationary series prediction, DeepVARMA uses a phased processing strategy to show higher adaptability and accuracy compared to the traditional VARMA model as well as the machine learning models LSTM, RF and XGBoost. (3) Compared with smooth sequence prediction, the traditional VARMA and VARMAX models fluctuate more in predicting non-smooth sequences, while DeepVARMA shows more flexibility and robustness. This study provides more accurate tools and methods for future development and scientific decision-making in the chemical industry."}, "https://arxiv.org/abs/2404.17693": {"title": "A Survey Selection Correction using Nonrandom Followup with an Application to the Gender Entrepreneurship Gap", "link": "https://arxiv.org/abs/2404.17693", "description": "arXiv:2404.17693v1 Announce Type: new \nAbstract: Selection into samples undermines efforts to describe populations and to estimate relationships between variables. We develop a simple method for correcting for sample selection that explains differences in survey responses between early and late respondents with correlation between potential responses and preference for survey response. Our method relies on researchers observing the number of data collection attempts prior to each individual's survey response rather than covariates that affect response rates without affecting potential responses. Applying our method to a survey of entrepreneurial aspirations among undergraduates at University of Wisconsin-Madison, we find suggestive evidence that the entrepreneurial aspiration rate is larger among survey respondents than the population, as well as the male-female gender gap in the entrepreneurial aspiration rate, which we estimate as 21 percentage points in the sample and 19 percentage points in the population. Our results suggest that the male-female gap in entrepreneurial aspirations arises prior to direct exposure to the labor market."}, "https://arxiv.org/abs/2404.17734": {"title": "Manipulating a Continuous Instrumental Variable in an Observational Study of Premature Babies: Algorithm, Partial Identification Bounds, and Inference under Randomization and Biased Randomization Assumptions", "link": "https://arxiv.org/abs/2404.17734", "description": "arXiv:2404.17734v1 Announce Type: new \nAbstract: Regionalization of intensive care for premature babies refers to a triage system of mothers with high-risk pregnancies to hospitals of varied capabilities based on risks faced by infants. Due to the limited capacity of high-level hospitals, which are equipped with advanced expertise to provide critical care, understanding the effect of delivering premature babies at such hospitals on infant mortality for different subgroups of high-risk mothers could facilitate the design of an efficient perinatal regionalization system. Towards answering this question, Baiocchi et al. (2010) proposed to strengthen an excess-travel-time-based, continuous instrumental variable (IV) in an IV-based, matched-pair design by switching focus to a smaller cohort amenable to being paired with a larger separation in the IV dose. Three elements changed with the strengthened IV: the study cohort, compliance rate and latent complier subgroup. Here, we introduce a non-bipartite, template matching algorithm that embeds data into a target, pair-randomized encouragement trial which maintains fidelity to the original study cohort while strengthening the IV. We then study randomization-based and IV-dependent, biased-randomization-based inference of partial identification bounds for the sample average treatment effect (SATE) in an IV-based matched pair design, which deviates from the usual effect ratio estimand in that the SATE is agnostic to the IV and who is matched to whom, although a strengthened IV design could narrow the partial identification bounds. Based on our proposed strengthened-IV design, we found that delivering at a high-level NICU reduced preterm babies' mortality rate compared to a low-level NICU for $81,766 \\times 2 = 163,532$ mothers and their preterm babies and the effect appeared to be minimal among non-black, low-risk mothers."}, "https://arxiv.org/abs/2404.17763": {"title": "Likelihood Based Inference in Fully and Partially Observed Exponential Family Graphical Models with Intractable Normalizing Constants", "link": "https://arxiv.org/abs/2404.17763", "description": "arXiv:2404.17763v1 Announce Type: new \nAbstract: Probabilistic graphical models that encode an underlying Markov random field are fundamental building blocks of generative modeling to learn latent representations in modern multivariate data sets with complex dependency structures. Among these, the exponential family graphical models are especially popular, given their fairly well-understood statistical properties and computational scalability to high-dimensional data based on pseudo-likelihood methods. These models have been successfully applied in many fields, such as the Ising model in statistical physics and count graphical models in genomics. Another strand of models allows some nodes to be latent, so as to allow the marginal distribution of the observable nodes to depart from exponential family to capture more complex dependence. These approaches form the basis of generative models in artificial intelligence, such as the Boltzmann machines and their restricted versions. A fundamental barrier to likelihood-based (i.e., both maximum likelihood and fully Bayesian) inference in both fully and partially observed cases is the intractability of the likelihood. The usual workaround is via adopting pseudo-likelihood based approaches, following the pioneering work of Besag (1974). The goal of this paper is to demonstrate that full likelihood based analysis of these models is feasible in a computationally efficient manner. The chief innovation lies in using a technique of Geyer (1991) to estimate the intractable normalizing constant, as well as its gradient, for intractable graphical models. Extensive numerical results, supporting theory and comparisons with pseudo-likelihood based approaches demonstrate the applicability of the proposed method."}, "https://arxiv.org/abs/2404.17772": {"title": "PWEXP: An R Package Using Piecewise Exponential Model for Study Design and Event/Timeline Prediction", "link": "https://arxiv.org/abs/2404.17772", "description": "arXiv:2404.17772v1 Announce Type: new \nAbstract: Parametric assumptions such as exponential distribution are commonly used in clinical trial design and analysis. However, violation of distribution assumptions can introduce biases in sample size and power calculations. Piecewise exponential (PWE) hazard model partitions the hazard function into segments each with constant hazards and is easy for interpretation and computation. Due to its piecewise property, PWE can fit a wide range of survival curves and accurately predict the future number of events and analysis time in event-driven clinical trials, thus enabling more flexible and reliable study designs. Compared with other existing approaches, the PWE model provides a superior balance of flexibility and robustness in model fitting and prediction. The proposed PWEXP package is designed for estimating and predicting PWE hazard models for right-censored data. By utilizing well-established criteria such as AIC, BIC, and cross-validation log-likelihood, the PWEXP package chooses the optimal number of change-points and determines the optimal position of change-points. With its particular goodness-of-fit, the PWEXP provides accurate and robust hazard estimation, which can be used for reliable power calculation at study design and timeline prediction at study conduct. The package also offers visualization functions to facilitate the interpretation of survival curve fitting results."}, "https://arxiv.org/abs/2404.17792": {"title": "A General Framework for Random Effects Models for Binary, Ordinal, Count Type and Continuous Dependent Variables Including Variable Selection", "link": "https://arxiv.org/abs/2404.17792", "description": "arXiv:2404.17792v1 Announce Type: new \nAbstract: A general random effects model is proposed that allows for continuous as well as discrete distributions of the responses. Responses can be unrestricted continuous, bounded continuous, binary, ordered categorical or given in the form of counts. The distribution of the responses is not restricted to exponential families, which is a severe restriction in generalized mixed models. Generalized mixed models use fixed distributions for responses, for example the Poisson distribution in count data, which has the disadvantage of not accounting for overdispersion. By using a response function and a thresholds function the proposed mixed thresholds model can account for a variety of alternative distributions that often show better fits than fixed distributions used within the generalized linear model framework. A particular strength of the model is that it provides a tool for joint modeling, responses may be of different types, some can be discrete, others continuous. In addition to introducing the mixed thresholds model parameter sparsity is addressed. Random effects models can contain a large number of parameters, in particular if effects have to be assumed as measurement-specific. Methods to obtain sparser representations are proposed and illustrated. The methods are shown to work in the thresholds model but could also be adapted to other modeling approaches."}, "https://arxiv.org/abs/2404.17885": {"title": "Sequential monitoring for explosive volatility regimes", "link": "https://arxiv.org/abs/2404.17885", "description": "arXiv:2404.17885v1 Announce Type: new \nAbstract: In this paper, we develop two families of sequential monitoring procedure to (timely) detect changes in a GARCH(1,1) model. Whilst our methodologies can be applied for the general analysis of changepoints in GARCH(1,1) sequences, they are in particular designed to detect changes from stationarity to explosivity or vice versa, thus allowing to check for volatility bubbles. Our statistics can be applied irrespective of whether the historical sample is stationary or not, and indeed without prior knowledge of the regime of the observations before and after the break. In particular, we construct our detectors as the CUSUM process of the quasi-Fisher scores of the log likelihood function. In order to ensure timely detection, we then construct our boundary function (exceeding which would indicate a break) by including a weighting sequence which is designed to shorten the detection delay in the presence of a changepoint. We consider two types of weights: a lighter set of weights, which ensures timely detection in the presence of changes occurring early, but not too early after the end of the historical sample; and a heavier set of weights, called Renyi weights which is designed to ensure timely detection in the presence of changepoints occurring very early in the monitoring horizon. In both cases, we derive the limiting distribution of the detection delays, indicating the expected delay for each set of weights. Our theoretical results are validated via a comprehensive set of simulations, and an empirical application to daily returns of individual stocks."}, "https://arxiv.org/abs/2404.18000": {"title": "Thinking inside the bounds: Improved error distributions for indifference point data analysis and simulation via beta regression using common discounting functions", "link": "https://arxiv.org/abs/2404.18000", "description": "arXiv:2404.18000v1 Announce Type: new \nAbstract: Standard nonlinear regression is commonly used when modeling indifference points due to its ability to closely follow observed data, resulting in a good model fit. However, standard nonlinear regression currently lacks a reasonable distribution-based framework for indifference points, which limits its ability to adequately describe the inherent variability in the data. Software commonly assumes data follow a normal distribution with constant variance. However, typical indifference points do not follow a normal distribution or exhibit constant variance. To address these limitations, this paper introduces a class of nonlinear beta regression models that offers excellent fit to discounting data and enhances simulation-based approaches. This beta regression model can accommodate popular discounting functions. This work proposes three specific advances. First, our model automatically captures non-constant variance as a function of delay. Second, our model improves simulation-based approaches since it obeys the natural boundaries of observable data, unlike the ordinary assumption of normal residuals and constant variance. Finally, we introduce a scale-location-truncation trick that allows beta regression to accommodate observed values of zero and one. A comparison between beta regression and standard nonlinear regression reveals close agreement in the estimated discounting rate k obtained from both methods."}, "https://arxiv.org/abs/2404.18197": {"title": "A General Causal Inference Framework for Cross-Sectional Observational Data", "link": "https://arxiv.org/abs/2404.18197", "description": "arXiv:2404.18197v1 Announce Type: new \nAbstract: Causal inference methods for observational data are highly regarded due to their wide applicability. While there are already numerous methods available for de-confounding bias, these methods generally assume that covariates consist solely of confounders or make naive assumptions about the covariates. Such assumptions face challenges in both theory and practice, particularly when dealing with high-dimensional covariates. Relaxing these naive assumptions and identifying the confounding covariates that truly require correction can effectively enhance the practical significance of these methods. Therefore, this paper proposes a General Causal Inference (GCI) framework specifically designed for cross-sectional observational data, which precisely identifies the key confounding covariates and provides corresponding identification algorithm. Specifically, based on progressive derivations of the Markov property on Directed Acyclic Graph, we conclude that the key confounding covariates are equivalent to the common root ancestors of the treatment and the outcome variable. Building upon this conclusion, the GCI framework is composed of a novel Ancestor Set Identification (ASI) algorithm and de-confounding inference methods. Firstly, the ASI algorithm is theoretically supported by the conditional independence properties and causal asymmetry between variables, enabling the identification of key confounding covariates. Subsequently, the identified confounding covariates are used in the de-confounding inference methods to obtain unbiased causal effect estimation, which can support informed decision-making. Extensive experiments on synthetic datasets demonstrate that the GCI framework can effectively identify the critical confounding covariates and significantly improve the precision, stability, and interpretability of causal inference in observational studies."}, "https://arxiv.org/abs/2404.18207": {"title": "Testing for Asymmetric Information in Insurance with Deep Learning", "link": "https://arxiv.org/abs/2404.18207", "description": "arXiv:2404.18207v1 Announce Type: new \nAbstract: The positive correlation test for asymmetric information developed by Chiappori and Salanie (2000) has been applied in many insurance markets. Most of the literature focuses on the special case of constant correlation; it also relies on restrictive parametric specifications for the choice of coverage and the occurrence of claims. We relax these restrictions by estimating conditional covariances and correlations using deep learning methods. We test the positive correlation property by using the intersection test of Chernozhukov, Lee, and Rosen (2013) and the \"sorted groups\" test of Chernozhukov, Demirer, Duflo, and Fernandez-Val (2023). Our results confirm earlier findings that the correlation between risk and coverage is small. Random forests and gradient boosting trees produce similar results to neural networks."}, "https://arxiv.org/abs/2404.18232": {"title": "A cautious approach to constraint-based causal model selection", "link": "https://arxiv.org/abs/2404.18232", "description": "arXiv:2404.18232v1 Announce Type: new \nAbstract: We study the data-driven selection of causal graphical models using constraint-based algorithms, which determine the existence or non-existence of edges (causal connections) in a graph based on testing a series of conditional independence hypotheses. In settings where the ultimate scientific goal is to use the selected graph to inform estimation of some causal effect of interest (e.g., by selecting a valid and sufficient set of adjustment variables), we argue that a \"cautious\" approach to graph selection should control the probability of falsely removing edges and prefer dense, rather than sparse, graphs. We propose a simple inversion of the usual conditional independence testing procedure: to remove an edge, test the null hypothesis of conditional association greater than some user-specified threshold, rather than the null of independence. This equivalence testing formulation to testing independence constraints leads to a procedure with desriable statistical properties and behaviors that better match the inferential goals of certain scientific studies, for example observational epidemiological studies that aim to estimate causal effects in the face of causal model uncertainty. We illustrate our approach on a data example from environmental epidemiology."}, "https://arxiv.org/abs/2404.18256": {"title": "Semiparametric causal mediation analysis in cluster-randomized experiments", "link": "https://arxiv.org/abs/2404.18256", "description": "arXiv:2404.18256v1 Announce Type: new \nAbstract: In cluster-randomized experiments, there is emerging interest in exploring the causal mechanism in which a cluster-level treatment affects the outcome through an intermediate outcome. Despite an extensive development of causal mediation methods in the past decade, only a few exceptions have been considered in assessing causal mediation in cluster-randomized studies, all of which depend on parametric model-based estimators. In this article, we develop the formal semiparametric efficiency theory to motivate several doubly-robust methods for addressing several mediation effect estimands corresponding to both the cluster-average and the individual-level treatment effects in cluster-randomized experiments--the natural indirect effect, natural direct effect, and spillover mediation effect. We derive the efficient influence function for each mediation effect, and carefully parameterize each efficient influence function to motivate practical strategies for operationalizing each estimator. We consider both parametric working models and data-adaptive machine learners to estimate the nuisance functions, and obtain semiparametric efficient causal mediation estimators in the latter case. Our methods are illustrated via extensive simulations and two completed cluster-randomized experiments."}, "https://arxiv.org/abs/2404.18268": {"title": "Optimal Treatment Allocation under Constraints", "link": "https://arxiv.org/abs/2404.18268", "description": "arXiv:2404.18268v1 Announce Type: new \nAbstract: In optimal policy problems where treatment effects vary at the individual level, optimally allocating treatments to recipients is complex even when potential outcomes are known. We present an algorithm for multi-arm treatment allocation problems that is guaranteed to find the optimal allocation in strongly polynomial time, and which is able to handle arbitrary potential outcomes as well as constraints on treatment requirement and capacity. Further, starting from an arbitrary allocation, we show how to optimally re-allocate treatments in a Pareto-improving manner. To showcase our results, we use data from Danish nurse home visiting for infants. We estimate nurse specific treatment effects for children born 1959-1967 in Copenhagen, comparing nurses against each other. We exploit random assignment of newborn children to nurses within a district to obtain causal estimates of nurse-specific treatment effects using causal machine learning. Using these estimates, and treating the Danish nurse home visiting program as a case of an optimal treatment allocation problem (where a treatment is a nurse), we document room for significant productivity improvements by optimally re-allocating nurses to children. Our estimates suggest that optimal allocation of nurses to children could have improved average yearly earnings by USD 1,815 and length of education by around two months."}, "https://arxiv.org/abs/2404.18370": {"title": "Out-of-distribution generalization under random, dense distributional shifts", "link": "https://arxiv.org/abs/2404.18370", "description": "arXiv:2404.18370v1 Announce Type: new \nAbstract: Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that p(y|x) remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we will discuss empirical evidence for such random dense distributional shifts and explain why commonly used models for distribution shifts-including adversarial approaches-may not be appropriate under these conditions. Then, we will develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we will apply the framework to several real-world data sets and discuss diagnostics to evaluate the fit of the distributional uncertainty model."}, "https://arxiv.org/abs/2404.18377": {"title": "Inference for the panel ARMA-GARCH model when both $N$ and $T$ are large", "link": "https://arxiv.org/abs/2404.18377", "description": "arXiv:2404.18377v1 Announce Type: new \nAbstract: We propose a panel ARMA-GARCH model to capture the dynamics of large panel data with $N$ individuals over $T$ time periods. For this model, we provide a two-step estimation procedure to estimate the ARMA parameters and GARCH parameters stepwisely. Under some regular conditions, we show that all of the proposed estimators are asymptotically normal with the convergence rate $(NT)^{-1/2}$, and they have the asymptotic biases when both $N$ and $T$ diverge to infinity at the same rate. Particularly, we find that the asymptotic biases result from the fixed effect, estimation effect, and unobservable initial values. To correct the biases, we further propose the bias-corrected version of estimators by using either the analytical asymptotics or jackknife method. Our asymptotic results are based on a new central limit theorem for the linear-quadratic form in the martingale difference sequence, when the weight matrix is uniformly bounded in row and column. Simulations and one real example are given to demonstrate the usefulness of our panel ARMA-GARCH model."}, "https://arxiv.org/abs/2404.18421": {"title": "Semiparametric mean and variance joint models with Laplace link functions for count time series", "link": "https://arxiv.org/abs/2404.18421", "description": "arXiv:2404.18421v1 Announce Type: new \nAbstract: Count time series data are frequently analyzed by modeling their conditional means and the conditional variance is often considered to be a deterministic function of the corresponding conditional mean and is not typically modeled independently. We propose a semiparametric mean and variance joint model, called random rounded count-valued generalized autoregressive conditional heteroskedastic (RRC-GARCH) model, to address this limitation. The RRC-GARCH model and its variations allow for the joint modeling of both the conditional mean and variance and offer a flexible framework for capturing various mean-variance structures (MVSs). One main feature of this model is its ability to accommodate negative values for regression coefficients and autocorrelation functions. The autocorrelation structure of the RRC-GARCH model using the proposed Laplace link functions with nonnegative regression coefficients is the same as that of an autoregressive moving-average (ARMA) process. For the new model, the stationarity and ergodicity are established and the consistency and asymptotic normality of the conditional least squares estimator are proved. Model selection criteria are proposed to evaluate the RRC-GARCH models. The performance of the RRC-GARCH model is assessed through analyses of both simulated and real data sets. The results indicate that the model can effectively capture the MVS of count time series data and generate accurate forecast means and variances."}, "https://arxiv.org/abs/2404.18678": {"title": "Sequential model confidence", "link": "https://arxiv.org/abs/2404.18678", "description": "arXiv:2404.18678v1 Announce Type: new \nAbstract: In most prediction and estimation situations, scientists consider various statistical models for the same problem, and naturally want to select amongst the best. Hansen et al. (2011) provide a powerful solution to this problem by the so-called model confidence set, a subset of the original set of available models that contains the best models with a given level of confidence. Importantly, model confidence sets respect the underlying selection uncertainty by being flexible in size. However, they presuppose a fixed sample size which stands in contrast to the fact that model selection and forecast evaluation are inherently sequential tasks where we successively collect new data and where the decision to continue or conclude a study may depend on the previous outcomes. In this article, we extend model confidence sets sequentially over time by relying on sequential testing methods. Recently, e-processes and confidence sequences have been introduced as new, safe methods for assessing statistical evidence. Sequential model confidence sets allow to continuously monitor the models' performances and come with time-uniform, nonasymptotic coverage guarantees."}, "https://arxiv.org/abs/2404.18732": {"title": "Two-way Homogeneity Pursuit for Quantile Network Vector Autoregression", "link": "https://arxiv.org/abs/2404.18732", "description": "arXiv:2404.18732v1 Announce Type: new \nAbstract: While the Vector Autoregression (VAR) model has received extensive attention for modelling complex time series, quantile VAR analysis remains relatively underexplored for high-dimensional time series data. To address this disparity, we introduce a two-way grouped network quantile (TGNQ) autoregression model for time series collected on large-scale networks, known for their significant heterogeneous and directional interactions among nodes. Our proposed model simultaneously conducts node clustering and model estimation to balance complexity and interpretability. To account for the directional influence among network nodes, each network node is assigned two latent group memberships that can be consistently estimated using our proposed estimation procedure. Theoretical analysis demonstrates the consistency of membership and parameter estimators even with an overspecified number of groups. With the correct group specification, estimated parameters are proven to be asymptotically normal, enabling valid statistical inferences. Moreover, we propose a quantile information criterion for consistently selecting the number of groups. Simulation studies show promising finite sample performance, and we apply the methodology to analyze connectedness and risk spillover effects among Chinese A-share stocks."}, "https://arxiv.org/abs/2404.18779": {"title": "Semiparametric fiducial inference", "link": "https://arxiv.org/abs/2404.18779", "description": "arXiv:2404.18779v1 Announce Type: new \nAbstract: R. A. Fisher introduced the concept of fiducial as a potential replacement for the Bayesian posterior distribution in the 1930s. During the past century, fiducial approaches have been explored in various parametric and nonparametric settings. However, to the best of our knowledge, no fiducial inference has been developed in the realm of semiparametric statistics. In this paper, we propose a novel fiducial approach for semiparametric models. To streamline our presentation, we use the Cox proportional hazards model, which is the most popular model for the analysis of survival data, as a running example. Other models and extensions are also discussed. In our experiments, we find our method to perform well especially in situations when the maximum likelihood estimator fails."}, "https://arxiv.org/abs/2404.18854": {"title": "Switching Models of Oscillatory Networks Greatly Improve Inference of Dynamic Functional Connectivity", "link": "https://arxiv.org/abs/2404.18854", "description": "arXiv:2404.18854v1 Announce Type: new \nAbstract: Functional brain networks can change rapidly as a function of stimuli or cognitive shifts. Tracking dynamic functional connectivity is particularly challenging as it requires estimating the structure of the network at each moment as well as how it is shifting through time. In this paper, we describe a general modeling framework and a set of specific models that provides substantially increased statistical power for estimating rhythmic dynamic networks, based on the assumption that for a particular experiment or task, the network state at any moment is chosen from a discrete set of possible network modes. Each model is comprised of three components: (1) a set of latent switching states that represent transitions between the expression of each network mode; (2) a set of latent oscillators, each characterized by an estimated mean oscillation frequency and an instantaneous phase and amplitude at each time point; and (3) an observation model that relates the observed activity at each electrode to a linear combination of the latent oscillators. We develop an expectation-maximization procedure to estimate the network structure for each switching state and the probability of each state being expressed at each moment. We conduct a set of simulation studies to illustrate the application of these models and quantify their statistical power, even in the face of model misspecification."}, "https://arxiv.org/abs/2404.18857": {"title": "VT-MRF-SPF: Variable Target Markov Random Field Scalable Particle Filter", "link": "https://arxiv.org/abs/2404.18857", "description": "arXiv:2404.18857v1 Announce Type: new \nAbstract: Markov random fields (MRFs) are invaluable tools across diverse fields, and spatiotemporal MRFs (STMRFs) amplify their effectiveness by integrating spatial and temporal dimensions. However, modeling spatiotemporal data introduces additional hurdles, including dynamic spatial dimensions and partial observations, prevalent in scenarios like disease spread analysis and environmental monitoring. Tracking high-dimensional targets with complex spatiotemporal interactions over extended periods poses significant challenges in accuracy, efficiency, and computational feasibility. To tackle these obstacles, we introduce the variable target MRF scalable particle filter (VT-MRF-SPF), a fully online learning algorithm designed for high-dimensional target tracking over STMRFs with varying dimensions under partial observation. We rigorously guarantee algorithm performance, explicitly indicating overcoming the curse of dimensionality. Additionally, we provide practical guidelines for tuning graphical parameters, leading to superior performance in extensive examinations."}, "https://arxiv.org/abs/2404.18862": {"title": "Conformal Prediction Sets for Populations of Graphs", "link": "https://arxiv.org/abs/2404.18862", "description": "arXiv:2404.18862v1 Announce Type: new \nAbstract: The analysis of data such as graphs has been gaining increasing attention in the past years. This is justified by the numerous applications in which they appear. Several methods are present to predict graphs, but much fewer to quantify the uncertainty of the prediction. The present work proposes an uncertainty quantification methodology for graphs, based on conformal prediction. The method works both for graphs with the same set of nodes (labelled graphs) and graphs with no clear correspondence between the set of nodes across the observed graphs (unlabelled graphs). The unlabelled case is dealt with the creation of prediction sets embedded in a quotient space. The proposed method does not rely on distributional assumptions, it achieves finite-sample validity, and it identifies interpretable prediction sets. To explore the features of this novel forecasting technique, we perform two simulation studies to show the methodology in both the labelled and the unlabelled case. We showcase the applicability of the method in analysing the performance of different teams during the FIFA 2018 football world championship via their player passing networks."}, "https://arxiv.org/abs/2404.18905": {"title": "Detecting critical treatment effect bias in small subgroups", "link": "https://arxiv.org/abs/2404.18905", "description": "arXiv:2404.18905v1 Announce Type: new \nAbstract: Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge."}, "https://arxiv.org/abs/2403.15352": {"title": "Universal Cold RNA Phase Transitions", "link": "https://arxiv.org/abs/2403.15352", "description": "arXiv:2403.15352v1 Announce Type: cross \nAbstract: RNA's diversity of structures and functions impacts all life forms since primordia. We use calorimetric force spectroscopy to investigate RNA folding landscapes in previously unexplored low-temperature conditions. We find that Watson-Crick RNA hairpins, the most basic secondary structure elements, undergo a glass-like transition below $\\mathbf{T_G\\sim 20 ^{\\circ}}$C where the heat capacity abruptly changes and the RNA folds into a diversity of misfolded structures. We hypothesize that an altered RNA biochemistry, determined by sequence-independent ribose-water interactions, outweighs sequence-dependent base pairing. The ubiquitous ribose-water interactions lead to universal RNA phase transitions below $\\mathbf{T_G}$, such as maximum stability at $\\mathbf{T_S\\sim 5 ^{\\circ}}$C where water density is maximum, and cold denaturation at $\\mathbf{T_C\\sim-50^{\\circ}}$C. RNA cold biochemistry may have a profound impact on RNA function and evolution."}, "https://arxiv.org/abs/2404.17682": {"title": "Testing for similarity of dose response in multi-regional clinical trials", "link": "https://arxiv.org/abs/2404.17682", "description": "arXiv:2404.17682v1 Announce Type: cross \nAbstract: This paper addresses the problem of deciding whether the dose response relationships between subgroups and the full population in a multi-regional trial are similar to each other. Similarity is measured in terms of the maximal deviation between the dose response curves. We consider a parametric framework and develop two powerful bootstrap tests for the similarity between the dose response curves of one subgroup and the full population, and for the similarity between the dose response curves of several subgroups and the full population. We prove the validity of the tests, investigate the finite sample properties by means of a simulation study and finally illustrate the methodology in a case study."}, "https://arxiv.org/abs/2404.17737": {"title": "Neutral Pivoting: Strong Bias Correction for Shared Information", "link": "https://arxiv.org/abs/2404.17737", "description": "arXiv:2404.17737v1 Announce Type: cross \nAbstract: In the absence of historical data for use as forecasting inputs, decision makers often ask a panel of judges to predict the outcome of interest, leveraging the wisdom of the crowd (Surowiecki 2005). Even if the crowd is large and skilled, shared information can bias the simple mean of judges' estimates. Addressing the issue of bias, Palley and Soll (2019) introduces a novel approach called pivoting. Pivoting can take several forms, most notably the powerful and reliable minimal pivot. We build on the intuition of the minimal pivot and propose a more aggressive bias correction known as the neutral pivot. The neutral pivot achieves the largest bias correction of its class that both avoids the need to directly estimate crowd composition or skill and maintains a smaller expected squared error than the simple mean for all considered settings. Empirical assessments on real datasets confirm the effectiveness of the neutral pivot compared to current methods."}, "https://arxiv.org/abs/2404.17769": {"title": "Conformal Ranked Retrieval", "link": "https://arxiv.org/abs/2404.17769", "description": "arXiv:2404.17769v1 Announce Type: cross \nAbstract: Given the wide adoption of ranked retrieval techniques in various information systems that significantly impact our daily lives, there is an increasing need to assess and address the uncertainty inherent in their predictions. This paper introduces a novel method using the conformal risk control framework to quantitatively measure and manage risks in the context of ranked retrieval problems. Our research focuses on a typical two-stage ranked retrieval problem, where the retrieval stage generates candidates for subsequent ranking. By carefully formulating the conformal risk for each stage, we have developed algorithms to effectively control these risks within their specified bounds. The efficacy of our proposed methods has been demonstrated through comprehensive experiments on three large-scale public datasets for ranked retrieval tasks, including the MSLR-WEB dataset, the Yahoo LTRC dataset and the MS MARCO dataset."}, "https://arxiv.org/abs/2404.17812": {"title": "High-Dimensional Single-Index Models: Link Estimation and Marginal Inference", "link": "https://arxiv.org/abs/2404.17812", "description": "arXiv:2404.17812v1 Announce Type: cross \nAbstract: This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging the information from the estimated link function, we propose more efficient estimators that are better aligned with the underlying model. Furthermore, we rigorously establish the asymptotic normality of each coordinate of the estimator. This provides a valid construction of confidence intervals and $p$-values for any finite collection of coordinates. Numerical experiments validate our theoretical results."}, "https://arxiv.org/abs/2404.17856": {"title": "Uncertainty quantification for iterative algorithms in linear models with application to early stopping", "link": "https://arxiv.org/abs/2404.17856", "description": "arXiv:2404.17856v1 Announce Type: cross \nAbstract: This paper investigates the iterates $\\hbb^1,\\dots,\\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \\asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\\sqrt n$-consistent under Gaussian designs. Applications to early-stopping are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for developing debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results."}, "https://arxiv.org/abs/2404.18786": {"title": "Randomization-based confidence intervals for the local average treatment effect", "link": "https://arxiv.org/abs/2404.18786", "description": "arXiv:2404.18786v1 Announce Type: cross \nAbstract: We consider the problem of generating confidence intervals in randomized experiments with noncompliance. We show that a refinement of a randomization-based procedure proposed by Imbens and Rosenbaum (2005) has desirable properties. Namely, we show that using a studentized Anderson-Rubin-type statistic as a test statistic yields confidence intervals that are finite-sample exact under treatment effect homogeneity, and remain asymptotically valid for the Local Average Treatment Effect when the treatment effect is heterogeneous. We provide a uniform analysis of this procedure."}, "https://arxiv.org/abs/2006.13850": {"title": "Global Sensitivity and Domain-Selective Testing for Functional-Valued Responses: An Application to Climate Economy Models", "link": "https://arxiv.org/abs/2006.13850", "description": "arXiv:2006.13850v4 Announce Type: replace \nAbstract: Understanding the dynamics and evolution of climate change and associated uncertainties is key for designing robust policy actions. Computer models are key tools in this scientific effort, which have now reached a high level of sophistication and complexity. Model auditing is needed in order to better understand their results, and to deal with the fact that such models are increasingly opaque with respect to their inner workings. Current techniques such as Global Sensitivity Analysis (GSA) are limited to dealing either with multivariate outputs, stochastic ones, or finite-change inputs. This limits their applicability to time-varying variables such as future pathways of greenhouse gases. To provide additional semantics in the analysis of a model ensemble, we provide an extension of GSA methodologies tackling the case of stochastic functional outputs with finite change inputs. To deal with finite change inputs and functional outputs, we propose an extension of currently available GSA methodologies while we deal with the stochastic part by introducing a novel, domain-selective inferential technique for sensitivity indices. Our method is explored via a simulation study that shows its robustness and efficacy in detecting sensitivity patterns. We apply it to real world data, where its capabilities can provide to practitioners and policymakers additional information about the time dynamics of sensitivity patterns, as well as information about robustness."}, "https://arxiv.org/abs/2012.11679": {"title": "Discordant Relaxations of Misspecified Models", "link": "https://arxiv.org/abs/2012.11679", "description": "arXiv:2012.11679v5 Announce Type: replace \nAbstract: In many set-identified models, it is difficult to obtain a tractable characterization of the identified set. Therefore, researchers often rely on non-sharp identification conditions, and empirical results are often based on an outer set of the identified set. This practice is often viewed as conservative yet valid because an outer set is always a superset of the identified set. However, this paper shows that when the model is refuted by the data, two sets of non-sharp identification conditions derived from the same model could lead to disjoint outer sets and conflicting empirical results. We provide a sufficient condition for the existence of such discordancy, which covers models characterized by conditional moment inequalities and the Artstein (1983) inequalities. We also derive sufficient conditions for the non-existence of discordant submodels, therefore providing a class of models for which constructing outer sets cannot lead to misleading interpretations. In the case of discordancy, we follow Masten and Poirier (2021) by developing a method to salvage misspecified models, but unlike them, we focus on discrete relaxations. We consider all minimum relaxations of a refuted model that restores data-consistency. We find that the union of the identified sets of these minimum relaxations is robust to detectable misspecifications and has an intuitive empirical interpretation."}, "https://arxiv.org/abs/2204.13439": {"title": "Mahalanobis balancing: a multivariate perspective on approximate covariate balancing", "link": "https://arxiv.org/abs/2204.13439", "description": "arXiv:2204.13439v4 Announce Type: replace \nAbstract: In the past decade, various exact balancing-based weighting methods were introduced to the causal inference literature. Exact balancing alleviates the extreme weight and model misspecification issues that may incur when one implements inverse probability weighting. It eliminates covariate imbalance by imposing balancing constraints in an optimization problem. The optimization problem can nevertheless be infeasible when there is bad overlap between the covariate distributions in the treated and control groups or when the covariates are high-dimensional. Recently, approximate balancing was proposed as an alternative balancing framework, which resolves the feasibility issue by using inequality moment constraints instead. However, it can be difficult to select the threshold parameters when the number of constraints is large. Moreover, moment constraints may not fully capture the discrepancy of covariate distributions. In this paper, we propose Mahalanobis balancing, which approximately balances covariate distributions from a multivariate perspective. We use a quadratic constraint to control overall imbalance with a single threshold parameter, which can be tuned by a simple selection procedure. We show that the dual problem of Mahalanobis balancing is an l_2 norm-based regularized regression problem, and establish interesting connection to propensity score models. We further generalize Mahalanobis balancing to the high-dimensional scenario. We derive asymptotic properties and make extensive comparisons with existing balancing methods in the numerical studies."}, "https://arxiv.org/abs/2206.02508": {"title": "Tucker tensor factor models: matricization and mode-wise PCA estimation", "link": "https://arxiv.org/abs/2206.02508", "description": "arXiv:2206.02508v3 Announce Type: replace \nAbstract: High-dimensional, higher-order tensor data are gaining prominence in a variety of fields, including but not limited to computer vision and network analysis. Tensor factor models, induced from noisy versions of tensor decompositions or factorizations, are natural potent instruments to study a collection of tensor-variate objects that may be dependent or independent. However, it is still in the early stage of developing statistical inferential theories for the estimation of various low-rank structures, which are customary to play the role of signals of tensor factor models. In this paper, we attempt to ``decode\" the estimation of a higher-order tensor factor model by leveraging tensor matricization. Specifically, we recast it into mode-wise traditional high-dimensional vector/fiber factor models, enabling the deployment of conventional principal components analysis (PCA) for estimation. Demonstrated by the Tucker tensor factor model (TuTFaM), which is induced from the noisy version of the widely-used Tucker decomposition, we summarize that estimations on signal components are essentially mode-wise PCA techniques, and the involvement of projection and iteration will enhance the signal-to-noise ratio to various extent. We establish the inferential theory of the proposed estimators, conduct rich simulation experiments, and illustrate how the proposed estimations can work in tensor reconstruction, and clustering for independent video and dependent economic datasets, respectively."}, "https://arxiv.org/abs/2301.13701": {"title": "On the Stability of General Bayesian Inference", "link": "https://arxiv.org/abs/2301.13701", "description": "arXiv:2301.13701v2 Announce Type: replace \nAbstract: We study the stability of posterior predictive inferences to the specification of the likelihood model and perturbations of the data generating process. In modern big data analyses, useful broad structural judgements may be elicited from the decision-maker but a level of interpolation is required to arrive at a likelihood model. As a result, an often computationally convenient canonical form is used in place of the decision-maker's true beliefs. Equally, in practice, observational datasets often contain unforeseen heterogeneities and recording errors and therefore do not necessarily correspond to how the process was idealised by the decision-maker. Acknowledging such imprecisions, a faithful Bayesian analysis should ideally be stable across reasonable equivalence classes of such inputs. We are able to guarantee that traditional Bayesian updating provides stability across only a very strict class of likelihood models and data generating processes, requiring the decision-maker to elicit their beliefs and understand how the data was generated with an unreasonable degree of accuracy. On the other hand, a generalised Bayesian alternative using the $\\beta$-divergence loss function is shown to be stable across practical and interpretable neighbourhoods, providing assurances that posterior inferences are not overly dependent on accidentally introduced spurious specifications or data collection errors. We illustrate this in linear regression, binary classification, and mixture modelling examples, showing that stable updating does not compromise the ability to learn about the data generating process. These stability results provide a compelling justification for using generalised Bayes to facilitate inference under simplified canonical models."}, "https://arxiv.org/abs/2304.04519": {"title": "On new omnibus tests of uniformity on the hypersphere", "link": "https://arxiv.org/abs/2304.04519", "description": "arXiv:2304.04519v5 Announce Type: replace \nAbstract: Two new omnibus tests of uniformity for data on the hypersphere are proposed. The new test statistics exploit closed-form expressions for orthogonal polynomials, feature tuning parameters, and are related to a ``smooth maximum'' function and the Poisson kernel. We obtain exact moments of the test statistics under uniformity and rotationally symmetric alternatives, and give their null asymptotic distributions. We consider approximate oracle tuning parameters that maximize the power of the tests against known generic alternatives and provide tests that estimate oracle parameters through cross-validated procedures while maintaining the significance level. Numerical experiments explore the effectiveness of null asymptotic distributions and the accuracy of inexpensive approximations of exact null distributions. A simulation study compares the powers of the new tests with other tests of the Sobolev class, showing the benefits of the former. The proposed tests are applied to the study of the (seemingly uniform) nursing times of wild polar bears."}, "https://arxiv.org/abs/2306.01198": {"title": "Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations", "link": "https://arxiv.org/abs/2306.01198", "description": "arXiv:2306.01198v3 Announce Type: replace \nAbstract: Matching algorithms are commonly used to predict matches between items in a collection. For example, in 1:1 face verification, a matching algorithm predicts whether two face images depict the same person. Accurately assessing the uncertainty of the error rates of such algorithms can be challenging when data are dependent and error rates are low, two aspects that have been often overlooked in the literature. In this work, we review methods for constructing confidence intervals for error rates in 1:1 matching tasks. We derive and examine the statistical properties of these methods, demonstrating how coverage and interval width vary with sample size, error rates, and degree of data dependence on both analysis and experiments with synthetic and real-world datasets. Based on our findings, we provide recommendations for best practices for constructing confidence intervals for error rates in 1:1 matching tasks."}, "https://arxiv.org/abs/2306.01468": {"title": "Robust Bayesian Inference for Berkson and Classical Measurement Error Models", "link": "https://arxiv.org/abs/2306.01468", "description": "arXiv:2306.01468v2 Announce Type: replace \nAbstract: Measurement error occurs when a covariate influencing a response variable is corrupted by noise. This can lead to misleading inference outcomes, particularly in problems where accurately estimating the relationship between covariates and response variables is crucial, such as causal effect estimation. Existing methods for dealing with measurement error often rely on strong assumptions such as knowledge of the error distribution or its variance and availability of replicated measurements of the covariates. We propose a Bayesian Nonparametric Learning framework that is robust to mismeasured covariates, does not require the preceding assumptions, and can incorporate prior beliefs about the error distribution. This approach gives rise to a general framework that is suitable for both Classical and Berkson error models via the appropriate specification of the prior centering measure of a Dirichlet Process (DP). Moreover, it offers flexibility in the choice of loss function depending on the type of regression model. We provide bounds on the generalization error based on the Maximum Mean Discrepancy (MMD) loss which allows for generalization to non-Gaussian distributed errors and nonlinear covariate-response relationships. We showcase the effectiveness of the proposed framework versus prior art in real-world problems containing either Berkson or Classical measurement errors."}, "https://arxiv.org/abs/2306.09151": {"title": "Estimating the Sampling Distribution of Posterior Decision Summaries in Bayesian Clinical Trials", "link": "https://arxiv.org/abs/2306.09151", "description": "arXiv:2306.09151v2 Announce Type: replace \nAbstract: Bayesian inference and the use of posterior or posterior predictive probabilities for decision making have become increasingly popular in clinical trials. The current practice in Bayesian clinical trials relies on a hybrid Bayesian-frequentist approach where the design and decision criteria are assessed with respect to frequentist operating characteristics such as power and type I error rate conditioning on a given set of parameters. These operating characteristics are commonly obtained via simulation studies. The utility of Bayesian measures, such as ``assurance\", that incorporate uncertainty about model parameters in estimating the probabilities of various decisions in trials has been demonstrated recently. However, the computational burden remains an obstacle toward wider use of such criteria. In this article, we propose methodology which utilizes large sample theory of the posterior distribution to define parametric models for the sampling distribution of the posterior summaries used for decision making. The parameters of these models are estimated using a small number of simulation scenarios, thereby refining these models to capture the sampling distribution for small to moderate sample size. The proposed approach toward the assessment of conditional and marginal operating characteristics and sample size determination can be considered as simulation-assisted rather than simulation-based. It enables formal incorporation of uncertainty about the trial assumptions via a design prior and significantly reduces the computational burden for the design of Bayesian trials in general."}, "https://arxiv.org/abs/2307.16138": {"title": "A switching state-space transmission model for tracking epidemics and assessing interventions", "link": "https://arxiv.org/abs/2307.16138", "description": "arXiv:2307.16138v2 Announce Type: replace \nAbstract: The effective control of infectious diseases relies on accurate assessment of the impact of interventions, which is often hindered by the complex dynamics of the spread of disease. A Beta-Dirichlet switching state-space transmission model is proposed to track underlying dynamics of disease and evaluate the effectiveness of interventions simultaneously. As time evolves, the switching mechanism introduced in the susceptible-exposed-infected-recovered (SEIR) model is able to capture the timing and magnitude of changes in the transmission rate due to the effectiveness of control measures. The implementation of this model is based on a particle Markov Chain Monte Carlo algorithm, which can estimate the time evolution of SEIR states, switching states, and high-dimensional parameters efficiently. The efficacy of the proposed model and estimation procedure are demonstrated through simulation studies. With a real-world application to British Columbia's COVID-19 outbreak, the proposed switching state-space transmission model quantifies the reduction of transmission rate following interventions. The proposed model provides a promising tool to inform public health policies aimed at studying the underlying dynamics and evaluating the effectiveness of interventions during the spread of the disease."}, "https://arxiv.org/abs/2309.02072": {"title": "Data Scaling Effect of Deep Learning in Financial Time Series Forecasting", "link": "https://arxiv.org/abs/2309.02072", "description": "arXiv:2309.02072v4 Announce Type: replace \nAbstract: For many years, researchers have been exploring the use of deep learning in the forecasting of financial time series. However, they have continued to rely on the conventional econometric approach for model optimization, optimizing the deep learning models on individual assets. In this paper, we use the stock volatility forecast as an example to illustrate global training - optimizes the deep learning model across a wide range of stocks - is both necessary and beneficial for any academic or industry practitioners who is interested in employing deep learning to forecast financial time series. Furthermore, a pre-trained foundation model for volatility forecast is introduced, capable of making accurate zero-shot forecasts for any stocks."}, "https://arxiv.org/abs/2309.06305": {"title": "Sensitivity Analysis for Linear Estimators", "link": "https://arxiv.org/abs/2309.06305", "description": "arXiv:2309.06305v3 Announce Type: replace \nAbstract: We propose a novel sensitivity analysis framework for linear estimators with identification failures that can be viewed as seeing the wrong outcome distribution. Our approach measures the degree of identification failure through the change in measure between the observed distribution and a hypothetical target distribution that would identify the causal parameter of interest. The framework yields a sensitivity analysis that generalizes existing bounds for Average Potential Outcome (APO), Regression Discontinuity (RD), and instrumental variables (IV) exclusion failure designs. Our partial identification results extend results from the APO context to allow even unbounded likelihood ratios. Our proposed sensitivity analysis consistently estimates sharp bounds under plausible conditions and estimates valid bounds under mild conditions. We find that our method performs well in simulations even when targeting a discontinuous and nearly infinite bound."}, "https://arxiv.org/abs/2312.07881": {"title": "Efficiency of QMLE for dynamic panel data models with interactive effects", "link": "https://arxiv.org/abs/2312.07881", "description": "arXiv:2312.07881v2 Announce Type: replace \nAbstract: This paper derives the efficiency bound for estimating the parameters of dynamic panel data models in the presence of an increasing number of incidental parameters. We study the efficiency problem by formulating the dynamic panel as a simultaneous equations system, and show that the quasi-maximum likelihood estimator (QMLE) applied to the system achieves the efficiency bound. Comparison of QMLE with fixed effects estimators is made."}, "https://arxiv.org/abs/2111.04597": {"title": "Neyman-Pearson Multi-class Classification via Cost-sensitive Learning", "link": "https://arxiv.org/abs/2111.04597", "description": "arXiv:2111.04597v3 Announce Type: replace-cross \nAbstract: Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications such as loan default prediction, different types of errors can have varying consequences. To address this asymmetry issue, two popular paradigms have been developed: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Previous studies on the NP paradigm have primarily focused on the binary case, while the multi-class NP problem poses a greater challenge due to its unknown feasibility. In this work, we tackle the multi-class NP problem by establishing a connection with the CS problem via strong duality and propose two algorithms. We extend the concept of NP oracle inequalities, crucial in binary classifications, to NP oracle properties in the multi-class context. Our algorithms satisfy these NP oracle properties under certain conditions. Furthermore, we develop practical algorithms to assess the feasibility and strong duality in multi-class NP problems, which can offer practitioners the landscape of a multi-class NP problem with various target error levels. Simulations and real data studies validate the effectiveness of our algorithms. To our knowledge, this is the first study to address the multi-class NP problem with theoretical guarantees. The proposed algorithms have been implemented in the R package \\texttt{npcs}, which is available on CRAN."}, "https://arxiv.org/abs/2211.01939": {"title": "Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation", "link": "https://arxiv.org/abs/2211.01939", "description": "arXiv:2211.01939v3 Announce Type: replace-cross \nAbstract: We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling."}, "https://arxiv.org/abs/2404.19118": {"title": "Identification and estimation of causal effects using non-concurrent controls in platform trials", "link": "https://arxiv.org/abs/2404.19118", "description": "arXiv:2404.19118v1 Announce Type: new \nAbstract: Platform trials are multi-arm designs that simultaneously evaluate multiple treatments for a single disease within the same overall trial structure. Unlike traditional randomized controlled trials, they allow treatment arms to enter and exit the trial at distinct times while maintaining a control arm throughout. This control arm comprises both concurrent controls, where participants are randomized concurrently to either the treatment or control arm, and non-concurrent controls, who enter the trial when the treatment arm under study is unavailable. While flexible, platform trials introduce a unique challenge with the use of non-concurrent controls, raising questions about how to efficiently utilize their data to estimate treatment effects. Specifically, what estimands should be used to evaluate the causal effect of a treatment versus control? Under what assumptions can these estimands be identified and estimated? Do we achieve any efficiency gains? In this paper, we use structural causal models and counterfactuals to clarify estimands and formalize their identification in the presence of non-concurrent controls in platform trials. We also provide outcome regression, inverse probability weighting, and doubly robust estimators for their estimation. We discuss efficiency gains, demonstrate their performance in a simulation study, and apply them to the ACTT platform trial, resulting in a 20% improvement in precision."}, "https://arxiv.org/abs/2404.19127": {"title": "A model-free subdata selection method for classification", "link": "https://arxiv.org/abs/2404.19127", "description": "arXiv:2404.19127v1 Announce Type: new \nAbstract: Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets."}, "https://arxiv.org/abs/2404.19144": {"title": "A Locally Robust Semiparametric Approach to Examiner IV Designs", "link": "https://arxiv.org/abs/2404.19144", "description": "arXiv:2404.19144v1 Announce Type: new \nAbstract: I propose a locally robust semiparametric framework for estimating causal effects using the popular examiner IV design, in the presence of many examiners and possibly many covariates relative to the sample size. The key ingredient of this approach is an orthogonal moment function that is robust to biases and local misspecification from the first step estimation of the examiner IV. I derive the orthogonal moment function and show that it delivers multiple robustness where the outcome model or at least one of the first step components is misspecified but the estimating equation remains valid. The proposed framework not only allows for estimation of the examiner IV in the presence of many examiners and many covariates relative to sample size, using a wide range of nonparametric and machine learning techniques including LASSO, Dantzig, neural networks and random forests, but also delivers root-n consistent estimation of the parameter of interest under mild assumptions."}, "https://arxiv.org/abs/2404.19145": {"title": "Orthogonal Bootstrap: Efficient Simulation of Input Uncertainty", "link": "https://arxiv.org/abs/2404.19145", "description": "arXiv:2404.19145v1 Announce Type: new \nAbstract: Bootstrap is a popular methodology for simulating input uncertainty. However, it can be computationally expensive when the number of samples is large. We propose a new approach called \\textbf{Orthogonal Bootstrap} that reduces the number of required Monte Carlo replications. We decomposes the target being simulated into two parts: the \\textit{non-orthogonal part} which has a closed-form result known as Infinitesimal Jackknife and the \\textit{orthogonal part} which is easier to be simulated. We theoretically and numerically show that Orthogonal Bootstrap significantly reduces the computational cost of Bootstrap while improving empirical accuracy and maintaining the same width of the constructed interval."}, "https://arxiv.org/abs/2404.19325": {"title": "Correcting for confounding in longitudinal experiments: positioning non-linear mixed effects modeling as implementation of standardization using latent conditional exchangeability", "link": "https://arxiv.org/abs/2404.19325", "description": "arXiv:2404.19325v1 Announce Type: new \nAbstract: Non-linear mixed effects modeling and simulation (NLME M&S) is evaluated to be used for standardization with longitudinal data in presence of confounders. Standardization is a well-known method in causal inference to correct for confounding by analyzing and combining results from subgroups of patients. We show that non-linear mixed effects modeling is a particular implementation of standardization that conditions on individual parameters described by the random effects of the mixed effects model. Our motivation is that in pharmacometrics NLME M&S is routinely used to analyze clinical trials and to predict and compare potential outcomes of the same patient population under different treatment regimens. Such a comparison is a causal question sometimes referred to as causal prediction. Nonetheless, NLME M&S is rarely positioned as a method for causal prediction.\n As an example, a simulated clinical trial is used that assumes treatment confounder feedback in which early outcomes can cause deviations from the planned treatment schedule. Being interested in the outcome for the hypothetical situation that patients adhere to the planned treatment schedule, we put assumptions in a causal diagram. From the causal diagram, conditional independence assumptions are derived either using latent conditional exchangeability, conditioning on the individual parameters, or using sequential conditional exchangeability, conditioning on earlier outcomes. Both conditional independencies can be used to estimate the estimand of interest, e.g., with standardization, and they give unbiased estimates."}, "https://arxiv.org/abs/2404.19344": {"title": "Data-adaptive structural change-point detection via isolation", "link": "https://arxiv.org/abs/2404.19344", "description": "arXiv:2404.19344v1 Announce Type: new \nAbstract: In this paper, a new data-adaptive method, called DAIS (Data Adaptive ISolation), is introduced for the estimation of the number and the location of change-points in a given data sequence. The proposed method can detect changes in various different signal structures; we focus on the examples of piecewise-constant and continuous, piecewise-linear signals. We highlight, however, that our algorithm can be extended to other frameworks, such as piecewise-quadratic signals. The data-adaptivity of our methodology lies in the fact that, at each step, and for the data under consideration, we search for the most prominent change-point in a targeted neighborhood of the data sequence that contains this change-point with high probability. Using a suitably chosen contrast function, the change-point will then get detected after being isolated in an interval. The isolation feature enhances estimation accuracy, while the data-adaptive nature of DAIS is advantageous regarding, mainly, computational complexity and accuracy. The simulation results presented indicate that DAIS is at least as accurate as state-of-the-art competitors."}, "https://arxiv.org/abs/2404.19465": {"title": "Optimal E-Values for Exponential Families: the Simple Case", "link": "https://arxiv.org/abs/2404.19465", "description": "arXiv:2404.19465v1 Announce Type: new \nAbstract: We provide a general condition under which e-variables in the form of a simple-vs.-simple likelihood ratio exist when the null hypothesis is a composite, multivariate exponential family. Such `simple' e-variables are easy to compute and expected-log-optimal with respect to any stopping time. Simple e-variables were previously only known to exist in quite specific settings, but we offer a unifying theorem on their existence for testing exponential families. We start with a simple alternative $Q$ and a regular exponential family null. Together these induce a second exponential family ${\\cal Q}$ containing $Q$, with the same sufficient statistic as the null. Our theorem shows that simple e-variables exist whenever the covariance matrices of ${\\cal Q}$ and the null are in a certain relation. Examples in which this relation holds include some $k$-sample tests, Gaussian location- and scale tests, and tests for more general classes of natural exponential families."}, "https://arxiv.org/abs/2404.19472": {"title": "Multi-label Classification under Uncertainty: A Tree-based Conformal Prediction Approach", "link": "https://arxiv.org/abs/2404.19472", "description": "arXiv:2404.19472v1 Announce Type: new \nAbstract: Multi-label classification is a common challenge in various machine learning applications, where a single data instance can be associated with multiple classes simultaneously. The current paper proposes a novel tree-based method for multi-label classification using conformal prediction and multiple hypothesis testing. The proposed method employs hierarchical clustering with labelsets to develop a hierarchical tree, which is then formulated as a multiple-testing problem with a hierarchical structure. The split-conformal prediction method is used to obtain marginal conformal $p$-values for each tested hypothesis, and two \\textit{hierarchical testing procedures} are developed based on marginal conformal $p$-values, including a hierarchical Bonferroni procedure and its modification for controlling the family-wise error rate. The prediction sets are thus formed based on the testing outcomes of these two procedures. We establish a theoretical guarantee of valid coverage for the prediction sets through proven family-wise error rate control of those two procedures. We demonstrate the effectiveness of our method in a simulation study and two real data analysis compared to other conformal methods for multi-label classification."}, "https://arxiv.org/abs/2404.19494": {"title": "The harms of class imbalance corrections for machine learning based prediction models: a simulation study", "link": "https://arxiv.org/abs/2404.19494", "description": "arXiv:2404.19494v1 Announce Type: new \nAbstract: Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data-generating scenarios (varying sample size, the number of predictors and event fraction). Our findings were illustrated in a case study using MIMIC-III data. In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over-estimation of risk and was not always able to be corrected with re-calibration. Correcting for class imbalance is not always necessary and may even be harmful for clinical prediction models which aim to produce reliable risk estimates on an individual basis."}, "https://arxiv.org/abs/2404.19661": {"title": "PCA for Point Processes", "link": "https://arxiv.org/abs/2404.19661", "description": "arXiv:2404.19661v1 Announce Type: new \nAbstract: We introduce a novel statistical framework for the analysis of replicated point processes that allows for the study of point pattern variability at a population level. By treating point process realizations as random measures, we adopt a functional analysis perspective and propose a form of functional Principal Component Analysis (fPCA) for point processes. The originality of our method is to base our analysis on the cumulative mass functions of the random measures which gives us a direct and interpretable analysis. Key theoretical contributions include establishing a Karhunen-Lo\\`{e}ve expansion for the random measures and a Mercer Theorem for covariance measures. We establish convergence in a strong sense, and introduce the concept of principal measures, which can be seen as latent processes governing the dynamics of the observed point patterns. We propose an easy-to-implement estimation strategy of eigenelements for which parametric rates are achieved. We fully characterize the solutions of our approach to Poisson and Hawkes processes and validate our methodology via simulations and diverse applications in seismology, single-cell biology and neurosiences, demonstrating its versatility and effectiveness. Our method is implemented in the pppca R-package."}, "https://arxiv.org/abs/2404.19700": {"title": "Comparing Multivariate Distributions: A Novel Approach Using Optimal Transport-based Plots", "link": "https://arxiv.org/abs/2404.19700", "description": "arXiv:2404.19700v1 Announce Type: new \nAbstract: Quantile-Quantile (Q-Q) plots are widely used for assessing the distributional similarity between two datasets. Traditionally, Q-Q plots are constructed for univariate distributions, making them less effective in capturing complex dependencies present in multivariate data. In this paper, we propose a novel approach for constructing multivariate Q-Q plots, which extend the traditional Q-Q plot methodology to handle high-dimensional data. Our approach utilizes optimal transport (OT) and entropy-regularized optimal transport (EOT) to align the empirical quantiles of the two datasets. Additionally, we introduce another technique based on OT and EOT potentials which can effectively compare two multivariate datasets. Through extensive simulations and real data examples, we demonstrate the effectiveness of our proposed approach in capturing multivariate dependencies and identifying distributional differences such as tail behaviour. We also propose two test statistics based on the Q-Q and potential plots to compare two distributions rigorously."}, "https://arxiv.org/abs/2404.19707": {"title": "Identification by non-Gaussianity in structural threshold and smooth transition vector autoregressive models", "link": "https://arxiv.org/abs/2404.19707", "description": "arXiv:2404.19707v1 Announce Type: new \nAbstract: Linear structural vector autoregressive models can be identified statistically without imposing restrictions on the model if the shocks are mutually independent and at most one of them is Gaussian. We show that this result extends to structural threshold and smooth transition vector autoregressive models incorporating a time-varying impact matrix defined as a weighted sum of the impact matrices of the regimes. Our empirical application studies the effects of the climate policy uncertainty shock on the U.S. macroeconomy. In a structural logistic smooth transition vector autoregressive model consisting of two regimes, we find that a positive climate policy uncertainty shock decreases production in times of low economic policy uncertainty but slightly increases it in times of high economic policy uncertainty. The introduced methods are implemented to the accompanying R package sstvars."}, "https://arxiv.org/abs/2404.19224": {"title": "Variational approximations of possibilistic inferential models", "link": "https://arxiv.org/abs/2404.19224", "description": "arXiv:2404.19224v1 Announce Type: cross \nAbstract: Inferential models (IMs) offer reliable, data-driven, possibilistic statistical inference. But despite IMs' theoretical/foundational advantages, efficient computation in applications is a major challenge. This paper presents a simple and apparently powerful Monte Carlo-driven strategy for approximating the IM's possibility contour, or at least its $\\alpha$-level set for a specified $\\alpha$. Our proposal utilizes a parametric family that, in a certain sense, approximately covers the credal set associated with the IM's possibility measure, which is reminiscent of variational approximations now widely used in Bayesian statistics."}, "https://arxiv.org/abs/2404.19242": {"title": "A Minimal Set of Parameters Based Depth-Dependent Distortion Model and Its Calibration Method for Stereo Vision Systems", "link": "https://arxiv.org/abs/2404.19242", "description": "arXiv:2404.19242v1 Announce Type: cross \nAbstract: Depth position highly affects lens distortion, especially in close-range photography, which limits the measurement accuracy of existing stereo vision systems. Moreover, traditional depth-dependent distortion models and their calibration methods have remained complicated. In this work, we propose a minimal set of parameters based depth-dependent distortion model (MDM), which considers the radial and decentering distortions of the lens to improve the accuracy of stereo vision systems and simplify their calibration process. In addition, we present an easy and flexible calibration method for the MDM of stereo vision systems with a commonly used planar pattern, which requires cameras to observe the planar pattern in different orientations. The proposed technique is easy to use and flexible compared with classical calibration techniques for depth-dependent distortion models in which the lens must be perpendicular to the planar pattern. The experimental validation of the MDM and its calibration method showed that the MDM improved the calibration accuracy by 56.55% and 74.15% compared with the Li's distortion model and traditional Brown's distortion model. Besides, an iteration-based reconstruction method is proposed to iteratively estimate the depth information in the MDM during three-dimensional reconstruction. The results showed that the accuracy of the iteration-based reconstruction method was improved by 9.08% compared with that of the non-iteration reconstruction method."}, "https://arxiv.org/abs/2404.19495": {"title": "Percentage Coefficient (bp) -- Effect Size Analysis (Theory Paper 1)", "link": "https://arxiv.org/abs/2404.19495", "description": "arXiv:2404.19495v1 Announce Type: cross \nAbstract: Percentage coefficient (bp) has emerged in recent publications as an additional and alternative estimator of effect size for regression analysis. This paper retraces the theory behind the estimator. It's posited that an estimator must first serve the fundamental function of enabling researchers and readers to comprehend an estimand, the target of estimation. It may then serve the instrumental function of enabling researchers and readers to compare two or more estimands. Defined as the regression coefficient when dependent variable (DV) and independent variable (IV) are both on conceptual 0-1 percentage scales, percentage coefficients (bp) feature 1) clearly comprehendible interpretation and 2) equitable scales for comparison. Thus, the coefficient (bp) serves both functions effectively and efficiently, thereby serving some needs not completely served by other indicators such as raw coefficient (bw) and standardized beta. Another fundamental premise of the functionalist theory is that \"effect\" is not a monolithic concept. Rather, it is a collection of compartments, each of which measures a component of the conglomerate that we call \"effect.\" A regression coefficient (b), for example, measures one aspect of effect, which is unit effect, aka efficiency, as it indicates the unit change in DV associated with a one-unit increase in IV. Percentage coefficient (bp) indicates the change in DV in percentage points associated with a whole scale increase in IV. It is meant to be an all-encompassing indicator of the all-encompassing concept of effect, but rather an interpretable and comparable indicator of efficiency, one of the key components of effect."}, "https://arxiv.org/abs/2404.19640": {"title": "Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks", "link": "https://arxiv.org/abs/2404.19640", "description": "arXiv:2404.19640v1 Announce Type: cross \nAbstract: Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that Bayesian neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are not inherently robust against adversarial attacks."}, "https://arxiv.org/abs/2102.08809": {"title": "Testing for Nonlinear Cointegration under Heteroskedasticity", "link": "https://arxiv.org/abs/2102.08809", "description": "arXiv:2102.08809v3 Announce Type: replace \nAbstract: This article discusses tests for nonlinear cointegration in the presence of variance breaks. We build on cointegration test approaches under heteroskedasticity (Cavaliere and Taylor, 2006, Journal of Time Series Analysis) and nonlinearity (Choi and Saikkonen, 2010, Econometric Theory) to propose a bootstrap test and prove its consistency. A Monte Carlo study shows the approach to have satisfactory finite-sample properties. We provide an empirical application to the environmental Kuznets curves (EKC), finding that the cointegration test provides little evidence for the EKC hypothesis. Additionally, we examine a nonlinear relation between the US money demand and the interest rate, finding that our test does not reject the null of a smooth transition cointegrating relation."}, "https://arxiv.org/abs/2112.10248": {"title": "Efficient Modeling of Spatial Extremes over Large Geographical Domains", "link": "https://arxiv.org/abs/2112.10248", "description": "arXiv:2112.10248v3 Announce Type: replace \nAbstract: Various natural phenomena exhibit spatial extremal dependence at short spatial distances. However, existing models proposed in the spatial extremes literature often assume that extremal dependence persists across the entire domain. This is a strong limitation when modeling extremes over large geographical domains, and yet it has been mostly overlooked in the literature. We here develop a more realistic Bayesian framework based on a novel Gaussian scale mixture model, with the Gaussian process component defined by a stochastic partial differential equation yielding a sparse precision matrix, and the random scale component modeled as a low-rank Pareto-tailed or Weibull-tailed spatial process determined by compactly-supported basis functions. We show that our proposed model is approximately tail-stationary and that it can capture a wide range of extremal dependence structures. Its inherently sparse structure allows fast Bayesian computations in high spatial dimensions based on a customized Markov chain Monte Carlo algorithm prioritizing calibration in the tail. We fit our model to analyze heavy monsoon rainfall data in Bangladesh. Our study shows that our model outperforms natural competitors and that it fits precipitation extremes well. We finally use the fitted model to draw inference on long-term return levels for marginal precipitation and spatial aggregates."}, "https://arxiv.org/abs/2204.00473": {"title": "Finite Sample Inference in Incomplete Models", "link": "https://arxiv.org/abs/2204.00473", "description": "arXiv:2204.00473v2 Announce Type: replace \nAbstract: We propose confidence regions for the parameters of incomplete models with exact coverage of the true parameter in finite samples. Our confidence region inverts a test, which generalizes Monte Carlo tests to incomplete models. The test statistic is a discrete analogue of a new optimal transport characterization of the sharp identified region. Both test statistic and critical values rely on simulation drawn from the distribution of latent variables and are computed using solutions to discrete optimal transport, hence linear programming problems. We also propose a fast preliminary search in the parameter space with an alternative, more conservative yet consistent test, based on a parameter free critical value."}, "https://arxiv.org/abs/2210.01757": {"title": "Transportability of model-based estimands in evidence synthesis", "link": "https://arxiv.org/abs/2210.01757", "description": "arXiv:2210.01757v5 Announce Type: replace \nAbstract: In evidence synthesis, effect modifiers are typically described as variables that induce treatment effect heterogeneity at the individual level, through treatment-covariate interactions in an outcome model parametrized at such level. As such, effect modification is defined with respect to a conditional measure, but marginal effect estimates are required for population-level decisions in health technology assessment. For non-collapsible measures, purely prognostic variables that are not determinants of treatment response at the individual level may modify marginal effects, even where there is individual-level treatment effect homogeneity. With heterogeneity, marginal effects for measures that are not directly collapsible cannot be expressed in terms of marginal covariate moments, and generally depend on the joint distribution of conditional effect measure modifiers and purely prognostic variables. There are implications for recommended practices in evidence synthesis. Unadjusted anchored indirect comparisons can be biased in the absence of individual-level treatment effect heterogeneity, or when marginal covariate moments are balanced across studies. Covariate adjustment may be necessary to account for cross-study imbalances in joint covariate distributions involving purely prognostic variables. In the absence of individual patient data for the target, covariate adjustment approaches are inherently limited in their ability to remove bias for measures that are not directly collapsible. Directly collapsible measures would facilitate the transportability of marginal effects between studies by: (1) reducing dependence on model-based covariate adjustment where there is individual-level treatment effect homogeneity or marginal covariate moments are balanced; and (2) facilitating the selection of baseline covariates for adjustment where there is individual-level treatment effect heterogeneity."}, "https://arxiv.org/abs/2210.05792": {"title": "Flexible Modeling of Nonstationary Extremal Dependence using Spatially-Fused LASSO and Ridge Penalties", "link": "https://arxiv.org/abs/2210.05792", "description": "arXiv:2210.05792v3 Announce Type: replace \nAbstract: Statistical modeling of a nonstationary spatial extremal dependence structure is challenging. Max-stable processes are common choices for modeling spatially-indexed block maxima, where an assumption of stationarity is usual to make inference feasible. However, this assumption is often unrealistic for data observed over a large or complex domain. We propose a computationally-efficient method for estimating extremal dependence using a globally nonstationary, but locally-stationary, max-stable process by exploiting nonstationary kernel convolutions. We divide the spatial domain into a fine grid of subregions, assign each of them its own dependence parameters, and use LASSO ($L_1$) or ridge ($L_2$) penalties to obtain spatially-smooth parameter estimates. We then develop a novel data-driven algorithm to merge homogeneous neighboring subregions. The algorithm facilitates model parsimony and interpretability. To make our model suitable for high-dimensional data, we exploit a pairwise likelihood to draw inferences and discuss computational and statistical efficiency. An extensive simulation study demonstrates the superior performance of our proposed model and the subregion-merging algorithm over the approaches that either do not model nonstationarity or do not update the domain partition. We apply our proposed method to model monthly maximum temperatures at over 1400 sites in Nepal and the surrounding Himalayan and sub-Himalayan regions; we again observe significant improvements in model fit compared to a stationary process and a nonstationary process without subregion-merging. Furthermore, we demonstrate that the estimated merged partition is interpretable from a geographic perspective and leads to better model diagnostics by adequately reducing the number of subregion-specific parameters."}, "https://arxiv.org/abs/2210.13027": {"title": "E-Valuating Classifier Two-Sample Tests", "link": "https://arxiv.org/abs/2210.13027", "description": "arXiv:2210.13027v2 Announce Type: replace \nAbstract: We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches beyond the conventional two-split (training and testing) approach of standard classifier two-sample tests. This strategy increases the power of the test while keeping the type I error well below the desired significance level."}, "https://arxiv.org/abs/2302.03435": {"title": "Logistic regression with missing responses and predictors: a review of existing approaches and a case study", "link": "https://arxiv.org/abs/2302.03435", "description": "arXiv:2302.03435v2 Announce Type: replace \nAbstract: In this work logistic regression when both the response and the predictor variables may be missing is considered. Several existing approaches are reviewed, including complete case analysis, inverse probability weighting, multiple imputation and maximum likelihood. The methods are compared in a simulation study, which serves to evaluate the bias, the variance and the mean squared error of the estimators for the regression coefficients. In the simulations, the maximum likelihood methodology is the one that presents the best results, followed by multiple imputation with five imputations, which is the second best. The methods are applied to a case study on the obesity for schoolchildren in the municipality of Viana do Castelo, North Portugal, where a logistic regression model is used to predict the International Obesity Task Force (IOTF) indicator from physical examinations and the past values of the obesity status. All the variables in the case study are potentially missing, with gender as the only exception. The results provided by the several methods are in well agreement, indicating the relevance of the past values of IOTF and physical scores for the prediction of obesity. Practical recommendations are given."}, "https://arxiv.org/abs/2303.17872": {"title": "Lancaster correlation -- a new dependence measure linked to maximum correlation", "link": "https://arxiv.org/abs/2303.17872", "description": "arXiv:2303.17872v2 Announce Type: replace \nAbstract: We suggest novel correlation coefficients which equal the maximum correlation for a class of bivariate Lancaster distributions while being only slightly smaller than maximum correlation for a variety of further bivariate distributions. In contrast to maximum correlation, however, our correlation coefficients allow for rank and moment-based estimators which are simple to compute and have tractable asymptotic distributions. Confidence intervals resulting from these asymptotic approximations and the covariance bootstrap show good finite-sample coverage. In a simulation, the power of asymptotic as well as permutation tests for independence based on our correlation measures compares favorably with competing methods based on distance correlation or rank coefficients for functional dependence, among others. Moreover, for the bivariate normal distribution, our correlation coefficients equal the absolute value of the Pearson correlation, an attractive feature for practitioners which is not shared by various competitors. We illustrate the practical usefulness of our methods in applications to two real data sets."}, "https://arxiv.org/abs/2401.11128": {"title": "Regularized Estimation of Sparse Spectral Precision Matrices", "link": "https://arxiv.org/abs/2401.11128", "description": "arXiv:2401.11128v2 Announce Type: replace \nAbstract: Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of spectral density matrices such as averaged periodograms tend to be severely ill-conditioned, and one needs to resort to suitable regularization strategies involving optimization over complex variables.\n In this work, we propose complex graphical Lasso (CGLASSO), an $\\ell_1$-penalized estimator of spectral precision matrix based on local Whittle likelihood maximization. We develop fast $\\textit{pathwise coordinate descent}$ algorithms for implementing CGLASSO on large dimensional time series data sets. At its core, our algorithmic development relies on a ring isomorphism between complex and real matrices that helps map a number of optimization problems over complex variables to similar optimization problems over real variables. This finding may be of independent interest and more broadly applicable for high-dimensional statistical analysis with complex-valued data. We also present a complete non-asymptotic theory of our proposed estimator which shows that consistent estimation is possible in high-dimensional regime as long as the underlying spectral precision matrix is suitably sparse. We compare the performance of CGLASSO with competing alternatives on simulated data sets, and use it to construct partial coherence network among brain regions from a real fMRI data set."}, "https://arxiv.org/abs/2208.03215": {"title": "Hierarchical Bayesian data selection", "link": "https://arxiv.org/abs/2208.03215", "description": "arXiv:2208.03215v2 Announce Type: replace-cross \nAbstract: There are many issues that can cause problems when attempting to infer model parameters from data. Data and models are both imperfect, and as such there are multiple scenarios in which standard methods of inference will lead to misleading conclusions; corrupted data, models which are only representative of subsets of the data, or multiple regions in which the model is best fit using different parameters. Methods exist for the exclusion of some anomalous types of data, but in practice, data cleaning is often undertaken by hand before attempting to fit models to data. In this work, we will employ hierarchical Bayesian data selection; the simultaneous inference of both model parameters, and parameters which represent our belief that each observation within the data should be included in the inference. The aim, within a Bayesian setting, is to find the regions of observation space for which the model can well-represent the data, and to find the corresponding model parameters for those regions. A number of approaches will be explored, and applied to test problems in linear regression, and to the problem of fitting an ODE model, approximated by a finite difference method. The approaches are simple to implement, can aid mixing of Markov chains designed to sample from the arising densities, and are broadly applicable to many inferential problems."}, "https://arxiv.org/abs/2303.02756": {"title": "A New Class of Realistic Spatio-Temporal Processes with Advection and Their Simulation", "link": "https://arxiv.org/abs/2303.02756", "description": "arXiv:2303.02756v2 Announce Type: replace-cross \nAbstract: Traveling phenomena, frequently observed in a variety of scientific disciplines including atmospheric science, seismography, and oceanography, have long suffered from limitations due to lack of realistic statistical modeling tools and simulation methods. Our work primarily addresses this, introducing more realistic and flexible models for spatio-temporal random fields. We break away from the traditional confines of the classic frozen field by either relaxing the assumption of a single deterministic velocity or rethinking the hypothesis regarding the spectrum shape, thus enhancing the realism of our models. While the proposed models stand out for their realism and flexibility, they are also paired with simulation algorithms that are equally or less computationally complex than the commonly used circulant embedding for Gaussian random fields in $\\mathbb{R}^{2+1}$. This combination of realistic modeling with efficient simulation methods creates an effective solution for better understanding traveling phenomena."}, "https://arxiv.org/abs/2308.00957": {"title": "Causal Inference with Differentially Private (Clustered) Outcomes", "link": "https://arxiv.org/abs/2308.00957", "description": "arXiv:2308.00957v2 Announce Type: replace-cross \nAbstract: Estimating causal effects from randomized experiments is only feasible if participants agree to reveal their potentially sensitive responses. Of the many ways of ensuring privacy, label differential privacy is a widely used measure of an algorithm's privacy guarantee, which might encourage participants to share responses without running the risk of de-anonymization. Many differentially private mechanisms inject noise into the original data-set to achieve this privacy guarantee, which increases the variance of most statistical estimators and makes the precise measurement of causal effects difficult: there exists a fundamental privacy-variance trade-off to performing causal analyses from differentially private data. With the aim of achieving lower variance for stronger privacy guarantees, we suggest a new differential privacy mechanism, Cluster-DP, which leverages any given cluster structure of the data while still allowing for the estimation of causal effects. We show that, depending on an intuitive measure of cluster quality, we can improve the variance loss while maintaining our privacy guarantees. We compare its performance, theoretically and empirically, to that of its unclustered version and a more extreme uniform-prior version which does not use any of the original response distribution, both of which are special cases of the Cluster-DP algorithm."}, "https://arxiv.org/abs/2405.00158": {"title": "BayesBlend: Easy Model Blending using Pseudo-Bayesian Model Averaging, Stacking and Hierarchical Stacking in Python", "link": "https://arxiv.org/abs/2405.00158", "description": "arXiv:2405.00158v1 Announce Type: new \nAbstract: Averaging predictions from multiple competing inferential models frequently outperforms predictions from any single model, providing that models are optimally weighted to maximize predictive performance. This is particularly the case in so-called $\\mathcal{M}$-open settings where the true model is not in the set of candidate models, and may be neither mathematically reifiable nor known precisely. This practice of model averaging has a rich history in statistics and machine learning, and there are currently a number of methods to estimate the weights for constructing model-averaged predictive distributions. Nonetheless, there are few existing software packages that can estimate model weights from the full variety of methods available, and none that blend model predictions into a coherent predictive distribution according to the estimated weights. In this paper, we introduce the BayesBlend Python package, which provides a user-friendly programming interface to estimate weights and blend multiple (Bayesian) models' predictive distributions. BayesBlend implements pseudo-Bayesian model averaging, stacking and, uniquely, hierarchical Bayesian stacking to estimate model weights. We demonstrate the usage of BayesBlend with examples of insurance loss modeling."}, "https://arxiv.org/abs/2405.00161": {"title": "Estimating Heterogeneous Treatment Effects with Item-Level Outcome Data: Insights from Item Response Theory", "link": "https://arxiv.org/abs/2405.00161", "description": "arXiv:2405.00161v1 Announce Type: new \nAbstract: Analyses of heterogeneous treatment effects (HTE) are common in applied causal inference research. However, when outcomes are latent variables assessed via psychometric instruments such as educational tests, standard methods ignore the potential HTE that may exist among the individual items of the outcome measure. Failing to account for \"item-level\" HTE (IL-HTE) can lead to both estimated standard errors that are too small and identification challenges in the estimation of treatment-by-covariate interaction effects. We demonstrate how Item Response Theory (IRT) models that estimate a treatment effect for each assessment item can both address these challenges and provide new insights into HTE generally. This study articulates the theoretical rationale for the IL-HTE model and demonstrates its practical value using data from 20 randomized controlled trials in economics, education, and health. Our results show that the IL-HTE model reveals item-level variation masked by average treatment effects, provides more accurate statistical inference, allows for estimates of the generalizability of causal effects, resolves identification problems in the estimation of interaction effects, and provides estimates of standardized treatment effect sizes corrected for attenuation due to measurement error."}, "https://arxiv.org/abs/2405.00179": {"title": "A Bayesian joint longitudinal-survival model with a latent stochastic process for intensive longitudinal data", "link": "https://arxiv.org/abs/2405.00179", "description": "arXiv:2405.00179v1 Announce Type: new \nAbstract: The availability of mobile health (mHealth) technology has enabled increased collection of intensive longitudinal data (ILD). ILD have potential to capture rapid fluctuations in outcomes that may be associated with changes in the risk of an event. However, existing methods for jointly modeling longitudinal and event-time outcomes are not well-equipped to handle ILD due to the high computational cost. We propose a joint longitudinal and time-to-event model suitable for analyzing ILD. In this model, we summarize a multivariate longitudinal outcome as a smaller number of time-varying latent factors. These latent factors, which are modeled using an Ornstein-Uhlenbeck stochastic process, capture the risk of a time-to-event outcome in a parametric hazard model. We take a Bayesian approach to fit our joint model and conduct simulations to assess its performance. We use it to analyze data from an mHealth study of smoking cessation. We summarize the longitudinal self-reported intensity of nine emotions as the psychological states of positive and negative affect. These time-varying latent states capture the risk of the first smoking lapse after attempted quit. Understanding factors associated with smoking lapse is of keen interest to smoking cessation researchers."}, "https://arxiv.org/abs/2405.00185": {"title": "Finite-sample adjustments for comparing clustered adaptive interventions using data from a clustered SMART", "link": "https://arxiv.org/abs/2405.00185", "description": "arXiv:2405.00185v1 Announce Type: new \nAbstract: Adaptive interventions, aka dynamic treatment regimens, are sequences of pre-specified decision rules that guide the provision of treatment for an individual given information about their baseline and evolving needs, including in response to prior intervention. Clustered adaptive interventions (cAIs) extend this idea by guiding the provision of intervention at the level of clusters (e.g., clinics), but with the goal of improving outcomes at the level of individuals within the cluster (e.g., clinicians or patients within clinics). A clustered, sequential multiple-assignment randomized trials (cSMARTs) is a multistage, multilevel randomized trial design used to construct high-quality cAIs. In a cSMART, clusters are randomized at multiple intervention decision points; at each decision point, the randomization probability can depend on response to prior data. A challenge in cluster-randomized trials, including cSMARTs, is the deleterious effect of small samples of clusters on statistical inference, particularly via estimation of standard errors. \\par This manuscript develops finite-sample adjustment (FSA) methods for making improved statistical inference about the causal effects of cAIs in a cSMART. The paper develops FSA methods that (i) scale variance estimators using a degree-of-freedom adjustment, (ii) reference a t distribution (instead of a normal), and (iii) employ a ``bias corrected\" variance estimator. Method (iii) requires extensions that are unique to the analysis of cSMARTs. Extensive simulation experiments are used to test the performance of the methods. The methods are illustrated using the Adaptive School-based Implementation of CBT (ASIC) study, a cSMART designed to construct a cAI for improving the delivery of cognitive behavioral therapy (CBT) by school mental health professionals within high schools in Michigan."}, "https://arxiv.org/abs/2405.00294": {"title": "Conformal inference for random objects", "link": "https://arxiv.org/abs/2405.00294", "description": "arXiv:2405.00294v1 Announce Type: new \nAbstract: We develop an inferential toolkit for analyzing object-valued responses, which correspond to data situated in general metric spaces, paired with Euclidean predictors within the conformal framework. To this end we introduce conditional profile average transport costs, where we compare distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius through the optimal transport cost when moving from one distance profile to another. The average transport cost to transport a given distance profile to all others is crucial for statistical inference in metric spaces and underpins the proposed conditional profile scores. A key feature of the proposed approach is to utilize the distribution of conditional profile average transport costs as conformity score for general metric space-valued responses, which facilitates the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the prediction sets. The finite sample performance for synthetic data in various metric spaces demonstrates that the proposed conditional profile score outperforms existing methods in terms of both coverage level and size of the resulting prediction sets, even in the special case of scalar and thus Euclidean responses. We also demonstrate the practical utility of conditional profile scores for network data from New York taxi trips and for compositional data reflecting energy sourcing of U.S. states."}, "https://arxiv.org/abs/2405.00424": {"title": "Optimal Bias-Correction and Valid Inference in High-Dimensional Ridge Regression: A Closed-Form Solution", "link": "https://arxiv.org/abs/2405.00424", "description": "arXiv:2405.00424v1 Announce Type: new \nAbstract: Ridge regression is an indispensable tool in big data econometrics but suffers from bias issues affecting both statistical efficiency and scalability. We introduce an iterative strategy to correct the bias effectively when the dimension $p$ is less than the sample size $n$. For $p>n$, our method optimally reduces the bias to a level unachievable through linear transformations of the response. We employ a Ridge-Screening (RS) method to handle the remaining bias when $p>n$, creating a reduced model suitable for bias-correction. Under certain conditions, the selected model nests the true one, making RS a novel variable selection approach. We establish the asymptotic properties and valid inferences of our de-biased ridge estimators for both $p< n$ and $p>n$, where $p$ and $n$ may grow towards infinity, along with the number of iterations. Our method is validated using simulated and real-world data examples, providing a closed-form solution to bias challenges in ridge regression inferences."}, "https://arxiv.org/abs/2405.00535": {"title": "Bayesian Varying-Effects Vector Autoregressive Models for Inference of Brain Connectivity Networks and Covariate Effects in Pediatric Traumatic Brain Injury", "link": "https://arxiv.org/abs/2405.00535", "description": "arXiv:2405.00535v1 Announce Type: new \nAbstract: In this paper, we develop an analytical approach for estimating brain connectivity networks that accounts for subject heterogeneity. More specifically, we consider a novel extension of a multi-subject Bayesian vector autoregressive model that estimates group-specific directed brain connectivity networks and accounts for the effects of covariates on the network edges. We adopt a flexible approach, allowing for (possibly) non-linear effects of the covariates on edge strength via a novel Bayesian nonparametric prior that employs a weighted mixture of Gaussian processes. For posterior inference, we achieve computational scalability by implementing a variational Bayes scheme. Our approach enables simultaneous estimation of group-specific networks and selection of relevant covariate effects. We show improved performance over competing two-stage approaches on simulated data. We apply our method on resting-state fMRI data from children with a history of traumatic brain injury and healthy controls to estimate the effects of age and sex on the group-level connectivities. Our results highlight differences in the distribution of parent nodes. They also suggest alteration in the relation of age, with peak edge strength in children with traumatic brain injury (TBI), and differences in effective connectivity strength between males and females."}, "https://arxiv.org/abs/2405.00581": {"title": "Conformalized Tensor Completion with Riemannian Optimization", "link": "https://arxiv.org/abs/2405.00581", "description": "arXiv:2405.00581v1 Announce Type: new \nAbstract: Tensor data, or multi-dimensional array, is a data format popular in multiple fields such as social network analysis, recommender systems, and brain imaging. It is not uncommon to observe tensor data containing missing values and tensor completion aims at estimating the missing values given the partially observed tensor. Sufficient efforts have been spared on devising scalable tensor completion algorithms but few on quantifying the uncertainty of the estimator. In this paper, we nest the uncertainty quantification (UQ) of tensor completion under a split conformal prediction framework and establish the connection of the UQ problem to a problem of estimating the missing propensity of each tensor entry. We model the data missingness of the tensor with a tensor Ising model parameterized by a low-rank tensor parameter. We propose to estimate the tensor parameter by maximum pseudo-likelihood estimation (MPLE) with a Riemannian gradient descent algorithm. Extensive simulation studies have been conducted to justify the validity of the resulting conformal interval. We apply our method to the regional total electron content (TEC) reconstruction problem."}, "https://arxiv.org/abs/2405.00619": {"title": "One-Bit Total Variation Denoising over Networks with Applications to Partially Observed Epidemics", "link": "https://arxiv.org/abs/2405.00619", "description": "arXiv:2405.00619v1 Announce Type: new \nAbstract: This paper introduces a novel approach for epidemic nowcasting and forecasting over networks using total variation (TV) denoising, a method inspired by classical signal processing techniques. Considering a network that models a population as a set of $n$ nodes characterized by their infection statuses $Y_i$ and that represents contacts as edges, we prove the consistency of graph-TV denoising for estimating the underlying infection probabilities $\\{p_i\\}_{ i \\in \\{1,\\cdots, n\\}}$ in the presence of Bernoulli noise. Our results provide an important extension of existing bounds derived in the Gaussian case to the study of binary variables -- an approach hereafter referred to as one-bit total variation denoising. The methodology is further extended to handle incomplete observations, thereby expanding its relevance to various real-world situations where observations over the full graph may not be accessible. Focusing on the context of epidemics, we establish that one-bit total variation denoising enhances both nowcasting and forecasting accuracy in networks, as further evidenced by comprehensive numerical experiments and two real-world examples. The contributions of this paper lie in its theoretical developments, particularly in addressing the incomplete data case, thereby paving the way for more precise epidemic modelling and enhanced surveillance strategies in practical settings."}, "https://arxiv.org/abs/2405.00626": {"title": "SARMA: Scalable Low-Rank High-Dimensional Autoregressive Moving Averages via Tensor Decomposition", "link": "https://arxiv.org/abs/2405.00626", "description": "arXiv:2405.00626v1 Announce Type: new \nAbstract: Existing models for high-dimensional time series are overwhelmingly developed within the finite-order vector autoregressive (VAR) framework, whereas the more flexible vector autoregressive moving averages (VARMA) have been much less considered. This paper introduces a high-dimensional model for capturing VARMA dynamics, namely the Scalable ARMA (SARMA) model, by combining novel reparameterization and tensor decomposition techniques. To ensure identifiability and computational tractability, we first consider a reparameterization of the VARMA model and discover that this interestingly amounts to a Tucker-low-rank structure for the AR coefficient tensor along the temporal dimension. Motivated by this finding, we further consider Tucker decomposition across the response and predictor dimensions of the AR coefficient tensor, enabling factor extraction across variables and time lags. Additionally, we consider sparsity assumptions on the factor loadings to accomplish automatic variable selection and greater estimation efficiency. For the proposed model, we develop both rank-constrained and sparsity-inducing estimators. Algorithms and model selection methods are also provided. Simulation studies and empirical examples confirm the validity of our theory and advantages of our approaches over existing competitors."}, "https://arxiv.org/abs/2405.00118": {"title": "Causal Inference with High-dimensional Discrete Covariates", "link": "https://arxiv.org/abs/2405.00118", "description": "arXiv:2405.00118v1 Announce Type: cross \nAbstract: When estimating causal effects from observational studies, researchers often need to adjust for many covariates to deconfound the non-causal relationship between exposure and outcome, among which many covariates are discrete. The behavior of commonly used estimators in the presence of many discrete covariates is not well understood since their properties are often analyzed under structural assumptions including sparsity and smoothness, which do not apply in discrete settings. In this work, we study the estimation of causal effects in a model where the covariates required for confounding adjustment are discrete but high-dimensional, meaning the number of categories $d$ is comparable with or even larger than sample size $n$. Specifically, we show the mean squared error of commonly used regression, weighting and doubly robust estimators is bounded by $\\frac{d^2}{n^2}+\\frac{1}{n}$. We then prove the minimax lower bound for the average treatment effect is of order $\\frac{d^2}{n^2 \\log^2 n}+\\frac{1}{n}$, which characterizes the fundamental difficulty of causal effect estimation in the high-dimensional discrete setting, and shows the estimators mentioned above are rate-optimal up to log-factors. We further consider additional structures that can be exploited, namely effect homogeneity and prior knowledge of the covariate distribution, and propose new estimators that enjoy faster convergence rates of order $\\frac{d}{n^2} + \\frac{1}{n}$, which achieve consistency in a broader regime. The results are illustrated empirically via simulation studies."}, "https://arxiv.org/abs/2405.00417": {"title": "Conformal Risk Control for Ordinal Classification", "link": "https://arxiv.org/abs/2405.00417", "description": "arXiv:2405.00417v1 Announce Type: cross \nAbstract: As a natural extension to the standard conformal prediction method, several conformal risk control methods have been recently developed and applied to various learning problems. In this work, we seek to control the conformal risk in expectation for ordinal classification tasks, which have broad applications to many real problems. For this purpose, we firstly formulated the ordinal classification task in the conformal risk control framework, and provided theoretic risk bounds of the risk control method. Then we proposed two types of loss functions specially designed for ordinal classification tasks, and developed corresponding algorithms to determine the prediction set for each case to control their risks at a desired level. We demonstrated the effectiveness of our proposed methods, and analyzed the difference between the two types of risks on three different datasets, including a simulated dataset, the UTKFace dataset and the diabetic retinopathy detection dataset."}, "https://arxiv.org/abs/2405.00576": {"title": "Calibration of the rating transition model for high and low default portfolios", "link": "https://arxiv.org/abs/2405.00576", "description": "arXiv:2405.00576v1 Announce Type: cross \nAbstract: In this paper we develop Maximum likelihood (ML) based algorithms to calibrate the model parameters in credit rating transition models. Since the credit rating transition models are not Gaussian linear models, the celebrated Kalman filter is not suitable to compute the likelihood of observed migrations. Therefore, we develop a Laplace approximation of the likelihood function and as a result the Kalman filter can be used in the end to compute the likelihood function. This approach is applied to so-called high-default portfolios, in which the number of migrations (defaults) is large enough to obtain high accuracy of the Laplace approximation. By contrast, low-default portfolios have a limited number of observed migrations (defaults). Therefore, in order to calibrate low-default portfolios, we develop a ML algorithm using a particle filter (PF) and Gaussian process regression. Experiments show that both algorithms are efficient and produce accurate approximations of the likelihood function and the ML estimates of the model parameters."}, "https://arxiv.org/abs/1902.09608": {"title": "On Binscatter", "link": "https://arxiv.org/abs/1902.09608", "description": "arXiv:1902.09608v5 Announce Type: replace \nAbstract: Binscatter is a popular method for visualizing bivariate relationships and conducting informal specification testing. We study the properties of this method formally and develop enhanced visualization and econometric binscatter tools. These include estimating conditional means with optimal binning and quantifying uncertainty. We also highlight a methodological problem related to covariate adjustment that can yield incorrect conclusions. We revisit two applications using our methodology and find substantially different results relative to those obtained using prior informal binscatter methods. General purpose software in Python, R, and Stata is provided. Our technical work is of independent interest for the nonparametric partition-based estimation literature."}, "https://arxiv.org/abs/2207.04248": {"title": "A Statistical-Modelling Approach to Feedforward Neural Network Model Selection", "link": "https://arxiv.org/abs/2207.04248", "description": "arXiv:2207.04248v5 Announce Type: replace \nAbstract: Feedforward neural networks (FNNs) can be viewed as non-linear regression models, where covariates enter the model through a combination of weighted summations and non-linear functions. Although these models have some similarities to the approaches used within statistical modelling, the majority of neural network research has been conducted outside of the field of statistics. This has resulted in a lack of statistically-based methodology, and, in particular, there has been little emphasis on model parsimony. Determining the input layer structure is analogous to variable selection, while the structure for the hidden layer relates to model complexity. In practice, neural network model selection is often carried out by comparing models using out-of-sample performance. However, in contrast, the construction of an associated likelihood function opens the door to information-criteria-based variable and architecture selection. A novel model selection method, which performs both input- and hidden-node selection, is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance as the model selection objective function leads to an increased probability of recovering the true model, while parsimoniously achieving favourable out-of-sample performance. Simulation studies are used to evaluate and justify the proposed method, and applications on real data are investigated."}, "https://arxiv.org/abs/2307.10503": {"title": "Regularizing threshold priors with sparse response patterns in Bayesian factor analysis with categorical indicators", "link": "https://arxiv.org/abs/2307.10503", "description": "arXiv:2307.10503v2 Announce Type: replace \nAbstract: Using instruments comprising ordered responses to items are ubiquitous for studying many constructs of interest. However, using such an item response format may lead to items with response categories infrequently endorsed or unendorsed completely. In maximum likelihood estimation, this results in non-existing estimates for thresholds. This work focuses on a Bayesian estimation approach to counter this issue. The issue changes from the existence of an estimate to how to effectively construct threshold priors. The proposed prior specification reconceptualizes the threshold prior as prior on the probability of each response category. A metric that is easier to manipulate while maintaining the necessary ordering constraints on the thresholds. The resulting induced-prior is more communicable, and we demonstrate comparable statistical efficiency that existing threshold priors. Evidence is provided using a simulated data set, a Monte Carlo simulation study, and an example multi-group item-factor model analysis. All analyses demonstrate how at least a relatively informative threshold prior is necessary to avoid inefficient posterior sampling and increase confidence in the coverage rates of posterior credible intervals."}, "https://arxiv.org/abs/2307.15330": {"title": "Group integrative dynamic factor models with application to multiple subject brain connectivity", "link": "https://arxiv.org/abs/2307.15330", "description": "arXiv:2307.15330v3 Announce Type: replace \nAbstract: This work introduces a novel framework for dynamic factor model-based group-level analysis of multiple subjects time series data, called GRoup Integrative DYnamic factor (GRIDY) models. The framework identifies and characterizes inter-subject similarities and differences between two pre-determined groups by considering a combination of group spatial information and individual temporal dynamics. Furthermore, it enables the identification of intra-subject similarities and differences over time by employing different model configurations for each subject. Methodologically, the framework combines a novel principal angle-based rank selection algorithm and a non-iterative integrative analysis framework. Inspired by simultaneous component analysis, this approach also reconstructs identifiable latent factor series with flexible covariance structures. The performance of the GRIDY models is evaluated through simulations conducted under various scenarios. An application is also presented to compare resting-state functional MRI data collected from multiple subjects in autism spectrum disorder and control groups."}, "https://arxiv.org/abs/2309.09299": {"title": "Bounds on Average Effects in Discrete Choice Panel Data Models", "link": "https://arxiv.org/abs/2309.09299", "description": "arXiv:2309.09299v3 Announce Type: replace \nAbstract: In discrete choice panel data, the estimation of average effects is crucial for quantifying the effect of covariates, and for policy evaluation and counterfactual analysis. This task is challenging in short panels with individual-specific effects due to partial identification and the incidental parameter problem. In particular, estimation of the sharp identified set is practically infeasible at realistic sample sizes whenever the number of support points of the observed covariates is large, such as when the covariates are continuous. In this paper, we therefore propose estimating outer bounds on the identified set of average effects. Our bounds are easy to construct, converge at the parametric rate, and are computationally simple to obtain even in moderately large samples, independent of whether the covariates are discrete or continuous. We also provide asymptotically valid confidence intervals on the identified set."}, "https://arxiv.org/abs/2309.13640": {"title": "Visualizing periodic stability in studies: the moving average meta-analysis (MA2)", "link": "https://arxiv.org/abs/2309.13640", "description": "arXiv:2309.13640v2 Announce Type: replace \nAbstract: Relative clinical benefits are often visually explored and formally analysed through a (cumulative) meta-analysis. In this manuscript, we introduce and further explore the moving average meta-analysis to aid towards the exploration and visualization of periodic stability in a meta-analysis."}, "https://arxiv.org/abs/2310.16207": {"title": "Propensity weighting plus adjustment in proportional hazards model is not doubly robust", "link": "https://arxiv.org/abs/2310.16207", "description": "arXiv:2310.16207v2 Announce Type: replace \nAbstract: Recently, it has become common for applied works to combine commonly used survival analysis modeling methods, such as the multivariable Cox model and propensity score weighting, with the intention of forming a doubly robust estimator of an exposure effect hazard ratio that is unbiased in large samples when either the Cox model or the propensity score model is correctly specified. This combination does not, in general, produce a doubly robust estimator, even after regression standardization, when there is truly a causal effect. We demonstrate via simulation this lack of double robustness for the semiparametric Cox model, the Weibull proportional hazards model, and a simple proportional hazards flexible parametric model, with both the latter models fit via maximum likelihood. We provide a novel proof that the combination of propensity score weighting and a proportional hazards survival model, fit either via full or partial likelihood, is consistent under the null of no causal effect of the exposure on the outcome under particular censoring mechanisms if either the propensity score or the outcome model is correctly specified and contains all confounders. Given our results suggesting that double robustness only exists under the null, we outline two simple alternative estimators that are doubly robust for the survival difference at a given time point (in the above sense), provided the censoring mechanism can be correctly modeled, and one doubly robust method of estimation for the full survival curve. We provide R code to use these estimators for estimation and inference in the supporting information."}, "https://arxiv.org/abs/2311.11054": {"title": "Modern extreme value statistics for Utopian extremes", "link": "https://arxiv.org/abs/2311.11054", "description": "arXiv:2311.11054v2 Announce Type: replace \nAbstract: Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the EVA (2023) Conference Data Challenge, which was organised for the 13$^\\text{th}$ International Conference on Extreme Value Analysis. This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of \"Utopia\""}, "https://arxiv.org/abs/2311.17271": {"title": "Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)", "link": "https://arxiv.org/abs/2311.17271", "description": "arXiv:2311.17271v2 Announce Type: replace \nAbstract: One measurement modality for rainfall is a fixed location rain gauge. However, extreme rainfall, flooding, and other climate extremes often occur at larger spatial scales and affect more than one location in a community. For example, in 2017 Hurricane Harvey impacted all of Houston and the surrounding region causing widespread flooding. Flood risk modeling requires understanding of rainfall for hydrologic regions, which may contain one or more rain gauges. Further, policy changes to address the risks and damages of natural hazards such as severe flooding are usually made at the community/neighborhood level or higher geo-spatial scale. Therefore, spatial-temporal methods which convert results from one spatial scale to another are especially useful in applications for evolving environmental extremes. We develop a point-to-area random effects (PARE) modeling strategy for understanding spatial-temporal extreme values at the areal level, when the core information are time series at point locations distributed over the region."}, "https://arxiv.org/abs/2401.00264": {"title": "Identification of Nonlinear Dynamic Panels under Partial Stationarity", "link": "https://arxiv.org/abs/2401.00264", "description": "arXiv:2401.00264v3 Announce Type: replace \nAbstract: This paper studies identification for a wide range of nonlinear panel data models, including binary choice, ordered response, and other types of limited dependent variable models. Our approach accommodates dynamic models with any number of lagged dependent variables as well as other types of (potentially contemporary) endogeneity. Our identification strategy relies on a partial stationarity condition, which not only allows for an unknown distribution of errors but also for temporal dependencies in errors. We derive partial identification results under flexible model specifications and provide additional support conditions for point identification. We demonstrate the robust finite-sample performance of our approach using Monte Carlo simulations, and apply the approach to analyze the empirical application of income categories using various ordered choice models."}, "https://arxiv.org/abs/2312.14191": {"title": "Noisy Measurements Are Important, the Design of Census Products Is Much More Important", "link": "https://arxiv.org/abs/2312.14191", "description": "arXiv:2312.14191v2 Announce Type: replace-cross \nAbstract: McCartan et al. (2023) call for \"making differential privacy work for census data users.\" This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic."}, "https://arxiv.org/abs/2405.00827": {"title": "Overcoming model uncertainty -- how equivalence tests can benefit from model averaging", "link": "https://arxiv.org/abs/2405.00827", "description": "arXiv:2405.00827v1 Announce Type: new \nAbstract: A common problem in numerous research areas, particularly in clinical trials, is to test whether the effect of an explanatory variable on an outcome variable is equivalent across different groups. In practice, these tests are frequently used to compare the effect between patient groups, e.g. based on gender, age or treatments. Equivalence is usually assessed by testing whether the difference between the groups does not exceed a pre-specified equivalence threshold. Classical approaches are based on testing the equivalence of single quantities, e.g. the mean, the area under the curve (AUC) or other values of interest. However, when differences depending on a particular covariate are observed, these approaches can turn out to be not very accurate. Instead, whole regression curves over the entire covariate range, describing for instance the time window or a dose range, are considered and tests are based on a suitable distance measure of two such curves, as, for example, the maximum absolute distance between them. In this regard, a key assumption is that the true underlying regression models are known, which is rarely the case in practice. However, misspecification can lead to severe problems as inflated type I errors or, on the other hand, conservative test procedures. In this paper, we propose a solution to this problem by introducing a flexible extension of such an equivalence test using model averaging in order to overcome this assumption and making the test applicable under model uncertainty. Precisely, we introduce model averaging based on smooth AIC weights and we propose a testing procedure which makes use of the duality between confidence intervals and hypothesis testing. We demonstrate the validity of our approach by means of a simulation study and demonstrate the practical relevance of the approach considering a time-response case study with toxicological gene expression data."}, "https://arxiv.org/abs/2405.00917": {"title": "Semiparametric mean and variance joint models with clipped-Laplace link functions for bounded integer-valued time series", "link": "https://arxiv.org/abs/2405.00917", "description": "arXiv:2405.00917v1 Announce Type: new \nAbstract: We present a novel approach for modeling bounded count time series data, by deriving accurate upper and lower bounds for the variance of a bounded count random variable while maintaining a fixed mean. Leveraging these bounds, we propose semiparametric mean and variance joint (MVJ) models utilizing a clipped-Laplace link function. These models offer a flexible and feasible structure for both mean and variance, accommodating various scenarios of under-dispersion, equi-dispersion, or over-dispersion in bounded time series. The proposed MVJ models feature a linear mean structure with positive regression coefficients summing to one and allow for negative regression cefficients and autocorrelations. We demonstrate that the autocorrelation structure of MVJ models mirrors that of an autoregressive moving-average (ARMA) process, provided the proposed clipped-Laplace link functions with nonnegative regression coefficients summing to one are utilized. We establish conditions ensuring the stationarity and ergodicity properties of the MVJ process, along with demonstrating the consistency and asymptotic normality of the conditional least squares estimators. To aid model selection and diagnostics, we introduce two model selection criteria and apply two model diagnostics statistics. Finally, we conduct simulations and real data analyses to investigate the finite-sample properties of the proposed MVJ models, providing insights into their efficacy and applicability in practical scenarios."}, "https://arxiv.org/abs/2405.00953": {"title": "Asymptotic Properties of the Distributional Synthetic Controls", "link": "https://arxiv.org/abs/2405.00953", "description": "arXiv:2405.00953v1 Announce Type: new \nAbstract: This paper enhances our comprehension of the Distributional Synthetic Control (DSC) proposed by Gunsilius (2023), focusing on its asymptotic properties. We first establish the DSC estimator's asymptotic optimality. The essence of this optimality lies in the treatment effect estimator given by DSC achieves the lowest possible squared prediction error among all potential treatment effect estimators that depend on an average of quantiles of control units. We also establish the convergence of the DSC weights when some requirements are met, as well as the convergence rate. A significant aspect of our research is that we find DSC synthesis forms an optimal weighted average, particularly in situations where it is impractical to perfectly fit the treated unit's quantiles through the weighted average of the control units' quantiles. To corroborate our theoretical insights, we provide empirical evidence derived from simulations."}, "https://arxiv.org/abs/2405.01072": {"title": "Statistical Inference on the Cumulative Distribution Function using Judgment Post Stratification", "link": "https://arxiv.org/abs/2405.01072", "description": "arXiv:2405.01072v1 Announce Type: new \nAbstract: In this work, we discuss a general class of the estimators for the cumulative distribution function (CDF) based on judgment post stratification (JPS) sampling scheme which includes both empirical and kernel distribution functions. Specifically, we obtain the expectation of the estimators in this class and show that they are asymptotically more efficient than their competitors in simple random sampling (SRS), as long as the rankings are better than random guessing. We find a mild condition that is necessary and sufficient for them to be asymptotically unbiased. We also prove that given the same condition, the estimators in this class are strongly uniformly consistent estimators of the true CDF, and converge in distribution to a normal distribution when the sample size goes to infinity.\n We then focus on the kernel distribution function (KDF) in the JPS design and obtain the optimal bandwidth. We next carry out a comprehensive Monte Carlo simulation to compare the performance of the KDF in the JPS design for different choices of sample size, set size, ranking quality, parent distribution, kernel function as well as both perfect and imperfect rankings set-ups with its counterpart in SRS design. It is found that the JPS estimator dramatically improves the efficiency of the KDF as compared to its SRS competitor for a wide range of the settings. Finally, we apply the described procedure on a real dataset from medical context to show their usefulness and applicability in practice."}, "https://arxiv.org/abs/2405.01110": {"title": "Investigating the causal effects of multiple treatments using longitudinal data: a simulation study", "link": "https://arxiv.org/abs/2405.01110", "description": "arXiv:2405.01110v1 Announce Type: new \nAbstract: Many clinical questions involve estimating the effects of multiple treatments using observational data. When using longitudinal data, the interest is often in the effect of treatment strategies that involve sustaining treatment over time. This requires causal inference methods appropriate for handling multiple treatments and time-dependent confounding. Robins Generalised methods (g-methods) are a family of methods which can deal with time-dependent confounding and some of these have been extended to situations with multiple treatments, although there are currently no studies comparing different methods in this setting. We show how five g-methods (inverse-probability-of-treatment weighted estimation of marginal structural models, g-formula, g-estimation, censoring and weighting, and a sequential trials approach) can be extended to situations with multiple treatments, compare their performances in a simulation study, and demonstrate their application with an example using data from the UK CF Registry."}, "https://arxiv.org/abs/2405.01182": {"title": "A Model-Based Approach to Shot Charts Estimation in Basketball", "link": "https://arxiv.org/abs/2405.01182", "description": "arXiv:2405.01182v1 Announce Type: new \nAbstract: Shot charts in basketball analytics provide an indispensable tool for evaluating players' shooting performance by visually representing the distribution of field goal attempts across different court locations. However, conventional methods often overlook the bounded nature of the basketball court, leading to inaccurate representations, particularly along the boundaries and corners. In this paper, we propose a novel model-based approach to shot chart estimation and visualization that explicitly considers the physical boundaries of the basketball court. By employing Gaussian mixtures for bounded data, our methodology allows to obtain more accurate estimation of shot density distributions for both made and missed shots. Bayes' rule is then applied to derive estimates for the probability of successful shooting from any given locations, and to identify the regions with the highest expected scores. To illustrate the efficacy of our proposal, we apply it to data from the 2022-23 NBA regular season, showing its usefulness through detailed analyses of shot patterns for two prominent players."}, "https://arxiv.org/abs/2405.01275": {"title": "Variable Selection in Ultra-high Dimensional Feature Space for the Cox Model with Interval-Censored Data", "link": "https://arxiv.org/abs/2405.01275", "description": "arXiv:2405.01275v1 Announce Type: new \nAbstract: We develop a set of variable selection methods for the Cox model under interval censoring, in the ultra-high dimensional setting where the dimensionality can grow exponentially with the sample size. The methods select covariates via a penalized nonparametric maximum likelihood estimation with some popular penalty functions, including lasso, adaptive lasso, SCAD, and MCP. We prove that our penalized variable selection methods with folded concave penalties or adaptive lasso penalty enjoy the oracle property. Extensive numerical experiments show that the proposed methods have satisfactory empirical performance under various scenarios. The utility of the methods is illustrated through an application to a genome-wide association study of age to early childhood caries."}, "https://arxiv.org/abs/2405.01281": {"title": "Demistifying Inference after Adaptive Experiments", "link": "https://arxiv.org/abs/2405.01281", "description": "arXiv:2405.01281v1 Announce Type: new \nAbstract: Adaptive experiments such as multi-arm bandits adapt the treatment-allocation policy and/or the decision to stop the experiment to the data observed so far. This has the potential to improve outcomes for study participants within the experiment, to improve the chance of identifying best treatments after the experiment, and to avoid wasting data. Seen as an experiment (rather than just a continually optimizing system) it is still desirable to draw statistical inferences with frequentist guarantees. The concentration inequalities and union bounds that generally underlie adaptive experimentation algorithms can yield overly conservative inferences, but at the same time the asymptotic normality we would usually appeal to in non-adaptive settings can be imperiled by adaptivity. In this article we aim to explain why, how, and when adaptivity is in fact an issue for inference and, when it is, understand the various ways to fix it: reweighting to stabilize variances and recover asymptotic normality, always-valid inference based on joint normality of an asymptotic limiting sequence, and characterizing and inverting the non-normal distributions induced by adaptivity."}, "https://arxiv.org/abs/2405.01336": {"title": "Quantification of vaccine waning as a challenge effect", "link": "https://arxiv.org/abs/2405.01336", "description": "arXiv:2405.01336v1 Announce Type: new \nAbstract: Knowing whether vaccine protection wanes over time is important for health policy and drug development. However, quantifying waning effects is difficult. A simple contrast of vaccine efficacy at two different times compares different populations of individuals: those who were uninfected at the first time versus those who remain uninfected until the second time. Thus, the contrast of vaccine efficacy at early and late times can not be interpreted as a causal effect. We propose to quantify vaccine waning using the challenge effect, which is a contrast of outcomes under controlled exposures to the infectious agent following vaccination. We identify sharp bounds on the challenge effect under non-parametric assumptions that are broadly applicable in vaccine trials using routinely collected data. We demonstrate that the challenge effect can differ substantially from the conventional vaccine efficacy due to depletion of susceptible individuals from the risk set over time. Finally, we apply the methods to derive bounds on the waning of the BNT162b2 COVID-19 vaccine using data from a placebo-controlled randomized trial. Our estimates of the challenge effect suggest waning protection after 2 months beyond administration of the second vaccine dose."}, "https://arxiv.org/abs/2405.01372": {"title": "Statistical algorithms for low-frequency diffusion data: A PDE approach", "link": "https://arxiv.org/abs/2405.01372", "description": "arXiv:2405.01372v1 Announce Type: new \nAbstract: We consider the problem of making nonparametric inference in multi-dimensional diffusion models from low-frequency data. Statistical analysis in this setting is notoriously challenging due to the intractability of the likelihood and its gradient, and computational methods have thus far largely resorted to expensive simulation-based techniques. In this article, we propose a new computational approach which is motivated by PDE theory and is built around the characterisation of the transition densities as solutions of the associated heat (Fokker-Planck) equation. Employing optimal regularity results from the theory of parabolic PDEs, we prove a novel characterisation for the gradient of the likelihood. Using these developments, for the nonlinear inverse problem of recovering the diffusivity (in divergence form models), we then show that the numerical evaluation of the likelihood and its gradient can be reduced to standard elliptic eigenvalue problems, solvable by powerful finite element methods. This enables the efficient implementation of a large class of statistical algorithms, including (i) preconditioned Crank-Nicolson and Langevin-type methods for posterior sampling, and (ii) gradient-based descent optimisation schemes to compute maximum likelihood and maximum-a-posteriori estimates. We showcase the effectiveness of these methods via extensive simulation studies in a nonparametric Bayesian model with Gaussian process priors. Interestingly, the optimisation schemes provided satisfactory numerical recovery while exhibiting rapid convergence towards stationary points despite the problem nonlinearity; thus our approach may lead to significant computational speed-ups. The reproducible code is available online at https://github.com/MattGiord/LF-Diffusion."}, "https://arxiv.org/abs/2405.01463": {"title": "Dynamic Local Average Treatment Effects", "link": "https://arxiv.org/abs/2405.01463", "description": "arXiv:2405.01463v1 Announce Type: new \nAbstract: We consider Dynamic Treatment Regimes (DTRs) with one sided non-compliance that arise in applications such as digital recommendations and adaptive medical trials. These are settings where decision makers encourage individuals to take treatments over time, but adapt encouragements based on previous encouragements, treatments, states, and outcomes. Importantly, individuals may choose to (not) comply with a treatment recommendation, whenever it is made available to them, based on unobserved confounding factors. We provide non-parametric identification, estimation, and inference for Dynamic Local Average Treatment Effects, which are expected values of multi-period treatment contrasts among appropriately defined complier subpopulations. Under standard assumptions in the Instrumental Variable and DTR literature, we show that one can identify local average effects of contrasts that correspond to offering treatment at any single time step. Under an additional cross-period effect-compliance independence assumption, which is satisfied in Staggered Adoption settings and a generalization of them, which we define as Staggered Compliance settings, we identify local average treatment effects of treating in multiple time periods."}, "https://arxiv.org/abs/2405.00727": {"title": "Generalised envelope spectrum-based signal-to-noise objectives: Formulation, optimisation and application for gear fault detection under time-varying speed conditions", "link": "https://arxiv.org/abs/2405.00727", "description": "arXiv:2405.00727v1 Announce Type: cross \nAbstract: In vibration-based condition monitoring, optimal filter design improves fault detection by enhancing weak fault signatures within vibration signals. This process involves optimising a derived objective function from a defined objective. The objectives are often based on proxy health indicators to determine the filter's parameters. However, these indicators can be compromised by irrelevant extraneous signal components and fluctuating operational conditions, affecting the filter's efficacy. Fault detection primarily uses the fault component's prominence in the squared envelope spectrum, quantified by a squared envelope spectrum-based signal-to-noise ratio. New optimal filter objective functions are derived from the proposed generalised envelope spectrum-based signal-to-noise objective for machines operating under variable speed conditions. Instead of optimising proxy health indicators, the optimal filter coefficients of the formulation directly maximise the squared envelope spectrum-based signal-to-noise ratio over targeted frequency bands using standard gradient-based optimisers. Four derived objective functions from the proposed objective effectively outperform five prominent methods in tests on three experimental datasets."}, "https://arxiv.org/abs/2405.00910": {"title": "De-Biasing Models of Biased Decisions: A Comparison of Methods Using Mortgage Application Data", "link": "https://arxiv.org/abs/2405.00910", "description": "arXiv:2405.00910v1 Announce Type: cross \nAbstract: Prediction models can improve efficiency by automating decisions such as the approval of loan applications. However, they may inherit bias against protected groups from the data they are trained on. This paper adds counterfactual (simulated) ethnic bias to real data on mortgage application decisions, and shows that this bias is replicated by a machine learning model (XGBoost) even when ethnicity is not used as a predictive variable. Next, several other de-biasing methods are compared: averaging over prohibited variables, taking the most favorable prediction over prohibited variables (a novel method), and jointly minimizing errors as well as the association between predictions and prohibited variables. De-biasing can recover some of the original decisions, but the results are sensitive to whether the bias is effected through a proxy."}, "https://arxiv.org/abs/2405.01404": {"title": "Random Pareto front surfaces", "link": "https://arxiv.org/abs/2405.01404", "description": "arXiv:2405.01404v1 Announce Type: cross \nAbstract: The Pareto front of a set of vectors is the subset which is comprised solely of all of the best trade-off points. By interpolating this subset, we obtain the optimal trade-off surface. In this work, we prove a very useful result which states that all Pareto front surfaces can be explicitly parametrised using polar coordinates. In particular, our polar parametrisation result tells us that we can fully characterise any Pareto front surface using the length function, which is a scalar-valued function that returns the projected length along any positive radial direction. Consequently, by exploiting this representation, we show how it is possible to generalise many useful concepts from linear algebra, probability and statistics, and decision theory to function over the space of Pareto front surfaces. Notably, we focus our attention on the stochastic setting where the Pareto front surface itself is a stochastic process. Among other things, we showcase how it is possible to define and estimate many statistical quantities of interest such as the expectation, covariance and quantile of any Pareto front surface distribution. As a motivating example, we investigate how these statistics can be used within a design of experiments setting, where the goal is to both infer and use the Pareto front surface distribution in order to make effective decisions. Besides this, we also illustrate how these Pareto front ideas can be used within the context of extreme value theory. Finally, as a numerical example, we applied some of our new methodology on a real-world air pollution data set."}, "https://arxiv.org/abs/2405.01450": {"title": "A mixed effects cosinor modelling framework for circadian gene expression", "link": "https://arxiv.org/abs/2405.01450", "description": "arXiv:2405.01450v1 Announce Type: cross \nAbstract: The cosinor model is frequently used to represent gene expression given the 24 hour day-night cycle time at which a corresponding tissue sample is collected. However, the timing of many biological processes are based on individual-specific internal timing systems that are offset relative to day-night cycle time. When these offsets are unknown, they pose a challenge in performing statistical analyses with a cosinor model. To clarify, when sample collection times are mis-recorded, cosinor regression can yield attenuated parameter estimates, which would also attenuate test statistics. This attenuation bias would inflate type II error rates in identifying genes with oscillatory behavior. This paper proposes a heuristic method to account for unknown offsets when tissue samples are collected in a longitudinal design. Specifically, this method involves first estimating individual-specific cosinor models for each gene. The times of sample collection for that individual are then translated based on the estimated phase-shifts across every gene. Simulation studies confirm that this method mitigates bias in estimation and inference. Illustrations with real data from three circadian biology studies highlight that this method produces parameter estimates and inferences akin to those obtained when each individual's offset is known."}, "https://arxiv.org/abs/2405.01484": {"title": "Designing Algorithmic Recommendations to Achieve Human-AI Complementarity", "link": "https://arxiv.org/abs/2405.01484", "description": "arXiv:2405.01484v1 Announce Type: cross \nAbstract: Algorithms frequently assist, rather than replace, human decision-makers. However, the design and analysis of algorithms often focus on predicting outcomes and do not explicitly model their effect on human decisions. This discrepancy between the design and role of algorithmic assistants becomes of particular concern in light of empirical evidence that suggests that algorithmic assistants again and again fail to improve human decisions. In this article, we formalize the design of recommendation algorithms that assist human decision-makers without making restrictive ex-ante assumptions about how recommendations affect decisions. We formulate an algorithmic-design problem that leverages the potential-outcomes framework from causal inference to model the effect of recommendations on a human decision-maker's binary treatment choice. Within this model, we introduce a monotonicity assumption that leads to an intuitive classification of human responses to the algorithm. Under this monotonicity assumption, we can express the human's response to algorithmic recommendations in terms of their compliance with the algorithm and the decision they would take if the algorithm sends no recommendation. We showcase the utility of our framework using an online experiment that simulates a hiring task. We argue that our approach explains the relative performance of different recommendation algorithms in the experiment, and can help design solutions that realize human-AI complementarity."}, "https://arxiv.org/abs/2305.05281": {"title": "Causal Discovery via Conditional Independence Testing with Proxy Variables", "link": "https://arxiv.org/abs/2305.05281", "description": "arXiv:2305.05281v3 Announce Type: replace \nAbstract: Distinguishing causal connections from correlations is important in many scenarios. However, the presence of unobserved variables, such as the latent confounder, can introduce bias in conditional independence testing commonly employed in constraint-based causal discovery for identifying causal relations. To address this issue, existing methods introduced proxy variables to adjust for the bias caused by unobserveness. However, these methods were either limited to categorical variables or relied on strong parametric assumptions for identification. In this paper, we propose a novel hypothesis-testing procedure that can effectively examine the existence of the causal relationship over continuous variables, without any parametric constraint. Our procedure is based on discretization, which under completeness conditions, is able to asymptotically establish a linear equation whose coefficient vector is identifiable under the causal null hypothesis. Based on this, we introduce our test statistic and demonstrate its asymptotic level and power. We validate the effectiveness of our procedure using both synthetic and real-world data."}, "https://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "https://arxiv.org/abs/2305.10817", "description": "arXiv:2305.10817v4 Announce Type: replace \nAbstract: We introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables. This framework makes causality detection possible even between high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled chaotic dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings. We also show that the method can be used to robustly detect causality in human electroencephalography data."}, "https://arxiv.org/abs/2401.03990": {"title": "Identification with possibly invalid IVs", "link": "https://arxiv.org/abs/2401.03990", "description": "arXiv:2401.03990v2 Announce Type: replace \nAbstract: This paper proposes a novel identification strategy relying on quasi-instrumental variables (quasi-IVs). A quasi-IV is a relevant but possibly invalid IV because it is not exogenous or not excluded. We show that a variety of models with discrete or continuous endogenous treatment which are usually identified with an IV - quantile models with rank invariance, additive models with homogenous treatment effects, and local average treatment effect models - can be identified under the joint relevance of two complementary quasi-IVs instead. To achieve identification, we complement one excluded but possibly endogenous quasi-IV (e.g., \"relevant proxies\" such as lagged treatment choice) with one exogenous (conditional on the excluded quasi-IV) but possibly included quasi-IV (e.g., random assignment or exogenous market shocks). Our approach also holds if any of the two quasi-IVs turns out to be a valid IV. In practice, being able to address endogeneity with complementary quasi-IVs instead of IVs is convenient since there are many applications where quasi-IVs are more readily available. Difference-in-differences is a notable example: time is an exogenous quasi-IV while the group assignment acts as a complementary excluded quasi-IV."}, "https://arxiv.org/abs/2303.11786": {"title": "Skeleton Regression: A Graph-Based Approach to Estimation with Manifold Structure", "link": "https://arxiv.org/abs/2303.11786", "description": "arXiv:2303.11786v2 Announce Type: replace-cross \nAbstract: We introduce a new regression framework designed to deal with large-scale, complex data that lies around a low-dimensional manifold with noises. Our approach first constructs a graph representation, referred to as the skeleton, to capture the underlying geometric structure. We then define metrics on the skeleton graph and apply nonparametric regression techniques, along with feature transformations based on the graph, to estimate the regression function. We also discuss the limitations of some nonparametric regressors with respect to the general metric space such as the skeleton graph. The proposed regression framework suggests a novel way to deal with data with underlying geometric structures and provides additional advantages in handling the union of multiple manifolds, additive noises, and noisy observations. We provide statistical guarantees for the proposed method and demonstrate its effectiveness through simulations and real data examples."}, "https://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "https://arxiv.org/abs/2310.05921", "description": "arXiv:2310.05921v3 Announce Type: replace-cross \nAbstract: We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a safe backup policy at run-time. The decisions produced by our algorithms are safe in the sense that they come with provable statistical guarantees of having low risk without any assumptions on the world model whatsoever; the observations need not be I.I.D. and can even be adversarial. The theory extends results from conformal prediction to calibrate decisions directly, without requiring the construction of prediction sets. Experiments demonstrate the utility of our approach in robot motion planning around humans, automated stock trading, and robot manufacturing."}, "https://arxiv.org/abs/2405.01645": {"title": "Synthetic Controls with spillover effects: A comparative study", "link": "https://arxiv.org/abs/2405.01645", "description": "arXiv:2405.01645v1 Announce Type: new \nAbstract: Iterative Synthetic Control Method is introduced in this study, a modification of the Synthetic Control Method (SCM) designed to improve its predictive performance by utilizing control units affected by the treatment in question. This method is then compared to other SCM modifications: SCM without any modifications, SCM after removing all spillover-affected units, Inclusive SCM, and the SP SCM model. For the comparison, Monte Carlo simulations are utilized, generating artificial datasets with known counterfactuals and comparing the predictive performance of the methods. Generally, the Inclusive SCM performed best in all settings and is relatively simple to implement. The Iterative SCM, introduced in this paper, was in close seconds, with a small difference in performance and a simpler implementation."}, "https://arxiv.org/abs/2405.01651": {"title": "Confidence regions for a persistence diagram of a single image with one or more loops", "link": "https://arxiv.org/abs/2405.01651", "description": "arXiv:2405.01651v1 Announce Type: new \nAbstract: Topological data analysis (TDA) uses persistent homology to quantify loops and higher-dimensional holes in data, making it particularly relevant for examining the characteristics of images of cells in the field of cell biology. In the context of a cell injury, as time progresses, a wound in the form of a ring emerges in the cell image and then gradually vanishes. Performing statistical inference on this ring-like pattern in a single image is challenging due to the absence of repeated samples. In this paper, we develop a novel framework leveraging TDA to estimate underlying structures within individual images and quantify associated uncertainties through confidence regions. Our proposed method partitions the image into the background and the damaged cell regions. Then pixels within the affected cell region are used to establish confidence regions in the space of persistence diagrams (topological summary statistics). The method establishes estimates on the persistence diagrams which correct the bias of traditional TDA approaches. A simulation study is conducted to evaluate the coverage probabilities of the proposed confidence regions in comparison to an alternative approach is proposed in this paper. We also illustrate our methodology by a real-world example provided by cell repair."}, "https://arxiv.org/abs/2405.01709": {"title": "Minimax Regret Learning for Data with Heterogeneous Subgroups", "link": "https://arxiv.org/abs/2405.01709", "description": "arXiv:2405.01709v1 Announce Type: new \nAbstract: Modern complex datasets often consist of various sub-populations. To develop robust and generalizable methods in the presence of sub-population heterogeneity, it is important to guarantee a uniform learning performance instead of an average one. In many applications, prior information is often available on which sub-population or group the data points belong to. Given the observed groups of data, we develop a min-max-regret (MMR) learning framework for general supervised learning, which targets to minimize the worst-group regret. Motivated from the regret-based decision theoretic framework, the proposed MMR is distinguished from the value-based or risk-based robust learning methods in the existing literature. The regret criterion features several robustness and invariance properties simultaneously. In terms of generalizability, we develop the theoretical guarantee for the worst-case regret over a super-population of the meta data, which incorporates the observed sub-populations, their mixtures, as well as other unseen sub-populations that could be approximated by the observed ones. We demonstrate the effectiveness of our method through extensive simulation studies and an application to kidney transplantation data from hundreds of transplant centers."}, "https://arxiv.org/abs/2405.01913": {"title": "Unleashing the Power of AI: Transforming Marketing Decision-Making in Heavy Machinery with Machine Learning, Radar Chart Simulation, and Markov Chain Analysis", "link": "https://arxiv.org/abs/2405.01913", "description": "arXiv:2405.01913v1 Announce Type: new \nAbstract: This pioneering research introduces a novel approach for decision-makers in the heavy machinery industry, specifically focusing on production management. The study integrates machine learning techniques like Ridge Regression, Markov chain analysis, and radar charts to optimize North American Crawler Cranes market production processes. Ridge Regression enables growth pattern identification and performance assessment, facilitating comparisons and addressing industry challenges. Markov chain analysis evaluates risk factors, aiding in informed decision-making and risk management. Radar charts simulate benchmark product designs, enabling data-driven decisions for production optimization. This interdisciplinary approach equips decision-makers with transformative insights, enhancing competitiveness in the heavy machinery industry and beyond. By leveraging these techniques, companies can revolutionize their production management strategies, driving success in diverse markets."}, "https://arxiv.org/abs/2405.02087": {"title": "Testing for an Explosive Bubble using High-Frequency Volatility", "link": "https://arxiv.org/abs/2405.02087", "description": "arXiv:2405.02087v1 Announce Type: new \nAbstract: Based on a continuous-time stochastic volatility model with a linear drift, we develop a test for explosive behavior in financial asset prices at a low frequency when prices are sampled at a higher frequency. The test exploits the volatility information in the high-frequency data. The method consists of devolatizing log-asset price increments with realized volatility measures and performing a supremum-type recursive Dickey-Fuller test on the devolatized sample. The proposed test has a nuisance-parameter-free asymptotic distribution and is easy to implement. We study the size and power properties of the test in Monte Carlo simulations. A real-time date-stamping strategy based on the devolatized sample is proposed for the origination and conclusion dates of the explosive regime. Conditions under which the real-time date-stamping strategy is consistent are established. The test and the date-stamping strategy are applied to study explosive behavior in cryptocurrency and stock markets."}, "https://arxiv.org/abs/2405.02217": {"title": "Identifying and exploiting alpha in linear asset pricing models with strong, semi-strong, and latent factors", "link": "https://arxiv.org/abs/2405.02217", "description": "arXiv:2405.02217v1 Announce Type: new \nAbstract: The risk premia of traded factors are the sum of factor means and a parameter vector we denote by {\\phi} which is identified from the cross section regression of alpha of individual securities on the vector of factor loadings. If phi is non-zero one can construct \"phi-portfolios\" which exploit the systematic components of non-zero alpha. We show that for known values of betas and when phi is non-zero there exist phi-portfolios that dominate mean-variance portfolios. The paper then proposes a two-step bias corrected estimator of phi and derives its asymptotic distribution allowing for idiosyncratic pricing errors, weak missing factors, and weak error cross-sectional dependence. Small sample results from extensive Monte Carlo experiments show that the proposed estimator has the correct size with good power properties. The paper also provides an empirical application to a large number of U.S. securities with risk factors selected from a large number of potential risk factors according to their strength and constructs phi-portfolios and compares their Sharpe ratios to mean variance and S&P 500 portfolio."}, "https://arxiv.org/abs/2405.02231": {"title": "Efficient spline orthogonal basis for representation of density functions", "link": "https://arxiv.org/abs/2405.02231", "description": "arXiv:2405.02231v1 Announce Type: new \nAbstract: Probability density functions form a specific class of functional data objects with intrinsic properties of scale invariance and relative scale characterized by the unit integral constraint. The Bayes spaces methodology respects their specific nature, and the centred log-ratio transformation enables processing such functional data in the standard Lebesgue space of square-integrable functions. As the data representing densities are frequently observed in their discrete form, the focus has been on their spline representation. Therefore, the crucial step in the approximation is to construct a proper spline basis reflecting their specific properties. Since the centred log-ratio transformation forms a subspace of functions with a zero integral constraint, the standard $B$-spline basis is no longer suitable. Recently, a new spline basis incorporating this zero integral property, called $Z\\!B$-splines, was developed. However, this basis does not possess the orthogonal property which is beneficial from computational and application point of view. As a result of this paper, we describe an efficient method for constructing an orthogonal $Z\\!B$-splines basis, called $Z\\!B$-splinets. The advantages of the $Z\\!B$-splinet approach are foremost a computational efficiency and locality of basis supports that is desirable for data interpretability, e.g. in the context of functional principal component analysis. The proposed approach is demonstrated on an empirical demographic dataset."}, "https://arxiv.org/abs/2405.01598": {"title": "Predictive Decision Synthesis for Portfolios: Betting on Better Models", "link": "https://arxiv.org/abs/2405.01598", "description": "arXiv:2405.01598v1 Announce Type: cross \nAbstract: We discuss and develop Bayesian dynamic modelling and predictive decision synthesis for portfolio analysis. The context involves model uncertainty with a set of candidate models for financial time series with main foci in sequential learning, forecasting, and recursive decisions for portfolio reinvestments. The foundational perspective of Bayesian predictive decision synthesis (BPDS) defines novel, operational analysis and resulting predictive and decision outcomes. A detailed case study of BPDS in financial forecasting of international exchange rate time series and portfolio rebalancing, with resulting BPDS-based decision outcomes compared to traditional Bayesian analysis, exemplifies and highlights the practical advances achievable under the expanded, subjective Bayesian approach that BPDS defines."}, "https://arxiv.org/abs/2405.01611": {"title": "Unifying and extending Precision Recall metrics for assessing generative models", "link": "https://arxiv.org/abs/2405.01611", "description": "arXiv:2405.01611v1 Announce Type: cross \nAbstract: With the recent success of generative models in image and text, the evaluation of generative models has gained a lot of attention. Whereas most generative models are compared in terms of scalar values such as Frechet Inception Distance (FID) or Inception Score (IS), in the last years (Sajjadi et al., 2018) proposed a definition of precision-recall curve to characterize the closeness of two distributions. Since then, various approaches to precision and recall have seen the light (Kynkaanniemi et al., 2019; Naeem et al., 2020; Park & Kim, 2023). They center their attention on the extreme values of precision and recall, but apart from this fact, their ties are elusive. In this paper, we unify most of these approaches under the same umbrella, relying on the work of (Simon et al., 2019). Doing so, we were able not only to recover entire curves, but also to expose the sources of the accounted pitfalls of the concerned metrics. We also provide consistency results that go well beyond the ones presented in the corresponding literature. Last, we study the different behaviors of the curves obtained experimentally."}, "https://arxiv.org/abs/2405.01744": {"title": "ALCM: Autonomous LLM-Augmented Causal Discovery Framework", "link": "https://arxiv.org/abs/2405.01744", "description": "arXiv:2405.01744v1 Announce Type: cross \nAbstract: To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicating their emergent capabilities and widespread applicability in facilitating causal reasoning across diverse domains, such as medicine, finance, and science. The expansive knowledge base of LLMs holds the potential to elevate the field of causal reasoning by offering interpretability, making inferences, generalizability, and uncovering novel causal structures. In this paper, we introduce a new framework, named Autonomous LLM-Augmented Causal Discovery Framework (ALCM), to synergize data-driven causal discovery algorithms and LLMs, automating the generation of a more resilient, accurate, and explicable causal graph. The ALCM consists of three integral components: causal structure learning, causal wrapper, and LLM-driven causal refiner. These components autonomously collaborate within a dynamic environment to address causal discovery questions and deliver plausible causal graphs. We evaluate the ALCM framework by implementing two demonstrations on seven well-known datasets. Experimental results demonstrate that ALCM outperforms existing LLM methods and conventional data-driven causal reasoning mechanisms. This study not only shows the effectiveness of the ALCM but also underscores new research directions in leveraging the causal reasoning capabilities of LLMs."}, "https://arxiv.org/abs/2405.02225": {"title": "Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks", "link": "https://arxiv.org/abs/2405.02225", "description": "arXiv:2405.02225v1 Announce Type: cross \nAbstract: This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\\mathbf{s},\\mathcal{G}, \\alpha)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\\mathbf{s}$, constraint set $\\mathcal{G}$, and a pre-specified threshold level $\\alpha$. We propose associated algorithms to achieve this notion in general settings. This framework is then applied to diverse scenarios encompassing different fairness concerns, including false negative rate control in image segmentation, prediction set conditional uncertainty quantification in hierarchical classification, and de-biased text generation in language models. We conduct numerical studies on several datasets and tasks."}, "https://arxiv.org/abs/2112.05274": {"title": "Handling missing data when estimating causal effects with Targeted Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2112.05274", "description": "arXiv:2112.05274v4 Announce Type: replace \nAbstract: Targeted Maximum Likelihood Estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate eight missing data methods in this context: complete-case analysis, extended TMLE incorporating outcome-missingness model, missing covariate missing indicator method, five multiple imputation (MI) approaches using parametric or machine-learning models. Six scenarios were considered, varying in exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/non-linear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a non-linear term. When choosing a method to handle missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and non-linearities is expected to perform well."}, "https://arxiv.org/abs/2212.12539": {"title": "Stable Distillation and High-Dimensional Hypothesis Testing", "link": "https://arxiv.org/abs/2212.12539", "description": "arXiv:2212.12539v2 Announce Type: replace \nAbstract: While powerful methods have been developed for high-dimensional hypothesis testing assuming orthogonal parameters, current approaches struggle to generalize to the more common non-orthogonal case. We propose Stable Distillation (SD), a simple paradigm for iteratively extracting independent pieces of information from observed data, assuming a parametric model. When applied to hypothesis testing for large regression models, SD orthogonalizes the effect estimates of non-orthogonal predictors by judiciously introducing noise into the observed outcomes vector, yielding mutually independent p-values across predictors. Simulations and a real regression example using US campaign contributions show that SD yields a scalable approach for non-orthogonal designs that exceeds or matches the power of existing methods against sparse alternatives. While we only present explicit SD algorithms for hypothesis testing in ordinary least squares and logistic regression, we provide general guidance for deriving and improving the power of SD procedures."}, "https://arxiv.org/abs/2307.14282": {"title": "Causal Effects in Matching Mechanisms with Strategically Reported Preferences", "link": "https://arxiv.org/abs/2307.14282", "description": "arXiv:2307.14282v2 Announce Type: replace \nAbstract: A growing number of central authorities use assignment mechanisms to allocate students to schools in a way that reflects student preferences and school priorities. However, most real-world mechanisms incentivize students to strategically misreport their preferences. In this paper, we provide an approach for identifying the causal effects of school assignment on future outcomes that accounts for strategic misreporting. Misreporting may invalidate existing point-identification approaches, and we derive sharp bounds for causal effects that are robust to strategic behavior. Our approach applies to any mechanism as long as there exist placement scores and cutoffs that characterize that mechanism's allocation rule. We use data from a deferred acceptance mechanism that assigns students to more than 1,000 university-major combinations in Chile. Matching theory predicts that students' behavior in Chile should be strategic because they can list only up to eight options, and we find empirical evidence consistent with such behavior. Our bounds are informative enough to reveal significant heterogeneity in graduation success with respect to preferences and school assignment."}, "https://arxiv.org/abs/2310.10271": {"title": "A geometric power analysis for general log-linear models", "link": "https://arxiv.org/abs/2310.10271", "description": "arXiv:2310.10271v2 Announce Type: replace \nAbstract: Log-linear models are widely used to express the association in multivariate frequency data on contingency tables. The paper focuses on the power analysis for testing the goodness-of-fit hypothesis for this model type. Conventionally, for the power-related sample size calculations a deviation from the null hypothesis (effect size) is specified by means of the chi-square goodness-of-fit index. It is argued that the odds ratio is a more natural measure of effect size, with the advantage of having a data-relevant interpretation. Therefore, a class of log-affine models that are specified by odds ratios whose values deviate from those of the null by a small amount can be chosen as an alternative. Being expressed as sets of constraints on odds ratios, both hypotheses are represented by smooth surfaces in the probability simplex, and thus, the power analysis can be given a geometric interpretation as well. A concept of geometric power is introduced and a Monte-Carlo algorithm for its estimation is proposed. The framework is applied to the power analysis of goodness-of-fit in the context of multinomial sampling. An iterative scaling procedure for generating distributions from a log-affine model is described and its convergence is proved. To illustrate, the geometric power analysis is carried out for data from a clinical study."}, "https://arxiv.org/abs/2305.11672": {"title": "Nonparametric classification with missing data", "link": "https://arxiv.org/abs/2305.11672", "description": "arXiv:2305.11672v2 Announce Type: replace-cross \nAbstract: We introduce a new nonparametric framework for classification problems in the presence of missing data. The key aspect of our framework is that the regression function decomposes into an anova-type sum of orthogonal functions, of which some (or even many) may be zero. Working under a general missingness setting, which allows features to be missing not at random, our main goal is to derive the minimax rate for the excess risk in this problem. In addition to the decomposition property, the rate depends on parameters that control the tail behaviour of the marginal feature distributions, the smoothness of the regression function and a margin condition. The ambient data dimension does not appear in the minimax rate, which can therefore be faster than in the classical nonparametric setting. We further propose a new method, called the Hard-thresholding Anova Missing data (HAM) classifier, based on a careful combination of a k-nearest neighbour algorithm and a thresholding step. The HAM classifier attains the minimax rate up to polylogarithmic factors and numerical experiments further illustrate its utility."}, "https://arxiv.org/abs/2307.02375": {"title": "Online Learning of Order Flow and Market Impact with Bayesian Change-Point Detection Methods", "link": "https://arxiv.org/abs/2307.02375", "description": "arXiv:2307.02375v2 Announce Type: replace-cross \nAbstract: Financial order flow exhibits a remarkable level of persistence, wherein buy (sell) trades are often followed by subsequent buy (sell) trades over extended periods. This persistence can be attributed to the division and gradual execution of large orders. Consequently, distinct order flow regimes might emerge, which can be identified through suitable time series models applied to market data. In this paper, we propose the use of Bayesian online change-point detection (BOCPD) methods to identify regime shifts in real-time and enable online predictions of order flow and market impact. To enhance the effectiveness of our approach, we have developed a novel BOCPD method using a score-driven approach. This method accommodates temporal correlations and time-varying parameters within each regime. Through empirical application to NASDAQ data, we have found that: (i) Our newly proposed model demonstrates superior out-of-sample predictive performance compared to existing models that assume i.i.d. behavior within each regime; (ii) When examining the residuals, our model demonstrates good specification in terms of both distributional assumptions and temporal correlations; (iii) Within a given regime, the price dynamics exhibit a concave relationship with respect to time and volume, mirroring the characteristics of actual large orders; (iv) By incorporating regime information, our model produces more accurate online predictions of order flow and market impact compared to models that do not consider regimes."}, "https://arxiv.org/abs/2311.04037": {"title": "Causal Discovery Under Local Privacy", "link": "https://arxiv.org/abs/2311.04037", "description": "arXiv:2311.04037v3 Announce Type: replace-cross \nAbstract: Differential privacy is a widely adopted framework designed to safeguard the sensitive information of data providers within a data set. It is based on the application of controlled noise at the interface between the server that stores and processes the data, and the data consumers. Local differential privacy is a variant that allows data providers to apply the privatization mechanism themselves on their data individually. Therefore it provides protection also in contexts in which the server, or even the data collector, cannot be trusted. The introduction of noise, however, inevitably affects the utility of the data, particularly by distorting the correlations between individual data components. This distortion can prove detrimental to tasks such as causal discovery. In this paper, we consider various well-known locally differentially private mechanisms and compare the trade-off between the privacy they provide, and the accuracy of the causal structure produced by algorithms for causal learning when applied to data obfuscated by these mechanisms. Our analysis yields valuable insights for selecting appropriate local differentially private protocols for causal discovery tasks. We foresee that our findings will aid researchers and practitioners in conducting locally private causal discovery."}, "https://arxiv.org/abs/2405.02343": {"title": "Rejoinder on \"Marked spatial point processes: current state and extensions to point processes on linear networks\"", "link": "https://arxiv.org/abs/2405.02343", "description": "arXiv:2405.02343v1 Announce Type: new \nAbstract: We are grateful to all discussants for their invaluable comments, suggestions, questions, and contributions to our article. We have attentively reviewed all discussions with keen interest. In this rejoinder, our objective is to address and engage with all points raised by the discussants in a comprehensive and considerate manner. Consistently, we identify the discussants, in alphabetical order, as follows: CJK for Cronie, Jansson, and Konstantinou, DS for Stoyan, GP for Grabarnik and Pommerening, MRS for Myllym\\\"aki, Rajala, and S\\\"arkk\\\"a, and MCvL for van Lieshout throughout this rejoinder."}, "https://arxiv.org/abs/2405.02480": {"title": "A Network Simulation of OTC Markets with Multiple Agents", "link": "https://arxiv.org/abs/2405.02480", "description": "arXiv:2405.02480v1 Announce Type: new \nAbstract: We present a novel agent-based approach to simulating an over-the-counter (OTC) financial market in which trades are intermediated solely by market makers and agent visibility is constrained to a network topology. Dynamics, such as changes in price, result from agent-level interactions that ubiquitously occur via market maker agents acting as liquidity providers. Two additional agents are considered: trend investors use a deep convolutional neural network paired with a deep Q-learning framework to inform trading decisions by analysing price history; and value investors use a static price-target to determine their trade directions and sizes. We demonstrate that our novel inclusion of a network topology with market makers facilitates explorations into various market structures. First, we present the model and an overview of its mechanics. Second, we validate our findings via comparison to the real-world: we demonstrate a fat-tailed distribution of price changes, auto-correlated volatility, a skew negatively correlated to market maker positioning, predictable price-history patterns and more. Finally, we demonstrate that our network-based model can lend insights into the effect of market-structure on price-action. For example, we show that markets with sparsely connected intermediaries can have a critical point of fragmentation, beyond which the market forms distinct clusters and arbitrage becomes rapidly possible between the prices of different market makers. A discussion is provided on future work that would be beneficial."}, "https://arxiv.org/abs/2405.02529": {"title": "Chauhan Weighted Trajectory Analysis reduces sample size requirements and expedites time-to-efficacy signals in advanced cancer clinical trials", "link": "https://arxiv.org/abs/2405.02529", "description": "arXiv:2405.02529v1 Announce Type: new \nAbstract: As Kaplan-Meier (KM) analysis is limited to single unidirectional endpoints, most advanced cancer randomized clinical trials (RCTs) are powered for either progression free survival (PFS) or overall survival (OS). This discards efficacy information carried by partial responses, complete responses, and stable disease that frequently precede progressive disease and death. Chauhan Weighted Trajectory Analysis (CWTA) is a generalization of KM that simultaneously assesses multiple rank-ordered endpoints. We hypothesized that CWTA could use this efficacy information to reduce sample size requirements and expedite efficacy signals in advanced cancer trials. We performed 100-fold and 1000-fold simulations of solid tumour systemic therapy RCTs with health statuses rank ordered from complete response (Stage 0) to death (Stage 4). At increments of sample size and hazard ratio, we compared KM PFS and OS with CWTA for (i) sample size requirements to achieve a power of 0.8 and (ii) time-to-first significant efficacy signal. CWTA consistently demonstrated greater power, and reduced sample size requirements by 18% to 35% compared to KM PFS and 14% to 20% compared to KM OS. CWTA also expedited time-to-efficacy signals 2- to 6-fold. CWTA, by incorporating all efficacy signals in the cancer treatment trajectory, provides clinically relevant reduction in required sample size and meaningfully expedites the efficacy signals of cancer treatments compared to KM PFS and KM OS. Using CWTA rather than KM as the primary trial outcome has the potential to meaningfully reduce the numbers of patients, trial duration, and costs to evaluate therapies in advanced cancer."}, "https://arxiv.org/abs/2405.02539": {"title": "Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models", "link": "https://arxiv.org/abs/2405.02539", "description": "arXiv:2405.02539v1 Announce Type: new \nAbstract: While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority."}, "https://arxiv.org/abs/2405.02551": {"title": "Power-Enhanced Two-Sample Mean Tests for High-Dimensional Compositional Data with Application to Microbiome Data Analysis", "link": "https://arxiv.org/abs/2405.02551", "description": "arXiv:2405.02551v1 Announce Type: new \nAbstract: Testing differences in mean vectors is a fundamental task in the analysis of high-dimensional compositional data. Existing methods may suffer from low power if the underlying signal pattern is in a situation that does not favor the deployed test. In this work, we develop two-sample power-enhanced mean tests for high-dimensional compositional data based on the combination of $p$-values, which integrates strengths from two popular types of tests: the maximum-type test and the quadratic-type test. We provide rigorous theoretical guarantees on the proposed tests, showing accurate Type-I error rate control and enhanced testing power. Our method boosts the testing power towards a broader alternative space, which yields robust performance across a wide range of signal pattern settings. Our theory also contributes to the literature on power enhancement and Gaussian approximation for high-dimensional hypothesis testing. We demonstrate the performance of our method on both simulated data and real-world microbiome data, showing that our proposed approach improves the testing power substantially compared to existing methods."}, "https://arxiv.org/abs/2405.02666": {"title": "The Analysis of Criminal Recidivism: A Hierarchical Model-Based Approach for the Analysis of Zero-Inflated, Spatially Correlated recurrent events Data", "link": "https://arxiv.org/abs/2405.02666", "description": "arXiv:2405.02666v1 Announce Type: new \nAbstract: The life course perspective in criminology has become prominent last years, offering valuable insights into various patterns of criminal offending and pathways. The study of criminal trajectories aims to understand the beginning, persistence and desistence in crime, providing intriguing explanations about these moments in life. Central to this analysis is the identification of patterns in the frequency of criminal victimization and recidivism, along with the factors that contribute to them. Specifically, this work introduces a new class of models that overcome limitations in traditional methods used to analyze criminal recidivism. These models are designed for recurrent events data characterized by excess of zeros and spatial correlation. They extend the Non-Homogeneous Poisson Process, incorporating spatial dependence in the model through random effects, enabling the analysis of associations among individuals within the same spatial stratum. To deal with the excess of zeros in the data, a zero-inflated Poisson mixed model was incorporated. In addition to parametric models following the Power Law process for baseline intensity functions, we propose flexible semi-parametric versions approximating the intensity function using Bernstein Polynomials. The Bayesian approach offers advantages such as incorporating external evidence and modeling specific correlations between random effects and observed data. The performance of these models was evaluated in a simulation study with various scenarios, and we applied them to analyze criminal recidivism data in the Metropolitan Region of Belo Horizonte, Brazil. The results provide a detailed analysis of high-risk areas for recurrent crimes and the behavior of recidivism rates over time. This research significantly enhances our understanding of criminal trajectories, paving the way for more effective strategies in combating criminal recidivism."}, "https://arxiv.org/abs/2405.02715": {"title": "Grouping predictors via network-wide metrics", "link": "https://arxiv.org/abs/2405.02715", "description": "arXiv:2405.02715v1 Announce Type: new \nAbstract: When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data."}, "https://arxiv.org/abs/2405.02779": {"title": "Estimating Complier Average Causal Effects with Mixtures of Experts", "link": "https://arxiv.org/abs/2405.02779", "description": "arXiv:2405.02779v1 Announce Type: new \nAbstract: Understanding the causal impact of medical interventions is essential in healthcare research, especially through randomized controlled trials (RCTs). Despite their prominence, challenges arise due to discrepancies between treatment allocation and actual intake, influenced by various factors like patient non-adherence or procedural errors. This paper focuses on the Complier Average Causal Effect (CACE), crucial for evaluating treatment efficacy among compliant patients. Existing methodologies often rely on assumptions such as exclusion restriction and monotonicity, which can be problematic in practice. We propose a novel approach, leveraging supervised learning architectures, to estimate CACE without depending on these assumptions. Our method involves a two-step process: first estimating compliance probabilities for patients, then using these probabilities to estimate two nuisance components relevant to CACE calculation. Building upon the principal ignorability assumption, we introduce four root-n consistent, asymptotically normal, CACE estimators, and prove that the underlying mixtures of experts' nuisance components are identifiable. Our causal framework allows our estimation procedures to enjoy reduced mean squared errors when exclusion restriction or monotonicity assumptions hold. Through simulations and application to a breastfeeding promotion RCT, we demonstrate the method's performance and applicability."}, "https://arxiv.org/abs/2405.02871": {"title": "Modeling frequency distribution above a priority in presence of IBNR", "link": "https://arxiv.org/abs/2405.02871", "description": "arXiv:2405.02871v1 Announce Type: new \nAbstract: In reinsurance, Poisson and Negative binomial distributions are employed for modeling frequency. However, the incomplete data regarding reported incurred claims above a priority level presents challenges in estimation. This paper focuses on frequency estimation using Schnieper's framework for claim numbering. We demonstrate that Schnieper's model is consistent with a Poisson distribution for the total number of claims above a priority at each year of development, providing a robust basis for parameter estimation. Additionally, we explain how to build an alternative assumption based on a Negative binomial distribution, which yields similar results. The study includes a bootstrap procedure to manage uncertainty in parameter estimation and a case study comparing assumptions and evaluating the impact of the bootstrap approach."}, "https://arxiv.org/abs/2405.02905": {"title": "Mixture of partially linear experts", "link": "https://arxiv.org/abs/2405.02905", "description": "arXiv:2405.02905v1 Announce Type: new \nAbstract: In the mixture of experts model, a common assumption is the linearity between a response variable and covariates. While this assumption has theoretical and computational benefits, it may lead to suboptimal estimates by overlooking potential nonlinear relationships among the variables. To address this limitation, we propose a partially linear structure that incorporates unspecified functions to capture nonlinear relationships. We establish the identifiability of the proposed model under mild conditions and introduce a practical estimation algorithm. We present the performance of our approach through numerical studies, including simulations and real data analysis."}, "https://arxiv.org/abs/2405.02983": {"title": "CVXSADes: a stochastic algorithm for constructing optimal exact regression designs with single or multiple objectives", "link": "https://arxiv.org/abs/2405.02983", "description": "arXiv:2405.02983v1 Announce Type: new \nAbstract: We propose an algorithm to construct optimal exact designs (EDs). Most of the work in the optimal regression design literature focuses on the approximate design (AD) paradigm due to its desired properties, including the optimality verification conditions derived by Kiefer (1959, 1974). ADs may have unbalanced weights, and practitioners may have difficulty implementing them with a designated run size $n$. Some EDs are constructed using rounding methods to get an integer number of runs at each support point of an AD, but this approach may not yield optimal results. To construct EDs, one may need to perform new combinatorial constructions for each $n$, and there is no unified approach to construct them. Therefore, we develop a systematic way to construct EDs for any given $n$. Our method can transform ADs into EDs while retaining high statistical efficiency in two steps. The first step involves constructing an AD by utilizing the convex nature of many design criteria. The second step employs a simulated annealing algorithm to search for the ED stochastically. Through several applications, we demonstrate the utility of our method for various design problems. Additionally, we show that the design efficiency approaches unity as the number of design points increases."}, "https://arxiv.org/abs/2405.03021": {"title": "Tuning parameter selection in econometrics", "link": "https://arxiv.org/abs/2405.03021", "description": "arXiv:2405.03021v1 Announce Type: new \nAbstract: I review some of the main methods for selecting tuning parameters in nonparametric and $\\ell_1$-penalized estimation. For the nonparametric estimation, I consider the methods of Mallows, Stein, Lepski, cross-validation, penalization, and aggregation in the context of series estimation. For the $\\ell_1$-penalized estimation, I consider the methods based on the theory of self-normalized moderate deviations, bootstrap, Stein's unbiased risk estimation, and cross-validation in the context of Lasso estimation. I explain the intuition behind each of the methods and discuss their comparative advantages. I also give some extensions."}, "https://arxiv.org/abs/2405.03041": {"title": "Bayesian Functional Graphical Models with Change-Point Detection", "link": "https://arxiv.org/abs/2405.03041", "description": "arXiv:2405.03041v1 Announce Type: new \nAbstract: Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, we propose a dynamic and Bayesian functional graphical model. Our modeling approach prioritizes the careful definition of an appropriate graph to identify both time-invariant and time-varying connectivity patterns. We introduce a novel block-structured sparsity prior paired with a finite basis expansion, which together yield effective shrinkage and graph selection with efficient computations via a Gibbs sampling algorithm. Crucially, the model includes (one or more) graph changepoints, which are learned jointly with all model parameters and incorporate graph dynamics. Simulation studies demonstrate excellent graph selection capabilities, with significant improvements over competing methods. We apply the proposed approach to study of dynamic connectivity patterns of sea surface temperatures in the Pacific Ocean and discovers meaningful edges."}, "https://arxiv.org/abs/2405.03042": {"title": "Functional Post-Clustering Selective Inference with Applications to EHR Data Analysis", "link": "https://arxiv.org/abs/2405.03042", "description": "arXiv:2405.03042v1 Announce Type: new \nAbstract: In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach."}, "https://arxiv.org/abs/2405.03083": {"title": "Causal K-Means Clustering", "link": "https://arxiv.org/abs/2405.03083", "description": "arXiv:2405.03083v1 Announce Type: new \nAbstract: Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse."}, "https://arxiv.org/abs/2405.03096": {"title": "Exact Sampling of Spanning Trees via Fast-forwarded Random Walks", "link": "https://arxiv.org/abs/2405.03096", "description": "arXiv:2405.03096v1 Announce Type: new \nAbstract: Tree graphs are routinely used in statistics. When estimating a Bayesian model with a tree component, sampling the posterior remains a core difficulty. Existing Markov chain Monte Carlo methods tend to rely on local moves, often leading to poor mixing. A promising approach is to instead directly sample spanning trees on an auxiliary graph. Current spanning tree samplers, such as the celebrated Aldous--Broder algorithm, predominantly rely on simulating random walks that are required to visit all the nodes of the graph. Such algorithms are prone to getting stuck in certain sub-graphs. We formalize this phenomenon using the bottlenecks in the random walk's transition probability matrix. We then propose a novel fast-forwarded cover algorithm that can break free from bottlenecks. The core idea is a marginalization argument that leads to a closed-form expression which allows for fast-forwarding to the event of visiting a new node. Unlike many existing approximation algorithms, our algorithm yields exact samples. We demonstrate the enhanced efficiency of the fast-forwarded cover algorithm, and illustrate its application in fitting a Bayesian dendrogram model on a Massachusetts crimes and communities dataset."}, "https://arxiv.org/abs/2405.03225": {"title": "Consistent response prediction for multilayer networks on unknown manifolds", "link": "https://arxiv.org/abs/2405.03225", "description": "arXiv:2405.03225v1 Announce Type: new \nAbstract: Our paper deals with a collection of networks on a common set of nodes, where some of the networks are associated with responses. Assuming that the networks correspond to points on a one-dimensional manifold in a higher dimensional ambient space, we propose an algorithm to consistently predict the response at an unlabeled network. Our model involves a specific multiple random network model, namely the common subspace independent edge model, where the networks share a common invariant subspace, and the heterogeneity amongst the networks is captured by a set of low dimensional matrices. Our algorithm estimates these low dimensional matrices that capture the heterogeneity of the networks, learns the underlying manifold by isomap, and consistently predicts the response at an unlabeled network. We provide theoretical justifications for the use of our algorithm, validated by numerical simulations. Finally, we demonstrate the use of our algorithm on larval Drosophila connectome data."}, "https://arxiv.org/abs/2405.03603": {"title": "Copas-Heckman-type sensitivity analysis for publication bias in rare-event meta-analysis under the framework of the generalized linear mixed model", "link": "https://arxiv.org/abs/2405.03603", "description": "arXiv:2405.03603v1 Announce Type: new \nAbstract: Publication bias (PB) is one of the serious issues in meta-analysis. Many existing methods dealing with PB are based on the normal-normal (NN) random-effects model assuming normal models in both the within-study and the between-study levels. For rare-event meta-analysis where the data contain rare occurrences of event, the standard NN random-effects model may perform poorly. Instead, the generalized linear mixed effects model (GLMM) using the exact within-study model is recommended. However, no method has been proposed for dealing with PB in rare-event meta-analysis using the GLMM. In this paper, we propose sensitivity analysis methods for evaluating the impact of PB on the GLMM based on the famous Copas-Heckman-type selection model. The proposed methods can be easily implemented with the standard software coring the nonlinear mixed-effects model. We use a real-world example to show how the usefulness of the proposed methods in evaluating the potential impact of PB in meta-analysis of the log-transformed odds ratio based on the GLMM using the non-central hypergeometric or binomial distribution as the within-study model. An extension of the proposed method is also introduced for evaluating PB in meta-analysis of proportion based on the GLMM with the binomial within-study model."}, "https://arxiv.org/abs/2405.03606": {"title": "Strang Splitting for Parametric Inference in Second-order Stochastic Differential Equations", "link": "https://arxiv.org/abs/2405.03606", "description": "arXiv:2405.03606v1 Announce Type: new \nAbstract: We address parameter estimation in second-order stochastic differential equations (SDEs), prevalent in physics, biology, and ecology. Second-order SDE is converted to a first-order system by introducing an auxiliary velocity variable raising two main challenges. First, the system is hypoelliptic since the noise affects only the velocity, making the Euler-Maruyama estimator ill-conditioned. To overcome that, we propose an estimator based on the Strang splitting scheme. Second, since the velocity is rarely observed we adjust the estimator for partial observations. We present four estimators for complete and partial observations, using full likelihood or only velocity marginal likelihood. These estimators are intuitive, easy to implement, and computationally fast, and we prove their consistency and asymptotic normality. Our analysis demonstrates that using full likelihood with complete observations reduces the asymptotic variance of the diffusion estimator. With partial observations, the asymptotic variance increases due to information loss but remains unaffected by the likelihood choice. However, a numerical study on the Kramers oscillator reveals that using marginal likelihood for partial observations yields less biased estimators. We apply our approach to paleoclimate data from the Greenland ice core and fit it to the Kramers oscillator model, capturing transitions between metastable states reflecting observed climatic conditions during glacial eras."}, "https://arxiv.org/abs/2405.02475": {"title": "Generalizing Orthogonalization for Models with Non-linearities", "link": "https://arxiv.org/abs/2405.02475", "description": "arXiv:2405.02475v1 Announce Type: cross \nAbstract: The complexity of black-box algorithms can lead to various challenges, including the introduction of biases. These biases present immediate risks in the algorithms' application. It was, for instance, shown that neural networks can deduce racial information solely from a patient's X-ray scan, a task beyond the capability of medical experts. If this fact is not known to the medical expert, automatic decision-making based on this algorithm could lead to prescribing a treatment (purely) based on racial information. While current methodologies allow for the \"orthogonalization\" or \"normalization\" of neural networks with respect to such information, existing approaches are grounded in linear models. Our paper advances the discourse by introducing corrections for non-linearities such as ReLU activations. Our approach also encompasses scalar and tensor-valued predictions, facilitating its integration into neural network architectures. Through extensive experiments, we validate our method's effectiveness in safeguarding sensitive data in generalized linear models, normalizing convolutional neural networks for metadata, and rectifying pre-existing embeddings for undesired attributes."}, "https://arxiv.org/abs/2405.03063": {"title": "Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection", "link": "https://arxiv.org/abs/2405.03063", "description": "arXiv:2405.03063v1 Announce Type: cross \nAbstract: Suppose that we first apply the Lasso to a design matrix, and then update one of its columns. In general, the signs of the Lasso coefficients may change, and there is no closed-form expression for updating the Lasso solution exactly. In this work, we propose an approximate formula for updating a debiased Lasso coefficient. We provide general nonasymptotic error bounds in terms of the norms and correlations of a given design matrix's columns, and then prove asymptotic convergence results for the case of a random design matrix with i.i.d.\\ sub-Gaussian row vectors and i.i.d.\\ Gaussian noise. Notably, the approximate formula is asymptotically correct for most coordinates in the proportional growth regime, under the mild assumption that each row of the design matrix is sub-Gaussian with a covariance matrix having a bounded condition number. Our proof only requires certain concentration and anti-concentration properties to control various error terms and the number of sign changes. In contrast, rigorously establishing distributional limit properties (e.g.\\ Gaussian limits for the debiased Lasso) under similarly general assumptions has been considered open problem in the universality theory. As applications, we show that the approximate formula allows us to reduce the computation complexity of variable selection algorithms that require solving multiple Lasso problems, such as the conditional randomization test and a variant of the knockoff filter."}, "https://arxiv.org/abs/2405.03579": {"title": "Some Statistical and Data Challenges When Building Early-Stage Digital Experimentation and Measurement Capabilities", "link": "https://arxiv.org/abs/2405.03579", "description": "arXiv:2405.03579v1 Announce Type: cross \nAbstract: Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses have been established purely to manage experiments for others. Given the growing evidence that data-driven organisations tend to outperform their non-data-driven counterparts, there has never been a greater need for organisations to build/acquire DEM capabilities to thrive in the current digital era.\n This thesis presents several novel approaches to statistical and data challenges for organisations building DEM capabilities. We focus on the fundamentals associated with building DEM capabilities, which lead to a richer understanding of the underlying assumptions and thus enable us to develop more appropriate capabilities. We address why one should engage in DEM by quantifying the benefits and risks of acquiring DEM capabilities. This is done using a ranking under lower uncertainty model, enabling one to construct a business case. We also examine what ingredients are necessary to run digital experiments. In addition to clarifying the existing literature around statistical tests, datasets, and methods in experimental design and causal inference, we construct an additional dataset and detailed case studies on applying state-of-the-art methods. Finally, we investigate when a digital experiment design would outperform another, leading to an evaluation framework that compares competing designs' data efficiency."}, "https://arxiv.org/abs/2302.12728": {"title": "Statistical Principles for Platform Trials", "link": "https://arxiv.org/abs/2302.12728", "description": "arXiv:2302.12728v2 Announce Type: replace \nAbstract: While within a clinical study there may be multiple doses and endpoints, across different studies each study will result in either an approval or a lack of approval of the drug compound studied. The False Approval Rate (FAR) is the proportion of drug compounds that lack efficacy incorrectly approved by regulators. (In the U.S., compounds that have efficacy and are approved are not involved in the FAR consideration, according to our reading of the relevant U.S. Congressional statute).\n While Tukey's (1953) Error Rate Familwise (ERFw) is meant to be applied within a clinical study, Tukey's (1953) Error Rate per Family (ERpF), defined alongside ERFw,is meant to be applied across studies. We show that controlling Error Rate Familwise (ERFw) within a clinical study at 5% in turn controls Error Rate per Family (ERpF) across studies at 5-per-100, regardless of whether the studies are correlated or not. Further, we show that ongoing regulatory practice, the additive multiplicity adjustment method of controlling ERpF, is controlling False Approval Rate FAR exactly (not conservatively) at 5-per-100 (even for Platform trials).\n In contrast, if a regulatory agency chooses to control the False Discovery Rate (FDR) across studies at 5% instead, then this change in policy from ERpF control to FDR control will result in incorrectly approving drug compounds that lack efficacy at a rate higher than 5-per-100, because in essence it gives the industry additional rewards for successfully developing compounds that have efficacy and are approved. Seems to us the discussion of such a change in policy would be at a level higher than merely statistical, needing harmonizsation/harmonization. (In the U.S., policy is set by the Congress.)"}, "https://arxiv.org/abs/2305.06262": {"title": "Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease", "link": "https://arxiv.org/abs/2305.06262", "description": "arXiv:2305.06262v3 Announce Type: replace \nAbstract: We propose a Bayesian model selection approach that allows medical practitioners to select among predictor variables while taking their respective costs into account. Medical procedures almost always incur costs in time and/or money. These costs might exceed their usefulness for modeling the outcome of interest. We develop Bayesian model selection that uses flexible model priors to penalize costly predictors a priori and select a subset of predictors useful relative to their costs. Our approach (i) gives the practitioner control over the magnitude of cost penalization, (ii) enables the prior to scale well with sample size, and (iii) enables the creation of our proposed inclusion path visualization, which can be used to make decisions about individual candidate predictors using both probabilistic and visual tools. We demonstrate the effectiveness of our inclusion path approach and the importance of being able to adjust the magnitude of the prior's cost penalization through a dataset pertaining to heart disease diagnosis in patients at the Cleveland Clinic Foundation, where several candidate predictors with various costs were recorded for patients, and through simulated data."}, "https://arxiv.org/abs/2306.00296": {"title": "Inference in Predictive Quantile Regressions", "link": "https://arxiv.org/abs/2306.00296", "description": "arXiv:2306.00296v2 Announce Type: replace \nAbstract: This paper studies inference in predictive quantile regressions when the predictive regressor has a near-unit root. We derive asymptotic distributions for the quantile regression estimator and its heteroskedasticity and autocorrelation consistent (HAC) t-statistic in terms of functionals of Ornstein-Uhlenbeck processes. We then propose a switching-fully modified (FM) predictive test for quantile predictability. The proposed test employs an FM style correction with a Bonferroni bound for the local-to-unity parameter when the predictor has a near unit root. It switches to a standard predictive quantile regression test with a slightly conservative critical value when the largest root of the predictor lies in the stationary range. Simulations indicate that the test has a reliable size in small samples and good power. We employ this new methodology to test the ability of three commonly employed, highly persistent and endogenous lagged valuation regressors - the dividend price ratio, earnings price ratio, and book-to-market ratio - to predict the median, shoulders, and tails of the stock return distribution."}, "https://arxiv.org/abs/2306.14761": {"title": "Doubly ranked tests for grouped functional data", "link": "https://arxiv.org/abs/2306.14761", "description": "arXiv:2306.14761v2 Announce Type: replace \nAbstract: Nonparametric tests for functional data are a challenging class of tests to work with because of the potentially high dimensional nature of functional data. One of the main challenges for considering rank-based tests, like the Mann-Whitney or Wilcoxon Rank Sum tests (MWW), is that the unit of observation is a curve. Thus any rank-based test must consider ways of ranking curves. While several procedures, including depth-based methods, have recently been used to create scores for rank-based tests, these scores are not constructed under the null and often introduce additional, uncontrolled for variability. We therefore reconsider the problem of rank-based tests for functional data and develop an alternative approach that incorporates the null hypothesis throughout. Our approach first ranks realizations from the curves at each time point, then summarizes the ranks for each subject using a sufficient statistic we derive, and finally re-ranks the sufficient statistics in a procedure we refer to as a doubly ranked test. As we demonstrate, doubly rank tests are more powerful while maintaining ideal type I error in the two sample, MWW setting. We also extend our framework to more than two samples, developing a Kruskal-Wallis test for functional data which exhibits good test characteristics as well. Finally, we illustrate the use of doubly ranked tests in functional data contexts from material science, climatology, and public health policy."}, "https://arxiv.org/abs/2309.10017": {"title": "A Change-Point Approach to Estimating the Proportion of False Null Hypotheses in Multiple Testing", "link": "https://arxiv.org/abs/2309.10017", "description": "arXiv:2309.10017v2 Announce Type: replace \nAbstract: For estimating the proportion of false null hypotheses in multiple testing, a family of estimators by Storey (2002) is widely used in the applied and statistical literature, with many methods suggested for selecting the parameter $\\lambda$. Inspired by change-point concepts, our new approach to the latter problem first approximates the $p$-value plot with a piecewise linear function with a single change-point and then selects the $p$-value at the change-point location as $\\lambda$. Simulations show that our method has among the smallest RMSE across various settings, and we extend it to address the estimation in cases of superuniform $p$-values. We provide asymptotic theory for our estimator, relying on the theory of quantile processes. Additionally, we propose an application in the change-point literature and illustrate it using high-dimensional CNV data."}, "https://arxiv.org/abs/2312.07520": {"title": "Estimating Counterfactual Matrix Means with Short Panel Data", "link": "https://arxiv.org/abs/2312.07520", "description": "arXiv:2312.07520v2 Announce Type: replace \nAbstract: We develop a new, spectral approach for identifying and estimating average counterfactual outcomes under a low-rank factor model with short panel data and general outcome missingness patterns. Applications include event studies and studies of outcomes of \"matches\" between agents of two types, e.g. workers and firms, typically conducted under less-flexible Two-Way-Fixed-Effects (TWFE) models of outcomes. Given an infinite population of units and a finite number of outcomes, we show our approach identifies all counterfactual outcome means, including those not estimable by existing methods, if a particular graph constructed based on overlaps in observed outcomes between subpopulations is connected. Our analogous, computationally efficient estimation procedure yields consistent, asymptotically normal estimates of counterfactual outcome means under fixed-$T$ (number of outcomes), large-$N$ (sample size) asymptotics. In a semi-synthetic simulation study based on matched employer-employee data, our estimator has lower bias and only slightly higher variance than a TWFE-model-based estimator when estimating average log-wages."}, "https://arxiv.org/abs/2312.09884": {"title": "Investigating the heterogeneity of \"study twins\"", "link": "https://arxiv.org/abs/2312.09884", "description": "arXiv:2312.09884v2 Announce Type: replace \nAbstract: Meta-analyses are commonly performed based on random-effects models, while in certain cases one might also argue in favour of a common-effect model. One such case may be given by the example of two \"study twins\" that are performed according to a common (or at least very similar) protocol. Here we investigate the particular case of meta-analysis of a pair of studies, e.g. summarizing the results of two confirmatory clinical trials in phase III of a clinical development programme. Thereby, we focus on the question of to what extent homogeneity or heterogeneity may be discernible, and include an empirical investigation of published (\"twin\") pairs of studies. A pair of estimates from two studies only provides very little evidence on homogeneity or heterogeneity of effects, and ad-hoc decision criteria may often be misleading."}, "https://arxiv.org/abs/2306.10614": {"title": "Identifiable causal inference with noisy treatment and no side information", "link": "https://arxiv.org/abs/2306.10614", "description": "arXiv:2306.10614v2 Announce Type: replace-cross \nAbstract: In some causal inference scenarios, the treatment variable is measured inaccurately, for instance in epidemiology or econometrics. Failure to correct for the effect of this measurement error can lead to biased causal effect estimates. Previous research has not studied methods that address this issue from a causal viewpoint while allowing for complex nonlinear dependencies and without assuming access to side information. For such a scenario, this study proposes a model that assumes a continuous treatment variable that is inaccurately measured. Building on existing results for measurement error models, we prove that our model's causal effect estimates are identifiable, even without knowledge of the measurement error variance or other side information. Our method relies on a deep latent variable model in which Gaussian conditionals are parameterized by neural networks, and we develop an amortized importance-weighted variational objective for training the model. Empirical results demonstrate the method's good performance with unknown measurement error. More broadly, our work extends the range of applications in which reliable causal inference can be conducted."}, "https://arxiv.org/abs/2310.00809": {"title": "Towards Causal Foundation Model: on Duality between Causal Inference and Attention", "link": "https://arxiv.org/abs/2310.00809", "description": "arXiv:2310.00809v2 Announce Type: replace-cross \nAbstract: Foundation models have brought changes to the landscape of machine learning, demonstrating sparks of human-level intelligence across a diverse array of tasks. However, a gap persists in complex tasks such as causal inference, primarily due to challenges associated with intricate reasoning steps and high numerical precision requirements. In this work, we take a first step towards building causally-aware foundation models for complex tasks. We propose a novel, theoretically sound method called Causal Inference with Attention (CInA), which utilizes multiple unlabeled datasets to perform self-supervised causal learning, and subsequently enables zero-shot causal inference on unseen tasks with new data. This is based on our theoretical results that demonstrate the primal-dual connection between optimal covariate balancing and self-attention, facilitating zero-shot causal inference through the final layer of a trained transformer-type architecture. We demonstrate empirically that our approach CInA effectively generalizes to out-of-distribution datasets and various real-world datasets, matching or even surpassing traditional per-dataset causal inference methodologies."}, "https://arxiv.org/abs/2405.03778": {"title": "An Autoregressive Model for Time Series of Random Objects", "link": "https://arxiv.org/abs/2405.03778", "description": "arXiv:2405.03778v1 Announce Type: new \nAbstract: Random variables in metric spaces indexed by time and observed at equally spaced time points are receiving increased attention due to their broad applicability. However, the absence of inherent structure in metric spaces has resulted in a literature that is predominantly non-parametric and model-free. To address this gap in models for time series of random objects, we introduce an adaptation of the classical linear autoregressive model tailored for data lying in a Hadamard space. The parameters of interest in this model are the Fr\\'echet mean and a concentration parameter, both of which we prove can be consistently estimated from data. Additionally, we propose a test statistic and establish its asymptotic normality, thereby enabling hypothesis testing for the absence of serial dependence. Finally, we introduce a bootstrap procedure to obtain critical values for the test statistic under the null hypothesis. Theoretical results of our method, including the convergence of the estimators as well as the size and power of the test, are illustrated through simulations, and the utility of the model is demonstrated by an analysis of a time series of consumer inflation expectations."}, "https://arxiv.org/abs/2405.03815": {"title": "Statistical inference for a stochastic generalized logistic differential equation", "link": "https://arxiv.org/abs/2405.03815", "description": "arXiv:2405.03815v1 Announce Type: new \nAbstract: This research aims to estimate three parameters in a stochastic generalized logistic differential equation. We assume the intrinsic growth rate and shape parameters are constant but unknown. To estimate these two parameters, we use the maximum likelihood method and establish that the estimators for these two parameters are strongly consistent. We estimate the diffusion parameter by using the quadratic variation processes. To test our results, we evaluate two data scenarios, complete and incomplete, with fixed values assigned to the three parameters. In the incomplete data scenario, we apply an Expectation Maximization algorithm."}, "https://arxiv.org/abs/2405.03826": {"title": "A quantile-based nonadditive fixed effects model", "link": "https://arxiv.org/abs/2405.03826", "description": "arXiv:2405.03826v1 Announce Type: new \nAbstract: I propose a quantile-based nonadditive fixed effects panel model to study heterogeneous causal effects. Similar to standard fixed effects (FE) model, my model allows arbitrary dependence between regressors and unobserved heterogeneity, but it generalizes the additive separability of standard FE to allow the unobserved heterogeneity to enter nonseparably. Similar to structural quantile models, my model's random coefficient vector depends on an unobserved, scalar ''rank'' variable, in which outcomes (excluding an additive noise term) are monotonic at a particular value of the regressor vector, which is much weaker than the conventional monotonicity assumption that must hold at all possible values. This rank is assumed to be stable over time, which is often more economically plausible than the panel quantile studies that assume individual rank is iid over time. It uncovers the heterogeneous causal effects as functions of the rank variable. I provide identification and estimation results, establishing uniform consistency and uniform asymptotic normality of the heterogeneous causal effect function estimator. Simulations show reasonable finite-sample performance and show my model complements fixed effects quantile regression. Finally, I illustrate the proposed methods by examining the causal effect of a country's oil wealth on its military defense spending."}, "https://arxiv.org/abs/2405.03834": {"title": "Covariance-free Multifidelity Control Variates Importance Sampling for Reliability Analysis of Rare Events", "link": "https://arxiv.org/abs/2405.03834", "description": "arXiv:2405.03834v1 Announce Type: new \nAbstract: Multifidelity modeling has been steadily gaining attention as a tool to address the problem of exorbitant model evaluation costs that makes the estimation of failure probabilities a significant computational challenge for complex real-world problems, particularly when failure is a rare event. To implement multifidelity modeling, estimators that efficiently combine information from multiple models/sources are necessary. In past works, the variance reduction techniques of Control Variates (CV) and Importance Sampling (IS) have been leveraged for this task. In this paper, we present the CVIS framework; a creative take on a coupled Control Variates and Importance Sampling estimator for bifidelity reliability analysis. The framework addresses some of the practical challenges of the CV method by using an estimator for the control variate mean and side-stepping the need to estimate the covariance between the original estimator and the control variate through a clever choice for the tuning constant. The task of selecting an efficient IS distribution is also considered, with a view towards maximally leveraging the bifidelity structure and maintaining expressivity. Additionally, a diagnostic is provided that indicates both the efficiency of the algorithm as well as the relative predictive quality of the models utilized. Finally, the behavior and performance of the framework is explored through analytical and numerical examples."}, "https://arxiv.org/abs/2405.03910": {"title": "A Primer on the Analysis of Randomized Experiments and a Survey of some Recent Advances", "link": "https://arxiv.org/abs/2405.03910", "description": "arXiv:2405.03910v1 Announce Type: new \nAbstract: The past two decades have witnessed a surge of new research in the analysis of randomized experiments. The emergence of this literature may seem surprising given the widespread use and long history of experiments as the \"gold standard\" in program evaluation, but this body of work has revealed many subtle aspects of randomized experiments that may have been previously unappreciated. This article provides an overview of some of these topics, primarily focused on stratification, regression adjustment, and cluster randomization."}, "https://arxiv.org/abs/2405.03985": {"title": "Bayesian Multilevel Compositional Data Analysis: Introduction, Evaluation, and Application", "link": "https://arxiv.org/abs/2405.03985", "description": "arXiv:2405.03985v1 Announce Type: new \nAbstract: Multilevel compositional data commonly occur in various fields, particularly in intensive, longitudinal studies using ecological momentary assessments. Examples include data repeatedly measured over time that are non-negative and sum to a constant value, such as sleep-wake movement behaviours in a 24-hour day. This article presents a novel methodology for analysing multilevel compositional data using a Bayesian inference approach. This method can be used to investigate how reallocation of time between sleep-wake movement behaviours may be associated with other phenomena (e.g., emotions, cognitions) at a daily level. We explain the theoretical details of the data and the models, and outline the steps necessary to implement this method. We introduce the R package multilevelcoda to facilitate the application of this method and illustrate using a real data example. An extensive parameter recovery simulation study verified the robust performance of the method. Across all simulation conditions investigated in the simulation study, the model had minimal convergence issues (convergence rate > 99%) and achieved excellent quality of parameter estimates and inference, with an average bias of 0.00 (range -0.09, 0.05) and coverage of 0.95 (range 0.93, 0.97). We conclude the article with recommendations on the use of the Bayesian compositional multilevel modelling approach, and hope to promote wider application of this method to answer robust questions using the increasingly available data from intensive, longitudinal studies."}, "https://arxiv.org/abs/2405.04193": {"title": "A generalized ordinal quasi-symmetry model and its separability for analyzing multi-way tables", "link": "https://arxiv.org/abs/2405.04193", "description": "arXiv:2405.04193v1 Announce Type: new \nAbstract: This paper addresses the challenge of modeling multi-way contingency tables for matched set data with ordinal categories. Although the complete symmetry and marginal homogeneity models are well established, they may not always provide a satisfactory fit to the data. To address this issue, we propose a generalized ordinal quasi-symmetry model that offers increased flexibility when the complete symmetry model fails to capture the underlying structure. We investigate the properties of this new model and provide an information-theoretic interpretation, elucidating its relationship to the ordinal quasi-symmetry model. Moreover, we revisit Agresti's findings and present a new necessary and sufficient condition for the complete symmetry model, proving that the proposed model and the marginal moment equality model are separable hypotheses. The separability of the proposed model and marginal moment equality model is a significant development in the analysis of multi-way contingency tables. It enables researchers to examine the symmetry structure in the data with greater precision, providing a more thorough understanding of the underlying patterns. This powerful framework equips researchers with the necessary tools to explore the complexities of ordinal variable relationships in matched set data, paving the way for new discoveries and insights."}, "https://arxiv.org/abs/2405.04226": {"title": "NEST: Neural Estimation by Sequential Testing", "link": "https://arxiv.org/abs/2405.04226", "description": "arXiv:2405.04226v1 Announce Type: new \nAbstract: Adaptive psychophysical procedures aim to increase the efficiency and reliability of measurements. With increasing stimulus and experiment complexity in the last decade, estimating multi-dimensional psychometric functions has become a challenging task for adaptive procedures. If the experimenter has limited information about the underlying psychometric function, it is not possible to use parametric techniques developed for the multi-dimensional stimulus space. Although there are non-parametric approaches that use Gaussian process methods and specific hand-crafted acquisition functions, their performance is sensitive to proper selection of the kernel function, which is not always straightforward. In this work, we use a neural network as the psychometric function estimator and introduce a novel acquisition function for stimulus selection. We thoroughly benchmark our technique both using simulations and by conducting psychovisual experiments under realistic conditions. We show that our method outperforms the state of the art without the need to select a kernel function and significantly reduces the experiment duration."}, "https://arxiv.org/abs/2405.04238": {"title": "Homogeneity of multinomial populations when data are classified into a large number of groups", "link": "https://arxiv.org/abs/2405.04238", "description": "arXiv:2405.04238v1 Announce Type: new \nAbstract: Suppose that we are interested in the comparison of two independent categorical variables. Suppose also that the population is divided into subpopulations or groups. Notice that the distribution of the target variable may vary across subpopulations, moreover, it may happen that the two independent variables have the same distribution in the whole population, but their distributions could differ in some groups. So, instead of testing the homogeneity of the two categorical variables, one may be interested in simultaneously testing the homogeneity in all groups. A novel procedure is proposed for carrying out such a testing problem. The test statistic is shown to be asymptotically normal, avoiding the use of complicated resampling methods to get $p$-values. Here by asymptotic we mean when the number of groups increases; the sample sizes of the data from each group can either stay bounded or grow with the number of groups. The finite sample performance of the proposal is empirically evaluated through an extensive simulation study. The usefulness of the proposal is illustrated by three data sets coming from diverse experimental fields such as education, the COVID-19 pandemic and digital elevation models."}, "https://arxiv.org/abs/2405.04254": {"title": "Distributed variable screening for generalized linear models", "link": "https://arxiv.org/abs/2405.04254", "description": "arXiv:2405.04254v1 Announce Type: new \nAbstract: In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility."}, "https://arxiv.org/abs/2405.04365": {"title": "Detailed Gender Wage Gap Decompositions: Controlling for Worker Unobserved Heterogeneity Using Network Theory", "link": "https://arxiv.org/abs/2405.04365", "description": "arXiv:2405.04365v1 Announce Type: new \nAbstract: Recent advances in the literature of decomposition methods in economics have allowed for the identification and estimation of detailed wage gap decompositions. In this context, building reliable counterfactuals requires using tighter controls to ensure that similar workers are correctly identified by making sure that important unobserved variables such as skills are controlled for, as well as comparing only workers with similar observable characteristics. This paper contributes to the wage decomposition literature in two main ways: (i) developing an economic principled network based approach to control for unobserved worker skills heterogeneity in the presence of potential discrimination; and (ii) extending existing generic decomposition tools to accommodate for potential lack of overlapping supports in covariates between groups being compared, which is likely to be the norm in more detailed decompositions. We illustrate the methodology by decomposing the gender wage gap in Brazil."}, "https://arxiv.org/abs/2405.04419": {"title": "Transportability of Principal Causal Effects", "link": "https://arxiv.org/abs/2405.04419", "description": "arXiv:2405.04419v1 Announce Type: new \nAbstract: Recent research in causal inference has made important progress in addressing challenges to the external validity of trial findings. Such methods weight trial participant data to more closely resemble the distribution of effect-modifying covariates in a well-defined target population. In the presence of participant non-adherence to study medication, these methods effectively transport an intention-to-treat effect that averages over heterogeneous compliance behaviors. In this paper, we develop a principal stratification framework to identify causal effects conditioning on both on compliance behavior and membership in the target population. We also develop non-parametric efficiency theory for and construct efficient estimators of such \"transported\" principal causal effects and characterize their finite-sample performance in simulation experiments. While this work focuses on treatment non-adherence, the framework is applicable to a broad class of estimands that target effects in clinically-relevant, possibly latent subsets of a target population."}, "https://arxiv.org/abs/2405.04446": {"title": "Causal Inference in the Multiverse of Hazard", "link": "https://arxiv.org/abs/2405.04446", "description": "arXiv:2405.04446v1 Announce Type: new \nAbstract: Hazard serves as a pivotal estimand in both practical applications and methodological frameworks. However, its causal interpretation poses notable challenges, including inherent selection biases and ill-defined populations to be compared between different treatment groups. In response, we propose a novel definition of counterfactual hazard within the framework of possible worlds. Instead of conditioning on prior survival status as a conditional probability, our new definition involves intervening in the prior status, treating it as a marginal probability. Using single-world intervention graphs, we demonstrate that the proposed counterfactual hazard is a type of controlled direct effect. Conceptually, intervening in survival status at each time point generates a new possible world, where the proposed hazards across time points represent risks in these hypothetical scenarios, forming a \"multiverse of hazard.\" The cumulative and average counterfactual hazards correspond to the sum and average of risks across this multiverse, respectively, with the actual world's risk lying between the two. This conceptual shift reframes hazards in the actual world as a collection of risks across possible worlds, marking a significant advancement in the causal interpretation of hazards."}, "https://arxiv.org/abs/2405.04465": {"title": "Two-way Fixed Effects and Differences-in-Differences Estimators in Heterogeneous Adoption Designs", "link": "https://arxiv.org/abs/2405.04465", "description": "arXiv:2405.04465v1 Announce Type: new \nAbstract: We consider treatment-effect estimation under a parallel trends assumption, in heterogeneous adoption designs where no unit is treated at period one, and units receive a weakly positive dose at period two. First, we develop a test of the assumption that the treatment effect is mean independent of the treatment, under which the commonly-used two-way-fixed-effects estimator is consistent. When this test is rejected, we propose alternative, robust estimators. If there are stayers with a period-two treatment equal to 0, the robust estimator is a difference-in-differences (DID) estimator using stayers as the control group. If there are quasi-stayers with a period-two treatment arbitrarily close to zero, the robust estimator is a DID using units with a period-two treatment below a bandwidth as controls. Finally, without stayers or quasi-stayers, we propose non-parametric bounds, and an estimator relying on a parametric specification of treatment-effect heterogeneity. We use our results to revisit Pierce and Schott (2016) and Enikolopov et al. (2011)."}, "https://arxiv.org/abs/2405.04475": {"title": "Bayesian Copula Density Estimation Using Bernstein Yett-Uniform Priors", "link": "https://arxiv.org/abs/2405.04475", "description": "arXiv:2405.04475v1 Announce Type: new \nAbstract: Probability density estimation is a central task in statistics. Copula-based models provide a great deal of flexibility in modelling multivariate distributions, allowing for the specifications of models for the marginal distributions separately from the dependence structure (copula) that links them to form a joint distribution. Choosing a class of copula models is not a trivial task and its misspecification can lead to wrong conclusions. We introduce a novel class of random Bernstein copula functions, and studied its support and the behavior of its posterior distribution. The proposal is based on a particular class of random grid-uniform copulas, referred to as yett-uniform copulas. Alternative Markov chain Monte Carlo algorithms for exploring the posterior distribution under the proposed model are also studied. The methodology is illustrated by means of simulated and real data."}, "https://arxiv.org/abs/2405.04531": {"title": "Stochastic Gradient MCMC for Massive Geostatistical Data", "link": "https://arxiv.org/abs/2405.04531", "description": "arXiv:2405.04531v1 Announce Type: new \nAbstract: Gaussian processes (GPs) are commonly used for prediction and inference for spatial data analyses. However, since estimation and prediction tasks have cubic time and quadratic memory complexity in number of locations, GPs are difficult to scale to large spatial datasets. The Vecchia approximation induces sparsity in the dependence structure and is one of several methods proposed to scale GP inference. Our work adds to the substantial research in this area by developing a stochastic gradient Markov chain Monte Carlo (SGMCMC) framework for efficient computation in GPs. At each step, the algorithm subsamples a minibatch of locations and subsequently updates process parameters through a Vecchia-approximated GP likelihood. Since the Vecchia-approximated GP has a time complexity that is linear in the number of locations, this results in scalable estimation in GPs. Through simulation studies, we demonstrate that SGMCMC is competitive with state-of-the-art scalable GP algorithms in terms of computational time and parameter estimation. An application of our method is also provided using the Argo dataset of ocean temperature measurements."}, "https://arxiv.org/abs/2405.03720": {"title": "Spatial Transfer Learning with Simple MLP", "link": "https://arxiv.org/abs/2405.03720", "description": "arXiv:2405.03720v1 Announce Type: cross \nAbstract: First step to investigate the potential of transfer learning applied to the field of spatial statistics"}, "https://arxiv.org/abs/2405.03723": {"title": "Generative adversarial learning with optimal input dimension and its adaptive generator architecture", "link": "https://arxiv.org/abs/2405.03723", "description": "arXiv:2405.03723v1 Announce Type: cross \nAbstract: We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which includes existing GANs as a special case. By incorporating the group penalty and the architecture penalty developed in the paper, G-GANs have several intriguing features. First, our framework offers adaptive dimensionality reduction from the initial dimension to a dimension necessary for generating the target distribution. Second, this reduction in dimensionality also shrinks the required size of the generator network architecture, which is automatically identified by the proposed architecture penalty. Both reductions in dimensionality and the generator network significantly improve the stability and the accuracy of the estimation and prediction. Theoretical support for the consistent selection of the input dimension and the generator network is provided. Third, the proposed algorithm involves an end-to-end training process, and the algorithm allows for dynamic adjustments between the input dimension and the generator network during training, further enhancing the overall performance of G-GANs. Extensive experiments conducted with simulated and benchmark data demonstrate the superior performance of G-GANs. In particular, compared to that of off-the-shelf methods, G-GANs achieves an average improvement of 45.68% in the CT slice dataset, 43.22% in the MNIST dataset and 46.94% in the FashionMNIST dataset in terms of the maximum mean discrepancy or Frechet inception distance. Moreover, the features generated based on the input dimensions identified by G-GANs align with visually significant features."}, "https://arxiv.org/abs/2405.04043": {"title": "Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference", "link": "https://arxiv.org/abs/2405.04043", "description": "arXiv:2405.04043v1 Announce Type: cross \nAbstract: Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains."}, "https://arxiv.org/abs/2012.00180": {"title": "Anisotropic local constant smoothing for change-point regression function estimation", "link": "https://arxiv.org/abs/2012.00180", "description": "arXiv:2012.00180v2 Announce Type: replace \nAbstract: Understanding forest fire spread in any region of Canada is critical to promoting forest health, and protecting human life and infrastructure. Quantifying fire spread from noisy images, where regions of a fire are separated by change-point boundaries, is critical to faithfully estimating fire spread rates. In this research, we develop a statistically consistent smooth estimator that allows us to denoise fire spread imagery from micro-fire experiments. We develop an anisotropic smoothing method for change-point data that uses estimates of the underlying data generating process to inform smoothing. We show that the anisotropic local constant regression estimator is consistent with convergence rate $O\\left(n^{-1/{(q+2)}}\\right)$. We demonstrate its effectiveness on simulated one- and two-dimensional change-point data and fire spread imagery from micro-fire experiments."}, "https://arxiv.org/abs/2205.00171": {"title": "A Heteroskedasticity-Robust Overidentifying Restriction Test with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2205.00171", "description": "arXiv:2205.00171v3 Announce Type: replace \nAbstract: This paper proposes an overidentifying restriction test for high-dimensional linear instrumental variable models. The novelty of the proposed test is that it allows the number of covariates and instruments to be larger than the sample size. The test is scale-invariant and is robust to heteroskedastic errors. To construct the final test statistic, we first introduce a test based on the maximum norm of multiple parameters that could be high-dimensional. The theoretical power based on the maximum norm is higher than that in the modified Cragg-Donald test (Koles\\'{a}r, 2018), the only existing test allowing for large-dimensional covariates. Second, following the principle of power enhancement (Fan et al., 2015), we introduce the power-enhanced test, with an asymptotically zero component used to enhance the power to detect some extreme alternatives with many locally invalid instruments. Finally, an empirical example of the trade and economic growth nexus demonstrates the usefulness of the proposed test."}, "https://arxiv.org/abs/2207.09098": {"title": "ReBoot: Distributed statistical learning via refitting bootstrap samples", "link": "https://arxiv.org/abs/2207.09098", "description": "arXiv:2207.09098v3 Announce Type: replace \nAbstract: In this paper, we propose a one-shot distributed learning algorithm via refitting bootstrap samples, which we refer to as ReBoot. ReBoot refits a new model to mini-batches of bootstrap samples that are continuously drawn from each of the locally fitted models. It requires only one round of communication of model parameters without much memory. Theoretically, we analyze the statistical error rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. In both cases, ReBoot provably achieves the full-sample statistical rate. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of subsamples (i.e., the number of sites), is $O(n ^ {-2})$ in GLM, where $n$ is the subsample size (the sample size of each local site). This rate is sharper than that of model parameter averaging and its variants, implying the higher tolerance of ReBoot with respect to data splits to maintain the full-sample rate. Our simulation study demonstrates the statistical advantage of ReBoot over competing methods. Finally, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification. FedReBoot exhibits substantial superiority over Federated Averaging (FedAvg) within early rounds of communication."}, "https://arxiv.org/abs/2304.02563": {"title": "The transcoding sampler for stick-breaking inferences on Dirichlet process mixtures", "link": "https://arxiv.org/abs/2304.02563", "description": "arXiv:2304.02563v2 Announce Type: replace \nAbstract: Dirichlet process mixture models suffer from slow mixing of the MCMC posterior chain produced by stick-breaking Gibbs samplers, as opposed to collapsed Gibbs samplers based on the Polya urn representation which have shorter integrated autocorrelation time (IAT).\n We study how cluster membership information is encoded under the two aforementioned samplers, and we introduce the transcoding algorithm to switch between encodings. We also develop the transcoding sampler, which consists of undertaking posterior partition inference with any high-efficiency sampler, such as collapsed Gibbs, and to subsequently transcode it to the stick-breaking representation via the transcoding algorithm, thereby allowing inference on all stick-breaking parameters of interest while retaining the shorter IAT of the high-efficiency sampler.\n The transcoding sampler is substantially simpler to implement than the slice sampler, it can inherit the shorter IAT of collapsed Gibbs samplers and it can also achieve zero IAT when paired with a posterior partition sampler that is i.i.d., such as the sequential importance sampler."}, "https://arxiv.org/abs/2311.02043": {"title": "Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective", "link": "https://arxiv.org/abs/2311.02043", "description": "arXiv:2311.02043v2 Announce Type: replace \nAbstract: Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Further, neither approach is well-suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model, we derive optimal and interpretable linear estimates and uncertainty quantification for each model-based conditional quantile. Our approach introduces a quantile-focused squared error loss, which enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, variable selection, and inference over frequentist and Bayesian competitors. We apply these tools to identify the quantile-specific impacts of social and environmental stressors on educational outcomes for a large cohort of children in North Carolina."}, "https://arxiv.org/abs/2401.00097": {"title": "Recursive identification with regularization and on-line hyperparameters estimation", "link": "https://arxiv.org/abs/2401.00097", "description": "arXiv:2401.00097v2 Announce Type: replace \nAbstract: This paper presents a regularized recursive identification algorithm with simultaneous on-line estimation of both the model parameters and the algorithms hyperparameters. A new kernel is proposed to facilitate the algorithm development. The performance of this novel scheme is compared with that of the recursive least squares algorithm in simulation."}, "https://arxiv.org/abs/2401.02048": {"title": "Random Effect Restricted Mean Survival Time Model", "link": "https://arxiv.org/abs/2401.02048", "description": "arXiv:2401.02048v2 Announce Type: replace \nAbstract: The restricted mean survival time (RMST) model has been garnering attention as a way to provide a clinically intuitive measure: the mean survival time. RMST models, which use methods based on pseudo time-to-event values and inverse probability censoring weighting, can adjust covariates. However, no approach has yet been introduced that considers random effects for clusters. In this paper, we propose a new random-effect RMST. We present two methods of analysis that consider variable effects by i) using a generalized mixed model with pseudo-values and ii) integrating the estimated results from the inverse probability censoring weighting estimating equations for each cluster. We evaluate our proposed methods through computer simulations. In addition, we analyze the effect of a mother's age at birth on under-five deaths in India using states as clusters."}, "https://arxiv.org/abs/2202.08081": {"title": "Reasoning with fuzzy and uncertain evidence using epistemic random fuzzy sets: general framework and practical models", "link": "https://arxiv.org/abs/2202.08081", "description": "arXiv:2202.08081v4 Announce Type: replace-cross \nAbstract: We introduce a general theory of epistemic random fuzzy sets for reasoning with fuzzy or crisp evidence. This framework generalizes both the Dempster-Shafer theory of belief functions, and possibility theory. Independent epistemic random fuzzy sets are combined by the generalized product-intersection rule, which extends both Dempster's rule for combining belief functions, and the product conjunctive combination of possibility distributions. We introduce Gaussian random fuzzy numbers and their multi-dimensional extensions, Gaussian random fuzzy vectors, as practical models for quantifying uncertainty about scalar or vector quantities. Closed-form expressions for the combination, projection and vacuous extension of Gaussian random fuzzy numbers and vectors are derived."}, "https://arxiv.org/abs/2209.01328": {"title": "Optimal empirical Bayes estimation for the Poisson model via minimum-distance methods", "link": "https://arxiv.org/abs/2209.01328", "description": "arXiv:2209.01328v2 Announce Type: replace-cross \nAbstract: The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins estimator lacks the desired smoothness and monotonicity of Bayes estimators and can be easily derailed by those data points that were rarely observed before. Based on the minimum-distance distance method, we propose a suite of empirical Bayes estimators, including the classical nonparametric maximum likelihood, that outperform the Robbins method in a variety of synthetic and real data sets and retain its optimality in terms of minimax regret."}, "https://arxiv.org/abs/2211.02039": {"title": "The Projected Covariance Measure for assumption-lean variable significance testing", "link": "https://arxiv.org/abs/2211.02039", "description": "arXiv:2211.02039v4 Announce Type: replace-cross \nAbstract: Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches."}, "https://arxiv.org/abs/2405.04624": {"title": "Adaptive design of experiments methodology for noise resistance with unreplicated experiments", "link": "https://arxiv.org/abs/2405.04624", "description": "arXiv:2405.04624v1 Announce Type: new \nAbstract: A new gradient-based adaptive sampling method is proposed for design of experiments applications which balances space filling, local refinement, and error minimization objectives while reducing reliance on delicate tuning parameters. High order local maximum entropy approximants are used for metamodelling, which take advantage of boundary-corrected kernel density estimation to increase accuracy and robustness on highly clumped datasets, as well as conferring the resulting metamodel with some robustness against data noise in the common case of unreplicated experiments. Two-dimensional test cases are analyzed against full factorial and latin hypercube designs and compare favourably. The proposed method is then applied in a unique manner to the problem of adaptive spatial resolution in time-varying non-linear functions, opening up the possibility to adapt the method to solve partial differential equations."}, "https://arxiv.org/abs/2405.04769": {"title": "Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets", "link": "https://arxiv.org/abs/2405.04769", "description": "arXiv:2405.04769v1 Announce Type: new \nAbstract: Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to generate such datasets are increasingly numerous, using varied tools including Bayesian models, deep neural networks and copulas. However, little is still known about how to properly perform statistical inference with these differentially private synthetic (DIPS) datasets. The challenge is for the analyses to take into account the variability from the synthetic data generation in addition to the usual sampling variability. A similar challenge also occurs when missing data is imputed before analysis, and statisticians have developed appropriate inference procedures for this case, which we tend extended to the case of synthetic datasets for privacy. In this work, we study the applicability of these procedures, based on combining rules, to the analysis of DIPS datasets. Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases."}, "https://arxiv.org/abs/2405.04816": {"title": "Testing the Fairness-Improvability of Algorithms", "link": "https://arxiv.org/abs/2405.04816", "description": "arXiv:2405.04816v1 Announce Type: new \nAbstract: Many algorithms have a disparate impact in that their benefits or harms fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, however, because it is not always clear whether there exists an alternative more-fair algorithm that does not compromise on other key objectives such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policy maker can formally substantiate, or refute, this necessity defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other pre-specified objectives. Our proposed test is simple to implement and can incorporate any exogenous constraint on the algorithm space. We establish the large-sample validity and consistency of our test, and demonstrate its use empirically by evaluating a healthcare algorithm originally considered by Obermeyer et al. (2019). In this demonstration, we find strong statistically significant evidence that it is possible to reduce the algorithm's disparate impact without compromising on the accuracy of its predictions."}, "https://arxiv.org/abs/2405.04845": {"title": "Weighted Particle-Based Optimization for Efficient Generalized Posterior Calibration", "link": "https://arxiv.org/abs/2405.04845", "description": "arXiv:2405.04845v1 Announce Type: new \nAbstract: In the realm of statistical learning, the increasing volume of accessible data and increasing model complexity necessitate robust methodologies. This paper explores two branches of robust Bayesian methods in response to this trend. The first is generalized Bayesian inference, which introduces a learning rate parameter to enhance robustness against model misspecifications. The second is Gibbs posterior inference, which formulates inferential problems using generic loss functions rather than probabilistic models. In such approaches, it is necessary to calibrate the spread of the posterior distribution by selecting a learning rate parameter. The study aims to enhance the generalized posterior calibration (GPC) algorithm proposed by Syring and Martin (2019) [Biometrika, Volume 106, Issue 2, pp. 479-486]. Their algorithm chooses the learning rate to achieve the nominal frequentist coverage probability, but it is computationally intensive because it requires repeated posterior simulations for bootstrap samples. We propose a more efficient version of the GPC inspired by sequential Monte Carlo (SMC) samplers. A target distribution with a different learning rate is evaluated without posterior simulation as in the reweighting step in SMC sampling. Thus, the proposed algorithm can reach the desired value within a few iterations. This improvement substantially reduces the computational cost of the GPC. Its efficacy is demonstrated through synthetic and real data applications."}, "https://arxiv.org/abs/2405.04895": {"title": "On Correlation and Prediction Interval Reduction", "link": "https://arxiv.org/abs/2405.04895", "description": "arXiv:2405.04895v1 Announce Type: new \nAbstract: Pearson's correlation coefficient is a popular statistical measure to summarize the strength of association between two continuous variables. It is usually interpreted via its square as percentage of variance of one variable predicted by the other in a linear regression model. It can be generalized for multiple regression via the coefficient of determination, which is not straightforward to interpret in terms of prediction accuracy. In this paper, we propose to assess the prediction accuracy of a linear model via the prediction interval reduction (PIR) by comparing the width of the prediction interval derived from this model with the width of the prediction interval obtained without this model. At the population level, PIR is one-to-one related to the correlation and the coefficient of determination. In particular, a correlation of 0.5 corresponds to a PIR of only 13%. It is also the one's complement of the coefficient of alienation introduced at the beginning of last century. We argue that PIR is easily interpretable and useful to keep in mind how difficult it is to make accurate individual predictions, an important message in the era of precision medicine and artificial intelligence. Different estimates of PIR are compared in the context of a linear model and an extension of the PIR concept to non-linear models is outlined."}, "https://arxiv.org/abs/2405.04904": {"title": "Dependence-based fuzzy clustering of functional time series", "link": "https://arxiv.org/abs/2405.04904", "description": "arXiv:2405.04904v1 Announce Type: new \nAbstract: Time series clustering is an important data mining task with a wide variety of applications. While most methods focus on time series taking values on the real line, very few works consider functional time series. However, functional objects frequently arise in many fields, such as actuarial science, demography or finance. Functional time series are indexed collections of infinite-dimensional curves viewed as random elements taking values in a Hilbert space. In this paper, the problem of clustering functional time series is addressed. To this aim, a distance between functional time series is introduced and used to construct a clustering procedure. The metric relies on a measure of serial dependence which can be seen as a natural extension of the classical quantile autocorrelation function to the functional setting. Since the dynamics of the series may vary over time, we adopt a fuzzy approach, which enables the procedure to locate each series into several clusters with different membership degrees. The resulting algorithm can group series generated from similar stochastic processes, reaching accurate results with series coming from a broad variety of functional models and requiring minimum hyperparameter tuning. Several simulation experiments show that the method exhibits a high clustering accuracy besides being computationally efficient. Two interesting applications involving high-frequency financial time series and age-specific mortality improvement rates illustrate the potential of the proposed approach."}, "https://arxiv.org/abs/2405.04917": {"title": "Guiding adaptive shrinkage by co-data to improve regression-based prediction and feature selection", "link": "https://arxiv.org/abs/2405.04917", "description": "arXiv:2405.04917v1 Announce Type: new \nAbstract: The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories. Yet, the uptake of learning methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating feature selection. Finally, we demonstrate the versatility of the guided shrinkage methodology by showing how to `do-it-yourself': we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving feature selection in genetics studies."}, "https://arxiv.org/abs/2405.04928": {"title": "A joint model for DHS and MICS surveys: Spatial modeling with anonymized locations", "link": "https://arxiv.org/abs/2405.04928", "description": "arXiv:2405.04928v1 Announce Type: new \nAbstract: Anonymizing the GPS locations of observations can bias a spatial model's parameter estimates and attenuate spatial predictions when improperly accounted for, and is relevant in applications from public health to paleoseismology. In this work, we demonstrate that a newly introduced method for geostatistical modeling in the presence of anonymized point locations can be extended to account for more general kinds of positional uncertainty due to location anonymization, including both jittering (a form of random perturbations of GPS coordinates) and geomasking (reporting only the name of the area containing the true GPS coordinates). We further provide a numerical integration scheme that flexibly accounts for the positional uncertainty as well as spatial and covariate information.\n We apply the method to women's secondary education completion data in the 2018 Nigeria demographic and health survey (NDHS) containing jittered point locations, and the 2016 Nigeria multiple indicator cluster survey (NMICS) containing geomasked locations. We show that accounting for the positional uncertainty in the surveys can improve predictions in terms of their continuous rank probability score."}, "https://arxiv.org/abs/2405.04973": {"title": "SVARs with breaks: Identification and inference", "link": "https://arxiv.org/abs/2405.04973", "description": "arXiv:2405.04973v1 Announce Type: new \nAbstract: In this paper we propose a class of structural vector autoregressions (SVARs) characterized by structural breaks (SVAR-WB). Together with standard restrictions on the parameters and on functions of them, we also consider constraints across the different regimes. Such constraints can be either (a) in the form of stability restrictions, indicating that not all the parameters or impulse responses are subject to structural changes, or (b) in terms of inequalities regarding particular characteristics of the SVAR-WB across the regimes. We show that all these kinds of restrictions provide benefits in terms of identification. We derive conditions for point and set identification of the structural parameters of the SVAR-WB, mixing equality, sign, rank and stability restrictions, as well as constraints on forecast error variances (FEVs). As point identification, when achieved, holds locally but not globally, there will be a set of isolated structural parameters that are observationally equivalent in the parametric space. In this respect, both common frequentist and Bayesian approaches produce unreliable inference as the former focuses on just one of these observationally equivalent points, while for the latter on a non-vanishing sensitivity to the prior. To overcome these issues, we propose alternative approaches for estimation and inference that account for all admissible observationally equivalent structural parameters. Moreover, we develop a pure Bayesian and a robust Bayesian approach for doing inference in set-identified SVAR-WBs. Both the theory of identification and inference are illustrated through a set of examples and an empirical application on the transmission of US monetary policy over the great inflation and great moderation regimes."}, "https://arxiv.org/abs/2405.05119": {"title": "Combining Rollout Designs and Clustering for Causal Inference under Low-order Interference", "link": "https://arxiv.org/abs/2405.05119", "description": "arXiv:2405.05119v1 Announce Type: new \nAbstract: Estimating causal effects under interference is pertinent to many real-world settings. However, the true interference network may be unknown to the practitioner, precluding many existing techniques that leverage this information. A recent line of work with low-order potential outcomes models uses staggered rollout designs to obtain unbiased estimators that require no network information. However, their use of polynomial extrapolation can lead to prohibitively high variance. To address this, we propose a two-stage experimental design that restricts treatment rollout to a sub-population. We analyze the bias and variance of an interpolation-style estimator under this experimental design. Through numerical simulations, we explore the trade-off between the error attributable to the subsampling of our experimental design and the extrapolation of the estimator. Under low-order interactions models with degree greater than 1, the proposed design greatly reduces the error of the polynomial interpolation estimator, such that it outperforms baseline estimators, especially when the treatment probability is small."}, "https://arxiv.org/abs/2405.05121": {"title": "A goodness-of-fit diagnostic for count data derived from half-normal plots with a simulated envelope", "link": "https://arxiv.org/abs/2405.05121", "description": "arXiv:2405.05121v1 Announce Type: new \nAbstract: Traditional methods of model diagnostics may include a plethora of graphical techniques based on residual analysis, as well as formal tests (e.g. Shapiro-Wilk test for normality and Bartlett test for homogeneity of variance). In this paper we derive a new distance metric based on the half-normal plot with a simulation envelope, a graphical model evaluation method, and investigate its properties through simulation studies. The proposed metric can help to assess the fit of a given model, and also act as a model selection criterion by being comparable across models, whether based or not on a true likelihood. More specifically, it quantitatively encompasses the model evaluation principles and removes the subjective bias when closely related models are involved. We validate the technique by means of an extensive simulation study carried out using count data, and illustrate with two case studies in ecology and fisheries research."}, "https://arxiv.org/abs/2405.05139": {"title": "Multivariate group sequential tests for global summary statistics", "link": "https://arxiv.org/abs/2405.05139", "description": "arXiv:2405.05139v1 Announce Type: new \nAbstract: We describe group sequential tests which efficiently incorporate information from multiple endpoints allowing for early stopping at pre-planned interim analyses. We formulate a testing procedure where several outcomes are examined, and interim decisions are based on a global summary statistic. An error spending approach to this problem is defined which allows for unpredictable group sizes and nuisance parameters such as the correlation between endpoints. We present and compare three methods for implementation of the testing procedure including numerical integration, the Delta approximation and Monte Carlo simulation. In our evaluation, numerical integration techniques performed best for implementation with error rate calculations accurate to five decimal places. Our proposed testing method is flexible and accommodates summary statistics derived from general, non-linear functions of endpoints informed by the statistical model. Type 1 error rates are controlled, and sample size calculations can easily be performed to satisfy power requirements."}, "https://arxiv.org/abs/2405.05220": {"title": "Causal Duration Analysis with Diff-in-Diff", "link": "https://arxiv.org/abs/2405.05220", "description": "arXiv:2405.05220v1 Announce Type: new \nAbstract: In economic program evaluation, it is common to obtain panel data in which outcomes are indicators that an individual has reached an absorbing state. For example, they may indicate whether an individual has exited a period of unemployment, passed an exam, left a marriage, or had their parole revoked. The parallel trends assumption that underpins difference-in-differences generally fails in such settings. We suggest identifying conditions that are analogous to those of difference-in-differences but apply to hazard rates rather than mean outcomes. These alternative assumptions motivate estimators that retain the simplicity and transparency of standard diff-in-diff, and we suggest analogous specification tests. Our approach can be adapted to general linear restrictions between the hazard rates of different groups, motivating duration analogues of the triple differences and synthetic control methods. We apply our procedures to examine the impact of a policy that increased the generosity of unemployment benefits, using a cross-cohort comparison."}, "https://arxiv.org/abs/2405.04711": {"title": "Community detection in multi-layer bipartite networks", "link": "https://arxiv.org/abs/2405.04711", "description": "arXiv:2405.04711v1 Announce Type: cross \nAbstract: The problem of community detection in multi-layer undirected networks has received considerable attention in recent years. However, practical scenarios often involve multi-layer bipartite networks, where each layer consists of two distinct types of nodes. Existing community detection algorithms tailored for multi-layer undirected networks are not directly applicable to multi-layer bipartite networks. To address this challenge, this paper introduces a novel multi-layer degree-corrected stochastic co-block model specifically designed to capture the underlying community structure within multi-layer bipartite networks. Within this framework, we propose an efficient debiased spectral co-clustering algorithm for detecting nodes' communities. We establish the consistent estimation property of our proposed algorithm and demonstrate that an increased number of layers in bipartite networks improves the accuracy of community detection. Through extensive numerical experiments, we showcase the superior performance of our algorithm compared to existing methods. Additionally, we validate our algorithm by applying it to real-world multi-layer network datasets, yielding meaningful and insightful results."}, "https://arxiv.org/abs/2405.04715": {"title": "Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance Learning", "link": "https://arxiv.org/abs/2405.04715", "description": "arXiv:2405.04715v1 Announce Type: cross \nAbstract: Statistics suffers from a fundamental problem, \"the curse of endogeneity\" -- the regression function, or more broadly the prediction risk minimizer with infinite data, may not be the target we wish to pursue. This is because when complex data are collected from multiple sources, the biases deviated from the interested (causal) association inherited in individuals or sub-populations are not expected to be canceled. Traditional remedies are of hindsight and restrictive in being tailored to prior knowledge like untestable cause-effect structures, resulting in methods that risk model misspecification and lack scalable applicability. This paper seeks to offer a purely data-driven and universally applicable method that only uses the heterogeneity of the biases in the data rather than following pre-offered commandments. Such an idea is formulated as a nonparametric invariance pursuit problem, whose goal is to unveil the invariant conditional expectation $m^\\star(x)\\equiv \\mathbb{E}[Y^{(e)}|X_{S^\\star}^{(e)}=x_{S^\\star}]$ with unknown important variable set $S^\\star$ across heterogeneous environments $e\\in \\mathcal{E}$. Under the structural causal model framework, $m^\\star$ can be interpreted as certain data-driven causality in general. The paper contributes to proposing a novel framework, called Focused Adversarial Invariance Regularization (FAIR), formulated as a single minimax optimization program that can solve the general invariance pursuit problem. As illustrated by the unified non-asymptotic analysis, our adversarial estimation framework can attain provable sample-efficient estimation akin to standard regression under a minimal identification condition for various tasks and models. As an application, the FAIR-NN estimator realized by two Neural Network classes is highlighted as the first approach to attain statistically efficient estimation in general nonparametric invariance learning."}, "https://arxiv.org/abs/2405.04919": {"title": "Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression", "link": "https://arxiv.org/abs/2405.04919", "description": "arXiv:2405.04919v1 Announce Type: cross \nAbstract: We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method."}, "https://arxiv.org/abs/1905.04028": {"title": "Demand and Welfare Analysis in Discrete Choice Models with Social Interactions", "link": "https://arxiv.org/abs/1905.04028", "description": "arXiv:1905.04028v2 Announce Type: replace \nAbstract: Many real-life settings of consumer-choice involve social interactions, causing targeted policies to have spillover-effects. This paper develops novel empirical tools for analyzing demand and welfare-effects of policy-interventions in binary choice settings with social interactions. Examples include subsidies for health-product adoption and vouchers for attending a high-achieving school. We establish the connection between econometrics of large games and Brock-Durlauf-type interaction models, under both I.I.D. and spatially correlated unobservables. We develop new convergence results for associated beliefs and estimates of preference-parameters under increasing-domain spatial asymptotics. Next, we show that even with fully parametric specifications and unique equilibrium, choice data, that are sufficient for counterfactual demand-prediction under interactions, are insufficient for welfare-calculations. This is because distinct underlying mechanisms producing the same interaction coefficient can imply different welfare-effects and deadweight-loss from a policy-intervention. Standard index-restrictions imply distribution-free bounds on welfare. We illustrate our results using experimental data on mosquito-net adoption in rural Kenya."}, "https://arxiv.org/abs/2106.10503": {"title": "Robust Bayesian Modeling of Counts with Zero inflation and Outliers: Theoretical Robustness and Efficient Computation", "link": "https://arxiv.org/abs/2106.10503", "description": "arXiv:2106.10503v2 Announce Type: replace \nAbstract: Count data with zero inflation and large outliers are ubiquitous in many scientific applications. However, posterior analysis under a standard statistical model, such as Poisson or negative binomial distribution, is sensitive to such contamination. This study introduces a novel framework for Bayesian modeling of counts that is robust to both zero inflation and large outliers. In doing so, we introduce rescaled beta distribution and adopt it to absorb undesirable effects from zero and outlying counts. The proposed approach has two appealing features: the efficiency of the posterior computation via a custom Gibbs sampling algorithm and a theoretically guaranteed posterior robustness, where extreme outliers are automatically removed from the posterior distribution. We demonstrate the usefulness of the proposed method by applying it to trend filtering and spatial modeling using predictive Gaussian processes."}, "https://arxiv.org/abs/2208.00174": {"title": "Bump hunting through density curvature features", "link": "https://arxiv.org/abs/2208.00174", "description": "arXiv:2208.00174v3 Announce Type: replace \nAbstract: Bump hunting deals with finding in sample spaces meaningful data subsets known as bumps. These have traditionally been conceived as modal or concave regions in the graph of the underlying density function. We define an abstract bump construct based on curvature functionals of the probability density. Then, we explore several alternative characterizations involving derivatives up to second order. In particular, a suitable implementation of Good and Gaskins' original concave bumps is proposed in the multivariate case. Moreover, we bring to exploratory data analysis concepts like the mean curvature and the Laplacian that have produced good results in applied domains. Our methodology addresses the approximation of the curvature functional with a plug-in kernel density estimator. We provide theoretical results that assure the asymptotic consistency of bump boundaries in the Hausdorff distance with affordable convergence rates. We also present asymptotically valid and consistent confidence regions bounding curvature bumps. The theory is illustrated through several use cases in sports analytics with datasets from the NBA, MLB and NFL. We conclude that the different curvature instances effectively combine to generate insightful visualizations."}, "https://arxiv.org/abs/2210.02599": {"title": "The Local to Unity Dynamic Tobit Model", "link": "https://arxiv.org/abs/2210.02599", "description": "arXiv:2210.02599v3 Announce Type: replace \nAbstract: This paper considers highly persistent time series that are subject to nonlinearities in the form of censoring or an occasionally binding constraint, such as are regularly encountered in macroeconomics. A tractable candidate model for such series is the dynamic Tobit with a root local to unity. We show that this model generates a process that converges weakly to a non-standard limiting process, that is constrained (regulated) to be positive. Surprisingly, despite the presence of censoring, the OLS estimators of the model parameters are consistent. We show that this allows OLS-based inferences to be drawn on the overall persistence of the process (as measured by the sum of the autoregressive coefficients), and for the null of a unit root to be tested in the presence of censoring. Our simulations illustrate that the conventional ADF test substantially over-rejects when the data is generated by a dynamic Tobit with a unit root, whereas our proposed test is correctly sized. We provide an application of our methods to testing for a unit root in the Swiss franc / euro exchange rate, during a period when this was subject to an occasionally binding lower bound."}, "https://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "https://arxiv.org/abs/2304.03853", "description": "arXiv:2304.03853v5 Announce Type: replace \nAbstract: StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes to develop more complex statistical models. These models generally divide into a measurement model that relates the latent classes to observed indicators, and a structural model that relates covariates and outcome variables to the latent classes. The measurement and structural models can be estimated jointly using the so-called one-step approach or sequentially using stepwise methods, which present significant advantages for practitioners regarding the interpretability of the estimated latent classes. In addition to the one-step approach, StepMix implements the most important stepwise estimation methods from the literature, including the bias-adjusted three-step methods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the more recent two-step approach. These pseudo-likelihood estimators are presented in this paper under a unified framework as specific expectation-maximization subroutines. To facilitate and promote their adoption among the data science community, StepMix follows the object-oriented design of the scikit-learn library and provides an additional R wrapper."}, "https://arxiv.org/abs/2304.07726": {"title": "Bayesian Causal Synthesis for Meta-Inference on Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2304.07726", "description": "arXiv:2304.07726v2 Announce Type: replace \nAbstract: The estimation of heterogeneous treatment effects in the potential outcome setting is biased when there exists model misspecification or unobserved confounding. As these biases are unobservable, what model to use when remains a critical open question. In this paper, we propose a novel Bayesian methodology to mitigate misspecification and improve estimation via a synthesis of multiple causal estimates, which we call Bayesian causal synthesis. Our development is built upon identifying a synthesis function that correctly specifies the heterogeneous treatment effect under no unobserved confounding, and achieves the irreducible bias under unobserved confounding. We show that our proposed method results in consistent estimates of the heterogeneous treatment effect; either with no bias or with irreducible bias. We provide a computational algorithm for fast posterior sampling. Several benchmark simulations and an empirical study highlight the efficacy of the proposed approach compared to existing methodologies, providing improved point and density estimation of the heterogeneous treatment effect, even under unobserved confounding."}, "https://arxiv.org/abs/2305.06465": {"title": "Occam Factor for Random Graphs: Erd\\\"{o}s-R\\'{e}nyi, Independent Edge, and Rank-1 Stochastic Blockmodel", "link": "https://arxiv.org/abs/2305.06465", "description": "arXiv:2305.06465v4 Announce Type: replace \nAbstract: We investigate the evidence/flexibility (i.e., \"Occam\") paradigm and demonstrate the theoretical and empirical consistency of Bayesian evidence for the task of determining an appropriate generative model for network data. This model selection framework involves determining a collection of candidate models, equipping each of these models' parameters with prior distributions derived via the encompassing priors method, and computing or approximating each models' evidence. We demonstrate how such a criterion may be used to select the most suitable model among the Erd\\\"{o}s-R\\'{e}nyi (ER) model, independent edge (IE) model, and rank-1 stochastic blockmodel (SBM). The Erd\\\"{o}s-R\\'{e}nyi may be considered as being linearly nested within IE, a fact which permits exponential family results. The rank-1 SBM is not so ideal, so we propose a numerical method to approximate its evidence. We apply this paradigm to brain connectome data. Future work necessitates deriving and equipping additional candidate random graph models with appropriate priors so they may be included in the paradigm."}, "https://arxiv.org/abs/2307.01284": {"title": "Does regional variation in wage levels identify the effects of a national minimum wage?", "link": "https://arxiv.org/abs/2307.01284", "description": "arXiv:2307.01284v3 Announce Type: replace \nAbstract: I evaluate the performance of estimators that exploit regional variation in wage levels to identify the employment and wage effects of national minimum wage laws. For the \"effective minimum wage\" design, I show that the identification assumptions in Lee (1999) are difficult to satisfy in settings without regional minimum wages. For the \"fraction affected\" design, I show that economic factors such as skill-biased technical change or regional convergence may cause parallel trends violations and should be investigated using pre-treatment data. I also show that this design is subject to misspecification biases that are not easily solved with changes in specification."}, "https://arxiv.org/abs/2307.03317": {"title": "Fitted value shrinkage", "link": "https://arxiv.org/abs/2307.03317", "description": "arXiv:2307.03317v5 Announce Type: replace \nAbstract: We propose a penalized least-squares method to fit the linear regression model with fitted values that are invariant to invertible linear transformations of the design matrix. This invariance is important, for example, when practitioners have categorical predictors and interactions. Our method has the same computational cost as ridge-penalized least squares, which lacks this invariance. We derive the expected squared distance between the vector of population fitted values and its shrinkage estimator as well as the tuning parameter value that minimizes this expectation. In addition to using cross validation, we construct two estimators of this optimal tuning parameter value and study their asymptotic properties. Our numerical experiments and data examples show that our method performs similarly to ridge-penalized least-squares."}, "https://arxiv.org/abs/2307.05825": {"title": "Bayesian taut splines for estimating the number of modes", "link": "https://arxiv.org/abs/2307.05825", "description": "arXiv:2307.05825v3 Announce Type: replace \nAbstract: The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts."}, "https://arxiv.org/abs/2308.00836": {"title": "Differentially Private Linear Regression with Linked Data", "link": "https://arxiv.org/abs/2308.00836", "description": "arXiv:2308.00836v2 Announce Type: replace \nAbstract: There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data."}, "https://arxiv.org/abs/2309.08707": {"title": "Fixed-b Asymptotics for Panel Models with Two-Way Clustering", "link": "https://arxiv.org/abs/2309.08707", "description": "arXiv:2309.08707v3 Announce Type: replace \nAbstract: This paper studies a cluster robust variance estimator proposed by Chiang, Hansen and Sasaki (2024) for linear panels. First, we show algebraically that this variance estimator (CHS estimator, hereafter) is a linear combination of three common variance estimators: the one-way unit cluster estimator, the \"HAC of averages\" estimator, and the \"average of HACs\" estimator. Based on this finding, we obtain a fixed-b asymptotic result for the CHS estimator and corresponding test statistics as the cross-section and time sample sizes jointly go to infinity. Furthermore, we propose two simple bias-corrected versions of the variance estimator and derive the fixed-b limits. In a simulation study, we find that the two bias-corrected variance estimators along with fixed-b critical values provide improvements in finite sample coverage probabilities. We illustrate the impact of bias-correction and use of the fixed-b critical values on inference in an empirical example from Thompson (2011) on the relationship between industry profitability and market concentration."}, "https://arxiv.org/abs/2312.07873": {"title": "Bayesian Estimation of Propensity Scores for Integrating Multiple Cohorts with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2312.07873", "description": "arXiv:2312.07873v2 Announce Type: replace \nAbstract: Comparative meta-analyses of groups of subjects by integrating multiple observational studies rely on estimated propensity scores (PSs) to mitigate covariate imbalances. However, PS estimation grapples with the theoretical and practical challenges posed by high-dimensional covariates. Motivated by an integrative analysis of breast cancer patients across seven medical centers, this paper tackles the challenges associated with integrating multiple observational datasets. The proposed inferential technique, called Bayesian Motif Submatrices for Covariates (B-MSC), addresses the curse of dimensionality by a hybrid of Bayesian and frequentist approaches. B-MSC uses nonparametric Bayesian \"Chinese restaurant\" processes to eliminate redundancy in the high-dimensional covariates and discover latent motifs or lower-dimensional structure. With these motifs as potential predictors, standard regression techniques can be utilized to accurately infer the PSs and facilitate covariate-balanced group comparisons. Simulations and meta-analysis of the motivating cancer investigation demonstrate the efficacy of the B-MSC approach to accurately estimate the propensity scores and efficiently address covariate imbalance when integrating observational health studies with high-dimensional covariates."}, "https://arxiv.org/abs/1812.07318": {"title": "Zero-Inflated Autoregressive Conditional Duration Model for Discrete Trade Durations with Excessive Zeros", "link": "https://arxiv.org/abs/1812.07318", "description": "arXiv:1812.07318v4 Announce Type: replace-cross \nAbstract: In finance, durations between successive transactions are usually modeled by the autoregressive conditional duration model based on a continuous distribution omitting zero values. Zero or close-to-zero durations can be caused by either split transactions or independent transactions. We propose a discrete model allowing for excessive zero values based on the zero-inflated negative binomial distribution with score dynamics. This model allows to distinguish between the processes generating split and standard transactions. We use the existing theory on score models to establish the invertibility of the score filter and verify that sufficient conditions hold for the consistency and asymptotic normality of the maximum likelihood of the model parameters. In an empirical study, we find that split transactions cause between 92 and 98 percent of zero and close-to-zero values. Furthermore, the loss of decimal places in the proposed approach is less severe than the incorrect treatment of zero values in continuous models."}, "https://arxiv.org/abs/2304.09010": {"title": "Causal Flow-based Variational Auto-Encoder for Disentangled Causal Representation Learning", "link": "https://arxiv.org/abs/2304.09010", "description": "arXiv:2304.09010v4 Announce Type: replace-cross \nAbstract: Disentangled representation learning aims to learn low-dimensional representations of data, where each dimension corresponds to an underlying generative factor. Currently, Variational Auto-Encoder (VAE) are widely used for disentangled representation learning, with the majority of methods assuming independence among generative factors. However, in real-world scenarios, generative factors typically exhibit complex causal relationships. We thus design a new VAE-based framework named Disentangled Causal Variational Auto-Encoder (DCVAE), which includes a variant of autoregressive flows known as causal flows, capable of learning effective causal disentangled representations. We provide a theoretical analysis of the disentanglement identifiability of DCVAE, ensuring that our model can effectively learn causal disentangled representations. The performance of DCVAE is evaluated on both synthetic and real-world datasets, demonstrating its outstanding capability in achieving causal disentanglement and performing intervention experiments. Moreover, DCVAE exhibits remarkable performance on downstream tasks and has the potential to learn the true causal structure among factors."}, "https://arxiv.org/abs/2311.05532": {"title": "Uncertainty-Aware Bayes' Rule and Its Applications", "link": "https://arxiv.org/abs/2311.05532", "description": "arXiv:2311.05532v2 Announce Type: replace-cross \nAbstract: Bayes' rule has enabled innumerable powerful algorithms of statistical signal processing and statistical machine learning. However, when there exist model misspecifications in prior distributions and/or data distributions, the direct application of Bayes' rule is questionable. Philosophically, the key is to balance the relative importance of prior and data distributions when calculating posterior distributions: if prior (resp. data) distributions are overly conservative, we should upweight the prior belief (resp. data evidence); if prior (resp. data) distributions are overly opportunistic, we should downweight the prior belief (resp. data evidence). This paper derives a generalized Bayes' rule, called uncertainty-aware Bayes' rule, to technically realize the above philosophy, i.e., to combat the model uncertainties in prior distributions and/or data distributions. Simulated and real-world experiments on classification and estimation showcase the superiority of the presented uncertainty-aware Bayes' rule over the conventional Bayes' rule: In particular, the uncertainty-aware Bayes classifier, the uncertainty-aware Kalman filter, the uncertainty-aware particle filter, and the uncertainty-aware interactive-multiple-model filter are suggested and validated."}, "https://arxiv.org/abs/2405.05389": {"title": "On foundation of generative statistics with F-entropy: a gradient-based approach", "link": "https://arxiv.org/abs/2405.05389", "description": "arXiv:2405.05389v1 Announce Type: new \nAbstract: This paper explores the interplay between statistics and generative artificial intelligence. Generative statistics, an integral part of the latter, aims to construct models that can {\\it generate} efficiently and meaningfully new data across the whole of the (usually high dimensional) sample space, e.g. a new photo. Within it, the gradient-based approach is a current favourite that exploits effectively, for the above purpose, the information contained in the observed sample, e.g. an old photo. However, often there are missing data in the observed sample, e.g. missing bits in the old photo. To handle this situation, we have proposed a gradient-based algorithm for generative modelling. More importantly, our paper underpins rigorously this powerful approach by introducing a new F-entropy that is related to Fisher's divergence. (The F-entropy is also of independent interest.) The underpinning has enabled the gradient-based approach to expand its scope. For example, it can now provide a tool for Possible future projects include discrete data and Bayesian variational inference."}, "https://arxiv.org/abs/2405.05403": {"title": "A fast and accurate inferential method for complex parametric models: the implicit bootstrap", "link": "https://arxiv.org/abs/2405.05403", "description": "arXiv:2405.05403v1 Announce Type: new \nAbstract: Performing inference such a computing confidence intervals is traditionally done, in the parametric case, by first fitting a model and then using the estimates to compute quantities derived at the asymptotic level or by means of simulations such as the ones from the family of bootstrap methods. These methods require the derivation and computation of a consistent estimator that can be very challenging to obtain when the models are complex as is the case for example when the data exhibit some types of features such as censoring, missclassification errors or contain outliers. In this paper, we propose a simulation based inferential method, the implicit bootstrap, that bypasses the need to compute a consistent estimator and can therefore be easily implemented. While being transformation respecting, we show that under similar conditions as for the studentized bootstrap, without the need of a consistent estimator, the implicit bootstrap is first and second order accurate. Using simulation studies, we also show the coverage accuracy of the method with data settings for which traditional methods are computationally very involving and also lead to poor coverage, especially when the sample size is relatively small. Based on these empirical results, we also explore theoretically the case of exact inference."}, "https://arxiv.org/abs/2405.05459": {"title": "Estimation and Inference for Change Points in Functional Regression Time Series", "link": "https://arxiv.org/abs/2405.05459", "description": "arXiv:2405.05459v1 Announce Type: new \nAbstract: In this paper, we study the estimation and inference of change points under a functional linear regression model with changes in the slope function. We present a novel Functional Regression Binary Segmentation (FRBS) algorithm which is computationally efficient as well as achieving consistency in multiple change point detection. This algorithm utilizes the predictive power of piece-wise constant functional linear regression models in the reproducing kernel Hilbert space framework. We further propose a refinement step that improves the localization rate of the initial estimator output by FRBS, and derive asymptotic distributions of the refined estimators for two different regimes determined by the magnitude of a change. To facilitate the construction of confidence intervals for underlying change points based on the limiting distribution, we propose a consistent block-type long-run variance estimator. Our theoretical justifications for the proposed approach accommodate temporal dependence and heavy-tailedness in both the functional covariates and the measurement errors. Empirical effectiveness of our methodology is demonstrated through extensive simulation studies and an application to the Standard and Poor's 500 index dataset."}, "https://arxiv.org/abs/2405.05534": {"title": "Sequential Validation of Treatment Heterogeneity", "link": "https://arxiv.org/abs/2405.05534", "description": "arXiv:2405.05534v1 Announce Type: new \nAbstract: We use the martingale construction of Luedtke and van der Laan (2016) to develop tests for the presence of treatment heterogeneity. The resulting sequential validation approach can be instantiated using various validation metrics, such as BLPs, GATES, QINI curves, etc., and provides an alternative to cross-validation-like cross-fold application of these metrics."}, "https://arxiv.org/abs/2405.05638": {"title": "An Efficient Finite Difference Approximation via a Double Sample-Recycling Approach", "link": "https://arxiv.org/abs/2405.05638", "description": "arXiv:2405.05638v1 Announce Type: new \nAbstract: Estimating stochastic gradients is pivotal in fields like service systems within operations research. The classical method for this estimation is the finite difference approximation, which entails generating samples at perturbed inputs. Nonetheless, practical challenges persist in determining the perturbation and obtaining an optimal finite difference estimator in the sense of possessing the smallest mean squared error (MSE). To tackle this problem, we propose a double sample-recycling approach in this paper. Firstly, pilot samples are recycled to estimate the optimal perturbation. Secondly, recycling these pilot samples again and generating new samples at the estimated perturbation, lead to an efficient finite difference estimator. We analyze its bias, variance and MSE. Our analyses demonstrate a reduction in asymptotic variance, and in some cases, a decrease in asymptotic bias, compared to the optimal finite difference estimator. Therefore, our proposed estimator consistently coincides with, or even outperforms the optimal finite difference estimator. In numerical experiments, we apply the estimator in several examples, and numerical results demonstrate its robustness, as well as coincidence with the theory presented, especially in the case of small sample sizes."}, "https://arxiv.org/abs/2405.05730": {"title": "Change point localisation and inference in fragmented functional data", "link": "https://arxiv.org/abs/2405.05730", "description": "arXiv:2405.05730v1 Announce Type: new \nAbstract: We study the problem of change point localisation and inference for sequentially collected fragmented functional data, where each curve is observed only over discrete grids randomly sampled over a short fragment. The sequence of underlying covariance functions is assumed to be piecewise constant, with changes happening at unknown time points. To localise the change points, we propose a computationally efficient fragmented functional dynamic programming (FFDP) algorithm with consistent change point localisation rates. With an extra step of local refinement, we derive the limiting distributions for the refined change point estimators in two different regimes where the minimal jump size vanishes and where it remains constant as the sample size diverges. Such results are the first time seen in the fragmented functional data literature. As a byproduct of independent interest, we also present a non-asymptotic result on the estimation error of the covariance function estimators over intervals with change points inspired by Lin et al. (2021). Our result accounts for the effects of the sampling grid size within each fragment under novel identifiability conditions. Extensive numerical studies are also provided to support our theoretical results."}, "https://arxiv.org/abs/2405.05759": {"title": "Advancing Distribution Decomposition Methods Beyond Common Supports: Applications to Racial Wealth Disparities", "link": "https://arxiv.org/abs/2405.05759", "description": "arXiv:2405.05759v1 Announce Type: new \nAbstract: I generalize state-of-the-art approaches that decompose differences in the distribution of a variable of interest between two groups into a portion explained by covariates and a residual portion. The method that I propose relaxes the overlapping supports assumption, allowing the groups being compared to not necessarily share exactly the same covariate support. I illustrate my method revisiting the black-white wealth gap in the U.S. as a function of labor income and other variables. Traditionally used decomposition methods would trim (or assign zero weight to) observations that lie outside the common covariate support region. On the other hand, by allowing all observations to contribute to the existing wealth gap, I find that otherwise trimmed observations contribute from 3% to 19% to the overall wealth gap, at different portions of the wealth distribution."}, "https://arxiv.org/abs/2405.05773": {"title": "Parametric Analysis of Bivariate Current Status data with Competing risks using Frailty model", "link": "https://arxiv.org/abs/2405.05773", "description": "arXiv:2405.05773v1 Announce Type: new \nAbstract: Shared and correlated Gamma frailty models are widely used in the literature to model the association in multivariate current status data. In this paper, we have proposed two other new Gamma frailty models, namely shared cause-specific and correlated cause-specific Gamma frailty to capture association in bivariate current status data with competing risks. We have investigated the identifiability of the bivariate models with competing risks for each of the four frailty variables. We have considered maximum likelihood estimation of the model parameters. Thorough simulation studies have been performed to study the finite sample behaviour of the estimated parameters. Also, we have analyzed a real data set on hearing loss in two ears using Exponential type and Weibull type cause-specific baseline hazard functions with the four different Gamma frailty variables and compare the fits using AIC."}, "https://arxiv.org/abs/2405.05781": {"title": "Nonparametric estimation of a future entry time distribution given the knowledge of a past state occupation in a progressive multistate model with current status data", "link": "https://arxiv.org/abs/2405.05781", "description": "arXiv:2405.05781v1 Announce Type: new \nAbstract: Case-I interval-censored (current status) data from multistate systems are often encountered in cancer and other epidemiological studies. In this article, we focus on the problem of estimating state entry distribution and occupation probabilities, contingent on a preceding state occupation. This endeavor is particularly complex owing to the inherent challenge of the unavailability of directly observed counts of individuals at risk of transitioning from a state, due to the cross-sectional nature of the data. We propose two nonparametric approaches, one using the fractional at-risk set approach recently adopted in the right-censoring framework and the other a new estimator based on the ratio of marginal state occupation probabilities. Both estimation approaches utilize innovative applications of concepts from the competing risks paradigm. The finite-sample behavior of the proposed estimators is studied via extensive simulation studies where we show that the estimators based on severely censored current status data have good performance when compared with those based on complete data. We demonstrate the application of the two methods to analyze data from patients diagnosed with breast cancer."}, "https://arxiv.org/abs/2405.05868": {"title": "Trustworthy Dimensionality Reduction", "link": "https://arxiv.org/abs/2405.05868", "description": "arXiv:2405.05868v1 Announce Type: new \nAbstract: Different unsupervised models for dimensionality reduction like PCA, LLE, Shannon's mapping, tSNE, UMAP, etc. work on different principles, hence, they are difficult to compare on the same ground. Although they are usually good for visualisation purposes, they can produce spurious patterns that are not present in the original data, losing its trustability (or credibility). On the other hand, information about some response variable (or knowledge of class labels) allows us to do supervised dimensionality reduction such as SIR, SAVE, etc. which work to reduce the data dimension without hampering its ability to explain the particular response at hand. Therefore, the reduced dataset cannot be used to further analyze its relationship with some other kind of responses, i.e., it loses its generalizability. To make a better dimensionality reduction algorithm with a better balance between these two, we shall formally describe the mathematical model used by dimensionality reduction algorithms and provide two indices to measure these intuitive concepts such as trustability and generalizability. Then, we propose a Localized Skeletonization and Dimensionality Reduction (LSDR) algorithm which approximately achieves optimality in both these indices to some extent. The proposed algorithm has been compared with state-of-the-art algorithms such as tSNE and UMAP and is found to be better overall in preserving global structure while retaining useful local information as well. We also propose some of the possible extensions of LSDR which could make this algorithm universally applicable for various types of data similar to tSNE and UMAP."}, "https://arxiv.org/abs/2405.05419": {"title": "Decompounding Under General Mixing Distributions", "link": "https://arxiv.org/abs/2405.05419", "description": "arXiv:2405.05419v1 Announce Type: cross \nAbstract: This study focuses on statistical inference for compound models of the form $X=\\xi_1+\\ldots+\\xi_N$, where $N$ is a random variable denoting the count of summands, which are independent and identically distributed (i.i.d.) random variables $\\xi_1, \\xi_2, \\ldots$. The paper addresses the problem of reconstructing the distribution of $\\xi$ from observed samples of $X$'s distribution, a process referred to as decompounding, with the assumption that $N$'s distribution is known. This work diverges from the conventional scope by not limiting $N$'s distribution to the Poisson type, thus embracing a broader context. We propose a nonparametric estimate for the density of $\\xi$, derive its rates of convergence and prove that these rates are minimax optimal for suitable classes of distributions for $\\xi$ and $N$. Finally, we illustrate the numerical performance of the algorithm on simulated examples."}, "https://arxiv.org/abs/2405.05596": {"title": "Measuring Strategization in Recommendation: Users Adapt Their Behavior to Shape Future Content", "link": "https://arxiv.org/abs/2405.05596", "description": "arXiv:2405.05596v1 Announce Type: cross \nAbstract: Most modern recommendation algorithms are data-driven: they generate personalized recommendations by observing users' past behaviors. A common assumption in recommendation is that how a user interacts with a piece of content (e.g., whether they choose to \"like\" it) is a reflection of the content, but not of the algorithm that generated it. Although this assumption is convenient, it fails to capture user strategization: that users may attempt to shape their future recommendations by adapting their behavior to the recommendation algorithm. In this work, we test for user strategization by conducting a lab experiment and survey. To capture strategization, we adopt a model in which strategic users select their engagement behavior based not only on the content, but also on how their behavior affects downstream recommendations. Using a custom music player that we built, we study how users respond to different information about their recommendation algorithm as well as to different incentives about how their actions affect downstream outcomes. We find strong evidence of strategization across outcome metrics, including participants' dwell time and use of \"likes.\" For example, participants who are told that the algorithm mainly pays attention to \"likes\" and \"dislikes\" use those functions 1.9x more than participants told that the algorithm mainly pays attention to dwell time. A close analysis of participant behavior (e.g., in response to our incentive conditions) rules out experimenter demand as the main driver of these trends. Further, in our post-experiment survey, nearly half of participants self-report strategizing \"in the wild,\" with some stating that they ignore content they actually like to avoid over-recommendation of that content in the future. Together, our findings suggest that user strategization is common and that platforms cannot ignore the effect of their algorithms on user behavior."}, "https://arxiv.org/abs/2405.05656": {"title": "Consistent Empirical Bayes estimation of the mean of a mixing distribution without identifiability assumption", "link": "https://arxiv.org/abs/2405.05656", "description": "arXiv:2405.05656v1 Announce Type: cross \nAbstract: {\\bf Abstract}\n Consider a Non-Parametric Empirical Bayes (NPEB) setup. We observe $Y_i, \\sim f(y|\\theta_i)$, $\\theta_i \\in \\Theta$ independent, where $\\theta_i \\sim G$ are independent $i=1,...,n$. The mixing distribution $G$ is unknown $G \\in \\{G\\}$ with no parametric assumptions about the class $\\{G \\}$. The common NPEB task is to estimate $\\theta_i, \\; i=1,...,n$. Conditions that imply 'optimality' of such NPEB estimators typically require identifiability of $G$ based on $Y_1,...,Y_n$. We consider the task of estimating $E_G \\theta$. We show that `often' consistent estimation of $E_G \\theta$ is implied without identifiability.\n We motivate the later task, especially in setups with non-response and missing data. We demonstrate consistency in simulations."}, "https://arxiv.org/abs/2405.05969": {"title": "Learned harmonic mean estimation of the Bayesian evidence with normalizing flows", "link": "https://arxiv.org/abs/2405.05969", "description": "arXiv:2405.05969v1 Announce Type: cross \nAbstract: We present the learned harmonic mean estimator with normalizing flows - a robust, scalable and flexible estimator of the Bayesian evidence for model comparison. Since the estimator is agnostic to sampling strategy and simply requires posterior samples, it can be applied to compute the evidence using any Markov chain Monte Carlo (MCMC) sampling technique, including saved down MCMC chains, or any variational inference approach. The learned harmonic mean estimator was recently introduced, where machine learning techniques were developed to learn a suitable internal importance sampling target distribution to solve the issue of exploding variance of the original harmonic mean estimator. In this article we present the use of normalizing flows as the internal machine learning technique within the learned harmonic mean estimator. Normalizing flows can be elegantly coupled with the learned harmonic mean to provide an approach that is more robust, flexible and scalable than the machine learning models considered previously. We perform a series of numerical experiments, applying our method to benchmark problems and to a cosmological example in up to 21 dimensions. We find the learned harmonic mean estimator is in agreement with ground truth values and nested sampling estimates. The open-source harmonic Python package implementing the learned harmonic mean, now with normalizing flows included, is publicly available."}, "https://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "https://arxiv.org/abs/2212.04620", "description": "arXiv:2212.04620v3 Announce Type: replace \nAbstract: Production functions are potentially misspecified when revenue is used as a proxy for output. I formalize and strengthen this common knowledge by showing that neither the production function nor Hicks-neutral productivity can be identified with such a revenue proxy. This result obtains when relaxing the standard assumptions used in the literature to allow for imperfect competition. It holds for a large class of production functions, including all commonly used parametric forms. Among the prevalent approaches to address this issue, only those that impose assumptions on the underlying demand system can possibly identify the production function."}, "https://arxiv.org/abs/2305.19885": {"title": "Reliability analysis of arbitrary systems based on active learning and global sensitivity analysis", "link": "https://arxiv.org/abs/2305.19885", "description": "arXiv:2305.19885v2 Announce Type: replace \nAbstract: System reliability analysis aims at computing the probability of failure of an engineering system given a set of uncertain inputs and limit state functions. Active-learning solution schemes have been shown to be a viable tool but as of yet they are not as efficient as in the context of component reliability analysis. This is due to some peculiarities of system problems, such as the presence of multiple failure modes and their uneven contribution to failure, or the dependence on the system configuration (e.g., series or parallel). In this work, we propose a novel active learning strategy designed for solving general system reliability problems. This algorithm combines subset simulation and Kriging/PC-Kriging, and relies on an enrichment scheme tailored to specifically address the weaknesses of this class of methods. More specifically, it relies on three components: (i) a new learning function that does not require the specification of the system configuration, (ii) a density-based clustering technique that allows one to automatically detect the different failure modes, and (iii) sensitivity analysis to estimate the contribution of each limit state to system failure so as to select only the most relevant ones for enrichment. The proposed method is validated on two analytical examples and compared against results gathered in the literature. Finally, a complex engineering problem related to power transmission is solved, thereby showcasing the efficiency of the proposed method in a real-case scenario."}, "https://arxiv.org/abs/2312.06204": {"title": "Multilayer Network Regression with Eigenvector Centrality and Community Structure", "link": "https://arxiv.org/abs/2312.06204", "description": "arXiv:2312.06204v2 Announce Type: replace \nAbstract: In the analysis of complex networks, centrality measures and community structures are two important aspects. For multilayer networks, one crucial task is to integrate information across different layers, especially taking the dependence structure within and between layers into consideration. In this study, we introduce a novel two-stage regression model (CC-MNetR) that leverages the eigenvector centrality and network community structure of fourth-order tensor-like multilayer networks. In particular, we construct community-based centrality measures, which are then incorporated into the regression model. In addition, considering the noise of network data, we analyze the centrality measure with and without measurement errors respectively, and establish the consistent properties of the least squares estimates in the regression. Our proposed method is then applied to the World Input-Output Database (WIOD) dataset to explore how input-output network data between different countries and different industries affect the Gross Output of each industry."}, "https://arxiv.org/abs/2312.15624": {"title": "Negative Control Falsification Tests for Instrumental Variable Designs", "link": "https://arxiv.org/abs/2312.15624", "description": "arXiv:2312.15624v2 Announce Type: replace \nAbstract: We develop theoretical foundations for widely used falsification tests for instrumental variable (IV) designs. We characterize these tests as conditional independence tests between negative control variables - proxies for potential threats - and either the IV or the outcome. We find that conventional applications of these falsification tests would flag problems in exogenous IV designs, and propose simple solutions to avoid this. We also propose new falsification tests that incorporate new types of negative control variables or alternative statistical tests. Finally, we illustrate that under stronger assumptions, negative control variables can also be used for bias correction."}, "https://arxiv.org/abs/2212.09706": {"title": "Multiple testing under negative dependence", "link": "https://arxiv.org/abs/2212.09706", "description": "arXiv:2212.09706v4 Announce Type: replace-cross \nAbstract: The multiple testing literature has primarily dealt with three types of dependence assumptions between p-values: independence, positive regression dependence, and arbitrary dependence. In this paper, we provide what we believe are the first theoretical results under various notions of negative dependence (negative Gaussian dependence, negative regression dependence, negative association, negative orthant dependence and weak negative dependence). These include the Simes global null test and the Benjamini-Hochberg procedure, which are known experimentally to be anti-conservative under negative dependence. The anti-conservativeness of these procedures is bounded by factors smaller than that under arbitrary dependence (in particular, by factors independent of the number of hypotheses). We also provide new results about negatively dependent e-values, and provide several examples as to when negative dependence may arise. Our proofs are elementary and short, thus amenable to extensions."}, "https://arxiv.org/abs/2404.17735": {"title": "Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models", "link": "https://arxiv.org/abs/2404.17735", "description": "arXiv:2404.17735v2 Announce Type: replace-cross \nAbstract: Diffusion probabilistic models (DPMs) have become the state-of-the-art in high-quality image generation. However, DPMs have an arbitrary noisy latent space with no interpretable or controllable semantics. Although there has been significant research effort to improve image sample quality, there is little work on representation-controlled generation using diffusion models. Specifically, causal modeling and controllable counterfactual generation using DPMs is an underexplored area. In this work, we propose CausalDiffAE, a diffusion-based causal representation learning framework to enable counterfactual generation according to a specified causal model. Our key idea is to use an encoder to extract high-level semantically meaningful causal variables from high-dimensional data and model stochastic variation using reverse diffusion. We propose a causal encoding mechanism that maps high-dimensional data to causally related latent factors and parameterize the causal mechanisms among latent factors using neural networks. To enforce the disentanglement of causal variables, we formulate a variational objective and leverage auxiliary label information in a prior to regularize the latent space. We propose a DDIM-based counterfactual generation procedure subject to do-interventions. Finally, to address the limited label supervision scenario, we also study the application of CausalDiffAE when a part of the training data is unlabeled, which also enables granular control over the strength of interventions in generating counterfactuals during inference. We empirically show that CausalDiffAE learns a disentangled latent space and is capable of generating high-quality counterfactual images."}, "https://arxiv.org/abs/2405.06135": {"title": "Local Longitudinal Modified Treatment Policies", "link": "https://arxiv.org/abs/2405.06135", "description": "arXiv:2405.06135v1 Announce Type: new \nAbstract: Longitudinal Modified Treatment Policies (LMTPs) provide a framework for defining a broad class of causal target parameters for continuous and categorical exposures. We propose Local LMTPs, a generalization of LMTPs to settings where the target parameter is conditional on subsets of units defined by the treatment or exposure. Such parameters have wide scientific relevance, with well-known parameters such as the Average Treatment Effect on the Treated (ATT) falling within the class. We provide a formal causal identification result that expresses the Local LMTP parameter in terms of sequential regressions, and derive the efficient influence function of the parameter which defines its semi-parametric and local asymptotic minimax efficiency bound. Efficient semi-parametric inference of Local LMTP parameters requires estimating the ratios of functions of complex conditional probabilities (or densities). We propose an estimator for Local LMTP parameters that directly estimates these required ratios via empirical loss minimization, drawing on the theory of Riesz representers. The estimator is implemented using a combination of ensemble machine learning algorithms and deep neural networks, and evaluated via simulation studies. We illustrate in simulation that estimation of the density ratios using Riesz representation might provide more stable estimators in finite samples in the presence of empirical violations of the overlap/positivity assumption."}, "https://arxiv.org/abs/2405.06156": {"title": "A Sharp Test for the Judge Leniency Design", "link": "https://arxiv.org/abs/2405.06156", "description": "arXiv:2405.06156v1 Announce Type: new \nAbstract: We propose a new specification test to assess the validity of the judge leniency design. We characterize a set of sharp testable implications, which exploit all the relevant information in the observed data distribution to detect violations of the judge leniency design assumptions. The proposed sharp test is asymptotically valid and consistent and will not make discordant recommendations. When the judge's leniency design assumptions are rejected, we propose a way to salvage the model using partial monotonicity and exclusion assumptions, under which a variant of the Local Instrumental Variable (LIV) estimand can recover the Marginal Treatment Effect. Simulation studies show our test outperforms existing non-sharp tests by significant margins. We apply our test to assess the validity of the judge leniency design using data from Stevenson (2018), and it rejects the validity for three crime categories: robbery, drug selling, and drug possession."}, "https://arxiv.org/abs/2405.06335": {"title": "Bayesian factor zero-inflated Poisson model for multiple grouped count data", "link": "https://arxiv.org/abs/2405.06335", "description": "arXiv:2405.06335v1 Announce Type: new \nAbstract: This paper proposes a computationally efficient Bayesian factor model for multiple grouped count data. Adopting the link function approach, the proposed model can capture the association within and between the at-risk probabilities and Poisson counts over multiple dimensions. The likelihood function for the grouped count data consists of the differences of the cumulative distribution functions evaluated at the endpoints of the groups, defining the probabilities of each data point falling in the groups. The combination of the data augmentation of underlying counts, the P\\'{o}lya-Gamma augmentation to approximate the Poisson distribution, and parameter expansion for the factor components is used to facilitate posterior computing. The efficacy of the proposed factor model is demonstrated using the simulated data and real data on the involvement of youths in the nineteen illegal activities."}, "https://arxiv.org/abs/2405.06353": {"title": "Next generation clinical trials: Seamless designs and master protocols", "link": "https://arxiv.org/abs/2405.06353", "description": "arXiv:2405.06353v1 Announce Type: new \nAbstract: Background: Drug development is often inefficient, costly and lengthy, yet it is essential for evaluating the safety and efficacy of new interventions. Compared with other disease areas, this is particularly true for Phase II / III cancer clinical trials where high attrition rates and reduced regulatory approvals are being seen. In response to these challenges, seamless clinical trials and master protocols have emerged to streamline the drug development process. Methods: Seamless clinical trials, characterized by their ability to transition seamlessly from one phase to another, can lead to accelerating the development of promising therapies while Master protocols provide a framework for investigating multiple treatment options and patient subgroups within a single trial. Results: We discuss the advantages of these methods through real trial examples and the principals that lead to their success while also acknowledging the associated regulatory considerations and challenges. Conclusion: Seamless designs and Master protocols have the potential to improve confirmatory clinical trials. In the disease area of cancer, this ultimately means that patients can receive life-saving treatments sooner."}, "https://arxiv.org/abs/2405.06366": {"title": "Accounting for selection biases in population analyses: equivalence of the in-likelihood and post-processing approaches", "link": "https://arxiv.org/abs/2405.06366", "description": "arXiv:2405.06366v1 Announce Type: new \nAbstract: In this paper I show the equivalence, under appropriate assumptions, of two alternative methods to account for the presence of selection biases (also called selection effects) in population studies: one is to include the selection effects in the likelihood directly; the other follows the procedure of first inferring the observed distribution and then removing selection effects a posteriori. Moreover, I investigate a potential bias allegedly induced by the latter approach: I show that this procedure, if applied under the appropriate assumptions, does not produce the aforementioned bias."}, "https://arxiv.org/abs/2405.06479": {"title": "Informativeness of Weighted Conformal Prediction", "link": "https://arxiv.org/abs/2405.06479", "description": "arXiv:2405.06479v1 Announce Type: new \nAbstract: Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multiple sources with varied covariate distributions. We establish theoretical guarantees for our proposed methods and demonstrate their efficacy through simulations."}, "https://arxiv.org/abs/2405.06559": {"title": "The landscapemetrics and motif packages for measuring landscape patterns and processes", "link": "https://arxiv.org/abs/2405.06559", "description": "arXiv:2405.06559v1 Announce Type: new \nAbstract: This book chapter emphasizes the significance of categorical raster data in ecological studies, specifically land use or land cover (LULC) data, and highlights the pivotal role of landscape metrics and pattern-based spatial analysis in comprehending environmental patterns and their dynamics. It explores the usage of R packages, particularly landscapemetrics and motif, for quantifying and analyzing landscape patterns using LULC data from three distinct European regions. It showcases the computation, visualization, and comparison of landscape metrics, while also addressing additional features such as patch value extraction, sub-region sampling, and moving window computation. Furthermore, the chapter delves into the intricacies of pattern-based spatial analysis, explaining how spatial signatures are computed and how the motif package facilitates comparisons and clustering of landscape patterns. The chapter concludes by discussing the potential of customization and expansion of the presented tools."}, "https://arxiv.org/abs/2405.06613": {"title": "Simultaneously detecting spatiotemporal changes with penalized Poisson regression models", "link": "https://arxiv.org/abs/2405.06613", "description": "arXiv:2405.06613v1 Announce Type: new \nAbstract: In the realm of large-scale spatiotemporal data, abrupt changes are commonly occurring across both spatial and temporal domains. This study aims to address the concurrent challenges of detecting change points and identifying spatial clusters within spatiotemporal count data. We introduce an innovative method based on the Poisson regression model, employing doubly fused penalization to unveil the underlying spatiotemporal change patterns. To efficiently estimate the model, we present an iterative shrinkage and threshold based algorithm to minimize the doubly penalized likelihood function. We establish the statistical consistency properties of the proposed estimator, confirming its reliability and accuracy. Furthermore, we conduct extensive numerical experiments to validate our theoretical findings, thereby highlighting the superior performance of our method when compared to existing competitive approaches."}, "https://arxiv.org/abs/2405.06635": {"title": "Multivariate Interval-Valued Models in Frequentist and Bayesian Schemes", "link": "https://arxiv.org/abs/2405.06635", "description": "arXiv:2405.06635v1 Announce Type: new \nAbstract: In recent years, addressing the challenges posed by massive datasets has led researchers to explore aggregated data, particularly leveraging interval-valued data, akin to traditional symbolic data analysis. While much recent research, with the exception of Samdai et al. (2023) who focused on the bivariate case, has primarily concentrated on parameter estimation in single-variable scenarios, this paper extends such investigations to the multivariate domain for the first time. We derive maximum likelihood (ML) estimators for the parameters and establish their asymptotic distributions. Additionally, we pioneer a theoretical Bayesian framework, previously confined to the univariate setting, for multivariate data. We provide a detailed exposition of the proposed estimators and conduct comparative performance analyses. Finally, we validate the effectiveness of our estimators through simulations and real-world data analysis."}, "https://arxiv.org/abs/2405.06013": {"title": "Variational Inference for Acceleration of SN Ia Photometric Distance Estimation with BayeSN", "link": "https://arxiv.org/abs/2405.06013", "description": "arXiv:2405.06013v1 Announce Type: cross \nAbstract: Type Ia supernovae (SNe Ia) are standarizable candles whose observed light curves can be used to infer their distances, which can in turn be used in cosmological analyses. As the quantity of observed SNe Ia grows with current and upcoming surveys, increasingly scalable analyses are necessary to take full advantage of these new datasets for precise estimation of cosmological parameters. Bayesian inference methods enable fitting SN Ia light curves with robust uncertainty quantification, but traditional posterior sampling using Markov Chain Monte Carlo (MCMC) is computationally expensive. We present an implementation of variational inference (VI) to accelerate the fitting of SN Ia light curves using the BayeSN hierarchical Bayesian model for time-varying SN Ia spectral energy distributions (SEDs). We demonstrate and evaluate its performance on both simulated light curves and data from the Foundation Supernova Survey with two different forms of surrogate posterior -- a multivariate normal and a custom multivariate zero-lower-truncated normal distribution -- and compare them with the Laplace Approximation and full MCMC analysis. To validate of our variational approximation, we calculate the pareto-smoothed importance sampling (PSIS) diagnostic, and perform variational simulation-based calibration (VSBC). The VI approximation achieves similar results to MCMC but with an order-of-magnitude speedup for the inference of the photometric distance moduli. Overall, we show that VI is a promising method for scalable parameter inference that enables analysis of larger datasets for precision cosmology."}, "https://arxiv.org/abs/2405.06540": {"title": "Separating States in Astronomical Sources Using Hidden Markov Models: With a Case Study of Flaring and Quiescence on EV Lac", "link": "https://arxiv.org/abs/2405.06540", "description": "arXiv:2405.06540v1 Announce Type: cross \nAbstract: We present a new method to distinguish between different states (e.g., high and low, quiescent and flaring) in astronomical sources with count data. The method models the underlying physical process as latent variables following a continuous-space Markov chain that determines the expected Poisson counts in observed light curves in multiple passbands. For the underlying state process, we consider several autoregressive processes, yielding continuous-space hidden Markov models of varying complexity. Under these models, we can infer the state that the object is in at any given time. The state predictions from these models are then dichotomized with the help of a finite-mixture model to produce state classifications. We apply these techniques to X-ray data from the active dMe flare star EV Lac, splitting the data into quiescent and flaring states. We find that a first-order vector autoregressive process efficiently separates flaring from quiescence: flaring occurs over 30-40% of the observation durations, a well-defined persistent quiescent state can be identified, and the flaring state is characterized by higher temperatures and emission measures."}, "https://arxiv.org/abs/2405.06558": {"title": "Random matrix theory improved Fr\\'echet mean of symmetric positive definite matrices", "link": "https://arxiv.org/abs/2405.06558", "description": "arXiv:2405.06558v1 Announce Type: cross \nAbstract: In this study, we consider the realm of covariance matrices in machine learning, particularly focusing on computing Fr\\'echet means on the manifold of symmetric positive definite matrices, commonly referred to as Karcher or geometric means. Such means are leveraged in numerous machine-learning tasks. Relying on advanced statistical tools, we introduce a random matrix theory-based method that estimates Fr\\'echet means, which is particularly beneficial when dealing with low sample support and a high number of matrices to average. Our experimental evaluation, involving both synthetic and real-world EEG and hyperspectral datasets, shows that we largely outperform state-of-the-art methods."}, "https://arxiv.org/abs/2109.04146": {"title": "Forecasting high-dimensional functional time series with dual-factor structures", "link": "https://arxiv.org/abs/2109.04146", "description": "arXiv:2109.04146v2 Announce Type: replace \nAbstract: We propose a dual-factor model for high-dimensional functional time series (HDFTS) that considers multiple populations. The HDFTS is first decomposed into a collection of functional time series (FTS) in a lower dimension and a group of population-specific basis functions. The system of basis functions describes cross-sectional heterogeneity, while the reduced-dimension FTS retains most of the information common to multiple populations. The low-dimensional FTS is further decomposed into a product of common functional loadings and a matrix-valued time series that contains the most temporal dynamics embedded in the original HDFTS. The proposed general-form dual-factor structure is connected to several commonly used functional factor models. We demonstrate the finite-sample performances of the proposed method in recovering cross-sectional basis functions and extracting common features using simulated HDFTS. An empirical study shows that the proposed model produces more accurate point and interval forecasts for subnational age-specific mortality rates in Japan. The financial benefits associated with the improved mortality forecasts are translated into a life annuity pricing scheme."}, "https://arxiv.org/abs/2303.07167": {"title": "When Respondents Don't Care Anymore: Identifying the Onset of Careless Responding", "link": "https://arxiv.org/abs/2303.07167", "description": "arXiv:2303.07167v2 Announce Type: replace \nAbstract: Questionnaires in the behavioral and organizational sciences tend to be lengthy: survey measures comprising hundreds of items are the norm rather than the exception. However, literature suggests that the longer a questionnaire takes, the higher the probability that participants lose interest and start responding carelessly. Consequently, in long surveys a large number of participants may engage in careless responding, posing a major threat to internal validity. We propose a novel method for identifying the onset of careless responding (or an absence thereof) for each participant. It is based on combined measurements of multiple dimensions in which carelessness may manifest, such as inconsistency and invariability. Since a structural break in either dimension is potentially indicative of carelessness, the proposed method searches for evidence for changepoints along the combined measurements. It is highly flexible, based on machine learning, and provides statistical guarantees on its performance. An empirical application on data from a seminal study on the incidence of careless responding reveals that the reported incidence has likely been substantially underestimated due to the presence of respondents that were careless for only parts of the questionnaire. In simulation experiments, we find that the proposed method achieves high reliability in correctly identifying carelessness onset, discriminates well between careless and attentive respondents, and captures a variety of careless response types, even when a large number of careless respondents are present. Furthermore, we provide freely available open source software to enhance accessibility and facilitate adoption by empirical researchers."}, "https://arxiv.org/abs/2101.00726": {"title": "Distributionally robust halfspace depth", "link": "https://arxiv.org/abs/2101.00726", "description": "arXiv:2101.00726v2 Announce Type: replace-cross \nAbstract: Tukey's halfspace depth can be seen as a stochastic program and as such it is not guarded against optimizer's curse, so that a limited training sample may easily result in a poor out-of-sample performance. We propose a generalized halfspace depth concept relying on the recent advances in distributionally robust optimization, where every halfspace is examined using the respective worst-case distribution in the Wasserstein ball of radius $\\delta\\geq 0$ centered at the empirical law. This new depth can be seen as a smoothed and regularized classical halfspace depth which is retrieved as $\\delta\\downarrow 0$. It inherits most of the main properties of the latter and, additionally, enjoys various new attractive features such as continuity and strict positivity beyond the convex hull of the support. We provide numerical illustrations of the new depth and its advantages, and develop some fundamental theory. In particular, we study the upper level sets and the median region including their breakdown properties."}, "https://arxiv.org/abs/2307.08079": {"title": "Flexible and efficient spatial extremes emulation via variational autoencoders", "link": "https://arxiv.org/abs/2307.08079", "description": "arXiv:2307.08079v3 Announce Type: replace-cross \nAbstract: Many real-world processes have complex tail dependence structures that cannot be characterized using classical Gaussian processes. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. In this paper, we aim to push the boundaries on computation and modeling of high-dimensional spatial extremes via integrating a new spatial extremes model that has flexible and non-stationary dependence properties in the encoding-decoding structure of a variational autoencoder called the XVAE. The XVAE can emulate spatial observations and produce outputs that have the same statistical properties as the inputs, especially in the tail. Our approach also provides a novel way of making fast inference with complex extreme-value processes. Through extensive simulation studies, we show that our XVAE is substantially more time-efficient than traditional Bayesian inference while outperforming many spatial extremes models with a stationary dependence structure. Lastly, we analyze a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea, which includes 30 years of daily measurements at 16703 grid cells. We demonstrate how to use XVAE to identify regions susceptible to marine heatwaves under climate change and examine the spatial and temporal variability of the extremal dependence structure."}, "https://arxiv.org/abs/2312.16214": {"title": "Stochastic Equilibrium the Lucas Critique and Keynesian Economics", "link": "https://arxiv.org/abs/2312.16214", "description": "arXiv:2312.16214v2 Announce Type: replace-cross \nAbstract: In this paper, a mathematically rigorous solution overturns existing wisdom regarding New Keynesian Dynamic Stochastic General Equilibrium. I develop a formal concept of stochastic equilibrium. I prove uniqueness and necessity, when agents are patient, with general application. Existence depends on appropriately specified eigenvalue conditions. Otherwise, no solution of any kind exists. I construct the equilibrium with Calvo pricing. I provide novel comparative statics with the non-stochastic model of mathematical significance. I uncover a bifurcation between neighbouring stochastic systems and approximations taken from the Zero Inflation Non-Stochastic Steady State (ZINSS). The correct Phillips curve agrees with the zero limit from the trend inflation framework. It contains a large lagged inflation coefficient and a small response to expected inflation. Price dispersion can be first or second order depending how shocks are scaled. The response to the output gap is always muted and is zero at standard parameters. A neutrality result is presented to explain why and align Calvo with Taylor pricing. Present and lagged demand shocks enter the Phillips curve so there is no Divine Coincidence and the system is identified from structural shocks alone. The lagged inflation slope is increasing in the inflation response, embodying substantive policy trade-offs. The Taylor principle is reversed, inactive settings are necessary, pointing towards inertial policy. The observational equivalence idea of the Lucas critique is disproven. The bifurcation results from the breakdown of the constraints implied by lagged nominal rigidity, associated with cross-equation cancellation possible only at ZINSS. There is a dual relationship between restrictions on the econometrician and constraints on repricing firms. Thus, if the model is correct, goodness of fit will jump."}, "https://arxiv.org/abs/2405.06763": {"title": "Post-selection inference for causal effects after causal discovery", "link": "https://arxiv.org/abs/2405.06763", "description": "arXiv:2405.06763v1 Announce Type: new \nAbstract: Algorithms for constraint-based causal discovery select graphical causal models among a space of possible candidates (e.g., all directed acyclic graphs) by executing a sequence of conditional independence tests. These may be used to inform the estimation of causal effects (e.g., average treatment effects) when there is uncertainty about which covariates ought to be adjusted for, or which variables act as confounders versus mediators. However, naively using the data twice, for model selection and estimation, would lead to invalid confidence intervals. Moreover, if the selected graph is incorrect, the inferential claims may apply to a selected functional that is distinct from the actual causal effect. We propose an approach to post-selection inference that is based on a resampling and screening procedure, which essentially performs causal discovery multiple times with randomly varying intermediate test statistics. Then, an estimate of the target causal effect and corresponding confidence sets are constructed from a union of individual graph-based estimates and intervals. We show that this construction has asymptotically correct coverage for the true causal effect parameter. Importantly, the guarantee holds for a fixed population-level effect, not a data-dependent or selection-dependent quantity. Most of our exposition focuses on the PC-algorithm for learning directed acyclic graphs and the multivariate Gaussian case for simplicity, but the approach is general and modular, so it may be used with other conditional independence based discovery algorithms and distributional families."}, "https://arxiv.org/abs/2405.06779": {"title": "Generalization Problems in Experiments Involving Multidimensional Decisions", "link": "https://arxiv.org/abs/2405.06779", "description": "arXiv:2405.06779v1 Announce Type: new \nAbstract: Can the causal effects estimated in experiment be generalized to real-world scenarios? This question lies at the heart of social science studies. External validity primarily assesses whether experimental effects persist across different settings, implicitly presuming the experiment's ecological validity-that is, the consistency of experimental effects with their real-life counterparts. However, we argue that this presumed consistency may not always hold, especially in experiments involving multidimensional decision processes, such as conjoint experiments. We introduce a formal model to elucidate how attention and salience effects lead to three types of inconsistencies between experimental findings and real-world phenomena: amplified effect magnitude, effect sign reversal, and effect importance reversal. We derive testable hypotheses from each theoretical outcome and test these hypotheses using data from various existing conjoint experiments and our own experiments. Drawing on our theoretical framework, we propose several recommendations for experimental design aimed at enhancing the generalizability of survey experiment findings."}, "https://arxiv.org/abs/2405.06796": {"title": "The Multiple Change-in-Gaussian-Mean Problem", "link": "https://arxiv.org/abs/2405.06796", "description": "arXiv:2405.06796v1 Announce Type: new \nAbstract: A manuscript version of the chapter \"The Multiple Change-in-Gaussian-Mean Problem\" from the book \"Change-Point Detection and Data Segmentation\" by Fearnhead and Fryzlewicz, currently in preparation. All R code and data to accompany this chapter and the book are gradually being made available through https://github.com/pfryz/cpdds."}, "https://arxiv.org/abs/2405.06799": {"title": "Riemannian Statistics for Any Type of Data", "link": "https://arxiv.org/abs/2405.06799", "description": "arXiv:2405.06799v1 Announce Type: new \nAbstract: This paper introduces a novel approach to statistics and data analysis, departing from the conventional assumption of data residing in Euclidean space to consider a Riemannian Manifold. The challenge lies in the absence of vector space operations on such manifolds. Pennec X. et al. in their book Riemannian Geometric Statistics in Medical Image Analysis proposed analyzing data on Riemannian manifolds through geometry, this approach is effective with structured data like medical images, where the intrinsic manifold structure is apparent. Yet, its applicability to general data lacking implicit local distance notions is limited. We propose a solution to generalize Riemannian statistics for any type of data."}, "https://arxiv.org/abs/2405.06813": {"title": "A note on distance variance for categorical variables", "link": "https://arxiv.org/abs/2405.06813", "description": "arXiv:2405.06813v1 Announce Type: new \nAbstract: This study investigates the extension of distance variance, a validated spread metric for continuous and binary variables [Edelmann et al., 2020, Ann. Stat., 48(6)], to quantify the spread of general categorical variables. We provide both geometric and algebraic characterizations of distance variance, revealing its connections to some commonly used entropy measures, and the variance-covariance matrix of the one-hot encoded representation. However, we demonstrate that distance variance fails to satisfy the Schur-concavity axiom for categorical variables with more than two categories, leading to counterintuitive results. This limitation hinders its applicability as a universal measure of spread."}, "https://arxiv.org/abs/2405.06850": {"title": "Identifying Peer Effects in Networks with Unobserved Effort and Isolated Students", "link": "https://arxiv.org/abs/2405.06850", "description": "arXiv:2405.06850v1 Announce Type: new \nAbstract: Peer influence on effort devoted to some activity is often studied using proxy variables when actual effort is unobserved. For instance, in education, academic effort is often proxied by GPA. We propose an alternative approach that circumvents this approximation. Our framework distinguishes unobserved shocks to GPA that do not affect effort from preference shocks that do affect effort levels. We show that peer effects estimates obtained using our approach can differ significantly from classical estimates (where effort is approximated) if the network includes isolated students. Applying our approach to data on high school students in the United States, we find that peer effect estimates relying on GPA as a proxy for effort are 40% lower than those obtained using our approach."}, "https://arxiv.org/abs/2405.06866": {"title": "Dynamic Contextual Pricing with Doubly Non-Parametric Random Utility Models", "link": "https://arxiv.org/abs/2405.06866", "description": "arXiv:2405.06866v1 Announce Type: new \nAbstract: In the evolving landscape of digital commerce, adaptive dynamic pricing strategies are essential for gaining a competitive edge. This paper introduces novel {\\em doubly nonparametric random utility models} that eschew traditional parametric assumptions used in estimating consumer demand's mean utility function and noise distribution. Existing nonparametric methods like multi-scale {\\em Distributional Nearest Neighbors (DNN and TDNN)}, initially designed for offline regression, face challenges in dynamic online pricing due to design limitations, such as the indirect observability of utility-related variables and the absence of uniform convergence guarantees. We address these challenges with innovative population equations that facilitate nonparametric estimation within decision-making frameworks and establish new analytical results on the uniform convergence rates of DNN and TDNN, enhancing their applicability in dynamic environments.\n Our theoretical analysis confirms that the statistical learning rates for the mean utility function and noise distribution are minimax optimal. We also derive a regret bound that illustrates the critical interaction between model dimensionality and noise distribution smoothness, deepening our understanding of dynamic pricing under varied market conditions. These contributions offer substantial theoretical insights and practical tools for implementing effective, data-driven pricing strategies, advancing the theoretical framework of pricing models and providing robust methodologies for navigating the complexities of modern markets."}, "https://arxiv.org/abs/2405.06889": {"title": "Tuning parameter selection for the adaptive nuclear norm regularized trace regression", "link": "https://arxiv.org/abs/2405.06889", "description": "arXiv:2405.06889v1 Announce Type: new \nAbstract: Regularized models have been applied in lots of areas, with high-dimensional data sets being popular. Because tuning parameter decides the theoretical performance and computational efficiency of the regularized models, tuning parameter selection is a basic and important issue. We consider the tuning parameter selection for adaptive nuclear norm regularized trace regression, which achieves by the Bayesian information criterion (BIC). The proposed BIC is established with the help of an unbiased estimator of degrees of freedom. Under some regularized conditions, this BIC is proved to achieve the rank consistency of the tuning parameter selection. That is the model solution under selected tuning parameter converges to the true solution and has the same rank with that of the true solution in probability. Some numerical results are presented to evaluate the performance of the proposed BIC on tuning parameter selection."}, "https://arxiv.org/abs/2405.07026": {"title": "Selective Randomization Inference for Adaptive Experiments", "link": "https://arxiv.org/abs/2405.07026", "description": "arXiv:2405.07026v1 Announce Type: new \nAbstract: Adaptive experiments use preliminary analyses of the data to inform further course of action and are commonly used in many disciplines including medical and social sciences. Because the null hypothesis and experimental design are not pre-specified, it has long been recognized that statistical inference for adaptive experiments is not straightforward. Most existing methods only apply to specific adaptive designs and rely on strong assumptions. In this work, we propose selective randomization inference as a general framework for analyzing adaptive experiments. In a nutshell, our approach applies conditional post-selection inference to randomization tests. By using directed acyclic graphs to describe the data generating process, we derive a selective randomization p-value that controls the selective type-I error without requiring independent and identically distributed data or any other modelling assumptions. We show how rejection sampling and Markov Chain Monte Carlo can be used to compute the selective randomization p-values and construct confidence intervals for a homogeneous treatment effect. To mitigate the risk of disconnected confidence intervals, we propose the use of hold-out units. Lastly, we demonstrate our method and compare it with other randomization tests using synthetic and real-world data."}, "https://arxiv.org/abs/2405.07102": {"title": "Nested Instrumental Variables Design: Switcher Average Treatment Effect, Identification, Efficient Estimation and Generalizability", "link": "https://arxiv.org/abs/2405.07102", "description": "arXiv:2405.07102v1 Announce Type: new \nAbstract: Instrumental variables (IV) are a commonly used tool to estimate causal effects from non-randomized data. A prototype of an IV is a randomized trial with non-compliance where the randomized treatment assignment serves as an IV for the non-ignorable treatment received. Under a monotonicity assumption, a valid IV non-parametrically identifies the average treatment effect among a non-identifiable complier subgroup, whose generalizability is often under debate. In many studies, there could exist multiple versions of an IV, for instance, different nudges to take the same treatment in different study sites in a multi-center clinical trial. These different versions of an IV may result in different compliance rates and offer a unique opportunity to study IV estimates' generalizability. In this article, we introduce a novel nested IV assumption and study identification of the average treatment effect among two latent subgroups: always-compliers and switchers, who are defined based on the joint potential treatment received under two versions of a binary IV. We derive the efficient influence function for the SWitcher Average Treatment Effect (SWATE) and propose efficient estimators. We then propose formal statistical tests of the generalizability of IV estimates based on comparing the conditional average treatment effect among the always-compliers and that among the switchers under the nested IV framework. We apply the proposed framework and method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and study the causal effect of colorectal cancer screening and its generalizability."}, "https://arxiv.org/abs/2405.07109": {"title": "Bridging Binarization: Causal Inference with Dichotomized Continuous Treatments", "link": "https://arxiv.org/abs/2405.07109", "description": "arXiv:2405.07109v1 Announce Type: new \nAbstract: The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary treatments. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous treatment create a new binary treatment variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous treatments by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized treatment variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous treatment variable versus another. The policies assume that, for any two values of the treatment variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the status-quo world. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. Finally, we present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed."}, "https://arxiv.org/abs/2405.07134": {"title": "On the Ollivier-Ricci curvature as fragility indicator of the stock markets", "link": "https://arxiv.org/abs/2405.07134", "description": "arXiv:2405.07134v1 Announce Type: new \nAbstract: Recently, an indicator for stock market fragility and crash size in terms of the Ollivier-Ricci curvature has been proposed. We study analytical and empirical properties of such indicator, test its elasticity with respect to different parameters and provide heuristics for the parameters involved. We show when and how the indicator accurately describes a financial crisis. We also propose an alternate method for calculating the indicator using a specific sub-graph with special curvature properties."}, "https://arxiv.org/abs/2405.07138": {"title": "Large-dimensional Robust Factor Analysis with Group Structure", "link": "https://arxiv.org/abs/2405.07138", "description": "arXiv:2405.07138v1 Announce Type: new \nAbstract: In this paper, we focus on exploiting the group structure for large-dimensional factor models, which captures the homogeneous effects of common factors on individuals within the same group. In view of the fact that datasets in macroeconomics and finance are typically heavy-tailed, we propose to identify the unknown group structure using the agglomerative hierarchical clustering algorithm and an information criterion with the robust two-step (RTS) estimates as initial values. The loadings and factors are then re-estimated conditional on the identified groups. Theoretically, we demonstrate the consistency of the estimators for both group membership and the number of groups determined by the information criterion. Under finite second moment condition, we provide the convergence rate for the newly estimated factor loadings with group information, which are shown to achieve efficiency gains compared to those obtained without group structure information. Numerical simulations and real data analysis demonstrate the nice finite sample performance of our proposed approach in the presence of both group structure and heavy-tailedness."}, "https://arxiv.org/abs/2405.07186": {"title": "Adaptive-TMLE for the Average Treatment Effect based on Randomized Controlled Trial Augmented with Real-World Data", "link": "https://arxiv.org/abs/2405.07186", "description": "arXiv:2405.07186v1 Announce Type: new \nAbstract: We consider the problem of estimating the average treatment effect (ATE) when both randomized control trial (RCT) data and real-world data (RWD) are available. We decompose the ATE estimand as the difference between a pooled-ATE estimand that integrates RCT and RWD and a bias estimand that captures the conditional effect of RCT enrollment on the outcome. We introduce an adaptive targeted minimum loss-based estimation (A-TMLE) framework to estimate them. We prove that the A-TMLE estimator is root-n-consistent and asymptotically normal. Moreover, in finite sample, it achieves the super-efficiency one would obtain had one known the oracle model for the conditional effect of the RCT enrollment on the outcome. Consequently, the smaller the working model of the bias induced by the RWD is, the greater our estimator's efficiency, while our estimator will always be at least as efficient as an efficient estimator that uses the RCT data only. A-TMLE outperforms existing methods in simulations by having smaller mean-squared-error and 95% confidence intervals. A-TMLE could help utilize RWD to improve the efficiency of randomized trial results without biasing the estimates of intervention effects. This approach could allow for smaller, faster trials, decreasing the time until patients can receive effective treatments."}, "https://arxiv.org/abs/2405.07292": {"title": "Kernel Three Pass Regression Filter", "link": "https://arxiv.org/abs/2405.07292", "description": "arXiv:2405.07292v1 Announce Type: new \nAbstract: We forecast a single time series using a high-dimensional set of predictors. When these predictors share common underlying dynamics, an approximate latent factor model provides a powerful characterization of their co-movements Bai(2003). These latent factors succinctly summarize the data and can also be used for prediction, alleviating the curse of dimensionality in high-dimensional prediction exercises, see Stock & Watson (2002a). However, forecasting using these latent factors suffers from two potential drawbacks. First, not all pervasive factors among the set of predictors may be relevant, and using all of them can lead to inefficient forecasts. The second shortcoming is the assumption of linear dependence of predictors on the underlying factors. The first issue can be addressed by using some form of supervision, which leads to the omission of irrelevant information. One example is the three-pass regression filter proposed by Kelly & Pruitt (2015). We extend their framework to cases where the form of dependence might be nonlinear by developing a new estimator, which we refer to as the Kernel Three-Pass Regression Filter (K3PRF). This alleviates the aforementioned second shortcoming. The estimator is computationally efficient and performs well empirically. The short-term performance matches or exceeds that of established models, while the long-term performance shows significant improvement."}, "https://arxiv.org/abs/2405.07294": {"title": "Factor Strength Estimation in Vector and Matrix Time Series Factor Models", "link": "https://arxiv.org/abs/2405.07294", "description": "arXiv:2405.07294v1 Announce Type: new \nAbstract: Most factor modelling research in vector or matrix-valued time series assume all factors are pervasive/strong and leave weaker factors and their corresponding series to the noise. Weaker factors can in fact be important to a group of observed variables, for instance a sector factor in a large portfolio of stocks may only affect particular sectors, but can be important both in interpretations and predictions for those stocks. While more recent factor modelling researches do consider ``local'' factors which are weak factors with sparse corresponding factor loadings, there are real data examples in the literature where factors are weak because of weak influence on most/all observed variables, so that the corresponding factor loadings are not sparse (non-local). As a first in the literature, we propose estimators of factor strengths for both local and non-local weak factors, and prove their consistency with rates of convergence spelt out for both vector and matrix-valued time series factor models. Factor strength has an important indication in what estimation procedure of factor models to follow, as well as the estimation accuracy of various estimators (Chen and Lam, 2024). Simulation results show that our estimators have good performance in recovering the true factor strengths, and an analysis on the NYC taxi traffic data indicates the existence of weak factors in the data which may not be localized."}, "https://arxiv.org/abs/2405.07397": {"title": "The Spike-and-Slab Quantile LASSO for Robust Variable Selection in Cancer Genomics Studies", "link": "https://arxiv.org/abs/2405.07397", "description": "arXiv:2405.07397v1 Announce Type: new \nAbstract: Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the non-robust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Ro\\v{c}kov\\'a and George, 2018). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with non-differentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA)."}, "https://arxiv.org/abs/2405.07408": {"title": "Bayesian Spatially Clustered Compositional Regression: Linking intersectoral GDP contributions to Gini Coefficients", "link": "https://arxiv.org/abs/2405.07408", "description": "arXiv:2405.07408v1 Announce Type: new \nAbstract: The Gini coefficient is an universally used measurement of income inequality. Intersectoral GDP contributions reveal the economic development of different sectors of the national economy. Linking intersectoral GDP contributions to Gini coefficients will provide better understandings of how the Gini coefficient is influenced by different industries. In this paper, a compositional regression with spatially clustered coefficients is proposed to explore heterogeneous effects over spatial locations under nonparametric Bayesian framework. Specifically, a Markov random field constraint mixture of finite mixtures prior is designed for Bayesian log contrast regression with compostional covariates, which allows for both spatially contiguous clusters and discontinous clusters. In addition, an efficient Markov chain Monte Carlo algorithm for posterior sampling that enables simultaneous inference on both cluster configurations and cluster-wise parameters is designed. The compelling empirical performance of the proposed method is demonstrated via extensive simulation studies and an application to 51 states of United States from 2019 Bureau of Economic Analysis."}, "https://arxiv.org/abs/2405.07420": {"title": "Robust Inference for High-Dimensional Panel Data Models", "link": "https://arxiv.org/abs/2405.07420", "description": "arXiv:2405.07420v1 Announce Type: new \nAbstract: In this paper, we propose a robust estimation and inferential method for high-dimensional panel data models. Specifically, (1) we investigate the case where the number of regressors can grow faster than the sample size, (2) we pay particular attention to non-Gaussian, serially and cross-sectionally correlated and heteroskedastic error processes, and (3) we develop an estimation method for high-dimensional long-run covariance matrix using a thresholded estimator.\n Methodologically and technically, we develop two Nagaev-types of concentration inequalities: one for a partial sum and the other for a quadratic form, subject to a set of easily verifiable conditions. Leveraging these two inequalities, we also derive a non-asymptotic bound for the LASSO estimator, achieve asymptotic normality via the node-wise LASSO regression, and establish a sharp convergence rate for the thresholded heteroskedasticity and autocorrelation consistent (HAC) estimator.\n Our study thus provides the relevant literature with a complete toolkit for conducting inference about the parameters of interest involved in a high-dimensional panel data framework. We also demonstrate the practical relevance of these theoretical results by investigating a high-dimensional panel data model with interactive fixed effects. Moreover, we conduct extensive numerical studies using simulated and real data examples."}, "https://arxiv.org/abs/2405.07504": {"title": "Hierarchical inference of evidence using posterior samples", "link": "https://arxiv.org/abs/2405.07504", "description": "arXiv:2405.07504v1 Announce Type: new \nAbstract: The Bayesian evidence, crucial ingredient for model selection, is arguably the most important quantity in Bayesian data analysis: at the same time, however, it is also one of the most difficult to compute. In this paper we present a hierarchical method that leverages on a multivariate normalised approximant for the posterior probability density to infer the evidence for a model in a hierarchical fashion using a set of posterior samples drawn using an arbitrary sampling scheme."}, "https://arxiv.org/abs/2405.07631": {"title": "Improving prediction models by incorporating external data with weights based on similarity", "link": "https://arxiv.org/abs/2405.07631", "description": "arXiv:2405.07631v1 Announce Type: new \nAbstract: In clinical settings, we often face the challenge of building prediction models based on small observational data sets. For example, such a data set might be from a medical center in a multi-center study. Differences between centers might be large, thus requiring specific models based on the data set from the target center. Still, we want to borrow information from the external centers, to deal with small sample sizes. There are approaches that either assign weights to each external data set or each external observation. To incorporate information on differences between data sets and observations, we propose an approach that combines both into weights that can be incorporated into a likelihood for fitting regression models. Specifically, we suggest weights at the data set level that incorporate information on how well the models that provide the observation weights distinguish between data sets. Technically, this takes the form of inverse probability weighting. We explore different scenarios where covariates and outcomes differ among data sets, informing our simulation design for method evaluation. The concept of effective sample size is used for understanding the effectiveness of our subgroup modeling approach. We demonstrate our approach through a clinical application, predicting applied radiotherapy doses for cancer patients. Generally, the proposed approach provides improved prediction performance when external data sets are similar. We thus provide a method for quantifying similarity of external data sets to the target data set and use this similarity to include external observations for improving performance in a target data set prediction modeling task with small data."}, "https://arxiv.org/abs/2405.07860": {"title": "Uniform Inference for Subsampled Moment Regression", "link": "https://arxiv.org/abs/2405.07860", "description": "arXiv:2405.07860v1 Announce Type: new \nAbstract: We propose a method for constructing a confidence region for the solution to a conditional moment equation. The method is built around a class of algorithms for nonparametric regression based on subsampled kernels. This class includes random forest regression. We bound the error in the confidence region's nominal coverage probability, under the restriction that the conditional moment equation of interest satisfies a local orthogonality condition. The method is applicable to the construction of confidence regions for conditional average treatment effects in randomized experiments, among many other similar problems encountered in applied economics and causal inference. As a by-product, we obtain several new order-explicit results on the concentration and normal approximation of high-dimensional $U$-statistics."}, "https://arxiv.org/abs/2405.07979": {"title": "Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference", "link": "https://arxiv.org/abs/2405.07979", "description": "arXiv:2405.07979v1 Announce Type: new \nAbstract: Variance reduction for causal inference in the presence of network interference is often achieved through either outcome modeling, which is typically analyzed under unit-randomized Bernoulli designs, or clustered experimental designs, which are typically analyzed without strong parametric assumptions. In this work, we study the intersection of these two approaches and consider the problem of estimation in low-order outcome models using data from a general experimental design. Our contributions are threefold. First, we present an estimator of the total treatment effect (also called the global average treatment effect) in a low-degree outcome model when the data are collected under general experimental designs, generalizing previous results for Bernoulli designs. We refer to this estimator as the pseudoinverse estimator and give bounds on its bias and variance in terms of properties of the experimental design. Second, we evaluate these bounds for the case of cluster randomized designs with both Bernoulli and complete randomization. For clustered Bernoulli randomization, we find that our estimator is always unbiased and that its variance scales like the smaller of the variance obtained from a low-order assumption and the variance obtained from cluster randomization, showing that combining these variance reduction strategies is preferable to using either individually. For clustered complete randomization, we find a notable bias-variance trade-off mediated by specific features of the clustering. Third, when choosing a clustered experimental design, our bounds can be used to select a clustering from a set of candidate clusterings. Across a range of graphs and clustering algorithms, we show that our method consistently selects clusterings that perform well on a range of response models, suggesting that our bounds are useful to practitioners."}, "https://arxiv.org/abs/2405.07985": {"title": "Improved LARS algorithm for adaptive LASSO in the linear regression model", "link": "https://arxiv.org/abs/2405.07985", "description": "arXiv:2405.07985v1 Announce Type: new \nAbstract: The adaptive LASSO has been used for consistent variable selection in place of LASSO in the linear regression model. In this article, we propose a modified LARS algorithm to combine adaptive LASSO with some biased estimators, namely the Almost Unbiased Ridge Estimator (AURE), Liu Estimator (LE), Almost Unbiased Liu Estimator (AULE), Principal Component Regression Estimator (PCRE), r-k class estimator, and r-d class estimator. Furthermore, we examine the performance of the proposed algorithm using a Monte Carlo simulation study and real-world examples."}, "https://arxiv.org/abs/2405.07343": {"title": "Graph neural networks for power grid operational risk assessment under evolving grid topology", "link": "https://arxiv.org/abs/2405.07343", "description": "arXiv:2405.07343v1 Announce Type: cross \nAbstract: This article investigates the ability of graph neural networks (GNNs) to identify risky conditions in a power grid over the subsequent few hours, without explicit, high-resolution information regarding future generator on/off status (grid topology) or power dispatch decisions. The GNNs are trained using supervised learning, to predict the power grid's aggregated bus-level (either zonal or system-level) or individual branch-level state under different power supply and demand conditions. The variability of the stochastic grid variables (wind/solar generation and load demand), and their statistical correlations, are rigorously considered while generating the inputs for the training data. The outputs in the training data, obtained by solving numerous mixed-integer linear programming (MILP) optimal power flow problems, correspond to system-level, zonal and transmission line-level quantities of interest (QoIs). The QoIs predicted by the GNNs are used to conduct hours-ahead, sampling-based reliability and risk assessment w.r.t. zonal and system-level (load shedding) as well as branch-level (overloading) failure events. The proposed methodology is demonstrated for three synthetic grids with sizes ranging from 118 to 2848 buses. Our results demonstrate that GNNs are capable of providing fast and accurate prediction of QoIs and can be good proxies for computationally expensive MILP algorithms. The excellent accuracy of GNN-based reliability and risk assessment suggests that GNN models can substantially improve situational awareness by quickly providing rigorous reliability and risk estimates."}, "https://arxiv.org/abs/2405.07359": {"title": "Forecasting with an N-dimensional Langevin Equation and a Neural-Ordinary Differential Equation", "link": "https://arxiv.org/abs/2405.07359", "description": "arXiv:2405.07359v1 Announce Type: cross \nAbstract: Accurate prediction of electricity day-ahead prices is essential in competitive electricity markets. Although stationary electricity-price forecasting techniques have received considerable attention, research on non-stationary methods is comparatively scarce, despite the common prevalence of non-stationary features in electricity markets. Specifically, existing non-stationary techniques will often aim to address individual non-stationary features in isolation, leaving aside the exploration of concurrent multiple non-stationary effects. Our overarching objective here is the formulation of a framework to systematically model and forecast non-stationary electricity-price time series, encompassing the broader scope of non-stationary behavior. For this purpose we develop a data-driven model that combines an N-dimensional Langevin equation (LE) with a neural-ordinary differential equation (NODE). The LE captures fine-grained details of the electricity-price behavior in stationary regimes but is inadequate for non-stationary conditions. To overcome this inherent limitation, we adopt a NODE approach to learn, and at the same time predict, the difference between the actual electricity-price time series and the simulated price trajectories generated by the LE. By learning this difference, the NODE reconstructs the non-stationary components of the time series that the LE is not able to capture. We exemplify the effectiveness of our framework using the Spanish electricity day-ahead market as a prototypical case study. Our findings reveal that the NODE nicely complements the LE, providing a comprehensive strategy to tackle both stationary and non-stationary electricity-price behavior. The framework's dependability and robustness is demonstrated through different non-stationary scenarios by comparing it against a range of basic naive methods."}, "https://arxiv.org/abs/2405.07552": {"title": "Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery", "link": "https://arxiv.org/abs/2405.07552", "description": "arXiv:2405.07552v1 Announce Type: cross \nAbstract: In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2405.07836": {"title": "Forecasting with Hyper-Trees", "link": "https://arxiv.org/abs/2405.07836", "description": "arXiv:2405.07836v1 Announce Type: cross \nAbstract: This paper introduces the concept of Hyper-Trees and offers a new direction in applying tree-based models to time series data. Unlike conventional applications of decision trees that forecast time series directly, Hyper-Trees are designed to learn the parameters of a target time series model. Our framework leverages the gradient-based nature of boosted trees, which allows us to extend the concept of Hyper-Networks to Hyper-Trees and to induce a time-series inductive bias to tree models. By relating the parameters of a target time series model to features, Hyper-Trees address the challenge of parameter non-stationarity and enable tree-based forecasts to extend beyond their initial training range. With our research, we aim to explore the effectiveness of Hyper-Trees across various forecasting scenarios and to expand the application of gradient boosted decision trees past their conventional use in time series forecasting."}, "https://arxiv.org/abs/2405.07910": {"title": "A Unification of Exchangeability and Continuous Exposure and Confounder Measurement Errors: Probabilistic Exchangeability", "link": "https://arxiv.org/abs/2405.07910", "description": "arXiv:2405.07910v1 Announce Type: cross \nAbstract: Exchangeability concerning a continuous exposure, X, implies no confounding bias when identifying average exposure effects of X, AEE(X). When X is measured with error (Xep), two challenges arise in identifying AEE(X). Firstly, exchangeability regarding Xep does not equal exchangeability regarding X. Secondly, the necessity of the non-differential error assumption (NDEA), overly stringent in practice, remains uncertain. To address them, this article proposes unifying exchangeability and exposure and confounder measurement errors with three novel concepts. The first, Probabilistic Exchangeability (PE), states that the outcomes of those with Xep=e are probabilistically exchangeable with the outcomes of those truly exposed to X=eT. The relationship between AEE(Xep) and AEE(X) in risk difference and ratio scales is mathematically expressed as a probabilistic certainty, termed exchangeability probability (Pe). Squared Pe (Pe.sq) quantifies the extent to which AEE(Xep) differs from AEE(X) due to exposure measurement error not akin to confounding mechanisms. In realistic settings, the coefficient of determination (R.sq) in the regression of X against Xep may be sufficient to measure Pe.sq. The second concept, Emergent Pseudo Confounding (EPC), describes the bias introduced by exposure measurement error, akin to confounding mechanisms. PE can hold when EPC is controlled for, which is weaker than NDEA. The third, Emergent Confounding, describes when bias due to confounder measurement error arises. Adjustment for E(P)C can be performed like confounding adjustment to ensure PE. This paper provides justifies for using AEE(Xep) and maximum insight into potential divergence of AEE(Xep) from AEE(X) and its measurement. Differential errors do not necessarily compromise causal inference."}, "https://arxiv.org/abs/2405.07971": {"title": "Sensitivity Analysis for Active Sampling, with Applications to the Simulation of Analog Circuits", "link": "https://arxiv.org/abs/2405.07971", "description": "arXiv:2405.07971v1 Announce Type: cross \nAbstract: We propose an active sampling flow, with the use-case of simulating the impact of combined variations on analog circuits. In such a context, given the large number of parameters, it is difficult to fit a surrogate model and to efficiently explore the space of design features.\n By combining a drastic dimension reduction using sensitivity analysis and Bayesian surrogate modeling, we obtain a flexible active sampling flow. On synthetic and real datasets, this flow outperforms the usual Monte-Carlo sampling which often forms the foundation of design space exploration."}, "https://arxiv.org/abs/2008.13087": {"title": "Efficient Nested Simulation Experiment Design via the Likelihood Ratio Method", "link": "https://arxiv.org/abs/2008.13087", "description": "arXiv:2008.13087v3 Announce Type: replace \nAbstract: In nested simulation literature, a common assumption is that the experimenter can choose the number of outer scenarios to sample. This paper considers the case when the experimenter is given a fixed set of outer scenarios from an external entity. We propose a nested simulation experiment design that pools inner replications from one scenario to estimate another scenario's conditional mean via the likelihood ratio method. Given the outer scenarios, we decide how many inner replications to run at each outer scenario as well as how to pool the inner replications by solving a bi-level optimization problem that minimizes the total simulation effort. We provide asymptotic analyses on the convergence rates of the performance measure estimators computed from the optimized experiment design. Under some assumptions, the optimized design achieves $\\cO(\\Gamma^{-1})$ mean squared error of the estimators given simulation budget $\\Gamma$. Numerical experiments demonstrate that our design outperforms a state-of-the-art design that pools replications via regression."}, "https://arxiv.org/abs/2102.10778": {"title": "Interactive identification of individuals with positive treatment effect while controlling false discoveries", "link": "https://arxiv.org/abs/2102.10778", "description": "arXiv:2102.10778v3 Announce Type: replace \nAbstract: Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which subjects have a positive treatment effect? While subgroup analysis has received attention, claims about individual participants are much more challenging. We frame the problem in terms of multiple hypothesis testing: each individual has a null hypothesis (stating that the potential outcomes are equal, for example) and we aim to identify those for whom the null is false (the treatment potential outcome stochastically dominates the control one, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction -- a human data scientist (or a computer program) may adaptively guide the algorithm in a data-dependent manner to gain power. We show how to extend the methods to observational settings and achieve a type of doubly-robust FDR control. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power."}, "https://arxiv.org/abs/2210.14205": {"title": "Unit Averaging for Heterogeneous Panels", "link": "https://arxiv.org/abs/2210.14205", "description": "arXiv:2210.14205v3 Announce Type: replace \nAbstract: In this work we introduce a unit averaging procedure to efficiently recover unit-specific parameters in a heterogeneous panel model. The procedure consists in estimating the parameter of a given unit using a weighted average of all the unit-specific parameter estimators in the panel. The weights of the average are determined by minimizing an MSE criterion we derive. We analyze the properties of the resulting minimum MSE unit averaging estimator in a local heterogeneity framework inspired by the literature on frequentist model averaging, and we derive the local asymptotic distribution of the estimator and the corresponding weights. The benefits of the procedure are showcased with an application to forecasting unemployment rates for a panel of German regions."}, "https://arxiv.org/abs/2301.11472": {"title": "Fast Bayesian inference for spatial mean-parameterized Conway-Maxwell-Poisson models", "link": "https://arxiv.org/abs/2301.11472", "description": "arXiv:2301.11472v4 Announce Type: replace \nAbstract: Count data with complex features arise in many disciplines, including ecology, agriculture, criminology, medicine, and public health. Zero inflation, spatial dependence, and non-equidispersion are common features in count data. There are two classes of models that allow for these features -- he mode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the generalized Poisson model. However both require the use of either constraints on the parameter space or a parameterization that leads to challenges in interpretability. We propose a spatial mean-parameterized COMP model that retains the flexibility of these models while resolving the above issues. We use a Bayesian spatial filtering approach in order to efficiently handle high-dimensional spatial data and we use reversible-jump MCMC to automatically choose the basis vectors for spatial filtering. The COMP distribution poses two additional computational challenges -- an intractable normalizing function in the likelihood and no closed-form expression for the mean. We propose a fast computational approach that addresses these challenges by, respectively, introducing an efficient auxiliary variable algorithm and pre-computing key approximations for fast likelihood evaluation. We illustrate the application of our methodology to simulated and real datasets, including Texas HPV-cancer data and US vaccine refusal data."}, "https://arxiv.org/abs/2309.04793": {"title": "Interpreting TSLS Estimators in Information Provision Experiments", "link": "https://arxiv.org/abs/2309.04793", "description": "arXiv:2309.04793v3 Announce Type: replace \nAbstract: To estimate the causal effects of beliefs on actions, researchers often conduct information provision experiments. We consider the causal interpretation of two-stage least squares (TSLS) estimators in these experiments. In particular, we characterize common TSLS estimators as weighted averages of causal effects, and interpret these weights under general belief updating conditions that nest parametric models from the literature. Our framework accommodates TSLS estimators for both active and passive control designs. Notably, we find that some passive control estimators allow for negative weights, which compromises their causal interpretation. We give practical guidance on such issues, and illustrate our results in two empirical applications."}, "https://arxiv.org/abs/2311.08691": {"title": "On Doubly Robust Estimation with Nonignorable Missing Data Using Instrumental Variables", "link": "https://arxiv.org/abs/2311.08691", "description": "arXiv:2311.08691v2 Announce Type: replace \nAbstract: Suppose we are interested in the mean of an outcome that is subject to nonignorable nonresponse. This paper develops new semiparametric estimation methods with instrumental variables which affect nonresponse, but not the outcome. The proposed estimators remain consistent and asymptotically normal even under partial model misspecifications for two variation independent nuisance components. We evaluate the performance of the proposed estimators via a simulation study, and apply them in adjusting for missing data induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer experience as an instrumental variable."}, "https://arxiv.org/abs/2312.05802": {"title": "Enhancing Scalability in Bayesian Nonparametric Factor Analysis of Spatiotemporal Data", "link": "https://arxiv.org/abs/2312.05802", "description": "arXiv:2312.05802v3 Announce Type: replace \nAbstract: This manuscript puts forward novel practicable spatiotemporal Bayesian factor analysis frameworks computationally feasible for moderate to large data. Our models exhibit significantly enhanced computational scalability and storage efficiency, deliver high overall modeling performances, and possess powerful inferential capabilities for adequately predicting outcomes at future time points or new spatial locations and satisfactorily clustering spatial locations into regions with similar temporal trajectories, a frequently encountered crucial task. We integrate on top of a baseline separable factor model with temporally dependent latent factors and spatially dependent factor loadings under a probit stick breaking process (PSBP) prior a new slice sampling algorithm that permits unknown varying numbers of spatial mixture components across all factors and guarantees them to be non-increasing through the MCMC iterations, thus considerably enhancing model flexibility, efficiency, and scalability. We further introduce a novel spatial latent nearest-neighbor Gaussian process (NNGP) prior and new sequential updating algorithms for the spatially varying latent variables in the PSBP prior, thereby attaining high spatial scalability. The markedly accelerated posterior sampling and spatial prediction as well as the great modeling and inferential performances of our models are substantiated by our simulation experiments."}, "https://arxiv.org/abs/2401.03881": {"title": "Density regression via Dirichlet process mixtures of normal structured additive regression models", "link": "https://arxiv.org/abs/2401.03881", "description": "arXiv:2401.03881v2 Announce Type: replace \nAbstract: Within Bayesian nonparametrics, dependent Dirichlet process mixture models provide a highly flexible approach for conducting inference about the conditional density function. However, several formulations of this class make either rather restrictive modelling assumptions or involve intricate algorithms for posterior inference, thus preventing their widespread use. In response to these challenges, we present a flexible, versatile, and computationally tractable model for density regression based on a single-weights dependent Dirichlet process mixture of normal distributions model for univariate continuous responses. We assume an additive structure for the mean of each mixture component and incorporate the effects of continuous covariates through smooth nonlinear functions. The key components of our modelling approach are penalised B-splines and their bivariate tensor product extension. Our proposed method also seamlessly accommodates parametric effects of categorical covariates, linear effects of continuous covariates, interactions between categorical and/or continuous covariates, varying coefficient terms, and random effects, which is why we refer our model as a Dirichlet process mixture of normal structured additive regression models. A noteworthy feature of our method is its efficiency in posterior simulation through Gibbs sampling, as closed-form full conditional distributions for all model parameters are available. Results from a simulation study demonstrate that our approach successfully recovers true conditional densities and other regression functionals in various challenging scenarios. Applications to a toxicology, disease diagnosis, and agricultural study are provided and further underpin the broad applicability of our modelling framework. An R package, DDPstar, implementing the proposed method is publicly available at https://bitbucket.org/mxrodriguez/ddpstar."}, "https://arxiv.org/abs/2401.07018": {"title": "Graphical models for cardinal paired comparisons data", "link": "https://arxiv.org/abs/2401.07018", "description": "arXiv:2401.07018v2 Announce Type: replace \nAbstract: Graphical models for cardinal paired comparison data with and without covariates are rigorously analyzed. Novel, graph--based, necessary and sufficient conditions which guarantee strong consistency, asymptotic normality and the exponential convergence of the estimated ranks are emphasized. A complete theory for models with covariates is laid out. In particular conditions under which covariates can be safely omitted from the model are provided. The methodology is employed in the analysis of both finite and infinite sets of ranked items specifically in the case of large sparse comparison graphs. The proposed methods are explored by simulation and applied to the ranking of teams in the National Basketball Association (NBA)."}, "https://arxiv.org/abs/2203.02605": {"title": "Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions", "link": "https://arxiv.org/abs/2203.02605", "description": "arXiv:2203.02605v3 Announce Type: replace-cross \nAbstract: In recent years, reinforcement learning (RL) has acquired a prominent position in health-related sequential decision-making problems, gaining traction as a valuable tool for delivering adaptive interventions (AIs). However, in part due to a poor synergy between the methodological and the applied communities, its real-life application is still limited and its potential is still to be realized. To address this gap, our work provides the first unified technical survey on RL methods, complemented with case studies, for constructing various types of AIs in healthcare. In particular, using the common methodological umbrella of RL, we bridge two seemingly different AI domains, dynamic treatment regimes and just-in-time adaptive interventions in mobile health, highlighting similarities and differences between them and discussing the implications of using RL. Open problems and considerations for future research directions are outlined. Finally, we leverage our experience in designing case studies in both areas to showcase the significant collaborative opportunities between statistical, RL, and healthcare researchers in advancing AIs."}, "https://arxiv.org/abs/2302.08854": {"title": "Post Reinforcement Learning Inference", "link": "https://arxiv.org/abs/2302.08854", "description": "arXiv:2302.08854v3 Announce Type: replace-cross \nAbstract: We consider estimation and inference using data collected from reinforcement learning algorithms. These algorithms, characterized by their adaptive experimentation, interact with individual units over multiple stages, dynamically adjusting their strategies based on previous interactions. Our goal is to evaluate a counterfactual policy post-data collection and estimate structural parameters, like dynamic treatment effects, which can be used for credit assignment and determining the effect of earlier actions on final outcomes. Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches for static data. However, in the adaptive data collection environment of reinforcement learning, where algorithms deploy nonstationary behavior policies, standard estimators do not achieve asymptotic normality due to the fluctuating variance. We propose a weighted Z-estimation approach with carefully designed adaptive weights to stabilize the time-varying estimation variance. We identify proper weighting schemes to restore the consistency and asymptotic normality of the weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing uniform confidence regions. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation."}, "https://arxiv.org/abs/2405.08177": {"title": "Parameter identifiability, parameter estimation and model prediction for differential equation models", "link": "https://arxiv.org/abs/2405.08177", "description": "arXiv:2405.08177v1 Announce Type: new \nAbstract: Interpreting data with mathematical models is an important aspect of real-world applied mathematical modeling. Very often we are interested to understand the extent to which a particular data set informs and constrains model parameters. This question is closely related to the concept of parameter identifiability, and in this article we present a series of computational exercises to introduce tools that can be used to assess parameter identifiability, estimate parameters and generate model predictions. Taking a likelihood-based approach, we show that very similar ideas and algorithms can be used to deal with a range of different mathematical modelling frameworks. The exercises and results presented in this article are supported by a suite of open access codes that can be accessed on GitHub."}, "https://arxiv.org/abs/2405.08180": {"title": "An adaptive enrichment design using Bayesian model averaging for selection and threshold-identification of tailoring variables", "link": "https://arxiv.org/abs/2405.08180", "description": "arXiv:2405.08180v1 Announce Type: new \nAbstract: Precision medicine stands as a transformative approach in healthcare, offering tailored treatments that can enhance patient outcomes and reduce healthcare costs. As understanding of complex disease improves, clinical trials are being designed to detect subgroups of patients with enhanced treatment effects. Biomarker-driven adaptive enrichment designs, which enroll a general population initially and later restrict accrual to treatment-sensitive patients, are gaining popularity. Current practice often assumes either pre-trial knowledge of biomarkers defining treatment-sensitive subpopulations or a simple, linear relationship between continuous markers and treatment effectiveness. Motivated by a trial studying rheumatoid arthritis treatment, we propose a Bayesian adaptive enrichment design which identifies important tailoring variables out of a larger set of candidate biomarkers. Our proposed design is equipped with a flexible modelling framework where the effects of continuous biomarkers are introduced using free knot B-splines. The parameters of interest are then estimated by marginalizing over the space of all possible variable combinations using Bayesian model averaging. At interim analyses, we assess whether a biomarker-defined subgroup has enhanced or reduced treatment effects, allowing for early termination due to efficacy or futility and restricting future enrollment to treatment-sensitive patients. We consider pre-categorized and continuous biomarkers, the latter of which may have complex, nonlinear relationships to the outcome and treatment effect. Using simulations, we derive the operating characteristics of our design and compare its performance to two existing approaches."}, "https://arxiv.org/abs/2405.08222": {"title": "Random Utility Models with Skewed Random Components: the Smallest versus Largest Extreme Value Distribution", "link": "https://arxiv.org/abs/2405.08222", "description": "arXiv:2405.08222v1 Announce Type: new \nAbstract: At the core of most random utility models (RUMs) is an individual agent with a random utility component following a largest extreme value Type I (LEVI) distribution. What if, instead, the random component follows its mirror image -- the smallest extreme value Type I (SEVI) distribution? Differences between these specifications, closely tied to the random component's skewness, can be quite profound. For the same preference parameters, the two RUMs, equivalent with only two choice alternatives, diverge progressively as the number of alternatives increases, resulting in substantially different estimates and predictions for key measures, such as elasticities and market shares.\n The LEVI model imposes the well-known independence-of-irrelevant-alternatives property, while SEVI does not. Instead, the SEVI choice probability for a particular option involves enumerating all subsets that contain this option. The SEVI model, though more complex to estimate, is shown to have computationally tractable closed-form choice probabilities. Much of the paper delves into explicating the properties of the SEVI model and exploring implications of the random component's skewness.\n Conceptually, the difference between the LEVI and SEVI models centers on whether information, known only to the agent, is more likely to increase or decrease the systematic utility parameterized using observed attributes. LEVI does the former; SEVI the latter. An immediate implication is that if choice is characterized by SEVI random components, then the observed choice is more likely to correspond to the systematic-utility-maximizing choice than if characterized by LEVI. Examining standard empirical examples from different applied areas, we find that the SEVI model outperforms the LEVI model, suggesting the relevance of its inclusion in applied researchers' toolkits."}, "https://arxiv.org/abs/2405.08284": {"title": "Predicting NVIDIA's Next-Day Stock Price: A Comparative Analysis of LSTM, MLP, ARIMA, and ARIMA-GARCH Models", "link": "https://arxiv.org/abs/2405.08284", "description": "arXiv:2405.08284v1 Announce Type: new \nAbstract: Forecasting stock prices remains a considerable challenge in financial markets, bearing significant implications for investors, traders, and financial institutions. Amid the ongoing AI revolution, NVIDIA has emerged as a key player driving innovation across various sectors. Given its prominence, we chose NVIDIA as the subject of our study."}, "https://arxiv.org/abs/2405.08307": {"title": "Sequential Maximal Updated Density Parameter Estimation for Dynamical Systems with Parameter Drift", "link": "https://arxiv.org/abs/2405.08307", "description": "arXiv:2405.08307v1 Announce Type: new \nAbstract: We present a novel method for generating sequential parameter estimates and quantifying epistemic uncertainty in dynamical systems within a data-consistent (DC) framework. The DC framework differs from traditional Bayesian approaches due to the incorporation of the push-forward of an initial density, which performs selective regularization in parameter directions not informed by the data in the resulting updated density. This extends a previous study that included the linear Gaussian theory within the DC framework and introduced the maximal updated density (MUD) estimate as an alternative to both least squares and maximum a posterior (MAP) estimates. In this work, we introduce algorithms for operational settings of MUD estimation in real or near-real time where spatio-temporal datasets arrive in packets to provide updated estimates of parameters and identify potential parameter drift. Computational diagnostics within the DC framework prove critical for evaluating (1) the quality of the DC update and MUD estimate and (2) the detection of parameter value drift. The algorithms are applied to estimate (1) wind drag parameters in a high-fidelity storm surge model, (2) thermal diffusivity field for a heat conductivity problem, and (3) changing infection and incubation rates of an epidemiological model."}, "https://arxiv.org/abs/2405.08525": {"title": "Doubly-robust inference and optimality in structure-agnostic models with smoothness", "link": "https://arxiv.org/abs/2405.08525", "description": "arXiv:2405.08525v1 Announce Type: new \nAbstract: We study the problem of constructing an estimator of the average treatment effect (ATE) that exhibits doubly-robust asymptotic linearity (DRAL). This is a stronger requirement than doubly-robust consistency. A DRAL estimator can yield asymptotically valid Wald-type confidence intervals even when the propensity score or the outcome model is inconsistently estimated. On the contrary, the celebrated doubly-robust, augmented-IPW (AIPW) estimator generally requires consistent estimation of both nuisance functions for standard root-n inference. We make three main contributions. First, we propose a new hybrid class of distributions that consists of the structure-agnostic class introduced in Balakrishnan et al (2023) with additional smoothness constraints. While DRAL is generally not possible in the pure structure-agnostic class, we show that it can be attained in the new hybrid one. Second, we calculate minimax lower bounds for estimating the ATE in the new class, as well as in the pure structure-agnostic one. Third, building upon the literature on doubly-robust inference (van der Laan, 2014, Benkeser et al, 2017, Dukes et al 2021), we propose a new estimator of the ATE that enjoys DRAL. Under certain conditions, we show that its rate of convergence in the new class can be much faster than that achieved by the AIPW estimator and, in particular, matches the minimax lower bound rate, thereby establishing its optimality. Finally, we clarify the connection between DRAL estimators and those based on higher-order influence functions (Robins et al, 2017) and complement our theoretical findings with simulations."}, "https://arxiv.org/abs/2405.08675": {"title": "Simplifying Debiased Inference via Automatic Differentiation and Probabilistic Programming", "link": "https://arxiv.org/abs/2405.08675", "description": "arXiv:2405.08675v1 Announce Type: new \nAbstract: We introduce an algorithm that simplifies the construction of efficient estimators, making them accessible to a broader audience. 'Dimple' takes as input computer code representing a parameter of interest and outputs an efficient estimator. Unlike standard approaches, it does not require users to derive a functional derivative known as the efficient influence function. Dimple avoids this task by applying automatic differentiation to the statistical functional of interest. Doing so requires expressing this functional as a composition of primitives satisfying a novel differentiability condition. Dimple also uses this composition to determine the nuisances it must estimate. In software, primitives can be implemented independently of one another and reused across different estimation problems. We provide a proof-of-concept Python implementation and showcase through examples how it allows users to go from parameter specification to efficient estimation with just a few lines of code."}, "https://arxiv.org/abs/2405.08687": {"title": "Latent group structure in linear panel data models with endogenous regressors", "link": "https://arxiv.org/abs/2405.08687", "description": "arXiv:2405.08687v1 Announce Type: new \nAbstract: This paper concerns the estimation of linear panel data models with endogenous regressors and a latent group structure in the coefficients. We consider instrumental variables estimation of the group-specific coefficient vector. We show that direct application of the Kmeans algorithm to the generalized method of moments objective function does not yield unique estimates. We newly develop and theoretically justify two-stage estimation methods that apply the Kmeans algorithm to a regression of the dependent variable on predicted values of the endogenous regressors. The results of Monte Carlo simulations demonstrate that two-stage estimation with the first stage modeled using a latent group structure achieves good classification accuracy, even if the true first-stage regression is fully heterogeneous. We apply our estimation methods to revisiting the relationship between income and democracy."}, "https://arxiv.org/abs/2405.08727": {"title": "Intervention effects based on potential benefit", "link": "https://arxiv.org/abs/2405.08727", "description": "arXiv:2405.08727v1 Announce Type: new \nAbstract: Optimal treatment rules are mappings from individual patient characteristics to tailored treatment assignments that maximize mean outcomes. In this work, we introduce a conditional potential benefit (CPB) metric that measures the expected improvement under an optimally chosen treatment compared to the status quo, within covariate strata. The potential benefit combines (i) the magnitude of the treatment effect, and (ii) the propensity for subjects to naturally select a suboptimal treatment. As a consequence, heterogeneity in the CPB can provide key insights into the mechanism by which a treatment acts and/or highlight potential barriers to treatment access or adverse effects. Moreover, we demonstrate that CPB is the natural prioritization score for individualized treatment policies when intervention capacity is constrained. That is, in the resource-limited setting where treatment options are freely accessible, but the ability to intervene on a portion of the target population is constrained (e.g., if the population is large, and follow-up and encouragement of treatment uptake is labor-intensive), targeting subjects with highest CPB maximizes the mean outcome. Focusing on this resource-limited setting, we derive formulas for optimal constrained treatment rules, and for any given budget, quantify the loss compared to the optimal unconstrained rule. We describe sufficient identification assumptions, and propose nonparametric, robust, and efficient estimators of the proposed quantities emerging from our framework."}, "https://arxiv.org/abs/2405.08730": {"title": "A Generalized Difference-in-Differences Estimator for Unbiased Estimation of Desired Estimands from Staggered Adoption and Stepped-Wedge Settings", "link": "https://arxiv.org/abs/2405.08730", "description": "arXiv:2405.08730v1 Announce Type: new \nAbstract: Staggered treatment adoption arises in the evaluation of policy impact and implementation in a variety of settings. This occurs in both randomized stepped-wedge trials and non-randomized quasi-experimental designs using causal inference methods based on difference-in-differences analysis. In both settings, it is crucial to carefully consider the target estimand and possible treatment effect heterogeneities in order to estimate the effect without bias and in an interpretable fashion. This paper proposes a novel non-parametric approach to this estimation for either setting. By constructing an estimator using two-by-two difference-in-difference comparisons as building blocks with arbitrary weights, the investigator can select weights to target the desired estimand in an unbiased manner under assumed treatment effect homogeneity, and minimize the variance under an assumed working covariance structure. This provides desirable bias properties with a relatively small sacrifice in variance and power by using the comparisons efficiently. The method is demonstrated on toy examples to show the process, as well as in the re-analysis of a stepped wedge trial on the impact of novel tuberculosis diagnostic tools. A full algorithm with R code is provided to implement this method. The proposed method allows for high flexibility and clear targeting of desired effects, providing one solution to the bias-variance-generalizability tradeoff."}, "https://arxiv.org/abs/2405.08738": {"title": "Calibrated sensitivity models", "link": "https://arxiv.org/abs/2405.08738", "description": "arXiv:2405.08738v1 Announce Type: new \nAbstract: In causal inference, sensitivity models assess how unmeasured confounders could alter causal analyses. However, the sensitivity parameter in these models -- which quantifies the degree of unmeasured confounding -- is often difficult to interpret. For this reason, researchers will sometimes compare the magnitude of the sensitivity parameter to an estimate for measured confounding. This is known as calibration. We propose novel calibrated sensitivity models, which directly incorporate measured confounding, and bound the degree of unmeasured confounding by a multiple of measured confounding. We illustrate how to construct calibrated sensitivity models via several examples. We also demonstrate their advantages over standard sensitivity analyses and calibration; in particular, the calibrated sensitivity parameter is an intuitive unit-less ratio of unmeasured divided by measured confounding, unlike standard sensitivity parameters, and one can correctly incorporate uncertainty due to estimating measured confounding, which standard calibration methods fail to do. By incorporating uncertainty due to measured confounding, we observe that causal analyses can be less robust or more robust to unmeasured confounding than would have been shown with standard approaches. We develop efficient estimators and methods for inference for bounds on the average treatment effect with three calibrated sensitivity models, and establish that our estimators are doubly robust and attain parametric efficiency and asymptotic normality under nonparametric conditions on their nuisance function estimators. We illustrate our methods with data analyses on the effect of exposure to violence on attitudes towards peace in Darfur and the effect of mothers' smoking on infant birthweight."}, "https://arxiv.org/abs/2405.08759": {"title": "Optimal Sequential Procedure for Early Detection of Multiple Side Effects", "link": "https://arxiv.org/abs/2405.08759", "description": "arXiv:2405.08759v1 Announce Type: new \nAbstract: In this paper, we propose an optimal sequential procedure for the early detection of potential side effects resulting from the administration of some treatment (e.g. a vaccine, say). The results presented here extend previous results obtained in Wang and Boukai (2024) who study the single side effect case to the case of two (or more) side effects. While the sequential procedure we employ, simultaneously monitors several of the treatment's side effects, the $(\\alpha, \\beta)$-optimal test we propose does not require any information about the inter-correlation between these potential side effects. However, in all of the subsequent analyses, including the derivations of the exact expressions of the Average Sample Number (ASN), the Power function, and the properties of the post-test (or post-detection) estimators, we accounted specifically, for the correlation between the potential side effects. In the real-life application (such as post-marketing surveillance), the number of available observations is large enough to justify asymptotic analyses of the sequential procedure (testing and post-detection estimation) properties. Accordingly, we also derive the consistency and asymptotic normality of our post-test estimators; results which enable us to also provide (asymptotic, post-detection) confidence intervals for the probabilities of various side-effects. Moreover, to compare two specific side effects, their relative risk plays an important role. We derive the distribution of the estimated relative risk in the asymptotic framework to provide appropriate inference. To illustrate the theoretical results presented, we provide two detailed examples based on the data of side effects on COVID-19 vaccine collected in Nigeria (see Nigeria (see Ilori et al. (2022))."}, "https://arxiv.org/abs/2405.08203": {"title": "Community detection in bipartite signed networks is highly dependent on parameter choice", "link": "https://arxiv.org/abs/2405.08203", "description": "arXiv:2405.08203v1 Announce Type: cross \nAbstract: Decision-making processes often involve voting. Human interactions with exogenous entities such as legislations or products can be effectively modeled as two-mode (bipartite) signed networks-where people can either vote positively, negatively, or abstain from voting on the entities. Detecting communities in such networks could help us understand underlying properties: for example ideological camps or consumer preferences. While community detection is an established practice separately for bipartite and signed networks, it remains largely unexplored in the case of bipartite signed networks. In this paper, we systematically evaluate the efficacy of community detection methods on bipartite signed networks using a synthetic benchmark and real-world datasets. Our findings reveal that when no communities are present in the data, these methods often recover spurious communities. When communities are present, the algorithms exhibit promising performance, although their performance is highly susceptible to parameter choice. This indicates that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value: it is essential to assess the robustness of parameter choices or perform domain-specific external validation."}, "https://arxiv.org/abs/2405.08290": {"title": "MCMC using $\\textit{bouncy}$ Hamiltonian dynamics: A unifying framework for Hamiltonian Monte Carlo and piecewise deterministic Markov process samplers", "link": "https://arxiv.org/abs/2405.08290", "description": "arXiv:2405.08290v1 Announce Type: cross \nAbstract: Piecewise-deterministic Markov process (PDMP) samplers constitute a state of the art Markov chain Monte Carlo (MCMC) paradigm in Bayesian computation, with examples including the zig-zag and bouncy particle sampler (BPS). Recent work on the zig-zag has indicated its connection to Hamiltonian Monte Carlo, a version of the Metropolis algorithm that exploits Hamiltonian dynamics. Here we establish that, in fact, the connection between the paradigms extends far beyond the specific instance. The key lies in (1) the fact that any time-reversible deterministic dynamics provides a valid Metropolis proposal and (2) how PDMPs' characteristic velocity changes constitute an alternative to the usual acceptance-rejection. We turn this observation into a rigorous framework for constructing rejection-free Metropolis proposals based on bouncy Hamiltonian dynamics which simultaneously possess Hamiltonian-like properties and generate discontinuous trajectories similar in appearance to PDMPs. When combined with periodic refreshment of the inertia, the dynamics converge strongly to PDMP equivalents in the limit of increasingly frequent refreshment. We demonstrate the practical implications of this new paradigm, with a sampler based on a bouncy Hamiltonian dynamics closely related to the BPS. The resulting sampler exhibits competitive performance on challenging real-data posteriors involving tens of thousands of parameters."}, "https://arxiv.org/abs/2405.08719": {"title": "Addressing Misspecification in Simulation-based Inference through Data-driven Calibration", "link": "https://arxiv.org/abs/2405.08719", "description": "arXiv:2405.08719v1 Announce Type: cross \nAbstract: Driven by steady progress in generative modeling, simulation-based inference (SBI) has enabled inference over stochastic simulators. However, recent work has demonstrated that model misspecification can harm SBI's reliability. This work introduces robust posterior estimation (ROPE), a framework that overcomes model misspecification with a small real-world calibration set of ground truth parameter measurements. We formalize the misspecification gap as the solution of an optimal transport problem between learned representations of real-world and simulated observations. Assuming the prior distribution over the parameters of interest is known and well-specified, our method offers a controllable balance between calibrated uncertainty and informative inference under all possible misspecifications of the simulator. Our empirical results on four synthetic tasks and two real-world problems demonstrate that ROPE outperforms baselines and consistently returns informative and calibrated credible intervals."}, "https://arxiv.org/abs/2405.08796": {"title": "Variational Bayes and non-Bayesian Updating", "link": "https://arxiv.org/abs/2405.08796", "description": "arXiv:2405.08796v1 Announce Type: cross \nAbstract: I show how variational Bayes can be used as a microfoundation for a popular model of non-Bayesian updating. All the results here are mathematically trivial, but I think this direction is potentially interesting."}, "https://arxiv.org/abs/2405.08806": {"title": "Bounds on the Distribution of a Sum of Two Random Variables: Revisiting a problem of Kolmogorov with application to Individual Treatment Effects", "link": "https://arxiv.org/abs/2405.08806", "description": "arXiv:2405.08806v1 Announce Type: cross \nAbstract: We revisit the following problem, proposed by Kolmogorov: given prescribed marginal distributions $F$ and $G$ for random variables $X,Y$ respectively, characterize the set of compatible distribution functions for the sum $Z=X+Y$. Bounds on the distribution function for $Z$ were given by Markarov (1982), and Frank et al. (1987), the latter using copula theory. However, though they obtain the same bounds, they make different assertions concerning their sharpness. In addition, their solutions leave some open problems in the case when the given marginal distribution functions are discontinuous. These issues have led to some confusion and erroneous statements in subsequent literature, which we correct.\n Kolmogorov's problem is closely related to inferring possible distributions for individual treatment effects $Y_1 - Y_0$ given the marginal distributions of $Y_1$ and $Y_0$; the latter being identified from a randomized experiment. We use our new insights to sharpen and correct results due to Fan and Park (2010) concerning individual treatment effects, and to fill some other logical gaps."}, "https://arxiv.org/abs/2009.05079": {"title": "Finding Groups of Cross-Correlated Features in Bi-View Data", "link": "https://arxiv.org/abs/2009.05079", "description": "arXiv:2009.05079v4 Announce Type: replace \nAbstract: Datasets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A,B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features that have significant aggregate correlation with the features in B, and vice-versa. This paper proposes an iterative-testing based bimodule search procedure (BSP) to identify stable bimodules.\n Compared to existing methods for detecting cross-correlated features, BSP was the best at recovering true bimodules with sufficient signal, while limiting the false discoveries. In addition, we applied BSP to the problem of expression quantitative trait loci (eQTL) analysis using data from the GTEx consortium. BSP identified several thousand SNP-gene bimodules. While many of the individual SNP-gene pairs appearing in the discovered bimodules were identified by standard eQTL methods, the discovered bimodules revealed genomic subnetworks that appeared to be biologically meaningful and worthy of further scientific investigation."}, "https://arxiv.org/abs/2202.07234": {"title": "Long-term Causal Inference Under Persistent Confounding via Data Combination", "link": "https://arxiv.org/abs/2202.07234", "description": "arXiv:2202.07234v4 Announce Type: replace \nAbstract: We study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Since the long-term outcome is observed only after a long delay, it is not measured in the experimental data, but only recorded in the observational data. However, both types of data include observations of some short-term outcomes. In this paper, we uniquely tackle the challenge of persistent unmeasured confounders, i.e., some unmeasured confounders that can simultaneously affect the treatment, short-term outcomes and the long-term outcome, noting that they invalidate identification strategies in previous literature. To address this challenge, we exploit the sequential structure of multiple short-term outcomes, and develop three novel identification strategies for the average long-term treatment effect. We further propose three corresponding estimators and prove their asymptotic consistency and asymptotic normality. We finally apply our methods to estimate the effect of a job training program on long-term employment using semi-synthetic data. We numerically show that our proposals outperform existing methods that fail to handle persistent confounders."}, "https://arxiv.org/abs/2204.12023": {"title": "A One-Covariate-at-a-Time Method for Nonparametric Additive Models", "link": "https://arxiv.org/abs/2204.12023", "description": "arXiv:2204.12023v3 Announce Type: replace \nAbstract: This paper proposes a one-covariate-at-a-time multiple testing (OCMT) approach to choose significant variables in high-dimensional nonparametric additive regression models. Similarly to Chudik, Kapetanios and Pesaran (2018), we consider the statistical significance of individual nonparametric additive components one at a time and take into account the multiple testing nature of the problem. One-stage and multiple-stage procedures are both considered. The former works well in terms of the true positive rate only if the marginal effects of all signals are strong enough; the latter helps to pick up hidden signals that have weak marginal effects. Simulations demonstrate the good finite sample performance of the proposed procedures. As an empirical application, we use the OCMT procedure on a dataset we extracted from the Longitudinal Survey on Rural Urban Migration in China. We find that our procedure works well in terms of the out-of-sample forecast root mean square errors, compared with competing methods."}, "https://arxiv.org/abs/2206.10676": {"title": "Conditional probability tensor decompositions for multivariate categorical response regression", "link": "https://arxiv.org/abs/2206.10676", "description": "arXiv:2206.10676v2 Announce Type: replace \nAbstract: In many modern regression applications, the response consists of multiple categorical random variables whose probability mass is a function of a common set of predictors. In this article, we propose a new method for modeling such a probability mass function in settings where the number of response variables, the number of categories per response, and the dimension of the predictor are large. Our method relies on a functional probability tensor decomposition: a decomposition of a tensor-valued function such that its range is a restricted set of low-rank probability tensors. This decomposition is motivated by the connection between the conditional independence of responses, or lack thereof, and their probability tensor rank. We show that the model implied by such a low-rank functional probability tensor decomposition can be interpreted in terms of a mixture of regressions and can thus be fit using maximum likelihood. We derive an efficient and scalable penalized expectation maximization algorithm to fit this model and examine its statistical properties. We demonstrate the encouraging performance of our method through both simulation studies and an application to modeling the functional classes of genes."}, "https://arxiv.org/abs/2302.13999": {"title": "Forecasting Macroeconomic Tail Risk in Real Time: Do Textual Data Add Value?", "link": "https://arxiv.org/abs/2302.13999", "description": "arXiv:2302.13999v2 Announce Type: replace \nAbstract: We examine the incremental value of news-based data relative to the FRED-MD economic indicators for quantile predictions of employment, output, inflation and consumer sentiment in a high-dimensional setting. Our results suggest that news data contain valuable information that is not captured by a large set of economic indicators. We provide empirical evidence that this information can be exploited to improve tail risk predictions. The added value is largest when media coverage and sentiment are combined to compute text-based predictors. Methods that capture quantile-specific non-linearities produce overall superior forecasts relative to methods that feature linear predictive relationships. The results are robust along different modeling choices."}, "https://arxiv.org/abs/2304.09460": {"title": "Studying continuous, time-varying, and/or complex exposures using longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2304.09460", "description": "arXiv:2304.09460v3 Announce Type: replace \nAbstract: This tutorial discusses methodology for causal inference using longitudinal modified treatment policies. This method facilitates the mathematical formalization, identification, and estimation of many novel parameters, and mathematically generalizes many commonly used parameters, such as the average treatment effect. Longitudinal modified treatment policies apply to a wide variety of exposures, including binary, multivariate, and continuous, and can accommodate time-varying treatments and confounders, competing risks, loss-to-follow-up, as well as survival, binary, or continuous outcomes. Longitudinal modified treatment policies can be seen as an extension of static and dynamic interventions to involve the natural value of treatment, and, like dynamic interventions, can be used to define alternative estimands with a positivity assumption that is more likely to be satisfied than estimands corresponding to static interventions. This tutorial aims to illustrate several practical uses of the longitudinal modified treatment policy methodology, including describing different estimation strategies and their corresponding advantages and disadvantages. We provide numerous examples of types of research questions which can be answered using longitudinal modified treatment policies. We go into more depth with one of these examples--specifically, estimating the effect of delaying intubation on critically ill COVID-19 patients' mortality. We demonstrate the use of the open-source R package lmtp to estimate the effects, and we provide code on https://github.com/kathoffman/lmtp-tutorial."}, "https://arxiv.org/abs/2310.09239": {"title": "Causal Quantile Treatment Effects with missing data by double-sampling", "link": "https://arxiv.org/abs/2310.09239", "description": "arXiv:2310.09239v2 Announce Type: replace \nAbstract: Causal weighted quantile treatment effects (WQTE) are a useful complement to standard causal contrasts that focus on the mean when interest lies at the tails of the counterfactual distribution. To-date, however, methods for estimation and inference regarding causal WQTEs have assumed complete data on all relevant factors. In most practical settings, however, data will be missing or incomplete data, particularly when the data are not collected for research purposes, as is the case for electronic health records and disease registries. Furthermore, such data sources may be particularly susceptible to the outcome data being missing-not-at-random (MNAR). In this paper, we consider the use of double-sampling, through which the otherwise missing data are ascertained on a sub-sample of study units, as a strategy to mitigate bias due to MNAR data in the estimation of causal WQTEs. With the additional data in-hand, we present identifying conditions that do not require assumptions regarding missingness in the original data. We then propose a novel inverse-probability weighted estimator and derive its asymptotic properties, both pointwise at specific quantiles and uniformly across a range of quantiles over some compact subset of (0,1), allowing the propensity score and double-sampling probabilities to be estimated. For practical inference, we develop a bootstrap method that can be used for both pointwise and uniform inference. A simulation study is conducted to examine the finite sample performance of the proposed estimators. The proposed method is illustrated with data from an EHR-based study examining the relative effects of two bariatric surgery procedures on BMI loss at 3 years post-surgery."}, "https://arxiv.org/abs/2310.13446": {"title": "Simple binning algorithm and SimDec visualization for comprehensive sensitivity analysis of complex computational models", "link": "https://arxiv.org/abs/2310.13446", "description": "arXiv:2310.13446v2 Announce Type: replace \nAbstract: Models of complex technological systems inherently contain interactions and dependencies among their input variables that affect their joint influence on the output. Such models are often computationally expensive and few sensitivity analysis methods can effectively process such complexities. Moreover, the sensitivity analysis field as a whole pays limited attention to the nature of interaction effects, whose understanding can prove to be critical for the design of safe and reliable systems. In this paper, we introduce and extensively test a simple binning approach for computing sensitivity indices and demonstrate how complementing it with the smart visualization method, simulation decomposition (SimDec), can permit important insights into the behavior of complex engineering models. The simple binning approach computes first-, second-order effects, and a combined sensitivity index, and is considerably more computationally efficient than the mainstream measure for Sobol indices introduced by Saltelli et al. The totality of the sensitivity analysis framework provides an efficient and intuitive way to analyze the behavior of complex systems containing interactions and dependencies."}, "https://arxiv.org/abs/2401.10057": {"title": "A method for characterizing disease emergence curves from paired pathogen detection and serology data", "link": "https://arxiv.org/abs/2401.10057", "description": "arXiv:2401.10057v2 Announce Type: replace \nAbstract: Wildlife disease surveillance programs and research studies track infection and identify risk factors for wild populations, humans, and agriculture. Often, several types of samples are collected from individuals to provide more complete information about an animal's infection history. Methods that jointly analyze multiple data streams to study disease emergence and drivers of infection via epidemiological process models remain underdeveloped. Joint-analysis methods can more thoroughly analyze all available data, more precisely quantifying epidemic processes, outbreak status, and risks. We contribute a paired data modeling approach that analyzes multiple samples from individuals. We use \"characterization maps\" to link paired data to epidemiological processes through a hierarchical statistical observation model. Our approach can provide both Bayesian and frequentist estimates of epidemiological parameters and state. We motivate our approach through the need to use paired pathogen and antibody detection tests to estimate parameters and infection trajectories for the widely applicable susceptible, infectious, recovered (SIR) model. We contribute general formulas to link characterization maps to arbitrary process models and datasets and an extended SIR model that better accommodates paired data. We find via simulation that paired data can more efficiently estimate SIR parameters than unpaired data, requiring samples from 5-10 times fewer individuals. We then study SARS-CoV-2 in wild White-tailed deer (Odocoileus virginianus) from three counties in the United States. Estimates for average infectious times corroborate captive animal studies. Our methods use general statistical theory to let applications extend beyond the SIR model we consider, and to more complicated examples of paired data."}, "https://arxiv.org/abs/2003.12408": {"title": "On the role of surrogates in the efficient estimation of treatment effects with limited outcome data", "link": "https://arxiv.org/abs/2003.12408", "description": "arXiv:2003.12408v3 Announce Type: replace-cross \nAbstract: In many experiments and observational studies, the outcome of interest is often difficult or expensive to observe, reducing effective sample sizes for estimating average treatment effects (ATEs) even when identifiable. We study how incorporating data on units for which only surrogate outcomes not of primary interest are observed can increase the precision of ATE estimation. We refrain from imposing stringent surrogacy conditions, which permit surrogates as perfect replacements for the target outcome. Instead, we supplement the available, albeit limited, observations of the target outcome (which by themselves identify the ATE) with abundant observations of surrogate outcomes, without any assumptions beyond random assignment and missingness and corresponding overlap conditions. To quantify the potential gains, we derive the difference in efficiency bounds on ATE estimation with and without surrogates, both when an overwhelming or comparable number of units have missing outcomes. We develop robust ATE estimation and inference methods that realize these efficiency gains. We empirically demonstrate the gains by studying the long-term-earning effects of job training."}, "https://arxiv.org/abs/2107.10955": {"title": "Learning Linear Polytree Structural Equation Models", "link": "https://arxiv.org/abs/2107.10955", "description": "arXiv:2107.10955v4 Announce Type: replace-cross \nAbstract: We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider the problem of inverse correlation matrix estimation under the linear polytree models, and establish the estimation error bound in terms of the dimension and the total number of v-structures. We also consider an extension of group linear polytree models, in which each node represents a group of variables. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees."}, "https://arxiv.org/abs/2306.05665": {"title": "Causal health impacts of power plant emission controls under modeled and uncertain physical process interference", "link": "https://arxiv.org/abs/2306.05665", "description": "arXiv:2306.05665v2 Announce Type: replace-cross \nAbstract: Causal inference with spatial environmental data is often challenging due to the presence of interference: outcomes for observational units depend on some combination of local and non-local treatment. This is especially relevant when estimating the effect of power plant emissions controls on population health, as pollution exposure is dictated by (i) the location of point-source emissions, as well as (ii) the transport of pollutants across space via dynamic physical-chemical processes. In this work, we estimate the effectiveness of air quality interventions at coal-fired power plants in reducing two adverse health outcomes in Texas in 2016: pediatric asthma ED visits and Medicare all-cause mortality. We develop methods for causal inference with interference when the underlying network structure is not known with certainty and instead must be estimated from ancillary data. Notably, uncertainty in the interference structure is propagated to the resulting causal effect estimates. We offer a Bayesian, spatial mechanistic model for the interference mapping which we combine with a flexible non-parametric outcome model to marginalize estimates of causal effects over uncertainty in the structure of interference. Our analysis finds some evidence that emissions controls at upwind power plants reduce asthma ED visits and all-cause mortality, however accounting for uncertainty in the interference renders the results largely inconclusive."}, "https://arxiv.org/abs/2405.08841": {"title": "Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data", "link": "https://arxiv.org/abs/2405.08841", "description": "arXiv:2405.08841v1 Announce Type: new \nAbstract: Epidemiological delays, such as incubation periods, serial intervals, and hospital lengths of stay, are among key quantities in infectious disease epidemiology that inform public health policy and clinical practice. This information is used to inform mathematical and statistical models, which in turn can inform control strategies. There are three main challenges that make delay distributions difficult to estimate. First, the data are commonly censored (e.g., symptom onset may only be reported by date instead of the exact time of day). Second, delays are often right truncated when being estimated in real time (not all events that have occurred have been observed yet). Third, during a rapidly growing or declining outbreak, overrepresentation or underrepresentation, respectively, of recently infected cases in the data can lead to bias in estimates. Studies that estimate delays rarely address all these factors and sometimes report several estimates using different combinations of adjustments, which can lead to conflicting answers and confusion about which estimates are most accurate. In this work, we formulate a checklist of best practices for estimating and reporting epidemiological delays with a focus on the incubation period and serial interval. We also propose strategies for handling common biases and identify areas where more work is needed. Our recommendations can help improve the robustness and utility of reported estimates and provide guidance for the evaluation of estimates for downstream use in transmission models or other analyses."}, "https://arxiv.org/abs/2405.08853": {"title": "Evaluating the Uncertainty in Mean Residual Times: Estimators Based on Residence Times from Discrete Time Processes", "link": "https://arxiv.org/abs/2405.08853", "description": "arXiv:2405.08853v1 Announce Type: new \nAbstract: In this work, we propose estimators for the uncertainty in mean residual times that require, for their evaluation, statistically independent individual residence times obtained from a discrete time process. We examine their performance through numerical experiments involving well-known probability distributions, and an application example using molecular dynamics simulation results, from an aqueous NaCl solution, is provided. These computationally inexpensive estimators, capable of achieving very accurate outcomes, serve as useful tools for assessing and reporting uncertainties in mean residual times across a wide range of simulations."}, "https://arxiv.org/abs/2405.08912": {"title": "High dimensional test for functional covariates", "link": "https://arxiv.org/abs/2405.08912", "description": "arXiv:2405.08912v1 Announce Type: new \nAbstract: As medical devices become more complex, they routinely collect extensive and complicated data. While classical regressions typically examine the relationship between an outcome and a vector of predictors, it becomes imperative to identify the relationship with predictors possessing functional structures. In this article, we introduce a novel inference procedure for examining the relationship between outcomes and large-scale functional predictors. We target testing the linear hypothesis on the functional parameters under the generalized functional linear regression framework, where the number of the functional parameters grows with the sample size. We develop the estimation procedure for the high dimensional generalized functional linear model incorporating B-spline functional approximation and amenable regularization. Furthermore, we construct a procedure that is able to test the local alternative hypothesis on the linear combinations of the functional parameters. We establish the statistical guarantees in terms of non-asymptotic convergence of the parameter estimation and the oracle property and asymptotic normality of the estimators. Moreover, we derive the asymptotic distribution of the test statistic. We carry out intensive simulations and illustrate with a new dataset from an Alzheimer's disease magnetoencephalography study."}, "https://arxiv.org/abs/2405.09003": {"title": "Nonparametric Inference on Dose-Response Curves Without the Positivity Condition", "link": "https://arxiv.org/abs/2405.09003", "description": "arXiv:2405.09003v1 Announce Type: new \nAbstract: Existing statistical methods in causal inference often rely on the assumption that every individual has some chance of receiving any treatment level regardless of its associated covariates, which is known as the positivity condition. This assumption could be violated in observational studies with continuous treatments. In this paper, we present a novel integral estimator of the causal effects with continuous treatments (i.e., dose-response curves) without requiring the positivity condition. Our approach involves estimating the derivative function of the treatment effect on each observed data sample and integrating it to the treatment level of interest so as to address the bias resulting from the lack of positivity condition. The validity of our approach relies on an alternative weaker assumption that can be satisfied by additive confounding models. We provide a fast and reliable numerical recipe for computing our estimator in practice and derive its related asymptotic theory. To conduct valid inference on the dose-response curve and its derivative, we propose using the nonparametric bootstrap and establish its consistency. The practical performances of our proposed estimators are validated through simulation studies and an analysis of the effect of air pollution exposure (PM$_{2.5}$) on cardiovascular mortality rates."}, "https://arxiv.org/abs/2405.09080": {"title": "Causal Inference for a Hidden Treatment", "link": "https://arxiv.org/abs/2405.09080", "description": "arXiv:2405.09080v1 Announce Type: new \nAbstract: In many empirical settings, directly observing a treatment variable may be infeasible although an error-prone surrogate measurement of the latter will often be available. Causal inference based solely on the observed surrogate measurement of the hidden treatment may be particularly challenging without an additional assumption or auxiliary data. To address this issue, we propose a method that carefully incorporates the surrogate measurement together with a proxy of the hidden treatment to identify its causal effect on any scale for which identification would in principle be feasible had contrary to fact the treatment been observed error-free. Beyond identification, we provide general semiparametric theory for causal effects identified using our approach, and we derive a large class of semiparametric estimators with an appealing multiple robustness property. A significant obstacle to our approach is the estimation of nuisance functions involving the hidden treatment, which prevents the direct application of standard machine learning algorithms. To resolve this, we introduce a novel semiparametric EM algorithm, thus adding a practical dimension to our theoretical contributions. This methodology can be adapted to analyze a large class of causal parameters in the proposed hidden treatment model, including the population average treatment effect, the effect of treatment on the treated, quantile treatment effects, and causal effects under marginal structural models. We examine the finite-sample performance of our method using simulations and an application which aims to estimate the causal effect of Alzheimer's disease on hippocampal volume using data from the Alzheimer's Disease Neuroimaging Initiative."}, "https://arxiv.org/abs/2405.09149": {"title": "Exploring uniformity and maximum entropy distribution on torus through intrinsic geometry: Application to protein-chemistry", "link": "https://arxiv.org/abs/2405.09149", "description": "arXiv:2405.09149v1 Announce Type: new \nAbstract: A generic family of distributions, defined on the surface of a curved torus is introduced using the area element of it. The area uniformity and the maximum entropy distribution are identified using the trigonometric moments of the proposed family. A marginal distribution is obtained as a three-parameter modification of the von Mises distribution that encompasses the von Mises, Cardioid, and Uniform distributions as special cases. The proposed family of the marginal distribution exhibits both symmetric and asymmetric, unimodal or bimodal shapes, contingent upon parameters. Furthermore, we scrutinize a two-parameter symmetric submodel, examining its moments, measure of variation, Kullback-Leibler divergence, and maximum likelihood estimation, among other properties. In addition, we introduce a modified acceptance-rejection sampling with a thin envelope obtained from the upper-Riemann-sum of a circular density, achieving a high rate of acceptance. This proposed sampling scheme will accelerate the empirical studies for a large-scale simulation reducing the processing time. Furthermore, we extend the Uniform, Wrapped Cauchy, and Kato-Jones distributions to the surface of the curved torus and implemented the proposed bivariate toroidal distribution for different groups of protein data, namely, $\\alpha$-helix, $\\beta$-sheet, and their mixture. A marginal of this proposed distribution is fitted to the wind direction data."}, "https://arxiv.org/abs/2405.09331": {"title": "Multi-Source Conformal Inference Under Distribution Shift", "link": "https://arxiv.org/abs/2405.09331", "description": "arXiv:2405.09331v1 Announce Type: new \nAbstract: Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology."}, "https://arxiv.org/abs/2405.09485": {"title": "Predicting Future Change-points in Time Series", "link": "https://arxiv.org/abs/2405.09485", "description": "arXiv:2405.09485v1 Announce Type: new \nAbstract: Change-point detection and estimation procedures have been widely developed in the literature. However, commonly used approaches in change-point analysis have mainly been focusing on detecting change-points within an entire time series (off-line methods), or quickest detection of change-points in sequentially observed data (on-line methods). Both classes of methods are concerned with change-points that have already occurred. The arguably more important question of when future change-points may occur, remains largely unexplored. In this paper, we develop a novel statistical model that describes the mechanism of change-point occurrence. Specifically, the model assumes a latent process in the form of a random walk driven by non-negative innovations, and an observed process which behaves differently when the latent process belongs to different regimes. By construction, an occurrence of a change-point is equivalent to hitting a regime threshold by the latent process. Therefore, by predicting when the latent process will hit the next regime threshold, future change-points can be forecasted. The probabilistic properties of the model such as stationarity and ergodicity are established. A composite likelihood-based approach is developed for parameter estimation and model selection. Moreover, we construct the predictor and prediction interval for future change points based on the estimated model."}, "https://arxiv.org/abs/2405.09509": {"title": "Double Robustness of Local Projections and Some Unpleasant VARithmetic", "link": "https://arxiv.org/abs/2405.09509", "description": "arXiv:2405.09509v1 Announce Type: new \nAbstract: We consider impulse response inference in a locally misspecified stationary vector autoregression (VAR) model. The conventional local projection (LP) confidence interval has correct coverage even when the misspecification is so large that it can be detected with probability approaching 1. This follows from a \"double robustness\" property analogous to that of modern estimators for partially linear regressions. In contrast, VAR confidence intervals dramatically undercover even for misspecification so small that it is difficult to detect statistically and cannot be ruled out based on economic theory. This is because of a \"no free lunch\" result for VARs: the worst-case bias and coverage distortion are small if, and only if, the variance is close to that of LP. While VAR coverage can be restored by using a bias-aware critical value or a large lag length, the resulting confidence interval tends to be at least as wide as the LP interval."}, "https://arxiv.org/abs/2405.09536": {"title": "Wasserstein Gradient Boosting: A General Framework with Applications to Posterior Regression", "link": "https://arxiv.org/abs/2405.09536", "description": "arXiv:2405.09536v1 Announce Type: new \nAbstract: Gradient boosting is a sequential ensemble method that fits a new base learner to the gradient of the remaining loss at each step. We propose a novel family of gradient boosting, Wasserstein gradient boosting, which fits a new base learner to an exactly or approximately available Wasserstein gradient of a loss functional on the space of probability distributions. Wasserstein gradient boosting returns a set of particles that approximates a target probability distribution assigned at each input. In probabilistic prediction, a parametric probability distribution is often specified on the space of output variables, and a point estimate of the output-distribution parameter is produced for each input by a model. Our main application of Wasserstein gradient boosting is a novel distributional estimate of the output-distribution parameter, which approximates the posterior distribution over the output-distribution parameter determined pointwise at each data point. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with various existing methods."}, "https://arxiv.org/abs/2405.08907": {"title": "Properties of stationary cyclical processes", "link": "https://arxiv.org/abs/2405.08907", "description": "arXiv:2405.08907v1 Announce Type: cross \nAbstract: The paper investigates the theoretical properties of zero-mean stationary time series with cyclical components, admitting the representation $y_t=\\alpha_t \\cos \\lambda t + \\beta_t \\sin \\lambda t$, with $\\lambda \\in (0,\\pi]$ and $[\\alpha_t\\,\\, \\beta_t]$ following some bivariate process. We diagnose that in the extant literature on cyclic time series, a prevalent assumption of Gaussianity for $[\\alpha_t\\,\\, \\beta_t]$ imposes inadvertently a severe restriction on the amplitude of the process. Moreover, it is shown that other common distributions may suffer from either similar defects or fail to guarantee the stationarity of $y_t$. To address both of the issues, we propose to introduce a direct stochastic modulation of the amplitude and phase shift in an almost periodic function. We prove that this novel approach may lead, in general, to a stationary (up to any order) time series, and specifically, to a zero-mean stationary time series featuring cyclicity, with a pseudo-cyclical autocovariance function that may even decay at a very slow rate. The proposed process fills an important gap in this type of models and allows for flexible modeling of amplitude and phase shift."}, "https://arxiv.org/abs/2405.09076": {"title": "Enhancing Airline Customer Satisfaction: A Machine Learning and Causal Analysis Approach", "link": "https://arxiv.org/abs/2405.09076", "description": "arXiv:2405.09076v1 Announce Type: cross \nAbstract: This study explores the enhancement of customer satisfaction in the airline industry, a critical factor for retaining customers and building brand reputation, which are vital for revenue growth. Utilizing a combination of machine learning and causal inference methods, we examine the specific impact of service improvements on customer satisfaction, with a focus on the online boarding pass experience. Through detailed data analysis involving several predictive and causal models, we demonstrate that improvements in the digital aspects of customer service significantly elevate overall customer satisfaction. This paper highlights how airlines can strategically leverage these insights to make data-driven decisions that enhance customer experiences and, consequently, their market competitiveness."}, "https://arxiv.org/abs/2405.09500": {"title": "Identifying Heterogeneous Decision Rules From Choices When Menus Are Unobserved", "link": "https://arxiv.org/abs/2405.09500", "description": "arXiv:2405.09500v1 Announce Type: cross \nAbstract: Given only aggregate choice data and limited information about how menus are distributed across the population, we describe what can be inferred robustly about the distribution of preferences (or more general decision rules). We strengthen and generalize existing results on such identification and provide an alternative analytical approach to study the problem. We show further that our model and results are applicable, after suitable reinterpretation, to other contexts. One application is to the robust identification of the distribution of updating rules given only the population distribution of beliefs and limited information about heterogeneous information sources."}, "https://arxiv.org/abs/2209.07295": {"title": "A new set of tools for goodness-of-fit validation", "link": "https://arxiv.org/abs/2209.07295", "description": "arXiv:2209.07295v2 Announce Type: replace \nAbstract: We introduce two new tools to assess the validity of statistical distributions. These tools are based on components derived from a new statistical quantity, the $comparison$ $curve$. The first tool is a graphical representation of these components on a $bar$ $plot$ (B plot), which can provide a detailed appraisal of the validity of the statistical model, in particular when supplemented by acceptance regions related to the model. The knowledge gained from this representation can sometimes suggest an existing $goodness$-$of$-$fit$ test to supplement this visual assessment with a control of the type I error. Otherwise, an adaptive test may be preferable and the second tool is the combination of these components to produce a powerful $\\chi^2$-type goodness-of-fit test. Because the number of these components can be large, we introduce a new selection rule to decide, in a data driven fashion, on their proper number to take into consideration. In a simulation, our goodness-of-fit tests are seen to be powerwise competitive with the best solutions that have been recommended in the context of a fully specified model as well as when some parameters must be estimated. Practical examples show how to use these tools to derive principled information about where the model departs from the data."}, "https://arxiv.org/abs/2212.11304": {"title": "Powerful Partial Conjunction Hypothesis Testing via Conditioning", "link": "https://arxiv.org/abs/2212.11304", "description": "arXiv:2212.11304v3 Announce Type: replace \nAbstract: A Partial Conjunction Hypothesis (PCH) test combines information across a set of base hypotheses to determine whether some subset is non-null. PCH tests arise in a diverse array of fields, but standard PCH testing methods can be highly conservative, leading to low power especially in low signal settings commonly encountered in applications. In this paper, we introduce the conditional PCH (cPCH) test, a new method for testing a single PCH that directly corrects the conservativeness of standard approaches by conditioning on certain order statistics of the base p-values. Under distributional assumptions commonly encountered in PCH testing, the cPCH test is valid and produces nearly uniformly distributed p-values under the null (i.e., cPCH p-values are only very slightly conservative). We demonstrate that the cPCH test matches or outperforms existing single PCH tests with particular power gains in low signal settings, maintains Type I error control even under model misspecification, and can be used to outperform state-of-the-art multiple PCH testing procedures in certain settings, particularly when side information is present. Finally, we illustrate an application of the cPCH test through a replicability analysis across DNA microarray studies."}, "https://arxiv.org/abs/2307.11127": {"title": "Asymptotically Unbiased Synthetic Control Methods by Distribution Matching", "link": "https://arxiv.org/abs/2307.11127", "description": "arXiv:2307.11127v3 Announce Type: replace \nAbstract: Synthetic Control Methods (SCMs) have become an essential tool for comparative case studies. The fundamental idea of SCMs is to estimate the counterfactual outcomes of a treated unit using a weighted sum of the observed outcomes of untreated units. The accuracy of the synthetic control (SC) is critical for evaluating the treatment effect of a policy intervention; therefore, the estimation of SC weights has been the focus of extensive research. In this study, we first point out that existing SCMs suffer from an endogeneity problem, the correlation between the outcomes of untreated units and the error term of the synthetic control, which yields a bias in the treatment effect estimator. We then propose a novel SCM based on density matching, assuming that the density of outcomes of the treated unit can be approximated by a weighted average of the joint density of untreated units (i.e., a mixture model). Based on this assumption, we estimate SC weights by matching the moments of treated outcomes with the weighted sum of moments of untreated outcomes. Our proposed method has three advantages over existing methods: first, our estimator is asymptotically unbiased under the assumption of the mixture model; second, due to the asymptotic unbiasedness, we can reduce the mean squared error in counterfactual predictions; third, our method generates full densities of the treatment effect, not merely expected values, which broadens the applicability of SCMs. We provide experimental results to demonstrate the effectiveness of our proposed method."}, "https://arxiv.org/abs/2310.09105": {"title": "Estimating Individual Responses when Tomorrow Matters", "link": "https://arxiv.org/abs/2310.09105", "description": "arXiv:2310.09105v3 Announce Type: replace \nAbstract: We propose a regression-based approach to estimate how individuals' expectations influence their responses to a counterfactual change. We provide conditions under which average partial effects based on regression estimates recover structural effects. We propose a practical three-step estimation method that relies on panel data on subjective expectations. We illustrate our approach in a model of consumption and saving, focusing on the impact of an income tax that not only changes current income but also affects beliefs about future income. Applying our approach to Italian survey data, we find that individuals' beliefs matter for evaluating the impact of tax policies on consumption decisions."}, "https://arxiv.org/abs/2310.14448": {"title": "Semiparametrically Efficient Score for the Survival Odds Ratio", "link": "https://arxiv.org/abs/2310.14448", "description": "arXiv:2310.14448v2 Announce Type: replace \nAbstract: We consider a general proportional odds model for survival data under binary treatment, where the functional form of the covariates is left unspecified. We derive the efficient score for the conditional survival odds ratio given the covariates using modern semiparametric theory. The efficient score may be useful in the development of doubly robust estimators, although computational challenges remain."}, "https://arxiv.org/abs/2312.08174": {"title": "Double Machine Learning for Static Panel Models with Fixed Effects", "link": "https://arxiv.org/abs/2312.08174", "description": "arXiv:2312.08174v3 Announce Type: replace \nAbstract: Recent advances in causal inference have seen the development of methods which make use of the predictive power of machine learning algorithms. In this paper, we use double machine learning (DML) (Chernozhukov et al., 2018) to approximate high-dimensional and non-linear nuisance functions of the confounders to make inferences about the effects of policy interventions from panel data. We propose new estimators by adapting correlated random effects, within-group and first-difference estimation for linear models to an extension of Robinson (1988)'s partially linear regression model to static panel data models with individual fixed effects and unspecified non-linear confounder effects. Using Monte Carlo simulations, we compare the relative performance of different machine learning algorithms and find that conventional least squares estimators performs well when the data generating process is mildly non-linear and smooth, but there are substantial performance gains with DML in terms of bias reduction when the true effect of the regressors is non-linear and discontinuous. However, inference based on individual learners can lead to badly biased inference. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the minimum wage on voting behavior in the UK."}, "https://arxiv.org/abs/2212.01792": {"title": "Classification by sparse generalized additive models", "link": "https://arxiv.org/abs/2212.01792", "description": "arXiv:2212.01792v4 Announce Type: replace-cross \nAbstract: We consider (nonparametric) sparse (generalized) additive models (SpAM) for classification. The design of a SpAM classifier is based on minimizing the logistic loss with a sparse group Lasso/Slope-type penalties on the coefficients of univariate additive components' expansions in orthonormal series (e.g., Fourier or wavelets). The resulting classifier is inherently adaptive to the unknown sparsity and smoothness. We show that under certain sparse group restricted eigenvalue condition it is nearly-minimax (up to log-factors) simultaneously across the entire range of analytic, Sobolev and Besov classes. The performance of the proposed classifier is illustrated on a simulated and a real-data examples."}, "https://arxiv.org/abs/2308.01156": {"title": "A new adaptive local polynomial density estimation procedure on complicated domains", "link": "https://arxiv.org/abs/2308.01156", "description": "arXiv:2308.01156v2 Announce Type: replace-cross \nAbstract: This paper presents a novel approach for pointwise estimation of multivariate density functions on known domains of arbitrary dimensions using nonparametric local polynomial estimators. Our method is highly flexible, as it applies to both simple domains, such as open connected sets, and more complicated domains that are not star-shaped around the point of estimation. This enables us to handle domains with sharp concavities, holes, and local pinches, such as polynomial sectors. Additionally, we introduce a data-driven selection rule based on the general ideas of Goldenshluger and Lepski. Our results demonstrate that the local polynomial estimators are minimax under a $L^2$ risk across a wide range of H\\\"older-type functional classes. In the adaptive case, we provide oracle inequalities and explicitly determine the convergence rate of our statistical procedure. Simulations on polynomial sectors show that our oracle estimates outperform those of the most popular alternative method, found in the sparr package for the R software. Our statistical procedure is implemented in an online R package which is readily accessible."}, "https://arxiv.org/abs/2405.09797": {"title": "Identification of Single-Treatment Effects in Factorial Experiments", "link": "https://arxiv.org/abs/2405.09797", "description": "arXiv:2405.09797v1 Announce Type: new \nAbstract: Despite their cost, randomized controlled trials (RCTs) are widely regarded as gold-standard evidence in disciplines ranging from social science to medicine. In recent decades, researchers have increasingly sought to reduce the resource burden of repeated RCTs with factorial designs that simultaneously test multiple hypotheses, e.g. experiments that evaluate the effects of many medications or products simultaneously. Here I show that when multiple interventions are randomized in experiments, the effect any single intervention would have outside the experimental setting is not identified absent heroic assumptions, even if otherwise perfectly realistic conditions are achieved. This happens because single-treatment effects involve a counterfactual world with a single focal intervention, allowing other variables to take their natural values (which may be confounded or modified by the focal intervention). In contrast, observational studies and factorial experiments provide information about potential-outcome distributions with zero and multiple interventions, respectively. In this paper, I formalize sufficient conditions for the identifiability of those isolated quantities. I show that researchers who rely on this type of design have to justify either linearity of functional forms or -- in the nonparametric case -- specify with Directed Acyclic Graphs how variables are related in the real world. Finally, I develop nonparametric sharp bounds -- i.e., maximally informative best-/worst-case estimates consistent with limited RCT data -- that show when extrapolations about effect signs are empirically justified. These new results are illustrated with simulated data."}, "https://arxiv.org/abs/2405.09810": {"title": "Trajecctory-Based Individualized Treatment Rules", "link": "https://arxiv.org/abs/2405.09810", "description": "arXiv:2405.09810v1 Announce Type: new \nAbstract: A core component of precision medicine research involves optimizing individualized treatment rules (ITRs) based on patient characteristics. Many studies used to estimate ITRs are longitudinal in nature, collecting outcomes over time. Yet, to date, methods developed to estimate ITRs often ignore the longitudinal structure of the data. Information available from the longitudinal nature of the data can be especially useful in mental health studies. Although treatment means might appear similar, understanding the trajectory of outcomes over time can reveal important differences between treatments and placebo effects. This longitudinal perspective is especially beneficial in mental health research, where subtle shifts in outcome patterns can hold significant implications. Despite numerous studies involving the collection of outcome data across various time points, most precision medicine methods used to develop ITRs overlook the information available from the longitudinal structure. The prevalence of missing data in such studies exacerbates the issue, as neglecting the longitudinal nature of the data can significantly impair the effectiveness of treatment rules. This paper develops a powerful longitudinal trajectory-based ITR construction method that incorporates baseline variables, via a single-index or biosignature, into the modeling of longitudinal outcomes. This trajectory-based ITR approach substantially minimizes the negative impact of missing data compared to more traditional ITR approaches. The approach is illustrated through simulation studies and a clinical trial for depression, contrasting it with more traditional ITRs that ignore longitudinal information."}, "https://arxiv.org/abs/2405.09887": {"title": "Quantization-based LHS for dependent inputs : application to sensitivity analysis of environmental models", "link": "https://arxiv.org/abs/2405.09887", "description": "arXiv:2405.09887v1 Announce Type: new \nAbstract: Numerical modeling is essential for comprehending intricate physical phenomena in different domains. To handle complexity, sensitivity analysis, particularly screening, is crucial for identifying influential input parameters. Kernel-based methods, such as the Hilbert Schmidt Independence Criterion (HSIC), are valuable for analyzing dependencies between inputs and outputs. Moreover, due to the computational expense of such models, metamodels (or surrogate models) are often unavoidable. Implementing metamodels and HSIC requires data from the original model, which leads to the need for space-filling designs. While existing methods like Latin Hypercube Sampling (LHS) are effective for independent variables, incorporating dependence is challenging. This paper introduces a novel LHS variant, Quantization-based LHS, which leverages Voronoi vector quantization to address correlated inputs. The method ensures comprehensive coverage of stratified variables, enhancing distribution across marginals. The paper outlines expectation estimators based on Quantization-based LHS in various dependency settings, demonstrating their unbiasedness. The method is applied on several models of growing complexities, first on simple examples to illustrate the theory, then on more complex environmental hydrological models, when the dependence is known or not, and with more and more interactive processes and factors. The last application is on the digital twin of a French vineyard catchment (Beaujolais region) to design a vegetative filter strip and reduce water, sediment and pesticide transfers from the fields to the river. Quantization-based LHS is used to compute HSIC measures and independence tests, demonstrating its usefulness, especially in the context of complex models."}, "https://arxiv.org/abs/2405.09906": {"title": "Process-based Inference for Spatial Energetics Using Bayesian Predictive Stacking", "link": "https://arxiv.org/abs/2405.09906", "description": "arXiv:2405.09906v1 Announce Type: new \nAbstract: Rapid developments in streaming data technologies have enabled real-time monitoring of human activity that can deliver high-resolution data on health variables over trajectories or paths carved out by subjects as they conduct their daily physical activities. Wearable devices, such as wrist-worn sensors that monitor gross motor activity, have become prevalent and have kindled the emerging field of ``spatial energetics'' in environmental health sciences. We devise a Bayesian inferential framework for analyzing such data while accounting for information available on specific spatial coordinates comprising a trajectory or path using a Global Positioning System (GPS) device embedded within the wearable device. We offer full probabilistic inference with uncertainty quantification using spatial-temporal process models adapted for data generated from ``actigraph'' units as the subject traverses a path or trajectory in their daily routine. Anticipating the need for fast inference for mobile health data, we pursue exact inference using conjugate Bayesian models and employ predictive stacking to assimilate inference across these individual models. This circumvents issues with iterative estimation algorithms such as Markov chain Monte Carlo. We devise Bayesian predictive stacking in this context for models that treat time as discrete epochs and that treat time as continuous. We illustrate our methods with simulation experiments and analysis of data from the Physical Activity through Sustainable Transport Approaches (PASTA-LA) study conducted by the Fielding School of Public Health at the University of California, Los Angeles."}, "https://arxiv.org/abs/2405.10026": {"title": "The case for specifying the \"ideal\" target trial", "link": "https://arxiv.org/abs/2405.10026", "description": "arXiv:2405.10026v1 Announce Type: new \nAbstract: The target trial is an increasingly popular conceptual device for guiding the design and analysis of observational studies that seek to perform causal inference. As tends to occur with concepts like this, there is variability in how certain aspects of the approach are understood, which may lead to potentially consequential differences in how the approach is taught, implemented, and interpreted in practice. In this commentary, we provide a perspective on two of these aspects: how the target trial should be specified, and relatedly, how the target trial fits within a formal causal inference framework."}, "https://arxiv.org/abs/2405.10036": {"title": "Large-scale Data Integration using Matrix Denoising and Geometric Factor Matching", "link": "https://arxiv.org/abs/2405.10036", "description": "arXiv:2405.10036v1 Announce Type: new \nAbstract: Unsupervised integrative analysis of multiple data sources has become common place and scalable algorithms are necessary to accommodate ever increasing availability of data. Only few currently methods have estimation speed as their focus, and those that do are only applicable to restricted data layouts such as different data types measured on the same observation units. We introduce a novel point of view on low-rank matrix integration phrased as a graph estimation problem which allows development of a method, large-scale Collective Matrix Factorization (lsCMF), which is able to integrate data in flexible layouts in a speedy fashion. It utilizes a matrix denoising framework for rank estimation and geometric properties of singular vectors to efficiently integrate data. The quick estimation speed of lsCMF while retaining good estimation of data structure is then demonstrated in simulation studies."}, "https://arxiv.org/abs/2405.10067": {"title": "Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layouts", "link": "https://arxiv.org/abs/2405.10067", "description": "arXiv:2405.10067v1 Announce Type: new \nAbstract: Interest in unsupervised methods for joint analysis of heterogeneous data sources has risen in recent years. Low-rank latent factor models have proven to be an effective tool for data integration and have been extended to a large number of data source layouts. Of particular interest is the separation of variation present in data sources into shared and individual subspaces. In addition, interpretability of estimated latent factors is crucial to further understanding.\n We present sparse and orthogonal low-rank Collective Matrix Factorization (solrCMF) to estimate low-rank latent factor models for flexible data layouts. These encompass traditional multi-view (one group, multiple data types) and multi-grid (multiple groups, multiple data types) layouts, as well as augmented layouts, which allow the inclusion of side information between data types or groups. In addition, solrCMF allows tensor-like layouts (repeated layers), estimates interpretable factors, and determines variation structure among factors and data sources.\n Using a penalized optimization approach, we automatically separate variability into the globally and partially shared as well as individual components and estimate sparse representations of factors. To further increase interpretability of factors, we enforce orthogonality between them. Estimation is performed efficiently in a recent multi-block ADMM framework which we adapted to support embedded manifold constraints.\n The performance of solrCMF is demonstrated in simulation studies and compares favorably to existing methods."}, "https://arxiv.org/abs/2405.10198": {"title": "Comprehensive Causal Machine Learning", "link": "https://arxiv.org/abs/2405.10198", "description": "arXiv:2405.10198v1 Announce Type: new \nAbstract: Uncovering causal effects at various levels of granularity provides substantial value to decision makers. Comprehensive machine learning approaches to causal effect estimation allow to use a single causal machine learning approach for estimation and inference of causal mean effects for all levels of granularity. Focusing on selection-on-observables, this paper compares three such approaches, the modified causal forest (mcf), the generalized random forest (grf), and double machine learning (dml). It also provides proven theoretical guarantees for the mcf and compares the theoretical properties of the approaches. The findings indicate that dml-based methods excel for average treatment effects at the population level (ATE) and group level (GATE) with few groups, when selection into treatment is not too strong. However, for finer causal heterogeneity, explicitly outcome-centred forest-based approaches are superior. The mcf has three additional benefits: (i) It is the most robust estimator in cases when dml-based approaches underperform because of substantial selectivity; (ii) it is the best estimator for GATEs when the number of groups gets larger; and (iii), it is the only estimator that is internally consistent, in the sense that low-dimensional causal ATEs and GATEs are obtained as aggregates of finer-grained causal parameters."}, "https://arxiv.org/abs/2405.10302": {"title": "Optimal Aggregation of Prediction Intervals under Unsupervised Domain Shift", "link": "https://arxiv.org/abs/2405.10302", "description": "arXiv:2405.10302v1 Announce Type: new \nAbstract: As machine learning models are increasingly deployed in dynamic environments, it becomes paramount to assess and quantify uncertainties associated with distribution shifts. A distribution shift occurs when the underlying data-generating process changes, leading to a deviation in the model's performance. The prediction interval, which captures the range of likely outcomes for a given prediction, serves as a crucial tool for characterizing uncertainties induced by their underlying distribution. In this paper, we propose methodologies for aggregating prediction intervals to obtain one with minimal width and adequate coverage on the target domain under unsupervised domain shift, under which we have labeled samples from a related source domain and unlabeled covariates from the target domain. Our analysis encompasses scenarios where the source and the target domain are related via i) a bounded density ratio, and ii) a measure-preserving transformation. Our proposed methodologies are computationally efficient and easy to implement. Beyond illustrating the performance of our method through a real-world dataset, we also delve into the theoretical details. This includes establishing rigorous theoretical guarantees, coupled with finite sample bounds, regarding the coverage and width of our prediction intervals. Our approach excels in practical applications and is underpinned by a solid theoretical framework, ensuring its reliability and effectiveness across diverse contexts."}, "https://arxiv.org/abs/2405.09596": {"title": "Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)", "link": "https://arxiv.org/abs/2405.09596", "description": "arXiv:2405.09596v1 Announce Type: cross \nAbstract: The prediction of ship trajectories is a growing field of study in artificial intelligence. Traditional methods rely on the use of LSTM, GRU networks, and even Transformer architectures for the prediction of spatio-temporal series. This study proposes a viable alternative for predicting these trajectories using only GNSS positions. It considers this spatio-temporal problem as a natural language processing problem. The latitude/longitude coordinates of AIS messages are transformed into cell identifiers using the H3 index. Thanks to the pseudo-octal representation, it becomes easier for language models to learn the spatial hierarchy of the H3 index. The method is compared with a classical Kalman filter, widely used in the maritime domain, and introduces the Fr\\'echet distance as the main evaluation metric. We show that it is possible to predict ship trajectories quite precisely up to 8 hours with 30 minutes of context. We demonstrate that this alternative works well enough to predict trajectories worldwide."}, "https://arxiv.org/abs/2405.09989": {"title": "A Gaussian Process Model for Ordinal Data with Applications to Chemoinformatics", "link": "https://arxiv.org/abs/2405.09989", "description": "arXiv:2405.09989v1 Announce Type: cross \nAbstract: With the proliferation of screening tools for chemical testing, it is now possible to create vast databases of chemicals easily. However, rigorous statistical methodologies employed to analyse these databases are in their infancy, and further development to facilitate chemical discovery is imperative. In this paper, we present conditional Gaussian process models to predict ordinal outcomes from chemical experiments, where the inputs are chemical compounds. We implement the Tanimoto distance, a metric on the chemical space, within the covariance of the Gaussian processes to capture correlated effects in the chemical space. A novel aspect of our model is that the kernel contains a scaling parameter, a feature not previously examined in the literature, that controls the strength of the correlation between elements of the chemical space. Using molecular fingerprints, a numerical representation of a compound's location within the chemical space, we show that accounting for correlation amongst chemical compounds improves predictive performance over the uncorrelated model, where effects are assumed to be independent. Moreover, we present a genetic algorithm for the facilitation of chemical discovery and identification of important features to the compound's efficacy. A simulation study is conducted to demonstrate the suitability of the proposed methods. Our proposed methods are demonstrated on a hazard classification problem of organic solvents."}, "https://arxiv.org/abs/2208.13370": {"title": "A Consistent ICM-based $\\chi^2$ Specification Test", "link": "https://arxiv.org/abs/2208.13370", "description": "arXiv:2208.13370v2 Announce Type: replace \nAbstract: In spite of the omnibus property of Integrated Conditional Moment (ICM) specification tests, they are not commonly used in empirical practice owing to, e.g., the non-pivotality of the test and the high computational cost of available bootstrap schemes especially in large samples. This paper proposes specification and mean independence tests based on a class of ICM metrics termed the generalized martingale difference divergence (GMDD). The proposed tests exhibit consistency, asymptotic $\\chi^2$-distribution under the null hypothesis, and computational efficiency. Moreover, they demonstrate robustness to heteroskedasticity of unknown form and can be adapted to enhance power towards specific alternatives. A power comparison with classical bootstrap-based ICM tests using Bahadur slopes is also provided. Monte Carlo simulations are conducted to showcase the proposed tests' excellent size control and competitive power."}, "https://arxiv.org/abs/2309.04047": {"title": "Fully Latent Principal Stratification With Measurement Models", "link": "https://arxiv.org/abs/2309.04047", "description": "arXiv:2309.04047v2 Announce Type: replace \nAbstract: There is wide agreement on the importance of implementation data from randomized effectiveness studies in behavioral science; however, there are few methods available to incorporate these data into causal models, especially when they are multivariate or longitudinal, and interest is in low-dimensional summaries. We introduce a framework for studying how treatment effects vary between subjects who implement an intervention differently, combining principal stratification with latent variable measurement models; since principal strata are latent in both treatment arms, we call it \"fully-latent principal stratification\" or FLPS. We describe FLPS models including item-response-theory measurement, show that they are feasible in a simulation study, and illustrate them in an analysis of hint usage from a randomized study of computerized mathematics tutors."}, "https://arxiv.org/abs/2310.02278": {"title": "A Stable and Efficient Covariate-Balancing Estimator for Causal Survival Effects", "link": "https://arxiv.org/abs/2310.02278", "description": "arXiv:2310.02278v2 Announce Type: replace \nAbstract: We propose an empirically stable and asymptotically efficient covariate-balancing approach to the problem of estimating survival causal effects in data with conditionally-independent censoring. This addresses a challenge often encountered in state-of-the-art nonparametric methods: the use of inverses of small estimated probabilities and the resulting amplification of estimation error. We validate our theoretical results in experiments on synthetic and semi-synthetic data."}, "https://arxiv.org/abs/2312.04077": {"title": "When is Plasmode simulation superior to parametric simulation when estimating the MSE of the least squares estimator in linear regression?", "link": "https://arxiv.org/abs/2312.04077", "description": "arXiv:2312.04077v2 Announce Type: replace \nAbstract: Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model (OGM), is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the OGM were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended."}, "https://arxiv.org/abs/2312.12641": {"title": "Robust Point Matching with Distance Profiles", "link": "https://arxiv.org/abs/2312.12641", "description": "arXiv:2312.12641v2 Announce Type: replace \nAbstract: While matching procedures based on pairwise distances are conceptually appealing and thus favored in practice, theoretical guarantees for such procedures are rarely found in the literature. We propose and analyze matching procedures based on distance profiles that are easily implementable in practice, showing these procedures are robust to outliers and noise. We demonstrate the performance of the proposed method using a real data example and provide simulation studies to complement the theoretical findings."}, "https://arxiv.org/abs/2203.15945": {"title": "A Framework for Improving the Reliability of Black-box Variational Inference", "link": "https://arxiv.org/abs/2203.15945", "description": "arXiv:2203.15945v2 Announce Type: replace-cross \nAbstract: Black-box variational inference (BBVI) now sees widespread use in machine learning and statistics as a fast yet flexible alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, stochastic optimization methods for BBVI remain unreliable and require substantial expertise and hand-tuning to apply effectively. In this paper, we propose Robust and Automated Black-box VI (RABVI), a framework for improving the reliability of BBVI optimization. RABVI is based on rigorously justified automation techniques, includes just a small number of intuitive tuning parameters, and detects inaccurate estimates of the optimal variational approximation. RABVI adaptively decreases the learning rate by detecting convergence of the fixed--learning-rate iterates, then estimates the symmetrized Kullback--Leibler (KL) divergence between the current variational approximation and the optimal one. It also employs a novel optimization termination criterion that enables the user to balance desired accuracy against computational cost by comparing (i) the predicted relative decrease in the symmetrized KL divergence if a smaller learning were used and (ii) the predicted computation required to converge with the smaller learning rate. We validate the robustness and accuracy of RABVI through carefully designed simulation studies and on a diverse set of real-world model and data examples."}, "https://arxiv.org/abs/2209.09936": {"title": "Solving Fredholm Integral Equations of the First Kind via Wasserstein Gradient Flows", "link": "https://arxiv.org/abs/2209.09936", "description": "arXiv:2209.09936v3 Announce Type: replace-cross \nAbstract: Solving Fredholm equations of the first kind is crucial in many areas of the applied sciences. In this work we adopt a probabilistic and variational point of view by considering a minimization problem in the space of probability measures with an entropic regularization. Contrary to classical approaches which discretize the domain of the solutions, we introduce an algorithm to asymptotically sample from the unique solution of the regularized minimization problem. As a result our estimators do not depend on any underlying grid and have better scalability properties than most existing methods. Our algorithm is based on a particle approximation of the solution of a McKean--Vlasov stochastic differential equation associated with the Wasserstein gradient flow of our variational formulation. We prove the convergence towards a minimizer and provide practical guidelines for its numerical implementation. Finally, our method is compared with other approaches on several examples including density deconvolution and epidemiology."}, "https://arxiv.org/abs/2405.10371": {"title": "Causal Discovery in Multivariate Extremes with a Hydrological Analysis of Swiss River Discharges", "link": "https://arxiv.org/abs/2405.10371", "description": "arXiv:2405.10371v1 Announce Type: new \nAbstract: Causal asymmetry is based on the principle that an event is a cause only if its absence would not have been a cause. From there, uncovering causal effects becomes a matter of comparing a well-defined score in both directions. Motivated by studying causal effects at extreme levels of a multivariate random vector, we propose to construct a model-agnostic causal score relying solely on the assumption of the existence of a max-domain of attraction. Based on a representation of a Generalized Pareto random vector, we construct the causal score as the Wasserstein distance between the margins and a well-specified random variable. The proposed methodology is illustrated on a hydrologically simulated dataset of different characteristics of catchments in Switzerland: discharge, precipitation, and snowmelt."}, "https://arxiv.org/abs/2405.10449": {"title": "Optimal Text-Based Time-Series Indices", "link": "https://arxiv.org/abs/2405.10449", "description": "arXiv:2405.10449v1 Announce Type: new \nAbstract: We propose an approach to construct text-based time-series indices in an optimal way--typically, indices that maximize the contemporaneous relation or the predictive performance with respect to a target variable, such as inflation. We illustrate our methodology with a corpus of news articles from the Wall Street Journal by optimizing text-based indices focusing on tracking the VIX index and inflation expectations. Our results highlight the superior performance of our approach compared to existing indices."}, "https://arxiv.org/abs/2405.10461": {"title": "Prediction in Measurement Error Models", "link": "https://arxiv.org/abs/2405.10461", "description": "arXiv:2405.10461v1 Announce Type: new \nAbstract: We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining minimal prediction interval length. Numerical experiments are conducted to illustrate the performance and we apply our method to predict concentration of Abeta1-12 in cerebrospinal fluid in an Alzheimer's disease data."}, "https://arxiv.org/abs/2405.10490": {"title": "Neural Optimization with Adaptive Heuristics for Intelligent Marketing System", "link": "https://arxiv.org/abs/2405.10490", "description": "arXiv:2405.10490v1 Announce Type: new \nAbstract: Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing."}, "https://arxiv.org/abs/2405.10527": {"title": "Hawkes Models And Their Applications", "link": "https://arxiv.org/abs/2405.10527", "description": "arXiv:2405.10527v1 Announce Type: new \nAbstract: The Hawkes process is a model for counting the number of arrivals to a system which exhibits the self-exciting property - that one arrival creates a heightened chance of further arrivals in the near future. The model, and its generalizations, have been applied in a plethora of disparate domains, though two particularly developed applications are in seismology and in finance. As the original model is elegantly simple, generalizations have been proposed which: track marks for each arrival, are multivariate, have a spatial component, are driven by renewal processes, treat time as discrete, and so on. This paper creates a cohesive review of the traditional Hawkes model and the modern generalizations, providing details on their construction, simulation algorithms, and giving key references to the appropriate literature for a detailed treatment."}, "https://arxiv.org/abs/2405.10539": {"title": "Overcoming Medical Overuse with AI Assistance: An Experimental Investigation", "link": "https://arxiv.org/abs/2405.10539", "description": "arXiv:2405.10539v1 Announce Type: new \nAbstract: This study evaluates the effectiveness of Artificial Intelligence (AI) in mitigating medical overtreatment, a significant issue characterized by unnecessary interventions that inflate healthcare costs and pose risks to patients. We conducted a lab-in-the-field experiment at a medical school, utilizing a novel medical prescription task, manipulating monetary incentives and the availability of AI assistance among medical students using a three-by-two factorial design. We tested three incentive schemes: Flat (constant pay regardless of treatment quantity), Progressive (pay increases with the number of treatments), and Regressive (penalties for overtreatment) to assess their influence on the adoption and effectiveness of AI assistance. Our findings demonstrate that AI significantly reduced overtreatment rates by up to 62% in the Regressive incentive conditions where (prospective) physician and patient interests were most aligned. Diagnostic accuracy improved by 17% to 37%, depending on the incentive scheme. Adoption of AI advice was high, with approximately half of the participants modifying their decisions based on AI input across all settings. For policy implications, we quantified the monetary (57%) and non-monetary (43%) incentives of overtreatment and highlighted AI's potential to mitigate non-monetary incentives and enhance social welfare. Our results provide valuable insights for healthcare administrators considering AI integration into healthcare systems."}, "https://arxiv.org/abs/2405.10655": {"title": "Macroeconomic Factors, Industrial Indexes and Bank Spread in Brazil", "link": "https://arxiv.org/abs/2405.10655", "description": "arXiv:2405.10655v1 Announce Type: new \nAbstract: The main objective of this paper is to Identify which macroe conomic factors and industrial indexes influenced the total Brazilian banking spread between March 2011 and March 2015. This paper considers subclassification of industrial activities in Brazil. Monthly time series data were used in multivariate linear regression models using Eviews (7.0). Eighteen variables were considered as candidates to be determinants. Variables which positively influenced bank spread are; Default, IPIs (Industrial Production Indexes) for capital goods, intermediate goods, du rable consumer goods, semi-durable and non-durable goods, the Selic, GDP, unemployment rate and EMBI +. Variables which influence negatively are; Consumer and general consumer goods IPIs, IPCA, the balance of the loan portfolio and the retail sales index. A p-value of 05% was considered. The main conclusion of this work is that the progress of industry, job creation and consumption can reduce bank spread. Keywords: Credit. Bank spread. Macroeconomics. Industrial Production Indexes. Finance."}, "https://arxiv.org/abs/2405.10719": {"title": "$\\ell_1$-Regularized Generalized Least Squares", "link": "https://arxiv.org/abs/2405.10719", "description": "arXiv:2405.10719v1 Announce Type: new \nAbstract: In this paper we propose an $\\ell_1$-regularized GLS estimator for high-dimensional regressions with potentially autocorrelated errors. We establish non-asymptotic oracle inequalities for estimation accuracy in a framework that allows for highly persistent autoregressive errors. In practice, the Whitening matrix required to implement the GLS is unkown, we present a feasible estimator for this matrix, derive consistency results and ultimately show how our proposed feasible GLS can recover closely the optimal performance (as if the errors were a white noise) of the LASSO. A simulation study verifies the performance of the proposed method, demonstrating that the penalized (feasible) GLS-LASSO estimator performs on par with the LASSO in the case of white noise errors, whilst outperforming it in terms of sign-recovery and estimation error when the errors exhibit significant correlation."}, "https://arxiv.org/abs/2405.10742": {"title": "Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal Mine", "link": "https://arxiv.org/abs/2405.10742", "description": "arXiv:2405.10742v1 Announce Type: new \nAbstract: We consider disease outbreak detection settings where the population under study consists of various subpopulations available for stratified surveillance. These subpopulations can for example be based on age cohorts, but may also correspond to other subgroups of the population under study such as international travellers. Rather than sampling uniformly over the entire population, one may elevate the effectiveness of the detection methodology by optimally choosing a subpopulation for sampling. We show (under some assumptions) the relative sampling efficiency between two subpopulations is inversely proportional to the ratio of their respective baseline disease risks. This leads to a considerable potential increase in sampling efficiency when sampling from the subpopulation with higher baseline disease risk, if the two subpopulation baseline risks differ strongly. Our mathematical results require a careful treatment of the power curves of exact binomial tests as a function of their sample size, which are erratic and non-monotonic due to the discreteness of the underlying distribution. Subpopulations with comparatively high baseline disease risk are typically in greater contact with health professionals, and thus when sampled for surveillance purposes this is typically motivated merely through a convenience argument. With this study, we aim to elevate the status of such \"convenience surveillance\" to optimal subpopulation surveillance."}, "https://arxiv.org/abs/2405.10769": {"title": "Efficient estimation of target population treatment effect from multiple source trials under effect-measure transportability", "link": "https://arxiv.org/abs/2405.10769", "description": "arXiv:2405.10769v1 Announce Type: new \nAbstract: When the marginal causal effect comparing the same treatment pair is available from multiple trials, we wish to transport all results to make inference on the target population effect. To account for the differences between populations, statistical analysis is often performed controlling for relevant variables. However, when transportability assumptions are placed on conditional causal effects, rather than the distribution of potential outcomes, we need to carefully choose these effect measures. In particular, we present identifiability results in two cases: target population average treatment effect for a continuous outcome and causal mean ratio for a positive outcome. We characterize the semiparametric efficiency bounds of the causal effects under the respective transportability assumptions and propose estimators that are doubly robust against model misspecifications. We highlight an important discussion on the tension between the non-collapsibility of conditional effects and the variational independence induced by transportability in the case of multiple source trials."}, "https://arxiv.org/abs/2405.10773": {"title": "Proximal indirect comparison", "link": "https://arxiv.org/abs/2405.10773", "description": "arXiv:2405.10773v1 Announce Type: new \nAbstract: We consider the problem of indirect comparison, where a treatment arm of interest is absent by design in the target randomized control trial (RCT) but available in a source RCT. The identifiability of the target population average treatment effect often relies on conditional transportability assumptions. However, it is a common concern whether all relevant effect modifiers are measured and controlled for. We highlight a new proximal identification result in the presence of shifted, unobserved effect modifiers based on proxies: an adjustment proxy in both RCTs and an additional reweighting proxy in the source RCT. We propose an estimator which is doubly-robust against misspecifications of the so-called bridge functions and asymptotically normal under mild consistency of the nuisance models. An alternative estimator is presented to accommodate missing outcomes in the source RCT, which we then apply to conduct a proximal indirect comparison analysis using two weight management trials."}, "https://arxiv.org/abs/2405.10925": {"title": "High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates", "link": "https://arxiv.org/abs/2405.10925", "description": "arXiv:2405.10925v1 Announce Type: new \nAbstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors."}, "https://arxiv.org/abs/2405.10469": {"title": "Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions", "link": "https://arxiv.org/abs/2405.10469", "description": "arXiv:2405.10469v1 Announce Type: cross \nAbstract: The development of open benchmarking platforms could greatly accelerate the adoption of AI agents in retail. This paper presents comprehensive simulations of customer shopping behaviors for the purpose of benchmarking reinforcement learning (RL) agents that optimize coupon targeting. The difficulty of this learning problem is largely driven by the sparsity of customer purchase events. We trained agents using offline batch data comprising summarized customer purchase histories to help mitigate this effect. Our experiments revealed that contextual bandit and deep RL methods that are less prone to over-fitting the sparse reward distributions significantly outperform static policies. This study offers a practical framework for simulating AI agents that optimize the entire retail customer journey. It aims to inspire the further development of simulation tools for retail AI systems."}, "https://arxiv.org/abs/2405.10795": {"title": "Non trivial optimal sampling rate for estimating a Lipschitz-continuous function in presence of mean-reverting Ornstein-Uhlenbeck noise", "link": "https://arxiv.org/abs/2405.10795", "description": "arXiv:2405.10795v1 Announce Type: cross \nAbstract: We examine a mean-reverting Ornstein-Uhlenbeck process that perturbs an unknown Lipschitz-continuous drift and aim to estimate the drift's value at a predetermined time horizon by sampling the path of the process. Due to the time varying nature of the drift we propose an estimation procedure that involves an online, time-varying optimization scheme implemented using a stochastic gradient ascent algorithm to maximize the log-likelihood of our observations. The objective of the paper is to investigate the optimal sample size/rate for achieving the minimum mean square distance between our estimator and the true value of the drift. In this setting we uncover a trade-off between the correlation of the observations, which increases with the sample size, and the dynamic nature of the unknown drift, which is weakened by increasing the frequency of observation. The mean square error is shown to be non monotonic in the sample size, attaining a global minimum whose precise description depends on the parameters that govern the model. In the static case, i.e. when the unknown drift is constant, our method outperforms the arithmetic mean of the observations in highly correlated regimes, despite the latter being a natural candidate estimator. We then compare our online estimator with the global maximum likelihood estimator."}, "https://arxiv.org/abs/2110.07051": {"title": "Fast and Scalable Inference for Spatial Extreme Value Models", "link": "https://arxiv.org/abs/2110.07051", "description": "arXiv:2110.07051v3 Announce Type: replace \nAbstract: The generalized extreme value (GEV) distribution is a popular model for analyzing and forecasting extreme weather data. To increase prediction accuracy, spatial information is often pooled via a latent Gaussian process (GP) on the GEV parameters. Inference for GEV-GP models is typically carried out using Markov chain Monte Carlo (MCMC) methods, or using approximate inference methods such as the integrated nested Laplace approximation (INLA). However, MCMC becomes prohibitively slow as the number of spatial locations increases, whereas INLA is only applicable in practice to a limited subset of GEV-GP models. In this paper, we revisit the original Laplace approximation for fitting spatial GEV models. In combination with a popular sparsity-inducing spatial covariance approximation technique, we show through simulations that our approach accurately estimates the Bayesian predictive distribution of extreme weather events, is scalable to several thousand spatial locations, and is several orders of magnitude faster than MCMC. A case study in forecasting extreme snowfall across Canada is presented."}, "https://arxiv.org/abs/2112.08934": {"title": "Lassoed Boosting and Linear Prediction in the Equities Market", "link": "https://arxiv.org/abs/2112.08934", "description": "arXiv:2112.08934v3 Announce Type: replace \nAbstract: We consider a two-stage estimation method for linear regression. First, it uses the lasso in Tibshirani (1996) to screen variables and, second, re-estimates the coefficients using the least-squares boosting method in Friedman (2001) on every set of selected variables. Based on the large-scale simulation experiment in Hastie et al. (2020), lassoed boosting performs as well as the relaxed lasso in Meinshausen (2007) and, under certain scenarios, can yield a sparser model. Applied to predicting equity returns, lassoed boosting gives the smallest mean-squared prediction error compared to several other methods."}, "https://arxiv.org/abs/2210.05983": {"title": "Model-based clustering in simple hypergraphs through a stochastic blockmodel", "link": "https://arxiv.org/abs/2210.05983", "description": "arXiv:2210.05983v3 Announce Type: replace \nAbstract: We propose a model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochastic blockmodel for graphs and assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection.\n To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset."}, "https://arxiv.org/abs/2210.09560": {"title": "A Bayesian Convolutional Neural Network-based Generalized Linear Model", "link": "https://arxiv.org/abs/2210.09560", "description": "arXiv:2210.09560v3 Announce Type: replace \nAbstract: Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly."}, "https://arxiv.org/abs/2306.08485": {"title": "Graph-Aligned Random Partition Model (GARP)", "link": "https://arxiv.org/abs/2306.08485", "description": "arXiv:2306.08485v2 Announce Type: replace \nAbstract: Bayesian nonparametric mixtures and random partition models are powerful tools for probabilistic clustering. However, standard independent mixture models can be restrictive in some applications such as inference on cell lineage due to the biological relations of the clusters. The increasing availability of large genomic data requires new statistical tools to perform model-based clustering and infer the relationship between homogeneous subgroups of units. Motivated by single-cell RNA applications we develop a novel dependent mixture model to jointly perform cluster analysis and align the clusters on a graph. Our flexible graph-aligned random partition model (GARP) exploits Gibbs-type priors as building blocks, allowing us to derive analytical results on the graph-aligned random partition's probability mass function (pmf). We derive a generalization of the Chinese restaurant process from the pmf and a related efficient and neat MCMC algorithm to perform Bayesian inference. We perform posterior inference on real single-cell RNA data from mice stem cells. We further investigate the performance of our model in capturing the underlying clustering structure as well as the underlying graph by means of simulation studies."}, "https://arxiv.org/abs/2306.09555": {"title": "Geometric-Based Pruning Rules For Change Point Detection in Multiple Independent Time Series", "link": "https://arxiv.org/abs/2306.09555", "description": "arXiv:2306.09555v2 Announce Type: replace \nAbstract: We consider the problem of detecting multiple changes in multiple independent time series. The search for the best segmentation can be expressed as a minimization problem over a given cost function. We focus on dynamic programming algorithms that solve this problem exactly. When the number of changes is proportional to data length, an inequality-based pruning rule encoded in the PELT algorithm leads to a linear time complexity. Another type of pruning, called functional pruning, gives a close-to-linear time complexity whatever the number of changes, but only for the analysis of univariate time series.\n We propose a few extensions of functional pruning for multiple independent time series based on the use of simple geometric shapes (balls and hyperrectangles). We focus on the Gaussian case, but some of our rules can be easily extended to the exponential family. In a simulation study we compare the computational efficiency of different geometric-based pruning rules. We show that for small dimensions (2, 3, 4) some of them ran significantly faster than inequality-based approaches in particular when the underlying number of changes is small compared to the data length."}, "https://arxiv.org/abs/2306.15075": {"title": "Differences in academic preparedness do not fully explain Black-White enrollment disparities in advanced high school coursework", "link": "https://arxiv.org/abs/2306.15075", "description": "arXiv:2306.15075v2 Announce Type: replace \nAbstract: Whether racial disparities in enrollment in advanced high school coursework can be attributed to differences in prior academic preparation is a central question in sociological research and education policy. However, previous investigations face methodological limitations, for they compare race-specific enrollment rates of students after adjusting for characteristics only partially related to their academic preparedness for advanced coursework. Informed by a recently-developed statistical technique, we propose and estimate a novel measure of students' academic preparedness and use administrative data from the New York City Department of Education to measure differences in AP mathematics enrollment rates among similarly prepared students of different races. We find that preexisting differences in academic preparation do not fully explain the under-representation of Black students relative to White students in AP mathematics. Our results imply that achieving equal opportunities for AP enrollment not only requires equalizing earlier academic experiences, but also addressing inequities that emerge from coursework placement processes."}, "https://arxiv.org/abs/2307.09864": {"title": "Asymptotic equivalence of Principal Components and Quasi Maximum Likelihood estimators in Large Approximate Factor Models", "link": "https://arxiv.org/abs/2307.09864", "description": "arXiv:2307.09864v4 Announce Type: replace \nAbstract: We provide an alternative derivation of the asymptotic results for the Principal Components estimator of a large approximate factor model. Results are derived under a minimal set of assumptions and, in particular, we require only the existence of 4th order moments. A special focus is given to the time series setting, a case considered in almost all recent econometric applications of factor models. Hence, estimation is based on the classical $n\\times n$ sample covariance matrix and not on a $T\\times T$ covariance matrix often considered in the literature. Indeed, despite the two approaches being asymptotically equivalent, the former is more coherent with a time series setting and it immediately allows us to write more intuitive asymptotic expansions for the Principal Component estimators showing that they are equivalent to OLS as long as $\\sqrt n/T\\to 0$ and $\\sqrt T/n\\to 0$, that is the loadings are estimated in a time series regression as if the factors were known, while the factors are estimated in a cross-sectional regression as if the loadings were known. Finally, we give some alternative sets of primitive sufficient conditions for mean-squared consistency of the sample covariance matrix of the factors, of the idiosyncratic components, and of the observed time series, which is the starting point for Principal Component Analysis."}, "https://arxiv.org/abs/2311.03644": {"title": "BOB: Bayesian Optimized Bootstrap for Uncertainty Quantification in Gaussian Mixture Models", "link": "https://arxiv.org/abs/2311.03644", "description": "arXiv:2311.03644v2 Announce Type: replace \nAbstract: A natural way to quantify uncertainties in Gaussian mixture models (GMMs) is through Bayesian methods. That said, sampling from the joint posterior distribution of GMMs via standard Markov chain Monte Carlo (MCMC) imposes several computational challenges, which have prevented a broader full Bayesian implementation of these models. A growing body of literature has introduced the Weighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as alternatives to MCMC sampling. The core idea of these methods is to repeatedly compute maximum a posteriori (MAP) estimates on many randomly weighted posterior densities. These MAP estimates then can be treated as approximate posterior draws. Nonetheless, a central question remains unanswered: How to select the random weights under arbitrary sample sizes. We, therefore, introduce the Bayesian Optimized Bootstrap (BOB), a computational method to automatically select these random weights by minimizing, through Bayesian Optimization, a black-box and noisy version of the reverse Kullback-Leibler (KL) divergence between the Bayesian posterior and an approximate posterior obtained via random weighting. Our proposed method outperforms competing approaches in recovering the Bayesian posterior, it provides a better uncertainty quantification, and it retains key asymptotic properties from existing methods. BOB's performance is demonstrated through extensive simulations, along with real-world data analyses."}, "https://arxiv.org/abs/2311.05794": {"title": "An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits", "link": "https://arxiv.org/abs/2311.05794", "description": "arXiv:2311.05794v2 Announce Type: replace \nAbstract: In multi-armed bandit (MAB) experiments, it is often advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. We develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit experiments that produces powerful and anytime-valid inference on the ATE for \\emph{any} bandit algorithm of the experimenter's choice, even those without probabilistic treatment assignment. Intuitively, the MAD \"mixes\" any bandit algorithm of the experimenter's choice with a Bernoulli design through a tuning parameter $\\delta_t$, where $\\delta_t$ is a deterministic sequence that decreases the priority placed on the Bernoulli design as the sample size grows. We prove that for $\\delta_t = \\omega\\left(t^{-1/4}\\right)$, the MAD generates anytime-valid asymptotic confidence sequences that are guaranteed to shrink around the true ATE. Hence, the experimenter is guaranteed to detect a true non-zero treatment effect in finite time. Additionally, we prove that the regret of the MAD approaches that of its underlying bandit algorithm over time, and hence, incurs a relatively small loss in regret in return for powerful inferential guarantees. Finally, we conduct an extensive simulation study exhibiting that the MAD achieves finite-sample anytime validity and high power without significant losses in finite-sample reward."}, "https://arxiv.org/abs/2312.10563": {"title": "Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS Integration", "link": "https://arxiv.org/abs/2312.10563", "description": "arXiv:2312.10563v2 Announce Type: replace \nAbstract: Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables. We develop a novel MR framework for mediation analysis with genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework employs carefully crafted estimating equations, allowing for different sets of genetic variations to instrument the exposure and the mediator, to efficiently integrate information stored in three independent GWAS. As part of this endeavor, we demonstrate that in mediation analysis, the challenge raised by instrument selection goes beyond the well-known winner's curse issue, and therefore, addressing it requires special treatment. We then develop bias correction techniques to address the instrument selection issue and commonly encountered measurement error bias issue. Collectively, through our theoretical investigations, we show that our framework provides valid statistical inference for both direct and mediation effects with enhanced statistical efficiency compared to existing methods. We further illustrate the finite-sample performance of our approach through simulation experiments and a case study."}, "https://arxiv.org/abs/2003.05492": {"title": "An asymptotic Peskun ordering and its application to lifted samplers", "link": "https://arxiv.org/abs/2003.05492", "description": "arXiv:2003.05492v5 Announce Type: replace-cross \nAbstract: A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result. It is however also known for being a result that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state $\\mathbf{y}$ from a state $\\mathbf{x}$, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all pairs $(\\mathbf{x}, \\mathbf{y})$ such that $\\mathbf{x} \\neq \\mathbf{y}$. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: essentially, the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis--Hastings counterparts. An analysis in great generality yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify them), but not necessarily in others (and the reasons why are made clear). A quantitative study in a specific context of graphical-model simulation is also conducted."}, "https://arxiv.org/abs/2311.00905": {"title": "Data-driven fixed-point tuning for truncated realized variations", "link": "https://arxiv.org/abs/2311.00905", "description": "arXiv:2311.00905v2 Announce Type: replace-cross \nAbstract: Many methods for estimating integrated volatility and related functionals of semimartingales in the presence of jumps require specification of tuning parameters for their use in practice. In much of the available theory, tuning parameters are assumed to be deterministic and their values are specified only up to asymptotic constraints. However, in empirical work and in simulation studies, they are typically chosen to be random and data-dependent, with explicit choices often relying entirely on heuristics. In this paper, we consider novel data-driven tuning procedures for the truncated realized variations of a semimartingale with jumps based on a type of random fixed-point iteration. Being effectively automated, our approach alleviates the need for delicate decision-making regarding tuning parameters in practice and can be implemented using information regarding sampling frequency alone. We show our methods can lead to asymptotically efficient estimation of integrated volatility and exhibit superior finite-sample performance compared to popular alternatives in the literature."}, "https://arxiv.org/abs/2312.07792": {"title": "Differentially private projection-depth-based medians", "link": "https://arxiv.org/abs/2312.07792", "description": "arXiv:2312.07792v2 Announce Type: replace-cross \nAbstract: We develop $(\\epsilon,\\delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We then present a new definition of the finite sample breakdown point which applies to a mechanism, and present a lower bound on the finite sample breakdown point of the projection-depth-based median. We demonstrate our main results on the canonical projection-depth-based median, as well as on projection-depth-based medians derived from trimmed estimators. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation. In the Cauchy setting, we show that the \"outlier error amplification\" effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics, which may be of general interest."}, "https://arxiv.org/abs/2405.11081": {"title": "What are You Weighting For? Improved Weights for Gaussian Mixture Filtering With Application to Cislunar Orbit Determination", "link": "https://arxiv.org/abs/2405.11081", "description": "arXiv:2405.11081v1 Announce Type: new \nAbstract: This work focuses on the critical aspect of accurate weight computation during the measurement incorporation phase of Gaussian mixture filters. The proposed novel approach computes weights by linearizing the measurement model about each component's posterior estimate rather than the the prior, as traditionally done. This work proves equivalence with traditional methods for linear models, provides novel sigma-point extensions to the traditional and proposed methods, and empirically demonstrates improved performance in nonlinear cases. Two illustrative examples, the Avocado and a cislunar single target tracking scenario, serve to highlight the advantages of the new weight computation technique by analyzing filter accuracy and consistency through varying the number of Gaussian mixture components."}, "https://arxiv.org/abs/2405.11111": {"title": "Euclidean mirrors and first-order changepoints in network time series", "link": "https://arxiv.org/abs/2405.11111", "description": "arXiv:2405.11111v1 Announce Type: new \nAbstract: We describe a model for a network time series whose evolution is governed by an underlying stochastic process, known as the latent position process, in which network evolution can be represented in Euclidean space by a curve, called the Euclidean mirror. We define the notion of a first-order changepoint for a time series of networks, and construct a family of latent position process networks with underlying first-order changepoints. We prove that a spectral estimate of the associated Euclidean mirror localizes these changepoints, even when the graph distribution evolves continuously, but at a rate that changes. Simulated and real data examples on organoid networks show that this localization captures empirically significant shifts in network evolution."}, "https://arxiv.org/abs/2405.11156": {"title": "A Randomized Permutation Whole-Model Test Heuristic for Self-Validated Ensemble Models (SVEM)", "link": "https://arxiv.org/abs/2405.11156", "description": "arXiv:2405.11156v1 Announce Type: new \nAbstract: We introduce a heuristic to test the significance of fit of Self-Validated Ensemble Models (SVEM) against the null hypothesis of a constant response. A SVEM model averages predictions from nBoot fits of a model, applied to fractionally weighted bootstraps of the target dataset. It tunes each fit on a validation copy of the training data, utilizing anti-correlated weights for training and validation. The proposed test computes SVEM predictions centered by the response column mean and normalized by the ensemble variability at each of nPoint points spaced throughout the factor space. A reference distribution is constructed by refitting the SVEM model to nPerm randomized permutations of the response column and recording the corresponding standardized predictions at the nPoint points. A reduced-rank singular value decomposition applied to the centered and scaled nPerm x nPoint reference matrix is used to calculate the Mahalanobis distance for each of the nPerm permutation results as well as the jackknife (holdout) Mahalanobis distance of the original response column. The process is repeated independently for each response in the experiment, producing a joint graphical summary. We present a simulation driven power analysis and discuss limitations of the test relating to model flexibility and design adequacy. The test maintains the nominal Type I error rate even when the base SVEM model contains more parameters than observations."}, "https://arxiv.org/abs/2405.11248": {"title": "Generalized extremiles and risk measures of distorted random variables", "link": "https://arxiv.org/abs/2405.11248", "description": "arXiv:2405.11248v1 Announce Type: new \nAbstract: Quantiles, expectiles and extremiles can be seen as concepts defined via an optimization problem, where this optimization problem is driven by two important ingredients: the loss function as well as a distributional weight function. This leads to the formulation of a general class of functionals that contains next to the above concepts many interesting quantities, including also a subclass of distortion risks. The focus of the paper is on developing estimators for such functionals and to establish asymptotic consistency and asymptotic normality of these estimators. The advantage of the general framework is that it allows application to a very broad range of concepts, providing as such estimation tools and tools for statistical inference (for example for construction of confidence intervals) for all involved concepts. After developing the theory for the general functional we apply it to various settings, illustrating the broad applicability. In a real data example the developed tools are used in an analysis of natural disasters."}, "https://arxiv.org/abs/2405.11358": {"title": "A Bayesian Nonparametric Approach for Clustering Functional Trajectories over Time", "link": "https://arxiv.org/abs/2405.11358", "description": "arXiv:2405.11358v1 Announce Type: new \nAbstract: Functional concurrent, or varying-coefficient, regression models are commonly used in biomedical and clinical settings to investigate how the relation between an outcome and observed covariate varies as a function of another covariate. In this work, we propose a Bayesian nonparametric approach to investigate how clusters of these functional relations evolve over time. Our model clusters individual functional trajectories within and across time periods while flexibly accommodating the evolution of the partitions across time periods with covariates. Motivated by mobile health data collected in a novel, smartphone-based smoking cessation intervention study, we demonstrate how our proposed method can simultaneously cluster functional trajectories, accommodate temporal dependence, and provide insights into the transitions between functional clusters over time."}, "https://arxiv.org/abs/2405.11477": {"title": "Analyze Additive and Interaction Effects via Collaborative Trees", "link": "https://arxiv.org/abs/2405.11477", "description": "arXiv:2405.11477v1 Announce Type: new \nAbstract: We present Collaborative Trees, a novel tree model designed for regression prediction, along with its bagging version, which aims to analyze complex statistical associations between features and uncover potential patterns inherent in the data. We decompose the mean decrease in impurity from the proposed tree model to analyze the additive and interaction effects of features on the response variable. Additionally, we introduce network diagrams to visually depict how each feature contributes additively to the response and how pairs of features contribute interaction effects. Through a detailed demonstration using an embryo growth dataset, we illustrate how the new statistical tools aid data analysis, both visually and numerically. Moreover, we delve into critical aspects of tree modeling, such as prediction performance, inference stability, and bias in feature importance measures, leveraging real datasets and simulation experiments for comprehensive discussions. On the theory side, we show that Collaborative Trees, built upon a ``sum of trees'' approach with our own innovative tree model regularization, exhibit characteristics akin to matching pursuit, under the assumption of high-dimensional independent binary input features (or one-hot feature groups). This newfound link sheds light on the superior capability of our tree model in estimating additive effects of features, a crucial factor for accurate interaction effect estimation."}, "https://arxiv.org/abs/2405.11522": {"title": "A comparative study of augmented inverse propensity weighted estimators using outcome-adaptive lasso and other penalized regression methods", "link": "https://arxiv.org/abs/2405.11522", "description": "arXiv:2405.11522v1 Announce Type: new \nAbstract: Confounder selection may be efficiently conducted using penalized regression methods when causal effects are estimated from observational data with many variables. An outcome-adaptive lasso was proposed to build a model for the propensity score that can be employed in conjunction with other variable selection methods for the outcome model to apply the augmented inverse propensity weighted (AIPW) estimator. However, researchers may not know which method is optimal to use for outcome model when applying the AIPW estimator with the outcome-adaptive lasso. This study provided hints on readily implementable penalized regression methods that should be adopted for the outcome model as a counterpart of the outcome-adaptive lasso. We evaluated the bias and variance of the AIPW estimators using the propensity score (PS) model and an outcome model based on penalized regression methods under various conditions by analyzing a clinical trial example and numerical experiments; the estimates and standard errors of the AIPW estimators were almost identical in an example with over 5000 participants. The AIPW estimators using penalized regression methods with the oracle property performed well in terms of bias and variance in numerical experiments with smaller sample sizes. Meanwhile, the bias of the AIPW estimator using the ordinary lasso for the PS and outcome models was considerably larger."}, "https://arxiv.org/abs/2405.11615": {"title": "Approximation of bivariate densities with compositional splines", "link": "https://arxiv.org/abs/2405.11615", "description": "arXiv:2405.11615v1 Announce Type: new \nAbstract: Reliable estimation and approximation of probability density functions is fundamental for their further processing. However, their specific properties, i.e. scale invariance and relative scale, prevent the use of standard methods of spline approximation and have to be considered when building a suitable spline basis. Bayes Hilbert space methodology allows to account for these properties of densities and enables their conversion to a standard Lebesgue space of square integrable functions using the centered log-ratio transformation. As the transformed densities fulfill a zero integral constraint, the constraint should likewise be respected by any spline basis used. Bayes Hilbert space methodology also allows to decompose bivariate densities into their interactive and independent parts with univariate marginals. As this yields a useful framework for studying the dependence structure between random variables, a spline basis ideally should admit a corresponding decomposition. This paper proposes a new spline basis for (transformed) bivariate densities respecting the desired zero integral property. We show that there is a one-to-one correspondence of this basis to a corresponding basis in the Bayes Hilbert space of bivariate densities using tools of this methodology. Furthermore, the spline representation and the resulting decomposition into interactive and independent parts are derived. Finally, this novel spline representation is evaluated in a simulation study and applied to empirical geochemical data."}, "https://arxiv.org/abs/2405.11624": {"title": "On Generalized Transmuted Lifetime Distribution", "link": "https://arxiv.org/abs/2405.11624", "description": "arXiv:2405.11624v1 Announce Type: new \nAbstract: This article presents a new class of generalized transmuted lifetime distributions which includes a large number of lifetime distributions as sub-family. Several important mathematical quantities such as density function, distribution function, quantile function, moments, moment generating function, stress-strength reliability function, order statistics, R\\'enyi and q-entropy, residual and reversed residual life function, and cumulative information generating function are obtained. The methods of maximum likelihood, ordinary least square, weighted least square, Cram\\'er-von Mises, Anderson Darling, and Right-tail Anderson Darling are considered to estimate the model parameters in a general way. Further, a well-organized Monte Carlo simulation experiments have been performed to observe the behavior of the estimators. Finally, two real data have also been analyzed to demonstrate the effectiveness of the proposed distribution in real-life modeling."}, "https://arxiv.org/abs/2405.11626": {"title": "Distribution-in-distribution-out Regression", "link": "https://arxiv.org/abs/2405.11626", "description": "arXiv:2405.11626v1 Announce Type: new \nAbstract: Regression analysis with probability measures as input predictors and output response has recently drawn great attention. However, it is challenging to handle multiple input probability measures due to the non-flat Riemannian geometry of the Wasserstein space, hindering the definition of arithmetic operations, hence additive linear structure is not well-defined. In this work, a distribution-in-distribution-out regression model is proposed by introducing parallel transport to achieve provable commutativity and additivity of newly defined arithmetic operations in Wasserstein space. The appealing properties of the DIDO regression model can serve a foundation for model estimation, prediction, and inference. Specifically, the Fr\\'echet least squares estimator is employed to obtain the best linear unbiased estimate, supported by the newly established Fr\\'echet Gauss-Markov Theorem. Furthermore, we investigate a special case when predictors and response are all univariate Gaussian measures, leading to a simple close-form solution of linear model coefficients and $R^2$ metric. A simulation study and real case study in intraoperative cardiac output prediction are performed to evaluate the performance of the proposed method."}, "https://arxiv.org/abs/2405.11681": {"title": "Distributed Tensor Principal Component Analysis", "link": "https://arxiv.org/abs/2405.11681", "description": "arXiv:2405.11681v1 Announce Type: new \nAbstract: As tensors become widespread in modern data analysis, Tucker low-rank Principal Component Analysis (PCA) has become essential for dimensionality reduction and structural discovery in tensor datasets. Motivated by the common scenario where large-scale tensors are distributed across diverse geographic locations, this paper investigates tensor PCA within a distributed framework where direct data pooling is impractical.\n We offer a comprehensive analysis of three specific scenarios in distributed Tensor PCA: a homogeneous setting in which tensors at various locations are generated from a single noise-affected model; a heterogeneous setting where tensors at different locations come from distinct models but share some principal components, aiming to improve estimation across all locations; and a targeted heterogeneous setting, designed to boost estimation accuracy at a specific location with limited samples by utilizing transferred knowledge from other sites with ample data.\n We introduce novel estimation methods tailored to each scenario, establish statistical guarantees, and develop distributed inference techniques to construct confidence regions. Our theoretical findings demonstrate that these distributed methods achieve sharp rates of accuracy by efficiently aggregating shared information across different tensors, while maintaining reasonable communication costs. Empirical validation through simulations and real-world data applications highlights the advantages of our approaches, particularly in managing heterogeneous tensor data."}, "https://arxiv.org/abs/2405.11720": {"title": "Estimating optimal tailored active surveillance strategy under interval censoring", "link": "https://arxiv.org/abs/2405.11720", "description": "arXiv:2405.11720v1 Announce Type: new \nAbstract: Active surveillance (AS) using repeated biopsies to monitor disease progression has been a popular alternative to immediate surgical intervention in cancer care. However, a biopsy procedure is invasive and sometimes leads to severe side effects of infection and bleeding. To reduce the burden of repeated surveillance biopsies, biomarker-assistant decision rules are sought to replace the fix-for-all regimen with tailored biopsy intensity for individual patients. Constructing or evaluating such decision rules is challenging. The key AS outcome is often ascertained subject to interval censoring. Furthermore, patients will discontinue their participation in the AS study once they receive a positive surveillance biopsy. Thus, patient dropout is affected by the outcomes of these biopsies. In this work, we propose a nonparametric kernel-based method to estimate the true positive rates (TPRs) and true negative rates (TNRs) of a tailored AS strategy, accounting for interval censoring and immediate dropouts. Based on these estimates, we develop a weighted classification framework to estimate the optimal tailored AS strategy and further incorporate the cost-benefit ratio for cost-effectiveness in medical decision-making. Theoretically, we provide a uniform generalization error bound of the derived AS strategy accommodating all possible trade-offs between TPRs and TNRs. Simulation and application to a prostate cancer surveillance study show the superiority of the proposed method."}, "https://arxiv.org/abs/2405.11723": {"title": "Inference with non-differentiable surrogate loss in a general high-dimensional classification framework", "link": "https://arxiv.org/abs/2405.11723", "description": "arXiv:2405.11723v1 Announce Type: new \nAbstract: Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In this work, we propose a kernel-smoothed decorrelated score to construct hypothesis testing and interval estimations for the linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version in a high-dimensional setup. Simulation and real data analysis are conducted to demonstrate the validity and superiority of the proposed method."}, "https://arxiv.org/abs/2405.11759": {"title": "Testing Sign Congruence", "link": "https://arxiv.org/abs/2405.11759", "description": "arXiv:2405.11759v1 Announce Type: new \nAbstract: We consider testing the null hypothesis that two parameters $({\\mu}_1, {\\mu}_2)$ have the same sign, assuming that (asymptotically) normal estimators are available. Examples of this problem include the analysis of heterogeneous treatment effects, causal interpretation of reduced-form estimands, meta-studies, and mediation analysis. A number of tests were recently proposed. We recommend a test that is simple and rejects more often than many of these recent proposals. Like all other tests in the literature, it is conservative if the truth is near (0, 0) and therefore also biased. To clarify whether these features are avoidable, we also provide a test that is unbiased and has exact size control on the boundary of the null hypothesis, but which has counterintuitive properties and hence we do not recommend. The method that we recommend can be used to revisit existing findings using information typically reported in empirical research papers."}, "https://arxiv.org/abs/2405.11781": {"title": "Structural Nested Mean Models Under Parallel Trends with Interference", "link": "https://arxiv.org/abs/2405.11781", "description": "arXiv:2405.11781v1 Announce Type: new \nAbstract: Despite the common occurrence of interference in Difference-in-Differences (DiD) applications, standard DiD methods rely on an assumption that interference is absent, and comparatively little work has considered how to accommodate and learn about spillover effects within a DiD framework. Here, we extend the so-called `DiD-SNMMs' of Shahn et al (2022) to accommodate interference in a time-varying DiD setting. Doing so enables estimation of a richer set of effects than previous DiD approaches. For example, DiD-SNMMs do not assume the absence of spillover effects after direct exposures and can model how effects of direct or indirect (i.e. spillover) exposures depend on past and concurrent (direct or indirect) exposure and covariate history. We consider both cluster and network interference structures an illustrate the methodology in simulations."}, "https://arxiv.org/abs/2405.11954": {"title": "Comparing predictive ability in presence of instability over a very short time", "link": "https://arxiv.org/abs/2405.11954", "description": "arXiv:2405.11954v1 Announce Type: new \nAbstract: We consider forecast comparison in the presence of instability when this affects only a short period of time. We demonstrate that global tests do not perform well in this case, as they were not designed to capture very short-lived instabilities, and their power vanishes altogether when the magnitude of the shock is very large. We then discuss and propose approaches that are more suitable to detect such situations, such as nonparametric methods (S test or MAX procedure). We illustrate these results in different Monte Carlo exercises and in evaluating the nowcast of the quarterly US nominal GDP from the Survey of Professional Forecasters (SPF) against a naive benchmark of no growth, over the period that includes the GDP instability brought by the Covid-19 crisis. We recommend that the forecaster should not pool the sample, but exclude the short periods of high local instability from the evaluation exercise."}, "https://arxiv.org/abs/2405.12083": {"title": "Instrumented Difference-in-Differences with heterogeneous treatment effects", "link": "https://arxiv.org/abs/2405.12083", "description": "arXiv:2405.12083v1 Announce Type: new \nAbstract: Many studies exploit variation in the timing of policy adoption across units as an instrument for treatment, and use instrumental variable techniques. This paper formalizes the underlying identification strategy as an instrumented difference-in-differences (DID-IV). In a simple setting with two periods and two groups, our DID-IV design mainly consists of a monotonicity assumption, and parallel trends assumptions in the treatment and the outcome. In this design, a Wald-DID estimand, which scales the DID estimand of the outcome by the DID estimand of the treatment, captures the local average treatment effect on the treated (LATET). In contrast to Fuzzy DID design considered in \\cite{De_Chaisemartin2018-xe}, our DID-IV design does not {\\it ex-ante} require strong restrictions on the treatment adoption behavior across units, and our target parameter, the LATET, is policy-relevant if the instrument is based on the policy change of interest to the researcher. We extend the canonical DID-IV design to multiple period settings with the staggered adoption of the instrument across units, which we call staggered DID-IV designs. We propose an estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity. We illustrate our findings in the setting of \\cite{Oreopoulos2006-bn}, estimating returns to schooling in the United Kingdom. In this application, the two-way fixed effects instrumental variable regression, which is the conventional approach to implement staggered DID-IV designs, yields the negative estimate, whereas our estimation method indicates the substantial gain from schooling."}, "https://arxiv.org/abs/2405.12157": {"title": "Asymmetry models and separability for multi-way contingency tables with ordinal categories", "link": "https://arxiv.org/abs/2405.12157", "description": "arXiv:2405.12157v1 Announce Type: new \nAbstract: In this paper, we propose a model that indicates the asymmetry structure for cell probabilities in multivariate contingency tables with the same ordered categories. The proposed model is the closest to the symmetry model in terms of the $f$-divergence under certain conditions and incorporates various asymmetry models as special cases, including existing models. We elucidate the relationship between the proposed model and conventional models from several aspects of divergence in $f$-divergence. Furthermore, we provide theorems showing that the symmetry model can be decomposed into two or more models, each imposing less restrictive parameter constraints than the symmetry condition. We also discuss the properties of goodness-of-fit statistics, particularly focusing on the likelihood ratio test statistics and Wald test statistics. Finally, we summarize the proposed model and discuss some problems and future work."}, "https://arxiv.org/abs/2405.12180": {"title": "Estimating the Impact of Social Distance Policy in Mitigating COVID-19 Spread with Factor-Based Imputation Approach", "link": "https://arxiv.org/abs/2405.12180", "description": "arXiv:2405.12180v1 Announce Type: new \nAbstract: We identify the effectiveness of social distancing policies in reducing the transmission of the COVID-19 spread. We build a model that measures the relative frequency and geographic distribution of the virus growth rate and provides hypothetical infection distribution in the states that enacted the social distancing policies, where we control time-varying, observed and unobserved, state-level heterogeneities. Using panel data on infection and deaths in all US states from February 20 to April 20, 2020, we find that stay-at-home orders and other types of social distancing policies significantly reduced the growth rate of infection and deaths. We show that the effects are time-varying and range from the weakest at the beginning of policy intervention to the strongest by the end of our sample period. We also found that social distancing policies were more effective in states with higher income, better education, more white people, more democratic voters, and higher CNN viewership."}, "https://arxiv.org/abs/2405.10991": {"title": "Relative Counterfactual Contrastive Learning for Mitigating Pretrained Stance Bias in Stance Detection", "link": "https://arxiv.org/abs/2405.10991", "description": "arXiv:2405.10991v1 Announce Type: cross \nAbstract: Stance detection classifies stance relations (namely, Favor, Against, or Neither) between comments and targets. Pretrained language models (PLMs) are widely used to mine the stance relation to improve the performance of stance detection through pretrained knowledge. However, PLMs also embed ``bad'' pretrained knowledge concerning stance into the extracted stance relation semantics, resulting in pretrained stance bias. It is not trivial to measure pretrained stance bias due to its weak quantifiability. In this paper, we propose Relative Counterfactual Contrastive Learning (RCCL), in which pretrained stance bias is mitigated as relative stance bias instead of absolute stance bias to overtake the difficulty of measuring bias. Firstly, we present a new structural causal model for characterizing complicated relationships among context, PLMs and stance relations to locate pretrained stance bias. Then, based on masked language model prediction, we present a target-aware relative stance sample generation method for obtaining relative bias. Finally, we use contrastive learning based on counterfactual theory to mitigate pretrained stance bias and preserve context stance relation. Experiments show that the proposed method is superior to stance detection and debiasing baselines."}, "https://arxiv.org/abs/2405.11377": {"title": "Causal Customer Churn Analysis with Low-rank Tensor Block Hazard Model", "link": "https://arxiv.org/abs/2405.11377", "description": "arXiv:2405.11377v1 Announce Type: cross \nAbstract: This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor completion for the parameter tensor. This captures hidden customer characteristics and temporal elements from churn records, effectively addressing the binary nature of churn data and its time-monotonic trends. Our model also uniquely categorizes interventions by their similar impacts, enhancing the precision and practicality of implementing customer retention strategies. For computational efficiency, we apply a projected gradient descent algorithm combined with spectral clustering. We lay down the theoretical groundwork for our model, including its non-asymptotic properties. The efficacy and superiority of our model are further validated through comprehensive experiments on both simulated and real-world applications."}, "https://arxiv.org/abs/2405.11688": {"title": "Performance Analysis of Monte Carlo Algorithms in Dense Subgraph Identification", "link": "https://arxiv.org/abs/2405.11688", "description": "arXiv:2405.11688v1 Announce Type: cross \nAbstract: The exploration of network structures through the lens of graph theory has become a cornerstone in understanding complex systems across diverse fields. Identifying densely connected subgraphs within larger networks is crucial for uncovering functional modules in biological systems, cohesive groups within social networks, and critical paths in technological infrastructures. The most representative approach, the SM algorithm, cannot locate subgraphs with large sizes, therefore cannot identify dense subgraphs; while the SA algorithm previously used by researchers combines simulated annealing and efficient moves for the Markov chain. However, the global optima cannot be guaranteed to be located by the simulated annealing methods including SA unless a logarithmic cooling schedule is used. To this end, our study introduces and evaluates the performance of the Simulated Annealing Algorithm (SAA), which combines simulated annealing with the stochastic approximation Monte Carlo algorithm. The performance of SAA against two other numerical algorithms-SM and SA, is examined in the context of identifying these critical subgraph structures using simulated graphs with embeded cliques. We have found that SAA outperforms both SA and SM by 1) the number of iterations to find the densest subgraph and 2) the percentage of time the algorithm is able to find a clique after 10,000 iterations, and 3) computation time. The promising result of the SAA algorithm could offer a robust tool for dissecting complex systems and potentially transforming our approach to solving problems in interdisciplinary fields."}, "https://arxiv.org/abs/2405.11923": {"title": "Rate Optimality and Phase Transition for User-Level Local Differential Privacy", "link": "https://arxiv.org/abs/2405.11923", "description": "arXiv:2405.11923v1 Announce Type: cross \nAbstract: Most of the literature on differential privacy considers the item-level case where each user has a single observation, but a growing field of interest is that of user-level privacy where each of the $n$ users holds $T$ observations and wishes to maintain the privacy of their entire collection.\n In this paper, we derive a general minimax lower bound, which shows that, for locally private user-level estimation problems, the risk cannot, in general, be made to vanish for a fixed number of users even when each user holds an arbitrarily large number of observations. We then derive matching, up to logarithmic factors, lower and upper bounds for univariate and multidimensional mean estimation, sparse mean estimation and non-parametric density estimation. In particular, with other model parameters held fixed, we observe phase transition phenomena in the minimax rates as $T$ the number of observations each user holds varies.\n In the case of (non-sparse) mean estimation and density estimation, we see that, for $T$ below a phase transition boundary, the rate is the same as having $nT$ users in the item-level setting. Different behaviour is however observed in the case of $s$-sparse $d$-dimensional mean estimation, wherein consistent estimation is impossible when $d$ exceeds the number of observations in the item-level setting, but is possible in the user-level setting when $T \\gtrsim s \\log (d)$, up to logarithmic factors. This may be of independent interest for applications as an example of a high-dimensional problem that is feasible under local privacy constraints."}, "https://arxiv.org/abs/2112.01611": {"title": "Robust changepoint detection in the variability of multivariate functional data", "link": "https://arxiv.org/abs/2112.01611", "description": "arXiv:2112.01611v2 Announce Type: replace \nAbstract: We consider the problem of robustly detecting changepoints in the variability of a sequence of independent multivariate functions. We develop a novel changepoint procedure, called the functional Kruskal--Wallis for covariance (FKWC) changepoint procedure, based on rank statistics and multivariate functional data depth. The FKWC changepoint procedure allows the user to test for at most one changepoint (AMOC) or an epidemic period, or to estimate the number and locations of an unknown amount of changepoints in the data. We show that when the ``signal-to-noise'' ratio is bounded below, the changepoint estimates produced by the FKWC procedure attain the minimax localization rate for detecting general changes in distribution in the univariate setting (Theorem 1). We also provide the behavior of the proposed test statistics for the AMOC and epidemic setting under the null hypothesis (Theorem 2) and, as a simple consequence of our main result, these tests are consistent (Corollary 1). In simulation, we show that our method is particularly robust when compared to similar changepoint methods. We present an application of the FKWC procedure to intraday asset returns and f-MRI scans. As a by-product of Theorem 1, we provide a concentration result for integrated functional depth functions (Lemma 2), which may be of general interest."}, "https://arxiv.org/abs/2212.09844": {"title": "Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding", "link": "https://arxiv.org/abs/2212.09844", "description": "arXiv:2212.09844v5 Announce Type: replace \nAbstract: Predictive algorithms inform consequential decisions in settings where the outcome is selectively observed given choices made by human decision makers. We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units conditional on observed covariates and identified nuisance parameters, formalizing popular empirical strategies for imputing missing data such as proxy outcomes and instrumental variables. We develop debiased machine learning estimators for the bounds on a large class of predictive performance estimands, such as the conditional likelihood of the outcome, a predictive algorithm's mean square error, true/false positive rate, and many others, under these assumptions. In an administrative dataset from a large Australian financial institution, we illustrate how varying assumptions on unobserved confounding leads to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups."}, "https://arxiv.org/abs/2303.11777": {"title": "Quasi Maximum Likelihood Estimation of High-Dimensional Factor Models: A Critical Review", "link": "https://arxiv.org/abs/2303.11777", "description": "arXiv:2303.11777v5 Announce Type: replace \nAbstract: We review Quasi Maximum Likelihood estimation of factor models for high-dimensional panels of time series. We consider two cases: (1) estimation when no dynamic model for the factors is specified (Bai and Li, 2012, 2016); (2) estimation based on the Kalman smoother and the Expectation Maximization algorithm thus allowing to model explicitly the factor dynamics (Doz et al., 2012, Barigozzi and Luciani, 2019). Our interest is in approximate factor models, i.e., when we allow for the idiosyncratic components to be mildly cross-sectionally, as well as serially, correlated. Although such setting apparently makes estimation harder, we show, in fact, that factor models do not suffer of the {\\it curse of dimensionality} problem, but instead they enjoy a {\\it blessing of dimensionality} property. In particular, given an approximate factor structure, if the cross-sectional dimension of the data, $N$, grows to infinity, we show that: (i) identification of the model is still possible, (ii) the mis-specification error due to the use of an exact factor model log-likelihood vanishes. Moreover, if we let also the sample size, $T$, grow to infinity, we can also consistently estimate all parameters of the model and make inference. The same is true for estimation of the latent factors which can be carried out by weighted least-squares, linear projection, or Kalman filtering/smoothing. We also compare the approaches presented with: Principal Component analysis and the classical, fixed $N$, exact Maximum Likelihood approach. We conclude with a discussion on efficiency of the considered estimators."}, "https://arxiv.org/abs/2305.07581": {"title": "Nonparametric data segmentation in multivariate time series via joint characteristic functions", "link": "https://arxiv.org/abs/2305.07581", "description": "arXiv:2305.07581v3 Announce Type: replace \nAbstract: Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series."}, "https://arxiv.org/abs/2309.01334": {"title": "Average treatment effect on the treated, under lack of positivity", "link": "https://arxiv.org/abs/2309.01334", "description": "arXiv:2309.01334v3 Announce Type: replace \nAbstract: The use of propensity score (PS) methods has become ubiquitous in causal inference. At the heart of these methods is the positivity assumption. Violation of the positivity assumption leads to the presence of extreme PS weights when estimating average causal effects of interest, such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT), which renders invalid related statistical inference. To circumvent this issue, trimming or truncating the extreme estimated PSs have been widely used. However, these methods require that we specify a priori a threshold and sometimes an additional smoothing parameter. While there are a number of methods dealing with the lack of positivity when estimating ATE, surprisingly there is no much effort in the same issue for ATT. In this paper, we first review widely used methods, such as trimming and truncation in ATT. We emphasize the underlying intuition behind these methods to better understand their applications and highlight their main limitations. Then, we argue that the current methods simply target estimands that are scaled ATT (and thus move the goalpost to a different target of interest), where we specify the scale and the target populations. We further propose a PS weight-based alternative for the average causal effect on the treated, called overlap weighted average treatment effect on the treated (OWATT). The appeal of our proposed method lies in its ability to obtain similar or even better results than trimming and truncation while relaxing the constraint to choose a priori a threshold (or even specify a smoothing parameter). The performance of the proposed method is illustrated via a series of Monte Carlo simulations and a data analysis on racial disparities in health care expenditures."}, "https://arxiv.org/abs/2309.04926": {"title": "Testing for Stationary or Persistent Coefficient Randomness in Predictive Regressions", "link": "https://arxiv.org/abs/2309.04926", "description": "arXiv:2309.04926v4 Announce Type: replace \nAbstract: This study considers tests for coefficient randomness in predictive regressions. Our focus is on how tests for coefficient randomness are influenced by the persistence of random coefficient. We show that when the random coefficient is stationary, or I(0), Nyblom's (1989) LM test loses its optimality (in terms of power), which is established against the alternative of integrated, or I(1), random coefficient. We demonstrate this by constructing a test that is more powerful than the LM test when the random coefficient is stationary, although the test is dominated in terms of power by the LM test when the random coefficient is integrated. This implies that the best test for coefficient randomness differs from context to context, and the persistence of the random coefficient determines which test is the best one. We apply those tests to the U.S. stock returns data."}, "https://arxiv.org/abs/2104.14412": {"title": "Nonparametric Test for Volatility in Clustered Multiple Time Series", "link": "https://arxiv.org/abs/2104.14412", "description": "arXiv:2104.14412v3 Announce Type: replace-cross \nAbstract: Contagion arising from clustering of multiple time series like those in the stock market indicators can further complicate the nature of volatility, rendering a parametric test (relying on asymptotic distribution) to suffer from issues on size and power. We propose a test on volatility based on the bootstrap method for multiple time series, intended to account for possible presence of contagion effect. While the test is fairly robust to distributional assumptions, it depends on the nature of volatility. The test is correctly sized even in cases where the time series are almost nonstationary. The test is also powerful specially when the time series are stationary in mean and that volatility are contained only in fewer clusters. We illustrate the method in global stock prices data."}, "https://arxiv.org/abs/2109.08793": {"title": "Estimations of the Local Conditional Tail Average Treatment Effect", "link": "https://arxiv.org/abs/2109.08793", "description": "arXiv:2109.08793v3 Announce Type: replace-cross \nAbstract: The conditional tail average treatment effect (CTATE) is defined as a difference between the conditional tail expectations of potential outcomes, which can capture heterogeneity and deliver aggregated local information on treatment effects over different quantile levels and is closely related to the notion of second-order stochastic dominance and the Lorenz curve. These properties render it a valuable tool for policy evaluation. In this paper, we study estimation of the CTATE locally for a group of compliers (local CTATE or LCTATE) under the two-sided noncompliance framework. We consider a semiparametric treatment effect framework under endogeneity for the LCTATE estimation using a newly introduced class of consistent loss functions jointly for the conditional tail expectation and quantile. We establish the asymptotic theory of our proposed LCTATE estimator and provide an efficient algorithm for its implementation. We then apply the method to evaluate the effects of participating in programs under the Job Training Partnership Act in the US."}, "https://arxiv.org/abs/2305.19461": {"title": "Residual spectrum: Brain functional connectivity detection beyond coherence", "link": "https://arxiv.org/abs/2305.19461", "description": "arXiv:2305.19461v2 Announce Type: replace-cross \nAbstract: Coherence is a widely used measure to assess linear relationships between time series. However, it fails to capture nonlinear dependencies. To overcome this limitation, this paper introduces the notion of residual spectral density as a higher-order extension of the squared coherence. The method is based on an orthogonal decomposition of time series regression models. We propose a test for the existence of the residual spectrum and derive its fundamental properties. A numerical study illustrates finite sample performance of the proposed method. An application of the method shows that the residual spectrum can effectively detect brain connectivity. Our study reveals a noteworthy contrast in connectivity patterns between schizophrenia patients and healthy individuals. Specifically, we observed that non-linear connectivity in schizophrenia patients surpasses that of healthy individuals, which stands in stark contrast to the established understanding that linear connectivity tends to be higher in healthy individuals. This finding sheds new light on the intricate dynamics of brain connectivity in schizophrenia."}, "https://arxiv.org/abs/2308.05945": {"title": "Improving Ego-Cluster for Network Effect Measurement", "link": "https://arxiv.org/abs/2308.05945", "description": "arXiv:2308.05945v2 Announce Type: replace-cross \nAbstract: The network effect, wherein one user's activity impacts another user, is common in social network platforms. Many new features in social networks are specifically designed to create a network effect, enhancing user engagement. For instance, content creators tend to produce more when their articles and posts receive positive feedback from followers. This paper discusses a new cluster-level experimentation methodology for measuring creator-side metrics in the context of A/B experiments. The methodology is designed to address cases where the experiment randomization unit and the metric measurement unit differ. It is a crucial part of LinkedIn's overall strategy to foster a robust creator community and ecosystem. The method is developed based on widely-cited research at LinkedIn but significantly improves the efficiency and flexibility of the clustering algorithm. This improvement results in a stronger capability for measuring creator-side metrics and an increased velocity for creator-related experiments."}, "https://arxiv.org/abs/2405.12467": {"title": "Conditional Choice Probability Estimation of Dynamic Discrete Choice Models with 2-period Finite Dependence", "link": "https://arxiv.org/abs/2405.12467", "description": "arXiv:2405.12467v1 Announce Type: new \nAbstract: This paper extends the work of Arcidiacono and Miller (2011, 2019) by introducing a novel characterization of finite dependence within dynamic discrete choice models, demonstrating that numerous models display 2-period finite dependence. We recast finite dependence as a problem of sequentially searching for weights and introduce a computationally efficient method for determining these weights by utilizing the Kronecker product structure embedded in state transitions. With the estimated weights, we develop a computationally attractive Conditional Choice Probability estimator with 2-period finite dependence. The computational efficacy of our proposed estimator is demonstrated through Monte Carlo simulations."}, "https://arxiv.org/abs/2405.12581": {"title": "Spectral analysis for noisy Hawkes processes inference", "link": "https://arxiv.org/abs/2405.12581", "description": "arXiv:2405.12581v1 Announce Type: new \nAbstract: Classic estimation methods for Hawkes processes rely on the assumption that observed event times are indeed a realisation of a Hawkes process, without considering any potential perturbation of the model. However, in practice, observations are often altered by some noise, the form of which depends on the context.It is then required to model the alteration mechanism in order to infer accurately such a noisy Hawkes process. While several models exist, we consider, in this work, the observations to be the indistinguishable union of event times coming from a Hawkes process and from an independent Poisson process. Since standard inference methods (such as maximum likelihood or Expectation-Maximisation) are either unworkable or numerically prohibitive in this context, we propose an estimation procedure based on the spectral analysis of second order properties of the noisy Hawkes process. Novel results include sufficient conditions for identifiability of the ensuing statistical model with exponential interaction functions for both univariate and bivariate processes. Although we mainly focus on the exponential scenario, other types of kernels are investigated and discussed. A new estimator based on maximising the spectral log-likelihood is then described, and its behaviour is numerically illustrated on synthetic data. Besides being free from knowing the source of each observed time (Hawkes or Poisson process), the proposed estimator is shown to perform accurately in estimating both processes."}, "https://arxiv.org/abs/2405.12622": {"title": "Asymptotic Properties of Matthews Correlation Coefficient", "link": "https://arxiv.org/abs/2405.12622", "description": "arXiv:2405.12622v1 Announce Type: new \nAbstract: Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC) is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates, a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field."}, "https://arxiv.org/abs/2405.12668": {"title": "Short and simple introduction to Bellman filtering and smoothing", "link": "https://arxiv.org/abs/2405.12668", "description": "arXiv:2405.12668v1 Announce Type: new \nAbstract: Based on Bellman's dynamic-programming principle, Lange (2024) presents an approximate method for filtering, smoothing and parameter estimation for possibly non-linear and/or non-Gaussian state-space models. While the approach applies more generally, this pedagogical note highlights the main results in the case where (i) the state transition remains linear and Gaussian while (ii) the observation density is log-concave and sufficiently smooth in the state variable. I demonstrate how Kalman's (1960) filter and Rauch et al.'s (1965) smoother can be obtained as special cases within the proposed framework. The main aim is to present non-experts (and my own students) with an accessible introduction, enabling them to implement the proposed methods."}, "https://arxiv.org/abs/2405.12694": {"title": "Parameter estimation in Comparative Judgement", "link": "https://arxiv.org/abs/2405.12694", "description": "arXiv:2405.12694v1 Announce Type: new \nAbstract: Comparative Judgement is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimates. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered. Further, we propose a superior approach based on bootstrapping. It is shown to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions and initial penalization method."}, "https://arxiv.org/abs/2405.12816": {"title": "A Non-Parametric Box-Cox Approach to Robustifying High-Dimensional Linear Hypothesis Testing", "link": "https://arxiv.org/abs/2405.12816", "description": "arXiv:2405.12816v1 Announce Type: new \nAbstract: The mainstream theory of hypothesis testing in high-dimensional regression typically assumes the underlying true model is a low-dimensional linear regression model, yet the Box-Cox transformation is a regression technique commonly used to mitigate anomalies like non-additivity and heteroscedasticity. This paper introduces a more flexible framework, the non-parametric Box-Cox model with unspecified transformation, to address model mis-specification in high-dimensional linear hypothesis testing while preserving the interpretation of regression coefficients. Model estimation and computation in high dimensions poses challenges beyond traditional sparse penalization methods. We propose the constrained partial penalized composite probit regression method for sparse estimation and investigate its statistical properties. Additionally, we present a computationally efficient algorithm using augmented Lagrangian and coordinate majorization descent for solving regularization problems with folded concave penalization and linear constraints. For testing linear hypotheses, we propose the partial penalized composite likelihood ratio test, score test and Wald test, and show that their limiting distributions under null and local alternatives follow generalized chi-squared distributions with the same degrees of freedom and noncentral parameter. Extensive simulation studies are conducted to examine the finite sample performance of the proposed tests. Our analysis of supermarket data illustrates potential discrepancies between our testing procedures and standard high-dimensional methods, highlighting the importance of our robustified approach."}, "https://arxiv.org/abs/2405.12924": {"title": "Robust Nonparametric Regression for Compositional Data: the Simplicial--Real case", "link": "https://arxiv.org/abs/2405.12924", "description": "arXiv:2405.12924v1 Announce Type: new \nAbstract: Statistical analysis on compositional data has gained a lot of attention due to their great potential of applications. A feature of these data is that they are multivariate vectors that lie in the simplex, that is, the components of each vector are positive and sum up a constant value. This fact poses a challenge to the analyst due to the internal dependency of the components which exhibit a spurious negative correlation. Since classical multivariate techniques are not appropriate in this scenario, it is necessary to endow the simplex of a suitable algebraic-geometrical structure, which is a starting point to develop adequate methodology and strategies to handle compositions. We centered our attention on regression problems with real responses and compositional covariates and we adopt a nonparametric approach due to the flexibility it provides. Aware of the potential damage that outliers may produce, we introduce a robust estimator in the framework of nonparametric regression for compositional data. The performance of the estimators is investigated by means of a numerical study where different contamination schemes are simulated. Through a real data analysis the advantages of using a robust procedure is illustrated."}, "https://arxiv.org/abs/2405.12953": {"title": "Quantifying Uncertainty in Classification Performance: ROC Confidence Bands Using Conformal Prediction", "link": "https://arxiv.org/abs/2405.12953", "description": "arXiv:2405.12953v1 Announce Type: new \nAbstract: To evaluate a classification algorithm, it is common practice to plot the ROC curve using test data. However, the inherent randomness in the test data can undermine our confidence in the conclusions drawn from the ROC curve, necessitating uncertainty quantification. In this article, we propose an algorithm to construct confidence bands for the ROC curve, quantifying the uncertainty of classification on the test data in terms of sensitivity and specificity. The algorithm is based on a procedure called conformal prediction, which constructs individualized confidence intervals for the test set and the confidence bands for the ROC curve can be obtained by combining the individualized intervals together. Furthermore, we address both scenarios where the test data are either iid or non-iid relative to the observed data set and propose distinct algorithms for each case with valid coverage probability. The proposed method is validated through both theoretical results and numerical experiments."}, "https://arxiv.org/abs/2405.12343": {"title": "Determine the Number of States in Hidden Markov Models via Marginal Likelihood", "link": "https://arxiv.org/abs/2405.12343", "description": "arXiv:2405.12343v1 Announce Type: cross \nAbstract: Hidden Markov models (HMM) have been widely used by scientists to model stochastic systems: the underlying process is a discrete Markov chain and the observations are noisy realizations of the underlying process. Determining the number of hidden states for an HMM is a model selection problem, which is yet to be satisfactorily solved, especially for the popular Gaussian HMM with heterogeneous covariance. In this paper, we propose a consistent method for determining the number of hidden states of HMM based on the marginal likelihood, which is obtained by integrating out both the parameters and hidden states. Moreover, we show that the model selection problem of HMM includes the order selection problem of finite mixture models as a special case. We give rigorous proof of the consistency of the proposed marginal likelihood method and provide an efficient computation method for practical implementation. We numerically compare the proposed method with the Bayesian information criterion (BIC), demonstrating the effectiveness of the proposed marginal likelihood method."}, "https://arxiv.org/abs/2207.11532": {"title": "Change Point Detection for High-dimensional Linear Models: A General Tail-adaptive Approach", "link": "https://arxiv.org/abs/2207.11532", "description": "arXiv:2207.11532v3 Announce Type: replace \nAbstract: We propose a novel approach for detecting change points in high-dimensional linear regression models. Unlike previous research that relied on strict Gaussian/sub-Gaussian error assumptions and had prior knowledge of change points, we propose a tail-adaptive method for change point detection and estimation. We use a weighted combination of composite quantile and least squared losses to build a new loss function, allowing us to leverage information from both conditional means and quantiles. For change point testing, we develop a family of individual testing statistics with different weights to account for unknown tail structures. These individual tests are further aggregated to construct a powerful tail-adaptive test for sparse regression coefficient changes. For change point estimation, we propose a family of argmax-based individual estimators. We provide theoretical justifications for the validity of these tests and change point estimators. Additionally, we introduce a new algorithm for detecting multiple change points in a tail-adaptive manner using the wild binary segmentation. Extensive numerical results show the effectiveness of our method. Lastly, an R package called ``TailAdaptiveCpt\" is developed to implement our algorithms."}, "https://arxiv.org/abs/2303.14508": {"title": "A spectral based goodness-of-fit test for stochastic block models", "link": "https://arxiv.org/abs/2303.14508", "description": "arXiv:2303.14508v2 Announce Type: replace \nAbstract: Community detection is a fundamental problem in complex network data analysis. Though many methods have been proposed, most existing methods require the number of communities to be the known parameter, which is not in practice. In this paper, we propose a novel goodness-of-fit test for the stochastic block model. The test statistic is based on the linear spectral of the adjacency matrix. Under the null hypothesis, we prove that the linear spectral statistic converges in distribution to $N(0,1)$. Some recent results in generalized Wigner matrices are used to prove the main theorems. Numerical experiments and real world data examples illustrate that our proposed linear spectral statistic has good performance."}, "https://arxiv.org/abs/2304.12414": {"title": "Bayesian Geostatistics Using Predictive Stacking", "link": "https://arxiv.org/abs/2304.12414", "description": "arXiv:2304.12414v2 Announce Type: replace \nAbstract: We develop Bayesian predictive stacking for geostatistical models, where the primary inferential objective is to provide inference on the latent spatial random field and conduct spatial predictions at arbitrary locations. We exploit analytically tractable posterior distributions for regression coefficients of predictors and the realizations of the spatial process conditional upon process parameters. We subsequently combine such inference by stacking these models across the range of values of the hyper-parameters. We devise stacking of means and posterior densities in a manner that is computationally efficient without resorting to iterative algorithms such as Markov chain Monte Carlo (MCMC) and can exploit the benefits of parallel computations. We offer novel theoretical insights into the resulting inference within an infill asymptotic paradigm and through empirical results showing that stacked inference is comparable to full sampling-based Bayesian inference at a significantly lower computational cost."}, "https://arxiv.org/abs/2311.00596": {"title": "Evaluating Binary Outcome Classifiers Estimated from Survey Data", "link": "https://arxiv.org/abs/2311.00596", "description": "arXiv:2311.00596v3 Announce Type: replace \nAbstract: Surveys are commonly used to facilitate research in epidemiology, health, and the social and behavioral sciences. Often, these surveys are not simple random samples, and respondents are given weights reflecting their probability of selection into the survey. It is well known that analysts can use these survey weights to produce unbiased estimates of population quantities like totals. In this article, we show that survey weights also can be beneficial for evaluating the quality of predictive models when splitting data into training and test sets. In particular, we characterize model assessment statistics, such as sensitivity and specificity, as finite population quantities, and compute survey-weighted estimates of these quantities with sample test data comprising a random subset of the original data.Using simulations with data from the National Survey on Drug Use and Health and the National Comorbidity Survey, we show that unweighted metrics estimated with sample test data can misrepresent population performance, but weighted metrics appropriately adjust for the complex sampling design. We also show that this conclusion holds for models trained using upsampling for mitigating class imbalance. The results suggest that weighted metrics should be used when evaluating performance on sample test data."}, "https://arxiv.org/abs/2312.00590": {"title": "Inference on common trends in functional time series", "link": "https://arxiv.org/abs/2312.00590", "description": "arXiv:2312.00590v4 Announce Type: replace \nAbstract: We study statistical inference on unit roots and cointegration for time series in a Hilbert space. We develop statistical inference on the number of common stochastic trends embedded in the time series, i.e., the dimension of the nonstationary subspace. We also consider tests of hypotheses on the nonstationary and stationary subspaces themselves. The Hilbert space can be of an arbitrarily large dimension, and our methods remain asymptotically valid even when the time series of interest takes values in a subspace of possibly unknown dimension. This has wide applicability in practice; for example, to the case of cointegrated vector time series that are either high-dimensional or of finite dimension, to high-dimensional factor model that includes a finite number of nonstationary factors, to cointegrated curve-valued (or function-valued) time series, and to nonstationary dynamic functional factor models. We include two empirical illustrations to the term structure of interest rates and labor market indices, respectively."}, "https://arxiv.org/abs/2210.14086": {"title": "A Global Wavelet Based Bootstrapped Test of Covariance Stationarity", "link": "https://arxiv.org/abs/2210.14086", "description": "arXiv:2210.14086v3 Announce Type: replace-cross \nAbstract: We propose a covariance stationarity test for an otherwise dependent and possibly globally non-stationary time series. We work in a generalized version of the new setting in Jin, Wang and Wang (2015), who exploit Walsh (1923) functions in order to compare sub-sample covariances with the full sample counterpart. They impose strict stationarity under the null, only consider linear processes under either hypothesis in order to achieve a parametric estimator for an inverted high dimensional asymptotic covariance matrix, and do not consider any other orthonormal basis. Conversely, we work with a general orthonormal basis under mild conditions that include Haar wavelet and Walsh functions; and we allow for linear or nonlinear processes with possibly non-iid innovations. This is important in macroeconomics and finance where nonlinear feedback and random volatility occur in many settings. We completely sidestep asymptotic covariance matrix estimation and inversion by bootstrapping a max-correlation difference statistic, where the maximum is taken over the correlation lag $h$ and basis generated sub-sample counter $k$ (the number of systematic samples). We achieve a higher feasible rate of increase for the maximum lag and counter $\\mathcal{H}_{T}$ and $\\mathcal{K}_{T}$. Of particular note, our test is capable of detecting breaks in variance, and distant, or very mild, deviations from stationarity."}, "https://arxiv.org/abs/2307.03687": {"title": "Leveraging text data for causal inference using electronic health records", "link": "https://arxiv.org/abs/2307.03687", "description": "arXiv:2307.03687v2 Announce Type: replace-cross \nAbstract: In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in developing countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research."}, "https://arxiv.org/abs/2312.10002": {"title": "On the Injectivity of Euler Integral Transforms with Hyperplanes and Quadric Hypersurfaces", "link": "https://arxiv.org/abs/2312.10002", "description": "arXiv:2312.10002v2 Announce Type: replace-cross \nAbstract: The Euler characteristic transform (ECT) is an integral transform used widely in topological data analysis. Previous efforts by Curry et al. and Ghrist et al. have independently shown that the ECT is injective on all compact definable sets. In this work, we first study the injectivity of the ECT on definable sets that are not necessarily compact and prove a complete classification of constructible functions that the Euler characteristic transform is not injective on. We then introduce the quadric Euler characteristic transform (QECT) as a natural generalization of the ECT by detecting definable shapes with quadric hypersurfaces rather than hyperplanes. We also discuss some criteria for the injectivity of QECT."}, "https://arxiv.org/abs/2405.13100": {"title": "Better Simulations for Validating Causal Discovery with the DAG-Adaptation of the Onion Method", "link": "https://arxiv.org/abs/2405.13100", "description": "arXiv:2405.13100v1 Announce Type: new \nAbstract: The number of artificial intelligence algorithms for learning causal models from data is growing rapidly. Most ``causal discovery'' or ``causal structure learning'' algorithms are primarily validated through simulation studies. However, no widely accepted simulation standards exist and publications often report conflicting performance statistics -- even when only considering publications that simulate data from linear models. In response, several manuscripts have criticized a popular simulation design for validating algorithms in the linear case.\n We propose a new simulation design for generating linear models for directed acyclic graphs (DAGs): the DAG-adaptation of the Onion (DaO) method. DaO simulations are fundamentally different from existing simulations because they prioritize the distribution of correlation matrices rather than the distribution of linear effects. Specifically, the DaO method uniformly samples the space of all correlation matrices consistent with (i.e. Markov to) a DAG. We also discuss how to sample DAGs and present methods for generating DAGs with scale-free in-degree or out-degree. We compare the DaO method against two alternative simulation designs and provide implementations of the DaO method in Python and R: https://github.com/bja43/DaO_simulation. We advocate for others to adopt DaO simulations as a fair universal benchmark."}, "https://arxiv.org/abs/2405.13342": {"title": "Scalable Bayesian inference for heat kernel Gaussian processes on manifolds", "link": "https://arxiv.org/abs/2405.13342", "description": "arXiv:2405.13342v1 Announce Type: new \nAbstract: We develop scalable manifold learning methods and theory, motivated by the problem of estimating manifold of fMRI activation in the Human Connectome Project (HCP). We propose the Fast Graph Laplacian Estimation for Heat Kernel Gaussian Processes (FLGP) in the natural exponential family model. FLGP handles large sample sizes $ n $, preserves the intrinsic geometry of data, and significantly reduces computational complexity from $ \\mathcal{O}(n^3) $ to $ \\mathcal{O}(n) $ via a novel reduced-rank approximation of the graph Laplacian's transition matrix and truncated Singular Value Decomposition for eigenpair computation. Our numerical experiments demonstrate FLGP's scalability and improved accuracy for manifold learning from large-scale complex data."}, "https://arxiv.org/abs/2405.13353": {"title": "Adaptive Bayesian Multivariate Spline Knot Inference with Prior Specifications on Model Complexity", "link": "https://arxiv.org/abs/2405.13353", "description": "arXiv:2405.13353v1 Announce Type: new \nAbstract: In multivariate spline regression, the number and locations of knots influence the performance and interpretability significantly. However, due to non-differentiability and varying dimensions, there is no desirable frequentist method to make inference on knots. In this article, we propose a fully Bayesian approach for knot inference in multivariate spline regression. The existing Bayesian method often uses BIC to calculate the posterior, but BIC is too liberal and it will heavily overestimate the knot number when the candidate model space is large. We specify a new prior on the knot number to take into account the complexity of the model space and derive an analytic formula in the normal model. In the non-normal cases, we utilize the extended Bayesian information criterion to approximate the posterior density. The samples are simulated in the space with differing dimensions via reversible jump Markov chain Monte Carlo. We apply the proposed method in knot inference and manifold denoising. Experiments demonstrate the splendid capability of the algorithm, especially in function fitting with jumping discontinuity."}, "https://arxiv.org/abs/2405.13400": {"title": "Ensemble size dependence of the logarithmic score for forecasts issued as multivariate normal distributions", "link": "https://arxiv.org/abs/2405.13400", "description": "arXiv:2405.13400v1 Announce Type: new \nAbstract: Multivariate probabilistic verification is concerned with the evaluation of joint probability distributions of vector quantities such as a weather variable at multiple locations or a wind vector for instance. The logarithmic score is a proper score that is useful in this context. In order to apply this score to ensemble forecasts, a choice for the density is required. Here, we are interested in the specific case when the density is multivariate normal with mean and covariance given by the ensemble mean and ensemble covariance, respectively. Under the assumptions of multivariate normality and exchangeability of the ensemble members, a relationship is derived which describes how the logarithmic score depends on ensemble size. It permits to estimate the score in the limit of infinite ensemble size from a small ensemble and thus produces a fair logarithmic score for multivariate ensemble forecasts under the assumption of normality. This generalises a study from 2018 which derived the ensemble size adjustment of the logarithmic score in the univariate case.\n An application to medium-range forecasts examines the usefulness of the ensemble size adjustments when multivariate normality is only an approximation. Predictions of vectors consisting of several different combinations of upper air variables are considered. Logarithmic scores are calculated for these vectors using ECMWF's daily extended-range forecasts which consist of a 100-member ensemble. The probabilistic forecasts of these vectors are verified against operational ECMWF analyses in the Northern mid-latitudes in autumn 2023. Scores are computed for ensemble sizes from 8 to 100. The fair logarithmic scores of ensembles with different cardinalities are very close, in contrast to the unadjusted scores which decrease considerably with ensemble size. This provides evidence for the practical usefulness of the derived relationships."}, "https://arxiv.org/abs/2405.13537": {"title": "Sequential Bayesian inference for stochastic epidemic models of cumulative incidence", "link": "https://arxiv.org/abs/2405.13537", "description": "arXiv:2405.13537v1 Announce Type: new \nAbstract: Epidemics are inherently stochastic, and stochastic models provide an appropriate way to describe and analyse such phenomena. Given temporal incidence data consisting of, for example, the number of new infections or removals in a given time window, a continuous-time discrete-valued Markov process provides a natural description of the dynamics of each model component, typically taken to be the number of susceptible, exposed, infected or removed individuals. Fitting the SEIR model to time-course data is a challenging problem due incomplete observations and, consequently, the intractability of the observed data likelihood. Whilst sampling based inference schemes such as Markov chain Monte Carlo are routinely applied, their computational cost typically restricts analysis to data sets of no more than a few thousand infective cases. Instead, we develop a sequential inference scheme that makes use of a computationally cheap approximation of the most natural Markov process model. Crucially, the resulting model allows a tractable conditional parameter posterior which can be summarised in terms of a set of low dimensional statistics. This is used to rejuvenate parameter samples in conjunction with a novel bridge construct for propagating state trajectories conditional on the next observation of cumulative incidence. The resulting inference framework also allows for stochastic infection and reporting rates. We illustrate our approach using synthetic and real data applications."}, "https://arxiv.org/abs/2405.13553": {"title": "Hidden semi-Markov models with inhomogeneous state dwell-time distributions", "link": "https://arxiv.org/abs/2405.13553", "description": "arXiv:2405.13553v1 Announce Type: new \nAbstract: The well-established methodology for the estimation of hidden semi-Markov models (HSMMs) as hidden Markov models (HMMs) with extended state spaces is further developed to incorporate covariate influences across all aspects of the state process model, in particular, regarding the distributions governing the state dwell time. The special case of periodically varying covariate effects on the state dwell-time distributions - and possibly the conditional transition probabilities - is examined in detail to derive important properties of such models, namely the periodically varying unconditional state distribution as well as the overall state dwell-time distribution. Through simulation studies, we ascertain key properties of these models and develop recommendations for hyperparameter settings. Furthermore, we provide a case study involving an HSMM with periodically varying dwell-time distributions to analyse the movement trajectory of an arctic muskox, demonstrating the practical relevance of the developed methodology."}, "https://arxiv.org/abs/2405.13591": {"title": "Running in circles: is practical application feasible for data fission and data thinning in post-clustering differential analysis?", "link": "https://arxiv.org/abs/2405.13591", "description": "arXiv:2405.13591v1 Announce Type: new \nAbstract: The standard pipeline to analyse single-cell RNA sequencing (scRNA-seq) often involves two steps : clustering and Differential Expression Analysis (DEA) to annotate cell populations based on gene expression. However, using clustering results for data-driven hypothesis formulation compromises statistical properties, especially Type I error control. Data fission was introduced to split the information contained in each observation into two independent parts that can be used for clustering and testing. However, data fission was originally designed for non-mixture distributions, and adapting it for mixtures requires knowledge of the unknown clustering structure to estimate component-specific scale parameters. As components are typically unavailable in practice, scale parameter estimators often exhibit bias. We explicitly quantify how this bias affects subsequent post-clustering differential analysis Type I error rate despite employing data fission. In response, we propose a novel approach that involves modeling each observation as a realization of its distribution, with scale parameters estimated non-parametrically. Simulations study showcase the efficacy of our method when component are clearly separated. However, the level of separability required to reach good performance presents complexities in its application to real scRNA-seq data."}, "https://arxiv.org/abs/2405.13621": {"title": "Interval identification of natural effects in the presence of outcome-related unmeasured confounding", "link": "https://arxiv.org/abs/2405.13621", "description": "arXiv:2405.13621v1 Announce Type: new \nAbstract: With reference to a binary outcome and a binary mediator, we derive identification bounds for natural effects under a reduced set of assumptions. Specifically, no assumptions about confounding are made that involve the outcome; we only assume no unobserved exposure-mediator confounding as well as a condition termed partially constant cross-world dependence (PC-CWD), which poses fewer constraints on the counterfactual probabilities than the usual cross-world independence assumption. The proposed strategy can be used also to achieve interval identification of the total effect, which is no longer point identified under the considered set of assumptions. Our derivations are based on postulating a logistic regression model for the mediator as well as for the outcome. However, in both cases the functional form governing the dependence on the explanatory variables is allowed to be arbitrary, thereby resulting in a semi-parametric approach. To account for sampling variability, we provide delta-method approximations of standard errors in order to build uncertainty intervals from identification bounds. The proposed method is applied to a dataset gathered from a Spanish prospective cohort study. The aim is to evaluate whether the effect of smoking on lung cancer risk is mediated by the onset of pulmonary emphysema."}, "https://arxiv.org/abs/2405.13767": {"title": "Enhancing Dose Selection in Phase I Cancer Trials: Extending the Bayesian Logistic Regression Model with Non-DLT Adverse Events Integration", "link": "https://arxiv.org/abs/2405.13767", "description": "arXiv:2405.13767v1 Announce Type: new \nAbstract: This paper presents the Burdened Bayesian Logistic Regression Model (BBLRM), an enhancement to the Bayesian Logistic Regression Model (BLRM) for dose-finding in phase I oncology trials. Traditionally, the BLRM determines the maximum tolerated dose (MTD) based on dose-limiting toxicities (DLTs). However, clinicians often perceive model-based designs like BLRM as complex and less conservative than rule-based designs, such as the widely used 3+3 method. To address these concerns, the BBLRM incorporates non-DLT adverse events (nDLTAEs) into the model. These events, although not severe enough to qualify as DLTs, provide additional information suggesting that higher doses might result in DLTs. In the BBLRM, an additional parameter $\\delta$ is introduced to account for nDLTAEs. This parameter adjusts the toxicity probability estimates, making the model more conservative in dose escalation. The $\\delta$ parameter is derived from the proportion of patients experiencing nDLTAEs within each cohort and is tuned to balance the model's conservatism. This approach aims to reduce the likelihood of assigning toxic doses as MTD while involving clinicians more directly in the decision-making process. The paper includes a simulation study comparing BBLRM with the traditional BLRM across various scenarios. The simulations demonstrate that BBLRM significantly reduces the selection of toxic doses as MTD without compromising, and sometimes even increasing, the accuracy of MTD identification. These results suggest that integrating nDLTAEs into the dose-finding process can enhance the safety and acceptance of model-based designs in phase I oncology trials."}, "https://arxiv.org/abs/2405.13783": {"title": "Nonparametric quantile regression for spatio-temporal processes", "link": "https://arxiv.org/abs/2405.13783", "description": "arXiv:2405.13783v1 Announce Type: new \nAbstract: In this paper, we develop a new and effective approach to nonparametric quantile regression that accommodates ultrahigh-dimensional data arising from spatio-temporal processes. This approach proves advantageous in staving off computational challenges that constitute known hindrances to existing nonparametric quantile regression methods when the number of predictors is much larger than the available sample size. We investigate conditions under which estimation is feasible and of good overall quality and obtain sharp approximations that we employ to devising statistical inference methodology. These include simultaneous confidence intervals and tests of hypotheses, whose asymptotics is borne by a non-trivial functional central limit theorem tailored to martingale differences. Additionally, we provide finite-sample results through various simulations which, accompanied by an illustrative application to real-worldesque data (on electricity demand), offer guarantees on the performance of the proposed methodology."}, "https://arxiv.org/abs/2405.13799": {"title": "Extending Kernel Testing To General Designs", "link": "https://arxiv.org/abs/2405.13799", "description": "arXiv:2405.13799v1 Announce Type: new \nAbstract: Kernel-based testing has revolutionized the field of non-parametric tests through the embedding of distributions in an RKHS. This strategy has proven to be powerful and flexible, yet its applicability has been limited to the standard two-sample case, while practical situations often involve more complex experimental designs. To extend kernel testing to any design, we propose a linear model in the RKHS that allows for the decomposition of mean embeddings into additive functional effects. We then introduce a truncated kernel Hotelling-Lawley statistic to test the effects of the model, demonstrating that its asymptotic distribution is chi-square, which remains valid with its Nystrom approximation. We discuss a homoscedasticity assumption that, although absent in the standard two-sample case, is necessary for general designs. Finally, we illustrate our framework using a single-cell RNA sequencing dataset and provide kernel-based generalizations of classical diagnostic and exploration tools to broaden the scope of kernel testing in any experimental design."}, "https://arxiv.org/abs/2405.13801": {"title": "Bayesian Inference Under Differential Privacy: Prior Selection Considerations with Application to Univariate Gaussian Data and Regression", "link": "https://arxiv.org/abs/2405.13801", "description": "arXiv:2405.13801v1 Announce Type: new \nAbstract: We describe Bayesian inference for the mean and variance of bounded data protected by differential privacy and modeled as Gaussian. Using this setting, we demonstrate that analysts can and should take the constraints imposed by the bounds into account when specifying prior distributions. Additionally, we provide theoretical and empirical results regarding what classes of default priors produce valid inference for a differentially private release in settings where substantial prior information is not available. We discuss how these results can be applied to Bayesian inference for regression with differentially private data."}, "https://arxiv.org/abs/2405.13844": {"title": "Causal Inference with Cocycles", "link": "https://arxiv.org/abs/2405.13844", "description": "arXiv:2405.13844v1 Announce Type: new \nAbstract: Many interventions in causal inference can be represented as transformations. We identify a local symmetry property satisfied by a large class of causal models under such interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional and counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show they achieve semiparametric efficiency under typical conditions. Since (infinitely) many distributions can share the same cocycle, these estimators make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset."}, "https://arxiv.org/abs/2405.13926": {"title": "Some models are useful, but for how long?: A decision theoretic approach to choosing when to refit large-scale prediction models", "link": "https://arxiv.org/abs/2405.13926", "description": "arXiv:2405.13926v1 Announce Type: new \nAbstract: Large-scale prediction models (typically using tools from artificial intelligence, AI, or machine learning, ML) are increasingly ubiquitous across a variety of industries and scientific domains. Such methods are often paired with detailed data from sources such as electronic health records, wearable sensors, and omics data (high-throughput technology used to understand biology). Despite their utility, implementing AI and ML tools at the scale necessary to work with this data introduces two major challenges. First, it can cost tens of thousands of dollars to train a modern AI/ML model at scale. Second, once the model is trained, its predictions may become less relevant as patient and provider behavior change, and predictions made for one geographical area may be less accurate for another. These two challenges raise a fundamental question: how often should you refit the AI/ML model to optimally trade-off between cost and relevance? Our work provides a framework for making decisions about when to {\\it refit} AI/ML models when the goal is to maintain valid statistical inference (e.g. estimating a treatment effect in a clinical trial). Drawing on portfolio optimization theory, we treat the decision of {\\it recalibrating} versus {\\it refitting} the model as a choice between ''investing'' in one of two ''assets.'' One asset, recalibrating the model based on another model, is quick and relatively inexpensive but bears uncertainty from sampling and the possibility that the other model is not relevant to current circumstances. The other asset, {\\it refitting} the model, is costly but removes the irrelevance concern (though not the risk of sampling error). We explore the balancing act between these two potential investments in this paper."}, "https://arxiv.org/abs/2405.13945": {"title": "Exogenous Consideration and Extended Random Utility", "link": "https://arxiv.org/abs/2405.13945", "description": "arXiv:2405.13945v1 Announce Type: new \nAbstract: In a consideration set model, an individual maximizes utility among the considered alternatives. I relate a consideration set additive random utility model to classic discrete choice and the extended additive random utility model, in which utility can be $-\\infty$ for infeasible alternatives. When observable utility shifters are bounded, all three models are observationally equivalent. Moreover, they have the same counterfactual bounds and welfare formulas for changes in utility shifters like price. For attention interventions, welfare cannot change in the full consideration model but is completely unbounded in the limited consideration model. The identified set for consideration set probabilities has a minimal width for any bounded support of shifters, but with unbounded support it is a point: identification \"towards\" infinity does not resemble identification \"at\" infinity."}, "https://arxiv.org/abs/2405.13970": {"title": "Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces", "link": "https://arxiv.org/abs/2405.13970", "description": "arXiv:2405.13970v1 Announce Type: new \nAbstract: Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics."}, "https://arxiv.org/abs/2405.14048": {"title": "fsemipar: an R package for SoF semiparametric regression", "link": "https://arxiv.org/abs/2405.14048", "description": "arXiv:2405.14048v1 Announce Type: new \nAbstract: Functional data analysis has become a tool of interest in applied areas such as economics, medicine, and chemistry. Among the techniques developed in recent literature, functional semiparametric regression stands out for its balance between flexible modelling and output interpretation. Despite the large variety of research papers dealing with scalar-on-function (SoF) semiparametric models, there is a notable gap in software tools for their implementation. This article introduces the R package \\texttt{fsemipar}, tailored for these models. \\texttt{fsemipar} not only estimates functional single-index models using kernel smoothing techniques but also estimates and selects relevant scalar variables in semi-functional models with multivariate linear components. A standout feature is its ability to identify impact points of a curve on the response, even in models with multiple functional covariates, and to integrate both continuous and pointwise effects of functional predictors within a single model. In addition, it allows the use of location-adaptive estimators based on the $k$-nearest-neighbours approach for all the semiparametric models included. Its flexible interface empowers users to customise a wide range of input parameters and includes the standard S3 methods for prediction, statistical analysis, and estimate visualization (\\texttt{predict}, \\texttt{summary}, \\texttt{print}, and \\texttt{plot}), enhancing clear result interpretation. Throughout the article, we illustrate the functionalities and the practicality of \\texttt{fsemipar} using two chemometric datasets."}, "https://arxiv.org/abs/2405.14104": {"title": "On the Identifying Power of Monotonicity for Average Treatment Effects", "link": "https://arxiv.org/abs/2405.14104", "description": "arXiv:2405.14104v1 Announce Type: new \nAbstract: In the context of a binary outcome, treatment, and instrument, Balke and Pearl (1993, 1997) establish that adding monotonicity to the instrument exogeneity assumption does not decrease the identified sets for average potential outcomes and average treatment effect parameters when those assumptions are consistent with the distribution of the observable data. We show that the same results hold in the broader context of multi-valued outcome, treatment, and instrument. An important example of such a setting is a multi-arm randomized controlled trial with noncompliance."}, "https://arxiv.org/abs/2405.14145": {"title": "Generalised Bayes Linear Inference", "link": "https://arxiv.org/abs/2405.14145", "description": "arXiv:2405.14145v1 Announce Type: new \nAbstract: Motivated by big data and the vast parameter spaces in modern machine learning models, optimisation approaches to Bayesian inference have seen a surge in popularity in recent years. In this paper, we address the connection between the popular new methods termed generalised Bayesian inference and Bayes linear methods. We propose a further generalisation to Bayesian inference that unifies these and other recent approaches by considering the Bayesian inference problem as one of finding the closest point in a particular solution space to a data generating process, where these notions differ depending on user-specified geometries and foundational belief systems. Motivated by this framework, we propose a generalisation to Bayes linear approaches that enables fast and principled inferences that obey the coherence requirements implied by domain restrictions on random quantities. We demonstrate the efficacy of generalised Bayes linear inference on a number of examples, including monotonic regression and inference for spatial counts. This paper is accompanied by an R package available at github.com/astfalckl/bayeslinear."}, "https://arxiv.org/abs/2405.14149": {"title": "A Direct Importance Sampling-based Framework for Rare Event Uncertainty Quantification in Non-Gaussian Spaces", "link": "https://arxiv.org/abs/2405.14149", "description": "arXiv:2405.14149v1 Announce Type: new \nAbstract: This work introduces a novel framework for precisely and efficiently estimating rare event probabilities in complex, high-dimensional non-Gaussian spaces, building on our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach. An unnormalized sampling target is first constructed and sampled, relaxing the optimal importance sampling distribution and appropriately designed for non-Gaussian spaces. Post-sampling, its normalizing constant is estimated using a stable inverse importance sampling procedure, employing an importance sampling density based on the already available samples. The sought probability is then computed based on the estimates evaluated in these two stages. The proposed estimator is theoretically analyzed, proving its unbiasedness and deriving its analytical coefficient of variation. To sample the constructed target, we resort to our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) and we prove that it converges to the correct stationary target distribution. To avoid the challenging task of tuning the trajectory length in complex spaces, QNp-HMCMC is effectively utilized in this work with a single-step integration. We thus show the equivalence of QNp-HMCMC with single-step implementation to a unique and efficient preconditioned Metropolis-adjusted Langevin algorithm (MALA). An optimization approach is also leveraged to initiate QNp-HMCMC effectively, and the implementation of the developed framework in bounded spaces is eventually discussed. A series of diverse problems involving high dimensionality (several hundred inputs), strong nonlinearity, and non-Gaussianity is presented, showcasing the capabilities and efficiency of the suggested framework and demonstrating its advantages compared to relevant state-of-the-art sampling methods."}, "https://arxiv.org/abs/2405.14166": {"title": "Optimal Bayesian predictive probability for delayed response in single-arm clinical trials with binary efficacy outcome", "link": "https://arxiv.org/abs/2405.14166", "description": "arXiv:2405.14166v1 Announce Type: new \nAbstract: In oncology, phase II or multiple expansion cohort trials are crucial for clinical development plans. This is because they aid in identifying potent agents with sufficient activity to continue development and confirm the proof of concept. Typically, these clinical trials are single-arm trials, with the primary endpoint being short-term treatment efficacy. Despite the development of several well-designed methodologies, there may be a practical impediment in that the endpoints may be observed within a sufficient time such that adaptive go/no-go decisions can be made in a timely manner at each interim monitoring. Specifically, Response Evaluation Criteria in Solid Tumors guideline defines a confirmed response and necessitates it in non-randomized trials, where the response is the primary endpoint. However, obtaining the confirmed outcome from all participants entered at interim monitoring may be time-consuming as non-responders should be followed up until the disease progresses. Thus, this study proposed an approach to accelerate the decision-making process that incorporated the outcome without confirmation by discounting its contribution to the decision-making framework using the generalized Bayes' theorem. Further, the behavior of the proposed approach was evaluated through a simple simulation study. The results demonstrated that the proposed approach made appropriate interim go/no-go decisions."}, "https://arxiv.org/abs/2405.14208": {"title": "An Empirical Comparison of Methods to Produce Business Statistics Using Non-Probability Data", "link": "https://arxiv.org/abs/2405.14208", "description": "arXiv:2405.14208v1 Announce Type: new \nAbstract: There is a growing trend among statistical agencies to explore non-probability data sources for producing more timely and detailed statistics, while reducing costs and respondent burden. Coverage and measurement error are two issues that may be present in such data. The imperfections may be corrected using available information relating to the population of interest, such as a census or a reference probability sample.\n In this paper, we compare a wide range of existing methods for producing population estimates using a non-probability dataset through a simulation study based on a realistic business population. The study was conducted to examine the performance of the methods under different missingness and data quality assumptions. The results confirm the ability of the methods examined to address selection bias. When no measurement error is present in the non-probability dataset, a screening dual-frame approach for the probability sample tends to yield lower sample size and mean squared error results. The presence of measurement error and/or nonignorable missingness increases mean squared errors for estimators that depend heavily on the non-probability data. In this case, the best approach tends to be to fall back to a model-assisted estimator based on the probability sample."}, "https://arxiv.org/abs/2405.14392": {"title": "Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing Flows", "link": "https://arxiv.org/abs/2405.14392", "description": "arXiv:2405.14392v1 Announce Type: new \nAbstract: Continuous normalizing flows (CNFs) learn the probability path between a reference and a target density by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we re-purpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective and using the learned probability path to improve Monte Carlo sampling. We propose a sequential method, which uses samples from a Markov chain to fix the probability path defining the FM objective. We augment this scheme with an adaptive tempering mechanism that allows the discovery of multiple modes in the target. Under mild assumptions, we establish convergence to a local optimum of the FM objective, discuss improvements in the convergence rate, and illustrate our methods on synthetic and real-world examples."}, "https://arxiv.org/abs/2405.14456": {"title": "Cumulant-based approximation for fast and efficient prediction for species distribution", "link": "https://arxiv.org/abs/2405.14456", "description": "arXiv:2405.14456v1 Announce Type: new \nAbstract: Species distribution modeling plays an important role in estimating the habitat suitability of species using environmental variables. For this purpose, Maxent and the Poisson point process are popular and powerful methods extensively employed across various ecological and biological sciences. However, the computational speed becomes prohibitively slow when using huge background datasets, which is often the case with fine-resolution data or global-scale estimations. To address this problem, we propose a computationally efficient species distribution model using a cumulant-based approximation (CBA) applied to the loss function of $\\gamma$-divergence. Additionally, we introduce a sequential estimating algorithm with an $L_1$ penalty to select important environmental variables closely associated with species distribution. The regularized geometric-mean method, derived from the CBA, demonstrates high computational efficiency and estimation accuracy. Moreover, by applying CBA to Maxent, we establish that Maxent and Fisher linear discriminant analysis are equivalent under a normality assumption. This equivalence leads to an highly efficient computational method for estimating species distribution. The effectiveness of our proposed methods is illustrated through simulation studies and by analyzing data on 226 species from the National Centre for Ecological Analysis and Synthesis and 709 Japanese vascular plant species. The computational efficiency of the proposed methods is significantly improved compared to Maxent, while maintaining comparable estimation accuracy. A R package {\\tt CBA} is also prepared to provide all programming codes used in simulation studies and real data analysis."}, "https://arxiv.org/abs/2405.14492": {"title": "Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data", "link": "https://arxiv.org/abs/2405.14492", "description": "arXiv:2405.14492v1 Announce Type: new \nAbstract: Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages."}, "https://arxiv.org/abs/2405.14509": {"title": "Closed-form estimators for an exponential family derived from likelihood equations", "link": "https://arxiv.org/abs/2405.14509", "description": "arXiv:2405.14509v1 Announce Type: new \nAbstract: In this paper, we derive closed-form estimators for the parameters of some probability distributions belonging to the exponential family. A bootstrap bias-reduced version of these proposed closed-form estimators are also derived. A Monte Carlo simulation is performed for the assessment of the estimators. The results are seen to be quite favorable to the proposed bootstrap bias-reduce estimators."}, "https://arxiv.org/abs/2405.14628": {"title": "Online robust estimation and bootstrap inference for function-on-scalar regression", "link": "https://arxiv.org/abs/2405.14628", "description": "arXiv:2405.14628v1 Announce Type: new \nAbstract: We propose a novel and robust online function-on-scalar regression technique via geometric median to learn associations between functional responses and scalar covariates based on massive or streaming datasets. The online estimation procedure, developed using the average stochastic gradient descent algorithm, offers an efficient and cost-effective method for analyzing sequentially augmented datasets, eliminating the need to store large volumes of data in memory. We establish the almost sure consistency, $L_p$ convergence, and asymptotic normality of the online estimator. To enable efficient and fast inference of the parameters of interest, including the derivation of confidence intervals, we also develop an innovative two-step online bootstrap procedure to approximate the limiting error distribution of the robust online estimator. Numerical studies under a variety of scenarios demonstrate the effectiveness and efficiency of the proposed online learning method. A real application analyzing PM$_{2.5}$ air-quality data is also included to exemplify the proposed online approach."}, "https://arxiv.org/abs/2405.14652": {"title": "Statistical inference for high-dimensional convoluted rank regression", "link": "https://arxiv.org/abs/2405.14652", "description": "arXiv:2405.14652v1 Announce Type: new \nAbstract: High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods."}, "https://arxiv.org/abs/2405.14686": {"title": "Efficient Algorithms for the Sensitivities of the Pearson Correlation Coefficient and Its Statistical Significance to Online Data", "link": "https://arxiv.org/abs/2405.14686", "description": "arXiv:2405.14686v1 Announce Type: new \nAbstract: Reliably measuring the collinearity of bivariate data is crucial in statistics, particularly for time-series analysis or ongoing studies in which incoming observations can significantly impact current collinearity estimates. Leveraging identities from Welford's online algorithm for sample variance, we develop a rigorous theoretical framework for analyzing the maximal change to the Pearson correlation coefficient and its p-value that can be induced by additional data. Further, we show that the resulting optimization problems yield elegant closed-form solutions that can be accurately computed by linear- and constant-time algorithms. Our work not only creates new theoretical avenues for robust correlation measures, but also has broad practical implications for disciplines that span econometrics, operations research, clinical trials, climatology, differential privacy, and bioinformatics. Software implementations of our algorithms in Cython-wrapped C are made available at https://github.com/marc-harary/sensitivity for reproducibility, practical deployment, and future theoretical development."}, "https://arxiv.org/abs/2405.14711": {"title": "Zero-inflation in the Multivariate Poisson Lognormal Family", "link": "https://arxiv.org/abs/2405.14711", "description": "arXiv:2405.14711v1 Announce Type: new \nAbstract: Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination."}, "https://arxiv.org/abs/2405.13224": {"title": "Integrating behavioral experimental findings into dynamical models to inform social change interventions", "link": "https://arxiv.org/abs/2405.13224", "description": "arXiv:2405.13224v1 Announce Type: cross \nAbstract: Addressing global challenges -- from public health to climate change -- often involves stimulating the large-scale adoption of new products or behaviors. Research traditions that focus on individual decision making suggest that achieving this objective requires better identifying the drivers of individual adoption choices. On the other hand, computational approaches rooted in complexity science focus on maximizing the propagation of a given product or behavior throughout social networks of interconnected adopters. The integration of these two perspectives -- although advocated by several research communities -- has remained elusive so far. Here we show how achieving this integration could inform seeding policies to facilitate the large-scale adoption of a given behavior or product. Drawing on complex contagion and discrete choice theories, we propose a method to estimate individual-level thresholds to adoption, and validate its predictive power in two choice experiments. By integrating the estimated thresholds into computational simulations, we show that state-of-the-art seeding methods for social influence maximization might be suboptimal if they neglect individual-level behavioral drivers, which can be corrected through the proposed experimental method."}, "https://arxiv.org/abs/2405.13794": {"title": "Conditioning diffusion models by explicit forward-backward bridging", "link": "https://arxiv.org/abs/2405.13794", "description": "arXiv:2405.13794v1 Announce Type: cross \nAbstract: Given an unconditional diffusion model $\\pi(x, y)$, using it to perform conditional simulation $\\pi(x \\mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express conditional simulation as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\\pi(x \\mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples."}, "https://arxiv.org/abs/2007.13804": {"title": "The Spectral Approach to Linear Rational Expectations Models", "link": "https://arxiv.org/abs/2007.13804", "description": "arXiv:2007.13804v5 Announce Type: replace \nAbstract: This paper considers linear rational expectations models in the frequency domain. The paper characterizes existence and uniqueness of solutions to particular as well as generic systems. The set of all solutions to a given system is shown to be a finite dimensional affine space in the frequency domain. It is demonstrated that solutions can be discontinuous with respect to the parameters of the models in the context of non-uniqueness, invalidating mainstream frequentist and Bayesian methods. The ill-posedness of the problem motivates regularized solutions with theoretically guaranteed uniqueness, continuity, and even differentiability properties."}, "https://arxiv.org/abs/2012.02708": {"title": "A Multivariate Realized GARCH Model", "link": "https://arxiv.org/abs/2012.02708", "description": "arXiv:2012.02708v2 Announce Type: replace \nAbstract: We propose a novel class of multivariate GARCH models that utilize realized measures of volatilities and correlations. The central component is an unconstrained vector parametrization of the conditional correlation matrix that facilitates factor models for correlations. This offers an elegant solution to the primary challenge that plagues multivariate GARCH models in high-dimensional settings. As an illustration, we consider block correlation structures that naturally simplify to linear factor models for the conditional correlations. We apply the model to returns of nine assets and inspect in-sample and out-of-sample model performance in comparison with several popular benchmarks."}, "https://arxiv.org/abs/2203.15897": {"title": "Calibrated Model Criticism Using Split Predictive Checks", "link": "https://arxiv.org/abs/2203.15897", "description": "arXiv:2203.15897v3 Announce Type: replace \nAbstract: Checking how well a fitted model explains the data is one of the most fundamental parts of a Bayesian data analysis. However, existing model checking methods suffer from trade-offs between being well-calibrated, automated, and computationally efficient. To overcome these limitations, we propose split predictive checks (SPCs), which combine the ease-of-use and speed of posterior predictive checks with the good calibration properties of predictive checks that rely on model-specific derivations or inference schemes. We develop an asymptotic theory for two types of SPCs: single SPCs and the divided SPCs. Our results demonstrate that they offer complementary strengths. Single SPCs work well with smaller datasets and provide excellent power when there is substantial misspecification, such as when the estimate uncertainty in the test statistic is significantly underestimated. When the sample size is large, divided SPCs can provide better power and are able to detect more subtle form of misspecification. We validate the finite-sample utility of SPCs through extensive simulation experiments in exponential family and hierarchical models, and provide three real-data examples where SPCs offer novel insights and additional flexibility beyond what is available when using posterior predictive checks."}, "https://arxiv.org/abs/2303.05659": {"title": "A marginal structural model for normal tissue complication probability", "link": "https://arxiv.org/abs/2303.05659", "description": "arXiv:2303.05659v4 Announce Type: replace \nAbstract: The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modelling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the safety of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that impose bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, and their use is illustrated in the context of radiotherapy treatment of anal canal cancer patients."}, "https://arxiv.org/abs/2303.09575": {"title": "Sample size determination via learning-type curves", "link": "https://arxiv.org/abs/2303.09575", "description": "arXiv:2303.09575v2 Announce Type: replace \nAbstract: This paper is concerned with sample size determination methodology for prediction models. We propose combining the individual calculations via a learning-type curve. We suggest two distinct ways of doing so, a deterministic skeleton of a learning curve and a Gaussian process centred upon its deterministic counterpart. We employ several learning algorithms for modelling the primary endpoint and distinct measures for trial efficacy. We find that the performance may vary with the sample size, but borrowing information across sample size universally improves the performance of such calculations. The Gaussian process-based learning curve appears more robust and statistically efficient, while computational efficiency is comparable. We suggest that anchoring against historical evidence when extrapolating sample sizes should be adopted when such data are available. The methods are illustrated on binary and survival endpoints."}, "https://arxiv.org/abs/2305.14194": {"title": "A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments", "link": "https://arxiv.org/abs/2305.14194", "description": "arXiv:2305.14194v2 Announce Type: replace \nAbstract: We develop new methodology to improve our understanding of the causal effects of multivariate air pollution exposures on public health. Typically, exposure to air pollution for an individual is measured at their home geographic region, though people travel to different regions with potentially different levels of air pollution. To account for this, we incorporate estimates of the mobility of individuals from cell phone mobility data to get an improved estimate of their exposure to air pollution. We treat this as an interference problem, where individuals in one geographic region can be affected by exposures in other regions due to mobility into those areas. We propose policy-relevant estimands and derive expressions showing the extent of bias one would obtain by ignoring this mobility. We additionally highlight the benefits of the proposed interference framework relative to a measurement error framework for accounting for mobility. We develop novel estimation strategies to estimate causal effects that account for this spatial spillover utilizing flexible Bayesian methodology. Lastly, we use the proposed methodology to study the health effects of ambient air pollution on mortality among Medicare enrollees in the United States."}, "https://arxiv.org/abs/2306.11697": {"title": "Treatment Effects in Extreme Regimes", "link": "https://arxiv.org/abs/2306.11697", "description": "arXiv:2306.11697v2 Announce Type: replace \nAbstract: Understanding treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the unavailability of counterfactual outcomes and the rarity and difficulty of collecting extreme data in practice. To address this issue, we propose a new framework based on extreme value theory for estimating treatment effects in extreme regimes. We quantify these effects using variations in tail decay rates of potential outcomes in the presence and absence of treatments. We establish algorithms for calculating these quantities and develop related theoretical results. We demonstrate the efficacy of our approach on various standard synthetic and semi-synthetic datasets."}, "https://arxiv.org/abs/2309.15600": {"title": "pencal: an R Package for the Dynamic Prediction of Survival with Many Longitudinal Predictors", "link": "https://arxiv.org/abs/2309.15600", "description": "arXiv:2309.15600v2 Announce Type: replace \nAbstract: In survival analysis, longitudinal information on the health status of a patient can be used to dynamically update the predicted probability that a patient will experience an event of interest. Traditional approaches to dynamic prediction such as joint models become computationally unfeasible with more than a handful of longitudinal covariates, warranting the development of methods that can handle a larger number of longitudinal covariates. We introduce the R package pencal, which implements a Penalized Regression Calibration approach that makes it possible to handle many longitudinal covariates as predictors of survival. pencal uses mixed-effects models to summarize the trajectories of the longitudinal covariates up to a prespecified landmark time, and a penalized Cox model to predict survival based on both baseline covariates and summary measures of the longitudinal covariates. This article illustrates the structure of the R package, provides a step by step example showing how to estimate PRC, compute dynamic predictions of survival and validate performance, and shows how parallelization can be used to significantly reduce computing time."}, "https://arxiv.org/abs/2310.15512": {"title": "Inference for Rank-Rank Regressions", "link": "https://arxiv.org/abs/2310.15512", "description": "arXiv:2310.15512v2 Announce Type: replace \nAbstract: Slope coefficients in rank-rank regressions are popular measures of intergenerational mobility. In this paper, we first point out two important properties of the OLS estimator in such regressions: commonly used variance estimators do not consistently estimate the asymptotic variance of the OLS estimator and, when the underlying distribution is not continuous, the OLS estimator may be highly sensitive to the way in which ties are handled. Motivated by these findings we derive the asymptotic theory for the OLS estimator in a general rank-rank regression specification without making assumptions about the continuity of the underlying distribution. We then extend the asymptotic theory to other regressions involving ranks that have been used in empirical work. Finally, we apply our new inference methods to three empirical studies. We find that the confidence intervals based on estimators of the correct variance may sometimes be substantially shorter and sometimes substantially longer than those based on commonly used variance estimators. The differences in confidence intervals concern economically meaningful values of mobility and thus may lead to different conclusions when comparing mobility across different regions or countries."}, "https://arxiv.org/abs/2311.11153": {"title": "Biarchetype analysis: simultaneous learning of observations and features based on extremes", "link": "https://arxiv.org/abs/2311.11153", "description": "arXiv:2311.11153v2 Announce Type: replace \nAbstract: We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value, and both the source code and examples are readily accessible in R and Python at https://github.com/aleixalcacer/JA-BIAA."}, "https://arxiv.org/abs/2311.15322": {"title": "False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values", "link": "https://arxiv.org/abs/2311.15322", "description": "arXiv:2311.15322v3 Announce Type: replace \nAbstract: The effective utilization of structural information in data while ensuring statistical validity poses a significant challenge in false discovery rate (FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning methods without relying on strong assumptions or highly idealized models. However, existing conformal methods have limitations in handling structured multiple testing. This is because their validity requires the deployment of symmetric rules, which assume the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome these limitations, we introduce the pseudo local index of significance (PLIS) procedure, which is capable of accommodating asymmetric rules and requires only pairwise exchangeability between the null conformity scores. We demonstrate that PLIS offers finite-sample guarantees in FDR control and the ability to assign higher weights to relevant data points. Numerical results confirm the effectiveness and robustness of PLIS and show improvements in power compared to existing model-free methods in various scenarios."}, "https://arxiv.org/abs/2306.13214": {"title": "Prior-itizing Privacy: A Bayesian Approach to Setting the Privacy Budget in Differential Privacy", "link": "https://arxiv.org/abs/2306.13214", "description": "arXiv:2306.13214v2 Announce Type: replace-cross \nAbstract: When releasing outputs from confidential data, agencies need to balance the analytical usefulness of the released data with the obligation to protect data subjects' confidentiality. For releases satisfying differential privacy, this balance is reflected by the privacy budget, $\\varepsilon$. We provide a framework for setting $\\varepsilon$ based on its relationship with Bayesian posterior probabilities of disclosure. The agency responsible for the data release decides how much posterior risk it is willing to accept at various levels of prior risk, which implies a unique $\\varepsilon$. Agencies can evaluate different risk profiles to determine one that leads to an acceptable trade-off in risk and utility."}, "https://arxiv.org/abs/2309.09367": {"title": "ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed Factors", "link": "https://arxiv.org/abs/2309.09367", "description": "arXiv:2309.09367v3 Announce Type: replace-cross \nAbstract: In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs."}, "https://arxiv.org/abs/2310.09818": {"title": "MCMC for Bayesian nonparametric mixture modeling under differential privacy", "link": "https://arxiv.org/abs/2310.09818", "description": "arXiv:2310.09818v2 Announce Type: replace-cross \nAbstract: Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov chain Monte Carlo (MCMC) algorithms for posterior inference. One is a marginal approach, resembling Neal's algorithm 5 with a pseudo-marginal Metropolis-Hastings move, and the other is a conditional approach. Although our focus is primarily on local DP, we show that our MCMC algorithms can be easily extended to deal with global differential privacy mechanisms. Moreover, for some carefully chosen mechanisms and mixture kernels, we show how auxiliary parameters can be analytically marginalized, allowing standard MCMC algorithms (i.e., non-privatized, such as Neal's Algorithm 2) to be efficiently employed. Our approach is general and applicable to any mixture model and privacy mechanism. In several simulations and a real case study, we discuss the performance of our algorithms and evaluate different privacy mechanisms proposed in the frequentist literature."}, "https://arxiv.org/abs/2311.00541": {"title": "An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek", "link": "https://arxiv.org/abs/2311.00541", "description": "arXiv:2311.00541v3 Announce Type: replace-cross \nAbstract: Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as ``kosmos'' (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed."}, "https://arxiv.org/abs/2311.05025": {"title": "Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients", "link": "https://arxiv.org/abs/2311.05025", "description": "arXiv:2311.05025v2 Announce Type: replace-cross \nAbstract: We present an unbiased method for Bayesian posterior means based on kinetic Langevin dynamics that combines advanced splitting methods with enhanced gradient approximations. Our approach avoids Metropolis correction by coupling Markov chains at different discretization levels in a multilevel Monte Carlo approach. Theoretical analysis demonstrates that our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. It can achieve accuracy $\\epsilon>0$ for estimating expectations of Lipschitz functions in $d$ dimensions with $\\mathcal{O}(d^{1/4}\\epsilon^{-2})$ expected gradient evaluations, without assuming warm start. We exhibit similar bounds using both approximate and stochastic gradients, and our method's computational cost is shown to scale independently of the size of the dataset. The proposed method is tested using a multinomial regression problem on the MNIST dataset and a Poisson regression model for soccer scores. Experiments indicate that the number of gradient evaluations per effective sample is independent of dimension, even when using inexact gradients. For product distributions, we give dimension-independent variance bounds. Our results demonstrate that the unbiased algorithm we present can be much more efficient than the ``gold-standard\" randomized Hamiltonian Monte Carlo."}, "https://arxiv.org/abs/2401.06687": {"title": "Proximal Causal Inference With Text Data", "link": "https://arxiv.org/abs/2401.06687", "description": "arXiv:2401.06687v2 Announce Type: replace-cross \nAbstract: Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses multiple instances of pre-treatment text data, infers two proxies from two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove that our text-based proxy method satisfies identification conditions required by the proximal g-formula while other seemingly reasonable proposals do not. We evaluate our method in synthetic and semi-synthetic settings and find that it produces estimates with low bias. To address untestable assumptions associated with the proximal g-formula, we further propose an odds ratio falsification heuristic. This new combination of proximal causal inference and zero-shot classifiers expands the set of text-specific causal methods available to practitioners."}, "https://arxiv.org/abs/2405.14913": {"title": "High Rank Path Development: an approach of learning the filtration of stochastic processes", "link": "https://arxiv.org/abs/2405.14913", "description": "arXiv:2405.14913v1 Announce Type: new \nAbstract: Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence, which can fully characterise all essential properties, including the filtration, of stochastic processes; however was considered to be hard to find efficient numerical implementations. In this paper, we introduce a novel metric called High Rank PCF Distance (HRPCFD) for extended weak convergence based on the high rank path development method from rough path theory, which also defines the characteristic function for measure-valued processes. We then show that such HRPCFD admits many favourable analytic properties which allows us to design an efficient algorithm for training HRPCFD from data and construct the HRPCF-GAN by using HRPCFD as the discriminator for conditional time series generation. Our numerical experiments on both hypothesis testing and generative modelling validate the out-performance of our approach compared with several state-of-the-art methods, highlighting its potential in broad applications of synthetic time series generation and in addressing classic financial and economic challenges, such as optimal stopping or utility maximisation problems."}, "https://arxiv.org/abs/2405.14990": {"title": "Dispersion Modeling in Zero-inflated Tweedie Models with Applications to Insurance Claim Data Analysis", "link": "https://arxiv.org/abs/2405.14990", "description": "arXiv:2405.14990v1 Announce Type: new \nAbstract: The Tweedie generalized linear models are commonly applied in the insurance industry to analyze semicontinuous claim data. For better prediction of the aggregated claim size, the mean and dispersion of the Tweedie model are often estimated together using the double generalized linear models. In some actuarial applications, it is common to observe an excessive percentage of zeros, which often results in a decline in the performance of the Tweedie model. The zero-inflated Tweedie model has been recently considered in the literature, which draws inspiration from the zero-inflated Poisson model. In this article, we consider the problem of dispersion modeling of the Tweedie state in the zero-inflated Tweedie model, in addition to the mean modeling. We also model the probability of the zero state based on the generalized expectation-maximization algorithm. To potentially incorporate nonlinear and interaction effects of the covariates, we estimate the mean, dispersion, and zero-state probability using decision-tree-based gradient boosting. We conduct extensive numerical studies to demonstrate the improved performance of our method over existing ones."}, "https://arxiv.org/abs/2405.15038": {"title": "Preferential Latent Space Models for Networks with Textual Edges", "link": "https://arxiv.org/abs/2405.15038", "description": "arXiv:2405.15038v1 Announce Type: new \nAbstract: Many real-world networks contain rich textual information in the edges, such as email networks where an edge between two nodes is an email exchange. Other examples include co-author networks and social media networks. The useful textual information carried in the edges is often discarded in most network analyses, resulting in an incomplete view of the relationships between nodes. In this work, we propose to represent the text document between each pair of nodes as a vector counting the appearances of keywords extracted from the corpus, and introduce a new and flexible preferential latent space network model that can offer direct insights on how contents of the textual exchanges modulate the relationships between nodes. We establish identifiability conditions for the proposed model and tackle model estimation with a computationally efficient projected gradient descent algorithm. We further derive the non-asymptotic error bound of the estimator from each step of the algorithm. The efficacy of our proposed method is demonstrated through simulations and an analysis of the Enron email network."}, "https://arxiv.org/abs/2405.15042": {"title": "Modularity, Higher-Order Recombination, and New Venture Success", "link": "https://arxiv.org/abs/2405.15042", "description": "arXiv:2405.15042v1 Announce Type: new \nAbstract: Modularity is critical for the emergence and evolution of complex social, natural, and technological systems robust to exploratory failure. We consider this in the context of emerging business organizations, which can be understood as complex systems. We build a theory of organizational emergence as higher-order, modular recombination wherein successful start-ups assemble novel combinations of successful modular components, rather than engage in the lower-order combination of disparate, singular components. Lower-order combinations are critical for long-term socio-economic transformation, but manifest diffuse benefits requiring support as public goods. Higher-order combinations facilitate rapid experimentation and attract private funding. We evaluate this with U.S. venture-funded start-ups over 45 years using company descriptions. We build a dynamic semantic space with word embedding models constructed from evolving business discourse, which allow us to measure the modularity of and distance between new venture components. Using event history models, we demonstrate how ventures more likely achieve successful IPOs and high-priced acquisitions when they combine diverse modules of clustered components. We demonstrate how higher-order combination enables venture success by accelerating firm development and diversifying investment, and we reflect on its implications for social innovation."}, "https://arxiv.org/abs/2405.15053": {"title": "A Latent Variable Approach to Learning High-dimensional Multivariate longitudinal Data", "link": "https://arxiv.org/abs/2405.15053", "description": "arXiv:2405.15053v1 Announce Type: new \nAbstract: High-dimensional multivariate longitudinal data, which arise when many outcome variables are measured repeatedly over time, are becoming increasingly common in social, behavioral and health sciences. We propose a latent variable model for drawing statistical inferences on covariate effects and predicting future outcomes based on high-dimensional multivariate longitudinal data. This model introduces unobserved factors to account for the between-variable and across-time dependence and assist the prediction. Statistical inference and prediction tools are developed under a general setting that allows outcome variables to be of mixed types and possibly unobserved for certain time points, for example, due to right censoring. A central limit theorem is established for drawing statistical inferences on regression coefficients. Additionally, an information criterion is introduced to choose the number of factors. The proposed model is applied to customer grocery shopping records to predict and understand shopping behavior."}, "https://arxiv.org/abs/2405.15192": {"title": "Addressing Duplicated Data in Point Process Models", "link": "https://arxiv.org/abs/2405.15192", "description": "arXiv:2405.15192v1 Announce Type: new \nAbstract: Spatial point process models are widely applied to point pattern data from various fields in the social and environmental sciences. However, a serious hurdle in fitting point process models is the presence of duplicated points, wherein multiple observations share identical spatial coordinates. This often occurs because of decisions made in the geo-coding process, such as assigning representative locations (e.g., aggregate-level centroids) to observations when data producers lack exact location information. Because spatial point process models like the Log-Gaussian Cox Process (LGCP) assume unique locations, researchers often employ {\\it ad hoc} solutions (e.g., jittering) to address duplicated data before analysis. As an alternative, this study proposes a Modified Minimum Contrast (MMC) method that adapts the inference procedure to account for the effect of duplicates without needing to alter the data. The proposed MMC method is applied to LGCP models, with simulation results demonstrating the gains of our method relative to existing approaches in terms of parameter estimation. Interestingly, simulation results also show the effect of the geo-coding process on parameter estimates, which can be utilized in the implementation of the MMC method. The MMC approach is then used to infer the spatial clustering characteristics of conflict events in Afghanistan (2008-2009)."}, "https://arxiv.org/abs/2405.15204": {"title": "A New Fit Assessment Framework for Common Factor Models Using Generalized Residuals", "link": "https://arxiv.org/abs/2405.15204", "description": "arXiv:2405.15204v1 Announce Type: new \nAbstract: Standard common factor models, such as the linear normal factor model, rely on strict parametric assumptions, which require rigorous model-data fit assessment to prevent fallacious inferences. However, overall goodness-of-fit diagnostics conventionally used in factor analysis do not offer diagnostic information on where the misfit originates. In the current work, we propose a new fit assessment framework for common factor models by extending the theory of generalized residuals (Haberman & Sinharay, 2013). This framework allows for the flexible adaptation of test statistics to identify various sources of misfit. In addition, the resulting goodness-of-fit tests provide more informative diagnostics, as the evaluation is performed conditionally on latent variables. Several examples of test statistics suitable for assessing various model assumptions are presented within this framework, and their performance is evaluated by simulation studies and a real data example."}, "https://arxiv.org/abs/2405.15242": {"title": "Causal machine learning methods and use of sample splitting in settings with high-dimensional confounding", "link": "https://arxiv.org/abs/2405.15242", "description": "arXiv:2405.15242v1 Announce Type: new \nAbstract: Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, induced when there are many confounders relative to sample size, or complex relationships between continuous confounders and exposure and outcome. As a promising avenue to overcome this challenge, doubly robust methods (Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE)) enable the use of data-adaptive approaches to fit the two models they involve. Biased standard errors may result when the data-adaptive approaches used are very complex. The coupling of doubly robust methods with cross-fitting has been proposed to tackle this. Despite advances, limited evaluation, comparison, and guidance are available on the implementation of AIPW and TMLE with data-adaptive approaches and cross-fitting in realistic settings where high-dimensional confounding is present. We conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches in estimating the average causal effect (ACE) and evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner (SL) ensemble learning approach used for the data-adaptive models. A range of scenarios in terms of data generation, and sample size were considered. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, with the number of folds a less important consideration. Using a full SL library was important to reduce bias and variance in the complex scenarios typical of modern health research studies."}, "https://arxiv.org/abs/2405.15531": {"title": "MMD Two-sample Testing in the Presence of Arbitrarily Missing Data", "link": "https://arxiv.org/abs/2405.15531", "description": "arXiv:2405.15531v1 Announce Type: new \nAbstract: In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that our method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of our approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error."}, "https://arxiv.org/abs/2405.15576": {"title": "Online Changepoint Detection via Dynamic Mode Decomposition", "link": "https://arxiv.org/abs/2405.15576", "description": "arXiv:2405.15576v1 Announce Type: new \nAbstract: Detecting changes in data streams is a vital task in many applications. There is increasing interest in changepoint detection in the online setting, to enable real-time monitoring and support prompt responses and informed decision-making. Many approaches assume stationary sequences before encountering an abrupt change in the mean or variance. Notably less attention has focused on the challenging case where the monitored sequences exhibit trend, periodicity and seasonality. Dynamic mode decomposition is a data-driven dimensionality reduction technique that extracts the essential components of a dynamical system. We propose a changepoint detection method that leverages this technique to sequentially model the dynamics of a moving window of data and produce a low-rank reconstruction. A change is identified when there is a significant difference between this reconstruction and the observed data, and we provide theoretical justification for this approach. Extensive simulations demonstrate that our approach has superior detection performance compared to other methods for detecting small changes in mean, variance, periodicity, and second-order structure, among others, in data that exhibits seasonality. Results on real-world datasets also show excellent performance compared to contemporary approaches."}, "https://arxiv.org/abs/2405.15579": {"title": "Generating density nowcasts for U", "link": "https://arxiv.org/abs/2405.15579", "description": "arXiv:2405.15579v1 Announce Type: new \nAbstract: Recent results in the literature indicate that artificial neural networks (ANNs) can outperform the dynamic factor model (DFM) in terms of the accuracy of GDP nowcasts. Compared to the DFM, the performance advantage of these highly flexible, nonlinear estimators is particularly evident in periods of recessions and structural breaks. From the perspective of policy-makers, however, nowcasts are the most useful when they are conveyed with uncertainty attached to them. While the DFM and other classical time series approaches analytically derive the predictive (conditional) distribution for GDP growth, ANNs can only produce point nowcasts based on their default training procedure (backpropagation). To fill this gap, first in the literature, we adapt two different deep learning algorithms that enable ANNs to generate density nowcasts for U.S. GDP growth: Bayes by Backprop and Monte Carlo dropout. The accuracy of point nowcasts, defined as the mean of the empirical predictive distribution, is evaluated relative to a naive constant growth model for GDP and a benchmark DFM specification. Using a 1D CNN as the underlying ANN architecture, both algorithms outperform those benchmarks during the evaluation period (2012:Q1 -- 2022:Q4). Furthermore, both algorithms are able to dynamically adjust the location (mean), scale (variance), and shape (skew) of the empirical predictive distribution. The results indicate that both Bayes by Backprop and Monte Carlo dropout can effectively augment the scope and functionality of ANNs, rendering them a fully compatible and competitive alternative for classical time series approaches."}, "https://arxiv.org/abs/2405.15641": {"title": "Predictive Uncertainty Quantification with Missing Covariates", "link": "https://arxiv.org/abs/2405.15641", "description": "arXiv:2405.15641v1 Announce Type: new \nAbstract: Predictive uncertainty quantification is crucial in decision-making problems. We investigate how to adequately quantify predictive uncertainty with missing covariates. A bottleneck is that missing values induce heteroskedasticity on the response's predictive distribution given the observed covariates. Thus, we focus on building predictive sets for the response that are valid conditionally to the missing values pattern. We show that this goal is impossible to achieve informatively in a distribution-free fashion, and we propose useful restrictions on the distribution class. Motivated by these hardness results, we characterize how missing values and predictive uncertainty intertwine. Particularly, we rigorously formalize the idea that the more missing values, the higher the predictive uncertainty. Then, we introduce a generalized framework, coined CP-MDA-Nested*, outputting predictive sets in both regression and classification. Under independence between the missing value pattern and both the features and the response (an assumption justified by our hardness results), these predictive sets are valid conditionally to any pattern of missing values. Moreover, it provides great flexibility in the trade-off between statistical variability and efficiency. Finally, we experimentally assess the performances of CP-MDA-Nested* beyond its scope of theoretical validity, demonstrating promising outcomes in more challenging configurations than independence."}, "https://arxiv.org/abs/2405.15670": {"title": "Post-selection inference for quantifying uncertainty in changes in variance", "link": "https://arxiv.org/abs/2405.15670", "description": "arXiv:2405.15670v1 Announce Type: new \nAbstract: Quantifying uncertainty in detected changepoints is an important problem. However it is challenging as the naive approach would use the data twice, first to detect the changes, and then to test them. This will bias the test, and can lead to anti-conservative p-values. One approach to avoid this is to use ideas from post-selection inference, which conditions on the information in the data used to choose which changes to test. As a result this produces valid p-values; that is, p-values that have a uniform distribution if there is no change. Currently such methods have been developed for detecting changes in mean only. This paper presents two approaches for constructing post-selection p-values for detecting changes in variance. These vary depending on the method use to detect the changes, but are general in terms of being applicable for a range of change-detection methods and a range of hypotheses that we may wish to test."}, "https://arxiv.org/abs/2405.15716": {"title": "Empirical Crypto Asset Pricing", "link": "https://arxiv.org/abs/2405.15716", "description": "arXiv:2405.15716v1 Announce Type: new \nAbstract: We motivate the study of the crypto asset class with eleven empirical facts, and study the drivers of crypto asset returns through the lens of univariate factors. We argue crypto assets are a new, attractive, and independent asset class. In a novel and rigorously built panel of crypto assets, we examine pricing ability of sixty three asset characteristics to find rich signal content across the characteristics and at several future horizons. Only univariate financial factors (i.e., functions of previous returns) were associated with statistically significant long-short strategies, suggestive of speculatively driven returns as opposed to more fundamental pricing factors."}, "https://arxiv.org/abs/2405.15721": {"title": "Dynamic Latent-Factor Model with High-Dimensional Asset Characteristics", "link": "https://arxiv.org/abs/2405.15721", "description": "arXiv:2405.15721v1 Announce Type: new \nAbstract: We develop novel estimation procedures with supporting econometric theory for a dynamic latent-factor model with high-dimensional asset characteristics, that is, the number of characteristics is on the order of the sample size. Utilizing the Double Selection Lasso estimator, our procedure employs regularization to eliminate characteristics with low signal-to-noise ratios yet maintains asymptotically valid inference for asset pricing tests. The crypto asset class is well-suited for applying this model given the limited number of tradable assets and years of data as well as the rich set of available asset characteristics. The empirical results present out-of-sample pricing abilities and risk-adjusted returns for our novel estimator as compared to benchmark methods. We provide an inference procedure for measuring the risk premium of an observable nontradable factor, and employ this to find that the inflation-mimicking portfolio in the crypto asset class has positive risk compensation."}, "https://arxiv.org/abs/2405.15740": {"title": "On Flexible Inverse Probability of Treatment and Intensity Weighting: Informative Censoring, Variable Inclusion, and Weight Trimming", "link": "https://arxiv.org/abs/2405.15740", "description": "arXiv:2405.15740v1 Announce Type: new \nAbstract: Many observational studies feature irregular longitudinal data, where the observation times are not common across individuals in the study. Further, the observation times may be related to the longitudinal outcome. In this setting, failing to account for the informative observation process may result in biased causal estimates. This can be coupled with other sources of bias, including non-randomized treatment assignments and informative censoring. This paper provides an overview of a flexible weighting method used to adjust for informative observation processes and non-randomized treatment assignments. We investigate the sensitivity of the flexible weighting method to violations of the noninformative censoring assumption, examine variable selection for the observation process weighting model, known as inverse intensity weighting, and look at the impacts of weight trimming for the flexible weighting model. We show that the flexible weighting method is sensitive to violations of the noninformative censoring assumption and show that a previously proposed extension fails under such violations. We also show that variables confounding the observation and outcome processes should always be included in the observation intensity model. Finally, we show that weight trimming should be applied in the flexible weighting model when the treatment assignment process is highly informative and driving the extreme weights. We conclude with an application of the methodology to a real data set to examine the impacts of household water sources on malaria diagnoses."}, "https://arxiv.org/abs/2405.14893": {"title": "YUI: Day-ahead Electricity Price Forecasting Using Invariance Simplified Supply and Demand Curve", "link": "https://arxiv.org/abs/2405.14893", "description": "arXiv:2405.14893v1 Announce Type: cross \nAbstract: In day-ahead electricity market, it is crucial for all market participants to have access to reliable and accurate price forecasts for their decision-making processes. Forecasting methods currently utilized in industrial applications frequently neglect the underlying mechanisms of price formation, while economic research from the perspective of supply and demand have stringent data collection requirements, making it difficult to apply in actual markets. Observing the characteristics of the day-ahead electricity market, we introduce two invariance assumptions to simplify the modeling of supply and demand curves. Upon incorporating the time invariance assumption, we can forecast the supply curve using the market equilibrium points from multiple time slots in the recent period. By introducing the price insensitivity assumption, we can approximate the demand curve using a straight line. The point where these two curves intersect provides us with the forecast price. The proposed model, forecasting suppl\\textbf{Y} and demand cUrve simplified by Invariance, termed as YUI, is more efficient than state-of-the-art methods. Our experiment results in Shanxi day-ahead electricity market show that compared with existing methods, YUI can reduce forecast error by 13.8\\% in MAE and 28.7\\% in sMAPE. Code is publicly available at https://github.com/wangln19/YUI."}, "https://arxiv.org/abs/2405.15132": {"title": "Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification", "link": "https://arxiv.org/abs/2405.15132", "description": "arXiv:2405.15132v1 Announce Type: cross \nAbstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets."}, "https://arxiv.org/abs/2405.15141": {"title": "Likelihood distortion and Bayesian local robustness", "link": "https://arxiv.org/abs/2405.15141", "description": "arXiv:2405.15141v1 Announce Type: cross \nAbstract: Robust Bayesian analysis has been mainly devoted to detecting and measuring robustness to the prior distribution. Indeed, many contributions in the literature aim to define suitable classes of priors which allow the computation of variations of quantities of interest while the prior changes within those classes. The literature has devoted much less attention to the robustness of Bayesian methods to the likelihood function due to mathematical and computational complexity, and because it is often arguably considered a more objective choice compared to the prior. In this contribution, a new approach to Bayesian local robustness to the likelihood function is proposed and extended to robustness to the prior and to both. This approach is based on the notion of distortion function introduced in the literature on risk theory, and then successfully adopted to build suitable classes of priors for Bayesian global robustness to the prior. The novel robustness measure is a local sensitivity measure that turns out to be very tractable and easy to compute for certain classes of distortion functions. Asymptotic properties are derived and numerical experiments illustrate the theory and its applicability for modelling purposes."}, "https://arxiv.org/abs/2405.15294": {"title": "Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision", "link": "https://arxiv.org/abs/2405.15294", "description": "arXiv:2405.15294v1 Announce Type: cross \nAbstract: We provide a theoretical and computational investigation of the Gamma-Maximin method with soft revision, which was recently proposed as a robust criterion for pseudo-label selection (PLS) in semi-supervised learning. Opposed to traditional methods for PLS we use credal sets of priors (\"generalized Bayes\") to represent the epistemic modeling uncertainty. These latter are then updated by the Gamma-Maximin method with soft revision. We eventually select pseudo-labeled data that are most likely in light of the least favorable distribution from the so updated credal set. We formalize the task of finding optimal pseudo-labeled data w.r.t. the Gamma-Maximin method with soft revision as an optimization problem. A concrete implementation for the class of logistic models then allows us to compare the predictive power of the method with competing approaches. It is observed that the Gamma-Maximin method with soft revision can achieve very promising results, especially when the proportion of labeled data is low."}, "https://arxiv.org/abs/2405.15357": {"title": "Strong screening rules for group-based SLOPE models", "link": "https://arxiv.org/abs/2405.15357", "description": "arXiv:2405.15357v1 Announce Type: cross \nAbstract: Tuning the regularization parameter in penalized regression models is an expensive task, requiring multiple models to be fit along a path of parameters. Strong screening rules drastically reduce computational costs by lowering the dimensionality of the input prior to fitting. We develop strong screening rules for group-based Sorted L-One Penalized Estimation (SLOPE) models: Group SLOPE and Sparse-group SLOPE. The developed rules are applicable for the wider family of group-based OWL models, including OSCAR. Our experiments on both synthetic and real data show that the screening rules significantly accelerate the fitting process. The screening rules make it accessible for group SLOPE and sparse-group SLOPE to be applied to high-dimensional datasets, particularly those encountered in genetics."}, "https://arxiv.org/abs/2405.15600": {"title": "Transfer Learning for Spatial Autoregressive Models", "link": "https://arxiv.org/abs/2405.15600", "description": "arXiv:2405.15600v1 Announce Type: cross \nAbstract: The spatial autoregressive (SAR) model has been widely applied in various empirical economic studies to characterize the spatial dependence among subjects. However, the precision of estimating the SAR model diminishes when the sample size of the target data is limited. In this paper, we propose a new transfer learning framework for the SAR model to borrow the information from similar source data to improve both estimation and prediction. When the informative source data sets are known, we introduce a two-stage algorithm, including a transferring stage and a debiasing stage, to estimate the unknown parameters and also establish the theoretical convergence rates for the resulting estimators. If we do not know which sources to transfer, a transferable source detection algorithm is proposed to detect informative sources data based on spatial residual bootstrap to retain the necessary spatial dependence. Its detection consistency is also derived. Simulation studies demonstrate that using informative source data, our transfer learning algorithm significantly enhances the performance of the classical two-stage least squares estimator. In the empirical application, we apply our method to the election prediction in swing states in the 2020 U.S. presidential election, utilizing polling data from the 2016 U.S. presidential election along with other demographic and geographical data. The empirical results show that our method outperforms traditional estimation methods."}, "https://arxiv.org/abs/2105.12891": {"title": "Identification and Estimation of Partial Effects in Nonlinear Semiparametric Panel Models", "link": "https://arxiv.org/abs/2105.12891", "description": "arXiv:2105.12891v5 Announce Type: replace \nAbstract: Average partial effects (APEs) are often not point identified in panel models with unrestricted unobserved individual heterogeneity, such as a binary response panel model with fixed effects and logistic errors as a special case. This lack of point identification occurs despite the identification of these models' common coefficients. We provide a unified framework to establish the point identification of various partial effects in a wide class of nonlinear semiparametric models under an index sufficiency assumption on the unobserved heterogeneity, even when the error distribution is unspecified and non-stationary. This assumption does not impose parametric restrictions on the unobserved heterogeneity and idiosyncratic errors. We also present partial identification results when the support condition fails. We then propose three-step semiparametric estimators for APEs, average structural functions, and average marginal effects, and show their consistency and asymptotic normality. Finally, we illustrate our approach in a study of determinants of married women's labor supply."}, "https://arxiv.org/abs/2207.05281": {"title": "Constrained D-optimal Design for Paid Research Study", "link": "https://arxiv.org/abs/2207.05281", "description": "arXiv:2207.05281v4 Announce Type: replace \nAbstract: We consider constrained sampling problems in paid research studies or clinical trials. When qualified volunteers are more than the budget allowed, we recommend a D-optimal sampling strategy based on the optimal design theory and develop a constrained lift-one algorithm to find the optimal allocation. Unlike the literature which mainly deals with linear models, our solution solves the constrained sampling problem under fairly general statistical models, including generalized linear models and multinomial logistic models, and with more general constraints. We justify theoretically the optimality of our sampling strategy and show by simulation studies and real-world examples the advantages over simple random sampling and proportionally stratified sampling strategies."}, "https://arxiv.org/abs/2301.01085": {"title": "The Chained Difference-in-Differences", "link": "https://arxiv.org/abs/2301.01085", "description": "arXiv:2301.01085v3 Announce Type: replace \nAbstract: This paper studies the identification, estimation, and inference of long-term (binary) treatment effect parameters when balanced panel data is not available, or consists of only a subset of the available data. We develop a new estimator: the chained difference-in-differences, which leverages the overlapping structure of many unbalanced panel data sets. This approach consists in aggregating a collection of short-term treatment effects estimated on multiple incomplete panels. Our estimator accommodates (1) multiple time periods, (2) variation in treatment timing, (3) treatment effect heterogeneity, (4) general missing data patterns, and (5) sample selection on observables. We establish the asymptotic properties of the proposed estimator and discuss identification and efficiency gains in comparison to existing methods. Finally, we illustrate its relevance through (i) numerical simulations, and (ii) an application about the effects of an innovation policy in France."}, "https://arxiv.org/abs/2306.10779": {"title": "Bootstrap test procedure for variance components in nonlinear mixed effects models in the presence of nuisance parameters and a singular Fisher Information Matrix", "link": "https://arxiv.org/abs/2306.10779", "description": "arXiv:2306.10779v2 Announce Type: replace \nAbstract: We examine the problem of variance components testing in general mixed effects models using the likelihood ratio test. We account for the presence of nuisance parameters, i.e. the fact that some untested variances might also be equal to zero. Two main issues arise in this context leading to a non regular setting. First, under the null hypothesis the true parameter value lies on the boundary of the parameter space. Moreover, due to the presence of nuisance parameters the exact location of these boundary points is not known, which prevents from using classical asymptotic theory of maximum likelihood estimation. Then, in the specific context of nonlinear mixed-effects models, the Fisher information matrix is singular at the true parameter value. We address these two points by proposing a shrinked parametric bootstrap procedure, which is straightforward to apply even for nonlinear models. We show that the procedure is consistent, solving both the boundary and the singularity issues, and we provide a verifiable criterion for the applicability of our theoretical results. We show through a simulation study that, compared to the asymptotic approach, our procedure has a better small sample performance and is more robust to the presence of nuisance parameters. A real data application is also provided."}, "https://arxiv.org/abs/2307.13627": {"title": "A flexible class of priors for orthonormal matrices with basis function-specific structure", "link": "https://arxiv.org/abs/2307.13627", "description": "arXiv:2307.13627v2 Announce Type: replace \nAbstract: Statistical modeling of high-dimensional matrix-valued data motivates the use of a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction. Low-rank representations commonly factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for basis function correlation structure a priori. While there exist Bayesian methods that allow for a common correlation structure across basis functions, empirical examples motivate the need for basis function-specific dependence structure. We propose a prior distribution for orthonormal matrices that can explicitly model basis function-specific structure. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. We discuss how the prior specification can be used for various scenarios and demonstrate favorable model properties through synthetic data examples. Finally, we apply our method to two-meter air temperature data from the Pacific Northwest, enhancing our understanding of the Earth system's internal variability."}, "https://arxiv.org/abs/2310.02414": {"title": "Sharp and Robust Estimation of Partially Identified Discrete Response Models", "link": "https://arxiv.org/abs/2310.02414", "description": "arXiv:2310.02414v3 Announce Type: replace \nAbstract: Semiparametric discrete choice models are widely used in a variety of practical applications. While these models are point identified in the presence of continuous covariates, they can become partially identified when covariates are discrete. In this paper we find that classic estimators, including the maximum score estimator, (Manski (1975)), loose their attractive statistical properties without point identification. First of all, they are not sharp with the estimator converging to an outer region of the identified set, (Komarova (2013)), and in many discrete designs it weakly converges to a random set. Second, they are not robust, with their distribution limit discontinuously changing with respect to the parameters of the model. We propose a novel class of estimators based on the concept of a quantile of a random set, which we show to be both sharp and robust. We demonstrate that our approach extends from cross-sectional settings to class static and dynamic discrete panel data models."}, "https://arxiv.org/abs/2311.07034": {"title": "Regularized Halfspace Depth for Functional Data", "link": "https://arxiv.org/abs/2311.07034", "description": "arXiv:2311.07034v2 Announce Type: replace \nAbstract: Data depth is a powerful nonparametric tool originally proposed to rank multivariate data from center outward. In this context, one of the most archetypical depth notions is Tukey's halfspace depth. In the last few decades notions of depth have also been proposed for functional data. However, Tukey's depth cannot be extended to handle functional data because of its degeneracy. Here, we propose a new halfspace depth for functional data which avoids degeneracy by regularization. The halfspace projection directions are constrained to have a small reproducing kernel Hilbert space norm. Desirable theoretical properties of the proposed depth, such as isometry invariance, maximality at center, monotonicity relative to a deepest point, upper semi-continuity, and consistency are established. Moreover, the regularized halfspace depth can rank functional data with varying emphasis in shape or magnitude, depending on the regularization. A new outlier detection approach is also proposed, which is capable of detecting both shape and magnitude outliers. It is applicable to trajectories in $L^2$, a very general space of functions that include non-smooth trajectories. Based on extensive numerical studies, our methods are shown to perform well in terms of detecting outliers of different types. Three real data examples showcase the proposed depth notion."}, "https://arxiv.org/abs/2311.17021": {"title": "Optimal Categorical Instrumental Variables", "link": "https://arxiv.org/abs/2311.17021", "description": "arXiv:2311.17021v2 Announce Type: replace \nAbstract: This paper discusses estimation with a categorical instrumental variable in settings with potentially few observations per category. The proposed categorical instrumental variable estimator (CIV) leverages a regularization assumption that implies existence of a latent categorical variable with fixed finite support achieving the same first stage fit as the observed instrument. In asymptotic regimes that allow the number of observations per category to grow at arbitrary small polynomial rate with the sample size, I show that when the cardinality of the support of the optimal instrument is known, CIV is root-n asymptotically normal, achieves the same asymptotic variance as the oracle IV estimator that presumes knowledge of the optimal instrument, and is semiparametrically efficient under homoskedasticity. Under-specifying the number of support points reduces efficiency but maintains asymptotic normality. In an application that leverages judge fixed effects as instruments, CIV compares favorably to commonly used jackknife-based instrumental variable estimators."}, "https://arxiv.org/abs/2312.03643": {"title": "Propagating moments in probabilistic graphical models with polynomial regression forms for decision support systems", "link": "https://arxiv.org/abs/2312.03643", "description": "arXiv:2312.03643v2 Announce Type: replace \nAbstract: Probabilistic graphical models are widely used to model complex systems under uncertainty. Traditionally, Gaussian directed graphical models are applied for analysis of large networks with continuous variables as they can provide conditional and marginal distributions in closed form simplifying the inferential task. The Gaussianity and linearity assumptions are often adequate, yet can lead to poor performance when dealing with some practical applications. In this paper, we model each variable in graph G as a polynomial regression of its parents to capture complex relationships between individual variables and with a utility function of polynomial form. We develop a message-passing algorithm to propagate information throughout the network solely using moments which enables the expected utility scores to be calculated exactly. Our propagation method scales up well and enables to perform inference in terms of a finite number of expectations. We illustrate how the proposed methodology works with examples and in an application to decision problems in energy planning and for real-time clinical decision support."}, "https://arxiv.org/abs/2312.06478": {"title": "Prediction De-Correlated Inference: A safe approach for post-prediction inference", "link": "https://arxiv.org/abs/2312.06478", "description": "arXiv:2312.06478v3 Announce Type: replace \nAbstract: In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PDC). Our approach is safe, in the sense that PDC can automatically adapt to any black-box machine-learning model and consistently outperform the supervised counterparts. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis demonstrate the superiority of PDC over the state-of-the-art methods."}, "https://arxiv.org/abs/2203.15756": {"title": "Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data", "link": "https://arxiv.org/abs/2203.15756", "description": "arXiv:2203.15756v3 Announce Type: replace-cross \nAbstract: Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\\textit{independent and identically distributed}$ data. However, it is known that even with infinite i.i.d.$\\ $ data, constraint-based methods can only identify causal structures up to broad Markov equivalence classes, posing a fundamental limitation for causal discovery. In this work, we observe that exchangeable data contains richer conditional independence structure than i.i.d.$\\ $ data, and show how the richer structure can be leveraged for causal discovery. We first present causal de Finetti theorems, which state that exchangeable distributions with certain non-trivial conditional independences can always be represented as $\\textit{independent causal mechanism (ICM)}$ generative processes. We then present our main identifiability theorem, which shows that given data from an ICM generative process, its unique causal structure can be identified through performing conditional independence tests. We finally develop a causal discovery algorithm and demonstrate its applicability to inferring causal relationships from multi-environment data. Our code and models are publicly available at: https://github.com/syguo96/Causal-de-Finetti"}, "https://arxiv.org/abs/2305.16539": {"title": "On the existence of powerful p-values and e-values for composite hypotheses", "link": "https://arxiv.org/abs/2305.16539", "description": "arXiv:2305.16539v3 Announce Type: replace-cross \nAbstract: Given a composite null $ \\mathcal P$ and composite alternative $ \\mathcal Q$, when and how can we construct a p-value whose distribution is exactly uniform under the null, and stochastically smaller than uniform under the alternative? Similarly, when and how can we construct an e-value whose expectation exactly equals one under the null, but its expected logarithm under the alternative is positive? We answer these basic questions, and other related ones, when $ \\mathcal P$ and $ \\mathcal Q$ are convex polytopes (in the space of probability measures). We prove that such constructions are possible if and only if $ \\mathcal Q$ does not intersect the span of $ \\mathcal P$. If the p-value is allowed to be stochastically larger than uniform under $P\\in \\mathcal P$, and the e-value can have expectation at most one under $P\\in \\mathcal P$, then it is achievable whenever $ \\mathcal P$ and $ \\mathcal Q$ are disjoint. More generally, even when $ \\mathcal P$ and $ \\mathcal Q$ are not polytopes, we characterize the existence of a bounded nontrivial e-variable whose expectation exactly equals one under any $P \\in \\mathcal P$. The proofs utilize recently developed techniques in simultaneous optimal transport. A key role is played by coarsening the filtration: sometimes, no such p-value or e-value exists in the richest data filtration, but it does exist in some reduced filtration, and our work provides the first general characterization of this phenomenon. We also provide an iterative construction that explicitly constructs such processes, and under certain conditions it finds the one that grows fastest under a specific alternative $Q$. We discuss implications for the construction of composite nonnegative (super)martingales, and end with some conjectures and open problems."}, "https://arxiv.org/abs/2405.15887": {"title": "Data-adaptive exposure thresholds for the Horvitz-Thompson estimator of the Average Treatment Effect in experiments with network interference", "link": "https://arxiv.org/abs/2405.15887", "description": "arXiv:2405.15887v1 Announce Type: new \nAbstract: Randomized controlled trials often suffer from interference, a violation of the Stable Unit Treatment Values Assumption (SUTVA) in which a unit's treatment assignment affects the outcomes of its neighbors. This interference causes bias in naive estimators of the average treatment effect (ATE). A popular method to achieve unbiasedness is to pair the Horvitz-Thompson estimator of the ATE with a known exposure mapping: a function that identifies which units in a given randomization are not subject to interference. For example, an exposure mapping can specify that any unit with at least $h$-fraction of its neighbors having the same treatment status does not experience interference. However, this threshold $h$ is difficult to elicit from domain experts, and a misspecified threshold can induce bias. In this work, we propose a data-adaptive method to select the \"$h$\"-fraction threshold that minimizes the mean squared error of the Hortvitz-Thompson estimator. Our method estimates the bias and variance of the Horvitz-Thompson estimator under different thresholds using a linear dose-response model of the potential outcomes. We present simulations illustrating that our method improves upon non-adaptive choices of the threshold. We further illustrate the performance of our estimator by running experiments on a publicly-available Amazon product similarity graph. Furthermore, we demonstrate that our method is robust to deviations from the linear potential outcomes model."}, "https://arxiv.org/abs/2405.15948": {"title": "Multicalibration for Censored Survival Data: Towards Universal Adaptability in Predictive Modeling", "link": "https://arxiv.org/abs/2405.15948", "description": "arXiv:2405.15948v1 Announce Type: new \nAbstract: Traditional statistical and machine learning methods assume identical distribution for the training and test data sets. This assumption, however, is often violated in real applications, particularly in health care research, where the training data~(source) may underrepresent specific subpopulations in the testing or target domain. Such disparities, coupled with censored observations, present significant challenges for investigators aiming to make predictions for those minority groups. This paper focuses on target-independent learning under covariate shift, where we study multicalibration for survival probability and restricted mean survival time, and propose a black-box post-processing boosting algorithm designed for censored survival data. Our algorithm, leveraging the pseudo observations, yields a multicalibrated predictor competitive with propensity scoring regarding predictions on the unlabeled target domain, not just overall but across diverse subpopulations. Our theoretical analysis for pseudo observations relies on functional delta method and $p$-variational norm. We further investigate the algorithm's sample complexity and convergence properties, as well as the multicalibration guarantee for post-processed predictors. Our theoretical insights reveal the link between multicalibration and universal adaptability, suggesting that our calibrated function performs comparably to, if not better than, the inverse propensity score weighting estimator. The performance of our proposed methods is corroborated through extensive numerical simulations and a real-world case study focusing on prediction of cardiovascular disease risk in two large prospective cohort studies. These empirical results confirm its potential as a powerful tool for predictive analysis with censored outcomes in diverse and shifting populations."}, "https://arxiv.org/abs/2405.16046": {"title": "Sensitivity Analysis for Attributable Effects in Case$^2$ Studies", "link": "https://arxiv.org/abs/2405.16046", "description": "arXiv:2405.16046v1 Announce Type: new \nAbstract: The case$^2$ study, also referred to as the case-case study design, is a valuable approach for conducting inference for treatment effects. Unlike traditional case-control studies, the case$^2$ design compares treatment in two types of cases with the same disease. A key quantity of interest is the attributable effect, which is the number of cases of disease among treated units which are caused by the treatment. Two key assumptions that are usually made for making inferences about the attributable effect in case$^2$ studies are 1.) treatment does not cause the second type of case, and 2.) the treatment does not alter an individual's case type. However, these assumptions are not realistic in many real-data applications. In this article, we present a sensitivity analysis framework to scrutinize the impact of deviations from these assumptions on obtained results. We also include sensitivity analyses related to the assumption of unmeasured confounding, recognizing the potential bias introduced by unobserved covariates. The proposed methodology is exemplified through an investigation into whether having violent behavior in the last year of life increases suicide risk via 1993 National Mortality Followback Survey dataset."}, "https://arxiv.org/abs/2405.16106": {"title": "On the PM2", "link": "https://arxiv.org/abs/2405.16106", "description": "arXiv:2405.16106v1 Announce Type: new \nAbstract: Spatial confounding, often regarded as a major concern in epidemiological studies, relates to the difficulty of recovering the effect of an exposure on an outcome when these variables are associated with unobserved factors. This issue is particularly challenging in spatio-temporal analyses, where it has been less explored so far. To study the effects of air pollution on mortality in Italy, we argue that a model that simultaneously accounts for spatio-temporal confounding and for the non-linear form of the effect of interest is needed. To this end, we propose a Bayesian dynamic generalized linear model, which allows for a non-linear association and for a decomposition of the exposure effect into two components. This decomposition accommodates associations with the outcome at fine and coarse temporal and spatial scales of variation. These features, when combined, allow reducing the spatio-temporal confounding bias and recovering the true shape of the association, as demonstrated through simulation studies. The results from the real-data application indicate that the exposure effect seems to have different magnitudes in different seasons, with peaks in the summer. We hypothesize that this could be due to possible interactions between the exposure variable with air temperature and unmeasured confounders."}, "https://arxiv.org/abs/2405.16161": {"title": "Inference for Optimal Linear Treatment Regimes in Personalized Decision-making", "link": "https://arxiv.org/abs/2405.16161", "description": "arXiv:2405.16161v1 Announce Type: new \nAbstract: Personalized decision-making, tailored to individual characteristics, is gaining significant attention. The optimal treatment regime aims to provide the best-expected outcome in the entire population, known as the value function. One approach to determine this optimal regime is by maximizing the Augmented Inverse Probability Weighting (AIPW) estimator of the value function. However, the derived treatment regime can be intricate and nonlinear, limiting their use. For clarity and interoperability, we emphasize linear regimes and determine the optimal linear regime by optimizing the AIPW estimator within set constraints.\n While the AIPW estimator offers a viable path to estimating the optimal regime, current methodologies predominantly focus on its asymptotic distribution, leaving a gap in studying the linear regime itself. However, there are many benefits to understanding the regime, as pinpointing significant covariates can enhance treatment effects and provide future clinical guidance. In this paper, we explore the asymptotic distribution of the estimated linear regime. Our results show that the parameter associated with the linear regime follows a cube-root convergence to a non-normal limiting distribution characterized by the maximizer of a centered Gaussian process with a quadratic drift. When making inferences for the estimated linear regimes with cube-root convergence in practical scenarios, the standard nonparametric bootstrap is invalid. As a solution, we facilitate the Cattaneo et al. (2020) bootstrap technique to provide a consistent distributional approximation for the estimated linear regimes, validated further through simulations and real-world data applications from the eICU Collaborative Research Database."}, "https://arxiv.org/abs/2405.16192": {"title": "Novel closed-form point estimators for a weighted exponential family derived from likelihood equations", "link": "https://arxiv.org/abs/2405.16192", "description": "arXiv:2405.16192v1 Announce Type: new \nAbstract: In this paper, we propose and investigate closed-form point estimators for a weighted exponential family. We also develop a bias-reduced version of these proposed closed-form estimators through bootstrap methods. Estimators are assessed using a Monte Carlo simulation, revealing favorable results for the proposed bootstrap bias-reduced estimators."}, "https://arxiv.org/abs/2405.16246": {"title": "Conformalized Late Fusion Multi-View Learning", "link": "https://arxiv.org/abs/2405.16246", "description": "arXiv:2405.16246v1 Announce Type: new \nAbstract: Uncertainty quantification for multi-view learning is motivated by the increasing use of multi-view data in scientific problems. A common variant of multi-view learning is late fusion: train separate predictors on individual views and combine them after single-view predictions are available. Existing methods for uncertainty quantification for late fusion often rely on undesirable distributional assumptions for validity. Conformal prediction is one approach that avoids such distributional assumptions. However, naively applying conformal prediction to late-stage fusion pipelines often produces overly conservative and uninformative prediction regions, limiting its downstream utility. We propose a novel methodology, Multi-View Conformal Prediction (MVCP), where conformal prediction is instead performed separately on the single-view predictors and only fused subsequently. Our framework extends the standard scalar formulation of a score function to a multivariate score that produces more efficient downstream prediction regions in both classification and regression settings. We then demonstrate that such improvements can be realized in methods built atop conformalized regressors, specifically in robust predict-then-optimize pipelines."}, "https://arxiv.org/abs/2405.16298": {"title": "Fast Emulation and Modular Calibration for Simulators with Functional Response", "link": "https://arxiv.org/abs/2405.16298", "description": "arXiv:2405.16298v1 Announce Type: new \nAbstract: Scalable surrogate models enable efficient emulation of computer models (or simulators), particularly when dealing with large ensembles of runs. While Gaussian Process (GP) models are commonly employed for emulation, they face limitations in scaling to truly large datasets. Furthermore, when dealing with dense functional output, such as spatial or time-series data, additional complexities arise, requiring careful handling to ensure fast emulation. This work presents a highly scalable emulator for functional data, building upon the works of Kennedy and O'Hagan (2001) and Higdon et al. (2008), while incorporating the local approximate Gaussian Process framework proposed by Gramacy and Apley (2015). The emulator utilizes global GP lengthscale parameter estimates to scale the input space, leading to a substantial improvement in prediction speed. We demonstrate that our fast approximation-based emulator can serve as a viable alternative to the methods outlined in Higdon et al. (2008) for functional response, while drastically reducing computational costs. The proposed emulator is applied to quickly calibrate the multiphysics continuum hydrodynamics simulator FLAG with a large ensemble of 20000 runs. The methods presented are implemented in the R package FlaGP."}, "https://arxiv.org/abs/2405.16379": {"title": "Selective inference for multiple pairs of clusters after K-means clustering", "link": "https://arxiv.org/abs/2405.16379", "description": "arXiv:2405.16379v1 Announce Type: new \nAbstract: If the same data is used for both clustering and for testing a null hypothesis that is formulated in terms of the estimated clusters, then the traditional hypothesis testing framework often fails to control the Type I error. Gao et al. [2022] and Chen and Witten [2023] provide selective inference frameworks for testing if a pair of estimated clusters indeed stem from underlying differences, for the case where hierarchical clustering and K-means clustering, respectively, are used to define the clusters. In applications, however, it is often of interest to test for multiple pairs of clusters. In our work, we extend the pairwise test of Chen and Witten [2023] to a test for multiple pairs of clusters, where the cluster assignments are produced by K-means clustering. We further develop an analogous test for the setting where the variance is unknown, building on the work of Yun and Barber [2023] that extends Gao et al. [2022]'s pairwise test to the case of unknown variance. For both known and unknown variance settings, we present methods that address certain forms of data-dependence in the choice of pairs of clusters to test for. We show that our proposed tests control the Type I error, both theoretically and empirically, and provide a numerical study of their empirical powers under various settings."}, "https://arxiv.org/abs/2405.16467": {"title": "Two-way fixed effects instrumental variable regressions in staggered DID-IV designs", "link": "https://arxiv.org/abs/2405.16467", "description": "arXiv:2405.16467v1 Announce Type: new \nAbstract: Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation in the timing of policy adoption across units as an instrument for treatment. This paper studies the properties of the TWFEIV estimator in staggered instrumented difference-in-differences (DID-IV) designs. We show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator can be decomposed into a weighted average of all possible two-group/two-period Wald-DID estimators. Under staggered DID-IV designs, a causal interpretation of the TWFEIV estimand hinges on the stable effects of the instrument on the treatment and the outcome over time. We illustrate the use of our decomposition theorem for the TWFEIV estimator through an empirical application."}, "https://arxiv.org/abs/2405.16492": {"title": "A joint model for (un)bounded longitudinal markers, competing risks, and recurrent events using patient registry data", "link": "https://arxiv.org/abs/2405.16492", "description": "arXiv:2405.16492v1 Announce Type: new \nAbstract: Joint models for longitudinal and survival data have become a popular framework for studying the association between repeatedly measured biomarkers and clinical events. Nevertheless, addressing complex survival data structures, especially handling both recurrent and competing event times within a single model, remains a challenge. This causes important information to be disregarded. Moreover, existing frameworks rely on a Gaussian distribution for continuous markers, which may be unsuitable for bounded biomarkers, resulting in biased estimates of associations. To address these limitations, we propose a Bayesian shared-parameter joint model that simultaneously accommodates multiple (possibly bounded) longitudinal markers, a recurrent event process, and competing risks. We use the beta distribution to model responses bounded within any interval (a,b) without sacrificing the interpretability of the association. The model offers various forms of association, discontinuous risk intervals, and both gap and calendar timescales. A simulation study shows that it outperforms simpler joint models. We utilize the US Cystic Fibrosis Foundation Patient Registry to study the associations between changes in lung function and body mass index, and the risk of recurrent pulmonary exacerbations, while accounting for the competing risks of death and lung transplantation. Our efficient implementation allows fast fitting of the model despite its complexity and the large sample size from this patient registry. Our comprehensive approach provides new insights into cystic fibrosis disease progression by quantifying the relationship between the most important clinical markers and events more precisely than has been possible before. The model implementation is available in the R package JMbayes2."}, "https://arxiv.org/abs/2405.16547": {"title": "Estimating Dyadic Treatment Effects with Unknown Confounders", "link": "https://arxiv.org/abs/2405.16547", "description": "arXiv:2405.16547v1 Announce Type: new \nAbstract: This paper proposes a statistical inference method for assessing treatment effects with dyadic data. Under the assumption that the treatments follow an exchangeable distribution, our approach allows for the presence of any unobserved confounding factors that potentially cause endogeneity of treatment choice without requiring additional information other than the treatments and outcomes. Building on the literature of graphon estimation in network data analysis, we propose a neighborhood kernel smoothing method for estimating dyadic average treatment effects. We also develop a permutation inference method for testing the sharp null hypothesis. Under certain regularity conditions, we derive the rate of convergence of the proposed estimator and demonstrate the size control property of our test. We apply our method to international trade data to assess the impact of free trade agreements on bilateral trade flows."}, "https://arxiv.org/abs/2405.16602": {"title": "Multiple imputation of missing covariates when using the Fine-Gray model", "link": "https://arxiv.org/abs/2405.16602", "description": "arXiv:2405.16602v1 Announce Type: new \nAbstract: The Fine-Gray model for the subdistribution hazard is commonly used for estimating associations between covariates and competing risks outcomes. When there are missing values in the covariates included in a given model, researchers may wish to multiply impute them. Assuming interest lies in estimating the risk of only one of the competing events, this paper develops a substantive-model-compatible multiple imputation approach that exploits the parallels between the Fine-Gray model and the standard (single-event) Cox model. In the presence of right-censoring, this involves first imputing the potential censoring times for those failing from competing events, and thereafter imputing the missing covariates by leveraging methodology previously developed for the Cox model in the setting without competing risks. In a simulation study, we compared the proposed approach to alternative methods, such as imputing compatibly with cause-specific Cox models. The proposed method performed well (in terms of estimation of both subdistribution log hazard ratios and cumulative incidences) when data were generated assuming proportional subdistribution hazards, and performed satisfactorily when this assumption was not satisfied. The gain in efficiency compared to a complete-case analysis was demonstrated in both the simulation study and in an applied data example on competing outcomes following an allogeneic stem cell transplantation. For individual-specific cumulative incidence estimation, assuming proportionality on the correct scale at the analysis phase appears to be more important than correctly specifying the imputation procedure used to impute the missing covariates."}, "https://arxiv.org/abs/2405.16780": {"title": "Analysis of Broken Randomized Experiments by Principal Stratification", "link": "https://arxiv.org/abs/2405.16780", "description": "arXiv:2405.16780v1 Announce Type: new \nAbstract: Although randomized controlled trials have long been regarded as the ``gold standard'' for evaluating treatment effects, there is no natural prevention from post-treatment events. For example, non-compliance makes the actual treatment different from the assigned treatment, truncation-by-death renders the outcome undefined or ill-defined, and missingness prevents the outcomes from being measured. In this paper, we develop a statistical analysis framework using principal stratification to investigate the treatment effect in broken randomized experiments. The average treatment effect in compliers and always-survivors is adopted as the target causal estimand. We establish the asymptotic property for the estimator. We apply the framework to study the effect of training on earnings in the Job Corps Study and find that the training program does not have an effect on employment but possibly have an effect on improving the earnings after employment."}, "https://arxiv.org/abs/2405.16859": {"title": "Gaussian Mixture Model with Rare Events", "link": "https://arxiv.org/abs/2405.16859", "description": "arXiv:2405.16859v1 Announce Type: new \nAbstract: We study here a Gaussian Mixture Model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow numerical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theoretical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires additionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample performance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs."}, "https://arxiv.org/abs/2405.16885": {"title": "Hidden Markov modelling of spatio-temporal dynamics of measles in 1750-1850 Finland", "link": "https://arxiv.org/abs/2405.16885", "description": "arXiv:2405.16885v1 Announce Type: new \nAbstract: Real world spatio-temporal datasets, and phenomena related to them, are often challenging to visualise or gain a general overview of. In order to summarise information encompassed in such data, we combine two well known statistical modelling methods. To account for the spatial dimension, we use the intrinsic modification of the conditional autoregression, and incorporate it with the hidden Markov model, allowing the spatial patterns to vary over time. We apply our method into parish register data considering deaths caused by measles in Finland in 1750-1850, and gain novel insight of previously undiscovered infection dynamics. Five distinctive, reoccurring states describing spatially and temporally differing infection burden and potential routes of spread are identified. We also find that there is a change in the occurrences of the most typical spatial patterns circa 1812, possibly due to changes in communication routes after major administrative transformations in Finland."}, "https://arxiv.org/abs/2405.16989": {"title": "Uncertainty Learning for High-dimensional Mean-variance Portfolio", "link": "https://arxiv.org/abs/2405.16989", "description": "arXiv:2405.16989v1 Announce Type: new \nAbstract: Accounting for uncertainty in Data quality is important for accurate statistical inference. We aim to an optimal conservative allocation for a large universe of assets in mean-variance portfolio (MVP), which is the worst choice within uncertainty in data distribution. Unlike the low dimensional MVP studied in Blanchet et al. (2022, Management Science), the large number of assets raises a challenging problem in quantifying the uncertainty, due to the big deviation of the sample covariance matrix from the population version. To overcome this difficulty, we propose a data-adaptive method to quantify the uncertainty with the help of a factor structure. Monte-Carlo Simulation is conducted to show the superiority of our method in high-dimensional cases, that, avoiding the over-conservative results in Blanchet et al. (2022), our allocation is closer to the oracle version in terms of risk minimization and expected portfolio return controlling."}, "https://arxiv.org/abs/2405.17064": {"title": "The Probability of Improved Prediction: a new concept in statistical inference", "link": "https://arxiv.org/abs/2405.17064", "description": "arXiv:2405.17064v1 Announce Type: new \nAbstract: In an attempt to provide an answer to the increasing criticism against p-values and to bridge the gap between statistical inference and prediction modelling, we introduce the probability of improved prediction (PIP). In general, the PIP is a probabilistic measure for comparing two competing models. Three versions of the PIP and several estimators are introduced and the relationships between them, p-values and the mean squared error are investigated. The performance of the estimators is assessed in a simulation study. An application shows how the PIP can support p-values to strengthen the conclusions or possibly point at issues with e.g. replicability."}, "https://arxiv.org/abs/2405.17117": {"title": "Robust Reproducible Network Exploration", "link": "https://arxiv.org/abs/2405.17117", "description": "arXiv:2405.17117v1 Announce Type: new \nAbstract: We propose a novel method of network detection that is robust against any complex dependence structure. Our goal is to conduct exploratory network detection, meaning that we attempt to detect a network composed of ``connectable'' edges that are worth investigating in detail for further modelling or precise network analysis. For a reproducible network detection, we pursuit high power while controlling the false discovery rate (FDR). In particular, we formalize the problem as a multiple testing, and propose p-variables that are used in the Benjamini-Hochberg procedure. We show that the proposed method controls the FDR under arbitrary dependence structure with any sample size, and has asymptotic power one. The validity is also confirmed by simulations and a real data example."}, "https://arxiv.org/abs/2405.17166": {"title": "Cross-border cannibalization: Spillover effects of wind and solar energy on interconnected European electricity markets", "link": "https://arxiv.org/abs/2405.17166", "description": "arXiv:2405.17166v1 Announce Type: new \nAbstract: The average revenue, or market value, of wind and solar energy tends to fall with increasing market shares, as is now evident across European electricity markets. At the same time, these markets have become more interconnected. In this paper, we empirically study the multiple cross-border effects on the value of renewable energy: on one hand, interconnection is a flexibility resource that allows to export energy when it is locally abundant, benefitting renewables. On the other hand, wind and solar radiation are correlated across space, so neighboring supply adds to the local one to depress domestic prices. We estimate both effects, using spatial panel regression on electricity market data from 2015 to 2023 from 30 European bidding zones. We find that domestic wind and solar value is not only depressed by domestic, but also by neighboring renewables expansion. The better interconnected a market is, the smaller the effect of domestic but the larger the effect of neighboring renewables. While wind value is stabilized by interconnection, solar value is not. If wind market share increases both at home and in neighboring markets by one percentage point, the value factor of wind energy is reduced by just above 1 percentage points. For solar, this number is almost 4 percentage points."}, "https://arxiv.org/abs/2405.17225": {"title": "Quantifying the Reliance of Black-Box Decision-Makers on Variables of Interest", "link": "https://arxiv.org/abs/2405.17225", "description": "arXiv:2405.17225v1 Announce Type: new \nAbstract: This paper introduces a framework for measuring how much black-box decision-makers rely on variables of interest. The framework adapts a permutation-based measure of variable importance from the explainable machine learning literature. With an emphasis on applicability, I present some of the framework's theoretical and computational properties, explain how reliance computations have policy implications, and work through an illustrative example. In the empirical application to interruptions by Supreme Court Justices during oral argument, I find that the effect of gender is more muted compared to the existing literature's estimate; I then use this paper's framework to compare Justices' reliance on gender and alignment to their reliance on experience, which are incomparable using regression coefficients."}, "https://arxiv.org/abs/2405.17237": {"title": "Mixing it up: Inflation at risk", "link": "https://arxiv.org/abs/2405.17237", "description": "arXiv:2405.17237v1 Announce Type: new \nAbstract: Assessing the contribution of various risk factors to future inflation risks was crucial for guiding monetary policy during the recent high inflation period. However, existing methodologies often provide limited insights by focusing solely on specific percentiles of the forecast distribution. In contrast, this paper introduces a comprehensive framework that examines how economic indicators impact the entire forecast distribution of macroeconomic variables, facilitating the decomposition of the overall risk outlook into its underlying drivers. Additionally, the framework allows for the construction of risk measures that align with central bank preferences, serving as valuable summary statistics. Applied to the recent inflation surge, the framework reveals that U.S. inflation risk was primarily influenced by the recovery of the U.S. business cycle and surging commodity prices, partially mitigated by adjustments in monetary policy and credit spreads."}, "https://arxiv.org/abs/2405.17254": {"title": "Estimating treatment-effect heterogeneity across sites in multi-site randomized experiments with imperfect compliance", "link": "https://arxiv.org/abs/2405.17254", "description": "arXiv:2405.17254v1 Announce Type: new \nAbstract: We consider multi-site randomized controlled trials with a large number of small sites and imperfect compliance, conducted in non-random convenience samples in each site. We show that an Empirical-Bayes (EB) estimator can be used to estimate a lower bound of the variance of intention-to-treat (ITT) effects across sites. We also propose bounds for the coefficient from a regression of site-level ITTs on sites' control-group outcome. Turning to local average treatment effects (LATEs), the EB estimator cannot be used to estimate their variance, because site-level LATE estimators are biased. Instead, we propose two testable assumptions under which the LATEs' variance can be written as a function of sites' ITT and first-stage (FS) effects, thus allowing us to use an EB estimator leveraging only unbiased ITT and FS estimators. We revisit Behaghel et al. (2014), who study the effect of counselling programs on job seekers job-finding rate, in more than 200 job placement agencies in France. We find considerable ITT heterogeneity, and even more LATE heterogeneity: our lower bounds on ITTs' (resp. LATEs') standard deviation are more than three (resp. four) times larger than the average ITT (resp. LATE) across sites. Sites with a lower job-finding rate in the control group have larger ITT effects."}, "https://arxiv.org/abs/2405.17259": {"title": "The state learner -- a super learner for right-censored data", "link": "https://arxiv.org/abs/2405.17259", "description": "arXiv:2405.17259v1 Announce Type: new \nAbstract: In survival analysis, prediction models are needed as stand-alone tools and in applications of causal inference to estimate nuisance parameters. The super learner is a machine learning algorithm which combines a library of prediction models into a meta learner based on cross-validated loss. In right-censored data, the choice of the loss function and the estimation of the expected loss need careful consideration. We introduce the state learner, a new super learner for survival analysis, which simultaneously evaluates libraries of prediction models for the event of interest and the censoring distribution. The state learner can be applied to all types of survival models, works in the presence of competing risks, and does not require a single pre-specified estimator of the conditional censoring distribution. We establish an oracle inequality for the state learner and investigate its performance through numerical experiments. We illustrate the application of the state learner with prostate cancer data, as a stand-alone prediction tool, and, for causal inference, as a way to estimate the nuisance parameter models of a smooth statistical functional."}, "https://arxiv.org/abs/2405.17265": {"title": "Assessing uncertainty in Gaussian mixtures-based entropy estimation", "link": "https://arxiv.org/abs/2405.17265", "description": "arXiv:2405.17265v1 Announce Type: new \nAbstract: Entropy estimation plays a crucial role in various fields, such as information theory, statistical data science, and machine learning. However, traditional entropy estimation methods often struggle with complex data distributions. Mixture-based estimation of entropy has been recently proposed and gained attention due to its ease of use and accuracy. This paper presents a novel approach to quantify the uncertainty associated with this mixture-based entropy estimation method using weighted likelihood bootstrap. Unlike standard methods, our approach leverages the underlying mixture structure by assigning random weights to observations in a weighted likelihood bootstrap procedure, leading to more accurate uncertainty estimation. The generation of weights is also investigated, leading to the proposal of using weights obtained from a Dirichlet distribution with parameter $\\alpha = 0.8137$ instead of the usual $\\alpha = 1$. Furthermore, the use of centered percentile intervals emerges as the preferred choice to ensure empirical coverage close to the nominal level. Extensive simulation studies comparing different resampling strategies are presented and results discussed. The proposed approach is illustrated by analyzing the log-returns of daily Gold prices at COMEX for the years 2014--2022, and the Net Rating scores, an advanced statistic used in basketball analytics, for NBA teams with reference to the 2022/23 regular season."}, "https://arxiv.org/abs/2405.17290": {"title": "Count Data Models with Heterogeneous Peer Effects under Rational Expectations", "link": "https://arxiv.org/abs/2405.17290", "description": "arXiv:2405.17290v1 Announce Type: new \nAbstract: This paper develops a micro-founded peer effect model for count responses using a game of incomplete information. The model incorporates heterogeneity in peer effects through agents' groups based on observed characteristics. Parameter identification is established using the identification condition of linear models, which relies on the presence of friends' friends who are not direct friends in the network. I show that this condition extends to a large class of nonlinear models. The model parameters are estimated using the nested pseudo-likelihood approach, controlling for network endogeneity. I present an empirical application on students' participation in extracurricular activities. I find that females are more responsive to their peers than males, whereas male peers do not influence male students. An easy-to-use R packag--named CDatanet--is available for implementing the model."}, "https://arxiv.org/abs/2405.15950": {"title": "A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction", "link": "https://arxiv.org/abs/2405.15950", "description": "arXiv:2405.15950v1 Announce Type: cross \nAbstract: Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the \"systematic bias of machine learning regression\". In this paper, we first demonstrate that this issue persists across various machine learning models, and then delve into its theoretical underpinnings. We propose a general constrained optimization approach designed to correct this bias and develop a computationally efficient algorithm to implement our method. Our simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning models, our method effectively addresses the longstanding issue of \"systematic bias of machine learning regression\" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age."}, "https://arxiv.org/abs/2405.16055": {"title": "Federated Learning for Non-factorizable Models using Deep Generative Prior Approximations", "link": "https://arxiv.org/abs/2405.16055", "description": "arXiv:2405.16055v1 Announce Type: cross \nAbstract: Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA) prior which enables FL for non-factorizable models across clients, expanding the applicability of FL to fields such as spatial statistics, epidemiology, environmental science, and other domains where modeling dependencies is crucial. The SIGMA prior is a pre-trained deep generative model that approximates the desired prior and induces a specified conditional independence structure in the latent variables, creating an approximate model suitable for FL settings. We demonstrate the SIGMA prior's effectiveness on synthetic data and showcase its utility in a real-world example of FL for spatial data, using a conditional autoregressive prior to model spatial dependence across Australia. Our work enables new FL applications in domains where modeling dependent data is essential for accurate predictions and decision-making."}, "https://arxiv.org/abs/2405.16069": {"title": "IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark", "link": "https://arxiv.org/abs/2405.16069", "description": "arXiv:2405.16069v1 Announce Type: cross \nAbstract: Evaluating observational estimators of causal effects demands information that is rarely available: unconfounded interventions and outcomes from the population of interest, created either by randomization or adjustment. As a result, it is customary to fall back on simulators when creating benchmark tasks. Simulators offer great control but are often too simplistic to make challenging tasks, either because they are hand-designed and lack the nuances of real-world data, or because they are fit to observational data without structural constraints. In this work, we propose a general, repeatable strategy for turning observational data into sequential structural causal models and challenging estimation tasks by following two simple principles: 1) fitting real-world data where possible, and 2) creating complexity by composing simple, hand-designed mechanisms. We implement these ideas in a highly configurable software package and apply it to the well-known Adult income data set to construct the IncomeSCM simulator. From this, we devise multiple estimation tasks and sample data sets to compare established estimators of causal effects. The tasks present a suitable challenge, with effect estimates varying greatly in quality between methods, despite similar performance in the modeling of factual outcomes, highlighting the need for dedicated causal estimators and model selection criteria."}, "https://arxiv.org/abs/2405.16130": {"title": "Automating the Selection of Proxy Variables of Unmeasured Confounders", "link": "https://arxiv.org/abs/2405.16130", "description": "arXiv:2405.16130v1 Announce Type: cross \nAbstract: Recently, interest has grown in the use of proxy variables of unobserved confounding for inferring the causal effect in the presence of unmeasured confounders from observational data. One difficulty inhibiting the practical use is finding valid proxy variables of unobserved confounding to a target causal effect of interest. These proxy variables are typically justified by background knowledge. In this paper, we investigate the estimation of causal effects among multiple treatments and a single outcome, all of which are affected by unmeasured confounders, within a linear causal model, without prior knowledge of the validity of proxy variables. To be more specific, we first extend the existing proxy variable estimator, originally addressing a single unmeasured confounder, to accommodate scenarios where multiple unmeasured confounders exist between the treatments and the outcome. Subsequently, we present two different sets of precise identifiability conditions for selecting valid proxy variables of unmeasured confounders, based on the second-order statistics and higher-order statistics of the data, respectively. Moreover, we propose two data-driven methods for the selection of proxy variables and for the unbiased estimation of causal effects. Theoretical analysis demonstrates the correctness of our proposed algorithms. Experimental results on both synthetic and real-world data show the effectiveness of the proposed approach."}, "https://arxiv.org/abs/2405.16250": {"title": "Conformal Robust Control of Linear Systems", "link": "https://arxiv.org/abs/2405.16250", "description": "arXiv:2405.16250v1 Announce Type: cross \nAbstract: End-to-end engineering design pipelines, in which designs are evaluated using concurrently defined optimal controllers, are becoming increasingly common in practice. To discover designs that perform well even under the misspecification of system dynamics, such end-to-end pipelines have now begun evaluating designs with a robust control objective in place of the nominal optimal control setup. Current approaches of specifying such robust control subproblems, however, rely on hand specification of perturbations anticipated to be present upon deployment or margin methods that ignore problem structure, resulting in a lack of theoretical guarantees and overly conservative empirical performance. We, instead, propose a novel methodology for LQR systems that leverages conformal prediction to specify such uncertainty regions in a data-driven fashion. Such regions have distribution-free coverage guarantees on the true system dynamics, in turn allowing for a probabilistic characterization of the regret of the resulting robust controller. We then demonstrate that such a controller can be efficiently produced via a novel policy gradient method that has convergence guarantees. We finally demonstrate the superior empirical performance of our method over alternate robust control specifications in a collection of engineering control systems, specifically for airfoils and a load-positioning system."}, "https://arxiv.org/abs/2405.16455": {"title": "On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization", "link": "https://arxiv.org/abs/2405.16455", "description": "arXiv:2405.16455v1 Announce Type: cross \nAbstract: Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF."}, "https://arxiv.org/abs/2405.16564": {"title": "Contextual Linear Optimization with Bandit Feedback", "link": "https://arxiv.org/abs/2405.16564", "description": "arXiv:2405.16564v1 Announce Type: cross \nAbstract: Contextual linear optimization (CLO) uses predictive observations to reduce uncertainty in random cost coefficients and thereby improve average-cost performance. An example is a stochastic shortest path with random edge costs (e.g., traffic) and predictive features (e.g., lagged traffic, weather). Existing work on CLO assumes the data has fully observed cost coefficient vectors, but in many applications, we can only see the realized cost of a historical decision, that is, just one projection of the random cost coefficient vector, to which we refer as bandit feedback. We study a class of algorithms for CLO with bandit feedback, which we term induced empirical risk minimization (IERM), where we fit a predictive model to directly optimize the downstream performance of the policy it induces. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses. A byproduct of our theory of independent interest is fast-rate regret bound for IERM with full feedback and misspecified policy class. We compare the performance of different modeling choices numerically using a stochastic shortest path example and provide practical insights from the empirical results."}, "https://arxiv.org/abs/2405.16672": {"title": "Transfer Learning Under High-Dimensional Graph Convolutional Regression Model for Node Classification", "link": "https://arxiv.org/abs/2405.16672", "description": "arXiv:2405.16672v1 Announce Type: cross \nAbstract: Node classification is a fundamental task, but obtaining node classification labels can be challenging and expensive in many real-world scenarios. Transfer learning has emerged as a promising solution to address this challenge by leveraging knowledge from source domains to enhance learning in a target domain. Existing transfer learning methods for node classification primarily focus on integrating Graph Convolutional Networks (GCNs) with various transfer learning techniques. While these approaches have shown promising results, they often suffer from a lack of theoretical guarantees, restrictive conditions, and high sensitivity to hyperparameter choices. To overcome these limitations, we propose a Graph Convolutional Multinomial Logistic Regression (GCR) model and a transfer learning method based on the GCR model, called Trans-GCR. We provide theoretical guarantees of the estimate obtained under GCR model in high-dimensional settings. Moreover, Trans-GCR demonstrates superior empirical performance, has a low computational cost, and requires fewer hyperparameters than existing methods."}, "https://arxiv.org/abs/2405.17178": {"title": "Statistical Mechanism Design: Robust Pricing, Estimation, and Inference", "link": "https://arxiv.org/abs/2405.17178", "description": "arXiv:2405.17178v1 Announce Type: cross \nAbstract: This paper tackles challenges in pricing and revenue projections due to consumer uncertainty. We propose a novel data-based approach for firms facing unknown consumer type distributions. Unlike existing methods, we assume firms only observe a finite sample of consumers' types. We introduce \\emph{empirically optimal mechanisms}, a simple and intuitive class of sample-based mechanisms with strong finite-sample revenue guarantees. Furthermore, we leverage our results to develop a toolkit for statistical inference on profits. Our approach allows to reliably estimate the profits associated for any particular mechanism, to construct confidence intervals, and to, more generally, conduct valid hypothesis testing."}, "https://arxiv.org/abs/2405.17318": {"title": "Extremal correlation coefficient for functional data", "link": "https://arxiv.org/abs/2405.17318", "description": "arXiv:2405.17318v1 Announce Type: cross \nAbstract: We propose a coefficient that measures dependence in paired samples of functions. It has properties similar to the Pearson correlation, but differs in significant ways: 1) it is designed to measure dependence between curves, 2) it focuses only on extreme curves. The new coefficient is derived within the framework of regular variation in Banach spaces. A consistent estimator is proposed and justified by an asymptotic analysis and a simulation study. The usefulness of the new coefficient is illustrated on financial and and climate functional data."}, "https://arxiv.org/abs/2112.13398": {"title": "Long Story Short: Omitted Variable Bias in Causal Machine Learning", "link": "https://arxiv.org/abs/2112.13398", "description": "arXiv:2112.13398v5 Announce Type: replace \nAbstract: We develop a general theory of omitted variable bias for a wide range of common causal parameters, including (but not limited to) averages of potential outcomes, average treatment effects, average causal derivatives, and policy effects from covariate shifts. Our theory applies to nonparametric models, while naturally allowing for (semi-)parametric restrictions (such as partial linearity) when such assumptions are made. We show how simple plausibility judgments on the maximum explanatory power of omitted variables are sufficient to bound the magnitude of the bias, thus facilitating sensitivity analysis in otherwise complex, nonlinear models. Finally, we provide flexible and efficient statistical inference methods for the bounds, which can leverage modern machine learning algorithms for estimation. These results allow empirical researchers to perform sensitivity analyses in a flexible class of machine-learned causal models using very simple, and interpretable, tools. We demonstrate the utility of our approach with two empirical examples."}, "https://arxiv.org/abs/2208.08638": {"title": "Lost in the Shuffle: Testing Power in the Presence of Errorful Network Vertex Labels", "link": "https://arxiv.org/abs/2208.08638", "description": "arXiv:2208.08638v5 Announce Type: replace \nAbstract: Two-sample network hypothesis testing is an important inference task with applications across diverse fields such as medicine, neuroscience, and sociology. Many of these testing methodologies operate under the implicit assumption that the vertex correspondence across networks is a priori known. This assumption is often untrue, and the power of the subsequent test can degrade when there are misaligned/label-shuffled vertices across networks. This power loss due to shuffling is theoretically explored in the context of random dot product and stochastic block model networks for a pair of hypothesis tests based on Frobenius norm differences between estimated edge probability matrices or between adjacency matrices. The loss in testing power is further reinforced by numerous simulations and experiments, both in the stochastic block model and in the random dot product graph model, where the power loss across multiple recently proposed tests in the literature is considered. Lastly, the impact that shuffling can have in real-data testing is demonstrated in a pair of examples from neuroscience and from social network analysis."}, "https://arxiv.org/abs/2210.00697": {"title": "A flexible model for correlated count data, with application to multi-condition differential expression analyses of single-cell RNA sequencing data", "link": "https://arxiv.org/abs/2210.00697", "description": "arXiv:2210.00697v3 Announce Type: replace \nAbstract: Detecting differences in gene expression is an important part of single-cell RNA sequencing experiments, and many statistical methods have been developed for this aim. Most differential expression analyses focus on comparing expression between two groups (e.g., treatment vs. control). But there is increasing interest in multi-condition differential expression analyses in which expression is measured in many conditions, and the aim is to accurately detect and estimate expression differences in all conditions. We show that directly modeling single-cell RNA-seq counts in all conditions simultaneously, while also inferring how expression differences are shared across conditions, leads to greatly improved performance for detecting and estimating expression differences compared to existing methods. We illustrate the potential of this new approach by analyzing data from a single-cell experiment studying the effects of cytokine stimulation on gene expression. We call our new method \"Poisson multivariate adaptive shrinkage\", and it is implemented in an R package available online at https://github.com/stephenslab/poisson.mash.alpha."}, "https://arxiv.org/abs/2301.09694": {"title": "Flexible Modeling of Demographic Transition Processes with a Bayesian Hierarchical B-splines Model", "link": "https://arxiv.org/abs/2301.09694", "description": "arXiv:2301.09694v2 Announce Type: replace \nAbstract: Several demographic and health indicators, including the total fertility rate (TFR) and modern contraceptive use rate (mCPR), evolve similarly over time, characterized by a transition between stable states. Existing approaches for estimation or projection of transitions in multiple populations have successfully used parametric functions to capture the relation between the rate of change of an indicator and its level. However, incorrect parametric forms may result in bias or incorrect coverage in long-term projections. We propose a new class of models to capture demographic transitions in multiple populations. Our proposal, the B-spline Transition Model (BTM), models the relationship between the rate of change of an indicator and its level using B-splines, allowing for data-adaptive estimation of transition functions. Bayesian hierarchical models are used to share information on the transition function between populations. We apply the BTM to estimate and project country-level TFR and mCPR and compare the results against those from extant parametric models. For TFR, BTM projections have generally lower error than the comparison model. For mCPR, while results are comparable between BTM and a parametric approach, the B-spline model generally improves out-of-sample predictions. The case studies suggest that the BTM may be considered for demographic applications"}, "https://arxiv.org/abs/2302.01607": {"title": "dynamite: An R Package for Dynamic Multivariate Panel Models", "link": "https://arxiv.org/abs/2302.01607", "description": "arXiv:2302.01607v2 Announce Type: replace \nAbstract: dynamite is an R package for Bayesian inference of intensive panel (time series) data comprising multiple measurements per multiple individuals measured in time. The package supports joint modeling of multiple response variables, time-varying and time-invariant effects, a wide range of discrete and continuous distributions, group-specific random effects, latent factors, and customization of prior distributions of the model parameters. Models in the package are defined via a user-friendly formula interface, and estimation of the posterior distribution of the model parameters takes advantage of state-of-the-art Markov chain Monte Carlo methods. The package enables efficient computation of both individual-level and summarized predictions and offers a comprehensive suite of tools for visualization and model diagnostics."}, "https://arxiv.org/abs/2307.13686": {"title": "Characteristics and Predictive Modeling of Short-term Impacts of Hurricanes on the US Employment", "link": "https://arxiv.org/abs/2307.13686", "description": "arXiv:2307.13686v3 Announce Type: replace \nAbstract: The physical and economic damages of hurricanes can acutely affect employment and the well-being of employees. However, a comprehensive understanding of these impacts remains elusive as many studies focused on narrow subsets of regions or hurricanes. Here we present an open-source dataset that serves interdisciplinary research on hurricane impacts on US employment. Compared to past domain-specific efforts, this dataset has greater spatial-temporal granularity and variable coverage. To demonstrate potential applications of this dataset, we focus on the short-term employment disruptions related to hurricanes during 1990-2020. The observed county-level employment changes in the initial month are small on average, though large employment losses (>30%) can occur after extreme storms. The overall small changes partly result from compensation among different employment sectors, which may obscure large, concentrated employment losses after hurricanes. Additional econometric analyses concur on the post-storm employment losses in hospitality and leisure but disagree on employment changes in the other industries. The dataset also enables data-driven analyses that highlight vulnerabilities such as pronounced employment losses related to Puerto Rico and rainy hurricanes. Furthermore, predictive modeling of short-term employment changes shows promising performance for service-providing industries and high-impact storms. In the examined cases, the nonlinear Random Forests model greatly outperforms the multiple linear regression model. The nonlinear model also suggests that more severe hurricane hazards projected by physical models may cause more extreme losses in US service-providing employment. Finally, we share our dataset and analytical code to facilitate the study and modeling of hurricane impacts in a changing climate."}, "https://arxiv.org/abs/2308.05205": {"title": "Dynamic survival analysis: modelling the hazard function via ordinary differential equations", "link": "https://arxiv.org/abs/2308.05205", "description": "arXiv:2308.05205v4 Announce Type: replace \nAbstract: The hazard function represents one of the main quantities of interest in the analysis of survival data. We propose a general approach for parametrically modelling the dynamics of the hazard function using systems of autonomous ordinary differential equations (ODEs). This modelling approach can be used to provide qualitative and quantitative analyses of the evolution of the hazard function over time. Our proposal capitalises on the extensive literature of ODEs which, in particular, allow for establishing basic rules or laws on the dynamics of the hazard function via the use of autonomous ODEs. We show how to implement the proposed modelling framework in cases where there is an analytic solution to the system of ODEs or where an ODE solver is required to obtain a numerical solution. We focus on the use of a Bayesian modelling approach, but the proposed methodology can also be coupled with maximum likelihood estimation. A simulation study is presented to illustrate the performance of these models and the interplay of sample size and censoring. Two case studies using real data are presented to illustrate the use of the proposed approach and to highlight the interpretability of the corresponding models. We conclude with a discussion on potential extensions of our work and strategies to include covariates into our framework. Although we focus on examples on Medical Statistics, the proposed framework is applicable in any context where the interest lies on estimating and interpreting the dynamics hazard function."}, "https://arxiv.org/abs/2308.10583": {"title": "The Multivariate Bernoulli detector: Change point estimation in discrete survival analysis", "link": "https://arxiv.org/abs/2308.10583", "description": "arXiv:2308.10583v2 Announce Type: replace \nAbstract: Time-to-event data are often recorded on a discrete scale with multiple, competing risks as potential causes for the event. In this context, application of continuous survival analysis methods with a single risk suffer from biased estimation. Therefore, we propose the Multivariate Bernoulli detector for competing risks with discrete times involving a multivariate change point model on the cause-specific baseline hazards. Through the prior on the number of change points and their location, we impose dependence between change points across risks, as well as allowing for data-driven learning of their number. Then, conditionally on these change points, a Multivariate Bernoulli prior is used to infer which risks are involved. Focus of posterior inference is cause-specific hazard rates and dependence across risks. Such dependence is often present due to subject-specific changes across time that affect all risks. Full posterior inference is performed through a tailored local-global Markov chain Monte Carlo (MCMC) algorithm, which exploits a data augmentation trick and MCMC updates from non-conjugate Bayesian nonparametric methods. We illustrate our model in simulations and on ICU data, comparing its performance with existing approaches."}, "https://arxiv.org/abs/2310.12402": {"title": "Sufficient dimension reduction for regression with metric space-valued responses", "link": "https://arxiv.org/abs/2310.12402", "description": "arXiv:2310.12402v2 Announce Type: replace \nAbstract: Data visualization and dimension reduction for regression between a general metric space-valued response and Euclidean predictors is proposed. Current Fr\\'ech\\'et dimension reduction methods require that the response metric space be continuously embeddable into a Hilbert space, which imposes restriction on the type of metric and kernel choice. We relax this assumption by proposing a Euclidean embedding technique which avoids the use of kernels. Under this framework, classical dimension reduction methods such as ordinary least squares and sliced inverse regression are extended. An extensive simulation experiment demonstrates the superior performance of the proposed method on synthetic data compared to existing methods where applicable. The real data analysis of factors influencing the distribution of COVID-19 transmission in the U.S. and the association between BMI and structural brain connectivity of healthy individuals are also investigated."}, "https://arxiv.org/abs/2310.12427": {"title": "Fast Power Curve Approximation for Posterior Analyses", "link": "https://arxiv.org/abs/2310.12427", "description": "arXiv:2310.12427v2 Announce Type: replace \nAbstract: Bayesian hypothesis tests leverage posterior probabilities, Bayes factors, or credible intervals to inform data-driven decision making. We propose a framework for power curve approximation with such hypothesis tests. We present a fast approach to explore the approximate sampling distribution of posterior probabilities when the conditions for the Bernstein-von Mises theorem are satisfied. We extend that approach to consider segments of such sampling distributions in a targeted manner for each sample size explored. These sampling distribution segments are used to construct power curves for various types of posterior analyses. Our resulting method for power curve approximation is orders of magnitude faster than conventional power curve estimation for Bayesian hypothesis tests. We also prove the consistency of the corresponding power estimates and sample size recommendations under certain conditions."}, "https://arxiv.org/abs/2311.02543": {"title": "Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling", "link": "https://arxiv.org/abs/2311.02543", "description": "arXiv:2311.02543v2 Announce Type: replace \nAbstract: This paper discusses estimation and limited information goodness-of-fit test statistics in factor models for binary data using pairwise likelihood estimation and sampling weights. The paper extends the applicability of pairwise likelihood estimation for factor models with binary data to accommodate complex sampling designs. Additionally, it introduces two key limited information test statistics: the Pearson chi-squared test and the Wald test. To enhance computational efficiency, the paper introduces modifications to both test statistics. The performance of the estimation and the proposed test statistics under simple random sampling and unequal probability sampling is evaluated using simulated data."}, "https://arxiv.org/abs/2311.13017": {"title": "W-kernel and essential subspace for frequencist evaluation of Bayesian estimators", "link": "https://arxiv.org/abs/2311.13017", "description": "arXiv:2311.13017v2 Announce Type: replace \nAbstract: The posterior covariance matrix W defined by the log-likelihood of each observation plays important roles both in the sensitivity analysis and frequencist evaluation of the Bayesian estimators. This study is focused on the matrix W and its principal space; we term the latter as an essential subspace. A key tool for treating frequencist properties is the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)). The matrix W can be interpreted as a reproducing kernel and is denoted as W-kernel. Using W-kernel, the essential subspace is expressed as a principal space given by the kernel principal component analysis. A relation to the Fisher kernel and neural tangent kernel is established, which elucidates the connection to the classical asymptotic theory. We also discuss a type of Bayesian-frequencist duality, which is naturally appeared from the kernel framework. Finally, two applications are discussed: the selection of a representative set of observations and dimensional reduction in the approximate bootstrap. In the former, incomplete Cholesky decomposition is introduced as an efficient method for computing the essential subspace. In the latter, different implementations of the approximate bootstrap for posterior means are compared."}, "https://arxiv.org/abs/2312.06098": {"title": "Mixture Matrix-valued Autoregressive Model", "link": "https://arxiv.org/abs/2312.06098", "description": "arXiv:2312.06098v2 Announce Type: replace \nAbstract: Time series of matrix-valued data are increasingly available in various areas including economics, finance, social science, etc. These data may shed light on the inter-dynamical relationships between two sets of attributes, for instance countries and economic indices. The matrix autoregressive (MAR) model provides a parsimonious approach for analyzing such data. However, the MAR model, being a linear model with parametric constraints, cannot capture the nonlinear patterns in the data, such as regime shifts in the dynamics. We propose a mixture matrix autoregressive (MMAR) model for analyzing potential regime shifts in the dynamics between two attributes, for instance, due to recession vs. blooming, or quiet period vs. pandemic. We propose an EM algorithm for maximum likelihood estimation. We derive some theoretical properties of the proposed method including consistency and asymptotic distribution, and illustrate its performance via simulations and real applications."}, "https://arxiv.org/abs/2312.10596": {"title": "A maximin optimal approach for sampling designs in two-phase studies", "link": "https://arxiv.org/abs/2312.10596", "description": "arXiv:2312.10596v2 Announce Type: replace \nAbstract: Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis."}, "https://arxiv.org/abs/2007.03069": {"title": "Outcome-Driven Dynamic Refugee Assignment with Allocation Balancing", "link": "https://arxiv.org/abs/2007.03069", "description": "arXiv:2007.03069v5 Announce Type: replace-cross \nAbstract: This study proposes two new dynamic assignment algorithms to match refugees and asylum seekers to geographic localities within a host country. The first, currently implemented in a multi-year randomized control trial in Switzerland, seeks to maximize the average predicted employment level (or any measured outcome of interest) of refugees through a minimum-discord online assignment algorithm. The performance of this algorithm is tested on real refugee resettlement data from both the US and Switzerland, where we find that it is able to achieve near-optimal expected employment compared to the hindsight-optimal solution, and is able to improve upon the status quo procedure by 40-50%. However, pure outcome maximization can result in a periodically imbalanced allocation to the localities over time, leading to implementation difficulties and an undesirable workflow for resettlement resources and agents. To address these problems, the second algorithm balances the goal of improving refugee outcomes with the desire for an even allocation over time. We find that this algorithm can achieve near-perfect balance over time with only a small loss in expected employment compared to the employment-maximizing algorithm. In addition, the allocation balancing algorithm offers a number of ancillary benefits compared to pure outcome maximization, including robustness to unknown arrival flows and greater exploration."}, "https://arxiv.org/abs/2012.02985": {"title": "Selecting the number of components in PCA via random signflips", "link": "https://arxiv.org/abs/2012.02985", "description": "arXiv:2012.02985v3 Announce Type: replace-cross \nAbstract: Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of \"empirical null\" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy."}, "https://arxiv.org/abs/2311.01412": {"title": "Causal Temporal Regime Structure Learning", "link": "https://arxiv.org/abs/2311.01412", "description": "arXiv:2311.01412v2 Announce Type: replace-cross \nAbstract: We address the challenge of structure learning from multivariate time series that are characterized by a sequence of different, unknown regimes. We introduce a new optimization-based method (CASTOR), that concurrently learns the Directed Acyclic Graph (DAG) for each regime and determine the number of regimes along with their sequential arrangement. Through the optimization of a score function via an expectation maximization (EM) algorithm, CASTOR alternates between learning the regime indices (Expectation step) and inferring causal relationships in each regime (Maximization step). We further prove the identifiability of regimes and DAGs within the CASTOR framework. We conduct extensive experiments and show that our method consistently outperforms causal discovery models across various settings (linear and nonlinear causal relationships) and datasets (synthetic and real data)."}, "https://arxiv.org/abs/2405.17591": {"title": "Individualized Dynamic Mediation Analysis Using Latent Factor Models", "link": "https://arxiv.org/abs/2405.17591", "description": "arXiv:2405.17591v1 Announce Type: new \nAbstract: Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity in many real-world applications. Additionally, the presence of unobserved confounding variables imposes a significant challenge to inferring both causal effect and mediation effect. To address these issues, we propose an individualized dynamic mediation analysis method. Our approach can identify the significant mediators of the population level while capturing the time-varying and heterogeneous mediation effects via latent factor modeling on coefficients of structural equation models. Another advantage of our method is that we can infer individualized mediation effects in the presence of unmeasured time-varying confounders. We provide estimation consistency for our proposed causal estimand and selection consistency for significant mediators. Extensive simulation studies and an application to a DNA methylation study demonstrate the effectiveness and advantages of our method."}, "https://arxiv.org/abs/2405.17669": {"title": "Bayesian Nonparametrics for Principal Stratification with Continuous Post-Treatment Variables", "link": "https://arxiv.org/abs/2405.17669", "description": "arXiv:2405.17669v1 Announce Type: new \nAbstract: Principal stratification provides a causal inference framework that allows adjustment for confounded post-treatment variables when comparing treatments. Although the literature has focused mainly on binary post-treatment variables, there is a growing interest in principal stratification involving continuous post-treatment variables. However, characterizing the latent principal strata with a continuous post-treatment presents a significant challenge, which is further complicated in observational studies where the treatment is not randomized. In this paper, we introduce the Confounders-Aware SHared atoms BAyesian mixture (CASBAH), a novel approach for principal stratification with continuous post-treatment variables that can be directly applied to observational studies. CASBAH leverages a dependent Dirichlet process, utilizing shared atoms across treatment levels, to effectively control for measured confounders and facilitate information sharing between treatment groups in the identification of principal strata membership. CASBAH also offers a comprehensive quantification of uncertainty surrounding the membership of the principal strata. Through Monte Carlo simulations, we show that the proposed methodology has excellent performance in characterizing the latent principal strata and estimating the effects of treatment on post-treatment variables and outcomes. Finally, CASBAH is applied to a case study in which we estimate the causal effects of US national air quality regulations on pollution levels and health outcomes."}, "https://arxiv.org/abs/2405.17684": {"title": "ZIKQ: An innovative centile chart method for utilizing natural history data in rare disease clinical development", "link": "https://arxiv.org/abs/2405.17684", "description": "arXiv:2405.17684v1 Announce Type: new \nAbstract: Utilizing natural history data as external control plays an important role in the clinical development of rare diseases, since placebo groups in double-blind randomization trials may not be available due to ethical reasons and low disease prevalence. This article proposed an innovative approach for utilizing natural history data to support rare disease clinical development by constructing reference centile charts. Due to the deterioration nature of certain rare diseases, the distributions of clinical endpoints can be age-dependent and have an absorbing state of zero, which can result in censored natural history data. Existing methods of reference centile charts can not be directly used in the censored natural history data. Therefore, we propose a new calibrated zero-inflated kernel quantile (ZIKQ) estimation to construct reference centile charts from censored natural history data. Using the application to Duchenne Muscular Dystrophy drug development, we demonstrate that the reference centile charts using the ZIKQ method can be implemented to evaluate treatment efficacy and facilitate a more targeted patient enrollment in rare disease clinical development."}, "https://arxiv.org/abs/2405.17707": {"title": "The Multiplex $p_2$ Model: Mixed-Effects Modeling for Multiplex Social Networks", "link": "https://arxiv.org/abs/2405.17707", "description": "arXiv:2405.17707v1 Announce Type: new \nAbstract: Social actors are often embedded in multiple social networks, and there is a growing interest in studying social systems from a multiplex network perspective. In this paper, we propose a mixed-effects model for cross-sectional multiplex network data that assumes dyads to be conditionally independent. Building on the uniplex $p_2$ model, we incorporate dependencies between different network layers via cross-layer dyadic effects and actor random effects. These cross-layer effects model the tendencies for ties between two actors and the ties to and from the same actor to be dependent across different relational dimensions. The model can also study the effect of actor and dyad covariates. As simulation-based goodness-of-fit analyses are common practice in applied network studies, we here propose goodness-of-fit measures for multiplex network analyses. We evaluate our choice of priors and the computational faithfulness and inferential properties of the proposed method through simulation. We illustrate the utility of the multiplex $p_2$ model in a replication study of a toxic chemical policy network. An original study that reflects on gossip as perceived by gossip senders and gossip targets, and their differences in perspectives, based on data from 34 Hungarian elementary school classes, highlights the applicability of the proposed method."}, "https://arxiv.org/abs/2405.17744": {"title": "Factor Augmented Matrix Regression", "link": "https://arxiv.org/abs/2405.17744", "description": "arXiv:2405.17744v1 Announce Type: new \nAbstract: We introduce \\underline{F}actor-\\underline{A}ugmented \\underline{Ma}trix \\underline{R}egression (FAMAR) to address the growing applications of matrix-variate data and their associated challenges, particularly with high-dimensionality and covariate correlations. FAMAR encompasses two key algorithms. The first is a novel non-iterative approach that efficiently estimates the factors and loadings of the matrix factor model, utilizing techniques of pre-training, diverse projection, and block-wise averaging. The second algorithm offers an accelerated solution for penalized matrix factor regression. Both algorithms are supported by established statistical and numerical convergence properties. Empirical evaluations, conducted on synthetic and real economics datasets, demonstrate FAMAR's superiority in terms of accuracy, interpretability, and computational speed. Our application to economic data showcases how matrix factors can be incorporated to predict the GDPs of the countries of interest, and the influence of these factors on the GDPs."}, "https://arxiv.org/abs/2405.17787": {"title": "Dyadic Regression with Sample Selection", "link": "https://arxiv.org/abs/2405.17787", "description": "arXiv:2405.17787v1 Announce Type: new \nAbstract: This paper addresses the sample selection problem in panel dyadic regression analysis. Dyadic data often include many zeros in the main outcomes due to the underlying network formation process. This not only contaminates popular estimators used in practice but also complicates the inference due to the dyadic dependence structure. We extend Kyriazidou (1997)'s approach to dyadic data and characterize the asymptotic distribution of our proposed estimator. The convergence rates are $\\sqrt{n}$ or $\\sqrt{n^{2}h_{n}}$, depending on the degeneracy of the H\\'{a}jek projection part of the estimator, where $n$ is the number of nodes and $h_{n}$ is a bandwidth. We propose a bias-corrected confidence interval and a variance estimator that adapts to the degeneracy. A Monte Carlo simulation shows the good finite performance of our estimator and highlights the importance of bias correction in both asymptotic regimes when the fraction of zeros in outcomes varies. We illustrate our procedure using data from Moretti and Wilson (2017)'s paper on migration."}, "https://arxiv.org/abs/2405.17828": {"title": "On Robust Clustering of Temporal Point Process", "link": "https://arxiv.org/abs/2405.17828", "description": "arXiv:2405.17828v1 Announce Type: new \nAbstract: Clustering of event stream data is of great importance in many application scenarios, including but not limited to, e-commerce, electronic health, online testing, mobile music service, etc. Existing clustering algorithms fail to take outlier data into consideration and are implemented without theoretical guarantees. In this paper, we propose a robust temporal point processes clustering framework which works under mild assumptions and meanwhile addresses several important issues in the event stream clustering problem.Specifically, we introduce a computationally efficient model-free distance function to quantify the dissimilarity between different event streams so that the outliers can be detected and the good initial clusters could be obtained. We further consider an expectation-maximization-type algorithm incorporated with a Catoni's influence function for robust estimation and fine-tuning of clusters. We also establish the theoretical results including algorithmic convergence, estimation error bound, outlier detection, etc. Simulation results corroborate our theoretical findings and real data applications show the effectiveness of our proposed methodology."}, "https://arxiv.org/abs/2405.17919": {"title": "Fisher's Legacy of Directional Statistics, and Beyond to Statistics on Manifolds", "link": "https://arxiv.org/abs/2405.17919", "description": "arXiv:2405.17919v1 Announce Type: new \nAbstract: It will not be an exaggeration to say that R A Fisher is the Albert Einstein of Statistics. He pioneered almost all the main branches of statistics, but it is not as well known that he opened the area of Directional Statistics with his 1953 paper introducing a distribution on the sphere which is now known as the Fisher distribution. He stressed that for spherical data one should take into account that the data is on a manifold. We will describe this Fisher distribution and reanalyse his geological data. We also comment on the two goals he set himself in that paper, and how he reinvented the von Mises distribution on the circle. Since then, many extensions of this distribution have appeared bearing Fisher's name such as the von Mises Fisher distribution and the matrix Fisher distribution. In fact, the subject of Directional Statistics has grown tremendously in the last two decades with new applications emerging in Life Sciences, Image Analysis, Machine Learning and so on. We give a recent new method of constructing the Fisher type distribution which has been motivated by some problems in Machine Learning. The subject related to his distribution has evolved since then more broadly as Statistics on Manifolds which also includes the new field of Shape Analysis. We end with a historical note pointing out some correspondence between D'Arcy Thompson and R A Fisher related to Shape Analysis."}, "https://arxiv.org/abs/2405.17954": {"title": "Comparison of predictive values with paired samples", "link": "https://arxiv.org/abs/2405.17954", "description": "arXiv:2405.17954v1 Announce Type: new \nAbstract: Positive predictive value and negative predictive value are two widely used parameters to assess the clinical usefulness of a medical diagnostic test. When there are two diagnostic tests, it is recommendable to make a comparative assessment of the values of these two parameters after applying the two tests to the same subjects (paired samples). The objective is then to make individual or global inferences about the difference or the ratio of the predictive value of the two diagnostic tests. These inferences are usually based on complex and not very intuitive expressions, some of which have subsequently been reformulated. We define the two properties of symmetry which any inference method must verify - symmetry in diagnoses and symmetry in the tests -, we propose new inference methods, and we define them with simple expressions. All of the methods are compared with each other, selecting the optimal method: (a) to obtain a confidence interval for the difference or ratio; (b) to perform an individual homogeneity test of the two predictive values; and (c) to carry out a global homogeneity test of the two predictive values."}, "https://arxiv.org/abs/2405.18089": {"title": "Semi-nonparametric models of multidimensional matching: an optimal transport approach", "link": "https://arxiv.org/abs/2405.18089", "description": "arXiv:2405.18089v1 Announce Type: new \nAbstract: This paper proposes empirically tractable multidimensional matching models, focusing on worker-job matching. We generalize the parametric model proposed by Lindenlaub (2017), which relies on the assumption of joint normality of observed characteristics of workers and jobs. In our paper, we allow unrestricted distributions of characteristics and show identification of the production technology, and equilibrium wage and matching functions using tools from optimal transport theory. Given identification, we propose efficient, consistent, asymptotically normal sieve estimators. We revisit Lindenlaub's empirical application and show that, between 1990 and 2010, the U.S. economy experienced much larger technological progress favoring cognitive abilities than the original findings suggest. Furthermore, our flexible model specifications provide a significantly better fit for patterns in the evolution of wage inequality."}, "https://arxiv.org/abs/2405.18288": {"title": "Stagewise Boosting Distributional Regression", "link": "https://arxiv.org/abs/2405.18288", "description": "arXiv:2405.18288v1 Announce Type: new \nAbstract: Forward stagewise regression is a simple algorithm that can be used to estimate regularized models. The updating rule adds a small constant to a regression coefficient in each iteration, such that the underlying optimization problem is solved slowly with small improvements. This is similar to gradient boosting, with the essential difference that the step size is determined by the product of the gradient and a step length parameter in the latter algorithm. One often overlooked challenge in gradient boosting for distributional regression is the issue of a vanishing small gradient, which practically halts the algorithm's progress. We show that gradient boosting in this case oftentimes results in suboptimal models, especially for complex problems certain distributional parameters are never updated due to the vanishing gradient. Therefore, we propose a stagewise boosting-type algorithm for distributional regression, combining stagewise regression ideas with gradient boosting. Additionally, we extend it with a novel regularization method, correlation filtering, to provide additional stability when the problem involves a large number of covariates. Furthermore, the algorithm includes best-subset selection for parameters and can be applied to big data problems by leveraging stochastic approximations of the updating steps. Besides the advantage of processing large datasets, the stochastic nature of the approximations can lead to better results, especially for complex distributions, by reducing the risk of being trapped in a local optimum. The performance of our proposed stagewise boosting distributional regression approach is investigated in an extensive simulation study and by estimating a full probabilistic model for lightning counts with data of more than 9.1 million observations and 672 covariates."}, "https://arxiv.org/abs/2405.18323": {"title": "Optimal Design in Repeated Testing for Count Data", "link": "https://arxiv.org/abs/2405.18323", "description": "arXiv:2405.18323v1 Announce Type: new \nAbstract: In this paper, we develop optimal designs for growth curve models with count data based on the Rasch Poisson-Gamma counts (RPGCM) model. This model is often used in educational and psychological testing when test results yield count data. In the RPGCM, the test scores are determined by respondents ability and item difficulty. Locally D-optimal designs are derived for maximum quasi-likelihood estimation to efficiently estimate the mean abilities of the respondents over time. Using the log link, both unstructured, linear and nonlinear growth curves of log mean abilities are taken into account. Finally, the sensitivity of the derived optimal designs due to an imprecise choice of parameter values is analyzed using D-efficiency."}, "https://arxiv.org/abs/2405.18413": {"title": "Homophily-adjusted social influence estimation", "link": "https://arxiv.org/abs/2405.18413", "description": "arXiv:2405.18413v1 Announce Type: new \nAbstract: Homophily and social influence are two key concepts of social network analysis. Distinguishing between these phenomena is difficult, and approaches to disambiguate the two have been primarily limited to longitudinal data analyses. In this study, we provide sufficient conditions for valid estimation of social influence through cross-sectional data, leading to a novel homophily-adjusted social influence model which addresses the backdoor pathway of latent homophilic features. The oft-used network autocorrelation model (NAM) is the special case of our proposed model with no latent homophily, suggesting that the NAM is only valid when all homophilic attributes are observed. We conducted an extensive simulation study to evaluate the performance of our proposed homophily-adjusted model, comparing its results with those from the conventional NAM. Our findings shed light on the nuanced dynamics of social networks, presenting a valuable tool for researchers seeking to estimate the effects of social influence while accounting for homophily. Code to implement our approach is available at https://github.com/hanhtdpham/hanam."}, "https://arxiv.org/abs/2405.17640": {"title": "Probabilistically Plausible Counterfactual Explanations with Normalizing Flows", "link": "https://arxiv.org/abs/2405.17640", "description": "arXiv:2405.17640v1 Announce Type: cross \nAbstract: We present PPCEF, a novel method for generating probabilistically plausible counterfactual explanations (CFs). PPCEF advances beyond existing methods by combining a probabilistic formulation that leverages the data distribution with the optimization of plausibility within a unified framework. Compared to reference approaches, our method enforces plausibility by directly optimizing the explicit density function without assuming a particular family of parametrized distributions. This ensures CFs are not only valid (i.e., achieve class change) but also align with the underlying data's probability density. For that purpose, our approach leverages normalizing flows as powerful density estimators to capture the complex high-dimensional data distribution. Furthermore, we introduce a novel loss that balances the trade-off between achieving class change and maintaining closeness to the original instance while also incorporating a probabilistic plausibility term. PPCEF's unconstrained formulation allows for efficient gradient-based optimization with batch processing, leading to orders of magnitude faster computation compared to prior methods. Moreover, the unconstrained formulation of PPCEF allows for the seamless integration of future constraints tailored to specific counterfactual properties. Finally, extensive evaluations demonstrate PPCEF's superiority in generating high-quality, probabilistically plausible counterfactual explanations in high-dimensional tabular settings. This makes PPCEF a powerful tool for not only interpreting complex machine learning models but also for improving fairness, accountability, and trust in AI systems."}, "https://arxiv.org/abs/2405.17642": {"title": "Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels", "link": "https://arxiv.org/abs/2405.17642", "description": "arXiv:2405.17642v1 Announce Type: cross \nAbstract: Growing regulatory and societal pressures demand increased transparency in AI, particularly in understanding the decisions made by complex machine learning models. Counterfactual Explanations (CFs) have emerged as a promising technique within Explainable AI (xAI), offering insights into individual model predictions. However, to understand the systemic biases and disparate impacts of AI models, it is crucial to move beyond local CFs and embrace global explanations, which offer a~holistic view across diverse scenarios and populations. Unfortunately, generating Global Counterfactual Explanations (GCEs) faces challenges in computational complexity, defining the scope of \"global,\" and ensuring the explanations are both globally representative and locally plausible. We introduce a novel unified approach for generating Local, Group-wise, and Global Counterfactual Explanations for differentiable classification models via gradient-based optimization to address these challenges. This framework aims to bridge the gap between individual and systemic insights, enabling a deeper understanding of model decisions and their potential impact on diverse populations. Our approach further innovates by incorporating a probabilistic plausibility criterion, enhancing actionability and trustworthiness. By offering a cohesive solution to the optimization and plausibility challenges in GCEs, our work significantly advances the interpretability and accountability of AI models, marking a step forward in the pursuit of transparent AI."}, "https://arxiv.org/abs/2405.18206": {"title": "Multi-CATE: Multi-Accurate Conditional Average Treatment Effect Estimation Robust to Unknown Covariate Shifts", "link": "https://arxiv.org/abs/2405.18206", "description": "arXiv:2405.18206v1 Announce Type: cross \nAbstract: Estimating heterogeneous treatment effects is important to tailor treatments to those individuals who would most likely benefit. However, conditional average treatment effect predictors may often be trained on one population but possibly deployed on different, possibly unknown populations. We use methodology for learning multi-accurate predictors to post-process CATE T-learners (differenced regressions) to become robust to unknown covariate shifts at the time of deployment. The method works in general for pseudo-outcome regression, such as the DR-learner. We show how this approach can combine (large) confounded observational and (smaller) randomized datasets by learning a confounded predictor from the observational dataset, and auditing for multi-accuracy on the randomized controlled trial. We show improvements in bias and mean squared error in simulations with increasingly larger covariate shift, and on a semi-synthetic case study of a parallel large observational study and smaller randomized controlled experiment. Overall, we establish a connection between methods developed for multi-distribution learning and achieve appealing desiderata (e.g. external validity) in causal inference and machine learning."}, "https://arxiv.org/abs/2405.18379": {"title": "A Note on the Prediction-Powered Bootstrap", "link": "https://arxiv.org/abs/2405.18379", "description": "arXiv:2405.18379v1 Announce Type: cross \nAbstract: We introduce PPBoot: a bootstrap-based method for prediction-powered inference. PPBoot is applicable to arbitrary estimation problems and is very simple to implement, essentially only requiring one application of the bootstrap. Through a series of examples, we demonstrate that PPBoot often performs nearly identically to (and sometimes better than) the earlier PPI(++) method based on asymptotic normality$\\unicode{x2013}$when the latter is applicable$\\unicode{x2013}$without requiring any asymptotic characterizations. Given its versatility, PPBoot could simplify and expand the scope of application of prediction-powered inference to problems where central limit theorems are hard to prove."}, "https://arxiv.org/abs/2405.18412": {"title": "Tensor Methods in High Dimensional Data Analysis: Opportunities and Challenges", "link": "https://arxiv.org/abs/2405.18412", "description": "arXiv:2405.18412v1 Announce Type: cross \nAbstract: Large amount of multidimensional data represented by multiway arrays or tensors are prevalent in modern applications across various fields such as chemometrics, genomics, physics, psychology, and signal processing. The structural complexity of such data provides vast new opportunities for modeling and analysis, but efficiently extracting information content from them, both statistically and computationally, presents unique and fundamental challenges. Addressing these challenges requires an interdisciplinary approach that brings together tools and insights from statistics, optimization and numerical linear algebra among other fields. Despite these hurdles, significant progress has been made in the last decade. This review seeks to examine some of the key advancements and identify common threads among them, under eight different statistical settings."}, "https://arxiv.org/abs/2008.09263": {"title": "Empirical Likelihood Covariate Adjustment for Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2008.09263", "description": "arXiv:2008.09263v3 Announce Type: replace \nAbstract: This paper proposes a versatile covariate adjustment method that directly incorporates covariate balance in regression discontinuity (RD) designs. The new empirical entropy balancing method reweights the standard local polynomial RD estimator by using the entropy balancing weights that minimize the Kullback--Leibler divergence from the uniform weights while satisfying the covariate balance constraints. Our estimator can be formulated as an empirical likelihood estimator that efficiently incorporates the information from the covariate balance condition as correctly specified over-identifying moment restrictions, and thus has an asymptotic variance no larger than that of the standard estimator without covariates. We demystify the asymptotic efficiency gain of Calonico, Cattaneo, Farrell, and Titiunik (2019)'s regression-based covariate-adjusted estimator, as their estimator has the same asymptotic variance as ours. Further efficiency improvement from balancing over sieve spaces is possible if our entropy balancing weights are computed using stronger covariate balance constraints that are imposed on functions of covariates. We then show that our method enjoys favorable second-order properties from empirical likelihood estimation and inference: the estimator has a small (bounded) nonlinearity bias, and the likelihood ratio based confidence set admits a simple analytical correction that can be used to improve coverage accuracy. The coverage accuracy of our confidence set is robust against slight perturbation to the covariate balance condition, which may happen in cases such as data contamination and misspecified \"unaffected\" outcomes used as covariates. The proposed entropy balancing approach for covariate adjustment is applicable to other RD-related settings."}, "https://arxiv.org/abs/2012.02601": {"title": "Asymmetric uncertainty : Nowcasting using skewness in real-time data", "link": "https://arxiv.org/abs/2012.02601", "description": "arXiv:2012.02601v4 Announce Type: replace \nAbstract: This paper presents a new way to account for downside and upside risks when producing density nowcasts of GDP growth. The approach relies on modelling location, scale and shape common factors in real-time macroeconomic data. While movements in the location generate shifts in the central part of the predictive density, the scale controls its dispersion (akin to general uncertainty) and the shape its asymmetry, or skewness (akin to downside and upside risks). The empirical application is centred on US GDP growth and the real-time data come from Fred-MD. The results show that there is more to real-time data than their levels or means: their dispersion and asymmetry provide valuable information for nowcasting economic activity. Scale and shape common factors (i) yield more reliable measures of uncertainty and (ii) improve precision when macroeconomic uncertainty is at its peak."}, "https://arxiv.org/abs/2203.04080": {"title": "On Robust Inference in Time Series Regression", "link": "https://arxiv.org/abs/2203.04080", "description": "arXiv:2203.04080v3 Announce Type: replace \nAbstract: Least squares regression with heteroskedasticity consistent standard errors (\"OLS-HC regression\") has proved very useful in cross section environments. However, several major difficulties, which are generally overlooked, must be confronted when transferring the HC technology to time series environments via heteroskedasticity and autocorrelation consistent standard errors (\"OLS-HAC regression\"). First, in plausible time-series environments, OLS parameter estimates can be inconsistent, so that OLS-HAC inference fails even asymptotically. Second, most economic time series have autocorrelation, which renders OLS parameter estimates inefficient. Third, autocorrelation similarly renders conditional predictions based on OLS parameter estimates inefficient. Finally, the structure of popular HAC covariance matrix estimators is ill-suited for capturing the autoregressive autocorrelation typically present in economic time series, which produces large size distortions and reduced power in HAC-based hypothesis testing, in all but the largest samples. We show that all four problems are largely avoided by the use of a simple and easily-implemented dynamic regression procedure, which we call DURBIN. We demonstrate the advantages of DURBIN with detailed simulations covering a range of practical issues."}, "https://arxiv.org/abs/2210.04146": {"title": "Inference in parametric models with many L-moments", "link": "https://arxiv.org/abs/2210.04146", "description": "arXiv:2210.04146v3 Announce Type: replace \nAbstract: L-moments are expected values of linear combinations of order statistics that provide robust alternatives to traditional moments. The estimation of parametric models by matching sample L-moments has been shown to outperform maximum likelihood estimation (MLE) in small samples from popular distributions. The choice of the number of L-moments to be used in estimation remains ad-hoc, though: researchers typically set the number of L-moments equal to the number of parameters, as to achieve an order condition for identification. This approach is generally inefficient in larger samples. In this paper, we show that, by properly choosing the number of L-moments and weighting these accordingly, we are able to construct an estimator that outperforms MLE in finite samples, and yet does not suffer from efficiency losses asymptotically. We do so by considering a \"generalised\" method of L-moments estimator and deriving its asymptotic properties in a framework where the number of L-moments varies with sample size. We then propose methods to automatically select the number of L-moments in a given sample. Monte Carlo evidence shows our proposed approach is able to outperform (in a mean-squared error sense) MLE in smaller samples, whilst working as well as it in larger samples. We then consider extensions of our approach to conditional and semiparametric models, and apply the latter to study expenditure patterns in a ridesharing platform in Brazil."}, "https://arxiv.org/abs/2302.09526": {"title": "Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators", "link": "https://arxiv.org/abs/2302.09526", "description": "arXiv:2302.09526v3 Announce Type: replace \nAbstract: We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $\\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $\\alpha^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand.\n The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks."}, "https://arxiv.org/abs/2309.06668": {"title": "On the uses and abuses of regression models: a call for reform of statistical practice and teaching", "link": "https://arxiv.org/abs/2309.06668", "description": "arXiv:2309.06668v2 Announce Type: replace \nAbstract: Regression methods dominate the practice of biostatistical analysis, but biostatistical training emphasises the details of regression models and methods ahead of the purposes for which such modelling might be useful. More broadly, statistics is widely understood to provide a body of techniques for \"modelling data\", underpinned by what we describe as the \"true model myth\": that the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective has led to a range of problems in the application of regression methods, including misguided \"adjustment\" for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline a new approach to the teaching and application of biostatistical methods, which situates them within a framework that first requires clear definition of the substantive research question at hand within one of three categories: descriptive, predictive, or causal. Within this approach, the simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models as well as other advanced biostatistical methods should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of \"input\" variables, but their conceptualisation and usage should follow from the purpose at hand."}, "https://arxiv.org/abs/2311.01217": {"title": "The learning effects of subsidies to bundled goods: a semiparametric approach", "link": "https://arxiv.org/abs/2311.01217", "description": "arXiv:2311.01217v3 Announce Type: replace \nAbstract: Can temporary subsidies to bundles induce long-run changes in demand due to learning about the quality of one of the constituent goods? This paper provides theoretical support and empirical evidence on this mechanism. Theoretically, we introduce a model where an agent learns about the quality of an innovation through repeated consumption. We then assess the predictions of our theory in a randomised experiment in a ridesharing platform. The experiment subsidised car trips integrating with a train or metro station, which we interpret as a bundle. Given the heavy-tailed nature of our data, we propose a semiparametric specification for treatment effects that enables the construction of more efficient estimators. We then introduce an efficient estimator for our specification by relying on L-moments. Our results indicate that a ten-weekday 50\\% discount on integrated trips leads to a large contemporaneous increase in the demand for integration, and, consistent with our model, persistent changes in the mean and dispersion of nonintegrated app rides. These effects last for over four months. A calibration of our theoretical model suggests that around 40\\% of the contemporaneous increase in integrated rides may be attributable to increased incentives to learning. Our results have nontrivial policy implications for the design of public transit systems."}, "https://arxiv.org/abs/2401.10592": {"title": "Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights", "link": "https://arxiv.org/abs/2401.10592", "description": "arXiv:2401.10592v2 Announce Type: replace \nAbstract: Randomized controlled clinical trials provide the gold standard for evidence generation in relation to the efficacy of a new treatment in medical research. Relevant information from previous studies may be desirable to incorporate in the design and analysis of a new trial, with the Bayesian paradigm providing a coherent framework to formally incorporate prior knowledge. Many established methods involve the use of a discounting factor, sometimes related to a measure of `similarity' between historical and the new trials. However, it is often the case that the sample size is highly nonlinear in those discounting factors. This hinders communication with subject-matter experts to elicit sensible values for borrowing strength at the trial design stage. Focusing on a commensurate predictive prior method that can incorporate historical data from multiple sources, we highlight a particular issue of nonmonotonicity and explain why this causes issues with interpretability of the discounting factors (hereafter referred to as `weights'). We propose a solution for this, from which an analytical sample size formula is derived. We then propose a linearization technique such that the sample size changes uniformly over the weights. Our approach leads to interpretable weights that represent the probability that historical data are (ir)relevant to the new trial, and could therefore facilitate easier elicitation of expert opinion on their values.\n Keywords: Bayesian sample size determination; Commensurate priors; Historical borrowing; Prior aggregation; Uniform shrinkage."}, "https://arxiv.org/abs/2207.09054": {"title": "Towards a Low-SWaP 1024-beam Digital Array: A 32-beam Sub-system at 5", "link": "https://arxiv.org/abs/2207.09054", "description": "arXiv:2207.09054v2 Announce Type: replace-cross \nAbstract: Millimeter wave communications require multibeam beamforming in order to utilize wireless channels that suffer from obstructions, path loss, and multi-path effects. Digital multibeam beamforming has maximum degrees of freedom compared to analog phased arrays. However, circuit complexity and power consumption are important constraints for digital multibeam systems. A low-complexity digital computing architecture is proposed for a multiplication-free 32-point linear transform that approximates multiple simultaneous RF beams similar to a discrete Fourier transform (DFT). Arithmetic complexity due to multiplication is reduced from the FFT complexity of $\\mathcal{O}(N\\: \\log N)$ for DFT realizations, down to zero, thus yielding a 46% and 55% reduction in chip area and dynamic power consumption, respectively, for the $N=32$ case considered. The paper describes the proposed 32-point DFT approximation targeting a 1024-beams using a 2D array, and shows the multiplierless approximation and its mapping to a 32-beam sub-system consisting of 5.8 GHz antennas that can be used for generating 1024 digital beams without multiplications. Real-time beam computation is achieved using a Xilinx FPGA at 120 MHz bandwidth per beam. Theoretical beam performance is compared with measured RF patterns from both a fixed-point FFT as well as the proposed multiplier-free algorithm and are in good agreement."}, "https://arxiv.org/abs/2303.15029": {"title": "Random measure priors in Bayesian recovery from sketches", "link": "https://arxiv.org/abs/2303.15029", "description": "arXiv:2303.15029v2 Announce Type: replace-cross \nAbstract: This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general ``traits'' setting, where each data point has integer levels of association with multiple symbols, typically referred to as ``traits''. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions."}, "https://arxiv.org/abs/2405.18531": {"title": "Difference-in-Discontinuities: Estimation, Inference and Validity Tests", "link": "https://arxiv.org/abs/2405.18531", "description": "arXiv:2405.18531v1 Announce Type: new \nAbstract: This paper investigates the econometric theory behind the newly developed difference-in-discontinuities design (DiDC). Despite its increasing use in applied research, there are currently limited studies of its properties. The method combines elements of regression discontinuity (RDD) and difference-in-differences (DiD) designs, allowing researchers to eliminate the effects of potential confounders at the discontinuity. We formalize the difference-in-discontinuity theory by stating the identification assumptions and proposing a nonparametric estimator, deriving its asymptotic properties and examining the scenarios in which the DiDC has desirable bias properties when compared to the standard RDD. We also provide comprehensive tests for one of the identification assumption of the DiDC. Monte Carlo simulation studies show that the estimators have good performance in finite samples. Finally, we revisit Grembi et al. (2016), that studies the effects of relaxing fiscal rules on public finance outcomes in Italian municipalities. The results show that the proposed estimator exhibits substantially smaller confidence intervals for the estimated effects."}, "https://arxiv.org/abs/2405.18597": {"title": "Causal inference in the closed-loop: marginal structural models for sequential excursion effects", "link": "https://arxiv.org/abs/2405.18597", "description": "arXiv:2405.18597v1 Announce Type: new \nAbstract: Optogenetics is widely used to study the effects of neural circuit manipulation on behavior. However, the paucity of causal inference methodological work on this topic has resulted in analysis conventions that discard information, and constrain the scientific questions that can be posed. To fill this gap, we introduce a nonparametric causal inference framework for analyzing \"closed-loop\" designs, which use dynamic policies that assign treatment based on covariates. In this setting, standard methods can introduce bias and occlude causal effects. Building on the sequentially randomized experiments literature in causal inference, our approach extends history-restricted marginal structural models for dynamic regimes. In practice, our framework can identify a wide range of causal effects of optogenetics on trial-by-trial behavior, such as, fast/slow-acting, dose-response, additive/antagonistic, and floor/ceiling. Importantly, it does so without requiring negative controls, and can estimate how causal effect magnitudes evolve across time points. From another view, our work extends \"excursion effect\" methods--popular in the mobile health literature--to enable estimation of causal contrasts for treatment sequences greater than length one, in the presence of positivity violations. We derive rigorous statistical guarantees, enabling hypothesis testing of these causal effects. We demonstrate our approach on data from a recent study of dopaminergic activity on learning, and show how our method reveals relevant effects obscured in standard analyses."}, "https://arxiv.org/abs/2405.18722": {"title": "Adaptive and Efficient Learning with Blockwise Missing and Semi-Supervised Data", "link": "https://arxiv.org/abs/2405.18722", "description": "arXiv:2405.18722v1 Announce Type: new \nAbstract: Data fusion is an important way to realize powerful and generalizable analyses across multiple sources. However, different capability of data collection across the sources has become a prominent issue in practice. This could result in the blockwise missingness (BM) of covariates troublesome for integration. Meanwhile, the high cost of obtaining gold-standard labels can cause the missingness of response on a large proportion of samples, known as the semi-supervised (SS) problem. In this paper, we consider a challenging scenario confronting both the BM and SS issues, and propose a novel Data-adaptive projecting Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE). Starting with a complete-data-only estimator, it involves two successive projection steps to reduce its variance without incurring bias. Compared to existing approaches, DEFUSE achieves a two-fold improvement. First, it leverages the BM labeled sample more efficiently through a novel data-adaptive projection approach robust to model misspecification on the missing covariates, leading to better variance reduction. Second, our method further incorporates the large unlabeled sample to enhance the estimation efficiency through imputation and projection. Compared to the previous SS setting with complete covariates, our work reveals a more essential role of the unlabeled sample in the BM setting. These advantages are justified in asymptotic and simulation studies. We also apply DEFUSE for the risk modeling and inference of heart diseases with the MIMIC-III electronic medical record (EMR) data."}, "https://arxiv.org/abs/2405.18836": {"title": "Do Finetti: On Causal Effects for Exchangeable Data", "link": "https://arxiv.org/abs/2405.18836", "description": "arXiv:2405.18836v1 Announce Type: new \nAbstract: We study causal effect estimation in a setting where the data are not i.i.d. (independent and identically distributed). We focus on exchangeable data satisfying an assumption of independent causal mechanisms. Traditional causal effect estimation frameworks, e.g., relying on structural causal models and do-calculus, are typically limited to i.i.d. data and do not extend to more general exchangeable generative processes, which naturally arise in multi-environment data. To address this gap, we develop a generalized framework for exchangeable data and introduce a truncated factorization formula that facilitates both the identification and estimation of causal effects in our setting. To illustrate potential applications, we introduce a causal P\\'olya urn model and demonstrate how intervention propagates effects in exchangeable data settings. Finally, we develop an algorithm that performs simultaneous causal discovery and effect estimation given multi-environment data."}, "https://arxiv.org/abs/2405.18856": {"title": "Inference under covariate-adaptive randomization with many strata", "link": "https://arxiv.org/abs/2405.18856", "description": "arXiv:2405.18856v1 Announce Type: new \nAbstract: Covariate-adaptive randomization is widely employed to balance baseline covariates in interventional studies such as clinical trials and experiments in development economics. Recent years have witnessed substantial progress in inference under covariate-adaptive randomization with a fixed number of strata. However, concerns have been raised about the impact of a large number of strata on its design and analysis, which is a common scenario in practice, such as in multicenter randomized clinical trials. In this paper, we propose a general framework for inference under covariate-adaptive randomization, which extends the seminal works of Bugni et al. (2018, 2019) by allowing for a diverging number of strata. Furthermore, we introduce a novel weighted regression adjustment that ensures efficiency improvement. On top of establishing the asymptotic theory, practical algorithms for handling situations involving an extremely large number of strata are also developed. Moreover, by linking design balance and inference robustness, we highlight the advantages of stratified block randomization, which enforces better covariate balance within strata compared to simple randomization. This paper offers a comprehensive landscape of inference under covariate-adaptive randomization, spanning from fixed to diverging to extremely large numbers of strata."}, "https://arxiv.org/abs/2405.18873": {"title": "A Return to Biased Nets: New Specifications and Approximate Bayesian Inference", "link": "https://arxiv.org/abs/2405.18873", "description": "arXiv:2405.18873v1 Announce Type: new \nAbstract: The biased net paradigm was the first general and empirically tractable scheme for parameterizing complex patterns of dependence in networks, expressing deviations from uniform random graph structure in terms of latent ``bias events,'' whose realizations enhance reciprocity, transitivity, or other structural features. Subsequent developments have introduced local specifications of biased nets, which reduce the need for approximations required in early specifications based on tracing processes. Here, we show that while one such specification leads to inconsistencies, a closely related Markovian specification both evades these difficulties and can be extended to incorporate new types of effects. We introduce the notion of inhibitory bias events, with satiation as an example, which are useful for avoiding degeneracies that can arise from closure bias terms. Although our approach does not lead to a computable likelihood, we provide a strategy for approximate Bayesian inference using random forest prevision. We demonstrate our approach on a network of friendship ties among college students, recapitulating a relationship between the sibling bias and tie strength posited in earlier work by Fararo."}, "https://arxiv.org/abs/2405.18987": {"title": "Transmission Channel Analysis in Dynamic Models", "link": "https://arxiv.org/abs/2405.18987", "description": "arXiv:2405.18987v1 Announce Type: new \nAbstract: We propose a framework for the analysis of transmission channels in a large class of dynamic models. To this end, we formulate our approach both using graph theory and potential outcomes, which we show to be equivalent. Our method, labelled Transmission Channel Analysis (TCA), allows for the decomposition of total effects captured by impulse response functions into the effects flowing along transmission channels, thereby providing a quantitative assessment of the strength of various transmission channels. We establish that this requires no additional identification assumptions beyond the identification of the structural shock whose effects the researcher wants to decompose. Additionally, we prove that impulse response functions are sufficient statistics for the computation of transmission effects. We also demonstrate the empirical relevance of TCA for policy evaluation by decomposing the effects of various monetary policy shock measures into instantaneous implementation effects and effects that likely relate to forward guidance."}, "https://arxiv.org/abs/2405.19058": {"title": "Participation bias in the estimation of heritability and genetic correlation", "link": "https://arxiv.org/abs/2405.19058", "description": "arXiv:2405.19058v1 Announce Type: new \nAbstract: It is increasingly recognized that participation bias can pose problems for genetic studies. Recently, to overcome the challenge that genetic information of non-participants is unavailable, it is shown that by comparing the IBD (identity by descent) shared and not-shared segments among the participants, one can estimate the genetic component underlying participation. That, however, does not directly address how to adjust estimates of heritability and genetic correlation for phenotypes correlated with participation. Here, for phenotypes whose mean differences between population and sample are known, we demonstrate a way to do so by adopting a statistical framework that separates out the genetic and non-genetic correlations between participation and these phenotypes. Crucially, our method avoids making the assumption that the effect of the genetic component underlying participation is manifested entirely through these other phenotypes. Applying the method to 12 UK Biobank phenotypes, we found 8 have significant genetic correlations with participation, including body mass index, educational attainment, and smoking status. For most of these phenotypes, without adjustments, estimates of heritability and the absolute value of genetic correlation would have underestimation biases."}, "https://arxiv.org/abs/2405.19145": {"title": "L-Estimation in Instrumental Variables Regression for Censored Data in Presence of Endogeneity and Dependent Errors", "link": "https://arxiv.org/abs/2405.19145", "description": "arXiv:2405.19145v1 Announce Type: new \nAbstract: In this article, we propose L-estimators of the unknown parameters in the instrumental variables regression in the presence of censored data under endogeneity. We allow the random errors involved in the model to be dependent. The proposed estimation procedure is a two-stage procedure, and the large sample properties of the proposed estimators are established. The utility of the proposed methodology is demonstrated for various simulated data and a benchmark real data set."}, "https://arxiv.org/abs/2405.19231": {"title": "Covariate Shift Corrected Conditional Randomization Test", "link": "https://arxiv.org/abs/2405.19231", "description": "arXiv:2405.19231v1 Announce Type: new \nAbstract: Conditional independence tests are crucial across various disciplines in determining the independence of an outcome variable $Y$ from a treatment variable $X$, conditioning on a set of confounders $Z$. The Conditional Randomization Test (CRT) offers a powerful framework for such testing by assuming known distributions of $X \\mid Z$; it controls the Type-I error exactly, allowing for the use of flexible, black-box test statistics. In practice, testing for conditional independence often involves using data from a source population to draw conclusions about a target population. This can be challenging due to covariate shift -- differences in the distribution of $X$, $Z$, and surrogate variables, which can affect the conditional distribution of $Y \\mid X, Z$ -- rendering traditional CRT approaches invalid. To address this issue, we propose a novel Covariate Shift Corrected Pearson Chi-squared Conditional Randomization (csPCR) test. This test adapts to covariate shifts by integrating importance weights and employing the control variates method to reduce variance in the test statistics and thus enhance power. Theoretically, we establish that the csPCR test controls the Type-I error asymptotically. Empirically, through simulation studies, we demonstrate that our method not only maintains control over Type-I errors but also exhibits superior power, confirming its efficacy and practical utility in real-world scenarios where covariate shifts are prevalent. Finally, we apply our methodology to a real-world dataset to assess the impact of a COVID-19 treatment on the 90-day mortality rate among patients."}, "https://arxiv.org/abs/2405.19312": {"title": "Causal Inference for Balanced Incomplete Block Designs", "link": "https://arxiv.org/abs/2405.19312", "description": "arXiv:2405.19312v1 Announce Type: new \nAbstract: Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multi-site trials. However, if the number of treatments under consideration is large it might not be practical or even feasible to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for a natural alternative to the complete block design that does not require reducing the number of treatment arms, the balanced incomplete block design (BIBD). This includes deriving the properties of two estimators for BIBDs and proposing conservative variance estimators. To assist practitioners in understanding the trade-offs of using BIBDs over other designs, the precisions of resulting estimators are compared to standard estimators for the complete block design, the cluster-randomized design, and the completely randomized design. Simulations and a data illustration demonstrate the strengths and weaknesses of using BIBDs. This work highlights BIBDs as practical and currently underutilized designs."}, "https://arxiv.org/abs/2405.18459": {"title": "Probing the Information Theoretical Roots of Spatial Dependence Measures", "link": "https://arxiv.org/abs/2405.18459", "description": "arXiv:2405.18459v1 Announce Type: cross \nAbstract: Intuitively, there is a relation between measures of spatial dependence and information theoretical measures of entropy. For instance, we can provide an intuition of why spatial data is special by stating that, on average, spatial data samples contain less than expected information. Similarly, spatial data, e.g., remotely sensed imagery, that is easy to compress is also likely to show significant spatial autocorrelation. Formulating our (highly specific) core concepts of spatial information theory in the widely used language of information theory opens new perspectives on their differences and similarities and also fosters cross-disciplinary collaboration, e.g., with the broader AI/ML communities. Interestingly, however, this intuitive relation is challenging to formalize and generalize, leading prior work to rely mostly on experimental results, e.g., for describing landscape patterns. In this work, we will explore the information theoretical roots of spatial autocorrelation, more specifically Moran's I, through the lens of self-information (also known as surprisal) and provide both formal proofs and experiments."}, "https://arxiv.org/abs/2405.18518": {"title": "LSTM-COX Model: A Concise and Efficient Deep Learning Approach for Handling Recurrent Events", "link": "https://arxiv.org/abs/2405.18518", "description": "arXiv:2405.18518v1 Announce Type: cross \nAbstract: In the current field of clinical medicine, traditional methods for analyzing recurrent events have limitations when dealing with complex time-dependent data. This study combines Long Short-Term Memory networks (LSTM) with the Cox model to enhance the model's performance in analyzing recurrent events with dynamic temporal information. Compared to classical models, the LSTM-Cox model significantly improves the accuracy of extracting clinical risk features and exhibits lower Akaike Information Criterion (AIC) values, while maintaining good performance on simulated datasets. In an empirical analysis of bladder cancer recurrence data, the model successfully reduced the mean squared error during the training phase and achieved a Concordance index of up to 0.90 on the test set. Furthermore, the model effectively distinguished between high and low-risk patient groups, and the identified recurrence risk features such as the number of tumor recurrences and maximum size were consistent with other research and clinical trial results. This study not only provides a straightforward and efficient method for analyzing recurrent data and extracting features but also offers a convenient pathway for integrating deep learning techniques into clinical risk prediction systems."}, "https://arxiv.org/abs/2405.18563": {"title": "Counterfactual Explanations for Multivariate Time-Series without Training Datasets", "link": "https://arxiv.org/abs/2405.18563", "description": "arXiv:2405.18563v1 Announce Type: cross \nAbstract: Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interpretations of opaque ML models and providing a pathway to transition from one decision to another. However, most existing CFE methods require access to the model's training dataset, few methods can handle multivariate time-series, and none can handle multivariate time-series without training datasets. These limitations can be formidable in many scenarios. In this paper, we present CFWoT, a novel reinforcement-learning-based CFE method that generates CFEs when training datasets are unavailable. CFWoT is model-agnostic and suitable for both static and multivariate time-series datasets with continuous and discrete features. Users have the flexibility to specify non-actionable, immutable, and preferred features, as well as causal constraints which CFWoT guarantees will be respected. We demonstrate the performance of CFWoT against four baselines on several datasets and find that, despite not having access to a training dataset, CFWoT finds CFEs that make significantly fewer and significantly smaller changes to the input time-series. These properties make CFEs more actionable, as the magnitude of change required to alter an outcome is vastly reduced."}, "https://arxiv.org/abs/2405.18601": {"title": "From Conformal Predictions to Confidence Regions", "link": "https://arxiv.org/abs/2405.18601", "description": "arXiv:2405.18601v1 Announce Type: cross \nAbstract: Conformal prediction methodologies have significantly advanced the quantification of uncertainties in predictive models. Yet, the construction of confidence regions for model parameters presents a notable challenge, often necessitating stringent assumptions regarding data distribution or merely providing asymptotic guarantees. We introduce a novel approach termed CCR, which employs a combination of conformal prediction intervals for the model outputs to establish confidence regions for model parameters. We present coverage guarantees under minimal assumptions on noise and that is valid in finite sample regime. Our approach is applicable to both split conformal predictions and black-box methodologies including full or cross-conformal approaches. In the specific case of linear models, the derived confidence region manifests as the feasible set of a Mixed-Integer Linear Program (MILP), facilitating the deduction of confidence intervals for individual parameters and enabling robust optimization. We empirically compare CCR to recent advancements in challenging settings such as with heteroskedastic and non-Gaussian noise."}, "https://arxiv.org/abs/2405.18621": {"title": "Multi-Armed Bandits with Network Interference", "link": "https://arxiv.org/abs/2405.18621", "description": "arXiv:2405.18621v1 Announce Type: cross \nAbstract: Online experimentation with interference is a common challenge in modern applications such as e-commerce and adaptive clinical trials in medicine. For example, in online marketplaces, the revenue of a good depends on discounts applied to competing goods. Statistical inference with interference is widely studied in the offline setting, but far less is known about how to adaptively assign treatments to minimize regret. We address this gap by studying a multi-armed bandit (MAB) problem where a learner (e-commerce platform) sequentially assigns one of possible $\\mathcal{A}$ actions (discounts) to $N$ units (goods) over $T$ rounds to minimize regret (maximize revenue). Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. With $\\mathcal{A}$ actions and $N$ units, minimizing regret is combinatorially difficult since the action space grows as $\\mathcal{A}^N$. To overcome this issue, we study a sparse network interference model, where the reward of a unit is only affected by the treatments assigned to $s$ neighboring units. We use tools from discrete Fourier analysis to develop a sparse linear representation of the unit-specific reward $r_n: [\\mathcal{A}]^N \\rightarrow \\mathbb{R} $, and propose simple, linear regression-based algorithms to minimize regret. Importantly, our algorithms achieve provably low regret both when the learner observes the interference neighborhood for all units and when it is unknown. This significantly generalizes other works on this topic which impose strict conditions on the strength of interference on a known network, and also compare regret to a markedly weaker optimal action. Empirically, we corroborate our theoretical findings via numerical simulations."}, "https://arxiv.org/abs/2405.18671": {"title": "Watermarking Counterfactual Explanations", "link": "https://arxiv.org/abs/2405.18671", "description": "arXiv:2405.18671v1 Announce Type: cross \nAbstract: The field of Explainable Artificial Intelligence (XAI) focuses on techniques for providing explanations to end-users about the decision-making processes that underlie modern-day machine learning (ML) models. Within the vast universe of XAI techniques, counterfactual (CF) explanations are often preferred by end-users as they help explain the predictions of ML models by providing an easy-to-understand & actionable recourse (or contrastive) case to individual end-users who are adversely impacted by predicted outcomes. However, recent studies have shown significant security concerns with using CF explanations in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on proprietary ML models. In this paper, we propose a model-agnostic watermarking framework (for adding watermarks to CF explanations) that can be leveraged to detect unauthorized model extraction attacks (which rely on the watermarked CF explanations). Our novel framework solves a bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks that rely on these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme, while ensuring that these embedded watermarks do not compromise the quality of the generated CF explanations. We evaluate this framework's performance across a diverse set of real-world datasets, CF explanation methods, and model extraction techniques, and show that our watermarking detection system can be used to accurately identify extracted ML models that are trained using the watermarked CF explanations. Our work paves the way for the secure adoption of CF explanations in real-world applications."}, "https://arxiv.org/abs/2405.19225": {"title": "Synthetic Potential Outcomes for Mixtures of Treatment Effects", "link": "https://arxiv.org/abs/2405.19225", "description": "arXiv:2405.19225v1 Announce Type: cross \nAbstract: Modern data analysis frequently relies on the use of large datasets, often constructed as amalgamations of diverse populations or data-sources. Heterogeneity across these smaller datasets constitutes two major challenges for causal inference: (1) the source of each sample can introduce latent confounding between treatment and effect, and (2) diverse populations may respond differently to the same treatment, giving rise to heterogeneous treatment effects (HTEs). The issues of latent confounding and HTEs have been studied separately but not in conjunction. In particular, previous works only report the conditional average treatment effect (CATE) among similar individuals (with respect to the measured covariates). CATEs cannot resolve mixtures of potential treatment effects driven by latent heterogeneity, which we call mixtures of treatment effects (MTEs). Inspired by method of moment approaches to mixture models, we propose \"synthetic potential outcomes\" (SPOs). Our new approach deconfounds heterogeneity while also guaranteeing the identifiability of MTEs. This technique bypasses full recovery of a mixture, which significantly simplifies its requirements for identifiability. We demonstrate the efficacy of SPOs on synthetic data."}, "https://arxiv.org/abs/2405.19317": {"title": "Adaptive Generalized Neyman Allocation: Local Asymptotic Minimax Optimal Best Arm Identification", "link": "https://arxiv.org/abs/2405.19317", "description": "arXiv:2405.19317v1 Announce Type: cross \nAbstract: This study investigates a local asymptotic minimax optimal strategy for fixed-budget best arm identification (BAI). We propose the Adaptive Generalized Neyman Allocation (AGNA) strategy and show that its worst-case upper bound of the probability of misidentifying the best arm aligns with the worst-case lower bound under the small-gap regime, where the gap between the expected outcomes of the best and suboptimal arms is small. Our strategy corresponds to a generalization of the Neyman allocation for two-armed bandits (Neyman, 1934; Kaufmann et al., 2016) and a refinement of existing strategies such as the ones proposed by Glynn & Juneja (2004) and Shin et al. (2018). Compared to Komiyama et al. (2022), which proposes a minimax rate-optimal strategy, our proposed strategy has a tighter upper bound that exactly matches the lower bound, including the constant terms, by restricting the class of distributions to the class of small-gap distributions. Our result contributes to the longstanding open issue about the existence of asymptotically optimal strategies in fixed-budget BAI, by presenting the local asymptotic minimax optimal strategy."}, "https://arxiv.org/abs/2106.14083": {"title": "Bayesian Time-Varying Tensor Vector Autoregressive Models for Dynamic Effective Connectivity", "link": "https://arxiv.org/abs/2106.14083", "description": "arXiv:2106.14083v2 Announce Type: replace \nAbstract: In contemporary neuroscience, a key area of interest is dynamic effective connectivity, which is crucial for understanding the dynamic interactions and causal relationships between different brain regions. Dynamic effective connectivity can provide insights into how brain network interactions are altered in neurological disorders such as dyslexia. Time-varying vector autoregressive (TV-VAR) models have been employed to draw inferences for this purpose. However, their significant computational requirements pose challenges, since the number of parameters to be estimated increases quadratically with the number of time series. In this paper, we propose a computationally efficient Bayesian time-varying VAR approach. For dealing with large-dimensional time series, the proposed framework employs a tensor decomposition for the VAR coefficient matrices at different lags. Dynamically varying connectivity patterns are captured by assuming that at any given time only a subset of components in the tensor decomposition is active. Latent binary time series select the active components at each time via an innovative and parsimonious Ising model in the time-domain. Furthermore, we propose parsity-inducing priors to achieve global-local shrinkage of the VAR coefficients, determine automatically the rank of the tensor decomposition and guide the selection of the lags of the auto-regression. We show the performances of our model formulation via simulation studies and data from a real fMRI study involving a book reading experiment."}, "https://arxiv.org/abs/2307.11941": {"title": "Visibility graph-based covariance functions for scalable spatial analysis in non-convex domains", "link": "https://arxiv.org/abs/2307.11941", "description": "arXiv:2307.11941v3 Announce Type: replace \nAbstract: We present a new method for constructing valid covariance functions of Gaussian processes for spatial analysis in irregular, non-convex domains such as bodies of water. Standard covariance functions based on geodesic distances are not guaranteed to be positive definite on such domains, while existing non-Euclidean approaches fail to respect the partially Euclidean nature of these domains where the geodesic distance agrees with the Euclidean distances for some pairs of points. Using a visibility graph on the domain, we propose a class of covariance functions that preserve Euclidean-based covariances between points that are connected in the domain while incorporating the non-convex geometry of the domain via conditional independence relationships. We show that the proposed method preserves the partially Euclidean nature of the intrinsic geometry on the domain while maintaining validity (positive definiteness) and marginal stationarity of the covariance function over the entire parameter space, properties which are not always fulfilled by existing approaches to construct covariance functions on non-convex domains. We provide useful approximations to improve computational efficiency, resulting in a scalable algorithm. We compare the performance of our method with those of competing state-of-the-art methods using simulation studies on synthetic non-convex domains. The method is applied to data regarding acidity levels in the Chesapeake Bay, showing its potential for ecological monitoring in real-world spatial applications on irregular domains."}, "https://arxiv.org/abs/2206.12235": {"title": "Guided sequential ABC schemes for intractable Bayesian models", "link": "https://arxiv.org/abs/2206.12235", "description": "arXiv:2206.12235v5 Announce Type: replace-cross \nAbstract: Sequential algorithms such as sequential importance sampling (SIS) and sequential Monte Carlo (SMC) have proven fundamental in Bayesian inference for models not admitting a readily available likelihood function. For approximate Bayesian computation (ABC), SMC-ABC is the state-of-art sampler. However, since the ABC paradigm is intrinsically wasteful, sequential ABC schemes can benefit from well-targeted proposal samplers that efficiently avoid improbable parameter regions. We contribute to the ABC modeller's toolbox with novel proposal samplers that are conditional to summary statistics of the data. In a sense, the proposed parameters are \"guided\" to rapidly reach regions of the posterior surface that are compatible with the observed data. This speeds up the convergence of these sequential samplers, thus reducing the computational effort, while preserving the accuracy in the inference. We provide a variety of guided Gaussian and copula-based samplers for both SIS-ABC and SMC-ABC easing inference for challenging case-studies, including multimodal posteriors, highly correlated posteriors, hierarchical models with about 20 parameters, and a simulation study of cell movements using more than 400 summary statistics."}, "https://arxiv.org/abs/2305.15988": {"title": "Non-Log-Concave and Nonsmooth Sampling via Langevin Monte Carlo Algorithms", "link": "https://arxiv.org/abs/2305.15988", "description": "arXiv:2305.15988v2 Announce Type: replace-cross \nAbstract: We study the problem of approximate sampling from non-log-concave distributions, e.g., Gaussian mixtures, which is often challenging even in low dimensions due to their multimodality. We focus on performing this task via Markov chain Monte Carlo (MCMC) methods derived from discretizations of the overdamped Langevin diffusions, which are commonly known as Langevin Monte Carlo algorithms. Furthermore, we are also interested in two nonsmooth cases for which a large class of proximal MCMC methods have been developed: (i) a nonsmooth prior is considered with a Gaussian mixture likelihood; (ii) a Laplacian mixture distribution. Such nonsmooth and non-log-concave sampling tasks arise from a wide range of applications to Bayesian inference and imaging inverse problems such as image deconvolution. We perform numerical simulations to compare the performance of most commonly used Langevin Monte Carlo algorithms."}, "https://arxiv.org/abs/2405.19523": {"title": "Comparison of Point Process Learning and its special case Takacs-Fiksel estimation", "link": "https://arxiv.org/abs/2405.19523", "description": "arXiv:2405.19523v1 Announce Type: new \nAbstract: Recently, Cronie et al. (2024) introduced the notion of cross-validation for point processes and a new statistical methodology called Point Process Learning (PPL). In PPL one splits a point process/pattern into a training and a validation set, and then predicts the latter from the former through a parametrised Papangelou conditional intensity. The model parameters are estimated by minimizing a point process prediction error; this notion was introduced as the second building block of PPL. It was shown that PPL outperforms the state-of-the-art in both kernel intensity estimation and estimation of the parameters of the Gibbs hard-core process. In the latter case, the state-of-the-art was represented by pseudolikelihood estimation. In this paper we study PPL in relation to Takacs-Fiksel estimation, of which pseudolikelihood is a special case. We show that Takacs-Fiksel estimation is a special case of PPL in the sense that PPL with a specific loss function asymptotically reduces to Takacs-Fiksel estimation if we let the cross-validation regime tend to leave-one-out cross-validation. Moreover, PPL involves a certain type of hyperparameter given by a weight function which ensures that the prediction errors have expectation zero if and only if we have the correct parametrisation. We show that the weight function takes an explicit but intractable form for general Gibbs models. Consequently, we propose different approaches to estimate the weight function in practice. In order to assess how the general PPL setup performs in relation to its special case Takacs-Fiksel estimation, we conduct a simulation study where we find that for common Gibbs models we can find loss functions and hyperparameters so that PPL typically outperforms Takacs-Fiksel estimation significantly in terms of mean square error. Here, the hyperparameters are the cross-validation parameters and the weight function estimate."}, "https://arxiv.org/abs/2405.19539": {"title": "Canonical Correlation Analysis as Reduced Rank Regression in High Dimensions", "link": "https://arxiv.org/abs/2405.19539", "description": "arXiv:2405.19539v1 Announce Type: new \nAbstract: Canonical Correlation Analysis (CCA) is a widespread technique for discovering linear relationships between two sets of variables $X \\in \\mathbb{R}^{n \\times p}$ and $Y \\in \\mathbb{R}^{n \\times q}$. In high dimensions however, standard estimates of the canonical directions cease to be consistent without assuming further structure. In this setting, a possible solution consists in leveraging the presumed sparsity of the solution: only a subset of the covariates span the canonical directions. While the last decade has seen a proliferation of sparse CCA methods, practical challenges regarding the scalability and adaptability of these methods still persist. To circumvent these issues, this paper suggests an alternative strategy that uses reduced rank regression to estimate the canonical directions when one of the datasets is high-dimensional while the other remains low-dimensional. By casting the problem of estimating the canonical direction as a regression problem, our estimator is able to leverage the rich statistics literature on high-dimensional regression and is easily adaptable to accommodate a wider range of structural priors. Our proposed solution maintains computational efficiency and accuracy, even in the presence of very high-dimensional data. We validate the benefits of our approach through a series of simulated experiments and further illustrate its practicality by applying it to three real-world datasets."}, "https://arxiv.org/abs/2405.19637": {"title": "Inference in semiparametric formation models for directed networks", "link": "https://arxiv.org/abs/2405.19637", "description": "arXiv:2405.19637v1 Announce Type: new \nAbstract: We propose a semiparametric model for dyadic link formations in directed networks. The model contains a set of degree parameters that measure different effects of popularity or outgoingness across nodes, a regression parameter vector that reflects the homophily effect resulting from the nodal attributes or pairwise covariates associated with edges, and a set of latent random noises with unknown distributions. Our interest lies in inferring the unknown degree parameters and homophily parameters. The dimension of the degree parameters increases with the number of nodes. Under the high-dimensional regime, we develop a kernel-based least squares approach to estimate the unknown parameters. The major advantage of our estimator is that it does not encounter the incidental parameter problem for the homophily parameters. We prove consistency of all the resulting estimators of the degree parameters and homophily parameters. We establish high-dimensional central limit theorems for the proposed estimators and provide several applications of our general theory, including testing the existence of degree heterogeneity, testing sparse signals and recovering the support. Simulation studies and a real data application are conducted to illustrate the finite sample performance of the proposed methods."}, "https://arxiv.org/abs/2405.19666": {"title": "Bayesian Joint Modeling for Longitudinal Magnitude Data with Informative Dropout: an Application to Critical Care Data", "link": "https://arxiv.org/abs/2405.19666", "description": "arXiv:2405.19666v1 Announce Type: new \nAbstract: In various biomedical studies, the focus of analysis centers on the magnitudes of data, particularly when algebraic signs are irrelevant or lost. To analyze the magnitude outcomes in repeated measures studies, using models with random effects is essential. This is because random effects can account for individual heterogeneity, enhancing parameter estimation precision. However, there are currently no established regression methods that incorporate random effects and are specifically designed for magnitude outcomes. This article bridges this gap by introducing Bayesian regression modeling approaches for analyzing magnitude data, with a key focus on the incorporation of random effects. Additionally, the proposed method is extended to address multiple causes of informative dropout, commonly encountered in repeated measures studies. To tackle the missing data challenge arising from dropout, a joint modeling strategy is developed, building upon the previously introduced regression techniques. Two numerical simulation studies are conducted to assess the validity of our method. The chosen simulation scenarios aim to resemble the conditions of our motivating study. The results demonstrate that the proposed method for magnitude data exhibits good performance in terms of both estimation accuracy and precision, and the joint models effectively mitigate bias due to missing data. Finally, we apply proposed models to analyze the magnitude data from the motivating study, investigating if sex impacts the magnitude change in diaphragm thickness over time for ICU patients."}, "https://arxiv.org/abs/2405.19803": {"title": "Dynamic Factor Analysis of High-dimensional Recurrent Events", "link": "https://arxiv.org/abs/2405.19803", "description": "arXiv:2405.19803v1 Announce Type: new \nAbstract: Recurrent event time data arise in many studies, including biomedicine, public health, marketing, and social media analysis. High-dimensional recurrent event data involving large numbers of event types and observations become prevalent with the advances in information technology. This paper proposes a semiparametric dynamic factor model for the dimension reduction and prediction of high-dimensional recurrent event data. The proposed model imposes a low-dimensional structure on the mean intensity functions of the event types while allowing for dependencies. A nearly rate-optimal smoothing-based estimator is proposed. An information criterion that consistently selects the number of factors is also developed. Simulation studies demonstrate the effectiveness of these inference tools. The proposed method is applied to grocery shopping data, for which an interpretable factor structure is obtained."}, "https://arxiv.org/abs/2405.19849": {"title": "Modelling and Forecasting Energy Market Volatility Using GARCH and Machine Learning Approach", "link": "https://arxiv.org/abs/2405.19849", "description": "arXiv:2405.19849v1 Announce Type: new \nAbstract: This paper presents a comparative analysis of univariate and multivariate GARCH-family models and machine learning algorithms in modeling and forecasting the volatility of major energy commodities: crude oil, gasoline, heating oil, and natural gas. It uses a comprehensive dataset incorporating financial, macroeconomic, and environmental variables to assess predictive performance and discusses volatility persistence and transmission across these commodities. Aspects of volatility persistence and transmission, traditionally examined by GARCH-class models, are jointly explored using the SHAP (Shapley Additive exPlanations) method. The findings reveal that machine learning models demonstrate superior out-of-sample forecasting performance compared to traditional GARCH models. Machine learning models tend to underpredict, while GARCH models tend to overpredict energy market volatility, suggesting a hybrid use of both types of models. There is volatility transmission from crude oil to the gasoline and heating oil markets. The volatility transmission in the natural gas market is less prevalent."}, "https://arxiv.org/abs/2405.19865": {"title": "Reduced Rank Regression for Mixed Predictor and Response Variables", "link": "https://arxiv.org/abs/2405.19865", "description": "arXiv:2405.19865v1 Announce Type: new \nAbstract: In this paper, we propose the generalized mixed reduced rank regression method, GMR$^3$ for short. GMR$^3$ is a regression method for a mix of numeric, binary and ordinal response variables. The predictor variables can be a mix of binary, nominal, ordinal, and numeric variables. For dealing with the categorical predictors we use optimal scaling. A majorization-minimization algorithm is derived for maximum likelihood estimation under a local independence assumption. We discuss in detail model selection for the dimensionality or rank, and the selection of predictor variables. We show an application of GMR$^3$ using the Eurobarometer Surveys data set of 2023."}, "https://arxiv.org/abs/2405.19897": {"title": "The Political Resource Curse Redux", "link": "https://arxiv.org/abs/2405.19897", "description": "arXiv:2405.19897v1 Announce Type: new \nAbstract: In the study of the Political Resource Curse (Brollo et al.,2013), the authors identified a new channel to investigate whether the windfalls of resources are unambiguously beneficial to society, both with theory and empirical evidence. This paper revisits the framework with a new dataset. Specifically, we implemented a regression discontinuity design and difference-in-difference specification"}, "https://arxiv.org/abs/2405.19985": {"title": "Targeted Sequential Indirect Experiment Design", "link": "https://arxiv.org/abs/2405.19985", "description": "arXiv:2405.19985v1 Announce Type: new \nAbstract: Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health. Such queries are inherently causal (rather than purely associational), but in many settings, experiments can not be conducted directly on the target variables of interest, but are indirect. Therefore, they perturb the target variable, but do not remove potential confounding factors. If, additionally, the resulting experimental measurements are multi-dimensional and the studied mechanisms nonlinear, the query of interest is generally not identified. We develop an adaptive strategy to design indirect experiments that optimally inform a targeted query about the ground truth mechanism in terms of sequentially narrowing the gap between an upper and lower bound on the query. While the general formulation consists of a bi-level optimization procedure, we derive an efficiently estimable analytical kernel-based estimator of the bounds for the causal effect, a query of key interest, and demonstrate the efficacy of our approach in confounded, multivariate, nonlinear synthetic settings."}, "https://arxiv.org/abs/2405.20137": {"title": "A unified framework of principal component analysis and factor analysis", "link": "https://arxiv.org/abs/2405.20137", "description": "arXiv:2405.20137v1 Announce Type: new \nAbstract: Principal component analysis and factor analysis are fundamental multivariate analysis methods. In this paper a unified framework to connect them is introduced. Under a general latent variable model, we present matrix optimization problems from the viewpoint of loss function minimization, and show that the two methods can be viewed as solutions to the optimization problems with specific loss functions. Specifically, principal component analysis can be derived from a broad class of loss functions including the L2 norm, while factor analysis corresponds to a modified L0 norm problem. Related problems are discussed, including algorithms, penalized maximum likelihood estimation under the latent variable model, and a principal component factor model. These results can lead to new tools of data analysis and research topics."}, "https://arxiv.org/abs/2405.20149": {"title": "Accounting for Mismatch Error in Small Area Estimation with Linked Data", "link": "https://arxiv.org/abs/2405.20149", "description": "arXiv:2405.20149v1 Announce Type: new \nAbstract: In small area estimation different data sources are integrated in order to produce reliable estimates of target parameters (e.g., a mean or a proportion) for a collection of small subsets (areas) of a finite population. Regression models such as the linear mixed effects model or M-quantile regression are often used to improve the precision of survey sample estimates by leveraging auxiliary information for which means or totals are known at the area level. In many applications, the unit-level linkage of records from different sources is probabilistic and potentially error-prone. In this paper, we present adjustments of the small area predictors that are based on either the linear mixed effects model or M-quantile regression to account for the presence of linkage error. These adjustments are developed from a two-component mixture model that hinges on the assumption of independence of the target and auxiliary variable given incorrect linkage. Estimation and inference is based on composite likelihoods and machinery revolving around the Expectation-Maximization Algorithm. For each of the two regression methods, we propose modified small area predictors and approximations for their mean squared errors. The empirical performance of the proposed approaches is studied in both design-based and model-based simulations that include comparisons to a variety of baselines."}, "https://arxiv.org/abs/2405.20164": {"title": "Item response parameter estimation performance using Gaussian quadrature and Laplace", "link": "https://arxiv.org/abs/2405.20164", "description": "arXiv:2405.20164v1 Announce Type: new \nAbstract: Item parameter estimation in pharmacometric item response theory (IRT) models is predominantly performed using the Laplace estimation algorithm as implemented in NONMEM. In psychometrics a wide range of different software tools, including several packages for the open-source software R for implementation of IRT are also available. Each have their own set of benefits and limitations and to date a systematic comparison of the primary estimation algorithms has not been evaluated. A simulation study evaluating varying number of hypothetical sample sizes and item scenarios at baseline was performed using both Laplace and Gauss-hermite quadrature (GHQ-EM). In scenarios with at least 20 items and more than 100 subjects, item parameters were estimated with good precision and were similar between estimation algorithms as demonstrated by several measures of bias and precision. The minimal differences observed for certain parameters or sample size scenarios were reduced when translating to the total score scale. The ease of use, speed of estimation and relative accuracy of the GHQ-EM method employed in mirt make it an appropriate alternative or supportive analytical approach to NONMEM for potential pharmacometrics IRT applications."}, "https://arxiv.org/abs/2405.20342": {"title": "Evaluating Approximations of Count Distributions and Forecasts for Poisson-Lindley Integer Autoregressive Processes", "link": "https://arxiv.org/abs/2405.20342", "description": "arXiv:2405.20342v1 Announce Type: new \nAbstract: Although many time series are realizations from discrete processes, it is often that a continuous Gaussian model is implemented for modeling and forecasting the data, resulting in incoherent forecasts. Forecasts using a Poisson-Lindley integer autoregressive (PLINAR) model are compared to variations of Gaussian forecasts via simulation by equating relevant moments of the marginals of the PLINAR to the Gaussian AR. To illustrate utility, the methods discussed are applied and compared using a discrete series with model parameters being estimated using each of conditional least squares, Yule-Walker, and maximum likelihood."}, "https://arxiv.org/abs/2405.19407": {"title": "Tempered Multifidelity Importance Sampling for Gravitational Wave Parameter Estimation", "link": "https://arxiv.org/abs/2405.19407", "description": "arXiv:2405.19407v1 Announce Type: cross \nAbstract: Estimating the parameters of compact binaries which coalesce and produce gravitational waves is a challenging Bayesian inverse problem. Gravitational-wave parameter estimation lies within the class of multifidelity problems, where a variety of models with differing assumptions, levels of fidelity, and computational cost are available for use in inference. In an effort to accelerate the solution of a Bayesian inverse problem, cheaper surrogates for the best models may be used to reduce the cost of likelihood evaluations when sampling the posterior. Importance sampling can then be used to reweight these samples to represent the true target posterior, incurring a reduction in the effective sample size. In cases when the problem is high dimensional, or when the surrogate model produces a poor approximation of the true posterior, this reduction in effective samples can be dramatic and render multifidelity importance sampling ineffective. We propose a novel method of tempered multifidelity importance sampling in order to remedy this issue. With this method the biasing distribution produced by the low-fidelity model is tempered, allowing for potentially better overlap with the target distribution. There is an optimal temperature which maximizes the efficiency in this setting, and we propose a low-cost strategy for approximating this optimal temperature using samples from the untempered distribution. In this paper, we motivate this method by applying it to Gaussian target and biasing distributions. Finally, we apply it to a series of problems in gravitational wave parameter estimation and demonstrate improved efficiencies when applying the method to real gravitational wave detections."}, "https://arxiv.org/abs/2405.19463": {"title": "Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data", "link": "https://arxiv.org/abs/2405.19463", "description": "arXiv:2405.19463v1 Announce Type: cross \nAbstract: We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order $\\mathcal{O}(\\log T/T)$ and $\\mathcal{O}(1/T^{1-\\iota})$ for any $\\iota>0$, respectively under the availability of two-sample and one-sample oracles, respectively, where $T$ is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure avoids explicitly modeling and estimating the relationship between confounder and the instrumental variables, demonstrating the benefit of the proposed approach over recent works based on reformulating the problem as minimax optimization problems. Numerical experiments are provided to corroborate the theoretical results."}, "https://arxiv.org/abs/2405.19610": {"title": "Factor Augmented Tensor-on-Tensor Neural Networks", "link": "https://arxiv.org/abs/2405.19610", "description": "arXiv:2405.19610v1 Announce Type: cross \nAbstract: This paper studies the prediction task of tensor-on-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor'') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods."}, "https://arxiv.org/abs/2405.19704": {"title": "Enhancing Sufficient Dimension Reduction via Hellinger Correlation", "link": "https://arxiv.org/abs/2405.19704", "description": "arXiv:2405.19704v1 Announce Type: cross \nAbstract: In this work, we develop a new theory and method for sufficient dimension reduction (SDR) in single-index models, where SDR is a sub-field of supervised dimension reduction based on conditional independence. Our work is primarily motivated by the recent introduction of the Hellinger correlation as a dependency measure. Utilizing this measure, we develop a method capable of effectively detecting the dimension reduction subspace, complete with theoretical justification. Through extensive numerical experiments, we demonstrate that our proposed method significantly enhances and outperforms existing SDR methods. This improvement is largely attributed to our proposed method's deeper understanding of data dependencies and the refinement of existing SDR techniques."}, "https://arxiv.org/abs/2405.19920": {"title": "The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressions", "link": "https://arxiv.org/abs/2405.19920", "description": "arXiv:2405.19920v1 Announce Type: cross \nAbstract: We present the ARR2 prior, a joint prior over the auto-regressive components in Bayesian time-series models and their induced $R^2$. Compared to other priors designed for times-series models, the ARR2 prior allows for flexible and intuitive shrinkage. We derive the prior for pure auto-regressive models, and extend it to auto-regressive models with exogenous inputs, and state-space models. Through both simulations and real-world modelling exercises, we demonstrate the efficacy of the ARR2 prior in improving sparse and reliable inference, while showing greater inference quality and predictive performance than other shrinkage priors. An open-source implementation of the prior is provided."}, "https://arxiv.org/abs/2405.20039": {"title": "Task-Agnostic Machine Learning-Assisted Inference", "link": "https://arxiv.org/abs/2405.20039", "description": "arXiv:2405.20039v1 Announce Type: cross \nAbstract: Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened up a whole new field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples and then uses the predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis. This is because any extension of these approaches to new, more sophisticated statistical tasks requires task-specific algebraic derivations and software implementations, which ignores the massive library of existing software tools already developed for complex inference tasks and severely constrains the scope of post-prediction inference in real applications. To address this challenge, we propose a novel statistical framework for task-agnostic ML-assisted inference. It provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routine. It delivers valid and efficient inference that is robust to arbitrary choices of ML models, while allowing nearly all existing analytical frameworks to be incorporated into the analysis of ML-predicted outcomes. Through extensive experiments, we showcase the validity, versatility, and superiority of our method compared to existing approaches."}, "https://arxiv.org/abs/2405.20088": {"title": "Personalized Predictions from Population Level Experiments: A Study on Alzheimer's Disease", "link": "https://arxiv.org/abs/2405.20088", "description": "arXiv:2405.20088v1 Announce Type: cross \nAbstract: The purpose of this article is to infer patient level outcomes from population level randomized control trials (RCTs). In this pursuit, we utilize the recently proposed synthetic nearest neighbors (SNN) estimator. At its core, SNN leverages information across patients to impute missing data associated with each patient of interest. We focus on two types of missing data: (i) unrecorded outcomes from discontinuing the assigned treatments and (ii) unobserved outcomes associated with unassigned treatments. Data imputation in the former powers and de-biases RCTs, while data imputation in the latter simulates \"synthetic RCTs\" to predict the outcomes for each patient under every treatment. The SNN estimator is interpretable, transparent, and causally justified under a broad class of missing data scenarios. Relative to several standard methods, we empirically find that SNN performs well for the above two applications using Phase 3 clinical trial data on patients with Alzheimer's Disease. Our findings directly suggest that SNN can tackle a current pain point within the clinical trial workflow on patient dropouts and serve as a new tool towards the development of precision medicine. Building on our insights, we discuss how SNN can further generalize to real-world applications."}, "https://arxiv.org/abs/2405.20191": {"title": "Multidimensional spatiotemporal clustering -- An application to environmental sustainability scores in Europe", "link": "https://arxiv.org/abs/2405.20191", "description": "arXiv:2405.20191v1 Announce Type: cross \nAbstract: The assessment of corporate sustainability performance is extremely relevant in facilitating the transition to a green and low-carbon intensity economy. However, companies located in different areas may be subject to different sustainability and environmental risks and policies. Henceforth, the main objective of this paper is to investigate the spatial and temporal pattern of the sustainability evaluations of European firms. We leverage on a large dataset containing information about companies' sustainability performances, measured by MSCI ESG ratings, and geographical coordinates of firms in Western Europe between 2013 and 2023. By means of a modified version of the Chavent et al. (2018) hierarchical algorithm, we conduct a spatial clustering analysis, combining sustainability and spatial information, and a spatiotemporal clustering analysis, which combines the time dynamics of multiple sustainability features and spatial dissimilarities, to detect groups of firms with homogeneous sustainability performance. We are able to build cross-national and cross-industry clusters with remarkable differences in terms of sustainability scores. Among other results, in the spatio-temporal analysis, we observe a high degree of geographical overlap among clusters, indicating that the temporal dynamics in sustainability assessment are relevant within a multidimensional approach. Our findings help to capture the diversity of ESG ratings across Western Europe and may assist practitioners and policymakers in evaluating companies facing different sustainability-linked risks in different areas."}, "https://arxiv.org/abs/2208.02024": {"title": "Time-Varying Dispersion Integer-Valued GARCH Models", "link": "https://arxiv.org/abs/2208.02024", "description": "arXiv:2208.02024v2 Announce Type: replace \nAbstract: We propose a general class of INteger-valued Generalized AutoRegressive Conditionally Heteroscedastic (INGARCH) processes by allowing time-varying mean and dispersion parameters, which we call time-varying dispersion INGARCH (tv-DINGARCH) models. More specifically, we consider mixed Poisson INGARCH models and allow for dynamic modeling of the dispersion parameter (as well as the mean), similar to the spirit of the ordinary GARCH models. We derive conditions to obtain first and second-order stationarity, and ergodicity as well. Estimation of the parameters is addressed and their associated asymptotic properties are established as well. A restricted bootstrap procedure is proposed for testing constant dispersion against time-varying dispersion. Monte Carlo simulation studies are presented for checking point estimation, standard errors, and the performance of the restricted bootstrap approach. We apply the tv-DINGARCH process to model the weekly number of reported measles infections in North Rhine-Westphalia, Germany, from January 2001 to May 2013, and compare its performance to the ordinary INGARCH approach."}, "https://arxiv.org/abs/2208.07590": {"title": "Neural Networks for Extreme Quantile Regression with an Application to Forecasting of Flood Risk", "link": "https://arxiv.org/abs/2208.07590", "description": "arXiv:2208.07590v3 Announce Type: replace \nAbstract: Risk assessment for extreme events requires accurate estimation of high quantiles that go beyond the range of historical observations. When the risk depends on the values of observed predictors, regression techniques are used to interpolate in the predictor space. We propose the EQRN model that combines tools from neural networks and extreme value theory into a method capable of extrapolation in the presence of complex predictor dependence. Neural networks can naturally incorporate additional structure in the data. We develop a recurrent version of EQRN that is able to capture complex sequential dependence in time series. We apply this method to forecast flood risk in the Swiss Aare catchment. It exploits information from multiple covariates in space and time to provide one-day-ahead predictions of return levels and exceedance probabilities. This output complements the static return level from a traditional extreme value analysis, and the predictions are able to adapt to distributional shifts as experienced in a changing climate. Our model can help authorities to manage flooding more effectively and to minimize their disastrous impacts through early warning systems."}, "https://arxiv.org/abs/2305.10656": {"title": "Spectral Change Point Estimation for High Dimensional Time Series by Sparse Tensor Decomposition", "link": "https://arxiv.org/abs/2305.10656", "description": "arXiv:2305.10656v2 Announce Type: replace \nAbstract: Multivariate time series may be subject to partial structural changes over certain frequency band, for instance, in neuroscience. We study the change point detection problem with high dimensional time series, within the framework of frequency domain. The overarching goal is to locate all change points and delineate which series are activated by the change, over which frequencies. In practice, the number of activated series per change and frequency could span from a few to full participation. We solve the problem by first computing a CUSUM tensor based on spectra estimated from blocks of the time series. A frequency-specific projection approach is applied for dimension reduction. The projection direction is estimated by a proposed tensor decomposition algorithm that adjusts to the sparsity level of changes. Finally, the projected CUSUM vectors across frequencies are aggregated for change point detection. We provide theoretical guarantees on the number of estimated change points and the convergence rate of their locations. We derive error bounds for the estimated projection direction for identifying the frequency-specific series activated in a change. We provide data-driven rules for the choice of parameters. The efficacy of the proposed method is illustrated by simulation and a stock returns application."}, "https://arxiv.org/abs/2306.14004": {"title": "Latent Factor Analysis in Short Panels", "link": "https://arxiv.org/abs/2306.14004", "description": "arXiv:2306.14004v2 Announce Type: replace \nAbstract: We develop inferential tools for latent factor analysis in short panels. The pseudo maximum likelihood setting under a large cross-sectional dimension n and a fixed time series dimension T relies on a diagonal TxT covariance matrix of the errors without imposing sphericity nor Gaussianity. We outline the asymptotic distributions of the latent factor and error covariance estimates as well as of an asymptotically uniformly most powerful invariant (AUMPI) test for the number of factors based on the likelihood ratio statistic. We derive the AUMPI characterization from inequalities ensuring the monotone likelihood ratio property for positive definite quadratic forms in normal variables. An empirical application to a large panel of monthly U.S. stock returns separates month after month systematic and idiosyncratic risks in short subperiods of bear vs. bull market based on the selected number of factors. We observe an uptrend in the paths of total and idiosyncratic volatilities while the systematic risk explains a large part of the cross-sectional total variance in bear markets but is not driven by a single factor. Rank tests show that observed factors struggle spanning latent factors with a discrepancy between the dimensions of the two factor spaces decreasing over time."}, "https://arxiv.org/abs/2212.14857": {"title": "Nuisance Function Tuning for Optimal Doubly Robust Estimation", "link": "https://arxiv.org/abs/2212.14857", "description": "arXiv:2212.14857v2 Announce Type: replace-cross \nAbstract: Estimators of doubly robust functionals typically rely on estimating two complex nuisance functions, such as the propensity score and conditional outcome mean for the average treatment effect functional. We consider the problem of how to estimate nuisance functions to obtain optimal rates of convergence for a doubly robust nonparametric functional that has witnessed applications across the causal inference and conditional independence testing literature. For several plug-in type estimators and a one-step type estimator, we illustrate the interplay between different tuning parameter choices for the nuisance function estimators and sample splitting strategies on the optimal rate of estimating the functional of interest. For each of these estimators and each sample splitting strategy, we show the necessity to undersmooth the nuisance function estimators under low regularity conditions to obtain optimal rates of convergence for the functional of interest. By performing suitable nuisance function tuning and sample splitting strategies, we show that some of these estimators can achieve minimax rates of convergence in all H\\\"older smoothness classes of the nuisance functions."}, "https://arxiv.org/abs/2308.13047": {"title": "Federated Causal Inference from Observational Data", "link": "https://arxiv.org/abs/2308.13047", "description": "arXiv:2308.13047v2 Announce Type: replace-cross \nAbstract: Decentralized data sources are prevalent in real-world applications, posing a formidable challenge for causal inference. These sources cannot be consolidated into a single entity owing to privacy constraints. The presence of dissimilar data distributions and missing values within them can potentially introduce bias to the causal estimands. In this article, we propose a framework to estimate causal effects from decentralized data sources. The proposed framework avoid exchanging raw data among the sources, thus contributing towards privacy-preserving causal learning. Three instances of the proposed framework are introduced to estimate causal effects across a wide range of diverse scenarios within a federated setting. (1) FedCI: a Bayesian framework based on Gaussian processes for estimating causal effects from federated observational data sources. It estimates the posterior distributions of the causal effects to compute the higher-order statistics that capture the uncertainty. (2) CausalRFF: an adaptive transfer algorithm that learns the similarities among the data sources by utilizing Random Fourier Features to disentangle the loss function into multiple components, each of which is associated with a data source. It estimates the similarities among the sources through transfer coefficients, and hence requiring no prior information about the similarity measures. (3) CausalFI: a new approach for federated causal inference from incomplete data, enabling the estimation of causal effects from multiple decentralized and incomplete data sources. It accounts for the missing data under the missing at random assumption, while also estimating higher-order statistics of the causal estimands. The proposed federated framework and its instances are an important step towards a privacy-preserving causal learning model."}, "https://arxiv.org/abs/2309.10211": {"title": "Loop Polarity Analysis to Avoid Underspecification in Deep Learning", "link": "https://arxiv.org/abs/2309.10211", "description": "arXiv:2309.10211v2 Announce Type: replace-cross \nAbstract: Deep learning is a powerful set of techniques for detecting complex patterns in data. However, when the causal structure of that process is underspecified, deep learning models can be brittle, lacking robustness to shifts in the distribution of the data-generating process. In this paper, we turn to loop polarity analysis as a tool for specifying the causal structure of a data-generating process, in order to encode a more robust understanding of the relationship between system structure and system behavior within the deep learning pipeline. We use simulated epidemic data based on an SIR model to demonstrate how measuring the polarity of the different feedback loops that compose a system can lead to more robust inferences on the part of neural networks, improving the out-of-distribution performance of a deep learning model and infusing a system-dynamics-inspired approach into the machine learning development pipeline."}, "https://arxiv.org/abs/2309.15769": {"title": "Algebraic and Statistical Properties of the Ordinary Least Squares Interpolator", "link": "https://arxiv.org/abs/2309.15769", "description": "arXiv:2309.15769v2 Announce Type: replace-cross \nAbstract: Deep learning research has uncovered the phenomenon of benign overfitting for overparameterized statistical models, which has drawn significant theoretical interest in recent years. Given its simplicity and practicality, the ordinary least squares (OLS) interpolator has become essential to gain foundational insights into this phenomenon. While properties of OLS are well established in classical, underparameterized settings, its behavior in high-dimensional, overparameterized regimes is less explored (unlike for ridge or lasso regression) though significant progress has been made of late. We contribute to this growing literature by providing fundamental algebraic and statistical results for the minimum $\\ell_2$-norm OLS interpolator. In particular, we provide algebraic equivalents of (i) the leave-$k$-out residual formula, (ii) Cochran's formula, and (iii) the Frisch-Waugh-Lovell theorem in the overparameterized regime. These results aid in understanding the OLS interpolator's ability to generalize and have substantive implications for causal inference. Under the Gauss-Markov model, we present statistical results such as an extension of the Gauss-Markov theorem and an analysis of variance estimation under homoskedastic errors for the overparameterized regime. To substantiate our theoretical contributions, we conduct simulations that further explore the stochastic properties of the OLS interpolator."}, "https://arxiv.org/abs/2401.14535": {"title": "CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process", "link": "https://arxiv.org/abs/2401.14535", "description": "arXiv:2401.14535v2 Announce Type: replace-cross \nAbstract: Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy in real-world applications containing information loss. For instance, the visual perception process translates a 3D space into 2D images, or the phenomenon of persistence of vision incorporates historical data into current perceptions. To address this challenge, we establish an identifiability theory that allows for the recovery of independent latent components even when they come from a nonlinear and non-invertible mix. Using this theory as a foundation, we propose a principled approach, CaRiNG, to learn the CAusal RepresentatIon of Non-invertible Generative temporal data with identifiability guarantees. Specifically, we utilize temporal context to recover lost latent information and apply the conditions in our theory to guide the training process. Through experiments conducted on synthetic datasets, we validate that our CaRiNG method reliably identifies the causal process, even when the generation process is non-invertible. Moreover, we demonstrate that our approach considerably improves temporal understanding and reasoning in practical applications."}, "https://arxiv.org/abs/2405.20400": {"title": "Fast leave-one-cluster-out cross-validation by clustered Network Information Criteria (NICc)", "link": "https://arxiv.org/abs/2405.20400", "description": "arXiv:2405.20400v1 Announce Type: new \nAbstract: This paper introduced a clustered estimator of the Network Information Criterion (NICc) to approximate leave-one-cluster-out cross-validated deviance, which can be used as an alternative to cluster-based cross-validation when modeling clustered data. Stone proved that Akaike Information Criterion (AIC) is an asymptotic equivalence to leave-one-observation-out cross-validation if the parametric model is true. Ripley pointed out that the Network Information Criterion (NIC) derived in Stone's proof, is a better approximation to leave-one-observation-out cross-validation when the model is not true. For clustered data, we derived a clustered estimator of NIC, referred to as NICc, by substituting the Fisher information matrix in NIC with its estimator that adjusts for clustering. This adjustment imposes a larger penalty in NICc than the unclustered estimator of NIC when modeling clustered data, thereby preventing overfitting more effectively. In a simulation study and an empirical example, we used linear and logistic regression to model clustered data with Gaussian or binomial response, respectively. We showed that NICc is a better approximation to leave-one-cluster-out deviance and prevents overfitting more effectively than AIC and Bayesian Information Criterion (BIC). NICc leads to more accurate model selection, as determined by cluster-based cross-validation, compared to AIC and BIC."}, "https://arxiv.org/abs/2405.20415": {"title": "Differentially Private Boxplots", "link": "https://arxiv.org/abs/2405.20415", "description": "arXiv:2405.20415v1 Announce Type: new \nAbstract: Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains relatively underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms a boxplot naively constructed from existing differentially private quantile algorithms. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization."}, "https://arxiv.org/abs/2405.20601": {"title": "Bayesian Nonparametric Quasi Likelihood", "link": "https://arxiv.org/abs/2405.20601", "description": "arXiv:2405.20601v1 Announce Type: new \nAbstract: A recent trend in Bayesian research has been revisiting generalizations of the likelihood that enable Bayesian inference without requiring the specification of a model for the data generating mechanism. This paper focuses on a Bayesian nonparametric extension of Wedderburn's quasi-likelihood, using Bayesian additive regression trees to model the mean function. Here, the analyst posits only a structural relationship between the mean and variance of the outcome. We show that this approach provides a unified, computationally efficient, framework for extending Bayesian decision tree ensembles to many new settings, including simplex-valued and heavily heteroskedastic data. We also introduce Bayesian strategies for inferring the dispersion parameter of the quasi-likelihood, a task which is complicated by the fact that the quasi-likelihood itself does not contain information about this parameter; despite these challenges, we are able to inject updates for the dispersion parameter into a Markov chain Monte Carlo inference scheme in a way that, in the parametric setting, leads to a Bernstein-von Mises result for the stationary distribution of the resulting Markov chain. We illustrate the utility of our approach on a variety of both synthetic and non-synthetic datasets."}, "https://arxiv.org/abs/2405.20644": {"title": "Fixed-budget optimal designs for multi-fidelity computer experiments", "link": "https://arxiv.org/abs/2405.20644", "description": "arXiv:2405.20644v1 Announce Type: new \nAbstract: This work focuses on the design of experiments of multi-fidelity computer experiments. We consider the autoregressive Gaussian process model proposed by Kennedy and O'Hagan (2000) and the optimal nested design that maximizes the prediction accuracy subject to a budget constraint. An approximate solution is identified through the idea of multi-level approximation and recent error bounds of Gaussian process regression. The proposed (approximately) optimal designs admit a simple analytical form. We prove that, to achieve the same prediction accuracy, the proposed optimal multi-fidelity design requires much lower computational cost than any single-fidelity design in the asymptotic sense. Numerical studies confirm this theoretical assertion."}, "https://arxiv.org/abs/2405.20655": {"title": "Statistical inference for case-control logistic regression via integrating external summary data", "link": "https://arxiv.org/abs/2405.20655", "description": "arXiv:2405.20655v1 Announce Type: new \nAbstract: Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be consistently estimated, the intercept parameter is not identifiable, and the marginal case proportion is not estimatable, either. We consider the situations in which besides the case-control data from the main study, called internal study, there also exists summary-level information from related external studies. An empirical likelihood based approach is proposed to make inference for the logistic model by incorporating the internal case-control data and external information. We show that the intercept parameter is identifiable with the help of external information, and then all the regression parameters as well as the marginal case proportion can be estimated consistently. The proposed method also accounts for the possible variability in external studies. The resultant estimators are shown to be asymptotically normally distributed. The asymptotic variance-covariance matrix can be consistently estimated by the case-control data. The optimal way to utilized external information is discussed. Simulation studies are conducted to verify the theoretical findings. A real data set is analyzed for illustration."}, "https://arxiv.org/abs/2405.20758": {"title": "Fast Bayesian Basis Selection for Functional Data Representation with Correlated Errors", "link": "https://arxiv.org/abs/2405.20758", "description": "arXiv:2405.20758v1 Announce Type: new \nAbstract: Functional data analysis (FDA) finds widespread application across various fields, due to data being recorded continuously over a time interval or at several discrete points. Since the data is not observed at every point but rather across a dense grid, smoothing techniques are often employed to convert the observed data into functions. In this work, we propose a novel Bayesian approach for selecting basis functions for smoothing one or multiple curves simultaneously. Our method differentiates from other Bayesian approaches in two key ways: (i) by accounting for correlated errors and (ii) by developing a variational EM algorithm instead of a Gibbs sampler. Simulation studies demonstrate that our method effectively identifies the true underlying structure of the data across various scenarios and it is applicable to different types of functional data. Our variational EM algorithm not only recovers the basis coefficients and the correct set of basis functions but also estimates the existing within-curve correlation. When applied to the motorcycle dataset, our method demonstrates comparable, and in some cases superior, performance in terms of adjusted $R^2$ compared to other techniques such as regression splines, Bayesian LASSO and LASSO. Additionally, when assuming independence among observations within a curve, our method, utilizing only a variational Bayes algorithm, is in the order of thousands faster than a Gibbs sampler on average. Our proposed method is implemented in R and codes are available at https://github.com/acarolcruz/VB-Bases-Selection."}, "https://arxiv.org/abs/2405.20817": {"title": "Extremile scalar-on-function regression with application to climate scenarios", "link": "https://arxiv.org/abs/2405.20817", "description": "arXiv:2405.20817v1 Announce Type: new \nAbstract: Extremiles provide a generalization of quantiles which are not only robust, but also have an intrinsic link with extreme value theory. This paper introduces an extremile regression model tailored for functional covariate spaces. The estimation procedure turns out to be a weighted version of local linear scalar-on-function regression, where now a double kernel approach plays a crucial role. Asymptotic expressions for the bias and variance are established, applicable to both decreasing bandwidth sequences and automatically selected bandwidths. The methodology is then investigated in detail through a simulation study. Furthermore, we highlight the applicability of the model through the analysis of data sourced from the CH2018 Swiss climate scenarios project, offering insights into its ability to serve as a modern tool to quantify climate behaviour."}, "https://arxiv.org/abs/2405.20856": {"title": "Parameter identification in linear non-Gaussian causal models under general confounding", "link": "https://arxiv.org/abs/2405.20856", "description": "arXiv:2405.20856v1 Announce Type: new \nAbstract: Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result, explore a generalization to models with feedback loops, and provide new results on the identifiability of the causal graph."}, "https://arxiv.org/abs/2405.20936": {"title": "Bayesian Deep Generative Models for Replicated Networks with Multiscale Overlapping Clusters", "link": "https://arxiv.org/abs/2405.20936", "description": "arXiv:2405.20936v1 Announce Type: new \nAbstract: Our interest is in replicated network data with multiple networks observed across the same set of nodes. Examples include brain connection networks, in which nodes corresponds to brain regions and replicates to different individuals, and ecological networks, in which nodes correspond to species and replicates to samples collected at different locations and/or times. Our goal is to infer a hierarchical structure of the nodes at a population level, while performing multi-resolution clustering of the individual replicates. In brain connectomics, the focus is on inferring common relationships among the brain regions, while characterizing inter-individual variability in an easily interpretable manner. To accomplish this, we propose a Bayesian hierarchical model, while providing theoretical support in terms of identifiability and posterior consistency, and design efficient methods for posterior computation. We provide novel technical tools for proving model identifiability, which are of independent interest. Our simulations and application to brain connectome data provide support for the proposed methodology."}, "https://arxiv.org/abs/2405.20957": {"title": "Data Fusion for Heterogeneous Treatment Effect Estimation with Multi-Task Gaussian Processes", "link": "https://arxiv.org/abs/2405.20957", "description": "arXiv:2405.20957v1 Announce Type: new \nAbstract: Bridging the gap between internal and external validity is crucial for heterogeneous treatment effect estimation. Randomised controlled trials (RCTs), favoured for their internal validity due to randomisation, often encounter challenges in generalising findings due to strict eligibility criteria. Observational studies on the other hand, provide external validity advantages through larger and more representative samples but suffer from compromised internal validity due to unmeasured confounding. Motivated by these complementary characteristics, we propose a novel Bayesian nonparametric approach leveraging multi-task Gaussian processes to integrate data from both RCTs and observational studies. In particular, we introduce a parameter which controls the degree of borrowing between the datasets and prevents the observational dataset from dominating the estimation. The value of the parameter can be either user-set or chosen through a data-adaptive procedure. Our approach outperforms other methods in point predictions across the covariate support of the observational study, and furthermore provides a calibrated measure of uncertainty for the estimated treatment effects, which is crucial when extrapolating. We demonstrate the robust performance of our approach in diverse scenarios through multiple simulation studies and a real-world education randomised trial."}, "https://arxiv.org/abs/2405.21020": {"title": "Bayesian Estimation of Hierarchical Linear Models from Incomplete Data: Cluster-Level Interaction Effects and Small Sample Sizes", "link": "https://arxiv.org/abs/2405.21020", "description": "arXiv:2405.21020v1 Announce Type: new \nAbstract: We consider Bayesian estimation of a hierarchical linear model (HLM) from small sample sizes where 37 patient-physician encounters are repeatedly measured at four time points. The continuous response $Y$ and continuous covariates $C$ are partially observed and assumed missing at random. With $C$ having linear effects, the HLM may be efficiently estimated by available methods. When $C$ includes cluster-level covariates having interactive or other nonlinear effects given small sample sizes, however, maximum likelihood estimation is suboptimal, and existing Gibbs samplers are based on a Bayesian joint distribution compatible with the HLM, but impute missing values of $C$ by a Metropolis algorithm via a proposal density having a constant variance while the target conditional distribution has a nonconstant variance. Therefore, the samplers are not guaranteed to be compatible with the joint distribution and, thus, not guaranteed to always produce unbiased estimation of the HLM. We introduce a compatible Gibbs sampler that imputes parameters and missing values directly from the exact conditional distributions. We analyze repeated measurements from patient-physician encounters by our sampler, and compare our estimators with those of existing methods by simulation."}, "https://arxiv.org/abs/2405.20418": {"title": "A Bayesian joint model of multiple nonlinear longitudinal and competing risks outcomes for dynamic prediction in multiple myeloma: joint estimation and corrected two-stage approaches", "link": "https://arxiv.org/abs/2405.20418", "description": "arXiv:2405.20418v1 Announce Type: cross \nAbstract: Predicting cancer-associated clinical events is challenging in oncology. In Multiple Myeloma (MM), a cancer of plasma cells, disease progression is determined by changes in biomarkers, such as serum concentration of the paraprotein secreted by plasma cells (M-protein). Therefore, the time-dependent behaviour of M-protein and the transition across lines of therapy (LoT) that may be a consequence of disease progression should be accounted for in statistical models to predict relevant clinical outcomes. Furthermore, it is important to understand the contribution of the patterns of longitudinal biomarkers, upon each LoT initiation, to time-to-death or time-to-next-LoT. Motivated by these challenges, we propose a Bayesian joint model for trajectories of multiple longitudinal biomarkers, such as M-protein, and the competing risks of death and transition to next LoT. Additionally, we explore two estimation approaches for our joint model: simultaneous estimation of all parameters (joint estimation) and sequential estimation of parameters using a corrected two-stage strategy aiming to reduce computational time. Our proposed model and estimation methods are applied to a retrospective cohort study from a real-world database of patients diagnosed with MM in the US from January 2015 to February 2022. We split the data into training and test sets in order to validate the joint model using both estimation approaches and make dynamic predictions of times until clinical events of interest, informed by longitudinally measured biomarkers and baseline variables available up to the time of prediction."}, "https://arxiv.org/abs/2405.20715": {"title": "Transforming Japan Real Estate", "link": "https://arxiv.org/abs/2405.20715", "description": "arXiv:2405.20715v1 Announce Type: cross \nAbstract: The Japanese real estate market, valued over 35 trillion USD, offers significant investment opportunities. Accurate rent and price forecasting could provide a substantial competitive edge. This paper explores using alternative data variables to predict real estate performance in 1100 Japanese municipalities. A comprehensive house price index was created, covering all municipalities from 2005 to the present, using a dataset of over 5 million transactions. This core dataset was enriched with economic factors spanning decades, allowing for price trajectory predictions.\n The findings show that alternative data variables can indeed forecast real estate performance effectively. Investment signals based on these variables yielded notable returns with low volatility. For example, the net migration ratio delivered an annualized return of 4.6% with a Sharpe ratio of 1.5. Taxable income growth and new dwellings ratio also performed well, with annualized returns of 4.1% (Sharpe ratio of 1.3) and 3.3% (Sharpe ratio of 0.9), respectively. When combined with transformer models to predict risk-adjusted returns 4 years in advance, the model achieved an R-squared score of 0.28, explaining nearly 30% of the variation in future municipality prices.\n These results highlight the potential of alternative data variables in real estate investment. They underscore the need for further research to identify more predictive factors. Nonetheless, the evidence suggests that such data can provide valuable insights into real estate price drivers, enabling more informed investment decisions in the Japanese market."}, "https://arxiv.org/abs/2405.20779": {"title": "Asymptotic utility of spectral anonymization", "link": "https://arxiv.org/abs/2405.20779", "description": "arXiv:2405.20779v1 Announce Type: cross \nAbstract: In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version $\\mathcal{P}$-SA, employing random permutation transformation, we introduce two novel SA variants: $\\mathcal{J}$-spectral anonymization and $\\mathcal{O}$-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, $\\mathcal{O}$-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, $\\mathcal{P}$-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation."}, "https://arxiv.org/abs/2405.21012": {"title": "G-Transformer for Conditional Average Potential Outcome Estimation over Time", "link": "https://arxiv.org/abs/2405.21012", "description": "arXiv:2405.21012v1 Announce Type: cross \nAbstract: Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task suffer from either (a) bias or (b) large variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model designed for unbiased, low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records."}, "https://arxiv.org/abs/2201.02532": {"title": "Approximate Factor Models for Functional Time Series", "link": "https://arxiv.org/abs/2201.02532", "description": "arXiv:2201.02532v3 Announce Type: replace \nAbstract: We propose a novel approximate factor model tailored for analyzing time-dependent curve data. Our model decomposes such data into two distinct components: a low-dimensional predictable factor component and an unpredictable error term. These components are identified through the autocovariance structure of the underlying functional time series. The model parameters are consistently estimated using the eigencomponents of a cumulative autocovariance operator and an information criterion is proposed to determine the appropriate number of factors. The methodology is applied to yield curve modeling and forecasting. Our results indicate that more than three factors are required to characterize the dynamics of the term structure of bond yields."}, "https://arxiv.org/abs/2303.01887": {"title": "Fast Forecasting of Unstable Data Streams for On-Demand Service Platforms", "link": "https://arxiv.org/abs/2303.01887", "description": "arXiv:2303.01887v2 Announce Type: replace \nAbstract: On-demand service platforms face a challenging problem of forecasting a large collection of high-frequency regional demand data streams that exhibit instabilities. This paper develops a novel forecast framework that is fast and scalable, and automatically assesses changing environments without human intervention. We empirically test our framework on a large-scale demand data set from a leading on-demand delivery platform in Europe, and find strong performance gains from using our framework against several industry benchmarks, across all geographical regions, loss functions, and both pre- and post-Covid periods. We translate forecast gains to economic impacts for this on-demand service platform by computing financial gains and reductions in computing costs."}, "https://arxiv.org/abs/2309.13251": {"title": "Nonparametric estimation of conditional densities by generalized random forests", "link": "https://arxiv.org/abs/2309.13251", "description": "arXiv:2309.13251v3 Announce Type: replace \nAbstract: Considering a continuous random variable $Y$ together with a continuous random vector $X$, I propose a nonparametric estimator $\\hat{f}(\\cdot|x)$ for the conditional density of $Y$ given $X=x$. This estimator takes the form of an exponential series whose coefficients $\\hat{\\theta}_{x}=(\\hat{\\theta}_{x,1}, \\dots,\\hat{\\theta}_{x,J})$ are the solution of a system of nonlinear equations that depends on an estimator of the conditional expectation $E[\\phi (Y)|X=x]$, where $\\phi$ is a $J$-dimensional vector of basis functions. The distinguishing feature of the proposed estimator is that $E[\\phi(Y)|X=x]$ is estimated by generalized random forest (Athey, Tibshirani, and Wager, Annals of Statistics, 2019), targeting the heterogeneity of $\\hat{\\theta}_{x}$ across $x$. I show that $\\hat{f}(\\cdot|x)$ is uniformly consistent and asymptotically normal, allowing $J \\rightarrow \\infty$. I also provide a standard error formula to construct asymptotically valid confidence intervals. Results from Monte Carlo experiments and an empirical illustration are provided."}, "https://arxiv.org/abs/2104.11702": {"title": "Correlated Dynamics in Marketing Sensitivities", "link": "https://arxiv.org/abs/2104.11702", "description": "arXiv:2104.11702v2 Announce Type: replace-cross \nAbstract: Understanding individual customers' sensitivities to prices, promotions, brands, and other marketing mix elements is fundamental to a wide swath of marketing problems. An important but understudied aspect of this problem is the dynamic nature of these sensitivities, which change over time and vary across individuals. Prior work has developed methods for capturing such dynamic heterogeneity within product categories, but neglected the possibility of correlated dynamics across categories. In this work, we introduce a framework to capture such correlated dynamics using a hierarchical dynamic factor model, where individual preference parameters are influenced by common cross-category dynamic latent factors, estimated through Bayesian nonparametric Gaussian processes. We apply our model to grocery purchase data, and find that a surprising degree of dynamic heterogeneity can be accounted for by only a few global trends. We also characterize the patterns in how consumers' sensitivities evolve across categories. Managerially, the proposed framework not only enhances predictive accuracy by leveraging cross-category data, but enables more precise estimation of quantities of interest, like price elasticity."}, "https://arxiv.org/abs/2406.00196": {"title": "A Seamless Phase II/III Design with Dose Optimization for Oncology Drug Development", "link": "https://arxiv.org/abs/2406.00196", "description": "arXiv:2406.00196v1 Announce Type: new \nAbstract: The US FDA's Project Optimus initiative that emphasizes dose optimization prior to marketing approval represents a pivotal shift in oncology drug development. It has a ripple effect for rethinking what changes may be made to conventional pivotal trial designs to incorporate a dose optimization component. Aligned with this initiative, we propose a novel Seamless Phase II/III Design with Dose Optimization (SDDO framework). The proposed design starts with dose optimization in a randomized setting, leading to an interim analysis focused on optimal dose selection, trial continuation decisions, and sample size re-estimation (SSR). Based on the decision at interim analysis, patient enrollment continues for both the selected dose arm and control arm, and the significance of treatment effects will be determined at final analysis. The SDDO framework offers increased flexibility and cost-efficiency through sample size adjustment, while stringently controlling the Type I error. This proposed design also facilitates both Accelerated Approval (AA) and regular approval in a \"one-trial\" approach. Extensive simulation studies confirm that our design reliably identifies the optimal dosage and makes preferable decisions with a reduced sample size while retaining statistical power."}, "https://arxiv.org/abs/2406.00245": {"title": "Model-based Clustering of Zero-Inflated Single-Cell RNA Sequencing Data via the EM Algorithm", "link": "https://arxiv.org/abs/2406.00245", "description": "arXiv:2406.00245v1 Announce Type: new \nAbstract: Biological cells can be distinguished by their phenotype or at the molecular level, based on their genome, epigenome, and transcriptome. This paper focuses on the transcriptome, which encompasses all the RNA transcripts in a given cell population, indicating the genes being expressed at a given time. We consider single-cell RNA sequencing data and develop a novel model-based clustering method to group cells based on their transcriptome profiles. Our clustering approach takes into account the presence of zero inflation in the data, which can occur due to genuine biological zeros or technological noise. The proposed model for clustering involves a mixture of zero-inflated Poisson or zero-inflated negative binomial distributions, and parameter estimation is carried out using the EM algorithm. We evaluate the performance of our proposed methodology through simulation studies and analyses of publicly available datasets."}, "https://arxiv.org/abs/2406.00322": {"title": "Adaptive Penalized Likelihood method for Markov Chains", "link": "https://arxiv.org/abs/2406.00322", "description": "arXiv:2406.00322v1 Announce Type: new \nAbstract: Maximum Likelihood Estimation (MLE) and Likelihood Ratio Test (LRT) are widely used methods for estimating the transition probability matrix in Markov chains and identifying significant relationships between transitions, such as equality. However, the estimated transition probability matrix derived from MLE lacks accuracy compared to the real one, and LRT is inefficient in high-dimensional Markov chains. In this study, we extended the adaptive Lasso technique from linear models to Markov chains and proposed a novel model by applying penalized maximum likelihood estimation to optimize the estimation of the transition probability matrix. Meanwhile, we demonstrated that the new model enjoys oracle properties, which means the estimated transition probability matrix has the same performance as the real one when given. Simulations show that our new method behave very well overall in comparison with various competitors. Real data analysis further convince the value of our proposed method."}, "https://arxiv.org/abs/2406.00442": {"title": "Optimizing hydrogen and e-methanol production through Power-to-X integration in biogas plants", "link": "https://arxiv.org/abs/2406.00442", "description": "arXiv:2406.00442v1 Announce Type: new \nAbstract: The European Union strategy for net zero emissions relies on developing hydrogen and electro fuels infrastructure. These fuels will be crucial as energy carriers and balancing agents for renewable energy variability. Large scale production requires more renewable capacity, and various Power to X (PtX) concepts are emerging in renewable rich countries. However, sourcing renewable carbon to scale carbon based electro fuels is a significant challenge. This study explores a PtX hub that sources renewable CO2 from biogas plants, integrating renewable energy, hydrogen production, and methanol synthesis on site. This concept creates an internal market for energy and materials, interfacing with the external energy system. The size and operation of the PtX hub were optimized, considering integration with local energy systems and a potential hydrogen grid. The levelized costs of hydrogen and methanol were estimated for a 2030 start, considering new legislation on renewable fuels of non biological origin (RFNBOs). Our results show the PtX hub can rely mainly on on site renewable energy, selling excess electricity to the grid. A local hydrogen grid connection improves operations, and the behind the meter market lowers energy prices, buffering against market variability. We found methanol costs could be below 650 euros per ton and hydrogen production costs below 3 euros per kg, with standalone methanol plants costing 23 per cent more. The CO2 recovery to methanol production ratio is crucial, with over 90 per cent recovery requiring significant investment in CO2 and H2 storage. Overall, our findings support planning PtX infrastructures integrated with the agricultural sector as a cost effective way to access renewable carbon."}, "https://arxiv.org/abs/2406.00472": {"title": "Financial Deepening and Economic Growth in Select Emerging Markets with Currency Board Systems: Theory and Evidence", "link": "https://arxiv.org/abs/2406.00472", "description": "arXiv:2406.00472v1 Announce Type: new \nAbstract: This paper investigates some indicators of financial development in select countries with currency board systems and raises some questions about the connection between financial development and growth in currency board systems. Most of those cases are long past episodes of what we would now call emerging markets. However, the paper also looks at Hong Kong, the currency board system that is one of the world's largest and most advanced financial markets. The global financial crisis of 2008 09 created doubts about the efficiency of financial markets in advanced economies, including in Hong Kong, and unsettled the previous consensus that a large financial sector would be more stable than a smaller one."}, "https://arxiv.org/abs/2406.00493": {"title": "Assessment of Case Influence in the Lasso with a Case-weight Adjusted Solution Path", "link": "https://arxiv.org/abs/2406.00493", "description": "arXiv:2406.00493v1 Announce Type: new \nAbstract: We study case influence in the Lasso regression using Cook's distance which measures overall change in the fitted values when one observation is deleted. Unlike in ordinary least squares regression, the estimated coefficients in the Lasso do not have a closed form due to the nondifferentiability of the $\\ell_1$ penalty, and neither does Cook's distance. To find the case-deleted Lasso solution without refitting the model, we approach it from the full data solution by introducing a weight parameter ranging from 1 to 0 and generating a solution path indexed by this parameter. We show that the solution path is piecewise linear with respect to a simple function of the weight parameter under a fixed penalty. The resulting case influence is a function of the penalty and weight, and it becomes Cook's distance when the weight is 0. As the penalty parameter changes, selected variables change, and the magnitude of Cook's distance for the same data point may vary with the subset of variables selected. In addition, we introduce a case influence graph to visualize how the contribution of each data point changes with the penalty parameter. From the graph, we can identify influential points at different penalty levels and make modeling decisions accordingly. Moreover, we find that case influence graphs exhibit different patterns between underfitting and overfitting phases, which can provide additional information for model selection."}, "https://arxiv.org/abs/2406.00549": {"title": "Zero Inflation as a Missing Data Problem: a Proxy-based Approach", "link": "https://arxiv.org/abs/2406.00549", "description": "arXiv:2406.00549v1 Announce Type: new \nAbstract: A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data).\n Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated.\n This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known.\n If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023].\n We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs)."}, "https://arxiv.org/abs/2406.00650": {"title": "Cluster-robust jackknife and bootstrap inference for binary response models", "link": "https://arxiv.org/abs/2406.00650", "description": "arXiv:2406.00650v1 Announce Type: new \nAbstract: We study cluster-robust inference for binary response models. Inference based on the most commonly-used cluster-robust variance matrix estimator (CRVE) can be very unreliable. We study several alternatives. Conceptually the simplest of these, but also the most computationally demanding, involves jackknifing at the cluster level. We also propose a linearized version of the cluster-jackknife variance matrix estimator as well as linearized versions of the wild cluster bootstrap. The linearizations are based on empirical scores and are computationally efficient. Throughout we use the logit model as a leading example. We also discuss a new Stata software package called logitjack which implements these procedures. Simulation results strongly favor the new methods, and two empirical examples suggest that it can be important to use them in practice."}, "https://arxiv.org/abs/2406.00700": {"title": "On the modelling and prediction of high-dimensional functional time series", "link": "https://arxiv.org/abs/2406.00700", "description": "arXiv:2406.00700v1 Announce Type: new \nAbstract: We propose a two-step procedure to model and predict high-dimensional functional time series, where the number of function-valued time series $p$ is large in relation to the length of time series $n$. Our first step performs an eigenanalysis of a positive definite matrix, which leads to a one-to-one linear transformation for the original high-dimensional functional time series, and the transformed curve series can be segmented into several groups such that any two subseries from any two different groups are uncorrelated both contemporaneously and serially. Consequently in our second step those groups are handled separately without the information loss on the overall linear dynamic structure. The second step is devoted to establishing a finite-dimensional dynamical structure for all the transformed functional time series within each group. Furthermore the finite-dimensional structure is represented by that of a vector time series. Modelling and forecasting for the original high-dimensional functional time series are realized via those for the vector time series in all the groups. We investigate the theoretical properties of our proposed methods, and illustrate the finite-sample performance through both extensive simulation and two real datasets."}, "https://arxiv.org/abs/2406.00730": {"title": "Assessing survival models by interval testing", "link": "https://arxiv.org/abs/2406.00730", "description": "arXiv:2406.00730v1 Announce Type: new \nAbstract: When considering many survival models, decisions become more challenging in health economic evaluation. In this paper, we present a set of methods to assist with selecting the most appropriate survival models. The methods highlight areas of particularly poor fit. Furthermore, plots and overall p-values provide guidance on whether a survival model should be rejected or not."}, "https://arxiv.org/abs/2406.00804": {"title": "On the Addams family of discrete frailty distributions for modelling multivariate case I interval-censored data", "link": "https://arxiv.org/abs/2406.00804", "description": "arXiv:2406.00804v1 Announce Type: new \nAbstract: Random effect models for time-to-event data, also known as frailty models, provide a conceptually appealing way of quantifying association between survival times and of representing heterogeneities resulting from factors which may be difficult or impossible to measure. In the literature, the random effect is usually assumed to have a continuous distribution. However, in some areas of application, discrete frailty distributions may be more appropriate. The present paper is about the implementation and interpretation of the Addams family of discrete frailty distributions. We propose methods of estimation for this family of densities in the context of shared frailty models for the hazard rates for case I interval-censored data. Our optimization framework allows for stratification of random effect distributions by covariates. We highlight interpretational advantages of the Addams family of discrete frailty distributions and the K-point distribution as compared to other frailty distributions. A unique feature of the Addams family and the K-point distribution is that the support of the frailty distribution depends on its parameters. This feature is best exploited by imposing a model on the distributional parameters, resulting in a model with non-homogeneous covariate effects that can be analysed using standard measures such as the hazard ratio. Our methods are illustrated with applications to multivariate case I interval-censored infection data."}, "https://arxiv.org/abs/2406.00827": {"title": "LaLonde (1986) after Nearly Four Decades: Lessons Learned", "link": "https://arxiv.org/abs/2406.00827", "description": "arXiv:2406.00827v1 Announce Type: new \nAbstract: In 1986, Robert LaLonde published an article that compared nonexperimental estimates to experimental benchmarks LaLonde (1986). He concluded that the nonexperimental methods at the time could not systematically replicate experimental benchmarks, casting doubt on the credibility of these methods. Following LaLonde's critical assessment, there have been significant methodological advances and practical changes, including (i) an emphasis on estimators based on unconfoundedness, (ii) a focus on the importance of overlap in covariate distributions, (iii) the introduction of propensity score-based methods leading to doubly robust estimators, (iv) a greater emphasis on validation exercises to bolster research credibility, and (v) methods for estimating and exploiting treatment effect heterogeneity. To demonstrate the practical lessons from these advances, we reexamine the LaLonde data and the Imbens-Rubin-Sacerdote lottery data. We show that modern methods, when applied in contexts with significant covariate overlap, yield robust estimates for the adjusted differences between the treatment and control groups. However, this does not mean that these estimates are valid. To assess their credibility, validation exercises (such as placebo tests) are essential, whereas goodness of fit tests alone are inadequate. Our findings highlight the importance of closely examining the assignment process, carefully inspecting overlap, and conducting validation exercises when analyzing causal effects with nonexperimental data."}, "https://arxiv.org/abs/2406.00866": {"title": "Planning for Gold: Sample Splitting for Valid Powerful Design of Observational Studies", "link": "https://arxiv.org/abs/2406.00866", "description": "arXiv:2406.00866v1 Announce Type: new \nAbstract: Observational studies are valuable tools for inferring causal effects in the absence of controlled experiments. However, these studies may be biased due to the presence of some relevant, unmeasured set of covariates. The design of an observational study has a prominent effect on its sensitivity to hidden biases, and the best design may not be apparent without examining the data. One approach to facilitate a data-inspired design is to split the sample into a planning sample for choosing the design and an analysis sample for making inferences. We devise a powerful and flexible method for selecting outcomes in the planning sample when an unknown number of outcomes are affected by the treatment. We investigate the theoretical properties of our method and conduct extensive simulations that demonstrate pronounced benefits, especially at higher levels of allowance for unmeasured confounding. Finally, we demonstrate our method in an observational study of the multi-dimensional impacts of a devastating flood in Bangladesh."}, "https://arxiv.org/abs/2406.00906": {"title": "A Bayesian Generalized Bridge Regression Approach to Covariance Estimation in the Presence of Covariates", "link": "https://arxiv.org/abs/2406.00906", "description": "arXiv:2406.00906v1 Announce Type: new \nAbstract: A hierarchical Bayesian approach that permits simultaneous inference for the regression coefficient matrix and the error precision (inverse covariance) matrix in the multivariate linear model is proposed. Assuming a natural ordering of the elements of the response, the precision matrix is reparameterized so it can be estimated with univariate-response linear regression techniques. A novel generalized bridge regression prior that accommodates both sparse and dense settings and is competitive with alternative methods for univariate-response regression is proposed and used in this framework. Two component-wise Markov chain Monte Carlo algorithms are developed for sampling, including a data augmentation algorithm based on a scale mixture of normals representation. Numerical examples demonstrate that the proposed method is competitive with comparable joint mean-covariance models, particularly in estimation of the precision matrix. The method is also used to estimate the 253 by 253 precision matrices of two classes of spectra extracted from images taken by the Hubble Space Telescope. Some interesting structural patterns in the estimates are discussed."}, "https://arxiv.org/abs/2406.00930": {"title": "A class of sequential multi-hypothesis tests", "link": "https://arxiv.org/abs/2406.00930", "description": "arXiv:2406.00930v1 Announce Type: new \nAbstract: In this paper, we deal with sequential testing of multiple hypotheses. In the general scheme of construction of optimal tests based on the backward induction, we propose a modification which provides a simplified (generally speaking, suboptimal) version of the optimal test, for any particular criterion of optimization. We call this DBC version (the one with Dropped Backward Control) of the optimal test. In particular, for the case of two simple hypotheses, dropping backward control in the Bayesian test produces the classical sequential probability ratio test (SPRT). Similarly, dropping backward control in the modified Kiefer-Weiss solutions produces Lorden's 2-SPRTs .\n In the case of more than two hypotheses, we obtain in this way new classes of sequential multi-hypothesis tests, and investigate their properties. The efficiency of the DBC-tests is evaluated with respect to the optimal Bayesian multi-hypothesis test and with respect to the matrix sequential probability ratio test (MSPRT) by Armitage. In a multihypothesis variant of the Kiefer-Weiss problem for binomial proportions the performance of the DBC-test is numerically compared with that of the exact solution. In a model of normal observations with a linear trend, the performance of of the DBC-test is numerically compared with that of the MSPRT. Some other numerical examples are presented.\n In all the cases the proposed tests exhibit a very high efficiency with respect to the optimal tests (more than 99.3\\% when sampling from Bernoulli populations) and/or with respect to the MSPRT (even outperforming the latter in some scenarios)."}, "https://arxiv.org/abs/2406.00940": {"title": "Measurement Error-Robust Causal Inference via Constructed Instrumental Variables", "link": "https://arxiv.org/abs/2406.00940", "description": "arXiv:2406.00940v1 Announce Type: new \nAbstract: Measurement error can often be harmful when estimating causal effects. Two scenarios in which this is the case are in the estimation of (a) the average treatment effect when confounders are measured with error and (b) the natural indirect effect when the exposure and/or confounders are measured with error. Methods adjusting for measurement error typically require external data or knowledge about the measurement error distribution. Here, we propose methodology not requiring any such information. Instead, we show that when the outcome regression is linear in the error-prone variables, consistent estimation of these causal effects can be recovered using constructed instrumental variables under certain conditions. These variables, which are functions of only the observed data, behave like instrumental variables for the error-prone variables. Using data from a study of the effects of prenatal exposure to heavy metals on growth and neurodevelopment in Bangladeshi mother-infant pairs, we apply our methodology to estimate (a) the effect of lead exposure on birth length while controlling for maternal protein intake, and (b) lead exposure's role in mediating the effect of maternal protein intake on birth length. Protein intake is calculated from food journal entries, and is suspected to be highly prone to measurement error."}, "https://arxiv.org/abs/2406.00941": {"title": "A Robust Residual-Based Test for Structural Changes in Factor Models", "link": "https://arxiv.org/abs/2406.00941", "description": "arXiv:2406.00941v1 Announce Type: new \nAbstract: In this paper, we propose an easy-to-implement residual-based specification testing procedure for detecting structural changes in factor models, which is powerful against both smooth and abrupt structural changes with unknown break dates. The proposed test is robust against the over-specified number of factors, and serially and cross-sectionally correlated error processes. A new central limit theorem is given for the quadratic forms of panel data with dependence over both dimensions, thereby filling a gap in the literature. We establish the asymptotic properties of the proposed test statistic, and accordingly develop a simulation-based scheme to select critical value in order to improve finite sample performance. Through extensive simulations and a real-world application, we confirm our theoretical results and demonstrate that the proposed test exhibits desirable size and power in practice."}, "https://arxiv.org/abs/2406.01002": {"title": "Random Subspace Local Projections", "link": "https://arxiv.org/abs/2406.01002", "description": "arXiv:2406.01002v1 Announce Type: new \nAbstract: We show how random subspace methods can be adapted to estimating local projections with many controls. Random subspace methods have their roots in the machine learning literature and are implemented by averaging over regressions estimated over different combinations of subsets of these controls. We document three key results: (i) Our approach can successfully recover the impulse response functions across Monte Carlo experiments representative of different macroeconomic settings and identification schemes. (ii) Our results suggest that random subspace methods are more accurate than other dimension reduction methods if the underlying large dataset has a factor structure similar to typical macroeconomic datasets such as FRED-MD. (iii) Our approach leads to differences in the estimated impulse response functions relative to benchmark methods when applied to two widely studied empirical applications."}, "https://arxiv.org/abs/2406.01218": {"title": "Sequential FDR and pFDR control under arbitrary dependence, with application to pharmacovigilance database monitoring", "link": "https://arxiv.org/abs/2406.01218", "description": "arXiv:2406.01218v1 Announce Type: new \nAbstract: We propose sequential multiple testing procedures which control the false discover rate (FDR) or the positive false discovery rate (pFDR) under arbitrary dependence between the data streams. This is accomplished by \"optimizing\" an upper bound on these error metrics for a class of step down sequential testing procedures. Both open-ended and truncated versions of these sequential procedures are given, both being able to control both the type~I multiple testing metric (FDR or pFDR) at specified levels, and the former being able to control both the type I and type II (e.g., FDR and the false nondiscovery rate, FNR). In simulation studies, these procedures provide 45-65% savings in average sample size over their fixed-sample competitors. We illustrate our procedures on drug data from the United Kingdom's Yellow Card Pharmacovigilance Database."}, "https://arxiv.org/abs/2406.01242": {"title": "Multiple Comparison Procedures for Simultaneous Inference in Functional MANOVA", "link": "https://arxiv.org/abs/2406.01242", "description": "arXiv:2406.01242v1 Announce Type: new \nAbstract: Functional data analysis is becoming increasingly popular to study data from real-valued random functions. Nevertheless, there is a lack of multiple testing procedures for such data. These are particularly important in factorial designs to compare different groups or to infer factor effects. We propose a new class of testing procedures for arbitrary linear hypotheses in general factorial designs with functional data. Our methods allow global as well as multiple inference of both, univariate and multivariate mean functions without assuming particular error distributions nor homoscedasticity. That is, we allow for different structures of the covariance functions between groups. To this end, we use point-wise quadratic-form-type test functions that take potential heteroscedasticity into account. Taking the supremum over each test function, we define a class of local test statistics. We analyse their (joint) asymptotic behaviour and propose a resampling approach to approximate the limit distributions. The resulting global and multiple testing procedures are asymptotic valid under weak conditions and applicable in general functional MANOVA settings. We evaluate their small-sample performance in extensive simulations and finally illustrate their applicability by analysing a multivariate functional air pollution data set."}, "https://arxiv.org/abs/2406.01557": {"title": "Bayesian compositional regression with flexible microbiome feature aggregation and selection", "link": "https://arxiv.org/abs/2406.01557", "description": "arXiv:2406.01557v1 Announce Type: new \nAbstract: Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability of existing variable selection approaches. In particular, microbiome data are high-dimensional, extremely sparse, and compositional. Importantly, many of the observed features, although categorized as different taxa, may play related functional roles. To address these challenges, we propose a novel compositional regression approach that leverages the data-adaptive clustering and variable selection properties of the spiked Dirichlet process to identify taxa that exhibit similar functional roles. Our proposed method, Bayesian Regression with Agglomerated Compositional Effects using a dirichLET process (BRACElet), enables the identification of a sparse set of features with shared impacts on the outcome, facilitating dimension reduction and model interpretation. We demonstrate that BRACElet outperforms existing approaches for microbiome variable selection through simulation studies and an application elucidating the impact of oral microbiome composition on insulin resistance."}, "https://arxiv.org/abs/2406.00128": {"title": "Matrix-valued Factor Model with Time-varying Main Effects", "link": "https://arxiv.org/abs/2406.00128", "description": "arXiv:2406.00128v1 Announce Type: cross \nAbstract: We introduce the matrix-valued time-varying Main Effects Factor Model (MEFM). MEFM is a generalization to the traditional matrix-valued factor model (FM). We give rigorous definitions of MEFM and its identifications, and propose estimators for the time-varying grand mean, row and column main effects, and the row and column factor loading matrices for the common component. Rates of convergence for different estimators are spelt out, with asymptotic normality shown. The core rank estimator for the common component is also proposed, with consistency of the estimators presented. We propose a test for testing if FM is sufficient against the alternative that MEFM is necessary, and demonstrate the power of such a test in various simulation settings. We also demonstrate numerically the accuracy of our estimators in extended simulation experiments. A set of NYC Taxi traffic data is analysed and our test suggests that MEFM is indeed necessary for analysing the data against a traditional FM."}, "https://arxiv.org/abs/2406.00317": {"title": "Combining Experimental and Historical Data for Policy Evaluation", "link": "https://arxiv.org/abs/2406.00317", "description": "arXiv:2406.00317v1 Announce Type: cross \nAbstract: This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to minimize the mean square error (MSE) of the resulting combined estimator. We further apply the pessimistic principle to obtain more robust estimators, and extend these developments to sequential decision making. Theoretically, we establish non-asymptotic error bounds for the MSEs of our proposed estimators, and derive their oracle, efficiency and robustness properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators."}, "https://arxiv.org/abs/2406.00326": {"title": "Far beyond day-ahead with econometric models for electricity price forecasting", "link": "https://arxiv.org/abs/2406.00326", "description": "arXiv:2406.00326v1 Announce Type: cross \nAbstract: The surge in global energy prices during the recent energy crisis, which peaked in 2022, has intensified the need for mid-term to long-term forecasting for hedging and valuation purposes. This study analyzes the statistical predictability of power prices before, during, and after the energy crisis, using econometric models with an hourly resolution. To stabilize the model estimates, we define fundamentally derived coefficient bounds. We provide an in-depth analysis of the unit root behavior of the power price series, showing that the long-term stochastic trend is explained by the prices of commodities used as fuels for power generation: gas, coal, oil, and emission allowances (EUA). However, as the forecasting horizon increases, spurious effects become extremely relevant, leading to highly significant but economically meaningless results. To mitigate these spurious effects, we propose the \"current\" model: estimating the current same-day relationship between power prices and their regressors and projecting this relationship into the future. This flexible and interpretable method is applied to hourly German day-ahead power prices for forecasting horizons up to one year ahead, utilizing a combination of regularized regression methods and generalized additive models."}, "https://arxiv.org/abs/2406.00394": {"title": "Learning Causal Abstractions of Linear Structural Causal Models", "link": "https://arxiv.org/abs/2406.00394", "description": "arXiv:2406.00394v1 Announce Type: cross \nAbstract: The need for modelling causal knowledge at different levels of granularity arises in several settings. Causal Abstraction provides a framework for formalizing this problem by relating two Structural Causal Models at different levels of detail. Despite increasing interest in applying causal abstraction, e.g. in the interpretability of large machine learning models, the graphical and parametrical conditions under which a causal model can abstract another are not known. Furthermore, learning causal abstractions from data is still an open problem. In this work, we tackle both issues for linear causal models with linear abstraction functions. First, we characterize how the low-level coefficients and the abstraction function determine the high-level coefficients and how the high-level model constrains the causal ordering of low-level variables. Then, we apply our theoretical results to learn high-level and low-level causal models and their abstraction function from observational data. In particular, we introduce Abs-LiNGAM, a method that leverages the constraints induced by the learned high-level model and the abstraction function to speedup the recovery of the larger low-level model, under the assumption of non-Gaussian noise terms. In simulated settings, we show the effectiveness of learning causal abstractions from data and the potential of our method in improving scalability of causal discovery."}, "https://arxiv.org/abs/2406.00535": {"title": "Causal Contrastive Learning for Counterfactual Regression Over Time", "link": "https://arxiv.org/abs/2406.00535", "description": "arXiv:2406.00535v1 Announce Type: cross \nAbstract: Estimating treatment effects over time holds significance in various domains, including precision medicine, epidemiology, economy, and marketing. This paper introduces a unique approach to counterfactual regression over time, emphasizing long-term predictions. Distinguishing itself from existing models like Causal Transformer, our approach highlights the efficacy of employing RNNs for long-term forecasting, complemented by Contrastive Predictive Coding (CPC) and Information Maximization (InfoMax). Emphasizing efficiency, we avoid the need for computationally expensive transformers. Leveraging CPC, our method captures long-term dependencies in the presence of time-varying confounders. Notably, recent models have disregarded the importance of invertible representation, compromising identification assumptions. To remedy this, we employ the InfoMax principle, maximizing a lower bound of mutual information between sequence data and its representation. Our method achieves state-of-the-art counterfactual estimation results using both synthetic and real-world data, marking the pioneering incorporation of Contrastive Predictive Encoding in causal inference."}, "https://arxiv.org/abs/2406.00610": {"title": "Portfolio Optimization with Robust Covariance and Conditional Value-at-Risk Constraints", "link": "https://arxiv.org/abs/2406.00610", "description": "arXiv:2406.00610v1 Announce Type: cross \nAbstract: The measure of portfolio risk is an important input of the Markowitz framework. In this study, we explored various methods to obtain a robust covariance estimators that are less susceptible to financial data noise. We evaluated the performance of large-cap portfolio using various forms of Ledoit Shrinkage Covariance and Robust Gerber Covariance matrix during the period of 2012 to 2022. Out-of-sample performance indicates that robust covariance estimators can outperform the market capitalization-weighted benchmark portfolio, particularly during bull markets. The Gerber covariance with Mean-Absolute-Deviation (MAD) emerged as the top performer. However, robust estimators do not manage tail risk well under extreme market conditions, for example, Covid-19 period. When we aim to control for tail risk, we should add constraint on Conditional Value-at-Risk (CVaR) to make more conservative decision on risk exposure. Additionally, we incorporated unsupervised clustering algorithm K-means to the optimization algorithm (i.e. Nested Clustering Optimization, NCO). It not only helps mitigate numerical instability of the optimization algorithm, but also contributes to lower drawdown as well."}, "https://arxiv.org/abs/2406.00611": {"title": "DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation", "link": "https://arxiv.org/abs/2406.00611", "description": "arXiv:2406.00611v1 Announce Type: cross \nAbstract: Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers for black-box models lack faithfulness guarantees, and self-interpretable models greatly compromise accuracy. To address these issues, we propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample. A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples. We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space. We evaluate DISCRET on diverse tasks involving tabular, image, and text data. DISCRET outperforms the best self-interpretable models and has accuracy comparable to the best black-box models while providing faithful explanations. DISCRET is available at https://github.com/wuyinjun-1993/DISCRET-ICML2024."}, "https://arxiv.org/abs/2406.00701": {"title": "Profiled Transfer Learning for High Dimensional Linear Model", "link": "https://arxiv.org/abs/2406.00701", "description": "arXiv:2406.00701v1 Announce Type: cross \nAbstract: We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \\textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \\textit{vanishing-difference} assumption and \\textit{low-rank} assumption in the literature, the \\textit{approximate-linear} assumption is more flexible and less stringent. Specifically, the PTL estimator is constructed by two major steps. Firstly, we regress the response on the transferred feature, leading to the profiled responses. Subsequently, we learn the regression relationship between profiled responses and the covariates on the target data. The final estimator is then assembled based on the \\textit{approximate-linear} relationship. To theoretically support the PTL estimator, we derive the non-asymptotic upper bound and minimax lower bound. We find that the PTL estimator is minimax optimal under appropriate regularity conditions. Extensive simulation studies are presented to demonstrate the finite sample performance of the new method. A real data example about sentence prediction is also presented with very encouraging results."}, "https://arxiv.org/abs/2406.00713": {"title": "Logistic Variational Bayes Revisited", "link": "https://arxiv.org/abs/2406.00713", "description": "arXiv:2406.00713v1 Announce Type: cross \nAbstract: Variational logistic regression is a popular method for approximate Bayesian inference seeing wide-spread use in many areas of machine learning including: Bayesian optimization, reinforcement learning and multi-instance learning to name a few. However, due to the intractability of the Evidence Lower Bound, authors have turned to the use of Monte Carlo, quadrature or bounds to perform inference, methods which are costly or give poor approximations to the true posterior.\n In this paper we introduce a new bound for the expectation of softplus function and subsequently show how this can be applied to variational logistic regression and Gaussian process classification. Unlike other bounds, our proposal does not rely on extending the variational family, or introducing additional parameters to ensure the bound is tight. In fact, we show that this bound is tighter than the state-of-the-art, and that the resulting variational posterior achieves state-of-the-art performance, whilst being significantly faster to compute than Monte-Carlo methods."}, "https://arxiv.org/abs/2406.00778": {"title": "Bayesian Joint Additive Factor Models for Multiview Learning", "link": "https://arxiv.org/abs/2406.00778", "description": "arXiv:2406.00778v1 Announce Type: cross \nAbstract: It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar."}, "https://arxiv.org/abs/2406.00853": {"title": "A Tutorial on Doubly Robust Learning for Causal Inference", "link": "https://arxiv.org/abs/2406.00853", "description": "arXiv:2406.00853v1 Announce Type: cross \nAbstract: Doubly robust learning offers a robust framework for causal inference from observational data by integrating propensity score and outcome modeling. Despite its theoretical appeal, practical adoption remains limited due to perceived complexity and inaccessible software. This tutorial aims to demystify doubly robust methods and demonstrate their application using the EconML package. We provide an introduction to causal inference, discuss the principles of outcome modeling and propensity scores, and illustrate the doubly robust approach through simulated case studies. By simplifying the methodology and offering practical coding examples, we intend to make doubly robust learning accessible to researchers and practitioners in data science and statistics."}, "https://arxiv.org/abs/2406.00998": {"title": "Distributional Refinement Network: Distributional Forecasting via Deep Learning", "link": "https://arxiv.org/abs/2406.00998", "description": "arXiv:2406.00998v1 Announce Type: cross \nAbstract: A key task in actuarial modelling involves modelling the distributional properties of losses. Classic (distributional) regression approaches like Generalized Linear Models (GLMs; Nelder and Wedderburn, 1972) are commonly used, but challenges remain in developing models that can (i) allow covariates to flexibly impact different aspects of the conditional distribution, (ii) integrate developments in machine learning and AI to maximise the predictive power while considering (i), and, (iii) maintain a level of interpretability in the model to enhance trust in the model and its outputs, which is often compromised in efforts pursuing (i) and (ii). We tackle this problem by proposing a Distributional Refinement Network (DRN), which combines an inherently interpretable baseline model (such as GLMs) with a flexible neural network-a modified Deep Distribution Regression (DDR; Li et al., 2019) method. Inspired by the Combined Actuarial Neural Network (CANN; Schelldorfer and W{\\''u}thrich, 2019), our approach flexibly refines the entire baseline distribution. As a result, the DRN captures varying effects of features across all quantiles, improving predictive performance while maintaining adequate interpretability. Using both synthetic and real-world data, we demonstrate the DRN's superior distributional forecasting capacity. The DRN has the potential to be a powerful distributional regression model in actuarial science and beyond."}, "https://arxiv.org/abs/2406.01259": {"title": "Aging modeling and lifetime prediction of a proton exchange membrane fuel cell using an extended Kalman filter", "link": "https://arxiv.org/abs/2406.01259", "description": "arXiv:2406.01259v1 Announce Type: cross \nAbstract: This article presents a methodology that aims to model and to provide predictive capabilities for the lifetime of Proton Exchange Membrane Fuel Cell (PEMFC). The approach integrates parametric identification, dynamic modeling, and Extended Kalman Filtering (EKF). The foundation is laid with the creation of a representative aging database, emphasizing specific operating conditions. Electrochemical behavior is characterized through the identification of critical parameters. The methodology extends to capture the temporal evolution of the identified parameters. We also address challenges posed by the limiting current density through a differential analysis-based modeling technique and the detection of breakpoints. This approach, involving Monte Carlo simulations, is coupled with an EKF for predicting voltage degradation. The Remaining Useful Life (RUL) is also estimated. The results show that our approach accurately predicts future voltage and RUL with very low relative errors."}, "https://arxiv.org/abs/2111.13226": {"title": "A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment", "link": "https://arxiv.org/abs/2111.13226", "description": "arXiv:2111.13226v4 Announce Type: replace \nAbstract: Causal inference grows increasingly complex as the number of confounders increases. Given treatments $X$, confounders $Z$ and outcomes $Y$, we develop a non-parametric method to test the \\textit{do-null} hypothesis $H_0:\\; p(y|\\text{\\it do}(X=x))=p(y)$ against the general alternative. Building on the Hilbert Schmidt Independence Criterion (HSIC) for marginal independence testing, we propose backdoor-HSIC (bd-HSIC) and demonstrate that it is calibrated and has power for both binary and continuous treatments under a large number of confounders. Additionally, we establish convergence properties of the estimators of covariance operators used in bd-HSIC. We investigate the advantages and disadvantages of bd-HSIC against parametric tests as well as the importance of using the do-null testing in contrast to marginal independence testing or conditional independence testing. A complete implementation can be found at \\hyperlink{https://github.com/MrHuff/kgformula}{\\texttt{https://github.com/MrHuff/kgformula}}."}, "https://arxiv.org/abs/2302.03996": {"title": "High-Dimensional Granger Causality for Climatic Attribution", "link": "https://arxiv.org/abs/2302.03996", "description": "arXiv:2302.03996v2 Announce Type: replace \nAbstract: In this paper we test for Granger causality in high-dimensional vector autoregressive models (VARs) to disentangle and interpret the complex causal chains linking radiative forcings and global temperatures. By allowing for high dimensionality in the model, we can enrich the information set with relevant natural and anthropogenic forcing variables to obtain reliable causal relations. This provides a step forward from existing climatology literature, which has mostly treated these variables in isolation in small models. Additionally, our framework allows to disregard the order of integration of the variables by directly estimating the VAR in levels, thus avoiding accumulating biases coming from unit-root and cointegration tests. This is of particular appeal for climate time series which are well known to contain stochastic trends and long memory. We are thus able to establish causal networks linking radiative forcings to global temperatures and to connect radiative forcings among themselves, thereby allowing for tracing the path of dynamic causal effects through the system."}, "https://arxiv.org/abs/2303.02438": {"title": "Bayesian clustering of high-dimensional data via latent repulsive mixtures", "link": "https://arxiv.org/abs/2303.02438", "description": "arXiv:2303.02438v2 Announce Type: replace \nAbstract: Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. The repulsive point process must be anisotropic to favor well-separated clusters of data, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes. We illustrate our model in simulations as well as a plant species co-occurrence dataset."}, "https://arxiv.org/abs/2303.12687": {"title": "On Weighted Orthogonal Learners for Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2303.12687", "description": "arXiv:2303.12687v2 Announce Type: replace \nAbstract: Motivated by applications in personalized medicine and individualized policymaking, there is a growing interest in techniques for quantifying treatment effect heterogeneity in terms of the conditional average treatment effect (CATE). Some of the most prominent methods for CATE estimation developed in recent years are T-Learner, DR-Learner and R-Learner. The latter two were designed to improve on the former by being Neyman-orthogonal. However, the relations between them remain unclear, and likewise the literature remains vague on whether these learners converge to a useful quantity or (functional) estimand when the underlying optimization procedure is restricted to a class of functions that does not include the CATE. In this article, we provide insight into these questions by discussing DR-Learner and R-Learner as special cases of a general class of weighted Neyman-orthogonal learners for the CATE, for which we moreover derive oracle bounds. Our results shed light on how one may construct Neyman-orthogonal learners with desirable properties, on when DR-Learner may be preferred over R-Learner (and vice versa), and on novel learners that may sometimes be preferable to either of these. Theoretical findings are confirmed using results from simulation studies on synthetic data, as well as an application in critical care medicine."}, "https://arxiv.org/abs/2305.04140": {"title": "A Nonparametric Mixed-Effects Mixture Model for Patterns of Clinical Measurements Associated with COVID-19", "link": "https://arxiv.org/abs/2305.04140", "description": "arXiv:2305.04140v2 Announce Type: replace \nAbstract: Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients."}, "https://arxiv.org/abs/2309.02584": {"title": "Multivariate Mat\\'ern Models -- A Spectral Approach", "link": "https://arxiv.org/abs/2309.02584", "description": "arXiv:2309.02584v2 Announce Type: replace \nAbstract: The classical Mat\\'ern model has been a staple in spatial statistics. Novel data-rich applications in environmental and physical sciences, however, call for new, flexible vector-valued spatial and space-time models. Therefore, the extension of the classical Mat\\'ern model has been a problem of active theoretical and methodological interest. In this paper, we offer a new perspective to extending the Mat\\'ern covariance model to the vector-valued setting. We adopt a spectral, stochastic integral approach, which allows us to address challenging issues on the validity of the covariance structure and at the same time to obtain new, flexible, and interpretable models. In particular, our multivariate extensions of the Mat\\'ern model allow for asymmetric covariance structures. Moreover, the spectral approach provides an essentially complete flexibility in modeling the local structure of the process. We establish closed-form representations of the cross-covariances when available, compare them with existing models, simulate Gaussian instances of these new processes, and demonstrate estimation of the model's parameters through maximum likelihood. An application of the new class of multivariate Mat\\'ern models to environmental data indicate their success in capturing inherent covariance-asymmetry phenomena."}, "https://arxiv.org/abs/2310.01198": {"title": "Likelihood Based Inference for ARMA Models", "link": "https://arxiv.org/abs/2310.01198", "description": "arXiv:2310.01198v3 Announce Type: replace \nAbstract: Autoregressive moving average (ARMA) models are frequently used to analyze time series data. Despite the popularity of these models, likelihood-based inference for ARMA models has subtleties that have been previously identified but continue to cause difficulties in widely used data analysis strategies. We provide a summary of parameter estimation via maximum likelihood and discuss common pitfalls that may lead to sub-optimal parameter estimates. We propose a random initialization algorithm for parameter estimation that frequently yields higher likelihoods than traditional maximum likelihood estimation procedures. We then investigate the parameter uncertainty of maximum likelihood estimates, and propose the use of profile confidence intervals as a superior alternative to intervals derived from the Fisher's information matrix. Through a series of simulation studies, we demonstrate the efficacy of our proposed algorithm and the improved nominal coverage of profile confidence intervals compared to the normal approximation based on Fisher's Information."}, "https://arxiv.org/abs/2310.07151": {"title": "Identification and Estimation of a Semiparametric Logit Model using Network Data", "link": "https://arxiv.org/abs/2310.07151", "description": "arXiv:2310.07151v2 Announce Type: replace \nAbstract: This paper studies the identification and estimation of a semiparametric binary network model in which the unobserved social characteristic is endogenous, that is, the unobserved individual characteristic influences both the binary outcome of interest and how links are formed within the network. The exact functional form of the latent social characteristic is not known. The proposed estimators are obtained based on matching pairs of agents whose network formation distributions are the same. The consistency and the asymptotic distribution of the estimators are proposed. The finite sample properties of the proposed estimators in a Monte-Carlo simulation are assessed. We conclude this study with an empirical application."}, "https://arxiv.org/abs/2310.08672": {"title": "Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal", "link": "https://arxiv.org/abs/2310.08672", "description": "arXiv:2310.08672v2 Announce Type: replace \nAbstract: In many settings, interventions may be more effective for some individuals than others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use \"nudges\" to encourage students to renew their financial-aid applications before a non-binding deadline. We begin with baseline approaches to targeting. First, we target based on a causal forest that estimates heterogeneous treatment effects and then assigns students to treatment according to those estimated to have the highest treatment effects. Next, we evaluate two alternative targeting policies, one targeting students with low predicted probability of renewing financial aid in the absence of the treatment, the other targeting those with high probability. The predicted baseline outcome is not the ideal criterion for targeting, nor is it a priori clear whether to prioritize low, high, or intermediate predicted probability. Nonetheless, targeting on low baseline outcomes is common in practice, for example because the relationship between individual characteristics and treatment effects is often difficult or impossible to estimate with historical data. We propose hybrid approaches that incorporate the strengths of both predictive approaches (accurate estimation) and causal approaches (correct criterion); we show that targeting intermediate baseline outcomes is most effective in our specific application, while targeting based on low baseline outcomes is detrimental. In one year of the experiment, nudging all students improved early filing by an average of 6.4 percentage points over a baseline average of 37% filing, and we estimate that targeting half of the students using our preferred policy attains around 75% of this benefit."}, "https://arxiv.org/abs/2310.10329": {"title": "Towards Data-Conditional Simulation for ABC Inference in Stochastic Differential Equations", "link": "https://arxiv.org/abs/2310.10329", "description": "arXiv:2310.10329v2 Announce Type: replace \nAbstract: We develop a Bayesian inference method for discretely-observed stochastic differential equations (SDEs). Inference is challenging for most SDEs, due to the analytical intractability of the likelihood function. Nevertheless, forward simulation via numerical methods is straightforward, motivating the use of approximate Bayesian computation (ABC). We propose a conditional simulation scheme for SDEs that is based on lookahead strategies for sequential Monte Carlo (SMC) and particle smoothing using backward simulation. This leads to the simulation of trajectories that are consistent with the observed trajectory, thereby increasing the ABC acceptance rate. We additionally employ an invariant neural network, previously developed for Markov processes, to learn the summary statistics function required in ABC. The neural network is incrementally retrained by exploiting an ABC-SMC sampler, which provides new training data at each round. Since the SDEs simulation scheme differs from standard forward simulation, we propose a suitable importance sampling correction, which has the added advantage of guiding the parameters towards regions of high posterior density, especially in the first ABC-SMC round. Our approach achieves accurate inference and is about three times faster than standard (forward-only) ABC-SMC. We illustrate our method in five simulation studies, including three examples from the Chan-Karaolyi-Longstaff-Sanders SDE family, a stochastic bi-stable model (Schl{\\\"o}gl) that is notoriously challenging for ABC methods, and a two dimensional biochemical reaction network."}, "https://arxiv.org/abs/2311.01341": {"title": "Composite Dyadic Models for Spatio-Temporal Data", "link": "https://arxiv.org/abs/2311.01341", "description": "arXiv:2311.01341v3 Announce Type: replace \nAbstract: Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully-connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe."}, "https://arxiv.org/abs/2401.04498": {"title": "Efficient designs for multivariate crossover trials", "link": "https://arxiv.org/abs/2401.04498", "description": "arXiv:2401.04498v2 Announce Type: replace \nAbstract: This article aims to study efficient/trace optimal designs for crossover trials with multiple responses recorded from each subject in the time periods. A multivariate fixed effects model is proposed with direct and carryover effects corresponding to the multiple responses. The corresponding error dispersion matrix is chosen to be either of the proportional or the generalized Markov covariance type, permitting the existence of direct and cross-correlations within and between the multiple responses. The corresponding information matrices for direct effects under the two types of dispersions are used to determine efficient designs. The efficiency of orthogonal array designs of Type $I$ and strength $2$ is investigated for a wide choice of covariance functions, namely, Mat($0.5$), Mat($1.5$) and Mat($\\infty$). To motivate these multivariate crossover designs, a gene expression dataset in a $3 \\times 3$ framework is utilized."}, "https://arxiv.org/abs/2102.11076": {"title": "Kernel Ridge Riesz Representers: Generalization Error and Mis-specification", "link": "https://arxiv.org/abs/2102.11076", "description": "arXiv:2102.11076v3 Announce Type: replace-cross \nAbstract: Kernel balancing weights provide confidence intervals for average treatment effects, based on the idea of balancing covariates for the treated group and untreated group in feature space, often with ridge regularization. Previous works on the classical kernel ridge balancing weights have certain limitations: (i) not articulating generalization error for the balancing weights, (ii) typically requiring correct specification of features, and (iii) providing inference for only average effects.\n I interpret kernel balancing weights as kernel ridge Riesz representers (KRRR) and address these limitations via a new characterization of the counterfactual effective dimension. KRRR is an exact generalization of kernel ridge regression and kernel ridge balancing weights. I prove strong properties similar to kernel ridge regression: population $L_2$ rates controlling generalization error, and a standalone closed form solution that can interpolate. The framework relaxes the stringent assumption that the underlying regression model is correctly specified by the features. It extends inference beyond average effects to heterogeneous effects, i.e. causal functions. I use KRRR to infer heterogeneous treatment effects, by age, of 401(k) eligibility on assets."}, "https://arxiv.org/abs/2309.11028": {"title": "The Topology and Geometry of Neural Representations", "link": "https://arxiv.org/abs/2309.11028", "description": "arXiv:2309.11028v3 Announce Type: replace-cross \nAbstract: A central question for neuroscience is how to characterize brain representations of perceptual and cognitive content. An ideal characterization should distinguish different functional regions with robustness to noise and idiosyncrasies of individual brains that do not correspond to computational differences. Previous studies have characterized brain representations by their representational geometry, which is defined by the representational dissimilarity matrix (RDM), a summary statistic that abstracts from the roles of individual neurons (or responses channels) and characterizes the discriminability of stimuli. Here we explore a further step of abstraction: from the geometry to the topology of brain representations. We propose topological representational similarity analysis (tRSA), an extension of representational similarity analysis (RSA) that uses a family of geo-topological summary statistics that generalizes the RDM to characterize the topology while de-emphasizing the geometry. We evaluate this new family of statistics in terms of the sensitivity and specificity for model selection using both simulations and fMRI data. In the simulations, the ground truth is a data-generating layer representation in a neural network model and the models are the same and other layers in different model instances (trained from different random seeds). In fMRI, the ground truth is a visual area and the models are the same and other areas measured in different subjects. Results show that topology-sensitive characterizations of population codes are robust to noise and interindividual variability and maintain excellent sensitivity to the unique representational signatures of different neural network layers and brain regions. These methods enable researchers to calibrate comparisons among representations in brains and models to be sensitive to the geometry, the topology, or a combination of both."}, "https://arxiv.org/abs/2310.13444": {"title": "Testing for the extent of instability in nearly unstable processes", "link": "https://arxiv.org/abs/2310.13444", "description": "arXiv:2310.13444v2 Announce Type: replace-cross \nAbstract: This paper deals with unit root issues in time series analysis. It has been known for a long time that unit root tests may be flawed when a series although stationary has a root close to unity. That motivated recent papers dedicated to autoregressive processes where the bridge between stability and instability is expressed by means of time-varying coefficients. The process we consider has a companion matrix $A_{n}$ with spectral radius $\\rho(A_{n}) < 1$ satisfying $\\rho(A_{n}) \\rightarrow 1$, a situation described as `nearly-unstable'. The question we investigate is: given an observed path supposed to come from a nearly-unstable process, is it possible to test for the `extent of instability', i.e. to test how close we are to the unit root? In this regard, we develop a strategy to evaluate $\\alpha$ and to test for $\\mathcal{H}_0 : ``\\alpha = \\alpha_0\"$ against $\\mathcal{H}_1 : ``\\alpha > \\alpha_0\"$ when $\\rho(A_{n})$ lies in an inner $O(n^{-\\alpha})$-neighborhood of the unity, for some $0 < \\alpha < 1$. Empirical evidence is given about the advantages of the flexibility induced by such a procedure compared to the common unit root tests. We also build a symmetric procedure for the usually left out situation where the dominant root lies around $-1$."}, "https://arxiv.org/abs/2406.01652": {"title": "Distributional bias compromises leave-one-out cross-validation", "link": "https://arxiv.org/abs/2406.01652", "description": "arXiv:2406.01652v1 Announce Type: new \nAbstract: Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called \"leave-one-out cross-validation\" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses."}, "https://arxiv.org/abs/2406.01819": {"title": "Bayesian Linear Models: A compact general set of results", "link": "https://arxiv.org/abs/2406.01819", "description": "arXiv:2406.01819v1 Announce Type: new \nAbstract: I present all the details in calculating the posterior distribution of the conjugate Normal-Gamma prior in Bayesian Linear Models (BLM), including correlated observations, prediction, model selection and comments on efficient numeric implementations. A Python implementation is also presented. These have been presented and available in many books and texts but, I believe, a general compact and simple presentation is always welcome and not always simple to find. Since correlated observations are also included, these results may also be useful for time series analysis and spacial statistics. Other particular cases presented include regression, Gaussian processes and Bayesian Dynamic Models."}, "https://arxiv.org/abs/2406.02028": {"title": "How should parallel cluster randomized trials with a baseline period be analyzed? A survey of estimands and common estimators", "link": "https://arxiv.org/abs/2406.02028", "description": "arXiv:2406.02028v1 Announce Type: new \nAbstract: The parallel cluster randomized trial with baseline (PB-CRT) is a common variant of the standard parallel cluster randomized trial (P-CRT) that maintains parallel randomization but additionally allows for both within and between-cluster comparisons. We define two estimands of interest in the context of PB-CRTs, the participant-average treatment effect (pATE) and cluster-average treatment effect (cATE), to address participant and cluster-level hypotheses. Previous work has indicated that under informative cluster sizes, commonly used mixed-effects models may yield inconsistent estimators for the estimands of interest. In this work, we theoretically derive the convergence of the unweighted and inverse cluster-period size weighted (i.) independence estimating equation, (ii.) fixed-effects model, (iii.) exchangeable mixed-effects model, and (iv.) nested-exchangeable mixed-effects model treatment effect estimators in a PB-CRT with continuous outcomes. We report a simulation study to evaluate the bias and inference with these different treatment effect estimators and their corresponding model-based or jackknife variance estimators. We then re-analyze a PB-CRT examining the effects of community youth teams on improving mental health among adolescent girls in rural eastern India. We demonstrate that the unweighted and weighted independence estimating equation and fixed-effects model regularly yield consistent estimators for the pATE and cATE estimands, whereas the mixed-effects models yield inconsistent estimators under informative cluster sizes. However, we demonstrate that unlike the nested-exchangeable mixed-effects model and corresponding analyses in P-CRTs, the exchangeable mixed-effects model is surprisingly robust to bias in many PB-CRT scenarios."}, "https://arxiv.org/abs/2406.02124": {"title": "Measuring the Dispersion of Discrete Distributions", "link": "https://arxiv.org/abs/2406.02124", "description": "arXiv:2406.02124v1 Announce Type: new \nAbstract: Measuring dispersion is among the most fundamental and ubiquitous concepts in statistics, both in applied and theoretical contexts. In order to ensure that dispersion measures like the standard deviation indeed capture the dispersion of any given distribution, they are by definition required to preserve a stochastic order of dispersion. The most basic order that functions as a foundation underneath the concept of dispersion measures is the so-called dispersive order. However, that order is incompatible with almost all discrete distributions, including all lattice distributions and most empirical distributions. Thus, there is no guarantee that popular measures properly capture the dispersion of these distributions.\n In this paper, discrete adaptations of the dispersive order are derived and analyzed. Their derivation is directly informed by key properties of the dispersive order in order to obtain a foundation for the measurement of discrete dispersion that is as similar as possible to the continuous setting. Two slightly different orders are obtained that both have numerous properties that the original dispersive order also has. Their behaviour on well-known families of lattice distribution is generally as expected if the parameter differences are large enough. Most popular dispersion measures preserve both discrete dispersive orders, which rigorously ensures that they are also meaningful in discrete settings. However, the interquantile range preserves neither discrete order, yielding that it should not be used to measure the dispersion of discrete distributions."}, "https://arxiv.org/abs/2406.02152": {"title": "A sequential test procedure for the choice of the number of regimes in multivariate nonlinear models", "link": "https://arxiv.org/abs/2406.02152", "description": "arXiv:2406.02152v1 Announce Type: new \nAbstract: This paper proposes a sequential test procedure for determining the number of regimes in nonlinear multivariate autoregressive models. The procedure relies on linearity and no additional nonlinearity tests for both multivariate smooth transition and threshold autoregressive models. We conduct a simulation study to evaluate the finite-sample properties of the proposed test in small samples. Our findings indicate that the test exhibits satisfactory size properties, with the rescaled version of the Lagrange Multiplier test statistics demonstrating the best performance in most simulation settings. The sequential procedure is also applied to two empirical cases, the US monthly interest rates and Icelandic river flows. In both cases, the detected number of regimes aligns well with the existing literature."}, "https://arxiv.org/abs/2406.02241": {"title": "Enabling Decision-Making with the Modified Causal Forest: Policy Trees for Treatment Assignment", "link": "https://arxiv.org/abs/2406.02241", "description": "arXiv:2406.02241v1 Announce Type: new \nAbstract: Decision-making plays a pivotal role in shaping outcomes in various disciplines, such as medicine, economics, and business. This paper provides guidance to practitioners on how to implement a decision tree designed to address treatment assignment policies using an interpretable and non-parametric algorithm. Our Policy Tree is motivated on the method proposed by Zhou, Athey, and Wager (2023), distinguishing itself for the policy score calculation, incorporating constraints, and handling categorical and continuous variables. We demonstrate the usage of the Policy Tree for multiple, discrete treatments on data sets from different fields. The Policy Tree is available in Python's open-source package mcf (Modified Causal Forest)."}, "https://arxiv.org/abs/2406.02297": {"title": "Optimal Stock Portfolio Selection with a Multivariate Hidden Markov Model", "link": "https://arxiv.org/abs/2406.02297", "description": "arXiv:2406.02297v1 Announce Type: new \nAbstract: The underlying market trends that drive stock price fluctuations are often referred to in terms of bull and bear markets. Optimal stock portfolio selection methods need to take into account these market trends; however, the bull and bear market states tend to be unobserved and can only be assigned retrospectively. We fit a linked hidden Markov model (LHMM) to relative stock price changes for S&P 500 stocks from 2011--2016 based on weekly closing values. The LHMM consists of a multivariate state process whose individual components correspond to HMMs for each of the 12 sectors of the S\\&P 500 stocks. The state processes are linked using a Gaussian copula so that the states of the component chains are correlated at any given time point. The LHMM allows us to capture more heterogeneity in the underlying market dynamics for each sector. In this study, stock performances are evaluated in terms of capital gains using the LHMM by utilizing historical stock price data. Based on the fitted LHMM, optimal stock portfolios are constructed to maximize capital gain while balancing reward and risk. Under out-of-sample testing, the annual capital gain for the portfolios for 2016--2017 are calculated. Portfolios constructed using the LHMM are able to generate returns comparable to the S&P 500 index."}, "https://arxiv.org/abs/2406.02320": {"title": "Compositional dynamic modelling for causal prediction in multivariate time series", "link": "https://arxiv.org/abs/2406.02320", "description": "arXiv:2406.02320v1 Announce Type: new \nAbstract: Theoretical developments in sequential Bayesian analysis of multivariate dynamic models underlie new methodology for causal prediction. This extends the utility of existing models with computationally efficient methodology, enabling routine exploration of Bayesian counterfactual analyses with multiple selected time series as synthetic controls. Methodological contributions also define the concept of outcome adaptive modelling to monitor and inferentially respond to changes in experimental time series following interventions designed to explore causal effects. The benefits of sequential analyses with time-varying parameter models for causal investigations are inherited in this broader setting. A case study in commercial causal analysis-- involving retail revenue outcomes related to marketing interventions-- highlights the methodological advances."}, "https://arxiv.org/abs/2406.02321": {"title": "A Bayesian nonlinear stationary model with multiple frequencies for business cycle analysis", "link": "https://arxiv.org/abs/2406.02321", "description": "arXiv:2406.02321v1 Announce Type: new \nAbstract: We design a novel, nonlinear single-source-of-error model for analysis of multiple business cycles. The model's specification is intended to capture key empirical characteristics of business cycle data by allowing for simultaneous cycles of different types and lengths, as well as time-variable amplitude and phase shift. The model is shown to feature relevant theoretical properties, including stationarity and pseudo-cyclical autocovariance function, and enables a decomposition of overall cyclic fluctuations into separate frequency-specific components. We develop a Bayesian framework for estimation and inference in the model, along with an MCMC procedure for posterior sampling, combining the Gibbs sampler and the Metropolis-Hastings algorithm, suitably adapted to address encountered numerical issues. Empirical results obtained from the model applied to the Polish GDP growth rates imply co-existence of two types of economic fluctuations: the investment and inventory cycles, and support the stochastic variability of the amplitude and phase shift, also capturing some business cycle asymmetries. Finally, the Bayesian framework enables a fully probabilistic inference on the business cycle clocks and dating, which seems the most relevant approach in view of economic uncertainties."}, "https://arxiv.org/abs/2406.02369": {"title": "Identifying Sample Size and Accuracy and Precision of the Estimators in Case-Crossover Analysis with Distributed Lags of Heteroskedastic Time-Varying Continuous Exposures Measured with Simple or Complex Error", "link": "https://arxiv.org/abs/2406.02369", "description": "arXiv:2406.02369v1 Announce Type: new \nAbstract: Power analyses help investigators design robust and reproducible research. Understanding of determinants of statistical power is helpful in interpreting results in publications and in analysis for causal inference. Case-crossover analysis, a matched case-control analysis, is widely used to estimate health effects of short-term exposures. Despite its widespread use, understanding of sample size, statistical power, and the accuracy and precision of the estimator in real-world data settings is very limited. First, the variance of exposures that exhibit spatiotemporal patterns may be heteroskedastic (e.g., air pollution and temperature exposures, impacted by climate change). Second, distributed lags of the exposure variable may be used to identify critical exposure time-windows. Third, exposure measurement error is not uncommon, impacting the accuracy and/or precision of the estimator, depending on the measurement error mechanism. Exposure measurement errors result in covariate measurement errors of distributed lags. All these issues complicate the understanding. Therefore, I developed approximation equations for sample size, estimates of the estimators and standard errors, and identified conditions for applications. I discussed polynomials for non-linear effect estimation. I analyzed air pollution estimates in the United States (U.S.), developed by U.S. Environmental Protection Agency to examine errors, and conducted statistical simulations. Overall, sample size can be calculated based on external information about exposure variable validation, without validation data in hand. For estimators of distributed lags, calculations may perform well if residual confounding due to covariate measurement error is not severe. This condition may sometimes be difficult to identify without validation data, suggesting investigators should consider validation research."}, "https://arxiv.org/abs/2406.02525": {"title": "The Impact of Acquisition on Product Quality in the Console Gaming Industry", "link": "https://arxiv.org/abs/2406.02525", "description": "arXiv:2406.02525v1 Announce Type: new \nAbstract: The console gaming industry, a dominant force in the global entertainment sector, has witnessed a wave of consolidation in recent years, epitomized by Microsoft's high-profile acquisitions of Activision Blizzard and Zenimax. This study investigates the repercussions of such mergers on consumer welfare and innovation within the gaming landscape, focusing on product quality as a key metric. Through a comprehensive analysis employing a difference-in-difference model, the research evaluates the effects of acquisition on game review ratings, drawing from a dataset comprising over 16,000 console games released between 2000 and 2023. The research addresses key assumptions underlying the difference-in-difference methodology, including parallel trends and spillover effects, to ensure the robustness of the findings. The DID results suggest a positive and statistically significant impact of acquisition on game review ratings, when controlling for genre and release year. The study contributes to the literature by offering empirical evidence on the direct consequences of industry consolidation on consumer welfare and competition dynamics within the gaming sector."}, "https://arxiv.org/abs/2406.02530": {"title": "LongBet: Heterogeneous Treatment Effect Estimation in Panel Data", "link": "https://arxiv.org/abs/2406.02530", "description": "arXiv:2406.02530v1 Announce Type: new \nAbstract: This paper introduces a novel approach for estimating heterogeneous treatment effects of binary treatment in panel data, particularly focusing on short panel data with large cross-sectional data and observed confoundings. In contrast to traditional literature in difference-in-differences method that often relies on the parallel trend assumption, our proposed model does not necessitate such an assumption. Instead, it leverages observed confoundings to impute potential outcomes and identify treatment effects. The method presented is a Bayesian semi-parametric approach based on the Bayesian causal forest model, which is extended here to suit panel data settings. The approach offers the advantage of the Bayesian approach to provides uncertainty quantification on the estimates. Simulation studies demonstrate its performance with and without the presence of parallel trend. Additionally, our proposed model enables the estimation of conditional average treatment effects, a capability that is rarely available in panel data settings."}, "https://arxiv.org/abs/2406.01649": {"title": "CoLa-DCE -- Concept-guided Latent Diffusion Counterfactual Explanations", "link": "https://arxiv.org/abs/2406.01649", "description": "arXiv:2406.01649v1 Announce Type: cross \nAbstract: Recent advancements in generative AI have introduced novel prospects and practical implementations. Especially diffusion models show their strength in generating diverse and, at the same time, realistic features, positioning them well for generating counterfactual explanations for computer vision models. Answering \"what if\" questions of what needs to change to make an image classifier change its prediction, counterfactual explanations align well with human understanding and consequently help in making model behavior more comprehensible. Current methods succeed in generating authentic counterfactuals, but lack transparency as feature changes are not directly perceivable. To address this limitation, we introduce Concept-guided Latent Diffusion Counterfactual Explanations (CoLa-DCE). CoLa-DCE generates concept-guided counterfactuals for any classifier with a high degree of control regarding concept selection and spatial conditioning. The counterfactuals comprise an increased granularity through minimal feature changes. The reference feature visualization ensures better comprehensibility, while the feature localization provides increased transparency of \"where\" changed \"what\". We demonstrate the advantages of our approach in minimality and comprehensibility across multiple image classification models and datasets and provide insights into how our CoLa-DCE explanations help comprehend model errors like misclassification cases."}, "https://arxiv.org/abs/2406.01653": {"title": "An efficient Wasserstein-distance approach for reconstructing jump-diffusion processes using parameterized neural networks", "link": "https://arxiv.org/abs/2406.01653", "description": "arXiv:2406.01653v1 Announce Type: cross \nAbstract: We analyze the Wasserstein distance ($W$-distance) between two probability distributions associated with two multidimensional jump-diffusion processes. Specifically, we analyze a temporally decoupled squared $W_2$-distance, which provides both upper and lower bounds associated with the discrepancies in the drift, diffusion, and jump amplitude functions between the two jump-diffusion processes. Then, we propose a temporally decoupled squared $W_2$-distance method for efficiently reconstructing unknown jump-diffusion processes from data using parameterized neural networks. We further show its performance can be enhanced by utilizing prior information on the drift function of the jump-diffusion process. The effectiveness of our proposed reconstruction method is demonstrated across several examples and applications."}, "https://arxiv.org/abs/2406.01663": {"title": "An efficient solution to Hidden Markov Models on trees with coupled branches", "link": "https://arxiv.org/abs/2406.01663", "description": "arXiv:2406.01663v1 Announce Type: cross \nAbstract: Hidden Markov Models (HMMs) are powerful tools for modeling sequential data, where the underlying states evolve in a stochastic manner and are only indirectly observable. Traditional HMM approaches are well-established for linear sequences, and have been extended to other structures such as trees. In this paper, we extend the framework of HMMs on trees to address scenarios where the tree-like structure of the data includes coupled branches -- a common feature in biological systems where entities within the same lineage exhibit dependent characteristics. We develop a dynamic programming algorithm that efficiently solves the likelihood, decoding, and parameter learning problems for tree-based HMMs with coupled branches. Our approach scales polynomially with the number of states and nodes, making it computationally feasible for a wide range of applications and does not suffer from the underflow problem. We demonstrate our algorithm by applying it to simulated data and propose self-consistency checks for validating the assumptions of the model used for inference. This work not only advances the theoretical understanding of HMMs on trees but also provides a practical tool for analyzing complex biological data where dependencies between branches cannot be ignored."}, "https://arxiv.org/abs/2406.01750": {"title": "Survival Data Simulation With the R Package rsurv", "link": "https://arxiv.org/abs/2406.01750", "description": "arXiv:2406.01750v1 Announce Type: cross \nAbstract: In this paper we propose a novel R package, called rsurv, developed for general survival data simulation purposes. The package is built under a new approach to simulate survival data that depends heavily on the use of dplyr verbs. The proposed package allows simulations of survival data from a wide range of regression models, including accelerated failure time (AFT), proportional hazards (PH), proportional odds (PO), accelerated hazard (AH), Yang and Prentice (YP), and extended hazard (EH) models. The package rsurv also stands out by its ability to generate survival data from an unlimited number of baseline distributions provided that an implementation of the quantile function of the chosen baseline distribution is available in R. Another nice feature of the package rsurv lies in the fact that linear predictors are specified using R formulas, facilitating the inclusion of categorical variables, interaction terms and offset variables. The functions implemented in the package rsurv can also be employed to simulate survival data with more complex structures, such as survival data with different types of censoring mechanisms, survival data with cure fraction, survival data with random effects (frailties), multivarite survival data, and competing risks survival data."}, "https://arxiv.org/abs/2406.01813": {"title": "Diffusion Boosted Trees", "link": "https://arxiv.org/abs/2406.01813", "description": "arXiv:2406.01813v1 Announce Type: cross \nAbstract: Combining the merits of both denoising diffusion probabilistic models and gradient boosting, the diffusion boosting paradigm is introduced for tackling supervised learning problems. We develop Diffusion Boosted Trees (DBT), which can be viewed as both a new denoising diffusion generative model parameterized by decision trees (one single tree for each diffusion timestep), and a new boosting algorithm that combines the weak learners into a strong learner of conditional distributions without making explicit parametric assumptions on their density forms. We demonstrate through experiments the advantages of DBT over deep neural network-based diffusion models as well as the competence of DBT on real-world regression tasks, and present a business application (fraud detection) of DBT for classification on tabular data with the ability of learning to defer."}, "https://arxiv.org/abs/2406.01823": {"title": "Causal Discovery with Fewer Conditional Independence Tests", "link": "https://arxiv.org/abs/2406.01823", "description": "arXiv:2406.01823v1 Announce Type: cross \nAbstract: Many questions in science center around the fundamental problem of understanding causal relationships. However, most constraint-based causal discovery algorithms, including the well-celebrated PC algorithm, often incur an exponential number of conditional independence (CI) tests, posing limitations in various applications. Addressing this, our work focuses on characterizing what can be learned about the underlying causal graph with a reduced number of CI tests. We show that it is possible to a learn a coarser representation of the hidden causal graph with a polynomial number of tests. This coarser representation, named Causal Consistent Partition Graph (CCPG), comprises of a partition of the vertices and a directed graph defined over its components. CCPG satisfies consistency of orientations and additional constraints which favor finer partitions. Furthermore, it reduces to the underlying causal graph when the causal graph is identifiable. As a consequence, our results offer the first efficient algorithm for recovering the true causal graph with a polynomial number of tests, in special cases where the causal graph is fully identifiable through observational data and potentially additional interventions."}, "https://arxiv.org/abs/2406.01933": {"title": "Orthogonal Causal Calibration", "link": "https://arxiv.org/abs/2406.01933", "description": "arXiv:2406.01933v1 Announce Type: cross \nAbstract: Estimates of causal parameters such as conditional average treatment effects and conditional quantile treatment effects play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters.\n In this work, we provide a general framework for calibrating predictors involving nuisance estimation. We consider a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\\ell$, under which we say an estimator $\\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. We prove generic upper bounds on the calibration error of any causal parameter estimate $\\theta$ with respect to any loss $\\ell$ using a concept called Neyman Orthogonality. Our bounds involve two decoupled terms - one measuring the error in estimating the unknown nuisance parameters, and the other representing the calibration error in a hypothetical world where the learned nuisance estimates were true. We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration. One algorithm, which applies to universally orthogonalizable loss functions, transforms the data into generalized pseudo-outcomes and applies an off-the-shelf calibration procedure. The other algorithm, which applies to conditionally orthogonalizable loss functions, extends the classical uniform mass binning algorithm to include nuisance estimation. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation."}, "https://arxiv.org/abs/2406.02049": {"title": "Causal Effect Identification in LiNGAM Models with Latent Confounders", "link": "https://arxiv.org/abs/2406.02049", "description": "arXiv:2406.02049v1 Announce Type: cross \nAbstract: We study the generic identifiability of causal effects in linear non-Gaussian acyclic models (LiNGAM) with latent variables. We consider the problem in two main settings: When the causal graph is known a priori, and when it is unknown. In both settings, we provide a complete graphical characterization of the identifiable direct or total causal effects among observed variables. Moreover, we propose efficient algorithms to certify the graphical conditions. Finally, we propose an adaptation of the reconstruction independent component analysis (RICA) algorithm that estimates the causal effects from the observational data given the causal graph. Experimental results show the effectiveness of the proposed method in estimating the causal effects."}, "https://arxiv.org/abs/2406.02360": {"title": "A Practical Approach for Exploring Granger Connectivity in High-Dimensional Networks of Time Series", "link": "https://arxiv.org/abs/2406.02360", "description": "arXiv:2406.02360v1 Announce Type: cross \nAbstract: This manuscript presents a novel method for discovering effective connectivity between specified pairs of nodes in a high-dimensional network of time series. To accurately perform Granger causality analysis from the first node to the second node, it is essential to eliminate the influence of all other nodes within the network. The approach proposed is to create a low-dimensional representation of all other nodes in the network using frequency-domain-based dynamic principal component analysis (spectral DPCA). The resulting scores are subsequently removed from the first and second nodes of interest, thus eliminating the confounding effect of other nodes within the high-dimensional network. To conduct hypothesis testing on Granger causality, we propose a permutation-based causality test. This test enhances the accuracy of our findings when the error structures are non-Gaussian. The approach has been validated in extensive simulation studies, which demonstrate the efficacy of the methodology as a tool for causality analysis in complex time series networks. The proposed methodology has also been demonstrated to be both expedient and viable on real datasets, with particular success observed on multichannel EEG networks."}, "https://arxiv.org/abs/2406.02424": {"title": "Contextual Dynamic Pricing: Algorithms, Optimality, and Local Differential Privacy Constraints", "link": "https://arxiv.org/abs/2406.02424", "description": "arXiv:2406.02424v1 Announce Type: cross \nAbstract: We study the contextual dynamic pricing problem where a firm sells products to $T$ sequentially arriving consumers that behave according to an unknown demand model. The firm aims to maximize its revenue, i.e. minimize its regret over a clairvoyant that knows the model in advance. The demand model is a generalized linear model (GLM), allowing for a stochastic feature vector in $\\mathbb R^d$ that encodes product and consumer information. We first show that the optimal regret upper bound is of order $\\sqrt{dT}$, up to a logarithmic factor, improving upon existing upper bounds in the literature by a $\\sqrt{d}$ factor. This sharper rate is materialised by two algorithms: a confidence bound-type (supCB) algorithm and an explore-then-commit (ETC) algorithm. A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem with many arms based on a careful discretization. We further study contextual dynamic pricing under the local differential privacy (LDP) constraints. In particular, we propose a stochastic gradient descent based ETC algorithm that achieves an optimal regret upper bound of order $d\\sqrt{T}/\\epsilon$, up to a logarithmic factor, where $\\epsilon>0$ is the privacy parameter. The regret upper bounds with and without LDP constraints are accompanied by newly constructed minimax lower bounds, which further characterize the cost of privacy. Extensive numerical experiments and a real data application on online lending are conducted to illustrate the efficiency and practical value of the proposed algorithms in dynamic pricing."}, "https://arxiv.org/abs/1604.04318": {"title": "Principal Sub-manifolds", "link": "https://arxiv.org/abs/1604.04318", "description": "arXiv:1604.04318v5 Announce Type: replace \nAbstract: We propose a novel method of finding principal components in multivariate data sets that lie on an embedded nonlinear Riemannian manifold within a higher-dimensional space. Our aim is to extend the geometric interpretation of PCA, while being able to capture non-geodesic modes of variation in the data. We introduce the concept of a principal sub-manifold, a manifold passing through a reference point, and at any point on the manifold extending in the direction of highest variation in the space spanned by the eigenvectors of the local tangent space PCA. Compared to recent work for the case where the sub-manifold is of dimension one Panaretos et al. (2014)$-$essentially a curve lying on the manifold attempting to capture one-dimensional variation$-$the current setting is much more general. The principal sub-manifold is therefore an extension of the principal flow, accommodating to capture higher dimensional variation in the data. We show the principal sub-manifold yields the ball spanned by the usual principal components in Euclidean space. By means of examples, we illustrate how to find, use and interpret a principal sub-manifold and we present an application in shape analysis."}, "https://arxiv.org/abs/2303.00178": {"title": "Disentangling Structural Breaks in Factor Models for Macroeconomic Data", "link": "https://arxiv.org/abs/2303.00178", "description": "arXiv:2303.00178v2 Announce Type: replace \nAbstract: Through a routine normalization of the factor variance, standard methods for estimating factor models in macroeconomics do not distinguish between breaks of the factor variance and factor loadings. We argue that it is important to distinguish between structural breaks in the factor variance and loadings within factor models commonly employed in macroeconomics as both can lead to markedly different interpretations when viewed via the lens of the underlying dynamic factor model. We then develop a projection-based decomposition that leads to two standard and easy-to-implement Wald tests to disentangle structural breaks in the factor variance and factor loadings. Applying our procedure to U.S. macroeconomic data, we find evidence of both types of breaks associated with the Great Moderation and the Great Recession. Through our projection-based decomposition, we estimate that the Great Moderation is associated with an over 60% reduction in the total factor variance, highlighting the relevance of disentangling breaks in the factor structure."}, "https://arxiv.org/abs/2306.13257": {"title": "Semiparametric Estimation of the Shape of the Limiting Bivariate Point Cloud", "link": "https://arxiv.org/abs/2306.13257", "description": "arXiv:2306.13257v3 Announce Type: replace \nAbstract: We propose a model to flexibly estimate joint tail properties by exploiting the convergence of an appropriately scaled point cloud onto a compact limit set. Characteristics of the shape of the limit set correspond to key tail dependence properties. We directly model the shape of the limit set using Bezier splines, which allow flexible and parsimonious specification of shapes in two dimensions. We then fit the Bezier splines to data in pseudo-polar coordinates using Markov chain Monte Carlo sampling, utilizing a limiting approximation to the conditional likelihood of the radii given angles. By imposing appropriate constraints on the parameters of the Bezier splines, we guarantee that each posterior sample is a valid limit set boundary, allowing direct posterior analysis of any quantity derived from the shape of the curve. Furthermore, we obtain interpretable inference on the asymptotic dependence class by using mixture priors with point masses on the corner of the unit box. Finally, we apply our model to bivariate datasets of extremes of variables related to fire risk and air pollution."}, "https://arxiv.org/abs/2401.00618": {"title": "Changes-in-Changes for Ordered Choice Models: Too Many \"False Zeros\"?", "link": "https://arxiv.org/abs/2401.00618", "description": "arXiv:2401.00618v2 Announce Type: replace \nAbstract: In this paper, we develop a Difference-in-Differences model for discrete, ordered outcomes, building upon elements from a continuous Changes-in-Changes model. We focus on outcomes derived from self-reported survey data eliciting socially undesirable, illegal, or stigmatized behaviors like tax evasion, substance abuse, or domestic violence, where too many \"false zeros\", or more broadly, underreporting are likely. We provide characterizations for distributional parallel trends, a concept central to our approach, within a general threshold-crossing model framework. In cases where outcomes are assumed to be reported correctly, we propose a framework for identifying and estimating treatment effects across the entire distribution. This framework is then extended to modeling underreported outcomes, allowing the reporting decision to depend on treatment status. A simulation study documents the finite sample performance of the estimators. Applying our methodology, we investigate the impact of recreational marijuana legalization for adults in several U.S. states on the short-term consumption behavior of 8th-grade high-school students. The results indicate small, but significant increases in consumption probabilities at each level. These effects are further amplified upon accounting for misreporting."}, "https://arxiv.org/abs/2209.04419": {"title": "Majority Vote for Distributed Differentially Private Sign Selection", "link": "https://arxiv.org/abs/2209.04419", "description": "arXiv:2209.04419v2 Announce Type: replace-cross \nAbstract: Privacy-preserving data analysis has become more prevalent in recent years. In this study, we propose a distributed group differentially private Majority Vote mechanism, for the sign selection problem in a distributed setup. To achieve this, we apply the iterative peeling to the stability function and use the exponential mechanism to recover the signs. For enhanced applicability, we study the private sign selection for mean estimation and linear regression problems, in distributed systems. Our method recovers the support and signs with the optimal signal-to-noise ratio as in the non-private scenario, which is better than contemporary works of private variable selections. Moreover, the sign selection consistency is justified by theoretical guarantees. Simulation studies are conducted to demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2303.02119": {"title": "Conditional Aalen--Johansen estimation", "link": "https://arxiv.org/abs/2303.02119", "description": "arXiv:2303.02119v2 Announce Type: replace-cross \nAbstract: The conditional Aalen--Johansen estimator, a general-purpose non-parametric estimator of conditional state occupation probabilities, is introduced. The estimator is applicable for any finite-state jump process and supports conditioning on external as well as internal covariate information. The conditioning feature permits for a much more detailed analysis of the distributional characteristics of the process. The estimator reduces to the conditional Kaplan--Meier estimator in the special case of a survival model and also englobes other, more recent, landmark estimators when covariates are discrete. Strong uniform consistency and asymptotic normality are established under lax moment conditions on the multivariate counting process, allowing in particular for an unbounded number of transitions."}, "https://arxiv.org/abs/2306.05937": {"title": "Robust Data-driven Prescriptiveness Optimization", "link": "https://arxiv.org/abs/2306.05937", "description": "arXiv:2306.05937v2 Announce Type: replace-cross \nAbstract: The abundance of data has led to the emergence of a variety of optimization techniques that attempt to leverage available side information to provide more anticipative decisions. The wide range of methods and contexts of application have motivated the design of a universal unitless measure of performance known as the coefficient of prescriptiveness. This coefficient was designed to quantify both the quality of contextual decisions compared to a reference one and the prescriptive power of side information. To identify policies that maximize the former in a data-driven context, this paper introduces a distributionally robust contextual optimization model where the coefficient of prescriptiveness substitutes for the classical empirical risk minimization objective. We present a bisection algorithm to solve this model, which relies on solving a series of linear programs when the distributional ambiguity set has an appropriate nested form and polyhedral structure. Studying a contextual shortest path problem, we evaluate the robustness of the resulting policies against alternative methods when the out-of-sample dataset is subject to varying amounts of distribution shift."}, "https://arxiv.org/abs/2406.02751": {"title": "Bayesian Statistics: A Review and a Reminder for the Practicing Reliability Engineer", "link": "https://arxiv.org/abs/2406.02751", "description": "arXiv:2406.02751v1 Announce Type: new \nAbstract: This paper introduces and reviews some of the principles and methods used in Bayesian reliability. It specifically discusses methods used in the analysis of success/no-success data and then reminds the reader of a simple (yet infrequently applied) Monte Carlo algorithm that can be used to calculate the posterior distribution of a system's reliability."}, "https://arxiv.org/abs/2406.02794": {"title": "PriME: Privacy-aware Membership profile Estimation in networks", "link": "https://arxiv.org/abs/2406.02794", "description": "arXiv:2406.02794v1 Announce Type: new \nAbstract: This paper presents a novel approach to estimating community membership probabilities for network vertices generated by the Degree Corrected Mixed Membership Stochastic Block Model while preserving individual edge privacy. Operating within the $\\varepsilon$-edge local differential privacy framework, we introduce an optimal private algorithm based on a symmetric edge flip mechanism and spectral clustering for accurate estimation of vertex community memberships. We conduct a comprehensive analysis of the estimation risk and establish the optimality of our procedure by providing matching lower bounds to the minimax risk under privacy constraints. To validate our approach, we demonstrate its performance through numerical simulations and its practical application to real-world data. This work represents a significant step forward in balancing accurate community membership estimation with stringent privacy preservation in network data analysis."}, "https://arxiv.org/abs/2406.02834": {"title": "Asymptotic inference with flexible covariate adjustment under rerandomization and stratified rerandomization", "link": "https://arxiv.org/abs/2406.02834", "description": "arXiv:2406.02834v1 Announce Type: new \nAbstract: Rerandomization is an effective treatment allocation procedure to control for baseline covariate imbalance. For estimating the average treatment effect, rerandomization has been previously shown to improve the precision of the unadjusted and the linearly-adjusted estimators over simple randomization without compromising consistency. However, it remains unclear whether such results apply more generally to the class of M-estimators, including the g-computation formula with generalized linear regression and doubly-robust methods, and more broadly, to efficient estimators with data-adaptive machine learners. In this paper, using a super-population framework, we develop the asymptotic theory for a more general class of covariate-adjusted estimators under rerandomization and its stratified extension. We prove that the asymptotic linearity and the influence function remain identical for any M-estimator under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. We further explain, drawing examples from several common M-estimators, that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. These results are extended to stratified rerandomization. Finally, we study the asymptotic theory for efficient estimators based on data-adaptive machine learners, and prove their efficiency optimality under rerandomization and stratified rerandomization. Our results are demonstrated via simulations and re-analyses of a cluster-randomized experiment that used stratified rerandomization."}, "https://arxiv.org/abs/2406.02835": {"title": "When is IV identification agnostic about outcomes?", "link": "https://arxiv.org/abs/2406.02835", "description": "arXiv:2406.02835v1 Announce Type: new \nAbstract: Many identification results in instrumental variables (IV) models have the property that identification holds with no restrictions on the joint distribution of potential outcomes or how these outcomes are correlated with selection behavior. This enables many IV models to allow for arbitrary heterogeneity in treatment effects and the possibility of selection on gains in the outcome variable. I call this type of identification result \"outcome-agnostic\", and provide a necessary and sufficient condition for counterfactual means or treatment effects to be identified in an outcome-agnostic manner, when the instruments and treatments have finite support. In addition to unifying many existing IV identification results, this characterization suggests a brute-force approach to revealing all restrictions on selection behavior that yield identification of treatment effect parameters. While computationally intensive, the approach uncovers even in simple settings new selection models that afford identification of interpretable causal parameters."}, "https://arxiv.org/abs/2406.02840": {"title": "Statistical inference of convex order by Wasserstein projection", "link": "https://arxiv.org/abs/2406.02840", "description": "arXiv:2406.02840v1 Announce Type: new \nAbstract: Ranking distributions according to a stochastic order has wide applications in diverse areas. Although stochastic dominance has received much attention,convex order, particularly in general dimensions, has yet to be investigated from a statistical point of view. This article addresses this gap by introducing a simple statistical test for convex order based on the Wasserstein projection distance. This projection distance not only encodes whether two distributions are indeed in convex order, but also quantifies the deviation from the desired convex order and produces an optimal convex order approximation. Lipschitz stability of the backward and forward Wasserstein projection distance is proved, which leads to elegant consistency results of the estimator we employ as our test statistic. Combining these with state of the art results regarding the convergence rate of empirical distributions, we also derive upper bounds for the $p$-value and type I error our test statistic, as well as upper bounds on the type II error for an appropriate class of strict alternatives. Lastly, we provide an efficient numerical scheme for our test statistic, by way of an entropic Frank-Wolfe algorithm. Some experiments based on synthetic data sets illuminates the success of our approach empirically."}, "https://arxiv.org/abs/2406.02948": {"title": "Copula-based semiparametric nonnormal transformed linear model for survival data with dependent censoring", "link": "https://arxiv.org/abs/2406.02948", "description": "arXiv:2406.02948v1 Announce Type: new \nAbstract: Although the independent censoring assumption is commonly used in survival analysis, it can be violated when the censoring time is related to the survival time, which often happens in many practical applications. To address this issue, we propose a flexible semiparametric method for dependent censored data. Our approach involves fitting the survival time and the censoring time with a joint transformed linear model, where the transformed function is unspecified. This allows for a very general class of models that can account for possible covariate effects, while also accommodating administrative censoring. We assume that the transformed variables have a bivariate nonnormal distribution based on parametric copulas and parametric marginals, which further enhances the flexibility of our method. We demonstrate the identifiability of the proposed model and establish the consistency and asymptotic normality of the model parameters under appropriate regularity conditions and assumptions. Furthermore, we evaluate the performance of our method through extensive simulation studies, and provide a real data example for illustration."}, "https://arxiv.org/abs/2406.03022": {"title": "Is local opposition taking the wind out of the energy transition?", "link": "https://arxiv.org/abs/2406.03022", "description": "arXiv:2406.03022v1 Announce Type: new \nAbstract: Local opposition to the installation of renewable energy sources is a potential threat to the energy transition. Local communities tend to oppose the construction of energy plants due to the associated negative externalities (the so-called 'not in my backyard' or NIMBY phenomenon) according to widespread belief, mostly based on anecdotal evidence. Using administrative data on wind turbine installation and electoral outcomes across municipalities located in the South of Italy during 2000-19, we estimate the impact of wind turbines' installation on incumbent regional governments' electoral support during the next elections. Our main findings, derived by a wind-speed based instrumental variable strategy, point in the direction of a mild and not statistically significant electoral backlash for right-wing regional administrations and of a strong and statistically significant positive reinforcement for left-wing regional administrations. Based on our analysis, the hypothesis of an electoral effect of NIMBY type of behavior in connection with the development of wind turbines appears not to be supported by the data."}, "https://arxiv.org/abs/2406.03053": {"title": "Identification of structural shocks in Bayesian VEC models with two-state Markov-switching heteroskedasticity", "link": "https://arxiv.org/abs/2406.03053", "description": "arXiv:2406.03053v1 Announce Type: new \nAbstract: We develop a Bayesian framework for cointegrated structural VAR models identified by two-state Markovian breaks in conditional covariances. The resulting structural VEC specification with Markov-switching heteroskedasticity (SVEC-MSH) is formulated in the so-called B-parameterization, in which the prior distribution is specified directly for the matrix of the instantaneous reactions of the endogenous variables to structural innovations. We discuss some caveats pertaining to the identification conditions presented earlier in the literature on stationary structural VAR-MSH models, and revise the restrictions to actually ensure the unique global identification through the two-state heteroskedasticity. To enable the posterior inference in the proposed model, we design an MCMC procedure, combining the Gibbs sampler and the Metropolis-Hastings algorithm. The methodology is illustrated both with a simulated as well as real-world data examples."}, "https://arxiv.org/abs/2406.03056": {"title": "Sparse two-stage Bayesian meta-analysis for individualized treatments", "link": "https://arxiv.org/abs/2406.03056", "description": "arXiv:2406.03056v1 Announce Type: new \nAbstract: Individualized treatment rules tailor treatments to patients based on clinical, demographic, and other characteristics. Estimation of individualized treatment rules requires the identification of individuals who benefit most from the particular treatments and thus the detection of variability in treatment effects. To develop an effective individualized treatment rule, data from multisite studies may be required due to the low power provided by smaller datasets for detecting the often small treatment-covariate interactions. However, sharing of individual-level data is sometimes constrained. Furthermore, sparsity may arise in two senses: different data sites may recruit from different populations, making it infeasible to estimate identical models or all parameters of interest at all sites, and the number of non-zero parameters in the model for the treatment rule may be small. To address these issues, we adopt a two-stage Bayesian meta-analysis approach to estimate individualized treatment rules which optimize expected patient outcomes using multisite data without disclosing individual-level data beyond the sites. Simulation results demonstrate that our approach can provide consistent estimates of the parameters which fully characterize the optimal individualized treatment rule. We estimate the optimal Warfarin dose strategy using data from the International Warfarin Pharmacogenetics Consortium, where data sparsity and small treatment-covariate interaction effects pose additional statistical challenges."}, "https://arxiv.org/abs/2406.03130": {"title": "Ordinal Mixed-Effects Random Forest", "link": "https://arxiv.org/abs/2406.03130", "description": "arXiv:2406.03130v1 Announce Type: new \nAbstract: We propose an innovative statistical method, called Ordinal Mixed-Effect Random Forest (OMERF), that extends the use of random forest to the analysis of hierarchical data and ordinal responses. The model preserves the flexibility and ability of modeling complex patterns of both categorical and continuous variables, typical of tree-based ensemble methods, and, at the same time, takes into account the structure of hierarchical data, modeling the dependence structure induced by the grouping and allowing statistical inference at all data levels. A simulation study is conducted to validate the performance of the proposed method and to compare it to the one of other state-of-the art models. The application of OMERF is exemplified in a case study focusing on predicting students performances using data from the Programme for International Student Assessment (PISA) 2022. The model identifies discriminating student characteristics and estimates the school-effect."}, "https://arxiv.org/abs/2406.03252": {"title": "Continuous-time modeling and bootstrap for chain ladder reserving", "link": "https://arxiv.org/abs/2406.03252", "description": "arXiv:2406.03252v1 Announce Type: new \nAbstract: We revisit the famous Mack's model which gives an estimate for the mean square error of prediction of the chain ladder claims reserves. We introduce a stochastic differential equation driven by a Brownian motion to model accumulated total claims amount for the chain ladder method. Within this continuous-time framework, we propose a bootstrap technique for estimating the distribution of claims reserves. It turns out that our approach leads to inherently capturing asymmetry and non-negativity, eliminating the necessity for additional assumptions. We conclude with a case study and comparative analysis against alternative methodologies based on Mack's model."}, "https://arxiv.org/abs/2406.03296": {"title": "Multi-relational Network Autoregression Model with Latent Group Structures", "link": "https://arxiv.org/abs/2406.03296", "description": "arXiv:2406.03296v1 Announce Type: new \nAbstract: Multi-relational networks among entities are frequently observed in the era of big data. Quantifying the effects of multiple networks have attracted significant research interest recently. In this work, we model multiple network effects through an autoregressive framework for tensor-valued time series. To characterize the potential heterogeneity of the networks and handle the high dimensionality of the time series data simultaneously, we assume a separate group structure for entities in each network and estimate all group memberships in a data-driven fashion. Specifically, we propose a group tensor network autoregression (GTNAR) model, which assumes that within each network, entities in the same group share the same set of model parameters, and the parameters differ across networks. An iterative algorithm is developed to estimate the model parameters and the latent group memberships simultaneously. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly- or possibly over-specified. An information criterion for group number estimation of each network is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method."}, "https://arxiv.org/abs/2406.03302": {"title": "Combining an experimental study with external data: study designs and identification strategies", "link": "https://arxiv.org/abs/2406.03302", "description": "arXiv:2406.03302v1 Announce Type: new \nAbstract: There is increasing interest in combining information from experimental studies, including randomized and single-group trials, with information from external experimental or observational data sources. Such efforts are usually motivated by the desire to compare treatments evaluated in different studies -- for instance, through the introduction of external treatment groups -- or to estimate treatment effects with greater precision. Proposals to combine experimental studies with external data were made at least as early as the 1970s, but in recent years have come under increasing consideration by regulatory agencies involved in drug and device evaluation, particularly with the increasing availability of rich observational data. In this paper, we describe basic templates of study designs and data structures for combining information from experimental studies with external data, and use the potential (counterfactual) outcomes framework to elaborate identification strategies for potential outcome means and average treatment effects in these designs. In formalizing designs and identification strategies for combining information from experimental studies with external data, we hope to provide a conceptual foundation to support the systematic use and evaluation of such efforts."}, "https://arxiv.org/abs/2406.03321": {"title": "Decision synthesis in monetary policy", "link": "https://arxiv.org/abs/2406.03321", "description": "arXiv:2406.03321v1 Announce Type: new \nAbstract: The macroeconomy is a sophisticated dynamic system involving significant uncertainties that complicate modelling. In response, decision makers consider multiple models that provide different predictions and policy recommendations which are then synthesized into a policy decision. In this setting, we introduce and develop Bayesian predictive decision synthesis (BPDS) to formalize monetary policy decision processes. BPDS draws on recent developments in model combination and statistical decision theory that yield new opportunities in combining multiple models, emphasizing the integration of decision goals, expectations and outcomes into the model synthesis process. Our case study concerns central bank policy decisions about target interest rates with a focus on implications for multi-step macroeconomic forecasting."}, "https://arxiv.org/abs/2406.03336": {"title": "Griddy-Gibbs sampling for Bayesian P-splines models with Poisson data", "link": "https://arxiv.org/abs/2406.03336", "description": "arXiv:2406.03336v1 Announce Type: new \nAbstract: P-splines are appealing for smoothing Poisson distributed counts. They provide a flexible setting for modeling nonlinear model components based on a discretized penalty structure with a relatively simple computational backbone. Under a Bayesian inferential process relying on Markov chain Monte Carlo, estimates of spline coefficients are typically obtained by means of Metropolis-type algorithms, which may suffer from convergence issues if the proposal distribution is not properly chosen. To avoid such a sensitive calibration choice, we extend the Griddy-Gibbs sampler to Bayesian P-splines models with a Poisson response variable. In this model class, conditional posterior distributions of spline components are shown to have attractive mathematical properties. Despite their non-conjugate nature, conditional posteriors of spline coefficients can be efficiently explored with a Gibbs sampling scheme by relying on grid-based approximations. The proposed Griddy-Gibbs sampler for Bayesian P-splines (GGSBPS) algorithm is an interesting calibration-free tool for density estimation and histogram smoothing that is made available in a compact and user-friendly routine. The performance of our approach is assessed in different simulation settings and the GGSBPS algorithm is illustrated on two real datasets."}, "https://arxiv.org/abs/2406.03358": {"title": "Bayesian Quantile Estimation and Regression with Martingale Posteriors", "link": "https://arxiv.org/abs/2406.03358", "description": "arXiv:2406.03358v1 Announce Type: new \nAbstract: Quantile estimation and regression within the Bayesian framework is challenging as the choice of likelihood and prior is not obvious. In this paper, we introduce a novel Bayesian nonparametric method for quantile estimation and regression based on the recently introduced martingale posterior (MP) framework. The core idea of the MP is that posterior sampling is equivalent to predictive imputation, which allows us to break free of the stringent likelihood-prior specification. We demonstrate that a recursive estimate of a smooth quantile function, subject to a martingale condition, is entirely sufficient for full nonparametric Bayesian inference. We term the resulting posterior distribution as the quantile martingale posterior (QMP), which arises from an implicit generative predictive distribution. Associated with the QMP is an expedient, MCMC-free and parallelizable posterior computation scheme, which can be further accelerated with an asymptotic approximation based on a Gaussian process. Furthermore, the well-known issue of monotonicity in quantile estimation is naturally alleviated through increasing rearrangement due to the connections to the Bayesian bootstrap. Finally, the QMP has a particularly tractable form that allows for comprehensive theoretical study, which forms a main focus of the work. We demonstrate the ease of posterior computation in simulations and real data experiments."}, "https://arxiv.org/abs/2406.03385": {"title": "Discrete Autoregressive Switching Processes in Sparse Graphical Modeling of Multivariate Time Series Data", "link": "https://arxiv.org/abs/2406.03385", "description": "arXiv:2406.03385v1 Announce Type: new \nAbstract: We propose a flexible Bayesian approach for sparse Gaussian graphical modeling of multivariate time series. We account for temporal correlation in the data by assuming that observations are characterized by an underlying and unobserved hidden discrete autoregressive process. We assume multivariate Gaussian emission distributions and capture spatial dependencies by modeling the state-specific precision matrices via graphical horseshoe priors. We characterize the mixing probabilities of the hidden process via a cumulative shrinkage prior that accommodates zero-inflated parameters for non-active components, and further incorporate a sparsity-inducing Dirichlet prior to estimate the effective number of states from the data. For posterior inference, we develop a sampling procedure that allows estimation of the number of discrete autoregressive lags and the number of states, and that cleverly avoids having to deal with the changing dimensions of the parameter space. We thoroughly investigate performance of our proposed methodology through several simulation studies. We further illustrate the use of our approach for the estimation of dynamic brain connectivity based on fMRI data collected on a subject performing a task-based experiment on latent learning"}, "https://arxiv.org/abs/2406.03400": {"title": "Non-stationary Spatio-Temporal Modeling Using the Stochastic Advection-Diffusion Equation", "link": "https://arxiv.org/abs/2406.03400", "description": "arXiv:2406.03400v1 Announce Type: new \nAbstract: We construct flexible spatio-temporal models through stochastic partial differential equations (SPDEs) where both diffusion and advection can be spatially varying. Computations are done through a Gaussian Markov random field approximation of the solution of the SPDE, which is constructed through a finite volume method. The new flexible non-separable model is compared to a flexible separable model both for reconstruction and forecasting and evaluated in terms of root mean square errors and continuous rank probability scores. A simulation study demonstrates that the non-separable model performs better when the data is simulated with non-separable effects such as diffusion and advection. Further, we estimate surrogate models for emulating the output of a ocean model in Trondheimsfjorden, Norway, and simulate observations of autonoumous underwater vehicles. The results show that the flexible non-separable model outperforms the flexible separable model for real-time prediction of unobserved locations."}, "https://arxiv.org/abs/2406.03432": {"title": "Bayesian inference for scale mixtures of skew-normal linear models under the centered parameterization", "link": "https://arxiv.org/abs/2406.03432", "description": "arXiv:2406.03432v1 Announce Type: new \nAbstract: In many situations we are interested in modeling real data where the response distribution, even conditionally on the covariates, presents asymmetry and/or heavy/light tails. In these situations, it is more suitable to consider models based on the skewed and/or heavy/light tailed distributions, such as the class of scale mixtures of skew-normal distributions. The classical parameterization of this distributions may not be good due to the some inferential issues when the skewness parameter is in a neighborhood of 0, then, the centered parameterization becomes more appropriate. In this paper, we developed a class of scale mixtures of skew-normal distributions under the centered parameterization, also a linear regression model based on them was proposed. We explore a hierarchical representation and set up a MCMC scheme for parameter estimation. Furthermore, we developed residuals and influence analysis tools. A Monte Carlo experiment is conducted to evaluate the performance of the MCMC algorithm and the behavior of the residual distribution. The methodology is illustrated with the analysis of a real data set."}, "https://arxiv.org/abs/2406.03463": {"title": "Gaussian Copula Models for Nonignorable Missing Data Using Auxiliary Marginal Quantiles", "link": "https://arxiv.org/abs/2406.03463", "description": "arXiv:2406.03463v1 Announce Type: new \nAbstract: We present an approach for modeling and imputation of nonignorable missing data under Gaussian copulas. The analyst posits a set of quantiles of the marginal distributions of the study variables, for example, reflecting information from external data sources or elicited expert opinion. When these quantiles are accurately specified, we prove it is possible to consistently estimate the copula correlation and perform multiple imputation in the presence of nonignorable missing data. We develop algorithms for estimation and imputation that are computationally efficient, which we evaluate in simulation studies of multiple imputation inferences. We apply the model to analyze associations between lead exposure levels and end-of-grade test scores for 170,000 students in North Carolina. These measurements are not missing at random, as children deemed at-risk for high lead exposure are more likely to be measured. We construct plausible marginal quantiles for lead exposure using national statistics provided by the Centers for Disease Control and Prevention. Complete cases and missing at random analyses appear to underestimate the relationships between certain variables and end-of-grade test scores, while multiple imputation inferences under our model support stronger adverse associations between lead exposure and educational outcomes."}, "https://arxiv.org/abs/2406.02584": {"title": "Planetary Causal Inference: Implications for the Geography of Poverty", "link": "https://arxiv.org/abs/2406.02584", "description": "arXiv:2406.02584v1 Announce Type: cross \nAbstract: Earth observation data such as satellite imagery can, when combined with machine learning, have profound impacts on our understanding of the geography of poverty through the prediction of living conditions, especially where government-derived economic indicators are either unavailable or potentially untrustworthy. Recent work has progressed in using EO data not only to predict spatial economic outcomes, but also to explore cause and effect, an understanding which is critical for downstream policy analysis. In this review, we first document the growth of interest in EO-ML analyses in the causal space. We then trace the relationship between spatial statistics and EO-ML methods before discussing the four ways in which EO data has been used in causal ML pipelines -- (1.) poverty outcome imputation for downstream causal analysis, (2.) EO image deconfounding, (3.) EO-based treatment effect heterogeneity, and (4.) EO-based transportability analysis. We conclude by providing a workflow for how researchers can incorporate EO data in causal ML analysis going forward."}, "https://arxiv.org/abs/2406.03341": {"title": "Tackling GenAI Copyright Issues: Originality Estimation and Genericization", "link": "https://arxiv.org/abs/2406.03341", "description": "arXiv:2406.03341v1 Announce Type: cross \nAbstract: The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. While some studies explore methods to mitigate copyright risks by steering the outputs of generative models away from those resembling copyrighted data, little attention has been paid to the question of how much of a resemblance is undesirable; more original or unique data are afforded stronger protection, and the threshold level of resemblance for constituting infringement correspondingly lower. Here, leveraging this principle, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to infringe copyright. To achieve this, we introduce a metric for quantifying the level of originality of data in a manner that is consistent with the legal framework. This metric can be practically estimated by drawing samples from a generative model, which is then used for the genericization process. Experiments demonstrate that our genericization method successfully modifies the output of a text-to-image generative model so that it produces more generic, copyright-compliant images."}, "https://arxiv.org/abs/2212.12874": {"title": "Test and Measure for Partial Mean Dependence Based on Machine Learning Methods", "link": "https://arxiv.org/abs/2212.12874", "description": "arXiv:2212.12874v2 Announce Type: replace \nAbstract: It is of importance to investigate the significance of a subset of covariates $W$ for the response $Y$ given covariates $Z$ in regression modeling. To this end, we propose a significance test for the partial mean independence problem based on machine learning methods and data splitting. The test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability are also discussed. If the null hypothesis is rejected, we propose a partial Generalized Measure of Correlation (pGMC) to measure the partial mean dependence of $Y$ given $W$ after controlling for the nonlinear effect of $Z$. We present the appealing theoretical properties of the pGMC and establish the asymptotic normality of its estimator with the optimal root-$N$ convergence rate. Furthermore, the valid confidence interval for the pGMC is also derived. As an important special case when there are no conditional covariates $Z$, we introduce a new test of overall significance of covariates for the response in a model-free setting. Numerical studies and real data analysis are also conducted to compare with existing approaches and to demonstrate the validity and flexibility of our proposed procedures."}, "https://arxiv.org/abs/2303.07272": {"title": "Accounting for multiplicity in machine learning benchmark performance", "link": "https://arxiv.org/abs/2303.07272", "description": "arXiv:2303.07272v4 Announce Type: replace \nAbstract: Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss three real-world examples; Kaggle competitions that demonstrate various aspects."}, "https://arxiv.org/abs/2306.00382": {"title": "Calibrated and Conformal Propensity Scores for Causal Effect Estimation", "link": "https://arxiv.org/abs/2306.00382", "description": "arXiv:2306.00382v2 Announce Type: replace \nAbstract: Propensity scores are commonly used to estimate treatment effects from observational data. We argue that the probabilistic output of a learned propensity score model should be calibrated -- i.e., a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group -- and we propose simple recalibration techniques to ensure this property. We prove that calibration is a necessary condition for unbiased treatment effect estimation when using popular inverse propensity weighted and doubly robust estimators. We derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the probabilistic propensity score model and show that calibration strictly improves this error bound while also avoiding extreme propensity weights. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional image covariates and genome-wide association studies (GWASs). Calibrated propensity scores improve the speed of GWAS analysis by more than two-fold by enabling the use of simpler models that are faster to train."}, "https://arxiv.org/abs/2306.11907": {"title": "Exact Inference for Random Effects Meta-Analyses with Small, Sparse Data", "link": "https://arxiv.org/abs/2306.11907", "description": "arXiv:2306.11907v2 Announce Type: replace \nAbstract: Meta-analysis aggregates information across related studies to provide more reliable statistical inference and has been a vital tool for assessing the safety and efficacy of many high profile pharmaceutical products. A key challenge in conducting a meta-analysis is that the number of related studies is typically small. Applying classical methods that are asymptotic in the number of studies can compromise the validity of inference, particularly when heterogeneity across studies is present. Moreover, serious adverse events are often rare and can result in one or more studies with no events in at least one study arm. Practitioners often apply arbitrary continuity corrections or remove zero-event studies to stabilize or define effect estimates in such settings, which can further invalidate subsequent inference. To address these significant practical issues, we introduce an exact inference method for comparing event rates in two treatment arms under a random effects framework, which we coin \"XRRmeta\". In contrast to existing methods, the coverage of the confidence interval from XRRmeta is guaranteed to be at or above the nominal level (up to Monte Carlo error) when the event rates, number of studies, and/or the within-study sample sizes are small. Extensive numerical studies indicate that XRRmeta does not yield overly conservative inference and we apply our proposed method to two real-data examples using our open source R package."}, "https://arxiv.org/abs/2307.04400": {"title": "ARK: Robust Knockoffs Inference with Coupling", "link": "https://arxiv.org/abs/2307.04400", "description": "arXiv:2307.04400v2 Announce Type: replace \nAbstract: We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and $k$-familywise error rate ($k$-FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or $k$-FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework. Additionally, we formally connect our concept of knockoff variable coupling to a type of Wasserstein distance."}, "https://arxiv.org/abs/2308.11805": {"title": "The Impact of Stocks on Correlations between Crop Yields and Prices and on Revenue Insurance Premiums using Semiparametric Quantile Regression", "link": "https://arxiv.org/abs/2308.11805", "description": "arXiv:2308.11805v2 Announce Type: replace-cross \nAbstract: Crop yields and harvest prices are often considered to be negatively correlated, thus acting as a natural risk management hedge through stabilizing revenues. Storage theory gives reason to believe that the correlation is an increasing function of stocks carried over from previous years. Stock-conditioned second moments have implications for price movements during shortages and for hedging needs, while spatially varying yield-price correlation structures have implications for who benefits from commodity support policies. In this paper, we propose to use semi-parametric quantile regression (SQR) with penalized B-splines to estimate a stock-conditioned joint distribution of yield and price. The proposed method, validated through a comprehensive simulation study, enables sampling from the true joint distribution using SQR. Then it is applied to approximate stock-conditioned correlation and revenue insurance premium for both corn and soybeans in the United States. For both crops, Cornbelt core regions have more negative correlations than do peripheral regions. We find strong evidence that correlation becomes less negative as stocks increase. We also show that conditioning on stocks is important when calculating actuarially fair revenue insurance premiums. In particular, revenue insurance premiums in the Cornbelt core will be biased upward if the model for calculating premiums does not allow correlation to vary with stocks available. The stock-dependent correlation can be viewed as a form of tail dependence that, if unacknowledged, leads to mispricing of revenue insurance products."}, "https://arxiv.org/abs/2401.00139": {"title": "Is Knowledge All Large Language Models Needed for Causal Reasoning?", "link": "https://arxiv.org/abs/2401.00139", "description": "arXiv:2401.00139v2 Announce Type: replace-cross \nAbstract: This paper explores the causal reasoning of large language models (LLMs) to enhance their interpretability and reliability in advancing artificial intelligence. Despite the proficiency of LLMs in a range of tasks, their potential for understanding causality requires further exploration. We propose a novel causal attribution model that utilizes ``do-operators\" for constructing counterfactual scenarios, allowing us to systematically quantify the influence of input numerical data and LLMs' pre-existing knowledge on their causal reasoning processes. Our newly developed experimental setup assesses LLMs' reliance on contextual information and inherent knowledge across various domains. Our evaluation reveals that LLMs' causal reasoning ability mainly depends on the context and domain-specific knowledge provided. In the absence of such knowledge, LLMs can still maintain a degree of causal reasoning using the available numerical data, albeit with limitations in the calculations. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively leveraging both knowledge and numerical information."}, "https://arxiv.org/abs/2406.03596": {"title": "A Multivariate Equivalence Test Based on Mahalanobis Distance with a Data-Driven Margin", "link": "https://arxiv.org/abs/2406.03596", "description": "arXiv:2406.03596v1 Announce Type: new \nAbstract: Multivariate equivalence testing is needed in a variety of scenarios for drug development. For example, drug products obtained from natural sources may contain many components for which the individual effects and/or their interactions on clinical efficacy and safety cannot be completely characterized. Such lack of sufficient characterization poses a challenge for both generic drug developers to demonstrate and regulatory authorities to determine the sameness of a proposed generic product to its reference product. Another case is to ensure batch-to-batch consistency of naturally derived products containing a vast number of components, such as botanical products. The equivalence or sameness between products containing many components that cannot be individually evaluated needs to be studied in a holistic manner. Multivariate equivalence test based on Mahalanobis distance may be suitable to evaluate many variables holistically. Existing studies based on such method assumed either a predetermined constant margin, for which a consensus is difficult to achieve, or a margin derived from the data, where, however, the randomness is ignored during the testing. In this study, we propose a multivariate equivalence test based on Mahalanobis distance with a data-drive margin with the randomness in the margin considered. Several possible implementations are compared with existing approaches via extensive simulation studies."}, "https://arxiv.org/abs/2406.03681": {"title": "Multiscale Tests for Point Processes and Longitudinal Networks", "link": "https://arxiv.org/abs/2406.03681", "description": "arXiv:2406.03681v1 Announce Type: new \nAbstract: We propose a new testing framework applicable to both the two-sample problem on point processes and the community detection problem on rectangular arrays of point processes, which we refer to as longitudinal networks; the latter problem is useful in situations where we observe interactions among a group of individuals over time. Our framework is based on a multiscale discretization scheme that consider not just the global null but also a collection of nulls local to small regions in the domain; in the two-sample problem, the local rejections tell us where the intensity functions differ and in the longitudinal network problem, the local rejections tell us when the community structure is most salient. We provide theoretical analysis for the two-sample problem and show that our method has minimax optimal power under a Holder continuity condition. We provide extensive simulation and real data analysis demonstrating the practicality of our proposed method."}, "https://arxiv.org/abs/2406.03861": {"title": "Small area estimation with generalized random forests: Estimating poverty rates in Mexico", "link": "https://arxiv.org/abs/2406.03861", "description": "arXiv:2406.03861v1 Announce Type: new \nAbstract: Identifying and addressing poverty is challenging in administrative units with limited information on income distribution and well-being. To overcome this obstacle, small area estimation methods have been developed to provide reliable and efficient estimators at disaggregated levels, enabling informed decision-making by policymakers despite the data scarcity. From a theoretical perspective, we propose a robust and flexible approach for estimating poverty indicators based on binary response variables within the small area estimation context: the generalized mixed effects random forest. Our method employs machine learning techniques to identify predictive, non-linear relationships from data, while also modeling hierarchical structures. Mean squared error estimation is explored using a parametric bootstrap. From an applied perspective, we examine the impact of information loss due to converting continuous variables into binary variables on the performance of small area estimation methods. We evaluate the proposed point and uncertainty estimates in both model- and design-based simulations. Finally, we apply our method to a case study revealing spatial patterns of poverty in the Mexican state of Tlaxcala."}, "https://arxiv.org/abs/2406.03900": {"title": "Enhanced variable selection for boosting sparser and less complex models in distributional copula regression", "link": "https://arxiv.org/abs/2406.03900", "description": "arXiv:2406.03900v1 Announce Type: new \nAbstract: Structured additive distributional copula regression allows to model the joint distribution of multivariate outcomes by relating all distribution parameters to covariates. Estimation via statistical boosting enables accounting for high-dimensional data and incorporating data-driven variable selection, both of which are useful given the complexity of the model class. However, as known from univariate (distributional) regression, the standard boosting algorithm tends to select too many variables with minor importance, particularly in settings with large sample sizes, leading to complex models with difficult interpretation. To counteract this behavior and to avoid selecting base-learners with only a negligible impact, we combined the ideas of probing, stability selection and a new deselection approach with statistical boosting for distributional copula regression. In a simulation study and an application to the joint modelling of weight and length of newborns, we found that all proposed methods enhance variable selection by reducing the number of false positives. However, only stability selection and the deselection approach yielded similar predictive performance to classical boosting. Finally, the deselection approach is better scalable to larger datasets and led to a competitive predictive performance, which we further illustrated in a genomic cohort study from the UK Biobank by modelling the joint genetic predisposition for two phenotypes."}, "https://arxiv.org/abs/2406.03971": {"title": "Comments on B", "link": "https://arxiv.org/abs/2406.03971", "description": "arXiv:2406.03971v1 Announce Type: new \nAbstract: In P\\\"otscher and Preinerstorfer (2022) and in the abridged version P\\\"otscher and Preinerstorfer (2024, published in Econometrica) we have tried to clear up the confusion introduced in Hansen (2022a) and in the earlier versions Hansen (2021a,b). Unfortunatelly, Hansen's (2024) reply to P\\\"otscher and Preinerstorfer (2024) further adds to the confusion. While we are already somewhat tired of the matter, for the sake of the econometrics community we feel compelled to provide clarification. We also add a comment on Portnoy (2023), a \"correction\" to Portnoy (2022), as well as on Lei and Wooldridge (2022)."}, "https://arxiv.org/abs/2406.04072": {"title": "Variational Prior Replacement in Bayesian Inference and Inversion", "link": "https://arxiv.org/abs/2406.04072", "description": "arXiv:2406.04072v1 Announce Type: new \nAbstract: Many scientific investigations require that the values of a set of model parameters are estimated using recorded data. In Bayesian inference, information from both observed data and prior knowledge is combined to update model parameters probabilistically. Prior information represents our belief about the range of values that the variables can take, and their relative probabilities when considered independently of recorded data. Situations arise in which we wish to change prior information: (i) the subjective nature of prior information, (ii) cases in which we wish to test different states of prior information as hypothesis tests, and (iii) information from new studies may emerge so prior information may evolve over time. Estimating the solution to any single inference problem is usually computationally costly, as it typically requires thousands of model samples and their forward simulations. Therefore, recalculating the Bayesian solution every time prior information changes can be extremely expensive. We develop a mathematical formulation that allows prior information to be changed in a solution using variational methods, without performing Bayesian inference on each occasion. In this method, existing prior information is removed from a previously obtained posterior distribution and is replaced by new prior information. We therefore call the methodology variational prior replacement (VPR). We demonstrate VPR using a 2D seismic full waveform inversion example, where VPR provides almost identical posterior solutions compared to those obtained by solving independent inference problems using different priors. The former can be completed within minutes even on a laptop whereas the latter requires days of computations using high-performance computing resources. We demonstrate the value of the method by comparing the posterior solutions obtained using three different types of prior information."}, "https://arxiv.org/abs/2406.04077": {"title": "Why recommended visit intervals should be extracted when conducting longitudinal analyses using electronic health record data: examining visit mechanism and sensitivity to assessment not at random", "link": "https://arxiv.org/abs/2406.04077", "description": "arXiv:2406.04077v1 Announce Type: new \nAbstract: Electronic health records (EHRs) provide an efficient approach to generating rich longitudinal datasets. However, since patients visit as needed, the assessment times are typically irregular and may be related to the patient's health. Failing to account for this informative assessment process could result in biased estimates of the disease course. In this paper, we show how estimation of the disease trajectory can be enhanced by leveraging an underutilized piece of information that is often in the patient's EHR: physician-recommended intervals between visits. Specifically, we demonstrate how recommended intervals can be used in characterizing the assessment process, and in investigating the sensitivity of the results to assessment not at random (ANAR). We illustrate our proposed approach in a clinic-based cohort study of juvenile dermatomyositis (JDM). In this study, we found that the recommended intervals explained 78% of the variability in the assessment times. Under a specific case of ANAR where we assumed that a worsening in disease led to patients visiting earlier than recommended, the estimated population average disease activity trajectory was shifted downward relative to the trajectory assuming assessment at random. These results demonstrate the crucial role recommended intervals play in improving the rigour of the analysis by allowing us to assess both the plausibility of the AAR assumption and the sensitivity of the results to departures from this assumption. Thus, we advise that studies using irregular longitudinal data should extract recommended visit intervals and follow our procedure for incorporating them into analyses."}, "https://arxiv.org/abs/2406.04085": {"title": "Copula-based models for correlated circular data", "link": "https://arxiv.org/abs/2406.04085", "description": "arXiv:2406.04085v1 Announce Type: new \nAbstract: We exploit Gaussian copulas to specify a class of multivariate circular distributions and obtain parametric models for the analysis of correlated circular data. This approach provides a straightforward extension of traditional multivariate normal models to the circular setting, without imposing restrictions on the marginal data distribution nor requiring overwhelming routines for parameter estimation. The proposal is illustrated on two case studies of animal orientation and sea currents, where we propose an autoregressive model for circular time series and a geostatistical model for circular spatial series."}, "https://arxiv.org/abs/2406.04133": {"title": "GLOBUS: Global building renovation potential by 2070", "link": "https://arxiv.org/abs/2406.04133", "description": "arXiv:2406.04133v1 Announce Type: new \nAbstract: Surpassing the two large emission sectors of transportation and industry, the building sector accounted for 34% and 37% of global energy consumption and carbon emissions in 2021, respectively. The building sector, the final piece to be addressed in the transition to net-zero carbon emissions, requires a comprehensive, multisectoral strategy for reducing emissions. Until now, the absence of data on global building floorspace has impeded the measurement of building carbon intensity (carbon emissions per floorspace) and the identification of ways to achieve carbon neutrality for buildings. For this study, we develop a global building stock model (GLOBUS) to fill that data gap. Our study's primary contribution lies in providing a dataset of global building stock turnover using scenarios that incorporate various levels of building renovation. By unifying the evaluation indicators, the dataset empowers building science researchers to perform comparative analyses based on floorspace. Specifically, the building stock dataset establishes a reference for measuring carbon emission intensity and decarbonization intensity of buildings within different countries. Further, we emphasize the sufficiency of existing buildings by incorporating building renovation into the model. Renovation can minimize the need to expand the building stock, thereby bolstering decarbonization of the building sector."}, "https://arxiv.org/abs/2406.04150": {"title": "A novel robust meta-analysis model using the $t$ distribution for outlier accommodation and detection", "link": "https://arxiv.org/abs/2406.04150", "description": "arXiv:2406.04150v1 Announce Type: new \nAbstract: Random effects meta-analysis model is an important tool for integrating results from multiple independent studies. However, the standard model is based on the assumption of normal distributions for both random effects and within-study errors, making it susceptible to outlying studies. Although robust modeling using the $t$ distribution is an appealing idea, the existing work, that explores the use of the $t$ distribution only for random effects, involves complicated numerical integration and numerical optimization. In this paper, a novel robust meta-analysis model using the $t$ distribution is proposed ($t$Meta). The novelty is that the marginal distribution of the effect size in $t$Meta follows the $t$ distribution, enabling that $t$Meta can simultaneously accommodate and detect outlying studies in a simple and adaptive manner. A simple and fast EM-type algorithm is developed for maximum likelihood estimation. Due to the mathematical tractability of the $t$ distribution, $t$Meta frees from numerical integration and allows for efficient optimization. Experiments on real data demonstrate that $t$Meta is compared favorably with related competitors in situations involving mild outliers. Moreover, in the presence of gross outliers, while related competitors may fail, $t$Meta continues to perform consistently and robustly."}, "https://arxiv.org/abs/2406.04167": {"title": "Comparing estimators of discriminative performance of time-to-event models", "link": "https://arxiv.org/abs/2406.04167", "description": "arXiv:2406.04167v1 Announce Type: new \nAbstract: Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review various estimators' mathematical construction and compare the behavior of the two classes of estimators. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly over-optimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators' bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g., cross validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce over-optimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014."}, "https://arxiv.org/abs/2406.04256": {"title": "Gradient Boosting for Hierarchical Data in Small Area Estimation", "link": "https://arxiv.org/abs/2406.04256", "description": "arXiv:2406.04256v1 Announce Type: new \nAbstract: This paper introduces Mixed Effect Gradient Boosting (MEGB), which combines the strengths of Gradient Boosting with Mixed Effects models to address complex, hierarchical data structures often encountered in statistical analysis. The methodological foundations, including a review of the Mixed Effects model and the Extreme Gradient Boosting method, leading to the introduction of MEGB are shown in detail. It highlights how MEGB can derive area-level mean estimations from unit-level data and calculate Mean Squared Error (MSE) estimates using a nonparametric bootstrap approach. The paper evaluates MEGB's performance through model-based and design-based simulation studies, comparing it against established estimators. The findings indicate that MEGB provides promising area mean estimations and may outperform existing small area estimators in various scenarios. The paper concludes with a discussion on future research directions, highlighting the possibility of extending MEGB's framework to accommodate different types of outcome variables or non-linear area level indicators."}, "https://arxiv.org/abs/2406.03821": {"title": "Bayesian generalized method of moments applied to pseudo-observations in survival analysis", "link": "https://arxiv.org/abs/2406.03821", "description": "arXiv:2406.03821v1 Announce Type: cross \nAbstract: Bayesian inference for survival regression modeling offers numerous advantages, especially for decision-making and external data borrowing, but demands the specification of the baseline hazard function, which may be a challenging task. We propose an alternative approach that does not need the specification of this function. Our approach combines pseudo-observations to convert censored data into longitudinal data with the Generalized Methods of Moments (GMM) to estimate the parameters of interest from the survival function directly. GMM may be viewed as an extension of the Generalized Estimating Equation (GEE) currently used for frequentist pseudo-observations analysis and can be extended to the Bayesian framework using a pseudo-likelihood function. We assessed the behavior of the frequentist and Bayesian GMM in the new context of analyzing pseudo-observations. We compared their performances to the Cox, GEE, and Bayesian piecewise exponential models through a simulation study of two-arm randomized clinical trials. Frequentist and Bayesian GMM gave valid inferences with similar performances compared to the three benchmark methods, except for small sample sizes and high censoring rates. For illustration, three post-hoc efficacy analyses were performed on randomized clinical trials involving patients with Ewing Sarcoma, producing results similar to those of the benchmark methods. Through a simple application of estimating hazard ratios, these findings confirm the effectiveness of this new Bayesian approach based on pseudo-observations and the generalized method of moments. This offers new insights on using pseudo-observations for Bayesian survival analysis."}, "https://arxiv.org/abs/2406.03924": {"title": "Statistical Multicriteria Benchmarking via the GSD-Front", "link": "https://arxiv.org/abs/2406.03924", "description": "arXiv:2406.03924v1 Announce Type: cross \nAbstract: Given the vast number of classifiers that have been (and continue to be) proposed, reliable methods for comparing them are becoming increasingly important. The desire for reliability is broken down into three main aspects: (1) Comparisons should allow for different quality metrics simultaneously. (2) Comparisons should take into account the statistical uncertainty induced by the choice of benchmark suite. (3) The robustness of the comparisons under small deviations in the underlying assumptions should be verifiable. To address (1), we propose to compare classifiers using a generalized stochastic dominance ordering (GSD) and present the GSD-front as an information-efficient alternative to the classical Pareto-front. For (2), we propose a consistent statistical estimator for the GSD-front and construct a statistical test for whether a (potentially new) classifier lies in the GSD-front of a set of state-of-the-art classifiers. For (3), we relax our proposed test using techniques from robust statistics and imprecise probabilities. We illustrate our concepts on the benchmark suite PMLB and on the platform OpenML."}, "https://arxiv.org/abs/2406.04191": {"title": "Strong Approximations for Empirical Processes Indexed by Lipschitz Functions", "link": "https://arxiv.org/abs/2406.04191", "description": "arXiv:2406.04191v1 Announce Type: cross \nAbstract: This paper presents new uniform Gaussian strong approximations for empirical processes indexed by classes of functions based on $d$-variate random vectors ($d\\geq1$). First, a uniform Gaussian strong approximation is established for general empirical processes indexed by Lipschitz functions, encompassing and improving on all previous results in the literature. When specialized to the setting considered by Rio (1994), and certain constraints on the function class hold, our result improves the approximation rate $n^{-1/(2d)}$ to $n^{-1/\\max\\{d,2\\}}$, up to the same $\\operatorname{polylog} n$ term, where $n$ denotes the sample size. Remarkably, we establish a valid uniform Gaussian strong approximation at the optimal rate $n^{-1/2}\\log n$ for $d=2$, which was previously known to be valid only for univariate ($d=1$) empirical processes via the celebrated Hungarian construction (Koml\\'os et al., 1975). Second, a uniform Gaussian strong approximation is established for a class of multiplicative separable empirical processes indexed by Lipschitz functions, which address some outstanding problems in the literature (Chernozhukov et al., 2014, Section 3). In addition, two other uniform Gaussian strong approximation results are presented for settings where the function class takes the form of a sequence of Haar basis based on generalized quasi-uniform partitions. We demonstrate the improvements and usefulness of our new strong approximation results with several statistical applications to nonparametric density and regression estimation."}, "https://arxiv.org/abs/1903.05054": {"title": "Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions", "link": "https://arxiv.org/abs/1903.05054", "description": "arXiv:1903.05054v2 Announce Type: replace \nAbstract: Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets."}, "https://arxiv.org/abs/2206.06821": {"title": "DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models", "link": "https://arxiv.org/abs/2206.06821", "description": "arXiv:2206.06821v2 Announce Type: replace \nAbstract: We present DoWhy-GCM, an extension of the DoWhy Python library, which leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation, DoWhy-GCM addresses diverse causal queries, such as identifying the root causes of outliers and distributional changes, attributing causal influences to the data generating process of each node, or diagnosis of causal structures. With DoWhy-GCM, users typically specify cause-effect relations via a causal graph, fit causal mechanisms, and pose causal queries -- all with just a few lines of code. The general documentation is available at https://www.pywhy.org/dowhy and the DoWhy-GCM specific code at https://github.com/py-why/dowhy/tree/main/dowhy/gcm."}, "https://arxiv.org/abs/2304.01273": {"title": "Heterogeneity-robust granular instruments", "link": "https://arxiv.org/abs/2304.01273", "description": "arXiv:2304.01273v3 Announce Type: replace \nAbstract: Granular instrumental variables (GIV) has experienced sharp growth in empirical macro-finance. The methodology's rise showcases granularity's potential for identification across many economic environments, like the estimation of spillovers and demand systems. I propose a new estimator--called robust granular instrumental variables (RGIV)--that enables studying unit-level heterogeneity in spillovers. Unlike existing methods that assume heterogeneity is a function of observables, RGIV leaves heterogeneity unrestricted. In contrast to the baseline GIV estimator, RGIV allows for unknown shock variances and equal-sized units. Applied to the Euro area, I find strong evidence of country-level heterogeneity in sovereign yield spillovers."}, "https://arxiv.org/abs/2304.14545": {"title": "Augmented balancing weights as linear regression", "link": "https://arxiv.org/abs/2304.14545", "description": "arXiv:2304.14545v3 Announce Type: replace \nAbstract: We provide a novel characterization of augmented balancing weights, also known as automatic debiased machine learning (AutoDML). These popular doubly robust or de-biased machine learning estimators combine outcome modeling with balancing weights - weights that achieve covariate balance directly in lieu of estimating and inverting the propensity score. When the outcome and weighting models are both linear in some (possibly infinite) basis, we show that the augmented estimator is equivalent to a single linear model with coefficients that combine the coefficients from the original outcome model and coefficients from an unpenalized ordinary least squares (OLS) fit on the same data. We see that, under certain choices of regularization parameters, the augmented estimator often collapses to the OLS estimator alone; this occurs for example in a re-analysis of the Lalonde 1986 dataset. We then extend these results to specific choices of outcome and weighting models. We first show that the augmented estimator that uses (kernel) ridge regression for both outcome and weighting models is equivalent to a single, undersmoothed (kernel) ridge regression. This holds numerically in finite samples and lays the groundwork for a novel analysis of undersmoothing and asymptotic rates of convergence. When the weighting model is instead lasso-penalized regression, we give closed-form expressions for special cases and demonstrate a ``double selection'' property. Our framework opens the black box on this increasingly popular class of estimators, bridges the gap between existing results on the semiparametric efficiency of undersmoothed and doubly robust estimators, and provides new insights into the performance of augmented balancing weights."}, "https://arxiv.org/abs/2406.04423": {"title": "Determining the Number of Communities in Sparse and Imbalanced Settings", "link": "https://arxiv.org/abs/2406.04423", "description": "arXiv:2406.04423v1 Announce Type: new \nAbstract: Community structures represent a crucial aspect of network analysis, and various methods have been developed to identify these communities. However, a common hurdle lies in determining the number of communities K, a parameter that often requires estimation in practice. Existing approaches for estimating K face two notable challenges: the weak community signal present in sparse networks and the imbalance in community sizes or edge densities that result in unequal per-community expected degree. We propose a spectral method based on a novel network operator whose spectral properties effectively overcome both challenges. This operator is a refined version of the non-backtracking operator, adapted from a \"centered\" adjacency matrix. Its leading eigenvalues are more concentrated than those of the adjacency matrix for sparse networks, while they also demonstrate enhanced signal under imbalance scenarios, a benefit attributed to the centering step. This is justified, either theoretically or numerically, under the null model K = 1, in both dense and ultra-sparse settings. A goodness-of-fit test based on the leading eigenvalue can be applied to determine the number of communities K."}, "https://arxiv.org/abs/2406.04448": {"title": "Bayesian Methods to Improve The Accuracy of Differentially Private Measurements of Constrained Parameters", "link": "https://arxiv.org/abs/2406.04448", "description": "arXiv:2406.04448v1 Announce Type: new \nAbstract: Formal disclosure avoidance techniques are necessary to ensure that published data can not be used to identify information about individuals. The addition of statistical noise to unpublished data can be implemented to achieve differential privacy, which provides a formal mathematical privacy guarantee. However, the infusion of noise results in data releases which are less precise than if no noise had been added, and can lead to some of the individual data points being nonsensical. Examples of this are estimates of population counts which are negative, or estimates of the ratio of counts which violate known constraints. A straightforward way to guarantee that published estimates satisfy these known constraints is to specify a statistical model and incorporate a prior on census counts and ratios which properly constrains the parameter space. We utilize rejection sampling methods for drawing samples from the posterior distribution and we show that this implementation produces estimates of population counts and ratios which maintain formal privacy, are more precise than the original unconstrained noisy measurements, and are guaranteed to satisfy prior constraints."}, "https://arxiv.org/abs/2406.04498": {"title": "Conformal Multi-Target Hyperrectangles", "link": "https://arxiv.org/abs/2406.04498", "description": "arXiv:2406.04498v1 Announce Type: new \nAbstract: We propose conformal hyperrectangular prediction regions for multi-target regression. We propose split conformal prediction algorithms for both point and quantile regression to form hyperrectangular prediction regions, which allow for easy marginal interpretation and do not require covariance estimation. In practice, it is preferable that a prediction region is balanced, that is, having identical marginal prediction coverage, since prediction accuracy is generally equally important across components of the response vector. The proposed algorithms possess two desirable properties, namely, tight asymptotic overall nominal coverage as well as asymptotic balance, that is, identical asymptotic marginal coverage, under mild conditions. We then compare our methods to some existing methods on both simulated and real data sets. Our simulation results and real data analysis show that our methods outperform existing methods while achieving the desired nominal coverage and good balance between dimensions."}, "https://arxiv.org/abs/2406.04505": {"title": "Causal Inference in Randomized Trials with Partial Clustering and Imbalanced Dependence Structures", "link": "https://arxiv.org/abs/2406.04505", "description": "arXiv:2406.04505v1 Announce Type: new \nAbstract: In many randomized trials, participants are grouped into clusters, such as neighborhoods or schools, and these clusters are assumed to be the independent unit. This assumption, however, might not reflect the underlying dependence structure, with serious consequences to statistical power. First, consider a cluster randomized trial where participants are artificially grouped together for the purposes of randomization. For intervention participants the groups are the basis for intervention delivery, but for control participants the groups are dissolved. Second, consider an individually randomized group treatment trial where participants are randomized and then post-randomization, intervention participants are grouped together for intervention delivery, while the control participants continue with the standard of care. In both trial designs, outcomes among intervention participants will be dependent within each cluster, while outcomes for control participants will be effectively independent. We use causal models to non-parametrically describe the data generating process for each trial design and formalize the conditional independence in the observed data distribution. For estimation and inference, we propose a novel implementation of targeted minimum loss-based estimation (TMLE) accounting for partial clustering and the imbalanced dependence structure. TMLE is a model-robust approach, leverages covariate adjustment and machine learning to improve precision, and facilitates estimation of a large set of causal effects. In finite sample simulations, TMLE achieved comparable or markedly higher statistical power than common alternatives. Finally, application of TMLE to real data from the SEARCH-IPT trial resulted in 20-57\\% efficiency gains, demonstrating the real-world consequences of our proposed approach."}, "https://arxiv.org/abs/2406.04518": {"title": "A novel multivariate regression model for unbalanced binary data : a strong conjugacy under random effect approach", "link": "https://arxiv.org/abs/2406.04518", "description": "arXiv:2406.04518v1 Announce Type: new \nAbstract: In this paper, we deduce a new multivariate regression model designed to fit correlated binary data. The multivariate distribution is derived from a Bernoulli mixed model with a nonnormal random intercept on the marginal approach. The random effect distribution is assumed to be the generalized log-gamma (GLG) distribution by considering a particular parameter setting. The complement log-log function is specified to lead to strong conjugacy between the response variable and random effect. The new discrete multivariate distribution, named MBerGLG distribution, has location and dispersion parameters. The MBerGLG distribution leads to the MBerGLG regression (MBerGLGR) model, providing an alternative approach to fitting both unbalanced and balanced correlated response binary data. Monte Carlo simulation studies show that its maximum likelihood estimators are unbiased, efficient, and consistent asymptotically. The randomized quantile residuals are performed to identify possible departures from the proposal model and the data and detect atypical subjects. Finally, two applications are presented in the data analysis section."}, "https://arxiv.org/abs/2406.04599": {"title": "Imputation of Nonignorable Missing Data in Surveys Using Auxiliary Margins Via Hot Deck and Sequential Imputation", "link": "https://arxiv.org/abs/2406.04599", "description": "arXiv:2406.04599v1 Announce Type: new \nAbstract: Survey data collection often is plagued by unit and item nonresponse. To reduce reliance on strong assumptions about the missingness mechanisms, statisticians can use information about population marginal distributions known, for example, from censuses or administrative databases. One approach that does so is the Missing Data with Auxiliary Margins, or MD-AM, framework, which uses multiple imputation for both unit and item nonresponse so that survey-weighted estimates accord with the known marginal distributions. However, this framework relies on specifying and estimating a joint distribution for the survey data and nonresponse indicators, which can be computationally and practically daunting in data with many variables of mixed types. We propose two adaptations to the MD-AM framework to simplify the imputation task. First, rather than specifying a joint model for unit respondents' data, we use random hot deck imputation while still leveraging the known marginal distributions. Second, instead of sampling from conditional distributions implied by the joint model for the missing data due to item nonresponse, we apply multiple imputation by chained equations for item nonresponse before imputation for unit nonresponse. Using simulation studies with nonignorable missingness mechanisms, we demonstrate that the proposed approach can provide more accurate point and interval estimates than models that do not leverage the auxiliary information. We illustrate the approach using data on voter turnout from the U.S. Current Population Survey."}, "https://arxiv.org/abs/2406.04653": {"title": "Dynamical mixture modeling with fast, automatic determination of Markov chains", "link": "https://arxiv.org/abs/2406.04653", "description": "arXiv:2406.04653v1 Announce Type: new \nAbstract: Markov state modeling has gained popularity in various scientific fields due to its ability to reduce complex time series data into transitions between a few states. Yet, current frameworks are limited by assuming a single Markov chain describes the data, and they suffer an inability to discern heterogeneities. As a solution, this paper proposes a variational expectation-maximization algorithm that identifies a mixture of Markov chains in a time-series data set. The method is agnostic to the definition of the Markov states, whether data-driven (e.g. by spectral clustering) or based on domain knowledge. Variational EM efficiently and organically identifies the number of Markov chains and dynamics of each chain without expensive model comparisons or posterior sampling. The approach is supported by a theoretical analysis and numerical experiments, including simulated and observational data sets based on ${\\tt Last.fm}$ music listening, ultramarathon running, and gene expression. The results show the new algorithm is competitive with contemporary mixture modeling approaches and powerful in identifying meaningful heterogeneities in time series data."}, "https://arxiv.org/abs/2406.04655": {"title": "Bayesian Inference for Spatial-temporal Non-Gaussian Data Using Predictive Stacking", "link": "https://arxiv.org/abs/2406.04655", "description": "arXiv:2406.04655v1 Announce Type: new \nAbstract: Analysing non-Gaussian spatial-temporal data typically requires introducing spatial dependence in generalised linear models through the link function of an exponential family distribution. However, unlike in Gaussian likelihoods, inference is considerably encumbered by the inability to analytically integrate out the random effects and reduce the dimension of the parameter space. Iterative estimation algorithms struggle to converge due to the presence of weakly identified parameters. We devise an approach that obviates these issues by exploiting generalised conjugate multivariate distribution theory for exponential families, which enables exact sampling from analytically available posterior distributions conditional upon some fixed process parameters. More specifically, we expand upon the Diaconis-Ylvisaker family of conjugate priors to achieve analytically tractable posterior inference for spatially-temporally varying regression models conditional on some kernel parameters. Subsequently, we assimilate inference from these individual posterior distributions over a range of values of these parameters using Bayesian predictive stacking. We evaluate inferential performance on simulated data, compare with fully Bayesian inference using Markov chain Monte Carlo and apply our proposed method to analyse spatially-temporally referenced avian count data from the North American Breeding Bird Survey database."}, "https://arxiv.org/abs/2406.04796": {"title": "Robust Inference of Dynamic Covariance Using Wishart Processes and Sequential Monte Carlo", "link": "https://arxiv.org/abs/2406.04796", "description": "arXiv:2406.04796v1 Announce Type: new \nAbstract: Several disciplines, such as econometrics, neuroscience, and computational psychology, study the dynamic interactions between variables over time. A Bayesian nonparametric model known as the Wishart process has been shown to be effective in this situation, but its inference remains highly challenging. In this work, we introduce a Sequential Monte Carlo (SMC) sampler for the Wishart process, and show how it compares to conventional inference approaches, namely MCMC and variational inference. Using simulations we show that SMC sampling results in the most robust estimates and out-of-sample predictions of dynamic covariance. SMC especially outperforms the alternative approaches when using composite covariance functions with correlated parameters. We demonstrate the practical applicability of our proposed approach on a dataset of clinical depression (n=1), and show how using an accurate representation of the posterior distribution can be used to test for dynamics on covariance"}, "https://arxiv.org/abs/2406.04849": {"title": "Dynamic prediction of death risk given a renewal hospitalization process", "link": "https://arxiv.org/abs/2406.04849", "description": "arXiv:2406.04849v1 Announce Type: new \nAbstract: Predicting the risk of death for chronic patients is highly valuable for informed medical decision-making. This paper proposes a general framework for dynamic prediction of the risk of death of a patient given her hospitalization history, which is generally available to physicians. Predictions are based on a joint model for the death and hospitalization processes, thereby avoiding the potential bias arising from selection of survivors. The framework accommodates various submodels for the hospitalization process. In particular, we study prediction of the risk of death in a renewal model for hospitalizations, a common approach to recurrent event modelling. In the renewal model, the distribution of hospitalizations throughout the follow-up period impacts the risk of death. This result differs from prediction in the Poisson model, previously studied, where only the number of hospitalizations matters. We apply our methodology to a prospective, observational cohort study of 401 patients treated for COPD in one of six outpatient respiratory clinics run by the Respiratory Service of Galdakao University Hospital, with a median follow-up of 4.16 years. We find that more concentrated hospitalizations increase the risk of death."}, "https://arxiv.org/abs/2406.04874": {"title": "Approximate Bayesian Computation with Deep Learning and Conformal prediction", "link": "https://arxiv.org/abs/2406.04874", "description": "arXiv:2406.04874v1 Announce Type: new \nAbstract: Approximate Bayesian Computation (ABC) methods are commonly used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Classical ABC methods are based on nearest neighbor type algorithms and rely on the choice of so-called summary statistics, distances between datasets and a tolerance threshold. Recently, methods combining ABC with more complex machine learning algorithms have been proposed to mitigate the impact of these \"user-choices\". In this paper, we propose the first, to our knowledge, ABC method completely free of summary statistics, distance and tolerance threshold. Moreover, in contrast with usual generalizations of the ABC method, it associates a confidence interval (having a proper frequentist marginal coverage) with the posterior mean estimation (or other moment-type estimates).\n Our method, ABCD-Conformal, uses a neural network with Monte Carlo Dropout to provide an estimation of the posterior mean (or others moment type functional), and conformal theory to obtain associated confidence sets. Efficient for estimating multidimensional parameters, we test this new method on three different applications and compare it with other ABC methods in the literature."}, "https://arxiv.org/abs/2406.04994": {"title": "Unguided structure learning of DAGs for count data", "link": "https://arxiv.org/abs/2406.04994", "description": "arXiv:2406.04994v1 Announce Type: new \nAbstract: Mainly motivated by the problem of modelling directional dependence relationships for multivariate count data in high-dimensional settings, we present a new algorithm, called learnDAG, for learning the structure of directed acyclic graphs (DAGs). In particular, the proposed algorithm tackled the problem of learning DAGs from observational data in two main steps: (i) estimation of candidate parent sets; and (ii) feature selection. We experimentally compare learnDAG to several popular competitors in recovering the true structure of the graphs in situations where relatively moderate sample sizes are available. Furthermore, to make our algorithm is stronger, a validation of the algorithm is presented through the analysis of real datasets."}, "https://arxiv.org/abs/2406.05010": {"title": "Testing common invariant subspace of multilayer networks", "link": "https://arxiv.org/abs/2406.05010", "description": "arXiv:2406.05010v1 Announce Type: new \nAbstract: Graph (or network) is a mathematical structure that has been widely used to model relational data. As real-world systems get more complex, multilayer (or multiple) networks are employed to represent diverse patterns of relationships among the objects in the systems. One active research problem in multilayer networks analysis is to study the common invariant subspace of the networks, because such common invariant subspace could capture the fundamental structural patterns and interactions across all layers. Many methods have been proposed to estimate the common invariant subspace. However, whether real-world multilayer networks share the same common subspace remains unknown. In this paper, we first attempt to answer this question by means of hypothesis testing. The null hypothesis states that the multilayer networks share the same subspace, and under the alternative hypothesis, there exist at least two networks that do not have the same subspace. We propose a Weighted Degree Difference Test, derive the limiting distribution of the test statistic and provide an analytical analysis of the power. Simulation study shows that the proposed test has satisfactory performance, and a real data application is provided."}, "https://arxiv.org/abs/2406.05012": {"title": "TrendLSW: Trend and Spectral Estimation of Nonstationary Time Series in R", "link": "https://arxiv.org/abs/2406.05012", "description": "arXiv:2406.05012v1 Announce Type: new \nAbstract: The TrendLSW R package has been developed to provide users with a suite of wavelet-based techniques to analyse the statistical properties of nonstationary time series. The key components of the package are (a) two approaches for the estimation of the evolutionary wavelet spectrum in the presence of trend; and (b) wavelet-based trend estimation in the presence of locally stationary wavelet errors via both linear and nonlinear wavelet thresholding; and (c) the calculation of associated pointwise confidence intervals. Lastly, the package directly implements boundary handling methods that enable the methods to be performed on data of arbitrary length, not just dyadic length as is common for wavelet-based methods, ensuring no pre-processing of data is necessary. The key functionality of the package is demonstrated through two data examples, arising from biology and activity monitoring."}, "https://arxiv.org/abs/2406.04915": {"title": "Bayesian inference of Latent Spectral Shapes", "link": "https://arxiv.org/abs/2406.04915", "description": "arXiv:2406.04915v1 Announce Type: cross \nAbstract: This paper proposes a hierarchical spatial-temporal model for modelling the spectrograms of animal calls. The motivation stems from analyzing recordings of the so-called grunt calls emitted by various lemur species. Our goal is to identify a latent spectral shape that characterizes each species and facilitates measuring dissimilarities between them. The model addresses the synchronization of animal vocalizations, due to varying time-lengths and speeds, with non-stationary temporal patterns and accounts for periodic sampling artifacts produced by the time discretization of analog signals. The former is achieved through a synchronization function, and the latter is modeled using a circular representation of time. To overcome the curse of dimensionality inherent in the model's implementation, we employ the Nearest Neighbor Gaussian Process, and posterior samples are obtained using the Markov Chain Monte Carlo method. We apply the model to a real dataset comprising sounds from 8 different species. We define a representative sound for each species and compare them using a simple distance measure. Cross-validation is used to evaluate the predictive capability of our proposal and explore special cases. Additionally, a simulation example is provided to demonstrate that the algorithm is capable of retrieving the true parameters."}, "https://arxiv.org/abs/2105.07685": {"title": "Time-lag bias induced by unobserved heterogeneity: comparing treated patients to controls with a different start of follow-up", "link": "https://arxiv.org/abs/2105.07685", "description": "arXiv:2105.07685v2 Announce Type: replace \nAbstract: In comparative effectiveness research, treated and control patients might have a different start of follow-up as treatment is often started later in the disease trajectory. This typically occurs when data from treated and controls are not collected within the same source. Only patients who did not yet experience the event of interest whilst in the control condition end up in the treatment data source. In case of unobserved heterogeneity, these treated patients will have a lower average risk than the controls. We illustrate how failing to account for this time-lag between treated and controls leads to bias in the estimated treatment effect. We define estimands and time axes, then explore five methods to adjust for this time-lag bias by utilising the time between diagnosis and treatment initiation in different ways. We conducted a simulation study to evaluate whether these methods reduce the bias and then applied the methods to a comparison between fertility patients treated with insemination and similar but untreated patients. We conclude that time-lag bias can be vast and that the time between diagnosis and treatment initiation should be taken into account in the analysis to respect the chronology of the disease and treatment trajectory."}, "https://arxiv.org/abs/2307.07342": {"title": "Bounded-memory adjusted scores estimation in generalized linear models with large data sets", "link": "https://arxiv.org/abs/2307.07342", "description": "arXiv:2307.07342v4 Announce Type: replace \nAbstract: The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically as a proportion of the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. Both procedures can also be readily adapted to fit generalized linear models when distinct parts of the data is stored across different sites and, due to privacy concerns, cannot be fully transferred across sites. We assess the procedures through a real-data application with millions of observations."}, "https://arxiv.org/abs/2401.11352": {"title": "A Connection Between Covariate Adjustment and Stratified Randomization in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2401.11352", "description": "arXiv:2401.11352v2 Announce Type: replace \nAbstract: The statistical efficiency of randomized clinical trials can be improved by incorporating information from baseline covariates (i.e., pre-treatment patient characteristics). This can be done in the design stage using stratified (permutated block) randomization or in the analysis stage through covariate adjustment. This article makes a connection between covariate adjustment and stratified randomization in a general framework where all regular, asymptotically linear estimators are identified as augmented estimators. From a geometric perspective, covariate adjustment can be viewed as an attempt to approximate the optimal augmentation function, and stratified randomization improves a given approximation by moving it closer to the optimal augmentation function. The efficiency benefit of stratified randomization is asymptotically equivalent to attaching an optimal augmentation term based on the stratification factor. Under stratified randomization, adjusting for the stratification factor only in data analysis is not expected to improve efficiency, and the key to efficient estimation is incorporating new prognostic information from other covariates. In designing a trial with stratified randomization, it is not essential to include all important covariates in the stratification, because their prognostic information can be incorporated through covariate adjustment. These observations are confirmed in a simulation study and illustrated using real clinical trial data."}, "https://arxiv.org/abs/2406.05188": {"title": "Numerically robust square root implementations of statistical linear regression filters and smoothers", "link": "https://arxiv.org/abs/2406.05188", "description": "arXiv:2406.05188v1 Announce Type: new \nAbstract: In this article, square-root formulations of the statistical linear regression filter and smoother are developed. Crucially, the method uses QR decompositions rather than Cholesky downdates. This makes the method inherently more numerically robust than the downdate based methods, which may fail in the face of rounding errors. This increased robustness is demonstrated in an ill-conditioned problem, where it is compared against a reference implementation in both double and single precision arithmetic. The new implementation is found to be more robust, when implemented in lower precision arithmetic as compared to the alternative."}, "https://arxiv.org/abs/2406.05193": {"title": "Probabilistic Clustering using Shared Latent Variable Model for Assessing Alzheimers Disease Biomarkers", "link": "https://arxiv.org/abs/2406.05193", "description": "arXiv:2406.05193v1 Announce Type: new \nAbstract: The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to early diagnose and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this paper, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion (BIC). We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the BIOCARD data, a longitudinal study that was conducted over two decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack et al. (2013). In addition, our analysis identified two subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses."}, "https://arxiv.org/abs/2406.05304": {"title": "Polytomous Explanatory Item Response Models for Item Discrimination: Assessing Negative-Framing Effects in Social-Emotional Learning Surveys", "link": "https://arxiv.org/abs/2406.05304", "description": "arXiv:2406.05304v1 Announce Type: new \nAbstract: Modeling item parameters as a function of item characteristics has a long history but has generally focused on models for item location. Explanatory item response models for item discrimination are available but rarely used. In this study, we extend existing approaches for modeling item discrimination from dichotomous to polytomous item responses. We illustrate our proposed approach with an application to four social-emotional learning surveys of preschool children to investigate how item discrimination depends on whether an item is positively or negatively framed. Negative framing predicts significantly lower item discrimination on two of the four surveys, and a plausibly causal estimate from a regression discontinuity analysis shows that negative framing reduces discrimination by about 30\\% on one survey. We conclude with a discussion of potential applications of explanatory models for item discrimination."}, "https://arxiv.org/abs/2406.05340": {"title": "Selecting the Number of Communities for Weighted Degree-Corrected Stochastic Block Models", "link": "https://arxiv.org/abs/2406.05340", "description": "arXiv:2406.05340v1 Announce Type: new \nAbstract: We investigate how to select the number of communities for weighted networks without a full likelihood modeling. First, we propose a novel weighted degree-corrected stochastic block model (DCSBM), in which the mean adjacency matrix is modeled as the same as in standard DCSBM, while the variance profile matrix is assumed to be related to the mean adjacency matrix through a given variance function. Our method of selection the number of communities is based on a sequential testing framework, in each step the weighed DCSBM is fitted via some spectral clustering method. A key step is to carry out matrix scaling on the estimated variance profile matrix. The resulting scaling factors can be used to normalize the adjacency matrix, from which the testing statistic is obtained. Under mild conditions on the weighted DCSBM, our proposed procedure is shown to be consistent in estimating the true number of communities. Numerical experiments on both simulated and real network data also demonstrate the desirable empirical properties of our method."}, "https://arxiv.org/abs/2406.05548": {"title": "Causal Interpretation of Regressions With Ranks", "link": "https://arxiv.org/abs/2406.05548", "description": "arXiv:2406.05548v1 Announce Type: new \nAbstract: In studies of educational production functions or intergenerational mobility, it is common to transform the key variables into percentile ranks. Yet, it remains unclear what the regression coefficient estimates with ranks of the outcome or the treatment. In this paper, we derive effective causal estimands for a broad class of commonly-used regression methods, including the ordinary least squares (OLS), two-stage least squares (2SLS), difference-in-differences (DiD), and regression discontinuity designs (RDD). Specifically, we introduce a novel primitive causal estimand, the Rank Average Treatment Effect (rank-ATE), and prove that it serves as the building block of the effective estimands of all the aforementioned econometrics methods. For 2SLS, DiD, and RDD, we show that direct applications to outcome ranks identify parameters that are difficult to interpret. To address this issue, we develop alternative methods to identify more interpretable causal parameters."}, "https://arxiv.org/abs/2406.05592": {"title": "Constrained Design of a Binary Instrument in a Partially Linear Model", "link": "https://arxiv.org/abs/2406.05592", "description": "arXiv:2406.05592v1 Announce Type: new \nAbstract: We study the question of how best to assign an encouragement in a randomized encouragement study. In our setting, units arrive with covariates, receive a nudge toward treatment or control, acquire one of those statuses in a way that need not align with the nudge, and finally have a response observed. The nudge can be seen as a binary instrument that affects the response only via the treatment status. Our goal is to assign the nudge as a function of covariates in a way that best estimates the local average treatment effect (LATE). We assume a partially linear model, wherein the baseline model is non-parametric and the treatment term is linear in the covariates. Under this model, we outline a two-stage procedure to consistently estimate the LATE. Though the variance of the LATE is intractable, we derive a finite sample approximation and thus a design criterion to minimize. This criterion is convex, allowing for constraints that might arise for budgetary or ethical reasons. We prove conditions under which our solution asymptotically recovers the lowest true variance among all possible nudge propensities. We apply our method to a semi-synthetic example involving triage in an emergency department and find significant gains relative to a regression discontinuity design."}, "https://arxiv.org/abs/2406.05607": {"title": "HAL-based Plugin Estimation of the Causal Dose-Response Curve", "link": "https://arxiv.org/abs/2406.05607", "description": "arXiv:2406.05607v1 Announce Type: new \nAbstract: Estimating the marginally adjusted dose-response curve for continuous treatments is a longstanding statistical challenge critical across multiple fields. In the context of parametric models, mis-specification may result in substantial bias, hindering the accurate discernment of the true data generating distribution and the associated dose-response curve. In contrast, non-parametric models face difficulties as the dose-response curve isn't pathwise differentiable, and then there is no $\\sqrt{n}$-consistent estimator. The emergence of the Highly Adaptive Lasso (HAL) MLE by van der Laan [2015] and van der Laan [2017] and the subsequent theoretical evidence by van der Laan [2023] regarding its pointwise asymptotic normality and uniform convergence rates, have highlighted the asymptotic efficacy of the HAL-based plug-in estimator for this intricate problem. This paper delves into the HAL-based plug-in estimators, including those with cross-validation and undersmoothing selectors, and introduces the undersmoothed smoothness-adaptive HAL-based plug-in estimator. We assess these estimators through extensive simulations, employing detailed evaluation metrics. Building upon the theoretical proofs in van der Laan [2023], our empirical findings underscore the asymptotic effectiveness of the undersmoothed smoothness-adaptive HAL-based plug-in estimator in estimating the marginally adjusted dose-response curve."}, "https://arxiv.org/abs/2406.05805": {"title": "Toward identifiability of total effects in summary causal graphs with latent confounders: an extension of the front-door criterion", "link": "https://arxiv.org/abs/2406.05805", "description": "arXiv:2406.05805v1 Announce Type: new \nAbstract: Conducting experiments to estimate total effects can be challenging due to cost, ethical concerns, or practical limitations. As an alternative, researchers often rely on causal graphs to determine if it is possible to identify these effects from observational data. Identifying total effects in fully specified non-temporal causal graphs has garnered considerable attention, with Pearl's front-door criterion enabling the identification of total effects in the presence of latent confounding even when no variable set is sufficient for adjustment. However, specifying a complete causal graph is challenging in many domains. Extending these identifiability results to partially specified graphs is crucial, particularly in dynamic systems where causal relationships evolve over time. This paper addresses the challenge of identifying total effects using a specific and well-known partially specified graph in dynamic systems called a summary causal graph, which does not specify the temporal lag between causal relations and can contain cycles. In particular, this paper presents sufficient graphical conditions for identifying total effects from observational data, even in the presence of hidden confounding and when no variable set is sufficient for adjustment, contributing to the ongoing effort to understand and estimate causal effects from observational data using summary causal graphs."}, "https://arxiv.org/abs/2406.05944": {"title": "Embedding Network Autoregression for time series analysis and causal peer effect inference", "link": "https://arxiv.org/abs/2406.05944", "description": "arXiv:2406.05944v1 Announce Type: new \nAbstract: We propose an Embedding Network Autoregressive Model (ENAR) for multivariate networked longitudinal data. We assume the network is generated from a latent variable model, and these unobserved variables are included in a structural peer effect model or a time series network autoregressive model as additive effects. This approach takes a unified view of two related problems, (1) modeling and predicting multivariate time series data and (2) causal peer influence estimation in the presence of homophily from finite time longitudinal data. Our estimation strategy comprises estimating latent factors from the observed network adjacency matrix either through spectral embedding or maximum likelihood estimation, followed by least squares estimation of the network autoregressive model. We show that the estimated momentum and peer effect parameters are consistent and asymptotically normal in asymptotic setups with a growing number of network vertices N while including a growing number of time points T and finite T cases. We allow the number of latent vectors K to grow at appropriate rates, which improves upon existing rates when such results are available for related models."}, "https://arxiv.org/abs/2406.05987": {"title": "Data-Driven Real-time Coupon Allocation in the Online Platform", "link": "https://arxiv.org/abs/2406.05987", "description": "arXiv:2406.05987v1 Announce Type: new \nAbstract: Traditionally, firms have offered coupons to customer groups at predetermined discount rates. However, advancements in machine learning and the availability of abundant customer data now enable platforms to provide real-time customized coupons to individuals. In this study, we partner with Meituan, a leading shopping platform, to develop a real-time, end-to-end coupon allocation system that is fast and effective in stimulating demand while adhering to marketing budgets when faced with uncertain traffic from a diverse customer base. Leveraging comprehensive customer and product features, we estimate Conversion Rates (CVR) under various coupon values and employ isotonic regression to ensure the monotonicity of predicted CVRs with respect to coupon value. Using calibrated CVR predictions as input, we propose a Lagrangian Dual-based algorithm that efficiently determines optimal coupon values for each arriving customer within 50 milliseconds. We theoretically and numerically investigate the model performance under parameter misspecifications and apply a control loop to adapt to real-time updated information, thereby better adhering to the marketing budget. Finally, we demonstrate through large-scale field experiments and observational data that our proposed coupon allocation algorithm outperforms traditional approaches in terms of both higher conversion rates and increased revenue. As of May 2024, Meituan has implemented our framework to distribute coupons to over 100 million users across more than 110 major cities in China, resulting in an additional CNY 8 million in annual profit. We demonstrate how to integrate a machine learning prediction model for estimating customer CVR, a Lagrangian Dual-based coupon value optimizer, and a control system to achieve real-time coupon delivery while dynamically adapting to random customer arrival patterns."}, "https://arxiv.org/abs/2406.06071": {"title": "Bayesian Parametric Methods for Deriving Distribution of Restricted Mean Survival Time", "link": "https://arxiv.org/abs/2406.06071", "description": "arXiv:2406.06071v1 Announce Type: new \nAbstract: We propose a Bayesian method for deriving the distribution of restricted mean survival time (RMST) using posterior samples, which accounts for covariates and heterogeneity among clusters based on a parametric model for survival time. We derive an explicit RMST equation by devising an integral of the survival function, allowing for the calculation of not only the mean and credible interval but also the mode, median, and probability of exceeding a certain value. Additionally, We propose two methods: one using random effects to account for heterogeneity among clusters and another utilizing frailty. We developed custom Stan code for the exponential, Weibull, log-normal frailty, and log-logistic models, as they cannot be processed using the brm functions in R. We evaluate our proposed methods through computer simulations and analyze real data from the eight Empowered Action Group states in India to confirm consistent results across states after adjusting for cluster differences. In conclusion, we derived explicit RMST formulas for parametric models and their distributions, enabling the calculation of the mean, median, mode, and credible interval. Our simulations confirmed the robustness of the proposed methods, and using the shrinkage effect allowed for more accurate results for each cluster. This manuscript has not been published elsewhere. The manuscript is not under consideration in whole or in part by another journal."}, "https://arxiv.org/abs/2406.06426": {"title": "Biomarker-Guided Adaptive Enrichment Design with Threshold Detection for Clinical Trials with Time-to-Event Outcome", "link": "https://arxiv.org/abs/2406.06426", "description": "arXiv:2406.06426v1 Announce Type: new \nAbstract: Biomarker-guided designs are increasingly used to evaluate personalized treatments based on patients' biomarker status in Phase II and III clinical trials. With adaptive enrichment, these designs can improve the efficiency of evaluating the treatment effect in biomarker-positive patients by increasing their proportion in the randomized trial. While time-to-event outcomes are often used as the primary endpoint to measure treatment effects for a new therapy in severe diseases like cancer and cardiovascular diseases, there is limited research on biomarker-guided adaptive enrichment trials in this context. Such trials almost always adopt hazard ratio methods for statistical measurement of treatment effects. In contrast, restricted mean survival time (RMST) has gained popularity for analyzing time-to-event outcomes because it offers more straightforward interpretations of treatment effects and does not require the proportional hazard assumption. This paper proposes a two-stage biomarker-guided adaptive RMST design with threshold detection and patient enrichment. We develop sophisticated methods for identifying the optimal biomarker threshold, treatment effect estimators in the biomarker-positive subgroup, and approaches for type I error rate, power analysis, and sample size calculation. We present a numerical example of re-designing an oncology trial. An extensive simulation study is conducted to evaluate the performance of the proposed design."}, "https://arxiv.org/abs/2406.06452": {"title": "Estimating Heterogeneous Treatment Effects by Combining Weak Instruments and Observational Data", "link": "https://arxiv.org/abs/2406.06452", "description": "arXiv:2406.06452v1 Announce Type: new \nAbstract: Accurately predicting conditional average treatment effects (CATEs) is crucial in personalized medicine and digital platform analytics. Since often the treatments of interest cannot be directly randomized, observational data is leveraged to learn CATEs, but this approach can incur significant bias from unobserved confounding. One strategy to overcome these limitations is to seek latent quasi-experiments in instrumental variables (IVs) for the treatment, for example, a randomized intent to treat or a randomized product recommendation. This approach, on the other hand, can suffer from low compliance, i.e., IV weakness. Some subgroups may even exhibit zero compliance meaning we cannot instrument for their CATEs at all. In this paper we develop a novel approach to combine IV and observational data to enable reliable CATE estimation in the presence of unobserved confounding in the observational data and low compliance in the IV data, including no compliance for some subgroups. We propose a two-stage framework that first learns biased CATEs from the observational data, and then applies a compliance-weighted correction using IV data, effectively leveraging IV strength variability across covariates. We characterize the convergence rates of our method and validate its effectiveness through a simulation study. Additionally, we demonstrate its utility with real data by analyzing the heterogeneous effects of 401(k) plan participation on wealth."}, "https://arxiv.org/abs/2406.06516": {"title": "Distribution-Free Predictive Inference under Unknown Temporal Drift", "link": "https://arxiv.org/abs/2406.06516", "description": "arXiv:2406.06516v1 Announce Type: new \nAbstract: Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The window is selected by optimizing an estimated bias-variance tradeoff. We provide sharp coverage guarantees for our method, showing its adaptivity to the underlying temporal drift. We also illustrate its efficacy through numerical experiments on synthetic and real data."}, "https://arxiv.org/abs/2406.05242": {"title": "Markov chain Monte Carlo without evaluating the target: an auxiliary variable approach", "link": "https://arxiv.org/abs/2406.05242", "description": "arXiv:2406.05242v1 Announce Type: cross \nAbstract: In sampling tasks, it is common for target distributions to be known up to a normalizing constant. However, in many situations, evaluating even the unnormalized distribution can be costly or infeasible. This issue arises in scenarios such as sampling from the Bayesian posterior for tall datasets and the `doubly-intractable' distributions. In this paper, we begin by observing that seemingly different Markov chain Monte Carlo (MCMC) algorithms, such as the exchange algorithm, PoissonMH, and TunaMH, can be unified under a simple common procedure. We then extend this procedure into a novel framework that allows the use of auxiliary variables in both the proposal and acceptance-rejection steps. We develop the theory of the new framework, applying it to existing algorithms to simplify and extend their results. Several new algorithms emerge from this framework, with improved performance demonstrated on both synthetic and real datasets."}, "https://arxiv.org/abs/2406.05264": {"title": "\"Minus-One\" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity", "link": "https://arxiv.org/abs/2406.05264", "description": "arXiv:2406.05264v1 Announce Type: cross \nAbstract: We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that \"learns\" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected."}, "https://arxiv.org/abs/2406.05633": {"title": "Heterogeneous Treatment Effects in Panel Data", "link": "https://arxiv.org/abs/2406.05633", "description": "arXiv:2406.05633v1 Announce Type: cross \nAbstract: We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters with similar treatment effects using a regression tree, and then leverages the (assumed) low-rank structure of the panel data to estimate the average treatment effect for each cluster. Our theoretical results establish the convergence of the resulting estimates to the true treatment effects. Computation experiments with semi-synthetic data show that our method achieves superior accuracy compared to alternative approaches, using a regression tree with no more than 40 leaves. Hence, our method provides more accurate and interpretable estimates than alternative methods."}, "https://arxiv.org/abs/2406.06014": {"title": "Network two-sample test for block models", "link": "https://arxiv.org/abs/2406.06014", "description": "arXiv:2406.06014v1 Announce Type: cross \nAbstract: We consider the two-sample testing problem for networks, where the goal is to determine whether two sets of networks originated from the same stochastic model. Assuming no vertex correspondence and allowing for different numbers of nodes, we address a fundamental network testing problem that goes beyond simple adjacency matrix comparisons. We adopt the stochastic block model (SBM) for network distributions, due to their interpretability and the potential to approximate more general models. The lack of meaningful node labels and vertex correspondence translate to a graph matching challenge when developing a test for SBMs. We introduce an efficient algorithm to match estimated network parameters, allowing us to properly combine and contrast information within and across samples, leading to a powerful test. We show that the matching algorithm, and the overall test are consistent, under mild conditions on the sparsity of the networks and the sample sizes, and derive a chi-squared asymptotic null distribution for the test. Through a mixture of theoretical insights and empirical validations, including experiments with both synthetic and real-world data, this study advances robust statistical inference for complex network data."}, "https://arxiv.org/abs/2406.06348": {"title": "Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning", "link": "https://arxiv.org/abs/2406.06348", "description": "arXiv:2406.06348v1 Announce Type: cross \nAbstract: The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces."}, "https://arxiv.org/abs/2203.14223": {"title": "Identifying Peer Influence in Therapeutic Communities Adjusting for Latent Homophily", "link": "https://arxiv.org/abs/2203.14223", "description": "arXiv:2203.14223v4 Announce Type: replace \nAbstract: We investigate peer role model influence on successful graduation from Therapeutic Communities (TCs) for substance abuse and criminal behavior. We use data from 3 TCs that kept records of exchanges of affirmations among residents and their precise entry and exit dates, allowing us to form peer networks and define a causal effect of interest. The role model effect measures the difference in the expected outcome of a resident (ego) who can observe one of their peers graduate before the ego's exit vs not graduating. To identify peer influence in the presence of unobserved homophily in observational data, we model the network with a latent variable model. We show that our peer influence estimator is asymptotically unbiased when the unobserved latent positions are estimated from the observed network. We additionally propose a measurement error bias correction method to further reduce bias due to estimating latent positions. Our simulations show the proposed latent homophily adjustment and bias correction perform well in finite samples. We also extend the methodology to the case of binary response with a probit model. Our results indicate a positive effect of peers' graduation on residents' graduation and that it differs based on gender, race, and the definition of the role model effect. A counterfactual exercise quantifies the potential benefits of an intervention directly on the treated resident and indirectly on their peers through network propagation."}, "https://arxiv.org/abs/2212.08697": {"title": "Multi-Task Learning for Sparsity Pattern Heterogeneity: Statistical and Computational Perspectives", "link": "https://arxiv.org/abs/2212.08697", "description": "arXiv:2212.08697v2 Announce Type: replace \nAbstract: We consider a problem in Multi-Task Learning (MTL) where multiple linear models are jointly trained on a collection of datasets (\"tasks\"). A key novelty of our framework is that it allows the sparsity pattern of regression coefficients and the values of non-zero coefficients to differ across tasks while still leveraging partially shared structure. Our methods encourage models to share information across tasks through separately encouraging 1) coefficient supports, and/or 2) nonzero coefficient values to be similar. This allows models to borrow strength during variable selection even when non-zero coefficient values differ across tasks. We propose a novel mixed-integer programming formulation for our estimator. We develop custom scalable algorithms based on block coordinate descent and combinatorial local search to obtain high-quality (approximate) solutions for our estimator. Additionally, we propose a novel exact optimization algorithm to obtain globally optimal solutions. We investigate the theoretical properties of our estimators. We formally show how our estimators leverage the shared support information across tasks to achieve better variable selection performance. We evaluate the performance of our methods in simulations and two biomedical applications. Our proposed approaches appear to outperform other sparse MTL methods in variable selection and prediction accuracy. We provide the sMTL package on CRAN."}, "https://arxiv.org/abs/2212.09145": {"title": "Classification of multivariate functional data on different domains with Partial Least Squares approaches", "link": "https://arxiv.org/abs/2212.09145", "description": "arXiv:2212.09145v3 Announce Type: replace \nAbstract: Classification (supervised-learning) of multivariate functional data is considered when the elements of the random functional vector of interest are defined on different domains. In this setting, PLS classification and tree PLS-based methods for multivariate functional data are presented. From a computational point of view, we show that the PLS components of the regression with multivariate functional data can be obtained using only the PLS methodology with univariate functional data. This offers an alternative way to present the PLS algorithm for multivariate functional data."}, "https://arxiv.org/abs/2212.10024": {"title": "Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples", "link": "https://arxiv.org/abs/2212.10024", "description": "arXiv:2212.10024v2 Announce Type: replace \nAbstract: Data subsampling has become widely recognized as a tool to overcome computational and economic bottlenecks in analyzing massive datasets. We contribute to the development of adaptive design for estimation of finite population characteristics, using active learning and adaptive importance sampling. We propose an active sampling strategy that iterates between estimation and data collection with optimal subsamples, guided by machine learning predictions on yet unseen data. The method is illustrated on virtual simulation-based safety assessment of advanced driver assistance systems. Substantial performance improvements are demonstrated compared to traditional sampling methods."}, "https://arxiv.org/abs/2301.01660": {"title": "Projection predictive variable selection for discrete response families with finite support", "link": "https://arxiv.org/abs/2301.01660", "description": "arXiv:2301.01660v3 Announce Type: replace \nAbstract: The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback-Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, in such cases, we recommend the latent projection in the early phase of a model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection."}, "https://arxiv.org/abs/2303.06661": {"title": "Bayesian size-and-shape regression modelling", "link": "https://arxiv.org/abs/2303.06661", "description": "arXiv:2303.06661v3 Announce Type: replace \nAbstract: Building on Dryden et al. (2021), this note presents the Bayesian estimation of a regression model for size-and-shape response variables with Gaussian landmarks. Our proposal fits into the framework of Bayesian latent variable models and allows a highly flexible modelling framework."}, "https://arxiv.org/abs/2306.15286": {"title": "Multilayer random dot product graphs: Estimation and online change point detection", "link": "https://arxiv.org/abs/2306.15286", "description": "arXiv:2306.15286v4 Announce Type: replace \nAbstract: We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realization from an MRDPG. Across layers, we assume fixed shared common node sets and latent positions but allow for different connectivity matrices. We propose efficient tensor algorithms under both fixed and random latent position cases to minimize the detection delay while controlling false alarms. Notably, in the random latent position case, we devise a novel nonparametric change point detection algorithm based on density kernel estimation that is applicable to a wide range of scenarios, including stochastic block models as special cases. Our theoretical findings are supported by extensive numerical experiments, with the code available online https://github.com/MountLee/MRDPG."}, "https://arxiv.org/abs/2306.15537": {"title": "Sparse estimation in ordinary kriging for functional data", "link": "https://arxiv.org/abs/2306.15537", "description": "arXiv:2306.15537v2 Announce Type: replace \nAbstract: We introduce a sparse estimation in the ordinary kriging for functional data. The functional kriging predicts a feature given as a function at a location where the data are not observed by a linear combination of data observed at other locations. To estimate the weights of the linear combination, we apply the lasso-type regularization in minimizing the expected squared error. We derive an algorithm to derive the estimator using the augmented Lagrange method. Tuning parameters included in the estimation procedure are selected by cross-validation. Since the proposed method can shrink some of the weights of the linear combination toward zeros exactly, we can investigate which locations are necessary or unnecessary to predict the feature. Simulation and real data analysis show that the proposed method appropriately provides reasonable results."}, "https://arxiv.org/abs/2306.15607": {"title": "Assessing small area estimates via bootstrap-weighted k-Nearest-Neighbor artificial populations", "link": "https://arxiv.org/abs/2306.15607", "description": "arXiv:2306.15607v2 Announce Type: replace \nAbstract: Comparing and evaluating small area estimation (SAE) models for a given application is inherently difficult. Typically, many areas lack enough data to check unit-level modeling assumptions or to assess unit-level predictions empirically; and no ground truth is available for checking area-level estimates. Design-based simulation from artificial populations can help with each of these issues, but only if the artificial populations realistically represent the application at hand and are not built using assumptions that inherently favor one SAE model over another. In this paper, we borrow ideas from random hot deck, approximate Bayesian bootstrap (ABB), and k Nearest Neighbor (kNN) imputation methods to propose a kNN-based approximation to ABB (KBAABB), for generating an artificial population when rich unit-level auxiliary data is available. We introduce diagnostic checks on the process of building the artificial population, and we demonstrate how to use such an artificial population for design-based simulation studies to compare and evaluate SAE models, using real data from the Forest Inventory and Analysis (FIA) program of the US Forest Service."}, "https://arxiv.org/abs/2309.14621": {"title": "Confidence Intervals for the F1 Score: A Comparison of Four Methods", "link": "https://arxiv.org/abs/2309.14621", "description": "arXiv:2309.14621v2 Announce Type: replace \nAbstract: In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research."}, "https://arxiv.org/abs/2311.12016": {"title": "Estimating Heterogeneous Exposure Effects in the Case-Crossover Design using BART", "link": "https://arxiv.org/abs/2311.12016", "description": "arXiv:2311.12016v2 Announce Type: replace \nAbstract: Epidemiological approaches for examining human health responses to environmental exposures in observational studies often control for confounding by implementing clever matching schemes and using statistical methods based on conditional likelihood. Nonparametric regression models have surged in popularity in recent years as a tool for estimating individual-level heterogeneous effects, which provide a more detailed picture of the exposure-response relationship but can also be aggregated to obtain improved marginal estimates at the population level. In this work we incorporate Bayesian additive regression trees (BART) into the conditional logistic regression model to identify heterogeneous exposure effects in a case-crossover design. Conditional logistic BART (CL-BART) utilizes reversible jump Markov chain Monte Carlo to bypass the conditional conjugacy requirement of the original BART algorithm. Our work is motivated by the growing interest in identifying subpopulations more vulnerable to environmental exposures. We apply CL-BART to a study of the impact of heat waves on people with Alzheimer's disease in California and effect modification by other chronic conditions. Through this application, we also describe strategies to examine heterogeneous odds ratios through variable importance, partial dependence, and lower-dimensional summaries."}, "https://arxiv.org/abs/2301.12537": {"title": "Non-Asymptotic State-Space Identification of Closed-Loop Stochastic Linear Systems using Instrumental Variables", "link": "https://arxiv.org/abs/2301.12537", "description": "arXiv:2301.12537v4 Announce Type: replace-cross \nAbstract: The paper suggests a generalization of the Sign-Perturbed Sums (SPS) finite sample system identification method for the identification of closed-loop observable stochastic linear systems in state-space form. The solution builds on the theory of matrix-variate regression and instrumental variable methods to construct distribution-free confidence regions for the state-space matrices. Both direct and indirect identification are studied, and the exactness as well as the strong consistency of the construction are proved. Furthermore, a new, computationally efficient ellipsoidal outer-approximation algorithm for the confidence regions is proposed. The new construction results in a semidefinite optimization problem which has an order-of-magnitude smaller number of constraints, as if one applied the ellipsoidal outer-approximation after vectorization. The effectiveness of the approach is also demonstrated empirically via a series of numerical experiments."}, "https://arxiv.org/abs/2308.06718": {"title": "Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables", "link": "https://arxiv.org/abs/2308.06718", "description": "arXiv:2308.06718v2 Announce Type: replace-cross \nAbstract: We investigate the task of learning causal structure in the presence of latent variables, including locating latent variables and determining their quantity, and identifying causal relationships among both latent and observed variables. To this end, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\\bf{Y}$ and $\\bf{Z}$, GIN holds if and only if $\\omega^{\\intercal}\\mathbf{Y}$ and $\\mathbf{Z}$ are independent, where $\\omega$ is a non-zero parameter vector determined by the cross-covariance between $\\mathbf{Y}$ and $\\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic models. Roughly speaking, GIN implies the existence of a set $\\mathcal{S}$ such that $\\mathcal{S}$ is causally earlier (w.r.t. the causal ordering) than $\\mathbf{Y}$, and that every active (collider-free) path between $\\mathbf{Y}$ and $\\mathbf{Z}$ must contain a node from $\\mathcal{S}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the causal structure of a LiNGLaH is identifiable in light of GIN conditions. Experimental results show the effectiveness of the proposed method."}, "https://arxiv.org/abs/2310.00250": {"title": "The oracle property of the generalized outcome adaptive lasso", "link": "https://arxiv.org/abs/2310.00250", "description": "arXiv:2310.00250v2 Announce Type: replace-cross \nAbstract: The generalized outcome-adaptive lasso (GOAL) is a variable selection for high-dimensional causal inference proposed by Bald\\'e et al. [2023, {\\em Biometrics} {\\bfseries 79(1)}, 514--520]. When the dimension is high, it is now well established that an ideal variable selection method should have the oracle property to ensure the optimal large sample performance. However, the oracle property of GOAL has not been proven. In this paper, we show that the GOAL estimator enjoys the oracle property. Our simulation shows that the GOAL method deals with the collinearity problem better than the oracle-like method, the outcome-adaptive lasso (OAL)."}, "https://arxiv.org/abs/2312.12844": {"title": "Effective Causal Discovery under Identifiable Heteroscedastic Noise Model", "link": "https://arxiv.org/abs/2312.12844", "description": "arXiv:2312.12844v2 Announce Type: replace-cross \nAbstract: Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of both accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the issue of heteroscedastic noise, we introduce relaxed and implementable sufficient conditions, proving the identifiability of a general class of SEM subject to these conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning that accounts for the variation in noise variance across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties and to learn a causal DAG from data with heteroscedastic variable noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data."}, "https://arxiv.org/abs/2406.06767": {"title": "ULV: A robust statistical method for clustered data, with applications to multisubject, single-cell omics data", "link": "https://arxiv.org/abs/2406.06767", "description": "arXiv:2406.06767v1 Announce Type: new \nAbstract: Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity."}, "https://arxiv.org/abs/2406.06768": {"title": "Data-Driven Switchback Experiments: Theoretical Tradeoffs and Empirical Bayes Designs", "link": "https://arxiv.org/abs/2406.06768", "description": "arXiv:2406.06768v1 Announce Type: new \nAbstract: We study the design and analysis of switchback experiments conducted on a single aggregate unit. The design problem is to partition the continuous time space into intervals and switch treatments between intervals, in order to minimize the estimation error of the treatment effect. We show that the estimation error depends on four factors: carryover effects, periodicity, serially correlated outcomes, and impacts from simultaneous experiments. We derive a rigorous bias-variance decomposition and show the tradeoffs of the estimation error from these factors. The decomposition provides three new insights in choosing a design: First, balancing the periodicity between treated and control intervals reduces the variance; second, switching less frequently reduces the bias from carryover effects while increasing the variance from correlated outcomes, and vice versa; third, randomizing interval start and end points reduces both bias and variance from simultaneous experiments. Combining these insights, we propose a new empirical Bayes design approach. This approach uses prior data and experiments for designing future experiments. We illustrate this approach using real data from a ride-sharing platform, yielding a design that reduces MSE by 33% compared to the status quo design used on the platform."}, "https://arxiv.org/abs/2406.06804": {"title": "Robustness to Missing Data: Breakdown Point Analysis", "link": "https://arxiv.org/abs/2406.06804", "description": "arXiv:2406.06804v1 Announce Type: new \nAbstract: Missing data is pervasive in econometric applications, and rarely is it plausible that the data are missing (completely) at random. This paper proposes a methodology for studying the robustness of results drawn from incomplete datasets. Selection is measured as the squared Hellinger divergence between the distributions of complete and incomplete observations, which has a natural interpretation. The breakdown point is defined as the minimal amount of selection needed to overturn a given result. Reporting point estimates and lower confidence intervals of the breakdown point is a simple, concise way to communicate the robustness of a result. An estimator of the breakdown point of a result drawn from a generalized method of moments model is proposed and shown root-n consistent and asymptotically normal under mild assumptions. Lower confidence intervals of the breakdown point are simple to construct. The paper concludes with a simulation study illustrating the finite sample performance of the estimators in several common models."}, "https://arxiv.org/abs/2406.06834": {"title": "Power Analysis for Experiments with Clustered Data, Ratio Metrics, and Regression for Covariate Adjustment", "link": "https://arxiv.org/abs/2406.06834", "description": "arXiv:2406.06834v1 Announce Type: new \nAbstract: We describe how to calculate standard errors for A/B tests that include clustered data, ratio metrics, and/or covariate adjustment. We may do this for power analysis/sample size calculations prior to running an experiment using historical data, or after an experiment for hypothesis testing and confidence intervals. The different applications have a common framework, using the sample variance of certain residuals. The framework is compatible with modular software, can be plugged into standard tools, doesn't require computing covariance matrices, and is numerically stable. Using this approach we estimate that covariate adjustment gives a median 66% variance reduction for a key metric, reducing experiment run time by 66%."}, "https://arxiv.org/abs/2406.06851": {"title": "Unbiased Markov Chain Monte Carlo: what, why, and how", "link": "https://arxiv.org/abs/2406.06851", "description": "arXiv:2406.06851v1 Announce Type: new \nAbstract: This document presents methods to remove the initialization or burn-in bias from Markov chain Monte Carlo (MCMC) estimates, with consequences on parallel computing, convergence diagnostics and performance assessment. The document is written as an introduction to these methods for MCMC users. Some theoretical results are mentioned, but the focus is on the methodology."}, "https://arxiv.org/abs/2406.06860": {"title": "Cluster GARCH", "link": "https://arxiv.org/abs/2406.06860", "description": "arXiv:2406.06860v1 Announce Type: new \nAbstract: We introduce a novel multivariate GARCH model with flexible convolution-t distributions that is applicable in high-dimensional systems. The model is called Cluster GARCH because it can accommodate cluster structures in the conditional correlation matrix and in the tail dependencies. The expressions for the log-likelihood function and its derivatives are tractable, and the latter facilitate a score-drive model for the dynamic correlation structure. We apply the Cluster GARCH model to daily returns for 100 assets and find it outperforms existing models, both in-sample and out-of-sample. Moreover, the convolution-t distribution provides a better empirical performance than the conventional multivariate t-distribution."}, "https://arxiv.org/abs/2406.06924": {"title": "A Novel Nonlinear Nonparametric Correlation Measurement With A Case Study on Surface Roughness in Finish Turning", "link": "https://arxiv.org/abs/2406.06924", "description": "arXiv:2406.06924v1 Announce Type: new \nAbstract: Estimating the correlation coefficient has been a daunting work with the increasing complexity of dataset's pattern. One of the problems in manufacturing applications consists of the estimation of a critical process variable during a machining operation from directly measurable process variables. For example, the prediction of surface roughness of a workpiece during finish turning processes. In this paper, we did exhaustive study on the existing popular correlation coefficients: Pearson correlation coefficient, Spearman's rank correlation coefficient, Kendall's Tau correlation coefficient, Fechner correlation coefficient, and Nonlinear correlation coefficient. However, no one of them can capture all the nonlinear and linear correlations. So, we represent a universal non-linear non-parametric correlation measurement, g-correlation coefficient. Unlike other correlation measurements, g-correlation doesn't require assumptions and pick the dominating patterns of the dataset after examining all the major patterns no matter it is linear or nonlinear. Results of testing on both linearly correlated and non-linearly correlated dataset and comparison with the introduced correlation coefficients in literature show that g-correlation is robust on all the linearly correlated dataset and outperforms for some non-linearly correlated dataset. Results of the application of different correlation concepts to surface roughness assessment show that g-correlation has a central role among all standard concepts of correlation."}, "https://arxiv.org/abs/2406.06941": {"title": "Efficient combination of observational and experimental datasets under general restrictions on outcome mean functions", "link": "https://arxiv.org/abs/2406.06941", "description": "arXiv:2406.06941v1 Announce Type: new \nAbstract: A researcher collecting data from a randomized controlled trial (RCT) often has access to an auxiliary observational dataset that may be confounded or otherwise biased for estimating causal effects. Common modeling assumptions impose restrictions on the outcome mean function - the conditional expectation of the outcome of interest given observed covariates - in the two datasets. Running examples from the literature include settings where the observational dataset is subject to outcome-mediated selection bias or to confounding bias taking an assumed parametric form. We propose a succinct framework to derive the efficient influence function for any identifiable pathwise differentiable estimand under a general class of restrictions on the outcome mean function. This uncovers surprising results that with homoskedastic outcomes and a constant propensity score in the RCT, even strong parametric assumptions cannot improve the semiparametric lower bound for estimating various average treatment effects. We then leverage double machine learning to construct a one-step estimator that achieves the semiparametric efficiency bound even in cases when the outcome mean function and other nuisance parameters are estimated nonparametrically. The goal is to empower a researcher with custom, previously unstudied modeling restrictions on the outcome mean function to systematically construct causal estimators that maximially leverage their assumptions for variance reduction. We demonstrate the finite sample precision gains of our estimator over existing approaches in extensions of various numerical studies and data examples from the literature."}, "https://arxiv.org/abs/2406.06980": {"title": "Sensitivity Analysis for the Test-Negative Design", "link": "https://arxiv.org/abs/2406.06980", "description": "arXiv:2406.06980v1 Announce Type: new \nAbstract: The test-negative design has become popular for evaluating the effectiveness of post-licensure vaccines using observational data. In addition to its logistical convenience on data collection, the design is also believed to control for the differential health-care-seeking behavior between vaccinated and unvaccinated individuals, which is an important while often unmeasured confounder between the vaccination and infection. Hence, the design has been employed routinely to monitor seasonal flu vaccines and more recently to measure the COVID-19 vaccine effectiveness. Despite its popularity, the design has been questioned, in particular about its ability to fully control for the unmeasured confounding. In this paper, we explore deviations from a perfect test-negative design, and propose various sensitivity analysis methods for estimating the effect of vaccination measured by the causal odds ratio on the subpopulation of individuals with good health-care-seeking behavior. We start with point identification of the causal odds ratio under a test-negative design, considering two forms of assumptions on the unmeasured confounder. These assumptions then lead to two approaches for conducting sensitivity analysis, addressing the influence of the unmeasured confounding in different ways. Specifically, one approach investigates partial control for unmeasured confounder in the test-negative design, while the other examines the impact of unmeasured confounder on both vaccination and infection. Furthermore, these approaches can be combined to provide narrower bounds on the true causal odds ratio, and can be further extended to sharpen the bounds by restricting the treatment effect heterogeneity. Finally, we apply the proposed methods to evaluate the effectiveness of COVID-19 vaccines using observational data from test-negative designs."}, "https://arxiv.org/abs/2406.07449": {"title": "Boosted Conformal Prediction Intervals", "link": "https://arxiv.org/abs/2406.07449", "description": "arXiv:2406.07449v1 Announce Type: new \nAbstract: This paper introduces a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length. We employ machine learning techniques, notably gradient boosting, to systematically improve upon a predefined conformity score function. This process is guided by carefully constructed loss functions that measure the deviation of prediction intervals from the targeted properties. The procedure operates post-training, relying solely on model predictions and without modifying the trained model (e.g., the deep network). Systematic experiments demonstrate that starting from conventional conformal methods, our boosted procedure achieves substantial improvements in reducing interval length and decreasing deviation from target conditional coverage."}, "https://arxiv.org/abs/2406.06654": {"title": "Training and Validating a Treatment Recommender with Partial Verification Evidence", "link": "https://arxiv.org/abs/2406.06654", "description": "arXiv:2406.06654v1 Announce Type: cross \nAbstract: Current clinical decision support systems (DSS) are trained and validated on observational data from the target clinic. This is problematic for treatments validated in a randomized clinical trial (RCT), but not yet introduced in any clinic. In this work, we report on a method for training and validating the DSS using the RCT data. The key challenges we address are of missingness -- missing rationale for treatment assignment (the assignment is at random), and missing verification evidence, since the effectiveness of a treatment for a patient can only be verified (ground truth) for treatments what were actually assigned to a patient. We use data from a multi-armed RCT that investigated the effectiveness of single- and combination- treatments for 240+ tinnitus patients recruited and treated in 5 clinical centers.\n To deal with the 'missing rationale' challenge, we re-model the target variable (outcome) in order to suppress the effect of the randomly-assigned treatment, and control on the effect of treatment in general. Our methods are also robust to missing values in features and with a small number of patients per RCT arm. We deal with 'missing verification evidence' by using counterfactual treatment verification, which compares the effectiveness of the DSS recommendations to the effectiveness of the RCT assignments when they are aligned v/s not aligned.\n We demonstrate that our approach leverages the RCT data for learning and verification, by showing that the DSS suggests treatments that improve the outcome. The results are limited through the small number of patients per treatment; while our ensemble is designed to mitigate this effect, the predictive performance of the methods is affected by the smallness of the data. We provide a basis for the establishment of decision supporting routines on treatments that have been tested in RCTs but have not yet been deployed clinically."}, "https://arxiv.org/abs/2406.06671": {"title": "Controlling Counterfactual Harm in Decision Support Systems Based on Prediction Sets", "link": "https://arxiv.org/abs/2406.06671", "description": "arXiv:2406.06671v1 Announce Type: cross \nAbstract: Decision support systems based on prediction sets help humans solve multiclass classification tasks by narrowing down the set of potential label values to a subset of them, namely a prediction set, and asking them to always predict label values from the prediction sets. While this type of systems have been proven to be effective at improving the average accuracy of the predictions made by humans, by restricting human agency, they may cause harm$\\unicode{x2014}$a human who has succeeded at predicting the ground-truth label of an instance on their own may have failed had they used these systems. In this paper, our goal is to control how frequently a decision support system based on prediction sets may cause harm, by design. To this end, we start by characterizing the above notion of harm using the theoretical framework of structural causal models. Then, we show that, under a natural, albeit unverifiable, monotonicity assumption, we can estimate how frequently a system may cause harm using only predictions made by humans on their own. Further, we also show that, under a weaker monotonicity assumption, which can be verified experimentally, we can bound how frequently a system may cause harm again using only predictions made by humans on their own. Building upon these assumptions, we introduce a computational framework to design decision support systems based on prediction sets that are guaranteed to cause harm less frequently than a user-specified value using conformal risk control. We validate our framework using real human predictions from two different human subject studies and show that, in decision support systems based on prediction sets, there is a trade-off between accuracy and counterfactual harm."}, "https://arxiv.org/abs/2406.06868": {"title": "Causality for Complex Continuous-time Functional Longitudinal Studies with Dynamic Treatment Regimes", "link": "https://arxiv.org/abs/2406.06868", "description": "arXiv:2406.06868v1 Announce Type: cross \nAbstract: Causal inference in longitudinal studies is often hampered by treatment-confounder feedback. Existing methods typically assume discrete time steps or step-like data changes, which we term ``regular and irregular functional studies,'' limiting their applicability to studies with continuous monitoring data, like intensive care units or continuous glucose monitoring. These studies, which we formally term ``functional longitudinal studies,'' require new approaches. Moreover, existing methods tailored for ``functional longitudinal studies'' can only investigate static treatment regimes, which are independent of historical covariates or treatments, leading to either stringent parametric assumptions or strong positivity assumptions. This restriction has limited the range of causal questions these methods can answer and their practicality. We address these limitations by developing a nonparametric framework for functional longitudinal data, accommodating dynamic treatment regimes that depend on historical covariates or treatments, and may or may not depend on the actual treatment administered. To build intuition and explain our approach, we provide a comprehensive review of existing methods for regular and irregular longitudinal studies. We then formally define the potential outcomes and causal effects of interest, develop identification assumptions, and derive g-computation and inverse probability weighting formulas through novel applications of stochastic process and measure theory. Additionally, we compute the efficient influence curve using semiparametric theory. Our framework generalizes existing literature, and achieves double robustness under specific conditions. Finally, to aid interpretation, we provide sufficient and intuitive conditions for our identification assumptions, enhancing the applicability of our methodology to real-world scenarios."}, "https://arxiv.org/abs/2406.07075": {"title": "New density/likelihood representations for Gibbs models based on generating functionals of point processes", "link": "https://arxiv.org/abs/2406.07075", "description": "arXiv:2406.07075v1 Announce Type: cross \nAbstract: Deriving exact density functions for Gibbs point processes has been challenging due to their general intractability, stemming from the intractability of their normalising constants/partition functions. This paper offers a solution to this open problem by exploiting a recent alternative representation of point process densities. Here, for a finite point process, the density is expressed as the void probability multiplied by a higher-order Papangelou conditional intensity function. By leveraging recent results on dependent thinnings, exact expressions for generating functionals and void probabilities of locally stable point processes are derived. Consequently, exact expressions for density/likelihood functions, partition functions and posterior densities are also obtained. The paper finally extends the results to locally stable Gibbsian random fields on lattices by representing them as point processes."}, "https://arxiv.org/abs/2406.07121": {"title": "The Treatment of Ties in Rank-Biased Overlap", "link": "https://arxiv.org/abs/2406.07121", "description": "arXiv:2406.07121v1 Announce Type: cross \nAbstract: Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines by comparing the rankings of documents they retrieve for the same queries. In these situations, though, it is very frequent to find tied documents that have the same score. Unfortunately, the treatment of ties in RBO remains superficial and incomplete, in the sense that it is not clear how to calculate it from the ranking prefixes only. In addition, the existing way of dealing with ties is very different from the one traditionally followed in the field of Statistics, most notably found in rank correlation coefficients such as Kendall's and Spearman's. In this paper we propose a generalized formulation for RBO to handle ties, thanks to which we complete the original definitions by showing how to perform prefix evaluation. We also use it to fully develop two variants that align with the ones found in the Statistics literature: one when there is a reference ranking to compare to, and one when there is not. Overall, these three variants provide researchers with flexibility when comparing rankings with RBO, by clearly determining what ties mean, and how they should be treated. Finally, using both synthetic and TREC data, we demonstrate the use of these new tie-aware RBO measures. We show that the scores may differ substantially from the original tie-unaware RBO measure, where ties had to be broken at random or by arbitrary criteria such as by document ID. Overall, these results evidence the need for a proper account of ties in rank similarity measures such as RBO."}, "https://arxiv.org/abs/2106.15675": {"title": "Estimating Gaussian mixtures using sparse polynomial moment systems", "link": "https://arxiv.org/abs/2106.15675", "description": "arXiv:2106.15675v3 Announce Type: replace \nAbstract: The method of moments is a classical statistical technique for density estimation that solves a system of moment equations to estimate the parameters of an unknown distribution. A fundamental question critical to understanding identifiability asks how many moment equations are needed to get finitely many solutions and how many solutions there are. We answer this question for classes of Gaussian mixture models using the tools of polyhedral geometry. In addition, we show that a generic Gaussian $k$-mixture model is identifiable from its first $3k+2$ moments. Using these results, we present a homotopy algorithm that performs parameter recovery for high dimensional Gaussian mixture models where the number of paths tracked scales linearly in the dimension."}, "https://arxiv.org/abs/2109.11990": {"title": "Optimization-based Causal Estimation from Heterogenous Environments", "link": "https://arxiv.org/abs/2109.11990", "description": "arXiv:2109.11990v3 Announce Type: replace \nAbstract: This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments-and ones that exhibit sufficient heterogeneity-CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model and more accurate predictions under interventions."}, "https://arxiv.org/abs/2211.16921": {"title": "Binary De Bruijn Processes", "link": "https://arxiv.org/abs/2211.16921", "description": "arXiv:2211.16921v2 Announce Type: replace \nAbstract: Binary time series data are very common in many applications, and are typically modelled independently via a Bernoulli process with a single probability of success. However, the probability of a success can be dependent on the outcome successes of past events. Presented here is a novel approach for modelling binary time series data called a binary de Bruijn process which takes into account temporal correlation. The structure is derived from de Bruijn Graphs - a directed graph, where given a set of symbols, V, and a 'word' length, m, the nodes of the graph consist of all possible sequences of V of length m. De Bruijn Graphs are equivalent to mth order Markov chains, where the 'word' length controls the number of states that each individual state is dependent on. This increases correlation over a wider area. To quantify how clustered a sequence generated from a de Bruijn process is, the run lengths of letters are observed along with run length properties. Inference is also presented along with two application examples: precipitation data and the Oxford and Cambridge boat race."}, "https://arxiv.org/abs/2303.00982": {"title": "Aggregated Intersection Bounds and Aggregated Minimax Values", "link": "https://arxiv.org/abs/2303.00982", "description": "arXiv:2303.00982v2 Announce Type: replace \nAbstract: This paper proposes a novel framework of aggregated intersection bounds, where the target parameter is obtained by averaging the minimum (or maximum) of a collection of regression functions over the covariate space. Examples of such quantities include the lower and upper bounds on distributional effects (Fr\\'echet-Hoeffding, Makarov) as well as the optimal welfare in statistical treatment choice problems. The proposed estimator -- the envelope score estimator -- is shown to have an oracle property, where the oracle knows the identity of the minimizer for each covariate value. Next, the result is extended to the aggregated minimax values of a collection of regression functions, covering optimal distributional welfare in worst-case and best-case, respectively. This proposed estimator -- the envelope saddle value estimator -- is shown to have an oracle property, where the oracle knows the identity of the saddle point."}, "https://arxiv.org/abs/2406.07564": {"title": "Optimizing Sales Forecasts through Automated Integration of Market Indicators", "link": "https://arxiv.org/abs/2406.07564", "description": "arXiv:2406.07564v1 Announce Type: new \nAbstract: Recognizing that traditional forecasting models often rely solely on historical demand, this work investigates the potential of data-driven techniques to automatically select and integrate market indicators for improving customer demand predictions. By adopting an exploratory methodology, we integrate macroeconomic time series, such as national GDP growth, from the \\textit{Eurostat} database into \\textit{Neural Prophet} and \\textit{SARIMAX} forecasting models. Suitable time series are automatically identified through different state-of-the-art feature selection methods and applied to sales data from our industrial partner. It could be shown that forecasts can be significantly enhanced by incorporating external information. Notably, the potential of feature selection methods stands out, especially due to their capability for automation without expert knowledge and manual selection effort. In particular, the Forward Feature Selection technique consistently yielded superior forecasting accuracy for both SARIMAX and Neural Prophet across different company sales datasets. In the comparative analysis of the errors of the selected forecasting models, namely Neural Prophet and SARIMAX, it is observed that neither model demonstrates a significant superiority over the other."}, "https://arxiv.org/abs/2406.07651": {"title": "surveygenmod2: A SAS macro for estimating complex survey adjusted generalized linear models and Wald-type tests", "link": "https://arxiv.org/abs/2406.07651", "description": "arXiv:2406.07651v1 Announce Type: new \nAbstract: surveygenmod2 builds on the macro written by da Silva (2017) for generalized linear models under complex survey designs. The updated macro fixed several minor bugs we encountered while updating the macro for use in SAS\\textregistered. We added additional features for conducting basic Wald-type tests on groups of parameters based on the estimated regression coefficients and parameter variance-covariance matrix."}, "https://arxiv.org/abs/2406.07756": {"title": "The Exchangeability Assumption for Permutation Tests of Multiple Regression Models: Implications for Statistics and Data Science", "link": "https://arxiv.org/abs/2406.07756", "description": "arXiv:2406.07756v1 Announce Type: new \nAbstract: Permutation tests are a powerful and flexible approach to inference via resampling. As computational methods become more ubiquitous in the statistics curriculum, use of permutation tests has become more tractable. At the heart of the permutation approach is the exchangeability assumption, which determines the appropriate null sampling distribution. We explore the exchangeability assumption in the context of permutation tests for multiple linear regression models. Various permutation schemes for the multiple linear regression setting have been previously proposed and assessed in the literature. As has been demonstrated previously, in most settings, the choice of how to permute a multiple linear regression model does not materially change inferential conclusions. Regardless, we believe that (1) understanding exchangeability in the multiple linear regression setting and also (2) how it relates to the null hypothesis of interest is valuable. We also briefly explore model settings beyond multiple linear regression (e.g., settings where clustering or hierarchical relationships exist) as a motivation for the benefit and flexibility of permutation tests. We close with pedagogical recommendations for instructors who want to bring multiple linear regression permutation inference into their classroom as a way to deepen student understanding of resampling-based inference."}, "https://arxiv.org/abs/2406.07787": {"title": "A Diagnostic Tool for Functional Causal Discovery", "link": "https://arxiv.org/abs/2406.07787", "description": "arXiv:2406.07787v1 Announce Type: new \nAbstract: Causal discovery methods aim to determine the causal direction between variables using observational data. Functional causal discovery methods, such as those based on the Linear Non-Gaussian Acyclic Model (LiNGAM), rely on structural and distributional assumptions to infer the causal direction. However, approaches for assessing causal discovery methods' performance as a function of sample size or the impact of assumption violations, inevitable in real-world scenarios, are lacking. To address this need, we propose Causal Direction Detection Rate (CDDR) diagnostic that evaluates whether and to what extent the interaction between assumption violations and sample size affects the ability to identify the hypothesized causal direction. Given a bivariate dataset of size N on a pair of variables, X and Y, CDDR diagnostic is the plotted comparison of the probability of each causal discovery outcome (e.g. X causes Y, Y causes X, or inconclusive) as a function of sample size less than N. We fully develop CDDR diagnostic in a bivariate case and demonstrate its use for two methods, LiNGAM and our new test-based causal discovery approach. We find CDDR diagnostic for the test-based approach to be more informative since it uses a richer set of causal discovery outcomes. Under certain assumptions, we prove that the probability estimates of detecting each possible causal discovery outcome are consistent and asymptotically normal. Through simulations, we study CDDR diagnostic's behavior when linearity and non-Gaussianity assumptions are violated. Additionally, we illustrate CDDR diagnostic on four real datasets, including three for which the causal direction is known."}, "https://arxiv.org/abs/2406.07809": {"title": "Did Harold Zuercher Have Time-Separable Preferences?", "link": "https://arxiv.org/abs/2406.07809", "description": "arXiv:2406.07809v1 Announce Type: new \nAbstract: This paper proposes an empirical model of dynamic discrete choice to allow for non-separable time preferences, generalizing the well-known Rust (1987) model. Under weak conditions, we show the existence of value functions and hence well-defined optimal choices. We construct a contraction mapping of the value function and propose an estimation method similar to Rust's nested fixed point algorithm. Finally, we apply the framework to the bus engine replacement data. We improve the fit of the data with our general model and reject the null hypothesis that Harold Zuercher has separable time preferences. Misspecifying an agent's preference as time-separable when it is not leads to biased inferences about structure parameters (such as the agent's risk attitudes) and misleading policy recommendations."}, "https://arxiv.org/abs/2406.07868": {"title": "Bridging multiple worlds: multi-marginal optimal transport for causal partial-identification problem", "link": "https://arxiv.org/abs/2406.07868", "description": "arXiv:2406.07868v1 Announce Type: new \nAbstract: Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the joint distribution is unavailable. In this paper, we cast the causal partial identification problem in the framework of MOT with $K$ margins and $d$-dimensional outcomes and obtain the exact partial identified set. In order to estimate the partial identified set via MOT, statistically, we establish a convergence rate of the plug-in MOT estimator for general quadratic objective functions and prove it is minimax optimal for a quadratic objective function stemming from the variance minimization problem with arbitrary $K$ and $d \\le 4$. Numerically, we demonstrate the efficacy of our method over several real-world datasets where our proposal consistently outperforms the baseline by a significant margin (over 70%). In addition, we provide efficient off-the-shelf implementations of MOT with general objective functions."}, "https://arxiv.org/abs/2406.07940": {"title": "Simple yet Sharp Sensitivity Analysis for Any Contrast Under Unmeasured Confounding", "link": "https://arxiv.org/abs/2406.07940", "description": "arXiv:2406.07940v1 Announce Type: new \nAbstract: We extend our previous work on sensitivity analysis for the risk ratio and difference contrasts under unmeasured confounding to any contrast. We prove that the bounds produced are still arbitrarily sharp, i.e. practically attainable. We illustrate the usability of the bounds with real data."}, "https://arxiv.org/abs/2406.08019": {"title": "Assessing Extreme Risk using Stochastic Simulation of Extremes", "link": "https://arxiv.org/abs/2406.08019", "description": "arXiv:2406.08019v1 Announce Type: new \nAbstract: Risk management is particularly concerned with extreme events, but analysing these events is often hindered by the scarcity of data, especially in a multivariate context. This data scarcity complicates risk management efforts. Various tools can assess the risk posed by extreme events, even under extraordinary circumstances. This paper studies the evaluation of univariate risk for a given risk factor using metrics that account for its asymptotic dependence on other risk factors. Data availability is crucial, particularly for extreme events where it is often limited by the nature of the phenomenon itself, making estimation challenging. To address this issue, two non-parametric simulation algorithms based on multivariate extreme theory are developed. These algorithms aim to extend a sample of extremes jointly and conditionally for asymptotically dependent variables using stochastic simulation and multivariate Generalised Pareto Distributions. The approach is illustrated with numerical analyses of both simulated and real data to assess the accuracy of extreme risk metric estimations."}, "https://arxiv.org/abs/2406.08022": {"title": "Null hypothesis Bayes factor estimates can be biased in (some) common factorial designs: A simulation study", "link": "https://arxiv.org/abs/2406.08022", "description": "arXiv:2406.08022v1 Announce Type: new \nAbstract: Bayes factor null hypothesis tests provide a viable alternative to frequentist measures of evidence quantification. Bayes factors for realistic interesting models cannot be calculated exactly, but have to be estimated, which involves approximations to complex integrals. Crucially, the accuracy of these estimates, i.e., whether an estimated Bayes factor corresponds to the true Bayes factor, is unknown, and may depend on data, prior, and likelihood. We have recently developed a novel statistical procedure, namely simulation-based calibration (SBC) for Bayes factors, to test for a given analysis, whether the computed Bayes factors are accurate. Here, we use SBC for Bayes factors to test for some common cognitive designs, whether Bayes factors are estimated accurately. We use the bridgesampling/brms packages as well as the BayesFactor package in R. We find that Bayes factor estimates are accurate and exhibit only little bias in Latin square designs with (a) random effects for subjects only and (b) for crossed random effects for subjects and items, but a single fixed-factor. However, Bayes factor estimates turn out biased and liberal in a 2x2 design with crossed random effects for subjects and items. These results suggest that researchers should test for their individual analysis, whether Bayes factor estimates are accurate. Moreover, future research is needed to determine the boundary conditions under which Bayes factor estimates are accurate or biased, as well as software development to improve estimation accuracy."}, "https://arxiv.org/abs/2406.08168": {"title": "Global Tests for Smoothed Functions in Mean Field Variational Additive Models", "link": "https://arxiv.org/abs/2406.08168", "description": "arXiv:2406.08168v1 Announce Type: new \nAbstract: Variational regression methods are an increasingly popular tool for their efficient estimation of complex. Given the mixed model representation of penalized effects, additive regression models with smoothed effects and scalar-on-function regression models can be fit relatively efficiently in a variational framework. However, inferential procedures for smoothed and functional effects in such a context is limited. We demonstrate that by using the Mean Field Variational Bayesian (MFVB) approximation to the additive model and the subsequent Coordinate Ascent Variational Inference (CAVI) algorithm, we can obtain a form of the estimated effects required of a Frequentist test for semiparametric curves. We establish MFVB approximations and CAVI algorithms for both Gaussian and binary additive models with an arbitrary number of smoothed and functional effects. We then derive a global testing framework for smoothed and functional effects. Our empirical study demonstrates that the test maintains good Frequentist properties in the variational framework and can be used to directly test results from a converged, MFVB approximation and CAVI algorithm. We illustrate the applicability of this approach in a wide range of data illustrations."}, "https://arxiv.org/abs/2406.08172": {"title": "inlamemi: An R package for missing data imputation and measurement error modelling using INLA", "link": "https://arxiv.org/abs/2406.08172", "description": "arXiv:2406.08172v1 Announce Type: new \nAbstract: Measurement error and missing data in variables used in statistical models are common, and can at worst lead to serious biases in analyses if they are ignored. Yet, these problems are often not dealt with adequately, presumably in part because analysts lack simple enough tools to account for error and missingness. In this R package, we provide functions to aid fitting hierarchical Bayesian models that account for cases where either measurement error (classical or Berkson), missing data, or both are present in continuous covariates. Model fitting is done in a Bayesian framework using integrated nested Laplace approximations (INLA), an approach that is growing in popularity due to its combination of computational speed and accuracy. The {inlamemi} R package is suitable for data analysts who have little prior experience using the R package {R-INLA}, and aids in formulating suitable hierarchical models for a variety of scenarios in order to appropriately capture the processes that generate the measurement error and/or missingness. Numerous examples are given to help analysts identify scenarios similar to their own, and make the process of specifying a suitable model easier."}, "https://arxiv.org/abs/2406.08174": {"title": "A computationally efficient procedure for combining ecological datasets by means of sequential consensus inference", "link": "https://arxiv.org/abs/2406.08174", "description": "arXiv:2406.08174v1 Announce Type: new \nAbstract: Combining data has become an indispensable tool for managing the current diversity and abundance of data. But, as data complexity and data volume swell, the computational demands of previously proposed models for combining data escalate proportionally, posing a significant challenge to practical implementation. This study presents a sequential consensus Bayesian inference procedure that allows for a flexible definition of models, aiming to emulate the versatility of integrated models while significantly reducing their computational cost. The method is based on updating the distribution of the fixed effects and hyperparameters from their marginal posterior distribution throughout a sequential inference procedure, and performing a consensus on the random effects after the sequential inference is completed. The applicability, together with its strengths and limitations, is outlined in the methodological description of the procedure. The sequential consensus method is presented in two distinct algorithms. The first algorithm performs a sequential updating and consensus from the stored values of the marginal or joint posterior distribution of the random effects. The second algorithm performs an extra step, addressing the deficiencies that may arise when the model partition does not share the whole latent field. The performance of the procedure is shown by three different examples -- one simulated and two with real data -- intending to expose its strengths and limitations."}, "https://arxiv.org/abs/2406.08241": {"title": "Mode-based estimation of the center of symmetry", "link": "https://arxiv.org/abs/2406.08241", "description": "arXiv:2406.08241v1 Announce Type: new \nAbstract: In the mean-median-mode triad of univariate centrality measures, the mode has been overlooked for estimating the center of symmetry in continuous and unimodal settings. This paper expands on the connection between kernel mode estimators and M-estimators for location, bridging the gap between the nonparametrics and robust statistics communities. The variance of modal estimators is studied in terms of a bandwidth parameter, establishing conditions for an optimal solution that outperforms the household sample mean. A purely nonparametric approach is adopted, modeling heavy-tailedness through regular variation. The results lead to an estimator proposal that includes a novel one-parameter family of kernels with compact support, offering extra robustness and efficiency. The effectiveness and versatility of the new method are demonstrated in a real-world case study and a thorough simulation study, comparing favorably to traditional and more competitive alternatives. Several myths about the mode are clarified along the way, reopening the quest for flexible and efficient nonparametric estimators."}, "https://arxiv.org/abs/2406.08279": {"title": "Positive and negative word of mouth in the United States", "link": "https://arxiv.org/abs/2406.08279", "description": "arXiv:2406.08279v1 Announce Type: new \nAbstract: Word of mouth is a process by which consumers transmit positive or negative sentiment to other consumers about a business. While this process has long been recognized as a type of promotion for businesses, the value of word of mouth is questionable. This study will examine the various correlates of word of mouth to demographic variables, including the role of the trust of business owners. Education level, region of residence, and income level were found to be significant predictors of positive word of mouth. Although the results generally suggest that the majority of respondents do not engage in word of mouth, there are valuable insights to be learned."}, "https://arxiv.org/abs/2406.08366": {"title": "Highest Probability Density Conformal Regions", "link": "https://arxiv.org/abs/2406.08366", "description": "arXiv:2406.08366v1 Announce Type: new \nAbstract: We propose a new method for finding the highest predictive density set or region using signed conformal inference. The proposed method is computationally efficient, while also carrying conformal coverage guarantees. We prove that under, mild regularity conditions, the conformal prediction set is asymptotically close to its oracle counterpart. The efficacy of the method is illustrated through simulations and real applications."}, "https://arxiv.org/abs/2406.08367": {"title": "Hierarchical Bayesian Emulation of the Expected Net Present Value Utility Function via a Multi-Model Ensemble Member Decomposition", "link": "https://arxiv.org/abs/2406.08367", "description": "arXiv:2406.08367v1 Announce Type: new \nAbstract: Computer models are widely used to study complex real world physical systems. However, there are major limitations to their direct use including: their complex structure; large numbers of inputs and outputs; and long evaluation times. Bayesian emulators are an effective means of addressing these challenges providing fast and efficient statistical approximation for computer model outputs. It is commonly assumed that computer models behave like a ``black-box'' function with no knowledge of the output prior to its evaluation. This ensures that emulators are generalisable but potentially limits their accuracy compared with exploiting such knowledge of constrained or structured output behaviour. We assume a ``grey-box'' computer model and establish a hierarchical emulation framework encompassing structured emulators which exploit known constrained and structured behaviour of constituent computer model outputs. This achieves greater physical interpretability and more accurate emulator predictions. This research is motivated by and applied to the commercially important TNO OLYMPUS Well Control Optimisation Challenge from the petroleum industry. We re-express this as a decision support under uncertainty problem. First, we reduce the computational expense of the analysis by identifying a representative subset of models using an efficient multi-model ensemble subsampling technique. Next we apply our hierarchical emulation methodology to the expected Net Present Value utility function with well control decision parameters as inputs."}, "https://arxiv.org/abs/2406.08390": {"title": "Coordinated Trading Strategies for Battery Storage in Reserve and Spot Markets", "link": "https://arxiv.org/abs/2406.08390", "description": "arXiv:2406.08390v1 Announce Type: new \nAbstract: Quantity and price risks are key uncertainties market participants face in electricity markets with increased volatility, for instance, due to high shares of renewables. From day ahead until real-time, there is a large variation in the best available information, leading to price changes that flexible assets, such as battery storage, can exploit economically. This study contributes to understanding how coordinated bidding strategies can enhance multi-market trading and large-scale energy storage integration. Our findings shed light on the complexities arising from interdependencies and the high-dimensional nature of the problem. We show how stochastic dual dynamic programming is a suitable solution technique for such an environment. We include the three markets of the frequency containment reserve, day-ahead, and intraday in stochastic modelling and develop a multi-stage stochastic program. Prices are represented in a multidimensional Markov Chain, following the scheduling of the markets and allowing for time-dependent randomness. Using the example of a battery storage in the German energy sector, we provide valuable insights into the technical aspects of our method and the economic feasibility of battery storage operation. We find that capacity reservation in the frequency containment reserve dominates over the battery's cycling in spot markets at the given resolution on prices in 2022. In an adjusted price environment, we find that coordination can yield an additional value of up to 12.5%."}, "https://arxiv.org/abs/2406.08419": {"title": "Identification and Inference on Treatment Effects under Covariate-Adaptive Randomization and Imperfect Compliance", "link": "https://arxiv.org/abs/2406.08419", "description": "arXiv:2406.08419v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) frequently utilize covariate-adaptive randomization (CAR) (e.g., stratified block randomization) and commonly suffer from imperfect compliance. This paper studies the identification and inference for the average treatment effect (ATE) and the average treatment effect on the treated (ATT) in such RCTs with a binary treatment.\n We first develop characterizations of the identified sets for both estimands. Since data are generally not i.i.d. under CAR, these characterizations do not follow from existing results. We then provide consistent estimators of the identified sets and asymptotically valid confidence intervals for the parameters. Our asymptotic analysis leads to concrete practical recommendations regarding how to estimate the treatment assignment probabilities that enter in estimated bounds. In the case of the ATE, using sample analog assignment frequencies is more efficient than using the true assignment probabilities. On the contrary, using the true assignment probabilities is preferable for the ATT."}, "https://arxiv.org/abs/2406.07555": {"title": "Sequential Monte Carlo for Cut-Bayesian Posterior Computation", "link": "https://arxiv.org/abs/2406.07555", "description": "arXiv:2406.07555v1 Announce Type: cross \nAbstract: We propose a sequential Monte Carlo (SMC) method to efficiently and accurately compute cut-Bayesian posterior quantities of interest, variations of standard Bayesian approaches constructed primarily to account for model misspecification. We prove finite sample concentration bounds for estimators derived from the proposed method along with a linear tempering extension and apply these results to a realistic setting where a computer model is misspecified. We then illustrate the SMC method for inference in a modular chemical reactor example that includes submodels for reaction kinetics, turbulence, mass transfer, and diffusion. The samples obtained are commensurate with a direct-sampling approach that consists of running multiple Markov chains, with computational efficiency gains using the SMC method. Overall, the SMC method presented yields a novel, rigorous approach to computing with cut-Bayesian posterior distributions."}, "https://arxiv.org/abs/2406.07825": {"title": "Shape-Constrained Distributional Optimization via Importance-Weighted Sample Average Approximation", "link": "https://arxiv.org/abs/2406.07825", "description": "arXiv:2406.07825v1 Announce Type: cross \nAbstract: Shape-constrained optimization arises in a wide range of problems including distributionally robust optimization (DRO) that has surging popularity in recent years. In the DRO literature, these problems are usually solved via reduction into moment-constrained problems using the Choquet representation. While powerful, such an approach could face tractability challenges arising from the geometries and the compatibility between the shape and the objective function and moment constraints. In this paper, we propose an alternative methodology to solve shape-constrained optimization problems by integrating sample average approximation with importance sampling, the latter used to convert the distributional optimization into an optimization problem over the likelihood ratio with respect to a sampling distribution. We demonstrate how our approach, which relies on finite-dimensional linear programs, can handle a range of shape-constrained problems beyond the reach of previous Choquet-based reformulations, and entails vanishing and quantifiable optimality gaps. Moreover, our theoretical analyses based on strong duality and empirical processes reveal the critical role of shape constraints in guaranteeing desirable consistency and convergence rates."}, "https://arxiv.org/abs/2406.08041": {"title": "HARd to Beat: The Overlooked Impact of Rolling Windows in the Era of Machine Learning", "link": "https://arxiv.org/abs/2406.08041", "description": "arXiv:2406.08041v1 Announce Type: cross \nAbstract: We investigate the predictive abilities of the heterogeneous autoregressive (HAR) model compared to machine learning (ML) techniques across an unprecedented dataset of 1,455 stocks. Our analysis focuses on the role of fitting schemes, particularly the training window and re-estimation frequency, in determining the HAR model's performance. Despite extensive hyperparameter tuning, ML models fail to surpass the linear benchmark set by HAR when utilizing a refined fitting approach for the latter. Moreover, the simplicity of HAR allows for an interpretable model with drastically lower computational costs. We assess performance using QLIKE, MSE, and realized utility metrics, finding that HAR consistently outperforms its ML counterparts when both rely solely on realized volatility and VIX as predictors. Our results underscore the importance of a correctly specified fitting scheme. They suggest that properly fitted HAR models provide superior forecasting accuracy, establishing robust guidelines for their practical application and use as a benchmark. This study not only reaffirms the efficacy of the HAR model but also provides a critical perspective on the practical limitations of ML approaches in realized volatility forecasting."}, "https://arxiv.org/abs/2406.08097": {"title": "Inductive Global and Local Manifold Approximation and Projection", "link": "https://arxiv.org/abs/2406.08097", "description": "arXiv:2406.08097v1 Announce Type: cross \nAbstract: Nonlinear dimensional reduction with the manifold assumption, often called manifold learning, has proven its usefulness in a wide range of high-dimensional data analysis. The significant impact of t-SNE and UMAP has catalyzed intense research interest, seeking further innovations toward visualizing not only the local but also the global structure information of the data. Moreover, there have been consistent efforts toward generalizable dimensional reduction that handles unseen data. In this paper, we first propose GLoMAP, a novel manifold learning method for dimensional reduction and high-dimensional data visualization. GLoMAP preserves locally and globally meaningful distance estimates and displays a progression from global to local formation during the course of optimization. Furthermore, we extend GLoMAP to its inductive version, iGLoMAP, which utilizes a deep neural network to map data to its lower-dimensional representation. This allows iGLoMAP to provide lower-dimensional embeddings for unseen points without needing to re-train the algorithm. iGLoMAP is also well-suited for mini-batch learning, enabling large-scale, accelerated gradient calculations. We have successfully applied both GLoMAP and iGLoMAP to the simulated and real-data settings, with competitive experiments against the state-of-the-art methods."}, "https://arxiv.org/abs/2406.08180": {"title": "Stochastic Process-based Method for Degree-Degree Correlation of Evolving Networks", "link": "https://arxiv.org/abs/2406.08180", "description": "arXiv:2406.08180v1 Announce Type: cross \nAbstract: Existing studies on the degree correlation of evolving networks typically rely on differential equations and statistical analysis, resulting in only approximate solutions due to inherent randomness. To address this limitation, we propose an improved Markov chain method for modeling degree correlation in evolving networks. By redesigning the network evolution rules to reflect actual network dynamics more accurately, we achieve a topological structure that closely matches real-world network evolution. Our method models the degree correlation evolution process for both directed and undirected networks and provides theoretical results that are verified through simulations. This work offers the first theoretical solution for the steady-state degree correlation in evolving network models and is applicable to more complex evolution mechanisms and networks with directional attributes. Additionally, it supports the study of dynamic characteristic control based on network structure at any given time, offering a new tool for researchers in the field."}, "https://arxiv.org/abs/2406.08322": {"title": "MMIL: A novel algorithm for disease associated cell type discovery", "link": "https://arxiv.org/abs/2406.08322", "description": "arXiv:2406.08322v1 Announce Type: cross \nAbstract: Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality."}, "https://arxiv.org/abs/2111.15524": {"title": "Efficiency and Robustness of Rosenbaum's Regression (Un)-Adjusted Rank-based Estimator in Randomized Experiments", "link": "https://arxiv.org/abs/2111.15524", "description": "arXiv:2111.15524v4 Announce Type: replace \nAbstract: Mean-based estimators of the causal effect in a completely randomized experiment may behave poorly if the potential outcomes have a heavy-tail, or contain outliers. We study an alternative estimator by Rosenbaum that estimates the constant additive treatment effect by inverting a randomization test using ranks. By investigating the breakdown point and asymptotic relative efficiency of this rank-based estimator, we show that it is provably robust against outliers and heavy-tailed potential outcomes, and has asymptotic variance at most 1.16 times that of the difference-in-means estimator (and much smaller when the potential outcomes are not light-tailed). We further derive a consistent estimator of the asymptotic standard error for Rosenbaum's estimator which yields a readily computable confidence interval for the treatment effect. We also study a regression adjusted version of Rosenbaum's estimator to incorporate additional covariate information in randomization inference. We prove gain in efficiency by this regression adjustment method under a linear regression model. We illustrate through synthetic and real data that, unlike the mean-based estimators, these rank-based estimators (both unadjusted or regression adjusted) are efficient and robust against heavy-tailed distributions, contamination, and model misspecification. Finally, we initiate the study of Rosenbaum's estimator when the constant treatment effect assumption may be violated."}, "https://arxiv.org/abs/2208.10910": {"title": "A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression", "link": "https://arxiv.org/abs/2208.10910", "description": "arXiv:2208.10910v3 Announce Type: replace \nAbstract: We introduce a new empirical Bayes approach for large-scale multiple linear regression. Our approach combines two key ideas: (i) the use of flexible \"adaptive shrinkage\" priors, which approximate the nonparametric family of scale mixture of normal distributions by a finite mixture of normal distributions; and (ii) the use of variational approximations to efficiently estimate prior hyperparameters and compute approximate posteriors. Combining these two ideas results in fast and flexible methods, with computational speed comparable to fast penalized regression methods such as the Lasso, and with competitive prediction accuracy across a wide range of scenarios. Further, we provide new results that establish conceptual connections between our empirical Bayes methods and penalized methods. Specifically, we show that the posterior mean from our method solves a penalized regression problem, with the form of the penalty function being learned from the data by directly solving an optimization problem (rather than being tuned by cross-validation). Our methods are implemented in an R package, mr.ash.alpha, available from https://github.com/stephenslab/mr.ash.alpha."}, "https://arxiv.org/abs/2211.06568": {"title": "Effective experience rating for large insurance portfolios via surrogate modeling", "link": "https://arxiv.org/abs/2211.06568", "description": "arXiv:2211.06568v3 Announce Type: replace \nAbstract: Experience rating in insurance uses a Bayesian credibility model to upgrade the current premiums of a contract by taking into account policyholders' attributes and their claim history. Most data-driven models used for this task are mathematically intractable, and premiums must be obtained through numerical methods such as simulation via MCMC. However, these methods can be computationally expensive and even prohibitive for large portfolios when applied at the policyholder level. Additionally, these computations become ``black-box\" procedures as there is no analytical expression showing how the claim history of policyholders is used to upgrade their premiums. To address these challenges, this paper proposes a surrogate modeling approach to inexpensively derive an analytical expression for computing the Bayesian premiums for any given model, approximately. As a part of the methodology, the paper introduces a \\emph{likelihood-based summary statistic} of the policyholder's claim history that serves as the main input of the surrogate model and that is sufficient for certain families of distribution, including the exponential dispersion family. As a result, the computational burden of experience rating for large portfolios is reduced through the direct evaluation of such analytical expression, which can provide a transparent and interpretable way of computing Bayesian premiums."}, "https://arxiv.org/abs/2307.09404": {"title": "Continuous-time multivariate analysis", "link": "https://arxiv.org/abs/2307.09404", "description": "arXiv:2307.09404v3 Announce Type: replace \nAbstract: The starting point for much of multivariate analysis (MVA) is an $n\\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending techniques of multivariate analysis to such settings. The proposed continuous-time multivariate analysis (CTMVA) framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as $B$-splines, as in the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We present continuous-time extensions of the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering. We show that CTMVA can improve on the performance of classical MVA, in particular for correlation estimation and clustering, and can be applied in some settings where classical MVA cannot, including variables observed at disparate time points. CTMVA is illustrated with a novel perspective on a well-known Canadian weather data set, and with applications to data sets involving international development, brain signals, and air quality. The proposed methods are implemented in the publicly available R package \\texttt{ctmva}."}, "https://arxiv.org/abs/2307.10808": {"title": "Claim Reserving via Inverse Probability Weighting: A Micro-Level Chain-Ladder Method", "link": "https://arxiv.org/abs/2307.10808", "description": "arXiv:2307.10808v3 Announce Type: replace \nAbstract: Claim reserving primarily relies on macro-level models, with the Chain-Ladder method being the most widely adopted. These methods were heuristically developed without minimal statistical foundations, relying on oversimplified data assumptions and neglecting policyholder heterogeneity, often resulting in conservative reserve predictions. Micro-level reserving, utilizing stochastic modeling with granular information, can improve predictions but tends to involve less attractive and complex models for practitioners. This paper aims to strike a practical balance between aggregate and individual models by introducing a methodology that enables the Chain-Ladder method to incorporate individual information. We achieve this by proposing a novel framework, formulating the claim reserving problem within a population sampling context. We introduce a reserve estimator in a frequency and severity distribution-free manner that utilizes inverse probability weights (IPW) driven by individual information, akin to propensity scores. We demonstrate that the Chain-Ladder method emerges as a particular case of such an IPW estimator, thereby inheriting a statistically sound foundation based on population sampling theory that enables the use of granular information, and other extensions."}, "https://arxiv.org/abs/2309.15408": {"title": "A smoothed-Bayesian approach to frequency recovery from sketched data", "link": "https://arxiv.org/abs/2309.15408", "description": "arXiv:2309.15408v2 Announce Type: replace \nAbstract: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a {\\em smoothed-Bayesian} method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on \\emph{multi-view} learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives."}, "https://arxiv.org/abs/2310.17820": {"title": "Sparse Bayesian Multidimensional Item Response Theory", "link": "https://arxiv.org/abs/2310.17820", "description": "arXiv:2310.17820v2 Announce Type: replace \nAbstract: Multivariate Item Response Theory (MIRT) is sought-after widely by applied researchers looking for interpretable (sparse) explanations underlying response patterns in questionnaire data. There is, however, an unmet demand for such sparsity discovery tools in practice. Our paper develops a Bayesian platform for binary and ordinal item MIRT which requires minimal tuning and scales well on large datasets due to its parallelizable features. Bayesian methodology for MIRT models has traditionally relied on MCMC simulation, which cannot only be slow in practice, but also often renders exact sparsity recovery impossible without additional thresholding. In this work, we develop a scalable Bayesian EM algorithm to estimate sparse factor loadings from mixed continuous, binary, and ordinal item responses. We address the seemingly insurmountable problem of unknown latent factor dimensionality with tools from Bayesian nonparametrics which enable estimating the number of factors. Rotations to sparsity through parameter expansion further enhance convergence and interpretability without identifiability constraints. In our simulation study, we show that our method reliably recovers both the factor dimensionality as well as the latent structure on high-dimensional synthetic data even for small samples. We demonstrate the practical usefulness of our approach on three datasets: an educational assessment dataset, a quality-of-life measurement dataset, and a bio-behavioral dataset. All demonstrations show that our tool yields interpretable estimates, facilitating interesting discoveries that might otherwise go unnoticed under a pure confirmatory factor analysis setting."}, "https://arxiv.org/abs/2312.05411": {"title": "Deep Bayes Factors", "link": "https://arxiv.org/abs/2312.05411", "description": "arXiv:2312.05411v2 Announce Type: replace \nAbstract: The is no other model or hypothesis verification tool in Bayesian statistics that is as widely used as the Bayes factor. We focus on generative models that are likelihood-free and, therefore, render the computation of Bayes factors (marginal likelihood ratios) far from obvious. We propose a deep learning estimator of the Bayes factor based on simulated data from two competing models using the likelihood ratio trick. This estimator is devoid of summary statistics and obviates some of the difficulties with ABC model choice. We establish sufficient conditions for consistency of our Deep Bayes Factor estimator as well as its consistency as a model selection tool. We investigate the performance of our estimator on various examples using a wide range of quality metrics related to estimation and model decision accuracy. After training, our deep learning approach enables rapid evaluations of the Bayes factor estimator at any fictional data arriving from either hypothesized model, not just the observed data $Y_0$. This allows us to inspect entire Bayes factor distributions under the two models and to quantify the relative location of the Bayes factor evaluated at $Y_0$ in light of these distributions. Such tail area evaluations are not possible for Bayes factor estimators tailored to $Y_0$. We find the performance of our Deep Bayes Factors competitive with existing MCMC techniques that require the knowledge of the likelihood function. We also consider variants for posterior or intrinsic Bayes factors estimation. We demonstrate the usefulness of our approach on a relatively high-dimensional real data example about determining cognitive biases."}, "https://arxiv.org/abs/2406.08628": {"title": "Empirical Evidence That There Is No Such Thing As A Validated Prediction Model", "link": "https://arxiv.org/abs/2406.08628", "description": "arXiv:2406.08628v1 Announce Type: new \nAbstract: Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.\n Methods: The Tufts-PACE CPM Registry contains CPMs for cardiovascular disease prognosis. We analyzed the AUCs of 469 CPMs with a total of 1,603 external validations. For each CPM, we performed a random effects meta-analysis to estimate the between-study standard deviation $\\tau$ among the AUCs. Since the majority of these meta-analyses has only a handful of validations, this leads to very poor estimates of $\\tau$. So, we estimated a log normal distribution of $\\tau$ across all CPMs and used this as an empirical prior. We compared this empirical Bayesian approach with frequentist meta-analyses using cross-validation.\n Results: The 469 CPMs had a median of 2 external validations (IQR: [1-3]). The estimated distribution of $\\tau$ had a mean of 0.055 and a standard deviation of 0.015. If $\\tau$ = 0.05, the 95% prediction interval for the AUC in a new setting is at least +/- 0.1, regardless of the number of validations. Frequentist methods underestimate the uncertainty about the AUC in a new setting. Accounting for $\\tau$ in a Bayesian approach achieved near nominal coverage.\n Conclusion: Due to large heterogeneity among the validated AUC values of a CPM, there is great irreducible uncertainty in predicting the AUC in a new setting. This uncertainty is underestimated by existing methods. The proposed empirical Bayes approach addresses this problem which merits wide application in judging the validity of prediction models."}, "https://arxiv.org/abs/2406.08668": {"title": "Causal Inference on Missing Exposure via Robust Estimation", "link": "https://arxiv.org/abs/2406.08668", "description": "arXiv:2406.08668v1 Announce Type: new \nAbstract: How to deal with missing data in observational studies is a common concern for causal inference. When the covariates are missing at random (MAR), multiple approaches have been provided to help solve the issue. However, if the exposure is MAR, few approaches are available and careful adjustments on both missingness and confounding issues are required to ensure a consistent estimate of the true causal effect on the response. In this article, a new inverse probability weighting (IPW) estimator based on weighted estimating equations (WEE) is proposed to incorporate weights from both the missingness and propensity score (PS) models, which can reduce the joint effect of extreme weights in finite samples. Additionally, we develop a triple robust (TR) estimator via WEE to further protect against the misspecification of the missingness model. The asymptotic properties of WEE estimators are proved using properties of estimating equations. Based on the simulation studies, WEE methods outperform others including imputation-based approaches in terms of bias and variability. Finally, an application study is conducted to identify the causal effect of the presence of cardiovascular disease on mortality for COVID-19 patients."}, "https://arxiv.org/abs/2406.08685": {"title": "Variational Bayes Inference for Spatial Error Models with Missing Data", "link": "https://arxiv.org/abs/2406.08685", "description": "arXiv:2406.08685v1 Announce Type: new \nAbstract: The spatial error model (SEM) is a type of simultaneous autoregressive (SAR) model for analysing spatially correlated data. Markov chain Monte Carlo (MCMC) is one of the most widely used Bayesian methods for estimating SEM, but it has significant limitations when it comes to handling missing data in the response variable due to its high computational cost. Variational Bayes (VB) approximation offers an alternative solution to this problem. Two VB-based algorithms employing Gaussian variational approximation with factor covariance structure are presented, joint VB (JVB) and hybrid VB (HVB), suitable for both missing at random and not at random inference. When dealing with many missing values, the JVB is inaccurate, and the standard HVB algorithm struggles to achieve accurate inferences. Our modified versions of HVB enable accurate inference within a reasonable computational time, thus improving its performance. The performance of the VB methods is evaluated using simulated and real datasets."}, "https://arxiv.org/abs/2406.08776": {"title": "Learning Joint and Individual Structure in Network Data with Covariates", "link": "https://arxiv.org/abs/2406.08776", "description": "arXiv:2406.08776v1 Announce Type: new \nAbstract: Datasets consisting of a network and covariates associated with its vertices have become ubiquitous. One problem pertaining to this type of data is to identify information unique to the network, information unique to the vertex covariates and information that is shared between the network and the vertex covariates. Existing techniques for network data and vertex covariates focus on capturing structure that is shared but are usually not able to differentiate structure that is unique to each dataset. This work formulates a low-rank model that simultaneously captures joint and individual information in network data with vertex covariates. A two-step estimation procedure is proposed, composed of an efficient spectral method followed by a refinement optimization step. Theoretically, we show that the spectral method is able to consistently recover the joint and individual components under a general signal-plus-noise model.\n Simulations and real data examples demonstrate the ability of the methods to recover accurate and interpretable components. In particular, the application of the methodology to a food trade network between countries with economic, developmental and geographical country-level indicators as covariates yields joint and individual factors that explain the trading patterns."}, "https://arxiv.org/abs/2406.08784": {"title": "Improved methods for empirical Bayes multivariate multiple testing and effect size estimation", "link": "https://arxiv.org/abs/2406.08784", "description": "arXiv:2406.08784v1 Announce Type: new \nAbstract: Estimating the sharing of genetic effects across different conditions is important to many statistical analyses of genomic data. The patterns of sharing arising from these data are often highly heterogeneous. To flexibly model these heterogeneous sharing patterns, Urbut et al. (2019) proposed the multivariate adaptive shrinkage (MASH) method to jointly analyze genetic effects across multiple conditions. However, multivariate analyses using MASH (as well as other multivariate analyses) require good estimates of the sharing patterns, and estimating these patterns efficiently and accurately remains challenging. Here we describe new empirical Bayes methods that provide improvements in speed and accuracy over existing methods. The two key ideas are: (1) adaptive regularization to improve accuracy in settings with many conditions; (2) improving the speed of the model fitting algorithms by exploiting analytical results on covariance estimation. In simulations, we show that the new methods provide better model fits, better out-of-sample performance, and improved power and accuracy in detecting the true underlying signals. In an analysis of eQTLs in 49 human tissues, our new analysis pipeline achieves better model fits and better out-of-sample performance than the existing MASH analysis pipeline. We have implemented the new methods, which we call ``Ultimate Deconvolution'', in an R package, udr, available on GitHub."}, "https://arxiv.org/abs/2406.08867": {"title": "A Robust Bayesian approach for reliability prognosis of one-shot devices under cumulative risk model", "link": "https://arxiv.org/abs/2406.08867", "description": "arXiv:2406.08867v1 Announce Type: new \nAbstract: The reliability prognosis of one-shot devices is drawing increasing attention because of their wide applicability. The present study aims to determine the lifetime prognosis of highly durable one-shot device units under a step-stress accelerated life testing (SSALT) experiment applying a cumulative risk model (CRM). In an SSALT experiment, CRM retains the continuity of hazard function by allowing the lag period before the effects of stress change emerge. In an analysis of such lifetime data, plentiful datasets might have outliers where conventional methods like maximum likelihood estimation or likelihood-based Bayesian estimation frequently fail. This work develops a robust estimation method based on density power divergence in classical and Bayesian frameworks. The hypothesis is tested by implementing the Bayes factor based on a robustified posterior. In Bayesian estimation, we exploit Hamiltonian Monte Carlo, which has certain advantages over the conventional Metropolis-Hastings algorithms. Further, the influence functions are examined to evaluate the robust behaviour of the estimators and the Bayes factor. Finally, the analytical development is validated through a simulation study and a real data analysis."}, "https://arxiv.org/abs/2406.08880": {"title": "Jackknife inference with two-way clustering", "link": "https://arxiv.org/abs/2406.08880", "description": "arXiv:2406.08880v1 Announce Type: new \nAbstract: For linear regression models with cross-section or panel data, it is natural to assume that the disturbances are clustered in two dimensions. However, the finite-sample properties of two-way cluster-robust tests and confidence intervals are often poor. We discuss several ways to improve inference with two-way clustering. Two of these are existing methods for avoiding, or at least ameliorating, the problem of undefined standard errors when a cluster-robust variance matrix estimator (CRVE) is not positive definite. One is a new method that always avoids the problem. More importantly, we propose a family of new two-way CRVEs based on the cluster jackknife. Simulations for models with two-way fixed effects suggest that, in many cases, the cluster-jackknife CRVE combined with our new method yields surprisingly accurate inferences. We provide a simple software package, twowayjack for Stata, that implements our recommended variance estimator."}, "https://arxiv.org/abs/2406.08968": {"title": "Covariate Selection for Optimizing Balance with Covariate-Adjusted Response-Adaptive Randomization", "link": "https://arxiv.org/abs/2406.08968", "description": "arXiv:2406.08968v1 Announce Type: new \nAbstract: Balancing influential covariates is crucial for valid treatment comparisons in clinical studies. While covariate-adaptive randomization is commonly used to achieve balance, its performance can be inadequate when the number of baseline covariates is large. It is therefore essential to identify the influential factors associated with the outcome and ensure balance among these critical covariates. In this article, we propose a novel covariate-adjusted response-adaptive randomization that integrates the patients' responses and covariates information to select sequentially significant covariates and maintain their balance. We establish theoretically the consistency of our covariate selection method and demonstrate that the improved covariate balancing, as evidenced by a faster convergence rate of the imbalance measure, leads to higher efficiency in estimating treatment effects. Furthermore, we provide extensive numerical and empirical studies to illustrate the benefits of our proposed method across various settings."}, "https://arxiv.org/abs/2406.09010": {"title": "A geometric approach to informed MCMC sampling", "link": "https://arxiv.org/abs/2406.09010", "description": "arXiv:2406.09010v1 Announce Type: new \nAbstract: A Riemannian geometric framework for Markov chain Monte Carlo (MCMC) is developed where using the Fisher-Rao metric on the manifold of probability density functions (pdfs) informed proposal densities for Metropolis-Hastings (MH) algorithms are constructed. We exploit the square-root representation of pdfs under which the Fisher-Rao metric boils down to the standard $L^2$ metric on the positive orthant of the unit hypersphere. The square-root representation allows us to easily compute the geodesic distance between densities, resulting in a straightforward implementation of the proposed geometric MCMC methodology. Unlike the random walk MH that blindly proposes a candidate state using no information about the target, the geometric MH algorithms effectively move an uninformed base density (e.g., a random walk proposal density) towards different global/local approximations of the target density. We compare the proposed geometric MH algorithm with other MCMC algorithms for various Markov chain orderings, namely the covariance, efficiency, Peskun, and spectral gap orderings. The superior performance of the geometric algorithms over other MH algorithms like the random walk Metropolis, independent MH and variants of Metropolis adjusted Langevin algorithms is demonstrated in the context of various multimodal, nonlinear and high dimensional examples. In particular, we use extensive simulation and real data applications to compare these algorithms for analyzing mixture models, logistic regression models and ultra-high dimensional Bayesian variable selection models. A publicly available R package accompanies the article."}, "https://arxiv.org/abs/2406.09055": {"title": "Relational event models with global covariates", "link": "https://arxiv.org/abs/2406.09055", "description": "arXiv:2406.09055v1 Announce Type: new \nAbstract: Traditional inference in relational event models from dynamic network data involves only dyadic and node-specific variables, as anything that is global, i.e. constant across dyads, drops out of the partial likelihood. We address this with the use of nested case-control sampling on a time-shifted version of the event process. This leads to a partial likelihood of a degenerate logistic additive model, enabling efficient estimation of global and non-global covariate effects. The method's effectiveness is demonstrated through a simulation study. An application to bike sharing data reveals significant influences of global covariates like weather and time of day on bike-sharing dynamics."}, "https://arxiv.org/abs/2406.09163": {"title": "Covariate balancing with measurement error", "link": "https://arxiv.org/abs/2406.09163", "description": "arXiv:2406.09163v1 Announce Type: new \nAbstract: In recent years, there is a growing body of causal inference literature focusing on covariate balancing methods. These methods eliminate observed confounding by equalizing covariate moments between the treated and control groups. The validity of covariate balancing relies on an implicit assumption that all covariates are accurately measured, which is frequently violated in observational studies. Nevertheless, the impact of measurement error on covariate balancing is unclear, and there is no existing work on balancing mismeasured covariates adequately. In this article, we show that naively ignoring measurement error reversely increases the magnitude of covariate imbalance and induces bias to treatment effect estimation. We then propose a class of measurement error correction strategies for the existing covariate balancing methods. Theoretically, we show that these strategies successfully recover balance for all covariates, and eliminate bias of treatment effect estimation. We assess the proposed correction methods in simulation studies and real data analysis."}, "https://arxiv.org/abs/2406.09195": {"title": "When Pearson $\\chi^2$ and other divisible statistics are not goodness-of-fit tests", "link": "https://arxiv.org/abs/2406.09195", "description": "arXiv:2406.09195v1 Announce Type: new \nAbstract: Thousands of experiments are analyzed and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived - somewhat naively - as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what are the new possibilities in their hands.\n Motivated by this need, the article introduces a unifying approach to the analysis of grouped data which allows us to study the class of divisible statistics - that includes Pearson's $\\chi^2$, the likelihood ratio as special cases - with a fresh perspective. The contributions collected in this manuscript span from modeling and estimation to distribution-free goodness-of-fit tests.\n Perhaps the most surprising result presented here is that, in a sparse regime, all tests proposed in the literature are dominated by a class of weighted linear statistics."}, "https://arxiv.org/abs/2406.09254": {"title": "General Bayesian Predictive Synthesis", "link": "https://arxiv.org/abs/2406.09254", "description": "arXiv:2406.09254v1 Announce Type: new \nAbstract: This study investigates Bayesian ensemble learning for improving the quality of decision-making. We consider a decision-maker who selects an action from a set of candidates based on a policy trained using observations. In our setting, we assume the existence of experts who provide predictive distributions based on their own policies. Our goal is to integrate these predictive distributions within the Bayesian framework. Our proposed method, which we refer to as General Bayesian Predictive Synthesis (GBPS), is characterized by a loss minimization framework and does not rely on parameter estimation, unlike existing studies. Inspired by Bayesian predictive synthesis and general Bayes frameworks, we evaluate the performance of our proposed method through simulation studies."}, "https://arxiv.org/abs/2406.08666": {"title": "Interventional Causal Discovery in a Mixture of DAGs", "link": "https://arxiv.org/abs/2406.08666", "description": "arXiv:2406.08666v1 Announce Type: cross \nAbstract: Causal interactions among a group of variables are often modeled by a single causal graph. In some domains, however, these interactions are best described by multiple co-existing causal graphs, e.g., in dynamical systems or genomics. This paper addresses the hitherto unknown role of interventions in learning causal interactions among variables governed by a mixture of causal systems, each modeled by one directed acyclic graph (DAG). Causal discovery from mixtures is fundamentally more challenging than single-DAG causal discovery. Two major difficulties stem from (i) inherent uncertainty about the skeletons of the component DAGs that constitute the mixture and (ii) possibly cyclic relationships across these component DAGs. This paper addresses these challenges and aims to identify edges that exist in at least one component DAG of the mixture, referred to as true edges. First, it establishes matching necessary and sufficient conditions on the size of interventions required to identify the true edges. Next, guided by the necessity results, an adaptive algorithm is designed that learns all true edges using ${\\cal O}(n^2)$ interventions, where $n$ is the number of nodes. Remarkably, the size of the interventions is optimal if the underlying mixture model does not contain cycles across its components. More generally, the gap between the intervention size used by the algorithm and the optimal size is quantified. It is shown to be bounded by the cyclic complexity number of the mixture model, defined as the size of the minimal intervention that can break the cycles in the mixture, which is upper bounded by the number of cycles among the ancestors of a node."}, "https://arxiv.org/abs/2406.08697": {"title": "Orthogonalized Estimation of Difference of $Q$-functions", "link": "https://arxiv.org/abs/2406.08697", "description": "arXiv:2406.08697v1 Announce Type: cross \nAbstract: Offline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of causal contrast functions such as CATE, which is sufficient for optimizing decisions and can adapt to potentially smoother structure. We develop a dynamic generalization of the R-learner (Nie and Wager 2021, Lewis and Syrgkanis 2021) for estimating and optimizing the difference of $Q^\\pi$-functions, $Q^\\pi(s,1)-Q^\\pi(s,0)$ (which can be used to optimize multiple-valued actions). We leverage orthogonal estimation to improve convergence rates in the presence of slower nuisance estimation rates and prove consistency of policy optimization under a margin condition. The method can leverage black-box nuisance estimators of the $Q$-function and behavior policy to target estimation of a more structured $Q$-function contrast."}, "https://arxiv.org/abs/2406.08709": {"title": "Introducing Diminutive Causal Structure into Graph Representation Learning", "link": "https://arxiv.org/abs/2406.08709", "description": "arXiv:2406.08709v1 Announce Type: cross \nAbstract: When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. However, within the domain of graph representation learning, the inherent complexity of graph data obstructs the derivation of a comprehensive causal structure that encapsulates universal rules or relationships governing the entire dataset. Instead, only specialized diminutive causal structures, delineating specific causal relationships within constrained subsets of graph data, emerge as discernible. Motivated by empirical insights, it is observed that GNN models exhibit a tendency to converge towards such specialized causal structures during the training process. Consequently, we posit that the introduction of these specific causal structures is advantageous for the training of GNN models. Building upon this proposition, we introduce a novel method that enables GNN models to glean insights from these specialized diminutive causal structures, thereby enhancing overall performance. Our method specifically extracts causal knowledge from the model representation of these diminutive causal structures and incorporates interchange intervention to optimize the learning process. Theoretical analysis serves to corroborate the efficacy of our proposed method. Furthermore, empirical experiments consistently demonstrate significant performance improvements across diverse datasets."}, "https://arxiv.org/abs/2406.08738": {"title": "Volatility Forecasting Using Similarity-based Parameter Correction and Aggregated Shock Information", "link": "https://arxiv.org/abs/2406.08738", "description": "arXiv:2406.08738v1 Announce Type: cross \nAbstract: We develop a procedure for forecasting the volatility of a time series immediately following a news shock. Adapting the similarity-based framework of Lin and Eck (2020), we exploit series that have experienced similar shocks. We aggregate their shock-induced excess volatilities by positing the shocks to be affine functions of exogenous covariates. The volatility shocks are modeled as random effects and estimated as fixed effects. The aggregation of these estimates is done in service of adjusting the $h$-step-ahead GARCH forecast of the time series under study by an additive term. The adjusted and unadjusted forecasts are evaluated using the unobservable but easily-estimated realized volatility (RV). A real-world application is provided, as are simulation results suggesting the conditions and hyperparameters under which our method thrives."}, "https://arxiv.org/abs/2406.09169": {"title": "Empirical Networks are Sparse: Enhancing Multi-Edge Models with Zero-Inflation", "link": "https://arxiv.org/abs/2406.09169", "description": "arXiv:2406.09169v1 Announce Type: cross \nAbstract: Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the $G(N,p)$, configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models which do not accurately represent complex systems and their dynamics."}, "https://arxiv.org/abs/2406.09172": {"title": "Generative vs", "link": "https://arxiv.org/abs/2406.09172", "description": "arXiv:2406.09172v1 Announce Type: cross \nAbstract: Learning a parametric model from a given dataset indeed enables to capture intrinsic dependencies between random variables via a parametric conditional probability distribution and in turn predict the value of a label variable given observed variables. In this paper, we undertake a comparative analysis of generative and discriminative approaches which differ in their construction and the structure of the underlying inference problem. Our objective is to compare the ability of both approaches to leverage information from various sources in an epistemic uncertainty aware inference via the posterior predictive distribution. We assess the role of a prior distribution, explicit in the generative case and implicit in the discriminative case, leading to a discussion about discriminative models suffering from imbalanced dataset. We next examine the double role played by the observed variables in the generative case, and discuss the compatibility of both approaches with semi-supervised learning. We also provide with practical insights and we examine how the modeling choice impacts the sampling from the posterior predictive distribution. With regard to this, we propose a general sampling scheme enabling supervised learning for both approaches, as well as semi-supervised learning when compatible with the considered modeling approach. Throughout this paper, we illustrate our arguments and conclusions using the example of affine regression, and validate our comparative analysis through classification simulations using neural network based models."}, "https://arxiv.org/abs/2406.09387": {"title": "Oblivious subspace embeddings for compressed Tucker decompositions", "link": "https://arxiv.org/abs/2406.09387", "description": "arXiv:2406.09387v1 Announce Type: cross \nAbstract: Emphasis in the tensor literature on random embeddings (tools for low-distortion dimension reduction) for the canonical polyadic (CP) tensor decomposition has left analogous results for the more expressive Tucker decomposition comparatively lacking. This work establishes general Johnson-Lindenstrauss (JL) type guarantees for the estimation of Tucker decompositions when an oblivious random embedding is applied along each mode. When these embeddings are drawn from a JL-optimal family, the decomposition can be estimated within $\\varepsilon$ relative error under restrictions on the embedding dimension that are in line with recent CP results. We implement a higher-order orthogonal iteration (HOOI) decomposition algorithm with random embeddings to demonstrate the practical benefits of this approach and its potential to improve the accessibility of otherwise prohibitive tensor analyses. On moderately large face image and fMRI neuroimaging datasets, empirical results show that substantial dimension reduction is possible with minimal increase in reconstruction error relative to traditional HOOI ($\\leq$5% larger error, 50%-60% lower computation time for large models with 50% dimension reduction along each mode). Especially for large tensors, our method outperforms traditional higher-order singular value decomposition (HOSVD) and recently proposed TensorSketch methods."}, "https://arxiv.org/abs/2110.06692": {"title": "A procedure for multiple testing of partial conjunction hypotheses based on a hazard rate inequality", "link": "https://arxiv.org/abs/2110.06692", "description": "arXiv:2110.06692v4 Announce Type: replace \nAbstract: The partial conjunction null hypothesis is tested in order to discover a signal that is present in multiple studies. The standard approach of carrying out a multiple test procedure on the partial conjunction (PC) $p$-values can be extremely conservative. We suggest alleviating this conservativeness, by eliminating many of the conservative PC $p$-values prior to the application of a multiple test procedure. This leads to the following two step procedure: first, select the set with PC $p$-values below a selection threshold; second, within the selected set only, apply a family-wise error rate or false discovery rate controlling procedure on the conditional PC $p$-values. The conditional PC $p$-values are valid if the null p-values are uniform and the combining method is Fisher. The proof of their validity is based on a novel inequality in hazard rate order of partial sums of order statistics which may be of independent interest. We also provide the conditions for which the false discovery rate controlling procedures considered will be below the nominal level. We demonstrate the potential usefulness of our novel method, CoFilter (conditional testing after filtering), for analyzing multiple genome wide association studies of Crohn's disease."}, "https://arxiv.org/abs/2310.07850": {"title": "Conformal prediction with local weights: randomization enables local guarantees", "link": "https://arxiv.org/abs/2310.07850", "description": "arXiv:2310.07850v2 Announce Type: replace \nAbstract: In this work, we consider the problem of building distribution-free prediction intervals with finite-sample conditional coverage guarantees. Conformal prediction (CP) is an increasingly popular framework for building prediction intervals with distribution-free guarantees, but these guarantees only ensure marginal coverage: the probability of coverage is averaged over a random draw of both the training and test data, meaning that there might be substantial undercoverage within certain subpopulations. Instead, ideally, we would want to have local coverage guarantees that hold for each possible value of the test point's features. While the impossibility of achieving pointwise local coverage is well established in the literature, many variants of conformal prediction algorithm show favorable local coverage properties empirically. Relaxing the definition of local coverage can allow for a theoretical understanding of this empirical phenomenon. We aim to bridge this gap between theoretical validation and empirical performance by proving achievable and interpretable guarantees for a relaxed notion of local coverage. Building on the localized CP method of Guan (2023) and the weighted CP framework of Tibshirani et al. (2019), we propose a new method, randomly-localized conformal prediction (RLCP), which returns prediction intervals that are not only marginally valid but also achieve a relaxed local coverage guarantee and guarantees under covariate shift. Through a series of simulations and real data experiments, we validate these coverage guarantees of RLCP while comparing it with the other local conformal prediction methods."}, "https://arxiv.org/abs/2311.16529": {"title": "Efficient and Globally Robust Causal Excursion Effect Estimation", "link": "https://arxiv.org/abs/2311.16529", "description": "arXiv:2311.16529v3 Announce Type: replace \nAbstract: Causal excursion effect (CEE) characterizes the effect of an intervention under policies that deviate from the experimental policy. It is widely used to study the effect of time-varying interventions that have the potential to be frequently adaptive, such as those delivered through smartphones. We study the semiparametric efficient estimation of CEE and we derive a semiparametric efficiency bound for CEE with identity or log link functions under working assumptions, in the context of micro-randomized trials. We propose a class of two-stage estimators that achieve the efficiency bound and are robust to misspecified nuisance models. In deriving the asymptotic property of the estimators, we establish a general theory for globally robust Z-estimators with either cross-fitted or non-cross-fitted nuisance parameters. We demonstrate substantial efficiency gain of the proposed estimator compared to existing ones through simulations and a real data application using the Drink Less micro-randomized trial."}, "https://arxiv.org/abs/2110.02318": {"title": "Approximate Message Passing for orthogonally invariant ensembles: Multivariate non-linearities and spectral initialization", "link": "https://arxiv.org/abs/2110.02318", "description": "arXiv:2110.02318v2 Announce Type: replace-cross \nAbstract: We study a class of Approximate Message Passing (AMP) algorithms for symmetric and rectangular spiked random matrix models with orthogonally invariant noise. The AMP iterates have fixed dimension $K \\geq 1$, a multivariate non-linearity is applied in each AMP iteration, and the algorithm is spectrally initialized with $K$ super-critical sample eigenvectors. We derive the forms of the Onsager debiasing coefficients and corresponding AMP state evolution, which depend on the free cumulants of the noise spectral distribution. This extends previous results for such models with $K=1$ and an independent initialization.\n Applying this approach to Bayesian principal components analysis, we introduce a Bayes-OAMP algorithm that uses as its non-linearity the posterior mean conditional on all preceding AMP iterates. We describe a practical implementation of this algorithm, where all debiasing and state evolution parameters are estimated from the observed data, and we illustrate the accuracy and stability of this approach in simulations."}, "https://arxiv.org/abs/2305.12883": {"title": "Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors", "link": "https://arxiv.org/abs/2305.12883", "description": "arXiv:2305.12883v3 Announce Type: replace-cross \nAbstract: In recent years, there has been a significant growth in research focusing on minimum $\\ell_2$ norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data."}, "https://arxiv.org/abs/2406.09473": {"title": "Multidimensional clustering in judge designs", "link": "https://arxiv.org/abs/2406.09473", "description": "arXiv:2406.09473v1 Announce Type: new \nAbstract: Estimates in judge designs run the risk of being biased due to the many judge identities that are implicitly or explicitly used as instrumental variables. The usual method to analyse judge designs, via a leave-out mean instrument, eliminates this many instrument bias only in case the data are clustered in at most one dimension. What is left out in the mean defines this clustering dimension. How most judge designs cluster their standard errors, however, implies that there are additional clustering dimensions, which makes that a many instrument bias remains. We propose two estimators that are many instrument bias free, also in multidimensional clustered judge designs. The first generalises the one dimensional cluster jackknife instrumental variable estimator, by removing from this estimator the additional bias terms due to the extra dependence in the data. The second models all but one clustering dimensions by fixed effects and we show how these numerous fixed effects can be removed without introducing extra bias. A Monte-Carlo experiment and the revisitation of two judge designs show the empirical relevance of properly accounting for multidimensional clustering in estimation."}, "https://arxiv.org/abs/2406.09521": {"title": "Randomization Inference: Theory and Applications", "link": "https://arxiv.org/abs/2406.09521", "description": "arXiv:2406.09521v1 Announce Type: new \nAbstract: We review approaches to statistical inference based on randomization. Permutation tests are treated as an important special case. Under a certain group invariance property, referred to as the ``randomization hypothesis,'' randomization tests achieve exact control of the Type I error rate in finite samples. Although this unequivocal precision is very appealing, the range of problems that satisfy the randomization hypothesis is somewhat limited. We show that randomization tests are often asymptotically, or approximately, valid and efficient in settings that deviate from the conditions required for finite-sample error control. When randomization tests fail to offer even asymptotic Type 1 error control, their asymptotic validity may be restored by constructing an asymptotically pivotal test statistic. Randomization tests can then provide exact error control for tests of highly structured hypotheses with good performance in a wider class of problems. We give a detailed overview of several prominent applications of randomization tests, including two-sample permutation tests, regression, and conformal inference."}, "https://arxiv.org/abs/2406.09597": {"title": "Ridge Regression for Paired Comparisons: A Tractable New Approach, with Application to Premier League Football", "link": "https://arxiv.org/abs/2406.09597", "description": "arXiv:2406.09597v1 Announce Type: new \nAbstract: Paired comparison models, such as Bradley-Terry and Thurstone-Mosteller, are commonly used to estimate relative strengths of pairwise compared items in tournament-style datasets. With predictive performance as primary criterion, we discuss estimation of paired comparison models with a ridge penalty. A new approach is derived which combines empirical Bayes and composite likelihoods without any need to re-fit the model, as a convenient alternative to cross-validation of the ridge tuning parameter. Simulation studies, together with application to 28 seasons of English Premier League football, demonstrate much better predictive accuracy of the new approach relative to ordinary maximum likelihood. While the application of a standard bias-reducing penalty was found to improve appreciably the performance of maximum likelihood, the ridge penalty with tuning as developed here yields greater accuracy still."}, "https://arxiv.org/abs/2406.09625": {"title": "Time Series Forecasting with Many Predictors", "link": "https://arxiv.org/abs/2406.09625", "description": "arXiv:2406.09625v1 Announce Type: new \nAbstract: We propose a novel approach for time series forecasting with many predictors, referred to as the GO-sdPCA, in this paper. The approach employs a variable selection method known as the group orthogonal greedy algorithm and the high-dimensional Akaike information criterion to mitigate the impact of irrelevant predictors. Moreover, a novel technique, called peeling, is used to boost the variable selection procedure so that many factor-relevant predictors can be included in prediction. Finally, the supervised dynamic principal component analysis (sdPCA) method is adopted to account for the dynamic information in factor recovery. In simulation studies, we found that the proposed method adapts well to unknown degrees of sparsity and factor strength, which results in good performance even when the number of relevant predictors is large compared to the sample size. Applying to economic and environmental studies, the proposed method consistently performs well compared to some commonly used benchmarks in one-step-ahead out-sample forecasts."}, "https://arxiv.org/abs/2406.09714": {"title": "Large language model validity via enhanced conformal prediction methods", "link": "https://arxiv.org/abs/2406.09714", "description": "arXiv:2406.09714v1 Announce Type: cross \nAbstract: We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on both synthetic and real-world datasets."}, "https://arxiv.org/abs/2406.10086": {"title": "Discovering influential text using convolutional neural networks", "link": "https://arxiv.org/abs/2406.10086", "description": "arXiv:2406.10086v1 Announce Type: cross \nAbstract: Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model's ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome."}, "https://arxiv.org/abs/1707.07215": {"title": "Sparse Recovery With Multiple Data Streams: A Sequential Adaptive Testing Approach", "link": "https://arxiv.org/abs/1707.07215", "description": "arXiv:1707.07215v3 Announce Type: replace \nAbstract: Multistage design has been used in a wide range of scientific fields. By allocating sensing resources adaptively, one can effectively eliminate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multi-stage adaptive testing and study how to minimize the total number of measurements while meeting pre-specified constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which effectively pools information across individual tests using a simultaneous multistage adaptive ranking and thresholding (SMART) approach, controls the error rates and leads to great savings in total study costs. Numerical studies confirm the effectiveness of SMART. The SMART procedure is illustrated through the analysis of large-scale A/B tests, high-throughput screening and image analysis."}, "https://arxiv.org/abs/2208.00139": {"title": "Another look at forecast trimming for combinations: robustness, accuracy and diversity", "link": "https://arxiv.org/abs/2208.00139", "description": "arXiv:2208.00139v2 Announce Type: replace \nAbstract: Forecast combination is widely recognized as a preferred strategy over forecast selection due to its ability to mitigate the uncertainty associated with identifying a single \"best\" forecast. Nonetheless, sophisticated combinations are often empirically dominated by simple averaging, which is commonly attributed to the weight estimation error. The issue becomes more problematic when dealing with a forecast pool containing a large number of individual forecasts. In this paper, we propose a new forecast trimming algorithm to identify an optimal subset from the original forecast pool for forecast combination tasks. In contrast to existing approaches, our proposed algorithm simultaneously takes into account the robustness, accuracy and diversity issues of the forecast pool, rather than isolating each one of these issues. We also develop five forecast trimming algorithms as benchmarks, including one trimming-free algorithm and several trimming algorithms that isolate each one of the three key issues. Experimental results show that our algorithm achieves superior forecasting performance in general in terms of both point forecasts and prediction intervals. Nevertheless, we argue that diversity does not always have to be addressed in forecast trimming. Based on the results, we offer some practical guidelines on the selection of forecast trimming algorithms for a target series."}, "https://arxiv.org/abs/2302.08348": {"title": "A robust statistical framework for cyber-vulnerability prioritisation under partial information in threat intelligence", "link": "https://arxiv.org/abs/2302.08348", "description": "arXiv:2302.08348v4 Announce Type: replace \nAbstract: Proactive cyber-risk assessment is gaining momentum due to the wide range of sectors that can benefit from the prevention of cyber-incidents by preserving integrity, confidentiality, and the availability of data. The rising attention to cybersecurity also results from the increasing connectivity of cyber-physical systems, which generates multiple sources of uncertainty about emerging cyber-vulnerabilities. This work introduces a robust statistical framework for quantitative and qualitative reasoning under uncertainty about cyber-vulnerabilities and their prioritisation. Specifically, we take advantage of mid-quantile regression to deal with ordinal risk assessments, and we compare it to current alternatives for cyber-risk ranking and graded responses. For this purpose, we identify a novel accuracy measure suited for rank invariance under partial knowledge of the whole set of existing vulnerabilities. The model is tested on both simulated and real data from selected databases that support the evaluation, exploitation, or response to cyber-vulnerabilities in realistic contexts. Such datasets allow us to compare multiple models and accuracy measures, discussing the implications of partial knowledge about cyber-vulnerabilities on threat intelligence and decision-making in operational scenarios."}, "https://arxiv.org/abs/2303.03009": {"title": "Identification of Ex ante Returns Using Elicited Choice Probabilities: an Application to Preferences for Public-sector Jobs", "link": "https://arxiv.org/abs/2303.03009", "description": "arXiv:2303.03009v2 Announce Type: replace \nAbstract: Ex ante returns, the net value that agents perceive before they take an investment decision, are understood as the main drivers of individual decisions. Hence, their distribution in a population is an important tool for counterfactual analysis and policy evaluation. This paper studies the identification of the population distribution of ex ante returns using stated choice experiments, in the context of binary investment decisions. The environment is characterised by uncertainty about future outcomes, with some uncertainty being resolved over time. In this context, each individual holds a probability distribution over different levels of returns. The paper provides novel, nonparametric identification results for the population distribution of returns, accounting for uncertainty. It complements these with a nonparametric/semiparametric estimation methodology, which is new to the stated-preference literature. Finally, it uses these results to study the preference of high ability students in Cote d'Ivoire for public-sector jobs and how the competition for talent affects the expansion of the private sector."}, "https://arxiv.org/abs/2305.01201": {"title": "Estimating Input Coefficients for Regional Input-Output Tables Using Deep Learning with Mixup", "link": "https://arxiv.org/abs/2305.01201", "description": "arXiv:2305.01201v3 Announce Type: replace \nAbstract: An input-output table is an important data for analyzing the economic situation of a region. Generally, the input-output table for each region (regional input-output table) in Japan is not always publicly available, so it is necessary to estimate the table. In particular, various methods have been developed for estimating input coefficients, which are an important part of the input-output table. Currently, non-survey methods are often used to estimate input coefficients because they require less data and computation, but these methods have some problems, such as discarding information and requiring additional data for estimation.\n In this study, the input coefficients are estimated by approximating the generation process with an artificial neural network (ANN) to mitigate the problems of the non-survey methods and to estimate the input coefficients with higher precision. To avoid over-fitting due to the small data used, data augmentation, called mixup, is introduced to increase the data size by generating virtual regions through region composition and scaling.\n By comparing the estimates of the input coefficients with those of Japan as a whole, it is shown that the accuracy of the method of this research is higher and more stable than that of the conventional non-survey methods. In addition, the estimated input coefficients for the three cities in Japan are generally close to the published values for each city."}, "https://arxiv.org/abs/2310.02600": {"title": "Neural Bayes Estimators for Irregular Spatial Data using Graph Neural Networks", "link": "https://arxiv.org/abs/2310.02600", "description": "arXiv:2310.02600v2 Announce Type: replace \nAbstract: Neural Bayes estimators are neural networks that approximate Bayes estimators in a fast and likelihood-free manner. Although they are appealing to use with spatial models, where estimation is often a computational bottleneck, neural Bayes estimators in spatial applications have, to date, been restricted to data collected over a regular grid. These estimators are also currently dependent on a prescribed set of spatial locations, which means that the neural network needs to be re-trained for new data sets; this renders them impractical in many applications and impedes their widespread adoption. In this work, we employ graph neural networks to tackle the important problem of parameter point estimation from data collected over arbitrary spatial locations. In addition to extending neural Bayes estimation to irregular spatial data, our architecture leads to substantial computational benefits, since the estimator can be used with any configuration or number of locations and independent replicates, thus amortising the cost of training for a given spatial model. We also facilitate fast uncertainty quantification by training an accompanying neural Bayes estimator that approximates a set of marginal posterior quantiles. We illustrate our methodology on Gaussian and max-stable processes. Finally, we showcase our methodology on a data set of global sea-surface temperature, where we estimate the parameters of a Gaussian process model in 2161 spatial regions, each containing thousands of irregularly-spaced data points, in just a few minutes with a single graphics processing unit."}, "https://arxiv.org/abs/2312.09825": {"title": "Extreme value methods for estimating rare events in Utopia", "link": "https://arxiv.org/abs/2312.09825", "description": "arXiv:2312.09825v2 Announce Type: replace \nAbstract: To capture the extremal behaviour of complex environmental phenomena in practice, flexi\\-ble techniques for modelling tail behaviour are required. In this paper, we introduce a variety of such methods, which were used by the Lancopula Utopiversity team to tackle the EVA (2023) Conference Data Challenge. This data challenge was split into four challenges, labelled C1-C4. Challenges C1 and C2 comprise univariate problems, where the goal is to estimate extreme quantiles for a non-stationary time series exhibiting several complex features. For these, we propose a flexible modelling technique, based on generalised additive models, with diagnostics indicating generally good performance for the observed data. Challenges C3 and C4 concern multivariate problems where the focus is on estimating joint extremal probabilities. For challenge C3, we propose an extension of available models in the multivariate literature and use this framework to estimate extreme probabilities in the presence of non-stationary dependence. Finally, for challenge C4, which concerns a 50 dimensional random vector, we employ a clustering technique to achieve dimension reduction and use a conditional modelling approach to estimate extremal probabilities across independent groups of variables."}, "https://arxiv.org/abs/2312.12361": {"title": "Improved multifidelity Monte Carlo estimators based on normalizing flows and dimensionality reduction techniques", "link": "https://arxiv.org/abs/2312.12361", "description": "arXiv:2312.12361v2 Announce Type: replace \nAbstract: We study the problem of multifidelity uncertainty propagation for computationally expensive models. In particular, we consider the general setting where the high-fidelity and low-fidelity models have a dissimilar parameterization both in terms of number of random inputs and their probability distributions, which can be either known in closed form or provided through samples. We derive novel multifidelity Monte Carlo estimators which rely on a shared subspace between the high-fidelity and low-fidelity models where the parameters follow the same probability distribution, i.e., a standard Gaussian. We build the shared space employing normalizing flows to map different probability distributions into a common one, together with linear and nonlinear dimensionality reduction techniques, active subspaces and autoencoders, respectively, which capture the subspaces where the models vary the most. We then compose the existing low-fidelity model with these transformations and construct modified models with an increased correlation with the high-fidelity model, which therefore yield multifidelity Monte Carlo estimators with reduced variance. A series of numerical experiments illustrate the properties and advantages of our approaches."}, "https://arxiv.org/abs/2105.04981": {"title": "Quantifying patient and neighborhood risks for stillbirth and preterm birth in Philadelphia with a Bayesian spatial model", "link": "https://arxiv.org/abs/2105.04981", "description": "arXiv:2105.04981v5 Announce Type: replace-cross \nAbstract: Stillbirth and preterm birth are major public health challenges. Using a Bayesian spatial model, we quantified patient-specific and neighborhood risks of stillbirth and preterm birth in the city of Philadelphia. We linked birth data from electronic health records at Penn Medicine hospitals from 2010 to 2017 with census-tract-level data from the United States Census Bureau. We found that both patient-level characteristics (e.g. self-identified race/ethnicity) and neighborhood-level characteristics (e.g. violent crime) were significantly associated with patients' risk of stillbirth or preterm birth. Our neighborhood analysis found that higher-risk census tracts had 2.68 times the average risk of stillbirth and 2.01 times the average risk of preterm birth compared to lower-risk census tracts. Higher neighborhood rates of women in poverty or on public assistance were significantly associated with greater neighborhood risk for these outcomes, whereas higher neighborhood rates of college-educated women or women in the labor force were significantly associated with lower risk. Several of these neighborhood associations were missed by the patient-level analysis. These results suggest that neighborhood-level analyses of adverse pregnancy outcomes can reveal nuanced relationships and, thus, should be considered by epidemiologists. Our findings can potentially guide place-based public health interventions to reduce stillbirth and preterm birth rates."}, "https://arxiv.org/abs/2406.10308": {"title": "Quick and Simple Kernel Differential Equation Regression Estimators for Data with Sparse Design", "link": "https://arxiv.org/abs/2406.10308", "description": "arXiv:2406.10308v1 Announce Type: new \nAbstract: Local polynomial regression of order at least one often performs poorly in regions of sparse data. Local constant regression is exceptional in this regard, though it is the least accurate method in general, especially at the boundaries of the data. Incorporating information from differential equations which may approximately or exactly hold is one way of extending the sparse design capacity of local constant regression while reducing bias and variance. A nonparametric regression method that exploits first order differential equations is introduced in this paper and applied to noisy mouse tumour growth data. Asymptotic biases and variances of kernel estimators using Taylor polynomials with different degrees are discussed. Model comparison is performed for different estimators through simulation studies under various scenarios which simulate exponential-type growth."}, "https://arxiv.org/abs/2406.10360": {"title": "Causal inference for N-of-1 trials", "link": "https://arxiv.org/abs/2406.10360", "description": "arXiv:2406.10360v1 Announce Type: new \nAbstract: The aim of personalized medicine is to tailor treatment decisions to individuals' characteristics. N-of-1 trials are within-person crossover trials that hold the promise of targeting individual-specific effects. While the idea behind N-of-1 trials might seem simple, analyzing and interpreting N-of-1 trials is not straightforward. In particular, there exists confusion about the role of randomization in this design, the (implicit) target estimand, and the need for covariate adjustment. Here we ground N-of-1 trials in a formal causal inference framework and formalize intuitive claims from the N-of-1 trial literature. We focus on causal inference from a single N-of-1 trial and define a conditional average treatment effect (CATE) that represents a target in this setting, which we call the U-CATE. We discuss the assumptions sufficient for identification and estimation of the U-CATE under different causal models in which the treatment schedule is assigned at baseline. A simple mean difference is shown to be an unbiased, asymptotically normal estimator of the U-CATE in simple settings, such as when participants have stable conditions (e.g., chronic pain) and interventions have effects limited in time (no carryover). We also consider settings where carryover effects, trends over time, time-varying common causes of the outcome, and outcome-outcome effects are present. In these more complex settings, we show that a time-varying g-formula identifies the U-CATE under explicit assumptions. Finally, we analyze data from N-of-1 trials about acne symptoms. Using this example, we show how different assumptions about the underlying data generating process can lead to different analytical strategies in N-of-1 trials."}, "https://arxiv.org/abs/2406.10473": {"title": "Design-based variance estimation of the H\\'ajek effect estimator in stratified and clustered experiments", "link": "https://arxiv.org/abs/2406.10473", "description": "arXiv:2406.10473v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are used to evaluate treatment effects. When individuals are grouped together, clustered RCTs are conducted. Stratification is recommended to reduce imbalance of baseline covariates between treatment and control. In practice, this can lead to comparisons between clusters of very different sizes. As a result, direct adjustment estimators that average differences of means within the strata may be inconsistent. We study differences of inverse probability weighted means of a treatment and a control group -- H\\'ajek effect estimators -- under two common forms of stratification: small strata that increase in number; or larger strata with growing numbers of clusters in each. Under either scenario, mild conditions give consistency and asymptotic Normality. We propose a variance estimator applicable to designs with any number of strata and strata of any size. We describe a special use of the variance estimator that improves small sample performance of Wald-type confidence intervals. The H\\'ajek effect estimator lends itself to covariance adjustment, and our variance estimator remains applicable. Simulations and real-world applications in children's nutrition and education confirm favorable operating characteristics, demonstrating advantages of the H\\'ajek effect estimator beyond its simplicity and ease of use."}, "https://arxiv.org/abs/2406.10499": {"title": "Functional Clustering for Longitudinal Associations between County-Level Social Determinants of Health and Stroke Mortality in the US", "link": "https://arxiv.org/abs/2406.10499", "description": "arXiv:2406.10499v1 Announce Type: new \nAbstract: Understanding longitudinally changing associations between Social determinants of health (SDOH) and stroke mortality is crucial for timely stroke management. Previous studies have revealed a significant regional disparity in the SDOH -- stroke mortality associations. However, they do not develop data-driven methods based on these longitudinal associations for regional division in stroke control. To fill this gap, we propose a novel clustering method for SDOH -- stroke mortality associations in the US counties. To enhance interpretability and statistical efficiency of the clustering outcomes, we introduce a new class of smoothness-sparsity pursued penalties for simultaneous clustering and variable selection in the longitudinal associations. As a result, we can identify important SDOH that contribute to longitudinal changes in the stroke mortality, facilitating clustering of US counties into several regions based on how these SDOH relate to stroke mortality. The effectiveness of our proposed method is demonstrated through extensive numerical studies. By applying our method to a county-level SDOH and stroke mortality longitudinal data, we identify 18 important SDOH for stroke mortality and divide the US counties into two clusters based on these selected SDOH. Our findings unveil complex regional heterogeneity in the longitudinal associations between SDOH and stroke mortality, providing valuable insights in region-specific SDOH adjustments for mitigating stroke mortality."}, "https://arxiv.org/abs/2406.10554": {"title": "Causal Inference with Outcomes Truncated by Death and Missing Not at Random", "link": "https://arxiv.org/abs/2406.10554", "description": "arXiv:2406.10554v1 Announce Type: new \nAbstract: In clinical trials, principal stratification analysis is commonly employed to address the issue of truncation by death, where a subject dies before the outcome can be measured. However, in practice, many survivor outcomes may remain uncollected or be missing not at random, posing a challenge to standard principal stratification analyses. In this paper, we explore the identification, estimation, and bounds of the average treatment effect within a subpopulation of individuals who would potentially survive under both treatment and control conditions. We show that the causal parameter of interest can be identified by introducing a proxy variable that affects the outcome only through the principal strata, while requiring that the treatment variable does not directly affect the missingness mechanism. Subsequently, we propose an approach for estimating causal parameters and derive nonparametric bounds in cases where identification assumptions are violated. We illustrate the performance of the proposed method through simulation studies and a real dataset obtained from a Human Immunodeficiency Virus (HIV) study."}, "https://arxiv.org/abs/2406.10612": {"title": "Producing treatment hierarchies in network meta-analysis using probabilistic models and treatment-choice criteria", "link": "https://arxiv.org/abs/2406.10612", "description": "arXiv:2406.10612v1 Announce Type: new \nAbstract: A key output of network meta-analysis (NMA) is the relative ranking of the treatments; nevertheless, it has attracted a lot of criticism. This is mainly due to the fact that ranking is an influential output and prone to over-interpretations even when relative effects imply small differences between treatments. To date, common ranking methods rely on metrics that lack a straightforward interpretation, while it is still unclear how to measure their uncertainty. We introduce a novel framework for estimating treatment hierarchies in NMA. At first, we formulate a mathematical expression that defines a treatment choice criterion (TCC) based on clinically important values. This TCC is applied to the study treatment effects to generate paired data indicating treatment preferences or ties. Then, we synthesize the paired data across studies using an extension of the so-called \"Bradley-Terry\" model. We assign to each treatment a latent variable interpreted as the treatment \"ability\" and we estimate the ability parameters within a regression model. Higher ability estimates correspond to higher positions in the final ranking. We further extend our model to adjust for covariates that may affect treatment selection. We illustrate the proposed approach and compare it with alternatives in two datasets: a network comparing 18 antidepressants for major depression and a network comparing 6 antihypertensives for the incidence of diabetes. Our approach provides a robust and interpretable treatment hierarchy which accounts for clinically important values and is presented alongside with uncertainty measures. Overall, the proposed framework offers a novel approach for ranking in NMA based on concrete criteria and preserves from over-interpretation of unimportant differences between treatments."}, "https://arxiv.org/abs/2406.10733": {"title": "A Laplace transform-based test for the equality of positive semidefinite matrix distributions", "link": "https://arxiv.org/abs/2406.10733", "description": "arXiv:2406.10733v1 Announce Type: new \nAbstract: In this paper, we present a novel test for determining equality in distribution of matrix distributions. Our approach is based on the integral squared difference of the empirical Laplace transforms with respect to the noncentral Wishart measure. We conduct an extensive power study to assess the performance of the test and determine the optimal choice of parameters. Furthermore, we demonstrate the applicability of the test on financial and non-life insurance data, illustrating its effectiveness in practical scenarios."}, "https://arxiv.org/abs/2406.10792": {"title": "Data-Adaptive Identification of Subpopulations Vulnerable to Chemical Exposures using Stochastic Interventions", "link": "https://arxiv.org/abs/2406.10792", "description": "arXiv:2406.10792v1 Announce Type: new \nAbstract: In environmental epidemiology, identifying subpopulations vulnerable to chemical exposures and those who may benefit differently from exposure-reducing policies is essential. For instance, sex-specific vulnerabilities, age, and pregnancy are critical factors for policymakers when setting regulatory guidelines. However, current semi-parametric methods for heterogeneous treatment effects are often limited to binary exposures and function as black boxes, lacking clear, interpretable rules for subpopulation-specific policy interventions. This study introduces a novel method using cross-validated targeted minimum loss-based estimation (TMLE) paired with a data-adaptive target parameter strategy to identify subpopulations with the most significant differential impact from simulated policy interventions that reduce exposure. Our approach is assumption-lean, allowing for the integration of machine learning while still yielding valid confidence intervals. We demonstrate the robustness of our methodology through simulations and application to NHANES data. Our analysis of NHANES data for persistent organic pollutants on leukocyte telomere length (LTL) identified age as the maximum effect modifier. Specifically, we found that exposure to 3,3',4,4',5-pentachlorobiphenyl (pcnb) consistently had a differential impact on LTL, with a one standard deviation reduction in exposure leading to a more pronounced increase in LTL among younger populations compared to older ones. We offer our method as an open-source software package, \\texttt{EffectXshift}, enabling researchers to investigate the effect modification of continuous exposures. The \\texttt{EffectXshift} package provides clear and interpretable results, informing targeted public health interventions and policy decisions."}, "https://arxiv.org/abs/2406.10837": {"title": "EM Estimation of Conditional Matrix Variate $t$ Distributions", "link": "https://arxiv.org/abs/2406.10837", "description": "arXiv:2406.10837v1 Announce Type: new \nAbstract: Conditional matrix variate student $t$ distribution was introduced by Battulga (2024a). In this paper, we propose a new version of the conditional matrix variate student $t$ distribution. The paper provides EM algorithms, which estimate parameters of the conditional matrix variate student $t$ distributions, including general cases and special cases with Minnesota prior."}, "https://arxiv.org/abs/2406.10962": {"title": "SynthTree: Co-supervised Local Model Synthesis for Explainable Prediction", "link": "https://arxiv.org/abs/2406.10962", "description": "arXiv:2406.10962v1 Announce Type: new \nAbstract: Explainable machine learning (XML) has emerged as a major challenge in artificial intelligence (AI). Although black-box models such as Deep Neural Networks and Gradient Boosting often exhibit exceptional predictive accuracy, their lack of interpretability is a notable drawback, particularly in domains requiring transparency and trust. This paper tackles this core AI problem by proposing a novel method to enhance explainability with minimal accuracy loss, using a Mixture of Linear Models (MLM) estimated under the co-supervision of black-box models. We have developed novel methods for estimating MLM by leveraging AI techniques. Specifically, we explore two approaches for partitioning the input space: agglomerative clustering and decision trees. The agglomerative clustering approach provides greater flexibility in model construction, while the decision tree approach further enhances explainability, yielding a decision tree model with linear or logistic regression models at its leaf nodes. Comparative analyses with widely-used and state-of-the-art predictive models demonstrate the effectiveness of our proposed methods. Experimental results show that statistical models can significantly enhance the explainability of AI, thereby broadening their potential for real-world applications. Our findings highlight the critical role that statistical methodologies can play in advancing explainable AI."}, "https://arxiv.org/abs/2406.11184": {"title": "HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators", "link": "https://arxiv.org/abs/2406.11184", "description": "arXiv:2406.11184v1 Announce Type: new \nAbstract: Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method's superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank."}, "https://arxiv.org/abs/2406.11216": {"title": "Bayesian Hierarchical Modelling of Noisy Gamma Processes: Model Formulation, Identifiability, Model Fitting, and Extensions to Unit-to-Unit Variability", "link": "https://arxiv.org/abs/2406.11216", "description": "arXiv:2406.11216v1 Announce Type: new \nAbstract: The gamma process is a natural model for monotonic degradation processes. In practice, it is desirable to extend the single gamma process to incorporate measurement error and to construct models for the degradation of several nominally identical units. In this paper, we show how these extensions are easily facilitated through the Bayesian hierarchical modelling framework. Following the precepts of the Bayesian statistical workflow, we show the principled construction of a noisy gamma process model. We also reparameterise the gamma process to simplify the specification of priors and make it obvious how the single gamma process model can be extended to include unit-to-unit variability or covariates. We first fit the noisy gamma process model to a single simulated degradation trace. In doing so, we find an identifiability problem between the volatility of the gamma process and the measurement error when there are only a few noisy degradation observations. However, this lack of identifiability can be resolved by including extra information in the analysis through a stronger prior or extra data that informs one of the non-identifiable parameters, or by borrowing information from multiple units. We then explore extensions of the model to account for unit-to-unit variability and demonstrate them using a crack-propagation data set with added measurement error. Lastly, we perform model selection in a fully Bayesian framework by using cross-validation to approximate the expected log probability density of new observation. We also show how failure time distributions with uncertainty intervals can be calculated for new units or units that are currently under test but are yet to fail."}, "https://arxiv.org/abs/2406.11306": {"title": "Bayesian Variable Selection via Hierarchical Gaussian Process Model in Computer Experiments", "link": "https://arxiv.org/abs/2406.11306", "description": "arXiv:2406.11306v1 Announce Type: new \nAbstract: Identifying the active factors that have significant impacts on the output of the complex system is an important but challenging variable selection problem in computer experiments. In this paper, a Bayesian hierarchical Gaussian process model is developed and some latent indicator variables are embedded into this setting for the sake of labelling the important variables. The parameter estimation and variable selection can be processed simultaneously in a full Bayesian framework through an efficient Markov Chain Monte Carlo (MCMC) method -- Metropolis-within-Gibbs sampler. The much better performances of the proposed method compared with the related competitors are evaluated by the analysis of simulated examples and a practical application."}, "https://arxiv.org/abs/2406.11399": {"title": "Spillover Detection for Donor Selection in Synthetic Control Models", "link": "https://arxiv.org/abs/2406.11399", "description": "arXiv:2406.11399v1 Announce Type: new \nAbstract: Synthetic control (SC) models are widely used to estimate causal effects in settings with observational time-series data. To identify the causal effect on a target unit, SC requires the existence of correlated units that are not impacted by the intervention. Given one of these potential donor units, how can we decide whether it is in fact a valid donor - that is, one not subject to spillover effects from the intervention? Such a decision typically requires appealing to strong a priori domain knowledge specifying the units, which becomes infeasible in situations with large pools of potential donors. In this paper, we introduce a practical, theoretically-grounded donor selection procedure, aiming to weaken this domain knowledge requirement. Our main result is a Theorem that yields the assumptions required to identify donor values at post-intervention time points using only pre-intervention data. We show how this Theorem - and the assumptions underpinning it - can be turned into a practical method for detecting potential spillover effects and excluding invalid donors when constructing SCs. Importantly, we employ sensitivity analysis to formally bound the bias in our SC causal estimate in situations where an excluded donor was indeed valid, or where a selected donor was invalid. Using ideas from the proximal causal inference and instrumental variables literature, we show that the excluded donors can nevertheless be leveraged to further debias causal effect estimates. Finally, we illustrate our donor selection procedure on both simulated and real-world datasets."}, "https://arxiv.org/abs/2406.11467": {"title": "Resilience of international oil trade networks under extreme event shock-recovery simulations", "link": "https://arxiv.org/abs/2406.11467", "description": "arXiv:2406.11467v1 Announce Type: new \nAbstract: With the frequent occurrence of black swan events, global energy security situation has become increasingly complex and severe. Assessing the resilience of the international oil trade network (iOTN) is crucial for evaluating its ability to withstand extreme shocks and recover thereafter, ensuring energy security. We overcomes the limitations of discrete historical data by developing a simulation model for extreme event shock-recovery in the iOTNs. We introduce network efficiency indicator to measure oil resource allocation efficiency and evaluate network performance. Then, construct a resilience index to explore the resilience of the iOTNs from dimensions of resistance and recoverability. Our findings indicate that extreme events can lead to sharp declines in performance of the iOTNs, especially when economies with significant trading positions and relations suffer shocks. The upward trend in recoverability and resilience reflects the self-organizing nature of the iOTNs, demonstrating its capacity for optimizing its own structure and functionality. Unlike traditional energy security research based solely on discrete historical data or resistance indicators, our model evaluates resilience from multiple dimensions, offering insights for global energy governance systems while providing diverse perspectives for various economies to mitigate risks and uphold energy security."}, "https://arxiv.org/abs/2406.11573": {"title": "Bayesian Outcome Weighted Learning", "link": "https://arxiv.org/abs/2406.11573", "description": "arXiv:2406.11573v1 Announce Type: new \nAbstract: One of the primary goals of statistical precision medicine is to learn optimal individualized treatment rules (ITRs). The classification-based, or machine learning-based, approach to estimating optimal ITRs was first introduced in outcome-weighted learning (OWL). OWL recasts the optimal ITR learning problem into a weighted classification problem, which can be solved using machine learning methods, e.g., support vector machines. In this paper, we introduce a Bayesian formulation of OWL. Starting from the OWL objective function, we generate a pseudo-likelihood which can be expressed as a scale mixture of normal distributions. A Gibbs sampling algorithm is developed to sample the posterior distribution of the parameters. In addition to providing a strategy for learning an optimal ITR, Bayesian OWL provides a natural, probabilistic approach to estimate uncertainty in ITR treatment recommendations themselves. We demonstrate the performance of our method through several simulation studies."}, "https://arxiv.org/abs/2406.11584": {"title": "The analysis of paired comparison data in the presence of cyclicality and intransitivity", "link": "https://arxiv.org/abs/2406.11584", "description": "arXiv:2406.11584v1 Announce Type: new \nAbstract: A principled approach to cyclicality and intransitivity in cardinal paired comparison data is developed within the framework of graphical linear models. Fundamental to our developments is a detailed understanding and study of the parameter space which accommodates cyclicality and intransitivity. In particular, the relationships between the reduced, completely transitive model, the full, not necessarily transitive model, and all manner of intermediate models are explored for both complete and incomplete paired comparison graphs. It is shown that identifying cyclicality and intransitivity reduces to a model selection problem and a new method for model selection employing geometrical insights, unique to the problem at hand, is proposed. The large sample properties of the estimators as well as guarantees on the selected model are provided. It is thus shown that in large samples all cyclicalities and intransitivities can be identified. The method is exemplified using simulations and the analysis of an illustrative example."}, "https://arxiv.org/abs/2406.11585": {"title": "Bayesian regression discontinuity design with unknown cutoff", "link": "https://arxiv.org/abs/2406.11585", "description": "arXiv:2406.11585v1 Announce Type: new \nAbstract: Regression discontinuity design (RDD) is a quasi-experimental approach used to estimate the causal effects of an intervention assigned based on a cutoff criterion. RDD exploits the idea that close to the cutoff units below and above are similar; hence, they can be meaningfully compared. Consequently, the causal effect can be estimated only locally at the cutoff point. This makes the cutoff point an essential element of RDD. However, especially in medical applications, the exact cutoff location may not always be disclosed to the researcher, and even when it is, the actual location may deviate from the official one. As we illustrate on the application of RDD to the HIV treatment eligibility data, estimating the causal effect at an incorrect cutoff point leads to meaningless results. Moreover, since the cutoff criterion often acts as a guideline rather than as a strict rule, the location of the cutoff may be unclear from the data. The method we present can be applied both as an estimation and validation tool in RDD. We use a Bayesian approach to incorporate prior knowledge and uncertainty about the cutoff location in the causal effect estimation. At the same time, our Bayesian model LoTTA is fitted globally to the whole data, whereas RDD is a local, boundary point estimation problem. In this work we address a natural question that arises: how to make Bayesian inference more local to render a meaningful and powerful estimate of the treatment effect?"}, "https://arxiv.org/abs/2406.11806": {"title": "A conservation law for posterior predictive variance", "link": "https://arxiv.org/abs/2406.11806", "description": "arXiv:2406.11806v1 Announce Type: new \nAbstract: We use the law of total variance to generate multiple expressions for the posterior predictive variance in Bayesian hierarchical models. These expressions are sums of terms involving conditional expectations and conditional variances. Since the posterior predictive variance is fixed given the hierarchical model, it represents a constant quantity that is conserved over the various expressions for it. The terms in the expressions can be assessed in absolute or relative terms to understand the main contributors to the length of prediction intervals. Also, sometimes these terms can be intepreted in the context of the hierarchical model. We show several examples, closed form and computational, to illustrate the uses of this approach in model assessment."}, "https://arxiv.org/abs/2406.10366": {"title": "Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework", "link": "https://arxiv.org/abs/2406.10366", "description": "arXiv:2406.10366v1 Announce Type: cross \nAbstract: Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how the estimands framework can help uncover underlying issues, their causes, and potential solutions. Ultimately, we believe this framework can improve the validity of evaluations through better-aligned inference, and help decision-makers and model users interpret reported results more effectively."}, "https://arxiv.org/abs/2406.10464": {"title": "The data augmentation algorithm", "link": "https://arxiv.org/abs/2406.10464", "description": "arXiv:2406.10464v1 Announce Type: cross \nAbstract: The data augmentation (DA) algorithms are popular Markov chain Monte Carlo (MCMC) algorithms often used for sampling from intractable probability distributions. This review article comprehensively surveys DA MCMC algorithms, highlighting their theoretical foundations, methodological implementations, and diverse applications in frequentist and Bayesian statistics. The article discusses tools for studying the convergence properties of DA algorithms. Furthermore, it contains various strategies for accelerating the speed of convergence of the DA algorithms, different extensions of DA algorithms and outlines promising directions for future research. This paper aims to serve as a resource for researchers and practitioners seeking to leverage data augmentation techniques in MCMC algorithms by providing key insights and synthesizing recent developments."}, "https://arxiv.org/abs/2406.10481": {"title": "DCDILP: a distributed learning method for large-scale causal structure learning", "link": "https://arxiv.org/abs/2406.10481", "description": "arXiv:2406.10481v1 Announce Type: cross \nAbstract: This paper presents a novel approach to causal discovery through a divide-and-conquer framework. By decomposing the problem into smaller subproblems defined on Markov blankets, the proposed DCDILP method first explores in parallel the local causal graphs of these subproblems. However, this local discovery phase encounters systematic challenges due to the presence of hidden confounders (variables within each Markov blanket may be influenced by external variables). Moreover, aggregating these local causal graphs in a consistent global graph defines a large size combinatorial optimization problem. DCDILP addresses these challenges by: i) restricting the local subgraphs to causal links only related with the central variable of the Markov blanket; ii) formulating the reconciliation of local causal graphs as an integer linear programming method. The merits of the approach, in both terms of causal discovery accuracy and scalability in the size of the problem, are showcased by experiments and comparisons with the state of the art."}, "https://arxiv.org/abs/2406.10738": {"title": "Adaptive Experimentation When You Can't Experiment", "link": "https://arxiv.org/abs/2406.10738", "description": "arXiv:2406.10738v1 Announce Type: cross \nAbstract: This paper introduces the \\emph{confounded pure exploration transductive linear bandit} (\\texttt{CPET-LB}) problem. As a motivating example, often online services cannot directly assign users to specific control or treatment experiences either for business or practical reasons. In these settings, naively comparing treatment and control groups that may result from self-selection can lead to biased estimates of underlying treatment effects. Instead, online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment. Our methodology provides online services with an adaptive experimental design approach for learning the best-performing treatment for such \\textit{encouragement designs}. We consider a more general underlying model captured by a linear structural equation and formulate pure exploration linear bandits in this setting. Though pure exploration has been extensively studied in standard adaptive experimental design settings, we believe this is the first work considering a setting where noise is confounded. Elimination-style algorithms using experimental design methods in combination with a novel finite-time confidence interval on an instrumental variable style estimator are presented with sample complexity upper bounds nearly matching a minimax lower bound. Finally, experiments are conducted that demonstrate the efficacy of our approach."}, "https://arxiv.org/abs/2406.11043": {"title": "Statistical Considerations for Evaluating Treatment Effect under Various Non-proportional Hazard Scenarios", "link": "https://arxiv.org/abs/2406.11043", "description": "arXiv:2406.11043v1 Announce Type: cross \nAbstract: We conducted a systematic comparison of statistical methods used for the analysis of time-to-event outcomes under various proportional and nonproportional hazard (NPH) scenarios. Our study used data from recently published oncology trials to compare the Log-rank test, still by far the most widely used option, against some available alternatives, including the MaxCombo test, the Restricted Mean Survival Time Difference (dRMST) test, the Generalized Gamma Model (GGM) and the Generalized F Model (GFM). Power, type I error rate, and time-dependent bias with respect to the RMST difference, survival probability difference, and median survival time were used to evaluate and compare the performance of these methods. In addition to the real data, we simulated three hypothetical scenarios with crossing hazards chosen so that the early and late effects 'cancel out' and used them to evaluate the ability of the aforementioned methods to detect time-specific and overall treatment effects. We implemented novel metrics for assessing the time-dependent bias in treatment effect estimates to provide a more comprehensive evaluation in NPH scenarios. Recommendations under each NPH scenario are provided by examining the type I error rate, power, and time-dependent bias associated with each statistical approach."}, "https://arxiv.org/abs/2406.11046": {"title": "Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data", "link": "https://arxiv.org/abs/2406.11046", "description": "arXiv:2406.11046v1 Announce Type: cross \nAbstract: Advancements in Artificial Intelligence, particularly with ChatGPT, have significantly impacted software development. Utilizing novel data from GitHub Innovation Graph, we hypothesize that ChatGPT enhances software production efficiency. Utilizing natural experiments where some governments banned ChatGPT, we employ Difference-in-Differences (DID), Synthetic Control (SC), and Synthetic Difference-in-Differences (SDID) methods to estimate its effects. Our findings indicate a significant positive impact on the number of git pushes, repositories, and unique developers per 100,000 people, particularly for high-level, general purpose, and shell scripting languages. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns."}, "https://arxiv.org/abs/2406.11490": {"title": "Interventional Imbalanced Multi-Modal Representation Learning via $\\beta$-Generalization Front-Door Criterion", "link": "https://arxiv.org/abs/2406.11490", "description": "arXiv:2406.11490v1 Announce Type: cross \nAbstract: Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method."}, "https://arxiv.org/abs/2406.11501": {"title": "Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality", "link": "https://arxiv.org/abs/2406.11501", "description": "arXiv:2406.11501v1 Announce Type: cross \nAbstract: Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks."}, "https://arxiv.org/abs/2406.11761": {"title": "Joint Linked Component Analysis for Multiview Data", "link": "https://arxiv.org/abs/2406.11761", "description": "arXiv:2406.11761v1 Announce Type: cross \nAbstract: In this work, we propose the joint linked component analysis (joint\\_LCA) for multiview data. Unlike classic methods which extract the shared components in a sequential manner, the objective of joint\\_LCA is to identify the view-specific loading matrices and the rank of the common latent subspace simultaneously. We formulate a matrix decomposition model where a joint structure and an individual structure are present in each data view, which enables us to arrive at a clean svd representation for the cross covariance between any pair of data views. An objective function with a novel penalty term is then proposed to achieve simultaneous estimation and rank selection. In addition, a refitting procedure is employed as a remedy to reduce the shrinkage bias caused by the penalization."}, "https://arxiv.org/abs/1908.04822": {"title": "R-miss-tastic: a unified platform for missing values methods and workflows", "link": "https://arxiv.org/abs/1908.04822", "description": "arXiv:1908.04822v4 Announce Type: replace \nAbstract: Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula.\n To help address this challenge, we have launched the \"R-miss-tastic\" platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), \"R-miss-tastic\" covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides)."}, "https://arxiv.org/abs/2109.11647": {"title": "Treatment Effects in Market Equilibrium", "link": "https://arxiv.org/abs/2109.11647", "description": "arXiv:2109.11647v3 Announce Type: replace \nAbstract: Policy-relevant treatment effect estimation in a marketplace setting requires taking into account both the direct benefit of the treatment and any spillovers induced by changes to the market equilibrium. The standard way to address these challenges is to evaluate interventions via cluster-randomized experiments, where each cluster corresponds to an isolated market. This approach, however, cannot be used when we only have access to a single market (or a small number of markets). Here, we show how to identify and estimate policy-relevant treatment effects using a unit-level randomized trial run within a single large market. A standard Bernoulli-randomized trial allows consistent estimation of direct effects, and of treatment heterogeneity measures that can be used for welfare-improving targeting. Estimating spillovers - as well as providing confidence intervals for the direct effect - requires estimates of price elasticities, which we provide using an augmented experimental design. Our results rely on all spillovers being mediated via the (observed) prices of a finite number of traded goods, and the market power of any single unit decaying as the market gets large. We illustrate our results using a simulation calibrated to a conditional cash transfer experiment in the Philippines."}, "https://arxiv.org/abs/2110.11771": {"title": "Additive Density-on-Scalar Regression in Bayes Hilbert Spaces with an Application to Gender Economics", "link": "https://arxiv.org/abs/2110.11771", "description": "arXiv:2110.11771v3 Announce Type: replace \nAbstract: Motivated by research on gender identity norms and the distribution of the woman's share in a couple's total labor income, we consider functional additive regression models for probability density functions as responses with scalar covariates. To preserve nonnegativity and integration to one under vector space operations, we formulate the model for densities in a Bayes Hilbert space, which allows to not only consider continuous densities, but also, e.g., discrete or mixed densities. Mixed ones occur in our application, as the woman's income share is a continuous variable having discrete point masses at zero and one for single-earner couples. Estimation is based on a gradient boosting algorithm, allowing for potentially numerous flexible covariate effects and model selection. We develop properties of Bayes Hilbert spaces related to subcompositional coherence, yielding (odds-ratio) interpretation of effect functions and simplified estimation for mixed densities via an orthogonal decomposition. Applying our approach to data from the German Socio-Economic Panel Study (SOEP) shows a more symmetric distribution in East German than in West German couples after reunification and a smaller child penalty comparing couples with and without minor children. These West-East differences become smaller, but are persistent over time."}, "https://arxiv.org/abs/2201.01357": {"title": "Estimating Heterogeneous Causal Effects of High-Dimensional Treatments: Application to Conjoint Analysis", "link": "https://arxiv.org/abs/2201.01357", "description": "arXiv:2201.01357v4 Announce Type: replace \nAbstract: Estimation of heterogeneous treatment effects is an active area of research. Most of the existing methods, however, focus on estimating the conditional average treatment effects of a single, binary treatment given a set of pre-treatment covariates. In this paper, we propose a method to estimate the heterogeneous causal effects of high-dimensional treatments, which poses unique challenges in terms of estimation and interpretation. The proposed approach finds maximally heterogeneous groups and uses a Bayesian mixture of regularized logistic regressions to identify groups of units who exhibit similar patterns of treatment effects. By directly modeling group membership with covariates, the proposed methodology allows one to explore the unit characteristics that are associated with different patterns of treatment effects. Our motivating application is conjoint analysis, which is a popular type of survey experiment in social science and marketing research and is based on a high-dimensional factorial design. We apply the proposed methodology to the conjoint data, where survey respondents are asked to select one of two immigrant profiles with randomly selected attributes. We find that a group of respondents with a relatively high degree of prejudice appears to discriminate against immigrants from non-European countries like Iraq. An open-source software package is available for implementing the proposed methodology."}, "https://arxiv.org/abs/2210.15253": {"title": "An extended generalized Pareto regression model for count data", "link": "https://arxiv.org/abs/2210.15253", "description": "arXiv:2210.15253v2 Announce Type: replace \nAbstract: The statistical modeling of discrete extremes has received less attention than their continuous counterparts in the Extreme Value Theory (EVT) literature. One approach to the transition from continuous to discrete extremes is the modeling of threshold exceedances of integer random variables by the discrete version of the generalized Pareto distribution. However, the optimal choice of thresholds defining exceedances remains a problematic issue. Moreover, in a regression framework, the treatment of the majority of non-extreme data below the selected threshold is either ignored or separated from the extremes. To tackle these issues, we expand on the concept of employing a smooth transition between the bulk and the upper tail of the distribution. In the case of zero inflation, we also develop models with an additional parameter. To incorporate possible predictors, we relate the parameters to additive smoothed predictors via an appropriate link, as in the generalized additive model (GAM) framework. A penalized maximum likelihood estimation procedure is implemented. We illustrate our modeling proposal with a real dataset of avalanche activity in the French Alps. With the advantage of bypassing the threshold selection step, our results indicate that the proposed models are more flexible and robust than competing models, such as the negative binomial distribution"}, "https://arxiv.org/abs/2211.16552": {"title": "Bayesian inference for aggregated Hawkes processes", "link": "https://arxiv.org/abs/2211.16552", "description": "arXiv:2211.16552v3 Announce Type: replace \nAbstract: The Hawkes process, a self-exciting point process, has a wide range of applications in modeling earthquakes, social networks and stock markets. The established estimation process requires that researchers have access to the exact time stamps and spatial information. However, available data are often rounded or aggregated. We develop a Bayesian estimation procedure for the parameters of a Hawkes process based on aggregated data. Our approach is developed for temporal, spatio-temporal, and mutually exciting Hawkes processes where data are available over discrete time periods and regions. We show theoretically that the parameters of the Hawkes process are identifiable from aggregated data under general specifications. We demonstrate the method on simulated temporal and spatio-temporal data with various model specifications in the presence of one or more interacting processes, and under varying coarseness of data aggregation. Finally, we examine the internal and cross-excitation effects of airstrikes and insurgent violence events from February 2007 to June 2008, with some data aggregated by day."}, "https://arxiv.org/abs/2301.06718": {"title": "Sparse and Integrative Principal Component Analysis for Multiview Data", "link": "https://arxiv.org/abs/2301.06718", "description": "arXiv:2301.06718v2 Announce Type: replace \nAbstract: We consider dimension reduction of multiview data, which are emerging in scientific studies. Formulating multiview data as multi-variate data with block structures corresponding to the different views, or views of data, we estimate top eigenvectors from multiview data that have two-fold sparsity, elementwise sparsity and blockwise sparsity. We propose a Fantope-based optimization criterion with multiple penalties to enforce the desired sparsity patterns and a denoising step is employed to handle potential presence of heteroskedastic noise across different data views. An alternating direction method of multipliers (ADMM) algorithm is used for optimization. We derive the l2 convergence of the estimated top eigenvectors and establish their sparsity and support recovery properties. Numerical studies are used to illustrate the proposed method."}, "https://arxiv.org/abs/2302.04354": {"title": "Consider or Choose? The Role and Power of Consideration Sets", "link": "https://arxiv.org/abs/2302.04354", "description": "arXiv:2302.04354v3 Announce Type: replace \nAbstract: Consideration sets play a crucial role in discrete choice modeling, where customers are commonly assumed to go through a two-stage decision making process. Specifically, customers are assumed to form consideration sets in the first stage and then use a second-stage choice mechanism to pick the product with the highest utility from the consideration sets. Recent studies mostly aim to propose more powerful choice mechanisms based on advanced non-parametric models to improve prediction accuracy. In contrast, this paper takes a step back from exploring more complex second-stage choice mechanisms and instead focus on how effectively we can model customer choice relying only on the first-stage consideration set formation. To this end, we study a class of nonparametric choice models that is only specified by a distribution over consideration sets and has a bounded rationality interpretation. We denote it as the consideration set model. Intriguingly, we show that this class of choice models can be characterized by the axiom of symmetric demand cannibalization, which enables complete statistical identification. We further consider the model's downstream assortment planning as an application. We first present an exact description of the optimal assortment, proving that it is revenue-ordered based on the blocks defined by the consideration sets. Despite this compelling structure, we establish that the assortment optimization problem under this model is NP-hard even to approximate. This result shows that accounting for consideration sets in the model inevitably results in inapproximability in assortment planning, even though the consideration set model uses the simplest possible uniform second-stage choice mechanism. Finally, using a real-world dataset, we show the tremendous power of the first-stage consideration sets when modeling customers' decision-making processes."}, "https://arxiv.org/abs/2303.08568": {"title": "Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models", "link": "https://arxiv.org/abs/2303.08568", "description": "arXiv:2303.08568v2 Announce Type: replace \nAbstract: We present a method to generate contingency tables that follow loglinear models with prescribed marginal probabilities and dependence structures. We make use of (loglinear) Poisson regression, where the dependence structures, described using odds ratios, are implemented using an offset term. We apply this methodology to carry out simulation studies in the context of population size estimation using dual system and triple system estimators, popular in official statistics. These estimators use contingency tables that summarise the counts of elements enumerated or captured within lists that are linked. The simulation is used to investigate these estimators in the situation that the model assumptions are fulfilled, and the situation that the model assumptions are violated."}, "https://arxiv.org/abs/2309.07893": {"title": "Choosing a Proxy Metric from Past Experiments", "link": "https://arxiv.org/abs/2309.07893", "description": "arXiv:2309.07893v2 Announce Type: replace \nAbstract: In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines."}, "https://arxiv.org/abs/2309.10083": {"title": "Invariant Probabilistic Prediction", "link": "https://arxiv.org/abs/2309.10083", "description": "arXiv:2309.10083v2 Announce Type: replace \nAbstract: In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method to yield invariant probabilistic predictions, called IPP, and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure on simulated as well as on single-cell data."}, "https://arxiv.org/abs/2401.00245": {"title": "Alternative Approaches for Estimating Highest-Density Regions", "link": "https://arxiv.org/abs/2401.00245", "description": "arXiv:2401.00245v2 Announce Type: replace \nAbstract: Among the variety of statistical intervals, highest-density regions (HDRs) stand out for their ability to effectively summarize a distribution or sample, unveiling its distinctive and salient features. An HDR represents the minimum size set that satisfies a certain probability coverage, and current methods for their computation require knowledge or estimation of the underlying probability distribution or density $f$. In this work, we illustrate a broader framework for computing HDRs, which generalizes the classical density quantile method introduced in the seminal paper of Hyndman (1996). The framework is based on neighbourhood measures, i.e., measures that preserve the order induced in the sample by $f$, and include the density $f$ as a special case. We explore a number of suitable distance-based measures, such as the $k$-nearest neighborhood distance, and some probabilistic variants based on copula models. An extensive comparison is provided, showing the advantages of the copula-based strategy, especially in those scenarios that exhibit complex structures (e.g., multimodalities or particular dependencies). Finally, we discuss the practical implications of our findings for estimating HDRs in real-world applications."}, "https://arxiv.org/abs/2401.16286": {"title": "Robust Functional Data Analysis for Stochastic Evolution Equations in Infinite Dimensions", "link": "https://arxiv.org/abs/2401.16286", "description": "arXiv:2401.16286v2 Announce Type: replace \nAbstract: We develop an asymptotic theory for the jump robust measurement of covariations in the context of stochastic evolution equation in infinite dimensions. Namely, we identify scaling limits for realized covariations of solution processes with the quadratic covariation of the latent random process that drives the evolution equation which is assumed to be a Hilbert space-valued semimartingale. We discuss applications to dynamically consistent and outlier-robust dimension reduction in the spirit of functional principal components and the estimation of infinite-dimensional stochastic volatility models."}, "https://arxiv.org/abs/2208.02807": {"title": "Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport", "link": "https://arxiv.org/abs/2208.02807", "description": "arXiv:2208.02807v3 Announce Type: replace-cross \nAbstract: We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem is therefore a mixture of unlabeled background and signal events, and the primary aim of the analysis is to determine whether the proportion of unlabeled signal events is nonzero. A challenging but necessary first step is to estimate the distribution of background events. Past work in this area has determined regions of the space of collider events where signal is unlikely to appear, and where the background distribution is therefore identifiable. The background distribution can be estimated in these regions, and extrapolated into the region of primary interest using transfer learning with a multivariate classifier. We build upon this existing approach in two ways. First, we revisit this method by developing a customized residual neural network which is tailored to the structure and symmetries of collider data. Second, we develop a new method for background estimation, based on the optimal transport problem, which relies on modeling assumptions distinct from earlier work. These two methods can serve as cross-checks for each other in particle physics analyses, due to the complementarity of their underlying assumptions. We compare their performance on simulated double Higgs boson data."}, "https://arxiv.org/abs/2312.12477": {"title": "When Graph Neural Network Meets Causality: Opportunities, Methodologies and An Outlook", "link": "https://arxiv.org/abs/2312.12477", "description": "arXiv:2312.12477v2 Announce Type: replace-cross \nAbstract: Graph Neural Networks (GNNs) have emerged as powerful representation learning tools for capturing complex dependencies within diverse graph-structured data. Despite their success in a wide range of graph mining tasks, GNNs have raised serious concerns regarding their trustworthiness, including susceptibility to distribution shift, biases towards certain populations, and lack of explainability. Recently, integrating causal learning techniques into GNNs has sparked numerous ground-breaking studies since many GNN trustworthiness issues can be alleviated by capturing the underlying data causality rather than superficial correlations. In this survey, we comprehensively review recent research efforts on Causality-Inspired GNNs (CIGNNs). Specifically, we first employ causal tools to analyze the primary trustworthiness risks of existing GNNs, underscoring the necessity for GNNs to comprehend the causal mechanisms within graph data. Moreover, we introduce a taxonomy of CIGNNs based on the type of causal learning capability they are equipped with, i.e., causal reasoning and causal representation learning. Besides, we systematically introduce typical methods within each category and discuss how they mitigate trustworthiness risks. Finally, we summarize useful resources and discuss several future directions, hoping to shed light on new research opportunities in this emerging field. The representative papers, along with open-source data and codes, are available in https://github.com/usail-hkust/Causality-Inspired-GNNs."}, "https://arxiv.org/abs/2406.11892": {"title": "Simultaneous comparisons of the variances of k treatments with that of a control: a Levene-Dunnett type procedure", "link": "https://arxiv.org/abs/2406.11892", "description": "arXiv:2406.11892v1 Announce Type: new \nAbstract: There are some global tests for heterogeneity of variance in k-sample one-way layouts, but few consider pairwise comparisons between treatment levels. For experimental designs with a control, comparisons of the variances between the treatment levels and the control are of interest - in analogy to the location parameter with the Dunnett (1955) procedure. Such a many-to-one approach for variances is proposed using the Levene transformation, a kind of residuals. Its properties are characterized with simulation studies and corresponding data examples are evaluated with R code."}, "https://arxiv.org/abs/2406.11940": {"title": "Model-Based Inference and Experimental Design for Interference Using Partial Network Data", "link": "https://arxiv.org/abs/2406.11940", "description": "arXiv:2406.11940v1 Announce Type: new \nAbstract: The stable unit treatment value assumption states that the outcome of an individual is not affected by the treatment statuses of others, however in many real world applications, treatments can have an effect on many others beyond the immediately treated. Interference can generically be thought of as mediated through some network structure. In many empirically relevant situations however, complete network data (required to adjust for these spillover effects) are too costly or logistically infeasible to collect. Partially or indirectly observed network data (e.g., subsamples, aggregated relational data (ARD), egocentric sampling, or respondent-driven sampling) reduce the logistical and financial burden of collecting network data, but the statistical properties of treatment effect adjustments from these design strategies are only beginning to be explored. In this paper, we present a framework for the estimation and inference of treatment effect adjustments using partial network data through the lens of structural causal models. We also illustrate procedures to assign treatments using only partial network data, with the goal of either minimizing estimator variance or optimally seeding. We derive single network asymptotic results applicable to a variety of choices for an underlying graph model. We validate our approach using simulated experiments on observed graphs with applications to information diffusion in India and Malawi."}, "https://arxiv.org/abs/2406.11942": {"title": "Clustering functional data with measurement errors: a simulation-based approach", "link": "https://arxiv.org/abs/2406.11942", "description": "arXiv:2406.11942v1 Announce Type: new \nAbstract: Clustering analysis of functional data, which comprises observations that evolve continuously over time or space, has gained increasing attention across various scientific disciplines. Practical applications often involve functional data that are contaminated with measurement errors arising from imprecise instruments, sampling errors, or other sources. These errors can significantly distort the inherent data structure, resulting in erroneous clustering outcomes. In this paper, we propose a simulation-based approach designed to mitigate the impact of measurement errors. Our proposed method estimates the distribution of functional measurement errors through repeated measurements. Subsequently, the clustering algorithm is applied to simulated data generated from the conditional distribution of the unobserved true functional data given the observed contaminated functional data, accounting for the adjustments made to rectify measurement errors. We illustrate through simulations show that the proposed method has improved numerical performance than the naive methods that neglect such errors. Our proposed method was applied to a childhood obesity study, giving more reliable clustering results"}, "https://arxiv.org/abs/2406.12028": {"title": "Mixed-resolution hybrid modeling in an element-based framework", "link": "https://arxiv.org/abs/2406.12028", "description": "arXiv:2406.12028v1 Announce Type: new \nAbstract: Computational modeling of a complex system is limited by the parts of the system with the least information. While detailed models and high-resolution data may be available for parts of a system, abstract relationships are often necessary to connect the parts and model the full system. For example, modeling food security necessitates the interaction of climate and socioeconomic factors, with models of system components existing at different levels of information in terms of granularity and resolution. Connecting these models is an ongoing challenge. In this work, we demonstrate methodology to quantize and integrate information from data and detailed component models alongside abstract relationships in a hybrid element-based modeling and simulation framework. In a case study of modeling food security, we apply quantization methods to generate (1) time-series model input from climate data and (2) a discrete representation of a component model (a statistical emulator of crop yield), which we then incorporate as an update rule in the hybrid element-based model, bridging differences in model granularity and resolution. Simulation of the hybrid element-based model recapitulated the trends of the original emulator, supporting the use of this methodology to integrate data and information from component models to simulate complex systems."}, "https://arxiv.org/abs/2406.12171": {"title": "Model Selection for Causal Modeling in Missing Exposure Problems", "link": "https://arxiv.org/abs/2406.12171", "description": "arXiv:2406.12171v1 Announce Type: new \nAbstract: In causal inference, properly selecting the propensity score (PS) model is a popular topic and has been widely investigated in observational studies. In addition, there is a large literature concerning the missing data problem. However, there are very few studies investigating the model selection issue for causal inference when the exposure is missing at random (MAR). In this paper, we discuss how to select both imputation and PS models, which can result in the smallest RMSE of the estimated causal effect. Then, we provide a new criterion, called the ``rank score\" for evaluating the overall performance of both models. The simulation studies show that the full imputation plus the outcome-related PS models lead to the smallest RMSE and the rank score can also pick the best models. An application study is conducted to study the causal effect of CVD on the mortality of COVID-19 patients."}, "https://arxiv.org/abs/2406.12237": {"title": "Lasso regularization for mixture experiments with noise variables", "link": "https://arxiv.org/abs/2406.12237", "description": "arXiv:2406.12237v1 Announce Type: new \nAbstract: We apply classical and Bayesian lasso regularizations to a family of models with the presence of mixture and process variables. We analyse the performance of these estimates with respect to ordinary least squares estimators by a simulation study and a real data application. Our results demonstrate the superior performance of Bayesian lasso, particularly via coordinate ascent variational inference, in terms of variable selection accuracy and response optimization."}, "https://arxiv.org/abs/2406.12780": {"title": "Bayesian Consistency for Long Memory Processes: A Semiparametric Perspective", "link": "https://arxiv.org/abs/2406.12780", "description": "arXiv:2406.12780v1 Announce Type: new \nAbstract: In this work, we will investigate a Bayesian approach to estimating the parameters of long memory models. Long memory, characterized by the phenomenon of hyperbolic autocorrelation decay in time series, has garnered significant attention. This is because, in many situations, the assumption of short memory, such as the Markovianity assumption, can be deemed too restrictive. Applications for long memory models can be readily found in fields such as astronomy, finance, and environmental sciences. However, current parametric and semiparametric approaches to modeling long memory present challenges, particularly in the estimation process.\n In this study, we will introduce various methods applied to this problem from a Bayesian perspective, along with a novel semiparametric approach for deriving the posterior distribution of the long memory parameter. Additionally, we will establish the asymptotic properties of the model. An advantage of this approach is that it allows to implement state-of-the-art efficient algorithms for nonparametric Bayesian models."}, "https://arxiv.org/abs/2406.12817": {"title": "Intrinsic Modeling of Shape-Constrained Functional Data, With Applications to Growth Curves and Activity Profiles", "link": "https://arxiv.org/abs/2406.12817", "description": "arXiv:2406.12817v1 Announce Type: new \nAbstract: Shape-constrained functional data encompass a wide array of application fields especially in the life sciences, such as activity profiling, growth curves, healthcare and mortality. Most existing methods for general functional data analysis often ignore that such data are subject to inherent shape constraints, while some specialized techniques rely on strict distributional assumptions. We propose an approach for modeling such data that harnesses the intrinsic geometry of functional trajectories by decomposing them into size and shape components. We focus on the two most prevalent shape constraints, positivity and monotonicity, and develop individual-level estimators for the size and shape components. Furthermore, we demonstrate the applicability of our approach by conducting subsequent analyses involving Fr\\'{e}chet mean and Fr\\'{e}chet regression and establish rates of convergence for the empirical estimators. Illustrative examples include simulations and data applications for activity profiles for Mediterranean fruit flies during their entire lifespan and for data from the Z\\\"{u}rich longitudinal growth study."}, "https://arxiv.org/abs/2406.12212": {"title": "Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data", "link": "https://arxiv.org/abs/2406.12212", "description": "arXiv:2406.12212v1 Announce Type: cross \nAbstract: Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness."}, "https://arxiv.org/abs/2406.12474": {"title": "Exploring Intra and Inter-language Consistency in Embeddings with ICA", "link": "https://arxiv.org/abs/2406.12474", "description": "arXiv:2406.12474v1 Announce Type: cross \nAbstract: Word embeddings represent words as multidimensional real vectors, facilitating data analysis and processing, but are often challenging to interpret. Independent Component Analysis (ICA) creates clearer semantic axes by identifying independent key features. Previous research has shown ICA's potential to reveal universal semantic axes across languages. However, it lacked verification of the consistency of independent components within and across languages. We investigated the consistency of semantic axes in two ways: both within a single language and across multiple languages. We first probed into intra-language consistency, focusing on the reproducibility of axes by performing ICA multiple times and clustering the outcomes. Then, we statistically examined inter-language consistency by verifying those axes' correspondences using statistical tests. We newly applied statistical methods to establish a robust framework that ensures the reliability and universality of semantic axes."}, "https://arxiv.org/abs/2104.00262": {"title": "Statistical significance revisited", "link": "https://arxiv.org/abs/2104.00262", "description": "arXiv:2104.00262v3 Announce Type: replace \nAbstract: Statistical significance measures the reliability of a result obtained from a random experiment. We investigate the number of repetitions needed for a statistical result to have a certain significance. In the first step, we consider binomially distributed variables in the example of medication testing with fixed placebo efficacy, asking how many experiments are needed in order to achieve a significance of 95 %. In the next step, we take the probability distribution of the placebo efficacy into account, which to the best of our knowledge has not been done so far. Depending on the specifics, we show that in order to obtain identical significance, it may be necessary to perform twice as many experiments than in a setting where the placebo distribution is neglected. We proceed by considering more general probability distributions and close with comments on some erroneous assumptions on probability distributions which lead, for instance, to a trivial explanation of the fat tail."}, "https://arxiv.org/abs/2109.08109": {"title": "Standard Errors for Calibrated Parameters", "link": "https://arxiv.org/abs/2109.08109", "description": "arXiv:2109.08109v3 Announce Type: replace \nAbstract: Calibration, the practice of choosing the parameters of a structural model to match certain empirical moments, can be viewed as minimum distance estimation. Existing standard error formulas for such estimators require a consistent estimate of the correlation structure of the empirical moments, which is often unavailable in practice. Instead, the variances of the individual empirical moments are usually readily estimable. Using only these variances, we derive conservative standard errors and confidence intervals for the structural parameters that are valid even under the worst-case correlation structure. In the over-identified case, we show that the moment weighting scheme that minimizes the worst-case estimator variance amounts to a moment selection problem with a simple solution. Finally, we develop tests of over-identifying or parameter restrictions. We apply our methods empirically to a model of menu cost pricing for multi-product firms and to a heterogeneous agent New Keynesian model."}, "https://arxiv.org/abs/2306.15642": {"title": "Neural Bayes estimators for censored inference with peaks-over-threshold models", "link": "https://arxiv.org/abs/2306.15642", "description": "arXiv:2306.15642v4 Announce Type: replace \nAbstract: Making inference with spatial extremal dependence models can be computationally burdensome since they involve intractable and/or censored likelihoods. Building on recent advances in likelihood-free inference with neural Bayes estimators, that is, neural networks that approximate Bayes estimators, we develop highly efficient estimators for censored peaks-over-threshold models that {use data augmentation techniques} to encode censoring information in the neural network {input}. Our new method provides a paradigm shift that challenges traditional censored likelihood-based inference methods for spatial extremal dependence models. Our simulation studies highlight significant gains in both computational and statistical efficiency, relative to competing likelihood-based approaches, when applying our novel estimators to make inference with popular extremal dependence models, such as max-stable, $r$-Pareto, and random scale mixture process models. We also illustrate that it is possible to train a single neural Bayes estimator for a general censoring level, precluding the need to retrain the network when the censoring level is changed. We illustrate the efficacy of our estimators by making fast inference on hundreds-of-thousands of high-dimensional spatial extremal dependence models to assess extreme particulate matter 2.5 microns or less in diameter (${\\rm PM}_{2.5}$) concentration over the whole of Saudi Arabia."}, "https://arxiv.org/abs/2310.08063": {"title": "Inference for Nonlinear Endogenous Treatment Effects Accounting for High-Dimensional Covariate Complexity", "link": "https://arxiv.org/abs/2310.08063", "description": "arXiv:2310.08063v3 Announce Type: replace \nAbstract: Nonlinearity and endogeneity are prevalent challenges in causal analysis using observational data. This paper proposes an inference procedure for a nonlinear and endogenous marginal effect function, defined as the derivative of the nonparametric treatment function, with a primary focus on an additive model that includes high-dimensional covariates. Using the control function approach for identification, we implement a regularized nonparametric estimation to obtain an initial estimator of the model. Such an initial estimator suffers from two biases: the bias in estimating the control function and the regularization bias for the high-dimensional outcome model. Our key innovation is to devise the double bias correction procedure that corrects these two biases simultaneously. Building on this debiased estimator, we further provide a confidence band of the marginal effect function. Simulations and an empirical study of air pollution and migration demonstrate the validity of our procedures."}, "https://arxiv.org/abs/2311.00553": {"title": "Polynomial Chaos Surrogate Construction for Random Fields with Parametric Uncertainty", "link": "https://arxiv.org/abs/2311.00553", "description": "arXiv:2311.00553v2 Announce Type: replace \nAbstract: Engineering and applied science rely on computational experiments to rigorously study physical systems. The mathematical models used to probe these systems are highly complex, and sampling-intensive studies often require prohibitively many simulations for acceptable accuracy. Surrogate models provide a means of circumventing the high computational expense of sampling such complex models. In particular, polynomial chaos expansions (PCEs) have been successfully used for uncertainty quantification studies of deterministic models where the dominant source of uncertainty is parametric. We discuss an extension to conventional PCE surrogate modeling to enable surrogate construction for stochastic computational models that have intrinsic noise in addition to parametric uncertainty. We develop a PCE surrogate on a joint space of intrinsic and parametric uncertainty, enabled by Rosenblatt transformations, and then extend the construction to random field data via the Karhunen-Loeve expansion. We then take advantage of closed-form solutions for computing PCE Sobol indices to perform a global sensitivity analysis of the model which quantifies the intrinsic noise contribution to the overall model output variance. Additionally, the resulting joint PCE is generative in the sense that it allows generating random realizations at any input parameter setting that are statistically approximately equivalent to realizations from the underlying stochastic model. The method is demonstrated on a chemical catalysis example model."}, "https://arxiv.org/abs/2311.18501": {"title": "Perturbation-based Effect Measures for Compositional Data", "link": "https://arxiv.org/abs/2311.18501", "description": "arXiv:2311.18501v2 Announce Type: replace \nAbstract: Existing effect measures for compositional features are inadequate for many modern applications for two reasons. First, modern datasets with compositional covariates, for example in microbiome research, display traits such as high-dimensionality and sparsity that can be poorly modelled with traditional parametric approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike many existing effect measures for compositional features, we do not define our effects based on a parametric model or a transformation of the data. Instead, we use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These effects naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated and semi-synthetic data and demonstrate advantages over existing techniques on data from New York schools and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees."}, "https://arxiv.org/abs/2312.05682": {"title": "Valid Cross-Covariance Models via Multivariate Mixtures with an Application to the Confluent Hypergeometric Class", "link": "https://arxiv.org/abs/2312.05682", "description": "arXiv:2312.05682v2 Announce Type: replace \nAbstract: Modeling of multivariate random fields through Gaussian processes calls for the construction of valid cross-covariance functions describing the dependence between any two component processes at different spatial locations. The required validity conditions often present challenges that lead to complicated restrictions on the parameter space. The purpose of this work is to present techniques using multivariate mixtures for establishing validity that are simultaneously simplified and comprehensive. This is accomplished using results on conditionally negative semidefinite matrices and the Schur product theorem. For illustration, we use the recently-introduced Confluent Hypergeometric (CH) class of covariance functions. In addition, we establish the spectral density of the Confluent Hypergeometric covariance and use this to construct valid multivariate models as well as propose new cross-covariances. Our approach leads to valid multivariate cross-covariance models that inherit the desired marginal properties of the Confluent Hypergeometric model and outperform the multivariate Mat\\'ern model in out-of-sample prediction under slowly-decaying correlation of the underlying multivariate random field. We also establish properties of the new models, including results on equivalence of Gaussian measures. We demonstrate the new model's use for multivariate oceanography dataset consisting of temperature, salinity and oxygen, as measured by autonomous floats in the Southern Ocean."}, "https://arxiv.org/abs/2209.01679": {"title": "Orthogonal and Linear Regressions and Pencils of Confocal Quadrics", "link": "https://arxiv.org/abs/2209.01679", "description": "arXiv:2209.01679v3 Announce Type: replace-cross \nAbstract: This paper enhances and develops bridges between statistics, mechanics, and geometry. For a given system of points in $\\mathbb R^k$ representing a sample of full rank, we construct an explicit pencil of confocal quadrics with the following properties: (i) All the hyperplanes for which the hyperplanar moments of inertia for the given system of points are equal, are tangent to the same quadrics from the pencil of quadrics. As an application, we develop regularization procedures for the orthogonal least square method, analogues of lasso and ridge methods from linear regression. (ii) For any given point $P$ among all the hyperplanes that contain it, the best fit is the tangent hyperplane to the quadric from the confocal pencil corresponding to the maximal Jacobi coordinate of the point $P$; the worst fit among the hyperplanes containing $P$ is the tangent hyperplane to the ellipsoid from the confocal pencil that contains $P$. The confocal pencil of quadrics provides a universal tool to solve the restricted principal component analysis restricted at any given point. Both results (i) and (ii) can be seen as generalizations of the classical result of Pearson on orthogonal regression. They have natural and important applications in the statistics of the errors-in-variables models (EIV). For the classical linear regressions we provide a geometric characterization of hyperplanes of least squares in a given direction among all hyperplanes which contain a given point. The obtained results have applications in restricted regressions, both ordinary and orthogonal ones. For the latter, a new formula for test statistic is derived. The developed methods and results are illustrated in natural statistics examples."}, "https://arxiv.org/abs/2302.07658": {"title": "SUrvival Control Chart EStimation Software in R: the success package", "link": "https://arxiv.org/abs/2302.07658", "description": "arXiv:2302.07658v2 Announce Type: replace-cross \nAbstract: Monitoring the quality of statistical processes has been of great importance, mostly in industrial applications. Control charts are widely used for this purpose, but often lack the possibility to monitor survival outcomes. Recently, inspecting survival outcomes has become of interest, especially in medical settings where outcomes often depend on risk factors of patients. For this reason many new survival control charts have been devised and existing ones have been extended to incorporate survival outcomes. The R package success allows users to construct risk-adjusted control charts for survival data. Functions to determine control chart parameters are included, which can be used even without expert knowledge on the subject of control charts. The package allows to create static as well as interactive charts, which are built using ggplot2 (Wickham 2016) and plotly (Sievert 2020)."}, "https://arxiv.org/abs/2303.05263": {"title": "Fast post-process Bayesian inference with Variational Sparse Bayesian Quadrature", "link": "https://arxiv.org/abs/2303.05263", "description": "arXiv:2303.05263v2 Announce Type: replace-cross \nAbstract: In applied Bayesian inference scenarios, users may have access to a large number of pre-existing model evaluations, for example from maximum-a-posteriori (MAP) optimization runs. However, traditional approximate inference techniques make little to no use of this available information. We propose the framework of post-process Bayesian inference as a means to obtain a quick posterior approximation from existing target density evaluations, with no further model calls. Within this framework, we introduce Variational Sparse Bayesian Quadrature (VSBQ), a method for post-process approximate inference for models with black-box and potentially noisy likelihoods. VSBQ reuses existing target density evaluations to build a sparse Gaussian process (GP) surrogate model of the log posterior density function. Subsequently, we leverage sparse-GP Bayesian quadrature combined with variational inference to achieve fast approximate posterior inference over the surrogate. We validate our method on challenging synthetic scenarios and real-world applications from computational neuroscience. The experiments show that VSBQ builds high-quality posterior approximations by post-processing existing optimization traces, with no further model evaluations."}, "https://arxiv.org/abs/2406.13052": {"title": "Distance Covariance, Independence, and Pairwise Differences", "link": "https://arxiv.org/abs/2406.13052", "description": "arXiv:2406.13052v1 Announce Type: new \nAbstract: (To appear in The American Statistician.) Distance covariance (Sz\\'ekely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an independent copy of $(X,Y)$. This raises natural questions about independence of variables like $X-X'$ and $Y-Y'$, about the connection between Cov$(|X-X'|,|Y-Y'|)$ and the covariance between doubly centered distances, and about necessary and sufficient conditions for independence. We show some basic results and present a new and nontechnical counterexample to a common fallacy, which provides more insight. We also show some motivating examples involving bivariate distributions and contingency tables, which can be used as didactic material for introducing distance correlation."}, "https://arxiv.org/abs/2406.13111": {"title": "Nonparametric Motion Control in Functional Connectivity Studies in Children with Autism Spectrum Disorder", "link": "https://arxiv.org/abs/2406.13111", "description": "arXiv:2406.13111v1 Announce Type: new \nAbstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from resting-state functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce motion artifacts. Many studies remove scans with excessive motion, which can lead to drastic reductions in sample size and introduce selection bias. To avoid such exclusions, we propose an estimand inspired by causal inference methods that quantifies the difference in average functional connectivity in autistic and non-ASD children while standardizing motion relative to the low motion distribution in scans that pass motion quality control. We introduce a nonparametric estimator for motion control, called MoCo, that uses all participants and flexibly models the impacts of motion and other relevant features using an ensemble of machine learning methods. We establish large-sample efficiency and multiple robustness of our proposed estimator. The framework is applied to estimate the difference in functional connectivity between 132 autistic and 245 non-ASD children, of which 34 and 126 pass motion quality control. MoCo appears to dramatically reduce motion artifacts relative to no participant removal, while more efficiently utilizing participant data and accounting for possible selection biases relative to the na\\\"ive approach with participant removal."}, "https://arxiv.org/abs/2406.13122": {"title": "Testing for Underpowered Literatures", "link": "https://arxiv.org/abs/2406.13122", "description": "arXiv:2406.13122v1 Announce Type: new \nAbstract: How many experimental studies would have come to different conclusions had they been run on larger samples? I show how to estimate the expected number of statistically significant results that a set of experiments would have reported had their sample sizes all been counterfactually increased by a chosen factor. The estimator is consistent and asymptotically normal. Unlike existing methods, my approach requires no assumptions about the distribution of true effects of the interventions being studied other than continuity. This method includes an adjustment for publication bias in the reported t-scores. An application to randomized controlled trials (RCTs) published in top economics journals finds that doubling every experiment's sample size would only increase the power of two-sided t-tests by 7.2 percentage points on average. This effect is small and is comparable to the effect for systematic replication projects in laboratory psychology where previous studies enabled accurate power calculations ex ante. These effects are both smaller than for non-RCTs. This comparison suggests that RCTs are on average relatively insensitive to sample size increases. The policy implication is that grant givers should generally fund more experiments rather than fewer, larger ones."}, "https://arxiv.org/abs/2406.13197": {"title": "Representation Transfer Learning for Semiparametric Regression", "link": "https://arxiv.org/abs/2406.13197", "description": "arXiv:2406.13197v1 Announce Type: new \nAbstract: We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data."}, "https://arxiv.org/abs/2406.13310": {"title": "A finite-infinite shared atoms nested model for the Bayesian analysis of large grouped data", "link": "https://arxiv.org/abs/2406.13310", "description": "arXiv:2406.13310v1 Announce Type: new \nAbstract: The use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a two-layered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigned either a finite-dimensional Dirichlet distribution or a Dirichlet process prior. Based on these considerations, we introduce a novel hierarchical nonparametric prior based on a finite set of shared atoms, a specification that enhances the flexibility of the induced random measures and the availability of fast posterior inference. To support these findings, we analytically derive the induced prior correlation structure and partially exchangeable partition probability function. Additionally, we develop a novel mean-field variational algorithm for posterior inference to boost the applicability of our nested model to large multivariate data. We then assess and compare the performance of the different shared-atom specifications via simulation. We also show that our variational proposal is highly scalable and that the accuracy of the posterior density estimate and the estimated partition is comparable with state-of-the-art Gibbs sampler algorithms. Finally, we apply our model to a real dataset of Spotify's song features, simultaneously segmenting artists and songs with similar characteristics."}, "https://arxiv.org/abs/2406.13395": {"title": "Bayesian Inference for Multidimensional Welfare Comparisons", "link": "https://arxiv.org/abs/2406.13395", "description": "arXiv:2406.13395v1 Announce Type: new \nAbstract: Using both single-index measures and stochastic dominance concepts, we show how Bayesian inference can be used to make multivariate welfare comparisons. A four-dimensional distribution for the well-being attributes income, mental health, education, and happiness are estimated via Bayesian Markov chain Monte Carlo using unit-record data taken from the Household, Income and Labour Dynamics in Australia survey. Marginal distributions of beta and gamma mixtures and discrete ordinal distributions are combined using a copula. Improvements in both well-being generally and poverty magnitude are assessed using posterior means of single-index measures and posterior probabilities of stochastic dominance. The conditions for stochastic dominance depend on the class of utility functions that is assumed to define a social welfare function and the number of attributes in the utility function. Three classes of utility functions are considered, and posterior probabilities of dominance are computed for one, two, and four-attribute utility functions for three time intervals within the period 2001 to 2019."}, "https://arxiv.org/abs/2406.13478": {"title": "Semiparametric Localized Principal Stratification Analysis with Continuous Strata", "link": "https://arxiv.org/abs/2406.13478", "description": "arXiv:2406.13478v1 Announce Type: new \nAbstract: Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries."}, "https://arxiv.org/abs/2406.13500": {"title": "Gradient-Boosted Generalized Linear Models for Conditional Vine Copulas", "link": "https://arxiv.org/abs/2406.13500", "description": "arXiv:2406.13500v1 Announce Type: new \nAbstract: Vine copulas are flexible dependence models using bivariate copulas as building blocks. If the parameters of the bivariate copulas in the vine copula depend on covariates, one obtains a conditional vine copula. We propose an extension for the estimation of continuous conditional vine copulas, where the parameters of continuous conditional bivariate copulas are estimated sequentially and separately via gradient-boosting. For this purpose, we link covariates via generalized linear models (GLMs) to Kendall's $\\tau$ correlation coefficient from which the corresponding copula parameter can be obtained. Consequently, the gradient-boosting algorithm estimates the copula parameters providing a natural covariate selection. In a second step, an additional covariate deselection procedure is applied. The performance of the gradient-boosted conditional vine copulas is illustrated in a simulation study. Linear covariate effects in low- and high-dimensional settings are investigated for the conditional bivariate copulas separately and for conditional vine copulas. Moreover, the gradient-boosted conditional vine copulas are applied to the temporal postprocessing of ensemble weather forecasts in a low-dimensional setting. The results show, that our suggested method is able to outperform the benchmark methods and identifies temporal correlations better. Eventually, we provide an R-package called boostCopula for this method."}, "https://arxiv.org/abs/2406.13635": {"title": "Temporal label recovery from noisy dynamical data", "link": "https://arxiv.org/abs/2406.13635", "description": "arXiv:2406.13635v1 Announce Type: new \nAbstract: Analyzing dynamical data often requires information of the temporal labels, but such information is unavailable in many applications. Recovery of these temporal labels, closely related to the seriation or sequencing problem, becomes crucial in the study. However, challenges arise due to the nonlinear nature of the data and the complexity of the underlying dynamical system, which may be periodic or non-periodic. Additionally, noise within the feature space complicates the theoretical analysis. Our work develops spectral algorithms that leverage manifold learning concepts to recover temporal labels from noisy data. We first construct the graph Laplacian of the data, and then employ the second (and the third) Fiedler vectors to recover temporal labels. This method can be applied to both periodic and aperiodic cases. It also does not require monotone properties on the similarity matrix, which are commonly assumed in existing spectral seriation algorithms. We develop the $\\ell_{\\infty}$ error of our estimators for the temporal labels and ranking, without assumptions on the eigen-gap. In numerical analysis, our method outperforms spectral seriation algorithms based on a similarity matrix. The performance of our algorithms is further demonstrated on a synthetic biomolecule data example."}, "https://arxiv.org/abs/2406.13691": {"title": "Computationally efficient multi-level Gaussian process regression for functional data observed under completely or partially regular sampling designs", "link": "https://arxiv.org/abs/2406.13691", "description": "arXiv:2406.13691v1 Announce Type: new \nAbstract: Gaussian process regression is a frequently used statistical method for flexible yet fully probabilistic non-linear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis.\n We consider a multi-level Gaussian process regression model where a common mean function and individual subject-specific deviations are modeled simultaneously as latent Gaussian processes. We derive exact analytic and computationally efficient expressions for the log-likelihood function and the posterior distributions in the case where the observations are sampled on either a completely or partially regular grid. This enables us to fit the model to large data sets that are currently computationally inaccessible using a standard implementation. We show through a simulation study that our analytic expressions are several orders of magnitude faster compared to a standard implementation, and we provide an implementation in the probabilistic programming language Stan."}, "https://arxiv.org/abs/2406.13826": {"title": "Testing identification in mediation and dynamic treatment models", "link": "https://arxiv.org/abs/2406.13826", "description": "arXiv:2406.13826v1 Announce Type: new \nAbstract: We propose a test for the identification of causal effects in mediation and dynamic treatment models that is based on two sets of observed variables, namely covariates to be controlled for and suspected instruments, building on the test by Huber and Kueck (2022) for single treatment models. We consider models with a sequential assignment of a treatment and a mediator to assess the direct treatment effect (net of the mediator), the indirect treatment effect (via the mediator), or the joint effect of both treatment and mediator. We establish testable conditions for identifying such effects in observational data. These conditions jointly imply (1) the exogeneity of the treatment and the mediator conditional on covariates and (2) the validity of distinct instruments for the treatment and the mediator, meaning that the instruments do not directly affect the outcome (other than through the treatment or mediator) and are unconfounded given the covariates. Our framework extends to post-treatment sample selection or attrition problems when replacing the mediator by a selection indicator for observing the outcome, enabling joint testing of the selectivity of treatment and attrition. We propose a machine learning-based test to control for covariates in a data-driven manner and analyze its finite sample performance in a simulation study. Additionally, we apply our method to Slovak labor market data and find that our testable implications are not rejected for a sequence of training programs typically considered in dynamic treatment evaluations."}, "https://arxiv.org/abs/2406.13833": {"title": "Cluster Quilting: Spectral Clustering for Patchwork Learning", "link": "https://arxiv.org/abs/2406.13833", "description": "arXiv:2406.13833v1 Announce Type: new \nAbstract: Patchwork learning arises as a new and challenging data collection paradigm where both samples and features are observed in fragmented subsets. Due to technological limits, measurement expense, or multimodal data integration, such patchwork data structures are frequently seen in neuroscience, healthcare, and genomics, among others. Instead of analyzing each data patch separately, it is highly desirable to extract comprehensive knowledge from the whole data set. In this work, we focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) k-means on the combined and weighted singular vectors. Under a sub-Gaussian mixture model, we establish theoretical guarantees via a non-asymptotic misclustering rate bound that reflects both properties of the patch-wise observation regime as well as the clustering signal and noise dependencies. We also validate our Cluster Quilting algorithm through extensive empirical studies on both simulated and real data sets in neuroscience and genomics, where it discovers more accurate and scientifically more plausible clusters than other approaches."}, "https://arxiv.org/abs/2406.13836": {"title": "Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions", "link": "https://arxiv.org/abs/2406.13836", "description": "arXiv:2406.13836v1 Announce Type: new \nAbstract: In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets."}, "https://arxiv.org/abs/2406.13876": {"title": "An Empirical Bayes Jackknife Regression Framework for Covariance Matrix Estimation", "link": "https://arxiv.org/abs/2406.13876", "description": "arXiv:2406.13876v1 Announce Type: new \nAbstract: Covariance matrix estimation, a classical statistical topic, poses significant challenges when the sample size is comparable to or smaller than the number of features. In this paper, we frame covariance matrix estimation as a compound decision problem and apply an optimal decision rule to estimate covariance parameters. To approximate this rule, we introduce an algorithm that integrates jackknife techniques with machine learning regression methods. This algorithm exhibits adaptability across diverse scenarios without relying on assumptions about data distribution. Simulation results and gene network inference from an RNA-seq experiment in mice demonstrate that our approach either matches or surpasses several state-of-the-art methods"}, "https://arxiv.org/abs/2406.13906": {"title": "Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data", "link": "https://arxiv.org/abs/2406.13906", "description": "arXiv:2406.13906v1 Announce Type: new \nAbstract: The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and high-dimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a real-world data application."}, "https://arxiv.org/abs/2406.13938": {"title": "Coverage of Credible Sets for Regression under Variable Selection", "link": "https://arxiv.org/abs/2406.13938", "description": "arXiv:2406.13938v1 Announce Type: new \nAbstract: We study the asymptotic frequentist coverage of credible sets based on a novel Bayesian approach for a multiple linear regression model under variable selection. We initially ignore the issue of variable selection, which allows us to put a conjugate normal prior on the coefficient vector. The variable selection step is incorporated directly in the posterior through a sparsity-inducing map and uses the induced prior for making an inference instead of the natural conjugate posterior. The sparsity-inducing map minimizes the sum of the squared l2-distance weighted by the data matrix and a suitably scaled l1-penalty term. We obtain the limiting coverage of various credible regions and demonstrate that a modified credible interval for a component has the exact asymptotic frequentist coverage if the corresponding predictor is asymptotically uncorrelated with other predictors. Through extensive simulation, we provide a guideline for choosing the penalty parameter as a function of the credibility level appropriate for the corresponding coverage. We also show finite-sample numerical results that support the conclusions from the asymptotic theory. We also provide the credInt package that implements the method in R to obtain the credible intervals along with the posterior samples."}, "https://arxiv.org/abs/2406.14046": {"title": "Estimating Time-Varying Parameters of Various Smoothness in Linear Models via Kernel Regression", "link": "https://arxiv.org/abs/2406.14046", "description": "arXiv:2406.14046v1 Announce Type: new \nAbstract: We consider estimating nonparametric time-varying parameters in linear models using kernel regression. Our contributions are twofold. First, We consider a broad class of time-varying parameters including deterministic smooth functions, the rescaled random walk, structural breaks, the threshold model and their mixtures. We show that those time-varying parameters can be consistently estimated by kernel regression. Our analysis exploits the smoothness of time-varying parameters rather than their specific form. The second contribution is to reveal that the bandwidth used in kernel regression determines the trade-off between the rate of convergence and the size of the class of time-varying parameters that can be estimated. An implication from our result is that the bandwidth should be proportional to $T^{-1/2}$ if the time-varying parameter follows the rescaled random walk, where $T$ is the sample size. We propose a specific choice of the bandwidth that accommodates a wide range of time-varying parameter models. An empirical application shows that the kernel-based estimator with this choice can capture the random-walk dynamics in time-varying parameters."}, "https://arxiv.org/abs/2406.14182": {"title": "Averaging polyhazard models using Piecewise deterministic Monte Carlo with applications to data with long-term survivors", "link": "https://arxiv.org/abs/2406.14182", "description": "arXiv:2406.14182v1 Announce Type: new \nAbstract: Polyhazard models are a class of flexible parametric models for modelling survival over extended time horizons. Their additive hazard structure allows for flexible, non-proportional hazards whose characteristics can change over time while retaining a parametric form, which allows for survival to be extrapolated beyond the observation period of a study. Significant user input is required, however, in selecting the number of latent hazards to model, their distributions and the choice of which variables to associate with each hazard. The resulting set of models is too large to explore manually, limiting their practical usefulness. Motivated by applications to stroke survivor and kidney transplant patient survival times we extend the standard polyhazard model through a prior structure allowing for joint inference of parameters and structural quantities, and develop a sampling scheme that utilises state-of-the-art Piecewise Deterministic Markov Processes to sample from the resulting transdimensional posterior with minimal user tuning."}, "https://arxiv.org/abs/2406.14184": {"title": "On integral priors for multiple comparison in Bayesian model selection", "link": "https://arxiv.org/abs/2406.14184", "description": "arXiv:2406.14184v1 Announce Type: new \nAbstract: Noninformative priors constructed for estimation purposes are usually not appropriate for model selection and testing. The methodology of integral priors was developed to get prior distributions for Bayesian model selection when comparing two models, modifying initial improper reference priors. We propose a generalization of this methodology to more than two models. Our approach adds an artificial copy of each model under comparison by compactifying the parametric space and creating an ergodic Markov chain across all models that returns the integral priors as marginals of the stationary distribution. Besides the garantee of their existance and the lack of paradoxes attached to estimation reference priors, an additional advantage of this methodology is that the simulation of this Markov chain is straightforward as it only requires simulations of imaginary training samples for all models and from the corresponding posterior distributions. This renders its implementation automatic and generic, both in the nested case and in the nonnested case."}, "https://arxiv.org/abs/2406.14380": {"title": "Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach", "link": "https://arxiv.org/abs/2406.14380", "description": "arXiv:2406.14380v1 Announce Type: new \nAbstract: Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creator-side randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard difference-in-means estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a \"recommender choice model\" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewer-content heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin short-video platform: besides the standard creator-side experiment, we carry out a costly blocked double-sided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard difference-in-means estimator."}, "https://arxiv.org/abs/2406.14453": {"title": "The Effective Number of Parameters in Kernel Density Estimation", "link": "https://arxiv.org/abs/2406.14453", "description": "arXiv:2406.14453v1 Announce Type: new \nAbstract: The quest for a formula that satisfactorily measures the effective degrees of freedom in kernel density estimation (KDE) is a long standing problem with few solutions. Starting from the orthogonal polynomial sequence (OPS) expansion for the ratio of the empirical to the oracle density, we show how convolution with the kernel leads to a new OPS with respect to which one may express the resulting KDE. The expansion coefficients of the two OPS systems can then be related via a kernel sensitivity matrix, and this then naturally leads to a definition of effective parameters by taking the trace of a symmetrized positive semi-definite normalized version. The resulting effective degrees of freedom (EDoF) formula is an oracle-based quantity; the first ever proposed in the literature. Asymptotic properties of the empirical EDoF are worked out through influence functions. Numerical investigations confirm the theoretical insights."}, "https://arxiv.org/abs/2406.14535": {"title": "On estimation and order selection for multivariate extremes via clustering", "link": "https://arxiv.org/abs/2406.14535", "description": "arXiv:2406.14535v1 Announce Type: new \nAbstract: We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation."}, "https://arxiv.org/abs/2406.11308": {"title": "Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?", "link": "https://arxiv.org/abs/2406.11308", "description": "arXiv:2406.11308v1 Announce Type: cross \nAbstract: In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs)."}, "https://arxiv.org/abs/2406.12908": {"title": "Rating Multi-Modal Time-Series Forecasting Models (MM-TSFM) for Robustness Through a Causal Lens", "link": "https://arxiv.org/abs/2406.12908", "description": "arXiv:2406.12908v1 Announce Type: cross \nAbstract: AI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multi-modal time-series forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of Multi-Modal Time-Series Forecasting Models (MM-TSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MM-TSFM. We apply our novel rating method on a variety of numeric and multi-modal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five time-series forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multi-modal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of time-series forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decision-making."}, "https://arxiv.org/abs/2406.13814": {"title": "Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches", "link": "https://arxiv.org/abs/2406.13814", "description": "arXiv:2406.13814v1 Announce Type: cross \nAbstract: Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates."}, "https://arxiv.org/abs/2406.13944": {"title": "Generalization error of min-norm interpolators in transfer learning", "link": "https://arxiv.org/abs/2406.13944", "description": "arXiv:2406.13944v1 Announce Type: cross \nAbstract: This paper establishes the generalization error of pooled min-$\\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results."}, "https://arxiv.org/abs/2406.13966": {"title": "Causal Inference with Latent Variables: Recent Advances and Future Prospectives", "link": "https://arxiv.org/abs/2406.13966", "description": "arXiv:2406.13966v1 Announce Type: cross \nAbstract: Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs)."}, "https://arxiv.org/abs/2406.14003": {"title": "Deep Optimal Experimental Design for Parameter Estimation Problems", "link": "https://arxiv.org/abs/2406.14003", "description": "arXiv:2406.14003v1 Announce Type: cross \nAbstract: Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations."}, "https://arxiv.org/abs/2406.14145": {"title": "Temperature in the Iberian Peninsula: Trend, seasonality, and heterogeneity", "link": "https://arxiv.org/abs/2406.14145", "description": "arXiv:2406.14145v1 Announce Type: cross \nAbstract: In this paper, we propose fitting unobserved component models to represent the dynamic evolution of bivariate systems of centre and log-range temperatures obtained monthly from minimum/maximum temperatures observed at a given location. In doing so, the centre and log-range temperature are decomposed into potentially stochastic trends, seasonal, and transitory components. Since our model encompasses deterministic trends and seasonal components as limiting cases, we contribute to the debate on whether stochastic or deterministic components better represent the trend and seasonal components. The methodology is implemented to centre and log-range temperature observed in four locations in the Iberian Peninsula, namely, Barcelona, Coru\\~{n}a, Madrid, and Seville. We show that, at each location, the centre temperature can be represented by a smooth integrated random walk with time-varying slope, while a stochastic level better represents the log-range. We also show that centre and log-range temperature are unrelated. The methodology is then extended to simultaneously model centre and log-range temperature observed at several locations in the Iberian Peninsula. We fit a multi-level dynamic factor model to extract potential commonalities among centre (log-range) temperature while also allowing for heterogeneity in different areas in the Iberian Peninsula. We show that, although the commonality in trends of average temperature is considerable, the regional components are also relevant."}, "https://arxiv.org/abs/2406.14163": {"title": "A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics", "link": "https://arxiv.org/abs/2406.14163", "description": "arXiv:2406.14163v1 Announce Type: cross \nAbstract: Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows."}, "https://arxiv.org/abs/2112.07755": {"title": "Separate Exchangeability as Modeling Principle in Bayesian Nonparametrics", "link": "https://arxiv.org/abs/2112.07755", "description": "arXiv:2112.07755v2 Announce Type: replace \nAbstract: We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is \\emph{de facto} widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, it is curiously underused for several other applications in BNP. We briefly review the definition of separate exchangeability focusing on the implications of such a definition in Bayesian modeling. We then discuss two tractable classes of models that implement separate exchangeability that are the natural counterparts of familiar partially exchangeable BNP models.\n The first is nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recent models for nested partitions implement partially exchangeable models related to variations of the well-known nested Dirichlet process. We argue that inference under such models in some cases ignores important features of the experimental setup. We obtain the separately exchangeable counterpart of such partially exchangeable partition structures.\n The second class is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved. We highlight how a Dirichlet process mixture of linear models known as ANOVA DDP can naturally implement separate exchangeability in such regression problems. Finally, we illustrate how to perform inference under such models in two real data examples."}, "https://arxiv.org/abs/2205.10310": {"title": "Treatment Effects in Bunching Designs: The Impact of Mandatory Overtime Pay on Hours", "link": "https://arxiv.org/abs/2205.10310", "description": "arXiv:2205.10310v4 Announce Type: replace \nAbstract: This paper studies the identifying power of bunching at kinks when the researcher does not assume a parametric choice model. I find that in a general choice model, identifying the average causal response to the policy switch at a kink amounts to confronting two extrapolation problems, each about the distribution of a counterfactual choice that is observed only in a censored manner. I apply this insight to partially identify the effect of overtime pay regulation on the hours of U.S. workers using administrative payroll data, assuming that each distribution satisfies a weak non-parametric shape constraint in the region where it is not observed. The resulting bounds are informative and indicate a relatively small elasticity of demand for weekly hours, addressing a long-standing question about the causal effects of the overtime mandate."}, "https://arxiv.org/abs/2302.03200": {"title": "Multivariate Bayesian dynamic modeling for causal prediction", "link": "https://arxiv.org/abs/2302.03200", "description": "arXiv:2302.03200v2 Announce Type: replace \nAbstract: Bayesian forecasting is developed in multivariate time series analysis for causal inference. Causal evaluation of sequentially observed time series data from control and treated units focuses on the impacts of interventions using contemporaneous outcomes in control units. Methodological developments here concern multivariate dynamic models for time-varying effects across multiple treated units with explicit foci on sequential learning and aggregation of intervention effects. Analysis explores dimension reduction across multiple synthetic counterfactual predictors. Computational advances leverage fully conjugate models for efficient sequential learning and inference, including cross-unit correlations and their time variation. This allows full uncertainty quantification on model hyper-parameters via Bayesian model averaging. A detailed case study evaluates interventions in a supermarket promotions experiment, with coupled predictive analyses in selected regions of a large-scale commercial system. Comparisons with existing methods highlight the issues of appropriate uncertainty quantification in casual inference in aggregation across treated units, among other practical concerns."}, "https://arxiv.org/abs/2303.14298": {"title": "Sensitivity Analysis in Unconditional Quantile Effects", "link": "https://arxiv.org/abs/2303.14298", "description": "arXiv:2303.14298v3 Announce Type: replace \nAbstract: This paper proposes a framework to analyze the effects of counterfactual policies on the unconditional quantiles of an outcome variable. For a given counterfactual policy, we obtain identified sets for the effect of both marginal and global changes in the proportion of treated individuals. To conduct a sensitivity analysis, we introduce the quantile breakdown frontier, a curve that (i) indicates whether a sensitivity analysis is possible or not, and (ii) when a sensitivity analysis is possible, quantifies the amount of selection bias consistent with a given conclusion of interest across different quantiles. To illustrate our method, we perform a sensitivity analysis on the effect of unionizing low income workers on the quantiles of the distribution of (log) wages."}, "https://arxiv.org/abs/2305.19089": {"title": "Impulse Response Analysis of Structural Nonlinear Time Series Models", "link": "https://arxiv.org/abs/2305.19089", "description": "arXiv:2305.19089v4 Announce Type: replace \nAbstract: This paper proposes a semiparametric sieve approach to estimate impulse response functions of nonlinear time series within a general class of structural autoregressive models. We prove that a two-step procedure can flexibly accommodate nonlinear specifications while avoiding the need to choose of fixed parametric forms. Sieve impulse responses are proven to be consistent by deriving uniform estimation guarantees, and an iterative algorithm makes it straightforward to compute them in practice. With simulations, we show that the proposed semiparametric approach proves effective against misspecification while suffering only minor efficiency losses. In a US monetary policy application, we find that the pointwise sieve GDP response associated with an interest rate increase is larger than that of a linear model. Finally, in an analysis of interest rate uncertainty shocks, sieve responses imply significantly more substantial contractionary effects both on production and inflation."}, "https://arxiv.org/abs/2306.12949": {"title": "On the use of the Gram matrix for multivariate functional principal components analysis", "link": "https://arxiv.org/abs/2306.12949", "description": "arXiv:2306.12949v2 Announce Type: replace \nAbstract: Dimension reduction is crucial in functional data analysis (FDA). The key tool to reduce the dimension of the data is functional principal component analysis. Existing approaches for functional principal component analysis usually involve the diagonalization of the covariance operator. With the increasing size and complexity of functional datasets, estimating the covariance operator has become more challenging. Therefore, there is a growing need for efficient methodologies to estimate the eigencomponents. Using the duality of the space of observations and the space of functional features, we propose to use the inner-product between the curves to estimate the eigenelements of multivariate and multidimensional functional datasets. The relationship between the eigenelements of the covariance operator and those of the inner-product matrix is established. We explore the application of these methodologies in several FDA settings and provide general guidance on their usability."}, "https://arxiv.org/abs/2307.05732": {"title": "From isotonic to Lipschitz regression: a new interpolative perspective on shape-restricted estimation", "link": "https://arxiv.org/abs/2307.05732", "description": "arXiv:2307.05732v3 Announce Type: replace \nAbstract: This manuscript seeks to bridge two seemingly disjoint paradigms of nonparametric regression estimation based on smoothness assumptions and shape constraints. The proposed approach is motivated by a conceptually simple observation: Every Lipschitz function is a sum of monotonic and linear functions. This principle is further generalized to the higher-order monotonicity and multivariate covariates. A family of estimators is proposed based on a sample-splitting procedure, which inherits desirable methodological, theoretical, and computational properties of shape-restricted estimators. Our theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptive properties to the complexity of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results, and extensive numerical studies validate the theoretical properties and empirical evidence for the practicalities of the proposed estimation framework."}, "https://arxiv.org/abs/2309.01404": {"title": "Hierarchical Regression Discontinuity Design: Pursuing Subgroup Treatment Effects", "link": "https://arxiv.org/abs/2309.01404", "description": "arXiv:2309.01404v2 Announce Type: replace \nAbstract: Regression discontinuity design (RDD) is widely adopted for causal inference under intervention determined by a continuous variable. While one is interested in treatment effect heterogeneity by subgroups in many applications, RDD typically suffers from small subgroup-wise sample sizes, which makes the estimation results highly instable. To solve this issue, we introduce hierarchical RDD (HRDD), a hierarchical Bayes approach for pursuing treatment effect heterogeneity in RDD. A key feature of HRDD is to employ a pseudo-model based on a loss function to estimate subgroup-level parameters of treatment effects under RDD, and assign a hierarchical prior distribution to ''borrow strength'' from other subgroups. The posterior computation can be easily done by a simple Gibbs sampling, and the optimal bandwidth can be automatically selected by the Hyv\\\"{a}rinen scores for unnormalized models. We demonstrate the proposed HRDD through simulation and real data analysis, and show that HRDD provides much more stable point and interval estimation than separately applying the standard RDD method to each subgroup."}, "https://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "https://arxiv.org/abs/2310.04924", "description": "arXiv:2310.04924v3 Announce Type: replace \nAbstract: Monte Carlo significance tests are a general tool that produce p-values by generating samples from the null distribution. However, Monte Carlo tests are limited to null hypothesis which we can exactly sample from. Markov chain Monte Carlo (MCMC) significance tests are a way to produce statistical valid p-values for null hypothesis we can only approximately sample from. These methods were first introduced by Besag and Clifford in 1989 and make no assumptions on the mixing time of the MCMC procedure. Here we review the two methods of Besag and Clifford and introduce a new method that unifies the existing procedures. We use simple examples to highlight the difference between MCMC significance tests and standard Monte Carlo tests based on exact sampling. We also survey a range of contemporary applications in the literature including goodness-of-fit testing for the Rasch model, tests for detecting gerrymandering [8] and a permutation based test of conditional independence [3]."}, "https://arxiv.org/abs/2310.17806": {"title": "Transporting treatment effects from difference-in-differences studies", "link": "https://arxiv.org/abs/2310.17806", "description": "arXiv:2310.17806v2 Announce Type: replace \nAbstract: Difference-in-differences (DID) is a popular approach to identify the causal effects of treatments and policies in the presence of unmeasured confounding. DID identifies the sample average treatment effect in the treated (SATT). However, a goal of such research is often to inform decision-making in target populations outside the treated sample. Transportability methods have been developed to extend inferences from study samples to external target populations; these methods have primarily been developed and applied in settings where identification is based on conditional independence between the treatment and potential outcomes, such as in a randomized trial. We present a novel approach to identifying and estimating effects in a target population, based on DID conducted in a study sample that differs from the target population. We present a range of assumptions under which one may identify causal effects in the target population and employ causal diagrams to illustrate these assumptions. In most realistic settings, results depend critically on the assumption that any unmeasured confounders are not effect measure modifiers on the scale of the effect of interest (e.g., risk difference, odds ratio). We develop several estimators of transported effects, including g-computation, inverse odds weighting, and a doubly robust estimator based on the efficient influence function. Simulation results support theoretical properties of the proposed estimators. As an example, we apply our approach to study the effects of a 2018 US federal smoke-free public housing law on air quality in public housing across the US, using data from a DID study conducted in New York City alone."}, "https://arxiv.org/abs/2310.20088": {"title": "Functional Principal Component Analysis for Distribution-Valued Processes", "link": "https://arxiv.org/abs/2310.20088", "description": "arXiv:2310.20088v2 Announce Type: replace \nAbstract: We develop statistical models for samples of distribution-valued stochastic processes featuring time-indexed univariate distributions, with emphasis on functional principal component analysis. The proposed model presents an intrinsic rather than transformation-based approach. The starting point is a transport process representation for distribution-valued processes under the Wasserstein metric. Substituting transports for distributions addresses the challenge of centering distribution-valued processes and leads to a useful and interpretable decomposition of each realized process into a process-specific single transport and a real-valued trajectory. This representation makes it possible to utilize a scalar multiplication operation for transports and facilitates not only functional principal component analysis but also to introduce a latent Gaussian process. This Gaussian process proves especially useful for the case where the distribution-valued processes are only observed on a sparse grid of time points, establishing an approach for longitudinal distribution-valued data. We study the convergence of the key components of this novel representation to their population targets and demonstrate the practical utility of the proposed approach through simulations and several data illustrations."}, "https://arxiv.org/abs/2312.16260": {"title": "Multinomial Link Models", "link": "https://arxiv.org/abs/2312.16260", "description": "arXiv:2312.16260v2 Announce Type: replace \nAbstract: We propose a unified multinomial link model for analyzing categorical responses. It not only covers the existing multinomial logistic models and their extensions as special cases, but also includes new models that can incorporate the observations with NA or Unknown responses in the data analysis. We provide explicit formulae and detailed algorithms for finding the maximum likelihood estimates of the model parameters and computing the Fisher information matrix. Our algorithms solve the infeasibility issue of existing statistical software on estimating parameters of cumulative link models. The applications to real datasets show that the new models can fit the data significantly better, and the corresponding data analysis may correct the misleading conclusions due to missing responses."}, "https://arxiv.org/abs/2406.14636": {"title": "MSmix: An R Package for clustering partial rankings via mixtures of Mallows Models with Spearman distance", "link": "https://arxiv.org/abs/2406.14636", "description": "arXiv:2406.14636v1 Announce Type: new \nAbstract: MSmix is a recently developed R package implementing maximum likelihood estimation of finite mixtures of Mallows models with Spearman distance for full and partial rankings. The package is designed to implement computationally tractable estimation routines of the model parameters, with the ability to handle arbitrary forms of partial rankings and sequences of a large number of items. The frequentist estimation task is accomplished via EM algorithms, integrating data augmentation strategies to recover the unobserved heterogeneity and the missing ranks. The package also provides functionalities for uncertainty quantification of the estimated parameters, via diverse bootstrap methods and asymptotic confidence intervals. Generic methods for S3 class objects are constructed for more effectively managing the output of the main routines. The usefulness of the package and its computational performance compared with competing software is illustrated via applications to both simulated and original real ranking datasets."}, "https://arxiv.org/abs/2406.14650": {"title": "Conditional correlation estimation and serial dependence identification", "link": "https://arxiv.org/abs/2406.14650", "description": "arXiv:2406.14650v1 Announce Type: new \nAbstract: It has been recently shown in Jaworski, P., Jelito, D. and Pitera, M. (2024), 'A note on the equivalence between the conditional uncorrelation and the independence of random variables', Electronic Journal of Statistics 18(1), that one can characterise the independence of random variables via the family of conditional correlations on quantile-induced sets. This effectively shows that the localized linear measure of dependence is able to detect any form of nonlinear dependence for appropriately chosen conditioning sets. In this paper, we expand this concept, focusing on the statistical properties of conditional correlation estimators and their potential usage in serial dependence identification. In particular, we show how to estimate conditional correlations in generic and serial dependence setups, discuss key properties of the related estimators, define the conditional equivalent of the autocorrelation function, and provide a series of examples which prove that the proposed framework could be efficiently used in many practical econometric applications."}, "https://arxiv.org/abs/2406.14717": {"title": "Analysis of Linked Files: A Missing Data Perspective", "link": "https://arxiv.org/abs/2406.14717", "description": "arXiv:2406.14717v1 Announce Type: new \nAbstract: In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios."}, "https://arxiv.org/abs/2406.14738": {"title": "Robust parameter estimation for partially observed second-order diffusion processes", "link": "https://arxiv.org/abs/2406.14738", "description": "arXiv:2406.14738v1 Announce Type: new \nAbstract: Estimating parameters of a diffusion process given continuous-time observations of the process via maximum likelihood approaches or, online, via stochastic gradient descent or Kalman filter formulations constitutes a well-established research area. It has also been established previously that these techniques are, in general, not robust to perturbations in the data in the form of temporal correlations. While the subject is relatively well understood and appropriate modifications have been suggested in the context of multi-scale diffusion processes and their reduced model equations, we consider here an alternative setting where a second-order diffusion process in positions and velocities is only observed via its positions. In this note, we propose a simple modification to standard stochastic gradient descent and Kalman filter formulations, which eliminates the arising systematic estimation biases. The modification can be extended to standard maximum likelihood approaches and avoids computation of previously proposed correction terms."}, "https://arxiv.org/abs/2406.14772": {"title": "Consistent community detection in multi-layer networks with heterogeneous differential privacy", "link": "https://arxiv.org/abs/2406.14772", "description": "arXiv:2406.14772v1 Announce Type: new \nAbstract: As network data has become increasingly prevalent, a substantial amount of attention has been paid to the privacy issue in publishing network data. One of the critical challenges for data publishers is to preserve the topological structures of the original network while protecting sensitive information. In this paper, we propose a personalized edge flipping mechanism that allows data publishers to protect edge information based on each node's privacy preference. It can achieve differential privacy while preserving the community structure under the multi-layer degree-corrected stochastic block model after appropriately debiasing, and thus consistent community detection in the privatized multi-layer networks is achievable. Theoretically, we establish the consistency of community detection in the privatized multi-layer network and show that better privacy protection of edges can be obtained for a proportion of nodes while allowing other nodes to give up their privacy. Furthermore, the advantage of the proposed personalized edge-flipping mechanism is also supported by its numerical performance on various synthetic networks and a real-life multi-layer network."}, "https://arxiv.org/abs/2406.14814": {"title": "Frank copula is minimum information copula under fixed Kendall's $\\tau$", "link": "https://arxiv.org/abs/2406.14814", "description": "arXiv:2406.14814v1 Announce Type: new \nAbstract: In dependence modeling, various copulas have been utilized. Among them, the Frank copula has been one of the most typical choices due to its simplicity. In this work, we demonstrate that the Frank copula is the minimum information copula under fixed Kendall's $\\tau$ (MICK), both theoretically and numerically. First, we explain that both MICK and the Frank density follow the hyperbolic Liouville equation. Moreover, we show that the copula density satisfying the Liouville equation is uniquely the Frank copula. Our result asserts that selecting the Frank copula as an appropriate copula model is equivalent to using Kendall's $\\tau$ as the sole available information about the true distribution, based on the entropy maximization principle."}, "https://arxiv.org/abs/2406.14904": {"title": "Enhancing reliability in prediction intervals using point forecasters: Heteroscedastic Quantile Regression and Width-Adaptive Conformal Inference", "link": "https://arxiv.org/abs/2406.14904", "description": "arXiv:2406.14904v1 Announce Type: new \nAbstract: Building prediction intervals for time series forecasting problems presents a complex challenge, particularly when relying solely on point predictors, a common scenario for practitioners in the industry. While research has primarily focused on achieving increasingly efficient valid intervals, we argue that, when evaluating a set of intervals, traditional measures alone are insufficient. There are additional crucial characteristics: the intervals must vary in length, with this variation directly linked to the difficulty of the prediction, and the coverage of the interval must remain independent of the difficulty of the prediction for practical utility. We propose the Heteroscedastic Quantile Regression (HQR) model and the Width-Adaptive Conformal Inference (WACI) method, providing theoretical coverage guarantees, to overcome those issues, respectively. The methodologies are evaluated in the context of Electricity Price Forecasting and Wind Power Forecasting, representing complex scenarios in time series forecasting. The results demonstrate that HQR and WACI not only improve or achieve typical measures of validity and efficiency but also successfully fulfil the commonly ignored mentioned characteristics."}, "https://arxiv.org/abs/2406.15078": {"title": "The Influence of Nuisance Parameter Uncertainty on Statistical Inference in Practical Data Science Models", "link": "https://arxiv.org/abs/2406.15078", "description": "arXiv:2406.15078v1 Announce Type: new \nAbstract: For multiple reasons -- such as avoiding overtraining from one data set or because of having received numerical estimates for some parameters in a model from an alternative source -- it is sometimes useful to divide a model's parameters into one group of primary parameters and one group of nuisance parameters. However, uncertainty in the values of nuisance parameters is an inevitable factor that impacts the model's reliability. This paper examines the issue of uncertainty calculation for primary parameters of interest in the presence of nuisance parameters. We illustrate a general procedure on two distinct model forms: 1) the GARCH time series model with univariate nuisance parameter and 2) multiple hidden layer feed-forward neural network models with multivariate nuisance parameters. Leveraging an existing theoretical framework for nuisance parameter uncertainty, we show how to modify the confidence regions for the primary parameters while considering the inherent uncertainty introduced by nuisance parameters. Furthermore, our study validates the practical effectiveness of adjusted confidence regions that properly account for uncertainty in nuisance parameters. Such an adjustment helps data scientists produce results that more honestly reflect the overall uncertainty."}, "https://arxiv.org/abs/2406.15157": {"title": "MIDAS-QR with 2-Dimensional Structure", "link": "https://arxiv.org/abs/2406.15157", "description": "arXiv:2406.15157v1 Announce Type: new \nAbstract: Mixed frequency data has been shown to improve the performance of growth-at-risk models in the literature. Most of the research has focused on imposing structure on the high-frequency lags when estimating MIDAS-QR models akin to what is done in mean models. However, only imposing structure on the lag-dimension can potentially induce quantile variation that would otherwise not be there. In this paper we extend the framework by introducing structure on both the lag dimension and the quantile dimension. In this way we are able to shrink unnecessary quantile variation in the high-frequency variables. This leads to more gradual lag profiles in both dimensions compared to the MIDAS-QR and UMIDAS-QR. We show that this proposed method leads to further gains in nowcasting and forecasting on a pseudo-out-of-sample exercise on US data."}, "https://arxiv.org/abs/2406.15170": {"title": "Inference for Delay Differential Equations Using Manifold-Constrained Gaussian Processes", "link": "https://arxiv.org/abs/2406.15170", "description": "arXiv:2406.15170v1 Announce Type: new \nAbstract: Dynamic systems described by differential equations often involve feedback among system components. When there are time delays for components to sense and respond to feedback, delay differential equation (DDE) models are commonly used. This paper considers the problem of inferring unknown system parameters, including the time delays, from noisy and sparse experimental data observed from the system. We propose an extension of manifold-constrained Gaussian processes to conduct parameter inference for DDEs, whereas the time delay parameters have posed a challenge for existing methods that bypass numerical solvers. Our method uses a Bayesian framework to impose a Gaussian process model over the system trajectory, conditioned on the manifold constraint that satisfies the DDEs. For efficient computation, a linear interpolation scheme is developed to approximate the values of the time-delayed system outputs, along with corresponding theoretical error bounds on the approximated derivatives. Two simulation examples, based on Hutchinson's equation and the lac operon system, together with a real-world application using Ontario COVID-19 data, are used to illustrate the efficacy of our method."}, "https://arxiv.org/abs/2406.15285": {"title": "Monte Carlo Integration in Simple and Complex Simulation Designs", "link": "https://arxiv.org/abs/2406.15285", "description": "arXiv:2406.15285v1 Announce Type: new \nAbstract: Simulation studies are used to evaluate and compare the properties of statistical methods in controlled experimental settings. In most cases, performing a simulation study requires knowledge of the true value of the parameter, or estimand, of interest. However, in many simulation designs, the true value of the estimand is difficult to compute analytically. Here, we illustrate the use of Monte Carlo integration to compute true estimand values in simple and complex simulation designs. We provide general pseudocode that can be replicated in any software program of choice to demonstrate key principles in using Monte Carlo integration in two scenarios: a simple three variable simulation where interest lies in the marginally adjusted odds ratio; and a more complex causal mediation analysis where interest lies in the controlled direct effect in the presence of mediator-outcome confounders affected by the exposure. We discuss general strategies that can be used to minimize Monte Carlo error, and to serve as checks on the simulation program to avoid coding errors. R programming code is provided illustrating the application of our pseudocode in these settings."}, "https://arxiv.org/abs/2406.15288": {"title": "Difference-in-Differences with Time-Varying Covariates in the Parallel Trends Assumption", "link": "https://arxiv.org/abs/2406.15288", "description": "arXiv:2406.15288v1 Announce Type: new \nAbstract: In this paper, we study difference-in-differences identification and estimation strategies where the parallel trends assumption holds after conditioning on time-varying covariates and/or time-invariant covariates. Our first main contribution is to point out a number of weaknesses of commonly used two-way fixed effects (TWFE) regressions in this context. In addition to issues related to multiple periods and variation in treatment timing that have been emphasized in the literature, we show that, even in the case with only two time periods, TWFE regressions are not generally robust to (i) paths of untreated potential outcomes depending on the level of time-varying covariates (as opposed to only the change in the covariates over time), (ii) paths of untreated potential outcomes depending on time-invariant covariates, and (iii) violations of linearity conditions for outcomes over time and/or the propensity score. Even in cases where none of the previous three issues hold, we show that TWFE regressions can suffer from negative weighting and weight-reversal issues. Thus, TWFE regressions can deliver misleading estimates of causal effect parameters in a number of empirically relevant cases. Second, we extend these arguments to the case of multiple periods and variation in treatment timing. Third, we provide simple diagnostics for assessing the extent of misspecification bias arising due to TWFE regressions. Finally, we propose alternative (and simple) estimation strategies that can circumvent these issues with two-way fixed regressions."}, "https://arxiv.org/abs/2406.14753": {"title": "A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms", "link": "https://arxiv.org/abs/2406.14753", "description": "arXiv:2406.14753v1 Announce Type: cross \nAbstract: We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish theoretical properties of our approach and derive an algorithm based on a specific instance of this approach. Our empirical results demonstrate the significant benefits of our approach."}, "https://arxiv.org/abs/2406.14808": {"title": "On the estimation rate of Bayesian PINN for inverse problems", "link": "https://arxiv.org/abs/2406.14808", "description": "arXiv:2406.14808v1 Announce Type: cross \nAbstract: Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian PINN estimator of the solution of a PDE from $n$ independent noisy measurement of the solution. We focus on a class of equations that are linear in their parameters (with unknown coefficients $\\theta_\\star$). We show that when the partial differential equation admits a classical solution (say $u_\\star$), differentiable to order $\\beta$, the mean square error of the Bayesian posterior mean is at least of order $n^{-2\\beta/(2\\beta + d)}$. Furthermore, we establish a convergence rate of the linear coefficients of $\\theta_\\star$ depending on the order of the underlying differential operator. Last but not least, our theoretical results are validated through extensive simulations."}, "https://arxiv.org/abs/2406.15195": {"title": "Multiscale modelling of animal movement with persistent dynamics", "link": "https://arxiv.org/abs/2406.15195", "description": "arXiv:2406.15195v1 Announce Type: cross \nAbstract: Wild animals are commonly fitted with trackers that record their position through time, to learn about their behaviour. Broadly, statistical models for tracking data often fall into two categories: local models focus on describing small-scale movement decisions, and global models capture large-scale spatial distributions. Due to this dichotomy, it is challenging to describe mathematically how animals' distributions arise from their short-term movement patterns, and to combine data sets collected at different scales. We propose a multiscale model of animal movement and space use based on the underdamped Langevin process, widely used in statistical physics. The model is convenient to describe animal movement for three reasons: it is specified in continuous time (such that its parameters are not dependent on an arbitrary time scale), its speed and direction are autocorrelated (similarly to real animal trajectories), and it has a closed form stationary distribution that we can view as a model of long-term space use. We use the common form of a resource selection function for the stationary distribution, to model the environmental drivers behind the animal's movement decisions. We further increase flexibility by allowing movement parameters to be time-varying, e.g., to account for daily cycles in an animal's activity. We formulate the model as a state-space model and present a method of inference based on the Kalman filter. The approach requires discretising the continuous-time process, and we use simulations to investigate performance for various time resolutions of observation. The approach works well at fine resolutions, though the estimated stationary distribution tends to be too flat when time intervals between observations are very long."}, "https://arxiv.org/abs/2406.15311": {"title": "The disruption index suffers from citation inflation and is confounded by shifts in scholarly citation practice", "link": "https://arxiv.org/abs/2406.15311", "description": "arXiv:2406.15311v1 Announce Type: cross \nAbstract: Measuring the rate of innovation in academia and industry is fundamental to monitoring the efficiency and competitiveness of the knowledge economy. To this end, a disruption index (CD) was recently developed and applied to publication and patent citation networks (Wu et al., Nature 2019; Park et al., Nature 2023). Here we show that CD systematically decreases over time due to secular growth in research and patent production, following two distinct mechanisms unrelated to innovation -- one behavioral and the other structural. Whereas the behavioral explanation reflects shifts associated with techno-social factors (e.g. self-citation practices), the structural explanation follows from `citation inflation' (CI), an inextricable feature of real citation networks attributable to increasing reference list lengths, which causes CD to systematically decrease. We demonstrate this causal link by way of mathematical deduction, computational simulation, multi-variate regression, and quasi-experimental comparison of the disruptiveness of PNAS versus PNAS Plus articles, which differ only in their lengths. Accordingly, we analyze CD data available in the SciSciNet database and find that disruptiveness incrementally increased from 2005-2015, and that the negative relationship between disruption and team-size is remarkably small in overall magnitude effect size, and shifts from negative to positive for team size $\\geq$ 8 coauthors."}, "https://arxiv.org/abs/2107.09235": {"title": "Distributional Effects with Two-Sided Measurement Error: An Application to Intergenerational Income Mobility", "link": "https://arxiv.org/abs/2107.09235", "description": "arXiv:2107.09235v3 Announce Type: replace \nAbstract: This paper considers identification and estimation of distributional effect parameters that depend on the joint distribution of an outcome and another variable of interest (\"treatment\") in a setting with \"two-sided\" measurement error -- that is, where both variables are possibly measured with error. Examples of these parameters in the context of intergenerational income mobility include transition matrices, rank-rank correlations, and the poverty rate of children as a function of their parents' income, among others. Building on recent work on quantile regression (QR) with measurement error in the outcome (particularly, Hausman, Liu, Luo, and Palmer (2021)), we show that, given (i) two linear QR models separately for the outcome and treatment conditional on other observed covariates and (ii) assumptions about the measurement error for each variable, one can recover the joint distribution of the outcome and the treatment. Besides these conditions, our approach does not require an instrument, repeated measurements, or distributional assumptions about the measurement error. Using recent data from the 1997 National Longitudinal Study of Youth, we find that accounting for measurement error notably reduces several estimates of intergenerational mobility parameters."}, "https://arxiv.org/abs/2206.01824": {"title": "Estimation of Over-parameterized Models from an Auto-Modeling Perspective", "link": "https://arxiv.org/abs/2206.01824", "description": "arXiv:2206.01824v3 Announce Type: replace \nAbstract: From a model-building perspective, we propose a paradigm shift for fitting over-parameterized models. Philosophically, the mindset is to fit models to future observations rather than to the observed sample. Technically, given an imputation method to generate future observations, we fit over-parameterized models to these future observations by optimizing an approximation of the desired expected loss function based on its sample counterpart and an adaptive $\\textit{duality function}$. The required imputation method is also developed using the same estimation technique with an adaptive $m$-out-of-$n$ bootstrap approach. We illustrate its applications with the many-normal-means problem, $n < p$ linear regression, and neural network-based image classification of MNIST digits. The numerical results demonstrate its superior performance across these diverse applications. While primarily expository, the paper conducts an in-depth investigation into the theoretical aspects of the topic. It concludes with remarks on some open problems."}, "https://arxiv.org/abs/2210.17063": {"title": "Shrinkage Methods for Treatment Choice", "link": "https://arxiv.org/abs/2210.17063", "description": "arXiv:2210.17063v2 Announce Type: replace \nAbstract: This study examines the problem of determining whether to treat individuals based on observed covariates. The most common decision rule is the conditional empirical success (CES) rule proposed by Manski (2004), which assigns individuals to treatments that yield the best experimental outcomes conditional on the observed covariates. Conversely, using shrinkage estimators, which shrink unbiased but noisy preliminary estimates toward the average of these estimates, is a common approach in statistical estimation problems because it is well-known that shrinkage estimators have smaller mean squared errors than unshrunk estimators. Inspired by this idea, we propose a computationally tractable shrinkage rule that selects the shrinkage factor by minimizing the upper bound of the maximum regret. Then, we compare the maximum regret of the proposed shrinkage rule with that of CES and pooling rules when the parameter space is correctly specified or misspecified. Our theoretical results demonstrate that the shrinkage rule performs well in many cases and these findings are further supported by numerical experiments. Specifically, we show that the maximum regret of the shrinkage rule can be strictly smaller than that of the CES and pooling rules in certain cases when the parameter space is correctly specified. In addition, we find that the shrinkage rule is robust against misspecifications of the parameter space. Finally, we apply our method to experimental data from the National Job Training Partnership Act Study."}, "https://arxiv.org/abs/2212.11442": {"title": "Small-time approximation of the transition density for diffusions with singularities", "link": "https://arxiv.org/abs/2212.11442", "description": "arXiv:2212.11442v2 Announce Type: replace \nAbstract: The Wright-Fisher (W-F) diffusion model serves as a foundational framework for interpreting population evolution through allele frequency dynamics over time. Despite the known transition probability between consecutive generations, an exact analytical expression for the transition density at arbitrary time intervals remains elusive. Commonly utilized distributions such as Gaussian or Beta inadequately address the fixation issue at extreme allele frequencies (0 or 1), particularly for short periods. In this study, we introduce two alternative parametric functions, namely the Asymptotic Expansion (AE) and the Gaussian approximation (GaussA), derived through probabilistic methodologies, aiming to better approximate this density. The AE function provides a suitable density for allele frequency distributions, encompassing extreme values within the interval [0,1]. Additionally, we outline the range of validity for the GaussA approximation. While our primary focus is on W-F diffusion, we demonstrate how our findings extend to other diffusion models featuring singularities. Through simulations of allele frequencies under a W-F process and employing a recently developed adaptive density estimation method, we conduct a comparative analysis to assess the fit of the proposed densities against the Beta and Gaussian distributions."}, "https://arxiv.org/abs/2302.12648": {"title": "New iterative algorithms for estimation of item functioning", "link": "https://arxiv.org/abs/2302.12648", "description": "arXiv:2302.12648v2 Announce Type: replace \nAbstract: When the item functioning of multi-item measurement is modeled with three or four-parameter models, parameter estimation may become challenging. Effective algorithms are crucial in such scenarios. This paper explores innovations to parameter estimation in generalized logistic regression models, which may be used in item response modeling to account for guessing/pretending or slipping/dissimulation and for the effect of covariates. We introduce a new implementation of the EM algorithm and propose a new algorithm based on the parametrized link function. The two novel iterative algorithms are compared to existing methods in a simulation study. Additionally, the study examines software implementation, including the specification of initial values for numerical algorithms and asymptotic properties with an estimation of standard errors. Overall, the newly proposed algorithm based on the parametrized link function outperforms other procedures, especially for small sample sizes. Moreover, the newly implemented EM algorithm provides additional information regarding respondents' inclination to guess or pretend and slip or dissimulate when answering the item. The study also discusses applications of the methods in the context of the detection of differential item functioning. Methods are demonstrated using real data from psychological and educational assessments."}, "https://arxiv.org/abs/2306.10221": {"title": "Dynamic Modeling of Sparse Longitudinal Data and Functional Snippets With Stochastic Differential Equations", "link": "https://arxiv.org/abs/2306.10221", "description": "arXiv:2306.10221v2 Announce Type: replace \nAbstract: Sparse functional/longitudinal data have attracted widespread interest due to the prevalence of such data in social and life sciences. A prominent scenario where such data are routinely encountered are accelerated longitudinal studies, where subjects are enrolled in the study at a random time and are only tracked for a short amount of time relative to the domain of interest. The statistical analysis of such functional snippets is challenging since information for the far-off-diagonal regions of the covariance structure is missing. Our main methodological contribution is to address this challenge by bypassing covariance estimation and instead modeling the underlying process as the solution of a data-adaptive stochastic differential equation. Taking advantage of the interface between Gaussian functional data and stochastic differential equations makes it possible to efficiently reconstruct the target process by estimating its dynamic distribution. The proposed approach allows one to consistently recover forward sample paths from functional snippets at the subject level. We establish the existence and uniqueness of the solution to the proposed data-driven stochastic differential equation and derive rates of convergence for the corresponding estimators. The finite-sample performance is demonstrated with simulation studies and functional snippets arising from a growth study and spinal bone mineral density data."}, "https://arxiv.org/abs/2308.15770": {"title": "Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data", "link": "https://arxiv.org/abs/2308.15770", "description": "arXiv:2308.15770v3 Announce Type: replace \nAbstract: Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as an average number of secondary infections produced by one infectious individual per unit time. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes while avoiding difficult to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2 in Los Angeles, California, using pathogen RNA concentrations collected from a large wastewater treatment facility."}, "https://arxiv.org/abs/2211.14692": {"title": "Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes", "link": "https://arxiv.org/abs/2211.14692", "description": "arXiv:2211.14692v4 Announce Type: replace-cross \nAbstract: In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes (RadGP), a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbors within a predetermined radius. We prove that any radial neighbors Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real world data showing excellent performance in both prior and posterior approximations to the original Gaussian process."}, "https://arxiv.org/abs/2311.12717": {"title": "Phylogenetic least squares estimation without genetic distances", "link": "https://arxiv.org/abs/2311.12717", "description": "arXiv:2311.12717v2 Announce Type: replace-cross \nAbstract: Least squares estimation of phylogenies is an established family of methods with good statistical properties. State-of-the-art least squares phylogenetic estimation proceeds by first estimating a distance matrix, which is then used to determine the phylogeny by minimizing a squared-error loss function. Here, we develop a method for least squares phylogenetic inference that does not rely on a pre-estimated distance matrix. Our approach allows us to circumvent the typical need to first estimate a distance matrix by forming a new loss function inspired by the phylogenetic likelihood score function; in this manner, inference is not based on a summary statistic of the sequence data, but directly on the sequence data itself. We use a Jukes-Cantor substitution model to show that our method leads to improvements over ordinary least squares phylogenetic inference, and is even observed to rival maximum likelihood estimation in terms of topology estimation efficiency. Using a Kimura 2-parameter model, we show that our method also allows for estimation of the global transition/transversion ratio simultaneously with the phylogeny and its branch lengths. This is impossible to accomplish with any other distance-based method as far as we know. Our developments pave the way for more optimal phylogenetic inference under the least squares framework, particularly in settings under which likelihood-based inference is infeasible, including when one desires to build a phylogeny based on information provided by only a subset of all possible nucleotide substitutions such as synonymous or non-synonymous substitutions."}, "https://arxiv.org/abs/2406.15573": {"title": "Sparse Bayesian multidimensional scaling(s)", "link": "https://arxiv.org/abs/2406.15573", "description": "arXiv:2406.15573v1 Announce Type: new \nAbstract: Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require a burdensome order of $N^2$ floating-point operations, where $N$ is the number of data points. Thus, BMDS becomes impractical as $N$ grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between $N^2$ and $N$. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1,000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks."}, "https://arxiv.org/abs/2406.15582": {"title": "Graphical copula GARCH modeling with dynamic conditional dependence", "link": "https://arxiv.org/abs/2406.15582", "description": "arXiv:2406.15582v1 Announce Type: new \nAbstract: Modeling returns on large portfolios is a challenging problem as the number of parameters in the covariance matrix grows as the square of the size of the portfolio. Traditional correlation models, for example, the dynamic conditional correlation (DCC)-GARCH model, often ignore the nonlinear dependencies in the tail of the return distribution. In this paper, we aim to develop a framework to model the nonlinear dependencies dynamically, namely the graphical copula GARCH (GC-GARCH) model. Motivated from the capital asset pricing model, to allow modeling of large portfolios, the number of parameters can be greatly reduced by introducing conditional independence among stocks given some risk factors. The joint distribution of the risk factors is factorized using a directed acyclic graph (DAG) with pair-copula construction (PCC) to enhance the modeling of the tails of the return distribution while offering the flexibility of having complex dependent structures. The DAG induces topological orders to the risk factors, which can be regarded as a list of directions of the flow of information. The conditional distributions among stock returns are also modeled using PCC. Dynamic conditional dependence structures are incorporated to allow the parameters in the copulas to be time-varying. Three-stage estimation is used to estimate parameters in the marginal distributions, the risk factor copulas, and the stock copulas. The simulation study shows that the proposed estimation procedure can estimate the parameters and the underlying DAG structure accurately. In the investment experiment of the empirical study, we demonstrate that the GC-GARCH model produces more precise conditional value-at-risk prediction and considerably higher cumulative portfolio returns than the DCC-GARCH model."}, "https://arxiv.org/abs/2406.15608": {"title": "Nonparametric FBST for Validating Linear Models", "link": "https://arxiv.org/abs/2406.15608", "description": "arXiv:2406.15608v1 Announce Type: new \nAbstract: The Full Bayesian Significance Test (FBST) possesses many desirable aspects, such as not requiring a non-zero prior probability for hypotheses while also producing a measure of evidence for $H_0$. Still, few attempts have been made to bring the FBST to nonparametric settings, with the main drawback being the need to obtain the highest posterior density (HPD) in a function space. In this work, we use Gaussian processes to provide an analytically tractable FBST for hypotheses of the type $$ H_0: g(\\boldsymbol{x}) = \\boldsymbol{b}(\\boldsymbol{x})\\boldsymbol{\\beta}, \\quad \\forall \\boldsymbol{x} \\in \\mathcal{X}, \\quad \\boldsymbol{\\beta} \\in \\mathbb{R}^k, $$ where $g(\\cdot)$ is the regression function, $\\boldsymbol{b}(\\cdot)$ is a vector of linearly independent linear functions -- such as $\\boldsymbol{b}(\\boldsymbol{x}) = \\boldsymbol{x}'$ -- and $\\mathcal{X}$ is the covariates' domain. We also make use of pragmatic hypotheses to verify if the adherence of linear models may be approximately instead of exactly true, allowing for the inclusion of valuable information such as measurement errors and utility judgments. This contribution extends the theory of the FBST, allowing its application in nonparametric settings and providing a procedure that easily tests if linear models are adequate for the data and that can automatically perform variable selection."}, "https://arxiv.org/abs/2406.15667": {"title": "Identification and Estimation of Causal Effects in High-Frequency Event Studies", "link": "https://arxiv.org/abs/2406.15667", "description": "arXiv:2406.15667v1 Announce Type: new \nAbstract: We provide precise conditions for nonparametric identification of causal effects by high-frequency event study regressions, which have been used widely in the recent macroeconomics, financial economics and political economy literatures. The high-frequency event study method regresses changes in an outcome variable on a measure of unexpected changes in a policy variable in a narrow time window around an event or a policy announcement (e.g., a 30-minute window around an FOMC announcement). We show that, contrary to popular belief, the narrow size of the window is not sufficient for identification. Rather, the population regression coefficient identifies a causal estimand when (i) the effect of the policy shock on the outcome does not depend on the other shocks (separability) and (ii) the surprise component of the news or event dominates all other shocks that are present in the event window (relative exogeneity). Technically, the latter condition requires the policy shock to have infinite variance in the event window. Under these conditions, we establish the causal meaning of the event study estimand corresponding to the regression coefficient and the consistency and asymptotic normality of the event study estimator. Notably, this standard linear regression estimator is robust to general forms of nonlinearity. We apply our results to Nakamura and Steinsson's (2018a) analysis of the real economic effects of monetary policy, providing a simple empirical procedure to analyze the extent to which the standard event study estimator adequately estimates causal effects of interest."}, "https://arxiv.org/abs/2406.15700": {"title": "Mixture of Directed Graphical Models for Discrete Spatial Random Fields", "link": "https://arxiv.org/abs/2406.15700", "description": "arXiv:2406.15700v1 Announce Type: new \nAbstract: Current approaches for modeling discrete-valued outcomes associated with spatially-dependent areal units incur computational and theoretical challenges, especially in the Bayesian setting when full posterior inference is desired. As an alternative, we propose a novel statistical modeling framework for this data setting, namely a mixture of directed graphical models (MDGMs). The components of the mixture, directed graphical models, can be represented by directed acyclic graphs (DAGs) and are computationally quick to evaluate. The DAGs representing the mixture components are selected to correspond to an undirected graphical representation of an assumed spatial contiguity/dependence structure of the areal units, which underlies the specification of traditional modeling approaches for discrete spatial processes such as Markov random fields (MRFs). We introduce the concept of compatibility to show how an undirected graph can be used as a template for the structural dependencies between areal units to create sets of DAGs which, as a collection, preserve the structural dependencies represented in the template undirected graph. We then introduce three classes of compatible DAGs and corresponding algorithms for fitting MDGMs based on these classes. In addition, we compare MDGMs to MRFs and a popular Bayesian MRF model approximation used in high-dimensional settings in a series of simulations and an analysis of ecometrics data collected as part of the Adolescent Health and Development in Context Study."}, "https://arxiv.org/abs/2406.15702": {"title": "Testing for Restricted Stochastic Dominance under Survey Nonresponse with Panel Data: Theory and an Evaluation of Poverty in Australia", "link": "https://arxiv.org/abs/2406.15702", "description": "arXiv:2406.15702v1 Announce Type: new \nAbstract: This paper lays the groundwork for a unifying approach to stochastic dominance testing under survey nonresponse that integrates the partial identification approach to incomplete data and design-based inference for complex survey data. We propose a novel inference procedure for restricted $s$th-order stochastic dominance, tailored to accommodate a broad spectrum of nonresponse assumptions. The method uses pseudo-empirical likelihood to formulate the test statistic and compares it to a critical value from the chi-squared distribution with one degree of freedom. We detail the procedure's asymptotic properties under both null and alternative hypotheses, establishing its uniform validity under the null and consistency against various alternatives. Using the Household, Income and Labour Dynamics in Australia survey, we demonstrate the procedure's utility in a sensitivity analysis of temporal poverty comparisons among Australian households."}, "https://arxiv.org/abs/2406.15844": {"title": "Bayesian modeling of multi-species labeling errors in ecological studies", "link": "https://arxiv.org/abs/2406.15844", "description": "arXiv:2406.15844v1 Announce Type: new \nAbstract: Ecological and conservation studies monitoring bird communities typically rely on species classification based on bird vocalizations. Historically, this has been based on expert volunteers going into the field and making lists of the bird species that they observe. Recently, machine learning algorithms have emerged that can accurately classify bird species based on audio recordings of their vocalizations. Such algorithms crucially rely on training data that are labeled by experts. Automated classification is challenging when multiple species are vocalizing simultaneously, there is background noise, and/or the bird is far from the microphone. In continuously monitoring different locations, the size of the audio data become immense and it is only possible for human experts to label a tiny proportion of the available data. In addition, experts can vary in their accuracy and breadth of knowledge about different species. This article focuses on the important problem of combining sparse expert annotations to improve bird species classification while providing uncertainty quantification. We additionally are interested in providing expert performance scores to increase their engagement and encourage improvements. We propose a Bayesian hierarchical modeling approach and evaluate this approach on a new community science platform developed in Finland."}, "https://arxiv.org/abs/2406.15912": {"title": "Clustering and Meta-Analysis Using a Mixture of Dependent Linear Tail-Free Priors", "link": "https://arxiv.org/abs/2406.15912", "description": "arXiv:2406.15912v1 Announce Type: new \nAbstract: We propose a novel nonparametric Bayesian approach for meta-analysis with event time outcomes. The model is an extension of linear dependent tail-free processes. The extension includes a modification to facilitate (conditionally) conjugate posterior updating and a hierarchical extension with a random partition of studies. The partition is formalized as a Dirichlet process mixture. The model development is motivated by a meta-analysis of cancer immunotherapy studies. The aim is to validate the use of relevant biomarkers in the design of immunotherapy studies. The hypothesis is about immunotherapy in general, rather than about a specific tumor type, therapy and marker. This broad hypothesis leads to a very diverse set of studies being included in the analysis and gives rise to substantial heterogeneity across studies"}, "https://arxiv.org/abs/2406.15933": {"title": "On the use of splines for representing ordered factors", "link": "https://arxiv.org/abs/2406.15933", "description": "arXiv:2406.15933v1 Announce Type: new \nAbstract: In the context of regression-type statistical models, the inclusion of some ordered factors among the explanatory variables requires the conversion of qualitative levels to numeric components of the linear predictor. The present note represent a follow-up of a methodology proposed by Azzalini (2023} for constructing numeric scores assigned to the factors levels. The aim of the present supplement it to allow additional flexibility of the mapping from ordered levels and numeric scores."}, "https://arxiv.org/abs/2406.16136": {"title": "Distribution-Free Online Change Detection for Low-Rank Images", "link": "https://arxiv.org/abs/2406.16136", "description": "arXiv:2406.16136v1 Announce Type: new \nAbstract: We present a distribution-free CUSUM procedure designed for online change detection in a time series of low-rank images, particularly when the change causes a mean shift. We represent images as matrix data and allow for temporal dependence, in addition to inherent spatial dependence, before and after the change. The marginal distributions are assumed to be general, not limited to any specific parametric distribution. We propose new monitoring statistics that utilize the low-rank structure of the in-control mean matrix. Additionally, we study the properties of the proposed detection procedure, assessing whether the monitoring statistics effectively capture a mean shift and evaluating the rate of increase in average run length relative to the control limit in both in-control and out-of-control cases. The effectiveness of our procedure is demonstrated through simulated and real data experiments."}, "https://arxiv.org/abs/2406.16171": {"title": "Exploring the difficulty of estimating win probability: a simulation study", "link": "https://arxiv.org/abs/2406.16171", "description": "arXiv:2406.16171v1 Announce Type: new \nAbstract: Estimating win probability is one of the classic modeling tasks of sports analytics. Many widely used win probability estimators are statistical win probability models, which fit the relationship between a binary win/loss outcome variable and certain game-state variables using data-driven regression or machine learning approaches. To illustrate just how difficult it is to accurately fit a statistical win probability model from noisy and highly correlated observational data, in this paper we conduct a simulation study. We create a simplified random walk version of football in which true win probability at each game-state is known, and we see how well a model recovers it. We find that the dependence structure of observational play-by-play data substantially inflates the bias and variance of estimators and lowers the effective sample size. This makes it essential to quantify uncertainty in win probability estimates, but typical bootstrapped confidence intervals are too narrow and don't achieve nominal coverage. Hence, we introduce a novel method, the fractional bootstrap, to calibrate these intervals to achieve adequate coverage."}, "https://arxiv.org/abs/2406.16234": {"title": "Efficient estimation of longitudinal treatment effects using difference-in-differences and machine learning", "link": "https://arxiv.org/abs/2406.16234", "description": "arXiv:2406.16234v1 Announce Type: new \nAbstract: Difference-in-differences is based on a parallel trends assumption, which states that changes over time in average potential outcomes are independent of treatment assignment, possibly conditional on covariates. With time-varying treatments, parallel trends assumptions can identify many types of parameters, but most work has focused on group-time average treatment effects and similar parameters conditional on the treatment trajectory. This paper focuses instead on identification and estimation of the intervention-specific mean - the mean potential outcome had everyone been exposed to a proposed intervention - which may be directly policy-relevant in some settings. Previous estimators for this parameter under parallel trends have relied on correctly-specified parametric models, which may be difficult to guarantee in applications. We develop multiply-robust and efficient estimators of the intervention-specific mean based on the efficient influence function, and derive conditions under which data-adaptive machine learning methods can be used to relax modeling assumptions. Our approach allows the parallel trends assumption to be conditional on the history of time-varying covariates, thus allowing for adjustment for time-varying covariates possibly impacted by prior treatments. Simulation results support the use of the proposed methods at modest sample sizes. As an example, we estimate the effect of a hypothetical federal minimum wage increase on self-rated health in the US."}, "https://arxiv.org/abs/2406.16458": {"title": "Distance-based Chatterjee correlation: a new generalized robust measure of directed association for multivariate real and complex-valued data", "link": "https://arxiv.org/abs/2406.16458", "description": "arXiv:2406.16458v1 Announce Type: new \nAbstract: Building upon the Chatterjee correlation (2021: J. Am. Stat. Assoc. 116, p2009) for two real-valued variables, this study introduces a generalized measure of directed association between two vector variables, real or complex-valued, and of possibly different dimensions. The new measure is denoted as the \"distance-based Chatterjee correlation\", owing to the use here of the \"distance transformed data\" defined in Szekely et al (2007: Ann. Statist. 35, p2769) for the distance correlation. A main property of the new measure, inherited from the original Chatterjee correlation, is its predictive and asymmetric nature: it measures how well one variable can be predicted by the other, asymmetrically. This allows for inferring the causal direction of the association, by using the method of Blobaum et al (2019: PeerJ Comput. Sci. 1, e169). Since the original Chatterjee correlation is based on ranks, it is not available for complex variables, nor for general multivariate data. The novelty of our work is the extension to multivariate real and complex-valued pairs of vectors, offering a robust measure of directed association in a completely non-parametric setting. Informally, the intuitive assumption used here is that distance correlation is mathematically equivalent to Pearson's correlation when applied to \"distance transformed\" data. The next logical step is to compute Chatterjee's correlation on the same \"distance transformed\" data, thereby extending the analysis to multivariate vectors of real and complex valued data. As a bonus, the new measure here is robust to outliers, which is not true for the distance correlation of Szekely et al. Additionally, this approach allows for inference regarding the causal direction of the association between the variables."}, "https://arxiv.org/abs/2406.16485": {"title": "Influence analyses of \"designs\" for evaluating inconsistency in network meta-analysis", "link": "https://arxiv.org/abs/2406.16485", "description": "arXiv:2406.16485v1 Announce Type: new \nAbstract: Network meta-analysis is an evidence synthesis method for comparative effectiveness analyses of multiple available treatments. To justify evidence synthesis, consistency is a relevant assumption; however, existing methods founded on statistical testing possibly have substantial limitations of statistical powers or several drawbacks in treating multi-arm studies. Besides, inconsistency is theoretically explained as design-by-treatment interactions, and the primary purpose of these analyses is prioritizing \"designs\" for further investigations to explore sources of biases and irregular issues that might influence the overall results. In this article, we propose an alternative framework for inconsistency evaluations using influence diagnostic methods that enable quantitative evaluations of the influences of individual designs to the overall results. We provide four new methods to quantify the influences of individual designs through a \"leave-one-design-out\" analysis framework. We also propose a simple summary measure, the O-value, for prioritizing designs and interpreting these influential analyses straightforwardly. Furthermore, we propose another testing approach based on the leave-one-design-out analysis framework. By applying the new methods to a network meta-analysis of antihypertensive drugs, we demonstrate the new methods located potential sources of inconsistency accurately. The proposed methods provide new insights into alternatives to existing test-based methods, especially quantifications of influences of individual designs on the overall network meta-analysis results."}, "https://arxiv.org/abs/2406.16507": {"title": "Statistical ranking with dynamic covariates", "link": "https://arxiv.org/abs/2406.16507", "description": "arXiv:2406.16507v1 Announce Type: new \nAbstract: We consider a covariate-assisted ranking model grounded in the Plackett--Luce framework. Unlike existing works focusing on pure covariates or individual effects with fixed covariates, our approach integrates individual effects with dynamic covariates. This added flexibility enhances realistic ranking yet poses significant challenges for analyzing the associated estimation procedures. This paper makes an initial attempt to address these challenges. We begin by discussing the sufficient and necessary condition for the model's identifiability. We then introduce an efficient alternating maximization algorithm to compute the maximum likelihood estimator (MLE). Under suitable assumptions on the topology of comparison graphs and dynamic covariates, we establish a quantitative uniform consistency result for the MLE with convergence rates characterized by the asymptotic graph connectivity. The proposed graph topology assumption holds for several popular random graph models under optimal leading-order sparsity conditions. A comprehensive numerical study is conducted to corroborate our theoretical findings and demonstrate the application of the proposed model to real-world datasets, including horse racing and tennis competitions."}, "https://arxiv.org/abs/2406.16523": {"title": "YEAST: Yet Another Sequential Test", "link": "https://arxiv.org/abs/2406.16523", "description": "arXiv:2406.16523v1 Announce Type: new \nAbstract: Large-scale randomised experiments have become a standard tool for developing products and improving user experience. To reduce losses from shipping harmful changes experimental results are, in practice, often checked repeatedly, which leads to inflated false alarm rates. To alleviate this problem, one can use sequential testing techniques as they control false discovery rates despite repeated checks. While multiple sequential testing methods exist in the literature, they either restrict the number of interim checks the experimenter can perform or have tuning parameters that require calibration. In this paper, we propose a novel sequential testing method that does not limit the number of interim checks and at the same time does not have any tuning parameters. The proposed method is new and does not stem from existing experiment monitoring procedures. It controls false discovery rates by ``inverting'' a bound on the threshold crossing probability derived from a classical maximal inequality. We demonstrate both in simulations and using real-world data that the proposed method outperforms current state-of-the-art sequential tests for continuous test monitoring. In addition, we illustrate the method's effectiveness with a real-world application on a major online fashion platform."}, "https://arxiv.org/abs/2406.16820": {"title": "EFECT -- A Method and Metric to Assess the Reproducibility of Stochastic Simulation Studies", "link": "https://arxiv.org/abs/2406.16820", "description": "arXiv:2406.16820v1 Announce Type: new \nAbstract: Reproducibility is a foundational standard for validating scientific claims in computational research. Stochastic computational models are employed across diverse fields such as systems biology, financial modelling and environmental sciences. Existing infrastructure and software tools support various aspects of reproducible model development, application, and dissemination, but do not adequately address independently reproducing simulation results that form the basis of scientific conclusions. To bridge this gap, we introduce the Empirical Characteristic Function Equality Convergence Test (EFECT), a data-driven method to quantify the reproducibility of stochastic simulation results. EFECT employs empirical characteristic functions to compare reported results with those independently generated by assessing distributional inequality, termed EFECT error, a metric to quantify the likelihood of equality. Additionally, we establish the EFECT convergence point, a metric for determining the required number of simulation runs to achieve an EFECT error value of a priori statistical significance, setting a reproducibility benchmark. EFECT supports all real-valued and bounded results irrespective of the model or method that produced them, and accommodates stochasticity from intrinsic model variability and random sampling of model inputs. We tested EFECT with stochastic differential equations, agent-based models, and Boolean networks, demonstrating its broad applicability and effectiveness. EFECT standardizes stochastic simulation reproducibility, establishing a workflow that guarantees reliable results, supporting a wide range of stakeholders, and thereby enhancing validation of stochastic simulation studies, across a model's lifecycle. To promote future standardization efforts, we are developing open source software library libSSR in diverse programming languages for easy integration of EFECT."}, "https://arxiv.org/abs/2406.16830": {"title": "Adjusting for Selection Bias Due to Missing Eligibility Criteria in Emulated Target Trials", "link": "https://arxiv.org/abs/2406.16830", "description": "arXiv:2406.16830v1 Announce Type: new \nAbstract: Target trial emulation (TTE) is a popular framework for observational studies based on electronic health records (EHR). A key component of this framework is determining the patient population eligible for inclusion in both a target trial of interest and its observational emulation. Missingness in variables that define eligibility criteria, however, presents a major challenge towards determining the eligible population when emulating a target trial with an observational study. In practice, patients with incomplete data are almost always excluded from analysis despite the possibility of selection bias, which can arise when subjects with observed eligibility data are fundamentally different than excluded subjects. Despite this, to the best of our knowledge, very little work has been done to mitigate this concern. In this paper, we propose a novel conceptual framework to address selection bias in TTE studies, tailored towards time-to-event endpoints, and describe estimation and inferential procedures via inverse probability weighting (IPW). Under an EHR-based simulation infrastructure, developed to reflect the complexity of EHR data, we characterize common settings under which missing eligibility data poses the threat of selection bias and investigate the ability of the proposed methods to address it. Finally, using EHR databases from Kaiser Permanente, we demonstrate the use of our method to evaluate the effect of bariatric surgery on microvascular outcomes among a cohort of severely obese patients with Type II diabetes mellitus (T2DM)."}, "https://arxiv.org/abs/2406.16859": {"title": "On the extensions of the Chatterjee-Spearman test", "link": "https://arxiv.org/abs/2406.16859", "description": "arXiv:2406.16859v1 Announce Type: new \nAbstract: Chatterjee (2021) introduced a novel independence test that is rank-based, asymptotically normal and consistent against all alternatives. One limitation of Chatterjee's test is its low statistical power for detecting monotonic relationships. To address this limitation, in our previous work (Zhang, 2024, Commun. Stat. - Theory Methods), we proposed to combine Chatterjee's and Spearman's correlations into a max-type test and established the asymptotic joint normality. This work examines three key extensions of the combined test. First, motivated by its original asymmetric form, we extend the Chatterjee-Spearman test to a symmetric version, and derive the asymptotic null distribution of the symmetrized statistic. Second, we investigate the relationships between Chatterjee's correlation and other popular rank correlations, including Kendall's tau and quadrant correlation. We demonstrate that, under independence, Chatterjee's correlation and any of these rank correlations are asymptotically joint normal and independent. Simulation studies demonstrate that the Chatterjee-Kendall test has better power than the Chatterjee-Spearman test. Finally, we explore two possible extensions to the multivariate case. These extensions expand the applicability of the rank-based combined tests to a broader range of scenarios."}, "https://arxiv.org/abs/2406.15514": {"title": "How big does a population need to be before demographers can ignore individual-level randomness in demographic events?", "link": "https://arxiv.org/abs/2406.15514", "description": "arXiv:2406.15514v1 Announce Type: cross \nAbstract: When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex structure, for populations of different sizes, focusing on on demographic conditions typical of historical populations. We conduct a microsimulation experiment where we simulate events and age-sex structure under a range of settings for demographic rates and population size. The experiment results suggest that individual-level randomness strongly affects age-sex structure for populations of about 100, but has a much smaller effect on populations of 1,000, and a negligible effect on populations of 10,000. Our conclusion is that analyses of age-sex structure in historical populations with sizes on the order 100 must account for individual-level randomness in demographic events. Analyses of populations with sizes on the order of 1,000 may need to make some allowance for individual-level variation, but other issues, such as measurement error, probably deserve more attention. Analyses of populations of 10,000 can safely ignore individual-level variation."}, "https://arxiv.org/abs/2406.15522": {"title": "Statistical Inference and A/B Testing in Fisher Markets and Paced Auctions", "link": "https://arxiv.org/abs/2406.15522", "description": "arXiv:2406.15522v1 Announce Type: cross \nAbstract: We initiate the study of statistical inference and A/B testing for two market equilibrium models: linear Fisher market (LFM) equilibrium and first-price pacing equilibrium (FPPE). LFM arises from fair resource allocation systems such as allocation of food to food banks and notification opportunities to different types of notifications. For LFM, we assume that the data observed is captured by the classical finite-dimensional Fisher market equilibrium, and its steady-state behavior is modeled by a continuous limit Fisher market. The second type of equilibrium we study, FPPE, arises from internet advertising where advertisers are constrained by budgets and advertising opportunities are sold via first-price auctions. For platforms that use pacing-based methods to smooth out the spending of advertisers, FPPE provides a hindsight-optimal configuration of the pacing method. We propose a statistical framework for the FPPE model, in which a continuous limit FPPE models the steady-state behavior of the auction platform, and a finite FPPE provides the data to estimate primitives of the limit FPPE. Both LFM and FPPE have an Eisenberg-Gale convex program characterization, the pillar upon which we derive our statistical theory. We start by deriving basic convergence results for the finite market to the limit market. We then derive asymptotic distributions, and construct confidence intervals. Furthermore, we establish the asymptotic local minimax optimality of estimation based on finite markets. We then show that the theory can be used for conducting statistically valid A/B testing on auction platforms. Synthetic and semi-synthetic experiments verify the validity and practicality of our theory."}, "https://arxiv.org/abs/2406.15867": {"title": "Hedging in Sequential Experiments", "link": "https://arxiv.org/abs/2406.15867", "description": "arXiv:2406.15867v1 Announce Type: cross \nAbstract: Experimentation involves risk. The investigator expends time and money in the pursuit of data that supports a hypothesis. In the end, the investigator may find that all of these costs were for naught and the data fail to reject the null. Furthermore, the investigator may not be able to test other hypotheses with the same data set in order to avoid false positives due to p-hacking. Therefore, there is a need for a mechanism for investigators to hedge the risk of financial and statistical bankruptcy in the business of experimentation.\n In this work, we build on the game-theoretic statistics framework to enable an investigator to hedge their bets against the null hypothesis and thus avoid ruin. First, we describe a method by which the investigator's test martingale wealth process can be capitalized by solving for the risk-neutral price. Then, we show that a portfolio that comprises the risky test martingale and a risk-free process is still a test martingale which enables the investigator to select a particular risk-return position using Markowitz portfolio theory. Finally, we show that a function that is derivative of the test martingale process can be constructed and used as a hedging instrument by the investigator or as a speculative instrument by a risk-seeking investor who wants to participate in the potential returns of the uncertain experiment wealth process. Together, these instruments enable an investigator to hedge the risk of ruin and they enable a investigator to efficiently hedge experimental risk."}, "https://arxiv.org/abs/2406.15904": {"title": "Learning When the Concept Shifts: Confounding, Invariance, and Dimension Reduction", "link": "https://arxiv.org/abs/2406.15904", "description": "arXiv:2406.15904v1 Announce Type: cross \nAbstract: Practitioners often deploy a learned prediction model in a new environment where the joint distribution of covariate and response has shifted. In observational data, the distribution shift is often driven by unobserved confounding factors lurking in the environment, with the underlying mechanism unknown. Confounding can obfuscate the definition of the best prediction model (concept shift) and shift covariates to domains yet unseen (covariate shift). Therefore, a model maximizing prediction accuracy in the source environment could suffer a significant accuracy drop in the target environment. This motivates us to study the domain adaptation problem with observational data: given labeled covariate and response pairs from a source environment, and unlabeled covariates from a target environment, how can one predict the missing target response reliably? We root the adaptation problem in a linear structural causal model to address endogeneity and unobserved confounding. We study the necessity and benefit of leveraging exogenous, invariant covariate representations to cure concept shifts and improve target prediction. This further motivates a new representation learning method for adaptation that optimizes for a lower-dimensional linear subspace and, subsequently, a prediction model confined to that subspace. The procedure operates on a non-convex objective-that naturally interpolates between predictability and stability/invariance-constrained on the Stiefel manifold. We study the optimization landscape and prove that, when the regularization is sufficient, nearly all local optima align with an invariant linear subspace resilient to both concept and covariate shift. In terms of predictability, we show a model that uses the learned lower-dimensional subspace can incur a nearly ideal gap between target and source risk. Three real-world data sets are investigated to validate our method and theory."}, "https://arxiv.org/abs/2406.16174": {"title": "Comparison of methods for mediation analysis with multiple correlated mediators", "link": "https://arxiv.org/abs/2406.16174", "description": "arXiv:2406.16174v1 Announce Type: cross \nAbstract: Various methods have emerged for conducting mediation analyses with multiple correlated mediators, each with distinct strengths and limitations. However, a comparative evaluation of these methods is lacking, providing the motivation for this paper. This study examines six mediation analysis methods for multiple correlated mediators that provide insights to the contributors for health disparities. We assessed the performance of each method in identifying joint or path-specific mediation effects in the context of binary outcome variables varying mediator types and levels of residual correlation between mediators. Through comprehensive simulations, the performance of six methods in estimating joint and/or path-specific mediation effects was assessed rigorously using a variety of metrics including bias, mean squared error, coverage and width of the 95$\\%$ confidence intervals. Subsequently, these methods were applied to the REasons for Geographic And Racial Differences in Stroke (REGARDS) study, where differing conclusions were obtained depending on the mediation method employed. This evaluation provides valuable guidance for researchers grappling with complex multi-mediator scenarios, enabling them to select an optimal mediation method for their research question and dataset."}, "https://arxiv.org/abs/2406.16221": {"title": "F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data", "link": "https://arxiv.org/abs/2406.16221", "description": "arXiv:2406.16221v1 Announce Type: cross \nAbstract: Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset."}, "https://arxiv.org/abs/2406.16227": {"title": "VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data", "link": "https://arxiv.org/abs/2406.16227", "description": "arXiv:2406.16227v1 Announce Type: cross \nAbstract: Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes.\n \\textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \\url{https://github.com/j-ackierao/VICatMix}."}, "https://arxiv.org/abs/2406.16241": {"title": "Position: Benchmarking is Limited in Reinforcement Learning Research", "link": "https://arxiv.org/abs/2406.16241", "description": "arXiv:2406.16241v1 Announce Type: cross \nAbstract: Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking."}, "https://arxiv.org/abs/2406.16351": {"title": "METRIK: Measurement-Efficient Randomized Controlled Trials using Transformers with Input Masking", "link": "https://arxiv.org/abs/2406.16351", "description": "arXiv:2406.16351v1 Announce Type: cross \nAbstract: Clinical randomized controlled trials (RCTs) collect hundreds of measurements spanning various metric types (e.g., laboratory tests, cognitive/motor assessments, etc.) across 100s-1000s of subjects to evaluate the effect of a treatment, but do so at the cost of significant trial expense. To reduce the number of measurements, trial protocols can be revised to remove metrics extraneous to the study's objective, but doing so requires additional human labor and limits the set of hypotheses that can be studied with the collected data. In contrast, a planned missing design (PMD) can reduce the amount of data collected without removing any metric by imputing the unsampled data. Standard PMDs randomly sample data to leverage statistical properties of imputation algorithms, but are ad hoc, hence suboptimal. Methods that learn PMDs produce more sample-efficient PMDs, but are not suitable for RCTs because they require ample prior data (150+ subjects) to model the data distribution. Therefore, we introduce a framework called Measurement EfficienT Randomized Controlled Trials using Transformers with Input MasKing (METRIK), which, for the first time, calculates a PMD specific to the RCT from a modest amount of prior data (e.g., 60 subjects). Specifically, METRIK models the PMD as a learnable input masking layer that is optimized with a state-of-the-art imputer based on the Transformer architecture. METRIK implements a novel sampling and selection algorithm to generate a PMD that satisfies the trial designer's objective, i.e., whether to maximize sampling efficiency or imputation performance for a given sampling budget. Evaluated across five real-world clinical RCT datasets, METRIK increases the sampling efficiency of and imputation performance under the generated PMD by leveraging correlations over time and across metrics, thereby removing the need to manually remove metrics from the RCT."}, "https://arxiv.org/abs/2406.16605": {"title": "CLEAR: Can Language Models Really Understand Causal Graphs?", "link": "https://arxiv.org/abs/2406.16605", "description": "arXiv:2406.16605v1 Announce Type: cross \nAbstract: Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR."}, "https://arxiv.org/abs/2406.16708": {"title": "CausalFormer: An Interpretable Transformer for Temporal Causal Discovery", "link": "https://arxiv.org/abs/2406.16708", "description": "arXiv:2406.16708v1 Announce Type: cross \nAbstract: Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality. Our code is available at https://github.com/lingbai-kong/CausalFormer."}, "https://arxiv.org/abs/2107.12420": {"title": "Semiparametric Estimation of Treatment Effects in Observational Studies with Heterogeneous Partial Interference", "link": "https://arxiv.org/abs/2107.12420", "description": "arXiv:2107.12420v3 Announce Type: replace \nAbstract: In many observational studies in social science and medicine, subjects or units are connected, and one unit's treatment and attributes may affect another's treatment and outcome, violating the stable unit treatment value assumption (SUTVA) and resulting in interference. To enable feasible estimation and inference, many previous works assume exchangeability of interfering units (neighbors). However, in many applications with distinctive units, interference is heterogeneous and needs to be modeled explicitly. In this paper, we focus on the partial interference setting, and only restrict units to be exchangeable conditional on observable characteristics. Under this framework, we propose generalized augmented inverse propensity weighted (AIPW) estimators for general causal estimands that include heterogeneous direct and spillover effects. We show that they are semiparametric efficient and robust to heterogeneous interference as well as model misspecifications. We apply our methods to the Add Health dataset to study the direct effects of alcohol consumption on academic performance and the spillover effects of parental incarceration on adolescent well-being."}, "https://arxiv.org/abs/2202.02903": {"title": "Difference in Differences with Time-Varying Covariates", "link": "https://arxiv.org/abs/2202.02903", "description": "arXiv:2202.02903v3 Announce Type: replace \nAbstract: This paper considers identification and estimation of causal effect parameters from participating in a binary treatment in a difference in differences (DID) setup when the parallel trends assumption holds after conditioning on observed covariates. Relative to existing work in the econometrics literature, we consider the case where the value of covariates can change over time and, potentially, where participating in the treatment can affect the covariates themselves. We propose new empirical strategies in both cases. We also consider two-way fixed effects (TWFE) regressions that include time-varying regressors, which is the most common way that DID identification strategies are implemented under conditional parallel trends. We show that, even in the case with only two time periods, these TWFE regressions are not generally robust to (i) time-varying covariates being affected by the treatment, (ii) treatment effects and/or paths of untreated potential outcomes depending on the level of time-varying covariates in addition to only the change in the covariates over time, (iii) treatment effects and/or paths of untreated potential outcomes depending on time-invariant covariates, (iv) treatment effect heterogeneity with respect to observed covariates, and (v) violations of strong functional form assumptions, both for outcomes over time and the propensity score, that are unlikely to be plausible in most DID applications. Thus, TWFE regressions can deliver misleading estimates of causal effect parameters in a number of empirically relevant cases. We propose both doubly robust estimands and regression adjustment/imputation strategies that are robust to these issues while not being substantially more challenging to implement."}, "https://arxiv.org/abs/2301.05743": {"title": "Re-thinking Spatial Confounding in Spatial Linear Mixed Models", "link": "https://arxiv.org/abs/2301.05743", "description": "arXiv:2301.05743v2 Announce Type: replace \nAbstract: In the last two decades, considerable research has been devoted to a phenomenon known as spatial confounding. Spatial confounding is thought to occur when there is multicollinearity between a covariate and the random effect in a spatial regression model. This multicollinearity is considered highly problematic when the inferential goal is estimating regression coefficients and various methodologies have been proposed to attempt to alleviate it. Recently, it has become apparent that many of these methodologies are flawed, yet the field continues to expand. In this paper, we offer a novel perspective of synthesizing the work in the field of spatial confounding. We propose that at least two distinct phenomena are currently conflated with the term spatial confounding. We refer to these as the ``analysis model'' and the ``data generation'' types of spatial confounding. We show that these two issues can lead to contradicting conclusions about whether spatial confounding exists and whether methods to alleviate it will improve inference. Our results also illustrate that in most cases, traditional spatial linear mixed models do help to improve inference on regression coefficients. Drawing on the insights gained, we offer a path forward for research in spatial confounding."}, "https://arxiv.org/abs/2302.12111": {"title": "Communication-Efficient Distributed Estimation and Inference for Cox's Model", "link": "https://arxiv.org/abs/2302.12111", "description": "arXiv:2302.12111v3 Announce Type: replace \nAbstract: Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods."}, "https://arxiv.org/abs/2304.01921": {"title": "Individual Welfare Analysis: Random Quasilinear Utility, Independence, and Confidence Bounds", "link": "https://arxiv.org/abs/2304.01921", "description": "arXiv:2304.01921v3 Announce Type: replace \nAbstract: We introduce a novel framework for individual-level welfare analysis. It builds on a parametric model for continuous demand with a quasilinear utility function, allowing for heterogeneous coefficients and unobserved individual-good-level preference shocks. We obtain bounds on the individual-level consumer welfare loss at any confidence level due to a hypothetical price increase, solving a scalable optimization problem constrained by a novel confidence set under an independence restriction. This confidence set is computationally simple and robust to weak instruments, nonlinearity, and partial identification. The validity of the confidence set is guaranteed by our new results on the joint limiting distribution of the independence test by Chatterjee (2021). These results together with the confidence set may have applications beyond welfare analysis. Monte Carlo simulations and two empirical applications on gasoline and food demand demonstrate the effectiveness of our method."}, "https://arxiv.org/abs/2305.16018": {"title": "Accommodating informative visit times for analysing irregular longitudinal data: a sensitivity analysis approach with balancing weights estimators", "link": "https://arxiv.org/abs/2305.16018", "description": "arXiv:2305.16018v2 Announce Type: replace \nAbstract: Irregular longitudinal data with informative visit times arise when patients' visits are partly driven by concurrent disease outcomes. However, existing methods such as inverse intensity weighting (IIW), often overlook or have not adequately assess the influence of informative visit times on estimation and inference. Based on novel balancing weights estimators, we propose a new sensitivity analysis approach to addressing informative visit times within the IIW framework. The balancing weights are obtained by balancing observed history variable distributions over time and including a selection function with specified sensitivity parameters to characterise the additional influence of the concurrent outcome on the visit process. A calibration procedure is proposed to anchor the range of the sensitivity parameters to the amount of variation in the visit process that could be additionally explained by the concurrent outcome given the observed history and time. Simulations demonstrate that our balancing weights estimators outperform existing weighted estimators for robustness and efficiency. We provide an R Markdown tutorial of the proposed methods and apply them to analyse data from a clinic-based cohort of psoriatic arthritis."}, "https://arxiv.org/abs/2308.01704": {"title": "Similarity-based Random Partition Distribution for Clustering Functional Data", "link": "https://arxiv.org/abs/2308.01704", "description": "arXiv:2308.01704v3 Announce Type: replace \nAbstract: Random partition distribution is a crucial tool for model-based clustering. This study advances the field of random partition in the context of functional spatial data, focusing on the challenges posed by hourly population data across various regions and dates. We propose an extended generalized Dirichlet process, named the similarity-based generalized Dirichlet process (SGDP), to address the limitations of simple random partition distributions (e.g., those induced by the Dirichlet process), such as an overabundance of clusters. This model prevents excess cluster production as well as incorporates pairwise similarity information to ensure accurate and meaningful grouping. The theoretical properties of the SGDP are studied. Then, SGDP-based random partition is applied to a real-world dataset of hourly population flow in $500\\text{m}^2$ meshes in the central part of Tokyo. In this empirical context, our method excels at detecting meaningful patterns in the data while accounting for spatial nuances. The results underscore the adaptability and utility of the method, showcasing its prowess in revealing intricate spatiotemporal dynamics. The proposed SGDP will significantly contribute to urban planning, transportation, and policy-making and will be a helpful tool for understanding population dynamics and their implications."}, "https://arxiv.org/abs/2308.15681": {"title": "Scalable Composite Likelihood Estimation of Probit Models with Crossed Random Effects", "link": "https://arxiv.org/abs/2308.15681", "description": "arXiv:2308.15681v2 Announce Type: replace \nAbstract: Crossed random effects structures arise in many scientific contexts. They raise severe computational problems with likelihood computations scaling like $N^{3/2}$ or worse for $N$ data points. In this paper we develop a new composite likelihood approach for crossed random effects probit models. For data arranged in R rows and C columns, the likelihood function includes a very difficult R + C dimensional integral. The composite likelihood we develop uses the marginal distribution of the response along with two hierarchical models. The cost is reduced to $\\mathcal{O}(N)$ and it can be computed with $R + C$ one dimensional integrals. We find that the commonly used Laplace approximation has a cost that grows superlinearly. We get consistent estimates of the probit slope and variance components from our composite likelihood algorithm. We also show how to estimate the covariance of the estimated regression coefficients. The algorithm scales readily to a data set of five million observations from Stitch Fix with $R + C > 700{,}000$."}, "https://arxiv.org/abs/2308.15986": {"title": "Sensitivity Analysis of Inverse Probability Weighting Estimators of Causal Effects in Observational Studies with Multivalued Treatments", "link": "https://arxiv.org/abs/2308.15986", "description": "arXiv:2308.15986v4 Announce Type: replace \nAbstract: One of the fundamental challenges in drawing causal inferences from observational studies is that the assumption of no unmeasured confounding is not testable from observed data. Therefore, assessing sensitivity to this assumption's violation is important to obtain valid causal conclusions in observational studies. Although several sensitivity analysis frameworks are available in the casual inference literature, very few of them are applicable to observational studies with multivalued treatments. To address this issue, we propose a sensitivity analysis framework for performing sensitivity analysis in multivalued treatment settings. Within this framework, a general class of additive causal estimands has been proposed. We demonstrate that the estimation of the causal estimands under the proposed sensitivity model can be performed very efficiently. Simulation results show that the proposed framework performs well in terms of bias of the point estimates and coverage of the confidence intervals when there is sufficient overlap in the covariate distributions. We illustrate the application of our proposed method by conducting an observational study that estimates the causal effect of fish consumption on blood mercury levels."}, "https://arxiv.org/abs/2310.13764": {"title": "Statistical Inference for Bures-Wasserstein Flows", "link": "https://arxiv.org/abs/2310.13764", "description": "arXiv:2310.13764v2 Announce Type: replace \nAbstract: We develop a statistical framework for conducting inference on collections of time-varying covariance operators (covariance flows) over a general, possibly infinite dimensional, Hilbert space. We model the intrinsically non-linear structure of covariances by means of the Bures-Wasserstein metric geometry. We make use of the Riemmanian-like structure induced by this metric to define a notion of mean and covariance of a random flow, and develop an associated Karhunen-Lo\\`eve expansion. We then treat the problem of estimation and construction of functional principal components from a finite collection of covariance flows, observed fully or irregularly.\n Our theoretical results are motivated by modern problems in functional data analysis, where one observes operator-valued random processes -- for instance when analysing dynamic functional connectivity and fMRI data, or when analysing multiple functional time series in the frequency domain. Nevertheless, our framework is also novel in the finite-dimensions (matrix case), and we demonstrate what simplifications can be afforded then. We illustrate our methodology by means of simulations and data analyses."}, "https://arxiv.org/abs/2311.13410": {"title": "Assessing the Unobserved: Enhancing Causal Inference in Sociology with Sensitivity Analysis", "link": "https://arxiv.org/abs/2311.13410", "description": "arXiv:2311.13410v2 Announce Type: replace \nAbstract: Explaining social events is a primary objective of applied data-driven sociology. To achieve that objective, many sociologists use statistical causal inference to identify causality using observational studies research context where the analyst does not control the data generating process. However, it is often challenging in observation studies to satisfy the unmeasured confounding assumption, namely, that there is no lurking third variable affecting the causal relationship of interest. In this article, we develop a framework enabling sociologists to employ a different strategy to enhance the quality of observational studies. Our framework builds on a surprisingly simple statistical approach, sensitivity analysis: a thought-experimental framework where the analyst imagines a lever, which they can pull for probing a variety of theoretically driven statistical magnitudes of posited unmeasured confounding which in turn distorts the causal effect of interest. By pulling that lever, the analyst can identify how strong an unmeasured confounder must be to wash away the estimated causal effect. Although each sensitivity analysis method requires its own assumptions, this sort of post-hoc analysis provides underutilized tools to bound causal quantities. Extending Lundberg et al, we develop a five-step approach to how applied sociological research can incorporate sensitivity analysis, empowering scholars to rejuvenate causal inference in observational studies."}, "https://arxiv.org/abs/2311.14889": {"title": "Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data", "link": "https://arxiv.org/abs/2311.14889", "description": "arXiv:2311.14889v2 Announce Type: replace \nAbstract: In this paper we review recent advances in statistical methods for the evaluation of the heterogeneity of treatment effects (HTE), including subgroup identification and estimation of individualized treatment regimens, from randomized clinical trials and observational studies. We identify several types of approaches using the features introduced in Lipkovich, Dmitrienko and D'Agostino (2017) that distinguish the recommended principled methods from basic methods for HTE evaluation that typically rely on rules of thumb and general guidelines (the methods are often referred to as common practices). We discuss the advantages and disadvantages of various principled methods as well as common measures for evaluating their performance. We use simulated data and a case study based on a historical clinical trial to illustrate several new approaches to HTE evaluation."}, "https://arxiv.org/abs/2401.06261": {"title": "Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables", "link": "https://arxiv.org/abs/2401.06261", "description": "arXiv:2401.06261v2 Announce Type: replace \nAbstract: Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent, which is usually not possible when considering a group of candidate genes from the same locus. We used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results even at modest sample sizes. Importantly, the causal effect estimates remain unbiased and their variance small when instruments are highly correlated. We applied MVMR with correlated instrumental variable sets at risk loci from genome-wide association studies (GWAS) for coronary artery disease using eQTL data from the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a single model to predict causal gene-tissue combinations remains infeasible."}, "https://arxiv.org/abs/2305.11561": {"title": "Causal Inference on Process Graphs, Part I: The Structural Equation Process Representation", "link": "https://arxiv.org/abs/2305.11561", "description": "arXiv:2305.11561v2 Announce Type: replace-cross \nAbstract: When dealing with time series data, causal inference methods often employ structural vector autoregressive (SVAR) processes to model time-evolving random systems. In this work, we rephrase recursive SVAR processes with possible latent component processes as a linear Structural Causal Model (SCM) of stochastic processes on a simple causal graph, the \\emph{process graph}, that models every process as a single node. Using this reformulation, we generalise Wright's well-known path-rule for linear Gaussian SCMs to the newly introduced process SCMs and we express the auto-covariance sequence of an SVAR process by means of a generalised trek-rule. Employing the Fourier-Transformation, we derive compact expressions for causal effects in the frequency domain that allow us to efficiently visualise the causal interactions in a multivariate SVAR process. Finally, we observe that the process graph can be used to formulate graphical criteria for identifying causal effects and to derive algebraic relations with which these frequency domain causal effects can be recovered from the observed spectral density."}, "https://arxiv.org/abs/2312.07320": {"title": "Convergence rates of non-stationary and deep Gaussian process regression", "link": "https://arxiv.org/abs/2312.07320", "description": "arXiv:2312.07320v3 Announce Type: replace-cross \nAbstract: The focus of this work is the convergence of non-stationary and deep Gaussian process regression. More precisely, we follow a Bayesian approach to regression or interpolation, where the prior placed on the unknown function $f$ is a non-stationary or deep Gaussian process, and we derive convergence rates of the posterior mean to the true function $f$ in terms of the number of observed training points. In some cases, we also show convergence of the posterior variance to zero. The only assumption imposed on the function $f$ is that it is an element of a certain reproducing kernel Hilbert space, which we in particular cases show to be norm-equivalent to a Sobolev space. Our analysis includes the case of estimated hyper-parameters in the covariance kernels employed, both in an empirical Bayes' setting and the particular hierarchical setting constructed through deep Gaussian processes. We consider the settings of noise-free or noisy observations on deterministic or random training points. We establish general assumptions sufficient for the convergence of deep Gaussian process regression, along with explicit examples demonstrating the fulfilment of these assumptions. Specifically, our examples require that the H\\\"older or Sobolev norms of the penultimate layer are bounded almost surely."}, "https://arxiv.org/abs/2406.17056": {"title": "Efficient two-sample instrumental variable estimators with change points and near-weak identification", "link": "https://arxiv.org/abs/2406.17056", "description": "arXiv:2406.17056v1 Announce Type: new \nAbstract: We consider estimation and inference in a linear model with endogenous regressors where the parameters of interest change across two samples. If the first-stage is common, we show how to use this information to obtain more efficient two-sample GMM estimators than the standard split-sample GMM, even in the presence of near-weak instruments. We also propose two tests to detect change points in the parameters of interest, depending on whether the first-stage is common or not. We derive the limiting distribution of these tests and show that they have non-trivial power even under weaker and possibly time-varying identification patterns. The finite sample properties of our proposed estimators and testing procedures are illustrated in a series of Monte-Carlo experiments, and in an application to the open-economy New Keynesian Phillips curve. Our empirical analysis using US data provides strong support for a New Keynesian Phillips curve with incomplete pass-through and reveals important time variation in the relationship between inflation and exchange rate pass-through."}, "https://arxiv.org/abs/2406.17058": {"title": "Bayesian Deep ICE", "link": "https://arxiv.org/abs/2406.17058", "description": "arXiv:2406.17058v1 Announce Type: new \nAbstract: Deep Independent Component Estimation (DICE) has many applications in modern day machine learning as a feature engineering extraction method. We provide a novel latent variable representation of independent component analysis that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction. We discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this representation hierarchy, we unify a number of hitherto disjoint estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research."}, "https://arxiv.org/abs/2406.17131": {"title": "Bayesian temporal biclustering with applications to multi-subject neuroscience studies", "link": "https://arxiv.org/abs/2406.17131", "description": "arXiv:2406.17131v1 Announce Type: new \nAbstract: We consider the problem of analyzing multivariate time series collected on multiple subjects, with the goal of identifying groups of subjects exhibiting similar trends in their recorded measurements over time as well as time-varying groups of associated measurements. To this end, we propose a Bayesian model for temporal biclustering featuring nested partitions, where a time-invariant partition of subjects induces a time-varying partition of measurements. Our approach allows for data-driven determination of the number of subject and measurement clusters as well as estimation of the number and location of changepoints in measurement partitions. To efficiently perform model fitting and posterior estimation with Markov Chain Monte Carlo, we derive a blocked update of measurements' cluster-assignment sequences. We illustrate the performance of our model in two applications to functional magnetic resonance imaging data and to an electroencephalogram dataset. The results indicate that the proposed model can combine information from potentially many subjects to discover a set of interpretable, dynamic patterns. Experiments on simulated data compare the estimation performance of the proposed model against ground-truth values and other statistical methods, showing that it performs well at identifying ground-truth subject and measurement clusters even when no subject or time dependence is present."}, "https://arxiv.org/abs/2406.17278": {"title": "Estimation and Inference for CP Tensor Factor Models", "link": "https://arxiv.org/abs/2406.17278", "description": "arXiv:2406.17278v1 Announce Type: new \nAbstract: High-dimensional tensor-valued data have recently gained attention from researchers in economics and finance. We consider the estimation and inference of high-dimensional tensor factor models, where each dimension of the tensor diverges. Our focus is on a factor model that admits CP-type tensor decomposition, which allows for non-orthogonal loading vectors. Based on the contemporary covariance matrix, we propose an iterative simultaneous projection estimation method. Our estimator is robust to weak dependence among factors and weak correlation across different dimensions in the idiosyncratic shocks. We establish an inferential theory, demonstrating both consistency and asymptotic normality under relaxed assumptions. Within a unified framework, we consider two eigenvalue ratio-based estimators for the number of factors in a tensor factor model and justify their consistency. Through a simulation study and two empirical applications featuring sorted portfolios and international trade flows, we illustrate the advantages of our proposed estimator over existing methodologies in the literature."}, "https://arxiv.org/abs/2406.17318": {"title": "Model Uncertainty in Latent Gaussian Models with Univariate Link Function", "link": "https://arxiv.org/abs/2406.17318", "description": "arXiv:2406.17318v1 Announce Type: new \nAbstract: We consider a class of latent Gaussian models with a univariate link function (ULLGMs). These are based on standard likelihood specifications (such as Poisson, Binomial, Bernoulli, Erlang, etc.) but incorporate a latent normal linear regression framework on a transformation of a key scalar parameter. We allow for model uncertainty regarding the covariates included in the regression. The ULLGM class typically accommodates extra dispersion in the data and has clear advantages for deriving theoretical properties and designing computational procedures. We formally characterize posterior existence under a convenient and popular improper prior and propose an efficient Markov chain Monte Carlo algorithm for Bayesian model averaging in ULLGMs. Simulation results suggest that the framework provides accurate results that are robust to some degree of misspecification. The methodology is successfully applied to measles vaccination coverage data from Ethiopia and to data on bilateral migration flows between OECD countries."}, "https://arxiv.org/abs/2406.17361": {"title": "Tree-based variational inference for Poisson log-normal models", "link": "https://arxiv.org/abs/2406.17361", "description": "arXiv:2406.17361v1 Announce Type: new \nAbstract: When studying ecosystems, hierarchical trees are often used to organize entities based on proximity criteria, such as the taxonomy in microbiology, social classes in geography, or product types in retail businesses, offering valuable insights into entity relationships. Despite their significance, current count-data models do not leverage this structured information. In particular, the widely used Poisson log-normal (PLN) model, known for its ability to model interactions between entities from count data, lacks the possibility to incorporate such hierarchical tree structures, limiting its applicability in domains characterized by such complexities. To address this matter, we introduce the PLN-Tree model as an extension of the PLN model, specifically designed for modeling hierarchical count data. By integrating structured variational inference techniques, we propose an adapted training procedure and establish identifiability results, enhancisng both theoretical foundations and practical interpretability. Additionally, we extend our framework to classification tasks as a preprocessing pipeline, showcasing its versatility. Experimental evaluations on synthetic datasets as well as real-world microbiome data demonstrate the superior performance of the PLN-Tree model in capturing hierarchical dependencies and providing valuable insights into complex data structures, showing the practical interest of knowledge graphs like the taxonomy in ecosystems modeling."}, "https://arxiv.org/abs/2406.17444": {"title": "Bayesian Partial Reduced-Rank Regression", "link": "https://arxiv.org/abs/2406.17444", "description": "arXiv:2406.17444v1 Announce Type: new \nAbstract: Reduced-rank (RR) regression may be interpreted as a dimensionality reduction technique able to reveal complex relationships among the data parsimoniously. However, RR regression models typically overlook any potential group structure among the responses by assuming a low-rank structure on the coefficient matrix. To address this limitation, a Bayesian Partial RR (BPRR) regression is exploited, where the response vector and the coefficient matrix are partitioned into low- and full-rank sub-groups. As opposed to the literature, which assumes known group structure and rank, a novel strategy is introduced that treats them as unknown parameters to be estimated. The main contribution is two-fold: an approach to infer the low- and full-rank group memberships from the data is proposed, and then, conditionally on this allocation, the corresponding (reduced) rank is estimated. Both steps are carried out in a Bayesian approach, allowing for full uncertainty quantification and based on a partially collapsed Gibbs sampler. It relies on a Laplace approximation of the marginal likelihood and the Metropolized Shotgun Stochastic Search to estimate the group allocation efficiently. Applications to synthetic and real-world data reveal the potential of the proposed method to reveal hidden structures in the data."}, "https://arxiv.org/abs/2406.17445": {"title": "Copula-Based Estimation of Causal Effects in Multiple Linear and Path Analysis Models", "link": "https://arxiv.org/abs/2406.17445", "description": "arXiv:2406.17445v1 Announce Type: new \nAbstract: Regression analysis is one of the most popularly used statistical technique which only measures the direct effect of independent variables on dependent variable. Path analysis looks for both direct and indirect effects of independent variables and may overcome several hurdles allied with regression models. It utilizes one or more structural regression equations in the model which are used to estimate the unknown parameters. The aim of this work is to study the path analysis models when the endogenous (dependent) variable and exogenous (independent) variables are linked through the elliptical copulas. Using well-organized numerical schemes, we investigate the performance of path models when direct and indirect effects are estimated applying classical ordinary least squares and copula-based regression approaches in different scenarios. Finally, two real data applications are also presented to demonstrate the performance of path analysis using copula approach."}, "https://arxiv.org/abs/2406.17466": {"title": "Two-Stage Testing in a high dimensional setting", "link": "https://arxiv.org/abs/2406.17466", "description": "arXiv:2406.17466v1 Announce Type: new \nAbstract: In a high dimensional regression setting in which the number of variables ($p$) is much larger than the sample size ($n$), the number of possible two-way interactions between the variables is immense. If the number of variables is in the order of one million, which is usually the case in e.g., genetics, the number of two-way interactions is of the order one million squared. In the pursuit of detecting two-way interactions, testing all pairs for interactions one-by-one is computational unfeasible and the multiple testing correction will be severe. In this paper we describe a two-stage testing procedure consisting of a screening and an evaluation stage. It is proven that, under some assumptions, the tests-statistics in the two stages are asymptotically independent. As a result, multiplicity correction in the second stage is only needed for the number of statistical tests that are actually performed in that stage. This increases the power of the testing procedure. Also, since the testing procedure in the first stage is computational simple, the computational burden is lowered. Simulations have been performed for multiple settings and regression models (generalized linear models and Cox PH model) to study the performance of the two-stage testing procedure. The results show type I error control and an increase in power compared to the procedure in which the pairs are tested one-by-one."}, "https://arxiv.org/abs/2406.17567": {"title": "Transfer Learning for High Dimensional Robust Regression", "link": "https://arxiv.org/abs/2406.17567", "description": "arXiv:2406.17567v1 Announce Type: new \nAbstract: Transfer learning has become an essential technique for utilizing information from source datasets to improve the performance of the target task. However, in the context of high-dimensional data, heterogeneity arises due to heteroscedastic variance or inhomogeneous covariate effects. To solve this problem, this paper proposes a robust transfer learning based on the Huber regression, specifically designed for scenarios where the transferable source data set is known. This method effectively mitigates the impact of data heteroscedasticity, leading to improvements in estimation and prediction accuracy. Moreover, when the transferable source data set is unknown, the paper introduces an efficient detection algorithm to identify informative sources. The effectiveness of the proposed method is proved through numerical simulation and empirical analysis using superconductor data."}, "https://arxiv.org/abs/2406.17571": {"title": "Causal Responder Detection", "link": "https://arxiv.org/abs/2406.17571", "description": "arXiv:2406.17571v1 Announce Type: new \nAbstract: We introduce the causal responders detection (CARD), a novel method for responder analysis that identifies treated subjects who significantly respond to a treatment. Leveraging recent advances in conformal prediction, CARD employs machine learning techniques to accurately identify responders while controlling the false discovery rate in finite sample sizes. Additionally, we incorporate a propensity score adjustment to mitigate bias arising from non-random treatment allocation, enhancing the robustness of our method in observational settings. Simulation studies demonstrate that CARD effectively detects responders with high power in diverse scenarios."}, "https://arxiv.org/abs/2406.17637": {"title": "Nowcasting in triple-system estimation", "link": "https://arxiv.org/abs/2406.17637", "description": "arXiv:2406.17637v1 Announce Type: new \nAbstract: When samples that each cover part of a population for a certain reference date become available slowly over time, an estimate of the population size can be obtained when at least two samples are available. Ideally one uses all the available samples, but if some samples become available much later one may want to use the samples that are available earlier, to obtain a preliminary or nowcast estimate. However, a limited number of samples may no longer lead to asymptotically unbiased estimates, in particularly in case of two early available samples that suffer from pairwise dependence. In this paper we propose a multiple system nowcasting model that deals with this issue by combining the early available samples with samples from a previous reference date and the expectation-maximisation algorithm. This leads to a nowcast estimate that is asymptotically unbiased under more relaxed assumptions than the dual-system estimator. The multiple system nowcasting model is applied to the problem of estimating the number of homeless people in The Netherlands, which leads to reasonably accurate nowcast estimates."}, "https://arxiv.org/abs/2406.17708": {"title": "Forecast Relative Error Decomposition", "link": "https://arxiv.org/abs/2406.17708", "description": "arXiv:2406.17708v1 Announce Type: new \nAbstract: We introduce a class of relative error decomposition measures that are well-suited for the analysis of shocks in nonlinear dynamic models. They include the Forecast Relative Error Decomposition (FRED), Forecast Error Kullback Decomposition (FEKD) and Forecast Error Laplace Decomposition (FELD). These measures are favourable over the traditional Forecast Error Variance Decomposition (FEVD) because they account for nonlinear dependence in both a serial and cross-sectional sense. This is illustrated by applications to dynamic models for qualitative data, count data, stochastic volatility and cyberrisk."}, "https://arxiv.org/abs/2406.17422": {"title": "Causal Inference on Process Graphs, Part II: Causal Structure and Effect Identification", "link": "https://arxiv.org/abs/2406.17422", "description": "arXiv:2406.17422v1 Announce Type: cross \nAbstract: A structural vector autoregressive (SVAR) process is a linear causal model for variables that evolve over a discrete set of time points and between which there may be lagged and instantaneous effects. The qualitative causal structure of an SVAR process can be represented by its finite and directed process graph, in which a directed link connects two processes whenever there is a lagged or instantaneous effect between them. At the process graph level, the causal structure of SVAR processes is compactly parameterised in the frequency domain. In this paper, we consider the problem of causal discovery and causal effect estimation from the spectral density, the frequency domain analogue of the auto covariance, of the SVAR process. Causal discovery concerns the recovery of the process graph and causal effect estimation concerns the identification and estimation of causal effects in the frequency domain.\n We show that information about the process graph, in terms of $d$- and $t$-separation statements, can be identified by verifying algebraic constraints on the spectral density. Furthermore, we introduce a notion of rational identifiability for frequency causal effects that may be confounded by exogenous latent processes, and show that the recent graphical latent factor half-trek criterion can be used on the process graph to assess whether a given (confounded) effect can be identified by rational operations on the entries of the spectral density."}, "https://arxiv.org/abs/2406.17714": {"title": "Compositional Models for Estimating Causal Effects", "link": "https://arxiv.org/abs/2406.17714", "description": "arXiv:2406.17714v1 Announce Type: cross \nAbstract: Many real-world systems can be represented as sets of interacting components. Examples of such systems include computational systems such as query processors, natural systems such as cells, and social systems such as families. Many approaches have been proposed in traditional (associational) machine learning to model such structured systems, including statistical relational models and graph neural networks. Despite this prior work, existing approaches to estimating causal effects typically treat such systems as single units, represent them with a fixed set of variables and assume a homogeneous data-generating process. We study a compositional approach for estimating individual treatment effects (ITE) in structured systems, where each unit is represented by the composition of multiple heterogeneous components. This approach uses a modular architecture to model potential outcomes at each component and aggregates component-level potential outcomes to obtain the unit-level potential outcomes. We discover novel benefits of the compositional approach in causal inference - systematic generalization to estimate counterfactual outcomes of unseen combinations of components and improved overlap guarantees between treatment and control groups compared to the classical methods for causal effect estimation. We also introduce a set of novel environments for empirically evaluating the compositional approach and demonstrate the effectiveness of our approach using both simulated and real-world data."}, "https://arxiv.org/abs/2109.13648": {"title": "Gaussian and Student's $t$ mixture vector autoregressive model with application to the effects of the Euro area monetary policy shock", "link": "https://arxiv.org/abs/2109.13648", "description": "arXiv:2109.13648v4 Announce Type: replace \nAbstract: A new mixture vector autoregressive model based on Gaussian and Student's $t$ distributions is introduced. As its mixture components, our model incorporates conditionally homoskedastic linear Gaussian vector autoregressions and conditionally heteroskedastic linear Student's $t$ vector autoregressions. For a $p$th order model, the mixing weights depend on the full distribution of the preceding $p$ observations, which leads to attractive practical and theoretical properties such as ergodicity and full knowledge of the stationary distribution of $p+1$ consecutive observations. A structural version of the model with statistically identified shocks is also proposed. The empirical application studies the effects of the Euro area monetary policy shock. We fit a two-regime model to the data and find the effects, particularly on inflation, stronger in the regime that mainly prevails before the Financial crisis than in the regime that mainly dominates after it. The introduced methods are implemented in the accompanying R package gmvarkit."}, "https://arxiv.org/abs/2210.09828": {"title": "Modelling Large Dimensional Datasets with Markov Switching Factor Models", "link": "https://arxiv.org/abs/2210.09828", "description": "arXiv:2210.09828v4 Announce Type: replace \nAbstract: We study a novel large dimensional approximate factor model with regime changes in the loadings driven by a latent first order Markov process. By exploiting the equivalent linear representation of the model, we first recover the latent factors by means of Principal Component Analysis. We then cast the model in state-space form, and we estimate loadings and transition probabilities through an EM algorithm based on a modified version of the Baum-Lindgren-Hamilton-Kim filter and smoother that makes use of the factors previously estimated. Our approach is appealing as it provides closed form expressions for all estimators. More importantly, it does not require knowledge of the true number of factors. We derive the theoretical properties of the proposed estimation procedure, and we show their good finite sample performance through a comprehensive set of Monte Carlo experiments. The empirical usefulness of our approach is illustrated through three applications to large U.S. datasets of stock returns, macroeconomic variables, and inflation indexes."}, "https://arxiv.org/abs/2307.09319": {"title": "Estimation of the Number Needed to Treat, the Number Needed to be Exposed, and the Exposure Impact Number with Instrumental Variables", "link": "https://arxiv.org/abs/2307.09319", "description": "arXiv:2307.09319v3 Announce Type: replace \nAbstract: The Number needed to treat (NNT) is an efficacy index defined as the average number of patients needed to treat to attain one additional treatment benefit. In observational studies, specifically in epidemiology, the adequacy of the populationwise NNT is questionable since the exposed group characteristics may substantially differ from the unexposed. To address this issue, groupwise efficacy indices were defined: the Exposure Impact Number (EIN) for the exposed group and the Number Needed to be Exposed (NNE) for the unexposed. Each defined index answers a unique research question since it targets a unique sub-population. In observational studies, the group allocation is typically affected by confounders that might be unmeasured. The available estimation methods that rely either on randomization or the sufficiency of the measured covariates for confounding control will result in inconsistent estimators of the true NNT (EIN, NNE) in such settings. Using Rubin's potential outcomes framework, we explicitly define the NNT and its derived indices as causal contrasts. Next, we introduce a novel method that uses instrumental variables to estimate the three aforementioned indices in observational studies. We present two analytical examples and a corresponding simulation study. The simulation study illustrates that the novel estimators are statistically consistent, unlike the previously available methods, and their analytical confidence intervals' empirical coverage rates converge to their nominal values. Finally, a real-world data example of an analysis of the effect of vitamin D deficiency on the mortality rate is presented."}, "https://arxiv.org/abs/2406.17827": {"title": "Practical identifiability and parameter estimation of compartmental epidemiological models", "link": "https://arxiv.org/abs/2406.17827", "description": "arXiv:2406.17827v1 Announce Type: new \nAbstract: Practical parameter identifiability in ODE-based epidemiological models is a known issue, yet one that merits further study. It is essentially ubiquitous due to noise and errors in real data. In this study, to avoid uncertainty stemming from data of unknown quality, simulated data with added noise are used to investigate practical identifiability in two distinct epidemiological models. Particular emphasis is placed on the role of initial conditions, which are assumed unknown, except those that are directly measured. Instead of just focusing on one method of estimation, we use and compare results from various broadly used methods, including maximum likelihood and Markov Chain Monte Carlo (MCMC) estimation.\n Among other findings, our analysis revealed that the MCMC estimator is overall more robust than the point estimators considered. Its estimates and predictions are improved when the initial conditions of certain compartments are fixed so that the model becomes globally identifiable. For the point estimators, whether fixing or fitting the that are not directly measured improves parameter estimates is model-dependent. Specifically, in the standard SEIR model, fixing the initial condition for the susceptible population S(0) improved parameter estimates, while this was not true when fixing the initial condition of the asymptomatic population in a more involved model. Our study corroborates the change in quality of parameter estimates upon usage of pre-peak or post-peak time-series under consideration. Finally, our examples suggest that in the presence of significantly noisy data, the value of structural identifiability is moot."}, "https://arxiv.org/abs/2406.17971": {"title": "Robust integration of external control data in randomized trials", "link": "https://arxiv.org/abs/2406.17971", "description": "arXiv:2406.17971v1 Announce Type: new \nAbstract: One approach for increasing the efficiency of randomized trials is the use of \"external controls\" -- individuals who received the control treatment in the trial during routine practice or in prior experimental studies. Existing external control methods, however, can have substantial bias if the populations underlying the trial and the external control data are not exchangeable. Here, we characterize a randomization-aware class of treatment effect estimators in the population underlying the trial that remain consistent and asymptotically normal when using external control data, even when exchangeability does not hold. We consider two members of this class of estimators: the well-known augmented inverse probability weighting trial-only estimator, which is the efficient estimator when only trial data are used; and a more efficient member of the class when exchangeability holds and external control data are available, which we refer to as the optimized randomization-aware estimator. To achieve robust integration of external control data in trial analyses, we then propose a combined estimator based on the efficient trial-only estimator and the optimized randomization-aware estimator. We show that the combined estimator is consistent and no less efficient than the most efficient of the two component estimators, whether the exchangeability assumption holds or not. We examine the estimators' performance in simulations and we illustrate their use with data from two trials of paliperidone extended-release for schizophrenia."}, "https://arxiv.org/abs/2406.18047": {"title": "Shrinkage Estimators for Beta Regression Models", "link": "https://arxiv.org/abs/2406.18047", "description": "arXiv:2406.18047v1 Announce Type: new \nAbstract: The beta regression model is a useful framework to model response variables that are rates or proportions, that is to say, response variables which are continuous and restricted to the interval (0,1). As with any other regression model, parameter estimates may be affected by collinearity or even perfect collinearity among the explanatory variables. To handle these situations shrinkage estimators are proposed. In particular we develop ridge regression and LASSO estimators from a penalized likelihood perspective with a logit link function. The properties of the resulting estimators are evaluated through a simulation study and a real data application"}, "https://arxiv.org/abs/2406.18052": {"title": "Flexible Conformal Highest Predictive Conditional Density Sets", "link": "https://arxiv.org/abs/2406.18052", "description": "arXiv:2406.18052v1 Announce Type: new \nAbstract: We introduce our method, conformal highest conditional density sets (CHCDS), that forms conformal prediction sets using existing estimated conditional highest density predictive regions. We prove the validity of the method and that conformal adjustment is negligible under some regularity conditions. In particular, if we correctly specify the underlying conditional density estimator, the conformal adjustment will be negligible. When the underlying model is incorrect, the conformal adjustment provides guaranteed nominal unconditional coverage. We compare the proposed method via simulation and a real data analysis to other existing methods. Our numerical results show that the flexibility of being able to use any existing conditional density estimation method is a large advantage for CHCDS compared to existing methods."}, "https://arxiv.org/abs/2406.18154": {"title": "Errors-In-Variables Model Fitting for Partially Unpaired Data Utilizing Mixture Models", "link": "https://arxiv.org/abs/2406.18154", "description": "arXiv:2406.18154v1 Announce Type: new \nAbstract: The goal of this paper is to introduce a general argumentation framework for regression in the errors-in-variables regime, allowing for full flexibility about the dimensionality of the data, error probability density types, the (linear or nonlinear) model type and the avoidance of explicit definition of loss functions. Further, we introduce in this framework model fitting for partially unpaired data, i.e. for given data groups the pairing information of input and output is lost (semi-supervised). This is achieved by constructing mixture model densities, which directly model this loss of pairing information allowing for inference. In a numerical simulation study linear and nonlinear model fits are illustrated as well as a real data study is presented based on life expectancy data from the world bank utilizing a multiple linear regression model. These results allow the conclusion that high quality model fitting is possible with partially unpaired data, which opens the possibility for new applications with unfortunate or deliberate loss of pairing information in the data."}, "https://arxiv.org/abs/2406.18189": {"title": "Functional knockoffs selection with applications to functional data analysis in high dimensions", "link": "https://arxiv.org/abs/2406.18189", "description": "arXiv:2406.18189v1 Announce Type: new \nAbstract: The knockoffs is a recently proposed powerful framework that effectively controls the false discovery rate (FDR) for variable selection. However, none of the existing knockoff solutions are directly suited to handle multivariate or high-dimensional functional data, which has become increasingly prevalent in various scientific applications. In this paper, we propose a novel functional model-X knockoffs selection framework tailored to sparse high-dimensional functional models, and show that our proposal can achieve the effective FDR control for any sample size. Furthermore, we illustrate the proposed functional model-X knockoffs selection procedure along with the associated theoretical guarantees for both FDR control and asymptotic power using examples of commonly adopted functional linear additive regression models and the functional graphical model. In the construction of functional knockoffs, we integrate essential components including the correlation operator matrix, the Karhunen-Lo\\`eve expansion, and semidefinite programming, and develop executable algorithms. We demonstrate the superiority of our proposed methods over the competitors through both extensive simulations and the analysis of two brain imaging datasets."}, "https://arxiv.org/abs/2406.18191": {"title": "Asymptotic Uncertainty in the Estimation of Frequency Domain Causal Effects for Linear Processes", "link": "https://arxiv.org/abs/2406.18191", "description": "arXiv:2406.18191v1 Announce Type: new \nAbstract: Structural vector autoregressive (SVAR) processes are commonly used time series models to identify and quantify causal interactions between dynamically interacting processes from observational data. The causal relationships between these processes can be effectively represented by a finite directed process graph - a graph that connects two processes whenever there is a direct delayed or simultaneous effect between them. Recent research has introduced a framework for quantifying frequency domain causal effects along paths on the process graph. This framework allows to identify how the spectral density of one process is contributing to the spectral density of another. In the current work, we characterise the asymptotic distribution of causal effect and spectral contribution estimators in terms of algebraic relations dictated by the process graph. Based on the asymptotic distribution we construct approximate confidence intervals and Wald type hypothesis tests for the estimated effects and spectral contributions. Under the assumption of causal sufficiency, we consider the class of differentiable estimators for frequency domain causal quantities, and within this class we identify the asymptotically optimal estimator. We illustrate the frequency domain Wald tests and uncertainty approximation on synthetic data, and apply them to analyse the impact of the 10 to 11 year solar cycle on the North Atlantic Oscillation (NAO). Our results confirm a significant effect of the solar cycle on the NAO at the 10 to 11 year time scale."}, "https://arxiv.org/abs/2406.18390": {"title": "The $\\ell$-test: leveraging sparsity in the Gaussian linear model for improved inference", "link": "https://arxiv.org/abs/2406.18390", "description": "arXiv:2406.18390v1 Announce Type: new \nAbstract: We develop novel LASSO-based methods for coefficient testing and confidence interval construction in the Gaussian linear model with $n\\ge d$. Our methods' finite-sample guarantees are identical to those of their ubiquitous ordinary-least-squares-$t$-test-based analogues, yet have substantially higher power when the true coefficient vector is sparse. In particular, our coefficient test, which we call the $\\ell$-test, performs like the one-sided $t$-test (despite not being given any information about the sign) under sparsity, and the corresponding confidence intervals are more than 10% shorter than the standard $t$-test based intervals. The nature of the $\\ell$-test directly provides a novel exact adjustment conditional on LASSO selection for post-selection inference, allowing for the construction of post-selection p-values and confidence intervals. None of our methods require resampling or Monte Carlo estimation. We perform a variety of simulations and a real data analysis on an HIV drug resistance data set to demonstrate the benefits of the $\\ell$-test. We end with a discussion of how the $\\ell$-test may asymptotically apply to a much more general class of parametric models."}, "https://arxiv.org/abs/2406.18484": {"title": "An Understanding of Principal Differential Analysis", "link": "https://arxiv.org/abs/2406.18484", "description": "arXiv:2406.18484v1 Announce Type: new \nAbstract: In functional data analysis, replicate observations of a smooth functional process and its derivatives offer a unique opportunity to flexibly estimate continuous-time ordinary differential equation models. Ramsay (1996) first proposed to estimate a linear ordinary differential equation from functional data in a technique called Principal Differential Analysis, by formulating a functional regression in which the highest-order derivative of a function is modelled as a time-varying linear combination of its lower-order derivatives. Principal Differential Analysis was introduced as a technique for data reduction and representation, using solutions of the estimated differential equation as a basis to represent the functional data. In this work, we re-formulate PDA as a generative statistical model in which functional observations arise as solutions of a deterministic ODE that is forced by a smooth random error process. This viewpoint defines a flexible class of functional models based on differential equations and leads to an improved understanding and characterisation of the sources of variability in Principal Differential Analysis. It does, however, result in parameter estimates that can be heavily biased under the standard estimation approach of PDA. Therefore, we introduce an iterative bias-reduction algorithm that can be applied to improve parameter estimates. We also examine the utility of our approach when the form of the deterministic part of the differential equation is unknown and possibly non-linear, where Principal Differential Analysis is treated as an approximate model based on time-varying linearisation. We demonstrate our approach on simulated data from linear and non-linear differential equations and on real data from human movement biomechanics. Supplementary R code for this manuscript is available at \\url{https://github.com/edwardgunning/UnderstandingOfPDAManuscript}."}, "https://arxiv.org/abs/2406.17972": {"title": "LABOR-LLM: Language-Based Occupational Representations with Large Language Models", "link": "https://arxiv.org/abs/2406.17972", "description": "arXiv:2406.17972v1 Announce Type: cross \nAbstract: Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based \"foundation model\", CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions."}, "https://arxiv.org/abs/2406.18240": {"title": "Concordance in basal cell carcinoma diagnosis", "link": "https://arxiv.org/abs/2406.18240", "description": "arXiv:2406.18240v1 Announce Type: cross \nAbstract: Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists."}, "https://arxiv.org/abs/2108.03464": {"title": "Bayesian $L_{\\frac{1}{2}}$ regression", "link": "https://arxiv.org/abs/2108.03464", "description": "arXiv:2108.03464v2 Announce Type: replace \nAbstract: It is well known that Bridge regression enjoys superior theoretical properties when compared to traditional LASSO. However, the current latent variable representation of its Bayesian counterpart, based on the exponential power prior, is computationally expensive in higher dimensions. In this paper, we show that the exponential power prior has a closed form scale mixture of normal decomposition for $\\alpha=(\\frac{1}{2})^\\gamma, \\gamma \\in \\{1, 2,\\ldots\\}$. We call these types of priors $L_{\\frac{1}{2}}$ prior for short. We develop an efficient partially collapsed Gibbs sampling scheme for computation using the $L_{\\frac{1}{2}}$ prior and study theoretical properties when $p>n$. In addition, we introduce a non-separable Bridge penalty function inspired by the fully Bayesian formulation and a novel, efficient coordinate descent algorithm. We prove the algorithm's convergence and show that the local minimizer from our optimisation algorithm has an oracle property. Finally, simulation studies were carried out to illustrate the performance of the new algorithms. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2306.16297": {"title": "A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying Moderation", "link": "https://arxiv.org/abs/2306.16297", "description": "arXiv:2306.16297v2 Announce Type: replace \nAbstract: Twin revolutions in wearable technologies and health interventions delivered by smartphones have greatly increased the accessibility of mobile health (mHealth) interventions. Micro-randomized trials (MRTs) are designed to assess the effectiveness of the mHealth intervention and introduce a novel class of causal estimands called \"causal excursion effects.\" These estimands enable the evaluation of how intervention effects change over time and are influenced by individual characteristics or context. However, existing analysis methods for causal excursion effects require prespecified features of the observed high-dimensional history to build a working model for a critical nuisance parameter. Machine learning appears ideal for automatic feature construction, but their naive application can lead to bias under model misspecification. To address this issue, this paper revisits the estimation of causal excursion effects from a meta-learner perspective, where the analyst remains agnostic to the supervised learning algorithms used to estimate nuisance parameters. We present the bidirectional asymptotic properties of the proposed estimators and compare them both theoretically and through extensive simulations. The results show relative efficiency gains and support the suggestion of a doubly robust alternative to existing methods. Finally, the proposed methods' practical utilities are demonstrated by analyzing data from a multi-institution cohort of first-year medical residents in the United States (NeCamp et al., 2020)."}, "https://arxiv.org/abs/2309.04685": {"title": "Simultaneous Modeling of Disease Screening and Severity Prediction: A Multi-task and Sparse Regularization Approach", "link": "https://arxiv.org/abs/2309.04685", "description": "arXiv:2309.04685v2 Announce Type: replace \nAbstract: The exploration of biomarkers, which are clinically useful biomolecules, and the development of prediction models using them are important problems in biomedical research. Biomarkers are widely used for disease screening, and some are related not only to the presence or absence of a disease but also to its severity. These biomarkers can be useful for prioritization of treatment and clinical decision-making. Considering a model helpful for both disease screening and severity prediction, this paper focuses on regression modeling for an ordinal response equipped with a hierarchical structure.\n If the response variable is a combination of the presence of disease and severity such as \\{{\\it healthy, mild, intermediate, severe}\\}, for example, the simplest method would be to apply the conventional ordinal regression model. However, the conventional model has flexibility issues and may not be suitable for the problems addressed in this paper, where the levels of the response variable might be heterogeneous. Therefore, this paper proposes a model assuming screening and severity prediction as different tasks, and an estimation method based on structural sparse regularization that leverages any common structure between the tasks when such commonality exists. In numerical experiments, the proposed method demonstrated stable performance across many scenarios compared to existing ordinal regression methods."}, "https://arxiv.org/abs/2401.05330": {"title": "Hierarchical Causal Models", "link": "https://arxiv.org/abs/2401.05330", "description": "arXiv:2401.05330v2 Announce Type: replace \nAbstract: Scientists often want to learn about cause and effect from hierarchical data, collected from subunits nested inside units. Consider students in schools, cells in patients, or cities in states. In such settings, unit-level variables (e.g. each school's budget) may affect subunit-level variables (e.g. the test scores of each student in each school) and vice versa. To address causal questions with hierarchical data, we propose hierarchical causal models, which extend structural causal models and causal graphical models by adding inner plates. We develop a general graphical identification technique for hierarchical causal models that extends do-calculus. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data, that is, if we had only unit-level summaries of subunit-level variables (e.g. the school's average test score, rather than each student's score). We develop estimation techniques for hierarchical causal models, using methods including hierarchical Bayesian models. We illustrate our results in simulation and via a reanalysis of the classic \"eight schools\" study."}, "https://arxiv.org/abs/2301.13152": {"title": "STEEL: Singularity-aware Reinforcement Learning", "link": "https://arxiv.org/abs/2301.13152", "description": "arXiv:2301.13152v5 Announce Type: replace-cross \nAbstract: Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. The existing methods require absolutely continuous assumption (e.g., there do not exist non-overlapping regions) on the distribution induced by target policies with respect to the data distribution over either the state or action or both. We propose a new batch RL algorithm that allows for singularity for both state and action spaces (e.g., existence of non-overlapping regions between offline data distribution and the distribution induced by the target policies) in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm under singularity. Compared with existing algorithms,by requiring only minimal data-coverage assumption, STEEL improves the applicability and robustness of batch RL. In addition, a two-step adaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulation studies and one (semi)-real experiment on personalized pricing demonstrate the superior performance of our methods in dealing with possible singularity in batch RL."}, "https://arxiv.org/abs/2406.18681": {"title": "Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features", "link": "https://arxiv.org/abs/2406.18681", "description": "arXiv:2406.18681v1 Announce Type: new \nAbstract: This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computationally prohibitive and may lead to inferential inaccuracies since accurate variable selection is essentially impossible in such high-dimensional GP regressions. As an alternative, this article proposes a strategy to sketch the high-dimensional feature vector with a carefully constructed sketching matrix, before fitting a GP with the scalar outcome and the sketched feature vector to draw predictive inference. The analysis is performed in parallel with many different sketching matrices and smoothing parameters in different processors, and the predictive inferences are combined using Bayesian predictive stacking. Since posterior predictive distribution in each processor is analytically tractable, the algorithm allows bypassing the robustness issues due to convergence and mixing of MCMC chains, leading to fast implementation with very large number of features. Simulation studies show superior performance of the proposed approach with a wide variety of competitors. The approach outperforms competitors in drawing point prediction with predictive uncertainties of outdoor air pollution from satellite images."}, "https://arxiv.org/abs/2406.18819": {"title": "MultiObjMatch: Matching with Optimal Tradeoffs between Multiple Objectives in R", "link": "https://arxiv.org/abs/2406.18819", "description": "arXiv:2406.18819v1 Announce Type: new \nAbstract: In an observational study, matching aims to create many small sets of similar treated and control units from initial samples that may differ substantially in order to permit more credible causal inferences. The problem of constructing matched sets may be formulated as an optimization problem, but it can be challenging to specify a single objective function that adequately captures all the design considerations at work. One solution, proposed by \\citet{pimentel2019optimal} is to explore a family of matched designs that are Pareto optimal for multiple objective functions. We present an R package, \\href{https://github.com/ShichaoHan/MultiObjMatch}{\\texttt{MultiObjMatch}}, that implements this multi-objective matching strategy using a network flow algorithm for several common design goals: marginal balance on important covariates, size of the matched sample, and average within-pair multivariate distances. We demonstrate the package's flexibility in exploring user-defined tradeoffs of interest via two case studies, a reanalysis of the canonical National Supported Work dataset and a novel analysis of a clinical dataset to estimate the impact of diabetic kidney disease on hospitalization costs."}, "https://arxiv.org/abs/2406.18829": {"title": "Full Information Linked ICA: addressing missing data problem in multimodal fusion", "link": "https://arxiv.org/abs/2406.18829", "description": "arXiv:2406.18829v1 Announce Type: new \nAbstract: Recent advances in multimodal imaging acquisition techniques have allowed us to measure different aspects of brain structure and function. Multimodal fusion, such as linked independent component analysis (LICA), is popularly used to integrate complementary information. However, it has suffered from missing data, commonly occurring in neuroimaging data. Therefore, in this paper, we propose a Full Information LICA algorithm (FI-LICA) to handle the missing data problem during multimodal fusion under the LICA framework. Built upon complete cases, our method employs the principle of full information and utilizes all available information to recover the missing latent information. Our simulation experiments showed the ideal performance of FI-LICA compared to current practices. Further, we applied FI-LICA to multimodal data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, showcasing better performance in classifying current diagnosis and in predicting the AD transition of participants with mild cognitive impairment (MCI), thereby highlighting the practical utility of our proposed method."}, "https://arxiv.org/abs/2406.18905": {"title": "Bayesian inference: More than Bayes's theorem", "link": "https://arxiv.org/abs/2406.18905", "description": "arXiv:2406.18905v1 Announce Type: new \nAbstract: Bayesian inference gets its name from *Bayes's theorem*, expressing posterior probabilities for hypotheses about a data generating process as the (normalized) product of prior probabilities and a likelihood function. But Bayesian inference uses all of probability theory, not just Bayes's theorem. Many hypotheses of scientific interest are *composite hypotheses*, with the strength of evidence for the hypothesis dependent on knowledge about auxiliary factors, such as the values of nuisance parameters (e.g., uncertain background rates or calibration factors). Many important capabilities of Bayesian methods arise from use of the law of total probability, which instructs analysts to compute probabilities for composite hypotheses by *marginalization* over auxiliary factors. This tutorial targets relative newcomers to Bayesian inference, aiming to complement tutorials that focus on Bayes's theorem and how priors modulate likelihoods. The emphasis here is on marginalization over parameter spaces -- both how it is the foundation for important capabilities, and how it may motivate caution when parameter spaces are large. Topics covered include the difference between likelihood and probability, understanding the impact of priors beyond merely shifting the maximum likelihood estimate, and the role of marginalization in accounting for uncertainty in nuisance parameters, systematic error, and model misspecification."}, "https://arxiv.org/abs/2406.18913": {"title": "A Note on Identification of Match Fixed Effects as Interpretable Unobserved Match Affinity", "link": "https://arxiv.org/abs/2406.18913", "description": "arXiv:2406.18913v1 Announce Type: new \nAbstract: We highlight that match fixed effects, represented by the coefficients of interaction terms involving dummy variables for two elements, lack identification without specific restrictions on parameters. Consequently, the coefficients typically reported as relative match fixed effects by statistical software are not interpretable. To address this, we establish normalization conditions that enable identification of match fixed effect parameters as interpretable indicators of unobserved match affinity, facilitating comparisons among observed matches."}, "https://arxiv.org/abs/2406.19021": {"title": "Nonlinear Multivariate Function-on-function Regression with Variable Selection", "link": "https://arxiv.org/abs/2406.19021", "description": "arXiv:2406.19021v1 Announce Type: new \nAbstract: This paper proposes a multivariate nonlinear function-on-function regression model, which allows both the response and the covariates can be multi-dimensional functions. The model is built upon the multivariate functional reproducing kernel Hilbert space (RKHS) theory. It predicts the response function by linearly combining each covariate function in their respective functional RKHS, and extends the representation theorem to accommodate model estimation. Further variable selection is proposed by adding the lasso penalty to the coefficients of the kernel functions. A block coordinate descent algorithm is proposed for model estimation, and several theoretical properties are discussed. Finally, we evaluate the efficacy of our proposed model using simulation data and a real-case dataset in meteorology."}, "https://arxiv.org/abs/2406.19033": {"title": "Factor multivariate stochastic volatility models of high dimension", "link": "https://arxiv.org/abs/2406.19033", "description": "arXiv:2406.19033v1 Announce Type: new \nAbstract: Building upon the pertinence of the factor decomposition to break the curse of dimensionality inherent to multivariate volatility processes, we develop a factor model-based multivariate stochastic volatility (fMSV) framework that relies on two viewpoints: sparse approximate factor model and sparse factor loading matrix. We propose a two-stage estimation procedure for the fMSV model: the first stage obtains the estimators of the factor model, and the second stage estimates the MSV part using the estimated common factor variables. We derive the asymptotic properties of the estimators. Simulated experiments are performed to assess the forecasting performances of the covariance matrices. The empirical analysis based on vectors of asset returns illustrates that the forecasting performances of the fMSV models outperforms competing conditional covariance models."}, "https://arxiv.org/abs/2406.19152": {"title": "Mixture priors for replication studies", "link": "https://arxiv.org/abs/2406.19152", "description": "arXiv:2406.19152v1 Announce Type: new \nAbstract: Replication of scientific studies is important for assessing the credibility of their results. However, there is no consensus on how to quantify the extent to which a replication study replicates an original result. We propose a novel Bayesian approach based on mixture priors. The idea is to use a mixture of the posterior distribution based on the original study and a non-informative distribution as the prior for the analysis of the replication study. The mixture weight then determines the extent to which the original and replication data are pooled.\n Two distinct strategies are presented: one with fixed mixture weights, and one that introduces uncertainty by assigning a prior distribution to the mixture weight itself. Furthermore, it is shown how within this framework Bayes factors can be used for formal testing of scientific hypotheses, such as tests regarding the presence or absence of an effect. To showcase the practical application of the methodology, we analyze data from three replication studies. Our findings suggest that mixture priors are a valuable and intuitive alternative to other Bayesian methods for analyzing replication studies, such as hierarchical models and power priors. We provide the free and open source R package repmix that implements the proposed methodology."}, "https://arxiv.org/abs/2406.19157": {"title": "How to build your latent Markov model -- the role of time and space", "link": "https://arxiv.org/abs/2406.19157", "description": "arXiv:2406.19157v1 Announce Type: new \nAbstract: Statistical models that involve latent Markovian state processes have become immensely popular tools for analysing time series and other sequential data. However, the plethora of model formulations, the inconsistent use of terminology, and the various inferential approaches and software packages can be overwhelming to practitioners, especially when they are new to this area. With this review-like paper, we thus aim to provide guidance for both statisticians and practitioners working with latent Markov models by offering a unifying view on what otherwise are often considered separate model classes, from hidden Markov models over state-space models to Markov-modulated Poisson processes. In particular, we provide a roadmap for identifying a suitable latent Markov model formulation given the data to be analysed. Furthermore, we emphasise that it is key to applied work with any of these model classes to understand how recursive techniques exploiting the models' dependence structure can be used for inference. The R package LaMa adapts this unified view and provides an easy-to-use framework for very fast (C++ based) evaluation of the likelihood of any of the models discussed in this paper, allowing users to tailor a latent Markov model to their data using a Lego-type approach."}, "https://arxiv.org/abs/2406.19213": {"title": "Comparing Lasso and Adaptive Lasso in High-Dimensional Data: A Genetic Survival Analysis in Triple-Negative Breast Cancer", "link": "https://arxiv.org/abs/2406.19213", "description": "arXiv:2406.19213v1 Announce Type: new \nAbstract: This study aims to evaluate the performance of Cox regression with lasso penalty and adaptive lasso penalty in high-dimensional settings. Variable selection methods are necessary in this context to reduce dimensionality and make the problem feasible. Several weight calculation procedures for adaptive lasso are proposed to determine if they offer an improvement over lasso, as adaptive lasso addresses its inherent bias. These proposed weights are based on principal component analysis, ridge regression, univariate Cox regressions and random survival forest (RSF). The proposals are evaluated in simulated datasets.\n A real application of these methodologies in the context of genomic data is also carried out. The study consists of determining the variables, clinical and genetic, that influence the survival of patients with triple-negative breast cancer (TNBC), which is a type breast cancer with low survival rates due to its aggressive nature."}, "https://arxiv.org/abs/2406.19346": {"title": "Eliciting prior information from clinical trials via calibrated Bayes factor", "link": "https://arxiv.org/abs/2406.19346", "description": "arXiv:2406.19346v1 Announce Type: new \nAbstract: In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated to a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is actually available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. This parameter can be considered as either fixed or random. Although various strategies exist for its determination, eliciting the prior distribution of the weight parameter according to a full Bayesian approach remains a challenge. In general, this parameter should be carefully selected to accurately reflect the available prior information without dominating the posterior inferential conclusions. To this aim, we propose a novel method for eliciting the prior distribution of the weight parameter through a simulation-based calibrated Bayes factor procedure. This approach allows for the prior distribution to be updated based on the strength of evidence provided by the data: The goal is to facilitate the integration of historical data when it aligns with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials."}, "https://arxiv.org/abs/2406.18623": {"title": "Unbiased least squares regression via averaged stochastic gradient descent", "link": "https://arxiv.org/abs/2406.18623", "description": "arXiv:2406.18623v1 Announce Type: cross \nAbstract: We consider an on-line least squares regression problem with optimal solution $\\theta^*$ and Hessian matrix H, and study a time-average stochastic gradient descent estimator of $\\theta^*$. For $k\\ge2$, we provide an unbiased estimator of $\\theta^*$ that is a modification of the time-average estimator, runs with an expected number of time-steps of order k, with O(1/k) expected excess risk. The constant behind the O notation depends on parameters of the regression and is a poly-logarithmic function of the smallest eigenvalue of H. We provide both a biased and unbiased estimator of the expected excess risk of the time-average estimator and of its unbiased counterpart, without requiring knowledge of either H or $\\theta^*$. We describe an \"average-start\" version of our estimators with similar properties. Our approach is based on randomized multilevel Monte Carlo. Our numerical experiments confirm our theoretical findings."}, "https://arxiv.org/abs/2406.18814": {"title": "Length Optimization in Conformal Prediction", "link": "https://arxiv.org/abs/2406.18814", "description": "arXiv:2406.18814v1 Announce Type: cross \nAbstract: Conditional validity and length efficiency are two crucial aspects of conformal prediction (CP). Achieving conditional validity ensures accurate uncertainty quantification for data subpopulations, while proper length efficiency ensures that the prediction sets remain informative and non-trivial. Despite significant efforts to address each of these issues individually, a principled framework that reconciles these two objectives has been missing in the CP literature. In this paper, we develop Conformal Prediction with Length-Optimization (CPL) - a novel framework that constructs prediction sets with (near-) optimal length while ensuring conditional validity under various classes of covariate shifts, including the key cases of marginal and group-conditional coverage. In the infinite sample regime, we provide strong duality results which indicate that CPL achieves conditional validity and length optimality. In the finite sample regime, we show that CPL constructs conditionally valid prediction sets. Our extensive empirical evaluations demonstrate the superior prediction set size performance of CPL compared to state-of-the-art methods across diverse real-world and synthetic datasets in classification, regression, and text-related settings."}, "https://arxiv.org/abs/2406.19082": {"title": "Gratia: An R package for exploring generalized additive models", "link": "https://arxiv.org/abs/2406.19082", "description": "arXiv:2406.19082v1 Announce Type: cross \nAbstract: Generalized additive models (GAMs, Hastie & Tibshirani, 1990; Wood, 2017) are an extension of the generalized linear model that allows the effects of covariates to be modelled as smooth functions. GAMs are increasingly used in many areas of science (e.g. Pedersen, Miller, Simpson, & Ross, 2019; Simpson, 2018) because the smooth functions allow nonlinear relationships between covariates and the response to be learned from the data through the use of penalized splines. Within the R (R Core Team, 2024) ecosystem, Simon Wood's mgcv package (Wood, 2017) is widely used to fit GAMs and is a Recommended package that ships with R as part of the default install. A growing number of other R packages build upon mgcv, for example as an engine to fit specialised models not handled by mgcv itself (e.g. GJMR, Marra & Radice, 2023), or to make use of the wide range of splines available in mgcv (e.g. brms, B\\\"urkner, 2017).\n The gratia package builds upon mgcv by providing functions that make working with GAMs easier. gratia takes a tidy approach (Wickham, 2014) providing ggplot2 (Wickham, 2016) replacements for mgcv's base graphics-based plots, functions for model diagnostics and exploration of fitted models, and a family of functions for drawing samples from the posterior distribution of a fitted GAM. Additional functionality is provided to facilitate the teaching and understanding of GAMs."}, "https://arxiv.org/abs/2208.07831": {"title": "Structured prior distributions for the covariance matrix in latent factor models", "link": "https://arxiv.org/abs/2208.07831", "description": "arXiv:2208.07831v3 Announce Type: replace \nAbstract: Factor models are widely used for dimension reduction in the analysis of multivariate data. This is achieved through decomposition of a p x p covariance matrix into the sum of two components. Through a latent factor representation, they can be interpreted as a diagonal matrix of idiosyncratic variances and a shared variation matrix, that is, the product of a p x k factor loadings matrix and its transpose. If k << p, this defines a parsimonious factorisation of the covariance matrix. Historically, little attention has been paid to incorporating prior information in Bayesian analyses using factor models where, at best, the prior for the factor loadings is order invariant. In this work, a class of structured priors is developed that can encode ideas of dependence structure about the shared variation matrix. The construction allows data-informed shrinkage towards sensible parametric structures while also facilitating inference over the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is extended to stationary dynamic factor models. For computational inference, parameter-expanded Markov chain Monte Carlo samplers are proposed, including an efficient adaptive Gibbs sampler. Two substantive applications showcase the scope of the methodology and its inferential benefits."}, "https://arxiv.org/abs/2305.06466": {"title": "The Bayesian Infinitesimal Jackknife for Variance", "link": "https://arxiv.org/abs/2305.06466", "description": "arXiv:2305.06466v2 Announce Type: replace \nAbstract: The frequentist variability of Bayesian posterior expectations can provide meaningful measures of uncertainty even when models are misspecified. Classical methods to asymptotically approximate the frequentist covariance of Bayesian estimators such as the Laplace approximation and the nonparametric bootstrap can be practically inconvenient, since the Laplace approximation may require an intractable integral to compute the marginal log posterior, and the bootstrap requires computing the posterior for many different bootstrap datasets. We develop and explore the infinitesimal jackknife (IJ), an alternative method for computing asymptotic frequentist covariance of smooth functionals of exchangeable data, which is based on the \"influence function\" of robust statistics. We show that the influence function for posterior expectations has the form of a simple posterior covariance, and that the IJ covariance estimate is, in turn, easily computed from a single set of posterior samples. Under conditions similar to those required for a Bayesian central limit theorem to apply, we prove that the corresponding IJ covariance estimate is asymptotically equivalent to the Laplace approximation and the bootstrap. In the presence of nuisance parameters that may not obey a central limit theorem, we argue using a von Mises expansion that the IJ covariance is inconsistent, but can remain a good approximation to the limiting frequentist variance. We demonstrate the accuracy and computational benefits of the IJ covariance estimates with simulated and real-world experiments."}, "https://arxiv.org/abs/2306.06756": {"title": "Semi-Parametric Inference for Doubly Stochastic Spatial Point Processes: An Approximate Penalized Poisson Likelihood Approach", "link": "https://arxiv.org/abs/2306.06756", "description": "arXiv:2306.06756v2 Announce Type: replace \nAbstract: Doubly-stochastic point processes model the occurrence of events over a spatial domain as an inhomogeneous Poisson process conditioned on the realization of a random intensity function. They are flexible tools for capturing spatial heterogeneity and dependence. However, existing implementations of doubly-stochastic spatial models are computationally demanding, often have limited theoretical guarantee, and/or rely on restrictive assumptions. We propose a penalized regression method for estimating covariate effects in doubly-stochastic point processes that is computationally efficient and does not require a parametric form or stationarity of the underlying intensity. Our approach is based on an approximate (discrete and deterministic) formulation of the true (continuous and stochastic) intensity function. We show that consistency and asymptotic normality of the covariate effect estimates can be achieved despite the model misspecification, and develop a covariance estimator that leads to a valid, albeit conservative, statistical inference procedure. A simulation study shows the validity of our approach under less restrictive assumptions on the data generating mechanism, and an application to Seattle crime data demonstrates better prediction accuracy compared with existing alternatives."}, "https://arxiv.org/abs/2307.00450": {"title": "Bayesian Hierarchical Modeling and Inference for Mechanistic Systems in Industrial Hygiene", "link": "https://arxiv.org/abs/2307.00450", "description": "arXiv:2307.00450v2 Announce Type: replace \nAbstract: A series of experiments in stationary and moving passenger rail cars were conducted to measure removal rates of particles in the size ranges of SARS-CoV-2 viral aerosols, and the air changes per hour provided by existing and modified air handling systems. Such methods for exposure assessments are customarily based on mechanistic models derived from physical laws of particle movement that are deterministic and do not account for measurement errors inherent in data collection. The resulting analysis compromises on reliably learning about mechanistic factors such as ventilation rates, aerosol generation rates and filtration efficiencies from field measurements. This manuscript develops a Bayesian state space modeling framework that synthesizes information from the mechanistic system as well as the field data. We derive a stochastic model from finite difference approximations of differential equations explaining particle concentrations. Our inferential framework trains the mechanistic system using the field measurements from the chamber experiments and delivers reliable estimates of the underlying physical process with fully model-based uncertainty quantification. Our application falls within the realm of Bayesian \"melding\" of mechanistic and statistical models and is of significant relevance to industrial hygienists and public health researchers working on assessment of exposure to viral aerosols in rail car fleets."}, "https://arxiv.org/abs/2307.13094": {"title": "Inference in Experiments with Matched Pairs and Imperfect Compliance", "link": "https://arxiv.org/abs/2307.13094", "description": "arXiv:2307.13094v2 Announce Type: replace \nAbstract: This paper studies inference for the local average treatment effect in randomized controlled trials with imperfect compliance where treatment status is determined according to \"matched pairs.\" By \"matched pairs,\" we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. Under weak assumptions governing the quality of the pairings, we first derive the limit distribution of the usual Wald (i.e., two-stage least squares) estimator of the local average treatment effect. We show further that conventional heteroskedasticity-robust estimators of the Wald estimator's limiting variance are generally conservative, in that their probability limits are (typically strictly) larger than the limiting variance. We therefore provide an alternative estimator of the limiting variance that is consistent. Finally, we consider the use of additional observed, baseline covariates not used in pairing units to increase the precision with which we can estimate the local average treatment effect. To this end, we derive the limiting behavior of a two-stage least squares estimator of the local average treatment effect which includes both the additional covariates in addition to pair fixed effects, and show that its limiting variance is always less than or equal to that of the Wald estimator. To complete our analysis, we provide a consistent estimator of this limiting variance. A simulation study confirms the practical relevance of our theoretical results. Finally, we apply our results to revisit a prominent experiment studying the effect of macroinsurance on microenterprise in Egypt."}, "https://arxiv.org/abs/2310.00803": {"title": "A Bayesian joint model for mediation analysis with matrix-valued mediators", "link": "https://arxiv.org/abs/2310.00803", "description": "arXiv:2310.00803v2 Announce Type: replace \nAbstract: Unscheduled treatment interruptions may lead to reduced quality of care in radiation therapy (RT). Identifying the RT prescription dose effects on the outcome of treatment interruptions, mediated through doses distributed into different organs-at-risk (OARs), can inform future treatment planning. The radiation exposure to OARs can be summarized by a matrix of dose-volume histograms (DVH) for each patient. Although various methods for high-dimensional mediation analysis have been proposed recently, few studies investigated how matrix-valued data can be treated as mediators. In this paper, we propose a novel Bayesian joint mediation model for high-dimensional matrix-valued mediators. In this joint model, latent features are extracted from the matrix-valued data through an adaptation of probabilistic multilinear principal components analysis (MPCA), retaining the inherent matrix structure. We derive and implement a Gibbs sampling algorithm to jointly estimate all model parameters, and introduce a Varimax rotation method to identify active indicators of mediation among the matrix-valued data. Our simulation study finds that the proposed joint model has higher efficiency in estimating causal decomposition effects compared to an alternative two-step method, and demonstrates that the mediation effects can be identified and visualized in the matrix form. We apply the method to study the effect of prescription dose on treatment interruptions in anal canal cancer patients."}, "https://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "https://arxiv.org/abs/2310.03521", "description": "arXiv:2310.03521v2 Announce Type: replace \nAbstract: In copula models the marginal distributions and copula function are specified separately. We treat these as two modules in a modular Bayesian inference framework, and propose conducting modified Bayesian inference by \"cutting feedback\". Cutting feedback limits the influence of potentially misspecified modules in posterior inference. We consider two types of cuts. The first limits the influence of a misspecified copula on inference for the marginals, which is a Bayesian analogue of the popular Inference for Margins (IFM) estimator. The second limits the influence of misspecified marginals on inference for the copula parameters by using a pseudo likelihood of the ranks to define the cut model. We establish that if only one of the modules is misspecified, then the appropriate cut posterior gives accurate uncertainty quantification asymptotically for the parameters in the other module. Computation of the cut posteriors is difficult, and new variational inference methods to do so are proposed. The efficacy of the new methodology is demonstrated using both simulated data and a substantive multivariate time series application from macroeconomic forecasting. In the latter, cutting feedback from misspecified marginals to a 1096 dimension copula improves posterior inference and predictive accuracy greatly, compared to conventional Bayesian inference."}, "https://arxiv.org/abs/2311.08340": {"title": "Causal Message Passing: A Method for Experiments with Unknown and General Network Interference", "link": "https://arxiv.org/abs/2311.08340", "description": "arXiv:2311.08340v2 Announce Type: replace \nAbstract: Randomized experiments are a powerful methodology for data-driven evaluation of decisions or interventions. Yet, their validity may be undermined by network interference. This occurs when the treatment of one unit impacts not only its outcome but also that of connected units, biasing traditional treatment effect estimations. Our study introduces a new framework to accommodate complex and unknown network interference, moving beyond specialized models in the existing literature. Our framework, termed causal message-passing, is grounded in high-dimensional approximate message passing methodology. It is tailored for multi-period experiments and is particularly effective in settings with many units and prevalent network interference. The framework models causal effects as a dynamic process where a treated unit's impact propagates through the network via neighboring units until equilibrium is reached. This approach allows us to approximate the dynamics of potential outcomes over time, enabling the extraction of valuable information before treatment effects reach equilibrium. Utilizing causal message-passing, we introduce a practical algorithm to estimate the total treatment effect, defined as the impact observed when all units are treated compared to the scenario where no unit receives treatment. We demonstrate the effectiveness of this approach across five numerical scenarios, each characterized by a distinct interference structure."}, "https://arxiv.org/abs/2312.10695": {"title": "Nonparametric Strategy Test", "link": "https://arxiv.org/abs/2312.10695", "description": "arXiv:2312.10695v3 Announce Type: replace \nAbstract: We present a nonparametric statistical test for determining whether an agent is following a given mixed strategy in a repeated strategic-form game given samples of the agent's play. This involves two components: determining whether the agent's frequencies of pure strategies are sufficiently close to the target frequencies, and determining whether the pure strategies selected are independent between different game iterations. Our integrated test involves applying a chi-squared goodness of fit test for the first component and a generalized Wald-Wolfowitz runs test for the second component. The results from both tests are combined using Bonferroni correction to produce a complete test for a given significance level $\\alpha.$ We applied the test to publicly available data of human rock-paper-scissors play. The data consists of 50 iterations of play for 500 human players. We test with a null hypothesis that the players are following a uniform random strategy independently at each game iteration. Using a significance level of $\\alpha = 0.05$, we conclude that 305 (61%) of the subjects are following the target strategy."}, "https://arxiv.org/abs/2312.15205": {"title": "X-Vine Models for Multivariate Extremes", "link": "https://arxiv.org/abs/2312.15205", "description": "arXiv:2312.15205v2 Announce Type: replace \nAbstract: Regular vine sequences permit the organisation of variables in a random vector along a sequence of trees. Regular vine models have become greatly popular in dependence modelling as a way to combine arbitrary bivariate copulas into higher-dimensional ones, offering flexibility, parsimony, and tractability. In this project, we use regular vine structures to decompose and construct the exponent measure density of a multivariate extreme value distribution, or, equivalently, the tail copula density. Although these densities pose theoretical challenges due to their infinite mass, their homogeneity property offers simplifications. The theory sheds new light on existing parametric families and facilitates the construction of new ones, called X-vines. Computations proceed via recursive formulas in terms of bivariate model components. We develop simulation algorithms for X-vine multivariate Pareto distributions as well as methods for parameter estimation and model selection on the basis of threshold exceedances. The methods are illustrated by Monte Carlo experiments and a case study on US flight delay data."}, "https://arxiv.org/abs/2311.10263": {"title": "Stable Differentiable Causal Discovery", "link": "https://arxiv.org/abs/2311.10263", "description": "arXiv:2311.10263v2 Announce Type: replace-cross \nAbstract: Inferring causal relationships as directed acyclic graphs (DAGs) is an important but challenging problem. Differentiable Causal Discovery (DCD) is a promising approach to this problem, framing the search as a continuous optimization. But existing DCD methods are numerically unstable, with poor performance beyond tens of variables. In this paper, we propose Stable Differentiable Causal Discovery (SDCD), a new method that improves previous DCD methods in two ways: (1) It employs an alternative constraint for acyclicity; this constraint is more stable, both theoretically and empirically, and fast to compute. (2) It uses a training procedure tailored for sparse causal graphs, which are common in real-world scenarios. We first derive SDCD and prove its stability and correctness. We then evaluate it with both observational and interventional data and on both small-scale and large-scale settings. We find that SDCD outperforms existing methods in both convergence speed and accuracy and can scale to thousands of variables. We provide code at https://github.com/azizilab/sdcd."}, "https://arxiv.org/abs/2406.19432": {"title": "Estimation of Shannon differential entropy: An extensive comparative review", "link": "https://arxiv.org/abs/2406.19432", "description": "arXiv:2406.19432v1 Announce Type: new \nAbstract: In this research work, a total of 45 different estimators of the Shannon differential entropy were reviewed. The estimators were mainly based on three classes, namely: window size spacings, kernel density estimation (KDE) and k-nearest neighbour (kNN) estimation. A total of 16, 5 and 6 estimators were selected from each of the classes, respectively, for comparison. The performances of the 27 selected estimators, in terms of their bias values and root mean squared errors (RMSEs) as well as their asymptotic behaviours, were compared through extensive Monte Carlo simulations. The empirical comparisons were carried out at different sample sizes of 10, 50, and 100 and different variable dimensions of 1, 2, 3, and 5, for three groups of continuous distributions according to their symmetry and support. The results showed that the spacings based estimators generally performed better than the estimators from the other two classes at univariate level, but suffered from non existence at multivariate level. The kNN based estimators were generally inferior to the estimators from the other two classes considered but showed an advantage of existence for all dimensions. Also, a new class of optimal window size was obtained and sets of estimators were recommended for different groups of distributions at different variable dimensions. Finally, the asymptotic biases, variances and distributions of the 'best estimators' were considered."}, "https://arxiv.org/abs/2406.19503": {"title": "Improving Finite Sample Performance of Causal Discovery by Exploiting Temporal Structure", "link": "https://arxiv.org/abs/2406.19503", "description": "arXiv:2406.19503v1 Announce Type: new \nAbstract: Methods of causal discovery aim to identify causal structures in a data driven way. Existing algorithms are known to be unstable and sensitive to statistical errors, and are therefore rarely used with biomedical or epidemiological data. We present an algorithm that efficiently exploits temporal structure, so-called tiered background knowledge, for estimating causal structures. Tiered background knowledge is readily available from, e.g., cohort or registry data. When used efficiently it renders the algorithm more robust to statistical errors and ultimately increases accuracy in finite samples. We describe the algorithm and illustrate how it proceeds. Moreover, we offer formal proofs as well as examples of desirable properties of the algorithm, which we demonstrate empirically in an extensive simulation study. To illustrate its usefulness in practice, we apply the algorithm to data from a children's cohort study investigating the interplay of diet, physical activity and other lifestyle factors for health outcomes."}, "https://arxiv.org/abs/2406.19535": {"title": "Modeling trajectories using functional linear differential equations", "link": "https://arxiv.org/abs/2406.19535", "description": "arXiv:2406.19535v1 Announce Type: new \nAbstract: We are motivated by a study that seeks to better understand the dynamic relationship between muscle activation and paw position during locomotion. For each gait cycle in this experiment, activation in the biceps and triceps is measured continuously and in parallel with paw position as a mouse trotted on a treadmill. We propose an innovative general regression method that draws from both ordinary differential equations and functional data analysis to model the relationship between these functional inputs and responses as a dynamical system that evolves over time. Specifically, our model addresses gaps in both literatures and borrows strength across curves estimating ODE parameters across all curves simultaneously rather than separately modeling each functional observation. Our approach compares favorably to related functional data methods in simulations and in cross-validated predictive accuracy of paw position in the gait data. In the analysis of the gait cycles, we find that paw speed and position are dynamically influenced by inputs from the biceps and triceps muscles, and that the effect of muscle activation persists beyond the activation itself."}, "https://arxiv.org/abs/2406.19550": {"title": "Provably Efficient Posterior Sampling for Sparse Linear Regression via Measure Decomposition", "link": "https://arxiv.org/abs/2406.19550", "description": "arXiv:2406.19550v1 Announce Type: new \nAbstract: We consider the problem of sampling from the posterior distribution of a $d$-dimensional coefficient vector $\\boldsymbol{\\theta}$, given linear observations $\\boldsymbol{y} = \\boldsymbol{X}\\boldsymbol{\\theta}+\\boldsymbol{\\varepsilon}$. In general, such posteriors are multimodal, and therefore challenging to sample from. This observation has prompted the exploration of various heuristics that aim at approximating the posterior distribution.\n In this paper, we study a different approach based on decomposing the posterior distribution into a log-concave mixture of simple product measures. This decomposition allows us to reduce sampling from a multimodal distribution of interest to sampling from a log-concave one, which is tractable and has been investigated in detail. We prove that, under mild conditions on the prior, for random designs, such measure decomposition is generally feasible when the number of samples per parameter $n/d$ exceeds a constant threshold. We thus obtain a provably efficient (polynomial time) sampling algorithm in a regime where this was previously not known. Numerical simulations confirm that the algorithm is practical, and reveal that it has attractive statistical properties compared to state-of-the-art methods."}, "https://arxiv.org/abs/2406.19563": {"title": "Bayesian Rank-Clustering", "link": "https://arxiv.org/abs/2406.19563", "description": "arXiv:2406.19563v1 Announce Type: new \nAbstract: In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefer an overall ranking where groups of objects are allowed to have equal ranks or to be $\\textit{rank-clustered}$. Existing models related to rank-clustering are limited by their inability to handle a variety of ordinal data types, to quantify uncertainty, or by the need to pre-specify the number and size of potential rank-clusters. We solve these limitations through the proposed Bayesian $\\textit{Rank-Clustered Bradley-Terry-Luce}$ model. We allow for rank-clustering via parameter fusion by imposing a novel spike-and-slab prior on object-specific worth parameters in Bradley-Terry-Luce family of distributions for ordinal comparisons. We demonstrate the model on simulated and real datasets in survey analysis, elections, and sports."}, "https://arxiv.org/abs/2406.19597": {"title": "What's the Weight? Estimating Controlled Outcome Differences in Complex Surveys for Health Disparities Research", "link": "https://arxiv.org/abs/2406.19597", "description": "arXiv:2406.19597v1 Announce Type: new \nAbstract: A basic descriptive question in statistics often asks whether there are differences in mean outcomes between groups based on levels of a discrete covariate (e.g., racial disparities in health outcomes). However, when this categorical covariate of interest is correlated with other factors related to the outcome, direct comparisons may lead to biased estimates and invalid inferential conclusions without appropriate adjustment. Propensity score methods are broadly employed with observational data as a tool to achieve covariate balance, but how to implement them in complex surveys is less studied - in particular, when the survey weights depend on the group variable under comparison. In this work, we focus on a specific example when sample selection depends on race. We propose identification formulas to properly estimate the average controlled difference (ACD) in outcomes between Black and White individuals, with appropriate weighting for covariate imbalance across the two racial groups and generalizability. Via extensive simulation, we show that our proposed methods outperform traditional analytic approaches in terms of bias, mean squared error, and coverage. We are motivated by the interplay between race and social determinants of health when estimating racial differences in telomere length using data from the National Health and Nutrition Examination Survey. We build a propensity for race to properly adjust for other social determinants while characterizing the controlled effect of race on telomere length. We find that evidence of racial differences in telomere length between Black and White individuals attenuates after accounting for confounding by socioeconomic factors and after utilizing appropriate propensity score and survey weighting techniques. Software to implement these methods can be found in the R package svycdiff at https://github.com/salernos/svycdiff."}, "https://arxiv.org/abs/2406.19604": {"title": "Geodesic Causal Inference", "link": "https://arxiv.org/abs/2406.19604", "description": "arXiv:2406.19604v1 Announce Type: new \nAbstract: Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for scalar outcomes and recently for distributional outcomes. We introduce here a general framework for causal inference when outcomes reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the characterization of treatment effects through the concept of the geodesic average treatment effect. Using ideas from Fr\\'echet regression, we develop estimation methods of the geodesic average treatment effect and derive consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Our methodology is illustrated by a simulation study and real data examples for compositional outcomes of U.S. statewise energy source data to study the effect of coal mining, network data of New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and brain functional connectivity network data to study the effect of Alzheimer's disease."}, "https://arxiv.org/abs/2406.19673": {"title": "Extended sample size calculations for evaluation of prediction models using a threshold for classification", "link": "https://arxiv.org/abs/2406.19673", "description": "arXiv:2406.19673v1 Announce Type: new \nAbstract: When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures can also be used. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have developed closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, PPV, NPV, and F1-score in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R and Stata commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit."}, "https://arxiv.org/abs/2406.19691": {"title": "Optimal subsampling for functional composite quantile regression in massive data", "link": "https://arxiv.org/abs/2406.19691", "description": "arXiv:2406.19691v1 Announce Type: new \nAbstract: As computer resources become increasingly limited, traditional statistical methods face challenges in analyzing massive data, especially in functional data analysis. To address this issue, subsampling offers a viable solution by significantly reducing computational requirements. This paper introduces a subsampling technique for composite quantile regression, designed for efficient application within the functional linear model on large datasets. We establish the asymptotic distribution of the subsampling estimator and introduce an optimal subsampling method based on the functional L-optimality criterion. Results from simulation studies and the real data analysis consistently demonstrate the superiority of the L-optimality criterion-based optimal subsampling method over the uniform subsampling approach."}, "https://arxiv.org/abs/2406.19702": {"title": "Vector AutoRegressive Moving Average Models: A Review", "link": "https://arxiv.org/abs/2406.19702", "description": "arXiv:2406.19702v1 Announce Type: new \nAbstract: Vector AutoRegressive Moving Average (VARMA) models form a powerful and general model class for analyzing dynamics among multiple time series. While VARMA models encompass the Vector AutoRegressive (VAR) models, their popularity in empirical applications is dominated by the latter. Can this phenomenon be explained fully by the simplicity of VAR models? Perhaps many users of VAR models have not fully appreciated what VARMA models can provide. The goal of this review is to provide a comprehensive resource for researchers and practitioners seeking insights into the advantages and capabilities of VARMA models. We start by reviewing the identification challenges inherent to VARMA models thereby encompassing classical and modern identification schemes and we continue along the same lines regarding estimation, specification and diagnosis of VARMA models. We then highlight the practical utility of VARMA models in terms of Granger Causality analysis, forecasting and structural analysis as well as recent advances and extensions of VARMA models to further facilitate their adoption in practice. Finally, we discuss some interesting future research directions where VARMA models can fulfill their potentials in applications as compared to their subclass of VAR models."}, "https://arxiv.org/abs/2406.19716": {"title": "Functional Time Transformation Model with Applications to Digital Health", "link": "https://arxiv.org/abs/2406.19716", "description": "arXiv:2406.19716v1 Announce Type: new \nAbstract: The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medicine, we develop a more general and flexible functional time-transformation model for estimating the conditional survival function with both functional and scalar covariates. A partially functional regression model is used to directly model the survival time on the covariates through an unknown monotone transformation and a known error distribution. We use Bernstein polynomials to model the monotone transformation function and the smooth functional coefficients. A sieve method of maximum likelihood is employed for estimation. Numerical simulations illustrate a satisfactory performance of the proposed method in estimation and inference. We demonstrate the application of the proposed model through two case studies involving wearable data i) Understanding the association between diurnal physical activity pattern and all-cause mortality based on accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 and ii) Modelling Time-to-Hypoglycemia events in a cohort of diabetic patients based on distributional representation of continuous glucose monitoring (CGM) data. The results provide important epidemiological insights into the direct association between survival times and the physiological signals and also exhibit superior predictive performance compared to traditional summary based biomarkers in the CGM study."}, "https://arxiv.org/abs/2406.19722": {"title": "Exact Bayesian Gaussian Cox Processes Using Random Integral", "link": "https://arxiv.org/abs/2406.19722", "description": "arXiv:2406.19722v1 Announce Type: new \nAbstract: A Gaussian Cox process is a popular model for point process data, in which the intensity function is a transformation of a Gaussian process. Posterior inference of this intensity function involves an intractable integral (i.e., the cumulative intensity function) in the likelihood resulting in doubly intractable posterior distribution. Here, we propose a nonparametric Bayesian approach for estimating the intensity function of an inhomogeneous Poisson process without reliance on large data augmentation or approximations of the likelihood function. We propose to jointly model the intensity and the cumulative intensity function as a transformed Gaussian process, allowing us to directly bypass the need of approximating the cumulative intensity function in the likelihood. We propose an exact MCMC sampler for posterior inference and evaluate its performance on simulated data. We demonstrate the utility of our method in three real-world scenarios including temporal and spatial event data, as well as aggregated time count data collected at multiple resolutions. Finally, we discuss extensions of our proposed method to other point processes."}, "https://arxiv.org/abs/2406.19778": {"title": "A multiscale Bayesian nonparametric framework for partial hierarchical clustering", "link": "https://arxiv.org/abs/2406.19778", "description": "arXiv:2406.19778v1 Announce Type: new \nAbstract: In recent years, there has been a growing demand to discern clusters of subjects in datasets characterized by a large set of features. Often, these clusters may be highly variable in size and present partial hierarchical structures. In this context, model-based clustering approaches with nonparametric priors are gaining attention in the literature due to their flexibility and adaptability to new data. However, current approaches still face challenges in recognizing hierarchical cluster structures and in managing tiny clusters or singletons. To address these limitations, we propose a novel infinite mixture model with kernels organized within a multiscale structure. Leveraging a careful specification of the kernel parameters, our method allows the inclusion of additional information guiding possible hierarchies among clusters while maintaining flexibility. We provide theoretical support and an elegant, parsimonious formulation based on infinite factorization that allows efficient inference via Gibbs sampler."}, "https://arxiv.org/abs/2406.19887": {"title": "Confidence intervals for tree-structured varying coefficients", "link": "https://arxiv.org/abs/2406.19887", "description": "arXiv:2406.19887v1 Announce Type: new \nAbstract: The tree-structured varying coefficient model (TSVC) is a flexible regression approach that allows the effects of covariates to vary with the values of the effect modifiers. Relevant effect modifiers are identified inherently using recursive partitioning techniques. To quantify uncertainty in TSVC models, we propose a procedure to construct confidence intervals of the estimated partition-specific coefficients. This task constitutes a selective inference problem as the coefficients of a TSVC model result from data-driven model building. To account for this issue, we introduce a parametric bootstrap approach, which is tailored to the complex structure of TSVC. Finite sample properties, particularly coverage proportions, of the proposed confidence intervals are evaluated in a simulation study. For illustration, we consider applications to data from COVID-19 patients and from patients suffering from acute odontogenic infection. The proposed approach may also be adapted for constructing confidence intervals for other tree-based methods."}, "https://arxiv.org/abs/2406.19903": {"title": "Joint estimation of insurance loss development factors using Bayesian hidden Markov models", "link": "https://arxiv.org/abs/2406.19903", "description": "arXiv:2406.19903v1 Announce Type: new \nAbstract: Loss development modelling is the actuarial practice of predicting the total 'ultimate' losses incurred on a set of policies once all claims are reported and settled. This poses a challenging prediction task as losses frequently take years to fully emerge from reported claims, and not all claims might yet be reported. Loss development models frequently estimate a set of 'link ratios' from insurance loss triangles, which are multiplicative factors transforming losses at one time point to ultimate. However, link ratios estimated using classical methods typically underestimate ultimate losses and cannot be extrapolated outside the domains of the triangle, requiring extension by 'tail factors' from another model. Although flexible, this two-step process relies on subjective decision points that might bias inference. Methods that jointly estimate 'body' link ratios and smooth tail factors offer an attractive alternative. This paper proposes a novel application of Bayesian hidden Markov models to loss development modelling, where discrete, latent states representing body and tail processes are automatically learned from the data. The hidden Markov development model is found to perform comparably to, and frequently better than, the two-step approach on numerical examples and industry datasets."}, "https://arxiv.org/abs/2406.19936": {"title": "Deep Learning of Multivariate Extremes via a Geometric Representation", "link": "https://arxiv.org/abs/2406.19936", "description": "arXiv:2406.19936v1 Announce Type: new \nAbstract: The study of geometric extremes, where extremal dependence properties are inferred from the deterministic limiting shapes of scaled sample clouds, provides an exciting approach to modelling the extremes of multivariate data. These shapes, termed limit sets, link together several popular extremal dependence modelling frameworks. Although the geometric approach is becoming an increasingly popular modelling tool, current inference techniques are limited to a low dimensional setting (d < 4), and generally require rigid modelling assumptions. In this work, we propose a range of novel theoretical results to aid with the implementation of the geometric extremes framework and introduce the first approach to modelling limit sets using deep learning. By leveraging neural networks, we construct asymptotically-justified yet flexible semi-parametric models for extremal dependence of high-dimensional data. We showcase the efficacy of our deep approach by modelling the complex extremal dependencies between meteorological and oceanographic variables in the North Sea off the coast of the UK."}, "https://arxiv.org/abs/2406.19940": {"title": "Closed-Form Power and Sample Size Calculations for Bayes Factors", "link": "https://arxiv.org/abs/2406.19940", "description": "arXiv:2406.19940v1 Announce Type: new \nAbstract: Determining an appropriate sample size is a critical element of study design, and the method used to determine it should be consistent with the planned analysis. When the planned analysis involves Bayes factor hypothesis testing, the sample size is usually desired to ensure a sufficiently high probability of obtaining a Bayes factor indicating compelling evidence for a hypothesis, given that the hypothesis is true. In practice, Bayes factor sample size determination is typically performed using computationally intensive Monte Carlo simulation. Here, we summarize alternative approaches that enable sample size determination without simulation. We show how, under approximate normality assumptions, sample sizes can be determined numerically, and provide the R package bfpwr for this purpose. Additionally, we identify conditions under which sample sizes can even be determined in closed-form, resulting in novel, easy-to-use formulas that also help foster intuition, enable asymptotic analysis, and can also be used for hybrid Bayesian/likelihoodist design. Furthermore, we show how in our framework power and sample size can be computed without simulation for more complex analysis priors, such as Jeffreys-Zellner-Siow priors or nonlocal normal moment priors. Case studies from medicine and psychology illustrate how researchers can use our methods to design informative yet cost-efficient studies."}, "https://arxiv.org/abs/2406.19956": {"title": "Three Scores and 15 Years (1948-2023) of Rao's Score Test: A Brief History", "link": "https://arxiv.org/abs/2406.19956", "description": "arXiv:2406.19956v1 Announce Type: new \nAbstract: Rao (1948) introduced the score test statistic as an alternative to the likelihood ratio and Wald test statistics. In spite of the optimality properties of the score statistic shown in Rao and Poti (1946), the Rao score (RS) test remained unnoticed for almost 20 years. Today, the RS test is part of the ``Holy Trinity'' of hypothesis testing and has found its place in the Statistics and Econometrics textbooks and related software. Reviewing the history of the RS test we note that remarkable test statistics proposed in the literature earlier or around the time of Rao (1948) mostly from intuition, such as Pearson (1900) goodness-fit-test, Moran (1948) I test for spatial dependence and Durbin and Watson (1950) test for serial correlation, can be given RS test statistic interpretation. At the same time, recent developments in the robust hypothesis testing under certain forms of misspecification, make the RS test an active area of research in Statistics and Econometrics. From our brief account of the history the RS test we conclude that its impact in science goes far beyond its calendar starting point with promising future research activities for many years to come."}, "https://arxiv.org/abs/2406.19965": {"title": "Futility analyses for the MCP-Mod methodology based on longitudinal models", "link": "https://arxiv.org/abs/2406.19965", "description": "arXiv:2406.19965v1 Announce Type: new \nAbstract: This article discusses futility analyses for the MCP-Mod methodology. Formulas are derived for calculating predictive and conditional power for MCP-Mod, which also cover the case when longitudinal models are used allowing to utilize incomplete data from patients at interim. A simulation study is conducted to evaluate the repeated sampling properties of the proposed decision rules and to assess the benefit of using a longitudinal versus a completer only model for decision making at interim. The results suggest that the proposed methods perform adequately and a longitudinal analysis outperforms a completer only analysis, particularly when the recruitment speed is higher and the correlation over time is larger. The proposed methodology is illustrated using real data from a dose-finding study for severe uncontrolled asthma."}, "https://arxiv.org/abs/2406.19986": {"title": "Instrumental Variable Estimation of Distributional Causal Effects", "link": "https://arxiv.org/abs/2406.19986", "description": "arXiv:2406.19986v1 Announce Type: new \nAbstract: Estimating the causal effect of a treatment on the entire response distribution is an important yet challenging task. For instance, one might be interested in how a pension plan affects not only the average savings among all individuals but also how it affects the entire savings distribution. While sufficiently large randomized studies can be used to estimate such distributional causal effects, they are often either not feasible in practice or involve non-compliance. A well-established class of methods for estimating average causal effects from either observational studies with unmeasured confounding or randomized studies with non-compliance are instrumental variable (IV) methods. In this work, we develop an IV-based approach for identifying and estimating distributional causal effects. We introduce a distributional IV model with corresponding assumptions, which leads to a novel identification result for the interventional cumulative distribution function (CDF) under a binary treatment. We then use this identification to construct a nonparametric estimator, called DIVE, for estimating the interventional CDFs under both treatments. We empirically assess the performance of DIVE in a simulation experiment and illustrate the usefulness of distributional causal effects on two real-data applications."}, "https://arxiv.org/abs/2406.19989": {"title": "A Closed-Form Solution to the 2-Sample Problem for Quantifying Changes in Gene Expression using Bayes Factors", "link": "https://arxiv.org/abs/2406.19989", "description": "arXiv:2406.19989v1 Announce Type: new \nAbstract: Sequencing technologies have revolutionised the field of molecular biology. We now have the ability to routinely capture the complete RNA profile in tissue samples. This wealth of data allows for comparative analyses of RNA levels at different times, shedding light on the dynamics of developmental processes, and under different environmental responses, providing insights into gene expression regulation and stress responses. However, given the inherent variability of the data stemming from biological and technological sources, quantifying changes in gene expression proves to be a statistical challenge. Here, we present a closed-form Bayesian solution to this problem. Our approach is tailored to the differential gene expression analysis of processed RNA-Seq data. The framework unifies and streamlines an otherwise complex analysis, typically involving parameter estimations and multiple statistical tests, into a concise mathematical equation for the calculation of Bayes factors. Using conjugate priors we can solve the equations analytically. For each gene, we calculate a Bayes factor, which can be used for ranking genes according to the statistical evidence for the gene's expression change given RNA-Seq data. The presented closed-form solution is derived under minimal assumptions and may be applied to a variety of other 2-sample problems."}, "https://arxiv.org/abs/2406.19412": {"title": "Dynamically Consistent Analysis of Realized Covariations in Term Structure Models", "link": "https://arxiv.org/abs/2406.19412", "description": "arXiv:2406.19412v1 Announce Type: cross \nAbstract: In this article we show how to analyze the covariation of bond prices nonparametrically and robustly, staying consistent with a general no-arbitrage setting. This is, in particular, motivated by the problem of identifying the number of statistically relevant factors in the bond market under minimal conditions. We apply this method in an empirical study which suggests that a high number of factors is needed to describe the term structure evolution and that the term structure of volatility varies over time."}, "https://arxiv.org/abs/2406.19573": {"title": "On Counterfactual Interventions in Vector Autoregressive Models", "link": "https://arxiv.org/abs/2406.19573", "description": "arXiv:2406.19573v1 Announce Type: cross \nAbstract: Counterfactual reasoning allows us to explore hypothetical scenarios in order to explain the impacts of our decisions. However, addressing such inquires is impossible without establishing the appropriate mathematical framework. In this work, we introduce the problem of counterfactual reasoning in the context of vector autoregressive (VAR) processes. We also formulate the inference of a causal model as a joint regression task where for inference we use both data with and without interventions. After learning the model, we exploit linearity of the VAR model to make exact predictions about the effects of counterfactual interventions. Furthermore, we quantify the total causal effects of past counterfactual interventions. The source code for this project is freely available at https://github.com/KurtButler/counterfactual_interventions."}, "https://arxiv.org/abs/2406.19974": {"title": "Generalizing self-normalized importance sampling with couplings", "link": "https://arxiv.org/abs/2406.19974", "description": "arXiv:2406.19974v1 Announce Type: cross \nAbstract: An essential problem in statistics and machine learning is the estimation of expectations involving PDFs with intractable normalizing constants. The self-normalized importance sampling (SNIS) estimator, which normalizes the IS weights, has become the standard approach due to its simplicity. However, the SNIS has been shown to exhibit high variance in challenging estimation problems, e.g, involving rare events or posterior predictive distributions in Bayesian statistics. Further, most of the state-of-the-art adaptive importance sampling (AIS) methods adapt the proposal as if the weights had not been normalized. In this paper, we propose a framework that considers the original task as estimation of a ratio of two integrals. In our new formulation, we obtain samples from a joint proposal distribution in an extended space, with two of its marginals playing the role of proposals used to estimate each integral. Importantly, the framework allows us to induce and control a dependency between both estimators. We propose a construction of the joint proposal that decomposes in two (multivariate) marginals and a coupling. This leads to a two-stage framework suitable to be integrated with existing or new AIS and/or variational inference (VI) algorithms. The marginals are adapted in the first stage, while the coupling can be chosen and adapted in the second stage. We show in several examples the benefits of the proposed methodology, including an application to Bayesian prediction with misspecified models."}, "https://arxiv.org/abs/2406.20088": {"title": "Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints", "link": "https://arxiv.org/abs/2406.20088", "description": "arXiv:2406.20088v1 Announce Type: cross \nAbstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate, precisely characterizing the effects of privacy constraints, source samples, and target samples on classification accuracy. The results reveal interesting phase transition phenomena and highlight the intricate trade-offs between preserving privacy and achieving classification accuracy. We then develop a data-driven adaptive classifier that achieves the optimal rate within a logarithmic factor across a large collection of parameter spaces while satisfying the same set of differential privacy constraints. Simulation studies and real-world data applications further elucidate the theoretical analysis with numerical results."}, "https://arxiv.org/abs/2109.11271": {"title": "Design-based theory for Lasso adjustment in randomized block experiments and rerandomized experiments", "link": "https://arxiv.org/abs/2109.11271", "description": "arXiv:2109.11271v3 Announce Type: replace \nAbstract: Blocking, a special case of rerandomization, is routinely implemented in the design stage of randomized experiments to balance the baseline covariates. This study proposes a regression adjustment method based on the least absolute shrinkage and selection operator (Lasso) to efficiently estimate the average treatment effect in randomized block experiments with high-dimensional covariates. We derive the asymptotic properties of the proposed estimator and outline the conditions under which this estimator is more efficient than the unadjusted one. We provide a conservative variance estimator to facilitate valid inferences. Our framework allows one treated or control unit in some blocks and heterogeneous propensity scores across blocks, thus including paired experiments and finely stratified experiments as special cases. We further accommodate rerandomized experiments and a combination of blocking and rerandomization. Moreover, our analysis allows both the number of blocks and block sizes to tend to infinity, as well as heterogeneous treatment effects across blocks without assuming a true outcome data-generating model. Simulation studies and two real-data analyses demonstrate the advantages of the proposed method."}, "https://arxiv.org/abs/2310.20376": {"title": "Hierarchical Mixture of Finite Mixtures", "link": "https://arxiv.org/abs/2310.20376", "description": "arXiv:2310.20376v2 Announce Type: replace \nAbstract: Statistical modelling in the presence of data organized in groups is a crucial task in Bayesian statistics. The present paper conceives a mixture model based on a novel family of Bayesian priors designed for multilevel data and obtained by normalizing a finite point process. In particular, the work extends the popular Mixture of Finite Mixture model to the hierarchical framework to capture heterogeneity within and between groups. A full distribution theory for this new family and the induced clustering is developed, including the marginal, posterior, and predictive distributions. Efficient marginal and conditional Gibbs samplers are designed to provide posterior inference. The proposed mixture model overcomes the Hierarchical Dirichlet Process, the utmost tool for handling multilevel data, in terms of analytical feasibility, clustering discovery, and computational time. The motivating application comes from the analysis of shot put data, which contains performance measurements of athletes across different seasons. In this setting, the proposed model is exploited to induce clustering of the observations across seasons and athletes. By linking clusters across seasons, similarities and differences in athletes' performances are identified."}, "https://arxiv.org/abs/2301.12616": {"title": "Active Sequential Two-Sample Testing", "link": "https://arxiv.org/abs/2301.12616", "description": "arXiv:2301.12616v4 Announce Type: replace-cross \nAbstract: A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \\emph{active sequential two-sample testing framework} that not only sequentially but also \\emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \\emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control."}, "https://arxiv.org/abs/2310.00125": {"title": "Covariance Expressions for Multi-Fidelity Sampling with Multi-Output, Multi-Statistic Estimators: Application to Approximate Control Variates", "link": "https://arxiv.org/abs/2310.00125", "description": "arXiv:2310.00125v2 Announce Type: replace-cross \nAbstract: We provide a collection of results on covariance expressions between Monte Carlo based multi-output mean, variance, and Sobol main effect variance estimators from an ensemble of models. These covariances can be used within multi-fidelity uncertainty quantification strategies that seek to reduce the estimator variance of high-fidelity Monte Carlo estimators with an ensemble of low-fidelity models. Such covariance expressions are required within approaches like the approximate control variate and multi-level best linear unbiased estimator. While the literature provides these expressions for some single-output cases such as mean and variance, our results are relevant to both multiple function outputs and multiple statistics across any sampling strategy. Following the description of these results, we use them within an approximate control variate scheme to show that leveraging multiple outputs can dramatically reduce estimator variance compared to single-output approaches. Synthetic examples are used to highlight the effects of optimal sample allocation and pilot sample estimation. A flight-trajectory simulation of entry, descent, and landing is used to demonstrate multi-output estimation in practical applications."}, "https://arxiv.org/abs/2312.10499": {"title": "Censored extreme value estimation", "link": "https://arxiv.org/abs/2312.10499", "description": "arXiv:2312.10499v4 Announce Type: replace-cross \nAbstract: A novel and comprehensive methodology designed to tackle the challenges posed by extreme values in the context of random censorship is introduced. The main focus is on the analysis of integrals based on the product-limit estimator of normalized upper order statistics, called extreme Kaplan--Meier integrals. These integrals allow for the transparent derivation of various important asymptotic distributional properties, offering an alternative approach to conventional plug-in estimation methods. Notably, this methodology demonstrates robustness and wide applicability within the scope of max-domains of attraction. A noteworthy by-product is the extension of generalized Hill-type estimators of extremes to encompass all max-domains of attraction, which is of independent interest. The theoretical framework is applied to construct novel estimators for positive and real-valued extreme value indices for right-censored data. Simulation studies supporting the theory are provided."}, "https://arxiv.org/abs/2407.00139": {"title": "A Calibrated Sensitivity Analysis for Weighted Causal Decompositions", "link": "https://arxiv.org/abs/2407.00139", "description": "arXiv:2407.00139v1 Announce Type: new \nAbstract: Disparities in health or well-being experienced by minority groups can be difficult to study using the traditional exposure-outcome paradigm in causal inference, since potential outcomes in variables such as race or sexual minority status are challenging to interpret. Causal decomposition analysis addresses this gap by positing causal effects on disparities under interventions to other, intervenable exposures that may play a mediating role in the disparity. While invoking weaker assumptions than causal mediation approaches, decomposition analyses are often conducted in observational settings and require uncheckable assumptions that eliminate unmeasured confounders. Leveraging the marginal sensitivity model, we develop a sensitivity analysis for weighted causal decomposition estimators and use the percentile bootstrap to construct valid confidence intervals for causal effects on disparities. We also propose a two-parameter amplification that enhances interpretability and facilitates an intuitive understanding of the plausibility of unmeasured confounders and their effects. We illustrate our framework on a study examining the effect of parental acceptance on disparities in suicidal ideation among sexual minority youth. We find that the effect is small and sensitive to unmeasured confounding, suggesting that further screening studies are needed to identify mitigating interventions in this vulnerable population."}, "https://arxiv.org/abs/2407.00364": {"title": "Medical Knowledge Integration into Reinforcement Learning Algorithms for Dynamic Treatment Regimes", "link": "https://arxiv.org/abs/2407.00364", "description": "arXiv:2407.00364v1 Announce Type: new \nAbstract: The goal of precision medicine is to provide individualized treatment at each stage of chronic diseases, a concept formalized by Dynamic Treatment Regimes (DTR). These regimes adapt treatment strategies based on decision rules learned from clinical data to enhance therapeutic effectiveness. Reinforcement Learning (RL) algorithms allow to determine these decision rules conditioned by individual patient data and their medical history. The integration of medical expertise into these models makes possible to increase confidence in treatment recommendations and facilitate the adoption of this approach by healthcare professionals and patients. In this work, we examine the mathematical foundations of RL, contextualize its application in the field of DTR, and present an overview of methods to improve its effectiveness by integrating medical expertise."}, "https://arxiv.org/abs/2407.00381": {"title": "Climate change analysis from LRD manifold functional regression", "link": "https://arxiv.org/abs/2407.00381", "description": "arXiv:2407.00381v1 Announce Type: new \nAbstract: A functional nonlinear regression approach, incorporating time information in the covariates, is proposed for temporal strong correlated manifold map data sequence analysis. Specifically, the functional regression parameters are supported on a connected and compact two--point homogeneous space. The Generalized Least--Squares (GLS) parameter estimator is computed in the linearized model, having error term displaying manifold scale varying Long Range Dependence (LRD). The performance of the theoretical and plug--in nonlinear regression predictors is illustrated by simulations on sphere, in terms of the empirical mean of the computed spherical functional absolute errors. In the case where the second--order structure of the functional error term in the linearized model is unknown, its estimation is performed by minimum contrast in the functional spectral domain. The linear case is illustrated in the Supplementary Material, revealing the effect of the slow decay velocity in time of the trace norms of the covariance operator family of the regression LRD error term. The purely spatial statistical analysis of atmospheric pressure at high cloud bottom, and downward solar radiation flux in Alegria et al. (2021) is extended to the spatiotemporal context, illustrating the numerical results from a generated synthetic data set."}, "https://arxiv.org/abs/2407.00561": {"title": "Advancing Information Integration through Empirical Likelihood: Selective Reviews and a New Idea", "link": "https://arxiv.org/abs/2407.00561", "description": "arXiv:2407.00561v1 Announce Type: new \nAbstract: Information integration plays a pivotal role in biomedical studies by facilitating the combination and analysis of independent datasets from multiple studies, thereby uncovering valuable insights that might otherwise remain obscured due to the limited sample size in individual studies. However, sharing raw data from independent studies presents significant challenges, primarily due to the need to safeguard sensitive participant information and the cumbersome paperwork involved in data sharing. In this article, we first provide a selective review of recent methodological developments in information integration via empirical likelihood, wherein only summary information is required, rather than the raw data. Following this, we introduce a new insight and a potentially promising framework that could broaden the application of information integration across a wider spectrum. Furthermore, this new framework offers computational convenience compared to classic empirical likelihood-based methods. We provide numerical evaluations to assess its performance and discuss various extensions in the end."}, "https://arxiv.org/abs/2407.00564": {"title": "Variational Nonparametric Inference in Functional Stochastic Block Model", "link": "https://arxiv.org/abs/2407.00564", "description": "arXiv:2407.00564v1 Announce Type: new \nAbstract: We propose a functional stochastic block model whose vertices involve functional data information. This new model extends the classic stochastic block model with vector-valued nodal information, and finds applications in real-world networks whose nodal information could be functional curves. Examples include international trade data in which a network vertex (country) is associated with the annual or quarterly GDP over certain time period, and MyFitnessPal data in which a network vertex (MyFitnessPal user) is associated with daily calorie information measured over certain time period. Two statistical tasks will be jointly executed. First, we will detect community structures of the network vertices assisted by the functional nodal information. Second, we propose computationally efficient variational test to examine the significance of the functional nodal information. We show that the community detection algorithms achieve weak and strong consistency, and the variational test is asymptotically chi-square with diverging degrees of freedom. As a byproduct, we propose pointwise confidence intervals for the slop function of the functional nodal information. Our methods are examined through both simulated and real datasets."}, "https://arxiv.org/abs/2407.00650": {"title": "Proper Scoring Rules for Multivariate Probabilistic Forecasts based on Aggregation and Transformation", "link": "https://arxiv.org/abs/2407.00650", "description": "arXiv:2407.00650v1 Announce Type: new \nAbstract: Proper scoring rules are an essential tool to assess the predictive performance of probabilistic forecasts. However, propriety alone does not ensure an informative characterization of predictive performance and it is recommended to compare forecasts using multiple scoring rules. With that in mind, interpretable scoring rules providing complementary information are necessary. We formalize a framework based on aggregation and transformation to build interpretable multivariate proper scoring rules. Aggregation-and-transformation-based scoring rules are able to target specific features of the probabilistic forecasts; which improves the characterization of the predictive performance. This framework is illustrated through examples taken from the literature and studied using numerical experiments showcasing its benefits. In particular, it is shown that it can help bridge the gap between proper scoring rules and spatial verification tools."}, "https://arxiv.org/abs/2407.00655": {"title": "Markov Switching Multiple-equation Tensor Regressions", "link": "https://arxiv.org/abs/2407.00655", "description": "arXiv:2407.00655v1 Announce Type: new \nAbstract: We propose a new flexible tensor model for multiple-equation regression that accounts for latent regime changes. The model allows for dynamic coefficients and multi-dimensional covariates that vary across equations. We assume the coefficients are driven by a common hidden Markov process that addresses structural breaks to enhance the model flexibility and preserve parsimony. We introduce a new Soft PARAFAC hierarchical prior to achieve dimensionality reduction while preserving the structural information of the covariate tensor. The proposed prior includes a new multi-way shrinking effect to address over-parametrization issues. We developed theoretical results to help hyperparameter choice. An efficient MCMC algorithm based on random scan Gibbs and back-fitting strategy is developed to achieve better computational scalability of the posterior sampling. The validity of the MCMC algorithm is demonstrated theoretically, and its computational efficiency is studied using numerical experiments in different parameter settings. The effectiveness of the model framework is illustrated using two original real data analyses. The proposed model exhibits superior performance when compared to the current benchmark, Lasso regression."}, "https://arxiv.org/abs/2407.00716": {"title": "On a General Theoretical Framework of Reliability", "link": "https://arxiv.org/abs/2407.00716", "description": "arXiv:2407.00716v1 Announce Type: new \nAbstract: Reliability is an essential measure of how closely observed scores represent latent scores (reflecting constructs), assuming some latent variable measurement model. We present a general theoretical framework of reliability, placing emphasis on measuring association between latent and observed scores. This framework was inspired by McDonald's (2011) regression framework, which highlighted the coefficient of determination as a measure of reliability. We extend McDonald's (2011) framework beyond coefficients of determination and introduce four desiderata for reliability measures (estimability, normalization, symmetry, and invariance). We also present theoretical examples to illustrate distinct measures of reliability and report on a numerical study that demonstrates the behavior of different reliability measures. We conclude with a discussion on the use of reliability coefficients and outline future avenues of research."}, "https://arxiv.org/abs/2407.00791": {"title": "inlabru: software for fitting latent Gaussian models with non-linear predictors", "link": "https://arxiv.org/abs/2407.00791", "description": "arXiv:2407.00791v1 Announce Type: new \nAbstract: The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian models and the likelihoods currently implemented in {INLA}, the main software implementation of the INLA methodology.\n {inlabru} is a software package that extends the types of models that can be fitted using INLA by allowing the latent predictor to be non-linear in its parameters, moving beyond the additive linear predictor framework to allow more complex functional relationships. For inference it uses an approximate iterative method based on the first-order Taylor expansion of the non-linear predictor, fitting the model using INLA for each linearised model configuration.\n {inlabru} automates much of the workflow required to fit models using {R-INLA}, simplifying the process for users to specify, fit and predict from models. There is additional support for fitting joint likelihood models by building each likelihood individually. {inlabru} also supports the direct use of spatial data structures, such as those implemented in the {sf} and {terra} packages.\n In this paper we outline the statistical theory, model structure and basic syntax required for users to understand and develop their own models using {inlabru}. We evaluate the approximate inference method using a Bayesian method checking approach. We provide three examples modelling simulated spatial data that demonstrate the benefits of the additional flexibility provided by {inlabru}."}, "https://arxiv.org/abs/2407.00797": {"title": "A placement-value based approach to concave ROC analysis", "link": "https://arxiv.org/abs/2407.00797", "description": "arXiv:2407.00797v1 Announce Type: new \nAbstract: The receiver operating characteristic (ROC) curve is an important graphic tool for evaluating a test in a wide range of disciplines. While useful, an ROC curve can cross the chance line, either by having an S-shape or a hook at the extreme specificity. These non-concave ROC curves are sub-optimal according to decision theory, as there are points that are superior than those corresponding to the portions below the chance line with either the same sensitivity or specificity. We extend the literature by proposing a novel placement value-based approach to ensure concave curvature of the ROC curve, and utilize Bayesian paradigm to make estimations under both a parametric and a semiparametric framework. We conduct extensive simulation studies to assess the performance of the proposed methodology under various scenarios, and apply it to a pancreatic cancer dataset."}, "https://arxiv.org/abs/2407.00846": {"title": "Estimating the cognitive effects of statins from observational data using the survival-incorporated median: a summary measure for clinical outcomes in the presence of death", "link": "https://arxiv.org/abs/2407.00846", "description": "arXiv:2407.00846v1 Announce Type: new \nAbstract: The issue of \"truncation by death\" commonly arises in clinical research: subjects may die before their follow-up assessment, resulting in undefined clinical outcomes. This article addresses truncation by death by analyzing the Long Life Family Study (LLFS), a multicenter observational study involving over 4000 older adults with familial longevity. We are interested in the cognitive effects of statins in LLFS participants, as the impact of statins on cognition remains unclear despite their widespread use. In this application, rather than treating death as a mechanism through which clinical outcomes are missing, we advocate treating death as part of the outcome measure. We focus on the survival-incorporated median, the median of a composite outcome combining death and cognitive scores, to summarize the effect of statins. We propose an estimator for the survival-incorporated median from observational data, applicable in both point-treatment settings and time-varying treatment settings. Simulations demonstrate the survival-incorporated median as a simple and useful summary measure. We apply this method to estimate the effect of statins on the change in cognitive function (measured by the Digit Symbol Substitution Test), incorporating death. Our results indicate no significant difference in cognitive decline between participants with a similar age distribution on and off statins from baseline. Through this application, we aim to not only contribute to this clinical question but also offer insights into analyzing clinical outcomes in the presence of death."}, "https://arxiv.org/abs/2407.00859": {"title": "Statistical inference on partially shape-constrained function-on-scalar linear regression models", "link": "https://arxiv.org/abs/2407.00859", "description": "arXiv:2407.00859v1 Announce Type: new \nAbstract: We consider functional linear regression models where functional outcomes are associated with scalar predictors by coefficient functions with shape constraints, such as monotonicity and convexity, that apply to sub-domains of interest. To validate the partial shape constraints, we propose testing a composite hypothesis of linear functional constraints on regression coefficients. Our approach employs kernel- and spline-based methods within a unified inferential framework, evaluating the statistical significance of the hypothesis by measuring an $L^2$-distance between constrained and unconstrained model fits. In the theoretical study of large-sample analysis under mild conditions, we show that both methods achieve the standard rate of convergence observed in the nonparametric estimation literature. Through numerical experiments of finite-sample analysis, we demonstrate that the type I error rate keeps the significance level as specified across various scenarios and that the power increases with sample size, confirming the consistency of the test procedure under both estimation methods. Our theoretical and numerical results provide researchers the flexibility to choose a method based on computational preference. The practicality of partial shape-constrained inference is illustrated by two data applications: one involving clinical trials of NeuroBloc in type A-resistant cervical dystonia and the other with the National Institute of Mental Health Schizophrenia Study."}, "https://arxiv.org/abs/2407.00882": {"title": "Subgroup Identification with Latent Factor Structure", "link": "https://arxiv.org/abs/2407.00882", "description": "arXiv:2407.00882v1 Announce Type: new \nAbstract: Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in the existing literature. In this paper, we aim to fill this gap in the ``diverging dimension\" regime and propose a center-augmented subgroup identification method under the Factor Augmented (sparse) Linear Model framework, which bridge dimension reduction and sparse regression together. The proposed method is flexible to the possibly high cross-sectional dependence among covariates and inherits the computational advantage with complexity $O(nK)$, in contrast to the $O(n^2)$ complexity of the conventional pairwise fusion penalty method in the literature, where $n$ is the sample size and $K$ is the number of subgroups. We also investigate the asymptotic properties of its oracle estimators under conditions on the minimal distance between group centroids. To implement the proposed approach, we introduce a Difference of Convex functions based Alternating Direction Method of Multipliers (DC-ADMM) algorithm and demonstrate its convergence to a local minimizer in finite steps. We illustrate the superiority of the proposed method through extensive numerical experiments and a real macroeconomic data example. An \\texttt{R} package \\texttt{SILFS} implementing the method is also available on CRAN."}, "https://arxiv.org/abs/2407.00890": {"title": "Macroeconomic Forecasting with Large Language Models", "link": "https://arxiv.org/abs/2407.00890", "description": "arXiv:2407.00890v1 Announce Type: new \nAbstract: This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios"}, "https://arxiv.org/abs/2407.01036": {"title": "Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests", "link": "https://arxiv.org/abs/2407.01036", "description": "arXiv:2407.01036v1 Announce Type: new \nAbstract: A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments."}, "https://arxiv.org/abs/2407.01055": {"title": "Exact statistical analysis for response-adaptive clinical trials: a general and computationally tractable approach", "link": "https://arxiv.org/abs/2407.01055", "description": "arXiv:2407.01055v1 Announce Type: new \nAbstract: Response-adaptive (RA) designs of clinical trials allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. RA designs face greater regulatory scrutiny due to potential type I error inflation, which limits their uptake in practice. Existing approaches to type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. We develop a general and computationally tractable approach for exact analysis in two-arm RA designs with binary outcomes. We use the approach to construct exact tests applicable to designs that use either randomized or deterministic RA procedures, allowing for complexities such as delayed outcomes, early stopping or allocation of participants in blocks. Our efficient forward recursion implementation allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming we show that, contrary to what is known for equal allocation, a conditional exact test has, almost uniformly, higher power than the unconditional test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of our approach in controlling type I error and/or improving the statistical power."}, "https://arxiv.org/abs/2407.01186": {"title": "Data fusion for efficiency gain in ATE estimation: A practical review with simulations", "link": "https://arxiv.org/abs/2407.01186", "description": "arXiv:2407.01186v1 Announce Type: new \nAbstract: The integration of real-world data (RWD) and randomized controlled trials (RCT) is increasingly important for advancing causal inference in scientific research. This combination holds great promise for enhancing the efficiency of causal effect estimation, offering benefits such as reduced trial participant numbers and expedited drug access for patients. Despite the availability of numerous data fusion methods, selecting the most appropriate one for a specific research question remains challenging. This paper systematically reviews and compares these methods regarding their assumptions, limitations, and implementation complexities. Through simulations reflecting real-world scenarios, we identify a prevalent risk-reward trade-off across different methods. We investigate and interpret this trade-off, providing key insights into the strengths and weaknesses of various methods; thereby helping researchers navigate through the application of data fusion for improved causal inference."}, "https://arxiv.org/abs/2407.00417": {"title": "Obtaining $(\\epsilon,\\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables", "link": "https://arxiv.org/abs/2407.00417", "description": "arXiv:2407.00417v1 Announce Type: cross \nAbstract: We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(\\epsilon, \\delta)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database."}, "https://arxiv.org/abs/2407.01171": {"title": "Neural Conditional Probability for Inference", "link": "https://arxiv.org/abs/2407.01171", "description": "arXiv:2407.01171v1 Announce Type: cross \nAbstract: We introduce NCP (Neural Conditional Probability), a novel operator-theoretic approach for learning conditional distributions with a particular focus on inference tasks. NCP can be used to build conditional confidence regions and extract important statistics like conditional quantiles, mean, and covariance. It offers streamlined learning through a single unconditional training phase, facilitating efficient inference without the need for retraining even when conditioning changes. By tapping into the powerful approximation capabilities of neural networks, our method efficiently handles a wide variety of complex probability distributions, effectively dealing with nonlinear relationships between input and output variables. Theoretical guarantees ensure both optimization consistency and statistical accuracy of the NCP method. Our experiments show that our approach matches or beats leading methods using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This demonstrates that a minimalistic architecture with a theoretically grounded loss function can achieve competitive results without sacrificing performance, even in the face of more complex architectures."}, "https://arxiv.org/abs/2010.00729": {"title": "Individual-centered partial information in social networks", "link": "https://arxiv.org/abs/2010.00729", "description": "arXiv:2010.00729v4 Announce Type: replace \nAbstract: In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive theoretical properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual's partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure."}, "https://arxiv.org/abs/2106.09499": {"title": "Maximum Entropy Spectral Analysis: an application to gravitational waves data analysis", "link": "https://arxiv.org/abs/2106.09499", "description": "arXiv:2106.09499v2 Announce Type: replace \nAbstract: The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, offers a powerful tool for spectral estimation of a time-series. It relies on Jaynes' maximum entropy principle, allowing the spectrum of a stochastic process to be inferred using the coefficients of an autoregressive process AR($p$) of order $p$. A closed-form recursive solution provides estimates for both the autoregressive coefficients and the order $p$ of the process. We provide a ready-to-use implementation of this algorithm in a Python package called \\texttt{memspectrum}, characterized through power spectral density (PSD) analysis on synthetic data with known PSD and comparisons of different criteria for stopping the recursion. Additionally, we compare the performance of our implementation with the ubiquitous Welch algorithm, using synthetic data generated from the GW150914 strain spectrum released by the LIGO-Virgo-Kagra collaboration. Our findings indicate that Burg's method provides PSD estimates with systematically lower variance and bias. This is particularly manifest in the case of a small (O($5000$)) number of data points, making Burg's method most suitable to work in this regime. Since this is close to the typical length of analysed gravitational waves data, improving the estimate of the PSD in this regime leads to more reliable posterior profiles for the system under study. We conclude our investigation by utilising MESA, and its particularly easy parametrisation where the only free parameter is the order $p$ of the AR process, to marginalise over the interferometers noise PSD in conjunction with inferring the parameters of GW150914."}, "https://arxiv.org/abs/2201.04811": {"title": "Binary response model with many weak instruments", "link": "https://arxiv.org/abs/2201.04811", "description": "arXiv:2201.04811v4 Announce Type: replace \nAbstract: This paper considers an endogenous binary response model with many weak instruments. We employ a control function approach and a regularization scheme to obtain better estimation results for the endogenous binary response model in the presence of many weak instruments. Two consistent and asymptotically normally distributed estimators are provided, each of which is called a regularized conditional maximum likelihood estimator (RCMLE) and a regularized nonlinear least squares estimator (RNLSE). Monte Carlo simulations show that the proposed estimators outperform the existing ones when there are many weak instruments. We use the proposed estimation method to examine the effect of family income on college completion."}, "https://arxiv.org/abs/2211.10776": {"title": "Bayesian Modal Regression based on Mixture Distributions", "link": "https://arxiv.org/abs/2211.10776", "description": "arXiv:2211.10776v5 Announce Type: replace \nAbstract: Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at https://github.com/rh8liuqy/Bayesian_modal_regression."}, "https://arxiv.org/abs/2212.01699": {"title": "Parametric Modal Regression with Error in Covariates", "link": "https://arxiv.org/abs/2212.01699", "description": "arXiv:2212.01699v3 Announce Type: replace \nAbstract: An inference procedure is proposed to provide consistent estimators of parameters in a modal regression model with a covariate prone to measurement error. A score-based diagnostic tool exploiting parametric bootstrap is developed to assess adequacy of parametric assumptions imposed on the regression model. The proposed estimation method and diagnostic tool are applied to synthetic data generated from simulation experiments and data from real-world applications to demonstrate their implementation and performance. These empirical examples illustrate the importance of adequately accounting for measurement error in the error-prone covariate when inferring the association between a response and covariates based on a modal regression model that is especially suitable for skewed and heavy-tailed response data."}, "https://arxiv.org/abs/2212.04746": {"title": "Model-based clustering of categorical data based on the Hamming distance", "link": "https://arxiv.org/abs/2212.04746", "description": "arXiv:2212.04746v2 Announce Type: replace \nAbstract: A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components.\n Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches."}, "https://arxiv.org/abs/2301.04625": {"title": "Enhanced Response Envelope via Envelope Regularization", "link": "https://arxiv.org/abs/2301.04625", "description": "arXiv:2301.04625v2 Announce Type: replace \nAbstract: The response envelope model provides substantial efficiency gains over the standard multivariate linear regression by identifying the material part of the response to the model and by excluding the immaterial part. In this paper, we propose the enhanced response envelope by incorporating a novel envelope regularization term based on a nonconvex manifold formulation. It is shown that the enhanced response envelope can yield better prediction risk than the original envelope estimator. The enhanced response envelope naturally handles high-dimensional data for which the original response envelope is not serviceable without necessary remedies. In an asymptotic high-dimensional regime where the ratio of the number of predictors over the number of samples converges to a non-zero constant, we characterize the risk function and reveal an interesting double descent phenomenon for the envelope model. A simulation study confirms our main theoretical findings. Simulations and real data applications demonstrate that the enhanced response envelope does have significantly improved prediction performance over the original envelope method, especially when the number of predictors is close to or moderately larger than the number of samples. Proofs and additional simulation results are shown in the supplementary file to this paper."}, "https://arxiv.org/abs/2301.13088": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces", "link": "https://arxiv.org/abs/2301.13088", "description": "arXiv:2301.13088v3 Announce Type: replace \nAbstract: Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners."}, "https://arxiv.org/abs/2307.02331": {"title": "Differential recall bias in estimating treatment effects in observational studies", "link": "https://arxiv.org/abs/2307.02331", "description": "arXiv:2307.02331v2 Announce Type: replace \nAbstract: Observational studies are frequently used to estimate the effect of an exposure or treatment on an outcome. To obtain an unbiased estimate of the treatment effect, it is crucial to measure the exposure accurately. A common type of exposure misclassification is recall bias, which occurs in retrospective cohort studies when study subjects may inaccurately recall their past exposure. Particularly challenging is differential recall bias in the context of self-reported binary exposures, where the bias may be directional rather than random , and its extent varies according to the outcomes experienced. This paper makes several contributions: (1) it establishes bounds for the average treatment effect (ATE) even when a validation study is not available; (2) it proposes multiple estimation methods across various strategies predicated on different assumptions; and (3) it suggests a sensitivity analysis technique to assess the robustness of the causal conclusion, incorporating insights from prior research. The effectiveness of these methods is demonstrated through simulation studies that explore various model misspecification scenarios. These approaches are then applied to investigate the effect of childhood physical abuse on mental health in adulthood."}, "https://arxiv.org/abs/2309.10642": {"title": "Correcting Selection Bias in Standardized Test Scores Comparisons", "link": "https://arxiv.org/abs/2309.10642", "description": "arXiv:2309.10642v4 Announce Type: replace \nAbstract: This paper addresses the issue of sample selection bias when comparing countries using International assessments like PISA (Program for International Student Assessment). Despite its widespread use, PISA rankings may be biased due to different attrition patterns in different countries, leading to inaccurate comparisons. This study proposes a methodology to correct for sample selection bias using a quantile selection model. Applying the method to PISA 2018 data, I find that correcting for selection bias significantly changes the rankings (based on the mean) of countries' educational performances. My results highlight the importance of accounting for sample selection bias in international educational comparisons."}, "https://arxiv.org/abs/2310.01575": {"title": "Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class Analysis", "link": "https://arxiv.org/abs/2310.01575", "description": "arXiv:2310.01575v2 Announce Type: replace \nAbstract: Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. \\sw{Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns}. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States."}, "https://arxiv.org/abs/2310.10393": {"title": "Statistical and Causal Robustness for Causal Null Hypothesis Tests", "link": "https://arxiv.org/abs/2310.10393", "description": "arXiv:2310.10393v2 Announce Type: replace \nAbstract: Prior work applying semiparametric theory to causal inference has primarily focused on deriving estimators that exhibit statistical robustness under a prespecified causal model that permits identification of a desired causal parameter. However, a fundamental challenge is correct specification of such a model, which usually involves making untestable assumptions. Evidence factors is an approach to combining hypothesis tests of a common causal null hypothesis under two or more candidate causal models. Under certain conditions, this yields a test that is valid if at least one of the underlying models is correct, which is a form of causal robustness. We propose a method of combining semiparametric theory with evidence factors. We develop a causal null hypothesis test based on joint asymptotic normality of K asymptotically linear semiparametric estimators, where each estimator is based on a distinct identifying functional derived from each of K candidate causal models. We show that this test provides both statistical and causal robustness in the sense that it is valid if at least one of the K proposed causal models is correct, while also allowing for slower than parametric rates of convergence in estimating nuisance functions. We demonstrate the effectiveness of our method via simulations and applications to the Framingham Heart Study and Wisconsin Longitudinal Study."}, "https://arxiv.org/abs/2308.01198": {"title": "Analyzing the Reporting Error of Public Transport Trips in the Danish National Travel Survey Using Smart Card Data", "link": "https://arxiv.org/abs/2308.01198", "description": "arXiv:2308.01198v3 Announce Type: replace-cross \nAbstract: Household travel surveys have been used for decades to collect individuals and households' travel behavior. However, self-reported surveys are subject to recall bias, as respondents might struggle to recall and report their activities accurately. This study examines the time reporting error of public transit users in a nationwide household travel survey by matching, at the individual level, five consecutive years of data from two sources, namely the Danish National Travel Survey (TU) and the Danish Smart Card system (Rejsekort). Survey respondents are matched with travel cards from the Rejsekort data solely based on the respondents' declared spatiotemporal travel behavior. Approximately, 70% of the respondents were successfully matched with Rejsekort travel cards. The findings reveal a median time reporting error of 11.34 minutes, with an Interquartile Range of 28.14 minutes. Furthermore, a statistical analysis was performed to explore the relationships between the survey respondents' reporting error and their socio-economic and demographic characteristics. The results indicate that females and respondents with a fixed schedule are in general more accurate than males and respondents with a flexible schedule in reporting their times of travel. Moreover, trips reported during weekdays or via the internet displayed higher accuracies compared to trips reported during weekends and holidays or via telephone interviews. This disaggregated analysis provides valuable insights that could help in improving the design and analysis of travel surveys, as well accounting for reporting errors/biases in travel survey-based applications. Furthermore, it offers valuable insights underlying the psychology of travel recall by survey respondents."}, "https://arxiv.org/abs/2401.13665": {"title": "Entrywise Inference for Missing Panel Data: A Simple and Instance-Optimal Approach", "link": "https://arxiv.org/abs/2401.13665", "description": "arXiv:2401.13665v2 Announce Type: replace-cross \nAbstract: Longitudinal or panel data can be represented as a matrix with rows indexed by units and columns indexed by time. We consider inferential questions associated with the missing data version of panel data induced by staggered adoption. We propose a computationally efficient procedure for estimation, involving only simple matrix algebra and singular value decomposition, and prove non-asymptotic and high-probability bounds on its error in estimating each missing entry. By controlling proximity to a suitably scaled Gaussian variable, we develop and analyze a data-driven procedure for constructing entrywise confidence intervals with pre-specified coverage. Despite its simplicity, our procedure turns out to be instance-optimal: we prove that the width of our confidence intervals match a non-asymptotic instance-wise lower bound derived via a Bayesian Cram\\'{e}r-Rao argument. We illustrate the sharpness of our theoretical characterization on a variety of numerical examples. Our analysis is based on a general inferential toolbox for SVD-based algorithm applied to the matrix denoising model, which might be of independent interest."}, "https://arxiv.org/abs/2407.01565": {"title": "A pseudo-outcome-based framework to analyze treatment heterogeneity in survival data using electronic health records", "link": "https://arxiv.org/abs/2407.01565", "description": "arXiv:2407.01565v1 Announce Type: new \nAbstract: An important aspect of precision medicine focuses on characterizing diverse responses to treatment due to unique patient characteristics, also known as heterogeneous treatment effects (HTE), and identifying beneficial subgroups with enhanced treatment effects. Estimating HTE with right-censored data in observational studies remains challenging. In this paper, we propose a pseudo-outcome-based framework for analyzing HTE in survival data, which includes a list of meta-learners for estimating HTE, a variable importance metric for identifying predictive variables to HTE, and a data-adaptive procedure to select subgroups with enhanced treatment effects. We evaluate the finite sample performance of the framework under various settings of observational studies. Furthermore, we applied the proposed methods to analyze the treatment heterogeneity of a Written Asthma Action Plan (WAAP) on time-to-ED (Emergency Department) return due to asthma exacerbation using a large asthma electronic health records dataset with visit records expanded from pre- to post-COVID-19 pandemic. We identified vulnerable subgroups of patients with poorer asthma outcomes but enhanced benefits from WAAP and characterized patient profiles. Our research provides valuable insights for healthcare providers on the strategic distribution of WAAP, particularly during disruptive public health crises, ultimately improving the management and control of pediatric asthma."}, "https://arxiv.org/abs/2407.01631": {"title": "Model Identifiability for Bivariate Failure Time Data with Competing Risks: Parametric Cause-specific Hazards and Non-parametric Frailty", "link": "https://arxiv.org/abs/2407.01631", "description": "arXiv:2407.01631v1 Announce Type: new \nAbstract: One of the commonly used approaches to capture dependence in multivariate survival data is through the frailty variables. The identifiability issues should be carefully investigated while modeling multivariate survival with or without competing risks. The use of non-parametric frailty distribution(s) is sometimes preferred for its robustness and flexibility properties. In this paper, we consider modeling of bivariate survival data with competing risks through four different kinds of non-parametric frailty and parametric baseline cause-specific hazard functions to investigate the corresponding model identifiability. We make the common assumption of the frailty mean being equal to unity."}, "https://arxiv.org/abs/2407.01763": {"title": "A Cepstral Model for Efficient Spectral Analysis of Covariate-dependent Time Series", "link": "https://arxiv.org/abs/2407.01763", "description": "arXiv:2407.01763v1 Announce Type: new \nAbstract: This article introduces a novel and computationally fast model to study the association between covariates and power spectra of replicated time series. A random covariate-dependent Cram\\'{e}r spectral representation and a semiparametric log-spectral model are used to quantify the association between the log-spectra and covariates. Each replicate-specific log-spectrum is represented by the cepstrum, inducing a cepstral-based multivariate linear model with the cepstral coefficients as the responses. By using only a small number of cepstral coefficients, the model parsimoniously captures frequency patterns of time series and saves a significant amount of computational time compared to existing methods. A two-stage estimation procedure is proposed. In the first stage, a Whittle likelihood-based approach is used to estimate the truncated replicate-specific cepstral coefficients. In the second stage, parameters of the cepstral-based multivariate linear model, and consequently the effect functions of covariates, are estimated. The model is flexible in the sense that it can accommodate various estimation methods for the multivariate linear model, depending on the application, domain knowledge, or characteristics of the covariates. Numerical studies confirm that the proposed method outperforms some existing methods despite its simplicity and shorter computational time. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2407.01765": {"title": "A General Framework for Design-Based Treatment Effect Estimation in Paired Cluster-Randomized Experiments", "link": "https://arxiv.org/abs/2407.01765", "description": "arXiv:2407.01765v1 Announce Type: new \nAbstract: Paired cluster-randomized experiments (pCRTs) are common across many disciplines because there is often natural clustering of individuals, and paired randomization can help balance baseline covariates to improve experimental precision. Although pCRTs are common, there is surprisingly no obvious way to analyze this randomization design if an individual-level (rather than cluster-level) treatment effect is of interest. Variance estimation is also complicated due to the dependency created through pairing clusters. Therefore, we aim to provide an intuitive and practical comparison between different estimation strategies in pCRTs in order to inform practitioners' choice of strategy. To this end, we present a general framework for design-based estimation in pCRTs for average individual effects. This framework offers a novel and intuitive view on the bias-variance trade-off between estimators and emphasizes the benefits of covariate adjustment for estimation with pCRTs. In addition to providing a general framework for estimation in pCRTs, the point and variance estimators we present support fixed-sample unbiased estimation with similar precision to a common regression model and consistently conservative variance estimation. Through simulation studies, we compare the performance of the point and variance estimators reviewed. Finally, we compare the performance of estimators with simulations using real data from an educational efficacy trial. Our analysis and simulation studies inform the choice of point and variance estimators for analyzing pCRTs in practice."}, "https://arxiv.org/abs/2407.01770": {"title": "Exploring causal effects of hormone- and radio-treatments in an observational study of breast cancer using copula-based semi-competing risks models", "link": "https://arxiv.org/abs/2407.01770", "description": "arXiv:2407.01770v1 Announce Type: new \nAbstract: Breast cancer patients may experience relapse or death after surgery during the follow-up period, leading to dependent censoring of relapse. This phenomenon, known as semi-competing risk, imposes challenges in analyzing treatment effects on breast cancer and necessitates advanced statistical tools for unbiased analysis. Despite progress in estimation and inference within semi-competing risks regression, its application to causal inference is still in its early stages. This article aims to propose a frequentist and semi-parametric framework based on copula models that can facilitate valid causal inference, net quantity estimation and interpretation, and sensitivity analysis for unmeasured factors under right-censored semi-competing risks data. We also propose novel procedures to enhance parameter estimation and its applicability in real practice. After that, we apply the proposed framework to a breast cancer study and detect the time-varying causal effects of hormone- and radio-treatments on patients' relapse-free survival and overall survival. Moreover, extensive numerical evaluations demonstrate the method's feasibility, highlighting minimal estimation bias and reliable statistical inference."}, "https://arxiv.org/abs/2407.01868": {"title": "Forecast Linear Augmented Projection (FLAP): A free lunch to reduce forecast error variance", "link": "https://arxiv.org/abs/2407.01868", "description": "arXiv:2407.01868v1 Announce Type: new \nAbstract: A novel forecast linear augmented projection (FLAP) method is introduced, which reduces the forecast error variance of any unbiased multivariate forecast without introducing bias. The method first constructs new component series which are linear combinations of the original series. Forecasts are then generated for both the original and component series. Finally, the full vector of forecasts is projected onto a linear subspace where the constraints implied by the combination weights hold. It is proven that the trace of the forecast error variance is non-increasing with the number of components, and mild conditions are established for which it is strictly decreasing. It is also shown that the proposed method achieves maximum forecast error variance reduction among linear projection methods. The theoretical results are validated through simulations and two empirical applications based on Australian tourism and FRED-MD data. Notably, using FLAP with Principal Component Analysis (PCA) to construct the new series leads to substantial forecast error variance reduction."}, "https://arxiv.org/abs/2407.01883": {"title": "Robust Linear Mixed Models using Hierarchical Gamma-Divergence", "link": "https://arxiv.org/abs/2407.01883", "description": "arXiv:2407.01883v1 Announce Type: new \nAbstract: Linear mixed models (LMMs), which typically assume normality for both the random effects and error terms, are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to poor statistical results (e.g., biased inference on model parameters and inaccurate prediction of random effects) if the data are contaminated. We propose a new approach to robust estimation and inference for LMMs using a hierarchical gamma divergence, which offers an automated, data-driven approach to downweight the effects of outliers occurring in both the error, and the random effects, using normalized powered density weights. For estimation and inference, we develop a computationally scalable minorization-maximization algorithm for the resulting objective function, along with a clustered bootstrap method for uncertainty quantification and a Hyvarinen score criterion for selecting a tuning parameter controlling the degree of robustness. When the genuine and contamination mixed effects distributions are sufficiently separated, then under suitable regularity conditions assuming the number of clusters tends to infinity, we show the resulting robust estimates can be asymptotically controlled even under a heavy level of (covariate-dependent) contamination. Simulation studies demonstrate hierarchical gamma divergence consistently outperforms several currently available methods for robustifying LMMs, under a wide range of scenarios of outlier generation at both the response and random effects levels. We illustrate the proposed method using data from a multi-center AIDS cohort study, where the use of a robust LMMs using hierarchical gamma divergence approach produces noticeably different results compared to methods that do not adequately adjust for potential outlier contamination."}, "https://arxiv.org/abs/2407.02085": {"title": "Regularized estimation of Monge-Kantorovich quantiles for spherical data", "link": "https://arxiv.org/abs/2407.02085", "description": "arXiv:2407.02085v1 Announce Type: new \nAbstract: Tools from optimal transport (OT) theory have recently been used to define a notion of quantile function for directional data. In practice, regularization is mandatory for applications that require out-of-sample estimates. To this end, we introduce a regularized estimator built from entropic optimal transport, by extending the definition of the entropic map to the spherical setting. We propose a stochastic algorithm to directly solve a continuous OT problem between the uniform distribution and a target distribution, by expanding Kantorovich potentials in the basis of spherical harmonics. In addition, we define the directional Monge-Kantorovich depth, a companion concept for OT-based quantiles. We show that it benefits from desirable properties related to Liu-Zuo-Serfling axioms for the statistical analysis of directional data. Building on our regularized estimators, we illustrate the benefits of our methodology for data analysis."}, "https://arxiv.org/abs/2407.02178": {"title": "Reverse time-to-death as time-scale in time-to-event analysis for studies of advanced illness and palliative care", "link": "https://arxiv.org/abs/2407.02178", "description": "arXiv:2407.02178v1 Announce Type: new \nAbstract: Background: Incidence of adverse outcome events rises as patients with advanced illness approach end-of-life. Exposures that tend to occur near end-of-life, e.g., use of wheelchair, oxygen therapy and palliative care, may therefore be found associated with the incidence of the adverse outcomes. We propose a strategy for time-to-event analysis to mitigate the time-varying confounding. Methods: We propose a concept of reverse time-to-death (rTTD) and its use for the time-scale in time-to-event analysis. We used data on community-based palliative care uptake (exposure) and emergency department visits (outcome) among patients with advanced cancer in Singapore to illustrate. We compare the results against that of the common practice of using time-on-study (TOS) as time-scale. Results: Graphical analysis demonstrated that cancer patients receiving palliative care had higher rate of emergency department visits than non-recipients mainly because they were closer to end-of-life, and that rTTD analysis made comparison between patients at the same time-to-death. Analysis of emergency department visits in relation to palliative care using TOS time-scale showed significant increase in hazard ratio estimate when observed time-varying covariates were omitted from statistical adjustment (change-in-estimate=0.38; 95% CI 0.15 to 0.60). There was no such change in otherwise the same analysis using rTTD (change-in-estimate=0.04; 95% CI -0.02 to 0.11), demonstrating the ability of rTTD time-scale to mitigate confounding that intensifies in relation to time-to-death. Conclusion: Use of rTTD as time-scale in time-to-event analysis provides a simple and robust approach to control time-varying confounding in studies of advanced illness, even if the confounders are unmeasured."}, "https://arxiv.org/abs/2407.02183": {"title": "How do financial variables impact public debt growth in China? An empirical study based on Markov regime-switching model", "link": "https://arxiv.org/abs/2407.02183", "description": "arXiv:2407.02183v1 Announce Type: new \nAbstract: The deep financial turmoil in China caused by the COVID-19 pandemic has exacerbated fiscal shocks and soaring public debt levels, which raises concerns about the stability and sustainability of China's public debt growth in the future. This paper employs the Markov regime-switching model with time-varying transition probability (TVTP-MS) to investigate the growth pattern of China's public debt and the impact of financial variables such as credit, house prices and stock prices on the growth of public debt. We identify two distinct regimes of China's public debt, i.e., the surge regime with high growth rate and high volatility and the steady regime with low growth rate and low volatility. The main results are twofold. On the one hand, an increase in the growth rate of the financial variables helps to moderate the growth rate of public debt, whereas the effects differ between the two regimes. More specifically, the impacts of credit and house prices are significant in the surge regime, whereas stock prices affect public debt growth significantly in the steady regime. On the other hand, a higher growth rate of financial variables also increases the probability of public debt either staying in or switching to the steady regime. These findings highlight the necessity of aligning financial adjustments with the prevailing public debt regime when developing sustainable fiscal policies."}, "https://arxiv.org/abs/2407.02262": {"title": "Conditional Forecasts in Large Bayesian VARs with Multiple Equality and Inequality Constraints", "link": "https://arxiv.org/abs/2407.02262", "description": "arXiv:2407.02262v1 Announce Type: new \nAbstract: Conditional forecasts, i.e. projections of a set of variables of interest on the future paths of some other variables, are used routinely by empirical macroeconomists in a number of applied settings. In spite of this, the existing algorithms used to generate conditional forecasts tend to be very computationally intensive, especially when working with large Vector Autoregressions or when multiple linear equality and inequality constraints are imposed at once. We introduce a novel precision-based sampler that is fast, scales well, and yields conditional forecasts from linear equality and inequality constraints. We show in a simulation study that the proposed method produces forecasts that are identical to those from the existing algorithms but in a fraction of the time. We then illustrate the performance of our method in a large Bayesian Vector Autoregression where we simultaneously impose a mix of linear equality and inequality constraints on the future trajectories of key US macroeconomic indicators over the 2020--2022 period."}, "https://arxiv.org/abs/2407.02367": {"title": "Rediscovering Bottom-Up: Effective Forecasting in Temporal Hierarchies", "link": "https://arxiv.org/abs/2407.02367", "description": "arXiv:2407.02367v1 Announce Type: new \nAbstract: Forecast reconciliation has become a prominent topic in recent forecasting literature, with a primary distinction made between cross-sectional and temporal hierarchies. This work focuses on temporal hierarchies, such as aggregating monthly time series data to annual data. We explore the impact of various forecast reconciliation methods on temporally aggregated ARIMA models, thereby bridging the fields of hierarchical forecast reconciliation and temporal aggregation both theoretically and experimentally. Our paper is the first to theoretically examine the effects of temporal hierarchical forecast reconciliation, demonstrating that the optimal method aligns with a bottom-up aggregation approach. To assess the practical implications and performance of the reconciled forecasts, we conduct a series of simulation studies, confirming that the findings extend to more complex models. This result helps explain the strong performance of the bottom-up approach observed in many prior studies. Finally, we apply our methods to real data examples, where we observe similar results."}, "https://arxiv.org/abs/2407.02401": {"title": "Fuzzy Social Network Analysis: Theory and Application in a University Department's Collaboration Network", "link": "https://arxiv.org/abs/2407.02401", "description": "arXiv:2407.02401v1 Announce Type: new \nAbstract: Social network analysis (SNA) helps us understand the relationships and interactions between individuals, groups, organisations, or other social entities. In SNA, ties are generally binary or weighted based on their strength. Nonetheless, when actors are individuals, the relationships between actors are often imprecise and identifying them with simple scalars leads to information loss. Social relationships are often vague in real life. Despite many classical social network techniques contemplate the use of weighted links, these approaches do not align with the original philosophy of fuzzy logic, which instead aims to preserve the vagueness inherent in human language and real life. Dealing with imprecise ties and introducing fuzziness in the definition of relationships requires an extension of social network analysis to fuzzy numbers instead of crisp values. The mathematical formalisation for this generalisation needs to extend classical centrality indices and operations to fuzzy numbers. For this reason, this paper proposes a generalisation of the so-called Fuzzy Social Network Analysis (FSNA) to the context of imprecise relationships among actors. The article shows the theory and application of real data collected through a fascinating mouse tracking technique to study the fuzzy relationships in a collaboration network among the members of a University department."}, "https://arxiv.org/abs/2407.01621": {"title": "Deciphering interventional dynamical causality from non-intervention systems", "link": "https://arxiv.org/abs/2407.01621", "description": "arXiv:2407.01621v1 Announce Type: cross \nAbstract: Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. To address this challenge, we propose a framework named Interventional Dynamical Causality (IntDC) for such non-intervention systems, along with its computational criterion, Interventional Embedding Entropy (IEE), to quantify causality. The IEE criterion theoretically and numerically enables the deciphering of IntDC solely from observational (non-interventional) time-series data, without requiring any knowledge of dynamical models or real interventions in the considered system. Demonstrations of performance showed the accuracy and robustness of IEE on benchmark simulated systems as well as real-world systems, including the neural connectomes of C. elegans, COVID-19 transmission networks in Japan, and regulatory networks surrounding key circadian genes."}, "https://arxiv.org/abs/2407.01623": {"title": "Uncertainty estimation in satellite precipitation spatial prediction by combining distributional regression algorithms", "link": "https://arxiv.org/abs/2407.01623", "description": "arXiv:2407.01623v1 Announce Type: cross \nAbstract: To facilitate effective decision-making, gridded satellite precipitation products should include uncertainty estimates. Machine learning has been proposed for issuing such estimates. However, most existing algorithms for this purpose rely on quantile regression. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. In this work, we introduce the concept of distributional regression for the engineering task of creating precipitation datasets through data merging. Building upon this concept, we propose new ensemble learning methods that can be valuable not only for spatial prediction but also for prediction problems in general. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale, and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression, and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking emerged as the most successful strategy. Three specific stacking methods achieved the best performance based on the quantile scoring rule, although the ranking of these methods varied across quantile levels. This suggests that a task-specific combination of multiple algorithms could yield significant benefits."}, "https://arxiv.org/abs/2407.01751": {"title": "Asymptotic tests for monotonicity and convexity of a probability mass function", "link": "https://arxiv.org/abs/2407.01751", "description": "arXiv:2407.01751v1 Announce Type: cross \nAbstract: In shape-constrained nonparametric inference, it is often necessary to perform preliminary tests to verify whether a probability mass function (p.m.f.) satisfies qualitative constraints such as monotonicity, convexity or in general $k$-monotonicity. In this paper, we are interested in testing $k$-monotonicity of a compactly supported p.m.f. and we put our main focus on monotonicity and convexity; i.e., $k \\in \\{1,2\\}$. We consider new testing procedures that are directly derived from the definition of $k$-monotonicity and rely exclusively on the empirical measure, as well as tests that are based on the projection of the empirical measure on the class of $k$-monotone p.m.f.s. The asymptotic behaviour of the introduced test statistics is derived and a simulation study is performed to assess the finite sample performance of all the proposed tests. Applications to real datasets are presented to illustrate the theory."}, "https://arxiv.org/abs/2407.01794": {"title": "Conditionally valid Probabilistic Conformal Prediction", "link": "https://arxiv.org/abs/2407.01794", "description": "arXiv:2407.01794v1 Announce Type: cross \nAbstract: We develop a new method for creating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution $P_{Y \\mid X}$. Most existing methods, such as conformalized quantile regression and probabilistic conformal prediction, only offer marginal coverage guarantees. Our approach extends these methods to achieve conditional coverage, which is essential for many practical applications. While exact conditional guarantees are impossible without assumptions on the data distribution, we provide non-asymptotic bounds that explicitly depend on the quality of the available estimate of the conditional distribution. Our confidence sets are highly adaptive to the local structure of the data, making them particularly useful in high heteroskedasticity situations. We demonstrate the effectiveness of our approach through extensive simulations, showing that it outperforms existing methods in terms of conditional coverage and improves the reliability of statistical inference in a wide range of applications."}, "https://arxiv.org/abs/2407.01874": {"title": "Simultaneous semiparametric inference for single-index models", "link": "https://arxiv.org/abs/2407.01874", "description": "arXiv:2407.01874v1 Announce Type: cross \nAbstract: In the common partially linear single-index model we establish a Bahadur representation for a smoothing spline estimator of all model parameters and use this result to prove the joint weak convergence of the estimator of the index link function at a given point, together with the estimators of the parametric regression coefficients. We obtain the surprising result that, despite of the nature of single-index models where the link function is evaluated at a linear combination of the index-coefficients, the estimator of the link function and the estimator of the index-coefficients are asymptotically independent. Our approach leverages a delicate analysis based on reproducing kernel Hilbert space and empirical process theory.\n We show that the smoothing spline estimator achieves the minimax optimal rate with respect to the $L^2$-risk and consider several statistical applications where joint inference on all model parameters is of interest. In particular, we develop a simultaneous confidence band for the link function and propose inference tools to investigate if the maximum absolute deviation between the (unknown) link function and a given function exceeds a given threshold. We also construct tests for joint hypotheses regarding model parameters which involve both the nonparametric and parametric components and propose novel multiplier bootstrap procedures to avoid the estimation of unknown asymptotic quantities."}, "https://arxiv.org/abs/2208.07044": {"title": "On minimum contrast method for multivariate spatial point processes", "link": "https://arxiv.org/abs/2208.07044", "description": "arXiv:2208.07044v3 Announce Type: replace \nAbstract: Compared to widely used likelihood-based approaches, the minimum contrast (MC) method offers a computationally efficient method for estimation and inference of spatial point processes. These relative gains in computing time become more pronounced when analyzing complicated multivariate point process models. Despite this, there has been little exploration of the MC method for multivariate spatial point processes. Therefore, this article introduces a new MC method for parametric multivariate spatial point processes. A contrast function is computed based on the trace of the power of the difference between the conjectured $K$-function matrix and its nonparametric unbiased edge-corrected estimator. Under standard assumptions, we derive the asymptotic normality of our MC estimator. The performance of the proposed method is demonstrated through simulation studies of bivariate log-Gaussian Cox processes and five-variate product-shot-noise Cox processes."}, "https://arxiv.org/abs/2401.00249": {"title": "Forecasting CPI inflation under economic policy and geopolitical uncertainties", "link": "https://arxiv.org/abs/2401.00249", "description": "arXiv:2401.00249v2 Announce Type: replace \nAbstract: Forecasting consumer price index (CPI) inflation is of paramount importance for both academics and policymakers at the central banks. This study introduces a filtered ensemble wavelet neural network (FEWNet) to forecast CPI inflation, which is tested on BRIC countries. FEWNet breaks down inflation data into high and low-frequency components using wavelets and utilizes them along with other economic factors (economic policy uncertainty and geopolitical risk) to produce forecasts. All the wavelet-transformed series and filtered exogenous variables are fed into downstream autoregressive neural networks to make the final ensemble forecast. Theoretically, we show that FEWNet reduces the empirical risk compared to fully connected autoregressive neural networks. FEWNet is more accurate than other forecasting methods and can also estimate the uncertainty in its predictions due to its capacity to effectively capture non-linearities and long-range dependencies in the data through its adaptable architecture. This makes FEWNet a valuable tool for central banks to manage inflation."}, "https://arxiv.org/abs/2211.15771": {"title": "Approximate Gibbs Sampler for Efficient Inference of Hierarchical Bayesian Models for Grouped Count Data", "link": "https://arxiv.org/abs/2211.15771", "description": "arXiv:2211.15771v2 Announce Type: replace-cross \nAbstract: Hierarchical Bayesian Poisson regression models (HBPRMs) provide a flexible modeling approach of the relationship between predictors and count response variables. The applications of HBPRMs to large-scale datasets require efficient inference algorithms due to the high computational cost of inferring many model parameters based on random sampling. Although Markov Chain Monte Carlo (MCMC) algorithms have been widely used for Bayesian inference, sampling using this class of algorithms is time-consuming for applications with large-scale data and time-sensitive decision-making, partially due to the non-conjugacy of many models. To overcome this limitation, this research develops an approximate Gibbs sampler (AGS) to efficiently learn the HBPRMs while maintaining the inference accuracy. In the proposed sampler, the data likelihood is approximated with Gaussian distribution such that the conditional posterior of the coefficients has a closed-form solution. Numerical experiments using real and synthetic datasets with small and large counts demonstrate the superior performance of AGS in comparison to the state-of-the-art sampling algorithm, especially for large datasets."}, "https://arxiv.org/abs/2407.02583": {"title": "Generalized Ridge Regression: Biased Estimation for Multiple Linear Regression Models", "link": "https://arxiv.org/abs/2407.02583", "description": "arXiv:2407.02583v1 Announce Type: new \nAbstract: When the regressors of a econometric linear model are nonorthogonal, it is well known that their estimation by ordinary least squares can present various problems that discourage the use of this model. The ridge regression is the most commonly used alternative; however, its generalized version has hardly been analyzed. The present work addresses the estimation of this generalized version, as well as the calculation of its mean squared error, goodness of fit and bootstrap inference."}, "https://arxiv.org/abs/2407.02671": {"title": "When Do Natural Mediation Effects Differ from Their Randomized Interventional Analogues: Test and Theory", "link": "https://arxiv.org/abs/2407.02671", "description": "arXiv:2407.02671v1 Announce Type: new \nAbstract: In causal mediation analysis, the natural direct and indirect effects (natural effects) are nonparametrically unidentifiable in the presence of treatment-induced confounding, which motivated the development of randomized interventional analogues (RIAs) of the natural effects. The RIAs are easier to identify and widely used in practice. Applied researchers often interpret RIA estimates as if they were the natural effects, even though the RIAs could be poor proxies for the natural effects. This calls for practical and theoretical guidance on when the RIAs differ from or coincide with the natural effects, which this paper aims to address. We develop a novel empirical test for the divergence between the RIAs and the natural effects under the weak assumptions sufficient for identifying the RIAs and illustrate the test using the Moving to Opportunity Study. We also provide new theoretical insights on the relationship between the RIAs and the natural effects from a covariance perspective and a structural equation perspective. Additionally, we discuss previously undocumented connections between the natural effects, the RIAs, and estimands in instrumental variable analysis and Wilcoxon-Mann-Whitney tests."}, "https://arxiv.org/abs/2407.02676": {"title": "Covariate-dependent hierarchical Dirichlet process", "link": "https://arxiv.org/abs/2407.02676", "description": "arXiv:2407.02676v1 Announce Type: new \nAbstract: The intricacies inherent in contemporary real datasets demand more advanced statistical models to effectively address complex challenges. In this article we delve into problems related to identifying clusters across related groups, when additional covariate information is available. We formulate a novel Bayesian nonparametric approach based on mixture models, integrating ideas from the hierarchical Dirichlet process and \"single-atoms\" dependent Dirichlet process. The proposed method exhibits exceptional generality and flexibility, accommodating both continuous and discrete covariates through the utilization of appropriate kernel functions. We construct a robust and efficient Markov chain Monte Carlo (MCMC) algorithm involving data augmentation to tackle the intractable normalized weights. The versatility of the proposed model extends our capability to discern the relationship between covariates and clusters. Through testing on both simulated and real-world datasets, our model demonstrates its capacity to identify meaningful clusters across groups, providing valuable insights for a spectrum of applications."}, "https://arxiv.org/abs/2407.02684": {"title": "A dimension reduction approach to edge weight estimation for use in spatial models", "link": "https://arxiv.org/abs/2407.02684", "description": "arXiv:2407.02684v1 Announce Type: new \nAbstract: Models for areal data are traditionally defined using the neighborhood structure of the regions on which data are observed. The unweighted adjacency matrix of a graph is commonly used to characterize the relationships between locations, resulting in the implicit assumption that all pairs of neighboring regions interact similarly, an assumption which may not be true in practice. It has been shown that more complex spatial relationships between graph nodes may be represented when edge weights are allowed to vary. Christensen and Hoff (2023) introduced a covariance model for data observed on graphs which is more flexible than traditional alternatives, parameterizing covariance as a function of an unknown edge weights matrix. A potential issue with their approach is that each edge weight is treated as a unique parameter, resulting in increasingly challenging parameter estimation as graph size increases. Within this article we propose a framework for estimating edge weight matrices that reduces their effective dimension via a basis function representation of of the edge weights. We show that this method may be used to enhance the performance and flexibility of covariance models parameterized by such matrices in a series of illustrations, simulations and data examples."}, "https://arxiv.org/abs/2407.02902": {"title": "Instrumental Variable methods to target Hypothetical Estimands with longitudinal repeated measures data: Application to the STEP 1 trial", "link": "https://arxiv.org/abs/2407.02902", "description": "arXiv:2407.02902v1 Announce Type: new \nAbstract: The STEP 1 randomized trial evaluated the effect of taking semaglutide vs placebo on body weight over a 68 week duration. As with any study evaluating an intervention delivered over a sustained period, non-adherence was observed. This was addressed in the original trial analysis within the Estimand Framework by viewing non-adherence as an intercurrent event. The primary analysis applied a treatment policy strategy which viewed it as an aspect of the treatment regimen, and thus made no adjustment for its presence. A supplementary analysis used a hypothetical strategy, targeting an estimand that would have been realised had all participants adhered, under the assumption that no post-baseline variables confounded adherence and change in body weight. In this paper we propose an alternative Instrumental Variable method to adjust for non-adherence which does not rely on the same `unconfoundedness' assumption and is less vulnerable to positivity violations (e.g., it can give valid results even under conditions where non-adherence is guaranteed). Unlike many previous Instrumental Variable approaches, it makes full use of the repeatedly measured outcome data, and allows for a time-varying effect of treatment adherence on a participant's weight. We show that it provides a natural vehicle for defining two distinct hypothetical estimands: the treatment effect if all participants would have adhered to semaglutide, and the treatment effect if all participants would have adhered to both semaglutide and placebo. When applied to the STEP 1 study, they both suggest a sustained, slowly decaying weight loss effect of semaglutide treatment."}, "https://arxiv.org/abs/2407.03085": {"title": "Accelerated Inference for Partially Observed Markov Processes using Automatic Differentiation", "link": "https://arxiv.org/abs/2407.03085", "description": "arXiv:2407.03085v1 Announce Type: new \nAbstract: Automatic differentiation (AD) has driven recent advances in machine learning, including deep neural networks and Hamiltonian Markov Chain Monte Carlo methods. Partially observed nonlinear stochastic dynamical systems have proved resistant to AD techniques because widely used particle filter algorithms yield an estimated likelihood function that is discontinuous as a function of the model parameters. We show how to embed two existing AD particle filter methods in a theoretical framework that provides an extension to a new class of algorithms. This new class permits a bias/variance tradeoff and hence a mean squared error substantially lower than the existing algorithms. We develop likelihood maximization algorithms suited to the Monte Carlo properties of the AD gradient estimate. Our algorithms require only a differentiable simulator for the latent dynamic system; by contrast, most previous approaches to AD likelihood maximization for particle filters require access to the system's transition probabilities. Numerical results indicate that a hybrid algorithm that uses AD to refine a coarse solution from an iterated filtering algorithm show substantial improvement on current state-of-the-art methods for a challenging scientific benchmark problem."}, "https://arxiv.org/abs/2407.03167": {"title": "Tail calibration of probabilistic forecasts", "link": "https://arxiv.org/abs/2407.03167", "description": "arXiv:2407.03167v1 Announce Type: new \nAbstract: Probabilistic forecasts comprehensively describe the uncertainty in the unknown future outcome, making them essential for decision making and risk management. While several methods have been introduced to evaluate probabilistic forecasts, existing evaluation techniques are ill-suited to the evaluation of tail properties of such forecasts. However, these tail properties are often of particular interest to forecast users due to the severe impacts caused by extreme outcomes. In this work, we introduce a general notion of tail calibration for probabilistic forecasts, which allows forecasters to assess the reliability of their predictions for extreme outcomes. We study the relationships between tail calibration and standard notions of forecast calibration, and discuss connections to peaks-over-threshold models in extreme value theory. Diagnostic tools are introduced and applied in a case study on European precipitation forecasts"}, "https://arxiv.org/abs/2407.03265": {"title": "Policymaker meetings as heteroscedasticity shifters: Identification and simultaneous inference in unstable SVARs", "link": "https://arxiv.org/abs/2407.03265", "description": "arXiv:2407.03265v1 Announce Type: new \nAbstract: We propose a novel approach to identification in structural vector autoregressions (SVARs) that uses external instruments for heteroscedasticiy of a structural shock of interest. This approach does not require lead/lag exogeneity for identification, does not require heteroskedasticity to be persistent, and facilitates interpretation of the structural shocks. To implement this identification approach in applications, we develop a new method for simultaneous inference of structural impulse responses and other parameters, employing a dependent wild-bootstrap of local projection estimators. This method is robust to an arbitrary number of unit roots and cointegration relationships, time-varying local means and drifts, and conditional heteroskedasticity of unknown form and can be used with other identification schemes, including Cholesky and the conventional external IV. We show how to construct pointwise and simultaneous confidence bounds for structural impulse responses and how to compute smoothed local projections with the corresponding confidence bounds. Using simulated data from a standard log-linearized DSGE model, we show that the method can reliably recover the true impulse responses in realistic datasets. As an empirical application, we adopt the proposed method in order to identify monetary policy shock using the dates of FOMC meetings in a standard six-variable VAR. The robustness of our identification and inference methods allows us to construct an instrumental variable for monetary policy shock that dates back to 1965. The resulting impulse response functions for all variables align with the classical Cholesky identification scheme and are different from the narrative sign restricted Bayesian VAR estimates. In particular, the response to inflation manifests a price puzzle that is indicative of the cost channel of the interest rates."}, "https://arxiv.org/abs/2407.03279": {"title": "Finely Stratified Rerandomization Designs", "link": "https://arxiv.org/abs/2407.03279", "description": "arXiv:2407.03279v1 Announce Type: new \nAbstract: We study estimation and inference on causal parameters under finely stratified rerandomization designs, which use baseline covariates to match units into groups (e.g. matched pairs), then rerandomize within-group treatment assignments until a balance criterion is satisfied. We show that finely stratified rerandomization does partially linear regression adjustment by design, providing nonparametric control over the covariates used for stratification, and linear control over the rerandomization covariates. We also introduce novel rerandomization criteria, allowing for nonlinear imbalance metrics and proposing a minimax scheme that optimizes the balance criterion using pilot data or prior information provided by the researcher. While the asymptotic distribution of generalized method of moments (GMM) estimators under stratified rerandomization is generically non-Gaussian, we show how to restore asymptotic normality using optimal ex-post linear adjustment. This allows us to provide simple asymptotically exact inference methods for superpopulation parameters, as well as efficient conservative inference methods for finite population parameters."}, "https://arxiv.org/abs/2405.14896": {"title": "Study on spike-and-wave detection in epileptic signals using t-location-scale distribution and the K-nearest neighbors classifier", "link": "https://arxiv.org/abs/2405.14896", "description": "arXiv:2405.14896v1 Announce Type: cross \nAbstract: Pattern classification in electroencephalography (EEG) signals is an important problem in biomedical engineering since it enables the detection of brain activity, particularly the early detection of epileptic seizures. In this paper, we propose a k-nearest neighbors classification for epileptic EEG signals based on a t-location-scale statistical representation to detect spike-and-waves. The proposed approach is demonstrated on a real dataset containing both spike-and-wave events and normal brain function signals, where our performance is evaluated in terms of classification accuracy, sensitivity, and specificity."}, "https://arxiv.org/abs/2407.02657": {"title": "Large Scale Hierarchical Industrial Demand Time-Series Forecasting incorporating Sparsity", "link": "https://arxiv.org/abs/2407.02657", "description": "arXiv:2407.02657v1 Announce Type: cross \nAbstract: Hierarchical time-series forecasting (HTSF) is an important problem for many real-world business applications where the goal is to simultaneously forecast multiple time-series that are related to each other via a hierarchical relation. Recent works, however, do not address two important challenges that are typically observed in many demand forecasting applications at large companies. First, many time-series at lower levels of the hierarchy have high sparsity i.e., they have a significant number of zeros. Most HTSF methods do not address this varying sparsity across the hierarchy. Further, they do not scale well to the large size of the real-world hierarchy typically unseen in benchmarks used in literature. We resolve both these challenges by proposing HAILS, a novel probabilistic hierarchical model that enables accurate and calibrated probabilistic forecasts across the hierarchy by adaptively modeling sparse and dense time-series with different distributional assumptions and reconciling them to adhere to hierarchical constraints. We show the scalability and effectiveness of our methods by evaluating them against real-world demand forecasting datasets. We deploy HAILS at a large chemical manufacturing company for a product demand forecasting application with over ten thousand products and observe a significant 8.5\\% improvement in forecast accuracy and 23% better improvement for sparse time-series. The enhanced accuracy and scalability make HAILS a valuable tool for improved business planning and customer experience."}, "https://arxiv.org/abs/2407.02702": {"title": "Practical Guide for Causal Pathways and Sub-group Disparity Analysis", "link": "https://arxiv.org/abs/2407.02702", "description": "arXiv:2407.02702v1 Announce Type: cross \nAbstract: In this study, we introduce the application of causal disparity analysis to unveil intricate relationships and causal pathways between sensitive attributes and the targeted outcomes within real-world observational data. Our methodology involves employing causal decomposition analysis to quantify and examine the causal interplay between sensitive attributes and outcomes. We also emphasize the significance of integrating heterogeneity assessment in causal disparity analysis to gain deeper insights into the impact of sensitive attributes within specific sub-groups on outcomes. Our two-step investigation focuses on datasets where race serves as the sensitive attribute. The results on two datasets indicate the benefit of leveraging causal analysis and heterogeneity assessment not only for quantifying biases in the data but also for disentangling their influences on outcomes. We demonstrate that the sub-groups identified by our approach to be affected the most by disparities are the ones with the largest ML classification errors. We also show that grouping the data only based on a sensitive attribute is not enough, and through these analyses, we can find sub-groups that are directly affected by disparities. We hope that our findings will encourage the adoption of such methodologies in future ethical AI practices and bias audits, fostering a more equitable and fair technological landscape."}, "https://arxiv.org/abs/2407.02754": {"title": "Is Cross-Validation the Gold Standard to Evaluate Model Performance?", "link": "https://arxiv.org/abs/2407.02754", "description": "arXiv:2407.02754v1 Announce Type: cross \nAbstract: Cross-Validation (CV) is the default choice for evaluating the performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that in fact, for a wide spectrum of models, CV does not statistically outperform the simple \"plug-in\" approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, $K$-fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases leave-one-out again does not outperform plug-in once this variability is taken into account. We obtain our theoretical comparisons via a novel higher-order Taylor analysis that allows us to derive necessary conditions for limit theorems of testing evaluations, which applies to model classes that are not amenable to previously known sufficient conditions. Our numerical results demonstrate that plug-in performs indeed no worse than CV across a wide range of examples."}, "https://arxiv.org/abs/2407.03094": {"title": "Conformal Prediction for Causal Effects of Continuous Treatments", "link": "https://arxiv.org/abs/2407.03094", "description": "arXiv:2407.03094v1 Announce Type: cross \nAbstract: Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data."}, "https://arxiv.org/abs/2108.06473": {"title": "Evidence Aggregation for Treatment Choice", "link": "https://arxiv.org/abs/2108.06473", "description": "arXiv:2108.06473v2 Announce Type: replace \nAbstract: Consider a planner who has limited knowledge of the policy's causal impact on a certain local population of interest due to a lack of data, but does have access to the publicized intervention studies performed for similar policies on different populations. How should the planner make use of and aggregate this existing evidence to make her policy decision? Following Manski (2020; Towards Credible Patient-Centered Meta-Analysis, \\textit{Epidemiology}), we formulate the planner's problem as a statistical decision problem with a social welfare objective, and solve for an optimal aggregation rule under the minimax-regret criterion. We investigate the analytical properties, computational feasibility, and welfare regret performance of this rule. We apply the minimax regret decision rule to two settings: whether to enact an active labor market policy based on 14 randomized control trial studies; and whether to approve a drug (Remdesivir) for COVID-19 treatment using a meta-database of clinical trials."}, "https://arxiv.org/abs/2112.01709": {"title": "Optimized variance estimation under interference and complex experimental designs", "link": "https://arxiv.org/abs/2112.01709", "description": "arXiv:2112.01709v2 Announce Type: replace \nAbstract: Unbiased and consistent variance estimators generally do not exist for design-based treatment effect estimators because experimenters never observe more than one potential outcome for any unit. The problem is exacerbated by interference and complex experimental designs. Experimenters must accept conservative variance estimators in these settings, but they can strive to minimize conservativeness. In this paper, we show that the task of constructing a minimally conservative variance estimator can be interpreted as an optimization problem that aims to find the lowest estimable upper bound of the true variance given the experimenter's risk preference and knowledge of the potential outcomes. We characterize the set of admissible bounds in the class of quadratic forms, and we demonstrate that the optimization problem is a convex program for many natural objectives. The resulting variance estimators are guaranteed to be conservative regardless of whether the background knowledge used to construct the bound is correct, but the estimators are less conservative if the provided information is reasonably accurate. Numerical results show that the resulting variance estimators can be considerably less conservative than existing estimators, allowing experimenters to draw more informative inferences about treatment effects."}, "https://arxiv.org/abs/2207.10513": {"title": "A flexible and interpretable spatial covariance model for data on graphs", "link": "https://arxiv.org/abs/2207.10513", "description": "arXiv:2207.10513v2 Announce Type: replace \nAbstract: Spatial models for areal data are often constructed such that all pairs of adjacent regions are assumed to have near-identical spatial autocorrelation. In practice, data can exhibit dependence structures more complicated than can be represented under this assumption. In this article we develop a new model for spatially correlated data observed on graphs, which can flexibly represented many types of spatial dependence patterns while retaining aspects of the original graph geometry. Our method implies an embedding of the graph into Euclidean space wherein covariance can be modeled using traditional covariance functions, such as those from the Mat\\'{e}rn family. We parameterize our model using a class of graph metrics compatible with such covariance functions, and which characterize distance in terms of network flow, a property useful for understanding proximity in many ecological settings. By estimating the parameters underlying these metrics, we recover the \"intrinsic distances\" between graph nodes, which assist in the interpretation of the estimated covariance and allow us to better understand the relationship between the observed process and spatial domain. We compare our model to existing methods for spatially dependent graph data, primarily conditional autoregressive models and their variants, and illustrate advantages of our method over traditional approaches. We fit our model to bird abundance data for several species in North Carolina, and show how it provides insight into the interactions between species-specific spatial distributions and geography."}, "https://arxiv.org/abs/2212.02335": {"title": "Policy Learning with the polle package", "link": "https://arxiv.org/abs/2212.02335", "description": "arXiv:2212.02335v4 Announce Type: replace \nAbstract: The R package polle is a unifying framework for learning and evaluating finite stage policies based on observational data. The package implements a collection of existing and novel methods for causal policy learning including doubly robust restricted Q-learning, policy tree learning, and outcome weighted learning. The package deals with (near) positivity violations by only considering realistic policies. Highly flexible machine learning methods can be used to estimate the nuisance components and valid inference for the policy value is ensured via cross-fitting. The library is built up around a simple syntax with four main functions policy_data(), policy_def(), policy_learn(), and policy_eval() used to specify the data structure, define user-specified policies, specify policy learning methods and evaluate (learned) policies. The functionality of the package is illustrated via extensive reproducible examples."}, "https://arxiv.org/abs/2306.08940": {"title": "Spatial modeling of extremes and an angular component", "link": "https://arxiv.org/abs/2306.08940", "description": "arXiv:2306.08940v2 Announce Type: replace \nAbstract: Many environmental processes such as rainfall, wind or snowfall are inherently spatial and the modelling of extremes has to take into account that feature. In addition, environmental processes are often attached with an angle, e.g., wind speed and direction or extreme snowfall and time of occurrence in year. This article proposes a Bayesian hierarchical model with a conditional independence assumption that aims at modelling simultaneously spatial extremes and an angular component. The proposed model relies on the extreme value theory as well as recent developments for handling directional statistics over a continuous domain. Working within a Bayesian setting, a Gibbs sampler is introduced whose performances are analysed through a simulation study. The paper ends with an application on extreme wind speed in France. Results show that extreme wind events in France are mainly coming from West apart from the Mediterranean part of France and the Alps."}, "https://arxiv.org/abs/1607.00393": {"title": "Frequentist properties of Bayesian inequality tests", "link": "https://arxiv.org/abs/1607.00393", "description": "arXiv:1607.00393v4 Announce Type: replace-cross \nAbstract: Bayesian and frequentist criteria fundamentally differ, but often posterior and sampling distributions agree asymptotically (e.g., Gaussian with same covariance). For the corresponding single-draw experiment, we characterize the frequentist size of a certain Bayesian hypothesis test of (possibly nonlinear) inequalities. If the null hypothesis is that the (possibly infinite-dimensional) parameter lies in a certain half-space, then the Bayesian test's size is $\\alpha$; if the null hypothesis is a subset of a half-space, then size is above $\\alpha$; and in other cases, size may be above, below, or equal to $\\alpha$. Rejection probabilities at certain points in the parameter space are also characterized. Two examples illustrate our results: translog cost function curvature and ordinal distribution relationships."}, "https://arxiv.org/abs/2010.03832": {"title": "Estimation of the Spectral Measure from ConvexCombinations of Regularly Varying RandomVectors", "link": "https://arxiv.org/abs/2010.03832", "description": "arXiv:2010.03832v2 Announce Type: replace-cross \nAbstract: The extremal dependence structure of a regularly varying random vector Xis fully described by its limiting spectral measure. In this paper, we investigate how torecover characteristics of the measure, such as extremal coefficients, from the extremalbehaviour of convex combinations of components of X. Our considerations result in aclass of new estimators of moments of the corresponding combinations for the spectralvector. We show asymptotic normality by means of a functional limit theorem and, focusingon the estimation of extremal coefficients, we verify that the minimal asymptoticvariance can be achieved by a plug-in estimator using subsampling bootstrap. We illustratethe benefits of our approach on simulated and real data."}, "https://arxiv.org/abs/2407.03379": {"title": "missForestPredict -- Missing data imputation for prediction settings", "link": "https://arxiv.org/abs/2407.03379", "description": "arXiv:2407.03379v1 Announce Type: new \nAbstract: Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times."}, "https://arxiv.org/abs/2407.03383": {"title": "Continuous Optimization for Offline Change Point Detection and Estimation", "link": "https://arxiv.org/abs/2407.03383", "description": "arXiv:2407.03383v1 Announce Type: new \nAbstract: This work explores use of novel advances in best subset selection for regression modelling via continuous optimization for offline change point detection and estimation in univariate Gaussian data sequences. The approach exploits reformulating the normal mean multiple change point model into a regularized statistical inverse problem enforcing sparsity. After introducing the problem statement, criteria and previous investigations via Lasso-regularization, the recently developed framework of continuous optimization for best subset selection (COMBSS) is briefly introduced and related to the problem at hand. Supervised and unsupervised perspectives are explored with the latter testing different approaches for the choice of regularization penalty parameters via the discrepancy principle and a confidence bound. The main result is an adaptation and evaluation of the COMBSS approach for offline normal mean multiple change-point detection via experimental results on simulated data for different choices of regularisation parameters. Results and future directions are discussed."}, "https://arxiv.org/abs/2407.03389": {"title": "A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data", "link": "https://arxiv.org/abs/2407.03389", "description": "arXiv:2407.03389v1 Announce Type: new \nAbstract: In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The method is a variant of the Deterministic Information Bottleneck algorithm which optimally compresses the data while retaining relevant information about the underlying structure. We compare the performance of the proposed method to that of three well-established clustering methods (KAMILA, K-Prototypes, and Partitioning Around Medoids with Gower's dissimilarity) on simulated and real-world datasets. The results demonstrate that the proposed approach represents a competitive alternative to conventional clustering techniques under specific conditions."}, "https://arxiv.org/abs/2407.03420": {"title": "Balancing events, not patients, maximizes power of the logrank test: and other insights on unequal randomization in survival trials", "link": "https://arxiv.org/abs/2407.03420", "description": "arXiv:2407.03420v1 Announce Type: new \nAbstract: We revisit the question of what randomization ratio (RR) maximizes power of the logrank test in event-driven survival trials under proportional hazards (PH). By comparing three approximations of the logrank test (Schoenfeld, Freedman, Rubinstein) to empirical simulations, we find that the RR that maximizes power is the RR that balances number of events across treatment arms at the end of the trial. This contradicts the common misconception implied by Schoenfeld's approximation that 1:1 randomization maximizes power. Besides power, we consider other factors that might influence the choice of RR (accrual, trial duration, sample size, etc.). We perform simulations to better understand how unequal randomization might impact these factors in practice. Altogether, we derive 6 insights to guide statisticians in the design of survival trials considering unequal randomization."}, "https://arxiv.org/abs/2407.03539": {"title": "Population Size Estimation with Many Lists and Heterogeneity: A Conditional Log-Linear Model Among the Unobserved", "link": "https://arxiv.org/abs/2407.03539", "description": "arXiv:2407.03539v1 Announce Type: new \nAbstract: We contribute a general and flexible framework to estimate the size of a closed population in the presence of $K$ capture-recapture lists and heterogeneous capture probabilities. Our novel identifying strategy leverages the fact that it is sufficient for identification that a subset of the $K$ lists are not arbitrarily dependent \\textit{within the subset of the population unobserved by the remaining lists}, conditional on covariates. This identification approach is interpretable and actionable, interpolating between the two predominant approaches in the literature as special cases: (conditional) independence across lists and log-linear models with no highest-order interaction. We derive nonparametric doubly-robust estimators for the resulting identification expression that are nearly optimal and approximately normal for any finite sample size, even when the heterogeneous capture probabilities are estimated nonparametrically using machine learning methods. Additionally, we devise a sensitivity analysis to show how deviations from the identification assumptions affect the resulting population size estimates, allowing for the integration of domain-specific knowledge into the identification and estimation processes more transparently. We empirically demonstrate the advantages of our method using both synthetic data and real data from the Peruvian internal armed conflict to estimate the number of casualties. The proposed methodology addresses recent critiques of capture-recapture models by allowing for a weaker and more interpretable identifying assumption and accommodating complex heterogeneous capture probabilities depending on high-dimensional or continuous covariates."}, "https://arxiv.org/abs/2407.03558": {"title": "Aggregated Sure Independence Screening for Variable Selection with Interaction Structures", "link": "https://arxiv.org/abs/2407.03558", "description": "arXiv:2407.03558v1 Announce Type: new \nAbstract: A new method called the aggregated sure independence screening is proposed for the computational challenges in variable selection of interactions when the number of explanatory variables is much higher than the number of observations (i.e., $p\\gg n$). In this problem, the two main challenges are the strong hierarchical restriction and the number of candidates for the main effects and interactions. If $n$ is a few hundred and $p$ is ten thousand, then the memory needed for the augmented matrix of the full model is more than $100{\\rm GB}$ in size, beyond the memory capacity of a personal computer. This issue can be solved by our proposed method but not by our competitors. Two advantages are that the proposed method can include important interactions even if the related main effects are weak or absent, and it can be combined with an arbitrary variable selection method for interactions. The research addresses the main concern for variable selection of interactions because it makes previous methods applicable to the case when $p$ is extremely large."}, "https://arxiv.org/abs/2407.03616": {"title": "When can weak latent factors be statistically inferred?", "link": "https://arxiv.org/abs/2407.03616", "description": "arXiv:2407.03616v1 Announce Type: new \nAbstract: This article establishes a new and comprehensive estimation and inference theory for principal component analysis (PCA) under the weak factor model that allow for cross-sectional dependent idiosyncratic components under nearly minimal the factor strength relative to the noise level or signal-to-noise ratio. Our theory is applicable regardless of the relative growth rate between the cross-sectional dimension $N$ and temporal dimension $T$. This more realistic assumption and noticeable result requires completely new technical device, as the commonly-used leave-one-out trick is no longer applicable to the case with cross-sectional dependence. Another notable advancement of our theory is on PCA inference $ - $ for example, under the regime where $N\\asymp T$, we show that the asymptotic normality for the PCA-based estimator holds as long as the signal-to-noise ratio (SNR) grows faster than a polynomial rate of $\\log N$. This finding significantly surpasses prior work that required a polynomial rate of $N$. Our theory is entirely non-asymptotic, offering finite-sample characterizations for both the estimation error and the uncertainty level of statistical inference. A notable technical innovation is our closed-form first-order approximation of PCA-based estimator, which paves the way for various statistical tests. Furthermore, we apply our theories to design easy-to-implement statistics for validating whether given factors fall in the linear spans of unknown latent factors, testing structural breaks in the factor loadings for an individual unit, checking whether two units have the same risk exposures, and constructing confidence intervals for systematic risks. Our empirical studies uncover insightful correlations between our test results and economic cycles."}, "https://arxiv.org/abs/2407.03619": {"title": "Multivariate Representations of Univariate Marked Hawkes Processes", "link": "https://arxiv.org/abs/2407.03619", "description": "arXiv:2407.03619v1 Announce Type: new \nAbstract: Univariate marked Hawkes processes are used to model a range of real-world phenomena including earthquake aftershock sequences, contagious disease spread, content diffusion on social media platforms, and order book dynamics. This paper illustrates a fundamental connection between univariate marked Hawkes processes and multivariate Hawkes processes. Exploiting this connection renders a framework that can be built upon for expressive and flexible inference on diverse data. Specifically, multivariate unmarked Hawkes representations are introduced as a tool to parameterize univariate marked Hawkes processes. We show that such multivariate representations can asymptotically approximate a large class of univariate marked Hawkes processes, are stationary given the approximated process is stationary, and that resultant conditional intensity parameters are identifiable. A simulation study demonstrates the efficacy of this approach, and provides heuristic bounds for error induced by the relatively larger parameter space of multivariate Hawkes processes."}, "https://arxiv.org/abs/2407.03690": {"title": "Robust CATE Estimation Using Novel Ensemble Methods", "link": "https://arxiv.org/abs/2407.03690", "description": "arXiv:2407.03690v1 Announce Type: new \nAbstract: The estimation of Conditional Average Treatment Effects (CATE) is crucial for understanding the heterogeneity of treatment effects in clinical trials. We evaluate the performance of common methods, including causal forests and various meta-learners, across a diverse set of scenarios revealing that each of the methods fails in one or more of the tested scenarios. Given the inherent uncertainty of the data-generating process in real-life scenarios, the robustness of a CATE estimator to various scenarios is critical for its reliability.\n To address this limitation of existing methods, we propose two new ensemble methods that integrate multiple estimators to enhance prediction stability and performance - Stacked X-Learner which uses the X-Learner with model stacking for estimating the nuisance functions, and Consensus Based Averaging (CBA), which averages only the models with highest internal agreement. We show that these models achieve good performance across a wide range of scenarios varying in complexity, sample size and structure of the underlying-mechanism, including a biologically driven model for PD-L1 inhibition pathway for cancer treatment."}, "https://arxiv.org/abs/2407.03725": {"title": "Under the null of valid specification, pre-tests of valid specification do not distort inference", "link": "https://arxiv.org/abs/2407.03725", "description": "arXiv:2407.03725v1 Announce Type: new \nAbstract: Consider a parameter of interest, which can be consistently estimated under some conditions. Suppose also that we can at least partly test these conditions with specification tests. We consider the common practice of conducting inference on the parameter of interest conditional on not rejecting these tests. We show that if the tested conditions hold, conditional inference is valid but possibly conservative. This holds generally, without imposing any assumption on the asymptotic dependence between the estimator of the parameter of interest and the specification test."}, "https://arxiv.org/abs/2407.03726": {"title": "Absolute average and median treatment effects as causal estimands on metric spaces", "link": "https://arxiv.org/abs/2407.03726", "description": "arXiv:2407.03726v1 Announce Type: new \nAbstract: We define the notions of absolute average and median treatment effects as causal estimands on general metric spaces such as Riemannian manifolds, propose estimators using stratification, and prove several properties, including strong consistency. In the process, we also demonstrate the strong consistency of the weighted sample Fr\\'echet means and geometric medians. Stratification allows these estimators to be utilized beyond the narrow constraints of a completely randomized experiment. After constructing confidence intervals using bootstrapping, we outline how to use the proposed estimates to test Fisher's sharp null hypothesis that the absolute average or median treatment effect is zero. Empirical evidence for the strong consistency of the estimators and the reasonable asymptotic coverage of the confidence intervals is provided through simulations in both randomized experiments and observational study settings. We also apply our methods to real data from an observational study to investigate the causal relationship between Alzheimer's disease and the shape of the corpus callosum, rejecting the aforementioned null hypotheses in cases where conventional Euclidean methods fail to do so. Our proposed methods are more generally applicable than past studies in dealing with general metric spaces."}, "https://arxiv.org/abs/2407.03774": {"title": "Mixture Modeling for Temporal Point Processes with Memory", "link": "https://arxiv.org/abs/2407.03774", "description": "arXiv:2407.03774v1 Announce Type: new \nAbstract: We propose a constructive approach to building temporal point processes that incorporate dependence on their history. The dependence is modeled through the conditional density of the duration, i.e., the interval between successive event times, using a mixture of first-order conditional densities for each one of a specific number of lagged durations. Such a formulation for the conditional duration density accommodates high-order dynamics, and it thus enables flexible modeling for point processes with memory. The implied conditional intensity function admits a representation as a local mixture of first-order hazard functions. By specifying appropriate families of distributions for the first-order conditional densities, with different shapes for the associated hazard functions, we can obtain either self-exciting or self-regulating point processes. From the perspective of duration processes, we develop a method to specify a stationary marginal density. The resulting model, interpreted as a dependent renewal process, introduces high-order Markov dependence among identically distributed durations. Furthermore, we provide extensions to cluster point processes. These can describe duration clustering behaviors attributed to different factors, thus expanding the scope of the modeling framework to a wider range of applications. Regarding implementation, we develop a Bayesian approach to inference, model checking, and prediction. We investigate point process model properties analytically, and illustrate the methodology with both synthetic and real data examples."}, "https://arxiv.org/abs/2407.04071": {"title": "Three- and four-parameter item response model in factor analysis framework", "link": "https://arxiv.org/abs/2407.04071", "description": "arXiv:2407.04071v1 Announce Type: new \nAbstract: This work proposes a 4-parameter factor analytic (4P FA) model for multi-item measurements composed of binary items as an extension to the dichotomized single latent variable FA model. We provide an analytical derivation of the relationship between the newly proposed 4P FA model and its counterpart in the item response theory (IRT) framework, the 4P IRT model. A Bayesian estimation method for the proposed 4P FA model is provided to estimate the four item parameters, the respondents' latent scores, and the scores cleaned of the guessing and inattention effects. The newly proposed algorithm is implemented in R and Python, and the relationship between the 4P FA and 4P IRT is empirically demonstrated using real datasets from admission tests and the assessment of anxiety."}, "https://arxiv.org/abs/2407.04104": {"title": "Network-based Neighborhood regression", "link": "https://arxiv.org/abs/2407.04104", "description": "arXiv:2407.04104v1 Announce Type: new \nAbstract: Given the ubiquity of modularity in biological systems, module-level regulation analysis is vital for understanding biological systems across various levels and their dynamics. Current statistical analysis on biological modules predominantly focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data. This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities. An efficient community-wise least square optimization approach is developed to uncover the strength of regulation among the network modules while enabling asymptotic inference. With random graph theory, we derive non-asymptotic estimation error bounds for the proposed estimator, achieving exact minimax optimality. Unlike the root-n consistency typical in canonical linear regression, our model exhibits linear consistency in the number of nodes n, highlighting the advantage of incorporating neighborhood information. The effectiveness of the proposed framework is further supported by extensive numerical experiments. Application to whole-exome sequencing and RNA-sequencing Autism datasets demonstrates the usage of the proposed method in identifying the association between the gene modules of genetic variations and the gene modules of genomic differential expressions."}, "https://arxiv.org/abs/2407.04142": {"title": "Bayesian Structured Mediation Analysis With Unobserved Confounders", "link": "https://arxiv.org/abs/2407.04142", "description": "arXiv:2407.04142v1 Announce Type: new \nAbstract: We explore methods to reduce the impact of unobserved confounders on the causal mediation analysis of high-dimensional mediators with spatially smooth structures, such as brain imaging data. The key approach is to incorporate the latent individual effects, which influence the structured mediators, as unobserved confounders in the outcome model, thereby potentially debiasing the mediation effects. We develop BAyesian Structured Mediation analysis with Unobserved confounders (BASMU) framework, and establish its model identifiability conditions. Theoretical analysis is conducted on the asymptotic bias of the Natural Indirect Effect (NIE) and the Natural Direct Effect (NDE) when the unobserved confounders are omitted in mediation analysis. For BASMU, we propose a two-stage estimation algorithm to mitigate the impact of these unobserved confounders on estimating the mediation effect. Extensive simulations demonstrate that BASMU substantially reduces the bias in various scenarios. We apply BASMU to the analysis of fMRI data in the Adolescent Brain Cognitive Development (ABCD) study, focusing on four brain regions previously reported to exhibit meaningful mediation effects. Compared with the existing image mediation analysis method, BASMU identifies two to four times more voxels that have significant mediation effects, with the NIE increased by 41%, and the NDE decreased by 26%."}, "https://arxiv.org/abs/2407.04437": {"title": "Overeducation under different macroeconomic conditions: The case of Spanish university graduates", "link": "https://arxiv.org/abs/2407.04437", "description": "arXiv:2407.04437v1 Announce Type: new \nAbstract: This paper examines the incidence and persistence of overeducation in the early careers of Spanish university graduates. We investigate the role played by the business cycle and field of study and their interaction in shaping both phenomena. We also analyse the relevance of specific types of knowledge and skills as driving factors in reducing overeducation risk. We use data from the Survey on the Labour Insertion of University Graduates (EILU) conducted by the Spanish National Statistics Institute in 2014 and 2019. The survey collects rich information on cohorts that graduated in the 2009/2010 and 2014/2015 academic years during the Great Recession and the subsequent economic recovery, respectively. Our results show, first, the relevance of the economic scenario when graduates enter the labour market. Graduation during a recession increased overeducation risk and persistence. Second, a clear heterogeneous pattern occurs across fields of study, with health sciences graduates displaying better performance in terms of both overeducation incidence and persistence and less impact of the business cycle. Third, we find evidence that some transversal skills (language, IT, management) can help to reduce overeducation risk in the absence of specific knowledge required for the job, thus indicating some kind of compensatory role. Finally, our findings have important policy implications. Overeducation, and more importantly overeducation persistence, imply a non-neglectable misallocation of resources. Therefore, policymakers need to address this issue in the design of education and labour market policies."}, "https://arxiv.org/abs/2407.04446": {"title": "Random-Effect Meta-Analysis with Robust Between-Study Variance", "link": "https://arxiv.org/abs/2407.04446", "description": "arXiv:2407.04446v1 Announce Type: new \nAbstract: Meta-analyses are widely employed to demonstrate strong evidence across numerous studies. On the other hand, in the context of rare diseases, meta-analyses are often conducted with a limited number of studies in which the analysis methods are based on theoretical frameworks assuming that the between-study variance is known. That is, the estimate of between-study variance is substituted for the true value, neglecting the randomness with the between-study variance estimated from the data. Consequently, excessively narrow confidence intervals for the overall treatment effect for meta-analyses have been constructed in only a few studies. In the present study, we propose overcoming this problem by estimating the distribution of between-study variance using the maximum likelihood-like estimator. We also suggest an approach for estimating the overall treatment effect via the distribution of the between-study variance. Our proposed method can extend many existing approaches to allow more adequate estimation under a few studies. Through simulation and analysis of real data, we demonstrate that our method remains consistently conservative compared to existing methods, which enables meta-analyses to consider the randomness of the between-study variance."}, "https://arxiv.org/abs/2407.04448": {"title": "Learning control variables and instruments for causal analysis in observational data", "link": "https://arxiv.org/abs/2407.04448", "description": "arXiv:2407.04448v1 Announce Type: new \nAbstract: This study introduces a data-driven, machine learning-based method to detect suitable control variables and instruments for assessing the causal effect of a treatment on an outcome in observational data, if they exist. Our approach tests the joint existence of instruments, which are associated with the treatment but not directly with the outcome (at least conditional on observables), and suitable control variables, conditional on which the treatment is exogenous, and learns the partition of instruments and control variables from the observed data. The detection of sets of instruments and control variables relies on the condition that proper instruments are conditionally independent of the outcome given the treatment and suitable control variables. We establish the consistency of our method for detecting control variables and instruments under certain regularity conditions, investigate the finite sample performance through a simulation study, and provide an empirical application to labor market data from the Job Corps study."}, "https://arxiv.org/abs/2407.04530": {"title": "A spatial-correlated multitask linear mixed-effects model for imaging genetics", "link": "https://arxiv.org/abs/2407.04530", "description": "arXiv:2407.04530v1 Announce Type: new \nAbstract: Imaging genetics aims to uncover the hidden relationship between imaging quantitative traits (QTs) and genetic markers (e.g. single nucleotide polymorphism (SNP)), and brings valuable insights into the pathogenesis of complex diseases, such as cancers and cognitive disorders (e.g. the Alzheimer's Disease). However, most linear models in imaging genetics didn't explicitly model the inner relationship among QTs, which might miss some potential efficiency gains from information borrowing across brain regions. In this work, we developed a novel Bayesian regression framework for identifying significant associations between QTs and genetic markers while explicitly modeling spatial dependency between QTs, with the main contributions as follows. Firstly, we developed a spatial-correlated multitask linear mixed-effects model (LMM) to account for dependencies between QTs. We incorporated a population-level mixed effects term into the model, taking full advantage of the dependent structure of brain imaging-derived QTs. Secondly, we implemented the model in the Bayesian framework and derived a Markov chain Monte Carlo (MCMC) algorithm to achieve the model inference. Further, we incorporated the MCMC samples with the Cauchy combination test (CCT) to examine the association between SNPs and QTs, which avoided computationally intractable multi-test issues. The simulation studies indicated improved power of our proposed model compared to classic models where inner dependencies of QTs were not modeled. We also applied the new spatial model to an imaging dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database."}, "https://arxiv.org/abs/2407.04659": {"title": "Simulation-based Calibration of Uncertainty Intervals under Approximate Bayesian Estimation", "link": "https://arxiv.org/abs/2407.04659", "description": "arXiv:2407.04659v1 Announce Type: new \nAbstract: The mean field variational Bayes (VB) algorithm implemented in Stan is relatively fast and efficient, making it feasible to produce model-estimated official statistics on a rapid timeline. Yet, while consistent point estimates of parameters are achieved for continuous data models, the mean field approximation often produces inaccurate uncertainty quantification to the extent that parameters are correlated a posteriori. In this paper, we propose a simulation procedure that calibrates uncertainty intervals for model parameters estimated under approximate algorithms to achieve nominal coverages. Our procedure detects and corrects biased estimation of both first and second moments of approximate marginal posterior distributions induced by any estimation algorithm that produces consistent first moments under specification of the correct model. The method generates replicate datasets using parameters estimated in an initial model run. The model is subsequently re-estimated on each replicate dataset, and we use the empirical distribution over the re-samples to formulate calibrated confidence intervals of parameter estimates of the initial model run that are guaranteed to asymptotically achieve nominal coverage. We demonstrate the performance of our procedure in Monte Carlo simulation study and apply it to real data from the Current Employment Statistics survey."}, "https://arxiv.org/abs/2407.04667": {"title": "The diameter of a stochastic matrix: A new measure for sensitivity analysis in Bayesian networks", "link": "https://arxiv.org/abs/2407.04667", "description": "arXiv:2407.04667v1 Announce Type: new \nAbstract: Bayesian networks are one of the most widely used classes of probabilistic models for risk management and decision support because of their interpretability and flexibility in including heterogeneous pieces of information. In any applied modelling, it is critical to assess how robust the inferences on certain target variables are to changes in the model. In Bayesian networks, these analyses fall under the umbrella of sensitivity analysis, which is most commonly carried out by quantifying dissimilarities using Kullback-Leibler information measures. In this paper, we argue that robustness methods based instead on the familiar total variation distance provide simple and more valuable bounds on robustness to misspecification, which are both formally justifiable and transparent. We introduce a novel measure of dependence in conditional probability tables called the diameter to derive such bounds. This measure quantifies the strength of dependence between a variable and its parents. We demonstrate how such formal robustness considerations can be embedded in building a Bayesian network."}, "https://arxiv.org/abs/2407.03336": {"title": "Efficient and Precise Calculation of the Confluent Hypergeometric Function", "link": "https://arxiv.org/abs/2407.03336", "description": "arXiv:2407.03336v1 Announce Type: cross \nAbstract: Kummer's function, also known as the confluent hypergeometric function (CHF), is an important mathematical function, in particular due to its many special cases, which include the Bessel function, the incomplete Gamma function and the error function (erf). The CHF has no closed form expression, but instead is most commonly expressed as an infinite sum of ratios of rising factorials, which makes its precise and efficient calculation challenging. It is a function of three parameters, the first two being the rising factorial base of the numerator and denominator, and the third being a scale parameter. Accurate and efficient calculation for large values of the scale parameter is particularly challenging due to numeric underflow and overflow which easily occur when summing the underlying component terms. This work presents an elegant and precise mathematical algorithm for the calculation of the CHF, which is of particular advantage for large values of the scale parameter. This method massively reduces the number and range of component terms which need to be summed to achieve any required precision, thus obviating the need for the computationally intensive transformations needed by current algorithms."}, "https://arxiv.org/abs/2407.03781": {"title": "Block-diagonal idiosyncratic covariance estimation in high-dimensional factor models for financial time series", "link": "https://arxiv.org/abs/2407.03781", "description": "arXiv:2407.03781v1 Announce Type: cross \nAbstract: Estimation of high-dimensional covariance matrices in latent factor models is an important topic in many fields and especially in finance. Since the number of financial assets grows while the estimation window length remains of limited size, the often used sample estimator yields noisy estimates which are not even positive definite. Under the assumption of latent factor models, the covariance matrix is decomposed into a common low-rank component and a full-rank idiosyncratic component. In this paper we focus on the estimation of the idiosyncratic component, under the assumption of a grouped structure of the time series, which may arise due to specific factors such as industries, asset classes or countries. We propose a generalized methodology for estimation of the block-diagonal idiosyncratic component by clustering the residual series and applying shrinkage to the obtained blocks in order to ensure positive definiteness. We derive two different estimators based on different clustering methods and test their performance using simulation and historical data. The proposed methods are shown to provide reliable estimates and outperform other state-of-the-art estimators based on thresholding methods."}, "https://arxiv.org/abs/2407.04138": {"title": "Online Bayesian changepoint detection for network Poisson processes with community structure", "link": "https://arxiv.org/abs/2407.04138", "description": "arXiv:2407.04138v1 Announce Type: cross \nAbstract: Network point processes often exhibit latent structure that govern the behaviour of the sub-processes. It is not always reasonable to assume that this latent structure is static, and detecting when and how this driving structure changes is often of interest. In this paper, we introduce a novel online methodology for detecting changes within the latent structure of a network point process. We focus on block-homogeneous Poisson processes, where latent node memberships determine the rates of the edge processes. We propose a scalable variational procedure which can be applied on large networks in an online fashion via a Bayesian forgetting factor applied to sequential variational approximations to the posterior distribution. The proposed framework is tested on simulated and real-world data, and it rapidly and accurately detects changes to the latent edge process rates, and to the latent node group memberships, both in an online manner. In particular, in an application on the Santander Cycles bike-sharing network in central London, we detect changes within the network related to holiday periods and lockdown restrictions between 2019 and 2020."}, "https://arxiv.org/abs/2407.04214": {"title": "Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome", "link": "https://arxiv.org/abs/2407.04214", "description": "arXiv:2407.04214v1 Announce Type: cross \nAbstract: For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least 28 days after testing positive and asked to report current symptom status. This study design yielded current status data: Outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates specialized statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. Female sex, history of seasonal allergies, fatigue during acute infection, and higher viral load were associated with slower symptom resolution."}, "https://arxiv.org/abs/2105.03067": {"title": "The $s$-value: evaluating stability with respect to distributional shifts", "link": "https://arxiv.org/abs/2105.03067", "description": "arXiv:2105.03067v4 Announce Type: replace \nAbstract: Common statistical measures of uncertainty such as $p$-values and confidence intervals quantify the uncertainty due to sampling, that is, the uncertainty due to not observing the full population. However, sampling is not the only source of uncertainty. In practice, distributions change between locations and across time. This makes it difficult to gather knowledge that transfers across data sets. We propose a measure of instability that quantifies the distributional instability of a statistical parameter with respect to Kullback-Leibler divergence, that is, the sensitivity of the parameter under general distributional perturbations within a Kullback-Leibler divergence ball. In addition, we quantify the instability of parameters with respect to directional or variable-specific shifts. Measuring instability with respect to directional shifts can be used to detect the type of shifts a parameter is sensitive to. We discuss how such knowledge can inform data collection for improved estimation of statistical parameters under shifted distributions. We evaluate the performance of the proposed measure on real data and show that it can elucidate the distributional instability of a parameter with respect to certain shifts and can be used to improve estimation accuracy under shifted distributions."}, "https://arxiv.org/abs/2107.06141": {"title": "Identification of Average Marginal Effects in Fixed Effects Dynamic Discrete Choice Models", "link": "https://arxiv.org/abs/2107.06141", "description": "arXiv:2107.06141v2 Announce Type: replace \nAbstract: In nonlinear panel data models, fixed effects methods are often criticized because they cannot identify average marginal effects (AMEs) in short panels. The common argument is that identifying AMEs requires knowledge of the distribution of unobserved heterogeneity, but this distribution is not identified in a fixed effects model with a short panel. In this paper, we derive identification results that contradict this argument. In a panel data dynamic logit model, and for $T$ as small as three, we prove the point identification of different AMEs, including causal effects of changes in the lagged dependent variable or the last choice's duration. Our proofs are constructive and provide simple closed-form expressions for the AMEs in terms of probabilities of choice histories. We illustrate our results using Monte Carlo experiments and with an empirical application of a dynamic structural model of consumer brand choice with state dependence."}, "https://arxiv.org/abs/2307.00835": {"title": "Engression: Extrapolation through the Lens of Distributional Regression", "link": "https://arxiv.org/abs/2307.00835", "description": "arXiv:2307.00835v3 Announce Type: replace \nAbstract: Distributional regression aims to estimate the full conditional distribution of a target variable, given covariates. Popular methods include linear and tree-ensemble based quantile regression. We propose a neural network-based distributional regression methodology called `engression'. An engression model is generative in the sense that we can sample from the fitted conditional distribution and is also suitable for high-dimensional outcomes. Furthermore, we find that modelling the conditional distribution on training data can constrain the fitted function outside of the training support, which offers a new perspective to the challenging extrapolation problem in nonlinear regression. In particular, for `pre-additive noise' models, where noise is added to the covariates before applying a nonlinear transformation, we show that engression can successfully perform extrapolation under some assumptions such as monotonicity, whereas traditional regression approaches such as least-squares or quantile regression fall short under the same assumptions. Our empirical results, from both simulated and real data, validate the effectiveness of the engression method and indicate that the pre-additive noise model is typically suitable for many real-world scenarios. The software implementations of engression are available in both R and Python."}, "https://arxiv.org/abs/2310.11680": {"title": "Trimmed Mean Group Estimation of Average Effects in Ultra Short T Panels under Correlated Heterogeneity", "link": "https://arxiv.org/abs/2310.11680", "description": "arXiv:2310.11680v2 Announce Type: replace \nAbstract: The commonly used two-way fixed effects estimator is biased under correlated heterogeneity and can lead to misleading inference. This paper proposes a new trimmed mean group (TMG) estimator which is consistent at the irregular rate of n^{1/3} even if the time dimension of the panel is as small as the number of its regressors. Extensions to panels with time effects are provided, and a Hausman test of correlated heterogeneity is proposed. Small sample properties of the TMG estimator (with and without time effects) are investigated by Monte Carlo experiments and shown to be satisfactory and perform better than other trimmed estimators proposed in the literature. The proposed test of correlated heterogeneity is also shown to have the correct size and satisfactory power. The utility of the TMG approach is illustrated with an empirical application."}, "https://arxiv.org/abs/2311.16614": {"title": "A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections", "link": "https://arxiv.org/abs/2311.16614", "description": "arXiv:2311.16614v4 Announce Type: replace \nAbstract: Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $\\alpha$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters."}, "https://arxiv.org/abs/2312.01168": {"title": "MacroPARAFAC for handling rowwise and cellwise outliers in incomplete multi-way data", "link": "https://arxiv.org/abs/2312.01168", "description": "arXiv:2312.01168v2 Announce Type: replace \nAbstract: Multi-way data extend two-way matrices into higher-dimensional tensors, often explored through dimensional reduction techniques. In this paper, we study the Parallel Factor Analysis (PARAFAC) model for handling multi-way data, representing it more compactly through a concise set of loading matrices and scores. We assume that the data may be incomplete and could contain both rowwise and cellwise outliers, signifying cases that deviate from the majority and outlying cells dispersed throughout the data array. To address these challenges, we present a novel algorithm designed to robustly estimate both loadings and scores. Additionally, we introduce an enhanced outlier map to distinguish various patterns of outlying behavior. Through simulations and the analysis of fluorescence Excitation-Emission Matrix (EEM) data, we demonstrate the robustness of our approach. Our results underscore the effectiveness of diagnostic tools in identifying and interpreting unusual patterns within the data."}, "https://arxiv.org/abs/2401.09381": {"title": "Modelling clusters in network time series with an application to presidential elections in the USA", "link": "https://arxiv.org/abs/2401.09381", "description": "arXiv:2401.09381v2 Announce Type: replace \nAbstract: Network time series are becoming increasingly relevant in the study of dynamic processes characterised by a known or inferred underlying network structure. Generalised Network Autoregressive (GNAR) models provide a parsimonious framework for exploiting the underlying network, even in the high-dimensional setting. We extend the GNAR framework by presenting the $\\textit{community}$-$\\alpha$ GNAR model that exploits prior knowledge and/or exogenous variables for identifying and modelling dynamic interactions across communities in the network. We further analyse the dynamics of $\\textit{ Red, Blue}$ and $\\textit{Swing}$ states throughout presidential elections in the USA. Our analysis suggests interesting global and communal effects."}, "https://arxiv.org/abs/2008.04267": {"title": "Robust Validation: Confident Predictions Even When Distributions Shift", "link": "https://arxiv.org/abs/2008.04267", "description": "arXiv:2008.04267v3 Announce Type: replace-cross \nAbstract: While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy -- coming from robust statistics and optimization -- is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.'s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity."}, "https://arxiv.org/abs/2208.03737": {"title": "Finite Tests from Functional Characterizations", "link": "https://arxiv.org/abs/2208.03737", "description": "arXiv:2208.03737v5 Announce Type: replace-cross \nAbstract: Classically, testing whether decision makers belong to specific preference classes involves two main approaches. The first, known as the functional approach, assumes access to an entire demand function. The second, the revealed preference approach, constructs inequalities to test finite demand data. This paper bridges these methods by using the functional approach to test finite data through preference learnability results. We develop a computationally efficient algorithm that generates tests for choice data based on functional characterizations of preference families. We provide these restrictions for various applications, including homothetic and weakly separable preferences, where the latter's revealed preference characterization is provably NP-Hard. We also address choice under uncertainty, offering tests for betweenness preferences. Lastly, we perform a simulation exercise demonstrating that our tests are effective in finite samples and accurately reject demands not belonging to a specified class."}, "https://arxiv.org/abs/2302.00934": {"title": "High-dimensional variable clustering based on maxima of a weakly dependent random process", "link": "https://arxiv.org/abs/2302.00934", "description": "arXiv:2302.00934v3 Announce Type: replace-cross \nAbstract: We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm depending on a tuning parameter that recovers the clusters of variables without specifying the number of clusters \\emph{a priori}. Our work provides some theoretical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. A data-driven selection method for the tuning parameter is also proposed. To further illustrate the significance of our work, we applied our method to neuroscience and environmental real-datasets. These applications highlight the potential and versatility of the proposed approach."}, "https://arxiv.org/abs/2303.13598": {"title": "Bootstrap-Assisted Inference for Generalized Grenander-type Estimators", "link": "https://arxiv.org/abs/2303.13598", "description": "arXiv:2303.13598v3 Announce Type: replace-cross \nAbstract: Westling and Carone (2020) proposed a framework for studying the large sample distributional properties of generalized Grenander-type estimators, a versatile class of nonparametric estimators of monotone functions. The limiting distribution of those estimators is representable as the left derivative of the greatest convex minorant of a Gaussian process whose monomial mean can be of unknown order (when the degree of flatness of the function of interest is unknown). The standard nonparametric bootstrap is unable to consistently approximate the large sample distribution of the generalized Grenander-type estimators even if the monomial order of the mean is known, making statistical inference a challenging endeavour in applications. To address this inferential problem, we present a bootstrap-assisted inference procedure for generalized Grenander-type estimators. The procedure relies on a carefully crafted, yet automatic, transformation of the estimator. Moreover, our proposed method can be made ``flatness robust'' in the sense that it can be made adaptive to the (possibly unknown) degree of flatness of the function of interest. The method requires only the consistent estimation of a single scalar quantity, for which we propose an automatic procedure based on numerical derivative estimation and the generalized jackknife. Under random sampling, our inference method can be implemented using a computationally attractive exchangeable bootstrap procedure. We illustrate our methods with examples and we also provide a small simulation study. The development of formal results is made possible by some technical results that may be of independent interest."}, "https://arxiv.org/abs/2308.13451": {"title": "Gotta match 'em all: Solution diversification in graph matching matched filters", "link": "https://arxiv.org/abs/2308.13451", "description": "arXiv:2308.13451v3 Announce Type: replace-cross \nAbstract: We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose algorithmic speed-ups that greatly enhance the scalability of our matched-filter approach. We present theoretical justification of our methodology in the setting of correlated Erdos-Renyi graphs, showing its ability to sequentially discover multiple templates under mild model conditions. We additionally demonstrate our method's utility via extensive experiments both using simulated models and real-world dataset, include human brain connectomes and a large transactional knowledge base."}, "https://arxiv.org/abs/2407.04812": {"title": "Active-Controlled Trial Design for HIV Prevention Trials with a Counterfactual Placebo", "link": "https://arxiv.org/abs/2407.04812", "description": "arXiv:2407.04812v1 Announce Type: new \nAbstract: In the quest for enhanced HIV prevention methods, the advent of antiretroviral drugs as pre-exposure prophylaxis (PrEP) has marked a significant stride forward. However, the ethical challenges in conducting placebo-controlled trials for new PrEP agents against a backdrop of highly effective existing PrEP options necessitates innovative approaches. This manuscript delves into the design and implementation of active-controlled trials that incorporate a counterfactual placebo estimate - a theoretical estimate of what HIV incidence would have been without effective prevention. We introduce a novel statistical framework for regulatory approval of new PrEP agents, predicated on the assumption of an available and consistent counterfactual placebo estimate. Our approach aims to assess the absolute efficacy (i.e., against placebo) of the new PrEP agent relative to the absolute efficacy of the active control. We propose a two-step procedure for hypothesis testing and further develop an approach that addresses potential biases inherent in non-randomized comparison to counterfactual placebos. By exploring different scenarios with moderately and highly effective active controls and counterfactual placebo estimates from various sources, we demonstrate how our design can significantly reduce sample sizes compared to traditional non-inferiority trials and offer a robust framework for evaluating new PrEP agents. This work contributes to the methodological repertoire for HIV prevention trials and underscores the importance of adaptability in the face of ethical and practical challenges."}, "https://arxiv.org/abs/2407.04933": {"title": "A Triginometric Seasonal Component Model and its Application to Time Series with Two Types of Seasonality", "link": "https://arxiv.org/abs/2407.04933", "description": "arXiv:2407.04933v1 Announce Type: new \nAbstract: A finite trigonometric series model for seasonal time series is considered in this paper. This component model is shown to be useful, in particular, for the modeling of time series with two types of seasonality, a long-period and a short period. This component model is also shown to be effective in the case of ordinary seasonal time series with only one seasonal component, if the seasonal pattern is simple and can be well represented by a small number of trigonometric components. As examples, electricity demand data, bi-hourly temperature data, CO2 data, and two economic time series are considered. The last section summarizes the findings from the emperical studies."}, "https://arxiv.org/abs/2407.05001": {"title": "Treatment effect estimation under covariate-adaptive randomization with heavy-tailed outcomes", "link": "https://arxiv.org/abs/2407.05001", "description": "arXiv:2407.05001v1 Announce Type: new \nAbstract: Randomized experiments are the gold standard for investigating causal relationships, with comparisons of potential outcomes under different treatment groups used to estimate treatment effects. However, outcomes with heavy-tailed distributions pose significant challenges to traditional statistical approaches. While recent studies have explored these issues under simple randomization, their application in more complex randomization designs, such as stratified randomization or covariate-adaptive randomization, has not been adequately addressed. To fill the gap, this paper examines the properties of the estimated influence function-based M-estimator under covariate-adaptive randomization with heavy-tailed outcomes, demonstrating its consistency and asymptotic normality. Yet, the existing variance estimator tends to overestimate the asymptotic variance, especially under more balanced designs, and lacks universal applicability across randomization methods. To remedy this, we introduce a novel stratified transformed difference-in-means estimator to enhance efficiency and propose a universally applicable variance estimator to facilitate valid inferences. Additionally, we establish the consistency of kernel-based density estimation in the context of covariate-adaptive randomization. Numerical results demonstrate the effectiveness of the proposed methods in finite samples."}, "https://arxiv.org/abs/2407.05089": {"title": "Bayesian network-guided sparse regression with flexible varying effects", "link": "https://arxiv.org/abs/2407.05089", "description": "arXiv:2407.05089v1 Announce Type: new \nAbstract: In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors."}, "https://arxiv.org/abs/2407.05159": {"title": "Roughness regularization for functional data analysis with free knots spline estimation", "link": "https://arxiv.org/abs/2407.05159", "description": "arXiv:2407.05159v1 Announce Type: new \nAbstract: In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA me\\-thods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method's strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data."}, "https://arxiv.org/abs/2407.05241": {"title": "Joint identification of spatially variable genes via a network-assisted Bayesian regularization approach", "link": "https://arxiv.org/abs/2407.05241", "description": "arXiv:2407.05241v1 Announce Type: new \nAbstract: Identifying genes that display spatial patterns is critical to investigating expression interactions within a spatial context and further dissecting biological understanding of complex mechanistic functionality. Despite the increase in statistical methods designed to identify spatially variable genes, they are mostly based on marginal analysis and share the limitation that the dependence (network) structures among genes are not well accommodated, where a biological process usually involves changes in multiple genes that interact in a complex network. Moreover, the latent cellular composition within spots may introduce confounding variations, negatively affecting identification accuracy. In this study, we develop a novel Bayesian regularization approach for spatial transcriptomic data, with the confounding variations induced by varying cellular distributions effectively corrected. Significantly advancing from the existing studies, a thresholded graph Laplacian regularization is proposed to simultaneously identify spatially variable genes and accommodate the network structure among genes. The proposed method is based on a zero-inflated negative binomial distribution, effectively accommodating the count nature, zero inflation, and overdispersion of spatial transcriptomic data. Extensive simulations and the application to real data demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2407.05288": {"title": "Efficient Bayesian dynamic closed skew-normal model preserving mean and covariance for spatio-temporal data", "link": "https://arxiv.org/abs/2407.05288", "description": "arXiv:2407.05288v1 Announce Type: new \nAbstract: Although Bayesian skew-normal models are useful for flexibly modeling spatio-temporal processes, they still have difficulty in computation cost and interpretability in their mean and variance parameters, including regression coefficients. To address these problems, this study proposes a spatio-temporal model that incorporates skewness while maintaining mean and variance, by applying the flexible subclass of the closed skew-normal distribution. An efficient sampling method is introduced, leveraging the autoregressive representation of the model. Additionally, the model's symmetry concerning spatial order is demonstrated, and Mardia's skewness and kurtosis are derived, showing independence from the mean and variance. Simulation studies compare the estimation performance of the proposed model with that of the Gaussian model. The result confirms its superiority in high skewness and low observation noise scenarios. The identification of Cobb-Douglas production functions across US states is examined as an application to real data, revealing that the proposed model excels in both goodness-of-fit and predictive performance."}, "https://arxiv.org/abs/2407.05372": {"title": "A Convexified Matching Approach to Imputation and Individualized Inference", "link": "https://arxiv.org/abs/2407.05372", "description": "arXiv:2407.05372v1 Announce Type: new \nAbstract: We introduce a new convexified matching method for missing value imputation and individualized inference inspired by computational optimal transport. Our method integrates favorable features from mainstream imputation approaches: optimal matching, regression imputation, and synthetic control. We impute counterfactual outcomes based on convex combinations of observed outcomes, defined based on an optimal coupling between the treated and control data sets. The optimal coupling problem is considered a convex relaxation to the combinatorial optimal matching problem. We estimate granular-level individual treatment effects while maintaining a desirable aggregate-level summary by properly constraining the coupling. We construct transparent, individual confidence intervals for the estimated counterfactual outcomes. We devise fast iterative entropic-regularized algorithms to solve the optimal coupling problem that scales favorably when the number of units to match is large. Entropic regularization plays a crucial role in both inference and computation; it helps control the width of the individual confidence intervals and design fast optimization algorithms."}, "https://arxiv.org/abs/2407.05400": {"title": "Collaborative Analysis for Paired A/B Testing Experiments", "link": "https://arxiv.org/abs/2407.05400", "description": "arXiv:2407.05400v1 Announce Type: new \nAbstract: With the extensive use of digital devices, online experimental platforms are commonly used to conduct experiments to collect data for evaluating different variations of products, algorithms, and interface designs, a.k.a., A/B tests. In practice, multiple A/B testing experiments are often carried out based on a common user population on the same platform. The same user's responses to different experiments can be correlated to some extent due to the individual effect of the user. In this paper, we propose a novel framework that collaboratively analyzes the data from paired A/B tests, namely, a pair of A/B testing experiments conducted on the same set of experimental subjects. The proposed analysis approach for paired A/B tests can lead to more accurate estimates than the traditional separate analysis of each experiment. We obtain the asymptotic distribution of the proposed estimators and demonstrate that the proposed estimators are asymptotically the best linear unbiased estimators under certain assumptions. Moreover, the proposed analysis approach is computationally efficient, easy to implement, and robust to different types of responses. Both numerical simulations and numerical studies based on a real case are used to examine the performance of the proposed method."}, "https://arxiv.org/abs/2407.05431": {"title": "Without Pain -- Clustering Categorical Data Using a Bayesian Mixture of Finite Mixtures of Latent Class Analysis Models", "link": "https://arxiv.org/abs/2407.05431", "description": "arXiv:2407.05431v1 Announce Type: new \nAbstract: We propose a Bayesian approach for model-based clustering of multivariate categorical data where variables are allowed to be associated within clusters and the number of clusters is unknown. The approach uses a two-layer mixture of finite mixtures model where the cluster distributions are approximated using latent class analysis models. A careful specification of priors with suitable hyperparameter values is crucial to identify the two-layer structure and obtain a parsimonious cluster solution. We outline the Bayesian estimation based on Markov chain Monte Carlo sampling with the telescoping sampler and describe how to obtain an identified clustering model by resolving the label switching issue. Empirical demonstrations in a simulation study using artificial data as well as a data set on low back pain indicate the good clustering performance of the proposed approach, provided hyperparameters are selected which induce sufficient shrinkage."}, "https://arxiv.org/abs/2407.05470": {"title": "Bayesian Finite Mixture Models", "link": "https://arxiv.org/abs/2407.05470", "description": "arXiv:2407.05470v1 Announce Type: new \nAbstract: Finite mixture models are a useful statistical model class for clustering and density approximation. In the Bayesian framework finite mixture models require the specification of suitable priors in addition to the data model. These priors allow to avoid spurious results and provide a principled way to define cluster shapes and a preference for specific cluster solutions. A generic model estimation scheme for finite mixtures with a fixed number of components is available using Markov chain Monte Carlo (MCMC) sampling with data augmentation. The posterior allows to assess uncertainty in a comprehensive way, but component-specific posterior inference requires resolving the label switching issue.\n In this paper we focus on the application of Bayesian finite mixture models for clustering. We start with discussing suitable specification, estimation and inference of the model if the number of components is assumed to be known. We then continue to explain suitable strategies for fitting Bayesian finite mixture models when the number of components is not known. In addition, all steps required to perform Bayesian finite mixture modeling are illustrated on a data example where a finite mixture model of multivariate Gaussian distributions is fitted. Suitable prior specification, estimation using MCMC and posterior inference are discussed for this example assuming the number of components to be known as well as unknown."}, "https://arxiv.org/abs/2407.05537": {"title": "Optimal treatment strategies for prioritized outcomes", "link": "https://arxiv.org/abs/2407.05537", "description": "arXiv:2407.05537v1 Announce Type: new \nAbstract: Dynamic treatment regimes formalize precision medicine as a sequence of decision rules, one for each stage of clinical intervention, that map current patient information to a recommended intervention. Optimal regimes are typically defined as maximizing some functional of a scalar outcome's distribution, e.g., the distribution's mean or median. However, in many clinical applications, there are multiple outcomes of interest. We consider the problem of estimating an optimal regime when there are multiple outcomes that are ordered by priority but which cannot be readily combined by domain experts into a meaningful single scalar outcome. We propose a definition of optimality in this setting and show that an optimal regime with respect to this definition leads to maximal mean utility under a large class of utility functions. Furthermore, we use inverse reinforcement learning to identify a composite outcome that most closely aligns with our definition within a pre-specified class. Simulation experiments and an application to data from a sequential multiple assignment randomized trial (SMART) on HIV/STI prevention illustrate the usefulness of the proposed approach."}, "https://arxiv.org/abs/2407.05543": {"title": "Functional Principal Component Analysis for Truncated Data", "link": "https://arxiv.org/abs/2407.05543", "description": "arXiv:2407.05543v1 Announce Type: new \nAbstract: Functional principal component analysis (FPCA) is a key tool in the study of functional data, driving both exploratory analyses and feature construction for use in formal modeling and testing procedures. However, existing methods for FPCA do not apply when functional observations are truncated, e.g., the measurement instrument only supports recordings within a pre-specified interval, thereby truncating values outside of the range to the nearest boundary. A naive application of existing methods without correction for truncation induces bias. We extend the FPCA framework to accommodate truncated noisy functional data by first recovering smooth mean and covariance surface estimates that are representative of the latent process's mean and covariance functions. Unlike traditional sample covariance smoothing techniques, our procedure yields a positive semi-definite covariance surface, computed without the need to retroactively remove negative eigenvalues in the covariance operator decomposition. Additionally, we construct a FPC score predictor and demonstrate its use in the generalized functional linear model. Convergence rates for the proposed estimators are provided. In simulation experiments, the proposed method yields better predictive performance and lower bias than existing alternatives. We illustrate its practical value through an application to a study with truncated blood glucose measurements."}, "https://arxiv.org/abs/2407.05585": {"title": "Unmasking Bias: A Framework for Evaluating Treatment Benefit Predictors Using Observational Studies", "link": "https://arxiv.org/abs/2407.05585", "description": "arXiv:2407.05585v1 Announce Type: new \nAbstract: Treatment benefit predictors (TBPs) map patient characteristics into an estimate of the treatment benefit tailored to individual patients, which can support optimizing treatment decisions. However, the assessment of their performance might be challenging with the non-random treatment assignment. This study conducts a conceptual analysis, which can be applied to finite-sample studies. We present a framework for evaluating TBPs using observational data from a target population of interest. We then explore the impact of confounding bias on TBP evaluation using measures of discrimination and calibration, which are the moderate calibration and the concentration of the benefit index ($C_b$), respectively. We illustrate that failure to control for confounding can lead to misleading values of performance metrics and establish how the confounding bias propagates to an evaluation bias to quantify the explicit bias for the performance metrics. These findings underscore the necessity of accounting for confounding factors when evaluating TBPs, ensuring more reliable and contextually appropriate treatment decisions."}, "https://arxiv.org/abs/2407.05596": {"title": "Methodology for Calculating CO2 Absorption by Tree Planting for Greening Projects", "link": "https://arxiv.org/abs/2407.05596", "description": "arXiv:2407.05596v1 Announce Type: new \nAbstract: In order to explore the possibility of carbon credits for greening projects, which play an important role in climate change mitigation, this paper examines a formula for estimating the amount of carbon fixation for greening activities in urban areas through tree planting. The usefulness of the formula studied was examined by conducting calculations based on actual data through measurements made by on-site surveys of a greening companie. A series of calculation results suggest that this formula may be useful. Recognizing carbon credits for green businesses for the carbon sequestration of their projects is an important incentive not only as part of environmental improvement and climate change action, but also to improve the health and well-being of local communities and to generate economic benefits. This study is a pioneering exploration of the methodology."}, "https://arxiv.org/abs/2407.05624": {"title": "Dynamic Matrix Factor Models for High Dimensional Time Series", "link": "https://arxiv.org/abs/2407.05624", "description": "arXiv:2407.05624v1 Announce Type: new \nAbstract: Matrix time series, which consist of matrix-valued data observed over time, are prevalent in various fields such as economics, finance, and engineering. Such matrix time series data are often observed in high dimensions. Matrix factor models are employed to reduce the dimensionality of such data, but they lack the capability to make predictions without specified dynamics in the latent factor process. To address this issue, we propose a two-component dynamic matrix factor model that extends the standard matrix factor model by incorporating a matrix autoregressive structure for the low-dimensional latent factor process. This two-component model injects prediction capability to the matrix factor model and provides deeper insights into the dynamics of high-dimensional matrix time series. We present the estimation procedures of the model and their theoretical properties, as well as empirical analysis of the estimation procedures via simulations, and a case study of New York city taxi data, demonstrating the performance and usefulness of the model."}, "https://arxiv.org/abs/2407.05625": {"title": "New User Event Prediction Through the Lens of Causal Inference", "link": "https://arxiv.org/abs/2407.05625", "description": "arXiv:2407.05625v1 Announce Type: new \nAbstract: Modeling and analysis for event series generated by heterogeneous users of various behavioral patterns are closely involved in our daily lives, including credit card fraud detection, online platform user recommendation, and social network analysis. The most commonly adopted approach to this task is to classify users into behavior-based categories and analyze each of them separately. However, this approach requires extensive data to fully understand user behavior, presenting challenges in modeling newcomers without historical knowledge. In this paper, we propose a novel discrete event prediction framework for new users through the lens of causal inference. Our method offers an unbiased prediction for new users without needing to know their categories. We treat the user event history as the ''treatment'' for future events and the user category as the key confounder. Thus, the prediction problem can be framed as counterfactual outcome estimation, with the new user model trained on an adjusted dataset where each event is re-weighted by its inverse propensity score. We demonstrate the superior performance of the proposed framework with a numerical simulation study and two real-world applications, including Netflix rating prediction and seller contact prediction for customer support at Amazon."}, "https://arxiv.org/abs/2407.05691": {"title": "Multi-resolution subsampling for large-scale linear classification", "link": "https://arxiv.org/abs/2407.05691", "description": "arXiv:2407.05691v1 Announce Type: new \nAbstract: Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information of the full data. The present work takes the view that sampling techniques are recommended for the region we focus on and summary measures are enough to collect the information for the rest according to a well-designed data partitioning. We propose a multi-resolution subsampling strategy that combines global information described by summary measures and local information obtained from selected subsample points. We show that the proposed method will lead to a more efficient subsample-based estimator for general large-scale classification problems. Some asymptotic properties of the proposed method are established and connections to existing subsampling procedures are explored. Finally, we illustrate the proposed subsampling strategy via simulated and real-world examples."}, "https://arxiv.org/abs/2407.05824": {"title": "Counting on count regression: overlooked aspects of the Negative Binomial specification", "link": "https://arxiv.org/abs/2407.05824", "description": "arXiv:2407.05824v1 Announce Type: new \nAbstract: Negative Binomial regression is a staple in Operations Management empirical research. Most of its analytical aspects are considered either self-evident, or minutiae that are better left to specialised textbooks. But what if the evidence provided by trusted sources disagrees? In this note I set out to verify results about the Negative Binomial regression specification presented in widely-cited academic sources. I identify problems in how these sources approach the gamma function and its derivatives, with repercussions on the Fisher Information Matrix that may ultimately affect statistical testing. By elevating computations that are rarely specified in full, I provide recommendations to improve methodological evidence that is typically presented without proof."}, "https://arxiv.org/abs/2407.05849": {"title": "Small area prediction of counts under machine learning-type mixed models", "link": "https://arxiv.org/abs/2407.05849", "description": "arXiv:2407.05849v1 Announce Type: new \nAbstract: This paper proposes small area estimation methods that utilize generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, we present two approaches based on random forests: the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF), both tailored to address challenges associated with count outcomes, particularly overdispersion. Our analysis reveals that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. Additionally, we introduce and evaluate three bootstrap methodologies - one parametric and two non-parametric - designed to assess the reliability of point estimators for area-level means. The effectiveness of these methodologies is tested through model-based (and design-based) simulations and applied to a real-world dataset from the state of Guerrero in Mexico, demonstrating their robustness and potential for practical applications."}, "https://arxiv.org/abs/2407.05854": {"title": "A Low-Rank Bayesian Approach for Geoadditive Modeling", "link": "https://arxiv.org/abs/2407.05854", "description": "arXiv:2407.05854v1 Announce Type: new \nAbstract: Kriging is an established methodology for predicting spatial data in geostatistics. Current kriging techniques can handle linear dependencies on spatially referenced covariates. Although splines have shown promise in capturing nonlinear dependencies of covariates, their combination with kriging, especially in handling count data, remains underexplored. This paper proposes a novel Bayesian approach to the low-rank representation of geoadditive models, which integrates splines and kriging to account for both spatial correlations and nonlinear dependencies of covariates. The proposed method accommodates Gaussian and count data inherent in many geospatial datasets. Additionally, Laplace approximations to selected posterior distributions enhances computational efficiency, resulting in faster computation times compared to Markov chain Monte Carlo techniques commonly used for Bayesian inference. Method performance is assessed through a simulation study, demonstrating the effectiveness of the proposed approach. The methodology is applied to the analysis of heavy metal concentrations in the Meuse river and vulnerability to the coronavirus disease 2019 (COVID-19) in Belgium. Through this work, we provide a new flexible and computationally efficient framework for analyzing spatial data."}, "https://arxiv.org/abs/2407.05896": {"title": "A new multivariate Poisson model", "link": "https://arxiv.org/abs/2407.05896", "description": "arXiv:2407.05896v1 Announce Type: new \nAbstract: Multi-dimensional data frequently occur in many different fields, including risk management, insurance, biology, environmental sciences, and many more. In analyzing multivariate data, it is imperative that the underlying modelling assumptions adequately reflect both the marginal behavior as well as the associations between components. This work focuses specifically on developing a new multivariate Poisson model appropriate for multi-dimensional count data. The proposed formulation is based on convolutions of comonotonic shock vectors with Poisson distributed components and allows for flexibility in capturing different degrees of positive dependence. In this paper, the general model framework will be presented along with various distributional properties. Several estimation techniques will be explored and assessed both through simulations and in a real data application involving extreme rainfall events."}, "https://arxiv.org/abs/2407.05914": {"title": "Constructing Level Sets Using Smoothed Approximate Bayesian Computation", "link": "https://arxiv.org/abs/2407.05914", "description": "arXiv:2407.05914v1 Announce Type: new \nAbstract: This paper presents a novel approach to level set estimation for any function/simulation with an arbitrary number of continuous inputs and arbitrary numbers of continuous responses. We present a method that uses existing data from computer model simulations to fit a Gaussian process surrogate and use a newly proposed Markov Chain Monte Carlo technique, which we refer to as Smoothed Approximate Bayesian Computation to sample sets of parameters that yield a desired response, which improves on ``hard-clipped\" versions of ABC. We prove that our method converges to the correct distribution (i.e. the posterior distribution of level sets, or probability contours) and give results of our method on known functions and a dam breach simulation where the relationship between input parameters and responses of interest is unknown. Two versions of S-ABC are offered based on: 1) surrogating an accurately known target model and 2) surrogating an approximate model, which leads to uncertainty in estimating the level sets. In addition, we show how our method can be extended to multiple responses with an accompanying example. As demonstrated, S-ABC is able to estimate a level set accurately without the use of a predefined grid or signed distance function."}, "https://arxiv.org/abs/2407.05957": {"title": "A likelihood ratio test for circular multimodality", "link": "https://arxiv.org/abs/2407.05957", "description": "arXiv:2407.05957v1 Announce Type: new \nAbstract: The modes of a statistical population are high frequency points around which most of the probability mass is accumulated. For the particular case of circular densities, we address the problem of testing if, given an observed sample of a random angle, the underlying circular distribution model is multimodal. Our work is motivated by the analysis of migration patterns of birds and the methodological proposal follows a novel approach based on likelihood ratio ideas, combined with critical bandwidths. Theoretical results support the behaviour of the test, whereas simulation examples show its finite sample performance."}, "https://arxiv.org/abs/2407.06038": {"title": "Comparing Causal Inference Methods for Point Exposures with Missing Confounders: A Simulation Study", "link": "https://arxiv.org/abs/2407.06038", "description": "arXiv:2407.06038v1 Announce Type: new \nAbstract: Causal inference methods based on electronic health record (EHR) databases must simultaneously handle confounding and missing data. Vast scholarship exists aimed at addressing these two issues separately, but surprisingly few papers attempt to address them simultaneously. In practice, when faced with simultaneous missing data and confounding, analysts may proceed by first imputing missing data and subsequently using outcome regression or inverse-probability weighting (IPW) to address confounding. However, little is known about the theoretical performance of such $\\textit{ad hoc}$ methods. In a recent paper Levis $\\textit{et al.}$ outline a robust framework for tackling these problems together under certain identifying conditions, and introduce a pair of estimators for the average treatment effect (ATE), one of which is non-parametric efficient. In this work we present a series of simulations, motivated by a published EHR based study of the long-term effects of bariatric surgery on weight outcomes, to investigate these new estimators and compare them to existing $\\textit{ad hoc}$ methods. While the latter perform well in certain scenarios, no single estimator is uniformly best. As such, the work of Levis $\\textit{et al.}$ may serve as a reasonable default for causal inference when handling confounding and missing data together."}, "https://arxiv.org/abs/2407.06069": {"title": "How to Add Baskets to an Ongoing Basket Trial with Information Borrowing", "link": "https://arxiv.org/abs/2407.06069", "description": "arXiv:2407.06069v1 Announce Type: new \nAbstract: Basket trials test a single therapeutic treatment on several patient populations under one master protocol. A desirable adaptive design feature in these studies may be the incorporation of new baskets to an ongoing study. Limited basket sample sizes can cause issues in power and precision of treatment effect estimates which could be amplified in added baskets due to the shortened recruitment time. While various Bayesian information borrowing techniques have been introduced to tackle the issue of small sample sizes, the impact of including new baskets in the trial and into the borrowing model has yet to be investigated. We explore approaches for adding baskets to an ongoing trial under information borrowing and highlight when it is beneficial to add a basket compared to running a separate investigation for new baskets. We also propose a novel calibration approach for the decision criteria that is more robust to false decision making. Simulation studies are conducted to assess the performance of approaches which is monitored primarily through type I error control and precision of estimates. Results display a substantial improvement in power for a new basket when information borrowing is utilized, however, this comes with potential inflation of error rates which can be shown to be reduced under the proposed calibration procedure."}, "https://arxiv.org/abs/2407.06173": {"title": "Large Row-Constrained Supersaturated Designs for High-throughput Screening", "link": "https://arxiv.org/abs/2407.06173", "description": "arXiv:2407.06173v1 Announce Type: new \nAbstract: High-throughput screening, in which multiwell plates are used to test large numbers of compounds against specific targets, is widely used across many areas of the biological sciences and most prominently in drug discovery. We propose a statistically principled approach to these screening experiments, using the machinery of supersaturated designs and the Lasso. To accommodate limitations on the number of biological entities that can be applied to a single microplate well, we present a new class of row-constrained supersaturated designs. We develop a computational procedure to construct these designs, provide some initial lower bounds on the average squared off-diagonal values of their main-effects information matrix, and study the impact of the constraint on design quality. We also show via simulation that the proposed constrained row screening method is statistically superior to existing methods and demonstrate the use of the new methodology on a real drug-discovery system."}, "https://arxiv.org/abs/2407.04980": {"title": "Enabling Causal Discovery in Post-Nonlinear Models with Normalizing Flows", "link": "https://arxiv.org/abs/2407.04980", "description": "arXiv:2407.04980v1 Announce Type: cross \nAbstract: Post-nonlinear (PNL) causal models stand out as a versatile and adaptable framework for modeling intricate causal relationships. However, accurately capturing the invertibility constraint required in PNL models remains challenging in existing studies. To address this problem, we introduce CAF-PoNo (Causal discovery via Normalizing Flows for Post-Nonlinear models), harnessing the power of the normalizing flows architecture to enforce the crucial invertibility constraint in PNL models. Through normalizing flows, our method precisely reconstructs the hidden noise, which plays a vital role in cause-effect identification through statistical independence testing. Furthermore, the proposed approach exhibits remarkable extensibility, as it can be seamlessly expanded to facilitate multivariate causal discovery via causal order identification, empowering us to efficiently unravel complex causal relationships. Extensive experimental evaluations on both simulated and real datasets consistently demonstrate that the proposed method outperforms several state-of-the-art approaches in both bivariate and multivariate causal discovery tasks."}, "https://arxiv.org/abs/2407.04992": {"title": "Scalable Variational Causal Discovery Unconstrained by Acyclicity", "link": "https://arxiv.org/abs/2407.04992", "description": "arXiv:2407.04992v1 Announce Type: cross \nAbstract: Bayesian causal discovery offers the power to quantify epistemic uncertainties among a broad range of structurally diverse causal theories potentially explaining the data, represented in forms of directed acyclic graphs (DAGs). However, existing methods struggle with efficient DAG sampling due to the complex acyclicity constraint. In this study, we propose a scalable Bayesian approach to effectively learn the posterior distribution over causal graphs given observational data thanks to the ability to generate DAGs without explicitly enforcing acyclicity. Specifically, we introduce a novel differentiable DAG sampling method that can generate a valid acyclic causal graph by mapping an unconstrained distribution of implicit topological orders to a distribution over DAGs. Given this efficient DAG sampling scheme, we are able to model the posterior distribution over causal graphs using a simple variational distribution over a continuous domain, which can be learned via the variational inference framework. Extensive empirical experiments on both simulated and real datasets demonstrate the superior performance of the proposed model compared to several state-of-the-art baselines."}, "https://arxiv.org/abs/2407.05330": {"title": "Fast Proxy Experiment Design for Causal Effect Identification", "link": "https://arxiv.org/abs/2407.05330", "description": "arXiv:2407.05330v1 Announce Type: cross \nAbstract: Identifying causal effects is a key problem of interest across many disciplines. The two long-standing approaches to estimate causal effects are observational and experimental (randomized) studies. Observational studies can suffer from unmeasured confounding, which may render the causal effects unidentifiable. On the other hand, direct experiments on the target variable may be too costly or even infeasible to conduct. A middle ground between these two approaches is to estimate the causal effect of interest through proxy experiments, which are conducted on variables with a lower cost to intervene on compared to the main target. Akbari et al. [2022] studied this setting and demonstrated that the problem of designing the optimal (minimum-cost) experiment for causal effect identification is NP-complete and provided a naive algorithm that may require solving exponentially many NP-hard problems as a sub-routine in the worst case. In this work, we provide a few reformulations of the problem that allow for designing significantly more efficient algorithms to solve it as witnessed by our extensive simulations. Additionally, we study the closely-related problem of designing experiments that enable us to identify a given effect through valid adjustments sets."}, "https://arxiv.org/abs/2407.05492": {"title": "Gaussian Approximation and Output Analysis for High-Dimensional MCMC", "link": "https://arxiv.org/abs/2407.05492", "description": "arXiv:2407.05492v1 Announce Type: cross \nAbstract: The widespread use of Markov Chain Monte Carlo (MCMC) methods for high-dimensional applications has motivated research into the scalability of these algorithms with respect to the dimension of the problem. Despite this, numerous problems concerning output analysis in high-dimensional settings have remained unaddressed. We present novel quantitative Gaussian approximation results for a broad range of MCMC algorithms. Notably, we analyse the dependency of the obtained approximation errors on the dimension of both the target distribution and the feature space. We demonstrate how these Gaussian approximations can be applied in output analysis. This includes determining the simulation effort required to guarantee Markov chain central limit theorems and consistent variance estimation in high-dimensional settings. We give quantitative convergence bounds for termination criteria and show that the termination time of a wide class of MCMC algorithms scales polynomially in dimension while ensuring a desired level of precision. Our results offer guidance to practitioners for obtaining appropriate standard errors and deciding the minimum simulation effort of MCMC algorithms in both multivariate and high-dimensional settings."}, "https://arxiv.org/abs/2407.05954": {"title": "Causality-driven Sequence Segmentation for Enhancing Multiphase Industrial Process Data Analysis and Soft Sensing", "link": "https://arxiv.org/abs/2407.05954", "description": "arXiv:2407.05954v1 Announce Type: cross \nAbstract: The dynamic characteristics of multiphase industrial processes present significant challenges in the field of industrial big data modeling. Traditional soft sensing models frequently neglect the process dynamics and have difficulty in capturing transient phenomena like phase transitions. To address this issue, this article introduces a causality-driven sequence segmentation (CDSS) model. This model first identifies the local dynamic properties of the causal relationships between variables, which are also referred to as causal mechanisms. It then segments the sequence into different phases based on the sudden shifts in causal mechanisms that occur during phase transitions. Additionally, a novel metric, similarity distance, is designed to evaluate the temporal consistency of causal mechanisms, which includes both causal similarity distance and stable similarity distance. The discovered causal relationships in each phase are represented as a temporal causal graph (TCG). Furthermore, a soft sensing model called temporal-causal graph convolutional network (TC-GCN) is trained for each phase, by using the time-extended data and the adjacency matrix of TCG. The numerical examples are utilized to validate the proposed CDSS model, and the segmentation results demonstrate that CDSS has excellent performance on segmenting both stable and unstable multiphase series. Especially, it has higher accuracy in separating non-stationary time series compared to other methods. The effectiveness of the proposed CDSS model and the TC-GCN model is also verified through a penicillin fermentation process. Experimental results indicate that the breakpoints discovered by CDSS align well with the reaction mechanisms and TC-GCN significantly has excellent predictive accuracy."}, "https://arxiv.org/abs/1906.03661": {"title": "Community Correlations and Testing Independence Between Binary Graphs", "link": "https://arxiv.org/abs/1906.03661", "description": "arXiv:1906.03661v3 Announce Type: replace \nAbstract: Graph data has a unique structure that deviates from standard data assumptions, often necessitating modifications to existing methods or the development of new ones to ensure valid statistical analysis. In this paper, we explore the notion of correlation and dependence between two binary graphs. Given vertex communities, we propose community correlations to measure the edge association, which equals zero if and only if the two graphs are conditionally independent within a specific pair of communities. The set of community correlations naturally leads to the maximum community correlation, indicating conditional independence on all possible pairs of communities, and to the overall graph correlation, which equals zero if and only if the two binary graphs are unconditionally independent. We then compute the sample community correlations via graph encoder embedding, proving they converge to their respective population versions, and derive the asymptotic null distribution to enable a fast, valid, and consistent test for conditional or unconditional independence between two binary graphs. The theoretical results are validated through comprehensive simulations, and we provide two real-data examples: one using Enron email networks and another using mouse connectome graphs, to demonstrate the utility of the proposed correlation measures."}, "https://arxiv.org/abs/2005.12017": {"title": "Estimating spatially varying health effects of wildland fire smoke using mobile health data", "link": "https://arxiv.org/abs/2005.12017", "description": "arXiv:2005.12017v3 Announce Type: replace \nAbstract: Wildland fire smoke exposures are an increasing threat to public health, and thus there is a growing need for studying the effects of protective behaviors on reducing health outcomes. Emerging smartphone applications provide unprecedented opportunities to deliver health risk communication messages to a large number of individuals when and where they experience the exposure and subsequently study the effectiveness, but also pose novel methodological challenges. Smoke Sense, a citizen science project, provides an interactive smartphone app platform for participants to engage with information about air quality and ways to protect their health and record their own health symptoms and actions taken to reduce smoke exposure. We propose a new, doubly robust estimator of the structural nested mean model parameter that accounts for spatially- and time-varying effects via a local estimating equation approach with geographical kernel weighting. Moreover, our analytical framework is flexible enough to handle informative missingness by inverse probability weighting of estimating functions. We evaluate the new method using extensive simulation studies and apply it to Smoke Sense data reported by the citizen scientists to increase the knowledge base about the relationship between health preventive measures and improved health outcomes. Our results estimate how the protective behaviors effects vary over space and time and find that protective behaviors have more significant effects on reducing health symptoms in the Southwest than the Northwest region of the USA."}, "https://arxiv.org/abs/2206.01779": {"title": "Bayesian and Frequentist Inference for Synthetic Controls", "link": "https://arxiv.org/abs/2206.01779", "description": "arXiv:2206.01779v3 Announce Type: replace \nAbstract: The synthetic control method has become a widely popular tool to estimate causal effects with observational data. Despite this, inference for synthetic control methods remains challenging. Often, inferential results rely on linear factor model data generating processes. In this paper, we characterize the conditions on the factor model primitives (the factor loadings) for which the statistical risk minimizers are synthetic controls (in the simplex). Then, we propose a Bayesian alternative to the synthetic control method that preserves the main features of the standard method and provides a new way of doing valid inference. We explore a Bernstein-von Mises style result to link our Bayesian inference to the frequentist inference. For linear factor model frameworks we show that a maximum likelihood estimator (MLE) of the synthetic control weights can consistently estimate the predictive function of the potential outcomes for the treated unit and that our Bayes estimator is asymptotically close to the MLE in the total variation sense. Through simulations, we show that there is convergence between the Bayes and frequentist approach even in sparse settings. Finally, we apply the method to re-visit the study of the economic costs of the German re-unification and the Catalan secession movement. The Bayesian synthetic control method is available in the bsynth R-package."}, "https://arxiv.org/abs/2308.01724": {"title": "Reconciling Functional Data Regression with Excess Bases", "link": "https://arxiv.org/abs/2308.01724", "description": "arXiv:2308.01724v2 Announce Type: replace \nAbstract: As the development of measuring instruments and computers has accelerated the collection of massive amounts of data, functional data analysis (FDA) has experienced a surge of attention. The FDA methodology treats longitudinal data as a set of functions on which inference, including regression, is performed. Functionalizing data typically involves fitting the data with basis functions. In general, the number of basis functions smaller than the sample size is selected. This paper casts doubt on this convention. Recent statistical theory has revealed the so-called double-descent phenomenon in which excess parameters overcome overfitting and lead to precise interpolation. Applying this idea to choosing the number of bases to be used for functional data, we show that choosing an excess number of bases can lead to more accurate predictions. Specifically, we explored this phenomenon in a functional regression context and examined its validity through numerical experiments. In addition, we introduce two real-world datasets to demonstrate that the double-descent phenomenon goes beyond theoretical and numerical experiments, confirming its importance in practical applications."}, "https://arxiv.org/abs/2310.18858": {"title": "Estimating a function of the scale parameter in a gamma distribution with bounded variance", "link": "https://arxiv.org/abs/2310.18858", "description": "arXiv:2310.18858v2 Announce Type: replace \nAbstract: Given a gamma population with known shape parameter $\\alpha$, we develop a general theory for estimating a function $g(\\cdot)$ of the scale parameter $\\beta$ with bounded variance. We begin by defining a sequential sampling procedure with $g(\\cdot)$ satisfying some desired condition in proposing the stopping rule, and show the procedure enjoys appealing asymptotic properties. After these general conditions, we substitute $g(\\cdot)$ with specific functions including the gamma mean, the gamma variance, the gamma rate parameter, and a gamma survival probability as four possible illustrations. For each illustration, Monte Carlo simulations are carried out to justify the remarkable performance of our proposed sequential procedure. This is further substantiated with a real data study on weights of newly born babies."}, "https://arxiv.org/abs/1809.01796": {"title": "Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data", "link": "https://arxiv.org/abs/1809.01796", "description": "arXiv:1809.01796v2 Announce Type: replace-cross \nAbstract: In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named Sparse Tensor Alternating Thresholding for Singular Value Decomposition (STAT-SVD) is proposed. The proposed procedure features a novel double projection \\& thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaker assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates."}, "https://arxiv.org/abs/2305.07993": {"title": "The Nonstationary Newsvendor with (and without) Predictions", "link": "https://arxiv.org/abs/2305.07993", "description": "arXiv:2305.07993v3 Announce Type: replace-cross \nAbstract: The classic newsvendor model yields an optimal decision for a \"newsvendor\" selecting a quantity of inventory, under the assumption that the demand is drawn from a known distribution. Motivated by applications such as cloud provisioning and staffing, we consider a setting in which newsvendor-type decisions must be made sequentially, in the face of demand drawn from a stochastic process that is both unknown and nonstationary. All prior work on this problem either (a) assumes that the level of nonstationarity is known, or (b) imposes additional statistical assumptions that enable accurate predictions of the unknown demand.\n We study the Nonstationary Newsvendor, with and without predictions. We first, in the setting without predictions, design a policy which we prove (via matching upper and lower bounds) achieves order-optimal regret -- ours is the first policy to accomplish this without being given the level of nonstationarity of the underlying demand. We then, for the first time, introduce a model for generic (i.e. with no statistical assumptions) predictions with arbitrary accuracy, and propose a policy that incorporates these predictions without being given their accuracy. We upper bound the regret of this policy, and show that it matches the best achievable regret had the accuracy of the predictions been known. Finally, we empirically validate our new policy with experiments based on three real-world datasets containing thousands of time-series, showing that it succeeds in closing approximately 74% of the gap between the best approaches based on nonstationarity and predictions alone."}, "https://arxiv.org/abs/2401.11646": {"title": "Nonparametric Density Estimation via Variance-Reduced Sketching", "link": "https://arxiv.org/abs/2401.11646", "description": "arXiv:2401.11646v2 Announce Type: replace-cross \nAbstract: Nonparametric density models are of great interest in various scientific and engineering disciplines. Classical density kernel methods, while numerically robust and statistically sound in low-dimensional settings, become inadequate even in moderate higher-dimensional settings due to the curse of dimensionality. In this paper, we introduce a new framework called Variance-Reduced Sketching (VRS), specifically designed to estimate multivariable density functions with a reduced curse of dimensionality. Our framework conceptualizes multivariable functions as infinite-size matrices, and facilitates a new sketching technique motivated by numerical linear algebra literature to reduce the variance in density estimation problems. We demonstrate the robust numerical performance of VRS through a series of simulated experiments and real-world data applications. Notably, VRS shows remarkable improvement over existing neural network estimators and classical kernel methods in numerous density models. Additionally, we offer theoretical justifications for VRS to support its ability to deliver nonparametric density estimation with a reduced curse of dimensionality."}, "https://arxiv.org/abs/2407.06350": {"title": "A Surrogate Endpoint Based Provisional Approval Causal Roadmap", "link": "https://arxiv.org/abs/2407.06350", "description": "arXiv:2407.06350v1 Announce Type: new \nAbstract: For many rare diseases with no approved preventive interventions, promising interventions exist, yet it has been difficult to conduct a pivotal phase 3 trial that could provide direct evidence demonstrating a beneficial effect on the target disease outcome. When a promising putative surrogate endpoint(s) for the target outcome is available, surrogate-based provisional approval of an intervention may be pursued. We apply the Causal Roadmap rubric to define a surrogate endpoint based provisional approval causal roadmap, which combines observational study data that estimates the relationship between the putative surrogate and the target outcome, with a phase 3 surrogate endpoint study that collects the same data but is very under-powered to assess the treatment effect (TE) on the target outcome. The objective is conservative estimation/inference for the TE with an estimated lower uncertainty bound that allows (through two bias functions) for an imperfect surrogate and imperfect transport of the conditional target outcome risk in the untreated between the observational and phase 3 studies. Two estimators of TE (plug-in, nonparametric efficient one-step) with corresponding inference procedures are developed. Finite-sample performance of the plug-in estimator is evaluated in two simulation studies, with R code provided. The roadmap is illustrated with contemporary Group B Streptococcus vaccine development."}, "https://arxiv.org/abs/2407.06387": {"title": "Conditional Rank-Rank Regression", "link": "https://arxiv.org/abs/2407.06387", "description": "arXiv:2407.06387v1 Announce Type: new \nAbstract: Rank-rank regressions are widely used in economic research to evaluate phenomena such as intergenerational income persistence or mobility. However, when covariates are incorporated to capture between-group persistence, the resulting coefficients can be difficult to interpret as such. We propose the conditional rank-rank regression, which uses conditional ranks instead of unconditional ranks, to measure average within-group income persistence. This property is analogous to that of the unconditional rank-rank regression that measures the overall income persistence. The difference between conditional and unconditional rank-rank regression coefficients therefore can measure between-group persistence. We develop a flexible estimation approach using distribution regression and establish a theoretical framework for large sample inference. An empirical study on intergenerational income mobility in Switzerland demonstrates the advantages of this approach. The study reveals stronger intergenerational persistence between fathers and sons compared to fathers and daughters, with the within-group persistence explaining 62% of the overall income persistence for sons and 52% for daughters. Families of small size or with highly educated fathers exhibit greater persistence in passing on their economic status."}, "https://arxiv.org/abs/2407.06395": {"title": "Logit unfolding choice models for binary data", "link": "https://arxiv.org/abs/2407.06395", "description": "arXiv:2407.06395v1 Announce Type: new \nAbstract: Discrete choice models with non-monotonic response functions are important in many areas of application, especially political sciences and marketing. This paper describes a novel unfolding model for binary data that allows for heavy-tailed shocks to the underlying utilities. One of our key contributions is a Markov chain Monte Carlo algorithm that requires little or no parameter tuning, fully explores the support of the posterior distribution, and can be used to fit various extensions of our core model that involve (Bayesian) hypothesis testing on the latent construct. Our empirical evaluations of the model and the associated algorithm suggest that they provide better complexity-adjusted fit to voting data from the United States House of Representatives."}, "https://arxiv.org/abs/2407.06466": {"title": "Increased risk of type I errors for detecting heterogeneity of treatment effects in cluster-randomized trials using mixed-effect models", "link": "https://arxiv.org/abs/2407.06466", "description": "arXiv:2407.06466v1 Announce Type: new \nAbstract: Evaluating heterogeneity of treatment effects (HTE) across subgroups is common in both randomized trials and observational studies. Although several statistical challenges of HTE analyses including low statistical power and multiple comparisons are widely acknowledged, issues arising for clustered data, including cluster randomized trials (CRTs), have received less attention. Notably, the potential for model misspecification is increased given the complex clustering structure (e.g., due to correlation among individuals within a subgroup and cluster), which could impact inference and type 1 errors. To illicit this issue, we conducted a simulation study to evaluate the performance of common analytic approaches for testing the presence of HTE for continuous, binary, and count outcomes: generalized linear mixed models (GLMM) and generalized estimating equations (GEE) including interaction terms between treatment group and subgroup. We found that standard GLMM analyses that assume a common correlation of participants within clusters can lead to severely elevated type 1 error rates of up to 47.2% compared to the 5% nominal level if the within-cluster correlation varies across subgroups. A flexible GLMM, which allows subgroup-specific within-cluster correlations, achieved the nominal type 1 error rate, as did GEE (though rates were slightly elevated even with as many as 50 clusters). Applying the methods to a real-world CRT using the count outcome utilization of healthcare, we found a large impact of the model specification on inference: the standard GLMM yielded highly significant interaction by sex (P=0.01), whereas the interaction was non-statistically significant under the flexible GLMM and GEE (P=0.64 and 0.93, respectively). We recommend that HTE analyses using GLMM account for within-subgroup correlation to avoid anti-conservative inference."}, "https://arxiv.org/abs/2407.06497": {"title": "Bayesian design for mathematical models of fruit growth based on misspecified prior information", "link": "https://arxiv.org/abs/2407.06497", "description": "arXiv:2407.06497v1 Announce Type: new \nAbstract: Bayesian design can be used for efficient data collection over time when the process can be described by the solution to an ordinary differential equation (ODE). Typically, Bayesian designs in such settings are obtained by maximising the expected value of a utility function that is derived from the joint probability distribution of the parameters and the response, given prior information about an appropriate ODE. However, in practice, appropriately defining such information \\textit{a priori} can be difficult due to incomplete knowledge about the mechanisms that govern how the process evolves over time. In this paper, we propose a method for finding Bayesian designs based on a flexible class of ODEs. Specifically, we consider the inclusion of spline terms into ODEs to provide flexibility in modelling how the process changes over time. We then propose to leverage this flexibility to form designs that are efficient even when the prior information is misspecified. Our approach is motivated by a sampling problem in agriculture where the goal is to provide a better understanding of fruit growth where prior information is based on studies conducted overseas, and therefore is potentially misspecified."}, "https://arxiv.org/abs/2407.06522": {"title": "Independent Approximates provide a maximum likelihood estimate for heavy-tailed distributions", "link": "https://arxiv.org/abs/2407.06522", "description": "arXiv:2407.06522v1 Announce Type: new \nAbstract: Heavy-tailed distributions are infamously difficult to estimate because their moments tend to infinity as the shape of the tail decay increases. Nevertheless, this study shows the utilization of a modified group of moments for estimating a heavy-tailed distribution. These modified moments are determined from powers of the original distribution. The nth-power distribution is guaranteed to have finite moments up to n-1. Samples from the nth-power distribution are drawn from n-tuple Independent Approximates, which are the set of independent samples grouped into n-tuples and sub-selected to be approximately equal to each other. We show that Independent Approximates are a maximum likelihood estimator for the generalized Pareto and the Student's t distributions, which are members of the family of coupled exponential distributions. We use the first (original), second, and third power distributions to estimate their zeroth (geometric mean), first, and second power-moments respectively. In turn, these power-moments are used to estimate the scale and shape of the distributions. A least absolute deviation criteria is used to select the optimal set of Independent Approximates. Estimates using higher powers and moments are possible though the number of n-tuples that are approximately equal may be limited."}, "https://arxiv.org/abs/2407.06722": {"title": "Femicide Laws, Unilateral Divorce, and Abortion Decriminalization Fail to Stop Women's Killings in Mexico", "link": "https://arxiv.org/abs/2407.06722", "description": "arXiv:2407.06722v1 Announce Type: new \nAbstract: This paper evaluates the effectiveness of femicide laws in combating gender-based killings of women, a major cause of premature female mortality globally. Focusing on Mexico, a pioneer in adopting such legislation, the paper leverages variations in the enactment of femicide laws and associated prison sentences across states. Using the difference-in-difference estimator, the analysis reveals that these laws have not significantly affected the incidence of femicides, homicides of women, or reports of women who have disappeared. These findings remain robust even when accounting for differences in prison sentencing, whether states also implemented unilateral divorce laws, or decriminalized abortion alongside femicide legislation. The results suggest that legislative measures are insufficient to address violence against women in settings where impunity prevails."}, "https://arxiv.org/abs/2407.06733": {"title": "Causes and Electoral Consequences of Political Assassinations: The Role of Organized Crime in Mexico", "link": "https://arxiv.org/abs/2407.06733", "description": "arXiv:2407.06733v1 Announce Type: new \nAbstract: Mexico has experienced a notable surge in assassinations of political candidates and mayors. This article argues that these killings are largely driven by organized crime, aiming to influence candidate selection, control local governments for rent-seeking, and retaliate against government crackdowns. Using a new dataset of political assassinations in Mexico from 2000 to 2021 and instrumental variables, we address endogeneity concerns in the location and timing of government crackdowns. Our instruments include historical Chinese immigration patterns linked to opium cultivation in Mexico, local corn prices, and U.S. illicit drug prices. The findings reveal that candidates in municipalities near oil pipelines face an increased risk of assassination due to drug trafficking organizations expanding into oil theft, particularly during elections and fuel price hikes. Government arrests or killings of organized crime members trigger retaliatory violence, further endangering incumbent mayors. This political violence has a negligible impact on voter turnout, as it targets politicians rather than voters. However, voter turnout increases in areas where authorities disrupt drug smuggling, raising the chances of the local party being re-elected. These results offer new insights into how criminal groups attempt to capture local governments and the implications for democracy under criminal governance."}, "https://arxiv.org/abs/2407.06835": {"title": "A flexible model for Record Linkage", "link": "https://arxiv.org/abs/2407.06835", "description": "arXiv:2407.06835v1 Announce Type: new \nAbstract: Combining data from various sources empowers researchers to explore innovative questions, for example those raised by conducting healthcare monitoring studies. However, the lack of a unique identifier often poses challenges. Record linkage procedures determine whether pairs of observations collected on different occasions belong to the same individual using partially identifying variables (e.g. birth year, postal code). Existing methodologies typically involve a compromise between computational efficiency and accuracy. Traditional approaches simplify this task by condensing information, yet they neglect dependencies among linkage decisions and disregard the one-to-one relationship required to establish coherent links. Modern approaches offer a comprehensive representation of the data generation process, at the expense of computational overhead and reduced flexibility. We propose a flexible method, that adapts to varying data complexities, addressing registration errors and accommodating changes of the identifying information over time. Our approach balances accuracy and scalability, estimating the linkage using a Stochastic Expectation Maximisation algorithm on a latent variable model. We illustrate the ability of our methodology to connect observations using large real data applications and demonstrate the robustness of our model to the linking variables quality in a simulation study. The proposed algorithm FlexRL is implemented and available in an open source R package."}, "https://arxiv.org/abs/2407.06867": {"title": "Distributionally robust risk evaluation with an isotonic constraint", "link": "https://arxiv.org/abs/2407.06867", "description": "arXiv:2407.06867v1 Announce Type: new \nAbstract: Statistical learning under distribution shift is challenging when neither prior knowledge nor fully accessible data from the target distribution is available. Distributionally robust learning (DRL) aims to control the worst-case statistical performance within an uncertainty set of candidate distributions, but how to properly specify the set remains challenging. To enable distributional robustness without being overly conservative, in this paper, we propose a shape-constrained approach to DRL, which incorporates prior information about the way in which the unknown target distribution differs from its estimate. More specifically, we assume the unknown density ratio between the target distribution and its estimate is isotonic with respect to some partial order. At the population level, we provide a solution to the shape-constrained optimization problem that does not involve the isotonic constraint. At the sample level, we provide consistency results for an empirical estimator of the target in a range of different settings. Empirical studies on both synthetic and real data examples demonstrate the improved accuracy of the proposed shape-constrained approach."}, "https://arxiv.org/abs/2407.06883": {"title": "Dealing with idiosyncratic cross-correlation when constructing confidence regions for PC factors", "link": "https://arxiv.org/abs/2407.06883", "description": "arXiv:2407.06883v1 Announce Type: new \nAbstract: In this paper, we propose a computationally simple estimator of the asymptotic covariance matrix of the Principal Components (PC) factors valid in the presence of cross-correlated idiosyncratic components. The proposed estimator of the asymptotic Mean Square Error (MSE) of PC factors is based on adaptive thresholding the sample covariances of the id iosyncratic residuals with the threshold based on their individual variances. We compare the nite sample performance of condence regions for the PC factors obtained using the proposed asymptotic MSE with those of available extant asymptotic and bootstrap regions and show that the former beats all alternative procedures for a wide variety of idiosyncratic cross-correlation structures."}, "https://arxiv.org/abs/2407.06892": {"title": "When Knockoffs fail: diagnosing and fixing non-exchangeability of Knockoffs", "link": "https://arxiv.org/abs/2407.06892", "description": "arXiv:2407.06892v1 Announce Type: new \nAbstract: Knockoffs are a popular statistical framework that addresses the challenging problem of conditional variable selection in high-dimensional settings with statistical control. Such statistical control is essential for the reliability of inference. However, knockoff guarantees rely on an exchangeability assumption that is difficult to test in practice, and there is little discussion in the literature on how to deal with unfulfilled hypotheses. This assumption is related to the ability to generate data similar to the observed data. To maintain reliable inference, we introduce a diagnostic tool based on Classifier Two-Sample Tests. Using simulations and real data, we show that violations of this assumption occur in common settings for classical Knockoffs generators, especially when the data have a strong dependence structure. We show that the diagnostic tool correctly detects such behavior. To fix knockoff generation, we propose a nonparametric, computationally-efficient alternative knockoff construction, which is based on constructing a predictor of each variable based on all others. We show that this approach achieves asymptotic exchangeability with the original variables under standard assumptions on the predictive model. We show empirically that the proposed approach restores error control on simulated data."}, "https://arxiv.org/abs/2407.06970": {"title": "Effect estimation in the presence of a misclassified binary mediator", "link": "https://arxiv.org/abs/2407.06970", "description": "arXiv:2407.06970v1 Announce Type: new \nAbstract: Mediation analyses allow researchers to quantify the effect of an exposure variable on an outcome variable through a mediator variable. If a binary mediator variable is misclassified, the resulting analysis can be severely biased. Misclassification is especially difficult to deal with when it is differential and when there are no gold standard labels available. Previous work has addressed this problem using a sensitivity analysis framework or by assuming that misclassification rates are known. We leverage a variable related to the misclassification mechanism to recover unbiased parameter estimates without using gold standard labels. The proposed methods require the reasonable assumption that the sum of the sensitivity and specificity is greater than 1. Three correction methods are presented: (1) an ordinary least squares correction for Normal outcome models, (2) a multi-step predictive value weighting method, and (3) a seamless expectation-maximization algorithm. We apply our misclassification correction strategies to investigate the mediating role of gestational hypertension on the association between maternal age and pre-term birth."}, "https://arxiv.org/abs/2407.07067": {"title": "Aggregate Bayesian Causal Forests: The ABCs of Flexible Causal Inference for Hierarchically Structured Data", "link": "https://arxiv.org/abs/2407.07067", "description": "arXiv:2407.07067v1 Announce Type: new \nAbstract: This paper introduces aggregate Bayesian Causal Forests (aBCF), a new Bayesian model for causal inference using aggregated data. Aggregated data are common in policy evaluations where we observe individuals such as students, but participation in an intervention is determined at a higher level of aggregation, such as schools implementing a curriculum. Interventions often have millions of individuals but far fewer higher-level units, making aggregation computationally attractive. To analyze aggregated data, a model must account for heteroskedasticity and intraclass correlation (ICC). Like Bayesian Causal Forests (BCF), aBCF estimates heterogeneous treatment effects with minimal parametric assumptions, but accounts for these aggregated data features, improving estimation of average and aggregate unit-specific effects.\n After introducing the aBCF model, we demonstrate via simulation that aBCF improves performance for aggregated data over BCF. We anchor our simulation on an evaluation of a large-scale Medicare primary care model. We demonstrate that aBCF produces treatment effect estimates with a lower root mean squared error and narrower uncertainty intervals while achieving the same level of coverage. We show that aBCF is not sensitive to the prior distribution used and that estimation improvements relative to BCF decline as the ICC approaches one. Code is available at https://github.com/mathematica-mpr/bcf-1."}, "https://arxiv.org/abs/2407.06390": {"title": "JANET: Joint Adaptive predictioN-region Estimation for Time-series", "link": "https://arxiv.org/abs/2407.06390", "description": "arXiv:2407.06390v1 Announce Type: cross \nAbstract: Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (Joint Adaptive predictioN-region Estimation for Time-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlled K-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET's superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data."}, "https://arxiv.org/abs/2407.06533": {"title": "LETS-C: Leveraging Language Embedding for Time Series Classification", "link": "https://arxiv.org/abs/2407.06533", "description": "arXiv:2407.06533v1 Announce Type: cross \nAbstract: Recent advancements in language modeling have shown promising results when applied to time series data. In particular, fine-tuning pre-trained large language models (LLMs) for time series classification tasks has achieved state-of-the-art (SOTA) performance on standard benchmarks. However, these LLM-based models have a significant drawback due to the large model size, with the number of trainable parameters in the millions. In this paper, we propose an alternative approach to leveraging the success of language modeling in the time series domain. Instead of fine-tuning LLMs, we utilize a language embedding model to embed time series and then pair the embeddings with a simple classification head composed of convolutional neural networks (CNN) and multilayer perceptron (MLP). We conducted extensive experiments on well-established time series classification benchmark datasets. We demonstrated LETS-C not only outperforms the current SOTA in classification accuracy but also offers a lightweight solution, using only 14.5% of the trainable parameters on average compared to the SOTA model. Our findings suggest that leveraging language encoders to embed time series data, combined with a simple yet effective classification head, offers a promising direction for achieving high-performance time series classification while maintaining a lightweight model architecture."}, "https://arxiv.org/abs/2407.06875": {"title": "Extending the blended generalized extreme value distribution", "link": "https://arxiv.org/abs/2407.06875", "description": "arXiv:2407.06875v1 Announce Type: cross \nAbstract: The generalized extreme value (GEV) distribution is commonly employed to help estimate the likelihood of extreme events in many geophysical and other application areas. The recently proposed blended generalized extreme value (bGEV) distribution modifies the GEV with positive shape parameter to avoid a hard lower bound that complicates fitting and inference. Here, the bGEV is extended to the GEV with negative shape parameter, avoiding a hard upper bound that is unrealistic in many applications. This extended bGEV is shown to improve on the GEV for forecasting future heat extremes based on past data. Software implementing this bGEV and applying it to the example temperature data is provided."}, "https://arxiv.org/abs/2407.07018": {"title": "End-To-End Causal Effect Estimation from Unstructured Natural Language Data", "link": "https://arxiv.org/abs/2407.07018", "description": "arXiv:2407.07018v1 Announce Type: cross \nAbstract: Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource."}, "https://arxiv.org/abs/2407.07072": {"title": "Assumption Smuggling in Intermediate Outcome Tests of Causal Mechanisms", "link": "https://arxiv.org/abs/2407.07072", "description": "arXiv:2407.07072v1 Announce Type: cross \nAbstract: Political scientists are increasingly attuned to the promises and pitfalls of establishing causal effects. But the vital question for many is not if a causal effect exists but why and how it exists. Even so, many researchers avoid causal mediation analyses due to the assumptions required, instead opting to explore causal mechanisms through what we call intermediate outcome tests. These tests use the same research design used to estimate the effect of treatment on the outcome to estimate the effect of the treatment on one or more mediators, with authors often concluding that evidence of the latter is evidence of a causal mechanism. We show in this paper that, without further assumptions, this can neither establish nor rule out the existence of a causal mechanism. Instead, such conclusions about the indirect effect of treatment rely on implicit and usually very strong assumptions that are often unmet. Thus, such causal mechanism tests, though very common in political science, should not be viewed as a free lunch but rather should be used judiciously, and researchers should explicitly state and defend the requisite assumptions."}, "https://arxiv.org/abs/2107.07942": {"title": "Flexible Covariate Adjustments in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2107.07942", "description": "arXiv:2107.07942v3 Announce Type: replace \nAbstract: Empirical regression discontinuity (RD) studies often use covariates to increase the precision of their estimates. In this paper, we propose a novel class of estimators that use such covariate information more efficiently than existing methods and can accommodate many covariates. It involves running a standard RD analysis in which a function of the covariates has been subtracted from the original outcome variable. We characterize the function that leads to the estimator with the smallest asymptotic variance, and consider feasible versions of such estimators in which this function is estimated, for example, through modern machine learning techniques."}, "https://arxiv.org/abs/2112.13651": {"title": "Factor modelling for high-dimensional functional time series", "link": "https://arxiv.org/abs/2112.13651", "description": "arXiv:2112.13651v3 Announce Type: replace \nAbstract: Many economic and scientific problems involve the analysis of high-dimensional functional time series, where the number of functional variables $p$ diverges as the number of serially dependent observations $n$ increases. In this paper, we present a novel functional factor model for high-dimensional functional time series that maintains and makes use of the functional and dynamic structure to achieve great dimension reduction and find the latent factor structure. To estimate the number of functional factors and the factor loadings, we propose a fully functional estimation procedure based on an eigenanalysis for a nonnegative definite and symmetric matrix. Our proposal involves a weight matrix to improve the estimation efficiency and tackle the issue of heterogeneity, the rationale of which is illustrated by formulating the estimation from a novel regression perspective. Asymptotic properties of the proposed method are studied when $p$ diverges at some polynomial rate as $n$ increases. To provide a parsimonious model and enhance interpretability for near-zero factor loadings, we impose sparsity assumptions on the factor loading space and then develop a regularized estimation procedure with theoretical guarantees when $p$ grows exponentially fast relative to $n.$ Finally, we demonstrate that our proposed estimators significantly outperform the competing methods through both simulations and applications to a U.K. temperature data set and a Japanese mortality data set."}, "https://arxiv.org/abs/2303.11721": {"title": "Using Forests in Multivariate Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2303.11721", "description": "arXiv:2303.11721v2 Announce Type: replace \nAbstract: We discuss estimation and inference of conditional treatment effects in regression discontinuity designs with multiple scores. Aside from the commonly used local linear regression approach and a minimax-optimal estimator recently proposed by Imbens and Wager (2019), we consider two estimators based on random forests -- honest regression forests and local linear forests -- whose construction resembles that of standard local regressions, with theoretical validity following from results in Wager and Athey (2018) and Friedberg et al. (2020). We design a systematic Monte Carlo study with data generating processes built both from functional forms that we specify and from Wasserstein Generative Adversarial Networks that can closely mimic the observed data. We find that no single estimator dominates across all simulations: (i) local linear regressions perform well in univariate settings, but can undercover when multivariate scores are transformed into a univariate score -- which is commonly done in practice -- possibly due to the \"zero-density\" issue of the collapsed univariate score at the transformed cutoff; (ii) good performance of the minimax-optimal estimator depends on accurate estimation of a nuisance parameter and its current implementation only accepts up to two scores; (iii) forest-based estimators are not designed for estimation at boundary points and can suffer from bias in finite sample, but their flexibility in modeling multivariate scores opens the door to a wide range of empirical applications in multivariate regression discontinuity designs."}, "https://arxiv.org/abs/2307.05818": {"title": "What Does it Take to Control Global Temperatures? A toolbox for testing and estimating the impact of economic policies on climate", "link": "https://arxiv.org/abs/2307.05818", "description": "arXiv:2307.05818v2 Announce Type: replace \nAbstract: This paper tests the feasibility and estimates the cost of climate control through economic policies. It provides a toolbox for a statistical historical assessment of a Stochastic Integrated Model of Climate and the Economy, and its use in (possibly counterfactual) policy analysis. Recognizing that stabilization requires supressing a trend, we use an integrated-cointegrated Vector Autoregressive Model estimated using a newly compiled dataset ranging between years A.D. 1000-2008, extending previous results on Control Theory in nonstationary systems. We test statistically whether, and quantify to what extent, carbon abatement policies can effectively stabilize or reduce global temperatures. Our formal test of policy feasibility shows that carbon abatement can have a significant long run impact and policies can render temperatures stationary around a chosen long run mean. In a counterfactual empirical illustration of the possibilities of our modeling strategy, we study a retrospective policy aiming to keep global temperatures close to their 1900 historical level. Achieving this via carbon abatement may cost about 75% of the observed 2008 level of world GDP, a cost equivalent to reverting to levels of output historically observed in the mid 1960s. By contrast, investment in carbon neutral technology could achieve the policy objective and be self-sustainable as long as it costs less than 50% of 2008 global GDP and 75% of consumption."}, "https://arxiv.org/abs/2310.12285": {"title": "Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm", "link": "https://arxiv.org/abs/2310.12285", "description": "arXiv:2310.12285v2 Announce Type: replace \nAbstract: High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer from scalability issues. This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators of hyperparameters for increased flexibility and an Expectation-Conditional-Minimization (ECM) algorithm for computationally efficient maximum a posteriori probability (MAP) estimation of parameters. The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation. We illustrate Linear Mixed Modeling with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies evaluating fixed and random effects estimation along with computation time. A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time. Supplementary materials are available online."}, "https://arxiv.org/abs/2401.07259": {"title": "Inference for bivariate extremes via a semi-parametric angular-radial model", "link": "https://arxiv.org/abs/2401.07259", "description": "arXiv:2401.07259v2 Announce Type: replace \nAbstract: The modelling of multivariate extreme events is important in a wide variety of applications, including flood risk analysis, metocean engineering and financial modelling. A wide variety of statistical techniques have been proposed in the literature; however, many such methods are limited in the forms of dependence they can capture, or make strong parametric assumptions about data structures. In this article, we introduce a novel inference framework for multivariate extremes based on a semi-parametric angular-radial model. This model overcomes the limitations of many existing approaches and provides a unified paradigm for assessing joint tail behaviour. Alongside inferential tools, we also introduce techniques for assessing uncertainty and goodness of fit. Our proposed technique is tested on simulated data sets alongside observed metocean time series', with results indicating generally good performance."}, "https://arxiv.org/abs/2308.12227": {"title": "Semiparametric Modeling and Analysis for Longitudinal Network Data", "link": "https://arxiv.org/abs/2308.12227", "description": "arXiv:2308.12227v2 Announce Type: replace-cross \nAbstract: We introduce a semiparametric latent space model for analyzing longitudinal network data. The model consists of a static latent space component and a time-varying node-specific baseline component. We develop a semiparametric efficient score equation for the latent space parameter by adjusting for the baseline nuisance component. Estimation is accomplished through a one-step update estimator and an appropriately penalized maximum likelihood estimator. We derive oracle error bounds for the two estimators and address identifiability concerns from a quotient manifold perspective. Our approach is demonstrated using the New York Citi Bike Dataset."}, "https://arxiv.org/abs/2407.07217": {"title": "The Hidden Subsidy of the Affordable Care Act", "link": "https://arxiv.org/abs/2407.07217", "description": "arXiv:2407.07217v1 Announce Type: new \nAbstract: Under the ACA, the federal government paid a substantially larger share of medical costs of newly eligible Medicaid enrollees than previously eligible ones. States could save up to 100% of their per-enrollee costs by reclassifying original enrollees into the newly eligible group. We examine whether this fiscal incentive changed states' enrollment practices. We find that Medicaid expansion caused large declines in the number of beneficiaries enrolled in the original Medicaid population, suggesting widespread reclassifications. In 2019 alone, this phenomenon affected 4.4 million Medicaid enrollees at a federal cost of $8.3 billion. Our results imply that reclassifications inflated the federal cost of Medicaid expansion by 18.2%."}, "https://arxiv.org/abs/2407.07251": {"title": "R", "link": "https://arxiv.org/abs/2407.07251", "description": "arXiv:2407.07251v1 Announce Type: new \nAbstract: This note provides a conceptual clarification of Ronald Aylmer Fisher's (1935) pioneering exact test in the context of the Lady Testing Tea experiment. It unveils a critical implicit assumption in Fisher's calibration: the taster minimizes expected misclassification given fixed probabilistic information. Without similar assumptions or an explicit alternative hypothesis, the rationale behind Fisher's specification of the rejection region remains unclear."}, "https://arxiv.org/abs/2407.07637": {"title": "Function-valued marked spatial point processes on linear networks: application to urban cycling profiles", "link": "https://arxiv.org/abs/2407.07637", "description": "arXiv:2407.07637v1 Announce Type: new \nAbstract: In the literature on spatial point processes, there is an emerging challenge in studying marked point processes with points being labelled by functions. In this paper, we focus on point processes living on linear networks and, from distinct points of view, propose several marked summary characteristics that are of great use in studying the average association and dispersion of the function-valued marks. Through a simulation study, we evaluate the performance of our proposed marked summary characteristics, both when marks are independent and when some sort of spatial dependence is evident among them. Finally, we employ our proposed mark summary characteristics to study the spatial structure of urban cycling profiles in Vancouver, Canada."}, "https://arxiv.org/abs/2407.07647": {"title": "Two Stage Least Squares with Time-Varying Instruments: An Application to an Evaluation of Treatment Intensification for Type-2 Diabetes", "link": "https://arxiv.org/abs/2407.07647", "description": "arXiv:2407.07647v1 Announce Type: new \nAbstract: As longitudinal data becomes more available in many settings, policy makers are increasingly interested in the effect of time-varying treatments (e.g. sustained treatment strategies). In settings such as this, the preferred analysis techniques are the g-methods, however these require the untestable assumption of no unmeasured confounding. Instrumental variable analyses can minimise bias through unmeasured confounding. Of these methods, the Two Stage Least Squares technique is one of the most well used in Econometrics, but it has not been fully extended, and evaluated, in full time-varying settings. This paper proposes a robust two stage least squares method for the econometric evaluation of time-varying treatment. Using a simulation study we found that, unlike standard two stage least squares, it performs relatively well across a wide range of circumstances, including model misspecification. It compares well with recent time-varying instrument approaches via g-estimation. We illustrate the methods in an evaluation of treatment intensification for Type-2 Diabetes Mellitus, exploring the exogeneity in prescribing preferences to operationalise a time-varying instrument."}, "https://arxiv.org/abs/2407.07717": {"title": "High-dimensional Covariance Estimation by Pairwise Likelihood Truncation", "link": "https://arxiv.org/abs/2407.07717", "description": "arXiv:2407.07717v1 Announce Type: new \nAbstract: Pairwise likelihood offers a useful approximation to the full likelihood function for covariance estimation in high-dimensional context. It simplifies high-dimensional dependencies by combining marginal bivariate likelihood objects, thereby making estimation more manageable. In certain models, including the Gaussian model, both pairwise and full likelihoods are known to be maximized by the same parameter values, thus retaining optimal statistical efficiency, when the number of variables is fixed. Leveraging this insight, we introduce the estimation of sparse high-dimensional covariance matrices by maximizing a truncated version of the pairwise likelihood function, which focuses on pairwise terms corresponding to nonzero covariance elements. To achieve a meaningful truncation, we propose to minimize the discrepancy between pairwise and full likelihood scores plus an L1-penalty discouraging the inclusion of uninformative terms. Differently from other regularization approaches, our method selects whole pairwise likelihood objects rather than individual covariance parameters, thus retaining the inherent unbiasedness of the pairwise likelihood estimating equations. This selection procedure is shown to have the selection consistency property as the covariance dimension increases exponentially fast. As a result, the implied pairwise likelihood estimator is consistent and converges to the oracle maximum likelihood estimator that assumes knowledge of nonzero covariance entries."}, "https://arxiv.org/abs/2407.07809": {"title": "Direct estimation and inference of higher-level correlations from lower-level measurements with applications in gene-pathway and proteomics studies", "link": "https://arxiv.org/abs/2407.07809", "description": "arXiv:2407.07809v1 Announce Type: new \nAbstract: This paper tackles the challenge of estimating correlations between higher-level biological variables (e.g., proteins and gene pathways) when only lower-level measurements are directly observed (e.g., peptides and individual genes). Existing methods typically aggregate lower-level data into higher-level variables and then estimate correlations based on the aggregated data. However, different data aggregation methods can yield varying correlation estimates as they target different higher-level quantities. Our solution is a latent factor model that directly estimates these higher-level correlations from lower-level data without the need for data aggregation. We further introduce a shrinkage estimator to ensure the positive definiteness and improve the accuracy of the estimated correlation matrix. Furthermore, we establish the asymptotic normality of our estimator, enabling efficient computation of p-values for the identification of significant correlations. The effectiveness of our approach is demonstrated through comprehensive simulations and the analysis of proteomics and gene expression datasets. We develop the R package highcor for implementing our method."}, "https://arxiv.org/abs/2407.07134": {"title": "Calibrating satellite maps with field data for improved predictions of forest biomass", "link": "https://arxiv.org/abs/2407.07134", "description": "arXiv:2407.07134v1 Announce Type: cross \nAbstract: Spatially explicit quantification of forest biomass is important for forest-health monitoring and carbon accounting. Direct field measurements of biomass are laborious and expensive, typically limiting their spatial and temporal sampling density and therefore the precision and resolution of the resulting inference. Satellites can provide biomass predictions at a far greater density, but these predictions are often biased relative to field measurements and exhibit heterogeneous errors. We developed and implemented a coregionalization model between sparse field measurements and a predictive satellite map to deliver improved predictions of biomass density at a 1-by-1 km resolution throughout the Pacific states of California, Oregon and Washington. The model accounts for zero-inflation in the field measurements and the heterogeneous errors in the satellite predictions. A stochastic partial differential equation approach to spatial modeling is applied to handle the magnitude of the satellite data. The spatial detail rendered by the model is much finer than would be possible with the field measurements alone, and the model provides substantial noise-filtering and bias-correction to the satellite map."}, "https://arxiv.org/abs/2407.07297": {"title": "Geometric quantile-based measures of multivariate distributional characteristics", "link": "https://arxiv.org/abs/2407.07297", "description": "arXiv:2407.07297v1 Announce Type: cross \nAbstract: Several new geometric quantile-based measures for multivariate dispersion, skewness, kurtosis, and spherical asymmetry are defined. These measures differ from existing measures, which use volumes and are easy to calculate. Some theoretical justification is given, followed by experiments illustrating that they are reasonable measures of these distributional characteristics and computing confidence regions with the desired coverage."}, "https://arxiv.org/abs/2407.07338": {"title": "Towards Complete Causal Explanation with Expert Knowledge", "link": "https://arxiv.org/abs/2407.07338", "description": "arXiv:2407.07338v1 Announce Type: cross \nAbstract: We study the problem of restricting Markov equivalence classes of maximal ancestral graphs (MAGs) containing certain edge marks, which we refer to as expert knowledge. MAGs forming a Markov equivalence class can be uniquely represented by an essential ancestral graph. We seek to learn the restriction of the essential ancestral graph containing the proposed expert knowledge. Our contributions are several-fold. First, we prove certain properties for the entire Markov equivalence class including a conjecture from Ali et al. (2009). Second, we present three sound graphical orientation rules, two of which generalize previously known rules, for adding expert knowledge to an essential graph. We also show that some orientation rules of Zhang (2008) are not needed for restricting the Markov equivalence class with expert knowledge. We provide an algorithm for including this expert knowledge and show that our algorithm is complete in certain settings i.e., in these settings, the output of our algorithm is a restricted essential ancestral graph. We conjecture this algorithm is complete generally. Outside of our specified settings, we provide an algorithm for checking whether a graph is a restricted essential graph and discuss its runtime. This work can be seen as a generalization of Meek (1995)."}, "https://arxiv.org/abs/2407.07596": {"title": "Learning treatment effects while treating those in need", "link": "https://arxiv.org/abs/2407.07596", "description": "arXiv:2407.07596v1 Announce Type: cross \nAbstract: Many social programs attempt to allocate scarce resources to people with the greatest need. Indeed, public services increasingly use algorithmic risk assessments motivated by this goal. However, targeting the highest-need recipients often conflicts with attempting to evaluate the causal effect of the program as a whole, as the best evaluations would be obtained by randomizing the allocation. We propose a framework to design randomized allocation rules which optimally balance targeting high-need individuals with learning treatment effects, presenting policymakers with a Pareto frontier between the two goals. We give sample complexity guarantees for the policy learning problem and provide a computationally efficient strategy to implement it. We then apply our framework to data from human services in Allegheny County, Pennsylvania. Optimized policies can substantially mitigate the tradeoff between learning and targeting. For example, it is often possible to obtain 90% of the optimal utility in targeting high-need individuals while ensuring that the average treatment effect can be estimated with less than 2 times the samples that a randomized controlled trial would require. Mechanisms for targeting public services often focus on measuring need as accurately as possible. However, our results suggest that algorithmic systems in public services can be most impactful if they incorporate program evaluation as an explicit goal alongside targeting."}, "https://arxiv.org/abs/2301.01480": {"title": "A new over-dispersed count model", "link": "https://arxiv.org/abs/2301.01480", "description": "arXiv:2301.01480v3 Announce Type: replace \nAbstract: A new two-parameter discrete distribution, namely the PoiG distribution is derived by the convolution of a Poisson variate and an independently distributed geometric random variable. This distribution generalizes both the Poisson and geometric distributions and can be used for modelling over-dispersed as well as equi-dispersed count data. A number of important statistical properties of the proposed count model, such as the probability generating function, the moment generating function, the moments, the survival function and the hazard rate function. Monotonic properties are studied, such as the log concavity and the stochastic ordering are also investigated in detail. Method of moment and the maximum likelihood estimators of the parameters of the proposed model are presented. It is envisaged that the proposed distribution may prove to be useful for the practitioners for modelling over-dispersed count data compared to its closest competitors."}, "https://arxiv.org/abs/2302.10367": {"title": "JOINTVIP: Prioritizing variables in observational study design with joint variable importance plot in R", "link": "https://arxiv.org/abs/2302.10367", "description": "arXiv:2302.10367v4 Announce Type: replace \nAbstract: Credible causal effect estimation requires treated subjects and controls to be otherwise similar. In observational settings, such as analysis of electronic health records, this is not guaranteed. Investigators must balance background variables so they are similar in treated and control groups. Common approaches include matching (grouping individuals into small homogeneous sets) or weighting (upweighting or downweighting individuals) to create similar profiles. However, creating identical distributions may be impossible if many variables are measured, and not all variables are of equal importance to the outcome. The joint variable importance plot (jointVIP) package to guides decisions about which variables to prioritize for adjustment by quantifying and visualizing each variable's relationship to both treatment and outcome."}, "https://arxiv.org/abs/2307.13364": {"title": "Testing for sparse idiosyncratic components in factor-augmented regression models", "link": "https://arxiv.org/abs/2307.13364", "description": "arXiv:2307.13364v4 Announce Type: replace \nAbstract: We propose a novel bootstrap test of a dense model, namely factor regression, against a sparse plus dense alternative augmenting model with sparse idiosyncratic components. The asymptotic properties of the test are established under time series dependence and polynomial tails. We outline a data-driven rule to select the tuning parameter and prove its theoretical validity. In simulation experiments, our procedure exhibits high power against sparse alternatives and low power against dense deviations from the null. Moreover, we apply our test to various datasets in macroeconomics and finance and often reject the null. This suggests the presence of sparsity -- on top of a dense model -- in commonly studied economic applications. The R package FAS implements our approach."}, "https://arxiv.org/abs/2401.02917": {"title": "Bayesian changepoint detection via logistic regression and the topological analysis of image series", "link": "https://arxiv.org/abs/2401.02917", "description": "arXiv:2401.02917v2 Announce Type: replace \nAbstract: We present a Bayesian method for multivariate changepoint detection that allows for simultaneous inference on the location of a changepoint and the coefficients of a logistic regression model for distinguishing pre-changepoint data from post-changepoint data. In contrast to many methods for multivariate changepoint detection, the proposed method is applicable to data of mixed type and avoids strict assumptions regarding the distribution of the data and the nature of the change. The regression coefficients provide an interpretable description of a potentially complex change. For posterior inference, the model admits a simple Gibbs sampling algorithm based on P\\'olya-gamma data augmentation. We establish conditions under which the proposed method is guaranteed to recover the true underlying changepoint. As a testing ground for our method, we consider the problem of detecting topological changes in time series of images. We demonstrate that our proposed method $\\mathtt{bclr}$, combined with a topological feature embedding, performs well on both simulated and real image data. The method also successfully recovers the location and nature of changes in more traditional changepoint tasks."}, "https://arxiv.org/abs/2407.07933": {"title": "Identification and Estimation of the Bi-Directional MR with Some Invalid Instruments", "link": "https://arxiv.org/abs/2407.07933", "description": "arXiv:2407.07933v1 Announce Type: new \nAbstract: We consider the challenging problem of estimating causal effects from purely observational data in the bi-directional Mendelian randomization (MR), where some invalid instruments, as well as unmeasured confounding, usually exist. To address this problem, most existing methods attempt to find proper valid instrumental variables (IVs) for the target causal effect by expert knowledge or by assuming that the causal model is a one-directional MR model. As such, in this paper, we first theoretically investigate the identification of the bi-directional MR from observational data. In particular, we provide necessary and sufficient conditions under which valid IV sets are correctly identified such that the bi-directional MR model is identifiable, including the causal directions of a pair of phenotypes (i.e., the treatment and outcome). Moreover, based on the identification theory, we develop a cluster fusion-like method to discover valid IV sets and estimate the causal effects of interest. We theoretically demonstrate the correctness of the proposed algorithm. Experimental results show the effectiveness of our method for estimating causal effects in bi-directional MR."}, "https://arxiv.org/abs/2407.07934": {"title": "Identifying macro conditional independencies and macro total effects in summary causal graphs with latent confounding", "link": "https://arxiv.org/abs/2407.07934", "description": "arXiv:2407.07934v1 Announce Type: new \nAbstract: Understanding causal relationships in dynamic systems is essential for numerous scientific fields, including epidemiology, economics, and biology. While causal inference methods have been extensively studied, they often rely on fully specified causal graphs, which may not always be available or practical in complex dynamic systems. Partially specified causal graphs, such as summary causal graphs (SCGs), provide a simplified representation of causal relationships, omitting temporal information and focusing on high-level causal structures. This simplification introduces new challenges concerning the types of queries of interest: macro queries, which involve relationships between clusters represented as vertices in the graph, and micro queries, which pertain to relationships between variables that are not directly visible through the vertices of the graph. In this paper, we first clearly distinguish between macro conditional independencies and micro conditional independencies and between macro total effects and micro total effects. Then, we demonstrate the soundness and completeness of the d-separation to identify macro conditional independencies in SCGs. Furthermore, we establish that the do-calculus is sound and complete for identifying macro total effects in SCGs. Conversely, we also show through various examples that these results do not hold when considering micro conditional independencies and micro total effects."}, "https://arxiv.org/abs/2407.07973": {"title": "Reduced-Rank Matrix Autoregressive Models: A Medium $N$ Approach", "link": "https://arxiv.org/abs/2407.07973", "description": "arXiv:2407.07973v1 Announce Type: new \nAbstract: Reduced-rank regressions are powerful tools used to identify co-movements within economic time series. However, this task becomes challenging when we observe matrix-valued time series, where each dimension may have a different co-movement structure. We propose reduced-rank regressions with a tensor structure for the coefficient matrix to provide new insights into co-movements within and between the dimensions of matrix-valued time series. Moreover, we relate the co-movement structures to two commonly used reduced-rank models, namely the serial correlation common feature and the index model. Two empirical applications involving U.S.\\ states and economic indicators for the Eurozone and North American countries illustrate how our new tools identify co-movements."}, "https://arxiv.org/abs/2407.07988": {"title": "Production function estimation using subjective expectations data", "link": "https://arxiv.org/abs/2407.07988", "description": "arXiv:2407.07988v1 Announce Type: new \nAbstract: Standard methods for estimating production functions in the Olley and Pakes (1996) tradition require assumptions on input choices. We introduce a new method that exploits (increasingly available) data on a firm's expectations of its future output and inputs that allows us to obtain consistent production function parameter estimates while relaxing these input demand assumptions. In contrast to dynamic panel methods, our proposed estimator can be implemented on very short panels (including a single cross-section), and Monte Carlo simulations show it outperforms alternative estimators when firms' material input choices are subject to optimization error. Implementing a range of production function estimators on UK data, we find our proposed estimator yields results that are either similar to or more credible than commonly-used alternatives. These differences are larger in industries where material inputs appear harder to optimize. We show that TFP implied by our proposed estimator is more strongly associated with future jobs growth than existing methods, suggesting that failing to adequately account for input endogeneity may underestimate the degree of dynamic reallocation in the economy."}, "https://arxiv.org/abs/2407.08140": {"title": "Variational Bayes for Mixture of Gaussian Structural Equation Models", "link": "https://arxiv.org/abs/2407.08140", "description": "arXiv:2407.08140v1 Announce Type: new \nAbstract: Structural equation models (SEMs) are commonly used to study the structural relationship between observed variables and latent constructs. Recently, Bayesian fitting procedures for SEMs have received more attention thanks to their potential to facilitate the adoption of more flexible model structures, and variational approximations have been shown to provide fast and accurate inference for Bayesian analysis of SEMs. However, the application of variational approximations is currently limited to very simple, elemental SEMs. We develop mean-field variational Bayes algorithms for two SEM formulations for data that present non-Gaussian features such as skewness and multimodality. The proposed models exploit the use of mixtures of Gaussians, include covariates for the analysis of latent traits and consider missing data. We also examine two variational information criteria for model selection that are straightforward to compute in our variational inference framework. The performance of the MFVB algorithms and information criteria is investigated in a simulated data study and a real data application."}, "https://arxiv.org/abs/2407.08228": {"title": "Wasserstein $k$-Centres Clustering for Distributional Data", "link": "https://arxiv.org/abs/2407.08228", "description": "arXiv:2407.08228v1 Announce Type: new \nAbstract: We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes the mode of variation of data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods for functional data. In this study, we propose a novel clustering method for distributional data on the real line, which takes account of difference in both the mean and mode of variation structures of clusters, in the spirit of the $k$-centres clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define a geodesic mode of variation of distributional data using geodesic principal component analysis. Then, we utilize the geodesic mode of each cluster to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve cluster quality compared to conventional clustering algorithms."}, "https://arxiv.org/abs/2407.08278": {"title": "Structuring, Sequencing, Staging, Selecting: the 4S method for the longitudinal analysis of multidimensional measurement scales in chronic diseases", "link": "https://arxiv.org/abs/2407.08278", "description": "arXiv:2407.08278v1 Announce Type: new \nAbstract: In clinical studies, measurement scales are often collected to report disease-related manifestations from clinician or patient perspectives. Their analysis can help identify relevant manifestations throughout the disease course, enhancing knowledge of disease progression and guiding clinicians in providing appropriate support. However, the analysis of measurement scales in health studies is not straightforward as made of repeated, ordinal, and potentially multidimensional item data. Their sum-score summaries may considerably reduce information and impend interpretation, their change over time occurs along clinical progression, and as many other longitudinal processes, their observation may be truncated by events. This work establishes a comprehensive strategy in four consecutive steps to leverage repeated data from multidimensional measurement scales. The 4S method successively (1) identifies the scale structure into subdimensions satisfying three calibration assumptions (unidimensionality, conditional independence, increasing monotonicity), (2) describes each subdimension progression using a joint latent process model which includes a continuous-time item response theory model for the longitudinal subpart, (3) aligns each subdimension's progression with disease stages through a projection approach, and (4) identifies the most informative items across disease stages using the Fisher's information. The method is comprehensively illustrated in multiple system atrophy (MSA), an alpha-synucleinopathy, with the analysis of daily activity and motor impairments over disease progression. The 4S method provides an effective and complete analytical strategy for any measurement scale repeatedly collected in health studies."}, "https://arxiv.org/abs/2407.08317": {"title": "Inference procedures in sequential trial emulation with survival outcomes: comparing confidence intervals based on the sandwich variance estimator, bootstrap and jackknife", "link": "https://arxiv.org/abs/2407.08317", "description": "arXiv:2407.08317v1 Announce Type: new \nAbstract: Sequential trial emulation (STE) is an approach to estimating causal treatment effects by emulating a sequence of target trials from observational data. In STE, inverse probability weighting is commonly utilised to address time-varying confounding and/or dependent censoring. Then structural models for potential outcomes are applied to the weighted data to estimate treatment effects. For inference, the simple sandwich variance estimator is popular but conservative, while nonparametric bootstrap is computationally expensive, and a more efficient alternative, linearised estimating function (LEF) bootstrap, has not been adapted to STE. We evaluated the performance of various methods for constructing confidence intervals (CIs) of marginal risk differences in STE with survival outcomes by comparing the coverage of CIs based on nonparametric/LEF bootstrap, jackknife, and the sandwich variance estimator through simulations. LEF bootstrap CIs demonstrated the best coverage with small/moderate sample sizes, low event rates and low treatment prevalence, which were the motivating scenarios for STE. They were less affected by treatment group imbalance and faster to compute than nonparametric bootstrap CIs. With large sample sizes and medium/high event rates, the sandwich-variance-estimator-based CIs had the best coverage and were the fastest to compute. These findings offer guidance in constructing CIs in causal survival analysis using STE."}, "https://arxiv.org/abs/2407.08382": {"title": "Adjusting for Participation Bias in Case-Control Genetic Association Studies for Rare Diseases", "link": "https://arxiv.org/abs/2407.08382", "description": "arXiv:2407.08382v1 Announce Type: new \nAbstract: Collection of genotype data in case-control genetic association studies may often be incomplete for reasons related to genes themselves. This non-ignorable missingness structure, if not appropriately accounted for, can result in participation bias in association analyses. To deal with this issue, Chen et al. (2016) proposed to collect additional genetic information from family members of individuals whose genotype data were not available, and developed a maximum likelihood method for bias correction. In this study, we develop an estimating equation approach to analyzing data collected from this design that allows adjustment of covariates. It jointly estimates odds ratio parameters for genetic association and missingness, where a logistic regression model is used to relate missingness to genotype and other covariates. Our method allows correlation between genotype and covariates while using genetic information from family members to provide information on the missing genotype data. In the estimating equation for genetic association parameters, we weight the contribution of each genotyped subject to the empirical likelihood score function by the inverse probability that the genotype data are available. We evaluate large and finite sample performance of our method via simulation studies and apply it to a family-based case-control study of breast cancer."}, "https://arxiv.org/abs/2407.08510": {"title": "Comparative analysis of Mixed-Data Sampling (MIDAS) model compared to Lag-Llama model for inflation nowcasting", "link": "https://arxiv.org/abs/2407.08510", "description": "arXiv:2407.08510v1 Announce Type: new \nAbstract: Inflation is one of the most important economic indicators closely watched by both public institutions and private agents. This study compares the performance of a traditional econometric model, Mixed Data Sampling regression, with one of the newest developments from the field of Artificial Intelligence, a foundational time series forecasting model based on a Long short-term memory neural network called Lag-Llama, in their ability to nowcast the Harmonized Index of Consumer Prices in the Euro area. Two models were compared and assessed whether the Lag-Llama can outperform the MIDAS regression, ensuring that the MIDAS regression is evaluated under the best-case scenario using a dataset spanning from 2010 to 2022. The following metrics were used to evaluate the models: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), correlation with the target, R-squared and adjusted R-squared. The results show better performance of the pre-trained Lag-Llama across all metrics."}, "https://arxiv.org/abs/2407.08565": {"title": "Information matrix test for normality of innovations in stationary time series models", "link": "https://arxiv.org/abs/2407.08565", "description": "arXiv:2407.08565v1 Announce Type: new \nAbstract: This study focuses on the problem of testing for normality of innovations in stationary time series models.To achieve this, we introduce an information matrix (IM) based test. While the IM test was originally developed to test for model misspecification, our study addresses that the test can also be used to test for the normality of innovations in various time series models. We provide sufficient conditions under which the limiting null distribution of the test statistics exists. As applications, a first-order threshold moving average model, GARCH model and double autoregressive model are considered. We conduct simulations to evaluate the performance of the proposed test and compare with other tests, and provide a real data analysis."}, "https://arxiv.org/abs/2407.08599": {"title": "Goodness of fit of relational event models", "link": "https://arxiv.org/abs/2407.08599", "description": "arXiv:2407.08599v1 Announce Type: new \nAbstract: A type of dynamic network involves temporally ordered interactions between actors, where past network configurations may influence future ones. The relational event model can be used to identify the underlying dynamics that drive interactions among system components. Despite the rapid development of this model over the past 15 years, an ongoing area of research revolves around evaluating the goodness of fit of this model, especially when it incorporates time-varying and random effects. Current methodologies often rely on comparing observed and simulated events using specific statistics, but this can be computationally intensive, and requires various assumptions.\n We propose an additive mixed-effect relational event model estimated via case-control sampling, and introduce a versatile framework for testing the goodness of fit of such models using weighted martingale residuals. Our focus is on a Kolmogorov-Smirnov type test designed to assess if covariates are accurately modeled. Our approach can be easily extended to evaluate whether other features of network dynamics have been appropriately incorporated into the model. We assess the goodness of fit of various relational event models using synthetic data to evaluate the test's power and coverage. Furthermore, we apply the method to a social study involving 57,791 emails sent by 159 employees of a Polish manufacturing company in 2010.\n The method is implemented in the R package mgcv."}, "https://arxiv.org/abs/2407.08602": {"title": "An Introduction to Causal Discovery", "link": "https://arxiv.org/abs/2407.08602", "description": "arXiv:2407.08602v1 Announce Type: new \nAbstract: In social sciences and economics, causal inference traditionally focuses on assessing the impact of predefined treatments (or interventions) on predefined outcomes, such as the effect of education programs on earnings. Causal discovery, in contrast, aims to uncover causal relationships among multiple variables in a data-driven manner, by investigating statistical associations rather than relying on predefined causal structures. This approach, more common in computer science, seeks to understand causality in an entire system of variables, which can be visualized by causal graphs. This survey provides an introduction to key concepts, algorithms, and applications of causal discovery from the perspectives of economics and social sciences. It covers fundamental concepts like d-separation, causal faithfulness, and Markov equivalence, sketches various algorithms for causal discovery, and discusses the back-door and front-door criteria for identifying causal effects. The survey concludes with more specific examples of causal discovery, e.g. for learning all variables that directly affect an outcome of interest and/or testing identification of causal effects in observational data."}, "https://arxiv.org/abs/2407.07996": {"title": "Gradual changes in functional time series", "link": "https://arxiv.org/abs/2407.07996", "description": "arXiv:2407.07996v1 Announce Type: cross \nAbstract: We consider the problem of detecting gradual changes in the sequence of mean functions from a not necessarily stationary functional time series. Our approach is based on the maximum deviation (calculated over a given time interval) between a benchmark function and the mean functions at different time points. We speak of a gradual change of size $\\Delta $, if this quantity exceeds a given threshold $\\Delta>0$. For example, the benchmark function could represent an average of yearly temperature curves from the pre-industrial time, and we are interested in the question if the yearly temperature curves afterwards deviate from the pre-industrial average by more than $\\Delta =1.5$ degrees Celsius, where the deviations are measured with respect to the sup-norm. Using Gaussian approximations for high-dimensional data we develop a test for hypotheses of this type and estimators for the time where a deviation of size larger than $\\Delta$ appears for the first time. We prove the validity of our approach and illustrate the new methods by a simulation study and a data example, where we analyze yearly temperature curves at different stations in Australia."}, "https://arxiv.org/abs/2407.08271": {"title": "Gaussian process interpolation with conformal prediction: methods and comparative analysis", "link": "https://arxiv.org/abs/2407.08271", "description": "arXiv:2407.08271v1 Announce Type: cross \nAbstract: This article advocates the use of conformal prediction (CP) methods for Gaussian process (GP) interpolation to enhance the calibration of prediction intervals. We begin by illustrating that using a GP model with parameters selected by maximum likelihood often results in predictions that are not optimally calibrated. CP methods can adjust the prediction intervals, leading to better uncertainty quantification while maintaining the accuracy of the underlying GP model. We compare different CP variants and introduce a novel variant based on an asymmetric score. Our numerical experiments demonstrate the effectiveness of CP methods in improving calibration without compromising accuracy. This work aims to facilitate the adoption of CP methods in the GP community."}, "https://arxiv.org/abs/2407.08485": {"title": "Local logistic regression for dimension reduction in classification", "link": "https://arxiv.org/abs/2407.08485", "description": "arXiv:2407.08485v1 Announce Type: cross \nAbstract: Sufficient dimension reduction has received much interest over the past 30 years. Most existing approaches focus on statistical models linking the response to the covariate through a regression equation, and as such are not adapted to binary classification problems. We address the question of dimension reduction for binary classification by fitting a localized nearest-neighbor logistic model with $\\ell_1$-penalty in order to estimate the gradient of the conditional probability of interest. Our theoretical analysis shows that the pointwise convergence rate of the gradient estimator is optimal under very mild conditions. The dimension reduction subspace is estimated using an outer product of such gradient estimates at several points in the covariate space. Our implementation uses cross-validation on the misclassification rate to estimate the dimension of this subspace. We find that the proposed approach outperforms existing competitors in synthetic and real data applications."}, "https://arxiv.org/abs/2407.08560": {"title": "Causal inference through multi-stage learning and doubly robust deep neural networks", "link": "https://arxiv.org/abs/2407.08560", "description": "arXiv:2407.08560v1 Announce Type: cross \nAbstract: Deep neural networks (DNNs) have demonstrated remarkable empirical performance in large-scale supervised learning problems, particularly in scenarios where both the sample size $n$ and the dimension of covariates $p$ are large. This study delves into the application of DNNs across a wide spectrum of intricate causal inference tasks, where direct estimation falls short and necessitates multi-stage learning. Examples include estimating the conditional average treatment effect and dynamic treatment effect. In this framework, DNNs are constructed sequentially, with subsequent stages building upon preceding ones. To mitigate the impact of estimation errors from early stages on subsequent ones, we integrate DNNs in a doubly robust manner. In contrast to previous research, our study offers theoretical assurances regarding the effectiveness of DNNs in settings where the dimensionality $p$ expands with the sample size. These findings are significant independently and extend to degenerate single-stage learning problems."}, "https://arxiv.org/abs/2407.08718": {"title": "The exact non-Gaussian weak lensing likelihood: A framework to calculate analytic likelihoods for correlation functions on masked Gaussian random fields", "link": "https://arxiv.org/abs/2407.08718", "description": "arXiv:2407.08718v1 Announce Type: cross \nAbstract: We present exact non-Gaussian joint likelihoods for auto- and cross-correlation functions on arbitrarily masked spherical Gaussian random fields. Our considerations apply to spin-0 as well as spin-2 fields but are demonstrated here for the spin-2 weak-lensing correlation function.\n We motivate that this likelihood cannot be Gaussian and show how it can nevertheless be calculated exactly for any mask geometry and on a curved sky, as well as jointly for different angular-separation bins and redshift-bin combinations. Splitting our calculation into a large- and small-scale part, we apply a computationally efficient approximation for the small scales that does not alter the overall non-Gaussian likelihood shape.\n To compare our exact likelihoods to correlation-function sampling distributions, we simulated a large number of weak-lensing maps, including shape noise, and find excellent agreement for one-dimensional as well as two-dimensional distributions. Furthermore, we compare the exact likelihood to the widely employed Gaussian likelihood and find significant levels of skewness at angular separations $\\gtrsim 1^{\\circ}$ such that the mode of the exact distributions is shifted away from the mean towards lower values of the correlation function. We find that the assumption of a Gaussian random field for the weak-lensing field is well valid at these angular separations.\n Considering the skewness of the non-Gaussian likelihood, we evaluate its impact on the posterior constraints on $S_8$. On a simplified weak-lensing-survey setup with an area of $10 \\ 000 \\ \\mathrm{deg}^2$, we find that the posterior mean of $S_8$ is up to $2\\%$ higher when using the non-Gaussian likelihood, a shift comparable to the precision of current stage-III surveys."}, "https://arxiv.org/abs/2302.05747": {"title": "Individualized Treatment Allocation in Sequential Network Games", "link": "https://arxiv.org/abs/2302.05747", "description": "arXiv:2302.05747v4 Announce Type: replace \nAbstract: Designing individualized allocation of treatments so as to maximize the equilibrium welfare of interacting agents has many policy-relevant applications. Focusing on sequential decision games of interacting agents, this paper develops a method to obtain optimal treatment assignment rules that maximize a social welfare criterion by evaluating stationary distributions of outcomes. Stationary distributions in sequential decision games are given by Gibbs distributions, which are difficult to optimize with respect to a treatment allocation due to analytical and computational complexity. We apply a variational approximation to the stationary distribution and optimize the approximated equilibrium welfare with respect to treatment allocation using a greedy optimization algorithm. We characterize the performance of the variational approximation, deriving a performance guarantee for the greedy optimization algorithm via a welfare regret bound. We implement our proposed method in simulation exercises and an empirical application using the Indian microfinance data (Banerjee et al., 2013), and show it delivers significant welfare gains."}, "https://arxiv.org/abs/2308.14625": {"title": "Models for temporal clustering of extreme events with applications to mid-latitude winter cyclones", "link": "https://arxiv.org/abs/2308.14625", "description": "arXiv:2308.14625v2 Announce Type: replace \nAbstract: The occurrence of extreme events like heavy precipitation or storms at a certain location often shows a clustering behaviour and is thus not described well by a Poisson process. We construct a general model for the inter-exceedance times in between such events which combines different candidate models for such behaviour. This allows us to distinguish data generating mechanisms leading to clusters of dependent events with exponential inter-exceedance times in between clusters from independent events with heavy-tailed inter-exceedance times, and even allows us to combine these two mechanisms for better descriptions of such occurrences. We propose a modification of the Cram\\'er-von Mises distance for model fitting. An application to mid-latitude winter cyclones illustrates the usefulness of our work."}, "https://arxiv.org/abs/2312.06883": {"title": "Adaptive Experiments Toward Learning Treatment Effect Heterogeneity", "link": "https://arxiv.org/abs/2312.06883", "description": "arXiv:2312.06883v3 Announce Type: replace \nAbstract: Understanding treatment effect heterogeneity has become an increasingly popular task in various fields, as it helps design personalized advertisements in e-commerce or targeted treatment in biomedical studies. However, most of the existing work in this research area focused on either analyzing observational data based on strong causal assumptions or conducting post hoc analyses of randomized controlled trial data, and there has been limited effort dedicated to the design of randomized experiments specifically for uncovering treatment effect heterogeneity. In the manuscript, we develop a framework for designing and analyzing response adaptive experiments toward better learning treatment effect heterogeneity. Concretely, we provide response adaptive experimental design frameworks that sequentially revise the data collection mechanism according to the accrued evidence during the experiment. Such design strategies allow for the identification of subgroups with the largest treatment effects with enhanced statistical efficiency. The proposed frameworks not only unify adaptive enrichment designs and response-adaptive randomization designs but also complement A/B test designs in e-commerce and randomized trial designs in clinical settings. We demonstrate the merit of our design with theoretical justifications and in simulation studies with synthetic e-commerce and clinical trial data."}, "https://arxiv.org/abs/2407.08761": {"title": "Bayesian analysis for pretest-posttest binary outcomes with adaptive significance levels", "link": "https://arxiv.org/abs/2407.08761", "description": "arXiv:2407.08761v1 Announce Type: new \nAbstract: Count outcomes in longitudinal studies are frequent in clinical and engineering studies. In frequentist and Bayesian statistical analysis, methods such as Mixed linear models allow the variability or correlation within individuals to be taken into account. However, in more straightforward scenarios, where only two stages of an experiment are observed (pre-treatment vs. post-treatment), there are only a few tools available, mainly for continuous outcomes. Thus, this work introduces a Bayesian statistical methodology for comparing paired samples in binary pretest-posttest scenarios. We establish a Bayesian probabilistic model for the inferential analysis of the unknown quantities, which is validated and refined through simulation analyses, and present an application to a dataset taken from the Television School and Family Smoking Prevention and Cessation Project (TVSFP) (Flay et al., 1995). The application of the Full Bayesian Significance Test (FBST) for precise hypothesis testing, along with the implementation of adaptive significance levels in the decision-making process, is included."}, "https://arxiv.org/abs/2407.08814": {"title": "Covariate Assisted Entity Ranking with Sparse Intrinsic Scores", "link": "https://arxiv.org/abs/2407.08814", "description": "arXiv:2407.08814v1 Announce Type: new \nAbstract: This paper addresses the item ranking problem with associate covariates, focusing on scenarios where the preference scores can not be fully explained by covariates, and the remaining intrinsic scores, are sparse. Specifically, we extend the pioneering Bradley-Terry-Luce (BTL) model by incorporating covariate information and considering sparse individual intrinsic scores. Our work introduces novel model identification conditions and examines the regularized penalized Maximum Likelihood Estimator (MLE) statistical rates. We then construct a debiased estimator for the penalized MLE and analyze its distributional properties. Additionally, we apply our method to the goodness-of-fit test for models with no latent intrinsic scores, namely, the covariates fully explaining the preference scores of individual items. We also offer confidence intervals for ranks. Our numerical studies lend further support to our theoretical findings, demonstrating validation for our proposed method"}, "https://arxiv.org/abs/2407.08827": {"title": "Estimating Methane Emissions from the Upstream Oil and Gas Industry Using a Multi-Stage Framework", "link": "https://arxiv.org/abs/2407.08827", "description": "arXiv:2407.08827v1 Announce Type: new \nAbstract: Measurement-based methane inventories, which involve surveying oil and gas facilities and compiling data to estimate methane emissions, are becoming the gold standard for quantifying emissions. However, there is a current lack of statistical guidance for the design and analysis of such surveys. The only existing method is a Monte Carlo procedure which is difficult to interpret, computationally intensive, and lacks available open-source code for its implementation. We provide an alternative method by framing methane surveys in the context of multi-stage sampling designs. We contribute estimators of the total emissions along with variance estimators which do not require simulation, as well as stratum-level total estimators. We show that the variance contribution from each stage of sampling can be estimated to inform the design of future surveys. We also introduce a more efficient modification of the estimator. Finally, we propose combining the multi-stage approach with a simple Monte Carlo procedure to model measurement error. The resulting methods are interpretable and require minimal computational resources. We apply the methods to aerial survey data of oil and gas facilities in British Columbia, Canada, to estimate the methane emissions in the province. An R package is provided to facilitate the use of the methods."}, "https://arxiv.org/abs/2407.08862": {"title": "Maximum Entropy Estimation of Heterogeneous Causal Effects", "link": "https://arxiv.org/abs/2407.08862", "description": "arXiv:2407.08862v1 Announce Type: new \nAbstract: For the purpose of causal inference we employ a stochastic model of the data generating process, utilizing individual propensity probabilities for the treatment, and also individual and counterfactual prognosis probabilities for the outcome. We assume a generalized version of the stable unit treatment value assumption, but we do not assume any version of strongly ignorable treatment assignment. Instead of conducting a sensitivity analysis, we utilize the principle of maximum entropy to estimate the distribution of causal effects. We develop a principled middle-way between extreme explanations of the observed data: we do not conclude that an observed association is wholly spurious, and we do not conclude that it is wholly causal. Rather, our conclusions are tempered and we conclude that the association is part spurious and part causal. In an example application we apply our methodology to analyze an observed association between marijuana use and hard drug use."}, "https://arxiv.org/abs/2407.08911": {"title": "Computationally efficient and statistically accurate conditional independence testing with spaCRT", "link": "https://arxiv.org/abs/2407.08911", "description": "arXiv:2407.08911v1 Announce Type: new \nAbstract: We introduce the saddlepoint approximation-based conditional randomization test (spaCRT), a novel conditional independence test that effectively balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens. Resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computational cost. The spaCRT leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands. We prove that the spaCRT p-value approximates the dCRT p-value with vanishing relative error, and that these two tests are asymptotically equivalent. Through extensive simulations and real data analysis, we demonstrate that the spaCRT controls Type-I error and maintains high power, outperforming other asymptotic and resampling-based tests. Our method is particularly well-suited for large-scale single-cell CRISPR screen analyses, facilitating the efficient and accurate assessment of perturbation-gene associations."}, "https://arxiv.org/abs/2407.09062": {"title": "Temporal M-quantile models and robust bias-corrected small area predictors", "link": "https://arxiv.org/abs/2407.09062", "description": "arXiv:2407.09062v1 Announce Type: new \nAbstract: In small area estimation, it is a smart strategy to rely on data measured over time. However, linear mixed models struggle to properly capture time dependencies when the number of lags is large. Given the lack of published studies addressing robust prediction in small areas using time-dependent data, this research seeks to extend M-quantile models to this field. Indeed, our methodology successfully addresses this challenge and offers flexibility to the widely imposed assumption of unit-level independence. Under the new model, robust bias-corrected predictors for small area linear indicators are derived. Additionally, the optimal selection of the robustness parameter for bias correction is explored, contributing theoretically to the field and enhancing outlier detection. For the estimation of the mean squared error (MSE), a first-order approximation and analytical estimators are obtained under general conditions. Several simulation experiments are conducted to evaluate the performance of the fitting algorithm, the new predictors, and the resulting MSE estimators, as well as the optimal selection of the robustness parameter. Finally, an application to the Spanish Living Conditions Survey data illustrates the usefulness of the proposed predictors."}, "https://arxiv.org/abs/2407.09293": {"title": "Sample size for developing a prediction model with a binary outcome: targeting precise individual risk estimates to improve clinical decisions and fairness", "link": "https://arxiv.org/abs/2407.09293", "description": "arXiv:2407.09293v1 Announce Type: new \nAbstract: When developing a clinical prediction model, the sample size of the development dataset is a key consideration. Small sample sizes lead to greater concerns of overfitting, instability, poor performance and lack of fairness. Previous research has outlined minimum sample size calculations to minimise overfitting and precisely estimate the overall risk. However even when meeting these criteria, the uncertainty (instability) in individual-level risk estimates may be considerable. In this article we propose how to examine and calculate the sample size required for developing a model with acceptably precise individual-level risk estimates to inform decisions and improve fairness. We outline a five-step process to be used before data collection or when an existing dataset is available. It requires researchers to specify the overall risk in the target population, the (anticipated) distribution of key predictors in the model, and an assumed 'core model' either specified directly (i.e., a logistic regression equation is provided) or based on specified C-statistic and relative effects of (standardised) predictors. We produce closed-form solutions that decompose the variance of an individual's risk estimate into Fisher's unit information matrix, predictor values and total sample size; this allows researchers to quickly calculate and examine individual-level uncertainty interval widths and classification instability for specified sample sizes. Such information can be presented to key stakeholders (e.g., health professionals, patients, funders) using prediction and classification instability plots to help identify the (target) sample size required to improve trust, reliability and fairness in individual predictions. Our proposal is implemented in software module pmstabilityss. We provide real examples and emphasise the importance of clinical context including any risk thresholds for decision making."}, "https://arxiv.org/abs/2407.09371": {"title": "Computationally Efficient Estimation of Large Probit Models", "link": "https://arxiv.org/abs/2407.09371", "description": "arXiv:2407.09371v1 Announce Type: new \nAbstract: Probit models are useful for modeling correlated discrete responses in many disciplines, including discrete choice data in economics. However, the Gaussian latent variable feature of probit models coupled with identification constraints pose significant computational challenges for its estimation and inference, especially when the dimension of the discrete response variable is large. In this paper, we propose a computationally efficient Expectation-Maximization (EM) algorithm for estimating large probit models. Our work is distinct from existing methods in two important aspects. First, instead of simulation or sampling methods, we apply and customize expectation propagation (EP), a deterministic method originally proposed for approximate Bayesian inference, to estimate moments of the truncated multivariate normal (TMVN) in the E (expectation) step. Second, we take advantage of a symmetric identification condition to transform the constrained optimization problem in the M (maximization) step into a one-dimensional problem, which is solved efficiently using Newton's method instead of off-the-shelf solvers. Our method enables the analysis of correlated choice data in the presence of more than 100 alternatives, which is a reasonable size in modern applications, such as online shopping and booking platforms, but has been difficult in practice with probit models. We apply our probit estimation method to study ordering effects in hotel search results on Expedia.com."}, "https://arxiv.org/abs/2407.09390": {"title": "Tail-robust factor modelling of vector and tensor time series in high dimensions", "link": "https://arxiv.org/abs/2407.09390", "description": "arXiv:2407.09390v1 Announce Type: new \nAbstract: We study the problem of factor modelling vector- and tensor-valued time series in the presence of heavy tails in the data, which produce anomalous observations with non-negligible probability. For this, we propose to combine a two-step procedure with data truncation, which is easy to implement and does not require iteratively searching for a numerical solution. Departing away from the light-tail assumptions often adopted in the time series factor modelling literature, we derive the theoretical properties of the proposed estimators while only assuming the existence of the $(2 + 2\\eps)$-th moment for some $\\eps \\in (0, 1)$, fully characterising the effect of heavy tails on the rates of estimation as well as the level of truncation. Numerical experiments on simulated datasets demonstrate the good performance of the proposed estimator, which is further supported by applications to two macroeconomic datasets."}, "https://arxiv.org/abs/2407.09443": {"title": "Addressing Confounding and Continuous Exposure Measurement Error Using Corrected Score Functions", "link": "https://arxiv.org/abs/2407.09443", "description": "arXiv:2407.09443v1 Announce Type: new \nAbstract: Confounding and exposure measurement error can introduce bias when drawing inference about the marginal effect of an exposure on an outcome of interest. While there are broad methodologies for addressing each source of bias individually, confounding and exposure measurement error frequently co-occur and there is a need for methods that address them simultaneously. In this paper, corrected score methods are derived under classical additive measurement error to draw inference about marginal exposure effects using only measured variables. Three estimators are proposed based on g-formula, inverse probability weighting, and doubly-robust estimation techniques. The estimators are shown to be consistent and asymptotically normal, and the doubly-robust estimator is shown to exhibit its namesake property. The methods, which are implemented in the R package mismex, perform well in finite samples under both confounding and measurement error as demonstrated by simulation studies. The proposed doubly-robust estimator is applied to study the effects of two biomarkers on HIV-1 infection using data from the HVTN 505 preventative vaccine trial."}, "https://arxiv.org/abs/2407.08750": {"title": "ROLCH: Regularized Online Learning for Conditional Heteroskedasticity", "link": "https://arxiv.org/abs/2407.08750", "description": "arXiv:2407.08750v1 Announce Type: cross \nAbstract: Large-scale streaming data are common in modern machine learning applications and have led to the development of online learning algorithms. Many fields, such as supply chain management, weather and meteorology, energy markets, and finance, have pivoted towards using probabilistic forecasts, which yields the need not only for accurate learning of the expected value but also for learning the conditional heteroskedasticity. Against this backdrop, we present a methodology for online estimation of regularized linear distributional models for conditional heteroskedasticity. The proposed algorithm is based on a combination of recent developments for the online estimation of LASSO models and the well-known GAMLSS framework. We provide a case study on day-ahead electricity price forecasting, in which we show the competitive performance of the adaptive estimation combined with strongly reduced computational effort. Our algorithms are implemented in a computationally efficient Python package."}, "https://arxiv.org/abs/2407.09130": {"title": "On goodness-of-fit testing for self-exciting point processes", "link": "https://arxiv.org/abs/2407.09130", "description": "arXiv:2407.09130v1 Announce Type: cross \nAbstract: Despite the wide usage of parametric point processes in theory and applications, a sound goodness-of-fit procedure to test whether a given parametric model is appropriate for data coming from a self-exciting point processes has been missing in the literature. In this work, we establish a bootstrap-based goodness-of-fit test which empirically works for all kinds of self-exciting point processes (and even beyond). In an infill-asymptotic setting we also prove its asymptotic consistency, albeit only in the particular case that the underlying point process is inhomogeneous Poisson."}, "https://arxiv.org/abs/2407.09387": {"title": "Meta-Analysis with Untrusted Data", "link": "https://arxiv.org/abs/2407.09387", "description": "arXiv:2407.09387v1 Announce Type: cross \nAbstract: [See paper for full abstract] Meta-analysis is a crucial tool for answering scientific questions. It is usually conducted on a relatively small amount of ``trusted'' data -- ideally from randomized, controlled trials -- which allow causal effects to be reliably estimated with minimal assumptions. We show how to answer causal questions much more precisely by making two changes. First, we incorporate untrusted data drawn from large observational databases, related scientific literature and practical experience -- without sacrificing rigor or introducing strong assumptions. Second, we train richer models capable of handling heterogeneous trials, addressing a long-standing challenge in meta-analysis. Our approach is based on conformal prediction, which fundamentally produces rigorous prediction intervals, but doesn't handle indirect observations: in meta-analysis, we observe only noisy effects due to the limited number of participants in each trial. To handle noise, we develop a simple, efficient version of fully-conformal kernel ridge regression, based on a novel condition called idiocentricity. We introduce noise-correcting terms in the residuals and analyze their interaction with a ``variance shaving'' technique. In multiple experiments on healthcare datasets, our algorithms deliver tighter, sounder intervals than traditional ones. This paper charts a new course for meta-analysis and evidence-based medicine, where heterogeneity and untrusted data are embraced for more nuanced and precise predictions."}, "https://arxiv.org/abs/2308.00812": {"title": "Causal exposure-response curve estimation with surrogate confounders: a study of air pollution and children's health in Medicaid claims data", "link": "https://arxiv.org/abs/2308.00812", "description": "arXiv:2308.00812v2 Announce Type: replace \nAbstract: In this paper, we undertake a case study to estimate a causal exposure-response function (ERF) for long-term exposure to fine particulate matter (PM$_{2.5}$) and respiratory hospitalizations in socioeconomically disadvantaged children using nationwide Medicaid claims data. These data present specific challenges. First, family income-based Medicaid eligibility criteria for children differ by state, creating socioeconomically distinct populations and leading to clustered data. Second, Medicaid enrollees' socioeconomic status, a confounder and an effect modifier of the exposure-response relationships under study, is not measured. However, two surrogates are available: median household income of each enrollee's zip code and state-level Medicaid family income eligibility thresholds for children. We introduce a customized approach for causal ERF estimation called MedMatch, building on generalized propensity score (GPS) matching methods. MedMatch adapts these methods to (1) leverage the surrogate variables to account for potential confounding and/or effect modification by socioeconomic status and (2) address practical challenges presented by differing exposure distributions across clusters. We also propose a new hyperparameter selection criterion for MedMatch and traditional GPS matching methods. Through extensive simulation studies, we demonstrate the strong performance of MedMatch relative to conventional approaches in this setting. We apply MedMatch to estimate the causal ERF between PM$_{2.5}$ and respiratory hospitalization among children in Medicaid, 2000-2012. We find a positive association, with a steeper curve at lower PM$_{2.5}$ concentrations that levels off at higher concentrations."}, "https://arxiv.org/abs/2311.04540": {"title": "On the estimation of the number of components in multivariate functional principal component analysis", "link": "https://arxiv.org/abs/2311.04540", "description": "arXiv:2311.04540v2 Announce Type: replace \nAbstract: Happ and Greven (2018) developed a methodology for principal components analysis of multivariate functional data for data observed on different dimensional domains. Their approach relies on an estimation of univariate functional principal components for each univariate functional feature. In this paper, we present extensive simulations to investigate choosing the number of principal components to retain. We show empirically that the conventional approach of using a percentage of variance explained threshold for each univariate functional feature may be unreliable when aiming to explain an overall percentage of variance in the multivariate functional data, and thus we advise practitioners to be careful when using it."}, "https://arxiv.org/abs/2312.11991": {"title": "Outcomes truncated by death in RCTs: a simulation study on the survivor average causal effect", "link": "https://arxiv.org/abs/2312.11991", "description": "arXiv:2312.11991v2 Announce Type: replace \nAbstract: Continuous outcome measurements truncated by death present a challenge for the estimation of unbiased treatment effects in randomized controlled trials (RCTs). One way to deal with such situations is to estimate the survivor average causal effect (SACE), but this requires making non-testable assumptions. Motivated by an ongoing RCT in very preterm infants with intraventricular hemorrhage, we performed a simulation study to compare a SACE estimator with complete case analysis (CCA) and an analysis after multiple imputation of missing outcomes. We set up 9 scenarios combining positive, negative and no treatment effect on the outcome (cognitive development) and on survival at 2 years of age. Treatment effect estimates from all methods were compared in terms of bias, mean squared error and coverage with regard to two true treatment effects: the treatment effect on the outcome used in the simulation and the SACE, which was derived by simulation of both potential outcomes per patient. Despite targeting different estimands (principal stratum estimand, hypothetical estimand), the SACE-estimator and multiple imputation gave similar estimates of the treatment effect and efficiently reduced the bias compared to CCA. Also, both methods were relatively robust to omission of one covariate in the analysis, and thus violation of relevant assumptions. Although the SACE is not without controversy, we find it useful if mortality is inherent to the study population. Some degree of violation of the required assumptions is almost certain, but may be acceptable in practice."}, "https://arxiv.org/abs/2309.10476": {"title": "Thermodynamically rational decision making under uncertainty", "link": "https://arxiv.org/abs/2309.10476", "description": "arXiv:2309.10476v3 Announce Type: replace-cross \nAbstract: An analytical characterization of thermodynamically rational agent behaviour is obtained for a simple, yet non--trivial example of a ``Maxwell's demon\" operating with partial information. Our results provide the first fully transparent physical understanding of a decision problem under uncertainty."}, "https://arxiv.org/abs/2407.09565": {"title": "A Short Note on Event-Study Synthetic Difference-in-Differences Estimators", "link": "https://arxiv.org/abs/2407.09565", "description": "arXiv:2407.09565v1 Announce Type: new \nAbstract: I propose an event study extension of Synthetic Difference-in-Differences (SDID) estimators. I show that, in simple and staggered adoption designs, estimators from Arkhangelsky et al. (2021) can be disaggregated into dynamic treatment effect estimators, comparing the lagged outcome differentials of treated and synthetic controls to their pre-treatment average. Estimators presented in this note can be computed using the sdid_event Stata package."}, "https://arxiv.org/abs/2407.09696": {"title": "Regularizing stock return covariance matrices via multiple testing of correlations", "link": "https://arxiv.org/abs/2407.09696", "description": "arXiv:2407.09696v1 Announce Type: new \nAbstract: This paper develops a large-scale inference approach for the regularization of stock return covariance matrices. The framework allows for the presence of heavy tails and multivariate GARCH-type effects of unknown form among the stock returns. The approach involves simultaneous testing of all pairwise correlations, followed by setting non-statistically significant elements to zero. This adaptive thresholding is achieved through sign-based Monte Carlo resampling within multiple testing procedures, controlling either the traditional familywise error rate, a generalized familywise error rate, or the false discovery proportion. Subsequent shrinkage ensures that the final covariance matrix estimate is positive definite and well-conditioned while preserving the achieved sparsity. Compared to alternative estimators, this new regularization method demonstrates strong performance in simulation experiments and real portfolio optimization."}, "https://arxiv.org/abs/2407.09735": {"title": "Positive and Unlabeled Data: Model, Estimation, Inference, and Classification", "link": "https://arxiv.org/abs/2407.09735", "description": "arXiv:2407.09735v1 Announce Type: new \nAbstract: This study introduces a new approach to addressing positive and unlabeled (PU) data through the double exponential tilting model (DETM). Traditional methods often fall short because they only apply to selected completely at random (SCAR) PU data, where the labeled positive and unlabeled positive data are assumed to be from the same distribution. In contrast, our DETM's dual structure effectively accommodates the more complex and underexplored selected at random PU data, where the labeled and unlabeled positive data can be from different distributions. We rigorously establish the theoretical foundations of DETM, including identifiability, parameter estimation, and asymptotic properties. Additionally, we move forward to statistical inference by developing a goodness-of-fit test for the SCAR condition and constructing confidence intervals for the proportion of positive instances in the target domain. We leverage an approximated Bayes classifier for classification tasks, demonstrating DETM's robust performance in prediction. Through theoretical insights and practical applications, this study highlights DETM as a comprehensive framework for addressing the challenges of PU data."}, "https://arxiv.org/abs/2407.09738": {"title": "Sparse Asymptotic PCA: Identifying Sparse Latent Factors Across Time Horizon", "link": "https://arxiv.org/abs/2407.09738", "description": "arXiv:2407.09738v1 Announce Type: new \nAbstract: This paper proposes a novel method for sparse latent factor modeling using a new sparse asymptotic Principal Component Analysis (APCA). This approach analyzes the co-movements of large-dimensional panel data systems over time horizons within a general approximate factor model framework. Unlike existing sparse factor modeling approaches based on sparse PCA, which assume sparse loading matrices, our sparse APCA assumes that factor processes are sparse over the time horizon, while the corresponding loading matrices are not necessarily sparse. This development is motivated by the observation that the assumption of sparse loadings may not be appropriate for financial returns, where exposure to market factors is generally universal and non-sparse. We propose a truncated power method to estimate the first sparse factor process and a sequential deflation method for multi-factor cases. Additionally, we develop a data-driven approach to identify the sparsity of risk factors over the time horizon using a novel cross-sectional cross-validation method. Theoretically, we establish that our estimators are consistent under mild conditions. Monte Carlo simulations demonstrate that the proposed method performs well in finite samples. Empirically, we analyze daily stock returns for a balanced panel of S&P 500 stocks from January 2004 to December 2016. Through textual analysis, we examine specific events associated with the identified sparse factors that systematically influence the stock market. Our approach offers a new pathway for economists to study and understand the systematic risks of economic and financial systems over time."}, "https://arxiv.org/abs/2407.09759": {"title": "Estimation of Integrated Volatility Functionals with Kernel Spot Volatility Estimators", "link": "https://arxiv.org/abs/2407.09759", "description": "arXiv:2407.09759v1 Announce Type: new \nAbstract: For a multidimensional It\\^o semimartingale, we consider the problem of estimating integrated volatility functionals. Jacod and Rosenbaum (2013) studied a plug-in type of estimator based on a Riemann sum approximation of the integrated functional and a spot volatility estimator with a forward uniform kernel. Motivated by recent results that show that spot volatility estimators with general two-side kernels of unbounded support are more accurate, in this paper, an estimator using a general kernel spot volatility estimator as the plug-in is considered. A biased central limit theorem for estimating the integrated functional is established with an optimal convergence rate. Unbiased central limit theorems for estimators with proper de-biasing terms are also obtained both at the optimal convergence regime for the bandwidth and when applying undersmoothing. Our results show that one can significantly reduce the estimator's bias by adopting a general kernel instead of the standard uniform kernel. Our proposed bias-corrected estimators are found to maintain remarkable robustness against bandwidth selection in a variety of sampling frequencies and functions."}, "https://arxiv.org/abs/2407.09761": {"title": "Exploring Differences between Two Decades of Mental Health Related Emergency Department Visits by Youth via Recurrent Events Analyses", "link": "https://arxiv.org/abs/2407.09761", "description": "arXiv:2407.09761v1 Announce Type: new \nAbstract: We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the data are usually not in a conventional format since administrative databases are created and maintained to serve non-research purposes and only information for people who seek health services is accessible. Analysis of administrative health data is thus challenging in general. In the MHED data analyses, we are particularly concerned with (i) evaluating dynamic patterns and impacts with doubly-censored recurrent event data, and (ii) re-calibrating estimators developed based on truncated data by leveraging summary statistics from the population. The findings are verified empirically via simulation. We have established the asymptotic properties of the inference procedures. The contributions of this paper are twofold. We present innovative strategies for processing doubly-censored recurrent event data, and overcoming the truncation induced by the data collection. In addition, through exploring the pediatric MHED visit records, we provide new insights into children/youths mental health changes over time."}, "https://arxiv.org/abs/2407.09772": {"title": "Valid standard errors for Bayesian quantile regression with clustered and independent data", "link": "https://arxiv.org/abs/2407.09772", "description": "arXiv:2407.09772v1 Announce Type: new \nAbstract: In Bayesian quantile regression, the most commonly used likelihood is the asymmetric Laplace (AL) likelihood. The reason for this choice is not that it is a plausible data-generating model but that the corresponding maximum likelihood estimator is identical to the classical estimator by Koenker and Bassett (1978), and in that sense, the AL likelihood can be thought of as a working likelihood. AL-based quantile regression has been shown to produce good finite-sample Bayesian point estimates and to be consistent. However, if the AL distribution does not correspond to the data-generating distribution, credible intervals based on posterior standard deviations can have poor coverage. Yang, Wang, and He (2016) proposed an adjustment to the posterior covariance matrix that produces asymptotically valid intervals. However, we show that this adjustment is sensitive to the choice of scale parameter for the AL likelihood and can lead to poor coverage when the sample size is small to moderate. We therefore propose using Infinitesimal Jackknife (IJ) standard errors (Giordano & Broderick, 2023). These standard errors do not require resampling but can be obtained from a single MCMC run. We also propose a version of IJ standard errors for clustered data. Simulations and applications to real data show that the IJ standard errors have good frequentist properties, both for independent and clustered data. We provide an R-package that computes IJ standard errors for clustered or independent data after estimation with the brms wrapper in R for Stan."}, "https://arxiv.org/abs/2407.10014": {"title": "Identification of Average Causal Effects in Confounded Additive Noise Models", "link": "https://arxiv.org/abs/2407.10014", "description": "arXiv:2407.10014v1 Announce Type: new \nAbstract: Additive noise models (ANMs) are an important setting studied in causal inference. Most of the existing works on ANMs assume causal sufficiency, i.e., there are no unobserved confounders. This paper focuses on confounded ANMs, where a set of treatment variables and a target variable are affected by an unobserved confounder that follows a multivariate Gaussian distribution. We introduce a novel approach for estimating the average causal effects (ACEs) of any subset of the treatment variables on the outcome and demonstrate that a small set of interventional distributions is sufficient to estimate all of them. In addition, we propose a randomized algorithm that further reduces the number of required interventions to poly-logarithmic in the number of nodes. Finally, we demonstrate that these interventions are also sufficient to recover the causal structure between the observed variables. This establishes that a poly-logarithmic number of interventions is sufficient to infer the causal effects of any subset of treatments on the outcome in confounded ANMs with high probability, even when the causal structure between treatments is unknown. The simulation results indicate that our method can accurately estimate all ACEs in the finite-sample regime. We also demonstrate the practical significance of our algorithm by evaluating it on semi-synthetic data."}, "https://arxiv.org/abs/2407.10089": {"title": "The inverse Kalman filter", "link": "https://arxiv.org/abs/2407.10089", "description": "arXiv:2407.10089v1 Announce Type: new \nAbstract: In this study, we introduce a new approach, the inverse Kalman filter (IKF), which enables accurate matrix-vector multiplication between a covariance matrix from a dynamic linear model and any real-valued vector with linear computational cost. We incorporate the IKF with the conjugate gradient algorithm, which substantially accelerates the computation of matrix inversion for a general form of covariance matrices, whereas other approximation approaches may not be directly applicable. We demonstrate the scalability and efficiency of the IKF approach through distinct applications, including nonparametric estimation of particle interaction functions and predicting incomplete lattices of correlated data, using both simulation and real-world observations, including cell trajectory and satellite radar interferogram."}, "https://arxiv.org/abs/2407.10185": {"title": "Semiparametric Efficient Inference for the Probability of Necessary and Sufficient Causation", "link": "https://arxiv.org/abs/2407.10185", "description": "arXiv:2407.10185v1 Announce Type: new \nAbstract: Causal attribution, which aims to explain why events or behaviors occur, is crucial in causal inference and enhances our understanding of cause-and-effect relationships in scientific research. The probabilities of necessary causation (PN) and sufficient causation (PS) are two of the most common quantities for attribution in causal inference. While many works have explored the identification or bounds of PN and PS, efficient estimation remains unaddressed. To fill this gap, this paper focuses on obtaining semiparametric efficient estimators of PN and PS under two sets of identifiability assumptions: strong ignorability and monotonicity, and strong ignorability and conditional independence. We derive efficient influence functions and semiparametric efficiency bounds for PN and PS under the two sets of identifiability assumptions, respectively. Based on this, we propose efficient estimators for PN and PS, and show their large sample properties. Extensive simulations validate the superiority of our estimators compared to competing methods. We apply our methods to a real-world dataset to assess various risk factors affecting stroke."}, "https://arxiv.org/abs/2407.10272": {"title": "Two-way Threshold Matrix Autoregression", "link": "https://arxiv.org/abs/2407.10272", "description": "arXiv:2407.10272v1 Announce Type: new \nAbstract: Matrix-valued time series data are widely available in various applications, attracting increasing attention in the literature. However, while nonlinearity has been recognized, the literature has so far neglected a deeper and more intricate level of nonlinearity, namely the {\\it row-level} nonlinear dynamics and the {\\it column-level} nonlinear dynamics, which are often observed in economic and financial data. In this paper, we propose a novel two-way threshold matrix autoregression (TWTMAR) model. This model is designed to effectively characterize the threshold structure in both rows and columns of matrix-valued time series. Unlike existing models that consider a single threshold variable or assume a uniform structure change across the matrix, the TWTMAR model allows for distinct threshold effects for rows and columns using two threshold variables. This approach achieves greater dimension reduction and yields better interpretation compared to existing methods. Moreover, we propose a parameter estimation procedure leveraging the intrinsic matrix structure and investigate the asymptotic properties. The efficacy and flexibility of the model are demonstrated through both simulation studies and an empirical analysis of the Fama-French Portfolio dataset."}, "https://arxiv.org/abs/2407.10418": {"title": "An integrated perspective of robustness in regression through the lens of the bias-variance trade-off", "link": "https://arxiv.org/abs/2407.10418", "description": "arXiv:2407.10418v1 Announce Type: new \nAbstract: This paper presents an integrated perspective on robustness in regression. Specifically, we examine the relationship between traditional outlier-resistant robust estimation and robust optimization, which focuses on parameter estimation resistant to imaginary dataset-perturbations. While both are commonly regarded as robust methods, these concepts demonstrate a bias-variance trade-off, indicating that they follow roughly converse strategies."}, "https://arxiv.org/abs/2407.10442": {"title": "Inference at the data's edge: Gaussian processes for modeling and inference under model-dependency, poor overlap, and extrapolation", "link": "https://arxiv.org/abs/2407.10442", "description": "arXiv:2407.10442v1 Announce Type: new \nAbstract: The Gaussian Process (GP) is a highly flexible non-linear regression approach that provides a principled approach to handling our uncertainty over predicted (counterfactual) values. It does so by computing a posterior distribution over predicted point as a function of a chosen model space and the observed data, in contrast to conventional approaches that effectively compute uncertainty estimates conditionally on placing full faith in a fitted model. This is especially valuable under conditions of extrapolation or weak overlap, where model dependency poses a severe threat. We first offer an accessible explanation of GPs, and provide an implementation suitable to social science inference problems. In doing so we reduce the number of user-chosen hyperparameters from three to zero. We then illustrate the settings in which GPs can be most valuable: those where conventional approaches have poor properties due to model-dependency/extrapolation in data-sparse regions. Specifically, we apply it to (i) comparisons in which treated and control groups have poor covariate overlap; (ii) interrupted time-series designs, where models are fitted prior to an event by extrapolated after it; and (iii) regression discontinuity, which depends on model estimates taken at or just beyond the edge of their supporting data."}, "https://arxiv.org/abs/2407.10653": {"title": "The Dynamic, the Static, and the Weak factor models and the analysis of high-dimensional time series", "link": "https://arxiv.org/abs/2407.10653", "description": "arXiv:2407.10653v1 Announce Type: new \nAbstract: Several fundamental and closely interconnected issues related to factor models are reviewed and discussed: dynamic versus static loadings, rate-strong versus rate-weak factors, the concept of weakly common component recently introduced by Gersing et al. (2023), the irrelevance of cross-sectional ordering and the assumption of cross-sectional exchangeability, and the problem of undetected strong factors."}, "https://arxiv.org/abs/2407.10721": {"title": "Nonparametric Multivariate Profile Monitoring Via Tree Ensembles", "link": "https://arxiv.org/abs/2407.10721", "description": "arXiv:2407.10721v1 Announce Type: new \nAbstract: Monitoring random profiles over time is used to assess whether the system of interest, generating the profiles, is operating under desired conditions at any time-point. In practice, accurate detection of a change-point within a sequence of responses that exhibit a functional relationship with multiple explanatory variables is an important goal for effectively monitoring such profiles. We present a nonparametric method utilizing ensembles of regression trees and random forests to model the functional relationship along with associated Kolmogorov-Smirnov statistic to monitor profile behavior. Through a simulation study considering multiple factors, we demonstrate that our method offers strong performance and competitive detection capability when compared to existing methods."}, "https://arxiv.org/abs/2407.10846": {"title": "Joint Learning from Heterogeneous Rank Data", "link": "https://arxiv.org/abs/2407.10846", "description": "arXiv:2407.10846v1 Announce Type: new \nAbstract: The statistical modelling of ranking data has a long history and encompasses various perspectives on how observed rankings arise. One of the most common models, the Plackett-Luce model, is frequently used to aggregate rankings from multiple rankers into a single ranking that corresponds to the underlying quality of the ranked objects. Given that rankers frequently exhibit heterogeneous preferences, mixture-type models have been developed to group rankers with more or less homogeneous preferences together to reduce bias. However, occasionally, these preference groups are known a-priori. Under these circumstances, current practice consists of fitting Plackett-Luce models separately for each group. Nevertheless, there might be some commonalities between different groups of rankers, such that separate estimation implies a loss of information. We propose an extension of the Plackett-Luce model, the Sparse Fused Plackett-Luce model, that allows for joint learning of such heterogeneous rank data, whereby information from different groups is utilised to achieve better model performance. The observed rankings can be considered a function of variables pertaining to the ranked objects. As such, we allow for these types of variables, where information on the coefficients is shared across groups. Moreover, as not all variables might be relevant for the ranking of an object, we impose sparsity on the coefficients to improve interpretability, estimation and prediction of the model. Simulations studies indicate superior performance of the proposed method compared to existing approaches. To illustrate the usage and interpretation of the method, an application on data consisting of consumer preferences regarding various sweet potato varieties is provided. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=SFPL."}, "https://arxiv.org/abs/2407.09542": {"title": "Multi-object Data Integration in the Study of Primary Progressive Aphasia", "link": "https://arxiv.org/abs/2407.09542", "description": "arXiv:2407.09542v1 Announce Type: cross \nAbstract: This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results."}, "https://arxiv.org/abs/2407.09632": {"title": "Granger Causality in Extremes", "link": "https://arxiv.org/abs/2407.09632", "description": "arXiv:2407.09632v1 Announce Type: cross \nAbstract: We introduce a rigorous mathematical framework for Granger causality in extremes, designed to identify causal links from extreme events in time series. Granger causality plays a pivotal role in uncovering directional relationships among time-varying variables. While this notion gains heightened importance during extreme and highly volatile periods, state-of-the-art methods primarily focus on causality within the body of the distribution, often overlooking causal mechanisms that manifest only during extreme events. Our framework is designed to infer causality mainly from extreme events by leveraging the causal tail coefficient. We establish equivalences between causality in extremes and other causal concepts, including (classical) Granger causality, Sims causality, and structural causality. We prove other key properties of Granger causality in extremes and show that the framework is especially helpful under the presence of hidden confounders. We also propose a novel inference method for detecting the presence of Granger causality in extremes from data. Our method is model-free, can handle non-linear and high-dimensional time series, outperforms current state-of-the-art methods in all considered setups, both in performance and speed, and was found to uncover coherent effects when applied to financial and extreme weather observations."}, "https://arxiv.org/abs/2407.09664": {"title": "An Introduction to Permutation Processes (version 0", "link": "https://arxiv.org/abs/2407.09664", "description": "arXiv:2407.09664v1 Announce Type: cross \nAbstract: These lecture notes were prepared for a special topics course in the Department of Statistics at the University of Washington, Seattle. They comprise the first eight chapters of a book currently in progress."}, "https://arxiv.org/abs/2407.09832": {"title": "Molecular clouds: do they deserve a non-Gaussian description?", "link": "https://arxiv.org/abs/2407.09832", "description": "arXiv:2407.09832v1 Announce Type: cross \nAbstract: Molecular clouds show complex structures reflecting their non-linear dynamics. Many studies, investigating the bridge between their morphology and physical properties, have shown the interest provided by non-Gaussian higher-order statistics to grasp physical information. Yet, as this bridge is usually characterized in the supervised world of simulations, transferring it onto observations can be hazardous, especially when the discrepancy between simulations and observations remains unknown. In this paper, we aim at identifying relevant summary statistics directly from the observation data. To do so, we develop a test that compares the informative power of two sets of summary statistics for a given dataset. Contrary to supervised approaches, this test does not require the knowledge of any data label or parameter, but focuses instead on comparing the degeneracy levels of these descriptors, relying on a notion of statistical compatibility. We apply this test to column density maps of 14 nearby molecular clouds observed by Herschel, and iteratively compare different sets of usual summary statistics. We show that a standard Gaussian description of these clouds is highly degenerate but can be substantially improved when being estimated on the logarithm of the maps. This illustrates that low-order statistics, properly used, remain a very powerful tool. We then further show that such descriptions still exhibit a small quantity of degeneracies, some of which are lifted by the higher order statistics provided by reduced wavelet scattering transforms. This property of observations quantitatively differs from state-of-the-art simulations of dense molecular cloud collapse and is not reproduced by logfBm models. Finally we show how the summary statistics identified can be cooperatively used to build a morphological distance, which is evaluated visually, and gives very satisfactory results."}, "https://arxiv.org/abs/2407.10132": {"title": "Optimal Kernel Choice for Score Function-based Causal Discovery", "link": "https://arxiv.org/abs/2407.10132", "description": "arXiv:2407.10132v1 Announce Type: cross \nAbstract: Score-based methods have demonstrated their effectiveness in discovering causal relationships by scoring different causal structures based on their goodness of fit to the data. Recently, Huang et al. proposed a generalized score function that can handle general data distributions and causal relationships by modeling the relations in reproducing kernel Hilbert space (RKHS). The selection of an appropriate kernel within this score function is crucial for accurately characterizing causal relationships and ensuring precise causal discovery. However, the current method involves manual heuristic selection of kernel parameters, making the process tedious and less likely to ensure optimality. In this paper, we propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data. Specifically, we model the generative process of the variables involved in each step of the causal graph search procedure as a mixture of independent noise variables. Based on this model, we derive an automatic kernel selection method by maximizing the marginal likelihood of the variables involved in each search step. We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms heuristic kernel selection methods."}, "https://arxiv.org/abs/2407.10175": {"title": "Low Volatility Stock Portfolio Through High Dimensional Bayesian Cointegration", "link": "https://arxiv.org/abs/2407.10175", "description": "arXiv:2407.10175v1 Announce Type: cross \nAbstract: We employ a Bayesian modelling technique for high dimensional cointegration estimation to construct low volatility portfolios from a large number of stocks. The proposed Bayesian framework effectively identifies sparse and important cointegration relationships amongst large baskets of stocks across various asset spaces, resulting in portfolios with reduced volatility. Such cointegration relationships persist well over the out-of-sample testing time, providing practical benefits in portfolio construction and optimization. Further studies on drawdown and volatility minimization also highlight the benefits of including cointegrated portfolios as risk management instruments."}, "https://arxiv.org/abs/2407.10659": {"title": "A nonparametric test for rough volatility", "link": "https://arxiv.org/abs/2407.10659", "description": "arXiv:2407.10659v1 Announce Type: cross \nAbstract: We develop a nonparametric test for deciding whether volatility of an asset follows a standard semimartingale process, with paths of finite quadratic variation, or a rough process with paths of infinite quadratic variation. The test utilizes the fact that volatility is rough if and only if volatility increments are negatively autocorrelated at high frequencies. It is based on the sample autocovariance of increments of spot volatility estimates computed from high-frequency asset return data. By showing a feasible CLT for this statistic under the null hypothesis of semimartingale volatility paths, we construct a test with fixed asymptotic size and an asymptotic power equal to one. The test is derived under very general conditions for the data-generating process. In particular, it is robust to jumps with arbitrary activity and to the presence of market microstructure noise. In an application of the test to SPY high-frequency data, we find evidence for rough volatility."}, "https://arxiv.org/abs/2407.10869": {"title": "Hidden Markov models with an unknown number of states and a repulsive prior on the state parameters", "link": "https://arxiv.org/abs/2407.10869", "description": "arXiv:2407.10869v1 Announce Type: cross \nAbstract: Hidden Markov models (HMMs) offer a robust and efficient framework for analyzing time series data, modelling both the underlying latent state progression over time and the observation process, conditional on the latent state. However, a critical challenge lies in determining the appropriate number of underlying states, often unknown in practice. In this paper, we employ a Bayesian framework, treating the number of states as a random variable and employing reversible jump Markov chain Monte Carlo to sample from the posterior distributions of all parameters, including the number of states. Additionally, we introduce repulsive priors for the state parameters in HMMs, and hence avoid overfitting issues and promote parsimonious models with dissimilar state components. We perform an extensive simulation study comparing performance of models with independent and repulsive prior distributions on the state parameters, and demonstrate our proposed framework on two ecological case studies: GPS tracking data on muskox in Antarctica and acoustic data on Cape gannets in South Africa. Our results highlight how our framework effectively explores the model space, defined by models with different latent state dimensions, while leading to latent states that are distinguished better and hence are more interpretable, enabling better understanding of complex dynamic systems."}, "https://arxiv.org/abs/2205.01061": {"title": "Robust inference for matching under rolling enrollment", "link": "https://arxiv.org/abs/2205.01061", "description": "arXiv:2205.01061v3 Announce Type: replace \nAbstract: Matching in observational studies faces complications when units enroll in treatment on a rolling basis. While each treated unit has a specific time of entry into the study, control units each have many possible comparison, or \"pseudo-treatment,\" times. The recent GroupMatch framework (Pimentel et al., 2020) solves this problem by searching over all possible pseudo-treatment times for each control and selecting those permitting the closest matches based on covariate histories. However, valid methods of inference have been described only for special cases of the general GroupMatch design, and these rely on strong assumptions. We provide three important innovations to address these problems. First, we introduce a new design, GroupMatch with instance replacement, that allows additional flexibility in control selection and proves more amenable to analysis. Second, we propose a block bootstrap approach for inference in GroupMatch with instance replacement and demonstrate that it accounts properly for complex correlations across matched sets. Third, we develop a permutation-based falsification test to detect possible violations of the important timepoint agnosticism assumption underpinning GroupMatch, which requires homogeneity of potential outcome means across time. Via simulation and a case study of the impact of short-term injuries on batting performance in major league baseball, we demonstrate the effectiveness of our methods for data analysis in practice."}, "https://arxiv.org/abs/2210.00091": {"title": "Factorized Fusion Shrinkage for Dynamic Relational Data", "link": "https://arxiv.org/abs/2210.00091", "description": "arXiv:2210.00091v3 Announce Type: replace \nAbstract: Modern data science applications often involve complex relational data with dynamic structures. An abrupt change in such dynamic relational data is typically observed in systems that undergo regime changes due to interventions. In such a case, we consider a factorized fusion shrinkage model in which all decomposed factors are dynamically shrunk towards group-wise fusion structures, where the shrinkage is obtained by applying global-local shrinkage priors to the successive differences of the row vectors of the factorized matrices. The proposed priors enjoy many favorable properties in comparison and clustering of the estimated dynamic latent factors. Comparing estimated latent factors involves both adjacent and long-term comparisons, with the time range of comparison considered as a variable. Under certain conditions, we demonstrate that the posterior distribution attains the minimax optimal rate up to logarithmic factors. In terms of computation, we present a structured mean-field variational inference framework that balances optimal posterior inference with computational scalability, exploiting both the dependence among components and across time. The framework can accommodate a wide variety of models, including dynamic matrix factorization, latent space models for networks and low-rank tensors. The effectiveness of our methodology is demonstrated through extensive simulations and real-world data analysis."}, "https://arxiv.org/abs/2210.07491": {"title": "Latent process models for functional network data", "link": "https://arxiv.org/abs/2210.07491", "description": "arXiv:2210.07491v3 Announce Type: replace \nAbstract: Network data are often sampled with auxiliary information or collected through the observation of a complex system over time, leading to multiple network snapshots indexed by a continuous variable. Many methods in statistical network analysis are traditionally designed for a single network, and can be applied to an aggregated network in this setting, but that approach can miss important functional structure. Here we develop an approach to estimating the expected network explicitly as a function of a continuous index, be it time or another indexing variable. We parameterize the network expectation through low dimensional latent processes, whose components we represent with a fixed, finite-dimensional functional basis. We derive a gradient descent estimation algorithm, establish theoretical guarantees for recovery of the low dimensional structure, compare our method to competitors, and apply it to a data set of international political interactions over time, showing our proposed method to adapt well to data, outperform competitors, and provide interpretable and meaningful results."}, "https://arxiv.org/abs/2306.16549": {"title": "UTOPIA: Universally Trainable Optimal Prediction Intervals Aggregation", "link": "https://arxiv.org/abs/2306.16549", "description": "arXiv:2306.16549v2 Announce Type: replace \nAbstract: Uncertainty quantification in prediction presents a compelling challenge with vast applications across various domains, including biomedical science, economics, and weather forecasting. There exists a wide array of methods for constructing prediction intervals, such as quantile regression and conformal prediction. However, practitioners often face the challenge of selecting the most suitable method for a specific real-world data problem. In response to this dilemma, we introduce a novel and universally applicable strategy called Universally Trainable Optimal Predictive Intervals Aggregation (UTOPIA). This technique excels in efficiently aggregating multiple prediction intervals while maintaining a small average width of the prediction band and ensuring coverage. UTOPIA is grounded in linear or convex programming, making it straightforward to train and implement. In the specific case where the prediction methods are elementary basis functions, as in kernel and spline bases, our method becomes the construction of a prediction band. Our proposed methodologies are supported by theoretical guarantees on the coverage probability and the average width of the aggregated prediction interval, which are detailed in this paper. The practicality and effectiveness of UTOPIA are further validated through its application to synthetic data and two real-world datasets in finance and macroeconomics."}, "https://arxiv.org/abs/2309.08783": {"title": "Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression", "link": "https://arxiv.org/abs/2309.08783", "description": "arXiv:2309.08783v4 Announce Type: replace \nAbstract: Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ. However, sparse regression models show marked evidence of heteroscedastic error even after transformations are applied. This violation of the homoscedasticity assumption can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. Bayesian heteroscedastic linear regression models relax the homoscedastic error assumption but can enforce restrictive prior assumptions on parameters, and many are computationally infeasible in the high-dimensional setting. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach that requires minimal prior assumptions and can incorporate covariates hypothesized to impact heterogeneity. We apply this method by using high-dimensional neuroimages to predict and provide PIs for AQ that accurately quantify predictive uncertainty. Our analysis demonstrates that H-PROBE can provide narrower PI widths than standard methods without sacrificing coverage. Narrower PIs are clinically important for determining the risk of moderate to severe aphasia. Additionally, through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference compared to alternative methods."}, "https://arxiv.org/abs/2311.13327": {"title": "Regressions under Adverse Conditions", "link": "https://arxiv.org/abs/2311.13327", "description": "arXiv:2311.13327v2 Announce Type: replace \nAbstract: We introduce a new regression method that relates the mean of an outcome variable to covariates, given the \"adverse condition\" that a distress variable falls in its tail. This allows to tailor classical mean regressions to adverse economic scenarios, which receive increasing interest in managing macroeconomic and financial risks, among many others. In the terminology of the systemic risk literature, our method can be interpreted as a regression for the Marginal Expected Shortfall. We propose a two-step procedure to estimate the new models, show consistency and asymptotic normality of the estimator, and propose feasible inference under weak conditions allowing for cross-sectional and time series applications. The accuracy of the asymptotic approximations of the two-step estimator is verified in simulations. Two empirical applications show that our regressions under adverse conditions are valuable in such diverse fields as the study of the relation between systemic risk and asset price bubbles, and dissecting macroeconomic growth vulnerabilities into individual components."}, "https://arxiv.org/abs/2311.13556": {"title": "Universally Optimal Multivariate Crossover Designs", "link": "https://arxiv.org/abs/2311.13556", "description": "arXiv:2311.13556v2 Announce Type: replace \nAbstract: In this article, universally optimal multivariate crossover designs are studied. The multiple response crossover design is motivated by a $3 \\times 3$ crossover setup, where the effect of $3$ doses of an oral drug are studied on gene expressions related to mucosal inflammation. Subjects are assigned to three treatment sequences and response measurements on $5$ different gene expressions are taken from each subject in each of the $3$ time periods. To model multiple or $g$ responses, where $g>1$, in a crossover setup, a multivariate fixed effect model with both direct and carryover treatment effects is considered. It is assumed that there are non zero within response correlations, while between response correlations are taken to be zero. The information matrix corresponding to the direct effects is obtained and some results are studied. The information matrix in the multivariate case is shown to differ from the univariate case, particularly in the completely symmetric property. For the $g>1$ case, with $t$ treatments and $p$ periods, for $p=t \\geq 3$, the design represented by a Type $\\rm{I}$ orthogonal array of strength $2$ is proved to be universally optimal over the class of binary designs, for the direct treatment effects."}, "https://arxiv.org/abs/2309.07692": {"title": "A minimum Wasserstein distance approach to Fisher's combination of independent discrete p-values", "link": "https://arxiv.org/abs/2309.07692", "description": "arXiv:2309.07692v2 Announce Type: replace-cross \nAbstract: This paper introduces a comprehensive framework to adjust a discrete test statistic for improving its hypothesis testing procedure. The adjustment minimizes the Wasserstein distance to a null-approximating continuous distribution, tackling some fundamental challenges inherent in combining statistical significances derived from discrete distributions. The related theory justifies Lancaster's mid-p and mean-value chi-squared statistics for Fisher's combination as special cases. However, in order to counter the conservative nature of Lancaster's testing procedures, we propose an updated null-approximating distribution. It is achieved by further minimizing the Wasserstein distance to the adjusted statistics within a proper distribution family. Specifically, in the context of Fisher's combination, we propose an optimal gamma distribution as a substitute for the traditionally used chi-squared distribution. This new approach yields an asymptotically consistent test that significantly improves type I error control and enhances statistical power."}, "https://arxiv.org/abs/2407.11035": {"title": "Optimal estimators of cross-partial derivatives and surrogates of functions", "link": "https://arxiv.org/abs/2407.11035", "description": "arXiv:2407.11035v1 Announce Type: new \nAbstract: Computing cross-partial derivatives using fewer model runs is relevant in modeling, such as stochastic approximation, derivative-based ANOVA, exploring complex models, and active subspaces. This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at $N$ randomized points and using a set of $L$ constraints. Randomized points rely on independent, central, and symmetric variables. The associated estimators, based on $NL$ model runs, reach the optimal rates of convergence (i.e., $\\mathcal{O}(N^{-1})$), and the biases of our approximations do not suffer from the curse of dimensionality for a wide class of functions. Such results are used for i) computing the main and upper-bounds of sensitivity indices, and ii) deriving emulators of simulators or surrogates of functions thanks to the derivative-based ANOVA. Simulations are presented to show the accuracy of our emulators and estimators of sensitivity indices. The plug-in estimates of indices using the U-statistics of one sample are numerically much stable."}, "https://arxiv.org/abs/2407.11094": {"title": "Robust Score-Based Quickest Change Detection", "link": "https://arxiv.org/abs/2407.11094", "description": "arXiv:2407.11094v1 Announce Type: new \nAbstract: Methods in the field of quickest change detection rapidly detect in real-time a change in the data-generating distribution of an online data stream. Existing methods have been able to detect this change point when the densities of the pre- and post-change distributions are known. Recent work has extended these results to the case where the pre- and post-change distributions are known only by their score functions. This work considers the case where the pre- and post-change score functions are known only to correspond to distributions in two disjoint sets. This work employs a pair of \"least-favorable\" distributions to robustify the existing score-based quickest change detection algorithm, the properties of which are studied. This paper calculates the least-favorable distributions for specific model classes and provides methods of estimating the least-favorable distributions for common constructions. Simulation results are provided demonstrating the performance of our robust change detection algorithm."}, "https://arxiv.org/abs/2407.11173": {"title": "Approximate Bayesian inference for high-resolution spatial disaggregation using alternative data sources", "link": "https://arxiv.org/abs/2407.11173", "description": "arXiv:2407.11173v1 Announce Type: new \nAbstract: This paper addresses the challenge of obtaining precise demographic information at a fine-grained spatial level, a necessity for planning localized public services such as water distribution networks, or understanding local human impacts on the ecosystem. While population sizes are commonly available for large administrative areas, such as wards in India, practical applications often demand knowledge of population density at smaller spatial scales. We explore the integration of alternative data sources, specifically satellite-derived products, including land cover, land use, street density, building heights, vegetation coverage, and drainage density. Using a case study focused on Bangalore City, India, with a ward-level population dataset for 198 wards and satellite-derived sources covering 786,702 pixels at a resolution of 30mX30m, we propose a semiparametric Bayesian spatial regression model for obtaining pixel-level population estimates. Given the high dimensionality of the problem, exact Bayesian inference is deemed impractical; we discuss an approximate Bayesian inference scheme based on the recently proposed max-and-smooth approach, a combination of Laplace approximation and Markov chain Monte Carlo. A simulation study validates the reasonable performance of our inferential approach. Mapping pixel-level estimates to the ward level demonstrates the effectiveness of our method in capturing the spatial distribution of population sizes. While our case study focuses on a demographic application, the methodology developed here readily applies to count-type spatial datasets from various scientific disciplines, where high-resolution alternative data sources are available."}, "https://arxiv.org/abs/2407.11342": {"title": "GenTwoArmsTrialSize: An R Statistical Software Package to estimate Generalized Two Arms Randomized Clinical Trial Sample Size", "link": "https://arxiv.org/abs/2407.11342", "description": "arXiv:2407.11342v1 Announce Type: new \nAbstract: The precise calculation of sample sizes is a crucial aspect in the design of clinical trials particularly for pharmaceutical statisticians. While various R statistical software packages have been developed by researchers to estimate required sample sizes under different assumptions, there has been a notable absence of a standalone R statistical software package that allows researchers to comprehensively estimate sample sizes under generalized scenarios. This paper introduces the R statistical software package \"GenTwoArmsTrialSize\" available on the Comprehensive R Archive Network (CRAN), designed for estimating the required sample size in two-arm clinical trials. The package incorporates four endpoint types, two trial treatment designs, four types of hypothesis tests, as well as considerations for noncompliance and loss of follow-up, providing researchers with the capability to estimate sample sizes across 24 scenarios. To facilitate understanding of the estimation process and illuminate the impact of noncompliance and loss of follow-up on the size and variability of estimations, the paper includes four hypothetical examples and one applied example. The discussion encompasses the package's limitations and outlines directions for future extensions and improvements."}, "https://arxiv.org/abs/2407.11614": {"title": "Restricted mean survival times for comparing grouped survival data: a Bayesian nonparametric approach", "link": "https://arxiv.org/abs/2407.11614", "description": "arXiv:2407.11614v1 Announce Type: new \nAbstract: Comparing survival experiences of different groups of data is an important issue in several applied problems. A typical example is where one wishes to investigate treatment effects. Here we propose a new Bayesian approach based on restricted mean survival times (RMST). A nonparametric prior is specified for the underlying survival functions: this extends the standard univariate neutral to the right processes to a multivariate setting and induces a prior for the RMST's. We rely on a representation as exponential functionals of compound subordinators to determine closed form expressions of prior and posterior mixed moments of RMST's. These results are used to approximate functionals of the posterior distribution of RMST's and are essential for comparing time--to--event data arising from different samples."}, "https://arxiv.org/abs/2407.11634": {"title": "A goodness-of-fit test for testing exponentiality based on normalized dynamic survival extropy", "link": "https://arxiv.org/abs/2407.11634", "description": "arXiv:2407.11634v1 Announce Type: new \nAbstract: The cumulative residual extropy (CRJ) is a measure of uncertainty that serves as an alternative to extropy. It replaces the probability density function with the survival function in the expression of extropy. This work introduces a new concept called normalized dynamic survival extropy (NDSE), a dynamic variation of CRJ. We observe that NDSE is equivalent to CRJ of the random variable of interest $X_{[t]}$ in the age replacement model at a fixed time $t$. Additionally, we have demonstrated that NDSE remains constant exclusively for exponential distribution at any time. We categorize two classes, INDSE and DNDSE, based on their increasing and decreasing NDSE values. Next, we present a non-parametric test to assess whether a distribution follows an exponential pattern against INDSE. We derive the exact and asymptotic distribution for the test statistic $\\widehat{\\Delta}^*$. Additionally, a test for asymptotic behavior is presented in the paper for right censoring data. Finally, we determine the critical values and power of our exact test through simulation. The simulation demonstrates that the suggested test is easy to compute and has significant statistical power, even with small sample sizes. We also conduct a power comparison analysis among other tests, which shows better power for the proposed test against other alternatives mentioned in this paper. Some numerical real-life examples validating the test are also included."}, "https://arxiv.org/abs/2407.11646": {"title": "Discovery and inference of possibly bi-directional causal relationships with invalid instrumental variables", "link": "https://arxiv.org/abs/2407.11646", "description": "arXiv:2407.11646v1 Announce Type: new \nAbstract: Learning causal relationships between pairs of complex traits from observational studies is of great interest across various scientific domains. However, most existing methods assume the absence of unmeasured confounding and restrict causal relationships between two traits to be uni-directional, which may be violated in real-world systems. In this paper, we address the challenge of causal discovery and effect inference for two traits while accounting for unmeasured confounding and potential feedback loops. By leveraging possibly invalid instrumental variables, we provide identification conditions for causal parameters in a model that allows for bi-directional relationships, and we also establish identifiability of the causal direction under the introduced conditions. Then we propose a data-driven procedure to detect the causal direction and provide inference results about causal effects along the identified direction. We show that our method consistently recovers the true direction and produces valid confidence intervals for the causal effect. We conduct extensive simulation studies to show that our proposal outperforms existing methods. We finally apply our method to analyze real data sets from UK Biobank."}, "https://arxiv.org/abs/2407.11674": {"title": "Effect Heterogeneity with Earth Observation in Randomized Controlled Trials: Exploring the Role of Data, Model, and Evaluation Metric Choice", "link": "https://arxiv.org/abs/2407.11674", "description": "arXiv:2407.11674v1 Announce Type: new \nAbstract: Many social and environmental phenomena are associated with macroscopic changes in the built environment, captured by satellite imagery on a global scale and with daily temporal resolution. While widely used for prediction, these images and especially image sequences remain underutilized for causal inference, especially in the context of randomized controlled trials (RCTs), where causal identification is established by design. In this paper, we develop and compare a set of general tools for analyzing Conditional Average Treatment Effects (CATEs) from temporal satellite data that can be applied to any RCT where geographical identifiers are available. Through a simulation study, we analyze different modeling strategies for estimating CATE in sequences of satellite images. We find that image sequence representation models with more parameters generally yield a higher ability to detect heterogeneity. To explore the role of model and data choice in practice, we apply the approaches to two influential RCTs--Banerjee et al. (2015), a poverty study in Cusco, Peru, and Bolsen et al. (2014), a water conservation experiment in the USA. We benchmark our image sequence models against image-only, tabular-only, and combined image-tabular data sources. We detect a stronger heterogeneity signal in the Peru experiment and for image sequence over image-only data. Land cover classifications over satellite images facilitate interpretation of what image features drive heterogeneity. These satellite-based CATE models enable generalizing the RCT results to larger geographical areas outside the original experimental context. While promising, transportability estimates highlight the need for sensitivity analysis. Overall, this paper shows how satellite sequence data can be incorporated into the analysis of RCTs, and how choices regarding satellite image data and model can be improved using evaluation metrics."}, "https://arxiv.org/abs/2407.11729": {"title": "Using shrinkage methods to estimate treatment effects in overlapping subgroups in randomized clinical trials with a time-to-event endpoint", "link": "https://arxiv.org/abs/2407.11729", "description": "arXiv:2407.11729v1 Announce Type: new \nAbstract: In randomized controlled trials, forest plots are frequently used to investigate the homogeneity of treatment effect estimates in subgroups. However, the interpretation of subgroup-specific treatment effect estimates requires great care due to the smaller sample size of subgroups and the large number of investigated subgroups. Bayesian shrinkage methods have been proposed to address these issues, but they often focus on disjoint subgroups while subgroups displayed in forest plots are overlapping, i.e., each subject appears in multiple subgroups. In our approach, we first build a flexible Cox model based on all available observations, including categorical covariates that identify the subgroups of interest and their interactions with the treatment group variable. We explore both penalized partial likelihood estimation with a lasso or ridge penalty for treatment-by-covariate interaction terms, and Bayesian estimation with a regularized horseshoe prior. One advantage of the Bayesian approach is the ability to derive credible intervals for shrunken subgroup-specific estimates. In a second step, the Cox model is marginalized to obtain treatment effect estimates for all subgroups. We illustrate these methods using data from a randomized clinical trial in follicular lymphoma and evaluate their properties in a simulation study. In all simulation scenarios, the overall mean-squared error is substantially smaller for penalized and shrinkage estimators compared to the standard subgroup-specific treatment effect estimator but leads to some bias for heterogeneous subgroups. We recommend that subgroup-specific estimators, which are typically displayed in forest plots, are more routinely complemented by treatment effect estimators based on shrinkage methods. The proposed methods are implemented in the R package bonsaiforest."}, "https://arxiv.org/abs/2407.11765": {"title": "Nowcasting R&D Expenditures: A Machine Learning Approach", "link": "https://arxiv.org/abs/2407.11765", "description": "arXiv:2407.11765v1 Announce Type: new \nAbstract: Macroeconomic data are crucial for monitoring countries' performance and driving policy. However, traditional data acquisition processes are slow, subject to delays, and performed at a low frequency. We address this 'ragged-edge' problem with a two-step framework. The first step is a supervised learning model predicting observed low-frequency figures. We propose a neural-network-based nowcasting model that exploits mixed-frequency, high-dimensional data. The second step uses the elasticities derived from the previous step to interpolate unobserved high-frequency figures. We apply our method to nowcast countries' yearly research and development (R&D) expenditure series. These series are collected through infrequent surveys, making them ideal candidates for this task. We exploit a range of predictors, chiefly Internet search volume data, and document the relevance of these data in improving out-of-sample predictions. Furthermore, we leverage the high frequency of our data to derive monthly estimates of R&D expenditures, which are currently unobserved. We compare our results with those obtained from the classical regression-based and the sparse temporal disaggregation methods. Finally, we validate our results by reporting a strong correlation with monthly R&D employment data."}, "https://arxiv.org/abs/2407.11937": {"title": "Generalized Difference-in-Differences", "link": "https://arxiv.org/abs/2407.11937", "description": "arXiv:2407.11937v1 Announce Type: new \nAbstract: In many social science applications, researchers use the difference-in-differences (DID) estimator to establish causal relationships, exploiting cross-sectional variation in a baseline factor and temporal variation in exposure to an event that presumably may affect all units. This approach, often referred to as generalized DID (GDID), differs from canonical DID in that it lacks a \"clean control group\" unexposed to the event after the event occurs. In this paper, we clarify GDID as a research design in terms of its data structure, feasible estimands, and identifying assumptions that allow the DID estimator to recover these estimands. We frame GDID as a factorial design with two factors: the baseline factor, denoted by $G$, and the exposure level to the event, denoted by $Z$, and define effect modification and causal interaction as the associative and causal effects of $G$ on the effect of $Z$, respectively. We show that under the canonical no anticipation and parallel trends assumptions, the DID estimator identifies only the effect modification of $G$ in GDID, and propose an additional generalized parallel trends assumption to identify causal interaction. Moreover, we show that the canonical DID research design can be framed as a special case of the GDID research design with an additional exclusion restriction assumption, thereby reconciling the two approaches. We illustrate these findings with empirical examples from economics and political science, and provide recommendations for improving practice and interpretation under GDID."}, "https://arxiv.org/abs/2407.11032": {"title": "Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)", "link": "https://arxiv.org/abs/2407.11032", "description": "arXiv:2407.11032v1 Announce Type: cross \nAbstract: Collaborative causal inference (CCI) is a federated learning method for pooling data from multiple, often self-interested, parties, to achieve a common learning goal over causal structures, e.g. estimation and optimization of treatment variables in a medical setting. Since obtaining data can be costly for the participants and sharing unique data poses the risk of losing competitive advantages, motivating the participation of all parties through equitable rewards and incentives is necessary. This paper devises an evaluation scheme to measure the value of each party's data contribution to the common learning task, tailored to causal inference's statistical demands, by comparing completed partially directed acyclic graphs (CPDAGs) inferred from observational data contributed by the participants. The Data Valuation Scheme thus obtained can then be used to introduce mechanisms that incentivize the agents to contribute data. It can be leveraged to reward agents fairly, according to the quality of their data, or to maximize all agents' data contributions."}, "https://arxiv.org/abs/2407.11056": {"title": "Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure: a Proof of Concept", "link": "https://arxiv.org/abs/2407.11056", "description": "arXiv:2407.11056v1 Announce Type: cross \nAbstract: This paper describes the development of a counterfactual Root Cause Analysis diagnosis approach for an industrial multivariate time series environment. It drives the attention toward the Point of Incipient Failure, which is the moment in time when the anomalous behavior is first observed, and where the root cause is assumed to be found before the issue propagates. The paper presents the elementary but essential concepts of the solution and illustrates them experimentally on a simulated setting. Finally, it discusses avenues of improvement for the maturity of the causal technology to meet the robustness challenges of increasingly complex environments in the industry."}, "https://arxiv.org/abs/2407.11426": {"title": "Generally-Occurring Model Change for Robust Counterfactual Explanations", "link": "https://arxiv.org/abs/2407.11426", "description": "arXiv:2407.11426v1 Announce Type: cross \nAbstract: With the increasing impact of algorithmic decision-making on human lives, the interpretability of models has become a critical issue in machine learning. Counterfactual explanation is an important method in the field of interpretable machine learning, which can not only help users understand why machine learning models make specific decisions, but also help users understand how to change these decisions. Naturally, it is an important task to study the robustness of counterfactual explanation generation algorithms to model changes. Previous literature has proposed the concept of Naturally-Occurring Model Change, which has given us a deeper understanding of robustness to model change. In this paper, we first further generalize the concept of Naturally-Occurring Model Change, proposing a more general concept of model parameter changes, Generally-Occurring Model Change, which has a wider range of applicability. We also prove the corresponding probabilistic guarantees. In addition, we consider a more specific problem, data set perturbation, and give relevant theoretical results by combining optimization theory."}, "https://arxiv.org/abs/2407.11465": {"title": "Testing by Betting while Borrowing and Bargaining", "link": "https://arxiv.org/abs/2407.11465", "description": "arXiv:2407.11465v1 Announce Type: cross \nAbstract: Testing by betting has been a cornerstone of the game-theoretic statistics literature. In this framework, a betting score (or more generally an e-process), as opposed to a traditional p-value, is used to quantify the evidence against a null hypothesis: the higher the betting score, the more money one has made betting against the null, and thus the larger the evidence that the null is false. A key ingredient assumed throughout past works is that one cannot bet more money than one currently has. In this paper, we ask what happens if the bettor is allowed to borrow money after going bankrupt, allowing further financial flexibility in this game of hypothesis testing. We propose various definitions of (adjusted) evidence relative to the wealth borrowed, indebted, and accumulated. We also ask what happens if the bettor can \"bargain\", in order to obtain odds bettor than specified by the null hypothesis. The adjustment of wealth in order to serve as evidence appeals to the characterization of arbitrage, interest rates, and num\\'eraire-adjusted pricing in this setting."}, "https://arxiv.org/abs/2407.11676": {"title": "SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation", "link": "https://arxiv.org/abs/2407.11676", "description": "arXiv:2407.11676v1 Announce Type: cross \nAbstract: Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-Bench, we propose a framework to evaluate DA methods and present a fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data with specific feature extraction. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-Bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-Bench is available on GitHub at https://github.com/scikit-adaptation/skada-bench."}, "https://arxiv.org/abs/2407.11887": {"title": "On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting", "link": "https://arxiv.org/abs/2407.11887", "description": "arXiv:2407.11887v1 Announce Type: cross \nAbstract: The prediction of extreme events in time series is a fundamental problem arising in many financial, scientific, engineering, and other applications. We begin by establishing a general Neyman-Pearson-type characterization of optimal extreme event predictors in terms of density ratios. This yields new insights and several closed-form optimal extreme event predictors for additive models. These results naturally extend to time series, where we study optimal extreme event prediction for heavy-tailed autoregressive and moving average models. Using a uniform law of large numbers for ergodic time series, we establish the asymptotic optimality of an empirical version of the optimal predictor for autoregressive models. Using multivariate regular variation, we also obtain expressions for the optimal extremal precision in heavy-tailed infinite moving averages, which provide theoretical bounds on the ability to predict extremes in this general class of models. The developed theory and methodology is applied to the important problem of solar flare prediction based on the state-of-the-art GOES satellite flux measurements of the Sun. Our results demonstrate the success and limitations of long-memory autoregressive as well as long-range dependent heavy-tailed FARIMA models for the prediction of extreme solar flares."}, "https://arxiv.org/abs/2407.11901": {"title": "Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows", "link": "https://arxiv.org/abs/2407.11901", "description": "arXiv:2407.11901v1 Announce Type: cross \nAbstract: We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures."}, "https://arxiv.org/abs/2209.11691": {"title": "Linear multidimensional regression with interactive fixed-effects", "link": "https://arxiv.org/abs/2209.11691", "description": "arXiv:2209.11691v3 Announce Type: replace \nAbstract: This paper studies a linear and additively separable model for multidimensional panel data of three or more dimensions with unobserved interactive fixed effects. Two approaches are considered to account for these unobserved interactive fixed-effects when estimating coefficients on the observed covariates. First, the model is embedded within the standard two dimensional panel framework and restrictions are formed under which the factor structure methods in Bai (2009) lead to consistent estimation of model parameters, but at slow rates of convergence. The second approach develops a kernel weighted fixed-effects method that is more robust to the multidimensional nature of the problem and can achieve the parametric rate of consistency under certain conditions. Theoretical results and simulations show some benefits to standard two-dimensional panel methods when the structure of the interactive fixed-effect term is known, but also highlight how the kernel weighted method performs well without knowledge of this structure. The methods are implemented to estimate the demand elasticity for beer."}, "https://arxiv.org/abs/2305.08488": {"title": "Hierarchical DCC-HEAVY Model for High-Dimensional Covariance Matrices", "link": "https://arxiv.org/abs/2305.08488", "description": "arXiv:2305.08488v2 Announce Type: replace \nAbstract: We introduce a HD DCC-HEAVY class of hierarchical-type factor models for high-dimensional covariance matrices, employing the realized measures built from higher-frequency data. The modelling approach features straightforward estimation and forecasting schemes, independent of the cross-sectional dimension of the assets under consideration, and accounts for sophisticated asymmetric dynamics in the covariances. Empirical analyses suggest that the HD DCC-HEAVY models have a better in-sample fit and deliver statistically and economically significant out-of-sample gains relative to the existing hierarchical factor model and standard benchmarks. The results are robust under different frequencies and market conditions."}, "https://arxiv.org/abs/2306.02813": {"title": "Variational inference based on a subclass of closed skew normals", "link": "https://arxiv.org/abs/2306.02813", "description": "arXiv:2306.02813v2 Announce Type: replace \nAbstract: Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univariate skew normals as the variational density, and illustrate how it provides increased flexibility and accuracy in approximating the joint posterior in various applications, by overcoming limitations in existing skew normal variational approximations. The evidence lower bound is optimized using stochastic gradient ascent, where analytic natural gradient updates are derived. We also demonstrate how problems in maximum likelihood estimation of skew normal parameters occur similarly in stochastic variational inference, and can be resolved using the centered parametrization. Supplemental materials are available online."}, "https://arxiv.org/abs/2312.15494": {"title": "Variable Selection in High Dimensional Linear Regressions with Parameter Instability", "link": "https://arxiv.org/abs/2312.15494", "description": "arXiv:2312.15494v2 Announce Type: replace \nAbstract: This paper considers the problem of variable selection allowing for parameter instability. It distinguishes between signal and pseudo-signal variables that are correlated with the target variable, and noise variables that are not, and investigate the asymptotic properties of the One Covariate at a Time Multiple Testing (OCMT) method proposed by Chudik et al. (2018) under parameter insatiability. It is established that OCMT continues to asymptotically select an approximating model that includes all the signals and none of the noise variables. Properties of post selection regressions are also investigated, and in-sample fit of the selected regression is shown to have the oracle property. The theoretical results support the use of unweighted observations at the selection stage of OCMT, whilst applying down-weighting of observations only at the forecasting stage. Monte Carlo and empirical applications show that OCMT without down-weighting at the selection stage yields smaller mean squared forecast errors compared to Lasso, Adaptive Lasso, and boosting."}, "https://arxiv.org/abs/2401.14558": {"title": "Simulation Model Calibration with Dynamic Stratification and Adaptive Sampling", "link": "https://arxiv.org/abs/2401.14558", "description": "arXiv:2401.14558v2 Announce Type: replace \nAbstract: Calibrating simulation models that take large quantities of multi-dimensional data as input is a hard simulation optimization problem. Existing adaptive sampling strategies offer a methodological solution. However, they may not sufficiently reduce the computational cost for estimation and solution algorithm's progress within a limited budget due to extreme noise levels and heteroskedasticity of system responses. We propose integrating stratification with adaptive sampling for the purpose of efficiency in optimization. Stratification can exploit local dependence in the simulation inputs and outputs. Yet, the state-of-the-art does not provide a full capability to adaptively stratify the data as different solution alternatives are evaluated. We devise two procedures for data-driven calibration problems that involve a large dataset with multiple covariates to calibrate models within a fixed overall simulation budget. The first approach dynamically stratifies the input data using binary trees, while the second approach uses closed-form solutions based on linearity assumptions between the objective function and concomitant variables. We find that dynamical adjustment of stratification structure accelerates optimization and reduces run-to-run variability in generated solutions. Our case study for calibrating a wind power simulation model, widely used in the wind industry, using the proposed stratified adaptive sampling, shows better-calibrated parameters under a limited budget."}, "https://arxiv.org/abs/2407.12100": {"title": "Agglomerative Clustering of Simulation Output Distributions Using Regularized Wasserstein Distance", "link": "https://arxiv.org/abs/2407.12100", "description": "arXiv:2407.12100v1 Announce Type: new \nAbstract: We investigate the use of clustering methods on data produced by a stochastic simulator, with applications in anomaly detection, pre-optimization, and online monitoring. We introduce an agglomerative clustering algorithm that clusters multivariate empirical distributions using the regularized Wasserstein distance and apply the proposed methodology on a call-center model."}, "https://arxiv.org/abs/2407.12114": {"title": "Bounds on causal effects in $2^{K}$ factorial experiments with non-compliance", "link": "https://arxiv.org/abs/2407.12114", "description": "arXiv:2407.12114v1 Announce Type: new \nAbstract: Factorial experiments are ubiquitous in the social and biomedical sciences, but when units fail to comply with each assigned factors, identification and estimation of the average treatment effects become impossible. Leveraging an instrumental variables approach, previous studies have shown how to identify and estimate the causal effect of treatment uptake among respondents who comply with treatment. A major caveat is that these identification results rely on strong assumptions on the effect of randomization on treatment uptake. This paper shows how to bound these complier average treatment effects under more mild assumptions on non-compliance."}, "https://arxiv.org/abs/2407.12175": {"title": "Temporal Configuration Model: Statistical Inference and Spreading Processes", "link": "https://arxiv.org/abs/2407.12175", "description": "arXiv:2407.12175v1 Announce Type: new \nAbstract: We introduce a family of parsimonious network models that are intended to generalize the configuration model to temporal settings. We present consistent estimators for the model parameters and perform numerical simulations to illustrate the properties of the estimators on finite samples. We also develop analytical solutions for basic and effective reproductive numbers for the early stage of discrete-time SIR spreading process. We apply three distinct temporal configuration models to empirical student proximity networks and compare their performance."}, "https://arxiv.org/abs/2407.12348": {"title": "MM Algorithms for Statistical Estimation in Quantile Regression", "link": "https://arxiv.org/abs/2407.12348", "description": "arXiv:2407.12348v1 Announce Type: new \nAbstract: Quantile regression is a robust and practically useful way to efficiently model quantile varying correlation and predict varied response quantiles of interest. This article constructs and tests MM algorithms, which are simple to code and have been suggested superior to some other prominent quantile regression methods in nonregularized problems, in an array of quantile regression settings including linear (modeling different quantile coefficients both separately and simultaneously), nonparametric, regularized, and monotone quantile regression. Applications to various real data sets and two simulation studies comparing MM to existing tested methods have corroborated our algorithms' effectiveness. We have made one key advance by generalizing our MM algorithm to efficiently fit easy-to-predict-and-interpret parametric quantile regression models for data sets exhibiting manifest complicated nonlinear correlation patterns, which has not yet been covered by current literature to the best of our knowledge."}, "https://arxiv.org/abs/2407.12422": {"title": "Conduct Parameter Estimation in Homogeneous Goods Markets with Equilibrium Existence and Uniqueness Conditions: The Case of Log-linear Specification", "link": "https://arxiv.org/abs/2407.12422", "description": "arXiv:2407.12422v1 Announce Type: new \nAbstract: We propose a constrained generalized method of moments estimator (GMM) incorporating theoretical conditions for the unique existence of equilibrium prices for estimating conduct parameters in a log-linear model with homogeneous goods markets. First, we derive such conditions. Second, Monte Carlo simulations confirm that in a log-linear model, incorporating the conditions resolves the problems of implausibly low or negative values of conduct parameters."}, "https://arxiv.org/abs/2407.12557": {"title": "Comparing Homogeneous And Inhomogeneous Time Markov Chains For Modelling Degradation In Sewer Pipe Networks", "link": "https://arxiv.org/abs/2407.12557", "description": "arXiv:2407.12557v1 Announce Type: new \nAbstract: Sewer pipe systems are essential for social and economic welfare. Managing these systems requires robust predictive models for degradation behaviour. This study focuses on probability-based approaches, particularly Markov chains, for their ability to associate random variables with degradation. Literature predominantly uses homogeneous and inhomogeneous Markov chains for this purpose. However, their effectiveness in sewer pipe degradation modelling is still debatable. Some studies support homogeneous Markov chains, while others challenge their utility. We examine this issue using a large-scale sewer network in the Netherlands, incorporating historical inspection data. We model degradation with homogeneous discrete and continuous time Markov chains, and inhomogeneous-time Markov chains using Gompertz, Weibull, Log-Logistic and Log-Normal density functions. Our analysis suggests that, despite their higher computational requirements, inhomogeneous-time Markov chains are more appropriate for modelling the nonlinear stochastic characteristics related to sewer pipe degradation, particularly the Gompertz distribution. However, they pose a risk of over-fitting, necessitating significant improvements in parameter inference processes to effectively address this issue."}, "https://arxiv.org/abs/2407.12633": {"title": "Bayesian spatial functional data clustering: applications in disease surveillance", "link": "https://arxiv.org/abs/2407.12633", "description": "arXiv:2407.12633v1 Announce Type: new \nAbstract: Our method extends the application of random spanning trees to cases where the response variable belongs to the exponential family, making it suitable for a wide range of real-world scenarios, including non-Gaussian likelihoods. The proposed model addresses the limitations of previous spatial clustering methods by allowing all within-cluster model parameters to be cluster-specific, thus offering greater flexibility. Additionally, we propose a Bayesian inference algorithm that overcomes the computational challenges associated with the reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm by employing composition sampling and the integrated nested Laplace approximation (INLA) to compute the marginal distribution necessary for the acceptance probability. This enhancement improves the mixing and feasibility of Bayesian inference for complex models. We demonstrate the effectiveness of our approach through simulation studies and apply it to real-world disease mapping applications: COVID-19 in the United States of America, and dengue fever in the states of Minas Gerais and S\\~ao Paulo, Brazil. Our results highlight the model's capability to uncover meaningful spatial patterns and temporal dynamics in disease outbreaks, providing valuable insights for public health decision-making and resource allocation."}, "https://arxiv.org/abs/2407.12700": {"title": "Bayesian Joint Modeling of Interrater and Intrarater Reliability with Multilevel Data", "link": "https://arxiv.org/abs/2407.12700", "description": "arXiv:2407.12700v1 Announce Type: new \nAbstract: We formulate three generalized Bayesian models for analyzing interrater and intrarater reliability in the presence of multilevel data. Stan implementations of these models provide new estimates of interrater and intrarater reliability. We also derive formulas for calculating marginal correlations under each of the three models. Comparisons of the kappa estimates and marginal correlations across the different models are presented from two real-world datasets. Simulations demonstrate properties of the different measures of agreement under different model assumptions."}, "https://arxiv.org/abs/2407.12132": {"title": "Maximum-likelihood regression with systematic errors for astronomy and the physical sciences: I", "link": "https://arxiv.org/abs/2407.12132", "description": "arXiv:2407.12132v1 Announce Type: cross \nAbstract: The paper presents a new statistical method that enables the use of systematic errors in the maximum-likelihood regression of integer-count Poisson data to a parametric model. The method is primarily aimed at the characterization of the goodness-of-fit statistic in the presence of the over-dispersion that is induced by sources of systematic error, and is based on a quasi-maximum-likelihood method that retains the Poisson distribution of the data. We show that the Poisson deviance, which is the usual goodness-of-fit statistic and that is commonly referred to in astronomy as the Cash statistics, can be easily generalized in the presence of systematic errors, under rather general conditions. The method and the associated statistics are first developed theoretically, and then they are tested with the aid of numerical simulations and further illustrated with real-life data from astronomical observations. The statistical methods presented in this paper are intended as a simple general-purpose framework to include additional sources of uncertainty for the analysis of integer-count data in a variety of practical data analysis situations."}, "https://arxiv.org/abs/2407.12254": {"title": "COKE: Causal Discovery with Chronological Order and Expert Knowledge in High Proportion of Missing Manufacturing Data", "link": "https://arxiv.org/abs/2407.12254", "description": "arXiv:2407.12254v1 Announce Type: cross \nAbstract: Understanding causal relationships between machines is crucial for fault diagnosis and optimization in manufacturing processes. Real-world datasets frequently exhibit up to 90% missing data and high dimensionality from hundreds of sensors. These datasets also include domain-specific expert knowledge and chronological order information, reflecting the recording order across different machines, which is pivotal for discerning causal relationships within the manufacturing data. However, previous methods for handling missing data in scenarios akin to real-world conditions have not been able to effectively utilize expert knowledge. Conversely, prior methods that can incorporate expert knowledge struggle with datasets that exhibit missing values. Therefore, we propose COKE to construct causal graphs in manufacturing datasets by leveraging expert knowledge and chronological order among sensors without imputing missing data. Utilizing the characteristics of the recipe, we maximize the use of samples with missing values, derive embeddings from intersections with an initial graph that incorporates expert knowledge and chronological order, and create a sensor ordering graph. The graph-generating process has been optimized by an actor-critic architecture to obtain a final graph that has a maximum reward. Experimental evaluations in diverse settings of sensor quantities and missing proportions demonstrate that our approach compared with the benchmark methods shows an average improvement of 39.9% in the F1-score. Moreover, the F1-score improvement can reach 62.6% when considering the configuration similar to real-world datasets, and 85.0% in real-world semiconductor datasets. The source code is available at https://github.com/OuTingYun/COKE."}, "https://arxiv.org/abs/2407.12708": {"title": "An Approximation for the 32-point Discrete Fourier Transform", "link": "https://arxiv.org/abs/2407.12708", "description": "arXiv:2407.12708v1 Announce Type: cross \nAbstract: This brief note aims at condensing some results on the 32-point approximate DFT and discussing its arithmetic complexity."}, "https://arxiv.org/abs/2407.12751": {"title": "Scalable Monte Carlo for Bayesian Learning", "link": "https://arxiv.org/abs/2407.12751", "description": "arXiv:2407.12751v1 Announce Type: cross \nAbstract: This book aims to provide a graduate-level introduction to advanced topics in Markov chain Monte Carlo (MCMC) algorithms, as applied broadly in the Bayesian computational context. Most, if not all of these topics (stochastic gradient MCMC, non-reversible MCMC, continuous time MCMC, and new techniques for convergence assessment) have emerged as recently as the last decade, and have driven substantial recent practical and theoretical advances in the field. A particular focus is on methods that are scalable with respect to either the amount of data, or the data dimension, motivated by the emerging high-priority application areas in machine learning and AI."}, "https://arxiv.org/abs/2205.05002": {"title": "Estimating Discrete Games of Complete Information: Bringing Logit Back in the Game", "link": "https://arxiv.org/abs/2205.05002", "description": "arXiv:2205.05002v3 Announce Type: replace \nAbstract: Estimating discrete games of complete information is often computationally difficult due to partial identification and the absence of closed-form moment characterizations. For both unordered and ordered-actions games, I propose computationally tractable approaches to estimation and inference that leverage convex programming theory via logit expressions. These methods are scalable to settings with many players and actions as they remove the computational burden associated with equilibria enumeration, numerical simulation, and grid search. I use simulation and empirical examples to show that my approaches can be several orders of magnitude faster than existing approaches."}, "https://arxiv.org/abs/2303.02498": {"title": "A stochastic network approach to clustering and visualising single-cell genomic count data", "link": "https://arxiv.org/abs/2303.02498", "description": "arXiv:2303.02498v3 Announce Type: replace \nAbstract: Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we propose a novel approach to studying single-cell genomic data, by modelling the observed genomic data count matrix $\\mathbf{X}\\in\\mathbb{Z}_{\\geq0}^{p\\times n}$ as a bipartite network with multi-edges. Utilising this first-principles network representation of the raw data, we propose clustering single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employing UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data estimates transformed latent positions (of genes and cells), under a latent position model of nodes in a bipartite stochastic network. We demonstrate how these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby generating an LE cancer biomarker that significantly predicts long-term patient survival outcome in an independent validation dataset."}, "https://arxiv.org/abs/2304.02171": {"title": "Faster estimation of dynamic discrete choice models using index invertibility", "link": "https://arxiv.org/abs/2304.02171", "description": "arXiv:2304.02171v3 Announce Type: replace \nAbstract: Many estimators of dynamic discrete choice models with persistent unobserved heterogeneity have desirable statistical properties but are computationally intensive. In this paper we propose a method to quicken estimation for a broad class of dynamic discrete choice problems by exploiting semiparametric index restrictions. Specifically, we propose an estimator for models whose reduced form parameters are invertible functions of one or more linear indices (Ahn, Ichimura, Powell and Ruud 2018), a property we term index invertibility. We establish that index invertibility implies a set of equality constraints on the model parameters. Our proposed estimator uses the equality constraints to decrease the dimension of the optimization problem, thereby generating computational gains. Our main result shows that the proposed estimator is asymptotically equivalent to the unconstrained, computationally heavy estimator. In addition, we provide a series of results on the number of independent index restrictions on the model parameters, providing theoretical guidance on the extent of computational gains. Finally, we demonstrate the advantages of our approach via Monte Carlo simulations."}, "https://arxiv.org/abs/2308.04971": {"title": "Stein Variational Rare Event Simulation", "link": "https://arxiv.org/abs/2308.04971", "description": "arXiv:2308.04971v2 Announce Type: replace \nAbstract: Rare event simulation and rare event probability estimation are important tasks within the analysis of systems subject to uncertainty and randomness. Simultaneously, accurately estimating rare event probabilities is an inherently difficult task that calls for dedicated tools and methods. One way to improve estimation efficiency on difficult rare event estimation problems is to leverage gradients of the computational model representing the system in consideration, e.g., to explore the rare event faster and more reliably. We present a novel approach for estimating rare event probabilities using such model gradients by drawing on a technique to generate samples from non-normalized posterior distributions in Bayesian inference - the Stein variational gradient descent. We propagate samples generated from a tractable input distribution towards a near-optimal rare event importance sampling distribution by exploiting a similarity of the latter with Bayesian posterior distributions. Sample propagation takes the shape of passing samples through a sequence of invertible transforms such that their densities can be tracked and used to construct an unbiased importance sampling estimate of the rare event probability - the Stein variational rare event estimator. We discuss settings and parametric choices of the algorithm and suggest a method for balancing convergence speed with stability by choosing the step width or base learning rate adaptively. We analyze the method's performance on several analytical test functions and two engineering examples in low to high stochastic dimensions ($d = 2 - 869$) and find that it consistently outperforms other state-of-the-art gradient-based rare event simulation methods."}, "https://arxiv.org/abs/2312.01870": {"title": "Extreme-value modelling of migratory bird arrival dates: Insights from citizen science data", "link": "https://arxiv.org/abs/2312.01870", "description": "arXiv:2312.01870v3 Announce Type: replace \nAbstract: Citizen science mobilises many observers and gathers huge datasets but often without strict sampling protocols, which results in observation biases due to heterogeneity in sampling effort that can lead to biased statistical inferences. We develop a spatiotemporal Bayesian hierarchical model for bias-corrected estimation of arrival dates of the first migratory bird individuals at a breeding site. Higher sampling effort could be correlated with earlier observed dates. We implement data fusion of two citizen-science datasets with fundamentally different protocols (BBS, eBird) and map posterior distributions of the latent process, which contains four spatial components with Gaussian process priors: species niche; sampling effort; position and scale parameters of annual first date of arrival. The data layer includes four response variables: counts of observed eBird locations (Poisson); presence-absence at observed eBird locations (Binomial); BBS occurrence counts (Poisson); first arrival dates (Generalized Extreme-Value). We devise a Markov Chain Monte Carlo scheme and check by simulation that the latent process components are identifiable. We apply our model to several migratory bird species in the northeastern US for 2001--2021, and find that the sampling effort significantly modulates the observed first arrival date. We exploit this relationship to effectively bias-correct predictions of the true first arrival dates."}, "https://arxiv.org/abs/2101.09271": {"title": "Representation of Context-Specific Causal Models with Observational and Interventional Data", "link": "https://arxiv.org/abs/2101.09271", "description": "arXiv:2101.09271v4 Announce Type: replace-cross \nAbstract: We address the problem of representing context-specific causal models based on both observational and experimental data collected under general (e.g. hard or soft) interventions by introducing a new family of context-specific conditional independence models called CStrees. This family is defined via a novel factorization criterion that allows for a generalization of the factorization property defining general interventional DAG models. We derive a graphical characterization of model equivalence for observational CStrees that extends the Verma and Pearl criterion for DAGs. This characterization is then extended to CStree models under general, context-specific interventions. To obtain these results, we formalize a notion of context-specific intervention that can be incorporated into concise graphical representations of CStree models. We relate CStrees to other context-specific models, showing that the families of DAGs, CStrees, labeled DAGs and staged trees form a strict chain of inclusions. We end with an application of interventional CStree models to a real data set, revealing the context-specific nature of the data dependence structure and the soft, interventional perturbations."}, "https://arxiv.org/abs/2302.04412": {"title": "Spatiotemporal factor models for functional data with application to population map forecast", "link": "https://arxiv.org/abs/2302.04412", "description": "arXiv:2302.04412v3 Announce Type: replace-cross \nAbstract: The proliferation of mobile devices has led to the collection of large amounts of population data. This situation has prompted the need to utilize this rich, multidimensional data in practical applications. In response to this trend, we have integrated functional data analysis (FDA) and factor analysis to address the challenge of predicting hourly population changes across various districts in Tokyo. Specifically, by assuming a Gaussian process, we avoided the large covariance matrix parameters of the multivariate normal distribution. In addition, the data were both time and spatially dependent between districts. To capture these characteristics, a Bayesian factor model was introduced, which modeled the time series of a small number of common factors and expressed the spatial structure through factor loading matrices. Furthermore, the factor loading matrices were made identifiable and sparse to ensure the interpretability of the model. We also proposed a Bayesian shrinkage method as a systematic approach for factor selection. Through numerical experiments and data analysis, we investigated the predictive accuracy and interpretability of our proposed method. We concluded that the flexibility of the method allows for the incorporation of additional time series features, thereby improving its accuracy."}, "https://arxiv.org/abs/2407.13029": {"title": "Bayesian Inference and the Principle of Maximum Entropy", "link": "https://arxiv.org/abs/2407.13029", "description": "arXiv:2407.13029v1 Announce Type: new \nAbstract: Bayes' theorem incorporates distinct types of information through the likelihood and prior. Direct observations of state variables enter the likelihood and modify posterior probabilities through consistent updating. Information in terms of expected values of state variables modify posterior probabilities by constraining prior probabilities to be consistent with the information. Constraints on the prior can be exact, limiting hypothetical frequency distributions to only those that satisfy the constraints, or be approximate, allowing residual deviations from the exact constraint to some degree of tolerance. When the model parameters and constraint tolerances are known, posterior probability follows directly from Bayes' theorem. When parameters and tolerances are unknown a prior for them must be specified. When the system is close to statistical equilibrium the computation of posterior probabilities is simplified due to the concentration of the prior on the maximum entropy hypothesis. The relationship between maximum entropy reasoning and Bayes' theorem from this point of view is that maximum entropy reasoning is a special case of Bayesian inference with a constrained entropy-favoring prior."}, "https://arxiv.org/abs/2407.13169": {"title": "Combining Climate Models using Bayesian Regression Trees and Random Paths", "link": "https://arxiv.org/abs/2407.13169", "description": "arXiv:2407.13169v1 Announce Type: new \nAbstract: Climate models, also known as general circulation models (GCMs), are essential tools for climate studies. Each climate model may have varying accuracy across the input domain, but no single model is uniformly better than the others. One strategy to improving climate model prediction performance is to integrate multiple model outputs using input-dependent weights. Along with this concept, weight functions modeled using Bayesian Additive Regression Trees (BART) were recently shown to be useful for integrating multiple Effective Field Theories in nuclear physics applications. However, a restriction of this approach is that the weights could only be modeled as piecewise constant functions. To smoothly integrate multiple climate models, we propose a new tree-based model, Random Path BART (RPBART), that incorporates random path assignments into the BART model to produce smooth weight functions and smooth predictions of the physical system, all in a matrix-free formulation. The smoothness feature of RPBART requires a more complex prior specification, for which we introduce a semivariogram to guide its hyperparameter selection. This approach is easy to interpret, computationally cheap, and avoids an expensive cross-validation study. Finally, we propose a posterior projection technique to enable detailed analysis of the fitted posterior weight functions. This allows us to identify a sparse set of climate models that can largely recover the underlying system within a given spatial region as well as quantifying model discrepancy within the model set under consideration. Our method is demonstrated on an ensemble of 8 GCMs modeling the average monthly surface temperature."}, "https://arxiv.org/abs/2407.13261": {"title": "Enhanced inference for distributions and quantiles of individual treatment effects in various experiments", "link": "https://arxiv.org/abs/2407.13261", "description": "arXiv:2407.13261v1 Announce Type: new \nAbstract: Understanding treatment effect heterogeneity has become increasingly important in many fields. In this paper we study distributions and quantiles of individual treatment effects to provide a more comprehensive and robust understanding of treatment effects beyond usual averages, despite they are more challenging to infer due to nonidentifiability from observed data. Recent randomization-based approaches offer finite-sample valid inference for treatment effect distributions and quantiles in both completely randomized and stratified randomized experiments, but can be overly conservative by assuming the worst-case scenario where units with large effects are all assigned to the treated (or control) group. We introduce two improved methods to enhance the power of these existing approaches. The first method reinterprets existing approaches as inferring treatment effects among only treated or control units, and then combines the inference for treated and control units to infer treatment effects for all units. The second method explicitly controls for the actual number of treated units with large effects. Both simulation and applications demonstrate the substantial gain from the improved methods. These methods are further extended to sampling-based experiments as well as quasi-experiments from matching, in which the ideas for both improved methods play critical and complementary roles."}, "https://arxiv.org/abs/2407.13283": {"title": "Heterogeneous Clinical Trial Outcomes via Multi-Output Gaussian Processes", "link": "https://arxiv.org/abs/2407.13283", "description": "arXiv:2407.13283v1 Announce Type: new \nAbstract: We make use of Kronecker structure for scaling Gaussian Process models to large-scale, heterogeneous, clinical data sets. Repeated measures, commonly performed in clinical research, facilitate computational acceleration for nonlinear Bayesian nonparametric models and enable exact sampling for non-conjugate inference, when combinations of continuous and discrete endpoints are observed. Model inference is performed in Stan, and comparisons are made with brms on simulated data and two real clinical data sets, following a radiological image quality theme. Scalable Gaussian Process models compare favourably with parametric models on real data sets with 17,460 observations. Different GP model specifications are explored, with components analogous to random effects, and their theoretical properties are described."}, "https://arxiv.org/abs/2407.13302": {"title": "Non-zero block selector: A linear correlation coefficient measure for blocking-selection models", "link": "https://arxiv.org/abs/2407.13302", "description": "arXiv:2407.13302v1 Announce Type: new \nAbstract: Multiple-group data is widely used in genomic studies, finance, and social science. This study investigates a block structure that consists of covariate and response groups. It examines the block-selection problem of high-dimensional models with group structures for both responses and covariates, where both the number of blocks and the dimension within each block are allowed to grow larger than the sample size. We propose a novel strategy for detecting the block structure, which includes the block-selection model and a non-zero block selector (NBS). We establish the uniform consistency of the NBS and propose three estimators based on the NBS to enhance modeling efficiency. We prove that the estimators achieve the oracle solution and show that they are consistent, jointly asymptotically normal, and efficient in modeling extremely high-dimensional data. Simulations generate complex data settings and demonstrate the superiority of the proposed method. A gene-data analysis also demonstrates its effectiveness."}, "https://arxiv.org/abs/2407.13314": {"title": "NIRVAR: Network Informed Restricted Vector Autoregression", "link": "https://arxiv.org/abs/2407.13314", "description": "arXiv:2407.13314v1 Announce Type: new \nAbstract: High-dimensional panels of time series arise in many scientific disciplines such as neuroscience, finance, and macroeconomics. Often, co-movements within groups of the panel components occur. Extracting these groupings from the data provides a course-grained description of the complex system in question and can inform subsequent prediction tasks. We develop a novel methodology to model such a panel as a restricted vector autoregressive process, where the coefficient matrix is the weighted adjacency matrix of a stochastic block model. This network time series model, which we call the Network Informed Restricted Vector Autoregression (NIRVAR) model, yields a coefficient matrix that has a sparse block-diagonal structure. We propose an estimation procedure that embeds each panel component in a low-dimensional latent space and clusters the embedded points to recover the blocks of the coefficient matrix. Crucially, the method allows for network-based time series modelling when the underlying network is unobserved. We derive the bias, consistency and asymptotic normality of the NIRVAR estimator. Simulation studies suggest that the NIRVAR estimated embedded points are Gaussian distributed around the ground truth latent positions. On three applications to finance, macroeconomics, and transportation systems, NIRVAR outperforms competing factor and network time series models in terms of out-of-sample prediction."}, "https://arxiv.org/abs/2407.13374": {"title": "A unifying modelling approach for hierarchical distributed lag models", "link": "https://arxiv.org/abs/2407.13374", "description": "arXiv:2407.13374v1 Announce Type: new \nAbstract: We present a statistical modelling framework for implementing Distributed Lag Models (DLMs), encompassing several extensions of the approach to capture the temporally distributed effect from covariates via regression. We place DLMs in the context of penalised Generalized Additive Models (GAMs) and illustrate that implementation via the R package \\texttt{mgcv}, which allows for flexible and interpretable inference in addition to thorough model assessment. We show how the interpretation of penalised splines as random quantities enables approximate Bayesian inference and hierarchical structures in the same practical setting. We focus on epidemiological studies and demonstrate the approach with application to mortality data from Cyprus and Greece. For the Cyprus case study, we investigate for the first time, the joint lagged effects from both temperature and humidity on mortality risk with the unexpected result that humidity severely increases risk during cold rather than hot conditions. Another novel application is the use of the proposed framework for hierarchical pooling, to estimate district-specific covariate-lag risk on morality and the use of posterior simulation to compare risk across districts."}, "https://arxiv.org/abs/2407.13402": {"title": "Block-Additive Gaussian Processes under Monotonicity Constraints", "link": "https://arxiv.org/abs/2407.13402", "description": "arXiv:2407.13402v1 Announce Type: new \nAbstract: We generalize the additive constrained Gaussian process framework to handle interactions between input variables while enforcing monotonicity constraints everywhere on the input space. The block-additive structure of the model is particularly suitable in the presence of interactions, while maintaining tractable computations. In addition, we develop a sequential algorithm, MaxMod, for model selection (i.e., the choice of the active input variables and of the blocks). We speed up our implementations through efficient matrix computations and thanks to explicit expressions of criteria involved in MaxMod. The performance and scalability of our methodology are showcased with several numerical examples in dimensions up to 120, as well as in a 5D real-world coastal flooding application, where interpretability is enhanced by the selection of the blocks."}, "https://arxiv.org/abs/2407.13446": {"title": "Subsampled One-Step Estimation for Fast Statistical Inference", "link": "https://arxiv.org/abs/2407.13446", "description": "arXiv:2407.13446v1 Announce Type: new \nAbstract: Subsampling is an effective approach to alleviate the computational burden associated with large-scale datasets. Nevertheless, existing subsampling estimators incur a substantial loss in estimation efficiency compared to estimators based on the full dataset. Specifically, the convergence rate of existing subsampling estimators is typically $n^{-1/2}$ rather than $N^{-1/2}$, where $n$ and $N$ denote the subsample and full data sizes, respectively. This paper proposes a subsampled one-step (SOS) method to mitigate the estimation efficiency loss utilizing the asymptotic expansions of the subsampling and full-data estimators. The resulting SOS estimator is computationally efficient and achieves a fast convergence rate of $\\max\\{n^{-1}, N^{-1/2}\\}$ rather than $n^{-1/2}$. We establish the asymptotic distribution of the SOS estimator, which can be non-normal in general, and construct confidence intervals on top of the asymptotic distribution. Furthermore, we prove that the SOS estimator is asymptotically normal and equivalent to the full data-based estimator when $n / \\sqrt{N} \\to \\infty$.Simulation studies and real data analyses were conducted to demonstrate the finite sample performance of the SOS estimator. Numerical results suggest that the SOS estimator is almost as computationally efficient as the uniform subsampling estimator while achieving similar estimation efficiency to the full data-based estimator."}, "https://arxiv.org/abs/2407.13546": {"title": "Treatment-control comparisons in platform trials including non-concurrent controls", "link": "https://arxiv.org/abs/2407.13546", "description": "arXiv:2407.13546v1 Announce Type: new \nAbstract: Shared controls in platform trials comprise concurrent and non-concurrent controls. For a given experimental arm, non-concurrent controls refer to data from patients allocated to the control arm before the arm enters the trial. The use of non-concurrent controls in the analysis is attractive because it may increase the trial's power of testing treatment differences while decreasing the sample size. However, since arms are added sequentially in the trial, randomization occurs at different times, which can introduce bias in the estimates due to time trends. In this article, we present methods to incorporate non-concurrent control data in treatment-control comparisons, allowing for time trends. We focus mainly on frequentist approaches that model the time trend and Bayesian strategies that limit the borrowing level depending on the heterogeneity between concurrent and non-concurrent controls. We examine the impact of time trends, overlap between experimental treatment arms and entry times of arms in the trial on the operating characteristics of treatment effect estimators for each method under different patterns for the time trends. We argue under which conditions the methods lead to type 1 error control and discuss the gain in power compared to trials only using concurrent controls by means of a simulation study in which methods are compared."}, "https://arxiv.org/abs/2407.13613": {"title": "Revisiting Randomization with the Cube Method", "link": "https://arxiv.org/abs/2407.13613", "description": "arXiv:2407.13613v1 Announce Type: new \nAbstract: We propose a novel randomization approach for randomized controlled trials (RCTs), named the cube method. The cube method allows for the selection of balanced samples across various covariate types, ensuring consistent adherence to balance tests and, whence, substantial precision gains when estimating treatment effects. We establish several statistical properties for the population and sample average treatment effects (PATE and SATE, respectively) under randomization using the cube method. The relevance of the cube method is particularly striking when comparing the behavior of prevailing methods employed for treatment allocation when the number of covariates to balance is increasing. We formally derive and compare bounds of balancing adjustments depending on the number of units $n$ and the number of covariates $p$ and show that our randomization approach outperforms methods proposed in the literature when $p$ is large and $p/n$ tends to 0. We run simulation studies to illustrate the substantial gains from the cube method for a large set of covariates."}, "https://arxiv.org/abs/2407.13678": {"title": "Joint modelling of time-to-event and longitudinal response using robust skew normal-independent distributions", "link": "https://arxiv.org/abs/2407.13678", "description": "arXiv:2407.13678v1 Announce Type: new \nAbstract: Joint modelling of longitudinal observations and event times continues to remain a topic of considerable interest in biomedical research. For example, in HIV studies, the longitudinal bio-marker such as CD4 cell count in a patient's blood over follow up months is jointly modelled with the time to disease progression, death or dropout via a random intercept term mostly assumed to be Gaussian. However, longitudinal observations in these kinds of studies often exhibit non-Gaussian behavior (due to high degree of skewness), and parameter estimation is often compromised under violations of the Gaussian assumptions. In linear mixed-effects model assumptions, the distributional assumption for the subject-specific random-effects is taken as Gaussian which may not be true in many situations. Further, this assumption makes the model extremely sensitive to outlying observations. We address these issues in this work by devising a joint model which uses a robust distribution in a parametric setup along with a conditional distributional assumption that ensures dependency of two processes in case the subject-specific random effects is given."}, "https://arxiv.org/abs/2407.13495": {"title": "Identifying Research Hotspots and Future Development Trends in Current Psychology: A Bibliometric Analysis of the Past Decade's Publications", "link": "https://arxiv.org/abs/2407.13495", "description": "arXiv:2407.13495v1 Announce Type: cross \nAbstract: By conducting a bibliometric analysis on 4,869 publications in Current Psychology from 2013 to 2022, this paper examined the annual publications and annual citations, as well as the leading institutions, countries, and keywords. CiteSpace, VOSviewer and SCImago Graphica were utilized for visualization analysis. On one hand, this paper analyzed the academic influence of Current Psychology over the past decade. On the other hand, it explored the research hotspots and future development trends within the field of international psychology. The results revealed that the three main research areas covered in the publications of Current Psychology were: the psychological well-being of young people, the negative emotions of adults, and self-awareness and management. The latest research hotspots highlighted in the journal include negative emotions, personality, and mental health. The three main development trends of Current Psychology are: 1) exploring the personality psychology of both adolescents and adults, 2) promoting the interdisciplinary research to study social psychological issues through the use of diversified research methods, and 3) emphasizing the emotional psychology of individuals and their interaction with social reality, from a people-oriented perspective."}, "https://arxiv.org/abs/2407.13514": {"title": "Topological Analysis of Seizure-Induced Changes in Brain Hierarchy Through Effective Connectivity", "link": "https://arxiv.org/abs/2407.13514", "description": "arXiv:2407.13514v1 Announce Type: cross \nAbstract: Traditional Topological Data Analysis (TDA) methods, such as Persistent Homology (PH), rely on distance measures (e.g., cross-correlation, partial correlation, coherence, and partial coherence) that are symmetric by definition. While useful for studying topological patterns in functional brain connectivity, the main limitation of these methods is their inability to capture the directional dynamics - which is crucial for understanding effective brain connectivity. We propose the Causality-Based Topological Ranking (CBTR) method, which integrates Causal Inference (CI) to assess effective brain connectivity with Hodge Decomposition (HD) to rank brain regions based on their mutual influence. Our simulations confirm that the CBTR method accurately and consistently identifies hierarchical structures in multivariate time series data. Moreover, this method effectively identifies brain regions showing the most significant interaction changes with other regions during seizures using electroencephalogram (EEG) data. These results provide novel insights into the brain's hierarchical organization and illuminate the impact of seizures on its dynamics."}, "https://arxiv.org/abs/2407.13641": {"title": "Optimal rates for estimating the covariance kernel from synchronously sampled functional data", "link": "https://arxiv.org/abs/2407.13641", "description": "arXiv:2407.13641v1 Announce Type: cross \nAbstract: We obtain minimax-optimal convergence rates in the supremum norm, including infor-mation-theoretic lower bounds, for estimating the covariance kernel of a stochastic process which is repeatedly observed at discrete, synchronous design points. In particular, for dense design we obtain the $\\sqrt n$-rate of convergence in the supremum norm without additional logarithmic factors which typically occur in the results in the literature. Surprisingly, in the transition from dense to sparse design the rates do not reflect the two-dimensional nature of the covariance kernel but correspond to those for univariate mean function estimation. Our estimation method can make use of higher-order smoothness of the covariance kernel away from the diagonal, and does not require the same smoothness on the diagonal itself. Hence, as in Mohammadi and Panaretos (2024) we can cover covariance kernels of processes with rough sample paths. Moreover, the estimator does not use mean function estimation to form residuals, and no smoothness assumptions on the mean have to be imposed. In the dense case we also obtain a central limit theorem in the supremum norm, which can be used as the basis for the construction of uniform confidence sets. Simulations and real-data applications illustrate the practical usefulness of the methods."}, "https://arxiv.org/abs/2006.02611": {"title": "Tensor Factor Model Estimation by Iterative Projection", "link": "https://arxiv.org/abs/2006.02611", "description": "arXiv:2006.02611v3 Announce Type: replace \nAbstract: Tensor time series, which is a time series consisting of tensorial observations, has become ubiquitous. It typically exhibits high dimensionality. One approach for dimension reduction is to use a factor model structure, in a form similar to Tucker tensor decomposition, except that the time dimension is treated as a dynamic process with a time dependent structure. In this paper we introduce two approaches to estimate such a tensor factor model by using iterative orthogonal projections of the original tensor time series. These approaches extend the existing estimation procedures and improve the estimation accuracy and convergence rate significantly as proven in our theoretical investigation. Our algorithms are similar to the higher order orthogonal projection method for tensor decomposition, but with significant differences due to the need to unfold tensors in the iterations and the use of autocorrelation. Consequently, our analysis is significantly different from the existing ones. Computational and statistical lower bounds are derived to prove the optimality of the sample size requirement and convergence rate for the proposed methods. Simulation study is conducted to further illustrate the statistical properties of these estimators."}, "https://arxiv.org/abs/2101.05774": {"title": "Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables", "link": "https://arxiv.org/abs/2101.05774", "description": "arXiv:2101.05774v4 Announce Type: replace \nAbstract: We propose a procedure which combines hierarchical clustering with a test of overidentifying restrictions for selecting valid instrumental variables (IV) from a large set of IVs. Some of these IVs may be invalid in that they fail the exclusion restriction. We show that if the largest group of IVs is valid, our method achieves oracle properties. Unlike existing techniques, our work deals with multiple endogenous regressors. Simulation results suggest an advantageous performance of the method in various settings. The method is applied to estimating the effect of immigration on wages."}, "https://arxiv.org/abs/2212.14075": {"title": "Forward Orthogonal Deviations GMM and the Absence of Large Sample Bias", "link": "https://arxiv.org/abs/2212.14075", "description": "arXiv:2212.14075v2 Announce Type: replace \nAbstract: It is well known that generalized method of moments (GMM) estimators of dynamic panel data regressions can have significant bias when the number of time periods ($T$) is not small compared to the number of cross-sectional units ($n$). The bias is attributed to the use of many instrumental variables. This paper shows that if the maximum number of instrumental variables used in a period increases with $T$ at a rate slower than $T^{1/2}$, then GMM estimators that exploit the forward orthogonal deviations (FOD) transformation do not have asymptotic bias, regardless of how fast $T$ increases relative to $n$. This conclusion is specific to using the FOD transformation. A similar conclusion does not necessarily apply when other transformations are used to remove fixed effects. Monte Carlo evidence illustrating the analytical results is provided."}, "https://arxiv.org/abs/2307.00093": {"title": "Design Sensitivity and Its Implications for Weighted Observational Studies", "link": "https://arxiv.org/abs/2307.00093", "description": "arXiv:2307.00093v2 Announce Type: replace \nAbstract: Sensitivity to unmeasured confounding is not typically a primary consideration in designing treated-control comparisons in observational studies. We introduce a framework allowing researchers to optimize robustness to omitted variable bias at the design stage using a measure called design sensitivity. Design sensitivity, which describes the asymptotic power of a sensitivity analysis, allows transparent assessment of the impact of different estimation strategies on sensitivity. We apply this general framework to two commonly-used sensitivity models, the marginal sensitivity model and the variance-based sensitivity model. By comparing design sensitivities, we interrogate how key features of weighted designs, including choices about trimming of weights and model augmentation, impact robustness to unmeasured confounding, and how these impacts may differ for the two different sensitivity models. We illustrate the proposed framework on a study examining drivers of support for the 2016 Colombian peace agreement."}, "https://arxiv.org/abs/2307.10454": {"title": "Latent Gaussian dynamic factor modeling and forecasting for multivariate count time series", "link": "https://arxiv.org/abs/2307.10454", "description": "arXiv:2307.10454v2 Announce Type: replace \nAbstract: This work considers estimation and forecasting in a multivariate, possibly high-dimensional count time series model constructed from a transformation of a latent Gaussian dynamic factor series. The estimation of the latent model parameters is based on second-order properties of the count and underlying Gaussian time series, yielding estimators of the underlying covariance matrices for which standard principal component analysis applies. Theoretical consistency results are established for the proposed estimation, building on certain concentration results for the models of the type considered. They also involve the memory of the latent Gaussian process, quantified through a spectral gap, shown to be suitably bounded as the model dimension increases, which is of independent interest. In addition, novel cross-validation schemes are suggested for model selection. The forecasting is carried out through a particle-based sequential Monte Carlo, leveraging Kalman filtering techniques. A simulation study and an application are also considered."}, "https://arxiv.org/abs/2212.01621": {"title": "A simple extension of Azadkia & Chatterjee's rank correlation to multi-response vectors", "link": "https://arxiv.org/abs/2212.01621", "description": "arXiv:2212.01621v3 Announce Type: replace-cross \nAbstract: Recently, Chatterjee (2023) recognized the lack of a direct generalization of his rank correlation $\\xi$ in Azadkia and Chatterjee (2021) to a multi-dimensional response vector. As a natural solution to this problem, we here propose an extension of $\\xi$ that is applicable to a set of $q \\geq 1$ response variables, where our approach builds upon converting the original vector-valued problem into a univariate problem and then applying the rank correlation $\\xi$ to it. Our novel measure $T$ quantifies the scale-invariant extent of functional dependence of a response vector $\\mathbf{Y} = (Y_1,\\dots,Y_q)$ on predictor variables $\\mathbf{X} = (X_1, \\dots,X_p)$, characterizes independence of $\\mathbf{X}$ and $\\mathbf{Y}$ as well as perfect dependence of $\\mathbf{Y}$ on $\\mathbf{X}$ and hence fulfills all the characteristics of a measure of predictability. Aiming at maximum interpretability, we provide various invariance results for $T$ as well as a closed-form expression in multivariate normal models. Building upon the graph-based estimator for $\\xi$ in Azadkia and Chatterjee (2021), we obtain a non-parametric, strongly consistent estimator for $T$ and show its asymptotic normality. Based on this estimator, we develop a model-free and dependence-based feature ranking and forward feature selection for multiple-outcome data. Simulation results and real case studies illustrate $T$'s broad applicability."}, "https://arxiv.org/abs/2302.03391": {"title": "Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection", "link": "https://arxiv.org/abs/2302.03391", "description": "arXiv:2302.03391v2 Announce Type: replace-cross \nAbstract: Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple l1 penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses."}, "https://arxiv.org/abs/2311.12978": {"title": "Physics-Informed Priors with Application to Boundary Layer Velocity", "link": "https://arxiv.org/abs/2311.12978", "description": "arXiv:2311.12978v2 Announce Type: replace-cross \nAbstract: One of the most popular recent areas of machine learning predicates the use of neural networks augmented by information about the underlying process in the form of Partial Differential Equations (PDEs). These physics-informed neural networks are obtained by penalizing the inference with a PDE, and have been cast as a minimization problem currently lacking a formal approach to quantify the uncertainty. In this work, we propose a novel model-based framework which regards the PDE as a prior information of a deep Bayesian neural network. The prior is calibrated without data to resemble the PDE solution in the prior mean, while our degree in confidence on the PDE with respect to the data is expressed in terms of the prior variance. The information embedded in the PDE is then propagated to the posterior yielding physics-informed forecasts with uncertainty quantification. We apply our approach to a simulated viscous fluid and to experimentally-obtained turbulent boundary layer velocity in a water tunnel using an appropriately simplified Navier-Stokes equation. Our approach requires very few observations to produce physically-consistent forecasts as opposed to non-physical forecasts stemming from non-informed priors, thereby allowing forecasting complex systems where some amount of data as well as some contextual knowledge is available."}, "https://arxiv.org/abs/2407.13814": {"title": "Building Population-Informed Priors for Bayesian Inference Using Data-Consistent Stochastic Inversion", "link": "https://arxiv.org/abs/2407.13814", "description": "arXiv:2407.13814v1 Announce Type: new \nAbstract: Bayesian inference provides a powerful tool for leveraging observational data to inform model predictions and uncertainties. However, when such data is limited, Bayesian inference may not adequately constrain uncertainty without the use of highly informative priors. Common approaches for constructing informative priors typically rely on either assumptions or knowledge of the underlying physics, which may not be available in all scenarios. In this work, we consider the scenario where data are available on a population of assets/individuals, which occurs in many problem domains such as biomedical or digital twin applications, and leverage this population-level data to systematically constrain the Bayesian prior and subsequently improve individualized inferences. The approach proposed in this paper is based upon a recently developed technique known as data-consistent inversion (DCI) for constructing a pullback probability measure. Succinctly, we utilize DCI to build population-informed priors for subsequent Bayesian inference on individuals. While the approach is general and applies to nonlinear maps and arbitrary priors, we prove that for linear inverse problems with Gaussian priors, the population-informed prior produces an increase in the information gain as measured by the determinant and trace of the inverse posterior covariance. We also demonstrate that the Kullback-Leibler divergence often improves with high probability. Numerical results, including linear-Gaussian examples and one inspired by digital twins for additively manufactured assets, indicate that there is significant value in using these population-informed priors."}, "https://arxiv.org/abs/2407.13865": {"title": "Projection-pursuit Bayesian regression for symmetric matrix predictors", "link": "https://arxiv.org/abs/2407.13865", "description": "arXiv:2407.13865v1 Announce Type: new \nAbstract: This paper develops a novel Bayesian approach for nonlinear regression with symmetric matrix predictors, often used to encode connectivity of different nodes. Unlike methods that vectorize matrices as predictors that result in a large number of model parameters and unstable estimation, we propose a Bayesian multi-index regression method, resulting in a projection-pursuit-type estimator that leverages the structure of matrix-valued predictors. We establish the model identifiability conditions and impose a sparsity-inducing prior on the projection directions for sparse sampling to prevent overfitting and enhance interpretability of the parameter estimates. Posterior inference is conducted through Bayesian backfitting. The performance of the proposed method is evaluated through simulation studies and a case study investigating the relationship between brain connectivity features and cognitive scores."}, "https://arxiv.org/abs/2407.13904": {"title": "In defense of MAR over latent ignorability (or latent MAR) for outcome missingness in studying principal causal effects: a causal graph view", "link": "https://arxiv.org/abs/2407.13904", "description": "arXiv:2407.13904v1 Announce Type: new \nAbstract: This paper concerns outcome missingness in principal stratification analysis. We revisit a common assumption known as latent ignorability or latent missing-at-random (LMAR), often considered a relaxation of missing-at-random (MAR). LMAR posits that the outcome is independent of its missingness if one conditions on principal stratum (which is partially unobservable) in addition to observed variables. The literature has focused on methods assuming LMAR (usually supplemented with a more specific assumption about the missingness), without considering the theoretical plausibility and necessity of LMAR. In this paper, we devise a way to represent principal stratum in causal graphs, and use causal graphs to examine this assumption. We find that LMAR is harder to satisfy than MAR, and for the purpose of breaking the dependence between the outcome and its missingness, no benefit is gained from conditioning on principal stratum on top of conditioning on observed variables. This finding has an important implication: MAR should be preferred over LMAR. This is convenient because MAR is easier to handle and (unlike LMAR) if MAR is assumed no additional assumption is needed. We thus turn to focus on the plausibility of MAR and its implications, with a view to facilitate appropriate use of this assumption. We clarify conditions on the causal structure and on auxiliary variables (if available) that need to hold for MAR to hold, and we use MAR to recover effect identification under two dominant identification assumptions (exclusion restriction and principal ignorability). We briefly comment on cases where MAR does not hold. In terms of broader connections, most of the MAR findings are also relevant to classic instrumental variable analysis that targets the local average treatment effect; and the LMAR finding suggests general caution with assumptions that condition on principal stratum."}, "https://arxiv.org/abs/2407.13958": {"title": "Flexible max-stable processes for fast and efficient inference", "link": "https://arxiv.org/abs/2407.13958", "description": "arXiv:2407.13958v1 Announce Type: new \nAbstract: Max-stable processes serve as the fundamental distributional family in extreme value theory. However, likelihood-based inference methods for max-stable processes still heavily rely on composite likelihoods, rendering them intractable in high dimensions due to their intractable densities. In this paper, we introduce a fast and efficient inference method for max-stable processes based on their angular densities for a class of max-stable processes whose angular densities do not put mass on the boundary space of the simplex, which can be used to construct r-Pareto processes. We demonstrate the efficiency of the proposed method through two new max-stable processes, the truncated extremal-t process and the skewed Brown-Resnick process. The proposed method is shown to be computationally efficient and can be applied to large datasets. Furthermore, the skewed Brown-Resnick process contains the popular Brown-Resnick model as a special case and possesses nonstationary extremal dependence structures. We showcase the new max-stable processes on simulated and real data."}, "https://arxiv.org/abs/2407.13971": {"title": "Dimension-reduced Reconstruction Map Learning for Parameter Estimation in Likelihood-Free Inference Problems", "link": "https://arxiv.org/abs/2407.13971", "description": "arXiv:2407.13971v1 Announce Type: new \nAbstract: Many application areas rely on models that can be readily simulated but lack a closed-form likelihood, or an accurate approximation under arbitrary parameter values. Existing parameter estimation approaches in this setting are generally approximate. Recent work on using neural network models to reconstruct the mapping from the data space to the parameters from a set of synthetic parameter-data pairs suffers from the curse of dimensionality, resulting in inaccurate estimation as the data size grows. We propose a dimension-reduced approach to likelihood-free estimation which combines the ideas of reconstruction map estimation with dimension-reduction approaches based on subject-specific knowledge. We examine the properties of reconstruction map estimation with and without dimension reduction and explore the trade-off between approximation error due to information loss from reducing the data dimension and approximation error. Numerical examples show that the proposed approach compares favorably with reconstruction map estimation, approximate Bayesian computation, and synthetic likelihood estimation."}, "https://arxiv.org/abs/2407.13980": {"title": "Byzantine-tolerant distributed learning of finite mixture models", "link": "https://arxiv.org/abs/2407.13980", "description": "arXiv:2407.13980v1 Announce Type: new \nAbstract: This paper proposes two split-and-conquer (SC) learning estimators for finite mixture models that are tolerant to Byzantine failures. In SC learning, individual machines obtain local estimates, which are then transmitted to a central server for aggregation. During this communication, the server may receive malicious or incorrect information from some local machines, a scenario known as Byzantine failures. While SC learning approaches have been devised to mitigate Byzantine failures in statistical models with Euclidean parameters, developing Byzantine-tolerant methods for finite mixture models with non-Euclidean parameters requires a distinct strategy. Our proposed distance-based methods are hyperparameter tuning free, unlike existing methods, and are resilient to Byzantine failures while achieving high statistical efficiency. We validate the effectiveness of our methods both theoretically and empirically via experiments on simulated and real data from machine learning applications for digit recognition. The code for the experiment can be found at https://github.com/SarahQiong/RobustSCGMM."}, "https://arxiv.org/abs/2407.14002": {"title": "Derandomized Truncated D-vine Copula Knockoffs with e-values to control the false discovery rate", "link": "https://arxiv.org/abs/2407.14002", "description": "arXiv:2407.14002v1 Announce Type: new \nAbstract: The Model-X knockoffs is a practical methodology for variable selection, which stands out from other selection strategies since it allows for the control of the false discovery rate (FDR), relying on finite-sample guarantees. In this article, we propose a Truncated D-vine Copula Knockoffs (TDCK) algorithm for sampling approximate knockoffs from complex multivariate distributions. Our algorithm enhances and improves features of previous attempts to sample knockoffs under the multivariate setting, with the three main contributions being: 1) the truncation of the D-vine copula, which reduces the dependence between the original variables and their corresponding knockoffs, improving the statistical power; 2) the employment of a straightforward non-parametric formulation for marginal transformations, eliminating the need for a specific parametric family or a kernel density estimator; 3) the use of the \"rvinecopulib'' R package offers better flexibility than the existing fitting vine copula knockoff methods. To eliminate the randomness in distinct realizations resulting in different sets of selected variables, we wrap the TDCK method with an existing derandomizing procedure for knockoffs, leading to a Derandomized Truncated D-vine Copula Knockoffs with e-values (DTDCKe) procedure. We demonstrate the robustness of the DTDCKe procedure under various scenarios with extensive simulation studies. We further illustrate its efficacy using a gene expression dataset, showing it achieves a more reliable gene selection than other competing methods, when the findings are compared with those of a meta-analysis. The results indicate that our Truncated D-vine copula approach is robust and has superior power, representing an appealing approach for variable selection in different multivariate applications, particularly in gene expression analysis."}, "https://arxiv.org/abs/2407.14022": {"title": "Causal Inference with Complex Treatments: A Survey", "link": "https://arxiv.org/abs/2407.14022", "description": "arXiv:2407.14022v1 Announce Type: new \nAbstract: Causal inference plays an important role in explanatory analysis and decision making across various fields like statistics, marketing, health care, and education. Its main task is to estimate treatment effects and make intervention policies. Traditionally, most of the previous works typically focus on the binary treatment setting that there is only one treatment for a unit to adopt or not. However, in practice, the treatment can be much more complex, encompassing multi-valued, continuous, or bundle options. In this paper, we refer to these as complex treatments and systematically and comprehensively review the causal inference methods for addressing them. First, we formally revisit the problem definition, the basic assumptions, and their possible variations under specific conditions. Second, we sequentially review the related methods for multi-valued, continuous, and bundled treatment settings. In each situation, we tentatively divide the methods into two categories: those conforming to the unconfoundedness assumption and those violating it. Subsequently, we discuss the available datasets and open-source codes. Finally, we provide a brief summary of these works and suggest potential directions for future research."}, "https://arxiv.org/abs/2407.14074": {"title": "Regression Adjustment for Estimating Distributional Treatment Effects in Randomized Controlled Trials", "link": "https://arxiv.org/abs/2407.14074", "description": "arXiv:2407.14074v1 Announce Type: new \nAbstract: In this paper, we address the issue of estimating and inferring the distributional treatment effects in randomized experiments. The distributional treatment effect provides a more comprehensive understanding of treatment effects by characterizing heterogeneous effects across individual units, as opposed to relying solely on the average treatment effect. To enhance the precision of distributional treatment effect estimation, we propose a regression adjustment method that utilizes the distributional regression and pre-treatment information. Our method is designed to be free from restrictive distributional assumptions. We establish theoretical efficiency gains and develop a practical, statistically sound inferential framework. Through extensive simulation studies and empirical applications, we illustrate the substantial advantages of our method, equipping researchers with a powerful tool for capturing the full spectrum of treatment effects in experimental research."}, "https://arxiv.org/abs/2407.14248": {"title": "Incertus", "link": "https://arxiv.org/abs/2407.14248", "description": "arXiv:2407.14248v1 Announce Type: new \nAbstract: In this paper, we present Insertus.jl, the Julia package that can help the user generate a randomization sequence of a given length for a multi-arm trial with a pre-specified target allocation ratio and assess the operating characteristics of the chosen randomization method through Monte Carlo simulations. The developed package is computationally efficient, and it can be invoked in R. Furthermore, the package is open-ended -- it can flexibly accommodate new randomization procedures and evaluate their statistical properties via simulation. It may be also helpful for validating other randomization methods for which software is not readily available. In summary, Insertus.jl can be used as ``Lego Blocks'' to construct a fit-for-purpose randomization procedure for a given clinical trial design."}, "https://arxiv.org/abs/2407.14311": {"title": "A Bayesian joint model of multiple longitudinal and categorical outcomes with application to multiple myeloma using permutation-based variable importance", "link": "https://arxiv.org/abs/2407.14311", "description": "arXiv:2407.14311v1 Announce Type: new \nAbstract: Joint models have proven to be an effective approach for uncovering potentially hidden connections between various types of outcomes, mainly continuous, time-to-event, and binary. Typically, longitudinal continuous outcomes are characterized by linear mixed-effects models, survival outcomes are described by proportional hazards models, and the link between outcomes are captured by shared random effects. Other modeling variations include generalized linear mixed-effects models for longitudinal data and logistic regression when a binary outcome is present, rather than time until an event of interest. However, in a clinical research setting, one might be interested in modeling the physician's chosen treatment based on the patient's medical history in order to identify prognostic factors. In this situation, there are often multiple treatment options, requiring the use of a multiclass classification approach. Inspired by this context, we develop a Bayesian joint model for longitudinal and categorical data. In particular, our motivation comes from a multiple myeloma study, in which biomarkers display nonlinear trajectories that are well captured through bi-exponential submodels, where patient-level information is shared with the categorical submodel. We also present a variable importance strategy for ranking prognostic factors. We apply our proposal and a competing model to the multiple myeloma data, compare the variable importance and inferential results for both models, and illustrate patient-level interpretations using our joint model."}, "https://arxiv.org/abs/2407.14349": {"title": "Measuring and testing tail equivalence", "link": "https://arxiv.org/abs/2407.14349", "description": "arXiv:2407.14349v1 Announce Type: new \nAbstract: We call two copulas tail equivalent if their first-order approximations in the tail coincide. As a special case, a copula is called tail symmetric if it is tail equivalent to the associated survival copula. We propose a novel measure and statistical test for tail equivalence. The proposed measure takes the value of zero if and only if the two copulas share a pair of tail order and tail order parameter in common. Moreover, taking the nature of these tail quantities into account, we design the proposed measure so that it takes a large value when tail orders are different, and a small value when tail order parameters are non-identical. We derive asymptotic properties of the proposed measure, and then propose a novel statistical test for tail equivalence. Performance of the proposed test is demonstrated in a series of simulation studies and empirical analyses of financial stock returns in the periods of the world financial crisis and the COVID-19 recession. Our empirical analysis reveals non-identical tail behaviors in different pairs of stocks, different parts of tails, and the two periods of recessions."}, "https://arxiv.org/abs/2407.14365": {"title": "Modified BART for Learning Heterogeneous Effects in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2407.14365", "description": "arXiv:2407.14365v1 Announce Type: new \nAbstract: This paper introduces BART-RDD, a sum-of-trees regression model built around a novel regression tree prior, which incorporates the special covariate structure of regression discontinuity designs. Specifically, the tree splitting process is constrained to ensure overlap within a narrow band surrounding the running variable cutoff value, where the treatment effect is identified. It is shown that unmodified BART-based models estimate RDD treatment effects poorly, while our modified model accurately recovers treatment effects at the cutoff. Specifically, BART-RDD is perhaps the first RDD method that effectively learns conditional average treatment effects. The new method is investigated in thorough simulation studies as well as an empirical application looking at the effect of academic probation on student performance in subsequent terms (Lindo et al., 2010)."}, "https://arxiv.org/abs/2407.14369": {"title": "tidychangepoint: a unified framework for analyzing changepoint detection in univariate time series", "link": "https://arxiv.org/abs/2407.14369", "description": "arXiv:2407.14369v1 Announce Type: new \nAbstract: We present tidychangepoint, a new R package for changepoint detection analysis. tidychangepoint leverages existing packages like changepoint, GA, tsibble, and broom to provide tidyverse-compliant tools for segmenting univariate time series using various changepoint detection algorithms. In addition, tidychangepoint also provides model-fitting procedures for commonly-used parametric models, tools for computing various penalized objective functions, and graphical diagnostic displays. tidychangepoint wraps both deterministic algorithms like PELT, and also flexible, randomized, genetic algorithms that can be used with any compliant model-fitting function and any penalized objective function. By bringing all of these disparate tools together in a cohesive fashion, tidychangepoint facilitates comparative analysis of changepoint detection algorithms and models."}, "https://arxiv.org/abs/2407.14003": {"title": "Time Series Generative Learning with Application to Brain Imaging Analysis", "link": "https://arxiv.org/abs/2407.14003", "description": "arXiv:2407.14003v1 Announce Type: cross \nAbstract: This paper focuses on the analysis of sequential image data, particularly brain imaging data such as MRI, fMRI, CT, with the motivation of understanding the brain aging process and neurodegenerative diseases. To achieve this goal, we investigate image generation in a time series context. Specifically, we formulate a min-max problem derived from the $f$-divergence between neighboring pairs to learn a time series generator in a nonparametric manner. The generator enables us to generate future images by transforming prior lag-k observations and a random vector from a reference distribution. With a deep neural network learned generator, we prove that the joint distribution of the generated sequence converges to the latent truth under a Markov and a conditional invariance condition. Furthermore, we extend our generation mechanism to a panel data scenario to accommodate multiple samples. The effectiveness of our mechanism is evaluated by generating real brain MRI sequences from the Alzheimer's Disease Neuroimaging Initiative. These generated image sequences can be used as data augmentation to enhance the performance of further downstream tasks, such as Alzheimer's disease detection."}, "https://arxiv.org/abs/2205.07689": {"title": "From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data", "link": "https://arxiv.org/abs/2205.07689", "description": "arXiv:2205.07689v3 Announce Type: replace \nAbstract: How can we tell complex point clouds with different small scale characteristics apart, while disregarding global features? Can we find a suitable transformation of such data in a way that allows to discriminate between differences in this sense with statistical guarantees? In this paper, we consider the analysis and classification of complex point clouds as they are obtained, e.g., via single molecule localization microscopy. We focus on the task of identifying differences between noisy point clouds based on small scale characteristics, while disregarding large scale information such as overall size. We propose an approach based on a transformation of the data via the so-called Distance-to-Measure (DTM) function, a transformation which is based on the average of nearest neighbor distances. For each data set, we estimate the probability density of average local distances of all data points and use the estimated densities for classification. While the applicability is immediate and the practical performance of the proposed methodology is very good, the theoretical study of the density estimators is quite challenging, as they are based on i.i.d. observations that have been obtained via a complicated transformation. In fact, the transformed data are stochastically dependent in a non-local way that is not captured by commonly considered dependence measures. Nonetheless, we show that the asymptotic behaviour of the density estimator is driven by a kernel density estimator of certain i.i.d. random variables by using theoretical properties of U-statistics, which allows to handle the dependencies via a Hoeffding decomposition. We show via a numerical study and in an application to simulated single molecule localization microscopy data of chromatin fibers that unsupervised classification tasks based on estimated DTM-densities achieve excellent separation results."}, "https://arxiv.org/abs/2401.15014": {"title": "A Robust Bayesian Method for Building Polygenic Risk Scores using Projected Summary Statistics and Bridge Prior", "link": "https://arxiv.org/abs/2401.15014", "description": "arXiv:2401.15014v2 Announce Type: replace \nAbstract: Polygenic risk scores (PRS) developed from genome-wide association studies (GWAS) are of increasing interest for clinical and research applications. Bayesian methods have been popular for building PRS because of their natural ability to regularize models and incorporate external information. In this article, we present new theoretical results, methods, and extensive numerical studies to advance Bayesian methods for PRS applications. We identify a potential risk, under a common Bayesian PRS framework, of posterior impropriety when integrating the required GWAS summary-statistics and linkage disequilibrium (LD) data from two distinct sources. As a principled remedy to this problem, we propose a projection of the summary statistics data that ensures compatibility between the two sources and in turn a proper behavior of the posterior. We further introduce a new PRS method, with accompanying software package, under the less-explored Bayesian bridge prior to more flexibly model varying sparsity levels in effect size distributions. We extensively benchmark it against alternative Bayesian methods using both synthetic and real datasets, quantifying the impact of both prior specification and LD estimation strategy. Our proposed PRS-Bridge, equipped with the projection technique and flexible prior, demonstrates the most consistent and generally superior performance across a variety of scenarios."}, "https://arxiv.org/abs/2407.14630": {"title": "Identification of changes in gene expression", "link": "https://arxiv.org/abs/2407.14630", "description": "arXiv:2407.14630v1 Announce Type: new \nAbstract: Evaluating the change in gene expression is a common goal in many research areas, such as in toxicological studies as well as in clinical trials. In practice, the analysis is often based on multiple t-tests evaluated at the observed time points. This severely limits the accuracy of determining the time points at which the gene changes in expression. Even if a parametric approach is chosen, the analysis is often restricted to identifying the onset of an effect. In this paper, we propose a parametric method to identify the time frame where the gene expression significantly changes. This is achieved by fitting a parametric model to the time-response data and constructing a confidence band for its first derivative. The confidence band is derived by a flexible two step bootstrap approach, which can be applied to a wide variety of possible curves. Our method focuses on the first derivative, since it provides an easy to compute and reliable measure for the change in response. It is summarised in terms of a hypothesis test, such that rejecting the null hypothesis means detecting a significant change in gene expression. Furthermore, a method for calculating confidence intervals for time points of interest (e.g. the beginning and end of significant change) is developed. We demonstrate the validity of our approach through a simulation study and present a variety of different applications to mouse gene expression data from a study investigating the effect of a Western diet on the progression of non-alcoholic fatty liver disease."}, "https://arxiv.org/abs/2407.14635": {"title": "Predicting the Distribution of Treatment Effects: A Covariate-Adjustment Approach", "link": "https://arxiv.org/abs/2407.14635", "description": "arXiv:2407.14635v1 Announce Type: new \nAbstract: Important questions for impact evaluation require knowledge not only of average effects, but of the distribution of treatment effects. What proportion of people are harmed? Does a policy help many by a little? Or a few by a lot? The inability to observe individual counterfactuals makes these empirical questions challenging. I propose an approach to inference on points of the distribution of treatment effects by incorporating predicted counterfactuals through covariate adjustment. I show that finite-sample inference is valid under weak assumptions, for example when data come from a Randomized Controlled Trial (RCT), and that large-sample inference is asymptotically exact under suitable conditions. Finally, I revisit five RCTs in microcredit where average effects are not statistically significant and find evidence of both positive and negative treatment effects in household income. On average across studies, at least 13.6% of households benefited and 12.5% were negatively affected."}, "https://arxiv.org/abs/2407.14666": {"title": "A Bayesian workflow for securitizing casualty insurance risk", "link": "https://arxiv.org/abs/2407.14666", "description": "arXiv:2407.14666v1 Announce Type: new \nAbstract: Casualty insurance-linked securities (ILS) are appealing to investors because the underlying insurance claims, which are directly related to resulting security performance, are uncorrelated with most other asset classes. Conversely, casualty ILS are appealing to insurers as an efficient capital managment tool. However, securitizing casualty insurance risk is non-trivial, as it requires forecasting loss ratios for pools of insurance policies that have not yet been written, in addition to estimating how the underlying losses will develop over time within future accident years. In this paper, we lay out a Bayesian workflow that tackles these complexities by using: (1) theoretically informed time-series and state-space models to capture how loss ratios develop and change over time; (2) historic industry data to inform prior distributions of models fit to individual programs; (3) stacking to combine loss ratio predictions from candidate models, and (4) both prior predictive simulations and simulation-based calibration to aid model specification. Using historic Schedule P filings, we then show how our proposed Bayesian workflow can be used to assess and compare models across a variety of key model performance metrics evaluated on future accident year losses."}, "https://arxiv.org/abs/2407.14703": {"title": "Generalizing and transporting causal inferences from randomized trials in the presence of trial engagement effects", "link": "https://arxiv.org/abs/2407.14703", "description": "arXiv:2407.14703v1 Announce Type: new \nAbstract: Trial engagement effects are effects of trial participation on the outcome that are not mediated by treatment assignment. Most work on extending (generalizing or transporting) causal inferences from a randomized trial to a target population has, explicitly or implicitly, assumed that trial engagement effects are absent, allowing evidence about the effects of the treatments examined in trials to be applied to non-experimental settings. Here, we define novel causal estimands and present identification results for generalizability and transportability analyses in the presence of trial engagement effects. Our approach allows for trial engagement effects under assumptions of no causal interaction between trial participation and treatment assignment on the absolute or relative scales. We show that under these assumptions, even in the presence of trial engagement effects, the trial data can be combined with covariate data from the target population to identify average treatment effects in the context of usual care as implemented in the target population (i.e., outside the experimental setting). The identifying observed data functionals under these no-interaction assumptions are the same as those obtained under the stronger identifiability conditions that have been invoked in prior work. Therefore, our results suggest a new interpretation for previously proposed generalizability and transportability estimators; this interpretation may be useful in analyses under causal structures where background knowledge suggests that trial engagement effects are present but interactions between trial participation and treatment are negligible."}, "https://arxiv.org/abs/2407.14748": {"title": "Regression models for binary data with scale mixtures of centered skew-normal link functions", "link": "https://arxiv.org/abs/2407.14748", "description": "arXiv:2407.14748v1 Announce Type: new \nAbstract: For the binary regression, the use of symmetrical link functions are not appropriate when we have evidence that the probability of success increases at a different rate than decreases. In these cases, the use of link functions based on the cumulative distribution function of a skewed and heavy tailed distribution can be useful. The most popular choice is some scale mixtures of skew-normal distribution. This family of distributions can have some identifiability problems, caused by the so-called direct parameterization. Also, in the binary modeling with skewed link functions, we can have another identifiability problem caused by the presence of the intercept and the skewness parameter. To circumvent these issues, in this work we proposed link functions based on the scale mixtures of skew-normal distributions under the centered parameterization. Furthermore, we proposed to fix the sign of the skewness parameter, which is a new perspective in the literature to deal with the identifiability problem in skewed link functions. Bayesian inference using MCMC algorithms and residual analysis are developed. Simulation studies are performed to evaluate the performance of the model. Also, the methodology is applied in a heart disease data."}, "https://arxiv.org/abs/2407.14914": {"title": "Leveraging Uniformization and Sparsity for Computation of Continuous Time Dynamic Discrete Choice Games", "link": "https://arxiv.org/abs/2407.14914", "description": "arXiv:2407.14914v1 Announce Type: new \nAbstract: Continuous-time formulations of dynamic discrete choice games offer notable computational advantages, particularly in modeling strategic interactions in oligopolistic markets. This paper extends these benefits by addressing computational challenges in order to improve model solution and estimation. We first establish new results on the rates of convergence of the value iteration, policy evaluation, and relative value iteration operators in the model, holding fixed player beliefs. Next, we introduce a new representation of the value function in the model based on uniformization -- a technique used in the analysis of continuous time Markov chains -- which allows us to draw a direct analogy to discrete time models. Furthermore, we show that uniformization also leads to a stable method to compute the matrix exponential, an operator appearing in the model's log likelihood function when only discrete time \"snapshot\" data are available. We also develop a new algorithm that concurrently computes the matrix exponential and its derivatives with respect to model parameters, enhancing computational efficiency. By leveraging the inherent sparsity of the model's intensity matrix, combined with sparse matrix techniques and precomputed addresses, we show how to significantly speed up computations. These strategies allow researchers to estimate more sophisticated and realistic models of strategic interactions and policy impacts in empirical industrial organization."}, "https://arxiv.org/abs/2407.15084": {"title": "High-dimensional log contrast models with measurement errors", "link": "https://arxiv.org/abs/2407.15084", "description": "arXiv:2407.15084v1 Announce Type: new \nAbstract: High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously address the compositional nature and measurement errors in the high-dimensional design matrix of compositional covariates, we propose a new method named Error-in-composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign-consistent selection properties are established. We then illustrate the finite sample performance of Eric Lasso using simulation studies and demonstrate its potential usefulness in a real data application example."}, "https://arxiv.org/abs/2407.15276": {"title": "Nonlinear Binscatter Methods", "link": "https://arxiv.org/abs/2407.15276", "description": "arXiv:2407.15276v1 Announce Type: new \nAbstract: Binned scatter plots are a powerful statistical tool for empirical work in the social, behavioral, and biomedical sciences. Available methods rely on a quantile-based partitioning estimator of the conditional mean regression function to primarily construct flexible yet interpretable visualization methods, but they can also be used to estimate treatment effects, assess uncertainty, and test substantive domain-specific hypotheses. This paper introduces novel binscatter methods based on nonlinear, possibly nonsmooth M-estimation methods, covering generalized linear, robust, and quantile regression models. We provide a host of theoretical results and practical tools for local constant estimation along with piecewise polynomial and spline approximations, including (i) optimal tuning parameter (number of bins) selection, (ii) confidence bands, and (iii) formal statistical tests regarding functional form or shape restrictions. Our main results rely on novel strong approximations for general partitioning-based estimators covering random, data-driven partitions, which may be of independent interest. We demonstrate our methods with an empirical application studying the relation between the percentage of individuals without health insurance and per capita income at the zip-code level. We provide general-purpose software packages implementing our methods in Python, R, and Stata."}, "https://arxiv.org/abs/2407.15340": {"title": "Random Survival Forest for Censored Functional Data", "link": "https://arxiv.org/abs/2407.15340", "description": "arXiv:2407.15340v1 Announce Type: new \nAbstract: This paper introduces a Random Survival Forest (RSF) method for functional data. The focus is specifically on defining a new functional data structure, the Censored Functional Data (CFD), for dealing with temporal observations that are censored due to study limitations or incomplete data collection. This approach allows for precise modelling of functional survival trajectories, leading to improved interpretation and prediction of survival dynamics across different groups. A medical survival study on the benchmark SOFA data set is presented. Results show good performance of the proposed approach, particularly in ranking the importance of predicting variables, as captured through dynamic changes in SOFA scores and patient mortality rates."}, "https://arxiv.org/abs/2407.15377": {"title": "Replicable Bandits for Digital Health Interventions", "link": "https://arxiv.org/abs/2407.15377", "description": "arXiv:2407.15377v1 Announce Type: new \nAbstract: Adaptive treatment assignment algorithms, such as bandit and reinforcement learning algorithms, are increasingly used in digital health intervention clinical trials. Causal inference and related data analyses are critical for evaluating digital health interventions, deciding how to refine the intervention, and deciding whether to roll-out the intervention more broadly. However the replicability of these analyses has received relatively little attention. This work investigates the replicability of statistical analyses from trials deploying adaptive treatment assignment algorithms. We demonstrate that many standard statistical estimators can be inconsistent and fail to be replicable across repetitions of the clinical trial, even as the sample size grows large. We show that this non-replicability is intimately related to properties of the adaptive algorithm itself. We introduce a formal definition of a \"replicable bandit algorithm\" and prove that under such algorithms, a wide variety of common statistical analyses are guaranteed to be consistent. We present both theoretical results and simulation studies based on a mobile health oral health self-care intervention. Our findings underscore the importance of designing adaptive algorithms with replicability in mind, especially for settings like digital health where deployment decisions rely heavily on replicated evidence. We conclude by discussing open questions on the connections between algorithm design, statistical inference, and experimental replicability."}, "https://arxiv.org/abs/2407.15461": {"title": "Forecasting mortality rates with functional signatures", "link": "https://arxiv.org/abs/2407.15461", "description": "arXiv:2407.15461v1 Announce Type: new \nAbstract: This study introduces an innovative methodology for mortality forecasting, which integrates signature-based methods within the functional data framework of the Hyndman-Ullah (HU) model. This new approach, termed the Hyndman-Ullah with truncated signatures (HUts) model, aims to enhance the accuracy and robustness of mortality predictions. By utilizing signature regression, the HUts model aims to capture complex, nonlinear dependencies in mortality data which enhances forecasting accuracy across various demographic conditions. The model is applied to mortality data from 12 countries, comparing its forecasting performance against classical models like the Lee-Carter model and variants of the HU models across multiple forecast horizons. Our findings indicate that overall the HUts model not only provides more precise point forecasts but also shows robustness against data irregularities, such as those observed in countries with historical outliers. The integration of signature-based methods enables the HUts model to capture complex patterns in mortality data, making it a powerful tool for actuaries and demographers. Prediction intervals are also constructed using bootstrapping methods."}, "https://arxiv.org/abs/2407.15522": {"title": "Big Data Analytics-Enabled Dynamic Capabilities and Market Performance: Examining the Roles of Marketing Ambidexterity and Competitor Pressure", "link": "https://arxiv.org/abs/2407.15522", "description": "arXiv:2407.15522v1 Announce Type: new \nAbstract: This study, rooted in dynamic capability theory and the developing era of Big Data Analytics, explores the transformative effect of BDA EDCs on marketing. Ambidexterity and firms market performance in the textile sector of Pakistans cities. Specifically, focusing on the firms who directly deal with customers, investigates the nuanced role of BDA EDCs in textile retail firms potential to navigate market dynamics. Emphasizing the exploitation component of marketing ambidexterity, the study investigated the mediating function of marketing ambidexterity and the moderating influence of competitive pressure. Using a survey questionnaire, the study targets key choice makers in textile firms of Faisalabad, Chiniot and Lahore, Pakistan. The PLS-SEM model was employed as an analytical technique, allows for a full examination of the complicated relations between BDA EDCs, marketing ambidexterity, rival pressure, and market performance. The study Predicting a positive impact of Big Data on marketing ambidexterity, with a specific emphasis on exploitation. The study expects this exploitation-orientated marketing ambidexterity to significantly enhance the firms market performance. This research contributes to the existing literature on dynamic capabilities-based frameworks from the perspective of the retail segment of textile industry. The study emphasizes the role of BDA-EDCs in the retail sector, imparting insights into the direct and indirect results of BDA EDCs on market performance inside the retail area. The study s novelty lies in its contextualization of BDA-EDCs in the textile zone of Faisalabad, Lahore and Chiniot, providing a unique perspective on the effect of BDA on marketing ambidexterity and market performance in firms. Methodologically, the study uses numerous samples of retail sectors to make sure broader universality, contributing realistic insights."}, "https://arxiv.org/abs/2407.15666": {"title": "Particle Based Inference for Continuous-Discrete State Space Models", "link": "https://arxiv.org/abs/2407.15666", "description": "arXiv:2407.15666v1 Announce Type: new \nAbstract: This article develops a methodology allowing application of the complete machinery of particle-based inference methods upon what we call the class of continuous-discrete State Space Models (CD-SSMs). Such models correspond to a latent continuous-time It\\^o diffusion process which is observed with noise at discrete time instances. Due to the continuous-time nature of the hidden signal, standard Feynman-Kac formulations and their accompanying particle-based approximations have to overcome several challenges, arising mainly due to the following considerations: (i) finite-time transition densities of the signal are typically intractable; (ii) ancestors of sampled signals are determined w.p.~1, thus cannot be resampled; (iii) diffusivity parameters given a sampled signal yield Dirac distributions. We overcome all above issues by introducing a framework based on carefully designed proposals and transformations thereof. That is, we obtain new expressions for the Feynman-Kac model that accommodate the effects of a continuous-time signal and overcome induced degeneracies. The constructed formulations will enable use of the full range of particle-based algorithms for CD-SSMs: for filtering/smoothing and parameter inference, whether online or offline. Our framework is compatible with guided proposals in the filtering steps that are essential for efficient algorithmic performance in the presence of informative observations or in higher dimensions, and is applicable for a very general class of CD-SSMs, including the case when the signal is modelled as a hypo-elliptic diffusion. Our methods can be immediately incorporated to available software packages for particle-based algorithms."}, "https://arxiv.org/abs/2407.15674": {"title": "LASSO Estimation in Exponential Random Graph models", "link": "https://arxiv.org/abs/2407.15674", "description": "arXiv:2407.15674v1 Announce Type: new \nAbstract: The paper demonstrates the use of LASSO-based estimation in network models. Taking the Exponential Random Graph Model (ERGM) as a flexible and widely used model for network data analysis, the paper focuses on the question of how to specify the (sufficient) statistics, that define the model structure. This includes both, endogenous network statistics (e.g. twostars, triangles, etc.) as well as statistics involving exogenous covariates; on the node as well as on the edge level. LASSO estimation is a penalized estimation that shrinks some of the parameter estimates to be equal to zero. As such it allows for model selection by modifying the amount of penalty. The concept is well established in standard regression and we demonstrate its usage in network data analysis, with the advantage of automatically providing a model selection framework."}, "https://arxiv.org/abs/2407.15733": {"title": "Online closed testing with e-values", "link": "https://arxiv.org/abs/2407.15733", "description": "arXiv:2407.15733v1 Announce Type: new \nAbstract: In contemporary research, data scientists often test an infinite sequence of hypotheses $H_1,H_2,\\ldots $ one by one, and are required to make real-time decisions without knowing the future hypotheses or data. In this paper, we consider such an online multiple testing problem with the goal of providing simultaneous lower bounds for the number of true discoveries in data-adaptively chosen rejection sets. Using the (online) closure principle, we show that for this task it is necessary to use an anytime-valid test for each intersection hypothesis. Motivated by this result, we construct a new online closed testing procedure and a corresponding short-cut with a true discovery guarantee based on multiplying sequential e-values. This general but simple procedure gives uniform improvements over existing methods but also allows to construct entirely new and powerful procedures. In addition, we introduce new ideas for hedging and boosting of sequential e-values that provably increase power. Finally, we also propose the first online true discovery procedure for arbitrarily dependent e-values."}, "https://arxiv.org/abs/2407.14537": {"title": "Small but not least changes: The Art of Creating Disruptive Innovations", "link": "https://arxiv.org/abs/2407.14537", "description": "arXiv:2407.14537v1 Announce Type: cross \nAbstract: In the ever-evolving landscape of technology, product innovation thrives on replacing outdated technologies with groundbreaking ones or through the ingenious recombination of existing technologies. Our study embarks on a revolutionary journey by genetically representing products, extracting their chromosomal data, and constructing a comprehensive phylogenetic network of automobiles. We delve deep into the technological features that shape innovation, pinpointing the ancestral roots of products and mapping out intricate product-family triangles. By leveraging the similarities within these triangles, we introduce a pioneering \"Product Disruption Index\"-inspired by the CD index (Funk and Owen-Smith, 2017)-to quantify a product's disruptiveness. Our approach is rigorously validated against the scientifically recognized trend of decreasing disruptiveness over time (Park et al., 2023) and through compelling case studies. Our statistical analysis reveals a fascinating insight: disruptive product innovations often stem from minor, yet crucial, modifications."}, "https://arxiv.org/abs/2407.14861": {"title": "Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes", "link": "https://arxiv.org/abs/2407.14861", "description": "arXiv:2407.14861v1 Announce Type: cross \nAbstract: With the growing access to administrative health databases, retrospective studies have become crucial evidence for medical treatments. Yet, non-randomized studies frequently face selection biases, requiring mitigation strategies. Propensity score matching (PSM) addresses these biases by selecting comparable populations, allowing for analysis without further methodological constraints. However, PSM has several drawbacks. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To prevent cherry-picking the best method, public authorities must involve field experts and engage in extensive discussions with researchers.\n To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches. A2A constructs artificial matching tasks that mirror the original ones but with known outcomes, assessing each matching method's performance comprehensively from propensity estimation to ATE estimation. When combined with Standardized Mean Difference, A2A enhances the precision of model selection, resulting in a reduction of up to 50% in ATE estimation errors across synthetic tasks and up to 90% in predicted ATE variability across both synthetic and real-world datasets. To our knowledge, A2A is the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection.\n Computing A2A requires solving hundreds of PSMs, we therefore automate all manual steps of the PSM pipeline. We integrate PSM methods from Python and R, our automated pipeline, a new metric, and reproducible experiments into popmatch, our new Python package, to enhance reproducibility and accessibility to bias correction methods."}, "https://arxiv.org/abs/2407.15028": {"title": "Statistical Models for Outbreak Detection of Measles in North Cotabato, Philippines", "link": "https://arxiv.org/abs/2407.15028", "description": "arXiv:2407.15028v1 Announce Type: cross \nAbstract: A measles outbreak occurs when the number of cases of measles in the population exceeds the typical level. Outbreaks that are not detected and managed early can increase mortality and morbidity and incur costs from activities responding to these events. The number of measles cases in the Province of North Cotabato, Philippines, was used in this study. Weekly reported cases of measles from January 2016 to December 2021 were provided by the Epidemiology and Surveillance Unit of the North Cotabato Provincial Health Office. Several integer-valued autoregressive (INAR) time series models were used to explore the possibility of detecting and identifying measles outbreaks in the province along with the classical ARIMA model. These models were evaluated based on goodness of fit, measles outbreak detection accuracy, and timeliness. The results of this study confirmed that INAR models have the conceptual advantage over ARIMA since the latter produces non-integer forecasts, which are not realistic for count data such as measles cases. Among the INAR models, the ZINGINAR (1) model was recommended for having a good model fit and timely and accurate detection of outbreaks. Furthermore, policymakers and decision-makers from relevant government agencies can use the ZINGINAR (1) model to improve disease surveillance and implement preventive measures against contagious diseases beforehand."}, "https://arxiv.org/abs/2407.15256": {"title": "Weak-instrument-robust subvector inference in instrumental variables regression: A subvector Lagrange multiplier test and properties of subvector Anderson-Rubin confidence sets", "link": "https://arxiv.org/abs/2407.15256", "description": "arXiv:2407.15256v1 Announce Type: cross \nAbstract: We propose a weak-instrument-robust subvector Lagrange multiplier test for instrumental variables regression. We show that it is asymptotically size-correct under a technical condition. This is the first weak-instrument-robust subvector test for instrumental variables regression to recover the degrees of freedom of the commonly used Wald test, which is not robust to weak instruments. Additionally, we provide a closed-form solution for subvector confidence sets obtained by inverting the subvector Anderson-Rubin test. We show that they are centered around a k-class estimator. Also, we show that the subvector confidence sets for single coefficients of the causal parameter are jointly bounded if and only if Anderson's likelihood-ratio test rejects the hypothesis that the first-stage regression parameter is of reduced rank, that is, that the causal parameter is not identified. Finally, we show that if a confidence set obtained by inverting the Anderson-Rubin test is bounded and nonempty, it is equal to a Wald-based confidence set with a data-dependent confidence level. We explicitly compute this Wald-based confidence test."}, "https://arxiv.org/abs/2407.15564": {"title": "Non-parametric estimation of conditional quantiles for time series with heavy tails", "link": "https://arxiv.org/abs/2407.15564", "description": "arXiv:2407.15564v1 Announce Type: cross \nAbstract: We propose a modified weighted Nadaraya-Watson estimator for the conditional distribution of a time series with heavy tails. We establish the asymptotic normality of the proposed estimator. Simulation study is carried out to assess the performance of the estimator. We illustrate our method using a dataset."}, "https://arxiv.org/abs/2407.15636": {"title": "On-the-fly spectral unmixing based on Kalman filtering", "link": "https://arxiv.org/abs/2407.15636", "description": "arXiv:2407.15636v1 Announce Type: cross \nAbstract: This work introduces an on-the-fly (i.e., online) linear unmixing method which is able to sequentially analyze spectral data acquired on a spectrum-by-spectrum basis. After deriving a sequential counterpart of the conventional linear mixing model, the proposed approach recasts the linear unmixing problem into a linear state-space estimation framework. Under Gaussian noise and state models, the estimation of the pure spectra can be efficiently conducted by resorting to Kalman filtering. Interestingly, it is shown that this Kalman filter can operate in a lower-dimensional subspace while ensuring the nonnegativity constraint inherent to pure spectra. This dimensionality reduction allows significantly lightening the computational burden, while leveraging recent advances related to the representation of essential spectral information. The proposed method is evaluated through extensive numerical experiments conducted on synthetic and real Raman data sets. The results show that this Kalman filter-based method offers a convenient trade-off between unmixing accuracy and computational efficiency, which is crucial for operating in an on-the-fly setting. To the best of the authors' knowledge, this is the first operational method which is able to solve the spectral unmixing problem efficiently in a dynamic fashion. It also constitutes a valuable building block for benefiting from acquisition and processing frameworks recently proposed in the microscopy literature, which are motivated by practical issues such as reducing acquisition time and avoiding potential damages being inflicted to photosensitive samples."}, "https://arxiv.org/abs/2407.15764": {"title": "Huber means on Riemannian manifolds", "link": "https://arxiv.org/abs/2407.15764", "description": "arXiv:2407.15764v1 Announce Type: cross \nAbstract: This article introduces Huber means on Riemannian manifolds, providing a robust alternative to the Frechet mean by integrating elements of both square and absolute loss functions. The Huber means are designed to be highly resistant to outliers while maintaining efficiency, making it a valuable generalization of Huber's M-estimator for manifold-valued data. We comprehensively investigate the statistical and computational aspects of Huber means, demonstrating their utility in manifold-valued data analysis. Specifically, we establish minimal conditions for ensuring the existence and uniqueness of the Huber mean and discuss regularity conditions for unbiasedness. The Huber means are statistically consistent and enjoy the central limit theorem. Additionally, we propose a moment-based estimator for the limiting covariance matrix, which is used to construct a robust one-sample location test procedure and an approximate confidence region for location parameters. Huber means are shown to be highly robust and efficient in the presence of outliers or under heavy-tailed distribution. To be more specific, it achieves a breakdown point of at least 0.5, the highest among all isometric equivariant estimators, and is more efficient than the Frechet mean under heavy-tailed distribution. Numerical examples on spheres and the set of symmetric positive-definite matrices further illustrate the efficiency and reliability of the proposed Huber means on Riemannian manifolds."}, "https://arxiv.org/abs/2007.04229": {"title": "Estimating Monte Carlo variance from multiple Markov chains", "link": "https://arxiv.org/abs/2007.04229", "description": "arXiv:2007.04229v4 Announce Type: replace \nAbstract: Modern computational advances have enabled easy parallel implementations of Markov chain Monte Carlo (MCMC). However, almost all work in estimating the variance of Monte Carlo averages, including the efficient batch means (BM) estimator, focuses on a single-chain MCMC run. We demonstrate that simply averaging covariance matrix estimators from multiple chains can yield critical underestimates in small Monte Carlo sample sizes, especially for slow-mixing Markov chains. We extend the work of \\cite{arg:and:2006} and propose a multivariate replicated batch means (RBM) estimator that utilizes information from parallel chains, thereby correcting for the underestimation. Under weak conditions on the mixing rate of the process, RBM is strongly consistent and exhibits similar large-sample bias and variance to the BM estimator. We also exhibit superior theoretical properties of RBM by showing that the (negative) bias in the RBM estimator is less than the average BM estimator in the presence of positive correlation in MCMC. Consequently, in small runs, the RBM estimator can be dramatically superior and this is demonstrated through a variety of examples."}, "https://arxiv.org/abs/2007.12807": {"title": "Cross-validation Approaches for Multi-study Predictions", "link": "https://arxiv.org/abs/2007.12807", "description": "arXiv:2007.12807v4 Announce Type: replace \nAbstract: We consider prediction in multiple studies with potential differences in the relationships between predictors and outcomes. Our objective is to integrate data from multiple studies to develop prediction models for unseen studies. We propose and investigate two cross-validation approaches applicable to multi-study stacking, an ensemble method that linearly combines study-specific ensemble members to produce generalizable predictions. Among our cross-validation approaches are some that avoid reuse of the same data in both the training and stacking steps, as done in earlier multi-study stacking. We prove that under mild regularity conditions the proposed cross-validation approaches produce stacked prediction functions with oracle properties. We also identify analytically in which scenarios the proposed cross-validation approaches increase prediction accuracy compared to stacking with data reuse. We perform a simulation study to illustrate these results. Finally, we apply our method to predicting mortality from long-term exposure to air pollutants, using collections of datasets."}, "https://arxiv.org/abs/2104.04628": {"title": "Modeling Time-Varying Random Objects and Dynamic Networks", "link": "https://arxiv.org/abs/2104.04628", "description": "arXiv:2104.04628v2 Announce Type: replace \nAbstract: Samples of dynamic or time-varying networks and other random object data such as time-varying probability distributions are increasingly encountered in modern data analysis. Common methods for time-varying data such as functional data analysis are infeasible when observations are time courses of networks or other complex non-Euclidean random objects that are elements of general metric spaces. In such spaces, only pairwise distances between the data objects are available and a strong limitation is that one cannot carry out arithmetic operations due to the lack of an algebraic structure. We combat this complexity by a generalized notion of mean trajectory taking values in the object space. For this, we adopt pointwise Fr\\'echet means and then construct pointwise distance trajectories between the individual time courses and the estimated Fr\\'echet mean trajectory, thus representing the time-varying objects and networks by functional data. Functional principal component analysis of these distance trajectories can reveal interesting features of dynamic networks and object time courses and is useful for downstream analysis. Our approach also makes it possible to study the empirical dynamics of time-varying objects, including dynamic regression to the mean or explosive behavior over time. We demonstrate desirable asymptotic properties of sample based estimators for suitable population targets under mild assumptions. The utility of the proposed methodology is illustrated with dynamic networks, time-varying distribution data and longitudinal growth data."}, "https://arxiv.org/abs/2112.04626": {"title": "Bayesian Semiparametric Longitudinal Inverse-Probit Mixed Models for Category Learning", "link": "https://arxiv.org/abs/2112.04626", "description": "arXiv:2112.04626v4 Announce Type: replace \nAbstract: Understanding how the adult human brain learns novel categories is an important problem in neuroscience. Drift-diffusion models are popular in such contexts for their ability to mimic the underlying neural mechanisms. One such model for gradual longitudinal learning was recently developed by Paulon et al. (2021). Fitting conventional drift-diffusion models, however, requires data on both category responses and associated response times. In practice, category response accuracies are often the only reliable measure recorded by behavioral scientists to describe human learning. However, To our knowledge, drift-diffusion models for such scenarios have never been considered in the literature. To address this gap, in this article, we build carefully on Paulon et al. (2021), but now with latent response times integrated out, to derive a novel biologically interpretable class of `inverse-probit' categorical probability models for observed categories alone. However, this new marginal model presents significant identifiability and inferential challenges not encountered originally for the joint model by Paulon et al. (2021). We address these new challenges using a novel projection-based approach with a symmetry-preserving identifiability constraint that allows us to work with conjugate priors in an unconstrained space. We adapt the model for group and individual-level inference in longitudinal settings. Building again on the model's latent variable representation, we design an efficient Markov chain Monte Carlo algorithm for posterior computation. We evaluate the empirical performance of the method through simulation experiments. The practical efficacy of the method is illustrated in applications to longitudinal tone learning studies."}, "https://arxiv.org/abs/2205.04345": {"title": "A unified diagnostic test for regression discontinuity designs", "link": "https://arxiv.org/abs/2205.04345", "description": "arXiv:2205.04345v4 Announce Type: replace \nAbstract: Diagnostic tests for regression discontinuity design face a size-control problem. We document a massive over-rejection of the identifying restriction among empirical studies in the top five economics journals. At least one diagnostic test was rejected for 21 out of 60 studies, whereas less than 5% of the collected 799 tests rejected the null hypotheses. In other words, more than one-third of the studies rejected at least one of their diagnostic tests, whereas their underlying identifying restrictions appear valid. Multiple testing causes this problem because the median number of tests per study was as high as 12. Therefore, we offer unified tests to overcome the size-control problem. Our procedure is based on the new joint asymptotic normality of local polynomial mean and density estimates. In simulation studies, our unified tests outperformed the Bonferroni correction. We implement the procedure as an R package rdtest with two empirical examples in its vignettes."}, "https://arxiv.org/abs/2208.14015": {"title": "The SPDE approach for spatio-temporal datasets with advection and diffusion", "link": "https://arxiv.org/abs/2208.14015", "description": "arXiv:2208.14015v4 Announce Type: replace \nAbstract: In the task of predicting spatio-temporal fields in environmental science using statistical methods, introducing statistical models inspired by the physics of the underlying phenomena that are numerically efficient is of growing interest. Large space-time datasets call for new numerical methods to efficiently process them. The Stochastic Partial Differential Equation (SPDE) approach has proven to be effective for the estimation and the prediction in a spatial context. We present here the advection-diffusion SPDE with first order derivative in time which defines a large class of nonseparable spatio-temporal models. A Gaussian Markov random field approximation of the solution to the SPDE is built by discretizing the temporal derivative with a finite difference method (implicit Euler) and by solving the spatial SPDE with a finite element method (continuous Galerkin) at each time step. The ''Streamline Diffusion'' stabilization technique is introduced when the advection term dominates the diffusion. Computationally efficient methods are proposed to estimate the parameters of the SPDE and to predict the spatio-temporal field by kriging, as well as to perform conditional simulations. The approach is applied to a solar radiation dataset. Its advantages and limitations are discussed."}, "https://arxiv.org/abs/2209.01396": {"title": "Small Study Regression Discontinuity Designs: Density Inclusive Study Size Metric and Performance", "link": "https://arxiv.org/abs/2209.01396", "description": "arXiv:2209.01396v3 Announce Type: replace \nAbstract: Regression discontinuity (RD) designs are popular quasi-experimental studies in which treatment assignment depends on whether the value of a running variable exceeds a cutoff. RD designs are increasingly popular in educational applications due to the prevalence of cutoff-based interventions. In such applications sample sizes can be relatively small or there may be sparsity around the cutoff. We propose a metric, density inclusive study size (DISS), that characterizes the size of an RD study better than overall sample size by incorporating the density of the running variable. We show the usefulness of this metric in a Monte Carlo simulation study that compares the operating characteristics of popular nonparametric RD estimation methods in small studies. We also apply the DISS metric and RD estimation methods to school accountability data from the state of Indiana."}, "https://arxiv.org/abs/2302.03687": {"title": "Covariate Adjustment in Stratified Experiments", "link": "https://arxiv.org/abs/2302.03687", "description": "arXiv:2302.03687v4 Announce Type: replace \nAbstract: This paper studies covariate adjusted estimation of the average treatment effect in stratified experiments. We work in a general framework that includes matched tuples designs, coarse stratification, and complete randomization as special cases. Regression adjustment with treatment-covariate interactions is known to weakly improve efficiency for completely randomized designs. By contrast, we show that for stratified designs such regression estimators are generically inefficient, potentially even increasing estimator variance relative to the unadjusted benchmark. Motivated by this result, we derive the asymptotically optimal linear covariate adjustment for a given stratification. We construct several feasible estimators that implement this efficient adjustment in large samples. In the special case of matched pairs, for example, the regression including treatment, covariates, and pair fixed effects is asymptotically optimal. We also provide novel asymptotically exact inference methods that allow researchers to report smaller confidence intervals, fully reflecting the efficiency gains from both stratification and adjustment. Simulations and an empirical application demonstrate the value of our proposed methods."}, "https://arxiv.org/abs/2303.16101": {"title": "Two-step estimation of latent trait models", "link": "https://arxiv.org/abs/2303.16101", "description": "arXiv:2303.16101v2 Announce Type: replace \nAbstract: We consider two-step estimation of latent variable models, in which just the measurement model is estimated in the first step and the measurement parameters are then fixed at their estimated values in the second step where the structural model is estimated. We show how this approach can be implemented for latent trait models (item response theory models) where the latent variables are continuous and their measurement indicators are categorical variables. The properties of two-step estimators are examined using simulation studies and applied examples. They perform well, and have attractive practical and conceptual properties compared to the alternative one-step and three-step approaches. These results are in line with previous findings for other families of latent variable models. This provides strong evidence that two-step estimation is a flexible and useful general method of estimation for different types of latent variable models."}, "https://arxiv.org/abs/2306.05593": {"title": "Localized Neural Network Modelling of Time Series: A Case Study on US Monetary Policy", "link": "https://arxiv.org/abs/2306.05593", "description": "arXiv:2306.05593v2 Announce Type: replace \nAbstract: In this paper, we investigate a semiparametric regression model under the context of treatment effects via a localized neural network (LNN) approach. Due to a vast number of parameters involved, we reduce the number of effective parameters by (i) exploring the use of identification restrictions; and (ii) adopting a variable selection method based on the group-LASSO technique. Subsequently, we derive the corresponding estimation theory and propose a dependent wild bootstrap procedure to construct valid inferences accounting for the dependence of data. Finally, we validate our theoretical findings through extensive numerical studies. In an empirical study, we revisit the impacts of a tightening monetary policy action on a variety of economic variables, including short-/long-term interest rate, inflation, unemployment rate, industrial price and equity return via the newly proposed framework using a monthly dataset of the US."}, "https://arxiv.org/abs/2307.00127": {"title": "Large-scale Bayesian Structure Learning for Gaussian Graphical Models using Marginal Pseudo-likelihood", "link": "https://arxiv.org/abs/2307.00127", "description": "arXiv:2307.00127v3 Announce Type: replace \nAbstract: Bayesian methods for learning Gaussian graphical models offer a comprehensive framework that addresses model uncertainty and incorporates prior knowledge. Despite their theoretical strengths, the applicability of Bayesian methods is often constrained by computational demands, especially in modern contexts involving thousands of variables. To overcome this issue, we introduce two novel Markov chain Monte Carlo (MCMC) search algorithms with a significantly lower computational cost than leading Bayesian approaches. Our proposed MCMC-based search algorithms use the marginal pseudo-likelihood approach to bypass the complexities of computing intractable normalizing constants and iterative precision matrix sampling. These algorithms can deliver reliable results in mere minutes on standard computers, even for large-scale problems with one thousand variables. Furthermore, our proposed method efficiently addresses model uncertainty by exploring the full posterior graph space. We establish the consistency of graph recovery, and our extensive simulation study indicates that the proposed algorithms, particularly for large-scale sparse graphs, outperform leading Bayesian approaches in terms of computational efficiency and accuracy. We also illustrate the practical utility of our methods on medium and large-scale applications from human and mice gene expression studies. The implementation supporting the new approach is available through the R package BDgraph."}, "https://arxiv.org/abs/2308.13630": {"title": "Degrees of Freedom: Search Cost and Self-consistency", "link": "https://arxiv.org/abs/2308.13630", "description": "arXiv:2308.13630v2 Announce Type: replace \nAbstract: Model degrees of freedom ($\\df$) is a fundamental concept in statistics because it quantifies the flexibility of a fitting procedure and is indispensable in model selection. To investigate the gap between $\\df$ and the number of independent variables in the fitting procedure, \\textcite{tibshiraniDegreesFreedomModel2015} introduced the \\emph{search degrees of freedom} ($\\sdf$) concept to account for the search cost during model selection. However, this definition has two limitations: it does not consider fitting procedures in augmented spaces and does not use the same fitting procedure for $\\sdf$ and $\\df$. We propose a \\emph{modified search degrees of freedom} ($\\msdf$) to directly account for the cost of searching in either original or augmented spaces. We check this definition for various fitting procedures, including classical linear regressions, spline methods, adaptive regressions (the best subset and the lasso), regression trees, and multivariate adaptive regression splines (MARS). In many scenarios when $\\sdf$ is applicable, $\\msdf$ reduces to $\\sdf$. However, for certain procedures like the lasso, $\\msdf$ offers a fresh perspective on search costs. For some complex procedures like MARS, the $\\df$ has been pre-determined during model fitting, but the $\\df$ of the final fitted procedure might differ from the pre-determined one. To investigate this discrepancy, we introduce the concepts of \\emph{nominal} $\\df$ and \\emph{actual} $\\df$, and define the property of \\emph{self-consistency}, which occurs when there is no gap between these two $\\df$'s. We propose a correction procedure for MARS to align these two $\\df$'s, demonstrating improved fitting performance through extensive simulations and two real data applications."}, "https://arxiv.org/abs/2309.09872": {"title": "Moment-assisted GMM for Improving Subsampling-based MLE with Large-scale data", "link": "https://arxiv.org/abs/2309.09872", "description": "arXiv:2309.09872v2 Announce Type: replace \nAbstract: The maximum likelihood estimation is computationally demanding for large datasets, particularly when the likelihood function includes integrals. Subsampling can reduce the computational burden, but it typically results in efficiency loss. This paper proposes a moment-assisted subsampling (MAS) method that can improve the estimation efficiency of existing subsampling-based maximum likelihood estimators. The motivation behind this approach stems from the fact that sample moments can be efficiently computed even if the sample size of the whole data set is huge. Through the generalized method of moments, the proposed method incorporates informative sample moments of the whole data. The MAS estimator can be computed rapidly and is asymptotically normal with a smaller asymptotic variance than the corresponding estimator without incorporating sample moments of the whole data. The asymptotic variance of the MAS estimator depends on the specific sample moments incorporated. We derive the optimal moment that minimizes the resulting asymptotic variance in terms of Loewner order. Simulation studies and real data analysis were conducted to compare the proposed method with existing subsampling methods. Numerical results demonstrate the promising performance of the MAS method across various scenarios."}, "https://arxiv.org/abs/2311.05200": {"title": "Efficient Bayesian functional principal component analysis of irregularly-observed multivariate curves", "link": "https://arxiv.org/abs/2311.05200", "description": "arXiv:2311.05200v2 Announce Type: replace \nAbstract: The analysis of multivariate functional curves has the potential to yield important scientific discoveries in domains such as healthcare, medicine, economics and social sciences. However, it is common for real-world settings to present longitudinal data that are both irregularly and sparsely observed, which introduces important challenges for the current functional data methodology. A Bayesian hierarchical framework for multivariate functional principal component analysis is proposed, which accommodates the intricacies of such irregular observation settings by flexibly pooling information across subjects and correlated curves. The model represents common latent dynamics via shared functional principal component scores, thereby effectively borrowing strength across curves while circumventing the computationally challenging task of estimating covariance matrices. These scores also provide a parsimonious representation of the major modes of joint variation of the curves and constitute interpretable scalar summaries that can be employed in follow-up analyses. Estimation is carried out using variational inference, which combines efficiency, modularity and approximate posterior density estimation, enabling the joint analysis of large datasets with parameter uncertainty quantification. Detailed simulations assess the effectiveness of the approach in sharing information from sparse and irregularly sampled multivariate curves. The methodology is also exploited to estimate the molecular disease courses of individual patients with SARS-CoV-2 infection and characterise patient heterogeneity in recovery outcomes; this study reveals key coordinated dynamics across the immune, inflammatory and metabolic systems, which are associated with survival and long-COVID symptoms up to one year post disease onset. The approach is implemented in the R package bayesFPCA."}, "https://arxiv.org/abs/2311.18146": {"title": "Co-Active Subspace Methods for the Joint Analysis of Adjacent Computer Models", "link": "https://arxiv.org/abs/2311.18146", "description": "arXiv:2311.18146v2 Announce Type: replace \nAbstract: Active subspace (AS) methods are a valuable tool for understanding the relationship between the inputs and outputs of a Physics simulation. In this paper, an elegant generalization of the traditional ASM is developed to assess the co-activity of two computer models. This generalization, which we refer to as a Co-Active Subspace (C-AS) Method, allows for the joint analysis of two or more computer models allowing for thorough exploration of the alignment (or non-alignment) of the respective gradient spaces. We define co-active directions, co-sensitivity indices, and a scalar ``concordance\" metric (and complementary ``discordance\" pseudo-metric) and we demonstrate that these are powerful tools for understanding the behavior of a class of computer models, especially when used to supplement traditional AS analysis. Details for efficient estimation of the C-AS and an accompanying R package (github.com/knrumsey/concordance) are provided. Practical application is demonstrated through analyzing a set of simulated rate stick experiments for PBX 9501, a high explosive, offering insights into complex model dynamics."}, "https://arxiv.org/abs/2008.03073": {"title": "Degree distributions in networks: beyond the power law", "link": "https://arxiv.org/abs/2008.03073", "description": "arXiv:2008.03073v5 Announce Type: replace-cross \nAbstract: The power law is useful in describing count phenomena such as network degrees and word frequencies. With a single parameter, it captures the main feature that the frequencies are linear on the log-log scale. Nevertheless, there have been criticisms of the power law, for example that a threshold needs to be pre-selected without its uncertainty quantified, that the power law is simply inadequate, and that subsequent hypothesis tests are required to determine whether the data could have come from the power law. We propose a modelling framework that combines two different generalisations of the power law, namely the generalised Pareto distribution and the Zipf-polylog distribution, to resolve these issues. The proposed mixture distributions are shown to fit the data well and quantify the threshold uncertainty in a natural way. A model selection step embedded in the Bayesian inference algorithm further answers the question whether the power law is adequate."}, "https://arxiv.org/abs/2309.07810": {"title": "Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression", "link": "https://arxiv.org/abs/2309.07810", "description": "arXiv:2309.07810v3 Announce Type: replace-cross \nAbstract: Debiasing is a fundamental concept in high-dimensional statistics. While degrees-of-freedom adjustment is the state-of-the-art technique in high-dimensional linear regression, it is limited to i.i.d. samples and sub-Gaussian covariates. These constraints hinder its broader practical use. Here, we introduce Spectrum-Aware Debiasing--a novel method for high-dimensional regression. Our approach applies to problems with structured dependencies, heavy tails, and low-rank structures. Our method achieves debiasing through a rescaled gradient descent step, deriving the rescaling factor using spectral information of the sample covariance matrix. The spectrum-based approach enables accurate debiasing in much broader contexts. We study the common modern regime where the number of features and samples scale proportionally. We establish asymptotic normality of our proposed estimator (suitably centered and scaled) under various convergence notions when the covariates are right-rotationally invariant. Such designs have garnered recent attention due to their crucial role in compressed sensing. Furthermore, we devise a consistent estimator for its asymptotic variance.\n Our work has two notable by-products: first, we use Spectrum-Aware Debiasing to correct bias in principal components regression (PCR), providing the first debiased PCR estimator in high dimensions. Second, we introduce a principled test for checking alignment between the signal and the eigenvectors of the sample covariance matrix. This test is independently valuable for statistical methods developed using approximate message passing, leave-one-out, or convex Gaussian min-max theorems. We demonstrate our method through simulated and real data experiments. Technically, we connect approximate message passing algorithms with debiasing and provide the first proof of the Cauchy property of vector approximate message passing (V-AMP)."}, "https://arxiv.org/abs/2407.15874": {"title": "Spatially-clustered spatial autoregressive models with application to agricultural market concentration in Europe", "link": "https://arxiv.org/abs/2407.15874", "description": "arXiv:2407.15874v1 Announce Type: new \nAbstract: In this paper, we present an extension of the spatially-clustered linear regression models, namely, the spatially-clustered spatial autoregression (SCSAR) model, to deal with spatial heterogeneity issues in clustering procedures. In particular, we extend classical spatial econometrics models, such as the spatial autoregressive model, the spatial error model, and the spatially-lagged model, by allowing the regression coefficients to be spatially varying according to a cluster-wise structure. Cluster memberships and regression coefficients are jointly estimated through a penalized maximum likelihood algorithm which encourages neighboring units to belong to the same spatial cluster with shared regression coefficients. Motivated by the increase of observed values of the Gini index for the agricultural production in Europe between 2010 and 2020, the proposed methodology is employed to assess the presence of local spatial spillovers on the market concentration index for the European regions in the last decade. Empirical findings support the hypothesis of fragmentation of the European agricultural market, as the regions can be well represented by a clustering structure partitioning the continent into three-groups, roughly approximated by a division among Western, North Central and Southeastern regions. Also, we detect heterogeneous local effects induced by the selected explanatory variables on the regional market concentration. In particular, we find that variables associated with social, territorial and economic relevance of the agricultural sector seem to act differently throughout the spatial dimension, across the clusters and with respect to the pooled model, and temporal dimension."}, "https://arxiv.org/abs/2407.16024": {"title": "Generalized functional dynamic principal component analysis", "link": "https://arxiv.org/abs/2407.16024", "description": "arXiv:2407.16024v1 Announce Type: new \nAbstract: In this paper, we explore dimension reduction for time series of functional data within both stationary and non-stationary frameworks. We introduce a functional framework of generalized dynamic principal component analysis (GDPCA). The concept of GDPCA aims for better adaptation to possible nonstationary features of the series. We define the functional generalized dynamic principal component (GDPC) as static factor time series in a functional dynamic factor model and obtain the multivariate GDPC from a truncation of the functional dynamic factor model. GDFPCA uses a minimum squared error criterion to evaluate the reconstruction of the original functional time series. The computation of GDPC involves a two-step estimation of the coefficient vector of the loading curves in a basis expansion. We provide a proof of the consistency of the reconstruction of the original functional time series with GDPC converging in mean square to the original functional time series. Monte Carlo simulation studies indicate that the proposed GDFPCA is comparable to dynamic functional principal component analysis (DFPCA) when the data generating process is stationary, and outperforms DFPCA and FPCA when the data generating process is non-stationary. The results of applications to real data reaffirm the findings in simulation studies."}, "https://arxiv.org/abs/2407.16037": {"title": "Estimating Distributional Treatment Effects in Randomized Experiments: Machine Learning for Variance Reduction", "link": "https://arxiv.org/abs/2407.16037", "description": "arXiv:2407.16037v1 Announce Type: new \nAbstract: We propose a novel regression adjustment method designed for estimating distributional treatment effect parameters in randomized experiments. Randomized experiments have been extensively used to estimate treatment effects in various scientific fields. However, to gain deeper insights, it is essential to estimate distributional treatment effects rather than relying solely on average effects. Our approach incorporates pre-treatment covariates into a distributional regression framework, utilizing machine learning techniques to improve the precision of distributional treatment effect estimators. The proposed approach can be readily implemented with off-the-shelf machine learning methods and remains valid as long as the nuisance components are reasonably well estimated. Also, we establish the asymptotic properties of the proposed estimator and present a uniformly valid inference method. Through simulation results and real data analysis, we demonstrate the effectiveness of integrating machine learning techniques in reducing the variance of distributional treatment effect estimators in finite samples."}, "https://arxiv.org/abs/2407.16116": {"title": "Robust and consistent model evaluation criteria in high-dimensional regression", "link": "https://arxiv.org/abs/2407.16116", "description": "arXiv:2407.16116v1 Announce Type: new \nAbstract: In the last two decades, sparse regularization methods such as the LASSO have been applied in various fields. Most of the regularization methods have one or more regularization parameters, and to select the value of the regularization parameter is essentially equal to select a model, thus we need to determine the regularization parameter adequately. Regarding the determination of the regularization parameter in the linear regression model, we often apply the information criteria like the AIC and BIC, however, it has been pointed out that these criteria are sensitive to outliers and tend not to perform well in high-dimensional settings. Outliers generally have a negative influence on not only estimation but also model selection, consequently, it is important to employ a selection method that is robust against outliers. In addition, when the number of explanatory variables is quite large, most conventional criteria are prone to select unnecessary explanatory variables. In this paper, we propose model evaluation criteria via the statistical divergence with excellence in robustness in both of parametric estimation and model selection. Furthermore, our proposed criteria simultaneously achieve the selection consistency with the robustness even in high-dimensional settings. We also report the results of some numerical examples to verify that the proposed criteria perform robust and consistent variable selection compared with the conventional selection methods."}, "https://arxiv.org/abs/2407.16212": {"title": "Optimal experimental design: Formulations and computations", "link": "https://arxiv.org/abs/2407.16212", "description": "arXiv:2407.16212v1 Announce Type: new \nAbstract: Questions of `how best to acquire data' are essential to modeling and prediction in the natural and social sciences, engineering applications, and beyond. Optimal experimental design (OED) formalizes these questions and creates computational methods to answer them. This article presents a systematic survey of modern OED, from its foundations in classical design theory to current research involving OED for complex models. We begin by reviewing criteria used to formulate an OED problem and thus to encode the goal of performing an experiment. We emphasize the flexibility of the Bayesian and decision-theoretic approach, which encompasses information-based criteria that are well-suited to nonlinear and non-Gaussian statistical models. We then discuss methods for estimating or bounding the values of these design criteria; this endeavor can be quite challenging due to strong nonlinearities, high parameter dimension, large per-sample costs, or settings where the model is implicit. A complementary set of computational issues involves optimization methods used to find a design; we discuss such methods in the discrete (combinatorial) setting of observation selection and in settings where an exact design can be continuously parameterized. Finally we present emerging methods for sequential OED that build non-myopic design policies, rather than explicit designs; these methods naturally adapt to the outcomes of past experiments in proposing new experiments, while seeking coordination among all experiments to be performed. Throughout, we highlight important open questions and challenges."}, "https://arxiv.org/abs/2407.16299": {"title": "Sparse outlier-robust PCA for multi-source data", "link": "https://arxiv.org/abs/2407.16299", "description": "arXiv:2407.16299v1 Announce Type: new \nAbstract: Sparse and outlier-robust Principal Component Analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single dataset whereas multi-source data-i.e. multiple related datasets requiring joint analysis-arise across many scientific areas. We introduce a novel PCA methodology that simultaneously (i) selects important features, (ii) allows for the detection of global sparse patterns across multiple data sources as well as local source-specific patterns, and (iii) is resistant to outliers. To this end, we develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns, and where the ssMRCD estimator is used as plug-in to permit joint outlier-robust analysis across multiple data sources. We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier and illustrate its practical advantages in simulation and in applications."}, "https://arxiv.org/abs/2407.16349": {"title": "Bayesian modelling of VAR precision matrices using stochastic block networks", "link": "https://arxiv.org/abs/2407.16349", "description": "arXiv:2407.16349v1 Announce Type: new \nAbstract: Commonly used priors for Vector Autoregressions (VARs) induce shrinkage on the autoregressive coefficients. Introducing shrinkage on the error covariance matrix is sometimes done but, in the vast majority of cases, without considering the network structure of the shocks and by placing the prior on the lower Cholesky factor of the precision matrix. In this paper, we propose a prior on the VAR error precision matrix directly. Our prior, which resembles a standard spike and slab prior, models variable inclusion probabilities through a stochastic block model that clusters shocks into groups. Within groups, the probability of having relations across group members is higher (inducing less sparsity) whereas relations across groups imply a lower probability that members of each group are conditionally related. We show in simulations that our approach recovers the true network structure well. Using a US macroeconomic data set, we illustrate how our approach can be used to cluster shocks together and that this feature leads to improved density forecasts."}, "https://arxiv.org/abs/2407.16366": {"title": "Robust Bayesian Model Averaging for Linear Regression Models With Heavy-Tailed Errors", "link": "https://arxiv.org/abs/2407.16366", "description": "arXiv:2407.16366v1 Announce Type: new \nAbstract: In this article, our goal is to develop a method for Bayesian model averaging in linear regression models to accommodate heavier tailed error distributions than the normal distribution. Motivated by the use of the Huber loss function in presence of outliers, Park and Casella (2008) proposed the concept of the Bayesian Huberized lasso, which has been recently developed and implemented by Kawakami and Hashimoto (2023), with hyperbolic errors. Because the Huberized lasso cannot enforce regression coefficients to be exactly zero, we propose a fully Bayesian variable selection approach with spike and slab priors, that can address sparsity more effectively. Furthermore, while the hyperbolic distribution has heavier tails than a normal distribution, its tails are less heavy in comparison to a Cauchy distribution.Thus, we propose a regression model, with an error distribution that encompasses both hyperbolic and Student-t distributions. Our model aims to capture the benefit of using Huber loss, but it can also adapt to heavier tails, and unknown levels of sparsity, as necessitated by the data. We develop an efficient Gibbs sampler with Metropolis Hastings steps for posterior computation. Through simulation studies, and analyses of the benchmark Boston housing dataset and NBA player salaries in the 2022-2023 season, we show that our method is competitive with various state-of-the-art methods."}, "https://arxiv.org/abs/2407.16374": {"title": "A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests", "link": "https://arxiv.org/abs/2407.16374", "description": "arXiv:2407.16374v1 Announce Type: new \nAbstract: In the statistical literature, as well as in artificial intelligence and machine learning, measures of discrepancy between two probability distributions are largely used to develop measures of goodness-of-fit. We concentrate on quadratic distances, which depend on a non-negative definite kernel. We propose a unified framework for the study of two-sample and k-sample goodness of fit tests based on the concept of matrix distance. We provide a succinct review of the goodness of fit literature related to the use of distance measures, and specifically to quadratic distances. We show that the quadratic distance kernel-based two-sample test has the same functional form with the maximum mean discrepancy test. We develop tests for the $k$-sample scenario, where the two-sample problem is a special case. We derive their asymptotic distribution under the null hypothesis and discuss computational aspects of the test procedures. We assess their performance, in terms of level and power, via extensive simulations and a real data example. The proposed framework is implemented in the QuadratiK package, available in both R and Python environments."}, "https://arxiv.org/abs/2407.16550": {"title": "A Kernel-Based Conditional Two-Sample Test Using Nearest Neighbors (with Applications to Calibration, Regression Curves, and Simulation-Based Inference)", "link": "https://arxiv.org/abs/2407.16550", "description": "arXiv:2407.16550v1 Announce Type: new \nAbstract: In this paper we introduce a kernel-based measure for detecting differences between two conditional distributions. Using the `kernel trick' and nearest-neighbor graphs, we propose a consistent estimate of this measure which can be computed in nearly linear time (for a fixed number of nearest neighbors). Moreover, when the two conditional distributions are the same, the estimate has a Gaussian limit and its asymptotic variance has a simple form that can be easily estimated from the data. The resulting test attains precise asymptotic level and is universally consistent for detecting differences between two conditional distributions. We also provide a resampling based test using our estimate that applies to the conditional goodness-of-fit problem, which controls Type I error in finite samples and is asymptotically consistent with only a finite number of resamples. A method to de-randomize the resampling test is also presented. The proposed methods can be readily applied to a broad range of problems, ranging from classical nonparametric statistics to modern machine learning. Specifically, we explore three applications: testing model calibration, regression curve evaluation, and validation of emulator models in simulation-based inference. We illustrate the superior performance of our method for these tasks, both in simulations as well as on real data. In particular, we apply our method to (1) assess the calibration of neural network models trained on the CIFAR-10 dataset, (2) compare regression functions for wind power generation across two different turbines, and (3) validate emulator models on benchmark examples with intractable posteriors and for generating synthetic `redshift' associated with galaxy images."}, "https://arxiv.org/abs/2407.15854": {"title": "Decoding Digital Influence: The Role of Social Media Behavior in Scientific Stratification Through Logistic Attribution Method", "link": "https://arxiv.org/abs/2407.15854", "description": "arXiv:2407.15854v1 Announce Type: cross \nAbstract: Scientific social stratification is a classic theme in the sociology of science. The deep integration of social media has bridged the gap between scientometrics and sociology of science. This study comprehensively analyzes the impact of social media on scientific stratification and mobility, delving into the complex interplay between academic status and social media activity in the digital age. [Research Method] Innovatively, this paper employs An Explainable Logistic Attribution Analysis from a meso-level perspective to explore the correlation between social media behaviors and scientific social stratification. It examines the impact of scientists' use of social media in the digital age on scientific stratification and mobility, uniquely combining statistical methods with machine learning. This fusion effectively integrates hypothesis testing with a substantive interpretation of the contribution of independent variables to the model. [Research Conclusion] Empirical evidence demonstrates that social media promotes stratification and mobility within the scientific community, revealing a nuanced and non-linear facilitation mechanism. Social media activities positively impact scientists' status within the scientific social hierarchy to a certain extent, but beyond a specific threshold, this impact turns negative. It shows that the advent of social media has opened new channels for academic influence, transcending the limitations of traditional academic publishing, and prompting changes in scientific stratification. Additionally, the study acknowledges the limitations of its experimental design and suggests future research directions."}, "https://arxiv.org/abs/2407.15868": {"title": "A Survey on Differential Privacy for SpatioTemporal Data in Transportation Research", "link": "https://arxiv.org/abs/2407.15868", "description": "arXiv:2407.15868v1 Announce Type: cross \nAbstract: With low-cost computing devices, improved sensor technology, and the proliferation of data-driven algorithms, we have more data than we know what to do with. In transportation, we are seeing a surge in spatiotemporal data collection. At the same time, concerns over user privacy have led to research on differential privacy in applied settings. In this paper, we look at some recent developments in differential privacy in the context of spatiotemporal data. Spatiotemporal data contain not only features about users but also the geographical locations of their frequent visits. Hence, the public release of such data carries extreme risks. To address the need for such data in research and inference without exposing private information, significant work has been proposed. This survey paper aims to summarize these efforts and provide a review of differential privacy mechanisms and related software. We also discuss related work in transportation where such mechanisms have been applied. Furthermore, we address the challenges in the deployment and mass adoption of differential privacy in transportation spatiotemporal data for downstream analyses."}, "https://arxiv.org/abs/2407.16152": {"title": "Discovering overlapping communities in multi-layer directed networks", "link": "https://arxiv.org/abs/2407.16152", "description": "arXiv:2407.16152v1 Announce Type: cross \nAbstract: This article explores the challenging problem of detecting overlapping communities in multi-layer directed networks. Our goal is to understand the underlying asymmetric overlapping community structure by analyzing the mixed memberships of nodes. We introduce a new model, the multi-layer mixed membership stochastic co-block model (multi-layer MM-ScBM), to model multi-layer directed networks in which nodes can belong to multiple communities. We develop a spectral procedure to estimate nodes' memberships in both sending and receiving patterns. Our method uses a successive projection algorithm on a few leading eigenvectors of two debiased aggregation matrices. To our knowledge, this is the first work to detect asymmetric overlapping communities in multi-layer directed networks. We demonstrate the consistent estimation properties of our method by providing per-node error rates under the multi-layer MM-ScBM framework. Our theoretical analysis reveals that increasing the overall sparsity, the number of nodes, or the number of layers can improve the accuracy of overlapping community detection. Extensive numerical experiments are conducted to validate these theoretical findings. We also apply our method to one real-world multi-layer directed network, gaining insightful results."}, "https://arxiv.org/abs/2407.16283": {"title": "A Randomized Exchange Algorithm for Optimal Design of Multi-Response Experiments", "link": "https://arxiv.org/abs/2407.16283", "description": "arXiv:2407.16283v1 Announce Type: cross \nAbstract: Despite the increasing prevalence of vector observations, computation of optimal experimental design for multi-response models has received limited attention. To address this problem within the framework of approximate designs, we introduce mREX, an algorithm that generalizes the randomized exchange algorithm REX (J Am Stat Assoc 115:529, 2020), originally specialized for single-response models. The mREX algorithm incorporates several improvements: a novel method for computing efficient sparse initial designs, an extension to all differentiable Kiefer's optimality criteria, and an efficient method for performing optimal exchanges of weights. For the most commonly used D-optimality criterion, we propose a technique for optimal weight exchanges based on the characteristic matrix polynomial. The mREX algorithm is applicable to linear, nonlinear, and generalized linear models, and scales well to large problems. It typically converges to optimal designs faster than available alternative methods, although it does not require advanced mathematical programming solvers. We demonstrate the application of mREX to bivariate dose-response Emax models for clinical trials, both without and with the inclusion of covariates."}, "https://arxiv.org/abs/2407.16376": {"title": "Bayesian Autoregressive Online Change-Point Detection with Time-Varying Parameters", "link": "https://arxiv.org/abs/2407.16376", "description": "arXiv:2407.16376v1 Announce Type: cross \nAbstract: Change points in real-world systems mark significant regime shifts in system dynamics, possibly triggered by exogenous or endogenous factors. These points define regimes for the time evolution of the system and are crucial for understanding transitions in financial, economic, social, environmental, and technological contexts. Building upon the Bayesian approach introduced in \\cite{c:07}, we devise a new method for online change point detection in the mean of a univariate time series, which is well suited for real-time applications and is able to handle the general temporal patterns displayed by data in many empirical contexts. We first describe time series as an autoregressive process of an arbitrary order. Second, the variance and correlation of the data are allowed to vary within each regime driven by a scoring rule that updates the value of the parameters for a better fit of the observations. Finally, a change point is detected in a probabilistic framework via the posterior distribution of the current regime length. By modeling temporal dependencies and time-varying parameters, the proposed approach enhances both the estimate accuracy and the forecasting power. Empirical validations using various datasets demonstrate the method's effectiveness in capturing memory and dynamic patterns, offering deeper insights into the non-stationary dynamics of real-world systems."}, "https://arxiv.org/abs/2407.16567": {"title": "CASTRO -- Efficient constrained sampling method for material and chemical experimental design", "link": "https://arxiv.org/abs/2407.16567", "description": "arXiv:2407.16567v1 Announce Type: cross \nAbstract: The exploration of multicomponent material composition space requires significant time and financial investments, necessitating efficient use of resources for statistically relevant compositions. This article introduces a novel methodology, implemented in the open-source CASTRO (ConstrAined Sequential laTin hypeRcube sampling methOd) software package, to overcome equality-mixture constraints and ensure comprehensive design space coverage. Our approach leverages Latin hypercube sampling (LHS) and LHS with multidimensional uniformity (LHSMDU) using a divide-and-conquer strategy to manage high-dimensional problems effectively. By incorporating previous experimental knowledge within a limited budget, our method strategically recommends a feasible number of experiments to explore the design space. We validate our methodology with two examples: a four-dimensional problem with near-uniform distributions and a nine-dimensional problem with additional mixture constraints, yielding specific component distributions. Our constrained sequential LHS or LHSMDU approach enables thorough design space exploration, proving robustness for experimental design. This research not only advances material science but also offers promising solutions for efficiency challenges in pharmaceuticals and chemicals. CASTRO and the case studies are available for free download on GitHub."}, "https://arxiv.org/abs/1910.12545": {"title": "Testing Forecast Rationality for Measures of Central Tendency", "link": "https://arxiv.org/abs/1910.12545", "description": "arXiv:1910.12545v5 Announce Type: replace \nAbstract: Rational respondents to economic surveys may report as a point forecast any measure of the central tendency of their (possibly latent) predictive distribution, for example the mean, median, mode, or any convex combination thereof. We propose tests of forecast rationality when the measure of central tendency used by the respondent is unknown. We overcome an identification problem that arises when the measures of central tendency are equal or in a local neighborhood of each other, as is the case for (exactly or nearly) symmetric distributions. As a building block, we also present novel tests for the rationality of mode forecasts. We apply our tests to income forecasts from the Federal Reserve Bank of New York's Survey of Consumer Expectations. We find these forecasts are rationalizable as mode forecasts, but not as mean or median forecasts. We also find heterogeneity in the measure of centrality used by respondents when stratifying the sample by past income, age, job stability, and survey experience."}, "https://arxiv.org/abs/2110.06450": {"title": "Online network change point detection with missing values and temporal dependence", "link": "https://arxiv.org/abs/2110.06450", "description": "arXiv:2110.06450v3 Announce Type: replace \nAbstract: In this paper we study online change point detection in dynamic networks with time heterogeneous missing pattern within networks and dependence across the time course. The missingness probabilities, the entrywise sparsity of networks, the rank of networks and the jump size in terms of the Frobenius norm, are all allowed to vary as functions of the pre-change sample size. On top of a thorough handling of all the model parameters, we notably allow the edges and missingness to be dependent. To the best of our knowledge, such general framework has not been rigorously nor systematically studied before in the literature. We propose a polynomial time change point detection algorithm, with a version of soft-impute algorithm (e.g. Mazumder et al., 2010; Klopp, 2015) as the imputation sub-routine. Piecing up these standard sub-routines algorithms, we are able to solve a brand new problem with sharp detection delay subject to an overall Type-I error control. Extensive numerical experiments are conducted demonstrating the outstanding performances of our proposed method in practice."}, "https://arxiv.org/abs/2111.13774": {"title": "Robust Permutation Tests in Linear Instrumental Variables Regression", "link": "https://arxiv.org/abs/2111.13774", "description": "arXiv:2111.13774v4 Announce Type: replace \nAbstract: This paper develops permutation versions of identification-robust tests in linear instrumental variables (IV) regression. Unlike the existing randomization and rank-based tests in which independence between the instruments and the error terms is assumed, the permutation Anderson- Rubin (AR), Lagrange Multiplier (LM) and Conditional Likelihood Ratio (CLR) tests are asymptotically similar and robust to conditional heteroskedasticity under standard exclusion restriction i.e. the orthogonality between the instruments and the error terms. Moreover, when the instruments are independent of the structural error term, the permutation AR tests are exact, hence robust to heavy tails. As such, these tests share the strengths of the rank-based tests and the wild bootstrap AR tests. Numerical illustrations corroborate the theoretical results."}, "https://arxiv.org/abs/2312.07775": {"title": "On the construction of stationary processes and random fields", "link": "https://arxiv.org/abs/2312.07775", "description": "arXiv:2312.07775v2 Announce Type: replace \nAbstract: We propose a new method to construct a stationary process and random field with a given decreasing covariance function and any one-dimensional marginal distribution. The result is a new class of stationary processes and random fields. The construction method utilizes a correlated binary sequence, and it allows a simple and practical way to model dependence structures in a stationary process and random field as its dependence structure is induced by the correlation structure of a few disjoint sets in the support set of the marginal distribution. Simulation results of the proposed models are provided, which show the empirical behavior of a sample path."}, "https://arxiv.org/abs/2307.02781": {"title": "Dynamic Factor Analysis with Dependent Gaussian Processes for High-Dimensional Gene Expression Trajectories", "link": "https://arxiv.org/abs/2307.02781", "description": "arXiv:2307.02781v2 Announce Type: replace-cross \nAbstract: The increasing availability of high-dimensional, longitudinal measures of gene expression can facilitate understanding of biological mechanisms, as required for precision medicine. Biological knowledge suggests that it may be best to describe complex diseases at the level of underlying pathways, which may interact with one another. We propose a Bayesian approach that allows for characterising such correlation among different pathways through Dependent Gaussian Processes (DGP) and mapping the observed high-dimensional gene expression trajectories into unobserved low-dimensional pathway expression trajectories via Bayesian Sparse Factor Analysis. Our proposal is the first attempt to relax the classical assumption of independent factors for longitudinal data and has demonstrated a superior performance in recovering the shape of pathway expression trajectories, revealing the relationships between genes and pathways, and predicting gene expressions (closer point estimates and narrower predictive intervals), as demonstrated through simulations and real data analysis. To fit the model, we propose a Monte Carlo Expectation Maximization (MCEM) scheme that can be implemented conveniently by combining a standard Markov Chain Monte Carlo sampler and an R package GPFDA (Konzen and others, 2021), which returns the maximum likelihood estimates of DGP hyperparameters. The modular structure of MCEM makes it generalizable to other complex models involving the DGP model component. Our R package DGP4LCF that implements the proposed approach is available on CRAN."}, "https://arxiv.org/abs/2407.16786": {"title": "Generalised Causal Dantzig", "link": "https://arxiv.org/abs/2407.16786", "description": "arXiv:2407.16786v1 Announce Type: new \nAbstract: Prediction invariance of causal models under heterogeneous settings has been exploited by a number of recent methods for causal discovery, typically focussing on recovering the causal parents of a target variable of interest. When instrumental variables are not available, the causal Dantzig estimator exploits invariance under the more restrictive case of shift interventions. However, also in this case, one requires observational data from a number of sufficiently different environments, which is rarely available. In this paper, we consider a structural equation model where the target variable is described by a generalised additive model conditional on its parents. Besides having finite moments, no modelling assumptions are made on the conditional distributions of the other variables in the system. Under this setting, we characterise the causal model uniquely by means of two key properties: the Pearson residuals are invariant under the causal model and conditional on the causal parents the causal parameters maximise the population likelihood. These two properties form the basis of a computational strategy for searching the causal model among all possible models. Crucially, for generalised linear models with a known dispersion parameter, such as Poisson and logistic regression, the causal model can be identified from a single data environment."}, "https://arxiv.org/abs/2407.16870": {"title": "CoCA: Cooperative Component Analysis", "link": "https://arxiv.org/abs/2407.16870", "description": "arXiv:2407.16870v1 Announce Type: new \nAbstract: We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of \"-omic\" data, ranging from genomics to proteomics, are measured on the same set of samples. The goal is to uncover important, shared signals that represent underlying biological mechanisms. CoCA combines an approximation error loss to preserve information within data views and an \"agreement penalty\" to encourage alignment across data views. By balancing the trade-off between these two key components in the objective, CoCA has the property of interpolating between the commonly-used principal component analysis (PCA) and canonical correlation analysis (CCA) as special cases at the two ends of the solution path. CoCA chooses the degree of agreement in a data-adaptive manner, using a validation set or cross-validation to estimate test error. Furthermore, we propose a sparse variant of CoCA that incorporates the Lasso penalty to yield feature sparsity, facilitating the identification of key features driving the observed patterns. We demonstrate the effectiveness of CoCA on simulated data and two real multiomics studies of COVID-19 and ductal carcinoma in situ of breast. In both real data applications, CoCA successfully integrates multiomics data, extracting components that are not only consistently present across different data views but also more informative and predictive of disease progression. CoCA offers a powerful framework for discovering important shared signals in multi-view data, with the potential to uncover novel insights in an increasingly multi-view data world."}, "https://arxiv.org/abs/2407.16948": {"title": "Relative local dependence of bivariate copulas", "link": "https://arxiv.org/abs/2407.16948", "description": "arXiv:2407.16948v1 Announce Type: new \nAbstract: For a bivariate probability distribution, local dependence around a single point on the support is often formulated as the second derivative of the logarithm of the probability density function. However, this definition lacks the invariance under marginal distribution transformations, which is often required as a criterion for dependence measures. In this study, we examine the \\textit{relative local dependence}, which we define as the ratio of the local dependence to the probability density function, for copulas. By using this notion, we point out that typical copulas can be characterised as the solutions to the corresponding partial differential equations, particularly highlighting that the relative local dependence of the Frank copula remains constant. The estimation and visualization of the relative local dependence are demonstrated using simulation data. Furthermore, we propose a class of copulas where local dependence is proportional to the $k$-th power of the probability density function, and as an example, we demonstrate a newly discovered relationship derived from the density functions of two representative copulas, the Frank copula and the Farlie-Gumbel-Morgenstern (FGM) copula."}, "https://arxiv.org/abs/2407.16950": {"title": "Identification and inference of outcome conditioned partial effects of general interventions", "link": "https://arxiv.org/abs/2407.16950", "description": "arXiv:2407.16950v1 Announce Type: new \nAbstract: This paper proposes a new class of distributional causal quantities, referred to as the \\textit{outcome conditioned partial policy effects} (OCPPEs), to measure the \\textit{average} effect of a general counterfactual intervention of a target covariate on the individuals in different quantile ranges of the outcome distribution.\n The OCPPE approach is valuable in several aspects: (i) Unlike the unconditional quantile partial effect (UQPE) that is not $\\sqrt{n}$-estimable, an OCPPE is $\\sqrt{n}$-estimable. Analysts can use it to capture heterogeneity across the unconditional distribution of $Y$ as well as obtain accurate estimation of the aggregated effect at the upper and lower tails of $Y$. (ii) The semiparametric efficiency bound for an OCPPE is explicitly derived. (iii) We propose an efficient debiased estimator for OCPPE, and provide feasible uniform inference procedures for the OCPPE process. (iv) The efficient doubly robust score for an OCPPE can be used to optimize infinitesimal nudges to a continuous treatment by maximizing a quantile specific Empirical Welfare function. We illustrate the method by analyzing how anti-smoking policies impact low percentiles of live infants' birthweights."}, "https://arxiv.org/abs/2407.17036": {"title": "A Bayesian modelling framework for health care resource use and costs in trial-based economic evaluations", "link": "https://arxiv.org/abs/2407.17036", "description": "arXiv:2407.17036v1 Announce Type: new \nAbstract: Individual-level effectiveness and healthcare resource use (HRU) data are routinely collected in trial-based economic evaluations. While effectiveness is often expressed in terms of utility scores derived from some health-related quality of life instruments (e.g.~EQ-5D questionnaires), different types of HRU may be included. Costs are usually generated by applying unit prices to HRU data and statistical methods have been traditionally implemented to analyse costs and utilities or after combining them into aggregated variables (e.g. Quality-Adjusted Life Years). When outcome data are not fully observed, e.g. some patients drop out or only provided partial information, the validity of the results may be hindered both in terms of efficiency and bias. Often, partially-complete HRU data are handled using \"ad-hoc\" methods, implicitly relying on some assumptions (e.g. fill-in a zero) which are hard to justify beside the practical convenience of increasing the completion rate. We present a general Bayesian framework for the modelling of partially-observed HRUs which allows a flexible model specification to accommodate the typical complexities of the data and to quantify the impact of different types of uncertainty on the results. We show the benefits of using our approach using a motivating example and compare the results to those from traditional analyses focussed on the modelling of cost variables after adopting some ad-hoc imputation strategy for HRU data."}, "https://arxiv.org/abs/2407.17113": {"title": "Bayesian non-linear subspace shrinkage using horseshoe priors", "link": "https://arxiv.org/abs/2407.17113", "description": "arXiv:2407.17113v1 Announce Type: new \nAbstract: When modeling biological responses using Bayesian non-parametric regression, prior information may be available on the shape of the response in the form of non-linear function spaces that define the general shape of the response. To incorporate such information into the analysis, we develop a non-linear functional shrinkage (NLFS) approach that uniformly shrinks the non-parametric fitted function into a non-linear function space while allowing for fits outside of this space when the data suggest alternative shapes. This approach extends existing functional shrinkage approaches into linear subspaces to shrinkage into non-linear function spaces using a Taylor series expansion and corresponding updating of non-linear parameters. We demonstrate this general approach on the Hill model, a popular, biologically motivated model, and show that shrinkage into combined function spaces, i.e., where one has two or more non-linear functions a priori, is straightforward. We demonstrate this approach through synthetic and real data. Computational details on the underlying MCMC sampling are provided with data and analysis available in an online supplement."}, "https://arxiv.org/abs/2407.17225": {"title": "Asymmetry Analysis of Bilateral Shapes", "link": "https://arxiv.org/abs/2407.17225", "description": "arXiv:2407.17225v1 Announce Type: new \nAbstract: Many biological objects possess bilateral symmetry about a midline or midplane, up to a ``noise'' term. This paper uses landmark-based methods to measure departures from bilateral symmetry, especially for the two-group problem where one group is more asymmetric than the other. In this paper, we formulate our work in the framework of size-and-shape analysis including registration via rigid body motion. Our starting point is a vector of elementary asymmetry features defined at the individual landmark coordinates for each object. We introduce two approaches for testing. In the first, the elementary features are combined into a scalar composite asymmetry measure for each object. Then standard univariate tests can be used to compare the two groups. In the second approach, a univariate test statistic is constructed for each elementary feature. The maximum of these statistics lead to an overall test statistic to compare the two groups and we then provide a technique to extract the important features from the landmark data. Our methodology is illustrated on a pre-registered smile dataset collected to assess the success of cleft lip surgery on human subjects. The asymmetry in a group of cleft lip subjects is compared to a group of normal subjects, and statistically significant differences have been found by univariate tests in the first approach. Further, our feature extraction method leads to an anatomically plausible set of landmarks for medical applications."}, "https://arxiv.org/abs/2407.17385": {"title": "Causal modelling without counterfactuals and individualised effects", "link": "https://arxiv.org/abs/2407.17385", "description": "arXiv:2407.17385v1 Announce Type: new \nAbstract: The most common approach to causal modelling is the potential outcomes framework due to Neyman and Rubin. In this framework, outcomes of counterfactual treatments are assumed to be well-defined. This metaphysical assumption is often thought to be problematic yet indispensable. The conventional approach relies not only on counterfactuals, but also on abstract notions of distributions and assumptions of independence that are not directly testable. In this paper, we construe causal inference as treatment-wise predictions for finite populations where all assumptions are testable; this means that one can not only test predictions themselves (without any fundamental problem), but also investigate sources of error when they fail. The new framework highlights the model-dependence of causal claims as well as the difference between statistical and scientific inference."}, "https://arxiv.org/abs/2407.16797": {"title": "Estimating the hyperuniformity exponent of point processes", "link": "https://arxiv.org/abs/2407.16797", "description": "arXiv:2407.16797v1 Announce Type: cross \nAbstract: We address the challenge of estimating the hyperuniformity exponent $\\alpha$ of a spatial point process, given only one realization of it. Assuming that the structure factor $S$ of the point process follows a vanishing power law at the origin (the typical case of a hyperuniform point process), this exponent is defined as the slope near the origin of $\\log S$. Our estimator is built upon the (expanding window) asymptotic variance of some wavelet transforms of the point process. By combining several scales and several wavelets, we develop a multi-scale, multi-taper estimator $\\widehat{\\alpha}$. We analyze its asymptotic behavior, proving its consistency under various settings, and enabling the construction of asymptotic confidence intervals for $\\alpha$ when $\\alpha < d$ and under Brillinger mixing. This construction is derived from a multivariate central limit theorem where the normalisations are non-standard and vary among the components. We also present a non-asymptotic deviation inequality providing insights into the influence of tapers on the bias-variance trade-off of $\\widehat{\\alpha}$. Finally, we investigate the performance of $\\widehat{\\alpha}$ through simulations, and we apply our method to the analysis of hyperuniformity in a real dataset of marine algae."}, "https://arxiv.org/abs/2407.16975": {"title": "On the Parameter Identifiability of Partially Observed Linear Causal Models", "link": "https://arxiv.org/abs/2407.16975", "description": "arXiv:2407.16975v1 Announce Type: cross \nAbstract: Linear causal models are important tools for modeling causal dependencies and yet in practice, only a subset of the variables can be observed. In this paper, we examine the parameter identifiability of these models by investigating whether the edge coefficients can be recovered given the causal structure and partially observed data. Our setting is more general than that of prior research - we allow all variables, including both observed and latent ones, to be flexibly related, and we consider the coefficients of all edges, whereas most existing works focus only on the edges between observed variables. Theoretically, we identify three types of indeterminacy for the parameters in partially observed linear causal models. We then provide graphical conditions that are sufficient for all parameters to be identifiable and show that some of them are provably necessary. Methodologically, we propose a novel likelihood-based parameter estimation method that addresses the variance indeterminacy of latent variables in a specific way and can asymptotically recover the underlying parameters up to trivial indeterminacy. Empirical studies on both synthetic and real-world datasets validate our identifiability theory and the effectiveness of the proposed method in the finite-sample regime."}, "https://arxiv.org/abs/2407.17132": {"title": "Exploring Covid-19 Spatiotemporal Dynamics: Non-Euclidean Spatially Aware Functional Registration", "link": "https://arxiv.org/abs/2407.17132", "description": "arXiv:2407.17132v1 Announce Type: cross \nAbstract: When it came to Covid-19, timing was everything. This paper considers the spatiotemporal dynamics of the Covid-19 pandemic via a developed methodology of non-Euclidean spatially aware functional registration. In particular, the daily SARS-CoV-2 incidence in each of 380 local authorities in the UK from March to June 2020 is analysed to understand the phase variation of the waves when considered as curves. This is achieved by adapting a traditional registration method (that of local variation analysis) to account for the clear spatial dependencies in the data. This adapted methodology is shown via simulation studies to perform substantially better for the estimation of the registration functions than the non-spatial alternative. Moreover, it is found that the driving time between locations represents the spatial dependency in the Covid-19 data better than geographical distance. However, since driving time is non-Euclidean, the traditional spatial frameworks break down; to solve this, a methodology inspired by multidimensional scaling is developed to approximate the driving times by a Euclidean distance which enables the established theory to be applied. Finally, the resulting estimates of the registration/warping processes are analysed by taking functionals to understand the qualitatively observable earliness/lateness and sharpness/flatness of the Covid-19 waves quantitatively."}, "https://arxiv.org/abs/2407.17200": {"title": "Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems", "link": "https://arxiv.org/abs/2407.17200", "description": "arXiv:2407.17200v1 Announce Type: cross \nAbstract: A recent stream of structured learning approaches has improved the practical state of the art for a range of combinatorial optimization problems with complex objectives encountered in operations research. Such approaches train policies that chain a statistical model with a surrogate combinatorial optimization oracle to map any instance of the problem to a feasible solution. The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately. However learning such policies by risk minimization is challenging because the empirical risk is piecewise constant in the parameters, and few theoretical guarantees have been provided so far. In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves generalization. Our main contribution is a generalization bound that controls the perturbation bias, the statistical learning error, and the optimization error. Our analysis relies on the introduction of a uniform weak property, which captures and quantifies the interplay of the statistical model and the surrogate combinatorial optimization oracle. This property holds under mild assumptions on the statistical model, the surrogate optimization, and the instance data distribution. We illustrate the result on a range of applications such as stochastic vehicle scheduling. In particular, such policies are relevant for contextual stochastic optimization and our results cover this case."}, "https://arxiv.org/abs/2407.17329": {"title": "Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia", "link": "https://arxiv.org/abs/2407.17329", "description": "arXiv:2407.17329v1 Announce Type: cross \nAbstract: Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid Leukemia (AML), a type of cancer that affects the blood and bone marrow, is essential in the prognosis and follow-up of AML patients. As traditional cytological analysis cannot detect leukemia cells below 5\\%, the analysis of flow cytometry dataset is expected to provide more reliable results. In this paper, we explore statistical learning methods based on optimal transport (OT) to achieve a relevant low-dimensional representation of multi-patient flow cytometry measurements (FCM) datasets considered as high-dimensional probability distributions. Using the framework of OT, we justify the use of the K-means algorithm for dimensionality reduction of multiple large-scale point clouds through mean measure quantization by merging all the data into a single point cloud. After this quantization step, the visualization of the intra and inter-patients FCM variability is carried out by embedding low-dimensional quantized probability measures into a linear space using either Wasserstein Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of compositional data. Using a publicly available FCM dataset and a FCM dataset from Bordeaux University Hospital, we demonstrate the benefits of our approach over the popular kernel mean embedding technique for statistical learning from multiple high-dimensional probability distributions. We also highlight the usefulness of our methodology for low-dimensional projection and clustering patient measurements according to their level of MRD in AML from FCM. In particular, our OT-based approach allows a relevant and informative two-dimensional representation of the results of the FlowSom algorithm, a state-of-the-art method for the detection of MRD in AML using multi-patient FCM."}, "https://arxiv.org/abs/2407.17401": {"title": "Estimation of bid-ask spreads in the presence of serial dependence", "link": "https://arxiv.org/abs/2407.17401", "description": "arXiv:2407.17401v1 Announce Type: cross \nAbstract: Starting from a basic model in which the dynamic of the transaction prices is a geometric Brownian motion disrupted by a microstructure white noise, corresponding to the random alternation of bids and asks, we propose moment-based estimators along with their statistical properties. We then make the model more realistic by considering serial dependence: we assume a geometric fractional Brownian motion for the price, then an Ornstein-Uhlenbeck process for the microstructure noise. In these two cases of serial dependence, we propose again consistent and asymptotically normal estimators. All our estimators are compared on simulated data with existing approaches, such as Roll, Corwin-Schultz, Abdi-Ranaldo, or Ardia-Guidotti-Kroencke estimators."}, "https://arxiv.org/abs/1902.09615": {"title": "Binscatter Regressions", "link": "https://arxiv.org/abs/1902.09615", "description": "arXiv:1902.09615v5 Announce Type: replace \nAbstract: We introduce the package Binsreg, which implements the binscatter methods developed by Cattaneo, Crump, Farrell, and Feng (2024b,a). The package includes seven commands: binsreg, binslogit, binsprobit, binsqreg, binstest, binspwc, and binsregselect. The first four commands implement binscatter plotting, point estimation, and uncertainty quantification (confidence intervals and confidence bands) for least squares linear binscatter regression (binsreg) and for nonlinear binscatter regression (binslogit for Logit regression, binsprobit for Probit regression, and binsqreg for quantile regression). The next two commands focus on pointwise and uniform inference: binstest implements hypothesis testing procedures for parametric specifications and for nonparametric shape restrictions of the unknown regression function, while binspwc implements multi-group pairwise statistical comparisons. Finally, the command binsregselect implements data-driven number of bins selectors. The commands offer binned scatter plots, and allow for covariate adjustment, weighting, clustering, and multi-sample analysis, which is useful when studying treatment effect heterogeneity in randomized and observational studies, among many other features."}, "https://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "https://arxiv.org/abs/2202.08419", "description": "arXiv:2202.08419v4 Announce Type: replace \nAbstract: In this paper, we develop a novel high-dimensional time-varying coefficient estimation method, based on high-dimensional Ito diffusion processes. To account for high-dimensional time-varying coefficients, we first estimate local (or instantaneous) coefficients using a time-localized Dantzig selection scheme under a sparsity condition, which results in biased local coefficient estimators due to the regularization. To handle the bias, we propose a debiasing scheme, which provides well-performing unbiased local coefficient estimators. With the unbiased local coefficient estimators, we estimate the integrated coefficient, and to further account for the sparsity of the coefficient process, we apply thresholding schemes. We call this Thresholding dEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED estimator. In the empirical analysis, we apply the TED procedure to analyzing high-dimensional factor models using high-frequency data."}, "https://arxiv.org/abs/2211.08858": {"title": "Unbalanced Kantorovich-Rubinstein distance, plan, and barycenter on finite spaces: A statistical perspective", "link": "https://arxiv.org/abs/2211.08858", "description": "arXiv:2211.08858v2 Announce Type: replace \nAbstract: We analyze statistical properties of plug-in estimators for unbalanced optimal transport quantities between finitely supported measures in different prototypical sampling models. Specifically, our main results provide non-asymptotic bounds on the expected error of empirical Kantorovich-Rubinstein (KR) distance, plans, and barycenters for mass penalty parameter $C>0$. The impact of the mass penalty parameter $C$ is studied in detail. Based on this analysis, we mathematically justify randomized computational schemes for KR quantities which can be used for fast approximate computations in combination with any exact solver. Using synthetic and real datasets, we empirically analyze the behavior of the expected errors in simulation studies and illustrate the validity of our theoretical bounds."}, "https://arxiv.org/abs/2008.07361": {"title": "Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?", "link": "https://arxiv.org/abs/2008.07361", "description": "arXiv:2008.07361v2 Announce Type: replace-cross \nAbstract: Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements.\n Materials and Methods: We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value.\n Results: The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively.\n Discussion: Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability.\n Conclusion: Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity."}, "https://arxiv.org/abs/2306.00833": {"title": "When Does Bottom-up Beat Top-down in Hierarchical Community Detection?", "link": "https://arxiv.org/abs/2306.00833", "description": "arXiv:2306.00833v2 Announce Type: replace-cross \nAbstract: Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive ($\\textit{top-down}$) algorithms recursively partition the nodes into two communities, until a stopping rule indicates that no further split is needed. In contrast, agglomerative ($\\textit{bottom-up}$) algorithms first identify the smallest community structure and then repeatedly merge the communities using a $\\textit{linkage}$ method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis."}, "https://arxiv.org/abs/2407.17534": {"title": "Extension of W-method and A-learner for multiple binary outcomes", "link": "https://arxiv.org/abs/2407.17534", "description": "arXiv:2407.17534v1 Announce Type: new \nAbstract: In this study, we compared two groups, in which subjects were assigned to either the treatment or the control group. In such trials, if the efficacy of the treatment cannot be demonstrated in a population that meets the eligibility criteria, identifying the subgroups for which the treatment is effective is desirable. Such subgroups can be identified by estimating heterogeneous treatment effects (HTE). In recent years, methods for estimating HTE have increasingly relied on complex models. Although these models improve the estimation accuracy, they often sacrifice interpretability. Despite significant advancements in the methods for continuous or univariate binary outcomes, methods for multiple binary outcomes are less prevalent, and existing interpretable methods, such as the W-method and A-learner, while capable of estimating HTE for a single binary outcome, still fail to capture the correlation structure when applied to multiple binary outcomes. We thus propose two methods for estimating HTE for multiple binary outcomes: one based on the W-method and the other based on the A-learner. We also demonstrate that the conventional A-learner introduces bias in the estimation of the treatment effect. The proposed method employs a framework based on reduced-rank regression to capture the correlation structure among multiple binary outcomes. We correct for the bias inherent in the A-learner estimates and investigate the impact of this bias through numerical simulations. Finally, we demonstrate the effectiveness of the proposed method using a real data application."}, "https://arxiv.org/abs/2407.17592": {"title": "Robust Maximum $L_q$-Likelihood Covariance Estimation for Replicated Spatial Data", "link": "https://arxiv.org/abs/2407.17592", "description": "arXiv:2407.17592v1 Announce Type: new \nAbstract: Parameter estimation with the maximum $L_q$-likelihood estimator (ML$q$E) is an alternative to the maximum likelihood estimator (MLE) that considers the $q$-th power of the likelihood values for some $q<1$. In this method, extreme values are down-weighted because of their lower likelihood values, which yields robust estimates. In this work, we study the properties of the ML$q$E for spatial data with replicates. We investigate the asymptotic properties of the ML$q$E for Gaussian random fields with a Mat\\'ern covariance function, and carry out simulation studies to investigate the numerical performance of the ML$q$E. We show that it can provide more robust and stable estimation results when some of the replicates in the spatial data contain outliers. In addition, we develop a mechanism to find the optimal choice of the hyper-parameter $q$ for the ML$q$E. The robustness of our approach is further verified on a United States precipitation dataset. Compared with other robust methods for spatial data, our proposal is more intuitive and easier to understand, yet it performs well when dealing with datasets containing outliers."}, "https://arxiv.org/abs/2407.17666": {"title": "Causal estimands and identification of time-varying effects in non-stationary time series from N-of-1 mobile device data", "link": "https://arxiv.org/abs/2407.17666", "description": "arXiv:2407.17666v1 Announce Type: new \nAbstract: Mobile technology (mobile phones and wearable devices) generates continuous data streams encompassing outcomes, exposures and covariates, presented as intensive longitudinal or multivariate time series data. The high frequency of measurements enables granular and dynamic evaluation of treatment effect, revealing their persistence and accumulation over time. Existing methods predominantly focus on the contemporaneous effect, temporal-average, or population-average effects, assuming stationarity or invariance of treatment effects over time, which are inadequate both conceptually and statistically to capture dynamic treatment effects in personalized mobile health data. We here propose new causal estimands for multivariate time series in N-of-1 studies. These estimands summarize how time-varying exposures impact outcomes in both short- and long-term. We propose identifiability assumptions and a g-formula estimator that accounts for exposure-outcome and outcome-covariate feedback. The g-formula employs a state space model framework innovatively to accommodate time-varying behavior of treatment effects in non-stationary time series. We apply the proposed method to a multi-year smartphone observational study of bipolar patients and estimate the dynamic effect of phone-based communication on mood of patients with bipolar disorder in an N-of-1 setting. Our approach reveals substantial heterogeneity in treatment effects over time and across individuals. A simulation-based strategy is also proposed for the development of a short-term, dynamic, and personalized treatment recommendation based on patient's past information, in combination with a novel positivity diagnostics plot, validating proper causal inference in time series data."}, "https://arxiv.org/abs/2407.17694": {"title": "Doubly Robust Conditional Independence Testing with Generative Neural Networks", "link": "https://arxiv.org/abs/2407.17694", "description": "arXiv:2407.17694v1 Announce Type: new \nAbstract: This article addresses the problem of testing the conditional independence of two generic random vectors $X$ and $Y$ given a third random vector $Z$, which plays an important role in statistical and machine learning applications. We propose a new non-parametric testing procedure that avoids explicitly estimating any conditional distributions but instead requires sampling from the two marginal conditional distributions of $X$ given $Z$ and $Y$ given $Z$. We further propose using a generative neural network (GNN) framework to sample from these approximated marginal conditional distributions, which tends to mitigate the curse of dimensionality due to its adaptivity to any low-dimensional structures and smoothness underlying the data. Theoretically, our test statistic is shown to enjoy a doubly robust property against GNN approximation errors, meaning that the test statistic retains all desirable properties of the oracle test statistic utilizing the true marginal conditional distributions, as long as the product of the two approximation errors decays to zero faster than the parametric rate. Asymptotic properties of our statistic and the consistency of a bootstrap procedure are derived under both null and local alternatives. Extensive numerical experiments and real data analysis illustrate the effectiveness and broad applicability of our proposed test."}, "https://arxiv.org/abs/2407.17804": {"title": "Bayesian Spatiotemporal Wombling", "link": "https://arxiv.org/abs/2407.17804", "description": "arXiv:2407.17804v1 Announce Type: new \nAbstract: Stochastic process models for spatiotemporal data underlying random fields find substantial utility in a range of scientific disciplines. Subsequent to predictive inference on the values of the random field (or spatial surface indexed continuously over time) at arbitrary space-time coordinates, scientific interest often turns to gleaning information regarding zones of rapid spatial-temporal change. We develop Bayesian modeling and inference for directional rates of change along a given surface. These surfaces, which demarcate regions of rapid change, are referred to as ``wombling'' surface boundaries. Existing methods for studying such changes have often been associated with curves and are not easily extendable to surfaces resulting from curves evolving over time. Our current contribution devises a fully model-based inferential framework for analyzing differential behavior in spatiotemporal responses by formalizing the notion of a ``wombling'' surface boundary using conventional multi-linear vector analytic frameworks and geometry followed by posterior predictive computations using triangulated surface approximations. We illustrate our methodology with comprehensive simulation experiments followed by multiple applications in environmental and climate science; pollutant analysis in environmental health; and brain imaging."}, "https://arxiv.org/abs/2407.17848": {"title": "Bayesian Benchmarking Small Area Estimation via Entropic Tilting", "link": "https://arxiv.org/abs/2407.17848", "description": "arXiv:2407.17848v1 Announce Type: new \nAbstract: Benchmarking estimation and its risk evaluation is a practically important issue in small area estimation. While hierarchical Bayesian methods have been widely adopted in small area estimation, a unified Bayesian approach to benchmarking estimation has not been fully discussed. This work employs an entropic tilting method to modify the posterior distribution of the small area parameters to meet the benchmarking constraint, which enables us to obtain benchmarked point estimation as well as reasonable uncertainty quantification. Using conditionally independent structures of the posterior, we first introduce general Monte Carlo methods for obtaining a benchmarked posterior and then show that the benchmarked posterior can be obtained in an analytical form for some representative small area models. We demonstrate the usefulness of the proposed method through simulation and empirical studies."}, "https://arxiv.org/abs/2407.17888": {"title": "Enhanced power enhancements for testing many moment equalities: Beyond the $2$- and $\\infty$-norm", "link": "https://arxiv.org/abs/2407.17888", "description": "arXiv:2407.17888v1 Announce Type: new \nAbstract: Tests based on the $2$- and $\\infty$-norm have received considerable attention in high-dimensional testing problems, as they are powerful against dense and sparse alternatives, respectively. The power enhancement principle of Fan et al. (2015) combines these two norms to construct tests that are powerful against both types of alternatives. Nevertheless, the $2$- and $\\infty$-norm are just two out of the whole spectrum of $p$-norms that one can base a test on. In the context of testing whether a candidate parameter satisfies a large number of moment equalities, we construct a test that harnesses the strength of all $p$-norms with $p\\in[2, \\infty]$. As a result, this test consistent against strictly more alternatives than any test based on a single $p$-norm. In particular, our test is consistent against more alternatives than tests based on the $2$- and $\\infty$-norm, which is what most implementations of the power enhancement principle target.\n We illustrate the scope of our general results by using them to construct a test that simultaneously dominates the Anderson-Rubin test (based on $p=2$) and tests based on the $\\infty$-norm in terms of consistency in the linear instrumental variable model with many (weak) instruments."}, "https://arxiv.org/abs/2407.17920": {"title": "Tobit Exponential Smoothing, towards an enhanced demand planning in the presence of censored data", "link": "https://arxiv.org/abs/2407.17920", "description": "arXiv:2407.17920v1 Announce Type: new \nAbstract: ExponenTial Smoothing (ETS) is a widely adopted forecasting technique in both research and practical applications. One critical development in ETS was the establishment of a robust statistical foundation based on state space models with a single source of error. However, an important challenge in ETS that remains unsolved is censored data estimation. This issue is critical in supply chain management, in particular, when companies have to deal with stockouts. This work solves that problem by proposing the Tobit ETS, which extends the use of ETS models to handle censored data efficiently. This advancement builds upon the linear models taxonomy and extends it to encompass censored data scenarios. The results show that the Tobit ETS reduces considerably the forecast bias. Real and simulation data are used from the airline and supply chain industries to corroborate the findings."}, "https://arxiv.org/abs/2407.18077": {"title": "An Alternating Direction Method of Multipliers Algorithm for the Weighted Fused LASSO Signal Approximator", "link": "https://arxiv.org/abs/2407.18077", "description": "arXiv:2407.18077v1 Announce Type: new \nAbstract: We present an Alternating Direction Method of Multipliers (ADMM) algorithm designed to solve the Weighted Generalized Fused LASSO Signal Approximator (wFLSA). First, we show that wFLSAs can always be reformulated as a Generalized LASSO problem. With the availability of algorithms tailored to the Generalized LASSO, the issue appears to be, in principle, resolved. However, the computational complexity of these algorithms is high, with a time complexity of $O(p^4)$ for a single iteration, where $p$ represents the number of coefficients. To overcome this limitation, we propose an ADMM algorithm specifically tailored for wFLSA-equivalent problems, significantly reducing the complexity to $O(p^2)$. Our algorithm is publicly accessible through the R package wflsa."}, "https://arxiv.org/abs/2407.18166": {"title": "Identification and multiply robust estimation of causal effects via instrumental variables from an auxiliary heterogeneous population", "link": "https://arxiv.org/abs/2407.18166", "description": "arXiv:2407.18166v1 Announce Type: new \nAbstract: Evaluating causal effects in a primary population of interest with unmeasured confounders is challenging. Although instrumental variables (IVs) are widely used to address unmeasured confounding, they may not always be available in the primary population. Fortunately, IVs might have been used in previous observational studies on similar causal problems, and these auxiliary studies can be useful to infer causal effects in the primary population, even if they represent different populations. However, existing methods often assume homogeneity or equality of conditional average treatment effects between the primary and auxiliary populations, which may be limited in practice. This paper aims to remove the homogeneity requirement and establish a novel identifiability result allowing for different conditional average treatment effects across populations. We also construct a multiply robust estimator that remains consistent despite partial misspecifications of the observed data model and achieves local efficiency if all nuisance models are correct. The proposed approach is illustrated through simulation studies. We finally apply our approach by leveraging data from lower income individuals with cigarette price as a valid IV to evaluate the causal effect of smoking on physical functional status in higher income group where strong IVs are not available."}, "https://arxiv.org/abs/2407.18206": {"title": "Starting Small: Prioritizing Safety over Efficacy in Randomized Experiments Using the Exact Finite Sample Likelihood", "link": "https://arxiv.org/abs/2407.18206", "description": "arXiv:2407.18206v1 Announce Type: new \nAbstract: We use the exact finite sample likelihood and statistical decision theory to answer questions of ``why?'' and ``what should you have done?'' using data from randomized experiments and a utility function that prioritizes safety over efficacy. We propose a finite sample Bayesian decision rule and a finite sample maximum likelihood decision rule. We show that in finite samples from 2 to 50, it is possible for these rules to achieve better performance according to established maximin and maximum regret criteria than a rule based on the Boole-Frechet-Hoeffding bounds. We also propose a finite sample maximum likelihood criterion. We apply our rules and criterion to an actual clinical trial that yielded a promising estimate of efficacy, and our results point to safety as a reason for why results were mixed in subsequent trials."}, "https://arxiv.org/abs/2407.17565": {"title": "Periodicity significance testing with null-signal templates: reassessment of PTF's SMBH binary candidates", "link": "https://arxiv.org/abs/2407.17565", "description": "arXiv:2407.17565v1 Announce Type: cross \nAbstract: Periodograms are widely employed for identifying periodicity in time series data, yet they often struggle to accurately quantify the statistical significance of detected periodic signals when the data complexity precludes reliable simulations. We develop a data-driven approach to address this challenge by introducing a null-signal template (NST). The NST is created by carefully randomizing the period of each cycle in the periodogram template, rendering it non-periodic. It has the same frequentist properties as a periodic signal template regardless of the noise probability distribution, and we show with simulations that the distribution of false positives is the same as with the original periodic template, regardless of the underlying data. Thus, performing a periodicity search with the NST acts as an effective simulation of the null (no-signal) hypothesis, without having to simulate the noise properties of the data. We apply the NST method to the supermassive black hole binaries (SMBHB) search in the Palomar Transient Factory (PTF), where Charisi et al. had previously proposed 33 high signal to (white) noise candidates utilizing simulations to quantify their significance. Our approach reveals that these simulations do not capture the complexity of the real data. There are no statistically significant periodic signal detections above the non-periodic background. To improve the search sensitivity we introduce a Gaussian quadrature based algorithm for the Bayes Factor with correlated noise as a test statistic, in contrast to the standard signal to white noise. We show with simulations that this improves sensitivity to true signals by more than an order of magnitude. However, using the Bayes Factor approach also results in no statistically significant detections in the PTF data."}, "https://arxiv.org/abs/2407.17658": {"title": "Semiparametric Piecewise Accelerated Failure Time Model for the Analysis of Immune-Oncology Clinical Trials", "link": "https://arxiv.org/abs/2407.17658", "description": "arXiv:2407.17658v1 Announce Type: cross \nAbstract: Effectiveness of immune-oncology chemotherapies has been presented in recent clinical trials. The Kaplan-Meier estimates of the survival functions of the immune therapy and the control often suggested the presence of the lag-time until the immune therapy began to act. It implies the use of hazard ratio under the proportional hazards assumption would not be appealing, and many alternatives have been investigated such as the restricted mean survival time. In addition to such overall summary of the treatment contrast, the lag-time is also an important feature of the treatment effect. Identical survival functions up to the lag-time implies patients who are likely to die before the lag-time would not benefit the treatment and identifying such patients would be very important. We propose the semiparametric piecewise accelerated failure time model and its inference procedure based on the semiparametric maximum likelihood method. It provides not only an overall treatment summary, but also a framework to identify patients who have less benefit from the immune-therapy in a unified way. Numerical experiments confirm that each parameter can be estimated with minimal bias. Through a real data analysis, we illustrate the evaluation of the effect of immune-oncology therapy and the characterization of covariates in which patients are unlikely to receive the benefit of treatment."}, "https://arxiv.org/abs/1904.00111": {"title": "Simple subvector inference on sharp identified set in affine models", "link": "https://arxiv.org/abs/1904.00111", "description": "arXiv:1904.00111v3 Announce Type: replace \nAbstract: This paper studies a regularized support function estimator for bounds on components of the parameter vector in the case in which the identified set is a polygon. The proposed regularized estimator has three important properties: (i) it has a uniform asymptotic Gaussian limit in the presence of flat faces in the absence of redundant (or overidentifying) constraints (or vice versa); (ii) the bias from regularization does not enter the first-order limiting distribution; (iii) the estimator remains consistent for sharp (non-enlarged) identified set for the individual components even in the non-regualar case. These properties are used to construct \\emph{uniformly valid} confidence sets for an element $\\theta_{1}$ of a parameter vector $\\theta\\in\\mathbb{R}^{d}$ that is partially identified by affine moment equality and inequality conditions. The proposed confidence sets can be computed as a solution to a small number of linear and convex quadratic programs, leading to a substantial decrease in computation time and guarantees a global optimum. As a result, the method provides a uniformly valid inference in applications in which the dimension of the parameter space, $d$, and the number of inequalities, $k$, were previously computationally unfeasible ($d,k=100$). The proposed approach can be extended to construct confidence sets for intersection bounds, to construct joint polygon-shaped confidence sets for multiple components of $\\theta$, and to find the set of solutions to a linear program. Inference for coefficients in the linear IV regression model with an interval outcome is used as an illustrative example."}, "https://arxiv.org/abs/2106.11043": {"title": "Scalable Bayesian inference for time series via divide-and-conquer", "link": "https://arxiv.org/abs/2106.11043", "description": "arXiv:2106.11043v3 Announce Type: replace \nAbstract: Bayesian computational algorithms tend to scale poorly as data size increases. This has motivated divide-and-conquer-based approaches for scalable inference. These divide the data into subsets, perform inference for each subset in parallel, and then combine these inferences. While appealing theoretical properties and practical performance have been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that usually lack rigorous theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer method, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach."}, "https://arxiv.org/abs/2202.08728": {"title": "Nonparametric extensions of randomized response for private confidence sets", "link": "https://arxiv.org/abs/2202.08728", "description": "arXiv:2202.08728v4 Announce Type: replace \nAbstract: This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations $(X_1, \\dots, X_n)$ with mean $\\mu^\\star$ that are privatized into $(Z_1, \\dots, Z_n)$, we present confidence intervals (CI) and time-uniform confidence sequences (CS) for $\\mu^\\star$ when only given access to the privatized data. To achieve this, we study a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests."}, "https://arxiv.org/abs/2305.14275": {"title": "Variational Inference with Coverage Guarantees in Simulation-Based Inference", "link": "https://arxiv.org/abs/2305.14275", "description": "arXiv:2305.14275v3 Announce Type: replace \nAbstract: Amortized variational inference is an often employed framework in simulation-based inference that produces a posterior approximation that can be rapidly computed given any new observation. Unfortunately, there are few guarantees about the quality of these approximate posteriors. We propose Conformalized Amortized Neural Variational Inference (CANVI), a procedure that is scalable, easily implemented, and provides guaranteed marginal coverage. Given a collection of candidate amortized posterior approximators, CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. CANVI ensures that the resulting predictor constructs regions that contain the truth with a user-specified level of probability. CANVI is agnostic to design decisions in formulating the candidate approximators and only requires access to samples from the forward model, permitting its use in likelihood-free settings. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation. Finally, we demonstrate the accurate calibration and high predictive efficiency of CANVI on a suite of simulation-based inference benchmark tasks and an important scientific task: analyzing galaxy emission spectra."}, "https://arxiv.org/abs/2309.16373": {"title": "Regularization and Model Selection for Ordinal-on-Ordinal Regression with Applications to Food Products' Testing and Survey Data", "link": "https://arxiv.org/abs/2309.16373", "description": "arXiv:2309.16373v2 Announce Type: replace \nAbstract: Ordinal data are quite common in applied statistics. Although some model selection and regularization techniques for categorical predictors and ordinal response models have been developed over the past few years, less work has been done concerning ordinal-on-ordinal regression. Motivated by a consumer test and a survey on the willingness to pay for luxury food products consisting of Likert-type items, we propose a strategy for smoothing and selecting ordinally scaled predictors in the cumulative logit model. First, the group lasso is modified by the use of difference penalties on neighboring dummy coefficients, thus taking into account the predictors' ordinal structure. Second, a fused lasso-type penalty is presented for the fusion of predictor categories and factor selection. The performance of both approaches is evaluated in simulation studies and on real-world data."}, "https://arxiv.org/abs/2310.16213": {"title": "Bayes Factors Based on Test Statistics and Non-Local Moment Prior Densities", "link": "https://arxiv.org/abs/2310.16213", "description": "arXiv:2310.16213v2 Announce Type: replace \nAbstract: We describe Bayes factors based on z, t, $\\chi^2$, and F statistics when non-local moment prior distributions are used to define alternative hypotheses. The non-local alternative prior distributions are centered on standardized effects. The prior densities include a dispersion parameter that can be used to model prior precision and the variation of effect sizes across replicated experiments. We examine the convergence rates of Bayes factors under true null and true alternative hypotheses and show how these Bayes factors can be used to construct Bayes factor functions. An example illustrates the application of resulting Bayes factors to psychological experiments."}, "https://arxiv.org/abs/2407.18341": {"title": "Shrinking Coarsened Win Ratio and Testing of Composite Endpoint", "link": "https://arxiv.org/abs/2407.18341", "description": "arXiv:2407.18341v1 Announce Type: new \nAbstract: Composite endpoints consisting of both terminal and non-terminal events, such as death and hospitalization, are frequently used as primary endpoints in cardiovascular clinical trials. The Win Ratio method (WR) proposed by Pocock et al. (2012) [1] employs a hierarchical structure to combine fatal and non-fatal events by giving death information an absolute priority, which adversely affects power if the treatment effect is mainly on the non-fatal outcomes. We hereby propose the Shrinking Coarsened Win Ratio method (SCWR) that releases the strict hierarchical structure of the standard WR by adding stages with coarsened thresholds shrinking to zero. A weighted adaptive approach is developed to determine the thresholds in SCWR. This method preserves the good statistical properties of the standard WR and has a greater capacity to detect treatment effects on non-fatal events. We show that SCWR has an overall more favorable performance than WR in our simulation that addresses the influence of follow-up time, the association between events, and the treatment effect levels, as well as a case study based on the Digitalis Investigation Group clinical trial data."}, "https://arxiv.org/abs/2407.18389": {"title": "Doubly Robust Targeted Estimation of Conditional Average Treatment Effects for Time-to-event Outcomes with Competing Risks", "link": "https://arxiv.org/abs/2407.18389", "description": "arXiv:2407.18389v1 Announce Type: new \nAbstract: In recent years, precision treatment strategy have gained significant attention in medical research, particularly for patient care. We propose a novel framework for estimating conditional average treatment effects (CATE) in time-to-event data with competing risks, using ICU patients with sepsis as an illustrative example. Our approach, based on cumulative incidence functions and targeted maximum likelihood estimation (TMLE), achieves both asymptotic efficiency and double robustness. The primary contribution of this work lies in our derivation of the efficient influence function for the targeted causal parameter, CATE. We established the theoretical proofs for these properties, and subsequently confirmed them through simulations. Our TMLE framework is flexible, accommodating various regression and machine learning models, making it applicable in diverse scenarios. In order to identify variables contributing to treatment effect heterogeneity and to facilitate accurate estimation of CATE, we developed two distinct variable importance measures (VIMs). This work provides a powerful tool for optimizing personalized treatment strategies, furthering the pursuit of precision medicine."}, "https://arxiv.org/abs/2407.18432": {"title": "Accounting for reporting delays in real-time phylodynamic analyses with preferential sampling", "link": "https://arxiv.org/abs/2407.18432", "description": "arXiv:2407.18432v1 Announce Type: new \nAbstract: The COVID-19 pandemic demonstrated that fast and accurate analysis of continually collected infectious disease surveillance data is crucial for situational awareness and policy making. Coalescent-based phylodynamic analysis can use genetic sequences of a pathogen to estimate changes in its effective population size, a measure of genetic diversity. These changes in effective population size can be connected to the changes in the number of infections in the population of interest under certain conditions. Phylodynamics is an important set of tools because its methods are often resilient to the ascertainment biases present in traditional surveillance data (e.g., preferentially testing symptomatic individuals). Unfortunately, it takes weeks or months to sequence and deposit the sampled pathogen genetic sequences into a database, making them available for such analyses. These reporting delays severely decrease precision of phylodynamic methods closer to present time, and for some models can lead to extreme biases. Here we present a method that affords reliable estimation of the effective population size trajectory closer to the time of data collection, allowing for policy decisions to be based on more recent data. Our work uses readily available historic times between sampling and sequencing for a population of interest, and incorporates this information into the sampling model to mitigate the effects of reporting delay in real-time analyses. We illustrate our methodology on simulated data and on SARS-CoV-2 sequences collected in the state of Washington in 2021."}, "https://arxiv.org/abs/2407.18612": {"title": "Integration of Structural Equation Modeling and Bayesian Networks in the Context of Causal Inference: A Case Study on Personal Positive Youth Development", "link": "https://arxiv.org/abs/2407.18612", "description": "arXiv:2407.18612v1 Announce Type: new \nAbstract: In this study, the combined use of structural equation modeling (SEM) and Bayesian network modeling (BNM) in causal inference analysis is revisited. The perspective highlights the debate between proponents of using BNM as either an exploratory phase or even as the sole phase in the definition of structural models, and those advocating for SEM as the superior alternative for exploratory analysis. The individual strengths and limitations of SEM and BNM are recognized, but this exploration evaluates the contention between utilizing SEM's robust structural inference capabilities and the dynamic probabilistic modeling offered by BNM. A case study of the work of, \\citet{balaguer_2022} in a structural model for personal positive youth development (\\textit{PYD}) as a function of positive parenting (\\textit{PP}) and perception of the climate and functioning of the school (\\textit{CFS}) is presented. The paper at last presents a clear stance on the analytical primacy of SEM in exploratory causal analysis, while acknowledging the potential of BNM in subsequent phases."}, "https://arxiv.org/abs/2407.18721": {"title": "Ensemble Kalman inversion approximate Bayesian computation", "link": "https://arxiv.org/abs/2407.18721", "description": "arXiv:2407.18721v1 Announce Type: new \nAbstract: Approximate Bayesian computation (ABC) is the most popular approach to inferring parameters in the case where the data model is specified in the form of a simulator. It is not possible to directly implement standard Monte Carlo methods for inference in such a model, due to the likelihood not being available to evaluate pointwise. The main idea of ABC is to perform inference on an alternative model with an approximate likelihood (the ABC likelihood), estimated at each iteration from points simulated from the data model. The central challenge of ABC is then to trade-off bias (introduced by approximating the model) with the variance introduced by estimating the ABC likelihood. Stabilising the variance of the ABC likelihood requires a computational cost that is exponential in the dimension of the data, thus the most common approach to reducing variance is to perform inference conditional on summary statistics. In this paper we introduce a new approach to estimating the ABC likelihood: using iterative ensemble Kalman inversion (IEnKI) (Iglesias, 2016; Iglesias et al., 2018). We first introduce new estimators of the marginal likelihood in the case of a Gaussian data model using the IEnKI output, then show how this may be used in ABC. Performance is illustrated on the Lotka-Volterra model, where we observe substantial improvements over standard ABC and other commonly-used approaches."}, "https://arxiv.org/abs/2407.18835": {"title": "Robust Estimation of Polychoric Correlation", "link": "https://arxiv.org/abs/2407.18835", "description": "arXiv:2407.18835v1 Announce Type: new \nAbstract: Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust to partial misspecification of the polychoric model, that is, the model is only misspecified for an unknown fraction of observations, for instance (but not limited to) careless respondents. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation and is consistent as well as asymptotically normally distributed. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify."}, "https://arxiv.org/abs/2407.18885": {"title": "Simulation Experiment Design for Calibration via Active Learning", "link": "https://arxiv.org/abs/2407.18885", "description": "arXiv:2407.18885v1 Announce Type: new \nAbstract: Simulation models often have parameters as input and return outputs to understand the behavior of complex systems. Calibration is the process of estimating the values of the parameters in a simulation model in light of observed data from the system that is being simulated. When simulation models are expensive, emulators are built with simulation data as a computationally efficient approximation of an expensive model. An emulator then can be used to predict model outputs, instead of repeatedly running an expensive simulation model during the calibration process. Sequential design with an intelligent selection criterion can guide the process of collecting simulation data to build an emulator, making the calibration process more efficient and effective. This article proposes two novel criteria for sequentially acquiring new simulation data in an active learning setting by considering uncertainties on the posterior density of parameters. Analysis of several simulation experiments and real-data simulation experiments from epidemiology demonstrates that proposed approaches result in improved posterior and field predictions."}, "https://arxiv.org/abs/2407.18905": {"title": "The nph2ph-transform: applications to the statistical analysis of completed clinical trials", "link": "https://arxiv.org/abs/2407.18905", "description": "arXiv:2407.18905v1 Announce Type: new \nAbstract: We present several illustrations from completed clinical trials on a statistical approach that allows us to gain useful insights regarding the time dependency of treatment effects. Our approach leans on a simple proposition: all non-proportional hazards (NPH) models are equivalent to a proportional hazards model. The nph2ph transform brings an NPH model into a PH form. We often find very simple approximations for this transform, enabling us to analyze complex NPH observations as though they had arisen under proportional hazards. Many techniques become available to us, and we use these to understand treatment effects better."}, "https://arxiv.org/abs/2407.18314": {"title": "Higher Partials of fStress", "link": "https://arxiv.org/abs/2407.18314", "description": "arXiv:2407.18314v1 Announce Type: cross \nAbstract: We define *fDistances*, which generalize Euclidean distances, squared distances, and log distances. The least squares loss function to fit fDistances to dissimilarity data is *fStress*. We give formulas and R/C code to compute partial derivatives of orders one to four of fStress, relying heavily on the use of Fa\\`a di Bruno's chain rule formula for higher derivatives."}, "https://arxiv.org/abs/2407.18698": {"title": "Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation", "link": "https://arxiv.org/abs/2407.18698", "description": "arXiv:2407.18698v1 Announce Type: cross \nAbstract: Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, $k-$sampling, nucleus $p-$sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available."}, "https://arxiv.org/abs/2407.18755": {"title": "Score matching through the roof: linear, nonlinear, and latent variables causal discovery", "link": "https://arxiv.org/abs/2407.18755", "description": "arXiv:2407.18755v1 Announce Type: cross \nAbstract: Causal discovery from observational data holds great promise, but existing methods rely on strong assumptions about the underlying causal structure, often requiring full observability of all relevant variables. We tackle these challenges by leveraging the score function $\\nabla \\log p(X)$ of observed variables for causal discovery and propose the following contributions. First, we generalize the existing results of identifiability with the score to additive noise models with minimal requirements on the causal mechanisms. Second, we establish conditions for inferring causal relations from the score even in the presence of hidden variables; this result is two-faced: we demonstrate the score's potential as an alternative to conditional independence tests to infer the equivalence class of causal graphs with hidden variables, and we provide the necessary conditions for identifying direct causes in latent variable models. Building on these insights, we propose a flexible algorithm for causal discovery across linear, nonlinear, and latent variable models, which we empirically validate."}, "https://arxiv.org/abs/2201.06898": {"title": "Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period", "link": "https://arxiv.org/abs/2201.06898", "description": "arXiv:2201.06898v4 Announce Type: replace \nAbstract: We propose difference-in-differences estimators in designs where the treatment is continuously distributed at every period, as is often the case when one studies the effects of taxes, tariffs, or prices. We assume that between consecutive periods, the treatment of some units, the switchers, changes, while the treatment of other units remains constant. We show that under a placebo-testable parallel-trends assumption, averages of the slopes of switchers' potential outcomes can be nonparametrically estimated. We generalize our estimators to the instrumental-variable case. We use our estimators to estimate the price-elasticity of gasoline consumption."}, "https://arxiv.org/abs/2209.04329": {"title": "Heterogeneous Treatment Effect Bounds under Sample Selection with an Application to the Effects of Social Media on Political Polarization", "link": "https://arxiv.org/abs/2209.04329", "description": "arXiv:2209.04329v5 Announce Type: replace \nAbstract: We propose a method for estimation and inference for bounds for heterogeneous causal effect parameters in general sample selection models where the treatment can affect whether an outcome is observed and no exclusion restrictions are available. The method provides conditional effect bounds as functions of policy relevant pre-treatment variables. It allows for conducting valid statistical inference on the unidentified conditional effects. We use a flexible debiased/double machine learning approach that can accommodate non-linear functional forms and high-dimensional confounders. Easily verifiable high-level conditions for estimation, misspecification robust confidence intervals, and uniform confidence bands are provided as well. We re-analyze data from a large scale field experiment on Facebook on counter-attitudinal news subscription with attrition. Our method yields substantially tighter effect bounds compared to conventional methods and suggests depolarization effects for younger users."}, "https://arxiv.org/abs/2307.16502": {"title": "Percolated stochastic block model via EM algorithm and belief propagation with non-backtracking spectra", "link": "https://arxiv.org/abs/2307.16502", "description": "arXiv:2307.16502v5 Announce Type: replace-cross \nAbstract: Whereas Laplacian and modularity based spectral clustering is apt to dense graphs, recent results show that for sparse ones, the non-backtracking spectrum is the best candidate to find assortative clusters of nodes. Here belief propagation in the sparse stochastic block model is derived with arbitrary given model parameters that results in a non-linear system of equations; with linear approximation, the spectrum of the non-backtracking matrix is able to specify the number $k$ of clusters. Then the model parameters themselves can be estimated by the EM algorithm. Bond percolation in the assortative model is considered in the following two senses: the within- and between-cluster edge probabilities decrease with the number of nodes and edges coming into existence in this way are retained with probability $\\beta$. As a consequence, the optimal $k$ is the number of the structural real eigenvalues (greater than $\\sqrt{c}$, where $c$ is the average degree) of the non-backtracking matrix of the graph. Assuming, these eigenvalues $\\mu_1 >\\dots > \\mu_k$ are distinct, the multiple phase transitions obtained for $\\beta$ are $\\beta_i =\\frac{c}{\\mu_i^2}$; further, at $\\beta_i$ the number of detectable clusters is $i$, for $i=1,\\dots ,k$. Inflation-deflation techniques are also discussed to classify the nodes themselves, which can be the base of the sparse spectral clustering."}, "https://arxiv.org/abs/2309.03731": {"title": "Using representation balancing to learn conditional-average dose responses from clustered data", "link": "https://arxiv.org/abs/2309.03731", "description": "arXiv:2309.03731v2 Announce Type: replace-cross \nAbstract: Estimating a unit's responses to interventions with an associated dose, the \"conditional average dose response\" (CADR), is relevant in a variety of domains, from healthcare to business, economics, and beyond. Such a response typically needs to be estimated from observational data, which introduces several challenges. That is why the machine learning (ML) community has proposed several tailored CADR estimators. Yet, the proposal of most of these methods requires strong assumptions on the distribution of data and the assignment of interventions, which go beyond the standard assumptions in causal inference. Whereas previous works have so far focused on smooth shifts in covariate distributions across doses, in this work, we will study estimating CADR from clustered data and where different doses are assigned to different segments of a population. On a novel benchmarking dataset, we show the impacts of clustered data on model performance and propose an estimator, CBRNet, that learns cluster-agnostic and hence dose-agnostic covariate representations through representation balancing for unbiased CADR inference. We run extensive experiments to illustrate the workings of our method and compare it with the state of the art in ML for CADR estimation."}, "https://arxiv.org/abs/2401.13929": {"title": "HMM for Discovering Decision-Making Dynamics Using Reinforcement Learning Experiments", "link": "https://arxiv.org/abs/2401.13929", "description": "arXiv:2401.13929v2 Announce Type: replace-cross \nAbstract: Major depressive disorder (MDD) presents challenges in diagnosis and treatment due to its complex and heterogeneous nature. Emerging evidence indicates that reward processing abnormalities may serve as a behavioral marker for MDD. To measure reward processing, patients perform computer-based behavioral tasks that involve making choices or responding to stimulants that are associated with different outcomes. Reinforcement learning (RL) models are fitted to extract parameters that measure various aspects of reward processing to characterize how patients make decisions in behavioral tasks. Recent findings suggest the inadequacy of characterizing reward learning solely based on a single RL model; instead, there may be a switching of decision-making processes between multiple strategies. An important scientific question is how the dynamics of learning strategies in decision-making affect the reward learning ability of individuals with MDD. Motivated by the probabilistic reward task (PRT) within the EMBARC study, we propose a novel RL-HMM framework for analyzing reward-based decision-making. Our model accommodates learning strategy switching between two distinct approaches under a hidden Markov model (HMM): subjects making decisions based on the RL model or opting for random choices. We account for continuous RL state space and allow time-varying transition probabilities in the HMM. We introduce a computationally efficient EM algorithm for parameter estimation and employ a nonparametric bootstrap for inference. We apply our approach to the EMBARC study to show that MDD patients are less engaged in RL compared to the healthy controls, and engagement is associated with brain activities in the negative affect circuitry during an emotional conflict task."}, "https://arxiv.org/abs/2407.19030": {"title": "Multimodal data integration and cross-modal querying via orchestrated approximate message passing", "link": "https://arxiv.org/abs/2407.19030", "description": "arXiv:2407.19030v1 Announce Type: new \nAbstract: The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on a tri-modal single-cell dataset."}, "https://arxiv.org/abs/2407.19057": {"title": "Partial Identification of the Average Treatment Effect with Stochastic Counterfactuals and Discordant Twins", "link": "https://arxiv.org/abs/2407.19057", "description": "arXiv:2407.19057v1 Announce Type: new \nAbstract: We develop a novel approach to partially identify causal estimands, such as the average treatment effect (ATE), from observational data. To better satisfy the stable unit treatment value assumption (SUTVA) we utilize stochastic counterfactuals within a propensity-prognosis model of the data generating process. For more precise identification we utilize knowledge of discordant twin outcomes as evidence for randomness in the data generating process. Our approach culminates with a constrained optimization problem; the solution gives upper and lower bounds for the ATE. We demonstrate the applicability of our introduced methodology with three example applications."}, "https://arxiv.org/abs/2407.19135": {"title": "Bayesian Mapping of Mortality Clusters", "link": "https://arxiv.org/abs/2407.19135", "description": "arXiv:2407.19135v1 Announce Type: new \nAbstract: Disease mapping analyses the distribution of several diseases within a territory. Primary goals include identifying areas with unexpected changes in mortality rates, studying the relation among multiple diseases, and dividing the analysed territory into clusters based on the observed levels of disease incidence or mortality. In this work, we focus on detecting spatial mortality clusters, that occur when neighbouring areas within a territory exhibit similar mortality levels due to one or more diseases. When multiple death causes are examined together, it is relevant to identify both the spatial boundaries of the clusters and the diseases that lead to their formation. However, existing methods in literature struggle to address this dual problem effectively and simultaneously. To overcome these limitations, we introduce Perla, a multivariate Bayesian model that clusters areas in a territory according to the observed mortality rates of multiple death causes, also exploiting the information of external covariates. Our model incorporates the spatial data structure directly into the clustering probabilities by leveraging the stick-breaking formulation of the multinomial distribution. Additionally, it exploits suitable global-local shrinkage priors to ensure that the detection of clusters is driven by concrete differences across mortality levels while excluding spurious differences. We propose an MCMC algorithm for posterior inference that consists of closed-form Gibbs sampling moves for nearly every model parameter, without requiring complex tuning operations. This work is primarily motivated by a case study on the territory of the local unit ULSS6 Euganea within the Italian public healthcare system. To demonstrate the flexibility and effectiveness of our methodology, we also validate Perla with a series of simulation experiments and an extensive case study on mortality levels in U.S. counties."}, "https://arxiv.org/abs/2407.19171": {"title": "Assessing Spatial Disparities: A Bayesian Linear Regression Approach", "link": "https://arxiv.org/abs/2407.19171", "description": "arXiv:2407.19171v1 Announce Type: new \nAbstract: Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities between neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions with widely disparate health outcomes. However, current statistical methods are often inadequate for appropriately defining what constitutes a spatial disparity and for constructing rankings of posterior probabilities that are robust under changes to such a definition. More specifically, non-parametric Bayesian approaches endow spatial effects with discrete probability distributions using Dirichlet processes, or generalizations thereof, and rely upon computationally intensive methods for inferring on weakly identified parameters. In this manuscript, we introduce a Bayesian linear regression framework to detect spatial health disparities. This enables us to exploit Bayesian conjugate posterior distributions in a more accessible manner and accelerate computation significantly over existing Bayesian non-parametric approaches. Simulation experiments conducted over a county map of the entire United States demonstrate the effectiveness of our method and we apply our method to a data set from the Institute of Health Metrics and Evaluation (IHME) on age-standardized US county-level estimates of mortality rates across tracheal, bronchus, and lung cancer."}, "https://arxiv.org/abs/2407.19269": {"title": "A Bayesian Approach Toward Robust Multidimensional Ellipsoid-Specific Fitting", "link": "https://arxiv.org/abs/2407.19269", "description": "arXiv:2407.19269v1 Announce Type: new \nAbstract: This work presents a novel and effective method for fitting multidimensional ellipsoids to scattered data in the contamination of noise and outliers. We approach the problem as a Bayesian parameter estimate process and maximize the posterior probability of a certain ellipsoidal solution given the data. We establish a more robust correlation between these points based on the predictive distribution within the Bayesian framework. We incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain, ensuring ellipsoid-specific results regardless of inputs. We then establish the connection between measurement point and model data via Bayes' rule to enhance the method's robustness against noise. Due to independent of spatial dimensions, the proposed method not only delivers high-quality fittings to challenging elongated ellipsoids but also generalizes well to multidimensional spaces. To address outlier disturbances, often overlooked by previous approaches, we further introduce a uniform distribution on top of the predictive distribution to significantly enhance the algorithm's robustness against outliers. We introduce an {\\epsilon}-accelerated technique to expedite the convergence of EM considerably. To the best of our knowledge, this is the first comprehensive method capable of performing multidimensional ellipsoid specific fitting within the Bayesian optimization paradigm under diverse disturbances. We evaluate it across lower and higher dimensional spaces in the presence of heavy noise, outliers, and substantial variations in axis ratios. Also, we apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks."}, "https://arxiv.org/abs/2407.19329": {"title": "Normality testing after transformation", "link": "https://arxiv.org/abs/2407.19329", "description": "arXiv:2407.19329v1 Announce Type: new \nAbstract: Transforming a random variable to improve its normality leads to a followup test for whether the transformed variable follows a normal distribution. Previous work has shown that the Anderson Darling test for normality suffers from resubstitution bias following Box-Cox transformation, and indicates normality much too often. The work reported here extends this by adding the Shapiro-Wilk statistic and the two-parameter Box Cox transformation, all of which show severe bias. We also develop a recalibration to correct the bias in all four settings. The methodology was motivated by finding reference ranges in biomarker studies where parametric analysis, possibly on a power-transformed measurand, can be much more informative than nonparametric. It is illustrated with a data set on biomarkers."}, "https://arxiv.org/abs/2407.19339": {"title": "Accounting for Nonresponse in Election Polls: Total Margin of Error", "link": "https://arxiv.org/abs/2407.19339", "description": "arXiv:2407.19339v1 Announce Type: new \nAbstract: The potential impact of nonresponse on election polls is well known and frequently acknowledged. Yet measurement and reporting of polling error has focused solely on sampling error, represented by the margin of error of a poll. Survey statisticians have long recommended measurement of the total survey error of a sample estimate by its mean square error (MSE), which jointly measures sampling and non-sampling errors. Extending the conventional language of polling, we think it reasonable to use the square root of maximum MSE to measure the total margin of error. This paper demonstrates how to measure the potential impact of nonresponse using the concept of the total margin of error, which we argue should be a standard feature in the reporting of election poll results. We first show how to jointly measure statistical imprecision and response bias when a pollster lacks any knowledge of the candidate preferences of non-responders. We then extend the analysis to settings where the pollster has partial knowledge that bounds the preferences of non-responders."}, "https://arxiv.org/abs/2407.19378": {"title": "Penalized Principal Component Analysis for Large-dimension Factor Model with Group Pursuit", "link": "https://arxiv.org/abs/2407.19378", "description": "arXiv:2407.19378v1 Announce Type: new \nAbstract: This paper investigates the intrinsic group structures within the framework of large-dimensional approximate factor models, which portrays homogeneous effects of the common factors on the individuals that fall into the same group. To this end, we propose a fusion Penalized Principal Component Analysis (PPCA) method and derive a closed-form solution for the $\\ell_2$-norm optimization problem. We also show the asymptotic properties of our proposed PPCA estimates. With the PPCA estimates as an initialization, we identify the unknown group structure by a combination of the agglomerative hierarchical clustering algorithm and an information criterion. Then the factor loadings and factor scores are re-estimated conditional on the identified latent groups. Under some regularity conditions, we establish the consistency of the membership estimators as well as that of the group number estimator derived from the information criterion. Theoretically, we show that the post-clustering estimators for the factor loadings and factor scores with group pursuit achieve efficiency gains compared to the estimators by conventional PCA method. Thorough numerical studies validate the established theory and a real financial example illustrates the practical usefulness of the proposed method."}, "https://arxiv.org/abs/2407.19502": {"title": "Identifying arbitrary transformation between the slopes in functional regression", "link": "https://arxiv.org/abs/2407.19502", "description": "arXiv:2407.19502v1 Announce Type: new \nAbstract: In this article, we study whether the slope functions of two functional regression models in two samples are associated with any arbitrary transformation (barring constant and linear transformation) or not along the vertical axis. In order to address this issue, a statistical testing of the hypothesis problem is formalized, and the test statistic is formed based on the estimated second derivative of the unknown transformation. The asymptotic properties of the test statistics are investigated using some advanced techniques related to the empirical process. Moreover, to implement the test for small sample size data, a Bootstrap algorithm is proposed, and it is shown that the Bootstrap version of the test is as good as the original test for sufficiently large sample size. Furthermore, the utility of the proposed methodology is shown for simulated data sets, and DTI data is analyzed using this methodology."}, "https://arxiv.org/abs/2407.19509": {"title": "Heterogeneous Grouping Structures in Panel Data", "link": "https://arxiv.org/abs/2407.19509", "description": "arXiv:2407.19509v1 Announce Type: new \nAbstract: In this paper we examine the existence of heterogeneity within a group, in panels with latent grouping structure. The assumption of within group homogeneity is prevalent in this literature, implying that the formation of groups alleviates cross-sectional heterogeneity, regardless of the prior knowledge of groups. While the latter hypothesis makes inference powerful, it can be often restrictive. We allow for models with richer heterogeneity that can be found both in the cross-section and within a group, without imposing the simple assumption that all groups must be heterogeneous. We further contribute to the method proposed by \\cite{su2016identifying}, by showing that the model parameters can be consistently estimated and the groups, while unknown, can be identifiable in the presence of different types of heterogeneity. Within the same framework we consider the validity of assuming both cross-sectional and within group homogeneity, using testing procedures. Simulations demonstrate good finite-sample performance of the approach in both classification and estimation, while empirical applications across several datasets provide evidence of multiple clusters, as well as reject the hypothesis of within group homogeneity."}, "https://arxiv.org/abs/2407.19558": {"title": "Identification and Inference with Invalid Instruments", "link": "https://arxiv.org/abs/2407.19558", "description": "arXiv:2407.19558v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are widely used to study the causal effect of an exposure on an outcome in the presence of unmeasured confounding. IVs require an instrument, a variable that is (A1) associated with the exposure, (A2) has no direct effect on the outcome except through the exposure, and (A3) is not related to unmeasured confounders. Unfortunately, finding variables that satisfy conditions (A2) or (A3) can be challenging in practice. This paper reviews works where instruments may not satisfy conditions (A2) or (A3), which we refer to as invalid instruments. We review identification and inference under different violations of (A2) or (A3), specifically under linear models, non-linear models, and heteroskedatic models. We conclude with an empirical comparison of various methods by re-analyzing the effect of body mass index on systolic blood pressure from the UK Biobank."}, "https://arxiv.org/abs/2407.19602": {"title": "Metropolis--Hastings with Scalable Subsampling", "link": "https://arxiv.org/abs/2407.19602", "description": "arXiv:2407.19602v1 Announce Type: new \nAbstract: The Metropolis-Hastings (MH) algorithm is one of the most widely used Markov Chain Monte Carlo schemes for generating samples from Bayesian posterior distributions. The algorithm is asymptotically exact, flexible and easy to implement. However, in the context of Bayesian inference for large datasets, evaluating the likelihood on the full data for thousands of iterations until convergence can be prohibitively expensive. This paper introduces a new subsample MH algorithm that satisfies detailed balance with respect to the target posterior and utilises control variates to enable exact, efficient Bayesian inference on datasets with large numbers of observations. Through theoretical results, simulation experiments and real-world applications on certain generalised linear models, we demonstrate that our method requires substantially smaller subsamples and is computationally more efficient than the standard MH algorithm and other exact subsample MH algorithms."}, "https://arxiv.org/abs/2407.19618": {"title": "Experimenting on Markov Decision Processes with Local Treatments", "link": "https://arxiv.org/abs/2407.19618", "description": "arXiv:2407.19618v1 Announce Type: new \nAbstract: As service systems grow increasingly complex and dynamic, many interventions become localized, available and taking effect only in specific states. This paper investigates experiments with local treatments on a widely-used class of dynamic models, Markov Decision Processes (MDPs). Particularly, we focus on utilizing the local structure to improve the inference efficiency of the average treatment effect. We begin by demonstrating the efficiency of classical inference methods, including model-based estimation and temporal difference learning under a fixed policy, as well as classical A/B testing with general treatments. We then introduce a variance reduction technique that exploits the local treatment structure by sharing information for states unaffected by the treatment policy. Our new estimator effectively overcomes the variance lower bound for general treatments while matching the more stringent lower bound incorporating the local treatment structure. Furthermore, our estimator can optimally achieve a linear reduction with the number of test arms for a major part of the variance. Finally, we explore scenarios with perfect knowledge of the control arm and design estimators that further improve inference efficiency."}, "https://arxiv.org/abs/2407.19624": {"title": "Nonparametric independence tests in high-dimensional settings, with applications to the genetics of complex disease", "link": "https://arxiv.org/abs/2407.19624", "description": "arXiv:2407.19624v1 Announce Type: new \nAbstract: [PhD thesis of FCP.] Nowadays, genetics studies large amounts of very diverse variables. Mathematical statistics has evolved in parallel to its applications, with much recent interest high-dimensional settings. In the genetics of human common disease, a number of relevant problems can be formulated as tests of independence. We show how defining adequate premetric structures on the support spaces of the genetic data allows for novel approaches to such testing. This yields a solid theoretical framework, which reflects the underlying biology, and allows for computationally-efficient implementations. For each problem, we provide mathematical results, simulations and the application to real data."}, "https://arxiv.org/abs/2407.19659": {"title": "Estimating heterogeneous treatment effects by W-MCM based on Robust reduced rank regression", "link": "https://arxiv.org/abs/2407.19659", "description": "arXiv:2407.19659v1 Announce Type: new \nAbstract: Recently, from the personalized medicine perspective, there has been an increased demand to identify subgroups of subjects for whom treatment is effective. Consequently, the estimation of heterogeneous treatment effects (HTE) has been attracting attention. While various estimation methods have been developed for a single outcome, there are still limited approaches for estimating HTE for multiple outcomes. Accurately estimating HTE remains a challenge especially for datasets where there is a high correlation between outcomes or the presence of outliers. Therefore, this study proposes a method that uses a robust reduced-rank regression framework to estimate treatment effects and identify effective subgroups. This approach allows the consideration of correlations between treatment effects and the estimation of treatment effects with an accurate low-rank structure. It also provides robust estimates for outliers. This study demonstrates that, when treatment effects are estimated using the reduced rank regression framework with an appropriate rank, the expected value of the estimator equals the treatment effect. Finally, we illustrate the effectiveness and interpretability of the proposed method through simulations and real data examples."}, "https://arxiv.org/abs/2407.19744": {"title": "Robust classification via finite mixtures of matrix-variate skew t distributions", "link": "https://arxiv.org/abs/2407.19744", "description": "arXiv:2407.19744v1 Announce Type: new \nAbstract: Analysis of matrix-variate data is becoming increasingly common in the literature, particularly in the field of clustering and classification. It is well-known that real data, including real matrix-variate data, often exhibit high levels of asymmetry. To address this issue, one common approach is to introduce a tail or skewness parameter to a symmetric distribution. In this regard, we introduced here a new distribution called the matrix-variate skew t distribution (MVST), which provides flexibility in terms of heavy tail and skewness. We then conduct a thorough investigation of various characterizations and probabilistic properties of the MVST distribution. We also explore extensions of this distribution to a finite mixture model. To estimate the parameters of the MVST distribution, we develop an efficient EM-type algorithm that computes maximum likelihood (ML) estimates of the model parameters. To validate the effectiveness and usefulness of the developed models and associated methods, we perform empirical experiments using simulated data as well as three real data examples. Our results demonstrate the efficacy of the developed approach in handling asymmetric matrix-variate data."}, "https://arxiv.org/abs/2407.19978": {"title": "Inferring High-Dimensional Dynamic Networks Changing with Multiple Covariates", "link": "https://arxiv.org/abs/2407.19978", "description": "arXiv:2407.19978v1 Announce Type: new \nAbstract: High-dimensional networks play a key role in understanding complex relationships. These relationships are often dynamic in nature and can change with multiple external factors (e.g., time and groups). Methods for estimating graphical models are often restricted to static graphs or graphs that can change with a single covariate (e.g., time). We propose a novel class of graphical models, the covariate-varying network (CVN), that can change with multiple external covariates.\n In order to introduce sparsity, we apply a $L_1$-penalty to the precision matrices of $m \\geq 2$ graphs we want to estimate. These graphs often show a level of similarity. In order to model this 'smoothness', we introduce the concept of a 'meta-graph' where each node in the meta-graph corresponds to an individual graph in the CVN. The (weighted) adjacency matrix of the meta-graph represents the strength with which similarity is enforced between the $m$ graphs.\n The resulting optimization problem is solved by employing an alternating direction method of multipliers. We test our method using a simulation study and we show its applicability by applying it to a real-world data set, the gene expression networks from the study 'German Cancer in childhood and molecular-epidemiology' (KiKme). An implementation of the algorithm in R is publicly available under https://github.com/bips-hb/cvn"}, "https://arxiv.org/abs/2407.20027": {"title": "Graphical tools for detection and control of selection bias with multiple exposures and samples", "link": "https://arxiv.org/abs/2407.20027", "description": "arXiv:2407.20027v1 Announce Type: new \nAbstract: Among recent developments in definitions and analysis of selection bias is the potential outcomes approach of Kenah (Epidemiology, 2023), which allows non-parametric analysis using single-world intervention graphs, linking selection of study participants to identification of causal effects. Mohan & Pearl (JASA, 2021) provide a framework for missing data via directed acyclic graphs augmented with nodes indicating missingness for each sometimes-missing variable, which allows for analysis of more general missing data problems but cannot easily encode scenarios in which different groups of variables are observed in specific subsamples. We give an alternative formulation of the potential outcomes framework based on conditional separable effects and indicators for selection into subsamples. This is practical for problems between the single-sample scenarios considered by Kenah and the variable-wise missingness considered by Mohan & Pearl. This simplifies identification conditions and admits generalizations to scenarios with multiple, potentially nested or overlapping study samples, as well as multiple or time-dependent exposures. We give examples of identifiability arguments for case-cohort studies, multiple or time-dependent exposures, and direct effects of selection."}, "https://arxiv.org/abs/2407.20051": {"title": "Estimating risk factors for pathogenic dose accrual from longitudinal data", "link": "https://arxiv.org/abs/2407.20051", "description": "arXiv:2407.20051v1 Announce Type: new \nAbstract: Estimating risk factors for incidence of a disease is crucial for understanding its etiology. For diseases caused by enteric pathogens, off-the-shelf statistical model-based approaches do not provide biological plausibility and ignore important sources of variability. We propose a new approach to estimating incidence risk factors built on established work in quantitative microbiological risk assessment. Excepting those risk factors which affect both dose accrual and within-host pathogen survival rates, our model's regression parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. % So long as risk factors do not affect both dose accrual and within-host pathogen survival rates, our model parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. We also describe a method for leveraging information across multiple pathogens. The proposed methods are available as an R package at \\url{https://github.com/dksewell/ladie}. Our simulation study shows unacceptable coverage rates from generalized linear models, while the proposed approach maintains the nominal rate even when the model is misspecified. Finally, we demonstrated our proposed approach by applying our method to Nairobian infant data obtained through the PATHOME study (\\url{https://reporter.nih.gov/project-details/10227256}), discovering the impact of various environmental factors on infant enteric infections."}, "https://arxiv.org/abs/2407.20073": {"title": "Transfer Learning Targeting Mixed Population: A Distributional Robust Perspective", "link": "https://arxiv.org/abs/2407.20073", "description": "arXiv:2407.20073v1 Announce Type: new \nAbstract: Despite recent advances in transfer learning with multiple source data sets, there still lacks developments for mixture target populations that could be approximated through a composite of the sources due to certain key factors like ethnicity in practice. To address this open problem under distributional shifts of covariates and outcome models as well as the absence of accurate labels on target, we propose a novel approach for distributionally robust transfer learning targeting mixture population. It learns a set of covariate-specific weights to infer the target outcome model with multiple sources, relying on a joint source mixture assumption for the target population. Then our method incorporates a group adversarial learning step to enhance the robustness against moderate violation of the joint mixture assumption. In addition, our framework allows the use of side information like small labeled sample as a guidance to avoid over-conservative results. Statistical convergence and predictive accuracy of our method are quantified through asymptotic studies. Simulation and real-world studies demonstrate the out-performance of our method over existing multi-source and transfer learning approaches."}, "https://arxiv.org/abs/2407.20085": {"title": "Local Level Dynamic Random Partition Models for Changepoint Detection", "link": "https://arxiv.org/abs/2407.20085", "description": "arXiv:2407.20085v1 Announce Type: new \nAbstract: Motivated by an increasing demand for models that can effectively describe features of complex multivariate time series, e.g. from sensor data in biomechanics, motion analysis, and sports science, we introduce a novel state-space modeling framework where the state equation encodes the evolution of latent partitions of the data over time. Building on the principles of dynamic linear models, our approach develops a random partition model capable of linking data partitions to previous ones over time, using a straightforward Markov structure that accounts for temporal persistence and facilitates changepoint detection. The selection of changepoints involves multiple dependent decisions, and we address this time-dependence by adopting a non-marginal false discovery rate control. This leads to a simple decision rule that ensures more stringent control of the false discovery rate compared to approaches that do not consider dependence. The method is efficiently implemented using a Gibbs sampling algorithm, leading to a straightforward approach compared to existing methods for dependent random partition models. Additionally, we show how the proposed method can be adapted to handle multi-view clustering scenarios. Simulation studies and the analysis of a human gesture phase dataset collected through various sensing technologies show the effectiveness of the method in clustering multivariate time series and detecting changepoints."}, "https://arxiv.org/abs/2407.19092": {"title": "Boosted generalized normal distributions: Integrating machine learning with operations knowledge", "link": "https://arxiv.org/abs/2407.19092", "description": "arXiv:2407.19092v1 Announce Type: cross \nAbstract: Applications of machine learning (ML) techniques to operational settings often face two challenges: i) ML methods mostly provide point predictions whereas many operational problems require distributional information; and ii) They typically do not incorporate the extensive body of knowledge in the operations literature, particularly the theoretical and empirical findings that characterize specific distributions. We introduce a novel and rigorous methodology, the Boosted Generalized Normal Distribution ($b$GND), to address these challenges. The Generalized Normal Distribution (GND) encompasses a wide range of parametric distributions commonly encountered in operations, and $b$GND leverages gradient boosting with tree learners to flexibly estimate the parameters of the GND as functions of covariates. We establish $b$GND's statistical consistency, thereby extending this key property to special cases studied in the ML literature that lacked such guarantees. Using data from a large academic emergency department in the United States, we show that the distributional forecasting of patient wait and service times can be meaningfully improved by leveraging findings from the healthcare operations literature. Specifically, $b$GND performs 6% and 9% better than the distribution-agnostic ML benchmark used to forecast wait and service times respectively. Further analysis suggests that these improvements translate into a 9% increase in patient satisfaction and a 4% reduction in mortality for myocardial infarction patients. Our work underscores the importance of integrating ML with operations knowledge to enhance distributional forecasts."}, "https://arxiv.org/abs/2407.19191": {"title": "Network sampling based inference for subgraph counts and clustering coefficient in a Stochastic Block Model framework with some extensions to a sparse case", "link": "https://arxiv.org/abs/2407.19191", "description": "arXiv:2407.19191v1 Announce Type: cross \nAbstract: Sampling is frequently used to collect data from large networks. In this article we provide valid asymptotic prediction intervals for subgraph counts and clustering coefficient of a population network when a network sampling scheme is used to observe the population. The theory is developed under a model based framework, where it is assumed that the population network is generated by a Stochastic Block Model (SBM). We study the effects of induced and ego-centric network formation, following the initial selection of nodes by Bernoulli sampling, and establish asymptotic normality of sample based subgraph count and clustering coefficient statistic under both network formation methods. The asymptotic results are developed under a joint design and model based approach, where the effect of sampling design is not ignored. In case of the sample based clustering coefficient statistic, we find that a bias correction is required in the ego-centric case, but there is no such bias in the induced case. We also extend the asymptotic normality results for estimated subgraph counts to a mildly sparse SBM framework, where edge probabilities decay to zero at a slow rate. In this sparse setting we find that the scaling and the maximum allowable decay rate for edge probabilities depend on the choice of the target subgraph. We obtain an expression for this maximum allowable decay rate and our results suggest that the rate becomes slower if the target subgraph has more edges in a certain sense. The simulation results suggest that the proposed prediction intervals have excellent coverage, even when the node selection probability is small and unknown SBM parameters are replaced by their estimates. Finally, the proposed methodology is applied to a real data set."}, "https://arxiv.org/abs/2407.19399": {"title": "Large-scale Multiple Testing of Cross-covariance Functions with Applications to Functional Network Models", "link": "https://arxiv.org/abs/2407.19399", "description": "arXiv:2407.19399v1 Announce Type: cross \nAbstract: The estimation of functional networks through functional covariance and graphical models have recently attracted increasing attention in settings with high dimensional functional data, where the number of functional variables p is comparable to, and maybe larger than, the number of subjects. In this paper, we first reframe the functional covariance model estimation as a tuning-free problem of simultaneously testing p(p-1)/2 hypotheses for cross-covariance functions. Our procedure begins by constructing a Hilbert-Schmidt-norm-based test statistic for each pair, and employs normal quantile transformations for all test statistics, upon which a multiple testing step is proposed. We then explore the multiple testing procedure under a general error-contamination framework and establish that our procedure can control false discoveries asymptotically. Additionally, we demonstrate that our proposed methods for two concrete examples: the functional covariance model with partial observations and, importantly, the more challenging functional graphical model, can be seamlessly integrated into the general error-contamination framework, and, with verifiable conditions, achieve theoretical guarantees on effective false discovery control. Finally, we showcase the superiority of our proposals through extensive simulations and functional connectivity analysis of two neuroimaging datasets."}, "https://arxiv.org/abs/2407.19613": {"title": "Causal effect estimation under network interference with mean-field methods", "link": "https://arxiv.org/abs/2407.19613", "description": "arXiv:2407.19613v1 Announce Type: cross \nAbstract: We study causal effect estimation from observational data under interference. The interference pattern is captured by an observed network. We adopt the chain graph framework of Tchetgen Tchetgen et. al. (2021), which allows (i) interaction among the outcomes of distinct study units connected along the graph and (ii) long range interference, whereby the outcome of an unit may depend on the treatments assigned to distant units connected along the interference network. For ``mean-field\" interaction networks, we develop a new scalable iterative algorithm to estimate the causal effects. For gaussian weighted networks, we introduce a novel causal effect estimation algorithm based on Approximate Message Passing (AMP). Our algorithms are provably consistent under a ``high-temperature\" condition on the underlying model. We estimate the (unknown) parameters of the model from data using maximum pseudo-likelihood and establish $\\sqrt{n}$-consistency of this estimator in all parameter regimes. Finally, we prove that the downstream estimators obtained by plugging in estimated parameters into the aforementioned algorithms are consistent at high-temperature. Our methods can accommodate dense interactions among the study units -- a setting beyond reach using existing techniques. Our algorithms originate from the study of variational inference approaches in high-dimensional statistics; overall, we demonstrate the usefulness of these ideas in the context of causal effect estimation under interference."}, "https://arxiv.org/abs/2407.19688": {"title": "Causal Interventional Prediction System for Robust and Explainable Effect Forecasting", "link": "https://arxiv.org/abs/2407.19688", "description": "arXiv:2407.19688v1 Announce Type: cross \nAbstract: Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect prediction task and further establish a causal graph based on treatment, adjustment variable, confounder, and outcome. Correspondingly, we design a causal interventional prediction system (CIPS) based on a variational autoencoder and fully conditional specification of multiple imputations. Extensive results demonstrate the superiority of our system over state-of-the-art methods and show remarkable versatility and extensibility in practice."}, "https://arxiv.org/abs/2407.19932": {"title": "Testing for the Asymmetric Optimal Hedge Ratios: With an Application to Bitcoin", "link": "https://arxiv.org/abs/2407.19932", "description": "arXiv:2407.19932v1 Announce Type: cross \nAbstract: Reducing financial risk is of paramount importance to investors, financial institutions, and corporations. Since the pioneering contribution of Johnson (1960), the optimal hedge ratio based on futures is regularly utilized. The current paper suggests an explicit and efficient method for testing the null hypothesis of a symmetric optimal hedge ratio against an asymmetric alternative one within a multivariate setting. If the null is rejected, the position dependent optimal hedge ratios can be estimated via the suggested model. This approach is expected to enhance the accuracy of the implemented hedging strategies compared to the standard methods since it accounts for the fact that the source of risk depends on whether the investor is a buyer or a seller of the risky asset. An application is provided using spot and futures prices of Bitcoin. The results strongly support the view that the optimal hedge ratio for this cryptocurrency is position dependent. The investor that is long in Bitcoin has a much higher conditional optimal hedge ratio compared to the one that is short in the asset. The difference between the two conditional optimal hedge ratios is statistically significant, which has important repercussions for implementing risk management strategies."}, "https://arxiv.org/abs/2112.11870": {"title": "Bag of DAGs: Inferring Directional Dependence in Spatiotemporal Processes", "link": "https://arxiv.org/abs/2112.11870", "description": "arXiv:2112.11870v5 Announce Type: replace \nAbstract: We propose a class of nonstationary processes to characterize space- and time-varying directional associations in point-referenced data. We are motivated by spatiotemporal modeling of air pollutants in which local wind patterns are key determinants of the pollutant spread, but information regarding prevailing wind directions may be missing or unreliable. We propose to map a discrete set of wind directions to edges in a sparse directed acyclic graph (DAG), accounting for uncertainty in directional correlation patterns across a domain. The resulting Bag of DAGs processes (BAGs) lead to interpretable nonstationarity and scalability for large data due to sparsity of DAGs in the bag. We outline Bayesian hierarchical models using BAGs and illustrate inferential and performance gains of our methods compared to other state-of-the-art alternatives. We analyze fine particulate matter using high-resolution data from low-cost air quality sensors in California during the 2020 wildfire season. An R package is available on GitHub."}, "https://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "https://arxiv.org/abs/2208.09638", "description": "arXiv:2208.09638v3 Announce Type: replace \nAbstract: What is the purpose of pre-analysis plans, and how should they be designed? We model the interaction between an agent who analyzes data and a principal who makes a decision based on agent reports. The agent could be the manufacturer of a new drug, and the principal a regulator deciding whether the drug is approved. Or the agent could be a researcher submitting a research paper, and the principal an editor deciding whether it is published. The agent decides which statistics to report to the principal. The principal cannot verify whether the analyst reported selectively. Absent a pre-analysis message, if there are conflicts of interest, then many desirable decision rules cannot be implemented. Allowing the agent to send a message before seeing the data increases the set of decision rules that can be implemented, and allows the principal to leverage agent expertise. The optimal mechanisms that we characterize require pre-analysis plans. Applying these results to hypothesis testing, we show that optimal rejection rules pre-register a valid test, and make worst-case assumptions about unreported statistics. Optimal tests can be found as a solution to a linear-programming problem."}, "https://arxiv.org/abs/2306.14019": {"title": "Instrumental Variable Approach to Estimating Individual Causal Effects in N-of-1 Trials: Application to ISTOP Study", "link": "https://arxiv.org/abs/2306.14019", "description": "arXiv:2306.14019v2 Announce Type: replace \nAbstract: An N-of-1 trial is a multiple crossover trial conducted in a single individual to provide evidence to directly inform personalized treatment decisions. Advancements in wearable devices greatly improved the feasibility of adopting these trials to identify optimal individual treatment plans, particularly when treatments differ among individuals and responses are highly heterogeneous. Our work was motivated by the I-STOP-AFib Study, which examined the impact of different triggers on atrial fibrillation (AF) occurrence. We described a causal framework for 'N-of-1' trial using potential treatment selection paths and potential outcome paths. Two estimands of individual causal effect were defined:(a) the effect of continuous exposure, and (b) the effect of an individual observed behavior. We addressed three challenges: (a) imperfect compliance to the randomized treatment assignment; (b) binary treatments and binary outcomes which led to the 'non-collapsibility' issue of estimating odds ratios; and (c) serial inference in the longitudinal observations. We adopted the Bayesian IV approach where the study randomization was the IV as it impacted the choice of exposure of a subject but not directly the outcome. Estimations were through a system of two parametric Bayesian models to estimate the individual causal effect. Our model got around the non-collapsibility and non-consistency by modeling the confounding mechanism through latent structural models and by inferring with Bayesian posterior of functionals. Autocorrelation present in the repeated measurements was also accounted for. The simulation study showed our method largely reduced bias and greatly improved the coverage of the estimated causal effect, compared to existing methods (ITT, PP, and AT). We applied the method to I-STOP-AFib Study to estimate the individual effect of alcohol on AF occurrence."}, "https://arxiv.org/abs/2307.16048": {"title": "Structural restrictions in local causal discovery: identifying direct causes of a target variable", "link": "https://arxiv.org/abs/2307.16048", "description": "arXiv:2307.16048v2 Announce Type: replace \nAbstract: We consider the problem of learning a set of direct causes of a target variable from an observational joint distribution. Learning directed acyclic graphs (DAGs) that represent the causal structure is a fundamental problem in science. Several results are known when the full DAG is identifiable from the distribution, such as assuming a nonlinear Gaussian data-generating process. Here, we are only interested in identifying the direct causes of one target variable (local causal structure), not the full DAG. This allows us to relax the identifiability assumptions and develop possibly faster and more robust algorithms. In contrast to the Invariance Causal Prediction framework, we only assume that we observe one environment without any interventions. We discuss different assumptions for the data-generating process of the target variable under which the set of direct causes is identifiable from the distribution. While doing so, we put essentially no assumptions on the variables other than the target variable. In addition to the novel identifiability results, we provide two practical algorithms for estimating the direct causes from a finite random sample and demonstrate their effectiveness on several benchmark and real datasets."}, "https://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "https://arxiv.org/abs/2310.05646", "description": "arXiv:2310.05646v4 Announce Type: replace \nAbstract: We study transfer learning for estimating piecewise-constant signals when source data, which may be relevant but disparate, are available in addition to the target data. We first investigate transfer learning estimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for unisource data scenarios and then generalise these estimators to accommodate multisources. To further reduce estimation errors, especially when some sources significantly differ from the target, we introduce an informative source selection algorithm. We then examine these estimators with multisource selection and establish their minimax optimality. Unlike the common narrative in the transfer learning literature that the performance is enhanced through large source sample sizes, our approaches leverage higher observation frequencies and accommodate diverse frequencies across multiple sources. Our theoretical findings are supported by extensive numerical experiments, with the code available online, see https://github.com/chrisfanwang/transferlearning"}, "https://arxiv.org/abs/2310.09597": {"title": "Adaptive maximization of social welfare", "link": "https://arxiv.org/abs/2310.09597", "description": "arXiv:2310.09597v2 Announce Type: replace \nAbstract: We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation. We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of $T^{2/3}$. This implies that (i) welfare maximization is harder than the multi-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of $T^{1/2}$ (for continuous policy sets), using a dyadic search algorithm. We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder)."}, "https://arxiv.org/abs/2312.02404": {"title": "Nonparametric Bayesian Adjustment of Unmeasured Confounders in Cox Proportional Hazards Models", "link": "https://arxiv.org/abs/2312.02404", "description": "arXiv:2312.02404v3 Announce Type: replace \nAbstract: In observational studies, unmeasured confounders present a crucial challenge in accurately estimating desired causal effects. To calculate the hazard ratio (HR) in Cox proportional hazard models for time-to-event outcomes, two-stage residual inclusion and limited information maximum likelihood are typically employed. However, these methods are known to entail difficulty in terms of potential bias of HR estimates and parameter identification. This study introduces a novel nonparametric Bayesian method designed to estimate an unbiased HR, addressing concerns that previous research methods have had. Our proposed method consists of two phases: 1) detecting clusters based on the likelihood of the exposure and outcome variables, and 2) estimating the hazard ratio within each cluster. Although it is implicitly assumed that unmeasured confounders affect outcomes through cluster effects, our algorithm is well-suited for such data structures. The proposed Bayesian estimator has good performance compared with some competitors."}, "https://arxiv.org/abs/2312.03268": {"title": "Design-based inference for generalized network experiments with stochastic interventions", "link": "https://arxiv.org/abs/2312.03268", "description": "arXiv:2312.03268v2 Announce Type: replace \nAbstract: A growing number of researchers are conducting randomized experiments to analyze causal relationships in network settings where units influence one another. A dominant methodology for analyzing these experiments is design-based, leveraging random treatment assignments as the basis for inference. In this paper, we generalize this design-based approach to accommodate complex experiments with a variety of causal estimands and different target populations. An important special case of such generalized network experiments is a bipartite network experiment, in which treatment is randomized among one set of units, and outcomes are measured on a separate set of units. We propose a broad class of causal estimands based on stochastic interventions for generalized network experiments. Using a design-based approach, we show how to estimate these causal quantities without bias and develop conservative variance estimators. We apply our methodology to a randomized experiment in education where participation in an anti-conflict promotion program is randomized among selected students. Our analysis estimates the causal effects of treating each student or their friends among different target populations in the network. We find that the program improves the overall conflict awareness among students but does not significantly reduce the total number of such conflicts."}, "https://arxiv.org/abs/2312.14086": {"title": "A Bayesian approach to functional regression: theory and computation", "link": "https://arxiv.org/abs/2312.14086", "description": "arXiv:2312.14086v2 Announce Type: replace \nAbstract: We propose a novel Bayesian methodology for inference in functional linear and logistic regression models based on the theory of reproducing kernel Hilbert spaces (RKHS's). We introduce general models that build upon the RKHS generated by the covariance function of the underlying stochastic process, and whose formulation includes as particular cases all finite-dimensional models based on linear combinations of marginals of the process, which can collectively be seen as a dense subspace made of simple approximations. By imposing a suitable prior distribution on this dense functional space we can perform data-driven inference via standard Bayes methodology, estimating the posterior distribution through reversible jump Markov chain Monte Carlo methods. In this context, our contribution is two-fold. First, we derive a theoretical result that guarantees posterior consistency, based on an application of a classic theorem of Doob to our RKHS setting. Second, we show that several prediction strategies stemming from our Bayesian procedure are competitive against other usual alternatives in both simulations and real data sets, including a Bayesian-motivated variable selection method."}, "https://arxiv.org/abs/2312.17566": {"title": "Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing", "link": "https://arxiv.org/abs/2312.17566", "description": "arXiv:2312.17566v2 Announce Type: replace \nAbstract: Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are contingent on prior assumptions that may be disputed. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. The rate converges pointwise as the sample size grows and, under some conditions, uniformly. The `Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore its benefits, including post-hoc variable selection, and limitations, including finite-sample inflation, through a Mendelian randomization study and simulations comparing approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure and e-values."}, "https://arxiv.org/abs/2202.04796": {"title": "The Transfer Performance of Economic Models", "link": "https://arxiv.org/abs/2202.04796", "description": "arXiv:2202.04796v4 Announce Type: replace-cross \nAbstract: Economists often estimate models using data from a particular domain, e.g. estimating risk preferences in a particular subject pool or for a specific class of lotteries. Whether a model's predictions extrapolate well across domains depends on whether the estimated model has captured generalizable structure. We provide a tractable formulation for this \"out-of-domain\" prediction problem and define the transfer error of a model based on how well it performs on data from a new domain. We derive finite-sample forecast intervals that are guaranteed to cover realized transfer errors with a user-selected probability when domains are iid, and use these intervals to compare the transferability of economic models and black box algorithms for predicting certainty equivalents. We find that in this application, the black box algorithms we consider outperform standard economic models when estimated and tested on data from the same domain, but the economic models generalize across domains better than the black-box algorithms do."}, "https://arxiv.org/abs/2407.20386": {"title": "On the power properties of inference for parameters with interval identified sets", "link": "https://arxiv.org/abs/2407.20386", "description": "arXiv:2407.20386v1 Announce Type: new \nAbstract: This paper studies a specific inference problem for a partially-identified parameter of interest with an interval identified set. We consider the favorable situation in which a researcher has two possible estimators to construct the confidence interval proposed in Imbens and Manski (2004) and Stoye (2009), and one is more efficient than the other. While the literature shows that both estimators deliver asymptotically exact confidence intervals for the parameter of interest, their inference in terms of statistical power is not compared. One would expect that using the more efficient estimator would result in more powerful inference. We formally prove this result."}, "https://arxiv.org/abs/2407.20491": {"title": "High dimensional inference for extreme value indices", "link": "https://arxiv.org/abs/2407.20491", "description": "arXiv:2407.20491v1 Announce Type: new \nAbstract: When applying multivariate extreme values statistics to analyze tail risk in compound events defined by a multivariate random vector, one often assumes that all dimensions share the same extreme value index. While such an assumption can be tested using a Wald-type test, the performance of such a test deteriorates as the dimensionality increases. This paper introduces a novel test for testing extreme value indices in a high dimensional setting. We show the asymptotic behavior of the test statistic and conduct simulation studies to evaluate its finite sample performance. The proposed test significantly outperforms existing methods in high dimensional settings. We apply this test to examine two datasets previously assumed to have identical extreme value indices across all dimensions."}, "https://arxiv.org/abs/2407.20520": {"title": "Uncertainty Quantification under Noisy Constraints, with Applications to Raking", "link": "https://arxiv.org/abs/2407.20520", "description": "arXiv:2407.20520v1 Announce Type: new \nAbstract: We consider statistical inference problems under uncertain equality constraints, and provide asymptotically valid uncertainty estimates for inferred parameters. The proposed approach leverages the implicit function theorem and primal-dual optimality conditions for a particular problem class. The motivating application is multi-dimensional raking, where observations are adjusted to match marginals; for example, adjusting estimated deaths across race, county, and cause in order to match state all-race all-cause totals. We review raking from a convex optimization perspective, providing explicit primal-dual formulations, algorithms, and optimality conditions for a wide array of raking applications, which are then leveraged to obtain the uncertainty estimates. Empirical results show that the approach obtains, at the cost of a single solve, nearly the same uncertainty estimates as computationally intensive Monte Carlo techniques that pass thousands of observed and of marginal draws through the entire raking process."}, "https://arxiv.org/abs/2407.20580": {"title": "Laplace approximation for Bayesian variable selection via Le Cam's one-step procedure", "link": "https://arxiv.org/abs/2407.20580", "description": "arXiv:2407.20580v1 Announce Type: new \nAbstract: Variable selection in high-dimensional spaces is a pervasive challenge in contemporary scientific exploration and decision-making. However, existing approaches that are known to enjoy strong statistical guarantees often struggle to cope with the computational demands arising from the high dimensionality. To address this issue, we propose a novel Laplace approximation method based on Le Cam's one-step procedure (\\textsf{OLAP}), designed to effectively tackles the computational burden. Under some classical high-dimensional assumptions we show that \\textsf{OLAP} is a statistically consistent variable selection procedure. Furthermore, we show that the approach produces a posterior distribution that can be explored in polynomial time using a simple Gibbs sampling algorithm. Toward that polynomial complexity result, we also made some general, noteworthy contributions to the mixing time analysis of Markov chains. We illustrate the method using logistic and Poisson regression models applied to simulated and real data examples."}, "https://arxiv.org/abs/2407.20683": {"title": "An online generalization of the e-BH procedure", "link": "https://arxiv.org/abs/2407.20683", "description": "arXiv:2407.20683v1 Announce Type: new \nAbstract: In online multiple testing the hypotheses arrive one-by-one over time and at each time it must be decided on the current hypothesis solely based on the data and hypotheses observed so far. In this paper we relax this setup by allowing initially accepted hypotheses to be rejected due to information gathered at later steps. We propose online e-BH, an online ARC (online with acceptance-to-rejection changes) version of the e-BH procedure. Online e-BH is the first nontrivial online procedure which provably controls the FDR at data-adaptive stopping times and under arbitrary dependence between the test statistics. Online e-BH uniformly improves e-LOND, the existing method for e-value based online FDR control. In addition, we introduce new boosting techniques for online e-BH to increase the power in case of locally dependent e-values. Furthermore, based on the same proof technique as used for online e-BH, we show that all existing online procedures with valid FDR control under arbitrary dependence also control the FDR at data-adaptive stopping times."}, "https://arxiv.org/abs/2407.20738": {"title": "A Local Modal Outer-Product-Gradient Estimator for Dimension Reduction", "link": "https://arxiv.org/abs/2407.20738", "description": "arXiv:2407.20738v1 Announce Type: new \nAbstract: Sufficient dimension reduction (SDR) is a valuable approach for handling high-dimensional data. Outer Product Gradient (OPG) is an popular approach. However, because of focusing the mean regression function, OPG may ignore some directions of central subspace (CS) when the distribution of errors is symmetric about zero. The mode of a distribution can provide an important summary of data. A Local Modal OPG (LMOPG) and its algorithm through mode regression are proposed to estimate the basis of CS with skew errors distribution. The estimator shows the consistent and asymptotic normal distribution under some mild conditions. Monte Carlo simulation is used to evaluate the performance and demonstrate the efficiency and robustness of the proposed method."}, "https://arxiv.org/abs/2407.20796": {"title": "Linear mixed modelling of federated data when only the mean, covariance, and sample size are available", "link": "https://arxiv.org/abs/2407.20796", "description": "arXiv:2407.20796v1 Announce Type: new \nAbstract: In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as linear mixed models, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning algorithms tackle this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analyst is required. In this paper, we propose an alternative framework to federated learning algorithms for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the framework of likelihood as theoretical support, this proposed framework achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania (CHOP). Assuming that each clinic only shares summary statistics once, we model the COVID-19 PCR test cycle threshold as a function of patient information. Simplicity, communication efficiency, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature."}, "https://arxiv.org/abs/2407.20819": {"title": "Design and inference for multi-arm clinical trials with informational borrowing: the interacting urns design", "link": "https://arxiv.org/abs/2407.20819", "description": "arXiv:2407.20819v1 Announce Type: new \nAbstract: This paper deals with a new design methodology for stratified comparative experiments based on interacting reinforced urn systems. The key idea is to model the interaction between urns for borrowing information across strata and to use it in the design phase in order to i) enhance the information exchange at the beginning of the study, when only few subjects have been enrolled and the stratum-specific information on treatments' efficacy could be scarce, ii) let the information sharing adaptively evolves via a reinforcement mechanism based on the observed outcomes, for skewing at each step the allocations towards the stratum-specific most promising treatment and iii) make the contribution of the strata with different treatment efficacy vanishing as the stratum information grows. In particular, we introduce the Interacting Urns Design, namely a new Covariate-Adjusted Response-Adaptive procedure, that randomizes the treatment allocations according to the evolution of the urn system. The theoretical properties of this proposal are described and the corresponding asymptotic inference is provided. Moreover, by a functional central limit theorem, we obtain the asymptotic joint distribution of the Wald-type sequential test statistics, which allows to sequentially monitor the suggested design in the clinical practice."}, "https://arxiv.org/abs/2407.20929": {"title": "ROC curve analysis for functional markers", "link": "https://arxiv.org/abs/2407.20929", "description": "arXiv:2407.20929v1 Announce Type: new \nAbstract: Functional markers become a more frequent tool in medical diagnosis. In this paper, we aim to define an index allowing to discriminate between populations when the observations are functional data belonging to a Hilbert space. We discuss some of the problems arising when estimating optimal directions defined to maximize the area under the curve of a projection index and we construct the corresponding ROC curve. We also go one step forward and consider the case of possibly different covariance operators, for which we recommend a quadratic discrimination rule. Consistency results are derived for both linear and quadratic indexes, under mild conditions. The results of our numerical experiments allow to see the advantages of the quadratic rule when the populations have different covariance operators. We also illustrate the considered methods on a real data set."}, "https://arxiv.org/abs/2407.20995": {"title": "Generalized Multivariate Functional Additive Mixed Models for Location, Scale, and Shape", "link": "https://arxiv.org/abs/2407.20995", "description": "arXiv:2407.20995v1 Announce Type: new \nAbstract: We propose a flexible regression framework to model the conditional distribution of multilevel generalized multivariate functional data of potentially mixed type, e.g. binary and continuous data. We make pointwise parametric distributional assumptions for each dimension of the multivariate functional data and model each distributional parameter as an additive function of covariates. The dependency between the different outcomes and, for multilevel functional data, also between different functions within a level is modelled by shared latent multivariate Gaussian processes. For a parsimonious representation of the latent processes, (generalized) multivariate functional principal components are estimated from the data and used as an empirical basis for these latent processes in the regression framework. Our modular two-step approach is very general and can easily incorporate new developments in the estimation of functional principal components for all types of (generalized) functional data. Flexible additive covariate effects for scalar or even functional covariates are available and are estimated in a Bayesian framework. We provide an easy-to-use implementation in the accompanying R package 'gmfamm' on CRAN and conduct a simulation study to confirm the validity of our regression framework and estimation strategy. The proposed multivariate functional model is applied to four dimensional traffic data in Berlin, which consists of the hourly numbers and mean speed of cars and trucks at different locations."}, "https://arxiv.org/abs/2407.20295": {"title": "Warped multifidelity Gaussian processes for data fusion of skewed environmental data", "link": "https://arxiv.org/abs/2407.20295", "description": "arXiv:2407.20295v1 Announce Type: cross \nAbstract: Understanding the dynamics of climate variables is paramount for numerous sectors, like energy and environmental monitoring. This study focuses on the critical need for a precise mapping of environmental variables for national or regional monitoring networks, a task notably challenging when dealing with skewed data. To address this issue, we propose a novel data fusion approach, the \\textit{warped multifidelity Gaussian process} (WMFGP). The method performs prediction using multiple time-series, accommodating varying reliability and resolutions and effectively handling skewness. In an extended simulation experiment the benefits and the limitations of the methods are explored, while as a case study, we focused on the wind speed monitored by the network of ARPA Lombardia, one of the regional environmental agencies operting in Italy. ARPA grapples with data gaps, and due to the connection between wind speed and air quality, it struggles with an effective air quality management. We illustrate the efficacy of our approach in filling the wind speed data gaps through two extensive simulation experiments. The case study provides more informative wind speed predictions crucial for predicting air pollutant concentrations, enhancing network maintenance, and advancing understanding of relevant meteorological and climatic phenomena."}, "https://arxiv.org/abs/2407.20352": {"title": "Designing Time-Series Models With Hypernetworks & Adversarial Portfolios", "link": "https://arxiv.org/abs/2407.20352", "description": "arXiv:2407.20352v1 Announce Type: cross \nAbstract: This article describes the methods that achieved 4th and 6th place in the forecasting and investment challenges, respectively, of the M6 competition, ultimately securing the 1st place in the overall duathlon ranking. In the forecasting challenge, we tested a novel meta-learning model that utilizes hypernetworks to design a parametric model tailored to a specific family of forecasting tasks. This approach allowed us to leverage similarities observed across individual forecasting tasks while also acknowledging potential heterogeneity in their data generating processes. The model's training can be directly performed with backpropagation, eliminating the need for reliance on higher-order derivatives and is equivalent to a simultaneous search over the space of parametric functions and their optimal parameter values. The proposed model's capabilities extend beyond M6, demonstrating superiority over state-of-the-art meta-learning methods in the sinusoidal regression task and outperforming conventional parametric models on time-series from the M4 competition. In the investment challenge, we adjusted portfolio weights to induce greater or smaller correlation between our submission and that of other participants, depending on the current ranking, aiming to maximize the probability of achieving a good rank."}, "https://arxiv.org/abs/2407.20377": {"title": "Leveraging Natural Language and Item Response Theory Models for ESG Scoring", "link": "https://arxiv.org/abs/2407.20377", "description": "arXiv:2407.20377v1 Announce Type: cross \nAbstract: This paper explores an innovative approach to Environmental, Social, and Governance (ESG) scoring by integrating Natural Language Processing (NLP) techniques with Item Response Theory (IRT), specifically the Rasch model. The study utilizes a comprehensive dataset of news articles in Portuguese related to Petrobras, a major oil company in Brazil, collected from 2022 and 2023. The data is filtered and classified for ESG-related sentiments using advanced NLP methods. The Rasch model is then applied to evaluate the psychometric properties of these ESG measures, providing a nuanced assessment of ESG sentiment trends over time. The results demonstrate the efficacy of this methodology in offering a more precise and reliable measurement of ESG factors, highlighting significant periods and trends. This approach may enhance the robustness of ESG metrics and contribute to the broader field of sustainability and finance by offering a deeper understanding of the temporal dynamics in ESG reporting."}, "https://arxiv.org/abs/2407.20553": {"title": "DiffusionCounterfactuals: Inferring High-dimensional Counterfactuals with Guidance of Causal Representations", "link": "https://arxiv.org/abs/2407.20553", "description": "arXiv:2407.20553v1 Announce Type: cross \nAbstract: Accurate estimation of counterfactual outcomes in high-dimensional data is crucial for decision-making and understanding causal relationships and intervention outcomes in various domains, including healthcare, economics, and social sciences. However, existing methods often struggle to generate accurate and consistent counterfactuals, particularly when the causal relationships are complex. We propose a novel framework that incorporates causal mechanisms and diffusion models to generate high-quality counterfactual samples guided by causal representation. Our approach introduces a novel, theoretically grounded training and sampling process that enables the model to consistently generate accurate counterfactual high-dimensional data under multiple intervention steps. Experimental results on various synthetic and real benchmarks demonstrate the proposed approach outperforms state-of-the-art methods in generating accurate and high-quality counterfactuals, using different evaluation metrics."}, "https://arxiv.org/abs/2407.20700": {"title": "Industrial-Grade Smart Troubleshooting through Causal Technical Language Processing: a Proof of Concept", "link": "https://arxiv.org/abs/2407.20700", "description": "arXiv:2407.20700v1 Announce Type: cross \nAbstract: This paper describes the development of a causal diagnosis approach for troubleshooting an industrial environment on the basis of the technical language expressed in Return on Experience records. The proposed method leverages the vectorized linguistic knowledge contained in the distributed representation of a Large Language Model, and the causal associations entailed by the embedded failure modes and mechanisms of the industrial assets. The paper presents the elementary but essential concepts of the solution, which is conceived as a causality-aware retrieval augmented generation system, and illustrates them experimentally on a real-world Predictive Maintenance setting. Finally, it discusses avenues of improvement for the maturity of the utilized causal technology to meet the robustness challenges of increasingly complex scenarios in the industry."}, "https://arxiv.org/abs/2210.08346": {"title": "Inferring a population composition from survey data with nonignorable nonresponse: Borrowing information from external sources", "link": "https://arxiv.org/abs/2210.08346", "description": "arXiv:2210.08346v2 Announce Type: replace \nAbstract: We introduce a method to make inference on the composition of a heterogeneous population using survey data, accounting for the possibility that capture heterogeneity is related to key survey variables. To deal with nonignorable nonresponse, we combine different data sources and propose the use of Fisher's noncentral hypergeometric model in a Bayesian framework. To illustrate the potentialities of our methodology, we focus on a case study aimed at estimating the composition of the population of Italian graduates by their occupational status one year after graduating, stratifying by gender and degree program. We account for the possibility that surveys inquiring about the occupational status of new graduates may have response rates that depend on individuals' employment status, implying the nonignorability of the nonresponse. Our findings show that employed people are generally more inclined to answer the questionnaire. Neglecting the nonresponse bias in such contexts might lead to overestimating the employment rate."}, "https://arxiv.org/abs/2306.05498": {"title": "Monte Carlo inference for semiparametric Bayesian regression", "link": "https://arxiv.org/abs/2306.05498", "description": "arXiv:2306.05498v2 Announce Type: replace \nAbstract: Data transformations are essential for broad applicability of parametric regression models. However, for Bayesian analysis, joint inference of the transformation and model parameters typically involves restrictive parametric transformations or nonparametric representations that are computationally inefficient and cumbersome for implementation and theoretical analysis, which limits their usability in practice. This paper introduces a simple, general, and efficient strategy for joint posterior inference of an unknown transformation and all regression model parameters. The proposed approach directly targets the posterior distribution of the transformation by linking it with the marginal distributions of the independent and dependent variables, and then deploys a Bayesian nonparametric model via the Bayesian bootstrap. Crucially, this approach delivers (1) joint posterior consistency under general conditions, including multiple model misspecifications, and (2) efficient Monte Carlo (not Markov chain Monte Carlo) inference for the transformation and all parameters for important special cases. These tools apply across a variety of data domains, including real-valued, positive, and compactly-supported data. Simulation studies and an empirical application demonstrate the effectiveness and efficiency of this strategy for semiparametric Bayesian analysis with linear models, quantile regression, and Gaussian processes. The R package SeBR is available on CRAN."}, "https://arxiv.org/abs/2309.00706": {"title": "Causal Effect Estimation after Propensity Score Trimming with Continuous Treatments", "link": "https://arxiv.org/abs/2309.00706", "description": "arXiv:2309.00706v2 Announce Type: replace \nAbstract: Propensity score trimming, which discards subjects with propensity scores below a threshold, is a common way to address positivity violations that complicate causal effect estimation. However, most works on trimming assume treatment is discrete and models for the outcome regression and propensity score are parametric. This work proposes nonparametric estimators for trimmed average causal effects in the case of continuous treatments based on efficient influence functions. For continuous treatments, an efficient influence function for a trimmed causal effect does not exist, due to a lack of pathwise differentiability induced by trimming and a continuous treatment. Thus, we target a smoothed version of the trimmed causal effect for which an efficient influence function exists. Our resulting estimators exhibit doubly-robust style guarantees, with error involving products or squares of errors for the outcome regression and propensity score, which allows for valid inference even when nonparametric models are used. Our results allow the trimming threshold to be fixed or defined as a quantile of the propensity score, such that confidence intervals incorporate uncertainty involved in threshold estimation. These findings are validated via simulation and an application, thereby showing how to efficiently-but-flexibly estimate trimmed causal effects with continuous treatments."}, "https://arxiv.org/abs/2301.03894": {"title": "Location- and scale-free procedures for distinguishing between distribution tail models", "link": "https://arxiv.org/abs/2301.03894", "description": "arXiv:2301.03894v2 Announce Type: replace-cross \nAbstract: We consider distinguishing between two distribution tail models when tails of one model are lighter (or heavier) than those of the other. Two procedures are proposed: one scale-free and one location- and scale-free, and their asymptotic properties are established. We show the advantage of using these procedures for distinguishing between certain tail models in comparison with the tests proposed in the literature by simulation and apply them to data on daily precipitation in Green Bay, US and Saentis, Switzerland."}, "https://arxiv.org/abs/2305.01392": {"title": "Multi-Scale CUSUM Tests for Time Dependent Spherical Random Fields", "link": "https://arxiv.org/abs/2305.01392", "description": "arXiv:2305.01392v2 Announce Type: replace-cross \nAbstract: This paper investigates the asymptotic behavior of structural break tests in the harmonic domain for time dependent spherical random fields. In particular, we prove a functional central limit theorem result for the fluctuations over time of the sample spherical harmonic coefficients, under the null of isotropy and stationarity; furthermore, we prove consistency of the corresponding CUSUM test, under a broad range of alternatives, including deterministic trend, abrupt change, and a nontrivial power alternative. Our results are then applied to NCEP data on global temperature: our estimates suggest that Climate Change does not simply affect global average temperatures, but also the nature of spatial fluctuations at different scales."}, "https://arxiv.org/abs/2308.09869": {"title": "Symmetrisation of a class of two-sample tests by mutually considering depth ranks including functional spaces", "link": "https://arxiv.org/abs/2308.09869", "description": "arXiv:2308.09869v2 Announce Type: replace-cross \nAbstract: Statistical depth functions provide measures of the outlyingness, or centrality, of the elements of a space with respect to a distribution. It is a nonparametric concept applicable to spaces of any dimension, for instance, multivariate and functional. Liu and Singh (1993) presented a multivariate two-sample test based on depth-ranks. We dedicate this paper to improving the power of the associated test statistic and incorporating its applicability to functional data. In doing so, we obtain a more natural test statistic that is symmetric in both samples. We derive the null asymptotic of the proposed test statistic, also proving the validity of the testing procedure for functional data. Finally, the finite sample performance of the test for functional data is illustrated by means of a simulation study and a real data analysis on annual temperature curves of ocean drifters is executed."}, "https://arxiv.org/abs/2310.17546": {"title": "A changepoint approach to modelling non-stationary soil moisture dynamics", "link": "https://arxiv.org/abs/2310.17546", "description": "arXiv:2310.17546v2 Announce Type: replace-cross \nAbstract: Soil moisture dynamics provide an indicator of soil health that scientists model via drydown curves. The typical modelling process requires the soil moisture time series to be manually separated into drydown segments and then exponential decay models are fitted to them independently. Sensor development over recent years means that experiments that were previously conducted over a few field campaigns can now be scaled to months or years at a higher sampling rate. To better meet the challenge of increasing data size, this paper proposes a novel changepoint-based approach to automatically identify structural changes in the soil drying process and simultaneously estimate the drydown parameters that are of interest to soil scientists. A simulation study is carried out to demonstrate the performance of the method in detecting changes and retrieving model parameters. Practical aspects of the method such as adding covariates and penalty learning are discussed. The method is applied to hourly soil moisture time series from the NEON data portal to investigate the temporal dynamics of soil moisture drydown. We recover known relationships previously identified manually, alongside delivering new insights into the temporal variability across soil types and locations."}, "https://arxiv.org/abs/2407.21119": {"title": "Potential weights and implicit causal designs in linear regression", "link": "https://arxiv.org/abs/2407.21119", "description": "arXiv:2407.21119v1 Announce Type: new \nAbstract: When do linear regressions estimate causal effects in quasi-experiments? This paper provides a generic diagnostic that assesses whether a given linear regression specification on a given dataset admits a design-based interpretation. To do so, we define a notion of potential weights, which encode counterfactual decisions a given regression makes to unobserved potential outcomes. If the specification does admit such an interpretation, this diagnostic can find a vector of unit-level treatment assignment probabilities -- which we call an implicit design -- under which the regression estimates a causal effect. This diagnostic also finds the implicit causal effect estimand. Knowing the implicit design and estimand adds transparency, leads to further sanity checks, and opens the door to design-based statistical inference. When applied to regression specifications studied in the causal inference literature, our framework recovers and extends existing theoretical results. When applied to widely-used specifications not covered by existing causal inference literature, our framework generates new theoretical insights."}, "https://arxiv.org/abs/2407.21154": {"title": "Bayesian thresholded modeling for integrating brain node and network predictors", "link": "https://arxiv.org/abs/2407.21154", "description": "arXiv:2407.21154v1 Announce Type: new \nAbstract: Progress in neuroscience has provided unprecedented opportunities to advance our understanding of brain alterations and their correspondence to phenotypic profiles. With data collected from various imaging techniques, studies have integrated different types of information ranging from brain structure, function, or metabolism. More recently, an emerging way to categorize imaging traits is through a metric hierarchy, including localized node-level measurements and interactive network-level metrics. However, limited research has been conducted to integrate these different hierarchies and achieve a better understanding of the neurobiological mechanisms and communications. In this work, we address this literature gap by proposing a Bayesian regression model under both vector-variate and matrix-variate predictors. To characterize the interplay between different predicting components, we propose a set of biologically plausible prior models centered on an innovative joint thresholded prior. This captures the coupling and grouping effect of signal patterns, as well as their spatial contiguity across brain anatomy. By developing a posterior inference, we can identify and quantify the uncertainty of signaling node- and network-level neuromarkers, as well as their predictive mechanism for phenotypic outcomes. Through extensive simulations, we demonstrate that our proposed method outperforms the alternative approaches substantially in both out-of-sample prediction and feature selection. By implementing the model to study children's general mental abilities, we establish a powerful predictive mechanism based on the identified task contrast traits and resting-state sub-networks."}, "https://arxiv.org/abs/2407.21253": {"title": "An overview of methods for receiver operating characteristic analysis, with an application to SARS-CoV-2 vaccine-induced humoral responses in solid organ transplant recipients", "link": "https://arxiv.org/abs/2407.21253", "description": "arXiv:2407.21253v1 Announce Type: new \nAbstract: Receiver operating characteristic (ROC) analysis is a tool to evaluate the capacity of a numeric measure to distinguish between groups, often employed in the evaluation of diagnostic tests. Overall classification ability is sometimes crudely summarized by a single numeric measure such as the area under the empirical ROC curve. However, it may also be of interest to estimate the full ROC curve while leveraging assumptions regarding the nature of the data (parametric) or about the ROC curve directly (semiparametric). Although there has been recent interest in methods to conduct comparisons by way of stochastic ordering, nuances surrounding ROC geometry and estimation are not widely known in the broader scientific and statistical community. The overarching goals of this manuscript are to (1) provide an overview of existing frameworks for ROC curve estimation with examples, (2) offer intuition for and considerations regarding methodological trade-offs, and (3) supply sample R code to guide implementation. We utilize simulations to demonstrate the bias-variance trade-off across various methods. As an illustrative example, we analyze data from a recent cohort study in order to compare responses to SARS-CoV-2 vaccination between solid organ transplant recipients and healthy controls."}, "https://arxiv.org/abs/2407.21322": {"title": "Randomized Controlled Trials of Service Interventions: The Impact of Capacity Constraints", "link": "https://arxiv.org/abs/2407.21322", "description": "arXiv:2407.21322v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs), or experiments, are the gold standard for intervention evaluation. However, the main appeal of RCTs, the clean identification of causal effects, can be compromised by interference, when one subject's treatment assignment can influence another subject's behavior or outcomes. In this paper, we formalise and study a type of interference stemming from the operational implementation of a subclass of interventions we term Service Interventions (SIs): interventions that include an on-demand service component provided by a costly and limited resource (e.g., healthcare providers or teachers).\n We show that in such a system, the capacity constraints induce dependencies across experiment subjects, where an individual may need to wait before receiving the intervention. By modeling these dependencies using a queueing system, we show how increasing the number of subjects without increasing the capacity of the system can lead to a nonlinear decrease in the treatment effect size. This has implications for conventional power analysis and recruitment strategies: increasing the sample size of an RCT without appropriately expanding capacity can decrease the study's power. To address this issue, we propose a method to jointly select the system capacity and number of users using the square root staffing rule from queueing theory. We show how incorporating knowledge of the queueing structure can help an experimenter reduce the amount of capacity and number of subjects required while still maintaining high power. In addition, our analysis of congestion-driven interference provides one concrete mechanism to explain why similar protocols can result in different RCT outcomes and why promising interventions at the RCT stage may not perform well at scale."}, "https://arxiv.org/abs/2407.21407": {"title": "Deep Fr\\'echet Regression", "link": "https://arxiv.org/abs/2407.21407", "description": "arXiv:2407.21407v1 Announce Type: new \nAbstract: Advancements in modern science have led to the increasing availability of non-Euclidean data in metric spaces. This paper addresses the challenge of modeling relationships between non-Euclidean responses and multivariate Euclidean predictors. We propose a flexible regression model capable of handling high-dimensional predictors without imposing parametric assumptions. Two primary challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we demonstrate the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, employing local Fr\\'echet regression, to map the low-dimensional manifold representations back to objects in the original metric space. We develop a theoretical framework, investigating the convergence rate of deep neural networks under dependent sub-Gaussian noise with bias. The convergence rate of the proposed regression model is then obtained by expanding the scope of local Fr\\'echet regression to accommodate multivariate predictors in the presence of errors in predictors. Simulations and case studies show that the proposed model outperforms existing methods for non-Euclidean responses, focusing on the special cases of probability measures and networks."}, "https://arxiv.org/abs/2407.21588": {"title": "A Bayesian Bootstrap Approach for Dynamic Borrowing for Minimizing Mean Squared Error", "link": "https://arxiv.org/abs/2407.21588", "description": "arXiv:2407.21588v1 Announce Type: new \nAbstract: For dynamic borrowing to leverage external data to augment the control arm of small RCTs, the key step is determining the amount of borrowing based on the similarity of the outcomes in the controls from the trial and the external data sources. A simple approach for this task uses the empirical Bayesian approach, which maximizes the marginal likelihood (maxML) of the amount of borrowing, while a likelihood-independent alternative minimizes the mean squared error (minMSE). We consider two minMSE approaches that differ from each other in the way of estimating the parameters in the minMSE rule. The classical one adjusts for bias due to sample variance, which in some situations is equivalent to the maxML rule. We propose a simplified alternative without the variance adjustment, which has asymptotic properties partially similar to the maxML rule, leading to no borrowing if means of control outcomes from the two data sources are different and may have less bias than that of the maxML rule. In contrast, the maxML rule may lead to full borrowing even when two datasets are moderately different, which may not be a desirable property. For inference, we propose a Bayesian bootstrap (BB) based approach taking the uncertainty of the estimated amount of borrowing and that of pre-adjustment into account. The approach can also be used with a pre-adjustment on the external controls for population difference between the two data sources using, e.g., inverse probability weighting. The proposed approach is computationally efficient and is implemented via a simple algorithm. We conducted a simulation study to examine properties of the proposed approach, including the coverage of 95 CI based on the Bayesian bootstrapped posterior samples, or asymptotic normality. The approach is illustrated by an example of borrowing controls for an AML trial from another study."}, "https://arxiv.org/abs/2407.21682": {"title": "Shape-restricted transfer learning analysis for generalized linear regression model", "link": "https://arxiv.org/abs/2407.21682", "description": "arXiv:2407.21682v1 Announce Type: new \nAbstract: Transfer learning has emerged as a highly sought-after and actively pursued research area within the statistical community. The core concept of transfer learning involves leveraging insights and information from auxiliary datasets to enhance the analysis of the primary dataset of interest. In this paper, our focus is on datasets originating from distinct yet interconnected distributions. We assume that the training data conforms to a standard generalized linear model, while the testing data exhibit a connection to the training data based on a prior probability shift assumption. Ultimately, we discover that the two-sample conditional means are interrelated through an unknown, nondecreasing function. We integrate the power of generalized estimating equations with the shape-restricted score function, creating a robust framework for improved inference regarding the underlying parameters. We theoretically establish the asymptotic properties of our estimator and demonstrate, through simulation studies, that our method yields more accurate parameter estimates compared to those based solely on the testing or training data. Finally, we apply our method to a real-world example."}, "https://arxiv.org/abs/2407.21695": {"title": "Unveiling land use dynamics: Insights from a hierarchical Bayesian spatio-temporal modelling of Compositional Data", "link": "https://arxiv.org/abs/2407.21695", "description": "arXiv:2407.21695v1 Announce Type: new \nAbstract: Changes in land use patterns have significant environmental and socioeconomic impacts, making it crucial for policymakers to understand their causes and consequences. This study, part of the European LAMASUS (Land Management for Sustainability) project, aims to support the EU's climate neutrality target by developing a governance model through collaboration between policymakers, land users, and researchers. We present a methodological synthesis for treating land use data using a Bayesian approach within spatial and spatio-temporal modeling frameworks.\n The study tackles the challenges of analyzing land use changes, particularly the presence of zero values and computational issues with large datasets. It introduces joint model structures to address zeros and employs sequential inference and consensus methods for Big Data problems. Spatial downscaling models approximate smaller scales from aggregated data, circumventing high-resolution data complications.\n We explore Beta regression and Compositional Data Analysis (CoDa) for land use data, review relevant spatial and spatio-temporal models, and present strategies for handling zeros. The paper demonstrates the implementation of key models, downscaling techniques, and solutions to Big Data challenges with examples from simulated data and the LAMASUS project, providing a comprehensive framework for understanding and managing land use changes."}, "https://arxiv.org/abs/2407.21025": {"title": "Reinforcement Learning in High-frequency Market Making", "link": "https://arxiv.org/abs/2407.21025", "description": "arXiv:2407.21025v1 Announce Type: cross \nAbstract: This paper establishes a new and comprehensive theoretical analysis for the application of reinforcement learning (RL) in high-frequency market making. We bridge the modern RL theory and the continuous-time statistical models in high-frequency financial economics. Different with most existing literature on methodological research about developing various RL methods for market making problem, our work is a pilot to provide the theoretical analysis. We target the effects of sampling frequency, and find an interesting tradeoff between error and complexity of RL algorithm when tweaking the values of the time increment $\\Delta$ $-$ as $\\Delta$ becomes smaller, the error will be smaller but the complexity will be larger. We also study the two-player case under the general-sum game framework and establish the convergence of Nash equilibrium to the continuous-time game equilibrium as $\\Delta\\rightarrow0$. The Nash Q-learning algorithm, which is an online multi-agent RL method, is applied to solve the equilibrium. Our theories are not only useful for practitioners to choose the sampling frequency, but also very general and applicable to other high-frequency financial decision making problems, e.g., optimal executions, as long as the time-discretization of a continuous-time markov decision process is adopted. Monte Carlo simulation evidence support all of our theories."}, "https://arxiv.org/abs/2407.21238": {"title": "Quantile processes and their applications in finite populations", "link": "https://arxiv.org/abs/2407.21238", "description": "arXiv:2407.21238v1 Announce Type: cross \nAbstract: The weak convergence of the quantile processes, which are constructed based on different estimators of the finite population quantiles, is shown under various well-known sampling designs based on a superpopulation model. The results related to the weak convergence of these quantile processes are applied to find asymptotic distributions of the smooth $L$-estimators and the estimators of smooth functions of finite population quantiles. Based on these asymptotic distributions, confidence intervals are constructed for several finite population parameters like the median, the $\\alpha$-trimmed means, the interquartile range and the quantile based measure of skewness. Comparisons of various estimators are carried out based on their asymptotic distributions. We show that the use of the auxiliary information in the construction of the estimators sometimes has an adverse effect on the performances of the smooth $L$-estimators and the estimators of smooth functions of finite population quantiles under several sampling designs. Further, the performance of each of the above-mentioned estimators sometimes becomes worse under sampling designs, which use the auxiliary information, than their performances under simple random sampling without replacement (SRSWOR)."}, "https://arxiv.org/abs/2407.21456": {"title": "A Ball Divergence Based Measure For Conditional Independence Testing", "link": "https://arxiv.org/abs/2407.21456", "description": "arXiv:2407.21456v1 Announce Type: cross \nAbstract: In this paper we introduce a new measure of conditional dependence between two random vectors ${\\boldsymbol X}$ and ${\\boldsymbol Y}$ given another random vector $\\boldsymbol Z$ using the ball divergence. Our measure characterizes conditional independence and does not require any moment assumptions. We propose a consistent estimator of the measure using a kernel averaging technique and derive its asymptotic distribution. Using this statistic we construct two tests for conditional independence, one in the model-${\\boldsymbol X}$ framework and the other based on a novel local wild bootstrap algorithm. In the model-${\\boldsymbol X}$ framework, which assumes the knowledge of the distribution of ${\\boldsymbol X}|{\\boldsymbol Z}$, applying the conditional randomization test we obtain a method that controls Type I error in finite samples and is asymptotically consistent, even if the distribution of ${\\boldsymbol X}|{\\boldsymbol Z}$ is incorrectly specified up to distance preserving transformations. More generally, in situations where ${\\boldsymbol X}|{\\boldsymbol Z}$ is unknown or hard to estimate, we design a double-bandwidth based local wild bootstrap algorithm that asymptotically controls both Type I error and power. We illustrate the advantage of our method, both in terms of Type I error and power, in a range of simulation settings and also in a real data example. A consequence of our theoretical results is a general framework for studying the asymptotic properties of a 2-sample conditional $V$-statistic, which is of independent interest."}, "https://arxiv.org/abs/2303.15376": {"title": "Identifiability of causal graphs under nonadditive conditionally parametric causal models", "link": "https://arxiv.org/abs/2303.15376", "description": "arXiv:2303.15376v4 Announce Type: replace \nAbstract: Causal discovery from observational data typically requires strong assumptions about the data-generating process. Previous research has established the identifiability of causal graphs under various models, including linear non-Gaussian, post-nonlinear, and location-scale models. However, these models may have limited applicability in real-world situations that involve a mixture of discrete and continuous variables or where the cause affects the variance or tail behavior of the effect. In this study, we introduce a new class of models, called Conditionally Parametric Causal Models (CPCM), which assume that the distribution of the effect, given the cause, belongs to well-known families such as Gaussian, Poisson, Gamma, or heavy-tailed Pareto distributions. These models are adaptable to a wide range of practical situations where the cause can influence the variance or tail behavior of the effect. We demonstrate the identifiability of CPCM by leveraging the concept of sufficient statistics. Furthermore, we propose an algorithm for estimating the causal structure from random samples drawn from CPCM. We evaluate the empirical properties of our methodology on various datasets, demonstrating state-of-the-art performance across multiple benchmarks."}, "https://arxiv.org/abs/2310.13232": {"title": "Interaction Screening and Pseudolikelihood Approaches for Tensor Learning in Ising Models", "link": "https://arxiv.org/abs/2310.13232", "description": "arXiv:2310.13232v2 Announce Type: replace \nAbstract: In this paper, we study two well known methods of Ising structure learning, namely the pseudolikelihood approach and the interaction screening approach, in the context of tensor recovery in $k$-spin Ising models. We show that both these approaches, with proper regularization, retrieve the underlying hypernetwork structure using a sample size logarithmic in the number of network nodes, and exponential in the maximum interaction strength and maximum node-degree. We also track down the exact dependence of the rate of tensor recovery on the interaction order $k$, that is allowed to grow with the number of samples and nodes, for both the approaches. We then provide a comparative discussion of the performance of the two approaches based on simulation studies, which also demonstrates the exponential dependence of the tensor recovery rate on the maximum coupling strength. Our tensor recovery methods are then applied on gene data taken from the Curated Microarray Database (CuMiDa), where we focus on understanding the important genes related to hepatocellular carcinoma."}, "https://arxiv.org/abs/2311.14846": {"title": "Fast Estimation of the Renshaw-Haberman Model and Its Variants", "link": "https://arxiv.org/abs/2311.14846", "description": "arXiv:2311.14846v2 Announce Type: replace \nAbstract: In mortality modelling, cohort effects are often taken into consideration as they add insights about variations in mortality across different generations. Statistically speaking, models such as the Renshaw-Haberman model may provide a better fit to historical data compared to their counterparts that incorporate no cohort effects. However, when such models are estimated using an iterative maximum likelihood method in which parameters are updated one at a time, convergence is typically slow and may not even be reached within a reasonably established maximum number of iterations. Among others, the slow convergence problem hinders the study of parameter uncertainty through bootstrapping methods. In this paper, we propose an intuitive estimation method that minimizes the sum of squared errors between actual and fitted log central death rates. The complications arising from the incorporation of cohort effects are overcome by formulating part of the optimization as a principal component analysis with missing values. Using mortality data from various populations, we demonstrate that our proposed method produces satisfactory estimation results and is significantly more efficient compared to the traditional likelihood-based approach."}, "https://arxiv.org/abs/2408.00032": {"title": "Methodological Foundations of Modern Causal Inference in Social Science Research", "link": "https://arxiv.org/abs/2408.00032", "description": "arXiv:2408.00032v1 Announce Type: new \nAbstract: This paper serves as a literature review of methodology concerning the (modern) causal inference methods to address the causal estimand with observational/survey data that have been or will be used in social science research. Mainly, this paper is divided into two parts: inference from statistical estimand for the causal estimand, in which we reviewed the assumptions for causal identification and the methodological strategies addressing the problems if some of the assumptions are violated. We also discuss the asymptotical analysis concerning the measure from the observational data to the theoretical measure and replicate the deduction of the efficient/doubly robust average treatment effect estimator, which is commonly used in current social science analysis."}, "https://arxiv.org/abs/2408.00100": {"title": "A new unit-bimodal distribution based on correlated Birnbaum-Saunders random variables", "link": "https://arxiv.org/abs/2408.00100", "description": "arXiv:2408.00100v1 Announce Type: new \nAbstract: In this paper, we propose a new distribution over the unit interval which can be characterized as a ratio of the type Z=Y/(X+Y) where X and Y are two correlated Birnbaum-Saunders random variables. The stress-strength probability between X and Y is calculated explicitly when the respective scale parameters are equal. Two applications of the ratio distribution are discussed."}, "https://arxiv.org/abs/2408.00177": {"title": "Fast variational Bayesian inference for correlated survival data: an application to invasive mechanical ventilation duration analysis", "link": "https://arxiv.org/abs/2408.00177", "description": "arXiv:2408.00177v1 Announce Type: new \nAbstract: Correlated survival data are prevalent in various clinical settings and have been extensively discussed in literature. One of the most common types of correlated survival data is clustered survival data, where the survival times from individuals in a cluster are associated. Our study is motivated by invasive mechanical ventilation data from different intensive care units (ICUs) in Ontario, Canada, forming multiple clusters. The survival times from patients within the same ICU cluster are correlated. To address this association, we introduce a shared frailty log-logistic accelerated failure time model that accounts for intra-cluster correlation through a cluster-specific random intercept. We present a novel, fast variational Bayes (VB) algorithm for parameter inference and evaluate its performance using simulation studies varying the number of clusters and their sizes. We further compare the performance of our proposed VB algorithm with the h-likelihood method and a Markov Chain Monte Carlo (MCMC) algorithm. The proposed algorithm delivers satisfactory results and demonstrates computational efficiency over the MCMC algorithm. We apply our method to the ICU ventilation data from Ontario to investigate the ICU site random effect on ventilation duration."}, "https://arxiv.org/abs/2408.00289": {"title": "Operator on Operator Regression in Quantum Probability", "link": "https://arxiv.org/abs/2408.00289", "description": "arXiv:2408.00289v1 Announce Type: new \nAbstract: This article introduces operator on operator regression in quantum probability. Here in the regression model, the response and the independent variables are certain operator valued observables, and they are linearly associated with unknown scalar coefficient (denoted by $\\beta$), and the error is a random operator. In the course of this study, we propose a quantum version of a class of estimators (denoted by $M$ estimator) of $\\beta$, and the large sample behaviour of those quantum version of the estimators are derived, given the fact that the true model is also linear and the samples are observed eigenvalue pairs of the operator valued observables."}, "https://arxiv.org/abs/2408.00291": {"title": "Bayesian Synthetic Control Methods with Spillover Effects: Estimating the Economic Cost of the 2011 Sudan Split", "link": "https://arxiv.org/abs/2408.00291", "description": "arXiv:2408.00291v1 Announce Type: new \nAbstract: The synthetic control method (SCM) is widely used for causal inference with panel data, particularly when there are few treated units. SCM assumes the stable unit treatment value assumption (SUTVA), which posits that potential outcomes are unaffected by the treatment status of other units. However, interventions often impact not only treated units but also untreated units, known as spillover effects. This study introduces a novel panel data method that extends SCM to allow for spillover effects and estimate both treatment and spillover effects. This method leverages a spatial autoregressive panel data model to account for spillover effects. We also propose Bayesian inference methods using Bayesian horseshoe priors for regularization. We apply the proposed method to two empirical studies: evaluating the effect of the California tobacco tax on consumption and estimating the economic impact of the 2011 division of Sudan on GDP per capita."}, "https://arxiv.org/abs/2408.00618": {"title": "Facilitating heterogeneous effect estimation via statistically efficient categorical modifiers", "link": "https://arxiv.org/abs/2408.00618", "description": "arXiv:2408.00618v1 Announce Type: new \nAbstract: Categorical covariates such as race, sex, or group are ubiquitous in regression analysis. While main-only (or ANCOVA) linear models are predominant, cat-modified linear models that include categorical-continuous or categorical-categorical interactions are increasingly important and allow heterogeneous, group-specific effects. However, with standard approaches, the addition of cat-modifiers fundamentally alters the estimates and interpretations of the main effects, often inflates their standard errors, and introduces significant concerns about group (e.g., racial) biases. We advocate an alternative parametrization and estimation scheme using abundance-based constraints (ABCs). ABCs induce a model parametrization that is both interpretable and equitable. Crucially, we show that with ABCs, the addition of cat-modifiers 1) leaves main effect estimates unchanged and 2) enhances their statistical power, under reasonable conditions. Thus, analysts can, and arguably should include cat-modifiers in linear regression models to discover potential heterogeneous effects--without compromising estimation, inference, and interpretability for the main effects. Using simulated data, we verify these invariance properties for estimation and inference and showcase the capabilities of ABCs to increase statistical power. We apply these tools to study demographic heterogeneities among the effects of social and environmental factors on STEM educational outcomes for children in North Carolina. An R package lmabc is available."}, "https://arxiv.org/abs/2408.00651": {"title": "A Dirichlet stochastic block model for composition-weighted networks", "link": "https://arxiv.org/abs/2408.00651", "description": "arXiv:2408.00651v1 Announce Type: new \nAbstract: Network data are observed in various applications where the individual entities of the system interact with or are connected to each other, and often these interactions are defined by their associated strength or importance. Clustering is a common task in network analysis that involves finding groups of nodes displaying similarities in the way they interact with the rest of the network. However, most clustering methods use the strengths of connections between entities in their original form, ignoring the possible differences in the capacities of individual nodes to send or receive edges. This often leads to clustering solutions that are heavily influenced by the nodes' capacities. One way to overcome this is to analyse the strengths of connections in relative rather than absolute terms, expressing each edge weight as a proportion of the sending (or receiving) capacity of the respective node. This, however, induces additional modelling constraints that most existing clustering methods are not designed to handle. In this work we propose a stochastic block model for composition-weighted networks based on direct modelling of compositional weight vectors using a Dirichlet mixture, with the parameters determined by the cluster labels of the sender and the receiver nodes. Inference is implemented via an extension of the classification expectation-maximisation algorithm that uses a working independence assumption, expressing the complete data likelihood of each node of the network as a function of fixed cluster labels of the remaining nodes. A model selection criterion is derived to aid the choice of the number of clusters. The model is validated using simulation studies, and showcased on network data from the Erasmus exchange program and a bike sharing network for the city of London."}, "https://arxiv.org/abs/2408.00237": {"title": "Empirical Bayes Linked Matrix Decomposition", "link": "https://arxiv.org/abs/2408.00237", "description": "arXiv:2408.00237v1 Announce Type: cross \nAbstract: Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular \"omics\" technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for \"blockwise\" imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation."}, "https://arxiv.org/abs/2408.00270": {"title": "Strong Oracle Guarantees for Partial Penalized Tests of High Dimensional Generalized Linear Models", "link": "https://arxiv.org/abs/2408.00270", "description": "arXiv:2408.00270v1 Announce Type: cross \nAbstract: Partial penalized tests provide flexible approaches to testing linear hypotheses in high dimensional generalized linear models. However, because the estimators used in these tests are local minimizers of potentially non-convex folded-concave penalized objectives, the solutions one computes in practice may not coincide with the unknown local minima for which we have nice theoretical guarantees. To close this gap between theory and computation, we introduce local linear approximation (LLA) algorithms to compute the full and reduced model estimators for these tests and develop theory specifically for the LLA solutions. We prove that our LLA algorithms converge to oracle estimators for the full and reduced models in two steps with overwhelming probability. We then leverage this strong oracle result and the asymptotic properties of the oracle estimators to show that the partial penalized test statistics evaluated at the two-step LLA solutions are approximately chi-square in large samples, giving us guarantees for the tests using specific computed solutions and thereby closing the theoretical gap. We conduct simulation studies to assess the finite-sample performance of our testing procedures, finding that partial penalized tests using the LLA solutions agree with tests using the oracle estimators, and demonstrate our testing procedures in a real data application."}, "https://arxiv.org/abs/2408.00399": {"title": "Unsupervised Pairwise Causal Discovery on Heterogeneous Data using Mutual Information Measures", "link": "https://arxiv.org/abs/2408.00399", "description": "arXiv:2408.00399v1 Announce Type: cross \nAbstract: A fundamental task in science is to determine the underlying causal relations because it is the knowledge of this functional structure what leads to the correct interpretation of an effect given the apparent associations in the observed data. In this sense, Causal Discovery is a technique that tackles this challenge by analyzing the statistical properties of the constituent variables. In this work, we target the generalizability of the discovery method by following a reductionist approach that only involves two variables, i.e., the pairwise or bi-variate setting. We question the current (possibly misleading) baseline results on the basis that they were obtained through supervised learning, which is arguably contrary to this genuinely exploratory endeavor. In consequence, we approach this problem in an unsupervised way, using robust Mutual Information measures, and observing the impact of the different variable types, which is oftentimes ignored in the design of solutions. Thus, we provide a novel set of standard unbiased results that can serve as a reference to guide future discovery tasks in completely unknown environments."}, "https://arxiv.org/abs/2205.03706": {"title": "Identification and Estimation of Dynamic Games with Unknown Information Structure", "link": "https://arxiv.org/abs/2205.03706", "description": "arXiv:2205.03706v3 Announce Type: replace \nAbstract: This paper studies the identification and estimation of dynamic games when the underlying information structure is unknown to the researcher. To tractably characterize the set of Markov perfect equilibrium predictions while maintaining weak assumptions on players' information, we introduce \\textit{Markov correlated equilibrium}, a dynamic analog of Bayes correlated equilibrium. The set of Markov correlated equilibrium predictions coincides with the set of Markov perfect equilibrium predictions that can arise when the players can observe more signals than assumed by the analyst. Using Markov correlated equilibrium as the solution concept, we propose tractable computational strategies for informationally robust estimation, inference, and counterfactual analysis that deal with the non-convexities arising in dynamic environments. We use our method to analyze the dynamic competition between Starbucks and Dunkin' in the US and the role of informational assumptions."}, "https://arxiv.org/abs/2208.07900": {"title": "Statistical Inferences and Predictions for Areal Data and Spatial Data Fusion with Hausdorff--Gaussian Processes", "link": "https://arxiv.org/abs/2208.07900", "description": "arXiv:2208.07900v2 Announce Type: replace \nAbstract: Accurate modeling of spatial dependence is pivotal in analyzing spatial data, influencing parameter estimation and out-of-sample predictions. The spatial structure and geometry of the data significantly impact valid statistical inference. Existing models for areal data often rely on adjacency matrices, struggling to differentiate between polygons of varying sizes and shapes. Conversely, data fusion models, while effective, rely on computationally intensive numerical integrals, presenting challenges for moderately large datasets. In response to these issues, we propose the Hausdorff-Gaussian process (HGP), a versatile model class utilizing the Hausdorff distance to capture spatial dependence in both point and areal data. We introduce a valid correlation function within the HGP framework, accommodating diverse modeling techniques, including geostatistical and areal models. Integration into generalized linear mixed-effects models enhances its applicability, particularly in addressing change of support and data fusion challenges. We validate our approach through a comprehensive simulation study and application to two real-world scenarios: one involving areal data and another demonstrating its effectiveness in data fusion. The results suggest that the HGP is competitive with specialized models regarding goodness-of-fit and prediction performances. In summary, the HGP offers a flexible and robust solution for modeling spatial data of various types and shapes, with potential applications spanning fields such as public health and climate science."}, "https://arxiv.org/abs/2209.08273": {"title": "Low-Rank Covariance Completion for Graph Quilting with Applications to Functional Connectivity", "link": "https://arxiv.org/abs/2209.08273", "description": "arXiv:2209.08273v2 Announce Type: replace \nAbstract: As a tool for estimating networks in high dimensions, graphical models are commonly applied to calcium imaging data to estimate functional neuronal connectivity, i.e. relationships between the activities of neurons. However, in many calcium imaging data sets, the full population of neurons is not recorded simultaneously, but instead in partially overlapping blocks. This leads to the Graph Quilting problem, as first introduced by (Vinci et.al. 2019), in which the goal is to infer the structure of the full graph when only subsets of features are jointly observed. In this paper, we study a novel two-step approach to Graph Quilting, which first imputes the complete covariance matrix using low-rank covariance completion techniques before estimating the graph structure. We introduce three approaches to solve this problem: block singular value decomposition, nuclear norm penalization, and non-convex low-rank factorization. While prior works have studied low-rank matrix completion, we address the challenges brought by the block-wise missingness and are the first to investigate the problem in the context of graph learning. We discuss theoretical properties of the two-step procedure, showing graph selection consistency of one proposed approach by proving novel L infinity-norm error bounds for matrix completion with block-missingness. We then investigate the empirical performance of the proposed methods on simulations and on real-world data examples, through which we show the efficacy of these methods for estimating functional connectivity from calcium imaging data."}, "https://arxiv.org/abs/2211.16502": {"title": "Identified vaccine efficacy for binary post-infection outcomes under misclassification without monotonicity", "link": "https://arxiv.org/abs/2211.16502", "description": "arXiv:2211.16502v4 Announce Type: replace \nAbstract: In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials."}, "https://arxiv.org/abs/2305.06645": {"title": "Causal Inference for Continuous Multiple Time Point Interventions", "link": "https://arxiv.org/abs/2305.06645", "description": "arXiv:2305.06645v3 Announce Type: replace \nAbstract: There are limited options to estimate the treatment effects of variables which are continuous and measured at multiple time points, particularly if the true dose-response curve should be estimated as closely as possible. However, these situations may be of relevance: in pharmacology, one may be interested in how outcomes of people living with -- and treated for -- HIV, such as viral failure, would vary for time-varying interventions such as different drug concentration trajectories. A challenge for doing causal inference with continuous interventions is that the positivity assumption is typically violated. To address positivity violations, we develop projection functions, which reweigh and redefine the estimand of interest based on functions of the conditional support for the respective interventions. With these functions, we obtain the desired dose-response curve in areas of enough support, and otherwise a meaningful estimand that does not require the positivity assumption. We develop $g$-computation type plug-in estimators for this case. Those are contrasted with g-computation estimators which are applied to continuous interventions without specifically addressing positivity violations, which we propose to be presented with diagnostics. The ideas are illustrated with longitudinal data from HIV positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children $<13$ years in Zambia/Uganda. Simulations show in which situations a standard g-computation approach is appropriate, and in which it leads to bias and how the proposed weighted estimation approach then recovers the alternative estimand of interest."}, "https://arxiv.org/abs/2307.13124": {"title": "Conformal prediction for frequency-severity modeling", "link": "https://arxiv.org/abs/2307.13124", "description": "arXiv:2307.13124v3 Announce Type: replace \nAbstract: We present a model-agnostic framework for the construction of prediction intervals of insurance claims, with finite sample statistical guarantees, extending the technique of split conformal prediction to the domain of two-stage frequency-severity modeling. The framework effectiveness is showcased with simulated and real datasets using classical parametric models and contemporary machine learning methods. When the underlying severity model is a random forest, we extend the two-stage split conformal prediction algorithm, showing how the out-of-bag mechanism can be leveraged to eliminate the need for a calibration set in the conformal procedure."}, "https://arxiv.org/abs/2308.13346": {"title": "GARHCX-NoVaS: A Model-free Approach to Incorporate Exogenous Variables", "link": "https://arxiv.org/abs/2308.13346", "description": "arXiv:2308.13346v2 Announce Type: replace \nAbstract: In this work, we explore the forecasting ability of a recently proposed normalizing and variance-stabilizing (NoVaS) transformation with the possible inclusion of exogenous variables. From an applied point-of-view, extra knowledge such as fundamentals- and sentiments-based information could be beneficial to improve the prediction accuracy of market volatility if they are incorporated into the forecasting process. In the classical approach, these models including exogenous variables are typically termed GARCHX-type models. Being a Model-free prediction method, NoVaS has generally shown more accurate, stable and robust (to misspecifications) performance than that compared to classical GARCH-type methods. This motivates us to extend this framework to the GARCHX forecasting as well. We derive the NoVaS transformation needed to include exogenous covariates and then construct the corresponding prediction procedure. We show through extensive simulation studies that bolster our claim that the NoVaS method outperforms traditional ones, especially for long-term time aggregated predictions. We also provide an interesting data analysis to exhibit how our method could possibly shed light on the role of geopolitical risks in forecasting volatility in national stock market indices for three different countries in Europe."}, "https://arxiv.org/abs/2401.10233": {"title": "Likelihood-ratio inference on differences in quantiles", "link": "https://arxiv.org/abs/2401.10233", "description": "arXiv:2401.10233v2 Announce Type: replace \nAbstract: Quantiles can represent key operational and business metrics, but the computational challenges associated with inference has hampered their adoption in online experimentation. One-sample confidence intervals are trivial to construct; however, two-sample inference has traditionally required bootstrapping or a density estimator. This paper presents a new two-sample difference-in-quantile hypothesis test and confidence interval based on a likelihood-ratio test statistic. A conservative version of the test does not involve a density estimator; a second version of the test, which uses a density estimator, yields confidence intervals very close to the nominal coverage level. It can be computed using only four order statistics from each sample."}, "https://arxiv.org/abs/2201.05102": {"title": "Space-time extremes of severe US thunderstorm environments", "link": "https://arxiv.org/abs/2201.05102", "description": "arXiv:2201.05102v3 Announce Type: replace-cross \nAbstract: Severe thunderstorms cause substantial economic and human losses in the United States. Simultaneous high values of convective available potential energy (CAPE) and storm relative helicity (SRH) are favorable to severe weather, and both they and the composite variable $\\mathrm{PROD}=\\sqrt{\\mathrm{CAPE}} \\times \\mathrm{SRH}$ can be used as indicators of severe thunderstorm activity. Their extremal spatial dependence exhibits temporal non-stationarity due to seasonality and large-scale atmospheric signals such as El Ni\\~no-Southern Oscillation (ENSO). In order to investigate this, we introduce a space-time model based on a max-stable, Brown--Resnick, field whose range depends on ENSO and on time through a tensor product spline. We also propose a max-stability test based on empirical likelihood and the bootstrap. The marginal and dependence parameters must be estimated separately owing to the complexity of the model, and we develop a bootstrap-based model selection criterion that accounts for the marginal uncertainty when choosing the dependence model. In the case study, the out-sample performance of our model is good. We find that extremes of PROD, CAPE and SRH are generally more localized in summer and, in some regions, less localized during El Ni\\~no and La Ni\\~na events, and give meteorological interpretations of these phenomena."}, "https://arxiv.org/abs/2209.15224": {"title": "Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models", "link": "https://arxiv.org/abs/2209.15224", "description": "arXiv:2209.15224v3 Announce Type: replace-cross \nAbstract: Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that effectively utilizes unknown similarities between related tasks and is robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve the minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Additionally, iterative unsupervised multi-task and transfer learning methods may suffer from an initialization alignment problem, and two alignment algorithms are proposed to resolve the issue. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees."}, "https://arxiv.org/abs/2310.12428": {"title": "Enhanced Local Explainability and Trust Scores with Random Forest Proximities", "link": "https://arxiv.org/abs/2310.12428", "description": "arXiv:2310.12428v2 Announce Type: replace-cross \nAbstract: We initiate a novel approach to explain the predictions and out of sample performance of random forest (RF) regression and classification models by exploiting the fact that any RF can be mathematically formulated as an adaptive weighted K nearest-neighbors model. Specifically, we employ a recent result that, for both regression and classification tasks, any RF prediction can be rewritten exactly as a weighted sum of the training targets, where the weights are RF proximities between the corresponding pairs of data points. We show that this linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established feature-based methods like SHAP, which generate attributions for a model prediction across input features. We show how this proximity-based approach to explainability can be used in conjunction with SHAP to explain not just the model predictions, but also out-of-sample performance, in the sense that proximities furnish a novel means of assessing when a given model prediction is more or less likely to be correct. We demonstrate this approach in the modeling of US corporate bond prices and returns in both regression and classification cases."}, "https://arxiv.org/abs/2401.06925": {"title": "Modeling Latent Selection with Structural Causal Models", "link": "https://arxiv.org/abs/2401.06925", "description": "arXiv:2401.06925v2 Announce Type: replace-cross \nAbstract: Selection bias is ubiquitous in real-world data, and can lead to misleading results if not dealt with properly. We introduce a conditioning operation on Structural Causal Models (SCMs) to model latent selection from a causal perspective. We show that the conditioning operation transforms an SCM with the presence of an explicit latent selection mechanism into an SCM without such selection mechanism, which partially encodes the causal semantics of the selected subpopulation according to the original SCM. Furthermore, we show that this conditioning operation preserves the simplicity, acyclicity, and linearity of SCMs, and commutes with marginalization. Thanks to these properties, combined with marginalization and intervention, the conditioning operation offers a valuable tool for conducting causal reasoning tasks within causal models where latent details have been abstracted away. We demonstrate by example how classical results of causal inference can be generalized to include selection bias and how the conditioning operation helps with modeling of real-world problems."}, "https://arxiv.org/abs/2408.00908": {"title": "Early Stopping Based on Repeated Significance", "link": "https://arxiv.org/abs/2408.00908", "description": "arXiv:2408.00908v1 Announce Type: new \nAbstract: For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $\\alpha$ for the success criterion produces statistical confidence at level $1 - \\alpha$. For multiple criteria, a Bonferroni correction that partitions $\\alpha$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points."}, "https://arxiv.org/abs/2408.01023": {"title": "Distilling interpretable causal trees from causal forests", "link": "https://arxiv.org/abs/2408.01023", "description": "arXiv:2408.01023v1 Announce Type: new \nAbstract: Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are."}, "https://arxiv.org/abs/2408.01208": {"title": "Distributional Difference-in-Differences Models with Multiple Time Periods: A Monte Carlo Analysis", "link": "https://arxiv.org/abs/2408.01208", "description": "arXiv:2408.01208v1 Announce Type: new \nAbstract: Researchers are often interested in evaluating the impact of a policy on the entire (or specific parts of the) distribution of the outcome of interest. In this paper, I provide a practical toolkit to recover the whole counterfactual distribution of the untreated potential outcome for the treated group in non-experimental settings with staggered treatment adoption by generalizing the existing quantile treatment effects on the treated (QTT) estimator proposed by Callaway and Li (2019). Besides the QTT, I consider different approaches that anonymously summarize the quantiles of the distribution of the outcome of interest (such as tests for stochastic dominance rankings) without relying on rank invariance assumptions. The finite-sample properties of the estimator proposed are analyzed via different Monte Carlo simulations. Despite being slightly biased for relatively small sample sizes, the proposed method's performance increases substantially when the sample size increases."}, "https://arxiv.org/abs/2408.00955": {"title": "Aggregation Models with Optimal Weights for Distributed Gaussian Processes", "link": "https://arxiv.org/abs/2408.00955", "description": "arXiv:2408.00955v1 Announce Type: cross \nAbstract: Gaussian process (GP) models have received increasingly attentions in recent years due to their superb prediction accuracy and modeling flexibility. To address the computational burdens of GP models for large-scale datasets, distributed learning for GPs are often adopted. Current aggregation models for distributed GPs are not time-efficient when incorporating correlations between GP experts. In this work, we propose a novel approach for aggregated prediction in distributed GPs. The technique is suitable for both the exact and sparse variational GPs. The proposed method incorporates correlations among experts, leading to better prediction accuracy with manageable computational requirements. As demonstrated by empirical studies, the proposed approach results in more stable predictions in less time than state-of-the-art consistent aggregation models."}, "https://arxiv.org/abs/2408.01017": {"title": "Application of Superconducting Technology in the Electricity Industry: A Game-Theoretic Analysis of Government Subsidy Policies and Power Company Equipment Upgrade Decisions", "link": "https://arxiv.org/abs/2408.01017", "description": "arXiv:2408.01017v1 Announce Type: cross \nAbstract: This study investigates the potential impact of \"LK-99,\" a novel material developed by a Korean research team, on the power equipment industry. Using evolutionary game theory, the interactions between governmental subsidies and technology adoption by power companies are modeled. A key innovation of this research is the introduction of sensitivity analyses concerning time delays and initial subsidy amounts, which significantly influence the strategic decisions of both government and corporate entities. The findings indicate that these factors are critical in determining the rate of technology adoption and the efficiency of the market as a whole. Due to existing data limitations, the study offers a broad overview of likely trends and recommends the inclusion of real-world data for more precise modeling once the material demonstrates room-temperature superconducting characteristics. The research contributes foundational insights valuable for future policy design and has significant implications for advancing the understanding of technology adoption and market dynamics."}, "https://arxiv.org/abs/2408.01117": {"title": "Reduced-Rank Estimation for Ill-Conditioned Stochastic Linear Model with High Signal-to-Noise Ratio", "link": "https://arxiv.org/abs/2408.01117", "description": "arXiv:2408.01117v1 Announce Type: cross \nAbstract: Reduced-rank approach has been used for decades in robust linear estimation of both deterministic and random vector of parameters in linear model y=Hx+\\sqrt{epsilon}n. In practical settings, estimation is frequently performed under incomplete or inexact model knowledge, which in the stochastic case significantly increases mean-square-error (MSE) of an estimate obtained by the linear minimum mean-square-error (MMSE) estimator, which is MSE-optimal among linear estimators in the theoretical case of perfect model knowledge. However, the improved performance of reduced-rank estimators over MMSE estimator in estimation under incomplete or inexact model knowledge has been established to date only by means of numerical simulations and arguments indicating that the reduced-rank approach may provide improved performance over MMSE estimator in certain settings. In this paper we focus on the high signal-to-noise ratio (SNR) case, which has not been previously considered as a natural area of application of reduced-rank estimators. We first show explicit sufficient conditions under which familiar reduced-rank MMSE and truncated SVD estimators achieve lower MSE than MMSE estimator if singular values of array response matrix H are perturbed. We then extend these results to the case of a generic perturbation of array response matrix H, and demonstrate why MMSE estimator frequently attains higher MSE than reduced-rank MMSE and truncated SVD estimators if H is ill-conditioned. The main results of this paper are verified in numerical simulations."}, "https://arxiv.org/abs/2408.01298": {"title": "Probabilistic Inversion Modeling of Gas Emissions: A Gradient-Based MCMC Estimation of Gaussian Plume Parameters", "link": "https://arxiv.org/abs/2408.01298", "description": "arXiv:2408.01298v1 Announce Type: cross \nAbstract: In response to global concerns regarding air quality and the environmental impact of greenhouse gas emissions, detecting and quantifying sources of emissions has become critical. To understand this impact and target mitigations effectively, methods for accurate quantification of greenhouse gas emissions are required. In this paper, we focus on the inversion of concentration measurements to estimate source location and emission rate. In practice, such methods often rely on atmospheric stability class-based Gaussian plume dispersion models. However, incorrectly identifying the atmospheric stability class can lead to significant bias in estimates of source characteristics. We present a robust approach that reduces this bias by jointly estimating the horizontal and vertical dispersion parameters of the Gaussian plume model, together with source location and emission rate, atmospheric background concentration, and sensor measurement error variance. Uncertainty in parameter estimation is quantified through probabilistic inversion using gradient-based MCMC methods. A simulation study is performed to assess the inversion methodology. We then focus on inference for the published Chilbolton dataset which contains controlled methane releases and demonstrates the practical benefits of estimating dispersion parameters in source inversion problems."}, "https://arxiv.org/abs/2408.01326": {"title": "Nonparametric Mean and Covariance Estimation for Discretely Observed High-Dimensional Functional Data: Rates of Convergence and Division of Observational Regimes", "link": "https://arxiv.org/abs/2408.01326", "description": "arXiv:2408.01326v1 Announce Type: cross \nAbstract: Nonparametric estimation of the mean and covariance parameters for functional data is a critical task, with local linear smoothing being a popular choice. In recent years, many scientific domains are producing high-dimensional functional data for which $p$, the number of curves per subject, is often much larger than the sample size $n$. Much of the methodology developed for such data rely on preliminary nonparametric estimates of the unknown mean functions and the auto- and cross-covariance functions. We investigate the convergence rates of local linear estimators in terms of the maximal error across components and pairs of components for mean and covariance functions, respectively, in both $L^2$ and uniform metrics. The local linear estimators utilize a generic weighting scheme that can adjust for differing numbers of discrete observations $N_{ij}$ across curves $j$ and subjects $i$, where the $N_{ij}$ vary with $n$. Particular attention is given to the equal weight per observation (OBS) and equal weight per subject (SUBJ) weighting schemes. The theoretical results utilize novel applications of concentration inequalities for functional data and demonstrate that, similar to univariate functional data, the order of the $N_{ij}$ relative to $p$ and $n$ divides high-dimensional functional data into three regimes: sparse, dense, and ultra-dense, with the high-dimensional parametric convergence rate of $\\left\\{\\log(p)/n\\right\\}^{1/2}$ being attainable in the latter two."}, "https://arxiv.org/abs/2211.06337": {"title": "Differentially Private Methods for Compositional Data", "link": "https://arxiv.org/abs/2211.06337", "description": "arXiv:2211.06337v3 Announce Type: replace \nAbstract: Confidential data, such as electronic health records, activity data from wearable devices, and geolocation data, are becoming increasingly prevalent. Differential privacy provides a framework to conduct statistical analyses while mitigating the risk of leaking private information. Compositional data, which consist of vectors with positive components that add up to a constant, have received little attention in the differential privacy literature. This article proposes differentially private approaches for analyzing compositional data using the Dirichlet distribution. We explore several methods, including Bayesian and bootstrap procedures. For the Bayesian methods, we consider posterior inference techniques based on Markov Chain Monte Carlo, Approximate Bayesian Computation, and asymptotic approximations. We conduct an extensive simulation study to compare these approaches and make evidence-based recommendations. Finally, we apply the methodology to a data set from the American Time Use Survey."}, "https://arxiv.org/abs/2212.09996": {"title": "A marginalized three-part interrupted time series regression model for proportional data", "link": "https://arxiv.org/abs/2212.09996", "description": "arXiv:2212.09996v2 Announce Type: replace \nAbstract: Interrupted time series (ITS) is often used to evaluate the effectiveness of a health policy intervention that accounts for the temporal dependence of outcomes. When the outcome of interest is a percentage or percentile, the data can be highly skewed, bounded in $[0, 1]$, and have many zeros or ones. A three-part Beta regression model is commonly used to separate zeros, ones, and positive values explicitly by three submodels. However, incorporating temporal dependence into the three-part Beta regression model is challenging. In this article, we propose a marginalized zero-one-inflated Beta time series model that captures the temporal dependence of outcomes through copula and allows investigators to examine covariate effects on the marginal mean. We investigate its practical performance using simulation studies and apply the model to a real ITS study."}, "https://arxiv.org/abs/2309.13666": {"title": "More power to you: Using machine learning to augment human coding for more efficient inference in text-based randomized trials", "link": "https://arxiv.org/abs/2309.13666", "description": "arXiv:2309.13666v2 Announce Type: replace \nAbstract: For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by trained human raters. This process, the current standard, is both time-consuming and limiting: even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subsample of available texts. In this work, we present an inferential framework that can be used to increase the power of an impact assessment, given a fixed human-coding budget, by taking advantage of any \"untapped\" observations -- those documents not manually scored due to time or resource constraints -- as a supplementary resource. Our approach, a methodological combination of causal inference, survey sampling methods, and machine learning, has four steps: (1) select and code a sample of documents; (2) build a machine learning model to predict the human-coded outcomes from a set of automatically extracted text features; (3) generate machine-predicted scores for all documents and use these scores to estimate treatment impacts; and (4) adjust the final impact estimates using the residual differences between human-coded and machine-predicted outcomes. This final step ensures any biases in the modeling procedure do not propagate to biases in final estimated effects. Through an extensive simulation study and an application to a recent field trial in education, we show that our proposed approach can be used to reduce the scope of a human-coding effort while maintaining nominal power to detect a significant treatment impact."}, "https://arxiv.org/abs/2312.15222": {"title": "Sequential decision rules for 2-arm clinical trials: a Bayesian perspective", "link": "https://arxiv.org/abs/2312.15222", "description": "arXiv:2312.15222v2 Announce Type: replace \nAbstract: Practical employment of Bayesian trial designs has been rare. Even if accepted in principle, the regulators have commonly required that such designs be calibrated according to an upper bound for the frequentist Type 1 error rate. This represents an internally inconsistent hybrid methodology, where important advantages from applying the Bayesian principles are lost. In particular, all pre-planned interim looks have an inflating multiplicity effect on Type 1 error rate. To present an alternative approach, we consider the prototype case of a 2-arm superiority trial with binary outcomes. The design is adaptive, using error tolerance criteria based on sequentially updated posterior probabilities, to conclude efficacy of the experimental treatment or futility of the trial. In the proposed approach, the regulators are assumed to have the main responsibility in defining criteria for the error control against false conclusions of efficacy, whereas the trial investigators will have a natural role in determining the criteria for concluding futility and thereby stopping the trial. It is suggested that the control of Type 1 error rate be replaced by the control of a criterion called regulators' False Discovery Probability (rFDP), the term corresponding directly to the probability interpretation of this criterion. Importantly, the sequential error control during the data analysis based on posterior probabilities will satisfy the rFDP criterion automatically, so that no separate computations are needed for such a purpose. The method contains the option of applying a decision rule for terminating the trial early if the predicted costs from continuing would exceed the corresponding gains. The proposed approach can lower the ultimately unnecessary barriers from the practical application of Bayesian trial designs."}, "https://arxiv.org/abs/2401.13975": {"title": "Sparse signal recovery and source localization via covariance learning", "link": "https://arxiv.org/abs/2401.13975", "description": "arXiv:2401.13975v2 Announce Type: replace \nAbstract: In the Multiple Measurements Vector (MMV) model, measurement vectors are connected to unknown, jointly sparse signal vectors through a linear regression model employing a single known measurement matrix (or dictionary). Typically, the number of atoms (columns of the dictionary) is greater than the number measurements and the sparse signal recovery problem is generally ill-posed. In this paper, we treat the signals and measurement noise as independent Gaussian random vectors with unknown signal covariance matrix and noise variance, respectively, and characterize the solution of the likelihood equation in terms of fixed point equation, thereby enabling the recovery of the sparse signal support (sources with non-zero variances) via a block coordinate descent (BCD) algorithm that leverage the FP characterization of the likelihood equation. Additionally, a greedy pursuit method, analogous to popular simultaneous orthogonal matching pursuit (OMP), is introduced. Our numerical examples demonstrate effectiveness of the proposed covariance learning (CL) algorithms both in classic sparse signal recovery as well as in direction-of-arrival (DOA) estimation problems where they perform favourably compared to the state-of-the-art algorithms under a broad variety of settings."}, "https://arxiv.org/abs/2110.15501": {"title": "Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning", "link": "https://arxiv.org/abs/2110.15501", "description": "arXiv:2110.15501v4 Announce Type: replace-cross \nAbstract: Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method."}, "https://arxiv.org/abs/2408.01626": {"title": "Weighted Brier Score -- an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration", "link": "https://arxiv.org/abs/2408.01626", "description": "arXiv:2408.01626v1 Announce Type: new \nAbstract: As advancements in novel biomarker-based algorithms and models accelerate disease risk prediction and stratification in medicine, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is often referred to as clinical utility, which received significant attention in terms of model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, examining how weighting influences the overall score and its individual components. Through this decomposition, we link the weighted Brier score to the $H$ measure, which has been proposed as a coherent alternative to the area under the receiver operating characteristic curve. This theoretical link to the $H$ measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from the Prostate Cancer Active Surveillance Study (PASS)."}, "https://arxiv.org/abs/2408.01628": {"title": "On Nonparametric Estimation of Covariograms", "link": "https://arxiv.org/abs/2408.01628", "description": "arXiv:2408.01628v1 Announce Type: new \nAbstract: The paper overviews and investigates several nonparametric methods of estimating covariograms. It provides a unified approach and notation to compare the main approaches used in applied research. The primary focus is on methods that utilise the actual values of observations, rather than their ranks. We concentrate on such desirable properties of covariograms as bias, positive-definiteness and behaviour at large distances. The paper discusses several theoretical properties and demonstrates some surprising drawbacks of well-known estimators. Numerical studies provide a comparison of representatives from different methods using various metrics. The results provide important insight and guidance for practitioners who use estimated covariograms in various applications, including kriging, monitoring network optimisation, cross-validation, and other related tasks."}, "https://arxiv.org/abs/2408.01662": {"title": "Principal component analysis balancing prediction and approximation accuracy for spatial data", "link": "https://arxiv.org/abs/2408.01662", "description": "arXiv:2408.01662v1 Announce Type: new \nAbstract: Dimension reduction is often the first step in statistical modeling or prediction of multivariate spatial data. However, most existing dimension reduction techniques do not account for the spatial correlation between observations and do not take the downstream modeling task into consideration when finding the lower-dimensional representation. We formalize the closeness of approximation to the original data and the utility of lower-dimensional scores for downstream modeling as two complementary, sometimes conflicting, metrics for dimension reduction. We illustrate how existing methodologies fall into this framework and propose a flexible dimension reduction algorithm that achieves the optimal trade-off. We derive a computationally simple form for our algorithm and illustrate its performance through simulation studies, as well as two applications in air pollution modeling and spatial transcriptomics."}, "https://arxiv.org/abs/2408.01893": {"title": "Minimum Gamma Divergence for Regression and Classification Problems", "link": "https://arxiv.org/abs/2408.01893", "description": "arXiv:2408.01893v1 Announce Type: new \nAbstract: The book is structured into four main chapters. Chapter 1 introduces the foundational concepts of divergence measures, including the well-known Kullback-Leibler divergence and its limitations. It then presents a detailed exploration of power divergences, such as the $\\alpha$, $\\beta$, and $\\gamma$-divergences, highlighting their unique properties and advantages. Chapter 2 explores minimum divergence methods for regression models, demonstrating how these methods can improve robustness and efficiency in statistical estimation. Chapter 3 extends these methods to Poisson point processes, with a focus on ecological applications, providing a robust framework for modeling species distributions and other spatial phenomena. Finally, Chapter 4 explores the use of divergence measures in machine learning, including applications in Boltzmann machines, AdaBoost, and active learning. The chapter emphasizes the practical benefits of these measures in enhancing model robustness and performance."}, "https://arxiv.org/abs/2408.01985": {"title": "Analysis of Factors Affecting the Entry of Foreign Direct Investment into Indonesia (Case Study of Three Industrial Sectors in Indonesia)", "link": "https://arxiv.org/abs/2408.01985", "description": "arXiv:2408.01985v1 Announce Type: new \nAbstract: The realization of FDI and DDI from January to December 2022 reached Rp1,207.2 trillion. The largest FDI investment realization by sector was led by the Basic Metal, Metal Goods, Non-Machinery, and Equipment Industry sector, followed by the Mining sector and the Electricity, Gas, and Water sector. The uneven amount of FDI investment realization in each industry and the impact of the COVID-19 pandemic in Indonesia are the main issues addressed in this study. This study aims to identify the factors that influence the entry of FDI into industries in Indonesia and measure the extent of these factors' influence on the entry of FDI. In this study, classical assumption tests and hypothesis tests are conducted to investigate whether the research model is robust enough to provide strategic options nationally. Moreover, this study uses the ordinary least squares (OLS) method. The results show that the electricity factor does not influence FDI inflows in the three industries. The Human Development Index (HDI) factor has a significant negative effect on FDI in the Mining Industry and a significant positive effect on FDI in the Basic Metal, Metal Goods, Non-Machinery, and Equipment Industries. However, HDI does not influence FDI in the Electricity, Gas, and Water Industries in Indonesia."}, "https://arxiv.org/abs/2408.02028": {"title": "Multivariate Information Measures: A Copula-based Approach", "link": "https://arxiv.org/abs/2408.02028", "description": "arXiv:2408.02028v1 Announce Type: new \nAbstract: Multivariate datasets are common in various real-world applications. Recently, copulas have received significant attention for modeling dependencies among random variables. A copula-based information measure is required to quantify the uncertainty inherent in these dependencies. This paper introduces a multivariate variant of the cumulative copula entropy and explores its various properties, including bounds, stochastic orders, and convergence-related results. Additionally, we define a cumulative copula information generating function and derive it for several well-known families of multivariate copulas. A fractional generalization of the multivariate cumulative copula entropy is also introduced and examined. We present a non-parametric estimator of the cumulative copula entropy using empirical beta copula. Furthermore, we propose a new distance measure between two copulas based on the Kullback-Leibler divergence and discuss a goodness-of-fit test based on this measure."}, "https://arxiv.org/abs/2408.02159": {"title": "SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems", "link": "https://arxiv.org/abs/2408.02159", "description": "arXiv:2408.02159v1 Announce Type: new \nAbstract: This paper introduces a new addition to the SPINEX (Similarity-based Predictions with Explainable Neighbors Exploration) family, tailored specifically for time series and forecasting analysis. This new algorithm leverages the concept of similarity and higher-order temporal interactions across multiple time scales to enhance predictive accuracy and interpretability in forecasting. To evaluate the effectiveness of SPINEX, we present comprehensive benchmarking experiments comparing it against 18 algorithms and across 49 synthetic and real datasets characterized by varying trends, seasonality, and noise levels. Our performance assessment focused on forecasting accuracy and computational efficiency. Our findings reveal that SPINEX consistently ranks among the top 5 performers in forecasting precision and has a superior ability to handle complex temporal dynamics compared to commonly adopted algorithms. Moreover, the algorithm's explainability features, Pareto efficiency, and medium complexity (on the order of O(log n)) are demonstrated through detailed visualizations to enhance the prediction and decision-making process. We note that integrating similarity-based concepts opens new avenues for research in predictive analytics, promising more accurate and transparent decision making."}, "https://arxiv.org/abs/2408.02331": {"title": "Explaining and Connecting Kriging with Gaussian Process Regression", "link": "https://arxiv.org/abs/2408.02331", "description": "arXiv:2408.02331v1 Announce Type: new \nAbstract: Kriging and Gaussian Process Regression are statistical methods that allow predicting the outcome of a random process or a random field by using a sample of correlated observations. In other words, the random process or random field is partially observed, and by using a sample a prediction is made, pointwise or as a whole, where the latter can be thought as a reconstruction. In addition, the techniques permit to give a measure of uncertainty of the prediction. The methods have different origins. Kriging comes from geostatistics, a field which started to develop around 1950 oriented to mining valuation problems, whereas Gaussian Process Regression has gained popularity in the area of machine learning in the last decade of the previous century. In the literature, the methods are usually presented as being the same technique. However, beyond this affirmation, the techniques have yet not been compared on a thorough mathematical basis and neither explained why and under which conditions this affirmation holds. Furthermore, Kriging has many variants and this affirmation should be precised. In this paper, this gap is filled. It is shown, step by step how both methods are deduced from the first principles -- with a major focus on Kriging, the mathematical connection between them, and which Kriging variant corresponds to which Gaussian Process Regression set up. The three most widely used versions of Kriging are considered: Simple Kriging, Ordinary Kriging and Universal Kriging. It is found, that despite their closeness, the techniques are different in their approach and assumptions, in a similar way the Least Square method, the Best Linear Unbiased Estimator method, and the Likelihood method in regression do. I hope this work can serve for a deeper understanding of the relationship between Kriging and Gaussian Process Regression, as well as a cohesive introductory resource for researchers."}, "https://arxiv.org/abs/2408.02343": {"title": "Unified Principal Components Analysis of Irregularly Observed Functional Time Series", "link": "https://arxiv.org/abs/2408.02343", "description": "arXiv:2408.02343v1 Announce Type: new \nAbstract: Irregularly observed functional time series (FTS) are increasingly available in many real-world applications. To analyze FTS, it is crucial to account for both serial dependencies and the irregularly observed nature of functional data. However, existing methods for FTS often rely on specific model assumptions in capturing serial dependencies, or cannot handle the irregular observational scheme of functional data. To solve these issues, one can perform dimension reduction on FTS via functional principal component analysis (FPCA) or dynamic FPCA. Nonetheless, these methods may either be not theoretically optimal or too redundant to represent serially dependent functional data. In this article, we introduce a novel dimension reduction method for FTS based on dynamic FPCA. Through a new concept called optimal functional filters, we unify the theories of FPCA and dynamic FPCA, providing a parsimonious and optimal representation for FTS adapting to its serial dependence structure. This framework is referred to as principal analysis via dependency-adaptivity (PADA). Under a hierarchical Bayesian model, we establish an estimation procedure for dimension reduction via PADA. Our method can be used for both sparsely and densely observed FTS, and is capable of predicting future functional data. We investigate the theoretical properties of PADA and demonstrate its effectiveness through extensive simulation studies. Finally, we illustrate our method via dimension reduction and prediction of daily PM2.5 data."}, "https://arxiv.org/abs/2408.02393": {"title": "Graphical Modelling without Independence Assumptions for Uncentered Data", "link": "https://arxiv.org/abs/2408.02393", "description": "arXiv:2408.02393v1 Announce Type: new \nAbstract: The independence assumption is a useful tool to increase the tractability of one's modelling framework. However, this assumption does not match reality; failing to take dependencies into account can cause models to fail dramatically. The field of multi-axis graphical modelling (also called multi-way modelling, Kronecker-separable modelling) has seen growth over the past decade, but these models require that the data have zero mean. In the multi-axis case, inference is typically done in the single sample scenario, making mean inference impossible.\n In this paper, we demonstrate how the zero-mean assumption can cause egregious modelling errors, as well as propose a relaxation to the zero-mean assumption that allows the avoidance of such errors. Specifically, we propose the \"Kronecker-sum-structured mean\" assumption, which leads to models with nonconvex-but-unimodal log-likelihoods that can be solved efficiently with coordinate descent."}, "https://arxiv.org/abs/2408.02513": {"title": "The appeal of the gamma family distribution to protect the confidentiality of contingency tables", "link": "https://arxiv.org/abs/2408.02513", "description": "arXiv:2408.02513v1 Announce Type: new \nAbstract: Administrative databases, such as the English School Census (ESC), are rich sources of information that are potentially useful for researchers. For such data sources to be made available, however, strict guarantees of privacy would be required. To achieve this, synthetic data methods can be used. Such methods, when protecting the confidentiality of tabular data (contingency tables), often utilise the Poisson or Poisson-mixture distributions, such as the negative binomial (NBI). These distributions, however, are either equidispersed (in the case of the Poisson) or overdispersed (e.g. in the case of the NBI), which results in excessive noise being applied to large low-risk counts. This paper proposes the use of the (discretized) gamma family (GAF) distribution, which allows noise to be applied in a more bespoke fashion. Specifically, it allows less noise to be applied as cell counts become larger, providing an optimal balance in relation to the risk-utility trade-off. We illustrate the suitability of the GAF distribution on an administrative-type data set that is reminiscent of the ESC."}, "https://arxiv.org/abs/2408.02573": {"title": "Testing identifying assumptions in Tobit Models", "link": "https://arxiv.org/abs/2408.02573", "description": "arXiv:2408.02573v1 Announce Type: new \nAbstract: This paper develops sharp testable implications for Tobit and IV-Tobit models' identifying assumptions: linear index specification, (joint) normality of latent errors, and treatment (instrument) exogeneity and relevance. The new sharp testable equalities can detect all possible observable violations of the identifying conditions. We propose a testing procedure for the model's validity using existing inference methods for intersection bounds. Simulation results suggests proper size for large samples and that the test is powerful to detect large violation of the exogeneity assumption and violations in the error structure. Finally, we review and propose new alternative paths to partially identify the parameters of interest under less restrictive assumptions."}, "https://arxiv.org/abs/2408.02594": {"title": "Time-series imputation using low-rank matrix completion", "link": "https://arxiv.org/abs/2408.02594", "description": "arXiv:2408.02594v1 Announce Type: new \nAbstract: We investigate the use of matrix completion methods for time-series imputation. Specifically we consider low-rank completion of the block-Hankel matrix representation of a time-series. Simulation experiments are used to compare the method with five recognised imputation techniques with varying levels of computational effort. The Hankel Imputation (HI) method is seen to perform competitively at interpolating missing time-series data, and shows particular potential for reproducing sharp peaks in the data."}, "https://arxiv.org/abs/2408.02667": {"title": "Evaluating and Utilizing Surrogate Outcomes in Covariate-Adjusted Response-Adaptive Designs", "link": "https://arxiv.org/abs/2408.02667", "description": "arXiv:2408.02667v1 Announce Type: new \nAbstract: This manuscript explores the intersection of surrogate outcomes and adaptive designs in statistical research. While surrogate outcomes have long been studied for their potential to substitute long-term primary outcomes, current surrogate evaluation methods do not directly account for the potential benefits of using surrogate outcomes to adapt randomization probabilities in adaptive randomized trials that aim to learn and respond to treatment effect heterogeneity. In this context, surrogate outcomes can benefit participants in the trial directly (i.e. improve expected outcome of newly-enrolled participants) by allowing for more rapid adaptation of randomization probabilities, particularly when surrogates enable earlier detection of heterogeneous treatment effects and/or indicate the optimal (individualized) treatment with stronger signals. Our study introduces a novel approach for surrogate evaluation that quantifies both of these benefits in the context of sequential adaptive experiment designs. We also propose a new Covariate-Adjusted Response-Adaptive (CARA) design that incorporates an Online Superlearner to assess and adaptively choose surrogate outcomes for updating treatment randomization probabilities. We introduce a Targeted Maximum Likelihood Estimator that addresses data dependency challenges in adaptively collected data and achieves asymptotic normality under reasonable assumptions without relying on parametric model assumptions. The robust performance of our adaptive design with Online Superlearner is presented via simulations. Our framework not only contributes a method to more comprehensively quantifying the benefits of candidate surrogate outcomes and choosing between them, but also offers an easily generalizable tool for evaluating various adaptive designs and making inferences, providing insights into alternative choices of designs."}, "https://arxiv.org/abs/2408.01582": {"title": "Conformal Diffusion Models for Individual Treatment Effect Estimation and Inference", "link": "https://arxiv.org/abs/2408.01582", "description": "arXiv:2408.01582v1 Announce Type: cross \nAbstract: Estimating treatment effects from observational data is of central interest across numerous application domains. Individual treatment effect offers the most granular measure of treatment effect on an individual level, and is the most useful to facilitate personalized care. However, its estimation and inference remain underdeveloped due to several challenges. In this article, we propose a novel conformal diffusion model-based approach that addresses those intricate challenges. We integrate the highly flexible diffusion modeling, the model-free statistical inference paradigm of conformal inference, along with propensity score and covariate local approximation that tackle distributional shifts. We unbiasedly estimate the distributions of potential outcomes for individual treatment effect, construct an informative confidence interval, and establish rigorous theoretical guarantees. We demonstrate the competitive performance of the proposed method over existing solutions through extensive numerical studies."}, "https://arxiv.org/abs/2408.01617": {"title": "Review and Demonstration of a Mixture Representation for Simulation from Densities Involving Sums of Powers", "link": "https://arxiv.org/abs/2408.01617", "description": "arXiv:2408.01617v1 Announce Type: cross \nAbstract: Penalized and robust regression, especially when approached from a Bayesian perspective, can involve the problem of simulating a random variable $\\boldsymbol z$ from a posterior distribution that includes a term proportional to a sum of powers, $\\|\\boldsymbol z \\|^q_q$, on the log scale. However, many popular gradient-based methods for Markov Chain Monte Carlo simulation from such posterior distributions use Hamiltonian Monte Carlo and accordingly require conditions on the differentiability of the unnormalized posterior distribution that do not hold when $q \\leq 1$ (Plummer, 2023). This is limiting; the setting where $q \\leq 1$ includes widely used sparsity inducing penalized regression models and heavy tailed robust regression models. In the special case where $q = 1$, a latent variable representation that facilitates simulation from such a posterior distribution is well known. However, the setting where $q < 1$ has not been treated as thoroughly. In this note, we review the availability of a latent variable representation described in Devroye (2009), show how it can be used to simulate from such posterior distributions when $0 < q < 2$, and demonstrate its utility in the context of estimating the parameters of a Bayesian penalized regression model."}, "https://arxiv.org/abs/2408.01926": {"title": "Efficient Decision Trees for Tensor Regressions", "link": "https://arxiv.org/abs/2408.01926", "description": "arXiv:2408.01926v1 Announce Type: cross \nAbstract: We proposed the tensor-input tree (TT) method for scalar-on-tensor and tensor-on-tensor regression problems. We first address scalar-on-tensor problem by proposing scalar-output regression tree models whose input variable are tensors (i.e., multi-way arrays). We devised and implemented fast randomized and deterministic algorithms for efficient fitting of scalar-on-tensor trees, making TT competitive against tensor-input GP models. Based on scalar-on-tensor tree models, we extend our method to tensor-on-tensor problems using additive tree ensemble approaches. Theoretical justification and extensive experiments on real and synthetic datasets are provided to illustrate the performance of TT."}, "https://arxiv.org/abs/2408.02060": {"title": "Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection", "link": "https://arxiv.org/abs/2408.02060", "description": "arXiv:2408.02060v1 Announce Type: cross \nAbstract: We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop a test statistic that is asymptotically normal, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape."}, "https://arxiv.org/abs/2408.02106": {"title": "A Functional Data Approach for Structural Health Monitoring", "link": "https://arxiv.org/abs/2408.02106", "description": "arXiv:2408.02106v1 Announce Type: cross \nAbstract: Structural Health Monitoring (SHM) is increasingly applied in civil engineering. One of its primary purposes is detecting and assessing changes in structure conditions to increase safety and reduce potential maintenance downtime. Recent advancements, especially in sensor technology, facilitate data measurements, collection, and process automation, leading to large data streams. We propose a function-on-function regression framework for (nonlinear) modeling the sensor data and adjusting for covariate-induced variation. Our approach is particularly suited for long-term monitoring when several months or years of training data are available. It combines highly flexible yet interpretable semi-parametric modeling with functional principal component analysis and uses the corresponding out-of-sample phase-II scores for monitoring. The method proposed can also be described as a combination of an ``input-output'' and an ``output-only'' method."}, "https://arxiv.org/abs/2408.02122": {"title": "Graph-Enabled Fast MCMC Sampling with an Unknown High-Dimensional Prior Distribution", "link": "https://arxiv.org/abs/2408.02122", "description": "arXiv:2408.02122v1 Announce Type: cross \nAbstract: Posterior sampling is a task of central importance in Bayesian inference. For many applications in Bayesian meta-analysis and Bayesian transfer learning, the prior distribution is unknown and needs to be estimated from samples. In practice, the prior distribution can be high-dimensional, adding to the difficulty of efficient posterior inference. In this paper, we propose a novel Markov chain Monte Carlo algorithm, which we term graph-enabled MCMC, for posterior sampling with unknown and potentially high-dimensional prior distributions. The algorithm is based on constructing a geometric graph from prior samples and subsequently uses the graph structure to guide the transition of the Markov chain. Through extensive theoretical and numerical studies, we demonstrate that our graph-enabled MCMC algorithm provides reliable approximation to the posterior distribution and is highly computationally efficient."}, "https://arxiv.org/abs/2408.02391": {"title": "Kullback-Leibler-based characterizations of score-driven updates", "link": "https://arxiv.org/abs/2408.02391", "description": "arXiv:2408.02391v1 Announce Type: cross \nAbstract: Score-driven models have been applied in some 400 published articles over the last decade. Much of this literature cites the optimality result in Blasques et al. (2015), which, roughly, states that sufficiently small score-driven updates are unique in locally reducing the Kullback-Leibler (KL) divergence relative to the true density for every observation. This is at odds with other well-known optimality results; the Kalman filter, for example, is optimal in a mean squared error sense, but may move in the wrong direction for atypical observations. We show that score-driven filters are, similarly, not guaranteed to improve the localized KL divergence at every observation. The seemingly stronger result in Blasques et al. (2015) is due to their use of an improper (localized) scoring rule. Even as a guaranteed improvement for every observation is unattainable, we prove that sufficiently small score-driven updates are unique in reducing the KL divergence relative to the true density in expectation. This positive$-$albeit weaker$-$result justifies the continued use of score-driven models and places their information-theoretic properties on solid footing."}, "https://arxiv.org/abs/1904.08895": {"title": "The uniform general signed rank test and its design sensitivity", "link": "https://arxiv.org/abs/1904.08895", "description": "arXiv:1904.08895v2 Announce Type: replace \nAbstract: A sensitivity analysis in an observational study tests whether the qualitative conclusions of an analysis would change if we were to allow for the possibility of limited bias due to confounding. The design sensitivity of a hypothesis test quantifies the asymptotic performance of the test in a sensitivity analysis against a particular alternative. We propose a new, non-asymptotic, distribution-free test, the uniform general signed rank test, for observational studies with paired data, and examine its performance under Rosenbaum's sensitivity analysis model. Our test can be viewed as adaptively choosing from among a large underlying family of signed rank tests, and we show that the uniform test achieves design sensitivity equal to the maximum design sensitivity over the underlying family of signed rank tests. Our test thus achieves superior, and sometimes infinite, design sensitivity, indicating it will perform well in sensitivity analyses on large samples. We support this conclusion with simulations and a data example, showing that the advantages of our test extend to moderate sample sizes as well."}, "https://arxiv.org/abs/2010.13599": {"title": "Design-Based Inference for Spatial Experiments under Unknown Interference", "link": "https://arxiv.org/abs/2010.13599", "description": "arXiv:2010.13599v5 Announce Type: replace \nAbstract: We consider design-based causal inference for spatial experiments in which treatments may have effects that bleed out and feed back in complex ways. Such spatial spillover effects violate the standard ``no interference'' assumption for standard causal inference methods. The complexity of spatial spillover effects also raises the risk of misspecification and bias in model-based analyses. We offer an approach for robust inference in such settings without having to specify a parametric outcome model. We define a spatial ``average marginalized effect'' (AME) that characterizes how, in expectation, units of observation that are a specified distance from an intervention location are affected by treatment at that location, averaging over effects emanating from other intervention nodes. We show that randomization is sufficient for non-parametric identification of the AME even if the nature of interference is unknown. Under mild restrictions on the extent of interference, we establish asymptotic distributions of estimators and provide methods for both sample-theoretic and randomization-based inference. We show conditions under which the AME recovers a structural effect. We illustrate our approach with a simulation study. Then we re-analyze a randomized field experiment and a quasi-experiment on forest conservation, showing how our approach offers robust inference on policy-relevant spillover effects."}, "https://arxiv.org/abs/2201.01010": {"title": "A Double Robust Approach for Non-Monotone Missingness in Multi-Stage Data", "link": "https://arxiv.org/abs/2201.01010", "description": "arXiv:2201.01010v2 Announce Type: replace \nAbstract: Multivariate missingness with a non-monotone missing pattern is complicated to deal with in empirical studies. The traditional Missing at Random (MAR) assumption is difficult to justify in such cases. Previous studies have strengthened the MAR assumption, suggesting that the missing mechanism of any variable is random when conditioned on a uniform set of fully observed variables. However, empirical evidence indicates that this assumption may be violated for variables collected at different stages. This paper proposes a new MAR-type assumption that fits non-monotone missing scenarios involving multi-stage variables. Based on this assumption, we construct an Augmented Inverse Probability Weighted GMM (AIPW-GMM) estimator. This estimator features an asymmetric format for the augmentation term, guarantees double robustness, and achieves the closed-form semiparametric efficiency bound. We apply this method to cases of missingness in both endogenous regressor and outcome, using the Oregon Health Insurance Experiment as an example. We check the correlation between missing probabilities and partially observed variables to justify the assumption. Moreover, we find that excluding incomplete data results in a loss of efficiency and insignificant estimators. The proposed estimator reduces the standard error by more than 50% for the estimated effects of the Oregon Health Plan on the elderly."}, "https://arxiv.org/abs/2211.14903": {"title": "Inference in Cluster Randomized Trials with Matched Pairs", "link": "https://arxiv.org/abs/2211.14903", "description": "arXiv:2211.14903v4 Announce Type: replace \nAbstract: This paper studies inference in cluster randomized trials where treatment status is determined according to a \"matched pairs\" design. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the cluster; by a \"matched pairs\" design, we mean that a sample of clusters is paired according to baseline, cluster-level covariates and, within each pair, one cluster is selected at random for treatment. We study the large-sample behavior of a weighted difference-in-means estimator and derive two distinct sets of results depending on if the matching procedure does or does not match on cluster size. We then propose a single variance estimator which is consistent in either regime. Combining these results establishes the asymptotic exactness of tests based on these estimators. Next, we consider the properties of two common testing procedures based on t-tests constructed from linear regressions, and argue that both are generally conservative in our framework. We additionally study the behavior of a randomization test which permutes the treatment status for clusters within pairs, and establish its finite-sample and asymptotic validity for testing specific null hypotheses. Finally, we propose a covariate-adjusted estimator which adjusts for additional baseline covariates not used for treatment assignment, and establish conditions under which such an estimator leads to strict improvements in precision. A simulation study confirms the practical relevance of our theoretical results."}, "https://arxiv.org/abs/2307.00575": {"title": "Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model", "link": "https://arxiv.org/abs/2307.00575", "description": "arXiv:2307.00575v2 Announce Type: replace \nAbstract: This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two steps: Average Subspace Capture (ASC) and Alternating Projection (AP). These steps are specifically designed to capture the row-wise and column-wise dimension-reduced subspaces which contain the most informative features of the data. ASC utilizes a novel average projection operator as initialization and achieves exact recovery in the noiseless setting. We analyze the convergence and non-asymptotic error bounds of MOP-UP, introducing a blockwise matrix eigenvalue perturbation bound that proves the desired bound, where classic perturbation bounds fail. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets. Lastly, we discuss generalizations of our approach to higher-order data."}, "https://arxiv.org/abs/2308.14996": {"title": "The projected dynamic linear model for time series on the sphere", "link": "https://arxiv.org/abs/2308.14996", "description": "arXiv:2308.14996v2 Announce Type: replace \nAbstract: Time series on the unit n-sphere arise in directional statistics, compositional data analysis, and many scientific fields. There are few models for such data, and the ones that exist suffer from several limitations: they are often computationally challenging to fit, many of them apply only to the circular case of n=2, and they are usually based on families of distributions that are not flexible enough to capture the complexities observed in real data. Furthermore, there is little work on Bayesian methods for spherical time series. To address these shortcomings, we propose a state space model based on the projected normal distribution that can be applied to spherical time series of arbitrary dimension. We describe how to perform fully Bayesian offline inference for this model using a simple and efficient Gibbs sampling algorithm, and we develop a Rao-Blackwellized particle filter to perform online inference for streaming data. In analyses of wind direction and energy market time series, we show that the proposed model outperforms competitors in terms of point, set, and density forecasting."}, "https://arxiv.org/abs/2309.03714": {"title": "An efficient joint model for high dimensional longitudinal and survival data via generic association features", "link": "https://arxiv.org/abs/2309.03714", "description": "arXiv:2309.03714v2 Announce Type: replace \nAbstract: This paper introduces a prognostic method called FLASH that addresses the problem of joint modelling of longitudinal data and censored durations when a large number of both longitudinal and time-independent features are available. In the literature, standard joint models are either of the shared random effect or joint latent class type. Combining ideas from both worlds and using appropriate regularisation techniques, we define a new model with the ability to automatically identify significant prognostic longitudinal features in a high-dimensional context, which is of increasing importance in many areas such as personalised medicine or churn prediction. We develop an estimation methodology based on the EM algorithm and provide an efficient implementation. The statistical performance of the method is demonstrated both in extensive Monte Carlo simulation studies and on publicly available real-world datasets. Our method significantly outperforms the state-of-the-art joint models in predicting the latent class membership probability in terms of the C-index in a so-called ``real-time'' prediction setting, with a computational speed that is orders of magnitude faster than competing methods. In addition, our model automatically identifies significant features that are relevant from a practical perspective, making it interpretable."}, "https://arxiv.org/abs/2310.06926": {"title": "Bayesian inference and cure rate modeling for event history data", "link": "https://arxiv.org/abs/2310.06926", "description": "arXiv:2310.06926v2 Announce Type: replace \nAbstract: Estimating model parameters of a general family of cure models is always a challenging task mainly due to flatness and multimodality of the likelihood function. In this work, we propose a fully Bayesian approach in order to overcome these issues. Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. It is demonstrated via simulations that the proposed algorithm freely explores the multimodal posterior distribution and produces robust point estimates, while it outperforms maximum likelihood estimation via the Expectation-Maximization algorithm. A by-product of our Bayesian implementation is to control the False Discovery Rate when classifying items as cured or not. Finally, the proposed method is illustrated in a real dataset which refers to recidivism for offenders released from prison; the event of interest is whether the offender was re-incarcerated after probation or not."}, "https://arxiv.org/abs/2307.01357": {"title": "Adaptive Principal Component Regression with Applications to Panel Data", "link": "https://arxiv.org/abs/2307.01357", "description": "arXiv:2307.01357v3 Announce Type: replace-cross \nAbstract: Principal component regression (PCR) is a popular technique for fixed-design error-in-variables regression, a generalization of the linear regression setting in which the observed covariates are corrupted with random noise. We provide the first time-uniform finite sample guarantees for (regularized) PCR whenever data is collected adaptively. Since the proof techniques for analyzing PCR in the fixed design setting do not readily extend to the online setting, our results rely on adapting tools from modern martingale concentration to the error-in-variables setting. We demonstrate the usefulness of our bounds by applying them to the domain of panel data, a ubiquitous setting in econometrics and statistics. As our first application, we provide a framework for experiment design in panel data settings when interventions are assigned adaptively. Our framework may be thought of as a generalization of the synthetic control and synthetic interventions frameworks, where data is collected via an adaptive intervention assignment policy. Our second application is a procedure for learning such an intervention assignment policy in a setting where units arrive sequentially to be treated. In addition to providing theoretical performance guarantees (as measured by regret), we show that our method empirically outperforms a baseline which does not leverage error-in-variables regression."}, "https://arxiv.org/abs/2312.04972": {"title": "Comparison of Probabilistic Structural Reliability Methods for Ultimate Limit State Assessment of Wind Turbines", "link": "https://arxiv.org/abs/2312.04972", "description": "arXiv:2312.04972v2 Announce Type: replace-cross \nAbstract: The probabilistic design of offshore wind turbines aims to ensure structural safety in a cost-effective way. This involves conducting structural reliability assessments for different design options and considering different structural responses. There are several structural reliability methods, and this paper will apply and compare different approaches in some simplified case studies. In particular, the well known environmental contour method will be compared to a more novel approach based on sequential sampling and Gaussian processes regression for an ultimate limit state case study. For one of the case studies, results will also be compared to results from a brute force simulation approach. Interestingly, the comparison is very different from the two case studies. In one of the cases the environmental contours method agrees well with the sequential sampling method but in the other, results vary considerably. Probably, this can be explained by the violation of some of the assumptions associated with the environmental contour approach, i.e. that the short-term variability of the response is large compared to the long-term variability of the environmental conditions. Results from this simple comparison study suggests that the sequential sampling method can be a robust and computationally effective approach for structural reliability assessment."}, "https://arxiv.org/abs/2408.02757": {"title": "A nonparametric test for diurnal variation in spot correlation processes", "link": "https://arxiv.org/abs/2408.02757", "description": "arXiv:2408.02757v1 Announce Type: new \nAbstract: The association between log-price increments of exchange-traded equities, as measured by their spot correlation estimated from high-frequency data, exhibits a pronounced upward-sloping and almost piecewise linear relationship at the intraday horizon. There is notably lower-on average less positive-correlation in the morning than in the afternoon. We develop a nonparametric testing procedure to detect such deterministic variation in a correlation process. The test statistic has a known distribution under the null hypothesis, whereas it diverges under the alternative. It is robust against stochastic correlation. We run a Monte Carlo simulation to discover the finite sample properties of the test statistic, which are close to the large sample predictions, even for small sample sizes and realistic levels of diurnal variation. In an application, we implement the test on a monthly basis for a high-frequency dataset covering the stock market over an extended period. The test leads to rejection of the null most of the time. This suggests diurnal variation in the correlation process is a nontrivial effect in practice."}, "https://arxiv.org/abs/2408.02770": {"title": "Measuring the Impact of New Risk Factors Within Survival Models", "link": "https://arxiv.org/abs/2408.02770", "description": "arXiv:2408.02770v1 Announce Type: new \nAbstract: Survival is poor for patients with metastatic cancer, and it is vital to examine new biomarkers that can improve patient prognostication and identify those who would benefit from more aggressive therapy. In metastatic prostate cancer, two new assays have become available: one that quantifies the number of cancer cells circulating in the peripheral blood, and the other a marker of the aggressiveness of the disease. It is critical to determine the magnitude of the effect of these biomarkers on the discrimination of a model-based risk score. To do so, most analysts frequently consider the discrimination of two separate survival models: one that includes both the new and standard factors and a second that includes the standard factors alone. However, this analysis is ultimately incorrect for many of the scale-transformation models ubiquitous in survival, as the reduced model is misspecified if the full model is specified correctly. To circumvent this issue, we developed a projection-based approach to estimate the impact of the two prostate cancer biomarkers. The results indicate that the new biomarkers can influence model discrimination and justify their inclusion in the risk model; however, the hunt remains for an applicable model to risk-stratify patients with metastatic prostate cancer."}, "https://arxiv.org/abs/2408.02821": {"title": "Continuous Monitoring via Repeated Significance", "link": "https://arxiv.org/abs/2408.02821", "description": "arXiv:2408.02821v1 Announce Type: new \nAbstract: Requiring statistical significance at multiple interim analyses to declare a statistically significant result for an AB test allows less stringent requirements for significance at each interim analysis. Repeated repeated significance competes well with methods built on assumptions about the test -- assumptions that may be impossible to evaluate a priori and may require extra data to evaluate empirically.\n Instead, requiring repeated significance allows the data itself to prove directly that the required results are not due to chance alone. We explain how to apply tests with repeated significance to continuously monitor unbounded tests -- tests that do not have an a priori bound on running time or number of observations. We show that it is impossible to maintain a constant requirement for significance for unbounded tests, but that we can come arbitrarily close to that goal."}, "https://arxiv.org/abs/2408.02830": {"title": "Setting the duration of online A/B experiments", "link": "https://arxiv.org/abs/2408.02830", "description": "arXiv:2408.02830v1 Announce Type: new \nAbstract: In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube."}, "https://arxiv.org/abs/2408.03024": {"title": "Weighted shape-constrained estimation for the autocovariance sequence from a reversible Markov chain", "link": "https://arxiv.org/abs/2408.03024", "description": "arXiv:2408.03024v1 Announce Type: new \nAbstract: We present a novel weighted $\\ell_2$ projection method for estimating autocovariance sequences and spectral density functions from reversible Markov chains. Berg and Song (2023) introduced a least-squares shape-constrained estimation approach for the autocovariance function by projecting an initial estimate onto a shape-constrained space using an $\\ell_2$ projection. While the least-squares objective is commonly used in shape-constrained regression, it can be suboptimal due to correlation and unequal variances in the input function. To address this, we propose a weighted least-squares method that defines a weighted norm on transformed data. Specifically, we transform an input autocovariance sequence into the Fourier domain and apply weights based on the asymptotic variance of the sample periodogram, leveraging the asymptotic independence of periodogram ordinates. Our proposal can equivalently be viewed as estimating a spectral density function by applying shape constraints to its Fourier series. We demonstrate that our weighted approach yields strongly consistent estimates for both the spectral density and the autocovariance sequence. Empirical studies show its effectiveness in uncertainty quantification for Markov chain Monte Carlo estimation, outperforming the unweighted moment LS estimator and other state-of-the-art methods."}, "https://arxiv.org/abs/2408.03137": {"title": "Efficient Asymmetric Causality Tests", "link": "https://arxiv.org/abs/2408.03137", "description": "arXiv:2408.03137v1 Announce Type: new \nAbstract: Asymmetric causality tests are increasingly gaining popularity in different scientific fields. This approach corresponds better to reality since logical reasons behind asymmetric behavior exist and need to be considered in empirical investigations. Hatemi-J (2012) introduced the asymmetric causality tests via partial cumulative sums for positive and negative components of the variables operating within the vector autoregressive (VAR) model. However, since the the residuals across the equations in the VAR model are not independent, the ordinary least squares method for estimating the parameters is not efficient. Additionally, asymmetric causality tests mean having different causal parameters (i.e., for positive or negative components), thus, it is crucial to assess not only if these causal parameters are individually statistically significant, but also if their difference is statistically significant. Consequently, tests of difference between estimated causal parameters should explicitly be conducted, which are neglected in the existing literature. The purpose of the current paper is to deal with these issues explicitly. An application is provided, and ten different hypotheses pertinent to the asymmetric causal interaction between two largest financial markets worldwide are efficiently tested within a multivariate setting."}, "https://arxiv.org/abs/2408.03138": {"title": "Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data", "link": "https://arxiv.org/abs/2408.03138", "description": "arXiv:2408.03138v1 Announce Type: new \nAbstract: It is crucial to assess the predictive performance of a model in order to establish its practicality and relevance in real-world scenarios, particularly for high-dimensional data analysis. Among data splitting or resampling methods, cross-validation (CV) is extensively used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model among competing alternatives. The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data (for training and testing). Here, the issues regarding the reproducibility of the K-fold CV estimator is demonstrated in hypothesis testing wherein different partitions lead to notably disparate conclusions. This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation for determining the difference in prediction error between two model-fitting algorithms. A naive implementation of the exhaustive nested cross-validation is computationally costly. Here, we address concerns regarding computational complexity by devising a computationally tractable closed-form expression for the proposed cross-validation estimator using ridge regularization. Our study also investigates strategies aimed at enhancing statistical power within high-dimensional scenarios while controlling the Type I error rate. To illustrate the practical utility of our method, we apply it to an RNA sequencing study and demonstrate its effectiveness in the context of biological data analysis."}, "https://arxiv.org/abs/2408.03268": {"title": "Regression analysis of elliptically symmetric direction data", "link": "https://arxiv.org/abs/2408.03268", "description": "arXiv:2408.03268v1 Announce Type: new \nAbstract: A comprehensive toolkit is developed for regression analysis of directional data based on a flexible class of angular Gaussian distributions. Informative testing procedures for isotropy and covariate effects on the directional response are proposed. Moreover, a prediction region that achieves the smallest volume in a class of ellipsoidal prediction regions of the same coverage probability is constructed. The efficacy of these inference procedures is demonstrated in simulation experiments. Finally, this new toolkit is used to analyze directional data originating from a hydrology study and a bioinformatics application."}, "https://arxiv.org/abs/2408.02679": {"title": "Visual Analysis of Multi-outcome Causal Graphs", "link": "https://arxiv.org/abs/2408.02679", "description": "arXiv:2408.02679v1 Announce Type: cross \nAbstract: We introduce a visual analysis method for multiple causal graphs with different outcome variables, namely, multi-outcome causal graphs. Multi-outcome causal graphs are important in healthcare for understanding multimorbidity and comorbidity. To support the visual analysis, we collaborated with medical experts to devise two comparative visualization techniques at different stages of the analysis process. First, a progressive visualization method is proposed for comparing multiple state-of-the-art causal discovery algorithms. The method can handle mixed-type datasets comprising both continuous and categorical variables and assist in the creation of a fine-tuned causal graph of a single outcome. Second, a comparative graph layout technique and specialized visual encodings are devised for the quick comparison of multiple causal graphs. In our visual analysis approach, analysts start by building individual causal graphs for each outcome variable, and then, multi-outcome causal graphs are generated and visualized with our comparative technique for analyzing differences and commonalities of these causal graphs. Evaluation includes quantitative measurements on benchmark datasets, a case study with a medical expert, and expert user studies with real-world health research data."}, "https://arxiv.org/abs/2408.03051": {"title": "The multivariate fractional Ornstein-Uhlenbeck process", "link": "https://arxiv.org/abs/2408.03051", "description": "arXiv:2408.03051v1 Announce Type: cross \nAbstract: Starting from the notion of multivariate fractional Brownian Motion introduced in [F. Lavancier, A. Philippe, and D. Surgailis. Covariance function of vector self-similar processes. Statistics & Probability Letters, 2009] we define a multivariate version of the fractional Ornstein-Uhlenbeck process. This multivariate Gaussian process is stationary, ergodic and allows for different Hurst exponents on each component. We characterize its correlation matrix and its short and long time asymptotics. Besides the marginal parameters, the cross correlation between one-dimensional marginal components is ruled by two parameters. We consider the problem of their inference, proposing two types of estimator, constructed from discrete observations of the process. We establish their asymptotic theory, in one case in the long time asymptotic setting, in the other case in the infill and long time asymptotic setting. The limit behavior can be asymptotically Gaussian or non-Gaussian, depending on the values of the Hurst exponents of the marginal components. The technical core of the paper relies on the analysis of asymptotic properties of functionals of Gaussian processes, that we establish using Malliavin calculus and Stein's method. We provide numerical experiments that support our theoretical analysis and also suggest a conjecture on the application of one of these estimators to the multivariate fractional Brownian Motion."}, "https://arxiv.org/abs/2306.15581": {"title": "Advances in projection predictive inference", "link": "https://arxiv.org/abs/2306.15581", "description": "arXiv:2306.15581v2 Announce Type: replace \nAbstract: The concepts of Bayesian prediction, model comparison, and model selection have developed significantly over the last decade. As a result, the Bayesian community has witnessed a rapid growth in theoretical and applied contributions to building and selecting predictive models. Projection predictive inference in particular has shown promise to this end, finding application across a broad range of fields. It is less prone to over-fitting than na\\\"ive selection based purely on cross-validation or information criteria performance metrics, and has been known to out-perform other methods in terms of predictive performance. We survey the core concept and contemporary contributions to projection predictive inference, and present a safe, efficient, and modular workflow for prediction-oriented model selection therein. We also provide an interpretation of the projected posteriors achieved by projection predictive inference in terms of their limitations in causal settings."}, "https://arxiv.org/abs/2307.10694": {"title": "PySDTest: a Python/Stata Package for Stochastic Dominance Tests", "link": "https://arxiv.org/abs/2307.10694", "description": "arXiv:2307.10694v2 Announce Type: replace \nAbstract: We introduce PySDTest, a Python/Stata package for statistical tests of stochastic dominance. PySDTest implements various testing procedures such as Barrett and Donald (2003), Linton et al. (2005), Linton et al. (2010), and Donald and Hsu (2016), along with their extensions. Users can flexibly combine several resampling methods and test statistics, including the numerical delta method (D\\\"umbgen, 1993; Hong and Li, 2018; Fang and Santos, 2019). The package allows for testing advanced hypotheses on stochastic dominance relations, such as stochastic maximality among multiple prospects. We first provide an overview of the concepts of stochastic dominance and testing methods. Then, we offer practical guidance for using the package and the Stata command pysdtest. We apply PySDTest to investigate the portfolio choice problem between the daily returns of Bitcoin and the S&P 500 index as an empirical illustration. Our findings indicate that the S&P 500 index returns second-order stochastically dominate the Bitcoin returns."}, "https://arxiv.org/abs/2401.01833": {"title": "Credible Distributions of Overall Ranking of Entities", "link": "https://arxiv.org/abs/2401.01833", "description": "arXiv:2401.01833v2 Announce Type: replace \nAbstract: Inference on overall ranking of a set of entities, such as athletes or players, schools and universities, hospitals, cities, restaurants, movies or books, companies, states, countries or subpopulations, based on appropriate characteristics or performances, is an important problem. Estimation of ranks based on point estimates of means does not account for the uncertainty in those estimates. Treating estimated ranks without any regard for uncertainty is problematic. We propose a novel solution using the Bayesian approach. It is easily implementable, competitive with a popular frequentist method, more effective and informative. Using suitable joint credible sets for entity means, we appropriately create {\\it credible distributions} (CDs, a phrase we coin), which are probability distributions, for the rank vector of entities. As a byproduct, the supports of the CDs are credible sets for overall ranking. We evaluate our proposed procedure in terms of accuracy and stability using a number of applications and a simulation study. While the frequentist approach cannot utilize covariates, the proposed method handles them routinely to its benefit."}, "https://arxiv.org/abs/2408.03415": {"title": "A Novel Approximate Bayesian Inference Method for Compartmental Models in Epidemiology using Stan", "link": "https://arxiv.org/abs/2408.03415", "description": "arXiv:2408.03415v1 Announce Type: new \nAbstract: Mechanistic compartmental models are widely used in epidemiology to study the dynamics of infectious disease transmission. These models have significantly contributed to designing and evaluating effective control strategies during pandemics. However, the increasing complexity and the number of parameters needed to describe rapidly evolving transmission scenarios present significant challenges for parameter estimation due to intractable likelihoods. To overcome this issue, likelihood-free methods have proven effective for accurately and efficiently fitting these models to data. In this study, we focus on approximate Bayesian computation (ABC) and synthetic likelihood methods for parameter inference. We develop a method that employs ABC to select the most informative subset of summary statistics, which are then used to construct a synthetic likelihood for posterior sampling. Posterior sampling is performed using Hamiltonian Monte Carlo as implemented in the Stan software. The proposed algorithm is demonstrated through simulation studies, showing promising results for inference in a simulated epidemic scenario."}, "https://arxiv.org/abs/2408.03463": {"title": "Identifying treatment response subgroups in observational time-to-event data", "link": "https://arxiv.org/abs/2408.03463", "description": "arXiv:2408.03463v1 Announce Type: new \nAbstract: Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily focus on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. Furthermore, the patient cohort of an RCT is often constrained by cost, and is not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. Therefore, when applied to observational studies, such approaches suffer from significant statistical biases because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes."}, "https://arxiv.org/abs/2408.03530": {"title": "Robust Identification in Randomized Experiments with Noncompliance", "link": "https://arxiv.org/abs/2408.03530", "description": "arXiv:2408.03530v1 Announce Type: new \nAbstract: This paper considers a robust identification of causal parameters in a randomized experiment setting with noncompliance where the standard local average treatment effect assumptions could be violated. Following Li, K\\'edagni, and Mourifi\\'e (2024), we propose a misspecification robust bound for a real-valued vector of various causal parameters. We discuss identification under two sets of weaker assumptions: random assignment and exclusion restriction (without monotonicity), and random assignment and monotonicity (without exclusion restriction). We introduce two causal parameters: the local average treatment-controlled direct effect (LATCDE), and the local average instrument-controlled direct effect (LAICDE). Under the random assignment and monotonicity assumptions, we derive sharp bounds on the local average treatment-controlled direct effects for the always-takers and never-takers, respectively, and the total average controlled direct effect for the compliers. Additionally, we show that the intent-to-treat effect can be expressed as a convex weighted average of these three effects. Finally, we apply our method on the proximity to college instrument and find that growing up near a four-year college increases the wage of never-takers (who represent more than 70% of the population) by a range of 4.15% to 27.07%."}, "https://arxiv.org/abs/2408.03590": {"title": "Sensitivity analysis using the Metamodel of Optimal Prognosis", "link": "https://arxiv.org/abs/2408.03590", "description": "arXiv:2408.03590v1 Announce Type: new \nAbstract: In real case applications within the virtual prototyping process, it is not always possible to reduce the complexity of the physical models and to obtain numerical models which can be solved quickly. Usually, every single numerical simulation takes hours or even days. Although the progresses in numerical methods and high performance computing, in such cases, it is not possible to explore various model configurations, hence efficient surrogate models are required. Generally the available meta-model techniques show several advantages and disadvantages depending on the investigated problem. In this paper we present an automatic approach for the selection of the optimal suitable meta-model for the actual problem. Together with an automatic reduction of the variable space using advanced filter techniques an efficient approximation is enabled also for high dimensional problems. This filter techniques enable a reduction of the high dimensional variable space to a much smaller subspace where meta-model-based sensitivity analyses are carried out to assess the influence of important variables and to identify the optimal subspace with corresponding surrogate model which enables the most accurate probabilistic analysis. For this purpose we investigate variance-based and moment-free sensitivity measures in combination with advanced meta-models as moving least squares and kriging."}, "https://arxiv.org/abs/2408.03602": {"title": "Piecewise Constant Hazard Estimation with the Fused Lasso", "link": "https://arxiv.org/abs/2408.03602", "description": "arXiv:2408.03602v1 Announce Type: new \nAbstract: In applied time-to-event analysis, a flexible parametric approach is to model the hazard rate as a piecewise constant function of time. However, the change points and values of the piecewise constant hazard are usually unknown and need to be estimated. In this paper, we develop a fully data-driven procedure for piecewise constant hazard estimation. We work in a general counting process framework which nests a wide range of popular models in time-to-event analysis including Cox's proportional hazards model with potentially high-dimensional covariates, competing risks models as well as more general multi-state models. To construct our estimator, we set up a regression model for the increments of the Breslow estimator and then use fused lasso techniques to approximate the piecewise constant signal in this regression model. In the theoretical part of the paper, we derive the convergence rate of our estimator as well as some results on how well the change points of the piecewise constant hazard are approximated by our method. We complement the theory by both simulations and a real data example, illustrating that our results apply in rather general event histories such as multi-state models."}, "https://arxiv.org/abs/2408.03738": {"title": "Parameter estimation for the generalized extreme value distribution: a method that combines bootstrapping and r largest order statistics", "link": "https://arxiv.org/abs/2408.03738", "description": "arXiv:2408.03738v1 Announce Type: new \nAbstract: A critical problem in extreme value theory (EVT) is the estimation of parameters for the limit probability distributions. Block maxima (BM), an approach in EVT that seeks estimates of parameters of the generalized extreme value distribution (GEV), can be generalized to take into account not just the maximum realization from a given dataset, but the r largest order statistics for a given r. In this work we propose a parameter estimation method that combines the r largest order statistic (r-LOS) extension of BM with permutation bootstrapping: surrogate realizations are obtained by randomly reordering the original data set, and then r-LOS is applied to these shuffled measurements - the mean estimate computed from these surrogate realizations is the desired estimate. We used synthetic observations and real meteorological time series to verify the performance of our method; we found that the combination of r-LOS and bootstrapping resulted in estimates more accurate than when either approach was implemented separately."}, "https://arxiv.org/abs/2408.03777": {"title": "Combining BART and Principal Stratification to estimate the effect of intermediate on primary outcomes with application to estimating the effect of family planning on employment in sub-Saharan Africa", "link": "https://arxiv.org/abs/2408.03777", "description": "arXiv:2408.03777v1 Announce Type: new \nAbstract: There is interest in learning about the causal effect of family planning (FP) on empowerment related outcomes. Experimental data related to this question are available from trials in which FP programs increase access to FP. While program assignment is unconfounded, FP uptake and subsequent empowerment may share common causes. We use principal stratification to estimate the causal effect of an intermediate FP outcome on a primary outcome of interest, among women affected by a FP program. Within strata defined by the potential reaction to the program, FP uptake is unconfounded. To minimize the need for parametric assumptions, we propose to use Bayesian Additive Regression Trees (BART) for modeling stratum membership and outcomes of interest. We refer to the combined approach as Prince BART. We evaluate Prince BART through a simulation study and use it to assess the causal effect of modern contraceptive use on employment in six cities in Nigeria, based on quasi-experimental data from a FP program trial during the first half of the 2010s. We show that findings differ between Prince BART and alternative modeling approaches based on parametric assumptions."}, "https://arxiv.org/abs/2408.03930": {"title": "Robust Estimation of Regression Models with Potentially Endogenous Outliers via a Modern Optimization Lens", "link": "https://arxiv.org/abs/2408.03930", "description": "arXiv:2408.03930v1 Announce Type: new \nAbstract: This paper addresses the robust estimation of linear regression models in the presence of potentially endogenous outliers. Through Monte Carlo simulations, we demonstrate that existing $L_1$-regularized estimation methods, including the Huber estimator and the least absolute deviation (LAD) estimator, exhibit significant bias when outliers are endogenous. Motivated by this finding, we investigate $L_0$-regularized estimation methods. We propose systematic heuristic algorithms, notably an iterative hard-thresholding algorithm and a local combinatorial search refinement, to solve the combinatorial optimization problem of the \\(L_0\\)-regularized estimation efficiently. Our Monte Carlo simulations yield two key results: (i) The local combinatorial search algorithm substantially improves solution quality compared to the initial projection-based hard-thresholding algorithm while offering greater computational efficiency than directly solving the mixed integer optimization problem. (ii) The $L_0$-regularized estimator demonstrates superior performance in terms of bias reduction, estimation accuracy, and out-of-sample prediction errors compared to $L_1$-regularized alternatives. We illustrate the practical value of our method through an empirical application to stock return forecasting."}, "https://arxiv.org/abs/2408.03425": {"title": "Sequential Conditional Transport on Probabilistic Graphs for Interpretable Counterfactual Fairness", "link": "https://arxiv.org/abs/2408.03425", "description": "arXiv:2408.03425v1 Announce Type: cross \nAbstract: In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, as suggested in Ple\\v{c}ko and Meinshausen (2020) and optimal transport, as in De Lara et al. (2024). We extend \"Knothe's rearrangement\" Bonnotte (2013) and \"triangular transport\" Zech and Marzouk (2022a) to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss individual fairness. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets."}, "https://arxiv.org/abs/2408.03608": {"title": "InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation", "link": "https://arxiv.org/abs/2408.03608", "description": "arXiv:2408.03608v1 Announce Type: cross \nAbstract: Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer."}, "https://arxiv.org/abs/2408.03626": {"title": "On the choice of the non-trainable internal weights in random feature maps", "link": "https://arxiv.org/abs/2408.03626", "description": "arXiv:2408.03626v1 Announce Type: cross \nAbstract: The computationally cheap machine learning architecture of random feature maps can be viewed as a single-layer feedforward network in which the weights of the hidden layer are random but fixed and only the outer weights are learned via linear regression. The internal weights are typically chosen from a prescribed distribution. The choice of the internal weights significantly impacts the accuracy of random feature maps. We address here the task of how to best select the internal weights. In particular, we consider the forecasting problem whereby random feature maps are used to learn a one-step propagator map for a dynamical system. We provide a computationally cheap hit-and-run algorithm to select good internal weights which lead to good forecasting skill. We show that the number of good features is the main factor controlling the forecasting skill of random feature maps and acts as an effective feature dimension. Lastly, we compare random feature maps with single-layer feedforward neural networks in which the internal weights are now learned using gradient descent. We find that random feature maps have superior forecasting capabilities whilst having several orders of magnitude lower computational cost."}, "https://arxiv.org/abs/2103.02235": {"title": "Prewhitened Long-Run Variance Estimation Robust to Nonstationarity", "link": "https://arxiv.org/abs/2103.02235", "description": "arXiv:2103.02235v3 Announce Type: replace \nAbstract: We introduce a nonparametric nonlinear VAR prewhitened long-run variance (LRV) estimator for the construction of standard errors robust to autocorrelation and heteroskedasticity that can be used for hypothesis testing in a variety of contexts including the linear regression model. Existing methods either are theoretically valid only under stationarity and have poor finite-sample properties under nonstationarity (i.e., fixed-b methods), or are theoretically valid under the null hypothesis but lead to tests that are not consistent under nonstationary alternative hypothesis (i.e., both fixed-b and traditional HAC estimators). The proposed estimator accounts explicitly for nonstationarity, unlike previous prewhitened procedures which are known to be unreliable, and leads to tests with accurate null rejection rates and good monotonic power. We also establish MSE bounds for LRV estimation that are sharper than previously established and use them to determine the data-dependent bandwidths."}, "https://arxiv.org/abs/2103.02981": {"title": "Theory of Evolutionary Spectra for Heteroskedasticity and Autocorrelation Robust Inference in Possibly Misspecified and Nonstationary Models", "link": "https://arxiv.org/abs/2103.02981", "description": "arXiv:2103.02981v2 Announce Type: replace \nAbstract: We develop a theory of evolutionary spectra for heteroskedasticity and autocorrelation robust (HAR) inference when the data may not satisfy second-order stationarity. Nonstationarity is a common feature of economic time series which may arise either from parameter variation or model misspecification. In such a context, the theories that support HAR inference are either not applicable or do not provide accurate approximations. HAR tests standardized by existing long-run variance estimators then may display size distortions and little or no power. This issue can be more severe for methods that use long bandwidths (i.e., fixed-b HAR tests). We introduce a class of nonstationary processes that have a time-varying spectral representation which evolves continuously except at a finite number of time points. We present an extension of the classical heteroskedasticity and autocorrelation consistent (HAC) estimators that applies two smoothing procedures. One is over the lagged autocovariances, akin to classical HAC estimators, and the other is over time. The latter element is important to flexibly account for nonstationarity. We name them double kernel HAC (DK-HAC) estimators. We show the consistency of the estimators and obtain an optimal DK-HAC estimator under the mean squared error (MSE) criterion. Overall, HAR tests standardized by the proposed DK-HAC estimators are competitive with fixed-b HAR tests, when the latter work well, with regards to size control even when there is strong dependence. Notably, in those empirically relevant situations in which previous HAR tests are undersized and have little or no power, the DK-HAC estimator leads to tests that have good size and power."}, "https://arxiv.org/abs/2111.14590": {"title": "The Fixed-b Limiting Distribution and the ERP of HAR Tests Under Nonstationarity", "link": "https://arxiv.org/abs/2111.14590", "description": "arXiv:2111.14590v2 Announce Type: replace \nAbstract: We show that the nonstandard limiting distribution of HAR test statistics under fixed-b asymptotics is not pivotal (even after studentization) when the data are nonstationarity. It takes the form of a complicated function of Gaussian processes and depends on the integrated local long-run variance and on on the second moments of the relevant series (e.g., of the regressors and errors for the case of the linear regression model). Hence, existing fixed-b inference methods based on stationarity are not theoretically valid in general. The nuisance parameters entering the fixed-b limiting distribution can be consistently estimated under small-b asymptotics but only with nonparametric rate of convergence. Hence, We show that the error in rejection probability (ERP) is an order of magnitude larger than that under stationarity and is also larger than that of HAR tests based on HAC estimators under conventional asymptotics. These theoretical results reconcile with recent finite-sample evidence in Casini (2021) and Casini, Deng and Perron (2021) who showing that fixed-b HAR tests can perform poorly when the data are nonstationary. They can be conservative under the null hypothesis and have non-monotonic power under the alternative hypothesis irrespective of how large the sample size is."}, "https://arxiv.org/abs/2211.16121": {"title": "Bayesian Multivariate Quantile Regression with alternative Time-varying Volatility Specifications", "link": "https://arxiv.org/abs/2211.16121", "description": "arXiv:2211.16121v2 Announce Type: replace \nAbstract: This article proposes a novel Bayesian multivariate quantile regression to forecast the tail behavior of energy commodities, where the homoskedasticity assumption is relaxed to allow for time-varying volatility. In particular, we exploit the mixture representation of the multivariate asymmetric Laplace likelihood and the Cholesky-type decomposition of the scale matrix to introduce stochastic volatility and GARCH processes and then provide an efficient MCMC to estimate them. The proposed models outperform the homoskedastic benchmark mainly when predicting the distribution's tails. We provide a model combination using a quantile score-based weighting scheme, which leads to improved performances, notably when no single model uniformly outperforms the other across quantiles, time, or variables."}, "https://arxiv.org/abs/2307.15348": {"title": "The curse of isotropy: from principal components to principal subspaces", "link": "https://arxiv.org/abs/2307.15348", "description": "arXiv:2307.15348v3 Announce Type: replace \nAbstract: This paper raises an important issue about the interpretation of principal component analysis. The curse of isotropy states that a covariance matrix with repeated eigenvalues yields rotation-invariant eigenvectors. In other words, principal components associated with equal eigenvalues show large intersample variability and are arbitrary combinations of potentially more interpretable components. However, empirical eigenvalues are never exactly equal in practice due to sampling errors. Therefore, most users overlook the problem. In this paper, we propose to identify datasets that are likely to suffer from the curse of isotropy by introducing a generative Gaussian model with repeated eigenvalues and comparing it to traditional models via the principle of parsimony. This yields an explicit criterion to detect the curse of isotropy in practice. We notably argue that in a dataset with 1000 samples, all the eigenvalue pairs with a relative eigengap lower than 21% should be assumed equal. This demonstrates that the curse of isotropy cannot be overlooked. In this context, we propose to transition from fuzzy principal components to much-more-interpretable principal subspaces. The final methodology (principal subspace analysis) is extremely simple and shows promising results on a variety of datasets from different fields."}, "https://arxiv.org/abs/2309.03742": {"title": "Efficient estimation and correction of selection-induced bias with order statistics", "link": "https://arxiv.org/abs/2309.03742", "description": "arXiv:2309.03742v3 Announce Type: replace \nAbstract: Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in step-wise selection), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area."}, "https://arxiv.org/abs/2311.15988": {"title": "A novel CFA+EFA model to detect aberrant respondents", "link": "https://arxiv.org/abs/2311.15988", "description": "arXiv:2311.15988v2 Announce Type: replace \nAbstract: Aberrant respondents are common but yet extremely detrimental to the quality of social surveys or questionnaires. Recently, factor mixture models have been employed to identify individuals providing deceptive or careless responses. We propose a comprehensive factor mixture model for continuous outcomes that combines confirmatory and exploratory factor models to classify both the non-aberrant and aberrant respondents. The flexibility of the proposed {classification model} allows for the identification of two of the most common aberrant response styles, namely faking and careless responding. We validated our approach by means of two simulations and two case studies. The results indicate the effectiveness of the proposed model in dealing with aberrant responses in social and behavioural surveys."}, "https://arxiv.org/abs/2312.17623": {"title": "Decision Theory for Treatment Choice Problems with Partial Identification", "link": "https://arxiv.org/abs/2312.17623", "description": "arXiv:2312.17623v2 Announce Type: replace \nAbstract: We apply classical statistical decision theory to a large class of treatment choice problems with partial identification, revealing important theoretical and practical challenges but also interesting research opportunities. The challenges are: In a general class of problems with Gaussian likelihood, all decision rules are admissible; it is maximin-welfare optimal to ignore all data; and, for severe enough partial identification, there are infinitely many minimax-regret optimal decision rules, all of which sometimes randomize the policy recommendation. The opportunities are: We introduce a profiled regret criterion that can reveal important differences between rules and render some of them inadmissible; and we uniquely characterize the minimax-regret optimal rule that least frequently randomizes. We apply our results to aggregation of experimental estimates for policy adoption, to extrapolation of Local Average Treatment Effects, and to policy making in the presence of omitted variable bias."}, "https://arxiv.org/abs/2106.02031": {"title": "Change-Point Analysis of Time Series with Evolutionary Spectra", "link": "https://arxiv.org/abs/2106.02031", "description": "arXiv:2106.02031v3 Announce Type: replace-cross \nAbstract: This paper develops change-point methods for the spectrum of a locally stationary time series. We focus on series with a bounded spectral density that change smoothly under the null hypothesis but exhibits change-points or becomes less smooth under the alternative. We address two local problems. The first is the detection of discontinuities (or breaks) in the spectrum at unknown dates and frequencies. The second involves abrupt yet continuous changes in the spectrum over a short time period at an unknown frequency without signifying a break. Both problems can be cast into changes in the degree of smoothness of the spectral density over time. We consider estimation and minimax-optimal testing. We determine the optimal rate for the minimax distinguishable boundary, i.e., the minimum break magnitude such that we are able to uniformly control type I and type II errors. We propose a novel procedure for the estimation of the change-points based on a wild sequential top-down algorithm and show its consistency under shrinking shifts and possibly growing number of change-points. Our method can be used across many fields and a companion program is made available in popular software packages."}, "https://arxiv.org/abs/2408.04056": {"title": "Testing for a general changepoint in psychometric studies: changes detection and sample size planning", "link": "https://arxiv.org/abs/2408.04056", "description": "arXiv:2408.04056v1 Announce Type: new \nAbstract: This paper introduces a new method for change detection in psychometric studies based on the recently introduced pseudo Score statistic, for which the sampling distribution under the alternative hypothesis has been determined. Our approach has the advantage of simplicity in its computation, eliminating the need for resampling or simulations to obtain critical values. Additionally, it comes with a known null/alternative distribution, facilitating easy calculations for power levels and sample size planning. The paper indeed also discusses the topic of power analysis in segmented regression, namely the estimation of sample size or power level when the study data being collected focuses on a covariate expected to affect the mean response via a piecewise relationship with an unknown breakpoint. We run simulation results showing that our method outperforms other Tests for a Change Point (TFCP) with both normally distributed and binary data and carry out a real SAT Critical reading data analysis. The proposed test contributes to the framework of psychometric research, and it is available on the Comprehensive R Archive Network (CRAN) and in a more user-friendly Shiny App, both illustrated at the end of the paper."}, "https://arxiv.org/abs/2408.04176": {"title": "A robust approach for generalized linear models based on maximum Lq-likelihood procedure", "link": "https://arxiv.org/abs/2408.04176", "description": "arXiv:2408.04176v1 Announce Type: new \nAbstract: In this paper we propose a procedure for robust estimation in the context of generalized linear models based on the maximum Lq-likelihood method. Alongside this, an estimation algorithm that represents a natural extension of the usual iteratively weighted least squares method in generalized linear models is presented. It is through the discussion of the asymptotic distribution of the proposed estimator and a set of statistics for testing linear hypothesis that it is possible to define standardized residuals using the mean-shift outlier model. In addition, robust versions of deviance function and the Akaike information criterion are defined with the aim of providing tools for model selection. Finally, the performance of the proposed methodology is illustrated through a simulation study and analysis of a real dataset."}, "https://arxiv.org/abs/2408.04213": {"title": "Hypothesis testing for general network models", "link": "https://arxiv.org/abs/2408.04213", "description": "arXiv:2408.04213v1 Announce Type: new \nAbstract: The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures based on these methods may suffer from model misspecification. To handle this issue, based on the random matrix theory, we first give a spectral property of the normalized adjacency matrix under a mild condition. Further, we establish a general goodness-of-fit test procedure for the unweight and undirected network. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Theoretically, this testing procedure is suitable for nearly all popular network models, such as stochastic block models, and latent space models. Further, we apply the proposed method to the degree-corrected mixed membership model and give a sequential estimator of the number of communities. Both simulation studies and real-world data examples indicate that the proposed method works well."}, "https://arxiv.org/abs/2408.04327": {"title": "BayesFBHborrow: An R Package for Bayesian borrowing for time-to-event data from a flexible baseline hazard", "link": "https://arxiv.org/abs/2408.04327", "description": "arXiv:2408.04327v1 Announce Type: new \nAbstract: There is currently a focus on statistical methods which can use external trial information to help accelerate the discovery, development and delivery of medicine. Bayesian methods facilitate borrowing which is \"dynamic\" in the sense that the similarity of the data helps to determine how much information is used. We propose a Bayesian semiparameteric model, which allows the baseline hazard to take any form through an ensemble average. We introduce priors to smooth the posterior baseline hazard improving both model estimation and borrowing characteristics. A \"lump-and-smear\" borrowing prior accounts for non-exchangable historical data and helps reduce the maximum type I error in the presence of prior-data conflict. In this article, we present BayesFBHborrow, an R package, which enables the user to perform Bayesian borrowing with a historical control dataset in a semiparameteric time-to-event model. User-defined hyperparameters smooth an ensemble averaged posterior baseline hazard. The model offers the specification of lump-and-smear priors on the commensurability parameter where the associated hyperparameters can be chosen according to the users tolerance for difference between the log baseline hazards. We demonstrate the performance of our Bayesian flexible baseline hazard model on a simulated and real world dataset."}, "https://arxiv.org/abs/2408.04419": {"title": "Analysing symbolic data by pseudo-marginal methods", "link": "https://arxiv.org/abs/2408.04419", "description": "arXiv:2408.04419v1 Announce Type: new \nAbstract: Symbolic data analysis (SDA) aggregates large individual-level datasets into a small number of distributional summaries, such as random rectangles or random histograms. Inference is carried out using these summaries in place of the original dataset, resulting in computational gains at the loss of some information. In likelihood-based SDA, the likelihood function is characterised by an integral with a large exponent, which limits the method's utility as for typical models the integral unavailable in closed form. In addition, the likelihood function is known to produce biased parameter estimates in some circumstances. Our article develops a Bayesian framework for SDA methods in these settings that resolves the issues resulting from integral intractability and biased parameter estimation using pseudo-marginal Markov chain Monte Carlo methods. We develop an exact but computationally expensive method based on path sampling and the block-Poisson estimator, and a much faster, but approximate, method based on Taylor expansion. Through simulation and real-data examples we demonstrate the performance of the developed methods, showing large reductions in computation time compared to the full-data analysis, with only a small loss of information."}, "https://arxiv.org/abs/2408.04552": {"title": "Semiparametric Estimation of Individual Coefficients in a Dyadic Link Formation Model Lacking Observable Characteristics", "link": "https://arxiv.org/abs/2408.04552", "description": "arXiv:2408.04552v1 Announce Type: new \nAbstract: Dyadic network formation models have wide applicability in economic research, yet are difficult to estimate in the presence of individual specific effects and in the absence of distributional assumptions regarding the model noise component. The availability of (continuously distributed) individual or link characteristics generally facilitates estimation. Yet, while data on social networks has recently become more abundant, the characteristics of the entities involved in the link may not be measured. Adapting the procedure of \\citet{KS}, I propose to use network data alone in a semiparametric estimation of the individual fixed effect coefficients, which carry the interpretation of the individual relative popularity. This entails the possibility to anticipate how a new-coming individual will connect in a pre-existing group. The estimator, needed for its fast convergence, fails to implement the monotonicity assumption regarding the model noise component, thereby potentially reversing the order if the fixed effect coefficients. This and other numerical issues can be conveniently tackled by my novel, data-driven way of normalising the fixed effects, which proves to outperform a conventional standardisation in many cases. I demonstrate that the normalised coefficients converge both at the same rate and to the same limiting distribution as if the true error distribution was known. The cost of semiparametric estimation is thus purely computational, while the potential benefits are large whenever the errors have a strongly convex or strongly concave distribution."}, "https://arxiv.org/abs/2408.04313": {"title": "Better Locally Private Sparse Estimation Given Multiple Samples Per User", "link": "https://arxiv.org/abs/2408.04313", "description": "arXiv:2408.04313v1 Announce Type: cross \nAbstract: Previous studies yielded discouraging results for item-level locally differentially private linear regression with $s^*$-sparsity assumption, where the minimax rate for $nm$ samples is $\\mathcal{O}(s^{*}d / nm\\varepsilon^2)$. This can be challenging for high-dimensional data, where the dimension $d$ is extremely large. In this work, we investigate user-level locally differentially private sparse linear regression. We show that with $n$ users each contributing $m$ samples, the linear dependency of dimension $d$ can be eliminated, yielding an error upper bound of $\\mathcal{O}(s^{*2} / nm\\varepsilon^2)$. We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space, which is extendable to general sparse estimation problems with tight error bounds. Experiments on both synthetic and real datasets demonstrate the superiority of the proposed methods. Both the theoretical and empirical results suggest that, with the same number of samples, locally private sparse estimation is better conducted when multiple samples per user are available."}, "https://arxiv.org/abs/2408.04617": {"title": "Difference-in-Differences for Health Policy and Practice: A Review of Modern Methods", "link": "https://arxiv.org/abs/2408.04617", "description": "arXiv:2408.04617v1 Announce Type: cross \nAbstract: Difference-in-differences (DiD) is the most popular observational causal inference method in health policy, employed to evaluate the real-world impact of policies and programs. To estimate treatment effects, DiD relies on the \"parallel trends assumption\", that on average treatment and comparison groups would have had parallel trajectories in the absence of an intervention. Historically, DiD has been considered broadly applicable and straightforward to implement, but recent years have seen rapid advancements in DiD methods. This paper reviews and synthesizes these innovations for medical and health policy researchers. We focus on four topics: (1) assessing the parallel trends assumption in health policy contexts; (2) relaxing the parallel trends assumption when appropriate; (3) employing estimators to account for staggered treatment timing; and (4) conducting robust inference for analyses in which normal-based clustered standard errors are inappropriate. For each, we explain challenges and common pitfalls in traditional DiD and modern methods available to address these issues."}, "https://arxiv.org/abs/2010.01800": {"title": "Robust and Efficient Estimation of Potential Outcome Means under Random Assignment", "link": "https://arxiv.org/abs/2010.01800", "description": "arXiv:2010.01800v2 Announce Type: replace \nAbstract: We study efficiency improvements in randomized experiments for estimating a vector of potential outcome means using regression adjustment (RA) when there are more than two treatment levels. We show that linear RA which estimates separate slopes for each assignment level is never worse, asymptotically, than using the subsample averages. We also show that separate RA improves over pooled RA except in the obvious case where slope parameters in the linear projections are identical across the different assignment levels. We further characterize the class of nonlinear RA methods that preserve consistency of the potential outcome means despite arbitrary misspecification of the conditional mean functions. Finally, we apply these regression adjustment techniques to efficiently estimate the lower bound mean willingness to pay for an oil spill prevention program in California."}, "https://arxiv.org/abs/2408.04730": {"title": "Vela: A Data-Driven Proposal for Joint Collaboration in Space Exploration", "link": "https://arxiv.org/abs/2408.04730", "description": "arXiv:2408.04730v1 Announce Type: new \nAbstract: The UN Office of Outer Space Affairs identifies synergy of space development activities and international cooperation through data and infrastructure sharing in their Sustainable Development Goal 17 (SDG17). Current multilateral space exploration paradigms, however, are divided between the Artemis and the Roscosmos-CNSA programs to return to the moon and establish permanent human settlements. As space agencies work to expand human presence in space, economic resource consolidation in pursuit of technologically ambitious space expeditions is the most sensible path to accomplish SDG17. This paper compiles a budget dataset for the top five federally-funded space agencies: CNSA, ESA, JAXA, NASA, and Roscosmos. Using time-series econometric anslysis methods in STATA, this work analyzes each agency's economic contributions toward space exploration. The dataset results are used to propose a multinational space mission, Vela, for the development of an orbiting space station around Mars in the late 2030s. Distribution of economic resources and technological capabilities by the respective space programs are proposed to ensure programmatic redundancy and increase the odds of success on the given timeline."}, "https://arxiv.org/abs/2408.04854": {"title": "A propensity score weighting approach to integrate aggregated data in random-effect individual-level data meta-analysis", "link": "https://arxiv.org/abs/2408.04854", "description": "arXiv:2408.04854v1 Announce Type: new \nAbstract: In evidence synthesis, collecting individual participant data (IPD) across eligible studies is the most reliable way to investigate the treatment effects in different subgroups defined by participant characteristics. Nonetheless, access to all IPD from all studies might be very challenging due to privacy concerns. To overcome this, many approaches such as multilevel modeling have been proposed to incorporate the vast amount of aggregated data from the literature into IPD meta-analysis. These methods, however, often rely on specifying separate models for trial-level versus patient-level data, which likely suffers from ecological bias when there are non-linearities in the outcome generating mechanism. In this paper, we introduce a novel method to combine aggregated data and IPD in meta-analysis that is free from ecological bias. The proposed approach relies on modeling the study membership given covariates, then using inverse weighting to estimate the trial-specific coefficients in the individual-level outcome model of studies without IPD accessible. The weights derived from this approach also shed insights on the similarity in the case-mix across studies, which is useful to assess whether eligible trials are sufficiently similar to be meta-analyzed. We evaluate the proposed method by synthetic data, then apply it to a real-world meta-analysis comparing the chance of response between guselkumab and adalimumab among patients with psoriasis."}, "https://arxiv.org/abs/2408.04933": {"title": "Variance-based sensitivity analysis in the presence of correlated input variables", "link": "https://arxiv.org/abs/2408.04933", "description": "arXiv:2408.04933v1 Announce Type: new \nAbstract: In this paper we propose an extension of the classical Sobol' estimator for the estimation of variance based sensitivity indices. The approach assumes a linear correlation model between the input variables which is used to decompose the contribution of an input variable into a correlated and an uncorrelated part. This method provides sampling matrices following the original joint probability distribution which are used directly to compute the model output without any assumptions or approximations of the model response function."}, "https://arxiv.org/abs/2408.05106": {"title": "Spatial Deconfounding is Reasonable Statistical Practice: Interpretations, Clarifications, and New Benefits", "link": "https://arxiv.org/abs/2408.05106", "description": "arXiv:2408.05106v1 Announce Type: new \nAbstract: The spatial linear mixed model (SLMM) consists of fixed and spatial random effects that can be confounded. Restricted spatial regression (RSR) models restrict the spatial random effects to be in the orthogonal column space of the covariates, which \"deconfounds\" the SLMM. Recent articles have shown that the RSR generally performs worse than the SLMM under a certain interpretation of the RSR. We show that every additive model can be reparameterized as a deconfounded model leading to what we call the linear reparameterization of additive models (LRAM). Under this reparameterization the coefficients of the covariates (referred to as deconfounded regression effects) are different from the (confounded) regression effects in the SLMM. It is shown that under the LRAM interpretation, existing deconfounded spatial models produce estimated deconfounded regression effects, spatial prediction, and spatial prediction variances equivalent to that of SLMM in Bayesian contexts. Furthermore, a general RSR (GRSR) and the SLMM produce identical inferences on confounded regression effects. While our results are in complete agreement with recent criticisms, our new results under the LRAM interpretation provide clarifications that lead to different and sometimes contrary conclusions. Additionally, we discuss the inferential and computational benefits to deconfounding, which we illustrate via a simulation."}, "https://arxiv.org/abs/2408.05209": {"title": "What are the real implications for $CO_2$ as generation from renewables increases?", "link": "https://arxiv.org/abs/2408.05209", "description": "arXiv:2408.05209v1 Announce Type: new \nAbstract: Wind and solar electricity generation account for 14% of total electricity generation in the United States and are expected to continue to grow in the next decades. In low carbon systems, generation from renewable energy sources displaces conventional fossil fuel power plants resulting in lower system-level emissions and emissions intensity. However, we find that intermittent generation from renewables changes the way conventional thermal power plants operate, and that the displacement of generation is not 1 to 1 as expected. Our work provides a method that allows policy and decision makers to continue to track the effect of additional renewable capacity and the resulting thermal power plant operational responses."}, "https://arxiv.org/abs/2408.04907": {"title": "Causal Discovery of Linear Non-Gaussian Causal Models with Unobserved Confounding", "link": "https://arxiv.org/abs/2408.04907", "description": "arXiv:2408.04907v1 Announce Type: cross \nAbstract: We consider linear non-Gaussian structural equation models that involve latent confounding. In this setting, the causal structure is identifiable, but, in general, it is not possible to identify the specific causal effects. Instead, a finite number of different causal effects result in the same observational distribution. Most existing algorithms for identifying these causal effects use overcomplete independent component analysis (ICA), which often suffers from convergence to local optima. Furthermore, the number of latent variables must be known a priori. To address these issues, we propose an algorithm that operates recursively rather than using overcomplete ICA. The algorithm first infers a source, estimates the effect of the source and its latent parents on their descendants, and then eliminates their influence from the data. For both source identification and effect size estimation, we use rank conditions on matrices formed from higher-order cumulants. We prove asymptotic correctness under the mild assumption that locally, the number of latent variables never exceeds the number of observed variables. Simulation studies demonstrate that our method achieves comparable performance to overcomplete ICA even though it does not know the number of latents in advance."}, "https://arxiv.org/abs/2206.03038": {"title": "Asymptotic Distribution-free Change-point Detection for Modern Data Based on a New Ranking Scheme", "link": "https://arxiv.org/abs/2206.03038", "description": "arXiv:2206.03038v3 Announce Type: replace \nAbstract: Change-point detection (CPD) involves identifying distributional changes in a sequence of independent observations. Among nonparametric methods, rank-based methods are attractive due to their robustness and effectiveness and have been extensively studied for univariate data. However, they are not well explored for high-dimensional or non-Euclidean data. This paper proposes a new method, Rank INduced by Graph Change-Point Detection (RING-CPD), which utilizes graph-induced ranks to handle high-dimensional and non-Euclidean data. The new method is asymptotically distribution-free under the null hypothesis, and an analytic $p$-value approximation is provided for easy type-I error control. Simulation studies show that RING-CPD effectively detects change points across a wide range of alternatives and is also robust to heavy-tailed distribution and outliers. The new method is illustrated by the detection of seizures in a functional connectivity network dataset, changes of digit images, and travel pattern changes in the New York City Taxi dataset."}, "https://arxiv.org/abs/2306.15199": {"title": "A new classification framework for high-dimensional data", "link": "https://arxiv.org/abs/2306.15199", "description": "arXiv:2306.15199v2 Announce Type: replace \nAbstract: Classification, a fundamental problem in many fields, faces significant challenges when handling a large number of features, a scenario commonly encountered in modern applications, such as identifying tumor subtypes from genomic data or categorizing customer attitudes based on online reviews. We propose a novel framework that utilizes the ranks of pairwise distances among observations and identifies consistent patterns in moderate- to high- dimensional data, which previous methods have overlooked. The proposed method exhibits superior performance across a variety of scenarios, from high-dimensional data to network data. We further explore a typical setting to investigate key quantities that play essential roles in our framework, which reveal the framework's capabilities in distinguishing differences in the first and/or second moment, as well as distinctions in higher moments."}, "https://arxiv.org/abs/2312.05365": {"title": "Product Centered Dirichlet Processes for Dependent Clustering", "link": "https://arxiv.org/abs/2312.05365", "description": "arXiv:2312.05365v2 Announce Type: replace \nAbstract: While there is an immense literature on Bayesian methods for clustering, the multiview case has received little attention. This problem focuses on obtaining distinct but statistically dependent clusterings in a common set of entities for different data types. For example, clustering patients into subgroups with subgroup membership varying according to the domain of the patient variables. A challenge is how to model the across-view dependence between the partitions of patients into subgroups. The complexities of the partition space make standard methods to model dependence, such as correlation, infeasible. In this article, we propose CLustering with Independence Centering (CLIC), a clustering prior that uses a single parameter to explicitly model dependence between clusterings across views. CLIC is induced by the product centered Dirichlet process (PCDP), a novel hierarchical prior that bridges between independent and equivalent partitions. We show appealing theoretic properties, provide a finite approximation and prove its accuracy, present a marginal Gibbs sampler for posterior computation, and derive closed form expressions for the marginal and joint partition distributions for the CLIC model. On synthetic data and in an application to epidemiology, CLIC accurately characterizes view-specific partitions while providing inference on the dependence level."}, "https://arxiv.org/abs/2408.05297": {"title": "Bootstrap Matching: a robust and efficient correction for non-random A/B test, and its applications", "link": "https://arxiv.org/abs/2408.05297", "description": "arXiv:2408.05297v1 Announce Type: new \nAbstract: A/B testing, a widely used form of Randomized Controlled Trial (RCT), is a fundamental tool in business data analysis and experimental design. However, despite its intent to maintain randomness, A/B testing often faces challenges that compromise this randomness, leading to significant limitations in practice. In this study, we introduce Bootstrap Matching, an innovative approach that integrates Bootstrap resampling, Matching techniques, and high-dimensional hypothesis testing to address the shortcomings of A/B tests when true randomization is not achieved. Unlike traditional methods such as Difference-in-Differences (DID) and Propensity Score Matching (PSM), Bootstrap Matching is tailored for large-scale datasets, offering enhanced robustness and computational efficiency. We illustrate the effectiveness of this methodology through a real-world application in online advertising and further discuss its potential applications in digital marketing, empirical economics, clinical trials, and high-dimensional bioinformatics."}, "https://arxiv.org/abs/2408.05342": {"title": "Optimal Treatment Allocation Strategies for A/B Testing in Partially Observable Time Series Experiments", "link": "https://arxiv.org/abs/2408.05342", "description": "arXiv:2408.05342v1 Announce Type: new \nAbstract: Time series experiments, in which experimental units receive a sequence of treatments over time, are frequently employed in many technological companies to evaluate the performance of a newly developed policy, product, or treatment relative to a baseline control. Many existing A/B testing solutions assume a fully observable experimental environment that satisfies the Markov condition, which often does not hold in practice. This paper studies the optimal design for A/B testing in partially observable environments. We introduce a controlled (vector) autoregressive moving average model to capture partial observability. We introduce a small signal asymptotic framework to simplify the analysis of asymptotic mean squared errors of average treatment effect estimators under various designs. We develop two algorithms to estimate the optimal design: one utilizing constrained optimization and the other employing reinforcement learning. We demonstrate the superior performance of our designs using a dispatch simulator and two real datasets from a ride-sharing company."}, "https://arxiv.org/abs/2408.05386": {"title": "Debiased Estimating Equation Method for Versatile and Efficient Mendelian Randomization Using Large Numbers of Correlated Weak and Invalid Instruments", "link": "https://arxiv.org/abs/2408.05386", "description": "arXiv:2408.05386v1 Announce Type: new \nAbstract: Mendelian randomization (MR) is a widely used tool for causal inference in the presence of unobserved confounders, which uses single nucleotide polymorphisms (SNPs) as instrumental variables (IVs) to estimate causal effects. However, SNPs often have weak effects on complex traits, leading to bias and low statistical efficiency in existing MR analysis due to weak instruments that are often used. The linkage disequilibrium (LD) among SNPs poses additional statistical hurdles. Specifically, existing MR methods often restrict analysis to independent SNPs via LD clumping and result in efficiency loss in estimating the causal effect. To address these issues, we propose the Debiased Estimating Equation Method (DEEM), a summary statistics-based MR approach that incorporates a large number of correlated weak-effect and invalid SNPs. DEEM not only effectively eliminates the weak IV bias but also improves the statistical efficiency of the causal effect estimation by leveraging information from many correlated SNPs. DEEM is a versatile method that allows for pleiotropic effects, adjusts for Winner's curse, and is applicable to both two-sample and one-sample MR analyses. Asymptotic analyses of the DEEM estimator demonstrate its attractive theoretical properties. Through extensive simulations and two real data examples, we demonstrate that DEEM improves the efficiency and robustness of MR analysis compared with existing methods."}, "https://arxiv.org/abs/2408.05665": {"title": "Change-Point Detection in Time Series Using Mixed Integer Programming", "link": "https://arxiv.org/abs/2408.05665", "description": "arXiv:2408.05665v1 Announce Type: new \nAbstract: We use cutting-edge mixed integer optimization (MIO) methods to develop a framework for detection and estimation of structural breaks in time series regression models. The framework is constructed based on the least squares problem subject to a penalty on the number of breakpoints. We restate the $l_0$-penalized regression problem as a quadratic programming problem with integer- and real-valued arguments and show that MIO is capable of finding provably optimal solutions using a well-known optimization solver. Compared to the popular $l_1$-penalized regression (LASSO) and other classical methods, the MIO framework permits simultaneous estimation of the number and location of structural breaks as well as regression coefficients, while accommodating the option of specifying a given or minimal number of breaks. We derive the asymptotic properties of the estimator and demonstrate its effectiveness through extensive numerical experiments, confirming a more accurate estimation of multiple breaks as compared to popular non-MIO alternatives. Two empirical examples demonstrate usefulness of the framework in applications from business and economic statistics."}, "https://arxiv.org/abs/2408.05688": {"title": "Bank Cost Efficiency and Credit Market Structure Under a Volatile Exchange Rate", "link": "https://arxiv.org/abs/2408.05688", "description": "arXiv:2408.05688v1 Announce Type: new \nAbstract: We study the impact of exchange rate volatility on cost efficiency and market structure in a cross-section of banks that have non-trivial exposures to foreign currency (FX) operations. We use unique data on quarterly revaluations of FX assets and liabilities (Revals) that Russian banks were reporting between 2004 Q1 and 2020 Q2. {\\it First}, we document that Revals constitute the largest part of the banks' total costs, 26.5\\% on average, with considerable variation across banks. {\\it Second}, we find that stochastic estimates of cost efficiency are both severely downward biased -- by 30\\% on average -- and generally not rank preserving when Revals are ignored, except for the tails, as our nonparametric copulas reveal. To ensure generalizability to other emerging market economies, we suggest a two-stage approach that does not rely on Revals but is able to shrink the downward bias in cost efficiency estimates by two-thirds. {\\it Third}, we show that Revals are triggered by the mismatch in the banks' FX operations, which, in turn, is driven by household FX deposits and the instability of Ruble's exchange rate. {\\it Fourth}, we find that the failure to account for Revals leads to the erroneous conclusion that the credit market is inefficient, which is driven by the upper quartile of the banks' distribution by total assets. Revals have considerable negative implications for financial stability which can be attenuated by the cross-border diversification of bank assets."}, "https://arxiv.org/abs/2408.05747": {"title": "Addressing Outcome Reporting Bias in Meta-analysis: A Selection Model Perspective", "link": "https://arxiv.org/abs/2408.05747", "description": "arXiv:2408.05747v1 Announce Type: new \nAbstract: Outcome Reporting Bias (ORB) poses significant threats to the validity of meta-analytic findings. It occurs when researchers selectively report outcomes based on the significance or direction of results, potentially leading to distorted treatment effect estimates. Despite its critical implications, ORB remains an under-recognized issue, with few comprehensive adjustment methods available. The goal of this research is to investigate ORB-adjustment techniques through a selection model lens, thereby extending some of the existing methodological approaches available in the literature. To gain a better insight into the effects of ORB in meta-analysis of clinical trials, specifically in the presence of heterogeneity, and to assess the effectiveness of ORB-adjustment techniques, we apply the methodology to real clinical data affected by ORB and conduct a simulation study focusing on treatment effect estimation with a secondary interest in heterogeneity quantification."}, "https://arxiv.org/abs/2408.05816": {"title": "BOP2-TE: Bayesian Optimal Phase 2 Design for Jointly Monitoring Efficacy and Toxicity with Application to Dose Optimization", "link": "https://arxiv.org/abs/2408.05816", "description": "arXiv:2408.05816v1 Announce Type: new \nAbstract: We propose a Bayesian optimal phase 2 design for jointly monitoring efficacy and toxicity, referred to as BOP2-TE, to improve the operating characteristics of the BOP2 design proposed by Zhou et al. (2017). BOP2-TE utilizes a Dirichlet-multinomial model to jointly model the distribution of toxicity and efficacy endpoints, making go/no-go decisions based on the posterior probability of toxicity and futility. In comparison to the original BOP2 and other existing designs, BOP2-TE offers the advantage of providing rigorous type I error control in cases where the treatment is toxic and futile, effective but toxic, or safe but futile, while optimizing power when the treatment is effective and safe. As a result, BOP2-TE enhances trial safety and efficacy. We also explore the incorporation of BOP2-TE into multiple-dose randomized trials for dose optimization, and consider a seamless design that integrates phase I dose finding with phase II randomized dose optimization. BOP2-TE is user-friendly, as its decision boundary can be determined prior to the trial's onset. Simulations demonstrate that BOP2-TE possesses desirable operating characteristics. We have developed a user-friendly web application as part of the BOP2 app, which is freely available at www.trialdesign.org."}, "https://arxiv.org/abs/2408.05847": {"title": "Correcting invalid regression discontinuity designs with multiple time period data", "link": "https://arxiv.org/abs/2408.05847", "description": "arXiv:2408.05847v1 Announce Type: new \nAbstract: A common approach to Regression Discontinuity (RD) designs relies on a continuity assumption of the mean potential outcomes at the cutoff defining the RD design. In practice, this assumption is often implausible when changes other than the intervention of interest occur at the cutoff (e.g., other policies are implemented at the same cutoff). When the continuity assumption is implausible, researchers often retreat to ad-hoc analyses that are not supported by any theory and yield results with unclear causal interpretation. These analyses seek to exploit additional data where either all units are treated or all units are untreated (regardless of their running variable value). For example, when data from multiple time periods are available. We first derive the bias of RD designs when the continuity assumption does not hold. We then present a theoretical foundation for analyses using multiple time periods by the means of a general identification framework incorporating data from additional time periods to overcome the bias. We discuss this framework under various RD designs, and also extend our work to carry-over effects and time-varying running variables. We develop local linear regression estimators, bias correction procedures, and standard errors that are robust to bias-correction for the multiple period setup. The approach is illustrated using an application that studied the effect of new fiscal laws on debt of Italian municipalities."}, "https://arxiv.org/abs/2408.05862": {"title": "Censored and extreme losses: functional convergence and applications to tail goodness-of-fit", "link": "https://arxiv.org/abs/2408.05862", "description": "arXiv:2408.05862v1 Announce Type: new \nAbstract: This paper establishes the functional convergence of the Extreme Nelson--Aalen and Extreme Kaplan--Meier estimators, which are designed to capture the heavy-tailed behaviour of censored losses. The resulting limit representations can be used to obtain the distributions of pathwise functionals with respect to the so-called tail process. For instance, we may recover the convergence of a censored Hill estimator, and we further investigate two goodness-of-fit statistics for the tail of the loss distribution. Using the the latter limit theorems, we propose two rules for selecting a suitable number of order statistics, both based on test statistics derived from the functional convergence results. The effectiveness of these selection rules is investigated through simulations and an application to a real dataset comprised of French motor insurance claim sizes."}, "https://arxiv.org/abs/2408.05887": {"title": "Statistically Optimal Uncertainty Quantification for Expensive Black-Box Models", "link": "https://arxiv.org/abs/2408.05887", "description": "arXiv:2408.05887v1 Announce Type: new \nAbstract: Uncertainty quantification, by means of confidence interval (CI) construction, has been a fundamental problem in statistics and also important in risk-aware decision-making. In this paper, we revisit the basic problem of CI construction, but in the setting of expensive black-box models. This means we are confined to using a low number of model runs, and without the ability to obtain auxiliary model information such as gradients. In this case, there exist classical methods based on data splitting, and newer methods based on suitable resampling. However, while all these resulting CIs have similarly accurate coverage in large sample, their efficiencies in terms of interval length differ, and a systematic understanding of which method and configuration attains the shortest interval appears open. Motivated by this, we create a theoretical framework to study the statistical optimality on CI tightness under computation constraint. Our theory shows that standard batching, but also carefully constructed new formulas using uneven-size or overlapping batches, batched jackknife, and the so-called cheap bootstrap and its weighted generalizations, are statistically optimal. Our developments build on a new bridge of the classical notion of uniformly most accurate unbiasedness with batching and resampling, by viewing model runs as asymptotically Gaussian \"data\", as well as a suitable notion of homogeneity for CIs."}, "https://arxiv.org/abs/2408.05896": {"title": "Scalable recommender system based on factor analysis", "link": "https://arxiv.org/abs/2408.05896", "description": "arXiv:2408.05896v1 Announce Type: new \nAbstract: Recommender systems have become crucial in the modern digital landscape, where personalized content, products, and services are essential for enhancing user experience. This paper explores statistical models for recommender systems, focusing on crossed random effects models and factor analysis. We extend the crossed random effects model to include random slopes, enabling the capture of varying covariate effects among users and items. Additionally, we investigate the use of factor analysis in recommender systems, particularly for settings with incomplete data. The paper also discusses scalable solutions using the Expectation Maximization (EM) and variational EM algorithms for parameter estimation, highlighting the application of these models to predict user-item interactions effectively."}, "https://arxiv.org/abs/2408.05961": {"title": "Cross-Spectral Analysis of Bivariate Graph Signals", "link": "https://arxiv.org/abs/2408.05961", "description": "arXiv:2408.05961v1 Announce Type: new \nAbstract: With the advancements in technology and monitoring tools, we often encounter multivariate graph signals, which can be seen as the realizations of multivariate graph processes, and revealing the relationship between their constituent quantities is one of the important problems. To address this issue, we propose a cross-spectral analysis tool for bivariate graph signals. The main goal of this study is to extend the scope of spectral analysis of graph signals to multivariate graph signals. In this study, we define joint weak stationarity graph processes and introduce graph cross-spectral density and coherence for multivariate graph processes. We propose several estimators for the cross-spectral density and investigate the theoretical properties of the proposed estimators. Furthermore, we demonstrate the effectiveness of the proposed estimators through numerical experiments, including simulation studies and a real data application. Finally, as an interesting extension, we discuss robust spectral analysis of graph signals in the presence of outliers."}, "https://arxiv.org/abs/2408.06046": {"title": "Identifying Total Causal Effects in Linear Models under Partial Homoscedasticity", "link": "https://arxiv.org/abs/2408.06046", "description": "arXiv:2408.06046v1 Announce Type: new \nAbstract: A fundamental challenge of scientific research is inferring causal relations based on observed data. One commonly used approach involves utilizing structural causal models that postulate noisy functional relations among interacting variables. A directed graph naturally represents these models and reflects the underlying causal structure. However, classical identifiability results suggest that, without conducting additional experiments, this causal graph can only be identified up to a Markov equivalence class of indistinguishable models. Recent research has shown that focusing on linear relations with equal error variances can enable the identification of the causal structure from mere observational data. Nonetheless, practitioners are often primarily interested in the effects of specific interventions, rendering the complete identification of the causal structure unnecessary. In this work, we investigate the extent to which less restrictive assumptions of partial homoscedasticity are sufficient for identifying the causal effects of interest. Furthermore, we construct mathematically rigorous confidence regions for total causal effects under structure uncertainty and explore the performance gain of relying on stricter error assumptions in a simulation study."}, "https://arxiv.org/abs/2408.06211": {"title": "Extreme-based causal effect learning with endogenous exposures and a light-tailed error", "link": "https://arxiv.org/abs/2408.06211", "description": "arXiv:2408.06211v1 Announce Type: new \nAbstract: Endogeneity poses significant challenges in causal inference across various research domains. This paper proposes a novel approach to identify and estimate causal effects in the presence of endogeneity. We consider a structural equation with endogenous exposures and an additive error term. Assuming the light-tailedness of the error term, we show that the causal effect can be identified by contrasting extreme conditional quantiles of the outcome given the exposures. Unlike many existing results, our identification approach does not rely on additional parametric assumptions or auxiliary variables. Building on the identification result, we develop an EXtreme-based Causal Effect Learning (EXCEL) method that estimates the causal effect using extreme quantile regression. We establish the consistency of the EXCEL estimator under a general additive structural equation and demonstrate its asymptotic normality in the linear model setting. These results reveal that extreme quantile regression is invulnerable to endogeneity when the error term is light-tailed, which is not appreciated in the literature to our knowledge. The EXCEL method is applied to causal inference problems with invalid instruments to construct a valid confidence set for the causal effect. Simulations and data analysis of an automobile sale dataset show the effectiveness of our method in addressing endogeneity."}, "https://arxiv.org/abs/2408.06263": {"title": "Optimal Integrative Estimation for Distributed Precision Matrices with Heterogeneity Adjustment", "link": "https://arxiv.org/abs/2408.06263", "description": "arXiv:2408.06263v1 Announce Type: new \nAbstract: Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring communication efficiency pose significant challenges on distributed statistical analysis. In this article, we focus on integrative estimation of distributed heterogeneous precision matrices, a crucial task related to joint precision matrix estimation where computation-efficient algorithms and statistical optimality theories are still underdeveloped. To tackle these challenges, we introduce a novel HEterogeneity-adjusted Aggregating and Thresholding (HEAT) approach for distributed integrative estimation. HEAT is designed to be both communication- and computation-efficient, and we demonstrate its statistical optimality by establishing the convergence rates and the corresponding minimax lower bounds under various integrative losses. To enhance the optimality of HEAT, we further propose an iterative HEAT (IteHEAT) approach. By iteratively refining the higher-order errors of HEAT estimators through multi-round communications, IteHEAT achieves geometric contraction rates of convergence. Extensive simulations and real data applications validate the numerical performance of HEAT and IteHEAT methods."}, "https://arxiv.org/abs/2408.06323": {"title": "Infer-and-widen versus split-and-condition: two tales of selective inference", "link": "https://arxiv.org/abs/2408.06323", "description": "arXiv:2408.06323v1 Announce Type: new \nAbstract: Recent attention has focused on the development of methods for post-selection inference. However, the connections between these methods, and the extent to which one might be preferred to another, remain unclear. In this paper, we classify existing methods for post-selection inference into one of two frameworks: infer-and-widen or split-and-condition. The infer-and-widen framework produces confidence intervals whose midpoints are biased due to selection, and must be wide enough to account for this bias. By contrast, split-and-condition directly adjusts the intervals' midpoints to account for selection. We compare the two frameworks in three vignettes: the winner's curse, maximal contrasts, and inference after the lasso. Our results are striking: in each of these examples, a split-and-condition strategy leads to confidence intervals that are much narrower than the state-of-the-art infer-and-widen proposal, when methods are tuned to yield identical selection events. Furthermore, even an ``oracle\" infer-and-widen confidence interval -- the narrowest possible interval that could be theoretically attained via infer-and-widen -- is not necessarily narrower than a feasible split-and-condition method. Taken together, these results point to split-and-condition as the most promising framework for post-selection inference in real-world settings."}, "https://arxiv.org/abs/2408.05428": {"title": "Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression", "link": "https://arxiv.org/abs/2408.05428", "description": "arXiv:2408.05428v1 Announce Type: cross \nAbstract: In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements act as instrumental variables (IVs), facilitating the identification of causal effects through leveraging exogenous perturbations in discrete treatment scenarios. However, real-world applications of encouragement designs often face challenges such as incomplete randomization, limited experimental data, and significantly fewer encouragements compared to treatments, hindering precise causal effect estimation. To address this, this paper introduces novel theories and algorithms for identifying the Conditional Average Treatment Effect (CATE) using variations in encouragement. Further, by leveraging both observational and encouragement data, we propose a generalized IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR), to effectively estimate the causal effects. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of EnCounteR over existing methods."}, "https://arxiv.org/abs/2408.05514": {"title": "Testing Elliptical Models in High Dimensions", "link": "https://arxiv.org/abs/2408.05514", "description": "arXiv:2408.05514v1 Announce Type: cross \nAbstract: Due to the broad applications of elliptical models, there is a long line of research on goodness-of-fit tests for empirically validating them. However, the existing literature on this topic is generally confined to low-dimensional settings, and to the best of our knowledge, there are no established goodness-of-fit tests for elliptical models that are supported by theoretical guarantees in high dimensions. In this paper, we propose a new goodness-of-fit test for this problem, and our main result shows that the test is asymptotically valid when the dimension and sample size diverge proportionally. Remarkably, it also turns out that the asymptotic validity of the test requires no assumptions on the population covariance matrix. With regard to numerical performance, we confirm that the empirical level of the test is close to the nominal level across a range of conditions, and that the test is able to reliably detect non-elliptical distributions. Moreover, when the proposed test is specialized to the problem of testing normality in high dimensions, we show that it compares favorably with a state-of-the-art method, and hence, this way of using the proposed test is of independent interest."}, "https://arxiv.org/abs/2408.05584": {"title": "Dynamical causality under invisible confounders", "link": "https://arxiv.org/abs/2408.05584", "description": "arXiv:2408.05584v1 Announce Type: cross \nAbstract: Causality inference is prone to spurious causal interactions, due to the substantial confounders in a complex system. While many existing methods based on the statistical methods or dynamical methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causality, in particular in the presence of invisible/unobservable confounders. As a result, accurately inferring causation with invisible confounders remains a largely unexplored and outstanding issue in data science and AI fields. In this work, we propose a method to overcome such challenges to infer dynamical causality under invisible confounders (CIC method) and further reconstruct the invisible confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. The core of our CIC method lies in its ability to decompose the observed variables not in their original space but in their delay embedding space into the common and private subspaces respectively, thereby quantifying causality between those variables both theoretically and computationally. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many invisible confounders, which is actually a long-standing problem in the field. In addition to the invisible confounder problem, such a decomposition actually makes the intertwined variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, and the experimental results demonstrates its effectiveness to reconstruct real biological networks even with unobserved confounders."}, "https://arxiv.org/abs/2408.05647": {"title": "Controlling for discrete unmeasured confounding in nonlinear causal models", "link": "https://arxiv.org/abs/2408.05647", "description": "arXiv:2408.05647v1 Announce Type: cross \nAbstract: Unmeasured confounding is a major challenge for identifying causal relationships from non-experimental data. Here, we propose a method that can accommodate unmeasured discrete confounding. Extending recent identifiability results in deep latent variable models, we show theoretically that confounding can be detected and corrected under the assumption that the observed data is a piecewise affine transformation of a latent Gaussian mixture model and that the identity of the mixture components is confounded. We provide a flow-based algorithm to estimate this model and perform deconfounding. Experimental results on synthetic and real-world data provide support for the effectiveness of our approach."}, "https://arxiv.org/abs/2408.05854": {"title": "On the Robustness of Kernel Goodness-of-Fit Tests", "link": "https://arxiv.org/abs/2408.05854", "description": "arXiv:2408.05854v1 Announce Type: cross \nAbstract: Goodness-of-fit testing is often criticized for its lack of practical relevance; since ``all models are wrong'', the null hypothesis that the data conform to our model is ultimately always rejected when the sample size is large enough. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is good enough for a specific task. This question can be formalized as a robust goodness-of-fit testing problem by asking whether the data were generated by a distribution corresponding to our model up to some mild perturbation. In this paper, we show that existing kernel goodness-of-fit tests are not robust according to common notions of robustness including qualitative and quantitative robustness. We also show that robust techniques based on tilted kernels from the parameter estimation literature are not sufficient for ensuring both types of robustness in the context of goodness-of-fit testing. We therefore propose the first robust kernel goodness-of-fit test which resolves this open problem using kernel Stein discrepancy balls, which encompass perturbation models such as Huber contamination models and density uncertainty bands."}, "https://arxiv.org/abs/2408.06103": {"title": "Method-of-Moments Inference for GLMs and Doubly Robust Functionals under Proportional Asymptotics", "link": "https://arxiv.org/abs/2408.06103", "description": "arXiv:2408.06103v1 Announce Type: cross \nAbstract: In this paper, we consider the estimation of regression coefficients and signal-to-noise (SNR) ratio in high-dimensional Generalized Linear Models (GLMs), and explore their implications in inferring popular estimands such as average treatment effects in high-dimensional observational studies. Under the ``proportional asymptotic'' regime and Gaussian covariates with known (population) covariance $\\Sigma$, we derive Consistent and Asymptotically Normal (CAN) estimators of our targets of inference through a Method-of-Moments type of estimators that bypasses estimation of high dimensional nuisance functions and hyperparameter tuning altogether. Additionally, under non-Gaussian covariates, we demonstrate universality of our results under certain additional assumptions on the regression coefficients and $\\Sigma$. We also demonstrate that knowing $\\Sigma$ is not essential to our proposed methodology when the sample covariance matrix estimator is invertible. Finally, we complement our theoretical results with numerical experiments and comparisons with existing literature."}, "https://arxiv.org/abs/2408.06277": {"title": "Multi-marginal Schr\\\"odinger Bridges with Iterative Reference", "link": "https://arxiv.org/abs/2408.06277", "description": "arXiv:2408.06277v1 Announce Type: cross \nAbstract: Practitioners frequently aim to infer an unobserved population trajectory using sample snapshots at multiple time points. For instance, in single-cell sequencing, scientists would like to learn how gene expression evolves over time. But sequencing any cell destroys that cell. So we cannot access any cell's full trajectory, but we can access snapshot samples from many cells. Stochastic differential equations are commonly used to analyze systems with full individual-trajectory access; since here we have only sample snapshots, these methods are inapplicable. The deep learning community has recently explored using Schr\\\"odinger bridges (SBs) and their extensions to estimate these dynamics. However, these methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic within the SB, which is often just set to be Brownian motion. But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model class for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a class of reference dynamics, not a single fixed one. In particular, we suggest an iterative projection method inspired by Schr\\\"odinger bridges; we alternate between learning a piecewise SB on the unobserved trajectories and using the learned SB to refine our best guess for the dynamics within the reference class. We demonstrate the advantages of our method via a well-known simulated parametric model from ecology, simulated and real data from systems biology, and real motion-capture data."}, "https://arxiv.org/abs/2108.12682": {"title": "Feature Selection in High-dimensional Spaces Using Graph-Based Methods", "link": "https://arxiv.org/abs/2108.12682", "description": "arXiv:2108.12682v3 Announce Type: replace \nAbstract: High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. We show the superior performance of our method over other existing ones through synthetic data, and demonstrate the utility of this method on several real-life datasets from the domains of climate change and biology, wherein our algorithm is not only able to detect known features expected to be associated with the underlying process, but also discovers novel targets that can be subsequently studied."}, "https://arxiv.org/abs/2210.17121": {"title": "Powerful Spatial Multiple Testing via Borrowing Neighboring Information", "link": "https://arxiv.org/abs/2210.17121", "description": "arXiv:2210.17121v2 Announce Type: replace \nAbstract: Clustered effects are often encountered in multiple hypothesis testing of spatial signals. In this paper, we propose a new method, termed \\textit{two-dimensional spatial multiple testing} (2d-SMT) procedure, to control the false discovery rate (FDR) and improve the detection power by exploiting the spatial information encoded in neighboring observations. The proposed method provides a novel perspective of utilizing spatial information by gathering signal patterns and spatial dependence into an auxiliary statistic. 2d-SMT rejects the null when a primary statistic at the location of interest and the auxiliary statistic constructed based on nearby observations are greater than their corresponding cutoffs. 2d-SMT can also be combined with different variants of the weighted BH procedures to improve the detection power further. A fast algorithm is developed to accelerate the search for optimal cutoffs in 2d-SMT. In theory, we establish the asymptotic FDR control of 2d-SMT under weak spatial dependence. Extensive numerical experiments demonstrate that the 2d-SMT method combined with various weighted BH procedures achieves the most competitive performance in FDR and power trade-off."}, "https://arxiv.org/abs/2310.14419": {"title": "Variable Selection and Minimax Prediction in High-dimensional Functional Linear Model", "link": "https://arxiv.org/abs/2310.14419", "description": "arXiv:2310.14419v3 Announce Type: replace \nAbstract: High-dimensional functional data have become increasingly prevalent in modern applications such as high-frequency financial data and neuroimaging data analysis. We investigate a class of high-dimensional linear regression models, where each predictor is a random element in an infinite-dimensional function space, and the number of functional predictors $p$ can potentially be ultra-high. Assuming that each of the unknown coefficient functions belongs to some reproducing kernel Hilbert space (RKHS), we regularize the fitting of the model by imposing a group elastic-net type of penalty on the RKHS norms of the coefficient functions. We show that our loss function is Gateaux sub-differentiable, and our functional elastic-net estimator exists uniquely in the product RKHS. Under suitable sparsity assumptions and a functional version of the irrepresentable condition, we derive a non-asymptotic tail bound for variable selection consistency of our method. Allowing the number of true functional predictors $q$ to diverge with the sample size, we also show a post-selection refined estimator can achieve the oracle minimax optimal prediction rate. The proposed methods are illustrated through simulation studies and a real-data application from the Human Connectome Project."}, "https://arxiv.org/abs/2209.00991": {"title": "E-backtesting", "link": "https://arxiv.org/abs/2209.00991", "description": "arXiv:2209.00991v4 Announce Type: replace-cross \nAbstract: In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. To design a model-free backtesting procedure for ES, we make use of the recently developed techniques of e-values and e-processes. Backtest e-statistics are introduced to formulate e-processes for risk measure forecasts, and unique forms of backtest e-statistics for VaR and ES are characterized using recent results on identification functions. For a given backtest e-statistic, a few criteria for optimally constructing the e-processes are studied. The proposed method can be naturally applied to many other risk measures and statistical quantities. We conduct extensive simulation studies and data analysis to illustrate the advantages of the model-free backtesting method, and compare it with the ones in the literature."}, "https://arxiv.org/abs/2408.06464": {"title": "Causal Graph Aided Causal Discovery in an Observational Aneurysmal Subarachnoid Hemorrhage Study", "link": "https://arxiv.org/abs/2408.06464", "description": "arXiv:2408.06464v1 Announce Type: new \nAbstract: Causal inference methods for observational data are increasingly recognized as a valuable complement to randomized clinical trials (RCTs). They can, under strong assumptions, emulate RCTs or help refine their focus. Our approach to causal inference uses causal directed acyclic graphs (DAGs). We are motivated by a concern that many observational studies in medicine begin without a clear definition of their objectives, without awareness of the scientific potential, and without tools to identify the necessary in itinere adjustments. We present and illustrate methods that provide \"midway insights\" during study's course, identify meaningful causal questions within the study's reach and point to the necessary data base enhancements for these questions to be meaningfully tackled. The method hinges on concepts of identification and positivity. Concepts are illustrated through an analysis of data generated by patients with aneurysmal Subarachnoid Hemorrhage (aSAH) halfway through a study, focusing in particular on the consequences of external ventricular drain (EVD) in strata of the aSAH population. In addition, we propose a method for multicenter studies, to monitor the impact of changes in practice at an individual center's level, by leveraging principles of instrumental variable (IV) inference."}, "https://arxiv.org/abs/2408.06504": {"title": "Generative Bayesian Modeling with Implicit Priors", "link": "https://arxiv.org/abs/2408.06504", "description": "arXiv:2408.06504v1 Announce Type: new \nAbstract: Generative models are a cornerstone of Bayesian data analysis, enabling predictive simulations and model validation. However, in practice, manually specified priors often lead to unreasonable simulation outcomes, a common obstacle for full Bayesian simulations. As a remedy, we propose to add small portions of real or simulated data, which creates implicit priors that improve the stability of Bayesian simulations. We formalize this approach, providing a detailed process for constructing implicit priors and relating them to existing research in this area. We also integrate the implicit priors into simulation-based calibration, a prominent Bayesian simulation task. Through two case studies, we demonstrate that implicit priors quickly improve simulation stability and model calibration. Our findings provide practical guidelines for implementing implicit priors in various Bayesian modeling scenarios, showcasing their ability to generate more reliable simulation outcomes."}, "https://arxiv.org/abs/2408.06517": {"title": "Post-selection inference for high-dimensional mediation analysis with survival outcomes", "link": "https://arxiv.org/abs/2408.06517", "description": "arXiv:2408.06517v1 Announce Type: new \nAbstract: It is of substantial scientific interest to detect mediators that lie in the causal pathway from an exposure to a survival outcome. However, with high-dimensional mediators, as often encountered in modern genomic data settings, there is a lack of powerful methods that can provide valid post-selection inference for the identified marginal mediation effect. To resolve this challenge, we develop a post-selection inference procedure for the maximally selected natural indirect effect using a semiparametric efficient influence function approach. To this end, we establish the asymptotic normality of a stabilized one-step estimator that takes the selection of the mediator into account. Simulation studies show that our proposed method has good empirical performance. We further apply our proposed approach to a lung cancer dataset and find multiple DNA methylation CpG sites that might mediate the effect of cigarette smoking on lung cancer survival."}, "https://arxiv.org/abs/2408.06519": {"title": "An unbounded intensity model for point processes", "link": "https://arxiv.org/abs/2408.06519", "description": "arXiv:2408.06519v1 Announce Type: new \nAbstract: We develop a model for point processes on the real line, where the intensity can be locally unbounded without inducing an explosion. In contrast to an orderly point process, for which the probability of observing more than one event over a short time interval is negligible, the bursting intensity causes an extreme clustering of events around the singularity. We propose a nonparametric approach to detect such bursts in the intensity. It relies on a heavy traffic condition, which admits inference for point processes over a finite time interval. With Monte Carlo evidence, we show that our testing procedure exhibits size control under the null, whereas it has high rejection rates under the alternative. We implement our approach on high-frequency data for the EUR/USD spot exchange rate, where the test statistic captures abnormal surges in trading activity. We detect a nontrivial amount of intensity bursts in these data and describe their basic properties. Trading activity during an intensity burst is positively related to volatility, illiquidity, and the probability of observing a drift burst. The latter effect is reinforced if the order flow is imbalanced or the price elasticity of the limit order book is large."}, "https://arxiv.org/abs/2408.06539": {"title": "Conformal predictive intervals in survival analysis: a re-sampling approach", "link": "https://arxiv.org/abs/2408.06539", "description": "arXiv:2408.06539v1 Announce Type: new \nAbstract: The distribution-free method of conformal prediction (Vovk et al, 2005) has gained considerable attention in computer science, machine learning, and statistics. Candes et al. (2023) extended this method to right-censored survival data, addressing right-censoring complexity by creating a covariate shift setting, extracting a subcohort of subjects with censoring times exceeding a fixed threshold. Their approach only estimates the lower prediction bound for type I censoring, where all subjects have available censoring times regardless of their failure status. In medical applications, we often encounter more general right-censored data, observing only the minimum of failure time and censoring time. Subjects with observed failure times have unavailable censoring times. To address this, we propose a bootstrap method to construct one -- as well as two-sided conformal predictive intervals for general right-censored survival data under different working regression models. Through simulations, our method demonstrates excellent average coverage for the lower bound and good coverage for the two-sided predictive interval, regardless of working model is correctly specified or not, particularly under moderate censoring. We further extend the proposed method to several directions in medical applications. We apply this method to predict breast cancer patients' future survival times based on tumour characteristics and treatment."}, "https://arxiv.org/abs/2408.06612": {"title": "Double Robust high dimensional alpha test for linear factor pricing model", "link": "https://arxiv.org/abs/2408.06612", "description": "arXiv:2408.06612v1 Announce Type: new \nAbstract: In this paper, we investigate alpha testing for high-dimensional linear factor pricing models. We propose a spatial sign-based max-type test to handle sparse alternative cases. Additionally, we prove that this test is asymptotically independent of the spatial-sign-based sum-type test proposed by Liu et al. (2023). Based on this result, we introduce a Cauchy Combination test procedure that combines both the max-type and sum-type tests. Simulation studies and real data applications demonstrate that the new proposed test procedure is robust not only for heavy-tailed distributions but also for the sparsity of the alternative hypothesis."}, "https://arxiv.org/abs/2408.06624": {"title": "Estimation and Inference of Average Treatment Effect in Percentage Points under Heterogeneity", "link": "https://arxiv.org/abs/2408.06624", "description": "arXiv:2408.06624v1 Announce Type: new \nAbstract: In semi-log regression models with heterogeneous treatment effects, the average treatment effect (ATE) in log points and its exponential transformation minus one underestimate the ATE in percentage points. I propose new estimation and inference methods for the ATE in percentage points, with inference utilizing the Fenton-Wilkinson approximation. These methods are particularly relevant for staggered difference-in-differences designs, where treatment effects often vary across groups and periods. I prove the methods' large-sample properties and demonstrate their finite-sample performance through simulations, revealing substantial discrepancies between conventional and proposed measures. Two empirical applications further underscore the practical importance of these methods."}, "https://arxiv.org/abs/2408.06739": {"title": "Considerations for missing data, outliers and transformations in permutation testing for ANOVA, ASCA(+) and related factorizations", "link": "https://arxiv.org/abs/2408.06739", "description": "arXiv:2408.06739v1 Announce Type: new \nAbstract: Multifactorial experimental designs allow us to assess the contribution of several factors, and potentially their interactions, to one or several responses of interests. Following the principles of the partition of the variance advocated by Sir R.A. Fisher, the experimental responses are factored into the quantitative contribution of main factors and interactions. A popular approach to perform this factorization in both ANOVA and ASCA(+) is through General Linear Models. Subsequently, different inferential approaches can be used to identify whether the contributions are statistically significant or not. Unfortunately, the performance of inferential approaches in terms of Type I and Type II errors can be heavily affected by missing data, outliers and/or the departure from normality of the distribution of the responses, which are commonplace problems in modern analytical experiments. In this paper, we study these problem and suggest good practices of application."}, "https://arxiv.org/abs/2408.06760": {"title": "Stratification in Randomised Clinical Trials and Analysis of Covariance: Some Simple Theory and Recommendations", "link": "https://arxiv.org/abs/2408.06760", "description": "arXiv:2408.06760v1 Announce Type: new \nAbstract: A simple device for balancing for a continuous covariate in clinical trials is to stratify by whether the covariate is above or below some target value, typically the predicted median. This raises an issue as to which model should be used for modelling the effect of treatment on the outcome variable, $Y$. Should one fit, the stratum indicator, $S$, the continuous covariate, $X$, both or neither?\n This question has been investigated in the literature using simulations targetting the overall effect on inferences about treatment . However, when a covariate is added to a model there are three consequences for inference: 1) The mean square error effect, 2) The variance inflation factor and 3) second order precision. We consider that it is valuable to consider these three factors separately even if, ultimately, it is their joint effect that matters.\n We present some simple theory, concentrating in particular on the variance inflation factor, that may be used to guide trialists in their choice of model. We also consider the case where the precise form of the relationship between the outcome and the covariate is not known. We conclude by recommending that the continuous coovariate should always be in the model but that, depending on circumstances, there may be some justification in fitting the stratum indicator also."}, "https://arxiv.org/abs/2408.06769": {"title": "Joint model for interval-censored semi-competing events and longitudinal data with subject-specific within and between visits variabilities", "link": "https://arxiv.org/abs/2408.06769", "description": "arXiv:2408.06769v1 Announce Type: new \nAbstract: Dementia currently affects about 50 million people worldwide, and this number is rising. Since there is still no cure, the primary focus remains on preventing modifiable risk factors such as cardiovascular factors. It is now recognized that high blood pressure is a risk factor for dementia. An increasing number of studies suggest that blood pressure variability may also be a risk factor for dementia. However, these studies have significant methodological weaknesses and fail to distinguish between long-term and short-term variability. The aim of this work was to propose a new joint model that combines a mixed-effects model, which handles the residual variance distinguishing inter-visit variability from intra-visit variability, and an illness-death model that allows for interval censoring and semi-competing risks. A subject-specific random effect is included in the model for both variances. Risks can simultaneously depend on the current value and slope of the marker, as well as on each of the two components of the residual variance. The model estimation is performed by maximizing the likelihood function using the Marquardt-Levenberg algorithm. A simulation study validates the estimation procedure, which is implemented in an R package. The model was estimated using data from the Three-City (3C) cohort to study the impact of intra- and inter-visits blood pressure variability on the risk of dementia and death."}, "https://arxiv.org/abs/2408.06977": {"title": "Endogeneity Corrections in Binary Outcome Models with Nonlinear Transformations: Identification and Inference", "link": "https://arxiv.org/abs/2408.06977", "description": "arXiv:2408.06977v1 Announce Type: new \nAbstract: For binary outcome models, an endogeneity correction based on nonlinear rank-based transformations is proposed. Identification without external instruments is achieved under one of two assumptions: Either the endogenous regressor is a nonlinear function of one component of the error term conditionally on exogenous regressors. Or the dependence between endogenous regressor and exogenous regressor is nonlinear. Under these conditions, we prove consistency and asymptotic normality. Monte Carlo simulations and an application on German insolvency data illustrate the usefulness of the method."}, "https://arxiv.org/abs/2408.06987": {"title": "Optimal Network Pairwise Comparison", "link": "https://arxiv.org/abs/2408.06987", "description": "arXiv:2408.06987v1 Announce Type: new \nAbstract: We are interested in the problem of two-sample network hypothesis testing: given two networks with the same set of nodes, we wish to test whether the underlying Bernoulli probability matrices of the two networks are the same or not. We propose Interlacing Balance Measure (IBM) as a new two-sample testing approach. We consider the {\\it Degree-Corrected Mixed-Membership (DCMM)} model for undirected networks, where we allow severe degree heterogeneity, mixed-memberships, flexible sparsity levels, and weak signals. In such a broad setting, how to find a test that has a tractable limiting null and optimal testing performances is a challenging problem. We show that IBM is such a test: in a broad DCMM setting with only mild regularity conditions, IBM has $N(0,1)$ as the limiting null and achieves the optimal phase transition.\n While the above is for undirected networks, IBM is a unified approach and is directly implementable for directed networks. For a broad directed-DCMM (extension of DCMM for directed networks) setting, we show that IBM has $N(0, 1/2)$ as the limiting null and continues to achieve the optimal phase transition. We have also applied IBM to the Enron email network and a gene co-expression network, with interesting results."}, "https://arxiv.org/abs/2408.07006": {"title": "The Complexities of Differential Privacy for Survey Data", "link": "https://arxiv.org/abs/2408.07006", "description": "arXiv:2408.07006v1 Announce Type: new \nAbstract: The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies."}, "https://arxiv.org/abs/2408.07025": {"title": "GARCH copulas and GARCH-mimicking copulas", "link": "https://arxiv.org/abs/2408.07025", "description": "arXiv:2408.07025v1 Announce Type: new \nAbstract: The bivariate copulas that describe the dependencies and partial dependencies of lagged variables in strictly stationary, first-order GARCH-type processes are investigated. It is shown that the copulas of symmetric GARCH processes are jointly symmetric but non-exchangeable, while the copulas of processes with symmetric innovation distributions and asymmetric leverage effects have weaker h-symmetry; copulas with asymmetric innovation distributions have neither form of symmetry. Since the actual copulas are typically inaccessible, due to the unknown functional forms of the marginal distributions of GARCH processes, methods of mimicking them are proposed. These rely on constructions that combine standard bivariate copulas for positive dependence with two uniformity-preserving transformations known as v-transforms. A variety of new copulas are introduced and the ones providing the best fit to simulated data from GARCH-type processes are identified. A method of constructing tractable simplified d-vines using linear v-transforms is described and shown to coincide with the vt-d-vine model when the two v-transforms are identical."}, "https://arxiv.org/abs/2408.07066": {"title": "Conformal prediction after efficiency-oriented model selection", "link": "https://arxiv.org/abs/2408.07066", "description": "arXiv:2408.07066v1 Announce Type: new \nAbstract: Given a family of pretrained models and a hold-out set, how can we construct a valid conformal prediction set while selecting a model that minimizes the width of the set? If we use the same hold-out data set both to select a model (the model that yields the smallest conformal prediction sets) and then to construct a conformal prediction set based on that selected model, we suffer a loss of coverage due to selection bias. Alternatively, we could further splitting the data to perform selection and calibration separately, but this comes at a steep cost if the size of the dataset is limited. In this paper, we address the challenge of constructing a valid prediction set after efficiency-oriented model selection. Our novel methods can be implemented efficiently and admit finite-sample validity guarantees without invoking additional sample-splitting. We show that our methods yield prediction sets with asymptotically optimal size under certain notion of continuity for the model class. The improved efficiency of the prediction sets constructed by our methods are further demonstrated through applications to synthetic datasets in various settings and a real data example."}, "https://arxiv.org/abs/2003.00593": {"title": "True and false discoveries with independent and sequential e-values", "link": "https://arxiv.org/abs/2003.00593", "description": "arXiv:2003.00593v2 Announce Type: replace \nAbstract: In this paper we use e-values in the context of multiple hypothesis testing assuming that the base tests produce independent, or sequential, e-values. Our simulation and empirical studies and theoretical considerations suggest that, under this assumption, our new algorithms are superior to the known algorithms using independent p-values and to our recent algorithms designed for e-values without the assumption of independence."}, "https://arxiv.org/abs/2211.04459": {"title": "flexBART: Flexible Bayesian regression trees with categorical predictors", "link": "https://arxiv.org/abs/2211.04459", "description": "arXiv:2211.04459v3 Announce Type: replace \nAbstract: Most implementations of Bayesian additive regression trees (BART) one-hot encode categorical predictors, replacing each one with several binary indicators, one for every level or category. Regression trees built with these indicators partition the discrete set of categorical levels by repeatedly removing one level at a time. Unfortunately, the vast majority of partitions cannot be built with this strategy, severely limiting BART's ability to partially pool data across groups of levels. Motivated by analyses of baseball data and neighborhood-level crime dynamics, we overcame this limitation by re-implementing BART with regression trees that can assign multiple levels to both branches of a decision tree node. To model spatial data aggregated into small regions, we further proposed a new decision rule prior that creates spatially contiguous regions by deleting a random edge from a random spanning tree of a suitably defined network. Our re-implementation, which is available in the flexBART package, often yields improved out-of-sample predictive performance and scales better to larger datasets than existing implementations of BART."}, "https://arxiv.org/abs/2311.12726": {"title": "Assessing variable importance in survival analysis using machine learning", "link": "https://arxiv.org/abs/2311.12726", "description": "arXiv:2311.12726v3 Announce Type: replace \nAbstract: Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials."}, "https://arxiv.org/abs/2408.07185": {"title": "A Sparse Grid Approach for the Nonparametric Estimation of High-Dimensional Random Coefficient Models", "link": "https://arxiv.org/abs/2408.07185", "description": "arXiv:2408.07185v1 Announce Type: new \nAbstract: A severe limitation of many nonparametric estimators for random coefficient models is the exponential increase of the number of parameters in the number of random coefficients included into the model. This property, known as the curse of dimensionality, restricts the application of such estimators to models with moderately few random coefficients. This paper proposes a scalable nonparametric estimator for high-dimensional random coefficient models. The estimator uses a truncated tensor product of one-dimensional hierarchical basis functions to approximate the underlying random coefficients' distribution. Due to the truncation, the number of parameters increases at a much slower rate than in the regular tensor product basis, rendering the nonparametric estimation of high-dimensional random coefficient models feasible. The derived estimator allows estimating the underlying distribution with constrained least squares, making the approach computationally simple and fast. Monte Carlo experiments and an application to data on the regulation of air pollution illustrate the good performance of the estimator."}, "https://arxiv.org/abs/2408.07193": {"title": "A comparison of methods for estimating the average treatment effect on the treated for externally controlled trials", "link": "https://arxiv.org/abs/2408.07193", "description": "arXiv:2408.07193v1 Announce Type: new \nAbstract: While randomized trials may be the gold standard for evaluating the effectiveness of the treatment intervention, in some special circumstances, single-arm clinical trials utilizing external control may be considered. The causal treatment effect of interest for single-arm studies is usually the average treatment effect on the treated (ATT) rather than the average treatment effect (ATE). Although methods have been developed to estimate the ATT, the selection and use of these methods require a thorough comparison and in-depth understanding of the advantages and disadvantages of these methods. In this study, we conducted simulations under different identifiability assumptions to compare the performance metrics (e.g., bias, standard deviation (SD), mean squared error (MSE), type I error rate) for a variety of methods, including the regression model, propensity score matching, Mahalanobis distance matching, coarsened exact matching, inverse probability weighting, augmented inverse probability weighting (AIPW), AIPW with SuperLearner, and targeted maximum likelihood estimator (TMLE) with SuperLearner.\n Our simulation results demonstrate that the doubly robust methods in general have smaller biases than other methods. In terms of SD, nonmatching methods in general have smaller SDs than matching-based methods. The performance of MSE is a trade-off between the bias and SD, and no method consistently performs better in term of MSE. The identifiability assumptions are critical to the models' performance: violation of the positivity assumption can lead to a significant inflation of type I errors in some methods; violation of the unconfoundedness assumption can lead to a large bias for all methods... (Further details are available in the main body of the paper)."}, "https://arxiv.org/abs/2408.07231": {"title": "Estimating the FDR of variable selection", "link": "https://arxiv.org/abs/2408.07231", "description": "arXiv:2408.07231v1 Announce Type: new \nAbstract: We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter."}, "https://arxiv.org/abs/2408.07240": {"title": "Sensitivity of MCMC-based analyses to small-data removal", "link": "https://arxiv.org/abs/2408.07240", "description": "arXiv:2408.07240v1 Announce Type: new \nAbstract: If the conclusion of a data analysis is sensitive to dropping very few data points, that conclusion might hinge on the particular data at hand rather than representing a more broadly applicable truth. How could we check whether this sensitivity holds? One idea is to consider every small subset of data, drop it from the dataset, and re-run our analysis. But running MCMC to approximate a Bayesian posterior is already very expensive; running multiple times is prohibitive, and the number of re-runs needed here is combinatorially large. Recent work proposes a fast and accurate approximation to find the worst-case dropped data subset, but that work was developed for problems based on estimating equations -- and does not directly handle Bayesian posterior approximations using MCMC. We make two principal contributions in the present work. We adapt the existing data-dropping approximation to estimators computed via MCMC. Observing that Monte Carlo errors induce variability in the approximation, we use a variant of the bootstrap to quantify this uncertainty. We demonstrate how to use our approximation in practice to determine whether there is non-robustness in a problem. Empirically, our method is accurate in simple models, such as linear regression. In models with complicated structure, such as hierarchical models, the performance of our method is mixed."}, "https://arxiv.org/abs/2408.07365": {"title": "Fast Bayesian inference in a class of sparse linear mixed effects models", "link": "https://arxiv.org/abs/2408.07365", "description": "arXiv:2408.07365v1 Announce Type: new \nAbstract: Linear mixed effects models are widely used in statistical modelling. We consider a mixed effects model with Bayesian variable selection in the random effects using spike-and-slab priors and developed a variational Bayes inference scheme that can be applied to large data sets. An EM algorithm is proposed for the model with normal errors where the posterior distribution of the variable inclusion parameters is approximated using an Occam's window approach. Placing this approach within a variational Bayes scheme also the algorithm to be extended to the model with skew-t errors. The performance of the algorithm is evaluated in a simulation study and applied to a longitudinal model for elite athlete performance in the 100 metre sprint and weightlifting."}, "https://arxiv.org/abs/2408.07463": {"title": "A novel framework for quantifying nominal outlyingness", "link": "https://arxiv.org/abs/2408.07463", "description": "arXiv:2408.07463v1 Announce Type: new \nAbstract: Outlier detection is an important data mining tool that becomes particularly challenging when dealing with nominal data. First and foremost, flagging observations as outlying requires a well-defined notion of nominal outlyingness. This paper presents a definition of nominal outlyingness and introduces a general framework for quantifying outlyingness of nominal data. The proposed framework makes use of ideas from the association rule mining literature and can be used for calculating scores that indicate how outlying a nominal observation is. Methods for determining the involved hyperparameter values are presented and the concepts of variable contributions and outlyingness depth are introduced, in an attempt to enhance interpretability of the results. An implementation of the framework is tested on five real-world data sets and the key findings are outlined. The ideas presented can serve as a tool for assessing the degree to which an observation differs from the rest of the data, under the assumption of sequences of nominal levels having been generated from a Multinomial distribution with varying event probabilities."}, "https://arxiv.org/abs/2408.07678": {"title": "Your MMM is Broken: Identification of Nonlinear and Time-varying Effects in Marketing Mix Models", "link": "https://arxiv.org/abs/2408.07678", "description": "arXiv:2408.07678v1 Announce Type: new \nAbstract: Recent years have seen a resurgence in interest in marketing mix models (MMMs), which are aggregate-level models of marketing effectiveness. Often these models incorporate nonlinear effects, and either implicitly or explicitly assume that marketing effectiveness varies over time. In this paper, we show that nonlinear and time-varying effects are often not identifiable from standard marketing mix data: while certain data patterns may be suggestive of nonlinear effects, such patterns may also emerge under simpler models that incorporate dynamics in marketing effectiveness. This lack of identification is problematic because nonlinearities and dynamics suggest fundamentally different optimal marketing allocations. We examine this identification issue through theory and simulations, wherein we explore the exact conditions under which conflation between the two types of models is likely to occur. In doing so, we introduce a flexible Bayesian nonparametric model that allows us to both flexibly simulate and estimate different data-generating processes. We show that conflating the two types of effects is especially likely in the presence of autocorrelated marketing variables, which are common in practice, especially given the widespread use of stock variables to capture long-run effects of advertising. We illustrate these ideas through numerous empirical applications to real-world marketing mix data, showing the prevalence of the conflation issue in practice. Finally, we show how marketers can avoid this conflation, by designing experiments that strategically manipulate spending in ways that pin down model form."}, "https://arxiv.org/abs/2408.07209": {"title": "Local linear smoothing for regression surfaces on the simplex using Dirichlet kernels", "link": "https://arxiv.org/abs/2408.07209", "description": "arXiv:2408.07209v1 Announce Type: cross \nAbstract: This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring excellent boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda, Nezzal, and Elhattab (2024)."}, "https://arxiv.org/abs/2408.07219": {"title": "Causal Effect Estimation using identifiable Variational AutoEncoder with Latent Confounders and Post-Treatment Variables", "link": "https://arxiv.org/abs/2408.07219", "description": "arXiv:2408.07219v1 Announce Type: cross \nAbstract: Estimating causal effects from observational data is challenging, especially in the presence of latent confounders. Much work has been done on addressing this challenge, but most of the existing research ignores the bias introduced by the post-treatment variables. In this paper, we propose a novel method of joint Variational AutoEncoder (VAE) and identifiable Variational AutoEncoder (iVAE) for learning the representations of latent confounders and latent post-treatment variables from their proxy variables, termed CPTiVAE, to achieve unbiased causal effect estimation from observational data. We further prove the identifiability in terms of the representation of latent post-treatment variables. Extensive experiments on synthetic and semi-synthetic datasets demonstrate that the CPTiVAE outperforms the state-of-the-art methods in the presence of latent confounders and post-treatment variables. We further apply CPTiVAE to a real-world dataset to show its potential application."}, "https://arxiv.org/abs/2408.07298": {"title": "Improving the use of social contact studies in epidemic modelling", "link": "https://arxiv.org/abs/2408.07298", "description": "arXiv:2408.07298v1 Announce Type: cross \nAbstract: Social contact studies, investigating social contact patterns in a population sample, have been an important contribution for epidemic models to better fit real life epidemics. A contact matrix $M$, having the \\emph{mean} number of contacts between individuals of different age groups as its elements, is estimated and used in combination with a multitype epidemic model to produce better data fitting and also giving more appropriate expressions for $R_0$ and other model outcomes. However, $M$ does not capture \\emph{variation} in contacts \\emph{within} each age group, which is often large in empirical settings. Here such variation within age groups is included in a simple way by dividing each age group into two halves: the socially active and the socially less active. The extended contact matrix, and its associated epidemic model, empirically show that acknowledging variation in social activity within age groups has a substantial impact on modelling outcomes such as $R_0$ and the final fraction $\\tau$ getting infected. In fact, the variation in social activity within age groups is often more important for data fitting than the division into different age groups itself. However, a difficulty with heterogeneity in social activity is that social contact studies typically lack information on if mixing with respect to social activity is assortative or not, i.e.\\ do socially active tend to mix more with other socially active or more with socially less active? The analyses show that accounting for heterogeneity in social activity improves the analyses irrespective of if such mixing is assortative or not, but the different assumptions gives rather different output. Future social contact studies should hence also try to infer the degree of assortativity of contacts with respect to social activity."}, "https://arxiv.org/abs/2408.07575": {"title": "A General Framework for Constraint-based Causal Learning", "link": "https://arxiv.org/abs/2408.07575", "description": "arXiv:2408.07575v1 Announce Type: cross \nAbstract: By representing any constraint-based causal learning algorithm via a placeholder property, we decompose the correctness condition into a part relating the distribution and the true causal graph, and a part that depends solely on the distribution. This provides a general framework to obtain correctness conditions for causal learning, and has the following implications. We provide exact correctness conditions for the PC algorithm, which are then related to correctness conditions of some other existing causal discovery algorithms. We show that the sparsest Markov representation condition is the weakest correctness condition resulting from existing notions of minimality for maximal ancestral graphs and directed acyclic graphs. We also reason that additional knowledge than just Pearl-minimality is necessary for causal learning beyond faithfulness."}, "https://arxiv.org/abs/2201.00468": {"title": "A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings", "link": "https://arxiv.org/abs/2201.00468", "description": "arXiv:2201.00468v3 Announce Type: replace \nAbstract: In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size $n$, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size $N$, much larger than $n$, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency."}, "https://arxiv.org/abs/2201.10208": {"title": "Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings", "link": "https://arxiv.org/abs/2201.10208", "description": "arXiv:2201.10208v2 Announce Type: replace \nAbstract: We consider quantile estimation in a semi-supervised setting, characterized by two available data sets: (i) a small or moderate sized labeled data set containing observations for a response and a set of possibly high dimensional covariates, and (ii) a much larger unlabeled data set where only the covariates are observed. We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets, to improve the estimation accuracy compared to the supervised estimator, i.e., the sample quantile from the labeled data. These estimators use a flexible imputation strategy applied to the estimating equation along with a debiasing step that allows for full robustness against misspecification of the imputation model. Further, a one-step update strategy is adopted to enable easy implementation of our method and handle the complexity from the non-linear nature of the quantile estimating equation. Under mild assumptions, our estimators are fully robust to the choice of the nuisance imputation model, in the sense of always maintaining root-n consistency and asymptotic normality, while having improved efficiency relative to the supervised estimator. They also attain semi-parametric optimality if the relation between the response and the covariates is correctly specified via the imputation model. As an illustration of estimating the nuisance imputation function, we consider kernel smoothing type estimators on lower dimensional and possibly estimated transformations of the high dimensional covariates, and we establish novel results on their uniform convergence rates in high dimensions, involving responses indexed by a function class and usage of dimension reduction techniques. These results may be of independent interest. Numerical results on both simulated and real data confirm our semi-supervised approach's improved performance, in terms of both estimation and inference."}, "https://arxiv.org/abs/2206.01076": {"title": "Likelihood-based Inference for Random Networks with Changepoints", "link": "https://arxiv.org/abs/2206.01076", "description": "arXiv:2206.01076v3 Announce Type: replace \nAbstract: Generative, temporal network models play an important role in analyzing the dependence structure and evolution patterns of complex networks. Due to the complicated nature of real network data, it is often naive to assume that the underlying data-generative mechanism itself is invariant with time. Such observation leads to the study of changepoints or sudden shifts in the distributional structure of the evolving network. In this paper, we propose a likelihood-based methodology to detect changepoints in undirected, affine preferential attachment networks, and establish a hypothesis testing framework to detect a single changepoint, together with a consistent estimator for the changepoint. Such results require establishing consistency and asymptotic normality of the MLE under the changepoint regime, which suffers from long range dependence. The methodology is then extended to the multiple changepoint setting via both a sliding window method and a more computationally efficient score statistic. We also compare the proposed methodology with previously developed non-parametric estimators of the changepoint via simulation, and the methods developed herein are applied to modeling the popularity of a topic in a Twitter network over time."}, "https://arxiv.org/abs/2306.03073": {"title": "Inference for Local Projections", "link": "https://arxiv.org/abs/2306.03073", "description": "arXiv:2306.03073v2 Announce Type: replace \nAbstract: Inference for impulse responses estimated with local projections presents interesting challenges and opportunities. Analysts typically want to assess the precision of individual estimates, explore the dynamic evolution of the response over particular regions, and generally determine whether the impulse generates a response that is any different from the null of no effect. Each of these goals requires a different approach to inference. In this article, we provide an overview of results that have appeared in the literature in the past 20 years along with some new procedures that we introduce here."}, "https://arxiv.org/abs/2310.17712": {"title": "Community Detection Guarantees Using Embeddings Learned by Node2Vec", "link": "https://arxiv.org/abs/2310.17712", "description": "arXiv:2310.17712v2 Announce Type: replace-cross \nAbstract: Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of $k$-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data."}, "https://arxiv.org/abs/2408.07738": {"title": "A discrete-time survival model to handle interval-censored covariates", "link": "https://arxiv.org/abs/2408.07738", "description": "arXiv:2408.07738v1 Announce Type: new \nAbstract: Methods are lacking to handle the problem of survival analysis in the presence of an interval-censored covariate, specifically the case in which the conditional hazard of the primary event of interest depends on the occurrence of a secondary event, the observation time of which is subject to interval censoring. We propose and study a flexible class of discrete-time parametric survival models that handle the censoring problem through joint modeling of the interval-censored secondary event, the outcome, and the censoring mechanism. We apply this model to the research question that motivated the methodology, estimating the effect of HIV status on all-cause mortality in a prospective cohort study in South Africa."}, "https://arxiv.org/abs/2408.07842": {"title": "Quantile and Distribution Treatment Effects on the Treated with Possibly Non-Continuous Outcomes", "link": "https://arxiv.org/abs/2408.07842", "description": "arXiv:2408.07842v1 Announce Type: new \nAbstract: Quantile and Distribution Treatment effects on the Treated (QTT/DTT) for non-continuous outcomes are either not identified or inference thereon is infeasible using existing methods. By introducing functional index parallel trends and no anticipation assumptions, this paper identifies and provides uniform inference procedures for QTT/DTT. The inference procedure applies under both the canonical two-group and staggered treatment designs with balanced panels, unbalanced panels, or repeated cross-sections. Monte Carlo experiments demonstrate the proposed method's robust and competitive performance, while an empirical application illustrates its practical utility."}, "https://arxiv.org/abs/2408.08135": {"title": "Combined p-value functions for meta-analysis", "link": "https://arxiv.org/abs/2408.08135", "description": "arXiv:2408.08135v1 Announce Type: new \nAbstract: P-value functions are modern statistical tools that unify effect estimation and hypothesis testing and can provide alternative point and interval estimates compared to standard meta-analysis methods, using any of the many p-value combination procedures available (Xie et al., 2011, JASA). We provide a systematic comparison of different combination procedures, both from a theoretical perspective and through simulation. We show that many prominent p-value combination methods (e.g. Fisher's method) are not invariant to the orientation of the underlying one-sided p-values. Only Edgington's method, a lesser-known combination method based on the sum of p-values, is orientation-invariant and provides confidence intervals not restricted to be symmetric around the point estimate. Adjustments for heterogeneity can also be made and results from a simulation study indicate that the approach can compete with more standard meta-analytic methods."}, "https://arxiv.org/abs/2408.08177": {"title": "Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency Domain", "link": "https://arxiv.org/abs/2408.08177", "description": "arXiv:2408.08177v1 Announce Type: new \nAbstract: Principal component analysis has been a main tool in multivariate analysis for estimating a low dimensional linear subspace that explains most of the variability in the data. However, in high-dimensional regimes, naive estimates of the principal loadings are not consistent and difficult to interpret. In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process, particularly if the principal components are interpretable in that they are sparse in coordinates and localized in frequency bands. In this paper, we introduce a formulation and consistent estimation procedure for interpretable principal component analysis for high-dimensional time series in the frequency domain. An efficient frequency-sequential algorithm is developed to compute sparse-localized estimates of the low-dimensional principal subspaces of the signal process. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a study of first episode psychosis."}, "https://arxiv.org/abs/2408.08200": {"title": "Analysing kinematic data from recreational runners using functional data analysis", "link": "https://arxiv.org/abs/2408.08200", "description": "arXiv:2408.08200v1 Announce Type: new \nAbstract: We present a multivariate functional mixed effects model for kinematic data from a large number of recreational runners. The runners' sagittal plane hip and knee angles are modelled jointly as a bivariate function with random effects functions used to account for the dependence among measurements from either side of the body. The model is fitted by first applying multivariate functional principal component analysis (mv-FPCA) and then modelling the mv-FPCA scores using scalar linear mixed effects models. Simulation and bootstrap approaches are introduced to construct simultaneous confidence bands for the fixed effects functions, and covariance functions are reconstructed to summarise the variability structure in the data and thoroughly investigate the suitability of the proposed model. In our scientific application, we observe a statistically significant effect of running speed on both the hip and knee angles. We also observe strong within-subject correlations, reflecting the highly idiosyncratic nature of running technique. Our approach is more generally applicable to modelling multiple streams of smooth kinematic or kinetic data measured repeatedly for multiple subjects in complex experimental designs."}, "https://arxiv.org/abs/2408.08259": {"title": "Incorporating Local Step-Size Adaptivity into the No-U-Turn Sampler using Gibbs Self Tuning", "link": "https://arxiv.org/abs/2408.08259", "description": "arXiv:2408.08259v1 Announce Type: new \nAbstract: Adapting the step size locally in the no-U-turn sampler (NUTS) is challenging because the step-size and path-length tuning parameters are interdependent. The determination of an optimal path length requires a predefined step size, while the ideal step size must account for errors along the selected path. Ensuring reversibility further complicates this tuning problem. In this paper, we present a method for locally adapting the step size in NUTS that is an instance of the Gibbs self-tuning (GIST) framework. Our approach guarantees reversibility with an acceptance probability that depends exclusively on the conditional distribution of the step size. We validate our step-size-adaptive NUTS method on Neal's funnel density and a high-dimensional normal distribution, demonstrating its effectiveness in challenging scenarios."}, "https://arxiv.org/abs/2408.07765": {"title": "A Bayesian Classification Trees Approach to Treatment Effect Variation with Noncompliance", "link": "https://arxiv.org/abs/2408.07765", "description": "arXiv:2408.07765v1 Announce Type: cross \nAbstract: Estimating varying treatment effects in randomized trials with noncompliance is inherently challenging since variation comes from two separate sources: variation in the impact itself and variation in the compliance rate. In this setting, existing frequentist and flexible machine learning methods are highly sensitive to the weak instruments problem, in which the compliance rate is (locally) close to zero. Bayesian approaches, on the other hand, can naturally account for noncompliance via imputation. We propose a Bayesian machine learning approach that combines the best features of both approaches. Our main methodological contribution is to present a Bayesian Causal Forest model for binary response variables in scenarios with noncompliance by repeatedly imputing individuals' compliance types, allowing us to flexibly estimate varying treatment effects among compliers. Simulation studies demonstrate the usefulness of our approach when compliance and treatment effects are heterogeneous. We apply the method to detect and analyze heterogeneity in the treatment effects in the Illinois Workplace Wellness Study, which not only features heterogeneous and one-sided compliance but also several binary outcomes of interest. We demonstrate the methodology on three outcomes one year after intervention. We confirm a null effect on the presence of a chronic condition, discover meaningful heterogeneity in a \"bad health\" outcome that cancels out to null in classical partial effect estimates, and find substantial heterogeneity in individuals' perception of management prioritization of health and safety."}, "https://arxiv.org/abs/1902.07447": {"title": "Eliciting ambiguity with mixing bets", "link": "https://arxiv.org/abs/1902.07447", "description": "arXiv:1902.07447v5 Announce Type: replace \nAbstract: Preferences for mixing can reveal ambiguity perception and attitude on a single event. The validity of the approach is discussed for multiple preference classes including maxmin, maxmax, variational, and smooth second-order preferences. An experimental implementation suggests that participants perceive almost as much ambiguity for the stock index and actions of other participants as for the Ellsberg urn, indicating the importance of ambiguity in real-world decision-making."}, "https://arxiv.org/abs/2009.09614": {"title": "Spillovers of Program Benefits with Missing Network Links", "link": "https://arxiv.org/abs/2009.09614", "description": "arXiv:2009.09614v3 Announce Type: replace \nAbstract: The issue of missing network links in partially observed networks is frequently neglected in empirical studies. This paper addresses this issue when investigating the spillovers of program benefits in the presence of network interactions. Our method is flexible enough to account for non-i.i.d. missing links. It relies on two network measures that can be easily constructed based on the incoming and outgoing links of the same observed network. The treatment and spillover effects can be point identified and consistently estimated if network degrees are bounded for all units. We also demonstrate the bias reduction property of our method if network degrees of some units are unbounded. Monte Carlo experiments and a naturalistic simulation on real-world network data are implemented to verify the finite-sample performance of our method. We also re-examine the spillover effects of home computer use on children's self-empowered learning."}, "https://arxiv.org/abs/2010.02297": {"title": "Testing homogeneity in dynamic discrete games in finite samples", "link": "https://arxiv.org/abs/2010.02297", "description": "arXiv:2010.02297v3 Announce Type: replace \nAbstract: The literature on dynamic discrete games often assumes that the conditional choice probabilities and the state transition probabilities are homogeneous across markets and over time. We refer to this as the \"homogeneity assumption\" in dynamic discrete games. This assumption enables empirical studies to estimate the game's structural parameters by pooling data from multiple markets and from many time periods. In this paper, we propose a hypothesis test to evaluate whether the homogeneity assumption holds in the data. Our hypothesis test is the result of an approximate randomization test, implemented via a Markov chain Monte Carlo (MCMC) algorithm. We show that our hypothesis test becomes valid as the (user-defined) number of MCMC draws diverges, for any fixed number of markets, time periods, and players. We apply our test to the empirical study of the U.S.\\ Portland cement industry in Ryan (2012)."}, "https://arxiv.org/abs/2208.02806": {"title": "A tree perspective on stick-breaking models in covariate-dependent mixtures", "link": "https://arxiv.org/abs/2208.02806", "description": "arXiv:2208.02806v4 Announce Type: replace \nAbstract: Stick-breaking (SB) processes are often adopted in Bayesian mixture models for generating mixing weights. When covariates influence the sizes of clusters, SB mixtures are particularly convenient as they can leverage their connection to binary regression to ease both the specification of covariate effects and posterior computation. Existing SB models are typically constructed based on continually breaking a single remaining piece of the unit stick. We view this from a dyadic tree perspective in terms of a lopsided bifurcating tree that extends only in one side. We show that two unsavory characteristics of SB models are in fact largely due to this lopsided tree structure. We consider a generalized class of SB models with alternative bifurcating tree structures and examine the influence of the underlying tree topology on the resulting Bayesian analysis in terms of prior assumptions, posterior uncertainty, and computational effectiveness. In particular, we provide evidence that a balanced tree topology, which corresponds to continually breaking all remaining pieces of the unit stick, can resolve or mitigate these undesirable properties of SB models that rely on a lopsided tree."}, "https://arxiv.org/abs/2212.10301": {"title": "Probabilistic Quantile Factor Analysis", "link": "https://arxiv.org/abs/2212.10301", "description": "arXiv:2212.10301v3 Announce Type: replace \nAbstract: This paper extends quantile factor analysis to a probabilistic variant that incorporates regularization and computationally efficient variational approximations. We establish through synthetic and real data experiments that the proposed estimator can, in many cases, achieve better accuracy than a recently proposed loss-based estimator. We contribute to the factor analysis literature by extracting new indexes of \\emph{low}, \\emph{medium}, and \\emph{high} economic policy uncertainty, as well as \\emph{loose}, \\emph{median}, and \\emph{tight} financial conditions. We show that the high uncertainty and tight financial conditions indexes have superior predictive ability for various measures of economic activity. In a high-dimensional exercise involving about 1000 daily financial series, we find that quantile factors also provide superior out-of-sample information compared to mean or median factors."}, "https://arxiv.org/abs/2305.00578": {"title": "A new clustering framework", "link": "https://arxiv.org/abs/2305.00578", "description": "arXiv:2305.00578v2 Announce Type: replace \nAbstract: Detecting clusters is a critical task in various fields, including statistics, engineering and bioinformatics. Our focus is primarily on the modern high-dimensional scenario, where traditional methods often fail due to the curse of dimensionality. In this study, we introduce a non-parametric framework for clustering that is applicable to any number of dimensions. Simulation results demonstrate that this new framework surpasses existing methods across a wide range of settings. We illustrate the proposed method with real data applications in distinguishing cancer tissues from normal tissues through gene expression data."}, "https://arxiv.org/abs/2401.00987": {"title": "Inverting estimating equations for causal inference on quantiles", "link": "https://arxiv.org/abs/2401.00987", "description": "arXiv:2401.00987v2 Announce Type: replace \nAbstract: The causal inference literature frequently focuses on estimating the mean of the potential outcome, whereas quantiles of the potential outcome may carry important additional information. We propose a unified approach, based on the inverse estimating equations, to generalize a class of causal inference solutions from estimating the mean of the potential outcome to its quantiles. We assume that a moment function is available to identify the mean of the threshold-transformed potential outcome, based on which a convenient construction of the estimating equation of quantiles of potential outcome is proposed. In addition, we give a general construction of the efficient influence functions of the mean and quantiles of potential outcomes, and explicate their connection. We motivate estimators for the quantile estimands with the efficient influence function, and develop their asymptotic properties when either parametric models or data-adaptive machine learners are used to estimate the nuisance functions. A broad implication of our results is that one can rework the existing result for mean causal estimands to facilitate causal inference on quantiles. Our general results are illustrated by several analytical and numerical examples."}, "https://arxiv.org/abs/2408.08481": {"title": "A Multivariate Multilevel Longitudinal Functional Model for Repeatedly Observed Human Movement Data", "link": "https://arxiv.org/abs/2408.08481", "description": "arXiv:2408.08481v1 Announce Type: new \nAbstract: Biomechanics and human movement research often involves measuring multiple kinematic or kinetic variables regularly throughout a movement, yielding data that present as smooth, multivariate, time-varying curves and are naturally amenable to functional data analysis. It is now increasingly common to record the same movement repeatedly for each individual, resulting in curves that are serially correlated and can be viewed as longitudinal functional data. We present a new approach for modelling multivariate multilevel longitudinal functional data, with application to kinematic data from recreational runners collected during a treadmill run. For each stride, the runners' hip, knee and ankle angles are modelled jointly as smooth multivariate functions that depend on subject-specific covariates. Longitudinally varying multivariate functional random effects are used to capture the dependence among adjacent strides and changes in the multivariate functions over the course of the treadmill run. A basis modelling approach is adopted to fit the model -- we represent each observation using a multivariate functional principal components basis and model the basis coefficients using scalar longitudinal mixed effects models. The predicted random effects are used to understand and visualise changes in the multivariate functional data over the course of the treadmill run. In our application, our method quantifies the effects of scalar covariates on the multivariate functional data, revealing a statistically significant effect of running speed at the hip, knee and ankle joints. Analysis of the predicted random effects reveals that individuals' kinematics are generally stable but certain individuals who exhibit strong changes during the run can also be identified. A simulation study is presented to demonstrate the efficacy of the proposed methodology under realistic data-generating scenarios."}, "https://arxiv.org/abs/2408.08580": {"title": "Revisiting the Many Instruments Problem using Random Matrix Theory", "link": "https://arxiv.org/abs/2408.08580", "description": "arXiv:2408.08580v1 Announce Type: new \nAbstract: We use recent results from the theory of random matrices to improve instrumental variables estimation with many instruments. In settings where the first-stage parameters are dense, we show that Ridge lowers the implicit price of a bias adjustment. This comes along with improved (finite-sample) properties in the second stage regression. Our theoretical results nest existing results on bias approximation and bias adjustment. Moreover, it extends them to settings with more instruments than observations."}, "https://arxiv.org/abs/2408.08630": {"title": "Spatial Principal Component Analysis and Moran Statistics for Multivariate Functional Areal Data", "link": "https://arxiv.org/abs/2408.08630", "description": "arXiv:2408.08630v1 Announce Type: new \nAbstract: In this article, we present the bivariate and multivariate functional Moran's I statistics and multivariate functional areal spatial principal component analysis (mfasPCA). These methods are the first of their kind in the field of multivariate areal spatial functional data analysis. The multivariate functional Moran's I statistic is employed to assess spatial autocorrelation, while mfasPCA is utilized for dimension reduction in both univariate and multivariate functional areal data. Through simulation studies and real-world examples, we demonstrate that the multivariate functional Moran's I statistic and mfasPCA are powerful tools for evaluating spatial autocorrelation in univariate and multivariate functional areal data analysis."}, "https://arxiv.org/abs/2408.08636": {"title": "Augmented Binary Method for Basket Trials (ABBA)", "link": "https://arxiv.org/abs/2408.08636", "description": "arXiv:2408.08636v1 Announce Type: new \nAbstract: In several clinical areas, traditional clinical trials often use a responder outcome, a composite endpoint that involves dichotomising a continuous measure. An augmented binary method that improves power whilst retaining the original responder endpoint has previously been proposed. The method leverages information from the the undichotomised component to improve power. We extend this method for basket trials, which are gaining popularity in many clinical areas. For clinical areas where response outcomes are used, we propose the new Augmented Binary method for BAsket trials (ABBA) enhances efficiency by borrowing information on the treatment effect between subtrials. The method is developed within a latent variable framework using a Bayesian hierarchical modelling approach. We investigate the properties of the proposed methodology by analysing point estimates and credible intervals in various simulation scenarios, comparing them to the standard analysis for basket trials that assumes binary outcome. Our method results in a reduction of 95% high density interval of the posterior distribution of the log odds ratio and an increase in power when the treatment effect is consistent across subtrials. We illustrate our approach using real data from two clinical trials in rheumatology."}, "https://arxiv.org/abs/2408.08771": {"title": "Dynamic factor analysis for sparse and irregular longitudinal data: an application to metabolite measurements in a COVID-19 study", "link": "https://arxiv.org/abs/2408.08771", "description": "arXiv:2408.08771v1 Announce Type: new \nAbstract: It is of scientific interest to identify essential biomarkers in biological processes underlying diseases to facilitate precision medicine. Factor analysis (FA) has long been used to address this goal: by assuming latent biological pathways drive the activity of measurable biomarkers, a biomarker is more influential if its absolute factor loading is larger. Although correlation between biomarkers has been properly handled under this framework, correlation between latent pathways are often overlooked, as one classical assumption in FA is the independence between factors. However, this assumption may not be realistic in the context of pathways, as existing biological knowledge suggests that pathways interact with one another rather than functioning independently. Motivated by sparsely and irregularly collected longitudinal measurements of metabolites in a COVID-19 study of large sample size, we propose a dynamic factor analysis model that can account for the potential cross-correlations between pathways, through a multi-output Gaussian processes (MOGP) prior on the factor trajectories. To mitigate against overfitting caused by sparsity of longitudinal measurements, we introduce a roughness penalty upon MOGP hyperparameters and allow for non-zero mean functions. To estimate these hyperparameters, we develop a stochastic expectation maximization (StEM) algorithm that scales well to the large sample size. In our simulation studies, StEM leads across all sample sizes considered to a more accurate and stable estimate of the MOGP hyperparameters than a comparator algorithm used in previous research. Application to the motivating example identifies a kynurenine pathway that affects the clinical severity of patients with COVID-19. In particular, a novel biomarker taurine is discovered, which has been receiving increased attention clinically, yet its role was overlooked in a previous analysis."}, "https://arxiv.org/abs/2408.08388": {"title": "Classification of High-dimensional Time Series in Spectral Domain using Explainable Features", "link": "https://arxiv.org/abs/2408.08388", "description": "arXiv:2408.08388v1 Announce Type: cross \nAbstract: Interpretable classification of time series presents significant challenges in high dimensions. Traditional feature selection methods in the frequency domain often assume sparsity in spectral density matrices (SDMs) or their inverses, which can be restrictive for real-world applications. In this article, we propose a model-based approach for classifying high-dimensional stationary time series by assuming sparsity in the difference between inverse SDMs. Our approach emphasizes the interpretability of model parameters, making it especially suitable for fields like neuroscience, where understanding differences in brain network connectivity across various states is crucial. The estimators for model parameters demonstrate consistency under appropriate conditions. We further propose using standard deep learning optimizers for parameter estimation, employing techniques such as mini-batching and learning rate scheduling. Additionally, we introduce a method to screen the most discriminatory frequencies for classification, which exhibits the sure screening property under general conditions. The flexibility of the proposed model allows the significance of covariates to vary across frequencies, enabling nuanced inferences and deeper insights into the underlying problem. The novelty of our method lies in the interpretability of the model parameters, addressing critical needs in neuroscience. The proposed approaches have been evaluated on simulated examples and the `Alert-vs-Drowsy' EEG dataset."}, "https://arxiv.org/abs/2408.08450": {"title": "Smooth and shape-constrained quantile distributed lag models", "link": "https://arxiv.org/abs/2408.08450", "description": "arXiv:2408.08450v1 Announce Type: cross \nAbstract: Exposure to environmental pollutants during the gestational period can significantly impact infant health outcomes, such as birth weight and neurological development. Identifying critical windows of susceptibility, which are specific periods during pregnancy when exposure has the most profound effects, is essential for developing targeted interventions. Distributed lag models (DLMs) are widely used in environmental epidemiology to analyze the temporal patterns of exposure and their impact on health outcomes. However, traditional DLMs focus on modeling the conditional mean, which may fail to capture heterogeneity in the relationship between predictors and the outcome. Moreover, when modeling the distribution of health outcomes like gestational birthweight, it is the extreme quantiles that are of most clinical relevance. We introduce two new quantile distributed lag model (QDLM) estimators designed to address the limitations of existing methods by leveraging smoothness and shape constraints, such as unimodality and concavity, to enhance interpretability and efficiency. We apply our QDLM estimators to the Colorado birth cohort data, demonstrating their effectiveness in identifying critical windows of susceptibility and informing public health interventions."}, "https://arxiv.org/abs/2408.08764": {"title": "Generalized logistic model for $r$ largest order statistics, with hydrological application", "link": "https://arxiv.org/abs/2408.08764", "description": "arXiv:2408.08764v1 Announce Type: cross \nAbstract: The effective use of available information in extreme value analysis is critical because extreme values are scarce. Thus, using the $r$ largest order statistics (rLOS) instead of the block maxima is encouraged. Based on the four-parameter kappa model for the rLOS (rK4D), we introduce a new distribution for the rLOS as a special case of the rK4D. That is the generalized logistic model for rLOS (rGLO). This distribution can be useful when the generalized extreme value model for rLOS is no longer efficient to capture the variability of extreme values. Moreover, the rGLO enriches a pool of candidate distributions to determine the best model to yield accurate and robust quantile estimates. We derive a joint probability density function, the marginal and conditional distribution functions of new model. The maximum likelihood estimation, delta method, profile likelihood, order selection by the entropy difference test, cross-validated likelihood criteria, and model averaging were considered for inferences. The usefulness and practical effectiveness of the rGLO are illustrated by the Monte Carlo simulation and an application to extreme streamflow data in Bevern Stream, UK."}, "https://arxiv.org/abs/2007.12031": {"title": "The r-largest four parameter kappa distribution", "link": "https://arxiv.org/abs/2007.12031", "description": "arXiv:2007.12031v2 Announce Type: replace \nAbstract: The generalized extreme value distribution (GEVD) has been widely used to model the extreme events in many areas. It is however limited to using only block maxima, which motivated to model the GEVD dealing with $r$-largest order statistics (rGEVD). The rGEVD which uses more than one extreme per block can significantly improves the performance of the GEVD. The four parameter kappa distribution (K4D) is a generalization of some three-parameter distributions including the GEVD. It can be useful in fitting data when three parameters in the GEVD are not sufficient to capture the variability of the extreme observations. The K4D still uses only block maxima. In this study, we thus extend the K4D to deal with $r$-largest order statistics as analogy as the GEVD is extended to the rGEVD. The new distribution is called the $r$-largest four parameter kappa distribution (rK4D). We derive a joint probability density function (PDF) of the rK4D, and the marginal and conditional cumulative distribution functions and PDFs. The maximum likelihood method is considered to estimate parameters. The usefulness and some practical concerns of the rK4D are illustrated by applying it to Venice sea-level data. This example study shows that the rK4D gives better fit but larger variances of the parameter estimates than the rGEVD. Some new $r$-largest distributions are derived as special cases of the rK4D, such as the $r$-largest logistic (rLD), generalized logistic (rGLD), and generalized Gumbel distributions (rGGD)."}, "https://arxiv.org/abs/2210.10768": {"title": "Anytime-valid off-policy inference for contextual bandits", "link": "https://arxiv.org/abs/2210.10768", "description": "arXiv:2210.10768v3 Announce Type: replace \nAbstract: Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data."}, "https://arxiv.org/abs/2301.01345": {"title": "Inspecting discrepancy between multivariate distributions using half-space depth based information criteria", "link": "https://arxiv.org/abs/2301.01345", "description": "arXiv:2301.01345v3 Announce Type: replace \nAbstract: This article inspects whether a multivariate distribution is different from a specified distribution or not, and it also tests the equality of two multivariate distributions. In the course of this study, a graphical tool-kit using well-known half-space depth based information criteria is proposed, which is a two-dimensional plot, regardless of the dimension of the data, and it is even useful in comparing high-dimensional distributions. The simple interpretability of the proposed graphical tool-kit motivates us to formulate test statistics to carry out the corresponding testing of hypothesis problems. It is established that the proposed tests based on the same information criteria are consistent, and moreover, the asymptotic distributions of the test statistics under contiguous/local alternatives are derived, which enable us to compute the asymptotic power of these tests. Furthermore, it is observed that the computations associated with the proposed tests are unburdensome. Besides, these tests perform better than many other tests available in the literature when data are generated from various distributions such as heavy tailed distributions, which indicates that the proposed methodology is robust as well. Finally, the usefulness of the proposed graphical tool-kit and tests is shown on two benchmark real data sets."}, "https://arxiv.org/abs/2310.11357": {"title": "A Pseudo-likelihood Approach to Under-5 Mortality Estimation", "link": "https://arxiv.org/abs/2310.11357", "description": "arXiv:2310.11357v2 Announce Type: replace \nAbstract: Accurate and precise estimates of the under-5 mortality rate (U5MR) are an important health summary for countries. However, full survival curves allow us to better understand the pattern of mortality in children under five. Modern demographic methods for estimating a full mortality schedule for children have been developed for countries with good vital registration and reliable census data, but perform poorly in many low- and middle-income countries (LMICs). In these countries, the need to utilize nationally representative surveys to estimate the U5MR requires additional care to mitigate potential biases in survey data, acknowledge the survey design, and handle the usual characteristics of survival data, for example, censoring and truncation. In this paper, we develop parametric and non-parametric pseudo-likelihood approaches to estimating child mortality across calendar time from complex survey data. We show that the parametric approach is particularly useful in scenarios where data are sparse and parsimonious models allow efficient estimation. We compare a variety of parametric models to two existing methods for obtaining a full survival curve for children under the age of 5, and argue that a parametric pseudo-likelihood approach is advantageous in LMICs. We apply our proposed approaches to survey data from four LMICs."}, "https://arxiv.org/abs/2311.11256": {"title": "Bayesian Modeling of Incompatible Spatial Data: A Case Study Involving Post-Adrian Storm Forest Damage Assessment", "link": "https://arxiv.org/abs/2311.11256", "description": "arXiv:2311.11256v2 Announce Type: replace \nAbstract: Modeling incompatible spatial data, i.e., data with different spatial resolutions, is a pervasive challenge in remote sensing data analysis. Typical approaches to addressing this challenge aggregate information to a common coarse resolution, i.e., compatible resolutions, prior to modeling. Such pre-processing aggregation simplifies analysis, but potentially causes information loss and hence compromised inference and predictive performance. To avoid losing potential information provided by finer spatial resolution data and improve predictive performance, we propose a new Bayesian method that constructs a latent spatial process model at the finest spatial resolution. This model is tailored to settings where the outcome variable is measured on a coarser spatial resolution than predictor variables -- a configuration seen increasingly when high spatial resolution remotely sensed predictors are used in analysis. A key contribution of this work is an efficient algorithm that enables full Bayesian inference using finer resolution data while optimizing computational and storage costs. The proposed method is applied to a forest damage assessment for the 2018 Adrian storm in Carinthia, Austria, that uses high-resolution laser imaging detection and ranging (LiDAR) measurements and relatively coarse resolution forest inventory measurements. Extensive simulation studies demonstrate the proposed approach substantially improves inference for small prediction units."}, "https://arxiv.org/abs/2408.08908": {"title": "Panel Data Unit Root testing: Overview", "link": "https://arxiv.org/abs/2408.08908", "description": "arXiv:2408.08908v1 Announce Type: new \nAbstract: This review discusses methods of testing for a panel unit root. Modern approaches to testing in cross-sectionally correlated panels are discussed, preceding the analysis with an analysis of independent panels. In addition, methods for testing in the case of non-linearity in the data (for example, in the case of structural breaks) are presented, as well as methods for testing in short panels, when the time dimension is small and finite. In conclusion, links to existing packages that allow implementing some of the described methods are provided."}, "https://arxiv.org/abs/2408.08990": {"title": "Adaptive Uncertainty Quantification for Generative AI", "link": "https://arxiv.org/abs/2408.08990", "description": "arXiv:2408.08990v1 Announce Type: new \nAbstract: This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning the predictor space into groups and then calibrating sectionally group by group. Adaptive partitioning (self-grouping) is achieved by fitting a robust regression tree to the conformity scores on the calibration set. This new tree variant is designed in such a way that adding a single new observation does not change the tree fit with overwhelmingly large probability. This add-one-in robustness property allows us to conclude a finite sample group-conditional coverage guarantee, a refinement of the marginal guarantee. In addition, unlike traditional split-conformal inference, adaptive splitting and within-group calibration yields adaptive bands which can stretch and shrink locally. We demonstrate benefits of local tightening on several simulated as well as real examples using non-parametric regression. Finally, we consider two contemporary classification applications for obtaining uncertainty quantification around GPT-4o predictions. We conformalize skin disease diagnoses based on self-reported symptoms as well as predicted states of U.S. legislators based on summaries of their ideology. We demonstrate substantial local tightening of the uncertainty sets while attaining similar marginal coverage."}, "https://arxiv.org/abs/2408.09008": {"title": "Approximations to worst-case data dropping: unmasking failure modes", "link": "https://arxiv.org/abs/2408.09008", "description": "arXiv:2408.09008v1 Announce Type: new \nAbstract: A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one outlier may hide or conceal the effect of another outlier. Based on the failures we identify, we provide recommendations for users and suggest directions for future improvements."}, "https://arxiv.org/abs/2408.09012": {"title": "Autoregressive models for panel data causal inference with application to state-level opioid policies", "link": "https://arxiv.org/abs/2408.09012", "description": "arXiv:2408.09012v1 Announce Type: new \nAbstract: Motivated by the study of state opioid policies, we propose a novel approach that uses autoregressive models for causal effect estimation in settings with panel data and staggered treatment adoption. Specifically, we seek to estimate of the impact of key opioid-related policies by quantifying the effects of must access prescription drug monitoring programs (PDMPs), naloxone access laws (NALs), and medical marijuana laws on opioid prescribing. Existing methods, such as differences-in-differences and synthetic controls, are challenging to apply in these types of dynamic policy landscapes where multiple policies are implemented over time and sample sizes are small. Autoregressive models are an alternative strategy that have been used to estimate policy effects in similar settings, but until this paper have lacked formal justification. We outline a set of assumptions that tie these models to causal effects, and we study biases of estimates based on this approach when key causal assumptions are violated. In a set of simulation studies that mirror the structure of our application, we also show that our proposed estimators frequently outperform existing estimators. In short, we justify the use of autoregressive models to provide robust evidence on the effectiveness of four state policies in combating the opioid crisis."}, "https://arxiv.org/abs/2408.09096": {"title": "Dynamic linear regression models for forecasting time series with semi long memory errors", "link": "https://arxiv.org/abs/2408.09096", "description": "arXiv:2408.09096v1 Announce Type: new \nAbstract: Dynamic linear regression models forecast the values of a time series based on a linear combination of a set of exogenous time series while incorporating a time series process for the error term. This error process is often assumed to follow an autoregressive integrated moving average (ARIMA) model, or seasonal variants thereof, which are unable to capture a long-range dependency structure of the error process. We propose a novel dynamic linear regression model that incorporates the long-range dependency feature of the errors. We demonstrate that the proposed error process may (i) have a significant impact on the posterior uncertainty of the estimated regression parameters and (ii) improve the model's forecasting ability. We develop a Markov chain Monte Carlo method to fit general dynamic linear regression models based on a frequency domain approach that enables fast, asymptotically exact Bayesian inference for large datasets. We demonstrate that our approximate algorithm is faster than the traditional time domain approaches, such as the Kalman filter and the multivariate Gaussian likelihood, while retaining a high accuracy when approximating the posterior distribution. We illustrate the method in simulated examples and two energy forecasting applications."}, "https://arxiv.org/abs/2408.09155": {"title": "Learning Robust Treatment Rules for Censored Data", "link": "https://arxiv.org/abs/2408.09155", "description": "arXiv:2408.09155v1 Announce Type: new \nAbstract: There is a fast-growing literature on estimating optimal treatment rules directly by maximizing the expected outcome. In biomedical studies and operations applications, censored survival outcome is frequently observed, in which case the restricted mean survival time and survival probability are of great interest. In this paper, we propose two robust criteria for learning optimal treatment rules with censored survival outcomes; the former one targets at an optimal treatment rule maximizing the restricted mean survival time, where the restriction is specified by a given quantile such as median; the latter one targets at an optimal treatment rule maximizing buffered survival probabilities, where the predetermined threshold is adjusted to account the restricted mean survival time. We provide theoretical justifications for the proposed optimal treatment rules and develop a sampling-based difference-of-convex algorithm for learning them. In simulation studies, our estimators show improved performance compared to existing methods. We also demonstrate the proposed method using AIDS clinical trial data."}, "https://arxiv.org/abs/2408.09187": {"title": "Externally Valid Selection of Experimental Sites via the k-Median Problem", "link": "https://arxiv.org/abs/2408.09187", "description": "arXiv:2408.09187v1 Announce Type: new \nAbstract: We present a decision-theoretic justification for viewing the question of how to best choose where to experiment in order to optimize external validity as a k-median (clustering) problem, a popular problem in computer science and operations research. We present conditions under which minimizing the worst-case, welfare-based regret among all nonrandom schemes that select k sites to experiment is approximately equal - and sometimes exactly equal - to finding the k most central vectors of baseline site-level covariates. The k-median problem can be formulated as a linear integer program. Two empirical applications illustrate the theoretical and computational benefits of the suggested procedure."}, "https://arxiv.org/abs/2408.09271": {"title": "Counterfactual and Synthetic Control Method: Causal Inference with Instrumented Principal Component Analysis", "link": "https://arxiv.org/abs/2408.09271", "description": "arXiv:2408.09271v1 Announce Type: new \nAbstract: The fundamental problem of causal inference lies in the absence of counterfactuals. Traditional methodologies impute the missing counterfactuals implicitly or explicitly based on untestable or overly stringent assumptions. Synthetic control method (SCM) utilizes a weighted average of control units to impute the missing counterfactual for the treated unit. Although SCM relaxes some strict assumptions, it still requires the treated unit to be inside the convex hull formed by the controls, avoiding extrapolation. In recent advances, researchers have modeled the entire data generating process (DGP) to explicitly impute the missing counterfactual. This paper expands the interactive fixed effect (IFE) model by instrumenting covariates into factor loadings, adding additional robustness. This methodology offers multiple benefits: firstly, it incorporates the strengths of previous SCM approaches, such as the relaxation of the untestable parallel trends assumption (PTA). Secondly, it does not require the targeted outcomes to be inside the convex hull formed by the controls. Thirdly, it eliminates the need for correct model specification required by the IFE model. Finally, it inherits the ability of principal component analysis (PCA) to effectively handle high-dimensional data and enhances the value extracted from numerous covariates."}, "https://arxiv.org/abs/2408.09415": {"title": "An exhaustive selection of sufficient adjustment sets for causal inference", "link": "https://arxiv.org/abs/2408.09415", "description": "arXiv:2408.09415v1 Announce Type: new \nAbstract: A subvector of predictor that satisfies the ignorability assumption, whose index set is called a sufficient adjustment set, is crucial for conducting reliable causal inference based on observational data. In this paper, we propose a general family of methods to detect all such sets for the first time in the literature, with no parametric assumptions on the outcome models and with flexible parametric and semiparametric assumptions on the predictor within the treatment groups; the latter induces desired sample-level accuracy. We show that the collection of sufficient adjustment sets can uniquely facilitate multiple types of studies in causal inference, including sharpening the estimation of average causal effect and recovering fundamental connections between the outcome and the treatment hidden in the dependence structure of the predictor. These findings are illustrated by simulation studies and a real data example at the end."}, "https://arxiv.org/abs/2408.09418": {"title": "Grade of membership analysis for multi-layer categorical data", "link": "https://arxiv.org/abs/2408.09418", "description": "arXiv:2408.09418v1 Announce Type: new \nAbstract: Consider a group of individuals (subjects) participating in the same psychological tests with numerous questions (items) at different times. The observed responses can be recorded in multiple response matrices over time, named multi-layer categorical data. Assuming that each subject has a common mixed membership shared across all layers, enabling it to be affiliated with multiple latent classes with varying weights, the objective of the grade of membership (GoM) analysis is to estimate these mixed memberships from the data. When the test is conducted only once, the data becomes traditional single-layer categorical data. The GoM model is a popular choice for describing single-layer categorical data with a latent mixed membership structure. However, GoM cannot handle multi-layer categorical data. In this work, we propose a new model, multi-layer GoM, which extends GoM to multi-layer categorical data. To estimate the common mixed memberships, we propose a new approach, GoM-DSoG, based on a debiased sum of Gram matrices. We establish GoM-DSoG's per-subject convergence rate under the multi-layer GoM model. Our theoretical results suggest that fewer no-responses, more subjects, more items, and more layers are beneficial for GoM analysis. We also propose an approach to select the number of latent classes. Extensive experimental studies verify the theoretical findings and show GoM-DSoG's superiority over its competitors, as well as the accuracy of our method in determining the number of latent classes."}, "https://arxiv.org/abs/2408.09560": {"title": "Deep Learning for the Estimation of Heterogeneous Parameters in Discrete Choice Models", "link": "https://arxiv.org/abs/2408.09560", "description": "arXiv:2408.09560v1 Announce Type: new \nAbstract: This paper studies the finite sample performance of the flexible estimation approach of Farrell, Liang, and Misra (2021a), who propose to use deep learning for the estimation of heterogeneous parameters in economic models, in the context of discrete choice models. The approach combines the structure imposed by economic models with the flexibility of deep learning, which assures the interpretebility of results on the one hand, and allows estimating flexible functional forms of observed heterogeneity on the other hand. For inference after the estimation with deep learning, Farrell et al. (2021a) derive an influence function that can be applied to many quantities of interest. We conduct a series of Monte Carlo experiments that investigate the impact of regularization on the proposed estimation and inference procedure in the context of discrete choice models. The results show that the deep learning approach generally leads to precise estimates of the true average parameters and that regular robust standard errors lead to invalid inference results, showing the need for the influence function approach for inference. Without regularization, the influence function approach can lead to substantial bias and large estimated standard errors caused by extreme outliers. Regularization reduces this property and stabilizes the estimation procedure, but at the expense of inducing an additional bias. The bias in combination with decreasing variance associated with increasing regularization leads to the construction of invalid inferential statements in our experiments. Repeated sample splitting, unlike regularization, stabilizes the estimation approach without introducing an additional bias, thereby allowing for the construction of valid inferential statements."}, "https://arxiv.org/abs/2408.09598": {"title": "Anytime-Valid Inference for Double/Debiased Machine Learning of Causal Parameters", "link": "https://arxiv.org/abs/2408.09598", "description": "arXiv:2408.09598v1 Announce Type: new \nAbstract: Double (debiased) machine learning (DML) has seen widespread use in recent years for learning causal/structural parameters, in part due to its flexibility and adaptability to high-dimensional nuisance functions as well as its ability to avoid bias from regularization or overfitting. However, the classic double-debiased framework is only valid asymptotically for a predetermined sample size, thus lacking the flexibility of collecting more data if sharper inference is needed, or stopping data collection early if useful inferences can be made earlier than expected. This can be of particular concern in large scale experimental studies with huge financial costs or human lives at stake, as well as in observational studies where the length of confidence of intervals do not shrink to zero even with increasing sample size due to partial identifiability of a structural parameter. In this paper, we present time-uniform counterparts to the asymptotic DML results, enabling valid inference and confidence intervals for structural parameters to be constructed at any arbitrary (possibly data-dependent) stopping time. We provide conditions which are only slightly stronger than the standard DML conditions, but offer the stronger guarantee for anytime-valid inference. This facilitates the transformation of any existing DML method to provide anytime-valid guarantees with minimal modifications, making it highly adaptable and easy to use. We illustrate our procedure using two instances: a) local average treatment effect in online experiments with non-compliance, and b) partial identification of average treatment effect in observational studies with potential unmeasured confounding."}, "https://arxiv.org/abs/2408.09607": {"title": "Experimental Design For Causal Inference Through An Optimization Lens", "link": "https://arxiv.org/abs/2408.09607", "description": "arXiv:2408.09607v1 Announce Type: new \nAbstract: The study of experimental design offers tremendous benefits for answering causal questions across a wide range of applications, including agricultural experiments, clinical trials, industrial experiments, social experiments, and digital experiments. Although valuable in such applications, the costs of experiments often drive experimenters to seek more efficient designs. Recently, experimenters have started to examine such efficiency questions from an optimization perspective, as experimental design problems are fundamentally decision-making problems. This perspective offers a lot of flexibility in leveraging various existing optimization tools to study experimental design problems. This manuscript thus aims to examine the foundations of experimental design problems in the context of causal inference as viewed through an optimization lens."}, "https://arxiv.org/abs/2408.09619": {"title": "Statistical Inference for Regression with Imputed Binary Covariates with Application to Emotion Recognition", "link": "https://arxiv.org/abs/2408.09619", "description": "arXiv:2408.09619v1 Announce Type: new \nAbstract: In the flourishing live streaming industry, accurate recognition of streamers' emotions has become a critical research focus, with profound implications for audience engagement and content optimization. However, precise emotion coding typically requires manual annotation by trained experts, making it extremely expensive and time-consuming to obtain complete observational data for large-scale studies. Motivated by this challenge in streamer emotion recognition, we develop here a novel imputation method together with a principled statistical inference procedure for analyzing partially observed binary data. Specifically, we assume for each observation an auxiliary feature vector, which is sufficiently cheap to be fully collected for the whole sample. We next assume a small pilot sample with both the target binary covariates (i.e., the emotion status) and the auxiliary features fully observed, of which the size could be considerably smaller than that of the whole sample. Thereafter, a regression model can be constructed for the target binary covariates and the auxiliary features. This enables us to impute the missing binary features using the fully observed auxiliary features for the entire sample. We establish the associated asymptotic theory for principled statistical inference and present extensive simulation experiments, demonstrating the effectiveness and theoretical soundness of our proposed method. Furthermore, we validate our approach using a comprehensive dataset on emotion recognition in live streaming, demonstrating that our imputation method yields smaller standard errors and is more statistically efficient than using pilot data only. Our findings have significant implications for enhancing user experience and optimizing engagement on streaming platforms."}, "https://arxiv.org/abs/2408.09631": {"title": "Penalized Likelihood Approach for the Four-parameter Kappa Distribution", "link": "https://arxiv.org/abs/2408.09631", "description": "arXiv:2408.09631v1 Announce Type: new \nAbstract: The four-parameter kappa distribution (K4D) is a generalized form of some commonly used distributions such as generalized logistic, generalized Pareto, generalized Gumbel, and generalized extreme value (GEV) distributions. Owing to its flexibility, the K4D is widely applied in modeling in several fields such as hydrology and climatic change. For the estimation of the four parameters, the maximum likelihood approach and the method of L-moments are usually employed. The L-moment estimator (LME) method works well for some parameter spaces, with up to a moderate sample size, but it is sometimes not feasible in terms of computing the appropriate estimates. Meanwhile, the maximum likelihood estimator (MLE) is optimal for large samples and applicable to a very wide range of situations, including non-stationary data. However, using the MLE of K4D with small sample sizes shows substantially poor performance in terms of a large variance of the estimator. We therefore propose a maximum penalized likelihood estimation (MPLE) of K4D by adjusting the existing penalty functions that restrict the parameter space. Eighteen combinations of penalties for two shape parameters are considered and compared. The MPLE retains modeling flexibility and large sample optimality while also improving on small sample properties. The properties of the proposed estimator are verified through a Monte Carlo simulation, and an application case is demonstrated taking Thailand's annual maximum temperature data. Based on this study, we suggest using combinations of penalty functions in general."}, "https://arxiv.org/abs/2408.09634": {"title": "Branch and Bound to Assess Stability of Regression Coefficients in Uncertain Models", "link": "https://arxiv.org/abs/2408.09634", "description": "arXiv:2408.09634v1 Announce Type: new \nAbstract: It can be difficult to interpret a coefficient of an uncertain model. A slope coefficient of a regression model may change as covariates are added or removed from the model. In the context of high-dimensional data, there are too many model extensions to check. However, as we show here, it is possible to efficiently search, with a branch and bound algorithm, for maximum and minimum values of that adjusted slope coefficient over a discrete space of regularized regression models. Here we introduce our algorithm, along with supporting mathematical results, an example application, and a link to our computer code, to help researchers summarize high-dimensional data and assess the stability of regression coefficients in uncertain models."}, "https://arxiv.org/abs/2408.09755": {"title": "Ensemble Prediction via Covariate-dependent Stacking", "link": "https://arxiv.org/abs/2408.09755", "description": "arXiv:2408.09755v1 Announce Type: new \nAbstract: This paper presents a novel approach to ensemble prediction called \"Covariate-dependent Stacking\" (CDST). Unlike traditional stacking methods, CDST allows model weights to vary flexibly as a function of covariates, thereby enhancing predictive performance in complex scenarios. We formulate the covariate-dependent weights through combinations of basis functions, estimate them by optimizing cross-validation, and develop an Expectation-Maximization algorithm, ensuring computational efficiency. To analyze the theoretical properties, we establish an oracle inequality regarding the expected loss to be minimized for estimating model weights. Through comprehensive simulation studies and an application to large-scale land price prediction, we demonstrate that CDST consistently outperforms conventional model averaging methods, particularly on datasets where some models fail to capture the underlying complexity. Our findings suggest that CDST is especially valuable for, but not limited to, spatio-temporal prediction problems, offering a powerful tool for researchers and practitioners in various fields of data analysis."}, "https://arxiv.org/abs/2408.09760": {"title": "Regional and spatial dependence of poverty factors in Thailand, and its use into Bayesian hierarchical regression analysis", "link": "https://arxiv.org/abs/2408.09760", "description": "arXiv:2408.09760v1 Announce Type: new \nAbstract: Poverty is a serious issue that harms humanity progression. The simplest solution is to use one-shirt-size policy to alleviate it. Nevertheless, each region has its unique issues, which require a unique solution to solve them. In the aspect of spatial analysis, neighbor regions can provide useful information to analyze issues of a given region. In this work, we proposed inferred boundaries of regions of Thailand that can explain better the poverty dynamics, instead of the usual government administrative regions. The proposed regions maximize a trade-off between poverty-related features and geographical coherence. We use a spatial analysis together with Moran's cluster algorithms and Bayesian hierarchical regression models, with the potential of assist the implementation of the right policy to alleviate the poverty phenomenon. We found that all variables considered show a positive spatial autocorrelation. The results of analysis illustrate that 1) Northern, Northeastern Thailand, and in less extend Northcentral Thailand are the regions that require more attention in the aspect of poverty issues, 2) Northcentral, Northeastern, Northern and Southern Thailand present dramatically low levels of education, income and amount of savings contrasted with large cities such as Bangkok-Pattaya and Central Thailand, and 3) Bangkok-Pattaya is the only region whose average years of education is above 12 years, which corresponds (approx.) with a complete senior high school."}, "https://arxiv.org/abs/2408.09770": {"title": "Shift-Dispersion Decompositions of Wasserstein and Cram\\'er Distances", "link": "https://arxiv.org/abs/2408.09770", "description": "arXiv:2408.09770v1 Announce Type: new \nAbstract: Divergence functions are measures of distance or dissimilarity between probability distributions that serve various purposes in statistics and applications. We propose decompositions of Wasserstein and Cram\\'er distances$-$which compare two distributions by integrating over their differences in distribution or quantile functions$-$into directed shift and dispersion components. These components are obtained by dividing the differences between the quantile functions into contributions arising from shift and dispersion, respectively. Our decompositions add information on how the distributions differ in a condensed form and consequently enhance the interpretability of the underlying divergences. We show that our decompositions satisfy a number of natural properties and are unique in doing so in location-scale families. The decompositions allow to derive sensitivities of the divergence measures to changes in location and dispersion, and they give rise to weak stochastic order relations that are linked to the usual stochastic and the dispersive order. Our theoretical developments are illustrated in two applications, where we focus on forecast evaluation of temperature extremes and on the design of probabilistic surveys in economics."}, "https://arxiv.org/abs/2408.09868": {"title": "Weak instruments in multivariable Mendelian randomization: methods and practice", "link": "https://arxiv.org/abs/2408.09868", "description": "arXiv:2408.09868v1 Announce Type: new \nAbstract: The method of multivariable Mendelian randomization uses genetic variants to instrument multiple exposures, to estimate the effect that a given exposure has on an outcome conditional on all other exposures included in a linear model. Unfortunately, the inclusion of every additional exposure makes a weak instruments problem more likely, because we require conditionally strong genetic predictors of each exposure. This issue is well appreciated in practice, with different versions of F-statistics routinely reported as measures of instument strength. Less transparently, however, these F-statistics are sometimes used to guide instrument selection, and even to decide whether to report empirical results. Rather than discarding findings with low F-statistics, weak instrument-robust methods can provide valid inference under weak instruments. For multivariable Mendelian randomization with two-sample summary data, we encourage use of the inference strategy of Andrews (2018) that reports both robust and non-robust confidence sets, along with a statistic that measures how reliable the non-robust confidence set is in terms of coverage. We also propose a novel adjusted-Kleibergen statistic that corrects for overdispersion heterogeneity in genetic associations with the outcome."}, "https://arxiv.org/abs/2408.09876": {"title": "Improving Genomic Prediction using High-dimensional Secondary Phenotypes: the Genetic Latent Factor Approach", "link": "https://arxiv.org/abs/2408.09876", "description": "arXiv:2408.09876v1 Announce Type: new \nAbstract: Decreasing costs and new technologies have led to an increase in the amount of data available to plant breeding programs. High-throughput phenotyping (HTP) platforms routinely generate high-dimensional datasets of secondary features that may be used to improve genomic prediction accuracy. However, integration of this data comes with challenges such as multicollinearity, parameter estimation in $p > n$ settings, and the computational complexity of many standard approaches. Several methods have emerged to analyze such data, but interpretation of model parameters often remains challenging.\n We propose genetic factor best linear unbiased prediction (gfBLUP), a seven-step prediction pipeline that reduces the dimensionality of the original secondary HTP data using generative factor analysis. In short, gfBLUP uses redundancy filtered and regularized genetic and residual correlation matrices to fit a maximum likelihood factor model and estimate genetic latent factor scores. These latent factors are subsequently used in multi-trait genomic prediction. Our approach performs on par or better than alternatives in extensive simulations and a real-world application, while producing easily interpretable and biologically relevant parameters. We discuss several possible extensions and highlight gfBLUP as the basis for a flexible and modular multi-trait genomic prediction framework."}, "https://arxiv.org/abs/2408.10091": {"title": "Non-Plug-In Estimators Could Outperform Plug-In Estimators: a Cautionary Note and a Diagnosis", "link": "https://arxiv.org/abs/2408.10091", "description": "arXiv:2408.10091v1 Announce Type: new \nAbstract: Objectives: Highly flexible nonparametric estimators have gained popularity in causal inference and epidemiology. Popular examples of such estimators include targeted maximum likelihood estimators (TMLE) and double machine learning (DML). TMLE is often argued or suggested to be better than DML estimators and several other estimators in small to moderate samples -- even if they share the same large-sample properties -- because TMLE is a plug-in estimator and respects the known bounds on the parameter, while other estimators might fall outside the known bounds and yield absurd estimates. However, this argument is not a rigorously proven result and may fail in certain cases. Methods: In a carefully chosen simulation setting, I compare the performance of several versions of TMLE and DML estimators of the average treatment effect among treated in small to moderate samples. Results: In this simulation setting, DML estimators outperforms some versions of TMLE in small samples. TMLE fluctuations are unstable, and hence empirically checking the magnitude of the TMLE fluctuation might alert cases where TMLE might perform poorly. Conclusions: As a plug-in estimator, TMLE is not guaranteed to outperform non-plug-in counterparts such as DML estimators in small samples. Checking the fluctuation magnitude might be a useful diagnosis for TMLE. More rigorous theoretical justification is needed to understand and compare the finite-sample performance of these highly flexible estimators in general."}, "https://arxiv.org/abs/2408.10142": {"title": "Insights of the Intersection of Phase-Type Distributions and Positive Systems", "link": "https://arxiv.org/abs/2408.10142", "description": "arXiv:2408.10142v1 Announce Type: new \nAbstract: In this paper, we consider the relationship between phase-type distributions and positive systems through practical examples. Phase-type distributions, commonly used in modelling dynamic systems, represent the temporal evolution of a set of variables based on their phase. On the other hand, positive systems, prevalent in a wide range of disciplines, are those where the involved variables maintain non-negative values over time. Through some examples, we demonstrate how phase-type distributions can be useful in describing and analyzing positive systems, providing a perspective on their dynamic behavior. Our main objective is to establish clear connections between these seemingly different concepts, highlighting their relevance and utility in various fields of study. The findings presented here contribute to a better understanding of the interaction between phase-type distribution theory and positive system theory, opening new opportunities for future research in this exciting interdisciplinary field."}, "https://arxiv.org/abs/2408.10149": {"title": "A non-parametric U-statistic testing approach for multi-arm clinical trials with multivariate longitudinal data", "link": "https://arxiv.org/abs/2408.10149", "description": "arXiv:2408.10149v1 Announce Type: new \nAbstract: Randomized clinical trials (RCTs) often involve multiple longitudinal primary outcomes to comprehensively assess treatment efficacy. The Longitudinal Rank-Sum Test (LRST), a robust U-statistics-based, non-parametric, rank-based method, effectively controls Type I error and enhances statistical power by leveraging the temporal structure of the data without relying on distributional assumptions. However, the LRST is limited to two-arm comparisons. To address the need for comparing multiple doses against a control group in many RCTs, we extend the LRST to a multi-arm setting. This novel multi-arm LRST provides a flexible and powerful approach for evaluating treatment efficacy across multiple arms and outcomes, with a strong capability for detecting the most effective dose in multi-arm trials. Extensive simulations demonstrate that this method maintains excellent Type I error control while providing greater power compared to the two-arm LRST with multiplicity adjustments. Application to the Bapineuzumab (Bapi) 301 trial further validates the multi-arm LRST's practical utility and robustness, confirming its efficacy in complex clinical trial analyses."}, "https://arxiv.org/abs/2408.09060": {"title": "[Invited Discussion] Randomization Tests to Address Disruptions in Clinical Trials: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions", "link": "https://arxiv.org/abs/2408.09060", "description": "arXiv:2408.09060v1 Announce Type: cross \nAbstract: Disruptions in clinical trials may be due to external events like pandemics, warfare, and natural disasters. Resulting complications may lead to unforeseen intercurrent events (events that occur after treatment initiation and affect the interpretation of the clinical question of interest or the existence of the measurements associated with it). In Uschner et al. (2023), several example clinical trial disruptions are described: treatment effect drift, population shift, change of care, change of data collection, and change of availability of study medication. A complex randomized controlled trial (RCT) setting with (planned or unplanned) intercurrent events is then described, and randomization tests are presented as a means for non-parametric inference that is robust to violations of assumption typically made in clinical trials. While estimation methods like Targeted Learning (TL) are valid in such settings, we do not see where the authors make the case that one should be going for a randomization test in such disrupted RCTs. In this discussion, we comment on the appropriateness of TL and the accompanying TL Roadmap in the context of disrupted clinical trials. We highlight a few key articles related to the broad applicability of TL for RCTs and real-world data (RWD) analyses with intercurrent events. We begin by introducing TL and motivating its utility in Section 2, and then in Section 3 we provide a brief overview of the TL Roadmap. In Section 4 we recite the example clinical trial disruptions presented in Uschner et al. (2023), discussing considerations and solutions based on the principles of TL. We request in an authors' rejoinder a clear theoretical demonstration with specific examples in this setting that a randomization test is the only valid inferential method relative to one based on following the TL Roadmap."}, "https://arxiv.org/abs/2408.09185": {"title": "Method of Moments Estimation for Affine Stochastic Volatility Models", "link": "https://arxiv.org/abs/2408.09185", "description": "arXiv:2408.09185v1 Announce Type: cross \nAbstract: We develop moment estimators for the parameters of affine stochastic volatility models. We first address the challenge of calculating moments for the models by introducing a recursive equation for deriving closed-form expressions for moments of any order. Consequently, we propose our moment estimators. We then establish a central limit theorem for our estimators and derive the explicit formulas for the asymptotic covariance matrix. Finally, we provide numerical results to validate our method."}, "https://arxiv.org/abs/2408.09532": {"title": "Deep Limit Model-free Prediction in Regression", "link": "https://arxiv.org/abs/2408.09532", "description": "arXiv:2408.09532v1 Announce Type: cross \nAbstract: In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-parametric approach, some additive form is often assumed. A newly proposed Model-free prediction principle sheds light on a prediction procedure without any model assumption. Previous work regarding this principle has shown better performance than other standard alternatives. Recently, DNN, one of the machine learning methods, has received increasing attention due to its great performance in practice. Guided by the Model-free prediction idea, we attempt to apply a fully connected forward DNN to map X and some appropriate reference random variable Z to Y. The targeted DNN is trained by minimizing a specially designed loss function so that the randomness of Y conditional on X is outsourced to Z through the trained DNN. Our method is more stable and accurate compared to other DNN-based counterparts, especially for optimal point predictions. With a specific prediction procedure, our prediction interval can capture the estimation variability so that it can render a better coverage rate for finite sample cases. The superior performance of our method is verified by simulation and empirical studies."}, "https://arxiv.org/abs/2408.09537": {"title": "Sample-Optimal Large-Scale Optimal Subset Selection", "link": "https://arxiv.org/abs/2408.09537", "description": "arXiv:2408.09537v1 Announce Type: cross \nAbstract: Ranking and selection (R&S) conventionally aims to select the unique best alternative with the largest mean performance from a finite set of alternatives. However, for better supporting decision making, it may be more informative to deliver a small menu of alternatives whose mean performances are among the top $m$. Such problem, called optimal subset selection (OSS), is generally more challenging to address than the conventional R&S. This challenge becomes even more significant when the number of alternatives is considerably large. Thus, the focus of this paper is on addressing the large-scale OSS problem. To achieve this goal, we design a top-$m$ greedy selection mechanism that keeps sampling the current top $m$ alternatives with top $m$ running sample means and propose the explore-first top-$m$ greedy (EFG-$m$) procedure. Through an extended boundary-crossing framework, we prove that the EFG-$m$ procedure is both sample optimal and consistent in terms of the probability of good selection, confirming its effectiveness in solving large-scale OSS problem. Surprisingly, we also demonstrate that the EFG-$m$ procedure enables to achieve an indifference-based ranking within the selected subset of alternatives at no extra cost. This is highly beneficial as it delivers deeper insights to decision-makers, enabling more informed decision-makings. Lastly, numerical experiments validate our results and demonstrate the efficiency of our procedures."}, "https://arxiv.org/abs/2408.09582": {"title": "A Likelihood-Free Approach to Goal-Oriented Bayesian Optimal Experimental Design", "link": "https://arxiv.org/abs/2408.09582", "description": "arXiv:2408.09582v1 Announce Type: cross \nAbstract: Conventional Bayesian optimal experimental design seeks to maximize the expected information gain (EIG) on model parameters. However, the end goal of the experiment often is not to learn the model parameters, but to predict downstream quantities of interest (QoIs) that depend on the learned parameters. And designs that offer high EIG for parameters may not translate to high EIG for QoIs. Goal-oriented optimal experimental design (GO-OED) thus directly targets to maximize the EIG of QoIs.\n We introduce LF-GO-OED (likelihood-free goal-oriented optimal experimental design), a computational method for conducting GO-OED with nonlinear observation and prediction models. LF-GO-OED is specifically designed to accommodate implicit models, where the likelihood is intractable. In particular, it builds a density ratio estimator from samples generated from approximate Bayesian computation (ABC), thereby sidestepping the need for likelihood evaluations or density estimations. The overall method is validated on benchmark problems with existing methods, and demonstrated on scientific applications of epidemiology and neural science."}, "https://arxiv.org/abs/2408.09618": {"title": "kendallknight: Efficient Implementation of Kendall's Correlation Coefficient Computation", "link": "https://arxiv.org/abs/2408.09618", "description": "arXiv:2408.09618v1 Announce Type: cross \nAbstract: The kendallknight package introduces an efficient implementation of Kendall's correlation coefficient computation, significantly improving the processing time for large datasets without sacrificing accuracy. The kendallknight package, following Knight (1966) and posterior literature, reduces the computational complexity resulting in drastic reductions in computation time, transforming operations that would take minutes or hours into milliseconds or minutes, while maintaining precision and correctly handling edge cases and errors. The package is particularly advantageous in econometric and statistical contexts where rapid and accurate calculation of Kendall's correlation coefficient is desirable. Benchmarks demonstrate substantial performance gains over the base R implementation, especially for large datasets."}, "https://arxiv.org/abs/2408.10136": {"title": "Robust spectral clustering with rank statistics", "link": "https://arxiv.org/abs/2408.10136", "description": "arXiv:2408.10136v1 Announce Type: cross \nAbstract: This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile.\n Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure."}, "https://arxiv.org/abs/2112.09259": {"title": "Robustness, Heterogeneous Treatment Effects and Covariate Shifts", "link": "https://arxiv.org/abs/2112.09259", "description": "arXiv:2112.09259v2 Announce Type: replace \nAbstract: This paper studies the robustness of estimated policy effects to changes in the distribution of covariates. Robustness to covariate shifts is important, for example, when evaluating the external validity of quasi-experimental results, which are often used as a benchmark for evidence-based policy-making. I propose a novel scalar robustness metric. This metric measures the magnitude of the smallest covariate shift needed to invalidate a claim on the policy effect (for example, $ATE \\geq 0$) supported by the quasi-experimental evidence. My metric links the heterogeneity of policy effects and robustness in a flexible, nonparametric way and does not require functional form assumptions. I cast the estimation of the robustness metric as a de-biased GMM problem. This approach guarantees a parametric convergence rate for the robustness metric while allowing for machine learning-based estimators of policy effect heterogeneity (for example, lasso, random forest, boosting, neural nets). I apply my procedure to the Oregon Health Insurance experiment. I study the robustness of policy effects estimates of health-care utilization and financial strain outcomes, relative to a shift in the distribution of context-specific covariates. Such covariates are likely to differ across US states, making quantification of robustness an important exercise for adoption of the insurance policy in states other than Oregon. I find that the effect on outpatient visits is the most robust among the metrics of health-care utilization considered."}, "https://arxiv.org/abs/2209.07672": {"title": "Nonparametric Estimation via Partial Derivatives", "link": "https://arxiv.org/abs/2209.07672", "description": "arXiv:2209.07672v2 Announce Type: replace \nAbstract: Traditional nonparametric estimation methods often lead to a slow convergence rate in large dimensions and require unrealistically enormous sizes of datasets for reliable conclusions. We develop an approach based on partial derivatives, either observed or estimated, to effectively estimate the function at near-parametric convergence rates. The novel approach and computational algorithm could lead to methods useful to practitioners in many areas of science and engineering. Our theoretical results reveal a behavior universal to this class of nonparametric estimation problems. We explore a general setting involving tensor product spaces and build upon the smoothing spline analysis of variance (SS-ANOVA) framework. For $d$-dimensional models under full interaction, the optimal rates with gradient information on $p$ covariates are identical to those for the $(d-p)$-interaction models without gradients and, therefore, the models are immune to the \"curse of interaction.\" For additive models, the optimal rates using gradient information are root-$n$, thus achieving the \"parametric rate.\" We demonstrate aspects of the theoretical results through synthetic and real data applications."}, "https://arxiv.org/abs/2212.01943": {"title": "Unbiased Test Error Estimation in the Poisson Means Problem via Coupled Bootstrap Techniques", "link": "https://arxiv.org/abs/2212.01943", "description": "arXiv:2212.01943v3 Announce Type: replace \nAbstract: We propose a coupled bootstrap (CB) method for the test error of an arbitrary algorithm that estimates the mean in a Poisson sequence, often called the Poisson means problem. The idea behind our method is to generate two carefully-designed data vectors from the original data vector, by using synthetic binomial noise. One such vector acts as the training sample and the second acts as the test sample. To stabilize the test error estimate, we average this over multiple bootstrap B of the synthetic noise. A key property of the CB estimator is that it is unbiased for the test error in a Poisson problem where the original mean has been shrunken by a small factor, driven by the success probability $p$ in the binomial noise. Further, in the limit as $B \\to \\infty$ and $p \\to 0$, we show that the CB estimator recovers a known unbiased estimator for test error based on Hudson's lemma, under no assumptions on the given algorithm for estimating the mean (in particular, no smoothness assumptions). Our methodology applies to two central loss functions that can be used to define test error: Poisson deviance and squared loss. Via a bias-variance decomposition, for each loss function, we analyze the effects of the binomial success probability and the number of bootstrap samples and on the accuracy of the estimator. We also investigate our method empirically across a variety of settings, using simulated as well as real data."}, "https://arxiv.org/abs/2306.08719": {"title": "Off-policy Evaluation in Doubly Inhomogeneous Environments", "link": "https://arxiv.org/abs/2306.08719", "description": "arXiv:2306.08719v4 Announce Type: replace \nAbstract: This work aims to study off-policy evaluation (OPE) under scenarios where two key reinforcement learning (RL) assumptions -- temporal stationarity and individual homogeneity are both violated. To handle the ``double inhomogeneities\", we propose a class of latent factor models for the reward and observation transition functions, under which we develop a general OPE framework that consists of both model-based and model-free approaches. To our knowledge, this is the first paper that develops statistically sound OPE methods in offline RL with double inhomogeneities. It contributes to a deeper understanding of OPE in environments, where standard RL assumptions are not met, and provides several practical approaches in these settings. We establish the theoretical properties of the proposed value estimators and empirically show that our approach outperforms competing methods that ignore either temporal nonstationarity or individual heterogeneity. Finally, we illustrate our method on a data set from the Medical Information Mart for Intensive Care."}, "https://arxiv.org/abs/2309.04957": {"title": "Winner's Curse Free Robust Mendelian Randomization with Summary Data", "link": "https://arxiv.org/abs/2309.04957", "description": "arXiv:2309.04957v2 Announce Type: replace \nAbstract: In the past decade, the increased availability of genome-wide association studies summary data has popularized Mendelian Randomization (MR) for conducting causal inference. MR analyses, incorporating genetic variants as instrumental variables, are known for their robustness against reverse causation bias and unmeasured confounders. Nevertheless, classical MR analyses utilizing summary data may still produce biased causal effect estimates due to the winner's curse and pleiotropic issues. To address these two issues and establish valid causal conclusions, we propose a unified robust Mendelian Randomization framework with summary data, which systematically removes the winner's curse and screens out invalid genetic instruments with pleiotropic effects. Different from existing robust MR literature, our framework delivers valid statistical inference on the causal effect neither requiring the genetic pleiotropy effects to follow any parametric distribution nor relying on perfect instrument screening property. Under appropriate conditions, we show that our proposed estimator converges to a normal distribution and its variance can be well estimated. We demonstrate the performance of our proposed estimator through Monte Carlo simulations and two case studies. The codes implementing the procedures are available at https://github.com/ChongWuLab/CARE/."}, "https://arxiv.org/abs/2310.13826": {"title": "A p-value for Process Tracing and other N=1 Studies", "link": "https://arxiv.org/abs/2310.13826", "description": "arXiv:2310.13826v2 Announce Type: replace \nAbstract: We introduce a method for calculating \\(p\\)-values when testing causal theories about a single case, for instance when conducting process tracing. As in Fisher's (1935) original design, our \\(p\\)-value indicates how frequently one would find the same or more favorable evidence while entertaining a rival theory (the null) for the sake of argument. We use an urn model to represent the null distribution and calibrate it to privilege false negative errors and reduce false positive errors. We also present an approach to sensitivity analysis and for representing the evidentiary weight of different observations. Our test suits any type of evidence, such as data from interviews and archives, observed in any combination. We apply our hypothesis test in two studies: a process tracing classic about the cause of the cholera outburst in Soho (Snow 1855) and a recent process tracing based explanation of the cause of a welfare policy shift in Uruguay (Rossel, Antia, and Manzi 2023)."}, "https://arxiv.org/abs/2311.17575": {"title": "Identifying Causal Effects of Nonbinary, Ordered Treatments using Multiple Instrumental Variables", "link": "https://arxiv.org/abs/2311.17575", "description": "arXiv:2311.17575v2 Announce Type: replace \nAbstract: This paper introduces a novel method for identifying causal effects of ordered, nonbinary treatments using multiple binary instruments. Extending the two-stage least squares (TSLS) framework, the approach accommodates ordered treatments under any monotonicity assumption. The key contribution is the identification of a new causal parameter that simplifies the interpretation of causal effects and is broadly applicable due to a mild monotonicity assumption, offering a compelling alternative to TSLS. The paper builds upon recent causal machine learning methodology for estimation and demonstrates how causal forests can detect local violations of the underlying monotonicity assumption. The methodology is applied to estimate the returns to education using the seminal dataset of Card (1995) and to evaluate the impact of an additional child on female labor market outcomes using the data from Angrist and Evans (1998)."}, "https://arxiv.org/abs/2312.12786": {"title": "Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets", "link": "https://arxiv.org/abs/2312.12786", "description": "arXiv:2312.12786v2 Announce Type: replace \nAbstract: Development of comprehensive prediction models are often of great interest in many disciplines of science, but datasets with information on all desired features often have small sample sizes. We describe a transfer learning approach for building high-dimensional generalized linear models using data from a main study with detailed information on all predictors and an external, potentially much larger, study that has ascertained a more limited set of predictors. We propose using the external dataset to build a reduced model and then \"transfer\" the information on underlying parameters for the analysis of the main study through a set of calibration equations which can account for the study-specific effects of design variables. We then propose a penalized generalized method of moment framework for inference and a one-step estimation method that could be implemented using standard glmnet package. We develop asymptotic theory and conduct extensive simulation studies to investigate both predictive performance and post-selection inference properties of the proposed method. Finally, we illustrate an application of the proposed method for the development of risk models for five common diseases using the UK Biobank study, combining information on low-dimensional risk factors and high throughout proteomic biomarkers."}, "https://arxiv.org/abs/2204.00180": {"title": "Measuring Diagnostic Test Performance Using Imperfect Reference Tests: A Partial Identification Approach", "link": "https://arxiv.org/abs/2204.00180", "description": "arXiv:2204.00180v4 Announce Type: replace-cross \nAbstract: Diagnostic tests are almost never perfect. Studies quantifying their performance use knowledge of the true health status, measured with a reference diagnostic test. Researchers commonly assume that the reference test is perfect, which is often not the case in practice. When the assumption fails, conventional studies identify \"apparent\" performance or performance with respect to the reference, but not true performance. This paper provides the smallest possible bounds on the measures of true performance - sensitivity (true positive rate) and specificity (true negative rate), or equivalently false positive and negative rates, in standard settings. Implied bounds on policy-relevant parameters are derived: 1) Prevalence in screened populations; 2) Predictive values. Methods for inference based on moment inequalities are used to construct uniformly consistent confidence sets in level over a relevant family of data distributions. Emergency Use Authorization (EUA) and independent study data for the BinaxNOW COVID-19 antigen test demonstrate that the bounds can be very informative. Analysis reveals that the estimated false negative rates for symptomatic and asymptomatic patients are up to 3.17 and 4.59 times higher than the frequently cited \"apparent\" false negative rate. Further applicability of the results in the context of imperfect proxies such as survey responses and imputed protected classes is indicated."}, "https://arxiv.org/abs/2309.06673": {"title": "Ridge detection for nonstationary multicomponent signals with time-varying wave-shape functions and its applications", "link": "https://arxiv.org/abs/2309.06673", "description": "arXiv:2309.06673v2 Announce Type: replace-cross \nAbstract: We introduce a novel ridge detection algorithm for time-frequency (TF) analysis, particularly tailored for intricate nonstationary time series encompassing multiple non-sinusoidal oscillatory components. The algorithm is rooted in the distinctive geometric patterns that emerge in the TF domain due to such non-sinusoidal oscillations. We term this method \\textit{shape-adaptive mode decomposition-based multiple harmonic ridge detection} (\\textsf{SAMD-MHRD}). A swift implementation is available when supplementary information is at hand. We demonstrate the practical utility of \\textsf{SAMD-MHRD} through its application to a real-world challenge. We employ it to devise a cutting-edge walking activity detection algorithm, leveraging accelerometer signals from an inertial measurement unit across diverse body locations of a moving subject."}, "https://arxiv.org/abs/2408.10396": {"title": "Highly Multivariate High-dimensionality Spatial Stochastic Processes-A Mixed Conditional Approach", "link": "https://arxiv.org/abs/2408.10396", "description": "arXiv:2408.10396v1 Announce Type: new \nAbstract: We propose a hybrid mixed spatial graphical model framework and novel concepts, e.g., cross-Markov Random Field (cross-MRF), to comprehensively address all feature aspects of highly multivariate high-dimensionality (HMHD) spatial data class when constructing the desired joint variance and precision matrix (where both p and n are large). Specifically, the framework accommodates any customized conditional independence (CI) among any number of p variate fields at the first stage, alleviating dynamic memory burden. Meanwhile, it facilitates parallel generation of covariance and precision matrix, with the latter's generation order scaling only linearly in p. In the second stage, we demonstrate the multivariate Hammersley-Clifford theorem from a column-wise conditional perspective and unearth the existence of cross-MRF. The link of the mixed spatial graphical framework and the cross-MRF allows for a mixed conditional approach, resulting in the sparsest possible representation of the precision matrix via accommodating the doubly CI among both p and n, with the highest possible exact-zero-value percentage. We also explore the possibility of the co-existence of geostatistical and MRF modelling approaches in one unified framework, imparting a potential solution to an open problem. The derived theories are illustrated with 1D simulation and 2D real-world spatial data."}, "https://arxiv.org/abs/2408.10401": {"title": "Spatial Knockoff Bayesian Variable Selection in Genome-Wide Association Studies", "link": "https://arxiv.org/abs/2408.10401", "description": "arXiv:2408.10401v1 Announce Type: new \nAbstract: High-dimensional variable selection has emerged as one of the prevailing statistical challenges in the big data revolution. Many variable selection methods have been adapted for identifying single nucleotide polymorphisms (SNPs) linked to phenotypic variation in genome-wide association studies. We develop a Bayesian variable selection regression model for identifying SNPs linked to phenotypic variation. We modify our Bayesian variable selection regression models to control the false discovery rate of SNPs using a knockoff variable approach. We reduce spurious associations by regressing the phenotype of interest against a set of basis functions that account for the relatedness of individuals. Using a restricted regression approach, we simultaneously estimate the SNP-level effects while removing variation in the phenotype that can be explained by population structure. We also accommodate the spatial structure among causal SNPs by modeling their inclusion probabilities jointly with a reduced rank Gaussian process. In a simulation study, we demonstrate that our spatial Bayesian variable selection regression model controls the false discovery rate and increases power when the relevant SNPs are clustered. We conclude with an analysis of Arabidopsis thaliana flowering time, a polygenic trait that is confounded with population structure, and find the discoveries of our method cluster near described flowering time genes."}, "https://arxiv.org/abs/2408.10478": {"title": "On a fundamental difference between Bayesian and frequentist approaches to robustness", "link": "https://arxiv.org/abs/2408.10478", "description": "arXiv:2408.10478v1 Announce Type: new \nAbstract: Heavy-tailed models are often used as a way to gain robustness against outliers in Bayesian analyses. On the other side, in frequentist analyses, M-estimators are often employed. In this paper, the two approaches are reconciled by considering M-estimators as maximum likelihood estimators of heavy-tailed models. We realize that, even from this perspective, there is a fundamental difference in that frequentists do not require these heavy-tailed models to be proper. It is shown what the difference between improper and proper heavy-tailed models can be in terms of estimation results through two real-data analyses based on linear regression. The findings of this paper make us ponder on the use of improper heavy-tailed data models in Bayesian analyses, an approach which is seen to fit within the generalized Bayesian framework of Bissiri et al. (2016) when combined with proper prior distributions yielding proper (generalized) posterior distributions."}, "https://arxiv.org/abs/2408.10509": {"title": "Continuous difference-in-differences with double/debiased machine learning", "link": "https://arxiv.org/abs/2408.10509", "description": "arXiv:2408.10509v1 Announce Type: new \nAbstract: This paper extends difference-in-differences to settings involving continuous treatments. Specifically, the average treatment effect on the treated (ATT) at any level of continuous treatment intensity is identified using a conditional parallel trends assumption. In this framework, estimating the ATTs requires first estimating infinite-dimensional nuisance parameters, especially the conditional density of the continuous treatment, which can introduce significant biases. To address this challenge, estimators for the causal parameters are proposed under the double/debiased machine learning framework. We show that these estimators are asymptotically normal and provide consistent variance estimators. To illustrate the effectiveness of our methods, we re-examine the study by Acemoglu and Finkelstein (2008), which assessed the effects of the 1983 Medicare Prospective Payment System (PPS) reform. By reinterpreting their research design using a difference-in-differences approach with continuous treatment, we nonparametrically estimate the treatment effects of the 1983 PPS reform, thereby providing a more detailed understanding of its impact."}, "https://arxiv.org/abs/2408.10542": {"title": "High-Dimensional Covariate-Augmented Overdispersed Multi-Study Poisson Factor Model", "link": "https://arxiv.org/abs/2408.10542", "description": "arXiv:2408.10542v1 Announce Type: new \nAbstract: Factor analysis for high-dimensional data is a canonical problem in statistics and has a wide range of applications. However, there is currently no factor model tailored to effectively analyze high-dimensional count responses with corresponding covariates across multiple studies, such as the single-cell sequencing dataset from a case-control study. In this paper, we introduce factor models designed to jointly analyze multiple studies by extracting study-shared and specified factors. Our factor models account for heterogeneous noises and overdispersion among counts with augmented covariates. We propose an efficient and speedy variational estimation procedure for estimating model parameters, along with a novel criterion for selecting the optimal number of factors and the rank of regression coefficient matrix. The consistency and asymptotic normality of estimators are systematically investigated by connecting variational likelihood and profile M-estimation. Extensive simulations and an analysis of a single-cell sequencing dataset are conducted to demonstrate the effectiveness of the proposed multi-study Poisson factor model."}, "https://arxiv.org/abs/2408.10558": {"title": "Multi-Attribute Preferences: A Transfer Learning Approach", "link": "https://arxiv.org/abs/2408.10558", "description": "arXiv:2408.10558v1 Announce Type: new \nAbstract: This contribution introduces a novel statistical learning methodology based on the Bradley-Terry method for pairwise comparisons, where the novelty arises from the method's capacity to estimate the worth of objects for a primary attribute by incorporating data of secondary attributes. These attributes are properties on which objects are evaluated in a pairwise fashion by individuals. By assuming that the main interest of practitioners lies in the primary attribute, and the secondary attributes only serve to improve estimation of the parameters underlying the primary attribute, this paper utilises the well-known transfer learning framework. To wit, the proposed method first estimates a biased worth vector using data pertaining to both the primary attribute and the set of informative secondary attributes, which is followed by a debiasing step based on a penalised likelihood of the primary attribute. When the set of informative secondary attributes is unknown, we allow for their estimation by a data-driven algorithm. Theoretically, we show that, under mild conditions, the $\\ell_\\infty$ and $\\ell_2$ rates are improved compared to fitting a Bradley-Terry model on just the data pertaining to the primary attribute. The favourable (comparative) performance under more general settings is shown by means of a simulation study. To illustrate the usage and interpretation of the method, an application of the proposed method is provided on consumer preference data pertaining to a cassava derived food product: eba. An R package containing the proposed methodology can be found on xhttps://CRAN.R-project.org/package=BTTL."}, "https://arxiv.org/abs/2408.10570": {"title": "A two-sample test based on averaged Wilcoxon rank sums over interpoint distances", "link": "https://arxiv.org/abs/2408.10570", "description": "arXiv:2408.10570v1 Announce Type: new \nAbstract: An important class of two-sample multivariate homogeneity tests is based on identifying differences between the distributions of interpoint distances. While generating distances from point clouds offers a straightforward and intuitive way for dimensionality reduction, it also introduces dependencies to the resulting distance samples. We propose a simple test based on Wilcoxon's rank sum statistic for which we prove asymptotic normality under the null hypothesis and fixed alternatives under mild conditions on the underlying distributions of the point clouds. Furthermore, we show consistency of the test and derive a variance approximation that allows to construct a computationally feasible, distribution-free test with good finite sample performance. The power and robustness of the test for high-dimensional data and low sample sizes is demonstrated by numerical simulations. Finally, we apply the proposed test to case-control testing on microarray data in genetic studies, which is considered a notorious case for a high number of variables and low sample sizes."}, "https://arxiv.org/abs/2408.10650": {"title": "Principal component analysis for max-stable distributions", "link": "https://arxiv.org/abs/2408.10650", "description": "arXiv:2408.10650v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is one of the most popular dimension reduction techniques in statistics and is especially powerful when a multivariate distribution is concentrated near a lower-dimensional subspace. Multivariate extreme value distributions have turned out to provide challenges for the application of PCA since their constraint support impedes the detection of lower-dimensional structures and heavy-tails can imply that second moments do not exist, thereby preventing the application of classical variance-based techniques for PCA. We adapt PCA to max-stable distributions using a regression setting and employ max-linear maps to project the random vector to a lower-dimensional space while preserving max-stability. We also provide a characterization of those distributions which allow for a perfect reconstruction from the lower-dimensional representation. Finally, we demonstrate how an optimal projection matrix can be consistently estimated and show viability in practice with a simulation study and application to a benchmark dataset."}, "https://arxiv.org/abs/2408.10686": {"title": "Gradient Wild Bootstrap for Instrumental Variable Quantile Regressions with Weak and Few Clusters", "link": "https://arxiv.org/abs/2408.10686", "description": "arXiv:2408.10686v1 Announce Type: new \nAbstract: We study the gradient wild bootstrap-based inference for instrumental variable quantile regressions in the framework of a small number of large clusters in which the number of clusters is viewed as fixed, and the number of observations for each cluster diverges to infinity. For the Wald inference, we show that our wild bootstrap Wald test, with or without studentization using the cluster-robust covariance estimator (CRVE), controls size asymptotically up to a small error as long as the parameter of endogenous variable is strongly identified in at least one of the clusters. We further show that the wild bootstrap Wald test with CRVE studentization is more powerful for distant local alternatives than that without. Last, we develop a wild bootstrap Anderson-Rubin (AR) test for the weak-identification-robust inference. We show it controls size asymptotically up to a small error, even under weak or partial identification for all clusters. We illustrate the good finite-sample performance of the new inference methods using simulations and provide an empirical application to a well-known dataset about US local labor markets."}, "https://arxiv.org/abs/2408.10825": {"title": "Conditional nonparametric variable screening by neural factor regression", "link": "https://arxiv.org/abs/2408.10825", "description": "arXiv:2408.10825v1 Announce Type: new \nAbstract: High-dimensional covariates often admit linear factor structure. To effectively screen correlated covariates in high-dimension, we propose a conditional variable screening test based on non-parametric regression using neural networks due to their representation power. We ask the question whether individual covariates have additional contributions given the latent factors or more generally a set of variables. Our test statistics are based on the estimated partial derivative of the regression function of the candidate variable for screening and a observable proxy for the latent factors. Hence, our test reveals how much predictors contribute additionally to the non-parametric regression after accounting for the latent factors. Our derivative estimator is the convolution of a deep neural network regression estimator and a smoothing kernel. We demonstrate that when the neural network size diverges with the sample size, unlike estimating the regression function itself, it is necessary to smooth the partial derivative of the neural network estimator to recover the desired convergence rate for the derivative. Moreover, our screening test achieves asymptotic normality under the null after finely centering our test statistics that makes the biases negligible, as well as consistency for local alternatives under mild conditions. We demonstrate the performance of our test in a simulation study and two real world applications."}, "https://arxiv.org/abs/2408.10915": {"title": "Neural Networks for Parameter Estimation in Geometrically Anisotropic Geostatistical Models", "link": "https://arxiv.org/abs/2408.10915", "description": "arXiv:2408.10915v1 Announce Type: new \nAbstract: This article presents a neural network approach for estimating the covariance function of spatial Gaussian random fields defined in a portion of the Euclidean plane. Our proposal builds upon recent contributions, expanding from the purely isotropic setting to encompass geometrically anisotropic correlation structures, i.e., random fields with correlation ranges that vary across different directions. We conduct experiments with both simulated and real data to assess the performance of the methodology and to provide guidelines to practitioners."}, "https://arxiv.org/abs/2408.11003": {"title": "DEEPEAST technique to enhance power in two-sample tests via the same-attraction function", "link": "https://arxiv.org/abs/2408.11003", "description": "arXiv:2408.11003v1 Announce Type: new \nAbstract: Data depth has emerged as an invaluable nonparametric measure for the ranking of multivariate samples. The main contribution of depth-based two-sample comparisons is the introduction of the Q statistic (Liu and Singh, 1993), a quality index. Unlike traditional methods, data depth does not require the assumption of normal distributions and adheres to four fundamental properties. Many existing two-sample homogeneity tests, which assess mean and/or scale changes in distributions often suffer from low statistical power or indeterminate asymptotic distributions. To overcome these challenges, we introduced a DEEPEAST (depth-explored same-attraction sample-to-sample central-outward ranking) technique for improving statistical power in two-sample tests via the same-attraction function. We proposed two novel and powerful depth-based test statistics: the sum test statistic and the product test statistic, which are rooted in Q statistics, share a \"common attractor\" and are applicable across all depth functions. We further proved the asymptotic distribution of these statistics for various depth functions. To assess the performance of power gain, we apply three depth functions: Mahalanobis depth (Liu and Singh, 1993), Spatial depth (Brown, 1958; Gower, 1974), and Projection depth (Liu, 1992). Through two-sample simulations, we have demonstrated that our sum and product statistics exhibit superior power performance, utilizing a strategic block permutation algorithm and compare favourably with popular methods in literature. Our tests are further validated through analysis on Raman spectral data, acquired from cellular and tissue samples, highlighting the effectiveness of the proposed tests highlighting the effective discrimination between health and cancerous samples."}, "https://arxiv.org/abs/2408.11012": {"title": "Discriminant Analysis in stationary time series based on robust cepstral coefficients", "link": "https://arxiv.org/abs/2408.11012", "description": "arXiv:2408.11012v1 Announce Type: new \nAbstract: Time series analysis is crucial in fields like finance, economics, environmental science, and biomedical engineering, aiding in forecasting, pattern identification, and understanding underlying mechanisms. While traditional time-domain methods focus on trends and seasonality, they often miss periodicities better captured in the frequency domain. Analyzing time series in the frequency domain uncovers spectral properties, offering deeper insights into underlying processes, aiding in differentiating data-generating processes of various populations, and assisting in classification. Common approaches use smoothed estimators, such as the smoothed periodogram, to minimize bias by averaging spectra from individual replicates within a population. However, these methods struggle with spectral variability among replicates, and abrupt values can skew estimators, complicating discrimination and classification. There's a gap in the literature for methods that account for within-population spectral variability, separate white noise effects from autocorrelations, and employ robust estimators in the presence of outliers. This paper fills that gap by introducing a robust framework for classifying time series groups. The process involves transforming time series into the frequency domain using the Fourier Transform, computing the power spectrum, and using the inverse Fourier Transform to obtain the cepstrum. To enhance spectral estimates' robustness and consistency, we apply the multitaper periodogram and the M-periodogram. These features are then used in Linear Discriminant Analysis (LDA) to improve classification accuracy and interpretability, offering a powerful tool for precise temporal pattern distinction and resilience to data anomalies."}, "https://arxiv.org/abs/2408.10251": {"title": "Impossible temperatures are not as rare as you think", "link": "https://arxiv.org/abs/2408.10251", "description": "arXiv:2408.10251v1 Announce Type: cross \nAbstract: The last decade has seen numerous record-shattering heatwaves in all corners of the globe. In the aftermath of these devastating events, there is interest in identifying worst-case thresholds or upper bounds that quantify just how hot temperatures can become. Generalized Extreme Value theory provides a data-driven estimate of extreme thresholds; however, upper bounds may be exceeded by future events, which undermines attribution and planning for heatwave impacts. Here, we show how the occurrence and relative probability of observed events that exceed a priori upper bound estimates, so-called \"impossible\" temperatures, has changed over time. We find that many unprecedented events are actually within data-driven upper bounds, but only when using modern spatial statistical methods. Furthermore, there are clear connections between anthropogenic forcing and the \"impossibility\" of the most extreme temperatures. Robust understanding of heatwave thresholds provides critical information about future record-breaking events and how their extremity relates to historical measurements."}, "https://arxiv.org/abs/2408.10610": {"title": "On the Approximability of Stationary Processes using the ARMA Model", "link": "https://arxiv.org/abs/2408.10610", "description": "arXiv:2408.10610v1 Announce Type: cross \nAbstract: We identify certain gaps in the literature on the approximability of stationary random variables using the Autoregressive Moving Average (ARMA) model. To quantify approximability, we propose that an ARMA model be viewed as an approximation of a stationary random variable. We map these stationary random variables to Hardy space functions, and formulate a new function approximation problem that corresponds to random variable approximation, and thus to ARMA. Based on this Hardy space formulation we identify a class of stationary processes where approximation guarantees are feasible. We also identify an idealized stationary random process for which we conjecture that a good ARMA approximation is not possible. Next, we provide a constructive proof that Pad\\'e approximations do not always correspond to the best ARMA approximation. Finally, we note that the spectral methods adopted in this paper can be seen as a generalization of unit root methods for stationary processes even when an ARMA model is not defined."}, "https://arxiv.org/abs/2208.05949": {"title": "Valid Inference After Causal Discovery", "link": "https://arxiv.org/abs/2208.05949", "description": "arXiv:2208.05949v3 Announce Type: replace \nAbstract: Causal discovery and causal effect estimation are two fundamental tasks in causal inference. While many methods have been developed for each task individually, statistical challenges arise when applying these methods jointly: estimating causal effects after running causal discovery algorithms on the same data leads to \"double dipping,\" invalidating the coverage guarantees of classical confidence intervals. To this end, we develop tools for valid post-causal-discovery inference. Across empirical studies, we show that a naive combination of causal discovery and subsequent inference algorithms leads to highly inflated miscoverage rates; on the other hand, applying our method provides reliable coverage while achieving more accurate causal discovery than data splitting."}, "https://arxiv.org/abs/2208.08693": {"title": "Matrix Quantile Factor Model", "link": "https://arxiv.org/abs/2208.08693", "description": "arXiv:2208.08693v3 Announce Type: replace \nAbstract: This paper introduces a matrix quantile factor model for matrix-valued data with low-rank structure. We estimate the row and column factor spaces via minimizing the empirical check loss function with orthogonal rotation constraints. We show that the estimates converge at rate $(\\min\\{p_1p_2,p_2T,p_1T\\})^{-1/2}$ in the average Frobenius norm, where $p_1$, $p_2$ and $T$ are the row dimensionality, column dimensionality and length of the matrix sequence, respectively. This rate is faster than that of the quantile estimates via ``flattening\" the matrix model into a large vector model. To derive the central limit theorem, we introduce a novel augmented Lagrangian function, which is equivalent to the original constrained empirical check loss minimization problem. Via the equivalence, we prove that the Hessian matrix of the augmented Lagrangian function is locally positive definite, resulting in a locally convex penalized loss function around the true factors and their loadings. This easily leads to a feasible second-order expansion of the score function and readily established central limit theorems of the smoothed estimates of the loadings. We provide three consistent criteria to determine the pair of row and column factor numbers. Extensive simulation studies and an empirical study justify our theory."}, "https://arxiv.org/abs/2307.02188": {"title": "Improving Algorithms for Fantasy Basketball", "link": "https://arxiv.org/abs/2307.02188", "description": "arXiv:2307.02188v4 Announce Type: replace \nAbstract: Fantasy basketball has a rich underlying mathematical structure which makes optimal drafting strategy unclear. A central issue for category leagues is how to aggregate a player's statistics from all categories into a single number representing general value. It is shown that under a simplified model of fantasy basketball, a novel metric dubbed the \"G-score\" is appropriate for this purpose. The traditional metric used by analysts, \"Z-score\", is a special case of the G-score under the condition that future player performances are known exactly. The distinction between Z-score and G-score is particularly meaningful for head-to-head formats, because there is a large degree of uncertainty in player performance from one week to another. Simulated fantasy basketball seasons with head-to-head scoring provide evidence that G-scores do in fact outperform Z-scores in that context."}, "https://arxiv.org/abs/2310.02273": {"title": "A New measure of income inequality", "link": "https://arxiv.org/abs/2310.02273", "description": "arXiv:2310.02273v2 Announce Type: replace \nAbstract: A new measure of income inequality that captures the heavy tail behavior of the income distribution is proposed. We discuss two different approaches to find the estimators of the proposed measure. We show that these estimators are consistent and have an asymptotically normal distribution. We also obtain a jackknife empirical likelihood (JEL) confidence interval of the income inequality measure. A Monte Carlo simulation study is conducted to evaluate the finite sample properties of the estimators and JEL-based confidence inerval. Finally, we use our measure to study the income inequality of three states in India."}, "https://arxiv.org/abs/2305.00050": {"title": "Causal Reasoning and Large Language Models: Opening a New Frontier for Causality", "link": "https://arxiv.org/abs/2305.00050", "description": "arXiv:2305.00050v3 Announce Type: replace-cross \nAbstract: The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a \"behavorial\" study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone, especially since LLMs generalize to novel datasets that were created after the training cutoff date.\n That said, LLMs exhibit unpredictable failure modes, and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques. Code and datasets are available at https://github.com/py-why/pywhy-llm."}, "https://arxiv.org/abs/2305.03205": {"title": "Risk management in the use of published statistical results for policy decisions", "link": "https://arxiv.org/abs/2305.03205", "description": "arXiv:2305.03205v2 Announce Type: replace-cross \nAbstract: Statistical inferential results generally come with a measure of reliability for decision-making purposes. For a policy implementer, the value of implementing published policy research depends critically upon this reliability. For a policy researcher, the value of policy implementation may depend weakly or not at all upon the policy's outcome. Some researchers might benefit from overstating the reliability of statistical results. Implementers may find it difficult or impossible to determine whether researchers are overstating reliability. This information asymmetry between researchers and implementers can lead to an adverse selection problem where, at best, the full benefits of a policy are not realized or, at worst, a policy is deemed too risky to implement at any scale. Researchers can remedy this by guaranteeing the policy outcome. Researchers can overcome their own risk aversion and wealth constraints by exchanging risks with other researchers or offering only partial insurance. The problem and remedy are illustrated using a confidence interval for the success probability of a binomial policy outcome."}, "https://arxiv.org/abs/2408.11193": {"title": "Inference with Many Weak Instruments and Heterogeneity", "link": "https://arxiv.org/abs/2408.11193", "description": "arXiv:2408.11193v1 Announce Type: new \nAbstract: This paper considers inference in a linear instrumental variable regression model with many potentially weak instruments and treatment effect heterogeneity. I show that existing tests can be arbitrarily oversized in this setup. Then, I develop a valid procedure that is robust to weak instrument asymptotics and heterogeneous treatment effects. The procedure targets a JIVE estimand, calculates an LM statistic, and compares it with critical values from a normal distribution. To establish this procedure's validity, this paper shows that the LM statistic is asymptotically normal and a leave-three-out variance estimator is unbiased and consistent. The power of the LM test is also close to a power envelope in an empirical application."}, "https://arxiv.org/abs/2408.11272": {"title": "High-Dimensional Overdispersed Generalized Factor Model with Application to Single-Cell Sequencing Data Analysis", "link": "https://arxiv.org/abs/2408.11272", "description": "arXiv:2408.11272v1 Announce Type: new \nAbstract: The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM."}, "https://arxiv.org/abs/2408.11315": {"title": "Locally Adaptive Random Walk Stochastic Volatility", "link": "https://arxiv.org/abs/2408.11315", "description": "arXiv:2408.11315v1 Announce Type: new \nAbstract: We introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with a new Dynamic Shrinkage Process (DSP) in (log) variances. Unlike classical Stochastic Volatility or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty (vol of vol). We derive the theoretical properties of the proposed global-local shrinkage prior. Through simulation studies, we demonstrate that ASV exhibits remarkable misspecification resilience with low prediction error across various data generating scenarios in simulation. Furthermore, ASV's capacity to yield locally smooth and interpretable estimates facilitates a clearer understanding of underlying patterns and trends in volatility. Additionally, we propose and illustrate an extension for Bayesian Trend Filtering simultaneously in both mean and variance. Finally, we show that this attribute makes ASV a robust tool applicable across a wide range of disciplines, including in finance, environmental science, epidemiology, and medicine, among others."}, "https://arxiv.org/abs/2408.11497": {"title": "Climate Change in Austria: Precipitation and Dry Spells over 50 years", "link": "https://arxiv.org/abs/2408.11497", "description": "arXiv:2408.11497v1 Announce Type: new \nAbstract: We propose a spatio-temporal generalised additive model (GAM) to study if precipitation patterns have changed between two 10-year time periods in the last 50 years in Austria. In particular, we model three scenarios: monthly mean and monthly maximum precipitation as well as the maximum length of a dry spell per month with a gamma, blended generalised extreme value and negative binomial distribution, respectively, over the periods 1973-1982 and 2013-2022. In order to model the spatial dependencies in the data more realistically, we intend to take the mountainous landscape of Austria into account. Therefore, we have chosen a non-stationary version of the Mat\\'ern covariance function, which accounts for elevation differences, as a spatial argument of the latent field in the GAM. The temporal part of the latent field is captured by an AR(1) process. We use the stochastic partial differential equation approach in combination with integrated nested Laplace approximation to perform inference computationally efficient. The model outputs are visualised and support existing climate change studies in the Alpine region obtained with, for example, projections from regional climate models."}, "https://arxiv.org/abs/2408.11519": {"title": "Towards an Inclusive Approach to Corporate Social Responsibility (CSR) in Morocco: CGEM's Commitment", "link": "https://arxiv.org/abs/2408.11519", "description": "arXiv:2408.11519v1 Announce Type: new \nAbstract: Corporate social responsibility encourages companies to integrate social and environmental concerns into their activities and their relations with stakeholders. It encompasses all actions aimed at the social good, above and beyond corporate interests and legal requirements. Various international organizations, authors and researchers have explored the notion of CSR and proposed a range of definitions reflecting their perspectives on the concept. In Morocco, although Moroccan companies are not overwhelmingly embracing CSR, several factors are encouraging them to integrate the CSR approach not only into their discourse, but also into their strategies. The CGEM is actively involved in promoting CSR within Moroccan companies, awarding the \"CGEM Label for CSR\" to companies that meet the criteria set out in the CSR Charter. The process of labeling Moroccan companies is in full expansion. The graphs presented in this article are broken down according to several criteria, such as company size, sector of activity and listing on the Casablanca Stock Exchange, in order to provide an overview of CSR-labeled companies in Morocco. The approach adopted for this article is a qualitative one aimed at presenting, firstly, the different definitions of the CSR concept and its evolution over time. In this way, the study focuses on the Moroccan context to dissect and analyze the state of progress of CSR integration in Morocco and the various efforts made by the CGEM to implement it. According to the data, 124 Moroccan companies have been awarded the CSR label. For a label in existence since 2006, this figure reflects a certain reluctance on the part of Moroccan companies to fully implement the CSR approach in their strategies. Nevertheless, Morocco is in a transitional phase, marked by the gradual adoption of various socially responsible practices."}, "https://arxiv.org/abs/2408.11594": {"title": "On the handling of method failure in comparison studies", "link": "https://arxiv.org/abs/2408.11594", "description": "arXiv:2408.11594v1 Announce Type: new \nAbstract: Comparison studies in methodological research are intended to compare methods in an evidence-based manner, offering guidance to data analysts to select a suitable method for their application. To provide trustworthy evidence, they must be carefully designed, implemented, and reported, especially given the many decisions made in planning and running. A common challenge in comparison studies is to handle the ``failure'' of one or more methods to produce a result for some (real or simulated) data sets, such that their performances cannot be measured in those instances. Despite an increasing emphasis on this topic in recent literature (focusing on non-convergence as a common manifestation), there is little guidance on proper handling and interpretation, and reporting of the chosen approach is often neglected. This paper aims to fill this gap and provides practical guidance for handling method failure in comparison studies. In particular, we show that the popular approaches of discarding data sets yielding failure (either for all or the failing methods only) and imputing are inappropriate in most cases. We also discuss how method failure in published comparison studies -- in various contexts from classical statistics and predictive modeling -- may manifest differently, but is often caused by a complex interplay of several aspects. Building on this, we provide recommendations derived from realistic considerations on suitable fallbacks when encountering method failure, hence avoiding the need for discarding data sets or imputation. Finally, we illustrate our recommendations and the dangers of inadequate handling of method failure through two illustrative comparison studies."}, "https://arxiv.org/abs/2408.11621": {"title": "Robust Bayes Treatment Choice with Partial Identification", "link": "https://arxiv.org/abs/2408.11621", "description": "arXiv:2408.11621v1 Announce Type: new \nAbstract: We study a class of binary treatment choice problems with partial identification, through the lens of robust (multiple prior) Bayesian analysis. We use a convenient set of prior distributions to derive ex-ante and ex-post robust Bayes decision rules, both for decision makers who can randomize and for decision makers who cannot.\n Our main messages are as follows: First, ex-ante and ex-post robust Bayes decision rules do not tend to agree in general, whether or not randomized rules are allowed. Second, randomized treatment assignment for some data realizations can be optimal in both ex-ante and, perhaps more surprisingly, ex-post problems. Therefore, it is usually with loss of generality to exclude randomized rules from consideration, even when regret is evaluated ex-post.\n We apply our results to a stylized problem where a policy maker uses experimental data to choose whether to implement a new policy in a population of interest, but is concerned about the external validity of the experiment at hand (Stoye, 2012); and to the aggregation of data generated by multiple randomized control trials in different sites to make a policy choice in a population for which no experimental data are available (Manski, 2020; Ishihara and Kitagawa, 2021)."}, "https://arxiv.org/abs/2408.11672": {"title": "Evidential Analysis: An Alternative to Hypothesis Testing in Normal Linear Models", "link": "https://arxiv.org/abs/2408.11672", "description": "arXiv:2408.11672v1 Announce Type: new \nAbstract: Statistical hypothesis testing, as formalized by 20th Century statisticians and taught in college statistics courses, has been a cornerstone of 100 years of scientific progress. Nevertheless, the methodology is increasingly questioned in many scientific disciplines. We demonstrate in this paper how many of the worrisome aspects of statistical hypothesis testing can be ameliorated with concepts and methods from evidential analysis. The model family we treat is the familiar normal linear model with fixed effects, embracing multiple regression and analysis of variance, a warhorse of everyday science in labs and field stations. Questions about study design, the applicability of the null hypothesis, the effect size, error probabilities, evidence strength, and model misspecification become more naturally housed in an evidential setting. We provide a completely worked example featuring a 2-way analysis of variance."}, "https://arxiv.org/abs/2408.11676": {"title": "L2-Convergence of the Population Principal Components in the Approximate Factor Model", "link": "https://arxiv.org/abs/2408.11676", "description": "arXiv:2408.11676v1 Announce Type: new \nAbstract: We prove that under the condition that the eigenvalues are asymptotically well separated and stable, the normalised principal components of a r-static factor sequence converge in mean square. Consequently, we have a generic interpretation of the principal components estimator as the normalised principal components of the statically common space. We illustrate why this can be useful for the interpretation of the PC-estimated factors, performing an asymptotic theory without rotation matrices and avoiding singularity issues in factor augmented regressions."}, "https://arxiv.org/abs/2408.11718": {"title": "Scalable and non-iterative graphical model estimation", "link": "https://arxiv.org/abs/2408.11718", "description": "arXiv:2408.11718v1 Announce Type: new \nAbstract: Graphical models have found widespread applications in many areas of modern statistics and machine learning. Iterative Proportional Fitting (IPF) and its variants have become the default method for undirected graphical model estimation, and are thus ubiquitous in the field. As the IPF is an iterative approach, it is not always readily scalable to modern high-dimensional data regimes. In this paper we propose a novel and fast non-iterative method for positive definite graphical model estimation in high dimensions, one that directly addresses the shortcomings of IPF and its variants. In addition, the proposed method has a number of other attractive properties. First, we show formally that as the dimension p grows, the proportion of graphs for which the proposed method will outperform the state-of-the-art in terms of computational complexity and performance tends to 1, affirming its efficacy in modern settings. Second, the proposed approach can be readily combined with scalable non-iterative thresholding-based methods for high-dimensional sparsity selection. Third, the proposed method has high-dimensional statistical guarantees. Moreover, our numerical experiments also show that the proposed method achieves scalability without compromising on statistical precision. Fourth, unlike the IPF, which depends on the Gaussian likelihood, the proposed method is much more robust."}, "https://arxiv.org/abs/2408.11803": {"title": "Bayesian Nonparametric Risk Assessment in Developmental Toxicity Studies with Ordinal Responses", "link": "https://arxiv.org/abs/2408.11803", "description": "arXiv:2408.11803v1 Announce Type: new \nAbstract: We develop a nonparametric Bayesian modeling framework for clustered ordinal responses in developmental toxicity studies, which typically exhibit extensive heterogeneity. The primary focus of these studies is to examine the dose-response relationship, which is depicted by the (conditional) probability of an endpoint across the dose (toxin) levels. Standard parametric approaches, limited in terms of the response distribution and/or the dose-response relationship, hinder reliable uncertainty quantification in this context. We propose nonparametric mixture models that are built from dose-dependent stick-breaking process priors, leveraging the continuation-ratio logits representation of the multinomial distribution to formulate the mixture kernel. We further elaborate the modeling approach, amplifying the mixture models with an overdispersed kernel which offers enhanced control of variability. We conduct a simulation study to demonstrate the benefits of both the discrete nonparametric mixing structure and the overdispersed kernel in delivering coherent uncertainty quantification. Further illustration is provided with different forms of risk assessment, using data from a toxicity experiment on the effects of ethylene glycol."}, "https://arxiv.org/abs/2408.11808": {"title": "Distance Correlation in Multiple Biased Sampling Models", "link": "https://arxiv.org/abs/2408.11808", "description": "arXiv:2408.11808v1 Announce Type: new \nAbstract: Testing the independence between random vectors is a fundamental problem in statistics. Distance correlation, a recently popular dependence measure, is universally consistent for testing independence against all distributions with finite moments. However, when data are subject to selection bias or collected from multiple sources or schemes, spurious dependence may arise. This creates a need for methods that can effectively utilize data from different sources and correct these biases. In this paper, we study the estimation of distance covariance and distance correlation under multiple biased sampling models, which provide a natural framework for addressing these issues. Theoretical properties, including the strong consistency and asymptotic null distributions of the distance covariance and correlation estimators, and the rate at which the test statistic diverges under sequences of alternatives approaching the null, are established. A weighted permutation procedure is proposed to determine the critical value of the independence test. Simulation studies demonstrate that our approach improves both the estimation of distance correlation and the power of the test."}, "https://arxiv.org/abs/2408.11164": {"title": "The Ensemble Epanechnikov Mixture Filter", "link": "https://arxiv.org/abs/2408.11164", "description": "arXiv:2408.11164v1 Announce Type: cross \nAbstract: In the high-dimensional setting, Gaussian mixture kernel density estimates become increasingly suboptimal. In this work we aim to show that it is practical to instead use the optimal multivariate Epanechnikov kernel. We make use of this optimal Epanechnikov mixture kernel density estimate for the sequential filtering scenario through what we term the ensemble Epanechnikov mixture filter (EnEMF). We provide a practical implementation of the EnEMF that is as cost efficient as the comparable ensemble Gaussian mixture filter. We show on a static example that the EnEMF is robust to growth in dimension, and also that the EnEMF has a significant reduction in error per particle on the 40-variable Lorenz '96 system."}, "https://arxiv.org/abs/2408.11753": {"title": "Small Sample Behavior of Wasserstein Projections, Connections to Empirical Likelihood, and Other Applications", "link": "https://arxiv.org/abs/2408.11753", "description": "arXiv:2408.11753v1 Announce Type: cross \nAbstract: The empirical Wasserstein projection (WP) distance quantifies the Wasserstein distance from the empirical distribution to a set of probability measures satisfying given expectation constraints. The WP is a powerful tool because it mitigates the curse of dimensionality inherent in the Wasserstein distance, making it valuable for various tasks, including constructing statistics for hypothesis testing, optimally selecting the ambiguity size in Wasserstein distributionally robust optimization, and studying algorithmic fairness. While the weak convergence analysis of the WP as the sample size $n$ grows is well understood, higher-order (i.e., sharp) asymptotics of WP remain unknown. In this paper, we study the second-order asymptotic expansion and the Edgeworth expansion of WP, both expressed as power series of $n^{-1/2}$. These expansions are essential to develop improved confidence level accuracy and a power expansion analysis for the WP-based tests for moment equations null against local alternative hypotheses. As a by-product, we obtain insightful criteria for comparing the power of the Empirical Likelihood and Hotelling's $T^2$ tests against the WP-based test. This insight provides the first comprehensive guideline for selecting the most powerful local test among WP-based, empirical-likelihood-based, and Hotelling's $T^2$ tests for a null. Furthermore, we introduce Bartlett-type corrections to improve the approximation to WP distance quantiles and, thus, improve the coverage in WP applications."}, "https://arxiv.org/abs/1511.04745": {"title": "The matryoshka doll prior: principled penalization in Bayesian selection", "link": "https://arxiv.org/abs/1511.04745", "description": "arXiv:1511.04745v2 Announce Type: replace \nAbstract: This paper introduces a general and principled construction of model space priors with a focus on regression problems. The proposed formulation regards each model as a ``local'' null hypothesis whose alternatives are the set of models that nest it. A simple proportionality principle yields a natural isomorphism of model spaces induced by conditioning on predictor inclusion before or after observing data. This isomorphism produces the Poisson distribution as the unique limiting distribution over model dimension under mild assumptions. We compare this model space prior theoretically and in simulations to widely adopted Beta-Binomial constructions and show that the proposed prior yields a ``just-right'' penalization profile."}, "https://arxiv.org/abs/2304.03809": {"title": "Estimating Shapley Effects in Big-Data Emulation and Regression Settings using Bayesian Additive Regression Trees", "link": "https://arxiv.org/abs/2304.03809", "description": "arXiv:2304.03809v2 Announce Type: replace \nAbstract: Shapley effects are a particularly interpretable approach to assessing how a function depends on its various inputs. The existing literature contains various estimators for this class of sensitivity indices in the context of nonparametric regression where the function is observed with noise, but there does not seem to be an estimator that is computationally tractable for input dimensions in the hundreds scale. This article provides such an estimator that is computationally tractable on this scale. The estimator uses a metamodel-based approach by first fitting a Bayesian Additive Regression Trees model which is then used to compute Shapley-effect estimates. This article also establishes a theoretical guarantee of posterior consistency on a large function class for this Shapley-effect estimator. Finally, this paper explores the performance of these Shapley-effect estimators on four different test functions for various input dimensions, including $p=500$."}, "https://arxiv.org/abs/2306.01566": {"title": "Fatigue detection via sequential testing of biomechanical data using martingale statistic", "link": "https://arxiv.org/abs/2306.01566", "description": "arXiv:2306.01566v2 Announce Type: replace \nAbstract: Injuries to the knee joint are very common for long-distance and frequent runners, an issue which is often attributed to fatigue. We address the problem of fatigue detection from biomechanical data from different sources, consisting of lower extremity joint angles and ground reaction forces from running athletes with the goal of better understanding the impact of fatigue on the biomechanics of runners in general and on an individual level. This is done by sequentially testing for change in a datastream using a simple martingale test statistic. Time-uniform probabilistic martingale bounds are provided which are used as thresholds for the test statistic. Sharp bounds can be developed by a hybrid of a piece-wise linear- and a law of iterated logarithm- bound over all time regimes, where the probability of an early detection is controlled in a uniform way. If the underlying distribution of the data gradually changes over the course of a run, then a timely upcrossing of the martingale over these bounds is expected. The methods are developed for a setting when change sets in gradually in an incoming stream of data. Parameter selection for the bounds are based on simulations and methodological comparison is done with respect to existing advances. The algorithms presented here can be easily adapted to an online change-detection setting. Finally, we provide a detailed data analysis based on extensive measurements of several athletes and benchmark the fatigue detection results with the runners' individual feedback over the course of the data collection. Qualitative conclusions on the biomechanical profiles of the athletes can be made based on the shape of the martingale trajectories even in the absence of an upcrossing of the threshold."}, "https://arxiv.org/abs/2311.01638": {"title": "Inference on summaries of a model-agnostic longitudinal variable importance trajectory with application to suicide prevention", "link": "https://arxiv.org/abs/2311.01638", "description": "arXiv:2311.01638v2 Announce Type: replace \nAbstract: Risk of suicide attempt varies over time. Understanding the importance of risk factors measured at a mental health visit can help clinicians evaluate future risk and provide appropriate care during the visit. In prediction settings where data are collected over time, such as in mental health care, it is often of interest to understand both the importance of variables for predicting the response at each time point and the importance summarized over the time series. Building on recent advances in estimation and inference for variable importance measures, we define summaries of variable importance trajectories and corresponding estimators. The same approaches for inference can be applied to these measures regardless of the choice of the algorithm(s) used to estimate the prediction function. We propose a nonparametric efficient estimation and inference procedure as well as a null hypothesis testing procedure that are valid even when complex machine learning tools are used for prediction. Through simulations, we demonstrate that our proposed procedures have good operating characteristics. We use these approaches to analyze electronic health records data from two large health systems to investigate the longitudinal importance of risk factors for suicide attempt to inform future suicide prevention research and clinical workflow."}, "https://arxiv.org/abs/2401.07000": {"title": "Counterfactual Slopes and Their Applications in Social Stratification", "link": "https://arxiv.org/abs/2401.07000", "description": "arXiv:2401.07000v2 Announce Type: replace \nAbstract: This paper addresses two prominent theses in social stratification research, the great equalizer thesis and Mare's (1980) school transition thesis. Both theses are premised on a descriptive regularity: the association between socioeconomic background and an outcome variable changes when conditioning on an intermediate treatment. However, if the descriptive regularity is driven by differential selection into treatment, then the two theses do not have substantive interpretations. We propose a set of novel counterfactual slope estimands, which capture the two theses under the hypothetical scenario where differential selection into treatment is eliminated. Thus, we use the counterfactual slopes to construct selection-free tests for the two theses. Compared with the existing literature, we are the first to explicitly provide nonparametric and causal estimands, which enable us to conduct more principled analysis. We are also the first to develop flexible, efficient, and robust estimators for the two theses based on efficient influence functions. We apply our framework to a nationally representative dataset in the United States and re-evaluate the two theses. Findings from our selection-free tests show that the descriptive regularity is sometimes misleading for substantive interpretations."}, "https://arxiv.org/abs/2212.05053": {"title": "Joint Spectral Clustering in Multilayer Degree-Corrected Stochastic Blockmodels", "link": "https://arxiv.org/abs/2212.05053", "description": "arXiv:2212.05053v2 Announce Type: replace-cross \nAbstract: Modern network datasets are often composed of multiple layers, either as different views, time-varying observations, or independent sample units, resulting in collections of networks over the same set of vertices but with potentially different connectivity patterns on each network. These data require models and methods that are flexible enough to capture local and global differences across the networks, while at the same time being parsimonious and tractable to yield computationally efficient and theoretically sound solutions that are capable of aggregating information across the networks. This paper considers the multilayer degree-corrected stochastic blockmodel, where a collection of networks share the same community structure, but degree-corrections and block connection probability matrices are permitted to be different. We establish the identifiability of this model and propose a spectral clustering algorithm for community detection in this setting. Our theoretical results demonstrate that the misclustering error rate of the algorithm improves exponentially with multiple network realizations, even in the presence of significant layer heterogeneity with respect to degree corrections, signal strength, and spectral properties of the block connection probability matrices. Simulation studies show that this approach improves on existing multilayer community detection methods in this challenging regime. Furthermore, in a case study of US airport data through January 2016 -- September 2021, we find that this methodology identifies meaningful community structure and trends in airport popularity influenced by pandemic impacts on travel."}, "https://arxiv.org/abs/2308.09104": {"title": "Spike-and-slab shrinkage priors for structurally sparse Bayesian neural networks", "link": "https://arxiv.org/abs/2308.09104", "description": "arXiv:2308.09104v2 Announce Type: replace-cross \nAbstract: Network complexity and computational efficiency have become increasingly significant aspects of deep learning. Sparse deep learning addresses these challenges by recovering a sparse representation of the underlying target function by reducing heavily over-parameterized deep neural networks. Specifically, deep neural architectures compressed via structured sparsity (e.g. node sparsity) provide low latency inference, higher data throughput, and reduced energy consumption. In this paper, we explore two well-established shrinkage techniques, Lasso and Horseshoe, for model compression in Bayesian neural networks. To this end, we propose structurally sparse Bayesian neural networks which systematically prune excessive nodes with (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group Horseshoe (SS-GHS) priors, and develop computationally tractable variational inference including continuous relaxation of Bernoulli variables. We establish the contraction rates of the variational posterior of our proposed models as a function of the network topology, layer-wise node cardinalities, and bounds on the network weights. We empirically demonstrate the competitive performance of our models compared to the baseline models in prediction accuracy, model compression, and inference latency."}, "https://arxiv.org/abs/2408.11951": {"title": "SPORTSCausal: Spill-Over Time Series Causal Inference", "link": "https://arxiv.org/abs/2408.11951", "description": "arXiv:2408.11951v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) have long been the gold standard for causal inference across various fields, including business analysis, economic studies, sociology, clinical research, and network learning. The primary advantage of RCTs over observational studies lies in their ability to significantly reduce noise from individual variance. However, RCTs depend on strong assumptions, such as group independence, time independence, and group randomness, which are not always feasible in real-world applications. Traditional inferential methods, including analysis of covariance (ANCOVA), often fail when these assumptions do not hold. In this paper, we propose a novel approach named \\textbf{Sp}ill\\textbf{o}ve\\textbf{r} \\textbf{T}ime \\textbf{S}eries \\textbf{Causal} (\\verb+SPORTSCausal+), which enables the estimation of treatment effects without relying on these stringent assumptions. We demonstrate the practical applicability of \\verb+SPORTSCausal+ through a real-world budget-control experiment. In this experiment, data was collected from both a 5\\% live experiment and a 50\\% live experiment using the same treatment. Due to the spillover effect, the vanilla estimation of the treatment effect was not robust across different treatment sizes, whereas \\verb+SPORTSCausal+ provided a robust estimation."}, "https://arxiv.org/abs/2408.11994": {"title": "Fast and robust cross-validation-based scoring rule inference for spatial statistics", "link": "https://arxiv.org/abs/2408.11994", "description": "arXiv:2408.11994v1 Announce Type: new \nAbstract: Scoring rules are aimed at evaluation of the quality of predictions, but can also be used for estimation of parameters in statistical models. We propose estimating parameters of multivariate spatial models by maximising the average leave-one-out cross-validation score. This method, LOOS, thus optimises predictions instead of maximising the likelihood. The method allows for fast computations for Gaussian models with sparse precision matrices, such as spatial Markov models. It also makes it possible to tailor the estimator's robustness to outliers and their sensitivity to spatial variations of uncertainty through the choice of the scoring rule which is used in the maximisation. The effects of the choice of scoring rule which is used in LOOS are studied by simulation in terms of computation time, statistical efficiency, and robustness. Various popular scoring rules and a new scoring rule, the root score, are compared to maximum likelihood estimation. The results confirmed that for spatial Markov models the computation time for LOOS was much smaller than for maximum likelihood estimation. Furthermore, the standard deviations of parameter estimates were smaller for maximum likelihood estimation, although the differences often were small. The simulations also confirmed that the usage of a robust scoring rule results in robust LOOS estimates and that the robustness provides better predictive quality for spatial data with outliers. Finally, the new inference method was applied to ERA5 temperature reanalysis data for the contiguous United States and the average July temperature for the years 1940 to 2023, and this showed that the LOOS estimator provided parameter estimates that were more than a hundred times faster to compute compared to maximum-likelihood estimation, and resulted in a model with better predictive performance."}, "https://arxiv.org/abs/2408.12078": {"title": "L1 Prominence Measures for Directed Graphs", "link": "https://arxiv.org/abs/2408.12078", "description": "arXiv:2408.12078v1 Announce Type: new \nAbstract: We introduce novel measures, L1 prestige and L1 centrality, for quantifying the prominence of each vertex in a strongly connected and directed graph by utilizing the concept of L1 data depth (Vardi and Zhang, Proc. Natl. Acad. Sci. U.S.A.\\ 97(4):1423--1426, 2000). The former measure quantifies the degree of prominence of each vertex in receiving choices, whereas the latter measure evaluates the degree of importance in giving choices. The proposed measures can handle graphs with both edge and vertex weights, as well as undirected graphs. However, examining a graph using a measure defined over a single `scale' inevitably leads to a loss of information, as each vertex may exhibit distinct structural characteristics at different levels of locality. To this end, we further develop local versions of the proposed measures with a tunable locality parameter. Using these tools, we present a multiscale network analysis framework that provides much richer structural information about each vertex than a single-scale inspection. By applying the proposed measures to the networks constructed from the Seoul Mobility Flow Data, it is demonstrated that these measures accurately depict and uncover the inherent characteristics of individual city regions."}, "https://arxiv.org/abs/2408.12098": {"title": "Temporal discontinuity trials and randomization: success rates versus design strength", "link": "https://arxiv.org/abs/2408.12098", "description": "arXiv:2408.12098v1 Announce Type: new \nAbstract: We consider the following comparative effectiveness scenario. There are two treatments for a particular medical condition: a randomized experiment has demonstrated mediocre effectiveness for the first treatment, while a non-randomized study of the second treatment reports a much higher success rate. On what grounds might one justifiably prefer the second treatment over the first treatment, given only the information from those two studies, including design details? This situation occurs in reality and warrants study. We consider a particular example involving studies of treatments for Crohn's disease. In order to help resolve these cases of asymmetric evidence, we make three contributions and apply them to our example. First, we demonstrate the potential to improve success rates above those found in a randomized trial, given heterogeneous effects. Second, we prove that deliberate treatment assignment can be more efficient than randomization when study results are to be transported to formulate an intervention policy on a wider population. Third, we provide formal conditions under which a temporal-discontinuity design approximates a randomized trial, and we introduce a novel design parameter to inform researchers about the strength of that approximation. Overall, our results indicate that while randomization certainly provides special advantages, other study designs such as temporal-discontinuity designs also have distinct advantages, and can produce valuable evidence that informs treatment decisions and intervention policy."}, "https://arxiv.org/abs/2408.12272": {"title": "Decorrelated forward regression for high dimensional data analysis", "link": "https://arxiv.org/abs/2408.12272", "description": "arXiv:2408.12272v1 Announce Type: new \nAbstract: Forward regression is a crucial methodology for automatically identifying important predictors from a large pool of potential covariates. In contexts with moderate predictor correlation, forward selection techniques can achieve screening consistency. However, this property gradually becomes invalid in the presence of substantially correlated variables, especially in high-dimensional datasets where strong correlations exist among predictors. This dilemma is encountered by other model selection methods in literature as well. To address these challenges, we introduce a novel decorrelated forward (DF) selection framework for generalized mean regression models, including prevalent models, such as linear, logistic, Poisson, and quasi likelihood. The DF selection framework stands out because of its ability to convert generalized mean regression models into linear ones, thus providing a clear interpretation of the forward selection process. It also offers a closed-form expression for forward iteration, to improve practical applicability and efficiency. Theoretically, we establish the screening consistency of DF selection and determine the upper bound of the selected submodel's size. To reduce computational burden, we develop a thresholding DF algorithm that provides a stopping rule for the forward-searching process. Simulations and two real data applications show the outstanding performance of our method compared with some existing model selection methods."}, "https://arxiv.org/abs/2408.12286": {"title": "Momentum Informed Inflation-at-Risk", "link": "https://arxiv.org/abs/2408.12286", "description": "arXiv:2408.12286v1 Announce Type: new \nAbstract: Growth-at-Risk has recently become a key measure of macroeconomic tail-risk, which has seen it be researched extensively. Surprisingly, the same cannot be said for Inflation-at-Risk where both tails, deflation and high inflation, are of key concern to policymakers, which has seen comparatively much less research. This paper will tackle this gap and provide estimates for Inflation-at-Risk. The key insight of the paper is that inflation is best characterised by a combination of two types of nonlinearities: quantile variation, and conditioning on the momentum of inflation."}, "https://arxiv.org/abs/2408.12339": {"title": "Inference for decorated graphs and application to multiplex networks", "link": "https://arxiv.org/abs/2408.12339", "description": "arXiv:2408.12339v1 Announce Type: new \nAbstract: A graphon is a limiting object used to describe the behaviour of large networks through a function that captures the probability of edge formation between nodes. Although the merits of graphons to describe large and unlabelled networks are clear, they traditionally are used for describing only binary edge information, which limits their utility for more complex relational data. Decorated graphons were introduced to extend the graphon framework by incorporating richer relationships, such as edge weights and types. This specificity in modelling connections provides more granular insight into network dynamics. Yet, there are no existing inference techniques for decorated graphons. We develop such an estimation method, extending existing techniques from traditional graphon estimation to accommodate these richer interactions. We derive the rate of convergence for our method and show that it is consistent with traditional non-parametric theory when the decoration space is finite. Simulations confirm that these theoretical rates are achieved in practice. Our method, tested on synthetic and empirical data, effectively captures additional edge information, resulting in improved network models. This advancement extends the scope of graphon estimation to encompass more complex networks, such as multiplex networks and attributed graphs, thereby increasing our understanding of their underlying structures."}, "https://arxiv.org/abs/2408.12347": {"title": "Preregistration does not improve the transparent evaluation of severity in Popper's philosophy of science or when deviations are allowed", "link": "https://arxiv.org/abs/2408.12347", "description": "arXiv:2408.12347v1 Announce Type: new \nAbstract: One justification for preregistering research hypotheses, methods, and analyses is that it improves the transparent evaluation of the severity of hypothesis tests. In this article, I consider two cases in which preregistration does not improve this evaluation. First, I argue that, although preregistration can facilitate the transparent evaluation of severity in Mayo's error statistical philosophy of science, it does not facilitate this evaluation in Popper's theory-centric philosophy. To illustrate, I show that associated concerns about Type I error rate inflation are only relevant in the error statistical approach and not in a theory-centric approach. Second, I argue that a preregistered test procedure that allows deviations in its implementation does not provide a more transparent evaluation of Mayoian severity than a non-preregistered procedure. In particular, I argue that sample-based validity-enhancing deviations cause an unknown inflation of the test procedure's Type I (familywise) error rate and, consequently, an unknown reduction in its capability to license inferences severely. I conclude that preregistration does not improve the transparent evaluation of severity in Popper's philosophy of science or when deviations are allowed."}, "https://arxiv.org/abs/2408.12482": {"title": "Latent Gaussian Graphical Models with Golazo Penalty", "link": "https://arxiv.org/abs/2408.12482", "description": "arXiv:2408.12482v1 Announce Type: new \nAbstract: The existence of latent variables in practical problems is common, for example when some variables are difficult or expensive to measure, or simply unknown. When latent variables are unaccounted for, structure learning for Gaussian graphical models can be blurred by additional correlation between the observed variables that is incurred by the latent variables. A standard approach for this problem is a latent version of the graphical lasso that splits the inverse covariance matrix into a sparse and a low-rank part that are penalized separately. In this paper we propose a generalization of this via the flexible Golazo penalty. This allows us to introduce latent versions of for example the adaptive lasso, positive dependence constraints or predetermined sparsity patterns, and combinations of those. We develop an algorithm for the latent Gaussian graphical model with the Golazo penalty and demonstrate it on simulated and real data."}, "https://arxiv.org/abs/2408.12541": {"title": "Clarifying the Role of the Mantel-Haenszel Risk Difference Estimator in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2408.12541", "description": "arXiv:2408.12541v1 Announce Type: new \nAbstract: The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common risk difference assumption. In our article, we relax this assumption and adopt a modern perspective, viewing the MH risk difference estimator as an approach for covariate adjustment in randomized clinical trials, distinguishing its use from that in meta-analysis and observational studies. We demonstrate that the MH risk difference estimator consistently estimates the average treatment effect within a standard super-population framework, which is often the primary interest in randomized clinical trials, in addition to estimating a weighted average of stratum-specific risk difference. We rigorously study its properties under both the large-stratum and sparse-stratum asymptotic regimes. Furthermore, for either estimand, we propose a unified robust variance estimator that improves over the popular variance estimators by Greenland and Robins (1985) and Sato et al. (1989) and has provable consistency across both asymptotic regimes, regardless of assuming common risk differences. Extensions of our theoretical results also provide new insights into the Cochran-Mantel-Haenszel test and the post-stratification estimator. Our findings are thoroughly validated through simulations and a clinical trial example."}, "https://arxiv.org/abs/2408.12577": {"title": "Integrating an agent-based behavioral model in microtransit forecasting and revenue management", "link": "https://arxiv.org/abs/2408.12577", "description": "arXiv:2408.12577v1 Announce Type: new \nAbstract: As an IT-enabled multi-passenger mobility service, microtransit has the potential to improve accessibility, reduce congestion, and enhance flexibility in transportation options. However, due to its heterogeneous impacts on different communities and population segments, there is a need for better tools in microtransit forecast and revenue management, especially when actual usage data are limited. We propose a novel framework based on an agent-based mixed logit model estimated with microtransit usage data and synthetic trip data. The framework involves estimating a lower-branch mode choice model with synthetic trip data, combining lower-branch parameters with microtransit data to estimate an upper-branch ride pass subscription model, and applying the nested model to evaluate microtransit pricing and subsidy policies. The framework enables further decision-support analysis to consider diverse travel patterns and heterogeneous tastes of the total population. We test the framework in a case study with synthetic trip data from Replica Inc. and microtransit data from Arlington Via. The lower-branch model result in a rho-square value of 0.603 on weekdays and 0.576 on weekends. Predictions made by the upper-branch model closely match the marginal subscription data. In a ride pass pricing policy scenario, we show that a discount in weekly pass (from $25 to $18.9) and monthly pass (from $80 to $71.5) would surprisingly increase total revenue by $102/day. In an event- or place-based subsidy policy scenario, we show that a 100% fare discount would reduce 80 car trips during peak hours at AT&T Stadium, requiring a subsidy of $32,068/year."}, "https://arxiv.org/abs/2408.11967": {"title": "Valuing an Engagement Surface using a Large Scale Dynamic Causal Model", "link": "https://arxiv.org/abs/2408.11967", "description": "arXiv:2408.11967v1 Announce Type: cross \nAbstract: With recent rapid growth in online shopping, AI-powered Engagement Surfaces (ES) have become ubiquitous across retail services. These engagement surfaces perform an increasing range of functions, including recommending new products for purchase, reminding customers of their orders and providing delivery notifications. Understanding the causal effect of engagement surfaces on value driven for customers and businesses remains an open scientific question. In this paper, we develop a dynamic causal model at scale to disentangle value attributable to an ES, and to assess its effectiveness. We demonstrate the application of this model to inform business decision-making by understanding returns on investment in the ES, and identifying product lines and features where the ES adds the most value."}, "https://arxiv.org/abs/2408.12004": {"title": "CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies", "link": "https://arxiv.org/abs/2408.12004", "description": "arXiv:2408.12004v1 Announce Type: cross \nAbstract: When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions."}, "https://arxiv.org/abs/2408.12014": {"title": "An Econometric Analysis of Large Flexible Cryptocurrency-mining Consumers in Electricity Markets", "link": "https://arxiv.org/abs/2408.12014", "description": "arXiv:2408.12014v1 Announce Type: cross \nAbstract: In recent years, power grids have seen a surge in large cryptocurrency mining firms, with individual consumption levels reaching 700MW. This study examines the behavior of these firms in Texas, focusing on how their consumption is influenced by cryptocurrency conversion rates, electricity prices, local weather, and other factors. We transform the skewed electricity consumption data of these firms, perform correlation analysis, and apply a seasonal autoregressive moving average model for analysis. Our findings reveal that, surprisingly, short-term mining electricity consumption is not correlated with cryptocurrency conversion rates. Instead, the primary influencers are the temperature and electricity prices. These firms also respond to avoid transmission and distribution network (T\\&D) charges -- famously known as four Coincident peak (4CP) charges -- during summer times. As the scale of these firms is likely to surge in future years, the developed electricity consumption model can be used to generate public, synthetic datasets to understand the overall impact on power grid. The developed model could also lead to better pricing mechanisms to effectively use the flexibility of these resources towards improving power grid reliability."}, "https://arxiv.org/abs/2408.12210": {"title": "Enhancing Causal Discovery in Financial Networks with Piecewise Quantile Regression", "link": "https://arxiv.org/abs/2408.12210", "description": "arXiv:2408.12210v1 Announce Type: cross \nAbstract: Financial networks can be constructed using statistical dependencies found within the price series of speculative assets. Across the various methods used to infer these networks, there is a general reliance on predictive modelling to capture cross-correlation effects. These methods usually model the flow of mean-response information, or the propagation of volatility and risk within the market. Such techniques, though insightful, don't fully capture the broader distribution-level causality that is possible within speculative markets. This paper introduces a novel approach, combining quantile regression with a piecewise linear embedding scheme - allowing us to construct causality networks that identify the complex tail interactions inherent to financial markets. Applying this method to 260 cryptocurrency return series, we uncover significant tail-tail causal effects and substantial causal asymmetry. We identify a propensity for coins to be self-influencing, with comparatively sparse cross variable effects. Assessing all link types in conjunction, Bitcoin stands out as the primary influencer - a nuance that is missed in conventional linear mean-response analyses. Our findings introduce a comprehensive framework for modelling distributional causality, paving the way towards more holistic representations of causality in financial markets."}, "https://arxiv.org/abs/2408.12288": {"title": "Demystifying Functional Random Forests: Novel Explainability Tools for Model Transparency in High-Dimensional Spaces", "link": "https://arxiv.org/abs/2408.12288", "description": "arXiv:2408.12288v1 Announce Type: cross \nAbstract: The advent of big data has raised significant challenges in analysing high-dimensional datasets across various domains such as medicine, ecology, and economics. Functional Data Analysis (FDA) has proven to be a robust framework for addressing these challenges, enabling the transformation of high-dimensional data into functional forms that capture intricate temporal and spatial patterns. However, despite advancements in functional classification methods and very high performance demonstrated by combining FDA and ensemble methods, a critical gap persists in the literature concerning the transparency and interpretability of black-box models, e.g. Functional Random Forests (FRF). In response to this need, this paper introduces a novel suite of explainability tools to illuminate the inner mechanisms of FRF. We propose using Functional Partial Dependence Plots (FPDPs), Functional Principal Component (FPC) Probability Heatmaps, various model-specific and model-agnostic FPCs' importance metrics, and the FPC Internal-External Importance and Explained Variance Bubble Plot. These tools collectively enhance the transparency of FRF models by providing a detailed analysis of how individual FPCs contribute to model predictions. By applying these methods to an ECG dataset, we demonstrate the effectiveness of these tools in revealing critical patterns and improving the explainability of FRF."}, "https://arxiv.org/abs/2408.12296": {"title": "Multiple testing for signal-agnostic searches of new physics with machine learning", "link": "https://arxiv.org/abs/2408.12296", "description": "arXiv:2408.12296v1 Announce Type: cross \nAbstract: In this work, we address the question of how to enhance signal-agnostic searches by leveraging multiple testing strategies. Specifically, we consider hypothesis tests relying on machine learning, where model selection can introduce a bias towards specific families of new physics signals. We show that it is beneficial to combine different tests, characterised by distinct choices of hyperparameters, and that performances comparable to the best available test are generally achieved while providing a more uniform response to various types of anomalies. Focusing on the New Physics Learning Machine, a methodology to perform a signal-agnostic likelihood-ratio test, we explore a number of approaches to multiple testing, such as combining p-values and aggregating test statistics."}, "https://arxiv.org/abs/2408.12346": {"title": "A logical framework for data-driven reasoning", "link": "https://arxiv.org/abs/2408.12346", "description": "arXiv:2408.12346v1 Announce Type: cross \nAbstract: We introduce and investigate a family of consequence relations with the goal of capturing certain important patterns of data-driven inference. The inspiring idea for our framework is the fact that data may reject, possibly to some degree, and possibly by mistake, any given scientific hypothesis. There is no general agreement in science about how to do this, which motivates putting forward a logical formulation of the problem. We do so by investigating distinct definitions of \"rejection degrees\" each yielding a consequence relation. Our investigation leads to novel variations on the theme of rational consequence relations, prominent among non-monotonic logics."}, "https://arxiv.org/abs/2408.12564": {"title": "Factor Adjusted Spectral Clustering for Mixture Models", "link": "https://arxiv.org/abs/2408.12564", "description": "arXiv:2408.12564v1 Announce Type: cross \nAbstract: This paper studies a factor modeling-based approach for clustering high-dimensional data generated from a mixture of strongly correlated variables. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc., with factor modeling being one of the popular techniques for explaining the common dependence. Standard techniques for clustering high-dimensional data, e.g., naive spectral clustering, often fail to yield insightful results as their performances heavily depend on the mixture components having a weakly correlated structure. To address the clustering problem in the presence of a latent factor model, we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step via eliminating the factor component to cope with the data dependency. We prove this method achieves an exponentially low mislabeling rate, with respect to the signal to noise ratio under a general set of assumptions. Our assumption bridges many classical factor models in the literature, such as the pervasive factor model, the weak factor model, and the sparse factor model. The FASC algorithm is also computationally efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of the FASC algorithm with real data experiments and numerical studies, and establish that FASC provides significant results in many cases where traditional spectral clustering fails."}, "https://arxiv.org/abs/2105.08868": {"title": "Markov-Restricted Analysis of Randomized Trials with Non-Monotone Missing Binary Outcomes", "link": "https://arxiv.org/abs/2105.08868", "description": "arXiv:2105.08868v2 Announce Type: replace \nAbstract: Scharfstein et al. (2021) developed a sensitivity analysis model for analyzing randomized trials with repeatedly measured binary outcomes that are subject to nonmonotone missingness. Their approach becomes computationally intractable when the number of measurements is large (e.g., greater than 15). In this paper, we repair this problem by introducing mth-order Markovian restrictions. We establish identification results for the joint distribution of the binary outcomes by representing the model as a directed acyclic graph (DAG). We develop a novel estimation strategy for a smooth functional of the joint distrubution. We illustrate our methodology in the context of a randomized trial designed to evaluate a web-delivered psychosocial intervention to reduce substance use, assessed by evaluating abstinence twice weekly for 12 weeks, among patients entering outpatient addiction treatment."}, "https://arxiv.org/abs/2209.07111": {"title": "$\\rho$-GNF: A Copula-based Sensitivity Analysis to Unobserved Confounding Using Normalizing Flows", "link": "https://arxiv.org/abs/2209.07111", "description": "arXiv:2209.07111v2 Announce Type: replace \nAbstract: We propose a novel sensitivity analysis to unobserved confounding in observational studies using copulas and normalizing flows. Using the idea of interventional equivalence of structural causal models, we develop $\\rho$-GNF ($\\rho$-graphical normalizing flow), where $\\rho{\\in}[-1,+1]$ is a bounded sensitivity parameter. This parameter represents the back-door non-causal association due to unobserved confounding, and which is encoded with a Gaussian copula. In other words, the $\\rho$-GNF enables scholars to estimate the average causal effect (ACE) as a function of $\\rho$, while accounting for various assumed strengths of the unobserved confounding. The output of the $\\rho$-GNF is what we denote as the $\\rho_{curve}$ that provides the bounds for the ACE given an interval of assumed $\\rho$ values. In particular, the $\\rho_{curve}$ enables scholars to identify the confounding strength required to nullify the ACE, similar to other sensitivity analysis methods (e.g., the E-value). Leveraging on experiments from simulated and real-world data, we show the benefits of $\\rho$-GNF. One benefit is that the $\\rho$-GNF uses a Gaussian copula to encode the distribution of the unobserved causes, which is commonly used in many applied settings. This distributional assumption produces narrower ACE bounds compared to other popular sensitivity analysis methods."}, "https://arxiv.org/abs/2309.13159": {"title": "Estimating a k-modal nonparametric mixed logit model with market-level data", "link": "https://arxiv.org/abs/2309.13159", "description": "arXiv:2309.13159v2 Announce Type: replace \nAbstract: We propose a group-level agent-based mixed (GLAM) logit model that is estimated using market-level choice share data. The model non-parametrically represents taste heterogeneity through market-specific parameters by solving a multiagent inverse utility maximization problem, addressing the limitations of existing market-level choice models with parametric taste heterogeneity. A case study of mode choice in New York State is conducted using synthetic population data of 53.55 million trips made by 19.53 million residents in 2019. These trips are aggregated based on population segments and census block group-level origin-destination (OD) pairs, resulting in 120,740 markets/agents. We benchmark in-sample and out-of-sample predictive performance of the GLAM logit model against multinomial logit, nested logit, inverse product differentiation logit, and random coefficient logit (RCL) models. The results show that GLAM logit outperforms benchmark models, improving the overall in-sample predictive accuracy from 78.7% to 96.71% and out-of-sample accuracy from 65.30% to 81.78%. The price elasticities and diversion ratios retrieved from GLAM logit and benchmark models exhibit similar substitution patterns among the six travel modes. GLAM logit is scalable and computationally efficient, taking less than one-tenth of the time taken to estimate the RCL model. The agent-specific parameters in GLAM logit provide additional insights such as value-of-time (VOT) across segments and regions, which has been further utilized to demonstrate its application in analyzing NYS travelers' mode choice response to the congestion pricing. The agent-specific parameters in GLAM logit facilitate their seamless integration into supply-side optimization models for revenue management and system design."}, "https://arxiv.org/abs/2401.08175": {"title": "Bayesian Function-on-Function Regression for Spatial Functional Data", "link": "https://arxiv.org/abs/2401.08175", "description": "arXiv:2401.08175v2 Announce Type: replace \nAbstract: Spatial functional data arise in many settings, such as particulate matter curves observed at monitoring stations and age population curves at each areal unit. Most existing functional regression models have limited applicability because they do not consider spatial correlations. Although functional kriging methods can predict the curves at unobserved spatial locations, they are based on variogram fittings rather than constructing hierarchical statistical models. In this manuscript, we propose a Bayesian framework for spatial function-on-function regression that can carry out parameter estimations and predictions. However, the proposed model has computational and inferential challenges because the model needs to account for within and between-curve dependencies. Furthermore, high-dimensional and spatially correlated parameters can lead to the slow mixing of Markov chain Monte Carlo algorithms. To address these issues, we first utilize a basis transformation approach to simplify the covariance and apply projection methods for dimension reduction. We also develop a simultaneous band score for the proposed model to detect the significant region in the regression function. We apply our method to both areal and point-level spatial functional data, showing the proposed method is computationally efficient and provides accurate estimations and predictions."}, "https://arxiv.org/abs/2204.06544": {"title": "Features of the Earth's seasonal hydroclimate: Characterizations and comparisons across the Koppen-Geiger climates and across continents", "link": "https://arxiv.org/abs/2204.06544", "description": "arXiv:2204.06544v2 Announce Type: replace-cross \nAbstract: Detailed investigations of time series features across climates, continents and variable types can progress our understanding and modelling ability of the Earth's hydroclimate and its dynamics. They can also improve our comprehension of the climate classification systems appearing in their core. Still, such investigations for seasonal hydroclimatic temporal dependence, variability and change are currently missing from the literature. Herein, we propose and apply at the global scale a methodological framework for filling this specific gap. We analyse over 13 000 earth-observed quarterly temperature, precipitation and river flow time series. We adopt the Koppen-Geiger climate classification system and define continental-scale geographical regions for conducting upon them seasonal hydroclimatic feature summaries. The analyses rely on three sample autocorrelation features, a temporal variation feature, a spectral entropy feature, a Hurst feature, a trend strength feature and a seasonality strength feature. We find notable differences to characterize the magnitudes of these features across the various Koppen-Geiger climate classes, as well as between continental-scale geographical regions. We, therefore, deem that the consideration of the comparative summaries could be beneficial in water resources engineering contexts. Lastly, we apply explainable machine learning to compare the investigated features with respect to how informative they are in distinguishing either the main Koppen-Geiger climates or the continental-scale regions. In this regard, the sample autocorrelation, temporal variation and seasonality strength features are found to be more informative than the spectral entropy, Hurst and trend strength features at the seasonal time scale."}, "https://arxiv.org/abs/2206.06885": {"title": "Neural interval-censored survival regression with feature selection", "link": "https://arxiv.org/abs/2206.06885", "description": "arXiv:2206.06885v3 Announce Type: replace-cross \nAbstract: Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships."}, "https://arxiv.org/abs/2304.06353": {"title": "Bayesian mixture models for phylogenetic source attribution from consensus sequences and time since infection estimates", "link": "https://arxiv.org/abs/2304.06353", "description": "arXiv:2304.06353v2 Announce Type: replace-cross \nAbstract: In stopping the spread of infectious diseases, pathogen genomic data can be used to reconstruct transmission events and characterize population-level sources of infection. Most approaches for identifying transmission pairs do not account for the time passing since divergence of pathogen variants in individuals, which is problematic in viruses with high within-host evolutionary rates. This prompted us to consider possible transmission pairs in terms of phylogenetic data and additional estimates of time since infection derived from clinical biomarkers. We develop Bayesian mixture models with an evolutionary clock as signal component and additional mixed effects or covariate random functions describing the mixing weights to classify potential pairs into likely and unlikely transmission pairs. We demonstrate that although sources cannot be identified at the individual level with certainty, even with the additional data on time elapsed, inferences into the population-level sources of transmission are possible, and more accurate than using only phylogenetic data without time since infection estimates. We apply the approach to estimate age-specific sources of HIV infection in Amsterdam MSM transmission networks between 2010-2021. This study demonstrates that infection time estimates provide informative data to characterize transmission sources, and shows how phylogenetic source attribution can then be done with multi-dimensional mixture models."}, "https://arxiv.org/abs/2408.12863": {"title": "Machine Learning and the Yield Curve: Tree-Based Macroeconomic Regime Switching", "link": "https://arxiv.org/abs/2408.12863", "description": "arXiv:2408.12863v1 Announce Type: new \nAbstract: We explore tree-based macroeconomic regime-switching in the context of the dynamic Nelson-Siegel (DNS) yield-curve model. In particular, we customize the tree-growing algorithm to partition macroeconomic variables based on the DNS model's marginal likelihood, thereby identifying regime-shifting patterns in the yield curve. Compared to traditional Markov-switching models, our model offers clear economic interpretation via macroeconomic linkages and ensures computational simplicity. In an empirical application to U.S. Treasury bond yields, we find (1) important yield curve regime switching, and (2) evidence that macroeconomic variables have predictive power for the yield curve when the short rate is high, but not in other regimes, thereby refining the notion of yield curve ``macro-spanning\"."}, "https://arxiv.org/abs/2408.13000": {"title": "Air-HOLP: Adaptive Regularized Feature Screening for High Dimensional Data", "link": "https://arxiv.org/abs/2408.13000", "description": "arXiv:2408.13000v1 Announce Type: new \nAbstract: Handling high-dimensional datasets presents substantial computational challenges, particularly when the number of features far exceeds the number of observations and when features are highly correlated. A modern approach to mitigate these issues is feature screening. In this work, the High-dimensional Ordinary Least-squares Projection (HOLP) feature screening method is advanced by employing adaptive ridge regularization. The impact of the ridge penalty on the Ridge-HOLP method is examined and Air-HOLP is proposed, a data-adaptive advance to Ridge-HOLP where the ridge-regularization parameter is selected iteratively and optimally for better feature screening performance. The proposed method addresses the challenges of penalty selection in high dimensions by offering a computationally efficient and stable alternative to traditional methods like bootstrapping and cross-validation. Air-HOLP is evaluated using simulated data and a prostate cancer genetic dataset. The empirical results demonstrate that Air-HOLP has improved performance over a large range of simulation settings. We provide R codes implementing the Air-HOLP feature screening method and integrating it into existing feature screening methods that utilize the HOLP formula."}, "https://arxiv.org/abs/2408.13047": {"title": "Difference-in-differences with as few as two cross-sectional units -- A new perspective to the democracy-growth debate", "link": "https://arxiv.org/abs/2408.13047", "description": "arXiv:2408.13047v1 Announce Type: new \nAbstract: Pooled panel analyses tend to mask heterogeneity in unit-specific treatment effects. For example, existing studies on the impact of democracy on economic growth do not reach a consensus as empirical findings are substantially heterogeneous in the country composition of the panel. In contrast to pooled panel analyses, this paper proposes a Difference-in-Differences (DiD) estimator that exploits the temporal dimension in the data and estimates unit-specific average treatment effects on the treated (ATT) with as few as two cross-sectional units. Under weak identification and temporal dependence conditions, the DiD estimator is asymptotically normal. The estimator is further complemented with a test of identification granted at least two candidate control units. Empirical results using the DiD estimator suggest Benin's economy would have been 6.3% smaller on average over the 1993-2018 period had she not democratised."}, "https://arxiv.org/abs/2408.13143": {"title": "A Restricted Latent Class Model with Polytomous Ordinal Correlated Attributes and Respondent-Level Covariates", "link": "https://arxiv.org/abs/2408.13143", "description": "arXiv:2408.13143v1 Announce Type: new \nAbstract: We present an exploratory restricted latent class model where response data is for a single time point, polytomous, and differing across items, and where latent classes reflect a multi-attribute state where each attribute is ordinal. Our model extends previous work to allow for correlation of the attributes through a multivariate probit and to allow for respondent-specific covariates. We demonstrate that the model recovers parameters well in a variety of realistic scenarios, and apply the model to the analysis of a particular dataset designed to diagnose depression. The application demonstrates the utility of the model in identifying the latent structure of depression beyond single-factor approaches which have been used in the past."}, "https://arxiv.org/abs/2408.13162": {"title": "A latent space model for multivariate count data time series analysis", "link": "https://arxiv.org/abs/2408.13162", "description": "arXiv:2408.13162v1 Announce Type: new \nAbstract: Motivated by a dataset of burglaries in Chicago, USA, we introduce a novel framework to analyze time series of count data combining common multivariate time series models with latent position network models. This novel methodology allows us to gain a new latent variable perspective on the crime dataset that we consider, allowing us to disentangle and explain the complex patterns exhibited by the data, while providing a natural time series framework that can be used to make future predictions. Our model is underpinned by two well known statistical approaches: a log-linear vector autoregressive model, which is prominent in the literature on multivariate count time series, and a latent projection model, which is a popular latent variable model for networks. The role of the projection model is to characterize the interaction parameters of the vector autoregressive model, thus uncovering the underlying network that is associated with the pairwise relationships between the time series. Estimation and inferential procedures are performed using an optimization algorithm and a Hamiltonian Monte Carlo procedure for efficient Bayesian inference. We also include a simulation study to illustrate the merits of our methodology in recovering consistent parameter estimates, and in making accurate future predictions for the time series. As we demonstrate in our application to the crime dataset, this new methodology can provide very meaningful model-based interpretations of the data, and it can be generalized to other time series contexts and applications."}, "https://arxiv.org/abs/2408.13022": {"title": "Estimation of ratios of normalizing constants using stochastic approximation : the SARIS algorithm", "link": "https://arxiv.org/abs/2408.13022", "description": "arXiv:2408.13022v1 Announce Type: cross \nAbstract: Computing ratios of normalizing constants plays an important role in statistical modeling. Two important examples are hypothesis testing in latent variables models, and model comparison in Bayesian statistics. In both examples, the likelihood ratio and the Bayes factor are defined as the ratio of the normalizing constants of posterior distributions. We propose in this article a novel methodology that estimates this ratio using stochastic approximation principle. Our estimator is consistent and asymptotically Gaussian. Its asymptotic variance is smaller than the one of the popular optimal bridge sampling estimator. Furthermore, it is much more robust to little overlap between the two unnormalized distributions considered. Thanks to its online definition, our procedure can be integrated in an estimation process in latent variables model, and therefore reduce the computational effort. The performances of the estimator are illustrated through a simulation study and compared to two other estimators : the ratio importance sampling and the optimal bridge sampling estimators."}, "https://arxiv.org/abs/2408.13176": {"title": "Non-parametric estimators of scaled cash flows", "link": "https://arxiv.org/abs/2408.13176", "description": "arXiv:2408.13176v1 Announce Type: cross \nAbstract: In multi-state life insurance, incidental policyholder behavior gives rise to expected cash flows that are not easily targeted by classic non-parametric estimators if data is subject to sampling effects. We introduce a scaled version of the classic Aalen--Johansen estimator that overcomes this challenge. Strong uniform consistency and asymptotic normality are established under entirely random right-censoring, subject to lax moment conditions on the multivariate counting process. In a simulation study, the estimator outperforms earlier proposals from the literature. Finally, we showcase the potential of the presented method to other areas of actuarial science."}, "https://arxiv.org/abs/2408.13179": {"title": "Augmented Functional Random Forests: Classifier Construction and Unbiased Functional Principal Components Importance through Ad-Hoc Conditional Permutations", "link": "https://arxiv.org/abs/2408.13179", "description": "arXiv:2408.13179v1 Announce Type: cross \nAbstract: This paper introduces a novel supervised classification strategy that integrates functional data analysis (FDA) with tree-based methods, addressing the challenges of high-dimensional data and enhancing the classification performance of existing functional classifiers. Specifically, we propose augmented versions of functional classification trees and functional random forests, incorporating a new tool for assessing the importance of functional principal components. This tool provides an ad-hoc method for determining unbiased permutation feature importance in functional data, particularly when dealing with correlated features derived from successive derivatives. Our study demonstrates that these additional features can significantly enhance the predictive power of functional classifiers. Experimental evaluations on both real-world and simulated datasets showcase the effectiveness of the proposed methodology, yielding promising results compared to existing methods."}, "https://arxiv.org/abs/1911.04696": {"title": "Extended MinP Tests for Global and Multiple testing", "link": "https://arxiv.org/abs/1911.04696", "description": "arXiv:1911.04696v2 Announce Type: replace \nAbstract: Empirical economic studies often involve multiple propositions or hypotheses, with researchers aiming to assess both the collective and individual evidence against these propositions or hypotheses. To rigorously assess this evidence, practitioners frequently employ tests with quadratic test statistics, such as $F$-tests and Wald tests, or tests based on minimum/maximum type test statistics. This paper introduces a combination test that merges these two classes of tests using the minimum $p$-value principle. The proposed test capitalizes on the global power advantages of both constituent tests while retaining the benefits of the stepdown procedure from minimum/maximum type tests."}, "https://arxiv.org/abs/2308.02974": {"title": "Combining observational and experimental data for causal inference considering data privacy", "link": "https://arxiv.org/abs/2308.02974", "description": "arXiv:2308.02974v2 Announce Type: replace \nAbstract: Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational data sets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this paper, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates when a randomized experiment (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT."}, "https://arxiv.org/abs/2408.13393": {"title": "WASP: Voting-based ex Ante method for Selecting joint Prediction strategy", "link": "https://arxiv.org/abs/2408.13393", "description": "arXiv:2408.13393v1 Announce Type: new \nAbstract: This paper addresses the topic of choosing a prediction strategy when using parametric or nonparametric regression models. It emphasizes the importance of ex ante prediction accuracy, ensemble approaches, and forecasting not only the values of the dependent variable but also a function of these values, such as total income or median loss. It proposes a method for selecting a strategy for predicting the vector of functions of the dependent variable using various ex ante accuracy measures. The final decision is made through voting, where the candidates are prediction strategies and the voters are diverse prediction models with their respective prediction errors. Because the method is based on a Monte Carlo simulation, it allows for new scenarios, not previously observed, to be considered. The first part of the article provides a detailed theoretical description of the proposed method, while the second part presents its practical use in managing a portfolio of communication insurance. The example uses data from the Polish insurance market. All calculations are performed using the R programme."}, "https://arxiv.org/abs/2408.13409": {"title": "Leveraging external data in the analysis of randomized controlled trials: a comparative analysis", "link": "https://arxiv.org/abs/2408.13409", "description": "arXiv:2408.13409v1 Announce Type: new \nAbstract: The use of patient-level information from previous studies, registries, and other external datasets can support the analysis of single-arm and randomized clinical trials to evaluate and test experimental treatments. However, the integration of external data in the analysis of clinical trials can also compromise the scientific validity of the results due to selection bias, study-to-study differences, unmeasured confounding, and other distortion mechanisms. Therefore, leveraging external data in the analysis of a clinical trial requires the use of appropriate methods that can detect, prevent or mitigate the risks of bias and potential distortion mechanisms. We review several methods that have been previously proposed to leverage external datasets, such as matching procedures or random effects modeling. Different methods present distinct trade-offs between risks and efficiencies. We conduct a comparative analysis of statistical methods to leverage external data and analyze randomized clinical trials. Multiple operating characteristics are discussed, such as the control of false positive results, power, and the bias of the treatment effect estimates, across candidate statistical methods. We compare the statistical methods through a broad set of simulation scenarios. We then compare the methods using a collection of datasets with individual patient-level information from several glioblastoma studies in order to provide recommendations for future glioblastoma trials."}, "https://arxiv.org/abs/2408.13411": {"title": "Estimating the Effective Sample Size for an inverse problem in subsurface flows", "link": "https://arxiv.org/abs/2408.13411", "description": "arXiv:2408.13411v1 Announce Type: new \nAbstract: The Effective Sample Size (ESS) and Integrated Autocorrelation Time (IACT) are two popular criteria for comparing Markov Chain Monte Carlo (MCMC) algorithms and detecting their convergence. Our goal is to assess those two quantities in the context of an inverse problem in subsurface flows. We begin by presenting a review of some popular methods for their estimation, and then simulate their sample distributions on AR(1) sequences for which the exact values were known. We find that those ESS estimators may not be statistically consistent, because their variance grows linearly in the number of sample values of the MCMC. Next, we analyze the output of two distinct MCMC algorithms for the Bayesian approach to the simulation of an elliptic inverse problem. Here, the estimators cannot even agree about the order of magnitude of the ESS. Our conclusion is that the ESS has major limitations and should not be used on MCMC outputs of complex models."}, "https://arxiv.org/abs/2408.13414": {"title": "Epistemically robust selection of fitted models", "link": "https://arxiv.org/abs/2408.13414", "description": "arXiv:2408.13414v1 Announce Type: new \nAbstract: Fitting models to data is an important part of the practice of science, made almost ubiquitous by advances in machine learning. Very often however, fitted solutions are not unique, but form an ensemble of candidate models -- qualitatively different, yet with comparable quantitative performance. One then needs a criterion which can select the best candidate models, or at least falsify (reject) the worst ones. Because standard statistical approaches to model selection rely on assumptions which are usually invalid in scientific contexts, they tend to be overconfident, rejecting models based on little more than statistical noise. The ideal objective for fitting models is generally considered to be the risk: this is the theoretical average loss of a model (assuming unlimited data). In this work we develop a nonparametric method for estimating, for each candidate model, the epistemic uncertainty on its risk: in other words we associate to each model a distribution of scores which accounts for expected modelling errors. We then propose that a model falsification criterion should mirror established experimental practice: a falsification result should be accepted only if it is reproducible across experimental variations. The strength of this approach is illustrated using examples from physics and neuroscience."}, "https://arxiv.org/abs/2408.13437": {"title": "Cross-sectional Dependence in Idiosyncratic Volatility", "link": "https://arxiv.org/abs/2408.13437", "description": "arXiv:2408.13437v1 Announce Type: new \nAbstract: This paper introduces an econometric framework for analyzing cross-sectional dependence in the idiosyncratic volatilities of assets using high frequency data. We first consider the estimation of standard measures of dependence in the idiosyncratic volatilities such as covariances and correlations. Naive estimators of these measures are biased due to the use of the error-laden estimates of idiosyncratic volatilities. We provide bias-corrected estimators and the relevant asymptotic theory. Next, we introduce an idiosyncratic volatility factor model, in which we decompose the variation in idiosyncratic volatilities into two parts: the variation related to the systematic factors such as the market volatility, and the residual variation. Again, naive estimators of the decomposition are biased, and we provide bias-corrected estimators. We also provide the asymptotic theory that allows us to test whether the residual (non-systematic) components of the idiosyncratic volatilities exhibit cross-sectional dependence. We apply our methodology to the S&P 100 index constituents, and document strong cross-sectional dependence in their idiosyncratic volatilities. We consider two different sets of idiosyncratic volatility factors, and find that neither can fully account for the cross-sectional dependence in idiosyncratic volatilities. For each model, we map out the network of dependencies in residual (non-systematic) idiosyncratic volatilities across all stocks."}, "https://arxiv.org/abs/2408.13453": {"title": "Unifying design-based and model-based sampling theory -- some suggestions to clear the cobwebs", "link": "https://arxiv.org/abs/2408.13453", "description": "arXiv:2408.13453v1 Announce Type: new \nAbstract: This paper gives a holistic overview of both the design-based and model-based paradigms for sampling theory. Both methods are presented within a unified framework with a simple consistent notation, and the differences in the two paradigms are explained within this common framework. We examine the different definitions of the \"population variance\" within the two paradigms and examine the use of Bessel's correction for a population variance. We critique some messy aspects of the presentation of the design-based paradigm and implore readers to avoid the standard presentation of this framework in favour of a more explicit presentation that includes explicit conditioning in probability statements. We also discuss a number of confusions that arise from the standard presentation of the design-based paradigm and argue that Bessel's correction should be applied to the population variance."}, "https://arxiv.org/abs/2408.13474": {"title": "Ridge, lasso, and elastic-net estimations of the modified Poisson and least-squares regressions for binary outcome data", "link": "https://arxiv.org/abs/2408.13474", "description": "arXiv:2408.13474v1 Announce Type: new \nAbstract: Logistic regression is a standard method in multivariate analysis for binary outcome data in epidemiological and clinical studies; however, the resultant odds-ratio estimates fail to provide directly interpretable effect measures. The modified Poisson and least-squares regressions are alternative standard methods that can provide risk-ratio and risk difference estimates without computational problems. However, the bias and invalid inference problems of these regression analyses under small or sparse data conditions (i.e.,the \"separation\" problem) have been insufficiently investigated. We show that the separation problem can adversely affect the inferences of the modified Poisson and least squares regressions, and to address these issues, we apply the ridge, lasso, and elastic-net estimating approaches to the two regression methods. As the methods are not founded on the maximum likelihood principle, we propose regularized quasi-likelihood approaches based on the estimating equations for these generalized linear models. The methods provide stable shrinkage estimates of risk ratios and risk differences even under separation conditions, and the lasso and elastic-net approaches enable simultaneous variable selection. We provide a bootstrap method to calculate the confidence intervals on the basis of the regularized quasi-likelihood estimation. The proposed methods are applied to a hematopoietic stem cell transplantation cohort study and the National Child Development Survey. We also provide an R package, regconfint, to implement these methods with simple commands."}, "https://arxiv.org/abs/2408.13514": {"title": "Cross Sectional Regression with Cluster Dependence: Inference based on Averaging", "link": "https://arxiv.org/abs/2408.13514", "description": "arXiv:2408.13514v1 Announce Type: new \nAbstract: We re-investigate the asymptotic properties of the traditional OLS (pooled) estimator, $\\hat{\\beta} _P$, with cluster dependence. The present study considers various scenarios under various restrictions on the cluster sizes and number of clusters. It is shown that $\\hat{\\beta}_P$ could be inconsistent in many realistic situations. We propose a simple estimator, $\\hat{\\beta}_A$ based on data averaging. The asymptotic properties of $\\hat{\\beta}_A$ is studied. It is shown that $\\hat{\\beta}_A$ is consistent even when $\\hat{\\beta}_P$ is inconsistent. It is further shown that the proposed estimator $\\hat{\\beta}_A$ could be more efficient as compared to $\\hat{\\beta}_P$ in many practical scenarios. As a consequence of averaging, we show that $\\hat{\\beta}_A$ retains consistency, asymptotic normality under classical measurement error problem circumventing the use of IV. A detailed simulation study shows the efficacy of $\\hat{\\beta}_A$. It is also seen that $\\hat{\\beta}_A$ yields better goodness of fit."}, "https://arxiv.org/abs/2408.13555": {"title": "Local statistical moments to capture Kramers-Moyal coefficients", "link": "https://arxiv.org/abs/2408.13555", "description": "arXiv:2408.13555v1 Announce Type: new \nAbstract: This study introduces an innovative local statistical moment approach for estimating Kramers-Moyal coefficients, effectively bridging the gap between nonparametric and parametric methodologies. These coefficients play a crucial role in characterizing stochastic processes. Our proposed approach provides a versatile framework for localized coefficient estimation, combining the flexibility of nonparametric methods with the interpretability of global parametric approaches. We showcase the efficacy of our approach through use cases involving both stationary and non-stationary time series analysis. Additionally, we demonstrate its applicability to real-world complex systems, specifically in the energy conversion process analysis of a wind turbine."}, "https://arxiv.org/abs/2408.13596": {"title": "Robust Principal Components by Casewise and Cellwise Weighting", "link": "https://arxiv.org/abs/2408.13596", "description": "arXiv:2408.13596v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is a fundamental tool for analyzing multivariate data. Here the focus is on dimension reduction to the principal subspace, characterized by its projection matrix. The classical principal subspace can be strongly affected by the presence of outliers. Traditional robust approaches consider casewise outliers, that is, cases generated by an unspecified outlier distribution that differs from that of the clean cases. But there may also be cellwise outliers, which are suspicious entries that can occur anywhere in the data matrix. Another common issue is that some cells may be missing. This paper proposes a new robust PCA method, called cellPCA, that can simultaneously deal with casewise outliers, cellwise outliers, and missing cells. Its single objective function combines two robust loss functions, that together mitigate the effect of casewise and cellwise outliers. The objective function is minimized by an iteratively reweighted least squares (IRLS) algorithm. Residual cellmaps and enhanced outlier maps are proposed for outlier detection. The casewise and cellwise influence functions of the principal subspace are derived, and its asymptotic distribution is obtained. Extensive simulations and two real data examples illustrate the performance of cellPCA."}, "https://arxiv.org/abs/2408.13606": {"title": "Influence Networks: Bayesian Modeling and Diffusion", "link": "https://arxiv.org/abs/2408.13606", "description": "arXiv:2408.13606v1 Announce Type: new \nAbstract: In this article, we make an innovative adaptation of a Bayesian latent space model based on projections in a novel way to analyze influence networks. By appropriately reparameterizing the model, we establish a formal metric for quantifying each individual's influencing capacity and estimating their latent position embedded in a social space. This modeling approach introduces a novel mechanism for fully characterizing the diffusion of an idea based on the estimated latent characteristics. It assumes that each individual takes the following states: Unknown, undecided, supporting, or rejecting an idea. This approach is demonstrated using a influence network from Twitter (now $\\mathbb{X}$) related to the 2022 Tax Reform in Colombia. An exhaustive simulation exercise is also performed to evaluate the proposed diffusion process."}, "https://arxiv.org/abs/2408.13649": {"title": "Tree-structured Markov random fields with Poisson marginal distributions", "link": "https://arxiv.org/abs/2408.13649", "description": "arXiv:2408.13649v1 Announce Type: new \nAbstract: A new family of tree-structured Markov random fields for a vector of discrete counting random variables is introduced. According to the characteristics of the family, the marginal distributions of the Markov random fields are all Poisson with the same mean, and are untied from the strength or structure of their built-in dependence. This key feature is uncommon for Markov random fields and most convenient for applications purposes. The specific properties of this new family confer a straightforward sampling procedure and analytic expressions for the joint probability mass function and the joint probability generating function of the vector of counting random variables, thus granting computational methods that scale well to vectors of high dimension. We study the distribution of the sum of random variables constituting a Markov random field from the proposed family, analyze a random variable's individual contribution to that sum through expected allocations, and establish stochastic orderings to assess a wide understanding of their behavior."}, "https://arxiv.org/abs/2408.13949": {"title": "Inference on Consensus Ranking of Distributions", "link": "https://arxiv.org/abs/2408.13949", "description": "arXiv:2408.13949v1 Announce Type: new \nAbstract: Instead of testing for unanimous agreement, I propose learning how broad of a consensus favors one distribution over another (of earnings, productivity, asset returns, test scores, etc.). Specifically, given a sample from each of two distributions, I propose statistical inference methods to learn about the set of utility functions for which the first distribution has higher expected utility than the second distribution. With high probability, an \"inner\" confidence set is contained within this true set, while an \"outer\" confidence set contains the true set. Such confidence sets can be formed by inverting a proposed multiple testing procedure that controls the familywise error rate. Theoretical justification comes from empirical process results, given that very large classes of utility functions are generally Donsker (subject to finite moments). The theory additionally justifies a uniform (over utility functions) confidence band of expected utility differences, as well as tests with a utility-based \"restricted stochastic dominance\" as either the null or alternative hypothesis. Simulated and empirical examples illustrate the methodology."}, "https://arxiv.org/abs/2408.13971": {"title": "Endogenous Treatment Models with Social Interactions: An Application to the Impact of Exercise on Self-Esteem", "link": "https://arxiv.org/abs/2408.13971", "description": "arXiv:2408.13971v1 Announce Type: new \nAbstract: We address the estimation of endogenous treatment models with social interactions in both the treatment and outcome equations. We model the interactions between individuals in an internally consistent manner via a game theoretic approach based on discrete Bayesian games. This introduces a substantial computational burden in estimation which we address through a sequential version of the nested fixed point algorithm. We also provide some relevant treatment effects, and procedures for their estimation, which capture the impact on both the individual and the total sample. Our empirical application examines the impact of an individual's exercise frequency on her level of self-esteem. We find that an individual's exercise frequency is influenced by her expectation of her friends'. We also find that an individual's level of self-esteem is affected by her level of exercise and, at relatively lower levels of self-esteem, by the expectation of her friends' self-esteem."}, "https://arxiv.org/abs/2408.14015": {"title": "Robust likelihood ratio tests for composite nulls and alternatives", "link": "https://arxiv.org/abs/2408.14015", "description": "arXiv:2408.14015v1 Announce Type: new \nAbstract: We propose an e-value based framework for testing composite nulls against composite alternatives when an $\\epsilon$ fraction of the data can be arbitrarily corrupted. Our tests are inherently sequential, being valid at arbitrary data-dependent stopping times, but they are new even for fixed sample sizes, giving type-I error control without any regularity conditions. We achieve this by modifying and extending a proposal by Huber (1965) in the point null versus point alternative case. Our test statistic is a nonnegative supermartingale under the null, even with a sequentially adaptive contamination model where the conditional distribution of each observation given the past data lies within an $\\epsilon$ (total variation) ball of the null. The test is powerful within an $\\epsilon$ ball of the alternative. As a consequence, one obtains anytime-valid p-values that enable continuous monitoring of the data, and adaptive stopping. We analyze the growth rate of our test supermartingale and demonstrate that as $\\epsilon\\to 0$, it approaches a certain Kullback-Leibler divergence between the null and alternative, which is the optimal non-robust growth rate. A key step is the derivation of a robust Reverse Information Projection (RIPr). Simulations validate the theory and demonstrate excellent practical performance."}, "https://arxiv.org/abs/2408.14036": {"title": "Robust subgroup-classifier learning and testing in change-plane regressions", "link": "https://arxiv.org/abs/2408.14036", "description": "arXiv:2408.14036v1 Announce Type: new \nAbstract: Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur representation of estimators are established, in which the proposed estimators of the grouping difference parameter and baseline parameter achieve sub-Gaussian tails. The hypothesis test considered here belongs to the class of test problems for which some parameters are not identifiable under the null hypothesis. The classic supremum of the squared score test statistic may lose power in practice when the dimension of the grouping parameter is large, so to overcome this drawback and make full use of the data's heavy-tailed error distribution, a robust weighted average of the squared score test statistic is proposed, which achieves a closed form when an appropriate weight is chosen. Asymptotic distributions of the proposed robust test statistic are derived under the null and alternative hypotheses. The proposed robust subgroup classifier and test statistic perform well on finite samples, and their performances are shown further by applying them to a medical dataset. The proposed procedure leads to the immediate application of recommending optimal individualized treatments."}, "https://arxiv.org/abs/2408.14038": {"title": "Jackknife Empirical Likelihood Method for U Statistics Based on Multivariate Samples and its Applications", "link": "https://arxiv.org/abs/2408.14038", "description": "arXiv:2408.14038v1 Announce Type: new \nAbstract: Empirical likelihood (EL) and its extension via the jackknife empirical likelihood (JEL) method provide robust alternatives to parametric approaches, in the contexts with uncertain data distributions. This paper explores the theoretical foundations and practical applications of JEL in the context of multivariate sample-based U-statistics. In this study we develop the JEL method for multivariate U-statistics with three (or more) samples. This study enhance the JEL methods capability to handle complex data structures while preserving the computation efficiency of the empirical likelihood method. To demonstrate the applications of the JEL method, we compute confidence intervals for differences in VUS measurements which have potential applications in classification problems. Monte Carlo simulation studies are conducted to evaluate the efficiency of the JEL, Normal approximation and Kernel based confidence intervals. These studies validate the superior performance of the JEL approach in terms of coverage probability and computational efficiency compared to other two methods. Additionally, a real data application illustrates the practical utility of the approach. The JEL method developed here has potential applications in dealing with complex data structures."}, "https://arxiv.org/abs/2408.14214": {"title": "Modeling the Dynamics of Growth in Master-Planned Communities", "link": "https://arxiv.org/abs/2408.14214", "description": "arXiv:2408.14214v1 Announce Type: new \nAbstract: This paper describes how a time-varying Markov model was used to forecast housing development at a master-planned community during a transition from high to low growth. Our approach draws on detailed historical data to model the dynamics of the market participants, producing results that are entirely data-driven and free of bias. While traditional time series forecasting methods often struggle to account for nonlinear regime changes in growth, our approach successfully captures the onset of buildout as well as external economic shocks, such as the 1990 and 2008-2011 recessions and the 2021 post-pandemic boom.\n This research serves as a valuable tool for urban planners, homeowner associations, and property stakeholders aiming to navigate the complexities of growth at master-planned communities during periods of both system stability and instability."}, "https://arxiv.org/abs/2408.14402": {"title": "A quasi-Bayesian sequential approach to deconvolution density estimation", "link": "https://arxiv.org/abs/2408.14402", "description": "arXiv:2408.14402v1 Announce Type: new \nAbstract: Density deconvolution addresses the estimation of the unknown (probability) density function $f$ of a random signal from data that are observed with an independent additive random noise. This is a classical problem in statistics, for which frequentist and Bayesian nonparametric approaches are available to deal with static or batch data. In this paper, we consider the problem of density deconvolution in a streaming or online setting where noisy data arrive progressively, with no predetermined sample size, and we develop a sequential nonparametric approach to estimate $f$. By relying on a quasi-Bayesian sequential approach, often referred to as Newton's algorithm, we obtain estimates of $f$ that are of easy evaluation, computationally efficient, and with a computational cost that remains constant as the amount of data increases, which is critical in the streaming setting. Large sample asymptotic properties of the proposed estimates are studied, yielding provable guarantees with respect to the estimation of $f$ at a point (local) and on an interval (uniform). In particular, we establish local and uniform central limit theorems, providing corresponding asymptotic credible intervals and bands. We validate empirically our methods on synthetic and real data, by considering the common setting of Laplace and Gaussian noise distributions, and make a comparison with respect to the kernel-based approach and a Bayesian nonparametric approach with a Dirichlet process mixture prior."}, "https://arxiv.org/abs/2408.14410": {"title": "Generalized Bayesian nonparametric clustering framework for high-dimensional spatial omics data", "link": "https://arxiv.org/abs/2408.14410", "description": "arXiv:2408.14410v1 Announce Type: new \nAbstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has transformed genomic research by enabling high-throughput gene expression profiling while preserving spatial context. Identifying spatial domains within SRT data is a critical task, with numerous computational approaches currently available. However, most existing methods rely on a multi-stage process that involves ad-hoc dimension reduction techniques to manage the high dimensionality of SRT data. These low-dimensional embeddings are then subjected to model-based or distance-based clustering methods. Additionally, many approaches depend on arbitrarily specifying the number of clusters (i.e., spatial domains), which can result in information loss and suboptimal downstream analysis. To address these limitations, we propose a novel Bayesian nonparametric mixture of factor analysis (BNPMFA) model, which incorporates a Markov random field-constrained Gibbs-type prior for partitioning high-dimensional spatial omics data. This new prior effectively integrates the spatial constraints inherent in SRT data while simultaneously inferring cluster membership and determining the optimal number of spatial domains. We have established the theoretical identifiability of cluster membership within this framework. The efficacy of our proposed approach is demonstrated through realistic simulations and applications to two SRT datasets. Our results show that the BNPMFA model not only surpasses state-of-the-art methods in clustering accuracy and estimating the number of clusters but also offers novel insights for identifying cellular regions within tissue samples."}, "https://arxiv.org/abs/2401.05218": {"title": "Invariant Causal Prediction with Locally Linear Models", "link": "https://arxiv.org/abs/2401.05218", "description": "arXiv:2401.05218v1 Announce Type: cross \nAbstract: We consider the task of identifying the causal parents of a target variable among a set of candidate variables from observational data. Our main assumption is that the candidate variables are observed in different environments which may, for example, correspond to different settings of a machine or different time intervals in a dynamical process. Under certain assumptions different environments can be regarded as interventions on the observed system. We assume a linear relationship between target and covariates, which can be different in each environment with the only restriction that the causal structure is invariant across environments. This is an extension of the ICP ($\\textbf{I}$nvariant $\\textbf{C}$ausal $\\textbf{P}$rediction) principle by Peters et al. [2016], who assumed a fixed linear relationship across all environments. Within our proposed setting we provide sufficient conditions for identifiability of the causal parents and introduce a practical method called LoLICaP ($\\textbf{Lo}$cally $\\textbf{L}$inear $\\textbf{I}$nvariant $\\textbf{Ca}$usal $\\textbf{P}$rediction), which is based on a hypothesis test for parent identification using a ratio of minimum and maximum statistics. We then show in a simplified setting that the statistical power of LoLICaP converges exponentially fast in the sample size, and finally we analyze the behavior of LoLICaP experimentally in more general settings."}, "https://arxiv.org/abs/2408.13448": {"title": "Efficient Reinforced DAG Learning without Acyclicity Constraints", "link": "https://arxiv.org/abs/2408.13448", "description": "arXiv:2408.13448v1 Announce Type: cross \nAbstract: Unraveling cause-effect structures embedded in mere observational data is of great scientific interest, owning to the wealth of knowledge that can benefit from such structures. Recently, reinforcement learning (RL) has emerged as the enhancement for classical techniques to search for the most probable causal explanation in the form of a directed acyclic graph (DAG). Yet, effectively exploring the DAG space is challenging due to the vast number of candidates and the intricate constraint of acyclicity. In this study, we present REACT (REinforced DAG learning without acyclicity ConstrainTs)-a novel causal discovery approach fueled by the RL machinery with an efficient DAG generation policy. Through a novel parametrization of DAGs, which allows for directly mapping a real-valued vector to an adjacency matrix representing a valid DAG in a single step without enforcing any acyclicity constraint, we are able to navigate the search space much more effectively with policy gradient methods. In addition, our comprehensive numerical evaluations on a diverse set of both synthetic and real data confirm the effectiveness of our method compared with state-of-the-art baselines."}, "https://arxiv.org/abs/2408.13556": {"title": "What if? Causal Machine Learning in Supply Chain Risk Management", "link": "https://arxiv.org/abs/2408.13556", "description": "arXiv:2408.13556v1 Announce Type: cross \nAbstract: The penultimate goal for developing machine learning models in supply chain management is to make optimal interventions. However, most machine learning models identify correlations in data rather than inferring causation, making it difficult to systematically plan for better outcomes. In this article, we propose and evaluate the use of causal machine learning for developing supply chain risk intervention models, and demonstrate its use with a case study in supply chain risk management in the maritime engineering sector. Our findings highlight that causal machine learning enhances decision-making processes by identifying changes that can be achieved under different supply chain interventions, allowing \"what-if\" scenario planning. We therefore propose different machine learning developmental pathways for for predicting risk, and planning for interventions to minimise risk and outline key steps for supply chain researchers to explore causal machine learning."}, "https://arxiv.org/abs/2408.13642": {"title": "Change Point Detection in Pairwise Comparison Data with Covariates", "link": "https://arxiv.org/abs/2408.13642", "description": "arXiv:2408.13642v1 Announce Type: cross \nAbstract: This paper introduces the novel piecewise stationary covariate-assisted ranking estimation (PS-CARE) model for analyzing time-evolving pairwise comparison data, enhancing item ranking accuracy through the integration of covariate information. By partitioning the data into distinct, stationary segments, the PS-CARE model adeptly detects temporal shifts in item rankings, known as change points, whose number and positions are initially unknown. Leveraging the minimum description length (MDL) principle, this paper establishes a statistically consistent model selection criterion to estimate these unknowns. The practical optimization of this MDL criterion is done with the pruned exact linear time (PELT) algorithm. Empirical evaluations reveal the method's promising performance in accurately locating change points across various simulated scenarios. An application to an NBA dataset yielded meaningful insights that aligned with significant historical events, highlighting the method's practical utility and the MDL criterion's effectiveness in capturing temporal ranking changes. To the best of the authors' knowledge, this research pioneers change point detection in pairwise comparison data with covariate information, representing a significant leap forward in the field of dynamic ranking analysis."}, "https://arxiv.org/abs/2408.14012": {"title": "Bayesian Cointegrated Panels in Digital Marketing", "link": "https://arxiv.org/abs/2408.14012", "description": "arXiv:2408.14012v1 Announce Type: cross \nAbstract: In this paper, we fully develop and apply a novel extension of Bayesian cointegrated panels modeling in digital marketing, particularly in modeling of a system where key ROI metrics such as clicks or impressions of a given digital campaign considered. Thus, in this context our goal is evaluating how the system reacts to investment perturbations due to changes in the investment strategy and its impact on the visibility of specific campaigns. To do so, we fit the model using a set of real marketing data with different investment campaigns over the same geographic territory. By employing forecast error variance decomposition, our findings indicate that clicks and impressions have a significant impact on session generation. Also, we evaluate our approach through a comprehensive simulation study that considers different processes. The results indicate that our proposal has substantial capabilities in terms of estimability and accuracy."}, "https://arxiv.org/abs/2408.14073": {"title": "Score-based change point detection via tracking the best of infinitely many experts", "link": "https://arxiv.org/abs/2408.14073", "description": "arXiv:2408.14073v1 Announce Type: cross \nAbstract: We suggest a novel algorithm for online change point detection based on sequential score function estimation and tracking the best expert approach. The core of the procedure is a version of the fixed share forecaster for the case of infinite number of experts and quadratic loss functions. The algorithm shows a promising performance in numerical experiments on artificial and real-world data sets. We also derive new upper bounds on the dynamic regret of the fixed share forecaster with varying parameter, which are of independent interest."}, "https://arxiv.org/abs/2408.14408": {"title": "Consistent diffusion matrix estimation from population time series", "link": "https://arxiv.org/abs/2408.14408", "description": "arXiv:2408.14408v1 Announce Type: cross \nAbstract: Progress on modern scientific questions regularly depends on using large-scale datasets to understand complex dynamical systems. An especially challenging case that has grown to prominence with advances in single-cell sequencing technologies is learning the behavior of individuals from population snapshots. In the absence of individual-level time series, standard stochastic differential equation models are often nonidentifiable because intrinsic diffusion cannot be distinguished from measurement noise. Despite the difficulty, accurately recovering diffusion terms is required to answer even basic questions about the system's behavior. We show how to combine population-level time series with velocity measurements to build a provably consistent estimator of the diffusion matrix."}, "https://arxiv.org/abs/2204.05793": {"title": "Coarse Personalization", "link": "https://arxiv.org/abs/2204.05793", "description": "arXiv:2204.05793v3 Announce Type: replace \nAbstract: Advances in estimating heterogeneous treatment effects enable firms to personalize marketing mix elements and target individuals at an unmatched level of granularity, but feasibility constraints limit such personalization. In practice, firms choose which unique treatments to offer and which individuals to offer these treatments with the goal of maximizing profits: we call this the coarse personalization problem. We propose a two-step solution that makes segmentation and targeting decisions in concert. First, the firm personalizes by estimating conditional average treatment effects. Second, the firm discretizes by utilizing treatment effects to choose which unique treatments to offer and who to assign to these treatments. We show that a combination of available machine learning tools for estimating heterogeneous treatment effects and a novel application of optimal transport methods provides a viable and efficient solution. With data from a large-scale field experiment for promotions management, we find that our methodology outperforms extant approaches that segment on consumer characteristics or preferences and those that only search over a prespecified grid. Using our procedure, the firm recoups over 99.5% of its expected incremental profits under fully granular personalization while offering only five unique treatments. We conclude by discussing how coarse personalization arises in other domains."}, "https://arxiv.org/abs/2211.11673": {"title": "Asymptotically Normal Estimation of Local Latent Network Curvature", "link": "https://arxiv.org/abs/2211.11673", "description": "arXiv:2211.11673v4 Announce Type: replace \nAbstract: Network data, commonly used throughout the physical, social, and biological sciences, consist of nodes (individuals) and the edges (interactions) between them. One way to represent network data's complex, high-dimensional structure is to embed the graph into a low-dimensional geometric space. The curvature of this space, in particular, provides insights about the structure in the graph, such as the propensity to form triangles or present tree-like structures. We derive an estimating function for curvature based on triangle side lengths and the length of the midpoint of a side to the opposing corner. We construct an estimator where the only input is a distance matrix and also establish asymptotic normality. We next introduce a novel latent distance matrix estimator for networks and an efficient algorithm to compute the estimate via solving iterative quadratic programs. We apply this method to the Los Alamos National Laboratory Unified Network and Host dataset and show how curvature estimates can be used to detect a red-team attack faster than naive methods, as well as discover non-constant latent curvature in co-authorship networks in physics. The code for this paper is available at https://github.com/SteveJWR/netcurve, and the methods are implemented in the R package https://github.com/SteveJWR/lolaR."}, "https://arxiv.org/abs/2302.02310": {"title": "$\\ell_1$-penalized Multinomial Regression: Estimation, inference, and prediction, with an application to risk factor identification for different dementia subtypes", "link": "https://arxiv.org/abs/2302.02310", "description": "arXiv:2302.02310v2 Announce Type: replace \nAbstract: High-dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast-based $\\ell_1$-penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and $p$-value of the individual hypothesis test. We also examine cases of model misspecification and non-identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods."}, "https://arxiv.org/abs/2305.04113": {"title": "Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis", "link": "https://arxiv.org/abs/2305.04113", "description": "arXiv:2305.04113v3 Announce Type: replace \nAbstract: Factor analysis provides a canonical framework for imposing lower-dimensional structure such as sparse covariance in high-dimensional data. High-dimensional data on the same set of variables are often collected under different conditions, for instance in reproducing studies across research groups. In such cases, it is natural to seek to learn the shared versus condition-specific structure. Existing hierarchical extensions of factor analysis have been proposed, but face practical issues including identifiability problems. To address these shortcomings, we propose a class of SUbspace Factor Analysis (SUFA) models, which characterize variation across groups at the level of a lower-dimensional subspace. We prove that the proposed class of SUFA models lead to identifiability of the shared versus group-specific components of the covariance, and study their posterior contraction properties. Taking a Bayesian approach, these contributions are developed alongside efficient posterior computation algorithms. Our sampler fully integrates out latent variables, is easily parallelizable and has complexity that does not depend on sample size. We illustrate the methods through application to integration of multiple gene expression datasets relevant to immunology."}, "https://arxiv.org/abs/2308.05534": {"title": "Collective Outlier Detection and Enumeration with Conformalized Closed Testing", "link": "https://arxiv.org/abs/2308.05534", "description": "arXiv:2308.05534v2 Announce Type: replace \nAbstract: This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set."}, "https://arxiv.org/abs/2311.04159": {"title": "Uncertainty Quantification using Simulation Output: Batching as an Inferential Device", "link": "https://arxiv.org/abs/2311.04159", "description": "arXiv:2311.04159v2 Announce Type: replace \nAbstract: We present batching as an omnibus device for uncertainty quantification using simulation output. We consider the classical context of a simulationist performing uncertainty quantification on an estimator $\\theta_n$ (of an unknown fixed quantity $\\theta$) using only the output data $(Y_1,Y_2,\\ldots,Y_n)$ gathered from a simulation. By uncertainty quantification, we mean approximating the sampling distribution of the error $\\theta_n-\\theta$ toward: (A) estimating an assessment functional $\\psi$, e.g., bias, variance, or quantile; or (B) constructing a $(1-\\alpha)$-confidence region on $\\theta$. We argue that batching is a remarkably simple and effective device for this purpose, and is especially suited for handling dependent output data such as what one frequently encounters in simulation contexts. We demonstrate that if the number of batches and the extent of their overlap are chosen appropriately, batching retains bootstrap's attractive theoretical properties of strong consistency and higher-order accuracy. For constructing confidence regions, we characterize two limiting distributions associated with a Studentized statistic. Our extensive numerical experience confirms theoretical insight, especially about the effects of batch size and batch overlap."}, "https://arxiv.org/abs/2302.02560": {"title": "Causal Estimation of Exposure Shifts with Neural Networks", "link": "https://arxiv.org/abs/2302.02560", "description": "arXiv:2302.02560v4 Announce Type: replace-cross \nAbstract: A fundamental task in causal inference is estimating the effect of distribution shift in the treatment variable. We refer to this problem as shift-response function (SRF) estimation. Existing neural network methods for causal inference lack theoretical guarantees and practical implementations for SRF estimation. In this paper, we introduce Targeted Regularization for Exposure Shifts with Neural Networks (TRESNET), a method to estimate SRFs with robustness and efficiency guarantees. Our contributions are twofold. First, we propose a targeted regularization loss for neural networks with theoretical properties that ensure double robustness and asymptotic efficiency specific to SRF estimation. Second, we extend targeted regularization to support loss functions from the exponential family to accommodate non-continuous outcome distributions (e.g., discrete counts). We conduct benchmark experiments demonstrating TRESNET's broad applicability and competitiveness. We then apply our method to a key policy question in public health to estimate the causal effect of revising the US National Ambient Air Quality Standards (NAAQS) for PM 2.5 from 12 ${\\mu}g/m^3$ to 9 ${\\mu}g/m^3$. This change has been recently proposed by the US Environmental Protection Agency (EPA). Our goal is to estimate the reduction in deaths that would result from this anticipated revision using data consisting of 68 million individuals across the U.S."}, "https://arxiv.org/abs/2408.14604": {"title": "Co-factor analysis of citation networks", "link": "https://arxiv.org/abs/2408.14604", "description": "arXiv:2408.14604v1 Announce Type: new \nAbstract: One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct \"co-factor\" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more."}, "https://arxiv.org/abs/2408.14661": {"title": "Non-Parametric Bayesian Inference for Partial Orders with Ties from Rank Data observed with Mallows Noise", "link": "https://arxiv.org/abs/2408.14661", "description": "arXiv:2408.14661v1 Announce Type: new \nAbstract: Partial orders may be used for modeling and summarising ranking data when the underlying order relations are less strict than a total order. They are a natural choice when the data are lists recording individuals' positions in queues in which queue order is constrained by a social hierarchy, as it may be appropriate to model the social hierarchy as a partial order and the lists as random linear extensions respecting the partial order. In this paper, we set up a new prior model for partial orders incorporating ties by clustering tied actors using a Poisson Dirichlet process. The family of models is projective. We perform Bayesian inference with different choices of noisy observation model. In particular, we propose a Mallow's observation model for our partial orders and give a recursive likelihood evaluation algorithm. We demonstrate our model on the 'Royal Acta' (Bishop) list data where we find the model is favored over well-known alternatives which fit only total orders."}, "https://arxiv.org/abs/2408.14669": {"title": "Inspection-Guided Randomization: A Flexible and Transparent Restricted Randomization Framework for Better Experimental Design", "link": "https://arxiv.org/abs/2408.14669", "description": "arXiv:2408.14669v1 Announce Type: new \nAbstract: Randomized experiments are considered the gold standard for estimating causal effects. However, out of the set of possible randomized assignments, some may be likely to produce poor effect estimates and misleading conclusions. Restricted randomization is an experimental design strategy that filters out undesirable treatment assignments, but its application has primarily been limited to ensuring covariate balance in two-arm studies where the target estimand is the average treatment effect. Other experimental settings with different design desiderata and target effect estimands could also stand to benefit from a restricted randomization approach. We introduce Inspection-Guided Randomization (IGR), a transparent and flexible framework for restricted randomization that filters out undesirable treatment assignments by inspecting assignments against analyst-specified, domain-informed design desiderata. In IGR, the acceptable treatment assignments are locked in ex ante and pre-registered in the trial protocol, thus safeguarding against $p$-hacking and promoting reproducibility. Through illustrative simulation studies motivated by education and behavioral health interventions, we demonstrate how IGR can be used to improve effect estimates compared to benchmark designs in group formation experiments and experiments with interference."}, "https://arxiv.org/abs/2408.14671": {"title": "Double/Debiased CoCoLASSO of Treatment Effects with Mismeasured High-Dimensional Control Variables", "link": "https://arxiv.org/abs/2408.14671", "description": "arXiv:2408.14671v1 Announce Type: new \nAbstract: We develop an estimator for treatment effects in high-dimensional settings with additive measurement error, a prevalent challenge in modern econometrics. We introduce the Double/Debiased Convex Conditioned LASSO (Double/Debiased CoCoLASSO), which extends the double/debiased machine learning framework to accommodate mismeasured covariates. Our principal contributions are threefold. (1) We construct a Neyman-orthogonal score function that remains valid under measurement error, incorporating a bias correction term to account for error-induced correlations. (2) We propose a method of moments estimator for the measurement error variance, enabling implementation without prior knowledge of the error covariance structure. (3) We establish the $\\sqrt{N}$-consistency and asymptotic normality of our estimator under general conditions, allowing for both the number of covariates and the magnitude of measurement error to increase with the sample size. Our theoretical results demonstrate the estimator's efficiency within the class of regularized high-dimensional estimators accounting for measurement error. Monte Carlo simulations corroborate our asymptotic theory and illustrate the estimator's robust performance across various levels of measurement error. Notably, our covariance-oblivious approach nearly matches the efficiency of methods that assume known error variance."}, "https://arxiv.org/abs/2408.14691": {"title": "Effects Among the Affected", "link": "https://arxiv.org/abs/2408.14691", "description": "arXiv:2408.14691v1 Announce Type: new \nAbstract: Many interventions are both beneficial to initiate and harmful to stop. Traditionally, to determine whether to deploy that intervention in a time-limited way depends on if, on average, the increase in the benefits of starting it outweigh the increase in the harms of stopping it. We propose a novel causal estimand that provides a more nuanced understanding of the effects of such treatments, particularly, how response to an earlier treatment (e.g., treatment initiation) modifies the effect of a later treatment (e.g., treatment discontinuation), thus learning if there are effects among the (un)affected. Specifically, we consider a marginal structural working model summarizing how the average effect of a later treatment varies as a function of the (estimated) conditional average effect of an earlier treatment. We allow for estimation of this conditional average treatment effect using machine learning, such that the causal estimand is a data-adaptive parameter. We show how a sequentially randomized design can be used to identify this causal estimand, and we describe a targeted maximum likelihood estimator for the resulting statistical estimand, with influence curve-based inference. Throughout, we use the Adaptive Strategies for Preventing and Treating Lapses of Retention in HIV Care trial (NCT02338739) as an illustrative example, showing that discontinuation of conditional cash transfers for HIV care adherence was most harmful among those who most had an increase in benefits from them initially."}, "https://arxiv.org/abs/2408.14710": {"title": "On the distinction between the per-protocol effect and the effect of the treatment strategy", "link": "https://arxiv.org/abs/2408.14710", "description": "arXiv:2408.14710v1 Announce Type: new \nAbstract: In randomized trials, the per-protocol effect, that is, the effect of being assigned a treatment strategy and receiving treatment according to the assigned strategy, is sometimes thought to reflect the effect of the treatment strategy itself, without intervention on assignment. Here, we argue by example that this is not necessarily the case. We examine a causal structure for a randomized trial where these two causal estimands -- the per-protocol effect and the effect of the treatment strategy -- are not equal, and where their corresponding identifying observed data functionals are not the same, but both require information on assignment for identification. Our example highlights the conceptual difference between the per-protocol effect and the effect of the treatment strategy itself, the conditions under which the observed data functionals for these estimands are equal, and suggests that in some cases their identification requires information on assignment, even when assignment is randomized. An implication of these findings is that in observational analyses that aim to emulate a target randomized trial in which an analog of assignment is well-defined, the effect of the treatment strategy is not necessarily an observational analog of the per-protocol effect. Furthermore, either of these effects may be unidentifiable without information on treatment assignment, unless one makes additional assumptions; informally, that assignment does not affect the outcome except through treatment (i.e., an exclusion-restriction assumption), and that assignment is not a confounder of the treatment outcome association conditional on other variables in the analysis."}, "https://arxiv.org/abs/2408.14766": {"title": "Differentially Private Estimation of Weighted Average Treatment Effects for Binary Outcomes", "link": "https://arxiv.org/abs/2408.14766", "description": "arXiv:2408.14766v1 Announce Type: new \nAbstract: In the social and health sciences, researchers often make causal inferences using sensitive variables. These researchers, as well as the data holders themselves, may be ethically and perhaps legally obligated to protect the confidentiality of study participants' data. It is now known that releasing any statistics, including estimates of causal effects, computed with confidential data leaks information about the underlying data values. Thus, analysts may desire to use causal estimators that can provably bound this information leakage. Motivated by this goal, we develop algorithms for estimating weighted average treatment effects with binary outcomes that satisfy the criterion of differential privacy. We present theoretical results on the accuracy of several differentially private estimators of weighted average treatment effects. We illustrate the empirical performance of these estimators using simulated data and a causal analysis using data on education and income."}, "https://arxiv.org/abs/2408.15052": {"title": "stopp: An R Package for Spatio-Temporal Point Pattern Analysis", "link": "https://arxiv.org/abs/2408.15052", "description": "arXiv:2408.15052v1 Announce Type: new \nAbstract: stopp is a novel R package specifically designed for the analysis of spatio-temporal point patterns which might have occurred in a subset of the Euclidean space or on some specific linear network, such as roads of a city. It represents the first package providing a comprehensive modelling framework for spatio-temporal Poisson point processes. While many specialized models exist in the scientific literature for analyzing complex spatio-temporal point patterns, we address the lack of general software for comparing simpler alternative models and their goodness of fit. The package's main functionalities include modelling and diagnostics, together with exploratory analysis tools and the simulation of point processes. A particular focus is given to local first-order and second-order characteristics. The package aggregates existing methods within one coherent framework, including those we proposed in recent papers, and it aims to welcome many further proposals and extensions from the R community."}, "https://arxiv.org/abs/2408.15058": {"title": "Competing risks models with two time scales", "link": "https://arxiv.org/abs/2408.15058", "description": "arXiv:2408.15058v1 Announce Type: new \nAbstract: Competing risks models can involve more than one time scale. A relevant example is the study of mortality after a cancer diagnosis, where time since diagnosis but also age may jointly determine the hazards of death due to different causes. Multiple time scales have rarely been explored in the context of competing events. Here, we propose a model in which the cause-specific hazards vary smoothly over two times scales. It is estimated by two-dimensional $P$-splines, exploiting the equivalence between hazard smoothing and Poisson regression. The data are arranged on a grid so that we can make use of generalized linear array models for efficient computations. The R-package TwoTimeScales implements the model.\n As a motivating example we analyse mortality after diagnosis of breast cancer and we distinguish between death due to breast cancer and all other causes of death. The time scales are age and time since diagnosis. We use data from the Surveillance, Epidemiology and End Results (SEER) program. In the SEER data, age at diagnosis is provided with a last open-ended category, leading to coarsely grouped data. We use the two-dimensional penalised composite link model to ungroup the data before applying the competing risks model with two time scales."}, "https://arxiv.org/abs/2408.15061": {"title": "Diagnosing overdispersion in longitudinal analyses with grouped nominal polytomous data", "link": "https://arxiv.org/abs/2408.15061", "description": "arXiv:2408.15061v1 Announce Type: new \nAbstract: Experiments in Agricultural Sciences often involve the analysis of longitudinal nominal polytomous variables, both in individual and grouped structures. Marginal and mixed-effects models are two common approaches. The distributional assumptions induce specific mean-variance relationships, however, in many instances, the observed variability is greater than assumed by the model. This characterizes overdispersion, whose identification is crucial for choosing an appropriate modeling framework to make inferences reliable. We propose an initial exploration of constructing a longitudinal multinomial dispersion index as a descriptive and diagnostic tool. This index is calculated as the ratio between the observed and assumed variances. The performance of this index was evaluated through a simulation study, employing statistical techniques to assess its initial performance in different scenarios. We identified that as the index approaches one, it is more likely that this corresponds to a high degree of overdispersion. Conversely, values closer to zero indicate a low degree of overdispersion. As a case study, we present an application in animal science, in which the behaviour of pigs (grouped in stalls) is evaluated, considering three response categories."}, "https://arxiv.org/abs/2408.14625": {"title": "A Bayesian approach for fitting semi-Markov mixture models of cancer latency to individual-level data", "link": "https://arxiv.org/abs/2408.14625", "description": "arXiv:2408.14625v1 Announce Type: cross \nAbstract: Multi-state models of cancer natural history are widely used for designing and evaluating cancer early detection strategies. Calibrating such models against longitudinal data from screened cohorts is challenging, especially when fitting non-Markovian mixture models against individual-level data. Here, we consider a family of semi-Markov mixture models of cancer natural history introduce an efficient data-augmented Markov chain Monte Carlo sampling algorithm for fitting these models to individual-level screening and cancer diagnosis histories. Our fully Bayesian approach supports rigorous uncertainty quantification and model selection through leave-one-out cross-validation, and it enables the estimation of screening-related overdiagnosis rates. We demonstrate the effectiveness of our approach using synthetic data, showing that the sampling algorithm efficiently explores the joint posterior distribution of model parameters and latent variables. Finally, we apply our method to data from the US Breast Cancer Surveillance Consortium and estimate the extent of breast cancer overdiagnosis associated with mammography screening. The sampler and model comparison method are available in the R package baclava."}, "https://arxiv.org/abs/2302.01269": {"title": "Adjusting for Incomplete Baseline Covariates in Randomized Controlled Trials: A Cross-World Imputation Framework", "link": "https://arxiv.org/abs/2302.01269", "description": "arXiv:2302.01269v2 Announce Type: replace \nAbstract: In randomized controlled trials, adjusting for baseline covariates is often applied to improve the precision of treatment effect estimation. However, missingness in covariates is common. Recently, Zhao & Ding (2022) studied two simple strategies, the single imputation method and missingness indicator method (MIM), to deal with missing covariates, and showed that both methods can provide efficiency gain. To better understand and compare these two strategies, we propose and investigate a novel imputation framework termed cross-world imputation (CWI), which includes single imputation and MIM as special cases. Through the lens of CWI, we show that MIM implicitly searches for the optimal CWI values and thus achieves optimal efficiency. We also derive conditions under which the single imputation method, by searching for the optimal single imputation values, can achieve the same efficiency as the MIM."}, "https://arxiv.org/abs/2310.12391": {"title": "Online Semiparametric Regression via Sequential Monte Carlo", "link": "https://arxiv.org/abs/2310.12391", "description": "arXiv:2310.12391v2 Announce Type: replace \nAbstract: We develop and describe online algorithms for performing online semiparametric regression analyses. Earlier work on this topic is in Luts, Broderick & Wand (J. Comput. Graph. Statist., 2014) where online mean field variational Bayes was employed. In this article we instead develop sequential Monte Carlo approaches to circumvent well-known inaccuracies inherent in variational approaches. Even though sequential Monte Carlo is not as fast as online mean field variational Bayes, it can be a viable alternative for applications where the data rate is not overly high. For Gaussian response semiparametric regression models our new algorithms share the online mean field variational Bayes property of only requiring updating and storage of sufficient statistics quantities of streaming data. In the non-Gaussian case accurate real-time semiparametric regression requires the full data to be kept in storage. The new algorithms allow for new options concerning accuracy/speed trade-offs for online semiparametric regression."}, "https://arxiv.org/abs/2308.15062": {"title": "Forecasting with Feedback", "link": "https://arxiv.org/abs/2308.15062", "description": "arXiv:2308.15062v3 Announce Type: replace-cross \nAbstract: Systematically biased forecasts are typically interpreted as evidence of forecasters' irrationality and/or asymmetric loss. In this paper we propose an alternative explanation: when forecasts inform economic policy decisions, and the resulting actions affect the realization of the forecast target itself, forecasts may be optimally biased even under quadratic loss. The result arises in environments in which the forecaster is uncertain about the decision maker's reaction to the forecast, which is presumably the case in most applications. We illustrate the empirical relevance of our theory by reviewing some stylized properties of Green Book inflation forecasts and relating them to the predictions from our model. Our results point out that the presence of policy feedback poses a challenge to traditional tests of forecast rationality."}, "https://arxiv.org/abs/2310.09702": {"title": "Inference with Mondrian Random Forests", "link": "https://arxiv.org/abs/2310.09702", "description": "arXiv:2310.09702v2 Announce Type: replace-cross \nAbstract: Random forests are popular methods for regression and classification analysis, and many different variants have been proposed in recent years. One interesting example is the Mondrian random forest, in which the underlying constituent trees are constructed via a Mondrian process. We give precise bias and variance characterizations, along with a Berry-Esseen-type central limit theorem, for the Mondrian random forest regression estimator. By combining these results with a carefully crafted debiasing approach and an accurate variance estimator, we present valid statistical inference methods for the unknown regression function. These methods come with explicitly characterized error bounds in terms of the sample size, tree complexity parameter, and number of trees in the forest, and include coverage error rates for feasible confidence interval estimators. Our novel debiasing procedure for the Mondrian random forest also allows it to achieve the minimax-optimal point estimation convergence rate in mean squared error for multivariate $\\beta$-H\\\"older regression functions, for all $\\beta > 0$, provided that the underlying tuning parameters are chosen appropriately. Efficient and implementable algorithms are devised for both batch and online learning settings, and we carefully study the computational complexity of different Mondrian random forest implementations. Finally, simulations with synthetic data validate our theory and methodology, demonstrating their excellent finite-sample properties."}, "https://arxiv.org/abs/2408.15452": {"title": "The effects of data preprocessing on probability of default model fairness", "link": "https://arxiv.org/abs/2408.15452", "description": "arXiv:2408.15452v1 Announce Type: new \nAbstract: In the context of financial credit risk evaluation, the fairness of machine learning models has become a critical concern, especially given the potential for biased predictions that disproportionately affect certain demographic groups. This study investigates the impact of data preprocessing, with a specific focus on Truncated Singular Value Decomposition (SVD), on the fairness and performance of probability of default models. Using a comprehensive dataset sourced from Kaggle, various preprocessing techniques, including SVD, were applied to assess their effect on model accuracy, discriminatory power, and fairness."}, "https://arxiv.org/abs/2408.15454": {"title": "BayesSRW: Bayesian Sampling and Re-weighting approach for variance reduction", "link": "https://arxiv.org/abs/2408.15454", "description": "arXiv:2408.15454v1 Announce Type: new \nAbstract: In this paper, we address the challenge of sampling in scenarios where limited resources prevent exhaustive measurement across all subjects. We consider a setting where samples are drawn from multiple groups, each following a distribution with unknown mean and variance parameters. We introduce a novel sampling strategy, motivated simply by Cauchy-Schwarz inequality, which minimizes the variance of the population mean estimator by allocating samples proportionally to both the group size and the standard deviation. This approach improves the efficiency of sampling by focusing resources on groups with greater variability, thereby enhancing the precision of the overall estimate. Additionally, we extend our method to a two-stage sampling procedure in a Bayes approach, named BayesSRW, where a preliminary stage is used to estimate the variance, which then informs the optimal allocation of the remaining sampling budget. Through simulation examples, we demonstrate the effectiveness of our approach in reducing estimation uncertainty and providing more reliable insights in applications ranging from user experience surveys to high-dimensional peptide array studies."}, "https://arxiv.org/abs/2408.15502": {"title": "ROMI: A Randomized Two-Stage Basket Trial Design to Optimize Doses for Multiple Indications", "link": "https://arxiv.org/abs/2408.15502", "description": "arXiv:2408.15502v1 Announce Type: new \nAbstract: Optimizing doses for multiple indications is challenging. The pooled approach of finding a single optimal biological dose (OBD) for all indications ignores that dose-response or dose-toxicity curves may differ between indications, resulting in varying OBDs. Conversely, indication-specific dose optimization often requires a large sample size. To address this challenge, we propose a Randomized two-stage basket trial design that Optimizes doses in Multiple Indications (ROMI). In stage 1, for each indication, response and toxicity are evaluated for a high dose, which may be a previously obtained MTD, with a rule that stops accrual to indications where the high dose is unsafe or ineffective. Indications not terminated proceed to stage 2, where patients are randomized between the high dose and a specified lower dose. A latent-cluster Bayesian hierarchical model is employed to borrow information between indications, while considering the potential heterogeneity of OBD across indications. Indication-specific utilities are used to quantify response-toxicity trade-offs. At the end of stage 2, for each indication with at least one acceptable dose, the dose with highest posterior mean utility is selected as optimal. Two versions of ROMI are presented, one using only stage 2 data for dose optimization and the other optimizing doses using data from both stages. Simulations show that both versions have desirable operating characteristics compared to designs that either ignore indications or optimize dose independently for each indication."}, "https://arxiv.org/abs/2408.15607": {"title": "Comparing restricted mean survival times in small sample clinical trials using pseudo-observations", "link": "https://arxiv.org/abs/2408.15607", "description": "arXiv:2408.15607v1 Announce Type: new \nAbstract: The widely used proportional hazard assumption cannot be assessed reliably in small-scale clinical trials and might often in fact be unjustified, e.g. due to delayed treatment effects. An alternative to the hazard ratio as effect measure is the difference in restricted mean survival time (RMST) that does not rely on model assumptions. Although an asymptotic test for two-sample comparisons of the RMST exists, it has been shown to suffer from an inflated type I error rate in samples of small or moderate sizes. Recently, permutation tests, including the studentized permutation test, have been introduced to address this issue. In this paper, we propose two methods based on pseudo-observations (PO) regression models as alternatives for such scenarios and assess their properties in comparison to previously proposed approaches in an extensive simulation study. Furthermore, we apply the proposed PO methods to data from a clinical trail and, by doing so, point out some extension that might be very useful for practical applications such as covariate adjustments."}, "https://arxiv.org/abs/2408.15617": {"title": "Network Representation of Higher-Order Interactions Based on Information Dynamics", "link": "https://arxiv.org/abs/2408.15617", "description": "arXiv:2408.15617v1 Announce Type: new \nAbstract: Many complex systems in science and engineering are modeled as networks whose nodes and links depict the temporal evolution of each system unit and the dynamic interaction between pairs of units, which are assessed respectively using measures of auto- and cross-correlation or variants thereof. However, a growing body of work is documenting that this standard network representation can neglect potentially crucial information shared by three or more dynamic processes in the form of higher-order interactions (HOIs). While several measures, mostly derived from information theory, are available to assess HOIs in network systems mapped by multivariate time series, none of them is able to provide a compact and detailed representation of higher-order interdependencies. In this work, we fill this gap by introducing a framework for the assessment of HOIs in dynamic network systems at different levels of resolution. The framework is grounded on the dynamic implementation of the O-information, a new measure assessing HOIs in dynamic networks, which is here used together with its local counterpart and its gradient to quantify HOIs respectively for the network as a whole, for each link, and for each node. The integration of these measures into the conventional network representation results in a tool for the representation of HOIs as networks, which is defined formally using measures of information dynamics, implemented in its linear version by using vector regression models and statistical validation techniques, illustrated in simulated network systems, and finally applied to an illustrative example in the field of network physiology."}, "https://arxiv.org/abs/2408.15623": {"title": "Correlation-Adjusted Simultaneous Testing for Ultra High-dimensional Grouped Data", "link": "https://arxiv.org/abs/2408.15623", "description": "arXiv:2408.15623v1 Announce Type: new \nAbstract: Epigenetics plays a crucial role in understanding the underlying molecular processes of several types of cancer as well as the determination of innovative therapeutic tools. To investigate the complex interplay between genetics and environment, we develop a novel procedure to identify differentially methylated probes (DMPs) among cases and controls. Statistically, this translates to an ultra high-dimensional testing problem with sparse signals and an inherent grouping structure. When the total number of variables being tested is massive and typically exhibits some degree of dependence, existing group-wise multiple comparisons adjustment methods lead to inflated false discoveries. We propose a class of Correlation-Adjusted Simultaneous Testing (CAST) procedures incorporating the general dependence among probes within and between genes to control the false discovery rate (FDR). Simulations demonstrate that CASTs have superior empirical power while maintaining the FDR compared to the benchmark group-wise. Moreover, while the benchmark fails to control FDR for small-sized grouped correlated data, CAST exhibits robustness in controlling FDR across varying group sizes. In bladder cancer data, the proposed CAST method confirms some existing differentially methylated probes implicated with the disease (Langevin, et. al., 2014). However, CAST was able to detect novel DMPs that the previous study (Langevin, et. al., 2014) failed to identify. The CAST method can accurately identify significant potential biomarkers and facilitates informed decision-making aligned with precision medicine in the context of complex data analysis."}, "https://arxiv.org/abs/2408.15670": {"title": "Adaptive Weighted Random Isolation (AWRI): a simple design to estimate causal effects under network interference", "link": "https://arxiv.org/abs/2408.15670", "description": "arXiv:2408.15670v1 Announce Type: new \nAbstract: Recently, causal inference under interference has gained increasing attention in the literature. In this paper, we focus on randomized designs for estimating the total treatment effect (TTE), defined as the average difference in potential outcomes between fully treated and fully controlled groups. We propose a simple design called weighted random isolation (WRI) along with a restricted difference-in-means estimator (RDIM) for TTE estimation. Additionally, we derive a novel mean squared error surrogate for the RDIM estimator, supported by a network-adaptive weight selection algorithm. This can help us determine a fair weight for the WRI design, thereby effectively reducing the bias. Our method accommodates directed networks, extending previous frameworks. Extensive simulations demonstrate that the proposed method outperforms nine established methods across a wide range of scenarios."}, "https://arxiv.org/abs/2408.15701": {"title": "Robust discriminant analysis", "link": "https://arxiv.org/abs/2408.15701", "description": "arXiv:2408.15701v1 Announce Type: new \nAbstract: Discriminant analysis (DA) is one of the most popular methods for classification due to its conceptual simplicity, low computational cost, and often solid performance. In its standard form, DA uses the arithmetic mean and sample covariance matrix to estimate the center and scatter of each class. We discuss and illustrate how this makes standard DA very sensitive to suspicious data points, such as outliers and mislabeled cases. We then present an overview of techniques for robust DA, which are more reliable in the presence of deviating cases. In particular, we review DA based on robust estimates of location and scatter, along with graphical diagnostic tools for visualizing the results of DA."}, "https://arxiv.org/abs/2408.15806": {"title": "Bayesian analysis of product feature allocation models", "link": "https://arxiv.org/abs/2408.15806", "description": "arXiv:2408.15806v1 Announce Type: new \nAbstract: Feature allocation models are an extension of Bayesian nonparametric clustering models, where individuals can share multiple features. We study a broad class of models whose probability distribution has a product form, which includes the popular Indian buffet process. This class plays a prominent role among existing priors, and it shares structural characteristics with Gibbs-type priors in the species sampling framework. We develop a general theory for the entire class, obtaining closed form expressions for the predictive structure and the posterior law of the underlying stochastic process. Additionally, we describe the distribution for the number of features and the number of hitherto unseen features in a future sample, leading to the $\\alpha$-diversity for feature models. We also examine notable novel examples, such as mixtures of Indian buffet processes and beta Bernoulli models, where the latter entails a finite random number of features. This methodology finds significant applications in ecology, allowing the estimation of species richness for incidence data, as we demonstrate by analyzing plant diversity in Danish forests and trees in Barro Colorado Island."}, "https://arxiv.org/abs/2408.15862": {"title": "Marginal homogeneity tests with panel data", "link": "https://arxiv.org/abs/2408.15862", "description": "arXiv:2408.15862v1 Announce Type: new \nAbstract: A panel dataset satisfies marginal homogeneity if the time-specific marginal distributions are homogeneous or time-invariant. Marginal homogeneity is relevant in economic settings such as dynamic discrete games. In this paper, we propose several tests for the hypothesis of marginal homogeneity and investigate their properties. We consider an asymptotic framework in which the number of individuals n in the panel diverges, and the number of periods T is fixed. We implement our tests by comparing a studentized or non-studentized T-sample version of the Cramer-von Mises statistic with a suitable critical value. We propose three methods to construct the critical value: asymptotic approximations, the bootstrap, and time permutations. We show that the first two methods result in asymptotically exact hypothesis tests. The permutation test based on a non-studentized statistic is asymptotically exact when T=2, but is asymptotically invalid when T>2. In contrast, the permutation test based on a studentized statistic is always asymptotically exact. Finally, under a time-exchangeability assumption, the permutation test is exact in finite samples, both with and without studentization."}, "https://arxiv.org/abs/2408.15964": {"title": "On harmonic oscillator hazard functions", "link": "https://arxiv.org/abs/2408.15964", "description": "arXiv:2408.15964v1 Announce Type: new \nAbstract: We propose a parametric hazard model obtained by enforcing positivity in the damped harmonic oscillator. The resulting model has closed-form hazard and cumulative hazard functions, facilitating likelihood and Bayesian inference on the parameters. We show that this model can capture a range of hazard shapes, such as increasing, decreasing, unimodal, bathtub, and oscillatory patterns, and characterize the tails of the corresponding survival function. We illustrate the use of this model in survival analysis using real data."}, "https://arxiv.org/abs/2408.15979": {"title": "Comparing the Pearson and Spearman Correlation Coefficients Across Distributions and Sample Sizes: A Tutorial Using Simulations and Empirical Data", "link": "https://arxiv.org/abs/2408.15979", "description": "arXiv:2408.15979v1 Announce Type: new \nAbstract: The Pearson product-moment correlation coefficient (rp) and the Spearman rank correlation coefficient (rs) are widely used in psychological research. We compare rp and rs on 3 criteria: variability, bias with respect to the population value, and robustness to an outlier. Using simulations across low (N = 5) to high (N = 1,000) sample sizes we show that, for normally distributed variables, rp and rs have similar expected values but rs is more variable, especially when the correlation is strong. However, when the variables have high kurtosis, rp is more variable than rs. Next, we conducted a sampling study of a psychometric dataset featuring symmetrically distributed data with light tails, and of 2 Likert-type survey datasets, 1 with light-tailed and the other with heavy-tailed distributions. Consistent with the simulations, rp had lower variability than rs in the psychometric dataset. In the survey datasets with heavy-tailed variables in particular, rs had lower variability than rp, and often corresponded more accurately to the population Pearson correlation coefficient (Rp) than rp did. The simulations and the sampling studies showed that variability in terms of standard deviations can be reduced by about 20% by choosing rs instead of rp. In comparison, increasing the sample size by a factor of 2 results in a 41% reduction of the standard deviations of rs and rp. In conclusion, rp is suitable for light-tailed distributions, whereas rs is preferable when variables feature heavy-tailed distributions or when outliers are present, as is often the case in psychological research."}, "https://arxiv.org/abs/2408.15260": {"title": "Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data", "link": "https://arxiv.org/abs/2408.15260", "description": "arXiv:2408.15260v1 Announce Type: cross \nAbstract: Synthetic Data is not new, but recent advances in Generative AI have raised interest in expanding the research toolbox, creating new opportunities and risks. This article provides a taxonomy of the full breadth of the Synthetic Data domain. We discuss its place in the research ecosystem by linking the advances in computational social science with the idea of the Fourth Paradigm of scientific discovery that integrates the elements of the evolution from empirical to theoretic to computational models. Further, leveraging the framework of Truth, Beauty, and Justice, we discuss how evaluation criteria vary across use cases as the information is used to add value and draw insights. Building a framework to organize different types of synthetic data, we end by describing the opportunities and challenges with detailed examples of using Generative AI to create synthetic quantitative and qualitative datasets and discuss the broader spectrum including synthetic populations, expert systems, survey data replacement, and personabots."}, "https://arxiv.org/abs/2408.15387": {"title": "Semiparametric Modelling of Cancer Mortality Trends in Colombia", "link": "https://arxiv.org/abs/2408.15387", "description": "arXiv:2408.15387v1 Announce Type: cross \nAbstract: In this paper, we compare semiparametric and parametric model adjustments for cancer mortality in breast and cervical cancer and prostate and lung cancer in men, according to age and period of death. Semiparametric models were adjusted for the number of deaths from the two localizations of greatest mortality by sex: breast and cervix in women; prostate and lungs in men. Adjustments in different semiparametric models were compared; which included making adjustments with different distributions and variable combinations in the parametric and non-parametric part, for localization as well as for scale. Finally, the semiparametric model with best adjustment was selected and compared to traditional model; that is, to the generalized lineal model with Poisson response and logarithmic link. Best results for the four kinds of cancer were obtained for the selected semiparametric model by comparing it to the traditional Poisson model based upon AIC, envelope correlation between estimated logarithm rate and real rate logarithm. In general, we observe that in estimation, rate increases with age; however, with respect to period, breast cancer and stomach cancer in men show a tendency to rise over time; on the other hand, for cervical cancer, it remains virtually constant, but for lung cancer in men, as of 2007, it tends to decrease."}, "https://arxiv.org/abs/2408.15451": {"title": "Certified Causal Defense with Generalizable Robustness", "link": "https://arxiv.org/abs/2408.15451", "description": "arXiv:2408.15451v1 Announce Type: cross \nAbstract: While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials."}, "https://arxiv.org/abs/2408.15612": {"title": "Cellwise robust and sparse principal component analysis", "link": "https://arxiv.org/abs/2408.15612", "description": "arXiv:2408.15612v1 Announce Type: cross \nAbstract: A first proposal of a sparse and cellwise robust PCA method is presented. Robustness to single outlying cells in the data matrix is achieved by substituting the squared loss function for the approximation error by a robust version. The integration of a sparsity-inducing $L_1$ or elastic net penalty offers additional modeling flexibility. For the resulting challenging optimization problem, an algorithm based on Riemannian stochastic gradient descent is developed, with the advantage of being scalable to high-dimensional data, both in terms of many variables as well as observations. The resulting method is called SCRAMBLE (Sparse Cellwise Robust Algorithm for Manifold-based Learning and Estimation). Simulations reveal the superiority of this approach in comparison to established methods, both in the casewise and cellwise robustness paradigms. Two applications from the field of tribology underline the advantages of a cellwise robust and sparse PCA method."}, "https://arxiv.org/abs/2305.14265": {"title": "Adapting to Misspecification", "link": "https://arxiv.org/abs/2305.14265", "description": "arXiv:2305.14265v4 Announce Type: replace \nAbstract: Empirical research typically involves a robustness-efficiency tradeoff. A researcher seeking to estimate a scalar parameter can invoke strong assumptions to motivate a restricted estimator that is precise but may be heavily biased, or they can relax some of these assumptions to motivate a more robust, but variable, unrestricted estimator. When a bound on the bias of the restricted estimator is available, it is optimal to shrink the unrestricted estimator towards the restricted estimator. For settings where a bound on the bias of the restricted estimator is unknown, we propose adaptive estimators that minimize the percentage increase in worst case risk relative to an oracle that knows the bound. We show that adaptive estimators solve a weighted convex minimax problem and provide lookup tables facilitating their rapid computation. Revisiting some well known empirical studies where questions of model specification arise, we examine the advantages of adapting to -- rather than testing for -- misspecification."}, "https://arxiv.org/abs/2309.00870": {"title": "Robust estimation for number of factors in high dimensional factor modeling via Spearman correlation matrix", "link": "https://arxiv.org/abs/2309.00870", "description": "arXiv:2309.00870v2 Announce Type: replace \nAbstract: Determining the number of factors in high-dimensional factor modeling is essential but challenging, especially when the data are heavy-tailed. In this paper, we introduce a new estimator based on the spectral properties of Spearman sample correlation matrix under the high-dimensional setting, where both dimension and sample size tend to infinity proportionally. Our estimator is robust against heavy tails in either the common factors or idiosyncratic errors. The consistency of our estimator is established under mild conditions. Numerical experiments demonstrate the superiority of our estimator compared to existing methods."}, "https://arxiv.org/abs/2310.04576": {"title": "Challenges in Statistically Rejecting the Perfect Competition Hypothesis Using Imperfect Competition Data", "link": "https://arxiv.org/abs/2310.04576", "description": "arXiv:2310.04576v4 Announce Type: replace \nAbstract: We theoretically prove why statistically rejecting the null hypothesis of perfect competition is challenging, known as a common problem in the literature. We also assess the finite sample performance of the conduct parameter test in homogeneous goods markets, showing that statistical power increases with the number of markets, a larger conduct parameter, and a stronger demand rotation instrument. However, even with a moderate number of markets and five firms, rejecting the null hypothesis of perfect competition remains difficult, irrespective of instrument strength or the use of optimal instruments. Our findings suggest that empirical results failing to reject perfect competition are due to the limited number of markets rather than methodological shortcomings."}, "https://arxiv.org/abs/2002.09377": {"title": "Misspecification-robust likelihood-free inference in high dimensions", "link": "https://arxiv.org/abs/2002.09377", "description": "arXiv:2002.09377v4 Announce Type: replace-cross \nAbstract: Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions and discrepancies for each parameter. The efficient additive acquisition structure is combined with exponentiated loss -likelihood to provide a misspecification-robust characterisation of the marginal posterior distribution for all model parameters. The method successfully performs computationally efficient inference in a 100-dimensional space on canonical examples and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space."}, "https://arxiv.org/abs/2204.06943": {"title": "Option Pricing with Time-Varying Volatility Risk Aversion", "link": "https://arxiv.org/abs/2204.06943", "description": "arXiv:2204.06943v3 Announce Type: replace-cross \nAbstract: We introduce a pricing kernel with time-varying volatility risk aversion that can explain the observed time variation in the shape of the pricing kernel. Dynamic volatility risk aversion, combined with the Heston-Nandi GARCH model, leads to a convenient option pricing model, denoted DHNG. The variance risk ratio emerges as a fundamental variable, and we show that it is closely related to economic fundamentals and common measures of sentiment and uncertainty. DHNG yields a closed-form pricing formula for the VIX, and we propose a novel approximation method that provides analytical expressions for option prices. We estimate the model using S&P 500 returns, the VIX, and option prices, and find that dynamic volatility risk aversion leads to a substantial reduction in VIX and option pricing errors."}, "https://arxiv.org/abs/2408.16023": {"title": "Inferring the parameters of Taylor's law in ecology", "link": "https://arxiv.org/abs/2408.16023", "description": "arXiv:2408.16023v1 Announce Type: new \nAbstract: Taylor's power law (TL) or fluctuation scaling has been verified empirically for the abundances of many species, human and non-human, and in many other fields including physics, meteorology, computer science, and finance. TL asserts that the variance is directly proportional to a power of the mean, exactly for population moments and, whether or not population moments exist, approximately for sample moments. In many papers, linear regression of log variance as a function of log mean is used to estimate TL's parameters. We provide some statistical guarantees with large-sample asymptotics for this kind of inference under general conditions, and we derive confidence intervals for the parameters. In many ecological applications, the means and variances are estimated over time or across space from arrays of abundance data collected at different locations and time points. When the ratio between the time-series length and the number of spatial points converges to a constant as both become large, the usual normalized statistics are asymptotically biased. We provide a bias correction to get correct confidence intervals. TL, widely studied in multiple sciences, is a source of challenging new statistical problems in a nonstationary spatiotemporal framework. We illustrate our results with both simulated and real data sets."}, "https://arxiv.org/abs/2408.16039": {"title": "Group Difference in Differences can Identify Effect Heterogeneity in Non-Canonical Settings", "link": "https://arxiv.org/abs/2408.16039", "description": "arXiv:2408.16039v1 Announce Type: new \nAbstract: Consider a very general setting in which data on an outcome of interest is collected in two `groups' at two time periods, with certain group-periods deemed `treated' and others `untreated'. A special case is the canonical Difference-in-Differences (DiD) setting in which one group is treated only in the second period while the other is treated in neither period. Then it is well known that under a parallel trends assumption across the two groups the classic DiD formula (subtracting the average change in the outcome across periods in the treated group by the average change in the outcome across periods in the untreated group) identifies the average treatment effect on the treated in the second period. But other relations between group, period, and treatment are possible. For example, the groups might be demographic (or other baseline covariate) categories with all units in both groups treated in the second period and none treated in the first, i.e. a pre-post design. Or one group might be treated in both periods while the other is treated in neither. In these non-canonical settings (lacking a control group or a pre-period), some researchers still compute DiD estimators, while others avoid causal inference altogether. In this paper, we will elucidate the group-period-treatment scenarios and corresponding parallel trends assumptions under which a DiD formula identifies meaningful causal estimands and what those causal estimands are. We find that in non-canonical settings, under a group parallel trends assumption the DiD formula identifies effect heterogeneity in the treated across groups or across time periods (depending on the setting)."}, "https://arxiv.org/abs/2408.16129": {"title": "Direct-Assisted Bayesian Unit-level Modeling for Small Area Estimation of Rare Event Prevalence", "link": "https://arxiv.org/abs/2408.16129", "description": "arXiv:2408.16129v1 Announce Type: new \nAbstract: Small area estimation using survey data can be achieved by using either a design-based or a model-based inferential approach. With respect to assumptions, design-based direct estimators are generally preferable because of their consistency and asymptotic normality. However, when data are sparse at the desired area level, as is often the case when measuring rare events for example, these direct estimators can have extremely large uncertainty, making a model-based approach preferable. A model-based approach with a random spatial effect borrows information from surrounding areas at the cost of inducing shrinkage towards the local average. As a result, estimates may be over-smoothed and inconsistent with design-based estimates at higher area levels when aggregated. We propose a unit-level Bayesian model for small area estimation of rare event prevalence which uses design-based direct estimates at a higher area level to increase accuracy, precision, and consistency in aggregation. After introducing the model and its implementation, we conduct a simulation study to compare its properties to alternative models and apply it to the estimation of the neonatal mortality rate in Zambia, using 2014 DHS data."}, "https://arxiv.org/abs/2408.16153": {"title": "Statistical comparison of quality attributes_a range-based approach", "link": "https://arxiv.org/abs/2408.16153", "description": "arXiv:2408.16153v1 Announce Type: new \nAbstract: A novel approach for comparing quality attributes of different products when there is considerable product-related variability is proposed. In such a case, the whole range of possible realizations must be considered. Looking, for example, at the respective information published by agencies like the EMA or the FDA, one can see that commonly accepted tests together with the proper statistical framework are not yet available. This work attempts to close this gap in the treatment of range-based comparisons. The question of when two products can be considered similar with respect to a certain property is discussed and a framework for such a statistical comparison is presented, which is based on the proposed concept of kappa-cover. Assuming normally distributed quality attributes a statistical test termed covering-test is proposed. Simulations show that this test possesses desirable statistical properties with respect to small sample size and power. In order to demonstrate the usefulness of the suggested concept, the proposed test is applied to a data set from the pharmaceutical industry"}, "https://arxiv.org/abs/2408.16330": {"title": "Sensitivity Analysis for Dynamic Discrete Choice Models", "link": "https://arxiv.org/abs/2408.16330", "description": "arXiv:2408.16330v1 Announce Type: new \nAbstract: In dynamic discrete choice models, some parameters, such as the discount factor, are being fixed instead of being estimated. This paper proposes two sensitivity analysis procedures for dynamic discrete choice models with respect to the fixed parameters. First, I develop a local sensitivity measure that estimates the change in the target parameter for a unit change in the fixed parameter. This measure is fast to compute as it does not require model re-estimation. Second, I propose a global sensitivity analysis procedure that uses model primitives to study the relationship between target parameters and fixed parameters. I show how to apply the sensitivity analysis procedures of this paper through two empirical applications."}, "https://arxiv.org/abs/2408.16381": {"title": "Uncertainty quantification for intervals", "link": "https://arxiv.org/abs/2408.16381", "description": "arXiv:2408.16381v1 Announce Type: new \nAbstract: Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing reliable and trusworthy predictive algorithms. However, the statistical literature currently lacks a general methodology for interval targets, especially when these outcomes are incomplete due to censoring. We propose a uncertainty quantification algorithm and establish its theoretical properties using empirical process arguments based on a newly developed class of functions specifically designed for interval data structures. Although this paper primarily focuses on deriving predictive regions for interval-censored data, the approach can also be applied to other statistical modeling tasks, such as goodness-of-fit assessments. Finally, the applicability of the methods developed here is illustrated through various biomedical applications, including two clinical examples: i) sleep time and its link with cardiovasculuar diseases ii) survival time and physical activity values."}, "https://arxiv.org/abs/2408.16384": {"title": "Nonparametric goodness of fit tests for Pareto type-I distribution with complete and censored data", "link": "https://arxiv.org/abs/2408.16384", "description": "arXiv:2408.16384v1 Announce Type: new \nAbstract: Two new goodness of fit tests for the Pareto type-I distribution for complete and right censored data are proposed using fixed point characterization based on Steins type identity. The asymptotic distributions of the test statistics under both the null and alternative hypotheses are obtained. The performance of the proposed tests is evaluated and compared with existing tests through a Monte Carlo simulation experiment. The newly proposed tests exhibit greater power than existing tests for the Pareto type-I distribution. Finally, the methodology is applied to real-world data sets."}, "https://arxiv.org/abs/2408.16485": {"title": "A multiple imputation approach to distinguish curative from life-prolonging effects in the presence of missing covariates", "link": "https://arxiv.org/abs/2408.16485", "description": "arXiv:2408.16485v1 Announce Type: new \nAbstract: Medical advancements have increased cancer survival rates and the possibility of finding a cure. Hence, it is crucial to evaluate the impact of treatments in terms of both curing the disease and prolonging survival. We may use a Cox proportional hazards (PH) cure model to achieve this. However, a significant challenge in applying such a model is the potential presence of partially observed covariates in the data. We aim to refine the methods for imputing partially observed covariates based on multiple imputation and fully conditional specification (FCS) approaches. To be more specific, we consider a more general case, where different covariate vectors are used to model the cure probability and the survival of patients who are not cured. We also propose an approximation of the exact conditional distribution using a regression approach, which helps draw imputed values at a lower computational cost. To assess its effectiveness, we compare the proposed approach with a complete case analysis and an analysis without any missing covariates. We discuss the application of these techniques to a real-world dataset from the BO06 clinical trial on osteosarcoma."}, "https://arxiv.org/abs/2408.16670": {"title": "A Causal Framework for Evaluating Heterogeneous Policy Mechanisms Using Difference-in-Differences", "link": "https://arxiv.org/abs/2408.16670", "description": "arXiv:2408.16670v1 Announce Type: new \nAbstract: In designing and evaluating public policies, policymakers and researchers often hypothesize about the mechanisms through which a policy may affect a population and aim to assess these mechanisms in practice. For example, when studying an excise tax on sweetened beverages, researchers might explore how cross-border shopping, economic competition, and store-level price changes differentially affect store sales. However, many policy evaluation designs, including the difference-in-differences (DiD) approach, traditionally target the average effect of the intervention rather than the underlying mechanisms. Extensions of these approaches to evaluate policy mechanisms often involve exploratory subgroup analyses or outcome models parameterized by mechanism-specific variables. However, neither approach studies the mechanisms within a causal framework, limiting the analysis to associative relationships between mechanisms and outcomes, which may be confounded by differences among sub-populations exposed to varying levels of the mechanisms. Therefore, rigorous mechanism evaluation requires robust techniques to adjust for confounding and accommodate the interconnected relationship between stores within competitive economic landscapes. In this paper, we present a framework for evaluating policy mechanisms by studying Philadelphia beverage tax. Our approach builds on recent advancements in causal effect curve estimators under DiD designs, offering tools and insights for assessing policy mechanisms complicated by confounding and network interference."}, "https://arxiv.org/abs/2408.16708": {"title": "Effect Aliasing in Observational Studies", "link": "https://arxiv.org/abs/2408.16708", "description": "arXiv:2408.16708v1 Announce Type: new \nAbstract: In experimental design, aliasing of effects occurs in fractional factorial experiments, where certain low order factorial effects are indistinguishable from certain high order interactions: low order contrasts may be orthogonal to one another, while their higher order interactions are aliased and not identified. In observational studies, aliasing occurs when certain combinations of covariates -- e.g., time period and various eligibility criteria for treatment -- perfectly predict the treatment that an individual will receive, so a covariate combination is aliased with a particular treatment. In this situation, when a contrast among several groups is used to estimate a treatment effect, collections of individuals defined by contrast weights may be balanced with respect to summaries of low-order interactions between covariates and treatments, but necessarily not balanced with respect to summaries of high-order interactions between covariates and treatments. We develop a theory of aliasing in observational studies, illustrate that theory in an observational study whose aliasing is more robust than conventional difference-in-differences, and develop a new form of matching to construct balanced confounded factorial designs from observational data."}, "https://arxiv.org/abs/2408.16763": {"title": "Finite Sample Valid Inference via Calibrated Bootstrap", "link": "https://arxiv.org/abs/2408.16763", "description": "arXiv:2408.16763v1 Announce Type: new \nAbstract: While widely used as a general method for uncertainty quantification, the bootstrap method encounters difficulties that raise concerns about its validity in practical applications. This paper introduces a new resampling-based method, termed $\\textit{calibrated bootstrap}$, designed to generate finite sample-valid parametric inference from a sample of size $n$. The central idea is to calibrate an $m$-out-of-$n$ resampling scheme, where the calibration parameter $m$ is determined against inferential pivotal quantities derived from the cumulative distribution functions of loss functions in parameter estimation. The method comprises two algorithms. The first, named $\\textit{resampling approximation}$ (RA), employs a $\\textit{stochastic approximation}$ algorithm to find the value of the calibration parameter $m=m_\\alpha$ for a given $\\alpha$ in a manner that ensures the resulting $m$-out-of-$n$ bootstrapped $1-\\alpha$ confidence set is valid. The second algorithm, termed $\\textit{distributional resampling}$ (DR), is developed to further select samples of bootstrapped estimates from the RA step when constructing $1-\\alpha$ confidence sets for a range of $\\alpha$ values is of interest. The proposed method is illustrated and compared to existing methods using linear regression with and without $L_1$ penalty, within the context of a high-dimensional setting and a real-world data application. The paper concludes with remarks on a few open problems worthy of consideration."}, "https://arxiv.org/abs/1001.2055": {"title": "Reversible jump Markov chain Monte Carlo and multi-model samplers", "link": "https://arxiv.org/abs/1001.2055", "description": "arXiv:1001.2055v2 Announce Type: replace \nAbstract: To appear in the second edition of the MCMC handbook, S. P. Brooks, A. Gelman, G. Jones and X.-L. Meng (eds), Chapman & Hall."}, "https://arxiv.org/abs/2304.01098": {"title": "The synthetic instrument: From sparse association to sparse causation", "link": "https://arxiv.org/abs/2304.01098", "description": "arXiv:2304.01098v2 Announce Type: replace \nAbstract: In many observational studies, researchers are often interested in studying the effects of multiple exposures on a single outcome. Standard approaches for high-dimensional data such as the lasso assume the associations between the exposures and the outcome are sparse. These methods, however, do not estimate the causal effects in the presence of unmeasured confounding. In this paper, we consider an alternative approach that assumes the causal effects in view are sparse. We show that with sparse causation, the causal effects are identifiable even with unmeasured confounding. At the core of our proposal is a novel device, called the synthetic instrument, that in contrast to standard instrumental variables, can be constructed using the observed exposures directly. We show that under linear structural equation models, the problem of causal effect estimation can be formulated as an $\\ell_0$-penalization problem, and hence can be solved efficiently using off-the-shelf software. Simulations show that our approach outperforms state-of-art methods in both low-dimensional and high-dimensional settings. We further illustrate our method using a mouse obesity dataset."}, "https://arxiv.org/abs/2306.07181": {"title": "Bayesian estimation of covariate assisted principal regression for brain functional connectivity", "link": "https://arxiv.org/abs/2306.07181", "description": "arXiv:2306.07181v2 Announce Type: replace \nAbstract: This paper presents a Bayesian reformulation of covariate-assisted principal (CAP) regression of Zhao and others (2021), which aims to identify components in the covariance of response signal that are associated with covariates in a regression framework. We introduce a geometric formulation and reparameterization of individual covariance matrices in their tangent space. By mapping the covariance matrices to the tangent space, we leverage Euclidean geometry to perform posterior inference. This approach enables joint estimation of all parameters and uncertainty quantification within a unified framework, fusing dimension reduction for covariance matrices with regression model estimation. We validate the proposed method through simulation studies and apply it to analyze associations between covariates and brain functional connectivity, utilizing data from the Human Connectome Project."}, "https://arxiv.org/abs/2310.12000": {"title": "Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models", "link": "https://arxiv.org/abs/2310.12000", "description": "arXiv:2310.12000v2 Announce Type: replace \nAbstract: Latent Gaussian process (GP) models are flexible probabilistic non-parametric function models. Vecchia approximations are accurate approximations for GPs to overcome computational bottlenecks for large data, and the Laplace approximation is a fast method with asymptotic convergence guarantees to approximate marginal likelihoods and posterior predictive distributions for non-Gaussian likelihoods. Unfortunately, the computational complexity of combined Vecchia-Laplace approximations grows faster than linearly in the sample size when used in combination with direct solver methods such as the Cholesky decomposition. Computations with Vecchia-Laplace approximations can thus become prohibitively slow precisely when the approximations are usually the most accurate, i.e., on large data sets. In this article, we present iterative methods to overcome this drawback. Among other things, we introduce and analyze several preconditioners, derive new convergence results, and propose novel methods for accurately approximating predictive variances. We analyze our proposed methods theoretically and in experiments with simulated and real-world data. In particular, we obtain a speed-up of an order of magnitude compared to Cholesky-based calculations and a threefold increase in prediction accuracy in terms of the continuous ranked probability score compared to a state-of-the-art method on a large satellite data set. All methods are implemented in a free C++ software library with high-level Python and R packages."}, "https://arxiv.org/abs/2401.11272": {"title": "Asymptotics for non-degenerate multivariate $U$-statistics with estimated nuisance parameters under the null and local alternative hypotheses", "link": "https://arxiv.org/abs/2401.11272", "description": "arXiv:2401.11272v2 Announce Type: replace-cross \nAbstract: The large-sample behavior of non-degenerate multivariate $U$-statistics of arbitrary degree is investigated under the assumption that their kernel depends on parameters that can be estimated consistently. Mild regularity conditions are given which guarantee that once properly normalized, such statistics are asymptotically multivariate Gaussian both under the null hypothesis and sequences of local alternatives. The work of Randles (1982, \\emph{Ann. Statist.}) is extended in three ways: the data and the kernel values can be multivariate rather than univariate, the limiting behavior under local alternatives is studied for the first time, and the effect of knowing some of the nuisance parameters is quantified. These results can be applied to a broad range of goodness-of-fit testing contexts, as shown in two specific examples."}, "https://arxiv.org/abs/2408.16963": {"title": "Nonparametric Density Estimation for Data Scattered on Irregular Spatial Domains: A Likelihood-Based Approach Using Bivariate Penalized Spline Smoothing", "link": "https://arxiv.org/abs/2408.16963", "description": "arXiv:2408.16963v1 Announce Type: new \nAbstract: Accurately estimating data density is crucial for making informed decisions and modeling in various fields. This paper presents a novel nonparametric density estimation procedure that utilizes bivariate penalized spline smoothing over triangulation for data scattered over irregular spatial domains. The approach is likelihood-based with a regularization term that addresses the roughness of the logarithm of density based on a second-order differential operator. The proposed method offers greater efficiency and flexibility in estimating density over complex domains and has been theoretically supported by establishing the asymptotic convergence rate under mild natural conditions. Through extensive simulation studies and a real-world application that analyzes motor vehicle theft data from Portland City, Oregon, we demonstrate the advantages of the proposed method over existing techniques detailed in the literature."}, "https://arxiv.org/abs/2408.17022": {"title": "Non-parametric Monitoring of Spatial Dependence", "link": "https://arxiv.org/abs/2408.17022", "description": "arXiv:2408.17022v1 Announce Type: new \nAbstract: In process monitoring applications, measurements are often taken regularly or randomly from different spatial locations in two or three dimensions. Here, we consider streams of regular, rectangular data sets and use spatial ordinal patterns (SOPs) as a non-parametric approach to detect spatial dependencies. A key feature of our proposed SOP charts is that they are distribution-free and do not require prior Phase-I analysis. We conduct an extensive simulation study, demonstrating the superiority and effectiveness of the proposed charts compared to traditional parametric approaches. We apply the SOP-based control charts to detect heavy rainfall in Germany, war-related fires in (eastern) Ukraine, and manufacturing defects in textile production. The wide range of applications and insights illustrate the broad utility of our non-parametric approach."}, "https://arxiv.org/abs/2408.17040": {"title": "Model-based clustering for covariance matrices via penalized Wishart mixture models", "link": "https://arxiv.org/abs/2408.17040", "description": "arXiv:2408.17040v1 Announce Type: new \nAbstract: Covariance matrices provide a valuable source of information about complex interactions and dependencies within the data. However, from a clustering perspective, this information has often been underutilized and overlooked. Indeed, commonly adopted distance-based approaches tend to rely primarily on mean levels to characterize and differentiate between groups. Recently, there have been promising efforts to cluster covariance matrices directly, thereby distinguishing groups solely based on the relationships between variables. From a model-based perspective, a probabilistic formalization has been provided by considering a mixture model with component densities following a Wishart distribution. Notwithstanding, this approach faces challenges when dealing with a large number of variables, as the number of parameters to be estimated increases quadratically. To address this issue, we propose a sparse Wishart mixture model, which assumes that the component scale matrices possess a cluster-dependent degree of sparsity. Model estimation is performed by maximizing a penalized log-likelihood, enforcing a covariance graphical lasso penalty on the component scale matrices. This penalty not only reduces the number of non-zero parameters, mitigating the challenges of high-dimensional settings, but also enhances the interpretability of results by emphasizing the most relevant relationships among variables. The proposed methodology is tested on both simulated and real data, demonstrating its ability to unravel the complexities of neuroimaging data and effectively cluster subjects based on the relational patterns among distinct brain regions."}, "https://arxiv.org/abs/2408.17153": {"title": "Scalable Bayesian Clustering for Integrative Analysis of Multi-View Data", "link": "https://arxiv.org/abs/2408.17153", "description": "arXiv:2408.17153v1 Announce Type: new \nAbstract: In the era of Big Data, scalable and accurate clustering algorithms for high-dimensional data are essential. We present new Bayesian Distance Clustering (BDC) models and inference algorithms with improved scalability while maintaining the predictive accuracy of modern Bayesian non-parametric models. Unlike traditional methods, BDC models the distance between observations rather than the observations directly, offering a compromise between the scalability of distance-based methods and the enhanced predictive power and probabilistic interpretation of model-based methods. However, existing BDC models still rely on performing inference on the partition model to group observations into clusters. The support of this partition model grows exponentially with the dataset's size, complicating posterior space exploration and leading to many costly likelihood evaluations. Inspired by K-medoids, we propose using tessellations in discrete space to simplify inference by focusing the learning task on finding the best tessellation centers, or \"medoids.\" Additionally, we extend our models to effectively handle multi-view data, such as data comprised of clusters that evolve across time, enhancing their applicability to complex datasets. The real data application in numismatics demonstrates the efficacy of our approach."}, "https://arxiv.org/abs/2408.17187": {"title": "State Space Model of Realized Volatility under the Existence of Dependent Market Microstructure Noise", "link": "https://arxiv.org/abs/2408.17187", "description": "arXiv:2408.17187v1 Announce Type: new \nAbstract: Volatility means the degree of variation of a stock price which is important in finance. Realized Volatility (RV) is an estimator of the volatility calculated using high-frequency observed prices. RV has lately attracted considerable attention of econometrics and mathematical finance. However, it is known that high-frequency data includes observation errors called market microstructure noise (MN). Nagakura and Watanabe[2015] proposed a state space model that resolves RV into true volatility and influence of MN. In this paper, we assume a dependent MN that autocorrelates and correlates with return as reported by Hansen and Lunde[2006] and extends the results of Nagakura and Watanabe[2015] and compare models by simulation and actual data."}, "https://arxiv.org/abs/2408.17188": {"title": "A note on promotion time cure models with a new biological consideration", "link": "https://arxiv.org/abs/2408.17188", "description": "arXiv:2408.17188v1 Announce Type: new \nAbstract: We introduce a generalized promotion time cure model motivated by a new biological consideration. The new approach is flexible to model heterogeneous survival data, in particular for addressing intra-sample heterogeneity."}, "https://arxiv.org/abs/2408.17205": {"title": "Estimation and inference of average treatment effects under heterogeneous additive treatment effect model", "link": "https://arxiv.org/abs/2408.17205", "description": "arXiv:2408.17205v1 Announce Type: new \nAbstract: Randomized experiments are the gold standard for estimating treatment effects, yet network interference challenges the validity of traditional estimators by violating the stable unit treatment value assumption and introducing bias. While cluster randomized experiments mitigate this bias, they encounter limitations in handling network complexity and fail to distinguish between direct and indirect effects. To address these challenges, we develop a design-based asymptotic theory for the existing Horvitz--Thompson estimators of the direct, indirect, and global average treatment effects under Bernoulli trials. We assume the heterogeneous additive treatment effect model with a hidden network that drives interference. Observing that these estimators are inconsistent in dense networks, we introduce novel eigenvector-based regression adjustment estimators to ensure consistency. We establish the asymptotic normality of the proposed estimators and provide conservative variance estimators under the design-based inference framework, offering robust conclusions independent of the underlying stochastic processes of the network and model parameters. Our method's adaptability is demonstrated across various interference structures, including partial interference and local interference in a two-sided marketplace. Numerical studies further illustrate the efficacy of the proposed estimators, offering practical insights into handling network interference."}, "https://arxiv.org/abs/2408.17257": {"title": "Likelihood estimation for stochastic differential equations with mixed effects", "link": "https://arxiv.org/abs/2408.17257", "description": "arXiv:2408.17257v1 Announce Type: new \nAbstract: Stochastic differential equations provide a powerful and versatile tool for modelling dynamic phenomena affected by random noise. In case of repeated observations of time series for several experimental units, it is often the case that some of the parameters vary between the individual experimental units, which has motivated a considerable interest in stochastic differential equations with mixed effects, where a subset of the parameters are random. These models enables simultaneous representation of randomness in the dynamics and variability between experimental units. When the data are observations at discrete time points, the likelihood function is only rarely explicitly available, so for likelihood based inference numerical methods are needed. We present Gibbs samplers and stochastic EM-algorithms based on the simple methods for simulation of diffusion bridges in Bladt and S{\\o}rensen (2014). These methods are easy to implement and have no tuning parameters. They are, moreover, computationally efficient at low sampling frequencies because the computing time increases linearly with the time between observations. The algorithms are shown to simplify considerably for exponential families of diffusion processes. In a simulation study, the estimation methods are shown to work well for Ornstein-Uhlenbeck processes and t-diffusions with mixed effects. Finally, the Gibbs sampler is applied to neuronal data."}, "https://arxiv.org/abs/2408.17278": {"title": "Incorporating Memory into Continuous-Time Spatial Capture-Recapture Models", "link": "https://arxiv.org/abs/2408.17278", "description": "arXiv:2408.17278v1 Announce Type: new \nAbstract: Obtaining reliable and precise estimates of wildlife species abundance and distribution is essential for the conservation and management of animal populations and natural reserves. Remote sensors such as camera traps are increasingly employed to gather data on uniquely identifiable individuals. Spatial capture-recapture (SCR) models provide estimates of population and spatial density from such data. These models introduce spatial correlation between observations of the same individual through a latent activity center. However SCR models assume that observations are independent over time and space, conditional on their given activity center, so that observed sightings at a given time and location do not influence the probability of being seen at future times and/or locations. With detectors like camera traps, this is ecologically unrealistic given the smooth movement of animals over space through time. We propose a new continuous-time modeling framework that incorporates both an individual's (latent) activity center and (known) previous location and time of detection. We demonstrate that standard SCR models can produce substantially biased density estimates when there is correlation in the times and locations of detections, and that our new model performs substantially better than standard SCR models on data simulated through a movement model as well as in a real camera trap study of American martens where an improvement in model fit is observed when incorporating the observed locations and times of previous observations."}, "https://arxiv.org/abs/2408.17346": {"title": "On Nonparanormal Likelihoods", "link": "https://arxiv.org/abs/2408.17346", "description": "arXiv:2408.17346v1 Announce Type: new \nAbstract: Nonparanormal models describe the joint distribution of multivariate responses via latent Gaussian, and thus parametric, copulae while allowing flexible nonparametric marginals. Some aspects of such distributions, for example conditional independence, are formulated parametrically. Other features, such as marginal distributions, can be formulated non- or semiparametrically. Such models are attractive when multivariate normality is questionable.\n Most estimation procedures perform two steps, first estimating the nonparametric part. The copula parameters come second, treating the marginal estimates as known. This is sufficient for some applications. For other applications, e.g. when a semiparametric margin features parameters of interest or when standard errors are important, a simultaneous estimation of all parameters might be more advantageous.\n We present suitable parameterisations of nonparanormal models, possibly including semiparametric effects, and define four novel nonparanormal log-likelihood functions. In general, the corresponding one-step optimization problems are shown to be non-convex. In some cases, however, biconvex problems emerge. Several convex approximations are discussed.\n From a low-level computational point of view, the core contribution is the score function for multivariate normal log-probabilities computed via Genz' procedure. We present transformation discriminant analysis when some biomarkers are subject to limit-of-detection problems as an application and illustrate possible empirical gains in semiparametric efficient polychoric correlation analysis."}, "https://arxiv.org/abs/2408.17385": {"title": "Comparing Propensity Score-Based Methods in Estimating the Treatment Effects: A Simulation Study", "link": "https://arxiv.org/abs/2408.17385", "description": "arXiv:2408.17385v1 Announce Type: new \nAbstract: In observational studies, the recorded treatment assignment is not purely random, but it is influenced by external factors such as patient characteristics, reimbursement policies, and existing guidelines. Therefore, the treatment effect can be estimated only after accounting for confounding factors. Propensity score (PS) methods are a family of methods that is widely used for this purpose. Although they are all based on the estimation of the a posteriori probability of treatment assignment given patient covariates, they estimate the treatment effect from different statistical points of view and are, thus, relatively hard to compare. In this work, we propose a simulation experiment in which a hypothetical cohort of subjects is simulated in seven scenarios of increasing complexity of the associations between covariates and treatment, but where the two main definitions of treatment effect (average treatment effect, ATE, and average effect of the treatment on the treated, ATT) coincide. Our purpose is to compare the performance of a wide array of PS-based methods (matching, stratification, and inverse probability weighting) in estimating the treatment effect and their robustness in different scenarios. We find that inverse probability weighting provides estimates of the treatment effect that are closer to the expected value by weighting all subjects of the starting population. Conversely, matching and stratification ensure that the subpopulation that generated the final estimate is made up of real instances drawn from the starting population, and, thus, provide a higher degree of control on the validity domain of the estimates."}, "https://arxiv.org/abs/2408.17392": {"title": "Dual-criterion Dose Finding Designs Based on Dose-Limiting Toxicity and Tolerability", "link": "https://arxiv.org/abs/2408.17392", "description": "arXiv:2408.17392v1 Announce Type: new \nAbstract: The primary objective of Phase I oncology trials is to assess the safety and tolerability of novel therapeutics. Conventional dose escalation methods identify the maximum tolerated dose (MTD) based on dose-limiting toxicity (DLT). However, as cancer therapies have evolved from chemotherapy to targeted therapies, these traditional methods have become problematic. Many targeted therapies rarely produce DLT and are administered over multiple cycles, potentially resulting in the accumulation of lower-grade toxicities, which can lead to intolerance, such as dose reduction or interruption. To address this issue, we proposed dual-criterion designs that find the MTD based on both DLT and non-DLT-caused intolerance. We considered the model-based design and model-assisted design that allow real-time decision-making in the presence of pending data due to long event assessment windows. Compared to DLT-based methods, our approaches exhibit superior operating characteristics when intolerance is the primary driver for determining the MTD and comparable operating characteristics when DLT is the primary driver."}, "https://arxiv.org/abs/2408.17410": {"title": "Family of multivariate extended skew-elliptical distributions: Statistical properties, inference and application", "link": "https://arxiv.org/abs/2408.17410", "description": "arXiv:2408.17410v1 Announce Type: new \nAbstract: In this paper we propose a family of multivariate asymmetric distributions over an arbitrary subset of set of real numbers which is defined in terms of the well-known elliptically symmetric distributions. We explore essential properties, including the characterization of the density function for various distribution types, as well as other key aspects such as identifiability, quantiles, stochastic representation, conditional and marginal distributions, moments, Kullback-Leibler Divergence, and parameter estimation. A Monte Carlo simulation study is performed for examining the performance of the developed parameter estimation method. Finally, the proposed models are used to analyze socioeconomic data."}, "https://arxiv.org/abs/2408.17426": {"title": "Weighted Regression with Sybil Networks", "link": "https://arxiv.org/abs/2408.17426", "description": "arXiv:2408.17426v1 Announce Type: new \nAbstract: In many online domains, Sybil networks -- or cases where a single user assumes multiple identities -- is a pervasive feature. This complicates experiments, as off-the-shelf regression estimators at least assume known network topologies (if not fully independent observations) when Sybil network topologies in practice are often unknown. The literature has exclusively focused on techniques to detect Sybil networks, leading many experimenters to subsequently exclude suspected networks entirely before estimating treatment effects. I present a more efficient solution in the presence of these suspected Sybil networks: a weighted regression framework that applies weights based on the probabilities that sets of observations are controlled by single actors. I show in the paper that the MSE-minimizing solution is to set the weight matrix equal to the inverse of the expected network topology. I demonstrate the methodology on simulated data, and then I apply the technique to a competition with suspected Sybil networks run on the Sui blockchain and show reductions in the standard error of the estimate by 6 - 24%."}, "https://arxiv.org/abs/2408.16791": {"title": "Multi-faceted Neuroimaging Data Integration via Analysis of Subspaces", "link": "https://arxiv.org/abs/2408.16791", "description": "arXiv:2408.16791v1 Announce Type: cross \nAbstract: Neuroimaging studies, such as the Human Connectome Project (HCP), often collect multi-faceted and multi-block data to study the complex human brain. However, these data are often analyzed in a pairwise fashion, which can hinder our understanding of how different brain-related measures interact with each other. In this study, we comprehensively analyze the multi-block HCP data using the Data Integration via Analysis of Subspaces (DIVAS) method. We integrate structural and functional brain connectivity, substance use, cognition, and genetics in an exhaustive five-block analysis. This gives rise to the important finding that genetics is the single data modality most predictive of brain connectivity, outside of brain connectivity itself. Nearly 14\\% of the variation in functional connectivity (FC) and roughly 12\\% of the variation in structural connectivity (SC) is attributed to shared spaces with genetics. Moreover, investigations of shared space loadings provide interpretable associations between particular brain regions and drivers of variability, such as alcohol consumption in the substance-use data block. Novel Jackstraw hypothesis tests are developed for the DIVAS framework to establish statistically significant loadings. For example, in the (FC, SC, and Substance Use) shared space, these novel hypothesis tests highlight largely negative functional and structural connections suggesting the brain's role in physiological responses to increased substance use. Furthermore, our findings have been validated using a subset of genetically relevant siblings or twins not studied in the main analysis."}, "https://arxiv.org/abs/2408.17087": {"title": "On the choice of the two tuning parameters for nonparametric estimation of an elliptical distribution generator", "link": "https://arxiv.org/abs/2408.17087", "description": "arXiv:2408.17087v1 Announce Type: cross \nAbstract: Elliptical distributions are a simple and flexible class of distributions that depend on a one-dimensional function, called the density generator. In this article, we study the non-parametric estimator of this generator that was introduced by Liebscher (2005). This estimator depends on two tuning parameters: a bandwidth $h$ -- as usual in kernel smoothing -- and an additional parameter $a$ that control the behavior near the center of the distribution. We give an explicit expression for the asymptotic MSE at a point $x$, and derive explicit expressions for the optimal tuning parameters $h$ and $a$. Estimation of the derivatives of the generator is also discussed. A simulation study shows the performance of the new methods."}, "https://arxiv.org/abs/2408.17230": {"title": "cosimmr: an R package for fast fitting of Stable Isotope Mixing Models with covariates", "link": "https://arxiv.org/abs/2408.17230", "description": "arXiv:2408.17230v1 Announce Type: cross \nAbstract: The study of animal diets and the proportional contribution that different foods make to their diets is an important task in ecology. Stable Isotope Mixing Models (SIMMs) are an important tool for studying an animal's diet and understanding how the animal interacts with its environment. We present cosimmr, a new R package designed to include covariates when estimating diet proportions in SIMMs, with simple functions to produce plots and summary statistics. The inclusion of covariates allows for users to perform a more in-depth analysis of their system and to gain new insights into the diets of the organisms being studied. A common problem with the previous generation of SIMMs is that they are very slow to produce a posterior distribution of dietary estimates, especially for more complex model structures, such as when covariates are included. The widely-used Markov chain Monte Carlo (MCMC) algorithm used by many traditional SIMMs often requires a very large number of iterations to reach convergence. In contrast, cosimmr uses Fixed Form Variational Bayes (FFVB), which we demonstrate gives up to an order of magnitude speed improvement with no discernible loss of accuracy. We provide a full mathematical description of the model, which includes corrections for trophic discrimination and concentration dependence, and evaluate its performance against the state of the art MixSIAR model. Whilst MCMC is guaranteed to converge to the posterior distribution in the long term, FFVB converges to an approximation of the posterior distribution, which may lead to sub-optimal performance. However we show that the package produces equivalent results in a fraction of the time for all the examples on which we test. The package is designed to be user-friendly and is based on the existing simmr framework."}, "https://arxiv.org/abs/2207.08868": {"title": "Isotonic propensity score matching", "link": "https://arxiv.org/abs/2207.08868", "description": "arXiv:2207.08868v2 Announce Type: replace \nAbstract: We propose a one-to-many matching estimator of the average treatment effect based on propensity scores estimated by isotonic regression. The method relies on the monotonicity assumption on the propensity score function, which can be justified in many applications in economics. We show that the nature of the isotonic estimator can help us to fix many problems of existing matching methods, including efficiency, choice of the number of matches, choice of tuning parameters, robustness to propensity score misspecification, and bootstrap validity. As a by-product, a uniformly consistent isotonic estimator is developed for our proposed matching method."}, "https://arxiv.org/abs/2211.04012": {"title": "A functional regression model for heterogeneous BioGeoChemical Argo data in the Southern Ocean", "link": "https://arxiv.org/abs/2211.04012", "description": "arXiv:2211.04012v2 Announce Type: replace \nAbstract: Leveraging available measurements of our environment can help us understand complex processes. One example is Argo Biogeochemical data, which aims to collect measurements of oxygen, nitrate, pH, and other variables at varying depths in the ocean. We focus on the oxygen data in the Southern Ocean, which has implications for ocean biology and the Earth's carbon cycle. Systematic monitoring of such data has only recently begun to be established, and the data is sparse. In contrast, Argo measurements of temperature and salinity are much more abundant. In this work, we introduce and estimate a functional regression model describing dependence in oxygen, temperature, and salinity data at all depths covered by the Argo data simultaneously. Our model elucidates important aspects of the joint distribution of temperature, salinity, and oxygen. Due to fronts that establish distinct spatial zones in the Southern Ocean, we augment this functional regression model with a mixture component. By modelling spatial dependence in the mixture component and in the data itself, we provide predictions onto a grid and improve location estimates of fronts. Our approach is scalable to the size of the Argo data, and we demonstrate its success in cross-validation and a comprehensive interpretation of the model."}, "https://arxiv.org/abs/2301.10468": {"title": "Model selection-based estimation for generalized additive models using mixtures of g-priors: Towards systematization", "link": "https://arxiv.org/abs/2301.10468", "description": "arXiv:2301.10468v4 Announce Type: replace \nAbstract: We explore the estimation of generalized additive models using basis expansion in conjunction with Bayesian model selection. Although Bayesian model selection is useful for regression splines, it has traditionally been applied mainly to Gaussian regression owing to the availability of a tractable marginal likelihood. We extend this method to handle an exponential family of distributions by using the Laplace approximation of the likelihood. Although this approach works well with any Gaussian prior distribution, consensus has not been reached on the best prior for nonparametric regression with basis expansions. Our investigation indicates that the classical unit information prior may not be ideal for nonparametric regression. Instead, we find that mixtures of g-priors are more effective. We evaluate various mixtures of g-priors to assess their performance in estimating generalized additive models. Additionally, we compare several priors for knots to determine the most effective strategy. Our simulation studies demonstrate that model selection-based approaches outperform other Bayesian methods."}, "https://arxiv.org/abs/2311.00878": {"title": "Backward Joint Model for the Dynamic Prediction of Both Competing Risk and Longitudinal Outcomes", "link": "https://arxiv.org/abs/2311.00878", "description": "arXiv:2311.00878v2 Announce Type: replace \nAbstract: Joint modeling is a useful approach to dynamic prediction of clinical outcomes using longitudinally measured predictors. When the outcomes are competing risk events, fitting the conventional shared random effects joint model often involves intensive computation, especially when multiple longitudinal biomarkers are be used as predictors, as is often desired in prediction problems. This paper proposes a new joint model for the dynamic prediction of competing risk outcomes. The model factorizes the likelihood into the distribution of the competing risks data and the distribution of longitudinal data given the competing risks data. It extends the basic idea of the recently published backward joint model (BJM) to the competing risk setting, and we call this model crBJM. This model also enables the prediction of future longitudinal data trajectories conditional on being at risk at a future time, a practically important problem that has not been studied in the statistical literature. The model fitting with the EM algorithm is efficient, stable and computationally fast, with a one-dimensional integral in the E-step and convex optimization for most parameters in the M-step, regardless of the number of longitudinal predictors. The model also comes with a consistent albeit less efficient estimation method that can be quickly implemented with standard software, ideal for model building and diagnostics. We study the numerical properties of the proposed method using simulations and illustrate its use in a chronic kidney disease study."}, "https://arxiv.org/abs/2312.01162": {"title": "Inference on many jumps in nonparametric panel regression models", "link": "https://arxiv.org/abs/2312.01162", "description": "arXiv:2312.01162v2 Announce Type: replace \nAbstract: We investigate the significance of change-point or jump effects within fully nonparametric regression contexts, with a particular focus on panel data scenarios where data generation processes vary across individual or group units, and error terms may display complex dependency structures. In our setting the threshold effect depends on a specific covariate, and we permit the true nonparametric regression to vary based on additional latent variables. We propose two uniform testing procedures: one to assess the existence of change-point effects and another to evaluate the uniformity of such effects across units. Even though the underlying data generation processes are neither independent nor identically distributed, our approach involves deriving a straightforward analytical expression to approximate the variance-covariance structure of change-point effects under general dependency conditions. Notably, when Gaussian approximations are made to these test statistics, the intricate dependency structures within the data can be safely disregarded owing to the localized nature of the statistics. This finding bears significant implications for obtaining critical values. Through extensive simulations, we demonstrate that our tests exhibit excellent control over size and reasonable power performance in finite samples, irrespective of strong cross-sectional and weak serial dependency within the data. Furthermore, applying our tests to two datasets reveals the existence of significant nonsmooth effects in both cases."}, "https://arxiv.org/abs/2312.13097": {"title": "Power calculation for cross-sectional stepped wedge cluster randomized trials with a time-to-event endpoint", "link": "https://arxiv.org/abs/2312.13097", "description": "arXiv:2312.13097v2 Announce Type: replace \nAbstract: Stepped wedge cluster randomized trials (SW-CRTs) are a form of randomized trial whereby clusters are progressively transitioned from control to intervention, with the timing of transition randomized for each cluster. An important task at the design stage is to ensure that the planned trial has sufficient power to observe a clinically meaningful effect size. While methods for determining study power have been well-developed for SW-CRTs with continuous and binary outcomes, limited methods for power calculation are available for SW-CRTs with censored time-to-event outcomes. In this article, we propose a stratified marginal Cox model to account for secular trend in cross-sectional SW-CRTs, and derive an explicit expression of the robust sandwich variance to facilitate power calculations without the need for computationally intensive simulations. Power formulas based on both the Wald and robust score tests are developed and validated via simulation under different finite-sample scenarios. Finally, we illustrate our methods in the context of a SW-CRT testing the effect of a new electronic reminder system on time to catheter removal in hospital settings. We also offer an R Shiny application to facilitate sample size and power calculations using our proposed methods."}, "https://arxiv.org/abs/2302.05955": {"title": "Recursive Estimation of Conditional Kernel Mean Embeddings", "link": "https://arxiv.org/abs/2302.05955", "description": "arXiv:2302.05955v2 Announce Type: replace-cross \nAbstract: Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces."}, "https://arxiv.org/abs/2409.00291": {"title": "Variable selection in the joint frailty model of recurrent and terminal events using Broken Adaptive Ridge regression", "link": "https://arxiv.org/abs/2409.00291", "description": "arXiv:2409.00291v1 Announce Type: new \nAbstract: We introduce a novel method to simultaneously perform variable selection and estimation in the joint frailty model of recurrent and terminal events using the Broken Adaptive Ridge Regression penalty. The BAR penalty can be summarized as an iteratively reweighted squared $L_2$-penalized regression, which approximates the $L_0$-regularization method. Our method allows for the number of covariates to diverge with the sample size. Under certain regularity conditions, we prove that the BAR estimator implemented under the model framework is consistent and asymptotically normally distributed, which are known as the oracle properties in the variable selection literature. In our simulation studies, we compare our proposed method to the Minimum Information Criterion (MIC) method. We apply our method on the Medical Information Mart for Intensive Care (MIMIC-III) database, with the aim of investigating which variables affect the risks of repeated ICU admissions and death during ICU stay."}, "https://arxiv.org/abs/2409.00379": {"title": "Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance", "link": "https://arxiv.org/abs/2409.00379", "description": "arXiv:2409.00379v1 Announce Type: new \nAbstract: Static supervised learning-in which experimental data serves as a training sample for the estimation of an optimal treatment assignment policy-is a commonly assumed framework of policy learning. An arguably more realistic but challenging scenario is a dynamic setting in which the planner performs experimentation and exploitation simultaneously with subjects that arrive sequentially. This paper studies bandit algorithms for learning an optimal individualised treatment assignment policy. Specifically, we study applicability of the EXP4.P (Exponential weighting for Exploration and Exploitation with Experts) algorithm developed by Beygelzimer et al. (2011) to policy learning. Assuming that the class of policies has a finite Vapnik-Chervonenkis dimension and that the number of subjects to be allocated is known, we present a high probability welfare-regret bound of the algorithm. To implement the algorithm, we use an incremental enumeration algorithm for hyperplane arrangements. We perform extensive numerical analysis to assess the algorithm's sensitivity to its tuning parameters and its welfare-regret performance. Further simulation exercises are calibrated to the National Job Training Partnership Act (JTPA) Study sample to determine how the algorithm performs when applied to economic data. Our findings highlight various computational challenges and suggest that the limited welfare gain from the algorithm is due to substantial heterogeneity in causal effects in the JTPA data."}, "https://arxiv.org/abs/2409.00453": {"title": "Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference", "link": "https://arxiv.org/abs/2409.00453", "description": "arXiv:2409.00453v1 Announce Type: new \nAbstract: Quantifying causal effects of exposures on outcomes, such as a treatment and a disease respectively, is a crucial issue in medical science for the administration of effective therapies. Importantly, any related causal analysis should account for all those variables, e.g. clinical features, that can act as risk factors involved in the occurrence of a disease. In addition, the selection of targeted strategies for therapy administration requires to quantify such treatment effects at personalized level rather than at population level. We address these issues by proposing a methodology based on categorical Directed Acyclic Graphs (DAGs) which provide an effective tool to infer causal relationships and causal effects between variables. In addition, we account for population heterogeneity by considering a Dirichlet Process mixture of categorical DAGs, which clusters individuals into homogeneous groups characterized by common causal structures, dependence parameters and causal effects. We develop computational strategies for Bayesian posterior inference, from which a battery of causal effects at subject-specific level is recovered. Our methodology is evaluated through simulations and applied to a dataset of breast cancer patients to investigate cardiotoxic side effects that can be induced by the administrated anticancer therapies."}, "https://arxiv.org/abs/2409.00470": {"title": "Examining the robustness of a model selection procedure in the binary latent block model through a language placement test data set", "link": "https://arxiv.org/abs/2409.00470", "description": "arXiv:2409.00470v1 Announce Type: new \nAbstract: When entering French university, the students' foreign language level is assessed through a placement test. In this work, we model the placement test results using binary latent block models which allow to simultaneously form homogeneous groups of students and of items. However, a major difficulty in latent block models is to select correctly the number of groups of rows and the number of groups of columns. The first purpose of this paper is to tune the number of initializations needed to limit the initial values problem in the estimation algorithm in order to propose a model selection procedure in the placement test context. Computational studies based on simulated data sets and on two placement test data sets are investigated. The second purpose is to investigate the robustness of the proposed model selection procedure in terms of stability of the students groups when the number of students varies."}, "https://arxiv.org/abs/2409.00679": {"title": "Exact Exploratory Bi-factor Analysis: A Constraint-based Optimisation Approach", "link": "https://arxiv.org/abs/2409.00679", "description": "arXiv:2409.00679v1 Announce Type: new \nAbstract: Bi-factor analysis is a form of confirmatory factor analysis widely used in psychological and educational measurement. The use of a bi-factor model requires the specification of an explicit bi-factor structure on the relationship between the observed variables and the group factors. In practice, the bi-factor structure is sometimes unknown, in which case an exploratory form of bi-factor analysis is needed to find the bi-factor structure. Unfortunately, there are few methods for exploratory bi-factor analysis, with the exception of a rotation-based method proposed in Jennrich and Bentler (2011, 2012). However, this method only finds approximate bi-factor structures, as it does not yield an exact bi-factor loading structure, even after applying hard thresholding. In this paper, we propose a constraint-based optimisation method that learns an exact bi-factor loading structure from data, overcoming the issue with the rotation-based method. The key to the proposed method is a mathematical characterisation of the bi-factor loading structure as a set of equality constraints, which allows us to formulate the exploratory bi-factor analysis problem as a constrained optimisation problem in a continuous domain and solve the optimisation problem with an augmented Lagrangian method. The power of the proposed method is shown via simulation studies and a real data example. Extending the proposed method to exploratory hierarchical factor analysis is also discussed. The codes are available on ``https://anonymous.4open.science/r/Bifactor-ALM-C1E6\"."}, "https://arxiv.org/abs/2409.00817": {"title": "Structural adaptation via directional regularity: rate accelerated estimation in multivariate functional data", "link": "https://arxiv.org/abs/2409.00817", "description": "arXiv:2409.00817v1 Announce Type: new \nAbstract: We introduce directional regularity, a new definition of anisotropy for multivariate functional data. Instead of taking the conventional view which determines anisotropy as a notion of smoothness along a dimension, directional regularity additionally views anisotropy through the lens of directions. We show that faster rates of convergence can be obtained through a change-of-basis by adapting to the directional regularity of a multivariate process. An algorithm for the estimation and identification of the change-of-basis matrix is constructed, made possible due to the unique replication structure of functional data. Non-asymptotic bounds are provided for our algorithm, supplemented by numerical evidence from an extensive simulation study. We discuss two possible applications of the directional regularity approach, and advocate its consideration as a standard pre-processing step in multivariate functional data analysis."}, "https://arxiv.org/abs/2409.01017": {"title": "Linear spline index regression model: Interpretability, nonlinearity and dimension reduction", "link": "https://arxiv.org/abs/2409.01017", "description": "arXiv:2409.01017v1 Announce Type: new \nAbstract: Inspired by the complexity of certain real-world datasets, this article introduces a novel flexible linear spline index regression model. The model posits piecewise linear effects of an index on the response, with continuous changes occurring at knots. Significantly, it possesses the interpretability of linear models, captures nonlinear effects similar to nonparametric models, and achieves dimension reduction like single-index models. In addition, the locations and number of knots remain unknown, which further enhances the adaptability of the model in practical applications. We propose a new method that combines penalized approaches and convolution techniques to simultaneously estimate the unknown parameters and determine the number of knots. Noteworthy is that the proposed method allows the number of knots to diverge with the sample size. We demonstrate that the proposed estimators can identify the number of knots with a probability approaching one and estimate the coefficients as efficiently as if the number of knots is known in advance. We also introduce a procedure to test the presence of knots. Simulation studies and two real datasets are employed to assess the finite sample performance of the proposed method."}, "https://arxiv.org/abs/2409.01208": {"title": "Statistical Jump Model for Mixed-Type Data with Missing Data Imputation", "link": "https://arxiv.org/abs/2409.01208", "description": "arXiv:2409.01208v1 Announce Type: new \nAbstract: In this paper, we address the challenge of clustering mixed-type data with temporal evolution by introducing the statistical jump model for mixed-type data. This novel framework incorporates regime persistence, enhancing interpretability and reducing the frequency of state switches, and efficiently handles missing data. The model is easily interpretable through its state-conditional means and modes, making it accessible to practitioners and policymakers. We validate our approach through extensive simulation studies and an empirical application to air quality data, demonstrating its superiority in inferring persistent air quality regimes compared to the traditional air quality index. Our contributions include a robust method for mixed-type temporal clustering, effective missing data management, and practical insights for environmental monitoring."}, "https://arxiv.org/abs/2409.01248": {"title": "Nonparametric Estimation of Path-specific Effects in Presence of Nonignorable Missing Covariates", "link": "https://arxiv.org/abs/2409.01248", "description": "arXiv:2409.01248v1 Announce Type: new \nAbstract: The path-specific effect (PSE) is of primary interest in mediation analysis when multiple intermediate variables between treatment and outcome are observed, as it can isolate the specific effect through each mediator, thus mitigating potential bias arising from other intermediate variables serving as mediator-outcome confounders. However, estimation and inference of PSE become challenging in the presence of nonignorable missing covariates, a situation particularly common in epidemiological research involving sensitive patient information. In this paper, we propose a fully nonparametric methodology to address this challenge. We establish identification for PSE by expressing it as a functional of observed data and demonstrate that the associated nuisance functions can be uniquely determined through sequential optimization problems by leveraging a shadow variable. Then we propose a sieve-based regression imputation approach for estimation. We establish the large-sample theory for the proposed estimator, and introduce a robust and efficient approach to make inference for PSE. The proposed method is applied to the NHANES dataset to investigate the mediation roles of dyslipidemia and obesity in the pathway from Type 2 diabetes mellitus to cardiovascular disease."}, "https://arxiv.org/abs/2409.01266": {"title": "Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions", "link": "https://arxiv.org/abs/2409.01266", "description": "arXiv:2409.01266v1 Announce Type: new \nAbstract: Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods."}, "https://arxiv.org/abs/2409.01295": {"title": "Pearson's Correlation under the scope: Assessment of the efficiency of Pearson's correlation to select predictor variables for linear models", "link": "https://arxiv.org/abs/2409.01295", "description": "arXiv:2409.01295v1 Announce Type: new \nAbstract: This article examines the limitations of Pearson's correlation in selecting predictor variables for linear models. Using mtcars and iris datasets from R, this paper demonstrates the limitation of this correlation measure when selecting a proper independent variable to model miles per gallon (mpg) from mtcars data and the petal length from the iris data. This paper exhibits the findings by reporting Pearson's correlation values for two potential predictor variables for each response variable, then builds a linear model to predict the response variable using each predictor variable. The error metrics for each model are then reported to evaluate how reliable Pearson's correlation is in selecting the best predictor variable. The results show that Pearson's correlation can be deceiving if used to select the predictor variable to build a linear model for a dependent variable."}, "https://arxiv.org/abs/2409.01444": {"title": "A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions", "link": "https://arxiv.org/abs/2409.01444", "description": "arXiv:2409.01444v1 Announce Type: new \nAbstract: Prediction models inform important clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. However, changes in the distribution of the data impact model performance. In health-care, a typical change is a shift in case-mix: for example, for cardiovascular risk managment, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital.\n This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis preditions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not.\n A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. This framework provides critical insights for evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task."}, "https://arxiv.org/abs/2409.01521": {"title": "Modelling Volatilities of High-dimensional Count Time Series with Network Structure and Asymmetry", "link": "https://arxiv.org/abs/2409.01521", "description": "arXiv:2409.01521v1 Announce Type: new \nAbstract: Modelling high-dimensional volatilities is a challenging topic, especially for high-dimensional discrete-valued time series data. This paper proposes a threshold spatial GARCH-type model for high-dimensional count data with network structure. The proposed model can simplify the parameterization by taking use of the network structure in data, and can capture the asymmetry in dynamics of volatilities by adopting a threshold structure. Our model is called Poisson Threshold Network GARCH model, because the conditional distributions are assumed to be Poisson distribution. Asymptotic theory of our maximum-likelihood-estimator (MLE) for the proposed spatial model is derived when both sample size and network dimension go to infinity. We get asymptotic statistical inferences via investigating the week dependence among components of the model and using limit theorems for weekly dependent random fields. Simulations are conducted to test the theoretical results, and the model is fitted to real count data as illustration of the proposed methodology."}, "https://arxiv.org/abs/2409.01599": {"title": "Multivariate Inference of Network Moments by Subsampling", "link": "https://arxiv.org/abs/2409.01599", "description": "arXiv:2409.01599v1 Announce Type: new \nAbstract: In this paper, we study the characterization of a network population by analyzing a single observed network, focusing on the counts of multiple network motifs or their corresponding multivariate network moments. We introduce an algorithm based on node subsampling to approximate the nontrivial joint distribution of the network moments, and prove its asymptotic accuracy. By examining the joint distribution of these moments, our approach captures complex dependencies among network motifs, making a significant advancement over earlier methods that rely on individual motifs marginally. This enables more accurate and robust network inference. Through real-world applications, such as comparing coexpression networks of distinct gene sets and analyzing collaboration patterns within the statistical community, we demonstrate that the multivariate inference of network moments provides deeper insights than marginal approaches, thereby enhancing our understanding of network mechanisms."}, "https://arxiv.org/abs/2409.01735": {"title": "Multi-objective Bayesian optimization for Likelihood-Free inference in sequential sampling models of decision making", "link": "https://arxiv.org/abs/2409.01735", "description": "arXiv:2409.01735v1 Announce Type: new \nAbstract: Joint modeling of different data sources in decision-making processes is crucial for understanding decision dynamics in consumer behavior models. Sequential Sampling Models (SSMs), grounded in neuro-cognitive principles, provide a systematic approach to combining information from multi-source data, such as those based on response times and choice outcomes. However, parameter estimation of SSMs is challenging due to the complexity of joint likelihood functions. Likelihood-Free inference (LFI) approaches enable Bayesian inference in complex models with intractable likelihoods, like SSMs, and only require the ability to simulate synthetic data from the model. Extending a popular approach to simulation efficient LFI for single-source data, we propose Multi-objective Bayesian Optimization for Likelihood-Free Inference (MOBOLFI) to estimate the parameters of SSMs calibrated using multi-source data. MOBOLFI models a multi-dimensional discrepancy between observed and simulated data, using a discrepancy for each data source. Multi-objective Bayesian Optimization is then used to ensure simulation efficient approximation of the SSM likelihood. The use of a multivariate discrepancy allows for approximations to individual data source likelihoods in addition to the joint likelihood, enabling both the detection of conflicting information and a deeper understanding of the importance of different data sources in estimating individual SSM parameters. We illustrate the advantages of our approach in comparison with the use of a single discrepancy in a simple synthetic data example and an SSM example with real-world data assessing preferences of ride-hailing drivers in Singapore to rent electric vehicles. Although we focus on applications to SSMs, our approach applies to the Likelihood-Free calibration of other models using multi-source data."}, "https://arxiv.org/abs/2409.01794": {"title": "Estimating Joint interventional distributions from marginal interventional data", "link": "https://arxiv.org/abs/2409.01794", "description": "arXiv:2409.01794v1 Announce Type: new \nAbstract: In this paper we show how to exploit interventional data to acquire the joint conditional distribution of all the variables using the Maximum Entropy principle. To this end, we extend the Causal Maximum Entropy method to make use of interventional data in addition to observational data. Using Lagrange duality, we prove that the solution to the Causal Maximum Entropy problem with interventional constraints lies in the exponential family, as in the Maximum Entropy solution. Our method allows us to perform two tasks of interest when marginal interventional distributions are provided for any subset of the variables. First, we show how to perform causal feature selection from a mixture of observational and single-variable interventional data, and, second, how to infer joint interventional distributions. For the former task, we show on synthetically generated data, that our proposed method outperforms the state-of-the-art method on merging datasets, and yields comparable results to the KCI-test which requires access to joint observations of all variables."}, "https://arxiv.org/abs/2409.01874": {"title": "Partial membership models for soft clustering of multivariate football player performance data", "link": "https://arxiv.org/abs/2409.01874", "description": "arXiv:2409.01874v1 Announce Type: new \nAbstract: The standard mixture modelling framework has been widely used to study heterogeneous populations, by modelling them as being composed of a finite number of homogeneous sub-populations. However, the standard mixture model assumes that each data point belongs to one and only one mixture component, or cluster, but when data points have fractional membership in multiple clusters this assumption is unrealistic. It is in fact conceptually very different to represent an observation as partly belonging to multiple groups instead of belonging to one group with uncertainty. For this purpose, various soft clustering approaches, or individual-level mixture models, have been developed. In this context, Heller et al (2008) formulated the Bayesian partial membership model (PM) as an alternative structure for individual-level mixtures, which also captures partial membership in the form of attribute specific mixtures, but does not assume a factorization over attributes. Our work proposes using the PM for soft clustering of count data arising in football performance analysis and compare the results with those achieved with the mixed membership model and finite mixture model. Learning and inference are carried out using Markov chain Monte Carlo methods. The method is applied on Serie A football player data from the 2022/2023 football season, to estimate the positions on the field where the players tend to play, in addition to their primary position, based on their playing style. The application of partial membership model to football data could have practical implications for coaches, talent scouts, team managers and analysts. These stakeholders can utilize the findings to make informed decisions related to team strategy, talent acquisition, and statistical research, ultimately enhancing performance and understanding in the field of football."}, "https://arxiv.org/abs/2409.01908": {"title": "Bayesian CART models for aggregate claim modeling", "link": "https://arxiv.org/abs/2409.01908", "description": "arXiv:2409.01908v1 Announce Type: new \nAbstract: This paper proposes three types of Bayesian CART (or BCART) models for aggregate claim amount, namely, frequency-severity models, sequential models and joint models. We propose a general framework for the BCART models applicable to data with multivariate responses, which is particularly useful for the joint BCART models with a bivariate response: the number of claims and aggregate claim amount. To facilitate frequency-severity modeling, we investigate BCART models for the right-skewed and heavy-tailed claim severity data by using various distributions. We discover that the Weibull distribution is superior to gamma and lognormal distributions, due to its ability to capture different tail characteristics in tree models. Additionally, we find that sequential BCART models and joint BCART models, which incorporate dependence between the number of claims and average severity, are beneficial and thus preferable to the frequency-severity BCART models in which independence is assumed. The effectiveness of these models' performance is illustrated by carefully designed simulations and real insurance data."}, "https://arxiv.org/abs/2409.01911": {"title": "Variable selection in convex nonparametric least squares via structured Lasso: An application to the Swedish electricity market", "link": "https://arxiv.org/abs/2409.01911", "description": "arXiv:2409.01911v1 Announce Type: new \nAbstract: We study the problem of variable selection in convex nonparametric least squares (CNLS). Whereas the least absolute shrinkage and selection operator (Lasso) is a popular technique for least squares, its variable selection performance is unknown in CNLS problems. In this work, we investigate the performance of the Lasso CNLS estimator and find out it is usually unable to select variables efficiently. Exploiting the unique structure of the subgradients in CNLS, we develop a structured Lasso by combining $\\ell_1$-norm and $\\ell_{\\infty}$-norm. To improve its predictive performance, we propose a relaxed version of the structured Lasso where we can control the two effects--variable selection and model shrinkage--using an additional tuning parameter. A Monte Carlo study is implemented to verify the finite sample performances of the proposed approaches. In the application of Swedish electricity distribution networks, when the regression model is assumed to be semi-nonparametric, our methods are extended to the doubly penalized CNLS estimators. The results from the simulation and application confirm that the proposed structured Lasso performs favorably, generally leading to sparser and more accurate predictive models, relative to the other variable selection methods in the literature."}, "https://arxiv.org/abs/2409.01926": {"title": "$Q_B$ Optimal Two-Level Designs for the Baseline Parameterization", "link": "https://arxiv.org/abs/2409.01926", "description": "arXiv:2409.01926v1 Announce Type: new \nAbstract: We have established the association matrix that expresses the estimator of effects under baseline parameterization, which has been considered in some recent literature, in an equivalent form as a linear combination of estimators of effects under the traditional centered parameterization. This allows the generalization of the $Q_B$ criterion which evaluates designs under model uncertainty in the traditional centered parameterization to be applicable to the baseline parameterization. Some optimal designs under the baseline parameterization seen in the previous literature are evaluated and it has been shown that at a given prior probability of a main effect being in the best model, the design converges to $Q_B$ optimal as the probability of an interaction being in the best model converges to 0 from above. The $Q_B$ optimal designs for two setups of factors and run sizes at various priors are found by an extended coordinate exchange algorithm and the evaluation of their performances are discussed. Comparisons have been made to those optimal designs restricted to level balance and orthogonality conditions."}, "https://arxiv.org/abs/2409.01943": {"title": "Spatially-dependent Indian Buffet Processes", "link": "https://arxiv.org/abs/2409.01943", "description": "arXiv:2409.01943v1 Announce Type: new \nAbstract: We develop a new stochastic process called spatially-dependent Indian buffet processes (SIBP) for spatially correlated binary matrices and propose general spatial factor models for various multivariate response variables. We introduce spatial dependency through the stick-breaking representation of the original Indian buffet process (IBP) and latent Gaussian process for the logit-transformed breaking proportion to capture underlying spatial correlation. We show that the marginal limiting properties of the number of non-zero entries under SIBP are the same as those in the original IBP, while the joint probability is affected by the spatial correlation. Using binomial expansion and Polya-gamma data augmentation, we provide a novel Gibbs sampling algorithm for posterior computation. The usefulness of the SIBP is demonstrated through simulation studies and two applications for large-dimensional multinomial data of areal dialects and geographical distribution of multiple tree species."}, "https://arxiv.org/abs/2409.01983": {"title": "Formalizing the causal interpretation in accelerated failure time models with unmeasured heterogeneity", "link": "https://arxiv.org/abs/2409.01983", "description": "arXiv:2409.01983v1 Announce Type: new \nAbstract: In the presence of unmeasured heterogeneity, the hazard ratio for exposure has a complex causal interpretation. To address this, accelerated failure time (AFT) models, which assess the effect on the survival time ratio scale, are often suggested as a better alternative. AFT models also allow for straightforward confounder adjustment. In this work, we formalize the causal interpretation of the acceleration factor in AFT models using structural causal models and data under independent censoring. We prove that the acceleration factor is a valid causal effect measure, even in the presence of frailty and treatment effect heterogeneity. Through simulations, we show that the acceleration factor better captures the causal effect than the hazard ratio when both AFT and proportional hazards models apply. Additionally, we extend the interpretation to systems with time-dependent acceleration factors, revealing the challenge of distinguishing between a time-varying homogeneous effect and unmeasured heterogeneity. While the causal interpretation of acceleration factors is promising, we caution practitioners about potential challenges in estimating these factors in the presence of effect heterogeneity."}, "https://arxiv.org/abs/2409.02087": {"title": "Objective Weights for Scoring: The Automatic Democratic Method", "link": "https://arxiv.org/abs/2409.02087", "description": "arXiv:2409.02087v1 Announce Type: new \nAbstract: When comparing performance (of products, services, entities, etc.), multiple attributes are involved. This paper deals with a way of weighting these attributes when one is seeking an overall score. It presents an objective approach to generating the weights in a scoring formula which avoids personal judgement. The first step is to find the maximum possible score for each assessed entity. These upper bound scores are found using Data Envelopment Analysis. In the second step the weights in the scoring formula are found by regressing the unique DEA scores on the attribute data. Reasons for using least squares and avoiding other distance measures are given. The method is tested on data where the true scores and weights are known. The method enables the construction of an objective scoring formula which has been generated from the data arising from all assessed entities and is, in that sense, democratic."}, "https://arxiv.org/abs/2409.00013": {"title": "CEopt: A MATLAB Package for Non-convex Optimization with the Cross-Entropy Method", "link": "https://arxiv.org/abs/2409.00013", "description": "arXiv:2409.00013v1 Announce Type: cross \nAbstract: This paper introduces CEopt (https://ceopt.org), a MATLAB tool leveraging the Cross-Entropy method for non-convex optimization. Due to the relative simplicity of the algorithm, it provides a kind of transparent ``gray-box'' optimization solver, with intuitive control parameters. Unique in its approach, CEopt effectively handles both equality and inequality constraints using an augmented Lagrangian method, offering robustness and scalability for moderately sized complex problems. Through select case studies, the package's applicability and effectiveness in various optimization scenarios are showcased, marking CEopt as a practical addition to optimization research and application toolsets."}, "https://arxiv.org/abs/2409.00417": {"title": "Learning linear acyclic causal model including Gaussian noise using ancestral relationships", "link": "https://arxiv.org/abs/2409.00417", "description": "arXiv:2409.00417v1 Announce Type: cross \nAbstract: This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algorithm and LiNGAM, can identify up to the distribution-equivalence pattern of a linear causal model, even in the presence of Gaussian disturbances. However, in the worst case, the PC-LiNGAM has factorial time complexity for the number of variables. In this paper, we propose an algorithm for learning the distribution-equivalence patterns of a linear causal model with a lower time complexity than PC-LiNGAM, using the causal ancestor finding algorithm in Maeda and Shimizu, which is generalized to account for Gaussian disturbances."}, "https://arxiv.org/abs/2409.00582": {"title": "CRUD-Capable Mobile Apps with R and shinyMobile: a Case Study in Rapid Prototyping", "link": "https://arxiv.org/abs/2409.00582", "description": "arXiv:2409.00582v1 Announce Type: cross \nAbstract: \"Harden\" is a Progressive Web Application (PWA) for Ecological Momentary Assessment (EMA) developed mostly in R, which runs on all platforms with an internet connection, including iOS and Android. It leverages the shinyMobile package for creating a reactive mobile user interface (UI), PostgreSQL for the database backend, and Google Cloud Run for scalable hosting in the cloud, with serverless execution. Using this technology stack, it was possible to rapidly prototype a fully CRUD-capable (Create, Read, Update, Delete) mobile app, with persistent user data across sessions, interactive graphs, and real-time statistical calculation. This framework is compared with current alternative frameworks for creating data science apps; it is argued that the shinyMobile package provides one of the most efficient methods for rapid prototyping and creation of statistical mobile apps that require advanced graphing capabilities. This paper outlines the methodology used to create the Harden application, and discusses the advantages and limitations of the shinyMobile approach to app development. It is hoped that this information will encourage other programmers versed in R to consider developing mobile apps with this framework."}, "https://arxiv.org/abs/2409.00704": {"title": "Stochastic Monotonicity and Random Utility Models: The Good and The Ugly", "link": "https://arxiv.org/abs/2409.00704", "description": "arXiv:2409.00704v1 Announce Type: cross \nAbstract: When it comes to structural estimation of risk preferences from data on choices, random utility models have long been one of the standard research tools in economics. A recent literature has challenged these models, pointing out some concerning monotonicity and, thus, identification problems. In this paper, we take a second look and point out that some of the criticism - while extremely valid - may have gone too far, demanding monotonicity of choice probabilities in decisions where it is not so clear whether it should be imposed. We introduce a new class of random utility models based on carefully constructed generalized risk premia which always satisfy our relaxed monotonicity criteria. Moreover, we show that some of the models used in applied research like the certainty-equivalent-based random utility model for CARA utility actually lie in this class of monotonic stochastic choice models. We conclude that not all random utility models are bad."}, "https://arxiv.org/abs/2409.00908": {"title": "EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification", "link": "https://arxiv.org/abs/2409.00908", "description": "arXiv:2409.00908v1 Announce Type: cross \nAbstract: Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely \\textsc{EnsLoss}, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the ``legitimacy'' of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, \\textsc{EnsLoss} enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of \\textsc{EnsLoss} compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on \\textsc{GitHub} at \\url{https://github.com/statmlben/rankseg}."}, "https://arxiv.org/abs/2409.01220": {"title": "Simultaneous Inference for Non-Stationary Random Fields, with Application to Gridded Data Analysis", "link": "https://arxiv.org/abs/2409.01220", "description": "arXiv:2409.01220v1 Announce Type: cross \nAbstract: Current statistics literature on statistical inference of random fields typically assumes that the fields are stationary or focuses on models of non-stationary Gaussian fields with parametric/semiparametric covariance families, which may not be sufficiently flexible to tackle complex modern-era random field data. This paper performs simultaneous nonparametric statistical inference for a general class of non-stationary and non-Gaussian random fields by modeling the fields as nonlinear systems with location-dependent transformations of an underlying `shift random field'. Asymptotic results, including concentration inequalities and Gaussian approximation theorems for high dimensional sparse linear forms of the random field, are derived. A computationally efficient locally weighted multiplier bootstrap algorithm is proposed and theoretically verified as a unified tool for the simultaneous inference of the aforementioned non-stationary non-Gaussian random field. Simulations and real-life data examples demonstrate good performances and broad applications of the proposed algorithm."}, "https://arxiv.org/abs/2409.01243": {"title": "Sample Complexity of the Sign-Perturbed Sums Method", "link": "https://arxiv.org/abs/2409.01243", "description": "arXiv:2409.01243v1 Announce Type: cross \nAbstract: We study the sample complexity of the Sign-Perturbed Sums (SPS) method, which constructs exact, non-asymptotic confidence regions for the true system parameters under mild statistical assumptions, such as independent and symmetric noise terms. The standard version of SPS deals with linear regression problems, however, it can be generalized to stochastic linear (dynamical) systems, even with closed-loop setups, and to nonlinear and nonparametric problems, as well. Although the strong consistency of the method was rigorously proven, the sample complexity of the algorithm was only analyzed so far for scalar linear regression problems. In this paper we study the sample complexity of SPS for general linear regression problems. We establish high probability upper bounds for the diameters of SPS confidence regions for finite sample sizes and show that the SPS regions shrink at the same, optimal rate as the classical asymptotic confidence ellipsoids. Finally, the difference between the theoretical bounds and the empirical sizes of SPS confidence regions is investigated experimentally."}, "https://arxiv.org/abs/2409.01464": {"title": "Stein transport for Bayesian inference", "link": "https://arxiv.org/abs/2409.01464", "description": "arXiv:2409.01464v1 Announce Type: cross \nAbstract: We introduce $\\textit{Stein transport}$, a novel methodology for Bayesian inference designed to efficiently push an ensemble of particles along a predefined curve of tempered probability distributions. The driving vector field is chosen from a reproducing kernel Hilbert space and can be derived either through a suitable kernel ridge regression formulation or as an infinitesimal optimal transport map in the Stein geometry. The update equations of Stein transport resemble those of Stein variational gradient descent (SVGD), but introduce a time-varying score function as well as specific weights attached to the particles. While SVGD relies on convergence in the long-time limit, Stein transport reaches its posterior approximation at finite time $t=1$. Studying the mean-field limit, we discuss the errors incurred by regularisation and finite-particle effects, and we connect Stein transport to birth-death dynamics and Fisher-Rao gradient flows. In a series of experiments, we show that in comparison to SVGD, Stein transport not only often reaches more accurate posterior approximations with a significantly reduced computational budget, but that it also effectively mitigates the variance collapse phenomenon commonly observed in SVGD."}, "https://arxiv.org/abs/2409.01570": {"title": "Smoothed Robust Phase Retrieval", "link": "https://arxiv.org/abs/2409.01570", "description": "arXiv:2409.01570v1 Announce Type: cross \nAbstract: The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the $\\ell_1$-loss is largely unknown to study spurious local solutions, even under the ideal noiseless setting, and its intrinsic nonsmooth nature also impacts the efficiency of optimization algorithms. This paper introduces the smoothed robust phase retrieval (SRPR) based on a family of convolution-type smoothed loss functions. Theoretically, we prove that the SRPR enjoys a benign geometric structure with high probability: (1) under the noiseless situation, the SRPR has no spurious local solutions, and the target signals are global solutions, and (2) under the infrequent but arbitrary corruptions, we characterize the stationary points of the SRPR and prove its benign landscape, which is the first landscape analysis of phase retrieval with corruption in the literature. Moreover, we prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless situation. Experiments on both simulated datasets and image recovery are provided to demonstrate the numerical performance of the SRPR."}, "https://arxiv.org/abs/2409.02086": {"title": "Taming Randomness in Agent-Based Models using Common Random Numbers", "link": "https://arxiv.org/abs/2409.02086", "description": "arXiv:2409.02086v1 Announce Type: cross \nAbstract: Random numbers are at the heart of every agent-based model (ABM) of health and disease. By representing each individual in a synthetic population, agent-based models enable detailed analysis of intervention impact and parameter sensitivity. Yet agent-based modeling has a fundamental signal-to-noise problem, in which small differences between simulations cannot be reliably differentiated from stochastic noise resulting from misaligned random number realizations. We introduce a novel methodology that eliminates noise due to misaligned random numbers, a first for agent-based modeling. Our approach enables meaningful individual-level analysis between ABM scenarios because all differences are driven by mechanistic effects rather than random number noise. A key result is that many fewer simulations are needed for some applications. We demonstrate the benefits of our approach on three disparate examples and discuss limitations."}, "https://arxiv.org/abs/2011.11558": {"title": "Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis", "link": "https://arxiv.org/abs/2011.11558", "description": "arXiv:2011.11558v3 Announce Type: replace \nAbstract: $n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy."}, "https://arxiv.org/abs/2205.01016": {"title": "Evidence Estimation in Gaussian Graphical Models Using a Telescoping Block Decomposition of the Precision Matrix", "link": "https://arxiv.org/abs/2205.01016", "description": "arXiv:2205.01016v4 Announce Type: replace \nAbstract: Marginal likelihood, also known as model evidence, is a fundamental quantity in Bayesian statistics. It is used for model selection using Bayes factors or for empirical Bayes tuning of prior hyper-parameters. Yet, the calculation of evidence has remained a longstanding open problem in Gaussian graphical models. Currently, the only feasible solutions that exist are for special cases such as the Wishart or G-Wishart, in moderate dimensions. We develop an approach based on a novel telescoping block decomposition of the precision matrix that allows the estimation of evidence by application of Chib's technique under a very broad class of priors under mild requirements. Specifically, the requirements are: (a) the priors on the diagonal terms on the precision matrix can be written as gamma or scale mixtures of gamma random variables and (b) those on the off-diagonal terms can be represented as normal or scale mixtures of normal. This includes structured priors such as the Wishart or G-Wishart, and more recently introduced element-wise priors, such as the Bayesian graphical lasso and the graphical horseshoe. Among these, the true marginal is known in an analytically closed form for Wishart, providing a useful validation of our approach. For the general setting of the other three, and several more priors satisfying conditions (a) and (b) above, the calculation of evidence has remained an open question that this article resolves under a unifying framework."}, "https://arxiv.org/abs/2207.11686": {"title": "Inference for linear functionals of high-dimensional longitudinal proteomics data using generalized estimating equations", "link": "https://arxiv.org/abs/2207.11686", "description": "arXiv:2207.11686v3 Announce Type: replace \nAbstract: Regression analysis of correlated data, where multiple correlated responses are recorded on the same unit, is ubiquitous in many scientific areas. With the advent of new technologies, in particular high-throughput omics profiling assays, such correlated data increasingly consist of large number of variables compared with the available sample size. Motivated by recent longitudinal proteomics studies of COVID-19, we propose a novel inference procedure for linear functionals of high-dimensional regression coefficients in generalized estimating equations, which are widely used to analyze correlated data. Our estimator for this more general inferential target, obtained via constructing projected estimating equations, is shown to be asymptotically normally distributed under mild regularity conditions. We also introduce a data-driven cross-validation procedure to select the tuning parameter for estimating the projection direction, which is not addressed in the existing procedures. We illustrate the utility of the proposed procedure in providing confidence intervals for associations of individual proteins and severe COVID risk scores obtained based on high-dimensional proteomics data, and demonstrate its robust finite-sample performance, especially in estimation bias and confidence interval coverage, via extensive simulations."}, "https://arxiv.org/abs/2207.12804": {"title": "Large-Scale Low-Rank Gaussian Process Prediction with Support Points", "link": "https://arxiv.org/abs/2207.12804", "description": "arXiv:2207.12804v2 Announce Type: replace \nAbstract: Low-rank approximation is a popular strategy to tackle the \"big n problem\" associated with large-scale Gaussian process regressions. Basis functions for developing low-rank structures are crucial and should be carefully specified. Predictive processes simplify the problem by inducing basis functions with a covariance function and a set of knots. The existing literature suggests certain practical implementations of knot selection and covariance estimation; however, theoretical foundations explaining the influence of these two factors on predictive processes are lacking. In this paper, the asymptotic prediction performance of the predictive process and Gaussian process predictions is derived and the impacts of the selected knots and estimated covariance are studied. We suggest the use of support points as knots, which best represent data locations. Extensive simulation studies demonstrate the superiority of support points and verify our theoretical results. Real data of precipitation and ozone are used as examples, and the efficiency of our method over other widely used low-rank approximation methods is verified."}, "https://arxiv.org/abs/2212.12041": {"title": "Estimating network-mediated causal effects via principal components network regression", "link": "https://arxiv.org/abs/2212.12041", "description": "arXiv:2212.12041v3 Announce Type: replace \nAbstract: We develop a method to decompose causal effects on a social network into an indirect effect mediated by the network, and a direct effect independent of the social network. To handle the complexity of network structures, we assume that latent social groups act as causal mediators. We develop principal components network regression models to differentiate the social effect from the non-social effect. Fitting the regression models is as simple as principal components analysis followed by ordinary least squares estimation. We prove asymptotic theory for regression coefficients from this procedure and show that it is widely applicable, allowing for a variety of distributions on the regression errors and network edges. We carefully characterize the counterfactual assumptions necessary to use the regression models for causal inference, and show that current approaches to causal network regression may result in over-control bias. The structure of our method is very general, so that it is applicable to many types of structured data beyond social networks, such as text, areal data, psychometrics, images and omics."}, "https://arxiv.org/abs/2301.06742": {"title": "Robust Realized Integrated Beta Estimator with Application to Dynamic Analysis of Integrated Beta", "link": "https://arxiv.org/abs/2301.06742", "description": "arXiv:2301.06742v2 Announce Type: replace \nAbstract: In this paper, we develop a robust non-parametric realized integrated beta estimator using high-frequency financial data contaminated by microstructure noises, which is robust to the stylized features, such as the time-varying beta and the dependence structure of microstructure noises. With this robust realized integrated beta estimator, we investigate dynamic structures of integrated betas and find an auto-regressive--moving-average (ARMA) structure. To model this dynamic structure, we utilize the ARMA model for daily integrated market betas. We call this the dynamic realized beta (DR Beta). We further introduce a high-frequency data generating process by filling the gap between the high-frequency-based non-parametric estimator and low-frequency dynamic structure. Then, we propose a quasi-likelihood procedure for estimating the model parameters with the robust realized integrated beta estimator as the proxy. We also establish asymptotic theorems for the proposed estimator and conduct a simulation study to check the performance of finite samples of the estimator. The empirical study with the S&P 500 index and the top 50 large trading volume stocks from the S&P 500 illustrates that the proposed DR Beta model with the robust realized beta estimator effectively accounts for dynamics in the market beta of individual stocks and better predicts future market betas."}, "https://arxiv.org/abs/2306.02360": {"title": "Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors", "link": "https://arxiv.org/abs/2306.02360", "description": "arXiv:2306.02360v2 Announce Type: replace \nAbstract: Dirichlet process mixtures are particularly sensitive to the value of the precision parameter controlling the behavior of the latent partition. Randomization of the precision through a prior distribution is a common solution, which leads to more robust inferential procedures. However, existing prior choices do not allow for transparent elicitation, due to the lack of analytical results. We introduce and investigate a novel prior for the Dirichlet process precision, the Stirling-gamma distribution. We study the distributional properties of the induced random partition, with an emphasis on the number of clusters. Our theoretical investigation clarifies the reasons of the improved robustness properties of the proposed prior. Moreover, we show that, under specific choices of its hyperparameters, the Stirling-gamma distribution is conjugate to the random partition of a Dirichlet process. We illustrate with an ecological application the usefulness of our approach for the detection of communities of ant workers."}, "https://arxiv.org/abs/2307.00224": {"title": "Flexible Bayesian Modeling for Longitudinal Binary and Ordinal Responses", "link": "https://arxiv.org/abs/2307.00224", "description": "arXiv:2307.00224v2 Announce Type: replace \nAbstract: Longitudinal studies with binary or ordinal responses are widely encountered in various disciplines, where the primary focus is on the temporal evolution of the probability of each response category. Traditional approaches build from the generalized mixed effects modeling framework. Even amplified with nonparametric priors placed on the fixed or random effects, such models are restrictive due to the implied assumptions on the marginal expectation and covariance structure of the responses. We tackle the problem from a functional data analysis perspective, treating the observations for each subject as realizations from subject-specific stochastic processes at the measured times. We develop the methodology focusing initially on binary responses, for which we assume the stochastic processes have Binomial marginal distributions. Leveraging the logits representation, we model the discrete space processes through sequences of continuous space processes. We utilize a hierarchical framework to model the mean and covariance kernel of the continuous space processes nonparametrically and simultaneously through a Gaussian process prior and an Inverse-Wishart process prior, respectively. The prior structure results in flexible inference for the evolution and correlation of binary responses, while allowing for borrowing of strength across all subjects. The modeling approach can be naturally extended to ordinal responses. Here, the continuation-ratio logits factorization of the multinomial distribution is key for efficient modeling and inference, including a practical way of dealing with unbalanced longitudinal data. The methodology is illustrated with synthetic data examples and an analysis of college students' mental health status data."}, "https://arxiv.org/abs/2308.12470": {"title": "Scalable Estimation of Multinomial Response Models with Random Consideration Sets", "link": "https://arxiv.org/abs/2308.12470", "description": "arXiv:2308.12470v3 Announce Type: replace \nAbstract: A common assumption in the fitting of unordered multinomial response models for $J$ mutually exclusive categories is that the responses arise from the same set of $J$ categories across subjects. However, when responses measure a choice made by the subject, it is more appropriate to condition the distribution of multinomial responses on a subject-specific consideration set, drawn from the power set of $\\{1,2,\\ldots,J\\}$. This leads to a mixture of multinomial response models governed by a probability distribution over the $J^{\\ast} = 2^J -1$ consideration sets. We introduce a novel method for estimating such generalized multinomial response models based on the fundamental result that any mass distribution over $J^{\\ast}$ consideration sets can be represented as a mixture of products of $J$ component-specific inclusion-exclusion probabilities. Moreover, under time-invariant consideration sets, the conditional posterior distribution of consideration sets is sparse. These features enable a scalable MCMC algorithm for sampling the posterior distribution of parameters, random effects, and consideration sets. Under regularity conditions, the posterior distributions of the marginal response probabilities and the model parameters satisfy consistency. The methodology is demonstrated in a longitudinal data set on weekly cereal purchases that cover $J = 101$ brands, a dimension substantially beyond the reach of existing methods."}, "https://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "https://arxiv.org/abs/2310.06330", "description": "arXiv:2310.06330v2 Announce Type: replace \nAbstract: Markov chain Monte Carlo (MCMC) is a commonly used method for approximating expectations with respect to probability distributions. Uncertainty assessment for MCMC estimators is essential in practical applications. Moreover, for multivariate functions of a Markov chain, it is important to estimate not only the auto-correlation for each component but also to estimate cross-correlations, in order to better assess sample quality, improve estimates of effective sample size, and use more effective stopping rules. Berg and Song [2022] introduced the moment least squares (momentLS) estimator, a shape-constrained estimator for the autocovariance sequence from a reversible Markov chain, for univariate functions of the Markov chain. Based on this sequence estimator, they proposed an estimator of the asymptotic variance of the sample mean from MCMC samples. In this study, we propose novel autocovariance sequence and asymptotic variance estimators for Markov chain functions with multiple components, based on the univariate momentLS estimators from Berg and Song [2022]. We demonstrate strong consistency of the proposed auto(cross)-covariance sequence and asymptotic variance matrix estimators. We conduct empirical comparisons of our method with other state-of-the-art approaches on simulated and real-data examples, using popular samplers including the random-walk Metropolis sampler and the No-U-Turn sampler from STAN."}, "https://arxiv.org/abs/2210.04482": {"title": "Leave-group-out cross-validation for latent Gaussian models", "link": "https://arxiv.org/abs/2210.04482", "description": "arXiv:2210.04482v5 Announce Type: replace-cross \nAbstract: Evaluating the predictive performance of a statistical model is commonly done using cross-validation. Although the leave-one-out method is frequently employed, its application is justified primarily for independent and identically distributed observations. However, this method tends to mimic interpolation rather than prediction when dealing with dependent observations. This paper proposes a modified cross-validation for dependent observations. This is achieved by excluding an automatically determined set of observations from the training set to mimic a more reasonable prediction scenario. Also, within the framework of latent Gaussian models, we illustrate a method to adjust the joint posterior for this modified cross-validation to avoid model refitting. This new approach is accessible in the R-INLA package (www.r-inla.org)."}, "https://arxiv.org/abs/2212.02935": {"title": "A multi-language toolkit for supporting automated checking of research outputs", "link": "https://arxiv.org/abs/2212.02935", "description": "arXiv:2212.02935v2 Announce Type: replace-cross \nAbstract: This article presents the automatic checking of research outputs package acro, which assists researchers and data governance teams by automatically applying best-practice principles-based statistical disclosure control (SDC) techniques on-the-fly as researchers conduct their analyses. acro distinguishes between: research output that is safe to publish; output that requires further analysis; and output that cannot be published because it creates substantial risk of disclosing private data. This is achieved through the use of a lightweight Python wrapper that sits over well-known analysis tools that produce outputs such as tables, plots, and statistical models. This adds functionality to (i) identify potentially disclosive outputs against a range of commonly used disclosure tests; (ii) apply disclosure mitigation strategies where required; (iii) report reasons for applying SDC; and (iv) produce simple summary documents trusted research environment staff can use to streamline their workflow. The major analytical programming languages used by researchers are supported: Python, R, and Stata. The acro code and documentation are available under an MIT license at https://github.com/AI-SDC/ACRO"}, "https://arxiv.org/abs/2312.04444": {"title": "Parameter Inference for Hypo-Elliptic Diffusions under a Weak Design Condition", "link": "https://arxiv.org/abs/2312.04444", "description": "arXiv:2312.04444v2 Announce Type: replace-cross \nAbstract: We address the problem of parameter estimation for degenerate diffusion processes defined via the solution of Stochastic Differential Equations (SDEs) with diffusion matrix that is not full-rank. For this class of hypo-elliptic diffusions recent works have proposed contrast estimators that are asymptotically normal, provided that the step-size in-between observations $\\Delta=\\Delta_n$ and their total number $n$ satisfy $n \\to \\infty$, $n \\Delta_n \\to \\infty$, $\\Delta_n \\to 0$, and additionally $\\Delta_n = o (n^{-1/2})$. This latter restriction places a requirement for a so-called `rapidly increasing experimental design'. In this paper, we overcome this limitation and develop a general contrast estimator satisfying asymptotic normality under the weaker design condition $\\Delta_n = o(n^{-1/p})$ for general $p \\ge 2$. Such a result has been obtained for elliptic SDEs in the literature, but its derivation in a hypo-elliptic setting is highly non-trivial. We provide numerical results to illustrate the advantages of the developed theory."}, "https://arxiv.org/abs/2401.12967": {"title": "Measure transport with kernel mean embeddings", "link": "https://arxiv.org/abs/2401.12967", "description": "arXiv:2401.12967v2 Announce Type: replace-cross \nAbstract: Kalman filters constitute a scalable and robust methodology for approximate Bayesian inference, matching first and second order moments of the target posterior. To improve the accuracy in nonlinear and non-Gaussian settings, we extend this principle to include more or different characteristics, based on kernel mean embeddings (KMEs) of probability measures into reproducing kernel Hilbert spaces. Focusing on the continuous-time setting, we develop a family of interacting particle systems (termed $\\textit{KME-dynamics}$) that bridge between prior and posterior, and that include the Kalman-Bucy filter as a special case. KME-dynamics does not require the score of the target, but rather estimates the score implicitly and intrinsically, and we develop links to score-based generative modeling and importance reweighting. A variant of KME-dynamics has recently been derived from an optimal transport and Fisher-Rao gradient flow perspective by Maurais and Marzouk, and we expose further connections to (kernelised) diffusion maps, leading to a variational formulation of regression type. Finally, we conduct numerical experiments on toy examples and the Lorenz 63 and 96 models, comparing our results against the ensemble Kalman filter and the mapping particle filter (Pulido and van Leeuwen, 2019, J. Comput. Phys.). Our experiments show particular promise for a hybrid modification (called Kalman-adjusted KME-dynamics)."}, "https://arxiv.org/abs/2409.02204": {"title": "Moment-type estimators for a weighted exponential family", "link": "https://arxiv.org/abs/2409.02204", "description": "arXiv:2409.02204v1 Announce Type: new \nAbstract: In this paper, we propose and study closed-form moment type estimators for a weighted exponential family. We also develop a bias-reduced version of these proposed closed-form estimators using bootstrap techniques. The estimators are evaluated using Monte Carlo simulation. This shows favourable results for the proposed bootstrap bias-reduced estimators."}, "https://arxiv.org/abs/2409.02209": {"title": "Estimand-based Inference in Presence of Long-Term Survivors", "link": "https://arxiv.org/abs/2409.02209", "description": "arXiv:2409.02209v1 Announce Type: new \nAbstract: In this article, we develop nonparametric inference methods for comparing survival data across two samples, which are beneficial for clinical trials of novel cancer therapies where long-term survival is a critical outcome. These therapies, including immunotherapies or other advanced treatments, aim to establish durable effects. They often exhibit distinct survival patterns such as crossing or delayed separation and potentially leveling-off at the tails of survival curves, clearly violating the proportional hazards assumption and rendering the hazard ratio inappropriate for measuring treatment effects. The proposed methodology utilizes the mixture cure framework to separately analyze the cure rates of long-term survivors and the survival functions of susceptible individuals. We evaluate a nonparametric estimator for the susceptible survival function in the one-sample setting. Under sufficient follow-up, it is expressed as a location-scale-shift variant of the Kaplan-Meier (KM) estimator. It retains several desirable features of the KM estimator, including inverse-probability-censoring weighting, product-limit estimation, self-consistency, and nonparametric efficiency. In scenarios of insufficient follow-up, it can easily be adapted by incorporating a suitable cure rate estimator. In the two-sample setting, besides using the difference in cure rates to measure the long-term effect, we propose a graphical estimand to compare the relative treatment effects on susceptible subgroups. This process, inspired by Kendall's tau, compares the order of survival times among susceptible individuals. The proposed methods' large-sample properties are derived for further inference, and the finite-sample properties are examined through extensive simulation studies. The proposed methodology is applied to analyze the digitized data from the CheckMate 067 immunotherapy clinical trial."}, "https://arxiv.org/abs/2409.02258": {"title": "Generalized implementation of invariant coordinate selection with positive semi-definite scatter matrices", "link": "https://arxiv.org/abs/2409.02258", "description": "arXiv:2409.02258v1 Announce Type: new \nAbstract: Invariant coordinate selection (ICS) is an unsupervised multivariate data transformation useful in many contexts such as outlier detection or clustering. It is based on the simultaneous diagonalization of two affine equivariant and positive definite scatter matrices. Its classical implementation relies on a non-symmetric eigenvalue problem (EVP) by diagonalizing one scatter relatively to the other. In case of collinearity, at least one of the scatter matrices is singular and the problem cannot be solved. To address this limitation, three approaches are proposed based on: a Moore-Penrose pseudo inverse (GINV), a dimension reduction (DR), and a generalized singular value decomposition (GSVD). Their properties are investigated theoretically and in different empirical applications. Overall, the extension based on GSVD seems the most promising even if it restricts the choice of scatter matrices that can be expressed as cross-products. In practice, some of the approaches also look suitable in the context of data in high dimension low sample size (HDLSS)."}, "https://arxiv.org/abs/2409.02269": {"title": "Simulation-calibration testing for inference in Lasso regressions", "link": "https://arxiv.org/abs/2409.02269", "description": "arXiv:2409.02269v1 Announce Type: new \nAbstract: We propose a test of the significance of a variable appearing on the Lasso path and use it in a procedure for selecting one of the models of the Lasso path, controlling the Family-Wise Error Rate. Our null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. We focus on the regularization parameter value from which a first variable outside A is selected. As the test statistic, we use this quantity's conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate this by simulating outcome vectors and then calibrating them on the observed outcome's estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models in which it turns into an iterative stochastic procedure. We prove that the test controls the risk of selecting a false positive in linear models, both under the null hypothesis and, under a correlation condition, when A does not contain all active variables. We assess the performance of our procedure through extensive simulation studies. We also illustrate it in the detection of exposures associated with drug-induced liver injuries in the French pharmacovigilance database."}, "https://arxiv.org/abs/2409.02311": {"title": "Distribution Regression Difference-In-Differences", "link": "https://arxiv.org/abs/2409.02311", "description": "arXiv:2409.02311v1 Announce Type: new \nAbstract: We provide a simple distribution regression estimator for treatment effects in the difference-in-differences (DiD) design. Our procedure is particularly useful when the treatment effect differs across the distribution of the outcome variable. Our proposed estimator easily incorporates covariates and, importantly, can be extended to settings where the treatment potentially affects the joint distribution of multiple outcomes. Our key identifying restriction is that the counterfactual distribution of the treated in the untreated state has no interaction effect between treatment and time. This assumption results in a parallel trend assumption on a transformation of the distribution. We highlight the relationship between our procedure and assumptions with the changes-in-changes approach of Athey and Imbens (2006). We also reexamine two existing empirical examples which highlight the utility of our approach."}, "https://arxiv.org/abs/2409.02331": {"title": "A parameterization of anisotropic Gaussian fields with penalized complexity priors", "link": "https://arxiv.org/abs/2409.02331", "description": "arXiv:2409.02331v1 Announce Type: new \nAbstract: Gaussian random fields (GFs) are fundamental tools in spatial modeling and can be represented flexibly and efficiently as solutions to stochastic partial differential equations (SPDEs). The SPDEs depend on specific parameters, which enforce various field behaviors and can be estimated using Bayesian inference. However, the likelihood typically only provides limited insights into the covariance structure under in-fill asymptotics. In response, it is essential to leverage priors to achieve appropriate, meaningful covariance structures in the posterior. This study introduces a smooth, invertible parameterization of the correlation length and diffusion matrix of an anisotropic GF and constructs penalized complexity (PC) priors for the model when the parameters are constant in space. The formulated prior is weakly informative, effectively penalizing complexity by pushing the correlation range toward infinity and the anisotropy to zero."}, "https://arxiv.org/abs/2409.02372": {"title": "A Principal Square Response Forward Regression Method for Dimension Reduction", "link": "https://arxiv.org/abs/2409.02372", "description": "arXiv:2409.02372v1 Announce Type: new \nAbstract: Dimension reduction techniques, such as Sufficient Dimension Reduction (SDR), are indispensable for analyzing high-dimensional datasets. This paper introduces a novel SDR method named Principal Square Response Forward Regression (PSRFR) for estimating the central subspace of the response variable Y, given the vector of predictor variables $\\bm{X}$. We provide a computational algorithm for implementing PSRFR and establish its consistency and asymptotic properties. Monte Carlo simulations are conducted to assess the performance, efficiency, and robustness of the proposed method. Notably, PSRFR exhibits commendable performance in scenarios where the variance of each component becomes increasingly dissimilar, particularly when the predictor variables follow an elliptical distribution. Furthermore, we illustrate and validate the effectiveness of PSRFR using a real-world dataset concerning wine quality. Our findings underscore the utility and reliability of the PSRFR method in practical applications of dimension reduction for high-dimensional data analysis."}, "https://arxiv.org/abs/2409.02397": {"title": "High-dimensional Bayesian Model for Disease-Specific Gene Detection in Spatial Transcriptomics", "link": "https://arxiv.org/abs/2409.02397", "description": "arXiv:2409.02397v1 Announce Type: new \nAbstract: Identifying disease-indicative genes is critical for deciphering disease mechanisms and has attracted significant interest in biomedical research. Spatial transcriptomics offers unprecedented insights for the detection of disease-specific genes by enabling within-tissue contrasts. However, this new technology poses challenges for conventional statistical models developed for RNA-sequencing, as these models often neglect the spatial organization of tissue spots. In this article, we propose a Bayesian shrinkage model to characterize the relationship between high-dimensional gene expressions and the disease status of each tissue spot, incorporating spatial correlation among these spots through autoregressive terms. Our model adopts a hierarchical structure to facilitate the analysis of multiple correlated samples and is further extended to accommodate the missing data within tissues. To ensure the model's applicability to datasets of varying sizes, we carry out two computational frameworks for Bayesian parameter estimation, tailored to both small and large sample scenarios. Simulation studies are conducted to evaluate the performance of the proposed model. The proposed model is applied to analyze the data arising from a HER2-positive breast cancer study."}, "https://arxiv.org/abs/2409.02573": {"title": "Fitting an Equation to Data Impartially", "link": "https://arxiv.org/abs/2409.02573", "description": "arXiv:2409.02573v1 Announce Type: new \nAbstract: We consider the problem of fitting a relationship (e.g. a potential scientific law) to data involving multiple variables. Ordinary (least squares) regression is not suitable for this because the estimated relationship will differ according to which variable is chosen as being dependent, and the dependent variable is unrealistically assumed to be the only variable which has any measurement error (noise). We present a very general method for estimating a linear functional relationship between multiple noisy variables, which are treated impartially, i.e. no distinction between dependent and independent variables. The data are not assumed to follow any distribution, but all variables are treated as being equally reliable. Our approach extends the geometric mean functional relationship to multiple dimensions. This is especially useful with variables measured in different units, as it is naturally scale-invariant, whereas orthogonal regression is not. This is because our approach is not based on minimizing distances, but on the symmetric concept of correlation. The estimated coefficients are easily obtained from the covariances or correlations, and correspond to geometric means of associated least squares coefficients. The ease of calculation will hopefully allow widespread application of impartial fitting to estimate relationships in a neutral way."}, "https://arxiv.org/abs/2409.02642": {"title": "The Application of Green GDP and Its Impact on Global Economy and Environment: Analysis of GGDP based on SEEA model", "link": "https://arxiv.org/abs/2409.02642", "description": "arXiv:2409.02642v1 Announce Type: new \nAbstract: This paper presents an analysis of Green Gross Domestic Product (GGDP) using the System of Environmental-Economic Accounting (SEEA) model to evaluate its impact on global climate mitigation and economic health. GGDP is proposed as a superior measure to tradi-tional GDP by incorporating natural resource consumption, environmental pollution control, and degradation factors. The study develops a GGDP model and employs grey correlation analysis and grey prediction models to assess its relationship with these factors. Key findings demonstrate that replacing GDP with GGDP can positively influence climate change, partic-ularly in reducing CO2 emissions and stabilizing global temperatures. The analysis further explores the implications of GGDP adoption across developed and developing countries, with specific predictions for China and the United States. The results indicate a potential increase in economic levels for developing countries, while developed nations may experi-ence a decrease. Additionally, the shift to GGDP is shown to significantly reduce natural re-source depletion and population growth rates in the United States, suggesting broader envi-ronmental and economic benefits. This paper highlights the universal applicability of the GGDP model and its potential to enhance environmental and economic policies globally."}, "https://arxiv.org/abs/2409.02662": {"title": "The Impact of Data Elements on Narrowing the Urban-Rural Consumption Gap in China: Mechanisms and Policy Analysis", "link": "https://arxiv.org/abs/2409.02662", "description": "arXiv:2409.02662v1 Announce Type: new \nAbstract: The urban-rural consumption gap, as one of the important indicators in social development, directly reflects the imbalance in urban and rural economic and social development. Data elements, as an important component of New Quality Productivity, are of significant importance in promoting economic development and improving people's living standards in the information age. This study, through the analysis of fixed-effects regression models, system GMM regression models, and the intermediate effect model, found that the development level of data elements to some extent promotes the narrowing of the urban-rural consumption gap. At the same time, the intermediate variable of urban-rural income gap plays an important role between data elements and consumption gap, with a significant intermediate effect. The results of the study indicate that the advancement of data elements can promote the balance of urban and rural residents' consumption levels by reducing the urban-rural income gap, providing theoretical support and policy recommendations for achieving common prosperity and promoting coordinated urban-rural development. Building upon this, this paper emphasizes the complex correlation between the development of data elements and the urban-rural consumption gap, and puts forward policy suggestions such as promoting the development of the data element market, strengthening the construction of the digital economy and e-commerce, and promoting integrated urban-rural development. Overall, the development of data elements is not only an important path to reducing the urban-rural consumption gap but also one of the key drivers for promoting the balanced development of China's economic and social development. This study has a certain theoretical and practical significance for understanding the mechanism of the urban-rural consumption gap and improving policies for urban-rural economic development."}, "https://arxiv.org/abs/2409.02705": {"title": "A family of toroidal diffusions with exact likelihood inference", "link": "https://arxiv.org/abs/2409.02705", "description": "arXiv:2409.02705v1 Announce Type: new \nAbstract: We provide a class of diffusion processes for continuous time-varying multivariate angular data with explicit transition probability densities, enabling exact likelihood inference. The presented diffusions are time-reversible and can be constructed for any pre-specified stationary distribution on the torus, including highly-multimodal mixtures. We give results on asymptotic likelihood theory allowing one-sample inference and tests of linear hypotheses for $k$ groups of diffusions, including homogeneity. We show that exact and direct diffusion bridge simulation is possible too. A class of circular jump processes with similar properties is also proposed. Several numerical experiments illustrate the methodology for the circular and two-dimensional torus cases. The new family of diffusions is applied (i) to test several homogeneity hypotheses on the movement of ants and (ii) to simulate bridges between the three-dimensional backbones of two related proteins."}, "https://arxiv.org/abs/2409.02872": {"title": "Momentum Dynamics in Competitive Sports: A Multi-Model Analysis Using TOPSIS and Logistic Regression", "link": "https://arxiv.org/abs/2409.02872", "description": "arXiv:2409.02872v1 Announce Type: new \nAbstract: This paper explores the concept of \"momentum\" in sports competitions through the use of the TOPSIS model and 0-1 logistic regression model. First, the TOPSIS model is employed to evaluate the performance of two tennis players, with visualizations used to analyze the situation's evolution at every moment in the match, explaining how \"momentum\" manifests in sports. Then, the 0-1 logistic regression model is utilized to verify the impact of \"momentum\" on match outcomes, demonstrating that fluctuations in player performance and the successive occurrence of successes are not random. Additionally, this paper examines the indicators that influence the reversal of game situations by analyzing key match data and testing the accuracy of the models with match data. The findings show that the model accurately explains the conditions during matches and can be generalized to other sports competitions. Finally, the strengths, weaknesses, and potential future improvements of the model are discussed."}, "https://arxiv.org/abs/2409.02888": {"title": "Cost-Effectiveness Analysis for Disease Prevention -- A Case Study on Colorectal Cancer Screening", "link": "https://arxiv.org/abs/2409.02888", "description": "arXiv:2409.02888v1 Announce Type: new \nAbstract: Cancer Screening has been widely recognized as an effective strategy for preventing the disease. Despite its effectiveness, determining when to start screening is complicated, because starting too early increases the number of screenings over lifetime and thus costs but starting too late may miss the cancer that could have been prevented. Therefore, to make an informed recommendation on the age to start screening, it is necessary to conduct cost-effectiveness analysis to assess the gain in life years relative to the cost of screenings. As more large-scale observational studies become accessible, there is growing interest in evaluating cost-effectiveness based on empirical evidence. In this paper, we propose a unified measure for evaluating cost-effectiveness and a causal analysis for the continuous intervention of screening initiation age, under the multi-state modeling with semi-competing risks. Extensive simulation results show that the proposed estimators perform well in realistic scenarios. We perform a cost-effectiveness analysis of the colorectal cancer screening, utilizing data from the large-scale Women's Health Initiative. Our analysis reveals that initiating screening at age 50 years yields the highest quality-adjusted life years with an acceptable incremental cost-effectiveness ratio compared to no screening, providing real-world evidence in support of screening recommendation for colorectal cancer."}, "https://arxiv.org/abs/2409.02116": {"title": "Discovering Candidate Genes Regulated by GWAS Signals in Cis and Trans", "link": "https://arxiv.org/abs/2409.02116", "description": "arXiv:2409.02116v1 Announce Type: cross \nAbstract: Understanding the genetic underpinnings of complex traits and diseases has been greatly advanced by genome-wide association studies (GWAS). However, a significant portion of trait heritability remains unexplained, known as ``missing heritability\". Most GWAS loci reside in non-coding regions, posing challenges in understanding their functional impact. Integrating GWAS with functional genomic data, such as expression quantitative trait loci (eQTLs), can bridge this gap. This study introduces a novel approach to discover candidate genes regulated by GWAS signals in both cis and trans. Unlike existing eQTL studies that focus solely on cis-eQTLs or consider cis- and trans-QTLs separately, we utilize adaptive statistical metrics that can reflect both the strong, sparse effects of cis-eQTLs and the weak, dense effects of trans-eQTLs. Consequently, candidate genes regulated by the joint effects can be prioritized. We demonstrate the efficiency of our method through theoretical and numerical analyses and apply it to adipose eQTL data from the METabolic Syndrome in Men (METSIM) study, uncovering genes playing important roles in the regulatory networks influencing cardiometabolic traits. Our findings offer new insights into the genetic regulation of complex traits and present a practical framework for identifying key regulatory genes based on joint eQTL effects."}, "https://arxiv.org/abs/2409.02135": {"title": "Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling", "link": "https://arxiv.org/abs/2409.02135", "description": "arXiv:2409.02135v1 Announce Type: cross \nAbstract: Learning-based methods have gained attention as general-purpose solvers because they can automatically learn problem-specific heuristics, reducing the need for manually crafted heuristics. However, these methods often face challenges with scalability. To address these issues, the improved Sampling algorithm for Combinatorial Optimization (iSCO) using discrete Langevin dynamics has been proposed, demonstrating better performance than several learning-based solvers. This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function from a simple convex form, where half-integral solutions dominate, to the original objective function, where the variables are restricted to 0 or 1. Furthermore, we incorporate parallel run communication leveraging GPUs, enhancing exploration capabilities and accelerating convergence. Numerical experiments demonstrate that our approach is a competitive general-purpose solver, achieving comparable performance to iSCO across various benchmark problems. Notably, our method exhibits superior trade-offs between speed and solution quality for large-scale instances compared to iSCO, commercial solvers, and specialized algorithms."}, "https://arxiv.org/abs/2409.02332": {"title": "Double Machine Learning at Scale to Predict Causal Impact of Customer Actions", "link": "https://arxiv.org/abs/2409.02332", "description": "arXiv:2409.02332v1 Announce Type: cross \nAbstract: Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams."}, "https://arxiv.org/abs/2409.02604": {"title": "Hypothesizing Missing Causal Variables with LLMs", "link": "https://arxiv.org/abs/2409.02604", "description": "arXiv:2409.02604v1 Announce Type: cross \nAbstract: Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establish the relationship between the cause and the effect. Motivated by the scientific discovery process, in this work, we formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We design a benchmark with varying difficulty levels and knowledge assumptions about the causal graph. With the growing interest in using Large Language Models (LLMs) to assist in scientific discovery, we benchmark open-source and closed models on our testbed. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. In contrast, they underperform in hypothesizing the cause and effect variables themselves. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model."}, "https://arxiv.org/abs/2409.02708": {"title": "Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit", "link": "https://arxiv.org/abs/2409.02708", "description": "arXiv:2409.02708v1 Announce Type: cross \nAbstract: Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects."}, "https://arxiv.org/abs/2103.05909": {"title": "A variational inference framework for inverse problems", "link": "https://arxiv.org/abs/2103.05909", "description": "arXiv:2103.05909v4 Announce Type: replace \nAbstract: A framework is presented for fitting inverse problem models via variational Bayes approximations. This methodology guarantees flexibility to statistical model specification for a broad range of applications, good accuracy and reduced model fitting times. The message passing and factor graph fragment approach to variational Bayes that is also described facilitates streamlined implementation of approximate inference algorithms and allows for supple inclusion of numerous response distributions and penalizations into the inverse problem model. Models for one- and two-dimensional response variables are examined and an infrastructure is laid down where efficient algorithm updates based on nullifying weak interactions between variables can also be derived for inverse problems in higher dimensions. An image processing application and a simulation exercise motivated by biomedical problems reveal the computational advantage offered by efficient implementation of variational Bayes over Markov chain Monte Carlo."}, "https://arxiv.org/abs/2112.02822": {"title": "A stableness of resistance model for nonresponse adjustment with callback data", "link": "https://arxiv.org/abs/2112.02822", "description": "arXiv:2112.02822v4 Announce Type: replace \nAbstract: Nonresponse arises frequently in surveys and follow-ups are routinely made to increase the response rate. In order to monitor the follow-up process, callback data have been used in social sciences and survey studies for decades. In modern surveys, the availability of callback data is increasing because the response rate is decreasing and follow-ups are essential to collect maximum information. Although callback data are helpful to reduce the bias in surveys, such data have not been widely used in statistical analysis until recently. We propose a stableness of resistance assumption for nonresponse adjustment with callback data. We establish the identification and the semiparametric efficiency theory under this assumption, and propose a suite of semiparametric estimation methods including doubly robust estimators, which generalize existing parametric approaches for callback data analysis. We apply the approach to a Consumer Expenditure Survey dataset. The results suggest an association between nonresponse and high housing expenditures."}, "https://arxiv.org/abs/2208.13323": {"title": "Safe Policy Learning under Regression Discontinuity Designs with Multiple Cutoffs", "link": "https://arxiv.org/abs/2208.13323", "description": "arXiv:2208.13323v4 Announce Type: replace \nAbstract: The regression discontinuity (RD) design is widely used for program evaluation with observational data. The primary focus of the existing literature has been the estimation of the local average treatment effect at the existing treatment cutoff. In contrast, we consider policy learning under the RD design. Because the treatment assignment mechanism is deterministic, learning better treatment cutoffs requires extrapolation. We develop a robust optimization approach to finding optimal treatment cutoffs that improve upon the existing ones. We first decompose the expected utility into point-identifiable and unidentifiable components. We then propose an efficient doubly-robust estimator for the identifiable parts. To account for the unidentifiable components, we leverage the existence of multiple cutoffs that are common under the RD design. Specifically, we assume that the heterogeneity in the conditional expectations of potential outcomes across different groups vary smoothly along the running variable. Under this assumption, we minimize the worst case utility loss relative to the status quo policy. The resulting new treatment cutoffs have a safety guarantee that they will not yield a worse overall outcome than the existing cutoffs. Finally, we establish the asymptotic regret bounds for the learned policy using semi-parametric efficiency theory. We apply the proposed methodology to empirical and simulated data sets."}, "https://arxiv.org/abs/2301.02739": {"title": "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values", "link": "https://arxiv.org/abs/2301.02739", "description": "arXiv:2301.02739v3 Announce Type: replace \nAbstract: Many testing problems are readily amenable to randomised tests such as those employing data splitting. However despite their usefulness in principle, randomised tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilise the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realisations such as through random data splits. We develop rank-transformed subsampling as a general method for delivering large sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a wide range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomised trial and calibrating cross-fit double machine learning confidence intervals. In contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys type-I error control that asymptotically approaches the nominal level. Moreover, compared to using the ordinary subsampling, we show that our rank transform can remove the first-order bias in approximating the null under alternatives and greatly improve power."}, "https://arxiv.org/abs/2409.02942": {"title": "Categorical Data Analysis", "link": "https://arxiv.org/abs/2409.02942", "description": "arXiv:2409.02942v1 Announce Type: new \nAbstract: Categorical data are common in educational and social science research; however, methods for its analysis are generally not covered in introductory statistics courses. This chapter overviews fundamental concepts and methods in categorical data analysis. It describes and illustrates the analysis of contingency tables given different sampling processes and distributions, estimation of probabilities, hypothesis testing, measures of associations, and tests of no association with nominal variables, as well as the test of linear association with ordinal variables. Three data sets illustrate fatal police shootings in the United States, clinical trials of the Moderna vaccine, and responses to General Social Survey questions."}, "https://arxiv.org/abs/2409.02950": {"title": "On Inference of Weitzman Overlapping Coefficient in Two Weibull Distributions", "link": "https://arxiv.org/abs/2409.02950", "description": "arXiv:2409.02950v1 Announce Type: new \nAbstract: Studying overlapping coefficients has recently become of great benefit, especially after its use in goodness-of-fit tests. These coefficients are defined as the amount of similarity between two statistical distributions. This research examines the estimation of one of these overlapping coefficients, which is the Weitzman coefficient {\\Delta}, assuming two Weibull distributions and without using any restrictions on the parameters of these distributions. We studied the relative bias and relative mean square error of the resulting estimator by implementing a simulation study. The results show the importance of the resulting estimator."}, "https://arxiv.org/abs/2409.03069": {"title": "Discussion of \"Data fission: splitting a single data point\"", "link": "https://arxiv.org/abs/2409.03069", "description": "arXiv:2409.03069v1 Announce Type: new \nAbstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations."}, "https://arxiv.org/abs/2409.03085": {"title": "Penalized Subgrouping of Heterogeneous Time Series", "link": "https://arxiv.org/abs/2409.03085", "description": "arXiv:2409.03085v1 Announce Type: new \nAbstract: Interest in the study and analysis of dynamic processes in the social, behavioral, and health sciences has burgeoned in recent years due to the increased availability of intensive longitudinal data. However, how best to model and account for the persistent heterogeneity characterizing such processes remains an open question. The multi-VAR framework, a recent methodological development built on the vector autoregressive model, accommodates heterogeneous dynamics in multiple-subject time series through structured penalization. In the original multi-VAR proposal, individual-level transition matrices are decomposed into common and unique dynamics, allowing for generalizable and person-specific features. The current project extends this framework to allow additionally for the identification and penalized estimation of subgroup-specific dynamics; that is, patterns of dynamics that are shared across subsets of individuals. The performance of the proposed subgrouping extension is evaluated in the context of both a simulation study and empirical application, and results are compared to alternative methods for subgrouping multiple-subject, multivariate time series."}, "https://arxiv.org/abs/2409.03126": {"title": "Co-Developing Causal Graphs with Domain Experts Guided by Weighted FDR-Adjusted p-values", "link": "https://arxiv.org/abs/2409.03126", "description": "arXiv:2409.03126v1 Announce Type: new \nAbstract: This paper proposes an approach facilitating co-design of causal graphs between subject matter experts and statistical modellers. Modern causal analysis starting with formulation of causal graphs provides benefits for robust analysis and well-grounded decision support. Moreover, this process can enrich the discovery and planning phase of data science projects.\n The key premise is that plotting relevant statistical information on a causal graph structure can facilitate an intuitive discussion between domain experts and modellers. Furthermore, Hand-crafting causality graphs, integrating human expertise with robust statistical methodology, enables ensuring responsible AI practices.\n The paper focuses on using multiplicity-adjusted p-values, controlling for the false discovery rate (FDR), as an aid for co-designing the graph. A family of hypotheses relevant to causal graph construction is identified, including assessing correlation strengths, directions of causal effects, and how well an estimated structural causal model induces the observed covariance structure.\n An iterative flow is described where an initial causal graph is drafted based on expert beliefs about likely causal relationships. The subject matter expert's beliefs, communicated as ranked scores could be incorporated into the control of the measure proposed by Benjamini and Kling, the FDCR (False Discovery Cost Rate). The FDCR-adjusted p-values then provide feedback on which parts of the graph are supported or contradicted by the data. This co-design process continues, adding, removing, or revising arcs in the graph, until the expert and modeller converge on a satisfactory causal structure grounded in both domain knowledge and data evidence."}, "https://arxiv.org/abs/2409.03136": {"title": "A New Forward Discriminant Analysis Framework Based On Pillai's Trace and ULDA", "link": "https://arxiv.org/abs/2409.03136", "description": "arXiv:2409.03136v1 Announce Type: new \nAbstract: Linear discriminant analysis (LDA), a traditional classification tool, suffers from limitations such as sensitivity to noise and computational challenges when dealing with non-invertible within-class scatter matrices. Traditional stepwise LDA frameworks, which iteratively select the most informative features, often exacerbate these issues by relying heavily on Wilks' $\\Lambda$, potentially causing premature stopping of the selection process. This paper introduces a novel forward discriminant analysis framework that integrates Pillai's trace with Uncorrelated Linear Discriminant Analysis (ULDA) to address these challenges, and offers a unified and stand-alone classifier. Through simulations and real-world datasets, the new framework demonstrates effective control of Type I error rates and improved classification accuracy, particularly in cases involving perfect group separations. The results highlight the potential of this approach as a robust alternative to the traditional stepwise LDA framework."}, "https://arxiv.org/abs/2409.03181": {"title": "Wrapped Gaussian Process Functional Regression Model for Batch Data on Riemannian Manifolds", "link": "https://arxiv.org/abs/2409.03181", "description": "arXiv:2409.03181v1 Announce Type: new \nAbstract: Regression is an essential and fundamental methodology in statistical analysis. The majority of the literature focuses on linear and nonlinear regression in the context of the Euclidean space. However, regression models in non-Euclidean spaces deserve more attention due to collection of increasing volumes of manifold-valued data. In this context, this paper proposes a concurrent functional regression model for batch data on Riemannian manifolds by estimating both mean structure and covariance structure simultaneously. The response variable is assumed to follow a wrapped Gaussian process distribution. Nonlinear relationships between manifold-valued response variables and multiple Euclidean covariates can be captured by this model in which the covariates can be functional and/or scalar. The performance of our model has been tested on both simulated data and real data, showing it is an effective and efficient tool in conducting functional data regression on Riemannian manifolds."}, "https://arxiv.org/abs/2409.03292": {"title": "Directional data analysis using the spherical Cauchy and the Poisson-kernel based distribution", "link": "https://arxiv.org/abs/2409.03292", "description": "arXiv:2409.03292v1 Announce Type: new \nAbstract: The spherical Cauchy distribution and the Poisson-kernel based distribution were both proposed in 2020, for the analysis of directional data. The paper explores both of them under various frameworks. Alternative parametrizations that offer numerical and estimation advantages, including a straightforward Newton-Raphson algorithm to estimate the parameters are suggested, which further facilitate a more straightforward formulation under the regression setting. A two-sample location test, based on the log-likelihood ratio test is suggested, completing with discriminant analysis. The two distributions are put to the test-bed for all aforementioned cases, through simulation studies and via real data examples comparing and illustrating their performance."}, "https://arxiv.org/abs/2409.03296": {"title": "An Efficient Two-Dimensional Functional Mixed-Effect Model Framework for Repeatedly Measured Functional Data", "link": "https://arxiv.org/abs/2409.03296", "description": "arXiv:2409.03296v1 Announce Type: new \nAbstract: With the rapid development of wearable device technologies, accelerometers can record minute-by-minute physical activity for consecutive days, which provides important insight into a dynamic association between the intensity of physical activity and mental health outcomes for large-scale population studies. Using Shanghai school adolescent cohort we estimate the effect of health assessment results on physical activity profiles recorded by accelerometers throughout a week, which is recognized as repeatedly measured functional data. To achieve this goal, we propose an innovative two-dimensional functional mixed-effect model (2dFMM) for the specialized data, which smoothly varies over longitudinal day observations with covariate-dependent mean and covariance functions. The modeling framework characterizes the longitudinal and functional structures while incorporating two-dimensional fixed effects for covariates of interest. We also develop a fast three-stage estimation procedure to provide accurate fixed-effect inference for model interpretability and improve computational efficiency when encountering large datasets. We find strong evidence of intraday and interday varying significant associations between physical activity and mental health assessments among our cohort population, which shed light on possible intervention strategies targeting daily physical activity patterns to improve school adolescent mental health. Our method is also used in environmental data to illustrate the wide applicability. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2409.03502": {"title": "Testing Whether Reported Treatment Effects are Unduly Dependent on the Specific Outcome Measure Used", "link": "https://arxiv.org/abs/2409.03502", "description": "arXiv:2409.03502v1 Announce Type: new \nAbstract: This paper addresses the situation in which treatment effects are reported using educational or psychological outcome measures comprised of multiple questions or ``items.'' A distinction is made between a treatment effect on the construct being measured, which is referred to as impact, and item-specific treatment effects that are not due to impact, which are referred to as differential item functioning (DIF). By definition, impact generalizes to other measures of the same construct (i.e., measures that use different items), while DIF is dependent upon the specific items that make up the outcome measure. To distinguish these two cases, two estimators of impact are compared: an estimator that naively aggregates over items, and a less efficient one that is highly robust to DIF. The null hypothesis that both are consistent estimators of the true treatment impact leads to a Hausman-like specification test of whether the naive estimate is affected by item-level variation that would not be expected to generalize beyond the specific outcome measure used. The performance of the test is illustrated with simulation studies and a re-analysis of 34 item-level datasets from 22 randomized evaluations of educational interventions. In the empirical example, the dependence of reported effect sizes on the type of outcome measure (researcher-developed or independently developed) was substantially reduced after accounting for DIF. Implications for the ongoing debate about the role of researcher-developed assessments in education sciences are discussed."}, "https://arxiv.org/abs/2409.03513": {"title": "Bias correction of posterior means using MCMC outputs", "link": "https://arxiv.org/abs/2409.03513", "description": "arXiv:2409.03513v1 Announce Type: new \nAbstract: We propose algorithms for addressing the bias of the posterior mean when used as an estimator of parameters. These algorithms build upon the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)) and can be implemented using the posterior covariance and third-order combined cumulants easily calculated from MCMC outputs. Two algorithms are introduced: The first algorithm utilises the output of a single-run MCMC with the original likelihood and prior to estimate the bias. A notable feature of the algorithm is that its ability to estimate definitional bias (Efron (2015)), which is crucial for Bayesian estimators. The second algorithm is designed for high-dimensional and sparse data settings, where ``quasi-prior'' for bias correction is introduced. The quasi-prior is iteratively refined using the output of the first algorithm as a measure of the residual bias at each step. These algorithms have been successfully implemented and tested for parameter estimation in the Weibull distribution and logistic regression in moderately high-dimensional settings."}, "https://arxiv.org/abs/2409.03572": {"title": "Extrinsic Principal Component Analysis", "link": "https://arxiv.org/abs/2409.03572", "description": "arXiv:2409.03572v1 Announce Type: new \nAbstract: One develops a fast computational methodology for principal component analysis on manifolds. Instead of estimating intrinsic principal components on an object space with a Riemannian structure, one embeds the object space in a numerical space, and the resulting chord distance is used. This method helps us analyzing high, theoretically even infinite dimensional data, from a new perspective. We define the extrinsic principal sub-manifolds of a random object on a Hilbert manifold embedded in a Hilbert space, and the sample counterparts. The resulting extrinsic principal components are useful for dimension data reduction. For application, one retains a very small number of such extrinsic principal components for a shape of contour data sample, extracted from imaging data."}, "https://arxiv.org/abs/2409.03575": {"title": "Detecting Spatial Dependence in Transcriptomics Data using Vectorised Persistence Diagrams", "link": "https://arxiv.org/abs/2409.03575", "description": "arXiv:2409.03575v1 Announce Type: new \nAbstract: Evaluating spatial patterns in data is an integral task across various domains, including geostatistics, astronomy, and spatial tissue biology. The analysis of transcriptomics data in particular relies on methods for detecting spatially-dependent features that exhibit significant spatial patterns for both explanatory analysis and feature selection. However, given the complex and high-dimensional nature of these data, there is a need for robust, stable, and reliable descriptors of spatial dependence. We leverage the stability and multiscale properties of persistent homology to address this task. To this end, we introduce a novel framework using functional topological summaries, such as Betti curves and persistence landscapes, for identifying and describing non-random patterns in spatial data. In particular, we propose a non-parametric one-sample permutation test for spatial dependence and investigate its utility across both simulated and real spatial omics data. Our vectorised approach outperforms baseline methods at accurately detecting spatial dependence. Further, we find that our method is more robust to outliers than alternative tests using Moran's I."}, "https://arxiv.org/abs/2409.03606": {"title": "Performance of Empirical Risk Minimization For Principal Component Regression", "link": "https://arxiv.org/abs/2409.03606", "description": "arXiv:2409.03606v1 Announce Type: new \nAbstract: This paper establishes bounds on the predictive performance of empirical risk minimization for principal component regression. Our analysis is nonparametric, in the sense that the relation between the prediction target and the predictors is not specified. In particular, we do not rely on the assumption that the prediction target is generated by a factor model. In our analysis we consider the cases in which the largest eigenvalues of the covariance matrix of the predictors grow linearly in the number of predictors (strong signal regime) or sublinearly (weak signal regime). The main result of this paper shows that empirical risk minimization for principal component regression is consistent for prediction and, under appropriate conditions, it achieves optimal performance (up to a logarithmic factor) in both the strong and weak signal regimes."}, "https://arxiv.org/abs/2302.10687": {"title": "Boosting the Power of Kernel Two-Sample Tests", "link": "https://arxiv.org/abs/2302.10687", "description": "arXiv:2302.10687v2 Announce Type: replace \nAbstract: The kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces. In this paper we propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance. We derive the asymptotic null distribution of the proposed test statistic and use a multiplier bootstrap approach to efficiently compute the rejection region. The resulting test is universally consistent and, since it is obtained by aggregating over a collection of kernels/bandwidths, is more powerful in detecting a wide range of alternatives in finite samples. We also derive the distribution of the test statistic for both fixed and local contiguous alternatives. The latter, in particular, implies that the proposed test is statistically efficient, that is, it has non-trivial asymptotic (Pitman) efficiency. The consistency properties of the Mahalanobis and other natural aggregation methods are also explored when the number of kernels is allowed to grow with the sample size. Extensive numerical experiments are performed on both synthetic and real-world datasets to illustrate the efficacy of the proposed method over single kernel tests. The computational complexity of the proposed method is also studied, both theoretically and in simulations. Our asymptotic results rely on deriving the joint distribution of MMD estimates using the framework of multiple stochastic integrals, which is more broadly useful, specifically, in understanding the efficiency properties of recently proposed adaptive MMD tests based on kernel aggregation and also in developing more computationally efficient (linear time) tests that combine multiple kernels. We conclude with an application of the Mahalanobis aggregation method for kernels with diverging scaling parameters."}, "https://arxiv.org/abs/2401.04156": {"title": "LASPATED: a Library for the Analysis of SPAtio-TEmporal Discrete data", "link": "https://arxiv.org/abs/2401.04156", "description": "arXiv:2401.04156v2 Announce Type: replace \nAbstract: We describe methods, tools, and a software library called LASPATED, available on GitHub (at https://github.com/vguigues/) to fit models using spatio-temporal data and space-time discretization. A video tutorial for this library is available on YouTube. We consider two types of methods to estimate a non-homogeneous Poisson process in space and time. The methods approximate the arrival intensity function of the Poisson process by discretizing space and time, and estimating arrival intensity as a function of subregion and time interval. With such methods, it is typical that the dimension of the estimator is large relative to the amount of data, and therefore the performance of the estimator can be improved by using additional data. The first method uses additional data to add a regularization term to the likelihood function for calibrating the intensity of the Poisson process. The second method uses additional data to estimate arrival intensity as a function of covariates. We describe a Python package to perform various types of space and time discretization. We also describe two packages for the calibration of the models, one in Matlab and one in C++. We demonstrate the advantages of our methods compared to basic maximum likelihood estimation with simulated and real data. The experiments with real data calibrate models of the arrival process of emergencies to be handled by the Rio de Janeiro emergency medical service."}, "https://arxiv.org/abs/2409.03805": {"title": "Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects", "link": "https://arxiv.org/abs/2409.03805", "description": "arXiv:2409.03805v1 Announce Type: new \nAbstract: We present experiences and lessons learned from increasing data readiness of heterogeneous data for artificial intelligence projects using visual analysis methods. Increasing the data readiness level involves understanding both the data as well as the context in which it is used, which are challenges well suitable to visual analysis. For this purpose, we contribute a mapping between data readiness aspects and visual analysis techniques suitable for different data types. We use the defined mapping to increase data readiness levels in use cases involving time-varying data, including numerical, categorical, and text. In addition to the mapping, we extend the data readiness concept to better take aspects of the task and solution into account and explicitly address distribution shifts during data collection time. We report on our experiences in using the presented visual analysis techniques to aid future artificial intelligence projects in raising the data readiness level."}, "https://arxiv.org/abs/2409.03962": {"title": "Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria", "link": "https://arxiv.org/abs/2409.03962", "description": "arXiv:2409.03962v1 Announce Type: new \nAbstract: The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2(P)-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R."}, "https://arxiv.org/abs/2409.03979": {"title": "Extreme Quantile Treatment Effects under Endogeneity: Evaluating Policy Effects for the Most Vulnerable Individuals", "link": "https://arxiv.org/abs/2409.03979", "description": "arXiv:2409.03979v1 Announce Type: new \nAbstract: We introduce a novel method for estimating and conducting inference about extreme quantile treatment effects (QTEs) in the presence of endogeneity. Our approach is applicable to a broad range of empirical research designs, including instrumental variables design and regression discontinuity design, among others. By leveraging regular variation and subsampling, the method ensures robust performance even in extreme tails, where data may be sparse or entirely absent. Simulation studies confirm the theoretical robustness of our approach. Applying our method to assess the impact of job training provided by the Job Training Partnership Act (JTPA), we find significantly negative QTEs for the lowest quantiles (i.e., the most disadvantaged individuals), contrasting with previous literature that emphasizes positive QTEs for intermediate quantiles."}, "https://arxiv.org/abs/2409.04079": {"title": "Fitting the Discrete Swept Skeletal Representation to Slabular Objects", "link": "https://arxiv.org/abs/2409.04079", "description": "arXiv:2409.04079v1 Announce Type: new \nAbstract: Statistical shape analysis of slabular objects like groups of hippocampi is highly useful for medical researchers as it can be useful for diagnoses and understanding diseases. This work proposes a novel object representation based on locally parameterized discrete swept skeletal structures. Further, model fitting and analysis of such representations are discussed. The model fitting procedure is based on boundary division and surface flattening. The quality of the model fitting is evaluated based on the symmetry and tidiness of the skeletal structure as well as the volume of the implied boundary. The power of the method is demonstrated by visual inspection and statistical analysis of a synthetic and an actual data set in comparison with an available skeletal representation."}, "https://arxiv.org/abs/2409.04126": {"title": "Incorporating external data for analyzing randomized clinical trials: A transfer learning approach", "link": "https://arxiv.org/abs/2409.04126", "description": "arXiv:2409.04126v1 Announce Type: new \nAbstract: Randomized clinical trials are the gold standard for analyzing treatment effects, but high costs and ethical concerns can limit recruitment, potentially leading to invalid inferences. Incorporating external trial data with similar characteristics into the analysis using transfer learning appears promising for addressing these issues. In this paper, we present a formal framework for applying transfer learning to the analysis of clinical trials, considering three key perspectives: transfer algorithm, theoretical foundation, and inference method. For the algorithm, we adopt a parameter-based transfer learning approach to enhance the lasso-adjusted stratum-specific estimator developed for estimating treatment effects. A key component in constructing the transfer learning estimator is deriving the regression coefficient estimates within each stratum, accounting for the bias between source and target data. To provide a theoretical foundation, we derive the $l_1$ convergence rate for the estimated regression coefficients and establish the asymptotic normality of the transfer learning estimator. Our results show that when external trial data resembles current trial data, the sample size requirements can be reduced compared to using only the current trial data. Finally, we propose a consistent nonparametric variance estimator to facilitate inference. Numerical studies demonstrate the effectiveness and robustness of our proposed estimator across various scenarios."}, "https://arxiv.org/abs/2409.04162": {"title": "Modelling multivariate spatio-temporal data with identifiable variational autoencoders", "link": "https://arxiv.org/abs/2409.04162", "description": "arXiv:2409.04162v1 Announce Type: new \nAbstract: Modelling multivariate spatio-temporal data with complex dependency structures is a challenging task but can be simplified by assuming that the original variables are generated from independent latent components. If these components are found, they can be modelled univariately. Blind source separation aims to recover the latent components by estimating the unmixing transformation based on the observed data only. The current methods for spatio-temporal blind source separation are restricted to linear unmixing, and nonlinear variants have not been implemented. In this paper, we extend identifiable variational autoencoder to the nonlinear nonstationary spatio-temporal blind source separation setting and demonstrate its performance using comprehensive simulation studies. Additionally, we introduce two alternative methods for the latent dimension estimation, which is a crucial task in order to obtain the correct latent representation. Finally, we illustrate the proposed methods using a meteorological application, where we estimate the latent dimension and the latent components, interpret the components, and show how nonstationarity can be accounted and prediction accuracy can be improved by using the proposed nonlinear blind source separation method as a preprocessing method."}, "https://arxiv.org/abs/2409.04256": {"title": "The $\\infty$-S test via regression quantile affine LASSO", "link": "https://arxiv.org/abs/2409.04256", "description": "arXiv:2409.04256v1 Announce Type: new \nAbstract: The nonparametric sign test dates back to the early 18th century with a data analysis by John Arbuthnot. It is an alternative to Gosset's more recent $t$-test for consistent differences between two sets of observations. Fisher's $F$-test is a generalization of the $t$-test to linear regression and linear null hypotheses. Only the sign test is robust to non-Gaussianity. Gutenbrunner et al. [1993] derived a version of the sign test for linear null hypotheses in the spirit of the F-test, which requires the difficult estimation of the sparsity function. We propose instead a new sign test called $\\infty$-S test via the convex analysis of a point estimator that thresholds the estimate towards the null hypothesis of the test."}, "https://arxiv.org/abs/2409.04378": {"title": "An MPEC Estimator for the Sequential Search Model", "link": "https://arxiv.org/abs/2409.04378", "description": "arXiv:2409.04378v1 Announce Type: new \nAbstract: This paper proposes a constrained maximum likelihood estimator for sequential search models, using the MPEC (Mathematical Programming with Equilibrium Constraints) approach. This method enhances numerical accuracy while avoiding ad hoc components and errors related to equilibrium conditions. Monte Carlo simulations show that the estimator performs better in small samples, with lower bias and root-mean-squared error, though less effectively in large samples. Despite these mixed results, the MPEC approach remains valuable for identifying candidate parameters comparable to the benchmark, without relying on ad hoc look-up tables, as it generates the table through solved equilibrium constraints."}, "https://arxiv.org/abs/2409.04412": {"title": "Robust Elicitable Functionals", "link": "https://arxiv.org/abs/2409.04412", "description": "arXiv:2409.04412v1 Announce Type: new \nAbstract: Elicitable functionals and (strict) consistent scoring functions are of interest due to their utility of determining (uniquely) optimal forecasts, and thus the ability to effectively backtest predictions. However, in practice, assuming that a distribution is correctly specified is too strong a belief to reliably hold. To remediate this, we incorporate a notion of statistical robustness into the framework of elicitable functionals, meaning that our robust functional accounts for \"small\" misspecifications of a baseline distribution. Specifically, we propose a robustified version of elicitable functionals by using the Kullback-Leibler divergence to quantify potential misspecifications from a baseline distribution. We show that the robust elicitable functionals admit unique solutions lying at the boundary of the uncertainty region. Since every elicitable functional possesses infinitely many scoring functions, we propose the class of b-homogeneous strictly consistent scoring functions, for which the robust functionals maintain desirable statistical properties. We show the applicability of the REF in two examples: in the reinsurance setting and in robust regression problems."}, "https://arxiv.org/abs/2409.03876": {"title": "A tutorial on panel data analysis using partially observed Markov processes via the R package panelPomp", "link": "https://arxiv.org/abs/2409.03876", "description": "arXiv:2409.03876v1 Announce Type: cross \nAbstract: The R package panelPomp supports analysis of panel data via a general class of partially observed Markov process models (PanelPOMP). This package tutorial describes how the mathematical concept of a PanelPOMP is represented in the software and demonstrates typical use-cases of panelPomp. Monte Carlo methods used for POMP models require adaptation for PanelPOMP models due to the higher dimensionality of panel data. The package takes advantage of recent advances for PanelPOMP, including an iterated filtering algorithm, Monte Carlo adjusted profile methodology and block optimization methodology to assist with the large parameter spaces that can arise with panel models. In addition, tools for manipulation of models and data are provided that take advantage of the panel structure."}, "https://arxiv.org/abs/2409.04001": {"title": "Over-parameterized regression methods and their application to semi-supervised learning", "link": "https://arxiv.org/abs/2409.04001", "description": "arXiv:2409.04001v1 Announce Type: cross \nAbstract: The minimum norm least squares is an estimation strategy under an over-parameterized case and, in machine learning, is known as a helpful tool for understanding a nature of deep learning. In this paper, to apply it in a context of non-parametric regression problems, we established several methods which are based on thresholding of SVD (singular value decomposition) components, wihch are referred to as SVD regression methods. We considered several methods that are singular value based thresholding, hard-thresholding with cross validation, universal thresholding and bridge thresholding. Information on output samples is not utilized in the first method while it is utilized in the other methods. We then applied them to semi-supervised learning, in which unlabeled input samples are incorporated into kernel functions in a regressor. The experimental results for real data showed that, depending on the datasets, the SVD regression methods is superior to a naive ridge regression method. Unfortunately, there were no clear advantage of the methods utilizing information on output samples. Furthermore, for depending on datasets, incorporation of unlabeled input samples into kernels is found to have certain advantages."}, "https://arxiv.org/abs/2409.04365": {"title": "Leveraging Machine Learning for Official Statistics: A Statistical Manifesto", "link": "https://arxiv.org/abs/2409.04365", "description": "arXiv:2409.04365v1 Announce Type: cross \nAbstract: It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics."}, "https://arxiv.org/abs/2409.04377": {"title": "Local times of self-intersection and sample path properties of Volterra Gaussian processes", "link": "https://arxiv.org/abs/2409.04377", "description": "arXiv:2409.04377v1 Announce Type: cross \nAbstract: We study a Volterra Gaussian process of the form $X(t)=\\int^t_0K(t,s)d{W(s)},$ where $W$ is a Wiener process and $K$ is a continuous kernel. In dimension one, we prove a law of the iterated logarithm, discuss the existence of local times and verify a continuous dependence between the local time and the kernel that generates the process. Furthermore, we prove the existence of the Rosen renormalized self-intersection local times for a planar Gaussian Volterra process."}, "https://arxiv.org/abs/1808.04945": {"title": "A confounding bridge approach for double negative control inference on causal effects", "link": "https://arxiv.org/abs/1808.04945", "description": "arXiv:1808.04945v4 Announce Type: replace \nAbstract: Unmeasured confounding is a key challenge for causal inference. In this paper, we establish a framework for unmeasured confounding adjustment with negative control variables. A negative control outcome is associated with the confounder but not causally affected by the exposure in view, and a negative control exposure is correlated with the primary exposure or the confounder but does not causally affect the outcome of interest. We introduce an outcome confounding bridge function that depicts the relationship between the confounding effects on the primary outcome and the negative control outcome, and we incorporate a negative control exposure to identify the bridge function and the average causal effect. We also consider the extension to the positive control setting by allowing for nonzero causal effect of the primary exposure on the control outcome. We illustrate our approach with simulations and apply it to a study about the short-term effect of air pollution on mortality. Although a standard analysis shows a significant acute effect of PM2.5 on mortality, our analysis indicates that this effect may be confounded, and after double negative control adjustment, the effect is attenuated toward zero."}, "https://arxiv.org/abs/2311.11290": {"title": "Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias", "link": "https://arxiv.org/abs/2311.11290", "description": "arXiv:2311.11290v2 Announce Type: replace \nAbstract: Firth (1993, Biometrika) shows that the maximum Jeffreys' prior penalized likelihood estimator in logistic regression has asymptotic bias decreasing with the square of the number of observations when the number of parameters is fixed, which is an order faster than the typical rate from maximum likelihood. The widespread use of that estimator in applied work is supported by the results in Kosmidis and Firth (2021, Biometrika), who show that it takes finite values, even in cases where the maximum likelihood estimate does not exist. Kosmidis and Firth (2021, Biometrika) also provide empirical evidence that the estimator has good bias properties in high-dimensional settings where the number of parameters grows asymptotically linearly but slower than the number of observations. We design and carry out a large-scale computer experiment covering a wide range of such high-dimensional settings and produce strong empirical evidence for a simple rescaling of the maximum Jeffreys' prior penalized likelihood estimator that delivers high accuracy in signal recovery, in terms of aggregate bias, in the presence of an intercept parameter. The rescaled estimator is effective even in cases where estimates from maximum likelihood and other recently proposed corrective methods based on approximate message passing do not exist."}, "https://arxiv.org/abs/2210.12277": {"title": "The Stochastic Proximal Distance Algorithm", "link": "https://arxiv.org/abs/2210.12277", "description": "arXiv:2210.12277v4 Announce Type: replace-cross \nAbstract: Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\\rho \\rightarrow \\infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks."}, "https://arxiv.org/abs/2211.03933": {"title": "A Hypergraph-Based Machine Learning Ensemble Network Intrusion Detection System", "link": "https://arxiv.org/abs/2211.03933", "description": "arXiv:2211.03933v3 Announce Type: replace-cross \nAbstract: Network intrusion detection systems (NIDS) to detect malicious attacks continue to meet challenges. NIDS are often developed offline while they face auto-generated port scan infiltration attempts, resulting in a significant time lag from adversarial adaption to NIDS response. To address these challenges, we use hypergraphs focused on internet protocol addresses and destination ports to capture evolving patterns of port scan attacks. The derived set of hypergraph-based metrics are then used to train an ensemble machine learning (ML) based NIDS that allows for real-time adaption in monitoring and detecting port scanning activities, other types of attacks, and adversarial intrusions at high accuracy, precision and recall performances. This ML adapting NIDS was developed through the combination of (1) intrusion examples, (2) NIDS update rules, (3) attack threshold choices to trigger NIDS retraining requests, and (4) a production environment with no prior knowledge of the nature of network traffic. 40 scenarios were auto-generated to evaluate the ML ensemble NIDS comprising three tree-based models. The resulting ML Ensemble NIDS was extended and evaluated with the CIC-IDS2017 dataset. Results show that under the model settings of an Update-ALL-NIDS rule (specifically retrain and update all the three models upon the same NIDS retraining request) the proposed ML ensemble NIDS evolved intelligently and produced the best results with nearly 100% detection performance throughout the simulation."}, "https://arxiv.org/abs/2306.15422": {"title": "Debiasing Piecewise Deterministic Markov Process samplers using couplings", "link": "https://arxiv.org/abs/2306.15422", "description": "arXiv:2306.15422v2 Announce Type: replace-cross \nAbstract: Monte Carlo methods -- such as Markov chain Monte Carlo (MCMC) and piecewise deterministic Markov process (PDMP) samplers -- provide asymptotically exact estimators of expectations under a target distribution. There is growing interest in alternatives to this asymptotic regime, in particular in constructing estimators that are exact in the limit of an infinite amount of computing processors, rather than in the limit of an infinite number of Markov iterations. In particular, Jacob et al. (2020) introduced coupled MCMC estimators to remove the non-asymptotic bias, resulting in MCMC estimators that can be embarrassingly parallelised. In this work, we extend the estimators of Jacob et al. (2020) to the continuous-time context and derive couplings for the bouncy, the boomerang and the coordinate samplers. Some preliminary empirical results are included that demonstrate the reasonable scaling of our method with the dimension of the target."}, "https://arxiv.org/abs/2312.11437": {"title": "Clustering Consistency of General Nonparametric Classification Methods in Cognitive Diagnosis", "link": "https://arxiv.org/abs/2312.11437", "description": "arXiv:2312.11437v2 Announce Type: replace-cross \nAbstract: Cognitive diagnosis models have been popularly used in fields such as education, psychology, and social sciences. While parametric likelihood estimation is a prevailing method for fitting cognitive diagnosis models, nonparametric methodologies are attracting increasing attention due to their ease of implementation and robustness, particularly when sample sizes are relatively small. However, existing clustering consistency results of the nonparametric estimation methods often rely on certain restrictive conditions, which may not be easily satisfied in practice. In this article, the clustering consistency of the general nonparametric classification method is reestablished under weaker and more practical conditions."}, "https://arxiv.org/abs/2409.04589": {"title": "Lee Bounds with Multilayered Sample Selection", "link": "https://arxiv.org/abs/2409.04589", "description": "arXiv:2409.04589v1 Announce Type: new \nAbstract: This paper investigates the causal effect of job training on wage rates in the presence of firm heterogeneity. When training affects worker sorting to firms, sample selection is no longer binary but is \"multilayered\". This paper extends the canonical Heckman (1979) sample selection model - which assumes selection is binary - to a setting where it is multilayered, and shows that in this setting Lee bounds set identifies a total effect that combines a weighted-average of the causal effect of job training on wage rates across firms with a weighted-average of the contrast in wages between different firms for a fixed level of training. Thus, Lee bounds set identifies a policy-relevant estimand only when firms pay homogeneous wages and/or when job training does not affect worker sorting across firms. We derive sharp closed-form bounds for the causal effect of job training on wage rates at each firm which leverage information on firm-specific wages. We illustrate our partial identification approach with an empirical application to the Job Corps Study. Results show that while conventional Lee bounds are strictly positive, our within-firm bounds include 0 showing that canonical Lee bounds may be capturing a pure sorting effect of job training."}, "https://arxiv.org/abs/2409.04684": {"title": "Establishing the Parallels and Differences Between Right-Censored and Missing Covariates", "link": "https://arxiv.org/abs/2409.04684", "description": "arXiv:2409.04684v1 Announce Type: new \nAbstract: While right-censored time-to-event outcomes have been studied for decades, handling time-to-event covariates, also known as right-censored covariates, is now of growing interest. So far, the literature has treated right-censored covariates as distinct from missing covariates, overlooking the potential applicability of estimators to both scenarios. We bridge this gap by establishing connections between right-censored and missing covariates under various assumptions about censoring and missingness, allowing us to identify parallels and differences to determine when estimators can be used in both contexts. These connections reveal adaptations to five estimators for right-censored covariates in the unexplored area of informative covariate right-censoring and to formulate a new estimator for this setting, where the event time depends on the censoring time. We establish the asymptotic properties of the six estimators, evaluate their robustness under incorrect distributional assumptions, and establish their comparative efficiency. We conducted a simulation study to confirm our theoretical results, and then applied all estimators to a Huntington disease observational study to analyze cognitive impairments as a function of time to clinical diagnosis."}, "https://arxiv.org/abs/2409.04729": {"title": "A Unified Framework for Cluster Methods with Tensor Networks", "link": "https://arxiv.org/abs/2409.04729", "description": "arXiv:2409.04729v1 Announce Type: new \nAbstract: Markov Chain Monte Carlo (MCMC), and Tensor Networks (TN) are two powerful frameworks for numerically investigating many-body systems, each offering distinct advantages. MCMC, with its flexibility and theoretical consistency, is well-suited for simulating arbitrary systems by sampling. TN, on the other hand, provides a powerful tensor-based language for capturing the entanglement properties intrinsic to many-body systems, offering a universal representation of these systems. In this work, we leverage the computational strengths of TN to design a versatile cluster MCMC sampler. Specifically, we propose a general framework for constructing tensor-based cluster MCMC methods, enabling arbitrary cluster updates by utilizing TNs to compute the distributions required in the MCMC sampler. Our framework unifies several existing cluster algorithms as special cases and allows for natural extensions. We demonstrate our method by applying it to the simulation of the two-dimensional Edwards-Anderson Model and the three-dimensional Ising Model. This work is dedicated to the memory of Prof. David Draper."}, "https://arxiv.org/abs/2409.04836": {"title": "Spatial Interference Detection in Treatment Effect Model", "link": "https://arxiv.org/abs/2409.04836", "description": "arXiv:2409.04836v1 Announce Type: new \nAbstract: Modeling the interference effect is an important issue in the field of causal inference. Existing studies rely on explicit and often homogeneous assumptions regarding interference structures. In this paper, we introduce a low-rank and sparse treatment effect model that leverages data-driven techniques to identify the locations of interference effects. A profiling algorithm is proposed to estimate the model coefficients, and based on these estimates, global test and local detection methods are established to detect the existence of interference and the interference neighbor locations for each unit. We derive the non-asymptotic bound of the estimation error, and establish theoretical guarantees for the global test and the accuracy of the detection method in terms of Jaccard index. Simulations and real data examples are provided to demonstrate the usefulness of the proposed method."}, "https://arxiv.org/abs/2409.04874": {"title": "Improving the Finite Sample Performance of Double/Debiased Machine Learning with Propensity Score Calibration", "link": "https://arxiv.org/abs/2409.04874", "description": "arXiv:2409.04874v1 Announce Type: new \nAbstract: Machine learning techniques are widely used for estimating causal effects. Double/debiased machine learning (DML) (Chernozhukov et al., 2018) uses a double-robust score function that relies on the prediction of nuisance functions, such as the propensity score, which is the probability of treatment assignment conditional on covariates. Estimators relying on double-robust score functions are highly sensitive to errors in propensity score predictions. Machine learners increase the severity of this problem as they tend to over- or underestimate these probabilities. Several calibration approaches have been proposed to improve probabilistic forecasts of machine learners. This paper investigates the use of probability calibration approaches within the DML framework. Simulation results demonstrate that calibrating propensity scores may significantly reduces the root mean squared error of DML estimates of the average treatment effect in finite samples. We showcase it in an empirical example and provide conditions under which calibration does not alter the asymptotic properties of the DML estimator."}, "https://arxiv.org/abs/2409.04876": {"title": "DEPLOYERS: An agent based modeling tool for multi country real world data", "link": "https://arxiv.org/abs/2409.04876", "description": "arXiv:2409.04876v1 Announce Type: new \nAbstract: We present recent progress in the design and development of DEPLOYERS, an agent-based macroeconomics modeling (ABM) framework, capable to deploy and simulate a full economic system (individual workers, goods and services firms, government, central and private banks, financial market, external sectors) whose structure and activity analysis reproduce the desired calibration data, that can be, for example a Social Accounting Matrix (SAM) or a Supply-Use Table (SUT) or an Input-Output Table (IOT).Here we extend our previous work to a multi-country version and show an example using data from a 46-countries 64-sectors FIGARO Inter-Country IOT. The simulation of each country runs on a separate thread or CPU core to simulate the activity of one step (month, week, or day) and then interacts (updates imports, exports, transfer) with that country's foreign partners, and proceeds to the next step. This interaction can be chosen to be aggregated (a single row and column IO account) or disaggregated (64 rows and columns) with each partner. A typical run simulates thousands of individuals and firms engaged in their monthly activity and then records the results, much like a survey of the country's economic system. This data can then be subjected to, for example, an Input-Output analysis to find out the sources of observed stylized effects as a function of time in the detailed and realistic modeling environment that can be easily implemented in an ABM framework."}, "https://arxiv.org/abs/2409.04933": {"title": "Marginal Structural Modeling of Representative Treatment Trajectories", "link": "https://arxiv.org/abs/2409.04933", "description": "arXiv:2409.04933v1 Announce Type: new \nAbstract: Marginal structural models (MSMs) are widely used in observational studies to estimate the causal effect of time-varying treatments. Despite its popularity, limited attention has been paid to summarizing the treatment history in the outcome model, which proves particularly challenging when individuals' treatment trajectories exhibit complex patterns over time. Commonly used metrics such as the average treatment level fail to adequately capture the treatment history, hindering causal interpretation. For scenarios where treatment histories exhibit distinct temporal patterns, we develop a new approach to parameterize the outcome model. We apply latent growth curve analysis to identify representative treatment trajectories from the observed data and use the posterior probability of latent class membership to summarize the different treatment trajectories. We demonstrate its use in parameterizing the MSMs, which facilitates the interpretations of the results. We apply the method to analyze data from an existing cohort of lung transplant recipients to estimate the effect of Tacrolimus concentrations on the risk of incident chronic kidney disease."}, "https://arxiv.org/abs/2409.04970": {"title": "A response-adaptive multi-arm design for continuous endpoints based on a weighted information measure", "link": "https://arxiv.org/abs/2409.04970", "description": "arXiv:2409.04970v1 Announce Type: new \nAbstract: Multi-arm trials are gaining interest in practice given the statistical and logistical advantages that they can offer. The standard approach is to use a fixed (throughout the trial) allocation ratio, but there is a call for making it adaptive and skewing the allocation of patients towards better performing arms. However, among other challenges, it is well-known that these approaches might suffer from lower statistical power. We present a response-adaptive design for continuous endpoints which explicitly allows to control the trade-off between the number of patients allocated to the 'optimal' arm and the statistical power. Such a balance is achieved through the calibration of a tuning parameter, and we explore various strategies to effectively select it. The proposed criterion is based on a context-dependent information measure which gives a greater weight to those treatment arms which have characteristics close to a pre-specified clinical target. We also introduce a simulation-based hypothesis testing procedure which focuses on selecting the target arm, discussing strategies to effectively control the type-I error rate. The potential advantage of the proposed criterion over currently used alternatives is evaluated in simulations, and its practical implementation is illustrated in the context of early Phase IIa proof-of-concept oncology clinical trials."}, "https://arxiv.org/abs/2409.04981": {"title": "Forecasting Age Distribution of Deaths: Cumulative Distribution Function Transformation", "link": "https://arxiv.org/abs/2409.04981", "description": "arXiv:2409.04981v1 Announce Type: new \nAbstract: Like density functions, period life-table death counts are nonnegative and have a constrained integral, and thus live in a constrained nonlinear space. Implementing established modelling and forecasting methods without obeying these constraints can be problematic for such nonlinear data. We introduce cumulative distribution function transformation to forecast the life-table death counts. Using the Japanese life-table death counts obtained from the Japanese Mortality Database (2024), we evaluate the point and interval forecast accuracies of the proposed approach, which compares favourably to an existing compositional data analytic approach. The improved forecast accuracy of life-table death counts is of great interest to demographers for estimating age-specific survival probabilities and life expectancy and actuaries for determining temporary annuity prices for different ages and maturities."}, "https://arxiv.org/abs/2409.04995": {"title": "Projective Techniques in Consumer Research: A Mixed Methods-Focused Review and Empirical Reanalysis", "link": "https://arxiv.org/abs/2409.04995", "description": "arXiv:2409.04995v1 Announce Type: new \nAbstract: This article gives an integrative review of research using projective methods in the consumer research domain. We give a general historical overview of the use of projective methods, both in psychology and in consumer research applications, and discuss the reliability and validity aspects and measurement for projective techniques. We review the literature on projective techniques in the areas of marketing, hospitality & tourism, and consumer & food science, with a mixed methods research focus on the interplay of qualitative and quantitative techniques. We review the use of several quantitative techniques used for structuring and analyzing projective data and run an empirical reanalysis of previously gathered data. We give recommendations for improved rigor and for potential future work involving mixed methods in projective techniques."}, "https://arxiv.org/abs/2409.05038": {"title": "An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties", "link": "https://arxiv.org/abs/2409.05038", "description": "arXiv:2409.05038v1 Announce Type: new \nAbstract: Many estimators of the variance of the well-known unbiased and uniform most powerful estimator $\\htheta$ of the Mann-Whitney effect, $\\theta = P(X < Y) + \\nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators are only valid in case of no ties or are biased in case of small sample sizes where the amount of the bias is not discussed. Here we derive an unbiased estimator that is based on different rankings, the so-called 'placements' (Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does not require the assumption of continuous \\dfs\\ and is also valid in the case of ties. Moreover, it is shown that this estimator is non-negative and has a sharp upper bound which may be considered an empirical version of the well-known Birnbaum-Klose inequality. The derivation of this estimator provides an option to compute the biases of some commonly used estimators in the literature. Simulations demonstrate that, for small sample sizes, the biases of these estimators depend on the underlying \\dfs\\ and thus are not under control. This means that in the case of a biased estimator, simulation results for the type-I error of a test or the coverage probability of a \\ci\\ do not only depend on the quality of the approximation of $\\htheta$ by a normal \\db\\ but also an additional unknown bias caused by the variance estimator. Finally, it is shown that this estimator is $L_2$-consistent."}, "https://arxiv.org/abs/2409.05160": {"title": "Inference for Large Scale Regression Models with Dependent Errors", "link": "https://arxiv.org/abs/2409.05160", "description": "arXiv:2409.05160v1 Announce Type: new \nAbstract: The exponential growth in data sizes and storage costs has brought considerable challenges to the data science community, requiring solutions to run learning methods on such data. While machine learning has scaled to achieve predictive accuracy in big data settings, statistical inference and uncertainty quantification tools are still lagging. Priority scientific fields collect vast data to understand phenomena typically studied with statistical methods like regression. In this setting, regression parameter estimation can benefit from efficient computational procedures, but the main challenge lies in computing error process parameters with complex covariance structures. Identifying and estimating these structures is essential for inference and often used for uncertainty quantification in machine learning with Gaussian Processes. However, estimating these structures becomes burdensome as data scales, requiring approximations that compromise the reliability of outputs. These approximations are even more unreliable when complexities like long-range dependencies or missing data are present. This work defines and proves the statistical properties of the Generalized Method of Wavelet Moments with Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid method for estimating and delivering inference for linear models using stochastic processes in the presence of data complexities like latent dependence structures and missing data. Applied examples from Earth Sciences and extensive simulations highlight the advantages of the GMWMX."}, "https://arxiv.org/abs/2409.05161": {"title": "Really Doing Great at Model Evaluation for CATE Estimation? A Critical Consideration of Current Model Evaluation Practices in Treatment Effect Estimation", "link": "https://arxiv.org/abs/2409.05161", "description": "arXiv:2409.05161v1 Announce Type: new \nAbstract: This paper critically examines current methodologies for evaluating models in Conditional and Average Treatment Effect (CATE/ATE) estimation, identifying several key pitfalls in existing practices. The current approach of over-reliance on specific metrics and empirical means and lack of statistical tests necessitates a more rigorous evaluation approach. We propose an automated algorithm for selecting appropriate statistical tests, addressing the trade-offs and assumptions inherent in these tests. Additionally, we emphasize the importance of reporting empirical standard deviations alongside performance metrics and advocate for using Squared Error for Coverage (SEC) and Absolute Error for Coverage (AEC) metrics and empirical histograms of the coverage results as supplementary metrics. These enhancements provide a more comprehensive understanding of model performance in heterogeneous data-generating processes (DGPs). The practical implications are demonstrated through two examples, showcasing the benefits of these methodological improvements, which can significantly improve the robustness and accuracy of future research in statistical models for CATE and ATE estimation."}, "https://arxiv.org/abs/2409.05184": {"title": "Difference-in-Differences with Multiple Events", "link": "https://arxiv.org/abs/2409.05184", "description": "arXiv:2409.05184v1 Announce Type: new \nAbstract: Confounding events with correlated timing violate the parallel trends assumption in Difference-in-Differences (DiD) designs. I show that the standard staggered DiD estimator is biased in the presence of confounding events. Identification can be achieved with units not yet treated by either event as controls and a double DiD design using variation in treatment timing. I apply this method to examine the effect of states' staggered minimum wage raise on teen employment from 2010 to 2020. The Medicaid expansion under the ACA confounded the raises, leading to a spurious negative estimate."}, "https://arxiv.org/abs/2409.05271": {"title": "Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials", "link": "https://arxiv.org/abs/2409.05271", "description": "arXiv:2409.05271v1 Announce Type: new \nAbstract: The uptake of formalized prior elicitation from experts in Bayesian clinical trials has been limited, largely due to the challenges associated with complex statistical modeling, the lack of practical tools, and the cognitive burden on experts required to quantify their uncertainty using probabilistic language. Additionally, existing methods do not address prior-posterior coherence, i.e., does the posterior distribution, obtained mathematically from combining the estimated prior with the trial data, reflect the expert's actual posterior beliefs? We propose a new elicitation approach that seeks to ensure prior-posterior coherence and reduce the expert's cognitive burden. This is achieved by eliciting responses about the expert's envisioned posterior judgments under various potential data outcomes and inferring the prior distribution by minimizing the discrepancies between these responses and the expected responses obtained from the posterior distribution. The feasibility and potential value of the new approach are illustrated through an application to a real trial currently underway."}, "https://arxiv.org/abs/2409.05276": {"title": "An Eigengap Ratio Test for Determining the Number of Communities in Network Data", "link": "https://arxiv.org/abs/2409.05276", "description": "arXiv:2409.05276v1 Announce Type: new \nAbstract: To characterize the community structure in network data, researchers have introduced various block-type models, including the stochastic block model, degree-corrected stochastic block model, mixed membership block model, degree-corrected mixed membership block model, and others. A critical step in applying these models effectively is determining the number of communities in the network. However, to our knowledge, existing methods for estimating the number of network communities often require model estimations or are unable to simultaneously account for network sparsity and a divergent number of communities. In this paper, we propose an eigengap-ratio based test that address these challenges. The test is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. We show that the proposed test statistic converges to a function of the type-I Tracy-Widom distributions under the null hypothesis, and that the test is asymptotically powerful under alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test."}, "https://arxiv.org/abs/2409.05412": {"title": "Censored Data Forecasting: Applying Tobit Exponential Smoothing with Time Aggregation", "link": "https://arxiv.org/abs/2409.05412", "description": "arXiv:2409.05412v1 Announce Type: new \nAbstract: This study introduces a novel approach to forecasting by Tobit Exponential Smoothing with time aggregation constraints. This model, a particular case of the Tobit Innovations State Space system, handles censored observed time series effectively, such as sales data, with known and potentially variable censoring levels over time. The paper provides a comprehensive analysis of the model structure, including its representation in system equations and the optimal recursive estimation of states. It also explores the benefits of time aggregation in state space systems, particularly for inventory management and demand forecasting. Through a series of case studies, the paper demonstrates the effectiveness of the model across various scenarios, including hourly and daily censoring levels. The results highlight the model's ability to produce accurate forecasts and confidence bands comparable to those from uncensored models, even under severe censoring conditions. The study further discusses the implications for inventory policy, emphasizing the importance of avoiding spiral-down effects in demand estimation. The paper concludes by showcasing the superiority of the proposed model over standard methods, particularly in reducing lost sales and excess stock, thereby optimizing inventory costs. This research contributes to the field of forecasting by offering a robust model that effectively addresses the challenges of censored data and time aggregation."}, "https://arxiv.org/abs/2409.05632": {"title": "Efficient nonparametric estimators of discriminationmeasures with censored survival data", "link": "https://arxiv.org/abs/2409.05632", "description": "arXiv:2409.05632v1 Announce Type: new \nAbstract: Discrimination measures such as concordance statistics (e.g. the c-index or the concordance probability) and the cumulative-dynamic time-dependent area under the ROC-curve (AUC) are widely used in the medical literature for evaluating the predictive accuracy of a scoring rule which relates a set of prognostic markers to the risk of experiencing a particular event. Often the scoring rule being evaluated in terms of discriminatory ability is the linear predictor of a survival regression model such as the Cox proportional hazards model. This has the undesirable feature that the scoring rule depends on the censoring distribution when the model is misspecified. In this work we focus on linear scoring rules where the coefficient vector is a nonparametric estimand defined in the setting where there is no censoring. We propose so-called debiased estimators of the aforementioned discrimination measures for this class of scoring rules. The proposed estimators make efficient use of the data and minimize bias by allowing for the use of data-adaptive methods for model fitting. Moreover, the estimators do not rely on correct specification of the censoring model to produce consistent estimation. We compare the estimators to existing methods in a simulation study, and we illustrate the method by an application to a brain cancer study."}, "https://arxiv.org/abs/2409.05713": {"title": "The Surprising Robustness of Partial Least Squares", "link": "https://arxiv.org/abs/2409.05713", "description": "arXiv:2409.05713v1 Announce Type: new \nAbstract: Partial least squares (PLS) is a simple factorisation method that works well with high dimensional problems in which the number of observations is limited given the number of independent variables. In this article, we show that PLS can perform better than ordinary least squares (OLS), least absolute shrinkage and selection operator (LASSO) and ridge regression in forecasting quarterly gross domestic product (GDP) growth, covering the period from 2000 to 2023. In fact, through dimension reduction, PLS proved to be effective in lowering the out-of-sample forecasting error, specially since 2020. For the period 2000-2019, the four methods produce similar results, suggesting that PLS is a valid regularisation technique like LASSO or ridge."}, "https://arxiv.org/abs/2409.04500": {"title": "Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm", "link": "https://arxiv.org/abs/2409.04500", "description": "arXiv:2409.04500v1 Announce Type: cross \nAbstract: Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators."}, "https://arxiv.org/abs/2409.04789": {"title": "forester: A Tree-Based AutoML Tool in R", "link": "https://arxiv.org/abs/2409.04789", "description": "arXiv:2409.04789v1 Announce Type: cross \nAbstract: The majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user's proficiency in the area of machine learning.\n The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis."}, "https://arxiv.org/abs/2409.05036": {"title": "Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes", "link": "https://arxiv.org/abs/2409.05036", "description": "arXiv:2409.05036v1 Announce Type: cross \nAbstract: Understanding the spread of infectious diseases such as COVID-19 is crucial for informed decision-making and resource allocation. A critical component of disease behavior is the velocity with which disease spreads, defined as the rate of change between time and space. In this paper, we propose a spatio-temporal modeling approach to determine the velocities of infectious disease spread. Our approach assumes that the locations and times of people infected can be considered as a spatio-temporal point pattern that arises as a realization of a spatio-temporal log-Gaussian Cox process. The intensity of this process is estimated using fast Bayesian inference by employing the integrated nested Laplace approximation (INLA) and the Stochastic Partial Differential Equations (SPDE) approaches. The velocity is then calculated using finite differences that approximate the derivatives of the intensity function. Finally, the directions and magnitudes of the velocities can be mapped at specific times to examine better the spread of the disease throughout the region. We demonstrate our method by analyzing COVID-19 spread in Cali, Colombia, during the 2020-2021 pandemic."}, "https://arxiv.org/abs/2409.05192": {"title": "Bellwether Trades: Characteristics of Trades influential in Predicting Future Price Movements in Markets", "link": "https://arxiv.org/abs/2409.05192", "description": "arXiv:2409.05192v1 Announce Type: cross \nAbstract: In this study, we leverage powerful non-linear machine learning methods to identify the characteristics of trades that contain valuable information. First, we demonstrate the effectiveness of our optimized neural network predictor in accurately predicting future market movements. Then, we utilize the information from this successful neural network predictor to pinpoint the individual trades within each data point (trading window) that had the most impact on the optimized neural network's prediction of future price movements. This approach helps us uncover important insights about the heterogeneity in information content provided by trades of different sizes, venues, trading contexts, and over time."}, "https://arxiv.org/abs/2409.05354": {"title": "Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design", "link": "https://arxiv.org/abs/2409.05354", "description": "arXiv:2409.05354v1 Announce Type: cross \nAbstract: This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a novel, fully recursive, algorithm for amortized sequential Bayesian experimental design in the non-exchangeable setting. We frame policy optimization as maximum likelihood estimation in a non-Markovian state-space model, achieving (at most) $\\mathcal{O}(T^2)$ computational complexity in the number of experiments. We provide theoretical convergence guarantees and introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF offers a practical, extensible, and provably consistent approach to sequential Bayesian experimental design, demonstrating improved efficiency over existing methods."}, "https://arxiv.org/abs/2409.05529": {"title": "Bootstrapping Estimators based on the Block Maxima Method", "link": "https://arxiv.org/abs/2409.05529", "description": "arXiv:2409.05529v1 Announce Type: cross \nAbstract: The block maxima method is a standard approach for analyzing the extremal behavior of a potentially multivariate time series. It has recently been found that the classical approach based on disjoint block maxima may be universally improved by considering sliding block maxima instead. However, the asymptotic variance formula for estimators based on sliding block maxima involves an integral over the covariance of a certain family of multivariate extreme value distributions, which makes its estimation, and inference in general, an intricate problem. As an alternative, one may rely on bootstrap approximations: we show that naive block-bootstrap approaches from time series analysis are inconsistent even in i.i.d.\\ situations, and provide a consistent alternative based on resampling circular block maxima. As a by-product, we show consistency of the classical resampling bootstrap for disjoint block maxima, and that estimators based on circular block maxima have the same asymptotic variance as their sliding block maxima counterparts. The finite sample properties are illustrated by Monte Carlo experiments, and the methods are demonstrated by a case study of precipitation extremes."}, "https://arxiv.org/abs/2409.05630": {"title": "Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept", "link": "https://arxiv.org/abs/2409.05630", "description": "arXiv:2409.05630v1 Announce Type: cross \nAbstract: In functional MRI (fMRI), effective connectivity analysis aims at inferring the causal influences that brain regions exert on one another. A common method for this type of analysis is structural equation modeling (SEM). We here propose a novel method to test the validity of a given model of structural equation. Given a structural model in the form of a directed graph, the method extracts the set of all constraints of conditional independence induced by the absence of links between pairs of regions in the model and tests for their validity in a Bayesian framework, either individually (constraint by constraint), jointly (e.g., by gathering all constraints associated with a given missing link), or globally (i.e., all constraints associated with the structural model). This approach has two main advantages. First, it only tests what is testable from observational data and does allow for false causal interpretation. Second, it makes it possible to test each constraint (or group of constraints) separately and, therefore, quantify in what measure each constraint (or, e..g., missing link) is respected in the data. We validate our approach using a simulation study and illustrate its potential benefits through the reanalysis of published data."}, "https://arxiv.org/abs/2409.05715": {"title": "Uniform Estimation and Inference for Nonparametric Partitioning-Based M-Estimators", "link": "https://arxiv.org/abs/2409.05715", "description": "arXiv:2409.05715v1 Announce Type: cross \nAbstract: This paper presents uniform estimation and inference theory for a large class of nonparametric partitioning-based M-estimators. The main theoretical results include: (i) uniform consistency for convex and non-convex objective functions; (ii) optimal uniform Bahadur representations; (iii) optimal uniform (and mean square) convergence rates; (iv) valid strong approximations and feasible uniform inference methods; and (v) extensions to functional transformations of underlying estimators. Uniformity is established over both the evaluation point of the nonparametric functional parameter and a Euclidean parameter indexing the class of loss functions. The results also account explicitly for the smoothness degree of the loss function (if any), and allow for a possibly non-identity (inverse) link function. We illustrate the main theoretical and methodological results with four substantive applications: quantile regression, distribution regression, $L_p$ regression, and Logistic regression; many other possibly non-smooth, nonlinear, generalized, robust M-estimation settings are covered by our theoretical results. We provide detailed comparisons with the existing literature and demonstrate substantive improvements: we achieve the best (in some cases optimal) known results under improved (in some cases minimal) requirements in terms of regularity conditions and side rate restrictions. The supplemental appendix reports other technical results that may be of independent interest."}, "https://arxiv.org/abs/2409.05729": {"title": "Efficient estimation with incomplete data via generalised ANOVA decomposition", "link": "https://arxiv.org/abs/2409.05729", "description": "arXiv:2409.05729v1 Announce Type: cross \nAbstract: We study the efficient estimation of a class of mean functionals in settings where a complete multivariate dataset is complemented by additional datasets recording subsets of the variables of interest. These datasets are allowed to have a general, in particular non-monotonic, structure. Our main contribution is to characterise the asymptotic minimal mean squared error for these problems and to introduce an estimator whose risk approximately matches this lower bound. We show that the efficient rescaled variance can be expressed as the minimal value of a quadratic optimisation problem over a function space, thus establishing a fundamental link between these estimation problems and the theory of generalised ANOVA decompositions. Our estimation procedure uses iterated nonparametric regression to mimic an approximate influence function derived through gradient descent. We prove that this estimator is approximately normally distributed, provide an estimator of its variance and thus develop confidence intervals of asymptotically minimal width. Finally we study a more direct estimator, which can be seen as a U-statistic with a data-dependent kernel, showing that it is also efficient under stronger regularity conditions."}, "https://arxiv.org/abs/2409.05798": {"title": "Enhancing Preference-based Linear Bandits via Human Response Time", "link": "https://arxiv.org/abs/2409.05798", "description": "arXiv:2409.05798v1 Announce Type: cross \nAbstract: Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences (\"easy\" queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated."}, "https://arxiv.org/abs/2209.05399": {"title": "Principles of Statistical Inference in Online Problems", "link": "https://arxiv.org/abs/2209.05399", "description": "arXiv:2209.05399v2 Announce Type: replace \nAbstract: To investigate a dilemma of statistical and computational efficiency faced by long-run variance estimators, we propose a decomposition of kernel weights in a quadratic form and some online inference principles. These proposals allow us to characterize efficient online long-run variance estimators. Our asymptotic theory and simulations show that this principle-driven approach leads to online estimators with a uniformly lower mean squared error than all existing works. We also discuss practical enhancements such as mini-batch and automatic updates to handle fast streaming data and optimal parameters tuning. Beyond variance estimation, we consider the proposals in the context of online quantile regression, online change point detection, Markov chain Monte Carlo convergence diagnosis, and stochastic approximation. Substantial improvements in computational cost and finite-sample statistical properties are observed when we apply our principle-driven variance estimator to original and modified inference procedures."}, "https://arxiv.org/abs/2211.13610": {"title": "Cross-Sectional Dynamics Under Network Structure: Theory and Macroeconomic Applications", "link": "https://arxiv.org/abs/2211.13610", "description": "arXiv:2211.13610v4 Announce Type: replace \nAbstract: Many environments in economics feature a cross-section of units linked by bilateral ties. I develop a framework for studying dynamics of cross-sectional variables that exploits this network structure. The Network-VAR (NVAR) is a vector autoregression in which innovations transmit cross-sectionally via bilateral links and which can accommodate rich patterns of how network effects of higher order accumulate over time. It can be used to estimate dynamic network effects, with the network given or inferred from dynamic cross-correlations in the data. It also offers a dimensionality-reduction technique for modeling high-dimensional (cross-sectional) processes, owing to networks' ability to summarize complex relations among variables (units) by relatively few bilateral links. In a first application, consistent with an RBC economy with lagged input-output conversion, I estimate how sectoral productivity shocks transmit along supply chains and affect sectoral prices in the US economy. In a second application, I forecast monthly industrial production growth across 44 countries by assuming and estimating a network underlying the dynamics. This reduces out-of-sample mean squared errors by up to 23% relative to a factor model, consistent with an equivalence result I derive."}, "https://arxiv.org/abs/2212.10042": {"title": "Guarantees for Comprehensive Simulation Assessment of Statistical Methods", "link": "https://arxiv.org/abs/2212.10042", "description": "arXiv:2212.10042v3 Announce Type: replace \nAbstract: Simulation can evaluate a statistical method for properties such as Type I Error, FDR, or bias on a grid of hypothesized parameter values. But what about the gaps between the grid-points? Continuous Simulation Extension (CSE) is a proof-by-simulation framework which can supplement simulations with (1) confidence bands valid over regions of parameter space or (2) calibration of rejection thresholds to provide rigorous proof of strong Type I Error control. CSE extends simulation estimates at grid-points into bounds over nearby space using a model shift bound related to the Renyi divergence, which we analyze for models in exponential family or canonical GLM form. CSE can work with adaptive sampling, nuisance parameters, administrative censoring, multiple arms, multiple testing, Bayesian randomization, Bayesian decision-making, and inference algorithms of arbitrary complexity. As a case study, we calibrate for strong Type I Error control a Phase II/III Bayesian selection design with 4 unknown statistical parameters. Potential applications include calibration of new statistical procedures or streamlining regulatory review of adaptive trial designs. Our open-source software implementation imprint is available athttps://github.com/Confirm-Solutions/imprint"}, "https://arxiv.org/abs/2409.06172": {"title": "Nonparametric Inference for Balance in Signed Networks", "link": "https://arxiv.org/abs/2409.06172", "description": "arXiv:2409.06172v1 Announce Type: new \nAbstract: In many real-world networks, relationships often go beyond simple dyadic presence or absence; they can be positive, like friendship, alliance, and mutualism, or negative, characterized by enmity, disputes, and competition. To understand the formation mechanism of such signed networks, the social balance theory sheds light on the dynamics of positive and negative connections. In particular, it characterizes the proverbs, \"a friend of my friend is my friend\" and \"an enemy of my enemy is my friend\". In this work, we propose a nonparametric inference approach for assessing empirical evidence for the balance theory in real-world signed networks. We first characterize the generating process of signed networks with node exchangeability and propose a nonparametric sparse signed graphon model. Under this model, we construct confidence intervals for the population parameters associated with balance theory and establish their theoretical validity. Our inference procedure is as computationally efficient as a simple normal approximation but offers higher-order accuracy. By applying our method, we find strong real-world evidence for balance theory in signed networks across various domains, extending its applicability beyond social psychology."}, "https://arxiv.org/abs/2409.06180": {"title": "Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach", "link": "https://arxiv.org/abs/2409.06180", "description": "arXiv:2409.06180v1 Announce Type: new \nAbstract: Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment."}, "https://arxiv.org/abs/2409.06288": {"title": "Ensemble Doubly Robust Bayesian Inference via Regression Synthesis", "link": "https://arxiv.org/abs/2409.06288", "description": "arXiv:2409.06288v1 Announce Type: new \nAbstract: The doubly robust estimator, which models both the propensity score and outcomes, is a popular approach to estimate the average treatment effect in the potential outcome setting. The primary appeal of this estimator is its theoretical property, wherein the estimator achieves consistency as long as either the propensity score or outcomes is correctly specified. In most applications, however, both are misspecified, leading to considerable bias that cannot be checked. In this paper, we propose a Bayesian ensemble approach that synthesizes multiple models for both the propensity score and outcomes, which we call doubly robust Bayesian regression synthesis. Our approach applies Bayesian updating to the ensemble model weights that adapt at the unit level, incorporating data heterogeneity, to significantly mitigate misspecification bias. Theoretically, we show that our proposed approach is consistent regarding the estimation of both the propensity score and outcomes, ensuring that the doubly robust estimator is consistent, even if no single model is correctly specified. An efficient algorithm for posterior computation facilitates the characterization of uncertainty regarding the treatment effect. Our proposed approach is compared against standard and state-of-the-art methods through two comprehensive simulation studies, where we find that our approach is superior in all cases. An empirical study on the impact of maternal smoking on birth weight highlights the practical applicability of our proposed method."}, "https://arxiv.org/abs/2409.06413": {"title": "This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis", "link": "https://arxiv.org/abs/2409.06413", "description": "arXiv:2409.06413v1 Announce Type: new \nAbstract: The commonly cited rule of thumb for regression analysis, which suggests that a sample size of $n \\geq 30$ is sufficient to ensure valid inferences, is frequently referenced but rarely scrutinized. This research note evaluates the lower bound for the number of observations required for regression analysis by exploring how different distributional characteristics, such as skewness and kurtosis, influence the convergence of t-values to the t-distribution in linear regression models. Through an extensive simulation study involving over 22 billion regression models, this paper examines a range of symmetric, platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000. The results reveal that it is sufficient that either the dependent or independent variable follow a symmetric distribution for the t-values to converge to the t-distribution at much smaller sample sizes than $n=30$. This is contrary to previous guidance which suggests that the error term needs to be normally distributed for this convergence to happen at low $n$. On the other hand, if both dependent and independent variables are highly skewed the required sample size is substantially higher. In cases of extreme skewness, even sample sizes of 10,000 do not ensure convergence. These findings suggest that the $n\\geq30$ rule is too permissive in certain cases but overly conservative in others, depending on the underlying distributional characteristics. This study offers revised guidelines for determining the minimum sample size necessary for valid regression analysis."}, "https://arxiv.org/abs/2409.06654": {"title": "Estimation and Inference for Causal Functions with Multiway Clustered Data", "link": "https://arxiv.org/abs/2409.06654", "description": "arXiv:2409.06654v1 Announce Type: new \nAbstract: This paper proposes methods of estimation and uniform inference for a general class of causal functions, such as the conditional average treatment effects and the continuous treatment effects, under multiway clustering. The causal function is identified as a conditional expectation of an adjusted (Neyman-orthogonal) signal that depends on high-dimensional nuisance parameters. We propose a two-step procedure where the first step uses machine learning to estimate the high-dimensional nuisance parameters. The second step projects the estimated Neyman-orthogonal signal onto a dictionary of basis functions whose dimension grows with the sample size. For this two-step procedure, we propose both the full-sample and the multiway cross-fitting estimation approaches. A functional limit theory is derived for these estimators. To construct the uniform confidence bands, we develop a novel resampling procedure, called the multiway cluster-robust sieve score bootstrap, that extends the sieve score bootstrap (Chen and Christensen, 2018) to the novel setting with multiway clustering. Extensive numerical simulations showcase that our methods achieve desirable finite-sample behaviors. We apply the proposed methods to analyze the causal relationship between mistrust levels in Africa and the historical slave trade. Our analysis rejects the null hypothesis of uniformly zero effects and reveals heterogeneous treatment effects, with significant impacts at higher levels of trade volumes."}, "https://arxiv.org/abs/2409.06680": {"title": "Sequential stratified inference for the mean", "link": "https://arxiv.org/abs/2409.06680", "description": "arXiv:2409.06680v1 Announce Type: new \nAbstract: We develop conservative tests for the mean of a bounded population using data from a stratified sample. The sample may be drawn sequentially, with or without replacement. The tests are \"anytime valid,\" allowing optional stopping and continuation in each stratum. We call this combination of properties sequential, finite-sample, nonparametric validity. The methods express a hypothesis about the population mean as a union of intersection hypotheses describing within-stratum means. They test each intersection hypothesis using independent test supermartingales (TSMs) combined across strata by multiplication. The $P$-value of the global null hypothesis is then the maximum $P$-value of any intersection hypothesis in the union. This approach has three primary moving parts: (i) the rule for deciding which stratum to draw from next to test each intersection null, given the sample so far; (ii) the form of the TSM for each null in each stratum; and (iii) the method of combining evidence across strata. These choices interact. We examine the performance of a variety of rules with differing computational complexity. Approximately optimal methods have a prohibitive computational cost, while naive rules may be inconsistent -- they will never reject for some alternative populations, no matter how large the sample. We present a method that is statistically comparable to optimal methods in examples where optimal methods are computable, but computationally tractable for arbitrarily many strata. In numerical examples its expected sample size is substantially smaller than that of previous methods."}, "https://arxiv.org/abs/2409.05934": {"title": "Predicting Electricity Consumption with Random Walks on Gaussian Processes", "link": "https://arxiv.org/abs/2409.05934", "description": "arXiv:2409.05934v1 Announce Type: cross \nAbstract: We consider time-series forecasting problems where data is scarce, difficult to gather, or induces a prohibitive computational cost. As a first attempt, we focus on short-term electricity consumption in France, which is of strategic importance for energy suppliers and public stakeholders. The complexity of this problem and the many levels of geospatial granularity motivate the use of an ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors, they are computationally expensive to train, which calls for a frugal few-shot learning approach. By taking into account performance on GPs trained on a dataset and designing a random walk on these, we mitigate the training cost of our entire Bayesian decision-making procedure. We introduce our algorithm called \\textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present numerical experiments to support its merits."}, "https://arxiv.org/abs/2409.06157": {"title": "Causal Analysis of Shapley Values: Conditional vs", "link": "https://arxiv.org/abs/2409.06157", "description": "arXiv:2409.06157v1 Announce Type: cross \nAbstract: Shapley values, a game theoretic concept, has been one of the most popular tools for explaining Machine Learning (ML) models in recent years. Unfortunately, the two most common approaches, conditional and marginal, to calculating Shapley values can lead to different results along with some undesirable side effects when features are correlated. This in turn has led to the situation in the literature where contradictory recommendations regarding choice of an approach are provided by different authors. In this paper we aim to resolve this controversy through the use of causal arguments. We show that the differences arise from the implicit assumptions that are made within each method to deal with missing causal information. We also demonstrate that the conditional approach is fundamentally unsound from a causal perspective. This, together with previous work in [1], leads to the conclusion that the marginal approach should be preferred over the conditional one."}, "https://arxiv.org/abs/2409.06271": {"title": "A new paradigm for global sensitivity analysis", "link": "https://arxiv.org/abs/2409.06271", "description": "arXiv:2409.06271v1 Announce Type: cross \nAbstract:
Current theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output's variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized.
"}, "https://arxiv.org/abs/2409.06565": {"title": "Enzyme kinetic reactions as interacting particle systems: Stochastic averaging and parameter inference", "link": "https://arxiv.org/abs/2409.06565", "description": "arXiv:2409.06565v1 Announce Type: cross \nAbstract: We consider a stochastic model of multistage Michaelis--Menten (MM) type enzyme kinetic reactions describing the conversion of substrate molecules to a product through several intermediate species. The high-dimensional, multiscale nature of these reaction networks presents significant computational challenges, especially in statistical estimation of reaction rates. This difficulty is amplified when direct data on system states are unavailable, and one only has access to a random sample of product formation times. To address this, we proceed in two stages. First, under certain technical assumptions akin to those made in the Quasi-steady-state approximation (QSSA) literature, we prove two asymptotic results: a stochastic averaging principle that yields a lower-dimensional model, and a functional central limit theorem that quantifies the associated fluctuations. Next, for statistical inference of the parameters of the original MM reaction network, we develop a mathematical framework involving an interacting particle system (IPS) and prove a propagation of chaos result that allows us to write a product-form likelihood function. The novelty of the IPS-based inference method is that it does not require information about the state of the system and works with only a random sample of product formation times. We provide numerical examples to illustrate the efficacy of the theoretical results."}, "https://arxiv.org/abs/2409.07018": {"title": "Clustered Factor Analysis for Multivariate Spatial Data", "link": "https://arxiv.org/abs/2409.07018", "description": "arXiv:2409.07018v1 Announce Type: new \nAbstract: Factor analysis has been extensively used to reveal the dependence structures among multivariate variables, offering valuable insight in various fields. However, it cannot incorporate the spatial heterogeneity that is typically present in spatial data. To address this issue, we introduce an effective method specifically designed to discover the potential dependence structures in multivariate spatial data. Our approach assumes that spatial locations can be approximately divided into a finite number of clusters, with locations within the same cluster sharing similar dependence structures. By leveraging an iterative algorithm that combines spatial clustering with factor analysis, we simultaneously detect spatial clusters and estimate a unique factor model for each cluster. The proposed method is evaluated through comprehensive simulation studies, demonstrating its flexibility. In addition, we apply the proposed method to a dataset of railway station attributes in the Tokyo metropolitan area, highlighting its practical applicability and effectiveness in uncovering complex spatial dependencies."}, "https://arxiv.org/abs/2409.07087": {"title": "Testing for a Forecast Accuracy Breakdown under Long Memory", "link": "https://arxiv.org/abs/2409.07087", "description": "arXiv:2409.07087v1 Announce Type: new \nAbstract: We propose a test to detect a forecast accuracy breakdown in a long memory time series and provide theoretical and simulation evidence on the memory transfer from the time series to the forecast residuals. The proposed method uses a double sup-Wald test against the alternative of a structural break in the mean of an out-of-sample loss series. To address the problem of estimating the long-run variance under long memory, a robust estimator is applied. The corresponding breakpoint results from a long memory robust CUSUM test. The finite sample size and power properties of the test are derived in a Monte Carlo simulation. A monotonic power function is obtained for the fixed forecasting scheme. In our practical application, we find that the global energy crisis that began in 2021 led to a forecast break in European electricity prices, while the results for the U.S. are mixed."}, "https://arxiv.org/abs/2409.07125": {"title": "Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning", "link": "https://arxiv.org/abs/2409.07125", "description": "arXiv:2409.07125v1 Announce Type: new \nAbstract: Modeling with multi-omics data presents multiple challenges such as the high-dimensionality of the problem ($p \\gg n$), the presence of interactions between features, and the need for integration between multiple data sources. We establish an interaction model that allows for the inclusion of multiple sources of data from the integration of two existing methods, pliable lasso and cooperative learning. The integrated model is tested both on simulation studies and on real multi-omics datasets for predicting labor onset and cancer treatment response. The results show that the model is effective in modeling multi-source data in various scenarios where interactions are present, both in terms of prediction performance and selection of relevant variables."}, "https://arxiv.org/abs/2409.07176": {"title": "Non-parametric estimation of transition intensities in interval censored Markov multi-state models without loops", "link": "https://arxiv.org/abs/2409.07176", "description": "arXiv:2409.07176v1 Announce Type: new \nAbstract: Panel data arises when transitions between different states are interval-censored in multi-state data. The analysis of such data using non-parametric multi-state models was not possible until recently, but is very desirable as it allows for more flexibility than its parametric counterparts. The single available result to date has some unique drawbacks. We propose a non-parametric estimator of the transition intensities for panel data using an Expectation Maximisation algorithm. The method allows for a mix of interval-censored and right-censored (exactly observed) transitions. A condition to check for the convergence of the algorithm to the non-parametric maximum likelihood estimator is given. A simulation study comparing the proposed estimator to a consistent estimator is performed, and shown to yield near identical estimates at smaller computational cost. A data set on the emergence of teeth in children is analysed. Code to perform the analyses is publicly available."}, "https://arxiv.org/abs/2409.07233": {"title": "Extended-support beta regression for $[0, 1]$ responses", "link": "https://arxiv.org/abs/2409.07233", "description": "arXiv:2409.07233v1 Announce Type: new \nAbstract: We introduce the XBX regression model, a continuous mixture of extended-support beta regressions for modeling bounded responses with or without boundary observations. The core building block of the new model is the extended-support beta distribution, which is a censored version of a four-parameter beta distribution with the same exceedance on the left and right of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We prove that both beta regression with dispersion effects and heteroscedastic normal regression with censoring at both $0$ and $1$ -- known as the heteroscedastic two-limit tobit model in the econometrics literature -- are special cases of the extended-support beta regression model, depending on whether a single extra parameter is zero or infinity, respectively. To overcome identifiability issues that may arise in estimating the extra parameter due to the similarity of the beta and normal distribution for certain parameter settings, we assume that the additional parameter has an exponential distribution with an unknown mean. The associated marginal likelihood can be conveniently and accurately approximated using a Gauss-Laguerre quadrature rule, resulting in efficient estimation and inference procedures. The new model is used to analyze investment decisions in a behavioral economics experiment, where the occurrence and extent of loss aversion is of interest. In contrast to standard approaches, XBX regression can simultaneously capture the probability of rational behavior as well as the mean amount of loss aversion. Moreover, the effectiveness of the new model is illustrated through extensive numerical comparisons with alternative models."}, "https://arxiv.org/abs/2409.07263": {"title": "Order selection in GARMA models for count time series: a Bayesian perspective", "link": "https://arxiv.org/abs/2409.07263", "description": "arXiv:2409.07263v1 Announce Type: new \nAbstract: Estimation in GARMA models has traditionally been carried out under the frequentist approach. To date, Bayesian approaches for such estimation have been relatively limited. In the context of GARMA models for count time series, Bayesian estimation achieves satisfactory results in terms of point estimation. Model selection in this context often relies on the use of information criteria. Despite its prominence in the literature, the use of information criteria for model selection in GARMA models for count time series have been shown to present poor performance in simulations, especially in terms of their ability to correctly identify models, even under large sample sizes. In this study, we study the problem of order selection in GARMA models for count time series, adopting a Bayesian perspective through the application of the Reversible Jump Markov Chain Monte Carlo approach. Monte Carlo simulation studies are conducted to assess the finite sample performance of the developed ideas, including point and interval inference, sensitivity analysis, effects of burn-in and thinning, as well as the choice of related priors and hyperparameters. Two real-data applications are presented, one considering automobile production in Brazil and the other considering bus exportation in Brazil before and after the COVID-19 pandemic, showcasing the method's capabilities and further exploring its flexibility."}, "https://arxiv.org/abs/2409.07350": {"title": "Local Effects of Continuous Instruments without Positivity", "link": "https://arxiv.org/abs/2409.07350", "description": "arXiv:2409.07350v1 Announce Type: new \nAbstract: Instrumental variables have become a popular study design for the estimation of treatment effects in the presence of unobserved confounders. In the canonical instrumental variables design, the instrument is a binary variable, and most extant methods are tailored to this context. In many settings, however, the instrument is a continuous measure. Standard estimation methods can be applied with continuous instruments, but they require strong assumptions regarding functional form. Moreover, while some recent work has introduced more flexible approaches for continuous instruments, these methods require an assumption known as positivity that is unlikely to hold in many applications. We derive a novel family of causal estimands using a stochastic dynamic intervention framework that considers a range of intervention distributions that are absolutely continuous with respect to the observed distribution of the instrument. These estimands focus on a specific form of local effect but do not require a positivity assumption. Next, we develop doubly robust estimators for these estimands that allow for estimation of the nuisance functions via nonparametric estimators. We use empirical process theory and sample splitting to derive asymptotic properties of the proposed estimators under weak conditions. In addition, we derive methods for profiling the principal strata as well as a method for sensitivity analysis for assessing robustness to an underlying monotonicity assumption. We evaluate our methods via simulation and demonstrate their feasibility using an application on the effectiveness of surgery for specific emergency conditions."}, "https://arxiv.org/abs/2409.07380": {"title": "Multi-source Stable Variable Importance Measure via Adversarial Machine Learning", "link": "https://arxiv.org/abs/2409.07380", "description": "arXiv:2409.07380v1 Announce Type: new \nAbstract: As part of enhancing the interpretability of machine learning, it is of renewed interest to quantify and infer the predictive importance of certain exposure covariates. Modern scientific studies often collect data from multiple sources with distributional heterogeneity. Thus, measuring and inferring stable associations across multiple environments is crucial in reliable and generalizable decision-making. In this paper, we propose MIMAL, a novel statistical framework for Multi-source stable Importance Measure via Adversarial Learning. MIMAL measures the importance of some exposure variables by maximizing the worst-case predictive reward over the source mixture. Our framework allows various machine learning methods for confounding adjustment and exposure effect characterization. For inferential analysis, the asymptotic normality of our introduced statistic is established under a general machine learning framework that requires no stronger learning accuracy conditions than those for single source variable importance. Numerical studies with various types of data generation setups and machine learning implementation are conducted to justify the finite-sample performance of MIMAL. We also illustrate our method through a real-world study of Beijing air pollution in multiple locations."}, "https://arxiv.org/abs/2409.07391": {"title": "Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap", "link": "https://arxiv.org/abs/2409.07391", "description": "arXiv:2409.07391v1 Announce Type: new \nAbstract: To estimate the average treatment effect in real-world populations, observational studies are typically designed around real-world cohorts. However, even when study samples from these designs represent the population, unmeasured confounders can introduce bias. Sensitivity analysis is often used to estimate bounds for the average treatment effect without relying on the strict mathematical assumptions of other existing methods. This article introduces a new approach that improves sensitivity analysis in observational studies by incorporating randomized clinical trial data, even with limited overlap due to inclusion/exclusion criteria. Theoretical proof and simulations show that this method provides a tighter bound width than existing approaches. We also apply this method to both a trial dataset and a real-world drug effectiveness comparison dataset for practical analysis."}, "https://arxiv.org/abs/2409.07111": {"title": "Local Sequential MCMC for Data Assimilation with Applications in Geoscience", "link": "https://arxiv.org/abs/2409.07111", "description": "arXiv:2409.07111v1 Announce Type: cross \nAbstract: This paper presents a new data assimilation (DA) scheme based on a sequential Markov Chain Monte Carlo (SMCMC) DA technique [Ruzayqat et al. 2024] which is provably convergent and has been recently used for filtering, particularly for high-dimensional non-linear, and potentially, non-Gaussian state-space models. Unlike particle filters, which can be considered exact methods and can be used for filtering non-linear, non-Gaussian models, SMCMC does not assign weights to the samples/particles, and therefore, the method does not suffer from the issue of weight-degeneracy when a relatively small number of samples is used. We design a localization approach within the SMCMC framework that focuses on regions where observations are located and restricts the transition densities included in the filtering distribution of the state to these regions. This results in immensely reducing the effective degrees of freedom and thus improving the efficiency. We test the new technique on high-dimensional ($d \\sim 10^4 - 10^5$) linear Gaussian model and non-linear shallow water models with Gaussian noise with real and synthetic observations. For two of the numerical examples, the observations mimic the data generated by the Surface Water and Ocean Topography (SWOT) mission led by NASA, which is a swath of ocean height observations that changes location at every assimilation time step. We also use a set of ocean drifters' real observations in which the drifters are moving according the ocean kinematics and assumed to have uncertain locations at the time of assimilation. We show that when higher accuracy is required, the proposed algorithm is superior in terms of efficiency and accuracy over competing ensemble methods and the original SMCMC filter."}, "https://arxiv.org/abs/2409.07389": {"title": "Dynamic Bayesian Networks, Elicitation and Data Embedding for Secure Environments", "link": "https://arxiv.org/abs/2409.07389", "description": "arXiv:2409.07389v1 Announce Type: cross \nAbstract: Serious crime modelling typically needs to be undertaken securely behind a firewall where police knowledge and capabilities can remain undisclosed. Data informing an ongoing incident is often sparse, with a large proportion of relevant data only coming to light after the incident culminates or after police intervene - by which point it is too late to make use of the data to aid real-time decision making for the incident in question. Much of the data that is available to police to support real-time decision making is highly confidential so cannot be shared with academics, and is therefore missing to them. In this paper, we describe the development of a formal protocol where a graphical model is used as a framework for securely translating a model designed by an academic team to a model for use by a police team. We then show, for the first time, how libraries of these models can be built and used for real-time decision support to circumvent the challenges of data missingness and tardiness seen in such a secure environment. The parallel development described by this protocol ensures that any sensitive information collected by police, and missing to academics, remains secured behind a firewall. The protocol nevertheless guides police so that they are able to combine the typically incomplete data streams that are open source with their more sensitive information in a formal and justifiable way. We illustrate the application of this protocol by describing how a new entry - a suspected vehicle attack - can be embedded into such a police library of criminal plots."}, "https://arxiv.org/abs/2211.04027": {"title": "Bootstraps for Dynamic Panel Threshold Models", "link": "https://arxiv.org/abs/2211.04027", "description": "arXiv:2211.04027v3 Announce Type: replace \nAbstract: This paper develops valid bootstrap inference methods for the dynamic short panel threshold regression. We demonstrate that the standard nonparametric bootstrap is inconsistent for the first-differenced generalized method of moments (GMM) estimator. The inconsistency is due to an $n^{1/4}$-consistent non-normal asymptotic distribution for the threshold estimate when the parameter resides within the continuity region of the parameter space. It stems from the rank deficiency of the approximate Jacobian of the sample moment conditions on the continuity region. To address this, we propose a grid bootstrap to construct confidence intervals of the threshold, a residual bootstrap to construct confidence intervals of the coefficients, and a bootstrap for testing continuity. They are shown to be valid under uncertain continuity, while the grid bootstrap is additionally shown to be uniformly valid. A set of Monte Carlo experiments demonstrate that the proposed bootstraps perform well in the finite samples and improve upon the standard nonparametric bootstrap."}, "https://arxiv.org/abs/2302.02468": {"title": "Circular and Spherical Projected Cauchy Distributions: A Novel Framework for Circular and Directional Data Modeling", "link": "https://arxiv.org/abs/2302.02468", "description": "arXiv:2302.02468v4 Announce Type: replace \nAbstract: We introduce a novel family of projected distributions on the circle and the sphere, namely the circular and spherical projected Cauchy distributions, as promising alternatives for modelling circular and spherical data. The circular distribution encompasses the wrapped Cauchy distribution as a special case, while featuring a more convenient parameterisation. We also propose a generalised wrapped Cauchy distribution that includes an extra parameter, enhancing the fit of the distribution. In the spherical context, we impose two conditions on the scatter matrix of the Cauchy distribution, resulting in an elliptically symmetric distribution. Our projected distributions exhibit attractive properties, such as a closed-form normalising constant and straightforward random value generation. The distribution parameters can be estimated using maximum likelihood, and we assess their bias through numerical studies. Further, we compare our proposed distributions to existing models with real datasets, demonstrating equal or superior fitting both with and without covariates."}, "https://arxiv.org/abs/2302.14423": {"title": "The First-stage F Test with Many Weak Instruments", "link": "https://arxiv.org/abs/2302.14423", "description": "arXiv:2302.14423v2 Announce Type: replace \nAbstract: A widely adopted approach for detecting weak instruments is to use the first-stage $F$ statistic. While this method was developed with a fixed number of instruments, its performance with many instruments remains insufficiently explored. We show that the first-stage $F$ test exhibits distorted sizes for detecting many weak instruments, regardless of the choice of pretested estimators or Wald tests. These distortions occur due to the inadequate approximation using classical noncentral Chi-squared distributions. As a byproduct of our main result, we present an alternative approach to pre-test many weak instruments with the corrected first-stage $F$ statistic. An empirical illustration with Angrist and Keueger (1991)'s returns to education data confirms its usefulness."}, "https://arxiv.org/abs/2303.10525": {"title": "Robustifying likelihoods by optimistically re-weighting data", "link": "https://arxiv.org/abs/2303.10525", "description": "arXiv:2303.10525v2 Announce Type: replace \nAbstract: Likelihood-based inferences have been remarkably successful in wide-spanning application areas. However, even after due diligence in selecting a good model for the data at hand, there is inevitably some amount of model misspecification: outliers, data contamination or inappropriate parametric assumptions such as Gaussianity mean that most models are at best rough approximations of reality. A significant practical concern is that for certain inferences, even small amounts of model misspecification may have a substantial impact; a problem we refer to as brittleness. This article attempts to address the brittleness problem in likelihood-based inferences by choosing the most model friendly data generating process in a distance-based neighborhood of the empirical measure. This leads to a new Optimistically Weighted Likelihood (OWL), which robustifies the original likelihood by formally accounting for a small amount of model misspecification. Focusing on total variation (TV) neighborhoods, we study theoretical properties, develop estimation algorithms and illustrate the methodology in applications to mixture models and regression."}, "https://arxiv.org/abs/2211.15769": {"title": "Graphical models for infinite measures with applications to extremes", "link": "https://arxiv.org/abs/2211.15769", "description": "arXiv:2211.15769v2 Announce Type: replace-cross \nAbstract: Conditional independence and graphical models are well studied for probability distributions on product spaces. We propose a new notion of conditional independence for any measure $\\Lambda$ on the punctured Euclidean space $\\mathbb R^d\\setminus \\{0\\}$ that explodes at the origin. The importance of such measures stems from their connection to infinitely divisible and max-infinitely divisible distributions, where they appear as L\\'evy measures and exponent measures, respectively. We characterize independence and conditional independence for $\\Lambda$ in various ways through kernels and factorization of a modified density, including a Hammersley-Clifford type theorem for undirected graphical models. As opposed to the classical conditional independence, our notion is intimately connected to the support of the measure $\\Lambda$. Our general theory unifies and extends recent approaches to graphical modeling in the fields of extreme value analysis and L\\'evy processes. Our results for the corresponding undirected and directed graphical models lay the foundation for new statistical methodology in these areas."}, "https://arxiv.org/abs/2409.07559": {"title": "Spatial Deep Convolutional Neural Networks", "link": "https://arxiv.org/abs/2409.07559", "description": "arXiv:2409.07559v1 Announce Type: new \nAbstract: Spatial prediction problems often use Gaussian process models, which can be computationally burdensome in high dimensions. Specification of an appropriate covariance function for the model can be challenging when complex non-stationarities exist. Recent work has shown that pre-computed spatial basis functions and a feed-forward neural network can capture complex spatial dependence structures while remaining computationally efficient. This paper builds on this literature by tailoring spatial basis functions for use in convolutional neural networks. Through both simulated and real data, we demonstrate that this approach yields more accurate spatial predictions than existing methods. Uncertainty quantification is also considered."}, "https://arxiv.org/abs/2409.07568": {"title": "Debiased high-dimensional regression calibration for errors-in-variables log-contrast models", "link": "https://arxiv.org/abs/2409.07568", "description": "arXiv:2409.07568v1 Announce Type: new \nAbstract: Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts."}, "https://arxiv.org/abs/2409.07617": {"title": "Determining number of factors under stability considerations", "link": "https://arxiv.org/abs/2409.07617", "description": "arXiv:2409.07617v1 Announce Type: new \nAbstract: This paper proposes a novel method for determining the number of factors in linear factor models under stability considerations. An instability measure is proposed based on the principal angle between the estimated loading spaces obtained by data splitting. Based on this measure, criteria for determining the number of factors are proposed and shown to be consistent. This consistency is obtained using results from random matrix theory, especially the complete delocalization of non-outlier eigenvectors. The advantage of the proposed methods over the existing ones is shown via weaker asymptotic requirements for consistency, simulation studies and a real data example."}, "https://arxiv.org/abs/2409.07738": {"title": "A model-based approach for clustering binned data", "link": "https://arxiv.org/abs/2409.07738", "description": "arXiv:2409.07738v1 Announce Type: new \nAbstract: Binned data often appears in different fields of research, and it is generated after summarizing the original data in a sequence of pairs of bins (or their midpoints) and frequencies. There may exist different reasons to only provide this summary, but more importantly, it is necessary being able to perform statistical analyses based only on it. We present a Bayesian nonparametric model for clustering applicable for binned data. Clusters are modeled via random partitions, and within them a model-based approach is assumed. Inferences are performed by a Markov chain Monte Carlo method and the complete proposal is tested using simulated and real data. Having particular interest in studying marine populations, we analyze samples of Lobatus (Strobus) gigas' lengths and found the presence of up to three cohorts along the year."}, "https://arxiv.org/abs/2409.07745": {"title": "Generalized Independence Test for Modern Data", "link": "https://arxiv.org/abs/2409.07745", "description": "arXiv:2409.07745v1 Announce Type: new \nAbstract: The test of independence is a crucial component of modern data analysis. However, traditional methods often struggle with the complex dependency structures found in high-dimensional data. To overcome this challenge, we introduce a novel test statistic that captures intricate relationships using similarity and dissimilarity information derived from the data. The statistic exhibits strong power across a broad range of alternatives for high-dimensional data, as demonstrated in extensive simulation studies. Under mild conditions, we show that the new test statistic converges to the $\\chi^2_4$ distribution under the permutation null distribution, ensuring straightforward type I error control. Furthermore, our research advances the moment method in proving the joint asymptotic normality of multiple double-indexed permutation statistics. We showcase the practical utility of this new test with an application to the Genotype-Tissue Expression dataset, where it effectively measures associations between human tissues."}, "https://arxiv.org/abs/2409.07795": {"title": "Robust and efficient estimation in the presence of a randomly censored covariate", "link": "https://arxiv.org/abs/2409.07795", "description": "arXiv:2409.07795v1 Announce Type: new \nAbstract: In Huntington's disease research, a current goal is to understand how symptoms change prior to a clinical diagnosis. Statistically, this entails modeling symptom severity as a function of the covariate 'time until diagnosis', which is often heavily right-censored in observational studies. Existing estimators that handle right-censored covariates have varying statistical efficiency and robustness to misspecified models for nuisance distributions (those of the censored covariate and censoring variable). On one extreme, complete case estimation, which utilizes uncensored data only, is free of nuisance distribution models but discards informative censored observations. On the other extreme, maximum likelihood estimation is maximally efficient but inconsistent when the covariate's distribution is misspecified. We propose a semiparametric estimator that is robust and efficient. When the nuisance distributions are modeled parametrically, the estimator is doubly robust, i.e., consistent if at least one distribution is correctly specified, and semiparametric efficient if both models are correctly specified. When the nuisance distributions are estimated via nonparametric or machine learning methods, the estimator is consistent and semiparametric efficient. We show empirically that the proposed estimator, implemented in the R package sparcc, has its claimed properties, and we apply it to study Huntington's disease symptom trajectories using data from the Enroll-HD study."}, "https://arxiv.org/abs/2409.07859": {"title": "Bootstrap Adaptive Lasso Solution Path Unit Root Tests", "link": "https://arxiv.org/abs/2409.07859", "description": "arXiv:2409.07859v1 Announce Type: new \nAbstract: We propose sieve wild bootstrap analogues to the adaptive Lasso solution path unit root tests of Arnold and Reinschl\\\"ussel (2024) arXiv:2404.06205 to improve finite sample properties and extend their applicability to a generalised framework, allowing for non-stationary volatility. Numerical evidence shows the bootstrap to improve the tests' precision for error processes that promote spurious rejections of the unit root null, depending on the detrending procedure. The bootstrap mitigates finite-sample size distortions and restores asymptotically valid inference when the data features time-varying unconditional variance. We apply the bootstrap tests to real residential property prices of the top six Eurozone economies and find evidence of stationarity to be period-specific, supporting the conjecture that exuberance in the housing market characterises the development of Euro-era residential property prices in the recent past."}, "https://arxiv.org/abs/2409.07881": {"title": "Cellwise outlier detection in heterogeneous populations", "link": "https://arxiv.org/abs/2409.07881", "description": "arXiv:2409.07881v1 Announce Type: new \nAbstract: Real-world applications may be affected by outlying values. In the model-based clustering literature, several methodologies have been proposed to detect units that deviate from the majority of the data (rowwise outliers) and trim them from the parameter estimates. However, the discarded observations can encompass valuable information in some observed features. Following the more recent cellwise contamination paradigm, we introduce a Gaussian mixture model for cellwise outlier detection. The proposal is estimated via an Expectation-Maximization (EM) algorithm with an additional step for flagging the contaminated cells of a data matrix and then imputing -- instead of discarding -- them before the parameter estimation. This procedure adheres to the spirit of the EM algorithm by treating the contaminated cells as missing values. We analyze the performance of the proposed model in comparison with other existing methodologies through a simulation study with different scenarios and illustrate its potential use for clustering, outlier detection, and imputation on three real data sets."}, "https://arxiv.org/abs/2409.07917": {"title": "Multiple tests for restricted mean time lost with competing risks data", "link": "https://arxiv.org/abs/2409.07917", "description": "arXiv:2409.07917v1 Announce Type: new \nAbstract: Easy-to-interpret effect estimands are highly desirable in survival analysis. In the competing risks framework, one good candidate is the restricted mean time lost (RMTL). It is defined as the area under the cumulative incidence function up to a prespecified time point and, thus, it summarizes the cumulative incidence function into a meaningful estimand. While existing RMTL-based tests are limited to two-sample comparisons and mostly to two event types, we aim to develop general contrast tests for factorial designs and an arbitrary number of event types based on a Wald-type test statistic. Furthermore, we avoid the often-made, rather restrictive continuity assumption on the event time distribution. This allows for ties in the data, which often occur in practical applications, e.g., when event times are measured in whole days. In addition, we develop more reliable tests for RMTL comparisons that are based on a permutation approach to improve the small sample performance. In a second step, multiple tests for RMTL comparisons are developed to test several null hypotheses simultaneously. Here, we incorporate the asymptotically exact dependence structure between the local test statistics to gain more power. The small sample performance of the proposed testing procedures is analyzed in simulations and finally illustrated by analyzing a real data example about leukemia patients who underwent bone marrow transplantation."}, "https://arxiv.org/abs/2409.07956": {"title": "Community detection in multi-layer networks by regularized debiased spectral clustering", "link": "https://arxiv.org/abs/2409.07956", "description": "arXiv:2409.07956v1 Announce Type: new \nAbstract: Community detection is a crucial problem in the analysis of multi-layer networks. In this work, we introduce a new method, called regularized debiased sum of squared adjacency matrices (RDSoS), to detect latent communities in multi-layer networks. RDSoS is developed based on a novel regularized Laplacian matrix that regularizes the debiased sum of squared adjacency matrices. In contrast, the classical regularized Laplacian matrix typically regularizes the adjacency matrix of a single-layer network. Therefore, at a high level, our regularized Laplacian matrix extends the classical regularized Laplacian matrix to multi-layer networks. We establish the consistency property of RDSoS under the multi-layer stochastic block model (MLSBM) and further extend RDSoS and its theoretical results to the degree-corrected version of the MLSBM model. The effectiveness of the proposed methods is evaluated and demonstrated through synthetic and real datasets."}, "https://arxiv.org/abs/2409.08112": {"title": "Review of Recent Advances in Gaussian Process Regression Methods", "link": "https://arxiv.org/abs/2409.08112", "description": "arXiv:2409.08112v1 Announce Type: new \nAbstract: Gaussian process (GP) methods have been widely studied recently, especially for large-scale systems with big data and even more extreme cases when data is sparse. Key advantages of these methods consist in: 1) the ability to provide inherent ways to assess the impact of uncertainties (especially in the data, and environment) on the solutions, 2) have efficient factorisation based implementations and 3) can be implemented easily in distributed manners and hence provide scalable solutions. This paper reviews the recently developed key factorised GP methods such as the hierarchical off-diagonal low-rank approximation methods and GP with Kronecker structures. An example illustrates the performance of these methods with respect to accuracy and computational complexity."}, "https://arxiv.org/abs/2409.08158": {"title": "Trends and biases in the social cost of carbon", "link": "https://arxiv.org/abs/2409.08158", "description": "arXiv:2409.08158v1 Announce Type: new \nAbstract: An updated and extended meta-analysis confirms that the central estimate of the social cost of carbon is around $200/tC with a large, right-skewed uncertainty and trending up. The pure rate of time preference and the inverse of the elasticity of intertemporal substitution are key assumptions, the total impact of 2.5K warming less so. The social cost of carbon is much higher if climate change is assumed to affect economic growth rather than the level of output and welfare. The literature is dominated by a relatively small network of authors, based in a few countries. Publication and citation bias have pushed the social cost of carbon up."}, "https://arxiv.org/abs/2409.07679": {"title": "Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning", "link": "https://arxiv.org/abs/2409.07679", "description": "arXiv:2409.07679v1 Announce Type: cross \nAbstract: We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the \"notorious\" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase."}, "https://arxiv.org/abs/2409.07874": {"title": "Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler", "link": "https://arxiv.org/abs/2409.07874", "description": "arXiv:2409.07874v1 Announce Type: cross \nAbstract: In this paper, we study Bayesian approach for solving large scale linear inverse problems arising in various scientific and engineering fields. We propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting properties and show that it can be formulated as a Gaussian mixture Markov random field. Since the density function of this family of prior is neither log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can not be applied to sample the posterior. Thus, we present a Gibbs sampler in which all the conditional posteriors involved have closed form expressions. The Gibbs sampler works well for small size problems but it is computationally intractable for large scale problems due to the need for sample high dimensional Gaussian distribution. To reduce the computation burden, we construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise deterministic Markov process. This new sampler combines elements of Gibbs sampler with bouncy particle sampler and its computation complexity is an order of magnitude smaller. We show that the new sampler converges to the target distribution. With computed tomography examples, we demonstrate that the proposed method shows competitive performance with existing popular Bayesian methods and is highly efficient in large scale problems."}, "https://arxiv.org/abs/2409.07879": {"title": "Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series", "link": "https://arxiv.org/abs/2409.07879", "description": "arXiv:2409.07879v1 Announce Type: cross \nAbstract: Functional data analysis (FDA) and ensemble learning can be powerful tools for analyzing complex environmental time series. Recent literature has highlighted the key role of diversity in enhancing accuracy and reducing variance in ensemble methods.This paper introduces Randomized Spline Trees (RST), a novel algorithm that bridges these two approaches by incorporating randomized functional representations into the Random Forest framework. RST generates diverse functional representations of input data using randomized B-spline parameters, creating an ensemble of decision trees trained on these varied representations. We provide a theoretical analysis of how this functional diversity contributes to reducing generalization error and present empirical evaluations on six environmental time series classification tasks from the UCR Time Series Archive. Results show that RST variants outperform standard Random Forests and Gradient Boosting on most datasets, improving classification accuracy by up to 14\\%. The success of RST demonstrates the potential of adaptive functional representations in capturing complex temporal patterns in environmental data. This work contributes to the growing field of machine learning techniques focused on functional data and opens new avenues for research in environmental time series analysis."}, "https://arxiv.org/abs/2409.08059": {"title": "Causal inference and racial bias in policing: New estimands and the importance of mobility data", "link": "https://arxiv.org/abs/2409.08059", "description": "arXiv:2409.08059v1 Announce Type: cross \nAbstract: Studying racial bias in policing is a critically important problem, but one that comes with a number of inherent difficulties due to the nature of the available data. In this manuscript we tackle multiple key issues in the causal analysis of racial bias in policing. First, we formalize race and place policing, the idea that individuals of one race are policed differently when they are in neighborhoods primarily made up of individuals of other races. We develop an estimand to study this question rigorously, show the assumptions necessary for causal identification, and develop sensitivity analyses to assess robustness to violations of key assumptions. Additionally, we investigate difficulties with existing estimands targeting racial bias in policing. We show for these estimands, and the estimands developed in this manuscript, that estimation can benefit from incorporating mobility data into analyses. We apply these ideas to a study in New York City, where we find a large amount of racial bias, as well as race and place policing, and that these findings are robust to large violations of untestable assumptions. We additionally show that mobility data can make substantial impacts on the resulting estimates, suggesting it should be used whenever possible in subsequent studies."}, "https://arxiv.org/abs/2409.08201": {"title": "Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study", "link": "https://arxiv.org/abs/2409.08201", "description": "arXiv:2409.08201v1 Announce Type: cross \nAbstract: The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the distribution of test statistics for the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. All results from numerical experiments were obtained from a synthetic dataset generated using the Smirnov transform (Inverse Transform Sampling) and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods. All necessary materials (source code, example scripts, dataset, and samples) are available on GitHub and Hugging Face."}, "https://arxiv.org/abs/2211.01727": {"title": "Bayesian inference of vector autoregressions with tensor decompositions", "link": "https://arxiv.org/abs/2211.01727", "description": "arXiv:2211.01727v5 Announce Type: replace \nAbstract: Vector autoregressions (VARs) are popular model for analyzing multivariate economic time series. However, VARs can be over-parameterized if the numbers of variables and lags are moderately large. Tensor VAR, a recent solution to over-parameterization, treats the coefficient matrix as a third-order tensor and estimates the corresponding tensor decomposition to achieve parsimony. In this paper, we employ the Tensor VAR structure with a CANDECOMP/PARAFAC (CP) decomposition and conduct Bayesian inference to estimate parameters. Firstly, we determine the rank by imposing the Multiplicative Gamma Prior to the tensor margins, i.e. elements in the decomposition, and accelerate the computation with an adaptive inferential scheme. Secondly, to obtain interpretable margins, we propose an interweaving algorithm to improve the mixing of margins and identify the margins using a post-processing procedure. In an application to the US macroeconomic data, our models outperform standard VARs in point and density forecasting and yield a summary of the dynamic of the US economy."}, "https://arxiv.org/abs/2301.11711": {"title": "ADDIS-Graphs for online error control with application to platform trials", "link": "https://arxiv.org/abs/2301.11711", "description": "arXiv:2301.11711v3 Announce Type: replace \nAbstract: In contemporary research, online error control is often required, where an error criterion, such as familywise error rate (FWER) or false discovery rate (FDR), shall remain under control while testing an a priori unbounded sequence of hypotheses. The existing online literature mainly considered large-scale designs and constructed blackbox-like algorithms for these. However, smaller studies, such as platform trials, require high flexibility and easy interpretability to take study objectives into account and facilitate the communication. Another challenge in platform trials is that due to the shared control arm some of the p-values are dependent and significance levels need to be prespecified before the decisions for all the past treatments are available. We propose ADDIS-Graphs with FWER control that due to their graphical structure perfectly adapt to such settings and provably uniformly improve the state-of-the-art method. We introduce several extensions of these ADDIS-Graphs, including the incorporation of information about the joint distribution of the p-values and a version for FDR control."}, "https://arxiv.org/abs/2306.07017": {"title": "Multivariate extensions of the Multilevel Best Linear Unbiased Estimator for ensemble-variational data assimilation", "link": "https://arxiv.org/abs/2306.07017", "description": "arXiv:2306.07017v2 Announce Type: replace \nAbstract: Multilevel estimators aim at reducing the variance of Monte Carlo statistical estimators, by combining samples generated with simulators of different costs and accuracies. In particular, the recent work of Schaden and Ullmann (2020) on the multilevel best linear unbiased estimator (MLBLUE) introduces a framework unifying several multilevel and multifidelity techniques. The MLBLUE is reintroduced here using a variance minimization approach rather than the regression approach of Schaden and Ullmann. We then discuss possible extensions of the scalar MLBLUE to a multidimensional setting, i.e. from the expectation of scalar random variables to the expectation of random vectors. Several estimators of increasing complexity are proposed: a) multilevel estimators with scalar weights, b) with element-wise weights, c) with spectral weights and d) with general matrix weights. The computational cost of each method is discussed. We finally extend the MLBLUE to the estimation of second-order moments in the multidimensional case, i.e. to the estimation of covariance matrices. The multilevel estimators proposed are d) a multilevel estimator with scalar weights and e) with element-wise weights. In large-dimension applications such as data assimilation for geosciences, the latter estimator is computationnally unaffordable. As a remedy, we also propose f) a multilevel covariance matrix estimator with optimal multilevel localization, inspired by the optimal localization theory of M\\'en\\'etrier and Aulign\\'e (2015). Some practical details on weighted MLMC estimators of covariance matrices are given in appendix."}, "https://arxiv.org/abs/2308.01747": {"title": "Fusion regression methods with repeated functional data", "link": "https://arxiv.org/abs/2308.01747", "description": "arXiv:2308.01747v3 Announce Type: replace \nAbstract: Linear regression and classification methods with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions related by some neighborhood structure (spatial, group, etc.). Two regression methods based on fusion penalties are proposed to consider the dependence induced by this structure. These methods aim to obtain parsimonious coefficient regression functions, by determining if close conditions are associated with common regression coefficient functions. The first method is a generalization to functional data of the variable fusion methodology based on the 1-nearest neighbor. The second one relies on the group fusion lasso penalty which assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. Numerical simulations and an application of electroencephalography data are presented."}, "https://arxiv.org/abs/2310.11741": {"title": "Graph of Graphs: From Nodes to Supernodes in Graphical Models", "link": "https://arxiv.org/abs/2310.11741", "description": "arXiv:2310.11741v2 Announce Type: replace \nAbstract: High-dimensional data analysis typically focuses on low-dimensional structure, often to aid interpretation and computational efficiency. Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data by representing variables as nodes and dependencies as edges. Inference is often focused on individual edges in the latent graph. Nonetheless, there is increasing interest in determining more complex structures, such as communities of nodes, for multiple reasons, including more effective information retrieval and better interpretability. In this work, we propose a hierarchical graphical model where we first cluster nodes and then, at the higher level, investigate the relationships among groups of nodes. Specifically, nodes are partitioned into supernodes with a data-coherent size-biased tessellation prior which combines ideas from Bayesian nonparametrics and Voronoi tessellations. This construct also allows accounting for the dependence of nodes within supernodes. At the higher level, dependence structure among supernodes is modeled through a Gaussian graphical model, where the focus of inference is on superedges. We provide theoretical justification for our modeling choices. We design tailored Markov chain Monte Carlo schemes, which also enable parallel computations. We demonstrate the effectiveness of our approach for large-scale structure learning in simulations and a transcriptomics application."}, "https://arxiv.org/abs/2312.16160": {"title": "SymmPI: Predictive Inference for Data with Group Symmetries", "link": "https://arxiv.org/abs/2312.16160", "description": "arXiv:2312.16160v3 Announce Type: replace \nAbstract: Quantifying the uncertainty of predictions is a core problem in modern statistics. Methods for predictive inference have been developed under a variety of assumptions, often -- for instance, in standard conformal prediction -- relying on the invariance of the distribution of the data under special groups of transformations such as permutation groups. Moreover, many existing methods for predictive inference aim to predict unobserved outcomes in sequences of feature-outcome observations. Meanwhile, there is interest in predictive inference under more general observation models (e.g., for partially observed features) and for data satisfying more general distributional symmetries (e.g., rotationally invariant or coordinate-independent observations in physics). Here we propose SymmPI, a methodology for predictive inference when data distributions have general group symmetries in arbitrary observation models. Our methods leverage the novel notion of distributional equivariant transformations, which process the data while preserving their distributional invariances. We show that SymmPI has valid coverage under distributional invariance and characterize its performance under distribution shift, recovering recent results as special cases. We apply SymmPI to predict unobserved values associated to vertices in a network, where the distribution is unchanged under relabelings that keep the network structure unchanged. In several simulations in a two-layer hierarchical model, and in an empirical data analysis example, SymmPI performs favorably compared to existing methods."}, "https://arxiv.org/abs/2310.02008": {"title": "fmeffects: An R Package for Forward Marginal Effects", "link": "https://arxiv.org/abs/2310.02008", "description": "arXiv:2310.02008v2 Announce Type: replace-cross \nAbstract: Forward marginal effects have recently been introduced as a versatile and effective model-agnostic interpretation method particularly suited for non-linear and non-parametric prediction models. They provide comprehensible model explanations of the form: if we change feature values by a pre-specified step size, what is the change in the predicted outcome? We present the R package fmeffects, the first software implementation of the theory surrounding forward marginal effects. The relevant theoretical background, package functionality and handling, as well as the software design and options for future extensions are discussed in this paper."}, "https://arxiv.org/abs/2409.08347": {"title": "Substitution in the perturbed utility route choice model", "link": "https://arxiv.org/abs/2409.08347", "description": "arXiv:2409.08347v1 Announce Type: new \nAbstract: This paper considers substitution patterns in the perturbed utility route choice model. We provide a general result that determines the marginal change in link flows following a marginal change in link costs across the network. We give a general condition on the network structure under which all paths are necessarily substitutes and an example in which some paths are complements. The presence of complementarity contradicts a result in a previous paper in this journal; we point out and correct the error."}, "https://arxiv.org/abs/2409.08354": {"title": "Bayesian Dynamic Factor Models for High-dimensional Matrix-valued Time Series", "link": "https://arxiv.org/abs/2409.08354", "description": "arXiv:2409.08354v1 Announce Type: new \nAbstract: High-dimensional matrix-valued time series are of significant interest in economics and finance, with prominent examples including cross region macroeconomic panels and firms' financial data panels. We introduce a class of Bayesian matrix dynamic factor models that utilize matrix structures to identify more interpretable factor patterns and factor impacts. Our model accommodates time-varying volatility, adjusts for outliers, and allows cross-sectional correlations in the idiosyncratic components. To determine the dimension of the factor matrix, we employ an importance-sampling estimator based on the cross-entropy method to estimate marginal likelihoods. Through a series of Monte Carlo experiments, we show the properties of the factor estimators and the performance of the marginal likelihood estimator in correctly identifying the true dimensions of the factor matrices. Applying our model to a macroeconomic dataset and a financial dataset, we demonstrate its ability in unveiling interesting features within matrix-valued time series."}, "https://arxiv.org/abs/2409.08756": {"title": "Cubature-based uncertainty estimation for nonlinear regression models", "link": "https://arxiv.org/abs/2409.08756", "description": "arXiv:2409.08756v1 Announce Type: new \nAbstract: Calibrating model parameters to measured data by minimizing loss functions is an important step in obtaining realistic predictions from model-based approaches, e.g., for process optimization. This is applicable to both knowledge-driven and data-driven model setups. Due to measurement errors, the calibrated model parameters also carry uncertainty. In this contribution, we use cubature formulas based on sparse grids to calculate the variance of the regression results. The number of cubature points is close to the theoretical minimum required for a given level of exactness. We present exact benchmark results, which we also compare to other cubatures. This scheme is then applied to estimate the prediction uncertainty of the NRTL model, calibrated to observations from different experimental designs."}, "https://arxiv.org/abs/2409.08773": {"title": "The Clustered Dose-Response Function Estimator for continuous treatment with heterogeneous treatment effects", "link": "https://arxiv.org/abs/2409.08773", "description": "arXiv:2409.08773v1 Announce Type: new \nAbstract: Many treatments are non-randomly assigned, continuous in nature, and exhibit heterogeneous effects even at identical treatment intensities. Taken together, these characteristics pose significant challenges for identifying causal effects, as no existing estimator can provide an unbiased estimate of the average causal dose-response function. To address this gap, we introduce the Clustered Dose-Response Function (Cl-DRF), a novel estimator designed to discern the continuous causal relationships between treatment intensity and the dependent variable across different subgroups. This approach leverages both theoretical and data-driven sources of heterogeneity and operates under relaxed versions of the conditional independence and positivity assumptions, which are required to be met only within each identified subgroup. To demonstrate the capabilities of the Cl-DRF estimator, we present both simulation evidence and an empirical application examining the impact of European Cohesion funds on economic growth."}, "https://arxiv.org/abs/2409.08779": {"title": "The underreported death toll of wars: a probabilistic reassessment from a structured expert elicitation", "link": "https://arxiv.org/abs/2409.08779", "description": "arXiv:2409.08779v1 Announce Type: new \nAbstract: Event datasets including those provided by Uppsala Conflict Data Program (UCDP) are based on reports from the media and international organizations, and are likely to suffer from reporting bias. Since the UCDP has strict inclusion criteria, they most likely under-estimate conflict-related deaths, but we do not know by how much. Here, we provide a generalizable, cross-national measure of uncertainty around UCDP reported fatalities that is more robust and realistic than UCDP's documented low and high estimates, and make available a dataset and R package accounting for the measurement uncertainty. We use a structured expert elicitation combined with statistical modelling to derive a distribution of plausible number of fatalities given the number of battle-related deaths and the type of violence documented by the UCDP. The results can help scholars understand the extent of bias affecting their empirical analyses of organized violence and contribute to improve the accuracy of conflict forecasting systems."}, "https://arxiv.org/abs/2409.08821": {"title": "High-dimensional regression with a count response", "link": "https://arxiv.org/abs/2409.08821", "description": "arXiv:2409.08821v1 Announce Type: new \nAbstract: We consider high-dimensional regression with a count response modeled by Poisson or negative binomial generalized linear model (GLM). We propose a penalized maximum likelihood estimator with a properly chosen complexity penalty and establish its adaptive minimaxity across models of various sparsity. To make the procedure computationally feasible for high-dimensional data we consider its LASSO and SLOPE convex surrogates. Their performance is illustrated through simulated and real-data examples."}, "https://arxiv.org/abs/2409.08838": {"title": "Angular Co-variance using intrinsic geometry of torus: Non-parametric change points detection in meteorological data", "link": "https://arxiv.org/abs/2409.08838", "description": "arXiv:2409.08838v1 Announce Type: new \nAbstract: In many temporal datasets, the parameters of the underlying distribution may change abruptly at unknown times. Detecting these changepoints is crucial for numerous applications. While this problem has been extensively studied for linear data, there has been remarkably less research on bivariate angular data. For the first time, we address the changepoint problem for the mean direction of toroidal and spherical data, which are types of bivariate angular data. By leveraging the intrinsic geometry of a curved torus, we introduce the concept of the ``square'' of an angle. This leads us to define the ``curved dispersion matrix'' for bivariate angular random variables, analogous to the dispersion matrix for bivariate linear random variables. Using this analogous measure of the ``Mahalanobis distance,'' we develop two new non-parametric tests to identify changes in the mean direction parameters for toroidal and spherical distributions. We derive the limiting distributions of the test statistics and evaluate their power surface and contours through extensive simulations. We also apply the proposed methods to detect changes in mean direction for hourly wind-wave direction measurements and the path of the cyclonic storm ``Biporjoy,'' which occurred between 6th and 19th June 2023 over the Arabian Sea, western coast of India."}, "https://arxiv.org/abs/2409.08863": {"title": "Change point analysis with irregular signals", "link": "https://arxiv.org/abs/2409.08863", "description": "arXiv:2409.08863v1 Announce Type: new \nAbstract: This paper considers the problem of testing and estimation of change point where signals after the change point can be highly irregular, which departs from the existing literature that assumes signals after the change point to be piece-wise constant or vary smoothly. A two-step approach is proposed to effectively estimate the location of the change point. The first step consists of a preliminary estimation of the change point that allows us to obtain unknown parameters for the second step. In the second step we use a new procedure to determine the position of the change point. We show that, under suitable conditions, the desirable $\\mathcal{O}_P(1)$ rate of convergence of the estimated change point can be obtained. We apply our method to analyze the Baidu search index of COVID-19 related symptoms and find 8~December 2019 to be the starting date of the COVID-19 pandemic."}, "https://arxiv.org/abs/2409.08912": {"title": "Joint spatial modeling of mean and non-homogeneous variance combining semiparametric SAR and GAMLSS models for hedonic prices", "link": "https://arxiv.org/abs/2409.08912", "description": "arXiv:2409.08912v1 Announce Type: new \nAbstract: In the context of spatial econometrics, it is very useful to have methodologies that allow modeling the spatial dependence of the observed variables and obtaining more precise predictions of both the mean and the variability of the response variable, something very useful in territorial planning and public policies. This paper proposes a new methodology that jointly models the mean and the variance. Also, it allows to model the spatial dependence of the dependent variable as a function of covariates and to model the semiparametric effects in both models. The algorithms developed are based on generalized additive models that allow the inclusion of non-parametric terms in both the mean and the variance, maintaining the traditional theoretical framework of spatial regression. The theoretical developments of the estimation of this model are carried out, obtaining desirable statistical properties in the estimators. A simulation study is developed to verify that the proposed method has a remarkable predictive capacity in terms of the mean square error and shows a notable improvement in the estimation of the spatial autoregressive parameter, compared to other traditional methods and some recent developments. The model is also tested on data from the construction of a hedonic price model for the city of Bogota, highlighting as the main result the ability to model the variability of housing prices, and the wealth in the analysis obtained."}, "https://arxiv.org/abs/2409.08924": {"title": "Regression-based proximal causal inference for right-censored time-to-event data", "link": "https://arxiv.org/abs/2409.08924", "description": "arXiv:2409.08924v1 Announce Type: new \nAbstract: Unmeasured confounding is one of the major concerns in causal inference from observational data. Proximal causal inference (PCI) is an emerging methodological framework to detect and potentially account for confounding bias by carefully leveraging a pair of negative control exposure (NCE) and outcome (NCO) variables, also known as treatment and outcome confounding proxies. Although regression-based PCI is well developed for binary and continuous outcomes, analogous PCI regression methods for right-censored time-to-event outcomes are currently lacking. In this paper, we propose a novel two-stage regression PCI approach for right-censored survival data under an additive hazard structural model. We provide theoretical justification for the proposed approach tailored to different types of NCOs, including continuous, count, and right-censored time-to-event variables. We illustrate the approach with an evaluation of the effectiveness of right heart catheterization among critically ill patients using data from the SUPPORT study. Our method is implemented in the open-access R package 'pci2s'."}, "https://arxiv.org/abs/2409.08965": {"title": "Dynamic Bayesian Networks with Conditional Dynamics in Edge Addition and Deletion", "link": "https://arxiv.org/abs/2409.08965", "description": "arXiv:2409.08965v1 Announce Type: new \nAbstract: This study presents a dynamic Bayesian network framework that facilitates intuitive gradual edge changes. We use two conditional dynamics to model the edge addition and deletion, and edge selection separately. Unlike previous research that uses a mixture network approach, which restricts the number of possible edge changes, or structural priors to induce gradual changes, which can lead to unclear network evolution, our model induces more frequent and intuitive edge change dynamics. We employ Markov chain Monte Carlo (MCMC) sampling to estimate the model structures and parameters and demonstrate the model's effectiveness in a portfolio selection application."}, "https://arxiv.org/abs/2409.08350": {"title": "An efficient heuristic for approximate maximum flow computations", "link": "https://arxiv.org/abs/2409.08350", "description": "arXiv:2409.08350v1 Announce Type: cross \nAbstract: Several concepts borrowed from graph theory are routinely used to better understand the inner workings of the (human) brain. To this end, a connectivity network of the brain is built first, which then allows one to assess quantities such as information flow and information routing via shortest path and maximum flow computations. Since brain networks typically contain several thousand nodes and edges, computational scaling is a key research area. In this contribution, we focus on approximate maximum flow computations in large brain networks. By combining graph partitioning with maximum flow computations, we propose a new approximation algorithm for the computation of the maximum flow with runtime O(|V||E|^2/k^2) compared to the usual runtime of O(|V||E|^2) for the Edmonds-Karp algorithm, where $V$ is the set of vertices, $E$ is the set of edges, and $k$ is the number of partitions. We assess both accuracy and runtime of the proposed algorithm on simulated graphs as well as on graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com)."}, "https://arxiv.org/abs/2409.08908": {"title": "Tracing the impacts of Mount Pinatubo eruption on global climate using spatially-varying changepoint detection", "link": "https://arxiv.org/abs/2409.08908", "description": "arXiv:2409.08908v1 Announce Type: cross \nAbstract: Significant events such as volcanic eruptions can have global and long lasting impacts on climate. These global impacts, however, are not uniform across space and time. Understanding how the Mt. Pinatubo eruption affects global and regional climate is of great interest for predicting impact on climate due to similar events. We propose a Bayesian framework to simultaneously detect and estimate spatially-varying temporal changepoints for regional climate impacts. Our approach takes into account the diffusing nature of the changes caused by the volcanic eruption and leverages spatial correlation. We illustrate our method on simulated datasets and compare it with an existing changepoint detection method. Finally, we apply our method on monthly stratospheric aerosol optical depth and surface temperature data from 1985 to 1995 to detect and estimate changepoints following the 1991 Mt. Pinatubo eruption."}, "https://arxiv.org/abs/2409.08925": {"title": "Multi forests: Variable importance for multi-class outcomes", "link": "https://arxiv.org/abs/2409.08925", "description": "arXiv:2409.08925v1 Announce Type: cross \nAbstract: In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm's primary purpose of calculating the multi-class VIM."}, "https://arxiv.org/abs/2409.08928": {"title": "Self-Organized State-Space Models with Artificial Dynamics", "link": "https://arxiv.org/abs/2409.08928", "description": "arXiv:2409.08928v1 Announce Type: cross \nAbstract: In this paper we consider a state-space model (SSM) parametrized by some parameter $\\theta$, and our aim is to perform joint parameter and state inference. A simple idea to perform this task, which almost dates back to the origin of the Kalman filter, is to replace the static parameter $\\theta$ by a Markov chain $(\\theta_t)_{t\\geq 0}$ on the parameter space and then to apply a standard filtering algorithm to the extended, or self-organized SSM. However, the practical implementation of this idea in a theoretically justified way has remained an open problem. In this paper we fill this gap by introducing various possible constructions of the Markov chain $(\\theta_t)_{t\\geq 0}$ that ensure the validity of the self-organized SSM (SO-SSM) for joint parameter and state inference. Notably, we show that theoretically valid SO-SSMs can be defined even if $\\|\\mathrm{Var}(\\theta_{t}|\\theta_{t-1})\\|$ converges to 0 slowly as $t\\rightarrow\\infty$. This result is important since, as illustrated in our numerical experiments, such models can be efficiently approximated using standard particle filter algorithms. While the idea studied in this work was first introduced for online inference in SSMs, it has also been proved to be useful for computing the maximum likelihood estimator (MLE) of a given SSM, since iterated filtering algorithms can be seen as particle filters applied to SO-SSMs for which the target parameter value is the MLE of interest. Based on this observation, we also derive constructions of $(\\theta_t)_{t\\geq 0}$ and theoretical results tailored to these specific applications of SO-SSMs, and as a result, we introduce new iterated filtering algorithms. From a practical point of view, the algorithms introduced in this work have the merit of being simple to implement and only requiring minimal tuning to perform well."}, "https://arxiv.org/abs/2102.07008": {"title": "A Distance Covariance-based Estimator", "link": "https://arxiv.org/abs/2102.07008", "description": "arXiv:2102.07008v2 Announce Type: replace \nAbstract: This paper introduces an estimator that considerably weakens the conventional relevance condition of instrumental variable (IV) methods, allowing for instruments that are weakly correlated, uncorrelated, or even mean-independent but not independent of endogenous covariates. Under the relevance condition, the estimator achieves consistent estimation and reliable inference without requiring instrument excludability, and it remains robust even when the first moment of the disturbance term does not exist. In contrast to conventional IV methods, it maximises the set of feasible instruments in any empirical setting. Under a weak conditional median independence condition on pairwise differences in disturbances and mild regularity assumptions, identification holds, and the estimator is consistent and asymptotically normal."}, "https://arxiv.org/abs/2208.14960": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case", "link": "https://arxiv.org/abs/2208.14960", "description": "arXiv:2208.14960v4 Announce Type: replace \nAbstract: Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners."}, "https://arxiv.org/abs/2209.03318": {"title": "On the Wasserstein median of probability measures", "link": "https://arxiv.org/abs/2209.03318", "description": "arXiv:2209.03318v3 Announce Type: replace \nAbstract: The primary choice to summarize a finite collection of random objects is by using measures of central tendency, such as mean and median. In the field of optimal transport, the Wasserstein barycenter corresponds to the Fr\\'{e}chet or geometric mean of a set of probability measures, which is defined as a minimizer of the sum of squared distances to each element in a given set with respect to the Wasserstein distance of order 2. We introduce the Wasserstein median as a robust alternative to the Wasserstein barycenter. The Wasserstein median corresponds to the Fr\\'{e}chet median under the 2-Wasserstein metric. The existence and consistency of the Wasserstein median are first established, along with its robustness property. In addition, we present a general computational pipeline that employs any recognized algorithms for the Wasserstein barycenter in an iterative fashion and demonstrate its convergence. The utility of the Wasserstein median as a robust measure of central tendency is demonstrated using real and simulated data."}, "https://arxiv.org/abs/2209.13918": {"title": "Inference in generalized linear models with robustness to misspecified variances", "link": "https://arxiv.org/abs/2209.13918", "description": "arXiv:2209.13918v3 Announce Type: replace \nAbstract: Generalized linear models usually assume a common dispersion parameter, an assumption that is seldom true in practice. Consequently, standard parametric methods may suffer appreciable loss of type I error control. As an alternative, we present a semi-parametric group-invariance method based on sign flipping of score contributions. Our method requires only the correct specification of the mean model, but is robust against any misspecification of the variance. We present tests for single as well as multiple regression coefficients. The test is asymptotically valid but shows excellent performance in small samples. We illustrate the method using RNA sequencing count data, for which it is difficult to model the overdispersion correctly. The method is available in the R library flipscores."}, "https://arxiv.org/abs/2211.07351": {"title": "A Tutorial on Asymptotic Properties for Biostatisticians with Applications to COVID-19 Data", "link": "https://arxiv.org/abs/2211.07351", "description": "arXiv:2211.07351v2 Announce Type: replace \nAbstract: Asymptotic properties of statistical estimators play a significant role both in practice and in theory. However, many asymptotic results in statistics rely heavily on the independent and identically distributed (iid) assumption, which is not realistic when we have fixed designs. In this article, we build a roadmap of general procedures for deriving asymptotic properties under fixed designs and the observations need not to be iid. We further provide their applications in many statistical applications. Finally, we apply our results to Poisson regression using a COVID-19 dataset as an illustration to demonstrate the power of these results in practice."}, "https://arxiv.org/abs/2311.18699": {"title": "Correlated Bayesian Additive Regression Trees with Gaussian Process for Regression Analysis of Dependent Data", "link": "https://arxiv.org/abs/2311.18699", "description": "arXiv:2311.18699v2 Announce Type: replace \nAbstract: Bayesian Additive Regression Trees (BART) has gained widespread popularity, prompting the development of various extensions for different applications. However, limited attention has been given to analyzing dependent data. Based on a general correlated error assumption and an innovative dummy representation, we introduces a novel extension of BART, called Correlated BART (CBART), designed to handle correlated errors. By integrating CBART with a Gaussian process (GP), we propose the CBART-GP model, in which the CBART and GP components are loosely coupled, allowing them to be estimated and applied independently. CBART captures the covariate mean function E[y|x]=f(x), while the Gaussian process models the dependency structure in the response $y$. We also developed a computationally efficient approach, named two-stage analysis of variance with weighted residuals, for the estimation of CBART-GP. Simulation studies demonstrate the superiority of CBART-GP over other models, and a real-world application illustrates its practical applicability."}, "https://arxiv.org/abs/2401.01949": {"title": "Adjacency Matrix Decomposition Clustering for Human Activity Data", "link": "https://arxiv.org/abs/2401.01949", "description": "arXiv:2401.01949v2 Announce Type: replace \nAbstract: Mobile apps and wearable devices accurately and continuously measure human activity; patterns within this data can provide a wealth of information applicable to fields such as transportation and health. Despite the potential utility of this data, there has been limited development of analysis methods for sequences of daily activities. In this paper, we propose a novel clustering method and cluster evaluation metric for human activity data that leverages an adjacency matrix representation to cluster the data without the calculation of a distance matrix. Our technique is substantially faster than conventional methods based on computing pairwise distances via sequence alignment algorithms and also enhances interpretability of results. We compare our method to distance-based hierarchical clustering and nTreeClus through simulation studies and an application to data collected by Daynamica, an app that turns sensor data into a daily summary of a user's activities. Among days that contain a large portion of time spent at home, our method distinguishes days that also contain multiple hours of travel or other activities, while both comparison methods fail to identify these patterns. We further identify which day patterns classified by our method are associated with higher concern for contracting COVID-19 with implications for public health messaging."}, "https://arxiv.org/abs/2312.15469": {"title": "Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products", "link": "https://arxiv.org/abs/2312.15469", "description": "arXiv:2312.15469v2 Announce Type: replace-cross \nAbstract: We consider the problem of sufficient dimension reduction (SDR) for multi-index models. The estimators of the central mean subspace in prior works either have slow (non-parametric) convergence rates, or rely on stringent distributional conditions (e.g., the covariate distribution $P_{\\mathbf{X}}$ being elliptical symmetric). In this paper, we show that a fast parametric convergence rate of form $C_d \\cdot n^{-1/2}$ is achievable via estimating the \\emph{expected smoothed gradient outer product}, for a general class of distribution $P_{\\mathbf{X}}$ admitting Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most $r$ and $P_{\\mathbf{X}}$ is the standard Gaussian, we show that the prefactor depends on the ambient dimension $d$ as $C_d \\propto d^r$."}, "https://arxiv.org/abs/2409.09065": {"title": "Automatic Pricing and Replenishment Strategies for Vegetable Products Based on Data Analysis and Nonlinear Programming", "link": "https://arxiv.org/abs/2409.09065", "description": "arXiv:2409.09065v1 Announce Type: new \nAbstract: In the field of fresh produce retail, vegetables generally have a relatively limited shelf life, and their quality deteriorates with time. Most vegetable varieties, if not sold on the day of delivery, become difficult to sell the following day. Therefore, retailers usually perform daily quantitative replenishment based on historical sales data and demand conditions. Vegetable pricing typically uses a \"cost-plus pricing\" method, with retailers often discounting products affected by transportation loss and quality decline. In this context, reliable market demand analysis is crucial as it directly impacts replenishment and pricing decisions. Given the limited retail space, a rational sales mix becomes essential. This paper first uses data analysis and visualization techniques to examine the distribution patterns and interrelationships of vegetable sales quantities by category and individual item, based on provided data on vegetable types, sales records, wholesale prices, and recent loss rates. Next, it constructs a functional relationship between total sales volume and cost-plus pricing for vegetable categories, forecasts future wholesale prices using the ARIMA model, and establishes a sales profit function and constraints. A nonlinear programming model is then developed and solved to provide daily replenishment quantities and pricing strategies for each vegetable category for the upcoming week. Further, we optimize the profit function and constraints based on the actual sales conditions and requirements, providing replenishment quantities and pricing strategies for individual items on July 1 to maximize retail profit. Finally, to better formulate replenishment and pricing decisions for vegetable products, we discuss and forecast the data that retailers need to collect and analyses how the collected data can be applied to the above issues."}, "https://arxiv.org/abs/2409.09178": {"title": "Identification of distributions for risks based on the first moment and c-statistic", "link": "https://arxiv.org/abs/2409.09178", "description": "arXiv:2409.09178v1 Announce Type: new \nAbstract: We show that for any family of distributions with support on [0,1] with strictly monotonic cumulative distribution function (CDF) that has no jumps and is quantile-identifiable (i.e., any two distinct quantiles identify the distribution), knowing the first moment and c-statistic is enough to identify the distribution. The derivations motivate numerical algorithms for mapping a given pair of expected value and c-statistic to the parameters of specified two-parameter distributions for probabilities. We implemented these algorithms in R and in a simulation study evaluated their numerical accuracy for common families of distributions for risks (beta, logit-normal, and probit-normal). An area of application for these developments is in risk prediction modeling (e.g., sample size calculations and Value of Information analysis), where one might need to estimate the parameters of the distribution of predicted risks from the reported summary statistics."}, "https://arxiv.org/abs/2409.09236": {"title": "Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times", "link": "https://arxiv.org/abs/2409.09236", "description": "arXiv:2409.09236v1 Announce Type: new \nAbstract: While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments."}, "https://arxiv.org/abs/2409.09243": {"title": "Unconditional Randomization Tests for Interference", "link": "https://arxiv.org/abs/2409.09243", "description": "arXiv:2409.09243v1 Announce Type: new \nAbstract: In social networks or spatial experiments, one unit's outcome often depends on another's treatment, a phenomenon called interference. Researchers are interested in not only the presence and magnitude of interference but also its pattern based on factors like distance, neighboring units, and connection strength. However, the non-random nature of these factors and complex correlations across units pose challenges for inference. This paper introduces the partial null randomization tests (PNRT) framework to address these issues. The proposed method is finite-sample valid and applicable with minimal network structure assumptions, utilizing randomization testing and pairwise comparisons. Unlike existing conditional randomization tests, PNRT avoids the need for conditioning events, making it more straightforward to implement. Simulations demonstrate the method's desirable power properties and its applicability to general interference scenarios."}, "https://arxiv.org/abs/2409.09310": {"title": "Exact Posterior Mean and Covariance for Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2409.09310", "description": "arXiv:2409.09310v1 Announce Type: new \nAbstract: A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution."}, "https://arxiv.org/abs/2409.09355": {"title": "A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions", "link": "https://arxiv.org/abs/2409.09355", "description": "arXiv:2409.09355v1 Announce Type: new \nAbstract: Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered."}, "https://arxiv.org/abs/2409.09440": {"title": "Group Sequential Testing of a Treatment Effect Using a Surrogate Marker", "link": "https://arxiv.org/abs/2409.09440", "description": "arXiv:2409.09440v1 Announce Type: new \nAbstract: The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our testing procedure using a simulation study and illustrate the method using data from two distinct AIDS clinical trials."}, "https://arxiv.org/abs/2409.09512": {"title": "Doubly robust and computationally efficient high-dimensional variable selection", "link": "https://arxiv.org/abs/2409.09512", "description": "arXiv:2409.09512v1 Announce Type: new \nAbstract: The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We present tower PCM (tPCM), a statistically and computationally efficient solution to the variable selection problem that does not suffer from these shortcomings. tPCM adapts the best aspects of two existing procedures that are based on similar functionals: the holdout randomization test (HRT) and the projected covariance measure (PCM). The former is a model-X test that utilizes many resamples and few machine learning fits, while the latter is an asymptotic doubly-robust style test for a single hypothesis that requires no resamples and many machine learning fits. Theoretically, we demonstrate the validity of tPCM, and perhaps surprisingly, the asymptotic equivalence of HRT, PCM, and tPCM. In so doing, we clarify the relationship between two methods from two separate literatures. An extensive simulation study verifies that tPCM can have significant computational savings compared to HRT and PCM, while maintaining nearly identical power."}, "https://arxiv.org/abs/2409.09577": {"title": "Structural counterfactual analysis in macroeconomics: theory and inference", "link": "https://arxiv.org/abs/2409.09577", "description": "arXiv:2409.09577v1 Announce Type: new \nAbstract: We propose a structural model-free methodology to analyze two types of macroeconomic counterfactuals related to policy path deviation: hypothetical trajectory and policy intervention. Our model-free approach is built on a structural vector moving-average (SVMA) model that relies solely on the identification of policy shocks, thereby eliminating the need to specify an entire structural model. Analytical solutions are derived for the counterfactual parameters, and statistical inference for these parameter estimates is provided using the Delta method. By utilizing external instruments, we introduce a projection-based method for the identification, estimation, and inference of these parameters. This approach connects our counterfactual analysis with the Local Projection literature. A simulation-based approach with nonlinear model is provided to add in addressing Lucas' critique. The innovative model-free methodology is applied in three counterfactual studies on the U.S. monetary policy: (1) a historical scenario analysis for a hypothetical interest rate path in the post-pandemic era, (2) a future scenario analysis under either hawkish or dovish interest rate policy, and (3) an evaluation of the policy intervention effect of an oil price shock by zeroing out the systematic responses of the interest rate."}, "https://arxiv.org/abs/2409.09660": {"title": "On the Proofs of the Predictive Synthesis Formula", "link": "https://arxiv.org/abs/2409.09660", "description": "arXiv:2409.09660v1 Announce Type: new \nAbstract: Bayesian predictive synthesis is useful in synthesizing multiple predictive distributions coherently. However, the proof for the fundamental equation of the synthesized predictive density has been missing. In this technical report, we review the series of research on predictive synthesis, then fill the gap between the known results and the equation used in modern applications. We provide two proofs and clarify the structure of predictive synthesis."}, "https://arxiv.org/abs/2409.09865": {"title": "A general approach to fitting multistate cure models based on an extended-long-format data structure", "link": "https://arxiv.org/abs/2409.09865", "description": "arXiv:2409.09865v1 Announce Type: new \nAbstract: A multistate cure model is a statistical framework used to analyze and represent the transitions that individuals undergo between different states over time, taking into account the possibility of being cured by initial treatment. This model is particularly useful in pediatric oncology where a fraction of the patient population achieves cure through treatment and therefore they will never experience some events. Our study develops a generalized algorithm based on the extended long data format, an extension of long data format where a transition can be split up to two rows each with a weight assigned reflecting the posterior probability of its cure status. The multistate cure model is fit on top of the current framework of multistate model and mixture cure model. The proposed algorithm makes use of the Expectation-Maximization (EM) algorithm and weighted likelihood representation such that it is easy to implement with standard package. As an example, the proposed algorithm is applied on data from the European Society for Blood and Marrow Transplantation (EBMT). Standard errors of the estimated parameters are obtained via a non-parametric bootstrap procedure, while the method involving the calculation of the second-derivative matrix of the observed log-likelihood is also presented."}, "https://arxiv.org/abs/2409.09884": {"title": "Dynamic quantification of player value for fantasy basketball", "link": "https://arxiv.org/abs/2409.09884", "description": "arXiv:2409.09884v1 Announce Type: new \nAbstract: Previous work on fantasy basketball quantifies player value for category leagues without taking draft circumstances into account. Quantifying value in this way is convenient, but inherently limited as a strategy, because it precludes the possibility of dynamic adaptation. This work introduces a framework for dynamic algorithms, dubbed \"H-scoring\", and describes an implementation of the framework for head-to-head formats, dubbed $H_0$. $H_0$ models many of the main aspects of category league strategy including category weighting, positional assignments, and format-specific objectives. Head-to-head simulations provide evidence that $H_0$ outperforms static ranking lists. Category-level results from the simulations reveal that one component of $H_0$'s strategy is punting a subset of categories, which it learns to do implicitly."}, "https://arxiv.org/abs/2409.09962": {"title": "A Simple and Adaptive Confidence Interval when Nuisance Parameters Satisfy an Inequality", "link": "https://arxiv.org/abs/2409.09962", "description": "arXiv:2409.09962v1 Announce Type: new \nAbstract: Inequalities may appear in many models. They can be as simple as assuming a parameter is nonnegative, possibly a regression coefficient or a treatment effect. This paper focuses on the case that there is only one inequality and proposes a confidence interval that is particularly attractive, called the inequality-imposed confidence interval (IICI). The IICI is simple. It does not require simulations or tuning parameters. The IICI is adaptive. It reduces to the usual confidence interval (calculated by adding and subtracting the standard error times the $1 - \\alpha/2$ standard normal quantile) when the inequality is sufficiently slack. When the inequality is sufficiently violated, the IICI reduces to an equality-imposed confidence interval (the usual confidence interval for the submodel where the inequality holds with equality). Also, the IICI is uniformly valid and has (weakly) shorter length than the usual confidence interval; it is never longer. The first empirical application considers a linear regression when a coefficient is known to be nonpositive. A second empirical application considers an instrumental variables regression when the endogeneity of a regressor is known to be nonnegative."}, "https://arxiv.org/abs/2409.10001": {"title": "Generalized Matrix Factor Model", "link": "https://arxiv.org/abs/2409.10001", "description": "arXiv:2409.10001v1 Announce Type: new \nAbstract: This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix."}, "https://arxiv.org/abs/2409.10030": {"title": "On LASSO Inference for High Dimensional Predictive Regression", "link": "https://arxiv.org/abs/2409.10030", "description": "arXiv:2409.10030v1 Announce Type: new \nAbstract: LASSO introduces shrinkage bias into estimated coefficients, which can adversely affect the desirable asymptotic normality and invalidate the standard inferential procedure based on the $t$-statistic. The desparsified LASSO has emerged as a well-known remedy for this issue. In the context of high dimensional predictive regression, the desparsified LASSO faces an additional challenge: the Stambaugh bias arising from nonstationary regressors. To restore the standard inferential procedure, we propose a novel estimator called IVX-desparsified LASSO (XDlasso). XDlasso eliminates the shrinkage bias and the Stambaugh bias simultaneously and does not require prior knowledge about the identities of nonstationary and stationary regressors. We establish the asymptotic properties of XDlasso for hypothesis testing, and our theoretical findings are supported by Monte Carlo simulations. Applying our method to real-world applications from the FRED-MD database -- which includes a rich set of control variables -- we investigate two important empirical questions: (i) the predictability of the U.S. stock returns based on the earnings-price ratio, and (ii) the predictability of the U.S. inflation using the unemployment rate."}, "https://arxiv.org/abs/2409.10174": {"title": "Information criteria for the number of directions of extremes in high-dimensional data", "link": "https://arxiv.org/abs/2409.10174", "description": "arXiv:2409.10174v1 Announce Type: new \nAbstract: In multivariate extreme value analysis, the estimation of the dependence structure in extremes is a challenging task, especially in the context of high-dimensional data. Therefore, a common approach is to reduce the model dimension by considering only the directions in which extreme values occur. In this paper, we use the concept of sparse regular variation recently introduced by Meyer and Wintenberger (2021) to derive information criteria for the number of directions in which extreme events occur, such as a Bayesian information criterion (BIC), a mean-squared error-based information criterion (MSEIC), and a quasi-Akaike information criterion (QAIC) based on the Gaussian likelihood function. As is typical in extreme value analysis, a challenging task is the choice of the number $k_n$ of observations used for the estimation. Therefore, for all information criteria, we present a two-step procedure to estimate both the number of directions of extremes and an optimal choice of $k_n$. We prove that the AIC of Meyer and Wintenberger (2023) and the MSEIC are inconsistent information criteria for the number of extreme directions whereas the BIC and the QAIC are consistent information criteria. Finally, the performance of the different information criteria is compared in a simulation study and applied on wind speed data."}, "https://arxiv.org/abs/2409.10221": {"title": "bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R", "link": "https://arxiv.org/abs/2409.10221", "description": "arXiv:2409.10221v1 Announce Type: new \nAbstract: The family of cure models provides a unique opportunity to simultaneously model both the proportion of cured subjects (those not facing the event of interest) and the distribution function of time-to-event for susceptibles (those facing the event). In practice, the application of cure models is mainly facilitated by the availability of various R packages. However, most of these packages primarily focus on the mixture or promotion time cure rate model. This article presents a fully Bayesian approach implemented in R to estimate a general family of cure rate models in the presence of covariates. It builds upon the work by Papastamoulis and Milienos (2024) by additionally considering various options for describing the promotion time, including the Weibull, exponential, Gompertz, log-logistic and finite mixtures of gamma distributions, among others. Moreover, the user can choose any proper distribution function for modeling the promotion time (provided that some specific conditions are met). Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. The package is illustrated on a real dataset analyzing the duration of the first marriage under the presence of various covariates such as the race, age and the presence of kids."}, "https://arxiv.org/abs/2409.10318": {"title": "Systematic comparison of Bayesian basket trial designs with unequal sample sizes and proposal of a new method based on power priors", "link": "https://arxiv.org/abs/2409.10318", "description": "arXiv:2409.10318v1 Announce Type: new \nAbstract: Basket trials examine the efficacy of an intervention in multiple patient subgroups simultaneously. The division into subgroups, called baskets, is based on matching medical characteristics, which may result in small sample sizes within baskets that are also likely to differ. Sparse data complicate statistical inference. Several Bayesian methods have been proposed in the literature that allow information sharing between baskets to increase statistical power. In this work, we provide a systematic comparison of five different Bayesian basket trial designs when sample sizes differ between baskets. We consider the power prior approach with both known and new weighting methods, a design by Fujikawa et al., as well as models based on Bayesian hierarchical modeling and Bayesian model averaging. The results of our simulation study show a high sensitivity to changing sample sizes for Fujikawa's design and the power prior approach. Limiting the amount of shared information was found to be decisive for the robustness to varying basket sizes. In combination with the power prior approach, this resulted in the best performance and the most reliable detection of an effect of the treatment under investigation and its absence."}, "https://arxiv.org/abs/2409.10352": {"title": "Partial Ordering Bayesian Logistic Regression Model for Phase I Combination Trials and Computationally Efficient Approach to Operational Prior Specification", "link": "https://arxiv.org/abs/2409.10352", "description": "arXiv:2409.10352v1 Announce Type: new \nAbstract: Recent years have seen increased interest in combining drug agents and/or schedules. Several methods for Phase I combination-escalation trials are proposed, among which, the partial ordering continual reassessment method (POCRM) gained great attention for its simplicity and good operational characteristics. However, the one-parameter nature of the POCRM makes it restrictive in more complicated settings such as the inclusion of a control group. This paper proposes a Bayesian partial ordering logistic model (POBLRM), which combines partial ordering and the more flexible (than CRM) two-parameter logistic model. Simulation studies show that the POBLRM performs similarly as the POCRM in non-randomised settings. When patients are randomised between the experimental dose-combinations and a control, performance is drastically improved.\n Most designs require specifying hyper-parameters, often chosen from statistical considerations (operational prior). The conventional \"grid search'' calibration approach requires large simulations, which are computationally costly. A novel \"cyclic calibration\" has been proposed to reduce the computation from multiplicative to additive. Furthermore, calibration processes should consider wide ranges of scenarios of true toxicity probabilities to avoid bias. A method to reduce scenarios based on scenario-complexities is suggested. This can reduce the computation by more than 500 folds while remaining operational characteristics similar to the grid search."}, "https://arxiv.org/abs/2409.10448": {"title": "Why you should also use OLS estimation of tail exponents", "link": "https://arxiv.org/abs/2409.10448", "description": "arXiv:2409.10448v1 Announce Type: new \nAbstract: Even though practitioners often estimate Pareto exponents running OLS rank-size regressions, the usual recommendation is to use the Hill MLE with a small-sample correction instead, due to its unbiasedness and efficiency. In this paper, we advocate that you should also apply OLS in empirical applications. On the one hand, we demonstrate that, with a small-sample correction, the OLS estimator is also unbiased. On the other hand, we show that the MLE assigns significantly greater weight to smaller observations. This suggests that the OLS estimator may outperform the MLE in cases where the distribution is (i) strictly Pareto but only in the upper tail or (ii) regularly varying rather than strictly Pareto. We substantiate our theoretical findings with Monte Carlo simulations and real-world applications, demonstrating the practical relevance of the OLS method in estimating tail exponents."}, "https://arxiv.org/abs/2409.09066": {"title": "Replicating The Log of Gravity", "link": "https://arxiv.org/abs/2409.09066", "description": "arXiv:2409.09066v1 Announce Type: cross \nAbstract: This document replicates the main results from Santos Silva and Tenreyro (2006 in R. The original results were obtained in TSP back in 2006. The idea here is to be explicit regarding the conceptual approach to regression in R. For most of the replication I used base R without external libraries except when it was absolutely necessary. The findings are consistent with the original article and reveal that the replication effort is minimal, without the need to contact the authors for clarifications or incur into data transformations or filtering not mentioned in the article."}, "https://arxiv.org/abs/2409.09894": {"title": "Estimating Wage Disparities Using Foundation Models", "link": "https://arxiv.org/abs/2409.09894", "description": "arXiv:2409.09894v1 Announce Type: cross \nAbstract: One thread of empirical work in social science focuses on decomposing group differences in outcomes into unexplained components and components explained by observable factors. In this paper, we study gender wage decompositions, which require estimating the portion of the gender wage gap explained by career histories of workers. Classical methods for decomposing the wage gap employ simple predictive models of wages which condition on a small set of simple summaries of labor history. The problem is that these predictive models cannot take advantage of the full complexity of a worker's history, and the resulting decompositions thus suffer from omitted variable bias (OVB), where covariates that are correlated with both gender and wages are not included in the model. Here we explore an alternative methodology for wage gap decomposition that employs powerful foundation models, such as large language models, as the predictive engine. Foundation models excel at making accurate predictions from complex, high-dimensional inputs. We use a custom-built foundation model, designed to predict wages from full labor histories, to decompose the gender wage gap. We prove that the way such models are usually trained might still lead to OVB, but develop fine-tuning algorithms that empirically mitigate this issue. Our model captures a richer representation of career history than simple models and predicts wages more accurately. In detail, we first provide a novel set of conditions under which an estimator of the wage gap based on a fine-tuned foundation model is $\\sqrt{n}$-consistent. Building on the theory, we then propose methods for fine-tuning foundation models that minimize OVB. Using data from the Panel Study of Income Dynamics, we find that history explains more of the gender wage gap than standard econometric models can measure, and we identify elements of history that are important for reducing OVB."}, "https://arxiv.org/abs/2409.09903": {"title": "Learning large softmax mixtures with warm start EM", "link": "https://arxiv.org/abs/2409.09903", "description": "arXiv:2409.09903v1 Announce Type: cross \nAbstract: Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute from $p$ possible candidates, in heterogeneous populations. The model has recently attracted attention in the AI literature, under the name softmax mixtures, where it is routinely used in the final layer of a neural network to map a large number $p$ of vectors in $\\mathbb{R}^L$ to a probability vector. Despite its wide applicability and empirical success, statistically optimal estimators of the mixture parameters, obtained via algorithms whose running time scales polynomially in $L$, are not known. This paper provides a solution to this problem for contemporary applications, such as large language models, in which the mixture has a large number $p$ of support points, and the size $N$ of the sample observed from the mixture is also large. Our proposed estimator combines two classical estimators, obtained respectively via a method of moments (MoM) and the expectation-minimization (EM) algorithm. Although both estimator types have been studied, from a theoretical perspective, for Gaussian mixtures, no similar results exist for softmax mixtures for either procedure. We develop a new MoM parameter estimator based on latent moment estimation that is tailored to our model, and provide the first theoretical analysis for a MoM-based procedure in softmax mixtures. Although consistent, MoM for softmax mixtures can exhibit poor numerical performance, as observed other mixture models. Nevertheless, as MoM is provably in a neighborhood of the target, it can be used as warm start for any iterative algorithm. We study in detail the EM algorithm, and provide its first theoretical analysis for softmax mixtures. Our final proposal for parameter estimation is the EM algorithm with a MoM warm start."}, "https://arxiv.org/abs/2409.09973": {"title": "Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data", "link": "https://arxiv.org/abs/2409.09973", "description": "arXiv:2409.09973v1 Announce Type: cross \nAbstract: We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation."}, "https://arxiv.org/abs/2409.10374": {"title": "Nonlinear Causality in Brain Networks: With Application to Motor Imagery vs Execution", "link": "https://arxiv.org/abs/2409.10374", "description": "arXiv:2409.10374v1 Announce Type: cross \nAbstract: One fundamental challenge of data-driven analysis in neuroscience is modeling causal interactions and exploring the connectivity of nodes in a brain network. Various statistical methods, relying on various perspectives and employing different data modalities, are being developed to examine and comprehend the underlying causal structures inherent to brain dynamics. This study introduces a novel statistical approach, TAR4C, to dissect causal interactions in multichannel EEG recordings. TAR4C uses the threshold autoregressive model to describe the causal interaction between nodes or clusters of nodes in a brain network. The perspective involves testing whether one node, which may represent a brain region, can control the dynamics of the other. The node that has such an impact on the other is called a threshold variable and can be classified as a causative because its functionality is the leading source operating as an instantaneous switching mechanism that regulates the time-varying autoregressive structure of the other. This statistical concept is commonly referred to as threshold non-linearity. Once threshold non-linearity has been verified between a pair of nodes, the subsequent essential facet of TAR modeling is to assess the predictive ability of the causal node for the current activity on the other and represent causal interactions in autoregressive terms. This predictive ability is what underlies Granger causality. The TAR4C approach can discover non-linear and time-dependent causal interactions without negating the G-causality perspective. The efficacy of the proposed approach is exemplified by analyzing the EEG signals recorded during the motor movement/imagery experiment. The similarities and differences between the causal interactions manifesting during the execution and the imagery of a given motor movement are demonstrated by analyzing EEG recordings from multiple subjects."}, "https://arxiv.org/abs/2103.01604": {"title": "Theory of Low Frequency Contamination from Nonstationarity and Misspecification: Consequences for HAR Inference", "link": "https://arxiv.org/abs/2103.01604", "description": "arXiv:2103.01604v3 Announce Type: replace \nAbstract: We establish theoretical results about the low frequency contamination (i.e., long memory effects) induced by general nonstationarity for estimates such as the sample autocovariance and the periodogram, and deduce consequences for heteroskedasticity and autocorrelation robust (HAR) inference. We present explicit expressions for the asymptotic bias of these estimates. We distinguish cases where this contamination only occurs as a small-sample problem and cases where the contamination continues to hold asymptotically. We show theoretically that nonparametric smoothing over time is robust to low frequency contamination. Our results provide new insights on the debate between consistent versus inconsistent long-run variance (LRV) estimation. Existing LRV estimators tend to be in inflated when the data are nonstationary. This results in HAR tests that can be undersized and exhibit dramatic power losses. Our theory indicates that long bandwidths or fixed-b HAR tests suffer more from low frequency contamination relative to HAR tests based on HAC estimators, whereas recently introduced double kernel HAC estimators do not super from this problem. Finally, we present second-order Edgeworth expansions under nonstationarity about the distribution of HAC and DK-HAC estimators and about the corresponding t-test in the linear regression model."}, "https://arxiv.org/abs/2103.15298": {"title": "Empirical Welfare Maximization with Constraints", "link": "https://arxiv.org/abs/2103.15298", "description": "arXiv:2103.15298v2 Announce Type: replace \nAbstract: Empirical Welfare Maximization (EWM) is a framework that can be used to select welfare program eligibility policies based on data. This paper extends EWM by allowing for uncertainty in estimating the budget needed to implement the selected policy, in addition to its welfare. Due to the additional estimation error, I show there exist no rules that achieve the highest welfare possible while satisfying a budget constraint uniformly over a wide range of DGPs. This differs from the setting without a budget constraint where uniformity is achievable. I propose an alternative trade-off rule and illustrate it with Medicaid expansion, a setting with imperfect take-up and varying program costs."}, "https://arxiv.org/abs/2306.00453": {"title": "A Gaussian Sliding Windows Regression Model for Hydrological Inference", "link": "https://arxiv.org/abs/2306.00453", "description": "arXiv:2306.00453v3 Announce Type: replace \nAbstract: Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the understanding of time lags associated with the delay between rainfall occurrence and subsequent changes in streamflow is of high practical importance. Since water can take a variety of flow paths to generate streamflow, a series of distinct runoff pulses may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, thus preventing novel insights about the dynamics of distinct flow paths from being formed. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, straightforward process inference can be achieved since each window can represent one flow path. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling."}, "https://arxiv.org/abs/2306.15947": {"title": "Separable pathway effects of semi-competing risks using multi-state sodels", "link": "https://arxiv.org/abs/2306.15947", "description": "arXiv:2306.15947v2 Announce Type: replace \nAbstract: Semi-competing risks refer to the phenomenon where a primary event (such as mortality) can ``censor'' an intermediate event (such as relapse of a disease), but not vice versa. Under the multi-state model, the primary event consists of two specific types: the direct outcome event and an indirect outcome event developed from intermediate events. Within this framework, we show that the total treatment effect on the cumulative incidence of the primary event can be decomposed into three separable pathway effects, capturing treatment effects on population-level transition rates between states. We next propose two estimators for the counterfactual cumulative incidences of the primary event under hypothetical treatment components. One estimator is given by the generalized Nelson--Aalen estimator with inverse probability weighting under covariates isolation, and the other is given based on the efficient influence function. The asymptotic normality of these estimators is established. The first estimator only involves a propensity score model and avoid modeling the cause-specific hazards. The second estimator has robustness against the misspecification of submodels. As an illustration of its potential usefulness, the proposed method is applied to compare effects of different allogeneic stem cell transplantation types on overall survival after transplantation."}, "https://arxiv.org/abs/2311.00528": {"title": "On the Comparative Analysis of Average Treatment Effects Estimation via Data Combination", "link": "https://arxiv.org/abs/2311.00528", "description": "arXiv:2311.00528v2 Announce Type: replace \nAbstract: There is growing interest in exploring causal effects in target populations via data combination. However, most approaches are tailored to specific settings and lack comprehensive comparative analyses. In this article, we focus on a typical scenario involving a source dataset and a target dataset. We first design six settings under covariate shift and conduct a comparative analysis by deriving the semiparametric efficiency bounds for the ATE in the target population. We then extend this analysis to six new settings that incorporate both covariate shift and posterior drift. Our study uncovers the key factors that influence efficiency gains and the ``effective sample size\" when combining two datasets, with a particular emphasis on the roles of the variance ratio of potential outcomes between datasets and the derivatives of the posterior drift function. To the best of our knowledge, this is the first paper that explicitly explores the role of the posterior drift functions in causal inference. Additionally, we also propose novel methods for conducting sensitivity analysis to address violations of transportability between the two datasets. We empirically validate our findings by constructing locally efficient estimators and conducting extensive simulations. We demonstrate the proposed methods in two real-world applications."}, "https://arxiv.org/abs/2110.10650": {"title": "Attention Overload", "link": "https://arxiv.org/abs/2110.10650", "description": "arXiv:2110.10650v4 Announce Type: replace-cross \nAbstract: We introduce an Attention Overload Model that captures the idea that alternatives compete for the decision maker's attention, and hence the attention that each alternative receives decreases as the choice problem becomes larger. Using this nonparametric restriction on the random attention formation, we show that a fruitful revealed preference theory can be developed and provide testable implications on the observed choice behavior that can be used to (point or partially) identify the decision maker's preference and attention frequency. We then enhance our attention overload model to accommodate heterogeneous preferences. Due to the nonparametric nature of our identifying assumption, we must discipline the amount of heterogeneity in the choice model: we propose the idea of List-based Attention Overload, where alternatives are presented to the decision makers as a list that correlates with both heterogeneous preferences and random attention. We show that preference and attention frequencies are (point or partially) identifiable under nonparametric assumptions on the list and attention formation mechanisms, even when the true underlying list is unknown to the researcher. Building on our identification results, for both preference and attention frequencies, we develop econometric methods for estimation and inference that are valid in settings with a large number of alternatives and choice problems, a distinctive feature of the economic environment we consider. We provide a software package in R implementing our empirical methods, and illustrate them in a simulation study."}, "https://arxiv.org/abs/2206.10240": {"title": "Core-Elements for Large-Scale Least Squares Estimation", "link": "https://arxiv.org/abs/2206.10240", "description": "arXiv:2206.10240v4 Announce Type: replace-cross \nAbstract: The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive applications in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\\mathrm{nnz}(X)+rp^2)$ computational cost, where $X$ is an $n\\times p$ predictor matrix, $r$ is the number of elements selected from each column of $X$, and $\\mathrm{nnz}(\\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors."}, "https://arxiv.org/abs/2307.02616": {"title": "Federated Epidemic Surveillance", "link": "https://arxiv.org/abs/2307.02616", "description": "arXiv:2307.02616v2 Announce Type: replace-cross \nAbstract: Epidemic surveillance is a challenging task, especially when crucial data is fragmented across institutions and data custodians are unable or unwilling to share it. This study aims to explore the feasibility of a simple federated surveillance approach. The idea is to conduct hypothesis tests for a rise in counts behind each custodian's firewall and then combine p-values from these tests using techniques from meta-analysis. We propose a hypothesis testing framework to identify surges in epidemic-related data streams and conduct experiments on real and semi-synthetic data to assess the power of different p-value combination methods to detect surges without needing to combine the underlying counts. Our findings show that relatively simple combination methods achieve a high degree of fidelity and suggest that infectious disease outbreaks can be detected without needing to share even aggregate data across institutions."}, "https://arxiv.org/abs/2307.10272": {"title": "A Shrinkage Likelihood Ratio Test for High-Dimensional Subgroup Analysis with a Logistic-Normal Mixture Model", "link": "https://arxiv.org/abs/2307.10272", "description": "arXiv:2307.10272v2 Announce Type: replace-cross \nAbstract: In subgroup analysis, testing the existence of a subgroup with a differential treatment effect serves as protection against spurious subgroup discovery. Despite its importance, this hypothesis testing possesses a complicated nature: parameter characterizing subgroup classification is not identified under the null hypothesis of no subgroup. Due to this irregularity, the existing methods have the following two limitations. First, the asymptotic null distribution of test statistics often takes an intractable form, which necessitates computationally demanding resampling methods to calculate the critical value. Second, the dimension of personal attributes characterizing subgroup membership is not allowed to be of high dimension. To solve these two problems simultaneously, this study develops a shrinkage likelihood ratio test for the existence of a subgroup using a logistic-normal mixture model. The proposed test statistics are built on a modified likelihood function that shrinks possibly high-dimensional unidentified parameters toward zero under the null hypothesis while retaining power under the alternative."}, "https://arxiv.org/abs/2401.04778": {"title": "Generative neural networks for characteristic functions", "link": "https://arxiv.org/abs/2401.04778", "description": "arXiv:2401.04778v2 Announce Type: replace-cross \nAbstract: We provide a simulation algorithm to simulate from a (multivariate) characteristic function, which is only accessible in a black-box format. The method is based on a generative neural network, whose loss function exploits a specific representation of the Maximum-Mean-Discrepancy metric to directly incorporate the targeted characteristic function. The algorithm is universal in the sense that it is independent of the dimension and that it does not require any assumptions on the given characteristic function. Furthermore, finite sample guarantees on the approximation quality in terms of the Maximum-Mean Discrepancy metric are derived. The method is illustrated in a simulation study."}, "https://arxiv.org/abs/2409.10750": {"title": "GPT takes the SAT: Tracing changes in Test Difficulty and Math Performance of Students", "link": "https://arxiv.org/abs/2409.10750", "description": "arXiv:2409.10750v1 Announce Type: new \nAbstract: Scholastic Aptitude Test (SAT) is crucial for college admissions but its effectiveness and relevance are increasingly questioned. This paper enhances Synthetic Control methods by introducing \"Transformed Control\", a novel method that employs Large Language Models (LLMs) powered by Artificial Intelligence to generate control groups. We utilize OpenAI's API to generate a control group where GPT-4, or ChatGPT, takes multiple SATs annually from 2008 to 2023. This control group helps analyze shifts in SAT math difficulty over time, starting from the baseline year of 2008. Using parallel trends, we calculate the Average Difference in Scores (ADS) to assess changes in high school students' math performance. Our results indicate a significant decrease in the difficulty of the SAT math section over time, alongside a decline in students' math performance. The analysis shows a 71-point drop in the rigor of SAT math from 2008 to 2023, with student performance decreasing by 36 points, resulting in a 107-point total divergence in average student math performance. We investigate possible mechanisms for this decline in math proficiency, such as changing university selection criteria, increased screen time, grade inflation, and worsening adolescent mental health. Disparities among demographic groups show a 104-point drop for White students, 84 points for Black students, and 53 points for Asian students. Male students saw a 117-point reduction, while female students had a 100-point decrease."}, "https://arxiv.org/abs/2409.10771": {"title": "Flexible survival regression with variable selection for heterogeneous population", "link": "https://arxiv.org/abs/2409.10771", "description": "arXiv:2409.10771v1 Announce Type: new \nAbstract: Survival regression is widely used to model time-to-events data, to explore how covariates may influence the occurrence of events. Modern datasets often encompass a vast number of covariates across many subjects, with only a subset of the covariates significantly affecting survival. Additionally, subjects often belong to an unknown number of latent groups, where covariate effects on survival differ significantly across groups. The proposed methodology addresses both challenges by simultaneously identifying the latent sub-groups in the heterogeneous population and evaluating covariate significance within each sub-group. This approach is shown to enhance the predictive accuracy for time-to-event outcomes, via uncovering varying risk profiles within the underlying heterogeneous population and is thereby helpful to device targeted disease management strategies."}, "https://arxiv.org/abs/2409.10812": {"title": "Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation", "link": "https://arxiv.org/abs/2409.10812", "description": "arXiv:2409.10812v1 Announce Type: new \nAbstract: Missing data is a common issue in medical, psychiatry, and social studies. In literature, Multiple Imputation (MI) was proposed to multiply impute datasets and combine analysis results from imputed datasets for statistical inference using Rubin's rule. However, Rubin's rule only works for combined inference on statistical tests with point and variance estimates and is not applicable to combine general F-statistics or Chi-square statistics. In this manuscript, we provide a solution to combine F-test statistics from multiply imputed datasets, when the F-statistic has an explicit fractional form (that is, both the numerator and denominator of the F-statistic are reported). Then we extend the method to combine Chi-square statistics from multiply imputed datasets. Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA and Type-III tests of fixed effects in mixed effects models, which do not have the explicit fractional form. SAS macros are also developed to facilitate applications."}, "https://arxiv.org/abs/2409.10820": {"title": "Simple robust two-stage estimation and inference for generalized impulse responses and multi-horizon causality", "link": "https://arxiv.org/abs/2409.10820", "description": "arXiv:2409.10820v1 Announce Type: new \nAbstract: This paper introduces a novel two-stage estimation and inference procedure for generalized impulse responses (GIRs). GIRs encompass all coefficients in a multi-horizon linear projection model of future outcomes of y on lagged values (Dufour and Renault, 1998), which include the Sims' impulse response. The conventional use of Least Squares (LS) with heteroskedasticity- and autocorrelation-consistent covariance estimation is less precise and often results in unreliable finite sample tests, further complicated by the selection of bandwidth and kernel functions. Our two-stage method surpasses the LS approach in terms of estimation efficiency and inference robustness. The robustness stems from our proposed covariance matrix estimates, which eliminate the need to correct for serial correlation in the multi-horizon projection residuals. Our method accommodates non-stationary data and allows the projection horizon to grow with sample size. Monte Carlo simulations demonstrate our two-stage method outperforms the LS method. We apply the two-stage method to investigate the GIRs, implement multi-horizon Granger causality test, and find that economic uncertainty exerts both short-run (1-3 months) and long-run (30 months) effects on economic activities."}, "https://arxiv.org/abs/2409.10835": {"title": "BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models", "link": "https://arxiv.org/abs/2409.10835", "description": "arXiv:2409.10835v1 Announce Type: new \nAbstract: We introduce the BMRMM package implementing Bayesian inference for a class of Markov renewal mixed models which can characterize the stochastic dynamics of a collection of sequences, each comprising alternative instances of categorical states and associated continuous duration times, while being influenced by a set of exogenous factors as well as a 'random' individual. The default setting flexibly models the state transition probabilities using mixtures of Dirichlet distributions and the duration times using mixtures of gamma kernels while also allowing variable selection for both. Modeling such data using simpler Markov mixed models also remains an option, either by ignoring the duration times altogether or by replacing them with instances of an additional category obtained by discretizing them by a user-specified unit. The option is also useful when data on duration times may not be available in the first place. We demonstrate the package's utility using two data sets."}, "https://arxiv.org/abs/2409.10855": {"title": "Calibrated Multivariate Regression with Localized PIT Mappings", "link": "https://arxiv.org/abs/2409.10855", "description": "arXiv:2409.10855v1 Announce Type: new \nAbstract: Calibration ensures that predicted uncertainties align with observed uncertainties. While there is an extensive literature on recalibration methods for univariate probabilistic forecasts, work on calibration for multivariate forecasts is much more limited. This paper introduces a novel post-hoc recalibration approach that addresses multivariate calibration for potentially misspecified models. Our method involves constructing local mappings between vectors of marginal probability integral transform values and the space of observations, providing a flexible and model free solution applicable to continuous, discrete, and mixed responses. We present two versions of our approach: one uses K-nearest neighbors, and the other uses normalizing flows. Each method has its own strengths in different situations. We demonstrate the effectiveness of our approach on two real data applications: recalibrating a deep neural network's currency exchange rate forecast and improving a regression model for childhood malnutrition in India for which the multivariate response has both discrete and continuous components."}, "https://arxiv.org/abs/2409.10860": {"title": "Cointegrated Matrix Autoregression Models", "link": "https://arxiv.org/abs/2409.10860", "description": "arXiv:2409.10860v1 Announce Type: new \nAbstract: We propose a novel cointegrated autoregressive model for matrix-valued time series, with bi-linear cointegrating vectors corresponding to the rows and columns of the matrix data. Compared to the traditional cointegration analysis, our proposed matrix cointegration model better preserves the inherent structure of the data and enables corresponding interpretations. To estimate the cointegrating vectors as well as other coefficients, we introduce two types of estimators based on least squares and maximum likelihood. We investigate the asymptotic properties of the cointegrated matrix autoregressive model under the existence of trend and establish the asymptotic distributions for the cointegrating vectors, as well as other model parameters. We conduct extensive simulations to demonstrate its superior performance over traditional methods. In addition, we apply our proposed model to Fama-French portfolios and develop a effective pairs trading strategy."}, "https://arxiv.org/abs/2409.10943": {"title": "Comparison of g-estimation approaches for handling symptomatic medication at multiple timepoints in Alzheimer's Disease with a hypothetical strategy", "link": "https://arxiv.org/abs/2409.10943", "description": "arXiv:2409.10943v1 Announce Type: new \nAbstract: For handling intercurrent events in clinical trials, one of the strategies outlined in the ICH E9(R1) addendum targets the hypothetical scenario of non-occurrence of the intercurrent event. While this strategy is often implemented by setting data after the intercurrent event to missing even if they have been collected, g-estimation allows for a more efficient estimation by using the information contained in post-IE data. As the g-estimation methods have largely developed outside of randomised clinical trials, optimisations for the application in clinical trials are possible. In this work, we describe and investigate the performance of modifications to the established g-estimation methods, leveraging the assumption that some intercurrent events are expected to have the same impact on the outcome regardless of the timing of their occurrence. In a simulation study in Alzheimer disease, the modifications show a substantial efficiency advantage for the estimation of an estimand that applies the hypothetical strategy to the use of symptomatic treatment while retaining unbiasedness and adequate type I error control."}, "https://arxiv.org/abs/2409.11040": {"title": "Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable", "link": "https://arxiv.org/abs/2409.11040", "description": "arXiv:2409.11040v1 Announce Type: new \nAbstract: This research deals with the estimation and imputation of missing data in longitudinal models with a Poisson response variable inflated with zeros. A methodology is proposed that is based on the use of maximum likelihood, assuming that data is missing at random and that there is a correlation between the response variables. In each of the times, the expectation maximization (EM) algorithm is used: in step E, a weighted regression is carried out, conditioned on the previous times that are taken as covariates. In step M, the estimation and imputation of the missing data are performed. The good performance of the methodology in different loss scenarios is demonstrated in a simulation study comparing the model only with complete data, and estimating missing data using the mode of the data of each individual. Furthermore, in a study related to the growth of corn, it is tested on real data to develop the algorithm in a practical scenario."}, "https://arxiv.org/abs/2409.11134": {"title": "E-Values for Exponential Families: the General Case", "link": "https://arxiv.org/abs/2409.11134", "description": "arXiv:2409.11134v1 Announce Type: new \nAbstract: We analyze common types of e-variables and e-processes for composite exponential family nulls: the optimal e-variable based on the reverse information projection (RIPr), the conditional (COND) e-variable, and the universal inference (UI) and sequen\\-tialized RIPr e-processes. We characterize the RIPr prior for simple and Bayes-mixture based alternatives, either precisely (for Gaussian nulls and alternatives) or in an approximate sense (general exponential families). We provide conditions under which the RIPr e-variable is (again exactly vs. approximately) equal to the COND e-variable. Based on these and other interrelations which we establish, we determine the e-power of the four e-statistics as a function of sample size, exactly for Gaussian and up to $o(1)$ in general. For $d$-dimensional null and alternative, the e-power of UI tends to be smaller by a term of $(d/2) \\log n + O(1)$ than that of the COND e-variable, which is the clear winner."}, "https://arxiv.org/abs/2409.11162": {"title": "Chasing Shadows: How Implausible Assumptions Skew Our Understanding of Causal Estimands", "link": "https://arxiv.org/abs/2409.11162", "description": "arXiv:2409.11162v1 Announce Type: new \nAbstract: The ICH E9 (R1) addendum on estimands, coupled with recent advancements in causal inference, has prompted a shift towards using model-free treatment effect estimands that are more closely aligned with the underlying scientific question. This represents a departure from traditional, model-dependent approaches where the statistical model often overshadows the inquiry itself. While this shift is a positive development, it has unintentionally led to the prioritization of an estimand's theoretical appeal over its practical learnability from data under plausible assumptions. We illustrate this by scrutinizing assumptions in the recent clinical trials literature on principal stratum estimands, demonstrating that some popular assumptions are not only implausible but often inevitably violated. We advocate for a more balanced approach to estimand formulation, one that carefully considers both the scientific relevance and the practical feasibility of estimation under realistic conditions."}, "https://arxiv.org/abs/2409.11167": {"title": "Poisson and Gamma Model Marginalisation and Marginal Likelihood calculation using Moment-generating Functions", "link": "https://arxiv.org/abs/2409.11167", "description": "arXiv:2409.11167v1 Announce Type: new \nAbstract: We present a new analytical method to derive the likelihood function that has the population of parameters marginalised out in Bayesian hierarchical models. This method is also useful to find the marginal likelihoods in Bayesian models or in random-effect linear mixed models. The key to this method is to take high-order (sometimes fractional) derivatives of the prior moment-generating function if particular existence and differentiability conditions hold.\n In particular, this analytical method assumes that the likelihood is either Poisson or gamma. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.\n We also present some examples validating this new analytical method."}, "https://arxiv.org/abs/2409.11265": {"title": "Performance of Cross-Validated Targeted Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2409.11265", "description": "arXiv:2409.11265v1 Announce Type: new \nAbstract: Background: Advanced methods for causal inference, such as targeted maximum likelihood estimation (TMLE), require certain conditions for statistical inference. However, in situations where there is not differentiability due to data sparsity or near-positivity violations, the Donsker class condition is violated. In such situations, TMLE variance can suffer from inflation of the type I error and poor coverage, leading to conservative confidence intervals. Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve on performance compared to TMLE in settings of positivity or Donsker class violations. We aim to investigate the performance of CVTMLE compared to TMLE in various settings.\n Methods: We utilised the data-generating mechanism as described in Leger et al. (2022) to run a Monte Carlo experiment under different Donsker class violations. Then, we evaluated the respective statistical performances of TMLE and CVTMLE with different super learner libraries, with and without regression tree methods.\n Results: We found that CVTMLE vastly improves confidence interval coverage without adversely affecting bias, particularly in settings with small sample sizes and near-positivity violations. Furthermore, incorporating regression trees using standard TMLE with ensemble super learner-based initial estimates increases bias and variance leading to invalid statistical inference.\n Conclusions: It has been shown that when using CVTMLE the Donsker class condition is no longer necessary to obtain valid statistical inference when using regression trees and under either data sparsity or near-positivity violations. We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library and thereby provides better estimation and inference in cases where the super learner library uses more flexible candidates and is prone to overfitting."}, "https://arxiv.org/abs/2409.11385": {"title": "Probability-scale residuals for event-time data", "link": "https://arxiv.org/abs/2409.11385", "description": "arXiv:2409.11385v1 Announce Type: new \nAbstract: The probability-scale residual (PSR) is defined as $E\\{sign(y, Y^*)\\}$, where $y$ is the observed outcome and $Y^*$ is a random variable from the fitted distribution. The PSR is particularly useful for ordinal and censored outcomes for which fitted values are not available without additional assumptions. Previous work has defined the PSR for continuous, binary, ordinal, right-censored, and current status outcomes; however, development of the PSR has not yet been considered for data subject to general interval censoring. We develop extensions of the PSR, first to mixed-case interval-censored data, and then to data subject to several types of common censoring schemes. We derive the statistical properties of the PSR and show that our more general PSR encompasses several previously defined PSR for continuous and censored outcomes as special cases. The performance of the residual is illustrated in real data from the Caribbean, Central, and South American Network for HIV Epidemiology."}, "https://arxiv.org/abs/2409.11080": {"title": "Data-driven stochastic 3D modeling of the nanoporous binder-conductive additive phase in battery cathodes", "link": "https://arxiv.org/abs/2409.11080", "description": "arXiv:2409.11080v1 Announce Type: cross \nAbstract: A stochastic 3D modeling approach for the nanoporous binder-conductive additive phase in hierarchically structured cathodes of lithium-ion batteries is presented. The binder-conductive additive phase of these electrodes consists of carbon black, polyvinylidene difluoride binder and graphite particles. For its stochastic 3D modeling, a three-step procedure based on methods from stochastic geometry is used. First, the graphite particles are described by a Boolean model with ellipsoidal grains. Second, the mixture of carbon black and binder is modeled by an excursion set of a Gaussian random field in the complement of the graphite particles. Third, large pore regions within the mixture of carbon black and binder are described by a Boolean model with spherical grains. The model parameters are calibrated to 3D image data of cathodes in lithium-ion batteries acquired by focused ion beam scanning electron microscopy. Subsequently, model validation is performed by comparing model realizations with measured image data in terms of various morphological descriptors that are not used for model fitting. Finally, we use the stochastic 3D model for predictive simulations, where we generate virtual, yet realistic, image data of nanoporous binder-conductive additives with varying amounts of graphite particles. Based on these virtual nanostructures, we can investigate structure-property relationships. In particular, we quantitatively study the influence of graphite particles on effective transport properties in the nanoporous binder-conductive additive phase, which have a crucial impact on electrochemical processes in the cathode and thus on the performance of battery cells."}, "https://arxiv.org/abs/2305.12616": {"title": "Conformal Prediction With Conditional Guarantees", "link": "https://arxiv.org/abs/2305.12616", "description": "arXiv:2305.12616v4 Announce Type: replace \nAbstract: We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of pre-specified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions."}, "https://arxiv.org/abs/2307.12325": {"title": "A Robust Framework for Graph-based Two-Sample Tests Using Weights", "link": "https://arxiv.org/abs/2307.12325", "description": "arXiv:2307.12325v3 Announce Type: replace \nAbstract: Graph-based tests are a class of non-parametric two-sample tests useful for analyzing high-dimensional data. The framework offers both flexibility and power in a wide-range of testing scenarios. The test statistics are constructed from similarity graphs (such as K-nearest neighbor graphs) and consequently, their performance is sensitive to the structure of the graph. When the graph has problematic structures, as is common for high-dimensional data, this can result in poor or unstable performance among existing graph-based tests. We address this challenge and develop graph-based test statistics that are robust to problematic structures of the graph. The limiting null distribution of the robust test statistics is derived. We illustrate the new tests via simulation studies and a real-world application on Chicago taxi trip-data."}, "https://arxiv.org/abs/2012.04485": {"title": "Occupational segregation in a Roy model with composition preferences", "link": "https://arxiv.org/abs/2012.04485", "description": "arXiv:2012.04485v3 Announce Type: replace-cross \nAbstract: We propose a model of labor market sector self-selection that combines comparative advantage, as in the Roy model, and sector composition preference. Two groups choose between two sectors based on heterogeneous potential incomes and group compositions in each sector. Potential incomes incorporate group specific human capital accumulation and wage discrimination. Composition preferences are interpreted as reflecting group specific amenity preferences as well as homophily and aversion to minority status. We show that occupational segregation is amplified by the composition preferences and we highlight a resulting tension between redistribution and diversity. The model also exhibits tipping from extreme compositions to more balanced ones. Tipping occurs when a small nudge, associated with affirmative action, pushes the system to a very different equilibrium, and when the set of equilibria changes abruptly when a parameter governing the relative importance of pecuniary and composition preferences crosses a threshold."}, "https://arxiv.org/abs/2306.02821": {"title": "A unified analysis of likelihood-based estimators in the Plackett--Luce model", "link": "https://arxiv.org/abs/2306.02821", "description": "arXiv:2306.02821v3 Announce Type: replace-cross \nAbstract: The Plackett--Luce model has been extensively used for rank aggregation in social choice theory. A central question in this model concerns estimating the utility vector that governs the model's likelihood. In this paper, we investigate the asymptotic theory of utility vector estimation by maximizing different types of likelihood, such as full, marginal, and quasi-likelihood. Starting from interpreting the estimating equations of these estimators to gain some initial insights, we analyze their asymptotic behavior as the number of compared objects increases. In particular, we establish both the uniform consistency and asymptotic normality of these estimators and discuss the trade-off between statistical efficiency and computational complexity. For generality, our results are proven for deterministic graph sequences under appropriate graph topology conditions. These conditions are shown to be revealing and sharp when applied to common sampling scenarios, such as nonuniform random hypergraph models and hypergraph stochastic block models. Numerical results are provided to support our findings."}, "https://arxiv.org/abs/2306.07819": {"title": "False discovery proportion envelopes with m-consistency", "link": "https://arxiv.org/abs/2306.07819", "description": "arXiv:2306.07819v2 Announce Type: replace-cross \nAbstract: We provide new non-asymptotic false discovery proportion (FDP) confidence envelopes in several multiple testing settings relevant for modern high dimensional-data methods. We revisit the multiple testing scenarios considered in the recent work of Katsevich and Ramdas (2020): top-$k$, preordered (including knockoffs), online. Our emphasis is on obtaining FDP confidence bounds that both have non-asymptotic coverage and are asymptotically accurate in a specific sense, as the number $m$ of tested hypotheses grows. Namely, we introduce and study the property (which we call $m$-consistency) that the confidence bound converges to or below the desired level $\\alpha$ when applied to a specific reference $\\alpha$-level false discovery rate (FDR) controlling procedure. In this perspective, we derive new bounds that provide improvements over existing ones, both theoretically and practically, and are suitable for situations where at least a moderate number of rejections is expected. These improvements are illustrated with numerical experiments and real data examples. In particular, the improvement is significant in the knockoffs setting, which shows the impact of the method for a practical use. As side results, we introduce a new confidence envelope for the empirical cumulative distribution function of i.i.d. uniform variables, and we provide new power results in sparse cases, both being of independent interest."}, "https://arxiv.org/abs/2409.11497": {"title": "Decomposing Gaussians with Unknown Covariance", "link": "https://arxiv.org/abs/2409.11497", "description": "arXiv:2409.11497v1 Announce Type: new \nAbstract: Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently available to decompose multivariate Gaussian data require knowledge of the covariance matrix. In many important problems (such as in spatial or longitudinal data analysis, and graphical modeling), the covariance matrix may be unknown and even of primary interest. Thus, in this work we develop new approaches to decompose Gaussians with unknown covariance. First, we present a general algorithm that encompasses all previous decomposition approaches for Gaussian data as special cases, and can further handle the case of an unknown covariance. It yields a new and more flexible alternative to sample splitting when $n>1$. When $n=1$, we prove that it is impossible to partition the information in a multivariate Gaussian into independent portions without knowing the covariance matrix. Thus, we use the general algorithm to decompose a single multivariate Gaussian with unknown covariance into dependent parts with tractable conditional distributions, and demonstrate their use for inference and validation. The proposed decomposition strategy extends naturally to Gaussian processes. In simulation and on electroencephalography data, we apply these decompositions to the tasks of model selection and post-selection inference in settings where alternative strategies are unavailable."}, "https://arxiv.org/abs/2409.11525": {"title": "Interpretability Indices and Soft Constraints for Factor Models", "link": "https://arxiv.org/abs/2409.11525", "description": "arXiv:2409.11525v1 Announce Type: new \nAbstract: Factor analysis is a way to characterize the relationships between many (observable) variables in terms of a smaller number of unobservable random variables which are called factors. However, the application of factor models and its success can be subjective or difficult to gauge, since infinitely many factor models that produce the same correlation matrix can be fit given sample data. Thus, there is a need to operationalize a criterion that measures how meaningful or \"interpretable\" a factor model is in order to select the best among many factor models.\n While there are already techniques that aim to measure and enhance interpretability, new indices, as well as rotation methods via mathematical optimization based on them, are proposed to measure interpretability. The proposed methods directly incorporate semantics with the help of natural language processing and are generalized to incorporate any \"prior information\". Moreover, the indices allow for complete or partial specification of relationships at a pairwise level. Aside from these, two other main benefits of the proposed methods are that they do not require the estimation of factor scores, which avoids the factor score indeterminacy problem, and that no additional explanatory variables are necessary.\n The implementation of the proposed methods is written in Python 3 and is made available together with several helper functions through the package interpretablefa on the Python Package Index. The methods' application is demonstrated here using data on the Experiences in Close Relationships Scale, obtained from the Open-Source Psychometrics Project."}, "https://arxiv.org/abs/2409.11701": {"title": "Bias Reduction in Matched Observational Studies with Continuous Treatments: Calipered Non-Bipartite Matching and Bias-Corrected Estimation and Inference", "link": "https://arxiv.org/abs/2409.11701", "description": "arXiv:2409.11701v1 Announce Type: new \nAbstract: Matching is a commonly used causal inference framework in observational studies. By pairing individuals with different treatment values but with the same values of covariates (i.e., exact matching), the sample average treatment effect (SATE) can be consistently estimated and inferred using the classic Neyman-type (difference-in-means) estimator and confidence interval. However, inexact matching typically exists in practice and may cause substantial bias for the downstream treatment effect estimation and inference. Many methods have been proposed to reduce bias due to inexact matching in the binary treatment case. However, to our knowledge, no existing work has systematically investigated bias due to inexact matching in the continuous treatment case. To fill this blank, we propose a general framework for reducing bias in inexactly matched observational studies with continuous treatments. In the matching stage, we propose a carefully formulated caliper that incorporates the information of both the paired covariates and treatment doses to better tailor matching for the downstream SATE estimation and inference. In the estimation and inference stage, we propose a bias-corrected Neyman estimator paired with the corresponding bias-corrected variance estimator to leverage the information on propensity density discrepancies after inexact matching to further reduce the bias due to inexact matching. We apply our proposed framework to COVID-19 social mobility data to showcase differences between classic and bias-corrected SATE estimation and inference."}, "https://arxiv.org/abs/2409.11967": {"title": "Incremental effects for continuous exposures", "link": "https://arxiv.org/abs/2409.11967", "description": "arXiv:2409.11967v1 Announce Type: new \nAbstract: Causal inference problems often involve continuous treatments, such as dose, duration, or frequency. However, continuous exposures bring many challenges, both with identification and estimation. For example, identifying standard dose-response estimands requires that everyone has some chance of receiving any particular level of the exposure (i.e., positivity). In this work, we explore an alternative approach: rather than estimating dose-response curves, we consider stochastic interventions based on exponentially tilting the treatment distribution by some parameter $\\delta$, which we term an incremental effect. This increases or decreases the likelihood a unit receives a given treatment level, and crucially, does not require positivity for identification. We begin by deriving the efficient influence function and semiparametric efficiency bound for these incremental effects under continuous exposures. We then show that estimation of the incremental effect is dependent on the size of the exponential tilt, as measured by $\\delta$. In particular, we derive new minimax lower bounds illustrating how the best possible root mean squared error scales with an effective sample size of $n/\\delta$, instead of usual sample size $n$. Further, we establish new convergence rates and bounds on the bias of double machine learning-style estimators. Our novel analysis gives a better dependence on $\\delta$ compared to standard analyses, by using mixed supremum and $L_2$ norms, instead of just $L_2$ norms from Cauchy-Schwarz bounds. Finally, we show that taking $\\delta \\to \\infty$ gives a new estimator of the dose-response curve at the edge of the support, and we give a detailed study of convergence rates in this regime."}, "https://arxiv.org/abs/2409.12081": {"title": "Optimising the Trade-Off Between Type I and Type II Errors: A Review and Extensions", "link": "https://arxiv.org/abs/2409.12081", "description": "arXiv:2409.12081v1 Announce Type: new \nAbstract: In clinical studies upon which decisions are based there are two types of errors that can be made: a type I error arises when the decision is taken to declare a positive outcome when the truth is in fact negative, and a type II error arises when the decision is taken to declare a negative outcome when the truth is in fact positive. Commonly the primary analysis of such a study entails a two-sided hypothesis test with a type I error rate of 5% and the study is designed to have a sufficiently low type II error rate, for example 10% or 20%. These values are arbitrary and often do not reflect the clinical, or regulatory, context of the study and ignore both the relative costs of making either type of error and the sponsor's prior belief that the drug is superior to either placebo, or a standard of care if relevant. This simplistic approach has recently been challenged by numerous authors both from a frequentist and Bayesian perspective since when resources are constrained there will be a need to consider a trade-off between type I and type II errors. In this paper we review proposals to utilise the trade-off by formally acknowledging the costs to optimise the choice of error rates for simple, point null and alternative hypotheses and extend the results to composite, or interval hypotheses, showing links to the Probability of Success of a clinical study."}, "https://arxiv.org/abs/2409.12173": {"title": "Poisson approximate likelihood compared to the particle filter", "link": "https://arxiv.org/abs/2409.12173", "description": "arXiv:2409.12173v1 Announce Type: new \nAbstract: Filtering algorithms are fundamental for inference on partially observed stochastic dynamic systems, since they provide access to the likelihood function and hence enable likelihood-based or Bayesian inference. A novel Poisson approximate likelihood (PAL) filter was introduced by Whitehouse et al. (2023). PAL employs a Poisson approximation to conditional densities, offering a fast approximation to the likelihood function for a certain subset of partially observed Markov process models. A central piece of evidence for PAL is the comparison in Table 1 of Whitehouse et al. (2023), which claims a large improvement for PAL over a standard particle filter algorithm. This evidence, based on a model and data from a previous scientific study by Stocks et al. (2020), might suggest that researchers confronted with similar models should use PAL rather than particle filter methods. Taken at face value, this evidence also reduces the credibility of Stocks et al. (2020) by indicating a shortcoming with the numerical methods that they used. However, we show that the comparison of log-likelihood values made by Whitehouse et al. (2023) is flawed because their PAL calculations were carried out using a dataset scaled differently from the previous study. If PAL and the particle filter are applied to the same data, the advantage claimed for PAL disappears. On simulations where the model is correctly specified, the particle filter outperforms PAL."}, "https://arxiv.org/abs/2409.11658": {"title": "Forecasting age distribution of life-table death counts via {\\alpha}-transformation", "link": "https://arxiv.org/abs/2409.11658", "description": "arXiv:2409.11658v1 Announce Type: cross \nAbstract: We introduce a compositional power transformation, known as an {\\alpha}-transformation, to model and forecast a time series of life-table death counts, possibly with zero counts observed at older ages. As a generalisation of the isometric log-ratio transformation (i.e., {\\alpha} = 0), the {\\alpha} transformation relies on the tuning parameter {\\alpha}, which can be determined in a data-driven manner. Using the Australian age-specific period life-table death counts from 1921 to 2020, the {\\alpha} transformation can produce more accurate short-term point and interval forecasts than the log-ratio transformation. The improved forecast accuracy of life-table death counts is of great importance to demographers and government planners for estimating survival probabilities and life expectancy and actuaries for determining annuity prices and reserves for various initial ages and maturity terms."}, "https://arxiv.org/abs/2204.02872": {"title": "Cluster randomized trials designed to support generalizable inferences", "link": "https://arxiv.org/abs/2204.02872", "description": "arXiv:2204.02872v2 Announce Type: replace \nAbstract: Background: When planning a cluster randomized trial, evaluators often have access to an enumerated cohort representing the target population of clusters. Practicalities of conducting the trial, such as the need to oversample clusters with certain characteristics to improve trial economy or to support inference about subgroups of clusters, may preclude simple random sampling from the cohort into the trial, and thus interfere with the goal of producing generalizable inferences about the target population.\n Methods: We describe a nested trial design where the randomized clusters are embedded within a cohort of trial-eligible clusters from the target population and where clusters are selected for inclusion in the trial with known sampling probabilities that may depend on cluster characteristics (e.g., allowing clusters to be chosen to facilitate trial conduct or to examine hypotheses related to their characteristics). We develop and evaluate methods for analyzing data from this design to generalize causal inferences to the target population underlying the cohort.\n Results: We present identification and estimation results for the expectation of the average potential outcome and for the average treatment effect, in the entire target population of clusters and in its non-randomized subset. In simulation studies we show that all the estimators have low bias but markedly different precision.\n Conclusions: Cluster randomized trials where clusters are selected for inclusion with known sampling probabilities that depend on cluster characteristics, combined with efficient estimation methods, can precisely quantify treatment effects in the target population, while addressing objectives of trial conduct that require oversampling clusters on the basis of their characteristics."}, "https://arxiv.org/abs/2307.15205": {"title": "A new robust graph for graph-based methods", "link": "https://arxiv.org/abs/2307.15205", "description": "arXiv:2307.15205v3 Announce Type: replace \nAbstract: Graph-based two-sample tests and change-point detection are powerful tools for analyzing high-dimensional and non-Euclidean data, as they do not impose distributional assumptions and perform effectively across a wide range of scenarios. These methods utilize a similarity graph constructed from the observations, with $K$-nearest neighbor graphs or $K$-minimum spanning trees being the current state-of-the-art choices. However, in high-dimensional settings, these graphs tend to form hubs -- nodes with disproportionately large degrees -- and graph-based methods are sensitive to hubs. To address this issue, we propose a robust graph that is significantly less prone to forming hubs in high-dimensional settings. Incorporating this robust graph can substantially improve the power of graph-based methods across various scenarios. Furthermore, we establish a theoretical foundation for graph-based methods using the proposed robust graph, demonstrating its consistency under fixed alternatives in both low-dimensional and high-dimensional contexts."}, "https://arxiv.org/abs/2401.04036": {"title": "A regularized MANOVA test for semicontinuous high-dimensional data", "link": "https://arxiv.org/abs/2401.04036", "description": "arXiv:2401.04036v2 Announce Type: replace \nAbstract: We propose a MANOVA test for semicontinuous data that is applicable also when the dimensionality exceeds the sample size. The test statistic is obtained as a likelihood ratio, where numerator and denominator are computed at the maxima of penalized likelihood functions under each hypothesis. Closed form solutions for the regularized estimators allow us to avoid computational overheads. We derive the null distribution using a permutation scheme. The power and level of the resulting test are evaluated in a simulation study. We illustrate the new methodology with two original data analyses, one regarding microRNA expression in human blastocyst cultures, and another regarding alien plant species invasion in the island of Socotra (Yemen)."}, "https://arxiv.org/abs/2205.08609": {"title": "Bagged Polynomial Regression and Neural Networks", "link": "https://arxiv.org/abs/2205.08609", "description": "arXiv:2205.08609v2 Announce Type: replace-cross \nAbstract: Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of \\textit{bagged} polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset. We demonstrate that BPR performs as well as neural networks in crop classification using satellite data, a setting where prediction accuracy is critical and interpretability is often required for addressing research questions."}, "https://arxiv.org/abs/2311.04318": {"title": "Estimation for multistate models subject to reporting delays and incomplete event adjudication", "link": "https://arxiv.org/abs/2311.04318", "description": "arXiv:2311.04318v2 Announce Type: replace-cross \nAbstract: Complete observation of event histories is often impossible due to sampling effects such as right-censoring and left-truncation, but also due to reporting delays and incomplete event adjudication. This is for example the case for health insurance claims and during interim stages of clinical trials. In this paper, we develop a parametric method that takes the aforementioned effects into account, treating the latter two as partially exogenous. The method, which takes the form of a two-step M-estimation procedure, is applicable to multistate models in general, including competing risks and recurrent event models. The effect of reporting delays is derived via thinning, offering an alternative to existing results for Poisson models. To address incomplete event adjudication, we propose an imputed likelihood approach which, compared to existing methods, has the advantage of allowing for dependencies between the event history and adjudication processes as well as allowing for unreported events and multiple event types. We establish consistency and asymptotic normality under standard identifiability, integrability, and smoothness conditions, and we demonstrate the validity of the percentile bootstrap. Finally, a simulation study shows favorable finite sample performance of our method compared to other alternatives, while an application to disability insurance data illustrates its practical potential."}, "https://arxiv.org/abs/2312.11582": {"title": "Shapley-PC: Constraint-based Causal Structure Learning with Shapley Values", "link": "https://arxiv.org/abs/2312.11582", "description": "arXiv:2312.11582v2 Announce Type: replace-cross \nAbstract: Causal Structure Learning (CSL), also referred to as causal discovery, amounts to extracting causal relations among variables in data. CSL enables the estimation of causal effects from observational data alone, avoiding the need to perform real life experiments. Constraint-based CSL leverages conditional independence tests to perform causal discovery. We propose Shapley-PC, a novel method to improve constraint-based CSL algorithms by using Shapley values over the possible conditioning sets, to decide which variables are responsible for the observed conditional (in)dependences. We prove soundness, completeness and asymptotic consistency of Shapley-PC and run a simulation study showing that our proposed algorithm is superior to existing versions of PC."}, "https://arxiv.org/abs/2409.12275": {"title": "Simultaneous Estimation of Many Sparse Networks via Hierarchical Poisson Log-Normal Model", "link": "https://arxiv.org/abs/2409.12275", "description": "arXiv:2409.12275v1 Announce Type: new \nAbstract: The advancement of single-cell RNA-sequencing (scRNA-seq) technologies allow us to study the individual level cell-type-specific gene expression networks by direct inference of genes' conditional independence structures. scRNA-seq data facilitates the analysis of gene expression data across different conditions or samples, enabling simultaneous estimation of condition- or sample-specific gene networks. Since the scRNA-seq data are count data with many zeros, existing network inference methods based on Gaussian graphs cannot be applied to such single cell data directly. We propose a hierarchical Poisson Log-Normal model to simultaneously estimate many such networks to effectively incorporate the shared network structures. We develop an efficient simultaneous estimation method that uses the variational EM and alternating direction method of multipliers (ADMM) algorithms, optimized for parallel processing. Simulation studies show this method outperforms traditional methods in network structure recovery and parameter estimation across various network models. We apply the method to two single cell RNA-seq datasets, a yeast single-cell gene expression dataset measured under 11 different environmental conditions, and a single-cell gene expression data from 13 inflammatory bowel disease patients. We demonstrate that simultaneous estimation can uncover a wider range of conditional dependence networks among genes, offering deeper insights into gene expression mechanisms."}, "https://arxiv.org/abs/2409.12348": {"title": "Heckman Selection Contaminated Normal Model", "link": "https://arxiv.org/abs/2409.12348", "description": "arXiv:2409.12348v1 Announce Type: new \nAbstract: The Heckman selection model is one of the most well-renounced econometric models in the analysis of data with sample selection. This model is designed to rectify sample selection biases based on the assumption of bivariate normal error terms. However, real data diverge from this assumption in the presence of heavy tails and/or atypical observations. Recently, this assumption has been relaxed via a more flexible Student's t-distribution, which has appealing statistical properties. This paper introduces a novel Heckman selection model using a bivariate contaminated normal distribution for the error terms. We present an efficient ECM algorithm for parameter estimation with closed-form expressions at the E-step based on truncated multinormal distribution formulas. The identifiability of the proposed model is also discussed, and its properties have been examined. Through simulation studies, we compare our proposed model with the normal and Student's t counterparts and investigate the finite-sample properties and the variation in missing rate. Results obtained from two real data analyses showcase the usefulness and effectiveness of our model. The proposed algorithms are implemented in the R package HeckmanEM."}, "https://arxiv.org/abs/2409.12353": {"title": "A Way to Synthetic Triple Difference", "link": "https://arxiv.org/abs/2409.12353", "description": "arXiv:2409.12353v1 Announce Type: new \nAbstract: This paper introduces a novel approach that combines synthetic control with triple difference to address violations of the parallel trends assumption. While synthetic control has been widely applied to improve causal estimates in difference-in-differences (DID) frameworks, its use in triple-difference models has been underexplored. By transforming triple difference into a DID structure, this paper extends the applicability of synthetic control to a triple-difference framework, enabling more robust estimates when parallel trends are violated across multiple dimensions. The empirical example focuses on China's \"4+7 Cities\" Centralized Drug Procurement pilot program. Based on the proposed procedure for synthetic triple difference, I find that the program can promote pharmaceutical innovation in terms of the number of patent applications even based on the recommended clustered standard error. This method contributes to improving causal inference in policy evaluations and offers a valuable tool for researchers dealing with heterogeneous treatment effects across subgroups."}, "https://arxiv.org/abs/2409.12498": {"title": "Neymanian inference in randomized experiments", "link": "https://arxiv.org/abs/2409.12498", "description": "arXiv:2409.12498v1 Announce Type: new \nAbstract: In his seminal work in 1923, Neyman studied the variance estimation problem for the difference-in-means estimator of the average treatment effect in completely randomized experiments. He proposed a variance estimator that is conservative in general and unbiased when treatment effects are homogeneous. While widely used under complete randomization, there is no unique or natural way to extend this estimator to more complex designs. To this end, we show that Neyman's estimator can be alternatively derived in two ways, leading to two novel variance estimation approaches: the imputation approach and the contrast approach. While both approaches recover Neyman's estimator under complete randomization, they yield fundamentally different variance estimators for more general designs. In the imputation approach, the variance is expressed as a function of observed and missing potential outcomes and then estimated by imputing the missing potential outcomes, akin to Fisherian inference. In the contrast approach, the variance is expressed as a function of several unobservable contrasts of potential outcomes and then estimated by exchanging each unobservable contrast with an observable contrast. Unlike the imputation approach, the contrast approach does not require separately estimating the missing potential outcome for each unit. We examine the theoretical properties of both approaches, showing that for a large class of designs, each produces conservative variance estimators that are unbiased in finite samples or asymptotically under homogeneous treatment effects."}, "https://arxiv.org/abs/2409.12592": {"title": "Choice of the hypothesis matrix for using the Anova-type-statistic", "link": "https://arxiv.org/abs/2409.12592", "description": "arXiv:2409.12592v1 Announce Type: new \nAbstract: Initially developed in Brunner et al. (1997), the Anova-type-statistic (ATS) is one of the most used quadratic forms for testing multivariate hypotheses for a variety of different parameter vectors $\\boldsymbol{\\theta}\\in\\mathbb{R}^d$. Such tests can be based on several versions of ATS and in most settings, they are preferable over those based on other quadratic forms, as for example the Wald-type-statistic (WTS). However, the same null hypothesis $\\boldsymbol{H}\\boldsymbol{\\theta}=\\boldsymbol{y}$ can be expressed by a multitude of hypothesis matrices $\\boldsymbol{H}\\in\\mathbb{R}^{m\\times d}$ and corresponding vectors $\\boldsymbol{y}\\in\\mathbb{R}^m$, which leads to different values of the test statistic, as it can be seen in simple examples. Since this can entail distinct test decisions, it remains to investigate under which conditions tests using different hypothesis matrices coincide. Here, the dimensions of the different hypothesis matrices can be substantially different, which has exceptional potential to save computation effort.\n In this manuscript, we show that for the Anova-type-statistic and some versions thereof, it is possible for each hypothesis $\\boldsymbol{H}\\boldsymbol{\\theta}=\\boldsymbol{y}$ to construct a companion matrix $\\boldsymbol{L}$ with a minimal number of rows, which not only tests the same hypothesis but also always yields the same test decisions. This allows a substantial reduction of computation time, which is investigated in several conducted simulations."}, "https://arxiv.org/abs/2409.12611": {"title": "Parameters on the boundary in predictive regression", "link": "https://arxiv.org/abs/2409.12611", "description": "arXiv:2409.12611v1 Announce Type: new \nAbstract: We consider bootstrap inference in predictive (or Granger-causality) regressions when the parameter of interest may lie on the boundary of the parameter space, here defined by means of a smooth inequality constraint. For instance, this situation occurs when the definition of the parameter space allows for the cases of either no predictability or sign-restricted predictability. We show that in this context constrained estimation gives rise to bootstrap statistics whose limit distribution is, in general, random, and thus distinct from the limit null distribution of the original statistics of interest. This is due to both (i) the possible location of the true parameter vector on the boundary of the parameter space, and (ii) the possible non-stationarity of the posited predicting (resp. Granger-causing) variable. We discuss a modification of the standard fixed-regressor wild bootstrap scheme where the bootstrap parameter space is shifted by a data-dependent function in order to eliminate the portion of limiting bootstrap randomness attributable to the boundary, and prove validity of the associated bootstrap inference under non-stationarity of the predicting variable as the only remaining source of limiting bootstrap randomness. Our approach, which is initially presented in a simple location model, has bearing on inference in parameter-on-the-boundary situations beyond the predictive regression problem."}, "https://arxiv.org/abs/2409.12662": {"title": "Testing for equal predictive accuracy with strong dependence", "link": "https://arxiv.org/abs/2409.12662", "description": "arXiv:2409.12662v1 Announce Type: new \nAbstract: We analyse the properties of the Diebold and Mariano (1995) test in the presence of autocorrelation in the loss differential. We show that the power of the Diebold and Mariano (1995) test decreases as the dependence increases, making it more difficult to obtain statistically significant evidence of superior predictive ability against less accurate benchmarks. We also find that, after a certain threshold, the test has no power and the correct null hypothesis is spuriously rejected. Taken together, these results caution to seriously consider the dependence properties of the loss differential before the application of the Diebold and Mariano (1995) test."}, "https://arxiv.org/abs/2409.12848": {"title": "Bridging the Gap Between Design and Analysis: Randomization Inference and Sensitivity Analysis for Matched Observational Studies with Treatment Doses", "link": "https://arxiv.org/abs/2409.12848", "description": "arXiv:2409.12848v1 Announce Type: new \nAbstract: Matching is a commonly used causal inference study design in observational studies. Through matching on measured confounders between different treatment groups, valid randomization inferences can be conducted under the no unmeasured confounding assumption, and sensitivity analysis can be further performed to assess sensitivity of randomization inference results to potential unmeasured confounding. However, for many common matching designs, there is still a lack of valid downstream randomization inference and sensitivity analysis approaches. Specifically, in matched observational studies with treatment doses (e.g., continuous or ordinal treatments), with the exception of some special cases such as pair matching, there is no existing randomization inference or sensitivity analysis approach for studying analogs of the sample average treatment effect (Neyman-type weak nulls), and no existing valid sensitivity analysis approach for testing the sharp null of no effect for any subject (Fisher's sharp null) when the outcome is non-binary. To fill these gaps, we propose new methods for randomization inference and sensitivity analysis that can work for general matching designs with treatment doses, applicable to general types of outcome variables (e.g., binary, ordinal, or continuous), and cover both Fisher's sharp null and Neyman-type weak nulls. We illustrate our approaches via comprehensive simulation studies and a real-data application."}, "https://arxiv.org/abs/2409.12856": {"title": "Scaleable Dynamic Forecast Reconciliation", "link": "https://arxiv.org/abs/2409.12856", "description": "arXiv:2409.12856v1 Announce Type: new \nAbstract: We introduce a dynamic approach to probabilistic forecast reconciliation at scale. Our model differs from the existing literature in this area in several important ways. Firstly we explicitly allow the weights allocated to the base forecasts in forming the combined, reconciled forecasts to vary over time. Secondly we drop the assumption, near ubiquitous in the literature, that in-sample base forecasts are appropriate for determining these weights, and use out of sample forecasts instead. Most existing probabilistic reconciliation approaches rely on time consuming sampling based techniques, and therefore do not scale well (or at all) to large data sets. We address this problem in two main ways, firstly by utilising a closed from estimator of covariance structure appropriate to hierarchical forecasting problems, and secondly by decomposing large hierarchies in to components which can be reconciled separately."}, "https://arxiv.org/abs/2409.12928": {"title": "A general condition for bias attenuation by a nondifferentially mismeasured confounder", "link": "https://arxiv.org/abs/2409.12928", "description": "arXiv:2409.12928v1 Announce Type: new \nAbstract: In real-world studies, the collected confounders may suffer from measurement error. Although mismeasurement of confounders is typically unintentional -- originating from sources such as human oversight or imprecise machinery -- deliberate mismeasurement also occurs and is becoming increasingly more common. For example, in the 2020 U.S. Census, noise was added to measurements to assuage privacy concerns. Sensitive variables such as income or age are oftentimes partially censored and are only known up to a range of values. In such settings, obtaining valid estimates of the causal effect of a binary treatment can be impossible, as mismeasurement of confounders constitutes a violation of the no unmeasured confounding assumption. A natural question is whether the common practice of simply adjusting for the mismeasured confounder is justifiable. In this article, we answer this question in the affirmative and demonstrate that in many realistic scenarios not covered by previous literature, adjusting for the mismeasured confounders reduces bias compared to not adjusting."}, "https://arxiv.org/abs/2409.12328": {"title": "SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems", "link": "https://arxiv.org/abs/2409.12328", "description": "arXiv:2409.12328v1 Announce Type: cross \nAbstract: Stochastic optimization problems in large-scale multi-stakeholder networked systems (e.g., power grids and supply chains) rely on data-driven scenarios to encapsulate complex spatiotemporal interdependencies. However, centralized aggregation of stakeholder data is challenging due to the existence of data silos resulting from computational and logistical bottlenecks. In this paper, we present SplitVAEs, a decentralized scenario generation framework that leverages variational autoencoders to generate high-quality scenarios without moving stakeholder data. With the help of experiments on distributed memory systems, we demonstrate the broad applicability of SplitVAEs in a variety of domain areas that are dominated by a large number of stakeholders. Our experiments indicate that SplitVAEs can learn spatial and temporal interdependencies in large-scale networks to generate scenarios that match the joint historical distribution of stakeholder data in a decentralized manner. Our experiments show that SplitVAEs deliver robust performance compared to centralized, state-of-the-art benchmark methods while significantly reducing data transmission costs, leading to a scalable, privacy-enhancing alternative to scenario generation."}, "https://arxiv.org/abs/2409.12890": {"title": "Stable and Robust Hyper-Parameter Selection Via Robust Information Sharing Cross-Validation", "link": "https://arxiv.org/abs/2409.12890", "description": "arXiv:2409.12890v1 Announce Type: cross \nAbstract: Robust estimators for linear regression require non-convex objective functions to shield against adverse affects of outliers. This non-convexity brings challenges, particularly when combined with penalization in high-dimensional settings. Selecting hyper-parameters for the penalty based on a finite sample is a critical task. In practice, cross-validation (CV) is the prevalent strategy with good performance for convex estimators. Applied with robust estimators, however, CV often gives sub-par results due to the interplay between multiple local minima and the penalty. The best local minimum attained on the full training data may not be the minimum with the desired statistical properties. Furthermore, there may be a mismatch between this minimum and the minima attained in the CV folds. This paper introduces a novel adaptive CV strategy that tracks multiple minima for each combination of hyper-parameters and subsets of the data. A matching scheme is presented for correctly evaluating minima computed on the full training data using the best-matching minima from the CV folds. It is shown that the proposed strategy reduces the variability of the estimated performance metric, leads to smoother CV curves, and therefore substantially increases the reliability and utility of robust penalized estimators."}, "https://arxiv.org/abs/2211.09099": {"title": "Selecting Subpopulations for Causal Inference in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2211.09099", "description": "arXiv:2211.09099v3 Announce Type: replace \nAbstract: The Brazil Bolsa Familia (BF) program is a conditional cash transfer program aimed to reduce short-term poverty by direct cash transfers and to fight long-term poverty by increasing human capital among poor Brazilian people. Eligibility for Bolsa Familia benefits depends on a cutoff rule, which classifies the BF study as a regression discontinuity (RD) design. Extracting causal information from RD studies is challenging. Following Li et al (2015) and Branson and Mealli (2019), we formally describe the BF RD design as a local randomized experiment within the potential outcome approach. Under this framework, causal effects can be identified and estimated on a subpopulation where a local overlap assumption, a local SUTVA and a local ignorability assumption hold. We first discuss the potential advantages of this framework over local regression methods based on continuity assumptions, which concern the definition of the causal estimands, the design and the analysis of the study, and the interpretation and generalizability of the results. A critical issue of this local randomization approach is how to choose subpopulations for which we can draw valid causal inference. We propose a Bayesian model-based finite mixture approach to clustering to classify observations into subpopulations where the RD assumptions hold and do not hold. This approach has important advantages: a) it allows to account for the uncertainty in the subpopulation membership, which is typically neglected; b) it does not impose any constraint on the shape of the subpopulation; c) it is scalable to high-dimensional settings; e) it allows to target alternative causal estimands than the average treatment effect (ATE); and f) it is robust to a certain degree of manipulation/selection of the running variable. We apply our proposed approach to assess causal effects of the Bolsa Familia program on leprosy incidence in 2009."}, "https://arxiv.org/abs/2309.15793": {"title": "Targeting relative risk heterogeneity with causal forests", "link": "https://arxiv.org/abs/2309.15793", "description": "arXiv:2309.15793v2 Announce Type: replace \nAbstract: The estimation of heterogeneous treatment effects (HTE) across different subgroups in a population is of significant interest in clinical trial analysis. State-of-the-art HTE estimation methods, including causal forests (Wager and Athey, 2018), generally rely on recursive partitioning for non-parametric identification of relevant covariates and interactions. However, like many other methods in this area, causal forests partition subgroups based on differences in absolute risk. This can dilute statistical power by masking variability in the relative risk, which is often a more appropriate quantity of clinical interest. In this work, we propose and implement a methodology for modifying causal forests to target relative risk, using a novel node-splitting procedure based on exhaustive generalized linear model comparison. We present results that suggest relative risk causal forests can capture otherwise undetected sources of heterogeneity."}, "https://arxiv.org/abs/2306.06291": {"title": "Optimal Multitask Linear Regression and Contextual Bandits under Sparse Heterogeneity", "link": "https://arxiv.org/abs/2306.06291", "description": "arXiv:2306.06291v2 Announce Type: replace-cross \nAbstract: Large and complex datasets are often collected from several, possibly heterogeneous sources. Multitask learning methods improve efficiency by leveraging commonalities across datasets while accounting for possible differences among them. Here, we study multitask linear regression and contextual bandits under sparse heterogeneity, where the source/task-associated parameters are equal to a global parameter plus a sparse task-specific term. We propose a novel two-stage estimator called MOLAR that leverages this structure by first constructing a covariate-wise weighted median of the task-wise linear regression estimates and then shrinking the task-wise estimates towards the weighted median. Compared to task-wise least squares estimates, MOLAR improves the dependence of the estimation error on the data dimension. Extensions of MOLAR to generalized linear models and constructing confidence intervals are discussed in the paper. We then apply MOLAR to develop methods for sparsely heterogeneous multitask contextual bandits, obtaining improved regret guarantees over single-task bandit methods. We further show that our methods are minimax optimal by providing a number of lower bounds. Finally, we support the efficiency of our methods by performing experiments on both synthetic data and the PISA dataset on student educational outcomes from heterogeneous countries."}, "https://arxiv.org/abs/2409.13041": {"title": "Properly constrained reference priors decay rates for efficient and robust posterior inference", "link": "https://arxiv.org/abs/2409.13041", "description": "arXiv:2409.13041v1 Announce Type: new \nAbstract: In Bayesian analysis, reference priors are widely recognized for their objective nature. Yet, they often lead to intractable and improper priors, which complicates their application. Besides, informed prior elicitation methods are penalized by the subjectivity of the choices they require. In this paper, we aim at proposing a reconciliation of the aforementioned aspects. Leveraging the objective aspect of reference prior theory, we introduce two strategies of constraint incorporation to build tractable reference priors. One provides a simple and easy-to-compute solution when the improper aspect is not questioned, and the other introduces constraints to ensure the reference prior is proper, or it provides proper posterior. Our methodology emphasizes the central role of Jeffreys prior decay rates in this process, and the practical applicability of our results is demonstrated using an example taken from the literature."}, "https://arxiv.org/abs/2409.13060": {"title": "Forecasting Causal Effects of Future Interventions: Confounding and Transportability Issues", "link": "https://arxiv.org/abs/2409.13060", "description": "arXiv:2409.13060v1 Announce Type: new \nAbstract: Recent developments in causal inference allow us to transport a causal effect of a time-fixed treatment from a randomized trial to a target population across space but within the same time frame. In contrast to transportability across space, transporting causal effects across time or forecasting causal effects of future interventions is more challenging due to time-varying confounders and time-varying effect modifiers. In this article, we seek to formally clarify the causal estimands for forecasting causal effects over time and the structural assumptions required to identify these estimands. Specifically, we develop a set of novel nonparametric identification formulas--g-computation formulas--for these causal estimands, and lay out the conditions required to accurately forecast causal effects from a past observed sample to a future population in a future time window. Our overaching objective is to leverage the modern causal inference theory to provide a theoretical framework for investigating whether the effects seen in a past sample would carry over to a new future population. Throughout the article, a working example addressing the effect of public policies or social events on COVID-related deaths is considered to contextualize the developments of analytical results."}, "https://arxiv.org/abs/2409.13097": {"title": "Incremental Causal Effect for Time to Treatment Initialization", "link": "https://arxiv.org/abs/2409.13097", "description": "arXiv:2409.13097v1 Announce Type: new \nAbstract: We consider time to treatment initialization. This can commonly occur in preventive medicine, such as disease screening and vaccination; it can also occur with non-fatal health conditions such as HIV infection without the onset of AIDS; or in tech industry where items wait to be reviewed manually as abusive or not, etc. While traditional causal inference focused on `when to treat' and its effects, including their possible dependence on subject characteristics, we consider the incremental causal effect when the intensity of time to treatment initialization is intervened upon. We provide identification of the incremental causal effect without the commonly required positivity assumption, as well as an estimation framework using inverse probability weighting. We illustrate our approach via simulation, and apply it to a rheumatoid arthritis study to evaluate the incremental effect of time to start methotrexate on joint pain."}, "https://arxiv.org/abs/2409.13140": {"title": "Non-parametric Replication of Instrumental Variable Estimates Across Studies", "link": "https://arxiv.org/abs/2409.13140", "description": "arXiv:2409.13140v1 Announce Type: new \nAbstract: Replicating causal estimates across different cohorts is crucial for increasing the integrity of epidemiological studies. However, strong assumptions regarding unmeasured confounding and effect modification often hinder this goal. By employing an instrumental variable (IV) approach and targeting the local average treatment effect (LATE), these assumptions can be relaxed to some degree; however, little work has addressed the replicability of IV estimates. In this paper, we propose a novel survey weighted LATE (SWLATE) estimator that incorporates unknown sampling weights and leverages machine learning for flexible modeling of nuisance functions, including the weights. Our approach, based on influence function theory and cross-fitting, provides a doubly-robust and efficient framework for valid inference, aligned with the growing \"double machine learning\" literature. We further extend our method to provide bounds on a target population ATE. The effectiveness of our approach, particularly in non-linear settings, is demonstrated through simulations and applied to a Mendelian randomization analysis of the relationship between triglycerides and cognitive decline."}, "https://arxiv.org/abs/2409.13190": {"title": "Nonparametric Causal Survival Analysis with Clustered Interference", "link": "https://arxiv.org/abs/2409.13190", "description": "arXiv:2409.13190v1 Announce Type: new \nAbstract: Inferring treatment effects on a survival time outcome based on data from an observational study is challenging due to the presence of censoring and possible confounding. An additional challenge occurs when a unit's treatment affects the outcome of other units, i.e., there is interference. In some settings, units may be grouped into clusters such that it is reasonable to assume interference only occurs within clusters, i.e., there is clustered interference. In this paper, methods are developed which can accommodate confounding, censored outcomes, and clustered interference. The approach avoids parametric assumptions and permits inference about counterfactual scenarios corresponding to any stochastic policy which modifies the propensity score distribution, and thus may have application across diverse settings. The proposed nonparametric sample splitting estimators allow for flexible data-adaptive estimation of nuisance functions and are consistent and asymptotically normal with parametric convergence rates. Simulation studies demonstrate the finite sample performance of the proposed estimators, and the methods are applied to a cholera vaccine study in Bangladesh."}, "https://arxiv.org/abs/2409.13267": {"title": "Spatial Sign based Principal Component Analysis for High Dimensional Data", "link": "https://arxiv.org/abs/2409.13267", "description": "arXiv:2409.13267v1 Announce Type: new \nAbstract: This article focuses on the robust principal component analysis (PCA) of high-dimensional data with elliptical distributions. We investigate the PCA of the sample spatial-sign covariance matrix in both nonsparse and sparse contexts, referring to them as SPCA and SSPCA, respectively. We present both nonasymptotic and asymptotic analyses to quantify the theoretical performance of SPCA and SSPCA. In sparse settings, we demonstrate that SSPCA, implemented through a combinatoric program, achieves the optimal rate of convergence. Our proposed SSPCA method is computationally efficient and exhibits robustness against heavy-tailed distributions compared to existing methods. Simulation studies and real-world data applications further validate the superiority of our approach."}, "https://arxiv.org/abs/2409.13300": {"title": "A Two-stage Inference Procedure for Sample Local Average Treatment Effects in Randomized Experiments", "link": "https://arxiv.org/abs/2409.13300", "description": "arXiv:2409.13300v1 Announce Type: new \nAbstract: In a given randomized experiment, individuals are often volunteers and can differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand. This paper focuses on inference about the sample local average treatment effect (LATE) in randomized experiments with non-compliance. We present a two-stage procedure that provides asymptotically correct coverage rate of the sample LATE in randomized experiments. The procedure uses a first-stage test to decide whether the instrument is strong or weak, and uses different confidence sets depending on the first-stage result. Proofs of the procedure is developed for the situation with and without regression adjustment and for two experimental designs (complete randomization and Mahalaonobis distance based rerandomization). Finite sample performance of the methods are studied using extensive Monte Carlo simulations and the methods are applied to data from a voter encouragement experiment."}, "https://arxiv.org/abs/2409.13336": {"title": "Two-level D- and A-optimal main-effects designs with run sizes one and two more than a multiple of four", "link": "https://arxiv.org/abs/2409.13336", "description": "arXiv:2409.13336v1 Announce Type: new \nAbstract: For run sizes that are a multiple of four, the literature offers many two-level designs that are D- and A-optimal for the main-effects model and minimize the aliasing between main effects and interaction effects and among interaction effects. For run sizes that are not a multiple of four, no conclusive results are known. In this paper, we propose two algorithms that generate all non-isomorphic D- and A-optimal main-effects designs for run sizes that are one and two more than a multiple of four. We enumerate all such designs for run sizes up to 18, report the numbers of designs we obtained, and identify those that minimize the aliasing between main effects and interaction effects and among interaction effects. Finally, we compare the minimally aliased designs we found with benchmark designs from the literature."}, "https://arxiv.org/abs/2409.13458": {"title": "Interpretable meta-analysis of model or marker performance", "link": "https://arxiv.org/abs/2409.13458", "description": "arXiv:2409.13458v1 Announce Type: new \nAbstract: Conventional meta analysis of model performance conducted using datasources from different underlying populations often result in estimates that cannot be interpreted in the context of a well defined target population. In this manuscript we develop methods for meta-analysis of several measures of model performance that are interpretable in the context of a well defined target population when the populations underlying the datasources used in the meta analysis are heterogeneous. This includes developing identifiablity conditions, inverse-weighting, outcome model, and doubly robust estimator. We illustrate the methods using simulations and data from two large lung cancer screening trials."}, "https://arxiv.org/abs/2409.13479": {"title": "Efficiency gain in association studies based on population surveys by augmenting outcome data from the target population", "link": "https://arxiv.org/abs/2409.13479", "description": "arXiv:2409.13479v1 Announce Type: new \nAbstract: Routinely collected nation-wide registers contain socio-economic and health-related information from a large number of individuals. However, important information on lifestyle, biological and other risk factors is available at most for small samples of the population through surveys. A majority of health surveys lack detailed medical information necessary for assessing the disease burden. Hence, traditionally data from the registers and the surveys are combined to have necessary information for the survey sample. Our idea is to base analyses on a combined sample obtained by adding a (large) sample of individuals from the population to the survey sample. The main objective is to assess the bias and gain in efficiency of such combined analyses with a binary or time-to-event outcome. We employ (i) the complete-case analysis (CCA) using the respondents of the survey, (ii) analysis of the full survey sample with both unit- and item-nonresponse under the missing at random (MAR) assumption and (iii) analysis of the combined sample under mixed type of missing data mechanism. We handle the missing data using multiple imputation (MI)-based analysis in (ii) and (iii). We utilize simulated as well as empirical data on ischemic heart disease obtained from the Finnish population. Our results suggested that the MI methods improved the efficiency of the estimates when we used the combined data for a binary outcome, but in the case of a time-to-event outcome the CCA was at least as good as the MI using the larger datasets, in terms of the the mean absolute and squared errors. Increasing the participation in the surveys and having good statistical methods for handling missing covariate data when the outcome is time-to-event would be needed for implementation of the proposed ideas."}, "https://arxiv.org/abs/2409.13516": {"title": "Dynamic tail risk forecasting: what do realized skewness and kurtosis add?", "link": "https://arxiv.org/abs/2409.13516", "description": "arXiv:2409.13516v1 Announce Type: new \nAbstract: This paper compares the accuracy of tail risk forecasts with a focus on including realized skewness and kurtosis in \"additive\" and \"multiplicative\" models. Utilizing a panel of 960 US stocks, we conduct diagnostic tests, employ scoring functions, and implement rolling window forecasting to evaluate the performance of Value at Risk (VaR) and Expected Shortfall (ES) forecasts. Additionally, we examine the impact of the window length on forecast accuracy. We propose model specifications that incorporate realized skewness and kurtosis for enhanced precision. Our findings provide insights into the importance of considering skewness and kurtosis in tail risk modeling, contributing to the existing literature and offering practical implications for risk practitioners and researchers."}, "https://arxiv.org/abs/2409.13531": {"title": "A simple but powerful tail index regression", "link": "https://arxiv.org/abs/2409.13531", "description": "arXiv:2409.13531v1 Announce Type: new \nAbstract: This paper introduces a flexible framework for the estimation of the conditional tail index of heavy tailed distributions. In this framework, the tail index is computed from an auxiliary linear regression model that facilitates estimation and inference based on established econometric methods, such as ordinary least squares (OLS), least absolute deviations, or M-estimation. We show theoretically and via simulations that OLS provides interesting results. Our Monte Carlo results highlight the adequate finite sample properties of the OLS tail index estimator computed from the proposed new framework and contrast its behavior to that of tail index estimates obtained by maximum likelihood estimation of exponential regression models, which is one of the approaches currently in use in the literature. An empirical analysis of the impact of determinants of the conditional left- and right-tail indexes of commodities' return distributions highlights the empirical relevance of our proposed approach. The novel framework's flexibility allows for extensions and generalizations in various directions, empowering researchers and practitioners to straightforwardly explore a wide range of research questions."}, "https://arxiv.org/abs/2409.13688": {"title": "Morphological Detection and Classification of Microplastics and Nanoplastics Emerged from Consumer Products by Deep Learning", "link": "https://arxiv.org/abs/2409.13688", "description": "arXiv:2409.13688v1 Announce Type: cross \nAbstract: Plastic pollution presents an escalating global issue, impacting health and environmental systems, with micro- and nanoplastics found across mediums from potable water to air. Traditional methods for studying these contaminants are labor-intensive and time-consuming, necessitating a shift towards more efficient technologies. In response, this paper introduces micro- and nanoplastics (MiNa), a novel and open-source dataset engineered for the automatic detection and classification of micro and nanoplastics using object detection algorithms. The dataset, comprising scanning electron microscopy images simulated under realistic aquatic conditions, categorizes plastics by polymer type across a broad size spectrum. We demonstrate the application of state-of-the-art detection algorithms on MiNa, assessing their effectiveness and identifying the unique challenges and potential of each method. The dataset not only fills a critical gap in available resources for microplastic research but also provides a robust foundation for future advancements in the field."}, "https://arxiv.org/abs/2109.00404": {"title": "Perturbation graphs, invariant prediction and causal relations in psychology", "link": "https://arxiv.org/abs/2109.00404", "description": "arXiv:2109.00404v2 Announce Type: replace \nAbstract: Networks (graphs) in psychology are often restricted to settings without interventions. Here we consider a framework borrowed from biology that involves multiple interventions from different contexts (observations and experiments) in a single analysis. The method is called perturbation graphs. In gene regulatory networks, the induced change in one gene is measured on all other genes in the analysis, thereby assessing possible causal relations. This is repeated for each gene in the analysis. A perturbation graph leads to the correct set of causes (not necessarily direct causes). Subsequent pruning of paths in the graph (called transitive reduction) should reveal direct causes. We show that transitive reduction will not in general lead to the correct underlying graph. We also show that invariant causal prediction is a generalisation of the perturbation graph method, where including additional variables does reveal direct causes, and thereby replacing transitive reduction. We conclude that perturbation graphs provide a promising new tool for experimental designs in psychology, and combined with invariant causal prediction make it possible to reveal direct causes instead of causal paths. As an illustration we apply the perturbation graphs and invariant causal prediction to a data set about attitudes on meat consumption and to a time series of a patient diagnosed with major depression disorder."}, "https://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "https://arxiv.org/abs/2302.11505", "description": "arXiv:2302.11505v4 Announce Type: replace \nAbstract: This paper studies settings where the analyst is interested in identifying and estimating the average causal effect of a binary treatment on an outcome. We consider a setup in which the outcome realization does not get immediately realized after the treatment assignment, a feature that is ubiquitous in empirical settings. The period between the treatment and the realization of the outcome allows other observed actions to occur and affect the outcome. In this context, we study several regression-based estimands routinely used in empirical work to capture the average treatment effect and shed light on interpreting them in terms of ceteris paribus effects, indirect causal effects, and selection terms. We obtain three main and related takeaways. First, the three most popular estimands do not generally satisfy what we call \\emph{strong sign preservation}, in the sense that these estimands may be negative even when the treatment positively affects the outcome conditional on any possible combination of other actions. Second, the most popular regression that includes the other actions as controls satisfies strong sign preservation \\emph{if and only if} these actions are mutually exclusive binary variables. Finally, we show that a linear regression that fully stratifies the other actions leads to estimands that satisfy strong sign preservation."}, "https://arxiv.org/abs/2308.04374": {"title": "Are Information criteria good enough to choose the right the number of regimes in Hidden Markov Models?", "link": "https://arxiv.org/abs/2308.04374", "description": "arXiv:2308.04374v2 Announce Type: replace \nAbstract: Selecting the number of regimes in Hidden Markov models is an important problem. There are many criteria that are used to select this number, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), integrated completed likelihood (ICL), deviance information criterion (DIC), and Watanabe-Akaike information criterion (WAIC), to name a few. In this article, we introduced goodness-of-fit tests for general Hidden Markov models with covariates, where the distribution of the observations is arbitrary, i.e., continuous, discrete, or a mixture of both. Then, a selection procedure is proposed based on this goodness-of-fit test. The main aim of this article is to compare the classical information criterion with the new criterion, when the outcome is either continuous, discrete or zero-inflated. Numerical experiments assess the finite sample performance of the goodness-of-fit tests, and comparisons between the different criteria are made."}, "https://arxiv.org/abs/2310.16989": {"title": "Randomization Inference When N Equals One", "link": "https://arxiv.org/abs/2310.16989", "description": "arXiv:2310.16989v2 Announce Type: replace \nAbstract: N-of-1 experiments, where a unit serves as its own control and treatment in different time windows, have been used in certain medical contexts for decades. However, due to effects that accumulate over long time windows and interventions that have complex evolution, a lack of robust inference tools has limited the widespread applicability of such N-of-1 designs. This work combines techniques from experiment design in causal inference and system identification from control theory to provide such an inference framework. We derive a model of the dynamic interference effect that arises in linear time-invariant dynamical systems. We show that a family of causal estimands analogous to those studied in potential outcomes are estimable via a standard estimator derived from the method of moments. We derive formulae for higher moments of this estimator and describe conditions under which N-of-1 designs may provide faster ways to estimate the effects of interventions in dynamical systems. We also provide conditions under which our estimator is asymptotically normal and derive valid confidence intervals for this setting."}, "https://arxiv.org/abs/2209.05894": {"title": "Nonparametric estimation of trawl processes: Theory and Applications", "link": "https://arxiv.org/abs/2209.05894", "description": "arXiv:2209.05894v2 Announce Type: replace-cross \nAbstract: Trawl processes belong to the class of continuous-time, strictly stationary, infinitely divisible processes; they are defined as L\\'{e}vy bases evaluated over deterministic trawl sets. This article presents the first nonparametric estimator of the trawl function characterising the trawl set and the serial correlation of the process. Moreover, it establishes a detailed asymptotic theory for the proposed estimator, including a law of large numbers and a central limit theorem for various asymptotic relations between an in-fill and a long-span asymptotic regime. In addition, it develops consistent estimators for both the asymptotic bias and variance, which are subsequently used for establishing feasible central limit theorems which can be applied to data. A simulation study shows the good finite sample performance of the proposed estimators. The new methodology is applied to forecasting high-frequency financial spread data from a limit order book and to estimating the busy-time distribution of a stochastic queue."}, "https://arxiv.org/abs/2309.05630": {"title": "Boundary Peeling: Outlier Detection Method Using One-Class Peeling", "link": "https://arxiv.org/abs/2309.05630", "description": "arXiv:2309.05630v2 Announce Type: replace-cross \nAbstract: Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-class Boundary Peeling uses the average signed distance from iteratively-peeled, flexible boundaries generated by one-class support vector machines. One-class Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In synthetic data simulations One-Class Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers, as compared to benchmark methods. One-Class Boundary Peeling performs competitively in terms of correct classification, AUC, and processing time using common benchmark data sets."}, "https://arxiv.org/abs/2409.13819": {"title": "Supervised low-rank approximation of high-dimensional multivariate functional data via tensor decomposition", "link": "https://arxiv.org/abs/2409.13819", "description": "arXiv:2409.13819v1 Announce Type: new \nAbstract: Motivated by the challenges of analyzing high-dimensional ($p \\gg n$) sequencing data from longitudinal microbiome studies, where samples are collected at multiple time points from each subject, we propose supervised functional tensor singular value decomposition (SupFTSVD), a novel dimensionality reduction method that leverages auxiliary information in the dimensionality reduction of high-dimensional functional tensors. Although multivariate functional principal component analysis is a natural choice for dimensionality reduction of multivariate functional data, it becomes computationally burdensome in high-dimensional settings. Low-rank tensor decomposition is a feasible alternative and has gained popularity in recent literature, but existing methods in this realm are often incapable of simultaneously utilizing the temporal structure of the data and subject-level auxiliary information. SupFTSVD overcomes these limitations by generating low-rank representations of high-dimensional functional tensors while incorporating subject-level auxiliary information and accounting for the functional nature of the data. Moreover, SupFTSVD produces low-dimensional representations of subjects, features, and time, as well as subject-specific trajectories, providing valuable insights into the biological significance of variations within the data. In simulation studies, we demonstrate that our method achieves notable improvement in tensor approximation accuracy and loading estimation by utilizing auxiliary information. Finally, we applied SupFTSVD to two longitudinal microbiome studies where biologically meaningful patterns in the data were revealed."}, "https://arxiv.org/abs/2409.13858": {"title": "BRcal: An R Package to Boldness-Recalibrate Probability Predictions", "link": "https://arxiv.org/abs/2409.13858", "description": "arXiv:2409.13858v1 Announce Type: new \nAbstract: When probability predictions are too cautious for decision making, boldness-recalibration enables responsible emboldening while maintaining the probability of calibration required by the user. We introduce BRcal, an R package implementing boldness-recalibration and supporting methodology as recently proposed. The BRcal package provides direct control of the calibration-boldness tradeoff and visualizes how different calibration levels change individual predictions. We describe the implementation details in BRcal related to non-linear optimization of boldness with a non-linear inequality constraint on calibration. Package functionality is demonstrated via a real world case study involving housing foreclosure predictions. The BRcal package is available on the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/BRcal/index.html) and on Github (https://github.com/apguthrie/BRcal)."}, "https://arxiv.org/abs/2409.13873": {"title": "Jointly modeling time-to-event and longitudinal data with individual-specific change points: a case study in modeling tumor burden", "link": "https://arxiv.org/abs/2409.13873", "description": "arXiv:2409.13873v1 Announce Type: new \nAbstract: In oncology clinical trials, tumor burden (TB) stands as a crucial longitudinal biomarker, reflecting the toll a tumor takes on a patient's prognosis. With certain treatments, the disease's natural progression shows the tumor burden initially receding before rising once more. Biologically, the point of change may be different between individuals and must have occurred between the baseline measurement and progression time of the patient, implying a random effects model obeying a bound constraint. However, in practice, patients may drop out of the study due to progression or death, presenting a non-ignorable missing data problem. In this paper, we introduce a novel joint model that combines time-to-event data and longitudinal data, where the latter is parameterized by a random change point augmented by random pre-slope and post-slope dynamics. Importantly, the model is equipped to incorporate covariates across for the longitudinal and survival models, adding significant flexibility. Adopting a Bayesian approach, we propose an efficient Hamiltonian Monte Carlo algorithm for parameter inference. We demonstrate the superiority of our approach compared to a longitudinal-only model via simulations and apply our method to a data set in oncology."}, "https://arxiv.org/abs/2409.13946": {"title": "Chauhan Weighted Trajectory Analysis of combined efficacy and safety outcomes for risk-benefit analysis", "link": "https://arxiv.org/abs/2409.13946", "description": "arXiv:2409.13946v1 Announce Type: new \nAbstract: Analyzing and effectively communicating the efficacy and toxicity of treatment is the basis of risk benefit analysis (RBA). More efficient and objective tools are needed. We apply Chauhan Weighted Trajectory Analysis (CWTA) to perform RBA with superior objectivity, power, and clarity.\n We used CWTA to perform 1000-fold simulations of RCTs using ordinal endpoints for both treatment efficacy and toxicity. RCTs were simulated with 1:1 allocation at defined sample sizes and hazard ratios. We studied the simplest case of 3 levels each of toxicity and efficacy and the general case of the advanced cancer trial, with efficacy graded by five RECIST 1.1 health statuses and toxicity by the six-point CTCAE scale (6 x 5 matrix). The latter model was applied to a real-world dose escalation phase I trial in advanced cancer.\n Simulations in both the 3 x 3 and the 6 x 5 advanced cancer matrix confirmed that drugs with both superior efficacy and toxicity profiles synergize for greater statistical power with CWTA-RBA. The CWTA-RBA 6 x 5 matrix reduced sample size requirements over CWTA efficacy-only analysis. Application to the dose finding phase I clinical trial provided objective, statistically significant validation for the selected dose.\n CWTA-RBA, by incorporating both drug efficacy and toxicity, provides a single test statistic and plot that analyzes and effectively communicates therapeutic risks and benefits. CWTA-RBA requires fewer patients than CWTA efficacy-only analysis when the experimental drug is both more effective and less toxic. CWTA-RBA facilitates the objective and efficient assessment of new therapies throughout the drug development pathway. Furthermore, several advantages over competing tests in communicating risk-benefit will assist regulatory review, clinical adoption, and understanding of therapeutic risks and benefits by clinicians and patients alike."}, "https://arxiv.org/abs/2409.13963": {"title": "Functional Factor Modeling of Brain Connectivity", "link": "https://arxiv.org/abs/2409.13963", "description": "arXiv:2409.13963v1 Announce Type: new \nAbstract: Many fMRI analyses examine functional connectivity, or statistical dependencies among remote brain regions. Yet popular methods for studying whole-brain functional connectivity often yield results that are difficult to interpret. Factor analysis offers a natural framework in which to study such dependencies, particularly given its emphasis on interpretability. However, multivariate factor models break down when applied to functional and spatiotemporal data, like fMRI. We present a factor model for discretely-observed multidimensional functional data that is well-suited to the study of functional connectivity. Unlike classical factor models which decompose a multivariate observation into a \"common\" term that captures covariance between observed variables and an uncorrelated \"idiosyncratic\" term that captures variance unique to each observed variable, our model decomposes a functional observation into two uncorrelated components: a \"global\" term that captures long-range dependencies and a \"local\" term that captures short-range dependencies. We show that if the global covariance is smooth with finite rank and the local covariance is banded with potentially infinite rank, then this decomposition is identifiable. Under these conditions, recovery of the global covariance amounts to rank-constrained matrix completion, which we exploit to formulate consistent loading estimators. We study these estimators, and their more interpretable post-processed counterparts, through simulations, then use our approach to uncover a rich covariance structure in a collection of resting-state fMRI scans."}, "https://arxiv.org/abs/2409.13990": {"title": "Batch Predictive Inference", "link": "https://arxiv.org/abs/2409.13990", "description": "arXiv:2409.13990v1 Announce Type: new \nAbstract: Constructing prediction sets with coverage guarantees for unobserved outcomes is a core problem in modern statistics. Methods for predictive inference have been developed for a wide range of settings, but usually only consider test data points one at a time. Here we study the problem of distribution-free predictive inference for a batch of multiple test points, aiming to construct prediction sets for functions -- such as the mean or median -- of any number of unobserved test datapoints. This setting includes constructing simultaneous prediction sets with a high probability of coverage, and selecting datapoints satisfying a specified condition while controlling the number of false claims.\n For the general task of predictive inference on a function of a batch of test points, we introduce a methodology called batch predictive inference (batch PI), and provide a distribution-free coverage guarantee under exchangeability of the calibration and test data. Batch PI requires the quantiles of a rank ordering function defined on certain subsets of ranks. While computing these quantiles is NP-hard in general, we show that it can be done efficiently in many cases of interest, most notably for batch score functions with a compositional structure -- which includes examples of interest such as the mean -- via a dynamic programming algorithm that we develop. Batch PI has advantages over naive approaches (such as partitioning the calibration data or directly extending conformal prediction) in many settings, as it can deliver informative prediction sets even using small calibration sample sizes. We illustrate that our procedures provide informative inference across the use cases mentioned above, through experiments on both simulated data and a drug-target interaction dataset."}, "https://arxiv.org/abs/2409.14032": {"title": "Refitted cross-validation estimation for high-dimensional subsamples from low-dimension full data", "link": "https://arxiv.org/abs/2409.14032", "description": "arXiv:2409.14032v1 Announce Type: new \nAbstract: The technique of subsampling has been extensively employed to address the challenges posed by limited computing resources and meet the needs for expedite data analysis. Various subsampling methods have been developed to meet the challenges characterized by a large sample size with a small number of parameters. However, direct applications of these subsampling methods may not be suitable when the dimension is also high and available computing facilities at hand are only able to analyze a subsample of size similar or even smaller than the dimension. In this case, although there is no high-dimensional problem in the full data, the subsample may have a sample size smaller or smaller than the number of parameters, making it a high-dimensional problem. We call this scenario the high-dimensional subsample from low-dimension full data problem. In this paper, we tackle this problem by proposing a novel subsampling-based approach that combines penalty-based dimension reduction and refitted cross-validation. The asymptotic normality of the refitted cross-validation subsample estimator is established, which plays a crucial role in statistical inference. The proposed method demonstrates appealing performance in numerical experiments on simulated data and a real data application."}, "https://arxiv.org/abs/2409.14049": {"title": "Adaptive radar detection of subspace-based distributed target in power heterogeneous clutter", "link": "https://arxiv.org/abs/2409.14049", "description": "arXiv:2409.14049v1 Announce Type: new \nAbstract: This paper investigates the problem of adaptive detection of distributed targets in power heterogeneous clutter. In the considered scenario, all the data share the identical structure of clutter covariance matrix, but with varying and unknown power mismatches. To address this problem, we iteratively estimate all the unknowns, including the coordinate matrix of the target, the clutter covariance matrix, and the corresponding power mismatches, and propose three detectors based on the generalized likelihood ratio test (GLRT), Rao and the Wald tests. The results from simulated and real data both illustrate that the detectors based on GLRT and Rao test have higher probabilities of detection (PDs) than the existing competitors. Among them, the Rao test-based detector exhibits the best overall detection performance. We also analyze the impact of the target extended dimensions, the signal subspace dimensions, and the number of training samples on the detection performance. Furthermore, simulation experiments also demonstrate that the proposed detectors have a constant false alarm rate (CFAR) property for the structure of clutter covariance matrix."}, "https://arxiv.org/abs/2409.14167": {"title": "Skew-symmetric approximations of posterior distributions", "link": "https://arxiv.org/abs/2409.14167", "description": "arXiv:2409.14167v1 Announce Type: new \nAbstract: Routinely-implemented deterministic approximations of posterior distributions from, e.g., Laplace method, variational Bayes and expectation-propagation, generally rely on symmetric approximating densities, often taken to be Gaussian. This choice facilitates optimization and inference, but typically affects the quality of the overall approximation. In fact, even in basic parametric models, the posterior distribution often displays asymmetries that yield bias and reduced accuracy when considering symmetric approximations. Recent research has moved towards more flexible approximating densities that incorporate skewness. However, current solutions are model-specific, lack general supporting theory, increase the computational complexity of the optimization problem, and do not provide a broadly-applicable solution to include skewness in any symmetric approximation. This article addresses such a gap by introducing a general and provably-optimal strategy to perturb any off-the-shelf symmetric approximation of a generic posterior distribution. Crucially, this novel perturbation is derived without additional optimization steps, and yields a similarly-tractable approximation within the class of skew-symmetric densities that provably enhances the finite-sample accuracy of the original symmetric approximation, and, under suitable assumptions, improves its convergence rate to the exact posterior by at least a $\\sqrt{n}$ factor, in asymptotic regimes. These advancements are illustrated in numerical studies focusing on skewed perturbations of state-of-the-art Gaussian approximations."}, "https://arxiv.org/abs/2409.14202": {"title": "Mining Causality: AI-Assisted Search for Instrumental Variables", "link": "https://arxiv.org/abs/2409.14202", "description": "arXiv:2409.14202v1 Announce Type: new \nAbstract: The instrumental variables (IVs) method is a leading empirical strategy for causal inference. Finding IVs is a heuristic and creative process, and justifying its validity (especially exclusion restrictions) is largely rhetorical. We propose using large language models (LLMs) to search for new IVs through narratives and counterfactual reasoning, similar to how a human researcher would. The stark difference, however, is that LLMs can accelerate this process exponentially and explore an extremely large search space. We demonstrate how to construct prompts to search for potentially valid IVs. We argue that multi-step prompting is useful and role-playing prompts are suitable for mimicking the endogenous decisions of economic agents. We apply our method to three well-known examples in economics: returns to schooling, production functions, and peer effects. We then extend our strategy to finding (i) control variables in regression and difference-in-differences and (ii) running variables in regression discontinuity designs."}, "https://arxiv.org/abs/2409.14255": {"title": "On the asymptotic distributions of some test statistics for two-way contingency tables", "link": "https://arxiv.org/abs/2409.14255", "description": "arXiv:2409.14255v1 Announce Type: new \nAbstract: Pearson's Chi-square test is a widely used tool for analyzing categorical data, yet its statistical power has remained theoretically underexplored. Due to the difficulties in obtaining its power function in the usual manner, Cochran (1952) suggested the derivation of its Pitman limiting power, which is later implemented by Mitra (1958) and Meng & Chapman (1966). Nonetheless, this approach is suboptimal for practical power calculations under fixed alternatives. In this work, we solve this long-standing problem by establishing the asymptotic normality of the Chi-square statistic under fixed alternatives and deriving an explicit formula for its variance. For finite samples, we suggest a second-order expansion based on the multivariate delta method to improve the approximations. As a further contribution, we obtain the power functions of two distance covariance tests. We apply our findings to study the statistical power of these tests under different simulation settings."}, "https://arxiv.org/abs/2409.14256": {"title": "POI-SIMEX for Conditionally Poisson Distributed Biomarkers from Tissue Histology", "link": "https://arxiv.org/abs/2409.14256", "description": "arXiv:2409.14256v1 Announce Type: new \nAbstract: Covariate measurement error in regression analysis is an important issue that has been studied extensively under the classical additive and the Berkson error models. Here, we consider cases where covariates are derived from tumor tissue histology, and in particular tissue microarrays. In such settings, biomarkers are evaluated from tissue cores that are subsampled from a larger tissue section so that these biomarkers are only estimates of the true cell densities. The resulting measurement error is non-negligible but is seldom accounted for in the analysis of cancer studies involving tissue microarrays. To adjust for this type of measurement error, we assume that these discrete-valued biomarkers are conditionally Poisson distributed, based on a Poisson process model governing the spatial locations of marker-positive cells. Existing methods for addressing conditional Poisson surrogates, particularly in the absence of internal validation data, are limited. We extend the simulation extrapolation (SIMEX) algorithm to accommodate the conditional Poisson case (POI-SIMEX), where measurement errors are non-Gaussian with heteroscedastic variance. The proposed estimator is shown to be strongly consistent in a linear regression model under the assumption of a conditional Poisson distribution for the observed biomarker. Simulation studies evaluate the performance of POI-SIMEX, comparing it with the naive method and an alternative corrected likelihood approach in linear regression and survival analysis contexts. POI-SIMEX is then applied to a study of high-grade serous cancer, examining the association between survival and the presence of triple-positive biomarker (CD3+CD8+FOXP3+ cells)"}, "https://arxiv.org/abs/2409.14263": {"title": "Potential root mean square error skill score", "link": "https://arxiv.org/abs/2409.14263", "description": "arXiv:2409.14263v1 Announce Type: new \nAbstract: Consistency, in a narrow sense, denotes the alignment between the forecast-optimization strategy and the verification directive. The current recommended deterministic solar forecast verification practice is to report the skill score based on root mean square error (RMSE), which would violate the notion of consistency if the forecasts are optimized under another strategy such as minimizing the mean absolute error (MAE). This paper overcomes such difficulty by proposing a so-called \"potential RMSE skill score,\" which depends only on: (1) the crosscorrelation between forecasts and observations, and (2) the autocorrelation of observations. While greatly simplifying the calculation, the new skill score does not discriminate inconsistent forecasts as much, e.g., even MAE-optimized forecasts can attain a high RMSE skill score."}, "https://arxiv.org/abs/2409.14397": {"title": "High-Dimensional Tensor Classification with CP Low-Rank Discriminant Structure", "link": "https://arxiv.org/abs/2409.14397", "description": "arXiv:2409.14397v1 Announce Type: new \nAbstract: Tensor classification has become increasingly crucial in statistics and machine learning, with applications spanning neuroimaging, computer vision, and recommendation systems. However, the high dimensionality of tensors presents significant challenges in both theory and practice. To address these challenges, we introduce a novel data-driven classification framework based on linear discriminant analysis (LDA) that exploits the CP low-rank structure in the discriminant tensor. Our approach includes an advanced iterative projection algorithm for tensor LDA and incorporates a novel initialization scheme called Randomized Composite PCA (\\textsc{rc-PCA}). \\textsc{rc-PCA}, potentially of independent interest beyond tensor classification, relaxes the incoherence and eigen-ratio assumptions of existing algorithms and provides a warm start close to the global optimum. We establish global convergence guarantees for the tensor estimation algorithm using \\textsc{rc-PCA} and develop new perturbation analyses for noise with cross-correlation, extending beyond the traditional i.i.d. assumption. This theoretical advancement has potential applications across various fields dealing with correlated data and allows us to derive statistical upper bounds on tensor estimation errors. Additionally, we confirm the rate-optimality of our classifier by establishing minimax optimal misclassification rates across a wide class of parameter spaces. Extensive simulations and real-world applications validate our method's superior performance.\n Keywords: Tensor classification; Linear discriminant analysis; Tensor iterative projection; CP low-rank; High-dimensional data; Minimax optimality."}, "https://arxiv.org/abs/2409.14646": {"title": "Scalable Expectation Propagation for Mixed-Effects Regression", "link": "https://arxiv.org/abs/2409.14646", "description": "arXiv:2409.14646v1 Announce Type: new \nAbstract: Mixed-effects regression models represent a useful subclass of regression models for grouped data; the introduction of random effects allows for the correlation between observations within each group to be conveniently captured when inferring the fixed effects. At a time where such regression models are being fit to increasingly large datasets with many groups, it is ideal if (a) the time it takes to make the inferences scales linearly with the number of groups and (b) the inference workload can be distributed across multiple computational nodes in a numerically stable way, if the dataset cannot be stored in one location. Current Bayesian inference approaches for mixed-effects regression models do not seem to account for both challenges simultaneously. To address this, we develop an expectation propagation (EP) framework in this setting that is both scalable and numerically stable when distributed for the case where there is only one grouping factor. The main technical innovations lie in the sparse reparameterisation of the EP algorithm, and a moment propagation (MP) based refinement for multivariate random effect factor approximations. Experiments are conducted to show that this EP framework achieves linear scaling, while having comparable accuracy to other scalable approximate Bayesian inference (ABI) approaches."}, "https://arxiv.org/abs/2409.14684": {"title": "Consistent Order Determination of Markov Decision Process", "link": "https://arxiv.org/abs/2409.14684", "description": "arXiv:2409.14684v1 Announce Type: new \nAbstract: The Markov assumption in Markov Decision Processes (MDPs) is fundamental in reinforcement learning, influencing both theoretical research and practical applications. Existing methods that rely on the Bellman equation benefit tremendously from this assumption for policy evaluation and inference. Testing the Markov assumption or selecting the appropriate order is important for further analysis. Existing tests primarily utilize sequential hypothesis testing methodology, increasing the tested order if the previously-tested one is rejected. However, This methodology cumulates type-I and type-II errors in sequential testing procedures that cause inconsistent order estimation, even with large sample sizes. To tackle this challenge, we develop a procedure that consistently distinguishes the true order from others. We first propose a novel estimator that equivalently represents any order Markov assumption. Based on this estimator, we thus construct a signal function and an associated signal statistic to achieve estimation consistency. Additionally, the curve pattern of the signal statistic facilitates easy visualization, assisting the order determination process in practice. Numerical studies validate the efficacy of our approach."}, "https://arxiv.org/abs/2409.14706": {"title": "Analysis of Stepped-Wedge Cluster Randomized Trials when treatment effect varies by exposure time or calendar time", "link": "https://arxiv.org/abs/2409.14706", "description": "arXiv:2409.14706v1 Announce Type: new \nAbstract: Stepped-wedge cluster randomized trials (SW-CRTs) are traditionally analyzed with models that assume an immediate and sustained treatment effect. Previous work has shown that making such an assumption in the analysis of SW-CRTs when the true underlying treatment effect varies by exposure time can produce severely misleading estimates. Alternatively, the true underlying treatment effect might vary by calendar time. Comparatively less work has examined treatment effect structure misspecification in this setting. Here, we evaluate the behavior of the mixed effects model-based immediate treatment effect, exposure time-averaged treatment effect, and calendar time-averaged treatment effect estimators in different scenarios where they are misspecified for the true underlying treatment effect structure. We prove that the immediate treatment effect estimator can be relatively robust to bias when estimating a true underlying calendar time-averaged treatment effect estimand. However, when there is a true underlying calendar (exposure) time-varying treatment effect, misspecifying an analysis with an exposure (calendar) time-averaged treatment effect estimator can yield severely misleading estimates and even converge to a value of the opposite sign of the true calendar (exposure) time-averaged treatment effect estimand. Researchers should carefully consider how the treatment effect may vary as a function of exposure time and/or calendar time in the analysis of SW-CRTs."}, "https://arxiv.org/abs/2409.14770": {"title": "Clinical research and methodology What usage and what hierarchical order for secondary endpoints?", "link": "https://arxiv.org/abs/2409.14770", "description": "arXiv:2409.14770v1 Announce Type: new \nAbstract: In a randomised clinical trial, when the result of the primary endpoint shows a significant benefit, the secondary endpoints are scrutinised to identify additional effects of the treatment. However, this approach entails a risk of concluding that there is a benefit for one of these endpoints when such benefit does not exist (inflation of type I error risk). There are mainly two methods used to control the risk of drawing erroneous conclusions for secondary endpoints. The first method consists of distributing the risk over several co-primary endpoints, so as to maintain an overall risk of 5%. The second is the hierarchical test procedure, which consists of first establishing a hierarchy of the endpoints, then evaluating each endpoint in succession according to this hierarchy while the endpoints continue to show statistical significance. This simple method makes it possible to show the additional advantages of treatments and to identify the factors that differentiate them."}, "https://arxiv.org/abs/2409.14776": {"title": "Inequality Sensitive Optimal Treatment Assignment", "link": "https://arxiv.org/abs/2409.14776", "description": "arXiv:2409.14776v1 Announce Type: new \nAbstract: The egalitarian equivalent, $ee$, of a societal distribution of outcomes with mean $m$ is the outcome level such that the evaluator is indifferent between the distribution of outcomes and a society in which everyone obtains an outcome of $ee$. For an inequality averse evaluator, $ee < m$. In this paper, I extend the optimal treatment choice framework in Manski (2024) to the case where the welfare evaluation is made using egalitarian equivalent measures, and derive optimal treatment rules for the Bayesian, maximin and minimax regret inequality averse evaluators. I illustrate how the methodology operates in the context of the JobCorps education and training program for disadvantaged youth (Schochet, Burghardt, and McConnell 2008) and in Meager (2022)'s Bayesian meta analysis of the microcredit literature."}, "https://arxiv.org/abs/2409.14806": {"title": "Rescaled Bayes factors: a class of e-variables", "link": "https://arxiv.org/abs/2409.14806", "description": "arXiv:2409.14806v1 Announce Type: new \nAbstract: A class of e-variables is introduced and analyzed. Some examples are presented."}, "https://arxiv.org/abs/2409.14926": {"title": "Early and Late Buzzards: Comparing Different Approaches for Quantile-based Multiple Testing in Heavy-Tailed Wildlife Research Data", "link": "https://arxiv.org/abs/2409.14926", "description": "arXiv:2409.14926v1 Announce Type: new \nAbstract: In medical, ecological and psychological research, there is a need for methods to handle multiple testing, for example to consider group comparisons with more than two groups. Typical approaches that deal with multiple testing are mean or variance based which can be less effective in the context of heavy-tailed and skewed data. Here, the median is the preferred measure of location and the interquartile range (IQR) is an adequate alternative to the variance. Therefore, it may be fruitful to formulate research questions of interest in terms of the median or the IQR. For this reason, we compare different inference approaches for two-sided and non-inferiority hypotheses formulated in terms of medians or IQRs in an extensive simulation study. We consider multiple contrast testing procedures combined with a bootstrap method as well as testing procedures with Bonferroni correction. As an example of a multiple testing problem based on heavy-tailed data we analyse an ecological trait variation in early and late breeding in a medium-sized bird of prey."}, "https://arxiv.org/abs/2409.14937": {"title": "Risk Estimate under a Nonstationary Autoregressive Model for Data-Driven Reproduction Number Estimation", "link": "https://arxiv.org/abs/2409.14937", "description": "arXiv:2409.14937v1 Announce Type: new \nAbstract: COVID-19 pandemic has brought to the fore epidemiological models which, though describing a rich variety of behaviors, have previously received little attention in the signal processing literature. During the pandemic, several works successfully leveraged state-of-the-art signal processing strategies to robustly infer epidemiological indicators despite the low quality of COVID-19 data. In the present work, a novel nonstationary autoregressive model is introduced, encompassing, but not reducing to, one of the most popular models for the propagation of viral epidemics. Using a variational framework, penalized likelihood estimators of the parameters of this new model are designed. In practice, the main bottleneck is that the estimation accuracy strongly depends on hyperparameters tuning. Without available ground truth, hyperparameters are selected by minimizing specifically designed data-driven oracles, used as proxy for the estimation error. Focusing on the nonstationary autoregressive Poisson model, the Stein's Unbiased Risk Estimate formalism is generalized to construct asymptotically unbiased risk estimators based on the derivation of an original autoregressive counterpart of Stein's lemma. The accuracy of these oracles and of the resulting estimates are assessed through intensive Monte Carlo simulations on synthetic data. Then, elaborating on recent epidemiological models, a novel weekly scaled Poisson model is proposed, enabling to better account for intrinsic variability of the contamination while being robust to reporting errors. Finally, the overall data-driven procedure is particularized to the estimation of COVID-19 reproduction number and exemplified on real COVID-19 infection counts in different countries and at different stages of the pandemic, demonstrating its ability to yield consistent estimates."}, "https://arxiv.org/abs/2409.15070": {"title": "Non-linear dependence and Granger causality: A vine copula approach", "link": "https://arxiv.org/abs/2409.15070", "description": "arXiv:2409.15070v1 Announce Type: new \nAbstract: Inspired by Jang et al. (2022), we propose a Granger causality-in-the-mean test for bivariate $k-$Markov stationary processes based on a recently introduced class of non-linear models, i.e., vine copula models. By means of a simulation study, we show that the proposed test improves on the statistical properties of the original test in Jang et al. (2022), constituting an excellent tool for testing Granger causality in the presence of non-linear dependence structures. Finally, we apply our test to study the pairwise relationships between energy consumption, GDP and investment in the U.S. and, notably, we find that Granger-causality runs two ways between GDP and energy consumption."}, "https://arxiv.org/abs/2409.15145": {"title": "Adaptive weight selection for time-to-event data under non-proportional hazards", "link": "https://arxiv.org/abs/2409.15145", "description": "arXiv:2409.15145v1 Announce Type: new \nAbstract: When planning a clinical trial for a time-to-event endpoint, we require an estimated effect size and need to consider the type of effect. Usually, an effect of proportional hazards is assumed with the hazard ratio as the corresponding effect measure. Thus, the standard procedure for survival data is generally based on a single-stage log-rank test. Knowing that the assumption of proportional hazards is often violated and sufficient knowledge to derive reasonable effect sizes is usually unavailable, such an approach is relatively rigid. We introduce a more flexible procedure by combining two methods designed to be more robust in case we have little to no prior knowledge. First, we employ a more flexible adaptive multi-stage design instead of a single-stage design. Second, we apply combination-type tests in the first stage of our suggested procedure to benefit from their robustness under uncertainty about the deviation pattern. We can then use the data collected during this period to choose a more specific single-weighted log-rank test for the subsequent stages. In this step, we employ Royston-Parmar spline models to extrapolate the survival curves to make a reasonable decision. Based on a real-world data example, we show that our approach can save a trial that would otherwise end with an inconclusive result. Additionally, our simulation studies demonstrate a sufficient power performance while maintaining more flexibility."}, "https://arxiv.org/abs/2409.14079": {"title": "Grid Point Approximation for Distributed Nonparametric Smoothing and Prediction", "link": "https://arxiv.org/abs/2409.14079", "description": "arXiv:2409.14079v1 Announce Type: cross \nAbstract: Kernel smoothing is a widely used nonparametric method in modern statistical analysis. The problem of efficiently conducting kernel smoothing for a massive dataset on a distributed system is a problem of great importance. In this work, we find that the popularly used one-shot type estimator is highly inefficient for prediction purposes. To this end, we propose a novel grid point approximation (GPA) method, which has the following advantages. First, the resulting GPA estimator is as statistically efficient as the global estimator under mild conditions. Second, it requires no communication and is extremely efficient in terms of computation for prediction. Third, it is applicable to the case where the data are not randomly distributed across different machines. To select a suitable bandwidth, two novel bandwidth selectors are further developed and theoretically supported. Extensive numerical studies are conducted to corroborate our theoretical findings. Two real data examples are also provided to demonstrate the usefulness of our GPA method."}, "https://arxiv.org/abs/2409.14284": {"title": "Survey Data Integration for Distribution Function Estimation", "link": "https://arxiv.org/abs/2409.14284", "description": "arXiv:2409.14284v1 Announce Type: cross \nAbstract: Integration of probabilistic and non-probabilistic samples for the estimation of finite population totals (or means) has recently received considerable attention in the field of survey sampling; yet, to the best of our knowledge, this framework has not been extended to cumulative distribution function (CDF) estimation. To address this gap, we propose a novel CDF estimator that integrates data from probability samples with data from (potentially large) nonprobability samples. Assuming that a set of shared covariates are observed in both samples, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some regularity conditions, we show that our CDF estimator is both design-consistent for the finite population CDF and asymptotically normally distributed. Additionally, we define and study a quantile estimator based on the proposed CDF estimator. Furthermore, we use both the bootstrap and asymptotic formulae to estimate their respective sampling variances. Our empirical results show that the proposed CDF estimator is robust to model misspecification under ignorability, and robust to ignorability under model misspecification. When both assumptions are violated, our residual-based CDF estimator still outperforms its 'plug-in' mass imputation and naive siblings, albeit with noted decreases in efficiency."}, "https://arxiv.org/abs/2409.14326": {"title": "Optimal sequencing depth for single-cell RNA-sequencing in Wasserstein space", "link": "https://arxiv.org/abs/2409.14326", "description": "arXiv:2409.14326v1 Announce Type: cross \nAbstract: How many samples should one collect for an empirical distribution to be as close as possible to the true population? This question is not trivial in the context of single-cell RNA-sequencing. With limited sequencing depth, profiling more cells comes at the cost of fewer reads per cell. Therefore, one must strike a balance between the number of cells sampled and the accuracy of each measured gene expression profile. In this paper, we analyze an empirical distribution of cells and obtain upper and lower bounds on the Wasserstein distance to the true population. Our analysis holds for general, non-parametric distributions of cells, and is validated by simulation experiments on a real single-cell dataset."}, "https://arxiv.org/abs/2409.14593": {"title": "Testing Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies", "link": "https://arxiv.org/abs/2409.14593", "description": "arXiv:2409.14593v1 Announce Type: cross \nAbstract: Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm."}, "https://arxiv.org/abs/2409.14734": {"title": "The continuous-time limit of quasi score-driven volatility models", "link": "https://arxiv.org/abs/2409.14734", "description": "arXiv:2409.14734v1 Announce Type: cross \nAbstract: This paper explores the continuous-time limit of a class of Quasi Score-Driven (QSD) models that characterize volatility. As the sampling frequency increases and the time interval tends to zero, the model weakly converges to a continuous-time stochastic volatility model where the two Brownian motions are correlated, thereby capturing the leverage effect in the market. Subsequently, we identify that a necessary condition for non-degenerate correlation is that the distribution of driving innovations differs from that of computing score, and at least one being asymmetric. We then illustrate this with two typical examples. As an application, the QSD model is used as an approximation for correlated stochastic volatility diffusions and quasi maximum likelihood estimation is performed. Simulation results confirm the method's effectiveness, particularly in estimating the correlation coefficient."}, "https://arxiv.org/abs/2111.06985": {"title": "Nonparametric Bayesian Knockoff Generators for Feature Selection Under Complex Data Structure", "link": "https://arxiv.org/abs/2111.06985", "description": "arXiv:2111.06985v2 Announce Type: replace \nAbstract: The recent proliferation of high-dimensional data, such as electronic health records and genetics data, offers new opportunities to find novel predictors of outcomes. Presented with a large set of candidate features, interest often lies in selecting the ones most likely to be predictive of an outcome for further study. Controlling the false discovery rate (FDR) at a specified level is often desired in evaluating these variables. Knockoff filtering is an innovative strategy for conducting FDR-controlled feature selection. This paper proposes a nonparametric Bayesian model for generating high-quality knockoff copies that can improve the accuracy of predictive feature identification for variables arising from complex distributions, which can be skewed, highly dispersed and/or a mixture of distributions. This paper provides a detailed description for generating knockoff copies from a GDPM model via MCMC posterior sampling. Additionally, we provide a theoretical guarantee on the robustness of the knockoff procedure. Through simulations, the method is shown to identify important features with accurate FDR control and improved power over the popular second-order Gaussian knockoff generator. Furthermore, the model is compared with finite Gaussian mixture knockoff generator in FDR and power. The proposed technique is applied for detecting genes predictive of survival in ovarian cancer patients using data from The Cancer Genome Atlas (TCGA)."}, "https://arxiv.org/abs/2201.05430": {"title": "Detecting Multiple Structural Breaks in Systems of Linear Regression Equations with Integrated and Stationary Regressors", "link": "https://arxiv.org/abs/2201.05430", "description": "arXiv:2201.05430v4 Announce Type: replace \nAbstract: In this paper, we propose a two-step procedure based on the group LASSO estimator in combination with a backward elimination algorithm to detect multiple structural breaks in linear regressions with multivariate responses. Applying the two-step estimator, we jointly detect the number and location of structural breaks, and provide consistent estimates of the coefficients. Our framework is flexible enough to allow for a mix of integrated and stationary regressors, as well as deterministic terms. Using simulation experiments, we show that the proposed two-step estimator performs competitively against the likelihood-based approach (Qu and Perron, 2007; Li and Perron, 2017; Oka and Perron, 2018) in finite samples. However, the two-step estimator is computationally much more efficient. An economic application to the identification of structural breaks in the term structure of interest rates illustrates this methodology."}, "https://arxiv.org/abs/2203.09330": {"title": "Fighting Noise with Noise: Causal Inference with Many Candidate Instruments", "link": "https://arxiv.org/abs/2203.09330", "description": "arXiv:2203.09330v3 Announce Type: replace \nAbstract: Instrumental variable methods provide useful tools for inferring causal effects in the presence of unmeasured confounding. To apply these methods with large-scale data sets, a major challenge is to find valid instruments from a possibly large candidate set. In practice, most of the candidate instruments are often not relevant for studying a particular exposure of interest. Moreover, not all relevant candidate instruments are valid as they may directly influence the outcome of interest. In this article, we propose a data-driven method for causal inference with many candidate instruments that addresses these two challenges simultaneously. A key component of our proposal involves using pseudo variables, known to be irrelevant, to remove variables from the original set that exhibit spurious correlations with the exposure. Synthetic data analyses show that the proposed method performs favourably compared to existing methods. We apply our method to a Mendelian randomization study estimating the effect of obesity on health-related quality of life."}, "https://arxiv.org/abs/2401.11827": {"title": "Flexible Models for Simple Longitudinal Data", "link": "https://arxiv.org/abs/2401.11827", "description": "arXiv:2401.11827v2 Announce Type: replace \nAbstract: We propose a new method for modelling simple longitudinal data. We aim to do this in a flexible manner (without restrictive assumptions about the shapes of individual trajectories), while exploiting structural similarities between the trajectories. Hierarchical models (such as linear mixed models, generalised additive mixed models and hierarchical generalised additive models) are commonly used to model longitudinal data, but fail to meet one or other of these requirements: either they make restrictive assumptions about the shape of individual trajectories, or fail to exploit structural similarities between trajectories. Functional principal components analysis promises to fulfil both requirements, and methods for functional principal components analysis have been developed for longitudinal data. However, we find that existing methods sometimes give poor-quality estimates of individual trajectories, particularly when the number of observations on each individual is small. We develop a new approach, which we call hierarchical modelling with functional principal components. Inference is conducted based on the full likelihood of all unknown quantities, with a penalty term to control the balance between fit to the data and smoothness of the trajectories. We run simulation studies to demonstrate that the new method substantially improves the quality of inference relative to existing methods across a range of examples, and apply the method to data on changes in body composition in adolescent girls."}, "https://arxiv.org/abs/2409.15530": {"title": "Identifying Elasticities in Autocorrelated Time Series Using Causal Graphs", "link": "https://arxiv.org/abs/2409.15530", "description": "arXiv:2409.15530v1 Announce Type: new \nAbstract: The price elasticity of demand can be estimated from observational data using instrumental variables (IV). However, naive IV estimators may be inconsistent in settings with autocorrelated time series. We argue that causal time graphs can simplify IV identification and help select consistent estimators. To do so, we propose to first model the equilibrium condition by an unobserved confounder, deriving a directed acyclic graph (DAG) while maintaining the assumption of a simultaneous determination of prices and quantities. We then exploit recent advances in graphical inference to derive valid IV estimators, including estimators that achieve consistency by simultaneously estimating nuisance effects. We further argue that observing significant differences between the estimates of presumably valid estimators can help to reject false model assumptions, thereby improving our understanding of underlying economic dynamics. We apply this approach to the German electricity market, estimating the price elasticity of demand on simulated and real-world data. The findings underscore the importance of accounting for structural autocorrelation in IV-based analysis."}, "https://arxiv.org/abs/2409.15597": {"title": "Higher-criticism for sparse multi-sensor change-point detection", "link": "https://arxiv.org/abs/2409.15597", "description": "arXiv:2409.15597v1 Announce Type: new \nAbstract: We present a procedure based on higher criticism (Dohono \\& Jin 2004) to address the sparse multi-sensor quickest change-point detection problem. Namely, we aim to detect a change in the distribution of the multi-sensor that might affect a few sensors out of potentially many, while those affected sensors, if they exist, are unknown to us in advance. Our procedure involves testing for a change point in individual sensors and combining multiple tests using higher criticism. As a by-product, our procedure also indicates a set of sensors suspected to be affected by the change. We demonstrate the effectiveness of our method compared to other procedures using extensive numerical evaluations. We analyze our procedure under a theoretical framework involving normal data sensors that might experience a change in both mean and variance. We consider individual tests based on the likelihood ratio or the generalized likelihood ratio statistics and show that our procedure attains the information-theoretic limits of detection. These limits coincide with existing litereature when the change is only in the mean."}, "https://arxiv.org/abs/2409.15663": {"title": "BARD: A seamless two-stage dose optimization design integrating backfill and adaptive randomization", "link": "https://arxiv.org/abs/2409.15663", "description": "arXiv:2409.15663v1 Announce Type: new \nAbstract: One common approach for dose optimization is a two-stage design, which initially conducts dose escalation to identify the maximum tolerated dose (MTD), followed by a randomization stage where patients are assigned to two or more doses to further assess and compare their risk-benefit profiles to identify the optimal dose. A limitation of this approach is its requirement for a relatively large sample size. To address this challenge, we propose a seamless two-stage design, BARD (Backfill and Adaptive Randomization for Dose Optimization), which incorporates two key features to reduce sample size and shorten trial duration. The first feature is the integration of backfilling into the stage 1 dose escalation, enhancing patient enrollment and data generation without prolonging the trial. The second feature involves seamlessly combining patients treated in stage 1 with those in stage 2, enabled by covariate-adaptive randomization, to inform the optimal dose and thereby reduce the sample size. Our simulation study demonstrates that BARD reduces the sample size, improves the accuracy of identifying the optimal dose, and maintains covariate balance in randomization, allowing for unbiased comparisons between doses. BARD designs offer an efficient solution to meet the dose optimization requirements set by Project Optimus, with software freely available at www.trialdesign.org."}, "https://arxiv.org/abs/2409.15676": {"title": "TUNE: Algorithm-Agnostic Inference after Changepoint Detection", "link": "https://arxiv.org/abs/2409.15676", "description": "arXiv:2409.15676v1 Announce Type: new \nAbstract: In multiple changepoint analysis, assessing the uncertainty of detected changepoints is crucial for enhancing detection reliability -- a topic that has garnered significant attention. Despite advancements through selective p-values, current methodologies often rely on stringent assumptions tied to specific changepoint models and detection algorithms, potentially compromising the accuracy of post-detection statistical inference. We introduce TUNE (Thresholding Universally and Nullifying change Effect), a novel algorithm-agnostic approach that uniformly controls error probabilities across detected changepoints. TUNE sets a universal threshold for multiple test statistics, applicable across a wide range of algorithms, and directly controls the family-wise error rate without the need for selective p-values. Through extensive theoretical and numerical analyses, TUNE demonstrates versatility, robustness, and competitive power, offering a viable and reliable alternative for model-agnostic post-detection inference."}, "https://arxiv.org/abs/2409.15756": {"title": "A penalized online sequential test of heterogeneous treatment effects for generalized linear models", "link": "https://arxiv.org/abs/2409.15756", "description": "arXiv:2409.15756v1 Announce Type: new \nAbstract: Identification of heterogeneous treatment effects (HTEs) has been increasingly popular and critical in various penalized strategy decisions using the A/B testing approach, especially in the scenario of a consecutive online collection of samples. However, in high-dimensional settings, such an identification remains challenging in the sense of lack of detection power of HTEs with insufficient sample instances for each batch sequentially collected online. In this article, a novel high-dimensional test is proposed, named as the penalized online sequential test (POST), to identify HTEs and select useful covariates simultaneously under continuous monitoring in generalized linear models (GLMs), which achieves high detection power and controls the Type I error. A penalized score test statistic is developed along with an extended p-value process for the online collection of samples, and the proposed POST method is further extended to multiple online testing scenarios, where both high true positive rates and under-controlled false discovery rates are achieved simultaneously. Asymptotic results are established and justified to guarantee properties of the POST, and its performance is evaluated through simulations and analysis of real data, compared with the state-of-the-art online test methods. Our findings indicate that the POST method exhibits selection consistency and superb detection power of HTEs as well as excellent control over the Type I error, which endows our method with the capability for timely and efficient inference for online A/B testing in high-dimensional GLMs framework."}, "https://arxiv.org/abs/2409.15995": {"title": "Robust Inference for Non-Linear Regression Models with Applications in Enzyme Kinetics", "link": "https://arxiv.org/abs/2409.15995", "description": "arXiv:2409.15995v1 Announce Type: new \nAbstract: Despite linear regression being the most popular statistical modelling technique, in real-life we often need to deal with situations where the true relationship between the response and the covariates is nonlinear in parameters. In such cases one needs to adopt appropriate non-linear regression (NLR) analysis, having wider applications in biochemical and medical studies among many others. In this paper we propose a new improved robust estimation and testing methodologies for general NLR models based on the minimum density power divergence approach and apply our proposal to analyze the widely popular Michaelis-Menten (MM) model in enzyme kinetics. We establish the asymptotic properties of our proposed estimator and tests, along with their theoretical robustness characteristics through influence function analysis. For the particular MM model, we have further empirically justified the robustness and the efficiency of our proposed estimator and the testing procedure through extensive simulation studies and several interesting real data examples of enzyme-catalyzed (biochemical) reactions."}, "https://arxiv.org/abs/2409.16003": {"title": "Easy Conditioning far Beyond Gaussian", "link": "https://arxiv.org/abs/2409.16003", "description": "arXiv:2409.16003v1 Announce Type: new \nAbstract: Estimating and sampling from conditional densities plays a critical role in statistics and data science, with a plethora of applications. Numerous methods are available ranging from simple fitting approaches to sophisticated machine learning algorithms. However, selecting from among these often involves a trade-off between conflicting objectives of efficiency, flexibility and interpretability. Starting from well known easy conditioning results in the Gaussian case, we show, thanks to results pertaining to stability by mixing and marginal transformations, that the latter carry over far beyond the Gaussian case. This enables us to flexibly model multivariate data by accommodating broad classes of multi-modal dependence structures and marginal distributions, while enjoying fast conditioning of fitted joint distributions. In applications, we primarily focus on conditioning via Gaussian versus Gaussian mixture copula models, comparing different fitting implementations for the latter. Numerical experiments with simulated and real data demonstrate the relevance of the approach for conditional sampling, evaluated using multivariate scoring rules."}, "https://arxiv.org/abs/2409.16132": {"title": "Large Bayesian Tensor VARs with Stochastic Volatility", "link": "https://arxiv.org/abs/2409.16132", "description": "arXiv:2409.16132v1 Announce Type: new \nAbstract: We consider Bayesian tensor vector autoregressions (TVARs) in which the VAR coefficients are arranged as a three-dimensional array or tensor, and this coefficient tensor is parameterized using a low-rank CP decomposition. We develop a family of TVARs using a general stochastic volatility specification, which includes a wide variety of commonly-used multivariate stochastic volatility and COVID-19 outlier-augmented models. In a forecasting exercise involving 40 US quarterly variables, we show that these TVARs outperform the standard Bayesian VAR with the Minnesota prior. The results also suggest that the parsimonious common stochastic volatility model tends to forecast better than the more flexible Cholesky stochastic volatility model."}, "https://arxiv.org/abs/2409.16276": {"title": "Bayesian Variable Selection and Sparse Estimation for High-Dimensional Graphical Models", "link": "https://arxiv.org/abs/2409.16276", "description": "arXiv:2409.16276v1 Announce Type: new \nAbstract: We introduce a novel Bayesian approach for both covariate selection and sparse precision matrix estimation in the context of high-dimensional Gaussian graphical models involving multiple responses. Our approach provides a sparse estimation of the three distinct sparsity structures: the regression coefficient matrix, the conditional dependency structure among responses, and between responses and covariates. This contrasts with existing methods, which typically focus on any two of these structures but seldom achieve simultaneous sparse estimation for all three. A key aspect of our method is that it leverages the structural sparsity information gained from the presence of irrelevant covariates in the dataset to introduce covariate-level sparsity in the precision and regression coefficient matrices. This is achieved through a Bayesian conditional random field model using a hierarchical spike and slab prior setup. Despite the non-convex nature of the problem, we establish statistical accuracy for points in the high posterior density region, including the maximum-a-posteriori (MAP) estimator. We also present an efficient Expectation-Maximization (EM) algorithm for computing the estimators. Through simulation experiments, we demonstrate the competitive performance of our method, particularly in scenarios with weak signal strength in the precision matrices. Finally, we apply our method to a bike-share dataset, showcasing its predictive performance."}, "https://arxiv.org/abs/2409.15532": {"title": "A theory of generalised coordinates for stochastic differential equations", "link": "https://arxiv.org/abs/2409.15532", "description": "arXiv:2409.15532v1 Announce Type: cross \nAbstract: Stochastic differential equations are ubiquitous modelling tools in physics and the sciences. In most modelling scenarios, random fluctuations driving dynamics or motion have some non-trivial temporal correlation structure, which renders the SDE non-Markovian; a phenomenon commonly known as ``colored'' noise. Thus, an important objective is to develop effective tools for mathematically and numerically studying (possibly non-Markovian) SDEs. In this report, we formalise a mathematical theory for analysing and numerically studying SDEs based on so-called `generalised coordinates of motion'. Like the theory of rough paths, we analyse SDEs pathwise for any given realisation of the noise, not solely probabilistically. Like the established theory of Markovian realisation, we realise non-Markovian SDEs as a Markov process in an extended space. Unlike the established theory of Markovian realisation however, the Markovian realisations here are accurate on short timescales and may be exact globally in time, when flows and fluctuations are analytic. This theory is exact for SDEs with analytic flows and fluctuations, and is approximate when flows and fluctuations are differentiable. It provides useful analysis tools, which we employ to solve linear SDEs with analytic fluctuations. It may also be useful for studying rougher SDEs, as these may be identified as the limit of smoother ones. This theory supplies effective, computationally straightforward methods for simulation, filtering and control of SDEs; amongst others, we re-derive generalised Bayesian filtering, a state-of-the-art method for time-series analysis. Looking forward, this report suggests that generalised coordinates have far-reaching applications throughout stochastic differential equations."}, "https://arxiv.org/abs/2409.15677": {"title": "Smoothing the Conditional Value-at-Risk based Pickands Estimators", "link": "https://arxiv.org/abs/2409.15677", "description": "arXiv:2409.15677v1 Announce Type: cross \nAbstract: We incorporate the conditional value-at-risk (CVaR) quantity into a generalized class of Pickands estimators. By introducing CVaR, the newly developed estimators not only retain the desirable properties of consistency, location, and scale invariance inherent to Pickands estimators, but also achieve a reduction in mean squared error (MSE). To address the issue of sensitivity to the choice of the number of top order statistics used for the estimation, and ensure robust estimation, which are crucial in practice, we first propose a beta measure, which is a modified beta density function, to smooth the estimator. Then, we develop an algorithm to approximate the asymptotic mean squared error (AMSE) and determine the optimal beta measure that minimizes AMSE. A simulation study involving a wide range of distributions shows that our estimators have good and highly stable finite-sample performance and compare favorably with the other estimators."}, "https://arxiv.org/abs/2409.15682": {"title": "Linear Contextual Bandits with Interference", "link": "https://arxiv.org/abs/2409.15682", "description": "arXiv:2409.15682v1 Announce Type: cross \nAbstract: Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit's actions on the rewards of others. In contextual bandit (CB) settings, where multiple units are present in the same round, potential interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decision-making process. Although some prior work has explored multi-agent and adversarial bandits in interference-aware settings, the effect of interference in CB, as well as the underlying theory, remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (LinCB), bridging the gap between causal inference and online decision-making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and a synthetic data generated based on MovieLens data."}, "https://arxiv.org/abs/2409.15844": {"title": "Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection", "link": "https://arxiv.org/abs/2409.15844", "description": "arXiv:2409.15844v1 Announce Type: cross \nAbstract: We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and hyperparameter tuning for engineering systems, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds."}, "https://arxiv.org/abs/2409.16044": {"title": "Stable Survival Extrapolation via Transfer Learning", "link": "https://arxiv.org/abs/2409.16044", "description": "arXiv:2409.16044v1 Announce Type: cross \nAbstract: The mean survival is the key ingredient of the decision process in several applications, including health economic evaluations. It is defined as the area under the complete survival curve thus necessitating extrapolation of the observed data. In this article we employ a Bayesian mortality model and transfer its projections in order to construct the baseline population that acts as an anchor of the survival model. This can be seen as an implicit bias-variance trade-off in unseen data. We then propose extrapolation methods based on flexible parametric polyhazard models which can naturally accommodate diverse shapes, including non-proportional hazards and crossing survival curves while typically maintaining a natural interpretation. We estimate the mean survival and related estimands in three cases, namely breast cancer, cardiac arrhythmia and advanced melanoma. Specifically, we evaluate the survival disadvantage of triple negative breast cancer cases, the efficacy of combining immunotherapy with mRNA cancer therapeutic for advanced melanoma treatment and the suitability of implantable cardioverter defibrilators for cardiac arrhythmia. The latter is conducted in a competing risks context illustrating how working on the cause-specific hazard alone minimizes potential instability. The results suggest that the proposed approach offers a flexible, interpretable and robust approach when survival extrapolation is required."}, "https://arxiv.org/abs/2210.09339": {"title": "Probability Weighted Clustered Coefficients Regression Models in Complex Survey Sampling", "link": "https://arxiv.org/abs/2210.09339", "description": "arXiv:2210.09339v3 Announce Type: replace \nAbstract: Regression analysis is commonly conducted in survey sampling. However, existing methods fail when the relationships vary across different areas or domains. In this paper, we propose a unified framework to study the group-wise covariate effect under complex survey sampling based on pairwise penalties, and the associated objective function is solved by the alternating direction method of multipliers. Theoretical properties of the proposed method are investigated under some generality conditions. Numerical experiments demonstrate the superiority of the proposed method in terms of identifying groups and estimation efficiency for both linear regression models and logistic regression models."}, "https://arxiv.org/abs/2305.08834": {"title": "Elastic Bayesian Model Calibration", "link": "https://arxiv.org/abs/2305.08834", "description": "arXiv:2305.08834v2 Announce Type: replace \nAbstract: Functional data are ubiquitous in scientific modeling. For instance, quantities of interest are modeled as functions of time, space, energy, density, etc. Uncertainty quantification methods for computer models with functional response have resulted in tools for emulation, sensitivity analysis, and calibration that are widely used. However, many of these tools do not perform well when the computer model's parameters control both the amplitude variation of the functional output and its alignment (or phase variation). This paper introduces a framework for Bayesian model calibration when the model responses are misaligned functional data. The approach generates two types of data out of the misaligned functional responses: (1) aligned functions so that the amplitude variation is isolated and (2) warping functions that isolate the phase variation. These two types of data are created for the computer simulation data (both of which may be emulated) and the experimental data. The calibration approach uses both types so that it seeks to match both the amplitude and phase of the experimental data. The framework is careful to respect constraints that arise especially when modeling phase variation, and is framed in a way that it can be done with readily available calibration software. We demonstrate the techniques on two simulated data examples and on two dynamic material science problems: a strength model calibration using flyer plate experiments and an equation of state model calibration using experiments performed on the Sandia National Laboratories' Z-machine."}, "https://arxiv.org/abs/2109.13819": {"title": "Perturbation theory for killed Markov processes and quasi-stationary distributions", "link": "https://arxiv.org/abs/2109.13819", "description": "arXiv:2109.13819v2 Announce Type: replace-cross \nAbstract: Motivated by recent developments of quasi-stationary Monte Carlo methods, we investigate the stability of quasi-stationary distributions of killed Markov processes under perturbations of the generator. We first consider a general bounded self-adjoint perturbation operator, and after that, study a particular unbounded perturbation corresponding to truncation of the killing rate. In both scenarios, we quantify the difference between eigenfunctions of the smallest eigenvalue of the perturbed and unperturbed generators in a Hilbert space norm. As a consequence, L1 norm estimates of the difference of the resulting quasi-stationary distributions in terms of the perturbation are provided."}, "https://arxiv.org/abs/2110.03200": {"title": "High Dimensional Logistic Regression Under Network Dependence", "link": "https://arxiv.org/abs/2110.03200", "description": "arXiv:2110.03200v3 Announce Type: replace-cross \nAbstract: Logistic regression is key method for modeling the probability of a binary outcome based on a collection of covariates. However, the classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure, such as over a temporal/spatial domain or on a social network. This necessitates the development of models that can simultaneously handle both the network `peer-effect' and the effect of high-dimensional covariates. In this paper, we develop a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture the pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We propose a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates (the regression coefficients), which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Under various standard regularity conditions, we show that the corresponding estimate attains the classical high-dimensional rate of consistency. Our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical (independent) logistic regression, when the true parameter is sparse and the underlying network is not too dense. We also develop an efficient algorithm for computing the estimates and validate our theoretical results in numerical experiments. An application to selecting genes in clustering spatial transcriptomics data is also discussed."}, "https://arxiv.org/abs/2302.09034": {"title": "Bayesian Mixtures Models with Repulsive and Attractive Atoms", "link": "https://arxiv.org/abs/2302.09034", "description": "arXiv:2302.09034v3 Announce Type: replace-cross \nAbstract: The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well-separated and interpretable clusters. In this work, we provide a unified framework for the construction and the Bayesian analysis of random probability measures with interacting atoms, encompassing both repulsive and attractive behaviours. Specifically, we derive closed-form expressions for the posterior distribution, the marginal and predictive distributions, which were not previously available except for the case of measures with i.i.d. atoms. We show how these quantities are fundamental both for prior elicitation and to develop new posterior simulation algorithms for hierarchical mixture models. Our results are obtained without any assumption on the finite point process that governs the atoms of the random measure. Their proofs rely on analytical tools borrowed from the Palm calculus theory, which might be of independent interest. We specialise our treatment to the classes of Poisson, Gibbs, and determinantal point processes, as well as in the case of shot-noise Cox processes. Finally, we illustrate the performance of different modelling strategies on simulated and real datasets."}, "https://arxiv.org/abs/2409.16409": {"title": "Robust Mean Squared Prediction Error Estimators of EBLUP of a Small Area Mean Under the Fay-Herriot Model", "link": "https://arxiv.org/abs/2409.16409", "description": "arXiv:2409.16409v1 Announce Type: new \nAbstract: In this paper we derive a second-order unbiased (or nearly unbiased) mean squared prediction error (MSPE) estimator of empirical best linear unbiased predictor (EBLUP) of a small area mean for a non-normal extension to the well-known Fay-Herriot model. Specifically, we derive our MSPE estimator essentially assuming certain moment conditions on both the sampling and random effects distributions. The normality-based Prasad-Rao MSPE estimator has a surprising robustness property in that it remains second-order unbiased under the non-normality of random effects when a simple method-of-moments estimator is used for the variance component and the sampling error distribution is normal. We show that the normality-based MSPE estimator is no longer second-order unbiased when the sampling error distribution is non-normal or when the Fay-Herriot moment method is used to estimate the variance component, even when the sampling error distribution is normal. It is interesting to note that when the simple method-of moments estimator is used for the variance component, our proposed MSPE estimator does not require the estimation of kurtosis of the random effects. Results of a simulation study on the accuracy of the proposed MSPE estimator, under non-normality of both sampling and random effects distributions, are also presented."}, "https://arxiv.org/abs/2409.16463": {"title": "Double-Estimation-Friendly Inference for High Dimensional Misspecified Measurement Error Models", "link": "https://arxiv.org/abs/2409.16463", "description": "arXiv:2409.16463v1 Announce Type: new \nAbstract: In this paper, we introduce an innovative testing procedure for assessing individual hypotheses in high-dimensional linear regression models with measurement errors. This method remains robust even when either the X-model or Y-model is misspecified. We develop a double robust score function that maintains a zero expectation if one of the models is incorrect, and we construct a corresponding score test. We first show the asymptotic normality of our approach in a low-dimensional setting, and then extend it to the high-dimensional models. Our analysis of high-dimensional settings explores scenarios both with and without the sparsity condition, establishing asymptotic normality and non-trivial power performance under local alternatives. Simulation studies and real data analysis demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2409.16683": {"title": "Robust Max Statistics for High-Dimensional Inference", "link": "https://arxiv.org/abs/2409.16683", "description": "arXiv:2409.16683v1 Announce Type: new \nAbstract: Although much progress has been made in the theory and application of bootstrap approximations for max statistics in high dimensions, the literature has largely been restricted to cases involving light-tailed data. To address this issue, we propose an approach to inference based on robust max statistics, and we show that their distributions can be accurately approximated via bootstrapping when the data are both high-dimensional and heavy-tailed. In particular, the data are assumed to satisfy an extended version of the well-established $L^{4}$-$L^2$ moment equivalence condition, as well as a weak variance decay condition. In this setting, we show that near-parametric rates of bootstrap approximation can be achieved in the Kolmogorov metric, independently of the data dimension. Moreover, this theoretical result is complemented by favorable empirical results involving both synthetic data and an application to financial data."}, "https://arxiv.org/abs/2409.16829": {"title": "Conditional Testing based on Localized Conformal p-values", "link": "https://arxiv.org/abs/2409.16829", "description": "arXiv:2409.16829v1 Announce Type: new \nAbstract: In this paper, we address conditional testing problems through the conformal inference framework. We define the localized conformal p-values by inverting prediction intervals and prove their theoretical properties. These defined p-values are then applied to several conditional testing problems to illustrate their practicality. Firstly, we propose a conditional outlier detection procedure to test for outliers in the conditional distribution with finite-sample false discovery rate (FDR) control. We also introduce a novel conditional label screening problem with the goal of screening multivariate response variables and propose a screening procedure to control the family-wise error rate (FWER). Finally, we consider the two-sample conditional distribution test and define a weighted U-statistic through the aggregation of localized p-values. Numerical simulations and real-data examples validate the superior performance of our proposed strategies."}, "https://arxiv.org/abs/2409.17039": {"title": "A flexiable approach: variable selection procedures with multilayer FDR control via e-values", "link": "https://arxiv.org/abs/2409.17039", "description": "arXiv:2409.17039v1 Announce Type: new \nAbstract: Consider a scenario where a large number of explanatory features targeting a response variable are analyzed, such that these features are partitioned into different groups according to their domain-specific structures. Furthermore, there may be several such partitions. Such multiple partitions may exist in many real-life scenarios. One such example is spatial genome-wide association studies. Researchers may not only be interested in identifying the features relevant to the response but also aim to determine the relevant groups within each partition. A group is considered relevant if it contains at least one relevant feature. To ensure the replicability of the findings at various resolutions, it is essential to provide false discovery rate (FDR) control for findings at multiple layers simultaneously. This paper presents a general approach that leverages various existing controlled selection procedures to generate more stable results using multilayer FDR control. The key contributions of our proposal are the development of a generalized e-filter that provides multilayer FDR control and the construction of a specific type of generalized e-values to evaluate feature importance. A primary application of our method is an improved version of Data Splitting (DS), called the eDS-filter. Furthermore, we combine the eDS-filter with the version of the group knockoff filter (gKF), resulting in a more flexible approach called the eDS+gKF filter. Simulation studies demonstrate that the proposed methods effectively control the FDR at multiple levels while maintaining or even improving power compared to other approaches. Finally, we apply the proposed method to analyze HIV mutation data."}, "https://arxiv.org/abs/2409.17129": {"title": "Bayesian Bivariate Conway-Maxwell-Poisson Regression Model for Correlated Count Data in Sports", "link": "https://arxiv.org/abs/2409.17129", "description": "arXiv:2409.17129v1 Announce Type: new \nAbstract: Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway-Maxwell-Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary."}, "https://arxiv.org/abs/2409.16407": {"title": "Towards Representation Learning for Weighting Problems in Design-Based Causal Inference", "link": "https://arxiv.org/abs/2409.16407", "description": "arXiv:2409.16407v1 Announce Type: cross \nAbstract: Reweighting a distribution to minimize a distance to a target distribution is a powerful and flexible strategy for estimating a wide range of causal effects, but can be challenging in practice because optimal weights typically depend on knowledge of the underlying data generating process. In this paper, we focus on design-based weights, which do not incorporate outcome information; prominent examples include prospective cohort studies, survey weighting, and the weighting portion of augmented weighting estimators. In such applications, we explore the central role of representation learning in finding desirable weights in practice. Unlike the common approach of assuming a well-specified representation, we highlight the error due to the choice of a representation and outline a general framework for finding suitable representations that minimize this error. Building on recent work that combines balancing weights and neural networks, we propose an end-to-end estimation procedure that learns a flexible representation, while retaining promising theoretical properties. We show that this approach is competitive in a range of common causal inference tasks."}, "https://arxiv.org/abs/1507.08689": {"title": "Multiple Outlier Detection in Samples with Exponential & Pareto Tails: Redeeming the Inward Approach & Detecting Dragon Kings", "link": "https://arxiv.org/abs/1507.08689", "description": "arXiv:1507.08689v2 Announce Type: replace \nAbstract: We introduce two ratio-based robust test statistics, max-robust-sum (MRS) and sum-robust-sum (SRS), designed to enhance the robustness of outlier detection in samples with exponential or Pareto tails. We also reintroduce the inward sequential testing method -- formerly relegated since the introduction of outward testing -- and show that MRS and SRS tests reduce susceptibility of the inward approach to masking, making the inward test as powerful as, and potentially less error-prone than, outward tests. Moreover, inward testing does not require the complicated type I error control of outward tests. A comprehensive comparison of the test statistics is done, considering performance of the proposed tests in both block and sequential tests, and contrasting their performance with classical test statistics across various data scenarios. In five case studies -- financial crashes, nuclear power generation accidents, stock market returns, epidemic fatalities, and city sizes -- significant outliers are detected and related to the concept of `Dragon King' events, defined as meaningful outliers that arise from a unique generating mechanism."}, "https://arxiv.org/abs/2108.02196": {"title": "Synthetic Controls for Experimental Design", "link": "https://arxiv.org/abs/2108.02196", "description": "arXiv:2108.02196v4 Announce Type: replace \nAbstract: This article studies experimental design in settings where the experimental units are large aggregate entities (e.g., markets), and only one or a small number of units can be exposed to the treatment. In such settings, randomization of the treatment may result in treated and control groups with very different characteristics at baseline, inducing biases. We propose a variety of experimental non-randomized synthetic control designs (Abadie, Diamond and Hainmueller, 2010, Abadie and Gardeazabal, 2003) that select the units to be treated, as well as the untreated units to be used as a control group. Average potential outcomes are estimated as weighted averages of the outcomes of treated units for potential outcomes with treatment, and weighted averages the outcomes of control units for potential outcomes without treatment. We analyze the properties of estimators based on synthetic control designs and propose new inferential techniques. We show that in experimental settings with aggregate units, synthetic control designs can substantially reduce estimation biases in comparison to randomization of the treatment."}, "https://arxiv.org/abs/2209.06294": {"title": "Graph-constrained Analysis for Multivariate Functional Data", "link": "https://arxiv.org/abs/2209.06294", "description": "arXiv:2209.06294v3 Announce Type: replace \nAbstract: Functional Gaussian graphical models (GGM) used for analyzing multivariate functional data customarily estimate an unknown graphical model representing the conditional relationships between the functional variables. However, in many applications of multivariate functional data, the graph is known and existing functional GGM methods cannot preserve a given graphical constraint. In this manuscript, we demonstrate how to conduct multivariate functional analysis that exactly conforms to a given inter-variable graph. We first show the equivalence between partially separable functional GGM and graphical Gaussian processes (GP), proposed originally for constructing optimal covariance functions for multivariate spatial data that retain the conditional independence relations in a given graphical model. The theoretical connection help design a new algorithm that leverages Dempster's covariance selection to calculate the maximum likelihood estimate of the covariance function for multivariate functional data under graphical constraints. We also show that the finite term truncation of functional GGM basis expansion used in practice is equivalent to a low-rank graphical GP, which is known to oversmooth marginal distributions. To remedy this, we extend our algorithm to better preserve marginal distributions while still respecting the graph and retaining computational scalability. The insights obtained from the new results presented in this manuscript will help practitioners better understand the relationship between these graphical models and in deciding on the appropriate method for their specific multivariate data analysis task. The benefits of the proposed algorithms are illustrated using empirical experiments and an application to functional modeling of neuroimaging data using the connectivity graph among regions of the brain."}, "https://arxiv.org/abs/2305.12336": {"title": "Estimation of finite population proportions for small areas -- a statistical data integration approach", "link": "https://arxiv.org/abs/2305.12336", "description": "arXiv:2305.12336v2 Announce Type: replace \nAbstract: Empirical best prediction (EBP) is a well-known method for producing reliable proportion estimates when the primary data source provides only small or no sample from finite populations. There are potential challenges in implementing existing EBP methodology such as limited auxiliary variables in the frame (not adequate for building a reasonable working predictive model) or unable to accurately link the sample to the finite population frame due to absence of identifiers. In this paper, we propose a new data linkage approach where the finite population frame is replaced by a big probability sample, having a large set of auxiliary variables but not the outcome binary variable of interest. We fit an assumed model on the small probability sample and then impute the outcome variable for all units of the big sample to obtain standard weighted proportions. We develop a new adjusted maximum likelihood (ML) method so that the estimate of model variance doesn't fall on the boundary, which is otherwise encountered in commonly used ML method. We also propose an estimator of the mean squared prediction error using a parametric bootstrap method and address computational issues by developing an efficient Expectation Maximization algorithm. The proposed methodology is illustrated in the context of election projection for small areas."}, "https://arxiv.org/abs/2306.03663": {"title": "Bayesian inference for group-level cortical surface image-on-scalar-regression with Gaussian process priors", "link": "https://arxiv.org/abs/2306.03663", "description": "arXiv:2306.03663v2 Announce Type: replace \nAbstract: In regression-based analyses of group-level neuroimage data researchers typically fit a series of marginal general linear models to image outcomes at each spatially-referenced pixel. Spatial regularization of effects of interest is usually induced indirectly by applying spatial smoothing to the data during preprocessing. While this procedure often works well, resulting inference can be poorly calibrated. Spatial modeling of effects of interest leads to more powerful analyses, however the number of locations in a typical neuroimage can preclude standard computation with explicitly spatial models. Here we contribute a Bayesian spatial regression model for group-level neuroimaging analyses. We induce regularization of spatially varying regression coefficient functions through Gaussian process priors. When combined with a simple nonstationary model for the error process, our prior hierarchy can lead to more data-adaptive smoothing than standard methods. We achieve computational tractability through Vecchia approximation of our prior which, critically, can be constructed for a wide class of spatial correlation functions and results in prior models that retain full spatial rank. We outline several ways to work with our model in practice and compare performance against standard vertex-wise analyses. Finally we illustrate our method in an analysis of cortical surface fMRI task contrast data from a large cohort of children enrolled in the Adolescent Brain Cognitive Development study."}, "https://arxiv.org/abs/1910.03821": {"title": "Quasi Maximum Likelihood Estimation and Inference of Large Approximate Dynamic Factor Models via the EM algorithm", "link": "https://arxiv.org/abs/1910.03821", "description": "arXiv:1910.03821v5 Announce Type: replace-cross \nAbstract: We study estimation of large Dynamic Factor models implemented through the Expectation Maximization (EM) algorithm, jointly with the Kalman smoother. We prove that as both the cross-sectional dimension, $n$, and the sample size, $T$, diverge to infinity: (i) the estimated loadings are $\\sqrt T$-consistent, asymptotically normal and equivalent to their Quasi Maximum Likelihood estimates; (ii) the estimated factors are $\\sqrt n$-consistent, asymptotically normal and equivalent to their Weighted Least Squares estimates. Moreover, the estimated loadings are asymptotically as efficient as those obtained by Principal Components analysis, while the estimated factors are more efficient if the idiosyncratic covariance is sparse enough."}, "https://arxiv.org/abs/2401.11948": {"title": "The Ensemble Kalman Filter for Dynamic Inverse Problems", "link": "https://arxiv.org/abs/2401.11948", "description": "arXiv:2401.11948v2 Announce Type: replace-cross \nAbstract: In inverse problems, the goal is to estimate unknown model parameters from noisy observational data. Traditionally, inverse problems are solved under the assumption of a fixed forward operator describing the observation model. In this article, we consider the extension of this approach to situations where we have a dynamic forward model, motivated by applications in scientific computation and engineering. We specifically consider this extension for a derivative-free optimizer, the ensemble Kalman inversion (EKI). We introduce and justify a new methodology called dynamic-EKI, which is a particle-based method with a changing forward operator. We analyze our new method, presenting results related to the control of our particle system through its covariance structure. This analysis includes moment bounds and an ensemble collapse, which are essential for demonstrating a convergence result. We establish convergence in expectation and validate our theoretical findings through experiments with dynamic-EKI applied to a 2D Darcy flow partial differential equation."}, "https://arxiv.org/abs/2409.17195": {"title": "When Sensitivity Bias Varies Across Subgroups: The Impact of Non-uniform Polarity in List Experiments", "link": "https://arxiv.org/abs/2409.17195", "description": "arXiv:2409.17195v1 Announce Type: new \nAbstract: Survey researchers face the problem of sensitivity bias: since people are reluctant to reveal socially undesirable or otherwise risky traits, aggregate estimates of these traits will be biased. List experiments offer a solution by conferring respondents greater privacy. However, little is know about how list experiments fare when sensitivity bias varies across respondent subgroups. For example, a trait that is socially undesirable to one group may socially desirable in a second group, leading sensitivity bias to be negative in the first group, while it is positive in the second. Or a trait may be not sensitive in one group, leading sensitivity bias to be zero in one group and non-zero in another. We use Monte Carlo simulations to explore what happens when the polarity (sign) of sensitivity bias is non-uniform. We find that a general diagnostic test yields false positives and that commonly used estimators return biased estimates of the prevalence of the sensitive trait, coefficients of covariates, and sensitivity bias itself. The bias is worse when polarity runs in opposite directions across subgroups, and as the difference in subgroup sizes increases. Significantly, non-uniform polarity could explain why some list experiments appear to 'fail'. By defining and systematically investigating the problem of non-uniform polarity, we hope to save some studies from the file-drawer and provide some guidance for future research."}, "https://arxiv.org/abs/2409.17298": {"title": "Sparsity, Regularization and Causality in Agricultural Yield: The Case of Paddy Rice in Peru", "link": "https://arxiv.org/abs/2409.17298", "description": "arXiv:2409.17298v1 Announce Type: new \nAbstract: This study introduces a novel approach that integrates agricultural census data with remotely sensed time series to develop precise predictive models for paddy rice yield across various regions of Peru. By utilizing sparse regression and Elastic-Net regularization techniques, the study identifies causal relationships between key remotely sensed variables-such as NDVI, precipitation, and temperature-and agricultural yield. To further enhance prediction accuracy, the first- and second-order dynamic transformations (velocity and acceleration) of these variables are applied, capturing non-linear patterns and delayed effects on yield. The findings highlight the improved predictive performance when combining regularization techniques with climatic and geospatial variables, enabling more precise forecasts of yield variability. The results confirm the existence of causal relationships in the Granger sense, emphasizing the value of this methodology for strategic agricultural management. This contributes to more efficient and sustainable production in paddy rice cultivation."}, "https://arxiv.org/abs/2409.17404": {"title": "Bayesian Covariate-Dependent Graph Learning with a Dual Group Spike-and-Slab Prior", "link": "https://arxiv.org/abs/2409.17404", "description": "arXiv:2409.17404v1 Announce Type: new \nAbstract: Covariate-dependent graph learning has gained increasing interest in the graphical modeling literature for the analysis of heterogeneous data. This task, however, poses challenges to modeling, computational efficiency, and interpretability. The parameter of interest can be naturally represented as a three-dimensional array with elements that can be grouped according to two directions, corresponding to node level and covariate level, respectively. In this article, we propose a novel dual group spike-and-slab prior that enables multi-level selection at covariate-level and node-level, as well as individual (local) level sparsity. We introduce a nested strategy with specific choices to address distinct challenges posed by the various grouping directions. For posterior inference, we develop a tuning-free Gibbs sampler for all parameters, which mitigates the difficulties of parameter tuning often encountered in high-dimensional graphical models and facilitates routine implementation. Through simulation studies, we demonstrate that the proposed model outperforms existing methods in its accuracy of graph recovery. We show the practical utility of our model via an application to microbiome data where we seek to better understand the interactions among microbes as well as how these are affected by relevant covariates."}, "https://arxiv.org/abs/2409.17441": {"title": "Factor pre-training in Bayesian multivariate logistic models", "link": "https://arxiv.org/abs/2409.17441", "description": "arXiv:2409.17441v1 Announce Type: new \nAbstract: This article focuses on inference in logistic regression for high-dimensional binary outcomes. A popular approach induces dependence across the outcomes by including latent factors in the linear predictor. Bayesian approaches are useful for characterizing uncertainty in inferring the regression coefficients, factors and loadings, while also incorporating hierarchical and shrinkage structure. However, Markov chain Monte Carlo algorithms for posterior computation face challenges in scaling to high-dimensional outcomes. Motivated by applications in ecology, we exploit a blessing of dimensionality to motivate pre-estimation of the latent factors. Conditionally on the factors, the outcomes are modeled via independent logistic regressions. We implement Gaussian approximations in parallel in inferring the posterior on the regression coefficients and loadings, including a simple adjustment to obtain credible intervals with valid frequentist coverage. We show posterior concentration properties and excellent empirical performance in simulations. The methods are applied to insect biodiversity data in Madagascar."}, "https://arxiv.org/abs/2409.17631": {"title": "Invariant Coordinate Selection and Fisher discriminant subspace beyond the case of two groups", "link": "https://arxiv.org/abs/2409.17631", "description": "arXiv:2409.17631v1 Announce Type: new \nAbstract: Invariant Coordinate Selection (ICS) is a multivariate technique that relies on the simultaneous diagonalization of two scatter matrices. It serves various purposes, including its use as a dimension reduction tool prior to clustering or outlier detection. Unlike methods such as Principal Component Analysis, ICS has a theoretical foundation that explains why and when the identified subspace should contain relevant information. These general results have been examined in detail primarily for specific scatter combinations within a two-cluster framework. In this study, we expand these investigations to include more clusters and scatter combinations. The case of three clusters in particular is studied at length. Based on these expanded theoretical insights and supported by numerical studies, we conclude that ICS is indeed suitable for recovering Fisher's discriminant subspace under very general settings and cases of failure seem rare."}, "https://arxiv.org/abs/2409.17706": {"title": "Stationarity of Manifold Time Series", "link": "https://arxiv.org/abs/2409.17706", "description": "arXiv:2409.17706v1 Announce Type: new \nAbstract: In modern interdisciplinary research, manifold time series data have been garnering more attention. A critical question in analyzing such data is ``stationarity'', which reflects the underlying dynamic behavior and is crucial across various fields like cell biology, neuroscience and empirical finance. Yet, there has been an absence of a formal definition of stationarity that is tailored to manifold time series. This work bridges this gap by proposing the first definitions of first-order and second-order stationarity for manifold time series. Additionally, we develop novel statistical procedures to test the stationarity of manifold time series and study their asymptotic properties. Our methods account for the curved nature of manifolds, leading to a more intricate analysis than that in Euclidean space. The effectiveness of our methods is evaluated through numerical simulations and their practical merits are demonstrated through analyzing a cell-type proportion time series dataset from a paper recently published in Cell. The first-order stationarity test result aligns with the biological findings of this paper, while the second-order stationarity test provides numerical support for a critical assumption made therein."}, "https://arxiv.org/abs/2409.17751": {"title": "Granger Causality for Mixed Time Series Generalized Linear Models: A Case Study on Multimodal Brain Connectivity", "link": "https://arxiv.org/abs/2409.17751", "description": "arXiv:2409.17751v1 Announce Type: new \nAbstract: This paper is motivated by studies in neuroscience experiments to understand interactions between nodes in a brain network using different types of data modalities that capture different distinct facets of brain activity. To assess Granger-causality, we introduce a flexible framework through a general class of models that accommodates mixed types of data (binary, count, continuous, and positive components) formulated in a generalized linear model (GLM) fashion. Statistical inference for causality is performed based on both frequentist and Bayesian approaches, with a focus on the latter. Here, we develop a procedure for conducting inference through the proposed Bayesian mixed time series model. By introducing spike and slab priors for some parameters in the model, our inferential approach guides causality order selection and provides proper uncertainty quantification. The proposed methods are then utilized to study the rat spike train and local field potentials (LFP) data recorded during the olfaction working memory task. The proposed methodology provides critical insights into the causal relationship between the rat spiking activity and LFP spectral power. Specifically, power in the LFP beta band is predictive of spiking activity 300 milliseconds later, providing a novel analytical tool for this area of emerging interest in neuroscience and demonstrating its usefulness and flexibility in the study of causality in general."}, "https://arxiv.org/abs/2409.17968": {"title": "Nonparametric Inference Framework for Time-dependent Epidemic Models", "link": "https://arxiv.org/abs/2409.17968", "description": "arXiv:2409.17968v1 Announce Type: new \nAbstract: Compartmental models, especially the Susceptible-Infected-Removed (SIR) model, have long been used to understand the behaviour of various diseases. Allowing parameters, such as the transmission rate, to be time-dependent functions makes it possible to adjust for and make inferences about changes in the process due to mitigation strategies or evolutionary changes of the infectious agent. In this article, we attempt to build a nonparametric inference framework for stochastic SIR models with time dependent infection rate. The framework includes three main steps: likelihood approximation, parameter estimation and confidence interval construction. The likelihood function of the stochastic SIR model, which is often intractable, can be approximated using methods such as diffusion approximation or tau leaping. The infection rate is modelled by a B-spline basis whose knot location and number of knots are determined by a fast knot placement method followed by a criterion-based model selection procedure. Finally, a point-wise confidence interval is built using a parametric bootstrap procedure. The performance of the framework is observed through various settings for different epidemic patterns. The model is then applied to the Ontario COVID-19 data across multiple waves."}, "https://arxiv.org/abs/2409.18005": {"title": "Collapsible Kernel Machine Regression for Exposomic Analyses", "link": "https://arxiv.org/abs/2409.18005", "description": "arXiv:2409.18005v1 Announce Type: new \nAbstract: An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibility comes at the cost of low power and difficult interpretation, particularly in exposomic analyses when the number of exposures is large. We propose a flexible framework that allows for separate selection of additive and non-additive effects, unifying additive models and kernel machine regression. The proposed approach yields increased power and simpler interpretation when there is little evidence of interaction. Further, it allows users to specify separate priors for additive and non-additive effects, and allows for tests of non-additive interaction. We extend the approach to the class of multiple index models, in which the special case of kernel machine-distributed lag models are nested. We apply the method to motivating data from a subcohort of the Human Early Life Exposome (HELIX) study containing 65 mixture components grouped into 13 distinct exposure classes."}, "https://arxiv.org/abs/2409.18091": {"title": "Incorporating sparse labels into biologging studies using hidden Markov models with weighted likelihoods", "link": "https://arxiv.org/abs/2409.18091", "description": "arXiv:2409.18091v1 Announce Type: new \nAbstract: Ecologists often use a hidden Markov model to decode a latent process, such as a sequence of an animal's behaviours, from an observed biologging time series. Modern technological devices such as video recorders and drones now allow researchers to directly observe an animal's behaviour. Using these observations as labels of the latent process can improve a hidden Markov model's accuracy when decoding the latent process. However, many wild animals are observed infrequently. Including such rare labels often has a negligible influence on parameter estimates, which in turn does not meaningfully improve the accuracy of the decoded latent process. We introduce a weighted likelihood approach that increases the relative influence of labelled observations. We use this approach to develop two hidden Markov models to decode the foraging behaviour of killer whales (Orcinus orca) off the coast of British Columbia, Canada. Using cross-validated evaluation metrics, we show that our weighted likelihood approach produces more accurate and understandable decoded latent processes compared to existing methods. Thus, our method effectively leverages sparse labels to enhance researchers' ability to accurately decode hidden processes across various fields."}, "https://arxiv.org/abs/2409.18117": {"title": "Formulating the Proxy Pattern-Mixture Model as a Selection Model to Assist with Sensitivity Analysis", "link": "https://arxiv.org/abs/2409.18117", "description": "arXiv:2409.18117v1 Announce Type: new \nAbstract: Proxy pattern-mixture models (PPMM) have previously been proposed as a model-based framework for assessing the potential for nonignorable nonresponse in sample surveys and nonignorable selection in nonprobability samples. One defining feature of the PPMM is the single sensitivity parameter, $\\phi$, that ranges from 0 to 1 and governs the degree of departure from ignorability. While this sensitivity parameter is attractive in its simplicity, it may also be of interest to describe departures from ignorability in terms of how the odds of response (or selection) depend on the outcome being measured. In this paper, we re-express the PPMM as a selection model, using the known relationship between pattern-mixture models and selection models, in order to better understand the underlying assumptions of the PPMM and the implied effect of the outcome on nonresponse. The selection model that corresponds to the PPMM is a quadratic function of the survey outcome and proxy variable, and the magnitude of the effect depends on the value of the sensitivity parameter, $\\phi$ (missingness/selection mechanism), the differences in the proxy means and standard deviations for the respondent and nonrespondent populations, and the strength of the proxy, $\\rho^{(1)}$. Large values of $\\phi$ (beyond $0.5$) often result in unrealistic selection mechanisms, and the corresponding selection model can be used to establish more realistic bounds on $\\phi$. We illustrate the results using data from the U.S. Census Household Pulse Survey."}, "https://arxiv.org/abs/2409.17544": {"title": "Optimizing the Induced Correlation in Omnibus Joint Graph Embeddings", "link": "https://arxiv.org/abs/2409.17544", "description": "arXiv:2409.17544v1 Announce Type: cross \nAbstract: Theoretical and empirical evidence suggests that joint graph embedding algorithms induce correlation across the networks in the embedding space. In the Omnibus joint graph embedding framework, previous results explicitly delineated the dual effects of the algorithm-induced and model-inherent correlations on the correlation across the embedded networks. Accounting for and mitigating the algorithm-induced correlation is key to subsequent inference, as sub-optimal Omnibus matrix constructions have been demonstrated to lead to loss in inference fidelity. This work presents the first efforts to automate the Omnibus construction in order to address two key questions in this joint embedding framework: the correlation-to-OMNI problem and the flat correlation problem. In the flat correlation problem, we seek to understand the minimum algorithm-induced flat correlation (i.e., the same across all graph pairs) produced by a generalized Omnibus embedding. Working in a subspace of the fully general Omnibus matrices, we prove both a lower bound for this flat correlation and that the classical Omnibus construction induces the maximal flat correlation. In the correlation-to-OMNI problem, we present an algorithm -- named corr2Omni -- that, from a given matrix of estimated pairwise graph correlations, estimates the matrix of generalized Omnibus weights that induces optimal correlation in the embedding space. Moreover, in both simulated and real data settings, we demonstrate the increased effectiveness of our corr2Omni algorithm versus the classical Omnibus construction."}, "https://arxiv.org/abs/2409.17804": {"title": "Enriched Functional Tree-Based Classifiers: A Novel Approach Leveraging Derivatives and Geometric Features", "link": "https://arxiv.org/abs/2409.17804", "description": "arXiv:2409.17804v1 Announce Type: cross \nAbstract: The positioning of this research falls within the scalar-on-function classification literature, a field of significant interest across various domains, particularly in statistics, mathematics, and computer science. This study introduces an advanced methodology for supervised classification by integrating Functional Data Analysis (FDA) with tree-based ensemble techniques for classifying high-dimensional time series. The proposed framework, Enriched Functional Tree-Based Classifiers (EFTCs), leverages derivative and geometric features, benefiting from the diversity inherent in ensemble methods to further enhance predictive performance and reduce variance. While our approach has been tested on the enrichment of Functional Classification Trees (FCTs), Functional K-NN (FKNN), Functional Random Forest (FRF), Functional XGBoost (FXGB), and Functional LightGBM (FLGBM), it could be extended to other tree-based and non-tree-based classifiers, with appropriate considerations emerging from this investigation. Through extensive experimental evaluations on seven real-world datasets and six simulated scenarios, this proposal demonstrates fascinating improvements over traditional approaches, providing new insights into the application of FDA in complex, high-dimensional learning problems."}, "https://arxiv.org/abs/2409.18118": {"title": "Slowly Scaling Per-Record Differential Privacy", "link": "https://arxiv.org/abs/2409.18118", "description": "arXiv:2409.18118v1 Announce Type: cross \nAbstract: We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released.\n Formal privacy mechanisms generally add randomness, or \"noise,\" to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data.\n We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility."}, "https://arxiv.org/abs/2309.02073": {"title": "Debiased regression adjustment in completely randomized experiments with moderately high-dimensional covariates", "link": "https://arxiv.org/abs/2309.02073", "description": "arXiv:2309.02073v2 Announce Type: replace \nAbstract: Completely randomized experiment is the gold standard for causal inference. When the covariate information for each experimental candidate is available, one typical way is to include them in covariate adjustments for more accurate treatment effect estimation. In this paper, we investigate this problem under the randomization-based framework, i.e., that the covariates and potential outcomes of all experimental candidates are assumed as deterministic quantities and the randomness comes solely from the treatment assignment mechanism. Under this framework, to achieve asymptotically valid inference, existing estimators usually require either (i) that the dimension of covariates $p$ grows at a rate no faster than $O(n^{3 / 4})$ as sample size $n \\to \\infty$; or (ii) certain sparsity constraints on the linear representations of potential outcomes constructed via possibly high-dimensional covariates. In this paper, we consider the moderately high-dimensional regime where $p$ is allowed to be in the same order of magnitude as $n$. We develop a novel debiased estimator with a corresponding inference procedure and establish its asymptotic normality under mild assumptions. Our estimator is model-free and does not require any sparsity constraint on potential outcome's linear representations. We also discuss its asymptotic efficiency improvements over the unadjusted treatment effect estimator under different dimensionality constraints. Numerical analysis confirms that compared to other regression adjustment based treatment effect estimators, our debiased estimator performs well in moderately high dimensions."}, "https://arxiv.org/abs/2310.13387": {"title": "Assumption violations in causal discovery and the robustness of score matching", "link": "https://arxiv.org/abs/2310.13387", "description": "arXiv:2310.13387v2 Announce Type: replace \nAbstract: When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices."}, "https://arxiv.org/abs/2312.10176": {"title": "Spectral estimation for spatial point processes and random fields", "link": "https://arxiv.org/abs/2312.10176", "description": "arXiv:2312.10176v2 Announce Type: replace \nAbstract: Spatial variables can be observed in many different forms, such as regularly sampled random fields (lattice data), point processes, and randomly sampled spatial processes. Joint analysis of such collections of observations is clearly desirable, but complicated by the lack of an easily implementable analysis framework. It is well known that Fourier transforms provide such a framework, but its form has eluded data analysts. We formalize it by providing a multitaper analysis framework using coupled discrete and continuous data tapers, combined with the discrete Fourier transform for inference. Using this set of tools is important, as it forms the backbone for practical spectral analysis. In higher dimensions it is important not to be constrained to Cartesian product domains, and so we develop the methodology for spectral analysis using irregular domain data tapers, and the tapered discrete Fourier transform. We discuss its fast implementation, and the asymptotic as well as large finite domain properties. Estimators of partial association between different spatial processes are provided as are principled methods to determine their significance, and we demonstrate their practical utility on a large-scale ecological dataset."}, "https://arxiv.org/abs/2401.01645": {"title": "Model Averaging and Double Machine Learning", "link": "https://arxiv.org/abs/2401.01645", "description": "arXiv:2401.01645v2 Announce Type: replace \nAbstract: This paper discusses pairing double/debiased machine learning (DDML) with stacking, a model averaging method for combining multiple candidate learners, to estimate structural parameters. In addition to conventional stacking, we consider two stacking variants available for DDML: short-stacking exploits the cross-fitting step of DDML to substantially reduce the computational burden and pooled stacking enforces common stacking weights over cross-fitting folds. Using calibrated simulation studies and two applications estimating gender gaps in citations and wages, we show that DDML with stacking is more robust to partially unknown functional forms than common alternative approaches based on single pre-selected learners. We provide Stata and R software implementing our proposals."}, "https://arxiv.org/abs/2401.07400": {"title": "Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data", "link": "https://arxiv.org/abs/2401.07400", "description": "arXiv:2401.07400v2 Announce Type: replace \nAbstract: Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods."}, "https://arxiv.org/abs/2107.07575": {"title": "Optimal tests of the composite null hypothesis arising in mediation analysis", "link": "https://arxiv.org/abs/2107.07575", "description": "arXiv:2107.07575v2 Announce Type: replace-cross \nAbstract: The indirect effect of an exposure on an outcome through an intermediate variable can be identified by a product of regression coefficients under certain causal and regression modeling assumptions. In this context, the null hypothesis of no indirect effect is a composite null hypothesis, as the null holds if either regression coefficient is zero. A consequence is that traditional hypothesis tests are severely underpowered near the origin (i.e., when both coefficients are small with respect to standard errors). We propose hypothesis tests that (i) preserve level alpha type 1 error, (ii) meaningfully improve power when both true underlying effects are small relative to sample size, and (iii) preserve power when at least one is not. One approach gives a closed-form test that is minimax optimal with respect to local power over the alternative parameter space. Another uses sparse linear programming to produce an approximately optimal test for a Bayes risk criterion. We discuss adaptations for performing large-scale hypothesis testing as well as modifications that yield improved interpretability. We provide an R package that implements the minimax optimal test."}, "https://arxiv.org/abs/2205.13469": {"title": "Proximal Estimation and Inference", "link": "https://arxiv.org/abs/2205.13469", "description": "arXiv:2205.13469v3 Announce Type: replace-cross \nAbstract: We build a unifying convex analysis framework characterizing the statistical properties of a large class of penalized estimators, both under a regular and an irregular design. Our framework interprets penalized estimators as proximal estimators, defined by a proximal operator applied to a corresponding initial estimator. We characterize the asymptotic properties of proximal estimators, showing that their asymptotic distribution follows a closed-form formula depending only on (i) the asymptotic distribution of the initial estimator, (ii) the estimator's limit penalty subgradient and (iii) the inner product defining the associated proximal operator. In parallel, we characterize the Oracle features of proximal estimators from the properties of their penalty's subgradients. We exploit our approach to systematically cover linear regression settings with a regular or irregular design. For these settings, we build new $\\sqrt{n}-$consistent, asymptotically normal Ridgeless-type proximal estimators, which feature the Oracle property and are shown to perform satisfactorily in practically relevant Monte Carlo settings."}, "https://arxiv.org/abs/2409.18358": {"title": "A Capture-Recapture Approach to Facilitate Causal Inference for a Trial-eligible Observational Cohort", "link": "https://arxiv.org/abs/2409.18358", "description": "arXiv:2409.18358v1 Announce Type: new \nAbstract: Background: We extend recently proposed design-based capture-recapture methods for prevalence estimation among registry participants, in order to support causal inference among a trial-eligible target population. The proposed design for CRC analysis integrates an observational study cohort with a randomized trial involving a small representative study sample, and enhances the generalizability and transportability of the findings. Methods: We develop a novel CRC-type estimator derived via multinomial distribution-based maximum-likelihood that exploits the design to deliver benefits in terms of validity and efficiency for comparing the effects of two treatments on a binary outcome. Additionally, the design enables a direct standardization-type estimator for efficient estimation of general means (e.g., of biomarker levels) under a specific treatment, and for their comparison across treatments. For inference, we propose a tailored Bayesian credible interval approach to improve coverage properties in conjunction with the proposed CRC estimator for binary outcomes, along with a bootstrap percentile interval approach for use in the case of continuous outcomes. Results: Simulations demonstrate the proposed estimators derived from the CRC design. The multinomial-based maximum-likelihood estimator shows benefits in terms of validity and efficiency in treatment effect comparisons, while the direct standardization-type estimator allows comprehensive comparison of treatment effects within the target population. Conclusion: The extended CRC methods provide a useful framework for causal inference in a trial-eligible target population by integrating observational and randomized trial data. The novel estimators enhance the generalizability and transportability of findings, offering efficient and valid tools for treatment effect comparisons on both binary and continuous outcomes."}, "https://arxiv.org/abs/2409.18392": {"title": "PNOD: An Efficient Projected Newton Framework for Exact Optimal Experimental Designs", "link": "https://arxiv.org/abs/2409.18392", "description": "arXiv:2409.18392v1 Announce Type: new \nAbstract: Computing the exact optimal experimental design has been a longstanding challenge in various scientific fields. This problem, when formulated using a specific information function, becomes a mixed-integer nonlinear programming (MINLP) problem, which is typically NP-hard, thus making the computation of a globally optimal solution extremely difficult. The branch and bound (BnB) method is a widely used approach for solving such MINLPs, but its practical efficiency heavily relies on the ability to solve continuous relaxations effectively within the BnB search tree. In this paper, we propose a novel projected Newton framework, combining with a vertex exchange method for efficiently solving the associated subproblems, designed to enhance the BnB method. This framework offers strong convergence guarantees by utilizing recent advances in solving self-concordant optimization and convex quadratic programming problems. Extensive numerical experiments on A-optimal and D-optimal design problems, two of the most commonly used models, demonstrate the framework's promising numerical performance. Specifically, our framework significantly improves the efficiency of node evaluation within the BnB search tree and enhances the accuracy of solutions compared to state-of-the-art methods. The proposed framework is implemented in an open source Julia package called \\texttt{PNOD.jl}, which opens up possibilities for its application in a wide range of real-world scenarios."}, "https://arxiv.org/abs/2409.18527": {"title": "Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and Recommendations", "link": "https://arxiv.org/abs/2409.18527", "description": "arXiv:2409.18527v1 Announce Type: new \nAbstract: Simulation studies are commonly used in methodological research for the empirical evaluation of data analysis methods. They generate artificial data sets under specified mechanisms and compare the performance of methods across conditions. However, simulation repetitions do not always produce valid outputs, e.g., due to non-convergence or other algorithmic failures. This phenomenon complicates the interpretation of results, especially when its occurrence differs between methods and conditions. Despite the potentially serious consequences of such \"missingness\", quantitative data on its prevalence and specific guidance on how to deal with it are currently limited. To this end, we reviewed 482 simulation studies published in various methodological journals and systematically assessed the prevalence and handling of missingness. We found that only 23.0% (111/482) of the reviewed simulation studies mention missingness, with even fewer reporting frequency (92/482 = 19.1%) or how it was handled (67/482 = 13.9%). We propose a classification of missingness and possible solutions. We give various recommendations, most notably to always quantify and report missingness, even if none was observed, to align missingness handling with study goals, and to share code and data for reproduction and reanalysis. Using a case study on publication bias adjustment methods, we illustrate common pitfalls and solutions."}, "https://arxiv.org/abs/2409.18550": {"title": "Iterative Trace Minimization for the Reconciliation of Very Short Hierarchical Time Series", "link": "https://arxiv.org/abs/2409.18550", "description": "arXiv:2409.18550v1 Announce Type: new \nAbstract: Time series often appear in an additive hierarchical structure. In such cases, time series on higher levels are the sums of their subordinate time series. This hierarchical structure places a natural constraint on forecasts. However, univariate forecasting techniques are incapable of ensuring this forecast coherence. An obvious solution is to forecast only bottom time series and obtain higher level forecasts through aggregation. This approach is also known as the bottom-up approach. In their seminal paper, \\citep{Wickramasuriya2019} propose an optimal reconciliation approach named MinT. It tries to minimize the trace of the underlying covariance matrix of all forecast errors. The MinT algorithm has demonstrated superior performance to the bottom-up and other approaches and enjoys great popularity. This paper provides a simulation study examining the performance of MinT for very short time series and larger hierarchical structures. This scenario makes the covariance estimation required by MinT difficult. A novel iterative approach is introduced which significantly reduces the number of estimated parameters. This approach is capable of improving forecast accuracy further. The application of MinTit is also demonstrated with a case study at the hand of a semiconductor dataset based on data provided by the World Semiconductor Trade Statistics (WSTS), a premier provider of semiconductor market data."}, "https://arxiv.org/abs/2409.18603": {"title": "Which depth to use to construct functional boxplots?", "link": "https://arxiv.org/abs/2409.18603", "description": "arXiv:2409.18603v1 Announce Type: new \nAbstract: This paper answers the question of which functional depth to use to construct a boxplot for functional data. It shows that integrated depths, e.g., the popular modified band depth, do not result in well-defined boxplots. Instead, we argue that infimal depths are the only functional depths that provide a valid construction of a functional boxplot. We also show that the properties of the boxplot are completely determined by properties of the one-dimensional depth function used in defining the infimal depth for functional data. Our claims are supported by (i) a motivating example, (ii) theoretical results concerning the properties of the boxplot, and (iii) a simulation study."}, "https://arxiv.org/abs/2409.18640": {"title": "Time-Varying Multi-Seasonal AR Models", "link": "https://arxiv.org/abs/2409.18640", "description": "arXiv:2409.18640v1 Announce Type: new \nAbstract: We propose a seasonal AR model with time-varying parameter processes in both the regular and seasonal parameters. The model is parameterized to guarantee stability at every time point and can accommodate multiple seasonal periods. The time evolution is modeled by dynamic shrinkage processes to allow for long periods of essentially constant parameters, periods of rapid change as well as abrupt jumps. A Gibbs sampler is developed with a particle Gibbs update step for the AR parameter trajectories. The near-degeneracy of the model, caused by the dynamic shrinkage processes, is shown to pose a challenge for particle methods. To address this, a more robust, faster and accurate approximate sampler based on the extended Kalman filter is proposed. The model and the numerical effectiveness of the Gibbs sampler are investigated on simulated and real data. An application to more than a century of monthly US industrial production data shows interesting clear changes in seasonality over time, particularly during the Great Depression and the recent Covid-19 pandemic. Keywords: Bayesian inference; Extended Kalman filter; Locally stationary processes; Particle MCMC; Seasonality."}, "https://arxiv.org/abs/2409.18712": {"title": "Computational and Numerical Properties of a Broadband Subspace-Based Likelihood Ratio Test", "link": "https://arxiv.org/abs/2409.18712", "description": "arXiv:2409.18712v1 Announce Type: new \nAbstract: This paper investigates the performance of a likelihood ratio test in combination with a polynomial subspace projection approach to detect weak transient signals in broadband array data. Based on previous empirical evidence that a likelihood ratio test is advantageously applied in a lower-dimensional subspace, we present analysis that highlights how the polynomial subspace projection whitens a crucial part of the signals, enabling a detector to operate with a shortened temporal window. This reduction in temporal correlation, together with a spatial compaction of the data, also leads to both computational and numerical advantages over a likelihood ratio test that is directly applied to the array data. The results of our analysis are illustrated by examples and simulations."}, "https://arxiv.org/abs/2409.18719": {"title": "New flexible versions of extended generalized Pareto model for count data", "link": "https://arxiv.org/abs/2409.18719", "description": "arXiv:2409.18719v1 Announce Type: new \nAbstract: Accurate modeling is essential in integer-valued real phenomena, including the distribution of entire data, zero-inflated (ZI) data, and discrete exceedances. The Poisson and Negative Binomial distributions, along with their ZI variants, are considered suitable for modeling the entire data distribution, but they fail to capture the heavy tail behavior effectively alongside the bulk of the distribution. In contrast, the discrete generalized Pareto distribution (DGPD) is preferred for high threshold exceedances, but it becomes less effective for low threshold exceedances. However, in some applications, the selection of a suitable high threshold is challenging, and the asymptotic conditions required for using DGPD are not always met. To address these limitations, extended versions of DGPD are proposed. These extensions are designed to model one of three scenarios: first, the entire distribution of the data, including both bulk and tail and bypassing the threshold selection step; second, the entire distribution along with ZI; and third, the tail of the distribution for low threshold exceedances. The proposed extensions offer improved estimates across all three scenarios compared to existing models, providing more accurate and reliable results in simulation studies and real data applications."}, "https://arxiv.org/abs/2409.18782": {"title": "Non-parametric efficient estimation of marginal structural models with multi-valued time-varying treatments", "link": "https://arxiv.org/abs/2409.18782", "description": "arXiv:2409.18782v1 Announce Type: new \nAbstract: Marginal structural models are a popular method for estimating causal effects in the presence of time-varying exposures. In spite of their popularity, no scalable non-parametric estimator exist for marginal structural models with multi-valued and time-varying treatments. In this paper, we use machine learning together with recent developments in semiparametric efficiency theory for longitudinal studies to propose such an estimator. The proposed estimator is based on a study of the non-parametric identifying functional, including first order von-Mises expansions as well as the efficient influence function and the efficiency bound. We show conditions under which the proposed estimator is efficient, asymptotically normal, and sequentially doubly robust in the sense that it is consistent if, for each time point, either the outcome or the treatment mechanism is consistently estimated. We perform a simulation study to illustrate the properties of the estimators, and present the results of our motivating study on a COVID-19 dataset studying the impact of mobility on the cumulative number of observed cases."}, "https://arxiv.org/abs/2409.18908": {"title": "Inference with Sequential Monte-Carlo Computation of $p$-values: Fast and Valid Approaches", "link": "https://arxiv.org/abs/2409.18908", "description": "arXiv:2409.18908v1 Announce Type: new \nAbstract: Hypothesis tests calibrated by (re)sampling methods (such as permutation, rank and bootstrap tests) are useful tools for statistical analysis, at the computational cost of requiring Monte-Carlo sampling for calibration. It is common and almost universal practice to execute such tests with predetermined and large number of Monte-Carlo samples, and disregard any randomness from this sampling at the time of drawing and reporting inference. At best, this approach leads to computational inefficiency, and at worst to invalid inference. That being said, a number of approaches in the literature have been proposed to adaptively guide analysts in choosing the number of Monte-Carlo samples, by sequentially deciding when to stop collecting samples and draw inference. These works introduce varying competing notions of what constitutes \"valid\" inference, complicating the landscape for analysts seeking suitable methodology. Furthermore, the majority of these approaches solely guarantee a meaningful estimate of the testing outcome, not the $p$-value itself $\\unicode{x2014}$ which is insufficient for many practical applications. In this paper, we survey the relevant literature, and build bridges between the scattered validity notions, highlighting some of their complementary roles. We also introduce a new practical methodology that provides an estimate of the $p$-value of the Monte-Carlo test, endowed with practically relevant validity guarantees. Moreover, our methodology is sequential, updating the $p$-value estimate after each new Monte-Carlo sample has been drawn, while retaining important validity guarantees regardless of the selected stopping time. We conclude this paper with a set of recommendations for the practitioner, both in terms of selection of methodology and manner of reporting results."}, "https://arxiv.org/abs/2409.18321": {"title": "Local Prediction-Powered Inference", "link": "https://arxiv.org/abs/2409.18321", "description": "arXiv:2409.18321v1 Announce Type: cross \nAbstract: To infer a function value on a specific point $x$, it is essential to assign higher weights to the points closer to $x$, which is called local polynomial / multivariable regression. In many practical cases, a limited sample size may ruin this method, but such conditions can be improved by the Prediction-Powered Inference (PPI) technique. This paper introduced a specific algorithm for local multivariable regression using PPI, which can significantly reduce the variance of estimations without enlarge the error. The confidence intervals, bias correction, and coverage probabilities are analyzed and proved the correctness and superiority of our algorithm. Numerical simulation and real-data experiments are applied and show these conclusions. Another contribution compared to PPI is the theoretical computation efficiency and explainability by taking into account the dependency of the dependent variable."}, "https://arxiv.org/abs/2409.18374": {"title": "Adaptive Learning of the Latent Space of Wasserstein Generative Adversarial Networks", "link": "https://arxiv.org/abs/2409.18374", "description": "arXiv:2409.18374v1 Announce Type: cross \nAbstract: Generative models based on latent variables, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields. However, many data such as natural images usually do not populate the ambient Euclidean space but instead reside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncover the structure of the data, possibly resulting in mismatch of latent representations and poor generative qualities. Towards addressing these problems, we propose a novel framework called the latent Wasserstein GAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsic dimension of the data manifold can be adaptively learned by a modified informative latent distribution. We prove that there exist an encoder network and a generator network in such a way that the intrinsic dimension of the learned encoding distribution is equal to the dimension of the data manifold. We theoretically establish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the data manifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that we force the synthetic data distribution to be similar to the real data distribution from a population perspective. Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify the correct intrinsic dimension under several scenarios, and simultaneously generate high-quality synthetic data by sampling from the learned latent distribution."}, "https://arxiv.org/abs/2409.18643": {"title": "Tail Risk Analysis for Financial Time Series", "link": "https://arxiv.org/abs/2409.18643", "description": "arXiv:2409.18643v1 Announce Type: cross \nAbstract: This book chapter illustrates how to apply extreme value statistics to financial time series data. Such data often exhibits strong serial dependence, which complicates assessment of tail risks. We discuss the two main approches to tail risk estimation, unconditional and conditional quantile forecasting. We use the S&P 500 index as a case study to assess serial (extremal) dependence, perform an unconditional and conditional risk analysis, and apply backtesting methods. Additionally, the chapter explores the impact of serial dependence on multivariate tail dependence."}, "https://arxiv.org/abs/1709.01050": {"title": "Modeling Interference Via Symmetric Treatment Decomposition", "link": "https://arxiv.org/abs/1709.01050", "description": "arXiv:1709.01050v3 Announce Type: replace \nAbstract: Classical causal inference assumes treatments meant for a given unit do not have an effect on other units. This assumption is violated in interference problems, where new types of spillover causal effects arise, and causal inference becomes much more difficult. In addition, interference introduces a unique complication where variables may transmit treatment influences to each other, which is a relationship that has some features of a causal one, but is symmetric.\n In this paper, we develop a new approach to decomposing the spillover effect into unit-specific components that extends the DAG based treatment decomposition approach to mediation of Robins and Richardson to causal models that admit stable symmetric relationships among variables in a network. We discuss two interpretations of such models: a network structural model interpretation, and an interpretation based on equilibrium of structural equation models discussed in (Lauritzen and Richardson, 2002). We show that both interpretations yield identical identification theory, and give conditions for components of the spillover effect to be identified.\n We discuss statistical inference for identified components of the spillover effect, including a maximum likelihood estimator, and a doubly robust estimator for the special case of two interacting outcomes. We verify consistency and robustness of our estimators via a simulation study, and illustrate our method by assessing the causal effect of education attainment on depressive symptoms using the data on households from the Wisconsin Longitudinal Study."}, "https://arxiv.org/abs/2203.08879": {"title": "A Simple and Computationally Trivial Estimator for Grouped Fixed Effects Models", "link": "https://arxiv.org/abs/2203.08879", "description": "arXiv:2203.08879v4 Announce Type: replace \nAbstract: This paper introduces a new fixed effects estimator for linear panel data models with clustered time patterns of unobserved heterogeneity. The method avoids non-convex and combinatorial optimization by combining a preliminary consistent estimator of the slope coefficient, an agglomerative pairwise-differencing clustering of cross-sectional units, and a pooled ordinary least squares regression. Asymptotic guarantees are established in a framework where $T$ can grow at any power of $N$, as both $N$ and $T$ approach infinity. Unlike most existing approaches, the proposed estimator is computationally straightforward and does not require a known upper bound on the number of groups. As existing approaches, this method leads to a consistent estimation of well-separated groups and an estimator of common parameters asymptotically equivalent to the infeasible regression controlling for the true groups. An application revisits the statistical association between income and democracy."}, "https://arxiv.org/abs/2307.07068": {"title": "Scalable Resampling in Massive Generalized Linear Models via Subsampled Residual Bootstrap", "link": "https://arxiv.org/abs/2307.07068", "description": "arXiv:2307.07068v2 Announce Type: replace \nAbstract: Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository."}, "https://arxiv.org/abs/2401.07294": {"title": "Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation Studies", "link": "https://arxiv.org/abs/2401.07294", "description": "arXiv:2401.07294v3 Announce Type: replace \nAbstract: Metamodels, or the regression analysis of Monte Carlo simulation results, provide a powerful tool to summarize simulation findings. However, an underutilized approach is the multilevel metamodel (MLMM) that accounts for the dependent data structure that arises from fitting multiple models to the same simulated data set. In this study, we articulate the theoretical rationale for the MLMM and illustrate how it can improve the interpretability of simulation results, better account for complex simulation designs, and provide new insights into the generalizability of simulation findings."}, "https://arxiv.org/abs/2401.11263": {"title": "Estimating Heterogeneous Treatment Effects on Survival Outcomes Using Counterfactual Censoring Unbiased Transformations", "link": "https://arxiv.org/abs/2401.11263", "description": "arXiv:2401.11263v2 Announce Type: replace \nAbstract: Methods for estimating heterogeneous treatment effects (HTE) from observational data have largely focused on continuous or binary outcomes, with less attention paid to survival outcomes and almost none to settings with competing risks. In this work, we develop censoring unbiased transformations (CUTs) for survival outcomes both with and without competing risks. After converting time-to-event outcomes using these CUTs, direct application of HTE learners for continuous outcomes yields consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. Our CUTs enable application of a much larger set of state of the art HTE learners for censored outcomes than had previously been available, especially in competing risks settings. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies."}, "https://arxiv.org/abs/2212.09900": {"title": "Policy learning \"without\" overlap: Pessimism and generalized empirical Bernstein's inequality", "link": "https://arxiv.org/abs/2212.09900", "description": "arXiv:2212.09900v3 Announce Type: replace-cross \nAbstract: This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions.\n In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL."}, "https://arxiv.org/abs/2401.03820": {"title": "Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices", "link": "https://arxiv.org/abs/2401.03820", "description": "arXiv:2401.03820v2 Announce Type: replace-cross \nAbstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method."}, "https://arxiv.org/abs/2409.19230": {"title": "Propensity Score Augmentation in Matching-based Estimation of Causal Effects", "link": "https://arxiv.org/abs/2409.19230", "description": "arXiv:2409.19230v1 Announce Type: new \nAbstract: When assessing the causal effect of a binary exposure using observational data, confounder imbalance across exposure arms must be addressed. Matching methods, including propensity score-based matching, can be used to deconfound the causal relationship of interest. They have been particularly popular in practice, at least in part due to their simplicity and interpretability. However, these methods can suffer from low statistical efficiency compared to many competing methods. In this work, we propose a novel matching-based estimator of the average treatment effect based on a suitably-augmented propensity score model. Our procedure can be shown to have greater statistical efficiency than traditional matching estimators whenever prognostic variables are available, and in some cases, can nearly reach the nonparametric efficiency bound. In addition to a theoretical study, we provide numerical results to illustrate our findings. Finally, we use our novel procedure to estimate the effect of circumcision on risk of HIV-1 infection using vaccine efficacy trial data."}, "https://arxiv.org/abs/2409.19241": {"title": "Estimating Interpretable Heterogeneous Treatment Effect with Causal Subgroup Discovery in Survival Outcomes", "link": "https://arxiv.org/abs/2409.19241", "description": "arXiv:2409.19241v1 Announce Type: new \nAbstract: Estimating heterogeneous treatment effect (HTE) for survival outcomes has gained increasing attention, as it captures the variation in treatment efficacy across patients or subgroups in delaying disease progression. However, most existing methods focus on post-hoc subgroup identification rather than simultaneously estimating HTE and selecting relevant subgroups. In this paper, we propose an interpretable HTE estimation framework that integrates three meta-learners that simultaneously estimate CATE for survival outcomes and identify predictive subgroups. We evaluated the performance of our method through comprehensive simulation studies across various randomized clinical trial (RCT) settings. Additionally, we demonstrated its application in a large RCT for age-related macular degeneration (AMD), a polygenic progressive eye disease, to estimate the HTE of an antioxidant and mineral supplement on time-to-AMD progression and to identify genetics-based subgroups with enhanced treatment effects. Our method offers a direct interpretation of the estimated HTE and provides evidence to support precision healthcare."}, "https://arxiv.org/abs/2409.19287": {"title": "Factors in Fashion: Factor Analysis towards the Mode", "link": "https://arxiv.org/abs/2409.19287", "description": "arXiv:2409.19287v1 Announce Type: new \nAbstract: The modal factor model represents a new factor model for dimension reduction in high dimensional panel data. Unlike the approximate factor model that targets for the mean factors, it captures factors that influence the conditional mode of the distribution of the observables. Statistical inference is developed with the aid of mode estimation, where the modal factors and the loadings are estimated through maximizing a kernel-type objective function. An easy-to-implement alternating maximization algorithm is designed to obtain the estimators numerically. Two model selection criteria are further proposed to determine the number of factors. The asymptotic properties of the proposed estimators are established under some regularity conditions. Simulations demonstrate the nice finite sample performance of our proposed estimators, even in the presence of heavy-tailed and asymmetric idiosyncratic error distributions. Finally, the application to inflation forecasting illustrates the practical merits of modal factors."}, "https://arxiv.org/abs/2409.19400": {"title": "The co-varying ties between networks and item responses via latent variables", "link": "https://arxiv.org/abs/2409.19400", "description": "arXiv:2409.19400v1 Announce Type: new \nAbstract: Relationships among teachers are known to influence their teaching-related perceptions. We study whether and how teachers' advising relationships (networks) are related to their perceptions of satisfaction, students, and influence over educational policies, recorded as their responses to a questionnaire (item responses). We propose a novel joint model of network and item responses (JNIRM) with correlated latent variables to understand these co-varying ties. This methodology allows the analyst to test and interpret the dependence between a network and item responses. Using JNIRM, we discover that teachers' advising relationships contribute to their perceptions of satisfaction and students more often than their perceptions of influence over educational policies. In addition, we observe that the complementarity principle applies in certain schools, where teachers tend to seek advice from those who are different from them. JNIRM shows superior parameter estimation and model fit over separately modeling the network and item responses with latent variable models."}, "https://arxiv.org/abs/2409.19673": {"title": "Priors for Reducing Asymptotic Bias of the Posterior Mean", "link": "https://arxiv.org/abs/2409.19673", "description": "arXiv:2409.19673v1 Announce Type: new \nAbstract: It is shown that the first-order term of the asymptotic bias of the posterior mean is removed by a suitable choice of a prior density. In regular statistical models including exponential families, and linear and logistic regression models, such a prior is given by the squared Jeffreys prior. We also explain the relationship between the proposed prior distribution, the moment matching prior, and the prior distribution that reduces the bias term of the posterior mode."}, "https://arxiv.org/abs/2409.19712": {"title": "Posterior Conformal Prediction", "link": "https://arxiv.org/abs/2409.19712", "description": "arXiv:2409.19712v1 Announce Type: new \nAbstract: Conformal prediction is a popular technique for constructing prediction intervals with distribution-free coverage guarantees. The coverage is marginal, meaning it only holds on average over the entire population but not necessarily for any specific subgroup. This article introduces a new method, posterior conformal prediction (PCP), which generates prediction intervals with both marginal and approximate conditional validity for clusters (or subgroups) naturally discovered in the data. PCP achieves these guarantees by modelling the conditional conformity score distribution as a mixture of cluster distributions. Compared to other methods with approximate conditional validity, this approach produces tighter intervals, particularly when the test data is drawn from clusters that are well represented in the validation data. PCP can also be applied to guarantee conditional coverage on user-specified subgroups, in which case it achieves robust coverage on smaller subgroups within the specified subgroups. In classification, the theory underlying PCP allows for adjusting the coverage level based on the classifier's confidence, achieving significantly smaller sets than standard conformal prediction sets. We evaluate the performance of PCP on diverse datasets from socio-economic, scientific and healthcare applications."}, "https://arxiv.org/abs/2409.19729": {"title": "Prior Sensitivity Analysis without Model Re-fit", "link": "https://arxiv.org/abs/2409.19729", "description": "arXiv:2409.19729v1 Announce Type: new \nAbstract: Prior sensitivity analysis is a fundamental method to check the effects of prior distributions on the posterior distribution in Bayesian inference. Exploring the posteriors under several alternative priors can be computationally intensive, particularly for complex latent variable models. To address this issue, we propose a novel method for quantifying the prior sensitivity that does not require model re-fit. Specifically, we present a method to compute the Hellinger and Kullback-Leibler distances between two posterior distributions with base and alternative priors, as well as posterior expectations under the alternative prior, using Monte Carlo integration based only on the base posterior distribution. This method significantly reduces computational costs in prior sensitivity analysis. We also extend the above approach for assessing the influence of hyperpriors in general latent variable models. We demonstrate the proposed method through examples of a simple normal distribution model, hierarchical binomial-beta model, and Gaussian process regression model."}, "https://arxiv.org/abs/2409.19812": {"title": "Compound e-values and empirical Bayes", "link": "https://arxiv.org/abs/2409.19812", "description": "arXiv:2409.19812v1 Announce Type: new \nAbstract: We explicitly define the notion of (exact or approximate) compound e-values which have been implicitly presented and extensively used in the recent multiple testing literature. We show that every FDR controlling procedure can be recovered by instantiating the e-BH procedure with certain compound e-values. Since compound e-values are closed under averaging, this allows for combination and derandomization of any FDR procedure. We then point out and exploit the connections to empirical Bayes. In particular, we use the fundamental theorem of compound decision theory to derive the log-optimal simple separable compound e-value for testing a set of point nulls against point alternatives: it is a ratio of mixture likelihoods. We extend universal inference to the compound setting. As one example, we construct approximate compound e-values for multiple t-tests, where the (nuisance) variances may be different across hypotheses. We provide connections to related notions in the literature stated in terms of p-values."}, "https://arxiv.org/abs/2409.19910": {"title": "Subset Simulation for High-dimensional and Multi-modal Bayesian Inference", "link": "https://arxiv.org/abs/2409.19910", "description": "arXiv:2409.19910v1 Announce Type: new \nAbstract: Bayesian analysis plays a crucial role in estimating distribution of unknown parameters for given data and model. Due to the curse of dimensionality, it is difficult to infer high-dimensional problems, especially when multiple modes exist. This paper introduces an efficient Bayesian posterior sampling algorithm that draws inspiration from subset simulation (SuS). It is based on a new interpretation of evidence from the perspective of structural reliability estimation, regarding the likelihood function as a limit state function. The posterior samples can be obtained following the principle of importance resampling as a postprocessing procedure. The estimation variance is derived to quantify the inherent uncertainty associated with the SuS estimator of evidence. The effective sample size is introduced to measure the quality of the posterior sampling. Three benchmark examples are first considered to illustrate the performance of the proposed algorithm by comparing it with two state-of-art algorithms. It is then used for the finite element (FE) model updating, showing its applicability in practical engineering problems. The proposed SuS algorithm exhibits comparable or even better performance in evidence estimation and posterior sampling, compared to the aBUS and MULTINEST algorithms, especially when the dimension of unknown parameters is high. In the application of the proposed algorithm for FE model updating, satisfactory results are obtained when the configuration (number and location) of sensory system is proper, underscoring the importance of adequate sensor placement around critical degrees of freedom."}, "https://arxiv.org/abs/2409.20084": {"title": "Conformal prediction for functional Ordinary kriging", "link": "https://arxiv.org/abs/2409.20084", "description": "arXiv:2409.20084v1 Announce Type: new \nAbstract: Functional Ordinary Kriging is the most widely used method to predict a curve at a given spatial point. However, uncertainty remains an open issue. In this article a distribution-free prediction method based on two different modulation functions and two conformity scores is proposed. Through simulations and benchmark data analyses, we demonstrate the advantages of our approach when compared to standard methods."}, "https://arxiv.org/abs/2409.20199": {"title": "Synthetic Difference in Differences for Repeated Cross-Sectional Data", "link": "https://arxiv.org/abs/2409.20199", "description": "arXiv:2409.20199v1 Announce Type: new \nAbstract: The synthetic difference-in-differences method provides an efficient method to estimate a causal effect with a latent factor model. However, it relies on the use of panel data. This paper presents an adaptation of the synthetic difference-in-differences method for repeated cross-sectional data. The treatment is considered to be at the group level so that it is possible to aggregate data by group to compute the two types of synthetic difference-in-differences weights on these aggregated data. Then, I develop and compute a third type of weight that accounts for the different number of observations in each cross-section. Simulation results show that the performance of the synthetic difference-in-differences estimator is improved when using the third type of weights on repeated cross-sectional data."}, "https://arxiv.org/abs/2409.20262": {"title": "Bootstrap-based goodness-of-fit test for parametric families of conditional distributions", "link": "https://arxiv.org/abs/2409.20262", "description": "arXiv:2409.20262v1 Announce Type: new \nAbstract: In various scientific fields, researchers are interested in exploring the relationship between some response variable Y and a vector of covariates X. In order to make use of a specific model for the dependence structure, it first has to be checked whether the conditional density function of Y given X fits into a given parametric family. We propose a consistent bootstrap-based goodness-of-fit test for this purpose. The test statistic traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of Y. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine the critical value. A simulation study shows that, in some cases, the new method is more sensitive to deviations from the parametric model than other tests found in the literature. We also apply our method to real-world datasets."}, "https://arxiv.org/abs/2409.20415": {"title": "New Tests of Equal Forecast Accuracy for Factor-Augmented Regressions with Weaker Loadings", "link": "https://arxiv.org/abs/2409.20415", "description": "arXiv:2409.20415v1 Announce Type: new \nAbstract: We provide the theoretical foundation for the recently proposed tests of equal forecast accuracy and encompassing by Pitarakis (2023a) and Pitarakis (2023b), when the competing forecast specification is that of a factor-augmented regression model, whose loadings are allowed to be homogeneously/heterogeneously weak. This should be of interest for practitioners, as at the moment there is no theory available to justify the use of these simple and powerful tests in such context."}, "https://arxiv.org/abs/2409.19060": {"title": "CURATE: Scaling-up Differentially Private Causal Graph Discovery", "link": "https://arxiv.org/abs/2409.19060", "description": "arXiv:2409.19060v1 Announce Type: cross \nAbstract: Causal Graph Discovery (CGD) is the process of estimating the underlying probabilistic graphical model that represents joint distribution of features of a dataset. CGD-algorithms are broadly classified into two categories: (i) Constraint-based algorithms (outcome depends on conditional independence (CI) tests), (ii) Score-based algorithms (outcome depends on optimized score-function). Since, sensitive features of observational data is prone to privacy-leakage, Differential Privacy (DP) has been adopted to ensure user privacy in CGD. Adding same amount of noise in this sequential-natured estimation process affects the predictive performance of the algorithms. As initial CI tests in constraint-based algorithms and later iterations of the optimization process of score-based algorithms are crucial, they need to be more accurate, less noisy. Based on this key observation, we present CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting. In contrast to existing DP-CGD algorithms with uniform privacy budgeting across all iterations, CURATE allows adaptive privacy budgeting by minimizing error probability (for constraint-based), maximizing iterations of the optimization problem (for score-based) while keeping the cumulative leakage bounded. To validate our framework, we present a comprehensive set of experiments on several datasets and show that CURATE achieves higher utility compared to existing DP-CGD algorithms with less privacy-leakage."}, "https://arxiv.org/abs/2409.19642": {"title": "Solving Fredholm Integral Equations of the Second Kind via Wasserstein Gradient Flows", "link": "https://arxiv.org/abs/2409.19642", "description": "arXiv:2409.19642v1 Announce Type: cross \nAbstract: Motivated by a recent method for approximate solution of Fredholm equations of the first kind, we develop a corresponding method for a class of Fredholm equations of the \\emph{second kind}. In particular, we consider the class of equations for which the solution is a probability measure. The approach centres around specifying a functional whose gradient flow admits a minimizer corresponding to a regularized version of the solution of the underlying equation and using a mean-field particle system to approximately simulate that flow. Theoretical support for the method is presented, along with some illustrative numerical results."}, "https://arxiv.org/abs/2409.19777": {"title": "Automatic debiasing of neural networks via moment-constrained learning", "link": "https://arxiv.org/abs/2409.19777", "description": "arXiv:2409.19777v1 Announce Type: cross \nAbstract: Causal and nonparametric estimands in economics and biostatistics can often be viewed as the mean of a linear functional applied to an unknown outcome regression function. Naively learning the regression function and taking a sample mean of the target functional results in biased estimators, and a rich debiasing literature has developed where one additionally learns the so-called Riesz representer (RR) of the target estimand (targeted learning, double ML, automatic debiasing etc.). Learning the RR via its derived functional form can be challenging, e.g. due to extreme inverse probability weights or the need to learn conditional density functions. Such challenges have motivated recent advances in automatic debiasing (AD), where the RR is learned directly via minimization of a bespoke loss. We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in AD, constraining the predicted moments and improving the robustness of RR estimates to optimization hyperparamters. Though our approach is not tied to a particular class of learner, we illustrate it using neural networks, and evaluate on the problems of average treatment/derivative effect estimation using semi-synthetic data. Our numerical experiments show improved performance versus state of the art benchmarks."}, "https://arxiv.org/abs/2409.20187": {"title": "Choosing DAG Models Using Markov and Minimal Edge Count in the Absence of Ground Truth", "link": "https://arxiv.org/abs/2409.20187", "description": "arXiv:2409.20187v1 Announce Type: cross \nAbstract: We give a novel nonparametric pointwise consistent statistical test (the Markov Checker) of the Markov condition for directed acyclic graph (DAG) or completed partially directed acyclic graph (CPDAG) models given a dataset. We also introduce the Cross-Algorithm Frugality Search (CAFS) for rejecting DAG models that either do not pass the Markov Checker test or that are not edge minimal. Edge minimality has been used previously by Raskutti and Uhler as a nonparametric simplicity criterion, though CAFS readily generalizes to other simplicity conditions. Reference to the ground truth is not necessary for CAFS, so it is useful for finding causal structure learning algorithms and tuning parameter settings that output causal models that are approximately true from a given data set. We provide a software tool for this analysis that is suitable for even quite large or dense models, provided a suitably fast pointwise consistent test of conditional independence is available. In addition, we show in simulation that the CAFS procedure can pick approximately correct models without knowing the ground truth."}, "https://arxiv.org/abs/2110.11856": {"title": "L-2 Regularized maximum likelihood for $\\beta$-model in large and sparse networks", "link": "https://arxiv.org/abs/2110.11856", "description": "arXiv:2110.11856v4 Announce Type: replace \nAbstract: The $\\beta$-model is a powerful tool for modeling large and sparse networks driven by degree heterogeneity, where many network models become infeasible due to computational challenge and network sparsity. However, existing estimation algorithms for $\\beta$-model do not scale up. Also, theoretical understandings remain limited to dense networks. This paper brings several significant improvements over existing results to address the urgent needs of practice. We propose a new $\\ell_2$-penalized MLE algorithm that can comfortably handle sparse networks of millions of nodes with much-improved memory parsimony. We establish the first rate-optimal error bounds and high-dimensional asymptotic normality results for $\\beta$-models, under much weaker network sparsity assumptions than best existing results.\n Application of our method to large COVID-19 network data sets and discover meaningful results."}, "https://arxiv.org/abs/2112.09170": {"title": "Reinforcing RCTs with Multiple Priors while Learning about External Validity", "link": "https://arxiv.org/abs/2112.09170", "description": "arXiv:2112.09170v5 Announce Type: replace \nAbstract: This paper introduces a framework for incorporating prior information into the design of sequential experiments. These sources may include past experiments, expert opinions, or the experimenter's intuition. We model the problem using a multi-prior Bayesian approach, mapping each source to a Bayesian model and aggregating them based on posterior probabilities. Policies are evaluated on three criteria: learning the parameters of payoff distributions, the probability of choosing the wrong treatment, and average rewards. Our framework demonstrates several desirable properties, including robustness to sources lacking external validity, while maintaining strong finite sample performance."}, "https://arxiv.org/abs/2210.05558": {"title": "Causal and counterfactual views of missing data models", "link": "https://arxiv.org/abs/2210.05558", "description": "arXiv:2210.05558v2 Announce Type: replace \nAbstract: It is often said that the fundamental problem of causal inference is a missing data problem -- the comparison of responses to two hypothetical treatment assignments is made difficult because for every experimental unit only one potential response is observed. In this paper, we consider the implications of the converse view: that missing data problems are a form of causal inference. We make explicit how the missing data problem of recovering the complete data law from the observed law can be viewed as identification of a joint distribution over counterfactual variables corresponding to values had we (possibly contrary to fact) been able to observe them. Drawing analogies with causal inference, we show how identification assumptions in missing data can be encoded in terms of graphical models defined over counterfactual and observed variables. We review recent results in missing data identification from this viewpoint. In doing so, we note interesting similarities and differences between missing data and causal identification theories."}, "https://arxiv.org/abs/2210.15829": {"title": "Estimation of Heterogeneous Treatment Effects Using a Conditional Moment Based Approach", "link": "https://arxiv.org/abs/2210.15829", "description": "arXiv:2210.15829v3 Announce Type: replace \nAbstract: We propose a new estimator for heterogeneous treatment effects in a partially linear model (PLM) with multiple exogenous covariates and a potentially endogenous treatment variable. Our approach integrates a Robinson transformation to handle the nonparametric component, the Smooth Minimum Distance (SMD) method to leverage conditional mean independence restrictions, and a Neyman-Orthogonalized first-order condition (FOC). By employing regularized model selection techniques like the Lasso method, our estimator accommodates numerous covariates while exhibiting reduced bias, consistency, and asymptotic normality. Simulations demonstrate its robust performance with diverse instrument sets compared to traditional GMM-type estimators. Applying this method to estimate Medicaid's heterogeneous treatment effects from the Oregon Health Insurance Experiment reveals more robust and reliable results than conventional GMM approaches."}, "https://arxiv.org/abs/2301.10228": {"title": "Augmenting a simulation campaign for hybrid computer model and field data experiments", "link": "https://arxiv.org/abs/2301.10228", "description": "arXiv:2301.10228v2 Announce Type: replace \nAbstract: The Kennedy and O'Hagan (KOH) calibration framework uses coupled Gaussian processes (GPs) to meta-model an expensive simulator (first GP), tune its ``knobs\" (calibration inputs) to best match observations from a real physical/field experiment and correct for any modeling bias (second GP) when predicting under new field conditions (design inputs). There are well-established methods for placement of design inputs for data-efficient planning of a simulation campaign in isolation, i.e., without field data: space-filling, or via criterion like minimum integrated mean-squared prediction error (IMSPE). Analogues within the coupled GP KOH framework are mostly absent from the literature. Here we derive a closed form IMSPE criterion for sequentially acquiring new simulator data for KOH. We illustrate how acquisitions space-fill in design space, but concentrate in calibration space. Closed form IMSPE precipitates a closed-form gradient for efficient numerical optimization. We demonstrate that our KOH-IMSPE strategy leads to a more efficient simulation campaign on benchmark problems, and conclude with a showcase on an application to equilibrium concentrations of rare earth elements for a liquid-liquid extraction reaction."}, "https://arxiv.org/abs/2312.02860": {"title": "Spectral Deconfounding for High-Dimensional Sparse Additive Models", "link": "https://arxiv.org/abs/2312.02860", "description": "arXiv:2312.02860v2 Announce Type: replace \nAbstract: Many high-dimensional data sets suffer from hidden confounding which affects both the predictors and the response of interest. In such situations, standard regression methods or algorithms lead to biased estimates. This paper substantially extends previous work on spectral deconfounding for high-dimensional linear models to the nonlinear setting and with this, establishes a proof of concept that spectral deconfounding is valid for general nonlinear models. Concretely, we propose an algorithm to estimate high-dimensional sparse additive models in the presence of hidden dense confounding: arguably, this is a simple yet practically useful nonlinear scope. We prove consistency and convergence rates for our method and evaluate it on synthetic data and a genetic data set."}, "https://arxiv.org/abs/2312.06289": {"title": "A graphical framework for interpretable correlation matrix models", "link": "https://arxiv.org/abs/2312.06289", "description": "arXiv:2312.06289v2 Announce Type: replace \nAbstract: In this work, we present a new approach for constructing models for correlation matrices with a user-defined graphical structure. The graphical structure makes correlation matrices interpretable and avoids the quadratic increase of parameters as a function of the dimension. We suggest an automatic approach to define a prior using a natural sequence of simpler models within the Penalized Complexity framework for the unknown parameters in these models.\n We illustrate this approach with three applications: a multivariate linear regression of four biomarkers, a multivariate disease mapping, and a multivariate longitudinal joint modelling. Each application underscores our method's intuitive appeal, signifying a substantial advancement toward a more cohesive and enlightening model that facilitates a meaningful interpretation of correlation matrices."}, "https://arxiv.org/abs/2008.07063": {"title": "To Bag is to Prune", "link": "https://arxiv.org/abs/2008.07063", "description": "arXiv:2008.07063v5 Announce Type: replace-cross \nAbstract: It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent \"true\" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better."}, "https://arxiv.org/abs/2010.13604": {"title": "A Sparse Beta Regression Model for Network Analysis", "link": "https://arxiv.org/abs/2010.13604", "description": "arXiv:2010.13604v3 Announce Type: replace-cross \nAbstract: For statistical analysis of network data, the $\\beta$-model has emerged as a useful tool, thanks to its flexibility in incorporating nodewise heterogeneity and theoretical tractability. To generalize the $\\beta$-model, this paper proposes the Sparse $\\beta$-Regression Model (S$\\beta$RM) that unites two research themes developed recently in modelling homophily and sparsity. In particular, we employ differential heterogeneity that assigns weights only to important nodes and propose penalized likelihood with an $\\ell_1$ penalty for parameter estimation. While our estimation method is closely related to the LASSO method for logistic regression, we develop new theory emphasizing the use of our model for dealing with a parameter regime that can handle sparse networks usually seen in practice. More interestingly, the resulting inference on the homophily parameter demands no debiasing normally employed in LASSO type estimation. We provide extensive simulation and data analysis to illustrate the use of the model. As a special case of our model, we extend the Erd\\H{o}s-R\\'{e}nyi model by including covariates and develop the associated statistical inference for sparse networks, which may be of independent interest."}, "https://arxiv.org/abs/2303.02887": {"title": "Empirical partially Bayes multiple testing and compound $\\chi^2$ decisions", "link": "https://arxiv.org/abs/2303.02887", "description": "arXiv:2303.02887v2 Announce Type: replace-cross \nAbstract: A common task in high-throughput biology is to screen for associations across thousands of units of interest, e.g., genes or proteins. Often, the data for each unit are modeled as Gaussian measurements with unknown mean and variance and are summarized as per-unit sample averages and sample variances. The downstream goal is multiple testing for the means. In this domain, it is routine to \"moderate\" (that is, to shrink) the sample variances through parametric empirical Bayes methods before computing p-values for the means. Such an approach is asymmetric in that a prior is posited and estimated for the nuisance parameters (variances) but not the primary parameters (means). Our work initiates the formal study of this paradigm, which we term \"empirical partially Bayes multiple testing.\" In this framework, if the prior for the variances were known, one could proceed by computing p-values conditional on the sample variances -- a strategy called partially Bayes inference by Sir David Cox. We show that these conditional p-values satisfy an Eddington/Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. The estimated p-values can be used with the Benjamini-Hochberg procedure to guarantee asymptotic control of the false discovery rate. Even in the compound setting, wherein the variances are fixed, the approach retains asymptotic type-I error guarantees."}, "https://arxiv.org/abs/2410.00116": {"title": "Bayesian Calibration in a multi-output transposition context", "link": "https://arxiv.org/abs/2410.00116", "description": "arXiv:2410.00116v1 Announce Type: new \nAbstract: Bayesian calibration is an effective approach for ensuring that numerical simulations accurately reflect the behavior of physical systems. However, because numerical models are never perfect, a discrepancy known as model error exists between the model outputs and the observed data, and must be quantified. Conventional methods can not be implemented in transposition situations, such as when a model has multiple outputs but only one is experimentally observed. To account for the model error in this context, we propose augmenting the calibration process by introducing additional input numerical parameters through a hierarchical Bayesian model, which includes hyperparameters for the prior distribution of the calibration variables. Importance sampling estimators are used to avoid increasing computational costs. Performance metrics are introduced to assess the proposed probabilistic model and the accuracy of its predictions. The method is applied on a computer code with three outputs that models the Taylor cylinder impact test. The outputs are considered as the observed variables one at a time, to work with three different transposition situations. The proposed method is compared with other approaches that embed model errors to demonstrate the significance of the hierarchical formulation."}, "https://arxiv.org/abs/2410.00125": {"title": "Relative Cumulative Residual Information Measure", "link": "https://arxiv.org/abs/2410.00125", "description": "arXiv:2410.00125v1 Announce Type: new \nAbstract: In this paper, we develop a relative cumulative residual information (RCRI) measure that intends to quantify the divergence between two survival functions. The dynamic relative cumulative residual information (DRCRI) measure is also introduced. We establish some characterization results under the proportional hazards model assumption. Additionally, we obtained the non-parametric estimators of RCRI and DRCRI measures based on the kernel density type estimator for the survival function. The effectiveness of the estimators are assessed through an extensive Monte Carlo simulation study. We consider the data from the third Gaia data release (Gaia DR3) for demonstrating the use of the proposed measure. For this study, we have collected epoch photometry data for the objects Gaia DR3 4111834567779557376 and Gaia DR3 5090605830056251776."}, "https://arxiv.org/abs/2410.00142": {"title": "On the posterior property of the Rician distribution", "link": "https://arxiv.org/abs/2410.00142", "description": "arXiv:2410.00142v1 Announce Type: new \nAbstract: The Rician distribution, a well-known statistical distribution frequently encountered in fields like magnetic resonance imaging and wireless communications, is particularly useful for describing many real phenomena such as signal process data. In this paper, we introduce objective Bayesian inference for the Rician distribution parameters, specifically the Jeffreys rule and Jeffreys prior are derived. We proved that the obtained posterior for the first priors led to an improper posterior while the Jeffreys prior led to a proper distribution. To evaluate the effectiveness of our proposed Bayesian estimation method, we perform extensive numerical simulations and compare the results with those obtained from traditional moment-based and maximum likelihood estimators. Our simulations illustrate that the Bayesian estimators derived from the Jeffreys prior provide nearly unbiased estimates, showcasing the advantages of our approach over classical techniques."}, "https://arxiv.org/abs/2410.00183": {"title": "Generalised mixed effects models for changepoint analysis of biomedical time series data", "link": "https://arxiv.org/abs/2410.00183", "description": "arXiv:2410.00183v1 Announce Type: new \nAbstract: Motivated by two distinct types of biomedical time series data, digital health monitoring and neuroimaging, we develop a novel approach for changepoint analysis that uses a generalised linear mixed model framework. The generalised linear mixed model framework lets us incorporate structure that is usually present in biomedical time series data. We embed the mixed model in a dynamic programming algorithm for detecting multiple changepoints in the fMRI data. We evaluate the performance of our proposed method across several scenarios using simulations. Finally, we show the utility of our proposed method on our two distinct motivating applications."}, "https://arxiv.org/abs/2410.00217": {"title": "Valid Inference on Functions of Causal Effects in the Absence of Microdata", "link": "https://arxiv.org/abs/2410.00217", "description": "arXiv:2410.00217v1 Announce Type: new \nAbstract: Economists are often interested in functions of multiple causal effects, a leading example of which is evaluating the cost-effectiveness of a government policy. In such settings, the benefits and costs might be captured by multiple causal effects and aggregated into a scalar measure of cost-effectiveness. Oftentimes, the microdata underlying these estimates is not accessible; only the published estimates and their corresponding standard errors are available for post-hoc analysis. We provide a method to conduct inference on functions of causal effects when the only information available is the point estimates and their corresponding standard errors. We apply our method to conduct inference on the Marginal Value of Public Funds (MVPF) for 8 different policies, and show that even in the absence of any microdata, it is possible to conduct valid and meaningful inference on the MVPF."}, "https://arxiv.org/abs/2410.00259": {"title": "Robust Emax Model Fitting: Addressing Nonignorable Missing Binary Outcome in Dose-Response Analysis", "link": "https://arxiv.org/abs/2410.00259", "description": "arXiv:2410.00259v1 Announce Type: new \nAbstract: The Binary Emax model is widely employed in dose-response analysis during drug development, where missing data often pose significant challenges. Addressing nonignorable missing binary responses, where the likelihood of missing data is related to unobserved outcomes, is particularly important, yet existing methods often lead to biased estimates. This issue is compounded when using the regulatory-recommended imputing as treatment failure approach, known as non-responder imputation. Moreover, the problem of separation, where a predictor perfectly distinguishes between outcome classes, can further complicate likelihood maximization. In this paper, we introduce a penalized likelihood-based method that integrates a modified Expectation-Maximization algorithm in the spirit of Ibrahim and Lipsitz to effectively manage both nonignorable missing data and separation issues. Our approach applies a noninformative Jeffreys prior to the likelihood, reducing bias in parameter estimation. Simulation studies demonstrate that our method outperforms existing methods, such as NRI, and the superiority is further supported by its application to data from a Phase II clinical trial. Additionally, we have developed an R package, ememax, to facilitate the implementation of the proposed method."}, "https://arxiv.org/abs/2410.00300": {"title": "Visualization for departures from symmetry with the power-divergence-type measure in two-way contingency tables", "link": "https://arxiv.org/abs/2410.00300", "description": "arXiv:2410.00300v1 Announce Type: new \nAbstract: When the row and column variables consist of the same category in a two-way contingency table, it is specifically called a square contingency table. Since it is clear that the square contingency tables have an association structure, a primary objective is to examine symmetric relationships and transitions between variables. While various models and measures have been proposed to analyze these structures understanding changes between two variables in behavior at two-time points or cohorts, it is also necessary to require a detailed investigation of individual categories and their interrelationships, such as shifts in brand preferences. This paper proposes a novel approach to correspondence analysis (CA) for evaluating departures from symmetry in square contingency tables with nominal categories, using a power-divergence-type measure. The approach ensures that well-known divergences can also be visualized and, regardless of the divergence used, the CA plot consists of two principal axes with equal contribution rates. Additionally, the scaling is independent of sample size, making it well-suited for comparing departures from symmetry across multiple contingency tables. Confidence regions are also constructed to enhance the accuracy of the CA plot."}, "https://arxiv.org/abs/2410.00338": {"title": "The generalized Nelson--Aalen estimator by inverse probability of treatment weighting", "link": "https://arxiv.org/abs/2410.00338", "description": "arXiv:2410.00338v1 Announce Type: new \nAbstract: Inverse probability of treatment weighting (IPTW) has been well applied in causal inference. For time-to-event outcomes, IPTW is performed by weighting the event counting process and at-risk process, resulting in a generalized Nelson--Aalen estimator for population-level hazards. In the presence of competing events, we adopt the counterfactual cumulative incidence of a primary event as the estimated. When the propensity score is estimated, we derive the influence function of the hazard estimator, and then establish the asymptotic property of the incidence estimator. We show that the uncertainty in the estimated propensity score contributes to an additional variation in the IPTW estimator of the cumulative incidence. However, through simulation and real-data application, we find that the additional variation is usually small."}, "https://arxiv.org/abs/2410.00370": {"title": "Covariate Adjusted Functional Mixed Membership Models", "link": "https://arxiv.org/abs/2410.00370", "description": "arXiv:2410.00370v1 Announce Type: new \nAbstract: Mixed membership models are a flexible class of probabilistic data representations used for unsupervised and semi-supervised learning, allowing each observation to partially belong to multiple clusters or features. In this manuscript, we extend the framework of functional mixed membership models to allow for covariate-dependent adjustments. The proposed model utilizes a multivariate Karhunen-Lo\\`eve decomposition, which allows for a scalable and flexible model. Within this framework, we establish a set of sufficient conditions ensuring the identifiability of the mean, covariance, and allocation structure up to a permutation of the labels. This manuscript is primarily motivated by studies on functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD). Specifically, we are interested in characterizing the heterogeneity of alpha oscillations for typically developing (TD) children and children with ASD. Since alpha oscillations are known to change as children develop, we aim to characterize the heterogeneity of alpha oscillations conditionally on the age of the child. Using the proposed framework, we were able to gain novel information on the developmental trajectories of alpha oscillations for children with ASD and how the developmental trajectories differ between TD children and children with ASD."}, "https://arxiv.org/abs/2410.00566": {"title": "Research Frontiers in Ambit Stochastics: In memory of Ole E", "link": "https://arxiv.org/abs/2410.00566", "description": "arXiv:2410.00566v1 Announce Type: new \nAbstract: This article surveys key aspects of ambit stochastics and remembers Ole E. Barndorff-Nielsen's important contributions to the foundation and advancement of this new research field over the last two decades. It also highlights some of the emerging trends in ambit stochastics."}, "https://arxiv.org/abs/2410.00574": {"title": "Asymmetric GARCH modelling without moment conditions", "link": "https://arxiv.org/abs/2410.00574", "description": "arXiv:2410.00574v1 Announce Type: new \nAbstract: There is a serious and long-standing restriction in the literature on heavy-tailed phenomena in that moment conditions, which are unrealistic, are almost always assumed in modelling such phenomena. Further, the issue of stability is often insufficiently addressed. To this end, we develop a comprehensive statistical inference for an asymmetric generalized autoregressive conditional heteroskedasticity model with standardized non-Gaussian symmetric stable innovation (sAGARCH) in a unified framework, covering both the stationary case and the explosive case. We consider first the maximum likelihood estimation of the model including the asymptotic properties of the estimator of the stable exponent parameter among others. We then propose a modified Kolmogorov-type test statistic for diagnostic checking, as well as those for strict stationarity and asymmetry testing. We conduct Monte Carlo simulation studies to examine the finite-sample performance of our entire statistical inference procedure. We include empirical examples of stock returns to highlight the usefulness and merits of our sAGARCH model."}, "https://arxiv.org/abs/2410.00662": {"title": "Bias in mixed models when analysing longitudinal data subject to irregular observation: when should we worry about it and how can recommended visit intervals help in specifying joint models when needed?", "link": "https://arxiv.org/abs/2410.00662", "description": "arXiv:2410.00662v1 Announce Type: new \nAbstract: In longitudinal studies using routinely collected data, such as electronic health records (EHRs), patients tend to have more measurements when they are unwell; this informative observation pattern may lead to bias. While semi-parametric approaches to modelling longitudinal data subject to irregular observation are known to be sensitive to misspecification of the visit process, parametric models may provide a more robust alternative. Robustness of parametric models on the outcome alone has been assessed under the assumption that the visit intensity is independent of the time since the last visit, given the covariates and random effects. However, this assumption of a memoryless visit process may not be realistic in the context of EHR data. In a special case which includes memory embedded into the visit process, we derive an expression for the bias in parametric models for the outcome alone and use this to identify factors that lead to increasing bias. Using simulation studies, we show that this bias is often small in practice. We suggest diagnostics for identifying the specific cases when the outcome model may be susceptible to meaningful bias, and propose a novel joint model of the outcome and visit processes that can eliminate or reduce the bias. We apply these diagnostics and the joint model to a study of juvenile dermatomyositis. We recommend that future studies using EHR data avoid relying only on the outcome model and instead first evaluate its appropriateness with our proposed diagnostics, applying our proposed joint model if necessary."}, "https://arxiv.org/abs/2410.00733": {"title": "A Nonparametric Test of Heterogeneous Treatment Effects under Interference", "link": "https://arxiv.org/abs/2410.00733", "description": "arXiv:2410.00733v1 Announce Type: new \nAbstract: Statistical inference of heterogeneous treatment effects (HTEs) across predefined subgroups is challenging when units interact because treatment effects may vary by pre-treatment variables, post-treatment exposure variables (that measure the exposure to other units' treatment statuses), or both. Thus, the conventional HTEs testing procedures may be invalid under interference. In this paper, I develop statistical methods to infer HTEs and disentangle the drivers of treatment effects heterogeneity in populations where units interact. Specifically, I incorporate clustered interference into the potential outcomes model and propose kernel-based test statistics for the null hypotheses of (i) no HTEs by treatment assignment (or post-treatment exposure variables) for all pre-treatment variables values and (ii) no HTEs by pre-treatment variables for all treatment assignment vectors. I recommend a multiple-testing algorithm to disentangle the source of heterogeneity in treatment effects. I prove the asymptotic properties of the proposed test statistics. Finally, I illustrate the application of the test procedures in an empirical setting using an experimental data set from a Chinese weather insurance program."}, "https://arxiv.org/abs/2410.00781": {"title": "Modeling Neural Switching via Drift-Diffusion Models", "link": "https://arxiv.org/abs/2410.00781", "description": "arXiv:2410.00781v1 Announce Type: new \nAbstract: Neural encoding, or neural representation, is a field in neuroscience that focuses on characterizing how information is encoded in the spiking activity of neurons. Currently, little is known about how sensory neurons can preserve information from multiple stimuli given their broad receptive fields. Multiplexing is a neural encoding theory that posits that neurons temporally switch between encoding various stimuli in their receptive field. Here, we construct a statistically falsifiable single-neuron model for multiplexing using a competition-based framework. The spike train models are constructed using drift-diffusion models, implying an integrate-and-fire framework to model the temporal dynamics of the membrane potential of the neuron. In addition to a multiplexing-specific model, we develop alternative models that represent alternative encoding theories (normalization, winner-take-all, subadditivity, etc.) with some level of abstraction. Using information criteria, we perform model comparison to determine whether the data favor multiplexing over alternative theories of neural encoding. Analysis of spike trains from the inferior colliculus of two macaque monkeys provides tenable evidence of multiplexing and offers new insight into the timescales at which switching occurs."}, "https://arxiv.org/abs/2410.00845": {"title": "Control Variate-based Stochastic Sampling from the Probability Simplex", "link": "https://arxiv.org/abs/2410.00845", "description": "arXiv:2410.00845v1 Announce Type: new \nAbstract: This paper presents a control variate-based Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models such as latent Dirichlet allocation. Standard Markov chain Monte Carlo methods, particularly those based on Langevin diffusions, suffer from significant discretization errors near the boundaries of the simplex, which are exacerbated in sparse data settings. To address this issue, we propose an improved approach based on the stochastic Cox--Ingersoll--Ross process, which eliminates discretization errors and enables exact transition densities. Our key contribution is the integration of control variates, which significantly reduces the variance of the stochastic gradient estimator in the Cox--Ingersoll--Ross process, thereby enhancing the accuracy and computational efficiency of the algorithm. We provide a theoretical analysis showing the variance reduction achieved by the control variates approach and demonstrate the practical advantages of our method in data subsampling settings. Empirical results on large datasets show that the proposed method outperforms existing approaches in both accuracy and scalability."}, "https://arxiv.org/abs/2410.00002": {"title": "Machine Learning and Econometric Approaches to Fiscal Policies: Understanding Industrial Investment Dynamics in Uruguay (1974-2010)", "link": "https://arxiv.org/abs/2410.00002", "description": "arXiv:2410.00002v1 Announce Type: cross \nAbstract: This paper examines the impact of fiscal incentives on industrial investment in Uruguay from 1974 to 2010. Using a mixed-method approach that combines econometric models with machine learning techniques, the study investigates both the short-term and long-term effects of fiscal benefits on industrial investment. The results confirm the significant role of fiscal incentives in driving long-term industrial growth, while also highlighting the importance of a stable macroeconomic environment, public investment, and access to credit. Machine learning models provide additional insights into nonlinear interactions between fiscal benefits and other macroeconomic factors, such as exchange rates, emphasizing the need for tailored fiscal policies. The findings have important policy implications, suggesting that fiscal incentives, when combined with broader economic reforms, can effectively promote industrial development in emerging economies."}, "https://arxiv.org/abs/2410.00301": {"title": "Network Science in Psychology", "link": "https://arxiv.org/abs/2410.00301", "description": "arXiv:2410.00301v1 Announce Type: cross \nAbstract: Social network analysis can answer research questions such as why or how individuals interact or form relationships and how those relationships impact other outcomes. Despite the breadth of methods available to address psychological research questions, social network analysis is not yet a standard practice in psychological research. To promote the use of social network analysis in psychological research, we present an overview of network methods, situating each method within the context of research studies and questions in psychology."}, "https://arxiv.org/abs/2410.00865": {"title": "How should we aggregate ratings? Accounting for personal rating scales via Wasserstein barycenters", "link": "https://arxiv.org/abs/2410.00865", "description": "arXiv:2410.00865v1 Announce Type: cross \nAbstract: A common method of making quantitative conclusions in qualitative situations is to collect numerical ratings on a linear scale. We investigate the problem of calculating aggregate numerical ratings from individual numerical ratings and propose a new, non-parametric model for the problem. We show that, with minimal modeling assumptions, the equal-weights average is inconsistent for estimating the quality of items. Analyzing the problem from the perspective of optimal transport, we derive an alternative rating estimator, which we show is asymptotically consistent almost surely and in $L^p$ for estimating quality, with an optimal rate of convergence. Further, we generalize Kendall's W, a non-parametric coefficient of preference concordance between raters, from the special case of rankings to the more general case of arbitrary numerical ratings. Along the way, we prove Glivenko--Cantelli-type theorems for uniform convergence of the cumulative distribution functions and quantile functions for Wasserstein-2 Fr\\'echet means on [0,1]."}, "https://arxiv.org/abs/2212.08766": {"title": "Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio", "link": "https://arxiv.org/abs/2212.08766", "description": "arXiv:2212.08766v2 Announce Type: replace \nAbstract: In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or \"feature statistic\" to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one's ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a \"local dependence\" condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso."}, "https://arxiv.org/abs/2401.08626": {"title": "Validation and Comparison of Non-Stationary Cognitive Models: A Diffusion Model Application", "link": "https://arxiv.org/abs/2401.08626", "description": "arXiv:2401.08626v3 Announce Type: replace-cross \nAbstract: Cognitive processes undergo various fluctuations and transient states across different temporal scales. Superstatistics are emerging as a flexible framework for incorporating such non-stationary dynamics into existing cognitive model classes. In this work, we provide the first experimental validation of superstatistics and formal comparison of four non-stationary diffusion decision models in a specifically designed perceptual decision-making task. Task difficulty and speed-accuracy trade-off were systematically manipulated to induce expected changes in model parameters. To validate our models, we assess whether the inferred parameter trajectories align with the patterns and sequences of the experimental manipulations. To address computational challenges, we present novel deep learning techniques for amortized Bayesian estimation and comparison of models with time-varying parameters. Our findings indicate that transition models incorporating both gradual and abrupt parameter shifts provide the best fit to the empirical data. Moreover, we find that the inferred parameter trajectories closely mirror the sequence of experimental manipulations. Posterior re-simulations further underscore the ability of the models to faithfully reproduce critical data patterns. Accordingly, our results suggest that the inferred non-stationary dynamics may reflect actual changes in the targeted psychological constructs. We argue that our initial experimental validation paves the way for the widespread application of superstatistics in cognitive modeling and beyond."}, "https://arxiv.org/abs/2410.00931": {"title": "A simple emulator that enables interpretation of parameter-output relationships, applied to two climate model PPEs", "link": "https://arxiv.org/abs/2410.00931", "description": "arXiv:2410.00931v1 Announce Type: new \nAbstract: We present a new additive method, nicknamed sage for Simplified Additive Gaussian processes Emulator, to emulate climate model Perturbed Parameter Ensembles (PPEs). It estimates the value of a climate model output as the sum of additive terms. Each additive term is the mean of a Gaussian Process, and corresponds to the impact of a parameter or parameter group on the variable of interest. This design caters to the sparsity of PPEs which are characterized by limited ensemble members and high dimensionality of the parameter space. sage quantifies the variability explained by different parameters and parameter groups, providing additional insights on the parameter-climate model output relationship. We apply the method to two climate model PPEs and compare it to a fully connected Neural Network. The two methods have comparable performance with both PPEs, but sage provides insights on parameter and parameter group importance as well as diagnostics useful for optimizing PPE design. Insights gained are valid regardless of the emulator method used, and have not been previously addressed. Our work highlights that analyzing the PPE used to train an emulator is different from analyzing data generated from an emulator trained on the PPE, as the former provides more insights on the data structure in the PPE which could help inform the emulator design."}, "https://arxiv.org/abs/2410.00971": {"title": "Data-Driven Random Projection and Screening for High-Dimensional Generalized Linear Models", "link": "https://arxiv.org/abs/2410.00971", "description": "arXiv:2410.00971v1 Announce Type: new \nAbstract: We address the challenge of correlated predictors in high-dimensional GLMs, where regression coefficients range from sparse to dense, by proposing a data-driven random projection method. This is particularly relevant for applications where the number of predictors is (much) larger than the number of observations and the underlying structure -- whether sparse or dense -- is unknown. We achieve this by using ridge-type estimates for variable screening and random projection to incorporate information about the response-predictor relationship when performing dimensionality reduction. We demonstrate that a ridge estimator with a small penalty is effective for random projection and screening, but the penalty value must be carefully selected. Unlike in linear regression, where penalties approaching zero work well, this approach leads to overfitting in non-Gaussian families. Instead, we recommend a data-driven method for penalty selection. This data-driven random projection improves prediction performance over conventional random projections, even surpassing benchmarks like elastic net. Furthermore, an ensemble of multiple such random projections combined with probabilistic variable screening delivers the best aggregated results in prediction and variable ranking across varying sparsity levels in simulations at a rather low computational cost. Finally, three applications with count and binary responses demonstrate the method's advantages in interpretability and prediction accuracy."}, "https://arxiv.org/abs/2410.00985": {"title": "Nonparametric tests of treatment effect homogeneity for policy-makers", "link": "https://arxiv.org/abs/2410.00985", "description": "arXiv:2410.00985v1 Announce Type: new \nAbstract: Recent work has focused on nonparametric estimation of conditional treatment effects, but inference has remained relatively unexplored. We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allow for both continuous and discrete covariates, and do not require sample splitting. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalized decision rule differs from using a rule that discards covariates. The proposal is thus relevant for guiding treatment policies. The utility of the proposal is borne out in simulation studies and a re-analysis of an AIDS clinical trial."}, "https://arxiv.org/abs/2410.01008": {"title": "Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data", "link": "https://arxiv.org/abs/2410.01008", "description": "arXiv:2410.01008v1 Announce Type: new \nAbstract: The Tweedie exponential dispersion family is a popular choice among many to model insurance losses that consist of zero-inflated semicontinuous data. In such data, it is often important to obtain credibility (inference) of the most important features that describe the endogenous variables. Post-selection inference is the standard procedure in statistics to obtain confidence intervals of model parameters after performing a feature extraction procedure. For a linear model, the lasso estimate often has non-negligible estimation bias for large coefficients corresponding to exogenous variables. To have valid inference on those coefficients, it is necessary to correct the bias of the lasso estimate. Traditional statistical methods, such as hypothesis testing or standard confidence interval construction might lead to incorrect conclusions during post-selection, as they are generally too optimistic. Here we discuss a few methodologies for constructing confidence intervals of the coefficients after feature selection in the Generalized Linear Model (GLM) family with application to insurance data."}, "https://arxiv.org/abs/2410.01051": {"title": "A class of priors to perform asymmetric Bayesian wavelet shrinkage", "link": "https://arxiv.org/abs/2410.01051", "description": "arXiv:2410.01051v1 Announce Type: new \nAbstract: This paper proposes a class of asymmetric priors to perform Bayesian wavelet shrinkage in the standard nonparametric regression model with Gaussian error. The priors are composed by mixtures of a point mass function at zero and one of the following distributions: asymmetric beta, Kumaraswamy, asymmetric triangular or skew normal. Statistical properties of the associated shrinkage rules such as squared bias, variance and risks are obtained numerically and discussed. Monte Carlo simulation studies are described to evaluate the performances of the rules against standard techniques. An application of the asymmetric rules to a stock market index time series is also illustrated."}, "https://arxiv.org/abs/2410.01063": {"title": "Functional summary statistics and testing for independence in marked point processes on the surface of three dimensional convex shapes", "link": "https://arxiv.org/abs/2410.01063", "description": "arXiv:2410.01063v1 Announce Type: new \nAbstract: The fundamental functional summary statistics used for studying spatial point patterns are developed for marked homogeneous and inhomogeneous point processes on the surface of a sphere. These are extended to point processes on the surface of three dimensional convex shapes given the bijective mapping from the shape to the sphere is known. These functional summary statistics are used to test for independence between the marginals of multi-type spatial point processes with methods for sampling the null distribution proposed and discussed. This is illustrated on both simulated data and the RNGC galaxy point pattern, revealing attractive dependencies between different galaxy types."}, "https://arxiv.org/abs/2410.01133": {"title": "A subcopula characterization of dependence for the Multivariate Bernoulli Distribution", "link": "https://arxiv.org/abs/2410.01133", "description": "arXiv:2410.01133v1 Announce Type: new \nAbstract: By applying Sklar's theorem to the Multivariate Bernoulli Distribution (MBD), it is proposed a framework that decouples the marginal distributions from the dependence structure, providing a clearer understanding of how binary variables interact. Explicit formulas are derived under the MBD using subcopulas to introduce dependence measures for interactions of all orders, not just pairwise. A bayesian inference approach is also applied to estimate the parameters of the MBD, offering practical tools for parameter estimation and dependence analysis in real-world applications. The results obtained contribute to the application of subcopulas of multivariate binary data, with a real data example of comorbidities in COVID-19 patients."}, "https://arxiv.org/abs/2410.01159": {"title": "Partially Identified Heterogeneous Treatment Effect with Selection: An Application to Gender Gaps", "link": "https://arxiv.org/abs/2410.01159", "description": "arXiv:2410.01159v1 Announce Type: new \nAbstract: This paper addresses the sample selection model within the context of the gender gap problem, where even random treatment assignment is affected by selection bias. By offering a robust alternative free from distributional or specification assumptions, we bound the treatment effect under the sample selection model with an exclusion restriction, an assumption whose validity is tested in the literature. This exclusion restriction allows for further segmentation of the population into distinct types based on observed and unobserved characteristics. For each type, we derive the proportions and bound the gender gap accordingly. Notably, trends in type proportions and gender gap bounds reveal an increasing proportion of always-working individuals over time, alongside variations in bounds, including a general decline across time and consistently higher bounds for those in high-potential wage groups. Further analysis, considering additional assumptions, highlights persistent gender gaps for some types, while other types exhibit differing or inconclusive trends. This underscores the necessity of separating individuals by type to understand the heterogeneous nature of the gender gap."}, "https://arxiv.org/abs/2410.01163": {"title": "Perturbation-Robust Predictive Modeling of Social Effects by Network Subspace Generalized Linear Models", "link": "https://arxiv.org/abs/2410.01163", "description": "arXiv:2410.01163v1 Announce Type: new \nAbstract: Network-linked data, where multivariate observations are interconnected by a network, are becoming increasingly prevalent in fields such as sociology and biology. These data often exhibit inherent noise and complex relational structures, complicating conventional modeling and statistical inference. Motivated by empirical challenges in analyzing such data sets, this paper introduces a family of network subspace generalized linear models designed for analyzing noisy, network-linked data. We propose a model inference method based on subspace-constrained maximum likelihood, which emphasizes flexibility in capturing network effects and provides a robust inference framework against network perturbations.We establish the asymptotic distributions of the estimators under network perturbations, demonstrating the method's accuracy through extensive simulations involving random network models and deep-learning-based embedding algorithms. The proposed methodology is applied to a comprehensive analysis of a large-scale study on school conflicts, where it identifies significant social effects, offering meaningful and interpretable insights into student behaviors."}, "https://arxiv.org/abs/2410.01175": {"title": "Forecasting short-term inflation in Argentina with Random Forest Models", "link": "https://arxiv.org/abs/2410.01175", "description": "arXiv:2410.01175v1 Announce Type: new \nAbstract: This paper examines the performance of Random Forest models in forecasting short-term monthly inflation in Argentina, based on a database of monthly indicators since 1962. It is found that these models achieve forecast accuracy that is statistically comparable to the consensus of market analysts' expectations surveyed by the Central Bank of Argentina (BCRA) and to traditional econometric models. One advantage of Random Forest models is that, as they are non-parametric, they allow for the exploration of nonlinear effects in the predictive power of certain macroeconomic variables on inflation. Among other findings, the relative importance of the exchange rate gap in forecasting inflation increases when the gap between the parallel and official exchange rates exceeds 60%. The predictive power of the exchange rate on inflation rises when the BCRA's net international reserves are negative or close to zero (specifically, below USD 2 billion). The relative importance of inflation inertia and the nominal interest rate in forecasting the following month's inflation increases when the nominal levels of inflation and/or interest rates rise."}, "https://arxiv.org/abs/2410.01283": {"title": "Bayesian estimation for novel geometric INGARCH model", "link": "https://arxiv.org/abs/2410.01283", "description": "arXiv:2410.01283v1 Announce Type: new \nAbstract: This paper introduces an integer-valued generalized autoregressive conditional heteroskedasticity (INGARCH) model based on the novel geometric distribution and discusses some of its properties. The parameter estimation problem of the models are studied by conditional maximum likelihood and Bayesian approach using Hamiltonian Monte Carlo (HMC) algorithm. The results of the simulation studies and real data analysis affirm the good performance of the estimators and the model."}, "https://arxiv.org/abs/2410.01475": {"title": "Exploring Learning Rate Selection in Generalised Bayesian Inference using Posterior Predictive Checks", "link": "https://arxiv.org/abs/2410.01475", "description": "arXiv:2410.01475v1 Announce Type: new \nAbstract: Generalised Bayesian Inference (GBI) attempts to address model misspecification in a standard Bayesian setup by tempering the likelihood. The likelihood is raised to a fractional power, called the learning rate, which reduces its importance in the posterior and has been established as a method to address certain kinds of model misspecification. Posterior Predictive Checks (PPC) attempt to detect model misspecification by locating a diagnostic, computed on the observed data, within the posterior predictive distribution of the diagnostic. This can be used to construct a hypothesis test where a small $p$-value indicates potential misfit. The recent Embedded Diachronic Sense Change (EDiSC) model suffers from misspecification and benefits from likelihood tempering. Using EDiSC as a case study, this exploratory work examines whether PPC could be used in a novel way to set the learning rate in a GBI setup. Specifically, the learning rate selected is the lowest value for which a hypothesis test using the log likelihood diagnostic is not rejected at the 10% level. The experimental results are promising, though not definitive, and indicate the need for further research along the lines suggested here."}, "https://arxiv.org/abs/2410.01658": {"title": "Smaller Confidence Intervals From IPW Estimators via Data-Dependent Coarsening", "link": "https://arxiv.org/abs/2410.01658", "description": "arXiv:2410.01658v1 Announce Type: new \nAbstract: Inverse propensity-score weighted (IPW) estimators are prevalent in causal inference for estimating average treatment effects in observational studies. Under unconfoundedness, given accurate propensity scores and $n$ samples, the size of confidence intervals of IPW estimators scales down with $n$, and, several of their variants improve the rate of scaling. However, neither IPW estimators nor their variants are robust to inaccuracies: even if a single covariate has an $\\varepsilon>0$ additive error in the propensity score, the size of confidence intervals of these estimators can increase arbitrarily. Moreover, even without errors, the rate with which the confidence intervals of these estimators go to zero with $n$ can be arbitrarily slow in the presence of extreme propensity scores (those close to 0 or 1).\n We introduce a family of Coarse IPW (CIPW) estimators that captures existing IPW estimators and their variants. Each CIPW estimator is an IPW estimator on a coarsened covariate space, where certain covariates are merged. Under mild assumptions, e.g., Lipschitzness in expected outcomes and sparsity of extreme propensity scores, we give an efficient algorithm to find a robust estimator: given $\\varepsilon$-inaccurate propensity scores and $n$ samples, its confidence interval size scales with $\\varepsilon+1/\\sqrt{n}$. In contrast, under the same assumptions, existing estimators' confidence interval sizes are $\\Omega(1)$ irrespective of $\\varepsilon$ and $n$. Crucially, our estimator is data-dependent and we show that no data-independent CIPW estimator can be robust to inaccuracies."}, "https://arxiv.org/abs/2410.01783": {"title": "On metric choice in dimension reduction for Fr\\'echet regression", "link": "https://arxiv.org/abs/2410.01783", "description": "arXiv:2410.01783v1 Announce Type: new \nAbstract: Fr\\'echet regression is becoming a mainstay in modern data analysis for analyzing non-traditional data types belonging to general metric spaces. This novel regression method utilizes the pairwise distances between the random objects, which makes the choice of metric crucial in the estimation. In this paper, the effect of metric choice on the estimation of the dimension reduction subspace for the regression between random responses and Euclidean predictors is investigated. Extensive numerical studies illustrate how different metrics affect the central and central mean space estimates for regression involving responses belonging to some popular metric spaces versus Euclidean predictors. An analysis of the distributions of glycaemia based on continuous glucose monitoring data demonstrate how metric choice can influence findings in real applications."}, "https://arxiv.org/abs/2410.01168": {"title": "MDDC: An R and Python Package for Adverse Event Identification in Pharmacovigilance Data", "link": "https://arxiv.org/abs/2410.01168", "description": "arXiv:2410.01168v1 Announce Type: cross \nAbstract: The safety of medical products continues to be a significant health concern worldwide. Spontaneous reporting systems (SRS) and pharmacovigilance databases are essential tools for postmarketing surveillance of medical products. Various SRS are employed globally, such as the Food and Drug Administration Adverse Event Reporting System (FAERS), EudraVigilance, and VigiBase. In the pharmacovigilance literature, numerous methods have been proposed to assess product - adverse event pairs for potential signals. In this paper, we introduce an R and Python package that implements a novel pattern discovery method for postmarketing adverse event identification, named Modified Detecting Deviating Cells (MDDC). The package also includes a data generation function that considers adverse events as groups, as well as additional utility functions. We illustrate the usage of the package through the analysis of real datasets derived from the FAERS database."}, "https://arxiv.org/abs/2410.01194": {"title": "Maximum Ideal Likelihood Estimator: An New Estimation and Inference Framework for Latent Variable Models", "link": "https://arxiv.org/abs/2410.01194", "description": "arXiv:2410.01194v1 Announce Type: cross \nAbstract: In this paper, a new estimation framework, Maximum Ideal Likelihood Estimator (MILE), is proposed for general parametric models with latent variables and missing values. Instead of focusing on the marginal likelihood of the observed data as in many traditional approaches, the MILE directly considers the joint distribution of the complete dataset by treating the latent variables as parameters (the ideal likelihood). The MILE framework remains valid, even when traditional methods are not applicable, e.g., non-finite conditional expectation of the marginal likelihood function, via different optimization techniques and algorithms. The statistical properties of the MILE, such as the asymptotic equivalence to the Maximum Likelihood Estimation (MLE), are proved under some mild conditions, which facilitate statistical inference and prediction. Simulation studies illustrate that MILE outperforms traditional approaches with computational feasibility and scalability using existing and our proposed algorithms."}, "https://arxiv.org/abs/2410.01265": {"title": "Transformers Handle Endogeneity in In-Context Linear Regression", "link": "https://arxiv.org/abs/2410.01265", "description": "arXiv:2410.01265v1 Announce Type: cross \nAbstract: We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares $(\\textsf{2SLS})$ solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the $\\textsf{2SLS}$ method, in the presence of endogeneity."}, "https://arxiv.org/abs/2410.01427": {"title": "Regularized e-processes: anytime valid inference with knowledge-based efficiency gains", "link": "https://arxiv.org/abs/2410.01427", "description": "arXiv:2410.01427v1 Announce Type: cross \nAbstract: Classical statistical methods have theoretical justification when the sample size is predetermined by the data-collection plan. In applications, however, it's often the case that sample sizes aren't predetermined; instead, investigators might use the data observed along the way to make on-the-fly decisions about when to stop data collection. Since those methods designed for static sample sizes aren't reliable when sample sizes are dynamic, there's been a recent surge of interest in e-processes and the corresponding tests and confidence sets that are anytime valid in the sense that their justification holds up for arbitrary dynamic data-collection plans. But if the investigator has relevant-yet-incomplete prior information about the quantity of interest, then there's an opportunity for efficiency gain, but existing approaches can't accommodate this. Here I build a new, regularized e-process framework that features a knowledge-based, imprecise-probabilistic regularization that offers improved efficiency. A generalized version of Ville's inequality is established, ensuring that inference based on the regularized e-process remains anytime valid in a novel, knowledge-dependent sense. In addition to anytime valid hypothesis tests and confidence sets, the proposed regularized e-processes facilitate possibility-theoretic uncertainty quantification with strong frequentist-like calibration properties and other Bayesian-like features: satisfies the likelihood principle, avoids sure-loss, and offers formal decision-making with reliability guarantees."}, "https://arxiv.org/abs/2410.01480": {"title": "Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales", "link": "https://arxiv.org/abs/2410.01480", "description": "arXiv:2410.01480v1 Announce Type: cross \nAbstract: Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study."}, "https://arxiv.org/abs/2410.01522": {"title": "Uncertainty quantification in neutron and gamma time correlation measurements", "link": "https://arxiv.org/abs/2410.01522", "description": "arXiv:2410.01522v1 Announce Type: cross \nAbstract: Neutron noise analysis is a predominant technique for fissile matter identification with passive methods. Quantifying the uncertainties associated with the estimated nuclear parameters is crucial for decision-making. A conservative uncertainty quantification procedure is possible by solving a Bayesian inverse problem with the help of statistical surrogate models but generally leads to large uncertainties due to the surrogate models' errors. In this work, we develop two methods for robust uncertainty quantification in neutron and gamma noise analysis based on the resolution of Bayesian inverse problems. We show that the uncertainties can be reduced by including information on gamma correlations. The investigation of a joint analysis of the neutron and gamma observations is also conducted with the help of active learning strategies to fine-tune surrogate models. We test our methods on a model of the SILENE reactor core, using simulated and real-world measurements."}, "https://arxiv.org/abs/2311.07736": {"title": "Use of Expected Utility (EU) to Evaluate Artificial Intelligence-Enabled Rule-Out Devices for Mammography Screening", "link": "https://arxiv.org/abs/2311.07736", "description": "arXiv:2311.07736v2 Announce Type: replace \nAbstract: Background: An artificial intelligence (AI)-enabled rule-out device may autonomously remove patient images unlikely to have cancer from radiologist review. Many published studies evaluate this type of device by retrospectively applying the AI to large datasets and use sensitivity and specificity as the performance metrics. However, these metrics have fundamental shortcomings because they are bound to have opposite changes with the rule-out application of AI. Method: We reviewed two performance metrics to compare the screening performance between the radiologist-with-rule-out-device and radiologist-without-device workflows: positive/negative predictive values (PPV/NPV) and expected utility (EU). We applied both methods to a recent study that reported improved performance in the radiologist-with-device workflow using a retrospective U.S. dataset. We then applied the EU method to a European study based on the reported recall and cancer detection rates at different AI thresholds to compare the potential utility among different thresholds. Results: For the U.S. study, neither PPV/NPV nor EU can demonstrate significant improvement for any of the algorithm thresholds reported. For the study using European data, we found that EU is lower as AI rules out more patients including false-negative cases and reduces the overall screening performance. Conclusions: Due to the nature of the retrospective simulated study design, sensitivity and specificity can be ambiguous in evaluating a rule-out device. We showed that using PPV/NPV or EU can resolve the ambiguity. The EU method can be applied with only recall rates and cancer detection rates, which is convenient as ground truth is often unavailable for non-recalled patients in screening mammography."}, "https://arxiv.org/abs/2401.01977": {"title": "Conformal causal inference for cluster randomized trials: model-robust inference without asymptotic approximations", "link": "https://arxiv.org/abs/2401.01977", "description": "arXiv:2401.01977v2 Announce Type: replace \nAbstract: Traditional statistical inference in cluster randomized trials typically invokes the asymptotic theory that requires the number of clusters to approach infinity. In this article, we propose an alternative conformal causal inference framework for analyzing cluster randomized trials that achieves the target inferential goal in finite samples without the need for asymptotic approximations. Different from traditional inference focusing on estimating the average treatment effect, our conformal causal inference aims to provide prediction intervals for the difference of counterfactual outcomes, thereby providing a new decision-making tool for clusters and individuals in the same target population. We prove that this framework is compatible with arbitrary working outcome models -- including data-adaptive machine learning methods that maximally leverage information from baseline covariates, and enjoys robustness against misspecification of working outcome models. Under our conformal causal inference framework, we develop efficient computation algorithms to construct prediction intervals for treatment effects at both the cluster and individual levels, and further extend to address inferential targets defined based on pre-specified covariate subgroups. Finally, we demonstrate the properties of our methods via simulations and a real data application based on a completed cluster randomized trial for treating chronic pain."}, "https://arxiv.org/abs/2304.06251": {"title": "Importance is Important: Generalized Markov Chain Importance Sampling Methods", "link": "https://arxiv.org/abs/2304.06251", "description": "arXiv:2304.06251v2 Announce Type: replace-cross \nAbstract: We show that for any multiple-try Metropolis algorithm, one can always accept the proposal and evaluate the importance weight that is needed to correct for the bias without extra computational cost. This results in a general, convenient, and rejection-free Markov chain Monte Carlo (MCMC) sampling scheme. By further leveraging the importance sampling perspective on Metropolis--Hastings algorithms, we propose an alternative MCMC sampler on discrete spaces that is also outside the Metropolis--Hastings framework, along with a general theory on its complexity. Numerical examples suggest that the proposed algorithms are consistently more efficient than the original Metropolis--Hastings versions."}, "https://arxiv.org/abs/2304.09398": {"title": "Minimax Signal Detection in Sparse Additive Models", "link": "https://arxiv.org/abs/2304.09398", "description": "arXiv:2304.09398v2 Announce Type: replace-cross \nAbstract: Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature."}, "https://arxiv.org/abs/2304.13831": {"title": "Random evolutionary games and random polynomials", "link": "https://arxiv.org/abs/2304.13831", "description": "arXiv:2304.13831v2 Announce Type: replace-cross \nAbstract: In this paper, we discover that the class of random polynomials arising from the equilibrium analysis of random asymmetric evolutionary games is \\textit{exactly} the Kostlan-Shub-Smale system of random polynomials, revealing an intriguing connection between evolutionary game theory and the theory of random polynomials. Through this connection, we analytically characterize the statistics of the number of internal equilibria of random asymmetric evolutionary games, namely its mean value, probability distribution, central limit theorem and universality phenomena. Biologically, these quantities enable prediction of the levels of social and biological diversity as well as the overall complexity in a dynamical system. By comparing symmetric and asymmetric random games, we establish that symmetry in group interactions increases the expected number of internal equilibria. Our research establishes new theoretical understanding of asymmetric evolutionary games and highlights the significance of symmetry and asymmetry in group interactions."}, "https://arxiv.org/abs/2410.02403": {"title": "A unified approach to penalized likelihood estimation of covariance matrices in high dimensions", "link": "https://arxiv.org/abs/2410.02403", "description": "arXiv:2410.02403v1 Announce Type: new \nAbstract: We consider the problem of estimation of a covariance matrix for Gaussian data in a high dimensional setting. Existing approaches include maximum likelihood estimation under a pre-specified sparsity pattern, l_1-penalized loglikelihood optimization and ridge regularization of the sample covariance. We show that these three approaches can be addressed in an unified way, by considering the constrained optimization of an objective function that involves two suitably defined penalty terms. This unified procedure exploits the advantages of each individual approach, while bringing novelty in the combination of the three. We provide an efficient algorithm for the optimization of the regularized objective function and describe the relationship between the two penalty terms, thereby highlighting the importance of the joint application of the three methods. A simulation study shows how the sparse estimates of covariance matrices returned by the procedure are stable and accurate, both in low and high dimensional settings, and how their calculation is more efficient than existing approaches under a partially known sparsity pattern. An illustration on sonar data shows is presented for the identification of the covariance structure among signals bounced off a certain material. The method is implemented in the publicly available R package gicf."}, "https://arxiv.org/abs/2410.02448": {"title": "Bayesian Calibration and Uncertainty Quantification for a Large Nutrient Load Impact Model", "link": "https://arxiv.org/abs/2410.02448", "description": "arXiv:2410.02448v1 Announce Type: new \nAbstract: Nutrient load simulators are large, deterministic, models that simulate the hydrodynamics and biogeochemical processes in aquatic ecosystems. They are central tools for planning cost efficient actions to fight eutrophication since they allow scenario predictions on impacts of nutrient load reductions to, e.g., harmful algal biomass growth. Due to being computationally heavy, the uncertainties related to these predictions are typically not rigorously assessed though. In this work, we developed a novel Bayesian computational approach for estimating the uncertainties in predictions of the Finnish coastal nutrient load model FICOS. First, we constructed a likelihood function for the multivariate spatiotemporal outputs of the FICOS model. Then, we used Bayes optimization to locate the posterior mode for the model parameters conditional on long term monitoring data. After that, we constructed a space filling design for FICOS model runs around the posterior mode and used it to train a Gaussian process emulator for the (log) posterior density of the model parameters. We then integrated over this (approximate) parameter posterior to produce probabilistic predictions for algal biomass and chlorophyll a concentration under alternative nutrient load reduction scenarios. Our computational algorithm allowed for fast posterior inference and the Gaussian process emulator had good predictive accuracy within the highest posterior probability mass region. The posterior predictive scenarios showed that the probability to reach the EUs Water Framework Directive objectives in the Finnish Archipelago Sea is generally low even under large load reductions."}, "https://arxiv.org/abs/2410.02649": {"title": "Stochastic Gradient Variational Bayes in the Stochastic Blockmodel", "link": "https://arxiv.org/abs/2410.02649", "description": "arXiv:2410.02649v1 Announce Type: new \nAbstract: Stochastic variational Bayes algorithms have become very popular in the machine learning literature, particularly in the context of nonparametric Bayesian inference. These algorithms replace the true but intractable posterior distribution with the best (in the sense of Kullback-Leibler divergence) member of a tractable family of distributions, using stochastic gradient algorithms to perform the optimization step. stochastic variational Bayes inference implicitly trades off computational speed for accuracy, but the loss of accuracy is highly model (and even dataset) specific. In this paper we carry out an empirical evaluation of this trade off in the context of stochastic blockmodels, which are a widely used class of probabilistic models for network and relational data. Our experiments indicate that, in the context of stochastic blockmodels, relatively large subsamples are required for these algorithms to find accurate approximations of the posterior, and that even then the quality of the approximations provided by stochastic gradient variational algorithms can be highly variable."}, "https://arxiv.org/abs/2410.02655": {"title": "Exact Bayesian Inference for Multivariate Spatial Data of Any Size with Application to Air Pollution Monitoring", "link": "https://arxiv.org/abs/2410.02655", "description": "arXiv:2410.02655v1 Announce Type: new \nAbstract: Fine particulate matter and aerosol optical thickness are of interest to atmospheric scientists for understanding air quality and its various health/environmental impacts. The available data are extremely large, making uncertainty quantification in a fully Bayesian framework quite difficult, as traditional implementations do not scale reasonably to the size of the data. We specifically consider roughly 8 million observations obtained from NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. To analyze data on this scale, we introduce Scalable Multivariate Exact Posterior Regression (SM-EPR) which combines the recently introduced data subset approach and Exact Posterior Regression (EPR). EPR is a new Bayesian hierarchical model where it is possible to sample independent replicates of fixed and random effects directly from the posterior without the use of Markov chain Monte Carlo (MCMC) or approximate Bayesian techniques. We extend EPR to the multivariate spatial context, where the multiple variables may be distributed according to different distributions. The combination of the data subset approach with EPR allows one to perform exact Bayesian inference without MCMC for effectively any sample size. We demonstrate our new SM-EPR method using this motivating big remote sensing data application and provide several simulations."}, "https://arxiv.org/abs/2410.02727": {"title": "Regression Discontinuity Designs Under Interference", "link": "https://arxiv.org/abs/2410.02727", "description": "arXiv:2410.02727v1 Announce Type: new \nAbstract: We extend the continuity-based framework to Regression Discontinuity Designs (RDDs) to identify and estimate causal effects in the presence of interference when units are connected through a network. In this setting, assignment to an \"effective treatment,\" which comprises the individual treatment and a summary of the treatment of interfering units (e.g., friends, classmates), is determined by the unit's score and the scores of other interfering units, leading to a multiscore RDD with potentially complex, multidimensional boundaries. We characterize these boundaries and derive generalized continuity assumptions to identify the proposed causal estimands, i.e., point and boundary causal effects. Additionally, we develop a distance-based nonparametric estimator, derive its asymptotic properties under restrictions on the network degree distribution, and introduce a novel variance estimator that accounts for network correlation. Finally, we apply our methodology to the PROGRESA/Oportunidades dataset to estimate the direct and indirect effects of receiving cash transfers on children's school attendance."}, "https://arxiv.org/abs/2410.01826": {"title": "Shocks-adaptive Robust Minimum Variance Portfolio for a Large Universe of Assets", "link": "https://arxiv.org/abs/2410.01826", "description": "arXiv:2410.01826v1 Announce Type: cross \nAbstract: This paper proposes a robust, shocks-adaptive portfolio in a large-dimensional assets universe where the number of assets could be comparable to or even larger than the sample size. It is well documented that portfolios based on optimizations are sensitive to outliers in return data. We deal with outliers by proposing a robust factor model, contributing methodologically through the development of a robust principal component analysis (PCA) for factor model estimation and a shrinkage estimation for the random error covariance matrix. This approach extends the well-regarded Principal Orthogonal Complement Thresholding (POET) method (Fan et al., 2013), enabling it to effectively handle heavy tails and sudden shocks in data. The novelty of the proposed robust method is its adaptiveness to both global and idiosyncratic shocks, without the need to distinguish them, which is useful in forming portfolio weights when facing outliers. We develop the theoretical results of the robust factor model and the robust minimum variance portfolio. Numerical and empirical results show the superior performance of the new portfolio."}, "https://arxiv.org/abs/2410.01839": {"title": "A Divide-and-Conquer Approach to Persistent Homology", "link": "https://arxiv.org/abs/2410.01839", "description": "arXiv:2410.01839v1 Announce Type: cross \nAbstract: Persistent homology is a tool of topological data analysis that has been used in a variety of settings to characterize different dimensional holes in data. However, persistent homology computations can be memory intensive with a computational complexity that does not scale well as the data size becomes large. In this work, we propose a divide-and-conquer (DaC) method to mitigate these issues. The proposed algorithm efficiently finds small, medium, and large-scale holes by partitioning data into sub-regions and uses a Vietoris-Rips filtration. Furthermore, we provide theoretical results that quantify the bottleneck distance between DaC and the true persistence diagram and the recovery probability of holes in the data. We empirically verify that the rate coincides with our theoretical rate, and find that the memory and computational complexity of DaC outperforms an alternative method that relies on a clustering preprocessing step to reduce the memory and computational complexity of the persistent homology computations. Finally, we test our algorithm using spatial data of the locations of lakes in Wisconsin, where the classical persistent homology is computationally infeasible."}, "https://arxiv.org/abs/2410.02015": {"title": "Instrumental variables: A non-asymptotic viewpoint", "link": "https://arxiv.org/abs/2410.02015", "description": "arXiv:2410.02015v1 Announce Type: cross \nAbstract: We provide a non-asymptotic analysis of the linear instrumental variable estimator allowing for the presence of exogeneous covariates. In addition, we introduce a novel measure of the strength of an instrument that can be used to derive non-asymptotic confidence intervals. For strong instruments, these non-asymptotic intervals match the asymptotic ones exactly up to higher order corrections; for weaker instruments, our intervals involve adaptive adjustments to the instrument strength, and thus remain valid even when asymptotic predictions break down. We illustrate our results via an analysis of the effect of PM2.5 pollution on various health conditions, using wildfire smoke exposure as an instrument. Our analysis shows that exposure to PM2.5 pollution leads to statistically significant increases in incidence of health conditions such as asthma, heart disease, and strokes."}, "https://arxiv.org/abs/2410.02025": {"title": "A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models", "link": "https://arxiv.org/abs/2410.02025", "description": "arXiv:2410.02025v1 Announce Type: cross \nAbstract: In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings."}, "https://arxiv.org/abs/2410.02050": {"title": "A fast, flexible simulation framework for Bayesian adaptive designs -- the R package BATSS", "link": "https://arxiv.org/abs/2410.02050", "description": "arXiv:2410.02050v1 Announce Type: cross \nAbstract: The use of Bayesian adaptive designs for randomised controlled trials has been hindered by the lack of software readily available to statisticians. We have developed a new software package (Bayesian Adaptive Trials Simulator Software - BATSS for the statistical software R, which provides a flexible structure for the fast simulation of Bayesian adaptive designs for clinical trials. We illustrate how the BATSS package can be used to define and evaluate the operating characteristics of Bayesian adaptive designs for various different types of primary outcomes (e.g., those that follow a normal, binary, Poisson or negative binomial distribution) and can incorporate the most common types of adaptations: stopping treatments (or the entire trial) for efficacy or futility, and Bayesian response adaptive randomisation - based on user-defined adaptation rules. Other important features of this highly modular package include: the use of (Integrated Nested) Laplace approximations to compute posterior distributions, parallel processing on a computer or a cluster, customisability, adjustment for covariates and a wide range of available conditional distributions for the response."}, "https://arxiv.org/abs/2410.02208": {"title": "Fast nonparametric feature selection with error control using integrated path stability selection", "link": "https://arxiv.org/abs/2410.02208", "description": "arXiv:2410.02208v1 Announce Type: cross \nAbstract: Feature selection can greatly improve performance and interpretability in machine learning problems. However, existing nonparametric feature selection methods either lack theoretical error control or fail to accurately control errors in practice. Many methods are also slow, especially in high dimensions. In this paper, we introduce a general feature selection method that applies integrated path stability selection to thresholding to control false positives and the false discovery rate. The method also estimates q-values, which are better suited to high-dimensional data than p-values. We focus on two special cases of the general method based on gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive simulations with RNA sequencing data show that IPSSGB and IPSSRF have better error control, detect more true positives, and are faster than existing methods. We also use both methods to detect microRNAs and genes related to ovarian cancer, finding that they make better predictions with fewer features than other methods."}, "https://arxiv.org/abs/2410.02306": {"title": "Choosing alpha post hoc: the danger of multiple standard significance thresholds", "link": "https://arxiv.org/abs/2410.02306", "description": "arXiv:2410.02306v1 Announce Type: cross \nAbstract: A fundamental assumption of classical hypothesis testing is that the significance threshold $\\alpha$ is chosen independently from the data. The validity of confidence intervals likewise relies on choosing $\\alpha$ beforehand. We point out that the independence of $\\alpha$ is guaranteed in practice because, in most fields, there exists one standard $\\alpha$ that everyone uses -- so that $\\alpha$ is automatically independent of everything. However, there have been recent calls to decrease $\\alpha$ from $0.05$ to $0.005$. We note that this may lead to multiple accepted standard thresholds within one scientific field. For example, different journals may require different significance thresholds. As a consequence, some researchers may be tempted to conveniently choose their $\\alpha$ based on their p-value. We use examples to illustrate that this severely invalidates hypothesis tests, and mention some potential solutions."}, "https://arxiv.org/abs/2410.02629": {"title": "Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression", "link": "https://arxiv.org/abs/2410.02629", "description": "arXiv:2410.02629v1 Announce Type: cross \nAbstract: This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates."}, "https://arxiv.org/abs/2203.14760": {"title": "Functional principal component analysis for longitudinal observations with sampling at random", "link": "https://arxiv.org/abs/2203.14760", "description": "arXiv:2203.14760v2 Announce Type: replace \nAbstract: Functional principal component analysis has been shown to be invaluable for revealing variation modes of longitudinal outcomes, which serves as important building blocks for forecasting and model building. Decades of research have advanced methods for functional principal component analysis often assuming independence between the observation times and longitudinal outcomes. Yet such assumptions are fragile in real-world settings where observation times may be driven by outcome-related reasons. Rather than ignoring the informative observation time process, we explicitly model the observational times by a counting process dependent on time-varying prognostic factors. Identification of the mean, covariance function, and functional principal components ensues via inverse intensity weighting. We propose using weighted penalized splines for estimation and establish consistency and convergence rates for the weighted estimators. Simulation studies demonstrate that the proposed estimators are substantially more accurate than the existing ones in the presence of a correlation between the observation time process and the longitudinal outcome process. We further examine the finite-sample performance of the proposed method using the Acute Infection and Early Disease Research Program study."}, "https://arxiv.org/abs/2310.07399": {"title": "Randomized Runge-Kutta-Nystr\\\"om Methods for Unadjusted Hamiltonian and Kinetic Langevin Monte Carlo", "link": "https://arxiv.org/abs/2310.07399", "description": "arXiv:2310.07399v2 Announce Type: replace-cross \nAbstract: We introduce $5/2$- and $7/2$-order $L^2$-accurate randomized Runge-Kutta-Nystr\\\"{o}m methods, tailored for approximating Hamiltonian flows within non-reversible Markov chain Monte Carlo samplers, such as unadjusted Hamiltonian Monte Carlo and unadjusted kinetic Langevin Monte Carlo. We establish quantitative $5/2$-order $L^2$-accuracy upper bounds under gradient and Hessian Lipschitz assumptions on the potential energy function. The numerical experiments demonstrate the superior efficiency of the proposed unadjusted samplers on a variety of well-behaved, high-dimensional target distributions."}, "https://arxiv.org/abs/2410.02905": {"title": "Multiscale Multi-Type Spatial Bayesian Analysis of Wildfires and Population Change That Avoids MCMC and Approximating the Posterior Distribution", "link": "https://arxiv.org/abs/2410.02905", "description": "arXiv:2410.02905v1 Announce Type: new \nAbstract: In recent years, wildfires have significantly increased in the United States (U.S.), making certain areas harder to live in. This motivates us to jointly analyze active fires and population changes in the U.S. from July 2020 to June 2021. The available data are recorded on different scales (or spatial resolutions) and by different types of distributions (referred to as multi-type data). Moreover, wildfires are known to have feedback mechanism that creates signal-to-noise dependence. We analyze point-referenced remote sensing fire data from National Aeronautics and Space Administration (NASA) and county-level population change data provided by U.S. Census Bureau's Population Estimates Program (PEP). To do this, we develop a multiscale multi-type spatial Bayesian hierarchical model that assumes the average number of fires is zero-inflated normal, the incidence of fire as Bernoulli, and the percentage population change as normally distributed. This high-dimensional dataset makes Markov chain Monte Carlo (MCMC) implementation infeasible. We bypass MCMC by extending a computationally efficient Bayesian framework to directly sample from the exact posterior distribution, referred to as Exact Posterior Regression (EPR), which includes a term to model feedback. A simulation study is included to compare our new EPR method to the traditional Bayesian model fitted via MCMC. In our analysis, we obtained predictions of wildfire probabilities, identified several useful covariates, and found that regions with many fires were directly related to population change."}, "https://arxiv.org/abs/2410.02918": {"title": "Moving sum procedure for multiple change point detection in large factor models", "link": "https://arxiv.org/abs/2410.02918", "description": "arXiv:2410.02918v1 Announce Type: new \nAbstract: The paper proposes a moving sum methodology for detecting multiple change points in high-dimensional time series under a factor model, where changes are attributed to those in loadings as well as emergence or disappearance of factors. We establish the asymptotic null distribution of the proposed test for family-wise error control, and show the consistency of the procedure for multiple change point estimation. Simulation studies and an application to a large dataset of volatilities demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2410.02920": {"title": "Statistical Inference with Nonignorable Non-Probability Survey Samples", "link": "https://arxiv.org/abs/2410.02920", "description": "arXiv:2410.02920v1 Announce Type: new \nAbstract: Statistical inference with non-probability survey samples is an emerging topic in survey sampling and official statistics and has gained increased attention from researchers and practitioners in the field. Much of the existing literature, however, assumes that the participation mechanism for non-probability samples is ignorable. In this paper, we develop a pseudo-likelihood approach to estimate participation probabilities for nonignorable non-probability samples when auxiliary information is available from an existing reference probability sample. We further construct three estimators for the finite population mean using regression-based prediction, inverse probability weighting (IPW), and augmented IPW estimators, and study their asymptotic properties. Variance estimation for the proposed methods is considered within the same framework. The efficiency of our proposed methods is demonstrated through simulation studies and a real data analysis using the ESPACOV survey on the effects of the COVID-19 pandemic in Spain."}, "https://arxiv.org/abs/2410.02929": {"title": "Identifying Hierarchical Structures in Network Data", "link": "https://arxiv.org/abs/2410.02929", "description": "arXiv:2410.02929v1 Announce Type: new \nAbstract: In this paper, we introduce a hierarchical extension of the stochastic blockmodel to identify multilevel community structures in networks. We also present a Markov chain Monte Carlo (MCMC) and a variational Bayes algorithm to fit the model and obtain approximate posterior inference. Through simulated and real datasets, we demonstrate that the model successfully identifies communities and supercommunities when they exist in the data. Additionally, we observe that the model returns a single supercommunity when there is no evidence of multilevel community structure. As expected in the case of the single-level stochastic blockmodel, we observe that the MCMC algorithm consistently outperforms its variational Bayes counterpart. Therefore, we recommend using MCMC whenever the network size allows for computational feasibility."}, "https://arxiv.org/abs/2410.02941": {"title": "Efficient collaborative learning of the average treatment effect under data sharing constraints", "link": "https://arxiv.org/abs/2410.02941", "description": "arXiv:2410.02941v1 Announce Type: new \nAbstract: Driven by the need to generate real-world evidence from multi-site collaborative studies, we introduce an efficient collaborative learning approach to evaluate average treatment effect in a multi-site setting under data sharing constraints. Specifically, the proposed method operates in a federated manner, using individual-level data from a user-defined target population and summary statistics from other source populations, to construct efficient estimator for the average treatment effect on the target population of interest. Our federated approach does not require iterative communications between sites, making it particularly suitable for research consortia with limited resources for developing automated data-sharing infrastructures. Compared to existing work data integration methods in causal inference, it allows distributional shifts in outcomes, treatments and baseline covariates distributions, and achieves semiparametric efficiency bound under appropriate conditions. We illustrate the magnitude of efficiency gains from incorporating extra data sources by examining the effect of insulin vs. non-insulin treatments on heart failure for patients with type II diabetes using electronic health record data collected from the All of Us program."}, "https://arxiv.org/abs/2410.02965": {"title": "BSNMani: Bayesian Scalar-on-network Regression with Manifold Learning", "link": "https://arxiv.org/abs/2410.02965", "description": "arXiv:2410.02965v1 Announce Type: new \nAbstract: Brain connectivity analysis is crucial for understanding brain structure and neurological function, shedding light on the mechanisms of mental illness. To study the association between individual brain connectivity networks and the clinical characteristics, we develop BSNMani: a Bayesian scalar-on-network regression model with manifold learning. BSNMani comprises two components: the network manifold learning model for brain connectivity networks, which extracts shared connectivity structures and subject-specific network features, and the joint predictive model for clinical outcomes, which studies the association between clinical phenotypes and subject-specific network features while adjusting for potential confounding covariates. For posterior computation, we develop a novel two-stage hybrid algorithm combining Metropolis-Adjusted Langevin Algorithm (MALA) and Gibbs sampling. Our method is not only able to extract meaningful subnetwork features that reveal shared connectivity patterns, but can also reveal their association with clinical phenotypes, further enabling clinical outcome prediction. We demonstrate our method through simulations and through its application to real resting-state fMRI data from a study focusing on Major Depressive Disorder (MDD). Our approach sheds light on the intricate interplay between brain connectivity and clinical features, offering insights that can contribute to our understanding of psychiatric and neurological disorders, as well as mental health."}, "https://arxiv.org/abs/2410.02982": {"title": "Imputing Missing Values with External Data", "link": "https://arxiv.org/abs/2410.02982", "description": "arXiv:2410.02982v1 Announce Type: new \nAbstract: Missing data is a common challenge across scientific disciplines. Current imputation methods require the availability of individual data to impute missing values. Often, however, missingness requires using external data for the imputation. In this paper, we introduce a new Stata command, mi impute from, designed to impute missing values using linear predictors and their related covariance matrix from imputation models estimated in one or multiple external studies. This allows for the imputation of any missing values without sharing individual data between studies. We describe the underlying method and present the syntax of mi impute from alongside practical examples of missing data in collaborative research projects."}, "https://arxiv.org/abs/2410.03012": {"title": "Determining adequate consistency levels for aggregation of expert estimates", "link": "https://arxiv.org/abs/2410.03012", "description": "arXiv:2410.03012v1 Announce Type: new \nAbstract: To obtain reliable results of expertise, which usually use individual and group expert pairwise comparisons, it is important to summarize (aggregate) expert estimates provided that they are sufficiently consistent. There are several ways to determine the threshold level of consistency sufficient for aggregation of estimates. They can be used for different consistency indices, but none of them relates the threshold value to the requirements for the reliability of the expertise's results. Therefore, a new approach to determining this consistency threshold is required. The proposed approach is based on simulation modeling of expert pairwise comparisons and a targeted search for the most inconsistent among the modeled pairwise comparison matrices. Thus, the search for the least consistent matrix is carried out for a given perturbation of the perfectly consistent matrix. This allows for determining the consistency threshold corresponding to a given permissible relative deviation of the resulting weight of an alternative from its hypothetical reference value."}, "https://arxiv.org/abs/2410.03091": {"title": "Time-In-Range Analyses of Functional Data Subject to Missing with Applications to Inpatient Continuous Glucose Monitoring", "link": "https://arxiv.org/abs/2410.03091", "description": "arXiv:2410.03091v1 Announce Type: new \nAbstract: Continuous glucose monitoring (CGM) has been increasingly used in US hospitals for the care of patients with diabetes. Time in range (TIR), which measures the percent of time over a specified time window with glucose values within a target range, has served as a pivotal CGM-metric for assessing glycemic control. However, inpatient CGM is prone to a prevailing issue that a limited length of hospital stay can cause insufficient CGM sampling, leading to a scenario with functional data plagued by complex missingness. Current analyses of inpatient CGM studies, however, ignore this issue and typically compute the TIR as the proportion of available CGM glucose values in range. As shown by simulation studies, this can result in considerably biased estimation and inference, largely owing to the nonstationary nature of inpatient CGM trajectories. In this work, we develop a rigorous statistical framework that confers valid inference on TIR in realistic inpatient CGM settings. Our proposals utilize a novel probabilistic representation of TIR, which enables leveraging the technique of inverse probability weighting and semiparametric survival modeling to obtain unbiased estimators of mean TIR that properly account for incompletely observed CGM trajectories. We establish desirable asymptotic properties of the proposed estimators. Results from our numerical studies demonstrate good finite-sample performance of the proposed method as well as its advantages over existing approaches. The proposed method is generally applicable to other functional data settings similar to CGM."}, "https://arxiv.org/abs/2410.03096": {"title": "Expected value of sample information calculations for risk prediction model development", "link": "https://arxiv.org/abs/2410.03096", "description": "arXiv:2410.03096v1 Announce Type: new \nAbstract: Uncertainty around predictions from a model due to the finite size of the development sample has traditionally been approached using classical inferential techniques. The finite size of the sample will result in the discrepancy between the final model and the correct model that maps covariates to predicted risks. From a decision-theoretic perspective, this discrepancy might affect the subsequent treatment decisions, and thus is associated with utility loss. From this perspective, procuring more development data is associated in expected gain in the utility of using the model. In this work, we define the Expected Value of Sample Information (EVSI) as the expected gain in clinical utility, defined in net benefit (NB) terms in net true positive units, by procuring a further development sample of a given size. We propose a bootstrap-based algorithm for EVSI computations, and show its feasibility and face validity in a case study. Decision-theoretic metrics can complement classical inferential methods when designing studies that are aimed at developing risk prediction models."}, "https://arxiv.org/abs/2410.03239": {"title": "A new GARCH model with a deterministic time-varying intercept", "link": "https://arxiv.org/abs/2410.03239", "description": "arXiv:2410.03239v1 Announce Type: new \nAbstract: It is common for long financial time series to exhibit gradual change in the unconditional volatility. We propose a new model that captures this type of nonstationarity in a parsimonious way. The model augments the volatility equation of a standard GARCH model by a deterministic time-varying intercept. It captures structural change that slowly affects the amplitude of a time series while keeping the short-run dynamics constant. We parameterize the intercept as a linear combination of logistic transition functions. We show that the model can be derived from a multiplicative decomposition of volatility and preserves the financial motivation of variance decomposition. We use the theory of locally stationary processes to show that the quasi maximum likelihood estimator (QMLE) of the parameters of the model is consistent and asymptotically normally distributed. We examine the quality of the asymptotic approximation in a small simulation study. An empirical application to Oracle Corporation stock returns demonstrates the usefulness of the model. We find that the persistence implied by the GARCH parameter estimates is reduced by including a time-varying intercept in the volatility equation."}, "https://arxiv.org/abs/2410.03393": {"title": "General linear hypothesis testing in ill-conditioned functional response model", "link": "https://arxiv.org/abs/2410.03393", "description": "arXiv:2410.03393v1 Announce Type: new \nAbstract: The paper concerns inference in the ill-conditioned functional response model, which is a part of functional data analysis. In this regression model, the functional response is modeled using several independent scalar variables. To verify linear hypotheses, we develop new test statistics by aggregating pointwise statistics using either integral or supremum. The new tests are scale-invariant, in contrast to the existing ones. To construct tests, we use different bootstrap methods. The performance of the new tests is compared with the performance of known tests through a simulation study and an application to a real data example."}, "https://arxiv.org/abs/2410.03557": {"title": "Robust Bond Risk Premia Predictability Test in the Quantiles", "link": "https://arxiv.org/abs/2410.03557", "description": "arXiv:2410.03557v1 Announce Type: new \nAbstract: Different from existing literature on testing the macro-spanning hypothesis of bond risk premia, which only considers mean regressions, this paper investigates whether the yield curve represented by CP factor (Cochrane and Piazzesi, 2005) contains all available information about future bond returns in a predictive quantile regression with many other macroeconomic variables. In this study, we introduce the Trend in Debt Holding (TDH) as a novel predictor, testing it alongside established macro indicators such as Trend Inflation (TI) (Cieslak and Povala, 2015), and macro factors from Ludvigson and Ng (2009). A significant challenge in this study is the invalidity of traditional quantile model inference approaches, given the high persistence of many macro variables involved. Furthermore, the existing methods addressing this issue do not perform well in the marginal test with many highly persistent predictors. Thus, we suggest a robust inference approach, whose size and power performance are shown to be better than existing tests. Using data from 1980-2022, the macro-spanning hypothesis is strongly supported at center quantiles by the empirical finding that the CP factor has predictive power while all other macro variables have negligible predictive power in this case. On the other hand, the evidence against the macro-spanning hypothesis is found at tail quantiles, in which TDH has predictive power at right tail quantiles while TI has predictive power at both tails quantiles. Finally, we show the performance of in-sample and out-of-sample predictions implemented by the proposed method are better than existing methods."}, "https://arxiv.org/abs/2410.03593": {"title": "Robust Quickest Correlation Change Detection in High-Dimensional Random Vectors", "link": "https://arxiv.org/abs/2410.03593", "description": "arXiv:2410.03593v1 Announce Type: new \nAbstract: Detecting changes in high-dimensional vectors presents significant challenges, especially when the post-change distribution is unknown and time-varying. This paper introduces a novel robust algorithm for correlation change detection in high-dimensional data. The approach utilizes the summary statistic of the maximum magnitude correlation coefficient to detect the change. This summary statistic captures the level of correlation present in the data but also has an asymptotic density. The robust test is designed using the asymptotic density. The proposed approach is robust because it can help detect a change in correlation level from some known level to unknown, time-varying levels. The proposed test is also computationally efficient and valid for a broad class of data distributions. The effectiveness of the proposed algorithm is demonstrated on simulated data."}, "https://arxiv.org/abs/2410.03619": {"title": "Functional Singular Value Decomposition", "link": "https://arxiv.org/abs/2410.03619", "description": "arXiv:2410.03619v1 Announce Type: new \nAbstract: Heterogeneous functional data are commonly seen in time series and longitudinal data analysis. To capture the statistical structures of such data, we propose the framework of Functional Singular Value Decomposition (FSVD), a unified framework with structure-adaptive interpretability for the analysis of heterogeneous functional data. We establish the mathematical foundation of FSVD by proving its existence and providing its fundamental properties using operator theory. We then develop an implementation approach for noisy and irregularly observed functional data based on a novel joint kernel ridge regression scheme and provide theoretical guarantees for its convergence and estimation accuracy. The framework of FSVD also introduces the concepts of intrinsic basis functions and intrinsic basis vectors, which represent two fundamental statistical structures for random functions and connect FSVD to various tasks including functional principal component analysis, factor models, functional clustering, and functional completion. We compare the performance of FSVD with existing methods in several tasks through extensive simulation studies. To demonstrate the value of FSVD in real-world datasets, we apply it to extract temporal patterns from a COVID-19 case count dataset and perform data completion on an electronic health record dataset."}, "https://arxiv.org/abs/2410.03648": {"title": "Spatial Hyperspheric Models for Compositional Data", "link": "https://arxiv.org/abs/2410.03648", "description": "arXiv:2410.03648v1 Announce Type: new \nAbstract: Compositional data are an increasingly prevalent data source in spatial statistics. Analysis of such data is typically done on log-ratio transformations or via Dirichlet regression. However, these approaches often make unnecessarily strong assumptions (e.g., strictly positive components, exclusively negative correlations). An alternative approach uses square-root transformed compositions and directional distributions. Such distributions naturally allow for zero-valued components and positive correlations, yet they may include support outside the non-negative orthant and are not generative for compositional data. To overcome this challenge, we truncate the elliptically symmetric angular Gaussian (ESAG) distribution to the non-negative orthant. Additionally, we propose a spatial hyperspheric regression that contains fixed and random multivariate spatial effects. The proposed method also contains a term that can be used to propagate uncertainty that may arise from precursory stochastic models (i.e., machine learning classification). We demonstrate our method on a simulation study and on classified bioacoustic signals of the Dryobates pubescens (downy woodpecker)."}, "https://arxiv.org/abs/2410.02774": {"title": "Estimating the Unobservable Components of Electricity Demand Response with Inverse Optimization", "link": "https://arxiv.org/abs/2410.02774", "description": "arXiv:2410.02774v1 Announce Type: cross \nAbstract: Understanding and predicting the electricity demand responses to prices are critical activities for system operators, retailers, and regulators. While conventional machine learning and time series analyses have been adequate for the routine demand patterns that have adapted only slowly over many years, the emergence of active consumers with flexible assets such as solar-plus-storage systems, and electric vehicles, introduces new challenges. These active consumers exhibit more complex consumption patterns, the drivers of which are often unobservable to the retailers and system operators. In practice, system operators and retailers can only monitor the net demand (metered at grid connection points), which reflects the overall energy consumption or production exchanged with the grid. As a result, all \"behind-the-meter\" activities-such as the use of flexibility-remain hidden from these entities. Such behind-the-meter behavior may be controlled by third party agents or incentivized by tariffs; in either case, the retailer's revenue and the system loads would be impacted by these activities behind the meter, but their details can only be inferred. We define the main components of net demand, as baseload, flexible, and self-generation, each having nonlinear responses to market price signals. As flexible demand response and self generation are increasing, this raises a pressing question of whether existing methods still perform well and, if not, whether there is an alternative way to understand and project the unobserved components of behavior. In response to this practical challenge, we evaluate the potential of a data-driven inverse optimization (IO) methodology. This approach characterizes decomposed consumption patterns without requiring direct observation of behind-the-meter behavior or device-level metering [...]"}, "https://arxiv.org/abs/2410.02799": {"title": "A Data Envelopment Analysis Approach for Assessing Fairness in Resource Allocation: Application to Kidney Exchange Programs", "link": "https://arxiv.org/abs/2410.02799", "description": "arXiv:2410.02799v1 Announce Type: cross \nAbstract: Kidney exchange programs have significantly increased transplantation rates but raise pressing questions about fairness in organ allocation. We present a novel framework leveraging Data Envelopment Analysis (DEA) to evaluate multiple fairness criteria--Priority, Access, and Outcome--within a single model, capturing complexities that may be overlooked in single-metric analyses. Using data from the United Network for Organ Sharing, we analyze these criteria individually, measuring Priority fairness through waitlist durations, Access fairness through Kidney Donor Profile Index scores, and Outcome fairness through graft lifespan. We then apply our DEA model to demonstrate significant disparities in kidney allocation efficiency across ethnic groups. To quantify uncertainty, we employ conformal prediction within the DEA framework, yielding group-conditional prediction intervals with finite sample coverage guarantees. Our findings show notable differences in efficiency distributions between ethnic groups. Our study provides a rigorous framework for evaluating fairness in complex resource allocation systems, where resource scarcity and mutual compatibility constraints exist. All code for using the proposed method and reproducing results is available on GitHub."}, "https://arxiv.org/abs/2410.02835": {"title": "The MLE is minimax optimal for LGC", "link": "https://arxiv.org/abs/2410.02835", "description": "arXiv:2410.02835v1 Announce Type: cross \nAbstract: We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convegence rates of the Maximum Likelihood Estimator (MLE). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the MLE. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions -- which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures."}, "https://arxiv.org/abs/2410.03346": {"title": "Implementing Response-Adaptive Randomisation in Stratified Rare-disease Trials: Design Challenges and Practical Solutions", "link": "https://arxiv.org/abs/2410.03346", "description": "arXiv:2410.03346v1 Announce Type: cross \nAbstract: Although response-adaptive randomisation (RAR) has gained substantial attention in the literature, it still has limited use in clinical trials. Amongst other reasons, the implementation of RAR in the real world raises important practical questions, often neglected. Motivated by an innovative phase-II stratified RAR trial, this paper addresses two challenges: (1) How to ensure that RAR allocations are both desirable and faithful to target probabilities, even in small samples? and (2) What adaptations to trigger after interim analyses in the presence of missing data? We propose a Mapping strategy that discretises the randomisation probabilities into a vector of allocation ratios, resulting in improved frequentist errors. Under the implementation of Mapping, we analyse the impact of missing data on operating characteristics by examining selected scenarios. Finally, we discuss additional concerns including: pooling data across trial strata, analysing the level of blinding in the trial, and reporting safety results."}, "https://arxiv.org/abs/2410.03630": {"title": "Is Gibbs sampling faster than Hamiltonian Monte Carlo on GLMs?", "link": "https://arxiv.org/abs/2410.03630", "description": "arXiv:2410.03630v1 Announce Type: cross \nAbstract: The Hamiltonian Monte Carlo (HMC) algorithm is often lauded for its ability to effectively sample from high-dimensional distributions. In this paper we challenge the presumed domination of HMC for the Bayesian analysis of GLMs. By utilizing the structure of the compute graph rather than the graphical model, we reduce the time per sweep of a full-scan Gibbs sampler from $O(d^2)$ to $O(d)$, where $d$ is the number of GLM parameters. Our simple changes to the implementation of the Gibbs sampler allow us to perform Bayesian inference on high-dimensional GLMs that are practically infeasible with traditional Gibbs sampler implementations. We empirically demonstrate a substantial increase in effective sample size per time when comparing our Gibbs algorithms to state-of-the-art HMC algorithms. While Gibbs is superior in terms of dimension scaling, neither Gibbs nor HMC dominate the other: we provide numerical and theoretical evidence that HMC retains an edge in certain circumstances thanks to its advantageous condition number scaling. Interestingly, for GLMs of fixed data size, we observe that increasing dimensionality can stabilize or even decrease condition number, shedding light on the empirical advantage of our efficient Gibbs sampler."}, "https://arxiv.org/abs/2410.03651": {"title": "Minimax-optimal trust-aware multi-armed bandits", "link": "https://arxiv.org/abs/2410.03651", "description": "arXiv:2410.03651v1 Announce Type: cross \nAbstract: Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue."}, "https://arxiv.org/abs/2207.08373": {"title": "Functional varying-coefficient model under heteroskedasticity with application to DTI data", "link": "https://arxiv.org/abs/2207.08373", "description": "arXiv:2207.08373v2 Announce Type: replace \nAbstract: In this paper, we develop a multi-step estimation procedure to simultaneously estimate the varying-coefficient functions using a local-linear generalized method of moments (GMM) based on continuous moment conditions. To incorporate spatial dependence, the continuous moment conditions are first projected onto eigen-functions and then combined by weighted eigen-values, thereby, solving the challenges of using an inverse covariance operator directly. We propose an optimal instrument variable that minimizes the asymptotic variance function among the class of all local-linear GMM estimators, and it outperforms the initial estimates which do not incorporate the spatial dependence. Our proposed method significantly improves the accuracy of the estimation under heteroskedasticity and its asymptotic properties have been investigated. Extensive simulation studies illustrate the finite sample performance, and the efficacy of the proposed method is confirmed by real data analysis."}, "https://arxiv.org/abs/2410.03814": {"title": "Graphical models for inference: A model comparison approach for analyzing bacterial conjugation", "link": "https://arxiv.org/abs/2410.03814", "description": "arXiv:2410.03814v1 Announce Type: new \nAbstract: We present a proof-of-concept of a model comparison approach for analyzing spatio-temporal observations of interacting populations. Our model variants are a collection of structurally similar Bayesian networks. Their distinct Noisy-Or conditional probability distributions describe interactions within the population, with each distribution corresponding to a specific mechanism of interaction. To determine which distributions most accurately represent the underlying mechanisms, we examine the accuracy of each Bayesian network with respect to observational data. We implement such a system for observations of bacterial populations engaged in conjugation, a type of horizontal gene transfer that allows microbes to share genetic material with nearby cells through physical contact. Evaluating cell-specific factors that affect conjugation is generally difficult because of the stochastic nature of the process. Our approach provides a new method for gaining insight into this process. We compare eight model variations for each of three experimental trials and rank them using two different metrics"}, "https://arxiv.org/abs/2410.03911": {"title": "Efficient Bayesian Additive Regression Models For Microbiome Studies", "link": "https://arxiv.org/abs/2410.03911", "description": "arXiv:2410.03911v1 Announce Type: new \nAbstract: Statistical analysis of microbiome data is challenging. Bayesian multinomial logistic-normal (MLN) models have gained popularity due to their ability to account for the count compositional nature of these data. However, these models are often computationally intractable to infer. Recently, we developed a computationally efficient and accurate approach to inferring MLN models with a Marginally Latent Matrix-T Process (MLTP) form: MLN-MLTPs. Our approach is based on a novel sampler with a marginal Laplace approximation -- called the \\textit{Collapse-Uncollapse} (CU) sampler. However, existing work with MLTPs has been limited to linear models or models of a single non-linear process. Moreover, existing methods lack an efficient means of estimating model hyperparameters. This article addresses both deficiencies. We introduce a new class of MLN Additive Gaussian Process models (\\textit{MultiAddGPs}) for deconvolution of overlapping linear and non-linear processes. We show that MultiAddGPs are examples of MLN-MLTPs and derive an efficient CU sampler for this model class. Moreover, we derive efficient Maximum Marginal Likelihood estimation for hyperparameters in MLTP models by taking advantage of Laplace approximation in the CU sampler. We demonstrate our approach using simulated and real data studies. Our models produce novel biological insights from a previously published artificial gut study."}, "https://arxiv.org/abs/2410.04020": {"title": "\"6 choose 4\": A framework to understand and facilitate discussion of strategies for overall survival safety monitoring", "link": "https://arxiv.org/abs/2410.04020", "description": "arXiv:2410.04020v1 Announce Type: new \nAbstract: Advances in anticancer therapies have significantly contributed to declining death rates in certain disease and clinical settings. However, they have also made it difficult to power a clinical trial in these settings with overall survival (OS) as the primary efficacy endpoint. A number of statistical approaches have therefore been recently proposed for the pre-specified analysis of OS as a safety endpoint (Shan, 2023; Fleming et al., 2024; Rodriguez et al., 2024). In this paper, we provide a simple, unifying framework that includes the aforementioned approaches (and a couple others) as special cases. By highlighting each approach's focus, priority, tolerance for risk, and strengths or challenges for practical implementation, this framework can help to facilitate discussions between stakeholders on \"fit-for-purpose OS data collection and assessment of harm\" (American Association for Cancer Research, 2024). We apply this framework to a real clinical trial in large B-cell lymphoma to illustrate its application and value. Several recommendations and open questions are also raised."}, "https://arxiv.org/abs/2410.04028": {"title": "Penalized Sparse Covariance Regression with High Dimensional Covariates", "link": "https://arxiv.org/abs/2410.04028", "description": "arXiv:2410.04028v1 Announce Type: new \nAbstract: Covariance regression offers an effective way to model the large covariance matrix with the auxiliary similarity matrices. In this work, we propose a sparse covariance regression (SCR) approach to handle the potentially high-dimensional predictors (i.e., similarity matrices). Specifically, we use the penalization method to identify the informative predictors and estimate their associated coefficients simultaneously. We first investigate the Lasso estimator and subsequently consider the folded concave penalized estimation methods (e.g., SCAD and MCP). However, the theoretical analysis of the existing penalization methods is primarily based on i.i.d. data, which is not directly applicable to our scenario. To address this difficulty, we establish the non-asymptotic error bounds by exploiting the spectral properties of the covariance matrix and similarity matrices. Then, we derive the estimation error bound for the Lasso estimator and establish the desirable oracle property of the folded concave penalized estimator. Extensive simulation studies are conducted to corroborate our theoretical results. We also illustrate the usefulness of the proposed method by applying it to a Chinese stock market dataset."}, "https://arxiv.org/abs/2410.04082": {"title": "Jackknife empirical likelihood ratio test for log symmetric distribution using probability weighted moments", "link": "https://arxiv.org/abs/2410.04082", "description": "arXiv:2410.04082v1 Announce Type: new \nAbstract: Log symmetric distributions are useful in modeling data which show high skewness and have found applications in various fields. Using a recent characterization for log symmetric distributions, we propose a goodness of fit test for testing log symmetry. The asymptotic distributions of the test statistics under both null and alternate distributions are obtained. As the normal-based test is difficult to implement, we also propose a jackknife empirical likelihood (JEL) ratio test for testing log symmetry. We conduct a Monte Carlo Simulation to evaluate the performance of the JEL ratio test. Finally, we illustrated our methodology using different data sets."}, "https://arxiv.org/abs/2410.04132": {"title": "A Novel Unit Distribution Named As Median Based Unit Rayleigh (MBUR): Properties and Estimations", "link": "https://arxiv.org/abs/2410.04132", "description": "arXiv:2410.04132v1 Announce Type: new \nAbstract: The importance of continuously emerging new distribution is a mandate to understand the world and environment surrounding us. In this paper, the author will discuss a new distribution defined on the interval (0,1) as regards the methodology of deducing its PDF, some of its properties and related functions. A simulation and real data analysis will be highlighted."}, "https://arxiv.org/abs/2410.04165": {"title": "How to Compare Copula Forecasts?", "link": "https://arxiv.org/abs/2410.04165", "description": "arXiv:2410.04165v1 Announce Type: new \nAbstract: This paper lays out a principled approach to compare copula forecasts via strictly consistent scores. We first establish the negative result that, in general, copulas fail to be elicitable, implying that copula predictions cannot sensibly be compared on their own. A notable exception is on Fr\\'echet classes, that is, when the marginal distribution structure is given and fixed, in which case we give suitable scores for the copula forecast comparison. As a remedy for the general non-elicitability of copulas, we establish novel multi-objective scores for copula forecast along with marginal forecasts. They give rise to two-step tests of equal or superior predictive ability which admit attribution of the forecast ranking to the accuracy of the copulas or the marginals. Simulations show that our two-step tests work well in terms of size and power. We illustrate our new methodology via an empirical example using copula forecasts for international stock market indices."}, "https://arxiv.org/abs/2410.04170": {"title": "Physics-encoded Spatio-temporal Regression", "link": "https://arxiv.org/abs/2410.04170", "description": "arXiv:2410.04170v1 Announce Type: new \nAbstract: Physics-informed methods have gained a great success in analyzing data with partial differential equation (PDE) constraints, which are ubiquitous when modeling dynamical systems. Different from the common penalty-based approach, this work promotes adherence to the underlying physical mechanism that facilitates statistical procedures. The motivating application concerns modeling fluorescence recovery after photobleaching, which is used for characterization of diffusion processes. We propose a physics-encoded regression model for handling spatio-temporally distributed data, which enables principled interpretability, parsimonious computation and efficient estimation by exploiting the structure of solutions of a governing evolution equation. The rate of convergence attaining the minimax optimality is theoretically demonstrated, generalizing the result obtained for the spatial regression. We conduct simulation studies to assess the performance of our proposed estimator and illustrate its usage in the aforementioned real data example."}, "https://arxiv.org/abs/2410.04312": {"title": "Adjusting for Spatial Correlation in Machine and Deep Learning", "link": "https://arxiv.org/abs/2410.04312", "description": "arXiv:2410.04312v1 Announce Type: new \nAbstract: Spatial data display correlation between observations collected at neighboring locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated features and thereby forfeit predictive accuracy. To remedy this shortcoming, we propose preprocessing the data using a spatial decorrelation transform derived from properties of a multivariate Gaussian distribution and Vecchia approximations. The transformed data can then be ported into a machine or deep learning tool. After model fitting on the transformed data, the output can be spatially re-correlated via the corresponding inverse transformation. We show that including this spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets."}, "https://arxiv.org/abs/2410.04330": {"title": "Inference in High-Dimensional Linear Projections: Multi-Horizon Granger Causality and Network Connectedness", "link": "https://arxiv.org/abs/2410.04330", "description": "arXiv:2410.04330v1 Announce Type: new \nAbstract: This paper presents a Wald test for multi-horizon Granger causality within a high-dimensional sparse Vector Autoregression (VAR) framework. The null hypothesis focuses on the causal coefficients of interest in a local projection (LP) at a given horizon. Nevertheless, the post-double-selection method on LP may not be applicable in this context, as a sparse VAR model does not necessarily imply a sparse LP for horizon h>1. To validate the proposed test, we develop two types of de-biased estimators for the causal coefficients of interest, both relying on first-step machine learning estimators of the VAR slope parameters. The first estimator is derived from the Least Squares method, while the second is obtained through a two-stage approach that offers potential efficiency gains. We further derive heteroskedasticity- and autocorrelation-consistent (HAC) inference for each estimator. Additionally, we propose a robust inference method for the two-stage estimator, eliminating the need to correct for serial correlation in the projection residuals. Monte Carlo simulations show that the two-stage estimator with robust inference outperforms the Least Squares method in terms of the Wald test size, particularly for longer projection horizons. We apply our methodology to analyze the interconnectedness of policy-related economic uncertainty among a large set of countries in both the short and long run. Specifically, we construct a causal network to visualize how economic uncertainty spreads across countries over time. Our empirical findings reveal, among other insights, that in the short run (1 and 3 months), the U.S. influences China, while in the long run (9 and 12 months), China influences the U.S. Identifying these connections can help anticipate a country's potential vulnerabilities and propose proactive solutions to mitigate the transmission of economic uncertainty."}, "https://arxiv.org/abs/2410.04356": {"title": "Subspace decompositions for association structure learning in multivariate categorical response regression", "link": "https://arxiv.org/abs/2410.04356", "description": "arXiv:2410.04356v1 Announce Type: new \nAbstract: Modeling the complex relationships between multiple categorical response variables as a function of predictors is a fundamental task in the analysis of categorical data. However, existing methods can be difficult to interpret and may lack flexibility. To address these challenges, we introduce a penalized likelihood method for multivariate categorical response regression that relies on a novel subspace decomposition to parameterize interpretable association structures. Our approach models the relationships between categorical responses by identifying mutual, joint, and conditionally independent associations, which yields a linear problem within a tensor product space. We establish theoretical guarantees for our estimator, including error bounds in high-dimensional settings, and demonstrate the method's interpretability and prediction accuracy through comprehensive simulation studies."}, "https://arxiv.org/abs/2410.04359": {"title": "Efficient, Cross-Fitting Estimation of Semiparametric Spatial Point Processes", "link": "https://arxiv.org/abs/2410.04359", "description": "arXiv:2410.04359v1 Announce Type: new \nAbstract: We study a broad class of models called semiparametric spatial point processes where the first-order intensity function contains both a parametric component and a nonparametric component. We propose a novel, spatial cross-fitting estimator of the parametric component based on random thinning, a common simulation technique in point processes. The proposed estimator is shown to be consistent and in many settings, asymptotically Normal. Also, we generalize the notion of semiparametric efficiency lower bound in i.i.d. settings to spatial point processes and show that the proposed estimator achieves the efficiency lower bound if the process is Poisson. Next, we present a new spatial kernel regression estimator that can estimate the nonparametric component of the intensity function at the desired rates for inference. Despite the dependence induced by the point process, we show that our estimator can be computed using existing software for generalized partial linear models in i.i.d. settings. We conclude with a small simulation study and a re-analysis of the spatial distribution of rainforest trees."}, "https://arxiv.org/abs/2410.04384": {"title": "Statistical Inference for Four-Regime Segmented Regression Models", "link": "https://arxiv.org/abs/2410.04384", "description": "arXiv:2410.04384v1 Announce Type: new \nAbstract: Segmented regression models offer model flexibility and interpretability as compared to the global parametric and the nonparametric models, and yet are challenging in both estimation and inference. We consider a four-regime segmented model for temporally dependent data with segmenting boundaries depending on multivariate covariates with non-diminishing boundary effects. A mixed integer quadratic programming algorithm is formulated to facilitate the least square estimation of the regression and the boundary parameters. The rates of convergence and the asymptotic distributions of the least square estimators are obtained for the regression and the boundary coefficients, respectively. We propose a smoothed regression bootstrap to facilitate inference on the parameters and a model selection procedure to select the most suitable model within the model class with at most four segments. Numerical simulations and a case study on air pollution in Beijing are conducted to demonstrate the proposed approach, which shows that the segmented models with three or four regimes are suitable for the modeling of the meteorological effects on the PM2.5 concentration."}, "https://arxiv.org/abs/2410.04390": {"title": "Approximate Maximum Likelihood Inference for Acoustic Spatial Capture-Recapture with Unknown Identities, Using Monte Carlo Expectation Maximization", "link": "https://arxiv.org/abs/2410.04390", "description": "arXiv:2410.04390v1 Announce Type: new \nAbstract: Acoustic spatial capture-recapture (ASCR) surveys with an array of synchronized acoustic detectors can be an effective way of estimating animal density or call density. However, constructing the capture histories required for ASCR analysis is challenging, as recognizing which detections at different detectors are of which calls is not a trivial task. Because calls from different distances take different times to arrive at detectors, the order in which calls are detected is not necessarily the same as the order in which they are made, and without knowing which detections are of the same call, we do not know how many different calls are detected. We propose a Monte Carlo expectation-maximization (MCEM) estimation method to resolve this unknown call identity problem. To implement the MCEM method in this context, we sample the latent variables from a complete-data likelihood model in the expectation step and use a semi-complete-data likelihood or conditional likelihood in the maximization step. We use a parametric bootstrap to obtain confidence intervals. When we apply our method to a survey of moss frogs, it gives an estimate within 15% of the estimate obtained using data with call capture histories constructed by experts, and unlike this latter estimate, our confidence interval incorporates the uncertainty about call identities. Simulations show it to have a low bias (6%) and coverage probabilities close to the nominal 95% value."}, "https://arxiv.org/abs/2410.04398": {"title": "Transfer Learning with General Estimating Equations", "link": "https://arxiv.org/abs/2410.04398", "description": "arXiv:2410.04398v1 Announce Type: new \nAbstract: We consider statistical inference for parameters defined by general estimating equations under the covariate shift transfer learning. Different from the commonly used density ratio weighting approach, we undertake a set of formulations to make the statistical inference semiparametric efficient with simple inference. It starts with re-constructing the estimation equations to make them Neyman orthogonal, which facilitates more robustness against errors in the estimation of two key nuisance functions, the density ratio and the conditional mean of the moment function. We present a divergence-based method to estimate the density ratio function, which is amenable to machine learning algorithms including the deep learning. To address the challenge that the conditional mean is parametric-dependent, we adopt a nonparametric multiple-imputation strategy that avoids regression at all possible parameter values. With the estimated nuisance functions and the orthogonal estimation equation, the inference for the target parameter is formulated via the empirical likelihood without sample splittings. We show that the proposed estimator attains the semiparametric efficiency bound, and the inference can be conducted with the Wilks' theorem. The proposed method is further evaluated by simulations and an empirical study on a transfer learning inference for ground-level ozone pollution"}, "https://arxiv.org/abs/2410.04431": {"title": "A Structural Approach to Growth-at-Risk", "link": "https://arxiv.org/abs/2410.04431", "description": "arXiv:2410.04431v1 Announce Type: new \nAbstract: We identify the structural impulse responses of quantiles of the outcome variable to a shock. Our estimation strategy explicitly distinguishes treatment from control variables, allowing us to model responses of unconditional quantiles while using controls for identification. Disentangling the effect of adding control variables on identification versus interpretation brings our structural quantile impulse responses conceptually closer to structural mean impulse responses. Applying our methodology to study the impact of financial shocks on lower quantiles of output growth confirms that financial shocks have an outsized effect on growth-at-risk, but the magnitude of our estimates is more extreme than in previous studies."}, "https://arxiv.org/abs/2410.04438": {"title": "A Reflection on the Impact of Misspecifying Unidentifiable Causal Inference Models in Surrogate Endpoint Evaluation", "link": "https://arxiv.org/abs/2410.04438", "description": "arXiv:2410.04438v1 Announce Type: new \nAbstract: Surrogate endpoints are often used in place of expensive, delayed, or rare true endpoints in clinical trials. However, regulatory authorities require thorough evaluation to accept these surrogate endpoints as reliable substitutes. One evaluation approach is the information-theoretic causal inference framework, which quantifies surrogacy using the individual causal association (ICA). Like most causal inference methods, this approach relies on models that are only partially identifiable. For continuous outcomes, a normal model is often used. Based on theoretical elements and a Monte Carlo procedure we studied the impact of model misspecification across two scenarios: 1) the true model is based on a multivariate t-distribution, and 2) the true model is based on a multivariate log-normal distribution. In the first scenario, the misspecification has a negligible impact on the results, while in the second, it has a significant impact when the misspecification is detectable using the observed data. Finally, we analyzed two data sets using the normal model and several D-vine copula models that were indistinguishable from the normal model based on the data at hand. We observed that the results may vary when different models are used."}, "https://arxiv.org/abs/2410.04496": {"title": "Hybrid Monte Carlo for Failure Probability Estimation with Gaussian Process Surrogates", "link": "https://arxiv.org/abs/2410.04496", "description": "arXiv:2410.04496v1 Announce Type: new \nAbstract: We tackle the problem of quantifying failure probabilities for expensive computer experiments with stochastic inputs. The computational cost of evaluating the computer simulation prohibits direct Monte Carlo (MC) and necessitates a statistical surrogate model. Surrogate-informed importance sampling -- which leverages the surrogate to identify suspected failures, fits a bias distribution to these locations, then calculates failure probabilities using a weighted average -- is popular, but it is data hungry and can provide erroneous results when budgets are limited. Instead, we propose a hybrid MC scheme which first uses the uncertainty quantification (UQ) of a Gaussian process (GP) surrogate to identify areas of high classification uncertainty, then combines surrogate predictions in certain regions with true simulator evaluation in uncertain regions. We also develop a stopping criterion which informs the allocation of a fixed budget of simulator evaluations between surrogate training and failure probability estimation. Our method is agnostic to surrogate choice (as long as UQ is provided); we showcase functionality with both GPs and deep GPs. It is also agnostic to design choices; we deploy contour locating sequential designs throughout. With these tools, we are able to effectively estimate small failure probabilities with only hundreds of simulator evaluations. We validate our method on a variety of synthetic benchmarks before deploying it on an expensive computer experiment of fluid flow around an airfoil."}, "https://arxiv.org/abs/2410.04561": {"title": "A Bayesian Method for Adverse Effects Estimation in Observational Studies with Truncation by Death", "link": "https://arxiv.org/abs/2410.04561", "description": "arXiv:2410.04561v1 Announce Type: new \nAbstract: Death among subjects is common in observational studies evaluating the causal effects of interventions among geriatric or severely ill patients. High mortality rates complicate the comparison of the prevalence of adverse events (AEs) between interventions. This problem is often referred to as outcome \"truncation\" by death. A possible solution is to estimate the survivor average causal effect (SACE), an estimand that evaluates the effects of interventions among those who would have survived under both treatment assignments. However, because the SACE does not include subjects who would have died under one or both arms, it does not consider the relationship between AEs and death. We propose a Bayesian method which imputes the unobserved mortality and AE outcomes for each participant under the intervention they did not receive. Using the imputed outcomes we define a composite ordinal outcome for each patient, combining the occurrence of death and the AE in an increasing scale of severity. This allows for the comparison of the effects of the interventions on death and the AE simultaneously among the entire sample. We implement the procedure to analyze the incidence of heart failure among geriatric patients being treated for Type II diabetes with sulfonylureas or dipeptidyl peptidase-4 inhibitors."}, "https://arxiv.org/abs/2410.04667": {"title": "A Finite Mixture Hidden Markov Model for Intermittently Observed Disease Process with Heterogeneity and Partially Known Disease Type", "link": "https://arxiv.org/abs/2410.04667", "description": "arXiv:2410.04667v1 Announce Type: new \nAbstract: Continuous-time multistate models are widely used for analyzing interval-censored data on disease progression over time. Sometimes, diseases manifest differently and what appears to be a coherent collection of symptoms is the expression of multiple distinct disease subtypes. To address this complexity, we propose a mixture hidden Markov model, where the observation process encompasses states representing common symptomatic stages across these diseases, and each underlying process corresponds to a distinct disease subtype. Our method models both the overall and the type-specific disease incidence/prevalence accounting for sampling conditions and exactly observed death times. Additionally, it can utilize partially available disease-type information, which offers insights into the pathway through specific hidden states in the disease process, to aid in the estimation. We present both a frequentist and a Bayesian way to obtain the estimates. The finite sample performance is evaluated through simulation studies. We demonstrate our method using the Nun Study and model the development and progression of dementia, encompassing both Alzheimer's disease (AD) and non-AD dementia."}, "https://arxiv.org/abs/2410.04676": {"title": "Democratizing Strategic Planning in Master-Planned Communities", "link": "https://arxiv.org/abs/2410.04676", "description": "arXiv:2410.04676v1 Announce Type: new \nAbstract: This paper introduces a strategic planning tool for master-planned communities designed specifically to quantify residents' subjective preferences about large investments in amenities and infrastructure projects. Drawing on data obtained from brief online surveys, the tool ranks alternative plans by considering the aggregate anticipated utilization of each proposed amenity and cost sensitivity to it (or risk sensitivity for infrastructure plans). In addition, the tool estimates the percentage of households that favor the preferred plan and predicts whether residents would actually be willing to fund the project. The mathematical underpinnings of the tool are borrowed from utility theory, incorporating exponential functions to model diminishing marginal returns on quality, cost, and risk mitigation."}, "https://arxiv.org/abs/2410.04696": {"title": "Efficient Input Uncertainty Quantification for Ratio Estimator", "link": "https://arxiv.org/abs/2410.04696", "description": "arXiv:2410.04696v1 Announce Type: new \nAbstract: We study the construction of a confidence interval (CI) for a simulation output performance measure that accounts for input uncertainty when the input models are estimated from finite data. In particular, we focus on performance measures that can be expressed as a ratio of two dependent simulation outputs' means. We adopt the parametric bootstrap method to mimic input data sampling and construct the percentile bootstrap CI after estimating the ratio at each bootstrap sample. The standard estimator, which takes the ratio of two sample averages, tends to exhibit large finite-sample bias and variance, leading to overcoverage of the percentile bootstrap CI. To address this, we propose two new ratio estimators that replace the sample averages with pooled mean estimators via the $k$-nearest neighbor ($k$NN) regression: the $k$NN estimator and the $k$LR estimator. The $k$NN estimator performs well in low dimensions but its theoretical performance guarantee degrades as the dimension increases. The $k$LR estimator combines the likelihood ratio (LR) method with the $k$NN regression, leveraging the strengths of both while mitigating their weaknesses; the LR method removes dependence on dimension, while the variance inflation introduced by the LR is controlled by $k$NN. Based on asymptotic analyses and finite-sample heuristics, we propose an experiment design that maximizes the efficiency of the proposed estimators and demonstrate their empirical performances using three examples including one in the enterprise risk management application."}, "https://arxiv.org/abs/2410.04700": {"title": "Nonparametric tests for interaction in two-way ANOVA with balanced replications", "link": "https://arxiv.org/abs/2410.04700", "description": "arXiv:2410.04700v1 Announce Type: new \nAbstract: Nonparametric procedures are more powerful for detecting interaction in two-way ANOVA when the data are non-normal. In this paper, we compute null critical values for the aligned rank-based tests (APCSSA/APCSSM) where the levels of the factors are between 2 and 6. We compare the performance of these new procedures with the ANOVA F-test for interaction, the adjusted rank transform test (ART), Conover's rank transform procedure (RT), and a rank-based ANOVA test (raov) using Monte Carlo simulations. The new procedures APCSSA/APCSSM are comparable with existing competitors in all settings. Even though there is no single dominant test in detecting interaction effects for non-normal data, nonparametric procedure APCSSM is the most highly recommended procedure for Cauchy errors settings."}, "https://arxiv.org/abs/2410.04864": {"title": "Order of Addition in Mixture-Amount Experiments", "link": "https://arxiv.org/abs/2410.04864", "description": "arXiv:2410.04864v1 Announce Type: new \nAbstract: In a mixture experiment, we study the behavior and properties of $m$ mixture components, where the primary focus is on the proportions of the components that make up the mixture rather than the total amount. Mixture-amount experiments are specialized types of mixture experiments where both the proportions of the components in the mixture and the total amount of the mixture are of interest. Such experiments consider the total amount of the mixture as a variable. In this paper, we consider an Order-of-Addition (OofA) mixture-amount experiment in which the response depends on both the mixture amounts of components and their order of addition. Full Mixture OofA designs are constructed to maintain orthogonality between the mixture-amount model terms and the effects of the order of addition. These designs enable the estimation of mixture-component model parameters and the order-of-addition effects. The G-efficiency criterion assesses how well the design supports precise and unbiased estimation of the model parameters. The Fraction of Design Space (FDS) plot is used to provide a visual assessment of the prediction capabilities of a design across the entire design space."}, "https://arxiv.org/abs/2410.04922": {"title": "Random-projection ensemble dimension reduction", "link": "https://arxiv.org/abs/2410.04922", "description": "arXiv:2410.04922v1 Announce Type: new \nAbstract: We introduce a new framework for dimension reduction in the context of high-dimensional regression. Our proposal is to aggregate an ensemble of random projections, which have been carefully chosen based on the empirical regression performance after being applied to the covariates. More precisely, we consider disjoint groups of independent random projections, apply a base regression method after each projection, and retain the projection in each group based on the empirical performance. We aggregate the selected projections by taking the singular value decomposition of their empirical average and then output the leading order singular vectors. A particularly appealing aspect of our approach is that the singular values provide a measure of the relative importance of the corresponding projection directions, which can be used to select the final projection dimension. We investigate in detail (and provide default recommendations for) various aspects of our general framework, including the projection distribution and the base regression method, as well as the number of random projections used. Additionally, we investigate the possibility of further reducing the dimension by applying our algorithm twice in cases where projection dimension recommended in the initial application is too large. Our theoretical results show that the error of our algorithm stabilises as the number of groups of projections increases. We demonstrate the excellent empirical performance of our proposal in a large numerical study using simulated and real data."}, "https://arxiv.org/abs/2410.04996": {"title": "Assumption-Lean Post-Integrated Inference with Negative Control Outcomes", "link": "https://arxiv.org/abs/2410.04996", "description": "arXiv:2410.04996v1 Announce Type: new \nAbstract: Data integration has become increasingly common in aligning multiple heterogeneous datasets. With high-dimensional outcomes, data integration methods aim to extract low-dimensional embeddings of observations to remove unwanted variations, such as batch effects and unmeasured covariates, inherent in data collected from different sources. However, multiple hypothesis testing after data integration can be substantially biased due to the data-dependent integration processes. To address this challenge, we introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes. By leveraging causal interpretations, we derive nonparametric identification conditions that form the basis of our PII approach.\n Our assumption-lean semiparametric inference method extends robustness and generality to projected direct effect estimands that account for mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide deterministic quantifications of the bias of target estimands induced by estimated embeddings and finite-sample linear expansions of the estimators with uniform concentration bounds on the residuals for all outcomes.\n The proposed doubly robust estimators are consistent and efficient under minimal assumptions, facilitating data-adaptive estimation with machine learning algorithms. Using random forests, we evaluate empirical statistical errors in simulations and analyze single-cell CRISPR perturbed datasets with potential unmeasured confounders."}, "https://arxiv.org/abs/2410.05008": {"title": "Testing procedures based on maximum likelihood estimation for Marked Hawkes processes", "link": "https://arxiv.org/abs/2410.05008", "description": "arXiv:2410.05008v1 Announce Type: new \nAbstract: The Hawkes model is a past-dependent point process, widely used in various fields for modeling temporal clustering of events. Extending this framework, the multidimensional marked Hawkes process incorporates multiple interacting event types and additional marks, enhancing its capability to model complex dependencies in multivariate time series data. However, increasing the complexity of the model also increases the computational cost of the associated estimation methods and may induce an overfitting of the model. Therefore, it is essential to find a trade-off between accuracy and artificial complexity of the model. In order to find the appropriate version of Hawkes processes, we address, in this paper, the tasks of model fit evaluation and parameter testing for marked Hawkes processes. This article focuses on parametric Hawkes processes with exponential memory kernels, a popular variant for its theoretical and practical advantages. Our work introduces robust testing methodologies for assessing model parameters and complexity, building upon and extending previous theoretical frameworks. We then validate the practical robustness of these tests through comprehensive numerical studies, especially in scenarios where theoretical guarantees remains incomplete."}, "https://arxiv.org/abs/2410.05082": {"title": "Large datasets for the Euro Area and its member countries and the dynamic effects of the common monetary policy", "link": "https://arxiv.org/abs/2410.05082", "description": "arXiv:2410.05082v1 Announce Type: new \nAbstract: We present and describe a new publicly available large dataset which encompasses quarterly and monthly macroeconomic time series for both the Euro Area (EA) as a whole and its ten primary member countries. The dataset, which is called EA-MD-QD, includes more than 800 time series and spans the period from January 2000 to the latest available month. Since January 2024 EA-MD-QD is updated on a monthly basis and constantly revised, making it an essential resource for conducting policy analysis related to economic outcomes in the EA. To illustrate the usefulness of EA-MD-QD, we study the country specific Impulse Responses of the EA wide monetary policy shock by means of the Common Component VAR plus either Instrumental Variables or Sign Restrictions identification schemes. The results reveal asymmetries in the transmission of the monetary policy shock across countries, particularly between core and peripheral countries. Additionally, we find comovements across Euro Area countries' business cycles to be driven mostly by real variables, compared to nominal ones."}, "https://arxiv.org/abs/2410.05169": {"title": "False Discovery Rate Control for Fast Screening of Large-Scale Genomics Biobanks", "link": "https://arxiv.org/abs/2410.05169", "description": "arXiv:2410.05169v1 Announce Type: new \nAbstract: Genomics biobanks are information treasure troves with thousands of phenotypes (e.g., diseases, traits) and millions of single nucleotide polymorphisms (SNPs). The development of methodologies that provide reproducible discoveries is essential for the understanding of complex diseases and precision drug development. Without statistical reproducibility guarantees, valuable efforts are spent on researching false positives. Therefore, scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods are urgently needed, especially, for complex polygenic diseases and traits. In this work, we propose the Screen-T-Rex selector, a fast FDR-controlling method based on the recently developed T-Rex selector. The method is tailored to screening large-scale biobanks and it does not require choosing additional parameters (sparsity parameter, target FDR level, etc). Numerical simulations and a real-world HIV-1 drug resistance example demonstrate that the performance of the Screen-T-Rex selector is superior, and its computation time is multiple orders of magnitude lower compared to current benchmark knockoff methods."}, "https://arxiv.org/abs/2410.05211": {"title": "The Informed Elastic Net for Fast Grouped Variable Selection and FDR Control in Genomics Research", "link": "https://arxiv.org/abs/2410.05211", "description": "arXiv:2410.05211v1 Announce Type: new \nAbstract: Modern genomics research relies on genome-wide association studies (GWAS) to identify the few genetic variants among potentially millions that are associated with diseases of interest. Only reproducible discoveries of groups of associations improve our understanding of complex polygenic diseases and enable the development of new drugs and personalized medicine. Thus, fast multivariate variable selection methods that have a high true positive rate (TPR) while controlling the false discovery rate (FDR) are crucial. Recently, the T-Rex+GVS selector, a version of the T-Rex selector that uses the elastic net (EN) as a base selector to perform grouped variable election, was proposed. Although it significantly increased the TPR in simulated GWAS compared to the original T-Rex, its comparably high computational cost limits scalability. Therefore, we propose the informed elastic net (IEN), a new base selector that significantly reduces computation time while retaining the grouped variable selection property. We quantify its grouping effect and derive its formulation as a Lasso-type optimization problem, which is solved efficiently within the T-Rex framework by the terminated LARS algorithm. Numerical simulations and a GWAS study demonstrate that the proposed T-Rex+GVS (IEN) exhibits the desired grouping effect, reduces computation time, and achieves the same TPR as T-Rex+GVS (EN) but with lower FDR, which makes it a promising method for large-scale GWAS."}, "https://arxiv.org/abs/2410.05212": {"title": "$\\texttt{rdid}$ and $\\texttt{rdidstag}$: Stata commands for robust difference-in-differences", "link": "https://arxiv.org/abs/2410.05212", "description": "arXiv:2410.05212v1 Announce Type: new \nAbstract: This article provides a Stata package for the implementation of the robust difference-in-differences (RDID) method developed in Ban and K\\'edagni (2023). It contains three main commands: $\\texttt{rdid}$, $\\texttt{rdid_dy}$, $\\texttt{rdidstag}$, which we describe in the introduction and the main text. We illustrate these commands through simulations and empirical examples."}, "https://arxiv.org/abs/2410.04554": {"title": "Fast algorithm for sparse least trimmed squares via trimmed-regularized reformulation", "link": "https://arxiv.org/abs/2410.04554", "description": "arXiv:2410.04554v1 Announce Type: cross \nAbstract: The least trimmed squares (LTS) is a reasonable formulation of robust regression whereas it suffers from high computational cost due to the nonconvexity and nonsmoothness of its objective function. The most frequently used FAST-LTS algorithm is particularly slow when a sparsity-inducing penalty such as the $\\ell_1$ norm is added. This paper proposes a computationally inexpensive algorithm for the sparse LTS, which is based on the proximal gradient method with a reformulation technique. Proposed method is equipped with theoretical convergence preferred over existing methods. Numerical experiments show that our method efficiently yields small objective value."}, "https://arxiv.org/abs/2410.04655": {"title": "Graph Fourier Neural Kernels (G-FuNK): Learning Solutions of Nonlinear Diffusive Parametric PDEs on Multiple Domains", "link": "https://arxiv.org/abs/2410.04655", "description": "arXiv:2410.04655v1 Announce Type: cross \nAbstract: Predicting time-dependent dynamics of complex systems governed by non-linear partial differential equations (PDEs) with varying parameters and domains is a challenging task motivated by applications across various fields. We introduce a novel family of neural operators based on our Graph Fourier Neural Kernels, designed to learn solution generators for nonlinear PDEs in which the highest-order term is diffusive, across multiple domains and parameters. G-FuNK combines components that are parameter- and domain-adapted with others that are not. The domain-adapted components are constructed using a weighted graph on the discretized domain, where the graph Laplacian approximates the highest-order diffusive term, ensuring boundary condition compliance and capturing the parameter and domain-specific behavior. Meanwhile, the learned components transfer across domains and parameters via Fourier Neural Operators. This approach naturally embeds geometric and directional information, improving generalization to new test domains without need for retraining the network. To handle temporal dynamics, our method incorporates an integrated ODE solver to predict the evolution of the system. Experiments show G-FuNK's capability to accurately approximate heat, reaction diffusion, and cardiac electrophysiology equations across various geometries and anisotropic diffusivity fields. G-FuNK achieves low relative errors on unseen domains and fiber fields, significantly accelerating predictions compared to traditional finite-element solvers."}, "https://arxiv.org/abs/2410.05263": {"title": "Regression Conformal Prediction under Bias", "link": "https://arxiv.org/abs/2410.05263", "description": "arXiv:2410.05263v1 Announce Type: cross \nAbstract: Uncertainty quantification is crucial to account for the imperfect predictions of machine learning algorithms for high-impact applications. Conformal prediction (CP) is a powerful framework for uncertainty quantification that generates calibrated prediction intervals with valid coverage. In this work, we study how CP intervals are affected by bias - the systematic deviation of a prediction from ground truth values - a phenomenon prevalent in many real-world applications. We investigate the influence of bias on interval lengths of two different types of adjustments -- symmetric adjustments, the conventional method where both sides of the interval are adjusted equally, and asymmetric adjustments, a more flexible method where the interval can be adjusted unequally in positive or negative directions. We present theoretical and empirical analyses characterizing how symmetric and asymmetric adjustments impact the \"tightness\" of CP intervals for regression tasks. Specifically for absolute residual and quantile-based non-conformity scores, we prove: 1) the upper bound of symmetrically adjusted interval lengths increases by $2|b|$ where $b$ is a globally applied scalar value representing bias, 2) asymmetrically adjusted interval lengths are not affected by bias, and 3) conditions when asymmetrically adjusted interval lengths are guaranteed to be smaller than symmetric ones. Our analyses suggest that even if predictions exhibit significant drift from ground truth values, asymmetrically adjusted intervals are still able to maintain the same tightness and validity of intervals as if the drift had never happened, while symmetric ones significantly inflate the lengths. We demonstrate our theoretical results with two real-world prediction tasks: sparse-view computed tomography (CT) reconstruction and time-series weather forecasting. Our work paves the way for more bias-robust machine learning systems."}, "https://arxiv.org/abs/2302.01861": {"title": "Covariance Matrix Estimation for High-Throughput Biomedical Data with Interconnected Communities", "link": "https://arxiv.org/abs/2302.01861", "description": "arXiv:2302.01861v3 Announce Type: replace \nAbstract: Estimating a covariance matrix is central to high-dimensional data analysis. Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging, among others, consistently reveal strong modularity in the dependence patterns. In these analyses, intercorrelated high-dimensional biomedical features often form communities or modules that can be interconnected with others. While the interconnected community structure has been extensively studied in biomedical research (e.g., gene co-expression networks), its potential to assist in the estimation of covariance matrices remains largely unexplored. To address this gap, we propose a procedure that leverages the commonly observed interconnected community structure in high-dimensional biomedical data to estimate large covariance and precision matrices. We derive the uniformly minimum-variance unbiased estimators for covariance and precision matrices in closed forms and provide theoretical results on their asymptotic properties. Our proposed method enhances the accuracy of covariance- and precision-matrix estimation and demonstrates superior performance compared to the competing methods in both simulations and real data analyses."}, "https://arxiv.org/abs/2305.14506": {"title": "Confidence Sets for Causal Orderings", "link": "https://arxiv.org/abs/2305.14506", "description": "arXiv:2305.14506v2 Announce Type: replace \nAbstract: Causal discovery procedures aim to deduce causal relationships among variables in a multivariate dataset. While various methods have been proposed for estimating a single causal model or a single equivalence class of models, less attention has been given to quantifying uncertainty in causal discovery in terms of confidence statements. A primary challenge in causal discovery of directed acyclic graphs is determining a causal ordering among the variables, and our work offers a framework for constructing confidence sets of causal orderings that the data do not rule out. Our methodology specifically applies to identifiable structural equation models with additive errors and is based on a residual bootstrap procedure to test the goodness-of-fit of causal orderings. We demonstrate the asymptotic validity of the confidence set constructed using this goodness-of-fit test and explain how the confidence set may be used to form sub/supersets of ancestral relationships as well as confidence intervals for causal effects that incorporate model uncertainty."}, "https://arxiv.org/abs/2306.14851": {"title": "Stability-Adjusted Cross-Validation for Sparse Linear Regression", "link": "https://arxiv.org/abs/2306.14851", "description": "arXiv:2306.14851v2 Announce Type: replace-cross \nAbstract: Given a high-dimensional covariate matrix and a response vector, ridge-regularized sparse linear regression selects a subset of features that explains the relationship between covariates and the response in an interpretable manner. To select the sparsity and robustness of linear regressors, techniques like k-fold cross-validation are commonly used for hyperparameter tuning. However, cross-validation substantially increases the computational cost of sparse regression as it requires solving many mixed-integer optimization problems (MIOs). Additionally, validation metrics often serve as noisy estimators of test set errors, with different hyperparameter combinations leading to models with different noise levels. Therefore, optimizing over these metrics is vulnerable to out-of-sample disappointment, especially in underdetermined settings. To improve upon this state of affairs, we make two key contributions. First, motivated by the generalization theory literature, we propose selecting hyperparameters that minimize a weighted sum of a cross-validation metric and a model's output stability, thus reducing the risk of poor out-of-sample performance. Second, we leverage ideas from the mixed-integer optimization literature to obtain computationally tractable relaxations of k-fold cross-validation metrics and the output stability of regressors, facilitating hyperparameter selection after solving fewer MIOs. These relaxations result in an efficient cyclic coordinate descent scheme, achieving lower validation errors than via traditional methods such as grid search. On synthetic datasets, our confidence adjustment procedure improves out-of-sample performance by 2%-5% compared to minimizing the k-fold error alone. On 13 real-world datasets, our confidence adjustment procedure reduces test set error by 2%, on average."}, "https://arxiv.org/abs/2410.05512": {"title": "A New Method for Multinomial Inference using Dempster-Shafer Theory", "link": "https://arxiv.org/abs/2410.05512", "description": "arXiv:2410.05512v1 Announce Type: new \nAbstract: A new method for multinomial inference is proposed by representing the cell probabilities as unordered segments on the unit interval and following Dempster-Shafer (DS) theory. The resulting DS posterior is then strengthened to improve symmetry and learning properties with the final posterior model being characterized by a Dirichlet distribution. In addition to computational simplicity, the new model has desirable invariance properties related to category permutations, refinements, and coarsenings. Furthermore, posterior inference on relative probabilities amongst certain cells depends only on data for the cells in question. Finally, the model is quite flexible with regard to parameterization and the range of testable assertions. Comparisons are made to existing methods and illustrated with two examples."}, "https://arxiv.org/abs/2410.05594": {"title": "Comparing HIV Vaccine Immunogenicity across Trials with Different Populations and Study Designs", "link": "https://arxiv.org/abs/2410.05594", "description": "arXiv:2410.05594v1 Announce Type: new \nAbstract: Safe and effective preventive vaccines have the potential to help stem the HIV epidemic. The efficacy of such vaccines is typically measured in randomized, double-blind phase IIb/III trials and described as a reduction in newly acquired HIV infections. However, such trials are often expensive, time-consuming, and/or logistically challenging. These challenges lead to a great interest in immune responses induced by vaccination, and in identifying which immune responses predict vaccine efficacy. These responses are termed vaccine correlates of protection. Studies of vaccine-induced immunogenicity vary in size and design, ranging from small, early phase trials, to case-control studies nested in a broader late-phase randomized trial. Moreover, trials can be conducted in geographically diverse study populations across the world. Such diversity presents a challenge for objectively comparing vaccine-induced immunogenicity. To address these practical challenges, we propose a framework that is capable of identifying appropriate causal estimands and estimators, which can be used to provide standardized comparisons of vaccine-induced immunogenicity across trials. We evaluate the performance of the proposed estimands via extensive simulation studies. Our estimators are well-behaved and enjoy robustness properties. The proposed technique is applied to compare vaccine immunogenicity using data from three recent HIV vaccine trials."}, "https://arxiv.org/abs/2410.05630": {"title": "Navigating Inflation in Ghana: How Can Machine Learning Enhance Economic Stability and Growth Strategies", "link": "https://arxiv.org/abs/2410.05630", "description": "arXiv:2410.05630v1 Announce Type: new \nAbstract: Inflation remains a persistent challenge for many African countries. This research investigates the critical role of machine learning (ML) in understanding and managing inflation in Ghana, emphasizing its significance for the country's economic stability and growth. Utilizing a comprehensive dataset spanning from 2010 to 2022, the study aims to employ advanced ML models, particularly those adept in time series forecasting, to predict future inflation trends. The methodology is designed to provide accurate and reliable inflation forecasts, offering valuable insights for policymakers and advocating for a shift towards data-driven approaches in economic decision-making. This study aims to significantly advance the academic field of economic analysis by applying machine learning (ML) and offering practical guidance for integrating advanced technological tools into economic governance, ultimately demonstrating ML's potential to enhance Ghana's economic resilience and support sustainable development through effective inflation management."}, "https://arxiv.org/abs/2410.05634": {"title": "Identification and estimation for matrix time series CP-factor models", "link": "https://arxiv.org/abs/2410.05634", "description": "arXiv:2410.05634v1 Announce Type: new \nAbstract: We investigate the identification and the estimation for matrix time series CP-factor models. Unlike the generalized eigenanalysis-based method of Chang et al. (2023) which requires the two factor loading matrices to be full-ranked, the newly proposed estimation can handle rank-deficient factor loading matrices. The estimation procedure consists of the spectral decomposition of several matrices and a matrix joint diagonalization algorithm, resulting in low computational cost. The theoretical guarantee established without the stationarity assumption shows that the proposed estimation exhibits a faster convergence rate than that of Chang et al. (2023). In fact the new estimator is free from the adverse impact of any eigen-gaps, unlike most eigenanalysis-based methods such as that of Chang et al. (2023). Furthermore, in terms of the error rates of the estimation, the proposed procedure is equivalent to handling a vector time series of dimension $\\max(p,q)$ instead of $p \\times q$, where $(p, q)$ are the dimensions of the matrix time series concerned. We have achieved this without assuming the \"near orthogonality\" of the loadings under various incoherence conditions often imposed in the CP-decomposition literature, see Han and Zhang (2022), Han et al. (2024) and the references within. Illustration with both simulated and real matrix time series data shows the usefulness of the proposed approach."}, "https://arxiv.org/abs/2410.05741": {"title": "The Transmission of Monetary Policy via Common Cycles in the Euro Area", "link": "https://arxiv.org/abs/2410.05741", "description": "arXiv:2410.05741v1 Announce Type: new \nAbstract: We use a FAVAR model with proxy variables and sign restrictions to investigate the role of the euro area common output and inflation cycles in the transmission of monetary policy shocks. We find that common cycles explain most of the variation in output and inflation across member countries, while Southern European economies show larger deviations from the cycles in the aftermath of the financial crisis. Building on this evidence, we show that monetary policy is homogeneously propagated to member countries via the common cycles. In contrast, country-specific transmission channels lead to heterogeneous country responses to monetary policy shocks. Consequently, our empirical results suggest that the divergent effects of ECB monetary policy are due to heterogeneous country-specific exposures to financial markets and not due to dis-synchronized economies of the euro area."}, "https://arxiv.org/abs/2410.05858": {"title": "Detecting dependence structure: visualization and inference", "link": "https://arxiv.org/abs/2410.05858", "description": "arXiv:2410.05858v1 Announce Type: new \nAbstract: Identifying dependency between two random variables is a fundamental problem. Clear interpretability and ability of a procedure to provide information on the form of the possible dependence is particularly important in evaluating dependencies. We introduce a new estimator of the quantile dependence function and pertinent local acceptance regions. This leads to insightful visualization and evaluation of underlying dependence structure. We also propose a test of independence of two random variables, pertinent to this new estimator. Our procedures are based on ranks and we derive a finite-sample theory that guarantees the inferential validity of our solutions at any given sample size. The procedures are simple to implement and computationally efficient. Large-sample consistency of the proposed test is also proved. We show that, in terms of power, new test is one of the best statistics for independence testing when considering a wide range of alternative models. Finally, we demonstrate use of our approach to visualize dependence structure and to detect local departures from independence through analyzing some datasets."}, "https://arxiv.org/abs/2410.05861": {"title": "Persistence-Robust Break Detection in Predictive Quantile and CoVaR Regressions", "link": "https://arxiv.org/abs/2410.05861", "description": "arXiv:2410.05861v1 Announce Type: new \nAbstract: Forecasting risk (as measured by quantiles) and systemic risk (as measured by Adrian and Brunnermeiers's (2016) CoVaR) is important in economics and finance. However, past research has shown that predictive relationships may be unstable over time. Therefore, this paper develops structural break tests in predictive quantile and CoVaR regressions. These tests can detect changes in the forecasting power of covariates, and are based on the principle of self-normalization. We show that our tests are valid irrespective of whether the predictors are stationary or near-stationary, rendering the tests suitable for a range of practical applications. Simulations illustrate the good finite-sample properties of our tests. Two empirical applications concerning equity premium and systemic risk forecasting models show the usefulness of the tests."}, "https://arxiv.org/abs/2410.05893": {"title": "Model Uncertainty and Missing Data: An Objective Bayesian Perspective", "link": "https://arxiv.org/abs/2410.05893", "description": "arXiv:2410.05893v1 Announce Type: new \nAbstract: The interplay between missing data and model uncertainty -- two classic statistical problems -- leads to primary questions that we formally address from an objective Bayesian perspective. For the general regression problem, we discuss the probabilistic justification of Rubin's rules applied to the usual components of Bayesian variable selection, arguing that prior predictive marginals should be central to the pursued methodology. In the regression settings, we explore the conditions of prior distributions that make the missing data mechanism ignorable. Moreover, when comparing multiple linear models, we provide a complete methodology for dealing with special cases, such as variable selection or uncertainty regarding model errors. In numerous simulation experiments, we demonstrate that our method outperforms or equals others, in consistently producing results close to those obtained using the full dataset. In general, the difference increases with the percentage of missing data and the correlation between the variables used for imputation. Finally, we summarize possible directions for future research."}, "https://arxiv.org/abs/2410.06125": {"title": "Dynamic graphical models: Theory, structure and counterfactual forecasting", "link": "https://arxiv.org/abs/2410.06125", "description": "arXiv:2410.06125v1 Announce Type: new \nAbstract: Simultaneous graphical dynamic linear models (SGDLMs) provide advances in flexibility, parsimony and scalability of multivariate time series analysis, with proven utility in forecasting. Core theoretical aspects of such models are developed, including new results linking dynamic graphical and latent factor models. Methodological developments extend existing Bayesian sequential analyses for model marginal likelihood evaluation and counterfactual forecasting. The latter, involving new Bayesian computational developments for missing data in SGDLMs, is motivated by causal applications. A detailed example illustrating the models and new methodology concerns global macroeconomic time series with complex, time-varying cross-series relationships and primary interests in potential causal effects."}, "https://arxiv.org/abs/2410.06281": {"title": "Sequential Design with Derived Win Statistics", "link": "https://arxiv.org/abs/2410.06281", "description": "arXiv:2410.06281v1 Announce Type: new \nAbstract: The Win Ratio has gained significant traction in cardiovascular trials as a novel method for analyzing composite endpoints (Pocock and others, 2012). Compared with conventional approaches based on time to the first event, the Win Ratio accommodates the varying priorities and types of outcomes among components, potentially offering greater statistical power by fully utilizing the information contained within each outcome. However, studies using Win Ratio have largely been confined to fixed design, limiting flexibility for early decisions, such as stopping for futility or efficacy. Our study proposes a sequential design framework incorporating multiple interim analyses based on Win Ratio or Net Benefit statistics. Moreover, we provide rigorous proof of the canonical joint distribution for sequential Win Ratio and Net Benefit statistics, and an algorithm for sample size determination is developed. We also provide results from a finite sample simulation study, which show that our proposed method controls Type I error maintains power level, and has a smaller average sample size than the fixed design. A real study of cardiovascular study is applied to illustrate the proposed method."}, "https://arxiv.org/abs/2410.06326": {"title": "A convex formulation of covariate-adjusted Gaussian graphical models via natural parametrization", "link": "https://arxiv.org/abs/2410.06326", "description": "arXiv:2410.06326v1 Announce Type: new \nAbstract: Gaussian graphical models (GGMs) are widely used for recovering the conditional independence structure among random variables. Recently, several key advances have been made to exploit an additional set of variables for better estimating the GGMs of the variables of interest. For example, in co-expression quantitative trait locus (eQTL) studies, both the mean expression level of genes as well as their pairwise conditional independence structure may be adjusted by genetic variants local to those genes. Existing methods to estimate covariate-adjusted GGMs either allow only the mean to depend on covariates or suffer from poor scaling assumptions due to the inherent non-convexity of simultaneously estimating the mean and precision matrix. In this paper, we propose a convex formulation that jointly estimates the covariate-adjusted mean and precision matrix by utilizing the natural parametrization of the multivariate Gaussian likelihood. This convexity yields theoretically better performance as the sparsity and dimension of the covariates grow large relative to the number of samples. We verify our theoretical results with numerical simulations and perform a reanalysis of an eQTL study of glioblastoma multiforme (GBM), an aggressive form of brain cancer."}, "https://arxiv.org/abs/2410.06377": {"title": "A Non-parametric Direct Learning Approach to Heterogeneous Treatment Effect Estimation under Unmeasured Confounding", "link": "https://arxiv.org/abs/2410.06377", "description": "arXiv:2410.06377v1 Announce Type: new \nAbstract: In many social, behavioral, and biomedical sciences, treatment effect estimation is a crucial step in understanding the impact of an intervention, policy, or treatment. In recent years, an increasing emphasis has been placed on heterogeneity in treatment effects, leading to the development of various methods for estimating Conditional Average Treatment Effects (CATE). These approaches hinge on a crucial identifying condition of no unmeasured confounding, an assumption that is not always guaranteed in observational studies or randomized control trials with non-compliance. In this paper, we proposed a general framework for estimating CATE with a possible unmeasured confounder using Instrumental Variables. We also construct estimators that exhibit greater efficiency and robustness against various scenarios of model misspecification. The efficacy of the proposed framework is demonstrated through simulation studies and a real data example."}, "https://arxiv.org/abs/2410.06394": {"title": "Nested Compound Random Measures", "link": "https://arxiv.org/abs/2410.06394", "description": "arXiv:2410.06394v1 Announce Type: new \nAbstract: Nested nonparametric processes are vectors of random probability measures widely used in the Bayesian literature to model the dependence across distinct, though related, groups of observations. These processes allow a two-level clustering, both at the observational and group levels. Several alternatives have been proposed starting from the nested Dirichlet process by Rodr\\'iguez et al. (2008). However, most of the available models are neither computationally efficient or mathematically tractable. In the present paper, we aim to introduce a range of nested processes that are mathematically tractable, flexible, and computationally efficient. Our proposal builds upon Compound Random Measures, which are vectors of dependent random measures early introduced by Griffin and Leisen (2017). We provide a complete investigation of theoretical properties of our model. In particular, we prove a general posterior characterization for vectors of Compound Random Measures, which is interesting per se and still not available in the current literature. Based on our theoretical results and the available posterior representation, we develop the first Ferguson & Klass algorithm for nested nonparametric processes. We specialize our general theorems and algorithms in noteworthy examples. We finally test the model's performance on different simulated scenarios, and we exploit the construction to study air pollution in different provinces of an Italian region (Lombardy). We empirically show how nested processes based on Compound Random Measures outperform other Bayesian competitors."}, "https://arxiv.org/abs/2410.06451": {"title": "False Discovery Rate Control via Data Splitting for Testing-after-Clustering", "link": "https://arxiv.org/abs/2410.06451", "description": "arXiv:2410.06451v1 Announce Type: new \nAbstract: Testing for differences in features between clusters in various applications often leads to inflated false positives when practitioners use the same dataset to identify clusters and then test features, an issue commonly known as ``double dipping''. To address this challenge, inspired by data-splitting strategies for controlling the false discovery rate (FDR) in regressions \\parencite{daiFalseDiscoveryRate2023}, we present a novel method that applies data-splitting to control FDR while maintaining high power in unsupervised clustering. We first divide the dataset into two halves, then apply the conventional testing-after-clustering procedure to each half separately and combine the resulting test statistics to form a new statistic for each feature. The new statistic can help control the FDR due to its property of having a sampling distribution that is symmetric around zero for any null feature. To further enhance stability and power, we suggest multiple data splitting, which involves repeatedly splitting the data and combining results. Our proposed data-splitting methods are mathematically proven to asymptotically control FDR in Gaussian settings. Through extensive simulations and analyses of single-cell RNA sequencing (scRNA-seq) datasets, we demonstrate that the data-splitting methods are easy to implement, adaptable to existing single-cell data analysis pipelines, and often outperform other approaches when dealing with weak signals and high correlations among features."}, "https://arxiv.org/abs/2410.06484": {"title": "Model-assisted and Knowledge-guided Transfer Regression for the Underrepresented Population", "link": "https://arxiv.org/abs/2410.06484", "description": "arXiv:2410.06484v1 Announce Type: new \nAbstract: Covariate shift and outcome model heterogeneity are two prominent challenges in leveraging external sources to improve risk modeling for underrepresented cohorts in paucity of accurate labels. We consider the transfer learning problem targeting some unlabeled minority sample encountering (i) covariate shift to the labeled source sample collected on a different cohort; and (ii) outcome model heterogeneity with some majority sample informative to the targeted minority model. In this scenario, we develop a novel model-assisted and knowledge-guided transfer learning targeting underrepresented population (MAKEUP) approach for high-dimensional regression models. Our MAKEUP approach includes a model-assisted debiasing step in response to the covariate shift, accompanied by a knowledge-guided sparsifying procedure leveraging the majority data to enhance learning on the minority group. We also develop a model selection method to avoid negative knowledge transfer that can work in the absence of gold standard labels on the target sample. Theoretical analyses show that MAKEUP provides efficient estimation for the target model on the minority group. It maintains robustness to the high complexity and misspecification of the nuisance models used for covariate shift correction, as well as adaptivity to the model heterogeneity and potential negative transfer between the majority and minority groups. Numerical studies demonstrate similar advantages in finite sample settings over existing approaches. We also illustrate our approach through a real-world application about the transfer learning of Type II diabetes genetic risk models on some underrepresented ancestry group."}, "https://arxiv.org/abs/2410.06564": {"title": "Green bubbles: a four-stage paradigm for detection and propagation", "link": "https://arxiv.org/abs/2410.06564", "description": "arXiv:2410.06564v1 Announce Type: new \nAbstract: Climate change has emerged as a significant global concern, attracting increasing attention worldwide. While green bubbles may be examined through a social bubble hypothesis, it is essential not to neglect a Climate Minsky moment triggered by sudden asset price changes. The significant increase in green investments highlights the urgent need for a comprehensive understanding of these market dynamics. Therefore, the current paper introduces a novel paradigm for studying such phenomena. Focusing on the renewable energy sector, Statistical Process Control (SPC) methodologies are employed to identify green bubbles within time series data. Furthermore, search volume indexes and social factors are incorporated into established econometric models to reveal potential implications for the financial system. Inspired by Joseph Schumpeter's perspectives on business cycles, this study recognizes green bubbles as a necessary evil for facilitating a successful transition towards a more sustainable future."}, "https://arxiv.org/abs/2410.06580": {"title": "When Does Interference Matter? Decision-Making in Platform Experiments", "link": "https://arxiv.org/abs/2410.06580", "description": "arXiv:2410.06580v1 Announce Type: new \nAbstract: This paper investigates decision-making in A/B experiments for online platforms and marketplaces. In such settings, due to constraints on inventory, A/B experiments typically lead to biased estimators because of interference; this phenomenon has been well studied in recent literature. By contrast, there has been relatively little discussion of the impact of interference on decision-making. In this paper, we analyze a benchmark Markovian model of an inventory-constrained platform, where arriving customers book listings that are limited in supply; our analysis builds on a self-contained analysis of general A/B experiments for Markov chains. We focus on the commonly used frequentist hypothesis testing approach for making launch decisions based on data from customer-randomized experiments, and we study the impact of interference on (1) false positive probability and (2) statistical power.\n We obtain three main findings. First, we show that for {\\em monotone} treatments -- i.e., those where the treatment changes booking probabilities in the same direction relative to control in all states -- the false positive probability of the na\\\"ive difference-in-means estimator with classical variance estimation is correctly controlled. This result stems from a novel analysis of A/A experiments with arbitrary dependence structures, which may be of independent interest. Second, we demonstrate that for monotone treatments, the statistical power of this na\\\"ive approach is higher than that of any similar pipeline using a debiased estimator. Taken together, these two findings suggest that platforms may be better off *not* debiasing when treatments are monotone. Finally, using simulations, we investigate false positive probability and statistical power when treatments are non-monotone, and we show that the performance of the na\\\"ive approach can be arbitrarily worse than a debiased approach in such cases."}, "https://arxiv.org/abs/2410.06726": {"title": "Sharp Bounds of the Causal Effect Under MNAR Confounding", "link": "https://arxiv.org/abs/2410.06726", "description": "arXiv:2410.06726v1 Announce Type: new \nAbstract: We report bounds for any contrast between the probabilities of the counterfactual outcome under exposure and non-exposure when the confounders are missing not at random. We assume that the missingness mechanism is outcome-independent, and prove that our bounds are arbitrarily sharp, i.e., practically attainable or logically possible."}, "https://arxiv.org/abs/2410.06875": {"title": "Group Shapley Value and Counterfactual Simulations in a Structural Model", "link": "https://arxiv.org/abs/2410.06875", "description": "arXiv:2410.06875v1 Announce Type: new \nAbstract: We propose a variant of the Shapley value, the group Shapley value, to interpret counterfactual simulations in structural economic models by quantifying the importance of different components. Our framework compares two sets of parameters, partitioned into multiple groups, and applying group Shapley value decomposition yields unique additive contributions to the changes between these sets. The relative contributions sum to one, enabling us to generate an importance table that is as easily interpretable as a regression table. The group Shapley value can be characterized as the solution to a constrained weighted least squares problem. Using this property, we develop robust decomposition methods to address scenarios where inputs for the group Shapley value are missing. We first apply our methodology to a simple Roy model and then illustrate its usefulness by revisiting two published papers."}, "https://arxiv.org/abs/2410.06939": {"title": "Direct Estimation for Commonly Used Pattern-Mixture Models in Clinical Trials", "link": "https://arxiv.org/abs/2410.06939", "description": "arXiv:2410.06939v1 Announce Type: new \nAbstract: Pattern-mixture models have received increasing attention as they are commonly used to assess treatment effects in primary or sensitivity analyses for clinical trials with nonignorable missing data. Pattern-mixture models have traditionally been implemented using multiple imputation, where the variance estimation may be a challenge because the Rubin's approach of combining between- and within-imputation variance may not provide consistent variance estimation while bootstrap methods may be time-consuming. Direct likelihood-based approaches have been proposed in the literature and implemented for some pattern-mixture models, but the assumptions are sometimes restrictive, and the theoretical framework is fragile. In this article, we propose an analytical framework for an efficient direct likelihood estimation method for commonly used pattern-mixture models corresponding to return-to-baseline, jump-to-reference, placebo washout, and retrieved dropout imputations. A parsimonious tipping point analysis is also discussed and implemented. Results from simulation studies demonstrate that the proposed methods provide consistent estimators. We further illustrate the utility of the proposed methods using data from a clinical trial evaluating a treatment for type 2 diabetes."}, "https://arxiv.org/abs/2410.07091": {"title": "Collusion Detection with Graph Neural Networks", "link": "https://arxiv.org/abs/2410.07091", "description": "arXiv:2410.07091v1 Announce Type: new \nAbstract: Collusion is a complex phenomenon in which companies secretly collaborate to engage in fraudulent practices. This paper presents an innovative methodology for detecting and predicting collusion patterns in different national markets using neural networks (NNs) and graph neural networks (GNNs). GNNs are particularly well suited to this task because they can exploit the inherent network structures present in collusion and many other economic problems. Our approach consists of two phases: In Phase I, we develop and train models on individual market datasets from Japan, the United States, two regions in Switzerland, Italy, and Brazil, focusing on predicting collusion in single markets. In Phase II, we extend the models' applicability through zero-shot learning, employing a transfer learning approach that can detect collusion in markets in which training data is unavailable. This phase also incorporates out-of-distribution (OOD) generalization to evaluate the models' performance on unseen datasets from other countries and regions. In our empirical study, we show that GNNs outperform NNs in detecting complex collusive patterns. This research contributes to the ongoing discourse on preventing collusion and optimizing detection methodologies, providing valuable guidance on the use of NNs and GNNs in economic applications to enhance market fairness and economic welfare."}, "https://arxiv.org/abs/2410.07161": {"title": "Spatiotemporal Modeling and Forecasting at Scale with Dynamic Generalized Linear Models", "link": "https://arxiv.org/abs/2410.07161", "description": "arXiv:2410.07161v1 Announce Type: new \nAbstract: Spatiotemporal data consisting of timestamps, GPS coordinates, and IDs occurs in many settings. Modeling approaches for this type of data must address challenges in terms of sensor noise, uneven sampling rates, and non-persistent IDs. In this work, we characterize and forecast human mobility at scale with dynamic generalized linear models (DGLMs). We represent mobility data as occupancy counts of spatial cells over time and use DGLMs to model the occupancy counts for each spatial cell in an area of interest. DGLMs are flexible to varying numbers of occupancy counts across spatial cells, are dynamic, and easily incorporate daily and weekly seasonality in the aggregate-level behavior. Our overall approach is robust to various types of noise and scales linearly in the number of spatial cells, time bins, and agents. Our results show that DGLMs provide accurate occupancy count forecasts over a variety of spatial resolutions and forecast horizons. We also present scaling results for spatiotemporal data consisting of hundreds of millions of observations. Our approach is flexible to support several downstream applications, including characterizing human mobility, forecasting occupancy counts, and anomaly detection for aggregate-level behaviors."}, "https://arxiv.org/abs/2410.05419": {"title": "Refining Counterfactual Explanations With Joint-Distribution-Informed Shapley Towards Actionable Minimality", "link": "https://arxiv.org/abs/2410.05419", "description": "arXiv:2410.05419v1 Announce Type: cross \nAbstract: Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanations that are presented to users and stakeholders. We address this problem by proposing a method that minimizes the required feature changes while maintaining the validity of CE, without imposing restrictions on models or CE algorithms, whether instance- or group-based. The key innovation lies in computing a joint distribution between observed and counterfactual data and leveraging it to inform Shapley values for feature attributions (FA). We demonstrate that optimal transport (OT) effectively derives this distribution, especially when the alignment between observed and counterfactual data is unclear in used CE methods. Additionally, a counterintuitive finding is uncovered: it may be misleading to rely on an exact alignment defined by the CE generation mechanism in conducting FA. Our proposed method is validated on extensive experiments across multiple datasets, showcasing its effectiveness in refining CE towards greater actionable efficiency."}, "https://arxiv.org/abs/2410.05444": {"title": "Online scalable Gaussian processes with conformal prediction for guaranteed coverage", "link": "https://arxiv.org/abs/2410.05444", "description": "arXiv:2410.05444v1 Announce Type: cross \nAbstract: The Gaussian process (GP) is a Bayesian nonparametric paradigm that is widely adopted for uncertainty quantification (UQ) in a number of safety-critical applications, including robotics, healthcare, as well as surveillance. The consistency of the resulting uncertainty values however, hinges on the premise that the learning function conforms to the properties specified by the GP model, such as smoothness, periodicity and more, which may not be satisfied in practice, especially with data arriving on the fly. To combat against such model mis-specification, we propose to wed the GP with the prevailing conformal prediction (CP), a distribution-free post-processing framework that produces it prediction sets with a provably valid coverage under the sole assumption of data exchangeability. However, this assumption is usually violated in the online setting, where a prediction set is sought before revealing the true label. To ensure long-term coverage guarantee, we will adaptively set the key threshold parameter based on the feedback whether the true label falls inside the prediction set. Numerical results demonstrate the merits of the online GP-CP approach relative to existing alternatives in the long-term coverage performance."}, "https://arxiv.org/abs/2410.05458": {"title": "Testing Credibility of Public and Private Surveys through the Lens of Regression", "link": "https://arxiv.org/abs/2410.05458", "description": "arXiv:2410.05458v1 Announce Type: cross \nAbstract: Testing whether a sample survey is a credible representation of the population is an important question to ensure the validity of any downstream research. While this problem, in general, does not have an efficient solution, one might take a task-based approach and aim to understand whether a certain data analysis tool, like linear regression, would yield similar answers both on the population and the sample survey. In this paper, we design an algorithm to test the credibility of a sample survey in terms of linear regression. In other words, we design an algorithm that can certify if a sample survey is good enough to guarantee the correctness of data analysis done using linear regression tools. Nowadays, one is naturally concerned about data privacy in surveys. Thus, we further test the credibility of surveys published in a differentially private manner. Specifically, we focus on Local Differential Privacy (LDP), which is a standard technique to ensure privacy in surveys where the survey participants might not trust the aggregator. We extend our algorithm to work even when the data analysis has been done using surveys with LDP. In the process, we also propose an algorithm that learns with high probability the guarantees a linear regression model on a survey published with LDP. Our algorithm also serves as a mechanism to learn linear regression models from data corrupted with noise coming from any subexponential distribution. We prove that it achieves the optimal estimation error bound for $\\ell_1$ linear regression, which might be of broader interest. We prove the theoretical correctness of our algorithms while trying to reduce the sample complexity for both public and private surveys. We also numerically demonstrate the performance of our algorithms on real and synthetic datasets."}, "https://arxiv.org/abs/2410.05484": {"title": "Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning", "link": "https://arxiv.org/abs/2410.05484", "description": "arXiv:2410.05484v1 Announce Type: cross \nAbstract: Despite their success and widespread adoption, the opaque nature of deep neural networks (DNNs) continues to hinder trust, especially in critical applications. Current interpretability solutions often yield inconsistent or oversimplified explanations, or require model changes that compromise performance. In this work, we introduce TRACER, a novel method grounded in causal inference theory designed to estimate the causal dynamics underpinning DNN decisions without altering their architecture or compromising their performance. Our approach systematically intervenes on input features to observe how specific changes propagate through the network, affecting internal activations and final outputs. Based on this analysis, we determine the importance of individual features, and construct a high-level causal map by grouping functionally similar layers into cohesive causal nodes, providing a structured and interpretable view of how different parts of the network influence the decisions. TRACER further enhances explainability by generating counterfactuals that reveal possible model biases and offer contrastive explanations for misclassifications. Through comprehensive evaluations across diverse datasets, we demonstrate TRACER's effectiveness over existing methods and show its potential for creating highly compressed yet accurate models, illustrating its dual versatility in both understanding and optimizing DNNs."}, "https://arxiv.org/abs/2410.05548": {"title": "Scalable Inference for Bayesian Multinomial Logistic-Normal Dynamic Linear Models", "link": "https://arxiv.org/abs/2410.05548", "description": "arXiv:2410.05548v1 Announce Type: cross \nAbstract: Many scientific fields collect longitudinal count compositional data. Each observation is a multivariate count vector, where the total counts are arbitrary, and the information lies in the relative frequency of the counts. Multiple authors have proposed Bayesian Multinomial Logistic-Normal Dynamic Linear Models (MLN-DLMs) as a flexible approach to modeling these data. However, adoption of these methods has been limited by computational challenges. This article develops an efficient and accurate approach to posterior state estimation, called $\\textit{Fenrir}$. Our approach relies on a novel algorithm for MAP estimation and an accurate approximation to a key posterior marginal of the model. As there are no equivalent methods against which we can compare, we also develop an optimized Stan implementation of MLN-DLMs. Our experiments suggest that Fenrir can be three orders of magnitude more efficient than Stan and can even be incorporated into larger sampling schemes for joint inference of model hyperparameters. Our methods are made available to the community as a user-friendly software library written in C++ with an R interface."}, "https://arxiv.org/abs/2410.05567": {"title": "With random regressors, least squares inference is robust to correlated errors with unknown correlation structure", "link": "https://arxiv.org/abs/2410.05567", "description": "arXiv:2410.05567v1 Announce Type: cross \nAbstract: Linear regression is arguably the most widely used statistical method. With fixed regressors and correlated errors, the conventional wisdom is to modify the variance-covariance estimator to accommodate the known correlation structure of the errors. We depart from the literature by showing that with random regressors, linear regression inference is robust to correlated errors with unknown correlation structure. The existing theoretical analyses for linear regression are no longer valid because even the asymptotic normality of the least-squares coefficients breaks down in this regime. We first prove the asymptotic normality of the t statistics by establishing their Berry-Esseen bounds based on a novel probabilistic analysis of self-normalized statistics. We then study the local power of the corresponding t tests and show that, perhaps surprisingly, error correlation can even enhance power in the regime of weak signals. Overall, our results show that linear regression is applicable more broadly than the conventional theory suggests, and further demonstrate the value of randomization to ensure robustness of inference."}, "https://arxiv.org/abs/2410.05668": {"title": "Diversity and Inclusion Index with Networks and Similarity: Analysis and its Application", "link": "https://arxiv.org/abs/2410.05668", "description": "arXiv:2410.05668v1 Announce Type: cross \nAbstract: In recent years, the concepts of ``diversity'' and ``inclusion'' have attracted considerable attention across a range of fields, encompassing both social and biological disciplines. To fully understand these concepts, it is critical to not only examine the number of categories but also the similarities and relationships among them. In this study, I introduce a novel index for diversity and inclusion that considers similarities and network connections. I analyzed the properties of these indices and investigated their mathematical relationships using established measures of diversity and networks. Moreover, I developed a methodology for estimating similarities based on the utility of diversity. I also created a method for visualizing proportions, similarities, and network connections. Finally, I evaluated the correlation with external metrics using real-world data, confirming that both the proposed indices and our index can be effectively utilized. This study contributes to a more nuanced understanding of diversity and inclusion analysis."}, "https://arxiv.org/abs/2410.05753": {"title": "Pathwise Gradient Variance Reduction with Control Variates in Variational Inference", "link": "https://arxiv.org/abs/2410.05753", "description": "arXiv:2410.05753v1 Announce Type: cross \nAbstract: Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it."}, "https://arxiv.org/abs/2410.05757": {"title": "Temperature Optimization for Bayesian Deep Learning", "link": "https://arxiv.org/abs/2410.05757", "description": "arXiv:2410.05757v1 Announce Type: cross \nAbstract: The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily focuses on predictive performance of the PPD, the latter emphasizes calibrated uncertainty and robustness to model misspecification; these distinct objectives lead to different temperature preferences."}, "https://arxiv.org/abs/2410.05843": {"title": "A time warping model for seasonal data with application to age estimation from narwhal tusks", "link": "https://arxiv.org/abs/2410.05843", "description": "arXiv:2410.05843v1 Announce Type: cross \nAbstract: Signals with varying periodicity frequently appear in real-world phenomena, necessitating the development of efficient modelling techniques to map the measured nonlinear timeline to linear time. Here we propose a regression model that allows for a representation of periodic and dynamic patterns observed in time series data. The model incorporates a hidden strictly increasing stochastic process that represents the instantaneous frequency, allowing the model to adapt and accurately capture varying time scales. A case study focusing on age estimation of narwhal tusks is presented, where cyclic element signals associated with annual growth layer groups are analyzed. We apply the methodology to data from one such tusk collected in West Greenland and use the fitted model to estimate the age of the narwhal. The proposed method is validated using simulated signals with known cycle counts and practical considerations and modelling challenges are discussed in detail. This research contributes to the field of time series analysis, providing a tool and valuable insights for understanding and modeling complex cyclic patterns in diverse domains."}, "https://arxiv.org/abs/2410.06163": {"title": "Likelihood-based Differentiable Structure Learning", "link": "https://arxiv.org/abs/2410.06163", "description": "arXiv:2410.06163v1 Announce Type: cross \nAbstract: Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses."}, "https://arxiv.org/abs/2410.06349": {"title": "Robust Domain Generalisation with Causal Invariant Bayesian Neural Networks", "link": "https://arxiv.org/abs/2410.06349", "description": "arXiv:2410.06349v1 Announce Type: cross \nAbstract: Deep neural networks can obtain impressive performance on various tasks under the assumption that their training domain is identical to their target domain. Performance can drop dramatically when this assumption does not hold. One explanation for this discrepancy is the presence of spurious domain-specific correlations in the training data that the network exploits. Causal mechanisms, in the other hand, can be made invariant under distribution changes as they allow disentangling the factors of distribution underlying the data generation. Yet, learning causal mechanisms to improve out-of-distribution generalisation remains an under-explored area. We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms. We show theoretically and experimentally that our model approximates reasoning under causal interventions. We demonstrate the performance of our method, outperforming point estimate-counterparts, on out-of-distribution image recognition tasks where the data distribution acts as strong adversarial confounders."}, "https://arxiv.org/abs/2410.06381": {"title": "Statistical Inference for Low-Rank Tensors: Heteroskedasticity, Subgaussianity, and Applications", "link": "https://arxiv.org/abs/2410.06381", "description": "arXiv:2410.06381v1 Announce Type: cross \nAbstract: In this paper, we consider inference and uncertainty quantification for low Tucker rank tensors with additive noise in the high-dimensional regime. Focusing on the output of the higher-order orthogonal iteration (HOOI) algorithm, a commonly used algorithm for tensor singular value decomposition, we establish non-asymptotic distributional theory and study how to construct confidence regions and intervals for both the estimated singular vectors and the tensor entries in the presence of heteroskedastic subgaussian noise, which are further shown to be optimal for homoskedastic Gaussian noise. Furthermore, as a byproduct of our theoretical results, we establish the entrywise convergence of HOOI when initialized via diagonal deletion. To further illustrate the utility of our theoretical results, we then consider several concrete statistical inference tasks. First, in the tensor mixed-membership blockmodel, we consider a two-sample test for equality of membership profiles, and we propose a test statistic with consistency under local alternatives that exhibits a power improvement relative to the corresponding matrix test considered in several previous works. Next, we consider simultaneous inference for small collections of entries of the tensor, and we obtain consistent confidence regions. Finally, focusing on the particular case of testing whether entries of the tensor are equal, we propose a consistent test statistic that shows how index overlap results in different asymptotic standard deviations. All of our proposed procedures are fully data-driven, adaptive to noise distribution and signal strength, and do not rely on sample-splitting, and our main results highlight the effect of higher-order structures on estimation relative to the matrix setting. Our theoretical results are demonstrated through numerical simulations."}, "https://arxiv.org/abs/2410.06407": {"title": "A Skewness-Based Criterion for Addressing Heteroscedastic Noise in Causal Discovery", "link": "https://arxiv.org/abs/2410.06407", "description": "arXiv:2410.06407v1 Announce Type: cross \nAbstract: Real-world data often violates the equal-variance assumption (homoscedasticity), making it essential to account for heteroscedastic noise in causal discovery. In this work, we explore heteroscedastic symmetric noise models (HSNMs), where the effect $Y$ is modeled as $Y = f(X) + \\sigma(X)N$, with $X$ as the cause and $N$ as independent noise following a symmetric distribution. We introduce a novel criterion for identifying HSNMs based on the skewness of the score (i.e., the gradient of the log density) of the data distribution. This criterion establishes a computationally tractable measurement that is zero in the causal direction but nonzero in the anticausal direction, enabling the causal direction discovery. We extend this skewness-based criterion to the multivariate setting and propose SkewScore, an algorithm that handles heteroscedastic noise without requiring the extraction of exogenous noise. We also conduct a case study on the robustness of SkewScore in a bivariate model with a latent confounder, providing theoretical insights into its performance. Empirical studies further validate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2410.06774": {"title": "Retrieved dropout imputation considering administrative study withdrawal", "link": "https://arxiv.org/abs/2410.06774", "description": "arXiv:2410.06774v1 Announce Type: cross \nAbstract: The International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) E9 (R1) Addendum provides a framework for defining estimands in clinical trials. Treatment policy strategy is the mostly used approach to handle intercurrent events in defining estimands. Imputing missing values for potential outcomes under the treatment policy strategy has been discussed in the literature. Missing values as a result of administrative study withdrawals (such as site closures due to business reasons, COVID-19 control measures, and geopolitical conflicts, etc.) are often imputed in the same way as other missing values occurring after intercurrent events related to safety or efficacy. Some research suggests using a hypothetical strategy to handle the treatment discontinuations due to administrative study withdrawal in defining the estimands and imputing the missing values based on completer data assuming missing at random, but this approach ignores the fact that subjects might experience other intercurrent events had they not had the administrative study withdrawal. In this article, we consider the administrative study withdrawal censors the normal real-world like intercurrent events and propose two methods for handling the corresponding missing values under the retrieved dropout imputation framework. Simulation shows the two methods perform well. We also applied the methods to actual clinical trial data evaluating an anti-diabetes treatment."}, "https://arxiv.org/abs/1906.09581": {"title": "Resistant convex clustering: How does the fusion penalty enhance resistantance?", "link": "https://arxiv.org/abs/1906.09581", "description": "arXiv:1906.09581v3 Announce Type: replace \nAbstract: Convex clustering is a convex relaxation of the $k$-means and hierarchical clustering. It involves solving a convex optimization problem with the objective function being a squared error loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. However, when data are contaminated, convex clustering with a squared error loss fails even when there is only one arbitrary outlier. To address this challenge, we propose a resistant convex clustering method. Theoretically, we show that the new estimator is resistant to arbitrary outliers: it does not break down until more than half of the observations are arbitrary outliers. Perhaps surprisingly, the fusion penalty can help enhance resistance by fusing the estimators to the cluster centers of uncontaminated samples, but not the other way around. Numerical studies demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2103.08035": {"title": "Testing for the Network Small-World Property", "link": "https://arxiv.org/abs/2103.08035", "description": "arXiv:2103.08035v2 Announce Type: replace \nAbstract: Researchers have long observed that the ``small-world\" property, which combines the concepts of high transitivity or clustering with a low average path length, is ubiquitous for networks obtained from a variety of disciplines, including social sciences, biology, neuroscience, and ecology. However, we find several shortcomings of the currently prevalent definition and detection methods rendering the concept less powerful. First, the widely used \\textit{small world coefficient} metric combines high transitivity with a low average path length in a single measure that confounds the two separate aspects. We find that the value of the metric is dominated by transitivity, and in several cases, networks get flagged as ``small world\" solely because of their high transitivity. Second, the detection methods lack a formal statistical inference. Third, the comparison is typically performed against simplistic random graph models as the baseline, ignoring well-known network characteristics and risks confounding the small world property with other network properties. We decouple the properties of high transitivity and low average path length as separate events to test for. Then we define the property as a statistical test between a suitable null hypothesis and a superimposed alternative hypothesis. We propose a parametric bootstrap test with several null hypothesis models to allow a wide range of background structures in the network. In addition to the bootstrap tests, we also propose an asymptotic test under the Erd\\\"{o}s-Ren\\'{y}i null model for which we provide theoretical guarantees on the asymptotic level and power. Our theoretical results include asymptotic distributions of clustering coefficient for various asymptotic growth rates on the probability of an edge. Applying the proposed methods to a large number of network datasets, we uncover new insights about their small-world property."}, "https://arxiv.org/abs/2209.01754": {"title": "Learning from a Biased Sample", "link": "https://arxiv.org/abs/2209.01754", "description": "arXiv:2209.01754v3 Announce Type: replace \nAbstract: The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $\\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction."}, "https://arxiv.org/abs/2301.07855": {"title": "Digital Divide: Empirical Study of CIUS 2020", "link": "https://arxiv.org/abs/2301.07855", "description": "arXiv:2301.07855v3 Announce Type: replace \nAbstract: As Canada and other major economies consider implementing \"digital money\" or Central Bank Digital Currencies, understanding how demographic and geographic factors influence public engagement with digital technologies becomes increasingly important. This paper uses data from the 2020 Canadian Internet Use Survey and employs survey-adapted Lasso inference methods to identify individual socio-economic and demographic characteristics determining the digital divide in Canada. We also introduce a score to measure and compare the digital literacy of various segments of Canadian population. Our findings reveal that disparities in the use of e.g. online banking, emailing, and digital payments exist across different demographic and socio-economic groups. In addition, we document the effects of COVID-19 pandemic on internet use in Canada and describe changes in the characteristics of Canadian internet users over the last decade."}, "https://arxiv.org/abs/2306.04315": {"title": "Inferring unknown unknowns: Regularized bias-aware ensemble Kalman filter", "link": "https://arxiv.org/abs/2306.04315", "description": "arXiv:2306.04315v3 Announce Type: replace \nAbstract: Because of physical assumptions and numerical approximations, low-order models are affected by uncertainties in the state and parameters, and by model biases. Model biases, also known as model errors or systematic errors, are difficult to infer because they are `unknown unknowns', i.e., we do not necessarily know their functional form a priori. With biased models, data assimilation methods may be ill-posed because either (i) they are 'bias-unaware' because the estimators are assumed unbiased, (ii) they rely on an a priori parametric model for the bias, or (iii) they can infer model biases that are not unique for the same model and data. First, we design a data assimilation framework to perform combined state, parameter, and bias estimation. Second, we propose a mathematical solution with a sequential method, i.e., the regularized bias-aware ensemble Kalman Filter (r-EnKF), which requires a model of the bias and its gradient (i.e., the Jacobian). Third, we propose an echo state network as the model bias estimator. We derive the Jacobian of the network, and design a robust training strategy with data augmentation to accurately infer the bias in different scenarios. Fourth, we apply the r-EnKF to nonlinearly coupled oscillators (with and without time-delay) affected by different forms of bias. The r-EnKF infers in real-time parameters and states, and a unique bias. The applications that we showcase are relevant to acoustics, thermoacoustics, and vibrations; however, the r-EnKF opens new opportunities for combined state, parameter and bias estimation for real-time and on-the-fly prediction in nonlinear systems."}, "https://arxiv.org/abs/2307.00214": {"title": "Utilizing a Capture-Recapture Strategy to Accelerate Infectious Disease Surveillance", "link": "https://arxiv.org/abs/2307.00214", "description": "arXiv:2307.00214v2 Announce Type: replace \nAbstract: Monitoring key elements of disease dynamics (e.g., prevalence, case counts) is of great importance in infectious disease prevention and control, as emphasized during the COVID-19 pandemic. To facilitate this effort, we propose a new capture-recapture (CRC) analysis strategy that takes misclassification into account from easily-administered, imperfect diagnostic test kits, such as the Rapid Antigen Test-kits or saliva tests. Our method is based on a recently proposed \"anchor stream\" design, whereby an existing voluntary surveillance data stream is augmented by a smaller and judiciously drawn random sample. It incorporates manufacturer-specified sensitivity and specificity parameters to account for imperfect diagnostic results in one or both data streams. For inference to accompany case count estimation, we improve upon traditional Wald-type confidence intervals by developing an adapted Bayesian credible interval for the CRC estimator that yields favorable frequentist coverage properties. When feasible, the proposed design and analytic strategy provides a more efficient solution than traditional CRC methods or random sampling-based biased-corrected estimation to monitor disease prevalence while accounting for misclassification. We demonstrate the benefits of this approach through simulation studies that underscore its potential utility in practice for economical disease monitoring among a registered closed population."}, "https://arxiv.org/abs/2308.02293": {"title": "Outlier-Robust Neural Network Training: Efficient Optimization of Transformed Trimmed Loss with Variation Regularization", "link": "https://arxiv.org/abs/2308.02293", "description": "arXiv:2308.02293v3 Announce Type: replace \nAbstract: In this study, we consider outlier-robust predictive modeling using highly-expressive neural networks. To this end, we employ (1) a transformed trimmed loss (TTL), which is a computationally feasible variant of the classical trimmed loss, and (2) a higher-order variation regularization (HOVR) of the prediction model. Note that using only TTL to train the neural network may possess outlier vulnerability, as its high expressive power causes it to overfit even the outliers perfectly. However, simultaneously introducing HOVR constrains the effective degrees of freedom, thereby avoiding fitting outliers. We newly provide an efficient stochastic algorithm for optimization and its theoretical convergence guarantee. (*Two authors contributed equally to this work.)"}, "https://arxiv.org/abs/2308.05073": {"title": "Harmonized Estimation of Subgroup-Specific Treatment Effects in Randomized Trials: The Use of External Control Data", "link": "https://arxiv.org/abs/2308.05073", "description": "arXiv:2308.05073v2 Announce Type: replace \nAbstract: Subgroup analyses of randomized controlled trials (RCTs) constitute an important component of the drug development process in precision medicine. In particular, subgroup analyses of early-stage trials often influence the design and eligibility criteria of subsequent confirmatory trials and ultimately influence which subpopulations will receive the treatment after regulatory approval. However, subgroup analyses are often complicated by small sample sizes, which leads to substantial uncertainty about subgroup-specific treatment effects. We explore the use of external control (EC) data to augment RCT subgroup analyses. We define and discuss it harmonized estimators of subpopulation-specific treatment effects that leverage EC data. Our approach can be used to modify any subgroup-specific treatment effect estimates that are obtained by combining RCT and EC data, such as linear regression. We alter these subgroup-specific estimates to make them coherent with a robust estimate of the average effect in the randomized population based only on RCT data. The weighted average of the resulting subgroup-specific harmonized estimates matches the RCT-only estimate of the overall effect in the randomized population. We discuss the proposed harmonized estimators through analytic results and simulations, and investigate standard performance metrics. The method is illustrated with a case study in oncology."}, "https://arxiv.org/abs/2310.15266": {"title": "Causal progress with imperfect placebo treatments and outcomes", "link": "https://arxiv.org/abs/2310.15266", "description": "arXiv:2310.15266v2 Announce Type: replace \nAbstract: In the quest to make defensible causal claims from observational data, it is sometimes possible to leverage information from \"placebo treatments\" and \"placebo outcomes\". Existing approaches employing such information focus largely on point identification and assume (i) \"perfect placebos\", meaning placebo treatments have precisely zero effect on the outcome and the real treatment has precisely zero effect on a placebo outcome; and (ii) \"equiconfounding\", meaning that the treatment-outcome relationship where one is a placebo suffers the same amount of confounding as does the real treatment-outcome relationship, on some scale. We instead consider an omitted variable bias framework, in which users can postulate ranges of values for the degree of unequal confounding and the degree of placebo imperfection. Once postulated, these assumptions identify or bound the linear estimates of treatment effects. Our approach also does not require using both a placebo treatment and placebo outcome, as some others do. While applicable in many settings, one ubiquitous use-case for this approach is to employ pre-treatment outcomes as (perfect) placebo outcomes, as in difference-in-difference. The parallel trends assumption in this setting is identical to the equiconfounding assumption, on a particular scale, which our framework allows the user to relax. Finally, we demonstrate the use of our framework with two applications and a simulation, employing an R package that implements these approaches."}, "https://arxiv.org/abs/2311.16984": {"title": "FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings", "link": "https://arxiv.org/abs/2311.16984", "description": "arXiv:2311.16984v3 Announce Type: replace \nAbstract: External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients' rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, 'FedECA' that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA provides evidence for a differential effect between two drugs that would have otherwise go unnoticed. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development."}, "https://arxiv.org/abs/2312.06018": {"title": "A Multivariate Polya Tree Model for Meta-Analysis with Event Time Distributions", "link": "https://arxiv.org/abs/2312.06018", "description": "arXiv:2312.06018v2 Announce Type: replace \nAbstract: We develop a non-parametric Bayesian prior for a family of random probability measures by extending the Polya tree ($PT$) prior to a joint prior for a set of probability measures $G_1,\\dots,G_n$, suitable for meta-analysis with event time outcomes. In the application to meta-analysis $G_i$ is the event time distribution specific to study $i$. The proposed model defines a regression on study-specific covariates by introducing increased correlation for any pair of studies with similar characteristics. The desired multivariate $PT$ model is constructed by introducing a hierarchical prior on the conditional splitting probabilities in the $PT$ construction for each of the $G_i$. The hierarchical prior replaces the independent beta priors for the splitting probability in the $PT$ construction with a Gaussian process prior for corresponding (logit) splitting probabilities across all studies. The Gaussian process is indexed by study-specific covariates, introducing the desired dependence with increased correlation for similar studies. The main feature of the proposed construction is (conditionally) conjugate posterior updating with commonly reported inference summaries for event time data. The construction is motivated by a meta-analysis over cancer immunotherapy studies."}, "https://arxiv.org/abs/2401.03834": {"title": "On the error control of invariant causal prediction", "link": "https://arxiv.org/abs/2401.03834", "description": "arXiv:2401.03834v3 Announce Type: replace \nAbstract: Invariant causal prediction (ICP, Peters et al. (2016)) provides a novel way for identifying causal predictors of a response by utilizing heterogeneous data from different environments. One notable advantage of ICP is that it guarantees to make no false causal discoveries with high probability. Such a guarantee, however, can be overly conservative in some applications, resulting in few or no causal discoveries. This raises a natural question: Can we use less conservative error control guarantees for ICP so that more causal information can be extracted from data? We address this question in the paper. We focus on two commonly used and more liberal guarantees: false discovery rate control and simultaneous true discovery bound. Unexpectedly, we find that false discovery rate does not seem to be a suitable error criterion for ICP. The simultaneous true discovery bound, on the other hand, proves to be an ideal choice, enabling users to explore potential causal predictors and extract more causal information. Importantly, the additional information comes for free, in the sense that no extra assumptions are required and the discoveries from the original ICP approach are fully retained. We demonstrate the practical utility of our method through simulations and a real dataset about the educational attainment of teenagers in the US."}, "https://arxiv.org/abs/2410.07229": {"title": "Fast spatio-temporally varying coefficient modeling with reluctant interaction selection", "link": "https://arxiv.org/abs/2410.07229", "description": "arXiv:2410.07229v1 Announce Type: new \nAbstract: Spatially and temporally varying coefficient (STVC) models are currently attracting attention as a flexible tool to explore the spatio-temporal patterns in regression coefficients. However, these models often struggle with balancing computational efficiency and model flexibility. To address this challenge, this study develops a fast and flexible method for STVC modeling. For enhanced flexibility in modeling, we assume multiple processes in each varying coefficient, including purely spatial, purely temporal, and spatio-temporal interaction processes with or without time cyclicity. While considering multiple processes can be time consuming, we combine a pre-conditioning method with a model selection procedure, inspired by reluctant interaction modeling. This approach allows us to computationally efficiently select and specify the latent space-time structure. Monte Carlo experiments demonstrate that the proposed method outperforms alternatives in terms of coefficient estimation accuracy and computational efficiency. Finally, we apply the proposed method to crime analysis using a sample size of 279,360, confirming that the proposed method provides reasonable estimates of varying coefficients. The STVC model is implemented in an R package spmoran."}, "https://arxiv.org/abs/2410.07242": {"title": "Dynamic borrowing from historical controls via the synthetic prior with covariates in randomized clinical trials", "link": "https://arxiv.org/abs/2410.07242", "description": "arXiv:2410.07242v1 Announce Type: new \nAbstract: Motivated by a rheumatoid arthritis clinical trial, we propose a new Bayesian method called SPx, standing for synthetic prior with covariates, to borrow information from historical trials to reduce the control group size in a new trial. The method involves a novel use of Bayesian model averaging to balance between multiple possible relationships between the historical and new trial data, allowing the historical data to be dynamically trusted or discounted as appropriate. We require only trial-level summary statistics, which are available more often than patient-level data. Through simulations and an application to the rheumatoid arthritis trial we show that SPx can substantially reduce the control group size while maintaining Frequentist properties."}, "https://arxiv.org/abs/2410.07293": {"title": "Feature-centric nonlinear autoregressive models", "link": "https://arxiv.org/abs/2410.07293", "description": "arXiv:2410.07293v1 Announce Type: new \nAbstract: We propose a novel feature-centric approach to surrogate modeling of dynamical systems driven by time-varying exogenous excitations. This approach, named Functional Nonlinear AutoRegressive with eXogenous inputs (F-NARX), aims to approximate the system response based on temporal features of both the exogenous inputs and the system response, rather than on their values at specific time lags. This is a major step away from the discrete-time-centric approach of classical NARX models, which attempts to determine the relationship between selected time steps of the input/output time series. By modeling the system in a time-feature space instead of the original time axis, F-NARX can provide more stable long-term predictions and drastically reduce the reliance of the model performance on the time discretization of the problem. F-NARX, like NARX, acts as a framework and is not tied to a single implementation. In this work, we introduce an F-NARX implementation based on principal component analysis and polynomial basis functions. To further improve prediction accuracy and computational efficiency, we also introduce a strategy to identify and fit a sparse model structure, thanks to a modified hybrid least angle regression approach that minimizes the expected forecast error, rather than the one-step-ahead prediction error. Since F-NARX is particularly well-suited to modeling engineering structures typically governed by physical processes, we investigate the behavior and capabilities of our F-NARX implementation on two case studies: an eight-story building under wind loading and a three-story steel frame under seismic loading. Our results demonstrate that F-NARX has several favorable properties over classical NARX, making it well suited to emulate engineering systems with high accuracy over extended time periods."}, "https://arxiv.org/abs/2410.07357": {"title": "Penalized regression with negative-unlabeled data: An approach to developing a long COVID research index", "link": "https://arxiv.org/abs/2410.07357", "description": "arXiv:2410.07357v1 Announce Type: new \nAbstract: Moderate to severe post-acute sequelae of SARS-CoV-2 infection (PASC), also called long COVID, is estimated to impact as many as 10% of SARS-CoV-2 infected individuals, representing a chronic condition with a substantial global public health burden. An expansive literature has identified over 200 long-term and persistent symptoms associated with a history of SARS-CoV-2 infection; yet, there remains to be a clear consensus on a syndrome definition. Such a definition is a critical first step in future studies of risk and resiliency factors, mechanisms of disease, and interventions for both treatment and prevention. We recently applied a strategy for defining a PASC research index based on a Lasso-penalized logistic regression on history of SARS-CoV-2 infection. In the current paper we formalize and evaluate this approach through theoretical derivations and simulation studies. We demonstrate that this approach appropriately selects symptoms associated with PASC and results in a score that has high discriminatory power for detecting PASC. An application to data on participants enrolled in the RECOVER (Researching COVID to Enhance Recovery) Adult Cohort is presented to illustrate our findings."}, "https://arxiv.org/abs/2410.07374": {"title": "Predicting Dengue Outbreaks: A Dynamic Approach with Variable Length Markov Chains and Exogenous Factors", "link": "https://arxiv.org/abs/2410.07374", "description": "arXiv:2410.07374v1 Announce Type: new \nAbstract: Variable Length Markov Chains with Exogenous Covariates (VLMCX) are stochastic models that use Generalized Linear Models to compute transition probabilities, taking into account both the state history and time-dependent exogenous covariates. The beta-context algorithm selects a relevant finite suffix (context) for predicting the next symbol. This algorithm estimates flexible tree-structured models by aggregating irrelevant states in the process history and enables the model to incorporate exogenous covariates over time.\n This research uses data from multiple sources to extend the beta-context algorithm to incorporate time-dependent and time-invariant exogenous covariates. Within this approach, we have a distinct Markov chain for every data source, allowing for a comprehensive understanding of the process behavior across multiple situations, such as different geographic locations. Despite using data from different sources, we assume that all sources are independent and share identical parameters - we explore contexts within each data source and combine them to compute transition probabilities, deriving a unified tree. This approach eliminates the necessity for spatial-dependent structural considerations within the model. Furthermore, we incorporate modifications in the estimation procedure to address contexts that appear with low frequency.\n Our motivation was to investigate the impact of previous dengue rates, weather conditions, and socioeconomic factors on subsequent dengue rates across various municipalities in Brazil, providing insights into dengue transmission dynamics."}, "https://arxiv.org/abs/2410.07443": {"title": "On the Lower Confidence Band for the Optimal Welfare", "link": "https://arxiv.org/abs/2410.07443", "description": "arXiv:2410.07443v1 Announce Type: new \nAbstract: This article addresses the question of reporting a lower confidence band (LCB) for optimal welfare in policy learning problems. A straightforward procedure inverts a one-sided t-test based on an efficient estimator of the optimal welfare. We argue that in an empirically relevant class of data-generating processes, a LCB corresponding to suboptimal welfare may exceed the straightforward LCB, with the average difference of order N-{1/2}. We relate this result to a lack of uniformity in the so-called margin assumption, commonly imposed in policy learning and debiased inference. We advocate for using uniformly valid asymptotic approximations and show how existing methods for inference in moment inequality models can be used to construct valid and tight LCBs for the optimal welfare. We illustrate our findings in the context of the National JTPA study."}, "https://arxiv.org/abs/2410.07454": {"title": "Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning", "link": "https://arxiv.org/abs/2410.07454", "description": "arXiv:2410.07454v1 Announce Type: new \nAbstract: A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models."}, "https://arxiv.org/abs/2410.07483": {"title": "Doubly robust estimation and sensitivity analysis with outcomes truncated by death in multi-arm clinical trials", "link": "https://arxiv.org/abs/2410.07483", "description": "arXiv:2410.07483v1 Announce Type: new \nAbstract: In clinical trials, the observation of participant outcomes may frequently be hindered by death, leading to ambiguity in defining a scientifically meaningful final outcome for those who die. Principal stratification methods are valuable tools for addressing the average causal effect among always-survivors, i.e., the average treatment effect among a subpopulation in the principal strata of those who would survive regardless of treatment assignment. Although robust methods for the truncation-by-death problem in two-arm clinical trials have been previously studied, its expansion to multi-arm clinical trials remains unknown. In this article, we study the identification of a class of survivor average causal effect estimands with multiple treatments under monotonicity and principal ignorability, and first propose simple weighting and regression approaches. As a further improvement, we then derive the efficient influence function to motivate doubly robust estimators for the survivor average causal effects in multi-arm clinical trials. We also articulate sensitivity methods under violations of key causal assumptions. Extensive simulations are conducted to investigate the finite-sample performance of the proposed methods, and a real data example is used to illustrate how to operationalize the proposed estimators and the sensitivity methods in practice."}, "https://arxiv.org/abs/2410.07487": {"title": "Learning associations of COVID-19 hospitalizations with wastewater viral signals by Markov modulated models", "link": "https://arxiv.org/abs/2410.07487", "description": "arXiv:2410.07487v1 Announce Type: new \nAbstract: Viral signal in wastewater offers a promising opportunity to assess and predict the burden of infectious diseases. That has driven the widespread adoption and development of wastewater monitoring tools by public health organizations. Recent research highlights a strong correlation between COVID-19 hospitalizations and wastewater viral signals, and validates that increases in wastewater measurements may offer early warnings of an increase in hospital admissions. Previous studies (e.g. Peng et al. 2023) utilize distributed lag models to explore associations of COVID-19 hospitalizations with lagged SARS-CoV-2 wastewater viral signals. However, the conventional distributed lag models assume the duration time of the lag to be fixed, which is not always plausible. This paper presents Markov-modulated models with distributed lasting time, treating the duration of the lag as a random variable defined by a hidden Markov chain. We evaluate exposure effects over the duration time and estimate the distribution of the lasting time using the wastewater data and COVID-19 hospitalization records from Ottawa, Canada during June 2020 to November 2022. The different COVID-19 waves are accommodated in the statistical learning. In particular, two strategies for comparing the associations over different time intervals are exemplified using the Ottawa data. Of note, the proposed Markov modulated models, an extension of distributed lag models, are potentially applicable to many different problems where the lag time is not fixed."}, "https://arxiv.org/abs/2410.07555": {"title": "A regression framework for studying relationships among attributes under network interference", "link": "https://arxiv.org/abs/2410.07555", "description": "arXiv:2410.07555v1 Announce Type: new \nAbstract: To understand how the interconnected and interdependent world of the twenty-first century operates and make model-based predictions, joint probability models for networks and interdependent outcomes are needed. We propose a comprehensive regression framework for networks and interdependent outcomes with multiple advantages, including interpretability, scalability, and provable theoretical guarantees. The regression framework can be used for studying relationships among attributes of connected units and captures complex dependencies among connections and attributes, while retaining the virtues of linear regression, logistic regression, and other regression models by being interpretable and widely applicable. On the computational side, we show that the regression framework is amenable to scalable statistical computing based on convex optimization of pseudo-likelihoods using minorization-maximization methods. On the theoretical side, we establish convergence rates for pseudo-likelihood estimators based on a single observation of dependent connections and attributes. We demonstrate the regression framework using simulations and an application to hate speech on the social media platform X in the six months preceding the insurrection at the U.S. Capitol on January 6, 2021."}, "https://arxiv.org/abs/2410.07934": {"title": "panelPomp: Analysis of Panel Data via Partially Observed Markov Processes in R", "link": "https://arxiv.org/abs/2410.07934", "description": "arXiv:2410.07934v1 Announce Type: new \nAbstract: Panel data arise when time series measurements are collected from multiple, dynamically independent but structurally related systems. In such cases, each system's time series can be modeled as a partially observed Markov process (POMP), and the ensemble of these models is called a PanelPOMP. If the time series are relatively short, statistical inference for each time series must draw information from across the entire panel. Every time series has a name, called its unit label, which may correspond to an object on which that time series was collected. Differences between units may be of direct inferential interest or may be a nuisance for studying the commonalities. The R package panelPomp supports analysis of panel data via a general class of PanelPOMP models. This includes a suite of tools for manipulation of models and data that take advantage of the panel structure. The panelPomp package currently emphasizes recent advances enabling likelihood-based inference via simulation-based algorithms. However, the general framework provided by panelPomp supports development of additional, new inference methodology for panel data."}, "https://arxiv.org/abs/2410.07996": {"title": "Smoothed pseudo-population bootstrap methods with applications to finite population quantiles", "link": "https://arxiv.org/abs/2410.07996", "description": "arXiv:2410.07996v1 Announce Type: new \nAbstract: This paper introduces smoothed pseudo-population bootstrap methods for the purposes of variance estimation and the construction of confidence intervals for finite population quantiles. In an i.i.d. context, it has been shown that resampling from a smoothed estimate of the distribution function instead of the usual empirical distribution function can improve the convergence rate of the bootstrap variance estimator of a sample quantile. We extend the smoothed bootstrap to the survey sampling framework by implementing it in pseudo-population bootstrap methods for high entropy, single-stage survey designs, such as simple random sampling without replacement and Poisson sampling. Given a kernel function and a bandwidth, it consists of smoothing the pseudo-population from which bootstrap samples are drawn using the original sampling design. Given that the implementation of the proposed algorithms requires the specification of the bandwidth, we develop a plug-in selection method along with a grid search selection method based on a bootstrap estimate of the mean squared error. Simulation results suggest a gain in efficiency associated with the smoothed approach as compared to the standard pseudo-population bootstrap for estimating the variance of a quantile estimator together with mixed results regarding confidence interval coverage."}, "https://arxiv.org/abs/2410.08078": {"title": "Negative Control Outcome Adjustment in Early-Phase Randomized Trials: Estimating Vaccine Effects on Immune Responses in HIV Exposed Uninfected Infants", "link": "https://arxiv.org/abs/2410.08078", "description": "arXiv:2410.08078v1 Announce Type: new \nAbstract: Adjustment for prognostic baseline covariates when estimating treatment effects in randomized trials can reduce bias due to covariate imbalance and yield guaranteed efficiency gain in large-samples. Gains in precision and reductions in finite-sample bias are arguably most important in the resource-limited setting of early-phase trials. Despite their favorable large-sample properties, the utility of covariate-adjusted estimators in early-phase trials is complicated by precision loss due to adjustment for weakly prognostic covariates and type I error rate inflation and undercoverage of asymptotic confidence intervals in finite samples. We propose adjustment for a valid negative control outcomes (NCO), or an auxiliary post-randomization outcome assumed completely unaffected by treatment but correlated with the outcome of interest. We articulate the causal assumptions that permit adjustment for NCOs, describe when NCO adjustment may improve upon adjustment for baseline covariates, illustrate performance and provide practical recommendations regarding model selection and finite-sample variance corrections in early-phase trials using numerical experiments, and demonstrate superior performance of NCO adjustment in the reanalysis of two early-phase vaccine trials in HIV exposed uninfected (HEU) infants. In early-phase studies without knowledge of baseline predictors of outcomes, we advocate for eschewing baseline covariates adjustment in favor of adjustment for NCOs believed to be unaffected by the experimental intervention."}, "https://arxiv.org/abs/2410.08080": {"title": "Bayesian Nonparametric Sensitivity Analysis of Multiple Comparisons Under Dependence", "link": "https://arxiv.org/abs/2410.08080", "description": "arXiv:2410.08080v1 Announce Type: new \nAbstract: This short communication introduces a sensitivity analysis method for Multiple Testing Procedures (MTPs), based on marginal $p$-values and the Dirichlet process prior distribution. The method measures each $p$-value's insensitivity towards a significance decision, with respect to the entire space of MTPs controlling either the family-wise error rate (FWER) or the false discovery rate (FDR) under arbitrary dependence between $p$-values, supported by this nonparametric prior. The sensitivity analysis method is illustrated through 1,081 hypothesis tests of the effects of the COVID-19 pandemic on educational processes for 15-year-old students, performed on a 2022 public dataset. Software code for the method is provided."}, "https://arxiv.org/abs/2410.07191": {"title": "Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving", "link": "https://arxiv.org/abs/2410.07191", "description": "arXiv:2410.07191v1 Announce Type: cross \nAbstract: Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\\textit{Causal tRajecTory predICtion}$ $\\textbf{(CRiTIC)}$, a novel model that utilizes a $\\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://critic-model.github.io/."}, "https://arxiv.org/abs/2410.07234": {"title": "A Dynamic Approach to Stock Price Prediction: Comparing RNN and Mixture of Experts Models Across Different Volatility Profiles", "link": "https://arxiv.org/abs/2410.07234", "description": "arXiv:2410.07234v1 Announce Type: cross \nAbstract: This study evaluates the effectiveness of a Mixture of Experts (MoE) model for stock price prediction by comparing it to a Recurrent Neural Network (RNN) and a linear regression model. The MoE framework combines an RNN for volatile stocks and a linear model for stable stocks, dynamically adjusting the weight of each model through a gating network. Results indicate that the MoE approach significantly improves predictive accuracy across different volatility profiles. The RNN effectively captures non-linear patterns for volatile companies but tends to overfit stable data, whereas the linear model performs well for predictable trends. The MoE model's adaptability allows it to outperform each individual model, reducing errors such as Mean Squared Error (MSE) and Mean Absolute Error (MAE). Future work should focus on enhancing the gating mechanism and validating the model with real-world datasets to optimize its practical applicability."}, "https://arxiv.org/abs/2410.07899": {"title": "A multivariate spatial regression model using signatures", "link": "https://arxiv.org/abs/2410.07899", "description": "arXiv:2410.07899v1 Announce Type: cross \nAbstract: We propose a spatial autoregressive model for a multivariate response variable and functional covariates. The approach is based on the notion of signature, which represents a function as an infinite series of its iterated integrals and presents the advantage of being applicable to a wide range of processes. We have provided theoretical guarantees for the choice of the signature truncation order, and we have shown in a simulation study that this approach outperforms existing approaches in the literature."}, "https://arxiv.org/abs/2308.12506": {"title": "General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays", "link": "https://arxiv.org/abs/2308.12506", "description": "arXiv:2308.12506v4 Announce Type: replace \nAbstract: We present a general central limit theorem with simple, easy-to-check covariance-based sufficient conditions for triangular arrays of random vectors when all variables could be interdependent. The result is constructed from Stein's method, but the conditions are distinct from related work. We show that these covariance conditions nest standard assumptions studied in the literature such as $M$-dependence, mixing random fields, non-mixing autoregressive processes, and dependency graphs, which themselves need not imply each other. This permits researchers to work with high-level but intuitive conditions based on overall correlation instead of more complicated and restrictive conditions such as strong mixing in random fields that may not have any obvious micro-foundation. As examples of the implications, we show how the theorem implies asymptotic normality in estimating: treatment effects with spillovers in more settings than previously admitted, covariance matrices, processes with global dependencies such as epidemic spread and information diffusion, and spatial process with Mat\\'{e}rn dependencies."}, "https://arxiv.org/abs/2309.07365": {"title": "Addressing selection bias in cluster randomized experiments via weighting", "link": "https://arxiv.org/abs/2309.07365", "description": "arXiv:2309.07365v2 Announce Type: replace \nAbstract: In cluster randomized experiments, individuals are often recruited after the cluster treatment assignment, and data are typically only available for the recruited sample. Post-randomization recruitment can lead to selection bias, inducing systematic differences between the overall and the recruited populations, and between the recruited intervention and control arms. In this setting, we define causal estimands for the overall and the recruited populations. We prove, under the assumption of ignorable recruitment, that the average treatment effect on the recruited population can be consistently estimated from the recruited sample using inverse probability weighting. Generally we cannot identify the average treatment effect on the overall population. Nonetheless, we show, via a principal stratification formulation, that one can use weighting of the recruited sample to identify treatment effects on two meaningful subpopulations of the overall population: individuals who would be recruited into the study regardless of the assignment, and individuals who would be recruited into the study under treatment but not under control. We develop an estimation strategy and a sensitivity analysis approach for checking the ignorable recruitment assumption. The proposed methods are applied to the ARTEMIS cluster randomized trial, where removing co-payment barriers increases the persistence of P2Y12 inhibitor among the always-recruited population."}, "https://arxiv.org/abs/2310.09428": {"title": "Sparse higher order partial least squares for simultaneous variable selection, dimension reduction, and tensor denoising", "link": "https://arxiv.org/abs/2310.09428", "description": "arXiv:2310.09428v2 Announce Type: replace \nAbstract: Partial Least Squares (PLS) regression emerged as an alternative to ordinary least squares for addressing multicollinearity in a wide range of scientific applications. As multidimensional tensor data is becoming more widespread, tensor adaptations of PLS have been developed. In this paper, we first establish the statistical behavior of Higher Order PLS (HOPLS) of Zhao et al. (2012), by showing that the consistency of the HOPLS estimator cannot be guaranteed as the tensor dimensions and the number of features increase faster than the sample size. To tackle this issue, we propose Sparse Higher Order Partial Least Squares (SHOPS) regression and an accompanying algorithm. SHOPS simultaneously accommodates variable selection, dimension reduction, and tensor response denoising. We further establish the asymptotic results of the SHOPS algorithm under a high-dimensional regime. The results also complete the unknown theoretic properties of SPLS algorithm (Chun and Kele\\c{s}, 2010). We verify these findings through comprehensive simulation experiments, and application to an emerging high-dimensional biological data analysis."}, "https://arxiv.org/abs/2302.00860": {"title": "Modeling Causal Mechanisms with Diffusion Models for Interventional and Counterfactual Queries", "link": "https://arxiv.org/abs/2302.00860", "description": "arXiv:2302.00860v3 Announce Type: replace-cross \nAbstract: We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings. These encodings enable us to directly sample under interventions and perform abduction for counterfactuals. Diffusion models are a natural fit here, since they can encode each node to a latent representation that acts as a proxy for exogenous noise. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Furthermore, we provide theoretical results that offer a methodology for analyzing counterfactual estimation in general encoder-decoder models, which could be useful in settings beyond our proposed approach."}, "https://arxiv.org/abs/2410.08283": {"title": "Adaptive sparsening and smoothing of the treatment model for longitudinal causal inference using outcome-adaptive LASSO and marginal fused LASSO", "link": "https://arxiv.org/abs/2410.08283", "description": "arXiv:2410.08283v1 Announce Type: new \nAbstract: Causal variable selection in time-varying treatment settings is challenging due to evolving confounding effects. Existing methods mainly focus on time-fixed exposures and are not directly applicable to time-varying scenarios. We propose a novel two-step procedure for variable selection when modeling the treatment probability at each time point. We first introduce a novel approach to longitudinal confounder selection using a Longitudinal Outcome Adaptive LASSO (LOAL) that will data-adaptively select covariates with theoretical justification of variance reduction of the estimator of the causal effect. We then propose an Adaptive Fused LASSO that can collapse treatment model parameters over time points with the goal of simplifying the models in order to improve the efficiency of the estimator while minimizing model misspecification bias compared with naive pooled logistic regression models. Our simulation studies highlight the need for and usefulness of the proposed approach in practice. We implemented our method on data from the Nicotine Dependence in Teens study to estimate the effect of the timing of alcohol initiation during adolescence on depressive symptoms in early adulthood."}, "https://arxiv.org/abs/2410.08382": {"title": "Bivariate Variable Ranking for censored time-to-event data via Copula Link Based Additive models", "link": "https://arxiv.org/abs/2410.08382", "description": "arXiv:2410.08382v1 Announce Type: new \nAbstract: In this paper, we present a variable ranking approach established on a novel measure to select important variables in bivariate Copula Link-Based Additive Models (Marra & Radice, 2020). The proposal allows for identifying two sets of relevant covariates for the two time-to-events without neglecting the dependency structure that may exist between the two survivals. The procedure suggested is evaluated via a simulation study and then is applied for analyzing the Age-Related Eye Disease Study dataset. The algorithm is implemented in a new R package, called BRBVS.."}, "https://arxiv.org/abs/2410.08422": {"title": "Principal Component Analysis in the Graph Frequency Domain", "link": "https://arxiv.org/abs/2410.08422", "description": "arXiv:2410.08422v1 Announce Type: new \nAbstract: We propose a novel principal component analysis in the graph frequency domain for dimension reduction of multivariate data residing on graphs. The proposed method not only effectively reduces the dimensionality of multivariate graph signals, but also provides a closed-form reconstruction of the original data. In addition, we investigate several propositions related to principal components and the reconstruction errors, and introduce a graph spectral envelope that aids in identifying common graph frequencies in multivariate graph signals. We demonstrate the validity of the proposed method through a simulation study and further analyze the boarding and alighting patterns of Seoul Metropolitan Subway passengers using the proposed method."}, "https://arxiv.org/abs/2410.08441": {"title": "A scientific review on advances in statistical methods for crossover design", "link": "https://arxiv.org/abs/2410.08441", "description": "arXiv:2410.08441v1 Announce Type: new \nAbstract: A comprehensive review of the literature on crossover design is needed to highlight its evolution, applications, and methodological advancements across various fields. Given its widespread use in clinical trials and other research domains, understanding this design's challenges, assumptions, and innovations is essential for optimizing its implementation and ensuring accurate, unbiased results. This article extensively reviews the history and statistical inference methods for crossover designs. A primary focus is given to the AB-BA design as it is the most widely used design in literature. Extension from two periods to higher-order designs is discussed, and a general inference procedure for continuous response is studied. Analysis of multivariate and categorical responses is also reviewed in this context. A bunch of open problems in this area are shortlisted."}, "https://arxiv.org/abs/2410.08488": {"title": "Fractional binomial regression model for count data with excess zeros", "link": "https://arxiv.org/abs/2410.08488", "description": "arXiv:2410.08488v1 Announce Type: new \nAbstract: This paper proposes a new generalized linear model with fractional binomial distribution.\n Zero-inflated Poisson/negative binomial distributions are used for count data that has many zeros. To analyze the association of such a count variable with covariates, zero-inflated Poisson/negative binomial regression models are widely used. In this work, we develop a regression model with the fractional binomial distribution that can serve as an additional tool for modeling the count response variable with covariates. Data analysis results show that on some occasions, our model outperforms the existing zero-inflated regression models."}, "https://arxiv.org/abs/2410.08492": {"title": "Exact MLE for Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2410.08492", "description": "arXiv:2410.08492v1 Announce Type: new \nAbstract: Exact MLE for generalized linear mixed models (GLMMs) is a long-standing problem unsolved until today. The proposed research solves the problem. In this problem, the main difficulty is caused by intractable integrals in the likelihood function when the response does not follow normal and the prior distribution for the random effects is specified by normal. Previous methods use Laplace approximations or Monte Carol simulations to compute the MLE approximately. These methods cannot provide the exact MLEs of the parameters and the hyperparameters. The exact MLE problem remains unsolved until the proposed work. The idea is to construct a sequence of mathematical functions in the optimization procedure. Optimization of the mathematical functions can be numerically computed. The result can lead to the exact MLEs of the parameters and hyperparameters. Because computing the likelihood is unnecessary, the proposed method avoids the main difficulty caused by the intractable integrals in the likelihood function."}, "https://arxiv.org/abs/2410.08523": {"title": "Parametric multi-fidelity Monte Carlo estimation with applications to extremes", "link": "https://arxiv.org/abs/2410.08523", "description": "arXiv:2410.08523v1 Announce Type: new \nAbstract: In a multi-fidelity setting, data are available from two sources, high- and low-fidelity. Low-fidelity data has larger size and can be leveraged to make more efficient inference about quantities of interest, e.g. the mean, for high-fidelity variables. In this work, such multi-fidelity setting is studied when the goal is to fit more efficiently a parametric model to high-fidelity data. Three multi-fidelity parameter estimation methods are considered, joint maximum likelihood, (multi-fidelity) moment estimation and (multi-fidelity) marginal maximum likelihood, and are illustrated on several parametric models, with the focus on parametric families used in extreme value analysis. An application is also provided concerning quantification of occurrences of extreme ship motions generated by two computer codes of varying fidelity."}, "https://arxiv.org/abs/2410.08733": {"title": "An alignment-agnostic methodology for the analysis of designed separations data", "link": "https://arxiv.org/abs/2410.08733", "description": "arXiv:2410.08733v1 Announce Type: new \nAbstract: Chemical separations data are typically analysed in the time domain using methods that integrate the discrete elution bands. Integrating the same chemical components across several samples must account for retention time drift over the course of an entire experiment as the physical characteristics of the separation are altered through several cycles of use. Failure to consistently integrate the components within a matrix of $M \\times N$ samples and variables create artifacts that have a profound effect on the analysis and interpretation of the data. This work presents an alternative where the raw separations data are analysed in the frequency domain to account for the offset of the chromatographic peaks as a matrix of complex Fourier coefficients. We present a generalization of the permutation testing, and visualization steps in ANOVA-Simultaneous Component Analysis (ASCA) to handle complex matrices, and use this method to analyze a synthetic dataset with known significant factors and compare the interpretation of a real dataset via its peak table and frequency domain representations."}, "https://arxiv.org/abs/2410.08782": {"title": "Half-KFN: An Enhanced Detection Method for Subtle Covariate Drift", "link": "https://arxiv.org/abs/2410.08782", "description": "arXiv:2410.08782v1 Announce Type: new \nAbstract: Detecting covariate drift is a common task of significant practical value in supervised learning. Once covariate drift occurs, the models may no longer be applicable, hence numerous studies have been devoted to the advancement of detection methods. However, current research methods are not particularly effective in handling subtle covariate drift when dealing with small proportions of drift samples. In this paper, inspired by the $k$-nearest neighbor (KNN) approach, a novel method called Half $k$-farthest neighbor (Half-KFN) is proposed in response to specific scenarios. Compared to traditional ones, Half-KFN exhibits higher power due to the inherent capability of the farthest neighbors which could better characterize the nature of drift. Furthermore, with larger sample sizes, the employment of the bootstrap for hypothesis testing is recommended. It is leveraged to calculate $p$-values dramatically faster than permutation tests, with speed undergoing an exponential growth as sample size increases. Numerical experiments on simulated and real data are conducted to evaluate our proposed method, and the results demonstrate that it consistently displays superior sensitivity and rapidity in covariate drift detection across various cases."}, "https://arxiv.org/abs/2410.08803": {"title": "Generalised logistic regression with vine copulas", "link": "https://arxiv.org/abs/2410.08803", "description": "arXiv:2410.08803v1 Announce Type: new \nAbstract: We propose a generalisation of the logistic regression model, that aims to account for non-linear main effects and complex interactions, while keeping the model inherently explainable. This is obtained by starting with log-odds that are linear in the covariates, and adding non-linear terms that depend on at least two covariates. More specifically, we use a generative specification of the model, consisting of a combination of certain margins on natural exponential form, combined with vine copulas. The estimation of the model is however based on the discriminative likelihood, and dependencies between covariates are included in the model, only if they contribute significantly to the distinction between the two classes. Further, a scheme for model selection and estimation is presented. The methods described in this paper are implemented in the R package LogisticCopula. In order to assess the performance of our model, we ran an extensive simulation study. The results from the study, as well as from a couple of examples on real data, showed that our model performs at least as well as natural competitors, especially in the presence of non-linearities and complex interactions, even when $n$ is not large compared to $p$."}, "https://arxiv.org/abs/2410.08849": {"title": "Causal inference targeting a concentration index for studies of health inequalities", "link": "https://arxiv.org/abs/2410.08849", "description": "arXiv:2410.08849v1 Announce Type: new \nAbstract: A concentration index, a standardized covariance between a health variable and relative income ranks, is often used to quantify income-related health inequalities. There is a lack of formal approach to study the effect of an exposure, e.g., education, on such measures of inequality. In this paper we contribute by filling this gap and developing the necessary theory and method. Thus, we define a counterfactual concentration index for different levels of an exposure. We give conditions for their identification, and then deduce their efficient influence function. This allows us to propose estimators, which are regular asymptotic linear under certain conditions. In particular, these estimators are $\\sqrt n$-consistent and asymptotically normal, as well as locally efficient. The implementation of the estimators is based on the fit of several nuisance functions. The estimators proposed have rate robustness properties allowing for convergence rates slower than $\\sqrt{n}$-rate for some of the nuisance function fits. The relevance of the asymptotic results for finite samples is studied with simulation experiments. We also present a case study of the effect of education on income-related health inequalities for a Swedish cohort."}, "https://arxiv.org/abs/2410.09027": {"title": "Variance reduction combining pre-experiment and in-experiment data", "link": "https://arxiv.org/abs/2410.09027", "description": "arXiv:2410.09027v1 Announce Type: new \nAbstract: Online controlled experiments (A/B testing) are essential in data-driven decision-making for many companies. Increasing the sensitivity of these experiments, particularly with a fixed sample size, relies on reducing the variance of the estimator for the average treatment effect (ATE). Existing methods like CUPED and CUPAC use pre-experiment data to reduce variance, but their effectiveness depends on the correlation between the pre-experiment data and the outcome. In contrast, in-experiment data is often more strongly correlated with the outcome and thus more informative. In this paper, we introduce a novel method that combines both pre-experiment and in-experiment data to achieve greater variance reduction than CUPED and CUPAC, without introducing bias or additional computation complexity. We also establish asymptotic theory and provide consistent variance estimators for our method. Applying this method to multiple online experiments at Etsy, we reach substantial variance reduction over CUPAC with the inclusion of only a few in-experiment covariates. These results highlight the potential of our approach to significantly improve experiment sensitivity and accelerate decision-making."}, "https://arxiv.org/abs/2410.09039": {"title": "Semi-Supervised Learning of Noisy Mixture of Experts Models", "link": "https://arxiv.org/abs/2410.09039", "description": "arXiv:2410.09039v1 Announce Type: new \nAbstract: The mixture of experts (MoE) model is a versatile framework for predictive modeling that has gained renewed interest in the age of large language models. A collection of predictive ``experts'' is learned along with a ``gating function'' that controls how much influence each expert is given when a prediction is made. This structure allows relatively simple models to excel in complex, heterogeneous data settings. In many contemporary settings, unlabeled data are widely available while labeled data are difficult to obtain. Semi-supervised learning methods seek to leverage the unlabeled data. We propose a novel method for semi-supervised learning of MoE models. We start from a semi-supervised MoE model that was developed by oceanographers that makes the strong assumption that the latent clustering structure in unlabeled data maps directly to the influence that the gating function should give each expert in the supervised task. We relax this assumption, imagining a noisy connection between the two, and propose an algorithm based on least trimmed squares, which succeeds even in the presence of misaligned data. Our theoretical analysis characterizes the conditions under which our approach yields estimators with a near-parametric rate of convergence. Simulated and real data examples demonstrate the method's efficacy."}, "https://arxiv.org/abs/2410.08362": {"title": "Towards Optimal Environmental Policies: Policy Learning under Arbitrary Bipartite Network Interference", "link": "https://arxiv.org/abs/2410.08362", "description": "arXiv:2410.08362v1 Announce Type: cross \nAbstract: The substantial effect of air pollution on cardiovascular disease and mortality burdens is well-established. Emissions-reducing interventions on coal-fired power plants -- a major source of hazardous air pollution -- have proven to be an effective, but costly, strategy for reducing pollution-related health burdens. Targeting the power plants that achieve maximum health benefits while satisfying realistic cost constraints is challenging. The primary difficulty lies in quantifying the health benefits of intervening at particular plants. This is further complicated because interventions are applied on power plants, while health impacts occur in potentially distant communities, a setting known as bipartite network interference (BNI). In this paper, we introduce novel policy learning methods based on Q- and A-Learning to determine the optimal policy under arbitrary BNI. We derive asymptotic properties and demonstrate finite sample efficacy in simulations. We apply our novel methods to a comprehensive dataset of Medicare claims, power plant data, and pollution transport networks. Our goal is to determine the optimal strategy for installing power plant scrubbers to minimize ischemic heart disease (IHD) hospitalizations under various cost constraints. We find that annual IHD hospitalization rates could be reduced in a range from 20.66-44.51 per 10,000 person-years through optimal policies under different cost constraints."}, "https://arxiv.org/abs/2410.08378": {"title": "Deep Generative Quantile Bayes", "link": "https://arxiv.org/abs/2410.08378", "description": "arXiv:2410.08378v1 Announce Type: cross \nAbstract: We develop a multivariate posterior sampling procedure through deep generative quantile learning. Simulation proceeds implicitly through a push-forward mapping that can transform i.i.d. random vector samples from the posterior. We utilize Monge-Kantorovich depth in multivariate quantiles to directly sample from Bayesian credible sets, a unique feature not offered by typical posterior sampling methods. To enhance the training of the quantile mapping, we design a neural network that automatically performs summary statistic extraction. This additional neural network structure has performance benefits, including support shrinkage (i.e., contraction of our posterior approximation) as the observation sample size increases. We demonstrate the usefulness of our approach on several examples where the absence of likelihood renders classical MCMC infeasible. Finally, we provide the following frequentist theoretical justifications for our quantile learning framework: {consistency of the estimated vector quantile, of the recovered posterior distribution, and of the corresponding Bayesian credible sets."}, "https://arxiv.org/abs/2410.08574": {"title": "Change-point detection in regression models for ordered data via the max-EM algorithm", "link": "https://arxiv.org/abs/2410.08574", "description": "arXiv:2410.08574v1 Announce Type: cross \nAbstract: We consider the problem of breakpoint detection in a regression modeling framework. To that end, we introduce a novel method, the max-EM algorithm which combines a constrained Hidden Markov Model with the Classification-EM (CEM) algorithm. This algorithm has linear complexity and provides accurate breakpoints detection and parameter estimations. We derive a theoretical result that shows that the likelihood of the data as a function of the regression parameters and the breakpoints location is increased at each step of the algorithm. We also present two initialization methods for the location of the breakpoints in order to deal with local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated."}, "https://arxiv.org/abs/2410.08831": {"title": "Distribution-free uncertainty quantification for inverse problems: application to weak lensing mass mapping", "link": "https://arxiv.org/abs/2410.08831", "description": "arXiv:2410.08831v1 Announce Type: cross \nAbstract: In inverse problems, distribution-free uncertainty quantification (UQ) aims to obtain error bars with coverage guarantees that are independent of any prior assumptions about the data distribution. In the context of mass mapping, uncertainties could lead to errors that affects our understanding of the underlying mass distribution, or could propagate to cosmological parameter estimation, thereby impacting the precision and reliability of cosmological models. Current surveys, such as Euclid or Rubin, will provide new weak lensing datasets of very high quality. Accurately quantifying uncertainties in mass maps is therefore critical to perform reliable cosmological parameter inference. In this paper, we extend the conformalized quantile regression (CQR) algorithm, initially proposed for scalar regression, to inverse problems. We compare our approach with another distribution-free approach based on risk-controlling prediction sets (RCPS). Both methods are based on a calibration dataset, and offer finite-sample coverage guarantees that are independent of the data distribution. Furthermore, they are applicable to any mass mapping method, including blackbox predictors. In our experiments, we apply UQ on three mass-mapping method: the Kaiser-Squires inversion, iterative Wiener filtering, and the MCALens algorithm. Our experiments reveal that RCPS tends to produce overconservative confidence bounds with small calibration sets, whereas CQR is designed to avoid this issue. Although the expected miscoverage rate is guaranteed to stay below a user-prescribed threshold regardless of the mass mapping method, selecting an appropriate reconstruction algorithm remains crucial for obtaining accurate estimates, especially around peak-like structures, which are particularly important for inferring cosmological parameters. Additionally, the choice of mass mapping method influences the size of the error bars."}, "https://arxiv.org/abs/2410.08939": {"title": "Linear-cost unbiased posterior estimates for crossed effects and matrix factorization models via couplings", "link": "https://arxiv.org/abs/2410.08939", "description": "arXiv:2410.08939v1 Announce Type: cross \nAbstract: We design and analyze unbiased Markov chain Monte Carlo (MCMC) schemes based on couplings of blocked Gibbs samplers (BGSs), whose total computational costs scale linearly with the number of parameters and data points. Our methodology is designed for and applicable to high-dimensional BGS with conditionally independent blocks, which are often encountered in Bayesian modeling. We provide bounds on the expected number of iterations needed for coalescence for Gaussian targets, which imply that practical two-step coupling strategies achieve coalescence times that match the relaxation times of the original BGS scheme up to a logarithmic factor. To illustrate the practical relevance of our methodology, we apply it to high-dimensional crossed random effect and probabilistic matrix factorization models, for which we develop a novel BGS scheme with improved convergence speed. Our methodology provides unbiased posterior estimates at linear cost (usually requiring only a few BGS iterations for problems with thousands of parameters), matching state-of-the-art procedures for both frequentist and Bayesian estimation of those models."}, "https://arxiv.org/abs/2205.03505": {"title": "A Flexible Quasi-Copula Distribution for Statistical Modeling", "link": "https://arxiv.org/abs/2205.03505", "description": "arXiv:2205.03505v2 Announce Type: replace \nAbstract: Copulas, generalized estimating equations, and generalized linear mixed models promote the analysis of grouped data where non-normal responses are correlated. Unfortunately, parameter estimation remains challenging in these three frameworks. Based on prior work of Tonda, we derive a new class of probability density functions that allow explicit calculation of moments, marginal and conditional distributions, and the score and observed information needed in maximum likelihood estimation. We also illustrate how the new distribution flexibly models longitudinal data following a non-Gaussian distribution. Finally, we conduct a tri-variate genome-wide association analysis on dichotomized systolic and diastolic blood pressure and body mass index data from the UK-Biobank, showcasing the modeling prowess and computational scalability of the new distribution."}, "https://arxiv.org/abs/2307.14160": {"title": "On the application of Gaussian graphical models to paired data problems", "link": "https://arxiv.org/abs/2307.14160", "description": "arXiv:2307.14160v2 Announce Type: replace \nAbstract: Gaussian graphical models are nowadays commonly applied to the comparison of groups sharing the same variables, by jointy learning their independence structures. We consider the case where there are exactly two dependent groups and the association structure is represented by a family of coloured Gaussian graphical models suited to deal with paired data problems. To learn the two dependent graphs, together with their across-graph association structure, we implement a fused graphical lasso penalty. We carry out a comprehensive analysis of this approach, with special attention to the role played by some relevant submodel classes. In this way, we provide a broad set of tools for the application of Gaussian graphical models to paired data problems. These include results useful for the specification of penalty values in order to obtain a path of lasso solutions and an ADMM algorithm that solves the fused graphical lasso optimization problem. Finally, we present an application of our method to cancer genomics where it is of interest to compare cancer cells with a control sample from histologically normal tissues adjacent to the tumor. All the methods described in this article are implemented in the $\\texttt{R}$ package $\\texttt{pdglasso}$ availabe at: https://github.com/savranciati/pdglasso."}, "https://arxiv.org/abs/2401.10824": {"title": "A versatile trivariate wrapped Cauchy copula with applications to toroidal and cylindrical data", "link": "https://arxiv.org/abs/2401.10824", "description": "arXiv:2401.10824v2 Announce Type: replace \nAbstract: In this paper, we propose a new flexible distribution for data on the three-dimensional torus which we call a trivariate wrapped Cauchy copula. Our trivariate copula has several attractive properties. It has a simple form of density and is unimodal. its parameters are interpretable and allow adjustable degree of dependence between every pair of variables and these can be easily estimated. The conditional distributions of the model are well studied bivariate wrapped Cauchy distributions. Furthermore, the distribution can be easily simulated. Parameter estimation via maximum likelihood for the distribution is given and we highlight the simple implementation procedure to obtain these estimates. We compare our model to its competitors for analysing trivariate data and provide some evidence of its advantages. Another interesting feature of this model is that it can be extended to cylindrical copula as we describe this new cylindrical copula and then gives its properties. We illustrate our trivariate wrapped Cauchy copula on data from protein bioinformatics of conformational angles, and our cylindrical copula using climate data related to buoy in the Adriatic Sea. The paper is motivated by these real trivariate datasets, but we indicate how the model can be extended to multivariate copulas."}, "https://arxiv.org/abs/1609.07630": {"title": "Low-complexity Image and Video Coding Based on an Approximate Discrete Tchebichef Transform", "link": "https://arxiv.org/abs/1609.07630", "description": "arXiv:1609.07630v4 Announce Type: replace-cross \nAbstract: The usage of linear transformations has great relevance for data decorrelation applications, like image and video compression. In that sense, the discrete Tchebichef transform (DTT) possesses useful coding and decorrelation properties. The DTT transform kernel does not depend on the input data and fast algorithms can be developed to real time applications. However, the DTT fast algorithm presented in literature possess high computational complexity. In this work, we introduce a new low-complexity approximation for the DTT. The fast algorithm of the proposed transform is multiplication-free and requires a reduced number of additions and bit-shifting operations. Image and video compression simulations in popular standards shows good performance of the proposed transform. Regarding hardware resource consumption for FPGA shows 43.1% reduction of configurable logic blocks and ASIC place and route realization shows 57.7% reduction in the area-time figure when compared with the 2-D version of the exact DTT."}, "https://arxiv.org/abs/2305.12407": {"title": "Federated Offline Policy Learning", "link": "https://arxiv.org/abs/2305.12407", "description": "arXiv:2305.12407v2 Announce Type: replace-cross \nAbstract: We consider the problem of learning personalized decision policies from observational bandit feedback data across multiple heterogeneous data sources. In our approach, we introduce a novel regret analysis that establishes finite-sample upper bounds on distinguishing notions of global regret for all data sources on aggregate and of local regret for any given data source. We characterize these regret bounds by expressions of source heterogeneity and distribution shift. Moreover, we examine the practical considerations of this problem in the federated setting where a central server aims to train a policy on data distributed across the heterogeneous sources without collecting any of their raw data. We present a policy learning algorithm amenable to federation based on the aggregation of local policies trained with doubly robust offline policy evaluation strategies. Our analysis and supporting experimental results provide insights into tradeoffs in the participation of heterogeneous data sources in offline policy learning."}, "https://arxiv.org/abs/2306.01890": {"title": "Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning", "link": "https://arxiv.org/abs/2306.01890", "description": "arXiv:2306.01890v3 Announce Type: replace-cross \nAbstract: Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data."}, "https://arxiv.org/abs/2312.15595": {"title": "Zero-Inflated Bandits", "link": "https://arxiv.org/abs/2312.15595", "description": "arXiv:2312.15595v2 Announce Type: replace-cross \nAbstract: Many real applications of bandits have sparse non-zero rewards, leading to slow learning speed. Using problem-specific structures for careful distribution modeling is known as critical to estimation efficiency in statistics, yet is under-explored in bandits. We initiate the study of zero-inflated bandits, where the reward is modeled as a classic semi-parametric distribution called zero-inflated distribution. We design Upper Confidence Bound- and Thompson Sampling-type algorithms for this specific structure. We derive the regret bounds under both multi-armed bandits with general reward assumptions and contextual generalized linear bandit with sub-Gaussian rewards. In many settings, the regret rates of our algorithms are either minimax optimal or state-of-the-art. The superior empirical performance of our methods is demonstrated via numerical studies."}, "https://arxiv.org/abs/2410.09217": {"title": "Flexibly Modeling Shocks to Demographic and Health Indicators with Bayesian Shrinkage Priors", "link": "https://arxiv.org/abs/2410.09217", "description": "arXiv:2410.09217v1 Announce Type: new \nAbstract: Demographic and health indicators may exhibit short or large short-term shocks; for example, armed conflicts, epidemics, or famines may cause shocks in period measures of life expectancy. Statistical models for estimating historical trends and generating future projections of these indicators for a large number of populations may be biased or not well probabilistically calibrated if they do not account for the presence of shocks. We propose a flexible method for modeling shocks when producing estimates and projections for multiple populations. The proposed approach makes no assumptions about the shape or duration of a shock, and requires no prior knowledge of when shocks may have occurred. Our approach is based on the modeling of shocks in level of the indicator of interest. We use Bayesian shrinkage priors such that shock terms are shrunk to zero unless the data suggest otherwise. The method is demonstrated in a model for male period life expectancy at birth. We use as a starting point an existing projection model and expand it by including the shock terms, modeled by the Bayesian shrinkage priors. Out-of-sample validation exercises find that including shocks in the model results in sharper uncertainty intervals without sacrificing empirical coverage or prediction error."}, "https://arxiv.org/abs/2410.09267": {"title": "Experimentation on Endogenous Graphs", "link": "https://arxiv.org/abs/2410.09267", "description": "arXiv:2410.09267v1 Announce Type: new \nAbstract: We study experimentation under endogenous network interference. Interference patterns are mediated by an endogenous graph, where edges can be formed or eliminated as a result of treatment. We show that conventional estimators are biased in these circumstances, and present a class of unbiased, consistent and asymptotically normal estimators of total treatment effects in the presence of such interference. Our results apply both to bipartite experimentation, in which the units of analysis and measurement differ, and the standard network experimentation case, in which they are the same."}, "https://arxiv.org/abs/2410.09274": {"title": "Hierarchical Latent Class Models for Mortality Surveillance Using Partially Verified Verbal Autopsies", "link": "https://arxiv.org/abs/2410.09274", "description": "arXiv:2410.09274v1 Announce Type: new \nAbstract: Monitoring data on causes of death is an important part of understanding the burden of diseases and effects of public health interventions. Verbal autopsy (VA) is a well-established method for gathering information about deaths outside of hospitals by conducting an interview to family members or caregivers of a deceased person. Existing cause-of-death assignment algorithms using VA data require either domain knowledge about the symptom-cause relationship, or large training datasets. When a new disease emerges, however, only limited information on symptom-cause relationship exists and training data are usually lacking, making it challenging to evaluate the impact of the disease. In this paper, we propose a novel Bayesian framework to estimate the fraction of deaths due to an emerging disease using VAs collected with partially verified cause of death. We use a latent class model to capture the distribution of symptoms and their dependence in a parsimonious way. We discuss potential sources of bias that may occur due to the cause-of-death verification process and adapt our framework to account for the verification mechanism. We also develop structured priors to improve prevalence estimation for sub-populations. We demonstrate the performance of our model using a mortality surveillance dataset that includes suspected COVID-19 related deaths in Brazil in 2021."}, "https://arxiv.org/abs/2410.09278": {"title": "Measurement Error Correction for Spatially Defined Environmental Exposures in Survival Analysis", "link": "https://arxiv.org/abs/2410.09278", "description": "arXiv:2410.09278v1 Announce Type: new \nAbstract: Environmental exposures are often defined using buffer zones around geocoded home addresses, but these static boundaries can miss dynamic daily activity patterns, leading to biased results. This paper presents a novel measurement error correction method for spatially defined environmental exposures within a survival analysis framework using the Cox proportional hazards model. The method corrects high-dimensional surrogate exposures from geocoded residential data at multiple buffer radii by applying principal component analysis for dimension reduction and leveraging external GPS-tracked validation datasets containing true exposure measurements. It also derives the asymptotic properties and variances of the proposed estimators. Extensive simulations are conducted to evaluate the performance of the proposed estimators, demonstrating its ability to improve accuracy in estimated exposure effects. An illustrative application assesses the impact of greenness exposure on depression incidence in the Nurses' Health Study (NHS). The results demonstrate that correcting for measurement error significantly enhances the accuracy of exposure estimates. This method offers a critical advancement for accurately assessing the health impacts of environmental exposures, outperforming traditional static buffer approaches."}, "https://arxiv.org/abs/2410.09282": {"title": "Anytime-Valid Continuous-Time Confidence Processes for Inhomogeneous Poisson Processes", "link": "https://arxiv.org/abs/2410.09282", "description": "arXiv:2410.09282v1 Announce Type: new \nAbstract: Motivated by monitoring the arrival of incoming adverse events such as customer support calls or crash reports from users exposed to an experimental product change, we consider sequential hypothesis testing of continuous-time inhomogeneous Poisson point processes. Specifically, we provide an interval-valued confidence process $C^\\alpha(t)$ over continuous time $t$ for the cumulative arrival rate $\\Lambda(t) = \\int_0^t \\lambda(s) \\mathrm{d}s$ with a continuous-time anytime-valid coverage guarantee $\\mathbb{P}[\\Lambda(t) \\in C^\\alpha(t) \\, \\forall t >0] \\geq 1-\\alpha$. We extend our results to compare two independent arrival processes by constructing multivariate confidence processes and a closed-form $e$-process for testing the equality of rates with a time-uniform Type-I error guarantee at a nominal $\\alpha$. We characterize the asymptotic growth rate of the proposed $e$-process under the alternative and show that it has power 1 when the average rates of the two Poisson process differ in the limit. We also observe a complementary relationship between our multivariate confidence process and the universal inference $e$-process for testing composite null hypotheses."}, "https://arxiv.org/abs/2410.09504": {"title": "Bayesian Transfer Learning for Artificially Intelligent Geospatial Systems: A Predictive Stacking Approach", "link": "https://arxiv.org/abs/2410.09504", "description": "arXiv:2410.09504v1 Announce Type: new \nAbstract: Building artificially intelligent geospatial systems require rapid delivery of spatial data analysis at massive scales with minimal human intervention. Depending upon their intended use, data analysis may also entail model assessment and uncertainty quantification. This article devises transfer learning frameworks for deployment in artificially intelligent systems, where a massive data set is split into smaller data sets that stream into the analytical framework to propagate learning and assimilate inference for the entire data set. Specifically, we introduce Bayesian predictive stacking for multivariate spatial data and demonstrate its effectiveness in rapidly analyzing massive data sets. Furthermore, we make inference feasible in a reasonable amount of time, and without excessively demanding hardware settings. We illustrate the effectiveness of this approach in extensive simulation experiments and subsequently analyze massive data sets in climate science on sea surface temperatures and on vegetation index."}, "https://arxiv.org/abs/2410.09506": {"title": "Distribution-Aware Mean Estimation under User-level Local Differential Privacy", "link": "https://arxiv.org/abs/2410.09506", "description": "arXiv:2410.09506v1 Announce Type: new \nAbstract: We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples. Previous work assume that the number of data samples is the same across users. In contrast, we consider a more general and realistic scenario where each user $u \\in [n]$ owns $m_u$ data samples drawn from some generative distribution $\\mu$; $m_u$ being unknown to the statistician but drawn from a known distribution $M$ over $\\mathbb{N}^\\star$. Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $\\mu$ for the task of mean estimation. We then derive a lower bound. The two bounds are asymptotically matching up to logarithmic factors and reduce to known bounds when $m_u = m$ for any user $u$."}, "https://arxiv.org/abs/2410.09552": {"title": "Model-based clustering of time-dependent observations with common structural changes", "link": "https://arxiv.org/abs/2410.09552", "description": "arXiv:2410.09552v1 Announce Type: new \nAbstract: We propose a novel model-based clustering approach for samples of time series. We assume as a unique commonality that two observations belong to the same group if structural changes in their behaviours happen at the same time. We resort to a latent representation of structural changes in each time series based on random orders to induce ties among different observations. Such an approach results in a general modeling strategy and can be combined with many time-dependent models known in the literature. Our studies have been motivated by an epidemiological problem, where we want to provide clusters of different countries of the European Union, where two countries belong to the same cluster if the spreading processes of the COVID-19 virus had structural changes at the same time."}, "https://arxiv.org/abs/2410.09665": {"title": "ipd: An R Package for Conducting Inference on Predicted Data", "link": "https://arxiv.org/abs/2410.09665", "description": "arXiv:2410.09665v1 Announce Type: new \nAbstract: Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage. Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage `vignette' are available at github.com/ipd-tools/ipd. Contact: jtleek@fredhutch.org and tylermc@uw.edu"}, "https://arxiv.org/abs/2410.09712": {"title": "Random effects model-based sufficient dimension reduction for independent clustered data", "link": "https://arxiv.org/abs/2410.09712", "description": "arXiv:2410.09712v1 Announce Type: new \nAbstract: Sufficient dimension reduction (SDR) is a popular class of regression methods which aim to find a small number of linear combinations of covariates that capture all the information of the responses i.e., a central subspace. The majority of current methods for SDR focus on the setting of independent observations, while the few techniques that have been developed for clustered data assume the linear transformation is identical across clusters. In this article, we introduce random effects SDR, where cluster-specific random effect central subspaces are assumed to follow a distribution on the Grassmann manifold, and the random effects distribution is characterized by a covariance matrix that captures the heterogeneity between clusters in the SDR process itself. We incorporate random effect SDR within a model-based inverse regression framework. Specifically, we propose a random effects principal fitted components model, where a two-stage algorithm is used to estimate the overall fixed effect central subspace, and predict the cluster-specific random effect central subspaces. We demonstrate the consistency of the proposed estimators, while simulation studies demonstrate the superior performance of the proposed approach compared to global and cluster-specific SDR approaches. We also present extensions of the above model to handle mixed predictors, demonstrating how random effects SDR can be achieved in the case of mixed continuous and binary covariates. Applying the proposed methods to study the longitudinal association between the life expectancy of women and socioeconomic variables across 117 countries, we find log income per capita, infant mortality, and income inequality are the main drivers of a two-dimensional fixed effect central subspace, although there is considerable heterogeneity in how the country-specific central subspaces are driven by the predictors."}, "https://arxiv.org/abs/2410.09808": {"title": "Optimal item calibration in the context of the Swedish Scholastic Aptitude Test", "link": "https://arxiv.org/abs/2410.09808", "description": "arXiv:2410.09808v1 Announce Type: new \nAbstract: Large scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate calibration items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of calibration items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method."}, "https://arxiv.org/abs/2410.09810": {"title": "Doubly unfolded adjacency spectral embedding of dynamic multiplex graphs", "link": "https://arxiv.org/abs/2410.09810", "description": "arXiv:2410.09810v1 Announce Type: new \nAbstract: Many real-world networks evolve dynamically over time and present different types of connections between nodes, often called layers. In this work, we propose a latent position model for these objects, called the dynamic multiplex random dot product graph (DMPRDPG), which uses an inner product between layer-specific and time-specific latent representations of the nodes to obtain edge probabilities. We further introduce a computationally efficient spectral embedding method for estimation of DMPRDPG parameters, called doubly unfolded adjacency spectral embedding (DUASE). The DUASE estimates are proved to be consistent and asymptotically normally distributed, demonstrating the optimality properties of the proposed estimator. A key strength of our method is the encoding of time-specific node representations and layer-specific effects in separate latent spaces, which allows the model to capture complex behaviours while maintaining relatively low dimensionality. The embedding method we propose can also be efficiently used for subsequent inference tasks. In particular, we highlight the use of the ISOMAP algorithm in conjunction with DUASE as a way to efficiently capture trends and global changepoints within a network, and the use of DUASE for graph clustering. Applications on real-world networks describing geopolitical interactions between countries and financial news reporting demonstrate practical uses of our method."}, "https://arxiv.org/abs/2410.09825": {"title": "Nickell Meets Stambaugh: A Tale of Two Biases in Panel Predictive Regressions", "link": "https://arxiv.org/abs/2410.09825", "description": "arXiv:2410.09825v1 Announce Type: new \nAbstract: In panel predictive regressions with persistent covariates, coexistence of the Nickell bias and the Stambaugh bias imposes challenges for hypothesis testing. This paper introduces a new estimator, the IVX-X-Jackknife (IVXJ), which effectively removes this composite bias and reinstates standard inferential procedures. The IVXJ estimator is inspired by the IVX technique in time series. In panel data where the cross section is of the same order as the time dimension, the bias of the baseline panel IVX estimator can be corrected via an analytical formula by leveraging an innovative X-Jackknife scheme that divides the time dimension into the odd and even indices. IVXJ is the first procedure that achieves unified inference across a wide range of modes of persistence in panel predictive regressions, whereas such unified inference is unattainable for the popular within-group estimator. Extended to accommodate long-horizon predictions with multiple regressions, IVXJ is used to examine the impact of debt levels on financial crises by panel local projection. Our empirics provide comparable results across different categories of debt."}, "https://arxiv.org/abs/2410.09884": {"title": "Detecting Structural Shifts and Estimating Change-Points in Interval-Based Time Series", "link": "https://arxiv.org/abs/2410.09884", "description": "arXiv:2410.09884v1 Announce Type: new \nAbstract: This paper addresses the open problem of conducting change-point analysis for interval-valued time series data using the maximum likelihood estimation (MLE) framework. Motivated by financial time series, we analyze data that includes daily opening (O), up (U), low (L), and closing (C) values, rather than just a closing value as traditionally used. To tackle this, we propose a fundamental model based on stochastic differential equations, which also serves as a transformation of other widely used models, such as the log-transformed geometric Brownian motion model. We derive the joint distribution for these interval-valued observations using the reflection principle and Girsanov's theorem. The MLE is obtained by optimizing the log-likelihood function through first and second-order derivative calculations, utilizing the Newton-Raphson algorithm. We further propose a novel parametric bootstrap method to compute confidence intervals, addressing challenges related to temporal dependency and interval-based data relationships. The performance of the model is evaluated through extensive simulations and real data analysis using S&P500 returns during the 2022 Russo-Ukrainian War. The results demonstrate that the proposed OULC model consistently outperforms the traditional OC model, offering more accurate and reliable change-point detection and parameter estimates."}, "https://arxiv.org/abs/2410.09892": {"title": "A Bayesian promotion time cure model with current status data", "link": "https://arxiv.org/abs/2410.09892", "description": "arXiv:2410.09892v1 Announce Type: new \nAbstract: Analysis of lifetime data from epidemiological studies or destructive testing often involves current status censoring, wherein individuals are examined only once and their event status is recorded only at that specific time point. In practice, some of these individuals may never experience the event of interest, leading to current status data with a cured fraction. Cure models are used to estimate the proportion of non-susceptible individuals, the distribution of susceptible ones, and covariate effects. Motivated from a biological interpretation of cancer metastasis, promotion time cure model is a popular alternative to the mixture cure rate model for analysing such data. The current study is the first to put forth a Bayesian inference procedure for analysing current status data with a cure fraction, resorting to a promotion time cure model. An adaptive Metropolis-Hastings algorithm is utilised for posterior computation. Simulation studies prove our approach's efficiency, while analyses of lung tumor and breast cancer data illustrate its practical utility. This approach has the potential to improve clinical cure rates through the incorporation of prior knowledge regarding the disease dynamics and therapeutic options."}, "https://arxiv.org/abs/2410.09898": {"title": "A Bayesian Joint Modelling of Current Status and Current Count Data", "link": "https://arxiv.org/abs/2410.09898", "description": "arXiv:2410.09898v1 Announce Type: new \nAbstract: Current status censoring or case I interval censoring takes place when subjects in a study are observed just once to check if a particular event has occurred. If the event is recurring, the data are classified as current count data; if non-recurring, they are classified as current status data. Several instances of dependence of these recurring and non-recurring events are observable in epidemiology and pathology. Estimation of the degree of this dependence and identification of major risk factors for the events are the major objectives of such studies. The current study proposes a Bayesian method for the joint modelling of such related events, employing a shared frailty-based semiparametric regression model. Computational implementation makes use of an adaptive Metropolis-Hastings algorithm. Simulation studies are put into use to show the effectiveness of the method proposed and fracture-osteoporosis data are worked through to highlight its application."}, "https://arxiv.org/abs/2410.09952": {"title": "Large Scale Longitudinal Experiments: Estimation and Inference", "link": "https://arxiv.org/abs/2410.09952", "description": "arXiv:2410.09952v1 Announce Type: new \nAbstract: Large-scale randomized experiments are seldom analyzed using panel regression methods because of computational challenges arising from the presence of millions of nuisance parameters. We leverage Mundlak's insight that unit intercepts can be eliminated by using carefully chosen averages of the regressors to rewrite several common estimators in a form that is amenable to weighted-least squares estimation with frequency weights. This renders regressions involving arbitrary strata intercepts tractable with very large datasets, optionally with the key compression step computed out-of-memory in SQL. We demonstrate that these methods yield more precise estimates than other commonly used estimators, and also find that the compression strategy greatly increases computational efficiency. We provide in-memory (pyfixest) and out-of-memory (duckreg) python libraries to implement these estimators."}, "https://arxiv.org/abs/2410.10025": {"title": "Sparse Multivariate Linear Regression with Strongly Associated Response Variables", "link": "https://arxiv.org/abs/2410.10025", "description": "arXiv:2410.10025v1 Announce Type: new \nAbstract: We propose new methods for multivariate linear regression when the regression coefficient matrix is sparse and the error covariance matrix is dense. We assume that the error covariance matrix has equicorrelation across the response variables. Two procedures are proposed: one is based on constant marginal response variance (compound symmetry), and the other is based on general varying marginal response variance. Two approximate procedures are also developed for high dimensions. We propose an approximation to the Gaussian validation likelihood for tuning parameter selection. Extensive numerical experiments illustrate when our procedures outperform relevant competitors as well as their robustness to model misspecification."}, "https://arxiv.org/abs/2410.10345": {"title": "Effective Positive Cauchy Combination Test", "link": "https://arxiv.org/abs/2410.10345", "description": "arXiv:2410.10345v1 Announce Type: new \nAbstract: In the field of multiple hypothesis testing, combining p-values represents a fundamental statistical method. The Cauchy combination test (CCT) (Liu and Xie, 2020) excels among numerous methods for combining p-values with powerful and computationally efficient performance. However, large p-values may diminish the significance of testing, even extremely small p-values exist. We propose a novel approach named the positive Cauchy combination test (PCCT) to surmount this flaw. Building on the relationship between the PCCT and CCT methods, we obtain critical values by applying the Cauchy distribution to the PCCT statistic. We find, however, that the PCCT tends to be effective only when the significance level is substantially small or the test statistics are strongly correlated. Otherwise, it becomes challenging to control type I errors, a problem that also pertains to the CCT. Thanks to the theories of stable distributions and the generalized central limit theorem, we have demonstrated critical values under weak dependence, which effectively controls type I errors for any given significance level. For more general scenarios, we correct the test statistic using the generalized mean method, which can control the size under any dependence structure and cannot be further optimized.Our method exhibits excellent performance, as demonstrated through comprehensive simulation studies. We further validate the effectiveness of our proposed method by applying it to a genetic dataset."}, "https://arxiv.org/abs/2410.10354": {"title": "Bayesian nonparametric modeling of heterogeneous populations of networks", "link": "https://arxiv.org/abs/2410.10354", "description": "arXiv:2410.10354v1 Announce Type: new \nAbstract: The increasing availability of multiple network data has highlighted the need for statistical models for heterogeneous populations of networks. A convenient framework makes use of metrics to measure similarity between networks. In this context, we propose a novel Bayesian nonparametric model that identifies clusters of networks characterized by similar connectivity patterns. Our approach relies on a location-scale Dirichlet process mixture of centered Erd\\H{o}s--R\\'enyi kernels, with components parametrized by a unique network representative, or mode, and a univariate measure of dispersion around the mode. We demonstrate that this model has full support in the Kullback--Leibler sense and is strongly consistent. An efficient Markov chain Monte Carlo scheme is proposed for posterior inference and clustering of multiple network data. The performance of the model is validated through extensive simulation studies, showing improvements over state-of-the-art methods. Additionally, we present an effective strategy to extend the application of the proposed model to datasets with a large number of nodes. We illustrate our approach with the analysis of human brain network data."}, "https://arxiv.org/abs/2410.10482": {"title": "Regression Model for Speckled Data with Extremely Variability", "link": "https://arxiv.org/abs/2410.10482", "description": "arXiv:2410.10482v1 Announce Type: new \nAbstract: Synthetic aperture radar (SAR) is an efficient and widely used remote sensing tool. However, data extracted from SAR images are contaminated with speckle, which precludes the application of techniques based on the assumption of additive and normally distributed noise. One of the most successful approaches to describing such data is the multiplicative model, where intensities can follow a variety of distributions with positive support. The $\\mathcal{G}^0_I$ model is among the most successful ones. Although several estimation methods for the $\\mathcal{G}^0_I$ parameters have been proposed, there is no work exploring a regression structure for this model. Such a structure could allow us to infer unobserved values from available ones. In this work, we propose a $\\mathcal{G}^0_I$ regression model and use it to describe the influence of intensities from other polarimetric channels. We derive some theoretical properties for the new model: Fisher information matrix, residual measures, and influential tools. Maximum likelihood point and interval estimation methods are proposed and evaluated by Monte Carlo experiments. Results from simulated and actual data show that the new model can be helpful for SAR image analysis."}, "https://arxiv.org/abs/2410.10618": {"title": "An Approximate Identity Link Function for Bayesian Generalized Linear Models", "link": "https://arxiv.org/abs/2410.10618", "description": "arXiv:2410.10618v1 Announce Type: new \nAbstract: In this note, we consider using a link function that has heavier tails than the usual exponential link function. We construct efficient Gibbs algorithms for Poisson and Multinomial models based on this link function by introducing gamma and inverse Gaussian latent variables and show that the algorithms generate geometrically ergodic Markov chains in simple settings. Our algorithms can be used for more complicated models with many parameters. We fit our simple Poisson model to a real dataset and confirm that the posterior distribution has similar implications to those under the usual Poisson regression model based on the exponential link function. Although less interpretable, our models are potentially more tractable or flexible from a computational point of view in some cases."}, "https://arxiv.org/abs/2410.10619": {"title": "Partially exchangeable stochastic block models for multilayer networks", "link": "https://arxiv.org/abs/2410.10619", "description": "arXiv:2410.10619v1 Announce Type: new \nAbstract: Multilayer networks generalize single-layered connectivity data in several directions. These generalizations include, among others, settings where multiple types of edges are observed among the same set of nodes (edge-colored networks) or where a single notion of connectivity is measured between nodes belonging to different pre-specified layers (node-colored networks). While progress has been made in statistical modeling of edge-colored networks, principled approaches that flexibly account for both within and across layer block-connectivity structures while incorporating layer information through a rigorous probabilistic construction are still lacking for node-colored multilayer networks. We fill this gap by introducing a novel class of partially exchangeable stochastic block models specified in terms of a hierarchical random partition prior for the allocation of nodes to groups, whose number is learned by the model. This goal is achieved without jeopardizing probabilistic coherence, uncertainty quantification and derivation of closed-form predictive within- and across-layer co-clustering probabilities. Our approach facilitates prior elicitation, the understanding of theoretical properties and the development of yet-unexplored predictive strategies for both the connections and the allocations of future incoming nodes. Posterior inference proceeds via a tractable collapsed Gibbs sampler, while performance is illustrated in simulations and in a real-world criminal network application. The notable gains achieved over competitors clarify the importance of developing general stochastic block models based on suitable node-exchangeability structures coherent with the type of multilayer network being analyzed."}, "https://arxiv.org/abs/2410.10633": {"title": "Missing data imputation using a truncated infinite factor model with application to metabolomics data", "link": "https://arxiv.org/abs/2410.10633", "description": "arXiv:2410.10633v1 Announce Type: new \nAbstract: In metabolomics, the study of small molecules in biological samples, data are often acquired through mass spectrometry. The resulting data contain highly correlated variables, typically with a larger number of variables than observations. Missing data are prevalent, and imputation is critical as data acquisition can be difficult and expensive, and many analysis methods necessitate complete data. In such data, missing at random (MAR) missingness occurs due to acquisition or processing error, while missing not at random (MNAR) missingness occurs when true values lie below the threshold for detection. Existing imputation methods generally assume one missingness type, or impute values outside the physical constraints of the data, which lack utility. A truncated factor analysis model with an infinite number of factors (tIFA) is proposed to facilitate imputation in metabolomics data, in a statistically and physically principled manner. Truncated distributional assumptions underpin tIFA, ensuring cognisance of the data's physical constraints when imputing. Further, tIFA allows for both MAR and MNAR missingness, and a Bayesian inferential approach provides uncertainty quantification for imputed values and missingness types. The infinite factor model parsimoniously models the high-dimensional, multicollinear data, with nonparametric shrinkage priors obviating the need for model selection tools to infer the number of latent factors. A simulation study is performed to assess the performance of tIFA and an application to a urinary metabolomics dataset results in a full dataset with practically useful imputed values, and associated uncertainty, ready for use in metabolomics analyses. Open-source R code accompanies tIFA, facilitating its widespread use."}, "https://arxiv.org/abs/2410.10723": {"title": "Statistically and computationally efficient conditional mean imputation for censored covariates", "link": "https://arxiv.org/abs/2410.10723", "description": "arXiv:2410.10723v1 Announce Type: new \nAbstract: Censored, missing, and error-prone covariates are all coarsened data types for which the true values are unknown. Many methods to handle the unobserved values, including imputation, are shared between these data types, with nuances based on the mechanism dominating the unobserved values and any other available information. For example, in prospective studies, the time to a specific disease diagnosis will be incompletely observed if only some patients are diagnosed by the end of the follow-up. Specifically, some times will be randomly right-censored, and patients' disease-free follow-up times must be incorporated into their imputed values. Assuming noninformative censoring, these censored values are replaced with their conditional means, the calculations of which require (i) estimating the conditional distribution of the censored covariate and (ii) integrating over the corresponding survival function. Semiparametric approaches are common, which estimate the distribution with a Cox proportional hazards model and then the integral with the trapezoidal rule. While these approaches offer robustness, they come at the cost of statistical and computational efficiency. We propose a general framework for parametric conditional mean imputation of censored covariates that offers better statistical precision and requires less computational strain by modeling the survival function parametrically, where conditional means often have an analytic solution. The framework is implemented in the open-source R package, speedyCMI."}, "https://arxiv.org/abs/2410.10749": {"title": "Testing the order of fractional integration in the presence of smooth trends, with an application to UK Great Ratios", "link": "https://arxiv.org/abs/2410.10749", "description": "arXiv:2410.10749v1 Announce Type: new \nAbstract: This note proposes semi-parametric tests for investigating whether a stochastic process is fractionally integrated of order $\\delta$, where $|\\delta| < 1/2$, when smooth trends are present in the model. We combine the semi-parametric approach by Iacone, Nielsen & Taylor (2022) to model the short range dependence with the use of Chebyshev polynomials by Cuestas & Gil-Alana to describe smooth trends. Our proposed statistics have standard limiting null distributions and match the asymptotic local power of infeasible tests based on unobserved errors. We also establish the conditions under which an information criterion can consistently estimate the order of the Chebyshev polynomial. The finite sample performance is evaluated using simulations, and an empirical application is given for the UK Great Ratios."}, "https://arxiv.org/abs/2410.10772": {"title": "Peer effects in the linear-in-means model may be inestimable even when identified", "link": "https://arxiv.org/abs/2410.10772", "description": "arXiv:2410.10772v1 Announce Type: new \nAbstract: Linear-in-means models are widely used to investigate peer effects. Identifying peer effects in these models is challenging, but conditions for identification are well-known. However, even when peer effects are identified, they may not be estimable, due to an asymptotic colinearity issue: as sample size increases, peer effects become more and more linearly dependent. We show that asymptotic colinearity occurs whenever nodal covariates are independent of the network and the minimum degree of the network is growing. Asymptotic colinearity can cause estimators to be inconsistent or to converge at slower than expected rates. We also demonstrate that dependence between nodal covariates and network structure can alleviate colinearity issues in random dot product graphs. These results suggest that linear-in-means models are less reliable for studying peer influence than previously believed."}, "https://arxiv.org/abs/2410.09227": {"title": "Fast Data-independent KLT Approximations Based on Integer Functions", "link": "https://arxiv.org/abs/2410.09227", "description": "arXiv:2410.09227v1 Announce Type: cross \nAbstract: The Karhunen-Lo\\`eve transform (KLT) stands as a well-established discrete transform, demonstrating optimal characteristics in data decorrelation and dimensionality reduction. Its ability to condense energy compression into a select few main components has rendered it instrumental in various applications within image compression frameworks. However, computing the KLT depends on the covariance matrix of the input data, which makes it difficult to develop fast algorithms for its implementation. Approximations for the KLT, utilizing specific rounding functions, have been introduced to reduce its computational complexity. Therefore, our paper introduces a category of low-complexity, data-independent KLT approximations, employing a range of round-off functions. The design methodology of the approximate transform is defined for any block-length $N$, but emphasis is given to transforms of $N = 8$ due to its wide use in image and video compression. The proposed transforms perform well when compared to the exact KLT and approximations considering classical performance measures. For particular scenarios, our proposed transforms demonstrated superior performance when compared to KLT approximations documented in the literature. We also developed fast algorithms for the proposed transforms, further reducing the arithmetic cost associated with their implementation. Evaluation of field programmable gate array (FPGA) hardware implementation metrics was conducted. Practical applications in image encoding showed the relevance of the proposed transforms. In fact, we showed that one of the proposed transforms outperformed the exact KLT given certain compression ratios."}, "https://arxiv.org/abs/2410.09835": {"title": "Knockoffs for exchangeable categorical covariates", "link": "https://arxiv.org/abs/2410.09835", "description": "arXiv:2410.09835v1 Announce Type: cross \nAbstract: Let $X=(X_1,\\ldots,X_p)$ be a $p$-variate random vector and $F$ a fixed finite set. In a number of applications, mainly in genetics, it turns out that $X_i\\in F$ for each $i=1,\\ldots,p$. Despite the latter fact, to obtain a knockoff $\\widetilde{X}$ (in the sense of \\cite{CFJL18}), $X$ is usually modeled as an absolutely continuous random vector. While comprehensible from the point of view of applications, this approximate procedure does not make sense theoretically, since $X$ is supported by the finite set $F^p$. In this paper, explicit formulae for the joint distribution of $(X,\\widetilde{X})$ are provided when $P(X\\in F^p)=1$ and $X$ is exchangeable or partially exchangeable. In fact, when $X_i\\in F$ for all $i$, there seem to be various reasons for assuming $X$ exchangeable or partially exchangeable. The robustness of $\\widetilde{X}$, with respect to the de Finetti's measure $\\pi$ of $X$, is investigated as well. Let $\\mathcal{L}_\\pi(\\widetilde{X}\\mid X=x)$ denote the conditional distribution of $\\widetilde{X}$, given $X=x$, when the de Finetti's measure is $\\pi$. It is shown that $$\\norm{\\mathcal{L}_{\\pi_1}(\\widetilde{X}\\mid X=x)-\\mathcal{L}_{\\pi_2}(\\widetilde{X}\\mid X=x)}\\le c(x)\\,\\norm{\\pi_1-\\pi_2}$$ where $\\norm{\\cdot}$ is total variation distance and $c(x)$ a suitable constant. Finally, a numerical experiment is performed. Overall, the knockoffs of this paper outperform the alternatives (i.e., the knockoffs obtained by giving $X$ an absolutely continuous distribution) as regards the false discovery rate but are slightly weaker in terms of power."}, "https://arxiv.org/abs/2410.10282": {"title": "Exact MCMC for Intractable Proposals", "link": "https://arxiv.org/abs/2410.10282", "description": "arXiv:2410.10282v1 Announce Type: cross \nAbstract: Accept-reject based Markov chain Monte Carlo (MCMC) methods are the workhorse algorithm for Bayesian inference. These algorithms, like Metropolis-Hastings, require the choice of a proposal distribution which is typically informed by the desired target distribution. Surprisingly, proposal distributions with unknown normalizing constants are not uncommon, even though for such a choice of a proposal, the Metropolis-Hastings acceptance ratio cannot be evaluated exactly. Across the literature, authors resort to approximation methods that yield inexact MCMC or develop specialized algorithms to combat this problem. We show how Bernoulli factory MCMC algorithms, originally proposed for doubly intractable target distributions, can quite naturally be adapted to this situation. We present three diverse and relevant examples demonstrating the usefulness of the Bernoulli factory approach to this problem."}, "https://arxiv.org/abs/2410.10309": {"title": "Optimal lower bounds for logistic log-likelihoods", "link": "https://arxiv.org/abs/2410.10309", "description": "arXiv:2410.10309v1 Announce Type: cross \nAbstract: The logit transform is arguably the most widely-employed link function beyond linear settings. This transformation routinely appears in regression models for binary data and provides, either explicitly or implicitly, a core building-block within state-of-the-art methodologies for both classification and regression. Its widespread use, combined with the lack of analytical solutions for the optimization of general losses involving the logit transform, still motivates active research in computational statistics. Among the directions explored, a central one has focused on the design of tangent lower bounds for logistic log-likelihoods that can be tractably optimized, while providing a tight approximation of these log-likelihoods. Although progress along these lines has led to the development of effective minorize-maximize (MM) algorithms for point estimation and coordinate ascent variational inference schemes for approximate Bayesian inference under several logit models, the overarching focus in the literature has been on tangent quadratic minorizers. In fact, it is still unclear whether tangent lower bounds sharper than quadratic ones can be derived without undermining the tractability of the resulting minorizer. This article addresses such a challenging question through the design and study of a novel piece-wise quadratic lower bound that uniformly improves any tangent quadratic minorizer, including the sharpest ones, while admitting a direct interpretation in terms of the classical generalized lasso problem. As illustrated in a ridge logistic regression, this unique connection facilitates more effective implementations than those provided by available piece-wise bounds, while improving the convergence speed of quadratic ones."}, "https://arxiv.org/abs/2410.10538": {"title": "Data-Driven Approaches for Modelling Target Behaviour", "link": "https://arxiv.org/abs/2410.10538", "description": "arXiv:2410.10538v1 Announce Type: cross \nAbstract: The performance of tracking algorithms strongly depends on the chosen model assumptions regarding the target dynamics. If there is a strong mismatch between the chosen model and the true object motion, the track quality may be poor or the track is easily lost. Still, the true dynamics might not be known a priori or it is too complex to be expressed in a tractable mathematical formulation. This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion based on training data. The first method builds on Gaussian Processes (GPs) for predicting the object motion, the second learns the parameters of an Interacting Multiple Model (IMM) filter and the third uses a Long Short-Term Memory (LSTM) network as a motion model. All methods are compared against an Extended Kalman Filter (EKF) with an analytic motion model as a benchmark and their respective strengths are highlighted in one simulated and two real-world scenarios."}, "https://arxiv.org/abs/2410.10641": {"title": "Echo State Networks for Spatio-Temporal Area-Level Data", "link": "https://arxiv.org/abs/2410.10641", "description": "arXiv:2410.10641v1 Announce Type: cross \nAbstract: Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model's computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat's tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts."}, "https://arxiv.org/abs/2410.10649": {"title": "Vecchia Gaussian Processes: Probabilistic Properties, Minimax Rates and Methodological Developments", "link": "https://arxiv.org/abs/2410.10649", "description": "arXiv:2410.10649v1 Announce Type: cross \nAbstract: Gaussian Processes (GPs) are widely used to model dependency in spatial statistics and machine learning, yet the exact computation suffers an intractable time complexity of $O(n^3)$. Vecchia approximation allows scalable Bayesian inference of GPs in $O(n)$ time by introducing sparsity in the spatial dependency structure that is characterized by a directed acyclic graph (DAG). Despite the popularity in practice, it is still unclear how to choose the DAG structure and there are still no theoretical guarantees in nonparametric settings. In this paper, we systematically study the Vecchia GPs as standalone stochastic processes and uncover important probabilistic properties and statistical results in methodology and theory. For probabilistic properties, we prove that the conditional distributions of the Mat\\'{e}rn GPs, as well as the Vecchia approximations of the Mat\\'{e}rn GPs, can be characterized by polynomials. This allows us to prove a series of results regarding the small ball probabilities and RKHSs of Vecchia GPs. For statistical methodology, we provide a principled guideline to choose parent sets as norming sets with fixed cardinality and provide detailed algorithms following such guidelines. For statistical theory, we prove posterior contraction rates for applying Vecchia GPs to regression problems, where minimax optimality is achieved by optimally tuned GPs via either oracle rescaling or hierarchical Bayesian methods. Our theory and methodology are demonstrated with numerical studies, where we also provide efficient implementation of our methods in C++ with R interfaces."}, "https://arxiv.org/abs/2410.10704": {"title": "Estimation beyond Missing (Completely) at Random", "link": "https://arxiv.org/abs/2410.10704", "description": "arXiv:2410.10704v1 Announce Type: cross \nAbstract: We study the effects of missingness on the estimation of population parameters. Moving beyond restrictive missing completely at random (MCAR) assumptions, we first formulate a missing data analogue of Huber's arbitrary $\\epsilon$-contamination model. For mean estimation with respect to squared Euclidean error loss, we show that the minimax quantiles decompose as a sum of the corresponding minimax quantiles under a heterogeneous, MCAR assumption, and a robust error term, depending on $\\epsilon$, that reflects the additional error incurred by departure from MCAR.\n We next introduce natural classes of realisable $\\epsilon$-contamination models, where an MCAR version of a base distribution $P$ is contaminated by an arbitrary missing not at random (MNAR) version of $P$. These classes are rich enough to capture various notions of biased sampling and sensitivity conditions, yet we show that they enjoy improved minimax performance relative to our earlier arbitrary contamination classes for both parametric and nonparametric classes of base distributions. For instance, with a univariate Gaussian base distribution, consistent mean estimation over realisable $\\epsilon$-contamination classes is possible even when $\\epsilon$ and the proportion of missingness converge (slowly) to 1. Finally, we extend our results to the setting of departures from missing at random (MAR) in normal linear regression with a realisable missing response."}, "https://arxiv.org/abs/1904.04484": {"title": "Meta-analysis of Bayesian analyses", "link": "https://arxiv.org/abs/1904.04484", "description": "arXiv:1904.04484v2 Announce Type: replace \nAbstract: Meta-analysis aims to generalize results from multiple related statistical analyses through a combined analysis. While the natural outcome of a Bayesian study is a posterior distribution, traditional Bayesian meta-analyses proceed by combining summary statistics (i.e., point-valued estimates) computed from data. In this paper, we develop a framework for combining posterior distributions from multiple related Bayesian studies into a meta-analysis. Importantly, the method is capable of reusing pre-computed posteriors from computationally costly analyses, without needing the implementation details from each study. Besides providing a consensus across studies, the method enables updating the local posteriors post-hoc and therefore refining them by sharing statistical strength between the studies, without rerunning the original analyses. We illustrate the wide applicability of the framework by combining results from likelihood-free Bayesian analyses, which would be difficult to carry out using standard methodology."}, "https://arxiv.org/abs/2210.05026": {"title": "Uncertainty Quantification in Synthetic Controls with Staggered Treatment Adoption", "link": "https://arxiv.org/abs/2210.05026", "description": "arXiv:2210.05026v4 Announce Type: replace \nAbstract: We propose principled prediction intervals to quantify the uncertainty of a large class of synthetic control predictions (or estimators) in settings with staggered treatment adoption, offering precise non-asymptotic coverage probability guarantees. From a methodological perspective, we provide a detailed discussion of different causal quantities to be predicted, which we call causal predictands, allowing for multiple treated units with treatment adoption at possibly different points in time. From a theoretical perspective, our uncertainty quantification methods improve on prior literature by (i) covering a large class of causal predictands in staggered adoption settings, (ii) allowing for synthetic control methods with possibly nonlinear constraints, (iii) proposing scalable robust conic optimization methods and principled data-driven tuning parameter selection, and (iv) offering valid uniform inference across post-treatment periods. We illustrate our methodology with an empirical application studying the effects of economic liberalization on real GDP per capita for Sub-Saharan African countries. Companion general-purpose software packages are provided in Python, R, and Stata."}, "https://arxiv.org/abs/2211.11884": {"title": "Parameter Estimation in Nonlinear Multivariate Stochastic Differential Equations Based on Splitting Schemes", "link": "https://arxiv.org/abs/2211.11884", "description": "arXiv:2211.11884v3 Announce Type: replace \nAbstract: The likelihood functions for discretely observed nonlinear continuous-time models based on stochastic differential equations are not available except for a few cases. Various parameter estimation techniques have been proposed, each with advantages, disadvantages, and limitations depending on the application. Most applications still use the Euler-Maruyama discretization, despite many proofs of its bias. More sophisticated methods, such as Kessler's Gaussian approximation, Ozaki's Local Linearization, A\\\"it-Sahalia's Hermite expansions, or MCMC methods, might be complex to implement, do not scale well with increasing model dimension, or can be numerically unstable. We propose two efficient and easy-to-implement likelihood-based estimators based on the Lie-Trotter (LT) and the Strang (S) splitting schemes. We prove that S has $L^p$ convergence rate of order 1, a property already known for LT. We show that the estimators are consistent and asymptotically efficient under the less restrictive one-sided Lipschitz assumption. A numerical study on the 3-dimensional stochastic Lorenz system complements our theoretical findings. The simulation shows that the S estimator performs the best when measured on precision and computational speed compared to the state-of-the-art."}, "https://arxiv.org/abs/2212.05524": {"title": "Bayesian inference for partial orders from random linear extensions: power relations from 12th Century Royal Acta", "link": "https://arxiv.org/abs/2212.05524", "description": "arXiv:2212.05524v3 Announce Type: replace \nAbstract: In the eleventh and twelfth centuries in England, Wales and Normandy, Royal Acta were legal documents in which witnesses were listed in order of social status. Any bishops present were listed as a group. For our purposes, each witness-list is an ordered permutation of bishop names with a known date or date-range. Changes over time in the order bishops are listed may reflect changes in their authority. Historians would like to detect and quantify these changes. There is no reason to assume that the underlying social order which constrains bishop-order within lists is a complete order. We therefore model the evolving social order as an evolving partial ordered set or {\\it poset}.\n We construct a Hidden Markov Model for these data. The hidden state is an evolving poset (the evolving social hierarchy) and the emitted data are random total orders (dated lists) respecting the poset present at the time the order was observed. This generalises existing models for rank-order data such as Mallows and Plackett-Luce. We account for noise via a random ``queue-jumping'' process. Our latent-variable prior for the random process of posets is marginally consistent. A parameter controls poset depth and actor-covariates inform the position of actors in the hierarchy. We fit the model, estimate posets and find evidence for changes in status over time. We interpret our results in terms of court politics. Simpler models, based on Bucket Orders and vertex-series-parallel orders, are rejected. We compare our results with a time-series extension of the Plackett-Luce model. Our software is publicly available."}, "https://arxiv.org/abs/2301.03799": {"title": "Tensor Formulation of the General Linear Model with Einstein Notation", "link": "https://arxiv.org/abs/2301.03799", "description": "arXiv:2301.03799v2 Announce Type: replace \nAbstract: The general linear model is a universally accepted method to conduct and test multiple linear regression models. Using this model one has the ability to simultaneously regress covariates among different groups of data. Moreover, there are hundreds of applications and statistical tests associated with the general linear model. However, the conventional matrix formulation is relatively inelegant which yields multiple difficulties including slow computation speed due to a large number of computations, increased memory usage due to needlessly large data structures, and organizational inconsistency. This is due to the fundamental incongruence between the degrees of freedom of the information the data structures in the conventional formulation of the general linear model are intended to represent and the rank of the data structures themselves. Here, I briefly suggest an elegant reformulation of the general linear model which involves the use of tensors and multidimensional arrays as opposed to exclusively flat structures in the conventional formulation. To demonstrate the efficacy of this approach I translate a few common applications of the general linear model from the conventional formulation to the tensor formulation."}, "https://arxiv.org/abs/2310.09345": {"title": "A Unified Bayesian Framework for Modeling Measurement Error in Multinomial Data", "link": "https://arxiv.org/abs/2310.09345", "description": "arXiv:2310.09345v2 Announce Type: replace \nAbstract: Measurement error in multinomial data is a well-known and well-studied inferential problem that is encountered in many fields, including engineering, biomedical and omics research, ecology, finance, official statistics, and social sciences. Methods developed to accommodate measurement error in multinomial data are typically equipped to handle false negatives or false positives, but not both. We provide a unified framework for accommodating both forms of measurement error using a Bayesian hierarchical approach. We demonstrate the proposed method's performance on simulated data and apply it to acoustic bat monitoring and official crime data."}, "https://arxiv.org/abs/2311.12252": {"title": "Partial identification and unmeasured confounding with multiple treatments and multiple outcomes", "link": "https://arxiv.org/abs/2311.12252", "description": "arXiv:2311.12252v2 Announce Type: replace \nAbstract: In this work, we develop a framework for partial identification of causal effects in settings with multiple treatments and multiple outcomes. We highlight several advantages of jointly analyzing causal effects across multiple estimands under a \"factor confounding assumption\" where residual dependence amongst treatments and outcomes is assumed to be driven by unmeasured confounding. In this setting, we show that joint partial identification regions for multiple estimands can be more informative than considering partial identification for individual estimands one at a time. We additionally show how assumptions related to the strength of confounding or the magnitude of plausible effect sizes for any one estimand can reduce the partial identification regions for other estimands. As a special case of this result, we explore how negative control assumptions reduce partial identification regions and discuss conditions under which point identification can be obtained. We develop novel computational approaches to finding partial identification regions under a variety of these assumptions. Lastly, we demonstrate our approach in an analysis of the causal effects of multiple air pollutants on several health outcomes in the United States using claims data from Medicare, where we find that some exposures have effects that are robust to the presence of unmeasured confounding."}, "https://arxiv.org/abs/1908.09173": {"title": "Welfare Analysis in Dynamic Models", "link": "https://arxiv.org/abs/1908.09173", "description": "arXiv:1908.09173v2 Announce Type: replace-cross \nAbstract: This paper provides welfare metrics for dynamic choice. We give estimation and inference methods for functions of the expected value of dynamic choice. These parameters include average value by group, average derivatives with respect to endowments, and structural decompositions. The example of dynamic discrete choice is considered. We give dual and doubly robust representations of these parameters. A least squares estimator of the dynamic Riesz representer for the parameter of interest is given. Debiased machine learners are provided and asymptotic theory given."}, "https://arxiv.org/abs/2303.06198": {"title": "Deflated HeteroPCA: Overcoming the curse of ill-conditioning in heteroskedastic PCA", "link": "https://arxiv.org/abs/2303.06198", "description": "arXiv:2303.06198v2 Announce Type: replace-cross \nAbstract: This paper is concerned with estimating the column subspace of a low-rank matrix $\\boldsymbol{X}^\\star \\in \\mathbb{R}^{n_1\\times n_2}$ from contaminated data. How to obtain optimal statistical accuracy while accommodating the widest range of signal-to-noise ratios (SNRs) becomes particularly challenging in the presence of heteroskedastic noise and unbalanced dimensionality (i.e., $n_2\\gg n_1$). While the state-of-the-art algorithm $\\textsf{HeteroPCA}$ emerges as a powerful solution for solving this problem, it suffers from \"the curse of ill-conditioning,\" namely, its performance degrades as the condition number of $\\boldsymbol{X}^\\star$ grows. In order to overcome this critical issue without compromising the range of allowable SNRs, we propose a novel algorithm, called $\\textsf{Deflated-HeteroPCA}$, that achieves near-optimal and condition-number-free theoretical guarantees in terms of both $\\ell_2$ and $\\ell_{2,\\infty}$ statistical accuracy. The proposed algorithm divides the spectrum of $\\boldsymbol{X}^\\star$ into well-conditioned and mutually well-separated subblocks, and applies $\\textsf{HeteroPCA}$ to conquer each subblock successively. Further, an application of our algorithm and theory to two canonical examples -- the factor model and tensor PCA -- leads to remarkable improvement for each application."}, "https://arxiv.org/abs/2410.11093": {"title": "rquest: An R package for hypothesis tests and confidence intervals for quantiles and summary measures based on quantiles", "link": "https://arxiv.org/abs/2410.11093", "description": "arXiv:2410.11093v1 Announce Type: new \nAbstract: Sample quantiles, such as the median, are often better suited than the sample mean for summarising location characteristics of a data set. Similarly, linear combinations of sample quantiles and ratios of such linear combinations, e.g. the interquartile range and quantile-based skewness measures, are often used to quantify characteristics such as spread and skew. While often reported, it is uncommon to accompany quantile estimates with confidence intervals or standard errors. The rquest package provides a simple way to conduct hypothesis tests and derive confidence intervals for quantiles, linear combinations of quantiles, ratios of dependent linear combinations (e.g., Bowley's measure of skewness) and differences and ratios of all of the above for comparisons between independent samples. Many commonly used measures based on quantiles are included, although it is also very simple for users to define their own. Additionally, quantile-based measures of inequality are also considered. The methods are based on recent research showing that reliable distribution-free confidence intervals can be obtained, even for moderate sample sizes. Several examples are provided herein."}, "https://arxiv.org/abs/2410.11103": {"title": "Models for spatiotemporal data with some missing locations and application to emergency calls models calibration", "link": "https://arxiv.org/abs/2410.11103", "description": "arXiv:2410.11103v1 Announce Type: new \nAbstract: We consider two classes of models for spatiotemporal data: one without covariates and one with covariates. If $\\mathcal{T}$ is a partition of time and $\\mathcal{I}$ a partition of the studied area into zones and if $\\mathcal{C}$ is the set of arrival types, we assume that the process of arrivals for time interval $t \\in \\mathcal{T}$, zone $i \\in \\mathcal{I}$, and arrival type $c \\in \\mathcal{C}$ is Poisson with some intensity $\\lambda_{c,i,t}$. We discussed the calibration and implementation of such models in \\cite{laspatedpaper, laspatedmanual} with corresponding software LASPATED (Library for the Analysis of SPAtioTEmporal Discrete data) available on GitHub at https://github.com/vguigues/LASPATED. In this paper, we discuss the extension of these models when some of the locations are missing in the historical data. We propose three models to deal with missing locations and implemented them both in Matlab and C++. The corresponding code is available on GitHub as an extension of LASPATED at https://github.com/vguigues/LASPATED/Missing_Data. We tested our implementation using the process of emergency calls to an Emergency Health Service where many calls come with missing locations and show the importance and benefit of using models that consider missing locations, rather than discarding the calls with missing locations for the calibration of statistical models for such calls."}, "https://arxiv.org/abs/2410.11151": {"title": "Discovering the critical number of respondents to validate an item in a questionnaire: The Binomial Cut-level Content Validity proposal", "link": "https://arxiv.org/abs/2410.11151", "description": "arXiv:2410.11151v1 Announce Type: new \nAbstract: The question that drives this research is: \"How to discover the number of respondents that are necessary to validate items of a questionnaire as actually essential to reach the questionnaire's proposal?\" Among the efforts in this subject, \\cite{Lawshe1975, Wilson2012, Ayre_CVR_2014} approached this issue by proposing and refining the Content Validation Ratio (CVR) that looks to identify items that are actually essentials. Despite their contribution, these studies do not check if an item validated as \"essential\" should be also validated as \"not essential\" by the same sample, which should be a paradox. Another issue is the assignment a probability equal a 50\\% to a item be randomly checked by a respondent as essential, despite an evaluator has three options to choose. Our proposal faces these issues, making it possible to verify if a paradoxical situation occurs, and being more precise in recommending whether an item should either be retained or discarded from a questionnaire."}, "https://arxiv.org/abs/2410.11192": {"title": "Multi-scale tests of independence powerful for detecting explicit or implicit functional relationship", "link": "https://arxiv.org/abs/2410.11192", "description": "arXiv:2410.11192v1 Announce Type: new \nAbstract: In this article, we consider the problem of testing the independence between two random variables. Our primary objective is to develop tests that are highly effective at detecting associations arising from explicit or implicit functional relationship between two variables. We adopt a multi-scale approach by analyzing neighborhoods of varying sizes within the dataset and aggregating the results. We introduce a general testing framework designed to enhance the power of existing independence tests to achieve our objective. Additionally, we propose a novel test method that is powerful as well as computationally efficient. The performance of these tests is compared with existing methods using various simulated datasets."}, "https://arxiv.org/abs/2410.11263": {"title": "Closed-form estimation and inference for panels with attrition and refreshment samples", "link": "https://arxiv.org/abs/2410.11263", "description": "arXiv:2410.11263v1 Announce Type: new \nAbstract: It has long been established that, if a panel dataset suffers from attrition, auxiliary (refreshment) sampling restores full identification under additional assumptions that still allow for nontrivial attrition mechanisms. Such identification results rely on implausible assumptions about the attrition process or lead to theoretically and computationally challenging estimation procedures. We propose an alternative identifying assumption that, despite its nonparametric nature, suggests a simple estimation algorithm based on a transformation of the empirical cumulative distribution function of the data. This estimation procedure requires neither tuning parameters nor optimization in the first step, i.e. has a closed form. We prove that our estimator is consistent and asymptotically normal and demonstrate its good performance in simulations. We provide an empirical illustration with income data from the Understanding America Study."}, "https://arxiv.org/abs/2410.11320": {"title": "Regularized Estimation of High-Dimensional Matrix-Variate Autoregressive Models", "link": "https://arxiv.org/abs/2410.11320", "description": "arXiv:2410.11320v1 Announce Type: new \nAbstract: Matrix-variate time series data are increasingly popular in economics, statistics, and environmental studies, among other fields. This paper develops regularized estimation methods for analyzing high-dimensional matrix-variate time series using bilinear matrix-variate autoregressive models. The bilinear autoregressive structure is widely used for matrix-variate time series, as it reduces model complexity while capturing interactions between rows and columns. However, when dealing with large dimensions, the commonly used iterated least-squares method results in numerous estimated parameters, making interpretation difficult. To address this, we propose two regularized estimation methods to further reduce model dimensionality. The first assumes banded autoregressive coefficient matrices, where each data point interacts only with nearby points. A two-step estimation method is used: first, traditional iterated least-squares is applied for initial estimates, followed by a banded iterated least-squares approach. A Bayesian Information Criterion (BIC) is introduced to estimate the bandwidth of the coefficient matrices. The second method assumes sparse autoregressive matrices, applying the LASSO technique for regularization. We derive asymptotic properties for both methods as the dimensions diverge and the sample size $T\\rightarrow\\infty$. Simulations and real data examples demonstrate the effectiveness of our methods, comparing their forecasting performance against common autoregressive models in the literature."}, "https://arxiv.org/abs/2410.11408": {"title": "Aggregation Trees", "link": "https://arxiv.org/abs/2410.11408", "description": "arXiv:2410.11408v1 Announce Type: new \nAbstract: Uncovering the heterogeneous effects of particular policies or \"treatments\" is a key concern for researchers and policymakers. A common approach is to report average treatment effects across subgroups based on observable covariates. However, the choice of subgroups is crucial as it poses the risk of $p$-hacking and requires balancing interpretability with granularity. This paper proposes a nonparametric approach to construct heterogeneous subgroups. The approach enables a flexible exploration of the trade-off between interpretability and the discovery of more granular heterogeneity by constructing a sequence of nested groupings, each with an optimality property. By integrating our approach with \"honesty\" and debiased machine learning, we provide valid inference about the average treatment effect of each group. We validate the proposed methodology through an empirical Monte-Carlo study and apply it to revisit the impact of maternal smoking on birth weight, revealing systematic heterogeneity driven by parental and birth-related characteristics."}, "https://arxiv.org/abs/2410.11438": {"title": "Effect modification and non-collapsibility leads to conflicting treatment decisions: a review of marginal and conditional estimands and recommendations for decision-making", "link": "https://arxiv.org/abs/2410.11438", "description": "arXiv:2410.11438v1 Announce Type: new \nAbstract: Effect modification occurs when a covariate alters the relative effectiveness of treatment compared to control. It is widely understood that, when effect modification is present, treatment recommendations may vary by population and by subgroups within the population. Population-adjustment methods are increasingly used to adjust for differences in effect modifiers between study populations and to produce population-adjusted estimates in a relevant target population for decision-making. It is also widely understood that marginal and conditional estimands for non-collapsible effect measures, such as odds ratios or hazard ratios, do not in general coincide even without effect modification. However, the consequences of both non-collapsibility and effect modification together are little-discussed in the literature.\n In this paper, we set out the definitions of conditional and marginal estimands, illustrate their properties when effect modification is present, and discuss the implications for decision-making. In particular, we show that effect modification can result in conflicting treatment rankings between conditional and marginal estimates. This is because conditional and marginal estimands correspond to different decision questions that are no longer aligned when effect modification is present. For time-to-event outcomes, the presence of covariates implies that marginal hazard ratios are time-varying, and effect modification can cause marginal hazard curves to cross. We conclude with practical recommendations for decision-making in the presence of effect modification, based on pragmatic comparisons of both conditional and marginal estimates in the decision target population. Currently, multilevel network meta-regression is the only population-adjustment method capable of producing both conditional and marginal estimates, in any decision target population."}, "https://arxiv.org/abs/2410.11482": {"title": "Scalable likelihood-based estimation and variable selection for the Cox model with incomplete covariates", "link": "https://arxiv.org/abs/2410.11482", "description": "arXiv:2410.11482v1 Announce Type: new \nAbstract: Regression analysis with missing data is a long-standing and challenging problem, particularly when there are many missing variables with arbitrary missing patterns. Likelihood-based methods, although theoretically appealing, are often computationally inefficient or even infeasible when dealing with a large number of missing variables. In this paper, we consider the Cox regression model with incomplete covariates that are missing at random. We develop an expectation-maximization (EM) algorithm for nonparametric maximum likelihood estimation, employing a transformation technique in the E-step so that it involves only a one-dimensional integration. This innovation makes our methods scalable with respect to the dimension of the missing variables. In addition, for variable selection, we extend the proposed EM algorithm to accommodate a LASSO penalty in the likelihood. We demonstrate the feasibility and advantages of the proposed methods over existing methods by large-scale simulation studies and apply the proposed methods to a cancer genomic study."}, "https://arxiv.org/abs/2410.11562": {"title": "Functional Data Analysis on Wearable Sensor Data: A Systematic Review", "link": "https://arxiv.org/abs/2410.11562", "description": "arXiv:2410.11562v1 Announce Type: new \nAbstract: Wearable devices and sensors have recently become a popular way to collect data, especially in the health sciences. The use of sensors allows patients to be monitored over a period of time with a high observation frequency. Due to the continuous-on-time structure of the data, novel statistical methods are recommended for the analysis of sensor data. One of the popular approaches in the analysis of wearable sensor data is functional data analysis. The main objective of this paper is to review functional data analysis methods applied to wearable device data according to the type of sensor. In addition, we introduce several freely available software packages and open databases of wearable device data to facilitate access to sensor data in different fields."}, "https://arxiv.org/abs/2410.11637": {"title": "Prediction-Centric Uncertainty Quantification via MMD", "link": "https://arxiv.org/abs/2410.11637", "description": "arXiv:2410.11637v1 Announce Type: new \nAbstract: Deterministic mathematical models, such as those specified via differential equations, are a powerful tool to communicate scientific insight. However, such models are necessarily simplified descriptions of the real world. Generalised Bayesian methodologies have been proposed for inference with misspecified models, but these are typically associated with vanishing parameter uncertainty as more data are observed. In the context of a misspecified deterministic mathematical model, this has the undesirable consequence that posterior predictions become deterministic and certain, while being incorrect. Taking this observation as a starting point, we propose Prediction-Centric Uncertainty Quantification, where a mixture distribution based on the deterministic model confers improved uncertainty quantification in the predictive context. Computation of the mixing distribution is cast as a (regularised) gradient flow of the maximum mean discrepancy (MMD), enabling consistent numerical approximations to be obtained. Results are reported on both a toy model from population ecology and a real model of protein signalling in cell biology."}, "https://arxiv.org/abs/2410.11713": {"title": "Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing", "link": "https://arxiv.org/abs/2410.11713", "description": "arXiv:2410.11713v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are the gold standard for causal inference on treatment effects. However, they can be underpowered due to small population sizes in rare diseases and limited number of patients willing to participate due to questions regarding equipoise among treatment groups in common diseases. Hybrid controlled trials use external controls (ECs) from historical studies or large observational databases to enhance statistical efficiency, and as such, are considered for drug evaluation in rare diseases or indications associated with low trial participation. However, non-randomized ECs can introduce biases that compromise validity and inflate type I errors for treatment discovery, particularly in small samples. To address this, we extend the Fisher randomization test (FRT) to hybrid controlled trials. Our approach involves a test statistic combining RCT and EC data and is based solely on randomization in the RCT. This method strictly controls the type I error rate, even with biased ECs, and improves power by incorporating unbiased ECs. To mitigate the power loss caused by biased ECs, we introduce conformal selective borrowing, which uses conformal p-values to individually select unbiased ECs, offering the flexibility to use either computationally efficient parametric models or off-the-shelf machine learning models to construct the conformal score function, along with model-agnostic reliability. We identify a risk-benefit trade-off in the power of FRT associated with different selection thresholds for conformal p-values. We offer both fixed and adaptive selection threshold options, enabling robust performance across varying levels of hidden bias. The advantages of our method are demonstrated through simulations and an application to a small lung cancer RCT with ECs from the National Cancer Database. Our method is available in the R package intFRT on GitHub."}, "https://arxiv.org/abs/2410.11716": {"title": "Randomization-based Inference for MCP-Mod", "link": "https://arxiv.org/abs/2410.11716", "description": "arXiv:2410.11716v1 Announce Type: new \nAbstract: Dose selection is critical in pharmaceutical drug development, as it directly impacts therapeutic efficacy and patient safety of a drug. The Generalized Multiple Comparison Procedures and Modeling (MCP-Mod) approach is commonly used in Phase II trials for testing and estimation of dose-response relationships. However, its effectiveness in small sample sizes, particularly with binary endpoints, is hindered by issues like complete separation in logistic regression, leading to non-existence of estimates. Motivated by an actual clinical trial using the MCP-Mod approach, this paper introduces penalized maximum likelihood estimation (MLE) and randomization-based inference techniques to address these challenges. Randomization-based inference allows for exact finite sample inference, while population-based inference for MCP-Mod typically relies on asymptotic approximations. Simulation studies demonstrate that randomization-based tests can enhance statistical power in small to medium-sized samples while maintaining control over type-I error rates, even in the presence of time trends. Our results show that residual-based randomization tests using penalized MLEs not only improve computational efficiency but also outperform standard randomization-based methods, making them an adequate choice for dose-finding analyses within the MCP-Mod framework. Additionally, we apply these methods to pharmacometric settings, demonstrating their effectiveness in such scenarios. The results in this paper underscore the potential of randomization-based inference for the analysis of dose-finding trials, particularly in small sample contexts."}, "https://arxiv.org/abs/2410.11743": {"title": "Addressing the Null Paradox in Epidemic Models: Correcting for Collider Bias in Causal Inference", "link": "https://arxiv.org/abs/2410.11743", "description": "arXiv:2410.11743v1 Announce Type: new \nAbstract: We address the null paradox in epidemic models, where standard methods estimate a non-zero treatment effect despite the true effect being zero. This occurs when epidemic models mis-specify how causal effects propagate over time, especially when covariates act as colliders between past interventions and latent variables, leading to spurious correlations. Standard approaches like maximum likelihood and Bayesian methods can misinterpret these biases, inferring false causal relationships. While semi-parametric models and inverse propensity weighting offer potential solutions, they often limit the ability of domain experts to incorporate epidemic-specific knowledge. To resolve this, we propose an alternative estimating equation that corrects for collider bias while allowing for statistical inference with frequentist guarantees, previously unavailable for complex models like SEIR."}, "https://arxiv.org/abs/2410.11797": {"title": "Unraveling Heterogeneous Treatment Effects in Networks: A Non-Parametric Approach Based on Node Connectivity", "link": "https://arxiv.org/abs/2410.11797", "description": "arXiv:2410.11797v1 Announce Type: new \nAbstract: In network settings, interference between units makes causal inference more challenging as outcomes may depend on the treatments received by others in the network. Typical estimands in network settings focus on treatment effects aggregated across individuals in the population. We propose a framework for estimating node-wise counterfactual means, allowing for more granular insights into the impact of network structure on treatment effect heterogeneity. We develop a doubly robust and non-parametric estimation procedure, KECENI (Kernel Estimation of Causal Effect under Network Interference), which offers consistency and asymptotic normality under network dependence. The utility of this method is demonstrated through an application to microfinance data, revealing the impact of network characteristics on treatment effects."}, "https://arxiv.org/abs/2410.11113": {"title": "Statistical Properties of Deep Neural Networks with Dependent Data", "link": "https://arxiv.org/abs/2410.11113", "description": "arXiv:2410.11113v1 Announce Type: cross \nAbstract: This paper establishes statistical properties of deep neural network (DNN) estimators under dependent data. Two general results for nonparametric sieve estimators directly applicable to DNNs estimators are given. The first establishes rates for convergence in probability under nonstationary data. The second provides non-asymptotic probability bounds on $\\mathcal{L}^{2}$-errors under stationary $\\beta$-mixing data. I apply these results to DNN estimators in both regression and classification contexts imposing only a standard H\\\"older smoothness assumption. These results are then used to demonstrate how asymptotic inference can be conducted on the finite dimensional parameter of a partially linear regression model after first-stage DNN estimation of infinite dimensional parameters. The DNN architectures considered are common in applications, featuring fully connected feedforward networks with any continuous piecewise linear activation function, unbounded weights, and a width and depth that grows with sample size. The framework provided also offers potential for research into other DNN architectures and time-series applications."}, "https://arxiv.org/abs/2410.11238": {"title": "Impact of existence and nonexistence of pivot on the coverage of empirical best linear prediction intervals for small areas", "link": "https://arxiv.org/abs/2410.11238", "description": "arXiv:2410.11238v1 Announce Type: cross \nAbstract: We advance the theory of parametric bootstrap in constructing highly efficient empirical best (EB) prediction intervals of small area means. The coverage error of such a prediction interval is of the order $O(m^{-3/2})$, where $m$ is the number of small areas to be pooled using a linear mixed normal model. In the context of an area level model where the random effects follow a non-normal known distribution except possibly for unknown hyperparameters, we analytically show that the order of coverage error of empirical best linear (EBL) prediction interval remains the same even if we relax the normality of the random effects by the existence of pivot for a suitably standardized random effects when hyperpameters are known. Recognizing the challenge of showing existence of a pivot, we develop a simple moment-based method to claim non-existence of pivot. We show that existing parametric bootstrap EBL prediction interval fails to achieve the desired order of the coverage error, i.e. $O(m^{-3/2})$, in absence of a pivot. We obtain a surprising result that the order $O(m^{-1})$ term is always positive under certain conditions indicating possible overcoverage of the existing parametric bootstrap EBL prediction interval. In general, we analytically show for the first time that the coverage problem can be corrected by adopting a suitably devised double parametric bootstrap. Our Monte Carlo simulations show that our proposed single bootstrap method performs reasonably well when compared to rival methods."}, "https://arxiv.org/abs/2410.11429": {"title": "Filtering coupled Wright-Fisher diffusions", "link": "https://arxiv.org/abs/2410.11429", "description": "arXiv:2410.11429v1 Announce Type: cross \nAbstract: Coupled Wright-Fisher diffusions have been recently introduced to model the temporal evolution of finitely-many allele frequencies at several loci. These are vectors of multidimensional diffusions whose dynamics are weakly coupled among loci through interaction coefficients, which make the reproductive rates for each allele depend on its frequencies at several loci. Here we consider the problem of filtering a coupled Wright-Fisher diffusion with parent-independent mutation, when this is seen as an unobserved signal in a hidden Markov model. We assume individuals are sampled multinomially at discrete times from the underlying population, whose type configuration at the loci is described by the diffusion states, and adapt recently introduced duality methods to derive the filtering and smoothing distributions. These respectively provide the conditional distribution of the diffusion states given past data, and that conditional on the entire dataset, and are key to be able to perform parameter inference on models of this type. We show that for this model these distributions are countable mixtures of tilted products of Dirichlet kernels, and describe their mixing weights and how these can be updated sequentially. The evaluation of the weights involves the transition probabilities of the dual process, which are not available in closed form. We lay out pseudo codes for the implementation of the algorithms, discuss how to handle the unavailable quantities, and briefly illustrate the procedure with synthetic data."}, "https://arxiv.org/abs/2210.00200": {"title": "Semiparametric Efficient Fusion of Individual Data and Summary Statistics", "link": "https://arxiv.org/abs/2210.00200", "description": "arXiv:2210.00200v4 Announce Type: replace \nAbstract: Suppose we have individual data from an internal study and various summary statistics from relevant external studies. External summary statistics have the potential to improve statistical inference for the internal population; however, it may lead to efficiency loss or bias if not used properly. We study the fusion of individual data and summary statistics in a semiparametric framework to investigate the efficient use of external summary statistics. Under a weak transportability assumption, we establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is no larger than that using only internal data and underpins the potential efficiency gain of integrating individual data and summary statistics. We propose a data-fused efficient estimator that achieves this efficiency bound. In addition, an adaptive fusion estimator is proposed to eliminate the bias of the original data-fused estimator when the transportability assumption fails. We establish the asymptotic oracle property of the adaptive fusion estimator. Simulations and application to a Helicobacter pylori infection dataset demonstrate the promising numerical performance of the proposed method."}, "https://arxiv.org/abs/2308.11026": {"title": "Harnessing The Collective Wisdom: Fusion Learning Using Decision Sequences From Diverse Sources", "link": "https://arxiv.org/abs/2308.11026", "description": "arXiv:2308.11026v2 Announce Type: replace \nAbstract: Learning from the collective wisdom of crowds is related to the statistical notion of fusion learning from multiple data sources or studies. However, fusing inferences from diverse sources is challenging since cross-source heterogeneity and potential data-sharing complicate statistical inference. Moreover, studies may rely on disparate designs, employ myriad modeling techniques, and prevailing data privacy norms may forbid sharing even summary statistics across the studies for an overall analysis. We propose an Integrative Ranking and Thresholding (IRT) framework for fusion learning in multiple testing. IRT operates under the setting where from each study a triplet is available: the vector of binary accept-reject decisions on the tested hypotheses, its False Discovery Rate (FDR) level and the hypotheses tested by it. Under this setting, IRT constructs an aggregated and nonparametric measure of evidence against each null hypotheses, which facilitates ranking the hypotheses in the order of their likelihood of being rejected. We show that IRT guarantees an overall FDR control if the studies control their respective FDR at the desired levels. IRT is extremely flexible, and a comprehensive numerical study demonstrates its practical relevance for pooling inferences. A real data illustration and extensions to alternative forms of Type I error control are discussed."}, "https://arxiv.org/abs/2309.10792": {"title": "Frequentist Inference for Semi-mechanistic Epidemic Models with Interventions", "link": "https://arxiv.org/abs/2309.10792", "description": "arXiv:2309.10792v2 Announce Type: replace \nAbstract: The effect of public health interventions on an epidemic are often estimated by adding the intervention to epidemic models. During the Covid-19 epidemic, numerous papers used such methods for making scenario predictions. The majority of these papers use Bayesian methods to estimate the parameters of the model. In this paper we show how to use frequentist methods for estimating these effects which avoids having to specify prior distributions. We also use model-free shrinkage methods to improve estimation when there are many different geographic regions. This allows us to borrow strength from different regions while still getting confidence intervals with correct coverage and without having to specify a hierarchical model. Throughout, we focus on a semi-mechanistic model which provides a simple, tractable alternative to compartmental methods."}, "https://arxiv.org/abs/2304.01010": {"title": "COBRA: Comparison-Optimal Betting for Risk-limiting Audits", "link": "https://arxiv.org/abs/2304.01010", "description": "arXiv:2304.01010v2 Announce Type: replace-cross \nAbstract: Risk-limiting audits (RLAs) can provide routine, affirmative evidence that reported election outcomes are correct by checking a random sample of cast ballots. An efficient RLA requires checking relatively few ballots. Here we construct highly efficient RLAs by optimizing supermartingale tuning parameters--$\\textit{bets}$--for ballot-level comparison audits. The exactly optimal bets depend on the true rate of errors in cast-vote records (CVRs)--digital receipts detailing how machines tabulated each ballot. We evaluate theoretical and simulated workloads for audits of contests with a range of diluted margins and CVR error rates. Compared to bets recommended in past work, using these optimal bets can dramatically reduce expected workloads--by 93% on average over our simulated audits. Because the exactly optimal bets are unknown in practice, we offer some strategies for approximating them. As with the ballot-polling RLAs described in ALPHA and RiLACs, adapting bets to previously sampled data or diversifying them over a range of suspected error rates can lead to substantially more efficient audits than fixing bets to $\\textit{a priori}$ values, especially when those values are far from correct. We sketch extensions to other designs and social choice functions, and conclude with some recommendations for real-world comparison audits."}, "https://arxiv.org/abs/2410.11892": {"title": "Copula based joint regression models for correlated data: an analysis in the bivariate case", "link": "https://arxiv.org/abs/2410.11892", "description": "arXiv:2410.11892v1 Announce Type: new \nAbstract: Regression analysis of non-normal correlated data is commonly performed using generalized linear mixed models (GLMM) and generalized estimating equations (GEE). The recent development of generalized joint regression models (GJRM) presents an alternative to these approaches by using copulas to flexibly model response variables and their dependence structures. This paper provides a simulation study that compares the GJRM with alternative methods. We focus on the case of the marginal distributions having the same form, for example, in models for longitudinal data.\n We find that for the normal model with identity link, all models provide accurate estimates of the parameters of interest. However, for non-normal models and when a non-identity link function is used, GLMMs in general provide biased estimates of marginal model parameters with inaccurately low standard errors. GLMM bias is more pronounced when the marginal distributions are more skewed or highly correlated. However, in the case that a GLMM parameter is estimated independently of the random effect term, we show it is possible to extract accurate parameter estimates, shown for a longitudinal time parameter with a logarithmic link model. In contrast, we find that GJRM and GEE provide unbiased estimates for all parameters with accurate standard errors when using a logarithmic link. In addition, we show that GJRM provides a model fit comparable to GLMM. In a real-world study of doctor visits, we further demonstrate that the GJRM provides better model fits than a comparable GEE or GLM, due to its greater flexibility in choice of marginal distribution and copula fit to dependence structures. We conclude that the GJRM provides a superior approach to current popular models for analysis of non-normal correlated data."}, "https://arxiv.org/abs/2410.11897": {"title": "A Structural Text-Based Scaling Model for Analyzing Political Discourse", "link": "https://arxiv.org/abs/2410.11897", "description": "arXiv:2410.11897v1 Announce Type: new \nAbstract: Scaling political actors based on their individual characteristics and behavior helps profiling and grouping them as well as understanding changes in the political landscape. In this paper we introduce the Structural Text-Based Scaling (STBS) model to infer ideological positions of speakers for latent topics from text data. We expand the usual Poisson factorization specification for topic modeling of text data and use flexible shrinkage priors to induce sparsity and enhance interpretability. We also incorporate speaker-specific covariates to assess their association with ideological positions. Applying STBS to U.S. Senate speeches from Congress session 114, we identify immigration and gun violence as the most polarizing topics between the two major parties in Congress. Additionally, we find that, in discussions about abortion, the gender of the speaker significantly influences their position, with female speakers focusing more on women's health. We also see that a speaker's region of origin influences their ideological position more than their religious affiliation."}, "https://arxiv.org/abs/2410.11902": {"title": "Stochastic Nonlinear Model Updating in Structural Dynamics Using a Novel Likelihood Function within the Bayesian-MCMC Framework", "link": "https://arxiv.org/abs/2410.11902", "description": "arXiv:2410.11902v1 Announce Type: new \nAbstract: The study presents a novel approach for stochastic nonlinear model updating in structural dynamics, employing a Bayesian framework integrated with Markov Chain Monte Carlo (MCMC) sampling for parameter estimation by using an approximated likelihood function. The proposed methodology is applied to both numerical and experimental cases. The paper commences by introducing Bayesian inference and its constituents: the likelihood function, prior distribution, and posterior distribution. The resonant decay method is employed to extract backbone curves, which capture the non-linear behaviour of the system. A mathematical model based on a single degree of freedom (SDOF) system is formulated, and backbone curves are obtained from time response data. Subsequently, MCMC sampling is employed to estimate the parameters using both numerical and experimental data. The obtained results demonstrate the convergence of the Markov chain, present parameter trace plots, and provide estimates of posterior distributions of updated parameters along with their uncertainties. Experimental validation is performed on a cantilever beam system equipped with permanent magnets and electromagnets. The proposed methodology demonstrates promising results in estimating parameters of stochastic non-linear dynamical systems. Through the use of the proposed likelihood functions using backbone curves, the probability distributions of both linear and non-linear parameters are simultaneously identified. Based on this view, the necessity to segregate stochastic linear and non-linear model updating is eliminated."}, "https://arxiv.org/abs/2410.11916": {"title": "\"The Simplest Idea One Can Have\" for Seamless Forecasts with Postprocessing", "link": "https://arxiv.org/abs/2410.11916", "description": "arXiv:2410.11916v1 Announce Type: new \nAbstract: Seamless forecasts are based on a combination of different sources to produce the best possible forecasts. Statistical multimodel postprocessing helps to combine various sources to achieve these seamless forecasts. However, when one of the combined sources of the forecast is not available due to reaching the end of its forecasting horizon, forecasts can be temporally inconsistent and sudden drops in skill can be observed. To obtain a seamless forecast, the output of multimodel postprocessing is often blended across these transitions, although this unnecessarily worsens the forecasts immediately before the transition. Additionally, large differences between the latest observation and the first forecasts can be present. This paper presents an idea to preserve a smooth temporal prediction until the end of the forecast range and increase its predictability. This optimal seamless forecast is simply accomplished by not excluding any model from the multimodel by using the latest possible lead time as model persistence into the future. Furthermore, the gap between the latest available observation and the first model step is seamlessly closed with the persistence of the observation by using the latest observation as additional predictor. With this idea, no visible jump in forecasts is observed and the verification presents a seamless quality in terms of scores. The benefit of accounting for observation and forecast persistence in multimodel postprocessing is illustrated using a simple temperature example with linear regression but can also be extended to other predictors and postprocessing methods."}, "https://arxiv.org/abs/2410.12093": {"title": "A Unified Framework for Causal Estimand Selection", "link": "https://arxiv.org/abs/2410.12093", "description": "arXiv:2410.12093v1 Announce Type: new \nAbstract: To determine the causal effect of a treatment using observational data, it is important to balance the covariate distributions between treated and control groups. However, achieving balance can be difficult when treated and control groups lack overlap. In the presence of limited overlap, researchers typically choose between two types of methods: 1) methods (e.g., inverse propensity score weighting) that imply traditional estimands (e.g., ATE) but whose estimators are at risk of variance inflation and considerable statistical bias; and 2) methods (e.g., overlap weighting) which imply a different estimand, thereby changing the target population to reduce variance. In this work, we introduce a framework for characterizing estimands by their target populations and the statistical performance of their estimators. We introduce a bias decomposition that encapsulates bias due to 1) the statistical bias of the estimator; and 2) estimand mismatch, i.e., deviation from the population of interest. We propose a design-based estimand selection procedure that helps navigate the tradeoff between these two sources of bias and variance of the resulting estimators. Our procedure allows the analyst to incorporate their domain-specific preference for preservation of the original population versus reduction of statistical bias. We demonstrate how to select an estimand based on these preferences by applying our framework to a right heart catheterization study."}, "https://arxiv.org/abs/2410.12098": {"title": "Testing Identifying Assumptions in Parametric Separable Models: A Conditional Moment Inequality Approach", "link": "https://arxiv.org/abs/2410.12098", "description": "arXiv:2410.12098v1 Announce Type: new \nAbstract: In this paper, we propose a simple method for testing identifying assumptions in parametric separable models, namely treatment exogeneity, instrument validity, and/or homoskedasticity. We show that the testable implications can be written in the intersection bounds framework, which is easy to implement using the inference method proposed in Chernozhukov, Lee, and Rosen (2013), and the Stata package of Chernozhukov et al. (2015). Monte Carlo simulations confirm that our test is consistent and controls size. We use our proposed method to test the validity of some commonly used instrumental variables, such as the average price in other markets in Nevo and Rosen (2012), the Bartik instrument in Card (2009), and the test rejects both instrumental variable models. When the identifying assumptions are rejected, we discuss solutions that allow researchers to identify some causal parameters of interest after relaxing functional form assumptions. We show that the IV model is nontestable if no functional form assumption is made on the outcome equation, when there exists a one-to-one mapping between the continuous treatment variable, the instrument, and the first-stage unobserved heterogeneity."}, "https://arxiv.org/abs/2410.12108": {"title": "A General Latent Embedding Approach for Modeling Non-uniform High-dimensional Sparse Hypergraphs with Multiplicity", "link": "https://arxiv.org/abs/2410.12108", "description": "arXiv:2410.12108v1 Announce Type: new \nAbstract: Recent research has shown growing interest in modeling hypergraphs, which capture polyadic interactions among entities beyond traditional dyadic relations. However, most existing methodologies for hypergraphs face significant limitations, including their heavy reliance on uniformity restrictions for hyperlink orders and their inability to account for repeated observations of identical hyperlinks. In this work, we introduce a novel and general latent embedding approach that addresses these challenges through the integration of latent embeddings, vertex degree heterogeneity parameters, and an order-adjusting parameter. Theoretically, we investigate the identifiability conditions for the latent embeddings and associated parameters, and we establish the convergence rates of their estimators along with asymptotic distributions. Computationally, we employ a projected gradient ascent algorithm for parameter estimation. Comprehensive simulation studies demonstrate the effectiveness of the algorithm and validate the theoretical findings. Moreover, an application to a co-citation hypergraph illustrates the advantages of the proposed method."}, "https://arxiv.org/abs/2410.12117": {"title": "Empirical Bayes estimation via data fission", "link": "https://arxiv.org/abs/2410.12117", "description": "arXiv:2410.12117v1 Announce Type: new \nAbstract: We demonstrate how data fission, a method for creating synthetic replicates from single observations, can be applied to empirical Bayes estimation. This extends recent work on empirical Bayes with multiple replicates to the classical single-replicate setting. The key insight is that after data fission, empirical Bayes estimation can be cast as a general regression problem."}, "https://arxiv.org/abs/2410.12151": {"title": "Root cause discovery via permutations and Cholesky decomposition", "link": "https://arxiv.org/abs/2410.12151", "description": "arXiv:2410.12151v1 Announce Type: new \nAbstract: This work is motivated by the following problem: Can we identify the disease-causing gene in a patient affected by a monogenic disorder? This problem is an instance of root cause discovery. In particular, we aim to identify the intervened variable in one interventional sample using a set of observational samples as reference. We consider a linear structural equation model where the causal ordering is unknown. We begin by examining a simple method that uses squared z-scores and characterize the conditions under which this method succeeds and fails, showing that it generally cannot identify the root cause. We then prove, without additional assumptions, that the root cause is identifiable even if the causal ordering is not. Two key ingredients of this identifiability result are the use of permutations and the Cholesky decomposition, which allow us to exploit an invariant property across different permutations to discover the root cause. Furthermore, we characterize permutations that yield the correct root cause and, based on this, propose a valid method for root cause discovery. We also adapt this approach to high-dimensional settings. Finally, we evaluate the performance of our methods through simulations and apply the high-dimensional method to discover disease-causing genes in the gene expression dataset that motivates this work."}, "https://arxiv.org/abs/2410.12300": {"title": "Sparse Causal Effect Estimation using Two-Sample Summary Statistics in the Presence of Unmeasured Confounding", "link": "https://arxiv.org/abs/2410.12300", "description": "arXiv:2410.12300v1 Announce Type: new \nAbstract: Observational genome-wide association studies are now widely used for causal inference in genetic epidemiology. To maintain privacy, such data is often only publicly available as summary statistics, and often studies for the endogenous covariates and the outcome are available separately. This has necessitated methods tailored to two-sample summary statistics. Current state-of-the-art methods modify linear instrumental variable (IV) regression -- with genetic variants as instruments -- to account for unmeasured confounding. However, since the endogenous covariates can be high dimensional, standard IV assumptions are generally insufficient to identify all causal effects simultaneously. We ensure identifiability by assuming the causal effects are sparse and propose a sparse causal effect two-sample IV estimator, spaceTSIV, adapting the spaceIV estimator by Pfister and Peters (2022) for two-sample summary statistics. We provide two methods, based on L0- and L1-penalization, respectively. We prove identifiability of the sparse causal effects in the two-sample setting and consistency of spaceTSIV. The performance of spaceTSIV is compared with existing two-sample IV methods in simulations. Finally, we showcase our methods using real proteomic and gene-expression data for drug-target discovery."}, "https://arxiv.org/abs/2410.12333": {"title": "Quantifying Treatment Effects: Estimating Risk Ratios in Causal Inference", "link": "https://arxiv.org/abs/2410.12333", "description": "arXiv:2410.12333v1 Announce Type: new \nAbstract: Randomized Controlled Trials (RCT) are the current gold standards to empirically measure the effect of a new drug. However, they may be of limited size and resorting to complementary non-randomized data, referred to as observational, is promising, as additional sources of evidence. In both RCT and observational data, the Risk Difference (RD) is often used to characterize the effect of a drug. Additionally, medical guidelines recommend to also report the Risk Ratio (RR), which may provide a different comprehension of the effect of the same drug. While different methods have been proposed and studied to estimate the RD, few methods exist to estimate the RR. In this paper, we propose estimators of the RR both in RCT and observational data and provide both asymptotical and finite-sample analyses. We show that, even in an RCT, estimating treatment allocation probability or adjusting for covariates leads to lower asymptotic variance. In observational studies, we propose weighting and outcome modeling estimators and derive their asymptotic bias and variance for well-specified models. Using semi-parametric theory, we define two doubly robusts estimators with minimal variances among unbiased estimators. We support our theoretical analysis with empirical evaluations and illustrate our findings through experiments."}, "https://arxiv.org/abs/2410.12374": {"title": "Forecasting Densities of Fatalities from State-based Conflicts using Observed Markov Models", "link": "https://arxiv.org/abs/2410.12374", "description": "arXiv:2410.12374v1 Announce Type: new \nAbstract: In this contribution to the VIEWS 2023 prediction challenge, we propose using an observed Markov model for making predictions of densities of fatalities from armed conflicts. The observed Markov model can be conceptualized as a two-stage model. The first stage involves a standard Markov model, where the latent states are pre-defined based on domain knowledge about conflict states. The second stage is a set of regression models conditional on the latent Markov-states which predict the number of fatalities. In the VIEWS 2023/24 prediction competition, we use a random forest classifier for modeling the transitions between the latent Markov states and a quantile regression forest to model the fatalities conditional on the latent states. For the predictions, we dynamically simulate latent state paths and randomly draw fatalities for each country-month from the conditional distribution of fatalities given the latent states. Interim evaluation of out-of-sample performance indicates that the observed Markov model produces well-calibrated forecasts which outperform the benchmark models and are among the top performing models across the evaluation metrics."}, "https://arxiv.org/abs/2410.12386": {"title": "Bias correction of quadratic spectral estimators", "link": "https://arxiv.org/abs/2410.12386", "description": "arXiv:2410.12386v1 Announce Type: new \nAbstract: The three cardinal, statistically consistent, families of non-parametric estimators to the power spectral density of a time series are lag-window, multitaper and Welch estimators. However, when estimating power spectral densities from a finite sample each can be subject to non-ignorable bias. Astfalck et al. (2024) developed a method that offers significant bias reduction for finite samples for Welch's estimator, which this article extends to the larger family of quadratic estimators, thus offering similar theory for bias correction of lag-window and multitaper estimators as well as combinations thereof. Importantly, this theory may be used in conjunction with any and all tapers and lag-sequences designed for bias reduction, and so should be seen as an extension to valuable work in these fields, rather than a supplanting methodology. The order of computation is larger than O(n log n) typical in spectral analyses, but not insurmountable in practice. Simulation studies support the theory with comparisons across variations of quadratic estimators."}, "https://arxiv.org/abs/2410.12454": {"title": "Conditional Outcome Equivalence: A Quantile Alternative to CATE", "link": "https://arxiv.org/abs/2410.12454", "description": "arXiv:2410.12454v1 Announce Type: new \nAbstract: Conditional quantile treatment effect (CQTE) can provide insight into the effect of a treatment beyond the conditional average treatment effect (CATE). This ability to provide information over multiple quantiles of the response makes CQTE especially valuable in cases where the effect of a treatment is not well-modelled by a location shift, even conditionally on the covariates. Nevertheless, the estimation of CQTE is challenging and often depends upon the smoothness of the individual quantiles as a function of the covariates rather than smoothness of the CQTE itself. This is in stark contrast to CATE where it is possible to obtain high-quality estimates which have less dependency upon the smoothness of the nuisance parameters when the CATE itself is smooth. Moreover, relative smoothness of the CQTE lacks the interpretability of smoothness of the CATE making it less clear whether it is a reasonable assumption to make. We combine the desirable properties of CATE and CQTE by considering a new estimand, the conditional quantile comparator (CQC). The CQC not only retains information about the whole treatment distribution, similar to CQTE, but also having more natural examples of smoothness and is able to leverage simplicity in an auxiliary estimand. We provide finite sample bounds on the error of our estimator, demonstrating its ability to exploit simplicity. We validate our theory in numerical simulations which show that our method produces more accurate estimates than baselines. Finally, we apply our methodology to a study on the effect of employment incentives on earnings across different age groups. We see that our method is able to reveal heterogeneity of the effect across different quantiles."}, "https://arxiv.org/abs/2410.12590": {"title": "Flexible and Efficient Estimation of Causal Effects with Error-Prone Exposures: A Control Variates Approach for Measurement Error", "link": "https://arxiv.org/abs/2410.12590", "description": "arXiv:2410.12590v1 Announce Type: new \nAbstract: Exposure measurement error is a ubiquitous but often overlooked challenge in causal inference with observational data. Existing methods accounting for exposure measurement error largely rely on restrictive parametric assumptions, while emerging data-adaptive estimation approaches allow for less restrictive assumptions but at the cost of flexibility, as they are typically tailored towards rigidly-defined statistical quantities. There remains a critical need for assumption-lean estimation methods that are both flexible and possess desirable theoretical properties across a variety of study designs. In this paper, we introduce a general framework for estimation of causal quantities in the presence of exposure measurement error, adapted from the control variates approach of Yang and Ding (2019). Our method can be implemented in various two-phase sampling study designs, where one obtains gold-standard exposure measurements for a small subset of the full study sample, called the validation data. The control variates framework leverages both the error-prone and error-free exposure measurements by augmenting an initial consistent estimator from the validation data with a variance reduction term formed from the full data. We show that our method inherits double-robustness properties under standard causal assumptions. Simulation studies show that our approach performs favorably compared to leading methods under various two-phase sampling schemes. We illustrate our method with observational electronic health record data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic."}, "https://arxiv.org/abs/2410.12709": {"title": "A Simple Interactive Fixed Effects Estimator for Short Panels", "link": "https://arxiv.org/abs/2410.12709", "description": "arXiv:2410.12709v1 Announce Type: new \nAbstract: We study the interactive effects (IE) model as an extension of the conventional additive effects (AE) model. For the AE model, the fixed effects estimator can be obtained by applying least squares to a regression that adds a linear projection of the fixed effect on the explanatory variables (Mundlak, 1978; Chamberlain, 1984). In this paper, we develop a novel estimator -- the projection-based IE (PIE) estimator -- for the IE model that is based on a similar approach. We show that, for the IE model, fixed effects estimators that have appeared in the literature are not equivalent to our PIE estimator, though both can be expressed as a generalized within estimator. Unlike the fixed effects estimators for the IE model, the PIE estimator is consistent for a fixed number of time periods with no restrictions on serial correlation or conditional heteroskedasticity in the errors. We also derive a statistic for testing the consistency of the two-way fixed effects estimator in the possible presence of iterative effects. Moreover, although the PIE estimator is the solution to a high-dimensional nonlinear least squares problem, we show that it can be computed by iterating between two steps, both of which have simple analytical solutions. The computational simplicity is an important advantage relative to other strategies that have been proposed for estimating the IE model for short panels. Finally, we compare the finite sample performance of IE estimators through simulations."}, "https://arxiv.org/abs/2410.12731": {"title": "Counterfactual Analysis in Empirical Games", "link": "https://arxiv.org/abs/2410.12731", "description": "arXiv:2410.12731v1 Announce Type: new \nAbstract: We address counterfactual analysis in empirical models of games with partially identified parameters, and multiple equilibria and/or randomized strategies, by constructing and analyzing the counterfactual predictive distribution set (CPDS). This framework accommodates various outcomes of interest, including behavioral and welfare outcomes. It allows a variety of changes to the environment to generate the counterfactual, including modifications of the utility functions, the distribution of utility determinants, the number of decision makers, and the solution concept. We use a Bayesian approach to summarize statistical uncertainty. We establish conditions under which the population CPDS is sharp from the point of view of identification. We also establish conditions under which the posterior CPDS is consistent if the posterior distribution for the underlying model parameter is consistent. Consequently, our results can be employed to conduct counterfactual analysis after a preliminary step of identifying and estimating the underlying model parameter based on the existing literature. Our consistency results involve the development of a new general theory for Bayesian consistency of posterior distributions for mappings of sets. Although we primarily focus on a model of a strategic game, our approach is applicable to other structural models with similar features."}, "https://arxiv.org/abs/2410.12740": {"title": "Just Ramp-up: Unleash the Potential of Regression-based Estimator for A/B Tests under Network Interference", "link": "https://arxiv.org/abs/2410.12740", "description": "arXiv:2410.12740v1 Announce Type: new \nAbstract: Recent research in causal inference under network interference has explored various experimental designs and estimation techniques to address this issue. However, existing methods, which typically rely on single experiments, often reach a performance bottleneck and face limitations in handling diverse interference structures. In contrast, we propose leveraging multiple experiments to overcome these limitations. In industry, the use of sequential experiments, often known as the ramp-up process, where traffic to the treatment gradually increases, is common due to operational needs like risk management and cost control. Our approach shifts the focus from operational aspects to the statistical advantages of merging data from multiple experiments. By combining data from sequentially conducted experiments, we aim to estimate the global average treatment effect more effectively. In this paper, we begin by analyzing the bias and variance of the linear regression estimator for GATE under general linear network interference. We demonstrate that bias plays a dominant role in the bias-variance tradeoff and highlight the intrinsic bias reduction achieved by merging data from experiments with strictly different treatment proportions. Herein the improvement introduced by merging two steps of experimental data is essential. In addition, we show that merging more steps of experimental data is unnecessary under general linear interference, while it can become beneficial when nonlinear interference occurs. Furthermore, we look into a more advanced estimator based on graph neural networks. Through extensive simulation studies, we show that the regression-based estimator benefits remarkably from training on merged experiment data, achieving outstanding statistical performance."}, "https://arxiv.org/abs/2410.12047": {"title": "Testing Causal Explanations: A Case Study for Understanding the Effect of Interventions on Chronic Kidney Disease", "link": "https://arxiv.org/abs/2410.12047", "description": "arXiv:2410.12047v1 Announce Type: cross \nAbstract: Randomized controlled trials (RCTs) are the standard for evaluating the effectiveness of clinical interventions. To address the limitations of RCTs on real-world populations, we developed a methodology that uses a large observational electronic health record (EHR) dataset. Principles of regression discontinuity (rd) were used to derive randomized data subsets to test expert-driven interventions using dynamic Bayesian Networks (DBNs) do-operations. This combined method was applied to a chronic kidney disease (CKD) cohort of more than two million individuals and used to understand the associational and causal relationships of CKD variables with respect to a surrogate outcome of >=40% decline in estimated glomerular filtration rate (eGFR). The associational and causal analyses depicted similar findings across DBNs from two independent healthcare systems. The associational analysis showed that the most influential variables were eGFR, urine albumin-to-creatinine ratio, and pulse pressure, whereas the causal analysis showed eGFR as the most influential variable, followed by modifiable factors such as medications that may impact kidney function over time. This methodology demonstrates how real-world EHR data can be used to provide population-level insights to inform improved healthcare delivery."}, "https://arxiv.org/abs/2410.12201": {"title": "SAT: Data-light Uncertainty Set Merging via Synthetics, Aggregation, and Test Inversion", "link": "https://arxiv.org/abs/2410.12201", "description": "arXiv:2410.12201v1 Announce Type: cross \nAbstract: The integration of uncertainty sets has diverse applications but also presents challenges, particularly when only initial sets and their control levels are available, along with potential dependencies. Examples include merging confidence sets from different distributed sites with communication constraints, as well as combining conformal prediction sets generated by different learning algorithms or data splits. In this article, we introduce an efficient and flexible Synthetic, Aggregation, and Test inversion (SAT) approach to merge various potentially dependent uncertainty sets into a single set. The proposed method constructs a novel class of synthetic test statistics, aggregates them, and then derives merged sets through test inversion. Our approach leverages the duality between set estimation and hypothesis testing, ensuring reliable coverage in dependent scenarios. The procedure is data-light, meaning it relies solely on initial sets and control levels without requiring raw data, and it adapts to any user-specified initial uncertainty sets, accommodating potentially varying coverage levels. Theoretical analyses and numerical experiments confirm that SAT provides finite-sample coverage guarantees and achieves small set sizes."}, "https://arxiv.org/abs/2410.12209": {"title": "Global Censored Quantile Random Forest", "link": "https://arxiv.org/abs/2410.12209", "description": "arXiv:2410.12209v1 Announce Type: cross \nAbstract: In recent years, censored quantile regression has enjoyed an increasing popularity for survival analysis while many existing works rely on linearity assumptions. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring, a forest-based flexible, competitive method able to capture complex nonlinear relationships. Taking into account the randomness in trees and connecting the proposed method to a randomized incomplete infinite degree U-process (IDUP), we quantify the prediction process' variation without assuming an infinite forest and establish its weak convergence. Moreover, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives and illustrate the use of the proposed importance ranking measures on both simulated and real data."}, "https://arxiv.org/abs/2410.12224": {"title": "Causally-Aware Unsupervised Feature Selection Learning", "link": "https://arxiv.org/abs/2410.12224", "description": "arXiv:2410.12224v1 Announce Type: cross \nAbstract: Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization."}, "https://arxiv.org/abs/2410.12367": {"title": "Adaptive and Stratified Subsampling Techniques for High Dimensional Non-Standard Data Environments", "link": "https://arxiv.org/abs/2410.12367", "description": "arXiv:2410.12367v1 Announce Type: cross \nAbstract: This paper addresses the challenge of estimating high-dimensional parameters in non-standard data environments, where traditional methods often falter due to issues such as heavy-tailed distributions, data contamination, and dependent observations. We propose robust subsampling techniques, specifically Adaptive Importance Sampling (AIS) and Stratified Subsampling, designed to enhance the reliability and efficiency of parameter estimation. Under some clearly outlined conditions, we establish consistency and asymptotic normality for the proposed estimators, providing non-asymptotic error bounds that quantify their performance. Our theoretical foundations are complemented by controlled experiments demonstrating the superiority of our methods over conventional approaches. By bridging the gap between theory and practice, this work offers significant contributions to robust statistical estimation, paving the way for advancements in various applied domains."}, "https://arxiv.org/abs/2410.12690": {"title": "Local transfer learning Gaussian process modeling, with applications to surrogate modeling of expensive computer simulators", "link": "https://arxiv.org/abs/2410.12690", "description": "arXiv:2410.12690v1 Announce Type: cross \nAbstract: A critical bottleneck for scientific progress is the costly nature of computer simulations for complex systems. Surrogate models provide an appealing solution: such models are trained on simulator evaluations, then used to emulate and quantify uncertainty on the expensive simulator at unexplored inputs. In many applications, one often has available data on related systems. For example, in designing a new jet turbine, there may be existing studies on turbines with similar configurations. A key question is how information from such \"source\" systems can be transferred for effective surrogate training on the \"target\" system of interest. We thus propose a new LOcal transfer Learning Gaussian Process (LOL-GP) model, which leverages a carefully-designed Gaussian process to transfer such information for surrogate modeling. The key novelty of the LOL-GP is a latent regularization model, which identifies regions where transfer should be performed and regions where it should be avoided. This \"local transfer\" property is desirable in scientific systems: at certain parameters, such systems may behave similarly and thus transfer is beneficial; at other parameters, they may behave differently and thus transfer is detrimental. By accounting for local transfer, the LOL-GP can rectify a critical limitation of \"negative transfer\" in existing transfer learning models, where the transfer of information worsens predictive performance. We derive a Gibbs sampling algorithm for efficient posterior predictive sampling on the LOL-GP, for both the multi-source and multi-fidelity transfer settings. We then show, via a suite of numerical experiments and an application for jet turbine design, the improved surrogate performance of the LOL-GP over existing methods."}, "https://arxiv.org/abs/2306.06932": {"title": "Revisiting Whittaker-Henderson Smoothing", "link": "https://arxiv.org/abs/2306.06932", "description": "arXiv:2306.06932v3 Announce Type: replace \nAbstract: Introduced nearly a century ago, Whittaker-Henderson smoothing is still widely used by actuaries for constructing one-dimensional and two-dimensional experience tables for mortality, disability and other Life Insurance risks. Our paper reframes this smoothing technique within a modern statistical framework and addresses six questions of practical relevance regarding its use.Firstly, we adopt a Bayesian view of this smoothing method to build credible intervals. Next, we shed light on the choice of the observation and weight vectors to which the smoothing should be applied by linking it to a maximum likelihood estimator introduced in the context of duration models. We then enhance the precision of the smoothing by relaxing an implicit asymptotic normal approximation on which it relies. Afterward, we select the smoothing parameters based on maximizing a marginal likelihood function. We later improve numerical performance in the presence of many observation points and consequently parameters. Finally, we extrapolate the results of the smoothing while preserving, through the use of constraints, consistency between estimated and predicted values."}, "https://arxiv.org/abs/2401.14910": {"title": "Modeling Extreme Events: Univariate and Multivariate Data-Driven Approaches", "link": "https://arxiv.org/abs/2401.14910", "description": "arXiv:2401.14910v2 Announce Type: replace \nAbstract: This article summarizes the contribution of team genEVA to the EVA (2023) Conference Data Challenge. The challenge comprises four individual tasks, with two focused on univariate extremes and two related to multivariate extremes. In the first univariate assignment, we estimate a conditional extremal quantile using a quantile regression approach with neural networks. For the second, we develop a fine-tuning procedure for improved extremal quantile estimation with a given conservative loss function. In the first multivariate sub-challenge, we approximate the data-generating process with a copula model. In the remaining task, we use clustering to separate a high-dimensional problem into approximately independent components. Overall, competitive results were achieved for all challenges, and our approaches for the univariate tasks yielded the most accurate quantile estimates in the competition."}, "https://arxiv.org/abs/1207.0258": {"title": "Control of Probability Flow in Markov Chain Monte Carlo -- Nonreversibility and Lifting", "link": "https://arxiv.org/abs/1207.0258", "description": "arXiv:1207.0258v2 Announce Type: replace-cross \nAbstract: The Markov Chain Monte Carlo (MCMC) method is widely used in various fields as a powerful numerical integration technique for systems with many degrees of freedom. In MCMC methods, probabilistic state transitions can be considered as a random walk in state space, and random walks allow for sampling from complex distributions. However, paradoxically, it is necessary to carefully suppress the randomness of the random walk to improve computational efficiency. By breaking detailed balance, we can create a probability flow in the state space and perform more efficient sampling along this flow. Motivated by this idea, practical and efficient nonreversible MCMC methods have been developed over the past ten years. In particular, the lifting technique, which introduces probability flows in an extended state space, has been applied to various systems and has proven more efficient than conventional reversible updates. We review and discuss several practical approaches to implementing nonreversible MCMC methods, including the shift method in the cumulative distribution and the directed-worm algorithm."}, "https://arxiv.org/abs/1905.10806": {"title": "Score-Driven Exponential Random Graphs: A New Class of Time-Varying Parameter Models for Dynamical Networks", "link": "https://arxiv.org/abs/1905.10806", "description": "arXiv:1905.10806v3 Announce Type: replace-cross \nAbstract: Motivated by the increasing abundance of data describing real-world networks that exhibit dynamical features, we propose an extension of the Exponential Random Graph Models (ERGMs) that accommodates the time variation of its parameters. Inspired by the fast-growing literature on Dynamic Conditional Score models, each parameter evolves according to an updating rule driven by the score of the ERGM distribution. We demonstrate the flexibility of score-driven ERGMs (SD-ERGMs) as data-generating processes and filters and show the advantages of the dynamic version over the static one. We discuss two applications to temporal networks from financial and political systems. First, we consider the prediction of future links in the Italian interbank credit network. Second, we show that the SD-ERGM allows discriminating between static or time-varying parameters when used to model the U.S. Congress co-voting network dynamics."}, "https://arxiv.org/abs/2401.13943": {"title": "Is the age pension in Australia sustainable and fair? Evidence from forecasting the old-age dependency ratio using the Hamilton-Perry model", "link": "https://arxiv.org/abs/2401.13943", "description": "arXiv:2401.13943v2 Announce Type: replace-cross \nAbstract: The age pension aims to assist eligible elderly Australians meet specific age and residency criteria in maintaining basic living standards. In designing efficient pension systems, government policymakers seek to satisfy the expectations of the overall aging population in Australia. However, the population's unique demographic characteristics at the state and territory level are often overlooked due to the lack of available data. We use the Hamilton-Perry model, which requires minimum input, to model and forecast the evolution of age-specific populations at the state level. We also integrate the obtained sub-national demographic information to determine sustainable pension ages up to 2051. We also investigate pension welfare distribution in all states and territories to identify disadvantaged residents under the current pension system. Using the sub-national mortality data for Australia from 1971 to 2021 obtained from AHMD (2023), we implement the Hamilton-Perry model with the help of functional time series forecasting techniques. With forecasts of age-specific population sizes for each state and territory, we compute the old age dependency ratio to determine the nationwide sustainable pension age."}, "https://arxiv.org/abs/2410.12930": {"title": "Probabilistic inference when the population space is open", "link": "https://arxiv.org/abs/2410.12930", "description": "arXiv:2410.12930v1 Announce Type: new \nAbstract: In using observed data to make inferences about a population quantity, it is commonly assumed that the sampling distribution from which the data were drawn belongs to a given parametric family of distributions, or at least, a given finite set of such families, i.e. the population space is assumed to be closed. Here, we address the problem of how to determine an appropriate post-data distribution for a given population quantity when such an assumption about the underlying sampling distribution is not made, i.e. when the population space is open. The strategy used to address this problem is based on the fact that even though, due to an open population space being non-measurable, we are not able to place a post-data distribution over all the sampling distributions contained in such a population space, it is possible to partition this type of space into a finite, countable or uncountable number of subsets such that a distribution can be placed over a variable that simply indicates which of these subsets contains the true sampling distribution. Moreover, it is argued that, by using sampling distributions that belong to a number of parametric families, it is possible to adequately and elegantly represent the sampling distributions that belong to each of the subsets of such a partition. Since a statistical model is conceived as being a model of a population space rather than a model of a sampling distribution, it is also argued that neither the type of models that are put forward nor the expression of pre-data knowledge via such models can be directly brought into question by the data. Finally, the case is made that, as well as not being required in the modelling process that is proposed, the standard practice of using P values to measure the absolute compatibility of an individual or family of sampling distributions with observed data is neither meaningful nor useful."}, "https://arxiv.org/abs/2410.13050": {"title": "Improved control of Dirichlet location and scale near the boundary", "link": "https://arxiv.org/abs/2410.13050", "description": "arXiv:2410.13050v1 Announce Type: new \nAbstract: Dirichlet distributions are commonly used for modeling vectors in a probability simplex. When used as a prior or a proposal distribution, it is natural to set the mean of a Dirichlet to be equal to the location where one wants the distribution to be centered. However, if the mean is near the boundary of the probability simplex, then a Dirichlet distribution becomes highly concentrated either (i) at the mean or (ii) extremely close to the boundary. Consequently, centering at the mean provides poor control over the location and scale near the boundary. In this article, we introduce a method for improved control over the location and scale of Beta and Dirichlet distributions. Specifically, given a target location point and a desired scale, we maximize the density at the target location point while constraining a specified measure of scale. We consider various choices of scale constraint, such as fixing the concentration parameter, the mean cosine error, or the variance in the Beta case. In several examples, we show that this maximum density method provides superior performance for constructing priors, defining Metropolis-Hastings proposals, and generating simulated probability vectors."}, "https://arxiv.org/abs/2410.13113": {"title": "Analyzing longitudinal electronic health records data with clinically informative visiting process: possible choices and comparisons", "link": "https://arxiv.org/abs/2410.13113", "description": "arXiv:2410.13113v1 Announce Type: new \nAbstract: Analyzing longitudinal electronic health records (EHR) data is challenging due to clinically informative observation processes, where the timing and frequency of patient visits often depend on their underlying health condition. Traditional longitudinal data analysis rarely considers observation times as stochastic or models them as functions of covariates. In this work, we evaluate the impact of informative visiting processes on two common statistical tasks: (1) estimating the effect of an exposure on a longitudinal biomarker, and (2) assessing the effect of longitudinal biomarkers on a time-to-event outcome, such as disease diagnosis. The methods we consider range from using simple summaries of the observed longitudinal data, imputation, inversely weighting by the estimated intensity of the visit process, and joint modeling. To our knowledge, this is the most comprehensive evaluation of inferential methods specifically tailored for EHR-like visiting processes. For the first task, we improve certain methods around 18 times faster, making them more suitable for large-scale EHR data analysis. For the second task, where no method currently accounts for the visiting process in joint models of longitudinal and time-to-event data, we propose incorporating the historical number of visits to adjust for informative visiting. Using data from the longitudinal biobank at the University of Michigan Health System, we investigate two case studies: 1) the association between genetic variants and lab markers with repeated measures (known as the LabWAS); and 2) the association between cardiometabolic health markers and time to hypertension diagnosis. We show how accounting for informative visiting processes affects the analysis results. We develop an R package CIMPLE (Clinically Informative Missingness handled through Probabilities, Likelihood, and Estimating equations) that integrates all these methods."}, "https://arxiv.org/abs/2410.13115": {"title": "Online conformal inference for multi-step time series forecasting", "link": "https://arxiv.org/abs/2410.13115", "description": "arXiv:2410.13115v1 Announce Type: new \nAbstract: We consider the problem of constructing distribution-free prediction intervals for multi-step time series forecasting, with a focus on the temporal dependencies inherent in multi-step forecast errors. We establish that the optimal $h$-step-ahead forecast errors exhibit serial correlation up to lag $(h-1)$ under a general non-stationary autoregressive data generating process. To leverage these properties, we propose the Autocorrelated Multi-step Conformal Prediction (AcMCP) method, which effectively incorporates autocorrelations in multi-step forecast errors, resulting in more statistically efficient prediction intervals. This method ensures theoretical long-run coverage guarantees for multi-step prediction intervals, though we note that increased forecasting horizons may exacerbate deviations from the target coverage, particularly in the context of limited sample sizes. Additionally, we extend several easy-to-implement conformal prediction methods, originally designed for single-step forecasting, to accommodate multi-step scenarios. Through empirical evaluations, including simulations and applications to data, we demonstrate that AcMCP achieves coverage that closely aligns with the target within local windows, while providing adaptive prediction intervals that effectively respond to varying conditions."}, "https://arxiv.org/abs/2410.13142": {"title": "Agnostic Characterization of Interference in Randomized Experiments", "link": "https://arxiv.org/abs/2410.13142", "description": "arXiv:2410.13142v1 Announce Type: new \nAbstract: We give an approach for characterizing interference by lower bounding the number of units whose outcome depends on certain groups of treated individuals, such as depending on the treatment of others, or others who are at least a certain distance away. The approach is applicable to randomized experiments with binary-valued outcomes. Asymptotically conservative point estimates and one-sided confidence intervals may be constructed with no assumptions beyond the known randomization design, allowing the approach to be used when interference is poorly understood, or when an observed network might only be a crude proxy for the underlying social mechanisms. Point estimates are equal to Hajek-weighted comparisons of units with differing levels of treatment exposure. Empirically, we find that the size of our interval estimates is competitive with (and often smaller than) those of the EATE, an assumption-lean treatment effect, suggesting that the proposed estimands may be intrinsically easier to estimate than treatment effects."}, "https://arxiv.org/abs/2410.13164": {"title": "Markov Random Fields with Proximity Constraints for Spatial Data", "link": "https://arxiv.org/abs/2410.13164", "description": "arXiv:2410.13164v1 Announce Type: new \nAbstract: The conditional autoregressive (CAR) model, simultaneous autoregressive (SAR) model, and its variants have become the predominant strategies for modeling regional or areal-referenced spatial data. The overwhelming wide-use of the CAR/SAR model motivates the need for new classes of models for areal-referenced data. Thus, we develop a novel class of Markov random fields based on truncating the full-conditional distribution. We define this truncation in two ways leading to versions of what we call the truncated autoregressive (TAR) model. First, we truncate the full conditional distribution so that a response at one location is close to the average of its neighbors. This strategy establishes relationships between TAR and CAR. Second, we truncate on the joint distribution of the data process in a similar way. This specification leads to connection between TAR and SAR model. Our Bayesian implementation does not use Markov chain Monte Carlo (MCMC) for Bayesian computation, and generates samples directly from the posterior distribution. Moreover, TAR does not have a range parameter that arises in the CAR/SAR models, which can be difficult to learn. We present the results of the proposed truncated autoregressive model on several simulated datasets and on a dataset of average property prices."}, "https://arxiv.org/abs/2410.13170": {"title": "Adaptive LAD-Based Bootstrap Unit Root Tests under Unconditional Heteroskedasticity", "link": "https://arxiv.org/abs/2410.13170", "description": "arXiv:2410.13170v1 Announce Type: new \nAbstract: This paper explores testing unit roots based on least absolute deviations (LAD) regression under unconditional heteroskedasticity. We first derive the asymptotic properties of the LAD estimator for a first-order autoregressive process with the coefficient (local to) unity under unconditional heteroskedasticity and weak dependence, revealing that the limiting distribution of the LAD estimator (consequently the derived test statistics) is closely associated with unknown time-varying variances. To conduct feasible LAD-based unit root tests under heteroskedasticity and serial dependence, we develop an adaptive block bootstrap procedure, which accommodates time-varying volatility and serial dependence, both of unknown forms, to compute critical values for LAD-based tests. The asymptotic validity is established. We then extend the testing procedure to allow for deterministic components. Simulation results indicate that, in the presence of unconditional heteroskedasticity and serial dependence, the classic LAD-based tests demonstrate severe size distortion, whereas the proposed LAD-based bootstrap tests exhibit good size-control capability. Additionally, the newly developed tests show superior testing power in heavy-tailed distributed cases compared to considered benchmarks. Finally, empirical analysis of real effective exchange rates of 16 EU countries is conducted to illustrate the applicability of the newly proposed tests."}, "https://arxiv.org/abs/2410.13261": {"title": "Novel Bayesian algorithms for ARFIMA long-memory processes: a comparison between MCMC and ABC approaches", "link": "https://arxiv.org/abs/2410.13261", "description": "arXiv:2410.13261v1 Announce Type: new \nAbstract: This paper presents a comparative study of two Bayesian approaches - Markov Chain Monte Carlo (MCMC) and Approximate Bayesian Computation (ABC) - for estimating the parameters of autoregressive fractionally-integrated moving average (ARFIMA) models, which are widely used to capture long-memory in time series data. We propose a novel MCMC algorithm that filters the time series into distinct long-memory and ARMA components, and benchmarked it against standard approaches. Additionally, a new ABC method is proposed, using three different summary statistics used for posterior estimation. The methods are implemented and evaluated through an extensive simulation study, as well as applied to a real-world financial dataset, specifically the quarterly U.S. Gross National Product (GNP) series. The results demonstrate the effectiveness of the Bayesian methods in estimating long-memory and short-memory parameters, with the filtered MCMC showing superior performance in various metrics. This study enhances our understanding of Bayesian techniques in ARFIMA modeling, providing insights into their advantages and limitations when applied to complex time series data."}, "https://arxiv.org/abs/2410.13420": {"title": "Spatial Proportional Hazards Model with Differential Regularization", "link": "https://arxiv.org/abs/2410.13420", "description": "arXiv:2410.13420v1 Announce Type: new \nAbstract: This paper introduces a novel Spatial Proportional Hazards model that incorporates spatial dependence through differential regularization. We address limitations of existing methods that overlook domain geometry by proposing an approach based on the Generalized Spatial Regression with PDE Penalization. Our method handles complex-shaped domains, enabling accurate modeling of spatial fields in survival data. Using a penalized log-likelihood functional, we estimate both covariate effects and the spatial field. The methodology is implemented via finite element methods, efficiently handling irregular domain geometries. We demonstrate its efficacy through simulations and apply it to real-world data."}, "https://arxiv.org/abs/2410.13522": {"title": "Fair comparisons of causal parameters with many treatments and positivity violations", "link": "https://arxiv.org/abs/2410.13522", "description": "arXiv:2410.13522v1 Announce Type: new \nAbstract: Comparing outcomes across treatments is essential in medicine and public policy. To do so, researchers typically estimate a set of parameters, possibly counterfactual, with each targeting a different treatment. Treatment-specific means (TSMs) are commonly used, but their identification requires a positivity assumption -- that every subject has a non-zero probability of receiving each treatment. This assumption is often implausible, especially when treatment can take many values. Causal parameters based on dynamic stochastic interventions can be robust to positivity violations. However, comparing these parameters may be unfair because they may depend on outcomes under non-target treatments. To address this, and clarify when fair comparisons are possible, we propose a fairness criterion: if the conditional TSM for one treatment is greater than that for another, then the corresponding causal parameter should also be greater. We derive two intuitive properties equivalent to this criterion and show that only a mild positivity assumption is needed to identify fair parameters. We then provide examples that satisfy this criterion and are identifiable under the milder positivity assumption. These parameters are non-smooth, making standard nonparametric efficiency theory inapplicable, so we propose smooth approximations of them. We then develop doubly robust-style estimators that attain parametric convergence rates under nonparametric conditions. We illustrate our methods with an analysis of dialysis providers in New York State."}, "https://arxiv.org/abs/2410.13658": {"title": "The Subtlety of Optimal Paternalism in a Population with Bounded Rationality", "link": "https://arxiv.org/abs/2410.13658", "description": "arXiv:2410.13658v1 Announce Type: new \nAbstract: We consider a utilitarian planner with the power to design a discrete choice set for a heterogeneous population with bounded rationality. We find that optimal paternalism is subtle. The policy that most effectively constrains or influences choices depends on the preference distribution of the population and on the choice probabilities conditional on preferences that measure the suboptimality of behavior. We first consider the planning problem in abstraction. We next examine policy choice when individuals measure utility with additive random error and maximize mismeasured rather than actual utility. We then analyze a class of problems of binary treatment choice under uncertainty. Here we suppose that a planner can mandate a treatment conditional on publicly observed personal covariates or can decentralize decision making, enabling persons to choose their own treatments. Bounded rationality may take the form of deviations between subjective personal beliefs and objective probabilities of uncertain outcomes. We apply our analysis to clinical decision making in medicine. Having documented that optimization of paternalism requires the planner to possess extensive knowledge that is rarely available, we address the difficult problem of paternalistic policy choice when the planner is boundedly rational."}, "https://arxiv.org/abs/2410.13693": {"title": "A multiscale method for data collected from network edges via the line graph", "link": "https://arxiv.org/abs/2410.13693", "description": "arXiv:2410.13693v1 Announce Type: new \nAbstract: Data collected over networks can be modelled as noisy observations of an unknown function over the nodes of a graph or network structure, fully described by its nodes and their connections, the edges. In this context, function estimation has been proposed in the literature and typically makes use of the network topology such as relative node arrangement, often using given or artificially constructed node Euclidean coordinates. However, networks that arise in fields such as hydrology (for example, river networks) present features that challenge these established modelling setups since the target function may naturally live on edges (e.g., river flow) and/or the node-oriented modelling uses noisy edge data as weights. This work tackles these challenges and develops a novel lifting scheme along with its associated (second) generation wavelets that permit data decomposition across the network edges. The transform, which we refer to under the acronym LG-LOCAAT, makes use of a line graph construction that first maps the data in the line graph domain. We thoroughly investigate the proposed algorithm's properties and illustrate its performance versus existing methodologies. We conclude with an application pertaining to hydrology that involves the denoising of a water quality index over the England river network, backed up by a simulation study for a river flow dataset."}, "https://arxiv.org/abs/2410.13723": {"title": "A Subsequence Approach to Topological Data Analysis for Irregularly-Spaced Time Series", "link": "https://arxiv.org/abs/2410.13723", "description": "arXiv:2410.13723v1 Announce Type: new \nAbstract: A time-delay embedding (TDE), grounded in the framework of Takens's Theorem, provides a mechanism to represent and analyze the inherent dynamics of time-series data. Recently, topological data analysis (TDA) methods have been applied to study this time series representation mainly through the lens of persistent homology. Current literature on the fusion of TDE and TDA are adept at analyzing uniformly-spaced time series observations. This work introduces a novel {\\em subsequence} embedding method for irregularly-spaced time-series data. We show that this method preserves the original state space topology while reducing spurious homological features. Theoretical stability results and convergence properties of the proposed method in the presence of noise and varying levels of irregularity in the spacing of the time series are established. Numerical studies and an application to real data illustrates the performance of the proposed method."}, "https://arxiv.org/abs/2410.13744": {"title": "Inferring the dynamics of quasi-reaction systems via nonlinear local mean-field approximations", "link": "https://arxiv.org/abs/2410.13744", "description": "arXiv:2410.13744v1 Announce Type: new \nAbstract: In the modelling of stochastic phenomena, such as quasi-reaction systems, parameter estimation of kinetic rates can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the nonlinear nature of the underlying process. At the mean level, the dynamics of the system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An analytical solution for generic quasi-reaction systems is proposed via a first order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing SDE and ODE-based methods via a simulation study. Besides the increased computational efficiency of the approach, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. An illustration on Rhesus Macaque data shows the applicability of the approach to the study of cell differentiation."}, "https://arxiv.org/abs/2410.13112": {"title": "Distributional Matrix Completion via Nearest Neighbors in the Wasserstein Space", "link": "https://arxiv.org/abs/2410.13112", "description": "arXiv:2410.13112v1 Announce Type: cross \nAbstract: We introduce the problem of distributional matrix completion: Given a sparsely observed matrix of empirical distributions, we seek to impute the true distributions associated with both observed and unobserved matrix entries. This is a generalization of traditional matrix completion where the observations per matrix entry are scalar valued. To do so, we utilize tools from optimal transport to generalize the nearest neighbors method to the distributional setting. Under a suitable latent factor model on probability distributions, we establish that our method recovers the distributions in the Wasserstein norm. We demonstrate through simulations that our method is able to (i) provide better distributional estimates for an entry compared to using observed samples for that entry alone, (ii) yield accurate estimates of distributional quantities such as standard deviation and value-at-risk, and (iii) inherently support heteroscedastic noise. We also prove novel asymptotic results for Wasserstein barycenters over one-dimensional distributions."}, "https://arxiv.org/abs/2410.13495": {"title": "On uniqueness of the set of k-means", "link": "https://arxiv.org/abs/2410.13495", "description": "arXiv:2410.13495v1 Announce Type: cross \nAbstract: We provide necessary and sufficient conditions for the uniqueness of the k-means set of a probability distribution. This uniqueness problem is related to the choice of k: depending on the underlying distribution, some values of this parameter could lead to multiple sets of k-means, which hampers the interpretation of the results and/or the stability of the algorithms. We give a general assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness and determine the asymptotic distribution of the within cluster sum of squares (WCSS). We also provide statistical characterizations of k-means uniqueness in terms of the asymptotic behavior of the empirical WCSS. As a consequence, we derive a bootstrap test for uniqueness of the set of k-means. The results are illustrated with examples of different types of non-uniqueness and we check by simulations the performance of the proposed methodology."}, "https://arxiv.org/abs/2410.13681": {"title": "Ab initio nonparametric variable selection for scalable Symbolic Regression with large $p$", "link": "https://arxiv.org/abs/2410.13681", "description": "arXiv:2410.13681v1 Announce Type: cross \nAbstract: Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which are common in modern scientific applications. This ``large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 17 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets."}, "https://arxiv.org/abs/2410.13735": {"title": "Optimizing Probabilistic Conformal Prediction with Vectorized Non-Conformity Scores", "link": "https://arxiv.org/abs/2410.13735", "description": "arXiv:2410.13735v1 Announce Type: cross \nAbstract: Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency -- the size of the uncertainty set -- is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets."}, "https://arxiv.org/abs/2303.14801": {"title": "FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions", "link": "https://arxiv.org/abs/2303.14801", "description": "arXiv:2303.14801v3 Announce Type: replace \nAbstract: Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study. Complete FAStEN code is provided at https://github.com/IBM/funGCN."}, "https://arxiv.org/abs/2312.17716": {"title": "Dependent Random Partitions by Shrinking Toward an Anchor", "link": "https://arxiv.org/abs/2312.17716", "description": "arXiv:2312.17716v2 Announce Type: replace \nAbstract: Although exchangeable processes from Bayesian nonparametrics have been used as a generating mechanism for random partition models, we deviate from this paradigm to explicitly incorporate clustering information in the formulation of our random partition model. Our shrinkage partition distribution takes any partition distribution and shrinks its probability mass toward an anchor partition. We show how this provides a framework to model hierarchically-dependent and temporally-dependent random partitions. The shrinkage parameter controls the degree of dependence, accommodating at its extremes both independence and complete equality. Since a priori knowledge of items may vary, our formulation allows the degree of shrinkage toward the anchor to be item-specific. Our random partition model has a tractable normalizing constant which allows for standard Markov chain Monte Carlo algorithms for posterior sampling. We prove intuitive theoretical properties for our distribution and compare it to related partition distributions. We show that our model provides better out-of-sample fit in a real data application."}, "https://arxiv.org/abs/2301.06615": {"title": "Data-Driven Estimation of Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2301.06615", "description": "arXiv:2301.06615v2 Announce Type: replace-cross \nAbstract: Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assumptions about the observed features in the data and ignore the structure of the underlying causal model, which can lead to biased estimation. At the same time, the underlying causal mechanism is rarely known in real-world datasets, making it hard to take it into consideration. In this work, we provide a survey of state-of-the-art data-driven methods for heterogeneous treatment effect estimation using machine learning, broadly categorizing them as methods that focus on counterfactual prediction and methods that directly estimate the causal effect. We also provide an overview of a third category of methods which rely on structural causal models and learn the model structure from data. Our empirical evaluation under various underlying structural model mechanisms shows the advantages and deficiencies of existing estimators and of the metrics for measuring their performance."}, "https://arxiv.org/abs/2307.12395": {"title": "Concentration inequalities for high-dimensional linear processes with dependent innovations", "link": "https://arxiv.org/abs/2307.12395", "description": "arXiv:2307.12395v2 Announce Type: replace-cross \nAbstract: We develop concentration inequalities for the $l_\\infty$ norm of vector linear processes with sub-Weibull, mixingale innovations. This inequality is used to obtain a concentration bound for the maximum entrywise norm of the lag-$h$ autocovariance matrix of linear processes. We apply these inequalities to sparse estimation of large-dimensional VAR(p) systems and heterocedasticity and autocorrelation consistent (HAC) high-dimensional covariance estimation."}, "https://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation", "link": "https://arxiv.org/abs/2310.06653", "description": "arXiv:2310.06653v2 Announce Type: replace-cross \nAbstract: In clinical trials, patients may discontinue treatments prematurely, breaking the initial randomization and, thus, challenging inference. Stakeholders in drug development are generally interested in going beyond the Intention-To-Treat (ITT) analysis, which provides valid causal estimates of the effect of treatment assignment but does not inform on the effect of the actual treatment receipt. Our study is motivated by an RCT in oncology, where patients assigned the investigational treatment may discontinue it due to adverse events. We propose adopting a principal stratum strategy and decomposing the overall ITT effect into principal causal effects for groups of patients defined by their potential discontinuation behavior. We first show how to implement a principal stratum strategy to assess causal effects on a survival outcome in the presence of continuous time treatment discontinuation, its advantages, and the conclusions one can draw. Our strategy deals with the time-to-event intermediate variable that may not be defined for patients who would not discontinue; moreover, discontinuation time and the primary endpoint are subject to censoring. We employ a flexible model-based Bayesian approach to tackle these complexities, providing easily interpretable results. We apply this Bayesian principal stratification framework to analyze synthetic data of the motivating oncology trial. We simulate data under different assumptions that reflect real scenarios where patients' behavior depends on critical baseline covariates. Supported by a simulation study, we shed light on the role of covariates in this framework: beyond making structural and parametric assumptions more credible, they lead to more precise inference and can be used to characterize patients' discontinuation behavior, which could help inform clinical practice and future protocols."}, "https://arxiv.org/abs/2410.13949": {"title": "Modeling Zero-Inflated Correlated Dental Data through Gaussian Copulas and Approximate Bayesian Computation", "link": "https://arxiv.org/abs/2410.13949", "description": "arXiv:2410.13949v1 Announce Type: new \nAbstract: We develop a new longitudinal count data regression model that accounts for zero-inflation and spatio-temporal correlation across responses. This project is motivated by an analysis of Iowa Fluoride Study (IFS) data, a longitudinal cohort study with data on caries (cavity) experience scores measured for each tooth across five time points. To that end, we use a hurdle model for zero-inflation with two parts: the presence model indicating whether a count is non-zero through logistic regression and the severity model that considers the non-zero counts through a shifted Negative Binomial distribution allowing overdispersion. To incorporate dependence across measurement occasion and teeth, these marginal models are embedded within a Gaussian copula that introduces spatio-temporal correlations. A distinct advantage of this formulation is that it allows us to determine covariate effects with population-level (marginal) interpretations in contrast to mixed model choices. Standard Bayesian sampling from such a model is infeasible, so we use approximate Bayesian computing for inference. This approach is applied to the IFS data to gain insight into the risk factors for dental caries and the correlation structure across teeth and time."}, "https://arxiv.org/abs/2410.14002": {"title": "A note on Bayesian R-squared for generalized additive mixed models", "link": "https://arxiv.org/abs/2410.14002", "description": "arXiv:2410.14002v1 Announce Type: new \nAbstract: We present a novel Bayesian framework to decompose the posterior predictive variance in a fitted Generalized Additive Mixed Model (GAMM) into explained and unexplained components. This decomposition enables a rigorous definition of Bayesian $R^{2}$. We show that the new definition aligns with the intuitive Bayesian $R^{2}$ proposed by Gelman, Goodrich, Gabry, and Vehtari (2019) [\\emph{The American Statistician}, \\textbf{73}(3), 307-309], but extends its applicability to a broader class of models. Furthermore, we introduce a partial Bayesian $R^{2}$ to quantify the contribution of individual model terms to the explained variation in the posterior predictions"}, "https://arxiv.org/abs/2410.14301": {"title": "Confidence interval for the sensitive fraction in Item Count Technique model", "link": "https://arxiv.org/abs/2410.14301", "description": "arXiv:2410.14301v1 Announce Type: new \nAbstract: The problem is in the estimation of the fraction of population with a sensitive characteristic. We consider the Item Count Technique an indirect method of questioning designed to protect respondents' privacy. The exact confidence interval for the sensitive fraction is constructed. The length of the proposed CI depends on both the given parameter of the model and the sample size. For these CI the model's parameter is established in relation to the provided level of the privacy protection of the interviewee. The optimal sample size for obtaining a CI of a given length is discussed in the context."}, "https://arxiv.org/abs/2410.14308": {"title": "Adaptive L-statistics for high dimensional test problem", "link": "https://arxiv.org/abs/2410.14308", "description": "arXiv:2410.14308v1 Announce Type: new \nAbstract: In this study, we focus on applying L-statistics to the high-dimensional one-sample location test problem. Intuitively, an L-statistic with $k$ parameters tends to perform optimally when the sparsity level of the alternative hypothesis matches $k$. We begin by deriving the limiting distributions for both L-statistics with fixed parameters and those with diverging parameters. To ensure robustness across varying sparsity levels of alternative hypotheses, we first establish the asymptotic independence between L-statistics with fixed and diverging parameters. Building on this, we propose a Cauchy combination test that integrates L-statistics with different parameters. Both simulation results and real-data applications highlight the advantages of our proposed methods."}, "https://arxiv.org/abs/2410.14317": {"title": "Identification of a Rank-dependent Peer Effect Model", "link": "https://arxiv.org/abs/2410.14317", "description": "arXiv:2410.14317v1 Announce Type: new \nAbstract: This paper develops an econometric model to analyse heterogeneity in peer effects in network data with endogenous spillover across units. We introduce a rank-dependent peer effect model that captures how the relative ranking of a peer outcome shapes the influence units have on one another, by modeling the peer effect to be linear in ordered peer outcomes. In contrast to the traditional linear-in-means model, our approach allows for greater flexibility in peer effect by accounting for the distribution of peer outcomes as well as the size of peer groups. Under a minimal condition, the rank-dependent peer effect model admits a unique equilibrium and is therefore tractable. Our simulations show that that estimation performs well in finite samples given sufficient covariate strength. We then apply our model to educational data from Norway, where we see that higher-performing students disproportionately drive GPA spillovers."}, "https://arxiv.org/abs/2410.14459": {"title": "To Vary or Not To Vary: A Simple Empirical Bayes Factor for Testing Variance Components", "link": "https://arxiv.org/abs/2410.14459", "description": "arXiv:2410.14459v1 Announce Type: new \nAbstract: Random effects are a flexible addition to statistical models to capture structural heterogeneity in the data, such as spatial dependencies, individual differences, temporal dependencies, or non-linear effects. Testing for the presence (or absence) of random effects is an important but challenging endeavor however, as testing a variance component, which must be non-negative, is a boundary problem. Various methods exist which have potential shortcomings or limitations. As a flexible alternative, we propose a flexible empirical Bayes factor (EBF) for testing for the presence of random effects. Rather than testing whether a variance component equals zero or not, the proposed EBF tests the equivalent assumption of whether all random effects are zero. The Bayes factor is `empirical' because the distribution of the random effects on the lower level, which serves as a prior, is estimated from the data as it is part of the model. Empirical Bayes factors can be computed using the output from classical (MLE) or Bayesian (MCMC) approaches. Analyses on synthetic data were carried out to assess the general behavior of the criterion. To illustrate the methodology, the EBF is used for testing random effects under various models including logistic crossed mixed effects models, spatial random effects models, dynamic structural equation models, random intercept cross-lagged panel models, and nonlinear regression models."}, "https://arxiv.org/abs/2410.14507": {"title": "Bin-Conditional Conformal Prediction of Fatalities from Armed Conflict", "link": "https://arxiv.org/abs/2410.14507", "description": "arXiv:2410.14507v1 Announce Type: new \nAbstract: Forecasting of armed conflicts is an important area of research that has the potential to save lives and prevent suffering. However, most existing forecasting models provide only point predictions without any individual-level uncertainty estimates. In this paper, we introduce a novel extension to conformal prediction algorithm which we call bin-conditional conformal prediction. This method allows users to obtain individual-level prediction intervals for any arbitrary prediction model while maintaining a specific level of coverage across user-defined ranges of values. We apply the bin-conditional conformal prediction algorithm to forecast fatalities from armed conflict. Our results demonstrate that the method provides well-calibrated uncertainty estimates for the predicted number of fatalities. Compared to standard conformal prediction, the bin-conditional method outperforms offers improved calibration of coverage rates across different values of the outcome, but at the cost of wider prediction intervals."}, "https://arxiv.org/abs/2410.14513": {"title": "GARCH option valuation with long-run and short-run volatility components: A novel framework ensuring positive variance", "link": "https://arxiv.org/abs/2410.14513", "description": "arXiv:2410.14513v1 Announce Type: new \nAbstract: Christoffersen, Jacobs, Ornthanalai, and Wang (2008) (CJOW) proposed an improved Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model for valuing European options, where the return volatility is comprised of two distinct components. Empirical studies indicate that the model developed by CJOW outperforms widely-used single-component GARCH models and provides a superior fit to options data than models that combine conditional heteroskedasticity with Poisson-normal jumps. However, a significant limitation of this model is that it allows the variance process to become negative. Oh and Park [2023] partially addressed this issue by developing a related model, yet the positivity of the volatility components is not guaranteed, both theoretically and empirically. In this paper we introduce a new GARCH model that improves upon the models by CJOW and Oh and Park [2023], ensuring the positivity of the return volatility. In comparison to the two earlier GARCH approaches, our novel methodology shows comparable in-sample performance on returns data and superior performance on S&P500 options data."}, "https://arxiv.org/abs/2410.14533": {"title": "The Traveling Bandit: A Framework for Bayesian Optimization with Movement Costs", "link": "https://arxiv.org/abs/2410.14533", "description": "arXiv:2410.14533v1 Announce Type: new \nAbstract: This paper introduces a framework for Bayesian Optimization (BO) with metric movement costs, addressing a critical challenge in practical applications where input alterations incur varying costs. Our approach is a convenient plug-in that seamlessly integrates with the existing literature on batched algorithms, where designs within batches are observed following the solution of a Traveling Salesman Problem. The proposed method provides a theoretical guarantee of convergence in terms of movement costs for BO. Empirically, our method effectively reduces average movement costs over time while maintaining comparable regret performance to conventional BO methods. This framework also shows promise for broader applications in various bandit settings with movement costs."}, "https://arxiv.org/abs/2410.14585": {"title": "A GARCH model with two volatility components and two driving factors", "link": "https://arxiv.org/abs/2410.14585", "description": "arXiv:2410.14585v1 Announce Type: new \nAbstract: We introduce a novel GARCH model that integrates two sources of uncertainty to better capture the rich, multi-component dynamics often observed in the volatility of financial assets. This model provides a quasi closed-form representation of the characteristic function for future log-returns, from which semi-analytical formulas for option pricing can be derived. A theoretical analysis is conducted to establish sufficient conditions for strict stationarity and geometric ergodicity, while also obtaining the continuous-time diffusion limit of the model. Empirical evaluations, conducted both in-sample and out-of-sample using S\\&P500 time series data, show that our model outperforms widely used single-factor models in predicting returns and option prices."}, "https://arxiv.org/abs/2410.14046": {"title": "Tensor Decomposition with Unaligned Observations", "link": "https://arxiv.org/abs/2410.14046", "description": "arXiv:2410.14046v1 Announce Type: cross \nAbstract: This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset."}, "https://arxiv.org/abs/2410.14333": {"title": "Predicting the trajectory of intracranial pressure in patients with traumatic brain injury: evaluation of a foundation model for time series", "link": "https://arxiv.org/abs/2410.14333", "description": "arXiv:2410.14333v1 Announce Type: cross \nAbstract: Patients with traumatic brain injury (TBI) often experience pathological increases in intracranial pressure (ICP), leading to intracranial hypertension (tIH), a common and serious complication. Early warning of an impending rise in ICP could potentially improve patient outcomes by enabling preemptive clinical intervention. However, the limited availability of patient data poses a challenge in developing reliable prediction models. In this study, we aim to determine whether foundation models, which leverage transfer learning, may offer a promising solution."}, "https://arxiv.org/abs/2410.14483": {"title": "Spectral Representations for Accurate Causal Uncertainty Quantification with Gaussian Processes", "link": "https://arxiv.org/abs/2410.14483", "description": "arXiv:2410.14483v1 Announce Type: cross \nAbstract: Accurate uncertainty quantification for causal effects is essential for robust decision making in complex systems, but remains challenging in non-parametric settings. One promising framework represents conditional distributions in a reproducing kernel Hilbert space and places Gaussian process priors on them to infer posteriors on causal effects, but requires restrictive nuclear dominant kernels and approximations that lead to unreliable uncertainty estimates. In this work, we introduce a method, IMPspec, that addresses these limitations via a spectral representation of the Hilbert space. We show that posteriors in this model can be obtained explicitly, by extending a result in Hilbert space regression theory. We also learn the spectral representation to optimise posterior calibration. Our method achieves state-of-the-art performance in uncertainty quantification and causal Bayesian optimisation across simulations and a healthcare application."}, "https://arxiv.org/abs/2410.14490": {"title": "Matrix normal distribution and elliptic distribution", "link": "https://arxiv.org/abs/2410.14490", "description": "arXiv:2410.14490v1 Announce Type: cross \nAbstract: In this paper, we introduce the matrix normal distribution according to the tensor decomposition of its covariance. Based on the canonical diagonal form, the moment generating function of sample covariance matrix and the distribution of latent roots are explicitly calculated. We also discuss the connections between matrix normal distributions, elliptic distributions, and their relevance to multivariate analysis and matrix variate distributions."}, "https://arxiv.org/abs/2203.09380": {"title": "Identifiability of Sparse Causal Effects using Instrumental Variables", "link": "https://arxiv.org/abs/2203.09380", "description": "arXiv:2203.09380v4 Announce Type: replace \nAbstract: Exogenous heterogeneity, for example, in the form of instrumental variables can help us learn a system's underlying causal structure and predict the outcome of unseen intervention experiments. In this paper, we consider linear models in which the causal effect from covariates $X$ on a response $Y$ is sparse. We provide conditions under which the causal coefficient becomes identifiable from the observed distribution. These conditions can be satisfied even if the number of instruments is as small as the number of causal parents. We also develop graphical criteria under which identifiability holds with probability one if the edge coefficients are sampled randomly from a distribution that is absolutely continuous with respect to Lebesgue measure and $Y$ is childless. As an estimator, we propose spaceIV and prove that it consistently estimates the causal effect if the model is identifiable and evaluate its performance on simulated data. If identifiability does not hold, we show that it may still be possible to recover a subset of the causal parents."}, "https://arxiv.org/abs/2310.19051": {"title": "A Survey of Methods for Estimating Hurst Exponent of Time Sequence", "link": "https://arxiv.org/abs/2310.19051", "description": "arXiv:2310.19051v2 Announce Type: replace \nAbstract: The Hurst exponent is a significant indicator for characterizing the self-similarity and long-term memory properties of time sequences. It has wide applications in physics, technologies, engineering, mathematics, statistics, economics, psychology and so on. Currently, available methods for estimating the Hurst exponent of time sequences can be divided into different categories: time-domain methods and spectrum-domain methods based on the representation of time sequence, linear regression methods and Bayesian methods based on parameter estimation methods. Although various methods are discussed in literature, there are still some deficiencies: the descriptions of the estimation algorithms are just mathematics-oriented and the pseudo-codes are missing; the effectiveness and accuracy of the estimation algorithms are not clear; the classification of estimation methods is not considered and there is a lack of guidance for selecting the estimation methods. In this work, the emphasis is put on thirteen dominant methods for estimating the Hurst exponent. For the purpose of decreasing the difficulty of implementing the estimation methods with computer programs, the mathematical principles are discussed briefly and the pseudo-codes of algorithms are presented with necessary details. It is expected that the survey could help the researchers to select, implement and apply the estimation algorithms of interest in practical situations in an easy way."}, "https://arxiv.org/abs/2311.16440": {"title": "Inference for Low-rank Models without Estimating the Rank", "link": "https://arxiv.org/abs/2311.16440", "description": "arXiv:2311.16440v2 Announce Type: replace \nAbstract: This paper studies the inference about linear functionals of high-dimensional low-rank matrices. While most existing inference methods would require consistent estimation of the true rank, our procedure is robust to rank misspecification, making it a promising approach in applications where rank estimation can be unreliable. We estimate the low-rank spaces using pre-specified weighting matrices, known as diversified projections. A novel statistical insight is that, unlike the usual statistical wisdom that overfitting mainly introduces additional variances, the over-estimated low-rank space also gives rise to a non-negligible bias due to an implicit ridge-type regularization. We develop a new inference procedure and show that the central limit theorem holds as long as the pre-specified rank is no smaller than the true rank. In one of our applications, we study multiple testing with incomplete data in the presence of confounding factors and show that our method remains valid as long as the number of controlled confounding factors is at least as large as the true number, even when no confounding factors are present."}, "https://arxiv.org/abs/2410.14789": {"title": "Differentially Private Covariate Balancing Causal Inference", "link": "https://arxiv.org/abs/2410.14789", "description": "arXiv:2410.14789v1 Announce Type: new \nAbstract: Differential privacy is the leading mathematical framework for privacy protection, providing a probabilistic guarantee that safeguards individuals' private information when publishing statistics from a dataset. This guarantee is achieved by applying a randomized algorithm to the original data, which introduces unique challenges in data analysis by distorting inherent patterns. In particular, causal inference using observational data in privacy-sensitive contexts is challenging because it requires covariate balance between treatment groups, yet checking the true covariates is prohibited to prevent leakage of sensitive information. In this article, we present a differentially private two-stage covariate balancing weighting estimator to infer causal effects from observational data. Our algorithm produces both point and interval estimators with statistical guarantees, such as consistency and rate optimality, under a given privacy budget."}, "https://arxiv.org/abs/2410.14857": {"title": "A New One Parameter Unit Distribution: Median Based Unit Rayleigh (MBUR): Parametric Quantile Regression Model", "link": "https://arxiv.org/abs/2410.14857", "description": "arXiv:2410.14857v1 Announce Type: new \nAbstract: Parametric quantile regression is illustrated for the one parameter new unit Rayleigh distribution called Median Based Unit Rayleigh distribution (MBUR) distribution. The estimation process using re-parameterized maximum likelihood function is highlighted with real dataset example. The inference and goodness of fit is also explored."}, "https://arxiv.org/abs/2410.14866": {"title": "Fast and Optimal Changepoint Detection and Localization using Bonferroni Triplets", "link": "https://arxiv.org/abs/2410.14866", "description": "arXiv:2410.14866v1 Announce Type: new \nAbstract: The paper considers the problem of detecting and localizing changepoints in a sequence of independent observations. We propose to evaluate a local test statistic on a triplet of time points, for each such triplet in a particular collection. This collection is sparse enough so that the results of the local tests can simply be combined with a weighted Bonferroni correction. This results in a simple and fast method, {\\sl Lean Bonferroni Changepoint detection} (LBD), that provides finite sample guarantees for the existance of changepoints as well as simultaneous confidence intervals for their locations. LBD is free of tuning parameters, and we show that LBD allows optimal inference for the detection of changepoints. To this end, we provide a lower bound for the critical constant that measures the difficulty of the changepoint detection problem, and we show that LBD attains this critical constant. We illustrate LBD for a number of distributional settings, namely when the observations are homoscedastic normal with known or unknown variance, for observations from a natural exponential family, and in a nonparametric setting where we assume only exchangeability for segments without a changepoint."}, "https://arxiv.org/abs/2410.14871": {"title": "Learning the Effect of Persuasion via Difference-In-Differences", "link": "https://arxiv.org/abs/2410.14871", "description": "arXiv:2410.14871v1 Announce Type: new \nAbstract: The persuasion rate is a key parameter for measuring the causal effect of a directional message on influencing the recipient's behavior. Its identification analysis has largely relied on the availability of credible instruments, but the requirement is not always satisfied in observational studies. Therefore, we develop a framework for identifying, estimating, and conducting inference for the average persuasion rates on the treated using a difference-in-differences approach. The average treatment effect on the treated is a standard parameter with difference-in-differences, but it underestimates the persuasion rate in our setting. Our estimation and inference methods include regression-based approaches and semiparametrically efficient estimators. Beginning with the canonical two-period case, we extend the framework to staggered treatment settings, where we show how to conduct rich analyses like the event-study design. We revisit previous studies of the British election and the Chinese curriculum reform to illustrate the usefulness of our methodology."}, "https://arxiv.org/abs/2410.14985": {"title": "Stochastic Loss Reserving: Dependence and Estimation", "link": "https://arxiv.org/abs/2410.14985", "description": "arXiv:2410.14985v1 Announce Type: new \nAbstract: Nowadays insurers have to account for potentially complex dependence between risks. In the field of loss reserving, there are many parametric and non-parametric models attempting to capture dependence between business lines. One common approach has been to use additive background risk models (ABRMs) which provide rich and interpretable dependence structures via a common shock model. Unfortunately, ABRMs are often restrictive. Models that capture necessary features may have impractical to estimate parameters. For example models without a closed-form likelihood function for lack of a probability density function (e.g. some Tweedie, Stable Distributions, etc).\n We apply a modification of the continuous generalised method of moments (CGMM) of [Carrasco and Florens, 2000] which delivers comparable estimators to the MLE to loss reserving. We examine models such as the one proposed by [Avanzi et al., 2016] and a related but novel one derived from the stable family of distributions. Our CGMM method of estimation provides conventional non-Bayesian estimates in the case where MLEs are impractical."}, "https://arxiv.org/abs/2410.15090": {"title": "Fast and Efficient Bayesian Analysis of Structural Vector Autoregressions Using the R Package bsvars", "link": "https://arxiv.org/abs/2410.15090", "description": "arXiv:2410.15090v1 Announce Type: new \nAbstract: The R package bsvars provides a wide range of tools for empirical macroeconomic and financial analyses using Bayesian Structural Vector Autoregressions. It uses frontier econometric techniques and C++ code to ensure fast and efficient estimation of these multivariate dynamic structural models, possibly with many variables, complex identification strategies, and non-linear characteristics. The models can be identified using adjustable exclusion restrictions and heteroskedastic or non-normal shocks. They feature a flexible three-level equation-specific local-global hierarchical prior distribution for the estimated level of shrinkage for autoregressive and structural parameters. Additionally, the package facilitates predictive and structural analyses such as impulse responses, forecast error variance and historical decompositions, forecasting, statistical verification of identification and hypotheses on autoregressive parameters, and analyses of structural shocks, volatilities, and fitted values. These features differentiate bsvars from existing R packages that either focus on a specific structural model, do not consider heteroskedastic shocks, or lack the implementation using compiled code."}, "https://arxiv.org/abs/2410.15097": {"title": "Predictive Quantile Regression with High-Dimensional Predictors: The Variable Screening Approach", "link": "https://arxiv.org/abs/2410.15097", "description": "arXiv:2410.15097v1 Announce Type: new \nAbstract: This paper advances a variable screening approach to enhance conditional quantile forecasts using high-dimensional predictors. We have refined and augmented the quantile partial correlation (QPC)-based variable screening proposed by Ma et al. (2017) to accommodate $\\beta$-mixing time-series data. Our approach is inclusive of i.i.d scenarios but introduces new convergence bounds for time-series contexts, suggesting the performance of QPC-based screening is influenced by the degree of time-series dependence. Through Monte Carlo simulations, we validate the effectiveness of QPC under weak dependence. Our empirical assessment of variable selection for growth-at-risk (GaR) forecasting underscores the method's advantages, revealing that specific labor market determinants play a pivotal role in forecasting GaR. While prior empirical research has predominantly considered a limited set of predictors, we employ the comprehensive Fred-QD dataset, retaining a richer breadth of information for GaR forecasts."}, "https://arxiv.org/abs/2410.15102": {"title": "Bayesian-based Propensity Score Subclassification Estimator", "link": "https://arxiv.org/abs/2410.15102", "description": "arXiv:2410.15102v1 Announce Type: new \nAbstract: Subclassification estimators are one of the methods used to estimate causal effects of interest using the propensity score. This method is more stable compared to other weighting methods, such as inverse probability weighting estimators, in terms of the variance of the estimators. In subclassification estimators, the number of strata is traditionally set at five, and this number is not typically chosen based on data information. Even when the number of strata is selected, the uncertainty from the selection process is often not properly accounted for. In this study, we propose a novel Bayesian-based subclassification estimator that can assess the uncertainty in the number of strata, rather than selecting a single optimal number, using a Bayesian paradigm. To achieve this, we apply a general Bayesian procedure that does not rely on a likelihood function. This procedure allows us to avoid making strong assumptions about the outcome model, maintaining the same flexibility as traditional causal inference methods. With the proposed Bayesian procedure, it is expected that uncertainties from the design phase can be appropriately reflected in the analysis phase, which is sometimes overlooked in non-Bayesian contexts."}, "https://arxiv.org/abs/2410.15381": {"title": "High-dimensional prediction for count response via sparse exponential weights", "link": "https://arxiv.org/abs/2410.15381", "description": "arXiv:2410.15381v1 Announce Type: new \nAbstract: Count data is prevalent in various fields like ecology, medical research, and genomics. In high-dimensional settings, where the number of features exceeds the sample size, feature selection becomes essential. While frequentist methods like Lasso have advanced in handling high-dimensional count data, Bayesian approaches remain under-explored with no theoretical results on prediction performance. This paper introduces a novel probabilistic machine learning framework for high-dimensional count data prediction. We propose a pseudo-Bayesian method that integrates a scaled Student prior to promote sparsity and uses an exponential weight aggregation procedure. A key contribution is a novel risk measure tailored to count data prediction, with theoretical guarantees for prediction risk using PAC-Bayesian bounds. Our results include non-asymptotic oracle inequalities, demonstrating rate-optimal prediction error without prior knowledge of sparsity. We implement this approach efficiently using Langevin Monte Carlo method. Simulations and a real data application highlight the strong performance of our method compared to the Lasso in various settings."}, "https://arxiv.org/abs/2410.15383": {"title": "Probabilities for asymmetric p-outside values", "link": "https://arxiv.org/abs/2410.15383", "description": "arXiv:2410.15383v1 Announce Type: new \nAbstract: In 2017-2020 Jordanova and co-authors investigate probabilities for p-outside values and determine them in many particular cases. They show that these probabilities are closely related to the concept for heavy tails. Tukey's boxplots are very popular and useful in practice. Analogously to the chi-square-criterion, the relative frequencies of the events an observation to fall in different their parts, compared with the corresponding probabilities an observation of a fixed probability distribution to fall in the same parts, help the practitioners to find the accurate probability distribution of the observed random variable. These open the door to work with the distribution sensitive estimators which in many cases are more accurate, especially for small sample investigations. All these methods, however, suffer from the disadvantage that they use inter quantile range in a symmetric way. The concept for outside values should take into account the form of the distribution. Therefore, here, we give possibility for more asymmetry in analysis of the tails of the distributions. We suggest new theoretical and empirical box-plots and characteristics of the tails of the distributions. These are theoretical asymmetric p-outside values functions. We partially investigate some of their properties and give some examples. It turns out that they do not depend on the center and the scaling factor of the distribution. Therefore, they are very appropriate for comparison of the tails of the distribution, and later on, for estimation of the parameters, which govern the tail behaviour of the cumulative distribution function."}, "https://arxiv.org/abs/2410.15421": {"title": "A New Framework for Bayesian Function Registration", "link": "https://arxiv.org/abs/2410.15421", "description": "arXiv:2410.15421v1 Announce Type: new \nAbstract: Function registration, also referred to as alignment, has been one of the fundamental problems in the field of functional data analysis. Classical registration methods such as the Fisher-Rao alignment focus on estimating optimal time warping function between functions. In recent studies, a model on time warping has attracted more attention, and it can be used as a prior term to combine with the classical method (as a likelihood term) in a Bayesian framework. The Bayesian approaches have been shown improvement over the classical methods. However, its prior model on time warping is often based a nonlinear approximation, which may introduce inaccuracy and inefficiency. To overcome these problems, we propose a new Bayesian approach by adopting a prior which provides a linear representation and various stochastic processes (Gaussian or non-Gaussian) can be effectively utilized on time warping. No linearization approximation is needed in the time warping computation, and the posterior can be obtained via a conventional Markov Chain Monte Carlo approach. We thoroughly investigate the impact of the prior on the performance of functional registration with multiple simulation examples, which demonstrate the superiority of the new framework over the previous methods. We finally utilize the new method in a real dataset and obtain desirable alignment result."}, "https://arxiv.org/abs/2410.15477": {"title": "Randomization Inference for Before-and-After Studies with Multiple Units: An Application to a Criminal Procedure Reform in Uruguay", "link": "https://arxiv.org/abs/2410.15477", "description": "arXiv:2410.15477v1 Announce Type: new \nAbstract: We study the immediate impact of a new code of criminal procedure on crime. In November 2017, Uruguay switched from an inquisitorial system (where a single judge leads the investigation and decides the appropriate punishment for a particular crime) to an adversarial system (where the investigation is now led by prosecutors and the judge plays an overseeing role). To analyze the short-term effects of this reform, we develop a randomization-based approach for before-and-after studies with multiple units. Our framework avoids parametric time series assumptions and eliminates extrapolation by basing statistical inferences on finite-sample methods that rely only on the time periods closest to the time of the policy intervention. A key identification assumption underlying our method is that there would have been no time trends in the absence of the intervention, which is most plausible in a small window around the time of the reform. We also discuss several falsification methods to assess the plausibility of this assumption. Using our proposed inferential approach, we find statistically significant short-term causal effects of the crime reform. Our unbiased estimate shows an average increase of approximately 25 police reports per day in the week following the implementation of the new adversarial system in Montevideo, representing an 8 percent increase compared to the previous week under the old system."}, "https://arxiv.org/abs/2410.15530": {"title": "Simultaneous Inference in Multiple Matrix-Variate Graphs for High-Dimensional Neural Recordings", "link": "https://arxiv.org/abs/2410.15530", "description": "arXiv:2410.15530v1 Announce Type: new \nAbstract: As large-scale neural recordings become common, many neuroscientific investigations are focused on identifying functional connectivity from spatio-temporal measurements in two or more brain areas across multiple sessions. Spatial-temporal data in neural recordings can be represented as matrix-variate data, with time as the first dimension and space as the second. In this paper, we exploit the multiple matrix-variate Gaussian Graphical model to encode the common underlying spatial functional connectivity across multiple sessions of neural recordings. By effectively integrating information across multiple graphs, we develop a novel inferential framework that allows simultaneous testing to detect meaningful connectivity for a target edge subset of arbitrary size. Our test statistics are based on a group penalized regression approach and a high-dimensional Gaussian approximation technique. The validity of simultaneous testing is demonstrated theoretically under mild assumptions on sample size and non-stationary autoregressive temporal dependence. Our test is nearly optimal in achieving the testable region boundary. Additionally, our method involves only convex optimization and parametric bootstrap, making it computationally attractive. We demonstrate the efficacy of the new method through both simulations and an experimental study involving multiple local field potential (LFP) recordings in the Prefrontal Cortex (PFC) and visual area V4 during a memory-guided saccade task."}, "https://arxiv.org/abs/2410.15560": {"title": "Ablation Studies for Novel Treatment Effect Estimation Models", "link": "https://arxiv.org/abs/2410.15560", "description": "arXiv:2410.15560v1 Announce Type: new \nAbstract: Ablation studies are essential for understanding the contribution of individual components within complex models, yet their application in nonparametric treatment effect estimation remains limited. This paper emphasizes the importance of ablation studies by examining the Bayesian Causal Forest (BCF) model, particularly the inclusion of the estimated propensity score $\\hat{\\pi}(x_i)$ intended to mitigate regularization-induced confounding (RIC). Through a partial ablation study utilizing five synthetic data-generating processes with varying baseline and propensity score complexities, we demonstrate that excluding $\\hat{\\pi}(x_i)$ does not diminish the model's performance in estimating average and conditional average treatment effects or in uncertainty quantification. Moreover, omitting $\\hat{\\pi}(x_i)$ reduces computational time by approximately 21\\%. These findings suggest that the BCF model's inherent flexibility suffices in adjusting for confounding without explicitly incorporating the propensity score. The study advocates for the routine use of ablation studies in treatment effect estimation to ensure model components are essential and to prevent unnecessary complexity."}, "https://arxiv.org/abs/2410.15596": {"title": "Assessing mediation in cross-sectional stepped wedge cluster randomized trials", "link": "https://arxiv.org/abs/2410.15596", "description": "arXiv:2410.15596v1 Announce Type: new \nAbstract: Mediation analysis has been comprehensively studied for independent data but relatively little work has been done for correlated data, especially for the increasingly adopted stepped wedge cluster randomized trials (SW-CRTs). Motivated by challenges in underlying the effect mechanisms in pragmatic and implementation science clinical trials, we develop new methods for mediation analysis in SW-CRTs. Specifically, based on a linear and generalized linear mixed models, we demonstrate how to estimate the natural indirect effect and mediation proportion in typical SW-CRTs with four data types, including both continuous and binary mediators and outcomes. Furthermore, to address the emerging challenges in exposure-time treatment effect heterogeneity, we derive the mediation expressions in SW-CRTs when the total effect varies as a function of the exposure time. The cluster jackknife approach is considered for inference across all data types and treatment effect structures. We conduct extensive simulations to evaluate the finite-sample performances of proposed mediation estimators and demonstrate the proposed approach in a real data example. A user-friendly R package mediateSWCRT has been developed to facilitate the practical implementation of the estimators."}, "https://arxiv.org/abs/2410.15634": {"title": "Distributionally Robust Instrumental Variables Estimation", "link": "https://arxiv.org/abs/2410.15634", "description": "arXiv:2410.15634v1 Announce Type: new \nAbstract: Instrumental variables (IV) estimation is a fundamental method in econometrics and statistics for estimating causal effects in the presence of unobserved confounding. However, challenges such as untestable model assumptions and poor finite sample properties have undermined its reliability in practice. Viewing common issues in IV estimation as distributional uncertainties, we propose DRIVE, a distributionally robust framework of the classical IV estimation method. When the ambiguity set is based on a Wasserstein distance, DRIVE minimizes a square root ridge regularized variant of the two stage least squares (TSLS) objective. We develop a novel asymptotic theory for this regularized regression estimator based on the square root ridge, showing that it achieves consistency without requiring the regularization parameter to vanish. This result follows from a fundamental property of the square root ridge, which we call ``delayed shrinkage''. This novel property, which also holds for a class of generalized method of moments (GMM) estimators, ensures that the estimator is robust to distributional uncertainties that persist in large samples. We further derive the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter based on theoretical results. Simulation studies confirm the superior finite sample performance of Wasserstein DRIVE. Thanks to its regularization and robustness properties, Wasserstein DRIVE could be preferable in practice, particularly when the practitioner is uncertain about model assumptions or distributional shifts in data."}, "https://arxiv.org/abs/2410.15705": {"title": "Variable screening for covariate dependent extreme value index estimation", "link": "https://arxiv.org/abs/2410.15705", "description": "arXiv:2410.15705v1 Announce Type: new \nAbstract: One of the main topics of extreme value analysis is to estimate the extreme value index, an important parameter that controls the tail behavior of the distribution. In many cases, estimating the extreme value index of the target variable associated with covariates is useful. Although the estimation of the covariate-dependent extreme value index has been developed by numerous researchers, no results have been presented regarding covariate selection. This paper proposes a sure independence screening method for covariate-dependent extreme value index estimation. For the screening, the marginal utility between the target variable and each covariate is calculated using the conditional Pickands estimator. A single-index model that uses the covariates selected by screening is further provided to estimate the extreme value index after screening. Monte Carlo simulations confirmed the finite sample performance of the proposed method. In addition, a real-data application is presented."}, "https://arxiv.org/abs/2410.15713": {"title": "Nonparametric method of structural break detection in stochastic time series regression model", "link": "https://arxiv.org/abs/2410.15713", "description": "arXiv:2410.15713v1 Announce Type: new \nAbstract: We propose a nonparametric algorithm to detect structural breaks in the conditional mean and/or variance of a time series. Our method does not assume any specific parametric form for the dependence structure of the regressor, the time series model, or the distribution of the model noise. This flexibility allows our algorithm to be applicable to a wide range of time series structures commonly encountered in financial econometrics. The effectiveness of the proposed algorithm is validated through an extensive simulation study and a real data application in detecting structural breaks in the mean and volatility of Bitcoin returns. The algorithm's ability to identify structural breaks in the data highlights its practical utility in econometric analysis and financial modeling."}, "https://arxiv.org/abs/2410.15734": {"title": "A Kernelization-Based Approach to Nonparametric Binary Choice Models", "link": "https://arxiv.org/abs/2410.15734", "description": "arXiv:2410.15734v1 Announce Type: new \nAbstract: We propose a new estimator for nonparametric binary choice models that does not impose a parametric structure on either the systematic function of covariates or the distribution of the error term. A key advantage of our approach is its computational efficiency. For instance, even when assuming a normal error distribution as in probit models, commonly used sieves for approximating an unknown function of covariates can lead to a large-dimensional optimization problem when the number of covariates is moderate. Our approach, motivated by kernel methods in machine learning, views certain reproducing kernel Hilbert spaces as special sieve spaces, coupled with spectral cut-off regularization for dimension reduction. We establish the consistency of the proposed estimator for both the systematic function of covariates and the distribution function of the error term, and asymptotic normality of the plug-in estimator for weighted average partial derivatives. Simulation studies show that, compared to parametric estimation methods, the proposed method effectively improves finite sample performance in cases of misspecification, and has a rather mild efficiency loss if the model is correctly specified. Using administrative data on the grant decisions of US asylum applications to immigration courts, along with nine case-day variables on weather and pollution, we re-examine the effect of outdoor temperature on court judges' \"mood\", and thus, their grant decisions."}, "https://arxiv.org/abs/2410.15874": {"title": "A measure of departure from symmetry via the Fisher-Rao distance for contingency tables", "link": "https://arxiv.org/abs/2410.15874", "description": "arXiv:2410.15874v1 Announce Type: new \nAbstract: A measure of asymmetry is a quantification method that allows for the comparison of categorical evaluations before and after treatment effects or among different target populations, irrespective of sample size. We focus on square contingency tables that summarize survey results between two time points or cohorts, represented by the same categorical variables. We propose a measure to evaluate the degree of departure from a symmetry model using cosine similarity. This proposal is based on the Fisher-Rao distance, allowing asymmetry to be interpreted as a geodesic distance between two distributions. Various measures of asymmetry have been proposed, but visualizing the relationship of these quantification methods on a two-dimensional plane demonstrates that the proposed measure provides the geometrically simplest and most natural quantification. Moreover, the visualized figure indicates that the proposed method for measuring departures from symmetry is less affected by very few cells with extreme asymmetry. A simulation study shows that for square contingency tables with an underlying asymmetry model, our method can directly extract and quantify only the asymmetric structure of the model, and can more sensitively detect departures from symmetry than divergence-type measures."}, "https://arxiv.org/abs/2410.15968": {"title": "A Causal Transformation Model for Time-to-Event Data Affected by Unobserved Confounding: Revisiting the Illinois Reemployment Bonus Experiment", "link": "https://arxiv.org/abs/2410.15968", "description": "arXiv:2410.15968v1 Announce Type: new \nAbstract: Motivated by studies investigating causal effects in survival analysis, we propose a transformation model to quantify the impact of a binary treatment on a time-to-event outcome. The approach is based on a flexible linear transformation structural model that links a monotone function of the time-to-event with the propensity for treatment through a bivariate Gaussian distribution. The model equations are specified as functions of additive predictors, allowing the impacts of observed confounders to be accounted for flexibly. Furthermore, the effect of the instrumental variable may be regularized through a ridge penalty, while interactions between the treatment and modifier variables can be incorporated into the model to acknowledge potential variations in treatment effects across different subgroups. The baseline survival function is estimated in a flexible manner using monotonic P-splines, while unobserved confounding is captured through the dependence parameter of the bivariate Gaussian. Parameter estimation is achieved via a computationally efficient and stable penalized maximum likelihood estimation approach and intervals constructed using the related inferential results. We revisit a dataset from the Illinois Reemployment Bonus Experiment to estimate the causal effect of a cash bonus on unemployment duration, unveiling new insights. The modeling framework is incorporated into the R package GJRM, enabling researchers and practitioners to fit the proposed causal survival model and obtain easy-to-interpret numerical and visual summaries."}, "https://arxiv.org/abs/2410.16017": {"title": "Semiparametric Bayesian Inference for a Conditional Moment Equality Model", "link": "https://arxiv.org/abs/2410.16017", "description": "arXiv:2410.16017v1 Announce Type: new \nAbstract: Conditional moment equality models are regularly encountered in empirical economics, yet they are difficult to estimate. These models map a conditional distribution of data to a structural parameter via the restriction that a conditional mean equals zero. Using this observation, I introduce a Bayesian inference framework in which an unknown conditional distribution is replaced with a nonparametric posterior, and structural parameter inference is then performed using an implied posterior. The method has the same flexibility as frequentist semiparametric estimators and does not require converting conditional moments to unconditional moments. Importantly, I prove a semiparametric Bernstein-von Mises theorem, providing conditions under which, in large samples, the posterior for the structural parameter is approximately normal, centered at an efficient estimator, and has variance equal to the Chamberlain (1987) semiparametric efficiency bound. As byproducts, I show that Bayesian uncertainty quantification methods are asymptotically optimal frequentist confidence sets and derive low-level sufficient conditions for Gaussian process priors. The latter sheds light on a key prior stability condition and relates to the numerical aspects of the paper in which these priors are used to predict the welfare effects of price changes."}, "https://arxiv.org/abs/2410.16076": {"title": "Improving the (approximate) sequential probability ratio test by avoiding overshoot", "link": "https://arxiv.org/abs/2410.16076", "description": "arXiv:2410.16076v1 Announce Type: new \nAbstract: The sequential probability ratio test (SPRT) by Wald (1945) is a cornerstone of sequential analysis. Based on desired type-I, II error levels $\\alpha, \\beta \\in (0,1)$, it stops when the likelihood ratio statistic crosses certain upper and lower thresholds, guaranteeing optimality of the expected sample size. However, these thresholds are not closed form and the test is often applied with approximate thresholds $(1-\\beta)/\\alpha$ and $\\beta/(1-\\alpha)$ (approximate SPRT). When $\\beta > 0$, this neither guarantees type I,II error control at $\\alpha,\\beta$ nor optimality. When $\\beta=0$ (power-one SPRT), it guarantees type I error control at $\\alpha$ that is in general conservative, and thus not optimal. The looseness in both cases is caused by overshoot: the test statistic overshoots the thresholds at the stopping time. One standard way to address this is to calculate the right thresholds numerically, but many papers and software packages do not do this. In this paper, we describe a different way to improve the approximate SPRT: we change the test statistic to avoid overshoot. Our technique uniformly improves power-one SPRTs $(\\beta=0)$ for simple nulls and alternatives, or for one-sided nulls and alternatives in exponential families. When $\\beta > 0$, our techniques provide valid type I error guarantees, lead to similar type II error as Wald's, but often needs less samples. These improved sequential tests can also be used for deriving tighter parametric confidence sequences, and can be extended to nontrivial settings like sampling without replacement and conformal martingales."}, "https://arxiv.org/abs/2410.16096": {"title": "Dynamic Time Warping-based imputation of long gaps in human mobility trajectories", "link": "https://arxiv.org/abs/2410.16096", "description": "arXiv:2410.16096v1 Announce Type: new \nAbstract: Individual mobility trajectories are difficult to measure and often incur long periods of missingness. Aggregation of this mobility data without accounting for the missingness leads to erroneous results, underestimating travel behavior. This paper proposes Dynamic Time Warping-Based Multiple Imputation (DTWBMI) as a method of filling long gaps in human mobility trajectories in order to use the available data to the fullest extent. This method reduces spatiotemporal trajectories to time series of particular travel behavior, then selects candidates for multiple imputation on the basis of the dynamic time warping distance between the potential donor series and the series preceding and following the gap in the recipient series and finally imputes values multiple times. A simulation study designed to establish optimal parameters for DTWBMI provides two versions of the method. These two methods are applied to a real-world dataset of individual mobility trajectories with simulated missingness and compared against other methods of handling missingness. Linear interpolation outperforms DTWBMI and other methods when gaps are short and data are limited. DTWBMI outperforms other methods when gaps become longer and when more data are available."}, "https://arxiv.org/abs/2410.16112": {"title": "Dynamic Biases of Static Panel Data Estimators", "link": "https://arxiv.org/abs/2410.16112", "description": "arXiv:2410.16112v1 Announce Type: new \nAbstract: This paper identifies an important bias - termed dynamic bias - in fixed effects panel estimators that arises when dynamic feedback is ignored in the estimating equation. Dynamic feedback occurs if past outcomes impact current outcomes, a feature of many settings ranging from economic growth to agricultural and labor markets. When estimating equations omit past outcomes, dynamic bias can lead to significantly inaccurate treatment effect estimates, even with randomly assigned treatments. This dynamic bias in simulations is larger than Nickell bias. I show that dynamic bias stems from the estimation of fixed effects, as their estimation generates confounding in the data. To recover consistent treatment effects, I develop a flexible estimator that provides fixed-T bias correction. I apply this approach to study the impact of temperature shocks on GDP, a canonical example where economic theory points to an important feedback from past to future outcomes. Accounting for dynamic bias lowers the estimated effects of higher yearly temperatures on GDP growth by 10% and GDP levels by 120%."}, "https://arxiv.org/abs/2410.16214": {"title": "Asymmetries in Financial Spillovers", "link": "https://arxiv.org/abs/2410.16214", "description": "arXiv:2410.16214v1 Announce Type: new \nAbstract: This paper analyzes nonlinearities in the international transmission of financial shocks originating in the US. To do so, we develop a flexible nonlinear multi-country model. Our framework is capable of producing asymmetries in the responses to financial shocks for shock size and sign, and over time. We show that international reactions to US-based financial shocks are asymmetric along these dimensions. Particularly, we find that adverse shocks trigger stronger declines in output, inflation, and stock markets than benign shocks. Further, we investigate time variation in the estimated dynamic effects and characterize the responsiveness of three major central banks to financial shocks."}, "https://arxiv.org/abs/2410.14783": {"title": "High-Dimensional Tensor Discriminant Analysis with Incomplete Tensors", "link": "https://arxiv.org/abs/2410.14783", "description": "arXiv:2410.14783v1 Announce Type: cross \nAbstract: Tensor classification has gained prominence across various fields, yet the challenge of handling partially observed tensor data in real-world applications remains largely unaddressed. This paper introduces a novel approach to tensor classification with incomplete data, framed within the tensor high-dimensional linear discriminant analysis. Specifically, we consider a high-dimensional tensor predictor with missing observations under the Missing Completely at Random (MCR) assumption and employ the Tensor Gaussian Mixture Model to capture the relationship between the tensor predictor and class label. We propose the Tensor LDA-MD algorithm, which manages high-dimensional tensor predictors with missing entries by leveraging the low-rank structure of the discriminant tensor. A key feature of our approach is a novel covariance estimation method under the tensor-based MCR model, supported by theoretical results that allow for correlated entries under mild conditions. Our work establishes the convergence rate of the estimation error of the discriminant tensor with incomplete data and minimax optimal bounds for the misclassification rate, addressing key gaps in the literature. Additionally, we derive large deviation results for the generalized mode-wise (separable) sample covariance matrix and its inverse, which are crucial tools in our analysis and hold independent interest. Our method demonstrates excellent performance in simulations and real data analysis, even with significant proportions of missing data. This research advances high-dimensional LDA and tensor learning, providing practical tools for applications with incomplete data and a solid theoretical foundation for classification accuracy in complex settings."}, "https://arxiv.org/abs/2410.14812": {"title": "Isolated Causal Effects of Natural Language", "link": "https://arxiv.org/abs/2410.14812", "description": "arXiv:2410.14812v1 Announce Type: cross \nAbstract: As language technologies become widespread, it is important to understand how variations in language affect reader perceptions -- formalized as the isolated causal effect of some focal language-encoded intervention on an external outcome. A core challenge of estimating isolated effects is the need to approximate all non-focal language outside of the intervention. In this paper, we introduce a formal estimation framework for isolated causal effects and explore how different approximations of non-focal language impact effect estimates. Drawing on the principle of omitted variable bias, we present metrics for evaluating the quality of isolated effect estimation and non-focal language approximation along the axes of fidelity and overlap. In experiments on semi-synthetic and real-world data, we validate the ability of our framework to recover ground truth isolated effects, and we demonstrate the utility of our proposed metrics as measures of quality for both isolated effect estimates and non-focal language approximations."}, "https://arxiv.org/abs/2410.14843": {"title": "Predictive variational inference: Learn the predictively optimal posterior distribution", "link": "https://arxiv.org/abs/2410.14843", "description": "arXiv:2410.14843v1 Announce Type: cross \nAbstract: Vanilla variational inference finds an optimal approximation to the Bayesian posterior distribution, but even the exact Bayesian posterior is often not meaningful under model misspecification. We propose predictive variational inference (PVI): a general inference framework that seeks and samples from an optimal posterior density such that the resulting posterior predictive distribution is as close to the true data generating process as possible, while this this closeness is measured by multiple scoring rules. By optimizing the objective, the predictive variational inference is generally not the same as, or even attempting to approximate, the Bayesian posterior, even asymptotically. Rather, we interpret it as implicit hierarchical expansion. Further, the learned posterior uncertainty detects heterogeneity of parameters among the population, enabling automatic model diagnosis. This framework applies to both likelihood-exact and likelihood-free models. We demonstrate its application in real data examples."}, "https://arxiv.org/abs/2410.14904": {"title": "Switchback Price Experiments with Forward-Looking Demand", "link": "https://arxiv.org/abs/2410.14904", "description": "arXiv:2410.14904v1 Announce Type: cross \nAbstract: We consider a retailer running a switchback experiment for the price of a single product, with infinite supply. In each period, the seller chooses a price $p$ from a set of predefined prices that consist of a reference price and a few discounted price levels. The goal is to estimate the demand gradient at the reference price point, with the goal of adjusting the reference price to improve revenue after the experiment. In our model, in each period, a unit mass of buyers arrives on the market, with values distributed based on a time-varying process. Crucially, buyers are forward looking with a discounted utility and will choose to not purchase now if they expect to face a discounted price in the near future. We show that forward-looking demand introduces bias in naive estimators of the demand gradient, due to intertemporal interference. Furthermore, we prove that there is no estimator that uses data from price experiments with only two price points that can recover the correct demand gradient, even in the limit of an infinitely long experiment with an infinitesimal price discount. Moreover, we characterize the form of the bias of naive estimators. Finally, we show that with a simple three price level experiment, the seller can remove the bias due to strategic forward-looking behavior and construct an estimator for the demand gradient that asymptotically recovers the truth."}, "https://arxiv.org/abs/2410.15049": {"title": "Modeling Time-Varying Effects of Mobile Health Interventions Using Longitudinal Functional Data from HeartSteps Micro-Randomized Trial", "link": "https://arxiv.org/abs/2410.15049", "description": "arXiv:2410.15049v1 Announce Type: cross \nAbstract: To optimize mobile health interventions and advance domain knowledge on intervention design, it is critical to understand how the intervention effect varies over time and with contextual information. This study aims to assess how a push notification suggesting physical activity influences individuals' step counts using data from the HeartSteps micro-randomized trial (MRT). The statistical challenges include the time-varying treatments and longitudinal functional step count measurements. We propose the first semiparametric causal excursion effect model with varying coefficients to model the time-varying effects within a decision point and across decision points in an MRT. The proposed model incorporates double time indices to accommodate the longitudinal functional outcome, enabling the assessment of time-varying effect moderation by contextual variables. We propose a two-stage causal effect estimator that is robust against a misspecified high-dimensional outcome regression nuisance model. We establish asymptotic theory and conduct simulation studies to validate the proposed estimator. Our analysis provides new insights into individuals' change in response profiles (such as how soon a response occurs) due to the activity suggestions, how such changes differ by the type of suggestions received, and how such changes depend on other contextual information such as being recently sedentary and the day being a weekday."}, "https://arxiv.org/abs/2410.15057": {"title": "Asymptotic Time-Uniform Inference for Parameters in Averaged Stochastic Approximation", "link": "https://arxiv.org/abs/2410.15057", "description": "arXiv:2410.15057v1 Announce Type: cross \nAbstract: We study time-uniform statistical inference for parameters in stochastic approximation (SA), which encompasses a bunch of applications in optimization and machine learning. To that end, we analyze the almost-sure convergence rates of the averaged iterates to a scaled sum of Gaussians in both linear and nonlinear SA problems. We then construct three types of asymptotic confidence sequences that are valid uniformly across all times with coverage guarantees, in an asymptotic sense that the starting time is sufficiently large. These coverage guarantees remain valid if the unknown covariance matrix is replaced by its plug-in estimator, and we conduct experiments to validate our methodology."}, "https://arxiv.org/abs/2410.15166": {"title": "Joint Probability Estimation of Many Binary Outcomes via Localized Adversarial Lasso", "link": "https://arxiv.org/abs/2410.15166", "description": "arXiv:2410.15166v1 Announce Type: cross \nAbstract: In this work we consider estimating the probability of many (possibly dependent) binary outcomes which is at the core of many applications, e.g., multi-level treatments in causal inference, demands for bundle of products, etc. Without further conditions, the probability distribution of an M dimensional binary vector is characterized by exponentially in M coefficients which can lead to a high-dimensional problem even without the presence of covariates. Understanding the (in)dependence structure allows us to substantially improve the estimation as it allows for an effective factorization of the probability distribution. In order to estimate the probability distribution of a M dimensional binary vector, we leverage a Bahadur representation that connects the sparsity of its coefficients with independence across the components. We propose to use regularized and adversarial regularized estimators to obtain an adaptive estimator with respect to the dependence structure which allows for rates of convergence to depend on this intrinsic (lower) dimension. These estimators are needed to handle several challenges within this setting, including estimating nuisance parameters, estimating covariates, and nonseparable moment conditions. Our main results consider the presence of (low dimensional) covariates for which we propose a locally penalized estimator. We provide pointwise rates of convergence addressing several issues in the theoretical analyses as we strive for making a computationally tractable formulation. We apply our results in the estimation of causal effects with multiple binary treatments and show how our estimators can improve the finite sample performance when compared with non-adaptive estimators that try to estimate all the probabilities directly. We also provide simulations that are consistent with our theoretical findings."}, "https://arxiv.org/abs/2410.15180": {"title": "HACSurv: A Hierarchical Copula-based Approach for Survival Analysis with Dependent Competing Risks", "link": "https://arxiv.org/abs/2410.15180", "description": "arXiv:2410.15180v1 Announce Type: cross \nAbstract: In survival analysis, subjects often face competing risks; for example, individuals with cancer may also suffer from heart disease or other illnesses, which can jointly influence the prognosis of risks and censoring. Traditional survival analysis methods often treat competing risks as independent and fail to accommodate the dependencies between different conditions. In this paper, we introduce HACSurv, a survival analysis method that learns Hierarchical Archimedean Copulas structures and cause-specific survival functions from data with competing risks. HACSurv employs a flexible dependency structure using hierarchical Archimedean copulas to represent the relationships between competing risks and censoring. By capturing the dependencies between risks and censoring, HACSurv achieves better survival predictions and offers insights into risk interactions. Experiments on synthetic datasets demonstrate that our method can accurately identify the complex dependency structure and precisely predict survival distributions, whereas the compared methods exhibit significant deviations between their predictions and the true distributions. Experiments on multiple real-world datasets also demonstrate that our method achieves better survival prediction compared to previous state-of-the-art methods."}, "https://arxiv.org/abs/2410.15244": {"title": "Extensions on low-complexity DCT approximations for larger blocklengths based on minimal angle similarity", "link": "https://arxiv.org/abs/2410.15244", "description": "arXiv:2410.15244v1 Announce Type: cross \nAbstract: The discrete cosine transform (DCT) is a central tool for image and video coding because it can be related to the Karhunen-Lo\\`eve transform (KLT), which is the optimal transform in terms of retained transform coefficients and data decorrelation. In this paper, we introduce 16-, 32-, and 64-point low-complexity DCT approximations by minimizing individually the angle between the rows of the exact DCT matrix and the matrix induced by the approximate transforms. According to some classical figures of merit, the proposed transforms outperformed the approximations for the DCT already known in the literature. Fast algorithms were also developed for the low-complexity transforms, asserting a good balance between the performance and its computational cost. Practical applications in image encoding showed the relevance of the transforms in this context. In fact, the experiments showed that the proposed transforms had better results than the known approximations in the literature for the cases of 16, 32, and 64 blocklength."}, "https://arxiv.org/abs/2410.15491": {"title": "Structural Causality-based Generalizable Concept Discovery Models", "link": "https://arxiv.org/abs/2410.15491", "description": "arXiv:2410.15491v1 Announce Type: cross \nAbstract: The rising need for explainable deep neural network architectures has utilized semantic concepts as explainable units. Several approaches utilizing disentangled representation learning estimate the generative factors and utilize them as concepts for explaining DNNs. However, even though the generative factors for a dataset remain fixed, concepts are not fixed entities and vary based on downstream tasks. In this paper, we propose a disentanglement mechanism utilizing a variational autoencoder (VAE) for learning mutually independent generative factors for a given dataset and subsequently learning task-specific concepts using a structural causal model (SCM). Our method assumes generative factors and concepts to form a bipartite graph, with directed causal edges from generative factors to concepts. Experiments are conducted on datasets with known generative factors: D-sprites and Shapes3D. On specific downstream tasks, our proposed method successfully learns task-specific concepts which are explained well by the causal edges from the generative factors. Lastly, separate from current causal concept discovery methods, our methodology is generalizable to an arbitrary number of concepts and flexible to any downstream tasks."}, "https://arxiv.org/abs/2410.15564": {"title": "Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits", "link": "https://arxiv.org/abs/2410.15564", "description": "arXiv:2410.15564v1 Announce Type: cross \nAbstract: In multi-armed bandits, the tasks of reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a practical bandit inference objective that aims to label arms with means above a threshold as quickly as possible. We show that GAI can be efficiently solved by combining a reward-maximizing sampling algorithm with a novel nonparametric anytime-valid sequential test for labeling arm means. We first establish that our sequential test maintains error control under highly nonparametric assumptions and asymptotically achieves the minimax optimal e-power, a notion of power for anytime-valid tests. Next, by pairing regret-minimizing sampling schemes with our sequential test, we provide an approach that achieves minimax optimal stopping times for labeling arms with means above a threshold, under an error probability constraint. Our empirical results validate our approach beyond the minimax setting, reducing the expected number of samples for all stopping times by at least 50% across both synthetic and real-world settings."}, "https://arxiv.org/abs/2410.15648": {"title": "Linking Model Intervention to Causal Interpretation in Model Explanation", "link": "https://arxiv.org/abs/2410.15648", "description": "arXiv:2410.15648v1 Announce Type: cross \nAbstract: Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervention effect has a causal interpretation, i.e., when it indicates whether a feature is a direct cause of the outcome. This work links the model intervention effect to the causal interpretation of a model. Such an interpretation capability is important since it indicates whether a machine learning model is trustworthy to domain experts. The conditions also reveal the limitations of using a model intervention effect for causal interpretation in an environment with unobserved features. Experiments on semi-synthetic datasets have been conducted to validate theorems and show the potential for using the model intervention effect for model interpretation."}, "https://arxiv.org/abs/2410.15655": {"title": "Accounting for Missing Covariates in Heterogeneous Treatment Estimation", "link": "https://arxiv.org/abs/2410.15655", "description": "arXiv:2410.15655v1 Announce Type: cross \nAbstract: Many applications of causal inference require using treatment effects estimated on a study population to make decisions in a separate target population. We consider the challenging setting where there are covariates that are observed in the target population that were not seen in the original study. Our goal is to estimate the tightest possible bounds on heterogeneous treatment effects conditioned on such newly observed covariates. We introduce a novel partial identification strategy based on ideas from ecological inference; the main idea is that estimates of conditional treatment effects for the full covariate set must marginalize correctly when restricted to only the covariates observed in both populations. Furthermore, we introduce a bias-corrected estimator for these bounds and prove that it enjoys fast convergence rates and statistical guarantees (e.g., asymptotic normality). Experimental results on both real and synthetic data demonstrate that our framework can produce bounds that are much tighter than would otherwise be possible."}, "https://arxiv.org/abs/2410.15711": {"title": "Quantiles and Quantile Regression on Riemannian Manifolds: a measure-transportation-based approach", "link": "https://arxiv.org/abs/2410.15711", "description": "arXiv:2410.15711v1 Announce Type: cross \nAbstract: Increased attention has been given recently to the statistical analysis of variables with values on nonlinear manifolds. A natural but nontrivial problem in that context is the definition of quantile concepts. We are proposing a solution for compact Riemannian manifolds without boundaries; typical examples are polyspheres, hyperspheres, and toro\\\"{\\i}dal manifolds equipped with their Riemannian metrics. Our concept of quantile function comes along with a concept of distribution function and, in the empirical case, ranks and signs. The absence of a canonical ordering is offset by resorting to the data-driven ordering induced by optimal transports. Theoretical properties, such as the uniform convergence of the empirical distribution and conditional (and unconditional) quantile functions and distribution-freeness of ranks and signs, are established. Statistical inference applications, from goodness-of-fit to distribution-free rank-based testing, are without number. Of particular importance is the case of quantile regression with directional or toro\\\"{\\i}dal multiple output, which is given special attention in this paper. Extensive simulations are carried out to illustrate these novel concepts."}, "https://arxiv.org/abs/2410.15931": {"title": "Towards more realistic climate model outputs: A multivariate bias correction based on zero-inflated vine copulas", "link": "https://arxiv.org/abs/2410.15931", "description": "arXiv:2410.15931v1 Announce Type: cross \nAbstract: Climate model large ensembles are an essential research tool for analysing and quantifying natural climate variability and providing robust information for rare extreme events. The models simulated representations of reality are susceptible to bias due to incomplete understanding of physical processes. This paper aims to correct the bias of five climate variables from the CRCM5 Large Ensemble over Central Europe at a 3-hourly temporal resolution. At this high temporal resolution, two variables, precipitation and radiation, exhibit a high share of zero inflation. We propose a novel bias-correction method, VBC (Vine copula bias correction), that models and transfers multivariate dependence structures for zero-inflated margins in the data from its error-prone model domain to a reference domain. VBC estimates the model and reference distribution using vine copulas and corrects the model distribution via (inverse) Rosenblatt transformation. To deal with the variables' zero-inflated nature, we develop a new vine density decomposition that accommodates such variables and employs an adequately randomized version of the Rosenblatt transform. This novel approach allows for more accurate modelling of multivariate zero-inflated climate data. Compared with state-of-the-art correction methods, VBC is generally the best-performing correction and the most accurate method for correcting zero-inflated events."}, "https://arxiv.org/abs/2410.15938": {"title": "Quantifying world geography as seen through the lens of Soviet propaganda", "link": "https://arxiv.org/abs/2410.15938", "description": "arXiv:2410.15938v1 Announce Type: cross \nAbstract: Cultural data typically contains a variety of biases. In particular, geographical locations are unequally portrayed in media, creating a distorted representation of the world. Identifying and measuring such biases is crucial to understand both the data and the socio-cultural processes that have produced them. Here we suggest to measure geographical biases in a large historical news media corpus by studying the representation of cities. Leveraging ideas of quantitative urban science, we develop a mixed quantitative-qualitative procedure, which allows us to get robust quantitative estimates of the biases. These biases can be further qualitatively interpreted resulting in a hermeneutic feedback loop. We apply this procedure to a corpus of the Soviet newsreel series 'Novosti Dnya' (News of the Day) and show that city representation grows super-linearly with city size, and is further biased by city specialization and geographical location. This allows to systematically identify geographical regions which are explicitly or sneakily emphasized by Soviet propaganda and quantify their importance."}, "https://arxiv.org/abs/1903.00037": {"title": "Sequential and Simultaneous Distance-based Dimension Reduction", "link": "https://arxiv.org/abs/1903.00037", "description": "arXiv:1903.00037v3 Announce Type: replace \nAbstract: This paper introduces a method called Sequential and Simultaneous Distance-based Dimension Reduction ($S^2D^2R$) that performs simultaneous dimension reduction for a pair of random vectors based on Distance Covariance (dCov). Compared with Sufficient Dimension Reduction (SDR) and Canonical Correlation Analysis (CCA)-based approaches, $S^2D^2R$ is a model-free approach that does not impose dimensional or distributional restrictions on variables and is more sensitive to nonlinear relationships. Theoretically, we establish a non-asymptotic error bound to guarantee the performance of $S^2D^2R$. Numerically, $S^2D^2R$ performs comparable to or better than other state-of-the-art algorithms and is computationally faster. All codes of our $S^2D^2R$ method can be found on Github, including an R package named S2D2R."}, "https://arxiv.org/abs/2204.08100": {"title": "Multi-Model Subset Selection", "link": "https://arxiv.org/abs/2204.08100", "description": "arXiv:2204.08100v3 Announce Type: replace \nAbstract: The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the L0-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by \"blackbox\" multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding L0-penalized regression models by extending recent developments in L0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages."}, "https://arxiv.org/abs/2206.09819": {"title": "Double soft-thresholded model for multi-group scalar on vector-valued image regression", "link": "https://arxiv.org/abs/2206.09819", "description": "arXiv:2206.09819v4 Announce Type: replace \nAbstract: In this paper, we develop a novel spatial variable selection method for scalar on vector-valued image regression in a multi-group setting. Here, 'vector-valued image' refers to the imaging datasets that contain vector-valued information at each pixel/voxel location, such as in RGB color images, multimodal medical images, DTI imaging, etc. The focus of this work is to identify the spatial locations in the image having an important effect on the scalar outcome measure. Specifically, the overall effect of each voxel is of interest. We thus develop a novel shrinkage prior by soft-thresholding the \\ell_2 norm of a latent multivariate Gaussian process. It will allow us to estimate sparse and piecewise-smooth spatially varying vector-valued regression coefficient functions. For posterior inference, an efficient MCMC algorithm is developed. We establish the posterior contraction rate for parameter estimation and consistency for variable selection of the proposed Bayesian model, assuming that the true regression coefficients are Holder smooth. Finally, we demonstrate the advantages of the proposed method in simulation studies and further illustrate in an ADNI dataset for modeling MMSE scores based on DTI-based vector-valued imaging markers."}, "https://arxiv.org/abs/2208.04669": {"title": "Boosting with copula-based components", "link": "https://arxiv.org/abs/2208.04669", "description": "arXiv:2208.04669v2 Announce Type: replace \nAbstract: The authors propose new additive models for binary outcomes, where the components are copula-based regression models (Noh et al, 2013), and designed such that the model may capture potentially complex interaction effects. The models do not require discretisation of continuous covariates, and are therefore suitable for problems with many such covariates. A fitting algorithm, and efficient procedures for model selection and evaluation of the components are described. Software is provided in the R-package copulaboost. Simulations and illustrations on data sets indicate that the method's predictive performance is either better than or comparable to the other methods."}, "https://arxiv.org/abs/2305.05106": {"title": "Mixed effects models for extreme value index regression", "link": "https://arxiv.org/abs/2305.05106", "description": "arXiv:2305.05106v4 Announce Type: replace \nAbstract: Extreme value theory (EVT) provides an elegant mathematical tool for the statistical analysis of rare events. When data are collected from multiple population subgroups, because some subgroups may have less data available for extreme value analysis, a scientific interest of many researchers would be to improve the estimates obtained directly from each subgroup. To achieve this, we incorporate the mixed effects model (MEM) into the regression technique in EVT. In small area estimation, the MEM has attracted considerable attention as a primary tool for producing reliable estimates for subgroups with small sample sizes, i.e., ``small areas.'' The key idea of MEM is to incorporate information from all subgroups into a single model and to borrow strength from all subgroups to improve estimates for each subgroup. Using this property, in extreme value analysis, the MEM may contribute to reducing the bias and variance of the direct estimates from each subgroup. This prompts us to evaluate the effectiveness of the MEM for EVT through theoretical studies and numerical experiments, including its application to the risk assessment of a number of stocks in the cryptocurrency market."}, "https://arxiv.org/abs/2307.12892": {"title": "A Statistical View of Column Subset Selection", "link": "https://arxiv.org/abs/2307.12892", "description": "arXiv:2307.12892v2 Announce Type: replace \nAbstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework."}, "https://arxiv.org/abs/2309.02631": {"title": "A Bayesian Nonparametric Method to Adjust for Unmeasured Confounding with Negative Controls", "link": "https://arxiv.org/abs/2309.02631", "description": "arXiv:2309.02631v2 Announce Type: replace \nAbstract: Unmeasured confounding bias threatens the validity of observational studies. While sensitivity analyses and study designs have been proposed to address this issue, they often overlook the growing availability of auxiliary data. Using negative controls from these data is a promising new approach to reduce unmeasured confounding bias. In this article, we develop a Bayesian nonparametric method to estimate a causal exposure-response function (CERF) leveraging information from negative controls to adjust for unmeasured confounding. We model the CERF as a mixture of linear models. This strategy captures the potential nonlinear shape of CERFs while maintaining computational efficiency, and it leverages closed-form results that hold under the linear model assumption. We assess the performance of our method through simulation studies. We found that the proposed method can recover the true shape of the CERF in the presence of unmeasured confounding under assumptions. To show the practical utility of our approach, we apply it to adjust for a possible unmeasured confounder when evaluating the relationship between long-term exposure to ambient $PM_{2.5}$ and cardiovascular hospitalization rates among the elderly in the continental US. We implement our estimation procedure in open-source software and have made the code publicly available to ensure reproducibility."}, "https://arxiv.org/abs/2401.00354": {"title": "Maximum Likelihood Estimation under the Emax Model: Existence, Geometry and Efficiency", "link": "https://arxiv.org/abs/2401.00354", "description": "arXiv:2401.00354v2 Announce Type: replace \nAbstract: This study focuses on the estimation of the Emax dose-response model, a widely utilized framework in clinical trials, agriculture, and environmental experiments. Existing challenges in obtaining maximum likelihood estimates (MLE) for model parameters are often ascribed to computational issues but, in reality, stem from the absence of a MLE. Our contribution provides a new understanding and control of all the experimental situations that practitioners might face, guiding them in the estimation process. We derive the exact MLE for a three-point experimental design and we identify the two scenarios where the MLE fails. To address these challenges, we propose utilizing Firth's modified score, providing its analytical expression as a function of the experimental design. Through a simulation study, we demonstrate that, in one of the problematic cases, the Firth modification yields a finite estimate. For the remaining case, we introduce a design-augmentation strategy akin to a hypothesis test."}, "https://arxiv.org/abs/2302.12093": {"title": "Experimenting under Stochastic Congestion", "link": "https://arxiv.org/abs/2302.12093", "description": "arXiv:2302.12093v4 Announce Type: replace-cross \nAbstract: We study randomized experiments in a service system when stochastic congestion can arise from temporarily limited supply or excess demand. Such congestion gives rise to cross-unit interference between the waiting customers, and analytic strategies that do not account for this interference may be biased. In current practice, one of the most widely used ways to address stochastic congestion is to use switchback experiments that alternatively turn a target intervention on and off for the whole system. We find, however, that under a queueing model for stochastic congestion, the standard way of analyzing switchbacks is inefficient, and that estimators that leverage the queueing model can be materially more accurate. Additionally, we show how the queueing model enables estimation of total policy gradients from unit-level randomized experiments, thus giving practitioners an alternative experimental approach they can use without needing to pre-commit to a fixed switchback length before data collection."}, "https://arxiv.org/abs/2306.15709": {"title": "Privacy-Preserving Community Detection for Locally Distributed Multiple Networks", "link": "https://arxiv.org/abs/2306.15709", "description": "arXiv:2306.15709v2 Announce Type: replace-cross \nAbstract: Modern multi-layer networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. The literature on the model-based statistical methods for community detection based on these data is still limited. This paper proposes a new method for consensus community detection and estimation in a multi-layer stochastic block model using locally stored and computed network data with privacy protection. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors using an orthogonal Procrustes transformation, and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds."}, "https://arxiv.org/abs/2407.01032": {"title": "Overcoming Common Flaws in the Evaluation of Selective Classification Systems", "link": "https://arxiv.org/abs/2407.01032", "description": "arXiv:2407.01032v2 Announce Type: replace-cross \nAbstract: Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets."}, "https://arxiv.org/abs/2410.16391": {"title": "Causal Data Fusion for Panel Data without Pre-Intervention Period", "link": "https://arxiv.org/abs/2410.16391", "description": "arXiv:2410.16391v1 Announce Type: new \nAbstract: Traditional panel data causal inference frameworks, such as difference-in-differences and synthetic control methods, rely on pre-intervention data to estimate counterfactuals, which may not be available in real-world settings when interventions are implemented in response to sudden events. In this paper, we introduce two data fusion methods for causal inference from panel data in scenarios where pre-intervention data is unavailable. These methods leverage auxiliary reference domains with related panel data to estimate causal effects in the target domain, overcoming the limitations imposed by the absence of pre-intervention data. We show the efficacy of these methods by obtaining converging bounds on the absolute bias as well as through simulations, showing their robustness in a variety of panel data settings. Our findings provide a framework for applying causal inference in urgent and data-constrained environments, such as public health crises or epidemiological shocks."}, "https://arxiv.org/abs/2410.16477": {"title": "Finite-Sample and Distribution-Free Fair Classification: Optimal Trade-off Between Excess Risk and Fairness, and the Cost of Group-Blindness", "link": "https://arxiv.org/abs/2410.16477", "description": "arXiv:2410.16477v1 Announce Type: new \nAbstract: Algorithmic fairness in machine learning has recently garnered significant attention. However, two pressing challenges remain: (1) The fairness guarantees of existing fair classification methods often rely on specific data distribution assumptions and large sample sizes, which can lead to fairness violations when the sample size is moderate-a common situation in practice. (2) Due to legal and societal considerations, using sensitive group attributes during decision-making (referred to as the group-blind setting) may not always be feasible.\n In this work, we quantify the impact of enforcing algorithmic fairness and group-blindness in binary classification under group fairness constraints. Specifically, we propose a unified framework for fair classification that provides distribution-free and finite-sample fairness guarantees with controlled excess risk. This framework is applicable to various group fairness notions in both group-aware and group-blind scenarios. Furthermore, we establish a minimax lower bound on the excess risk, showing the minimax optimality of our proposed algorithm up to logarithmic factors. Through extensive simulation studies and real data analysis, we further demonstrate the superior performance of our algorithm compared to existing methods, and provide empirical support for our theoretical findings."}, "https://arxiv.org/abs/2410.16526": {"title": "A Dynamic Spatiotemporal and Network ARCH Model with Common Factors", "link": "https://arxiv.org/abs/2410.16526", "description": "arXiv:2410.16526v1 Announce Type: new \nAbstract: We introduce a dynamic spatiotemporal volatility model that extends traditional approaches by incorporating spatial, temporal, and spatiotemporal spillover effects, along with volatility-specific observed and latent factors. The model offers a more general network interpretation, making it applicable for studying various types of network spillovers. The primary innovation lies in incorporating volatility-specific latent factors into the dynamic spatiotemporal volatility model. Using Bayesian estimation via the Markov Chain Monte Carlo (MCMC) method, the model offers a robust framework for analyzing the spatial, temporal, and spatiotemporal effects of a log-squared outcome variable on its volatility. We recommend using the deviance information criterion (DIC) and a regularized Bayesian MCMC method to select the number of relevant factors in the model. The model's flexibility is demonstrated through two applications: a spatiotemporal model applied to the U.S. housing market and another applied to financial stock market networks, both highlighting the model's ability to capture varying degrees of interconnectedness. In both applications, we find strong spatial/network interactions with relatively stronger spillover effects in the stock market."}, "https://arxiv.org/abs/2410.16577": {"title": "High-dimensional Grouped-regression using Bayesian Sparse Projection-posterior", "link": "https://arxiv.org/abs/2410.16577", "description": "arXiv:2410.16577v1 Announce Type: new \nAbstract: We consider a novel Bayesian approach to estimation, uncertainty quantification, and variable selection for a high-dimensional linear regression model under sparsity. The number of predictors can be nearly exponentially large relative to the sample size. We put a conjugate normal prior initially disregarding sparsity, but for making an inference, instead of the original multivariate normal posterior, we use the posterior distribution induced by a map transforming the vector of regression coefficients to a sparse vector obtained by minimizing the sum of squares of deviations plus a suitably scaled $\\ell_1$-penalty on the vector. We show that the resulting sparse projection-posterior distribution contracts around the true value of the parameter at the optimal rate adapted to the sparsity of the vector. We show that the true sparsity structure gets a large sparse projection-posterior probability. We further show that an appropriately recentred credible ball has the correct asymptotic frequentist coverage. Finally, we describe how the computational burden can be distributed to many machines, each dealing with only a small fraction of the whole dataset. We conduct a comprehensive simulation study under a variety of settings and found that the proposed method performs well for finite sample sizes. We also apply the method to several real datasets, including the ADNI data, and compare its performance with the state-of-the-art methods. We implemented the method in the \\texttt{R} package called \\texttt{sparseProj}, and all computations have been carried out using this package."}, "https://arxiv.org/abs/2410.16608": {"title": "Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective", "link": "https://arxiv.org/abs/2410.16608", "description": "arXiv:2410.16608v1 Announce Type: new \nAbstract: Visualizing high-dimensional data is an important routine for understanding biomedical data and interpreting deep learning models. Neighbor embedding methods, such as t-SNE, UMAP, and LargeVis, among others, are a family of popular visualization methods which reduce high-dimensional data to two dimensions. However, recent studies suggest that these methods often produce visual artifacts, potentially leading to incorrect scientific conclusions. Recognizing that the current limitation stems from a lack of data-independent notions of embedding maps, we introduce a novel conceptual and computational framework, LOO-map, that learns the embedding maps based on a classical statistical idea known as the leave-one-out. LOO-map extends the embedding over a discrete set of input points to the entire input space, enabling a systematic assessment of map continuity, and thus the reliability of the visualizations. We find for many neighbor embedding methods, their embedding maps can be intrinsically discontinuous. The discontinuity induces two types of observed map distortion: ``overconfidence-inducing discontinuity,\" which exaggerates cluster separation, and ``fracture-inducing discontinuity,\" which creates spurious local structures. Building upon LOO-map, we propose two diagnostic point-wise scores -- perturbation score and singularity score -- to address these limitations. These scores can help identify unreliable embedding points, detect out-of-distribution data, and guide hyperparameter selection. Our approach is flexible and works as a wrapper around many neighbor embedding algorithms. We test our methods across multiple real-world datasets from computer vision and single-cell omics to demonstrate their effectiveness in enhancing the interpretability and accuracy of visualizations."}, "https://arxiv.org/abs/2410.16627": {"title": "Enhancing Computational Efficiency in High-Dimensional Bayesian Analysis: Applications to Cancer Genomics", "link": "https://arxiv.org/abs/2410.16627", "description": "arXiv:2410.16627v1 Announce Type: new \nAbstract: In this study, we present a comprehensive evaluation of the Two-Block Gibbs (2BG) sampler as a robust alternative to the traditional Three-Block Gibbs (3BG) sampler in Bayesian shrinkage models. Through extensive simulation studies, we demonstrate that the 2BG sampler exhibits superior computational efficiency and faster convergence rates, particularly in high-dimensional settings where the ratio of predictors to samples is large. We apply these findings to real-world data from the NCI-60 cancer cell panel, leveraging gene expression data to predict protein expression levels. Our analysis incorporates feature selection, identifying key genes that influence protein expression while shedding light on the underlying genetic mechanisms in cancer cells. The results indicate that the 2BG sampler not only produces more effective samples than the 3BG counterpart but also significantly reduces computational costs, thereby enhancing the applicability of Bayesian methods in high-dimensional data analysis. This contribution extends the understanding of shrinkage techniques in statistical modeling and offers valuable insights for cancer genomics research."}, "https://arxiv.org/abs/2410.16656": {"title": "Parsimonious Dynamic Mode Decomposition: A Robust and Automated Approach for Optimally Sparse Mode Selection in Complex Systems", "link": "https://arxiv.org/abs/2410.16656", "description": "arXiv:2410.16656v1 Announce Type: new \nAbstract: This paper introduces the Parsimonious Dynamic Mode Decomposition (parsDMD), a novel algorithm designed to automatically select an optimally sparse subset of dynamic modes for both spatiotemporal and purely temporal data. By incorporating time-delay embedding and leveraging Orthogonal Matching Pursuit (OMP), parsDMD ensures robustness against noise and effectively handles complex, nonlinear dynamics. The algorithm is validated on a diverse range of datasets, including standing wave signals, identifying hidden dynamics, fluid dynamics simulations (flow past a cylinder and transonic buffet), and atmospheric sea-surface temperature (SST) data. ParsDMD addresses a significant limitation of the traditional sparsity-promoting DMD (spDMD), which requires manual tuning of sparsity parameters through a rigorous trial-and-error process to balance between single-mode and all-mode solutions. In contrast, parsDMD autonomously determines the optimally sparse subset of modes without user intervention, while maintaining minimal computational complexity. Comparative analyses demonstrate that parsDMD consistently outperforms spDMD by providing more accurate mode identification and effective reconstruction in noisy environments. These advantages render parsDMD an effective tool for real-time diagnostics, forecasting, and reduced-order model construction across various disciplines."}, "https://arxiv.org/abs/2410.16716": {"title": "A class of modular and flexible covariate-based covariance functions for nonstationary spatial modeling", "link": "https://arxiv.org/abs/2410.16716", "description": "arXiv:2410.16716v1 Announce Type: new \nAbstract: The assumptions of stationarity and isotropy often stated over spatial processes have not aged well during the last two decades, partly explained by the combination of computational developments and the increasing availability of high-resolution spatial data. While a plethora of approaches have been developed to relax these assumptions, it is often a costly tradeoff between flexibility and a diversity of computational challenges. In this paper, we present a class of covariance functions that relies on fixed, observable spatial information that provides a convenient tradeoff while offering an extra layer of numerical and visual representation of the flexible spatial dependencies. This model allows for separate parametric structures for different sources of nonstationarity, such as marginal standard deviation, geometric anisotropy, and smoothness. It simplifies to a Mat\\'ern covariance function in its basic form and is adaptable for large datasets, enhancing flexibility and computational efficiency. We analyze the capabilities of the presented model through simulation studies and an application to Swiss precipitation data."}, "https://arxiv.org/abs/2410.16722": {"title": "Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors", "link": "https://arxiv.org/abs/2410.16722", "description": "arXiv:2410.16722v1 Announce Type: new \nAbstract: In our paper, we focus on robust variable selection for missing data and measurement error.Missing data and measurement errors can lead to confusing data distribution.We propose an exponential loss function with tuning parameter to apply to Missing and measurement errors data.By adjusting the parameter,the loss function can be better and more robust under various different data distributions.We use inverse probability weighting and additivity error models to address missing data and measurement errors. Also, we find that the Atan punishment method works better.We used Monte Carlo simulations to assess the validity of robust variable selection and validated our findings with the breast cancer dataset"}, "https://arxiv.org/abs/2410.16806": {"title": "Simplified vine copula models: state of science and affairs", "link": "https://arxiv.org/abs/2410.16806", "description": "arXiv:2410.16806v1 Announce Type: new \nAbstract: Vine copula models have become highly popular practical tools for modeling multivariate dependencies. To maintain traceability, a commonly employed simplifying assumption is that conditional copulas remain unchanged by the conditioning variables. This assumption has sparked a somewhat polarizing debate within our community. The fact that much of this dispute occurs outside the public record has placed the field in an unfortunate position, impeding scientific progress. In this article, I will review what we know about the flexibility and limitations of simplified vine copula models, explore the broader implications, and offer my own, hopefully reconciling, perspective on the issue."}, "https://arxiv.org/abs/2410.16903": {"title": "Second-order characteristics for spatial point processes with graph-valued marks", "link": "https://arxiv.org/abs/2410.16903", "description": "arXiv:2410.16903v1 Announce Type: new \nAbstract: The immense progress in data collection and storage capacities have yielded rather complex, challenging spatial event-type data, where each event location is augmented by a non-simple mark. Despite the growing interest in analysing such complex event patterns, the methodology for such analysis is not embedded well in the literature. In particular, the literature lacks statistical methods to analyse marks which are characterised by an inherent relational structure, i.e.\\ where the mark is graph-valued. Motivated by epidermal nerve fibre data, we introduce different mark summary characteristics, which investigate the average variation or association between pairs of graph-valued marks, and apply some of the methods to the nerve data."}, "https://arxiv.org/abs/2410.16998": {"title": "Identifying Conduct Parameters with Separable Demand: A Counterexample to Lau (1982)", "link": "https://arxiv.org/abs/2410.16998", "description": "arXiv:2410.16998v1 Announce Type: new \nAbstract: We provide a counterexample to the conduct parameter identification result established in the foundational work of Lau (1982), which generalizes the identification theorem of Bresnahan (1982) by relaxing the linearity assumptions. We identify a separable demand function that still permits identification and validate this case both theoretically and through numerical simulations."}, "https://arxiv.org/abs/2410.17046": {"title": "Mesoscale two-sample testing for network data", "link": "https://arxiv.org/abs/2410.17046", "description": "arXiv:2410.17046v1 Announce Type: new \nAbstract: Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications, for example, neuroimaging, to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classical statistical question of two-sample comparison arises. In this work, we address the problem of testing for statistically significant differences in a given arbitrary subset of connections. This general framework allows an analyst to focus on a single node, a specific region of interest, or compare whole networks. Our ability to conduct \"mesoscale\" testing on a meaningful group of edges is particularly relevant for applications such as neuroimaging and distinguishes our approach from prior work, which tends to focus either on a single node or the whole network. In this mesoscale setting, we develop statistically sound projection-based tests for two-sample comparison in both weighted and binary edge networks. Our approach can leverage all available network information, and learn informative projections which improve testing power when low-dimensional latent network structure is present."}, "https://arxiv.org/abs/2410.17105": {"title": "General Seemingly Unrelated Local Projections", "link": "https://arxiv.org/abs/2410.17105", "description": "arXiv:2410.17105v1 Announce Type: new \nAbstract: We provide a framework for efficiently estimating impulse response functions with Local Projections (LPs). Our approach offers a Bayesian treatment for LPs with Instrumental Variables, accommodating multiple shocks and instruments per shock, accounts for autocorrelation in multi-step forecasts by jointly modeling all LPs as a seemingly unrelated system of equations, defines a flexible yet parsimonious joint prior for impulse responses based on a Gaussian Process, allows for joint inference about the entire vector of impulse responses, and uses all available data across horizons by imputing missing values."}, "https://arxiv.org/abs/2410.17108": {"title": "A general framework for probabilistic model uncertainty", "link": "https://arxiv.org/abs/2410.17108", "description": "arXiv:2410.17108v1 Announce Type: new \nAbstract: Existing approaches to model uncertainty typically either compare models using a quantitative model selection criterion or evaluate posterior model probabilities having set a prior. In this paper, we propose an alternative strategy which views missing observations as the source of model uncertainty, where the true model would be identified with the complete data. To quantify model uncertainty, it is then necessary to provide a probability distribution for the missing observations conditional on what has been observed. This can be set sequentially using one-step-ahead predictive densities, which recursively sample from the best model according to some consistent model selection criterion. Repeated predictive sampling of the missing data, to give a complete dataset and hence a best model each time, provides our measure of model uncertainty. This approach bypasses the need for subjective prior specification or integration over parameter spaces, addressing issues with standard methods such as the Bayes factor. Predictive resampling also suggests an alternative view of hypothesis testing as a decision problem based on a population statistic, where we directly index the probabilities of competing models. In addition to hypothesis testing, we provide illustrations from density estimation and variable selection, demonstrating our approach on a range of standard problems."}, "https://arxiv.org/abs/2410.17153": {"title": "A Bayesian Perspective on the Maximum Score Problem", "link": "https://arxiv.org/abs/2410.17153", "description": "arXiv:2410.17153v1 Announce Type: new \nAbstract: This paper presents a Bayesian inference framework for a linear index threshold-crossing binary choice model that satisfies a median independence restriction. The key idea is that the model is observationally equivalent to a probit model with nonparametric heteroskedasticity. Consequently, Gibbs sampling techniques from Albert and Chib (1993) and Chib and Greenberg (2013) lead to a computationally attractive Bayesian inference procedure in which a Gaussian process forms a conditionally conjugate prior for the natural logarithm of the skedastic function."}, "https://arxiv.org/abs/2410.16307": {"title": "Functional Clustering of Discount Functions for Behavioral Investor Profiling", "link": "https://arxiv.org/abs/2410.16307", "description": "arXiv:2410.16307v1 Announce Type: cross \nAbstract: Classical finance models are based on the premise that investors act rationally and utilize all available information when making portfolio decisions. However, these models often fail to capture the anomalies observed in intertemporal choices and decision-making under uncertainty, particularly when accounting for individual differences in preferences and consumption patterns. Such limitations hinder traditional finance theory's ability to address key questions like: How do personal preferences shape investment choices? What drives investor behaviour? And how do individuals select their portfolios? One prominent contribution is Pompian's model of four Behavioral Investor Types (BITs), which links behavioural finance studies with Keirsey's temperament theory, highlighting the role of personality in financial decision-making. Yet, traditional parametric models struggle to capture how these distinct temperaments influence intertemporal decisions, such as how individuals evaluate trade-offs between present and future outcomes. To address this gap, the present study employs Functional Data Analysis (FDA) to specifically investigate temporal discounting behaviours revealing nuanced patterns in how different temperaments perceive and manage uncertainty over time. Our findings show heterogeneity within each temperament, suggesting that investor profiles are far more diverse than previously thought. This refined classification provides deeper insights into the role of temperament in shaping intertemporal financial decisions, offering practical implications for financial advisors to better tailor strategies to individual risk preferences and decision-making styles."}, "https://arxiv.org/abs/2410.16333": {"title": "Conformal Predictive Portfolio Selection", "link": "https://arxiv.org/abs/2410.16333", "description": "arXiv:2410.16333v1 Announce Type: cross \nAbstract: This study explores portfolio selection using predictive models for portfolio returns. Portfolio selection is a fundamental task in finance, and various methods have been developed to achieve this goal. For example, the mean-variance approach constructs portfolios by balancing the trade-off between the mean and variance of asset returns, while the quantile-based approach optimizes portfolios by accounting for tail risk. These traditional methods often rely on distributional information estimated from historical data. However, a key concern is the uncertainty of future portfolio returns, which may not be fully captured by simple reliance on historical data, such as using the sample average. To address this, we propose a framework for predictive portfolio selection using conformal inference, called Conformal Predictive Portfolio Selection (CPPS). Our approach predicts future portfolio returns, computes corresponding prediction intervals, and selects the desirable portfolio based on these intervals. The framework is flexible and can accommodate a variety of predictive models, including autoregressive (AR) models, random forests, and neural networks. We demonstrate the effectiveness of our CPPS framework using an AR model and validate its performance through empirical studies, showing that it provides superior returns compared to simpler strategies."}, "https://arxiv.org/abs/2410.16419": {"title": "Data Augmentation of Multivariate Sensor Time Series using Autoregressive Models and Application to Failure Prognostics", "link": "https://arxiv.org/abs/2410.16419", "description": "arXiv:2410.16419v1 Announce Type: cross \nAbstract: This work presents a novel data augmentation solution for non-stationary multivariate time series and its application to failure prognostics. The method extends previous work from the authors which is based on time-varying autoregressive processes. It can be employed to extract key information from a limited number of samples and generate new synthetic samples in a way that potentially improves the performance of PHM solutions. This is especially valuable in situations of data scarcity which are very usual in PHM, especially for failure prognostics. The proposed approach is tested based on the CMAPSS dataset, commonly employed for prognostics experiments and benchmarks. An AutoML approach from PHM literature is employed for automating the design of the prognostics solution. The empirical evaluation provides evidence that the proposed method can substantially improve the performance of PHM solutions."}, "https://arxiv.org/abs/2410.16523": {"title": "Efficient Neural Network Training via Subset Pretraining", "link": "https://arxiv.org/abs/2410.16523", "description": "arXiv:2410.16523v1 Announce Type: cross \nAbstract: In training neural networks, it is common practice to use partial gradients computed over batches, mostly very small subsets of the training set. This approach is motivated by the argument that such a partial gradient is close to the true one, with precision growing only with the square root of the batch size. A theoretical justification is with the help of stochastic approximation theory. However, the conditions for the validity of this theory are not satisfied in the usual learning rate schedules. Batch processing is also difficult to combine with efficient second-order optimization methods. This proposal is based on another hypothesis: the loss minimum of the training set can be expected to be well-approximated by the minima of its subsets. Such subset minima can be computed in a fraction of the time necessary for optimizing over the whole training set. This hypothesis has been tested with the help of the MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks, optionally extended by training data augmentation. The experiments have confirmed that results equivalent to conventional training can be reached. In summary, even small subsets are representative if the overdetermination ratio for the given model parameter set sufficiently exceeds unity. The computing expense can be reduced to a tenth or less."}, "https://arxiv.org/abs/2410.16554": {"title": "On the breakdown point of transport-based quantiles", "link": "https://arxiv.org/abs/2410.16554", "description": "arXiv:2410.16554v1 Announce Type: cross \nAbstract: Recent work has used optimal transport ideas to generalize the notion of (center-outward) quantiles to dimension $d\\geq 2$. We study the robustness properties of these transport-based quantiles by deriving their breakdown point, roughly, the smallest amount of contamination required to make these quantiles take arbitrarily aberrant values. We prove that the transport median defined in Chernozhukov et al.~(2017) and Hallin et al.~(2021) has breakdown point of $1/2$. Moreover, a point in the transport depth contour of order $\\tau\\in [0,1/2]$ has breakdown point of $\\tau$. This shows that the multivariate transport depth shares the same breakdown properties as its univariate counterpart. Our proof relies on a general argument connecting the breakdown point of transport maps evaluated at a point to the Tukey depth of that point in the reference measure."}, "https://arxiv.org/abs/2410.16702": {"title": "HDNRA: An R package for HDLSS location testing with normal-reference approaches", "link": "https://arxiv.org/abs/2410.16702", "description": "arXiv:2410.16702v1 Announce Type: cross \nAbstract: The challenge of location testing for high-dimensional data in statistical inference is notable. Existing literature suggests various methods, many of which impose strong regularity conditions on underlying covariance matrices to ensure asymptotic normal distribution of test statistics, leading to difficulties in size control. To address this, a recent set of tests employing the normal-reference approach has been proposed. Moreover, the availability of tests for high-dimensional location testing in R packages implemented in C++ is limited. This paper introduces the latest methods utilizing normal-reference approaches to test the equality of mean vectors in high-dimensional samples with potentially different covariance matrices. We present an R package named HDNRA to illustrate the implementation of these tests, extending beyond the two-sample problem to encompass general linear hypothesis testing (GLHT). The package offers easy and user-friendly access to these tests, with its core implemented in C++ using Rcpp, OpenMP and RcppArmadillo for efficient execution. Theoretical properties of these normal-reference tests are revisited, and examples based on real datasets using different tests are provided."}, "https://arxiv.org/abs/2106.13856": {"title": "Nonparametric inference on counterfactuals in first-price auctions", "link": "https://arxiv.org/abs/2106.13856", "description": "arXiv:2106.13856v3 Announce Type: replace \nAbstract: In a classical model of the first-price sealed-bid auction with independent private values, we develop nonparametric estimators for several policy-relevant targets, such as the bidder's surplus and auctioneer's revenue under counterfactual reserve prices. Motivated by the linearity of these targets in the quantile function of bidders' values, we propose an estimator of the latter and derive its Bahadur-Kiefer expansion. This makes it possible to construct uniform confidence bands and test complex hypotheses about the auction design. Using the data on U.S. Forest Service timber auctions, we test whether setting zero reserve prices in these auctions was revenue maximizing."}, "https://arxiv.org/abs/2305.00694": {"title": "Scaling of Piecewise Deterministic Monte Carlo for Anisotropic Targets", "link": "https://arxiv.org/abs/2305.00694", "description": "arXiv:2305.00694v2 Announce Type: replace \nAbstract: Piecewise deterministic Markov processes (PDMPs) are a type of continuous-time Markov process that combine deterministic flows with jumps. Recently, PDMPs have garnered attention within the Monte Carlo community as a potential alternative to traditional Markov chain Monte Carlo (MCMC) methods. The Zig-Zag sampler and the Bouncy Particle Sampler are commonly used examples of the PDMP methodology which have also yielded impressive theoretical properties, but little is known about their robustness to extreme dependence or anisotropy of the target density. It turns out that PDMPs may suffer from poor mixing due to anisotropy and this paper investigates this effect in detail in the stylised but important Gaussian case. To this end, we employ a multi-scale analysis framework in this paper. Our results show that when the Gaussian target distribution has two scales, of order $1$ and $\\epsilon$, the computational cost of the Bouncy Particle Sampler is of order $\\epsilon^{-1}$, and the computational cost of the Zig-Zag sampler is $\\epsilon^{-2}$. In comparison, the cost of the traditional MCMC methods such as RWM is of order $\\epsilon^{-2}$, at least when the dimensionality of the small component is more than $1$. Therefore, there is a robustness advantage to using PDMPs in this context."}, "https://arxiv.org/abs/2309.07107": {"title": "Reducing Symbiosis Bias Through Better A/B Tests of Recommendation Algorithms", "link": "https://arxiv.org/abs/2309.07107", "description": "arXiv:2309.07107v3 Announce Type: replace \nAbstract: It is increasingly common in digital environments to use A/B tests to compare the performance of recommendation algorithms. However, such experiments often violate the stable unit treatment value assumption (SUTVA), particularly SUTVA's \"no hidden treatments\" assumption, due to the shared data between algorithms being compared. This results in a novel form of bias, which we term \"symbiosis bias,\" where the performance of each algorithm is influenced by the training data generated by its competitor. In this paper, we investigate three experimental designs--cluster-randomized, data-diverted, and user-corpus co-diverted experiments--aimed at mitigating symbiosis bias. We present a theoretical model of symbiosis bias and simulate the impact of each design in dynamic recommendation environments. Our results show that while each design reduces symbiosis bias to some extent, they also introduce new challenges, such as reduced training data in data-diverted experiments. We further validate the existence of symbiosis bias using data from a large-scale A/B test conducted on a global recommender system, demonstrating that symbiosis bias affects treatment effect estimates in the field. Our findings provide actionable insights for researchers and practitioners seeking to design experiments that accurately capture algorithmic performance without bias in treatment effect estimates introduced by shared data."}, "https://arxiv.org/abs/2410.17399": {"title": "An Anatomy of Event Studies: Hypothetical Experiments, Exact Decomposition, and Weighting Diagnostics", "link": "https://arxiv.org/abs/2410.17399", "description": "arXiv:2410.17399v1 Announce Type: new \nAbstract: In recent decades, event studies have emerged as a central methodology in health and social research for evaluating the causal effects of staggered interventions. In this paper, we analyze event studies from the perspective of experimental design and focus on the use of information across units and time periods in the construction of effect estimators. As a particular case of this approach, we offer a novel decomposition of the classical dynamic two-way fixed effects (TWFE) regression estimator for event studies. Our decomposition is expressed in closed form and reveals in finite samples the hypothetical experiment that TWFE regression adjustments approximate. This decomposition offers insights into how standard regression estimators borrow information across different units and times, clarifying and supplementing the notion of forbidden comparison noted in the literature. We propose a robust weighting approach for estimation in event studies, which allows investigators to progressively build larger valid weighted contrasts by leveraging increasingly stronger assumptions on the treatment assignment and potential outcomes mechanisms. This weighting approach also allows for the generalization of treatment effect estimates to a target population. We provide diagnostics and visualization tools and illustrate these methods in a case study of the impact of divorce reforms on female suicide."}, "https://arxiv.org/abs/2410.17601": {"title": "Flexible Approach for Statistical Disclosure Control in Geospatial Data", "link": "https://arxiv.org/abs/2410.17601", "description": "arXiv:2410.17601v1 Announce Type: new \nAbstract: We develop a flexible approach by combining the Quadtree-based method with suppression to maximize the utility of the grid data and simultaneously to reduce the risk of disclosing private information from individual units. To protect data confidentiality, we produce a high resolution grid from geo-reference data with a minimum size of 1 km nested in grids with increasingly larger resolution on the basis of statistical disclosure control methods (i.e threshold and concentration rule). While our implementation overcomes certain weaknesses of Quadtree-based method by accounting for irregularly distributed and relatively isolated marginal units, it also allows creating joint aggregation of several variables. The method is illustrated by relying on synthetic data of the Danish agricultural census 2020 for a set of key agricultural indicators, such as the number of agricultural holdings, the utilized agricultural area and the number of organic farms. We demonstrate the need to assess the reliability of indicators when using a sub-sample of synthetic data followed by an example that presents the same approach for generating a ratio (i.e., the share of organic farming). The methodology is provided as the open-source \\textit{R}-package \\textit{MRG} that is adaptable to use with other geo-referenced survey data underlying confidentiality or other privacy restrictions."}, "https://arxiv.org/abs/2410.17604": {"title": "Ranking of Multi-Response Experiment Treatments", "link": "https://arxiv.org/abs/2410.17604", "description": "arXiv:2410.17604v1 Announce Type: new \nAbstract: We present a probabilistic ranking model to identify the optimal treatment in multiple-response experiments. In contemporary practice, treatments are applied over individuals with the goal of achieving multiple ideal properties on them simultaneously. However, often there are competing properties, and the optimality of one cannot be achieved without compromising the optimality of another. Typically, we still want to know which treatment is the overall best. In our framework, we first formulate overall optimality in terms of treatment ranks. Then we infer the latent ranking that allow us to report treatments from optimal to least optimal, provided ideal desirable properties. We demonstrate through simulations and real data analysis how we can achieve reliability of inferred ranks in practice. We adopt a Bayesian approach and derive an associated Markov Chain Monte Carlo algorithm to fit our model to data. Finally, we discuss the prospects of adoption of our method as a standard tool for experiment evaluation in trials-based research."}, "https://arxiv.org/abs/2410.17680": {"title": "Unraveling Residualization: enhancing its application and exposing its relationship with the FWL theorem", "link": "https://arxiv.org/abs/2410.17680", "description": "arXiv:2410.17680v1 Announce Type: new \nAbstract: The residualization procedure has been applied in many different fields to estimate models with multicollinearity. However, there exists a lack of understanding of this methodology and some authors discourage its use. This paper aims to contribute to a better understanding of the residualization procedure to promote an adequate application and interpretation of it among statistics and data sciences. We highlight its interesting potential application, not only to mitigate multicollinearity but also when the study is oriented to the analysis of the isolated effect of independent variables. The relation between the residualization methodology and the Frisch-Waugh-Lovell (FWL) theorem is also analyzed, concluding that, although both provide the same estimations, the interpretation of the estimated coefficients is different. These different interpretations justify the application of the residualization methodology regardless of the FWL theorem. A real data example is presented for a better illustration of the contribution of this paper."}, "https://arxiv.org/abs/2410.17864": {"title": "Longitudinal Causal Inference with Selective Eligibility", "link": "https://arxiv.org/abs/2410.17864", "description": "arXiv:2410.17864v1 Announce Type: new \nAbstract: Dropout often threatens the validity of causal inference in longitudinal studies. While existing studies have focused on the problem of missing outcomes caused by treatment, we study an important but overlooked source of dropout, selective eligibility. For example, patients may become ineligible for subsequent treatments due to severe side effects or complete recovery. Selective eligibility differs from the problem of ``truncation by death'' because dropout occurs after observing the outcome but before receiving the subsequent treatment. This difference makes the standard approach to dropout inapplicable. We propose a general methodological framework for longitudinal causal inference with selective eligibility. By focusing on subgroups of units who would become eligible for treatment given a specific treatment history, we define the time-specific eligible treatment effect (ETE) and expected number of outcome events (EOE) under a treatment sequence of interest. Assuming a generalized version of sequential ignorability, we derive two nonparametric identification formulae, each leveraging different parts of the observed data distribution. We then derive the efficient influence function of each causal estimand, yielding the corresponding doubly robust estimator. Finally, we apply the proposed methodology to an impact evaluation of a pre-trial risk assessment instrument in the criminal justice system, in which selective eligibility arises due to recidivism."}, "https://arxiv.org/abs/2410.18021": {"title": "Deep Nonparametric Inference for Conditional Hazard Function", "link": "https://arxiv.org/abs/2410.18021", "description": "arXiv:2410.18021v1 Announce Type: new \nAbstract: We propose a novel deep learning approach to nonparametric statistical inference for the conditional hazard function of survival time with right-censored data. We use a deep neural network (DNN) to approximate the logarithm of a conditional hazard function given covariates and obtain a DNN likelihood-based estimator of the conditional hazard function. Such an estimation approach renders model flexibility and hence relaxes structural and functional assumptions on conditional hazard or survival functions. We establish the nonasymptotic error bound and functional asymptotic normality of the proposed estimator. Subsequently, we develop new one-sample tests for goodness-of-fit evaluation and two-sample tests for treatment comparison. Both simulation studies and real application analysis show superior performances of the proposed estimators and tests in comparison with existing methods."}, "https://arxiv.org/abs/2410.17398": {"title": "Sacred and Profane: from the Involutive Theory of MCMC to Helpful Hamiltonian Hacks", "link": "https://arxiv.org/abs/2410.17398", "description": "arXiv:2410.17398v1 Announce Type: cross \nAbstract: In the first edition of this Handbook, two remarkable chapters consider seemingly distinct yet deeply connected subjects ..."}, "https://arxiv.org/abs/2410.17692": {"title": "Asymptotics for parametric martingale posteriors", "link": "https://arxiv.org/abs/2410.17692", "description": "arXiv:2410.17692v1 Announce Type: cross \nAbstract: The martingale posterior framework is a generalization of Bayesian inference where one elicits a sequence of one-step ahead predictive densities instead of the likelihood and prior. Posterior sampling then involves the imputation of unseen observables, and can then be carried out in an expedient and parallelizable manner using predictive resampling without requiring Markov chain Monte Carlo. Recent work has investigated the use of plug-in parametric predictive densities, combined with stochastic gradient descent, to specify a parametric martingale posterior. This paper investigates the asymptotic properties of this class of parametric martingale posteriors. In particular, two central limit theorems based on martingale limit theory are introduced and applied. The first is a predictive central limit theorem, which enables a significant acceleration of the predictive resampling scheme through a hybrid sampling algorithm based on a normal approximation. The second is a Bernstein-von Mises result, which is novel for martingale posteriors, and provides methodological guidance on attaining desirable frequentist properties. We demonstrate the utility of the theoretical results in simulations and a real data example."}, "https://arxiv.org/abs/2209.04977": {"title": "Semi-supervised Triply Robust Inductive Transfer Learning", "link": "https://arxiv.org/abs/2209.04977", "description": "arXiv:2209.04977v2 Announce Type: replace \nAbstract: In this work, we propose a Semi-supervised Triply Robust Inductive transFer LEarning (STRIFLE) approach, which integrates heterogeneous data from a label-rich source population and a label-scarce target population and utilizes a large amount of unlabeled data simultaneously to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ two nuisance models, a density ratio model and an imputation model, to combine transfer learning and surrogate-assisted semi-supervised learning strategies effectively and achieve triple robustness. While the STRIFLE approach assumes the target and source populations to share the same conditional distribution of outcome Y given both the surrogate features S and predictors X, it allows the true underlying model of Y|X to differ between the two populations due to the potential covariate shift in S and X. Different from double robustness, even if both nuisance models are misspecified or the distribution of Y|(S, X) is not the same between the two populations, the triply robust STRIFLE estimator can still partially use the source population when the shifted source population and the target population share enough similarities. Moreover, it is guaranteed to be no worse than the target-only surrogate-assisted semi-supervised estimator with an additional error term from transferability detection. These desirable properties of our estimator are established theoretically and verified in finite samples via extensive simulation studies. We utilize the STRIFLE estimator to train a Type II diabetes polygenic risk prediction model for the African American target population by transferring knowledge from electronic health records linked genomic data observed in a larger European source population."}, "https://arxiv.org/abs/2301.08056": {"title": "Geodesic slice sampling on the sphere", "link": "https://arxiv.org/abs/2301.08056", "description": "arXiv:2301.08056v3 Announce Type: replace \nAbstract: Probability measures on the sphere form an important class of statistical models and are used, for example, in modeling directional data or shapes. Due to their widespread use, but also as an algorithmic building block, efficient sampling of distributions on the sphere is highly desirable. We propose a shrinkage based and an idealized geodesic slice sampling Markov chain, designed to generate approximate samples from distributions on the sphere. In particular, the shrinkage-based version of the algorithm can be implemented such that it runs efficiently in any dimension and has no tuning parameters. We verify reversibility and prove that under weak regularity conditions geodesic slice sampling is uniformly ergodic. Numerical experiments show that the proposed slice samplers achieve excellent mixing on challenging targets including the Bingham distribution and mixtures of von Mises-Fisher distributions. In these settings our approach outperforms standard samplers such as random-walk Metropolis-Hastings and Hamiltonian Monte Carlo."}, "https://arxiv.org/abs/2302.07227": {"title": "Transport map unadjusted Langevin algorithms: learning and discretizing perturbed samplers", "link": "https://arxiv.org/abs/2302.07227", "description": "arXiv:2302.07227v4 Announce Type: replace \nAbstract: Langevin dynamics are widely used in sampling high-dimensional, non-Gaussian distributions whose densities are known up to a normalizing constant. In particular, there is strong interest in unadjusted Langevin algorithms (ULA), which directly discretize Langevin dynamics to estimate expectations over the target distribution. We study the use of transport maps that approximately normalize a target distribution as a way to precondition and accelerate the convergence of Langevin dynamics. We show that in continuous time, when a transport map is applied to Langevin dynamics, the result is a Riemannian manifold Langevin dynamics (RMLD) with metric defined by the transport map. We also show that applying a transport map to an irreversibly-perturbed ULA results in a geometry-informed irreversible perturbation (GiIrr) of the original dynamics. These connections suggest more systematic ways of learning metrics and perturbations, and also yield alternative discretizations of the RMLD described by the map, which we study. Under appropriate conditions, these discretized processes can be endowed with non-asymptotic bounds describing convergence to the target distribution in 2-Wasserstein distance. Illustrative numerical results complement our theoretical claims."}, "https://arxiv.org/abs/2306.07362": {"title": "Large-Scale Multiple Testing of Composite Null Hypotheses Under Heteroskedasticity", "link": "https://arxiv.org/abs/2306.07362", "description": "arXiv:2306.07362v2 Announce Type: replace \nAbstract: Heteroskedasticity poses several methodological challenges in designing valid and powerful procedures for simultaneous testing of composite null hypotheses. In particular, the conventional practice of standardizing or re-scaling heteroskedastic test statistics in this setting may severely affect the power of the underlying multiple testing procedure. Additionally, when the inferential parameter of interest is correlated with the variance of the test statistic, methods that ignore this dependence may fail to control the type I error at the desired level. We propose a new Heteroskedasticity Adjusted Multiple Testing (HAMT) procedure that avoids data reduction by standardization, and directly incorporates the side information from the variances into the testing procedure. Our approach relies on an improved nonparametric empirical Bayes deconvolution estimator that offers a practical strategy for capturing the dependence between the inferential parameter of interest and the variance of the test statistic. We develop theory to show that HAMT is asymptotically valid and optimal for FDR control. Simulation results demonstrate that HAMT outperforms existing procedures with substantial power gain across many settings at the same FDR level. The method is illustrated on an application involving the detection of engaged users on a mobile game app."}, "https://arxiv.org/abs/2312.01925": {"title": "Coefficient Shape Alignment in Multiple Functional Linear Regression", "link": "https://arxiv.org/abs/2312.01925", "description": "arXiv:2312.01925v4 Announce Type: replace \nAbstract: In multivariate functional data analysis, different functional covariates often exhibit homogeneity. The covariates with pronounced homogeneity can be analyzed jointly within the same group, offering a parsimonious approach to modeling multivariate functional data. In this paper, a novel grouped multiple functional regression model with a new regularization approach termed {\\it ``coefficient shape alignment\"} is developed to tackle functional covariates homogeneity. The modeling procedure includes two steps: first aggregate covariates into disjoint groups using the new regularization approach; then the grouped multiple functional regression model is established based on the detected grouping structure. In this grouped model, the coefficient functions of covariates in the same group share the same shape, invariant to scaling. The new regularization approach works by penalizing differences in the shape of the coefficients. We establish conditions under which the true grouping structure can be accurately identified and derive the asymptotic properties of the model estimates. Extensive simulation studies are conducted to assess the finite-sample performance of the proposed methods. The practical applicability of the model is demonstrated through real data analysis in the context of sugar quality evaluation. This work offers a novel framework for analyzing the homogeneity of functional covariates and constructing parsimonious models for multivariate functional data."}, "https://arxiv.org/abs/2410.18144": {"title": "Using Platt's scaling for calibration after undersampling -- limitations and how to address them", "link": "https://arxiv.org/abs/2410.18144", "description": "arXiv:2410.18144v1 Announce Type: new \nAbstract: When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study."}, "https://arxiv.org/abs/2410.18159": {"title": "On the Existence of One-Sided Representations in the Generalised Dynamic Factor Model", "link": "https://arxiv.org/abs/2410.18159", "description": "arXiv:2410.18159v1 Announce Type: new \nAbstract: We consider the generalised dynamic factor model (GDFM) and assume that the dynamic common component is purely non-deterministic. We show that then the common shocks (and therefore the dynamic common component) can always be represented in terms of current and past observed variables. Hence, we further generalise existing results on the so called One-Sidedness problem of the GDFM. We may conclude that the existence of a one-sided representation that is causally subordinated to the observed variables is in the very nature of the GDFM and the lack of one-sidedness is an artefact of the chosen representation."}, "https://arxiv.org/abs/2410.18261": {"title": "Detecting Spatial Outliers: the Role of the Local Influence Function", "link": "https://arxiv.org/abs/2410.18261", "description": "arXiv:2410.18261v1 Announce Type: new \nAbstract: In the analysis of large spatial datasets, identifying and treating spatial outliers is essential for accurately interpreting geographical phenomena. While spatial correlation measures, particularly Local Indicators of Spatial Association (LISA), are widely used to detect spatial patterns, the presence of abnormal observations frequently distorts the landscape and conceals critical spatial relationships. These outliers can significantly impact analysis due to the inherent spatial dependencies present in the data. Traditional influence function (IF) methodologies, commonly used in statistical analysis to measure the impact of individual observations, are not directly applicable in the spatial context because the influence of an observation is determined not only by its own value but also by its spatial location, its connections with neighboring regions, and the values of those neighboring observations. In this paper, we introduce a local version of the influence function (LIF) that accounts for these spatial dependencies. Through the analysis of both simulated and real-world datasets, we demonstrate how the LIF provides a more nuanced and accurate detection of spatial outliers compared to traditional LISA measures and local impact assessments, improving our understanding of spatial patterns."}, "https://arxiv.org/abs/2410.18272": {"title": "Partially Identified Rankings from Pairwise Interactions", "link": "https://arxiv.org/abs/2410.18272", "description": "arXiv:2410.18272v1 Announce Type: new \nAbstract: This paper considers the problem of ranking objects based on their latent merits using data from pairwise interactions. Existing approaches rely on the restrictive assumption that all the interactions are either observed or missed randomly. We investigate what can be inferred about rankings when this assumption is relaxed. First, we demonstrate that in parametric models, such as the popular Bradley-Terry-Luce model, rankings are point-identified if and only if the tournament graph is connected. Second, we show that in nonparametric models based on strong stochastic transitivity, rankings in a connected tournament are only partially identified. Finally, we propose two statistical tests to determine whether a ranking belongs to the identified set. One test is valid in finite samples but computationally intensive, while the other is easy to implement and valid asymptotically. We illustrate our procedure using Brazilian employer-employee data to test whether male and female workers rank firms differently when making job transitions."}, "https://arxiv.org/abs/2410.18338": {"title": "Robust function-on-function interaction regression", "link": "https://arxiv.org/abs/2410.18338", "description": "arXiv:2410.18338v1 Announce Type: new \nAbstract: A function-on-function regression model with quadratic and interaction effects of the covariates provides a more flexible model. Despite several attempts to estimate the model's parameters, almost all existing estimation strategies are non-robust against outliers. Outliers in the quadratic and interaction effects may deteriorate the model structure more severely than their effects in the main effect. We propose a robust estimation strategy based on the robust functional principal component decomposition of the function-valued variables and $\\tau$-estimator. The performance of the proposed method relies on the truncation parameters in the robust functional principal component decomposition of the function-valued variables. A robust Bayesian information criterion is used to determine the optimum truncation constants. A forward stepwise variable selection procedure is employed to determine relevant main, quadratic, and interaction effects to address a possible model misspecification. The finite-sample performance of the proposed method is investigated via a series of Monte-Carlo experiments. The proposed method's asymptotic consistency and influence function are also studied in the supplement, and its empirical performance is further investigated using a U.S. COVID-19 dataset."}, "https://arxiv.org/abs/2410.18381": {"title": "Inference on High Dimensional Selective Labeling Models", "link": "https://arxiv.org/abs/2410.18381", "description": "arXiv:2410.18381v1 Announce Type: new \nAbstract: A class of simultaneous equation models arise in the many domains where observed binary outcomes are themselves a consequence of the existing choices of of one of the agents in the model. These models are gaining increasing interest in the computer science and machine learning literatures where they refer the potentially endogenous sample selection as the {\\em selective labels} problem. Empirical settings for such models arise in fields as diverse as criminal justice, health care, and insurance. For important recent work in this area, see for example Lakkaruju et al. (2017), Kleinberg et al. (2018), and Coston et al.(2021) where the authors focus on judicial bail decisions, and where one observes the outcome of whether a defendant filed to return for their court appearance only if the judge in the case decides to release the defendant on bail. Identifying and estimating such models can be computationally challenging for two reasons. One is the nonconcavity of the bivariate likelihood function, and the other is the large number of covariates in each equation. Despite these challenges, in this paper we propose a novel distribution free estimation procedure that is computationally friendly in many covariates settings. The new method combines the semiparametric batched gradient descent algorithm introduced in Khan et al.(2023) with a novel sorting algorithms incorporated to control for selection bias. Asymptotic properties of the new procedure are established under increasing dimension conditions in both equations, and its finite sample properties are explored through a simulation study and an application using judicial bail data."}, "https://arxiv.org/abs/2410.18409": {"title": "Doubly protected estimation for survival outcomes utilizing external controls for randomized clinical trials", "link": "https://arxiv.org/abs/2410.18409", "description": "arXiv:2410.18409v1 Announce Type: new \nAbstract: Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach incorporates machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework."}, "https://arxiv.org/abs/2410.18437": {"title": "Studentized Tests of Independence: Random-Lifter approach", "link": "https://arxiv.org/abs/2410.18437", "description": "arXiv:2410.18437v1 Announce Type: new \nAbstract: The exploration of associations between random objects with complex geometric structures has catalyzed the development of various novel statistical tests encompassing distance-based and kernel-based statistics. These methods have various strengths and limitations. One problem is that their test statistics tend to converge to asymptotic null distributions involving second-order Wiener chaos, which are hard to compute and need approximation or permutation techniques that use much computing power to build rejection regions. In this work, we take an entirely different and novel strategy by using the so-called ``Random-Lifter''. This method is engineered to yield test statistics with the standard normal limit under null distributions without the need for sample splitting. In other words, we set our sights on having simple limiting distributions and finding the proper statistics through reverse engineering. We use the Central Limit Theorems (CLTs) for degenerate U-statistics derived from our novel association measures to do this. As a result, the asymptotic distributions of our proposed tests are straightforward to compute. Our test statistics also have the minimax property. We further substantiate that our method maintains competitive power against existing methods with minimal adjustments to constant factors. Both numerical simulations and real-data analysis corroborate the efficacy of the Random-Lifter method."}, "https://arxiv.org/abs/2410.18445": {"title": "Inferring Latent Graphs from Stationary Signals Using a Graphical Autoregressive Model", "link": "https://arxiv.org/abs/2410.18445", "description": "arXiv:2410.18445v1 Announce Type: new \nAbstract: Graphs are an intuitive way to represent relationships between variables in fields such as finance and neuroscience. However, these graphs often need to be inferred from data. In this paper, we propose a novel framework to infer a latent graph by treating the observed multidimensional data as graph-referenced stationary signals. Specifically, we introduce the graphical autoregressive model (GAR), where the inverse covariance matrix of the observed signals is expressed as a second-order polynomial of the normalized graph Laplacian of the latent graph. The GAR model extends the autoregressive model from time series analysis to general undirected graphs, offering a new approach to graph inference. To estimate the latent graph, we develop a three-step procedure based on penalized maximum likelihood, supported by theoretical analysis and numerical experiments. Simulation studies and an application to S&P 500 stock price data show that the GAR model can outperform Gaussian graphical models when it fits the observed data well. Our results suggest that the GAR model offers a promising new direction for inferring latent graphs across diverse applications. Codes and example scripts are available at https://github.com/jed-harwood/SGM ."}, "https://arxiv.org/abs/2410.18486": {"title": "Evolving Voices Based on Temporal Poisson Factorisation", "link": "https://arxiv.org/abs/2410.18486", "description": "arXiv:2410.18486v1 Announce Type: new \nAbstract: The world is evolving and so is the vocabulary used to discuss topics in speech. Analysing political speech data from more than 30 years requires the use of flexible topic models to uncover the latent topics and their change in prevalence over time as well as the change in the vocabulary of the topics. We propose the temporal Poisson factorisation (TPF) model as an extension to the Poisson factorisation model to model sparse count data matrices obtained based on the bag-of-words assumption from text documents with time stamps. We discuss and empirically compare different model specifications for the time-varying latent variables consisting either of a flexible auto-regressive structure of order one or a random walk. Estimation is based on variational inference where we consider a combination of coordinate ascent updates with automatic differentiation using batching of documents. Suitable variational families are proposed to ease inference. We compare results obtained using independent univariate variational distributions for the time-varying latent variables to those obtained with a multivariate variant. We discuss in detail the results of the TPF model when analysing speeches from 18 sessions in the U.S. Senate (1981-2016)."}, "https://arxiv.org/abs/2410.18696": {"title": "Latent Functional PARAFAC for modeling multidimensional longitudinal data", "link": "https://arxiv.org/abs/2410.18696", "description": "arXiv:2410.18696v1 Announce Type: new \nAbstract: In numerous settings, it is increasingly common to deal with longitudinal data organized as high-dimensional multi-dimensional arrays, also known as tensors. Within this framework, the time-continuous property of longitudinal data often implies a smooth functional structure on one of the tensor modes. To help researchers investigate such data, we introduce a new tensor decomposition approach based on the CANDECOMP/PARAFAC decomposition. Our approach allows for representing a high-dimensional functional tensor as a low-dimensional set of functions and feature matrices. Furthermore, to capture the underlying randomness of the statistical setting more efficiently, we introduce a probabilistic latent model in the decomposition. A covariance-based block-relaxation algorithm is derived to obtain estimates of model parameters. Thanks to the covariance formulation of the solving procedure and thanks to the probabilistic modeling, the method can be used in sparse and irregular sampling schemes, making it applicable in numerous settings. We apply our approach to help characterize multiple neurocognitive scores observed over time in the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. Finally, intensive simulations show a notable advantage of our method in reconstructing tensors."}, "https://arxiv.org/abs/2410.18734": {"title": "Response Surface Designs for Crossed and Nested Multi-Stratum Structures", "link": "https://arxiv.org/abs/2410.18734", "description": "arXiv:2410.18734v1 Announce Type: new \nAbstract: Response surface designs are usually described as being run under complete randomization of the treatment combinations to the experimental units. In practice, however, it is often necessary or beneficial to run them under some kind of restriction to the randomization, leading to multi-stratum designs. In particular, some factors are often hard to set, so they cannot have their levels reset for each experimental unit. This paper presents a general solution to designing response surface experiments in any multi-stratum structure made up of crossing and/or nesting of unit factors. A stratum-by-stratum approach to constructing designs using compound optimal design criteria is used and illustrated. It is shown that good designs can be found even for large experiments in complex structures."}, "https://arxiv.org/abs/2410.18939": {"title": "Adaptive partition Factor Analysis", "link": "https://arxiv.org/abs/2410.18939", "description": "arXiv:2410.18939v1 Announce Type: new \nAbstract: Factor Analysis has traditionally been utilized across diverse disciplines to extrapolate latent traits that influence the behavior of multivariate observed variables. Historically, the focus has been on analyzing data from a single study, neglecting the potential study-specific variations present in data from multiple studies. Multi-study factor analysis has emerged as a recent methodological advancement that addresses this gap by distinguishing between latent traits shared across studies and study-specific components arising from artifactual or population-specific sources of variation. In this paper, we extend the current methodologies by introducing novel shrinkage priors for the latent factors, thereby accommodating a broader spectrum of scenarios -- from the absence of study-specific latent factors to models in which factors pertain only to small subgroups nested within or shared between the studies. For the proposed construction we provide conditions for identifiability of factor loadings and guidelines to perform straightforward posterior computation via Gibbs sampling. Through comprehensive simulation studies, we demonstrate that our proposed method exhibits competing performance across a variety of scenarios compared to existing methods, yet providing richer insights. The practical benefits of our approach are further illustrated through applications to bird species co-occurrence data and ovarian cancer gene expression data."}, "https://arxiv.org/abs/2410.18243": {"title": "Saddlepoint Monte Carlo and its Application to Exact Ecological Inference", "link": "https://arxiv.org/abs/2410.18243", "description": "arXiv:2410.18243v1 Announce Type: cross \nAbstract: Assuming X is a random vector and A a non-invertible matrix, one sometimes need to perform inference while only having access to samples of Y = AX. The corresponding likelihood is typically intractable. One may still be able to perform exact Bayesian inference using a pseudo-marginal sampler, but this requires an unbiased estimator of the intractable likelihood.\n We propose saddlepoint Monte Carlo, a method for obtaining an unbiased estimate of the density of Y with very low variance, for any model belonging to an exponential family. Our method relies on importance sampling of the characteristic function, with insights brought by the standard saddlepoint approximation scheme with exponential tilting. We show that saddlepoint Monte Carlo makes it possible to perform exact inference on particularly challenging problems and datasets. We focus on the ecological inference problem, where one observes only aggregates at a fine level. We present in particular a study of the carryover of votes between the two rounds of various French elections, using the finest available data (number of votes for each candidate in about 60,000 polling stations over most of the French territory).\n We show that existing, popular approximate methods for ecological inference can lead to substantial bias, which saddlepoint Monte Carlo is immune from. We also present original results for the 2024 legislative elections on political centre-to-left and left-to-centre conversion rates when the far-right is present in the second round. Finally, we discuss other exciting applications for saddlepoint Monte Carlo, such as dealing with aggregate data in privacy or inverse problems."}, "https://arxiv.org/abs/2410.18268": {"title": "Stabilizing black-box model selection with the inflated argmax", "link": "https://arxiv.org/abs/2410.18268", "description": "arXiv:2410.18268v1 Announce Type: cross \nAbstract: Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. This paper presents a new approach to stabilizing model selection that leverages a combination of bagging and an \"inflated\" argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. In addition to developing theoretical guarantees, we illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable and (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances. In both settings, the proposed method yields stable and compact collections of selected models, outperforming a variety of benchmarks."}, "https://arxiv.org/abs/2410.18435": {"title": "Forecasting Australian fertility by age, region, and birthplace", "link": "https://arxiv.org/abs/2410.18435", "description": "arXiv:2410.18435v1 Announce Type: cross \nAbstract: Fertility differentials by urban-rural residence and nativity of women in Australia significantly impact population composition at sub-national levels. We aim to provide consistent fertility forecasts for Australian women characterized by age, region, and birthplace. Age-specific fertility rates at the national and sub-national levels obtained from census data between 1981-2011 are jointly modeled and forecast by the grouped functional time series method. Forecasts for women of each region and birthplace are reconciled following the chosen hierarchies to ensure that results at various disaggregation levels consistently sum up to the respective national total. Coupling the region of residence disaggregation structure with the trace minimization reconciliation method produces the most accurate point and interval forecasts. In addition, age-specific fertility rates disaggregated by the birthplace of women show significant heterogeneity that supports the application of the grouped forecasting method."}, "https://arxiv.org/abs/2410.18844": {"title": "Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints", "link": "https://arxiv.org/abs/2410.18844", "description": "arXiv:2410.18844v1 Announce Type: cross \nAbstract: Pure exploration in bandits models multiple real-world problems, such as tuning hyper-parameters or conducting user studies, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an $r$$\\textit{-good feasible policy}$. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. We show how this lower bound evolves with the sequential estimation of constraints. Second, we leverage the Lagrangian lower bound and the properties of convex optimisation to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. To this end, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use pessimistic estimate of the feasible set at each step. We show that these algorithms achieve asymptotically optimal sample complexity upper bounds up to constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LAGEX and LATS with respect to baselines."}, "https://arxiv.org/abs/2202.04146": {"title": "A Neural Phillips Curve and a Deep Output Gap", "link": "https://arxiv.org/abs/2202.04146", "description": "arXiv:2202.04146v2 Announce Type: replace \nAbstract: Many problems plague empirical Phillips curves (PCs). Among them is the hurdle that the two key components, inflation expectations and the output gap, are both unobserved. Traditional remedies include proxying for the absentees or extracting them via assumptions-heavy filtering procedures. I propose an alternative route: a Hemisphere Neural Network (HNN) whose architecture yields a final layer where components can be interpreted as latent states within a Neural PC. There are benefits. First, HNN conducts the supervised estimation of nonlinearities that arise when translating a high-dimensional set of observed regressors into latent states. Second, forecasts are economically interpretable. Among other findings, the contribution of real activity to inflation appears understated in traditional PCs. In contrast, HNN captures the 2021 upswing in inflation and attributes it to a large positive output gap starting from late 2020. The unique path of HNN's gap comes from dispensing with unemployment and GDP in favor of an amalgam of nonlinearly processed alternative tightness indicators."}, "https://arxiv.org/abs/2210.08149": {"title": "Distance and Kernel-Based Measures for Global and Local Two-Sample Conditional Distribution Testing", "link": "https://arxiv.org/abs/2210.08149", "description": "arXiv:2210.08149v2 Announce Type: replace \nAbstract: Testing the equality of two conditional distributions is crucial in various modern applications, including transfer learning and causal inference. Despite its importance, this fundamental problem has received surprisingly little attention in the literature. This work aims to present a unified framework based on distance and kernel methods for both global and local two-sample conditional distribution testing. To this end, we introduce distance and kernel-based measures that characterize the homogeneity of two conditional distributions. Drawing from the concept of conditional U-statistics, we propose consistent estimators for these measures. Theoretically, we derive the convergence rates and the asymptotic distributions of the estimators under both the null and alternative hypotheses. Utilizing these measures, along with a local bootstrap approach, we develop global and local tests that can detect discrepancies between two conditional distributions at global and local levels, respectively. Our tests demonstrate reliable performance through simulations and real data analyses."}, "https://arxiv.org/abs/2401.12031": {"title": "Multi-objective optimisation using expected quantile improvement for decision making in disease outbreaks", "link": "https://arxiv.org/abs/2401.12031", "description": "arXiv:2401.12031v2 Announce Type: replace \nAbstract: Optimization under uncertainty is important in many applications, particularly to inform policy and decision making in areas such as public health. A key source of uncertainty arises from the incorporation of environmental variables as inputs into computational models or simulators. Such variables represent uncontrollable features of the optimization problem and reliable decision making must account for the uncertainty they propagate to the simulator outputs. Often, multiple, competing objectives are defined from these outputs such that the final optimal decision is a compromise between different goals.\n Here, we present emulation-based optimization methodology for such problems that extends expected quantile improvement (EQI) to address multi-objective optimization. Focusing on the practically important case of two objectives, we use a sequential design strategy to identify the Pareto front of optimal solutions. Uncertainty from the environmental variables is integrated out using Monte Carlo samples from the simulator. Interrogation of the expected output from the simulator is facilitated by use of (Gaussian process) emulators. The methodology is demonstrated on an optimization problem from public health involving the dispersion of anthrax spores across a spatial terrain. Environmental variables include meteorological features that impact the dispersion, and the methodology identifies the Pareto front even when there is considerable input uncertainty."}, "https://arxiv.org/abs/2410.19019": {"title": "Median Based Unit Weibull (MBUW): a new unit distribution Properties", "link": "https://arxiv.org/abs/2410.19019", "description": "arXiv:2410.19019v1 Announce Type: new \nAbstract: A new 2 parameter unit Weibull distribution is defined on the unit interval (0,1). The methodology of deducing its PDF, some of its properties and related functions are discussed. The paper is supplied by many figures illustrating the new distribution and how this can make it illegible to fit a wide range of skewed data. The new distribution holds a name (Attia) as a nickname."}, "https://arxiv.org/abs/2410.19028": {"title": "Enhancing Approximate Modular Bayesian Inference by Emulating the Conditional Posterior", "link": "https://arxiv.org/abs/2410.19028", "description": "arXiv:2410.19028v1 Announce Type: new \nAbstract: In modular Bayesian analyses, complex models are composed of distinct modules, each representing different aspects of the data or prior information. In this context, fully Bayesian approaches can sometimes lead to undesirable feedback between modules, compromising the integrity of the inference. This paper focuses on the \"cut-distribution\" which prevents unwanted influence between modules by \"cutting\" feedback. The multiple imputation (DS) algorithm is standard practice for approximating the cut-distribution, but it can be computationally intensive, especially when the number of imputations required is large. An enhanced method is proposed, the Emulating the Conditional Posterior (ECP) algorithm, which leverages emulation to increase the number of imputations. Through numerical experiment it is demonstrated that the ECP algorithm outperforms the traditional DS approach in terms of accuracy and computational efficiency, particularly when resources are constrained. It is also shown how the DS algorithm can be improved using ideas from design of experiments. This work also provides practical recommendations on algorithm choice based on the computational demands of sampling from the prior and cut-distributions."}, "https://arxiv.org/abs/2410.19031": {"title": "Model-free Variable Selection and Inference for High-dimensional Data", "link": "https://arxiv.org/abs/2410.19031", "description": "arXiv:2410.19031v1 Announce Type: new \nAbstract: Statistical inference is challenging in high-dimensional data analysis. Existing post-selection inference requires an explicitly specified regression model as well as sparsity in the regression model. The performance of such procedures can be poor under either misspecified nonlinear models or a violation of the sparsity assumption. In this paper, we propose a sufficient dimension association (SDA) technique that measures the association between each predictor and the response variable conditioning on other predictors. Our proposed SDA method requires neither a specific form of regression model nor sparsity in the regression. Alternatively, our method assumes normalized or Gaussian-distributed predictors with a Markov blanket property. We propose an estimator for the SDA and prove asymptotic properties for the estimator. For simultaneous hypothesis testing and variable selection, we construct test statistics based on the Kolmogorov-Smirnov principle and the Cram{\\\"e}r-von-Mises principle. A multiplier bootstrap approach is used for computing critical values and $p$-values. Extensive simulation studies have been conducted to show the validity and superiority of our SDA method. Gene expression data from the Alzheimer Disease Neuroimaging Initiative are used to demonstrate a real application."}, "https://arxiv.org/abs/2410.19060": {"title": "Heterogeneous Intertemporal Treatment Effects via Dynamic Panel Data Models", "link": "https://arxiv.org/abs/2410.19060", "description": "arXiv:2410.19060v1 Announce Type: new \nAbstract: We study the identification and estimation of heterogeneous, intertemporal treatment effects (TE) when potential outcomes depend on past treatments. First, applying a dynamic panel data model to observed outcomes, we show that instrument-based GMM estimators, such as Arellano and Bond (1991), converge to a non-convex (negatively weighted) aggregate of TE plus non-vanishing trends. We then provide restrictions on sequential exchangeability (SE) of treatment and TE heterogeneity that reduce the GMM estimand to a convex (positively weighted) aggregate of TE. Second, we introduce an adjusted inverse-propensity-weighted (IPW) estimator for a new notion of average treatment effect (ATE) over past observed treatments. Third, we show that when potential outcomes are generated by dynamic panel data models with homogeneous TE, such GMM estimators converge to causal parameters (even when SE is generically violated without conditioning on individual fixed effects). Finally, we motivate SE and compare it with parallel trends (PT) in various settings with observational data (when treatments are dynamic, rational choices under learning) or experimental data (when treatments are sequentially randomized)."}, "https://arxiv.org/abs/2410.19073": {"title": "Doubly Robust Nonparametric Efficient Estimation for Provider Evaluation", "link": "https://arxiv.org/abs/2410.19073", "description": "arXiv:2410.19073v1 Announce Type: new \nAbstract: Provider profiling has the goal of identifying healthcare providers with exceptional patient outcomes. When evaluating providers, adjustment is necessary to control for differences in case-mix between different providers. Direct and indirect standardization are two popular risk adjustment methods. In causal terms, direct standardization examines a counterfactual in which the entire target population is treated by one provider. Indirect standardization, commonly expressed as a standardized outcome ratio, examines the counterfactual in which the population treated by a provider had instead been randomly assigned to another provider. Our first contribution is to present nonparametric efficiency bound for direct and indirectly standardized provider metrics by deriving their efficient influence functions. Our second contribution is to propose fully nonparametric estimators based on targeted minimum loss-based estimation that achieve the efficiency bounds. The finite-sample performance of the estimator is investigated through simulation studies. We apply our methods to evaluate dialysis facilities in New York State in terms of unplanned readmission rates using a large Medicare claims dataset. A software implementation of our methods is available in the R package TargetedRisk."}, "https://arxiv.org/abs/2410.19140": {"title": "Enhancing Spatial Functional Linear Regression with Robust Dimension Reduction Methods", "link": "https://arxiv.org/abs/2410.19140", "description": "arXiv:2410.19140v1 Announce Type: new \nAbstract: This paper introduces a robust estimation strategy for the spatial functional linear regression model using dimension reduction methods, specifically functional principal component analysis (FPCA) and functional partial least squares (FPLS). These techniques are designed to address challenges associated with spatially correlated functional data, particularly the impact of outliers on parameter estimation. By projecting the infinite-dimensional functional predictor onto a finite-dimensional space defined by orthonormal basis functions and employing M-estimation to mitigate outlier effects, our approach improves the accuracy and reliability of parameter estimates in the spatial functional linear regression context. Simulation studies and empirical data analysis substantiate the effectiveness of our methods, while an appendix explores the Fisher consistency and influence function of the FPCA-based approach. The rfsac package in R implements these robust estimation strategies, ensuring practical applicability for researchers and practitioners."}, "https://arxiv.org/abs/2410.19154": {"title": "Cross Spline Net and a Unified World", "link": "https://arxiv.org/abs/2410.19154", "description": "arXiv:2410.19154v1 Announce Type: new \nAbstract: In today's machine learning world for tabular data, XGBoost and fully connected neural network (FCNN) are two most popular methods due to their good model performance and convenience to use. However, they are highly complicated, hard to interpret, and can be overfitted. In this paper, we propose a new modeling framework called cross spline net (CSN) that is based on a combination of spline transformation and cross-network (Wang et al. 2017, 2021). We will show CSN is as performant and convenient to use, and is less complicated, more interpretable and robust. Moreover, the CSN framework is flexible, as the spline layer can be configured differently to yield different models. With different choices of the spline layer, we can reproduce or approximate a set of non-neural network models, including linear and spline-based statistical models, tree, rule-fit, tree-ensembles (gradient boosting trees, random forest), oblique tree/forests, multi-variate adaptive regression spline (MARS), SVM with polynomial kernel, etc. Therefore, CSN provides a unified modeling framework that puts the above set of non-neural network models under the same neural network framework. By using scalable and powerful gradient descent algorithms available in neural network libraries, CSN avoids some pitfalls (such as being ad-hoc, greedy or non-scalable) in the case-specific optimization methods used in the above non-neural network models. We will use a special type of CSN, TreeNet, to illustrate our point. We will compare TreeNet with XGBoost and FCNN to show the benefits of TreeNet. We believe CSN will provide a flexible and convenient framework for practitioners to build performant, robust and more interpretable models."}, "https://arxiv.org/abs/2410.19190": {"title": "A novel longitudinal rank-sum test for multiple primary endpoints in clinical trials: Applications to neurodegenerative disorders", "link": "https://arxiv.org/abs/2410.19190", "description": "arXiv:2410.19190v1 Announce Type: new \nAbstract: Neurodegenerative disorders such as Alzheimer's disease (AD) present a significant global health challenge, characterized by cognitive decline, functional impairment, and other debilitating effects. Current AD clinical trials often assess multiple longitudinal primary endpoints to comprehensively evaluate treatment efficacy. Traditional methods, however, may fail to capture global treatment effects, require larger sample sizes due to multiplicity adjustments, and may not fully exploit multivariate longitudinal data. To address these limitations, we introduce the Longitudinal Rank Sum Test (LRST), a novel nonparametric rank-based omnibus test statistic. The LRST enables a comprehensive assessment of treatment efficacy across multiple endpoints and time points without multiplicity adjustments, effectively controlling Type I error while enhancing statistical power. It offers flexibility against various data distributions encountered in AD research and maximizes the utilization of longitudinal data. Extensive simulations and real-data applications demonstrate the LRST's performance, underscoring its potential as a valuable tool in AD clinical trials."}, "https://arxiv.org/abs/2410.19212": {"title": "Inference on Multiple Winners with Applications to Microcredit and Economic Mobility", "link": "https://arxiv.org/abs/2410.19212", "description": "arXiv:2410.19212v1 Announce Type: new \nAbstract: While policymakers and researchers are often concerned with conducting inference based on a data-dependent selection, a strictly larger class of inference problems arises when considering multiple data-dependent selections, such as when selecting on statistical significance or quantiles. Given this, we study the problem of conducting inference on multiple selections, which we dub the inference on multiple winners problem. In this setting, we encounter both selective and multiple testing problems, making existing approaches either not applicable or too conservative. Instead, we propose a novel, two-step approach to the inference on multiple winners problem, with the first step modeling the selection of winners, and the second step using this model to conduct inference only on the set of likely winners. Our two-step approach reduces over-coverage error by up to 96%. We apply our two-step approach to revisit the winner's curse in the creating moves to opportunity (CMTO) program, and to study external validity issues in the microcredit literature. In the CMTO application, we find that, after correcting for the inference on multiple winners problem, we fail to reject the possibility of null effects in the majority of census tracts selected by the CMTO program. In our microcredit application, we find that heterogeneity in treatment effect estimates remains largely unaffected even after our proposed inference corrections."}, "https://arxiv.org/abs/2410.19226": {"title": "Deep Transformation Model", "link": "https://arxiv.org/abs/2410.19226", "description": "arXiv:2410.19226v1 Announce Type: new \nAbstract: There has been a significant recent surge in deep neural network (DNN) techniques. Most of the existing DNN techniques have restricted model formats/assumptions. To overcome their limitations, we propose the nonparametric transformation model, which encompasses many popular models as special cases and hence is less sensitive to model mis-specification. This model also has the potential of accommodating heavy-tailed errors, a robustness property not broadly shared. Accordingly, a new loss function, which fundamentally differs from the existing ones, is developed. For computational feasibility, we further develop a double rectified linear unit (DReLU)-based estimator. To accommodate the scenario with a diverging number of input variables and/or noises, we propose variable selection based on group penalization. We further expand the scope to coherently accommodate censored survival data. The estimation and variable selection properties are rigorously established. Extensive numerical studies, including simulations and data analyses, establish the satisfactory practical utility of the proposed methods."}, "https://arxiv.org/abs/2410.19271": {"title": "Feed-Forward Panel Estimation for Discrete-time Survival Analysis of Recurrent Events with Frailty", "link": "https://arxiv.org/abs/2410.19271", "description": "arXiv:2410.19271v1 Announce Type: new \nAbstract: In recurrent survival analysis where the event of interest can occur multiple times for each subject, frailty models play a crucial role by capturing unobserved heterogeneity at the subject level within a population. Frailty models traditionally face challenges due to the lack of a closed-form solution for the maximum likelihood estimation that is unconditional on frailty. In this paper, we propose a novel method: Feed-Forward Panel estimation for discrete-time Survival Analysis (FFPSurv). Our model uses variational Bayesian inference to sequentially update the posterior distribution of frailty as recurrent events are observed, and derives a closed form for the panel likelihood, effectively addressing the limitation of existing frailty models. We demonstrate the efficacy of our method through extensive experiments on numerical examples and real-world recurrent survival data. Furthermore, we mathematically prove that our model is identifiable under minor assumptions."}, "https://arxiv.org/abs/2410.19407": {"title": "Insights into regression-based cross-temporal forecast reconciliation", "link": "https://arxiv.org/abs/2410.19407", "description": "arXiv:2410.19407v1 Announce Type: new \nAbstract: Cross-temporal forecast reconciliation aims to ensure consistency across forecasts made at different temporal and cross-sectional levels. We explore the relationships between sequential, iterative, and optimal combination approaches, and discuss the conditions under which a sequential reconciliation approach (either first-cross-sectional-then-temporal, or first-temporal-then-cross-sectional) is equivalent to a fully (i.e., cross-temporally) coherent iterative heuristic. Furthermore, we show that for specific patterns of the error covariance matrix in the regression model on which the optimal combination approach grounds, iterative reconciliation naturally converges to the optimal combination solution, regardless the order of application of the uni-dimensional cross-sectional and temporal reconciliation approaches. Theoretical and empirical properties of the proposed approaches are investigated through a forecasting experiment using a dataset of hourly photovoltaic power generation. The study presents a comprehensive framework for understanding and enhancing cross-temporal forecast reconciliation, considering both forecast accuracy and the often overlooked computational aspects, showing that significant improvement can be achieved in terms of memory space and computation time, two particularly important aspects in the high-dimensional contexts that usually arise in cross-temporal forecast reconciliation."}, "https://arxiv.org/abs/2410.19469": {"title": "Unified Causality Analysis Based on the Degrees of Freedom", "link": "https://arxiv.org/abs/2410.19469", "description": "arXiv:2410.19469v1 Announce Type: new \nAbstract: Temporally evolving systems are typically modeled by dynamic equations. A key challenge in accurate modeling is understanding the causal relationships between subsystems, as well as identifying the presence and influence of unobserved hidden drivers on the observed dynamics. This paper presents a unified method capable of identifying fundamental causal relationships between pairs of systems, whether deterministic or stochastic. Notably, the method also uncovers hidden common causes beyond the observed variables. By analyzing the degrees of freedom in the system, our approach provides a more comprehensive understanding of both causal influence and hidden confounders. This unified framework is validated through theoretical models and simulations, demonstrating its robustness and potential for broader application."}, "https://arxiv.org/abs/2410.19523": {"title": "OCEAN: Flexible Feature Set Aggregation for Analysis of Multi-omics Data", "link": "https://arxiv.org/abs/2410.19523", "description": "arXiv:2410.19523v1 Announce Type: new \nAbstract: Integrated analysis of multi-omics datasets holds great promise for uncovering complex biological processes. However, the large dimension of omics data poses significant interpretability and multiple testing challenges. Simultaneous Enrichment Analysis (SEA) was introduced to address these issues in single-omics analysis, providing an in-built multiple testing correction and enabling simultaneous feature set testing. In this paper, we introduce OCEAN, an extension of SEA to multi-omics data. OCEAN is a flexible approach to analyze potentially all possible two-way feature sets from any pair of genomics datasets. We also propose two new error rates which are in line with the two-way structure of the data and facilitate interpretation of the results. The power and utility of OCEAN is demonstrated by analyzing copy number and gene expression data for breast and colon cancer."}, "https://arxiv.org/abs/2410.19682": {"title": "trajmsm: An R package for Trajectory Analysis and Causal Modeling", "link": "https://arxiv.org/abs/2410.19682", "description": "arXiv:2410.19682v1 Announce Type: new \nAbstract: The R package trajmsm provides functions designed to simplify the estimation of the parameters of a model combining latent class growth analysis (LCGA), a trajectory analysis technique, and marginal structural models (MSMs) called LCGA-MSM. LCGA summarizes similar patterns of change over time into a few distinct categories called trajectory groups, which are then included as \"treatments\" in the MSM. MSMs are a class of causal models that correctly handle treatment-confounder feedback. The parameters of LCGA-MSMs can be consistently estimated using different estimators, such as inverse probability weighting (IPW), g-computation, and pooled longitudinal targeted maximum likelihood estimation (pooled LTMLE). These three estimators of the parameters of LCGA-MSMs are currently implemented in our package. In the context of a time-dependent outcome, we previously proposed a combination of LCGA and history-restricted MSMs (LCGA-HRMSMs). Our package provides additional functions to estimate the parameters of such models. Version 0.1.3 of the package is currently available on CRAN."}, "https://arxiv.org/abs/2410.18993": {"title": "Deterministic Fokker-Planck Transport -- With Applications to Sampling, Variational Inference, Kernel Mean Embeddings & Sequential Monte Carlo", "link": "https://arxiv.org/abs/2410.18993", "description": "arXiv:2410.18993v1 Announce Type: cross \nAbstract: The Fokker-Planck equation can be reformulated as a continuity equation, which naturally suggests using the associated velocity field in particle flow methods. While the resulting probability flow ODE offers appealing properties - such as defining a gradient flow of the Kullback-Leibler divergence between the current and target densities with respect to the 2-Wasserstein distance - it relies on evaluating the current probability density, which is intractable in most practical applications. By closely examining the drawbacks of approximating this density via kernel density estimation, we uncover opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo."}, "https://arxiv.org/abs/2410.19125": {"title": "A spectral method for multi-view subspace learning using the product of projections", "link": "https://arxiv.org/abs/2410.19125", "description": "arXiv:2410.19125v1 Announce Type: cross \nAbstract: Multi-view data provides complementary information on the same set of observations, with multi-omics and multimodal sensor data being common examples. Analyzing such data typically requires distinguishing between shared (joint) and unique (individual) signal subspaces from noisy, high-dimensional measurements. Despite many proposed methods, the conditions for reliably identifying joint and individual subspaces remain unclear. We rigorously quantify these conditions, which depend on the ratio of the signal rank to the ambient dimension, principal angles between true subspaces, and noise levels. Our approach characterizes how spectrum perturbations of the product of projection matrices, derived from each view's estimated subspaces, affect subspace separation. Using these insights, we provide an easy-to-use and scalable estimation algorithm. In particular, we employ rotational bootstrap and random matrix theory to partition the observed spectrum into joint, individual, and noise subspaces. Diagnostic plots visualize this partitioning, providing practical and interpretable insights into the estimation performance. In simulations, our method estimates joint and individual subspaces more accurately than existing approaches. Applications to multi-omics data from colorectal cancer patients and nutrigenomic study of mice demonstrate improved performance in downstream predictive tasks."}, "https://arxiv.org/abs/2410.19300": {"title": "Golden Ratio-Based Sufficient Dimension Reduction", "link": "https://arxiv.org/abs/2410.19300", "description": "arXiv:2410.19300v1 Announce Type: cross \nAbstract: Many machine learning applications deal with high dimensional data. To make computations feasible and learning more efficient, it is often desirable to reduce the dimensionality of the input variables by finding linear combinations of the predictors that can retain as much original information as possible in the relationship between the response and the original predictors. We propose a neural network based sufficient dimension reduction method that not only identifies the structural dimension effectively, but also estimates the central space well. It takes advantages of approximation capabilities of neural networks for functions in Barron classes and leads to reduced computation cost compared to other dimension reduction methods in the literature. Additionally, the framework can be extended to fit practical dimension reduction, making the methodology more applicable in practical settings."}, "https://arxiv.org/abs/2410.19412": {"title": "Robust Time Series Causal Discovery for Agent-Based Model Validation", "link": "https://arxiv.org/abs/2410.19412", "description": "arXiv:2410.19412v1 Announce Type: cross \nAbstract: Agent-Based Model (ABM) validation is crucial as it helps ensuring the reliability of simulations, and causal discovery has become a powerful tool in this context. However, current causal discovery methods often face accuracy and robustness challenges when applied to complex and noisy time series data, which is typical in ABM scenarios. This study addresses these issues by proposing a Robust Cross-Validation (RCV) approach to enhance causal structure learning for ABM validation. We develop RCV-VarLiNGAM and RCV-PCMCI, novel extensions of two prominent causal discovery algorithms. These aim to reduce the impact of noise better and give more reliable causal relation results, even with high-dimensional, time-dependent data. The proposed approach is then integrated into an enhanced ABM validation framework, which is designed to handle diverse data and model structures.\n The approach is evaluated using synthetic datasets and a complex simulated fMRI dataset. The results demonstrate greater reliability in causal structure identification. The study examines how various characteristics of datasets affect the performance of established causal discovery methods. These characteristics include linearity, noise distribution, stationarity, and causal structure density. This analysis is then extended to the RCV method to see how it compares in these different situations. This examination helps confirm whether the results are consistent with existing literature and also reveals the strengths and weaknesses of the novel approaches.\n By tackling key methodological challenges, the study aims to enhance ABM validation with a more resilient valuation framework presented. These improvements increase the reliability of model-driven decision making processes in complex systems analysis."}, "https://arxiv.org/abs/2410.19596": {"title": "On the robustness of semi-discrete optimal transport", "link": "https://arxiv.org/abs/2410.19596", "description": "arXiv:2410.19596v1 Announce Type: cross \nAbstract: We derive the breakdown point for solutions of semi-discrete optimal transport problems, which characterizes the robustness of the multivariate quantiles based on optimal transport proposed in Ghosal and Sen (2022). We do so under very mild assumptions: the absolutely continuous reference measure is only assumed to have a support that is compact and convex, whereas the target measure is a general discrete measure on a finite number, $n$ say, of atoms. The breakdown point depends on the target measure only through its probability weights (hence not on the location of the atoms) and involves the geometry of the reference measure through the Tukey (1975) concept of halfspace depth. Remarkably, depending on this geometry, the breakdown point of the optimal transport median can be strictly smaller than the breakdown point of the univariate median or the breakdown point of the spatial median, namely~$\\lceil n/2\\rceil /2$. In the context of robust location estimation, our results provide a subtle insight on how to perform multivariate trimming when constructing trimmed means based on optimal transport."}, "https://arxiv.org/abs/2204.11979": {"title": "Semi-Parametric Sensitivity Analysis for Trials with Irregular and Informative Assessment Times", "link": "https://arxiv.org/abs/2204.11979", "description": "arXiv:2204.11979v4 Announce Type: replace \nAbstract: Many trials are designed to collect outcomes at or around pre-specified times after randomization. If there is variability in the times when participants are actually assessed, this can pose a challenge to learning the effect of treatment, since not all participants have outcome assessments at the times of interest. Furthermore, observed outcome values may not be representative of all participants' outcomes at a given time. Methods have been developed that account for some types of such irregular and informative assessment times; however, since these methods rely on untestable assumptions, sensitivity analyses are needed. We develop a methodology that is benchmarked at the explainable assessmen (EA) assumption, under which assessment and outcomes at each time are related only through data collected prior to that time. Our method uses an exponential tilting assumption, governed by a sensitivity analysis parameter, that posits deviations from the EA assumption. Our inferential strategy is based on a new influence function-based, augmented inverse intensity-weighted estimator. Our approach allows for flexible semiparametric modeling of the observed data, which is separated from specification of the sensitivity parameter. We apply our method to a randomized trial of low-income individuals with uncontrolled asthma, and we illustrate implementation of our estimation procedure in detail."}, "https://arxiv.org/abs/2312.05858": {"title": "Causal inference and policy evaluation without a control group", "link": "https://arxiv.org/abs/2312.05858", "description": "arXiv:2312.05858v2 Announce Type: replace \nAbstract: Without a control group, the most widespread methodologies for estimating causal effects cannot be applied. To fill this gap, we propose the Machine Learning Control Method, a new approach for causal panel analysis that estimates causal parameters without relying on untreated units. We formalize identification within the potential outcomes framework and then provide estimation based on machine learning algorithms. To illustrate the practical relevance of our method, we present simulation evidence, a replication study, and an empirical application on the impact of the COVID-19 crisis on educational inequality. We implement the proposed approach in the companion R package MachineControl"}, "https://arxiv.org/abs/2312.14013": {"title": "Two-Stage Pseudo Maximum Likelihood Estimation of Semiparametric Copula-based Regression Models for Semi-Competing Risks Data", "link": "https://arxiv.org/abs/2312.14013", "description": "arXiv:2312.14013v2 Announce Type: replace \nAbstract: We propose a two-stage estimation procedure for a copula-based model with semi-competing risks data, where the non-terminal event is subject to dependent censoring by the terminal event, and both events are subject to independent censoring. With a copula-based model, the marginal survival functions of individual event times are specified by semiparametric transformation models, and the dependence between the bivariate event times is specified by a parametric copula function. For the estimation procedure, in the first stage, the parameters associated with the marginal of the terminal event are estimated using only the corresponding observed outcomes, and in the second stage, the marginal parameters for the non-terminal event time and the copula parameter are estimated together via maximizing a pseudo-likelihood function based on the joint distribution of the bivariate event times. We derived the asymptotic properties of the proposed estimator and provided an analytic variance estimator for inference. Through simulation studies, we showed that our approach leads to consistent estimates with less computational cost and more robustness than the one-stage procedure developed in Chen (2012), where all parameters were estimated simultaneously. In addition, our approach demonstrates more desirable finite-sample performances over another existing two-stage estimation method proposed in Zhu et al. (2021). An R package PMLE4SCR is developed to implement our proposed method."}, "https://arxiv.org/abs/2401.13179": {"title": "Realized Stochastic Volatility Model with Skew-t Distributions for Improved Volatility and Quantile Forecasting", "link": "https://arxiv.org/abs/2401.13179", "description": "arXiv:2401.13179v2 Announce Type: replace \nAbstract: Forecasting volatility and quantiles of financial returns is essential for accurately measuring financial tail risks, such as value-at-risk and expected shortfall. The critical elements in these forecasts involve understanding the distribution of financial returns and accurately estimating volatility. This paper introduces an advancement to the traditional stochastic volatility model, termed the realized stochastic volatility model, which integrates realized volatility as a precise estimator of volatility. To capture the well-known characteristics of return distribution, namely skewness and heavy tails, we incorporate three types of skew-t distributions. Among these, two distributions include the skew-normal feature, offering enhanced flexibility in modeling the return distribution. We employ a Bayesian estimation approach using the Markov chain Monte Carlo method and apply it to major stock indices. Our empirical analysis, utilizing data from US and Japanese stock indices, indicates that the inclusion of both skewness and heavy tails in daily returns significantly improves the accuracy of volatility and quantile forecasts."}, "https://arxiv.org/abs/2207.10464": {"title": "Rate-optimal estimation of mixed semimartingales", "link": "https://arxiv.org/abs/2207.10464", "description": "arXiv:2207.10464v2 Announce Type: replace-cross \nAbstract: Consider the sum $Y=B+B(H)$ of a Brownian motion $B$ and an independent fractional Brownian motion $B(H)$ with Hurst parameter $H\\in(0,1)$. Even though $B(H)$ is not a semimartingale, it was shown in [\\textit{Bernoulli} \\textbf{7} (2001) 913--934] that $Y$ is a semimartingale if $H>3/4$. Moreover, $Y$ is locally equivalent to $B$ in this case, so $H$ cannot be consistently estimated from local observations of $Y$. This paper pivots on another unexpected feature in this model: if $B$ and $B(H)$ become correlated, then $Y$ will never be a semimartingale, and $H$ can be identified, regardless of its value. This and other results will follow from a detailed statistical analysis of a more general class of processes called \\emph{mixed semimartingales}, which are semiparametric extensions of $Y$ with stochastic volatility in both the martingale and the fractional component. In particular, we derive consistent estimators and feasible central limit theorems for all parameters and processes that can be identified from high-frequency observations. We further show that our estimators achieve optimal rates in a minimax sense."}, "https://arxiv.org/abs/2410.19947": {"title": "Testing the effects of an unobservable factor: Do marriage prospects affect college major choice?", "link": "https://arxiv.org/abs/2410.19947", "description": "arXiv:2410.19947v1 Announce Type: new \nAbstract: Motivated by studying the effects of marriage prospects on students' college major choices, this paper develops a new econometric test for analyzing the effects of an unobservable factor in a setting where this factor potentially influences both agents' decisions and a binary outcome variable. Our test is built upon a flexible copula-based estimation procedure and leverages the ordered nature of latent utilities of the polychotomous choice model. Using the proposed method, we demonstrate that marriage prospects significantly influence the college major choices of college graduates participating in the National Longitudinal Study of Youth (97) Survey. Furthermore, we validate the robustness of our findings with alternative tests that use stated marriage expectation measures from our data, thereby demonstrating the applicability and validity of our testing procedure in real-life scenarios."}, "https://arxiv.org/abs/2410.20029": {"title": "Jacobian-free Efficient Pseudo-Likelihood (EPL) Algorithm", "link": "https://arxiv.org/abs/2410.20029", "description": "arXiv:2410.20029v1 Announce Type: new \nAbstract: This study proposes a simple procedure to compute Efficient Pseudo Likelihood (EPL) estimator proposed by Dearing and Blevins (2024) for estimating dynamic discrete games, without computing Jacobians of equilibrium constraints. EPL estimator is efficient, convergent, and computationally fast. However, the original algorithm requires deriving and coding the Jacobians, which are cumbersome and prone to coding mistakes especially when considering complicated models. The current study proposes to avoid the computation of Jacobians by combining the ideas of numerical derivatives (for computing Jacobian-vector products) and the Krylov method (for solving linear equations). It shows good computational performance of the proposed method by numerical experiments."}, "https://arxiv.org/abs/2410.20045": {"title": "Accurate Inference for Penalized Logistic Regression", "link": "https://arxiv.org/abs/2410.20045", "description": "arXiv:2410.20045v1 Announce Type: new \nAbstract: Inference for high-dimensional logistic regression models using penalized methods has been a challenging research problem. As an illustration, a major difficulty is the significant bias of the Lasso estimator, which limits its direct application in inference. Although various bias corrected Lasso estimators have been proposed, they often still exhibit substantial biases in finite samples, undermining their inference performance. These finite sample biases become particularly problematic in one-sided inference problems, such as one-sided hypothesis testing. This paper proposes a novel two-step procedure for accurate inference in high-dimensional logistic regression models. In the first step, we propose a Lasso-based variable selection method to select a suitable submodel of moderate size for subsequent inference. In the second step, we introduce a bias corrected estimator to fit the selected submodel. We demonstrate that the resulting estimator from this two-step procedure has a small bias order and enables accurate inference. Numerical studies and an analysis of alcohol consumption data are included, where our proposed method is compared to alternative approaches. Our results indicate that the proposed method exhibits significantly smaller biases than alternative methods in finite samples, thereby leading to improved inference performance."}, "https://arxiv.org/abs/2410.20138": {"title": "Functional Mixture Regression Control Chart", "link": "https://arxiv.org/abs/2410.20138", "description": "arXiv:2410.20138v1 Announce Type: new \nAbstract: Industrial applications often exhibit multiple in-control patterns due to varying operating conditions, which makes a single functional linear model (FLM) inadequate to capture the complexity of the true relationship between a functional quality characteristic and covariates, which gives rise to the multimode profile monitoring problem. This issue is clearly illustrated in the resistance spot welding (RSW) process in the automotive industry, where different operating conditions lead to multiple in-control states. In these states, factors such as electrode tip wear and dressing may influence the functional quality characteristic differently, resulting in distinct FLMs across subpopulations. To address this problem, this article introduces the functional mixture regression control chart (FMRCC) to monitor functional quality characteristics with multiple in-control patterns and covariate information, modeled using a mixture of FLMs. A monitoring strategy based on the likelihood ratio test is proposed to monitor any deviation from the estimated in-control heterogeneous population. An extensive Monte Carlo simulation study is performed to compare the FMRCC with competing monitoring schemes that have already appeared in the literature, and a case study in the monitoring of an RSW process in the automotive industry, which motivated this research, illustrates its practical applicability."}, "https://arxiv.org/abs/2410.20169": {"title": "Robust Bayes-assisted Confidence Regions", "link": "https://arxiv.org/abs/2410.20169", "description": "arXiv:2410.20169v1 Announce Type: new \nAbstract: The Frequentist, Assisted by Bayes (FAB) framework aims to construct confidence regions that leverage information about parameter values in the form of a prior distribution. FAB confidence regions (FAB-CRs) have smaller volume for values of the parameter that are likely under the prior, while maintaining exact frequentist coverage. This work introduces several methodological and theoretical contributions to the FAB framework. For Gaussian likelihoods, we show that the posterior mean of the parameter of interest is always contained in the FAB-CR. As such, the posterior mean constitutes a natural notion of FAB estimator to be reported alongside the FAB-CR. More generally, we show that for a likelihood in the natural exponential family, a transformation of the posterior mean of the natural parameter is always contained in the FAB-CR. For Gaussian likelihoods, we show that power law tails conditions on the marginal likelihood induce robust FAB-CRs, that are uniformly bounded and revert to the standard frequentist confidence intervals for extreme observations. We translate this result into practice by proposing a class of shrinkage priors for the FAB framework that satisfy this condition without sacrificing analytical tractability. The resulting FAB estimators are equal to prominent Bayesian shrinkage estimators, including the horseshoe estimator, thereby establishing insightful connections between robust FAB-CRs and Bayesian shrinkage methods."}, "https://arxiv.org/abs/2410.20208": {"title": "scpQCA: Enhancing mvQCA Applications through Set-Covering-Based QCA Method", "link": "https://arxiv.org/abs/2410.20208", "description": "arXiv:2410.20208v1 Announce Type: new \nAbstract: In fields such as sociology, political science, public administration, and business management, particularly in the direction of international relations, Qualitative Comparative Analysis (QCA) has been widely adopted as a research method. This article addresses the limitations of the QCA method in its application, specifically in terms of low coverage, factor limitations, and value limitations. scpQCA enhances the coverage of results and expands the tolerance of the QCA method for multi-factor and multi-valued analyses by maintaining the consistency threshold. To validate these capabilities, we conducted experiments on both random data and specific case datasets, utilizing different approaches of CCM (Configurational Comparative Methods) such as scpQCA, CNA, and QCApro, and presented the different results. In addition, the robustness of scpQCA has been examined from the perspectives of internal and external across different case datasets, thereby demonstrating its extensive applicability and advantages over existing QCA algorithms."}, "https://arxiv.org/abs/2410.20319": {"title": "High-dimensional partial linear model with trend filtering", "link": "https://arxiv.org/abs/2410.20319", "description": "arXiv:2410.20319v1 Announce Type: new \nAbstract: We study the high-dimensional partial linear model, where the linear part has a high-dimensional sparse regression coefficient and the nonparametric part includes a function whose derivatives are of bounded total variation. We expand upon the univariate trend filtering to develop partial linear trend filtering--a doubly penalized least square estimation approach based on $\\ell_1$ penalty and total variation penalty. Analogous to the advantages of trend filtering in univariate nonparametric regression, partial linear trend filtering not only can be efficiently computed, but also achieves the optimal error rate for estimating the nonparametric function. This in turn leads to the oracle rate for the linear part as if the underlying nonparametric function were known. We compare the proposed approach with a standard smoothing spline based method, and show both empirically and theoretically that the former outperforms the latter when the underlying function possesses heterogeneous smoothness. We apply our approach to the IDATA study to investigate the relationship between metabolomic profiles and ultra-processed food (UPF) intake, efficiently identifying key metabolites associated with UPF consumption and demonstrating strong predictive performance."}, "https://arxiv.org/abs/2410.20443": {"title": "A Robust Topological Framework for Detecting Regime Changes in Multi-Trial Experiments with Application to Predictive Maintenance", "link": "https://arxiv.org/abs/2410.20443", "description": "arXiv:2410.20443v1 Announce Type: new \nAbstract: We present a general and flexible framework for detecting regime changes in complex, non-stationary data across multi-trial experiments. Traditional change point detection methods focus on identifying abrupt changes within a single time series (single trial), targeting shifts in statistical properties such as the mean, variance, and spectrum over time within that sole trial. In contrast, our approach considers changes occurring across trials, accommodating changes that may arise within individual trials due to experimental inconsistencies, such as varying delays or event duration. By leveraging diverse metrics to analyze time-frequency characteristics specifically topological changes in the spectrum and spectrograms, our approach offers a comprehensive framework for detecting such variations. Our approach can handle different scenarios with various statistical assumptions, including varying levels of stationarity within and across trials, making our framework highly adaptable. We validate our approach through simulations using time-varying autoregressive processes that exhibit different regime changes. Our results demonstrate the effectiveness of detecting changes across trials under diverse conditions. Furthermore, we illustrate the effectiveness of our method by applying it to predictive maintenance using the NASA bearing dataset. By analyzing the time-frequency characteristics of vibration signals recorded by accelerometers, our approach accurately identifies bearing failures, showcasing its strong potential for early fault detection in mechanical systems."}, "https://arxiv.org/abs/2410.20628": {"title": "International vulnerability of inflation", "link": "https://arxiv.org/abs/2410.20628", "description": "arXiv:2410.20628v1 Announce Type: new \nAbstract: In a globalised world, inflation in a given country may be becoming less responsive to domestic economic activity, while being increasingly determined by international conditions. Consequently, understanding the international sources of vulnerability of domestic inflation is turning fundamental for policy makers. In this paper, we propose the construction of Inflation-at-risk and Deflation-at-risk measures of vulnerability obtained using factor-augmented quantile regressions estimated with international factors extracted from a multi-level Dynamic Factor Model with overlapping blocks of inflations corresponding to economies grouped either in a given geographical region or according to their development level. The methodology is implemented to inflation observed monthly from 1999 to 2022 for over 115 countries. We conclude that, in a large number of developed countries, international factors are relevant to explain the right tail of the distribution of inflation, and, consequently, they are more relevant for the vulnerability related to high inflation than for average or low inflation. However, while inflation of developing low-income countries is hardly affected by international conditions, the results for middle-income countries are mixed. Finally, based on a rolling-window out-of-sample forecasting exercise, we show that the predictive power of international factors has increased in the most recent years of high inflation."}, "https://arxiv.org/abs/2410.20641": {"title": "The Curious Problem of the Normal Inverse Mean", "link": "https://arxiv.org/abs/2410.20641", "description": "arXiv:2410.20641v1 Announce Type: new \nAbstract: In astronomical observations, the estimation of distances from parallaxes is a challenging task due to the inherent measurement errors and the non-linear relationship between the parallax and the distance. This study leverages ideas from robust Bayesian inference to tackle these challenges, investigating a broad class of prior densities for estimating distances with a reduced bias and variance. Through theoretical analysis, simulation experiments, and the application to data from the Gaia Data Release 1 (GDR1), we demonstrate that heavy-tailed priors provide more reliable distance estimates, particularly in the presence of large fractional parallax errors. Theoretical results highlight the \"curse of a single observation,\" where the likelihood dominates the posterior, limiting the impact of the prior. Nevertheless, heavy-tailed priors can delay the explosion of posterior risk, offering a more robust framework for distance estimation. The findings suggest that reciprocal invariant priors, with polynomial decay in their tails, such as the Half-Cauchy and Product Half-Cauchy, are particularly well-suited for this task, providing a balance between bias reduction and variance control."}, "https://arxiv.org/abs/2410.20671": {"title": "Statistical Inference in High-dimensional Poisson Regression with Applications to Mediation Analysis", "link": "https://arxiv.org/abs/2410.20671", "description": "arXiv:2410.20671v1 Announce Type: new \nAbstract: Large-scale datasets with count outcome variables are widely present in various applications, and the Poisson regression model is among the most popular models for handling count outcomes. This paper considers the high-dimensional sparse Poisson regression model and proposes bias-corrected estimators for both linear and quadratic transformations of high-dimensional regression vectors. We establish the asymptotic normality of the estimators, construct asymptotically valid confidence intervals, and conduct related hypothesis testing. We apply the devised methodology to high-dimensional mediation analysis with count outcome, with particular application of testing for the existence of interaction between the treatment variable and high-dimensional mediators. We demonstrate the proposed methods through extensive simulation studies and application to real-world epigenetic data."}, "https://arxiv.org/abs/2410.20860": {"title": "Robust Network Targeting with Multiple Nash Equilibria", "link": "https://arxiv.org/abs/2410.20860", "description": "arXiv:2410.20860v1 Announce Type: new \nAbstract: Many policy problems involve designing individualized treatment allocation rules to maximize the equilibrium social welfare of interacting agents. Focusing on large-scale simultaneous decision games with strategic complementarities, we develop a method to estimate an optimal treatment allocation rule that is robust to the presence of multiple equilibria. Our approach remains agnostic about changes in the equilibrium selection mechanism under counterfactual policies, and we provide a closed-form expression for the boundary of the set-identified equilibrium outcomes. To address the incompleteness that arises when an equilibrium selection mechanism is not specified, we use the maximin welfare criterion to select a policy, and implement this policy using a greedy algorithm. We establish a performance guarantee for our method by deriving a welfare regret bound, which accounts for sampling uncertainty and the use of the greedy algorithm. We demonstrate our method with an application to the microfinance dataset of Banerjee et al. (2013)."}, "https://arxiv.org/abs/2410.20885": {"title": "A Distributed Lag Approach to the Generalised Dynamic Factor Model (GDFM)", "link": "https://arxiv.org/abs/2410.20885", "description": "arXiv:2410.20885v1 Announce Type: new \nAbstract: We provide estimation and inference for the Generalised Dynamic Factor Model (GDFM) under the assumption that the dynamic common component can be expressed in terms of a finite number of lags of contemporaneously pervasive factors. The proposed estimator is simply an OLS regression of the observed variables on factors extracted via static principal components and therefore avoids frequency domain techniques entirely."}, "https://arxiv.org/abs/2410.20908": {"title": "Making all pairwise comparisons in multi-arm clinical trials without control treatment", "link": "https://arxiv.org/abs/2410.20908", "description": "arXiv:2410.20908v1 Announce Type: new \nAbstract: The standard paradigm for confirmatory clinical trials is to compare experimental treatments with a control, for example the standard of care or a placebo. However, it is not always the case that a suitable control exists. Efficient statistical methodology is well studied in the setting of randomised controlled trials. This is not the case if one wishes to compare several experimental with no control arm. We propose hypothesis testing methods suitable for use in such a setting. These methods are efficient, ensuring the error rate is controlled at exactly the desired rate with no conservatism. This in turn yields an improvement in power when compared with standard methods one might otherwise consider using, such as a Bonferroni adjustment. The proposed testing procedure is also highly flexible. We show how it may be extended for use in multi-stage adaptive trials, covering the majority of scenarios in which one might consider the use of such procedures in the clinical trials setting. With such a highly flexible nature, these methods may also be applied more broadly outside of a clinical trials setting."}, "https://arxiv.org/abs/2410.20915": {"title": "On Spatio-Temporal Stochastic Frontier Models", "link": "https://arxiv.org/abs/2410.20915", "description": "arXiv:2410.20915v1 Announce Type: new \nAbstract: In the literature on stochastic frontier models until the early 2000s, the joint consideration of spatial and temporal dimensions was often inadequately addressed, if not completely neglected. However, from an evolutionary economics perspective, the production process of the decision-making units constantly changes over both dimensions: it is not stable over time due to managerial enhancements and/or internal or external shocks, and is influenced by the nearest territorial neighbours. This paper proposes an extension of the Fusco and Vidoli [2013] SEM-like approach, which globally accounts for spatial and temporal effects in the term of inefficiency. In particular, coherently with the stochastic panel frontier literature, two different versions of the model are proposed: the time-invariant and the time-varying spatial stochastic frontier models. In order to evaluate the inferential properties of the proposed estimators, we first run Monte Carlo experiments and we then present the results of an application to a set of commonly referenced data, demonstrating robustness and stability of estimates across all scenarios."}, "https://arxiv.org/abs/2410.20918": {"title": "Almost goodness-of-fit tests", "link": "https://arxiv.org/abs/2410.20918", "description": "arXiv:2410.20918v1 Announce Type: new \nAbstract: We introduce the almost goodness-of-fit test, a procedure to decide if a (parametric) model provides a good representation of the probability distribution generating the observed sample. We consider the approximate model determined by an M-estimator of the parameters as the best representative of the unknown distribution within the parametric class. The objective is the approximate validation of a distribution or an entire parametric family up to a pre-specified threshold value, the margin of error. The methodology also allows quantifying the percentage improvement of the proposed model compared to a non-informative (constant) one. The test statistic is the $\\mathrm{L}^p$-distance between the empirical distribution function and the corresponding one of the estimated (parametric) model. The value of the parameter $p$ allows modulating the impact of the tails of the distribution in the validation of the model. By deriving the asymptotic distribution of the test statistic, as well as proving the consistency of its bootstrap approximation, we present an easy-to-implement and flexible method. The performance of the proposal is illustrated with a simulation study and the analysis of a real dataset."}, "https://arxiv.org/abs/2410.21098": {"title": "Single CASANOVA? Not in multiple comparisons", "link": "https://arxiv.org/abs/2410.21098", "description": "arXiv:2410.21098v1 Announce Type: new \nAbstract: When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated [1] and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent [2]. We propose two new tests based on combined weighted log-rank tests. One as a simple multiple contrast test of weighted log-rank tests and one as an extension of the so-called CASANOVA test [3]. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test promises to be more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power."}, "https://arxiv.org/abs/2410.21104": {"title": "Topological Identification of Agent Status in Information Contagions: Application to Financial Markets", "link": "https://arxiv.org/abs/2410.21104", "description": "arXiv:2410.21104v1 Announce Type: new \nAbstract: Cascade models serve as effective tools for understanding the propagation of information and diseases within social networks. Nevertheless, their applicability becomes constrained when the states of the agents (nodes) are hidden and can only be inferred through indirect observations or symptoms. This study proposes a Mapper-based strategy to infer the status of agents within a hidden information cascade model using expert knowledge. To verify and demonstrate the method we identify agents who are likely to take advantage of information obtained from an inside information network. We do this using data on insider networks and stock market transactions. Recognizing the sensitive nature of allegations of insider trading, we design a conservative approach to minimize false positives, ensuring that innocent agents are not wrongfully implicated. The Mapper-based results systematically outperform other methods, such as clustering and unsupervised anomaly detection, on synthetic data. We also apply the method to empirical data and verify the results using a statistical validation method based on persistence homology. Our findings highlight that the proposed Mapper-based technique successfully identifies a subpopulation of opportunistic agents within the information cascades. The adaptability of this method to diverse data types and sizes is demonstrated, with potential for tailoring for specific applications."}, "https://arxiv.org/abs/2410.21105": {"title": "Difference-in-Differences with Time-varying Continuous Treatments using Double/Debiased Machine Learning", "link": "https://arxiv.org/abs/2410.21105", "description": "arXiv:2410.21105v1 Announce Type: new \nAbstract: We propose a difference-in-differences (DiD) method for a time-varying continuous treatment and multiple time periods. Our framework assesses the average treatment effect on the treated (ATET) when comparing two non-zero treatment doses. The identification is based on a conditional parallel trend assumption imposed on the mean potential outcome under the lower dose, given observed covariates and past treatment histories. We employ kernel-based ATET estimators for repeated cross-sections and panel data adopting the double/debiased machine learning framework to control for covariates and past treatment histories in a data-adaptive manner. We also demonstrate the asymptotic normality of our estimation approach under specific regularity conditions. In a simulation study, we find a compelling finite sample performance of undersmoothed versions of our estimators in setups with several thousand observations."}, "https://arxiv.org/abs/2410.21166": {"title": "A Componentwise Estimation Procedure for Multivariate Location and Scatter: Robustness, Efficiency and Scalability", "link": "https://arxiv.org/abs/2410.21166", "description": "arXiv:2410.21166v1 Announce Type: new \nAbstract: Covariance matrix estimation is an important problem in multivariate data analysis, both from theoretical as well as applied points of view. Many simple and popular covariance matrix estimators are known to be severely affected by model misspecification and the presence of outliers in the data; on the other hand robust estimators with reasonably high efficiency are often computationally challenging for modern large and complex datasets. In this work, we propose a new, simple, robust and highly efficient method for estimation of the location vector and the scatter matrix for elliptically symmetric distributions. The proposed estimation procedure is designed in the spirit of the minimum density power divergence (DPD) estimation approach with appropriate modifications which makes our proposal (sequential minimum DPD estimation) computationally very economical and scalable to large as well as higher dimensional datasets. Consistency and asymptotic normality of the proposed sequential estimators of the multivariate location and scatter are established along with asymptotic positive definiteness of the estimated scatter matrix. Robustness of our estimators are studied by means of influence functions. All theoretical results are illustrated further under multivariate normality. A large-scale simulation study is presented to assess finite sample performances and scalability of our method in comparison to the usual maximum likelihood estimator (MLE), the ordinary minimum DPD estimator (MDPDE) and other popular non-parametric methods. The applicability of our method is further illustrated with a real dataset on credit card transactions."}, "https://arxiv.org/abs/2410.21213": {"title": "Spatial causal inference in the presence of preferential sampling to study the impacts of marine protected areas", "link": "https://arxiv.org/abs/2410.21213", "description": "arXiv:2410.21213v1 Announce Type: new \nAbstract: Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observational, and processes of interest exhibit spatial dependence, which presents challenges in estimating the causal effects. Spatial data can also be subject to preferential sampling, where the sampling locations are related to the response variable, further complicating inference and prediction. To address these challenges, we propose a spatial causal inference method that simultaneously accounts for unmeasured spatial confounders in both the sampling process and the treatment allocation. We prove the identifiability of key parameters in the model and the consistency of the posterior distributions of those parameters. We show via simulation studies that the causal effect of interest can be reliably estimated under the proposed model. The proposed method is applied to assess the effect of MPAs on fish biomass. We find evidence of preferential sampling and that properly accounting for this source of bias impacts the estimate of the causal effect."}, "https://arxiv.org/abs/2410.21263": {"title": "Adaptive Transfer Clustering: A Unified Framework", "link": "https://arxiv.org/abs/2410.21263", "description": "arXiv:2410.21263v1 Announce Type: new \nAbstract: We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios."}, "https://arxiv.org/abs/2410.19780": {"title": "Sampling from Bayesian Neural Network Posteriors with Symmetric Minibatch Splitting Langevin Dynamics", "link": "https://arxiv.org/abs/2410.19780", "description": "arXiv:2410.19780v1 Announce Type: cross \nAbstract: We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias $O(h^2 d^{1/2})$ in dimension $d>0$ with stepsize $h>0$, despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging."}, "https://arxiv.org/abs/2410.19923": {"title": "Language Agents Meet Causality -- Bridging LLMs and Causal World Models", "link": "https://arxiv.org/abs/2410.19923", "description": "arXiv:2410.19923v1 Announce Type: cross \nAbstract: Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally-aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons."}, "https://arxiv.org/abs/2410.19952": {"title": "L\\'evy graphical models", "link": "https://arxiv.org/abs/2410.19952", "description": "arXiv:2410.19952v1 Announce Type: cross \nAbstract: Conditional independence and graphical models are crucial concepts for sparsity and statistical modeling in higher dimensions. For L\\'evy processes, a widely applied class of stochastic processes, these notions have not been studied. By the L\\'evy-It\\^o decomposition, a multivariate L\\'evy process can be decomposed into the sum of a Brownian motion part and an independent jump process. We show that conditional independence statements between the marginal processes can be studied separately for these two parts. While the Brownian part is well-understood, we derive a novel characterization of conditional independence between the sample paths of the jump process in terms of the L\\'evy measure. We define L\\'evy graphical models as L\\'evy processes that satisfy undirected or directed Markov properties. We prove that the graph structure is invariant under changes of the univariate marginal processes. L\\'evy graphical models allow the construction of flexible, sparse dependence models for L\\'evy processes in large dimensions, which are interpretable thanks to the underlying graph. For trees, we develop statistical methodology to learn the underlying structure from low- or high-frequency observations of the L\\'evy process and show consistent graph recovery. We apply our method to model stock returns from U.S. companies to illustrate the advantages of our approach."}, "https://arxiv.org/abs/2410.20089": {"title": "Sample Efficient Bayesian Learning of Causal Graphs from Interventions", "link": "https://arxiv.org/abs/2410.20089", "description": "arXiv:2410.20089v1 Announce Type: cross \nAbstract: Causal discovery is a fundamental problem with applications spanning various areas in science and engineering. It is well understood that solely using observational data, one can only orient the causal graph up to its Markov equivalence class, necessitating interventional data to learn the complete causal graph. Most works in the literature design causal discovery policies with perfect interventions, i.e., they have access to infinite interventional samples. This study considers a Bayesian approach for learning causal graphs with limited interventional samples, mirroring real-world scenarios where such samples are usually costly to obtain. By leveraging the recent result of Wien\\\"obst et al. (2023) on uniform DAG sampling in polynomial time, we can efficiently enumerate all the cut configurations and their corresponding interventional distributions of a target set, and further track their posteriors. Given any number of interventional samples, our proposed algorithm randomly intervenes on a set of target vertices that cut all the edges in the graph and returns a causal graph according to the posterior of each target set. When the number of interventional samples is large enough, we show theoretically that our proposed algorithm will return the true causal graph with high probability. We compare our algorithm against various baseline methods on simulated datasets, demonstrating its superior accuracy measured by the structural Hamming distance between the learned DAG and the ground truth. Additionally, we present a case study showing how this algorithm could be modified to answer more general causal questions without learning the whole graph. As an example, we illustrate that our method can be used to estimate the causal effect of a variable that cannot be intervened."}, "https://arxiv.org/abs/2410.20318": {"title": "Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds", "link": "https://arxiv.org/abs/2410.20318", "description": "arXiv:2410.20318v1 Announce Type: cross \nAbstract: We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered."}, "https://arxiv.org/abs/2410.20896": {"title": "BSD: a Bayesian framework for parametric models of neural spectra", "link": "https://arxiv.org/abs/2410.20896", "description": "arXiv:2410.20896v1 Announce Type: cross \nAbstract: The analysis of neural power spectra plays a crucial role in understanding brain function and dysfunction. While recent efforts have led to the development of methods for decomposing spectral data, challenges remain in performing statistical analysis and group-level comparisons. Here, we introduce Bayesian Spectral Decomposition (BSD), a Bayesian framework for analysing neural spectral power. BSD allows for the specification, inversion, comparison, and analysis of parametric models of neural spectra, addressing limitations of existing methods. We first establish the face validity of BSD on simulated data and show how it outperforms an established method (\\fooof{}) for peak detection on artificial spectral data. We then demonstrate the efficacy of BSD on a group-level study of EEG spectra in 204 healthy subjects from the LEMON dataset. Our results not only highlight the effectiveness of BSD in model selection and parameter estimation, but also illustrate how BSD enables straightforward group-level regression of the effect of continuous covariates such as age. By using Bayesian inference techniques, BSD provides a robust framework for studying neural spectral data and their relationship to brain function and dysfunction."}, "https://arxiv.org/abs/2208.05543": {"title": "A novel decomposition to explain heterogeneity in observational and randomized studies of causality", "link": "https://arxiv.org/abs/2208.05543", "description": "arXiv:2208.05543v2 Announce Type: replace \nAbstract: This paper introduces a novel decomposition framework to explain heterogeneity in causal effects observed across different studies, considering both observational and randomized settings. We present a formal decomposition of between-study heterogeneity, identifying sources of variability in treatment effects across studies. The proposed methodology allows for robust estimation of causal parameters under various assumptions, addressing differences in pre-treatment covariate distributions, mediating variables, and the outcome mechanism. Our approach is validated through a simulation study and applied to data from the Moving to Opportunity (MTO) study, demonstrating its practical relevance. This work contributes to the broader understanding of causal inference in multi-study environments, with potential applications in evidence synthesis and policy-making."}, "https://arxiv.org/abs/2211.03274": {"title": "A General Framework for Cutting Feedback within Modularised Bayesian Inference", "link": "https://arxiv.org/abs/2211.03274", "description": "arXiv:2211.03274v3 Announce Type: replace \nAbstract: Standard Bayesian inference can build models that combine information from various sources, but this inference may not be reliable if components of a model are misspecified. Cut inference, as a particular type of modularized Bayesian inference, is an alternative which splits a model into modules and cuts the feedback from the suspect module. Previous studies have focused on a two-module case, but a more general definition of a \"module\" remains unclear. We present a formal definition of a \"module\" and discuss its properties. We formulate methods for identifying modules; determining the order of modules; and building the cut distribution that should be used for cut inference within an arbitrary directed acyclic graph structure. We justify the cut distribution by showing that it not only cuts the feedback but also is the best approximation satisfying this condition to the joint distribution in the Kullback-Leibler divergence. We also extend cut inference for the two-module case to a general multiple-module case via a sequential splitting technique and demonstrate this via illustrative applications."}, "https://arxiv.org/abs/2310.16638": {"title": "Double Debiased Covariate Shift Adaptation Robust to Density-Ratio Estimation", "link": "https://arxiv.org/abs/2310.16638", "description": "arXiv:2310.16638v3 Announce Type: replace \nAbstract: Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies."}, "https://arxiv.org/abs/2008.13619": {"title": "A model and method for analyzing the precision of binary measurement methods based on beta-binomial distributions, and related statistical tests", "link": "https://arxiv.org/abs/2008.13619", "description": "arXiv:2008.13619v3 Announce Type: replace-cross \nAbstract: This study developed a new statistical model and method for analyzing the precision of binary measurement methods from collaborative studies. The model is based on beta-binomial distributions. In other words, it assumes that the sensitivity of each laboratory obeys a beta distribution, and the binary measured values under a given sensitivity follow a binomial distribution. We propose the key precision measures of repeatability and reproducibility for the model, and provide their unbiased estimates. Further, through consideration of a number of statistical test methods for homogeneity of proportions, we propose appropriate methods for determining laboratory effects in the new model. Finally, we apply the results to real-world examples in the fields of food safety and chemical risk assessment and management."}, "https://arxiv.org/abs/2304.01111": {"title": "Theoretical guarantees for neural control variates in MCMC", "link": "https://arxiv.org/abs/2304.01111", "description": "arXiv:2304.01111v2 Announce Type: replace-cross \nAbstract: In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory."}, "https://arxiv.org/abs/2401.01426": {"title": "Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference", "link": "https://arxiv.org/abs/2401.01426", "description": "arXiv:2401.01426v2 Announce Type: replace-cross \nAbstract: Sound and complete algorithms have been proposed to compute identifiable causal queries using the causal structure and data. However, most of these algorithms assume accurate estimation of the data distribution, which is impractical for high-dimensional variables such as images. On the other hand, modern deep generative architectures can be trained to sample from high-dimensional distributions. However, training these networks are typically very costly. Thus, it is desirable to leverage pre-trained models to answer causal queries using such high-dimensional data. To address this, we propose modular training of deep causal generative models that not only makes learning more efficient, but also allows us to utilize large, pre-trained conditional generative models. To the best of our knowledge, our algorithm, Modular-DCM is the first algorithm that, given the causal structure, uses adversarial training to learn the network weights, and can make use of pre-trained models to provably sample from any identifiable causal query in the presence of latent confounders. With extensive experiments on the Colored-MNIST dataset, we demonstrate that our algorithm outperforms the baselines. We also show our algorithm's convergence on the COVIDx dataset and its utility with a causal invariant prediction problem on CelebA-HQ."}, "https://arxiv.org/abs/2410.21343": {"title": "Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2410.21343", "description": "arXiv:2410.21343v1 Announce Type: new \nAbstract: Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \\textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \\textbf{C}ombine \\textbf{I}ncomplete \\textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \\textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets."}, "https://arxiv.org/abs/2410.21350": {"title": "Enhanced sequential directional importance sampling for structural reliability analysis", "link": "https://arxiv.org/abs/2410.21350", "description": "arXiv:2410.21350v1 Announce Type: new \nAbstract: Sequential directional importance sampling (SDIS) is an efficient adaptive simulation method for estimating failure probabilities. It expresses the failure probability as the product of a group of integrals that are easy to estimate, wherein the first one is estimated with Monte Carlo simulation (MCS), and all the subsequent ones are estimated with directional importance sampling. In this work, we propose an enhanced SDIS method for structural reliability analysis. We discuss the efficiency of MCS for estimat?ing the first integral in standard SDIS and propose using Subset Simulation as an alternative method. Additionally, we propose a Kriging-based active learning algorithm tailored to identify multiple roots in certain important di?rections within a specificed search interval. The performance of the enhanced SDIS is demonstrated through various complex benchmark problems. The results show that the enhanced SDIS is a versatile reliability analysis method that can efficiently and robustly solve challenging reliability problems"}, "https://arxiv.org/abs/2410.21355": {"title": "Generalized Method of Moments and Percentile Method: Estimating parameters of the Novel Median Based Unit Weibull Distribution", "link": "https://arxiv.org/abs/2410.21355", "description": "arXiv:2410.21355v1 Announce Type: new \nAbstract: The Median Based Unit Weibull is a new 2 parameter unit Weibull distribution defined on the unit interval (0,1). Estimation of the parameters using MLE encountered some problems like large variance. Using generalized method of moments (GMMs) and percentile method may ameliorate this condition. This paper introduces GMMs and the percentile methods for estimating the parameters of the new distribution with illustrative real data analysis."}, "https://arxiv.org/abs/2410.21464": {"title": "Causal Bootstrap for General Randomized Designs", "link": "https://arxiv.org/abs/2410.21464", "description": "arXiv:2410.21464v1 Announce Type: new \nAbstract: We distinguish between two sources of uncertainty in experimental causal inference: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. Most bootstrap procedures used by practitioners primarily estimate sampling uncertainty. Other methods for quantifying design uncertainty also fall short, because they are restricted to common designs and estimators, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer programming approach, which allows us to estimate design uncertainty for any known and probabilistic assignment mechanisms, and linear-in-treatment and quadratic-in-treatment estimators. We include asymptotic validity results and demonstrate the refined confidence intervals achieved by accurately accounting for non-standard design uncertainty through simulations of geographical experiments."}, "https://arxiv.org/abs/2410.21469": {"title": "Hybrid Bayesian Smoothing on Surfaces", "link": "https://arxiv.org/abs/2410.21469", "description": "arXiv:2410.21469v1 Announce Type: new \nAbstract: Modeling spatial processes that exhibit both smooth and rough features poses a significant challenge. This is especially true in fields where complex physical variables are observed across spatial domains. Traditional spatial techniques, such as Gaussian processes (GPs), are ill-suited to capture sharp transitions and discontinuities in spatial fields. In this paper, we propose a new approach incorporating non-Gaussian processes (NGPs) into a hybrid model which identifies both smooth and rough components. Specifically, we model the rough process using scaled mixtures of Gaussian distributions in a Bayesian hierarchical model (BHM).\n Our motivation comes from the Community Earth System Model Large Ensemble (CESM-LE), where we seek to emulate climate sensitivity fields that exhibit complex spatial patterns, including abrupt transitions at ocean-land boundaries. We demonstrate that traditional GP models fail to capture such abrupt changes and that our proposed hybrid model, implemented through a full Gibbs sampler. This significantly improves model interpretability and accurate recovery of process parameters.\n Through a multi-factor simulation study, we evaluate the performance of several scaled mixtures designed to model the rough process. The results highlight the advantages of using these heavier tailed priors as a replacement to the Bayesian fused LASSO. One prior in particular, the normal Jeffrey's prior stands above the rest. We apply our model to the CESM-LE dataset, demonstrating its ability to better represent the mean function and its uncertainty in climate sensitivity fields.\n This work combines the strengths of GPs for smooth processes with the flexibility of NGPs for abrupt changes. We provide a computationally efficient Gibbs sampler and include additional strategies for accelerating Monte Carlo Markov Chain (MCMC) sampling."}, "https://arxiv.org/abs/2410.21498": {"title": "Bayesian Nonparametric Models for Multiple Raters: a General Statistical Framework", "link": "https://arxiv.org/abs/2410.21498", "description": "arXiv:2410.21498v1 Announce Type: new \nAbstract: Rating procedure is crucial in many applied fields (e.g., educational, clinical, emergency). It implies that a rater (e.g., teacher, doctor) rates a subject (e.g., student, doctor) on a rating scale. Given raters variability, several statistical methods have been proposed for assessing and improving the quality of ratings. The analysis and the estimate of the Intraclass Correlation Coefficient (ICC) are major concerns in such cases. As evidenced by the literature, ICC might differ across different subgroups of raters and might be affected by contextual factors and subject heterogeneity. Model estimation in the presence of heterogeneity has been one of the recent challenges in this research line. Consequently, several methods have been proposed to address this issue under a parametric multilevel modelling framework, in which strong distributional assumptions are made. We propose a more flexible model under the Bayesian nonparametric (BNP) framework, in which most of those assumptions are relaxed. By eliciting hierarchical discrete nonparametric priors, the model accommodates clusters among raters and subjects, naturally accounts for heterogeneity, and improves estimate accuracy. We propose a general BNP heteroscedastic framework to analyze rating data and possible latent differences among subjects and raters. The estimated densities are used to make inferences about the rating process and the quality of the ratings. By exploiting a stick-breaking representation of the Dirichlet Process a general class of ICC indices might be derived for these models. Theoretical results about the ICC are provided together with computational strategies. Simulations and a real-world application are presented and possible future directions are discussed."}, "https://arxiv.org/abs/2410.21505": {"title": "Economic Diversification and Social Progress in the GCC Countries: A Study on the Transition from Oil-Dependency to Knowledge-Based Economies", "link": "https://arxiv.org/abs/2410.21505", "description": "arXiv:2410.21505v1 Announce Type: new \nAbstract: The Gulf Cooperation Council countries -- Oman, Bahrain, Kuwait, UAE, Qatar, and Saudi Arabia -- holds strategic significance due to its large oil reserves. However, these nations face considerable challenges in shifting from oil-dependent economies to more diversified, knowledge-based systems. This study examines the progress of Gulf Cooperation Council (GCC) countries in achieving economic diversification and social development, focusing on the Social Progress Index (SPI), which provides a broader measure of societal well-being beyond just economic growth. Using data from the World Bank, covering 2010 to 2023, the study employs the XGBoost machine learning model to forecast SPI values for the period of 2024 to 2026. Key components of the methodology include data preprocessing, feature selection, and the simulation of independent variables through ARIMA modeling. The results highlight significant improvements in education, healthcare, and women's rights, contributing to enhanced SPI performance across the GCC countries. However, notable challenges persist in areas like personal rights and inclusivity. The study further indicates that despite economic setbacks caused by global disruptions, including the COVID-19 pandemic and oil price volatility, GCC nations are expected to see steady improvements in their SPI scores through 2027. These findings underscore the critical importance of economic diversification, investment in human capital, and ongoing social reforms to reduce dependence on hydrocarbons and build knowledge-driven economies. This research offers valuable insights for policymakers aiming to strengthen both social and economic resilience in the region while advancing long-term sustainable development goals."}, "https://arxiv.org/abs/2410.21516": {"title": "Forecasting Political Stability in GCC Countries", "link": "https://arxiv.org/abs/2410.21516", "description": "arXiv:2410.21516v1 Announce Type: new \nAbstract: Political stability is crucial for the socioeconomic development of nations, particularly in geopolitically sensitive regions such as the Gulf Cooperation Council Countries, Saudi Arabia, UAE, Kuwait, Qatar, Oman, and Bahrain. This study focuses on predicting the political stability index for these six countries using machine learning techniques. The study uses data from the World Banks comprehensive dataset, comprising 266 indicators covering economic, political, social, and environmental factors. Employing the Edit Distance on Real Sequence method for feature selection and XGBoost for model training, the study forecasts political stability trends for the next five years. The model achieves high accuracy, with mean absolute percentage error values under 10, indicating reliable predictions. The forecasts suggest that Oman, the UAE, and Qatar will experience relatively stable political conditions, while Saudi Arabia and Bahrain may continue to face negative political stability indices. The findings underscore the significance of economic factors such as GDP and foreign investment, along with variables related to military expenditure and international tourism, as key predictors of political stability. These results provide valuable insights for policymakers, enabling proactive measures to enhance governance and mitigate potential risks."}, "https://arxiv.org/abs/2410.21559": {"title": "Enhancing parameter estimation in finite mixture of generalized normal distributions", "link": "https://arxiv.org/abs/2410.21559", "description": "arXiv:2410.21559v1 Announce Type: new \nAbstract: Mixtures of generalized normal distributions (MGND) have gained popularity for modelling datasets with complex statistical behaviours. However, the estimation of the shape parameter within the maximum likelihood framework is quite complex, presenting the risk of numerical and degeneracy issues. This study introduced an expectation conditional maximization algorithm that includes an adaptive step size function within Newton-Raphson updates of the shape parameter and a modified criterion for stopping the EM iterations. Through extensive simulations, the effectiveness of the proposed algorithm in overcoming the limitations of existing approaches, especially in scenarios with high shape parameter values, high parameters overalp and low sample sizes, is shown. A detailed comparative analysis with a mixture of normals and Student-t distributions revealed that the MGND model exhibited superior goodness-of-fit performance when used to fit the density of the returns of 50 stocks belonging to the Euro Stoxx index."}, "https://arxiv.org/abs/2410.21603": {"title": "Approximate Bayesian Computation with Statistical Distances for Model Selection", "link": "https://arxiv.org/abs/2410.21603", "description": "arXiv:2410.21603v1 Announce Type: new \nAbstract: Model selection is a key task in statistics, playing a critical role across various scientific disciplines. While no model can fully capture the complexities of a real-world data-generating process, identifying the model that best approximates it can provide valuable insights. Bayesian statistics offers a flexible framework for model selection by updating prior beliefs as new data becomes available, allowing for ongoing refinement of candidate models. This is typically achieved by calculating posterior probabilities, which quantify the support for each model given the observed data. However, in cases where likelihood functions are intractable, exact computation of these posterior probabilities becomes infeasible. Approximate Bayesian Computation (ABC) has emerged as a likelihood-free method and it is traditionally used with summary statistics to reduce data dimensionality, however this often results in information loss difficult to quantify, particularly in model selection contexts. Recent advancements propose the use of full data approaches based on statistical distances, offering a promising alternative that bypasses the need for summary statistics and potentially allows recovery of the exact posterior distribution. Despite these developments, full data ABC approaches have not yet been widely applied to model selection problems. This paper seeks to address this gap by investigating the performance of ABC with statistical distances in model selection. Through simulation studies and an application to toad movement models, this work explores whether full data approaches can overcome the limitations of summary statistic-based ABC for model choice."}, "https://arxiv.org/abs/2410.21832": {"title": "Robust Estimation and Model Selection for the Controlled Directed Effect with Unmeasured Mediator-Outcome Confounders", "link": "https://arxiv.org/abs/2410.21832", "description": "arXiv:2410.21832v1 Announce Type: new \nAbstract: Controlled Direct Effect (CDE) is one of the causal estimands used to evaluate both exposure and mediation effects on an outcome. When there are unmeasured confounders existing between the mediator and the outcome, the ordinary identification assumption does not work. In this manuscript, we consider an identification condition to identify CDE in the presence of unmeasured confounders. The key assumptions are: 1) the random allocation of the exposure, and 2) the existence of instrumental variables directly related to the mediator. Under these conditions, we propose a novel doubly robust estimation method, which work well if either the propensity score model or the baseline outcome model is correctly specified. Additionally, we propose a Generalized Information Criterion (GIC)-based model selection criterion for CDE that ensures model selection consistency. Our proposed procedure and related methods are applied to both simulation and real datasets to confirm the performance of these methods. Our proposed method can select the correct model with high probability and accurately estimate CDE."}, "https://arxiv.org/abs/2410.21858": {"title": "Joint Estimation of Conditional Mean and Covariance for Unbalanced Panels", "link": "https://arxiv.org/abs/2410.21858", "description": "arXiv:2410.21858v1 Announce Type: new \nAbstract: We propose a novel nonparametric kernel-based estimator of cross-sectional conditional mean and covariance matrices for large unbalanced panels. We show its consistency and provide finite-sample guarantees. In an empirical application, we estimate conditional mean and covariance matrices for a large unbalanced panel of monthly stock excess returns given macroeconomic and firm-specific covariates from 1962 to 2021.The estimator performs well with respect to statistical measures. It is informative for empirical asset pricing, generating conditional mean-variance efficient portfolios with substantial out-of-sample Sharpe ratios far beyond equal-weighted benchmarks."}, "https://arxiv.org/abs/2410.21914": {"title": "Bayesian Stability Selection and Inference on Inclusion Probabilities", "link": "https://arxiv.org/abs/2410.21914", "description": "arXiv:2410.21914v1 Announce Type: new \nAbstract: Stability selection is a versatile framework for structure estimation and variable selection in high-dimensional setting, primarily grounded in frequentist principles. In this paper, we propose an enhanced methodology that integrates Bayesian analysis to refine the inference of inclusion probabilities within the stability selection framework. Traditional approaches rely on selection frequencies for decision-making, often disregarding domain-specific knowledge and failing to account for the inherent uncertainty in the variable selection process. Our methodology uses prior information to derive posterior distributions of inclusion probabilities, thereby improving both inference and decision-making. We present a two-step process for engaging with domain experts, enabling statisticians to elucidate prior distributions informed by expert knowledge while allowing experts to control the weight of their input on the final results. Using posterior distributions, we offer Bayesian credible intervals to quantify uncertainty in the variable selection process. In addition, we highlight how selection frequencies can be uninformative or even misleading when covariates are correlated with each other, and demonstrate how domain expertise can alleviate such issues. Our approach preserves the versatility of stability selection and is suitable for a broad range of structure estimation challenges."}, "https://arxiv.org/abs/2410.21954": {"title": "Inference of a Susceptible-Infectious stochastic model", "link": "https://arxiv.org/abs/2410.21954", "description": "arXiv:2410.21954v1 Announce Type: new \nAbstract: We consider a time-inhomogeneous diffusion process able to describe the dynamics of infected people in a susceptible-infectious epidemic model in which the transmission intensity function is time-dependent. Such a model is well suited to describe some classes of micro-parasitic infections in which individuals never acquire lasting immunity and over the course of the epidemic everyone eventually becomes infected. The stochastic process related to the deterministic model is transformable into a non homogeneous Wiener process so the probability distribution can be obtained. Here we focus on the inference for such process, by providing an estimation procedure for the involved parameters. We point out that the time dependence in the infinitesimal moments of the diffusion process makes classical inference methods inapplicable. The proposed procedure is based on Generalized Method of Moments in order to find suitable estimate for the infinitesimal drift and variance of the transformed process. Several simulation studies are conduced to test the procedure, these include the time homogeneous case, for which a comparison with the results obtained by applying the MLE is made, and cases in which the intensity function are time dependent with particular attention to periodic cases. Finally, we apply the estimation procedure to a real dataset."}, "https://arxiv.org/abs/2410.21989": {"title": "On the Consistency of Partial Ordering Continual Reassessment Method with Model and Ordering Misspecification", "link": "https://arxiv.org/abs/2410.21989", "description": "arXiv:2410.21989v1 Announce Type: new \nAbstract: One of the aims of Phase I clinical trial designs in oncology is typically to find the maximum tolerated doses. A number of innovative dose-escalation designs were proposed in the literature to achieve this goal efficiently. Although the sample size of Phase I trials is usually small, the asymptotic properties (e.g. consistency) of dose-escalation designs can provide useful guidance on the design parameters and improve fundamental understanding of these designs. For the first proposed model-based monotherapy dose-escalation design, the Continual Reassessment Method (CRM), sufficient consistency conditions have been previously derived and then greatly influenced on how these studies are run in practice. At the same time, there is an increasing interest in Phase I combination-escalation trial in which two or more drugs are combined. The monotherapy dose-escalation design cannot be generally applied in this case due to uncertainty between monotonic ordering between some of the combinations, and, as a result, specialised designs were proposed. However, there were no theoretical or asymptotic properties evaluation of these proposals. In this paper, we derive the consistency conditions of the partial Ordering CRM (POCRM) design when there exists uncertainty in the monotonic ordering with a focus on dual-agent combination-escalation trials. Based on the derived consistency condition, we provide guidance on how the design parameters and ordering of the POCRM should be defined."}, "https://arxiv.org/abs/2410.22248": {"title": "Model-free Estimation of Latent Structure via Multiscale Nonparametric Maximum Likelihood", "link": "https://arxiv.org/abs/2410.22248", "description": "arXiv:2410.22248v1 Announce Type: new \nAbstract: Multivariate distributions often carry latent structures that are difficult to identify and estimate, and which better reflect the data generating mechanism than extrinsic structures exhibited simply by the raw data. In this paper, we propose a model-free approach for estimating such latent structures whenever they are present, without assuming they exist a priori. Given an arbitrary density $p_0$, we construct a multiscale representation of the density and propose data-driven methods for selecting representative models that capture meaningful discrete structure. Our approach uses a nonparametric maximum likelihood estimator to estimate the latent structure at different scales and we further characterize their asymptotic limits. By carrying out such a multiscale analysis, we obtain coarseto-fine structures inherent in the original distribution, which are integrated via a model selection procedure to yield an interpretable discrete representation of it. As an application, we design a clustering algorithm based on the proposed procedure and demonstrate its effectiveness in capturing a wide range of latent structures."}, "https://arxiv.org/abs/2410.22300": {"title": "A Latent Variable Model with Change Points and Its Application to Time Pressure Effects in Educational Assessment", "link": "https://arxiv.org/abs/2410.22300", "description": "arXiv:2410.22300v1 Announce Type: new \nAbstract: Educational assessments are valuable tools for measuring student knowledge and skills, but their validity can be compromised when test takers exhibit changes in response behavior due to factors such as time pressure. To address this issue, we introduce a novel latent factor model with change-points for item response data, designed to detect and account for individual-level shifts in response patterns during testing. This model extends traditional Item Response Theory (IRT) by incorporating person-specific change-points, which enables simultaneous estimation of item parameters, person latent traits, and the location of behavioral changes. We evaluate the proposed model through extensive simulation studies, which demonstrate its ability to accurately recover item parameters, change-point locations, and individual ability estimates under various conditions. Our findings show that accounting for change-points significantly reduces bias in ability estimates, particularly for respondents affected by time pressure. Application of the model to two real-world educational testing datasets reveals distinct patterns of change-point occurrence between high-stakes and lower-stakes tests, providing insights into how test-taking behavior evolves during the tests. This approach offers a more nuanced understanding of test-taking dynamics, with important implications for test design, scoring, and interpretation."}, "https://arxiv.org/abs/2410.22333": {"title": "Hypothesis tests and model parameter estimation on data sets with missing correlation information", "link": "https://arxiv.org/abs/2410.22333", "description": "arXiv:2410.22333v1 Announce Type: new \nAbstract: Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simply hypothesis tests, as well as an algorithm to determine the necessary inflation factor model parameter fits. It then presents some example applications of the methods to real neutrino interaction data and model comparisons."}, "https://arxiv.org/abs/2410.21790": {"title": "Reconstructing East Asian Temperatures from 1368 to 1911 Using Historical Documents, Climate Models, and Data Assimilation", "link": "https://arxiv.org/abs/2410.21790", "description": "arXiv:2410.21790v1 Announce Type: cross \nAbstract: We present a novel approach for reconstructing annual temperatures in East Asia from 1368 to 1911, leveraging the Reconstructed East Asian Climate Historical Encoded Series (REACHES). The lack of instrumental data during this period poses significant challenges to understanding past climate conditions. REACHES digitizes historical documents from the Ming and Qing dynasties of China, converting qualitative descriptions into a four-level ordinal temperature scale. However, these index-based data are biased toward abnormal or extreme weather phenomena, leading to data gaps that likely correspond to normal conditions. To address this bias and reconstruct historical temperatures at any point within East Asia, including locations without direct historical data, we employ a three-tiered statistical framework. First, we perform kriging to interpolate temperature data across East Asia, adopting a zero-mean assumption to handle missing information. Next, we utilize the Last Millennium Ensemble (LME) reanalysis data and apply quantile mapping to calibrate the kriged REACHES data to Celsius temperature scales. Finally, we introduce a novel Bayesian data assimilation method that integrates the kriged Celsius data with LME simulations to enhance reconstruction accuracy. We model the LME data at each geographic location using a flexible nonstationary autoregressive time series model and employ regularized maximum likelihood estimation with a fused lasso penalty. The resulting dynamic distribution serves as a prior, which is refined via Kalman filtering by incorporating the kriged Celsius REACHES data to yield posterior temperature estimates. This comprehensive integration of historical documentation, contemporary climate models, and advanced statistical methods improves the accuracy of historical temperature reconstructions and provides a crucial resource for future environmental and climate studies."}, "https://arxiv.org/abs/2410.22119": {"title": "Deep Q-Exponential Processes", "link": "https://arxiv.org/abs/2410.22119", "description": "arXiv:2410.22119v1 Announce Type: cross \nAbstract: Motivated by deep neural networks, the deep Gaussian process (DGP) generalizes the standard GP by stacking multiple layers of GPs. Despite the enhanced expressiveness, GP, as an $L_2$ regularization prior, tends to be over-smooth and sub-optimal for inhomogeneous subjects, such as images with edges. Recently, Q-exponential process (Q-EP) has been proposed as an $L_q$ relaxation to GP and demonstrated with more desirable regularization properties through a parameter $q>0$ with $q=2$ corresponding to GP. Sharing the similar tractability of posterior and predictive distributions with GP, Q-EP can also be stacked to improve its modeling flexibility. In this paper, we generalize Q-EP to deep Q-EP to enjoy both proper regularization and improved expressiveness. The generalization is realized by introducing shallow Q-EP as a latent variable model and then building a hierarchy of the shallow Q-EP layers. Sparse approximation by inducing points and scalable variational strategy are applied to facilitate the inference. We demonstrate the numerical advantages of the proposed deep Q-EP model by comparing with multiple state-of-the-art deep probabilistic models."}, "https://arxiv.org/abs/2303.06384": {"title": "Measuring Information Transfer Between Nodes in a Brain Network through Spectral Transfer Entropy", "link": "https://arxiv.org/abs/2303.06384", "description": "arXiv:2303.06384v3 Announce Type: replace \nAbstract: Brain connectivity characterizes interactions between different regions of a brain network during resting-state or performance of a cognitive task. In studying brain signals such as electroencephalograms (EEG), one formal approach to investigating connectivity is through an information-theoretic causal measure called transfer entropy (TE). To enhance the functionality of TE in brain signal analysis, we propose a novel methodology that captures cross-channel information transfer in the frequency domain. Specifically, we introduce a new measure, the spectral transfer entropy (STE), to quantify the magnitude and direction of information flow from a band-specific oscillation of one channel to another band-specific oscillation of another channel. The main advantage of our proposed approach is that it formulates TE in a novel way to perform inference on band-specific oscillations while maintaining robustness to the inherent problems associated with filtering. In addition, an advantage of STE is that it allows adjustments for multiple comparisons to control false positive rates. Another novel contribution is a simple yet efficient method for estimating STE using vine copula theory. This method can produce an exact zero estimate of STE (which is the boundary point of the parameter space) without the need for bias adjustments. With the vine copula representation, a null copula model, which exhibits zero STE, is defined, thus enabling straightforward significance testing through standard resampling. Lastly, we demonstrate the advantage of the proposed STE measure through numerical experiments and provide interesting and novel findings on the analysis of EEG data in a visual-memory experiment."}, "https://arxiv.org/abs/2401.06447": {"title": "Uncertainty-aware multi-fidelity surrogate modeling with noisy data", "link": "https://arxiv.org/abs/2401.06447", "description": "arXiv:2401.06447v2 Announce Type: replace \nAbstract: Emulating high-accuracy computationally expensive models is crucial for tasks requiring numerous model evaluations, such as uncertainty quantification and optimization. When lower-fidelity models are available, they can be used to improve the predictions of high-fidelity models. Multi-fidelity surrogate models combine information from sources of varying fidelities to construct an efficient surrogate model. However, in real-world applications, uncertainty is present in both high- and low-fidelity models due to measurement or numerical noise, as well as lack of knowledge due to the limited experimental design budget. This paper introduces a comprehensive framework for multi-fidelity surrogate modeling that handles noise-contaminated data and is able to estimate the underlying noise-free high-fidelity model. Our methodology quantitatively incorporates the different types of uncertainty affecting the problem and emphasizes on delivering precise estimates of the uncertainty in its predictions both with respect to the underlying high-fidelity model and unseen noise-contaminated high-fidelity observations, presented through confidence and prediction intervals, respectively. Additionally, the proposed framework offers a natural approach to combining physical experiments and computational models by treating noisy experimental data as high-fidelity sources and white-box computational models as their low-fidelity counterparts. The effectiveness of our methodology is showcased through synthetic examples and a wind turbine application."}, "https://arxiv.org/abs/2410.22481": {"title": "Bayesian Counterfactual Prediction Models for HIV Care Retention with Incomplete Outcome and Covariate Information", "link": "https://arxiv.org/abs/2410.22481", "description": "arXiv:2410.22481v1 Announce Type: new \nAbstract: Like many chronic diseases, human immunodeficiency virus (HIV) is managed over time at regular clinic visits. At each visit, patient features are assessed, treatments are prescribed, and a subsequent visit is scheduled. There is a need for data-driven methods for both predicting retention and recommending scheduling decisions that optimize retention. Prediction models can be useful for estimating retention rates across a range of scheduling options. However, training such models with electronic health records (EHR) involves several complexities. First, formal causal inference methods are needed to adjust for observed confounding when estimating retention rates under counterfactual scheduling decisions. Second, competing events such as death preclude retention, while censoring events render retention missing. Third, inconsistent monitoring of features such as viral load and CD4 count lead to covariate missingness. This paper presents an all-in-one approach for both predicting HIV retention and optimizing scheduling while accounting for these complexities. We formulate and identify causal retention estimands in terms of potential return-time under a hypothetical scheduling decision. Flexible Bayesian approaches are used to model the observed return-time distribution while accounting for competing and censoring events and form posterior point and uncertainty estimates for these estimands. We address the urgent need for data-driven decision support in HIV care by applying our method to EHR from the Academic Model Providing Access to Healthcare (AMPATH) - a consortium of clinics that treat HIV in Western Kenya."}, "https://arxiv.org/abs/2410.22501": {"title": "Order of Addition in Orthogonally Blocked Mixture and Component-Amount Designs", "link": "https://arxiv.org/abs/2410.22501", "description": "arXiv:2410.22501v1 Announce Type: new \nAbstract: Mixture experiments often involve process variables, such as different chemical reactors in a laboratory or varying mixing speeds in a production line. Organizing the runs in orthogonal blocks allows the mixture model to be fitted independently of the process effects, ensuring clearer insights into the role of each mixture component. Current literature on mixture designs in orthogonal blocks ignores the order of addition of mixture components in mixture blends. This paper considers the order of addition of components in mixture and mixture-amount experiments, using the variable total amount taken into orthogonal blocks. The response depends on both the mixture proportions or the amounts of the components and the order of their addition. Mixture designs in orthogonal blocks are constructed to enable the estimation of mixture or component-amount model parameters and the order-of-addition effects. The G-efficiency criterion is used to assess how well the design supports precise and unbiased estimation of the model parameters. The fraction of the Design Space plot is used to provide a visual assessment of the prediction capabilities of a design across the entire design space."}, "https://arxiv.org/abs/2410.22534": {"title": "Bayesian shared parameter joint models for heterogeneous populations", "link": "https://arxiv.org/abs/2410.22534", "description": "arXiv:2410.22534v1 Announce Type: new \nAbstract: Joint models (JMs) for longitudinal and time-to-event data are an important class of biostatistical models in health and medical research. When the study population consists of heterogeneous subgroups, the standard JM may be inadequate and lead to misleading results. Joint latent class models (JLCMs) and their variants have been proposed to incorporate latent class structures into JMs. JLCMs are useful for identifying latent subgroup structures, obtaining a more nuanced understanding of the relationships between longitudinal outcomes, and improving prediction performance. We consider the generic form of JLCM, which poses significant computational challenges for both frequentist and Bayesian approaches due to the numerical intractability and multimodality of the associated model's likelihood or posterior. Focusing on the less explored Bayesian paradigm, we propose a new Bayesian inference framework to tackle key limitations in the existing method. Our algorithm leverages state-of-the-art Markov chain Monte Carlo techniques and parallel computing for parameter estimation and model selection. Through a simulation study, we demonstrate the feasibility and superiority of our proposed method over the existing approach. Our simulations also generate important computational insights and practical guidance for implementing such complex models. We illustrate our method using data from the PAQUID prospective cohort study, where we jointly investigate the association between a repeatedly measured cognitive score and the risk of dementia and the latent class structure defined from the longitudinal outcomes."}, "https://arxiv.org/abs/2410.22574": {"title": "Inference in Partially Linear Models under Dependent Data with Deep Neural Networks", "link": "https://arxiv.org/abs/2410.22574", "description": "arXiv:2410.22574v1 Announce Type: new \nAbstract: I consider inference in a partially linear regression model under stationary $\\beta$-mixing data after first stage deep neural network (DNN) estimation. Using the DNN results of Brown (2024), I show that the estimator for the finite dimensional parameter, constructed using DNN-estimated nuisance components, achieves $\\sqrt{n}$-consistency and asymptotic normality. By avoiding sample splitting, I address one of the key challenges in applying machine learning techniques to econometric models with dependent data. In a future version of this work, I plan to extend these results to obtain general conditions for semiparametric inference after DNN estimation of nuisance components, which will allow for considerations such as more efficient estimation procedures, and instrumental variable settings."}, "https://arxiv.org/abs/2410.22617": {"title": "Bayesian Inference for Relational Graph in a Causal Vector Autoregressive Time Series", "link": "https://arxiv.org/abs/2410.22617", "description": "arXiv:2410.22617v1 Announce Type: new \nAbstract: We propose a method for simultaneously estimating a contemporaneous graph structure and autocorrelation structure for a causal high-dimensional vector autoregressive process (VAR). The graph is estimated by estimating the stationary precision matrix using a Bayesian framework. We introduce a novel parameterization that is convenient for jointly estimating the precision matrix and the autocovariance matrices. The methodology based on the new parameterization has several desirable properties. A key feature of the proposed methodology is that it maintains causality of the process in its estimates and also provides a fast feasible way for computing the reduced rank likelihood for a high-dimensional Gaussian VAR. We use sparse priors along with the likelihood under the new parameterization to obtain the posterior of the graphical parameters as well as that of the temporal parameters. An efficient Markov Chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We also establish theoretical consistency properties for the high-dimensional posterior. The proposed methodology shows excellent performance in simulations and real data applications."}, "https://arxiv.org/abs/2410.22675": {"title": "Clustering Computer Mouse Tracking Data with Informed Hierarchical Shrinkage Partition Priors", "link": "https://arxiv.org/abs/2410.22675", "description": "arXiv:2410.22675v1 Announce Type: new \nAbstract: Mouse-tracking data, which record computer mouse trajectories while participants perform an experimental task, provide valuable insights into subjects' underlying cognitive processes. Neuroscientists are interested in clustering the subjects' responses during computer mouse-tracking tasks to reveal patterns of individual decision-making behaviors and identify population subgroups with similar neurobehavioral responses. These data can be combined with neuro-imaging data to provide additional information for personalized interventions. In this article, we develop a novel hierarchical shrinkage partition (HSP) prior for clustering summary statistics derived from the trajectories of mouse-tracking data. The HSP model defines a subjects' cluster as a set of subjects that gives rise to more similar (rather than identical) nested partitions of the conditions. The proposed model can incorporate prior information about the partitioning of either subjects or conditions to facilitate clustering, and it allows for deviations of the nested partitions within each subject group. These features distinguish the HSP model from other bi-clustering methods that typically create identical nested partitions of conditions within a subject group. Furthermore, it differs from existing nested clustering methods, which define clusters based on common parameters in the sampling model and identify subject groups by different distributions. We illustrate the unique features of the HSP model on a mouse tracking dataset from a pilot study and in simulation studies. Our results show the ability and effectiveness of the proposed exploratory framework in clustering and revealing possible different behavioral patterns across subject groups."}, "https://arxiv.org/abs/2410.22751": {"title": "Novel Subsampling Strategies for Heavily Censored Reliability Data", "link": "https://arxiv.org/abs/2410.22751", "description": "arXiv:2410.22751v1 Announce Type: new \nAbstract: Computational capability often falls short when confronted with massive data, posing a common challenge in establishing a statistical model or statistical inference method dealing with big data. While subsampling techniques have been extensively developed to downsize the data volume, there is a notable gap in addressing the unique challenge of handling extensive reliability data, in which a common situation is that a large proportion of data is censored. In this article, we propose an efficient subsampling method for reliability analysis in the presence of censoring data, intending to estimate the parameters of lifetime distribution. Moreover, a novel subsampling method for subsampling from severely censored data is proposed, i.e., only a tiny proportion of data is complete. The subsampling-based estimators are given, and their asymptotic properties are derived. The optimal subsampling probabilities are derived through the L-optimality criterion, which minimizes the trace of the product of the asymptotic covariance matrix and a constant matrix. Efficient algorithms are proposed to implement the proposed subsampling methods to address the challenge that optimal subsampling strategy depends on unknown parameter estimation from full data. Real-world hard drive dataset case and simulative empirical studies are employed to demonstrate the superior performance of the proposed methods."}, "https://arxiv.org/abs/2410.22824": {"title": "Surface data imputation with stochastic processes", "link": "https://arxiv.org/abs/2410.22824", "description": "arXiv:2410.22824v1 Announce Type: new \nAbstract: Spurious measurements in surface data are common in technical surfaces. Excluding or ignoring these spurious points may lead to incorrect surface characterization if these points inherit features of the surface. Therefore, data imputation must be applied to ensure that the estimated data points at spurious measurements do not strongly deviate from the true surface and its characteristics. Traditional surface data imputation methods rely on simple assumptions and ignore existing knowledge of the surface, yielding in suboptimal estimates. In this paper, we propose using stochastic processes for data imputation. This approach, which originates from surface simulation, allows for the straightforward integration of a priori knowledge. We employ Gaussian processes for surface data imputation with artificially generated missing features. Both the surfaces and the missing features are generated artificially. Our results demonstrate that the proposed method fills the missing values and interpolates data points with better alignment to the measured surface compared to traditional approaches, particularly when surface features are missing."}, "https://arxiv.org/abs/2410.22989": {"title": "Propensity Score Methods for Local Test Score Equating: Stratification and Inverse Probability Weighting", "link": "https://arxiv.org/abs/2410.22989", "description": "arXiv:2410.22989v1 Announce Type: new \nAbstract: In test equating, ensuring score comparability across different test forms is crucial but particularly challenging when test groups are non-equivalent and no anchor test is available. Local test equating aims to satisfy Lord's equity requirement by conditioning equating transformations on individual-level information, typically using anchor test scores as proxies for latent ability. However, anchor tests are not always available in practice. This paper introduces two novel propensity score-based methods for local equating: stratification and inverse probability weighting (IPW). These methods use covariates to account for group differences, with propensity scores serving as proxies for latent ability differences between test groups. The stratification method partitions examinees into comparable groups based on similar propensity scores, while IPW assigns weights inversely proportional to the probability of group membership. We evaluate these methods through empirical analysis and simulation studies. Results indicate both methods can effectively adjust for group differences, with their relative performance depending on the strength of covariate-ability correlations. The study extends local equating methodology to cases where only covariate information is available, providing testing programs with new tools for ensuring fair score comparability."}, "https://arxiv.org/abs/2410.23081": {"title": "General Bayesian quantile regression for counts via generative modeling", "link": "https://arxiv.org/abs/2410.23081", "description": "arXiv:2410.23081v1 Announce Type: new \nAbstract: Although quantile regression has emerged as a powerful tool for understanding various quantiles of a response variable conditioned on a set of covariates, the development of quantile regression for count responses has received far less attention. This paper proposes a new Bayesian approach to quantile regression for count data, which provides a more flexible and interpretable alternative to the existing approaches. The proposed approach associates the continuous latent variable with the discrete response and nonparametrically estimates the joint distribution of the latent variable and a set of covariates. Then, by regressing the estimated continuous conditional quantile on the covariates, the posterior distributions of the covariate effects on the conditional quantiles are obtained through general Bayesian updating via simple optimization. The simulation study and real data analysis demonstrate that the proposed method overcomes the existing limitations and enhances quantile estimation and interpretation of variable relationships, making it a valuable tool for practitioners handling count data."}, "https://arxiv.org/abs/2410.23246": {"title": "Progression: an extrapolation principle for regression", "link": "https://arxiv.org/abs/2410.23246", "description": "arXiv:2410.23246v1 Announce Type: new \nAbstract: The problem of regression extrapolation, or out-of-distribution generalization, arises when predictions are required at test points outside the range of the training data. In such cases, the non-parametric guarantees for regression methods from both statistics and machine learning typically fail. Based on the theory of tail dependence, we propose a novel statistical extrapolation principle. After a suitable, data-adaptive marginal transformation, it assumes a simple relationship between predictors and the response at the boundary of the training predictor samples. This assumption holds for a wide range of models, including non-parametric regression functions with additive noise. Our semi-parametric method, progression, leverages this extrapolation principle and offers guarantees on the approximation error beyond the training data range. We demonstrate how this principle can be effectively integrated with existing approaches, such as random forests and additive models, to improve extrapolation performance on out-of-distribution samples."}, "https://arxiv.org/abs/2410.22502": {"title": "Gender disparities in rehospitalisations after coronary artery bypass grafting: evidence from a functional causal mediation analysis of the MIMIC-IV data", "link": "https://arxiv.org/abs/2410.22502", "description": "arXiv:2410.22502v1 Announce Type: cross \nAbstract: Hospital readmissions following coronary artery bypass grafting (CABG) not only impose a substantial cost burden on healthcare systems but also serve as a potential indicator of the quality of medical care. Previous studies of gender effects on complications after CABG surgery have consistently revealed that women tend to suffer worse outcomes. To better understand the causal pathway from gender to the number of rehospitalisations, we study the postoperative central venous pressure (CVP), frequently recorded over patients' intensive care unit (ICU) stay after the CABG surgery, as a functional mediator. Confronted with time-varying CVP measurements and zero-inflated rehospitalisation counts within 60 days following discharge, we propose a parameter-simulating quasi-Bayesian Monte Carlo approximation method that accommodates a functional mediator and a zero-inflated count outcome for causal mediation analysis. We find a causal relationship between the female gender and increased rehospitalisation counts after CABG, and that time-varying central venous pressure mediates this causal effect."}, "https://arxiv.org/abs/2410.22591": {"title": "FGCE: Feasible Group Counterfactual Explanations for Auditing Fairness", "link": "https://arxiv.org/abs/2410.22591", "description": "arXiv:2410.22591v1 Announce Type: cross \nAbstract: This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (FGCEs), captures real-world feasibility constraints and constructs subgroups with similar counterfactuals, setting it apart from existing methods. It also addresses key trade-offs in counterfactual generation, including the balance between the number of counterfactuals, their associated costs, and the breadth of coverage achieved. To evaluate these trade-offs and assess fairness, we propose measures tailored to group counterfactual generation. Our experimental results on benchmark datasets demonstrate the effectiveness of our approach in managing feasibility constraints and trade-offs, as well as the potential of our proposed metrics in identifying and quantifying fairness issues."}, "https://arxiv.org/abs/2410.22647": {"title": "Adaptive Robust Confidence Intervals", "link": "https://arxiv.org/abs/2410.22647", "description": "arXiv:2410.22647v1 Announce Type: cross \nAbstract: This paper studies the construction of adaptive confidence intervals under Huber's contamination model when the contamination proportion is unknown. For the robust confidence interval of a Gaussian mean, we show that the optimal length of an adaptive interval must be exponentially wider than that of a non-adaptive one. An optimal construction is achieved through simultaneous uncertainty quantification of quantiles at all levels. The results are further extended beyond the Gaussian location model by addressing a general family of robust hypothesis testing. In contrast to adaptive robust estimation, our findings reveal that the optimal length of an adaptive robust confidence interval critically depends on the distribution's shape."}, "https://arxiv.org/abs/2410.22703": {"title": "On tail inference in scale-free inhomogeneous random graphs", "link": "https://arxiv.org/abs/2410.22703", "description": "arXiv:2410.22703v1 Announce Type: cross \nAbstract: Both empirical and theoretical investigations of scale-free network models have found that large degrees in a network exert an outsized impact on its structure. However, the tools used to infer the tail behavior of degree distributions in scale-free networks often lack a strong theoretical foundation. In this paper, we introduce a new framework for analyzing the asymptotic distribution of estimators for degree tail indices in scale-free inhomogeneous random graphs. Our framework leverages the relationship between the large weights and large degrees of Norros-Reittu and Chung-Lu random graphs. In particular, we determine a rate for the number of nodes $k(n) \\rightarrow \\infty$ such that for all $i = 1, \\dots, k(n)$, the node with the $i$-th largest weight will have the $i$-th largest degree with high probability. Such alignment of upper-order statistics is then employed to establish the asymptotic normality of three different tail index estimators based on the upper degrees. These results suggest potential applications of the framework to threshold selection and goodness-of-fit testing in scale-free networks, issues that have long challenged the network science community."}, "https://arxiv.org/abs/2410.22754": {"title": "An Overview of Causal Inference using Kernel Embeddings", "link": "https://arxiv.org/abs/2410.22754", "description": "arXiv:2410.22754v1 Announce Type: cross \nAbstract: Kernel embeddings have emerged as a powerful tool for representing probability measures in a variety of statistical inference problems. By mapping probability measures into a reproducing kernel Hilbert space (RKHS), kernel embeddings enable flexible representations of complex relationships between variables. They serve as a mechanism for efficiently transferring the representation of a distribution downstream to other tasks, such as hypothesis testing or causal effect estimation. In the context of causal inference, the main challenges include identifying causal associations and estimating the average treatment effect from observational data, where confounding variables may obscure direct cause-and-effect relationships. Kernel embeddings provide a robust nonparametric framework for addressing these challenges. They allow for the representations of distributions of observational data and their seamless transformation into representations of interventional distributions to estimate relevant causal quantities. We overview recent research that leverages the expressiveness of kernel embeddings in tandem with causal inference."}, "https://arxiv.org/abs/2410.23147": {"title": "FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection", "link": "https://arxiv.org/abs/2410.23147", "description": "arXiv:2410.23147v1 Announce Type: cross \nAbstract: Traditional decision trees are limited by axis-orthogonal splits, which can perform poorly when true decision boundaries are oblique. While oblique decision tree methods address this limitation, they often face high computational costs, difficulties with multi-class classification, and a lack of effective feature selection. In this paper, we introduce LDATree and FoLDTree, two novel frameworks that integrate Uncorrelated Linear Discriminant Analysis (ULDA) and Forward ULDA into a decision tree structure. These methods enable efficient oblique splits, handle missing values, support feature selection, and provide both class labels and probabilities as model outputs. Through evaluations on simulated and real-world datasets, LDATree and FoLDTree consistently outperform axis-orthogonal and other oblique decision tree methods, achieving accuracy levels comparable to the random forest. The results highlight the potential of these frameworks as robust alternatives to traditional single-tree methods."}, "https://arxiv.org/abs/2410.23155": {"title": "QWO: Speeding Up Permutation-Based Causal Discovery in LiGAMs", "link": "https://arxiv.org/abs/2410.23155", "description": "arXiv:2410.23155v1 Announce Type: cross \nAbstract: Causal discovery is essential for understanding relationships among variables of interest in many scientific domains. In this paper, we focus on permutation-based methods for learning causal graphs in Linear Gaussian Acyclic Models (LiGAMs), where the permutation encodes a causal ordering of the variables. Existing methods in this setting are not scalable due to their high computational complexity. These methods are comprised of two main components: (i) constructing a specific DAG, $\\mathcal{G}^\\pi$, for a given permutation $\\pi$, which represents the best structure that can be learned from the available data while adhering to $\\pi$, and (ii) searching over the space of permutations (i.e., causal orders) to minimize the number of edges in $\\mathcal{G}^\\pi$. We introduce QWO, a novel approach that significantly enhances the efficiency of computing $\\mathcal{G}^\\pi$ for a given permutation $\\pi$. QWO has a speed-up of $O(n^2)$ ($n$ is the number of variables) compared to the state-of-the-art BIC-based method, making it highly scalable. We show that our method is theoretically sound and can be integrated into existing search strategies such as GRASP and hill-climbing-based methods to improve their performance."}, "https://arxiv.org/abs/2410.23174": {"title": "On the fundamental limitations of multiproposal Markov chain Monte Carlo algorithms", "link": "https://arxiv.org/abs/2410.23174", "description": "arXiv:2410.23174v1 Announce Type: cross \nAbstract: We study multiproposal Markov chain Monte Carlo algorithms, such as Multiple-try or generalised Metropolis-Hastings schemes, which have recently received renewed attention due to their amenability to parallel computing. First, we prove that no multiproposal scheme can speed-up convergence relative to the corresponding single proposal scheme by more than a factor of $K$, where $K$ denotes the number of proposals at each iteration. This result applies to arbitrary target distributions and it implies that serial multiproposal implementations are always less efficient than single proposal ones. Secondly, we consider log-concave distributions over Euclidean spaces, proving that, in this case, the speed-up is at most logarithmic in $K$, which implies that even parallel multiproposal implementations are fundamentally limited in the computational gain they can offer. Crucially, our results apply to arbitrary multiproposal schemes and purely rely on the two-step structure of the associated kernels (i.e. first generate $K$ candidate points, then select one among those). Our theoretical findings are validated through numerical simulations."}, "https://arxiv.org/abs/2410.23412": {"title": "BAMITA: Bayesian Multiple Imputation for Tensor Arrays", "link": "https://arxiv.org/abs/2410.23412", "description": "arXiv:2410.23412v1 Announce Type: new \nAbstract: Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation ."}, "https://arxiv.org/abs/2410.23587": {"title": "Fractional Moments by the Moment-Generating Function", "link": "https://arxiv.org/abs/2410.23587", "description": "arXiv:2410.23587v1 Announce Type: new \nAbstract: We introduce a novel method for obtaining a wide variety of moments of a random variable with a well-defined moment-generating function (MGF). We derive new expressions for fractional moments and fractional absolute moments, both central and non-central moments. The new moment expressions are relatively simple integrals that involve the MGF, but do not require its derivatives. We label the new method CMGF because it uses a complex extension of the MGF and can be used to obtain complex moments. We illustrate the new method with three applications where the MGF is available in closed-form, while the corresponding densities and the derivatives of the MGF are either unavailable or very difficult to obtain."}, "https://arxiv.org/abs/2410.23590": {"title": "The Nudge Average Treatment Effect", "link": "https://arxiv.org/abs/2410.23590", "description": "arXiv:2410.23590v1 Announce Type: new \nAbstract: The instrumental variable method is a prominent approach to recover under certain conditions, valid inference about a treatment causal effect even when unmeasured confounding might be present. In a groundbreaking paper, Imbens and Angrist (1994) established that a valid instrument nonparametrically identifies the average causal effect among compliers, also known as the local average treatment effect under a certain monotonicity assumption which rules out the existence of so-called defiers. An often-cited attractive property of monotonicity is that it facilitates a causal interpretation of the instrumental variable estimand without restricting the degree of heterogeneity of the treatment causal effect. In this paper, we introduce an alternative equally straightforward and interpretable condition for identification, which accommodates both the presence of defiers and heterogenous treatment effects. Mainly, we show that under our new conditions, the instrumental variable estimand recovers the average causal effect for the subgroup of units for whom the treatment is manipulable by the instrument, a subgroup which may consist of both defiers and compliers, therefore recovering an effect estimand we aptly call the Nudge Average Treatment Effect."}, "https://arxiv.org/abs/2410.23706": {"title": "Asynchronous Jump Testing and Estimation in High Dimensions Under Complex Temporal Dynamics", "link": "https://arxiv.org/abs/2410.23706", "description": "arXiv:2410.23706v1 Announce Type: new \nAbstract: Most high dimensional changepoint detection methods assume the error process is stationary and changepoints occur synchronously across dimensions. The violation of these assumptions, which in applied settings is increasingly likely as the dimensionality of the time series being analyzed grows, can dramatically curtail the sensitivity or the accuracy of these methods. We propose AJDN (Asynchronous Jump Detection under Nonstationary noise). AJDN is a high dimensional multiscale jump detection method that tests and estimates jumps in an otherwise smoothly varying mean function for high dimensional time series with nonstationary noise where the jumps across dimensions may not occur at the same time. AJDN is correct in the sense that it detects the correct number of jumps with a prescribed probability asymptotically and its accuracy in estimating the locations of the jumps is asymptotically nearly optimal under the asynchronous jump assumption. Through a simulation study we demonstrate AJDN's robustness across a wide variety of stationary and nonstationary high dimensional time series, and we show its strong performance relative to some existing high dimensional changepoint detection methods. We apply AJDN to a seismic time series to demonstrate its ability to accurately detect jumps in real-world high dimensional time series with complex temporal dynamics."}, "https://arxiv.org/abs/2410.23785": {"title": "Machine Learning Debiasing with Conditional Moment Restrictions: An Application to LATE", "link": "https://arxiv.org/abs/2410.23785", "description": "arXiv:2410.23785v1 Announce Type: new \nAbstract: Models with Conditional Moment Restrictions (CMRs) are popular in economics. These models involve finite and infinite dimensional parameters. The infinite dimensional components include conditional expectations, conditional choice probabilities, or policy functions, which might be flexibly estimated using Machine Learning tools. This paper presents a characterization of locally debiased moments for regular models defined by general semiparametric CMRs with possibly different conditioning variables. These moments are appealing as they are known to be less affected by first-step bias. Additionally, we study their existence and relevance. Such results apply to a broad class of smooth functionals of finite and infinite dimensional parameters that do not necessarily appear in the CMRs. As a leading application of our theory, we characterize debiased machine learning for settings of treatment effects with endogeneity, giving necessary and sufficient conditions. We present a large class of relevant debiased moments in this context. We then propose the Compliance Machine Learning Estimator (CML), based on a practically convenient orthogonal relevant moment. We show that the resulting estimand can be written as a convex combination of conditional local average treatment effects (LATE). Altogether, CML enjoys three appealing properties in the LATE framework: (1) local robustness to first-stage estimation, (2) an estimand that can be identified under a minimal relevance condition, and (3) a meaningful causal interpretation. Our numerical experimentation shows satisfactory relative performance of such an estimator. Finally, we revisit the Oregon Health Insurance Experiment, analyzed by Finkelstein et al. (2012). We find that the use of machine learning and CML suggest larger positive effects on health care utilization than previously determined."}, "https://arxiv.org/abs/2410.23786": {"title": "Conformal inference for cell type annotation with graph-structured constraints", "link": "https://arxiv.org/abs/2410.23786", "description": "arXiv:2410.23786v1 Announce Type: new \nAbstract: Conformal inference is a method that provides prediction sets for machine learning models, operating independently of the underlying distributional assumptions and relying solely on the exchangeability of training and test data. Despite its wide applicability and popularity, its application in graph-structured problems remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information encoded in the graph structure of predicted classes to enhance the interpretability of conformal sets. Using a motivating example from genomics, specifically imaging-based spatial transcriptomics data and single-cell RNA sequencing data, we demonstrate how incorporating graph-structured constraints can improve the interpretation of cell type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the distribution of the response variable changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://github.com/ccb-hms/scConform."}, "https://arxiv.org/abs/2410.23852": {"title": "Estimation and Inference in Dyadic Network Formation Models with Nontransferable Utilities", "link": "https://arxiv.org/abs/2410.23852", "description": "arXiv:2410.23852v1 Announce Type: new \nAbstract: This paper studies estimation and inference in a dyadic network formation model with observed covariates, unobserved heterogeneity, and nontransferable utilities. With the presence of the high dimensional fixed effects, the maximum likelihood estimator is numerically difficult to compute and suffers from the incidental parameter bias. We propose an easy-to-compute one-step estimator for the homophily parameter of interest, which is further refined to achieve $\\sqrt{N}$-consistency via split-network jackknife and efficiency by the bootstrap aggregating (bagging) technique. We establish consistency for the estimator of the fixed effects and prove asymptotic normality for the unconditional average partial effects. Simulation studies show that our method works well with finite samples, and an empirical application using the risk-sharing data from Nyakatoke highlights the importance of employing proper statistical inferential procedures."}, "https://arxiv.org/abs/2410.24003": {"title": "On testing for independence between generalized error models of several time series", "link": "https://arxiv.org/abs/2410.24003", "description": "arXiv:2410.24003v1 Announce Type: new \nAbstract: We propose new copula-based models for multivariate time series having continuous or discrete distributions, or a mixture of both. These models include stochastic volatility models and regime-switching models. We also propose statistics for testing independence between the generalized errors of these models, extending previous results of Duchesne, Ghoudi and Remillard (2012) obtained for stochastic volatility models. We define families of empirical processes constructed from lagged generalized errors, and we show that their joint asymptotic distributions are Gaussian and independent of the estimated parameters of the individual time series. Moebius transformations of the empirical processes are used to obtain tractable covariances. Several tests statistics are then proposed, based on Cramer-von Mises statistics and dependence measures, as well as graphical methods to visualize the dependence. In addition, numerical experiments are performed to assess the power of the proposed tests. Finally, to show the usefulness of our methodologies, examples of applications for financial data and crime data are given to cover both discrete and continuous cases. ll developed methodologies are implemented in the CRAN package IndGenErrors."}, "https://arxiv.org/abs/2410.24094": {"title": "Adaptive Sphericity Tests for High Dimensional Data", "link": "https://arxiv.org/abs/2410.24094", "description": "arXiv:2410.24094v1 Announce Type: new \nAbstract: In this paper, we investigate sphericity testing in high-dimensional settings, where existing methods primarily rely on sum-type test procedures that often underperform under sparse alternatives. To address this limitation, we propose two max-type test procedures utilizing the sample covariance matrix and the sample spatial-sign covariance matrix, respectively. Furthermore, we introduce two Cauchy combination test procedures that integrate both sum-type and max-type tests, demonstrating their superiority across a wide range of sparsity levels in the alternative hypothesis. Our simulation studies corroborate these findings, highlighting the enhanced performance of our proposed methodologies in high-dimensional sphericity testi"}, "https://arxiv.org/abs/2410.24136": {"title": "Two-sided conformalized survival analysis", "link": "https://arxiv.org/abs/2410.24136", "description": "arXiv:2410.24136v1 Announce Type: new \nAbstract: This paper presents a novel method using conformal prediction to generate two-sided or one-sided prediction intervals for survival times. Specifically, the method provides both lower and upper predictive bounds for individuals deemed sufficiently similar to the non-censored population, while returning only a lower bound for others. The prediction intervals offer finite-sample coverage guarantees, requiring no distributional assumptions other than the sampled data points are independent and identically distributed. The performance of the procedure is assessed using both synthetic and real-world datasets."}, "https://arxiv.org/abs/2410.24163": {"title": "Improve the Precision of Area Under the Curve Estimation for Recurrent Events Through Covariate Adjustment", "link": "https://arxiv.org/abs/2410.24163", "description": "arXiv:2410.24163v1 Announce Type: new \nAbstract: The area under the curve (AUC) of the mean cumulative function (MCF) has recently been introduced as a novel estimand for evaluating treatment effects in recurrent event settings, capturing a totality of evidence in relation to disease progression. While the Lin-Wei-Yang-Ying (LWYY) model is commonly used for analyzing recurrent events, it relies on the proportional rate assumption between treatment arms, which is often violated in practice. In contrast, the AUC under MCFs does not depend on such proportionality assumptions and offers a clinically interpretable measure of treatment effect. To improve the precision of the AUC estimation while preserving its unconditional interpretability, we propose a nonparametric covariate adjustment approach. This approach guarantees efficiency gain compared to unadjusted analysis, as demonstrated by theoretical asymptotic distributions, and is universally applicable to various randomization schemes, including both simple and covariate-adaptive designs. Extensive simulations across different scenarios further support its advantage in increasing statistical power. Our findings highlight the importance of covariate adjustment for the analysis of AUC in recurrent event settings, offering practical guidance for its application in randomized clinical trials."}, "https://arxiv.org/abs/2410.24194": {"title": "Bayesian hierarchical models with calibrated mixtures of g-priors for assessing treatment effect moderation in meta-analysis", "link": "https://arxiv.org/abs/2410.24194", "description": "arXiv:2410.24194v1 Announce Type: new \nAbstract: Assessing treatment effect moderation is critical in biomedical research and many other fields, as it guides personalized intervention strategies to improve participant's outcomes. Individual participant-level data meta-analysis (IPD-MA) offers a robust framework for such assessments by leveraging data from multiple trials. However, its performance is often compromised by challenges such as high between-trial variability. Traditional Bayesian shrinkage methods have gained popularity, but are less suitable in this context, as their priors do not discern heterogeneous studies. In this paper, we propose the calibrated mixtures of g-priors methods in IPD-MA to enhance efficiency and reduce risk in the estimation of moderation effects. Our approach incorporates a trial-level sample size tuning function, and a moderator-level shrinkage parameter in the prior, offering a flexible spectrum of shrinkage levels that enables practitioners to evaluate moderator importance, from conservative to optimistic perspectives. Compared with existing Bayesian shrinkage methods, our extensive simulation studies demonstrate that the calibrated mixtures of g-priors exhibit superior performances in terms of efficiency and risk metrics, particularly under high between-trial variability, high model sparsity, weak moderation effects and correlated design matrices. We further illustrate their application in assessing effect moderators of two active treatments for major depressive disorder, using IPD from four randomized controlled trials."}, "https://arxiv.org/abs/2410.23525": {"title": "On the consistency of bootstrap for matching estimators", "link": "https://arxiv.org/abs/2410.23525", "description": "arXiv:2410.23525v1 Announce Type: cross \nAbstract: In a landmark paper, Abadie and Imbens (2008) showed that the naive bootstrap is inconsistent when applied to nearest neighbor matching estimators of the average treatment effect with a fixed number of matches. Since then, this finding has inspired numerous efforts to address the inconsistency issue, typically by employing alternative bootstrap methods. In contrast, this paper shows that the naive bootstrap is provably consistent for the original matching estimator, provided that the number of matches, $M$, diverges. The bootstrap inconsistency identified by Abadie and Imbens (2008) thus arises solely from the use of a fixed $M$."}, "https://arxiv.org/abs/2410.23580": {"title": "Bayesian Hierarchical Model for Synthesizing Registry and Survey Data on Female Breast Cancer Prevalence", "link": "https://arxiv.org/abs/2410.23580", "description": "arXiv:2410.23580v1 Announce Type: cross \nAbstract: In public health, it is critical for policymakers to assess the relationship between the disease prevalence and associated risk factors or clinical characteristics, facilitating effective resources allocation. However, for diseases like female breast cancer (FBC), reliable prevalence data at specific geographical levels, such as the county-level, are limited because the gold standard data typically come from long-term cancer registries, which do not necessarily collect needed risk factors. In addition, it remains unclear whether fitting each model separately or jointly results in better estimation. In this paper, we identify two data sources to produce reliable county-level prevalence estimates in Missouri, USA: the population-based Missouri Cancer Registry (MCR) and the survey-based Missouri County-Level Study (CLS). We propose a two-stage Bayesian model to synthesize these sources, accounting for their differences in the methodological design, case definitions, and collected information. The first stage involves estimating the county-level FBC prevalence using the raking method for CLS data and the counting method for MCR data, calibrating the differences in the methodological design and case definition. The second stage includes synthesizing two sources with different sets of covariates using a Bayesian generalized linear mixed model with Zeller-Siow prior for the coefficients. Our data analyses demonstrate that using both data sources have better results than at least one data source, and including a data source membership matters when there exist systematic differences in these sources. Finally, we translate results into policy making and discuss methodological differences for data synthesis of registry and survey data."}, "https://arxiv.org/abs/2410.23614": {"title": "Hypothesis testing with e-values", "link": "https://arxiv.org/abs/2410.23614", "description": "arXiv:2410.23614v1 Announce Type: cross \nAbstract: This book is written to offer a humble, but unified, treatment of e-values in hypothesis testing. The book is organized into three parts: Fundamental Concepts, Core Ideas, and Advanced Topics. The first part includes three chapters that introduce the basic concepts. The second part includes five chapters of core ideas such as universal inference, log-optimality, e-processes, operations on e-values, and e-values in multiple testing. The third part contains five chapters of advanced topics. We hope that, by putting the materials together in this book, the concept of e-values becomes more accessible for educational, research, and practical use."}, "https://arxiv.org/abs/2410.23838": {"title": "Zero-inflated stochastic block modeling of efficiency-security tradeoffs in weighted criminal networks", "link": "https://arxiv.org/abs/2410.23838", "description": "arXiv:2410.23838v1 Announce Type: cross \nAbstract: Criminal networks arise from the unique attempt to balance a need of establishing frequent ties among affiliates to facilitate the coordination of illegal activities, with the necessity to sparsify the overall connectivity architecture to hide from law enforcement. This efficiency-security tradeoff is also combined with the creation of groups of redundant criminals that exhibit similar connectivity patterns, thus guaranteeing resilient network architectures. State-of-the-art models for such data are not designed to infer these unique structures. In contrast to such solutions we develop a computationally-tractable Bayesian zero-inflated Poisson stochastic block model (ZIP-SBM), which identifies groups of redundant criminals with similar connectivity patterns, and infers both overt and covert block interactions within and across such groups. This is accomplished by modeling weighted ties (corresponding to counts of interactions among pairs of criminals) via zero-inflated Poisson distributions with block-specific parameters that quantify complex patterns in the excess of zero ties in each block (security) relative to the distribution of the observed weighted ties within that block (efficiency). The performance of ZIP-SBM is illustrated in simulations and in a study of summits co-attendances in a complex Mafia organization, where we unveil efficiency-security structures adopted by the criminal organization that were hidden to previous analyses."}, "https://arxiv.org/abs/2410.23975": {"title": "Average Controlled and Average Natural Micro Direct Effects in Summary Causal Graphs", "link": "https://arxiv.org/abs/2410.23975", "description": "arXiv:2410.23975v1 Announce Type: cross \nAbstract: In this paper, we investigate the identifiability of average controlled direct effects and average natural direct effects in causal systems represented by summary causal graphs, which are abstractions of full causal graphs, often used in dynamic systems where cycles and omitted temporal information complicate causal inference. Unlike in the traditional linear setting, where direct effects are typically easier to identify and estimate, non-parametric direct effects, which are crucial for handling real-world complexities, particularly in epidemiological contexts where relationships between variables (e.g, genetic, environmental, and behavioral factors) are often non-linear, are much harder to define and identify. In particular, we give sufficient conditions for identifying average controlled micro direct effect and average natural micro direct effect from summary causal graphs in the presence of hidden confounding. Furthermore, we show that the conditions given for the average controlled micro direct effect become also necessary in the setting where there is no hidden confounding and where we are only interested in identifiability by adjustment."}, "https://arxiv.org/abs/2410.24056": {"title": "A Martingale-Free Introduction to Conditional Gaussian Nonlinear Systems", "link": "https://arxiv.org/abs/2410.24056", "description": "arXiv:2410.24056v1 Announce Type: cross \nAbstract: The Conditional Gaussian Nonlinear System (CGNS) is a broad class of nonlinear stochastic dynamical systems. Given the trajectories for a subset of state variables, the remaining follow a Gaussian distribution. Despite the conditionally linear structure, the CGNS exhibits strong nonlinearity, thus capturing many non-Gaussian characteristics observed in nature through its joint and marginal distributions. Desirably, it enjoys closed analytic formulae for the time evolution of its conditional Gaussian statistics, which facilitate the study of data assimilation and other related topics. In this paper, we develop a martingale-free approach to improve the understanding of CGNSs. This methodology provides a tractable approach to proving the time evolution of the conditional statistics by deriving results through time discretization schemes, with the continuous-time regime obtained via a formal limiting process as the discretization time-step vanishes. This discretized approach further allows for developing analytic formulae for optimal posterior sampling of unobserved state variables with correlated noise. These tools are particularly valuable for studying extreme events and intermittency and apply to high-dimensional systems. Moreover, the approach improves the understanding of different sampling methods in characterizing uncertainty. The effectiveness of the framework is demonstrated through a physics-constrained, triad-interaction climate model with cubic nonlinearity and state-dependent cross-interacting noise."}, "https://arxiv.org/abs/2410.24145": {"title": "Conformal prediction of circular data", "link": "https://arxiv.org/abs/2410.24145", "description": "arXiv:2410.24145v1 Announce Type: cross \nAbstract: Split conformal prediction techniques are applied to regression problems with circular responses by introducing a suitable conformity score, leading to prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under exchangeable data. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear response regression model into one suitable for circular responses. When random forests serve as basis models in this projection procedure, we harness the out-of-bag dynamics to eliminate the necessity for a separate calibration sample in the construction of prediction sets. For synthetic and real datasets the resulting projected random forests model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, when compared to the split conformal prediction sets generated by two existing alternative models."}, "https://arxiv.org/abs/2006.00371": {"title": "Ridge Regularization: an Essential Concept in Data Science", "link": "https://arxiv.org/abs/2006.00371", "description": "arXiv:2006.00371v2 Announce Type: replace \nAbstract: Ridge or more formally $\\ell_2$ regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics."}, "https://arxiv.org/abs/2111.10721": {"title": "Identifying Dynamic Discrete Choice Models with Hyperbolic Discounting", "link": "https://arxiv.org/abs/2111.10721", "description": "arXiv:2111.10721v4 Announce Type: replace \nAbstract: We study identification of dynamic discrete choice models with hyperbolic discounting. We show that the standard discount factor, present bias factor, and instantaneous utility functions for the sophisticated agent are point-identified from observed conditional choice probabilities and transition probabilities in a finite horizon model. The main idea to achieve identification is to exploit variation in the observed conditional choice probabilities over time. We present the estimation method and demonstrate a good performance of the estimator by simulation."}, "https://arxiv.org/abs/2309.02674": {"title": "Denoising and Multilinear Projected-Estimation of High-Dimensional Matrix-Variate Factor Time Series", "link": "https://arxiv.org/abs/2309.02674", "description": "arXiv:2309.02674v2 Announce Type: replace \nAbstract: This paper proposes a new multi-linear projection method for denoising and estimation of high-dimensional matrix-variate factor time series. It assumes that a $p_1\\times p_2$ matrix-variate time series consists of a dynamically dependent, lower-dimensional matrix-variate factor process and a $p_1\\times p_2$ matrix idiosyncratic series. In addition, the latter series assumes a matrix-variate factor structure such that its row and column covariances may have diverging/spiked eigenvalues to accommodate the case of low signal-to-noise ratio often encountered in applications. We use an iterative projection procedure to reduce the dimensions and noise effects in estimating front and back loading matrices and to obtain faster convergence rates than those of the traditional methods available in the literature. We further introduce a two-way projected Principal Component Analysis to mitigate the diverging noise effects, and implement a high-dimensional white-noise testing procedure to estimate the dimension of the matrix factor process. Asymptotic properties of the proposed method are established if the dimensions and sample size go to infinity. We also use simulations and real examples to assess the performance of the proposed method in finite samples and to compare its forecasting ability with some existing ones in the literature. The proposed method fares well in out-of-sample forecasting. In a supplement, we demonstrate the efficacy of the proposed approach even when the idiosyncratic terms exhibit serial correlations with or without a diverging white noise effect."}, "https://arxiv.org/abs/2311.10900": {"title": "Max-Rank: Efficient Multiple Testing for Conformal Prediction", "link": "https://arxiv.org/abs/2311.10900", "description": "arXiv:2311.10900v3 Announce Type: replace \nAbstract: Multiple hypothesis testing (MHT) commonly arises in various scientific fields, from genomics to psychology, where testing many hypotheses simultaneously increases the risk of Type-I errors. These errors can mislead scientific inquiry, rendering MHT corrections essential. In this paper, we address MHT within the context of conformal prediction, a flexible method for predictive uncertainty quantification. Some conformal prediction settings can require simultaneous testing, and positive dependencies among tests typically exist. We propose a novel correction named $\\texttt{max-rank}$ that leverages these dependencies, whilst ensuring that the joint Type-I error rate is efficiently controlled. Inspired by permutation-based corrections from Westfall & Young (1993), $\\texttt{max-rank}$ exploits rank order information to improve performance, and readily integrates with any conformal procedure. We demonstrate both its theoretical and empirical advantages over the common Bonferroni correction and its compatibility with conformal prediction, highlighting the potential to enhance predictive uncertainty quantification."}, "https://arxiv.org/abs/2312.04601": {"title": "Weak Supervision Performance Evaluation via Partial Identification", "link": "https://arxiv.org/abs/2312.04601", "description": "arXiv:2312.04601v2 Announce Type: replace-cross \nAbstract: Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fr\\'echet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications."}, "https://arxiv.org/abs/2411.00191": {"title": "Sharp Bounds on the Variance of General Regression Adjustment in Randomized Experiments", "link": "https://arxiv.org/abs/2411.00191", "description": "arXiv:2411.00191v1 Announce Type: new \nAbstract: Building on statistical foundations laid by Neyman [1923] a century ago, a growing literature focuses on problems of causal inference that arise in the context of randomized experiments where the target of inference is the average treatment effect in a finite population and random assignment determines which subjects are allocated to one of the experimental conditions. In this framework, variances of average treatment effect estimators remain unidentified because they depend on the covariance between treated and untreated potential outcomes, which are never jointly observed. Aronow et al. [2014] provide an estimator for the variance of the difference-in-means estimator that is asymptotically sharp. In practice, researchers often use some form of covariate adjustment, such as linear regression when estimating the average treatment effect. Here we extend the Aronow et al. [2014] result, providing asymptotically sharp variance bounds for general regression adjustment. We apply these results to linear regression adjustment and show benefits both in a simulation as well as an empirical application."}, "https://arxiv.org/abs/2411.00256": {"title": "Bayesian Smoothing and Feature Selection Using variational Automatic Relevance Determination", "link": "https://arxiv.org/abs/2411.00256", "description": "arXiv:2411.00256v1 Announce Type: new \nAbstract: This study introduces Variational Automatic Relevance Determination (VARD), a novel approach tailored for fitting sparse additive regression models in high-dimensional settings. VARD distinguishes itself by its ability to independently assess the smoothness of each feature while enabling precise determination of whether a feature's contribution to the response is zero, linear, or nonlinear. Further, an efficient coordinate descent algorithm is introduced to implement VARD. Empirical evaluations on simulated and real-world data underscore VARD's superiority over alternative variable selection methods for additive models."}, "https://arxiv.org/abs/2411.00346": {"title": "Estimating Broad Sense Heritability via Kernel Ridge Regression", "link": "https://arxiv.org/abs/2411.00346", "description": "arXiv:2411.00346v1 Announce Type: new \nAbstract: The broad sense genetic heritability, which quantifies the total proportion of phenotypic variation in a population due to genetic factors, is crucial for understanding trait inheritance. While many existing methods focus on estimating narrow sense heritability, which accounts only for additive genetic variation, this paper introduces a kernel ridge regression approach to estimate broad-sense heritability. We provide both upper and lower bounds for the estimator. The effectiveness of the proposed method was evaluated through extensive simulations of both synthetic data and real data from the 1000 Genomes Project. Additionally, the estimator was applied to data from the Alzheimer's Disease Neuroimaging Initiative to demonstrate its practical utility."}, "https://arxiv.org/abs/2411.00358": {"title": "Inference in a Stationary/Nonstationary Autoregressive Time-Varying-Parameter Model", "link": "https://arxiv.org/abs/2411.00358", "description": "arXiv:2411.00358v1 Announce Type: new \nAbstract: This paper considers nonparametric estimation and inference in first-order autoregressive (AR(1)) models with deterministically time-varying parameters. A key feature of the proposed approach is to allow for time-varying stationarity in some time periods, time-varying nonstationarity (i.e., unit root or local-to-unit root behavior) in other periods, and smooth transitions between the two. The estimation of the AR parameter at any time point is based on a local least squares regression method, where the relevant initial condition is endogenous. We obtain limit distributions for the AR parameter estimator and t-statistic at a given point $\\tau$ in time when the parameter exhibits unit root, local-to-unity, or stationary/stationary-like behavior at time $\\tau$. These results are used to construct confidence intervals and median-unbiased interval estimators for the AR parameter at any specified point in time. The confidence intervals have correct asymptotic coverage probabilities with the coverage holding uniformly over stationary and nonstationary behavior of the observations."}, "https://arxiv.org/abs/2411.00429": {"title": "Unbiased mixed variables distance", "link": "https://arxiv.org/abs/2411.00429", "description": "arXiv:2411.00429v1 Announce Type: new \nAbstract: Defining a distance in a mixed setting requires the quantification of observed differences of variables of different types and of variables that are measured on different scales. There exist several proposals for mixed variable distances, however, such distances tend to be biased towards specific variable types and measurement units. That is, the variable types and scales influence the contribution of individual variables to the overall distance. In this paper, we define unbiased mixed variable distances for which the contributions of individual variables to the overall distance are not influenced by measurement types or scales. We define the relevant concepts to quantify such biases and we provide a general formulation that can be used to construct unbiased mixed variable distances."}, "https://arxiv.org/abs/2411.00471": {"title": "Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models", "link": "https://arxiv.org/abs/2411.00471", "description": "arXiv:2411.00471v1 Announce Type: new \nAbstract: This paper introduces Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models. These priors are extensions of traditional mixtures of $g$ priors that allow for differential shrinkage for various (data-selected) blocks of parameters while fully accounting for the predictors' correlation structure, providing a bridge between the literatures on model selection and continuous shrinkage priors. We show that Dirichlet process mixtures of block $g$ priors are consistent in various senses and, in particular, that they avoid the conditional Lindley ``paradox'' highlighted by Som et al.(2016). Further, we develop a Markov chain Monte Carlo algorithm for posterior inference that requires only minimal ad-hoc tuning. Finally, we investigate the empirical performance of the prior in various real and simulated datasets. In the presence of a small number of very large effects, Dirichlet process mixtures of block $g$ priors lead to higher power for detecting smaller but significant effects without only a minimal increase in the number of false discoveries."}, "https://arxiv.org/abs/2411.00520": {"title": "Calibrated quantile prediction for Growth-at-Risk", "link": "https://arxiv.org/abs/2411.00520", "description": "arXiv:2411.00520v1 Announce Type: new \nAbstract: Accurate computation of robust estimates for extremal quantiles of empirical distributions is an essential task for a wide range of applicative fields, including economic policymaking and the financial industry. Such estimates are particularly critical in calculating risk measures, such as Growth-at-Risk (GaR). % and Value-at-Risk (VaR). This work proposes a conformal framework to estimate calibrated quantiles, and presents an extensive simulation study and a real-world analysis of GaR to examine its benefits with respect to the state of the art. Our findings show that CP methods consistently improve the calibration and robustness of quantile estimates at all levels. The calibration gains are appreciated especially at extremal quantiles, which are critical for risk assessment and where traditional methods tend to fall short. In addition, we introduce a novel property that guarantees coverage under the exchangeability assumption, providing a valuable tool for managing risks by quantifying and controlling the likelihood of future extreme observations."}, "https://arxiv.org/abs/2411.00644": {"title": "What can we learn from marketing skills as a bipartite network from accredited programs?", "link": "https://arxiv.org/abs/2411.00644", "description": "arXiv:2411.00644v1 Announce Type: new \nAbstract: The relationship between professional skills and higher education programs is modeled as a non-directed bipartite network with binary entries representing the links between 28 skills (as captured by the occupational information network, O*NET) and 258 graduate program summaries (as captured by commercial brochures of graduate programs in marketing with accreditation standards of the Association to Advance Collegiate Schools of Business). While descriptive analysis for skills suggests a qualitative lack of alignment between the job demands captured by O*NET, inferential analyses based on exponential random graph model estimates show that skills' popularity and homophily coexist with a systematic yet weak alignment to job demands for marketing managers."}, "https://arxiv.org/abs/2411.00534": {"title": "Change-point detection in functional time series: Applications to age-specific mortality and fertility", "link": "https://arxiv.org/abs/2411.00534", "description": "arXiv:2411.00534v1 Announce Type: cross \nAbstract: We consider determining change points in a time series of age-specific mortality and fertility curves observed over time. We propose two detection methods for identifying these change points. The first method uses a functional cumulative sum statistic to pinpoint the change point. The second method computes a univariate time series of integrated squared forecast errors after fitting a functional time-series model before applying a change-point detection method to the errors to determine the change point. Using Australian age-specific fertility and mortality data, we apply these methods to locate the change points and identify the optimal training period to achieve improved forecast accuracy."}, "https://arxiv.org/abs/2411.00621": {"title": "Nonparametric estimation of Hawkes processes with RKHSs", "link": "https://arxiv.org/abs/2411.00621", "description": "arXiv:2411.00621v1 Announce Type: cross \nAbstract: This paper addresses nonparametric estimation of nonlinear multivariate Hawkes processes, where the interaction functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). Motivated by applications in neuroscience, the model allows complex interaction functions, in order to express exciting and inhibiting effects, but also a combination of both (which is particularly interesting to model the refractory period of neurons), and considers in return that conditional intensities are rectified by the ReLU function. The latter feature incurs several methodological challenges, for which workarounds are proposed in this paper. In particular, it is shown that a representer theorem can be obtained for approximated versions of the log-likelihood and the least-squares criteria. Based on it, we propose an estimation method, that relies on two simple approximations (of the ReLU function and of the integral operator). We provide an approximation bound, justifying the negligible statistical effect of these approximations. Numerical results on synthetic data confirm this fact as well as the good asymptotic behavior of the proposed estimator. It also shows that our method achieves a better performance compared to related nonparametric estimation techniques and suits neuronal applications."}, "https://arxiv.org/abs/1801.09236": {"title": "Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms", "link": "https://arxiv.org/abs/1801.09236", "description": "arXiv:1801.09236v4 Announce Type: replace \nAbstract: Differential privacy (DP), provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector $T$, a function of sensitive data under DP, via the class of $K$-norm mechanisms with the goal of minimizing the noise added to achieve privacy. First, we introduce the sensitivity space of $T$, which extends the concepts of sensitivity polytope and sensitivity hull to the setting of arbitrary statistics $T$. We then propose a framework consisting of three methods for comparing the $K$-norm mechanisms: 1) a multivariate extension of stochastic dominance, 2) the entropy of the mechanism, and 3) the conditional variance given a direction, to identify the optimal $K$-norm mechanism. In all of these criteria, the optimal $K$-norm mechanism is generated by the convex hull of the sensitivity space. Using our methodology, we extend the objective perturbation and functional mechanisms and apply these tools to logistic and linear regression, allowing for private releases of statistical results. Via simulations and an application to a housing price dataset, we demonstrate that our proposed methodology offers a substantial improvement in utility for the same level of risk."}, "https://arxiv.org/abs/2205.07090": {"title": "Evaluating Forecasts with scoringutils in R", "link": "https://arxiv.org/abs/2205.07090", "description": "arXiv:2205.07090v2 Announce Type: replace \nAbstract: Evaluating forecasts is essential to understand and improve forecasting and make forecasts useful to decision makers. A variety of R packages provide a broad variety of scoring rules, visualisations and diagnostic tools. One particular challenge, which scoringutils aims to address, is handling the complexity of evaluating and comparing forecasts from several forecasters across multiple dimensions such as time, space, and different types of targets. scoringutils extends the existing landscape by offering a convenient and flexible data.table-based framework for evaluating and comparing probabilistic forecasts (forecasts represented by a full predictive distribution). Notably, scoringutils is the first package to offer extensive support for probabilistic forecasts in the form of predictive quantiles, a format that is currently used by several infectious disease Forecast Hubs. The package is easily extendable, meaning that users can supply their own scoring rules or extend existing classes to handle new types of forecasts. scoringutils provides broad functionality to check the data and diagnose issues, to visualise forecasts and missing data, to transform data before scoring, to handle missing forecasts, to aggregate scores, and to visualise the results of the evaluation. The paper presents the package and its core functionality and illustrates common workflows using example data of forecasts for COVID-19 cases and deaths submitted to the European COVID-19 Forecast Hub."}, "https://arxiv.org/abs/2301.10592": {"title": "Hierarchical Regularizers for Reverse Unrestricted Mixed Data Sampling Regressions", "link": "https://arxiv.org/abs/2301.10592", "description": "arXiv:2301.10592v2 Announce Type: replace \nAbstract: Reverse Unrestricted MIxed DAta Sampling (RU-MIDAS) regressions are used to model high-frequency responses by means of low-frequency variables. However, due to the periodic structure of RU-MIDAS regressions, the dimensionality grows quickly if the frequency mismatch between the high- and low-frequency variables is large. Additionally the number of high-frequency observations available for estimation decreases. We propose to counteract this reduction in sample size by pooling the high-frequency coefficients and further reduce the dimensionality through a sparsity-inducing convex regularizer that accounts for the temporal ordering among the different lags. To this end, the regularizer prioritizes the inclusion of lagged coefficients according to the recency of the information they contain. We demonstrate the proposed method on two empirical applications, one on realized volatility forecasting with macroeconomic data and another on demand forecasting for a bicycle-sharing system with ridership data on other transportation types."}, "https://arxiv.org/abs/2411.00795": {"title": "Simulations for estimation of random effects and overall effect in three-level meta-analysis of standardized mean differences using constant and inverse-variance weights", "link": "https://arxiv.org/abs/2411.00795", "description": "arXiv:2411.00795v1 Announce Type: new \nAbstract: We consider a three-level meta-analysis of standardized mean differences. The standard method of estimation uses inverse-variance weights and REML/PL estimation of variance components for the random effects. We introduce new moment-based point and interval estimators for the two variance components and related estimators of the overall mean. Similar to traditional analysis of variance, our method is based on two conditional $Q$ statistics with effective-sample-size weights. We study, by simulation, bias and coverage of these new estimators. For comparison, we also study bias and coverage of the REML/PL-based approach as implemented in {\\it rma.mv} in {\\it metafor}. Our results demonstrate that the new methods are often considerably better and do not have convergence problems, which plague the standard analysis."}, "https://arxiv.org/abs/2411.00886": {"title": "The ET Interview: Professor Joel L", "link": "https://arxiv.org/abs/2411.00886", "description": "arXiv:2411.00886v1 Announce Type: new \nAbstract: Joel L. Horowitz has made profound contributions to many areas in econometrics and statistics. These include bootstrap methods, semiparametric and nonparametric estimation, specification testing, nonparametric instrumental variables estimation, high-dimensional models, functional data analysis, and shape restrictions, among others. Originally trained as a physicist, Joel made a pivotal transition to econometrics, greatly benefiting our profession. Throughout his career, he has collaborated extensively with a diverse range of coauthors, including students, departmental colleagues, and scholars from around the globe. Joel was born in 1941 in Pasadena, California. He attended Stanford for his undergraduate studies and obtained his Ph.D. in physics from Cornell in 1967. He has been Charles E. and Emma H. Morrison Professor of Economics at Northwestern University since 2001. Prior to that, he was a faculty member at the University of Iowa (1982-2001). He has served as a co-editor of Econometric Theory (1992-2000) and Econometrica (2000-2004). He is a Fellow of the Econometric Society and of the American Statistical Association, and an elected member of the International Statistical Institute. The majority of this interview took place in London during June 2022."}, "https://arxiv.org/abs/2411.00921": {"title": "Differentially Private Algorithms for Linear Queries via Stochastic Convex Optimization", "link": "https://arxiv.org/abs/2411.00921", "description": "arXiv:2411.00921v1 Announce Type: new \nAbstract: This article establishes a method to answer a finite set of linear queries on a given dataset while ensuring differential privacy. To achieve this, we formulate the corresponding task as a saddle-point problem, i.e. an optimization problem whose solution corresponds to a distribution minimizing the difference between answers to the linear queries based on the true distribution and answers from a differentially private distribution. Against this background, we establish two new algorithms for corresponding differentially private data release: the first is based on the differentially private Frank-Wolfe method, the second combines randomized smoothing with stochastic convex optimization techniques for a solution to the saddle-point problem. While previous works assess the accuracy of differentially private algorithms with reference to the empirical data distribution, a key contribution of our work is a more natural evaluation of the proposed algorithms' accuracy with reference to the true data-generating distribution."}, "https://arxiv.org/abs/2411.00947": {"title": "Asymptotic theory for the quadratic assignment procedure", "link": "https://arxiv.org/abs/2411.00947", "description": "arXiv:2411.00947v1 Announce Type: new \nAbstract: The quadratic assignment procedure (QAP) is a popular tool for analyzing network data in medical and social sciences. To test the association between two network measurements represented by two symmetric matrices, QAP calculates the $p$-value by permuting the units, or equivalently, by simultaneously permuting the rows and columns of one matrix. Its extension to the regression setting, known as the multiple regression QAP, has also gained popularity, especially in psychometrics. However, the statistics theory for QAP has not been fully established in the literature. We fill the gap in this paper. We formulate the network models underlying various QAPs. We derive (a) the asymptotic sampling distributions of some canonical test statistics and (b) the corresponding asymptotic permutation distributions induced by QAP under strong and weak null hypotheses. Task (a) relies on applying the theory of U-statistics, and task (b) relies on applying the theory of double-indexed permutation statistics. The combination of tasks (a) and (b) provides a relatively complete picture of QAP. Overall, our asymptotic theory suggests that using properly studentized statistics in QAP is a robust choice in that it is finite-sample exact under the strong null hypothesis and preserves the asymptotic type one error rate under the weak null hypothesis."}, "https://arxiv.org/abs/2411.00950": {"title": "A Semiparametric Approach to Causal Inference", "link": "https://arxiv.org/abs/2411.00950", "description": "arXiv:2411.00950v1 Announce Type: new \nAbstract: In causal inference, an important problem is to quantify the effects of interventions or treatments. Many studies focus on estimating the mean causal effects; however, these estimands may offer limited insight since two distributions can share the same mean yet exhibit significant differences. Examining the causal effects from a distributional perspective provides a more thorough understanding. In this paper, we employ a semiparametric density ratio model (DRM) to characterize the counterfactual distributions, introducing a framework that assumes a latent structure shared by these distributions. Our model offers flexibility by avoiding strict parametric assumptions on the counterfactual distributions. Specifically, the DRM incorporates a nonparametric component that can be estimated through the method of empirical likelihood (EL), using the data from all the groups stemming from multiple interventions. Consequently, the EL-DRM framework enables inference of the counterfactual distribution functions and their functionals, facilitating direct and transparent causal inference from a distributional perspective. Numerical studies on both synthetic and real-world data validate the effectiveness of our approach."}, "https://arxiv.org/abs/2411.01052": {"title": "Multivariate Gini-type discrepancies", "link": "https://arxiv.org/abs/2411.01052", "description": "arXiv:2411.01052v1 Announce Type: new \nAbstract: Measuring distances in a multidimensional setting is a challenging problem, which appears in many fields of science and engineering. In this paper, to measure the distance between two multivariate distributions, we introduce a new measure of discrepancy which is scale invariant and which, in the case of two independent copies of the same distribution, and after normalization, coincides with the scaling invariant multidimensional version of the Gini index recently proposed in [34]. A byproduct of the analysis is an easy-to-handle discrepancy metric, obtained by application of the theory to a pair of Gaussian multidimensional densities. The obtained metric does improve the standard metrics, based on the mean squared error, as it is scale invariant. The importance of this theoretical finding is illustrated by means of a real problem that concerns measuring the importance of Environmental, Social and Governance factors for the growth of small and medium enterprises."}, "https://arxiv.org/abs/2411.01064": {"title": "Empirical Welfare Analysis with Hedonic Budget Constraints", "link": "https://arxiv.org/abs/2411.01064", "description": "arXiv:2411.01064v1 Announce Type: new \nAbstract: We analyze demand settings where heterogeneous consumers maximize utility for product attributes subject to a nonlinear budget constraint. We develop nonparametric methods for welfare-analysis of interventions that change the constraint. Two new findings are Roy's identity for smooth, nonlinear budgets, which yields a Partial Differential Equation system, and a Slutsky-like symmetry condition for demand. Under scalar unobserved heterogeneity and single-crossing preferences, the coefficient functions in the PDEs are nonparametrically identified, and under symmetry, lead to path-independent, money-metric welfare. We illustrate our methods with welfare evaluation of a hypothetical change in relationship between property rent and neighborhood school-quality using British microdata."}, "https://arxiv.org/abs/2411.01065": {"title": "Local Indicators of Mark Association for Spatial Marked Point Processes", "link": "https://arxiv.org/abs/2411.01065", "description": "arXiv:2411.01065v1 Announce Type: new \nAbstract: The emergence of distinct local mark behaviours is becoming increasingly common in the applications of spatial marked point processes. This dynamic highlights the limitations of existing global mark correlation functions in accurately identifying the true patterns of mark associations/variations among points as distinct mark behaviours might dominate one another, giving rise to an incomplete understanding of mark associations. In this paper, we introduce a family of local indicators of mark association (LIMA) functions for spatial marked point processes. These functions are defined on general state spaces and can include marks that are either real-valued or function-valued. Unlike global mark correlation functions, which are often distorted by the existence of distinct mark behaviours, LIMA functions reliably identify all types of mark associations and variations among points. Additionally, they accurately determine the interpoint distances where individual points show significant mark associations. Through simulation studies, featuring various scenarios, and four real applications in forestry, criminology, and urban mobility, we study spatial marked point processes in $\\R^2$ and on linear networks with either real-valued or function-valued marks, demonstrating that LIMA functions significantly outperform the existing global mark correlation functions."}, "https://arxiv.org/abs/2411.01249": {"title": "A novel method for synthetic control with interference", "link": "https://arxiv.org/abs/2411.01249", "description": "arXiv:2411.01249v1 Announce Type: new \nAbstract: Synthetic control methods have been widely used for evaluating policy effects in comparative case studies. However, most existing synthetic control methods implicitly rule out interference effects on the untreated units, and their validity may be jeopardized in the presence of interference. In this paper, we propose a novel synthetic control method, which admits interference but does not require modeling the interference structure. Identification of the effects is achieved under the assumption that the number of interfered units is no larger than half of the total number of units minus the dimension of confounding factors. We propose consistent and asymptotically normal estimation and establish statistical inference for the direct and interference effects averaged over post-intervention periods. We evaluate the performance of our method and compare it to competing methods via numerical experiments. A real data analysis, evaluating the effects of the announcement of relocating the US embassy to Jerusalem on the number of Middle Eastern conflicts, provides empirical evidence that the announcement not only increases the number of conflicts in Israel-Palestine but also has interference effects on several other Middle Eastern countries."}, "https://arxiv.org/abs/2411.01250": {"title": "Hierarchical and Density-based Causal Clustering", "link": "https://arxiv.org/abs/2411.01250", "description": "arXiv:2411.01250v1 Announce Type: new \nAbstract: Understanding treatment effect heterogeneity is vital for scientific and policy research. However, identifying and evaluating heterogeneous treatment effects pose significant challenges due to the typically unknown subgroup structure. Recently, a novel approach, causal k-means clustering, has emerged to assess heterogeneity of treatment effect by applying the k-means algorithm to unknown counterfactual regression functions. In this paper, we expand upon this framework by integrating hierarchical and density-based clustering algorithms. We propose plug-in estimators that are simple and readily implementable using off-the-shelf algorithms. Unlike k-means clustering, which requires the margin condition, our proposed estimators do not rely on strong structural assumptions on the outcome process. We go on to study their rate of convergence, and show that under the minimal regularity conditions, the additional cost of causal clustering is essentially the estimation error of the outcome regression functions. Our findings significantly extend the capabilities of the causal clustering framework, thereby contributing to the progression of methodologies for identifying homogeneous subgroups in treatment response, consequently facilitating more nuanced and targeted interventions. The proposed methods also open up new avenues for clustering with generic pseudo-outcomes. We explore finite sample properties via simulation, and illustrate the proposed methods in voting and employment projection datasets."}, "https://arxiv.org/abs/2411.01317": {"title": "Distributed Pseudo-Likelihood Method for Community Detection in Large-Scale Networks", "link": "https://arxiv.org/abs/2411.01317", "description": "arXiv:2411.01317v1 Announce Type: new \nAbstract: This paper proposes a distributed pseudo-likelihood method (DPL) to conveniently identify the community structure of large-scale networks. Specifically, we first propose a block-wise splitting method to divide large-scale network data into several subnetworks and distribute them among multiple workers. For simplicity, we assume the classical stochastic block model. Then, the DPL algorithm is iteratively implemented for the distributed optimization of the sum of the local pseudo-likelihood functions. At each iteration, the worker updates its local community labels and communicates with the master. The master then broadcasts the combined estimator to each worker for the new iterative steps. Based on the distributed system, DPL significantly reduces the computational complexity of the traditional pseudo-likelihood method using a single machine. Furthermore, to ensure statistical accuracy, we theoretically discuss the requirements of the worker sample size. Moreover, we extend the DPL method to estimate degree-corrected stochastic block models. The superior performance of the proposed distributed algorithm is demonstrated through extensive numerical studies and real data analysis."}, "https://arxiv.org/abs/2411.01381": {"title": "Modeling the restricted mean survival time using pseudo-value random forests", "link": "https://arxiv.org/abs/2411.01381", "description": "arXiv:2411.01381v1 Announce Type: new \nAbstract: The restricted mean survival time (RMST) has become a popular measure to summarize event times in longitudinal studies. Defined as the area under the survival function up to a time horizon $\\tau$ > 0, the RMST can be interpreted as the life expectancy within the time interval [0, $\\tau$]. In addition to its straightforward interpretation, the RMST also allows for the definition of valid estimands for the causal analysis of treatment contrasts in medical studies. In this work, we introduce a non-parametric approach to model the RMST conditional on a set of baseline variables (including, e.g., treatment variables and confounders). Our method is based on a direct modeling strategy for the RMST, using leave-one-out jackknife pseudo-values within a random forest regression framework. In this way, it can be employed to obtain precise estimates of both patient-specific RMST values and confounder-adjusted treatment contrasts. Since our method (termed \"pseudo-value random forest\", PVRF) is model-free, RMST estimates are not affected by restrictive assumptions like the proportional hazards assumption. Particularly, PVRF offers a high flexibility in detecting relevant covariate effects from higher-dimensional data, thereby expanding the range of existing pseudo-value modeling techniques for RMST estimation. We investigate the properties of our method using simulations and illustrate its use by an application to data from the SUCCESS-A breast cancer trial. Our numerical experiments demonstrate that PVRF yields accurate estimates of both patient-specific RMST values and RMST-based treatment contrasts."}, "https://arxiv.org/abs/2411.01382": {"title": "On MCMC mixing under unidentified nonparametric models with an application to survival predictions under transformation models", "link": "https://arxiv.org/abs/2411.01382", "description": "arXiv:2411.01382v1 Announce Type: new \nAbstract: The multi-modal posterior under unidentified nonparametric models yields poor mixing of Markov Chain Monte Carlo (MCMC), which is a stumbling block to Bayesian predictions. In this article, we conceptualize a prior informativeness threshold that is essentially the variance of posterior modes and expressed by the uncertainty hyperparameters of nonparametric priors. The threshold plays the role of a lower bound of the within-chain MCMC variance to ensure MCMC mixing, and engines prior modification through hyperparameter tuning to descend the mode variance. Our method distinguishes from existing postprocessing methods in that it directly samples well-mixed MCMC chains on the unconstrained space, and inherits the original posterior predictive distribution in predictive inference. Our method succeeds in Bayesian survival predictions under an unidentified nonparametric transformation model, guarded by the inferential theories of the posterior variance, under elicitation of two delicate nonparametric priors. Comprehensive simulations and real-world data analysis demonstrate that our method achieves MCMC mixing and outperforms existing approaches in survival predictions."}, "https://arxiv.org/abs/2411.01383": {"title": "Automated Analysis of Experiments using Hierarchical Garrote", "link": "https://arxiv.org/abs/2411.01383", "description": "arXiv:2411.01383v1 Announce Type: new \nAbstract: In this work, we propose an automatic method for the analysis of experiments that incorporates hierarchical relationships between the experimental variables. We use a modified version of nonnegative garrote method for variable selection which can incorporate hierarchical relationships. The nonnegative garrote method requires a good initial estimate of the regression parameters for it to work well. To obtain the initial estimate, we use generalized ridge regression with the ridge parameters estimated from a Gaussian process prior placed on the underlying input-output relationship. The proposed method, called HiGarrote, is fast, easy to use, and requires no manual tuning. Analysis of several real experiments are presented to demonstrate its benefits over the existing methods."}, "https://arxiv.org/abs/2411.01498": {"title": "Educational Effects in Mathematics: Conditional Average Treatment Effect depending on the Number of Treatments", "link": "https://arxiv.org/abs/2411.01498", "description": "arXiv:2411.01498v1 Announce Type: new \nAbstract: This study examines the educational effect of the Academic Support Center at Kogakuin University. Following the initial assessment, it was suggested that group bias had led to an underestimation of the Center's true impact. To address this issue, the authors applied the theory of causal inference. By using T-learner, the conditional average treatment effect (CATE) of the Center's face-to-face (F2F) personal assistance program was evaluated. Extending T-learner, the authors produced a new CATE function that depends on the number of treatments (F2F sessions) and used the estimated function to predict the CATE performance of F2F assistance."}, "https://arxiv.org/abs/2411.01528": {"title": "Enhancing Forecasts Using Real-Time Data Flow and Hierarchical Forecast Reconciliation, with Applications to the Energy Sector", "link": "https://arxiv.org/abs/2411.01528", "description": "arXiv:2411.01528v1 Announce Type: new \nAbstract: A novel framework for hierarchical forecast updating is presented, addressing a critical gap in the forecasting literature. By assuming a temporal hierarchy structure, the innovative approach extends hierarchical forecast reconciliation to effectively manage the challenge posed by partially observed data. This crucial extension allows, in conjunction with real-time data, to obtain updated and coherent forecasts across the entire temporal hierarchy, thereby enhancing decision-making accuracy. The framework involves updating base models in response to new data, which produces revised base forecasts. A subsequent pruning step integrates the newly available data, allowing for the application of any forecast reconciliation method to obtain fully updated reconciled forecasts. Additionally, the framework not only ensures coherence among forecasts but also improves overall accuracy throughout the hierarchy. Its inherent flexibility and interpretability enable users to perform hierarchical forecast updating concisely. The methodology is extensively demonstrated in a simulation study with various settings and comparing different data-generating processes, hierarchies, and reconciliation methods. Practical applicability is illustrated through two case studies in the energy sector, energy generation and solar power data, where the framework yields superior results compared to base models that do not incorporate new data, leading to more precise decision-making outcomes."}, "https://arxiv.org/abs/2411.01588": {"title": "Statistical Inference on High Dimensional Gaussian Graphical Regression Models", "link": "https://arxiv.org/abs/2411.01588", "description": "arXiv:2411.01588v1 Announce Type: new \nAbstract: Gaussian graphical regressions have emerged as a powerful approach for regressing the precision matrix of a Gaussian graphical model on covariates, which, unlike traditional Gaussian graphical models, can help determine how graphs are modulated by high dimensional subject-level covariates, and recover both the population-level and subject-level graphs. To fit the model, a multi-task learning approach {achieves} %has been shown to result in lower error rates compared to node-wise regressions. However, due to the high complexity and dimensionality of the Gaussian graphical regression problem, the important task of statistical inference remains unexplored. We propose a class of debiased estimators based on multi-task learners for statistical inference in Gaussian graphical regressions. We show that debiasing can be performed quickly and separately for the multi-task learners. In a key debiasing step {that estimates} %involving the estimation of the inverse covariance matrix, we propose a novel {projection technique} %diagonalization approach that dramatically reduces computational costs {in optimization} to scale only with the sample size $n$. We show that our debiased estimators enjoy a fast convergence rate and asymptotically follow a normal distribution, enabling valid statistical inference such as constructing confidence intervals and performing hypothesis testing. Simulation studies confirm the practical utility of the proposed approach, and we further apply it to analyze gene co-expression graph data from a brain cancer study, revealing meaningful biological relationships."}, "https://arxiv.org/abs/2411.01617": {"title": "Changes-In-Changes For Discrete Treatment", "link": "https://arxiv.org/abs/2411.01617", "description": "arXiv:2411.01617v1 Announce Type: new \nAbstract: This paper generalizes the changes-in-changes (CIC) model to handle discrete treatments with more than two categories, extending the binary case of Athey and Imbens (2006). While the original CIC model is well-suited for binary treatments, it cannot accommodate multi-category discrete treatments often found in economic and policy settings. Although recent work has extended CIC to continuous treatments, there remains a gap for multi-category discrete treatments. I introduce a generalized CIC model that adapts the rank invariance assumption to multiple treatment levels, allowing for robust modeling while capturing the distinct effects of varying treatment intensities."}, "https://arxiv.org/abs/2411.01686": {"title": "FRODO: A novel approach to micro-macro multilevel regressio", "link": "https://arxiv.org/abs/2411.01686", "description": "arXiv:2411.01686v1 Announce Type: new \nAbstract: Within the field of hierarchical modelling, little attention is paid to micro-macro models: those in which group-level outcomes are dependent on covariates measured at the level of individuals within groups. Although such models are perhaps underrepresented in the literature, they have applications in economics, epidemiology, and the social sciences. Despite the strong mathematical similarities between micro-macro and measurement error models, few efforts have been made to apply the much better-developed methodology of the latter to the former. Here, we present a new empirical Bayesian technique for micro-macro data, called FRODO (Functional Regression On Densities of Observations). The method jointly infers group-specific densities for multilevel covariates and uses them as functional predictors in a functional linear regression, resulting in a model that is analogous to a generalized additive model (GAM). In doing so, it achieves a level of generality comparable to more sophisticated methods developed for errors-in-variables models, while further leveraging the larger group sizes characteristic of multilevel data to provide richer information about the within-group covariate distributions. After explaining the hierarchical structure of FRODO, its power and versatility are demonstrated on several simulated datasets, showcasing its ability to accommodate a wide variety of covariate distributions and regression models."}, "https://arxiv.org/abs/2411.01697": {"title": "A probabilistic diagnostic for Laplace approximations: Introduction and experimentation", "link": "https://arxiv.org/abs/2411.01697", "description": "arXiv:2411.01697v1 Announce Type: new \nAbstract: Many models require integrals of high-dimensional functions: for instance, to obtain marginal likelihoods. Such integrals may be intractable, or too expensive to compute numerically. Instead, we can use the Laplace approximation (LA). The LA is exact if the function is proportional to a normal density; its effectiveness therefore depends on the function's true shape. Here, we propose the use of the probabilistic numerical framework to develop a diagnostic for the LA and its underlying shape assumptions, modelling the function and its integral as a Gaussian process and devising a \"test\" by conditioning on a finite number of function values. The test is decidedly non-asymptotic and is not intended as a full substitute for numerical integration - rather, it is simply intended to test the feasibility of the assumptions underpinning the LA with as minimal computation. We discuss approaches to optimize and design the test, apply it to known sample functions, and highlight the challenges of high dimensions."}, "https://arxiv.org/abs/2411.01704": {"title": "Understanding the decision-making process of choice modellers", "link": "https://arxiv.org/abs/2411.01704", "description": "arXiv:2411.01704v1 Announce Type: new \nAbstract: Discrete Choice Modelling serves as a robust framework for modelling human choice behaviour across various disciplines. Building a choice model is a semi structured research process that involves a combination of a priori assumptions, behavioural theories, and statistical methods. This complex set of decisions, coupled with diverse workflows, can lead to substantial variability in model outcomes. To better understand these dynamics, we developed the Serious Choice Modelling Game, which simulates the real world modelling process and tracks modellers' decisions in real time using a stated preference dataset. Participants were asked to develop choice models to estimate Willingness to Pay values to inform policymakers about strategies for reducing noise pollution. The game recorded actions across multiple phases, including descriptive analysis, model specification, and outcome interpretation, allowing us to analyse both individual decisions and differences in modelling approaches. While our findings reveal a strong preference for using data visualisation tools in descriptive analysis, it also identifies gaps in missing values handling before model specification. We also found significant variation in the modelling approach, even when modellers were working with the same choice dataset. Despite the availability of more complex models, simpler models such as Multinomial Logit were often preferred, suggesting that modellers tend to avoid complexity when time and resources are limited. Participants who engaged in more comprehensive data exploration and iterative model comparison tended to achieve better model fit and parsimony, which demonstrate that the methodological choices made throughout the workflow have significant implications, particularly when modelling outcomes are used for policy formulation."}, "https://arxiv.org/abs/2411.01723": {"title": "Comparing multilevel and fixed effect approaches in the generalized linear model setting", "link": "https://arxiv.org/abs/2411.01723", "description": "arXiv:2411.01723v1 Announce Type: new \nAbstract: We extend prior work comparing linear multilevel models (MLM) and fixed effect (FE) models to the generalized linear model (GLM) setting, where the coefficient on a treatment variable is of primary interest. This leads to three key insights. (i) First, as in the linear setting, MLM can be thought of as a regularized form of FE. This explains why MLM can show large biases in its treatment coefficient estimates when group-level confounding is present. However, unlike the linear setting, there is not an exact equivalence between MLM and regularized FE coefficient estimates in GLMs. (ii) Second, we study a generalization of \"bias-corrected MLM\" (bcMLM) to the GLM setting. Neither FE nor bcMLM entirely solves MLM's bias problem in GLMs, but bcMLM tends to show less bias than does FE. (iii) Third, and finally, just like in the linear setting, MLM's default standard errors can misspecify the true intragroup dependence structure in the GLM setting, which can lead to downwardly biased standard errors. A cluster bootstrap is a more agnostic alternative. Ultimately, for non-linear GLMs, we recommend bcMLM for estimating the treatment coefficient, and a cluster bootstrap for standard errors and confidence intervals. If a bootstrap is not computationally feasible, then we recommend FE with cluster-robust standard errors."}, "https://arxiv.org/abs/2411.01732": {"title": "Alignment and matching tests for high-dimensional tensor signals via tensor contraction", "link": "https://arxiv.org/abs/2411.01732", "description": "arXiv:2411.01732v1 Announce Type: new \nAbstract: We consider two hypothesis testing problems for low-rank and high-dimensional tensor signals, namely the tensor signal alignment and tensor signal matching problems. These problems are challenging due to the high dimension of tensors and lack of meaningful test statistics. By exploiting a recent tensor contraction method, we propose and validate relevant test statistics using eigenvalues of a data matrix resulting from the tensor contraction. The matrix has a long range dependence among its entries, which makes the analysis of the matrix challenging, involved and distinct from standard random matrix theory. Our approach provides a novel framework for addressing hypothesis testing problems in the context of high-dimensional tensor signals."}, "https://arxiv.org/abs/2411.01799": {"title": "Estimating Nonseparable Selection Models: A Functional Contraction Approach", "link": "https://arxiv.org/abs/2411.01799", "description": "arXiv:2411.01799v1 Announce Type: new \nAbstract: We propose a new method for estimating nonseparable selection models. We show that, given the selection rule and the observed selected outcome distribution, the potential outcome distribution can be characterized as the fixed point of an operator, and we prove that this operator is a functional contraction. We propose a two-step semiparametric maximum likelihood estimator to estimate the selection model and the potential outcome distribution. The consistency and asymptotic normality of the estimator are established. Our approach performs well in Monte Carlo simulations and is applicable in a variety of empirical settings where only a selected sample of outcomes is observed. Examples include consumer demand models with only transaction prices, auctions with incomplete bid data, and Roy models with data on accepted wages."}, "https://arxiv.org/abs/2411.01820": {"title": "Dynamic Supervised Principal Component Analysis for Classification", "link": "https://arxiv.org/abs/2411.01820", "description": "arXiv:2411.01820v1 Announce Type: new \nAbstract: This paper introduces a novel framework for dynamic classification in high dimensional spaces, addressing the evolving nature of class distributions over time or other index variables. Traditional discriminant analysis techniques are adapted to learn dynamic decision rules with respect to the index variable. In particular, we propose and study a new supervised dimension reduction method employing kernel smoothing to identify the optimal subspace, and provide a comprehensive examination of this approach for both linear discriminant analysis and quadratic discriminant analysis. We illustrate the effectiveness of the proposed methods through numerical simulations and real data examples. The results show considerable improvements in classification accuracy and computational efficiency. This work contributes to the field by offering a robust and adaptive solution to the challenges of scalability and non-staticity in high-dimensional data classification."}, "https://arxiv.org/abs/2411.01864": {"title": "On the Asymptotic Properties of Debiased Machine Learning Estimators", "link": "https://arxiv.org/abs/2411.01864", "description": "arXiv:2411.01864v1 Announce Type: new \nAbstract: This paper studies the properties of debiased machine learning (DML) estimators under a novel asymptotic framework, offering insights for improving the performance of these estimators in applications. DML is an estimation method suited to economic models where the parameter of interest depends on unknown nuisance functions that must be estimated. It requires weaker conditions than previous methods while still ensuring standard asymptotic properties. Existing theoretical results do not distinguish between two alternative versions of DML estimators, DML1 and DML2. Under a new asymptotic framework, this paper demonstrates that DML2 asymptotically dominates DML1 in terms of bias and mean squared error, formalizing a previous conjecture based on simulation results regarding their relative performance. Additionally, this paper provides guidance for improving the performance of DML2 in applications."}, "https://arxiv.org/abs/2411.02123": {"title": "Uncertainty quantification and multi-stage variable selection for personalized treatment regimes", "link": "https://arxiv.org/abs/2411.02123", "description": "arXiv:2411.02123v1 Announce Type: new \nAbstract: A dynamic treatment regime is a sequence of medical decisions that adapts to the evolving clinical status of a patient over time. To facilitate personalized care, it is crucial to assess the probability of each available treatment option being optimal for a specific patient, while also identifying the key prognostic factors that determine the optimal sequence of treatments. This task has become increasingly challenging due to the growing number of individual prognostic factors typically available. In response to these challenges, we propose a Bayesian model for optimizing dynamic treatment regimes that addresses the uncertainty in identifying optimal decision sequences and incorporates dimensionality reduction to manage high-dimensional individual covariates. The first task is achieved through a suitable augmentation of the model to handle counterfactual variables. For the second, we introduce a novel class of spike-and-slab priors for the multi-stage selection of significant factors, to favor the sharing of information across stages. The effectiveness of the proposed approach is demonstrated through extensive simulation studies and illustrated using clinical trial data on severe acute arterial hypertension."}, "https://arxiv.org/abs/2411.02231": {"title": "Sharp Bounds for Continuous-Valued Treatment Effects with Unobserved Confounders", "link": "https://arxiv.org/abs/2411.02231", "description": "arXiv:2411.02231v1 Announce Type: new \nAbstract: In causal inference, treatment effects are typically estimated under the ignorability, or unconfoundedness, assumption, which is often unrealistic in observational data. By relaxing this assumption and conducting a sensitivity analysis, we introduce novel bounds and derive confidence intervals for the Average Potential Outcome (APO) - a standard metric for evaluating continuous-valued treatment or exposure effects. We demonstrate that these bounds are sharp under a continuous sensitivity model, in the sense that they give the smallest possible interval under this model, and propose a doubly robust version of our estimators. In a comparative analysis with the method of Jesson et al. (2022) (arXiv:2204.10022), using both simulated and real datasets, we show that our approach not only yields sharper bounds but also achieves good coverage of the true APO, with significantly reduced computation times."}, "https://arxiv.org/abs/2411.02239": {"title": "Powerful batch conformal prediction for classification", "link": "https://arxiv.org/abs/2411.02239", "description": "arXiv:2411.02239v1 Announce Type: new \nAbstract: In a supervised classification split conformal/inductive framework with $K$ classes, a calibration sample of $n$ labeled examples is observed for inference on the label of a new unlabeled example. In this work, we explore the case where a \"batch\" of $m$ independent such unlabeled examples is given, and a multivariate prediction set with $1-\\alpha$ coverage should be provided for this batch. Hence, the batch prediction set takes the form of a collection of label vectors of size $m$, while the calibration sample only contains univariate labels. Using the Bonferroni correction consists in concatenating the individual prediction sets at level $1-\\alpha/m$ (Vovk 2013). We propose a uniformly more powerful solution, based on specific combinations of conformal $p$-values that exploit the Simes inequality (Simes 1986). Intuitively, the pooled evidence of fairly \"easy\" examples of the batch can help provide narrower batch prediction sets. We also introduced adaptive versions of the novel procedure that are particularly effective when the batch prediction set is expected to be large. The theoretical guarantees are provided when all examples are iid, as well as more generally when iid is assumed only conditionally within each class. In particular, our results are also valid under a label distribution shift since the distribution of the labels need not be the same in the calibration sample and in the new `batch'. The usefulness of the method is illustrated on synthetic and real data examples."}, "https://arxiv.org/abs/2411.02276": {"title": "A Bayesian Model for Co-clustering Ordinal Data with Informative Missing Entries", "link": "https://arxiv.org/abs/2411.02276", "description": "arXiv:2411.02276v1 Announce Type: new \nAbstract: Several approaches have been proposed in the literature for clustering multivariate ordinal data. These methods typically treat missing values as absent information, rather than recognizing them as valuable for profiling population characteristics. To address this gap, we introduce a Bayesian nonparametric model for co-clustering multivariate ordinal data that treats censored observations as informative, rather than merely missing. We demonstrate that this offers a significant improvement in understanding the underlying structure of the data. Our model exploits the flexibility of two independent Dirichlet processes, allowing us to infer potentially distinct subpopulations that characterize the latent structure of both subjects and variables. The ordinal nature of the data is addressed by introducing latent variables, while a matrix factorization specification is adopted to handle the high dimensionality of the data in a parsimonious way. The conjugate structure of the model enables an explicit derivation of the full conditional distributions of all the random variables in the model, which facilitates seamless posterior inference using a Gibbs sampling algorithm. We demonstrate the method's performance through simulations and by analyzing politician and movie ratings data."}, "https://arxiv.org/abs/2411.02396": {"title": "Fusion of Tree-induced Regressions for Clinico-genomic Data", "link": "https://arxiv.org/abs/2411.02396", "description": "arXiv:2411.02396v1 Announce Type: new \nAbstract: Cancer prognosis is often based on a set of omics covariates and a set of established clinical covariates such as age and tumor stage. Combining these two sets poses challenges. First, dimension difference: clinical covariates should be favored because they are low-dimensional and usually have stronger prognostic ability than high-dimensional omics covariates. Second, interactions: genetic profiles and their prognostic effects may vary across patient subpopulations. Last, redundancy: a (set of) gene(s) may encode similar prognostic information as a clinical covariate. To address these challenges, we combine regression trees, employing clinical covariates only, with a fusion-like penalized regression framework in the leaf nodes for the omics covariates. The fusion penalty controls the variability in genetic profiles across subpopulations. We prove that the shrinkage limit of the proposed method equals a benchmark model: a ridge regression with penalized omics covariates and unpenalized clinical covariates. Furthermore, the proposed method allows researchers to evaluate, for different subpopulations, whether the overall omics effect enhances prognosis compared to only employing clinical covariates. In an application to colorectal cancer prognosis based on established clinical covariates and 20,000+ gene expressions, we illustrate the features of our method."}, "https://arxiv.org/abs/2411.00794": {"title": "HOUND: High-Order Universal Numerical Differentiator for a Parameter-free Polynomial Online Approximation", "link": "https://arxiv.org/abs/2411.00794", "description": "arXiv:2411.00794v1 Announce Type: cross \nAbstract: This paper introduces a scalar numerical differentiator, represented as a system of nonlinear differential equations of any high order. We derive the explicit solution for this system and demonstrate that, with a suitable choice of differentiator order, the error converges to zero for polynomial signals with additive white noise. In more general cases, the error remains bounded, provided that the highest estimated derivative is also bounded. A notable advantage of this numerical differentiation method is that it does not require tuning parameters based on the specific characteristics of the signal being differentiated. We propose a discretization method for the equations that implements a cumulative smoothing algorithm for time series. This algorithm operates online, without the need for data accumulation, and it solves both interpolation and extrapolation problems without fitting any coefficients to the data."}, "https://arxiv.org/abs/2411.00839": {"title": "CausAdv: A Causal-based Framework for Detecting Adversarial Examples", "link": "https://arxiv.org/abs/2411.00839", "description": "arXiv:2411.00839v1 Announce Type: cross \nAbstract: Deep learning has led to tremendous success in many real-world applications of computer vision, thanks to sophisticated architectures such as Convolutional neural networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations in inputs. These inputs appear almost indistinguishable from natural images, yet they are incorrectly classified by CNN architectures. This vulnerability of adversarial examples has led researchers to focus on enhancing the robustness of deep learning models in general, and CNNs in particular, by creating defense and detection methods to distinguish adversarials inputs from natural ones. In this paper, we address the adversarial robustness of CNNs through causal reasoning.\n We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. Then we perform statistical analysis on the filters CI of every sample, whether clan or adversarials, to demonstrate how adversarial examples indeed exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarials detection without the need to train a separate detector. In addition, we illustrate the efficiency of causal explanations as a helpful detection technique through visualizing the causal features. The results can be reproduced using the code available in the repository: https://github.com/HichemDebbi/CausAdv."}, "https://arxiv.org/abs/2411.00945": {"title": "Higher-Order Causal Message Passing for Experimentation with Complex Interference", "link": "https://arxiv.org/abs/2411.00945", "description": "arXiv:2411.00945v1 Announce Type: cross \nAbstract: Accurate estimation of treatment effects is essential for decision-making across various scientific fields. This task, however, becomes challenging in areas like social sciences and online marketplaces, where treating one experimental unit can influence outcomes for others through direct or indirect interactions. Such interference can lead to biased treatment effect estimates, particularly when the structure of these interactions is unknown. We address this challenge by introducing a new class of estimators based on causal message-passing, specifically designed for settings with pervasive, unknown interference. Our estimator draws on information from the sample mean and variance of unit outcomes and treatments over time, enabling efficient use of observed data to estimate the evolution of the system state. Concretely, we construct non-linear features from the moments of unit outcomes and treatments and then learn a function that maps these features to future mean and variance of unit outcomes. This allows for the estimation of the treatment effect over time. Extensive simulations across multiple domains, using synthetic and real network data, demonstrate the efficacy of our approach in estimating total treatment effect dynamics, even in cases where interference exhibits non-monotonic behavior in the probability of treatment."}, "https://arxiv.org/abs/2411.01100": {"title": "Transfer Learning Between U", "link": "https://arxiv.org/abs/2411.01100", "description": "arXiv:2411.01100v1 Announce Type: cross \nAbstract: For the 2024 U.S. presidential election, would negative, digital ads against Donald Trump impact voter turnout in Pennsylvania (PA), a key \"tipping point\" state? The gold standard to address this question, a randomized experiment where voters get randomized to different ads, yields unbiased estimates of the ad effect, but is very expensive. Instead, we propose a less-than-ideal, but significantly cheaper and likely faster framework based on transfer learning, where we transfer knowledge from a past ad experiment in 2020 to evaluate ads for 2024. A key component of our framework is a sensitivity analysis that quantifies the unobservable differences between past and future elections, which can be calibrated in a data-driven manner. We propose two estimators of the 2024 ad effect: a simple regression estimator with bootstrap, which we recommend for practitioners in this field, and an estimator based on the efficient influence function for broader applications. Using our framework, we estimate the effect of running a negative, digital ad campaign against Trump on voter turnout in PA for the 2024 election. Our findings indicate effect heterogeneity across counties of PA and among important subgroups stratified by gender, urbanicity, and education attainment."}, "https://arxiv.org/abs/2411.01292": {"title": "Causal reasoning in difference graphs", "link": "https://arxiv.org/abs/2411.01292", "description": "arXiv:2411.01292v1 Announce Type: cross \nAbstract: In epidemiology, understanding causal mechanisms across different populations is essential for designing effective public health interventions. Recently, difference graphs have been introduced as a tool to visually represent causal variations between two distinct populations. While there has been progress in inferring these graphs from data through causal discovery methods, there remains a gap in systematically leveraging their potential to enhance causal reasoning. This paper addresses that gap by establishing conditions for identifying causal changes and effects using difference graphs and observational data. It specifically focuses on identifying total causal changes and total effects in a nonparametric framework, as well as direct causal changes and direct effects in a linear context. In doing so, it provides a novel approach to causal reasoning that holds potential for various public health applications."}, "https://arxiv.org/abs/2411.01295": {"title": "Marginal Causal Flows for Validation and Inference", "link": "https://arxiv.org/abs/2411.01295", "description": "arXiv:2411.01295v1 Announce Type: cross \nAbstract: Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets."}, "https://arxiv.org/abs/2411.01319": {"title": "Efficient Nested Estimation of CoVaR: A Decoupled Approach", "link": "https://arxiv.org/abs/2411.01319", "description": "arXiv:2411.01319v1 Announce Type: cross \nAbstract: This paper addresses the estimation of the systemic risk measure known as CoVaR, which quantifies the risk of a financial portfolio conditional on another portfolio being at risk. We identify two principal challenges: conditioning on a zero-probability event and the repricing of portfolios. To tackle these issues, we propose a decoupled approach utilizing smoothing techniques and develop a model-independent theoretical framework grounded in a functional perspective. We demonstrate that the rate of convergence of the decoupled estimator can achieve approximately $O_{\\rm P}(\\Gamma^{-1/2})$, where $\\Gamma$ represents the computational budget. Additionally, we establish the smoothness of the portfolio loss functions, highlighting its crucial role in enhancing sample efficiency. Our numerical results confirm the effectiveness of the decoupled estimators and provide practical insights for the selection of appropriate smoothing techniques."}, "https://arxiv.org/abs/2411.01394": {"title": "Centrality in Collaboration: A Novel Algorithm for Social Partitioning Gradients in Community Detection for Multiple Oncology Clinical Trial Enrollments", "link": "https://arxiv.org/abs/2411.01394", "description": "arXiv:2411.01394v1 Announce Type: cross \nAbstract: Patients at a comprehensive cancer center who do not achieve cure or remission following standard treatments often become candidates for clinical trials. Patients who participate in a clinical trial may be suitable for other studies. A key factor influencing patient enrollment in subsequent clinical trials is the structured collaboration between oncologists and most responsible physicians. Possible identification of these collaboration networks can be achieved through the analysis of patient movements between clinical trial intervention types with social network analysis and community detection algorithms. In the detection of oncologist working groups, the present study evaluates three community detection algorithms: Girvan-Newman, Louvain and an algorithm developed by the author. Girvan-Newman identifies each intervention as their own community, while Louvain groups interventions in a manner that is difficult to interpret. In contrast, the author's algorithm groups interventions in a way that is both intuitive and informative, with a gradient effect that is particularly useful for epidemiological research. This lays the groundwork for future subgroup analysis of clustered interventions."}, "https://arxiv.org/abs/2411.01694": {"title": "Modeling Home Range and Intra-Specific Spatial Interaction in Wild Animal Populations", "link": "https://arxiv.org/abs/2411.01694", "description": "arXiv:2411.01694v1 Announce Type: cross \nAbstract: Interactions among individuals from the same-species of wild animals are an important component of population dynamics. An interaction can be either static (based on overlap of space use) or dynamic (based on movement). The goal of this work is to determine the level of static interactions between individuals from the same-species of wild animals using 95\\% and 50\\% home ranges, as well as to model their movement interactions, which could include attraction, avoidance (or repulsion), or lack of interaction, in order to gain new insights and improve our understanding of ecological processes. Home range estimation methods (minimum convex polygon, kernel density estimator, and autocorrelated kernel density estimator), inhomogeneous multitype (or cross-type) summary statistics, and envelope testing methods (pointwise and global envelope tests) were proposed to study the nature of the same-species wild-animal spatial interactions. Using GPS collar data, we applied the methods to quantify both static and dynamic interactions between black bears in southern Alabama, USA. In general, our findings suggest that the black bears in our dataset showed no significant preference to live together or apart, i.e., there was no significant deviation from independence toward association or avoidance (i.e., segregation) between the bears."}, "https://arxiv.org/abs/2411.01772": {"title": "Joint optimization for production operations considering reworking", "link": "https://arxiv.org/abs/2411.01772", "description": "arXiv:2411.01772v1 Announce Type: cross \nAbstract: In pursuit of enhancing the comprehensive efficiency of production systems, our study focused on the joint optimization problem of scheduling and machine maintenance in scenarios where product rework occurs. The primary challenge lies in the interdependence between product \\underline{q}uality, machine \\underline{r}eliability, and \\underline{p}roduction scheduling, compounded by the uncertainties from machine degradation and product quality, which is prevalent in sophisticated manufacturing systems. To address this issue, we investigated the dynamic relationship among these three aspects, named as QRP-co-effect. On this basis, we constructed an optimization model that integrates production scheduling, machine maintenance, and product rework decisions, encompassing the context of stochastic degradation and product quality uncertainties within a mixed-integer programming problem. To effectively solve this problem, we proposed a dual-module solving framework that integrates planning and evaluation for solution improvement via dynamic communication. By analyzing the structural properties of this joint optimization problem, we devised an efficient solving algorithm with an interactive mechanism that leverages \\emph{in-situ} condition information regarding the production system's state and computational resources. The proposed methodology has been validated through comparative and ablation experiments. The experimental results demonstrated the significant enhancement of production system efficiency, along with a reduction in machine maintenance costs in scenarios involving rework."}, "https://arxiv.org/abs/2411.01881": {"title": "Causal Discovery and Classification Using Lempel-Ziv Complexity", "link": "https://arxiv.org/abs/2411.01881", "description": "arXiv:2411.01881v1 Announce Type: cross \nAbstract: Inferring causal relationships in the decision-making processes of machine learning algorithms is a crucial step toward achieving explainable Artificial Intelligence (AI). In this research, we introduce a novel causality measure and a distance metric derived from Lempel-Ziv (LZ) complexity. We explore how the proposed causality measure can be used in decision trees by enabling splits based on features that most strongly \\textit{cause} the outcome. We further evaluate the effectiveness of the causality-based decision tree and the distance-based decision tree in comparison to a traditional decision tree using Gini impurity. While the proposed methods demonstrate comparable classification performance overall, the causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models. This result indicates that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data. Based on the features used in the LZ causal measure based decision tree, we introduce a causal strength for each features in the dataset so as to infer the predominant causal variables for the occurrence of the outcome."}, "https://arxiv.org/abs/2411.02221": {"title": "Targeted Learning for Variable Importance", "link": "https://arxiv.org/abs/2411.02221", "description": "arXiv:2411.02221v1 Announce Type: cross \nAbstract: Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results."}, "https://arxiv.org/abs/2411.02374": {"title": "Identifying Economic Factors Affecting Unemployment Rates in the United States", "link": "https://arxiv.org/abs/2411.02374", "description": "arXiv:2411.02374v1 Announce Type: cross \nAbstract: In this study, we seek to understand how macroeconomic factors such as GDP, inflation, Unemployment Insurance, and S&P 500 index; as well as microeconomic factors such as health, race, and educational attainment impacted the unemployment rate for about 20 years in the United States. Our research question is to identify which factor(s) contributed the most to the unemployment rate surge using linear regression. Results from our studies showed that GDP (negative), inflation (positive), Unemployment Insurance (contrary to popular opinion; negative), and S&P 500 index (negative) were all significant factors, with inflation being the most important one. As for health issue factors, our model produced resultant correlation scores for occurrences of Cardiovascular Disease, Neurological Disease, and Interpersonal Violence with unemployment. Race as a factor showed a huge discrepancies in the unemployment rate between Black Americans compared to their counterparts. Asians had the lowest unemployment rate throughout the years. As for education attainment, results showed that having a higher education attainment significantly reduced one chance of unemployment. People with higher degrees had the lowest unemployment rate. Results of this study will be beneficial for policymakers and researchers in understanding the unemployment rate during the pandemic."}, "https://arxiv.org/abs/2411.02380": {"title": "Robust Bayesian regression in astronomy", "link": "https://arxiv.org/abs/2411.02380", "description": "arXiv:2411.02380v1 Announce Type: cross \nAbstract: Model mis-specification (e.g. the presence of outliers) is commonly encountered in astronomical analyses, often requiring the use of ad hoc algorithms (e.g. sigma-clipping). We develop and implement a generic Bayesian approach to linear regression, based on Student's t-distributions, that is robust to outliers and mis-specification of the noise model. Our method is validated using simulated datasets with various degrees of model mis-specification; the derived constraints are shown to be systematically less biased than those from a similar model using normal distributions. We demonstrate that, for a dataset without outliers, a worst-case inference using t-distributions would give unbiased results with $\\lesssim\\!10$ per cent increase in the reported parameter uncertainties. We also compare with existing analyses of real-world datasets, finding qualitatively different results where normal distributions have been used and agreement where more robust methods have been applied. A Python implementation of this model, t-cup, is made available for others to use."}, "https://arxiv.org/abs/2204.10291": {"title": "Structural Nested Mean Models Under Parallel Trends Assumptions", "link": "https://arxiv.org/abs/2204.10291", "description": "arXiv:2204.10291v5 Announce Type: replace \nAbstract: We link and extend two approaches to estimating time-varying treatment effects on repeated continuous outcomes--time-varying Difference in Differences (DiD; see Roth et al. (2023) and Chaisemartin et al. (2023) for reviews) and Structural Nested Mean Models (SNMMs; see Vansteelandt and Joffe (2014) for a review). In particular, we show that SNMMs, which were previously only known to be nonparametrically identified under a no unobserved confounding assumption, are also identified under a generalized version of the parallel trends assumption typically used to justify time-varying DiD methods. Because SNMMs model a broader set of causal estimands, our results allow practitioners of existing time-varying DiD approaches to address additional types of substantive questions under similar assumptions. SNMMs enable estimation of time-varying effect heterogeneity, lasting effects of a `blip' of treatment at a single time point, effects of sustained interventions (possibly on continuous or multi-dimensional treatments) when treatment repeatedly changes value in the data, controlled direct effects, effects of dynamic treatment strategies that depend on covariate history, and more. Our results also allow analysts who apply SNMMs under the no unobserved confounding assumption to estimate some of the same causal effects under alternative identifying assumptions. We provide a method for sensitivity analysis to violations of our parallel trends assumption. We further explain how to estimate optimal treatment regimes via optimal regime SNMMs under parallel trends assumptions plus an assumption that there is no effect modification by unobserved confounders. Finally, we illustrate our methods with real data applications estimating effects of Medicaid expansion on uninsurance rates, effects of floods on flood insurance take-up, and effects of sustained changes in temperature on crop yields."}, "https://arxiv.org/abs/2212.05515": {"title": "Reliability Study of Battery Lives: A Functional Degradation Analysis Approach", "link": "https://arxiv.org/abs/2212.05515", "description": "arXiv:2212.05515v2 Announce Type: replace \nAbstract: Renewable energy is critical for combating climate change, whose first step is the storage of electricity generated from renewable energy sources. Li-ion batteries are a popular kind of storage units. Their continuous usage through charge-discharge cycles eventually leads to degradation. This can be visualized in plotting voltage discharge curves (VDCs) over discharge cycles. Studies of battery degradation have mostly concentrated on modeling degradation through one scalar measurement summarizing each VDC. Such simplification of curves can lead to inaccurate predictive models. Here we analyze the degradation of rechargeable Li-ion batteries from a NASA data set through modeling and predicting their full VDCs. With techniques from longitudinal and functional data analysis, we propose a new two-step predictive modeling procedure for functional responses residing on heterogeneous domains. We first predict the shapes and domain end points of VDCs using functional regression models. Then we integrate these predictions to perform a degradation analysis. Our approach is fully functional, allows the incorporation of usage information, produces predictions in a curve form, and thus provides flexibility in the assessment of battery degradation. Through extensive simulation studies and cross-validated data analysis, our approach demonstrates better prediction than the existing approach of modeling degradation directly with aggregated data."}, "https://arxiv.org/abs/2309.13441": {"title": "Anytime valid and asymptotically optimal inference driven by predictive recursion", "link": "https://arxiv.org/abs/2309.13441", "description": "arXiv:2309.13441v4 Announce Type: replace \nAbstract: Distinguishing two candidate models is a fundamental and practically important statistical problem. Error rate control is crucial to the testing logic but, in complex nonparametric settings, can be difficult to achieve, especially when the stopping rule that determines the data collection process is not available. This paper proposes an e-process construction based on the predictive recursion (PR) algorithm originally designed to recursively fit nonparametric mixture models. The resulting PRe-process affords anytime valid inference and is asymptotically efficient in the sense that its growth rate is first-order optimal relative to PR's mixture model."}, "https://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "https://arxiv.org/abs/2310.04660", "description": "arXiv:2310.04660v2 Announce Type: replace \nAbstract: Many scientific questions in biomedical, environmental, and psychological research involve understanding the effects of multiple factors on outcomes. While factorial experiments are ideal for this purpose, randomized controlled treatment assignment is generally infeasible in many empirical studies. Therefore, investigators must rely on observational data, where drawing reliable causal inferences for multiple factors remains challenging. As the number of treatment combinations grows exponentially with the number of factors, some treatment combinations can be rare or missing by chance in observed data, further complicating factorial effects estimation. To address these challenges, we propose a novel weighting method tailored to observational studies with multiple factors. Our approach uses weighted observational data to emulate a randomized factorial experiment, enabling simultaneous estimation of the effects of multiple factors and their interactions. Our investigations reveal a crucial nuance: achieving balance among covariates, as in single-factor scenarios, is necessary but insufficient for unbiasedly estimating factorial effects; balancing the factors is also essential in multi-factor settings. Moreover, we extend our weighting method to handle missing treatment combinations in observed data. Finally, we study the asymptotic behavior of the new weighting estimators and propose a consistent variance estimator, providing reliable inferences on factorial effects in observational studies."}, "https://arxiv.org/abs/2310.18836": {"title": "Cluster-Randomized Trials with Cross-Cluster Interference", "link": "https://arxiv.org/abs/2310.18836", "description": "arXiv:2310.18836v3 Announce Type: replace \nAbstract: The literature on cluster-randomized trials typically assumes no interference across clusters. This may be implausible when units are irregularly distributed in space without well-separated communities, in which case clusters may not represent significant geographic, social, or economic divisions. In this paper, we develop methods for reducing bias due to cross-cluster interference. First, we propose an estimation strategy that excludes units not surrounded by clusters assigned to the same treatment arm. We show that this substantially reduces asymptotic bias relative to conventional difference-in-means estimators without substantial cost to variance. Second, we formally establish a bias-variance trade-off in the choice of clusters: constructing fewer, larger clusters reduces bias due to interference but increases variance. We provide a rule for choosing the number of clusters to balance the asymptotic orders of the bias and variance of our estimator. Finally, we consider unsupervised learning for cluster construction and provide theoretical guarantees for $k$-medoids."}, "https://arxiv.org/abs/2311.11858": {"title": "Theory coherent shrinkage of Time-Varying Parameters in VARs", "link": "https://arxiv.org/abs/2311.11858", "description": "arXiv:2311.11858v2 Announce Type: replace \nAbstract: This paper introduces a novel theory-coherent shrinkage prior for Time-Varying Parameter VARs (TVP-VARs). The prior centers the time-varying parameters on a path implied a priori by an underlying economic theory, chosen to describe the dynamics of the macroeconomic variables in the system. Leveraging information from conventional economic theory using this prior significantly improves inference precision and forecast accuracy compared to the standard TVP-VAR. In an application, I use this prior to incorporate information from a New Keynesian model that includes both the Zero Lower Bound (ZLB) and forward guidance into a medium-scale TVP-VAR model. This approach leads to more precise estimates of the impulse response functions, revealing a distinct propagation of risk premium shocks inside and outside the ZLB in US data."}, "https://arxiv.org/abs/2401.02694": {"title": "Nonconvex High-Dimensional Time-Varying Coefficient Estimation for Noisy High-Frequency Observations with a Factor Structure", "link": "https://arxiv.org/abs/2401.02694", "description": "arXiv:2401.02694v2 Announce Type: replace \nAbstract: In this paper, we propose a novel high-dimensional time-varying coefficient estimator for noisy high-frequency observations with a factor structure. In high-frequency finance, we often observe that noises dominate the signal of underlying true processes and that covariates exhibit a factor structure due to their strong dependence. Thus, we cannot apply usual regression procedures to analyze high-frequency observations. To handle the noises, we first employ a smoothing method for the observed dependent and covariate processes. Then, to handle the strong dependence of the covariate processes, we apply Principal Component Analysis (PCA) and transform the highly correlated covariate structure into a weakly correlated structure. However, the variables from PCA still contain non-negligible noises. To manage these non-negligible noises and the high dimensionality, we propose a nonconvex penalized regression method for each local coefficient. This method produces consistent but biased local coefficient estimators. To estimate the integrated coefficients, we propose a debiasing scheme and obtain a debiased integrated coefficient estimator using debiased local coefficient estimators. Then, to further account for the sparsity structure of the coefficients, we apply a thresholding scheme to the debiased integrated coefficient estimator. We call this scheme the Factor Adjusted Thresholded dEbiased Nonconvex LASSO (FATEN-LASSO) estimator. Furthermore, this paper establishes the concentration properties of the FATEN-LASSO estimator and discusses a nonconvex optimization algorithm."}, "https://arxiv.org/abs/1706.09072": {"title": "Decomposing Network Influence: Social Influence Regression", "link": "https://arxiv.org/abs/1706.09072", "description": "arXiv:1706.09072v2 Announce Type: replace-cross \nAbstract: Understanding network influence and its determinants are key challenges in political science and network analysis. Traditional latent variable models position actors within a social space based on network dependencies but often do not elucidate the underlying factors driving these interactions. To overcome this limitation, we propose the Social Influence Regression (SIR) model, an extension of vector autoregression tailored for relational data that incorporates exogenous covariates into the estimation of influence patterns. The SIR model captures influence dynamics via a pair of $n \\times n$ matrices that quantify how the actions of one actor affect the future actions of another. This framework not only provides a statistical mechanism for explaining actor influence based on observable traits but also improves computational efficiency through an iterative block coordinate descent method. We showcase the SIR model's capabilities by applying it to monthly conflict events between countries, using data from the Integrated Crisis Early Warning System (ICEWS). Our findings demonstrate the SIR model's ability to elucidate complex influence patterns within networks by linking them to specific covariates. This paper's main contributions are: (1) introducing a model that explains third-order dependencies through exogenous covariates and (2) offering an efficient estimation approach that scales effectively with large, complex networks."}, "https://arxiv.org/abs/1806.02935": {"title": "Causal effects based on distributional distances", "link": "https://arxiv.org/abs/1806.02935", "description": "arXiv:1806.02935v3 Announce Type: replace-cross \nAbstract: Comparing counterfactual distributions can provide more nuanced and valuable measures for causal effects, going beyond typical summary statistics such as averages. In this work, we consider characterizing causal effects via distributional distances, focusing on two kinds of target parameters. The first is the counterfactual outcome density. We propose a doubly robust-style estimator for the counterfactual density and study its rates of convergence and limiting distributions. We analyze asymptotic upper bounds on the $L_q$ and the integrated $L_q$ risks of the proposed estimator, and propose a bootstrap-based confidence band. The second is a novel distributional causal effect defined by the $L_1$ distance between different counterfactual distributions. We study three approaches for estimating the proposed distributional effect: smoothing the counterfactual density, smoothing the $L_1$ distance, and imposing a margin condition. For each approach, we analyze asymptotic properties and error bounds of the proposed estimator, and discuss potential advantages and disadvantages. We go on to present a bootstrap approach for obtaining confidence intervals, and propose a test of no distributional effect. We conclude with a numerical illustration and a real-world example."}, "https://arxiv.org/abs/2411.02531": {"title": "Comment on 'Sparse Bayesian Factor Analysis when the Number of Factors is Unknown' by S", "link": "https://arxiv.org/abs/2411.02531", "description": "arXiv:2411.02531v1 Announce Type: new \nAbstract: The techniques suggested in Fr\\\"uhwirth-Schnatter et al. (2024) concern sparsity and factor selection and have enormous potential beyond standard factor analysis applications. We show how these techniques can be applied to Latent Space (LS) models for network data. These models suffer from well-known identification issues of the latent factors due to likelihood invariance to factor translation, reflection, and rotation (see Hoff et al., 2002). A set of observables can be instrumental in identifying the latent factors via auxiliary equations (see Liu et al., 2021). These, in turn, share many analogies with the equations used in factor modeling, and we argue that the factor loading restrictions may be beneficial for achieving identification."}, "https://arxiv.org/abs/2411.02675": {"title": "Does Regression Produce Representative Causal Rankings?", "link": "https://arxiv.org/abs/2411.02675", "description": "arXiv:2411.02675v1 Announce Type: new \nAbstract: We examine the challenges in ranking multiple treatments based on their estimated effects when using linear regression or its popular double-machine-learning variant, the Partially Linear Model (PLM), in the presence of treatment effect heterogeneity. We demonstrate by example that overlap-weighting performed by linear models like PLM can produce Weighted Average Treatment Effects (WATE) that have rankings that are inconsistent with the rankings of the underlying Average Treatment Effects (ATE). We define this as ranking reversals and derive a necessary and sufficient condition for ranking reversals under the PLM. We conclude with several simulation studies conditions under which ranking reversals occur."}, "https://arxiv.org/abs/2411.02755": {"title": "Restricted Win Probability with Bayesian Estimation for Implementing the Estimand Framework in Clinical Trials With a Time-to-Event Outcome", "link": "https://arxiv.org/abs/2411.02755", "description": "arXiv:2411.02755v1 Announce Type: new \nAbstract: We propose a restricted win probability estimand for comparing treatments in a randomized trial with a time-to-event outcome. We also propose Bayesian estimators for this summary measure as well as the unrestricted win probability. Bayesian estimation is scalable and facilitates seamless handling of censoring mechanisms as compared to related non-parametric pairwise approaches like win ratios. Unlike the log-rank test, these measures effectuate the estimand framework as they reflect a clearly defined population quantity related to the probability of a later event time with the potential restriction that event times exceeding a pre-specified time are deemed equivalent. We compare efficacy with established methods using computer simulation and apply the proposed approach to 304 reconstructed datasets from oncology trials. We show that the proposed approach has more power than the log-rank test in early treatment difference scenarios, and at least as much power as the win ratio in all scenarios considered. We also find that the proposed approach's statistical significance is concordant with the log-rank test for the vast majority of the oncology datasets examined. The proposed approach offers an interpretable, efficient alternative for trials with time-to-event outcomes that aligns with the estimand framework."}, "https://arxiv.org/abs/2411.02763": {"title": "Identifying nonlinear relations among random variables: A network analytic approach", "link": "https://arxiv.org/abs/2411.02763", "description": "arXiv:2411.02763v1 Announce Type: new \nAbstract: Nonlinear relations between variables, such as the curvilinear relationship between childhood trauma and resilience in patients with schizophrenia and the moderation relationship between mentalizing, and internalizing and externalizing symptoms and quality of life in youths, are more prevalent than our current methods have been able to detect. Although there has been a rise in network models, network construction for the standard Gaussian graphical model depends solely upon linearity. While nonlinear models are an active field of study in psychological methodology, many of these models require the analyst to specify the functional form of the relation. When performing more exploratory modeling, such as with cross-sectional network psychometrics, specifying the functional form a nonlinear relation might take becomes infeasible given the number of possible relations modeled. Here, we apply a nonparametric approach to identifying nonlinear relations using partial distance correlations. We found that partial distance correlations excel overall at identifying nonlinear relations regardless of functional form when compared with Pearson's and Spearman's partial correlations. Through simulation studies and an empirical example, we show that partial distance correlations can be used to identify possible nonlinear relations in psychometric networks, enabling researchers to then explore the shape of these relations with more confirmatory models."}, "https://arxiv.org/abs/2411.02769": {"title": "Assessment of Misspecification in CDMs Using a Generalized Information Matrix Test", "link": "https://arxiv.org/abs/2411.02769", "description": "arXiv:2411.02769v1 Announce Type: new \nAbstract: If the probability model is correctly specified, then we can estimate the covariance matrix of the asymptotic maximum likelihood estimate distribution using either the first or second derivatives of the likelihood function. Therefore, if the determinants of these two different covariance matrix estimation formulas differ this indicates model misspecification. This misspecification detection strategy is the basis of the Determinant Information Matrix Test ($GIMT_{Det}$). To investigate the performance of the $GIMT_{Det}$, a Deterministic Input Noisy And gate (DINA) Cognitive Diagnostic Model (CDM) was fit to the Fraction-Subtraction dataset. Next, various misspecified versions of the original DINA CDM were fit to bootstrap data sets generated by sampling from the original fitted DINA CDM. The $GIMT_{Det}$ showed good discrimination performance for larger levels of misspecification. In addition, the $GIMT_{Det}$ did not detect model misspecification when model misspecification was not present and additionally did not detect model misspecification when the level of misspecification was very low. However, the $GIMT_{Det}$ discrimation performance was highly variable across different misspecification strategies when the misspecification level was moderately sized. The proposed new misspecification detection methodology is promising but additional empirical studies are required to further characterize its strengths and limitations."}, "https://arxiv.org/abs/2411.02771": {"title": "Automatic doubly robust inference for linear functionals via calibrated debiased machine learning", "link": "https://arxiv.org/abs/2411.02771", "description": "arXiv:2411.02771v1 Announce Type: new \nAbstract: In causal inference, many estimands of interest can be expressed as a linear functional of the outcome regression function; this includes, for example, average causal effects of static, dynamic and stochastic interventions. For learning such estimands, in this work, we propose novel debiased machine learning estimators that are doubly robust asymptotically linear, thus providing not only doubly robust consistency but also facilitating doubly robust inference (e.g., confidence intervals and hypothesis tests). To do so, we first establish a key link between calibration, a machine learning technique typically used in prediction and classification tasks, and the conditions needed to achieve doubly robust asymptotic linearity. We then introduce calibrated debiased machine learning (C-DML), a unified framework for doubly robust inference, and propose a specific C-DML estimator that integrates cross-fitting, isotonic calibration, and debiased machine learning estimation. A C-DML estimator maintains asymptotic linearity when either the outcome regression or the Riesz representer of the linear functional is estimated sufficiently well, allowing the other to be estimated at arbitrarily slow rates or even inconsistently. We propose a simple bootstrap-assisted approach for constructing doubly robust confidence intervals. Our theoretical and empirical results support the use of C-DML to mitigate bias arising from the inconsistent or slow estimation of nuisance functions."}, "https://arxiv.org/abs/2411.02804": {"title": "Beyond the Traditional VIX: A Novel Approach to Identifying Uncertainty Shocks in Financial Markets", "link": "https://arxiv.org/abs/2411.02804", "description": "arXiv:2411.02804v1 Announce Type: new \nAbstract: We introduce a new identification strategy for uncertainty shocks to explain macroeconomic volatility in financial markets. The Chicago Board Options Exchange Volatility Index (VIX) measures market expectations of future volatility, but traditional methods based on second-moment shocks and time-varying volatility of the VIX often fail to capture the non-Gaussian, heavy-tailed nature of asset returns. To address this, we construct a revised VIX by fitting a double-subordinated Normal Inverse Gaussian Levy process to S&P 500 option prices, providing a more comprehensive measure of volatility that reflects the extreme movements and heavy tails observed in financial data. Using an axiomatic approach, we introduce a general family of risk-reward ratios, computed with our revised VIX and fitted over a fractional time series to more accurately identify uncertainty shocks in financial markets."}, "https://arxiv.org/abs/2411.02811": {"title": "Temporal Wasserstein Imputation: Versatile Missing Data Imputation for Time Series", "link": "https://arxiv.org/abs/2411.02811", "description": "arXiv:2411.02811v1 Announce Type: new \nAbstract: Missing data often significantly hamper standard time series analysis, yet in practice they are frequently encountered. In this paper, we introduce temporal Wasserstein imputation, a novel method for imputing missing data in time series. Unlike existing techniques, our approach is fully nonparametric, circumventing the need for model specification prior to imputation, making it suitable for potential nonlinear dynamics. Its principled algorithmic implementation can seamlessly handle univariate or multivariate time series with any missing pattern. In addition, the plausible range and side information of the missing entries (such as box constraints) can easily be incorporated. As a key advantage, our method mitigates the distributional bias typical of many existing approaches, ensuring more reliable downstream statistical analysis using the imputed series. Leveraging the benign landscape of the optimization formulation, we establish the convergence of an alternating minimization algorithm to critical points. Furthermore, we provide conditions under which the marginal distributions of the underlying time series can be identified. Our numerical experiments, including extensive simulations covering linear and nonlinear time series models and an application to a real-world groundwater dataset laden with missing data, corroborate the practical usefulness of the proposed method."}, "https://arxiv.org/abs/2411.02924": {"title": "A joint model of correlated ordinal and continuous variables", "link": "https://arxiv.org/abs/2411.02924", "description": "arXiv:2411.02924v1 Announce Type: new \nAbstract: In this paper we build a joint model which can accommodate for binary, ordinal and continuous responses, by assuming that the errors of the continuous variables and the errors underlying the ordinal and binary outcomes follow a multivariate normal distribution. We employ composite likelihood methods to estimate the model parameters and use composite likelihood inference for model comparison and uncertainty quantification. The complimentary R package mvordnorm implements estimation of this model using composite likelihood methods and is available for download from Github. We present two use-cases in the area of risk management to illustrate our approach."}, "https://arxiv.org/abs/2411.02956": {"title": "On Distributional Discrepancy for Experimental Design with General Assignment Probabilities", "link": "https://arxiv.org/abs/2411.02956", "description": "arXiv:2411.02956v1 Announce Type: new \nAbstract: We investigate experimental design for randomized controlled trials (RCTs) with both equal and unequal treatment-control assignment probabilities. Our work makes progress on the connection between the distributional discrepancy minimization (DDM) problem introduced by Harshaw et al. (2024) and the design of RCTs. We make two main contributions: First, we prove that approximating the optimal solution of the DDM problem within even a certain constant error is NP-hard. Second, we introduce a new Multiplicative Weights Update (MWU) algorithm for the DDM problem, which improves the Gram-Schmidt walk algorithm used by Harshaw et al. (2024) when assignment probabilities are unequal. Building on the framework of Harshaw et al. (2024) and our MWU algorithm, we then develop the MWU design, which reduces the worst-case mean-squared error in estimating the average treatment effect. Finally, we present a comprehensive simulation study comparing our design with commonly used designs."}, "https://arxiv.org/abs/2411.03014": {"title": "Your copula is a classifier in disguise: classification-based copula density estimation", "link": "https://arxiv.org/abs/2411.03014", "description": "arXiv:2411.03014v1 Announce Type: new \nAbstract: We propose reinterpreting copula density estimation as a discriminative task. Under this novel estimation scheme, we train a classifier to distinguish samples from the joint density from those of the product of independent marginals, recovering the copula density in the process. We derive equivalences between well-known copula classes and classification problems naturally arising in our interpretation. Furthermore, we show our estimator achieves theoretical guarantees akin to maximum likelihood estimation. By identifying a connection with density ratio estimation, we benefit from the rich literature and models available for such problems. Empirically, we demonstrate the applicability of our approach by estimating copulas of real and high-dimensional datasets, outperforming competing copula estimators in density evaluation as well as sampling."}, "https://arxiv.org/abs/2411.03100": {"title": "Modeling sparsity in count-weighted networks", "link": "https://arxiv.org/abs/2411.03100", "description": "arXiv:2411.03100v1 Announce Type: new \nAbstract: Community detection methods have been extensively studied to recover communities structures in network data. While many models and methods focus on binary data, real-world networks also present the strength of connections, which could be considered in the network analysis. We propose a probabilistic model for generating weighted networks that allows us to control network sparsity and incorporates degree corrections for each node. We propose a community detection method based on the Variational Expectation-Maximization (VEM) algorithm. We show that the proposed method works well in practice for simulated networks. We analyze the Brazilian airport network to compare the community structures before and during the COVID-19 pandemic."}, "https://arxiv.org/abs/2411.03208": {"title": "Randomly assigned first differences?", "link": "https://arxiv.org/abs/2411.03208", "description": "arXiv:2411.03208v1 Announce Type: new \nAbstract: I consider treatment-effect estimation with a panel data with two periods, using a first difference regression of the outcome evolution $\\Delta Y_g$ on the treatment evolution $\\Delta D_g$. A design-based justification of this estimator assumes that $\\Delta D_g$ is as good as randomly assigned. This note shows that if one posits a causal model in levels between the treatment and the outcome, then $\\Delta D_g$ randomly assigned implies that $\\Delta D_g$ is independent of $D_{g,1}$. This is a very strong, fully testable condition. If $\\Delta D_g$ is not independent of $D_{g,1}$, the first-difference estimator can still identify a convex combination of effects if $\\Delta D_g$ is randomly assigned and treatment effects do not change over time, or if the the treatment path $(D_{g,1},D_{g,2})$ is randomly assigned and either of the following conditions hold: i) $D_{g,1}$ and $D_{g,2}$ are binary; ii) $D_{g,1}$ and $D_{g,2}$ have the same variance; iii) $D_{g,1}$ and $D_{g,2}$ are uncorrelated or negatively correlated. I use these results to revisit Acemoglu et al (2016)."}, "https://arxiv.org/abs/2411.03304": {"title": "Bayesian Controlled FDR Variable Selection via Knockoffs", "link": "https://arxiv.org/abs/2411.03304", "description": "arXiv:2411.03304v1 Announce Type: new \nAbstract: In many research fields, researchers aim to identify significant associations between a set of explanatory variables and a response while controlling the false discovery rate (FDR). To this aim, we develop a fully Bayesian generalization of the classical model-X knockoff filter. Knockoff filter introduces controlled noise in the model in the form of cleverly constructed copies of the predictors as auxiliary variables. In our approach we consider the joint model of the covariates and the response and incorporate the conditional independence structure of the covariates into the prior distribution of the auxiliary knockoff variables. We further incorporate the estimation of a graphical model among the covariates, which in turn aids knockoffs generation and improves the estimation of the covariate effects on the response. We use a modified spike-and-slab prior on the regression coefficients, which avoids the increase of the model dimension as typical in the classical knockoff filter. Our model performs variable selection using an upper bound on the posterior probability of non-inclusion. We show how our model construction leads to valid model-X knockoffs and demonstrate that the proposed characterization is sufficient for controlling the BFDR at an arbitrary level, in finite samples. We also show that the model selection is robust to the estimation of the precision matrix. We use simulated data to demonstrate that our proposal increases the stability of the selection with respect to classical knockoff methods, as it relies on the entire posterior distribution of the knockoff variables instead of a single sample. With respect to Bayesian variable selection methods, we show that our selection procedure achieves comparable or better performances, while maintaining control over the FDR. Finally, we show the usefulness of the proposed model with an application to real data."}, "https://arxiv.org/abs/2411.02726": {"title": "Elliptical Wishart distributions: information geometry, maximum likelihood estimator, performance analysis and statistical learning", "link": "https://arxiv.org/abs/2411.02726", "description": "arXiv:2411.02726v1 Announce Type: cross \nAbstract: This paper deals with Elliptical Wishart distributions - which generalize the Wishart distribution - in the context of signal processing and machine learning. Two algorithms to compute the maximum likelihood estimator (MLE) are proposed: a fixed point algorithm and a Riemannian optimization method based on the derived information geometry of Elliptical Wishart distributions. The existence and uniqueness of the MLE are characterized as well as the convergence of both estimation algorithms. Statistical properties of the MLE are also investigated such as consistency, asymptotic normality and an intrinsic version of Fisher efficiency. On the statistical learning side, novel classification and clustering methods are designed. For the $t$-Wishart distribution, the performance of the MLE and statistical learning algorithms are evaluated on both simulated and real EEG and hyperspectral data, showcasing the interest of our proposed methods."}, "https://arxiv.org/abs/2411.02909": {"title": "When is it worthwhile to jackknife? Breaking the quadratic barrier for Z-estimators", "link": "https://arxiv.org/abs/2411.02909", "description": "arXiv:2411.02909v1 Announce Type: cross \nAbstract: Resampling methods are especially well-suited to inference with estimators that provide only \"black-box'' access. Jackknife is a form of resampling, widely used for bias correction and variance estimation, that is well-understood under classical scaling where the sample size $n$ grows for a fixed problem. We study its behavior in application to estimating functionals using high-dimensional $Z$-estimators, allowing both the sample size $n$ and problem dimension $d$ to diverge. We begin showing that the plug-in estimator based on the $Z$-estimate suffers from a quadratic breakdown: while it is $\\sqrt{n}$-consistent and asymptotically normal whenever $n \\gtrsim d^2$, it fails for a broad class of problems whenever $n \\lesssim d^2$. We then show that under suitable regularity conditions, applying a jackknife correction yields an estimate that is $\\sqrt{n}$-consistent and asymptotically normal whenever $n\\gtrsim d^{3/2}$. This provides strong motivation for the use of jackknife in high-dimensional problems where the dimension is moderate relative to sample size. We illustrate consequences of our general theory for various specific $Z$-estimators, including non-linear functionals in linear models; generalized linear models; and the inverse propensity score weighting (IPW) estimate for the average treatment effect, among others."}, "https://arxiv.org/abs/2411.03021": {"title": "Testing Generalizability in Causal Inference", "link": "https://arxiv.org/abs/2411.03021", "description": "arXiv:2411.03021v1 Announce Type: cross \nAbstract: Ensuring robust model performance across diverse real-world scenarios requires addressing both transportability across domains with covariate shifts and extrapolation beyond observed data ranges. However, there is no formal procedure for statistically evaluating generalizability in machine learning algorithms, particularly in causal inference. Existing methods often rely on arbitrary metrics like AUC or MSE and focus predominantly on toy datasets, providing limited insights into real-world applicability. To address this gap, we propose a systematic and quantitative framework for evaluating model generalizability under covariate distribution shifts, specifically within causal inference settings. Our approach leverages the frugal parameterization, allowing for flexible simulations from fully and semi-synthetic benchmarks, offering comprehensive evaluations for both mean and distributional regression methods. By basing simulations on real data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics. Grounded in real-world data, it provides realistic insights into model performance, bridging the gap between synthetic evaluations and practical applications."}, "https://arxiv.org/abs/2411.03026": {"title": "Robust Market Interventions", "link": "https://arxiv.org/abs/2411.03026", "description": "arXiv:2411.03026v1 Announce Type: cross \nAbstract: A large differentiated oligopoly yields inefficient market equilibria. An authority with imprecise information about the primitives of the market aims to design tax/subsidy interventions that increase efficiency robustly, i.e., with high probability. We identify a condition on demand that guarantees the existence of such interventions, and we show how to construct them using noisy estimates of demand complementarities and substitutabilities across products. The analysis works by deriving a novel description of the incidence of market interventions in terms of spectral statistics of a Slutsky matrix. Our notion of recoverable structure ensures that parts of the spectrum that are useful for the design of interventions are statistically recoverable from noisy demand estimates."}, "https://arxiv.org/abs/2109.00408": {"title": "How to Detect Network Dependence in Latent Factor Models? A Bias-Corrected CD Test", "link": "https://arxiv.org/abs/2109.00408", "description": "arXiv:2109.00408v4 Announce Type: replace \nAbstract: In a recent paper Juodis and Reese (2022) (JR) show that the application of the CD test proposed by Pesaran (2004) to residuals from panels with latent factors results in over-rejection. They propose a randomized test statistic to correct for over-rejection, and add a screening component to achieve power. This paper considers the same problem but from a different perspective, and shows that the standard CD test remains valid if the latent factors are weak in the sense the strength is less than half. In the case where latent factors are strong, we propose a bias-corrected version, CD*, which is shown to be asymptotically standard normal under the null of error cross-sectional independence and have power against network type alternatives. This result is shown to hold for pure latent factor models as well as for panel regression models with latent factors. The case where the errors are serially correlated is also considered. Small sample properties of the CD* test are investigated by Monte Carlo experiments and are shown to have the correct size for strong and weak factors as well as for Gaussian and non-Gaussian errors. In contrast, it is found that JR's test tends to over-reject in the case of panels with non-Gaussian errors, and has low power against spatial network alternatives. In an empirical application, using the CD* test, it is shown that there remains spatial error dependence in a panel data model for real house price changes across 377 Metropolitan Statistical Areas in the U.S., even after the effects of latent factors are filtered out."}, "https://arxiv.org/abs/2208.03632": {"title": "Quantile Random-Coefficient Regression with Interactive Fixed Effects: Heterogeneous Group-Level Policy Evaluation", "link": "https://arxiv.org/abs/2208.03632", "description": "arXiv:2208.03632v3 Announce Type: replace \nAbstract: We propose a quantile random-coefficient regression with interactive fixed effects to study the effects of group-level policies that are heterogeneous across individuals. Our approach is the first to use a latent factor structure to handle the unobservable heterogeneities in the random coefficient. The asymptotic properties and an inferential method for the policy estimators are established. The model is applied to evaluate the effect of the minimum wage policy on earnings between 1967 and 1980 in the United States. Our results suggest that the minimum wage policy has significant and persistent positive effects on black workers and female workers up to the median. Our results also indicate that the policy helps reduce income disparity up to the median between two groups: black, female workers versus white, male workers. However, the policy is shown to have little effect on narrowing the income gap between low- and high-income workers within the subpopulations."}, "https://arxiv.org/abs/2311.05914": {"title": "Efficient Case-Cohort Design using Balanced Sampling", "link": "https://arxiv.org/abs/2311.05914", "description": "arXiv:2311.05914v2 Announce Type: replace \nAbstract: A case-cohort design is a two-phase sampling design frequently used to analyze censored survival data in a cost-effective way, where a subcohort is usually selected using simple random sampling or stratified simple random sampling. In this paper, we propose an efficient sampling procedure based on balanced sampling when selecting a subcohort in a case-cohort design. A sample selected via a balanced sampling procedure automatically calibrates auxiliary variables. When fitting a Cox model, calibrating sampling weights has been shown to lead to more efficient estimators of the regression coefficients (Breslow et al., 2009a, b). The reduced variabilities over its counterpart with a simple random sampling are shown via extensive simulation experiments. The proposed design and estimation procedure are also illustrated with the well-known National Wilms Tumor Study dataset."}, "https://arxiv.org/abs/2311.16260": {"title": "Using Multiple Outcomes to Improve the Synthetic Control Method", "link": "https://arxiv.org/abs/2311.16260", "description": "arXiv:2311.16260v2 Announce Type: replace \nAbstract: When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than separate weights, and that averaging leads to further gains when the number of outcomes grows. We illustrate this via a re-analysis of the impact of the Flint water crisis on educational outcomes."}, "https://arxiv.org/abs/2208.06115": {"title": "A Nonparametric Approach with Marginals for Modeling Consumer Choice", "link": "https://arxiv.org/abs/2208.06115", "description": "arXiv:2208.06115v5 Announce Type: replace-cross \nAbstract: Given data on the choices made by consumers for different offer sets, a key challenge is to develop parsimonious models that describe and predict consumer choice behavior while being amenable to prescriptive tasks such as pricing and assortment optimization. The marginal distribution model (MDM) is one such model, which requires only the specification of marginal distributions of the random utilities. This paper aims to establish necessary and sufficient conditions for given choice data to be consistent with the MDM hypothesis, inspired by the usefulness of similar characterizations for the random utility model (RUM). This endeavor leads to an exact characterization of the set of choice probabilities that the MDM can represent. Verifying the consistency of choice data with this characterization is equivalent to solving a polynomial-sized linear program. Since the analogous verification task for RUM is computationally intractable and neither of these models subsumes the other, MDM is helpful in striking a balance between tractability and representational power. The characterization is then used with robust optimization for making data-driven sales and revenue predictions for new unseen assortments. When the choice data lacks consistency with the MDM hypothesis, finding the best-fitting MDM choice probabilities reduces to solving a mixed integer convex program. Numerical results using real world data and synthetic data demonstrate that MDM exhibits competitive representational power and prediction performance compared to RUM and parametric models while being significantly faster in computation than RUM."}, "https://arxiv.org/abs/2411.03388": {"title": "Extending Cluster-Weighted Factor Analyzers for multivariate prediction and high-dimensional interpretability", "link": "https://arxiv.org/abs/2411.03388", "description": "arXiv:2411.03388v1 Announce Type: new \nAbstract: Cluster-weighted factor analyzers (CWFA) are a versatile class of mixture models designed to estimate the joint distribution of a random vector that includes a response variable along with a set of explanatory variables. They are particularly valuable in situations involving high dimensionality. This paper enhances CWFA models in two notable ways. First, it enables the prediction of multiple response variables while considering their potential interactions. Second, it identifies factors associated with disjoint groups of explanatory variables, thereby improving interpretability. This development leads to the introduction of the multivariate cluster-weighted disjoint factor analyzers (MCWDFA) model. An alternating expectation-conditional maximization algorithm is employed for parameter estimation. The effectiveness of the proposed model is assessed through an extensive simulation study that examines various scenarios. The proposal is applied to crime data from the United States, sourced from the UCI Machine Learning Repository, with the aim of capturing potential latent heterogeneity within communities and identifying groups of socio-economic features that are similarly associated with factors predicting crime rates. Results provide valuable insights into the underlying structures influencing crime rates which may potentially be helpful for effective cluster-specific policymaking and social interventions."}, "https://arxiv.org/abs/2411.03489": {"title": "A Bayesian nonparametric approach to mediation and spillover effects with multiple mediators in cluster-randomized trials", "link": "https://arxiv.org/abs/2411.03489", "description": "arXiv:2411.03489v1 Announce Type: new \nAbstract: Cluster randomized trials (CRTs) with multiple unstructured mediators present significant methodological challenges for causal inference due to within-cluster correlation, interference among units, and the complexity introduced by multiple mediators. Existing causal mediation methods often fall short in simultaneously addressing these complexities, particularly in disentangling mediator-specific effects under interference that are central to studying complex mechanisms. To address this gap, we propose new causal estimands for spillover mediation effects that differentiate the roles of each individual's own mediator and the spillover effects resulting from interactions among individuals within the same cluster. We establish identification results for each estimand and, to flexibly model the complex data structures inherent in CRTs, we develop a new Bayesian nonparametric prior -- the Nested Dependent Dirichlet Process Mixture -- designed for flexibly capture the outcome and mediator surfaces at different levels. We conduct extensive simulations across various scenarios to evaluate the frequentist performance of our methods, compare them with a Bayesian parametric counterpart and illustrate our new methods in an analysis of a completed CRT."}, "https://arxiv.org/abs/2411.03504": {"title": "Exact Designs for OofA Experiments Under a Transition-Effect Model", "link": "https://arxiv.org/abs/2411.03504", "description": "arXiv:2411.03504v1 Announce Type: new \nAbstract: In the chemical, pharmaceutical, and food industries, sometimes the order of adding a set of components has an impact on the final product. These are instances of the order-of-addition (OofA) problem, which aims to find the optimal sequence of the components. Extensive research on this topic has been conducted, but almost all designs are found by optimizing the $D-$optimality criterion. However, when prediction of the response is important, there is still a need for $I-$optimal designs. A new model for OofA experiments is presented that uses transition effects to model the effect of order on the response, and the model is extended to cover cases where block-wise constraints are placed on the order of addition. Several algorithms are used to find both $D-$ and $I-$efficient designs under this new model for many run sizes and for large numbers of components. Finally, two examples are shown to illustrate the effectiveness of the proposed designs and model at identifying the optimal order of addition, even under block-wise constraints."}, "https://arxiv.org/abs/2411.03530": {"title": "Improving precision of A/B experiments using trigger intensity", "link": "https://arxiv.org/abs/2411.03530", "description": "arXiv:2411.03530v1 Announce Type: new \nAbstract: In industry, online randomized controlled experiment (a.k.a A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. To improve the precision (or reduce standard error), we introduce the idea of trigger observations where the output of the treatment and the control model are different. We show that the evaluation with full information about trigger observations (full knowledge) improves the precision in comparison to a baseline method. However, detecting all such trigger observations is a costly affair, hence we propose a sampling based evaluation method (partial knowledge) to reduce the cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, evaluation with full knowledge reduces the standard error as much as 85%. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%."}, "https://arxiv.org/abs/2411.03625": {"title": "Identification and Inference in General Bunching Designs", "link": "https://arxiv.org/abs/2411.03625", "description": "arXiv:2411.03625v1 Announce Type: new \nAbstract: This paper develops a formal econometric framework and tools for the identification and inference of a structural parameter in general bunching designs. We present both point and partial identification results, which generalize previous approaches in the literature. The key assumption for point identification is the analyticity of the counterfactual density, which defines a broader class of distributions than many well-known parametric families. In the partial identification approach, the analyticity condition is relaxed and various shape restrictions can be incorporated, including those found in the literature. Both of our identification results account for observable heterogeneity in the model, which has previously been permitted only in limited ways. We provide a suite of counterfactual estimation and inference methods, termed the generalized polynomial strategy. Our method restores the merits of the original polynomial strategy proposed by Chetty et al. (2011) while addressing several weaknesses in the widespread practice. The efficacy of the proposed method is demonstrated compared to a version of the polynomial estimator in a series of Monte Carlo studies within the augmented isoelastic model. We revisit the data used in Saez (2010) and find substantially different results relative to those from the polynomial strategy."}, "https://arxiv.org/abs/2411.03848": {"title": "Monotone Missing Data: A Blessing and a Curse", "link": "https://arxiv.org/abs/2411.03848", "description": "arXiv:2411.03848v1 Announce Type: new \nAbstract: Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable if and only if neither colluders nor self-censoring edges are present in the graph. We show that monotonicity may enable the identification of the full law despite colluders and prevent the identification under mediated (pathwise) self-censoring. The results emphasize the importance of proper treatment of monotone missingness in the analysis of incomplete data."}, "https://arxiv.org/abs/2411.03992": {"title": "Sparse Bayesian joint modal estimation for exploratory item factor analysis", "link": "https://arxiv.org/abs/2411.03992", "description": "arXiv:2411.03992v1 Announce Type: new \nAbstract: This study presents a scalable Bayesian estimation algorithm for sparse estimation in exploratory item factor analysis based on a classical Bayesian estimation method, namely Bayesian joint modal estimation (BJME). BJME estimates the model parameters and factor scores that maximize the complete-data joint posterior density. Simulation studies show that the proposed algorithm has high computational efficiency and accuracy in variable selection over latent factors and the recovery of the model parameters. Moreover, we conducted a real data analysis using large-scale data from a psychological assessment that targeted the Big Five personality traits. This result indicates that the proposed algorithm achieves computationally efficient parameter estimation and extracts the interpretable factor loading structure."}, "https://arxiv.org/abs/2411.04002": {"title": "Federated mixed effects logistic regression based on one-time shared summary statistics", "link": "https://arxiv.org/abs/2411.04002", "description": "arXiv:2411.04002v1 Announce Type: new \nAbstract: Upholding data privacy especially in medical research has become tantamount to facing difficulties in accessing individual-level patient data. Estimating mixed effects binary logistic regression models involving data from multiple data providers like hospitals thus becomes more challenging. Federated learning has emerged as an option to preserve the privacy of individual observations while still estimating a global model that can be interpreted on the individual level, but it usually involves iterative communication between the data providers and the data analyst. In this paper, we present a strategy to estimate a mixed effects binary logistic regression model that requires data providers to share summary statistics only once. It involves generating pseudo-data whose summary statistics match those of the actual data and using these into the model estimation process instead of the actual unavailable data. Our strategy is able to include multiple predictors which can be a combination of continuous and categorical variables. Through simulation, we show that our approach estimates the true model at least as good as the one which requires the pooled individual observations. An illustrative example using real data is provided. Unlike typical federated learning algorithms, our approach eliminates infrastructure requirements and security issues while being communication efficient and while accounting for heterogeneity."}, "https://arxiv.org/abs/2411.03333": {"title": "Analysis of Bipartite Networks in Anime Series: Textual Analysis, Topic Clustering, and Modeling", "link": "https://arxiv.org/abs/2411.03333", "description": "arXiv:2411.03333v1 Announce Type: cross \nAbstract: This article analyzes a specific bipartite network that shows the relationships between users and anime, examining how the descriptions of anime influence the formation of user communities. In particular, we introduce a new variable that quantifies the frequency with which words from a description appear in specific word clusters. These clusters are generated from a bigram analysis derived from all descriptions in the database. This approach fully characterizes the dynamics of these communities and shows how textual content affect the cohesion and structure of the social network among anime enthusiasts. Our findings suggest that there may be significant implications for the design of recommendation systems and the enhancement of user experience on anime platforms."}, "https://arxiv.org/abs/2411.03623": {"title": "Asymptotic analysis of estimators of ergodic stochastic differential equations", "link": "https://arxiv.org/abs/2411.03623", "description": "arXiv:2411.03623v1 Announce Type: cross \nAbstract: The paper studies asymptotic properties of estimators of multidimensional stochastic differential equations driven by Brownian motions from high-frequency discrete data. Consistency and central limit properties of a class of estimators of the diffusion parameter and an approximate maximum likelihood estimator of the drift parameter based on a discretized likelihood function have been established in a suitable scaling regime involving the time-gap between the observations and the overall time span. Our framework is more general than that typically considered in the literature and, thus, has the potential to be applicable to a wider range of stochastic models."}, "https://arxiv.org/abs/2411.03641": {"title": "Constrained Multi-objective Bayesian Optimization through Optimistic Constraints Estimation", "link": "https://arxiv.org/abs/2411.03641", "description": "arXiv:2411.03641v1 Announce Type: cross \nAbstract: Multi-objective Bayesian optimization has been widely adopted in scientific experiment design, including drug discovery and hyperparameter optimization. In practice, regulatory or safety concerns often impose additional thresholds on certain attributes of the experimental outcomes. Previous work has primarily focused on constrained single-objective optimization tasks or active search under constraints. We propose CMOBO, a sample-efficient constrained multi-objective Bayesian optimization algorithm that balances learning of the feasible region (defined on multiple unknowns) with multi-objective optimization within the feasible region in a principled manner. We provide both theoretical justification and empirical evidence, demonstrating the efficacy of our approach on various synthetic benchmarks and real-world applications."}, "https://arxiv.org/abs/2202.06420": {"title": "Statistical Inference for Cell Type Deconvolution", "link": "https://arxiv.org/abs/2202.06420", "description": "arXiv:2202.06420v4 Announce Type: replace \nAbstract: Integrating data from different platforms, such as bulk and single-cell RNA sequencing, is crucial for improving the accuracy and interpretability of complex biological analyses like cell type deconvolution. However, this task is complicated by measurement and biological heterogeneity between target and reference datasets. For the problem of cell type deconvolution, existing methods often neglect the correlation and uncertainty in cell type proportion estimates, possibly leading to an additional concern of false positives in downstream comparisons across multiple individuals. We introduce MEAD, a comprehensive statistical framework that not only estimates cell type proportions but also provides asymptotically valid statistical inference on the estimates. One of our key contributions is the identifiability result, which rigorously establishes the conditions under which cell type proportions are identifiable despite arbitrary heterogeneity of measurement biases between platforms. MEAD also supports the comparison of cell type proportions across individuals after deconvolution, accounting for gene-gene correlations and biological variability. Through simulations and real-data analysis, MEAD demonstrates superior reliability for inferring cell type compositions in complex biological systems."}, "https://arxiv.org/abs/2205.15422": {"title": "Profile Monitoring via Eigenvector Perturbation", "link": "https://arxiv.org/abs/2205.15422", "description": "arXiv:2205.15422v2 Announce Type: replace \nAbstract: In Statistical Process Control, control charts are often used to detect undesirable behavior of sequentially observed quality characteristics. Designing a control chart with desirably low False Alarm Rate (FAR) and detection delay ($ARL_1$) is an important challenge especially when the sampling rate is high and the control chart has an In-Control Average Run Length, called $ARL_0$, of 200 or more, as commonly found in practice. Unfortunately, arbitrary reduction of the FAR typically increases the $ARL_1$. Motivated by eigenvector perturbation theory, we propose the Eigenvector Perturbation Control Chart for computationally fast nonparametric profile monitoring. Our simulation studies show that it outperforms the competition and achieves both $ARL_1 \\approx 1$ and $ARL_0 > 10^6$."}, "https://arxiv.org/abs/2302.09103": {"title": "Multiple change-point detection for some point processes", "link": "https://arxiv.org/abs/2302.09103", "description": "arXiv:2302.09103v3 Announce Type: replace \nAbstract: The aim of change-point detection is to identify behavioral shifts within time series data. This article focuses on scenarios where the data is derived from an inhomogeneous Poisson process or a marked Poisson process. We present a methodology for detecting multiple offline change-points using a minimum contrast estimator. Specifically, we address how to manage the continuous nature of the process given the available discrete observations. Additionally, we select the appropriate number of changes via a cross-validation procedure which is particularly effective given the characteristics of the Poisson process. Lastly, we show how to use this methodology to self-exciting processes with changes in the intensity. Through experiments, with both simulated and real datasets, we showcase the advantages of the proposed method, which has been implemented in the R package \\texttt{CptPointProcess}."}, "https://arxiv.org/abs/2311.01147": {"title": "Variational Inference for Sparse Poisson Regression", "link": "https://arxiv.org/abs/2311.01147", "description": "arXiv:2311.01147v4 Announce Type: replace \nAbstract: We have utilized the non-conjugate VB method for the problem of the sparse Poisson regression model. To provide an approximated conjugacy in the model, the likelihood is approximated by a quadratic function, which provides the conjugacy of the approximation component with the Gaussian prior to the regression coefficient. Three sparsity-enforcing priors are used for this problem. The proposed models are compared with each other and two frequentist sparse Poisson methods (LASSO and SCAD) to evaluate the estimation, prediction and the sparsing performance of the proposed methods. Throughout a simulated data example, the accuracy of the VB methods is computed compared to the corresponding benchmark MCMC methods. It can be observed that the proposed VB methods have provided a good approximation to the posterior distribution of the parameters, while the VB methods are much faster than the MCMC ones. Using several benchmark count response data sets, the prediction performance of the proposed methods is evaluated in real-world applications."}, "https://arxiv.org/abs/2212.01887": {"title": "The Optimality of Blocking Designs in Equally and Unequally Allocated Randomized Experiments with General Response", "link": "https://arxiv.org/abs/2212.01887", "description": "arXiv:2212.01887v3 Announce Type: replace-cross \nAbstract: We consider the performance of the difference-in-means estimator in a two-arm randomized experiment under common experimental endpoints such as continuous (regression), incidence, proportion and survival. We examine performance under both equal and unequal allocation to treatment groups and we consider both the Neyman randomization model and the population model. We show that in the Neyman model, where the only source of randomness is the treatment manipulation, there is no free lunch: complete randomization is minimax for the estimator's mean squared error. In the population model, where each subject experiences response noise with zero mean, the optimal design is the deterministic perfect-balance allocation. However, this allocation is generally NP-hard to compute and moreover, depends on unknown response parameters. When considering the tail criterion of Kapelner et al. (2021), we show the optimal design is less random than complete randomization and more random than the deterministic perfect-balance allocation. We prove that Fisher's blocking design provides the asymptotically optimal degree of experimental randomness. Theoretical results are supported by simulations in all considered experimental settings."}, "https://arxiv.org/abs/2411.04166": {"title": "Kernel density estimation with polyspherical data and its applications", "link": "https://arxiv.org/abs/2411.04166", "description": "arXiv:2411.04166v1 Announce Type: new \nAbstract: A kernel density estimator for data on the polysphere $\\mathbb{S}^{d_1}\\times\\cdots\\times\\mathbb{S}^{d_r}$, with $r,d_1,\\ldots,d_r\\geq 1$, is presented in this paper. We derive the main asymptotic properties of the estimator, including mean square error, normality, and optimal bandwidths. We address the kernel theory of the estimator beyond the von Mises-Fisher kernel, introducing new kernels that are more efficient and investigating normalizing constants, moments, and sampling methods thereof. Plug-in and cross-validated bandwidth selectors are also obtained. As a spin-off of the kernel density estimator, we propose a nonparametric $k$-sample test based on the Jensen-Shannon divergence. Numerical experiments illuminate the asymptotic theory of the kernel density estimator and demonstrate the superior performance of the $k$-sample test with respect to parametric alternatives in certain scenarios. Our smoothing methodology is applied to the analysis of the morphology of a sample of hippocampi of infants embedded on the high-dimensional polysphere $(\\mathbb{S}^2)^{168}$ via skeletal representations ($s$-reps)."}, "https://arxiv.org/abs/2411.04228": {"title": "dsld: A Socially Relevant Tool for Teaching Statistics", "link": "https://arxiv.org/abs/2411.04228", "description": "arXiv:2411.04228v1 Announce Type: new \nAbstract: The growing power of data science can play a crucial role in addressing social discrimination, necessitating nuanced understanding and effective mitigation strategies of potential biases. Data Science Looks At Discrimination (dsld) is an R and Python package designed to provide users with a comprehensive toolkit of statistical and graphical methods for assessing possible discrimination related to protected groups, such as race, gender, and age. Our software offers techniques for discrimination analysis by identifying and mitigating confounding variables, along with methods for reducing bias in predictive models.\n In educational settings, dsld offers instructors powerful tools to teach important statistical principles through motivating real world examples of discrimination analysis. The inclusion of an 80-page Quarto book further supports users, from statistics educators to legal professionals, in effectively applying these analytical tools to real world scenarios."}, "https://arxiv.org/abs/2411.04229": {"title": "Detecting State Changes in Functional Neuronal Connectivity using Factorial Switching Linear Dynamical Systems", "link": "https://arxiv.org/abs/2411.04229", "description": "arXiv:2411.04229v1 Announce Type: new \nAbstract: A key question in brain sciences is how to identify time-evolving functional connectivity, such as that obtained from recordings of neuronal activity over time. We wish to explain the observed phenomena in terms of latent states which, in the case of neuronal activity, might correspond to subnetworks of neurons within a brain or organoid. Many existing approaches assume that only one latent state can be active at a time, in contrast to our domain knowledge. We propose a switching dynamical system based on the factorial hidden Markov model. Unlike existing approaches, our model acknowledges that neuronal activity can be caused by multiple subnetworks, which may be activated either jointly or independently. A change in one part of the network does not mean that the entire connectivity pattern will change. We pair our model with scalable variational inference algorithm, using a concrete relaxation of the underlying factorial hidden Markov model, to effectively infer the latent states and model parameters. We show that our algorithm can recover ground-truth structure and yield insights about the maturation of neuronal activity in microelectrode array recordings from in vitro neuronal cultures."}, "https://arxiv.org/abs/2411.04239": {"title": "An Adversarial Approach to Identification and Inference", "link": "https://arxiv.org/abs/2411.04239", "description": "arXiv:2411.04239v1 Announce Type: new \nAbstract: We introduce a novel framework to characterize identified sets of structural and counterfactual parameters in econometric models. Our framework centers on a discrepancy function, which we construct using insights from convex analysis. The zeros of the discrepancy function determine the identified set, which may be a singleton. The discrepancy function has an adversarial game interpretation: a critic maximizes the discrepancy between data and model features, while a defender minimizes it by adjusting the probability measure of the unobserved heterogeneity. Our approach enables fast computation via linear programming. We use the sample analog of the discrepancy function as a test statistic, and show that it provides asymptotically valid inference for the identified set. Applied to nonlinear panel models with fixed effects, it offers a unified approach for identifying both structural and counterfactual parameters across exogeneity conditions, including strict and sequential, without imposing parametric restrictions on the distribution of error terms or functional form assumptions."}, "https://arxiv.org/abs/2411.04286": {"title": "Bounded Rationality in Central Bank Communication", "link": "https://arxiv.org/abs/2411.04286", "description": "arXiv:2411.04286v1 Announce Type: new \nAbstract: This study explores the influence of FOMC sentiment on market expectations, focusing on cognitive differences between experts and non-experts. Using sentiment analysis of FOMC minutes, we integrate these insights into a bounded rationality model to examine the impact on inflation expectations. Results show that experts form more conservative expectations, anticipating FOMC stabilization actions, while non-experts react more directly to inflation concerns. A lead-lag analysis indicates that institutions adjust faster, though the gap with individual investors narrows in the short term. These findings highlight the need for tailored communication strategies to better align public expectations with policy goals."}, "https://arxiv.org/abs/2411.04310": {"title": "Mediation analysis of community context effects on heart failure using the survival R2D2 prior", "link": "https://arxiv.org/abs/2411.04310", "description": "arXiv:2411.04310v1 Announce Type: new \nAbstract: Congestive heart failure (CHF) is a leading cause of morbidity, mortality and healthcare costs, impacting $>$23 million individuals worldwide. Large electronic health records data provide an opportunity to improve clinical management of diseases, but statistical inference on large amounts of relevant personal data is still a challenge. Thus, accurately identifying influential risk factors is pivotal to reducing the dimensionality of information. Bayesian variable selection in survival regression is a common approach towards solving this problem. In this paper, we propose placing a beta prior directly on the model coefficient of determination (Bayesian $R^2$), which induces a prior on the global variance of the predictors and provides shrinkage. Through reparameterization using an auxiliary variable, we are able to update a majority of the parameters with Gibbs sampling, simplifying computation and quickening convergence. Performance gains over competing variable selection methods are showcased through an extensive simulation study. Finally, the method is applied in a mediation analysis to identify community context attributes impacting time to first congestive heart failure diagnosis of patients enrolled in University of North Carolina Cardiovascular Device Surveillance Registry. The model has high predictive performance with a C-index of over 0.7 and we find that factors associated with higher socioeconomic inequality and air pollution increase risk of heart failure."}, "https://arxiv.org/abs/2411.04312": {"title": "Lee Bounds with a Continuous Treatment in Sample Selection", "link": "https://arxiv.org/abs/2411.04312", "description": "arXiv:2411.04312v1 Announce Type: new \nAbstract: Sample selection problems arise when treatment affects both the outcome and the researcher's ability to observe it. This paper generalizes Lee (2009) bounds for the average treatment effect of a binary treatment to a continuous/multivalued treatment. We evaluate the Job Crops program to study the causal effect of training hours on wages. To identify the average treatment effect of always-takers who are selected regardless of the treatment values, we assume that if a subject is selected at some sufficient treatment values, then it remains selected at all treatment values. For example, if program participants are employed with one month of training, then they remain employed with any training hours. This sufficient treatment values assumption includes the monotone assumption on the treatment effect on selection as a special case. We further allow the conditional independence assumption and subjects with different pretreatment covariates to have different sufficient treatment values. The estimation and inference theory utilize the orthogonal moment function and cross-fitting for double debiased machine learning."}, "https://arxiv.org/abs/2411.04380": {"title": "Identification of Long-Term Treatment Effects via Temporal Links, Observational, and Experimental Data", "link": "https://arxiv.org/abs/2411.04380", "description": "arXiv:2411.04380v1 Announce Type: new \nAbstract: Recent literature proposes combining short-term experimental and long-term observational data to provide credible alternatives to conventional observational studies for identification of long-term average treatment effects (LTEs). I show that experimental data have an auxiliary role in this context. They bring no identifying power without additional modeling assumptions. When modeling assumptions are imposed, experimental data serve to amplify their identifying power. If the assumptions fail, adding experimental data may only yield results that are farther from the truth. Motivated by this, I introduce two assumptions on treatment response that may be defensible based on economic theory or intuition. To utilize them, I develop a novel two-step identification approach that centers on bounding temporal link functions -- the relationship between short-term and mean long-term potential outcomes. The approach provides sharp bounds on LTEs for a general class of assumptions, and allows for imperfect experimental compliance -- extending existing results."}, "https://arxiv.org/abs/2411.04411": {"title": "Parsimoniously Fitting Large Multivariate Random Effects in glmmTMB", "link": "https://arxiv.org/abs/2411.04411", "description": "arXiv:2411.04411v1 Announce Type: new \nAbstract: Multivariate random effects with unstructured variance-covariance matrices of large dimensions, $q$, can be a major challenge to estimate. In this paper, we introduce a new implementation of a reduced-rank approach to fit large dimensional multivariate random effects by writing them as a linear combination of $d < q$ latent variables. By adding reduced-rank functionality to the package glmmTMB, we enhance the mixed models available to include random effects of dimensions that were previously not possible. We apply the reduced-rank random effect to two examples, estimating a generalized latent variable model for multivariate abundance data and a random-slopes model."}, "https://arxiv.org/abs/2411.04450": {"title": "Partial Identification of Distributional Treatment Effects in Panel Data using Copula Equality Assumptions", "link": "https://arxiv.org/abs/2411.04450", "description": "arXiv:2411.04450v1 Announce Type: new \nAbstract: This paper aims to partially identify the distributional treatment effects (DTEs) that depend on the unknown joint distribution of treated and untreated potential outcomes. We construct the DTE bounds using panel data and allow individuals to switch between the treated and untreated states more than once over time. Individuals are grouped based on their past treatment history, and DTEs are allowed to be heterogeneous across different groups. We provide two alternative group-wise copula equality assumptions to bound the unknown joint and the DTEs, both of which leverage information from the past observations. Testability of these two assumptions are also discussed, and test results are presented. We apply this method to study the treatment effect heterogeneity of exercising on the adults' body weight. These results demonstrate that our method improves the identification power of the DTE bounds compared to the existing methods."}, "https://arxiv.org/abs/2411.04520": {"title": "A Structured Estimator for large Covariance Matrices in the Presence of Pairwise and Spatial Covariates", "link": "https://arxiv.org/abs/2411.04520", "description": "arXiv:2411.04520v1 Announce Type: new \nAbstract: We consider the problem of estimating a high-dimensional covariance matrix from a small number of observations when covariates on pairs of variables are available and the variables can have spatial structure. This is motivated by the problem arising in demography of estimating the covariance matrix of the total fertility rate (TFR) of 195 different countries when only 11 observations are available. We construct an estimator for high-dimensional covariance matrices by exploiting information about pairwise covariates, such as whether pairs of variables belong to the same cluster, or spatial structure of the variables, and interactions between the covariates. We reformulate the problem in terms of a mixed effects model. This requires the estimation of only a small number of parameters, which are easy to interpret and which can be selected using standard procedures. The estimator is consistent under general conditions, and asymptotically normal. It works if the mean and variance structure of the data is already specified or if some of the data are missing. We assess its performance under our model assumptions, as well as under model misspecification, using simulations. We find that it outperforms several popular alternatives. We apply it to the TFR dataset and draw some conclusions."}, "https://arxiv.org/abs/2411.04640": {"title": "The role of expansion strategies and operational attributes on hotel performance: a compositional approach", "link": "https://arxiv.org/abs/2411.04640", "description": "arXiv:2411.04640v1 Announce Type: new \nAbstract: This study aims to explore the impact of expansion strategies and specific attributes of hotel establishments on the performance of international hotel chains, focusing on four key performance indicators: RevPAR, efficiency, occupancy, and asset turnover. Data were collected from 255 hotels across various international hotel chains, providing a comprehensive assessment of how different expansion strategies and hotel attributes influence performance. The research employs compositional data analysis (CoDA) to address the methodological limitations of traditional financial ratios in statistical analysis. The findings indicate that ownership-based expansion strategies result in higher operational performance, as measured by revenue per available room, but yield lower economic performance due to the high capital investment required. Non-ownership strategies, such as management contracts and franchising, show superior economic efficiency, offering more flexibility and reduced financial risk. This study contributes to the hospitality management literature by applying CoDA, a novel methodological approach in this field, to examine the performance of different hotel expansion strategies with a sound and more appropriate method. The insights provided can guide hotel managers and investors in making informed decisions to optimize both operational and economic performance."}, "https://arxiv.org/abs/2411.04909": {"title": "Doubly robust inference with censoring unbiased transformations", "link": "https://arxiv.org/abs/2411.04909", "description": "arXiv:2411.04909v1 Announce Type: new \nAbstract: This paper extends doubly robust censoring unbiased transformations to a broad class of censored data structures under the assumption of coarsening at random and positivity. This includes the classic survival and competing risks setting, but also encompasses multiple events. A doubly robust representation for the conditional bias of the transformed data is derived. This leads to rate double robustness and oracle efficiency properties for estimating conditional expectations when combined with cross-fitting and linear smoothers. Simulation studies demonstrate favourable performance of the proposed method relative to existing approaches. An application of the methods to a regression discontinuity design with censored data illustrates its practical utility."}, "https://arxiv.org/abs/2411.04236": {"title": "Differentially Private Finite Population Estimation via Survey Weight Regularization", "link": "https://arxiv.org/abs/2411.04236", "description": "arXiv:2411.04236v1 Announce Type: cross \nAbstract: In general, it is challenging to release differentially private versions of survey-weighted statistics with low error for acceptable privacy loss. This is because weighted statistics from complex sample survey data can be more sensitive to individual survey response and weight values than unweighted statistics, resulting in differentially private mechanisms that can add substantial noise to the unbiased estimate of the finite population quantity. On the other hand, simply disregarding the survey weights adds noise to a biased estimator, which also can result in an inaccurate estimate. Thus, the problem of releasing an accurate survey-weighted estimate essentially involves a trade-off among bias, precision, and privacy. We leverage this trade-off to develop a differentially private method for estimating finite population quantities. The key step is to privately estimate a hyperparameter that determines how much to regularize or shrink survey weights as a function of privacy loss. We illustrate the differentially private finite population estimation using the Panel Study of Income Dynamics. We show that optimal strategies for releasing DP survey-weighted mean income estimates require orders-of-magnitude less noise than naively using the original survey weights without modification."}, "https://arxiv.org/abs/2411.04522": {"title": "Testing for changes in the error distribution in functional linear models", "link": "https://arxiv.org/abs/2411.04522", "description": "arXiv:2411.04522v1 Announce Type: cross \nAbstract: We consider linear models with scalar responses and covariates from a separable Hilbert space. The aim is to detect change points in the error distribution, based on sequential residual empirical distribution functions. Expansions for those estimated functions are more challenging in models with infinite-dimensional covariates than in regression models with scalar or vector-valued covariates due to a slower rate of convergence of the parameter estimators. Yet the suggested change point test is asymptotically distribution-free and consistent for one-change point alternatives. In the latter case we also show consistency of a change point estimator."}, "https://arxiv.org/abs/2411.04729": {"title": "Conjugate gradient methods for high-dimensional GLMMs", "link": "https://arxiv.org/abs/2411.04729", "description": "arXiv:2411.04729v1 Announce Type: cross \nAbstract: Generalized linear mixed models (GLMMs) are a widely used tool in statistical analysis. The main bottleneck of many computational approaches lies in the inversion of the high dimensional precision matrices associated with the random effects. Such matrices are typically sparse; however, the sparsity pattern resembles a multi partite random graph, which does not lend itself well to default sparse linear algebra techniques. Notably, we show that, for typical GLMMs, the Cholesky factor is dense even when the original precision is sparse. We thus turn to approximate iterative techniques, in particular to the conjugate gradient (CG) method. We combine a detailed analysis of the spectrum of said precision matrices with results from random graph theory to show that CG-based methods applied to high-dimensional GLMMs typically achieve a fixed approximation error with a total cost that scales linearly with the number of parameters and observations. Numerical illustrations with both real and simulated data confirm the theoretical findings, while at the same time illustrating situations, such as nested structures, where CG-based methods struggle."}, "https://arxiv.org/abs/2204.04119": {"title": "Using negative controls to identify causal effects with invalid instrumental variables", "link": "https://arxiv.org/abs/2204.04119", "description": "arXiv:2204.04119v4 Announce Type: replace \nAbstract: Many proposals for the identification of causal effects require an instrumental variable that satisfies strong, untestable unconfoundedness and exclusion restriction assumptions. In this paper, we show how one can potentially identify causal effects under violations of these assumptions by harnessing a negative control population or outcome. This strategy allows one to leverage sup-populations for whom the exposure is degenerate, and requires that the instrument-outcome association satisfies a certain parallel trend condition. We develop the semiparametric efficiency theory for a general instrumental variable model, and obtain a multiply robust, locally efficient estimator of the average treatment effect in the treated. The utility of the estimators is demonstrated in simulation studies and an analysis of the Life Span Study."}, "https://arxiv.org/abs/2305.05934": {"title": "Does Principal Component Analysis Preserve the Sparsity in Sparse Weak Factor Models?", "link": "https://arxiv.org/abs/2305.05934", "description": "arXiv:2305.05934v2 Announce Type: replace \nAbstract: This paper studies the principal component (PC) method-based estimation of weak factor models with sparse loadings. We uncover an intrinsic near-sparsity preservation property for the PC estimators of loadings, which comes from the approximately upper triangular (block) structure of the rotation matrix. It implies an asymmetric relationship among factors: the rotated loadings for a stronger factor can be contaminated by those from a weaker one, but the loadings for a weaker factor is almost free of the impact of those from a stronger one. More importantly, the finding implies that there is no need to use complicated penalties to sparsify the loading estimators. Instead, we adopt a simple screening method to recover the sparsity and construct estimators for various factor strengths. In addition, for sparse weak factor models, we provide a singular value thresholding-based approach to determine the number of factors and establish uniform convergence rates for PC estimators, which complement Bai and Ng (2023). The accuracy and efficiency of the proposed estimators are investigated via Monte Carlo simulations. The application to the FRED-QD dataset reveals the underlying factor strengths and loading sparsity as well as their dynamic features."}, "https://arxiv.org/abs/1602.00856": {"title": "Bayesian Dynamic Quantile Model Averaging", "link": "https://arxiv.org/abs/1602.00856", "description": "arXiv:1602.00856v2 Announce Type: replace-cross \nAbstract: This article introduces a novel dynamic framework to Bayesian model averaging for time-varying parameter quantile regressions. By employing sequential Markov chain Monte Carlo, we combine empirical estimates derived from dynamically chosen quantile regressions, thereby facilitating a comprehensive understanding of the quantile model instabilities. The effectiveness of our methodology is initially validated through the examination of simulated datasets and, subsequently, by two applications to the US inflation rates and to the US real estate market. Our empirical findings suggest that a more intricate and nuanced analysis is needed when examining different sub-period regimes, since the determinants of inflation and real estate prices are clearly shown to be time-varying. In conclusion, we suggest that our proposed approach could offer valuable insights to aid decision making in a rapidly changing environment."}, "https://arxiv.org/abs/2411.05215": {"title": "Misclassification of Vaccination Status in Electronic Health Records: A Bayesian Approach in Cluster Randomized Trials", "link": "https://arxiv.org/abs/2411.05215", "description": "arXiv:2411.05215v1 Announce Type: new \nAbstract: Misclassification in binary outcomes is not uncommon and statistical methods to investigate its impact on policy-driving study results are lacking. While misclassifying binary outcomes is a statistically ubiquitous phenomena, we focus on misclassification in a public health application: vaccinations. One such study design in public health that addresses policy is the cluster controlled randomized trial (CCRT). A CCRT that measures the impact of a novel behavioral intervention on increasing vaccine uptake can be severely biased when the supporting data are incomplete vaccination records. In particular, these vaccine records more often may be prone to negative misclassification, that is, a clinic's record of an individual patient's vaccination status may be unvaccinated when, in reality, this patient was vaccinated outside of the clinic. With large nation-wide endeavors to encourage vaccinations without a gold-standard vaccine record system, sensitivity analyses that incorporate misclassification rates are promising for robust inference. In this work we introduce a novel extension of Bayesian logistic regression where we perturb the clinic size and vaccination count with random draws from expert-elicited prior distributions. These prior distributions represent the misclassification rates for each clinic that stochastically add unvaccinated counts to the observed vaccinated counts. These prior distributions are assigned for each clinic (the first level in a group-level randomized trial). We demonstrate this method with a data application from a CCRT evaluating the influence of a behavioral intervention on vaccination uptake among U.S. veterans. A simulation study is carried out demonstrating its estimation properties."}, "https://arxiv.org/abs/2411.05220": {"title": "Inference for Treatment Effects Conditional on Generalized Principal Strata using Instrumental Variables", "link": "https://arxiv.org/abs/2411.05220", "description": "arXiv:2411.05220v1 Announce Type: new \nAbstract: In a setting with a multi-valued outcome, treatment and instrument, this paper considers the problem of inference for a general class of treatment effect parameters. The class of parameters considered are those that can be expressed as the expectation of a function of the response type conditional on a generalized principal stratum. Here, the response type simply refers to the vector of potential outcomes and potential treatments, and a generalized principal stratum is a set of possible values for the response type. In addition to instrument exogeneity, the main substantive restriction imposed rules out certain values for the response types in the sense that they are assumed to occur with probability zero. It is shown through a series of examples that this framework includes a wide variety of parameters and assumptions that have been considered in the previous literature. A key result in our analysis is a characterization of the identified set for such parameters under these assumptions in terms of existence of a non-negative solution to linear systems of equations with a special structure. We propose methods for inference exploiting this special structure and recent results in Fang et al. (2023)."}, "https://arxiv.org/abs/2411.05246": {"title": "Caliper Synthetic Matching: Generalized Radius Matching with Local Synthetic Controls", "link": "https://arxiv.org/abs/2411.05246", "description": "arXiv:2411.05246v1 Announce Type: new \nAbstract: Matching promises transparent causal inferences for observational data, making it an intuitive approach for many applications. In practice, however, standard matching methods often perform poorly compared to modern approaches such as response-surface modeling and optimizing balancing weights. We propose Caliper Synthetic Matching (CSM) to address these challenges while preserving simple and transparent matches and match diagnostics. CSM extends Coarsened Exact Matching by incorporating general distance metrics, adaptive calipers, and locally constructed synthetic controls. We show that CSM can be viewed as a monotonic imbalance bounding matching method, so that it inherits the usual bounds on imbalance and bias enjoyed by MIB methods. We further provide a bound on a measure of joint covariate imbalance. Using a simulation study, we illustrate how CSM can even outperform modern matching methods in certain settings, and finally illustrate its use in an empirical example. Overall, we find CSM allows for many of the benefits of matching while avoiding some of the costs."}, "https://arxiv.org/abs/2411.05315": {"title": "Differentiable Calibration of Inexact Stochastic Simulation Models via Kernel Score Minimization", "link": "https://arxiv.org/abs/2411.05315", "description": "arXiv:2411.05315v1 Announce Type: new \nAbstract: Stochastic simulation models are generative models that mimic complex systems to help with decision-making. The reliability of these models heavily depends on well-calibrated input model parameters. However, in many practical scenarios, only output-level data are available to learn the input model parameters, which is challenging due to the often intractable likelihood of the stochastic simulation model. Moreover, stochastic simulation models are frequently inexact, with discrepancies between the model and the target system. No existing methods can effectively learn and quantify the uncertainties of input parameters using only output-level data. In this paper, we propose to learn differentiable input parameters of stochastic simulation models using output-level data via kernel score minimization with stochastic gradient descent. We quantify the uncertainties of the learned input parameters using a frequentist confidence set procedure based on a new asymptotic normality result that accounts for model inexactness. The proposed method is evaluated on exact and inexact G/G/1 queueing models."}, "https://arxiv.org/abs/2411.05601": {"title": "Detecting Cointegrating Relations in Non-stationary Matrix-Valued Time Series", "link": "https://arxiv.org/abs/2411.05601", "description": "arXiv:2411.05601v1 Announce Type: new \nAbstract: This paper proposes a Matrix Error Correction Model to identify cointegration relations in matrix-valued time series. We hereby allow separate cointegrating relations along the rows and columns of the matrix-valued time series and use information criteria to select the cointegration ranks. Through Monte Carlo simulations and a macroeconomic application, we demonstrate that our approach provides a reliable estimation of the number of cointegrating relationships."}, "https://arxiv.org/abs/2411.05629": {"title": "Nowcasting distributions: a functional MIDAS model", "link": "https://arxiv.org/abs/2411.05629", "description": "arXiv:2411.05629v1 Announce Type: new \nAbstract: We propose a functional MIDAS model to leverage high-frequency information for forecasting and nowcasting distributions observed at a lower frequency. We approximate the low-frequency distribution using Functional Principal Component Analysis and consider a group lasso spike-and-slab prior to identify the relevant predictors in the finite-dimensional SUR-MIDAS approximation of the functional MIDAS model. In our application, we use the model to nowcast the U.S. households' income distribution. Our findings indicate that the model enhances forecast accuracy for the entire target distribution and for key features of the distribution that signal changes in inequality."}, "https://arxiv.org/abs/2411.05695": {"title": "Firm Heterogeneity and Macroeconomic Fluctuations: a Functional VAR model", "link": "https://arxiv.org/abs/2411.05695", "description": "arXiv:2411.05695v1 Announce Type: new \nAbstract: We develop a Functional Augmented Vector Autoregression (FunVAR) model to explicitly incorporate firm-level heterogeneity observed in more than one dimension and study its interaction with aggregate macroeconomic fluctuations. Our methodology employs dimensionality reduction techniques for tensor data objects to approximate the joint distribution of firm-level characteristics. More broadly, our framework can be used for assessing predictions from structural models that account for micro-level heterogeneity observed on multiple dimensions. Leveraging firm-level data from the Compustat database, we use the FunVAR model to analyze the propagation of total factor productivity (TFP) shocks, examining their impact on both macroeconomic aggregates and the cross-sectional distribution of capital and labor across firms."}, "https://arxiv.org/abs/2411.05324": {"title": "SASWISE-UE: Segmentation and Synthesis with Interpretable Scalable Ensembles for Uncertainty Estimation", "link": "https://arxiv.org/abs/2411.05324", "description": "arXiv:2411.05324v1 Announce Type: cross \nAbstract: This paper introduces an efficient sub-model ensemble framework aimed at enhancing the interpretability of medical deep learning models, thus increasing their clinical applicability. By generating uncertainty maps, this framework enables end-users to evaluate the reliability of model outputs. We developed a strategy to develop diverse models from a single well-trained checkpoint, facilitating the training of a model family. This involves producing multiple outputs from a single input, fusing them into a final output, and estimating uncertainty based on output disagreements. Implemented using U-Net and UNETR models for segmentation and synthesis tasks, this approach was tested on CT body segmentation and MR-CT synthesis datasets. It achieved a mean Dice coefficient of 0.814 in segmentation and a Mean Absolute Error of 88.17 HU in synthesis, improved from 89.43 HU by pruning. Additionally, the framework was evaluated under corruption and undersampling, maintaining correlation between uncertainty and error, which highlights its robustness. These results suggest that the proposed approach not only maintains the performance of well-trained models but also enhances interpretability through effective uncertainty estimation, applicable to both convolutional and transformer models in a range of imaging tasks."}, "https://arxiv.org/abs/2411.05580": {"title": "Increasing power and robustness in screening trials by testing stored specimens in the control arm", "link": "https://arxiv.org/abs/2411.05580", "description": "arXiv:2411.05580v1 Announce Type: cross \nAbstract: Background: Screening trials require large sample sizes and long time-horizons to demonstrate mortality reductions. We recently proposed increasing statistical power by testing stored control-arm specimens, called the Intended Effect (IE) design. To evaluate feasibility of the IE design, the US National Cancer Institute (NCI) is collecting blood specimens in the control-arm of the NCI Vanguard Multicancer Detection pilot feasibility trial. However, key assumptions of the IE design require more investigation and relaxation. Methods: We relax the IE design to (1) reduce costs by testing only a stratified sample of control-arm specimens by incorporating inverse-probability sampling weights, (2) correct for potential loss-of-signal in stored control-arm specimens, and (3) correct for non-compliance with control-arm specimen collections. We also examine sensitivity to unintended effects of screening. Results: In simulations, testing all primary-outcome control-arm specimens and a 50% sample of the rest maintains nearly all the power of the IE while only testing half the control-arm specimens. Power remains increased from the IE analysis (versus the standard analysis) even if unintended effects exist. The IE design is robust to some loss-of-signal scenarios, but otherwise requires retest-positive fractions that correct bias at a small loss of power. The IE can be biased and lose power under control-arm non-compliance scenarios, but corrections correct bias and can increase power. Conclusions: The IE design can be made more cost-efficient and robust to loss-of-signal. Unintended effects will not typically reduce the power gain over the standard trial design. Non-compliance with control-arm specimen collections can cause bias and loss of power that can be mitigated by corrections. Although promising, practical experience with the IE design in screening trials is necessary."}, "https://arxiv.org/abs/2411.05625": {"title": "Cross-validating causal discovery via Leave-One-Variable-Out", "link": "https://arxiv.org/abs/2411.05625", "description": "arXiv:2411.05625v1 Announce Type: cross \nAbstract: We propose a new approach to falsify causal discovery algorithms without ground truth, which is based on testing the causal model on a pair of variables that has been dropped when learning the causal model. To this end, we use the \"Leave-One-Variable-Out (LOVO)\" prediction where $Y$ is inferred from $X$ without any joint observations of $X$ and $Y$, given only training data from $X,Z_1,\\dots,Z_k$ and from $Z_1,\\dots,Z_k,Y$. We demonstrate that causal models on the two subsets, in the form of Acyclic Directed Mixed Graphs (ADMGs), often entail conclusions on the dependencies between $X$ and $Y$, enabling this type of prediction. The prediction error can then be estimated since the joint distribution $P(X, Y)$ is assumed to be available, and $X$ and $Y$ have only been omitted for the purpose of falsification. After presenting this graphical method, which is applicable to general causal discovery algorithms, we illustrate how to construct a LOVO predictor tailored towards algorithms relying on specific a priori assumptions, such as linear additive noise models. Simulations indicate that the LOVO prediction error is indeed correlated with the accuracy of the causal outputs, affirming the method's effectiveness."}, "https://arxiv.org/abs/2411.05758": {"title": "On the limiting variance of matching estimators", "link": "https://arxiv.org/abs/2411.05758", "description": "arXiv:2411.05758v1 Announce Type: cross \nAbstract: This paper examines the limiting variance of nearest neighbor matching estimators for average treatment effects with a fixed number of matches. We present, for the first time, a closed-form expression for this limit. Here the key is the establishment of the limiting second moment of the catchment area's volume, which resolves a question of Abadie and Imbens. At the core of our approach is a new universality theorem on the measures of high-order Voronoi cells, extending a result by Devroye, Gy\\\"orfi, Lugosi, and Walk."}, "https://arxiv.org/abs/2302.10160": {"title": "Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift", "link": "https://arxiv.org/abs/2302.10160", "description": "arXiv:2302.10160v3 Announce Type: replace \nAbstract: We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance."}, "https://arxiv.org/abs/2309.16861": {"title": "Demystifying Spatial Confounding", "link": "https://arxiv.org/abs/2309.16861", "description": "arXiv:2309.16861v2 Announce Type: replace \nAbstract: Spatial confounding is a fundamental issue in spatial regression models which arises because spatial random effects, included to approximate unmeasured spatial variation, are typically not independent of covariates in the model. This can lead to significant bias in covariate effect estimates. The problem is complex and has been the topic of extensive research with sometimes puzzling and seemingly contradictory results. Here, we develop a broad theoretical framework that brings mathematical clarity to the mechanisms of spatial confounding, providing explicit analytical expressions for the resulting bias. We see that the problem is directly linked to spatial smoothing and identify exactly how the size and occurrence of bias relate to the features of the spatial model as well as the underlying confounding scenario. Using our results, we can explain subtle and counter-intuitive behaviours. Finally, we propose a general approach for dealing with spatial confounding bias in practice, applicable for any spatial model specification. When a covariate has non-spatial information, we show that a general form of the so-called spatial+ method can be used to eliminate bias. When no such information is present, the situation is more challenging but, under the assumption of unconfounded high frequencies, we develop a procedure in which multiple capped versions of spatial+ are applied to assess the bias in this case. We illustrate our approach with an application to air temperature in Germany."}, "https://arxiv.org/abs/2411.05839": {"title": "Simulation Studies For Goodness-of-Fit and Two-Sample Methods For Univariate Data", "link": "https://arxiv.org/abs/2411.05839", "description": "arXiv:2411.05839v1 Announce Type: new \nAbstract: We present the results of a large number of simulation studies regarding the power of various goodness-of-fit as well as nonparametric two-sample tests for univariate data. This includes both continuous and discrete data. In general no single method can be relied upon to provide good power, any one method may be quite good for some combination of null hypothesis and alternative and may fail badly for another. Based on the results of these studies we propose a fairly small number of methods chosen such that for any of the case studies included here at least one of the methods has good power.\n The studies were carried out using the R packages R2sample and Rgof, available from CRAN."}, "https://arxiv.org/abs/2411.06129": {"title": "Rational Expectations Nonparametric Empirical Bayes", "link": "https://arxiv.org/abs/2411.06129", "description": "arXiv:2411.06129v1 Announce Type: new \nAbstract: We examine the uniqueness of the posterior distribution within an Empirical Bayes framework using a discretized prior. To achieve this, we impose Rational Expectations conditions on the prior, focusing on coherence and stability properties. We derive the conditions necessary for posterior uniqueness when observations are drawn from either discrete or continuous distributions. Additionally, we discuss the properties of our discretized prior as an approximation of the true underlying prior."}, "https://arxiv.org/abs/2411.06150": {"title": "It's About Time: What A/B Test Metrics Estimate", "link": "https://arxiv.org/abs/2411.06150", "description": "arXiv:2411.06150v1 Announce Type: new \nAbstract: Online controlled experiments, or A/B tests, are large-scale randomized trials in digital environments. This paper investigates the estimands of the difference-in-means estimator in these experiments, focusing on scenarios with repeated measurements on users. We compare cumulative metrics that use all post-exposure data for each user to windowed metrics that measure each user over a fixed time window. We analyze the estimands and highlight trade-offs between the two types of metrics. Our findings reveal that while cumulative metrics eliminate the need for pre-defined measurement windows, they result in estimands that are more intricately tied to the experiment intake and runtime. This complexity can lead to counter-intuitive practical consequences, such as decreased statistical power with more observations. However, cumulative metrics offer earlier results and can quickly detect strong initial signals. We conclude that neither metric type is universally superior. The optimal choice depends on the temporal profile of the treatment effect, the distribution of exposure, and the stopping time of the experiment. This research provides insights for experimenters to make informed decisions about how to define metrics based on their specific experimental contexts and objectives."}, "https://arxiv.org/abs/2411.06298": {"title": "Efficient subsampling for high-dimensional data", "link": "https://arxiv.org/abs/2411.06298", "description": "arXiv:2411.06298v1 Announce Type: new \nAbstract: In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could perform variable selection, as only a subset of a large number of variables is active. We propose an approach when both the size of the full dataset and the number of variables are large. This approach firstly identifies the active variables by applying a procedure inspired by random LASSO (Least Absolute Shrinkage and Selection Operator) and then selects subdata based on leverage scores to build a predictive model. Our proposed approach outperforms approaches that already exists in the current literature, including the usage of the full dataset, in both variable selection and prediction, while also exhibiting significant improvements in computing time. Simulation experiments as well as a real data application are provided."}, "https://arxiv.org/abs/2411.06327": {"title": "Return-forecasting and Volatility-forecasting Power of On-chain Activities in the Cryptocurrency Market", "link": "https://arxiv.org/abs/2411.06327", "description": "arXiv:2411.06327v1 Announce Type: new \nAbstract: We investigate the return-forecasting and volatility-forecasting power of intraday on-chain flow data for BTC, ETH, and USDT, and the associated option strategies. First, we find that USDT net inflow into cryptocurrency exchanges positively forecasts future returns of both BTC and ETH, with the strongest effect at the 1-hour frequency. Second, we find that ETH net inflow into cryptocurrency exchanges negatively forecasts future returns of ETH. Third, we find that BTC net inflow into cryptocurrency exchanges does not significantly forecast future returns of BTC. Finally, we confirm that selling 0DTE ETH call options is a profitable trading strategy when the net inflow into cryptocurrency exchanges is high. Our study lends new insights into the emerging literature that studies the on-chain activities and their asset-pricing impact in the cryptocurrency market."}, "https://arxiv.org/abs/2411.06342": {"title": "Stabilized Inverse Probability Weighting via Isotonic Calibration", "link": "https://arxiv.org/abs/2411.06342", "description": "arXiv:2411.06342v1 Announce Type: new \nAbstract: Inverse weighting with an estimated propensity score is widely used by estimation methods in causal inference to adjust for confounding bias. However, directly inverting propensity score estimates can lead to instability, bias, and excessive variability due to large inverse weights, especially when treatment overlap is limited. In this work, we propose a post-hoc calibration algorithm for inverse propensity weights that generates well-calibrated, stabilized weights from user-supplied, cross-fitted propensity score estimates. Our approach employs a variant of isotonic regression with a loss function specifically tailored to the inverse propensity weights. Through theoretical analysis and empirical studies, we demonstrate that isotonic calibration improves the performance of doubly robust estimators of the average treatment effect."}, "https://arxiv.org/abs/2411.06591": {"title": "Analysis of spatially clustered survival data with unobserved covariates using SBART", "link": "https://arxiv.org/abs/2411.06591", "description": "arXiv:2411.06591v1 Announce Type: new \nAbstract: Usual parametric and semi-parametric regression methods are inappropriate and inadequate for large clustered survival studies when the appropriate functional forms of the covariates and their interactions in hazard functions are unknown, and random cluster effects and cluster-level covariates are spatially correlated. We present a general nonparametric method for such studies under the Bayesian ensemble learning paradigm called Soft Bayesian Additive Regression Trees. Our methodological and computational challenges include large number of clusters, variable cluster sizes, and proper statistical augmentation of the unobservable cluster-level covariate using a data registry different from the main survival study. We use an innovative 3-step approach based on latent variables to address our computational challenges. We illustrate our method and its advantages over existing methods by assessing the impacts of intervention in some county-level and patient-level covariates to mitigate existing racial disparity in breast cancer survival in 67 Florida counties (clusters) using two different data resources. Florida Cancer Registry (FCR) is used to obtain clustered survival data with patient-level covariates, and the Behavioral Risk Factor Surveillance Survey (BRFSS) is used to obtain further data information on an unobservable county-level covariate of Screening Mammography Utilization (SMU)."}, "https://arxiv.org/abs/2411.06716": {"title": "Parameter Estimation for Partially Observed McKean-Vlasov Diffusions", "link": "https://arxiv.org/abs/2411.06716", "description": "arXiv:2411.06716v1 Announce Type: new \nAbstract: In this article we consider likelihood-based estimation of static parameters for a class of partially observed McKean-Vlasov (POMV) diffusion process with discrete-time observations over a fixed time interval. In particular, using the framework of [5] we develop a new randomized multilevel Monte Carlo method for estimating the parameters, based upon Markovian stochastic approximation methodology. New Markov chain Monte Carlo algorithms for the POMV model are introduced facilitating the application of [5]. We prove, under assumptions, that the expectation of our estimator is biased, but with expected small and controllable bias. Our approach is implemented on several examples."}, "https://arxiv.org/abs/2411.06913": {"title": "BudgetIV: Optimal Partial Identification of Causal Effects with Mostly Invalid Instruments", "link": "https://arxiv.org/abs/2411.06913", "description": "arXiv:2411.06913v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are widely used to estimate causal effects in the presence of unobserved confounding between exposure and outcome. An IV must affect the outcome exclusively through the exposure and be unconfounded with the outcome. We present a framework for relaxing either or both of these strong assumptions with tuneable and interpretable budget constraints. Our algorithm returns a feasible set of causal effects that can be identified exactly given relevant covariance parameters. The feasible set may be disconnected but is a finite union of convex subsets. We discuss conditions under which this set is sharp, i.e., contains all and only effects consistent with the background assumptions and the joint distribution of observable variables. Our method applies to a wide class of semiparametric models, and we demonstrate how its ability to select specific subsets of instruments confers an advantage over convex relaxations in both linear and nonlinear settings. We also adapt our algorithm to form confidence sets that are asymptotically valid under a common statistical assumption from the Mendelian randomization literature."}, "https://arxiv.org/abs/2411.07028": {"title": "Estimating abilities with an Elo-informed growth model", "link": "https://arxiv.org/abs/2411.07028", "description": "arXiv:2411.07028v1 Announce Type: new \nAbstract: An intelligent tutoring system (ITS) aims to provide instructions and exercises tailored to the ability of a student. To do this, the ITS needs to estimate the ability based on student input. Rather than including frequent full-scale tests to update our ability estimate, we want to base estimates on the outcomes of practice exercises that are part of the learning process. A challenge with this approach is that the ability changes as the student learns, which makes traditional item response theory (IRT) models inappropriate. Most IRT models estimate an ability based on a test result, and assume that the ability is constant throughout a test.\n We review some existing methods for measuring abilities that change throughout the measurement period, and propose a new method which we call the Elo-informed growth model. This method assumes that the abilities for a group of respondents who are all in the same stage of the learning process follow a distribution that can be estimated. The method does not assume a particular shape of the growth curve. It performs better than the standard Elo algorithm when the measured outcomes are far apart in time, or when the ability change is rapid."}, "https://arxiv.org/abs/2411.07221": {"title": "Self-separated and self-connected models for mediator and outcome missingness in mediation analysis", "link": "https://arxiv.org/abs/2411.07221", "description": "arXiv:2411.07221v1 Announce Type: new \nAbstract: Missing data is a common problem that challenges the study of effects of treatments. In the context of mediation analysis, this paper addresses missingness in the two key variables, mediator and outcome, focusing on identification. We consider self-separated missingness models where identification is achieved by conditional independence assumptions only and self-connected missingness models where identification relies on so-called shadow variables. The first class is somewhat limited as it is constrained by the need to remove a certain number of connections from the model. The second class turns out to include substantial variation in the position of the shadow variable in the causal structure (vis-a-vis the mediator and outcome) and the corresponding implications for the model. In constructing the models, to improve plausibility, we pay close attention to allowing, where possible, dependencies due to unobserved causes of the missingness. In this exploration, we develop theory where needed. This results in templates for identification in this mediation setting, generally useful identification techniques, and perhaps most significantly, synthesis and substantial expansion of shadow variable theory."}, "https://arxiv.org/abs/2411.05869": {"title": "Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data", "link": "https://arxiv.org/abs/2411.05869", "description": "arXiv:2411.05869v1 Announce Type: cross \nAbstract: The Gaussian process (GP) is a widely used probabilistic machine learning method for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear processes. Unlike many other machine learning methods, GPs include an implicit characterization of uncertainty, making them extremely useful across many areas of science, technology, and engineering. Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility and exact methods for inference that prevent application to data sets with more than about ten thousand points. Modern approaches to address stationarity assumptions generally fail to accommodate large data sets, while all attempts to address scalability focus on approximating the Gaussian likelihood, which can involve subjectivity and lead to inaccuracies. In this work, we explicitly derive an alternative kernel that can discover and encode both sparsity and nonstationarity. We embed the kernel within a fully Bayesian GP model and leverage high-performance computing resources to enable the analysis of massive data sets. We demonstrate the favorable performance of our novel kernel relative to existing exact and approximate GP methods across a variety of synthetic data examples. Furthermore, we conduct space-time prediction based on more than one million measurements of daily maximum temperature and verify that our results outperform state-of-the-art methods in the Earth sciences. More broadly, having access to exact GPs that use ultra-scalable, sparsity-discovering, nonstationary kernels allows GP methods to truly compete with a wide variety of machine learning methods."}, "https://arxiv.org/abs/2411.05870": {"title": "An Adaptive Online Smoother with Closed-Form Solutions and Information-Theoretic Lag Selection for Conditional Gaussian Nonlinear Systems", "link": "https://arxiv.org/abs/2411.05870", "description": "arXiv:2411.05870v1 Announce Type: cross \nAbstract: Data assimilation (DA) combines partial observations with a dynamical model to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the DA to utilize only observations within a nearby window, significantly reducing computational storage. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and the adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, its advantage of computational storage is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence."}, "https://arxiv.org/abs/2411.06140": {"title": "Deep Nonparametric Conditional Independence Tests for Images", "link": "https://arxiv.org/abs/2411.06140", "description": "arXiv:2411.06140v1 Announce Type: cross \nAbstract: Conditional independence tests (CITs) test for conditional dependence between random variables. As existing CITs are limited in their applicability to complex, high-dimensional variables such as images, we introduce deep nonparametric CITs (DNCITs). The DNCITs combine embedding maps, which extract feature representations of high-dimensional variables, with nonparametric CITs applicable to these feature representations. For the embedding maps, we derive general properties on their parameter estimators to obtain valid DNCITs and show that these properties include embedding maps learned through (conditional) unsupervised or transfer learning. For the nonparametric CITs, appropriate tests are selected and adapted to be applicable to feature representations. Through simulations, we investigate the performance of the DNCITs for different embedding maps and nonparametric CITs under varying confounder dimensions and confounder relationships. We apply the DNCITs to brain MRI scans and behavioral traits, given confounders, of healthy individuals from the UK Biobank (UKB), confirming null results from a number of ambiguous personality neuroscience studies with a larger data set and with our more powerful tests. In addition, in a confounder control study, we apply the DNCITs to brain MRI scans and a confounder set to test for sufficient confounder control, leading to a potential reduction in the confounder dimension under improved confounder control compared to existing state-of-the-art confounder control studies for the UKB. Finally, we provide an R package implementing the DNCITs."}, "https://arxiv.org/abs/2411.06324": {"title": "Amortized Bayesian Local Interpolation NetworK: Fast covariance parameter estimation for Gaussian Processes", "link": "https://arxiv.org/abs/2411.06324", "description": "arXiv:2411.06324v1 Announce Type: cross \nAbstract: Gaussian processes (GPs) are a ubiquitous tool for geostatistical modeling with high levels of flexibility and interpretability, and the ability to make predictions at unseen spatial locations through a process called Kriging. Estimation of Kriging weights relies on the inversion of the process' covariance matrix, creating a computational bottleneck for large spatial datasets. In this paper, we propose an Amortized Bayesian Local Interpolation NetworK (A-BLINK) for fast covariance parameter estimation, which uses two pre-trained deep neural networks to learn a mapping from spatial location coordinates and covariance function parameters to Kriging weights and the spatial variance, respectively. The fast prediction time of these networks allows us to bypass the matrix inversion step, creating large computational speedups over competing methods in both frequentist and Bayesian settings, and also provides full posterior inference and predictions using Markov chain Monte Carlo sampling methods. We show significant increases in computational efficiency over comparable scalable GP methodology in an extensive simulation study with lower parameter estimation error. The efficacy of our approach is also demonstrated using a temperature dataset of US climate normals for 1991--2020 based on over 7,000 weather stations."}, "https://arxiv.org/abs/2411.06518": {"title": "Causal Representation Learning from Multimodal Biological Observations", "link": "https://arxiv.org/abs/2411.06518", "description": "arXiv:2411.06518v1 Announce Type: cross \nAbstract: Prevalent in biological applications (e.g., human phenotype measurements), multimodal datasets can provide valuable insights into the underlying biological mechanisms. However, current machine learning models designed to analyze such datasets still lack interpretability and theoretical guarantees, which are essential to biological applications. Recent advances in causal representation learning have shown promise in uncovering the interpretable latent causal variables with formal theoretical certificates. Unfortunately, existing works for multimodal distributions either rely on restrictive parametric assumptions or provide rather coarse identification results, limiting their applicability to biological research which favors a detailed understanding of the mechanisms.\n In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biological datasets. Theoretically, we consider a flexible nonparametric latent distribution (c.f., parametric assumptions in prior work) permitting causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work. Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities, which, as we will discuss, is natural for a large collection of biological systems. Empirically, we propose a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established medical research, validating our theoretical and methodological framework."}, "https://arxiv.org/abs/2411.06593": {"title": "Algebraic and Statistical Properties of the Partially Regularized Ordinary Least Squares Interpolator", "link": "https://arxiv.org/abs/2411.06593", "description": "arXiv:2411.06593v1 Announce Type: cross \nAbstract: Modern deep learning has revealed a surprising statistical phenomenon known as benign overfitting, with high-dimensional linear regression being a prominent example. This paper contributes to ongoing research on the ordinary least squares (OLS) interpolator, focusing on the partial regression setting, where only a subset of coefficients is implicitly regularized. On the algebraic front, we extend Cochran's formula and the leave-one-out residual formula for the partial regularization framework. On the stochastic front, we leverage our algebraic results to design several homoskedastic variance estimators under the Gauss-Markov model. These estimators serve as a basis for conducting statistical inference, albeit with slight conservatism in their performance. Through simulations, we study the finite-sample properties of these variance estimators across various generative models."}, "https://arxiv.org/abs/2411.06631": {"title": "SequentialSamplingModels", "link": "https://arxiv.org/abs/2411.06631", "description": "arXiv:2411.06631v1 Announce Type: cross \nAbstract: Sequential sampling models (SSMs) are a widely used framework describing decision-making as a stochastic, dynamic process of evidence accumulation. SSMs popularity across cognitive science has driven the development of various software packages that lower the barrier for simulating, estimating, and comparing existing SSMs. Here, we present a software tool, SequentialSamplingModels.jl (SSM.jl), designed to make SSM simulations more accessible to Julia users, and to integrate with the Julia ecosystem. We demonstrate the basic use of SSM.jl for simulation, plotting, and Bayesian inference."}, "https://arxiv.org/abs/2208.10974": {"title": "Beta-Sorted Portfolios", "link": "https://arxiv.org/abs/2208.10974", "description": "arXiv:2208.10974v3 Announce Type: replace \nAbstract: Beta-sorted portfolios -- portfolios comprised of assets with similar covariation to selected risk factors -- are a popular tool in empirical finance to analyze models of (conditional) expected returns. Despite their widespread use, little is known of their statistical properties in contrast to comparable procedures such as two-pass regressions. We formally investigate the properties of beta-sorted portfolio returns by casting the procedure as a two-step nonparametric estimator with a nonparametric first step and a beta-adaptive portfolios construction. Our framework rationalize the well-known estimation algorithm with precise economic and statistical assumptions on the general data generating process and characterize its key features. We study beta-sorted portfolios for both a single cross-section as well as for aggregation over time (e.g., the grand mean), offering conditions that ensure consistency and asymptotic normality along with new uniform inference procedures allowing for uncertainty quantification and testing of various relevant hypotheses in financial applications. We also highlight some limitations of current empirical practices and discuss what inferences can and cannot be drawn from returns to beta-sorted portfolios for either a single cross-section or across the whole sample. Finally, we illustrate the functionality of our new procedures in an empirical application."}, "https://arxiv.org/abs/2212.08446": {"title": "Score function-based tests for ultrahigh-dimensional linear models", "link": "https://arxiv.org/abs/2212.08446", "description": "arXiv:2212.08446v2 Announce Type: replace \nAbstract: In this paper, we investigate score function-based tests to check the significance of an ultrahigh-dimensional sub-vector of the model coefficients when the nuisance parameter vector is also ultrahigh-dimensional in linear models. We first reanalyze and extend a recently proposed score function-based test to derive, under weaker conditions, its limiting distributions under the null and local alternative hypotheses. As it may fail to work when the correlation between testing covariates and nuisance covariates is high, we propose an orthogonalized score function-based test with two merits: debiasing to make the non-degenerate error term degenerate and reducing the asymptotic variance to enhance power performance. Simulations evaluate the finite-sample performances of the proposed tests, and a real data analysis illustrates its application."}, "https://arxiv.org/abs/2304.00874": {"title": "A mixture transition distribution modeling for higher-order circular Markov processes", "link": "https://arxiv.org/abs/2304.00874", "description": "arXiv:2304.00874v5 Announce Type: replace \nAbstract: The stationary higher-order Markov process for circular data is considered. We employ the mixture transition distribution (MTD) model to express the transition density of the process on the circle. The underlying circular transition distribution is based on Wehrly and Johnson's bivariate joint circular models. The structures of the circular autocorrelation function together with the circular partial autocorrelation function are found to be similar to those of the autocorrelation and partial autocorrelation functions of the real-valued autoregressive process when the underlying binding density has zero sine moments. The validity of the model is assessed by applying it to some Monte Carlo simulations and real directional data."}, "https://arxiv.org/abs/2306.02711": {"title": "Truly Multivariate Structured Additive Distributional Regression", "link": "https://arxiv.org/abs/2306.02711", "description": "arXiv:2306.02711v2 Announce Type: replace \nAbstract: Generalized additive models for location, scale and shape (GAMLSS) are a popular extension to mean regression models where each parameter of an arbitrary distribution is modelled through covariates. While such models have been developed for univariate and bivariate responses, the truly multivariate case remains extremely challenging for both computational and theoretical reasons. Alternative approaches to GAMLSS may allow for higher dimensional response vectors to be modelled jointly but often assume a fixed dependence structure not depending on covariates or are limited with respect to modelling flexibility or computational aspects. We contribute to this gap in the literature and propose a truly multivariate distributional model, which allows one to benefit from the flexibility of GAMLSS even when the response has dimension larger than two or three. Building on copula regression, we model the dependence structure of the response through a Gaussian copula, while the marginal distributions can vary across components. Our model is highly parameterized but estimation becomes feasible with Bayesian inference employing shrinkage priors. We demonstrate the competitiveness of our approach in a simulation study and illustrate how it complements existing models along the examples of childhood malnutrition and a yet unexplored data set on traffic detection in Berlin."}, "https://arxiv.org/abs/2306.16402": {"title": "Guidance on Individualized Treatment Rule Estimation in High Dimensions", "link": "https://arxiv.org/abs/2306.16402", "description": "arXiv:2306.16402v2 Announce Type: replace \nAbstract: Individualized treatment rules, cornerstones of precision medicine, inform patient treatment decisions with the goal of optimizing patient outcomes. These rules are generally unknown functions of patients' pre-treatment covariates, meaning they must be estimated from clinical or observational study data. Myriad methods have been developed to learn these rules, and these procedures are demonstrably successful in traditional asymptotic settings with moderate number of covariates. The finite-sample performance of these methods in high-dimensional covariate settings, which are increasingly the norm in modern clinical trials, has not been well characterized, however. We perform a comprehensive comparison of state-of-the-art individualized treatment rule estimators, assessing performance on the basis of the estimators' accuracy, interpretability, and computational efficacy. Sixteen data-generating processes with continuous outcomes and binary treatment assignments are considered, reflecting a diversity of randomized and observational studies. We summarize our findings and provide succinct advice to practitioners needing to estimate individualized treatment rules in high dimensions. All code is made publicly available, facilitating modifications and extensions to our simulation study. A novel pre-treatment covariate filtering procedure is also proposed and is shown to improve estimators' accuracy and interpretability."}, "https://arxiv.org/abs/2308.02005": {"title": "Randomization-Based Inference for Average Treatment Effect in Inexactly Matched Observational Studies", "link": "https://arxiv.org/abs/2308.02005", "description": "arXiv:2308.02005v3 Announce Type: replace \nAbstract: Matching is a widely used causal inference study design in observational studies. It seeks to mimic a randomized experiment by forming matched sets of treated and control units based on proximity in covariates. Ideally, treated units are exactly matched with controls for the covariates, and randomization-based inference for the treatment effect can then be conducted as in a randomized experiment under the ignorability assumption. However, matching is typically inexact when continuous covariates or many covariates exist. Previous studies have routinely ignored inexact matching in the downstream randomization-based inference as long as some covariate balance criteria are satisfied. Some recent studies found that this routine practice can cause severe bias. They proposed new inference methods for correcting for bias due to inexact matching. However, these inference methods focus on the constant treatment effect (i.e., Fisher's sharp null) and are not directly applicable to the average treatment effect (i.e., Neyman's weak null). To address this problem, we propose a new framework - inverse post-matching probability weighting (IPPW) - for randomization-based average treatment effect inference under inexact matching. Compared with the routinely used randomization-based inference framework based on the difference-in-means estimator, our proposed IPPW framework can substantially reduce bias due to inexact matching and improve the coverage rate. We have also developed an open-source R package RIIM (Randomization-Based Inference under Inexact Matching) for implementing our methods."}, "https://arxiv.org/abs/2311.09446": {"title": "Scalable simulation-based inference for implicitly defined models using a metamodel for Monte Carlo log-likelihood estimator", "link": "https://arxiv.org/abs/2311.09446", "description": "arXiv:2311.09446v2 Announce Type: replace \nAbstract: Models implicitly defined through a random simulator of a process have become widely used in scientific and industrial applications in recent years. However, simulation-based inference methods for such implicit models, like approximate Bayesian computation (ABC), often scale poorly as data size increases. We develop a scalable inference method for implicitly defined models using a metamodel for the Monte Carlo log-likelihood estimator derived from simulations. This metamodel characterizes both statistical and simulation-based randomness in the distribution of the log-likelihood estimator across different parameter values. Our metamodel-based method quantifies uncertainty in parameter estimation in a principled manner, leveraging the local asymptotic normality of the mean function of the log-likelihood estimator. We apply this method to construct accurate confidence intervals for parameters of partially observed Markov process models where the Monte Carlo log-likelihood estimator is obtained using the bootstrap particle filter. We numerically demonstrate that our method enables accurate and highly scalable parameter inference across several examples, including a mechanistic compartment model for infectious diseases."}, "https://arxiv.org/abs/2106.16149": {"title": "When Frictions are Fractional: Rough Noise in High-Frequency Data", "link": "https://arxiv.org/abs/2106.16149", "description": "arXiv:2106.16149v4 Announce Type: replace-cross \nAbstract: The analysis of high-frequency financial data is often impeded by the presence of noise. This article is motivated by intraday return data in which market microstructure noise appears to be rough, that is, best captured by a continuous-time stochastic process that locally behaves as fractional Brownian motion. Assuming that the underlying efficient price process follows a continuous It\\^o semimartingale, we derive consistent estimators and asymptotic confidence intervals for the roughness parameter of the noise and the integrated price and noise volatilities, in all cases where these quantities are identifiable. In addition to desirable features such as serial dependence of increments, compatibility between different sampling frequencies and diurnal effects, the rough noise model can further explain divergence rates in volatility signature plots that vary considerably over time and between assets."}, "https://arxiv.org/abs/2208.00830": {"title": "Short-time expansion of characteristic functions in a rough volatility setting with applications", "link": "https://arxiv.org/abs/2208.00830", "description": "arXiv:2208.00830v2 Announce Type: replace-cross \nAbstract: We derive a higher-order asymptotic expansion of the conditional characteristic function of the increment of an It\\^o semimartingale over a shrinking time interval. The spot characteristics of the It\\^o semimartingale are allowed to have dynamics of general form. In particular, their paths can be rough, that is, exhibit local behavior like that of a fractional Brownian motion, while at the same time have jumps with arbitrary degree of activity. The expansion result shows the distinct roles played by the different features of the spot characteristics dynamics. As an application of our result, we construct a nonparametric estimator of the Hurst parameter of the diffusive volatility process from portfolios of short-dated options written on an underlying asset."}, "https://arxiv.org/abs/2311.01762": {"title": "Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel", "link": "https://arxiv.org/abs/2311.01762", "description": "arXiv:2311.01762v2 Announce Type: replace-cross \nAbstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes solving a system of linear equations, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks."}, "https://arxiv.org/abs/2312.09862": {"title": "Wasserstein-based Minimax Estimation of Dependence in Multivariate Regularly Varying Extremes", "link": "https://arxiv.org/abs/2312.09862", "description": "arXiv:2312.09862v2 Announce Type: replace-cross \nAbstract: We present the first minimax risk bounds for estimators of the spectral measure in multivariate linear factor models, where observations are linear combinations of regularly varying latent factors. Non-asymptotic convergence rates are derived for the multivariate Peak-over-Threshold estimator in terms of the $p$-th order Wasserstein distance, and information-theoretic lower bounds for the minimax risks are established. The convergence rate of the estimator is shown to be minimax optimal under a class of Pareto-type models analogous to the standard class used in the setting of one-dimensional observations known as the Hall-Welsh class. When the estimator is minimax inefficient, a novel two-step estimator is introduced and demonstrated to attain the minimax lower bound. Our analysis bridges the gaps in understanding trade-offs between estimation bias and variance in multivariate extreme value theory."}, "https://arxiv.org/abs/2411.07399": {"title": "Cumulative differences between subpopulations versus body mass index in the Behavioral Risk Factor Surveillance System data", "link": "https://arxiv.org/abs/2411.07399", "description": "arXiv:2411.07399v1 Announce Type: new \nAbstract: Prior works have demonstrated many advantages of cumulative statistics over the classical methods of reliability diagrams, ECEs (empirical, estimated, or expected calibration errors), and ICIs (integrated calibration indices). The advantages pertain to assessing calibration of predicted probabilities, comparison of responses from a subpopulation to the responses from the full population, and comparison of responses from one subpopulation to those from a separate subpopulation. The cumulative statistics include graphs of cumulative differences as a function of the scalar covariate, as well as metrics due to Kuiper and to Kolmogorov and Smirnov that summarize the graphs into single scalar statistics and associated P-values (also known as \"attained significance levels\" for significance tests). However, the prior works have not yet treated data from biostatistics.\n Fortunately, the advantages of the cumulative statistics extend to the Behavioral Risk Factor Surveillance System (BRFSS) of the Centers for Disease Control and Prevention. This is unsurprising, since the mathematics is the same as in earlier works. Nevertheless, detailed analysis of the BRFSS data is revealing and corroborates the findings of earlier works.\n Two methodological extensions beyond prior work that facilitate analysis of the BRFSS are (1) empirical estimators of uncertainty for graphs of the cumulative differences between two subpopulations, such that the estimators are valid for any real-valued responses, and (2) estimators of the weighted average treatment effect for the differences in the responses between the subpopulations. Both of these methods concern the case in which none of the covariate's observed values for one subpopulation is equal to any of the covariate's values for the other subpopulation. The data analysis presented reports results for this case as well as several others."}, "https://arxiv.org/abs/2411.07604": {"title": "Dynamic Evolutionary Game Analysis of How Fintech in Banking Mitigates Risks in Agricultural Supply Chain Finance", "link": "https://arxiv.org/abs/2411.07604", "description": "arXiv:2411.07604v1 Announce Type: new \nAbstract: This paper explores the impact of banking fintech on reducing financial risks in the agricultural supply chain, focusing on the secondary allocation of commercial credit. The study constructs a three-player evolutionary game model involving banks, core enterprises, and SMEs to analyze how fintech innovations, such as big data credit assessment, blockchain, and AI-driven risk evaluation, influence financial risks and access to credit. The findings reveal that banking fintech reduces financing costs and mitigates financial risks by improving transaction reliability, enhancing risk identification, and minimizing information asymmetry. By optimizing cooperation between banks, core enterprises, and SMEs, fintech solutions enhance the stability of the agricultural supply chain, contributing to rural revitalization goals and sustainable agricultural development. The study provides new theoretical insights and practical recommendations for improving agricultural finance systems and reducing financial risks.\n Keywords: banking fintech, agricultural supply chain, financial risk, commercial credit, SMEs, evolutionary game model, big data, blockchain, AI-driven risk evaluation."}, "https://arxiv.org/abs/2411.07617": {"title": "Semi-supervised learning using copula-based regression and model averaging", "link": "https://arxiv.org/abs/2411.07617", "description": "arXiv:2411.07617v1 Announce Type: new \nAbstract: The available data in semi-supervised learning usually consists of relatively small sized labeled data and much larger sized unlabeled data. How to effectively exploit unlabeled data is the key issue. In this paper, we write the regression function in the form of a copula and marginal distributions, and the unlabeled data can be exploited to improve the estimation of the marginal distributions. The predictions based on different copulas are weighted, where the weights are obtained by minimizing an asymptotic unbiased estimator of the prediction risk. Error-ambiguity decomposition of the prediction risk is performed such that unlabeled data can be exploited to improve the prediction risk estimation. We demonstrate the asymptotic normality of copula parameters and regression function estimators of the candidate models under the semi-supervised framework, as well as the asymptotic optimality and weight consistency of the model averaging estimator. Our model averaging estimator achieves faster convergence rates of asymptotic optimality and weight consistency than the supervised counterpart. Extensive simulation experiments and the California housing dataset demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2411.07651": {"title": "Quasi-Bayes empirical Bayes: a sequential approach to the Poisson compound decision problem", "link": "https://arxiv.org/abs/2411.07651", "description": "arXiv:2411.07651v1 Announce Type: new \nAbstract: The Poisson compound decision problem is a classical problem in statistics, for which parametric and nonparametric empirical Bayes methodologies are available to estimate the Poisson's means in static or batch domains. In this paper, we consider the Poisson compound decision problem in a streaming or online domain. By relying on a quasi-Bayesian approach, often referred to as Newton's algorithm, we obtain sequential Poisson's mean estimates that are of easy evaluation, computationally efficient and with a constant computational cost as data increase, which is desirable for streaming data. Large sample asymptotic properties of the proposed estimates are investigated, also providing frequentist guarantees in terms of a regret analysis. We validate empirically our methodology, both on synthetic and real data, comparing against the most popular alternatives."}, "https://arxiv.org/abs/2411.07808": {"title": "Spatial Competition on Psychological Pricing Strategies: Preliminary Evidence from an Online Marketplace", "link": "https://arxiv.org/abs/2411.07808", "description": "arXiv:2411.07808v1 Announce Type: new \nAbstract: According zu Kadir et al. (2023) online marketplaces are used to buy and sell products and services, as well as to exchange money and data between users or the platform. Due to the large product selection, low costs and the ease of shopping without physical restrictions as well as the technical possibilities, online marketplaces have grown rapidly Kadir et al. (2023). Online marketplaces are also used in the consumer-to-consumer (C2C) sector and thus offer a broad user group a marketplace, for example for used products. This article focuses on Willhaben.at (2024), a leading C2C marketplace in Austria, as stated by Obersteiner, Schmied, and Pamperl (2023). The empirical analysis in this course essay centers around the offer ads of Woom Bikes, a standardised product which is sold on Willhaben. Through web scraping, a dataset of approximately 826 observations was created, focusing on mid-to-high price segment bicycles, which are characterized by price stability and uniformity as we claim. This analysis aims to create analyse ad listing prices through predictive models using willhaben product listing attributes and using the spatial distribution of one of the product attributes."}, "https://arxiv.org/abs/2411.07817": {"title": "Impact of R&D and AI Investments on Economic Growth and Credit Rating", "link": "https://arxiv.org/abs/2411.07817", "description": "arXiv:2411.07817v1 Announce Type: new \nAbstract: The research and development (R&D) phase is essential for fostering innovation and aligns with long-term strategies in both public and private sectors. This study addresses two primary research questions: (1) assessing the relationship between R&D investments and GDP through regression analysis, and (2) estimating the economic value added (EVA) that Georgia must generate to progress from a BB to a BBB credit rating. Using World Bank data from 2014-2022, this analysis found that increasing R&D, with an emphasis on AI, by 30-35% has a measurable impact on GDP. Regression results reveal a coefficient of 7.02%, indicating a 10% increase in R&D leads to a 0.70% GDP rise, with an 81.1% determination coefficient and a strong 90.1% correlation.\n Georgia's EVA model was calculated to determine the additional value needed for a BBB rating, comparing indicators from Greece, Hungary, India, and Kazakhstan as benchmarks. Key economic indicators considered were nominal GDP, GDP per capita, real GDP growth, and fiscal indicators (government balance/GDP, debt/GDP). The EVA model projects that to achieve a BBB rating within nine years, Georgia requires $61.7 billion in investments. Utilizing EVA and comprehensive economic indicators will support informed decision-making and enhance the analysis of Georgia's economic trajectory."}, "https://arxiv.org/abs/2411.07874": {"title": "Changepoint Detection in Complex Models: Cross-Fitting Is Needed", "link": "https://arxiv.org/abs/2411.07874", "description": "arXiv:2411.07874v1 Announce Type: new \nAbstract: Changepoint detection is commonly approached by minimizing the sum of in-sample losses to quantify the model's overall fit across distinct data segments. However, we observe that flexible modeling techniques, particularly those involving hyperparameter tuning or model selection, often lead to inaccurate changepoint estimation due to biases that distort the target of in-sample loss minimization. To mitigate this issue, we propose a novel cross-fitting methodology that incorporates out-of-sample loss evaluations using independent samples separate from those used for model fitting. This approach ensures consistent changepoint estimation, contingent solely upon the models' predictive accuracy across nearly homogeneous data segments. Extensive numerical experiments demonstrate that our proposed cross-fitting strategy significantly enhances the reliability and adaptability of changepoint detection in complex scenarios."}, "https://arxiv.org/abs/2411.07952": {"title": "Matching $\\leq$ Hybrid $\\leq$ Difference in Differences", "link": "https://arxiv.org/abs/2411.07952", "description": "arXiv:2411.07952v1 Announce Type: new \nAbstract: Since LaLonde's (1986) seminal paper, there has been ongoing interest in estimating treatment effects using pre- and post-intervention data. Scholars have traditionally used experimental benchmarks to evaluate the accuracy of alternative econometric methods, including Matching, Difference-in-Differences (DID), and their hybrid forms (e.g., Heckman et al., 1998b; Dehejia and Wahba, 2002; Smith and Todd, 2005). We revisit these methodologies in the evaluation of job training and educational programs using four datasets (LaLonde, 1986; Heckman et al., 1998a; Smith and Todd, 2005; Chetty et al., 2014a; Athey et al., 2020), and show that the inequality relationship, Matching $\\leq$ Hybrid $\\leq$ DID, appears as a consistent norm, rather than a mere coincidence. We provide a formal theoretical justification for this puzzling phenomenon under plausible conditions such as negative selection, by generalizing the classical bracketing (Angrist and Pischke, 2009, Section 5). Consequently, when treatments are expected to be non-negative, DID tends to provide optimistic estimates, while Matching offers more conservative ones. Keywords: bias, difference in differences, educational program, job training program, matching."}, "https://arxiv.org/abs/2411.07978": {"title": "Doubly Robust Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2411.07978", "description": "arXiv:2411.07978v1 Announce Type: new \nAbstract: This study introduces a doubly robust (DR) estimator for regression discontinuity (RD) designs. In RD designs, treatment effects are estimated in a quasi-experimental setting where treatment assignment depends on whether a running variable surpasses a predefined cutoff. A common approach in RD estimation is to apply nonparametric regression methods, such as local linear regression. In such an approach, the validity relies heavily on the consistency of nonparametric estimators and is limited by the nonparametric convergence rate, thereby preventing $\\sqrt{n}$-consistency. To address these issues, we propose the DR-RD estimator, which combines two distinct estimators for the conditional expected outcomes. If either of these estimators is consistent, the treatment effect estimator remains consistent. Furthermore, due to the debiasing effect, our proposed estimator achieves $\\sqrt{n}$-consistency if both regression estimators satisfy certain mild conditions, which also simplifies statistical inference."}, "https://arxiv.org/abs/2411.07984": {"title": "Scalable piecewise smoothing with BART", "link": "https://arxiv.org/abs/2411.07984", "description": "arXiv:2411.07984v1 Announce Type: new \nAbstract: Although it is an extremely effective, easy-to-use, and increasingly popular tool for nonparametric regression, the Bayesian Additive Regression Trees (BART) model is limited by the fact that it can only produce discontinuous output. Initial attempts to overcome this limitation were based on regression trees that output Gaussian Processes instead of constants. Unfortunately, implementations of these extensions cannot scale to large datasets. We propose ridgeBART, an extension of BART built with trees that output linear combinations of ridge functions (i.e., a composition of an affine transformation of the inputs and non-linearity); that is, we build a Bayesian ensemble of localized neural networks with a single hidden layer. We develop a new MCMC sampler that updates trees in linear time and establish nearly minimax-optimal posterior contraction rates for estimating Sobolev and piecewise anisotropic Sobolev functions. We demonstrate ridgeBART's effectiveness on synthetic data and use it to estimate the probability that a professional basketball player makes a shot from any location on the court in a spatially smooth fashion."}, "https://arxiv.org/abs/2411.07800": {"title": "Kernel-based retrieval models for hyperspectral image data optimized with Kernel Flows", "link": "https://arxiv.org/abs/2411.07800", "description": "arXiv:2411.07800v1 Announce Type: cross \nAbstract: Kernel-based statistical methods are efficient, but their performance depends heavily on the selection of kernel parameters. In literature, the optimization studies on kernel-based chemometric methods is limited and often reduced to grid searching. Previously, the authors introduced Kernel Flows (KF) to learn kernel parameters for Kernel Partial Least-Squares (K-PLS) regression. KF is easy to implement and helps minimize overfitting. In cases of high collinearity between spectra and biogeophysical quantities in spectroscopy, simpler methods like Principal Component Regression (PCR) may be more suitable. In this study, we propose a new KF-type approach to optimize Kernel Principal Component Regression (K-PCR) and test it alongside KF-PLS. Both methods are benchmarked against non-linear regression techniques using two hyperspectral remote sensing datasets."}, "https://arxiv.org/abs/2411.08019": {"title": "Language Models as Causal Effect Generators", "link": "https://arxiv.org/abs/2411.08019", "description": "arXiv:2411.08019v1 Announce Type: cross \nAbstract: We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure."}, "https://arxiv.org/abs/2304.03247": {"title": "A Bayesian Framework for Causal Analysis of Recurrent Events with Timing Misalignment", "link": "https://arxiv.org/abs/2304.03247", "description": "arXiv:2304.03247v2 Announce Type: replace \nAbstract: Observational studies of recurrent event rates are common in biomedical statistics. Broadly, the goal is to estimate differences in event rates under two treatments within a defined target population over a specified followup window. Estimation with observational data is challenging because, while membership in the target population is defined in terms of eligibility criteria, treatment is rarely observed exactly at the time of eligibility. Ad-hoc solutions to this timing misalignment can induce bias by incorrectly attributing prior event counts and person-time to treatment. Even if eligibility and treatment are aligned, a terminal event process (e.g. death) often stops the recurrent event process of interest. In practice, both processes can be censored so that events are not observed over the entire followup window. Our approach addresses misalignment by casting it as a time-varying treatment problem: some patients are on treatment at eligibility while others are off treatment but may switch to treatment at a specified time - if they survive long enough. We define and identify an average causal effect estimand under right-censoring. Estimation is done using a g-computation procedure with a joint semiparametric Bayesian model for the death and recurrent event processes. We apply the method to contrast hospitalization rates among patients with different opioid treatments using Medicare insurance claims data."}, "https://arxiv.org/abs/2306.04182": {"title": "Simultaneous Estimation and Dataset Selection for Transfer Learning in High Dimensions by a Non-convex Penalty", "link": "https://arxiv.org/abs/2306.04182", "description": "arXiv:2306.04182v3 Announce Type: replace \nAbstract: In this paper, we propose to estimate model parameters and identify informative source datasets simultaneously for high-dimensional transfer learning problems with the aid of a non-convex penalty, in contrast to the separate useful dataset selection and transfer learning procedures in the existing literature. To numerically solve the non-convex problem with respect to two specific statistical models, namely the sparse linear regression and the generalized low-rank trace regression models, we adopt the difference of convex (DC) programming with the alternating direction method of multipliers (ADMM) procedures. We theoretically justify the proposed algorithm from both statistical and computational perspectives. Extensive numerical results are reported alongside to validate the theoretical assertions. An \\texttt{R} package \\texttt{MHDTL} is developed to implement the proposed methods."}, "https://arxiv.org/abs/2306.13362": {"title": "Factor-augmented sparse MIDAS regressions with an application to nowcasting", "link": "https://arxiv.org/abs/2306.13362", "description": "arXiv:2306.13362v3 Announce Type: replace \nAbstract: This article investigates factor-augmented sparse MIDAS (Mixed Data Sampling) regressions for high-dimensional time series data, which may be observed at different frequencies. Our novel approach integrates sparse and dense dimensionality reduction techniques. We derive the convergence rate of our estimator under misspecification, $\\tau$-mixing dependence, and polynomial tails. Our method's finite sample performance is assessed via Monte Carlo simulations. We apply the methodology to nowcasting U.S. GDP growth and demonstrate that it outperforms both sparse regression and standard factor-augmented regression during the COVID-19 pandemic. To ensure the robustness of these results, we also implement factor-augmented sparse logistic regression, which further confirms the superior accuracy of our nowcast probabilities during recessions. These findings indicate that recessions are influenced by both idiosyncratic (sparse) and common (dense) shocks."}, "https://arxiv.org/abs/2309.08039": {"title": "Flexible Functional Treatment Effect Estimation", "link": "https://arxiv.org/abs/2309.08039", "description": "arXiv:2309.08039v2 Announce Type: replace \nAbstract: We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weights are constructed by directly minimizing the uniform balancing error resulting from a decomposition of the WMKRR estimator, instead of being estimated under a particular treatment selection model. Despite the complex structure of the uniform balancing error derived under WMKRR, finite-dimensional convex algorithms can be applied to efficiently solve for the proposed weights thanks to a representer theorem. The optimal convergence rate is shown to be attainable by the proposed WMKRR estimator without any smoothness assumption on the true weight function. Corresponding empirical performance is demonstrated by a simulation study and a real data application."}, "https://arxiv.org/abs/2310.01589": {"title": "Exact Gradient Evaluation for Adaptive Quadrature Approximate Marginal Likelihood in Mixed Models for Grouped Data", "link": "https://arxiv.org/abs/2310.01589", "description": "arXiv:2310.01589v2 Announce Type: replace \nAbstract: A method is introduced for approximate marginal likelihood inference via adaptive Gaussian quadrature in mixed models with a single grouping factor. The core technical contribution is an algorithm for computing the exact gradient of the approximate log-marginal likelihood. This leads to efficient maximum likelihood via quasi-Newton optimization that is demonstrated to be faster than existing approaches based on finite-differenced gradients or derivative-free optimization. The method is specialized to Bernoulli mixed models with multivariate, correlated Gaussian random effects; here computations are performed using an inverse log-Cholesky parameterization of the Gaussian density that involves no matrix decomposition during model fitting, while Wald confidence intervals are provided for variance parameters on the original scale. Simulations give evidence of these intervals attaining nominal coverage if enough quadrature points are used, for data comprised of a large number of very small groups exhibiting large between-group heterogeneity. The Laplace approximation is well-known to give especially poor coverage and high bias for data comprised of a large number of small groups. Adaptive quadrature mitigates this, and the methods in this paper improve the computational feasibility of this more accurate method. All results may be reproduced using code available at \\url{https://github.com/awstringer1/aghmm-paper-code}."}, "https://arxiv.org/abs/2411.08150": {"title": "Targeted Maximum Likelihood Estimation for Integral Projection Models in Population Ecology", "link": "https://arxiv.org/abs/2411.08150", "description": "arXiv:2411.08150v1 Announce Type: new \nAbstract: Integral projection models (IPMs) are widely used to study population growth and the dynamics of demographic structure (e.g. age and size distributions) within a population.These models use data on individuals' growth, survival, and reproduction to predict changes in the population from one time point to the next and use these in turn to ask about long-term growth rates, the sensitivity of that growth rate to environmental factors, and aspects of the long term population such as how much reproduction concentrates in a few individuals; these quantities are not directly measurable from data and must be inferred from the model. Building IPMs requires us to develop models for individual fates over the next time step -- Did they survive? How much did they grow or shrink? Did they Reproduce? -- conditional on their initial state as well as on environmental covariates in a manner that accounts for the unobservable quantities that are are ultimately interested in estimating.Targeted maximum likelihood estimation (TMLE) methods are particularly well-suited to a framework in which we are largely interested in the consequences of models. These build machine learning-based models that estimate the probability distribution of the data we observe and define a target of inference as a function of these. The initial estimate for the distribution is then modified by tilting in the direction of the efficient influence function to both de-bias the parameter estimate and provide more accurate inference. In this paper, we employ TMLE to develop robust and efficient estimators for properties derived from a fitted IPM. Mathematically, we derive the efficient influence function and formulate the paths for the least favorable sub-models. Empirically, we conduct extensive simulations using real data from both long term studies of Idaho steppe plant communities and experimental Rotifer populations."}, "https://arxiv.org/abs/2411.08188": {"title": "MSTest: An R-Package for Testing Markov Switching Models", "link": "https://arxiv.org/abs/2411.08188", "description": "arXiv:2411.08188v1 Announce Type: new \nAbstract: We present the R package MSTest, which implements hypothesis testing procedures to identify the number of regimes in Markov switching models. These models have wide-ranging applications in economics, finance, and numerous other fields. The MSTest package includes the Monte Carlo likelihood ratio test procedures proposed by Rodriguez-Rondon and Dufour (2024), the moment-based tests of Dufour and Luger (2017), the parameter stability tests of Carrasco, Hu, and Ploberger (2014), and the likelihood ratio test of Hansen (1992). Additionally, the package enables users to simulate and estimate univariate and multivariate Markov switching and hidden Markov processes, using the expectation-maximization (EM) algorithm or maximum likelihood estimation (MLE). We demonstrate the functionality of the MSTest package through both simulation experiments and an application to U.S. GNP growth data."}, "https://arxiv.org/abs/2411.08256": {"title": "$K$-means clustering for sparsely observed longitudinal data", "link": "https://arxiv.org/abs/2411.08256", "description": "arXiv:2411.08256v1 Announce Type: new \nAbstract: In longitudinal data analysis, observation points of repeated measurements over time often vary among subjects except in well-designed experimental studies. Additionally, measurements for each subject are typically obtained at only a few time points. From such sparsely observed data, identifying underlying cluster structures can be challenging. This paper proposes a fast and simple clustering method that generalizes the classical $k$-means method to identify cluster centers in sparsely observed data. The proposed method employs the basis function expansion to model the cluster centers, providing an effective way to estimate cluster centers from fragmented data. We establish the statistical consistency of the proposed method, as with the classical $k$-means method. Through numerical experiments, we demonstrate that the proposed method performs competitively with, or even outperforms, existing clustering methods. Moreover, the proposed method offers significant gains in computational efficiency due to its simplicity. Applying the proposed method to real-world data illustrates its effectiveness in identifying cluster structures in sparsely observed data."}, "https://arxiv.org/abs/2411.08262": {"title": "Adaptive Shrinkage with a Nonparametric Bayesian Lasso", "link": "https://arxiv.org/abs/2411.08262", "description": "arXiv:2411.08262v1 Announce Type: new \nAbstract: Modern approaches to perform Bayesian variable selection rely mostly on the use of shrinkage priors. That said, an ideal shrinkage prior should be adaptive to different signal levels, ensuring that small effects are ruled out, while keeping relatively intact the important ones. With this task in mind, we develop the nonparametric Bayesian Lasso, an adaptive and flexible shrinkage prior for Bayesian regression and variable selection, particularly useful when the number of predictors is comparable or larger than the number of available data points. We build on spike-and-slab Lasso ideas and extend them by placing a Dirichlet Process prior on the shrinkage parameters. The result is a prior on the regression coefficients that can be seen as an infinite mixture of Double Exponential densities, all offering different amounts of regularization, ensuring a more adaptive and flexible shrinkage. We also develop an efficient Markov chain Monte Carlo algorithm for posterior inference. Through various simulation exercises and real-world data analyses, we demonstrate that our proposed method leads to a better recovery of the true regression coefficients, a better variable selection, and better out-of-sample predictions, highlighting the benefits of the nonparametric Bayesian Lasso over existing shrinkage priors."}, "https://arxiv.org/abs/2411.08315": {"title": "Optimal individualized treatment regimes for survival data with competing risks", "link": "https://arxiv.org/abs/2411.08315", "description": "arXiv:2411.08315v1 Announce Type: new \nAbstract: Precision medicine leverages patient heterogeneity to estimate individualized treatment regimens, formalized, data-driven approaches designed to match patients with optimal treatments. In the presence of competing events, where multiple causes of failure can occur and one cause precludes others, it is crucial to assess the risk of the specific outcome of interest, such as one type of failure over another. This helps clinicians tailor interventions based on the factors driving that particular cause, leading to more precise treatment strategies. Currently, no precision medicine methods simultaneously account for both survival and competing risk endpoints. To address this gap, we develop a nonparametric individualized treatment regime estimator. Our two-phase method accounts for both overall survival from all events as well as the cumulative incidence of a main event of interest. Additionally, we introduce a multi-utility value function that incorporates both outcomes. We develop random survival and random cumulative incidence forests to construct individual survival and cumulative incidence curves. Simulation studies demonstrated that our proposed method performs well, which we applied to a cohort of peripheral artery disease patients at high risk for limb loss and mortality."}, "https://arxiv.org/abs/2411.08352": {"title": "Imputation-based randomization tests for randomized experiments with interference", "link": "https://arxiv.org/abs/2411.08352", "description": "arXiv:2411.08352v1 Announce Type: new \nAbstract: The presence of interference renders classic Fisher randomization tests infeasible due to nuisance unknowns. To address this issue, we propose imputing the nuisance unknowns and computing Fisher randomization p-values multiple times, then averaging them. We term this approach the imputation-based randomization test and provide theoretical results on its asymptotic validity. Our method leverages the merits of randomization and the flexibility of the Bayesian framework: for multiple imputations, we can either employ the empirical distribution of observed outcomes to achieve robustness against model mis-specification or utilize a parametric model to incorporate prior information. Simulation results demonstrate that our method effectively controls the type I error rate and significantly enhances the testing power compared to existing randomization tests for randomized experiments with interference. We apply our method to a two-round randomized experiment with multiple treatments and one-way interference, where existing randomization tests exhibit limited power."}, "https://arxiv.org/abs/2411.08390": {"title": "Expected Information Gain Estimation via Density Approximations: Sample Allocation and Dimension Reduction", "link": "https://arxiv.org/abs/2411.08390", "description": "arXiv:2411.08390v1 Announce Type: new \nAbstract: Computing expected information gain (EIG) from prior to posterior (equivalently, mutual information between candidate observations and model parameters or other quantities of interest) is a fundamental challenge in Bayesian optimal experimental design. We formulate flexible transport-based schemes for EIG estimation in general nonlinear/non-Gaussian settings, compatible with both standard and implicit Bayesian models. These schemes are representative of two-stage methods for estimating or bounding EIG using marginal and conditional density estimates. In this setting, we analyze the optimal allocation of samples between training (density estimation) and approximation of the outer prior expectation. We show that with this optimal sample allocation, the MSE of the resulting EIG estimator converges more quickly than that of a standard nested Monte Carlo scheme. We then address the estimation of EIG in high dimensions, by deriving gradient-based upper bounds on the mutual information lost by projecting the parameters and/or observations to lower-dimensional subspaces. Minimizing these upper bounds yields projectors and hence low-dimensional EIG approximations that outperform approximations obtained via other linear dimension reduction schemes. Numerical experiments on a PDE-constrained Bayesian inverse problem also illustrate a favorable trade-off between dimension truncation and the modeling of non-Gaussianity, when estimating EIG from finite samples in high dimensions."}, "https://arxiv.org/abs/2411.08491": {"title": "Covariate Adjustment in Randomized Experiments Motivated by Higher-Order Influence Functions", "link": "https://arxiv.org/abs/2411.08491", "description": "arXiv:2411.08491v1 Announce Type: new \nAbstract: Higher-Order Influence Functions (HOIF), developed in a series of papers over the past twenty years, is a fundamental theoretical device for constructing rate-optimal causal-effect estimators from observational studies. However, the value of HOIF for analyzing well-conducted randomized controlled trials (RCT) has not been explicitly explored. In the recent US Food \\& Drug Administration (FDA) and European Medical Agency (EMA) guidelines on the practice of covariate adjustment in analyzing RCT, in addition to the simple, unadjusted difference-in-mean estimator, it was also recommended to report the estimator adjusting for baseline covariates via a simple parametric working model, such as a linear model. In this paper, we show that an HOIF-motivated estimator for the treatment-specific mean has significantly improved statistical properties compared to popular adjusted estimators in practice when the number of baseline covariates $p$ is relatively large compared to the sample size $n$. We also characterize the conditions under which the HOIF-motivated estimator improves upon the unadjusted estimator. Furthermore, we demonstrate that a novel debiased adjusted estimator proposed recently by Lu et al. is, in fact, another HOIF-motivated estimator under disguise. Finally, simulation studies are conducted to corroborate our theoretical findings."}, "https://arxiv.org/abs/2411.08495": {"title": "Confidence intervals for adaptive trial designs I: A methodological review", "link": "https://arxiv.org/abs/2411.08495", "description": "arXiv:2411.08495v1 Announce Type: new \nAbstract: Regulatory guidance notes the need for caution in the interpretation of confidence intervals (CIs) constructed during and after an adaptive clinical trial. Conventional CIs of the treatment effects are prone to undercoverage (as well as other undesirable properties) in many adaptive designs, because they do not take into account the potential and realised trial adaptations. This paper is the first in a two-part series that explores CIs for adaptive trials. It provides a comprehensive review of the methods to construct CIs for adaptive designs, while the second paper illustrates how to implement these in practice and proposes a set of guidelines for trial statisticians. We describe several classes of techniques for constructing CIs for adaptive clinical trials, before providing a systematic literature review of available methods, classified by the type of adaptive design. As part of this, we assess, through a proposed traffic light system, which of several desirable features of CIs (such as achieving nominal coverage and consistency with the hypothesis test decision) each of these methods holds."}, "https://arxiv.org/abs/2411.08524": {"title": "Evaluating Parameter Uncertainty in the Poisson Lognormal Model with Corrected Variational Estimators", "link": "https://arxiv.org/abs/2411.08524", "description": "arXiv:2411.08524v1 Announce Type: new \nAbstract: Count data analysis is essential across diverse fields, from ecology and accident analysis to single-cell RNA sequencing (scRNA-seq) and metagenomics. While log transformations are computationally efficient, model-based approaches such as the Poisson-Log-Normal (PLN) model provide robust statistical foundations and are more amenable to extensions. The PLN model, with its latent Gaussian structure, not only captures overdispersion but also enables correlation between variables and inclusion of covariates, making it suitable for multivariate count data analysis. Variational approximations are a golden standard to estimate parameters of complex latent variable models such as PLN, maximizing a surrogate likelihood. However, variational estimators lack theoretical statistical properties such as consistency and asymptotic normality. In this paper, we investigate the consistency and variance estimation of PLN parameters using M-estimation theory. We derive the Sandwich estimator, previously studied in Westling and McCormick (2019), specifically for the PLN model. We compare this approach to the variational Fisher Information method, demonstrating the Sandwich estimator's effectiveness in terms of coverage through simulation studies. Finally, we validate our method on a scRNA-seq dataset."}, "https://arxiv.org/abs/2411.08625": {"title": "A Conjecture on Group Decision Accuracy in Voter Networks through the Regularized Incomplete Beta Function", "link": "https://arxiv.org/abs/2411.08625", "description": "arXiv:2411.08625v1 Announce Type: new \nAbstract: This paper presents a conjecture on the regularized incomplete beta function in the context of majority decision systems modeled through a voter framework. We examine a network where voters interact, with some voters fixed in their decisions while others are free to change their states based on the influence of their neighbors. We demonstrate that as the number of free voters increases, the probability of selecting the correct majority outcome converges to $1-I_{0.5}(\\alpha,\\beta)$, where $I_{0.5}(\\alpha,\\beta)$ is the regularized incomplete beta function. The conjecture posits that when $\\alpha > \\beta$, $1-I_{0.5}(\\alpha,\\beta) > \\alpha/(\\alpha+\\beta)$, meaning the group's decision accuracy exceeds that of an individual voter. We provide partial results, including a proof for integer values of $\\alpha$ and $\\beta$, and support the general case using a probability bound. This work extends Condorcet's Jury Theorem by incorporating voter dependence driven by network dynamics, showing that group decision accuracy can exceed individual accuracy under certain conditions."}, "https://arxiv.org/abs/2411.08698": {"title": "Statistical Operating Characteristics of Current Early Phase Dose Finding Designs with Toxicity and Efficacy in Oncology", "link": "https://arxiv.org/abs/2411.08698", "description": "arXiv:2411.08698v1 Announce Type: new \nAbstract: Traditional phase I dose finding cancer clinical trial designs aim to determine the maximum tolerated dose (MTD) of the investigational cytotoxic agent based on a single toxicity outcome, assuming a monotone dose-response relationship. However, this assumption might not always hold for newly emerging therapies such as immuno-oncology therapies and molecularly targeted therapies, making conventional dose finding trial designs based on toxicity no longer appropriate. To tackle this issue, numerous early phase dose finding clinical trial designs have been developed to identify the optimal biological dose (OBD), which takes both toxicity and efficacy outcomes into account. In this article, we review the current model-assisted dose finding designs, BOIN-ET, BOIN12, UBI, TEPI-2, PRINTE, STEIN, and uTPI to identify the OBD and compare their operating characteristics. Extensive simulation studies and a case study using a CAR T-cell therapy phase I trial have been conducted to compare the performance of the aforementioned designs under different possible dose-response relationship scenarios. The simulation results demonstrate that the performance of different designs varies depending on the particular dose-response relationship and the specific metric considered. Based on our simulation results and practical considerations, STEIN, PRINTE, and BOIN12 outperform the other designs from different perspectives."}, "https://arxiv.org/abs/2411.08771": {"title": "Confidence intervals for adaptive trial designs II: Case study and practical guidance", "link": "https://arxiv.org/abs/2411.08771", "description": "arXiv:2411.08771v1 Announce Type: new \nAbstract: In adaptive clinical trials, the conventional confidence interval (CI) for a treatment effect is prone to undesirable properties such as undercoverage and potential inconsistency with the final hypothesis testing decision. Accordingly, as is stated in recent regulatory guidance on adaptive designs, there is the need for caution in the interpretation of CIs constructed during and after an adaptive clinical trial. However, it may be unclear which of the available CIs in the literature are preferable. This paper is the second in a two-part series that explores CIs for adaptive trials. Part I provided a methodological review of approaches to construct CIs for adaptive designs. In this paper (part II), we present an extended case study based around a two-stage group sequential trial, including a comprehensive simulation study of the proposed CIs for this setting. This facilitates an expanded description of considerations around what makes for an effective CI procedure following an adaptive trial. We show that the CIs can have notably different properties. Finally, we propose a set of guidelines for researchers around the choice of CIs and the reporting of CIs following an adaptive design."}, "https://arxiv.org/abs/2411.08778": {"title": "Causal-DRF: Conditional Kernel Treatment Effect Estimation using Distributional Random Forest", "link": "https://arxiv.org/abs/2411.08778", "description": "arXiv:2411.08778v1 Announce Type: new \nAbstract: The conditional average treatment effect (CATE) is a commonly targeted statistical parameter for measuring the mean effect of a treatment conditional on covariates. However, the CATE will fail to capture effects of treatments beyond differences in conditional expectations. Inspired by causal forests for CATE estimation, we develop a forest-based method to estimate the conditional kernel treatment effect (CKTE), based on the recently introduced Distributional Random Forest (DRF) algorithm. Adapting the splitting criterion of DRF, we show how one forest fit can be used to obtain a consistent and asymptotically normal estimator of the CKTE, as well as an approximation of its sampling distribution. This allows to study the difference in distribution between control and treatment group and thus yields a more comprehensive understanding of the treatment effect. In particular, this enables the construction of a conditional kernel-based test for distributional effects with provably valid type-I error. We show the effectiveness of the proposed estimator in simulations and apply it to the 1991 Survey of Income and Program Participation (SIPP) pension data to study the effect of 401(k) eligibility on wealth."}, "https://arxiv.org/abs/2411.08861": {"title": "Interaction Testing in Variation Analysis", "link": "https://arxiv.org/abs/2411.08861", "description": "arXiv:2411.08861v1 Announce Type: new \nAbstract: Relationships of cause and effect are of prime importance for explaining scientific phenomena. Often, rather than just understanding the effects of causes, researchers also wish to understand how a cause $X$ affects an outcome $Y$ mechanistically -- i.e., what are the causal pathways that are activated between $X$ and $Y$. For analyzing such questions, a range of methods has been developed over decades under the rubric of causal mediation analysis. Traditional mediation analysis focuses on decomposing the average treatment effect (ATE) into direct and indirect effects, and therefore focuses on the ATE as the central quantity. This corresponds to providing explanations for associations in the interventional regime, such as when the treatment $X$ is randomized. Commonly, however, it is of interest to explain associations in the observational regime, and not just in the interventional regime. In this paper, we introduce \\text{variation analysis}, an extension of mediation analysis that focuses on the total variation (TV) measure between $X$ and $Y$, written as $\\mathrm{E}[Y \\mid X=x_1] - \\mathrm{E}[Y \\mid X=x_0]$. The TV measure encompasses both causal and confounded effects, as opposed to the ATE which only encompasses causal (direct and mediated) variations. In this way, the TV measure is suitable for providing explanations in the natural regime and answering questions such as ``why is $X$ associated with $Y$?''. Our focus is on decomposing the TV measure, in a way that explicitly includes direct, indirect, and confounded variations. Furthermore, we also decompose the TV measure to include interaction terms between these different pathways. Subsequently, interaction testing is introduced, involving hypothesis tests to determine if interaction terms are significantly different from zero. If interactions are not significant, more parsimonious decompositions of the TV measure can be used."}, "https://arxiv.org/abs/2411.07031": {"title": "Evaluating the Accuracy of Chatbots in Financial Literature", "link": "https://arxiv.org/abs/2411.07031", "description": "arXiv:2411.07031v1 Announce Type: cross \nAbstract: We evaluate the reliability of two chatbots, ChatGPT (4o and o1-preview versions), and Gemini Advanced, in providing references on financial literature and employing novel methodologies. Alongside the conventional binary approach commonly used in the literature, we developed a nonbinary approach and a recency measure to assess how hallucination rates vary with how recent a topic is. After analyzing 150 citations, ChatGPT-4o had a hallucination rate of 20.0% (95% CI, 13.6%-26.4%), while the o1-preview had a hallucination rate of 21.3% (95% CI, 14.8%-27.9%). In contrast, Gemini Advanced exhibited higher hallucination rates: 76.7% (95% CI, 69.9%-83.4%). While hallucination rates increased for more recent topics, this trend was not statistically significant for Gemini Advanced. These findings emphasize the importance of verifying chatbot-provided references, particularly in rapidly evolving fields."}, "https://arxiv.org/abs/2411.08338": {"title": "Quantifying uncertainty in the numerical integration of evolution equations based on Bayesian isotonic regression", "link": "https://arxiv.org/abs/2411.08338", "description": "arXiv:2411.08338v1 Announce Type: cross \nAbstract: This paper presents a new Bayesian framework for quantifying discretization errors in numerical solutions of ordinary differential equations. By modelling the errors as random variables, we impose a monotonicity constraint on the variances, referred to as discretization error variances. The key to our approach is the use of a shrinkage prior for the variances coupled with variable transformations. This methodology extends existing Bayesian isotonic regression techniques to tackle the challenge of estimating the variances of a normal distribution. An additional key feature is the use of a Gaussian mixture model for the $\\log$-$\\chi^2_1$ distribution, enabling the development of an efficient Gibbs sampling algorithm for the corresponding posterior."}, "https://arxiv.org/abs/2007.12922": {"title": "Data fusion methods for the heterogeneity of treatment effect and confounding function", "link": "https://arxiv.org/abs/2007.12922", "description": "arXiv:2007.12922v3 Announce Type: replace \nAbstract: The heterogeneity of treatment effect (HTE) lies at the heart of precision medicine. Randomized controlled trials are gold-standard for treatment effect estimation but are typically underpowered for heterogeneous effects. In contrast, large observational studies have high predictive power but are often confounded due to the lack of randomization of treatment. We show that an observational study, even subject to hidden confounding, may be used to empower trials in estimating the HTE using the notion of confounding function. The confounding function summarizes the impact of unmeasured confounders on the difference between the observed treatment effect and the causal treatment effect, given the observed covariates, which is unidentifiable based only on the observational study. Coupling the trial and observational study, we show that the HTE and confounding function are identifiable. We then derive the semiparametric efficient scores and the integrative estimators of the HTE and confounding function. We clarify the conditions under which the integrative estimator of the HTE is strictly more efficient than the trial estimator. Finally, we illustrate the integrative estimators via simulation and an application."}, "https://arxiv.org/abs/2105.10809": {"title": "Exact PPS Sampling with Bounded Sample Size", "link": "https://arxiv.org/abs/2105.10809", "description": "arXiv:2105.10809v2 Announce Type: replace \nAbstract: Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified \"weight\" (also called its \"size\"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value $n$. The sample size is exactly equal to $n$ if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and helping to control the time required for analytics over the sample, while allowing the user complete control over the sample contents. The method is both simple to implement and efficient, being a one-pass streaming algorithm with an amortized processing time of $O(1)$ per item."}, "https://arxiv.org/abs/2206.04902": {"title": "Forecasting macroeconomic data with Bayesian VARs: Sparse or dense? It depends!", "link": "https://arxiv.org/abs/2206.04902", "description": "arXiv:2206.04902v4 Announce Type: replace \nAbstract: Vector autogressions (VARs) are widely applied when it comes to modeling and forecasting macroeconomic variables. In high dimensions, however, they are prone to overfitting. Bayesian methods, more concretely shrinkage priors, have shown to be successful in improving prediction performance. In the present paper, we introduce the semi-global framework, in which we replace the traditional global shrinkage parameter with group-specific shrinkage parameters. We show how this framework can be applied to various shrinkage priors, such as global-local priors and stochastic search variable selection priors. We demonstrate the virtues of the proposed framework in an extensive simulation study and in an empirical application forecasting data of the US economy. Further, we shed more light on the ongoing ``Illusion of Sparsity'' debate, finding that forecasting performances under sparse/dense priors vary across evaluated economic variables and across time frames. Dynamic model averaging, however, can combine the merits of both worlds."}, "https://arxiv.org/abs/2306.16642": {"title": "Improving randomized controlled trial analysis via data-adaptive borrowing", "link": "https://arxiv.org/abs/2306.16642", "description": "arXiv:2306.16642v2 Announce Type: replace \nAbstract: In recent years, real-world external controls have grown in popularity as a tool to empower randomized placebo-controlled trials, particularly in rare diseases or cases where balanced randomization is unethical or impractical. However, as external controls are not always comparable to the trials, direct borrowing without scrutiny may heavily bias the treatment effect estimator. Our paper proposes a data-adaptive integrative framework capable of preventing unknown biases of the external controls. The adaptive nature is achieved by dynamically sorting out a comparable subset of the external controls via bias penalization. Our proposed method can simultaneously achieve (a) the semiparametric efficiency bound when the external controls are comparable and (b) selective borrowing that mitigates the impact of the existence of incomparable external controls. Furthermore, we establish statistical guarantees, including consistency, asymptotic distribution, and inference, providing type-I error control and good power. Extensive simulations and two real-data applications show that the proposed method leads to improved performance over the trial-only estimator across various bias-generating scenarios."}, "https://arxiv.org/abs/2310.09673": {"title": "Robust Quickest Change Detection in Non-Stationary Processes", "link": "https://arxiv.org/abs/2310.09673", "description": "arXiv:2310.09673v2 Announce Type: replace \nAbstract: Optimal algorithms are developed for robust detection of changes in non-stationary processes. These are processes in which the distribution of the data after change varies with time. The decision-maker does not have access to precise information on the post-change distribution. It is shown that if the post-change non-stationary family has a distribution that is least favorable in a well-defined sense, then the algorithms designed using the least favorable distributions are robust and optimal. Non-stationary processes are encountered in public health monitoring and space and military applications. The robust algorithms are applied to real and simulated data to show their effectiveness."}, "https://arxiv.org/abs/2401.01804": {"title": "Efficient Computation of Confidence Sets Using Classification on Equidistributed Grids", "link": "https://arxiv.org/abs/2401.01804", "description": "arXiv:2401.01804v2 Announce Type: replace \nAbstract: Economic models produce moment inequalities, which can be used to form tests of the true parameters. Confidence sets (CS) of the true parameters are derived by inverting these tests. However, they often lack analytical expressions, necessitating a grid search to obtain the CS numerically by retaining the grid points that pass the test. When the statistic is not asymptotically pivotal, constructing the critical value for each grid point in the parameter space adds to the computational burden. In this paper, we convert the computational issue into a classification problem by using a support vector machine (SVM) classifier. Its decision function provides a faster and more systematic way of dividing the parameter space into two regions: inside vs. outside of the confidence set. We label those points in the CS as 1 and those outside as -1. Researchers can train the SVM classifier on a grid of manageable size and use it to determine whether points on denser grids are in the CS or not. We establish certain conditions for the grid so that there is a tuning that allows us to asymptotically reproduce the test in the CS. This means that in the limit, a point is classified as belonging to the confidence set if and only if it is labeled as 1 by the SVM."}, "https://arxiv.org/abs/2104.04716": {"title": "Selecting Penalty Parameters of High-Dimensional M-Estimators using Bootstrapping after Cross-Validation", "link": "https://arxiv.org/abs/2104.04716", "description": "arXiv:2104.04716v5 Announce Type: replace-cross \nAbstract: We develop a new method for selecting the penalty parameter for $\\ell_{1}$-penalized M-estimators in high dimensions, which we refer to as bootstrapping after cross-validation. We derive rates of convergence for the corresponding $\\ell_1$-penalized M-estimator and also for the post-$\\ell_1$-penalized M-estimator, which refits the non-zero entries of the former estimator without penalty in the criterion function. We demonstrate via simulations that our methods are not dominated by cross-validation in terms of estimation errors and can outperform cross-validation in terms of inference. As an empirical illustration, we revisit Fryer Jr (2019), who investigated racial differences in police use of force, and confirm his findings."}, "https://arxiv.org/abs/2411.08929": {"title": "Power and Sample Size Calculations for Cluster Randomized Hybrid Type 2 Effectiveness-Implementation Studies", "link": "https://arxiv.org/abs/2411.08929", "description": "arXiv:2411.08929v1 Announce Type: new \nAbstract: Hybrid studies allow investigators to simultaneously study an intervention effectiveness outcome and an implementation research outcome. In particular, type 2 hybrid studies support research that places equal importance on both outcomes rather than focusing on one and secondarily on the other (i.e., type 1 and type 3 studies). Hybrid 2 studies introduce the statistical issue of multiple testing, complicated by the fact that they are typically also cluster randomized trials. Standard statistical methods do not apply in this scenario. Here, we describe the design methodologies available for validly powering hybrid type 2 studies and producing reliable sample size calculations in a cluster-randomized design with a focus on binary outcomes. Through a literature search, 18 publications were identified that included methods relevant to the design of hybrid 2 studies. Five methods were identified, two of which did not account for clustering but are extended in this article to do so, namely the combined outcomes approach and the single 1-degree of freedom combined test. Procedures for powering hybrid 2 studies using these five methods are described and illustrated using input parameters inspired by a study from the Community Intervention to Reduce CardiovascuLar Disease in Chicago (CIRCL-Chicago) Implementation Research Center. In this illustrative example, the intervention effectiveness outcome was controlled blood pressure, and the implementation outcome was reach. The conjunctive test resulted in higher power than the popular p-value adjustment methods, and the newly extended combined outcomes and single 1-DF test were found to be the most powerful among all of the tests."}, "https://arxiv.org/abs/2411.08984": {"title": "Using Principal Progression Rate to Quantify and Compare Disease Progression in Comparative Studies", "link": "https://arxiv.org/abs/2411.08984", "description": "arXiv:2411.08984v1 Announce Type: new \nAbstract: In comparative studies of progressive diseases, such as randomized controlled trials (RCTs), the mean Change From Baseline (CFB) of a continuous outcome at a pre-specified follow-up time across subjects in the target population is a standard estimand used to summarize the overall disease progression. Despite its simplicity in interpretation, the mean CFB may not efficiently capture important features of the trajectory of the mean outcome relevant to the evaluation of the treatment effect of an intervention. Additionally, the estimation of the mean CFB does not use all longitudinal data points. To address these limitations, we propose a class of estimands called Principal Progression Rate (PPR). The PPR is a weighted average of local or instantaneous slope of the trajectory of the population mean during the follow-up. The flexibility of the weight function allows the PPR to cover a broad class of intuitive estimands, including the mean CFB, the slope of ordinary least-square fit to the trajectory, and the area under the curve. We showed that properly chosen PPRs can enhance statistical power over the mean CFB by amplifying the signal of treatment effect and/or improving estimation precision. We evaluated different versions of PPRs and the performance of their estimators through numerical studies. A real dataset was analyzed to demonstrate the advantage of using alternative PPR over the mean CFB."}, "https://arxiv.org/abs/2411.09017": {"title": "Debiased machine learning for counterfactual survival functionals based on left-truncated right-censored data", "link": "https://arxiv.org/abs/2411.09017", "description": "arXiv:2411.09017v1 Announce Type: new \nAbstract: Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data are subject to covariate-dependent left truncation and right censoring and when baseline covariates suffice to deconfound the relationship between exposure and survival time. Our inferential procedures explicitly allow the integration of flexible machine learning tools for nuisance estimation, and enjoy certain robustness properties. The approach we propose can be directly used to make pointwise or uniform inference on smooth summaries of the joint counterfactual survival time and covariate distribution, and can be valuable even in the absence of interventions, when summaries of a marginal survival distribution are of interest. We showcase how our procedures can be used to learn a variety of inferential targets and illustrate their performance in simulation studies."}, "https://arxiv.org/abs/2411.09097": {"title": "On the Selection Stability of Stability Selection and Its Applications", "link": "https://arxiv.org/abs/2411.09097", "description": "arXiv:2411.09097v1 Announce Type: new \nAbstract: Stability selection is a widely adopted resampling-based framework for high-dimensional structure estimation and variable selection. However, the concept of 'stability' is often narrowly addressed, primarily through examining selection frequencies, or 'stability paths'. This paper seeks to broaden the use of an established stability estimator to evaluate the overall stability of the stability selection framework, moving beyond single-variable analysis. We suggest that the stability estimator offers two advantages: it can serve as a reference to reflect the robustness of the outcomes obtained and help identify an optimal regularization value to improve stability. By determining this value, we aim to calibrate key stability selection parameters, namely, the decision threshold and the expected number of falsely selected variables, within established theoretical bounds. Furthermore, we explore a novel selection criterion based on this regularization value. With the asymptotic distribution of the stability estimator previously established, convergence to true stability is ensured, allowing us to observe stability trends over successive sub-samples. This approach sheds light on the required number of sub-samples addressing a notable gap in prior studies. The 'stabplot' package is developed to facilitate the use of the plots featured in this manuscript, supporting their integration into further statistical analysis and research workflows."}, "https://arxiv.org/abs/2411.09218": {"title": "On the (Mis)Use of Machine Learning with Panel Data", "link": "https://arxiv.org/abs/2411.09218", "description": "arXiv:2411.09218v1 Announce Type: new \nAbstract: Machine Learning (ML) is increasingly employed to inform and support policymaking interventions. This methodological article cautions practitioners about common but often overlooked pitfalls associated with the uncritical application of supervised ML algorithms to panel data. Ignoring the cross-sectional and longitudinal structure of this data can lead to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of ML models. After clarifying these issues, we provide practical guidelines and best practices for applied researchers to ensure the correct implementation of supervised ML in panel data environments, emphasizing the need to define ex ante the primary goal of the analysis and align the ML pipeline accordingly. An empirical application based on over 3,000 US counties from 2000 to 2019 illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks."}, "https://arxiv.org/abs/2411.09221": {"title": "Difference-in-Differences with Sample Selection", "link": "https://arxiv.org/abs/2411.09221", "description": "arXiv:2411.09221v1 Announce Type: new \nAbstract: Endogenous treatment and sample selection are two concomitant sources of endogeneity that challenge the validity of causal inference. In this paper, we focus on the partial identification of treatment effects within a standard two-period difference-in-differences framework when the outcome is observed for an endogenously selected subpopulation. The identification strategy embeds Lee's (2009) bounding approach based on principal stratification, which divides the population into latent subgroups based on selection behaviour in counterfactual treatment states in both periods. We establish identification results for four latent types and illustrate the proposed approach by applying it to estimate 1) the effect of a job training program on earnings and 2) the effect of a working-from-home policy on employee performance."}, "https://arxiv.org/abs/2411.09452": {"title": "Sparse Interval-valued Time Series Modeling with Machine Learning", "link": "https://arxiv.org/abs/2411.09452", "description": "arXiv:2411.09452v1 Announce Type: new \nAbstract: By treating intervals as inseparable sets, this paper proposes sparse machine learning regressions for high-dimensional interval-valued time series. With LASSO or adaptive LASSO techniques, we develop a penalized minimum distance estimation, which covers point-based estimators are special cases. We establish the consistency and oracle properties of the proposed penalized estimator, regardless of whether the number of predictors is diverging with the sample size. Monte Carlo simulations demonstrate the favorable finite sample properties of the proposed estimation. Empirical applications to interval-valued crude oil price forecasting and sparse index-tracking portfolio construction illustrate the robustness and effectiveness of our method against competing approaches, including random forest and multilayer perceptron for interval-valued data. Our findings highlight the potential of machine learning techniques in interval-valued time series analysis, offering new insights for financial forecasting and portfolio management."}, "https://arxiv.org/abs/2411.09579": {"title": "Propensity Score Matching: Should We Use It in Designing Observational Studies?", "link": "https://arxiv.org/abs/2411.09579", "description": "arXiv:2411.09579v1 Announce Type: new \nAbstract: Propensity Score Matching (PSM) stands as a widely embraced method in comparative effectiveness research. PSM crafts matched datasets, mimicking some attributes of randomized designs, from observational data. In a valid PSM design where all baseline confounders are measured and matched, the confounders would be balanced, allowing the treatment status to be considered as if it were randomly assigned. Nevertheless, recent research has unveiled a different facet of PSM, termed \"the PSM paradox.\" As PSM approaches exact matching by progressively pruning matched sets in order of decreasing propensity score distance, it can paradoxically lead to greater covariate imbalance, heightened model dependence, and increased bias, contrary to its intended purpose. Methods: We used analytic formula, simulation, and literature to demonstrate that this paradox stems from the misuse of metrics for assessing chance imbalance and bias. Results: Firstly, matched pairs typically exhibit different covariate values despite having identical propensity scores. However, this disparity represents a \"chance\" difference and will average to zero over a large number of matched pairs. Common distance metrics cannot capture this ``chance\" nature in covariate imbalance, instead reflecting increasing variability in chance imbalance as units are pruned and the sample size diminishes. Secondly, the largest estimate among numerous fitted models, because of uncertainty among researchers over the correct model, was used to determine statistical bias. This cherry-picking procedure ignores the most significant benefit of matching design-reducing model dependence based on its robustness against model misspecification bias. Conclusions: We conclude that the PSM paradox is not a legitimate concern and should not stop researchers from using PSM designs."}, "https://arxiv.org/abs/2411.08998": {"title": "Microfoundation Inference for Strategic Prediction", "link": "https://arxiv.org/abs/2411.08998", "description": "arXiv:2411.08998v1 Announce Type: cross \nAbstract: Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models. A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents' responses as a cost-adjusted utility maximization problem and propose estimates for said cost. Our approach leverages optimal transport to align pre-model exposure (ex ante) and post-model exposure (ex post) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit-scoring dataset."}, "https://arxiv.org/abs/2411.09100": {"title": "General linear threshold models with application to influence maximization", "link": "https://arxiv.org/abs/2411.09100", "description": "arXiv:2411.09100v1 Announce Type: cross \nAbstract: A number of models have been developed for information spread through networks, often for solving the Influence Maximization (IM) problem. IM is the task of choosing a fixed number of nodes to \"seed\" with information in order to maximize the spread of this information through the network, with applications in areas such as marketing and public health. Most methods for this problem rely heavily on the assumption of known strength of connections between network members (edge weights), which is often unrealistic. In this paper, we develop a likelihood-based approach to estimate edge weights from the fully and partially observed information diffusion paths. We also introduce a broad class of information diffusion models, the general linear threshold (GLT) model, which generalizes the well-known linear threshold (LT) model by allowing arbitrary distributions of node activation thresholds. We then show our weight estimator is consistent under the GLT and some mild assumptions. For the special case of the standard LT model, we also present a much faster expectation-maximization approach for weight estimation. Finally, we prove that for the GLT models, the IM problem can be solved by a natural greedy algorithm with standard optimality guarantees if all node threshold distributions have concave cumulative distribution functions. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility in the choice of threshold distribution combined with the estimation of edge weights significantly improves the quality of IM solutions, spread prediction, and the estimates of the node activation probabilities."}, "https://arxiv.org/abs/2411.09225": {"title": "fdesigns: Bayesian Optimal Designs of Experiments for Functional Models in R", "link": "https://arxiv.org/abs/2411.09225", "description": "arXiv:2411.09225v1 Announce Type: cross \nAbstract: This paper describes the R package fdesigns that implements a methodology for identifying Bayesian optimal experimental designs for models whose factor settings are functions, known as profile factors. This type of experiments which involve factors that vary dynamically over time, presenting unique challenges in both estimation and design due to the infinite-dimensional nature of functions. The package fdesigns implements a dimension reduction method leveraging basis functions of the B-spline basis system. The package fdesigns contains functions that effectively reduce the design problem to the optimisation of basis coefficients for functional linear functional generalised linear models, and it accommodates various options. Applications of the fdesigns package are demonstrated through a series of examples that showcase its capabilities in identifying optimal designs for functional linear and generalised linear models. The examples highlight how the package's functions can be used to efficiently design experiments involving both profile and scalar factors, including interactions and polynomial effects."}, "https://arxiv.org/abs/2411.09258": {"title": "On Asymptotic Optimality of Least Squares Model Averaging When True Model Is Included", "link": "https://arxiv.org/abs/2411.09258", "description": "arXiv:2411.09258v1 Announce Type: cross \nAbstract: Asymptotic optimality is a key theoretical property in model averaging. Due to technical difficulties, existing studies rely on restricted weight sets or the assumption that there is no true model with fixed dimensions in the candidate set. The focus of this paper is to overcome these difficulties. Surprisingly, we discover that when the penalty factor in the weight selection criterion diverges with a certain order and the true model dimension is fixed, asymptotic loss optimality does not hold, but asymptotic risk optimality does. This result differs from the corresponding result of Fang et al. (2023, Econometric Theory 39, 412-441) and reveals that using the discrete weight set of Hansen (2007, Econometrica 75, 1175-1189) can yield opposite asymptotic properties compared to using the usual weight set. Simulation studies illustrate the theoretical findings in a variety of settings."}, "https://arxiv.org/abs/2411.09353": {"title": "Monitoring time to event in registry data using CUSUMs based on excess hazard models", "link": "https://arxiv.org/abs/2411.09353", "description": "arXiv:2411.09353v1 Announce Type: cross \nAbstract: An aspect of interest in surveillance of diseases is whether the survival time distribution changes over time. By following data in health registries over time, this can be monitored, either in real time or retrospectively. With relevant risk factors registered, these can be taken into account in the monitoring as well. A challenge in monitoring survival times based on registry data is that data on cause of death might either be missing or uncertain. To quantify the burden of disease in such cases, excess hazard methods can be used, where the total hazard is modelled as the population hazard plus the excess hazard due to the disease.\n We propose a CUSUM procedure for monitoring for changes in the survival time distribution in cases where use of excess hazard models is relevant. The procedure is based on a survival log-likelihood ratio and extends previously suggested methods for monitoring of time to event to the excess hazard setting. The procedure takes into account changes in the population risk over time, as well as changes in the excess hazard which is explained by observed covariates. Properties, challenges and an application to cancer registry data will be presented."}, "https://arxiv.org/abs/2411.09514": {"title": "On importance sampling and independent Metropolis-Hastings with an unbounded weight function", "link": "https://arxiv.org/abs/2411.09514", "description": "arXiv:2411.09514v1 Announce Type: cross \nAbstract: Importance sampling and independent Metropolis-Hastings (IMH) are among the fundamental building blocks of Monte Carlo methods. Both require a proposal distribution that globally approximates the target distribution. The Radon-Nikodym derivative of the target distribution relative to the proposal is called the weight function. Under the weak assumption that the weight is unbounded but has a number of finite moments under the proposal distribution, we obtain new results on the approximation error of importance sampling and of the particle independent Metropolis-Hastings algorithm (PIMH), which includes IMH as a special case. For IMH and PIMH, we show that the common random numbers coupling is maximal. Using that coupling we derive bounds on the total variation distance of a PIMH chain to the target distribution. The bounds are sharp with respect to the number of particles and the number of iterations. Our results allow a formal comparison of the finite-time biases of importance sampling and IMH. We further consider bias removal techniques using couplings of PIMH, and provide conditions under which the resulting unbiased estimators have finite moments. We compare the asymptotic efficiency of regular and unbiased importance sampling estimators as the number of particles goes to infinity."}, "https://arxiv.org/abs/2310.08479": {"title": "Personalised dynamic super learning: an application in predicting hemodiafiltration convection volumes", "link": "https://arxiv.org/abs/2310.08479", "description": "arXiv:2310.08479v2 Announce Type: replace \nAbstract: Obtaining continuously updated predictions is a major challenge for personalised medicine. Leveraging combinations of parametric regressions and machine learning approaches, the personalised online super learner (POSL) can achieve such dynamic and personalised predictions. We adapt POSL to predict a repeated continuous outcome dynamically and propose a new way to validate such personalised or dynamic prediction models. We illustrate its performance by predicting the convection volume of patients undergoing hemodiafiltration. POSL outperformed its candidate learners with respect to median absolute error, calibration-in-the-large, discrimination, and net benefit. We finally discuss the choices and challenges underlying the use of POSL."}, "https://arxiv.org/abs/2401.06403": {"title": "Fourier analysis of spatial point processes", "link": "https://arxiv.org/abs/2401.06403", "description": "arXiv:2401.06403v2 Announce Type: replace \nAbstract: In this article, we develop comprehensive frequency domain methods for estimating and inferring the second-order structure of spatial point processes. The main element here is on utilizing the discrete Fourier transform (DFT) of the point pattern and its tapered counterpart. Under second-order stationarity, we show that both the DFTs and the tapered DFTs are asymptotically jointly independent Gaussian even when the DFTs share the same limiting frequencies. Based on these results, we establish an $\\alpha$-mixing central limit theorem for a statistic formulated as a quadratic form of the tapered DFT. As applications, we derive the asymptotic distribution of the kernel spectral density estimator and establish a frequency domain inferential method for parametric stationary point processes. For the latter, the resulting model parameter estimator is computationally tractable and yields meaningful interpretations even in the case of model misspecification. We investigate the finite sample performance of our estimator through simulations, considering scenarios of both correctly specified and misspecified models. Furthermore, we extend our proposed DFT-based frequency domain methods to a class of non-stationary spatial point processes."}, "https://arxiv.org/abs/2401.07344": {"title": "Robust Genomic Prediction and Heritability Estimation using Density Power Divergence", "link": "https://arxiv.org/abs/2401.07344", "description": "arXiv:2401.07344v2 Announce Type: replace \nAbstract: This manuscript delves into the intersection of genomics and phenotypic prediction, focusing on the statistical innovation required to navigate the complexities introduced by noisy covariates and confounders. The primary emphasis is on the development of advanced robust statistical models tailored for genomic prediction from single nucleotide polymorphism data in plant and animal breeding and multi-field trials. The manuscript highlights the significance of incorporating all estimated effects of marker loci into the statistical framework and aiming to reduce the high dimensionality of data while preserving critical information. This paper introduces a new robust statistical framework for genomic prediction, employing one-stage and two-stage linear mixed model analyses along with utilizing the popular robust minimum density power divergence estimator (MDPDE) to estimate genetic effects on phenotypic traits. The study illustrates the superior performance of the proposed MDPDE-based genomic prediction and associated heritability estimation procedures over existing competitors through extensive empirical experiments on artificial datasets and application to a real-life maize breeding dataset. The results showcase the robustness and accuracy of the proposed MDPDE-based approaches, especially in the presence of data contamination, emphasizing their potential applications in improving breeding programs and advancing genomic prediction of phenotyping traits."}, "https://arxiv.org/abs/2011.14762": {"title": "M-Variance Asymptotics and Uniqueness of Descriptors", "link": "https://arxiv.org/abs/2011.14762", "description": "arXiv:2011.14762v2 Announce Type: replace-cross \nAbstract: Asymptotic theory for M-estimation problems usually focuses on the asymptotic convergence of the sample descriptor, defined as the minimizer of the sample loss function. Here, we explore a related question and formulate asymptotic theory for the minimum value of sample loss, the M-variance. Since the loss function value is always a real number, the asymptotic theory for the M-variance is comparatively simple. M-variance often satisfies a standard central limit theorem, even in situations where the asymptotics of the descriptor is more complicated as for example in case of smeariness, or if no asymptotic distribution can be given as can be the case if the descriptor space is a general metric space. We use the asymptotic results for the M-variance to formulate a hypothesis test to systematically determine for a given sample whether the underlying population loss function may have multiple global minima. We discuss three applications of our test to data, each of which presents a typical scenario in which non-uniqueness of descriptors may occur. These model scenarios are the mean on a non-euclidean space, non-linear regression and Gaussian mixture clustering."}, "https://arxiv.org/abs/2411.09771": {"title": "Bayesian estimation of finite mixtures of Tobit models", "link": "https://arxiv.org/abs/2411.09771", "description": "arXiv:2411.09771v1 Announce Type: new \nAbstract: This paper outlines a Bayesian approach to estimate finite mixtures of Tobit models. The method consists of an MCMC approach that combines Gibbs sampling with data augmentation and is simple to implement. I show through simulations that the flexibility provided by this method is especially helpful when censoring is not negligible. In addition, I demonstrate the broad utility of this methodology with applications to a job training program, labor supply, and demand for medical care. I find that this approach allows for non-trivial additional flexibility that can alter results considerably and beyond improving model fit."}, "https://arxiv.org/abs/2411.09808": {"title": "Sharp Testable Implications of Encouragement Designs", "link": "https://arxiv.org/abs/2411.09808", "description": "arXiv:2411.09808v1 Announce Type: new \nAbstract: This paper studies the sharp testable implications of an additive random utility model with a discrete multi-valued treatment and a discrete multi-valued instrument, in which each value of the instrument only weakly increases the utility of one choice. Borrowing the terminology used in randomized experiments, we call such a setting an encouragement design. We derive inequalities in terms of the conditional choice probabilities that characterize when the distribution of the observed data is consistent with such a model. Through a novel constructive argument, we further show these inequalities are sharp in the sense that any distribution of the observed data that satisfies these inequalities is generated by this additive random utility model."}, "https://arxiv.org/abs/2411.10009": {"title": "Semiparametric inference for impulse response functions using double/debiased machine learning", "link": "https://arxiv.org/abs/2411.10009", "description": "arXiv:2411.10009v1 Announce Type: new \nAbstract: We introduce a double/debiased machine learning (DML) estimator for the impulse response function (IRF) in settings where a time series of interest is subjected to multiple discrete treatments, assigned over time, which can have a causal effect on future outcomes. The proposed estimator can rely on fully nonparametric relations between treatment and outcome variables, opening up the possibility to use flexible machine learning approaches to estimate IRFs. To this end, we extend the theory of DML from an i.i.d. to a time series setting and show that the proposed DML estimator for the IRF is consistent and asymptotically normally distributed at the parametric rate, allowing for semiparametric inference for dynamic effects in a time series setting. The properties of the estimator are validated numerically in finite samples by applying it to learn the IRF in the presence of serial dependence in both the confounder and observation innovation processes. We also illustrate the methodology empirically by applying it to the estimation of the effects of macroeconomic shocks."}, "https://arxiv.org/abs/2411.10064": {"title": "Adaptive Physics-Guided Neural Network", "link": "https://arxiv.org/abs/2411.10064", "description": "arXiv:2411.10064v1 Announce Type: new \nAbstract: This paper introduces an adaptive physics-guided neural network (APGNN) framework for predicting quality attributes from image data by integrating physical laws into deep learning models. The APGNN adaptively balances data-driven and physics-informed predictions, enhancing model accuracy and robustness across different environments. Our approach is evaluated on both synthetic and real-world datasets, with comparisons to conventional data-driven models such as ResNet. For the synthetic data, 2D domains were generated using three distinct governing equations: the diffusion equation, the advection-diffusion equation, and the Poisson equation. Non-linear transformations were applied to these domains to emulate complex physical processes in image form.\n In real-world experiments, the APGNN consistently demonstrated superior performance in the diverse thermal image dataset. On the cucumber dataset, characterized by low material diversity and controlled conditions, APGNN and PGNN showed similar performance, both outperforming the data-driven ResNet. However, in the more complex thermal dataset, particularly for outdoor materials with higher environmental variability, APGNN outperformed both PGNN and ResNet by dynamically adjusting its reliance on physics-based versus data-driven insights. This adaptability allowed APGNN to maintain robust performance across structured, low-variability settings and more heterogeneous scenarios. These findings underscore the potential of adaptive physics-guided learning to integrate physical constraints effectively, even in challenging real-world contexts with diverse environmental conditions."}, "https://arxiv.org/abs/2411.10089": {"title": "G-computation for increasing performances of clinical trials with individual randomization and binary response", "link": "https://arxiv.org/abs/2411.10089", "description": "arXiv:2411.10089v1 Announce Type: new \nAbstract: In a clinical trial, the random allocation aims to balance prognostic factors between arms, preventing true confounders. However, residual differences due to chance may introduce near-confounders. Adjusting on prognostic factors is therefore recommended, especially because the related increase of the power. In this paper, we hypothesized that G-computation associated with machine learning could be a suitable method for randomized clinical trials even with small sample sizes. It allows for flexible estimation of the outcome model, even when the covariates' relationships with outcomes are complex. Through simulations, penalized regressions (Lasso, Elasticnet) and algorithm-based methods (neural network, support vector machine, super learner) were compared. Penalized regressions reduced variance but may introduce a slight increase in bias. The associated reductions in sample size ranged from 17\\% to 54\\%. In contrast, algorithm-based methods, while effective for larger and more complex data structures, underestimated the standard deviation, especially with small sample sizes. In conclusion, G-computation with penalized models, particularly Elasticnet with splines when appropriate, represents a relevant approach for increasing the power of RCTs and accounting for potential near-confounders."}, "https://arxiv.org/abs/2411.10121": {"title": "Quadratic Form based Multiple Contrast Tests for Comparison of Group Means", "link": "https://arxiv.org/abs/2411.10121", "description": "arXiv:2411.10121v1 Announce Type: new \nAbstract: Comparing the mean vectors across different groups is a cornerstone in the realm of multivariate statistics, with quadratic forms commonly serving as test statistics. However, when the overall hypothesis is rejected, identifying specific vector components or determining the groups among which differences exist requires additional investigations. Conversely, employing multiple contrast tests (MCT) allows conclusions about which components or groups contribute to these differences. However, they come with a trade-off, as MCT lose some benefits inherent to quadratic forms. In this paper, we combine both approaches to get a quadratic form based multiple contrast test that leverages the advantages of both. To understand its theoretical properties, we investigate its asymptotic distribution in a semiparametric model. We thereby focus on two common quadratic forms - the Wald-type statistic and the Anova-type statistic - although our findings are applicable to any quadratic form.\n Furthermore, we employ Monte-Carlo and resampling techniques to enhance the test's performance in small sample scenarios. Through an extensive simulation study, we assess the performance of our proposed tests against existing alternatives, highlighting their advantages."}, "https://arxiv.org/abs/2411.10204": {"title": "Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport", "link": "https://arxiv.org/abs/2411.10204", "description": "arXiv:2411.10204v1 Announce Type: new \nAbstract: Wasserstein distances form a family of metrics on spaces of probability measures that have recently seen many applications. However, statistical analysis in these spaces is complex due to the nonlinearity of Wasserstein spaces. One potential solution to this problem is Linear Optimal Transport (LOT). This method allows one to find a Euclidean embedding, called LOT embedding, of measures in some Wasserstein spaces, but some information is lost in this embedding. So, to understand whether statistical analysis relying on LOT embeddings can make valid inferences about original data, it is helpful to quantify how well these embeddings describe that data. To answer this question, we present a decomposition of the Fr\\'echet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We then extend this decomposition to the Fused Gromov-Wasserstein setting. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained by the embedding, and the classification accuracy of machine learning classifiers built on the embedded data. We use the MNIST handwritten digits dataset, IMDB-50000 dataset, and Diffusion Tensor MRI images for these experiments. Our results illustrate the effectiveness of low dimensional LOT embeddings in terms of the percentage of variance explained and the classification accuracy of models built on the embedded data."}, "https://arxiv.org/abs/2411.10218": {"title": "Bayesian Adaptive Tucker Decompositions for Tensor Factorization", "link": "https://arxiv.org/abs/2411.10218", "description": "arXiv:2411.10218v1 Announce Type: new \nAbstract: Tucker tensor decomposition offers a more effective representation for multiway data compared to the widely used PARAFAC model. However, its flexibility brings the challenge of selecting the appropriate latent multi-rank. To overcome the issue of pre-selecting the latent multi-rank, we introduce a Bayesian adaptive Tucker decomposition model that infers the multi-rank automatically via an infinite increasing shrinkage prior. The model introduces local sparsity in the core tensor, inducing rich and at the same time parsimonious dependency structures. Posterior inference proceeds via an efficient adaptive Gibbs sampler, supporting both continuous and binary data and allowing for straightforward missing data imputation when dealing with incomplete multiway data. We discuss fundamental properties of the proposed modeling framework, providing theoretical justification. Simulation studies and applications to chemometrics and complex ecological data offer compelling evidence of its advantages over existing tensor factorization methods."}, "https://arxiv.org/abs/2411.10312": {"title": "Generalized Conditional Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2411.10312", "description": "arXiv:2411.10312v1 Announce Type: new \nAbstract: We propose generalized conditional functional principal components analysis (GC-FPCA) for the joint modeling of the fixed and random effects of non-Gaussian functional outcomes. The method scales up to very large functional data sets by estimating the principal components of the covariance matrix on the linear predictor scale conditional on the fixed effects. This is achieved by combining three modeling innovations: (1) fit local generalized linear mixed models (GLMMs) conditional on covariates in windows along the functional domain; (2) conduct a functional principal component analysis (FPCA) on the person-specific functional effects obtained by assembling the estimated random effects from the local GLMMs; and (3) fit a joint functional mixed effects model conditional on covariates and the estimated principal components from the previous step. GC-FPCA was motivated by modeling the minute-level active/inactive profiles over the day ($1{,}440$ 0/1 measurements per person) for $8{,}700$ study participants in the National Health and Nutrition Examination Survey (NHANES) 2011-2014. We show that state-of-the-art approaches cannot handle data of this size and complexity, while GC-FPCA can."}, "https://arxiv.org/abs/2411.10381": {"title": "An Instrumental Variables Framework to Unite Spatial Confounding Methods", "link": "https://arxiv.org/abs/2411.10381", "description": "arXiv:2411.10381v1 Announce Type: new \nAbstract: Studies investigating the causal effects of spatially varying exposures on health$\\unicode{x2013}$such as air pollution, green space, or crime$\\unicode{x2013}$often rely on observational and spatially indexed data. A prevalent challenge is unmeasured spatial confounding, where an unobserved spatially varying variable affects both exposure and outcome, leading to biased causal estimates and invalid confidence intervals. In this paper, we introduce a general framework based on instrumental variables (IV) that encompasses and unites most of the existing methods designed to account for an unmeasured spatial confounder. We show that a common feature of all existing methods is their reliance on small-scale variation in exposure, which functions as an IV. In this framework, we outline the underlying assumptions and the estimation strategy of each method. Furthermore, we demonstrate that the IV can be used to identify and estimate the exposure-response curve under more relaxed assumptions. We conclude by estimating the exposure-response curve between long-term exposure to fine particulate matter and all-cause mortality among 33,454 zip codes in the United States while adjusting for unmeasured spatial confounding."}, "https://arxiv.org/abs/2411.10415": {"title": "Dynamic Causal Effects in a Nonlinear World: the Good, the Bad, and the Ugly", "link": "https://arxiv.org/abs/2411.10415", "description": "arXiv:2411.10415v1 Announce Type: new \nAbstract: Applied macroeconomists frequently use impulse response estimators motivated by linear models. We study whether the estimands of such procedures have a causal interpretation when the true data generating process is in fact nonlinear. We show that vector autoregressions and linear local projections onto observed shocks or proxies identify weighted averages of causal effects regardless of the extent of nonlinearities. By contrast, identification approaches that exploit heteroskedasticity or non-Gaussianity of latent shocks are highly sensitive to departures from linearity. Our analysis is based on new results on the identification of marginal treatment effects through weighted regressions, which may also be of interest to researchers outside macroeconomics."}, "https://arxiv.org/abs/2411.09726": {"title": "Spatio-Temporal Jump Model for Urban Thermal Comfort Monitoring", "link": "https://arxiv.org/abs/2411.09726", "description": "arXiv:2411.09726v1 Announce Type: cross \nAbstract: Thermal comfort is essential for well-being in urban spaces, especially as cities face increasing heat from urbanization and climate change. Existing thermal comfort models usually overlook temporal dynamics alongside spatial dependencies. We address this problem by introducing a spatio-temporal jump model that clusters data with persistence across both spatial and temporal dimensions. This framework enhances interpretability, minimizes abrupt state changes, and easily handles missing data. We validate our approach through extensive simulations, demonstrating its accuracy in recovering the true underlying partition. When applied to hourly environmental data gathered from a set of weather stations located across the city of Singapore, our proposal identifies meaningful thermal comfort regimes, demonstrating its effectiveness in dynamic urban settings and suitability for real-world monitoring. The comparison of these regimes with feedback on thermal preference indicates the potential of an unsupervised approach to avoid extensive surveys."}, "https://arxiv.org/abs/2411.10314": {"title": "Estimating the Cost of Informal Care with a Novel Two-Stage Approach to Individual Synthetic Control", "link": "https://arxiv.org/abs/2411.10314", "description": "arXiv:2411.10314v1 Announce Type: cross \nAbstract: Informal carers provide the majority of care for people living with challenges related to older age, long-term illness, or disability. However, the care they provide often results in a significant income penalty for carers, a factor largely overlooked in the economics literature and policy discourse. Leveraging data from the UK Household Longitudinal Study, this paper provides the first robust causal estimates of the caring income penalty using a novel individual synthetic control based method that accounts for unit-level heterogeneity in post-treatment trajectories over time. Our baseline estimates identify an average relative income gap of up to 45%, with an average decrease of {\\pounds}162 in monthly income, peaking at {\\pounds}192 per month after 4 years, based on the difference between informal carers providing the highest-intensity of care and their synthetic counterparts. We find that the income penalty is more pronounced for women than for men, and varies by ethnicity and age."}, "https://arxiv.org/abs/2009.00401": {"title": "Time-Varying Parameters as Ridge Regressions", "link": "https://arxiv.org/abs/2009.00401", "description": "arXiv:2009.00401v4 Announce Type: replace \nAbstract: Time-varying parameters (TVPs) models are frequently used in economics to capture structural change. I highlight a rather underutilized fact -- that these are actually ridge regressions. Instantly, this makes computations, tuning, and implementation much easier than in the state-space paradigm. Among other things, solving the equivalent dual ridge problem is computationally very fast even in high dimensions, and the crucial \"amount of time variation\" is tuned by cross-validation. Evolving volatility is dealt with using a two-step ridge regression. I consider extensions that incorporate sparsity (the algorithm selects which parameters vary and which do not) and reduced-rank restrictions (variation is tied to a factor model). To demonstrate the usefulness of the approach, I use it to study the evolution of monetary policy in Canada using large time-varying local projections. The application requires the estimation of about 4600 TVPs, a task well within the reach of the new method."}, "https://arxiv.org/abs/2309.00141": {"title": "Causal Inference under Network Interference Using a Mixture of Randomized Experiments", "link": "https://arxiv.org/abs/2309.00141", "description": "arXiv:2309.00141v2 Announce Type: replace \nAbstract: In randomized experiments, the classic Stable Unit Treatment Value Assumption (SUTVA) posits that the outcome for one experimental unit is unaffected by the treatment assignments of other units. However, this assumption is frequently violated in settings such as online marketplaces and social networks, where interference between units is common. We address the estimation of the total treatment effect in a network interference model by employing a mixed randomization design that combines two widely used experimental methods: Bernoulli randomization, where treatment is assigned independently to each unit, and cluster-based randomization, where treatment is assigned at the aggregate level. The mixed randomization design simultaneously incorporates both methods, thereby mitigating the bias present in cluster-based designs. We propose an unbiased estimator for the total treatment effect under this mixed design and show that its variance is bounded by $O(d^2 n^{-1} p^{-1} (1-p)^{-1})$, where $d$ is the maximum degree of the network, $n$ is the network size, and $p$ is the treatment probability. Additionally, we establish a lower bound of $\\Omega(d^{1.5} n^{-1} p^{-1} (1-p)^{-1})$ for the variance of any mixed design. Moreover, when the interference weights on the network's edges are unknown, we propose a weight-invariant design that achieves a variance bound of $O(d^3 n^{-1} p^{-1} (1-p)^{-1})$, which is aligned with the estimator introduced by Cortez-Rodriguez et al. (2023) under similar conditions."}, "https://arxiv.org/abs/2311.14204": {"title": "Reproducible Aggregation of Sample-Split Statistics", "link": "https://arxiv.org/abs/2311.14204", "description": "arXiv:2311.14204v3 Announce Type: replace \nAbstract: Statistical inference is often simplified by sample-splitting. This simplification comes at the cost of the introduction of randomness not native to the data. We propose a simple procedure for sequentially aggregating statistics constructed with multiple splits of the same sample. The user specifies a bound and a nominal error rate. If the procedure is implemented twice on the same data, the nominal error rate approximates the chance that the results differ by more than the bound. We illustrate the application of the procedure to several widely applied econometric methods."}, "https://arxiv.org/abs/2411.10494": {"title": "Efficient inference for differential equation models without numerical solvers", "link": "https://arxiv.org/abs/2411.10494", "description": "arXiv:2411.10494v1 Announce Type: new \nAbstract: Parameter inference is essential when we wish to interpret observational data using mathematical models. Standard inference methods for differential equation models typically rely on obtaining repeated numerical solutions of the differential equation(s). Recent results have explored how numerical truncation error can have major, detrimental, and sometimes hidden impacts on likelihood-based inference by introducing false local maxima into the log-likelihood function. We present a straightforward approach for inference that eliminates the need for solving the underlying differential equations, thereby completely avoiding the impact of truncation error. Open-access Jupyter notebooks, available on GitHub, allow others to implement this method for a broad class of widely-used models to interpret biological data."}, "https://arxiv.org/abs/2411.10600": {"title": "Monetary Incentives, Landowner Preferences: Estimating Cross-Elasticities in Farmland Conversion to Renewable Energy", "link": "https://arxiv.org/abs/2411.10600", "description": "arXiv:2411.10600v1 Announce Type: new \nAbstract: This study examines the impact of monetary factors on the conversion of farmland to renewable energy generation, specifically solar and wind, in the context of expanding U.S. energy production. We propose a new econometric method that accounts for the diverse circumstances of landowners, including their unordered alternative land use options, non-monetary benefits from farming, and the influence of local regulations. We demonstrate that identifying the cross elasticity of landowners' farming income in relation to the conversion of farmland to renewable energy requires an understanding of their preferences. By utilizing county legislation that we assume to be shaped by land-use preferences, we estimate the cross-elasticities of farming income. Our findings indicate that monetary incentives may only influence landowners' decisions in areas with potential for future residential development, underscoring the importance of considering both preferences and regulatory contexts."}, "https://arxiv.org/abs/2411.10615": {"title": "Inference for overparametrized hierarchical Archimedean copulas", "link": "https://arxiv.org/abs/2411.10615", "description": "arXiv:2411.10615v1 Announce Type: new \nAbstract: Hierarchical Archimedean copulas (HACs) are multivariate uniform distributions constructed by nesting Archimedean copulas into one another, and provide a flexible approach to modeling non-exchangeable data. However, this flexibility in the model structure may lead to over-fitting when the model estimation procedure is not performed properly. In this paper, we examine the problem of structure estimation and more generally on the selection of a parsimonious model from the hypothesis testing perspective. Formal tests for structural hypotheses concerning HACs have been lacking so far, most likely due to the restrictions on their associated parameter space which hinders the use of standard inference methodology. Building on previously developed asymptotic methods for these non-standard parameter spaces, we provide an asymptotic stochastic representation for the maximum likelihood estimators of (potentially) overparametrized HACs, which we then use to formulate a likelihood ratio test for certain common structural hypotheses. Additionally, we also derive analytical expressions for the first- and second-order partial derivatives of two-level HACs based on Clayton and Gumbel generators, as well as general numerical approximation schemes for the Fisher information matrix."}, "https://arxiv.org/abs/2411.10620": {"title": "Doubly Robust Estimation of Causal Excursion Effects in Micro-Randomized Trials with Missing Longitudinal Outcomes", "link": "https://arxiv.org/abs/2411.10620", "description": "arXiv:2411.10620v1 Announce Type: new \nAbstract: Micro-randomized trials (MRTs) are increasingly utilized for optimizing mobile health interventions, with the causal excursion effect (CEE) as a central quantity for evaluating interventions under policies that deviate from the experimental policy. However, MRT often contains missing data due to reasons such as missed self-reports or participants not wearing sensors, which can bias CEE estimation. In this paper, we propose a two-stage, doubly robust estimator for CEE in MRTs when longitudinal outcomes are missing at random, accommodating continuous, binary, and count outcomes. Our two-stage approach allows for both parametric and nonparametric modeling options for two nuisance parameters: the missingness model and the outcome regression. We demonstrate that our estimator is doubly robust, achieving consistency and asymptotic normality if either the missingness or the outcome regression model is correctly specified. Simulation studies further validate the estimator's desirable finite-sample performance. We apply the method to HeartSteps, an MRT for developing mobile health interventions that promote physical activity."}, "https://arxiv.org/abs/2411.10623": {"title": "Sensitivity Analysis for Observational Studies with Flexible Matched Designs", "link": "https://arxiv.org/abs/2411.10623", "description": "arXiv:2411.10623v1 Announce Type: new \nAbstract: Observational studies provide invaluable opportunities to draw causal inference, but they may suffer from biases due to pretreatment difference between treated and control units. Matching is a popular approach to reduce observed covariate imbalance. To tackle unmeasured confounding, a sensitivity analysis is often conducted to investigate how robust a causal conclusion is to the strength of unmeasured confounding. For matched observational studies, Rosenbaum proposed a sensitivity analysis framework that uses the randomization of treatment assignments as the \"reasoned basis\" and imposes no model assumptions on the potential outcomes as well as their dependence on the observed and unobserved confounding factors. However, this otherwise appealing framework requires exact matching to guarantee its validity, which is hard to achieve in practice. In this paper we provide an alternative inferential framework that shares the same procedure as Rosenbaum's approach but relies on a different justification. Our framework allows flexible matching algorithms and utilizes alternative source of randomness, in particular random permutations of potential outcomes instead of treatment assignments, to guarantee statistical validity."}, "https://arxiv.org/abs/2411.10628": {"title": "Feature Importance of Climate Vulnerability Indicators with Gradient Boosting across Five Global Cities", "link": "https://arxiv.org/abs/2411.10628", "description": "arXiv:2411.10628v1 Announce Type: new \nAbstract: Efforts are needed to identify and measure both communities' exposure to climate hazards and the social vulnerabilities that interact with these hazards, but the science of validating hazard vulnerability indicators is still in its infancy. Progress is needed to improve: 1) the selection of variables that are used as proxies to represent hazard vulnerability; 2) the applicability and scale for which these indicators are intended, including their transnational applicability. We administered an international urban survey in Buenos Aires, Argentina; Johannesburg, South Africa; London, United Kingdom; New York City, United States; and Seoul, South Korea in order to collect data on exposure to various types of extreme weather events, socioeconomic characteristics commonly used as proxies for vulnerability (i.e., income, education level, gender, and age), and additional characteristics not often included in existing composite indices (i.e., queer identity, disability identity, non-dominant primary language, and self-perceptions of both discrimination and vulnerability to flood risk). We then use feature importance analysis with gradient-boosted decision trees to measure the importance that these variables have in predicting exposure to various types of extreme weather events. Our results show that non-traditional variables were more relevant to self-reported exposure to extreme weather events than traditionally employed variables such as income or age. Furthermore, differences in variable relevance across different types of hazards and across urban contexts suggest that vulnerability indicators need to be fit to context and should not be used in a one-size-fits-all fashion."}, "https://arxiv.org/abs/2411.10647": {"title": "False Discovery Control in Multiple Testing: A Brief Overview of Theories and Methodologies", "link": "https://arxiv.org/abs/2411.10647", "description": "arXiv:2411.10647v1 Announce Type: new \nAbstract: As the volume and complexity of data continue to expand across various scientific disciplines, the need for robust methods to account for the multiplicity of comparisons has grown widespread. A popular measure of type 1 error rate in multiple testing literature is the false discovery rate (FDR). The FDR provides a powerful and practical approach to large-scale multiple testing and has been successfully used in a wide range of applications. The concept of FDR has gained wide acceptance in the statistical community and various methods has been proposed to control the FDR. In this work, we review the latest developments in FDR control methodologies. We also develop a conceptual framework to better describe this vast literature; understand its intuition and key ideas; and provide guidance for the researcher interested in both the application and development of the methodology."}, "https://arxiv.org/abs/2411.10648": {"title": "Subsampling-based Tests in Mediation Analysis", "link": "https://arxiv.org/abs/2411.10648", "description": "arXiv:2411.10648v1 Announce Type: new \nAbstract: Testing for mediation effect poses a challenge since the null hypothesis (i.e., the absence of mediation effects) is composite, making most existing mediation tests quite conservative and often underpowered. In this work, we propose a subsampling-based procedure to construct a test statistic whose asymptotic null distribution is pivotal and remains the same regardless of the three null cases encountered in mediation analysis. The method, when combined with the popular Sobel test, leads to an accurate size control under the null. We further introduce a Cauchy combination test to construct p-values from different subsample splits, which reduces variability in the testing results and increases detection power. Through numerical studies, our approach has demonstrated a more accurate size and higher detection power than the competing classical and contemporary methods."}, "https://arxiv.org/abs/2411.10768": {"title": "Building Interpretable Climate Emulators for Economics", "link": "https://arxiv.org/abs/2411.10768", "description": "arXiv:2411.10768v1 Announce Type: new \nAbstract: This paper presents a framework for developing efficient and interpretable carbon-cycle emulators (CCEs) as part of climate emulators in Integrated Assessment Models, enabling economists to custom-build CCEs accurately calibrated to advanced climate science. We propose a generalized multi-reservoir linear box-model CCE that preserves key physical quantities and can be use-case tailored for specific use cases. Three CCEs are presented for illustration: the 3SR model (replicating DICE-2016), the 4PR model (including the land biosphere), and the 4PR-X model (accounting for dynamic land-use changes like deforestation that impact the reservoir's storage capacity). Evaluation of these models within the DICE framework shows that land-use changes in the 4PR-X model significantly impact atmospheric carbon and temperatures -- emphasizing the importance of using tailored climate emulators. By providing a transparent and flexible tool for policy analysis, our framework allows economists to assess the economic impacts of climate policies more accurately."}, "https://arxiv.org/abs/2411.10801": {"title": "Improving Causal Estimation by Mixing Samples to Address Weak Overlap in Observational Studies", "link": "https://arxiv.org/abs/2411.10801", "description": "arXiv:2411.10801v1 Announce Type: new \nAbstract: Sufficient overlap of propensity scores is one of the most critical assumptions in observational studies. In this article, we will cover the severity in statistical inference under such assumption failure with weighting, one of the most dominating causal inference methodologies. Then we propose a simple, yet novel remedy: \"mixing\" the treated and control groups in the observed dataset. We state that our strategy has three key advantages: (1) Improvement in estimators' accuracy especially in weak overlap, (2) Identical targeting population of treatment effect, (3) High flexibility. We introduce a property of mixed sample that offers a safer inference by implementing onto both traditional and modern weighting methods. We illustrate this with several extensive simulation studies and guide the readers with a real-data analysis for practice."}, "https://arxiv.org/abs/2411.10858": {"title": "Scalable Gaussian Process Regression Via Median Posterior Inference for Estimating Multi-Pollutant Mixture Health Effects", "link": "https://arxiv.org/abs/2411.10858", "description": "arXiv:2411.10858v1 Announce Type: new \nAbstract: Humans are exposed to complex mixtures of environmental pollutants rather than single chemicals, necessitating methods to quantify the health effects of such mixtures. Research on environmental mixtures provides insights into realistic exposure scenarios, informing regulatory policies that better protect public health. However, statistical challenges, including complex correlations among pollutants and nonlinear multivariate exposure-response relationships, complicate such analyses. A popular Bayesian semi-parametric Gaussian process regression framework (Coull et al., 2015) addresses these challenges by modeling exposure-response functions with Gaussian processes and performing feature selection to manage high-dimensional exposures while accounting for confounders. Originally designed for small to moderate-sized cohort studies, this framework does not scale well to massive datasets. To address this, we propose a divide-and-conquer strategy, partitioning data, computing posterior distributions in parallel, and combining results using the generalized median. While we focus on Gaussian process models for environmental mixtures, the proposed distributed computing strategy is broadly applicable to other Bayesian models with computationally prohibitive full-sample Markov Chain Monte Carlo fitting. We provide theoretical guarantees for the convergence of the proposed posterior distributions to those derived from the full sample. We apply this method to estimate associations between a mixture of ambient air pollutants and ~650,000 birthweights recorded in Massachusetts during 2001-2012. Our results reveal negative associations between birthweight and traffic pollution markers, including elemental and organic carbon and PM2.5, and positive associations with ozone and vegetation greenness."}, "https://arxiv.org/abs/2411.10908": {"title": "The Conflict Graph Design: Estimating Causal Effects under Arbitrary Neighborhood Interference", "link": "https://arxiv.org/abs/2411.10908", "description": "arXiv:2411.10908v1 Announce Type: new \nAbstract: A fundamental problem in network experiments is selecting an appropriate experimental design in order to precisely estimate a given causal effect of interest. In fact, optimal rates of estimation remain unknown for essentially all causal effects in network experiments. In this work, we propose a general approach for constructing experiment designs under network interference with the goal of precisely estimating a pre-specified causal effect. A central aspect of our approach is the notion of a conflict graph, which captures the fundamental unobservability associated with the casual effect and the underlying network. We refer to our experimental design as the Conflict Graph Design. In order to estimate effects, we propose a modified Horvitz--Thompson estimator. We show that its variance under the Conflict Graph Design is bounded as $O(\\lambda(H) / n )$, where $\\lambda(H)$ is the largest eigenvalue of the adjacency matrix of the conflict graph. These rates depend on both the underlying network and the particular causal effect under investigation. Not only does this yield the best known rates of estimation for several well-studied causal effects (e.g. the global and direct effects) but it also provides new methods for effects which have received less attention from the perspective of experiment design (e.g. spill-over effects). Our results corroborate two implicitly understood points in the literature: (1) that in order to increase precision, experiment designs should be tailored to specific causal effects of interest and (2) that \"more local\" effects are easier to estimate than \"more global\" effects. In addition to point estimation, we construct conservative variance estimators which facilitate the construction of asymptotically valid confidence intervals for the casual effect of interest."}, "https://arxiv.org/abs/2411.10959": {"title": "Program Evaluation with Remotely Sensed Outcomes", "link": "https://arxiv.org/abs/2411.10959", "description": "arXiv:2411.10959v1 Announce Type: new \nAbstract: While traditional program evaluations typically rely on surveys to measure outcomes, certain economic outcomes such as living standards or environmental quality may be infeasible or costly to collect. As a result, recent empirical work estimates treatment effects using remotely sensed variables (RSVs), such mobile phone activity or satellite images, instead of ground-truth outcome measurements. Common practice predicts the economic outcome from the RSV, using an auxiliary sample of labeled RSVs, and then uses such predictions as the outcome in the experiment. We prove that this approach leads to biased estimates of treatment effects when the RSV is a post-outcome variable. We nonparametrically identify the treatment effect, using an assumption that reflects the logic of recent empirical research: the conditional distribution of the RSV remains stable across both samples, given the outcome and treatment. Our results do not require researchers to know or consistently estimate the relationship between the RSV, outcome, and treatment, which is typically mis-specified with unstructured data. We form a representation of the RSV for downstream causal inference by predicting the outcome and predicting the treatment, with better predictions leading to more precise causal estimates. We re-evaluate the efficacy of a large-scale public program in India, showing that the program's measured effects on local consumption and poverty can be replicated using satellite"}, "https://arxiv.org/abs/2411.10971": {"title": "A novel density-based approach for estimating unknown means, distribution visualisations and meta-analyses of quantiles", "link": "https://arxiv.org/abs/2411.10971", "description": "arXiv:2411.10971v1 Announce Type: new \nAbstract: In meta-analysis with continuous outcomes, the use of effect sizes based on the means is the most common. It is often found, however, that only the quantile summary measures are reported in some studies, and in certain scenarios, a meta-analysis of the quantiles themselves are of interest. We propose a novel density-based approach to support the implementation of a comprehensive meta-analysis, when only the quantile summary measures are reported. The proposed approach uses flexible quantile-based distributions and percentile matching to estimate the unknown parameters without making any prior assumptions about the underlying distributions. Using simulated and real data, we show that the proposed novel density-based approach works as well as or better than the widely-used methods in estimating the means using quantile summaries without assuming a distribution apriori, and provides a novel tool for distribution visualisations. In addition to this, we introduce quantile-based meta-analysis methods for situations where a comparison of quantiles between groups themselves are of interest and found to be more suitable. Using both real and simulated data, we also demonstrate the applicability of these quantile-based methods."}, "https://arxiv.org/abs/2411.10980": {"title": "A joint modeling approach to treatment effects estimation with unmeasured confounders", "link": "https://arxiv.org/abs/2411.10980", "description": "arXiv:2411.10980v1 Announce Type: new \nAbstract: Estimating treatment effects using observation data often relies on the assumption of no unmeasured confounders. However, unmeasured confounding variables may exist in many real-world problems. It can lead to a biased estimation without incorporating the unmeasured confounding effect. To address this problem, this paper proposes a new mixed-effects joint modeling approach to identifying and estimating the OR function and the PS function in the presence of unmeasured confounders in longitudinal data settings. As a result, we can obtain the estimators of the average treatment effect and heterogeneous treatment effects. In our proposed setting, we allow interaction effects of the treatment and unmeasured confounders on the outcome. Moreover, we propose a new Laplacian-variant EM algorithm to estimate the parameters in the joint models. We apply the method to a real-world application from the CitieS-Health Barcelona Panel Study, in which we study the effect of short-term air pollution exposure on mental health."}, "https://arxiv.org/abs/2411.11058": {"title": "Econometrics and Formalism of Psychological Archetypes of Scientific Workers with Introverted Thinking Type", "link": "https://arxiv.org/abs/2411.11058", "description": "arXiv:2411.11058v1 Announce Type: new \nAbstract: The chronological hierarchy and classification of psychological types of individuals are examined. The anomalous nature of psychological activity in individuals involved in scientific work is highlighted. Certain aspects of the introverted thinking type in scientific activities are analyzed. For the first time, psychological archetypes of scientists with pronounced introversion are postulated in the context of twelve hypotheses about the specifics of professional attributes of introverted scientific activities.\n A linear regression and Bayesian equation are proposed for quantitatively assessing the econometric degree of introversion in scientific employees, considering a wide range of characteristics inherent to introverts in scientific processing. Specifically, expressions for a comprehensive assessment of introversion in a linear model and the posterior probability of the econometric (scientometric) degree of introversion in a Bayesian model are formulated.\n The models are based on several econometric (scientometric) hypotheses regarding various aspects of professional activities of introverted scientists, such as a preference for solo publications, low social activity, narrow specialization, high research depth, and so forth. Empirical data and multiple linear regression methods can be used to calibrate the equations. The model can be applied to gain a deeper understanding of the psychological characteristics of scientific employees, which is particularly useful in ergonomics and the management of scientific teams and projects. The proposed method also provides scientists with pronounced introversion the opportunity to develop their careers, focusing on individual preferences and features."}, "https://arxiv.org/abs/2411.11248": {"title": "A model-free test of the time-reversibility of climate change processes", "link": "https://arxiv.org/abs/2411.11248", "description": "arXiv:2411.11248v1 Announce Type: new \nAbstract: Time-reversibility is a crucial feature of many time series models, while time-irreversibility is the rule rather than the exception in real-life data. Testing the null hypothesis of time-reversibilty, therefore, should be an important step preliminary to the identification and estimation of most traditional time-series models. Existing procedures, however, mostly consist of testing necessary but not sufficient conditions, leading to under-rejection, or sufficient but non-necessary ones, which leads to over-rejection. Moreover, they generally are model-besed. In contrast, the copula spectrum studied by Goto et al. ($\\textit{Ann. Statist.}$ 2022, $\\textbf{50}$: 3563--3591) allows for a model-free necessary and sufficient time-reversibility condition. A test based on this copula-spectrum-based characterization has been proposed by authors. This paper illustrates the performance of this test, with an illustration in the analysis of climatic data."}, "https://arxiv.org/abs/2411.11270": {"title": "Unbiased Approximations for Stationary Distributions of McKean-Vlasov SDEs", "link": "https://arxiv.org/abs/2411.11270", "description": "arXiv:2411.11270v1 Announce Type: new \nAbstract: We consider the development of unbiased estimators, to approximate the stationary distribution of Mckean-Vlasov stochastic differential equations (MVSDEs). These are an important class of processes, which frequently appear in applications such as mathematical finance, biology and opinion dynamics. Typically the stationary distribution is unknown and indeed one cannot simulate such processes exactly. As a result one commonly requires a time-discretization scheme which results in a discretization bias and a bias from not being able to simulate the associated stationary distribution. To overcome this bias, we present a new unbiased estimator taking motivation from the literature on unbiased Monte Carlo. We prove the unbiasedness of our estimator, under assumptions. In order to prove this we require developing ergodicity results of various discrete time processes, through an appropriate discretization scheme, towards the invariant measure. Numerous numerical experiments are provided, on a range of MVSDEs, to demonstrate the effectiveness of our unbiased estimator. Such examples include the Currie-Weiss model, a 3D neuroscience model and a parameter estimation problem."}, "https://arxiv.org/abs/2411.11301": {"title": "Subgroup analysis in multi level hierarchical cluster randomized trials", "link": "https://arxiv.org/abs/2411.11301", "description": "arXiv:2411.11301v1 Announce Type: new \nAbstract: Cluster or group randomized trials (CRTs) are increasingly used for both behavioral and system-level interventions, where entire clusters are randomly assigned to a study condition or intervention. Apart from the assigned cluster-level analysis, investigating whether an intervention has a differential effect for specific subgroups remains an important issue, though it is often considered an afterthought in pivotal clinical trials. Determining such subgroup effects in a CRT is a challenging task due to its inherent nested cluster structure. Motivated by a real-life HIV prevention CRT, we consider a three-level cross-sectional CRT, where randomization is carried out at the highest level and subgroups may exist at different levels of the hierarchy. We employ a linear mixed-effects model to estimate the subgroup-specific effects through their maximum likelihood estimators (MLEs). Consequently, we develop a consistent test for the significance of the differential intervention effect between two subgroups at different levels of the hierarchy, which is the key methodological contribution of this work. We also derive explicit formulae for sample size determination to detect a differential intervention effect between two subgroups, aiming to achieve a given statistical power in the case of a planned confirmatory subgroup analysis. The application of our methodology is illustrated through extensive simulation studies using synthetic data, as well as with real-world data from an HIV prevention CRT in The Bahamas."}, "https://arxiv.org/abs/2411.11498": {"title": "Efficient smoothness selection for nonparametric Markov-switching models via quasi restricted maximum likelihood", "link": "https://arxiv.org/abs/2411.11498", "description": "arXiv:2411.11498v1 Announce Type: new \nAbstract: Markov-switching models are powerful tools that allow capturing complex patterns from time series data driven by latent states. Recent work has highlighted the benefits of estimating components of these models nonparametrically, enhancing their flexibility and reducing biases, which in turn can improve state decoding, forecasting, and overall inference. Formulating such models using penalized splines is straightforward, but practically feasible methods for a data-driven smoothness selection in these models are still lacking. Traditional techniques, such as cross-validation and information criteria-based selection suffer from major drawbacks, most importantly their reliance on computationally expensive grid search methods, hampering practical usability for Markov-switching models. Michelot (2022) suggested treating spline coefficients as random effects with a multivariate normal distribution and using the R package TMB (Kristensen et al., 2016) for marginal likelihood maximization. While this method avoids grid search and typically results in adequate smoothness selection, it entails a nested optimization problem, thus being computationally demanding. We propose to exploit the simple structure of penalized splines treated as random effects, thereby greatly reducing the computational burden while potentially improving fixed effects parameter estimation accuracy. Our proposed method offers a reliable and efficient mechanism for smoothness selection, rendering the estimation of Markov-switching models involving penalized splines feasible for complex data structures."}, "https://arxiv.org/abs/2411.11559": {"title": "Treatment Effect Estimators as Weighted Outcomes", "link": "https://arxiv.org/abs/2411.11559", "description": "arXiv:2411.11559v1 Announce Type: new \nAbstract: Estimators weighting observed outcomes to form an effect estimate have a long tradition. The corresponding outcome weights are utilized in established procedures, e.g. to check covariate balancing, to characterize target populations, or to detect and manage extreme weights. This paper provides a general framework to derive the functional form of such weights. It establishes when and how numerical equivalence between an original estimator representation as moment condition and a unique weighted representation can be obtained. The framework is applied to derive novel outcome weights for the leading cases of double machine learning and generalized random forests, with existing results recovered as special cases. The analysis highlights that implementation choices determine (i) the availability of outcome weights and (ii) their properties. Notably, standard implementations of partially linear regression based estimators - like causal forest - implicitly apply outcome weights that do not sum to (minus) one in the (un)treated group as usually considered desirable."}, "https://arxiv.org/abs/2411.11580": {"title": "Metric Oja Depth, New Statistical Tool for Estimating the Most Central Objects", "link": "https://arxiv.org/abs/2411.11580", "description": "arXiv:2411.11580v1 Announce Type: new \nAbstract: The Oja depth (simplicial volume depth) is one of the classical statistical techniques for measuring the central tendency of data in multivariate space. Despite the widespread emergence of object data like images, texts, matrices or graphs, a well-developed and suitable version of Oja depth for object data is lacking. To address this shortcoming, in this study we propose a novel measure of statistical depth, the metric Oja depth applicable to any object data. Then, we develop two competing strategies for optimizing metric depth functions, i.e., finding the deepest objects with respect to them. Finally, we compare the performance of the metric Oja depth with three other depth functions (half-space, lens, and spatial) in diverse data scenarios.\n Keywords: Object Data, Metric Oja depth, Statistical depth, Optimization, Genetic algorithm, Metric statistics"}, "https://arxiv.org/abs/2411.11675": {"title": "Nonparametric Bayesian approach for dynamic borrowing of historical control data", "link": "https://arxiv.org/abs/2411.11675", "description": "arXiv:2411.11675v1 Announce Type: new \nAbstract: When incorporating historical control data into the analysis of current randomized controlled trial data, it is critical to account for differences between the datasets. When the cause of the difference is an unmeasured factor and adjustment for observed covariates only is insufficient, it is desirable to use a dynamic borrowing method that reduces the impact of heterogeneous historical controls. We propose a nonparametric Bayesian approach for borrowing historical controls that are homogeneous with the current control. Additionally, to emphasize the resolution of conflicts between the historical controls and current control, we introduce a method based on the dependent Dirichlet process mixture. The proposed methods can be implemented using the same procedure, regardless of whether the outcome data comprise aggregated study-level data or individual participant data. We also develop a novel index of similarity between the historical and current control data, based on the posterior distribution of the parameter of interest. We conduct a simulation study and analyze clinical trial examples to evaluate the performance of the proposed methods compared to existing methods. The proposed method based on the dependent Dirichlet process mixture can more accurately borrow from homogeneous historical controls while reducing the impact of heterogeneous historical controls compared to the typical Dirichlet process mixture. The proposed methods outperform existing methods in scenarios with heterogeneous historical controls, in which the meta-analytic approach is ineffective."}, "https://arxiv.org/abs/2411.11728": {"title": "Davis-Kahan Theorem in the two-to-infinity norm and its application to perfect clustering", "link": "https://arxiv.org/abs/2411.11728", "description": "arXiv:2411.11728v1 Announce Type: new \nAbstract: Many statistical applications, such as the Principal Component Analysis, matrix completion, tensor regression and many others, rely on accurate estimation of leading eigenvectors of a matrix. The Davis-Khan theorem is known to be instrumental for bounding above the distances between matrices $U$ and $\\widehat{U}$ of population eigenvectors and their sample versions. While those distances can be measured in various metrics, the recent developments showed advantages of evaluation of the deviation in the 2-to-infinity norm. The purpose of this paper is to provide upper bounds for the distances between $U$ and $\\widehat{U}$ in the two-to-infinity norm for a variety of possible scenarios and competitive approaches. Although this problem has been studied by several authors, the difference between this paper and its predecessors is that the upper bounds are obtained with no or mild probabilistic assumptions on the error distributions. Those bounds are subsequently refined, when some generic probabilistic assumptions on the errors hold. In addition, the paper provides alternative methods for evaluation of $\\widehat{U}$ and, therefore, enables one to compare the resulting accuracies. As an example of an application of the results in the paper, we derive sufficient conditions for perfect clustering in a generic setting, and then employ them in various scenarios."}, "https://arxiv.org/abs/2411.11737": {"title": "Randomization-based Z-estimation for evaluating average and individual treatment effects", "link": "https://arxiv.org/abs/2411.11737", "description": "arXiv:2411.11737v1 Announce Type: new \nAbstract: Randomized experiments have been the gold standard for drawing causal inference. The conventional model-based approach has been one of the most popular ways for analyzing treatment effects from randomized experiments, which is often carried through inference for certain model parameters. In this paper, we provide a systematic investigation of model-based analyses for treatment effects under the randomization-based inference framework. This framework does not impose any distributional assumptions on the outcomes, covariates and their dependence, and utilizes only randomization as the \"reasoned basis\". We first derive the asymptotic theory for Z-estimation in completely randomized experiments, and propose sandwich-type conservative covariance estimation. We then apply the developed theory to analyze both average and individual treatment effects in randomized experiments. For the average treatment effect, we consider three estimation strategies: model-based, model-imputed, and model-assisted, where the first two can be sensitive to model misspecification or require specific ways for parameter estimation. The model-assisted approach is robust to arbitrary model misspecification and always provides consistent average treatment effect estimation. We propose optimal ways to conduct model-assisted estimation using generally nonlinear least squares for parameter estimation. For the individual treatment effects, we propose to directly model the relationship between individual effects and covariates, and discuss the model's identifiability, inference and interpretation allowing model misspecification."}, "https://arxiv.org/abs/2411.10646": {"title": "Wasserstein Spatial Depth", "link": "https://arxiv.org/abs/2411.10646", "description": "arXiv:2411.10646v1 Announce Type: cross \nAbstract: Modeling observations as random distributions embedded within Wasserstein spaces is becoming increasingly popular across scientific fields, as it captures the variability and geometric structure of the data more effectively. However, the distinct geometry and unique properties of Wasserstein space pose challenges to the application of conventional statistical tools, which are primarily designed for Euclidean spaces. Consequently, adapting and developing new methodologies for analysis within Wasserstein spaces has become essential. The space of distributions on $\\mathbb{R}^d$ with $d>1$ is not linear, and ''mimic'' the geometry of a Riemannian manifold. In this paper, we extend the concept of statistical depth to distribution-valued data, introducing the notion of {\\it Wasserstein spatial depth}. This new measure provides a way to rank and order distributions, enabling the development of order-based clustering techniques and inferential tools. We show that Wasserstein spatial depth (WSD) preserves critical properties of conventional statistical depths, notably, ranging within $[0,1]$, transformation invariance, vanishing at infinity, reaching a maximum at the geometric median, and continuity. Additionally, the population WSD has a straightforward plug-in estimator based on sampled empirical distributions. We establish the estimator's consistency and asymptotic normality. Extensive simulation and real-data application showcase the practical efficacy of WSD."}, "https://arxiv.org/abs/2411.10982": {"title": "Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations", "link": "https://arxiv.org/abs/2411.10982", "description": "arXiv:2411.10982v1 Announce Type: cross \nAbstract: We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits."}, "https://arxiv.org/abs/2411.11132": {"title": "Variational Bayesian Bow tie Neural Networks with Shrinkage", "link": "https://arxiv.org/abs/2411.11132", "description": "arXiv:2411.11132v1 Announce Type: cross \nAbstract: Despite the dominant role of deep models in machine learning, limitations persist, including overconfident predictions, susceptibility to adversarial attacks, and underestimation of variability in predictions. The Bayesian paradigm provides a natural framework to overcome such issues and has become the gold standard for uncertainty estimation with deep models, also providing improved accuracy and a framework for tuning critical hyperparameters. However, exact Bayesian inference is challenging, typically involving variational algorithms that impose strong independence and distributional assumptions. Moreover, existing methods are sensitive to the architectural choice of the network. We address these issues by constructing a relaxed version of the standard feed-forward rectified neural network, and employing Polya-Gamma data augmentation tricks to render a conditionally linear and Gaussian model. Additionally, we use sparsity-promoting priors on the weights of the neural network for data-driven architectural design. To approximate the posterior, we derive a variational inference algorithm that avoids distributional assumptions and independence across layers and is a faster alternative to the usual Markov Chain Monte Carlo schemes."}, "https://arxiv.org/abs/2411.11203": {"title": "Debiasing Watermarks for Large Language Models via Maximal Coupling", "link": "https://arxiv.org/abs/2411.11203", "description": "arXiv:2411.11203v1 Announce Type: cross \nAbstract: Watermarking language models is essential for distinguishing between human and machine-generated text and thus maintaining the integrity and trustworthiness of digital communication. We present a novel green/red list watermarking approach that partitions the token set into ``green'' and ``red'' lists, subtly increasing the generation probability for green tokens. To correct token distribution bias, our method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal. Theoretical analysis confirms this approach's unbiased nature and robust detection capabilities. Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability, and it demonstrates resilience to targeted modifications aimed at improving text quality. This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality."}, "https://arxiv.org/abs/2411.11271": {"title": "Mean Estimation in Banach Spaces Under Infinite Variance and Martingale Dependence", "link": "https://arxiv.org/abs/2411.11271", "description": "arXiv:2411.11271v1 Announce Type: cross \nAbstract: We consider estimating the shared mean of a sequence of heavy-tailed random variables taking values in a Banach space. In particular, we revisit and extend a simple truncation-based mean estimator by Catoni and Giulini. While existing truncation-based approaches require a bound on the raw (non-central) second moment of observations, our results hold under a bound on either the central or non-central $p$th moment for some $p > 1$. In particular, our results hold for distributions with infinite variance. The main contributions of the paper follow from exploiting connections between truncation-based mean estimation and the concentration of martingales in 2-smooth Banach spaces. We prove two types of time-uniform bounds on the distance between the estimator and unknown mean: line-crossing inequalities, which can be optimized for a fixed sample size $n$, and non-asymptotic law of the iterated logarithm type inequalities, which match the tightness of line-crossing inequalities at all points in time up to a doubly logarithmic factor in $n$. Our results do not depend on the dimension of the Banach space, hold under martingale dependence, and all constants in the inequalities are known and small."}, "https://arxiv.org/abs/2411.11748": {"title": "Debiased Regression for Root-N-Consistent Conditional Mean Estimation", "link": "https://arxiv.org/abs/2411.11748", "description": "arXiv:2411.11748v1 Announce Type: cross \nAbstract: This study introduces a debiasing method for regression estimators, including high-dimensional and nonparametric regression estimators. For example, nonparametric regression methods allow for the estimation of regression functions in a data-driven manner with minimal assumptions; however, these methods typically fail to achieve $\\sqrt{n}$-consistency in their convergence rates, and many, including those in machine learning, lack guarantees that their estimators asymptotically follow a normal distribution. To address these challenges, we propose a debiasing technique for nonparametric estimators by adding a bias-correction term to the original estimators, extending the conventional one-step estimator used in semiparametric analysis. Specifically, for each data point, we estimate the conditional expected residual of the original nonparametric estimator, which can, for instance, be computed using kernel (Nadaraya-Watson) regression, and incorporate it as a bias-reduction term. Our theoretical analysis demonstrates that the proposed estimator achieves $\\sqrt{n}$-consistency and asymptotic normality under a mild convergence rate condition for both the original nonparametric estimator and the conditional expected residual estimator. Notably, this approach remains model-free as long as the original estimator and the conditional expected residual estimator satisfy the convergence rate condition. The proposed method offers several advantages, including improved estimation accuracy and simplified construction of confidence intervals."}, "https://arxiv.org/abs/2411.11824": {"title": "Theoretical Foundations of Conformal Prediction", "link": "https://arxiv.org/abs/2411.11824", "description": "arXiv:2411.11824v1 Announce Type: cross \nAbstract: This book is about conformal prediction and related inferential techniques that build on permutation tests and exchangeability. These techniques are useful in a diverse array of tasks, including hypothesis testing and providing uncertainty quantification guarantees for machine learning systems. Much of the current interest in conformal prediction is due to its ability to integrate into complex machine learning workflows, solving the problem of forming prediction sets without any assumptions on the form of the data generating distribution. Since contemporary machine learning algorithms have generally proven difficult to analyze directly, conformal prediction's main appeal is its ability to provide formal, finite-sample guarantees when paired with such methods.\n The goal of this book is to teach the reader about the fundamental technical arguments that arise when researching conformal prediction and related questions in distribution-free inference. Many of these proof strategies, especially the more recent ones, are scattered among research papers, making it difficult for researchers to understand where to look, which results are important, and how exactly the proofs work. We hope to bridge this gap by curating what we believe to be some of the most important results in the literature and presenting their proofs in a unified language, with illustrations, and with an eye towards pedagogy."}, "https://arxiv.org/abs/2310.08115": {"title": "Model-Agnostic Covariate-Assisted Inference on Partially Identified Causal Effects", "link": "https://arxiv.org/abs/2310.08115", "description": "arXiv:2310.08115v2 Announce Type: replace \nAbstract: Many causal estimands are only partially identifiable since they depend on the unobservable joint distribution between potential outcomes. Stratification on pretreatment covariates can yield sharper bounds; however, unless the covariates are discrete with relatively small support, this approach typically requires binning covariates or estimating the conditional distributions of the potential outcomes given the covariates. Binning can result in substantial efficiency loss and become challenging to implement, even with a moderate number of covariates. Estimating conditional distributions, on the other hand, may yield invalid inference if the distributions are inaccurately estimated, such as when a misspecified model is used or when the covariates are high-dimensional. In this paper, we propose a unified and model-agnostic inferential approach for a wide class of partially identified estimands. Our method, based on duality theory for optimal transport problems, has four key properties. First, in randomized experiments, our approach can wrap around any estimates of the conditional distributions and provide uniformly valid inference, even if the initial estimates are arbitrarily inaccurate. A simple extension of our method to observational studies is doubly robust in the usual sense. Second, if nuisance parameters are estimated at semiparametric rates, our estimator is asymptotically unbiased for the sharp partial identification bound. Third, we can apply the multiplier bootstrap to select covariates and models without sacrificing validity, even if the true model is not selected. Finally, our method is computationally efficient. Overall, in three empirical applications, our method consistently reduces the width of estimated identified sets and confidence intervals without making additional structural assumptions."}, "https://arxiv.org/abs/2312.06265": {"title": "Type I Error Rates are Not Usually Inflated", "link": "https://arxiv.org/abs/2312.06265", "description": "arXiv:2312.06265v4 Announce Type: replace \nAbstract: The inflation of Type I error rates is thought to be one of the causes of the replication crisis. Questionable research practices such as p-hacking are thought to inflate Type I error rates above their nominal level, leading to unexpectedly high levels of false positives in the literature and, consequently, unexpectedly low replication rates. In this article, I offer an alternative view. I argue that questionable and other research practices do not usually inflate relevant Type I error rates. I begin by introducing the concept of Type I error rates and distinguishing between statistical errors and theoretical errors. I then illustrate my argument with respect to model misspecification, multiple testing, selective inference, forking paths, exploratory analyses, p-hacking, optional stopping, double dipping, and HARKing. In each case, I demonstrate that relevant Type I error rates are not usually inflated above their nominal level, and in the rare cases that they are, the inflation is easily identified and resolved. I conclude that the replication crisis may be explained, at least in part, by researchers' misinterpretation of statistical errors and their underestimation of theoretical errors."}, "https://arxiv.org/abs/2411.12086": {"title": "A Comparison of Zero-Inflated Models for Modern Biomedical Data", "link": "https://arxiv.org/abs/2411.12086", "description": "arXiv:2411.12086v1 Announce Type: new \nAbstract: Many data sets cannot be accurately described by standard probability distributions due to the excess number of zero values present. For example, zero-inflation is prevalent in microbiome data and single-cell RNA sequencing data, which serve as our real data examples. Several models have been proposed to address zero-inflated datasets including the zero-inflated negative binomial, hurdle negative binomial model, and the truncated latent Gaussian copula model. This study aims to compare various models and determine which one performs optimally under different conditions using both simulation studies and real data analyses. We are particularly interested in investigating how dependence among the variables, level of zero-inflation or deflation, and variance of the data affects model selection."}, "https://arxiv.org/abs/2411.12184": {"title": "Testability of Instrumental Variables in Additive Nonlinear, Non-Constant Effects Models", "link": "https://arxiv.org/abs/2411.12184", "description": "arXiv:2411.12184v1 Announce Type: new \nAbstract: We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition."}, "https://arxiv.org/abs/2411.12294": {"title": "Adaptive Forward Stepwise Regression", "link": "https://arxiv.org/abs/2411.12294", "description": "arXiv:2411.12294v1 Announce Type: new \nAbstract: This paper proposes a sparse regression method that continuously interpolates between Forward Stepwise selection (FS) and the LASSO. When tuned appropriately, our solutions are much sparser than typical LASSO fits but, unlike FS fits, benefit from the stabilizing effect of shrinkage. Our method, Adaptive Forward Stepwise Regression (AFS) addresses this need for sparser models with shrinkage. We show its connection with boosting via a soft-thresholding viewpoint and demonstrate the ease of adapting the method to classification tasks. In both simulations and real data, our method has lower mean squared error and fewer selected features across multiple settings compared to popular sparse modeling procedures."}, "https://arxiv.org/abs/2411.12367": {"title": "Left-truncated discrete lifespans: The AFiD enterprise panel", "link": "https://arxiv.org/abs/2411.12367", "description": "arXiv:2411.12367v1 Announce Type: new \nAbstract: Our model for the lifespan of an enterprise is the geometric distribution. We do not formulate a model for enterprise foundation, but assume that foundations and lifespans are independent. We aim to fit the model to information about foundation and closure of German enterprises in the AFiD panel. The lifespan for an enterprise that has been founded before the first wave of the panel is either left truncated, when the enterprise is contained in the panel, or missing, when it already closed down before the first wave. Marginalizing the likelihood to that part of the enterprise history after the first wave contributes to the aim of a closed-form estimate and standard error. Invariance under the foundation distribution is achived by conditioning on observability of the enterprises. The conditional marginal likelihood can be written as a function of a martingale. The later arises when calculating the compensator, with respect some filtration, of a process that counts the closures. The estimator itself can then also be written as a martingale transform and consistency as well as asymptotic normality are easily proven. The life expectancy of German enterprises, estimated from the demographic information about 1.4 million enterprises for the years 2018 and 2019, are ten years. The width of the confidence interval are two months. Closure after the last wave is taken into account as right censored."}, "https://arxiv.org/abs/2411.12407": {"title": "Bayesian multilevel compositional data analysis with the R package multilevelcoda", "link": "https://arxiv.org/abs/2411.12407", "description": "arXiv:2411.12407v1 Announce Type: new \nAbstract: Multilevel compositional data, such as data sampled over time that are non-negative and sum to a constant value, are common in various fields. However, there is currently no software specifically built to model compositional data in a multilevel framework. The R package multilevelcoda implements a collection of tools for modelling compositional data in a Bayesian multivariate, multilevel pipeline. The user-friendly setup only requires the data, model formula, and minimal specification of the analysis. This paper outlines the statistical theory underlying the Bayesian compositional multilevel modelling approach and details the implementation of the functions available in multilevelcoda, using an example dataset of compositional daily sleep-wake behaviours. This innovative method can be used to gain robust answers to scientific questions using the increasingly available multilevel compositional data from intensive, longitudinal studies."}, "https://arxiv.org/abs/2411.12423": {"title": "Nonstationary functional time series forecasting", "link": "https://arxiv.org/abs/2411.12423", "description": "arXiv:2411.12423v1 Announce Type: new \nAbstract: We propose a nonstationary functional time series forecasting method with an application to age-specific mortality rates observed over the years. The method begins by taking the first-order differencing and estimates its long-run covariance function. Through eigen-decomposition, we obtain a set of estimated functional principal components and their associated scores for the differenced series. These components allow us to reconstruct the original functional data and compute the residuals. To model the temporal patterns in the residuals, we again perform dynamic functional principal component analysis and extract its estimated principal components and the associated scores for the residuals. As a byproduct, we introduce a geometrically decaying weighted approach to assign higher weights to the most recent data than those from the distant past. Using the Swedish age-specific mortality rates from 1751 to 2022, we demonstrate that the weighted dynamic functional factor model can produce more accurate point and interval forecasts, particularly for male series exhibiting higher volatility."}, "https://arxiv.org/abs/2411.12477": {"title": "Robust Bayesian causal estimation for causal inference in medical diagnosis", "link": "https://arxiv.org/abs/2411.12477", "description": "arXiv:2411.12477v1 Announce Type: new \nAbstract: Causal effect estimation is a critical task in statistical learning that aims to find the causal effect on subjects by identifying causal links between a number of predictor (or, explanatory) variables and the outcome of a treatment. In a regressional framework, we assign a treatment and outcome model to estimate the average causal effect. Additionally, for high dimensional regression problems, variable selection methods are also used to find a subset of predictor variables that maximises the predictive performance of the underlying model for better estimation of the causal effect. In this paper, we propose a different approach. We focus on the variable selection aspects of high dimensional causal estimation problem. We suggest a cautious Bayesian group LASSO (least absolute shrinkage and selection operator) framework for variable selection using prior sensitivity analysis. We argue that in some cases, abstaining from selecting (or, rejecting) a predictor is beneficial and we should gather more information to obtain a more decisive result. We also show that for problems with very limited information, expert elicited variable selection can give us a more stable causal effect estimation as it avoids overfitting. Lastly, we carry a comparative study with synthetic dataset and show the applicability of our method in real-life situations."}, "https://arxiv.org/abs/2411.12479": {"title": "Graph-based Square-Root Estimation for Sparse Linear Regression", "link": "https://arxiv.org/abs/2411.12479", "description": "arXiv:2411.12479v1 Announce Type: new \nAbstract: Sparse linear regression is one of the classic problems in the field of statistics, which has deep connections and high intersections with optimization, computation, and machine learning. To address the effective handling of high-dimensional data, the diversity of real noise, and the challenges in estimating standard deviation of the noise, we propose a novel and general graph-based square-root estimation (GSRE) model for sparse linear regression. Specifically, we use square-root-loss function to encourage the estimators to be independent of the unknown standard deviation of the error terms and design a sparse regularization term by using the graphical structure among predictors in a node-by-node form. Based on the predictor graphs with special structure, we highlight the generality by analyzing that the model in this paper is equivalent to several classic regression models. Theoretically, we also analyze the finite sample bounds, asymptotic normality and model selection consistency of GSRE method without relying on the standard deviation of error terms. In terms of computation, we employ the fast and efficient alternating direction method of multipliers. Finally, based on a large number of simulated and real data with various types of noise, we demonstrate the performance advantages of the proposed method in estimation, prediction and model selection."}, "https://arxiv.org/abs/2411.12555": {"title": "Multivariate and Online Transfer Learning with Uncertainty Quantification", "link": "https://arxiv.org/abs/2411.12555", "description": "arXiv:2411.12555v1 Announce Type: new \nAbstract: Untreated periodontitis causes inflammation within the supporting tissue of the teeth and can ultimately lead to tooth loss. Modeling periodontal outcomes is beneficial as they are difficult and time consuming to measure, but disparities in representation between demographic groups must be considered. There may not be enough participants to build group specific models and it can be ineffective, and even dangerous, to apply a model to participants in an underrepresented group if demographic differences were not considered during training. We propose an extension to RECaST Bayesian transfer learning framework. Our method jointly models multivariate outcomes, exhibiting significant improvement over the previous univariate RECaST method. Further, we introduce an online approach to model sequential data sets. Negative transfer is mitigated to ensure that the information shared from the other demographic groups does not negatively impact the modeling of the underrepresented participants. The Bayesian framework naturally provides uncertainty quantification on predictions. Especially important in medical applications, our method does not share data between domains. We demonstrate the effectiveness of our method in both predictive performance and uncertainty quantification on simulated data and on a database of dental records from the HealthPartners Institute."}, "https://arxiv.org/abs/2411.12578": {"title": "Robust Inference for High-dimensional Linear Models with Heavy-tailed Errors via Partial Gini Covariance", "link": "https://arxiv.org/abs/2411.12578", "description": "arXiv:2411.12578v1 Announce Type: new \nAbstract: This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effectiveness. Our proposed approach leverages the partial Gini covariance to construct a robust statistical inference framework that requires minimal tuning and does not impose restrictive moment conditions on error distributions. Unlike traditional methods, it circumvents the need for estimating the density of random errors and enhances the computational feasibility and robustness. Extensive simulations demonstrate the proposed method's superior power and robustness over standard high-dimensional inference approaches, such as those based on the debiased Lasso. The asymptotic relative efficiency analysis provides additional theoretical insight on the improved efficiency of the new approach in the heavy-tailed setting. Additionally, the partial Gini covariance extends to the multivariate setting, enabling chi-square testing for a group of coefficients. We illustrate the method's practical application with a real-world data example."}, "https://arxiv.org/abs/2411.12585": {"title": "Semiparametric quantile functional regression analysis of adolescent physical activity distributions in the presence of missing data", "link": "https://arxiv.org/abs/2411.12585", "description": "arXiv:2411.12585v1 Announce Type: new \nAbstract: In the age of digital healthcare, passively collected physical activity profiles from wearable sensors are a preeminent tool for evaluating health outcomes. In order to fully leverage the vast amounts of data collected through wearable accelerometers, we propose to use quantile functional regression to model activity profiles as distributional outcomes through quantile responses, which can be used to evaluate activity level differences across covariates based on any desired distributional summary. Our proposed framework addresses two key problems not handled in existing distributional regression literature. First, we use spline mixed model formulations in the basis space to model nonparametric effects of continuous predictors on the distributional response. Second, we address the underlying missingness problem that is common in these types of wearable data but typically not addressed. We show that the missingness can induce bias in the subject-specific distributional summaries that leads to biased distributional regression estimates and even bias the frequently used scalar summary measures, and introduce a nonparametric function-on-function modeling approach that adjusts for each subject's missingness profile to address this problem. We evaluate our nonparametric modeling and missing data adjustment using simulation studies based on realistically simulated activity profiles and use it to gain insights into adolescent activity profiles from the Teen Environment and Neighborhood study."}, "https://arxiv.org/abs/2411.11847": {"title": "Modelling financial returns with mixtures of generalized normal distributions", "link": "https://arxiv.org/abs/2411.11847", "description": "arXiv:2411.11847v1 Announce Type: cross \nAbstract: This PhD Thesis presents an investigation into the analysis of financial returns using mixture models, focusing on mixtures of generalized normal distributions (MGND) and their extensions. The study addresses several critical issues encountered in the estimation process and proposes innovative solutions to enhance accuracy and efficiency. In Chapter 2, the focus lies on the MGND model and its estimation via expectation conditional maximization (ECM) and generalized expectation maximization (GEM) algorithms. A thorough exploration reveals a degeneracy issue when estimating the shape parameter. Several algorithms are proposed to overcome this critical issue. Chapter 3 extends the theoretical perspective by applying the MGND model on several stock market indices. A two-step approach is proposed for identifying turmoil days and estimating returns and volatility. Chapter 4 introduces constrained mixture of generalized normal distributions (CMGND), enhancing interpretability and efficiency by imposing constraints on parameters. Simulation results highlight the benefits of constrained parameter estimation. Finally, Chapter 5 introduces generalized normal distribution-hidden Markov models (GND-HMMs) able to capture the dynamic nature of financial returns. This manuscript contributes to the statistical modelling of financial returns by offering flexible, parsimonious, and interpretable frameworks. The proposed mixture models capture complex patterns in financial data, thereby facilitating more informed decision-making in financial analysis and risk management."}, "https://arxiv.org/abs/2411.12036": {"title": "Prediction-Guided Active Experiments", "link": "https://arxiv.org/abs/2411.12036", "description": "arXiv:2411.12036v1 Announce Type: cross \nAbstract: Here is the revised abstract, ensuring all characters are ASCII-compatible:\n In this work, we introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE), which leverages predictions from an existing machine learning model to guide sampling and experimentation. Specifically, at each time step, an experimental unit is sampled according to a designated sampling distribution, and the actual outcome is observed based on an experimental probability. Otherwise, only a prediction for the outcome is available. We begin by analyzing the non-adaptive case, where full information on the joint distribution of the predictor and the actual outcome is assumed. For this scenario, we derive an optimal experimentation strategy by minimizing the semi-parametric efficiency bound for the class of regular estimators. We then introduce an estimator that meets this efficiency bound, achieving asymptotic optimality. Next, we move to the adaptive case, where the predictor is continuously updated with newly sampled data. We show that the adaptive version of the estimator remains efficient and attains the same semi-parametric bound under certain regularity assumptions. Finally, we validate PGAE's performance through simulations and a semi-synthetic experiment using data from the US Census Bureau. The results underscore the PGAE framework's effectiveness and superiority compared to other existing methods."}, "https://arxiv.org/abs/2411.12119": {"title": "Asymptotics in Multiple Hypotheses Testing under Dependence: beyond Normality", "link": "https://arxiv.org/abs/2411.12119", "description": "arXiv:2411.12119v1 Announce Type: cross \nAbstract: Correlated observations are ubiquitous phenomena in a plethora of scientific avenues. Tackling this dependence among test statistics has been one of the pertinent problems in simultaneous inference. However, very little literature exists that elucidates the effect of correlation on different testing procedures under general distributional assumptions. In this work, we address this gap in a unified way by considering the multiple testing problem under a general correlated framework. We establish an upper bound on the family-wise error rate(FWER) of Bonferroni's procedure for equicorrelated test statistics. Consequently, we find that for a quite general class of distributions, Bonferroni FWER asymptotically tends to zero when the number of hypotheses approaches infinity. We extend this result to general positively correlated elliptically contoured setups. We also present examples of distributions for which Bonferroni FWER has a strictly positive limit under equicorrelation. We extend the limiting zero results to the class of step-down procedures under quite general correlated setups. Specifically, the probability of rejecting at least one hypothesis approaches zero asymptotically for any step-down procedure. The results obtained in this work generalize existing results for correlated Normal test statistics and facilitate new insights into the performances of multiple testing procedures under dependence."}, "https://arxiv.org/abs/2411.12258": {"title": "E-STGCN: Extreme Spatiotemporal Graph Convolutional Networks for Air Quality Forecasting", "link": "https://arxiv.org/abs/2411.12258", "description": "arXiv:2411.12258v1 Announce Type: cross \nAbstract: Modeling and forecasting air quality plays a crucial role in informed air pollution management and protecting public health. The air quality data of a region, collected through various pollution monitoring stations, display nonlinearity, nonstationarity, and highly dynamic nature and detain intense stochastic spatiotemporal correlation. Geometric deep learning models such as Spatiotemporal Graph Convolutional Networks (STGCN) can capture spatial dependence while forecasting temporal time series data for different sensor locations. Another key characteristic often ignored by these models is the presence of extreme observations in the air pollutant levels for severely polluted cities worldwide. Extreme value theory is a commonly used statistical method to predict the expected number of violations of the National Ambient Air Quality Standards for air pollutant concentration levels. This study develops an extreme value theory-based STGCN model (E-STGCN) for air pollution data to incorporate extreme behavior across pollutant concentrations. Along with spatial and temporal components, E-STGCN uses generalized Pareto distribution to investigate the extreme behavior of different air pollutants and incorporate it inside graph convolutional networks. The proposal is then applied to analyze air pollution data (PM2.5, PM10, and NO2) of 37 monitoring stations across Delhi, India. The forecasting performance for different test horizons is evaluated compared to benchmark forecasters (both temporal and spatiotemporal). It was found that E-STGCN has consistent performance across all the seasons in Delhi, India, and the robustness of our results has also been evaluated empirically. Moreover, combined with conformal prediction, E-STGCN can also produce probabilistic prediction intervals."}, "https://arxiv.org/abs/2411.12623": {"title": "Random signed measures", "link": "https://arxiv.org/abs/2411.12623", "description": "arXiv:2411.12623v1 Announce Type: cross \nAbstract: Point processes and, more generally, random measures are ubiquitous in modern statistics. However, they can only take positive values, which is a severe limitation in many situations. In this work, we introduce and study random signed measures, also known as real-valued random measures, and apply them to constrcut various Bayesian non-parametric models. In particular, we provide an existence result for random signed measures, allowing us to obtain a canonical definition for them and solve a 70-year-old open problem. Further, we provide a representation of completely random signed measures (CRSMs), which extends the celebrated Kingman's representation for completely random measures (CRMs) to the real-valued case. We then introduce specific classes of random signed measures, including the Skellam point process, which plays the role of the Poisson point process in the real-valued case, and the Gaussian random measure. We use the theoretical results to develop two Bayesian nonparametric models -- one for topic modeling and the other for random graphs -- and to investigate mean function estimation in Bayesian nonparametric regression."}, "https://arxiv.org/abs/2411.12674": {"title": "OrigamiPlot: An R Package and Shiny Web App Enhanced Visualizations for Multivariate Data", "link": "https://arxiv.org/abs/2411.12674", "description": "arXiv:2411.12674v1 Announce Type: cross \nAbstract: We introduce OrigamiPlot, an open-source R package and Shiny web application designed to enhance the visualization of multivariate data. This package implements the origami plot, a novel visualization technique proposed by Duan et al. in 2023, which improves upon traditional radar charts by ensuring that the area of the connected region is invariant to the ordering of attributes, addressing a key limitation of radar charts. The software facilitates multivariate decision-making by supporting comparisons across multiple objects and attributes, offering customizable features such as auxiliary axes and weighted attributes for enhanced clarity. Through the R package and user-friendly Shiny interface, researchers can efficiently create and customize plots without requiring extensive programming knowledge. Demonstrated using network meta-analysis as a real-world example, OrigamiPlot proves to be a versatile tool for visualizing multivariate data across various fields. This package opens new opportunities for simplifying decision-making processes with complex data."}, "https://arxiv.org/abs/2009.00085": {"title": "Identification of Semiparametric Panel Multinomial Choice Models with Infinite-Dimensional Fixed Effects", "link": "https://arxiv.org/abs/2009.00085", "description": "arXiv:2009.00085v2 Announce Type: replace \nAbstract: This paper proposes a robust method for semiparametric identification and estimation in panel multinomial choice models, where we allow for infinite-dimensional fixed effects that enter into consumer utilities in an additively nonseparable way, thus incorporating rich forms of unobserved heterogeneity. Our identification strategy exploits multivariate monotonicity in parametric indexes, and uses the logical contraposition of an intertemporal inequality on choice probabilities to obtain identifying restrictions. We provide a consistent estimation procedure, and demonstrate the practical advantages of our method with Monte Carlo simulations and an empirical illustration on popcorn sales with the Nielsen data."}, "https://arxiv.org/abs/2110.00982": {"title": "Identification and Estimation in a Time-Varying Endogenous Random Coefficient Panel Data Model", "link": "https://arxiv.org/abs/2110.00982", "description": "arXiv:2110.00982v2 Announce Type: replace \nAbstract: This paper proposes a correlated random coefficient linear panel data model, where regressors can be correlated with time-varying and individual-specific random coefficients through both a fixed effect and a time-varying random shock. I develop a new panel data-based identification method to identify the average partial effect and the local average response function. The identification strategy employs a sufficient statistic to control for the fixed effect and a conditional control variable for the random shock. Conditional on these two controls, the residual variation in the regressors is driven solely by the exogenous instrumental variables, and thus can be exploited to identify the parameters of interest. The constructive identification analysis leads to three-step series estimators, for which I establish rates of convergence and asymptotic normality. To illustrate the method, I estimate a heterogeneous Cobb-Douglas production function for manufacturing firms in China, finding substantial variations in output elasticities across firms."}, "https://arxiv.org/abs/2310.20460": {"title": "Aggregating Dependent Signals with Heavy-Tailed Combination Tests", "link": "https://arxiv.org/abs/2310.20460", "description": "arXiv:2310.20460v2 Announce Type: replace \nAbstract: Combining dependent p-values to evaluate the global null hypothesis presents a longstanding challenge in statistical inference, particularly when aggregating results from diverse methods to boost signal detection. P-value combination tests using heavy-tailed distribution based transformations, such as the Cauchy combination test and the harmonic mean p-value, have recently garnered significant interest for their potential to efficiently handle arbitrary p-value dependencies. Despite their growing popularity in practical applications, there is a gap in comprehensive theoretical and empirical evaluations of these methods. This paper conducts an extensive investigation, revealing that, theoretically, while these combination tests are asymptotically valid for pairwise quasi-asymptotically independent test statistics, such as bivariate normal variables, they are also asymptotically equivalent to the Bonferroni test under the same conditions. However, extensive simulations unveil their practical utility, especially in scenarios where stringent type-I error control is not necessary and signals are dense. Both the heaviness of the distribution and its support substantially impact the tests' non-asymptotic validity and power, and we recommend using a truncated Cauchy distribution in practice. Moreover, we show that under the violation of quasi-asymptotic independence among test statistics, these tests remain valid and, in fact, can be considerably less conservative than the Bonferroni test. We also present two case studies in genetics and genomics, showcasing the potential of the combination tests to significantly enhance statistical power while effectively controlling type-I errors."}, "https://arxiv.org/abs/2305.01277": {"title": "Zero-Truncated Modelling in a Meta-Analysis on Suicide Data after Bariatric Surgery", "link": "https://arxiv.org/abs/2305.01277", "description": "arXiv:2305.01277v2 Announce Type: replace-cross \nAbstract: Meta-analysis is a well-established method for integrating results from several independent studies to estimate a common quantity of interest. However, meta-analysis is prone to selection bias, notably when particular studies are systematically excluded. This can lead to bias in estimating the quantity of interest. Motivated by a meta-analysis to estimate the rate of completed-suicide after bariatric surgery, where studies which reported no suicides were excluded, a novel zero-truncated count modelling approach was developed. This approach addresses heterogeneity, both observed and unobserved, through covariate and overdispersion modelling, respectively. Additionally, through the Horvitz-Thompson estimator, an approach is developed to estimate the number of excluded studies, a quantity of potential interest for researchers. Uncertainty quantification for both estimation of suicide rates and number of excluded studies is achieved through a parametric bootstrapping approach.\\end{abstract}"}, "https://arxiv.org/abs/2411.12871": {"title": "Modelling Directed Networks with Reciprocity", "link": "https://arxiv.org/abs/2411.12871", "description": "arXiv:2411.12871v1 Announce Type: new \nAbstract: Asymmetric relational data is increasingly prevalent across diverse fields, underscoring the need for directed network models to address the complex challenges posed by their unique structures. Unlike undirected models, directed models can capture reciprocity, the tendency of nodes to form mutual links. In this work, we address a fundamental question: what is the effective sample size for modeling reciprocity? We examine this by analyzing the Bernoulli model with reciprocity, allowing for varying sparsity levels between non-reciprocal and reciprocal effects. We then extend this framework to a model that incorporates node-specific heterogeneity and link-specific reciprocity using covariates. Our findings reveal intriguing interplays between non-reciprocal and reciprocal effects in sparse networks. We propose a straightforward inference procedure based on maximum likelihood estimation that operates without prior knowledge of sparsity levels, whether covariates are included or not."}, "https://arxiv.org/abs/2411.12889": {"title": "Goodness-of-fit tests for generalized Poisson distributions", "link": "https://arxiv.org/abs/2411.12889", "description": "arXiv:2411.12889v1 Announce Type: new \nAbstract: This paper presents and examines computationally convenient goodness-of-fit tests for the family of generalized Poisson distributions, which encompasses notable distributions such as the Compound Poisson and the Katz distributions. The tests are consistent against fixed alternatives and their null distribution can be consistently approximated by a parametric bootstrap. The goodness of the bootstrap estimator and the power for finite sample sizes are numerically assessed through an extensive simulation experiment, including comparisons with other tests. In many cases, the novel tests either outperform or match the performance of existing ones. Real data applications are considered for illustrative purposes."}, "https://arxiv.org/abs/2411.12944": {"title": "From Estimands to Robust Inference of Treatment Effects in Platform Trials", "link": "https://arxiv.org/abs/2411.12944", "description": "arXiv:2411.12944v1 Announce Type: new \nAbstract: A platform trial is an innovative clinical trial design that uses a master protocol (i.e., one overarching protocol) to evaluate multiple treatments in an ongoing manner and can accelerate the evaluation of new treatments. However, the flexibility that marks the potential of platform trials also creates inferential challenges. Two key challenges are the precise definition of treatment effects and the robust and efficient inference on these effects. To address these challenges, we first define a clinically meaningful estimand that characterizes the treatment effect as a function of the expected outcomes under two given treatments among concurrently eligible patients. Then, we develop weighting and post-stratification methods for estimation of treatment effects with minimal assumptions. To fully leverage the efficiency potential of data from concurrently eligible patients, we also consider a model-assisted approach for baseline covariate adjustment to gain efficiency while maintaining robustness against model misspecification. We derive and compare asymptotic distributions of proposed estimators in theory and propose robust variance estimators. The proposed estimators are empirically evaluated in a simulation study and illustrated using the SIMPLIFY trial. Our methods are implemented in the R package RobinCID."}, "https://arxiv.org/abs/2411.13131": {"title": "Bayesian Parameter Estimation of Normal Distribution from Sample Mean and Extreme Values", "link": "https://arxiv.org/abs/2411.13131", "description": "arXiv:2411.13131v1 Announce Type: new \nAbstract: This paper proposes a Bayesian method for estimating the parameters of a normal distribution when only limited summary statistics (sample mean, minimum, maximum, and sample size) are available. To estimate the parameters of a normal distribution, we introduce a data augmentation approach using the Gibbs sampler, where intermediate values are treated as missing values and samples from a truncated normal distribution conditional on the observed sample mean, minimum, and maximum values. Through simulation studies, we demonstrate that our method achieves estimation accuracy comparable to theoretical expectations."}, "https://arxiv.org/abs/2411.13372": {"title": "Clustering with Potential Multidimensionality: Inference and Practice", "link": "https://arxiv.org/abs/2411.13372", "description": "arXiv:2411.13372v1 Announce Type: new \nAbstract: We show how clustering standard errors in one or more dimensions can be justified in M-estimation when there is sampling or assignment uncertainty. Since existing procedures for variance estimation are either conservative or invalid, we propose a variance estimator that refines a conservative procedure and remains valid. We then interpret environments where clustering is frequently employed in empirical work from our design-based perspective and provide insights on their estimands and inference procedures."}, "https://arxiv.org/abs/2411.13432": {"title": "Spatial error models with heteroskedastic normal perturbations and joint modeling of mean and variance", "link": "https://arxiv.org/abs/2411.13432", "description": "arXiv:2411.13432v1 Announce Type: new \nAbstract: This work presents the spatial error model with heteroskedasticity, which allows the joint modeling of the parameters associated with both the mean and the variance, within a traditional approach to spatial econometrics. The estimation algorithm is based on the log-likelihood function and incorporates the use of GAMLSS models in an iterative form. Two theoretical results show the advantages of the model to the usual models of spatial econometrics and allow obtaining the bias of weighted least squares estimators. The proposed methodology is tested through simulations, showing notable results in terms of the ability to recover all parameters and the consistency of its estimates. Finally, this model is applied to identify the factors associated with school desertion in Colombia."}, "https://arxiv.org/abs/2411.13542": {"title": "The R\\'enyi Outlier Test", "link": "https://arxiv.org/abs/2411.13542", "description": "arXiv:2411.13542v1 Announce Type: new \nAbstract: Cox and Kartsonaki proposed a simple outlier test for a vector of p-values based on the R\\'enyi transformation that is fast for large $p$ and numerically stable for very small p-values -- key properties for large data analysis. We propose and implement a generalization of this procedure we call the R\\'enyi Outlier Test (ROT). This procedure maintains the key properties of the original but is much more robust to uncertainty in the number of outliers expected a priori among the p-values. The ROT can also account for two types of prior information that are common in modern data analysis. The first is the prior probability that a given p-value may be outlying. The second is an estimate of how far of an outlier a p-value might be, conditional on it being an outlier; in other words, an estimate of effect size. Using a series of pre-calculated spline functions, we provide a fast and numerically stable implementation of the ROT in our R package renyi."}, "https://arxiv.org/abs/2411.12845": {"title": "Underlying Core Inflation with Multiple Regimes", "link": "https://arxiv.org/abs/2411.12845", "description": "arXiv:2411.12845v1 Announce Type: cross \nAbstract: This paper introduces a new approach for estimating core inflation indicators based on common factors across a broad range of price indices. Specifically, by utilizing procedures for detecting multiple regimes in high-dimensional factor models, we propose two types of core inflation indicators: one incorporating multiple structural breaks and another based on Markov switching. The structural breaks approach can eliminate revisions for past regimes, though it functions as an offline indicator, as real-time detection of breaks is not feasible with this method. On the other hand, the Markov switching approach can reduce revisions while being useful in real time, making it a simple and robust core inflation indicator suitable for real-time monitoring and as a short-term guide for monetary policy. Additionally, this approach allows us to estimate the probability of being in different inflationary regimes. To demonstrate the effectiveness of these indicators, we apply them to Canadian price data. To compare the real-time performance of the Markov switching approach to the benchmark model without regime-switching, we assess their abilities to forecast headline inflation and minimize revisions. We find that the Markov switching model delivers superior predictive accuracy and significantly reduces revisions during periods of substantial inflation changes. Hence, our findings suggest that accounting for time-varying factors and parameters enhances inflation signal accuracy and reduces data requirements, especially following sudden economic shifts."}, "https://arxiv.org/abs/2411.12965": {"title": "On adaptivity and minimax optimality of two-sided nearest neighbors", "link": "https://arxiv.org/abs/2411.12965", "description": "arXiv:2411.12965v1 Announce Type: cross \nAbstract: Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \\Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps."}, "https://arxiv.org/abs/2411.13080": {"title": "Distribution-free Measures of Association based on Optimal Transport", "link": "https://arxiv.org/abs/2411.13080", "description": "arXiv:2411.13080v1 Announce Type: cross \nAbstract: In this paper we propose and study a class of nonparametric, yet interpretable measures of association between two random vectors $X$ and $Y$ taking values in $\\mathbb{R}^{d_1}$ and $\\mathbb{R}^{d_2}$ respectively ($d_1, d_2\\ge 1$). These nonparametric measures -- defined using the theory of reproducing kernel Hilbert spaces coupled with optimal transport -- capture the strength of dependence between $X$ and $Y$ and have the property that they are 0 if and only if the variables are independent and 1 if and only if one variable is a measurable function of the other. Further, these population measures can be consistently estimated using the general framework of geometric graphs which include $k$-nearest neighbor graphs and minimum spanning trees. Additionally, these measures can also be readily used to construct an exact finite sample distribution-free test of mutual independence between $X$ and $Y$. In fact, as far as we are aware, these are the only procedures that possess all the above mentioned desirable properties. The correlation coefficient proposed in Dette et al. (2013), Chatterjee (2021), Azadkia and Chatterjee (2021), at the population level, can be seen as a special case of this general class of measures."}, "https://arxiv.org/abs/2411.13293": {"title": "Revealed Information", "link": "https://arxiv.org/abs/2411.13293", "description": "arXiv:2411.13293v1 Announce Type: cross \nAbstract: An analyst observes the frequency with which a decision maker (DM) takes actions, but does not observe the frequency of actions conditional on the payoff-relevant state. We ask when can the analyst rationalize the DM's choices as if the DM first learns something about the state before taking action. We provide a support function characterization of the triples of utility functions, prior beliefs, and (marginal) distributions over actions such that the DM's action distribution is consistent with information given the agent's prior and utility function. Assumptions on the cardinality of the state space and the utility function allow us to refine this characterization, obtaining a sharp system of finitely many inequalities the utility function, prior, and action distribution must satisfy. We apply our characterization to study comparative statics and ring-network games, and to identify conditions under which a data set is consistent with a public information structure in first-order Bayesian persuasion games. We characterize the set of distributions over posterior beliefs that are consistent with the DM's choices. Assuming the first-order approach applies, we extend our results to settings with a continuum of actions and/or states.%"}, "https://arxiv.org/abs/2104.07773": {"title": "Jointly Modeling and Clustering Tensors in High Dimensions", "link": "https://arxiv.org/abs/2104.07773", "description": "arXiv:2104.07773v3 Announce Type: replace \nAbstract: We consider the problem of jointly modeling and clustering populations of tensors by introducing a high-dimensional tensor mixture model with heterogeneous covariances. To effectively tackle the high dimensionality of tensor objects, we employ plausible dimension reduction assumptions that exploit the intrinsic structures of tensors such as low-rankness in the mean and separability in the covariance. In estimation, we develop an efficient high-dimensional expectation-conditional-maximization (HECM) algorithm that breaks the intractable optimization in the M-step into a sequence of much simpler conditional optimization problems, each of which is convex, admits regularization and has closed-form updating formulas. Our theoretical analysis is challenged by both the non-convexity in the EM-type estimation and having access to only the solutions of conditional maximizations in the M-step, leading to the notion of dual non-convexity. We demonstrate that the proposed HECM algorithm, with an appropriate initialization, converges geometrically to a neighborhood that is within statistical precision of the true parameter. The efficacy of our proposed method is demonstrated through comparative numerical experiments and an application to a medical study, where our proposal achieves an improved clustering accuracy over existing benchmarking methods."}, "https://arxiv.org/abs/2307.16720": {"title": "Clustering multivariate functional data using the epigraph and hypograph indices: a case study on Madrid air quality", "link": "https://arxiv.org/abs/2307.16720", "description": "arXiv:2307.16720v4 Announce Type: replace \nAbstract: With the rapid growth of data generation, advancements in functional data analysis (FDA) have become essential, especially for approaches that handle multiple variables at the same time. This paper introduces a novel formulation of the epigraph and hypograph indices, along with their generalized expressions, specifically designed for multivariate functional data (MFD). These new definitions account for interrelationships between variables, enabling effective clustering of MFD based on the original data curves and their first two derivatives. The methodology developed here has been tested on simulated datasets, demonstrating strong performance compared to state-of-the-art methods. Its practical utility is further illustrated with two environmental datasets: the Canadian weather dataset and a 2023 air quality study in Madrid. These applications highlight the potential of the method as a great tool for analyzing complex environmental data, offering valuable insights for researchers and policymakers in climate and environmental research."}, "https://arxiv.org/abs/2308.05858": {"title": "Inconsistency and Acausality of Model Selection in Bayesian Inverse Problems", "link": "https://arxiv.org/abs/2308.05858", "description": "arXiv:2308.05858v3 Announce Type: replace \nAbstract: Bayesian inference paradigms are regarded as powerful tools for solution of inverse problems. However, when applied to inverse problems in physical sciences, Bayesian formulations suffer from a number of inconsistencies that are often overlooked. A well known, but mostly neglected, difficulty is connected to the notion of conditional probability densities. Borel, and later Kolmogorov's (1933/1956), found that the traditional definition of conditional densities is incomplete: In different parameterizations it leads to different results. We will show an example where two apparently correct procedures applied to the same problem lead to two widely different results. Another type of inconsistency involves violation of causality. This problem is found in model selection strategies in Bayesian inversion, such as Hierarchical Bayes and Trans-Dimensional Inversion where so-called hyperparameters are included as variables to control either the number (or type) of unknowns, or the prior uncertainties on data or model parameters. For Hierarchical Bayes we demonstrate that the calculated 'prior' distributions of data or model parameters are not prior-, but posterior information. In fact, the calculated 'standard deviations' of the data are a measure of the inability of the forward function to model the data, rather than uncertainties of the data. For trans-dimensional inverse problems we show that the so-called evidence is, in fact, not a measure of the success of fitting the data for the given choice (or number) of parameters, as often claimed. We also find that the notion of Natural Parsimony is ill-defined, because of its dependence on the parameter prior. Based on this study, we find that careful rethinking of Bayesian inversion practices is required, with special emphasis on ways of avoiding the Borel-Kolmogorov inconsistency, and on the way we interpret model selection results."}, "https://arxiv.org/abs/2311.15485": {"title": "Calibrated Generalized Bayesian Inference", "link": "https://arxiv.org/abs/2311.15485", "description": "arXiv:2311.15485v2 Announce Type: replace \nAbstract: We provide a simple and general solution for accurate uncertainty quantification of Bayesian inference in misspecified or approximate models, and for generalized posteriors more generally. While existing solutions are based on explicit Gaussian posterior approximations, or post-processing procedures, we demonstrate that correct uncertainty quantification can be achieved by substituting the usual posterior with an intuitively appealing alternative posterior that conveys the same information. This solution applies to both likelihood-based and loss-based posteriors, and we formally demonstrate the reliable uncertainty quantification of this approach. The new approach is demonstrated through a range of examples, including linear models, and doubly intractable models."}, "https://arxiv.org/abs/1710.06078": {"title": "Estimate exponential memory decay in Hidden Markov Model and its applications", "link": "https://arxiv.org/abs/1710.06078", "description": "arXiv:1710.06078v2 Announce Type: replace-cross \nAbstract: Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as $\\lambda_2-\\lambda_1$, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences $B$ given the controlled error $\\epsilon$ is $B=\\log(\\epsilon)/(\\lambda_2-\\lambda_1)$. We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated $B$ could be reused for the nearby parameter during the inference."}, "https://arxiv.org/abs/2211.12692": {"title": "Empirical Bayes estimation: When does $g$-modeling beat $f$-modeling in theory (and in practice)?", "link": "https://arxiv.org/abs/2211.12692", "description": "arXiv:2211.12692v2 Announce Type: replace-cross \nAbstract: Empirical Bayes (EB) is a popular framework for large-scale inference that aims to find data-driven estimators to compete with the Bayesian oracle that knows the true prior. Two principled approaches to EB estimation have emerged over the years: $f$-modeling, which constructs an approximate Bayes rule by estimating the marginal distribution of the data, and $g$-modeling, which estimates the prior from data and then applies the learned Bayes rule. For the Poisson model, the prototypical examples are the celebrated Robbins estimator and the nonparametric MLE (NPMLE), respectively. It has long been recognized in practice that the Robbins estimator, while being conceptually appealing and computationally simple, lacks robustness and can be easily derailed by ``outliers'', unlike the NPMLE which provides more stable and interpretable fit thanks to its Bayes form. On the other hand, not only do the existing theories shed little light on this phenomenon, but they all point to the opposite, as both methods have recently been shown optimal in terms of regret (excess over the Bayes risk) for compactly supported and subexponential priors.\n In this paper we provide a theoretical justification for the superiority of $g$-modeling over $f$-modeling for heavy-tailed data by considering priors with bounded $p>1$th moment. We show that with mild regularization, any $g$-modeling method that is Hellinger rate-optimal in density estimation achieves an optimal total regret $\\tilde \\Theta(n^{\\frac{3}{2p+1}})$; in particular, the special case of NPMLE succeeds without regularization. In contrast, there exists an $f$-modeling estimator whose density estimation rate is optimal but whose EB regret is suboptimal by a polynomial factor. These results show that the proper Bayes form provides a ``general recipe of success'' for optimal EB estimation that applies to all $g$-modeling (but not $f$-modeling) methods."}, "https://arxiv.org/abs/2411.13570": {"title": "Inconsistency and Acausality in Bayesian Inference for Physical Problems", "link": "https://arxiv.org/abs/2411.13570", "description": "arXiv:2411.13570v1 Announce Type: new \nAbstract: Bayesian inference is used to estimate continuous parameter values given measured data in many fields of science. The method relies on conditional probability densities to describe information about both data and parameters, yet the notion of conditional densities is inadmissible: probabilities of the same physical event, computed from conditional densities under different parameterizations, may be inconsistent. We show that this inconsistency, together with acausality in hierarchical methods, invalidate a variety of commonly applied Bayesian methods when applied to problems in the physical world, including trans-dimensional inference, general Bayesian dimensionality reduction methods, and hierarchical and empirical Bayes. Models in parameter spaces of different dimensionalities cannot be compared, invalidating the concept of natural parsimony, the probabilistic counterpart to Occams Razor. Bayes theorem itself is inadmissible, and Bayesian inference applied to parameters that characterize physical properties requires reformulation."}, "https://arxiv.org/abs/2411.13692": {"title": "Randomized Basket Trial with an Interim Analysis (RaBIt) and Applications in Mental Health", "link": "https://arxiv.org/abs/2411.13692", "description": "arXiv:2411.13692v1 Announce Type: new \nAbstract: Basket trials can efficiently evaluate a single treatment across multiple diseases with a common shared target. Prior methods for randomized basket trials required baskets to have the same sample and effect sizes. To that end, we developed a general randomized basket trial with an interim analysis (RaBIt) that allows for unequal sample sizes and effect sizes per basket. RaBIt is characterized by pruning at an interim stage and then analyzing a pooling of the remaining baskets. We derived the analytical power and type 1 error for the design. We first show that our results are consistent with the prior methods when the sample and effect sizes were the same across baskets. As we adjust the sample allocation between baskets, our threshold for the final test statistic becomes more stringent in order to maintain the same overall type 1 error. Finally, we notice that if we fix a sample size for the baskets proportional to their accrual rate, then at the cost of an almost negligible amount of power, the trial overall is expected to take substantially less time than the non-generalized version."}, "https://arxiv.org/abs/2411.13748": {"title": "An Economical Approach to Design Posterior Analyses", "link": "https://arxiv.org/abs/2411.13748", "description": "arXiv:2411.13748v1 Announce Type: new \nAbstract: To design Bayesian studies, criteria for the operating characteristics of posterior analyses - such as power and the type I error rate - are often assessed by estimating sampling distributions of posterior probabilities via simulation. In this paper, we propose an economical method to determine optimal sample sizes and decision criteria for such studies. Using our theoretical results that model posterior probabilities as a function of the sample size, we assess operating characteristics throughout the sample size space given simulations conducted at only two sample sizes. These theoretical results are used to construct bootstrap confidence intervals for the optimal sample sizes and decision criteria that reflect the stochastic nature of simulation-based design. We also repurpose the simulations conducted in our approach to efficiently investigate various sample sizes and decision criteria using contour plots. The broad applicability and wide impact of our methodology is illustrated using two clinical examples."}, "https://arxiv.org/abs/2411.13764": {"title": "Selective inference is easier with p-values", "link": "https://arxiv.org/abs/2411.13764", "description": "arXiv:2411.13764v1 Announce Type: new \nAbstract: Selective inference is a subfield of statistics that enables valid inference after selection of a data-dependent question. In this paper, we introduce selectively dominant p-values, a class of p-values that allow practitioners to easily perform inference after arbitrary selection procedures. Unlike a traditional p-value, whose distribution must stochastically dominate the uniform distribution under the null, a selectively dominant p-value must have a post-selection distribution that stochastically dominates that of a uniform having undergone the same selection process; moreover, this property must hold simultaneously for all possible selection processes. Despite the strength of this condition, we show that all commonly used p-values (e.g., p-values from two-sided testing in parametric families, one-sided testing in monotone likelihood ratio and exponential families, $F$-tests for linear regression, and permutation tests) are selectively dominant. By recasting two canonical selective inference problems-inference on winners and rank verification-in our selective dominance framework, we provide simpler derivations, a deeper conceptual understanding, and new generalizations and variations of these methods. Additionally, we use our insights to introduce selective variants of methods that combine p-values, such as Fisher's combination test."}, "https://arxiv.org/abs/2411.13810": {"title": "Dynamic spatial interaction models for a leader's resource allocation and followers' multiple activities", "link": "https://arxiv.org/abs/2411.13810", "description": "arXiv:2411.13810v1 Announce Type: new \nAbstract: This paper introduces a novel spatial interaction model to explore the decision-making processes of two types of agents-a leader and followers-with central and local governments serving as empirical representations. The model accounts for three key features: (i) resource allocations from the leader to the followers and the resulting strategic interactions, (ii) followers' choices across multiple activities, and (iii) interactions among these activities. We develop a network game to examine the micro-foundations of these processes. In this game, followers engage in multiple activities, while the leader allocates resources by monitoring the externalities arising from followers' interactions. The game's unique NE is the foundation for our econometric framework, providing equilibrium measures to understand the short-term impacts of changes in followers' characteristics and their long-term consequences. To estimate the agent payoff parameters, we employ the QML estimation method and examine the asymptotic properties of the QML estimator to ensure robust statistical inferences. Empirically, we investigate interactions among U.S. states in public welfare expenditures (PWE) and housing and community development expenditures (HCDE), focusing on how federal grants influence these expenditures and the interactions among state governments. Our findings reveal positive spillovers in states' PWEs, complementarity between the two expenditures within states, and negative cross-variable spillovers between them. Additionally, we observe positive effects of federal grants on both expenditures. Counterfactual simulations indicate that federal interventions lead to a 6.46% increase in social welfare by increasing the states' efforts on PWE and HCDE. However, due to the limited flexibility in federal grants, their magnitudes are smaller than the proportion of federal grants within the states' total revenues."}, "https://arxiv.org/abs/2411.13822": {"title": "High-Dimensional Extreme Quantile Regression", "link": "https://arxiv.org/abs/2411.13822", "description": "arXiv:2411.13822v1 Announce Type: new \nAbstract: The estimation of conditional quantiles at extreme tails is of great interest in numerous applications. Various methods that integrate regression analysis with an extrapolation strategy derived from extreme value theory have been proposed to estimate extreme conditional quantiles in scenarios with a fixed number of covariates. However, these methods prove ineffective in high-dimensional settings, where the number of covariates increases with the sample size. In this article, we develop new estimation methods tailored for extreme conditional quantiles with high-dimensional covariates. We establish the asymptotic properties of the proposed estimators and demonstrate their superior performance through simulation studies, particularly in scenarios of growing dimension and high dimension where existing methods may fail. Furthermore, the analysis of auto insurance data validates the efficacy of our methods in estimating extreme conditional insurance claims and selecting important variables."}, "https://arxiv.org/abs/2411.13829": {"title": "Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inference", "link": "https://arxiv.org/abs/2411.13829", "description": "arXiv:2411.13829v1 Announce Type: new \nAbstract: When using multiple imputation (MI) for missing data, maintaining compatibility between the imputation model and substantive analysis is important for avoiding bias. For example, some causal inference methods incorporate an outcome model with exposure-confounder interactions that must be reflected in the imputation model. Two approaches for compatible imputation with multivariable missingness have been proposed: Substantive-Model-Compatible Fully Conditional Specification (SMCFCS) and a stacked-imputation-based approach (SMC-stack). If the imputation model is correctly specified, both approaches are guaranteed to be unbiased under the \"missing at random\" assumption. However, this assumption is violated when the outcome causes its own missingness, which is common in practice. In such settings, sensitivity analyses are needed to assess the impact of alternative assumptions on results. An appealing solution for sensitivity analysis is delta-adjustment using MI, specifically \"not-at-random\" (NAR)FCS. However, the issue of imputation model compatibility has not been considered in sensitivity analysis, with a naive implementation of NARFCS being susceptible to bias. To address this gap, we propose two approaches for compatible sensitivity analysis when the outcome causes its own missingness. The proposed approaches, NAR-SMCFCS and NAR-SMC-stack, extend SMCFCS and SMC-stack, respectively, with delta-adjustment for the outcome. We evaluate these approaches using a simulation study that is motivated by a case study, to which the methods were also applied. The simulation results confirmed that a naive implementation of NARFCS produced bias in effect estimates, while NAR-SMCFCS and NAR-SMC-stack were approximately unbiased. The proposed compatible approaches provide promising avenues for conducting sensitivity analysis to missingness assumptions in causal inference."}, "https://arxiv.org/abs/2411.14010": {"title": "Spectral domain likelihoods for Bayesian inference in time-varying parameter models", "link": "https://arxiv.org/abs/2411.14010", "description": "arXiv:2411.14010v1 Announce Type: new \nAbstract: Inference for locally stationary processes is often based on some local Whittle-type approximation of the likelihood function defined in the frequency domain. The main reasons for using such a likelihood approximation is that i) it has substantially lower computational cost and better scalability to long time series compared to the time domain likelihood, particularly when used for Bayesian inference via Markov Chain Monte Carlo (MCMC), ii) convenience when the model itself is specified in the frequency domain, and iii) it provides access to bootstrap and subsampling MCMC which exploits the asymptotic independence of Fourier transformed data. Most of the existing literature compares the asymptotic performance of the maximum likelihood estimator (MLE) from such frequency domain likelihood approximation with the exact time domain MLE. Our article uses three simulation studies to assess the finite-sample accuracy of several frequency domain likelihood functions when used to approximate the posterior distribution in time-varying parameter models. The methods are illustrated on an application to egg price data."}, "https://arxiv.org/abs/2411.14016": {"title": "Robust Mutual Fund Selection with False Discovery Rate Control", "link": "https://arxiv.org/abs/2411.14016", "description": "arXiv:2411.14016v1 Announce Type: new \nAbstract: In this article, we address the challenge of identifying skilled mutual funds among a large pool of candidates, utilizing the linear factor pricing model. Assuming observable factors with a weak correlation structure for the idiosyncratic error, we propose a spatial-sign based multiple testing procedure (SS-BH). When latent factors are present, we first extract them using the elliptical principle component method (He et al. 2022) and then propose a factor-adjusted spatial-sign based multiple testing procedure (FSS-BH). Simulation studies demonstrate that our proposed FSS-BH procedure performs exceptionally well across various applications and exhibits robustness to variations in the covariance structure and the distribution of the error term. Additionally, real data application further highlights the superiority of the FSS-BH procedure."}, "https://arxiv.org/abs/2411.14185": {"title": "A note on numerical evaluation of conditional Akaike information for nonlinear mixed-effects models", "link": "https://arxiv.org/abs/2411.14185", "description": "arXiv:2411.14185v1 Announce Type: new \nAbstract: We propose two methods to evaluate the conditional Akaike information (cAI) for nonlinear mixed-effects models with no restriction on cluster size. Method 1 is designed for continuous data and includes formulae for the derivatives of fixed and random effects estimators with respect to observations. Method 2, compatible with any type of observation, requires modeling the marginal (or prior) distribution of random effects as a multivariate normal distribution. Simulations show that Method 1 performs well with Gaussian data but struggles with skewed continuous distributions, whereas Method 2 consistently performs well across various distributions, including normal, gamma, negative binomial, and Tweedie, with flexible link functions. Based on our findings, we recommend Method 2 as a distributionally robust cAI criterion for model selection in nonlinear mixed-effects models."}, "https://arxiv.org/abs/2411.14265": {"title": "A Bayesian mixture model for Poisson network autoregression", "link": "https://arxiv.org/abs/2411.14265", "description": "arXiv:2411.14265v1 Announce Type: new \nAbstract: In this paper, we propose a new Bayesian Poisson network autoregression mixture model (PNARM). Our model combines ideas from the models of Dahl 2008, Ren et al. 2024 and Armillotta and Fokianos 2024, as it is motivated by the following aims. We consider the problem of modelling multivariate count time series since they arise in many real-world data sets, but has been studied less than its Gaussian-distributed counterpart (Fokianos 2024). Additionally, we assume that the time series occur on the nodes of a known underlying network where the edges dictate the form of the structural vector autoregression model, as a means of imposing sparsity. A further aim is to accommodate heterogeneous node dynamics, and to develop a probabilistic model for clustering nodes that exhibit similar behaviour. We develop an MCMC algorithm for sampling from the model's posterior distribution. The model is applied to a data set of COVID-19 cases in the counties of the Republic of Ireland."}, "https://arxiv.org/abs/2411.14285": {"title": "Stochastic interventions, sensitivity analysis, and optimal transport", "link": "https://arxiv.org/abs/2411.14285", "description": "arXiv:2411.14285v1 Announce Type: new \nAbstract: Recent methodological research in causal inference has focused on effects of stochastic interventions, which assign treatment randomly, often according to subject-specific covariates. In this work, we demonstrate that the usual notion of stochastic interventions have a surprising property: when there is unmeasured confounding, bounds on their effects do not collapse when the policy approaches the observational regime. As an alternative, we propose to study generalized policies, treatment rules that can depend on covariates, the natural value of treatment, and auxiliary randomness. We show that certain generalized policy formulations can resolve the \"non-collapsing\" bound issue: bounds narrow to a point when the target treatment distribution approaches that in the observed data. Moreover, drawing connections to the theory of optimal transport, we characterize generalized policies that minimize worst-case bound width in various sensitivity analysis models, as well as corresponding sharp bounds on their causal effects. These optimal policies are new, and can have a more parsimonious interpretation compared to their usual stochastic policy analogues. Finally, we develop flexible, efficient, and robust estimators for the sharp nonparametric bounds that emerge from the framework."}, "https://arxiv.org/abs/2411.14323": {"title": "Estimands and Their Implications for Evidence Synthesis for Oncology: A Simulation Study of Treatment Switching in Meta-Analysis", "link": "https://arxiv.org/abs/2411.14323", "description": "arXiv:2411.14323v1 Announce Type: new \nAbstract: The ICH E9(R1) addendum provides guidelines on accounting for intercurrent events in clinical trials using the estimands framework. However, there has been limited attention on the estimands framework for meta-analysis. Using treatment switching, a well-known intercurrent event that occurs frequently in oncology, we conducted a simulation study to explore the bias introduced by pooling together estimates targeting different estimands in a meta-analysis of randomized clinical trials (RCTs) that allowed for treatment switching. We simulated overall survival data of a collection of RCTs that allowed patients in the control group to switch to the intervention treatment after disease progression under fixed-effects and random-effects models. For each RCT, we calculated effect estimates for a treatment policy estimand that ignored treatment switching, and a hypothetical estimand that accounted for treatment switching by censoring switchers at the time of switching. Then, we performed random-effects and fixed-effects meta-analyses to pool together RCT effect estimates while varying the proportions of treatment policy and hypothetical effect estimates. We compared the results of meta-analyses that pooled different types of effect estimates with those that pooled only treatment policy or hypothetical estimates. We found that pooling estimates targeting different estimands results in pooled estimators that reflect neither the treatment policy estimand nor the hypothetical estimand. This finding shows that pooling estimates of varying target estimands can generate misleading results, even under a random-effects model. Adopting the estimands framework for meta-analysis may improve alignment between meta-analytic results and the clinical research question of interest."}, "https://arxiv.org/abs/2411.13763": {"title": "Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data", "link": "https://arxiv.org/abs/2411.13763", "description": "arXiv:2411.13763v1 Announce Type: cross \nAbstract: In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter $\\theta$ in a linear threshold $\\theta^T Z$ for a continuous variable $X$ such that the discrepancy between whether $X$ exceeds the threshold $\\theta^T Z$ and a binary outcome $Y$ is minimized. We propose a novel $K$-step active subsampling algorithm to estimate $\\theta$, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to $\\beta\\geq 1$, the smoothness of the conditional density of $X$ given $Y$ and $Z$. For $\\beta>(1+\\sqrt{3})/2$, we show that the two-step algorithm yields an estimator with the parametric convergence rate $O_p((s \\log d /N)^{1/2})$ in $l_2$ norm. The rate of our estimator is strictly faster than the minimax optimal rate with $N$ i.i.d. samples drawn from the population. For the other two scenarios $1<\\beta\\leq (1+\\sqrt{3})/2$ and $\\beta=1$, the estimator from the two-step algorithm is sub-optimal. The former requires to run $K>2$ steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset."}, "https://arxiv.org/abs/2411.13868": {"title": "Robust Detection of Watermarks for Large Language Models Under Human Edits", "link": "https://arxiv.org/abs/2411.13868", "description": "arXiv:2411.13868v1 Announce Type: cross \nAbstract: Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \\textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families."}, "https://arxiv.org/abs/2102.08591": {"title": "Data-Driven Logistic Regression Ensembles With Applications in Genomics", "link": "https://arxiv.org/abs/2102.08591", "description": "arXiv:2102.08591v5 Announce Type: replace \nAbstract: Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Statistical tools used to discover patterns between the expression of certain genes and the presence of diseases should ideally perform well in terms of both prediction accuracy and identification of key biomarkers. We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. The ensembles are comprised of a relatively small number of highly accurate and interpretable models that are learned directly from the data by minimizing a global objective function. We derive the asymptotic properties of our method and develop an efficient algorithm to compute the ensembles. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical genomics datasets involving common diseases such as cancer, multiple sclerosis and psoriasis. In several applications our method could identify key biomarkers that were absent in state-of-the-art competitor methods. We develop a variable importance ranking tool that may guide the focus of researchers on the most promising genes. Based on numerical experiments we provide guidelines for the choice of the number of models in our ensembles."}, "https://arxiv.org/abs/2411.14542": {"title": "Combining missing data imputation and internal validation in clinical risk prediction models", "link": "https://arxiv.org/abs/2411.14542", "description": "arXiv:2411.14542v1 Announce Type: new \nAbstract: Methods to handle missing data have been extensively explored in the context of estimation and descriptive studies, with multiple imputation being the most widely used method in clinical research. However, in the context of clinical risk prediction models, where the goal is often to achieve high prediction accuracy and to make predictions for future patients, there are different considerations regarding the handling of missing data. As a result, deterministic imputation is better suited to the setting of clinical risk prediction models, since the outcome is not included in the imputation model and the imputation method can be easily applied to future patients. In this paper, we provide a tutorial demonstrating how to conduct bootstrapping followed by deterministic imputation of missing data to construct and internally validate the performance of a clinical risk prediction model in the presence of missing data. Extensive simulation study results are provided to help guide decision-making in real-world applications."}, "https://arxiv.org/abs/2411.14548": {"title": "A Random-Effects Approach to Linear Mixed Model Analysis of Incomplete Longitudinal Data", "link": "https://arxiv.org/abs/2411.14548", "description": "arXiv:2411.14548v1 Announce Type: new \nAbstract: We propose a random-effects approach to missing values for linear mixed model (LMM) analysis. The method converts a LMM with missing covariates to another LMM without missing covariates. The standard LMM analysis tools for longitudinal data then apply. Performance of the method is evaluated empirically, and compared with alternative approaches, including the popular MICE procedure of multiple imputation. Theoretical explanations are given for the patterns observed in the simulation studies. A real-data example is discussed."}, "https://arxiv.org/abs/2411.14570": {"title": "Gradient-based optimization for variational empirical Bayes multiple regression", "link": "https://arxiv.org/abs/2411.14570", "description": "arXiv:2411.14570v1 Announce Type: new \nAbstract: Variational empirical Bayes (VEB) methods provide a practically attractive approach to fitting large, sparse, multiple regression models. These methods usually use coordinate ascent to optimize the variational objective function, an approach known as coordinate ascent variational inference (CAVI). Here we propose alternative optimization approaches based on gradient-based (quasi-Newton) methods, which we call gradient-based variational inference (GradVI). GradVI exploits a recent result from Kim et. al. [arXiv:2208.10910] which writes the VEB regression objective function as a penalized regression. Unfortunately the penalty function is not available in closed form, and we present and compare two approaches to dealing with this problem. In simple situations where CAVI performs well, we show that GradVI produces similar predictive performance, and GradVI converges in fewer iterations when the predictors are highly correlated. Furthermore, unlike CAVI, the key computations in GradVI are simple matrix-vector products, and so GradVI is much faster than CAVI in settings where the design matrix admits fast matrix-vector products (e.g., as we show here, trendfiltering applications) and lends itself to parallelized implementations in ways that CAVI does not. GradVI is also very flexible, and could exploit automatic differentiation to easily implement different prior families. Our methods are implemented in an open-source Python software, GradVI (available from https://github.com/stephenslab/gradvi )."}, "https://arxiv.org/abs/2411.14635": {"title": "Modelling Loss of Complexity in Intermittent Time Series and its Application", "link": "https://arxiv.org/abs/2411.14635", "description": "arXiv:2411.14635v1 Announce Type: new \nAbstract: In this paper, we developed a nonparametric relative entropy (RlEn) for modelling loss of complexity in intermittent time series. This technique consists of two steps. First, we carry out a nonlinear autoregressive model where the lag order is determined by a Bayesian Information Criterion (BIC), and complexity of each intermittent time series is obtained by our novel relative entropy. Second, change-points in complexity were detected by using the cumulative sum (CUSUM) based method. Using simulations and compared to the popular method appropriate entropy (ApEN), the performance of RlEn was assessed for its (1) ability to localise complexity change-points in intermittent time series; (2) ability to faithfully estimate underlying nonlinear models. The performance of the proposal was then examined in a real analysis of fatigue-induced changes in the complexity of human motor outputs. The results demonstrated that the proposed method outperformed the ApEn in accurately detecting complexity changes in intermittent time series segments."}, "https://arxiv.org/abs/2411.14674": {"title": "Summarizing Bayesian Nonparametric Mixture Posterior -- Sliced Optimal Transport Metrics for Gaussian Mixtures", "link": "https://arxiv.org/abs/2411.14674", "description": "arXiv:2411.14674v1 Announce Type: new \nAbstract: Existing methods to summarize posterior inference for mixture models focus on identifying a point estimate of the implied random partition for clustering, with density estimation as a secondary goal (Wade and Ghahramani, 2018; Dahl et al., 2022). We propose a novel approach for summarizing posterior inference in nonparametric Bayesian mixture models, prioritizing density estimation of the mixing measure (or mixture) as an inference target. One of the key features is the model-agnostic nature of the approach, which remains valid under arbitrarily complex dependence structures in the underlying sampling model. Using a decision-theoretic framework, our method identifies a point estimate by minimizing posterior expected loss. A loss function is defined as a discrepancy between mixing measures as the loss function. Estimating the mixing measure implies inference on the mixture density. Exploiting the discrete nature of the mixing measure, we use a version of sliced Wasserstein distance. We introduce two specific variants for Gaussian mixtures. The first, mixed sliced Wasserstein, applies generalized geodesic projections on the product of the Euclidean space and the manifold of symmetric positive definite matrices. The second, sliced mixture Wasserstein, leverages the linearity of Gaussian mixture measures for efficient projection."}, "https://arxiv.org/abs/2411.14690": {"title": "Deep Gaussian Process Emulation and Uncertainty Quantification for Large Computer Experiments", "link": "https://arxiv.org/abs/2411.14690", "description": "arXiv:2411.14690v1 Announce Type: new \nAbstract: Computer models are used as a way to explore complex physical systems. Stationary Gaussian process emulators, with their accompanying uncertainty quantification, are popular surrogates for computer models. However, many computer models are not well represented by stationary Gaussian processes models. Deep Gaussian processes have been shown to be capable of capturing non-stationary behaviors and abrupt regime changes in the computer model response. In this paper, we explore the properties of two deep Gaussian process formulations within the context of computer model emulation. For one of these formulations, we introduce a new parameter that controls the amount of smoothness in the deep Gaussian process layers. We adapt a stochastic variational approach to inference for this model, allowing for prior specification and posterior exploration of the smoothness of the response surface. Our approach can be applied to a large class of computer models, and scales to arbitrarily large simulation designs. The proposed methodology was motivated by the need to emulate an astrophysical model of the formation of binary black hole mergers."}, "https://arxiv.org/abs/2411.14763": {"title": "From Replications to Revelations: Heteroskedasticity-Robust Inference", "link": "https://arxiv.org/abs/2411.14763", "description": "arXiv:2411.14763v1 Announce Type: new \nAbstract: We compare heteroskedasticity-robust inference methods with a large-scale Monte Carlo study based on regressions from 155 reproduction packages of leading economic journals. The results confirm established wisdom and uncover new insights. Among well established methods HC2 standard errors with the degree of freedom specification proposed by Bell and McCaffrey (2002) perform best. To further improve the accuracy of t-tests, we propose a novel degree-of-freedom specification based on partial leverages. We also show how HC2 to HC4 standard errors can be refined by more effectively addressing the 15.6% of cases where at least one observation exhibits a leverage of one."}, "https://arxiv.org/abs/2411.14864": {"title": "Bayesian optimal change point detection in high-dimensions", "link": "https://arxiv.org/abs/2411.14864", "description": "arXiv:2411.14864v1 Announce Type: new \nAbstract: We propose the first Bayesian methods for detecting change points in high-dimensional mean and covariance structures. These methods are constructed using pairwise Bayes factors, leveraging modularization to identify significant changes in individual components efficiently. We establish that the proposed methods consistently detect and estimate change points under much milder conditions than existing approaches in the literature. Additionally, we demonstrate that their localization rates are nearly optimal in terms of rates. The practical performance of the proposed methods is evaluated through extensive simulation studies, where they are compared to state-of-the-art techniques. The results show comparable or superior performance across most scenarios. Notably, the methods effectively detect change points whenever signals of sufficient magnitude are present, irrespective of the number of signals. Finally, we apply the proposed methods to genetic and financial datasets, illustrating their practical utility in real-world applications."}, "https://arxiv.org/abs/2411.14999": {"title": "The EE-Classifier: A classification method for functional data based on extremality indexes", "link": "https://arxiv.org/abs/2411.14999", "description": "arXiv:2411.14999v1 Announce Type: new \nAbstract: Functional data analysis has gained significant attention due to its wide applicability. This research explores the extension of statistical analysis methods for functional data, with a primary focus on supervised classification techniques. It provides a review on the existing depth-based methods used in functional data samples. Building on this foundation, it introduces an extremality-based approach, which takes the modified epigraph and hypograph indexes properties as classification techniques. To demonstrate the effectiveness of the classifier, it is applied to both real-world and synthetic data sets. The results show its efficacy in accurately classifying functional data. Additionally, the classifier is used to analyze the fluctuations in the S\\&P 500 stock value. This research contributes to the field of functional data analysis by introducing a new extremality-based classifier. The successful application to various data sets shows its potential for supervised classification tasks and provides valuable insights into financial data analysis."}, "https://arxiv.org/abs/2411.14472": {"title": "Exploring the Potential Role of Generative AI in the TRAPD Procedure for Survey Translation", "link": "https://arxiv.org/abs/2411.14472", "description": "arXiv:2411.14472v1 Announce Type: cross \nAbstract: This paper explores and assesses in what ways generative AI can assist in translating survey instruments. Writing effective survey questions is a challenging and complex task, made even more difficult for surveys that will be translated and deployed in multiple linguistic and cultural settings. Translation errors can be detrimental, with known errors rendering data unusable for its intended purpose and undetected errors leading to incorrect conclusions. A growing number of institutions face this problem as surveys deployed by private and academic organizations globalize, and the success of their current efforts depends heavily on researchers' and translators' expertise and the amount of time each party has to contribute to the task. Thus, multilinguistic and multicultural surveys produced by teams with limited expertise, budgets, or time are at significant risk for translation-based errors in their data. We implement a zero-shot prompt experiment using ChatGPT to explore generative AI's ability to identify features of questions that might be difficult to translate to a linguistic audience other than the source language. We find that ChatGPT can provide meaningful feedback on translation issues, including common source survey language, inconsistent conceptualization, sensitivity and formality issues, and nonexistent concepts. In addition, we provide detailed information on the practicality of the approach, including accessing the necessary software, associated costs, and computational run times. Lastly, based on our findings, we propose avenues for future research that integrate AI into survey translation practices."}, "https://arxiv.org/abs/2411.15075": {"title": "The Effects of Major League Baseball's Ban on Infield Shifts: A Quasi-Experimental Analysis", "link": "https://arxiv.org/abs/2411.15075", "description": "arXiv:2411.15075v1 Announce Type: cross \nAbstract: From 2020 to 2023, Major League Baseball changed rules affecting team composition, player positioning, and game time. Understanding the effects of these rules is crucial for leagues, teams, players, and other relevant parties to assess their impact and to advocate either for further changes or undoing previous ones. Panel data and quasi-experimental methods provide useful tools for causal inference in these settings. I demonstrate this potential by analyzing the effect of the 2023 shift ban at both the league-wide and player-specific levels. Using difference-in-differences analysis, I show that the policy increased batting average on balls in play and on-base percentage for left-handed batters by a modest amount (nine points). For individual players, synthetic control analyses identify several players whose offensive performance (on-base percentage, on-base plus slugging percentage, and weighted on-base average) improved substantially (over 70 points in several cases) because of the rule change, and other players with previously high shift rates for whom it had little effect. This article both estimates the impact of this specific rule change and demonstrates how these methods for causal inference are potentially valuable for sports analytics -- at the player, team, and league levels -- more broadly."}, "https://arxiv.org/abs/2305.03134": {"title": "Debiased Inference for Dynamic Nonlinear Panels with Multi-dimensional Heterogeneities", "link": "https://arxiv.org/abs/2305.03134", "description": "arXiv:2305.03134v3 Announce Type: replace \nAbstract: We introduce a generic class of dynamic nonlinear heterogeneous parameter models that incorporate individual and time effects in both the intercept and slope. To address the incidental parameter problem inherent in this class of models, we develop an analytical bias correction procedure to construct a bias-corrected likelihood. The resulting maximum likelihood estimators are automatically bias-corrected. Moreover, likelihood-based tests statistics -- including likelihood-ratio, Lagrange-multiplier, and Wald tests -- follow the limiting chi-square distribution under the null hypothesis. Simulations demonstrate the effectiveness of the proposed correction method, and an empirical application on the labor force participation of single mothers underscores its practical importance."}, "https://arxiv.org/abs/2309.10284": {"title": "Rank-adaptive covariance testing with applications to genomics and neuroimaging", "link": "https://arxiv.org/abs/2309.10284", "description": "arXiv:2309.10284v2 Announce Type: replace \nAbstract: In biomedical studies, testing for differences in covariance offers scientific insights beyond mean differences, especially when differences are driven by complex joint behavior between features. However, when differences in joint behavior are weakly dispersed across many dimensions and arise from differences in low-rank structures within the data, as is often the case in genomics and neuroimaging, existing two-sample covariance testing methods may suffer from power loss. The Ky-Fan(k) norm, defined by the sum of the top Ky-Fan(k) singular values, is a simple and intuitive matrix norm able to capture signals caused by differences in low-rank structures between matrices, but its statistical properties in hypothesis testing have not been studied well. In this paper, we investigate the behavior of the Ky-Fan(k) norm in two-sample covariance testing. Ultimately, we propose a novel methodology, Rank-Adaptive Covariance Testing (RACT), which is able to leverage differences in low-rank structures found in the covariance matrices of two groups in order to maximize power. RACT uses permutation for statistical inference, ensuring an exact Type I error control. We validate RACT in simulation studies and evaluate its performance when testing for differences in gene expression networks between two types of lung cancer, as well as testing for covariance heterogeneity in diffusion tensor imaging (DTI) data taken on two different scanner types."}, "https://arxiv.org/abs/2311.13202": {"title": "Robust Multi-Model Subset Selection", "link": "https://arxiv.org/abs/2311.13202", "description": "arXiv:2311.13202v3 Announce Type: replace \nAbstract: Outlying observations can be challenging to handle and adversely affect subsequent analyses, particularly, in complex high-dimensional datasets. Although outliers are not always undesired anomalies in the data and may possess valuable insights, only methods that are robust to outliers are able to accurately identify them and resist their influence. In this paper, we propose a method that generates an ensemble of sparse and diverse predictive models that are resistant to outliers. We show that the ensembles generally outperform single-model sparse and robust methods in high-dimensional prediction tasks. Cross-validation is used to tune model parameters to control levels of sparsity, diversity and resistance to outliers. We establish the finitesample breakdown point of the ensembles and the models that comprise them, and we develop a tailored computing algorithm to learn the ensembles by leveraging recent developments in L0 optimization. Our extensive numerical experiments on synthetic and artificially contaminated real datasets from bioinformatics and cheminformatics demonstrate the competitive advantage of our method over state-of-the-art single-model methods."}, "https://arxiv.org/abs/2309.03122": {"title": "Bayesian Evidence Synthesis for Modeling SARS-CoV-2 Transmission", "link": "https://arxiv.org/abs/2309.03122", "description": "arXiv:2309.03122v2 Announce Type: replace-cross \nAbstract: The acute phase of the Covid-19 pandemic has made apparent the need for decision support based upon accurate epidemic modeling. This process is substantially hampered by under-reporting of cases and related data incompleteness issues. In this article we adopt the Bayesian paradigm and synthesize publicly available data via a discrete-time stochastic epidemic modeling framework. The models allow for estimating the total number of infections while accounting for the endemic phase of the pandemic. We assess the prediction of the infection rate utilizing mobility information, notably the principal components of the mobility data. We evaluate variational Bayes in this context and find that Hamiltonian Monte Carlo offers a robust inference alternative for such models. We elaborate upon vector analysis of the epidemic dynamics, thus enriching the traditional tools used for decision making. In particular, we show how certain 2-dimensional plots on the phase plane may yield intuitive information regarding the speed and the type of transmission dynamics. We investigate the potential of a two-stage analysis as a consequence of cutting feedback, for inference on certain functionals of the model parameters. Finally, we show that a point mass on critical parameters is overly restrictive and investigate informative priors as a suitable alternative."}, "https://arxiv.org/abs/2311.12214": {"title": "Random Fourier Signature Features", "link": "https://arxiv.org/abs/2311.12214", "description": "arXiv:2311.12214v2 Announce Type: replace-cross \nAbstract: Tensor algebras give rise to one of the most powerful measures of similarity for sequences of arbitrary length called the signature kernel accompanied with attractive theoretical guarantees from stochastic analysis. Previous algorithms to compute the signature kernel scale quadratically in terms of the length and the number of the sequences. To mitigate this severe computational bottleneck, we develop a random Fourier feature-based acceleration of the signature kernel acting on the inherently non-Euclidean domain of sequences. We show uniform approximation guarantees for the proposed unbiased estimator of the signature kernel, while keeping its computation linear in the sequence length and number. In addition, combined with recent advances on tensor projections, we derive two even more scalable time series features with favourable concentration properties and computational complexity both in time and memory. Our empirical results show that the reduction in computational cost comes at a negligible price in terms of accuracy on moderate-sized datasets, and it enables one to scale to large datasets up to a million time series."}, "https://arxiv.org/abs/2411.15326": {"title": "Scalar-on-Shape Regression Models for Functional Data Analysis", "link": "https://arxiv.org/abs/2411.15326", "description": "arXiv:2411.15326v1 Announce Type: new \nAbstract: Functional data contains two components: shape (or amplitude) and phase. This paper focuses on a branch of functional data analysis (FDA), namely Shape-Based FDA, that isolates and focuses on shapes of functions. Specifically, this paper focuses on Scalar-on-Shape (ScoSh) regression models that incorporate the shapes of predictor functions and discard their phases. This aspect sets ScoSh models apart from the traditional Scalar-on-Function (ScoF) regression models that incorporate full predictor functions. ScoSh is motivated by object data analysis, {\\it, e.g.}, for neuro-anatomical objects, where object morphologies are relevant and their parameterizations are arbitrary. ScoSh also differs from methods that arbitrarily pre-register data and uses it in subsequent analysis. In contrast, ScoSh models perform registration during regression, using the (non-parametric) Fisher-Rao inner product and nonlinear index functions to capture complex predictor-response relationships. This formulation results in novel concepts of {\\it regression phase} and {\\it regression mean} of functions. Regression phases are time-warpings of predictor functions that optimize prediction errors, and regression means are optimal regression coefficients. We demonstrate practical applications of the ScoSh model using extensive simulated and real-data examples, including predicting COVID outcomes when daily rate curves are predictors."}, "https://arxiv.org/abs/2411.15398": {"title": "The ultimate issue error: mistaking parameters for hypotheses", "link": "https://arxiv.org/abs/2411.15398", "description": "arXiv:2411.15398v1 Announce Type: new \nAbstract: In a criminal investigation, an inferential error occurs when the probability that a suspect is the source of some evidence -- such as a fingerprint -- is taken as the probability of guilt. This is known as the ultimate issue error, and the same error occurs in statistical inference when the probability that a parameter equals some value is incorrectly taken to be the probability of a hypothesis. Almost all statistical inference in the social and biological sciences is subject to this error, and replacing every instance of \"hypothesis testing\" with \"parameter testing\" in these fields would more accurately describe the target of inference. The relationship between parameter values and quantities derived from them, such as p-values or Bayes factors, have no direct quantitative relationship with scientific hypotheses. Here, we describe the problem, its consequences, and suggest options for improving scientific inference."}, "https://arxiv.org/abs/2411.15499": {"title": "Asymmetric Errors", "link": "https://arxiv.org/abs/2411.15499", "description": "arXiv:2411.15499v1 Announce Type: new \nAbstract: We present a procedure for handling asymmetric errors. Many results in particle physics are presented as values with different positive and negative errors, and there is no consistent procedure for handling them. We consider the difference between errors quoted using pdfs and using likelihoods, and the difference between the rms spread of a measurement and the 68\\% central confidence region. We provide a comprehensive analysis of the possibilities, and software tools to enable their use."}, "https://arxiv.org/abs/2411.15511": {"title": "Forecasting with Markovian max-stable fields in space and time: An application to wind gust speeds", "link": "https://arxiv.org/abs/2411.15511", "description": "arXiv:2411.15511v1 Announce Type: new \nAbstract: Hourly maxima of 3-second wind gust speeds are prominent indicators of the severity of wind storms, and accurately forecasting them is thus essential for populations, civil authorities and insurance companies. Space-time max-stable models appear as natural candidates for this, but those explored so far are not suited for forecasting and, more generally, the forecasting literature for max-stable fields is limited. To fill this gap, we consider a specific space-time max-stable model, more precisely a max-autoregressive model with advection, that is well-adapted to model and forecast atmospheric variables. We apply it, as well as our related forecasting strategy, to reanalysis 3-second wind gust data for France in 1999, and show good performance compared to a competitor model. On top of demonstrating the practical relevance of our model, we meticulously study its theoretical properties and show the consistency and asymptotic normality of the space-time pairwise likelihood estimator which is used to calibrate the model."}, "https://arxiv.org/abs/2411.15566": {"title": "Sensitivity Analysis on Interaction Effects of Policy-Augmented Bayesian Networks", "link": "https://arxiv.org/abs/2411.15566", "description": "arXiv:2411.15566v1 Announce Type: new \nAbstract: Biomanufacturing plays an important role in supporting public health and the growth of the bioeconomy. Modeling and studying the interaction effects among various input variables is very critical for obtaining a scientific understanding and process specification in biomanufacturing. In this paper, we use the ShapleyOwen indices to measure the interaction effects for the policy-augmented Bayesian network (PABN) model, which characterizes the risk- and science-based understanding of production bioprocess mechanisms. In order to facilitate efficient interaction effect quantification, we propose a sampling-based simulation estimation framework. In addition, to further improve the computational efficiency, we develop a non-nested simulation algorithm with sequential sampling, which can dynamically allocate the simulation budget to the interactions with high uncertainty and therefore estimate the interaction effects more accurately under a total fixed budget setting."}, "https://arxiv.org/abs/2411.15625": {"title": "Canonical Correlation Analysis: review", "link": "https://arxiv.org/abs/2411.15625", "description": "arXiv:2411.15625v1 Announce Type: new \nAbstract: For over a century canonical correlations, variables, and related concepts have been studied across various fields, with contributions dating back to Jordan [1875] and Hotelling [1936]. This text surveys the evolution of canonical correlation analysis, a fundamental statistical tool, beginning with its foundational theorems and progressing to recent developments and open research problems. Along the way we introduce and review methods, notions, and fundamental concepts from linear algebra, random matrix theory, and high-dimensional statistics, placing particular emphasis on rigorous mathematical treatment.\n The survey is intended for technically proficient graduate students and other researchers with an interest in this area. The content is organized into five chapters, supplemented by six sets of exercises found in Chapter 6. These exercises introduce additional material, reinforce key concepts, and serve to bridge ideas across chapters. We recommend the following sequence: first, solve Problem Set 0, then proceed with Chapter 1, solve Problem Set 1, and so on through the text."}, "https://arxiv.org/abs/2411.15691": {"title": "Data integration using covariate summaries from external sources", "link": "https://arxiv.org/abs/2411.15691", "description": "arXiv:2411.15691v1 Announce Type: new \nAbstract: In modern data analysis, information is frequently collected from multiple sources, often leading to challenges such as data heterogeneity and imbalanced sample sizes across datasets. Robust and efficient data integration methods are crucial for improving the generalization and transportability of statistical findings. In this work, we address scenarios where, in addition to having full access to individualized data from a primary source, supplementary covariate information from external sources is also available. While traditional data integration methods typically require individualized covariates from external sources, such requirements can be impractical due to limitations related to accessibility, privacy, storage, and cost. Instead, we propose novel data integration techniques that rely solely on external summary statistics, such as sample means and covariances, to construct robust estimators for the mean outcome under both homogeneous and heterogeneous data settings. Additionally, we extend this framework to causal inference, enabling the estimation of average treatment effects for both generalizability and transportability."}, "https://arxiv.org/abs/2411.15713": {"title": "Bayesian High-dimensional Grouped-regression using Sparse Projection-posterior", "link": "https://arxiv.org/abs/2411.15713", "description": "arXiv:2411.15713v1 Announce Type: new \nAbstract: We present a novel Bayesian approach for high-dimensional grouped regression under sparsity. We leverage a sparse projection method that uses a sparsity-inducing map to derive an induced posterior on a lower-dimensional parameter space. Our method introduces three distinct projection maps based on popular penalty functions: the Group LASSO Projection Posterior, Group SCAD Projection Posterior, and Adaptive Group LASSO Projection Posterior. Each projection map is constructed to immerse dense posterior samples into a structured, sparse space, allowing for effective group selection and estimation in high-dimensional settings. We derive optimal posterior contraction rates for estimation and prediction, proving that the methods are model selection consistent. Additionally, we propose a Debiased Group LASSO Projection Map, which ensures exact coverage of credible sets. Our methodology is particularly suited for applications in nonparametric additive models, where we apply it with B-spline expansions to capture complex relationships between covariates and response. Extensive simulations validate our theoretical findings, demonstrating the robustness of our approach across different settings. Finally, we illustrate the practical utility of our method with an application to brain MRI volume data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), where our model identifies key brain regions associated with Alzheimer's progression."}, "https://arxiv.org/abs/2411.15771": {"title": "A flexible and general semi-supervised approach to multiple hypothesis testing", "link": "https://arxiv.org/abs/2411.15771", "description": "arXiv:2411.15771v1 Announce Type: new \nAbstract: Standard multiple testing procedures are designed to report a list of discoveries, or suspected false null hypotheses, given the hypotheses' p-values or test scores. Recently there has been a growing interest in enhancing such procedures by combining additional information with the primary p-value or score. Specifically, such so-called ``side information'' can be leveraged to improve the separation between true and false nulls along additional ``dimensions'' thereby increasing the overall sensitivity. In line with this idea, we develop RESET (REScoring via Estimating and Training) which uses a unique data-splitting protocol that subsequently allows any semi-supervised learning approach to factor in the available side-information while maintaining finite-sample error rate control. Our practical implementation, RESET Ensemble, selects from an ensemble of classification algorithms so that it is compatible to a range of multiple testing scenarios without the need for the user to select the appropriate one. We apply RESET to both p-value and competition based multiple testing problems and show that RESET is (1) power-wise competitive, (2) fast compared to most tools and (3) is able to uniquely achieve finite sample FDR or FDP control, depending on the user's preference."}, "https://arxiv.org/abs/2411.15797": {"title": "Utilization and Profitability of Tractor Services for Maize Farming in Ejura-Sekyedumase Municipality, Ghana", "link": "https://arxiv.org/abs/2411.15797", "description": "arXiv:2411.15797v1 Announce Type: new \nAbstract: Maize farming is a major livelihood activity for many farmers in Ghana. Unfortunately, farmers usually do not obtain the expected returns on their investment due to reliance on rudimentary, labor-intensive, and inefficient methods of production. Using cross-sectional data from 359 maize farmers, this study investigates the profitability and determinants of the use of tractor services for maize production in Ejura-Sekyedumase, Ashanti Region of Ghana. Results from descriptive and profitability analyses reveal that tractor services such as ploughing and shelling are widely used, but their profitability varies significantly among farmers. Key factors influencing profitability include farm size, fertilizer quantity applied, and farmer experience. Results from a multivariate probit analysis also showed that farming experience, fertilizer quantity, and profit per acre have a positive influence on tractor service use for shelling, while household size, farm size, and FBO have a negative influence. Farming experience, fertilizer quantity, and profit per acre positively influence tractor service use for ploughing, while farm size has a negative influence. A t-test result reveals a statistically significant difference in profit between farmers who use tractor services and those who do not. Specifically, farmers who utilize tractor services on their maize farm had a return to cost of 9 percent more than those who do not (p-value < 0.05). The Kendall's result showed a moderate agreement among the maize farmers (first ranked being financial issues) in their ability to access/utilize tractor services on their farm."}, "https://arxiv.org/abs/2411.15819": {"title": "A Copula-Based Approach to Modelling and Testing for Heavy-tailed Data with Bivariate Heteroscedastic Extremes", "link": "https://arxiv.org/abs/2411.15819", "description": "arXiv:2411.15819v1 Announce Type: new \nAbstract: Heteroscedasticity and correlated data pose challenges for extreme value analysis, particularly in two-sample testing problems for tail behaviors. In this paper, we propose a novel copula-based multivariate model for independent but not identically distributed heavy-tailed data with heterogeneous marginal distributions and a varying copula structure. The proposed model encompasses classical models with independent and identically distributed data and some models with a mixture of correlation. To understand the tail behavior, we introduce the quasi-tail copula, which integrates both marginal heteroscedasticity and the dependence structure of the varying copula, and further propose the estimation approach. We then establish the joint asymptotic properties for the Hill estimator, scedasis functions, and quasi-tail copula. In addition, a multiplier bootstrap method is applied to estimate their complex covariance. Moreover, it is of practical interest to develop four typical two-sample testing problems under the new model, which include the equivalence of the extreme value indices and scedasis functions. Finally, we conduct simulation studies to validate our tests and apply the new model to the data from the stock market."}, "https://arxiv.org/abs/2411.15822": {"title": "Semi-parametric least-area linear-circular regression through M\\\"obius transformation", "link": "https://arxiv.org/abs/2411.15822", "description": "arXiv:2411.15822v1 Announce Type: new \nAbstract: This paper introduces a new area-based regression model where the responses are angular variables and the predictors are linear. The regression curve is formulated using a generalized M\\\"obius transformation that maps the real axis to the circle. A novel area-based loss function is introduced for parameter estimation, utilizing the intrinsic geometry of a curved torus. The model is semi-parametric, requiring no specific distributional assumptions for the angular error. Extensive simulation studies are performed with von Mises and wrapped Cauchy distributions as angular errors. The practical utility of the model is illustrated through real data analysis of two well-known cryptocurrencies, Bitcoin and Ethereum."}, "https://arxiv.org/abs/2411.15826": {"title": "Expert-elicitation method for non-parametric joint priors using normalizing flows", "link": "https://arxiv.org/abs/2411.15826", "description": "arXiv:2411.15826v1 Announce Type: new \nAbstract: We propose an expert-elicitation method for learning non-parametric joint prior distributions using normalizing flows. Normalizing flows are a class of generative models that enable exact, single-step density evaluation and can capture complex density functions through specialized deep neural networks. Building on our previously introduced simulation-based framework, we adapt and extend the methodology to accommodate non-parametric joint priors. Our framework thus supports the development of elicitation methods for learning both parametric and non-parametric priors, as well as independent or joint priors for model parameters. To evaluate the performance of the proposed method, we perform four simulation studies and present an evaluation pipeline that incorporates diagnostics and additional evaluation tools to support decision-making at each stage of the elicitation process."}, "https://arxiv.org/abs/2411.15908": {"title": "Selective Inference for Time-Varying Effect Moderation", "link": "https://arxiv.org/abs/2411.15908", "description": "arXiv:2411.15908v1 Announce Type: new \nAbstract: Causal effect moderation investigates how the effect of interventions (or treatments) on outcome variables changes based on observed characteristics of individuals, known as potential effect moderators. With advances in data collection, datasets containing many observed features as potential moderators have become increasingly common. High-dimensional analyses often lack interpretability, with important moderators masked by noise, while low-dimensional, marginal analyses yield many false positives due to strong correlations with true moderators. In this paper, we propose a two-step method for selective inference on time-varying causal effect moderation that addresses the limitations of both high-dimensional and marginal analyses. Our method first selects a relatively smaller, more interpretable model to estimate a linear causal effect moderation using a Gaussian randomization approach. We then condition on the selection event to construct a pivot, enabling uniformly asymptotic semi-parametric inference in the selected model. Through simulations and real data analyses, we show that our method consistently achieves valid coverage rates, even when existing conditional methods and common sample splitting techniques fail. Moreover, our method yields shorter, bounded intervals, unlike existing methods that may produce infinitely long intervals."}, "https://arxiv.org/abs/2411.15996": {"title": "Homeopathic Modernization and the Middle Science Trap: conceptual context of ergonomics, econometrics and logic of some national scientific case", "link": "https://arxiv.org/abs/2411.15996", "description": "arXiv:2411.15996v1 Announce Type: new \nAbstract: This article analyses the structural and institutional barriers hindering the development of scientific systems in transition economies, such as Kazakhstan. The main focus is on the concept of the \"middle science trap,\" which is characterized by steady growth in quantitative indicators (publications, grants) but a lack of qualitative advancement. Excessive bureaucracy, weak integration into the international scientific community, and ineffective science management are key factors limiting development. This paper proposes an approach of \"homeopathic modernization,\" which focuses on minimal yet strategically significant changes aimed at reducing bureaucratic barriers and enhancing the effectiveness of the scientific ecosystem. A comparative analysis of international experience (China, India, and the European Union) is provided, demonstrating how targeted reforms in the scientific sector can lead to significant results. Social and cultural aspects, including the influence of mentality and institutional structure, are also examined, and practical recommendations for reforming the scientific system in Kazakhstan and Central Asia are offered. The conclusions of the article could be useful for developing national science modernization programs, particularly in countries with high levels of bureaucracy and conservatism."}, "https://arxiv.org/abs/2411.16153": {"title": "Analysis of longitudinal data with destructive sampling using linear mixed models", "link": "https://arxiv.org/abs/2411.16153", "description": "arXiv:2411.16153v1 Announce Type: new \nAbstract: This paper proposes an analysis methodology for the case where there is longitudinal data with destructive sampling of observational units, which come from experimental units that are measured at all times of the analysis. A mixed linear model is proposed and compared with regression models with fixed and mixed effects, among which is a similar that is used for data called pseudo-panel, and one of multivariate analysis of variance, which are common in statistics. To compare the models, the mean square error was used, demonstrating the advantage of the proposed methodology. In addition, an application was made to real-life data that refers to the scores in the Saber 11 tests applied to students in Colombia to see the advantage of using this methodology in practical scenarios."}, "https://arxiv.org/abs/2411.16192": {"title": "Modeling large dimensional matrix time series with partially known and latent factors", "link": "https://arxiv.org/abs/2411.16192", "description": "arXiv:2411.16192v1 Announce Type: new \nAbstract: This article considers to model large-dimensional matrix time series by introducing a regression term to the matrix factor model. This is an extension of classic matrix factor model to incorporate the information of known factors or useful covariates. We establish the convergence rates of coefficient matrix, loading matrices and the signal part. The theoretical results coincide with the rates in Wang et al. (2019). We conduct numerical studies to verify the performance of our estimation procedure in finite samples. Finally, we demonstrate the superiority of our proposed model using the daily returns of stocks data."}, "https://arxiv.org/abs/2411.16220": {"title": "On the achievability of efficiency bounds for covariate-adjusted response-adaptive randomization", "link": "https://arxiv.org/abs/2411.16220", "description": "arXiv:2411.16220v1 Announce Type: new \nAbstract: In the context of precision medicine, covariate-adjusted response-adaptive randomization (CARA) has garnered much attention from both academia and industry due to its benefits in providing ethical and tailored treatment assignments based on patients' profiles while still preserving favorable statistical properties. Recent years have seen substantial progress in understanding the inference for various adaptive experimental designs. In particular, research has focused on two important perspectives: how to obtain robust inference in the presence of model misspecification, and what the smallest variance, i.e., the efficiency bound, an estimator can achieve. Notably, Armstrong (2022) derived the asymptotic efficiency bound for any randomization procedure that assigns treatments depending on covariates and accrued responses, thus including CARA, among others. However, to the best of our knowledge, no existing literature has addressed whether and how the asymptotic efficiency bound can be achieved under CARA. In this paper, by connecting two strands of literature on adaptive randomization, namely robust inference and efficiency bound, we provide a definitive answer to this question for an important practical scenario where only discrete covariates are observed and used to form stratification. We consider a specific type of CARA, i.e., a stratified version of doubly-adaptive biased coin design, and prove that the stratified difference-in-means estimator achieves Armstrong (2022)'s efficiency bound, with possible ethical constraints on treatment assignments. Our work provides new insights and demonstrates the potential for more research regarding the design and analysis of CARA that maximizes efficiency while adhering to ethical considerations. Future studies could explore how to achieve the asymptotic efficiency bound for general CARA with continuous covariates, which remains an open question."}, "https://arxiv.org/abs/2411.16311": {"title": "Bayesian models for missing and misclassified variables using integrated nested Laplace approximations", "link": "https://arxiv.org/abs/2411.16311", "description": "arXiv:2411.16311v1 Announce Type: new \nAbstract: Misclassified variables used in regression models, either as a covariate or as the response, may lead to biased estimators and incorrect inference. Even though Bayesian models to adjust for misclassification error exist, it has not been shown how these models can be implemented using integrated nested Laplace approximation (INLA), a popular framework for fitting Bayesian models due to its computational efficiency. Since INLA requires the latent field to be Gaussian, and the Bayesian models adjusting for covariate misclassification error necessarily introduce a latent categorical variable, it is not obvious how to fit these models in INLA. Here, we show how INLA can be combined with importance sampling to overcome this limitation. We also discuss how to account for a misclassified response variable using INLA directly without any additional sampling procedure. The proposed methods are illustrated through a number of simulations and applications to real-world data, and all examples are presented with detailed code in the supporting information."}, "https://arxiv.org/abs/2411.16429": {"title": "Multivariate Adjustments for Average Equivalence Testing", "link": "https://arxiv.org/abs/2411.16429", "description": "arXiv:2411.16429v1 Announce Type: new \nAbstract: Multivariate (average) equivalence testing is widely used to assess whether the means of two conditions of interest are `equivalent' for different outcomes simultaneously. The multivariate Two One-Sided Tests (TOST) procedure is typically used in this context by checking if, outcome by outcome, the marginal $100(1-2\\alpha$)\\% confidence intervals for the difference in means between the two conditions of interest lie within pre-defined lower and upper equivalence limits. This procedure, known to be conservative in the univariate case, leads to a rapid power loss when the number of outcomes increases, especially when one or more outcome variances are relatively large. In this work, we propose a finite-sample adjustment for this procedure, the multivariate $\\alpha$-TOST, that consists in a correction of $\\alpha$, the significance level, taking the (arbitrary) dependence between the outcomes of interest into account and making it uniformly more powerful than the conventional multivariate TOST. We present an iterative algorithm allowing to efficiently define $\\alpha^{\\star}$, the corrected significance level, a task that proves challenging in the multivariate setting due to the inter-relationship between $\\alpha^{\\star}$ and the sets of values belonging to the null hypothesis space and defining the test size. We study the operating characteristics of the multivariate $\\alpha$-TOST both theoretically and via an extensive simulation study considering cases relevant for real-world analyses -- i.e.,~relatively small sample sizes, unknown and heterogeneous variances, and different correlation structures -- and show the superior finite-sample properties of the multivariate $\\alpha$-TOST compared to its conventional counterpart. We finally re-visit a case study on ticlopidine hydrochloride and compare both methods when simultaneously assessing bioequivalence for multiple pharmacokinetic parameters."}, "https://arxiv.org/abs/2411.16662": {"title": "A Supervised Machine Learning Approach for Assessing Grant Peer Review Reports", "link": "https://arxiv.org/abs/2411.16662", "description": "arXiv:2411.16662v1 Announce Type: new \nAbstract: Peer review in grant evaluation informs funding decisions, but the contents of peer review reports are rarely analyzed. In this work, we develop a thoroughly tested pipeline to analyze the texts of grant peer review reports using methods from applied Natural Language Processing (NLP) and machine learning. We start by developing twelve categories reflecting content of grant peer review reports that are of interest to research funders. This is followed by multiple human annotators' iterative annotation of these categories in a novel text corpus of grant peer review reports submitted to the Swiss National Science Foundation. After validating the human annotation, we use the annotated texts to fine-tune pre-trained transformer models to classify these categories at scale, while conducting several robustness and validation checks. Our results show that many categories can be reliably identified by human annotators and machine learning approaches. However, the choice of text classification approach considerably influences the classification performance. We also find a high correspondence between out-of-sample classification performance and human annotators' perceived difficulty in identifying categories. Our results and publicly available fine-tuned transformer models will allow researchers and research funders and anybody interested in peer review to examine and report on the contents of these reports in a structured manner. Ultimately, we hope our approach can contribute to ensuring the quality and trustworthiness of grant peer review."}, "https://arxiv.org/abs/2411.15306": {"title": "Heavy-tailed Contamination is Easier than Adversarial Contamination", "link": "https://arxiv.org/abs/2411.15306", "description": "arXiv:2411.15306v1 Announce Type: cross \nAbstract: A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount.\n Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability."}, "https://arxiv.org/abs/2411.15590": {"title": "From Complexity to Parsimony: Integrating Latent Class Analysis to Uncover Multimodal Learning Patterns in Collaborative Learning", "link": "https://arxiv.org/abs/2411.15590", "description": "arXiv:2411.15590v1 Announce Type: cross \nAbstract: Multimodal Learning Analytics (MMLA) leverages advanced sensing technologies and artificial intelligence to capture complex learning processes, but integrating diverse data sources into cohesive insights remains challenging. This study introduces a novel methodology for integrating latent class analysis (LCA) within MMLA to map monomodal behavioural indicators into parsimonious multimodal ones. Using a high-fidelity healthcare simulation context, we collected positional, audio, and physiological data, deriving 17 monomodal indicators. LCA identified four distinct latent classes: Collaborative Communication, Embodied Collaboration, Distant Interaction, and Solitary Engagement, each capturing unique monomodal patterns. Epistemic network analysis compared these multimodal indicators with the original monomodal indicators and found that the multimodal approach was more parsimonious while offering higher explanatory power regarding students' task and collaboration performances. The findings highlight the potential of LCA in simplifying the analysis of complex multimodal data while capturing nuanced, cross-modality behaviours, offering actionable insights for educators and enhancing the design of collaborative learning interventions. This study proposes a pathway for advancing MMLA, making it more parsimonious and manageable, and aligning with the principles of learner-centred education."}, "https://arxiv.org/abs/2411.15624": {"title": "Trans-Glasso: A Transfer Learning Approach to Precision Matrix Estimation", "link": "https://arxiv.org/abs/2411.15624", "description": "arXiv:2411.15624v1 Announce Type: cross \nAbstract: Precision matrix estimation is essential in various fields, yet it is challenging when samples for the target study are limited. Transfer learning can enhance estimation accuracy by leveraging data from related source studies. We propose Trans-Glasso, a two-step transfer learning method for precision matrix estimation. First, we obtain initial estimators using a multi-task learning objective that captures shared and unique features across studies. Then, we refine these estimators through differential network estimation to adjust for structural differences between the target and source precision matrices. Under the assumption that most entries of the target precision matrix are shared with source matrices, we derive non-asymptotic error bounds and show that Trans-Glasso achieves minimax optimality under certain conditions. Extensive simulations demonstrate Trans Glasso's superior performance compared to baseline methods, particularly in small-sample settings. We further validate Trans-Glasso in applications to gene networks across brain tissues and protein networks for various cancer subtypes, showcasing its effectiveness in biological contexts. Additionally, we derive the minimax optimal rate for differential network estimation, representing the first such guarantee in this area."}, "https://arxiv.org/abs/2411.15674": {"title": "Quantile deep learning models for multi-step ahead time series prediction", "link": "https://arxiv.org/abs/2411.15674", "description": "arXiv:2411.15674v1 Announce Type: cross \nAbstract: Uncertainty quantification is crucial in time series prediction, and quantile regression offers a valuable mechanism for uncertainty quantification which is useful for extreme value forecasting. Although deep learning models have been prominent in multi-step ahead prediction, the development and evaluation of quantile deep learning models have been limited. We present a novel quantile regression deep learning framework for multi-step time series prediction. In this way, we elevate the capabilities of deep learning models by incorporating quantile regression, thus providing a more nuanced understanding of predictive values. We provide an implementation of prominent deep learning models for multi-step ahead time series prediction and evaluate their performance under high volatility and extreme conditions. We include multivariate and univariate modelling, strategies and provide a comparison with conventional deep learning models from the literature. Our models are tested on two cryptocurrencies: Bitcoin and Ethereum, using daily close-price data and selected benchmark time series datasets. The results show that integrating a quantile loss function with deep learning provides additional predictions for selected quantiles without a loss in the prediction accuracy when compared to the literature. Our quantile model has the ability to handle volatility more effectively and provides additional information for decision-making and uncertainty quantification through the use of quantiles when compared to conventional deep learning models."}, "https://arxiv.org/abs/2411.16244": {"title": "What events matter for exchange rate volatility ?", "link": "https://arxiv.org/abs/2411.16244", "description": "arXiv:2411.16244v1 Announce Type: cross \nAbstract: This paper expands on stochastic volatility models by proposing a data-driven method to select the macroeconomic events most likely to impact volatility. The paper identifies and quantifies the effects of macroeconomic events across multiple countries on exchange rate volatility using high-frequency currency returns, while accounting for persistent stochastic volatility effects and seasonal components capturing time-of-day patterns. Given the hundreds of macroeconomic announcements and their lags, we rely on sparsity-based methods to select relevant events for the model. We contribute to the exchange rate literature in four ways: First, we identify the macroeconomic events that drive currency volatility, estimate their effects and connect them to macroeconomic fundamentals. Second, we find a link between intraday seasonality, trading volume, and the opening hours of major markets across the globe. We provide a simple labor-based explanation for this observed pattern. Third, we show that including macroeconomic events and seasonal components is crucial for forecasting exchange rate volatility. Fourth, our proposed model yields the lowest volatility and highest Sharpe ratio in portfolio allocations when compared to standard SV and GARCH models."}, "https://arxiv.org/abs/2411.16366": {"title": "Connections between sequential Bayesian inference and evolutionary dynamics", "link": "https://arxiv.org/abs/2411.16366", "description": "arXiv:2411.16366v1 Announce Type: cross \nAbstract: It has long been posited that there is a connection between the dynamical equations describing evolutionary processes in biology and sequential Bayesian learning methods. This manuscript describes new research in which this precise connection is rigorously established in the continuous time setting. Here we focus on a partial differential equation known as the Kushner-Stratonovich equation describing the evolution of the posterior density in time. Of particular importance is a piecewise smooth approximation of the observation path from which the discrete time filtering equations, which are shown to converge to a Stratonovich interpretation of the Kushner-Stratonovich equation. This smooth formulation will then be used to draw precise connections between nonlinear stochastic filtering and replicator-mutator dynamics. Additionally, gradient flow formulations will be investigated as well as a form of replicator-mutator dynamics which is shown to be beneficial for the misspecified model filtering problem. It is hoped this work will spur further research into exchanges between sequential learning and evolutionary biology and to inspire new algorithms in filtering and sampling."}, "https://arxiv.org/abs/1904.01047": {"title": "Dynamically Optimal Treatment Allocation", "link": "https://arxiv.org/abs/1904.01047", "description": "arXiv:1904.01047v5 Announce Type: replace \nAbstract: Dynamic decisions are pivotal to economic policy making. We show how existing evidence from randomized control trials can be utilized to guide personalized decisions in challenging dynamic environments with budget and capacity constraints. Recent advances in reinforcement learning now enable the solution of many complex, real-world problems for the first time. We allow for restricted classes of policy functions and prove that their regret decays at rate n^(-0.5), the same as in the static case. Applying our methods to job training, we find that by exploiting the problem's dynamic structure, we achieve significantly higher welfare compared to static approaches."}, "https://arxiv.org/abs/2208.09542": {"title": "Improving knockoffs with conditional calibration", "link": "https://arxiv.org/abs/2208.09542", "description": "arXiv:2208.09542v3 Announce Type: replace \nAbstract: The knockoff filter of Barber and Candes (arXiv:1404.5609) is a flexible framework for multiple testing in supervised learning models, based on introducing synthetic predictor variables to control the false discovery rate (FDR). Using the conditional calibration framework of Fithian and Lei (arXiv:2007.10438), we introduce the calibrated knockoff procedure, a method that uniformly improves the power of any fixed-X or model-X knockoff procedure. We show theoretically and empirically that the improvement is especially notable in two contexts where knockoff methods can be nearly powerless: when the rejection set is small, and when the structure of the design matrix in fixed-X knockoffs prevents us from constructing good knockoff variables. In these contexts, calibrated knockoffs even outperform competing FDR-controlling methods like the (dependence-adjusted) procedure Benjamini-Hochberg in many scenarios."}, "https://arxiv.org/abs/2301.10911": {"title": "Posterior risk of modular and semi-modular Bayesian inference", "link": "https://arxiv.org/abs/2301.10911", "description": "arXiv:2301.10911v3 Announce Type: replace \nAbstract: Modular Bayesian methods perform inference in models that are specified through a collection of coupled sub-models, known as modules. These modules often arise from modelling different data sources or from combining domain knowledge from different disciplines. ``Cutting feedback'' is a Bayesian inference method that ensures misspecification of one module does not affect inferences for parameters in other modules, and produces what is known as the cut posterior. However, choosing between the cut posterior and the standard Bayesian posterior is challenging. When misspecification is not severe, cutting feedback can greatly increase posterior uncertainty without a large reduction of estimation bias, leading to a bias-variance trade-off. This trade-off motivates semi-modular posteriors, which interpolate between standard and cut posteriors based on a tuning parameter. In this work, we provide the first precise formulation of the bias-variance trade-off that is present in cutting feedback, and we propose a new semi-modular posterior that takes advantage of it. Under general regularity conditions, we prove that this semi-modular posterior is more accurate than the cut posterior according to a notion of posterior risk. An important implication of this result is that point inferences made under the cut posterior are inadmissable. The new method is demonstrated in a number of examples."}, "https://arxiv.org/abs/2307.11201": {"title": "Choosing the Right Approach at the Right Time: A Comparative Analysis of Causal Effect Estimation using Confounder Adjustment and Instrumental Variables", "link": "https://arxiv.org/abs/2307.11201", "description": "arXiv:2307.11201v3 Announce Type: replace \nAbstract: In observational studies, potential unobserved confounding is a major barrier in isolating the average causal effect (ACE). In these scenarios, two main approaches are often used: confounder adjustment for causality (CAC) and instrumental variable analysis for causation (IVAC). Nevertheless, both are subject to untestable assumptions and, therefore, it may be unclear which assumption violation scenarios one method is superior in terms of mitigating inconsistency for the ACE. Although general guidelines exist, direct theoretical comparisons of the trade-offs between CAC and the IVAC assumptions are limited. Using ordinary least squares (OLS) for CAC and two-stage least squares (2SLS) for IVAC, we analytically compare the relative inconsistency for the ACE of each approach under a variety of assumption violation scenarios and discuss rules of thumb for practice. Additionally, a sensitivity framework is proposed to guide analysts in determining which approach may result in less inconsistency for estimating the ACE with a given dataset. We demonstrate our findings both through simulation and by revisiting Card's analysis of the effect of educational attainment on earnings, which has been the subject of previous discussion on instrument validity. The implications of our findings on causal inference practice are discussed, providing guidance for analysts to judge whether CAC or IVAC may be more appropriate for a given situation."}, "https://arxiv.org/abs/2308.13713": {"title": "Causally Sound Priors for Binary Experiments", "link": "https://arxiv.org/abs/2308.13713", "description": "arXiv:2308.13713v3 Announce Type: replace \nAbstract: We introduce the BREASE framework for the Bayesian analysis of randomized controlled trials with a binary treatment and a binary outcome. Approaching the problem from a causal inference perspective, we propose parameterizing the likelihood in terms of the baselinerisk, efficacy, and adverse side effects of the treatment, along with a flexible, yet intuitive and tractable jointly independent beta prior distribution on these parameters, which we show to be a generalization of the Dirichlet prior for the joint distribution of potential outcomes. Our approach has a number of desirable characteristics when compared to current mainstream alternatives: (i) it naturally induces prior dependence between expected outcomes in the treatment and control groups; (ii) as the baseline risk, efficacy and risk of adverse side effects are quantities commonly present in the clinicians' vocabulary, the hyperparameters of the prior are directly interpretable, thus facilitating the elicitation of prior knowledge and sensitivity analysis; and (iii) we provide analytical formulae for the marginal likelihood, Bayes factor, and other posterior quantities, as well as an exact posterior sampling algorithm and an accurate and fast data-augmented Gibbs sampler in cases where traditional MCMC fails. Empirical examples demonstrate the utility of our methods for estimation, hypothesis testing, and sensitivity analysis of treatment effects."}, "https://arxiv.org/abs/2311.12597": {"title": "Optimal Functional Bilinear Regression with Two-way Functional Covariates via Reproducing Kernel Hilbert Space", "link": "https://arxiv.org/abs/2311.12597", "description": "arXiv:2311.12597v2 Announce Type: replace \nAbstract: Traditional functional linear regression usually takes a one-dimensional functional predictor as input and estimates the continuous coefficient function. Modern applications often generate two-dimensional covariates, which become matrices when observed at grid points. To avoid the inefficiency of the classical method involving estimation of a two-dimensional coefficient function, we propose a functional bilinear regression model, and introduce an innovative three-term penalty to impose roughness penalty in the estimation. The proposed estimator exhibits minimax optimal property for prediction under the framework of reproducing kernel Hilbert space. An iterative generalized cross-validation approach is developed to choose tuning parameters, which significantly improves the computational efficiency over the traditional cross-validation approach. The statistical and computational advantages of the proposed method over existing methods are further demonstrated via simulated experiments, the Canadian weather data, and a biochemical long-range infrared light detection and ranging data."}, "https://arxiv.org/abs/2411.16831": {"title": "Measuring Statistical Evidence: A Short Report", "link": "https://arxiv.org/abs/2411.16831", "description": "This short text tried to establish a big picture of what statistics is about and how an ideal inference method should behave. Moreover, by examining shortcomings of some of the currently used methods for measuring evidence and utilizing some intuitive principles, we motivated the Relative Belief Ratio as the primary method of characterizing statistical evidence. Number of topics has been omitted for the interest of this text and the reader is strongly advised to refer to (Evans, 2015) as the primary source for further readings of the subject.", "authors": "Zamani"}, "https://arxiv.org/abs/2411.16902": {"title": "Bounding causal effects with an unknown mixture of informative and non-informative censoring", "link": "https://arxiv.org/abs/2411.16902", "description": "In experimental and observational data settings, researchers often have limited knowledge of the reasons for censored or missing outcomes. To address this uncertainty, we propose bounds on causal effects for censored outcomes, accommodating the scenario where censoring is an unobserved mixture of informative and non-informative components. Within this mixed censoring framework, we explore several assumptions to derive bounds on causal effects, including bounds expressed as a function of user-specified sensitivity parameters. We develop influence-function based estimators of these bounds to enable flexible, non-parametric, and machine learning based estimation, achieving root-n convergence rates and asymptotic normality under relatively mild conditions. We further consider the identification and estimation of bounds for other causal quantities that remain meaningful when informative censoring reflects a competing risk, such as death. We conduct extensive simulation studies and illustrate our methodology with a study on the causal effect of antipsychotic drugs on diabetes risk using a health insurance dataset.", "authors": "Rubinstein, Agniel, Han et al"}, "https://arxiv.org/abs/2411.16906": {"title": "A Binary IV Model for Persuasion: Profiling Persuasion Types among Compliers", "link": "https://arxiv.org/abs/2411.16906", "description": "In an empirical study of persuasion, researchers often use a binary instrument to encourage individuals to consume information and take some action. We show that, with a binary Imbens-Angrist instrumental variable model and the monotone treatment response assumption, it is possible to identify the joint distribution of potential outcomes among compliers. This is necessary to identify the percentage of mobilised voters and their statistical characteristic defined by the moments of the joint distribution of treatment and covariates. Specifically, we develop a method that enables researchers to identify the statistical characteristic of persuasion types: always-voters, never-voters, and mobilised voters among compliers. These findings extend the kappa weighting results in Abadie (2003). We also provide a sharp test for the two sets of identification assumptions. The test boils down to testing whether there exists a nonnegative solution to a possibly under-determined system of linear equations with known coefficients. An application based on Green et al. (2003) is provided.", "authors": "Yu"}, "https://arxiv.org/abs/2411.16938": {"title": "Fragility Index for Time-to-Event Endpoints in Single-Arm Clinical Trials", "link": "https://arxiv.org/abs/2411.16938", "description": "The reliability of clinical trial outcomes is crucial, especially in guiding medical decisions. In this paper, we introduce the Fragility Index (FI) for time-to-event endpoints in single-arm clinical trials - a novel metric designed to quantify the robustness of study conclusions. The FI represents the smallest number of censored observations that, when reclassified as uncensored events, causes the posterior probability of the median survival time exceeding a specified threshold to fall below a predefined confidence level. While drug effectiveness is typically assessed by determining whether the posterior probability exceeds a specified confidence level, the FI offers a complementary measure, indicating how robust these conclusions are to potential shifts in the data. Using a Bayesian approach, we develop a practical framework for computing the FI based on the exponential survival model. To facilitate the application of our method, we developed an R package fi, which provides a tool to compute the Fragility Index. Through real world case studies involving time to event data from single arms clinical trials, we demonstrate the utility of this index. Our findings highlight how the FI can be a valuable tool for assessing the robustness of survival analyses in single-arm studies, aiding researchers and clinicians in making more informed decisions.", "authors": "Maity, Garg, Basu"}, "https://arxiv.org/abs/2411.16978": {"title": "Normal Approximation for U-Statistics with Cross-Sectional Dependence", "link": "https://arxiv.org/abs/2411.16978", "description": "We apply Stein's method to investigate the normal approximation for both non-degenerate and degenerate U-statistics with cross-sectionally dependent underlying processes in the Wasserstein metric. We show that the convergence rates depend on the mixing rates, the sparsity of the cross-sectional dependence, and the moments of the kernel functions. Conditions are derived for central limit theorems to hold as corollaries. We demonstrate one application of the theoretical results with nonparametric specification test for data with cross-sectional dependence.", "authors": "Liu"}, "https://arxiv.org/abs/2411.17013": {"title": "Conditional Extremes with Graphical Models", "link": "https://arxiv.org/abs/2411.17013", "description": "Multivariate extreme value analysis quantifies the probability and magnitude of joint extreme events. River discharges from the upper Danube River basin provide a challenging dataset for such analysis because the data, which is measured on a spatial network, exhibits both asymptotic dependence and asymptotic independence. To account for both features, we extend the conditional multivariate extreme value model (CMEVM) with a new approach for the residual distribution. This allows sparse (graphical) dependence structures and fully parametric prediction. Our approach fills a current gap in statistical methodology for graphical extremes, where existing models require asymptotic independence. Further, the model can be used to learn the graphical dependence structure when it is unknown a priori. To support inference in high dimensions, we propose a stepwise inference procedure that is computationally efficient and loses no information or predictive power. We show our method is flexible and accurately captures the extremal dependence for the upper Danube River basin discharges.", "authors": "Farrell, Eastoe, Lee"}, "https://arxiv.org/abs/2411.17024": {"title": "Detecting Outliers in Multiple Sampling Results Without Thresholds", "link": "https://arxiv.org/abs/2411.17024", "description": "Bayesian statistics emphasizes the importance of prior distributions, yet finding an appropriate one is practically challenging. When multiple sample results are taken regarding the frequency of the same event, these samples may be influenced by different selection effects. In the absence of suitable prior distributions to correct for these selection effects, it is necessary to exclude outlier sample results to avoid compromising the final result. However, defining outliers based on different thresholds may change the result, which makes the result less persuasive. This work proposes a definition of outliers without the need to set thresholds.", "authors": "Shen"}, "https://arxiv.org/abs/2411.17033": {"title": "Quantile Graph Discovery through QuACC: Quantile Association via Conditional Concordance", "link": "https://arxiv.org/abs/2411.17033", "description": "Graphical structure learning is an effective way to assess and visualize cross-biomarker dependencies in biomedical settings. Standard approaches to estimating graphs rely on conditional independence tests that may not be sensitive to associations that manifest at the tails of joint distributions, i.e., they may miss connections among variables that exhibit associations mainly at lower or upper quantiles. In this work, we propose a novel measure of quantile-specific conditional association called QuACC: Quantile Association via Conditional Concordance. For a pair of variables and a conditioning set, QuACC quantifies agreement between the residuals from two quantile regression models, which may be linear or more complex, e.g., quantile forests. Using this measure as the basis for a test of null (quantile) association, we introduce a new class of quantile-specific graphical models. Through simulation we show our method is powerful for detecting dependencies under dependencies that manifest at the tails of distributions. We apply our method to biobank data from All of Us and identify quantile-specific patterns of conditional association in a multivariate setting.", "authors": "Khan, Malinsky, Picard et al"}, "https://arxiv.org/abs/2411.17035": {"title": "Achieving Privacy Utility Balance for Multivariate Time Series Data", "link": "https://arxiv.org/abs/2411.17035", "description": "Utility-preserving data privatization is of utmost importance for data-producing agencies. The popular noise-addition privacy mechanism distorts autocorrelation patterns in time series data, thereby marring utility; in response, McElroy et al. (2023) introduced all-pass filtering (FLIP) as a utility-preserving time series data privatization method. Adapting this concept to multivariate data is more complex, and in this paper we propose a multivariate all-pass (MAP) filtering method, employing an optimization algorithm to achieve the best balance between data utility and privacy protection. To test the effectiveness of our approach, we apply MAP filtering to both simulated and real data, sourced from the U.S. Census Bureau's Quarterly Workforce Indicator (QWI) dataset.", "authors": "Hore, McElroy, Roy"}, "https://arxiv.org/abs/2411.17393": {"title": "Patient recruitment forecasting in clinical trials using time-dependent Poisson-gamma model and homogeneity testing criteria", "link": "https://arxiv.org/abs/2411.17393", "description": "Clinical trials in the modern era are characterized by their complexity and high costs and usually involve hundreds/thousands of patients to be recruited across multiple clinical centres in many countries, as typically a rather large sample size is required in order to prove the efficiency of a particular drug. As the imperative to recruit vast numbers of patients across multiple clinical centres has become a major challenge, an accurate forecasting of patient recruitment is one of key factors for the operational success of clinical trials. A classic Poisson-gamma (PG) recruitment model assumes time-homogeneous recruitment rates. However, there can be potential time-trends in the recruitment driven by various factors, e.g. seasonal changes, exhaustion of patients on particular treatments in some centres, etc. Recently a few authors considered some extensions of the PG model to time-dependent rates under some particular assumptions. In this paper, a natural generalization of the original PG model to a PG model with non-homogeneous time-dependent rates is introduced. It is also proposed a new analytic methodology for modelling/forecasting patient recruitment using a Poisson-gamma approximation of recruitment processes in different countries and globally. The properties of some tests on homogeneity of the rates (non-parametric one using a Poisson model and two parametric tests using Poisson and PG model) are investigated. The techniques for modeling and simulation of the recruitment using time-dependent model are discussed. For re-projection of the remaining recruitment it is proposed to use a moving window and re-estimating parameters at every interim time. The results are supported by simulation of some artificial data sets.", "authors": "Anisimov, Oliver"}, "https://arxiv.org/abs/2411.17394": {"title": "Variable selection via fused sparse-group lasso penalized multi-state models incorporating molecular data", "link": "https://arxiv.org/abs/2411.17394", "description": "In multi-state models based on high-dimensional data, effective modeling strategies are required to determine an optimal, ideally parsimonious model. In particular, linking covariate effects across transitions is needed to conduct joint variable selection. A useful technique to reduce model complexity is to address homogeneous covariate effects for distinct transitions. We integrate this approach to data-driven variable selection by extended regularization methods within multi-state model building. We propose the fused sparse-group lasso (FSGL) penalized Cox-type regression in the framework of multi-state models combining the penalization concepts of pairwise differences of covariate effects along with transition grouping. For optimization, we adapt the alternating direction method of multipliers (ADMM) algorithm to transition-specific hazards regression in the multi-state setting. In a simulation study and application to acute myeloid leukemia (AML) data, we evaluate the algorithm's ability to select a sparse model incorporating relevant transition-specific effects and similar cross-transition effects. We investigate settings in which the combined penalty is beneficial compared to global lasso regularization.", "authors": "Miah, Goeman, Putter et al"}, "https://arxiv.org/abs/2411.17402": {"title": "Receiver operating characteristic curve analysis with non-ignorable missing disease status", "link": "https://arxiv.org/abs/2411.17402", "description": "This article considers the receiver operating characteristic (ROC) curve analysis for medical data with non-ignorable missingness in the disease status. In the framework of the logistic regression models for both the disease status and the verification status, we first establish the identifiability of model parameters, and then propose a likelihood method to estimate the model parameters, the ROC curve, and the area under the ROC curve (AUC) for the biomarker. The asymptotic distributions of these estimators are established. Via extensive simulation studies, we compare our method with competing methods in the point estimation and assess the accuracy of confidence interval estimation under various scenarios. To illustrate the application of the proposed method in practical data, we apply our method to the National Alzheimer's Coordinating Center data set.", "authors": "Hu, Yu, Li"}, "https://arxiv.org/abs/2411.17464": {"title": "A new test for assessing the covariate effect in ROC curves", "link": "https://arxiv.org/abs/2411.17464", "description": "The ROC curve is a statistical tool that analyses the accuracy of a diagnostic test in which a variable is used to decide whether an individual is healthy or not. Along with that diagnostic variable it is usual to have information of some other covariates. In some situations it is advisable to incorporate that information into the study, as the performance of the ROC curves can be affected by them. Using the covariate-adjusted, the covariate-specific or the pooled ROC curves we discuss how to decide if we can exclude the covariates from our study or not, and the implications this may have in further analyses of the ROC curve. A new test for comparing the covariate-adjusted and the pooled ROC curve is proposed, and the problem is illustrated by analysing a real database.", "authors": "Fanjul-Hevia, Pardo-Fern\\'andez, Gonz\\'alez-Manteiga"}, "https://arxiv.org/abs/2411.17533": {"title": "Simplifying Causal Mediation Analysis for Time-to-Event Outcomes using Pseudo-Values", "link": "https://arxiv.org/abs/2411.17533", "description": "Mediation analysis for survival outcomes is challenging. Most existing methods quantify the treatment effect using the hazard ratio (HR) and attempt to decompose the HR into the direct effect of treatment plus an indirect, or mediated, effect. However, the HR is not expressible as an expectation, which complicates this decomposition, both in terms of estimation and interpretation. Here, we present an alternative approach which leverages pseudo-values to simplify estimation and inference. Pseudo-values take censoring into account during their construction, and once derived, can be modeled in the same way as any continuous outcome. Thus, pseudo-values enable mediation analysis for a survival outcome to fit seamlessly into standard mediation software (e.g. CMAverse in R). Pseudo-values are easy to calculate via a leave-one-observation-out procedure (i.e. jackknifing) and the calculation can be accelerated when the influence function of the estimator is known. Mediation analysis for causal effects defined by survival probabilities, restricted mean survival time, and cumulative incidence functions - in the presence of competing risks - can all be performed within this framework. Extensive simulation studies demonstrate that the method is unbiased across 324 scenarios/estimands and controls the type-I error at the nominal level under the null of no mediation. We illustrate the approach using data from the PARADIGMS clinical trial for the treatment of pediatric multiple sclerosis using fingolimod. In particular, we evaluate whether an imaging biomarker lies on the causal path between treatment and time-to-relapse, which aids in justifying this biomarker as a surrogate outcome. Our approach greatly simplifies mediation analysis for survival data and provides a decomposition of the total effect that is both intuitive and interpretable.", "authors": "Ocampo, Giudice, H\\\"aring et al"}, "https://arxiv.org/abs/2411.17618": {"title": "Valid Bayesian Inference based on Variance Weighted Projection for High-Dimensional Logistic Regression with Binary Covariates", "link": "https://arxiv.org/abs/2411.17618", "description": "We address the challenge of conducting inference for a categorical treatment effect related to a binary outcome variable while taking into account high-dimensional baseline covariates. The conventional technique used to establish orthogonality for the treatment effect from nuisance variables in continuous cases is inapplicable in the context of binary treatment. To overcome this obstacle, an orthogonal score tailored specifically to this scenario is formulated which is based on a variance-weighted projection. Additionally, a novel Bayesian framework is proposed to facilitate valid inference for the desired low-dimensional parameter within the complex framework of high-dimensional logistic regression. We provide uniform convergence results, affirming the validity of credible intervals derived from the posterior distribution. The effectiveness of the proposed method is demonstrated through comprehensive simulation studies and real data analysis.", "authors": "Ojha, Narisetty"}, "https://arxiv.org/abs/2411.17639": {"title": "Intrepid MCMC: Metropolis-Hastings with Exploration", "link": "https://arxiv.org/abs/2411.17639", "description": "In engineering examples, one often encounters the need to sample from unnormalized distributions with complex shapes that may also be implicitly defined through a physical or numerical simulation model, making it computationally expensive to evaluate the associated density function. For such cases, MCMC has proven to be an invaluable tool. Random-walk Metropolis Methods (also known as Metropolis-Hastings (MH)), in particular, are highly popular for their simplicity, flexibility, and ease of implementation. However, most MH algorithms suffer from significant limitations when attempting to sample from distributions with multiple modes (particularly disconnected ones). In this paper, we present Intrepid MCMC - a novel MH scheme that utilizes a simple coordinate transformation to significantly improve the mode-finding ability and convergence rate to the target distribution of random-walk Markov chains while retaining most of the simplicity of the vanilla MH paradigm. Through multiple examples, we showcase the improvement in the performance of Intrepid MCMC over vanilla MH for a wide variety of target distribution shapes. We also provide an analysis of the mixing behavior of the Intrepid Markov chain, as well as the efficiency of our algorithm for increasing dimensions. A thorough discussion is presented on the practical implementation of the Intrepid MCMC algorithm. Finally, its utility is highlighted through a Bayesian parameter inference problem for a two-degree-of-freedom oscillator under free vibration.", "authors": "Chakroborty, Shields"}, "https://arxiv.org/abs/2411.17054": {"title": "Optimal Estimation of Shared Singular Subspaces across Multiple Noisy Matrices", "link": "https://arxiv.org/abs/2411.17054", "description": "Estimating singular subspaces from noisy matrices is a fundamental problem with wide-ranging applications across various fields. Driven by the challenges of data integration and multi-view analysis, this study focuses on estimating shared (left) singular subspaces across multiple matrices within a low-rank matrix denoising framework. A common approach for this task is to perform singular value decomposition on the stacked matrix (Stack-SVD), which is formed by concatenating all the individual matrices. We establish that Stack-SVD achieves minimax rate-optimality when the true (left) singular subspaces of the signal matrices are identical. Our analysis reveals some phase transition phenomena in the estimation problem as a function of the underlying signal-to-noise ratio, highlighting how the interplay among multiple matrices collectively determines the fundamental limits of estimation. We then tackle the more complex scenario where the true singular subspaces are only partially shared across matrices. For various cases of partial sharing, we rigorously characterize the conditions under which Stack-SVD remains effective, achieves minimax optimality, or fails to deliver consistent estimates, offering theoretical insights into its practical applicability. To overcome Stack-SVD's limitations in partial sharing scenarios, we propose novel estimators and an efficient algorithm to identify shared and unshared singular vectors, and prove their minimax rate-optimality. Extensive simulation studies and real-world data applications demonstrate the numerous advantages of our proposed approaches.", "authors": "Ma, Ma"}, "https://arxiv.org/abs/2411.17087": {"title": "On the Symmetry of Limiting Distributions of M-estimators", "link": "https://arxiv.org/abs/2411.17087", "description": "Many functionals of interest in statistics and machine learning can be written as minimizers of expected loss functions. Such functionals are called $M$-estimands, and can be estimated by $M$-estimators -- minimizers of empirical average losses. Traditionally, statistical inference (e.g., hypothesis tests and confidence sets) for $M$-estimands is obtained by proving asymptotic normality of $M$-estimators centered at the target. However, asymptotic normality is only one of several possible limiting distributions and (asymptotically) valid inference becomes significantly difficult with non-normal limits. In this paper, we provide conditions for the symmetry of three general classes of limiting distributions, enabling inference using HulC (Kuchibhotla et al. (2024)).", "authors": "Bhowmick, Kuchibhotla"}, "https://arxiv.org/abs/2411.17136": {"title": "Autoencoder Enhanced Realised GARCH on Volatility Forecasting", "link": "https://arxiv.org/abs/2411.17136", "description": "Realised volatility has become increasingly prominent in volatility forecasting due to its ability to capture intraday price fluctuations. With a growing variety of realised volatility estimators, each with unique advantages and limitations, selecting an optimal estimator may introduce challenges. In this thesis, aiming to synthesise the impact of various realised volatility measures on volatility forecasting, we propose an extension of the Realised GARCH model that incorporates an autoencoder-generated synthetic realised measure, combining the information from multiple realised measures in a nonlinear manner. Our proposed model extends existing linear methods, such as Principal Component Analysis and Independent Component Analysis, to reduce the dimensionality of realised measures. The empirical evaluation, conducted across four major stock markets from January 2000 to June 2022 and including the period of COVID-19, demonstrates both the feasibility of applying an autoencoder to synthesise volatility measures and the superior effectiveness of the proposed model in one-step-ahead rolling volatility forecasting. The model exhibits enhanced flexibility in parameter estimations across each rolling window, outperforming traditional linear approaches. These findings indicate that nonlinear dimension reduction offers further adaptability and flexibility in improving the synthetic realised measure, with promising implications for future volatility forecasting applications.", "authors": "Zhao, Wang, Gerlach et al"}, "https://arxiv.org/abs/2411.17224": {"title": "Double robust estimation of functional outcomes with data missing at random", "link": "https://arxiv.org/abs/2411.17224", "description": "We present and study semi-parametric estimators for the mean of functional outcomes in situations where some of these outcomes are missing and covariate information is available on all units. Assuming that the missingness mechanism depends only on the covariates (missing at random assumption), we present two estimators for the functional mean parameter, using working models for the functional outcome given the covariates, and the probability of missingness given the covariates. We contribute by establishing that both these estimators have Gaussian processes as limiting distributions and explicitly give their covariance functions. One of the estimators is double robust in the sense that the limiting distribution holds whenever at least one of the nuisance models is correctly specified. These results allow us to present simultaneous confidence bands for the mean function with asymptotically guaranteed coverage. A Monte Carlo study shows the finite sample properties of the proposed functional estimators and their associated simultaneous inference. The use of the method is illustrated in an application where the mean of counterfactual outcomes is targeted.", "authors": "Liu, Ecker, Schelin et al"}, "https://arxiv.org/abs/2411.17395": {"title": "Asymptotics for estimating a diverging number of parameters -- with and without sparsity", "link": "https://arxiv.org/abs/2411.17395", "description": "We consider high-dimensional estimation problems where the number of parameters diverges with the sample size. General conditions are established for consistency, uniqueness, and asymptotic normality in both unpenalized and penalized estimation settings. The conditions are weak and accommodate a broad class of estimation problems, including ones with non-convex and group structured penalties. The wide applicability of the results is illustrated through diverse examples, including generalized linear models, multi-sample inference, and stepwise estimation procedures.", "authors": "Gauss, Nagler"}, "https://arxiv.org/abs/2110.07361": {"title": "Distribution-Free Bayesian multivariate predictive inference", "link": "https://arxiv.org/abs/2110.07361", "description": "We introduce a comprehensive Bayesian multivariate predictive inference framework. The basis for our framework is a hierarchical Bayesian model, that is a mixture of finite Polya trees corresponding to multiple dyadic partitions of the unit cube. Given a sample of observations from an unknown multivariate distribution, the posterior predictive distribution is used to model and generate future observations from the unknown distribution. We illustrate the implementation of our methodology and study its performance on simulated examples. We introduce an algorithm for constructing conformal prediction sets, that provide finite sample probability assurances for future observations, with our Bayesian model.", "authors": "Yekutieli"}, "https://arxiv.org/abs/2302.03237": {"title": "Examination of Nonlinear Longitudinal Processes with Latent Variables, Latent Processes, Latent Changes, and Latent Classes in the Structural Equation Modeling Framework: The R package nlpsem", "link": "https://arxiv.org/abs/2302.03237", "description": "We introduce the R package nlpsem, a comprehensive toolkit for analyzing longitudinal processes within the structural equation modeling (SEM) framework, incorporating individual measurement occasions. This package emphasizes nonlinear longitudinal models, especially intrinsic ones, across four key scenarios: (1) univariate longitudinal processes with latent variables, optionally including covariates such as time-invariant covariates (TICs) and time-varying covariates (TVCs); (2) multivariate longitudinal analyses to explore correlations or unidirectional relationships between longitudinal variables; (3) multiple-group frameworks for comparing manifest classes in scenarios (1) and (2); and (4) mixture models for scenarios (1) and (2), accommodating latent class heterogeneity. Built on the OpenMx R package, nlpsem supports flexible model designs and uses the full information maximum likelihood method for parameter estimation. A notable feature is its algorithm for determining initial values directly from raw data, enhancing computational efficiency and convergence. Furthermore, nlpsem provides tools for goodness-of-fit tests, cluster analyses, visualization, derivation of p-values and three types of confidence intervals, as well as model selection for nested models using likelihood ratio tests and for non-nested models based on criteria such as Akaike Information Criterion and Bayesian Information Criterion. This article serves as a companion document to the nlpsem R package, providing a comprehensive guide to its modeling capabilities, estimation methods, implementation features, and application examples using synthetic intelligence growth data.", "authors": "Liu"}, "https://arxiv.org/abs/2209.14295": {"title": "Label Noise Robustness of Conformal Prediction", "link": "https://arxiv.org/abs/2209.14295", "description": "We study the robustness of conformal prediction, a powerful tool for uncertainty quantification, to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. We further extend our theory and formulate the requirements for correctly controlling a general loss function, such as the false negative proportion, with noisy labels. Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity.", "authors": "Einbinder, Feldman, Bates et al"}, "https://arxiv.org/abs/2411.17743": {"title": "Ranking probabilistic forecasting models with different loss functions", "link": "https://arxiv.org/abs/2411.17743", "description": "In this study, we introduced various statistical performance metrics, based on the pinball loss and the empirical coverage, for the ranking of probabilistic forecasting models. We tested the ability of the proposed metrics to determine the top performing forecasting model and investigated the use of which metric corresponds to the highest average per-trade profit in the out-of-sample period. Our findings show that for the considered trading strategy, ranking the forecasting models according to the coverage of quantile forecasts used in the trading hours exhibits a superior economic performance.", "authors": "Serafin, Uniejewski"}, "https://arxiv.org/abs/2411.17841": {"title": "Defective regression models for cure rate modeling in Marshall-Olkin family", "link": "https://arxiv.org/abs/2411.17841", "description": "Regression model have a substantial impact on interpretation of treatments, genetic characteristics and other covariates in survival analysis. In many datasets, the description of censoring and survival curve reveals the presence of cure fraction on data, which leads to alternative modelling. The most common approach to introduce covariates under a parametric estimation are the cure rate models and their variations, although the use of defective distributions have introduced a more parsimonious and integrated approach. Defective distributions is given by a density function whose integration is not one after changing the domain of one the parameters. In this work, we introduce two new defective regression models for long-term survival data in the Marshall-Olkin family: the Marshall-Olkin Gompertz and the Marshall-Olkin inverse Gaussian. The estimation process is conducted using the maximum likelihood estimation and Bayesian inference. We evaluate the asymptotic properties of the classical approach in Monte Carlo studies as well as the behavior of Bayes estimates with vague information. The application of both models under classical and Bayesian inferences is provided in an experiment of time until death from colon cancer with a dichotomous covariate. The Marshall-Olkin Gompertz regression presented the best adjustment and we present some global diagnostic and residual analysis for this proposal.", "authors": "Neto, Tomazella"}, "https://arxiv.org/abs/2411.17859": {"title": "Sparse twoblock dimension reduction for simultaneous compression and variable selection in two blocks of variables", "link": "https://arxiv.org/abs/2411.17859", "description": "A method is introduced to perform simultaneous sparse dimension reduction on two blocks of variables. Beyond dimension reduction, it also yields an estimator for multivariate regression with the capability to intrinsically deselect uninformative variables in both independent and dependent blocks. An algorithm is provided that leads to a straightforward implementation of the method. The benefits of simultaneous sparse dimension reduction are shown to carry through to enhanced capability to predict a set of multivariate dependent variables jointly. Both in a simulation study and in two chemometric applications, the new method outperforms its dense counterpart, as well as multivariate partial least squares.", "authors": "Serneels"}, "https://arxiv.org/abs/2411.17905": {"title": "Repeated sampling of different individuals but the same clusters to improve precision of difference-in-differences estimators: the DISC design", "link": "https://arxiv.org/abs/2411.17905", "description": "We describe the DISC (Different Individuals, Same Clusters) design, a sampling scheme that can improve the precision of difference-in-differences (DID) estimators in settings involving repeated sampling of a population at multiple time points. Although cohort designs typically lead to more efficient DID estimators relative to repeated cross-sectional (RCS) designs, they are often impractical in practice due to high rates of loss-to-follow-up, individuals leaving the risk set, or other reasons. The DISC design represents a hybrid between a cohort sampling design and a RCS sampling design, an alternative strategy in which the researcher takes a single sample of clusters, but then takes different cross-sectional samples of individuals within each cluster at two or more time points. We show that the DISC design can yield DID estimators with much higher precision relative to a RCS design, particularly if random cluster effects are present in the data-generating mechanism. For example, for a design in which 40 clusters and 25 individuals per cluster are sampled (for a total sample size of n=1,000), the variance of a commonly-used DID treatment effect estimator is 2.3 times higher in the RCS design for an intraclass correlation coefficient (ICC) of 0.05, 3.8 times higher for an ICC of 0.1, and 7.3 times higher for an ICC of 0.2.", "authors": "Downey, Kenny"}, "https://arxiv.org/abs/2411.17910": {"title": "Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological Studies", "link": "https://arxiv.org/abs/2411.17910", "description": "In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediators. Analyzing causal mediation in such high-dimensional omics data presents substantial challenges, including complex dependencies among mediators and the need for advanced regularization or Bayesian techniques to ensure stable and interpretable estimation and selection of indirect effects. To this end, we propose a novel Bayesian framework for identifying active pathways and estimating indirect effects in the presence of high-dimensional multivariate mediators. Our approach adopts a multivariate stochastic search variable selection method, tailored for such complex mediation scenarios. Central to our method is the introduction of a set of priors for the selection: a Markov random field prior and sequential subsetting Bernoulli priors. The first prior's Markov property leverages the inherent correlations among mediators, thereby increasing power to detect mediated effects. The sequential subsetting aspect of the second prior encourages the simultaneous selection of relevant mediators and their corresponding indirect effects from the two model parts, providing a more coherent and efficient variable selection framework, specific to mediation analysis. Comprehensive simulation studies demonstrate that the proposed method provides superior power in detecting active mediating pathways. We further illustrate the practical utility of the method through its application to metabolome data from two cohort studies, highlighting its effectiveness in real data setting.", "authors": "Bae, Kim, Wang et al"}, "https://arxiv.org/abs/2411.17983": {"title": "Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization", "link": "https://arxiv.org/abs/2411.17983", "description": "Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting.\n This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation.", "authors": "Bai, Jin"}, "https://arxiv.org/abs/2411.18012": {"title": "Bayesian Inference of Spatially Varying Correlations via the Thresholded Correlation Gaussian Process", "link": "https://arxiv.org/abs/2411.18012", "description": "A central question in multimodal neuroimaging analysis is to understand the association between two imaging modalities and to identify brain regions where such an association is statistically significant. In this article, we propose a Bayesian nonparametric spatially varying correlation model to make inference of such regions. We build our model based on the thresholded correlation Gaussian process (TCGP). It ensures piecewise smoothness, sparsity, and jump discontinuity of spatially varying correlations, and is well applicable even when the number of subjects is limited or the signal-to-noise ratio is low. We study the identifiability of our model, establish the large support property, and derive the posterior consistency and selection consistency. We also develop a highly efficient Gibbs sampler and its variant to compute the posterior distribution. We illustrate the method with both simulations and an analysis of functional magnetic resonance imaging data from the Human Connectome Project.", "authors": "Li, Li, Kang"}, "https://arxiv.org/abs/2411.18334": {"title": "Multi-response linear regression estimation based on low-rank pre-smoothing", "link": "https://arxiv.org/abs/2411.18334", "description": "Pre-smoothing is a technique aimed at increasing the signal-to-noise ratio in data to improve subsequent estimation and model selection in regression problems. However, pre-smoothing has thus far been limited to the univariate response regression setting. Motivated by the widespread interest in multi-response regression analysis in many scientific applications, this article proposes a technique for data pre-smoothing in this setting based on low-rank approximation. We establish theoretical results on the performance of the proposed methodology, and quantify its benefit empirically in a number of simulated experiments. We also demonstrate our proposed low-rank pre-smoothing technique on real data arising from the environmental and biological sciences.", "authors": "Tian, Gibberd, Nunes et al"}, "https://arxiv.org/abs/2411.18351": {"title": "On an EM-based closed-form solution for 2 parameter IRT models", "link": "https://arxiv.org/abs/2411.18351", "description": "It is a well-known issue that in Item Response Theory models there is no closed-form for the maximum likelihood estimators of the item parameters. Parameter estimation is therefore typically achieved by means of numerical methods like gradient search. The present work has a two-fold aim: On the one hand, we revise the fundamental notions associated to the item parameter estimation in 2 parameter Item Response Theory models from the perspective of the complete-data likelihood. On the other hand, we argue that, within an Expectation-Maximization approach, a closed-form for discrimination and difficulty parameters can actually be obtained that simply corresponds to the Ordinary Least Square solution.", "authors": "Center, Tuebingen), Center et al"}, "https://arxiv.org/abs/2411.18398": {"title": "Derivative Estimation of Multivariate Functional Data", "link": "https://arxiv.org/abs/2411.18398", "description": "Existing approaches for derivative estimation are restricted to univariate functional data. We propose two methods to estimate the principal components and scores for the derivatives of multivariate functional data. As a result, the derivatives can be reconstructed by a multivariate Karhunen-Lo\\`eve expansion. The first approach is an extended version of multivariate functional principal component analysis (MFPCA) which incorporates the derivatives, referred to as derivative MFPCA (DMFPCA). The second approach is based on the derivation of multivariate Karhunen-Lo\\`eve (DMKL) expansion. We compare the performance of the two proposed methods with a direct approach in simulations. The simulation results indicate that DMFPCA outperforms DMKL and the direct approach, particularly for densely observed data. We apply DMFPCA and DMKL methods to coronary angiogram data to recover derivatives of diameter and quantitative flow ratio. We obtain the multivariate functional principal components and scores of the derivatives, which can be used to classify patterns of coronary artery disease.", "authors": "Zhu, Golovkine, Bargary et al"}, "https://arxiv.org/abs/2411.18416": {"title": "Probabilistic size-and-shape functional mixed models", "link": "https://arxiv.org/abs/2411.18416", "description": "The reliable recovery and uncertainty quantification of a fixed effect function $\\mu$ in a functional mixed model, for modelling population- and object-level variability in noisily observed functional data, is a notoriously challenging task: variations along the $x$ and $y$ axes are confounded with additive measurement error, and cannot in general be disentangled. The question then as to what properties of $\\mu$ may be reliably recovered becomes important. We demonstrate that it is possible to recover the size-and-shape of a square-integrable $\\mu$ under a Bayesian functional mixed model. The size-and-shape of $\\mu$ is a geometric property invariant to a family of space-time unitary transformations, viewed as rotations of the Hilbert space, that jointly transform the $x$ and $y$ axes. A random object-level unitary transformation then captures size-and-shape \\emph{preserving} deviations of $\\mu$ from an individual function, while a random linear term and measurement error capture size-and-shape \\emph{altering} deviations. The model is regularized by appropriate priors on the unitary transformations, posterior summaries of which may then be suitably interpreted as optimal data-driven rotations of a fixed orthonormal basis for the Hilbert space. Our numerical experiments demonstrate utility of the proposed model, and superiority over the current state-of-the-art.", "authors": "Wang, Bharath, Chkrebtii et al"}, "https://arxiv.org/abs/2411.18433": {"title": "A Latent Space Approach to Inferring Distance-Dependent Reciprocity in Directed Networks", "link": "https://arxiv.org/abs/2411.18433", "description": "Reciprocity, or the stochastic tendency for actors to form mutual relationships, is an essential characteristic of directed network data. Existing latent space approaches to modeling directed networks are severely limited by the assumption that reciprocity is homogeneous across the network. In this work, we introduce a new latent space model for directed networks that can model heterogeneous reciprocity patterns that arise from the actors' latent distances. Furthermore, existing edge-independent latent space models are nested within the proposed model class, which allows for meaningful model comparisons. We introduce a Bayesian inference procedure to infer the model parameters using Hamiltonian Monte Carlo. Lastly, we use the proposed method to infer different reciprocity patterns in an advice network among lawyers, an information-sharing network between employees at a manufacturing company, and a friendship network between high school students.", "authors": "Loyal, Wu, Stewart"}, "https://arxiv.org/abs/2411.18481": {"title": "Bhirkuti's Test of Bias Acceptance: Examining in Psychometric Simulations", "link": "https://arxiv.org/abs/2411.18481", "description": "This study introduces Bhirkuti's Test of Bias Acceptance, a systematic graphical framework for evaluating bias and determining its acceptability under varying experimental conditions. Absolute Relative Bias (ARB), while useful for understanding bias, is sensitive to outliers and population parameter magnitudes, often overstating bias for small values and understating it for larger ones. Similarly, Relative Efficiency (RE) can be influenced by variance differences and outliers, occasionally producing counterintuitive values exceeding 100%, which complicates interpretation. By addressing the limitations of traditional metrics such as Absolute Relative Bias (ARB) and Relative Efficiency (RE), the proposed graphical methodology framework leverages ridgeline plots and standardized estimate to provide a comprehensive visualization of parameter estimate distributions. Ridgeline plots done this way offer a robust alternative by visualizing full distributions, highlighting variability, trends, outliers, descriptive and facilitating more informed decision-making. This study employs multivariate Latent Growth Models (LGM) and Monte Carlo simulations to examine the performance of growth curve modeling under planned missing data designs, focusing on parameter estimate recovery and efficiency. By combining innovative visualization techniques with rigorous simulation methods, Bhirkuti's Test of Bias Acceptance provides a versatile and interpretable toolset for advancing quantitative research in bias evaluation and efficiency assessment.", "authors": "Bhusal, Little"}, "https://arxiv.org/abs/2411.18510": {"title": "A subgroup-aware scoring approach to the study of effect modification in observational studies", "link": "https://arxiv.org/abs/2411.18510", "description": "Effect modification means the size of a treatment effect varies with an observed covariate. Generally speaking, a larger treatment effect with more stable error terms is less sensitive to bias. Thus, we might be able to conclude that a study is less sensitive to unmeasured bias by using these subgroups experiencing larger treatment effects. Lee et al. (2018) proposed the submax method that leverages the joint distribution of test statistics from subgroups to draw a firmer conclusion if effect modification occurs. However, one version of the submax method uses M-statistics as the test statistics and is implemented in the R package submax (Rosenbaum, 2017). The scaling factor in the M-statistics is computed using all observations combined across subgroups. We show that this combining can confuse effect modification with outliers. We propose a novel group M-statistic that scores the matched pairs in each subgroup to tackle the issue. We examine our novel scoring strategy in extensive settings to show the superior performance. The proposed method is applied to an observational study of the effect of a malaria prevention treatment in West Africa.", "authors": "Fan, Small"}, "https://arxiv.org/abs/2411.18549": {"title": "Finite population inference for skewness measures", "link": "https://arxiv.org/abs/2411.18549", "description": "In this article we consider Bowley's skewness measure and the Groeneveld-Meeden $b_{3}$ index in the context of finite population sampling. We employ the functional delta method to obtain asymptotic variance formulae for plug-in estimators and propose corresponding variance estimators. We then consider plug-in estimators based on the H\\'{a}jek cdf-estimator and on a Deville-S\\\"arndal type calibration estimator and test the performance of normal confidence intervals.", "authors": "Pasquazzi"}, "https://arxiv.org/abs/2411.16552": {"title": "When Is Heterogeneity Actionable for Personalization?", "link": "https://arxiv.org/abs/2411.16552", "description": "Targeting and personalization policies can be used to improve outcomes beyond the uniform policy that assigns the best performing treatment in an A/B test to everyone. Personalization relies on the presence of heterogeneity of treatment effects, yet, as we show in this paper, heterogeneity alone is not sufficient for personalization to be successful. We develop a statistical model to quantify \"actionable heterogeneity,\" or the conditions when personalization is likely to outperform the best uniform policy. We show that actionable heterogeneity can be visualized as crossover interactions in outcomes across treatments and depends on three population-level parameters: within-treatment heterogeneity, cross-treatment correlation, and the variation in average responses. Our model can be used to predict the expected gain from personalization prior to running an experiment and also allows for sensitivity analysis, providing guidance on how changing treatments can affect the personalization gain. To validate our model, we apply five common personalization approaches to two large-scale field experiments with many interventions that encouraged flu vaccination. We find an 18% gain from personalization in one and a more modest 4% gain in the other, which is consistent with our model. Counterfactual analysis shows that this difference in the gains from personalization is driven by a drastic difference in within-treatment heterogeneity. However, reducing cross-treatment correlation holds a larger potential to further increase personalization gains. Our findings provide a framework for assessing the potential from personalization and offer practical recommendations for improving gains from targeting in multi-intervention settings.", "authors": "Shchetkina, Berman"}, "https://arxiv.org/abs/2411.17808": {"title": "spar: Sparse Projected Averaged Regression in R", "link": "https://arxiv.org/abs/2411.17808", "description": "Package spar for R builds ensembles of predictive generalized linear models with high-dimensional predictors. It employs an algorithm utilizing variable screening and random projection tools to efficiently handle the computational challenges associated with large sets of predictors. The package is designed with a strong focus on extensibility. Screening and random projection techniques are implemented as S3 classes with user-friendly constructor functions, enabling users to easily integrate and develop new procedures. This design enhances the package's adaptability and makes it a powerful tool for a variety of high-dimensional applications.", "authors": "Parzer, Vana-G\\\"ur, Filzmoser"}, "https://arxiv.org/abs/2411.17989": {"title": "Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal Discovery", "link": "https://arxiv.org/abs/2411.17989", "description": "As the significance of understanding the cause-and-effect relationships among variables increases in the development of modern systems and algorithms, learning causality from observational data has become a preferred and efficient approach over conducting randomized control trials. However, purely observational data could be insufficient to reconstruct the true causal graph. Consequently, many researchers tried to utilise some form of prior knowledge to improve causal discovery process. In this context, the impressive capabilities of large language models (LLMs) have emerged as a promising alternative to the costly acquisition of prior expert knowledge. In this work, we further explore the potential of using LLMs to enhance causal discovery approaches, particularly focusing on score-based methods, and we propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.", "authors": "Li, Liu, Wang et al"}, "https://arxiv.org/abs/2411.18008": {"title": "Causal and Local Correlations Based Network for Multivariate Time Series Classification", "link": "https://arxiv.org/abs/2411.18008", "description": "Recently, time series classification has attracted the attention of a large number of researchers, and hundreds of methods have been proposed. However, these methods often ignore the spatial correlations among dimensions and the local correlations among features. To address this issue, the causal and local correlations based network (CaLoNet) is proposed in this study for multivariate time series classification. First, pairwise spatial correlations between dimensions are modeled using causality modeling to obtain the graph structure. Then, a relationship extraction network is used to fuse local correlations to obtain long-term dependency features. Finally, the graph structure and long-term dependency features are integrated into the graph neural network. Experiments on the UEA datasets show that CaLoNet can obtain competitive performance compared with state-of-the-art methods.", "authors": "Du, Wei, Zheng et al"}, "https://arxiv.org/abs/2411.18502": {"title": "Isometry pursuit", "link": "https://arxiv.org/abs/2411.18502", "description": "Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.", "authors": "Koelle, Meila"}, "https://arxiv.org/abs/2411.18569": {"title": "A Flexible Defense Against the Winner's Curse", "link": "https://arxiv.org/abs/2411.18569", "description": "Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly, in machine learning, practitioners are often interested in the population performance of the model that performs best empirically. However, cherry-picking the best candidate leads to the winner's curse: the observed performance for the winner is biased upwards, rendering conclusions based on standard measures of uncertainty invalid. We introduce the zoom correction, a novel approach for valid inference on the winner. Our method is flexible: it can be employed in both parametric and nonparametric settings, can handle arbitrary dependencies between candidates, and automatically adapts to the level of selection bias. The method easily extends to important related problems, such as inference on the top k winners, inference on the value and identity of the population winner, and inference on \"near-winners.\"", "authors": "Zrnic, Fithian"}, "https://arxiv.org/abs/2307.16353": {"title": "Single Proxy Synthetic Control", "link": "https://arxiv.org/abs/2307.16353", "description": "Synthetic control methods are widely used to estimate the treatment effect on a single treated unit in time-series settings. A common approach to estimate synthetic control weights is to regress the treated unit's pre-treatment outcome and covariates' time series measurements on those of untreated units via ordinary least squares. However, this approach can perform poorly if the pre-treatment fit is not near perfect, whether the weights are normalized or not. In this paper, we introduce a single proxy synthetic control approach, which views the outcomes of untreated units as proxies of the treatment-free potential outcome of the treated unit, a perspective we leverage to construct a valid synthetic control. Under this framework, we establish an alternative identification strategy and corresponding estimation methods for synthetic controls and the treatment effect on the treated unit. Notably, unlike existing proximal synthetic control methods, which require two types of proxies for identification, ours relies on a single type of proxy, thus facilitating its practical relevance. Additionally, we adapt a conformal inference approach to perform inference about the treatment effect, obviating the need for a large number of post-treatment observations. Lastly, our framework can accommodate time-varying covariates and nonlinear models. We demonstrate the proposed approach in a simulation study and a real-world application.", "authors": "Park, Tchetgen"}, "https://arxiv.org/abs/1903.04209": {"title": "From interpretability to inference: an estimation framework for universal approximators", "link": "https://arxiv.org/abs/1903.04209", "description": "We present a novel framework for estimation and inference for the broad class of universal approximators. Estimation is based on the decomposition of model predictions into Shapley values. Inference relies on analyzing the bias and variance properties of individual Shapley components. We show that Shapley value estimation is asymptotically unbiased, and we introduce Shapley regressions as a tool to uncover the true data generating process from noisy data alone. The well-known case of the linear regression is the special case in our framework if the model is linear in parameters. We present theoretical, numerical, and empirical results for the estimation of heterogeneous treatment effects as our guiding example.", "authors": "Joseph"}, "https://arxiv.org/abs/2211.13612": {"title": "Joint modeling of wind speed and wind direction through a conditional approach", "link": "https://arxiv.org/abs/2211.13612", "description": "Atmospheric near surface wind speed and wind direction play an important role in many applications, ranging from air quality modeling, building design, wind turbine placement to climate change research. It is therefore crucial to accurately estimate the joint probability distribution of wind speed and direction. In this work we develop a conditional approach to model these two variables, where the joint distribution is decomposed into the product of the marginal distribution of wind direction and the conditional distribution of wind speed given wind direction. To accommodate the circular nature of wind direction a von Mises mixture model is used; the conditional wind speed distribution is modeled as a directional dependent Weibull distribution via a two-stage estimation procedure, consisting of a directional binned Weibull parameter estimation, followed by a harmonic regression to estimate the dependence of the Weibull parameters on wind direction. A Monte Carlo simulation study indicates that our method outperforms an alternative method that uses periodic spline quantile regression in terms of estimation efficiency. We illustrate our method by using the output from a regional climate model to investigate how the joint distribution of wind speed and direction may change under some future climate scenarios.", "authors": "Murphy, Huang, Bessac et al"}, "https://arxiv.org/abs/2411.18646": {"title": "Temporal Models for Demographic and Global Health Outcomes in Multiple Populations: Introducing the Normal-with-Optional-Shrinkage Data Model Class", "link": "https://arxiv.org/abs/2411.18646", "description": "Statistical models are used to produce estimates of demographic and global health indicators in populations with limited data. Such models integrate multiple data sources to produce estimates and forecasts with uncertainty based on model assumptions. Model assumptions can be divided into assumptions that describe latent trends in the indicator of interest versus assumptions on the data generating process of the observed data, conditional on the latent process value. Focusing on the latter, we introduce a class of data models that can be used to combine data from multiple sources with various reporting issues. The proposed data model accounts for sampling errors and differences in observational uncertainty based on survey characteristics. In addition, the data model employs horseshoe priors to produce estimates that are robust to outlying observations. We refer to the data model class as the normal-with-optional-shrinkage (NOS) set up. We illustrate the use of the NOS data model for the estimation of modern contraceptive use and other family planning indicators at the national level for countries globally, using survey data.", "authors": "Alkema, Susmann, Ray"}, "https://arxiv.org/abs/2411.18739": {"title": "A Bayesian semi-parametric approach to causal mediation for longitudinal mediators and time-to-event outcomes with application to a cardiovascular disease cohort study", "link": "https://arxiv.org/abs/2411.18739", "description": "Causal mediation analysis of observational data is an important tool for investigating the potential causal effects of medications on disease-related risk factors, and on time-to-death (or disease progression) through these risk factors. However, when analyzing data from a cohort study, such analyses are complicated by the longitudinal structure of the risk factors and the presence of time-varying confounders. Leveraging data from the Atherosclerosis Risk in Communities (ARIC) cohort study, we develop a causal mediation approach, using (semi-parametric) Bayesian Additive Regression Tree (BART) models for the longitudinal and survival data. Our framework allows for time-varying exposures, confounders, and mediators, all of which can either be continuous or binary. We also identify and estimate direct and indirect causal effects in the presence of a competing event. We apply our methods to assess how medication, prescribed to target cardiovascular disease (CVD) risk factors, affects the time-to-CVD death.", "authors": "Bhandari, Daniels, Josefsson et al"}, "https://arxiv.org/abs/2411.18772": {"title": "Difference-in-differences Design with Outcomes Missing Not at Random", "link": "https://arxiv.org/abs/2411.18772", "description": "This paper addresses one of the most prevalent problems encountered by political scientists working with difference-in-differences (DID) design: missingness in panel data. A common practice for handling missing data, known as complete case analysis, is to drop cases with any missing values over time. A more principled approach involves using nonparametric bounds on causal effects or applying inverse probability weighting based on baseline covariates. Yet, these methods are general remedies that often under-utilize the assumptions already imposed on panel structure for causal identification. In this paper, I outline the pitfalls of complete case analysis and propose an alternative identification strategy based on principal strata. To be specific, I impose parallel trends assumption within each latent group that shares the same missingness pattern (e.g., always-respondents, if-treated-respondents) and leverage missingness rates over time to estimate the proportions of these groups. Building on this, I tailor Lee bounds, a well-known nonparametric bounds under selection bias, to partially identify the causal effect within the DID design. Unlike complete case analysis, the proposed method does not require independence between treatment selection and missingness patterns, nor does it assume homogeneous effects across these patterns.", "authors": "Shin"}, "https://arxiv.org/abs/2411.18773": {"title": "Inference on Dynamic Spatial Autoregressive Models with Change Point Detection", "link": "https://arxiv.org/abs/2411.18773", "description": "We analyze a varying-coefficient dynamic spatial autoregressive model with spatial fixed effects. One salient feature of the model is the incorporation of multiple spatial weight matrices through their linear combinations with varying coefficients, which help solve the problem of choosing the most \"correct\" one for applied econometricians who often face the availability of multiple expert spatial weight matrices. We estimate and make inferences on the model coefficients and coefficients in basis expansions of the varying coefficients through penalized estimations, establishing the oracle properties of the estimators and the consistency of the overall estimated spatial weight matrix, which can be time-dependent. We further consider two applications of our model in change point detections in dynamic spatial autoregressive models, providing theoretical justifications in consistent change point locations estimation and practical implementations. Simulation experiments demonstrate the performance of our proposed methodology, and a real data analysis is also carried out.", "authors": "Cen, Chen, Lam"}, "https://arxiv.org/abs/2411.18826": {"title": "Improved order selection method for hidden Markov models: a case study with movement data", "link": "https://arxiv.org/abs/2411.18826", "description": "Hidden Markov models (HMMs) are a versatile statistical framework commonly used in ecology to characterize behavioural patterns from animal movement data. In HMMs, the observed data depend on a finite number of underlying hidden states, generally interpreted as the animal's unobserved behaviour. The number of states is a crucial parameter, controlling the trade-off between ecological interpretability of behaviours (fewer states) and the goodness of fit of the model (more states). Selecting the number of states, commonly referred to as order selection, is notoriously challenging. Common model selection metrics, such as AIC and BIC, often perform poorly in determining the number of states, particularly when models are misspecified. Building on existing methods for HMMs and mixture models, we propose a double penalized likelihood maximum estimate (DPMLE) for the simultaneous estimation of the number of states and parameters of non-stationary HMMs. The DPMLE differs from traditional information criteria by using two penalty functions on the stationary probabilities and state-dependent parameters. For non-stationary HMMs, forward and backward probabilities are used to approximate stationary probabilities. Using a simulation study that includes scenarios with additional complexity in the data, we compare the performance of our method with that of AIC and BIC. We also illustrate how the DPMLE differs from AIC and BIC using narwhal (Monodon monoceros) movement data. The proposed method outperformed AIC and BIC in identifying the correct number of states under model misspecification. Furthermore, its capacity to handle non-stationary dynamics allowed for more realistic modeling of complex movement data, offering deeper insights into narwhal behaviour. Our method is a powerful tool for order selection in non-stationary HMMs, with potential applications extending beyond the field of ecology.", "authors": "Dupont, Marcoux, Hussey et al"}, "https://arxiv.org/abs/2411.18838": {"title": "Contrasting the optimal resource allocation to cybersecurity and cyber insurance using prospect theory versus expected utility theory", "link": "https://arxiv.org/abs/2411.18838", "description": "Protecting against cyber-threats is vital for every organization and can be done by investing in cybersecurity controls and purchasing cyber insurance. However, these are interlinked since insurance premiums could be reduced by investing more in cybersecurity controls. The expected utility theory and the prospect theory are two alternative theories explaining decision-making under risk and uncertainty, which can inform strategies for optimizing resource allocation. While the former is considered a rational approach, research has shown that most people make decisions consistent with the latter, including on insurance uptakes. We compare and contrast these two approaches to provide important insights into how the two approaches could lead to different optimal allocations resulting in differing risk exposure as well as financial costs. We introduce the concept of a risk curve and show that identifying the nature of the risk curve is a key step in deriving the optimal resource allocation.", "authors": "Joshi, Yang, Slapnicar et al"}, "https://arxiv.org/abs/2411.18864": {"title": "Redesigning the ensemble Kalman filter with a dedicated model of epistemic uncertainty", "link": "https://arxiv.org/abs/2411.18864", "description": "The problem of incorporating information from observations received serially in time is widespread in the field of uncertainty quantification. Within a probabilistic framework, such problems can be addressed using standard filtering techniques. However, in many real-world problems, some (or all) of the uncertainty is epistemic, arising from a lack of knowledge, and is difficult to model probabilistically. This paper introduces a possibilistic ensemble Kalman filter designed for this setting and characterizes some of its properties. Using possibility theory to describe epistemic uncertainty is appealing from a philosophical perspective, and it is easy to justify certain heuristics often employed in standard ensemble Kalman filters as principled approaches to capturing uncertainty within it. The possibilistic approach motivates a robust mechanism for characterizing uncertainty which shows good performance with small sample sizes, and can outperform standard ensemble Kalman filters at given sample size, even when dealing with genuinely aleatoric uncertainty.", "authors": "Kimchaiwong, Houssineau, Johansen"}, "https://arxiv.org/abs/2411.18879": {"title": "Learning treatment effects under covariate dependent left truncation and right censoring", "link": "https://arxiv.org/abs/2411.18879", "description": "In observational studies with delayed entry, causal inference for time-to-event outcomes can be challenging. The challenges arise because, in addition to the potential confounding bias from observational data, the collected data often also suffers from the selection bias due to left truncation, where only subjects with time-to-event (such as death) greater than the enrollment times are included, as well as bias from informative right censoring. To estimate the treatment effects on time-to-event outcomes in such settings, inverse probability weighting (IPW) is often employed. However, IPW is sensitive to model misspecifications, which makes it vulnerable, especially when faced with three sources of biases. Moreover, IPW is inefficient. To address these challenges, we propose a doubly robust framework to handle covariate dependent left truncation and right censoring that can be applied to a wide range of estimation problems, including estimating average treatment effect (ATE) and conditional average treatment effect (CATE). For average treatment effect, we develop estimators that enjoy model double robustness and rate double robustness. For conditional average treatment effect, we develop orthogonal and doubly robust learners that can achieve oracle rate of convergence. Our framework represents the first attempt to construct doubly robust estimators and orthogonal learners for treatment effects that account for all three sources of biases: confounding, selection from covariate-induced dependent left truncation, and informative right censoring. We apply the proposed methods to analyze the effect of midlife alcohol consumption on late-life cognitive impairment, using data from Honolulu Asia Aging Study.", "authors": "Wang, Ying, Xu"}, "https://arxiv.org/abs/2411.18957": {"title": "Bayesian Cluster Weighted Gaussian Models", "link": "https://arxiv.org/abs/2411.18957", "description": "We introduce a novel class of Bayesian mixtures for normal linear regression models which incorporates a further Gaussian random component for the distribution of the predictor variables. The proposed cluster-weighted model aims to encompass potential heterogeneity in the distribution of the response variable as well as in the multivariate distribution of the covariates for detecting signals relevant to the underlying latent structure. Of particular interest are potential signals originating from: (i) the linear predictor structures of the regression models and (ii) the covariance structures of the covariates. We model these two components using a lasso shrinkage prior for the regression coefficients and a graphical-lasso shrinkage prior for the covariance matrices. A fully Bayesian approach is followed for estimating the number of clusters, by treating the number of mixture components as random and implementing a trans-dimensional telescoping sampler. Alternative Bayesian approaches based on overfitting mixture models or using information criteria to select the number of components are also considered. The proposed method is compared against EM type implementation, mixtures of regressions and mixtures of experts. The method is illustrated using a set of simulation studies and a biomedical dataset.", "authors": "Papastamoulis, Perrakis"}, "https://arxiv.org/abs/2411.18978": {"title": "Warfare Ignited Price Contagion Dynamics in Early Modern Europe", "link": "https://arxiv.org/abs/2411.18978", "description": "Economic historians have long studied market integration and contagion dynamics during periods of warfare and global stress, but there is a lack of model-based evidence on these phenomena. This paper uses an econometric contagion model, the Diebold-Yilmaz framework, to examine the dynamics of economic shocks across European markets in the early modern period. Our findings suggest that violent conflicts, especially the Thirty Years' War, significantly increased food price spillover across cities, causing widespread disruptions across Europe. We also demonstrate the ability of this framework to capture relevant historical dynamics between the main trade centers of the period.", "authors": "Esmaili, Puma, Ludlow et al"}, "https://arxiv.org/abs/2411.19104": {"title": "Algorithmic modelling of a complex redundant multi-state system subject to multiple events, preventive maintenance, loss of units and a multiple vacation policy through a MMAP", "link": "https://arxiv.org/abs/2411.19104", "description": "A complex multi-state redundant system undergoing preventive maintenance and experiencing multiple events is being considered in a continuous time frame. The online unit is susceptible to various types of failures, both internal and external in nature, with multiple degradation levels present, both internally and externally. Random inspections are continuously monitoring these degradation levels, and if they reach a critical state, the unit is directed to a repair facility for preventive maintenance. The repair facility is managed by a single repairperson, who follows a multiple vacation policy dependent on the operational status of the units. The repairperson is responsible for two primary tasks: corrective repairs and preventive maintenance. The time durations within the system follow phase-type distributions, and the model is constructed using Markovian Arrival Processes with marked arrivals. A variety of performance measures, including transient and stationary distributions, are calculated using matrix-analytic methods. This approach enables the expression of key results and overall system behaviour in a matrix-algorithmic format. In order to optimize the model, costs and rewards are integrated into the analysis. A numerical example is presented to showcase the model's flexibility and effectiveness in real-world applications.", "authors": "Ruiz-Castro, Zapata-Ceballos"}, "https://arxiv.org/abs/2411.19138": {"title": "Nonparametric estimation on the circle based on Fej\\'er polynomials", "link": "https://arxiv.org/abs/2411.19138", "description": "This paper presents a comprehensive study of nonparametric estimation techniques on the circle using Fej\\'er polynomials, which are analogues of Bernstein polynomials for periodic functions. Building upon Fej\\'er's uniform approximation theorem, the paper introduces circular density and distribution function estimators based on Fej\\'er kernels. It establishes their theoretical properties, including uniform strong consistency and asymptotic expansions. The proposed methods are extended to account for measurement errors by incorporating classical and Berkson error models, adjusting the Fej\\'er estimator to mitigate their effects. Simulation studies analyze the finite-sample performance of these estimators under various scenarios, including mixtures of circular distributions and measurement error models. An application to rainfall data demonstrates the practical application of the proposed estimators, demonstrating their robustness and effectiveness in the presence of rounding-induced Berkson errors.", "authors": "Klar, Milo\\v{s}evi\\'c, Obradovi\\'c"}, "https://arxiv.org/abs/2411.19176": {"title": "A Unified Bayesian Framework for Mortality Model Selection", "link": "https://arxiv.org/abs/2411.19176", "description": "In recent years, a wide range of mortality models has been proposed to address the diverse factors influencing mortality rates, which has highlighted the need to perform model selection. Traditional mortality model selection methods, such as AIC and BIC, often require fitting multiple models independently and ranking them based on these criteria. This process can fail to account for uncertainties in model selection, which can lead to overly optimistic prediction interval, and it disregards the potential insights from combining models. To address these limitations, we propose a novel Bayesian model selection framework that integrates model selection and parameter estimation into the same process. This requires creating a model building framework that will give rise to different models by choosing different parametric forms for each term. Inference is performed using the reversible jump Markov chain Monte Carlo algorithm, which is devised to allow for transition between models of different dimensions, as is the case for the models considered here. We develop modelling frameworks for data stratified by age and period and for data stratified by age, period and product. Our results are presented in two case studies.", "authors": "Diana, Tze, Pittea"}, "https://arxiv.org/abs/2411.19184": {"title": "Flexible space-time models for extreme data", "link": "https://arxiv.org/abs/2411.19184", "description": "Extreme Value Analysis is an essential methodology in the study of rare and extreme events, which hold significant interest in various fields, particularly in the context of environmental sciences. Models that employ the exceedances of values above suitably selected high thresholds possess the advantage of capturing the \"sub-asymptotic\" dependence of data. This paper presents an extension of spatial random scale mixture models to the spatio-temporal domain. A comprehensive framework for characterizing the dependence structure of extreme events across both dimensions is provided. Indeed, the model is capable of distinguishing between asymptotic dependence and independence, both in space and time, through the use of parametric inference. The high complexity of the likelihood function for the proposed model necessitates a simulation approach based on neural networks for parameter estimation, which leverages summaries of the sub-asymptotic dependence present in the data. The effectiveness of the model in assessing the limiting dependence structure of spatio-temporal processes is demonstrated through both simulation studies and an application to rainfall datasets.", "authors": "Dell'Oro, Gaetan"}, "https://arxiv.org/abs/2411.19205": {"title": "On the application of Jammalamadaka-Jim\\'enez Gamero-Meintanis test for circular regression model assessment", "link": "https://arxiv.org/abs/2411.19205", "description": "We study a circular-circular multiplicative regression model, characterized by an angular error distribution assumed to be wrapped Cauchy. We propose a specification procedure for this model, focusing on adapting a recently proposed goodness-of-fit test for circular distributions. We derive its limiting properties and study the power performance of the test through extensive simulations, including the adaptation of some other well-known goodness-of-fit tests for this type of data. To emphasize the practical relevance of our methodology, we apply it to several small real-world datasets and wind direction measurements in the Black Forest region of southwestern Germany, demonstrating the power and versatility of the presented approach.", "authors": "Halaj, Klar, Milo\\v{s}evi\\'c et al"}, "https://arxiv.org/abs/2411.19225": {"title": "Sparse optimization for estimating the cross-power spectrum in linear inverse models : from theory to the application in brain connectivity", "link": "https://arxiv.org/abs/2411.19225", "description": "In this work we present a computationally efficient linear optimization approach for estimating the cross--power spectrum of an hidden multivariate stochastic process from that of another observed process. Sparsity in the resulting estimator of the cross--power is induced through $\\ell_1$ regularization and the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is used for computing such an estimator. With respect to a standard implementation, we prove that a proper initialization step is sufficient to guarantee the required symmetric and antisymmetric properties of the involved quantities. Further, we show how structural properties of the forward operator can be exploited within the FISTA update in order to make our approach adequate also for large--scale problems such as those arising in context of brain functional connectivity.\n The effectiveness of the proposed approach is shown in a practical scenario where we aim at quantifying the statistical relationships between brain regions in the context of non-invasive electromagnetic field recordings. Our results show that our method provide results with an higher specificity that classical approaches based on a two--step procedure where first the hidden process describing the brain activity is estimated through a linear optimization step and then the cortical cross--power spectrum is computed from the estimated time--series.", "authors": "Carini, Furci, Sommariva"}, "https://arxiv.org/abs/2411.19368": {"title": "Distribution-Free Calibration of Statistical Confidence Sets", "link": "https://arxiv.org/abs/2411.19368", "description": "Constructing valid confidence sets is a crucial task in statistical inference, yet traditional methods often face challenges when dealing with complex models or limited observed sample sizes. These challenges are frequently encountered in modern applications, such as Likelihood-Free Inference (LFI). In these settings, confidence sets may fail to maintain a confidence level close to the nominal value. In this paper, we introduce two novel methods, TRUST and TRUST++, for calibrating confidence sets to achieve distribution-free conditional coverage. These methods rely entirely on simulated data from the statistical model to perform calibration. Leveraging insights from conformal prediction techniques adapted to the statistical inference context, our methods ensure both finite-sample local coverage and asymptotic conditional coverage as the number of simulations increases, even if n is small. They effectively handle nuisance parameters and provide computationally efficient uncertainty quantification for the estimated confidence sets. This allows users to assess whether additional simulations are necessary for robust inference. Through theoretical analysis and experiments on models with both tractable and intractable likelihoods, we demonstrate that our methods outperform existing approaches, particularly in small-sample regimes. This work bridges the gap between conformal prediction and statistical inference, offering practical tools for constructing valid confidence sets in complex models.", "authors": "Cabezas, Soares, Ramos et al"}, "https://arxiv.org/abs/2411.19384": {"title": "Random Effects Misspecification and its Consequences for Prediction in Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2411.19384", "description": "When fitting generalized linear mixed models (GLMMs), one important decision to make relates to the choice of the random effects distribution. As the random effects are unobserved, misspecification of this distribution is a real possibility. In this article, we investigate the consequences of random effects misspecification for point prediction and prediction inference in GLMMs, a topic on which there is considerably less research compared to consequences for parameter estimation and inference. We use theory, simulation, and a real application to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors (MSEPs), and prediction intervals. We found that the optimal shrinkage is different under the two random effect distributions, so is impacted by misspecification. The unconditional MSEPs for the random effects are almost always larger under the misspecified normal random effects distribution, especially when cluster sizes are small. Results for the MSEPs conditional on the random effects are more complicated, but they remain generally larger under the misspecified distribution when the true random effect is close to the mean of one of the component distributions in the true mixture distribution. Results for prediction intervals indicate that overall coverage probability is not greatly impacted by misspecification.", "authors": "Vu, Hui, Muller et al"}, "https://arxiv.org/abs/2411.19425": {"title": "Bayesian Hierarchical Modeling for Predicting Spatially Correlated Curves in Irregular Domains: A Case Study on PM10 Pollution", "link": "https://arxiv.org/abs/2411.19425", "description": "This study presents a Bayesian hierarchical model for analyzing spatially correlated functional data and handling irregularly spaced observations. The model uses Bernstein polynomial (BP) bases combined with autoregressive random effects, allowing for nuanced modeling of spatial correlations between sites and dependencies of observations within curves. Moreover, the proposed procedure introduces a distinct structure for the random effect component compared to previous works. Simulation studies conducted under various challenging scenarios verify the model's robustness, demonstrating its capacity to accurately recover spatially dependent curves and predict observations at unmonitored locations. The model's performance is further supported by its application to real-world data, specifically PM$_{10}$ particulate matter measurements from a monitoring network in Mexico City. This application is of practical importance, as particles can penetrate the respiratory system and aggravate various health conditions. The model effectively predicts concentrations at unmonitored sites, with uncertainty estimates that reflect spatial variability across the domain. This new methodology provides a flexible framework for the FDA in spatial contexts and addresses challenges in analyzing irregular domains with potential applications in environmental monitoring.", "authors": "Moreno, Dias"}, "https://arxiv.org/abs/2411.19448": {"title": "Unsupervised Variable Selection for Ultrahigh-Dimensional Clustering Analysis", "link": "https://arxiv.org/abs/2411.19448", "description": "Compared to supervised variable selection, the research on unsupervised variable selection is far behind. A forward partial-variable clustering full-variable loss (FPCFL) method is proposed for the corresponding challenges. An advantage is that the FPCFL method can distinguish active, redundant, and uninformative variables, which the previous methods cannot achieve. Theoretical and simulation studies show that the performance of a clustering method using all the variables can be worse if many uninformative variables are involved. Better results are expected if the uninformative variables are excluded. The research addresses a previous concern about how variable selection affects the performance of clustering. Rather than many previous methods attempting to select all the relevant variables, the proposed method selects a subset that can induce an equally good result. This phenomenon does not appear in the supervised variable selection problems.", "authors": "Zhang, Huang"}, "https://arxiv.org/abs/2411.19523": {"title": "Density-Calibrated Conformal Quantile Regression", "link": "https://arxiv.org/abs/2411.19523", "description": "This paper introduces the Density-Calibrated Conformal Quantile Regression (CQR-d) method, a novel approach for constructing prediction intervals that adapts to varying uncertainty across the feature space. Building upon conformal quantile regression, CQR-d incorporates local information through a weighted combination of local and global conformity scores, where the weights are determined by local data density. We prove that CQR-d provides valid marginal coverage at level $1 - \\alpha - \\epsilon$, where $\\epsilon$ represents a small tolerance from numerical optimization. Through extensive simulation studies and an application to the a heteroscedastic dataset available in R, we demonstrate that CQR-d maintains the desired coverage while producing substantially narrower prediction intervals compared to standard conformal quantile regression (CQR). Notably, in our application on heteroscedastic data, CQR-d achieves an $8.6\\%$ reduction in average interval width while maintaining comparable coverage. The method's effectiveness is particularly pronounced in settings with clear local uncertainty patterns, making it a valuable tool for prediction tasks in heterogeneous data environments.", "authors": "Lu"}, "https://arxiv.org/abs/2411.19572": {"title": "Canonical correlation analysis of stochastic trends via functional approximation", "link": "https://arxiv.org/abs/2411.19572", "description": "This paper proposes a novel canonical correlation analysis for semiparametric inference in $I(1)/I(0)$ systems via functional approximation. The approach can be applied coherently to panels of $p$ variables with a generic number $s$ of stochastic trends, as well as to subsets or aggregations of variables. This study discusses inferential tools on $s$ and on the loading matrix $\\psi$ of the stochastic trends (and on their duals $r$ and $\\beta$, the cointegration rank and the cointegrating matrix): asymptotically pivotal test sequences and consistent estimators of $s$ and $r$, $T$-consistent, mixed Gaussian and efficient estimators of $\\psi$ and $\\beta$, Wald tests thereof, and misspecification tests for checking model assumptions. Monte Carlo simulations show that these tools have reliable performance uniformly in $s$ for small, medium and large-dimensional systems, with $p$ ranging from 10 to 300. An empirical analysis of 20 exchange rates illustrates the methods.", "authors": "Franchi, Georgiev, Paruolo"}, "https://arxiv.org/abs/2411.19633": {"title": "Isotropy testing in spatial point patterns: nonparametric versus parametric replication under misspecification", "link": "https://arxiv.org/abs/2411.19633", "description": "Several hypothesis testing methods have been proposed to validate the assumption of isotropy in spatial point patterns. A majority of these methods are characterised by an unknown distribution of the test statistic under the null hypothesis of isotropy. Parametric approaches to approximating the distribution involve simulation of patterns from a user-specified isotropic model. Alternatively, nonparametric replicates of the test statistic under isotropy can be used to waive the need for specifying a model. In this paper, we first develop a general framework which allows for the integration of a selected nonparametric replication method into isotropy testing. We then conduct a large simulation study comprising application-like scenarios to assess the performance of tests with different parametric and nonparametric replication methods. In particular, we explore distortions in test size and power caused by model misspecification, and demonstrate the advantages of nonparametric replication in such scenarios.", "authors": "Pypkowski, Sykulski, Martin"}, "https://arxiv.org/abs/2411.19789": {"title": "Adjusting auxiliary variables under approximate neighborhood interference", "link": "https://arxiv.org/abs/2411.19789", "description": "Randomized experiments are the gold standard for causal inference. However, traditional assumptions, such as the Stable Unit Treatment Value Assumption (SUTVA), often fail in real-world settings where interference between units is present. Network interference, in particular, has garnered significant attention. Structural models, like the linear-in-means model, are commonly used to describe interference; but they rely on the correct specification of the model, which can be restrictive. Recent advancements in the literature, such as the Approximate Neighborhood Interference (ANI) framework, offer more flexible approaches by assuming negligible interference from distant units. In this paper, we introduce a general framework for regression adjustment for the network experiments under the ANI assumption. This framework expands traditional regression adjustment by accounting for imbalances in network-based covariates, ensuring precision improvement, and providing shorter confidence intervals. We establish the validity of our approach using a design-based inference framework, which relies solely on randomization of treatment assignments for inference without requiring correctly specified outcome models.", "authors": "Lu, Wang, Zhang"}, "https://arxiv.org/abs/2411.19871": {"title": "Thompson, Ulam, or Gauss? Multi-criteria recommendations for posterior probability computation methods in Bayesian response-adaptive trials", "link": "https://arxiv.org/abs/2411.19871", "description": "To implement a Bayesian response-adaptive trial it is necessary to evaluate a sequence of posterior probabilities. This sequence is often approximated by simulation due to the unavailability of closed-form formulae to compute it exactly. Approximating these probabilities by simulation can be computationally expensive and impact the accuracy or the range of scenarios that may be explored. An alternative approximation method based on Gaussian distributions can be faster but its accuracy is not guaranteed. The literature lacks practical recommendations for selecting approximation methods and comparing their properties, particularly considering trade-offs between computational speed and accuracy. In this paper, we focus on the case where the trial has a binary endpoint with Beta priors. We first outline an efficient way to compute the posterior probabilities exactly for any number of treatment arms. Then, using exact probability computations, we show how to benchmark calculation methods based on considerations of computational speed, patient benefit, and inferential accuracy. This is done through a range of simulations in the two-armed case, as well as an analysis of the three-armed Established Status Epilepticus Treatment Trial. Finally, we provide practical guidance for which calculation method is most appropriate in different settings, and how to choose the number of simulations if the simulation-based approximation method is used.", "authors": "Kaddaj, Pin, Baas et al"}, "https://arxiv.org/abs/2411.19933": {"title": "Transfer Learning for High-dimensional Quantile Regression with Distribution Shift", "link": "https://arxiv.org/abs/2411.19933", "description": "Information from related source studies can often enhance the findings of a target study. However, the distribution shift between target and source studies can severely impact the efficiency of knowledge transfer. In the high-dimensional regression setting, existing transfer approaches mainly focus on the parameter shift. In this paper, we focus on the high-dimensional quantile regression with knowledge transfer under three types of distribution shift: parameter shift, covariate shift, and residual shift. We propose a novel transferable set and a new transfer framework to address the above three discrepancies. Non-asymptotic estimation error bounds and source detection consistency are established to validate the availability and superiority of our method in the presence of distribution shift. Additionally, an orthogonal debiased approach is proposed for statistical inference with knowledge transfer, leading to sharper asymptotic results. Extensive simulation results as well as real data applications further demonstrate the effectiveness of our proposed procedure.", "authors": "Bai, Zhang, Yang et al"}, "https://arxiv.org/abs/2411.18830": {"title": "Double Descent in Portfolio Optimization: Dance between Theoretical Sharpe Ratio and Estimation Accuracy", "link": "https://arxiv.org/abs/2411.18830", "description": "We study the relationship between model complexity and out-of-sample performance in the context of mean-variance portfolio optimization. Representing model complexity by the number of assets, we find that the performance of low-dimensional models initially improves with complexity but then declines due to overfitting. As model complexity becomes sufficiently high, the performance improves with complexity again, resulting in a double ascent Sharpe ratio curve similar to the double descent phenomenon observed in artificial intelligence. The underlying mechanisms involve an intricate interaction between the theoretical Sharpe ratio and estimation accuracy. In high-dimensional models, the theoretical Sharpe ratio approaches its upper limit, and the overfitting problem is reduced because there are more parameters than data restrictions, which allows us to choose well-behaved parameters based on inductive bias.", "authors": "Lu, Yang, Zhang"}, "https://arxiv.org/abs/2411.18986": {"title": "ZIPG-SK: A Novel Knockoff-Based Approach for Variable Selection in Multi-Source Count Data", "link": "https://arxiv.org/abs/2411.18986", "description": "The rapid development of sequencing technology has generated complex, highly skewed, and zero-inflated multi-source count data. This has posed significant challenges in variable selection, which is crucial for uncovering shared disease mechanisms, such as tumor development and metabolic dysregulation. In this study, we propose a novel variable selection method called Zero-Inflated Poisson-Gamma based Simultaneous knockoff (ZIPG-SK) for multi-source count data. To address the highly skewed and zero-inflated properties of count data, we introduce a Gaussian copula based on the ZIPG distribution for constructing knockoffs, while also incorporating the information of covariates. This method successfully detects common features related to the results in multi-source data while controlling the false discovery rate (FDR). Additionally, our proposed method effectively combines e-values to enhance power. Extensive simulations demonstrate the superiority of our method over Simultaneous Knockoff and other existing methods in processing count data, as it improves power across different scenarios. Finally, we validated the method by applying it to two real-world multi-source datasets: colorectal cancer (CRC) and type 2 diabetes (T2D). The identified variable characteristics are consistent with existing studies and provided additional insights.", "authors": "Tang, Mao, Ma et al"}, "https://arxiv.org/abs/2411.19223": {"title": "On the Unknowable Limits to Prediction", "link": "https://arxiv.org/abs/2411.19223", "description": "This short Correspondence critiques the classic dichotomization of prediction error into reducible and irreducible components, noting that certain types of error can be eliminated at differential speeds. We propose an improved analytical framework that better distinguishes epistemic from aleatoric uncertainty, emphasizing that predictability depends on information sets and cautioning against premature claims of unpredictability.", "authors": "Yan, Rahal"}, "https://arxiv.org/abs/2411.19878": {"title": "Nonparametric Estimation for a Log-concave Distribution Function with Interval-censored Data", "link": "https://arxiv.org/abs/2411.19878", "description": "We consider the nonparametric maximum likelihood estimation for the underlying event time based on mixed-case interval-censored data, under a log-concavity assumption on its distribution function. This generalized framework relaxes the assumptions of a log-concave density function or a concave distribution function considered in the literature. A log-concave distribution function is fulfilled by many common parametric families in survival analysis and also allows for multi-modal and heavy-tailed distributions. We establish the existence, uniqueness and consistency of the log-concave nonparametric maximum likelihood estimator. A computationally efficient procedure that combines an active set algorithm with the iterative convex minorant algorithm is proposed. Numerical studies demonstrate the advantages of incorporating additional shape constraint compared to the unconstrained nonparametric maximum likelihood estimator. The results also show that our method achieves a balance between efficiency and robustness compared to assuming log-concavity in the density. An R package iclogcondist is developed to implement our proposed method.", "authors": "Chu, Ling, Yuan"}, "https://arxiv.org/abs/1902.07770": {"title": "Cross Validation for Penalized Quantile Regression with a Case-Weight Adjusted Solution Path", "link": "https://arxiv.org/abs/1902.07770", "description": "Cross validation is widely used for selecting tuning parameters in regularization methods, but it is computationally intensive in general. To lessen its computational burden, approximation schemes such as generalized approximate cross validation (GACV) are often employed. However, such approximations may not work well when non-smooth loss functions are involved. As a case in point, approximate cross validation schemes for penalized quantile regression do not work well for extreme quantiles. In this paper, we propose a new algorithm to compute the leave-one-out cross validation scores exactly for quantile regression with ridge penalty through a case-weight adjusted solution path. Resorting to the homotopy technique in optimization, we introduce a case weight for each individual data point as a continuous embedding parameter and decrease the weight gradually from one to zero to link the estimators based on the full data and those with a case deleted. This allows us to design a solution path algorithm to compute all leave-one-out estimators very efficiently from the full-data solution. We show that the case-weight adjusted solution path is piecewise linear in the weight parameter, and using the solution path, we examine case influences comprehensively and observe that different modes of case influences emerge, depending on the specified quantiles, data dimensions and penalty parameter. We further illustrate the utility of the proposed algorithm in real-world applications.", "authors": "Tu, Zhu, Lee et al"}, "https://arxiv.org/abs/2206.12152": {"title": "Estimation and Inference in High-Dimensional Panel Data Models with Interactive Fixed Effects", "link": "https://arxiv.org/abs/2206.12152", "description": "We develop new econometric methods for estimation and inference in high-dimensional panel data models with interactive fixed effects. Our approach can be regarded as a non-trivial extension of the very popular common correlated effects (CCE) approach. Roughly speaking, we proceed as follows: We first construct a projection device to eliminate the unobserved factors from the model by applying a dimensionality reduction transform to the matrix of cross-sectionally averaged covariates. The unknown parameters are then estimated by applying lasso techniques to the projected model. For inference purposes, we derive a desparsified version of our lasso-type estimator. While the original CCE approach is restricted to the low-dimensional case where the number of regressors is small and fixed, our methods can deal with both low- and high-dimensional situations where the number of regressors is large and may even exceed the overall sample size. We derive theory for our estimation and inference methods both in the large-T-case, where the time series length T tends to infinity, and in the small-T-case, where T is a fixed natural number. Specifically, we derive the convergence rate of our estimator and show that its desparsified version is asymptotically normal under suitable regularity conditions. The theoretical analysis of the paper is complemented by a simulation study and an empirical application to characteristic based asset pricing.", "authors": "Linton, Ruecker, Vogt et al"}, "https://arxiv.org/abs/2310.09257": {"title": "Reconstruct Ising Model with Global Optimality via SLIDE", "link": "https://arxiv.org/abs/2310.09257", "description": "The reconstruction of interaction networks between random events is a critical problem arising from statistical physics and politics, sociology, biology, psychology, and beyond. The Ising model lays the foundation for this reconstruction process, but finding the underlying Ising model from the least amount of observed samples in a computationally efficient manner has been historically challenging for half a century. Using sparsity learning, we present an approach named SLIDE whose sample complexity is globally optimal. Furthermore, a tuning-free algorithm is developed to give a statistically consistent solution of SLIDE in polynomial time with high probability. On extensive benchmarked cases, the SLIDE approach demonstrates dominant performance in reconstructing underlying Ising models, confirming its superior statistical properties. The application on the U.S. senators voting in the last six congresses reveals that both the Republicans and Democrats noticeably assemble in each congress; interestingly, the assembling of Democrats is particularly pronounced in the latest congress.", "authors": "Chen, Zhu, Zhu et al"}, "https://arxiv.org/abs/2412.00066": {"title": "New Axioms for Dependence Measurement and Powerful Tests", "link": "https://arxiv.org/abs/2412.00066", "description": "We build a context-free, comprehensive, flexible, and sound footing for measuring the dependence of two variables based on three new axioms, updating Renyi's (1959) seven postulates. We illustrate the superior footing of axioms by Vinod's (2014) asymmetric matrix of generalized correlation coefficients R*. We list five limitations explaining the poorer footing of axiom-failing Hellinger correlation proposed in 2022. We also describe a new implementation of a one-sided test with Taraldsen's (2021) exact density. This paper provides a new table for more powerful one-sided tests using the exact Taraldsen density and includes a published example where using Taraldsen's method makes a practical difference. The code to implement all our proposals is in R packages.", "authors": "Vinod"}, "https://arxiv.org/abs/2412.00106": {"title": "Scalable computation of the maximum flow in large brain connectivity networks", "link": "https://arxiv.org/abs/2412.00106", "description": "We are interested in computing an approximation of the maximum flow in large (brain) connectivity networks. The maximum flow in such networks is of interest in order to better understand the routing of information in the human brain. However, the runtime of $O(|V||E|^2)$ for the classic Edmonds-Karp algorithm renders computations of the maximum flow on networks with millions of vertices infeasible, where $V$ is the set of vertices and $E$ is the set of edges. In this contribution, we propose a new Monte Carlo algorithm which is capable of computing an approximation of the maximum flow in networks with millions of vertices via subsampling. Apart from giving a point estimate of the maximum flow, our algorithm also returns valid confidence bounds for the true maximum flow. Importantly, its runtime only scales as $O(B \\cdot |\\tilde{V}| |\\tilde{E}|^2)$, where $B$ is the number of Monte Carlo samples, $\\tilde{V}$ is the set of subsampled vertices, and $\\tilde{E}$ is the edge set induced by $\\tilde{V}$. Choosing $B \\in O(|V|)$ and $|\\tilde{V}| \\in O(\\sqrt{|V|})$ (implying $|\\tilde{E}| \\in O(|V|)$) yields an algorithm with runtime $O(|V|^{3.5})$ while still guaranteeing the usual \"root-n\" convergence of the confidence interval of the maximum flow estimate. We evaluate our proposed algorithm with respect to both accuracy and runtime on simulated graphs as well as graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com).", "authors": "Qian, Hahn"}, "https://arxiv.org/abs/2412.00228": {"title": "A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies", "link": "https://arxiv.org/abs/2412.00228", "description": "Selection bias can hinder accurate estimation of association parameters in binary disease risk models using non-probability samples like electronic health records (EHRs). The issue is compounded when participants are recruited from multiple clinics or centers with varying selection mechanisms that may depend on the disease or outcome of interest. Traditional inverse-probability-weighted (IPW) methods, based on constructed parametric selection models, often struggle with misspecifications when selection mechanisms vary across cohorts. This paper introduces a new Joint Augmented Inverse Probability Weighted (JAIPW) method, which integrates individual-level data from multiple cohorts collected under potentially outcome-dependent selection mechanisms, with data from an external probability sample. JAIPW offers double robustness by incorporating a flexible auxiliary score model to address potential misspecifications in the selection models. We outline the asymptotic properties of the JAIPW estimator, and our simulations reveal that JAIPW achieves up to five times lower relative bias and three times lower root mean square error (RMSE) compared to the best performing joint IPW methods under scenarios with misspecified selection models. Applying JAIPW to the Michigan Genomics Initiative (MGI), a multi-clinic EHR-linked biobank, combined with external national probability samples, resulted in cancer-sex association estimates more closely aligned with national estimates. We also analyzed the association between cancer and polygenic risk scores (PRS) in MGI to illustrate a situation where the exposure is not available in the external probability sample.", "authors": "Kundu, Shi, Salvatore et al"}, "https://arxiv.org/abs/2412.00233": {"title": "Peer Effects and Herd Behavior: An Empirical Study Based on the \"Double 11\" Shopping Festival", "link": "https://arxiv.org/abs/2412.00233", "description": "This study employs a Bayesian Probit model to empirically analyze peer effects and herd behavior among consumers during the \"Double 11\" shopping festival, using data collected through a questionnaire survey. The results demonstrate that peer effects significantly influence consumer decision-making, with the probability of participation in the shopping event increasing notably when roommates are involved. Additionally, factors such as gender, online shopping experience, and fashion consciousness significantly impact consumers' herd behavior. This research not only enhances the understanding of online shopping behavior among college students but also provides empirical evidence for e-commerce platforms to formulate targeted marketing strategies. Finally, the study discusses the fragility of online consumption activities, the need for adjustments in corporate marketing strategies, and the importance of promoting a healthy online culture.", "authors": "Wang"}, "https://arxiv.org/abs/2412.00280": {"title": "Benchmarking covariates balancing methods, a simulation study", "link": "https://arxiv.org/abs/2412.00280", "description": "Causal inference in observational studies has advanced significantly since Rosenbaum and Rubin introduced propensity score methods. Inverse probability of treatment weighting (IPTW) is widely used to handle confounding bias. However, newer methods, such as energy balancing (EB), kernel optimal matching (KOM), and covariate balancing propensity score by tailored loss function (TLF), offer model-free or non-parametric alternatives. Despite these developments, guidance remains limited in selecting the most suitable method for treatment effect estimation in practical applications. This study compares IPTW with EB, KOM, and TLF, focusing on their ability to estimate treatment effects since this is the primary objective in many applications. Monte Carlo simulations are used to assess the ability of these balancing methods combined with different estimators to evaluate average treatment effect. We compare these methods across a range of scenarios varying sample size, level of confusion, and proportion of treated. In our simulation, we observe no significant advantages in using EB, KOM, or TLF methods over IPTW. Moreover, these recent methods make obtaining confidence intervals with nominal coverage difficult. We also compare the methods on the PROBITsim dataset and get results similar to those of our simulations.", "authors": "Peyrot, Porcher, Petit"}, "https://arxiv.org/abs/2412.00304": {"title": "Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores", "link": "https://arxiv.org/abs/2412.00304", "description": "Bayesian factor models are widely used for dimensionality reduction and pattern discovery in high-dimensional datasets across diverse fields. These models typically focus on imposing priors on factor loading to induce sparsity and improve interpretability. However, factor score, which plays a critical role in individual-level associations with factors, has received less attention and is assumed to have standard multivariate normal distribution. This oversimplification fails to capture the heterogeneity observed in real-world applications. We propose the Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN), a novel framework that addresses these limitations by introducing a mass-nonlocal prior for factor scores. This prior provides a more flexible posterior distribution that captures individual heterogeneity while assigning positive probability to zero value. The zeros entries in the score matrix, characterize the sparsity, offering a robust and novel approach for determining the optimal number of factors. Model parameters are estimated using a fast and efficient Gibbs sampler. Extensive simulations demonstrate that BFMAN outperforms standard Bayesian sparse factor models in factor recovery, sparsity detection, and score estimation. We apply BFMAN to the Hispanic Community Health Study/Study of Latinos and identify dietary patterns and their associations with cardiovascular outcomes, showcasing the model's ability to uncover meaningful insights in diet.", "authors": "Zorzetto, Huang, Vito"}, "https://arxiv.org/abs/2412.00311": {"title": "Disentangling The Effects of Air Pollution on Social Mobility: A Bayesian Principal Stratification Approach", "link": "https://arxiv.org/abs/2412.00311", "description": "Principal stratification provides a robust framework for causal inference, enabling the investigation of the causal link between air pollution exposure and social mobility, mediated by the education level. Studying the causal mechanisms through which air pollution affects social mobility is crucial to highlight the role of education as a mediator, and offering evidence that can inform policies aimed at reducing both environmental and educational inequalities for more equitable social outcomes. In this paper, we introduce a novel Bayesian nonparametric model for principal stratification, leveraging the dependent Dirichlet process to flexibly model the distribution of potential outcomes. By incorporating confounders and potential outcomes for the post-treatment variable in the Bayesian mixture model for the final outcome, our approach improves the accuracy of missing data imputation and allows for the characterization of treatment effects. We assess the performance of our method through a simulation study and demonstrate its application in evaluating the principal causal effects of air pollution on social mobility in the United States.", "authors": "Zorzetto, Torre, Petrone et al"}, "https://arxiv.org/abs/2412.00607": {"title": "Risk models from tree-structured Markov random fields following multivariate Poisson distributions", "link": "https://arxiv.org/abs/2412.00607", "description": "We propose risk models for a portfolio of risks, each following a compound Poisson distribution, with dependencies introduced through a family of tree-based Markov random fields with Poisson marginal distributions inspired in C\\^ot\\'e et al. (2024b, arXiv:2408.13649). The diversity of tree topologies allows for the construction of risk models under several dependence schemes. We study the distribution of the random vector of risks and of the aggregate claim amount of the portfolio. We perform two risk management tasks: the assessment of the global risk of the portfolio and its allocation to each component. Numerical examples illustrate the findings and the efficiency of the computation methods developed throughout. We also show that the discussed family of Markov random fields is a subfamily of the multivariate Poisson distribution constructed through common shocks.", "authors": "Cossette, C\\^ot\\'e, Dubeau et al"}, "https://arxiv.org/abs/2412.00634": {"title": "Optimization of Delivery Routes for Fresh E-commerce in Pre-warehouse Mode", "link": "https://arxiv.org/abs/2412.00634", "description": "With the development of the economy, fresh food e-commerce has experienced rapid growth. One of the core competitive advantages of fresh food e-commerce platforms lies in selecting an appropriate logistics distribution model. This study focuses on the front warehouse model, aiming to minimize distribution costs. Considering the perishable nature and short shelf life of fresh food, a distribution route optimization model is constructed, and the saving mileage method is designed to determine the optimal distribution scheme. The results indicate that under certain conditions, different distribution schemes significantly impact the performance of fresh food e-commerce platforms. Based on a review of domestic and international research, this paper takes Dingdong Maicai as an example to systematically introduce the basic concepts of distribution route optimization in fresh food e-commerce platforms under the front warehouse model, analyze the advantages of logistics distribution, and thoroughly examine the importance of distribution routes for fresh products.", "authors": "Harward, Lin, Wang et al"}, "https://arxiv.org/abs/2412.00885": {"title": "Bayesian feature selection in joint models with application to a cardiovascular disease cohort study", "link": "https://arxiv.org/abs/2412.00885", "description": "Cardiovascular disease (CVD) cohorts collect data longitudinally to study the association between CVD risk factors and event times. An important area of scientific research is to better understand what features of CVD risk factor trajectories are associated with the disease. We develop methods for feature selection in joint models where feature selection is viewed as a bi-level variable selection problem with multiple features nested within multiple longitudinal risk factors. We modify a previously proposed Bayesian sparse group selection (BSGS) prior, which has not been implemented in joint models until now, to better represent prior beliefs when selecting features both at the group level (longitudinal risk factor) and within group (features of a longitudinal risk factor). One of the advantages of our method over the BSGS method is the ability to account for correlation among the features within a risk factor. As a result, it selects important features similarly, but excludes the unimportant features within risk factors more efficiently than BSGS. We evaluate our prior via simulations and apply our method to data from the Atherosclerosis Risk in Communities (ARIC) study, a population-based, prospective cohort study consisting of over 15,000 men and women aged 45-64, measured at baseline and at six additional times. We evaluate which CVD risk factors and which characteristics of their trajectories (features) are associated with death from CVD. We find that systolic and diastolic blood pressure, glucose, and total cholesterol are important risk factors with different important features associated with CVD death in both men and women.", "authors": "Islam, Daniels, Aghabazaz et al"}, "https://arxiv.org/abs/2412.00926": {"title": "A sensitivity analysis approach to principal stratification with a continuous longitudinal intermediate outcome: Applications to a cohort stepped wedge trial", "link": "https://arxiv.org/abs/2412.00926", "description": "Causal inference in the presence of intermediate variables is a challenging problem in many applications. Principal stratification (PS) provides a framework to estimate principal causal effects (PCE) in such settings. However, existing PS methods primarily focus on settings with binary intermediate variables. We propose a novel approach to estimate PCE with continuous intermediate variables in the context of stepped wedge cluster randomized trials (SW-CRTs). Our method leverages the time-varying treatment assignment in SW-CRTs to calibrate sensitivity parameters and identify the PCE under realistic assumptions. We demonstrate the application of our approach using data from a cohort SW-CRT evaluating the effect of a crowdsourcing intervention on HIV testing uptake among men who have sex with men in China, with social norms as a continuous intermediate variable. The proposed methodology expands the scope of PS to accommodate continuous variables and provides a practical tool for causal inference in SW-CRTs.", "authors": "Yang, Daniels, Li"}, "https://arxiv.org/abs/2412.00945": {"title": "Generalized spatial autoregressive model", "link": "https://arxiv.org/abs/2412.00945", "description": "This paper presents the generalized spatial autoregression (GSAR) model, a significant advance in spatial econometrics for non-normal response variables belonging to the exponential family. The GSAR model extends the logistic SAR, probit SAR, and Poisson SAR approaches by offering greater flexibility in modeling spatial dependencies while ensuring computational feasibility. Fundamentally, theoretical results are established on the convergence, efficiency, and consistency of the estimates obtained by the model. In addition, it improves the statistical properties of existing methods and extends them to new distributions. Simulation samples show the theoretical results and allow a visual comparison with existing methods. An empirical application is made to Republican voting patterns in the United States. The GSAR model outperforms standard spatial models by capturing nuanced spatial autocorrelation and accommodating regional heterogeneity, leading to more robust inferences. These findings underline the potential of the GSAR model as an analytical tool for researchers working with categorical or count data or skewed distributions with spatial dependence in diverse domains, such as political science, epidemiology, and market research. In addition, the R codes for estimating the model are provided, which allows its adaptability in these scenarios.", "authors": "Cruz, Toloza-Delgado, Melo"}, "https://arxiv.org/abs/2412.01008": {"title": "Multiple Testing in Generalized Universal Inference", "link": "https://arxiv.org/abs/2412.01008", "description": "Compared to p-values, e-values provably guarantee safe, valid inference. If the goal is to test multiple hypotheses simultaneously, one can construct e-values for each individual test and then use the recently developed e-BH procedure to properly correct for multiplicity. Standard e-value constructions, however, require distributional assumptions that may not be justifiable. This paper demonstrates that the generalized universal inference framework can be used along with the e-BH procedure to control frequentist error rates in multiple testing when the quantities of interest are minimizers of risk functions, thereby avoiding the need for distributional assumptions. We demonstrate the validity and power of this approach via a simulation study, testing the significance of a predictor in quantile regression.", "authors": "Dey, Martin, Williams"}, "https://arxiv.org/abs/2412.01030": {"title": "Iterative Distributed Multinomial Regression", "link": "https://arxiv.org/abs/2412.01030", "description": "This article introduces an iterative distributed computing estimator for the multinomial logistic regression model with large choice sets. Compared to the maximum likelihood estimator, the proposed iterative distributed estimator achieves significantly faster computation and, when initialized with a consistent estimator, attains asymptotic efficiency under a weak dominance condition. Additionally, we propose a parametric bootstrap inference procedure based on the iterative distributed estimator and establish its consistency. Extensive simulation studies validate the effectiveness of the proposed methods and highlight the computational efficiency of the iterative distributed estimator.", "authors": "Fan, Okar, Shi"}, "https://arxiv.org/abs/2412.01084": {"title": "Stochastic Search Variable Selection for Bayesian Generalized Linear Mixed Effect Models", "link": "https://arxiv.org/abs/2412.01084", "description": "Variable selection remains a difficult problem, especially for generalized linear mixed models (GLMMs). While some frequentist approaches to simultaneously select joint fixed and random effects exist, primarily through the use of penalization, existing approaches for Bayesian GLMMs exist only for special cases, like that of logistic regression. In this work, we apply the Stochastic Search Variable Selection (SSVS) approach for the joint selection of fixed and random effects proposed in Yang et al. (2020) for linear mixed models to Bayesian GLMMs. We show that while computational issues remain, SSVS serves as a feasible and effective approach to jointly select fixed and random effects. We demonstrate the effectiveness of the proposed methodology to both simulated and real data. Furthermore, we study the role hyperparameters play in the model selection.", "authors": "Ding, Laga"}, "https://arxiv.org/abs/2412.01208": {"title": "Locally robust semiparametric estimation of sample selection models without exclusion restrictions", "link": "https://arxiv.org/abs/2412.01208", "description": "Existing identification and estimation methods for semiparametric sample selection models rely heavily on exclusion restrictions. However, it is difficult in practice to find a credible excluded variable that has a correlation with selection but no correlation with the outcome. In this paper, we establish a new identification result for a semiparametric sample selection model without the exclusion restriction. The key identifying assumptions are nonlinearity on the selection equation and linearity on the outcome equation. The difference in the functional form plays the role of an excluded variable and provides identification power. According to the identification result, we propose to estimate the model by a partially linear regression with a nonparametrically generated regressor. To accommodate modern machine learning methods in generating the regressor, we construct an orthogonalized moment by adding the first-step influence function and develop a locally robust estimator by solving the cross-fitted orthogonalized moment condition. We prove root-n-consistency and asymptotic normality of the proposed estimator under mild regularity conditions. A Monte Carlo simulation shows the satisfactory performance of the estimator in finite samples, and an application to wage regression illustrates its usefulness in the absence of exclusion restrictions.", "authors": "Pan, Zhang"}, "https://arxiv.org/abs/2412.01302": {"title": "The Deep Latent Position Block Model For The Block Clustering And Latent Representation Of Networks", "link": "https://arxiv.org/abs/2412.01302", "description": "The increased quantity of data has led to a soaring use of networks to model relationships between different objects, represented as nodes. Since the number of nodes can be particularly large, the network information must be summarised through node clustering methods. In order to make the results interpretable, a relevant visualisation of the network is also required. To tackle both issues, we propose a new methodology called deep latent position block model (Deep LPBM) which simultaneously provides a network visualisation coherent with block modelling, allowing a clustering more general than community detection methods, as well as a continuous representation of nodes in a latent space given by partial membership vectors. Our methodology is based on a variational autoencoder strategy, relying on a graph convolutional network, with a specifically designed decoder. The inference is done using both variational and stochastic approximations. In order to efficiently select the number of clusters, we provide a comparison of three model selection criteria. An extensive benchmark as well as an evaluation of the partial memberships are provided. We conclude with an analysis of a French political blogosphere network and a comparison with another methodology to illustrate the insights provided by Deep LPBM results.", "authors": "(SU, LPSM), (UCA et al"}, "https://arxiv.org/abs/2412.01367": {"title": "From rotational to scalar invariance: Enhancing identifiability in score-driven factor models", "link": "https://arxiv.org/abs/2412.01367", "description": "We show that, for a certain class of scaling matrices including the commonly used inverse square-root of the conditional Fisher Information, score-driven factor models are identifiable up to a multiplicative scalar constant under very mild restrictions. This result has no analogue in parameter-driven models, as it exploits the different structure of the score-driven factor dynamics. Consequently, score-driven models offer a clear advantage in terms of economic interpretability compared to parameter-driven factor models, which are identifiable only up to orthogonal transformations. Our restrictions are order-invariant and can be generalized to scoredriven factor models with dynamic loadings and nonlinear factor models. We test extensively the identification strategy using simulated and real data. The empirical analysis on financial and macroeconomic data reveals a substantial increase of log-likelihood ratios and significantly improved out-of-sample forecast performance when switching from the classical restrictions adopted in the literature to our more flexible specifications.", "authors": "Buccheri, Corsi, Dzuverovic"}, "https://arxiv.org/abs/2412.01464": {"title": "Nonparametric directional variogram estimation in the presence of outlier blocks", "link": "https://arxiv.org/abs/2412.01464", "description": "This paper proposes robust estimators of the variogram, a statistical tool that is commonly used in geostatistics to capture the spatial dependence structure of data. The new estimators are based on the highly robust minimum covariance determinant estimator and estimate the directional variogram for several lags jointly. Simulations and breakdown considerations confirm the good robustness properties of the new estimators. While Genton's estimator based on the robust estimation of the variance of pairwise sums and differences performs well in case of isolated outliers, the new estimators based on robust estimation of multivariate variance and covariance matrices perform superior to the established alternatives in the presence of outlier blocks in the data. The methods are illustrated by an application to satellite data, where outlier blocks may occur because of e.g. clouds.", "authors": "Gierse, Fried"}, "https://arxiv.org/abs/2412.01603": {"title": "A Dimension-Agnostic Bootstrap Anderson-Rubin Test For Instrumental Variable Regressions", "link": "https://arxiv.org/abs/2412.01603", "description": "Weak-identification-robust Anderson-Rubin (AR) tests for instrumental variable (IV) regressions are typically developed separately depending on whether the number of IVs is treated as fixed or increasing with the sample size. These tests rely on distinct test statistics and critical values. To apply them, researchers are forced to take a stance on the asymptotic behavior of the number of IVs, which can be ambiguous when the number is moderate. In this paper, we propose a bootstrap-based, dimension-agnostic AR test. By deriving strong approximations for the test statistic and its bootstrap counterpart, we show that our new test has a correct asymptotic size regardless of whether the number of IVs is fixed or increasing -- allowing, but not requiring, the number of IVs to exceed the sample size. We also analyze the power properties of the proposed uniformly valid test under both fixed and increasing numbers of IVs.", "authors": "Lim, Wang, Zhang"}, "https://arxiv.org/abs/2412.00658": {"title": "Probabilistic Predictions of Option Prices Using Multiple Sources of Data", "link": "https://arxiv.org/abs/2412.00658", "description": "A new modular approximate Bayesian inferential framework is proposed that enables fast calculation of probabilistic predictions of future option prices. We exploit multiple information sources, including daily spot returns, high-frequency spot data and option prices. A benefit of this modular Bayesian approach is that it allows us to work with the theoretical option pricing model, without needing to specify an arbitrary statistical model that links the theoretical prices to their observed counterparts. We show that our approach produces accurate probabilistic predictions of option prices in realistic scenarios and, despite not explicitly modelling pricing errors, the method is shown to be robust to their presence. Predictive accuracy based on the Heston stochastic volatility model, with predictions produced via rapid real-time updates, is illustrated empirically for short-maturity options.", "authors": "Maneesoonthorn, Frazier, Martin"}, "https://arxiv.org/abs/2412.00753": {"title": "The ecological forecast horizon revisited: Potential, actual and relative system predictability", "link": "https://arxiv.org/abs/2412.00753", "description": "Ecological forecasts are model-based statements about currently unknown ecosystem states in time or space. For a model forecast to be useful to inform decision-makers, model validation and verification determine adequateness. The measure of forecast goodness that can be translated into a limit up to which a forecast is acceptable is known as the `forecast horizon'. While verification of meteorological models follows strict criteria with established metrics and forecast horizons, assessments of ecological forecasting models still remain experiment-specific and forecast horizons are rarely reported. As such, users of ecological forecasts remain uninformed of how far into the future statements can be trusted. In this work, we synthesise existing approaches, define empirical forecast horizons in a unified framework for assessing ecological predictability and offer recipes on their computation. We distinguish upper and lower boundary estimates of predictability limits, reflecting the model's potential and actual forecast horizon, and show how a benchmark model can help determine its relative forecast horizon. The approaches are demonstrated with four case studies from population, ecosystem, and earth system research.", "authors": "Wesselkamp, Albrecht, Pinnington et al"}, "https://arxiv.org/abs/2412.01344": {"title": "Practical Performative Policy Learning with Strategic Agents", "link": "https://arxiv.org/abs/2412.01344", "description": "This paper studies the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes, inducing an endogenous distribution shift. There has been growing interest in training machine learning models in strategic environments, including strategic classification and performative prediction. However, existing approaches often rely on restrictive parametric assumptions: micro-level utility models in strategic classification and macro-level data distribution maps in performative prediction, severely limiting scalability and generalizability. We approach this problem as a complex causal inference task, relaxing parametric assumptions on both micro-level agent behavior and macro-level data distribution. Leveraging bounded rationality, we uncover a practical low-dimensional structure in distribution shifts and construct an effective mediator in the causal path from the deployed model to the shifted data. We then propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map. Our algorithm efficiently utilizes batch feedback and limited manipulation patterns. Our approach achieves high sample efficiency compared to methods reliant on bandit feedback or zero-order optimization. We also provide theoretical guarantees for algorithmic convergence. Extensive and challenging experiments on high-dimensional settings demonstrate our method's practical efficacy.", "authors": "Chen, Chen, Li"}, "https://arxiv.org/abs/2412.01399": {"title": "Navigating Challenges in Spatio-temporal Modelling of Antarctic Krill Abundance: Addressing Zero-inflated Data and Misaligned Covariates", "link": "https://arxiv.org/abs/2412.01399", "description": "Antarctic krill (Euphausia superba) are among the most abundant species on our planet and serve as a vital food source for many marine predators in the Southern Ocean. In this paper, we utilise statistical spatio-temporal methods to combine data from various sources and resolutions, aiming to accurately model krill abundance. Our focus lies in fitting the model to a dataset comprising acoustic measurements of krill biomass. To achieve this, we integrate climate covariates obtained from satellite imagery and from drifting surface buoys (also known as drifters). Additionally, we use sparsely collected krill biomass data obtained from net fishing efforts (KRILLBASE) for validation. However, integrating these multiple heterogeneous data sources presents significant modelling challenges, including spatio-temporal misalignment and inflated zeros in the observed data. To address these challenges, we fit a Hurdle-Gamma model to jointly describe the occurrence of zeros and the krill biomass for the non-zero observations, while also accounting for misaligned and heterogeneous data sources, including drifters. Therefore, our work presents a comprehensive framework for analysing and predicting krill abundance in the Southern Ocean, leveraging information from various sources and formats. This is crucial due to the impact of krill fishing, as understanding their distribution is essential for informed management decisions and fishing regulations aimed at protecting the species.", "authors": "Amaral, Sykulski, Cavan et al"}, "https://arxiv.org/abs/2206.10475": {"title": "New possibilities in identification of binary choice models with fixed effects", "link": "https://arxiv.org/abs/2206.10475", "description": "We study the identification of binary choice models with fixed effects. We provide a condition called sign saturation and show that this condition is sufficient for the identification of the model. In particular, we can guarantee identification even with bounded regressors. We also show that without this condition, the model is not identified unless the error distribution belongs to a small class. The same sign saturation condition is also essential for identifying the sign of treatment effects. A test is provided to check the sign saturation condition and can be implemented using existing algorithms for the maximum score estimator.", "authors": "Zhu"}, "https://arxiv.org/abs/2209.04787": {"title": "A time-varying bivariate copula joint model for longitudinal and time-to-event data", "link": "https://arxiv.org/abs/2209.04787", "description": "A time-varying bivariate copula joint model, which models the repeatedly measured longitudinal outcome at each time point and the survival data jointly by both the random effects and time-varying bivariate copulas, is proposed in this paper. A regular joint model normally supposes there exist subject-specific latent random effects or classes shared by the longitudinal and time-to-event processes and the two processes are conditionally independent given these latent variables. Under this assumption, the joint likelihood of the two processes is straightforward to derive and their association, as well as heterogeneity among the population, are naturally introduced by the unobservable latent variables. However, because of the unobservable nature of these latent variables, the conditional independence assumption is difficult to verify. Therefore, besides the random effects, a time-varying bivariate copula is introduced to account for the extra time-dependent association between the two processes. The proposed model includes a regular joint model as a special case under some copulas. Simulation studies indicates the parameter estimators in the proposed model are robust against copula misspecification and it has superior performance in predicting survival probabilities compared to the regular joint model. A real data application on the Primary biliary cirrhosis (PBC) data is performed.", "authors": "Zhang, Charalambous, Foster"}, "https://arxiv.org/abs/2308.02450": {"title": "Composite Quantile Factor Model", "link": "https://arxiv.org/abs/2308.02450", "description": "This paper introduces the method of composite quantile factor model for factor analysis in high-dimensional panel data. We propose to estimate the factors and factor loadings across multiple quantiles of the data, allowing the estimates to better adapt to features of the data at different quantiles while still modeling the mean of the data. We develop the limiting distribution of the estimated factors and factor loadings, and an information criterion for consistent factor number selection is also discussed. Simulations show that the proposed estimator and the information criterion have good finite sample properties for several non-normal distributions under consideration. We also consider an empirical study on the factor analysis for 246 quarterly macroeconomic variables. A companion R package cqrfactor is developed.", "authors": "Huang"}, "https://arxiv.org/abs/2309.01637": {"title": "The Robust F-Statistic as a Test for Weak Instruments", "link": "https://arxiv.org/abs/2309.01637", "description": "Montiel Olea and Pflueger (2013) proposed the effective F-statistic as a test for weak instruments in terms of the Nagar bias of the two-stage least squares (2SLS) estimator relative to a benchmark worst-case bias. We show that their methodology applies to a class of linear generalized method of moments (GMM) estimators with an associated class of generalized effective F-statistics. The standard nonhomoskedasticity robust F-statistic is a member of this class. The associated GMMf estimator, with the extension f for first-stage, is a novel and unusual estimator as the weight matrix is based on the first-stage residuals. As the robust F-statistic can also be used as a test for underidentification, expressions for the calculation of the weak-instruments critical values in terms of the Nagar bias of the GMMf estimator relative to the benchmark simplify and no simulation methods or Patnaik (1949) distributional approximations are needed. In the grouped-data IV designs of Andrews (2018), where the robust F-statistic is large but the effective F-statistic is small, the GMMf estimator is shown to behave much better in terms of bias than the 2SLS estimator, as expected by the weak-instruments test results.", "authors": "Windmeijer"}, "https://arxiv.org/abs/2312.13148": {"title": "Partially factorized variational inference for high-dimensional mixed models", "link": "https://arxiv.org/abs/2312.13148", "description": "While generalized linear mixed models are a fundamental tool in applied statistics, many specifications, such as those involving categorical factors with many levels or interaction terms, can be computationally challenging to estimate due to the need to compute or approximate high-dimensional integrals. Variational inference is a popular way to perform such computations, especially in the Bayesian context. However, naive use of such methods can provide unreliable uncertainty quantification. We show that this is indeed the case for mixed models, proving that standard mean-field variational inference dramatically underestimates posterior uncertainty in high-dimensions. We then show how appropriately relaxing the mean-field assumption leads to methods whose uncertainty quantification does not deteriorate in high-dimensions, and whose total computational cost scales linearly with the number of parameters and observations. Our theoretical and numerical results focus on mixed models with Gaussian or binomial likelihoods, and rely on connections to random graph theory to obtain sharp high-dimensional asymptotic analysis. We also provide generic results, which are of independent interest, relating the accuracy of variational inference to the convergence rate of the corresponding coordinate ascent algorithm that is used to find it. Our proposed methodology is implemented in the R package, see https://github.com/mgoplerud/vglmer . Numerical results with simulated and real data examples illustrate the favourable computation cost versus accuracy trade-off of our approach compared to various alternatives.", "authors": "Goplerud, Papaspiliopoulos, Zanella"}, "https://arxiv.org/abs/2303.03092": {"title": "Environment Invariant Linear Least Squares", "link": "https://arxiv.org/abs/2303.03092", "description": "This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.", "authors": "Fan, Fang, Gu et al"}, "https://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "https://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class of partially observed stochastic differential equations (SDE) driven by jump processes. Such type of models can be routinely found in applications, of which we focus upon the case of neuroscience. The data are assumed to be observed regularly in time and driven by the SDE model with unknown parameters. In practice the SDE may not have an analytically tractable solution and this leads naturally to a time-discretization. We adapt the multilevel Markov chain Monte Carlo method of [11], which works with a hierarchy of time discretizations and show empirically and theoretically that this is preferable to using one single time discretization. The improvement is in terms of the computational cost needed to obtain a pre-specified numerical error. Our approach is illustrated on models that are found in neuroscience.", "authors": "Maama, Jasra, Kamatani"}, "https://arxiv.org/abs/2311.15257": {"title": "Career Modeling with Missing Data and Traces", "link": "https://arxiv.org/abs/2311.15257", "description": "Many social scientists study the career trajectories of populations of interest, such as economic and administrative elites. However, data to document such processes are rarely completely available, which motivates the adoption of inference tools that can account for large numbers of missing values. Taking the example of public-private paths of elite civil servants in France, we introduce binary Markov switching models to perform Bayesian data augmentation. Our procedure relies on two data sources: (1) detailed observations of a small number of individual trajectories, and (2) less informative ``traces'' left by all individuals, which we model for imputation of missing data. An advantage of this model class is that it maintains the properties of hidden Markov models and enables a tailored sampler to target the posterior, while allowing for varying parameters across individuals and time. We provide two applied studies which demonstrate this can be used to properly test substantive hypotheses, and expand the social scientific literature in various ways. We notably show that the rate at which ENA graduates exit the French public sector has not increased since 1990, but that the rate at which they come back has increased.", "authors": "Voldoire, Ryder, Lahfa"}, "https://arxiv.org/abs/2412.02014": {"title": "Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2412.02014", "description": "Dynamic prediction, which typically refers to the prediction of future outcomes using historical records, is often of interest in biomedical research. For datasets with large sample sizes, high measurement density, and complex correlation structures, traditional methods are often infeasible because of the computational burden associated with both data scale and model complexity. Moreover, many models do not directly facilitate out-of-sample predictions for generalized outcomes. To address these issues, we develop a novel approach for dynamic predictions based on a recently developed method estimating complex patterns of variation for exponential family data: fast Generalized Functional Principal Components Analysis (fGFPCA). Our method is able to handle large-scale, high-density repeated measures much more efficiently with its implementation feasible even on personal computational resources (e.g., a standard desktop or laptop computer). The proposed method makes highly flexible and accurate predictions of future trajectories for data that exhibit high degrees of nonlinearity, and allows for out-of-sample predictions to be obtained without reestimating any parameters. A simulation study is designed and implemented to illustrate the advantages of this method. To demonstrate its practical utility, we also conducted a case study to predict diurnal active/inactive patterns using accelerometry data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014. Both the simulation study and the data application demonstrate the better predictive performance and high computational efficiency of the proposed method compared to existing methods. The proposed method also obtains more personalized prediction that improves as more information becomes available, which is an essential goal of dynamic prediction that other methods fail to achieve.", "authors": "Jin, Leroux"}, "https://arxiv.org/abs/2412.02105": {"title": "The causal effects of modified treatment policies under network interference", "link": "https://arxiv.org/abs/2412.02105", "description": "Modified treatment policies are a widely applicable class of interventions used to study the causal effects of continuous exposures. Approaches to evaluating their causal effects assume no interference, meaning that such effects cannot be learned from data in settings where the exposure of one unit affects the outcome of others, as is common in spatial or network data. We introduce a new class of intervention, induced modified treatment policies, which we show identify such causal effects in the presence of network interference. Building on recent developments in network causal inference, we provide flexible, semi-parametric efficient estimators of the identified statistical estimand. Simulation experiments demonstrate that an induced modified treatment policy can eliminate causal (or identification) bias resulting from interference. We use the methods developed to evaluate the effect of zero-emission vehicle uptake on air pollution in California, strengthening prior evidence.", "authors": "Balkus, Delaney, Hejazi"}, "https://arxiv.org/abs/2412.02151": {"title": "Efficient Analysis of Latent Spaces in Heterogeneous Networks", "link": "https://arxiv.org/abs/2412.02151", "description": "This work proposes a unified framework for efficient estimation under latent space modeling of heterogeneous networks. We consider a class of latent space models that decompose latent vectors into shared and network-specific components across networks. We develop a novel procedure that first identifies the shared latent vectors and further refines estimates through efficient score equations to achieve statistical efficiency. Oracle error rates for estimating the shared and heterogeneous latent vectors are established simultaneously. The analysis framework offers remarkable flexibility, accommodating various types of edge weights under exponential family distributions.", "authors": "Tian, Sun, He"}, "https://arxiv.org/abs/2412.02182": {"title": "Searching for local associations while controlling the false discovery rate", "link": "https://arxiv.org/abs/2412.02182", "description": "We introduce local conditional hypotheses that express how the relation between explanatory variables and outcomes changes across different contexts, described by covariates. By expanding upon the model-X knockoff filter, we show how to adaptively discover these local associations, all while controlling the false discovery rate. Our enhanced inferences can help explain sample heterogeneity and uncover interactions, making better use of the capabilities offered by modern machine learning models. Specifically, our method is able to leverage any model for the identification of data-driven hypotheses pertaining to different contexts. Then, it rigorously test these hypotheses without succumbing to selection bias. Importantly, our approach is efficient and does not require sample splitting. We demonstrate the effectiveness of our method through numerical experiments and by studying the genetic architecture of Waist-Hip-Ratio across different sexes in the UKBiobank.", "authors": "Gablenz, Sesia, Sun et al"}, "https://arxiv.org/abs/2412.02183": {"title": "Endogenous Interference in Randomized Experiments", "link": "https://arxiv.org/abs/2412.02183", "description": "This paper investigates the identification and inference of treatment effects in randomized controlled trials with social interactions. Two key network features characterize the setting and introduce endogeneity: (1) latent variables may affect both network formation and outcomes, and (2) the intervention may alter network structure, mediating treatment effects. I make three contributions. First, I define parameters within a post-treatment network framework, distinguishing direct effects of treatment from indirect effects mediated through changes in network structure. I provide a causal interpretation of the coefficients in a linear outcome model. For estimation and inference, I focus on a specific form of peer effects, represented by the fraction of treated friends. Second, in the absence of endogeneity, I establish the consistency and asymptotic normality of ordinary least squares estimators. Third, if endogeneity is present, I propose addressing it through shift-share instrumental variables, demonstrating the consistency and asymptotic normality of instrumental variable estimators in relatively sparse networks. For denser networks, I propose a denoised estimator based on eigendecomposition to restore consistency. Finally, I revisit Prina (2015) as an empirical illustration, demonstrating that treatment can influence outcomes both directly and through network structure changes.", "authors": "Gao"}, "https://arxiv.org/abs/2412.02333": {"title": "Estimation of a multivariate von Mises distribution for contaminated torus data", "link": "https://arxiv.org/abs/2412.02333", "description": "The occurrence of atypical circular observations on the torus can badly affect parameters estimation of the multivariate von Mises distribution. This paper addresses the problem of robust fitting of the multivariate von Mises model using the weighted likelihood methodology. The key ingredients are non-parametric density estimation for multivariate circular data and the definition of appropriate weighted estimating equations. The finite sample behavior of the proposed weighted likelihood estimator has been investigated by Monte Carlo numerical studies and real data examples.", "authors": "Bertagnolli, Greco, Agostinelli"}, "https://arxiv.org/abs/2412.02513": {"title": "Quantile-Crossing Spectrum and Spline Autoregression Estimation", "link": "https://arxiv.org/abs/2412.02513", "description": "The quantile-crossing spectrum is the spectrum of quantile-crossing processes created from a time series by the indicator function that shows whether or not the time series lies above or below a given quantile at a given time. This bivariate function of frequency and quantile level provides a richer view of serial dependence than that offered by the ordinary spectrum. We propose a new method for estimating the quantile-crossing spectrum as a bivariate function of frequency and quantile level. The proposed method, called spline autoregression (SAR), jointly fits an AR model to the quantile-crossing series across multiple quantiles; the AR coefficients are represented as spline functions of the quantile level and penalized for their roughness. Numerical experiments show that when the underlying spectrum is smooth in quantile level the proposed method is able to produce more accurate estimates in comparison with the alternative that ignores the smoothness.", "authors": "Li"}, "https://arxiv.org/abs/2412.02654": {"title": "Simple and Effective Portfolio Construction with Crypto Assets", "link": "https://arxiv.org/abs/2412.02654", "description": "We consider the problem of constructing a portfolio that combines traditional financial assets with crypto assets. We show that despite the documented attributes of crypto assets, such as high volatility, heavy tails, excess kurtosis, and skewness, a simple extension of traditional risk allocation provides robust solutions for integrating these emerging assets into broader investment strategies. Examination of the risk allocation holdings suggests an even simpler method, analogous to the traditional 60/40 stocks/bonds allocation, involving a fixed allocation to crypto and traditional assets, dynamically diluted with cash to achieve a target risk level.", "authors": "Johansson, Boyd"}, "https://arxiv.org/abs/2412.02660": {"title": "A Markowitz Approach to Managing a Dynamic Basket of Moving-Band Statistical Arbitrages", "link": "https://arxiv.org/abs/2412.02660", "description": "We consider the problem of managing a portfolio of moving-band statistical arbitrages (MBSAs), inspired by the Markowitz optimization framework. We show how to manage a dynamic basket of MBSAs, and illustrate the method on recent historical data, showing that it can perform very well in terms of risk-adjusted return, essentially uncorrelated with the market.", "authors": "Johansson, Schmelzer, Boyd"}, "https://arxiv.org/abs/2412.01953": {"title": "The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications", "link": "https://arxiv.org/abs/2412.01953", "description": "Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientific disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.", "authors": "Brouillard, Squires, Wahl et al"}, "https://arxiv.org/abs/2412.02251": {"title": "Selective Reviews of Bandit Problems in AI via a Statistical View", "link": "https://arxiv.org/abs/2412.02251", "description": "Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. We also extend the discussion to $K$-armed contextual bandits and SCAB, examining their methodologies, regret analyses, and discussing the relation between the SCAB problems and the functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.", "authors": "Zhou, Wei, Zhang"}, "https://arxiv.org/abs/2412.02355": {"title": "TITE-CLRM: Towards efficient time-to-event dose-escalation guidance of multi-cycle cancer therapies", "link": "https://arxiv.org/abs/2412.02355", "description": "Treatment of cancer has rapidly evolved over time in quite dramatic ways, for example from chemotherapies, targeted therapies to immunotherapies and chimeric antigen receptor T-cells. Nonetheless, the basic design of early phase I trials in oncology still follows pre-dominantly a dose-escalation design. These trials monitor safety over the first treatment cycle in order to escalate the dose of the investigated drug. However, over time studying additional factors such as drug combinations and/or variation in the timing of dosing became important as well. Existing designs were continuously enhanced and expanded to account for increased trial complexity. With toxicities occurring at later stages beyond the first cycle and the need to treat patients over multiple cycles, the focus on the first treatment cycle only is becoming a limitation in nowadays multi-cycle treatment therapies. Here we introduce a multi-cycle time-to-event model (TITE-CLRM: Time-Interval-To-Event Complementary-Loglog Regression Model) allowing guidance of dose-escalation trials studying multi-cycle therapies. The challenge lies in balancing the need to monitor safety of longer treatment periods with the need to continuously enroll patients safely. The proposed multi-cycle time to event model is formulated as an extension to established concepts like the escalation with over dose control principle. The model is motivated from a current drug development project and evaluated in a simulation study.", "authors": "Widmer, Weber, Xu et al"}, "https://arxiv.org/abs/2412.02380": {"title": "Use of surrogate endpoints in health technology assessment: a review of selected NICE technology appraisals in oncology", "link": "https://arxiv.org/abs/2412.02380", "description": "Objectives: Surrogate endpoints, used to substitute for and predict final clinical outcomes, are increasingly being used to support submissions to health technology assessment agencies. The increase in use of surrogate endpoints has been accompanied by literature describing frameworks and statistical methods to ensure their robust validation. The aim of this review was to assess how surrogate endpoints have recently been used in oncology technology appraisals by the National Institute for Health and Care Excellence (NICE) in England and Wales.\n Methods: This paper identified technology appraisals in oncology published by NICE between February 2022 and May 2023. Data are extracted on methods for the use and validation of surrogate endpoints.\n Results: Of the 47 technology appraisals in oncology available for review, 18 (38 percent) utilised surrogate endpoints, with 37 separate surrogate endpoints being discussed. However, the evidence supporting the validity of the surrogate relationship varied significantly across putative surrogate relationships with 11 providing RCT evidence, 7 providing evidence from observational studies, 12 based on clinical opinion and 7 providing no evidence for the use of surrogate endpoints.\n Conclusions: This review supports the assertion that surrogate endpoints are frequently used in oncology technology appraisals in England and Wales. Despite increasing availability of statistical methods and guidance on appropriate validation of surrogate endpoints, this review highlights that use and validation of surrogate endpoints can vary between technology appraisals which can lead to uncertainty in decision-making.", "authors": "Wheaton, Bujkiewicz"}, "https://arxiv.org/abs/2412.02439": {"title": "Nature versus nurture in galaxy formation: the effect of environment on star formation with causal machine learning", "link": "https://arxiv.org/abs/2412.02439", "description": "Understanding how galaxies form and evolve is at the heart of modern astronomy. With the advent of large-scale surveys and simulations, remarkable progress has been made in the last few decades. Despite this, the physical processes behind the phenomena, and particularly their importance, remain far from known, as correlations have primarily been established rather than the underlying causality. We address this challenge by applying the causal inference framework. Specifically, we tackle the fundamental open question of whether galaxy formation and evolution depends more on nature (i.e., internal processes) or nurture (i.e., external processes), by estimating the causal effect of environment on star-formation rate in the IllustrisTNG simulations. To do so, we develop a comprehensive causal model and employ cutting-edge techniques from epidemiology to overcome the long-standing problem of disentangling nature and nurture. We find that the causal effect is negative and substantial, with environment suppressing the SFR by a maximal factor of $\\sim100$. While the overall effect at $z=0$ is negative, in the early universe, environment is discovered to have a positive impact, boosting star formation by a factor of $\\sim10$ at $z\\sim1$ and by even greater amounts at higher redshifts. Furthermore, we show that: (i) nature also plays an important role, as ignoring it underestimates the causal effect in intermediate-density environments by a factor of $\\sim2$, (ii) controlling for the stellar mass at a snapshot in time, as is common in the literature, is not only insufficient to disentangle nature and nurture but actually has an adverse effect, though (iii) stellar mass is an adequate proxy of the effects of nature. Finally, this work may prove a useful blueprint for extracting causal insights in other fields that deal with dynamical systems with closed feedback loops, such as the Earth's climate.", "authors": "Mucesh, Hartley, Gilligan-Lee et al"}, "https://arxiv.org/abs/2412.02640": {"title": "On the optimality of coin-betting for mean estimation", "link": "https://arxiv.org/abs/2412.02640", "description": "Confidence sequences are sequences of confidence sets that adapt to incoming data while maintaining validity. Recent advances have introduced an algorithmic formulation for constructing some of the tightest confidence sequences for bounded real random variables. These approaches use a coin-betting framework, where a player sequentially bets on differences between potential mean values and observed data. This letter establishes that such coin-betting formulation is optimal among all possible algorithmic frameworks for constructing confidence sequences that build on e-variables and sequential hypothesis testing.", "authors": "Clerico"}, "https://arxiv.org/abs/1907.00287": {"title": "Estimating Treatment Effect under Additive Hazards Models with High-dimensional Covariates", "link": "https://arxiv.org/abs/1907.00287", "description": "Estimating causal effects for survival outcomes in the high-dimensional setting is an extremely important topic for many biomedical applications as well as areas of social sciences. We propose a new orthogonal score method for treatment effect estimation and inference that results in asymptotically valid confidence intervals assuming only good estimation properties of the hazard outcome model and the conditional probability of treatment. This guarantee allows us to provide valid inference for the conditional treatment effect under the high-dimensional additive hazards model under considerably more generality than existing approaches. In addition, we develop a new Hazards Difference (HDi), estimator. We showcase that our approach has double-robustness properties in high dimensions: with cross-fitting, the HDi estimate is consistent under a wide variety of treatment assignment models; the HDi estimate is also consistent when the hazards model is misspecified and instead the true data generating mechanism follows a partially linear additive hazards model. We further develop a novel sparsity doubly robust result, where either the outcome or the treatment model can be a fully dense high-dimensional model. We apply our methods to study the treatment effect of radical prostatectomy versus conservative management for prostate cancer patients using the SEER-Medicare Linked Data.", "authors": "Hou, Bradic, Xu"}, "https://arxiv.org/abs/2305.19242": {"title": "Nonstationary Gaussian Process Surrogates", "link": "https://arxiv.org/abs/2305.19242", "description": "We provide a survey of nonstationary surrogate models which utilize Gaussian processes (GPs) or variations thereof, including nonstationary kernel adaptations, partition and local GPs, and spatial warpings through deep Gaussian processes. We also overview publicly available software implementations and conclude with a bake-off involving an 8-dimensional satellite drag computer experiment. Code for this example is provided in a public git repository.", "authors": "Booth, Cooper, Gramacy"}, "https://arxiv.org/abs/2401.05315": {"title": "Multi-resolution filters via linear projection for large spatio-temporal datasets", "link": "https://arxiv.org/abs/2401.05315", "description": "Advances in compact sensing devices mounted on satellites have facilitated the collection of large spatio-temporal datasets with coordinates. Since such datasets are often incomplete and noisy, it is useful to create the prediction surface of a spatial field. To this end, we consider an online filtering inference by using the Kalman filter based on linear Gaussian state-space models. However, the Kalman filter is impractically time-consuming when the number of locations in spatio-temporal datasets is large. To address this problem, we propose a multi-resolution filter via linear projection (MRF-lp), a fast computation method for online filtering inference. In the MRF-lp, by carrying out a multi-resolution approximation via linear projection (MRA-lp), the forecast covariance matrix can be approximated while capturing both the large- and small-scale spatial variations. As a result of this approximation, our proposed MRF-lp preserves a block-sparse structure of some matrices appearing in the MRF-lp through time, which leads to the scalability of this algorithm. Additionally, we discuss extensions of the MRF-lp to a nonlinear and non-Gaussian case. Simulation studies and real data analysis for total precipitable water vapor demonstrate that our proposed approach performs well compared with the related methods.", "authors": "Hirano, Ishihara"}, "https://arxiv.org/abs/2312.00710": {"title": "SpaCE: The Spatial Confounding Environment", "link": "https://arxiv.org/abs/2312.00710", "description": "Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.", "authors": "Tec, Trisovic, Audirac et al"}, "https://arxiv.org/abs/2412.02767": {"title": "Endogenous Heteroskedasticity in Linear Models", "link": "https://arxiv.org/abs/2412.02767", "description": "Linear regressions with endogeneity are widely used to estimate causal effects. This paper studies a statistical framework that has two common issues, endogeneity of the regressors, and heteroskedasticity that is allowed to depend on endogenous regressors, i.e., endogenous heteroskedasticity. We show that the presence of such conditional heteroskedasticity in the structural regression renders the two-stages least squares estimator inconsistent. To solve this issue, we propose sufficient conditions together with a control function approach to identify and estimate the causal parameters of interest. We establish statistical properties of the estimator, say consistency and asymptotic normality, and propose valid inference procedures. Monte Carlo simulations provide evidence of the finite sample performance of the proposed methods, and evaluate different implementation procedures. We revisit an empirical application about job training to illustrate the methods.", "authors": "Alejo, Galvao, Martinez-Iriarte et al"}, "https://arxiv.org/abs/2412.02791": {"title": "Chain-linked Multiple Matrix Integration via Embedding Alignment", "link": "https://arxiv.org/abs/2412.02791", "description": "Motivated by the increasing demand for multi-source data integration in various scientific fields, in this paper we study matrix completion in scenarios where the data exhibits certain block-wise missing structures -- specifically, where only a few noisy submatrices representing (overlapping) parts of the full matrix are available. We propose the Chain-linked Multiple Matrix Integration (CMMI) procedure to efficiently combine the information that can be extracted from these individual noisy submatrices. CMMI begins by deriving entity embeddings for each observed submatrix, then aligns these embeddings using overlapping entities between pairs of submatrices, and finally aggregates them to reconstruct the entire matrix of interest. We establish, under mild regularity conditions, entrywise error bounds and normal approximations for the CMMI estimates. Simulation studies and real data applications show that CMMI is computationally efficient and effective in recovering the full matrix, even when overlaps between the observed submatrices are minimal.", "authors": "Zheng, Tang"}, "https://arxiv.org/abs/2412.02945": {"title": "Detection of Multiple Influential Observations on Model Selection", "link": "https://arxiv.org/abs/2412.02945", "description": "Outlying observations are frequently encountered in a wide spectrum of scientific domains, posing significant challenges for the generalizability of statistical models and the reproducibility of downstream analysis. These observations can be identified through influential diagnosis, which refers to the detection of observations that are unduly influential on diverse facets of statistical inference. To date, methods for identifying observations influencing the choice of a stochastically selected submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors p exceeds the sample size n. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, the notion of exchangeability is revived, and used to determine the exact finite- and large-sample distributions of our assessment metric. This forms the foundation for the introduction of both parametric and non-parametric approaches for its approximation and the establishment of thresholds for diagnosis. The resulting framework is extended to logistic regression models, followed by a simulation study conducted to assess the performance of various detection procedures. Finally the framework is applied to data from an fMRI study of thermal pain, with the goal of identifying outlying subjects that could distort the formulation of statistical models using functional brain activity in predicting physical pain ratings. Both linear and logistic regression models are used to demonstrate the benefits of detection and compare the performances of different detection procedures. In particular, two additional influential observations are identified, which are not discovered by previous studies.", "authors": "Zhang, Asgharian, Lindquist"}, "https://arxiv.org/abs/2412.02970": {"title": "Uncovering dynamics between SARS-CoV-2 wastewater concentrations and community infections via Bayesian spatial functional concurrent regression", "link": "https://arxiv.org/abs/2412.02970", "description": "Monitoring wastewater concentrations of SARS-CoV-2 yields a low-cost, noninvasive method for tracking disease prevalence and provides early warning signs of upcoming outbreaks in the serviced communities. There is tremendous clinical and public health interest in understanding the exact dynamics between wastewater viral loads and infection rates in the population. As both data sources may contain substantial noise and missingness, in addition to spatial and temporal dependencies, properly modeling this relationship must address these numerous complexities simultaneously while providing interpretable and clear insights. We propose a novel Bayesian functional concurrent regression model that accounts for both spatial and temporal correlations while estimating the dynamic effects between wastewater concentrations and positivity rates over time. We explicitly model the time lag between the two series and provide full posterior inference on the possible delay between spikes in wastewater concentrations and subsequent outbreaks. We estimate a time lag likely between 5 to 11 days between spikes in wastewater levels and reported clinical positivity rates. Additionally, we find a dynamic relationship between wastewater concentration levels and the strength of its association with positivity rates that fluctuates between outbreaks and non-outbreaks.", "authors": "Sun, Schedler, Kowal et al"}, "https://arxiv.org/abs/2412.02986": {"title": "Bayesian Transfer Learning for Enhanced Estimation and Inference", "link": "https://arxiv.org/abs/2412.02986", "description": "Transfer learning enhances model performance in a target population with limited samples by leveraging knowledge from related studies. While many works focus on improving predictive performance, challenges of statistical inference persist. Bayesian approaches naturally offer uncertainty quantification for parameter estimates, yet existing Bayesian transfer learning methods are typically limited to single-source scenarios or require individual-level data. We introduce TRansfer leArning via guideD horseshoE prioR (TRADER), a novel approach enabling multi-source transfer through pre-trained models in high-dimensional linear regression. TRADER shrinks target parameters towards a weighted average of source estimates, accommodating sources with different scales. Theoretical investigation shows that TRADER achieves faster posterior contraction rates than standard continuous shrinkage priors when sources align well with the target while preventing negative transfer from heterogeneous sources. The analysis of finite-sample marginal posterior behavior reveals that TRADER achieves desired frequentist coverage probabilities, even for coefficients with moderate signal strength--a scenario where standard continuous shrinkage priors struggle. Extensive numerical studies and a real-data application estimating the association between blood glucose and insulin use in the Hispanic diabetic population demonstrate that TRADER improves estimation and inference accuracy over continuous shrinkage priors using target data alone, while outperforming a state-of-the-art transfer learning method that requires individual-level data.", "authors": "Lai, Padilla, Gu"}, "https://arxiv.org/abs/2412.03042": {"title": "On a penalised likelihood approach for joint modelling of longitudinal covariates and partly interval-censored data -- an application to the Anti-PD1 brain collaboration trial", "link": "https://arxiv.org/abs/2412.03042", "description": "This article considers the joint modeling of longitudinal covariates and partly-interval censored time-to-event data. Longitudinal time-varying covariates play a crucial role in obtaining accurate clinically relevant predictions using a survival regression model. However, these covariates are often measured at limited time points and may be subject to measurement error. Further methodological challenges arise from the fact that, in many clinical studies, the event times of interest are interval-censored. A model that simultaneously accounts for all these factors is expected to improve the accuracy of survival model estimations and predictions. In this article, we consider joint models that combine longitudinal time-varying covariates with the Cox model for time-to-event data which is subject to interval censoring. The proposed model employs a novel penalised likelihood approach for estimating all parameters, including the random effects. The covariance matrix of the estimated parameters can be obtained from the penalised log-likelihood. The performance of the model is compared to an existing method under various scenarios. The simulation results demonstrated that our new method can provide reliable inferences when dealing with interval-censored data. Data from the Anti-PD1 brain collaboration clinical trial in advanced melanoma is used to illustrate the application of the new method.", "authors": "Webb, Zou, Lo et al"}, "https://arxiv.org/abs/2412.03185": {"title": "Information borrowing in Bayesian clinical trials: choice of tuning parameters for the robust mixture prior", "link": "https://arxiv.org/abs/2412.03185", "description": "Borrowing historical data for use in clinical trials has increased in recent years. This is accomplished in the Bayesian framework by specification of informative prior distributions. One such approach is the robust mixture prior arising as a weighted mixture of an informative prior and a robust prior inducing dynamic borrowing that allows to borrow most when the current and external data are observed to be similar. The robust mixture prior requires the choice of three additional quantities: the mixture weight, and the mean and dispersion of the robust component. Some general guidance is available, but a case-by-case study of the impact of these quantities on specific operating characteristics seems lacking. We focus on evaluating the impact of parameter choices for the robust component of the mixture prior in one-arm and hybrid-control trials. The results show that all three quantities can strongly impact the operating characteristics. In particular, as already known, variance of the robust component is linked to robustness. Less known, however, is that its location can have a strong impact on Type I error rate and MSE which can even become unbounded. Further, the impact of the weight choice is strongly linked with the robust component's location and variance. Recommendations are provided for the choice of the robust component parameters, prior weight, alternative functional form for this component as well as considerations to keep in mind when evaluating operating characteristics.", "authors": "Weru, Kopp-Schneider, Wiesenfarth et al"}, "https://arxiv.org/abs/2412.03246": {"title": "Nonparametric estimation of the Patient Weighted While-Alive Estimand", "link": "https://arxiv.org/abs/2412.03246", "description": "In clinical trials with recurrent events, such as repeated hospitalizations terminating with death, it is important to consider the patient events overall history for a thorough assessment of treatment effects. The occurrence of fewer events due to early deaths can lead to misinterpretation, emphasizing the importance of a while-alive strategy as suggested in Schmidli et al. (2023). We focus in this paper on the patient weighted while-alive estimand represented as the expected number of events divided by the time alive within a target window and develop efficient estimation for this estimand. We derive its efficient influence function and develop a one-step estimator, initially applied to the irreversible illness-death model. For the broader context of recurrent events, due to the increased complexity, the one-step estimator is practically intractable. We therefore suggest an alternative estimator that is also expected to have high efficiency focusing on the randomized treatment setting. We compare the efficiency of these two estimators in the illness-death setting. Additionally, we apply our proposed estimator to a real-world case study involving metastatic colorectal cancer patients, demonstrating the practical applicability and benefits of the while-alive approach.", "authors": "Ragni, Martinussen, Scheike"}, "https://arxiv.org/abs/2412.03429": {"title": "Coherent forecast combination for linearly constrained multiple time series", "link": "https://arxiv.org/abs/2412.03429", "description": "Linearly constrained multiple time series may be encountered in many practical contexts, such as the National Accounts (e.g., GDP disaggregated by Income, Expenditure and Output), and multilevel frameworks where the variables are organized according to hierarchies or groupings, like the total energy consumption of a country disaggregated by region and energy sources. In these cases, when multiple incoherent base forecasts for each individual variable are available, a forecast combination-and-reconciliation approach, that we call coherent forecast combination, may be used to improve the accuracy of the base forecasts and achieve coherence in the final result. In this paper, we develop an optimization-based technique that combines multiple unbiased base forecasts while assuring the constraints valid for the series. We present closed form expressions for the coherent combined forecast vector and its error covariance matrix in the general case where a different number of forecasts is available for each variable. We also discuss practical issues related to the covariance matrix that is part of the optimal solution. Through simulations and a forecasting experiment on the daily Australian electricity generation hierarchical time series, we show that the proposed methodology, in addition to adhering to sound statistical principles, may yield in significant improvement on base forecasts, single-task combination and single-expert reconciliation approaches as well.", "authors": "Girolimetto, Fonzo"}, "https://arxiv.org/abs/2412.03484": {"title": "Visualisation for Exploratory Modelling Analysis of Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2412.03484", "description": "When developing Bayesian hierarchical models, selecting the most appropriate hierarchical structure can be a challenging task, and visualisation remains an underutilised tool in this context. In this paper, we consider visualisations for the display of hierarchical models in data space and compare a collection of multiple models via their parameters and hyper-parameter estimates. Specifically, with the aim of aiding model choice, we propose new visualisations to explore how the choice of Bayesian hierarchical modelling structure impacts parameter distributions. The visualisations are designed using a robust set of principles to provide richer comparisons that extend beyond the conventional plots and numerical summaries typically used. As a case study, we investigate five Bayesian hierarchical models fit using the brms R package, a high-level interface to Stan for Bayesian modelling, to model country mathematics trends from the PISA (Programme for International Student Assessment) database. Our case study demonstrates that by adhering to these principles, researchers can create visualisations that not only help them make more informed choices between Bayesian hierarchical model structures but also enable them to effectively communicate the rationale for those choices.", "authors": "Akinfenwa, Cahill, Hurley"}, "https://arxiv.org/abs/2412.02869": {"title": "Constrained Identifiability of Causal Effects", "link": "https://arxiv.org/abs/2412.02869", "description": "We study the identification of causal effects in the presence of different types of constraints (e.g., logical constraints) in addition to the causal graph. These constraints impose restrictions on the models (parameterizations) induced by the causal graph, reducing the set of models considered by the identifiability problem. We formalize the notion of constrained identifiability, which takes a set of constraints as another input to the classical definition of identifiability. We then introduce a framework for testing constrained identifiability by employing tractable Arithmetic Circuits (ACs), which enables us to accommodate constraints systematically. We show that this AC-based approach is at least as complete as existing algorithms (e.g., do-calculus) for testing classical identifiability, which only assumes the constraint of strict positivity. We use examples to demonstrate the effectiveness of this AC-based approach by showing that unidentifiable causal effects may become identifiable under different types of constraints.", "authors": "Chen, Darwiche"}, "https://arxiv.org/abs/2412.02878": {"title": "Modeling and Discovering Direct Causes for Predictive Models", "link": "https://arxiv.org/abs/2412.02878", "description": "We introduce a causal modeling framework that captures the input-output behavior of predictive models (e.g., machine learning models) by representing it using causal graphs. The framework enables us to define and identify features that directly cause the predictions, which has broad implications for data collection and model evaluation. We show two assumptions under which the direct causes can be discovered from data, one of which further simplifies the discovery process. In addition to providing sound and complete algorithms, we propose an optimization technique based on an independence rule that can be integrated with the algorithms to speed up the discovery process both theoretically and empirically.", "authors": "Chen, Bhatia"}, "https://arxiv.org/abs/2412.02893": {"title": "Removing Spurious Correlation from Neural Network Interpretations", "link": "https://arxiv.org/abs/2412.02893", "description": "The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.", "authors": "Fotouhi, Bahadori, Feyisetan et al"}, "https://arxiv.org/abs/2412.03491": {"title": "Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications", "link": "https://arxiv.org/abs/2412.03491", "description": "Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.", "authors": "Sauer, Boulesteix, Han{\\ss}um et al"}, "https://arxiv.org/abs/2412.03528": {"title": "The R.O.A.D. to clinical trial emulation", "link": "https://arxiv.org/abs/2412.03528", "description": "Observational studies provide the only evidence on the effectiveness of interventions when randomized controlled trials (RCTs) are impractical due to cost, ethical concerns, or time constraints. While many methodologies aim to draw causal inferences from observational data, there is a growing trend to model observational study designs after RCTs, a strategy known as \"target trial emulation.\" Despite its potential, causal inference through target trial emulation cannot fully address the confounding bias in real-world data due to the lack of randomization. In this work, we present a novel framework for target trial emulation that aims to overcome several key limitations, including confounding bias. The framework proceeds as follows: First, we apply the eligibility criteria of a specific trial to an observational cohort. We then \"correct\" this cohort by extracting a subset that matches both the distribution of covariates and the baseline prognosis of the control group in the target RCT. Next, we address unmeasured confounding by adjusting the prognosis estimates of the treated group to align with those observed in the trial. Following trial emulation, we go a step further by leveraging the emulated cohort to train optimal decision trees, to identify subgroups of patients with heterogeneity in treatment effects (HTE). The absence of confounding is verified using two external models, and the validity of the treatment recommendations is independently confirmed by the team responsible for the original trial we emulate. To our knowledge, this is the first framework to successfully address both observed and unobserved confounding, a challenge that has historically limited the use of randomized trial emulation and causal inference. Additionally, our framework holds promise in advancing precision medicine by identifying patient subgroups that benefit most from specific treatments.", "authors": "Bertsimas, Koulouras, Nagata et al"}, "https://arxiv.org/abs/2307.05708": {"title": "Bayesian inference on the order of stationary vector autoregressions", "link": "https://arxiv.org/abs/2307.05708", "description": "Vector autoregressions (VARs) are a widely used tool for modelling multivariate time-series. It is common to assume a VAR is stationary; this can be enforced by imposing the stationarity condition which restricts the parameter space of the autoregressive coefficients to the stationary region. However, implementing this constraint is difficult due to the complex geometry of the stationary region. Fortunately, recent work has provided a solution for autoregressions of fixed order $p$ based on a reparameterization in terms of a set of interpretable and unconstrained transformed partial autocorrelation matrices. In this work, focus is placed on the difficult problem of allowing $p$ to be unknown, developing a prior and computational inference that takes full account of order uncertainty. Specifically, the multiplicative gamma process is used to build a prior which encourages increasing shrinkage of the partial autocorrelations with increasing lag. Identifying the lag beyond which the partial autocorrelations become equal to zero then determines $p$. Based on classic time-series theory, a principled choice of truncation criterion identifies whether a partial autocorrelation matrix is effectively zero. Posterior inference utilizes Hamiltonian Monte Carlo via Stan. The work is illustrated in a substantive application to neural activity data to investigate ultradian brain rhythms.", "authors": "Binks, Heaps, Panagiotopoulou et al"}, "https://arxiv.org/abs/2401.13208": {"title": "Assessing Influential Observations in Pain Prediction using fMRI Data", "link": "https://arxiv.org/abs/2401.13208", "description": "Neuroimaging data allows researchers to model the relationship between multivariate patterns of brain activity and outcomes related to mental states and behaviors. However, the existence of outlying participants can potentially undermine the generalizability of these models and jeopardize the validity of downstream statistical analysis. To date, the ability to detect and account for participants unduly influencing various model selection approaches have been sorely lacking. Motivated by a task-based functional magnetic resonance imaging (fMRI) study of thermal pain, we propose and establish the asymptotic distribution for a diagnostic measure applicable to a number of different model selectors. A high-dimensional clustering procedure is further combined with this measure to detect multiple influential observations. In a series of simulations, our proposed method demonstrates clear advantages over existing methods in terms of improved detection performance, leading to enhanced predictive and variable selection outcomes. Application of our method to data from the thermal pain study illustrates the influence of outlying participants, in particular with regards to differences in activation between low and intense pain conditions. This allows for the selection of an interpretable model with high prediction power after removal of the detected observations. Though inspired by the fMRI-based thermal pain study, our methods are broadly applicable to other high-dimensional data types.", "authors": "Zhang, Asgharian, Lindquist"}, "https://arxiv.org/abs/2210.07114": {"title": "A Tutorial on Statistical Models Based on Counting Processes", "link": "https://arxiv.org/abs/2210.07114", "description": "Since the famous paper written by Kaplan and Meier in 1958, survival analysis has become one of the most important fields in statistics. Nowadays it is one of the most important statistical tools in analyzing epidemiological and clinical data including COVID-19 pandemic. This article reviews some of the most celebrated and important results and methods, including consistency, asymptotic normality, bias and variance estimation, in survival analysis and the treatment is parallel to the monograph Statistical Models Based on Counting Processes. Other models and results such as semi-Markov models and the Turnbull's estimator that jump out of the classical counting process martingale framework are also discussed.", "authors": "Cui"}, "https://arxiv.org/abs/2412.03596": {"title": "SMART-MC: Sparse Matrix Estimation with Covariate-Based Transitions in Markov Chain Modeling of Multiple Sclerosis Disease Modifying Therapies", "link": "https://arxiv.org/abs/2412.03596", "description": "A Markov model is a widely used tool for modeling sequences of events from a finite state-space and hence can be employed to identify the transition probabilities across treatments based on treatment sequence data. To understand how patient-level covariates impact these treatment transitions, the transition probabilities are modeled as a function of patient covariates. This approach enables the visualization of the effect of patient-level covariates on the treatment transitions across patient visits. The proposed method automatically estimates the entries of the transition matrix with smaller numbers of empirical transitions as constant; the user can set desired cutoff of the number of empirical transition counts required for a particular transition probability to be estimated as a function of covariates. Firstly, this strategy automatically enforces the final estimated transition matrix to contain zeros at the locations corresponding to zero empirical transition counts, avoiding further complicated model constructs to handle sparsity, in an efficient manner. Secondly, it restricts estimation of transition probabilities as a function of covariates, when the number of empirical transitions is particularly small, thus avoiding the identifiability issue which might arise due to the p>n scenario when estimating each transition probability as a function of patient covariates. To optimize the multi-modal likelihood, a parallelized scalable global optimization routine is also developed. The proposed method is applied to understand how the transitions across disease modifying therapies (DMTs) in Multiple Sclerosis (MS) patients are influenced by patient-level demographic and clinical phenotypes.", "authors": "Kim, Xia, Das"}, "https://arxiv.org/abs/2412.03668": {"title": "Hidden Markov graphical models with state-dependent generalized hyperbolic distributions", "link": "https://arxiv.org/abs/2412.03668", "description": "In this paper we develop a novel hidden Markov graphical model to investigate time-varying interconnectedness between different financial markets. To identify conditional correlation structures under varying market conditions and accommodate stylized facts embedded in financial time series, we rely upon the generalized hyperbolic family of distributions with time-dependent parameters evolving according to a latent Markov chain. We exploit its location-scale mixture representation to build a penalized EM algorithm for estimating the state-specific sparse precision matrices by means of an $L_1$ penalty. The proposed approach leads to regime-specific conditional correlation graphs that allow us to identify different degrees of network connectivity of returns over time. The methodology's effectiveness is validated through simulation exercises under different scenarios. In the empirical analysis we apply our model to daily returns of a large set of market indexes, cryptocurrencies and commodity futures over the period 2017-2023.", "authors": "Foroni, Merlo, Petrella"}, "https://arxiv.org/abs/2412.03731": {"title": "Using a Two-Parameter Sensitivity Analysis Framework to Efficiently Combine Randomized and Non-randomized Studies", "link": "https://arxiv.org/abs/2412.03731", "description": "Causal inference is vital for informed decision-making across fields such as biomedical research and social sciences. Randomized controlled trials (RCTs) are considered the gold standard for the internal validity of inferences, whereas observational studies (OSs) often provide the opportunity for greater external validity. However, both data sources have inherent limitations preventing their use for broadly valid statistical inferences: RCTs may lack generalizability due to their selective eligibility criterion, and OSs are vulnerable to unobserved confounding. This paper proposes an innovative approach to integrate RCT and OS that borrows the other study's strengths to remedy each study's limitations. The method uses a novel triplet matching algorithm to align RCT and OS samples and a new two-parameter sensitivity analysis framework to quantify internal and external biases. This combined approach yields causal estimates that are more robust to hidden biases than OSs alone and provides reliable inferences about the treatment effect in the general population. We apply this method to investigate the effects of lactation on maternal health using a small RCT and a long-term observational health records dataset from the California National Primate Research Center. This application demonstrates the practical utility of our approach in generating scientifically sound and actionable causal estimates.", "authors": "Yu, Karmakar, Vandeleest et al"}, "https://arxiv.org/abs/2412.03797": {"title": "A Two-stage Approach for Variable Selection in Joint Modeling of Multiple Longitudinal Markers and Competing Risk Outcomes", "link": "https://arxiv.org/abs/2412.03797", "description": "Background: In clinical and epidemiological research, the integration of longitudinal measurements and time-to-event outcomes is vital for understanding relationships and improving risk prediction. However, as the number of longitudinal markers increases, joint model estimation becomes more complex, leading to long computation times and convergence issues. This study introduces a novel two-stage Bayesian approach for variable selection in joint models, illustrated through a practical application.\n Methods: Our approach conceptualizes the analysis in two stages. In the first stage, we estimate one-marker joint models for each longitudinal marker related to the event, allowing for bias reduction from informative dropouts through individual marker trajectory predictions. The second stage employs a proportional hazard model that incorporates expected current values of all markers as time-dependent covariates. We explore continuous and Dirac spike-and-slab priors for variable selection, utilizing Markov chain Monte Carlo (MCMC) techniques.\n Results: The proposed method addresses the challenges of parameter estimation and risk prediction with numerous longitudinal markers, demonstrating robust performance through simulation studies. We further validate our approach by predicting dementia risk using the Three-City (3C) dataset, a longitudinal cohort study from France.\n Conclusions: This two-stage Bayesian method offers an efficient process for variable selection in joint modeling, enhancing risk prediction capabilities in longitudinal studies. The accompanying R package VSJM, which is freely available at https://github.com/tbaghfalaki/VSJM, facilitates implementation, making this approach accessible for diverse clinical applications.", "authors": "Baghfalaki, Hashemi, Tzourio et al"}, "https://arxiv.org/abs/2412.03827": {"title": "Optimal Correlation for Bernoulli Trials with Covariates", "link": "https://arxiv.org/abs/2412.03827", "description": "Given covariates for $n$ units, each of which is to receive a treatment with probability $1/2$, we study the question of how best to correlate their treatment assignments to minimize the variance of the IPW estimator of the average treatment effect. Past work by \\cite{bai2022} found that the optimal stratified experiment is a matched-pair design, where the matching depends on oracle knowledge of the distributions of potential outcomes given covariates. We show that, in the strictly broader class of all admissible correlation structures, the optimal design is to divide the units into two clusters and uniformly assign treatment to exactly one of the two clusters. This design can be computed by solving a 0-1 knapsack problem that uses the same oracle information and can result in an arbitrarily large variance improvement. A shift-invariant version can be constructed by ensuring that exactly half of the units are treated. A method with just two clusters is not robust to a bad proxy for the oracle, and we mitigate this with a hybrid that uses $O(n^\\alpha)$ clusters for $0<\\alpha<1$. Under certain assumptions, we also derive a CLT for the IPW estimator under our design and a consistent estimator of the variance. We compare our proposed designs to the optimal stratified design in simulated examples and find improved performance.", "authors": "Morrison, Owen"}, "https://arxiv.org/abs/2412.03833": {"title": "A Note on the Identifiability of the Degree-Corrected Stochastic Block Model", "link": "https://arxiv.org/abs/2412.03833", "description": "In this short note, we address the identifiability issues inherent in the Degree-Corrected Stochastic Block Model (DCSBM). We provide a rigorous proof demonstrating that the parameters of the DCSBM are identifiable up to a scaling factor and a permutation of the community labels, under a mild condition.", "authors": "Park, Zhao, Hao"}, "https://arxiv.org/abs/2412.03918": {"title": "Selection of Ultrahigh-Dimensional Interactions Using $L_0$ Penalty", "link": "https://arxiv.org/abs/2412.03918", "description": "Selecting interactions from an ultrahigh-dimensional statistical model with $n$ observations and $p$ variables when $p\\gg n$ is difficult because the number of candidates for interactions is $p(p-1)/2$ and a selected model should satisfy the strong hierarchical (SH) restriction. A new method called the SHL0 is proposed to overcome the difficulty. The objective function of the SHL0 method is composed of a loglikelihood function and an $L_0$ penalty. A well-known approach in theoretical computer science called local combinatorial optimization is used to optimize the objective function. We show that any local solution of the SHL0 is consistent and enjoys the oracle properties, implying that it is unnecessary to use a global solution in practice. Three additional advantages are: a tuning parameter is used to penalize the main effects and interactions; a closed-form expression can derive the tuning parameter; and the idea can be extended to arbitrary ultrahigh-dimensional statistical models. The proposed method is more flexible than the previous methods for selecting interactions. A simulation study of the research shows that the proposed SHL0 outperforms its competitors.", "authors": "Zhang"}, "https://arxiv.org/abs/2412.03975": {"title": "A shiny app for modeling the lifetime in primary breast cancer patients through phase-type distributions", "link": "https://arxiv.org/abs/2412.03975", "description": "Phase-type distributions (PHDs), which are defined as the distribution of the lifetime up to the absorption in an absorbent Markov chain, are an appropriate candidate to model the lifetime of any system, since any non-negative probability distribution can be approximated by a PHD with sufficient precision. Despite PHD potential, friendly statistical programs do not have a module implemented in their interfaces to handle PHD. Thus, researchers must consider others statistical software such as R, Matlab or Python that work with the compilation of code chunks and functions. This fact might be an important handicap for those researchers who do not have sufficient knowledge in programming environments. In this paper, a new interactive web application developed with shiny is introduced in order to adjust PHD to an experimental dataset. This open access app does not require any kind of knowledge about programming or major mathematical concepts. Users can easily compare the graphic fit of several PHDs while estimating their parameters and assess the goodness of fit with just several clicks. All these functionalities are exhibited by means of a numerical simulation and modeling the time to live since the diagnostic in primary breast cancer patients.", "authors": "Acal, Contreras, Montero et al"}, "https://arxiv.org/abs/2412.04109": {"title": "Pseudo-Observations for Bivariate Survival Data", "link": "https://arxiv.org/abs/2412.04109", "description": "The pseudo-observations approach has been gaining popularity as a method to estimate covariate effects on censored survival data. It is used regularly to estimate covariate effects on quantities such as survival probabilities, restricted mean life, cumulative incidence, and others. In this work, we propose to generalize the pseudo-observations approach to situations where a bivariate failure-time variable is observed, subject to right censoring. The idea is to first estimate the joint survival function of both failure times and then use it to define the relevant pseudo-observations. Once the pseudo-observations are calculated, they are used as the response in a generalized linear model. We consider two common nonparametric estimators of the joint survival function: the estimator of Lin and Ying (1993) and the Dabrowska estimator (Dabrowska, 1988). For both estimators, we show that our bivariate pseudo-observations approach produces regression estimates that are consistent and asymptotically normal. Our proposed method enables estimation of covariate effects on quantities such as the joint survival probability at a fixed bivariate time point, or simultaneously at several time points, and consequentially can estimate covariate-adjusted conditional survival probabilities. We demonstrate the method using simulations and an analysis of two real-world datasets.", "authors": "Travis-Lumer, Mandel, Betensky"}, "https://arxiv.org/abs/2412.04265": {"title": "On Extrapolation of Treatment Effects in Multiple-Cutoff Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2412.04265", "description": "Regression discontinuity (RD) designs typically identify the treatment effect at a single cutoff point. But when and how can we learn about treatment effects away from the cutoff? This paper addresses this question within a multiple-cutoff RD framework. We begin by examining the plausibility of the constant bias assumption proposed by Cattaneo, Keele, Titiunik, and Vazquez-Bare (2021) through the lens of rational decision-making behavior, which suggests that a kind of similarity between groups and whether individuals can influence the running variable are important factors. We then introduce an alternative set of assumptions and propose a broadly applicable partial identification strategy. The potential applicability and usefulness of the proposed bounds are illustrated through two empirical examples.", "authors": "Okamoto, Ozaki"}, "https://arxiv.org/abs/2412.04275": {"title": "Scoping review of methodology for aiding generalisability and transportability of clinical prediction models", "link": "https://arxiv.org/abs/2412.04275", "description": "Generalisability and transportability of clinical prediction models (CPMs) refer to their ability to maintain predictive performance when applied to new populations. While CPMs may show good generalisability or transportability to a specific new population, it is rare for a CPM to be developed using methods that prioritise good generalisability or transportability. There is an emerging literature of such techniques; therefore, this scoping review aims to summarise the main methodological approaches, assumptions, advantages, disadvantages and future development of methodology aiding the generalisability/transportability. Relevant articles were systematically searched from MEDLINE, Embase, medRxiv, arxiv databases until September 2023 using a predefined set of search terms. Extracted information included methodology description, assumptions, applied examples, advantages and disadvantages. The searches found 1,761 articles; 172 were retained for full text screening; 18 were finally included. We categorised the methodologies according to whether they were data-driven or knowledge-driven, and whether are specifically tailored for target population. Data-driven approaches range from data augmentation to ensemble methods and density ratio weighting, while knowledge-driven strategies rely on causal methodology. Future research could focus on comparison of such methodologies on simulated and real datasets to identify their strengths specific applicability, as well as synthesising these approaches for enhancing their practical usefulness.", "authors": "Ploddi, Sperrin, Martin et al"}, "https://arxiv.org/abs/2412.04293": {"title": "Cubic-based Prediction Approach for Large Volatility Matrix using High-Frequency Financial Data", "link": "https://arxiv.org/abs/2412.04293", "description": "In this paper, we develop a novel method for predicting future large volatility matrices based on high-dimensional factor-based It\\^o processes. Several studies have proposed volatility matrix prediction methods using parametric models to account for volatility dynamics. However, these methods often impose restrictions, such as constant eigenvectors over time. To generalize the factor structure, we construct a cubic (order-3 tensor) form of an integrated volatility matrix process, which can be decomposed into low-rank tensor and idiosyncratic tensor components. To predict conditional expected large volatility matrices, we introduce the Projected Tensor Principal Orthogonal componEnt Thresholding (PT-POET) procedure and establish its asymptotic properties. Finally, the advantages of PT-POET are also verified by a simulation study and illustrated by applying minimum variance portfolio allocation using high-frequency trading data.", "authors": "Choi, Kim"}, "https://arxiv.org/abs/2412.03723": {"title": "Bayesian Perspective for Orientation Estimation in Cryo-EM and Cryo-ET", "link": "https://arxiv.org/abs/2412.03723", "description": "Accurate orientation estimation is a crucial component of 3D molecular structure reconstruction, both in single-particle cryo-electron microscopy (cryo-EM) and in the increasingly popular field of cryo-electron tomography (cryo-ET). The dominant method, which involves searching for an orientation with maximum cross-correlation relative to given templates, falls short, particularly in low signal-to-noise environments. In this work, we propose a Bayesian framework to develop a more accurate and flexible orientation estimation approach, with the minimum mean square error (MMSE) estimator as a key example. This method effectively accommodates varying structural conformations and arbitrary rotational distributions. Through simulations, we demonstrate that our estimator consistently outperforms the cross-correlation-based method, especially in challenging conditions with low signal-to-noise ratios, and offer a theoretical framework to support these improvements. We further show that integrating our estimator into the iterative refinement in the 3D reconstruction pipeline markedly enhances overall accuracy, revealing substantial benefits across the algorithmic workflow. Finally, we show empirically that the proposed Bayesian approach enhances robustness against the ``Einstein from Noise'' phenomenon, reducing model bias and improving reconstruction reliability. These findings indicate that the proposed Bayesian framework could substantially advance cryo-EM and cryo-ET by enhancing the accuracy, robustness, and reliability of 3D molecular structure reconstruction, thereby facilitating deeper insights into complex biological systems.", "authors": "Xu, Balanov, Bendory"}, "https://arxiv.org/abs/2412.04493": {"title": "Robust Quickest Change Detection in Multi-Stream Non-Stationary Processes", "link": "https://arxiv.org/abs/2412.04493", "description": "The problem of robust quickest change detection (QCD) in non-stationary processes under a multi-stream setting is studied. In classical QCD theory, optimal solutions are developed to detect a sudden change in the distribution of stationary data. Most studies have focused on single-stream data. In non-stationary processes, the data distribution both before and after change varies with time and is not precisely known. The multi-dimension data even complicates such issues. It is shown that if the non-stationary family for each dimension or stream has a least favorable law (LFL) or distribution in a well-defined sense, then the algorithm designed using the LFLs is robust optimal. The notion of LFL defined in this work differs from the classical definitions due to the dependence of the post-change model on the change point. Examples of multi-stream non-stationary processes encountered in public health monitoring and aviation applications are provided. Our robust algorithm is applied to simulated and real data to show its effectiveness.", "authors": "Hou, Bidkhori, Banerjee"}, "https://arxiv.org/abs/2412.04605": {"title": "Semiparametric Bayesian Difference-in-Differences", "link": "https://arxiv.org/abs/2412.04605", "description": "This paper studies semiparametric Bayesian inference for the average treatment effect on the treated (ATT) within the difference-in-differences research design. We propose two new Bayesian methods with frequentist validity. The first one places a standard Gaussian process prior on the conditional mean function of the control group. We obtain asymptotic equivalence of our Bayesian estimator and an efficient frequentist estimator by establishing a semiparametric Bernstein-von Mises (BvM) theorem. The second method is a double robust Bayesian procedure that adjusts the prior distribution of the conditional mean function and subsequently corrects the posterior distribution of the resulting ATT. We establish a semiparametric BvM result under double robust smoothness conditions; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score, and vice versa. Monte Carlo simulations and an empirical application demonstrate that the proposed Bayesian DiD methods exhibit strong finite-sample performance compared to existing frequentist methods. Finally, we outline an extension to difference-in-differences with multiple periods and staggered entry.", "authors": "Breunig, Liu, Yu"}, "https://arxiv.org/abs/2412.04640": {"title": "Multi-Quantile Estimators for the parameters of Generalized Extreme Value distribution", "link": "https://arxiv.org/abs/2412.04640", "description": "We introduce and study Multi-Quantile estimators for the parameters $( \\xi, \\sigma, \\mu)$ of Generalized Extreme Value (GEV) distributions to provide a robust approach to extreme value modeling. Unlike classical estimators, such as the Maximum Likelihood Estimation (MLE) estimator and the Probability Weighted Moments (PWM) estimator, which impose strict constraints on the shape parameter $\\xi$, our estimators are always asymptotically normal and consistent across all values of the GEV parameters. The asymptotic variances of our estimators decrease with the number of quantiles increasing and can approach the Cram\\'er-Rao lower bound very closely whenever it exists. Our Multi-Quantile Estimators thus offer a more flexible and efficient alternative for practical applications. We also discuss how they can be implemented in the context of Block Maxima method.", "authors": "Lin, Kong, Azencott"}, "https://arxiv.org/abs/2412.04663": {"title": "Fairness-aware Principal Component Analysis for Mortality Forecasting and Annuity Pricing", "link": "https://arxiv.org/abs/2412.04663", "description": "Fairness-aware statistical learning is critical for data-driven decision-making to mitigate discrimination against protected attributes, such as gender, race, and ethnicity. This is especially important for high-stake decision-making, such as insurance underwriting and annuity pricing. This paper proposes a new fairness-regularized principal component analysis - Fair PCA, in the context of high-dimensional factor models. An efficient gradient descent algorithm is constructed with adaptive selection criteria for hyperparameter tuning. The Fair PCA is applied to mortality modelling to mitigate gender discrimination in annuity pricing. The model performance has been validated through both simulation studies and empirical data analysis.", "authors": "Huang, Shen, Yang et al"}, "https://arxiv.org/abs/2412.04736": {"title": "Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals", "link": "https://arxiv.org/abs/2412.04736", "description": "This article considers a novel and widely applicable approach to modeling high-dimensional dependent data when a large number of explanatory variables are available and the signal-to-noise ratio is low. We postulate that a $p$-dimensional response series is the sum of a linear regression with many observable explanatory variables and an error term driven by some latent common factors and an idiosyncratic noise. The common factors have dynamic dependence whereas the covariance matrix of the idiosyncratic noise can have diverging eigenvalues to handle the situation of low signal-to-noise ratio commonly encountered in applications. The regression coefficient matrix is estimated using penalized methods when the dimensions involved are high. We apply factor modeling to the regression residuals, employ a high-dimensional white noise testing procedure to determine the number of common factors, and adopt a projected Principal Component Analysis when the signal-to-noise ratio is low. We establish asymptotic properties of the proposed method, both for fixed and diverging numbers of regressors, as $p$ and the sample size $T$ approach infinity. Finally, we use simulations and empirical applications to demonstrate the efficacy of the proposed approach in finite samples.", "authors": "Gao, Tsay"}, "https://arxiv.org/abs/2412.04744": {"title": "Marginally interpretable spatial logistic regression with bridge processes", "link": "https://arxiv.org/abs/2412.04744", "description": "In including random effects to account for dependent observations, the odds ratio interpretation of logistic regression coefficients is changed from population-averaged to subject-specific. This is unappealing in many applications, motivating a rich literature on methods that maintain the marginal logistic regression structure without random effects, such as generalized estimating equations. However, for spatial data, random effect approaches are appealing in providing a full probabilistic characterization of the data that can be used for prediction. We propose a new class of spatial logistic regression models that maintain both population-averaged and subject-specific interpretations through a novel class of bridge processes for spatial random effects. These processes are shown to have appealing computational and theoretical properties, including a scale mixture of normal representation. The new methodology is illustrated with simulations and an analysis of childhood malaria prevalence data in the Gambia.", "authors": "Lee, Dunson"}, "https://arxiv.org/abs/2412.04773": {"title": "Robust and Optimal Tensor Estimation via Robust Gradient Descent", "link": "https://arxiv.org/abs/2412.04773", "description": "Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.", "authors": "Zhang, Wang, Li et al"}, "https://arxiv.org/abs/2412.04803": {"title": "Regression Analysis of Cure Rate Models with Competing Risks Subjected to Interval Censoring", "link": "https://arxiv.org/abs/2412.04803", "description": "In this work, we present two defective regression models for the analysis of interval-censored competing risk data in the presence of cured individuals, viz., defective Gompertz and defective inverse Gaussian regression models. The proposed models enable us to estimate the cure fraction directly from the model. Simultaneously, we estimate the regression parameters corresponding to each cause of failure using the method of maximum likelihood. The finite sample behaviour of the proposed models is evaluated through Monte Carlo simulation studies. We illustrate the practical applicability of the models using a real-life data set on HIV patients.", "authors": "K., P., Sankaran"}, "https://arxiv.org/abs/2412.04816": {"title": "Linear Regressions with Combined Data", "link": "https://arxiv.org/abs/2412.04816", "description": "We study best linear predictions in a context where the outcome of interest and some of the covariates are observed in two different datasets that cannot be matched. Traditional approaches obtain point identification by relying, often implicitly, on exclusion restrictions. We show that without such restrictions, coefficients of interest can still be partially identified and we derive a constructive characterization of the sharp identified set. We then build on this characterization to develop computationally simple and asymptotically normal estimators of the corresponding bounds. We show that these estimators exhibit good finite sample performances.", "authors": "D'Haultfoeuille, Gaillac, Maurel"}, "https://arxiv.org/abs/2412.04956": {"title": "Fast Estimation of the Composite Link Model for Multidimensional Grouped Counts", "link": "https://arxiv.org/abs/2412.04956", "description": "This paper presents a significant advancement in the estimation of the Composite Link Model within a penalized likelihood framework, specifically designed to address indirect observations of grouped count data. While the model is effective in these contexts, its application becomes computationally challenging in large, high-dimensional settings. To overcome this, we propose a reformulated iterative estimation procedure that leverages Generalized Linear Array Models, enabling the disaggregation and smooth estimation of latent distributions in multidimensional data. Through applications to high-dimensional mortality datasets, we demonstrate the model's capability to capture fine-grained patterns while comparing its computational performance to the conventional algorithm. The proposed methodology offers notable improvements in computational speed, storage efficiency, and practical applicability, making it suitable for a wide range of fields where high-dimensional data are provided in grouped formats.", "authors": "Camarda, Durb\\'an"}, "https://arxiv.org/abs/2412.05018": {"title": "Application of generalized linear models in big data: a divide and recombine (D&R) approach", "link": "https://arxiv.org/abs/2412.05018", "description": "D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.", "authors": "Nayem, Biswas"}, "https://arxiv.org/abs/2412.05032": {"title": "moonboot: An R Package Implementing m-out-of-n Bootstrap Methods", "link": "https://arxiv.org/abs/2412.05032", "description": "The m-out-of-n bootstrap is a possible workaround to compute confidence intervals for bootstrap inconsistent estimators, because it works under weaker conditions than the n-out-of-n bootstrap. It has the disadvantage, however, that it requires knowledge of an appropriate scaling factor {\\tau}n and that the coverage probability for finite n depends on the choice of m. This article presents an R package moonboot which implements the computation of m-out-of-n bootstrap confidence intervals and provides functions for estimating the parameters {\\tau}n and m. By means of Monte Carlo simulations, we evaluate the different methods and compare them for different estimators", "authors": "Dalitz, L\\\"ogler"}, "https://arxiv.org/abs/2412.05195": {"title": "Piecewise-linear modeling of multivariate geometric extremes", "link": "https://arxiv.org/abs/2412.05195", "description": "A recent development in extreme value modeling uses the geometry of the dataset to perform inference on the multivariate tail. A key quantity in this inference is the gauge function, whose values define this geometry. Methodology proposed to date for capturing the gauge function either lacks flexibility due to parametric specifications, or relies on complex neural network specifications in dimensions greater than three. We propose a semiparametric gauge function that is piecewise-linear, making it simple to interpret and provides a good approximation for the true underlying gauge function. This linearity also makes optimization tasks computationally inexpensive. The piecewise-linear gauge function can be used to define both a radial and an angular model, allowing for the joint fitting of extremal pseudo-polar coordinates, a key aspect of this geometric framework. We further expand the toolkit for geometric extremal modeling through the estimation of high radial quantiles at given angular values via kernel density estimation. We apply the new methodology to air pollution data, which exhibits a complex extremal dependence structure.", "authors": "Campbell, Wadsworth"}, "https://arxiv.org/abs/2412.05199": {"title": "Energy Based Equality of Distributions Testing for Compositional Data", "link": "https://arxiv.org/abs/2412.05199", "description": "Not many tests exist for testing the equality for two or more multivariate distributions with compositional data, perhaps due to their constrained sample space. At the moment, there is only one test suggested that relies upon random projections. We propose a novel test termed {\\alpha}-Energy Based Test ({\\alpha}-EBT) to compare the multivariate distributions of two (or more) compositional data sets. Similar to the aforementioned test, the new test makes no parametric assumptions about the data and, based on simulation studies it exhibits higher power levels.", "authors": "Sevinc, Tsagris"}, "https://arxiv.org/abs/2412.05108": {"title": "Constructing optimal treatment length strategies to maximize quality-adjusted lifetimes", "link": "https://arxiv.org/abs/2412.05108", "description": "Real-world clinical decision making is a complex process that involves balancing the risks and benefits of treatments. Quality-adjusted lifetime is a composite outcome that combines patient quantity and quality of life, making it an attractive outcome in clinical research. We propose methods for constructing optimal treatment length strategies to maximize this outcome. Existing methods for estimating optimal treatment strategies for survival outcomes cannot be applied to a quality-adjusted lifetime due to induced informative censoring. We propose a weighted estimating equation that adjusts for both confounding and informative censoring. We also propose a nonparametric estimator of the mean counterfactual quality-adjusted lifetime survival curve under a given treatment length strategy, where the weights are estimated using an undersmoothed sieve-based estimator. We show that the estimator is asymptotically linear and provide a data-dependent undersmoothing criterion. We apply our method to obtain the optimal time for percutaneous endoscopic gastrostomy insertion in patients with amyotrophic lateral sclerosis.", "authors": "Sun, Ertefaie, Duttweiler et al"}, "https://arxiv.org/abs/2412.05174": {"title": "Compound Gaussian Radar Clutter Model With Positive Tempered Alpha-Stable Texture", "link": "https://arxiv.org/abs/2412.05174", "description": "The compound Gaussian (CG) family of distributions has achieved great success in modeling sea clutter. This work develops a flexible-tailed CG model to improve generality in clutter modeling, by introducing the positive tempered $\\alpha$-stable (PT$\\alpha$S) distribution to model clutter texture. The PT$\\alpha$S distribution exhibits widely tunable tails by tempering the heavy tails of the positive $\\alpha$-stable (P$\\alpha$S) distribution, thus providing greater flexibility in texture modeling. Specifically, we first develop a bivariate isotropic CG-PT$\\alpha$S complex clutter model that is defined by an explicit characteristic function, based on which the corresponding amplitude model is derived. Then, we prove that the amplitude model can be expressed as a scale mixture of Rayleighs, just as the successful compound K and Pareto models. Furthermore, a characteristic function-based method is developed to estimate the parameters of the amplitude model. Finally, real-world sea clutter data analysis indicates the amplitude model's flexibility in modeling clutter data with various tail behaviors.", "authors": "Liao, Xie, Zhou"}, "https://arxiv.org/abs/2007.10432": {"title": "Treatment Effects with Targeting Instruments", "link": "https://arxiv.org/abs/2007.10432", "description": "Multivalued treatments are commonplace in applications. We explore the use of discrete-valued instruments to control for selection bias in this setting. Our discussion revolves around the concept of targeting: which instruments target which treatments. It allows us to establish conditions under which counterfactual averages and treatment effects are point- or partially-identified for composite complier groups. We illustrate the usefulness of our framework by applying it to data from the Head Start Impact Study. Under a plausible positive selection assumption, we derive informative bounds that suggest less beneficial effects of Head Start expansions than the parametric estimates of Kline and Walters (2016).", "authors": "Lee, Salani\\'e"}, "https://arxiv.org/abs/2412.05397": {"title": "Network Structural Equation Models for Causal Mediation and Spillover Effects", "link": "https://arxiv.org/abs/2412.05397", "description": "Social network interference induces spillover effects from neighbors' exposures, and the complexity of statistical analysis increases when mediators are involved with network interference. To address various technical challenges, we develop a theoretical framework employing a structural graphical modeling approach to investigate both mediation and interference effects within network data. Our framework enables us to capture the multifaceted mechanistic pathways through which neighboring units' exposures and mediators exert direct and indirect influences on an individual's outcome. We extend the exposure mapping paradigm in the context of a random-effects network structural equation models (REN-SEM), establishing its capacity to delineate spillover effects of interest. Our proposed methodology contributions include maximum likelihood estimation for REN-SEM and inference procedures with theoretical guarantees. Such guarantees encompass consistent asymptotic variance estimators, derived under a non-i.i.d. asymptotic theory. The robustness and practical utility of our methodology are demonstrated through simulation experiments and a real-world data analysis of the Twitch Gamers Network Dataset, underscoring its effectiveness in capturing the intricate dynamics of network-mediated exposure effects. This work is the first to provide a rigorous theoretical framework and analytic toolboxes to the mediation analysis of network data, including a robust assessment on the interplay of mediation and interference.", "authors": "Kundu, Song"}, "https://arxiv.org/abs/2412.05463": {"title": "The BPgWSP test: a Bayesian Weibull Shape Parameter signal detection test for adverse drug reactions", "link": "https://arxiv.org/abs/2412.05463", "description": "We develop a Bayesian Power generalized Weibull shape parameter (PgWSP) test as statistical method for signal detection of possible drug-adverse event associations using electronic health records for pharmacovigilance. The Bayesian approach allows the incorporation of prior knowledge about the likely time of occurrence along time-to-event data. The test is based on the shape parameters of the Power generalized Weibull (PgW) distribution. When both shape parameters are equal to one, the PgW distribution reduces to an exponential distribution, i.e. a constant hazard function. This is interpreted as no temporal association between drug and adverse event. The Bayesian PgWSP test involves comparing a region of practical equivalence (ROPE) around one reflecting the null hypothesis with estimated credibility intervals reflecting the posterior means of the shape parameters. The decision to raise a signal is based on the outcomes of the ROPE test and the selected combination rule for these outcomes. The development of the test requires a simulation study for tuning of the ROPE and credibility intervals to optimize specifcity and sensitivity of the test. Samples are generated under various conditions, including differences in sample size, prevalence of adverse drug reactions (ADRs), and the proportion of adverse events. We explore prior assumptions reflecting the belief in the presence or absence of ADRs at different points in the observation period. Various types of ROPE, credibility intervals, and combination rules are assessed and optimal tuning parameters are identifed based on the area under the curve. The tuned Bayesian PgWSP test is illustrated in a case study in which the time-dependent correlation between the intake of bisphosphonates and four adverse events is investigated.", "authors": "Dyck, Sauzet"}, "https://arxiv.org/abs/2412.05508": {"title": "Optimizing Returns from Experimentation Programs", "link": "https://arxiv.org/abs/2412.05508", "description": "Experimentation in online digital platforms is used to inform decision making. Specifically, the goal of many experiments is to optimize a metric of interest. Null hypothesis statistical testing can be ill-suited to this task, as it is indifferent to the magnitude of effect sizes and opportunity costs. Given access to a pool of related past experiments, we discuss how experimentation practice should change when the goal is optimization. We survey the literature on empirical Bayes analyses of A/B test portfolios, and single out the A/B Testing Problem (Azevedo et al., 2020) as a starting point, which treats experimentation as a constrained optimization problem. We show that the framework can be solved with dynamic programming and implemented by appropriately tuning $p$-value thresholds. Furthermore, we develop several extensions of the A/B Testing Problem and discuss the implications of these results on experimentation programs in industry. For example, under no-cost assumptions, firms should be testing many more ideas, reducing test allocation sizes, and relaxing $p$-value thresholds away from $p = 0.05$.", "authors": "Sudijono, Ejdemyr, Lal et al"}, "https://arxiv.org/abs/2412.05621": {"title": "Minimum Sliced Distance Estimation in a Class of Nonregular Econometric Models", "link": "https://arxiv.org/abs/2412.05621", "description": "This paper proposes minimum sliced distance estimation in structural econometric models with possibly parameter-dependent supports. In contrast to likelihood-based estimation, we show that under mild regularity conditions, the minimum sliced distance estimator is asymptotically normally distributed leading to simple inference regardless of the presence/absence of parameter dependent supports. We illustrate the performance of our estimator on an auction model.", "authors": "Fan, Park"}, "https://arxiv.org/abs/2412.05664": {"title": "Property of Inverse Covariance Matrix-based Financial Adjacency Matrix for Detecting Local Groups", "link": "https://arxiv.org/abs/2412.05664", "description": "In financial applications, we often observe both global and local factors that are modeled by a multi-level factor model. When detecting unknown local group memberships under such a model, employing a covariance matrix as an adjacency matrix for local group memberships is inadequate due to the predominant effect of global factors. Thus, to detect a local group structure more effectively, this study introduces an inverse covariance matrix-based financial adjacency matrix (IFAM) that utilizes negative values of the inverse covariance matrix. We show that IFAM ensures that the edge density between different groups vanishes, while that within the same group remains non-vanishing. This reduces falsely detected connections and helps identify local group membership accurately. To estimate IFAM under the multi-level factor model, we introduce a factor-adjusted GLASSO estimator to address the prevalent global factor effect in the inverse covariance matrix. An empirical study using returns from international stocks across 20 financial markets demonstrates that incorporating IFAM effectively detects latent local groups, which helps improve the minimum variance portfolio allocation performance.", "authors": "Oh, Kim"}, "https://arxiv.org/abs/2412.05673": {"title": "A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors", "link": "https://arxiv.org/abs/2412.05673", "description": "This paper presents a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, achieving a balance between quadratic and absolute linear loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data effectively. The generalized Bayesian approach constructs a working likelihood utilizing the SPH loss that facilitates efficient and stable estimation while providing rigorous estimation uncertainty quantification for all model parameters. Notably, this allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients -- e.g., ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings -- the framework ensures principled inference. We establish rigorous theoretical guarantees for the accurate estimation of underlying model parameters and the correct selection of predictor variables under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superiority of our approach compared to traditional quadratic and absolute linear loss-based Bayesian regression methods, highlighting its flexibility and robustness in high-dimensional and challenging data contexts.", "authors": "Chakraborty, Khare, Michailidis"}, "https://arxiv.org/abs/2412.05687": {"title": "Bootstrap Model Averaging", "link": "https://arxiv.org/abs/2412.05687", "description": "Model averaging has gained significant attention in recent years due to its ability of fusing information from different models. The critical challenge in frequentist model averaging is the choice of weight vector. The bootstrap method, known for its favorable properties, presents a new solution. In this paper, we propose a bootstrap model averaging approach that selects the weights by minimizing a bootstrap criterion. Our weight selection criterion can also be interpreted as a bootstrap aggregating. We demonstrate that the resultant estimator is asymptotically optimal in the sense that it achieves the lowest possible squared error loss. Furthermore, we establish the convergence rate of bootstrap weights tending to the theoretically optimal weights. Additionally, we derive the limiting distribution for our proposed model averaging estimator. Through simulation studies and empirical applications, we show that our proposed method often has better performance than other commonly used model selection and model averaging methods, and bootstrap variants.", "authors": "Song, Zou, Wan"}, "https://arxiv.org/abs/2412.05736": {"title": "Convolution Mode Regression", "link": "https://arxiv.org/abs/2412.05736", "description": "For highly skewed or fat-tailed distributions, mean or median-based methods often fail to capture the central tendencies in the data. Despite being a viable alternative, estimating the conditional mode given certain covariates (or mode regression) presents significant challenges. Nonparametric approaches suffer from the \"curse of dimensionality\", while semiparametric strategies often lead to non-convex optimization problems. In order to avoid these issues, we propose a novel mode regression estimator that relies on an intermediate step of inverting the conditional quantile density. In contrast to existing approaches, we employ a convolution-type smoothed variant of the quantile regression. Our estimator converges uniformly over the design points of the covariates and, unlike previous quantile-based mode regressions, is uniform with respect to the smoothing bandwidth. Additionally, the Convolution Mode Regression is dimension-free, carries no issues regarding optimization and preliminary simulations suggest the estimator is normally distributed in finite samples.", "authors": "Finn, Horta"}, "https://arxiv.org/abs/2412.05744": {"title": "PICS: A sequential approach to obtain optimal designs for non-linear models leveraging closed-form solutions for faster convergence", "link": "https://arxiv.org/abs/2412.05744", "description": "D-Optimal designs for estimating parameters of response models are derived by maximizing the determinant of the Fisher information matrix. For non-linear models, the Fisher information matrix depends on the unknown parameter vector of interest, leading to a weird situation that in order to obtain the D-optimal design, one needs to have knowledge of the parameter to be estimated. One solution to this problem is to choose the design points sequentially, optimizing the D-optimality criterion using parameter estimates based on available data, followed by updating the parameter estimates using maximum likelihood estimation. On the other hand, there are many non-linear models for which closed-form results for D-optimal designs are available, but because such solutions involve the parameters to be estimated, they can only be used by substituting \"guestimates\" of parameters. In this paper, a hybrid sequential strategy called PICS (Plug into closed-form solution) is proposed that replaces the optimization of the objective function at every single step by a draw from the probability distribution induced by the known optimal design by plugging in the current estimates. Under regularity conditions, asymptotic normality of the sequence of estimators generated by this approach are established. Usefulness of this approach in terms of saving computational time and achieving greater efficiency of estimation compared to the standard sequential approach are demonstrated with simulations conducted from two different sets of models.", "authors": "Ghosh, Khamaru, Dasgupta"}, "https://arxiv.org/abs/2412.05765": {"title": "A Two-stage Joint Modeling Approach for Multiple Longitudinal Markers and Time-to-event Data", "link": "https://arxiv.org/abs/2412.05765", "description": "Collecting multiple longitudinal measurements and time-to-event outcomes is a common practice in clinical and epidemiological studies, often focusing on exploring associations between them. Joint modeling is the standard analytical tool for such data, with several R packages available. However, as the number of longitudinal markers increases, the computational burden and convergence challenges make joint modeling increasingly impractical.\n This paper introduces a novel two-stage Bayesian approach to estimate joint models for multiple longitudinal measurements and time-to-event outcomes. The method builds on the standard two-stage framework but improves the initial stage by estimating a separate one-marker joint model for the event and each longitudinal marker, rather than relying on mixed models. These estimates are used to derive predictions of individual marker trajectories, avoiding biases from informative dropouts. In the second stage, a proportional hazards model is fitted, incorporating the predicted current values and slopes of the markers as time-dependent covariates. To address uncertainty in the first-stage predictions, a multiple imputation technique is employed when estimating the Cox model in the second stage.\n This two-stage method allows for the analysis of numerous longitudinal markers, which is often infeasible with traditional multi-marker joint modeling. The paper evaluates the approach through simulation studies and applies it to the PBC2 dataset and a real-world dementia dataset containing 17 longitudinal markers. An R package, TSJM, implementing the method is freely available on GitHub: https://github.com/tbaghfalaki/TSJM.", "authors": "Baghfalaki, Hashemi, Helmer et al"}, "https://arxiv.org/abs/2412.05794": {"title": "Bundle Choice Model with Endogenous Regressors: An Application to Soda Tax", "link": "https://arxiv.org/abs/2412.05794", "description": "This paper proposes a Bayesian factor-augmented bundle choice model to estimate joint consumption as well as the substitutability and complementarity of multiple goods in the presence of endogenous regressors. The model extends the two primary treatments of endogeneity in existing bundle choice models: (1) endogenous market-level prices and (2) time-invariant unobserved individual heterogeneity. A Bayesian sparse factor approach is employed to capture high-dimensional error correlations that induce taste correlation and endogeneity. Time-varying factor loadings allow for more general individual-level and time-varying heterogeneity and endogeneity, while the sparsity induced by the shrinkage prior on loadings balances flexibility with parsimony. Applied to a soda tax in the context of complementarities, the new approach captures broader effects of the tax that were previously overlooked. Results suggest that a soda tax could yield additional health benefits by marginally decreasing the consumption of salty snacks along with sugary drinks, extending the health benefits beyond the reduction in sugar consumption alone.", "authors": "Sun"}, "https://arxiv.org/abs/2412.05836": {"title": "Modeling time to failure using a temporal sequence of events", "link": "https://arxiv.org/abs/2412.05836", "description": "In recent years, the requirement for real-time understanding of machine behavior has become an important objective in industrial sectors to reduce the cost of unscheduled downtime and to maximize production with expected quality. The vast majority of high-end machines are equipped with a number of sensors that can record event logs over time. In this paper, we consider an injection molding (IM) machine that manufactures plastic bottles for soft drink. We have analyzed the machine log data with a sequence of three type of events, ``running with alert'', ``running without alert'', and ``failure''. Failure event leads to downtime of the machine and necessitates maintenance. The sensors are capable of capturing the corresponding operational conditions of the machine as well as the defined states of events. This paper presents a new model to predict a) time to failure of the IM machine and b) identification of important sensors in the system that may explain the events which in-turn leads to failure. The proposed method is more efficient than the popular competitor and can help reduce the downtime costs by controlling operational parameters in advance to prevent failures from occurring too soon.", "authors": "Pal, Koley, Ranjan et al"}, "https://arxiv.org/abs/2412.05905": {"title": "Fast QR updating methods for statistical applications", "link": "https://arxiv.org/abs/2412.05905", "description": "This paper introduces fast R updating algorithms designed for statistical applications, including regression, filtering, and model selection, where data structures change frequently. Although traditional QR decomposition is essential for matrix operations, it becomes computationally intensive when dynamically updating the design matrix in statistical models. The proposed algorithms efficiently update the R matrix without recalculating Q, significantly reducing computational costs. These algorithms provide a scalable solution for high-dimensional regression models, enhancing the feasibility of large-scale statistical analyses and model selection in data-intensive fields. Comprehensive simulation studies and real-world data applications reveal that the methods significantly reduce computational time while preserving accuracy. An extensive discussion highlights the versatility of fast R updating algorithms, illustrating their benefits across a wide range of models and applications in statistics and machine learning.", "authors": "Bernardi, Busatto, Cattelan"}, "https://arxiv.org/abs/2412.05919": {"title": "Estimating Spillover Effects in the Presence of Isolated Nodes", "link": "https://arxiv.org/abs/2412.05919", "description": "In estimating spillover effects under network interference, practitioners often use linear regression with either the number or fraction of treated neighbors as regressors. An often overlooked fact is that the latter is undefined for units without neighbors (``isolated nodes\"). The common practice is to impute this fraction as zero for isolated nodes. This paper shows that such practice introduces bias through theoretical derivations and simulations. Causal interpretations of the commonly used spillover regression coefficients are also provided.", "authors": "Kim"}, "https://arxiv.org/abs/2412.05998": {"title": "B-MASTER: Scalable Bayesian Multivariate Regression Analysis for Selecting Targeted Essential Regressors to Identify the Key Genera in Microbiome-Metabolite Relation Dynamics", "link": "https://arxiv.org/abs/2412.05998", "description": "The gut microbiome significantly influences responses to cancer therapies, including immunotherapies, primarily through its impact on the metabolome. Despite some existing studies addressing the effects of specific microbial genera on individual metabolites, there is little to no prior work focused on identifying the key microbiome components at the genus level that shape the overall metabolome profile. To bridge this gap, we introduce B-MASTER (Bayesian Multivariate regression Analysis for Selecting Targeted Essential Regressors), a fully Bayesian framework incorporating an L1 penalty to promote sparsity in the coefficient matrix and an L2 penalty to shrink coefficients for non-major covariate components simultaneously, thereby isolating essential regressors. The method is complemented with a scalable Gibbs sampling algorithm, whose computational speed increases linearly with the number of parameters and remains largely unaffected by sample size and data-specific characteristics for models of fixed dimensions. Notably, B-MASTER achieves full posterior inference for models with up to four million parameters within a practical time-frame. Using this approach, we identify key microbial genera influencing the overall metabolite profile, conduct an in-depth analysis of their effects on the most abundant metabolites, and investigate metabolites differentially abundant in colorectal cancer patients. These results provide foundational insights into the impact of the microbiome at the genus level on metabolite profiles relevant to cancer, a relationship that remains largely unexplored in the existing literature.", "authors": "Das, Dey, Peterson et al"}, "https://arxiv.org/abs/2412.06018": {"title": "Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research", "link": "https://arxiv.org/abs/2412.06018", "description": "Longitudinal passive sensing studies for health and behavior outcomes often have missing and incomplete data. Handling missing data effectively is thus a critical data processing and modeling step. Our formative interviews with researchers working in longitudinal health and behavior passive sensing revealed a recurring theme: most researchers consider imputation a low-priority step in their analysis and inference pipeline, opting to use simple and off-the-shelf imputation strategies without comprehensively evaluating its impact on study outcomes. Through this paper, we call attention to the importance of imputation. Using publicly available passive sensing datasets for depression, we show that prioritizing imputation can significantly impact the study outcomes -- with our proposed imputation strategies resulting in up to 31% improvement in AUROC to predict depression over the original imputation strategy. We conclude by discussing the challenges and opportunities with effective imputation in longitudinal sensing studies.", "authors": "Choube, Majethia, Bhattacharya et al"}, "https://arxiv.org/abs/2412.06020": {"title": "New Additive OCBA Procedures for Robust Ranking and Selection", "link": "https://arxiv.org/abs/2412.06020", "description": "Robust ranking and selection (R&S) is an important and challenging variation of conventional R&S that seeks to select the best alternative among a finite set of alternatives. It captures the common input uncertainty in the simulation model by using an ambiguity set to include multiple possible input distributions and shifts to select the best alternative with the smallest worst-case mean performance over the ambiguity set. In this paper, we aim at developing new fixed-budget robust R&S procedures to minimize the probability of incorrect selection (PICS) under a limited sampling budget. Inspired by an additive upper bound of the PICS, we derive a new asymptotically optimal solution to the budget allocation problem. Accordingly, we design a new sequential optimal computing budget allocation (OCBA) procedure to solve robust R&S problems efficiently. We then conduct a comprehensive numerical study to verify the superiority of our robust OCBA procedure over existing ones. The numerical study also provides insights on the budget allocation behaviors that lead to enhanced efficiency.", "authors": "Wan, Li, Hong"}, "https://arxiv.org/abs/2412.06027": {"title": "A Generalized Mixture Cure Model Incorporating Known Cured Individuals", "link": "https://arxiv.org/abs/2412.06027", "description": "The Mixture Cure (MC) models constitute an appropriate and easily interpretable method when studying a time-to-event variable in a population comprised of both susceptible and cured individuals. In literature, those models usually assume that the latter are unobservable. However, there are cases in which a cured individual may be identified. For example, when studying the distant metastasis during the lifetime or the miscarriage during pregnancy, individuals that have died without a metastasis or have given birth are certainly non-susceptible. The same also holds when studying the x-year overall survival or the death during hospital stay. Common MC models ignore this information and consider them all censored, thus yielding in risk of assigning low immune probabilities to cured individuals. In this study, we consider a MC model that incorporates known information on cured individuals, with the time to cure identification being either deterministic or stochastic. We use the expectation-maximization algorithm to derive the maximum likelihood estimators. Furthermore, we compare different strategies that account for cure information such as (1) assigning infinite times to event for known cured cases and adjusting the traditional model and (2) considering only the probability of cure identification but ignoring the time until that happens. Theoretical results and simulations demonstrate the value of the proposed model especially when the time to cure identification is stochastic, increasing precision and decreasing the mean squared error. On the other hand, the traditional models that ignore the known cured information perform well when the curation is achieved after a known cutoff point. Moreover, through simulations the comparisons of the different strategies are examined, as possible alternatives to the complete-information model.", "authors": "Karakatsoulis"}, "https://arxiv.org/abs/2412.06092": {"title": "Density forecast transformations", "link": "https://arxiv.org/abs/2412.06092", "description": "The popular choice of using a $direct$ forecasting scheme implies that the individual predictions do not contain information on cross-horizon dependence. However, this dependence is needed if the forecaster has to construct, based on $direct$ density forecasts, predictive objects that are functions of several horizons ($e.g.$ when constructing annual-average growth rates from quarter-on-quarter growth rates). To address this issue we propose to use copulas to combine the individual $h$-step-ahead predictive distributions into a joint predictive distribution. Our method is particularly appealing to practitioners for whom changing the $direct$ forecasting specification is too costly. In a Monte Carlo study, we demonstrate that our approach leads to a better approximation of the true density than an approach that ignores the potential dependence. We show the superior performance of our method in several empirical examples, where we construct (i) quarterly forecasts using month-on-month $direct$ forecasts, (ii) annual-average forecasts using monthly year-on-year $direct$ forecasts, and (iii) annual-average forecasts using quarter-on-quarter $direct$ forecasts.", "authors": "Mogliani, Odendahl"}, "https://arxiv.org/abs/2412.06098": {"title": "Bayesian Clustering Prior with Overlapping Indices for Effective Use of Multisource External Data", "link": "https://arxiv.org/abs/2412.06098", "description": "The use of external data in clinical trials offers numerous advantages, such as reducing the number of patients, increasing study power, and shortening trial durations. In Bayesian inference, information in external data can be transferred into an informative prior for future borrowing (i.e., prior synthesis). However, multisource external data often exhibits heterogeneity, which can lead to information distortion during the prior synthesis. Clustering helps identifying the heterogeneity, enhancing the congruence between synthesized prior and external data, thereby preventing information distortion. Obtaining optimal clustering is challenging due to the trade-off between congruence with external data and robustness to future data. We introduce two overlapping indices: the overlapping clustering index (OCI) and the overlapping evidence index (OEI). Using these indices alongside a K-Means algorithm, the optimal clustering of external data can be identified by balancing the trade-off. Based on the clustering result, we propose a prior synthesis framework to effectively borrow information from multisource external data. By incorporating the (robust) meta-analytic predictive prior into this framework, we develop (robust) Bayesian clustering MAP priors. Simulation studies and real-data analysis demonstrate their superiority over commonly used priors in the presence of heterogeneity. Since the Bayesian clustering priors are constructed without needing data from the prospective study to be conducted, they can be applied to both study design and data analysis in clinical trials or experiments.", "authors": "Lu, Lee"}, "https://arxiv.org/abs/2412.06114": {"title": "Randomized interventional effects for semicompeting risks, with application to allogeneic stem cell transplantation study", "link": "https://arxiv.org/abs/2412.06114", "description": "In clinical studies, the risk of the terminal event can be modified by an intermediate event, resulting in semicompeting risks. To study the treatment effect on the terminal event mediated by the intermediate event, it is of interest to decompose the total effect into direct and indirect effects. Three typical frameworks for causal mediation analysis have been proposed: natural effects, separable effects and interventional effects. In this article, we extend the interventional approach to time-to-event data, and compare it with other frameworks. With no time-varying confounders, these three frameworks lead to the same identification formula. With time-varying confounders, the interventional effects framework outperforms the other two because it requires weaker assumptions and fewer restrictions on time-varying confounders for identification. We present the identification formula for interventional effects and discuss some variants of the identification assumptions. As an illustration, we study the effect of transplant modalities on death mediated by relapse in an allogeneic stem cell transplantation study to treat leukemia with a time-varying confounder.", "authors": "Deng, Wang"}, "https://arxiv.org/abs/2412.06119": {"title": "Sandwich regression for accurate and robust estimation in generalized linear multilevel and longitudinal models", "link": "https://arxiv.org/abs/2412.06119", "description": "Generalized linear models are a popular tool in applied statistics, with their maximum likelihood estimators enjoying asymptotic Gaussianity and efficiency. As all models are wrong, it is desirable to understand these estimators' behaviours under model misspecification. We study semiparametric multilevel generalized linear models, where only the conditional mean of the response is taken to follow a specific parametric form. Pre-existing estimators from mixed effects models and generalized estimating equations require specificaiton of a conditional covariance, which when misspecified can result in inefficient estimates of fixed effects parameters. It is nevertheless often computationally attractive to consider a restricted, finite dimensional class of estimators, as these models naturally imply. We introduce sandwich regression, that selects the estimator of minimal variance within a parametric class of estimators over all distributions in the full semiparametric model. We demonstrate numerically on simulated and real data the attractive improvements our sandwich regression approach enjoys over classical mixed effects models and generalized estimating equations.", "authors": "Young, Shah"}, "https://arxiv.org/abs/2412.06311": {"title": "SID: A Novel Class of Nonparametric Tests of Independence for Censored Outcomes", "link": "https://arxiv.org/abs/2412.06311", "description": "We propose a new class of metrics, called the survival independence divergence (SID), to test dependence between a right-censored outcome and covariates. A key technique for deriving the SIDs is to use a counting process strategy, which equivalently transforms the intractable independence test due to the presence of censoring into a test problem for complete observations. The SIDs are equal to zero if and only if the right-censored response and covariates are independent, and they are capable of detecting various types of nonlinear dependence. We propose empirical estimates of the SIDs and establish their asymptotic properties. We further develop a wild bootstrap method to estimate the critical values and show the consistency of the bootstrap tests. The numerical studies demonstrate that our SID-based tests are highly competitive with existing methods in a wide range of settings.", "authors": "Li, Liu, You et al"}, "https://arxiv.org/abs/2412.06442": {"title": "Efficiency of nonparametric superiority tests based on restricted mean survival time versus the log-rank test under proportional hazards", "link": "https://arxiv.org/abs/2412.06442", "description": "Background: For RCTs with time-to-event endpoints, proportional hazard (PH) models are typically used to estimate treatment effects and logrank tests are commonly used for hypothesis testing. There is growing support for replacing this approach with a model-free estimand and assumption-lean analysis method. One alternative is to base the analysis on the difference in restricted mean survival time (RMST) at a specific time, a single-number summary measure that can be defined without any restrictive assumptions on the outcome model. In a simple setting without covariates, an assumption-lean analysis can be achieved using nonparametric methods such as Kaplan Meier estimation. The main advantage of moving to a model-free summary measure and assumption-lean analysis is that the validity and interpretation of conclusions do not depend on the PH assumption. The potential disadvantage is that the nonparametric analysis may lose efficiency under PH. There is disagreement in recent literature on this issue. Methods: Asymptotic results and simulations are used to compare the efficiency of a log-rank test against a nonparametric analysis of the difference in RMST in a superiority trial under PH. Previous studies have separately examined the effect of event rates and the censoring distribution on relative efficiency. This investigation clarifies conflicting results from earlier research by exploring the joint effect of event rate and censoring distribution together. Several illustrative examples are provided. Results: In scenarios with high event rates and/or substantial censoring across a large proportion of the study window, and when both methods make use of the same amount of data, relative efficiency is close to unity. However, in cases with low event rates but when censoring is concentrated at the end of the study window, the PH analysis has a considerable efficiency advantage.", "authors": "Magirr, Wang, Deng et al"}, "https://arxiv.org/abs/2412.06628": {"title": "Partial identification of principal causal effects under violations of principal ignorability", "link": "https://arxiv.org/abs/2412.06628", "description": "Principal stratification is a general framework for studying causal mechanisms involving post-treatment variables. When estimating principal causal effects, the principal ignorability assumption is commonly invoked, which we study in detail in this manuscript. Our first key contribution is studying a commonly used strategy of using parametric models to jointly model the outcome and principal strata without requiring the principal ignorability assumption. We show that even if the joint distribution of principal strata is known, this strategy necessarily leads to only partial identification of causal effects, even under very simple and correctly specified outcome models. While principal ignorability can lead to point identification in this setting, we discuss alternative, weaker assumptions and show how they lead to more informative partial identification regions. An additional contribution is that we provide theoretical support to strategies used in the literature for identifying association parameters that govern the joint distribution of principal strata. We prove that this is possible, but only if the principal ignorability assumption is violated. Additionally, due to partial identifiability of causal effects even when these association parameters are known, we show that these association parameters are only identifiable under strong parametric constraints. Lastly, we extend these results to more flexible semiparametric and nonparametric Bayesian models.", "authors": "Wu, Antonelli"}, "https://arxiv.org/abs/2412.06688": {"title": "Probabilistic Targeted Factor Analysis", "link": "https://arxiv.org/abs/2412.06688", "description": "We develop a probabilistic variant of Partial Least Squares (PLS) we call Probabilistic Targeted Factor Analysis (PTFA), which can be used to extract common factors in predictors that are useful to predict a set of predetermined target variables. Along with the technique, we provide an efficient expectation-maximization (EM) algorithm to learn the parameters and forecast the targets of interest. We develop a number of extensions to missing-at-random data, stochastic volatility, and mixed-frequency data for real-time forecasting. In a simulation exercise, we show that PTFA outperforms PLS at recovering the common underlying factors affecting both features and target variables delivering better in-sample fit, and providing valid forecasts under contamination such as measurement error or outliers. Finally, we provide two applications in Economics and Finance where PTFA performs competitively compared with PLS and Principal Component Analysis (PCA) at out-of-sample forecasting.", "authors": "Herculano, Montoya-Bland\\'on"}, "https://arxiv.org/abs/2412.06766": {"title": "Testing Mutual Independence in Metric Spaces Using Distance Profiles", "link": "https://arxiv.org/abs/2412.06766", "description": "This paper introduces a novel unified framework for testing mutual independence among a vector of random objects that may reside in different metric spaces, including some existing methodologies as special cases. The backbone of the proposed tests is the notion of joint distance profiles, which uniquely characterize the joint law of random objects under a mild condition on the joint law or on the metric spaces. Our test statistics measure the difference of the joint distance profiles of each data point with respect to the joint law and the product of marginal laws of the vector of random objects, where flexible data-adaptive weight profiles are incorporated for power enhancement. We derive the limiting distribution of the test statistics under the null hypothesis of mutual independence and show that the proposed tests with specific weight profiles are asymptotically distribution-free if the marginal distance profiles are continuous. We also establish the consistency of the tests under sequences of alternative hypotheses converging to the null. Furthermore, since the asymptotic tests with non-trivial weight profiles require the knowledge of the underlying data distribution, we adopt a permutation scheme to approximate the $p$-values and provide theoretical guarantees that the permutation-based tests control the type I error rate under the null and are consistent under the alternatives. We demonstrate the power of the proposed tests across various types of data objects through simulations and real data applications, where our tests are shown to have superior performance compared with popular existing approaches.", "authors": "Chen, Dubey"}, "https://arxiv.org/abs/1510.04315": {"title": "Deriving Priorities From Inconsistent PCM using the Network Algorithms", "link": "https://arxiv.org/abs/1510.04315", "description": "In several multiobjective decision problems Pairwise Comparison Matrices (PCM) are applied to evaluate the decision variants. The problem that arises very often is the inconsistency of a given PCM. In such a situation it is important to approximate the PCM with a consistent one. The most common way is to minimize the Euclidean distance between the matrices. In the paper we consider the problem of minimizing the maximum distance. After applying the logarithmic transformation we are able to formulate the obtained subproblem as a Shortest Path Problem and solve it more efficiently. We analyze and completely characterize the form of the set of optimal solutions and provide an algorithm that results in a unique, Pareto-efficient solution.", "authors": "Anholcer, F\\\"ul\\\"op"}, "https://arxiv.org/abs/2412.05506": {"title": "Ranking of Large Language Model with Nonparametric Prompts", "link": "https://arxiv.org/abs/2412.05506", "description": "We consider the inference for the ranking of large language models (LLMs). Alignment arises as a big challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has been shown as a well-performing tool to improve alignment based on the best-of-$N$ policy. In this paper, we propose a new inferential framework for testing hypotheses and constructing confidence intervals of the ranking of language models. We consider the widely adopted Bradley-Terry-Luce (BTL) model, where each item is assigned a positive preference score that determines its pairwise comparisons' outcomes. We further extend it into the contextual setting, where the score of each model varies with the prompt. We show the convergence rate of our estimator. By extending the current Gaussian multiplier bootstrap theory to accommodate the supremum of not identically distributed empirical processes, we construct the confidence interval for ranking and propose a valid testing procedure. We also introduce the confidence diagram as a global ranking property. We conduct numerical experiments to assess the performance of our method.", "authors": "Wang, Han, Fang et al"}, "https://arxiv.org/abs/2412.05608": {"title": "Exact distribution-free tests of spherical symmetry applicable to high dimensional data", "link": "https://arxiv.org/abs/2412.05608", "description": "We develop some graph-based tests for spherical symmetry of a multivariate distribution using a method based on data augmentation. These tests are constructed using a new notion of signs and ranks that are computed along a path obtained by optimizing an objective function based on pairwise dissimilarities among the observations in the augmented data set. The resulting tests based on these signs and ranks have the exact distribution-free property, and irrespective of the dimension of the data, the null distributions of the test statistics remain the same. These tests can be conveniently used for high-dimensional data, even when the dimension is much larger than the sample size. Under appropriate regularity conditions, we prove the consistency of these tests in high dimensional asymptotic regime, where the dimension grows to infinity while the sample size may or may not grow with the dimension. We also propose a generalization of our methods to take care of the situations, where the center of symmetry is not specified by the null hypothesis. Several simulated data sets and a real data set are analyzed to demonstrate the utility of the proposed tests.", "authors": "Banerjee, Ghosh"}, "https://arxiv.org/abs/2412.05759": {"title": "Leveraging Black-box Models to Assess Feature Importance in Unconditional Distribution", "link": "https://arxiv.org/abs/2412.05759", "description": "Understanding how changes in explanatory features affect the unconditional distribution of the outcome is important in many applications. However, existing black-box predictive models are not readily suited for analyzing such questions. In this work, we develop an approximation method to compute the feature importance curves relevant to the unconditional distribution of outcomes, while leveraging the power of pre-trained black-box predictive models. The feature importance curves measure the changes across quantiles of outcome distribution given an external impact of change in the explanatory features. Through extensive numerical experiments and real data examples, we demonstrate that our approximation method produces sparse and faithful results, and is computationally efficient.", "authors": "Zhou, Li"}, "https://arxiv.org/abs/2412.06042": {"title": "Infinite Mixture Models for Improved Modeling of Across-Site Evolutionary Variation", "link": "https://arxiv.org/abs/2412.06042", "description": "Scientific studies in many areas of biology routinely employ evolutionary analyses based on the probabilistic inference of phylogenetic trees from molecular sequence data. Evolutionary processes that act at the molecular level are highly variable, and properly accounting for heterogeneity in evolutionary processes is crucial for more accurate phylogenetic inference. Nucleotide substitution rates and patterns are known to vary among sites in multiple sequence alignments, and such variation can be modeled by partitioning alignments into categories corresponding to different substitution models. Determining $\\textit{a priori}$ appropriate partitions can be difficult, however, and better model fit can be achieved through flexible Bayesian infinite mixture models that simultaneously infer the number of partitions, the partition that each site belongs to, and the evolutionary parameters corresponding to each partition. Here, we consider several different types of infinite mixture models, including classic Dirichlet process mixtures, as well as novel approaches for modeling across-site evolutionary variation: hierarchical models for data with a natural group structure, and infinite hidden Markov models that account for spatial patterns in alignments. In analyses of several viral data sets, we find that different types of infinite mixture models emerge as the best choices in different scenarios. To enable these models to scale efficiently to large data sets, we adapt efficient Markov chain Monte Carlo algorithms and exploit opportunities for parallel computing. We implement this infinite mixture modeling framework in BEAST X, a widely-used software package for Bayesian phylogenetic inference.", "authors": "Gill, Baele, Suchard et al"}, "https://arxiv.org/abs/2412.06087": {"title": "Ethnography and Machine Learning: Synergies and New Directions", "link": "https://arxiv.org/abs/2412.06087", "description": "Ethnography (social scientific methods that illuminate how people understand, navigate and shape the real world contexts in which they live their lives) and machine learning (computational techniques that use big data and statistical learning models to perform quantifiable tasks) are each core to contemporary social science. Yet these tools have remained largely separate in practice. This chapter draws on a growing body of scholarship that argues that ethnography and machine learning can be usefully combined, particularly for large comparative studies. Specifically, this paper (a) explains the value (and challenges) of using machine learning alongside qualitative field research for certain types of projects, (b) discusses recent methodological trends to this effect, (c) provides examples that illustrate workflow drawn from several large projects, and (d) concludes with a roadmap for enabling productive coevolution of field methods and machine learning.", "authors": "Li, Abramson"}, "https://arxiv.org/abs/2412.06582": {"title": "Optimal estimation in private distributed functional data analysis", "link": "https://arxiv.org/abs/2412.06582", "description": "We systematically investigate the preservation of differential privacy in functional data analysis, beginning with functional mean estimation and extending to varying coefficient model estimation. Our work introduces a distributed learning framework involving multiple servers, each responsible for collecting several sparsely observed functions. This hierarchical setup introduces a mixed notion of privacy. Within each function, user-level differential privacy is applied to $m$ discrete observations. At the server level, central differential privacy is deployed to account for the centralised nature of data collection. Across servers, only private information is exchanged, adhering to federated differential privacy constraints. To address this complex hierarchy, we employ minimax theory to reveal several fundamental phenomena: from sparse to dense functional data analysis, from user-level to central and federated differential privacy costs, and the intricate interplay between different regimes of functional data analysis and privacy preservation.\n To the best of our knowledge, this is the first study to rigorously examine functional data estimation under multiple privacy constraints. Our theoretical findings are complemented by efficient private algorithms and extensive numerical evidence, providing a comprehensive exploration of this challenging problem.", "authors": "Xue, Lin, Yu"}, "https://arxiv.org/abs/2205.03970": {"title": "Policy Choice in Time Series by Empirical Welfare Maximization", "link": "https://arxiv.org/abs/2205.03970", "description": "This paper develops a novel method for policy choice in a dynamic setting where the available data is a multi-variate time series. Building on the statistical treatment choice framework, we propose Time-series Empirical Welfare Maximization (T-EWM) methods to estimate an optimal policy rule by maximizing an empirical welfare criterion constructed using nonparametric potential outcome time series. We characterize conditions under which T-EWM consistently learns a policy choice that is optimal in terms of conditional welfare given the time-series history. We derive a nonasymptotic upper bound for conditional welfare regret. To illustrate the implementation and uses of T-EWM, we perform simulation studies and apply the method to estimate optimal restriction rules against Covid-19.", "authors": "Kitagawa, Wang, Xu"}, "https://arxiv.org/abs/2207.12081": {"title": "A unified quantile framework for nonlinear heterogeneous transcriptome-wide associations", "link": "https://arxiv.org/abs/2207.12081", "description": "Transcriptome-wide association studies (TWAS) are powerful tools for identifying gene-level associations by integrating genome-wide association studies and gene expression data. However, most TWAS methods focus on linear associations between genes and traits, ignoring the complex nonlinear relationships that may be present in biological systems. To address this limitation, we propose a novel framework, QTWAS, which integrates a quantile-based gene expression model into the TWAS model, allowing for the discovery of nonlinear and heterogeneous gene-trait associations. Via comprehensive simulations and applications to both continuous and binary traits, we demonstrate that the proposed model is more powerful than conventional TWAS in identifying gene-trait associations.", "authors": "Wang, Ionita-Laza, Wei"}, "https://arxiv.org/abs/2211.05844": {"title": "Quantile Fourier Transform, Quantile Series, and Nonparametric Estimation of Quantile Spectra", "link": "https://arxiv.org/abs/2211.05844", "description": "A nonparametric method is proposed for estimating the quantile spectra and cross-spectra introduced in Li (2012; 2014) as bivariate functions of frequency and quantile level. The method is based on the quantile discrete Fourier transform (QDFT) defined by trigonometric quantile regression and the quantile series (QSER) defined by the inverse Fourier transform of the QDFT. A nonparametric spectral estimator is constructed from the autocovariance function of the QSER using the lag-window (LW) approach. Smoothing techniques are also employed to reduce the statistical variability of the LW estimator across quantiles when the underlying spectrum varies smoothly with respect to the quantile level. The performance of the proposed estimation method is evaluated through a simulation study.", "authors": "Li"}, "https://arxiv.org/abs/2211.13478": {"title": "A New Spatio-Temporal Model Exploiting Hamiltonian Equations", "link": "https://arxiv.org/abs/2211.13478", "description": "The solutions of Hamiltonian equations are known to describe the underlying phase space of a mechanical system. In this article, we propose a novel spatio-temporal model using a strategic modification of the Hamiltonian equations, incorporating appropriate stochasticity via Gaussian processes. The resultant spatio-temporal process, continuously varying with time, turns out to be nonparametric, non-stationary, non-separable, and non-Gaussian. Additionally, the lagged correlations converge to zero as the spatio-temporal lag goes to infinity. We investigate the theoretical properties of the new spatio-temporal process, including its continuity and smoothness properties. We derive methods for complete Bayesian inference using MCMC techniques in the Bayesian paradigm. The performance of our method has been compared with that of a non-stationary Gaussian process (GP) using two simulation studies, where our method shows a significant improvement over the non-stationary GP. Further, applying our new model to two real data sets revealed encouraging performance.", "authors": "Mazumder, Banerjee, Bhattacharya"}, "https://arxiv.org/abs/2305.07549": {"title": "Distribution free MMD tests for model selection with estimated parameters", "link": "https://arxiv.org/abs/2305.07549", "description": "There exist some testing procedures based on the maximum mean discrepancy (MMD) to address the challenge of model specification. However, they ignore the presence of estimated parameters in the case of composite null hypotheses. In this paper, we first illustrate the effect of parameter estimation in model specification tests based on the MMD. Second, we propose simple model specification and model selection tests in the case of models with estimated parameters. All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families. A simulation study and a real data analysis illustrate the performance of our tests in terms of power and level.", "authors": "Br\\\"uck, Fermanian, Min"}, "https://arxiv.org/abs/2306.11689": {"title": "Statistical Tests for Replacing Human Decision Makers with Algorithms", "link": "https://arxiv.org/abs/2306.11689", "description": "This paper proposes a statistical framework of using artificial intelligence to improve human decision making. The performance of each human decision maker is benchmarked against that of machine predictions. We replace the diagnoses made by a subset of the decision makers with the recommendation from the machine learning algorithm. We apply both a heuristic frequentist approach and a Bayesian posterior loss function approach to abnormal birth detection using a nationwide dataset of doctor diagnoses from prepregnancy checkups of reproductive age couples and pregnancy outcomes. We find that our algorithm on a test dataset results in a higher overall true positive rate and a lower false positive rate than the diagnoses made by doctors only.", "authors": "Feng, Hong, Tang et al"}, "https://arxiv.org/abs/2308.08152": {"title": "Estimating Effects of Long-Term Treatments", "link": "https://arxiv.org/abs/2308.08152", "description": "Estimating the effects of long-term treatments through A/B testing is challenging. Treatments, such as updates to product functionalities, user interface designs, and recommendation algorithms, are intended to persist within the system for a long duration of time after their initial launches. However, due to the constraints of conducting long-term experiments, practitioners often rely on short-term experimental results to make product launch decisions. It remains open how to accurately estimate the effects of long-term treatments using short-term experimental data. To address this question, we introduce a longitudinal surrogate framework that decomposes the long-term effects into functions based on user attributes, short-term metrics, and treatment assignments. We outline identification assumptions, estimation strategies, inferential techniques, and validation methods under this framework. Empirically, we demonstrate that our approach outperforms existing solutions by using data from two real-world experiments, each involving more than a million users on WeChat, one of the world's largest social networking platforms.", "authors": "(Jingjing), (Jingjing), (Jingjing) et al"}, "https://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "https://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally intractable. While composite likelihood has computational advantages, it can still be demanding when dealing with numerous likelihood components and a large sample size. This paper tackles this challenge by employing an approximation of the conventional composite likelihood estimator, which is derived from an optimization procedure relying on stochastic gradients. This novel estimator is shown to be asymptotically normally distributed around the true parameter. In particular, based on the relative divergent rate of the sample size and the number of iterations of the optimization, the variance of the limiting distribution is shown to compound for two sources of uncertainty: the sampling variability of the data and the optimization noise, with the latter depending on the sampling distribution used to construct the stochastic gradients. The advantages of the proposed framework are illustrated through simulation studies on two working examples: an Ising model for binary data and a gamma frailty model for count data. Finally, a real-data application is presented, showing its effectiveness in a large-scale mental health survey.", "authors": "Alfonzetti, Bellio, Chen et al"}, "https://arxiv.org/abs/2211.16182": {"title": "Residual permutation test for regression coefficient testing", "link": "https://arxiv.org/abs/2211.16182", "description": "We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates $p$ can be up to a constant fraction of sample size $n$. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever $p < n / 2$. Moreover, RPT is shown to be asymptotically powerful for heavy tailed noises with bounded $(1+t)$-th order moment when the true coefficient is at least of order $n^{-t/(1+t)}$ for $t \\in [0,1]$. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.", "authors": "Wen, Wang, Wang"}, "https://arxiv.org/abs/2311.17797": {"title": "Learning to Simulate: Generative Metamodeling via Quantile Regression", "link": "https://arxiv.org/abs/2311.17797", "description": "Stochastic simulation models effectively capture complex system dynamics but are often too slow for real-time decision-making. Traditional metamodeling techniques learn relationships between simulator inputs and a single output summary statistic, such as the mean or median. These techniques enable real-time predictions without additional simulations. However, they require prior selection of one appropriate output summary statistic, limiting their flexibility in practical applications. We propose a new concept: generative metamodeling. It aims to construct a \"fast simulator of the simulator,\" generating random outputs significantly faster than the original simulator while preserving approximately equal conditional distributions. Generative metamodels enable rapid generation of numerous random outputs upon input specification, facilitating immediate computation of any summary statistic for real-time decision-making. We introduce a new algorithm, quantile-regression-based generative metamodeling (QRGMM), and establish its distributional convergence and convergence rate. Extensive numerical experiments demonstrate QRGMM's efficacy compared to other state-of-the-art generative algorithms in practical real-time decision-making scenarios.", "authors": "Hong, Hou, Zhang et al"}, "https://arxiv.org/abs/2401.13009": {"title": "Comparative Study of Causal Discovery Methods for Cyclic Models with Hidden Confounders", "link": "https://arxiv.org/abs/2401.13009", "description": "Nowadays, the need for causal discovery is ubiquitous. A better understanding of not just the stochastic dependencies between parts of a system, but also the actual cause-effect relations, is essential for all parts of science. Thus, the need for reliable methods to detect causal directions is growing constantly. In the last 50 years, many causal discovery algorithms have emerged, but most of them are applicable only under the assumption that the systems have no feedback loops and that they are causally sufficient, i.e. that there are no unmeasured subsystems that can affect multiple measured variables. This is unfortunate since those restrictions can often not be presumed in practice. Feedback is an integral feature of many processes, and real-world systems are rarely completely isolated and fully measured. Fortunately, in recent years, several techniques, that can cope with cyclic, causally insufficient systems, have been developed. And with multiple methods available, a practical application of those algorithms now requires knowledge of the respective strengths and weaknesses. Here, we focus on the problem of causal discovery for sparse linear models which are allowed to have cycles and hidden confounders. We have prepared a comprehensive and thorough comparative study of four causal discovery techniques: two versions of the LLC method [10] and two variants of the ASP-based algorithm [11]. The evaluation investigates the performance of those techniques for various experiments with multiple interventional setups and different dataset sizes.", "authors": "Lorbeer, Mohsen"}, "https://arxiv.org/abs/2412.06848": {"title": "Application of Random Matrix Theory in High-Dimensional Statistics", "link": "https://arxiv.org/abs/2412.06848", "description": "This review article provides an overview of random matrix theory (RMT) with a focus on its growing impact on the formulation and inference of statistical models and methodologies. Emphasizing applications within high-dimensional statistics, we explore key theoretical results from RMT and their role in addressing challenges associated with high-dimensional data. The discussion highlights how advances in RMT have significantly influenced the development of statistical methods, particularly in areas such as covariance matrix inference, principal component analysis (PCA), signal processing, and changepoint detection, demonstrating the close interplay between theory and practice in modern high-dimensional statistical inference.", "authors": "Bhattacharyya, Chattopadhyay, Basu"}, "https://arxiv.org/abs/2412.06870": {"title": "Variable Selection for Comparing High-dimensional Time-Series Data", "link": "https://arxiv.org/abs/2412.06870", "description": "Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator's parameters on traffic flows are analysed.", "authors": "Mitsuzawa, Grossi, Bortoli et al"}, "https://arxiv.org/abs/2412.07031": {"title": "Large Language Models: An Applied Econometric Framework", "link": "https://arxiv.org/abs/2412.07031", "description": "Large language models (LLMs) are being used in economics research to form predictions, label text, simulate human responses, generate hypotheses, and even produce data for times and places where such data don't exist. While these uses are creative, are they valid? When can we abstract away from the inner workings of an LLM and simply rely on their outputs? We develop an econometric framework to answer this question. Our framework distinguishes between two types of empirical tasks. Using LLM outputs for prediction problems (including hypothesis generation) is valid under one condition: no \"leakage\" between the LLM's training dataset and the researcher's sample. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed by some text or from human subjects) requires an additional assumption: LLM outputs must be as good as the gold standard measurements they replace. Otherwise estimates can be biased, even if LLM outputs are highly accurate but not perfectly so. We document the extent to which these conditions are violated and the implications for research findings in illustrative applications to finance and political economy. We also provide guidance to empirical researchers. The only way to ensure no training leakage is to use open-source LLMs with documented training data and published weights. The only way to deal with LLM measurement error is to collect validation data and model the error structure. A corollary is that if such conditions can't be met for a candidate LLM application, our strong advice is: don't.", "authors": "Ludwig, Mullainathan, Rambachan"}, "https://arxiv.org/abs/2412.07170": {"title": "On the Consistency of Bayesian Adaptive Testing under the Rasch Model", "link": "https://arxiv.org/abs/2412.07170", "description": "This study establishes the consistency of Bayesian adaptive testing methods under the Rasch model, addressing a gap in the literature on their large-sample guarantees. Although Bayesian approaches are recognized for their finite-sample performance and capability to circumvent issues such as the cold-start problem; however, rigorous proofs of their asymptotic properties, particularly in non-i.i.d. structures, remain lacking. We derive conditions under which the posterior distributions of latent traits converge to the true values for a sequence of given items, and demonstrate that Bayesian estimators remain robust under the mis-specification of the prior. Our analysis then extends to adaptive item selection methods in which items are chosen endogenously during the test. Additionally, we develop a Bayesian decision-theoretical framework for the item selection problem and propose a novel selection that aligns the test process with optimal estimator performance. These theoretical results provide a foundation for Bayesian methods in adaptive testing, complementing prior evidence of their finite-sample advantages.", "authors": "Yang, Wei, Chen"}, "https://arxiv.org/abs/2412.07184": {"title": "Automatic Doubly Robust Forests", "link": "https://arxiv.org/abs/2412.07184", "description": "This paper proposes the automatic Doubly Robust Random Forest (DRRF) algorithm for estimating the conditional expectation of a moment functional in the presence of high-dimensional nuisance functions. DRRF combines the automatic debiasing framework using the Riesz representer (Chernozhukov et al., 2022) with non-parametric, forest-based estimation methods for the conditional moment (Athey et al., 2019; Oprescu et al., 2019). In contrast to existing methods, DRRF does not require prior knowledge of the form of the debiasing term nor impose restrictive parametric or semi-parametric assumptions on the target quantity. Additionally, it is computationally efficient for making predictions at multiple query points and significantly reduces runtime compared to methods such as Orthogonal Random Forest (Oprescu et al., 2019). We establish the consistency and asymptotic normality results of DRRF estimator under general assumptions, allowing for the construction of valid confidence intervals. Through extensive simulations in heterogeneous treatment effect (HTE) estimation, we demonstrate the superior performance of DRRF over benchmark approaches in terms of estimation accuracy, robustness, and computational efficiency.", "authors": "Chen, Duan, Chernozhukov et al"}, "https://arxiv.org/abs/2412.07352": {"title": "Inference after discretizing unobserved heterogeneity", "link": "https://arxiv.org/abs/2412.07352", "description": "We consider a linear panel data model with nonseparable two-way unobserved heterogeneity corresponding to a linear version of the model studied in Bonhomme et al. (2022). We show that inference is possible in this setting using a straightforward two-step estimation procedure inspired by existing discretization approaches. In the first step, we construct a discrete approximation of the unobserved heterogeneity by (k-means) clustering observations separately across the individual ($i$) and time ($t$) dimensions. In the second step, we estimate a linear model with two-way group fixed effects specific to each cluster. Our approach shares similarities with methods from the double machine learning literature, as the underlying moment conditions exhibit the same type of bias-reducing properties. We provide a theoretical analysis of a cross-fitted version of our estimator, establishing its asymptotic normality at parametric rate under the condition $\\max(N,T)=o(\\min(N,T)^3)$. Simulation studies demonstrate that our methodology achieves excellent finite-sample performance, even when $T$ is negligible with respect to $N$.", "authors": "Beyhum, Mugnier"}, "https://arxiv.org/abs/2412.07404": {"title": "Median Based Unit Weibull Distribution (MBUW): Does the Higher Order Probability Weighted Moments (PWM) Add More Information over the Lower Order PWM in Parameter Estimation", "link": "https://arxiv.org/abs/2412.07404", "description": "In the present paper, Probability weighted moments (PWMs) method for parameter estimation of the median based unit weibull (MBUW) distribution is discussed. The most widely used first order PWMs is compared with the higher order PWMs for parameter estimation of (MBUW) distribution. Asymptotic distribution of this PWM estimator is derived. This comparison is illustrated using real data analysis.", "authors": "Attia"}, "https://arxiv.org/abs/2412.07495": {"title": "A comparison of Kaplan--Meier-based inverse probability of censoring weighted regression methods", "link": "https://arxiv.org/abs/2412.07495", "description": "Weighting with the inverse probability of censoring is an approach to deal with censoring in regression analyses where the outcome may be missing due to right-censoring. In this paper, three separate approaches involving this idea in a setting where the Kaplan--Meier estimator is used for estimating the censoring probability are compared. In more detail, the three approaches involve weighted regression, regression with a weighted outcome, and regression of a jack-knife pseudo-observation based on a weighted estimator. Expressions of the asymptotic variances are given in each case and the expressions are compared to each other and to the uncensored case. In terms of low asymptotic variance, a clear winner cannot be found. Which approach will have the lowest asymptotic variance depends on the censoring distribution. Expressions of the limit of the standard sandwich variance estimator in the three cases are also provided, revealing an overestimation under the implied assumptions.", "authors": "Overgaard"}, "https://arxiv.org/abs/2412.07524": {"title": "Gaussian Process with dissolution spline kernel", "link": "https://arxiv.org/abs/2412.07524", "description": "In-vitro dissolution testing is a critical component in the quality control of manufactured drug products. The $\\mathrm{f}_2$ statistic is the standard for assessing similarity between two dissolution profiles. However, the $\\mathrm{f}_2$ statistic has known limitations: it lacks an uncertainty estimate, is a discrete-time metric, and is a biased measure, calculating the differences between profiles at discrete time points. To address these limitations, we propose a Gaussian Process (GP) with a dissolution spline kernel for dissolution profile comparison. The dissolution spline kernel is a new spline kernel using logistic functions as its basis functions, enabling the GP to capture the expected monotonic increase in dissolution curves. This results in better predictions of dissolution curves. This new GP model reduces bias in the $\\mathrm{f}_2$ calculation by allowing predictions to be interpolated in time between observed values, and provides uncertainty quantification. We assess the model's performance through simulations and real datasets, demonstrating its improvement over a previous GP-based model introduced for dissolution testing. We also show that the new model can be adapted to include dissolution-specific covariates. Applying the model to real ibuprofen dissolution data under various conditions, we demonstrate its ability to extrapolate curve shapes across different experimental settings.", "authors": "Murphy, Bachiller, D'Arcy et al"}, "https://arxiv.org/abs/2412.07604": {"title": "Nested exemplar latent space models for dimension reduction in dynamic networks", "link": "https://arxiv.org/abs/2412.07604", "description": "Dynamic latent space models are widely used for characterizing changes in networks and relational data over time. These models assign to each node latent attributes that characterize connectivity with other nodes, with these latent attributes dynamically changing over time. Node attributes can be organized as a three-way tensor with modes corresponding to nodes, latent space dimension, and time. Unfortunately, as the number of nodes and time points increases, the number of elements of this tensor becomes enormous, leading to computational and statistical challenges, particularly when data are sparse. We propose a new approach for massively reducing dimensionality by expressing the latent node attribute tensor as low rank. This leads to an interesting new {\\em nested exemplar} latent space model, which characterizes the node attribute tensor as dependent on low-dimensional exemplar traits for each node, weights for each latent space dimension, and exemplar curves characterizing time variation. We study properties of this framework, including expressivity, and develop efficient Bayesian inference algorithms. The approach leads to substantial advantages in simulations and applications to ecological networks.", "authors": "Kampe, Silva, Dunson et al"}, "https://arxiv.org/abs/2412.07611": {"title": "Deep Partially Linear Transformation Model for Right-Censored Survival Data", "link": "https://arxiv.org/abs/2412.07611", "description": "Although the Cox proportional hazards model is well established and extensively used in the analysis of survival data, the proportional hazards (PH) assumption may not always hold in practical scenarios. The semiparametric transformation model extends the conventional Cox model and also includes many other survival models as special cases. This paper introduces a deep partially linear transformation model (DPLTM) as a general and flexible framework for estimation, inference and prediction. The proposed method is capable of avoiding the curse of dimensionality while still retaining the interpretability of some covariates of interest. We derive the overall convergence rate of the maximum likelihood estimators, the minimax lower bound of the nonparametric deep neural network (DNN) estimator, the asymptotic normality and the semiparametric efficiency of the parametric estimator. Comprehensive simulation studies demonstrate the impressive performance of the proposed estimation procedure in terms of both estimation accuracy and prediction power, which is further validated by an application to a real-world dataset.", "authors": "Yin, Zhang, Yu"}, "https://arxiv.org/abs/2412.07649": {"title": "Machine Learning the Macroeconomic Effects of Financial Shocks", "link": "https://arxiv.org/abs/2412.07649", "description": "We propose a method to learn the nonlinear impulse responses to structural shocks using neural networks, and apply it to uncover the effects of US financial shocks. The results reveal substantial asymmetries with respect to the sign of the shock. Adverse financial shocks have powerful effects on the US economy, while benign shocks trigger much smaller reactions. Instead, with respect to the size of the shocks, we find no discernible asymmetries.", "authors": "Hauzenberger, Huber, Klieber et al"}, "https://arxiv.org/abs/2412.07194": {"title": "Emperical Study on Various Symmetric Distributions for Modeling Time Series", "link": "https://arxiv.org/abs/2412.07194", "description": "This study evaluated probability distributions for modeling time series with abrupt structural changes. The Pearson type VII distribution, with an adjustable shape parameter $b$, proved versatile. The generalized Laplace distribution performed similarly to the Pearson model, occasionally surpassing it in terms of likelihood and AIC. Mixture models, including the mixture of $\\delta$-function and Gaussian distribution, showed potential but were less stable. Pearson type VII and extended Laplace models were deemed more reliable for general cases. Model selection depends on data characteristics and goals.", "authors": "Mathmatics)"}, "https://arxiv.org/abs/2412.07657": {"title": "Probabilistic Modelling of Multiple Long-Term Condition Onset Times", "link": "https://arxiv.org/abs/2412.07657", "description": "The co-occurrence of multiple long-term conditions (MLTC), or multimorbidity, in an individual can reduce their lifespan and severely impact their quality of life. Exploring the longitudinal patterns, e.g. clusters, of disease accrual can help better understand the genetic and environmental drivers of multimorbidity, and potentially identify individuals who may benefit from early targeted intervention. We introduce $\\textit{probabilistic modelling of onset times}$, or $\\texttt{ProMOTe}$, for clustering and forecasting MLTC trajectories. $\\texttt{ProMOTe}$ seamlessly learns from incomplete and unreliable disease trajectories that is commonplace in Electronic Health Records but often ignored in existing longitudinal clustering methods. We analyse data from 150,000 individuals in the UK Biobank and identify 50 clusters showing patterns of disease accrual that have also been reported by some recent studies. We further discuss the forecasting capabilities of the model given the history of disease accrual.", "authors": "Richards, Fleetwood, Prigge et al"}, "https://arxiv.org/abs/1912.03158": {"title": "High-frequency and heteroskedasticity identification in multicountry models: Revisiting spillovers of monetary shocks", "link": "https://arxiv.org/abs/1912.03158", "description": "We explore the international transmission of monetary policy and central bank information shocks originating from the United States and the euro area. Employing a panel vector autoregression, we use macroeconomic and financial variables across several major economies to address both static and dynamic spillovers. To identify structural shocks, we introduce a novel approach that combines external instruments with heteroskedasticity-based identification and sign restrictions. Our results suggest significant spillovers from European Central Bank and Federal Reserve policies to each other's economies, global aggregates, and other countries. These effects are more pronounced for central bank information shocks than for pure monetary policy shocks, and the dominance of the US in the global economy is reflected in our findings.", "authors": "Pfarrhofer, Stelzer"}, "https://arxiv.org/abs/2108.05990": {"title": "Nearly Optimal Learning using Sparse Deep ReLU Networks in Regularized Empirical Risk Minimization with Lipschitz Loss", "link": "https://arxiv.org/abs/2108.05990", "description": "We propose a sparse deep ReLU network (SDRN) estimator of the regression function obtained from regularized empirical risk minimization with a Lipschitz loss function. Our framework can be applied to a variety of regression and classification problems. We establish novel non-asymptotic excess risk bounds for our SDRN estimator when the regression function belongs to a Sobolev space with mixed derivatives. We obtain a new nearly optimal risk rate in the sense that the SDRN estimator can achieve nearly the same optimal minimax convergence rate as one-dimensional nonparametric regression with the dimension only involved in a logarithm term when the feature dimension is fixed. The estimator has a slightly slower rate when the dimension grows with the sample size. We show that the depth of the SDRN estimator grows with the sample size in logarithmic order, and the total number of nodes and weights grows in polynomial order of the sample size to have the nearly optimal risk rate. The proposed SDRN can go deeper with fewer parameters to well estimate the regression and overcome the overfitting problem encountered by conventional feed-forward neural networks.", "authors": "Huang, Liu, Ma"}, "https://arxiv.org/abs/2301.07513": {"title": "A Bayesian Nonparametric Stochastic Block Model for Directed Acyclic Graphs", "link": "https://arxiv.org/abs/2301.07513", "description": "Random graphs have been widely used in statistics, for example in network and social interaction analysis. In some applications, data may contain an inherent hierarchical ordering among its vertices, which prevents any directed edge between pairs of vertices that do not respect this order. For example, in bibliometrics, older papers cannot cite newer ones. In such situations, the resulting graph forms a Directed Acyclic Graph. In this article, we propose an extension of the popular Stochastic Block Model (SBM) to account for the presence of a latent hierarchical ordering in the data. The proposed approach includes a topological ordering in the likelihood of the model, which allows a directed edge to have positive probability only if the corresponding pair of vertices respect the order. This latent ordering is treated as an unknown parameter and endowed with a prior distribution. We describe how to formalize the model and perform posterior inference for a Bayesian nonparametric version of the SBM in which both the hierarchical ordering and the number of latent blocks are learnt from the data. Finally, an illustration with a real-world dataset from bibliometrics is presented. Additional supplementary materials are available online.", "authors": "Lee, Battiston"}, "https://arxiv.org/abs/2312.13195": {"title": "Principal Component Copulas for Capital Modelling and Systemic Risk", "link": "https://arxiv.org/abs/2312.13195", "description": "We introduce a class of copulas that we call Principal Component Copulas (PCCs). This class combines the strong points of copula-based techniques with principal component-based models, which results in flexibility when modelling tail dependence along the most important directions in high-dimensional data. We obtain theoretical results for PCCs that are important for practical applications. In particular, we derive tractable expressions for the high-dimensional copula density, which can be represented in terms of characteristic functions. We also develop algorithms to perform Maximum Likelihood and Generalized Method of Moment estimation in high-dimensions and show very good performance in simulation experiments. Finally, we apply the copula to the international stock market in order to study systemic risk. We find that PCCs lead to excellent performance on measures of systemic risk due to their ability to distinguish between parallel market movements, which increase systemic risk, and orthogonal movements, which reduce systemic risk. As a result, we consider the PCC promising for internal capital models, which financial institutions use to protect themselves against systemic risk.", "authors": "Gubbels, Ypma, Oosterlee"}, "https://arxiv.org/abs/2412.07787": {"title": "Anomaly Detection in California Electricity Price Forecasting: Enhancing Accuracy and Reliability Using Principal Component Analysis", "link": "https://arxiv.org/abs/2412.07787", "description": "Accurate and reliable electricity price forecasting has significant practical implications for grid management, renewable energy integration, power system planning, and price volatility management. This study focuses on enhancing electricity price forecasting in California's grid, addressing challenges from complex generation data and heteroskedasticity. Utilizing principal component analysis (PCA), we analyze CAISO's hourly electricity prices and demand from 2016-2021 to improve day-ahead forecasting accuracy. Initially, we apply traditional outlier analysis with the interquartile range method, followed by robust PCA (RPCA) for more effective outlier elimination. This approach improves data symmetry and reduces skewness. We then construct multiple linear regression models using both raw and PCA-transformed features. The model with transformed features, refined through traditional and SAS Sparse Matrix outlier removal methods, shows superior forecasting performance. The SAS Sparse Matrix method, in particular, significantly enhances model accuracy. Our findings demonstrate that PCA-based methods are key in advancing electricity price forecasting, supporting renewable integration and grid management in day-ahead markets.\n Keywords: Electricity price forecasting, principal component analysis (PCA), power system planning, heteroskedasticity, renewable energy integration.", "authors": "Nyangon, Akintunde"}, "https://arxiv.org/abs/2412.07824": {"title": "Improved Small Area Inference from Data Integration Using Global-Local Priors", "link": "https://arxiv.org/abs/2412.07824", "description": "We present and apply methodology to improve inference for small area parameters by using data from several sources. This work extends Cahoy and Sedransk (2023) who showed how to integrate summary statistics from several sources. Our methodology uses hierarchical global-local prior distributions to make inferences for the proportion of individuals in Florida's counties who do not have health insurance. Results from an extensive simulation study show that this methodology will provide improved inference by using several data sources. Among the five model variants evaluated the ones using horseshoe priors for all variances have better performance than the ones using lasso priors for the local variances.", "authors": "Cahoy, Sedransk"}, "https://arxiv.org/abs/2412.07905": {"title": "Spectral Differential Network Analysis for High-Dimensional Time Series", "link": "https://arxiv.org/abs/2412.07905", "description": "Spectral networks derived from multivariate time series data arise in many domains, from brain science to Earth science. Often, it is of interest to study how these networks change under different conditions. For instance, to better understand epilepsy, it would be interesting to capture the changes in the brain connectivity network as a patient experiences a seizure, using electroencephalography data. A common approach relies on estimating the networks in each condition and calculating their difference. Such estimates may behave poorly in high dimensions as the networks themselves may not be sparse in structure while their difference may be. We build upon this observation to develop an estimator of the difference in inverse spectral densities across two conditions. Using an L1 penalty on the difference, consistency is established by only requiring the difference to be sparse. We illustrate the method on synthetic data experiments, on experiments with electroencephalography data, and on experiments with optogentic stimulation and micro-electrocorticography data.", "authors": "Hellstern, Kim, Harchaoui et al"}, "https://arxiv.org/abs/2412.07957": {"title": "Spatial scale-aware tail dependence modeling for high-dimensional spatial extremes", "link": "https://arxiv.org/abs/2412.07957", "description": "Extreme events over large spatial domains may exhibit highly heterogeneous tail dependence characteristics, yet most existing spatial extremes models yield only one dependence class over the entire spatial domain. To accurately characterize \"data-level dependence'' in analysis of extreme events, we propose a mixture model that achieves flexible dependence properties and allows high-dimensional inference for extremes of spatial processes. We modify the popular random scale construction that multiplies a Gaussian random field by a single radial variable; we allow the radial variable to vary smoothly across space and add non-stationarity to the Gaussian process. As the level of extremeness increases, this single model exhibits both asymptotic independence at long ranges and either asymptotic dependence or independence at short ranges. We make joint inference on the dependence model and a marginal model using a copula approach within a Bayesian hierarchical model. Three different simulation scenarios show close to nominal frequentist coverage rates. Lastly, we apply the model to a dataset of extreme summertime precipitation over the central United States. We find that the joint tail of precipitation exhibits non-stationary dependence structure that cannot be captured by limiting extreme value models or current state-of-the-art sub-asymptotic models.", "authors": "Shi, Zhang, Risser et al"}, "https://arxiv.org/abs/2412.07987": {"title": "Hypothesis Testing for High-Dimensional Matrix-Valued Data", "link": "https://arxiv.org/abs/2412.07987", "description": "This paper addresses hypothesis testing for the mean of matrix-valued data in high-dimensional settings. We investigate the minimum discrepancy test, originally proposed by Cragg (1997), which serves as a rank test for lower-dimensional matrices. We evaluate the performance of this test as the matrix dimensions increase proportionally with the sample size, and identify its limitations when matrix dimensions significantly exceed the sample size. To address these challenges, we propose a new test statistic tailored for high-dimensional matrix rank testing. The oracle version of this statistic is analyzed to highlight its theoretical properties. Additionally, we develop a novel approach for constructing a sparse singular value decomposition (SVD) estimator for singular vectors, providing a comprehensive examination of its theoretical aspects. Using the sparse SVD estimator, we explore the properties of the sample version of our proposed statistic. The paper concludes with simulation studies and two case studies involving surveillance video data, demonstrating the practical utility of our proposed methods.", "authors": "Cui, Li, Li et al"}, "https://arxiv.org/abs/2412.08042": {"title": "Robust and efficient estimation of time-varying treatment effects using marginal structural models dependent on partial treatment history", "link": "https://arxiv.org/abs/2412.08042", "description": "Inverse probability (IP) weighting of marginal structural models (MSMs) can provide consistent estimators of time-varying treatment effects under correct model specifications and identifiability assumptions, even in the presence of time-varying confounding. However, this method has two problems: (i) inefficiency due to IP-weights cumulating all time points and (ii) bias and inefficiency due to the MSM misspecification. To address these problems, we propose new IP-weights for estimating the parameters of the MSM dependent on partial treatment history and closed testing procedures for selecting the MSM under known IP-weights. In simulation studies, our proposed methods outperformed existing methods in terms of both performance in estimating time-varying treatment effects and in selecting the correct MSM. Our proposed methods were also applied to real data of hemodialysis patients with reasonable results.", "authors": "Seya, Taguri, Ishii"}, "https://arxiv.org/abs/2412.08051": {"title": "Two-way Node Popularity Model for Directed and Bipartite Networks", "link": "https://arxiv.org/abs/2412.08051", "description": "There has been extensive research on community detection in directed and bipartite networks. However, these studies often fail to consider the popularity of nodes in different communities, which is a common phenomenon in real-world networks. To address this issue, we propose a new probabilistic framework called the Two-Way Node Popularity Model (TNPM). The TNPM also accommodates edges from different distributions within a general sub-Gaussian family. We introduce the Delete-One-Method (DOM) for model fitting and community structure identification, and provide a comprehensive theoretical analysis with novel technical skills dealing with sub-Gaussian generalization. Additionally, we propose the Two-Stage Divided Cosine Algorithm (TSDC) to handle large-scale networks more efficiently. Our proposed methods offer multi-folded advantages in terms of estimation accuracy and computational efficiency, as demonstrated through extensive numerical studies. We apply our methods to two real-world applications, uncovering interesting findings.", "authors": "Jing, Li, Wang et al"}, "https://arxiv.org/abs/2412.08088": {"title": "Dynamic Classification of Latent Disease Progression with Auxiliary Surrogate Labels", "link": "https://arxiv.org/abs/2412.08088", "description": "Disease progression prediction based on patients' evolving health information is challenging when true disease states are unknown due to diagnostic capabilities or high costs. For example, the absence of gold-standard neurological diagnoses hinders distinguishing Alzheimer's disease (AD) from related conditions such as AD-related dementias (ADRDs), including Lewy body dementia (LBD). Combining temporally dependent surrogate labels and health markers may improve disease prediction. However, existing literature models informative surrogate labels and observed variables that reflect the underlying states using purely generative approaches, limiting the ability to predict future states. We propose integrating the conventional hidden Markov model as a generative model with a time-varying discriminative classification model to simultaneously handle potentially misspecified surrogate labels and incorporate important markers of disease progression. We develop an adaptive forward-backward algorithm with subjective labels for estimation, and utilize the modified posterior and Viterbi algorithms to predict the progression of future states or new patients based on objective markers only. Importantly, the adaptation eliminates the need to model the marginal distribution of longitudinal markers, a requirement in traditional algorithms. Asymptotic properties are established, and significant improvement with finite samples is demonstrated via simulation studies. Analysis of the neuropathological dataset of the National Alzheimer's Coordinating Center (NACC) shows much improved accuracy in distinguishing LBD from AD.", "authors": "Cai, Zeng, Marder et al"}, "https://arxiv.org/abs/2412.08225": {"title": "Improving Active Learning with a Bayesian Representation of Epistemic Uncertainty", "link": "https://arxiv.org/abs/2412.08225", "description": "A popular strategy for active learning is to specifically target a reduction in epistemic uncertainty, since aleatoric uncertainty is often considered as being intrinsic to the system of interest and therefore not reducible. Yet, distinguishing these two types of uncertainty remains challenging and there is no single strategy that consistently outperforms the others. We propose to use a particular combination of probability and possibility theories, with the aim of using the latter to specifically represent epistemic uncertainty, and we show how this combination leads to new active learning strategies that have desirable properties. In order to demonstrate the efficiency of these strategies in non-trivial settings, we introduce the notion of a possibilistic Gaussian process (GP) and consider GP-based multiclass and binary classification problems, for which the proposed methods display a strong performance for both simulated and real datasets.", "authors": "Thomas, Houssineau"}, "https://arxiv.org/abs/2412.08458": {"title": "Heavy Tail Robust Estimation and Inference for Average Treatment Effects", "link": "https://arxiv.org/abs/2412.08458", "description": "We study the probability tail properties of Inverse Probability Weighting (IPW) estimators of the Average Treatment Effect (ATE) when there is limited overlap between the covariate distributions of the treatment and control groups. Under unconfoundedness of treatment assignment conditional on covariates, such limited overlap is manifested in the propensity score for certain units being very close (but not equal) to 0 or 1. This renders IPW estimators possibly heavy tailed, and with a slower than sqrt(n) rate of convergence. Trimming or truncation is ultimately based on the covariates, ignoring important information about the inverse probability weighted random variable Z that identifies ATE by E[Z]= ATE. We propose a tail-trimmed IPW estimator whose performance is robust to limited overlap. In terms of the propensity score, which is generally unknown, we plug-in its parametric estimator in the infeasible Z, and then negligibly trim the resulting feasible Z adaptively by its large values. Trimming leads to bias if Z has an asymmetric distribution and an infinite variance, hence we estimate and remove the bias using important improvements on existing theory and methods. Our estimator sidesteps dimensionality, bias and poor correspondence properties associated with trimming by the covariates or propensity score. Monte Carlo experiments demonstrate that trimming by the covariates or the propensity score requires the removal of a substantial portion of the sample to render a low bias and close to normal estimator, while our estimator has low bias and mean-squared error, and is close to normal, based on the removal of very few sample extremes.", "authors": "Hill, Chaudhuri"}, "https://arxiv.org/abs/2412.08498": {"title": "A robust, scalable K-statistic for quantifying immune cell clustering in spatial proteomics data", "link": "https://arxiv.org/abs/2412.08498", "description": "Spatial summary statistics based on point process theory are widely used to quantify the spatial organization of cell populations in single-cell spatial proteomics data. Among these, Ripley's $K$ is a popular metric for assessing whether cells are spatially clustered or are randomly dispersed. However, the key assumption of spatial homogeneity is frequently violated in spatial proteomics data, leading to overestimates of cell clustering and colocalization. To address this, we propose a novel $K$-based method, termed \\textit{KAMP} (\\textbf{K} adjustment by \\textbf{A}nalytical \\textbf{M}oments of the \\textbf{P}ermutation distribution), for quantifying the spatial organization of cells in spatial proteomics samples. \\textit{KAMP} leverages background cells in each sample along with a new closed-form representation of the first and second moments of the permutation distribution of Ripley's $K$ to estimate an empirical null model. Our method is robust to inhomogeneity, computationally efficient even in large datasets, and provides approximate $p$-values for testing spatial clustering and colocalization. Methodological developments are motivated by a spatial proteomics study of 103 women with ovarian cancer, where our analysis using \\textit{KAMP} shows a positive association between immune cell clustering and overall patient survival. Notably, we also find evidence that using $K$ without correcting for sample inhomogeneity may bias hazard ratio estimates in downstream analyses. \\textit{KAMP} completes this analysis in just 5 minutes, compared to 538 minutes for the only competing method that adequately addresses inhomogeneity.", "authors": "Wrobel, Song"}, "https://arxiv.org/abs/2412.08533": {"title": "Rate accelerated inference for integrals of multivariate random functions", "link": "https://arxiv.org/abs/2412.08533", "description": "The computation of integrals is a fundamental task in the analysis of functional data, which are typically considered as random elements in a space of squared integrable functions. Borrowing ideas from recent advances in the Monte Carlo integration literature, we propose effective unbiased estimation and inference procedures for integrals of uni- and multivariate random functions. Several applications to key problems in functional data analysis (FDA) involving random design points are studied and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and the usual algorithms for numerical integration. Moreover, the proposed estimator facilitates effective inference by generally providing better coverage with shorter confidence and prediction intervals, in both noisy and noiseless setups.", "authors": "Patilea, Wang"}, "https://arxiv.org/abs/2412.08567": {"title": "Identifiability of the instrumental variable model with the treatment and outcome missing not at random", "link": "https://arxiv.org/abs/2412.08567", "description": "The instrumental variable model of Imbens and Angrist (1994) and Angrist et al. (1996) allow for the identification of the local average treatment effect, also known as the complier average causal effect. However, many empirical studies are challenged by the missingness in the treatment and outcome. Generally, the complier average causal effect is not identifiable without further assumptions when the treatment and outcome are missing not at random. We study its identifiability even when the treatment and outcome are missing not at random. We review the existing results and provide new findings to unify the identification analysis in the literature.", "authors": "Zuo, Ding, Yang"}, "https://arxiv.org/abs/2412.07929": {"title": "Dirichlet-Neumann Averaging: The DNA of Efficient Gaussian Process Simulation", "link": "https://arxiv.org/abs/2412.07929", "description": "Gaussian processes (GPs) and Gaussian random fields (GRFs) are essential for modelling spatially varying stochastic phenomena. Yet, the efficient generation of corresponding realisations on high-resolution grids remains challenging, particularly when a large number of realisations are required. This paper presents two novel contributions. First, we propose a new methodology based on Dirichlet-Neumann averaging (DNA) to generate GPs and GRFs with isotropic covariance on regularly spaced grids. The combination of discrete cosine and sine transforms in the DNA sampling approach allows for rapid evaluations without the need for modification or padding of the desired covariance function. While this introduces an error in the covariance, our numerical experiments show that this error is negligible for most relevant applications, representing a trade-off between efficiency and precision. We provide explicit error estimates for Mat\\'ern covariances. The second contribution links our new methodology to the stochastic partial differential equation (SPDE) approach for sampling GRFs. We demonstrate that the concepts developed in our methodology can also guide the selection of boundary conditions in the SPDE framework. We prove that averaging specific GRFs sampled via the SPDE approach yields genuinely isotropic realisations without domain extension, with the error bounds established in the first part remaining valid.", "authors": "Mathematics, Heidelberg, Computing) et al"}, "https://arxiv.org/abs/2412.08064": {"title": "Statistical Convergence Rates of Optimal Transport Map Estimation between General Distributions", "link": "https://arxiv.org/abs/2412.08064", "description": "This paper studies the convergence rates of optimal transport (OT) map estimators, a topic of growing interest in statistics, machine learning, and various scientific fields. Despite recent advancements, existing results rely on regularity assumptions that are very restrictive in practice and much stricter than those in Brenier's Theorem, including the compactness and convexity of the probability support and the bi-Lipschitz property of the OT maps. We aim to broaden the scope of OT map estimation and fill this gap between theory and practice. Given the strong convexity assumption on Brenier's potential, we first establish the non-asymptotic convergence rates for the original plug-in estimator without requiring restrictive assumptions on probability measures. Additionally, we introduce a sieve plug-in estimator and establish its convergence rates without the strong convexity assumption on Brenier's potential, enabling the widely used cases such as the rank functions of normal or t-distributions. We also establish new Poincar\\'e-type inequalities, which are proved given sufficient conditions on the local boundedness of the probability density and mild topological conditions of the support, and these new inequalities enable us to achieve faster convergence rates for the Donsker function class. Moreover, we develop scalable algorithms to efficiently solve the OT map estimation using neural networks and present numerical experiments to demonstrate the effectiveness and robustness.", "authors": "Ding, Li, Xue"}, "https://arxiv.org/abs/2203.06743": {"title": "Bayesian Analysis of Sigmoidal Gaussian Cox Processes via Data Augmentation", "link": "https://arxiv.org/abs/2203.06743", "description": "Many models for point process data are defined through a thinning procedure where locations of a base process (often Poisson) are either kept (observed) or discarded (thinned). In this paper, we go back to the fundamentals of the distribution theory for point processes to establish a link between the base thinning mechanism and the joint density of thinned and observed locations in any of such models. In practice, the marginal model of observed points is often intractable, but thinned locations can be instantiated from their conditional distribution and typical data augmentation schemes can be employed to circumvent this problem. Such approaches have been employed in the recent literature, but some inconsistencies have been introduced across the different publications. We concentrate on an example: the so-called sigmoidal Gaussian Cox process. We apply our approach to resolve contradicting viewpoints in the data augmentation step of the inference procedures therein. We also provide a multitype extension to this process and conduct Bayesian inference on data consisting of positions of two different species of trees in Lansing Woods, Michigan. The emphasis is put on intertype dependence modeling with Bayesian uncertainty quantification.", "authors": "Alie, Stephens, Schmidt"}, "https://arxiv.org/abs/2210.06639": {"title": "Robust Estimation and Inference in Panels with Interactive Fixed Effects", "link": "https://arxiv.org/abs/2210.06639", "description": "We consider estimation and inference for a regression coefficient in panels with interactive fixed effects (i.e., with a factor structure). We demonstrate that existing estimators and confidence intervals (CIs) can be heavily biased and size-distorted when some of the factors are weak. We propose estimators with improved rates of convergence and bias-aware CIs that remain valid uniformly, regardless of factor strength. Our approach applies the theory of minimax linear estimation to form a debiased estimate, using a nuclear norm bound on the error of an initial estimate of the interactive fixed effects. Our resulting bias-aware CIs take into account the remaining bias caused by weak factors. Monte Carlo experiments show substantial improvements over conventional methods when factors are weak, with minimal costs to estimation accuracy when factors are strong.", "authors": "Armstrong, Weidner, Zeleneev"}, "https://arxiv.org/abs/2301.04876": {"title": "Interacting Treatments with Endogenous Takeup", "link": "https://arxiv.org/abs/2301.04876", "description": "We study causal inference in randomized experiments (or quasi-experiments) following a $2\\times 2$ factorial design. There are two treatments, denoted $A$ and $B$, and units are randomly assigned to one of four categories: treatment $A$ alone, treatment $B$ alone, joint treatment, or none. Allowing for endogenous non-compliance with the two binary instruments representing the intended assignment, as well as unrestricted interference across the two treatments, we derive the causal interpretation of various instrumental variable estimands under more general compliance conditions than in the literature. In general, if treatment takeup is driven by both instruments for some units, it becomes difficult to separate treatment interaction from treatment effect heterogeneity. We provide auxiliary conditions and various bounding strategies that may help zero in on causally interesting parameters. As an empirical illustration, we apply our results to a program randomly offering two different treatments, namely tutoring and financial incentives, to first year college students, in order to assess the treatments' effects on academic performance.", "authors": "Kormos, Lieli, Huber"}, "https://arxiv.org/abs/2308.01605": {"title": "Causal thinking for decision making on Electronic Health Records: why and how", "link": "https://arxiv.org/abs/2308.01605", "description": "Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.", "authors": "(SODA), (MIT, USZ) et al"}, "https://arxiv.org/abs/2312.04078": {"title": "Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison", "link": "https://arxiv.org/abs/2312.04078", "description": "Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studies, the similarity between distributions of simulated datasets and real datasets, for which the performance of methods is assessed, is crucial. In two- or $k$-sample testing, it is checked, whether the underlying distributions of two or more datasets coincide.\n Extremely many approaches for quantifying dataset similarity have been proposed in the literature. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes. In an extensive review of these methods the main underlying ideas, formal definitions, and important properties are introduced.\n We compare the 118 methods in terms of their applicability, interpretability, and theoretical properties, in order to provide recommendations for selecting an appropriate dataset similarity measure based on the specific goal of the dataset comparison and on the properties of the datasets at hand. An online tool facilitates the choice of the appropriate dataset similarity measure.", "authors": "Stolte, Kappenberg, Rahnenf\\\"uhrer et al"}, "https://arxiv.org/abs/2303.08653": {"title": "On the robustness of posterior means", "link": "https://arxiv.org/abs/2303.08653", "description": "Consider a normal location model $X \\mid \\theta \\sim N(\\theta, \\sigma^2)$ with known $\\sigma^2$. Suppose $\\theta \\sim G_0$, where the prior $G_0$ has zero mean and variance bounded by $V$. Let $G_1$ be a possibly misspecified prior with zero mean and variance bounded by $V$. We show that the squared error Bayes risk of the posterior mean under $G_1$ is bounded, subjected to an additional tail condition on $G_1$, uniformly over $G_0, G_1, \\sigma^2 > 0$.", "authors": "Chen"}, "https://arxiv.org/abs/2412.08762": {"title": "Modeling EEG Spectral Features through Warped Functional Mixed Membership Models", "link": "https://arxiv.org/abs/2412.08762", "description": "A common concern in the field of functional data analysis is the challenge of temporal misalignment, which is typically addressed using curve registration methods. Currently, most of these methods assume the data is governed by a single common shape or a finite mixture of population level shapes. We introduce more flexibility using mixed membership models. Individual observations are assumed to partially belong to different clusters, allowing variation across multiple functional features. We propose a Bayesian hierarchical model to estimate the underlying shapes, as well as the individual time-transformation functions and levels of membership. Motivating this work is data from EEG signals in children with autism spectrum disorder (ASD). Our method agrees with the neuroimaging literature, recovering the 1/f pink noise feature distinctly from the peak in the alpha band. Furthermore, the introduction of a regression component in the estimation of time-transformation functions quantifies the effect of age and clinical designation on the location of the peak alpha frequency (PAF).", "authors": "Landry, Senturk, Jeste et al"}, "https://arxiv.org/abs/2412.08827": {"title": "A Debiased Estimator for the Mediation Functional in Ultra-High-Dimensional Setting in the Presence of Interaction Effects", "link": "https://arxiv.org/abs/2412.08827", "description": "Mediation analysis is crucial in many fields of science for understanding the mechanisms or processes through which an independent variable affects an outcome, thereby providing deeper insights into causal relationships and improving intervention strategies. Despite advances in analyzing the mediation effect with fixed/low-dimensional mediators and covariates, our understanding of estimation and inference of mediation functional in the presence of (ultra)-high-dimensional mediators and covariates is still limited. In this paper, we present an estimator for mediation functional in a high-dimensional setting that accommodates the interaction between covariates and treatment in generating mediators, as well as interactions between both covariates and treatment and mediators and treatment in generating the response. We demonstrate that our estimator is $\\sqrt{n}$-consistent and asymptotically normal, thus enabling reliable inference on direct and indirect treatment effects with asymptotically valid confidence intervals. A key technical contribution of our work is to develop a multi-step debiasing technique, which may also be valuable in other statistical settings with similar structural complexities where accurate estimation depends on debiasing.", "authors": "Bo, Ghassami, Mukherjee"}, "https://arxiv.org/abs/2412.08828": {"title": "A Two-Stage Approach for Segmenting Spatial Point Patterns Applied to Multiplex Imaging", "link": "https://arxiv.org/abs/2412.08828", "description": "Recent advances in multiplex imaging have enabled researchers to locate different types of cells within a tissue sample. This is especially relevant for tumor immunology, as clinical regimes corresponding to different stages of disease or responses to treatment may manifest as different spatial arrangements of tumor and immune cells. Spatial point pattern modeling can be used to partition multiplex tissue images according to these regimes. To this end, we propose a two-stage approach: first, local intensities and pair correlation functions are estimated from the spatial point pattern of cells within each image, and the pair correlation functions are reduced in dimension via spectral decomposition of the covariance function. Second, the estimates are clustered in a Bayesian hierarchical model with spatially-dependent cluster labels. The clusters correspond to regimes of interest that are present across subjects; the cluster labels segment the spatial point patterns according to those regimes. Through Markov Chain Monte Carlo sampling, we jointly estimate and quantify uncertainty in the cluster assignment and spatial characteristics of each cluster. Simulations demonstrate the performance of the method, and it is applied to a set of multiplex immunofluorescence images of diseased pancreatic tissue.", "authors": "Sheng, Reich, Staicu et al"}, "https://arxiv.org/abs/2412.08831": {"title": "Panel Stochastic Frontier Models with Latent Group Structures", "link": "https://arxiv.org/abs/2412.08831", "description": "Stochastic frontier models have attracted significant interest over the years due to their unique feature of including a distinct inefficiency term alongside the usual error term. To effectively separate these two components, strong distributional assumptions are often necessary. To overcome this limitation, numerous studies have sought to relax or generalize these models for more robust estimation. In line with these efforts, we introduce a latent group structure that accommodates heterogeneity across firms, addressing not only the stochastic frontiers but also the distribution of the inefficiency term. This framework accounts for the distinctive features of stochastic frontier models, and we propose a practical estimation procedure to implement it. Simulation studies demonstrate the strong performance of our proposed method, which is further illustrated through an application to study the cost efficiency of the U.S. commercial banking sector.", "authors": "Tomioka, Yang, Zhang"}, "https://arxiv.org/abs/2412.08857": {"title": "Dynamic prediction of an event using multiple longitudinal markers: a model averaging approach", "link": "https://arxiv.org/abs/2412.08857", "description": "Dynamic event prediction, using joint modeling of survival time and longitudinal variables, is extremely useful in personalized medicine. However, the estimation of joint models including many longitudinal markers is still a computational challenge because of the high number of random effects and parameters to be estimated. In this paper, we propose a model averaging strategy to combine predictions from several joint models for the event, including one longitudinal marker only or pairwise longitudinal markers. The prediction is computed as the weighted mean of the predictions from the one-marker or two-marker models, with the time-dependent weights estimated by minimizing the time-dependent Brier score. This method enables us to combine a large number of predictions issued from joint models to achieve a reliable and accurate individual prediction. Advantages and limits of the proposed methods are highlighted in a simulation study by comparison with the predictions from well-specified and misspecified all-marker joint models as well as the one-marker and two-marker joint models. Using the PBC2 data set, the method is used to predict the risk of death in patients with primary biliary cirrhosis. The method is also used to analyze a French cohort study called the 3C data. In our study, seventeen longitudinal markers are considered to predict the risk of death.", "authors": "Hashemi, Baghfalaki, Philipps et al"}, "https://arxiv.org/abs/2412.08916": {"title": "Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy", "link": "https://arxiv.org/abs/2412.08916", "description": "Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.", "authors": "Kim, Ray, Reich"}, "https://arxiv.org/abs/2412.08934": {"title": "A cheat sheet for probability distributions of orientational data", "link": "https://arxiv.org/abs/2412.08934", "description": "The need for statistical models of orientations arises in many applications in engineering and computer science. Orientational data appear as sets of angles, unit vectors, rotation matrices or quaternions. In the field of directional statistics, a lot of advances have been made in modelling such types of data. However, only a few of these tools are used in engineering and computer science applications. Hence, this paper aims to serve as a cheat sheet for those probability distributions of orientations. Models for 1-DOF, 2-DOF and 3-DOF orientations are discussed. For each of them, expressions for the density function, fitting to data, and sampling are presented. The paper is written with a compromise between engineering and statistics in terms of notation and terminology. A Python library with functions for some of these models is provided. Using this library, two examples of applications to real data are presented.", "authors": "Lopez-Custodio"}, "https://arxiv.org/abs/2412.09166": {"title": "The square array design", "link": "https://arxiv.org/abs/2412.09166", "description": "This paper is concerned with the construction of augmented row-column designs for unreplicated trials. The idea is predicated on the representation of a $k \\times t$ equireplicate incomplete-block design with $t$ treatments in $t$ blocks of size $k$, termed an auxiliary block design, as a $t \\times t$ square array design with $k$ controls, where $k 0$) on the rate at which spillovers decay with the ``distance'' between units, defined in a generalized way to encompass spatial and quasi-spatial settings, e.g. where the economically relevant concept of distance is a gravity equation. Over all estimators linear in the outcomes and all cluster-randomized designs the optimal geometric rate of convergence is $n^{-\\frac{1}{2+\\frac{1}{\\eta}}}$, and this rate can be achieved using a generalized ``Scaling Clusters'' design that we provide. We then introduce the additional assumption, implicit in the OLS estimators used in recent applied studies, that potential outcomes are linear in population treatment assignments. These estimators are inconsistent for our estimand, but a refined OLS estimator is consistent and rate optimal, and performs better than IPW estimators when clusters must be small. Its finite-sample performance can be improved by incorporating prior information about the structure of spillovers. As a robust alternative to the linear approach we also provide a method to select estimator-design pairs that minimize a notion of worst-case risk when the data generating process is unknown. Finally, we provide asymptotically valid inference methods.", "authors": "Faridani, Niehaus"}, "https://arxiv.org/abs/2310.04578": {"title": "A Double Machine Learning Approach for the Evaluation of COVID-19 Vaccine Effectiveness under the Test-Negative Design: Analysis of Qu\\'ebec Administrative Data", "link": "https://arxiv.org/abs/2310.04578", "description": "The test-negative design (TND), which is routinely used for monitoring seasonal flu vaccine effectiveness (VE), has recently become integral to COVID-19 vaccine surveillance, notably in Qu\\'ebec, Canada. Some studies have addressed the identifiability and estimation of causal parameters under the TND, but efficiency bounds for nonparametric estimators of the target parameter under the unconfoundedness assumption have not yet been investigated. Motivated by the goal of improving adjustment for measured confounders when estimating COVID-19 VE among community-dwelling people aged $\\geq 60$ years in Qu\\'ebec, we propose a one-step doubly robust and locally efficient estimator called TNDDR (TND doubly robust), which utilizes cross-fitting (sample splitting) and can incorporate machine learning techniques to estimate the nuisance functions and thus improve control for measured confounders. We derive the efficient influence function (EIF) for the marginal expectation of the outcome under a vaccination intervention, explore the von Mises expansion, and establish the conditions for $\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR. The proposed estimator is supported by both theoretical and empirical justifications.", "authors": "Jiang, Talbot, Carazo et al"}, "https://arxiv.org/abs/2401.05256": {"title": "Tests of Missing Completely At Random based on sample covariance matrices", "link": "https://arxiv.org/abs/2401.05256", "description": "We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a collection of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Our first contributions are to define a natural measure of the incompatibility of a collection of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By analysing the concentration properties of the natural plug-in estimator for this measure, we propose a novel hypothesis test, which is calibrated via a bootstrap procedure and demonstrates power against any distribution with incompatible covariance matrices. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed. Furthermore, tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest.", "authors": "Bordino, Berrett"}, "https://arxiv.org/abs/2412.09697": {"title": "Evaluating time-specific treatment effects in matched-pairs studies", "link": "https://arxiv.org/abs/2412.09697", "description": "This study develops methods for evaluating a treatment effect on a time-to-event outcome in matched-pair studies. While most methods for paired right-censored outcomes allow determining an overall treatment effect over the course of follow-up, they generally lack in providing detailed insights into how the effect changes over time. To address this gap, we propose time-specific and overall tests for paired right-censored outcomes under randomization inference. We further extend our tests to matched observational studies by developing corresponding sensitivity analysis methods to take into account departures from randomization. Simulations demonstrate the robustness of our approach against various non-proportional hazards alternatives, including a crossing survival curves scenario. We demonstrate the application of our methods using a matched observational study from the Korean Longitudinal Study of Aging (KLoSA) data, focusing on the effect of social engagement on survival.", "authors": "Lee, Lee"}, "https://arxiv.org/abs/2412.09729": {"title": "Doubly Robust Conformalized Survival Analysis with Right-Censored Data", "link": "https://arxiv.org/abs/2412.09729", "description": "We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for type-I censoring. This method imputes unobserved censoring times using a suitable model, and then analyzes the imputed data using weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data sets demonstrate that our method is more robust than existing approaches in challenging settings where the survival model may be inaccurate, while achieving comparable performance in easier scenarios.", "authors": "Sesia, Svetnik"}, "https://arxiv.org/abs/2412.09786": {"title": "A class of nonparametric methods for evaluating the effect of continuous treatments on survival outcomes", "link": "https://arxiv.org/abs/2412.09786", "description": "In randomized trials and observational studies, it is often necessary to evaluate the extent to which an intervention affects a time-to-event outcome, which is only partially observed due to right censoring. For instance, in infectious disease studies, it is frequently of interest to characterize the relationship between risk of acquisition of infection with a pathogen and a biomarker previously measuring for an immune response against that pathogen induced by prior infection and/or vaccination. It is common to conduct inference within a causal framework, wherein we desire to make inferences about the counterfactual probability of survival through a given time point, at any given exposure level. To determine whether a causal effect is present, one can assess if this quantity differs by exposure level. Recent work shows that, under typical causal assumptions, summaries of the counterfactual survival distribution are identifiable. Moreover, when the treatment is multi-level, these summaries are also pathwise differentiable in a nonparametric probability model, making it possible to construct estimators thereof that are unbiased and approximately normal. In cases where the treatment is continuous, the target estimand is no longer pathwise differentiable, rendering it difficult to construct well-behaved estimators without strong parametric assumptions. In this work, we extend beyond the traditional setting with multilevel interventions to develop approaches to nonparametric inference with a continuous exposure. We introduce methods for testing whether the counterfactual probability of survival time by a given time-point remains constant across the range of the continuous exposure levels. The performance of our proposed methods is evaluated via numerical studies, and we apply our method to data from a recent pair of efficacy trials of an HIV monoclonal antibody.", "authors": "Jin, Gilbert, Hudson"}, "https://arxiv.org/abs/2412.09792": {"title": "Flexible Bayesian Nonparametric Product Mixtures for Multi-scale Functional Clustering", "link": "https://arxiv.org/abs/2412.09792", "description": "There is a rich literature on clustering functional data with applications to time-series modeling, trajectory data, and even spatio-temporal applications. However, existing methods routinely perform global clustering that enforces identical atom values within the same cluster. Such grouping may be inadequate for high-dimensional functions, where the clustering patterns may change between the more dominant high-level features and the finer resolution local features. While there is some limited literature on local clustering approaches to deal with the above problems, these methods are typically not scalable to high-dimensional functions, and their theoretical properties are not well-investigated. Focusing on basis expansions for high-dimensional functions, we propose a flexible non-parametric Bayesian approach for multi-resolution clustering. The proposed method imposes independent Dirichlet process (DP) priors on different subsets of basis coefficients that ultimately results in a product of DP mixture priors inducing local clustering. We generalize the approach to incorporate spatially correlated error terms when modeling random spatial functions to provide improved model fitting. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for implementation. We show posterior consistency properties under the local clustering approach that asymptotically recovers the true density of random functions. Extensive simulations illustrate the improved clustering and function estimation under the proposed method compared to classical approaches. We apply the proposed approach to a spatial transcriptomics application where the goal is to infer clusters of genes with distinct spatial patterns of expressions. Our method makes an important contribution by expanding the limited literature on local clustering methods for high-dimensional functions with theoretical guarantees.", "authors": "Yao, Kundu"}, "https://arxiv.org/abs/2412.09794": {"title": "Sequential Change Point Detection in High-dimensional Vector Auto-regressive Models", "link": "https://arxiv.org/abs/2412.09794", "description": "Sequential (online) change-point detection involves continuously monitoring time-series data and triggering an alarm when shifts in the data distribution are detected. We propose an algorithm for real-time identification of alterations in the transition matrices of high-dimensional vector autoregressive models. The algorithm estimates transition matrices and error term variances using regularization techniques applied to training data, then computes a specific test statistic to detect changes in transition matrices as new data batches arrive. We establish the asymptotic normality of the test statistic under the scenario of no change points, subject to mild conditions. An alarm is raised when the calculated test statistic exceeds a predefined quantile of the standard normal distribution. We demonstrate that, as the size of the change (jump size) increases, the test power approaches one. The effectiveness of the algorithm is validated empirically across various simulation scenarios. Finally, we present two applications of the proposed methodology: analyzing shocks in S&P 500 data and detecting the timing of seizures in EEG data.", "authors": "Tian, Safikhani"}, "https://arxiv.org/abs/2412.09804": {"title": "Unified optimal model averaging with a general loss function based on cross-validation", "link": "https://arxiv.org/abs/2412.09804", "description": "Studying unified model averaging estimation for situations with complicated data structures, we propose a novel model averaging method based on cross-validation (MACV). MACV unifies a large class of new and existing model averaging estimators and covers a very general class of loss functions. Furthermore, to reduce the computational burden caused by the conventional leave-subject/one-out cross validation, we propose a SEcond-order-Approximated Leave-one/subject-out (SEAL) cross validation, which largely improves the computation efficiency. In the context of non-independent and non-identically distributed random variables, we establish the unified theory for analyzing the asymptotic behaviors of the proposed MACV and SEAL methods, where the number of candidate models is allowed to diverge with sample size. To demonstrate the breadth of the proposed methodology, we exemplify four optimal model averaging estimators under four important situations, i.e., longitudinal data with discrete responses, within-cluster correlation structure modeling, conditional prediction in spatial data, and quantile regression with a potential correlation structure. We conduct extensive simulation studies and analyze real-data examples to illustrate the advantages of the proposed methods.", "authors": "Yu, Zhang, Liang"}, "https://arxiv.org/abs/2412.09830": {"title": "$L$-estimation of Claim Severity Models Weighted by Kumaraswamy Density", "link": "https://arxiv.org/abs/2412.09830", "description": "Statistical modeling of claim severity distributions is essential in insurance and risk management, where achieving a balance between robustness and efficiency in parameter estimation is critical against model contaminations. Two \\( L \\)-estimators, the method of trimmed moments (MTM) and the method of winsorized moments (MWM), are commonly used in the literature, but they are constrained by rigid weighting schemes that either discard or uniformly down-weight extreme observations, limiting their customized adaptability. This paper proposes a flexible robust \\( L \\)-estimation framework weighted by Kumaraswamy densities, offering smoothly varying observation-specific weights that preserve valuable information while improving robustness and efficiency. The framework is developed for parametric claim severity models, including Pareto, lognormal, and Fr{\\'e}chet distributions, with theoretical justifications on asymptotic normality and variance-covariance structures. Through simulations and application to a U.S. indemnity loss dataset, the proposed method demonstrates superior performance over MTM, MWM, and MLE approaches, particularly in handling outliers and heavy-tailed distributions, making it a flexible and reliable alternative for loss severity modeling.", "authors": "Poudyal, Aryal, Pokhrel"}, "https://arxiv.org/abs/2412.09845": {"title": "Addressing Positivity Violations in Extending Inference to a Target Population", "link": "https://arxiv.org/abs/2412.09845", "description": "Enhancing the external validity of trial results is essential for their applicability to real-world populations. However, violations of the positivity assumption can limit both the generalizability and transportability of findings. To address positivity violations in estimating the average treatment effect for a target population, we propose a framework that integrates characterizing the underrepresented group and performing sensitivity analysis for inference in the original target population. Our approach helps identify limitations in trial sampling and improves the robustness of trial findings for real-world populations. We apply this approach to extend findings from phase IV trials of treatments for opioid use disorder to a real-world population based on the 2021 Treatment Episode Data Set.", "authors": "Lu, Basu"}, "https://arxiv.org/abs/2412.09872": {"title": "Tail Risk Equivalent Level Transition and Its Application for Estimating Extreme $L_p$-quantiles", "link": "https://arxiv.org/abs/2412.09872", "description": "$L_p$-quantile has recently been receiving growing attention in risk management since it has desirable properties as a risk measure and is a generalization of two widely applied risk measures, Value-at-Risk and Expectile. The statistical methodology for $L_p$-quantile is not only feasible but also straightforward to implement as it represents a specific form of M-quantile using $p$-power loss function. In this paper, we introduce the concept of Tail Risk Equivalent Level Transition (TRELT) to capture changes in tail risk when we make a risk transition between two $L_p$-quantiles. TRELT is motivated by PELVE in Li and Wang (2023) but for tail risk. As it remains unknown in theory how this transition works, we investigate the existence, uniqueness, and asymptotic properties of TRELT (as well as dual TRELT) for $L_p$-quantiles. In addition, we study the inference methods for TRELT and extreme $L_p$-quantiles by using this risk transition, which turns out to be a novel extrapolation method in extreme value theory. The asymptotic properties of the proposed estimators are established, and both simulation studies and real data analysis are conducted to demonstrate their empirical performance.", "authors": "Zhong, Hou"}, "https://arxiv.org/abs/2412.10039": {"title": "Are you doing better than random guessing? A call for using negative controls when evaluating causal discovery algorithms", "link": "https://arxiv.org/abs/2412.10039", "description": "New proposals for causal discovery algorithms are typically evaluated using simulations and a few select real data examples with known data generating mechanisms. However, there does not exist a general guideline for how such evaluation studies should be designed, and therefore, comparing results across different studies can be difficult. In this article, we propose a common evaluation baseline by posing the question: Are we doing better than random guessing? For the task of graph skeleton estimation, we derive exact distributional results under random guessing for the expected behavior of a range of typical causal discovery evaluation metrics (including precision and recall). We show that these metrics can achieve very large values under random guessing in certain scenarios, and hence warn against using them without also reporting negative control results, i.e., performance under random guessing. We also propose an exact test of overall skeleton fit, and showcase its use on a real data application. Finally, we propose a general pipeline for using random controls beyond the skeleton estimation task, and apply it both in a simulated example and a real data application.", "authors": "Petersen"}, "https://arxiv.org/abs/2412.10196": {"title": "High-dimensional Statistics Applications to Batch Effects in Metabolomics", "link": "https://arxiv.org/abs/2412.10196", "description": "Batch effects are inevitable in large-scale metabolomics. Prior to formal data analysis, batch effect correction (BEC) is applied to prevent from obscuring biological variations, and batch effect evaluation (BEE) is used for correction assessment. However, existing BEE algorithms neglect covariances between the variables, and existing BEC algorithms might fail to adequately correct the covariances. Therefore, we resort to recent advancements in high-dimensional statistics, and respectively propose \"quality control-based simultaneous tests (QC-ST)\" and \"covariance correction (CoCo)\". Validated by the simulation data, QC-ST can simultaneously detect the statistical significance of QC samples' mean vectors and covariance matrices across different batches, and has a satisfactory statistical performance in empirical sizes, empirical powers, and computational speed. Then, we apply four QC-based BEC algorithms to two large cohort datasets, and find that extreme gradient boost (XGBoost) performs best in relative standard deviation (RSD) and dispersion-ratio (D-ratio). After prepositive BEC, if QC-ST still suggests that batch effects between some two batches are significant, CoCo should be implemented. And after CoCo (if necessary), the four metrics (i.e., RSD, D-ratio, classification performance, and QC-ST) might be further improved. In summary, under the guidance of QC-ST, we can develop a matching strategy to integrate multiple BEC algorithms more rationally and flexibly, and minimize batch effects for reliable biological conclusions.", "authors": "Guo"}, "https://arxiv.org/abs/2412.10213": {"title": "Collaborative Design of Controlled Experiments in the Presence of Subject Covariates", "link": "https://arxiv.org/abs/2412.10213", "description": "We consider the optimal experimental design problem of allocating subjects to treatment or control when subjects participate in multiple, separate controlled experiments within a short time-frame and subject covariate information is available. Here, in addition to subject covariates, we consider the dependence among the responses coming from the subject's random effect across experiments. In this setting, the goal of the allocation is to provide precise estimates of treatment effects for each experiment. Deriving the precision matrix of the treatment effects and using D-optimality as our allocation criterion, we demonstrate the advantage of collaboratively designing and analyzing multiple experiments over traditional independent design and analysis, and propose two randomized algorithms to provide solutions to the D-optimality problem for collaborative design. The first algorithm decomposes the D-optimality problem into a sequence of subproblems, where each subproblem is a quadratic binary program that can be solved through a semi-definite relaxation based randomized algorithm with performance guarantees. The second algorithm involves solving a single semi-definite program, and randomly generating allocations for each experiment from the solution of this program. We showcase the performance of these algorithms through a simulation study, finding that our algorithms outperform covariate-agnostic methods when there are a large number of covariates.", "authors": "Fisher, Zhang, Kang et al"}, "https://arxiv.org/abs/2412.10245": {"title": "Regression trees for nonparametric diagnostics of sequential positivity violations in longitudinal causal inference", "link": "https://arxiv.org/abs/2412.10245", "description": "Sequential positivity is often a necessary assumption for drawing causal inferences, such as through marginal structural modeling. Unfortunately, verification of this assumption can be challenging because it usually relies on multiple parametric propensity score models, unlikely all correctly specified. Therefore, we propose a new algorithm, called \"sequential Positivity Regression Tree\" (sPoRT), to check this assumption with greater ease under either static or dynamic treatment strategies. This algorithm also identifies the subgroups found to be violating this assumption, allowing for insights about the nature of the violations and potential solutions. We first present different versions of sPoRT based on either stratifying or pooling over time. Finally, we illustrate its use in a real-life application of HIV-positive children in Southern Africa with and without pooling over time. An R notebook showing how to use sPoRT is available at github.com/ArthurChatton/sPoRT-notebook.", "authors": "Chatton, Schomaker, Luque-Fernandez et al"}, "https://arxiv.org/abs/2412.10252": {"title": "What if we had built a prediction model with a survival super learner instead of a Cox model 10 years ago?", "link": "https://arxiv.org/abs/2412.10252", "description": "Objective: This study sought to compare the drop in predictive performance over time according to the modeling approach (regression versus machine learning) used to build a kidney transplant failure prediction model with a time-to-event outcome.\n Study Design and Setting: The Kidney Transplant Failure Score (KTFS) was used as a benchmark. We reused the data from which it was developed (DIVAT cohort, n=2,169) to build another prediction algorithm using a survival super learner combining (semi-)parametric and non-parametric methods. Performance in DIVAT was estimated for the two prediction models using internal validation. Then, the drop in predictive performance was evaluated in the same geographical population approximately ten years later (EKiTE cohort, n=2,329).\n Results: In DIVAT, the super learner achieved better discrimination than the KTFS, with a tAUROC of 0.83 (0.79-0.87) compared to 0.76 (0.70-0.82). While the discrimination remained stable for the KTFS, it was not the case for the super learner, with a drop to 0.80 (0.76-0.83). Regarding calibration, the survival SL overestimated graft survival at development, while the KTFS underestimated graft survival ten years later. Brier score values were similar regardless of the approach and the timing.\n Conclusion: The more flexible SL provided superior discrimination on the population used to fit it compared to a Cox model and similar discrimination when applied to a future dataset of the same population. Both methods are subject to calibration drift over time. However, weak calibration on the population used to develop the prediction model was correct only for the Cox model, and recalibration should be considered in the future to correct the calibration drift.", "authors": "Chatton, Pilote, Feugo et al"}, "https://arxiv.org/abs/2412.10304": {"title": "A Neyman-Orthogonalization Approach to the Incidental Parameter Problem", "link": "https://arxiv.org/abs/2412.10304", "description": "A popular approach to perform inference on a target parameter in the presence of nuisance parameters is to construct estimating equations that are orthogonal to the nuisance parameters, in the sense that their expected first derivative is zero. Such first-order orthogonalization may, however, not suffice when the nuisance parameters are very imprecisely estimated. Leading examples where this is the case are models for panel and network data that feature fixed effects. In this paper, we show how, in the conditional-likelihood setting, estimating equations can be constructed that are orthogonal to any chosen order. Combining these equations with sample splitting yields higher-order bias-corrected estimators of target parameters. In an empirical application we apply our method to a fixed-effect model of team production and obtain estimates of complementarity in production and impacts of counterfactual re-allocations.", "authors": "Bonhomme, Jochmans, Weidner"}, "https://arxiv.org/abs/2412.10005": {"title": "Matrix Completion via Residual Spectral Matching", "link": "https://arxiv.org/abs/2412.10005", "description": "Noisy matrix completion has attracted significant attention due to its applications in recommendation systems, signal processing and image restoration. Most existing works rely on (weighted) least squares methods under various low-rank constraints. However, minimizing the sum of squared residuals is not always efficient, as it may ignore the potential structural information in the residuals.In this study, we propose a novel residual spectral matching criterion that incorporates not only the numerical but also locational information of residuals. This criterion is the first in noisy matrix completion to adopt the perspective of low-rank perturbation of random matrices and exploit the spectral properties of sparse random matrices. We derive optimal statistical properties by analyzing the spectral properties of sparse random matrices and bounding the effects of low-rank perturbations and partial observations. Additionally, we propose algorithms that efficiently approximate solutions by constructing easily computable pseudo-gradients. The iterative process of the proposed algorithms ensures convergence at a rate consistent with the optimal statistical error bound. Our method and algorithms demonstrate improved numerical performance in both simulated and real data examples, particularly in environments with high noise levels.", "authors": "Chen, Yao"}, "https://arxiv.org/abs/2412.10053": {"title": "De-Biasing Structure Function Estimates From Sparse Time Series of the Solar Wind: A Data-Driven Approach", "link": "https://arxiv.org/abs/2412.10053", "description": "Structure functions, which represent the moments of the increments of a stochastic process, are essential complementary statistics to power spectra for analysing the self-similar behaviour of a time series. However, many real-world environmental datasets, such as those collected by spacecraft monitoring the solar wind, contain gaps, which inevitably corrupt the statistics. The nature of this corruption for structure functions remains poorly understood - indeed, often overlooked. Here we simulate gaps in a large set of magnetic field intervals from Parker Solar Probe in order to characterize the behaviour of the structure function of a sparse time series of solar wind turbulence. We quantify the resultant error with regards to the overall shape of the structure function, and its slope in the inertial range. Noting the consistent underestimation of the true curve when using linear interpolation, we demonstrate the ability of an empirical correction factor to de-bias these estimates. This correction, \"learnt\" from the data from a single spacecraft, is shown to generalize well to data from a solar wind regime elsewhere in the heliosphere, producing smaller errors, on average, for missing fractions >25%. Given this success, we apply the correction to gap-affected Voyager intervals from the inner heliosheath and local interstellar medium, obtaining spectral indices similar to those from previous studies. This work provides a tool for future studies of fragmented solar wind time series, such as those from Voyager, MAVEN, and OMNI, as well as sparsely-sampled astrophysical and geophysical processes more generally.", "authors": "Wrench, Parashar"}, "https://arxiv.org/abs/2412.10069": {"title": "Multiscale Dynamical Indices Reveal Scale-Dependent Atmospheric Dynamics", "link": "https://arxiv.org/abs/2412.10069", "description": "Geophysical systems are inherently complex and span multiple spatial and temporal scales, making their dynamics challenging to understand and predict. This challenge is especially pronounced for extreme events, which are primarily governed by their instantaneous properties rather than their average characteristics. Advances in dynamical systems theory, including the development of local dynamical indices such as local dimension and inverse persistence, have provided powerful tools for studying these short-lasting phenomena. However, existing applications of such indices often rely on predefined fixed spatial domains and scales, with limited discussion on the influence of spatial scales on the results. In this work, we present a novel spatially multiscale methodology that applies a sliding window method to compute dynamical indices, enabling the exploration of scale-dependent properties. Applying this framework to high-impact European summertime heatwaves, we reconcile previously different perspectives, thereby underscoring the importance of spatial scales in such analyses. Furthermore, we emphasize that our novel methodology has broad applicability to other atmospheric phenomena, as well as to other geophysical and spatio-temporal systems.", "authors": "Dong, Messori, Faranda et al"}, "https://arxiv.org/abs/2412.10119": {"title": "AMUSE: Adaptive Model Updating using a Simulated Environment", "link": "https://arxiv.org/abs/2412.10119", "description": "Prediction models frequently face the challenge of concept drift, in which the underlying data distribution changes over time, weakening performance. Examples can include models which predict loan default, or those used in healthcare contexts. Typical management strategies involve regular model updates or updates triggered by concept drift detection. However, these simple policies do not necessarily balance the cost of model updating with improved classifier performance. We present AMUSE (Adaptive Model Updating using a Simulated Environment), a novel method leveraging reinforcement learning trained within a simulated data generating environment, to determine update timings for classifiers. The optimal updating policy depends on the current data generating process and ongoing drift process. Our key idea is that we can train an arbitrarily complex model updating policy by creating a training environment in which possible episodes of drift are simulated by a parametric model, which represents expectations of possible drift patterns. As a result, AMUSE proactively recommends updates based on estimated performance improvements, learning a policy that balances maintaining model performance with minimizing update costs. Empirical results confirm the effectiveness of AMUSE in simulated data.", "authors": "Chislett, Vallejos, Cannings et al"}, "https://arxiv.org/abs/2412.10288": {"title": "Performance evaluation of predictive AI models to support medical decisions: Overview and guidance", "link": "https://arxiv.org/abs/2412.10288", "description": "A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a \"proper\" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.", "authors": "initiative), initiative), initiative) et al"}, "https://arxiv.org/abs/2212.14622": {"title": "Identifying causal effects with subjective ordinal outcomes", "link": "https://arxiv.org/abs/2212.14622", "description": "Survey questions often ask respondents to select from ordered scales where the meanings of the categories are subjective, leaving each individual free to apply their own definitions when answering. This paper studies the use of these responses as an outcome variable in causal inference, accounting for variation in the interpretation of categories across individuals. I find that when a continuous treatment is statistically independent of both i) potential outcomes; and ii) heterogeneity in reporting styles, a nonparametric regression of response category number on that treatment variable recovers a quantity proportional to an average causal effect among individuals who are on the margin between successive response categories. The magnitude of a given regression coefficient is not meaningful on its own, but the ratio of local regression derivatives with respect to two such treatment variables identifies the relative magnitudes of convex averages of their effects. I find that comparisons involving discrete treatment variables are not as readily interpretable, but obtain a partial identification result for such cases under additional assumptions. I illustrate the results by revisiting the effects of income comparisons on subjective well-being, without assuming cardinality or interpersonal comparability of responses.", "authors": "Goff"}, "https://arxiv.org/abs/2412.10521": {"title": "KenCoh: A Ranked-Based Canonical Coherence", "link": "https://arxiv.org/abs/2412.10521", "description": "In this paper, we consider the problem of characterizing a robust global dependence between two brain regions where each region may contain several voxels or channels. This work is driven by experiments to investigate the dependence between two cortical regions and to identify differences in brain networks between brain states, e.g., alert and drowsy states. The most common approach to explore dependence between two groups of variables (or signals) is via canonical correlation analysis (CCA). However, it is limited to only capturing linear associations and is sensitive to outlier observations. These limitations are crucial because brain network connectivity is likely to be more complex than linear and that brain signals may exhibit heavy-tailed properties. To overcome these limitations, we develop a robust method, Kendall canonical coherence (KenCoh), for learning monotonic connectivity structure among neuronal signals filtered at given frequency bands. Furthermore, we propose the KenCoh-based permutation test to investigate the differences in brain network connectivity between two different states. Our simulation study demonstrates that KenCoh is competitive to the traditional variance-covariance estimator and outperforms the later when the underlying distributions are heavy-tailed. We apply our method to EEG recordings from a virtual-reality driving experiment. Our proposed method led to further insights on the differences of frontal-parietal cross-dependence network when the subject is alert and when the subject is drowsy and that left-parietal channel drives this dependence at the beta-band.", "authors": "Talento, Roy, Ombao"}, "https://arxiv.org/abs/2412.10563": {"title": "Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision", "link": "https://arxiv.org/abs/2412.10563", "description": "Randomized controlled trials (RCTs) in oncology often allow control group participants to crossover to experimental treatments, a practice that, while often ethically necessary, complicates the accurate estimation of long-term treatment effects. When crossover rates are high or sample sizes are limited, commonly used methods for crossover adjustment (such as the rank-preserving structural failure time model, inverse probability of censoring weights, and two-stage estimation (TSE)) may produce imprecise estimates. Real-world data (RWD) can be used to develop an external control arm for the RCT, although this approach ignores evidence from trial subjects who did not crossover and ignores evidence from the data obtained prior to crossover for those subjects who did. This paper introduces augmented two-stage estimation (ATSE), a method that combines data from non-switching participants in a RCT with an external dataset, forming a 'hybrid non-switching arm'. With a simulation study, we evaluate the ATSE method's effectiveness compared to TSE crossover adjustment and an external control arm approach. Results indicate that, relative to TSE and the external control arm approach, ATSE can increase precision and may be less susceptible to bias due to unmeasured confounding.", "authors": "Campbell, Jansen, Cope"}, "https://arxiv.org/abs/2412.10600": {"title": "The Front-door Criterion in the Potential Outcome Framework", "link": "https://arxiv.org/abs/2412.10600", "description": "In recent years, the front-door criterion (FDC) has been increasingly noticed in economics and social science. However, most economists still resist collecting this tool in their empirical toolkit. This article aims to incorporate the FDC into the framework of the potential outcome model (RCM). It redefines the key assumptions of the FDC with the language of the RCM. These assumptions are more comprehensive and detailed than the original ones in the structure causal model (SCM). The causal connotations of the FDC estimates are elaborated in detail, and the estimation bias caused by violating some key assumptions is theoretically derived. Rigorous simulation data are used to confirm the theoretical derivation. It is proved that the FDC can still provide useful insights into causal relationships even when some key assumptions are violated. The FDC is also comprehensively compared with the instrumental variables (IV) from the perspective of assumptions and causal connotations. The analyses of this paper show that the FDC can serve as a powerful empirical tool. It can provide new insights into causal relationships compared with the conventional methods in social science.", "authors": "Chen"}, "https://arxiv.org/abs/2412.10608": {"title": "An overview of meta-analytic methods for economic research", "link": "https://arxiv.org/abs/2412.10608", "description": "Meta-analysis is the use of statistical methods to combine the results of individual studies to estimate the overall effect size for a specific outcome of interest. The direction and magnitude of this estimated effect, along with its confidence interval, provide insights into the phenomenon or relationship being investigated. As an extension of the standard meta-analysis, meta-regression analysis incorporates multiple moderators representing identifiable study characteristics into the meta-analysis model, thereby explaining some of the heterogeneity in true effect sizes across studies. This form of meta-analysis is especially designed to quantitatively synthesize empirical evidence in economics. This study provides an overview of the meta-analytic procedures tailored for economic research. By addressing key challenges, including between-study heterogeneity, publication bias, and effect size dependence, it aims to equip researchers with the tools and insights needed to conduct rigorous and informative meta-analytic studies in economics and related disciplines.", "authors": "Haghnejad, Farahati"}, "https://arxiv.org/abs/2412.10635": {"title": "Do LLMs Act as Repositories of Causal Knowledge?", "link": "https://arxiv.org/abs/2412.10635", "description": "Large language models (LLMs) offer the potential to automate a large number of tasks that previously have not been possible to automate, including some in science. There is considerable interest in whether LLMs can automate the process of causal inference by providing the information about causal links necessary to build a structural model. We use the case of confounding in the Coronary Drug Project (CDP), for which there are several studies listing expert-selected confounders that can serve as a ground truth. LLMs exhibit mediocre performance in identifying confounders in this setting, even though text about the ground truth is in their training data. Variables that experts identify as confounders are only slightly more likely to be labeled as confounders by LLMs compared to variables that experts consider non-confounders. Further, LLM judgment on confounder status is highly inconsistent across models, prompts, and irrelevant concerns like multiple-choice option ordering. LLMs do not yet have the ability to automate the reporting of causal links.", "authors": "Huntington-Klein, Murray"}, "https://arxiv.org/abs/2412.10639": {"title": "A Multiprocess State Space Model with Feedback and Switching for Patterns of Clinical Measurements Associated with COVID-19", "link": "https://arxiv.org/abs/2412.10639", "description": "Clinical measurements, such as body temperature, are often collected over time to monitor an individual's underlying health condition. These measurements exhibit complex temporal dynamics, necessitating sophisticated statistical models to capture patterns and detect deviations. We propose a novel multiprocess state space model with feedback and switching mechanisms to analyze the dynamics of clinical measurements. This model captures the evolution of time series through distinct latent processes and incorporates feedback effects in the transition probabilities between latent processes. We develop estimation methods using the EM algorithm, integrated with multiprocess Kalman filtering and multiprocess fixed-interval smoothing. Simulation study shows that the algorithm is efficient and performs well. We apply the proposed model to body temperature measurements from COVID-19-infected hemodialysis patients to examine temporal dynamics and estimate infection and recovery probabilities.", "authors": "Ma, Guo, Kotanko et al"}, "https://arxiv.org/abs/2412.10658": {"title": "Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling", "link": "https://arxiv.org/abs/2412.10658", "description": "Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.", "authors": "Dong, Jiang, Pan et al"}, "https://arxiv.org/abs/2412.10683": {"title": "Adaptive Nonparametric Perturbations of Parametric Bayesian Models", "link": "https://arxiv.org/abs/2412.10683", "description": "Parametric Bayesian modeling offers a powerful and flexible toolbox for scientific data analysis. Yet the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this paper we study nonparametrically perturbed parametric (NPP) Bayesian models, in which a parametric Bayesian model is relaxed via a distortion of its likelihood. We analyze the properties of NPP models when the target of inference is the true data distribution or some functional of it, such as in causal inference. We show that NPP models can offer the robustness of nonparametric models while retaining the data efficiency of parametric models, achieving fast convergence when the parametric model is close to true. To efficiently analyze data with an NPP model, we develop a generalized Bayes procedure to approximate its posterior. We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data. NPP modeling offers an efficient approach to robust Bayesian inference and can be used to robustify any parametric Bayesian model.", "authors": "Wu, Weinstein, Salehi et al"}, "https://arxiv.org/abs/2412.10721": {"title": "Model checking for high dimensional generalized linear models based on random projections", "link": "https://arxiv.org/abs/2412.10721", "description": "Most existing tests in the literature for model checking do not work in high dimension settings due to challenges arising from the \"curse of dimensionality\", or dependencies on the normality of parameter estimators. To address these challenges, we proposed a new goodness of fit test based on random projections for generalized linear models, when the dimension of covariates may substantially exceed the sample size. The tests only require the convergence rate of parameter estimators to derive the limiting distribution. The growing rate of the dimension is allowed to be of exponential order in relation to the sample size. As random projection converts covariates to one-dimensional space, our tests can detect the local alternative departing from the null at the rate of $n^{-1/2}h^{-1/4}$ where $h$ is the bandwidth, and $n$ is the sample size. This sensitive rate is not related to the dimension of covariates, and thus the \"curse of dimensionality\" for our tests would be largely alleviated. An interesting and unexpected result is that for randomly chosen projections, the resulting test statistics can be asymptotic independent. We then proposed combination methods to enhance the power performance of the tests. Detailed simulation studies and a real data analysis are conducted to illustrate the effectiveness of our methodology.", "authors": "Chen, Liu, Peng et al"}, "https://arxiv.org/abs/2412.10791": {"title": "Forecasting realized covariances using HAR-type models", "link": "https://arxiv.org/abs/2412.10791", "description": "We investigate methods for forecasting multivariate realized covariances matrices applied to a set of 30 assets that were included in the DJ30 index at some point, including two novel methods that use existing (univariate) log of realized variance models that account for attenuation bias and time-varying parameters. We consider the implications of some modeling choices within the class of heterogeneous autoregressive models. The following are our key findings. First, modeling the logs of the marginal volatilities is strongly preferred over direct modeling of marginal volatility. Thus, our proposed model that accounts for attenuation bias (for the log-response) provides superior one-step-ahead forecasts over existing multivariate realized covariance approaches. Second, accounting for measurement errors in marginal realized variances generally improves multivariate forecasting performance, but to a lesser degree than previously found in the literature. Third, time-varying parameter models based on state-space models perform almost equally well. Fourth, statistical and economic criteria for comparing the forecasting performance lead to some differences in the models' rankings, which can partially be explained by the turbulent post-pandemic data in our out-of-sample validation dataset using sub-sample analyses.", "authors": "Quiroz, Tafakori, Manner"}, "https://arxiv.org/abs/2412.10920": {"title": "Multiscale Autoregression on Adaptively Detected Timescales", "link": "https://arxiv.org/abs/2412.10920", "description": "We propose a multiscale approach to time series autoregression, in which linear regressors for the process in question include features of its own path that live on multiple timescales. We take these multiscale features to be the recent averages of the process over multiple timescales, whose number or spans are not known to the analyst and are estimated from the data via a change-point detection technique. The resulting construction, termed Adaptive Multiscale AutoRegression (AMAR) enables adaptive regularisation of linear autoregression of large orders. The AMAR model is designed to offer simplicity and interpretability on the one hand, and modelling flexibility on the other. Our theory permits the longest timescale to increase with the sample size. A simulation study is presented to show the usefulness of our approach. Some possible extensions are also discussed, including the Adaptive Multiscale Vector AutoRegressive model (AMVAR) for multivariate time series, which demonstrates promising performance in the data example on UK and US unemployment rates. The R package amar provides an efficient implementation of the AMAR framework.", "authors": "Baranowski, Chen, Fryzlewicz"}, "https://arxiv.org/abs/2412.10988": {"title": "Multiple Imputation for Nonresponse in Complex Surveys Using Design Weights and Auxiliary Margins", "link": "https://arxiv.org/abs/2412.10988", "description": "Survey data typically have missing values due to unit and item nonresponse. Sometimes, survey organizations know the marginal distributions of certain categorical variables in the survey. As shown in previous work, survey organizations can leverage these distributions in multiple imputation for nonignorable unit nonresponse, generating imputations that result in plausible completed-data estimates for the variables with known margins. However, this prior work does not use the design weights for unit nonrespondents; rather, it relies on a set of fabricated weights for these units. We extend this previous work to utilize the design weights for all sampled units. We illustrate the approach using simulation studies.", "authors": "Xu, Reiter"}, "https://arxiv.org/abs/2412.11136": {"title": "Minimax Regret Estimation for Generalizing Heterogeneous Treatment Effects with Multisite Data", "link": "https://arxiv.org/abs/2412.11136", "description": "To test scientific theories and develop individualized treatment rules, researchers often wish to learn heterogeneous treatment effects that can be consistently found across diverse populations and contexts. We consider the problem of generalizing heterogeneous treatment effects (HTE) based on data from multiple sites. A key challenge is that a target population may differ from the source sites in unknown and unobservable ways. This means that the estimates from site-specific models lack external validity, and a simple pooled analysis risks bias. We develop a robust CATE (conditional average treatment effect) estimation methodology with multisite data from heterogeneous populations. We propose a minimax-regret framework that learns a generalizable CATE model by minimizing the worst-case regret over a class of target populations whose CATE can be represented as convex combinations of site-specific CATEs. Using robust optimization, the proposed methodology accounts for distribution shifts in both individual covariates and treatment effect heterogeneity across sites. We show that the resulting CATE model has an interpretable closed-form solution, expressed as a weighted average of site-specific CATE models. Thus, researchers can utilize a flexible CATE estimation method within each site and aggregate site-specific estimates to produce the final model. Through simulations and a real-world application, we show that the proposed methodology improves the robustness and generalizability of existing approaches.", "authors": "Zhang, Huang, Imai"}, "https://arxiv.org/abs/2412.11140": {"title": "BUPD: A Bayesian under-parameterized basket design with the unit information prior in oncology trials", "link": "https://arxiv.org/abs/2412.11140", "description": "Basket trials in oncology enroll multiple patients with cancer harboring identical gene alterations and evaluate their response to targeted therapies across cancer types. Several existing methods have extended a Bayesian hierarchical model borrowing information on the response rates in different cancer types to account for the heterogeneity of drug effects. However, these methods rely on several pre-specified parameters to account for the heterogeneity of response rates among different cancer types. Here, we propose a novel Bayesian under-parameterized basket design with a unit information prior (BUPD) that uses only one (or two) pre-specified parameters to control the amount of information borrowed among cancer types, considering the heterogeneity of response rates. BUPD adapts the unit information prior approach, originally developed for borrowing information from historical clinical trial data, to enable mutual information borrowing between two cancer types. BUPD enables flexible controls of the type 1 error rate and power by explicitly specifying the strength of borrowing while providing interpretable estimations of response rates. Simulation studies revealed that BUPD reduced the type 1 error rate in scenarios with few ineffective cancer types and improved the power in scenarios with few effective cancer types better than five existing methods. This study also illustrated the efficiency of BUPD using response rates from a real basket trial.", "authors": "Kitabayashi, Sato, Hirakawa"}, "https://arxiv.org/abs/2412.11153": {"title": "Balancing Accuracy and Costs in Cross-Temporal Hierarchies: Investigating Decision-Based and Validation-Based Reconciliation", "link": "https://arxiv.org/abs/2412.11153", "description": "Wind power forecasting is essential for managing daily operations at wind farms and enabling market operators to manage power uncertainty effectively in demand planning. This paper explores advanced cross-temporal forecasting models and their potential to enhance forecasting accuracy. First, we propose a novel approach that leverages validation errors, rather than traditional in-sample errors, for covariance matrix estimation and forecast reconciliation. Second, we introduce decision-based aggregation levels for forecasting and reconciliation where certain horizons are based on the required decisions in practice. Third, we evaluate the forecasting performance of the models not only on their ability to minimize errors but also on their effectiveness in reducing decision costs, such as penalties in ancillary services. Our results show that statistical-based hierarchies tend to adopt less conservative forecasts and reduce revenue losses. On the other hand, decision-based reconciliation offers a more balanced compromise between accuracy and decision cost, making them attractive for practical use.", "authors": "Abolghasemi, Girolimetto, Fonzo"}, "https://arxiv.org/abs/2412.11179": {"title": "Treatment Evaluation at the Intensive and Extensive Margins", "link": "https://arxiv.org/abs/2412.11179", "description": "This paper provides a solution to the evaluation of treatment effects in selective samples when neither instruments nor parametric assumptions are available. We provide sharp bounds for average treatment effects under a conditional monotonicity assumption for all principal strata, i.e. units characterizing the complete intensive and extensive margins. Most importantly, we allow for a large share of units whose selection is indifferent to treatment, e.g. due to non-compliance. The existence of such a population is crucially tied to the regularity of sharp population bounds and thus conventional asymptotic inference for methods such as Lee bounds can be misleading. It can be solved using smoothed outer identification regions for inference. We provide semiparametrically efficient debiased machine learning estimators for both regular and smooth bounds that can accommodate high-dimensional covariates and flexible functional forms. Our study of active labor market policy reveals the empirical prevalence of the aforementioned indifference population and supports results from previous impact analysis under much weaker assumptions.", "authors": "Heiler, Kaufmann, Veliyev"}, "https://arxiv.org/abs/2412.11267": {"title": "P3LS: Point Process Partial Least Squares", "link": "https://arxiv.org/abs/2412.11267", "description": "Many studies collect data that can be considered as a realization of a point process. Included are medical imaging data where photon counts are recorded by a gamma camera from patients being injected with a gamma emitting tracer. It is of interest to develop analytic methods that can help with diagnosis as well as in the training of inexpert radiologists. Partial least squares (PLS) is a popular analytic approach that combines features from linear modeling as well as dimension reduction to provide parsimonious prediction and classification. However, existing PLS methodologies do not include the analysis of point process predictors. In this article, we introduce point process PLS (P3LS) for analyzing latent time-varying intensity functions from collections of inhomogeneous point processes. A novel estimation procedure for $P^3LS$ is developed that utilizes the properties of log-Gaussian Cox processes, and its empirical properties are examined in simulation studies. The method is used to analyze kidney functionality in patients with renal disease in order to aid in the diagnosis of kidney obstruction.", "authors": "Namdari, Krafty, Manatunga"}, "https://arxiv.org/abs/2412.11278": {"title": "VAR models with an index structure: A survey with new results", "link": "https://arxiv.org/abs/2412.11278", "description": "The main aim of this paper is to review recent advances in the multivariate autoregressive index model [MAI], originally proposed by reinsel1983some, and their applications to economic and financial time series. MAI has recently gained momentum because it can be seen as a link between two popular but distinct multivariate time series approaches: vector autoregressive modeling [VAR] and the dynamic factor model [DFM]. Indeed, on the one hand, the MAI is a VAR model with a peculiar reduced-rank structure; on the other hand, it allows for identification of common components and common shocks in a similar way as the DFM. The focus is on recent developments of the MAI, which include extending the original model with individual autoregressive structures, stochastic volatility, time-varying parameters, high-dimensionality, and cointegration. In addition, new insights on previous contributions and a novel model are also provided.", "authors": "Cubadda"}, "https://arxiv.org/abs/2412.11285": {"title": "Moderating the Mediation Bootstrap for Causal Inference", "link": "https://arxiv.org/abs/2412.11285", "description": "Mediation analysis is a form of causal inference that investigates indirect effects and causal mechanisms. Confidence intervals for indirect effects play a central role in conducting inference. The problem is non-standard leading to coverage rates that deviate considerably from their nominal level. The default inference method in the mediation model is the paired bootstrap, which resamples directly from the observed data. However, a residual bootstrap that explicitly exploits the assumed causal structure (X->M->Y) could also be applied. There is also a debate whether the bias-corrected (BC) bootstrap method is superior to the percentile method, with the former showing liberal behavior (actual coverage too low) in certain circumstances. Moreover, bootstrap methods tend to be very conservative (coverage higher than required) when mediation effects are small. Finally, iterated bootstrap methods like the double bootstrap have not been considered due to their high computational demands. We investigate the issues mentioned in the simple mediation model by a large-scale simulation. Results are explained using graphical methods and the newly derived finite-sample distribution. The main findings are: (i) conservative behavior of the bootstrap is caused by extreme dependence of the bootstrap distribution's shape on the estimated coefficients (ii) this dependence leads to counterproductive correction of the the double bootstrap. The added randomness of the BC method inflates the coverage in the absence of mediation, but still leads to (invalid) liberal inference when the mediation effect is small.", "authors": "Garderen, Giersbergen"}, "https://arxiv.org/abs/2412.11326": {"title": "Spatial Cross-Recurrence Quantification Analysis for Multi-Platform Contact Tracing and Epidemiology Research", "link": "https://arxiv.org/abs/2412.11326", "description": "Contact tracing is an essential tool in slowing and containing outbreaks of contagious diseases. Current contact tracing methods range from interviews with public health personnel to Bluetooth pings from smartphones. While all methods offer various benefits, it is difficult for different methods to integrate with one another. Additionally, for contact tracing mobile applications, data privacy is a concern to many as GPS data from users is saved to either a central server or the user's device. The current paper describes a method called spatial cross-recurrence quantification analysis (SpaRQ) that can combine and analyze contact tracing data, regardless of how it has been obtained, and generate a risk profile for the user without storing GPS data. Furthermore, the plots from SpaRQ can be used to investigate the nature of the infectious agent, such as how long it can remain viable in air or on surfaces after an infected person has passed, the chance of infection based on exposure time, and what type of exposure is maximally infective.", "authors": "Patten"}, "https://arxiv.org/abs/2412.11340": {"title": "Fast Bayesian Functional Principal Components Analysis", "link": "https://arxiv.org/abs/2412.11340", "description": "Functional Principal Components Analysis (FPCA) is one of the most successful and widely used analytic tools for exploration and dimension reduction of functional data. Standard implementations of FPCA estimate the principal components from the data but ignore their sampling variability in subsequent inferences. To address this problem, we propose the Fast Bayesian Functional Principal Components Analysis (Fast BayesFPCA), that treats principal components as parameters on the Stiefel manifold. To ensure efficiency, stability, and scalability we introduce three innovations: (1) project all eigenfunctions onto an orthonormal spline basis, reducing modeling considerations to a smaller-dimensional Stiefel manifold; (2) induce a uniform prior on the Stiefel manifold of the principal component spline coefficients via the polar representation of a matrix with entries following independent standard Normal priors; and (3) constrain sampling using the assumed FPCA structure to improve stability. We demonstrate the application of Fast BayesFPCA to characterize the variability in mealtime glucose from the Dietary Approaches to Stop Hypertension for Diabetes Continuous Glucose Monitoring (DASH4D CGM) study. All relevant STAN code and simulation routines are available as supplementary material.", "authors": "Sartini, Zhou, Selvin et al"}, "https://arxiv.org/abs/2412.11348": {"title": "Analyzing zero-inflated clustered longitudinal ordinal outcomes using GEE-type models with an application to dental fluorosis studies", "link": "https://arxiv.org/abs/2412.11348", "description": "Motivated by the Iowa Fluoride Study (IFS) dataset, which comprises zero-inflated multi-level ordinal responses on tooth fluorosis, we develop an estimation scheme leveraging generalized estimating equations (GEEs) and James-Stein shrinkage. Previous analyses of this cohort study primarily focused on caries (count response) or employed a Bayesian approach to the ordinal fluorosis outcome. This study is based on the expanded dataset that now includes observations for age 23, whereas earlier works were restricted to ages 9, 13, and/or 17 according to the participants' ages at the time of measurement. The adoption of a frequentist perspective enhances the interpretability to a broader audience. Over a choice of several covariance structures, separate models are formulated for the presence (zero versus non-zero score) and severity (non-zero ordinal scores) of fluorosis, which are then integrated through shared regression parameters. This comprehensive framework effectively identifies risk or protective effects of dietary and non-dietary factors on dental fluorosis.", "authors": "Sarkar, Mukherjee, Gaskins et al"}, "https://arxiv.org/abs/2412.11575": {"title": "Cost-aware Portfolios in a Large Universe of Assets", "link": "https://arxiv.org/abs/2412.11575", "description": "This paper considers the finite horizon portfolio rebalancing problem in terms of mean-variance optimization, where decisions are made based on current information on asset returns and transaction costs. The study's novelty is that the transaction costs are integrated within the optimization problem in a high-dimensional portfolio setting where the number of assets is larger than the sample size. We propose portfolio construction and rebalancing models with nonconvex penalty considering two types of transaction cost, the proportional transaction cost and the quadratic transaction cost. We establish the desired theoretical properties under mild regularity conditions. Monte Carlo simulations and empirical studies using S&P 500 and Russell 2000 stocks show the satisfactory performance of the proposed portfolio and highlight the importance of involving the transaction costs when rebalancing a portfolio.", "authors": "Fan, Medeiros, Yang et al"}, "https://arxiv.org/abs/2412.11612": {"title": "Autoregressive hidden Markov models for high-resolution animal movement data", "link": "https://arxiv.org/abs/2412.11612", "description": "New types of high-resolution animal movement data allow for increasingly comprehensive biological inference, but method development to meet the statistical challenges associated with such data is lagging behind. In this contribution, we extend the commonly applied hidden Markov models for step lengths and turning angles to address the specific requirements posed by high-resolution movement data, in particular the very strong within-state correlation induced by the momentum in the movement. The models feature autoregressive components of general order in both the step length and the turning angle variable, with the possibility to automate the selection of the autoregressive degree using a lasso approach. In a simulation study, we identify potential for improved inference when using the new model instead of the commonly applied basic hidden Markov model in cases where there is strong within-state autocorrelation. The practical use of the model is illustrated using high-resolution movement tracks of terns foraging near an anthropogenic structure causing turbulent water flow features.", "authors": "Stoye, Hoyer, Langrock"}, "https://arxiv.org/abs/2412.11651": {"title": "A New Sampling Method Base on Sequential Tests with Fixed Sample Size Upper Limit", "link": "https://arxiv.org/abs/2412.11651", "description": "Sequential inspection is a technique employed to monitor product quality during the production process. For smaller batch sizes, the Acceptable Quality Limit(AQL) inspection theory is typically applied, whereas for larger batch sizes, the Poisson distribution is commonly utilized to determine the sample size and rejection thresholds. However, due to the fact that the rate of defective products is usually low in actual production, using these methods often requires more samples to draw conclusions, resulting in higher inspection time. Based on this, this paper proposes a sequential inspection method with a fixed upper limit of sample size. This approach not only incorporates the Poisson distribution algorithm, allowing for rapid calculation of sample size and rejection thresholds to facilitate planning, but also adapts the concept of sequential inspection to dynamically modify the sampling plan and decision-making process. This method aims to decrease the number of samples required while preserving the inspection's efficacy. Finally, this paper shows through Monte Carlo simulation that compared with the traditional Poisson distribution algorithm, the sequential test method with a fixed sample size upper limit significantly reduces the number of samples compared to the traditional Poisson distribution algorithm, while maintaining effective inspection outcomes.", "authors": "Huang"}, "https://arxiv.org/abs/2412.11692": {"title": "A partial likelihood approach to tree-based density modeling and its application in Bayesian inference", "link": "https://arxiv.org/abs/2412.11692", "description": "Tree-based models for probability distributions are usually specified using a predetermined, data-independent collection of candidate recursive partitions of the sample space. To characterize an unknown target density in detail over the entire sample space, candidate partitions must have the capacity to expand deeply into all areas of the sample space with potential non-zero sampling probability. Such an expansive system of partitions often incurs prohibitive computational costs and makes inference prone to overfitting, especially in regions with little probability mass. Existing models typically make a compromise and rely on relatively shallow trees. This hampers one of the most desirable features of trees, their ability to characterize local features, and results in reduced statistical efficiency. Traditional wisdom suggests that this compromise is inevitable to ensure coherent likelihood-based reasoning, as a data-dependent partition system that allows deeper expansion only in regions with more observations would induce double dipping of the data and thus lead to inconsistent inference. We propose a simple strategy to restore coherency while allowing the candidate partitions to be data-dependent, using Cox's partial likelihood. This strategy parametrizes the tree-based sampling model according to the allocation of probability mass based on the observed data, and yet under appropriate specification, the resulting inference remains valid. Our partial likelihood approach is broadly applicable to existing likelihood-based methods and in particular to Bayesian inference on tree-based models. We give examples in density estimation in which the partial likelihood is endowed with existing priors on tree-based models and compare with the standard, full-likelihood approach. The results show substantial gains in estimation accuracy and computational efficiency from using the partial likelihood.", "authors": "Ma, Bruni"}, "https://arxiv.org/abs/2412.11790": {"title": "Variable importance measures for heterogeneous treatment effects with survival outcome", "link": "https://arxiv.org/abs/2412.11790", "description": "Treatment effect heterogeneity plays an important role in many areas of causal inference and within recent years, estimation of the conditional average treatment effect (CATE) has received much attention in the statistical community. While accurate estimation of the CATE-function through flexible machine learning procedures provides a tool for prediction of the individual treatment effect, it does not provide further insight into the driving features of potential treatment effect heterogeneity. Recent papers have addressed this problem by providing variable importance measures for treatment effect heterogeneity. Most of the suggestions have been developed for continuous or binary outcome, while little attention has been given to censored time-to-event outcome. In this paper, we extend the treatment effect variable importance measure (TE-VIM) proposed in Hines et al. (2022) to the survival setting with censored outcome. We derive an estimator for the TE-VIM for two different CATE functions based on the survival function and RMST, respectively. Along with the TE-VIM, we propose a new measure of treatment effect heterogeneity based on the best partially linear projection of the CATE and suggest accompanying estimators for that projection. All estimators are based on semiparametric efficiency theory, and we give conditions under which they are asymptotically linear. The finite sample performance of the derived estimators are investigated in a simulation study. Finally, the estimators are applied and contrasted in two real data examples.", "authors": "Ziersen, Martinussen"}, "https://arxiv.org/abs/2412.11850": {"title": "Causal Invariance Learning via Efficient Optimization of a Nonconvex Objective", "link": "https://arxiv.org/abs/2412.11850", "description": "Data from multiple environments offer valuable opportunities to uncover causal relationships among variables. Leveraging the assumption that the causal outcome model remains invariant across heterogeneous environments, state-of-the-art methods attempt to identify causal outcome models by learning invariant prediction models and rely on exhaustive searches over all (exponentially many) covariate subsets. These approaches present two major challenges: 1) determining the conditions under which the invariant prediction model aligns with the causal outcome model, and 2) devising computationally efficient causal discovery algorithms that scale polynomially, instead of exponentially, with the number of covariates. To address both challenges, we focus on the additive intervention regime and propose nearly necessary and sufficient conditions for ensuring that the invariant prediction model matches the causal outcome model. Exploiting the essentially necessary identifiability conditions, we introduce Negative Weight Distributionally Robust Optimization NegDRO a nonconvex continuous minimax optimization whose global optimizer recovers the causal outcome model. Unlike standard group DRO problems that maximize over the simplex, NegDRO allows negative weights on environment losses, which break the convexity. Despite its nonconvexity, we demonstrate that a standard gradient method converges to the causal outcome model, and we establish the convergence rate with respect to the sample size and the number of iterations. Our algorithm avoids exhaustive search, making it scalable especially when the number of covariates is large. The numerical results further validate the efficiency of the proposed method.", "authors": "Wang, Hu, B\\\"uhlmann et al"}, "https://arxiv.org/abs/2412.11971": {"title": "Multiplex Dirichlet stochastic block model for clustering multidimensional compositional networks", "link": "https://arxiv.org/abs/2412.11971", "description": "Network data often represent multiple types of relations, which can also denote exchanged quantities, and are typically encompassed in a weighted multiplex. Such data frequently exhibit clustering structures, however, traditional clustering methods are not well-suited for multiplex networks. Additionally, standard methods treat edge weights in their raw form, potentially biasing clustering towards a node's total weight capacity rather than reflecting cluster-related interaction patterns. To address this, we propose transforming edge weights into a compositional format, enabling the analysis of connection strengths in relative terms and removing the impact of nodes' total weights. We introduce a multiplex Dirichlet stochastic block model designed for multiplex networks with compositional layers. This model accounts for sparse compositional networks and enables joint clustering across different types of interactions. We validate the model through a simulation study and apply it to the international export data from the Food and Agriculture Organization of the United Nations.", "authors": "Promskaia, O'Hagan, Fop"}, "https://arxiv.org/abs/2412.10643": {"title": "Scientific Realism vs. Anti-Realism: Toward a Common Ground", "link": "https://arxiv.org/abs/2412.10643", "description": "The debate between scientific realism and anti-realism remains at a stalemate, with reconciliation seeming hopeless. Yet, important work remains: to seek a common ground, even if only to uncover deeper points of disagreement. I develop the idea that everyone values some truths, and use it to benefit both sides of the debate. More specifically, many anti-realists, such as instrumentalists, have yet to seriously engage with Sober's call to justify their preferred version of Ockham's razor through a positive epistemology. Meanwhile, realists face a similar challenge: providing a non-circular explanation of how their version of Ockham's razor connects to truth. Drawing insights from fields that study scientific inference -- statistics and machine learning -- I propose a common ground that addresses these challenges for both sides. This common ground also isolates a distinctively epistemic root of the irreconcilability in the realism debate.", "authors": "Lin"}, "https://arxiv.org/abs/2412.10753": {"title": "Posterior asymptotics of high-dimensional spiked covariance model with inverse-Wishart prior", "link": "https://arxiv.org/abs/2412.10753", "description": "We consider Bayesian inference on the spiked eigenstructures of high-dimensional covariance matrices; specifically, we focus on estimating the eigenvalues and corresponding eigenvectors of high-dimensional covariance matrices in which a few eigenvalues are significantly larger than the rest. We impose an inverse-Wishart prior distribution on the unknown covariance matrix and derive the posterior distributions of the eigenvalues and eigenvectors by transforming the posterior distribution of the covariance matrix. We prove that the posterior distribution of the spiked eigenvalues and corresponding eigenvectors converges to the true parameters under the spiked high-dimensional covariance assumption, and also that the posterior distribution of the spiked eigenvector attains the minimax optimality under the single spiked covariance model. Simulation studies and real data analysis demonstrate that our proposed method outperforms all existing methods in quantifying uncertainty.", "authors": "Lee, Park, Kim et al"}, "https://arxiv.org/abs/2412.10796": {"title": "Spatio-temporal analysis of extreme winter temperatures in Ireland", "link": "https://arxiv.org/abs/2412.10796", "description": "We analyse extreme daily minimum temperatures in winter months over the island of Ireland from 1950-2022. We model the marginal distributions of extreme winter minima using a generalised Pareto distribution (GPD), capturing temporal and spatial non-stationarities in the parameters of the GPD. We investigate two independent temporal non-stationarities in extreme winter minima. We model the long-term trend in magnitude of extreme winter minima as well as short-term, large fluctuations in magnitude caused by anomalous behaviour of the jet stream. We measure magnitudes of spatial events with a carefully chosen risk function and fit an r-Pareto process to extreme events exceeding a high-risk threshold. Our analysis is based on synoptic data observations courtesy of Met \\'Eireann and the Met Office. We show that the frequency of extreme cold winter events is decreasing over the study period. The magnitude of extreme winter events is also decreasing, indicating that winters are warming, and apparently warming at a faster rate than extreme summer temperatures. We also show that extremely cold winter temperatures are warming at a faster rate than non-extreme winter temperatures. We find that a climate model output previously shown to be informative as a covariate for modelling extremely warm summer temperatures is less effective as a covariate for extremely cold winter temperatures. However, we show that the climate model is useful for informing a non-extreme temperature model.", "authors": "Healy, Tawn, Thorne et al"}, "https://arxiv.org/abs/2412.11104": {"title": "ABC3: Active Bayesian Causal Inference with Cohn Criteria in Randomized Experiments", "link": "https://arxiv.org/abs/2412.11104", "description": "In causal inference, randomized experiment is a de facto method to overcome various theoretical issues in observational study. However, the experimental design requires expensive costs, so an efficient experimental design is necessary. We propose ABC3, a Bayesian active learning policy for causal inference. We show a policy minimizing an estimation error on conditional average treatment effect is equivalent to minimizing an integrated posterior variance, similar to Cohn criteria \\citep{cohn1994active}. We theoretically prove ABC3 also minimizes an imbalance between the treatment and control groups and the type 1 error probability. Imbalance-minimizing characteristic is especially notable as several works have emphasized the importance of achieving balance. Through extensive experiments on real-world data sets, ABC3 achieves the highest efficiency, while empirically showing the theoretical results hold.", "authors": "Cha, Lee"}, "https://arxiv.org/abs/2412.11743": {"title": "Generalized Bayesian deep reinforcement learning", "link": "https://arxiv.org/abs/2412.11743", "description": "Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. Similar to other model-based RL approaches, it involves two key components: (1) Inferring the posterior distribution of the data generating process (DGP) modeling the true environment and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models assuming Markov dependence. In absence of likelihood functions for these models we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We use sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high dimensional parameter space of the neural networks, we use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To justify the use of the prequential scoring rule posterior we prove a Bernstein-von Misses type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximizing the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies assuming discrete action and state-space. Finally we successfully extend our setup for a challenging problem with continuous action space without theoretical guarantees.", "authors": "Roy, Everitt, Robert et al"}, "https://arxiv.org/abs/2105.08451": {"title": "Bayesian Levy-Dynamic Spatio-Temporal Process: Towards Big Data Analysis", "link": "https://arxiv.org/abs/2105.08451", "description": "In this era of big data, all scientific disciplines are evolving fast to cope up with the enormity of the available information. So is statistics, the queen of science. Big data are particularly relevant to spatio-temporal statistics, thanks to much-improved technology in satellite based remote sensing and Geographical Information Systems. However, none of the existing approaches seem to meet the simultaneous demand of reality emulation and cheap computation. In this article, with the Levy random fields as the starting point, e construct a new Bayesian nonparametric, nonstationary and nonseparable dynamic spatio-temporal model with the additional realistic property that the lagged spatio-temporal correlations converge to zero as the lag tends to infinity. Although our Bayesian model seems to be intricately structured and is variable-dimensional with respect to each time index, we are able to devise a fast and efficient parallel Markov Chain Monte Carlo (MCMC) algorithm for Bayesian inference. Our simulation experiment brings out quite encouraging performance from our Bayesian Levy-dynamic approach. We finally apply our Bayesian Levy-dynamic model and methods to a sea surface temperature dataset consisting of 139,300 data points in space and time. Although not big data in the true sense, this is a large and highly structured data by any standard. Even for this large and complex data, our parallel MCMC algorithm, implemented on 80 processors, generated 110,000 MCMC realizations from the Levy-dynamic posterior within a single day, and the resultant Bayesian posterior predictive analysis turned out to be encouraging. Thus, it is not unreasonable to expect that with significantly more computing resources, it is feasible to analyse terabytes of spatio-temporal data with our new model and methods.", "authors": "Bhattacharya"}, "https://arxiv.org/abs/2303.04258": {"title": "Extremes in High Dimensions: Methods and Scalable Algorithms", "link": "https://arxiv.org/abs/2303.04258", "description": "Extreme value theory for univariate and low-dimensional observations has been explored in considerable detail, but the field is still in an early stage regarding high-dimensional settings. This paper focuses on H\\\"usler-Reiss models, a popular class of models for multivariate extremes similar to multivariate Gaussian distributions, and their domain of attraction. We develop estimators for the model parameters based on score matching, and we equip these estimators with theories and exceptionally scalable algorithms. Simulations and applications to weather extremes demonstrate the fact that the estimators can estimate a large number of parameters reliably and fast; for example, we show that H\\\"usler-Reiss models with thousands of parameters can be fitted within a couple of minutes on a standard laptop. More generally speaking, our work relates extreme value theory to modern concepts of high-dimensional statistics and convex optimization.", "authors": "Lederer, Oesting"}, "https://arxiv.org/abs/2304.06960": {"title": "Estimating Conditional Average Treatment Effects with Heteroscedasticity by Model Averaging and Matching", "link": "https://arxiv.org/abs/2304.06960", "description": "We propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effects under heteroskedastic error settings. The proposed approach has asymptotic optimality and consistency of weights and estimator. Numerical studies show that our method has good finite-sample performances.", "authors": "Shi, Zhang, Zhong"}, "https://arxiv.org/abs/2310.00356": {"title": "Functional conditional volatility modeling with missing data: inference and application to energy commodities", "link": "https://arxiv.org/abs/2310.00356", "description": "This paper explores the nonparametric estimation of the volatility component in a heteroscedastic scalar-on-function regression model, where the underlying discrete-time process is ergodic and subject to a missing-at-random mechanism. We first propose a simplified estimator for the regression and volatility operators, constructed solely from the observed data. The asymptotic properties of these estimators, including the almost sure uniform consistency rate and asymptotic distribution, are rigorously analyzed. Subsequently, the simplified estimators are employed to impute the missing data in the original process, enhancing the estimation of the regression and volatility components. The asymptotic behavior of these imputed estimators is also thoroughly investigated. A numerical comparison of the simplified and imputed estimators is presented using simulated data. Finally, the methodology is applied to real-world data to model the volatility of daily natural gas returns, utilizing intraday EU/USD exchange rate return curves sampled at a 1-hour frequency.", "authors": "Djeniah, Chaouch, Bouchentouf"}, "https://arxiv.org/abs/2311.14892": {"title": "An Identification and Dimensionality Robust Test for Instrumental Variables Models", "link": "https://arxiv.org/abs/2311.14892", "description": "Using modifications of Lindeberg's interpolation technique, I propose a new identification-robust test for the structural parameter in a heteroskedastic instrumental variables model. While my analysis allows the number of instruments to be much larger than the sample size, it does not require many instruments, making my test applicable in settings that have not been well studied. Instead, the proposed test statistic has a limiting chi-squared distribution so long as an auxiliary parameter can be consistently estimated. This is possible using machine learning methods even when the number of instruments is much larger than the sample size. To improve power, a simple combination with the sup-score statistic of Belloni et al. (2012) is proposed. I point out that first-stage F-statistics calculated on LASSO selected variables may be misleading indicators of identification strength and demonstrate favorable performance of my proposed methods in both empirical data and simulation study.", "authors": "Navjeevan"}, "https://arxiv.org/abs/2312.01411": {"title": "Bayesian inference on Cox regression models using catalytic prior distributions", "link": "https://arxiv.org/abs/2312.01411", "description": "The Cox proportional hazards model (Cox model) is a popular model for survival data analysis. When the sample size is small relative to the dimension of the model, the standard maximum partial likelihood inference is often problematic. In this work, we propose the Cox catalytic prior distributions for Bayesian inference on Cox models, which is an extension of a general class of prior distributions originally designed for stabilizing complex parametric models. The Cox catalytic prior is formulated as a weighted likelihood of the regression coefficients based on synthetic data and a surrogate baseline hazard constant. This surrogate hazard can be either provided by the user or estimated from the data, and the synthetic data are generated from the predictive distribution of a fitted simpler model. For point estimation, we derive an approximation of the marginal posterior mode, which can be computed conveniently as a regularized log partial likelihood estimator. We prove that our prior distribution is proper and the resulting estimator is consistent under mild conditions. In simulation studies, our proposed method outperforms standard maximum partial likelihood inference and is on par with existing shrinkage methods. We further illustrate the application of our method to a real dataset.", "authors": "Li, Huang"}, "https://arxiv.org/abs/2312.01969": {"title": "FDR Control for Online Anomaly Detection", "link": "https://arxiv.org/abs/2312.01969", "description": "A new online multiple testing procedure is described in the context of anomaly detection, which controls the False Discovery Rate (FDR). An accurate anomaly detector must control the false positive rate at a prescribed level while keeping the false negative rate as low as possible. However in the online context, such a constraint remains highly challenging due to the usual lack of FDR control: the online framework makes it impossible to use classical multiple testing approaches such as the Benjamini-Hochberg (BH) procedure, which would require knowing the entire time series. The developed strategy relies on exploiting the local control of the ``modified FDR'' (mFDR) criterion. It turns out that the local control of mFDR enables global control of the FDR over the full series up to additional modifications of the multiple testing procedures. An important ingredient in this control is the cardinality of the calibration dataset used to compute the empirical p-values. A dedicated strategy for tuning this parameter is designed for achieving the prescribed FDR control over the entire time series. The good statistical performance of the full strategy is analyzed by theoretical guarantees. Its practical behavior is assessed by several simulation experiments which support our conclusions.", "authors": "Kr\\\"onert, C\\'elisse, Hattab"}, "https://arxiv.org/abs/2401.15793": {"title": "Doubly regularized generalized linear models for spatial observations with high-dimensional covariates", "link": "https://arxiv.org/abs/2401.15793", "description": "A discrete spatial lattice can be cast as a network structure over which spatially-correlated outcomes are observed. A second network structure may also capture similarities among measured features, when such information is available. Incorporating the network structures when analyzing such doubly-structured data can improve predictive power, and lead to better identification of important features in the data-generating process. Motivated by applications in spatial disease mapping, we develop a new doubly regularized regression framework to incorporate these network structures for analyzing high-dimensional datasets. Our estimators can be easily implemented with standard convex optimization algorithms. In addition, we describe a procedure to obtain asymptotically valid confidence intervals and hypothesis tests for our model parameters. We show empirically that our framework provides improved predictive accuracy and inferential power compared to existing high-dimensional spatial methods. These advantages hold given fully accurate network information, and also with networks which are partially misspecified or uninformative. The application of the proposed method to modeling COVID-19 mortality data suggests that it can improve prediction of deaths beyond standard spatial models, and that it selects relevant covariates more often.", "authors": "Sondhi, Cheng, Shojaie"}, "https://arxiv.org/abs/2412.12316": {"title": "Estimating HIV Cross-sectional Incidence Using Recency Tests from a Non-representative Sample", "link": "https://arxiv.org/abs/2412.12316", "description": "Cross-sectional incidence estimation based on recency testing has become a widely used tool in HIV research. Recently, this method has gained prominence in HIV prevention trials to estimate the \"placebo\" incidence that participants might experience without preventive treatment. The application of this approach faces challenges due to non-representative sampling, as individuals aware of their HIV-positive status may be less likely to participate in screening for an HIV prevention trial. To address this, a recent phase 3 trial excluded individuals based on whether they have had a recent HIV test. To the best of our knowledge, the validity of this approach has yet to be studied. In our work, we investigate the performance of cross-sectional HIV incidence estimation when excluding individuals based on prior HIV tests in realistic trial settings. We develop a statistical framework that incorporates a testing-based criterion and possible non-representative sampling. We introduce a metric we call the effective mean duration of recent infection (MDRI) that mathematically quantifies bias in incidence estimation. We conduct an extensive simulation study to evaluate incidence estimator performance under various scenarios. Our findings reveal that when screening attendance is affected by knowledge of HIV status, incidence estimators become unreliable unless all individuals with recent HIV tests are excluded. Additionally, we identified a trade-off between bias and variability: excluding more individuals reduces bias from non-representative sampling but in many cases increases the variability of incidence estimates. These findings highlight the need for caution when applying testing-based criteria and emphasize the importance of refining incidence estimation methods to improve the design and evaluation of future HIV prevention trials.", "authors": "Pan, Bannick, Gao"}, "https://arxiv.org/abs/2412.12335": {"title": "Target Aggregate Data Adjustment Method for Transportability Analysis Utilizing Summary-Level Data from the Target Population", "link": "https://arxiv.org/abs/2412.12335", "description": "Transportability analysis is a causal inference framework used to evaluate the external validity of randomized clinical trials (RCTs) or observational studies. Most existing transportability analysis methods require individual patient-level data (IPD) for both the source and the target population, narrowing its applicability when only target aggregate-level data (AgD) is available. Besides, accounting for censoring is essential to reduce bias in longitudinal data, yet AgD-based transportability methods in the presence of censoring remain underexplored. Here, we propose a two-stage weighting framework named \"Target Aggregate Data Adjustment\" (TADA) to address the mentioned challenges simultaneously. TADA is designed as a two-stage weighting scheme to simultaneously adjust for both censoring bias and distributional imbalances of effect modifiers (EM), where the final weights are the product of the inverse probability of censoring weights and participation weights derived using the method of moments. We have conducted an extensive simulation study to evaluate TADA's performance. Our results indicate that TADA can effectively control the bias resulting from censoring within a non-extreme range suitable for most practical scenarios, and enhance the application and clinical interpretability of transportability analyses in settings with limited data availability.", "authors": "Yan, Vuong, Metcalfe et al"}, "https://arxiv.org/abs/2412.12365": {"title": "On the Role of Surrogates in Conformal Inference of Individual Causal Effects", "link": "https://arxiv.org/abs/2412.12365", "description": "Learning the Individual Treatment Effect (ITE) is essential for personalized decision making, yet causal inference has traditionally focused on aggregated treatment effects. While integrating conformal prediction with causal inference can provide valid uncertainty quantification for ITEs, the resulting prediction intervals are often excessively wide, limiting their practical utility. To address this limitation, we introduce \\underline{S}urrogate-assisted \\underline{C}onformal \\underline{I}nference for \\underline{E}fficient I\\underline{N}dividual \\underline{C}ausal \\underline{E}ffects (SCIENCE), a framework designed to construct more efficient prediction intervals for ITEs. SCIENCE applies to various data configurations, including semi-supervised and surrogate-assisted semi-supervised learning. It accommodates covariate shifts between source data, which contain primary outcomes, and target data, which may include only surrogate outcomes or covariates. Leveraging semi-parametric efficiency theory, SCIENCE produces rate double-robust prediction intervals under mild rate convergence conditions, permitting the use of flexible non-parametric models to estimate nuisance functions. We quantify efficiency gains by comparing semi-parametric efficiency bounds with and without the incorporation of surrogates. Simulation studies demonstrate that our surrogate-assisted intervals offer substantial efficiency improvements over existing methods while maintaining valid group-conditional coverage. Applied to the phase 3 Moderna COVE COVID-19 vaccine trial, SCIENCE illustrates how multiple surrogate markers can be leveraged to generate more efficient prediction intervals.", "authors": "Gao, Gilbert, Han"}, "https://arxiv.org/abs/2412.12380": {"title": "The Estimand Framework and Causal Inference: Complementary not Competing Paradigms", "link": "https://arxiv.org/abs/2412.12380", "description": "The creation of the ICH E9 (R1) estimands framework has led to more precise specification of the treatment effects of interest in the design and statistical analysis of clinical trials. However, it is unclear how the new framework relates to causal inference, as both approaches appear to define what is being estimated and have a quantity labelled an estimand. Using illustrative examples, we show that both approaches can be used to define a population-based summary of an effect on an outcome for a specified population and highlight the similarities and differences between these approaches. We demonstrate that the ICH E9 (R1) estimand framework offers a descriptive, structured approach that is more accessible to non-mathematicians, facilitating clearer communication of trial objectives and results. We then contrast this with the causal inference framework, which provides a mathematically precise definition of an estimand, and allows the explicit articulation of assumptions through tools such as causal graphs. Despite these differences, the two paradigms should be viewed as complementary rather than competing. The combined use of both approaches enhances the ability to communicate what is being estimated. We encourage those familiar with one framework to appreciate the concepts of the other to strengthen the robustness and clarity of clinical trial design, analysis, and interpretation.", "authors": "Drury, Bartlett, Wright et al"}, "https://arxiv.org/abs/2412.12405": {"title": "Generalized entropy calibration for analyzing voluntary survey data", "link": "https://arxiv.org/abs/2412.12405", "description": "Statistical analysis of voluntary survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized entropy calibration is introduced as a unified tool for calibration weighting to control the selection bias. We first establish the relationship between the generalized calibration weighting and its dual expression for regression estimation. The dual relationship is critical in identifying the implied regression model and developing model selection for calibration weighting. Also, if a linear regression model for an important study variable is available, then two-step calibration method can be used to smooth the final weights and achieve the statistical efficiency. Asymptotic properties of the proposed estimator are investigated. Results from a limited simulation study are also presented.", "authors": "Kwon, Kim, Qiu"}, "https://arxiv.org/abs/2412.12407": {"title": "Inside-out cross-covariance for spatial multivariate data", "link": "https://arxiv.org/abs/2412.12407", "description": "As the spatial features of multivariate data are increasingly central in researchers' applied problems, there is a growing demand for novel spatially-aware methods that are flexible, easily interpretable, and scalable to large data. We develop inside-out cross-covariance (IOX) models for multivariate spatial likelihood-based inference. IOX leads to valid cross-covariance matrix functions which we interpret as inducing spatial dependence on independent replicates of a correlated random vector. The resulting sample cross-covariance matrices are \"inside-out\" relative to the ubiquitous linear model of coregionalization (LMC). However, unlike LMCs, our methods offer direct marginal inference, easy prior elicitation of covariance parameters, the ability to model outcomes with unequal smoothness, and flexible dimension reduction. As a covariance model for a q-variate Gaussian process, IOX leads to scalable models for noisy vector data as well as flexible latent models. For large n cases, IOX complements Vecchia approximations and related process-based methods based on sparse graphical models. We demonstrate superior performance of IOX on synthetic datasets as well as on colorectal cancer proteomics data. An R package implementing the proposed methods is available at github.com/mkln/spiox.", "authors": "Peruzzi"}, "https://arxiv.org/abs/2412.12868": {"title": "Bayesian nonparametric partial clustering: Quantifying the effectiveness of agricultural subsidies across Europe", "link": "https://arxiv.org/abs/2412.12868", "description": "The global climate has underscored the need for effective policies to reduce greenhouse gas emissions from all sources, including those resulting from agricultural expansion, which is regulated by the Common Agricultural Policy (CAP) across the European Union (EU). To assess the effectiveness of these mitigation policies, statistical methods must account for the heterogeneous impact of policies across different countries. We propose a Bayesian approach that combines the multinomial logit model, which is suitable for compositional land-use data, with a Bayesian nonparametric (BNP) prior to cluster regions with similar policy impacts. To simultaneously control for other relevant factors, we distinguish between cluster-specific and global covariates, coining this approach the Bayesian nonparametric partial clustering model. We develop a novel and efficient Markov Chain Monte Carlo (MCMC) algorithm, leveraging recent advances in the Bayesian literature. Using economic, geographic, and subsidy-related data from 22 EU member states, we examine the effectiveness of policies influencing land-use decisions across Europe and highlight the diversity of the problem. Our results indicate that the impact of CAP varies widely across the EU, emphasizing the need for subsidies to be tailored to optimize their effectiveness.", "authors": "Mozdzen, Addo, Krisztin et al"}, "https://arxiv.org/abs/2412.12945": {"title": "Meta-analysis models relaxing the random effects normality assumption: methodological systematic review and simulation study", "link": "https://arxiv.org/abs/2412.12945", "description": "Random effects meta-analysis is widely used for synthesizing studies under the assumption that underlying effects come from a normal distribution. However, under certain conditions the use of alternative distributions might be more appropriate. We conducted a systematic review to identify articles introducing alternative meta-analysis models assuming non-normal between-study distributions. We identified 27 eligible articles suggesting 24 alternative meta-analysis models based on long-tail and skewed distributions, on mixtures of distributions, and on Dirichlet process priors. Subsequently, we performed a simulation study to evaluate the performance of these models and to compare them with the standard normal model. We considered 22 scenarios varying the amount of between-study variance, the shape of the true distribution, and the number of included studies. We compared 15 models implemented in the Frequentist or in the Bayesian framework. We found small differences with respect to bias between the different models but larger differences in the level of coverage probability. In scenarios with large between-study variance, all models were substantially biased in the estimation of the mean treatment effect. This implies that focusing only on the mean treatment effect of random effects meta-analysis can be misleading when substantial heterogeneity is suspected or outliers are present.", "authors": "Panagiotopoulou, Evrenoglou, Schmid et al"}, "https://arxiv.org/abs/2412.12967": {"title": "Neural Posterior Estimation for Stochastic Epidemic Modeling", "link": "https://arxiv.org/abs/2412.12967", "description": "Stochastic infectious disease models capture uncertainty in public health outcomes and have become increasingly popular in epidemiological practice. However, calibrating these models to observed data is challenging with existing methods for parameter estimation. Stochastic epidemic models are nonlinear dynamical systems with potentially large latent state spaces, resulting in computationally intractable likelihood densities. We develop an approach to calibrating complex epidemiological models to high-dimensional data using Neural Posterior Estimation, a novel technique for simulation-based inference. In NPE, a neural conditional density estimator trained on simulated data learns to \"invert\" a stochastic simulator, returning a parametric approximation to the posterior distribution. We introduce a stochastic, discrete-time Susceptible Infected (SI) model with heterogeneous transmission for healthcare-associated infections (HAIs). HAIs are a major burden on healthcare systems. They exhibit high rates of asymptotic carriage, making it difficult to estimate infection rates. Through extensive simulation experiments, we show that NPE produces accurate posterior estimates of infection rates with greater sample efficiency compared to Approximate Bayesian Computation (ABC). We then use NPE to fit our SI model to an outbreak of carbapenem-resistant Klebsiella pneumoniae in a long-term acute care facility, finding evidence of location-based heterogeneity in patient-to-patient transmission risk. We argue that our methodology can be fruitfully applied to a wide range of mechanistic transmission models and problems in the epidemiology of infectious disease.", "authors": "Chatha, Bu, Regier et al"}, "https://arxiv.org/abs/2412.13076": {"title": "Dual Interpretation of Machine Learning Forecasts", "link": "https://arxiv.org/abs/2412.13076", "description": "Machine learning predictions are typically interpreted as the sum of contributions of predictors. Yet, each out-of-sample prediction can also be expressed as a linear combination of in-sample values of the predicted variable, with weights corresponding to pairwise proximity scores between current and past economic events. While this dual route leads nowhere in some contexts (e.g., large cross-sectional datasets), it provides sparser interpretations in settings with many regressors and little training data-like macroeconomic forecasting. In this case, the sequence of contributions can be visualized as a time series, allowing analysts to explain predictions as quantifiable combinations of historical analogies. Moreover, the weights can be viewed as those of a data portfolio, inspiring new diagnostic measures such as forecast concentration, short position, and turnover. We show how weights can be retrieved seamlessly for (kernel) ridge regression, random forest, boosted trees, and neural networks. Then, we apply these tools to analyze post-pandemic forecasts of inflation, GDP growth, and recession probabilities. In all cases, the approach opens the black box from a new angle and demonstrates how machine learning models leverage history partly repeating itself.", "authors": "Coulombe, Goebel, Klieber"}, "https://arxiv.org/abs/2412.13122": {"title": "Fr\\'echet Sufficient Dimension Reduction for Metric Space-Valued Data via Distance Covariance", "link": "https://arxiv.org/abs/2412.13122", "description": "We propose a novel Fr\\'echet sufficient dimension reduction (SDR) method based on kernel distance covariance, tailored for metric space-valued responses such as count data, probability densities, and other complex structures. The method leverages a kernel-based transformation to map metric space-valued responses into a feature space, enabling efficient dimension reduction. By incorporating kernel distance covariance, the proposed approach offers enhanced flexibility and adaptability for datasets with diverse and non-Euclidean characteristics. The effectiveness of the method is demonstrated through synthetic simulations and several real-world applications. In all cases, the proposed method runs faster and consistently outperforms the existing Fr\\'echet SDR approaches, demonstrating its broad applicability and robustness in addressing complex data challenges.", "authors": "Huang, Yu, Li et al"}, "https://arxiv.org/abs/2405.11284": {"title": "The Logic of Counterfactuals and the Epistemology of Causal Inference", "link": "https://arxiv.org/abs/2405.11284", "description": "The 2021 Nobel Prize in Economics recognizes a type of causal model known as the Rubin causal model, or potential outcome framework, which deserves far more attention from philosophers than it currently receives. To spark philosophers' interest, I develop a dialectic connecting the Rubin causal model to the Lewis-Stalnaker debate on a logical principle of counterfactuals: Conditional Excluded Middle (CEM). I begin by playing good cop for CEM, developing a new argument in its favor -- a Quine-Putnam-style indispensability argument. This argument is based on the observation that CEM seems to be indispensable to the Rubin causal model, which underpins our best scientific theory of causal inference in health and social sciences -- a Nobel Prize-winning theory. Indeed, CEM has long remained a core assumption of the Rubin causal model, despite challenges from within the statistics and economics communities over twenty years ago. I then switch sides to play bad cop for CEM, undermining the indispensability argument by developing a new theory of causal inference that dispenses with CEM while preserving the successes of the original theory (thanks to a new theorem proved here). The key, somewhat surprisingly, is to integrate two approaches to causal modeling: the Rubin causal model, more familiar in health and social sciences, and the causal Bayes net, more familiar in philosophy. The good cop/bad cop dialectic is concluded with a connection to broader philosophical issues, including intertheory relations, the revisability of logic, and the role of background assumptions in justifying scientific inference.", "authors": "Lin"}, "https://arxiv.org/abs/2412.12114": {"title": "Unlocking new capabilities in the analysis of GC$\\times$GC-TOFMS data with shift-invariant multi-linearity", "link": "https://arxiv.org/abs/2412.12114", "description": "This paper introduces a novel deconvolution algorithm, shift-invariant multi-linearity (SIML), which significantly enhances the analysis of data from a comprehensive two-dimensional gas chromatograph coupled to a mass spectrometric detector (GC$\\times$GC-TOFMS). Designed to address the challenges posed by retention time shifts and high noise levels, SIML incorporates wavelet-based smoothing and Fourier-Transform based shift-correction within the multivariate curve resolution-alternating least squares (MCR-ALS) framework. We benchmarked the SIML algorithm against traditional methods such as MCR-ALS and Parallel Factor Analysis 2 with flexible coupling (PARAFAC2$\\times$N) using both simulated and real GC$\\times$GC-TOFMS datasets. Our results demonstrate that SIML provides unique solutions with significantly improved robustness, particularly in low signal-to-noise ratio scenarios, where it maintains high accuracy in estimating mass spectra and concentrations. The enhanced reliability of quantitative analyses afforded by SIML underscores its potential for broad application in complex matrix analyses across environmental science, food chemistry, and biological research.", "authors": "Schneide, Armstrong, Gallagher et al"}, "https://arxiv.org/abs/2412.12198": {"title": "Pop-out vs. Glue: A Study on the pre-attentive and focused attention stages in Visual Search tasks", "link": "https://arxiv.org/abs/2412.12198", "description": "This study explores visual search asymmetry and the detection process between parallel and serial search strategies, building upon Treisman's Feature Integration Theory [3]. Our experiment examines how easy it is to locate an oblique line among vertical distractors versus a vertical line among oblique distractors, a framework previously validated by Treisman & Gormican (1988) [4] and Gupta et al. (2015) [1]. We hypothesised that an oblique target among vertical lines would produce a perceptual 'pop-out' effect, allowing for faster, parallel search, while the reverse condition would require serial search strategy. Seventy-eight participants from Utrecht University engaged in trials with varied target-distractor orientations and number of items. We measured reaction times and found a significant effect of target type on search speed: oblique targets were identified more quickly, reflecting 'pop-out' behaviour, while vertical targets demanded focused attention ('glue phase'). Our results align with past findings, supporting our hypothesis on search asymmetry and its dependency on distinct visual features. Future research could benefit from eye-tracking and neural network analysis, particularly for identifying the neural processing of visual features in both parallel and serial search conditions.", "authors": "Beukelman, Rodrigues"}, "https://arxiv.org/abs/2412.12807": {"title": "Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification", "link": "https://arxiv.org/abs/2412.12807", "description": "Selective classification frameworks are useful tools for automated decision making in highly risky scenarios, since they allow for a classifier to only make highly confident decisions, while abstaining from making a decision when it is not confident enough to do so, which is otherwise known as an indecision. For a given level of classification accuracy, we aim to make as many decisions as possible. For many problems, this can be achieved without abstaining from making decisions. But when the problem is hard enough, we show that we can still control the misclassification rate of a classifier up to any user specified level, while only abstaining from the minimum necessary amount of decisions, even if this level of misclassification is smaller than the Bayes optimal error rate. In many problem settings, the user could obtain a dramatic decrease in misclassification while only paying a comparatively small price in terms of indecisions.", "authors": "Ndaoud, Radchenko, Rava"}, "https://arxiv.org/abs/2412.13034": {"title": "Unified calibration and spatial mapping of fine particulate matter data from multiple low-cost air pollution sensor networks in Baltimore, Maryland", "link": "https://arxiv.org/abs/2412.13034", "description": "Low-cost air pollution sensor networks are increasingly being deployed globally, supplementing sparse regulatory monitoring with localized air quality data. In some areas, like Baltimore, Maryland, there are only few regulatory (reference) devices but multiple low-cost networks. While there are many available methods to calibrate data from each network individually, separate calibration of each network leads to conflicting air quality predictions. We develop a general Bayesian spatial filtering model combining data from multiple networks and reference devices, providing dynamic calibrations (informed by the latest reference data) and unified predictions (combining information from all available sensors) for the entire region. This method accounts for network-specific bias and noise (observation models), as different networks can use different types of sensors, and uses a Gaussian process (state-space model) to capture spatial correlations. We apply the method to calibrate PM$_{2.5}$ data from Baltimore in June and July 2023 -- a period including days of hazardous concentrations due to wildfire smoke. Our method helps mitigate the effects of preferential sampling of one network in Baltimore, results in better predictions and narrower confidence intervals. Our approach can be used to calibrate low-cost air pollution sensor data in Baltimore and any other areas with multiple low-cost networks.", "authors": "Heffernan, Koehler, Gentner et al"}, "https://arxiv.org/abs/2105.00879": {"title": "Identification and Estimation of Average Causal Effects in Fixed Effects Logit Models", "link": "https://arxiv.org/abs/2105.00879", "description": "This paper studies identification and estimation of average causal effects, such as average marginal or treatment effects, in fixed effects logit models with short panels. Relating the identified set of these effects to an extremal moment problem, we first show how to obtain sharp bounds on such effects simply, without any optimization. We also consider even simpler outer bounds, which, contrary to the sharp bounds, do not require any first-step nonparametric estimators. We build confidence intervals based on these two approaches and show their asymptotic validity. Monte Carlo simulations suggest that both approaches work well in practice, the second being typically competitive in terms of interval length. Finally, we show that our method is also useful to measure treatment effect heterogeneity.", "authors": "Davezies, D'Haultf{\\oe}uille, Laage"}, "https://arxiv.org/abs/2202.02029": {"title": "First-order integer-valued autoregressive processes with Generalized Katz innovations", "link": "https://arxiv.org/abs/2202.02029", "description": "A new integer--valued autoregressive process (INAR) with Generalised Lagrangian Katz (GLK) innovations is defined. This process family provides a flexible modelling framework for count data, allowing for under and over--dispersion, asymmetry, and excess of kurtosis and includes standard INAR models such as Generalized Poisson and Negative Binomial as special cases. We show that the GLK--INAR process is discrete semi--self--decomposable, infinite divisible, stable by aggregation and provides stationarity conditions. Some extensions are discussed, such as the Markov--Switching and the zero--inflated GLK--INARs. A Bayesian inference framework and an efficient posterior approximation procedure are introduced. The proposed models are applied to 130 time series from Google Trend, which proxy the worldwide public concern about climate change. New evidence is found of heterogeneity across time, countries and keywords in the persistence, uncertainty, and long--run public awareness level.", "authors": "Lopez, Bassetti, Carallo et al"}, "https://arxiv.org/abs/2311.00013": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "https://arxiv.org/abs/2311.00013", "description": "We propose two approaches to estimate semiparametric discrete choice models for bundles. Our first approach is a kernel-weighted rank estimator based on a matching-based identification strategy. We establish its complete asymptotic properties and prove the validity of the nonparametric bootstrap for inference. We then introduce a new multi-index least absolute deviations (LAD) estimator as an alternative, of which the main advantage is its capacity to estimate preference parameters on both alternative- and agent-specific regressors. Both methods can account for arbitrary correlation in disturbances across choices, with the former also allowing for interpersonal heteroskedasticity. We also demonstrate that the identification strategy underlying these procedures can be extended naturally to panel data settings, producing an analogous localized maximum score estimator and a LAD estimator for estimating bundle choice models with fixed effects. We derive the limiting distribution of the former and verify the validity of the numerical bootstrap as an inference tool. All our proposed methods can be applied to general multi-index models. Monte Carlo experiments show that they perform well in finite samples.", "authors": "Ouyang, Yang"}, "https://arxiv.org/abs/2401.13379": {"title": "An Ising Similarity Regression Model for Modeling Multivariate Binary Data", "link": "https://arxiv.org/abs/2401.13379", "description": "Understanding the dependence structure between response variables is an important component in the analysis of correlated multivariate data. This article focuses on modeling dependence structures in multivariate binary data, motivated by a study aiming to understand how patterns in different U.S. senators' votes are determined by similarities (or lack thereof) in their attributes, e.g., political parties and social network profiles. To address such a research question, we propose a new Ising similarity regression model which regresses pairwise interaction coefficients in the Ising model against a set of similarity measures available/constructed from covariates. Model selection approaches are further developed through regularizing the pseudo-likelihood function with an adaptive lasso penalty to enable the selection of relevant similarity measures. We establish estimation and selection consistency of the proposed estimator under a general setting where the number of similarity measures and responses tend to infinity. Simulation study demonstrates the strong finite sample performance of the proposed estimator, particularly compared with several existing Ising model estimators in estimating the matrix of pairwise interaction coefficients. Applying the Ising similarity regression model to a dataset of roll call voting records of 100 U.S. senators, we are able to quantify how similarities in senators' parties, businessman occupations and social network profiles drive their voting associations.", "authors": "Tho, Hui, Zou"}}
\ No newline at end of file
+{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2310.04576": {"title": "Finite Sample Performance of a Conduct Parameter Test in Homogenous Goods Markets", "link": "http://arxiv.org/abs/2310.04576", "description": "We assess the finite sample performance of the conduct parameter test in\nhomogeneous goods markets. Statistical power rises with an increase in the\nnumber of markets, a larger conduct parameter, and a stronger demand rotation\ninstrument. However, even with a moderate number of markets and five firms,\nregardless of instrument strength and the utilization of optimal instruments,\nrejecting the null hypothesis of perfect competition remains challenging. Our\nfindings indicate that empirical results that fail to reject perfect\ncompetition are a consequence of the limited number of markets rather than\nmethodological deficiencies."}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.05311": {"title": "Identification and Estimation in a Class of Potential Outcomes Models", "link": "http://arxiv.org/abs/2310.05311", "description": "This paper develops a class of potential outcomes models characterized by\nthree main features: (i) Unobserved heterogeneity can be represented by a\nvector of potential outcomes and a type describing the manner in which an\ninstrument determines the choice of treatment; (ii) The availability of an\ninstrumental variable that is conditionally independent of unobserved\nheterogeneity; and (iii) The imposition of convex restrictions on the\ndistribution of unobserved heterogeneity. The proposed class of models\nencompasses multiple classical and novel research designs, yet possesses a\ncommon structure that permits a unifying analysis of identification and\nestimation. In particular, we establish that these models share a common\nnecessary and sufficient condition for identifying certain causal parameters.\nOur identification results are constructive in that they yield estimating\nmoment conditions for the parameters of interest. Focusing on a leading special\ncase of our framework, we further show how these estimating moment conditions\nmay be modified to be doubly robust. The corresponding double robust estimators\nare shown to be asymptotically normally distributed, bootstrap based inference\nis shown to be asymptotically valid, and the semi-parametric efficiency bound\nis derived for those parameters that are root-n estimable. We illustrate the\nusefulness of our results for developing, identifying, and estimating causal\nmodels through an empirical evaluation of the role of mental health as a\nmediating variable in the Moving To Opportunity experiment."}, "http://arxiv.org/abs/2310.05761": {"title": "Robust Minimum Distance Inference in Structural Models", "link": "http://arxiv.org/abs/2310.05761", "description": "This paper proposes minimum distance inference for a structural parameter of\ninterest, which is robust to the lack of identification of other structural\nnuisance parameters. Some choices of the weighting matrix lead to asymptotic\nchi-squared distributions with degrees of freedom that can be consistently\nestimated from the data, even under partial identification. In any case,\nknowledge of the level of under-identification is not required. We study the\npower of our robust test. Several examples show the wide applicability of the\nprocedure and a Monte Carlo investigates its finite sample performance. Our\nidentification-robust inference method can be applied to make inferences on\nboth calibrated (fixed) parameters and any other structural parameter of\ninterest. We illustrate the method's usefulness by applying it to a structural\nmodel on the non-neutrality of monetary policy, as in \\cite{nakamura2018high},\nwhere we empirically evaluate the validity of the calibrated parameters and we\ncarry out robust inference on the slope of the Phillips curve and the\ninformation effect."}, "http://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "http://arxiv.org/abs/2302.13066", "description": "Different proxy variables used in fiscal policy SVARs lead to contradicting\nconclusions regarding the size of fiscal multipliers. In this paper, we show\nthat the conflicting results are due to violations of the exogeneity\nassumptions, i.e. the commonly used proxies are endogenously related to the\nstructural shocks. We propose a novel approach to include proxy variables into\na Bayesian non-Gaussian SVAR, tailored to accommodate potentially endogenous\nproxy variables. Using our model, we show that increasing government spending\nis a more effective tool to stimulate the economy than reducing taxes. We\nconstruct new exogenous proxies that can be used in the traditional proxy VAR\napproach resulting in similar estimates compared to our proposed hybrid SVAR\nmodel."}, "http://arxiv.org/abs/2303.01863": {"title": "Constructing High Frequency Economic Indicators by Imputation", "link": "http://arxiv.org/abs/2303.01863", "description": "Monthly and weekly economic indicators are often taken to be the largest\ncommon factor estimated from high and low frequency data, either separately or\njointly. To incorporate mixed frequency information without directly modeling\nthem, we target a low frequency diffusion index that is already available, and\ntreat high frequency values as missing. We impute these values using multiple\nfactors estimated from the high frequency data. In the empirical examples\nconsidered, static matrix completion that does not account for serial\ncorrelation in the idiosyncratic errors yields imprecise estimates of the\nmissing values irrespective of how the factors are estimated. Single equation\nand systems-based dynamic procedures that account for serial correlation yield\nimputed values that are closer to the observed low frequency ones. This is the\ncase in the counterfactual exercise that imputes the monthly values of consumer\nsentiment series before 1978 when the data was released only on a quarterly\nbasis. This is also the case for a weekly version of the CFNAI index of\neconomic activity that is imputed using seasonally unadjusted data. The imputed\nseries reveals episodes of increased variability of weekly economic information\nthat are masked by the monthly data, notably around the 2014-15 collapse in oil\nprices."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2009.01995": {"title": "Instrument Validity for Heterogeneous Causal Effects", "link": "http://arxiv.org/abs/2009.01995", "description": "This paper provides a general framework for testing instrument validity in\nheterogeneous causal effect models. The generalization includes the cases where\nthe treatment can be multivalued ordered or unordered. Based on a series of\ntestable implications, we propose a nonparametric test which is proved to be\nasymptotically size controlled and consistent. Compared to the tests in the\nliterature, our test can be applied in more general settings and may achieve\npower improvement. Refutation of instrument validity by the test helps detect\ninvalid instruments that may yield implausible results on causal effects.\nEvidence that the test performs well on finite samples is provided via\nsimulations. We revisit the empirical study on return to schooling to\ndemonstrate application of the proposed test in practice. An extended\ncontinuous mapping theorem and an extended delta method, which may be of\nindependent interest, are provided to establish the asymptotic distribution of\nthe test statistic under null."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "http://arxiv.org/abs/2205.02274", "description": "Marketplace companies rely heavily on experimentation when making changes to\nthe design or operation of their platforms. The workhorse of experimentation is\nthe randomized controlled trial (RCT), or A/B test, in which users are randomly\nassigned to treatment or control groups. However, marketplace interference\ncauses the Stable Unit Treatment Value Assumption (SUTVA) to be violated,\nleading to bias in the standard RCT metric. In this work, we propose techniques\nfor platforms to run standard RCTs and still obtain meaningful estimates\ndespite the presence of marketplace interference. We specifically consider a\ngeneralized matching setting, in which the platform explicitly matches supply\nwith demand via a linear programming algorithm. Our first proposal is for the\nplatform to estimate the value of global treatment and global control via\noptimization. We prove that this approach is unbiased in the fluid limit. Our\nsecond proposal is to compare the average shadow price of the treatment and\ncontrol groups rather than the total value accrued by each group. We prove that\nthis technique corresponds to the correct first-order approximation (in a\nTaylor series sense) of the value function of interest even in a finite-size\nsystem. We then use this result to prove that, under reasonable assumptions,\nour estimator is less biased than the RCT estimator. At the heart of our result\nis the idea that it is relatively easy to model interference in matching-driven\nmarketplaces since, in such markets, the platform intermediates the spillover."}, "http://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "http://arxiv.org/abs/2208.09638", "description": "What is the purpose of pre-analysis plans, and how should they be designed?\nWe propose a principal-agent model where a decision-maker relies on selective\nbut truthful reports by an analyst. The analyst has data access, and\nnon-aligned objectives. In this model, the implementation of statistical\ndecision rules (tests, estimators) requires an incentive-compatible mechanism.\nWe first characterize which decision rules can be implemented. We then\ncharacterize optimal statistical decision rules subject to implementability. We\nshow that implementation requires pre-analysis plans. Focussing specifically on\nhypothesis tests, we show that optimal rejection rules pre-register a valid\ntest for the case when all data is reported, and make worst-case assumptions\nabout unreported data. Optimal tests can be found as a solution to a\nlinear-programming problem."}, "http://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "http://arxiv.org/abs/2302.11505", "description": "This paper studies settings where the analyst is interested in identifying\nand estimating the average causal effect of a binary treatment on an outcome.\nWe consider a setup in which the outcome realization does not get immediately\nrealized after the treatment assignment, a feature that is ubiquitous in\nempirical settings. The period between the treatment and the realization of the\noutcome allows other observed actions to occur and affect the outcome. In this\ncontext, we study several regression-based estimands routinely used in\nempirical work to capture the average treatment effect and shed light on\ninterpreting them in terms of ceteris paribus effects, indirect causal effects,\nand selection terms. We obtain three main and related takeaways. First, the\nthree most popular estimands do not generally satisfy what we call \\emph{strong\nsign preservation}, in the sense that these estimands may be negative even when\nthe treatment positively affects the outcome conditional on any possible\ncombination of other actions. Second, the most popular regression that includes\nthe other actions as controls satisfies strong sign preservation \\emph{if and\nonly if} these actions are mutually exclusive binary variables. Finally, we\nshow that a linear regression that fully stratifies the other actions leads to\nestimands that satisfy strong sign preservation."}, "http://arxiv.org/abs/2302.13455": {"title": "Nickell Bias in Panel Local Projection: Financial Crises Are Worse Than You Think", "link": "http://arxiv.org/abs/2302.13455", "description": "Local Projection is widely used for impulse response estimation, with the\nFixed Effect (FE) estimator being the default for panel data. This paper\nhighlights the presence of Nickell bias for all regressors in the FE estimator,\neven if lagged dependent variables are absent in the regression. This bias is\nthe consequence of the inherent panel predictive specification. We recommend\nusing the split-panel jackknife estimator to eliminate the asymptotic bias and\nrestore the standard statistical inference. Revisiting three macro-finance\nstudies on the linkage between financial crises and economic contraction, we\nfind that the FE estimator substantially underestimates the post-crisis\neconomic losses."}, "http://arxiv.org/abs/2310.07151": {"title": "Identification and Estimation of a Semiparametric Logit Model using Network Data", "link": "http://arxiv.org/abs/2310.07151", "description": "This paper studies the identification and estimation of a semiparametric\nbinary network model in which the unobserved social characteristic is\nendogenous, that is, the unobserved individual characteristic influences both\nthe binary outcome of interest and how links are formed within the network. The\nexact functional form of the latent social characteristic is not known. The\nproposed estimators are obtained based on matching pairs of agents whose\nnetwork formation distributions are the same. The consistency and the\nasymptotic distribution of the estimators are proposed. The finite sample\nproperties of the proposed estimators in a Monte-Carlo simulation are assessed.\nWe conclude this study with an empirical application."}, "http://arxiv.org/abs/2310.07558": {"title": "Smootheness-Adaptive Dynamic Pricing with Nonparametric Demand Learning", "link": "http://arxiv.org/abs/2310.07558", "description": "We study the dynamic pricing problem where the demand function is\nnonparametric and H\\\"older smooth, and we focus on adaptivity to the unknown\nH\\\"older smoothness parameter $\\beta$ of the demand function. Traditionally the\noptimal dynamic pricing algorithm heavily relies on the knowledge of $\\beta$ to\nachieve a minimax optimal regret of\n$\\widetilde{O}(T^{\\frac{\\beta+1}{2\\beta+1}})$. However, we highlight the\nchallenge of adaptivity in this dynamic pricing problem by proving that no\npricing policy can adaptively achieve this minimax optimal regret without\nknowledge of $\\beta$. Motivated by the impossibility result, we propose a\nself-similarity condition to enable adaptivity. Importantly, we show that the\nself-similarity condition does not compromise the problem's inherent complexity\nsince it preserves the regret lower bound\n$\\Omega(T^{\\frac{\\beta+1}{2\\beta+1}})$. Furthermore, we develop a\nsmoothness-adaptive dynamic pricing algorithm and theoretically prove that the\nalgorithm achieves this minimax optimal regret bound without the prior\nknowledge $\\beta$."}, "http://arxiv.org/abs/1910.07452": {"title": "Identifying Network Ties from Panel Data: Theory and an Application to Tax Competition", "link": "http://arxiv.org/abs/1910.07452", "description": "Social interactions determine many economic behaviors, but information on\nsocial ties does not exist in most publicly available and widely used datasets.\nWe present results on the identification of social networks from observational\npanel data that contains no information on social ties between agents. In the\ncontext of a canonical social interactions model, we provide sufficient\nconditions under which the social interactions matrix, endogenous and exogenous\nsocial effect parameters are all globally identified. While this result is\nrelevant across different estimation strategies, we then describe how\nhigh-dimensional estimation techniques can be used to estimate the interactions\nmodel based on the Adaptive Elastic Net GMM method. We employ the method to\nstudy tax competition across US states. We find the identified social\ninteractions matrix implies tax competition differs markedly from the common\nassumption of competition between geographically neighboring states, providing\nfurther insights for the long-standing debate on the relative roles of factor\nmobility and yardstick competition in driving tax setting behavior across\nstates. Most broadly, our identification and application show the analysis of\nsocial interactions can be extended to economic realms where no network data\nexists."}, "http://arxiv.org/abs/2308.00913": {"title": "The Bayesian Context Trees State Space Model for time series modelling and forecasting", "link": "http://arxiv.org/abs/2308.00913", "description": "A hierarchical Bayesian framework is introduced for developing rich mixture\nmodels for real-valued time series, partly motivated by important applications\nin financial time series analysis. At the top level, meaningful discrete states\nare identified as appropriately quantised values of some of the most recent\nsamples. These observable states are described as a discrete context-tree\nmodel. At the bottom level, a different, arbitrary model for real-valued time\nseries -- a base model -- is associated with each state. This defines a very\ngeneral framework that can be used in conjunction with any existing model class\nto build flexible and interpretable mixture models. We call this the Bayesian\nContext Trees State Space Model, or the BCT-X framework. Efficient algorithms\nare introduced that allow for effective, exact Bayesian inference and learning\nin this setting; in particular, the maximum a posteriori probability (MAP)\ncontext-tree model can be identified. These algorithms can be updated\nsequentially, facilitating efficient online forecasting. The utility of the\ngeneral framework is illustrated in two particular instances: When\nautoregressive (AR) models are used as base models, resulting in a nonlinear AR\nmixture model, and when conditional heteroscedastic (ARCH) models are used,\nresulting in a mixture model that offers a powerful and systematic way of\nmodelling the well-known volatility asymmetries in financial data. In\nforecasting, the BCT-X methods are found to outperform state-of-the-art\ntechniques on simulated and real-world data, both in terms of accuracy and\ncomputational requirements. In modelling, the BCT-X structure finds natural\nstructure present in the data. In particular, the BCT-ARCH model reveals a\nnovel, important feature of stock market index data, in the form of an enhanced\nleverage effect."}, "http://arxiv.org/abs/2310.07790": {"title": "Integration or fragmentation? A closer look at euro area financial markets", "link": "http://arxiv.org/abs/2310.07790", "description": "This paper examines the degree of integration at euro area financial markets.\nTo that end, we estimate overall and country-specific integration indices based\non a panel vector-autoregression with factor stochastic volatility. Our results\nindicate a more heterogeneous bond market compared to the market for lending\nrates. At both markets, the global financial crisis and the sovereign debt\ncrisis led to a severe decline in financial integration, which fully recovered\nsince then. We furthermore identify countries that deviate from their peers\neither by responding differently to crisis events or by taking on different\nroles in the spillover network. The latter analysis reveals two set of\ncountries, namely a main body of countries that receives and transmits\nspillovers and a second, smaller group of spillover absorbing economies.\nFinally, we demonstrate by estimating an augmented Taylor rule that euro area\nshort-term interest rates are positively linked to the level of integration on\nthe bond market."}, "http://arxiv.org/abs/2310.07839": {"title": "Marital Sorting, Household Inequality and Selection", "link": "http://arxiv.org/abs/2310.07839", "description": "Using CPS data for 1976 to 2022 we explore how wage inequality has evolved\nfor married couples with both spouses working full time full year, and its\nimpact on household income inequality. We also investigate how marriage sorting\npatterns have changed over this period. To determine the factors driving income\ninequality we estimate a model explaining the joint distribution of wages which\naccounts for the spouses' employment decisions. We find that income inequality\nhas increased for these households and increased assortative matching of wages\nhas exacerbated the inequality resulting from individual wage growth. We find\nthat positive sorting partially reflects the correlation across unobservables\ninfluencing both members' of the marriage wages. We decompose the changes in\nsorting patterns over the 47 years comprising our sample into structural,\ncomposition and selection effects and find that the increase in positive\nsorting primarily reflects the increased skill premia for both observed and\nunobserved characteristics."}, "http://arxiv.org/abs/2310.08063": {"title": "Uniform Inference for Nonlinear Endogenous Treatment Effects with High-Dimensional Covariates", "link": "http://arxiv.org/abs/2310.08063", "description": "Nonlinearity and endogeneity are common in causal effect studies with\nobservational data. In this paper, we propose new estimation and inference\nprocedures for nonparametric treatment effect functions with endogeneity and\npotentially high-dimensional covariates. The main innovation of this paper is\nthe double bias correction procedure for the nonparametric instrumental\nvariable (NPIV) model under high dimensions. We provide a useful uniform\nconfidence band of the marginal effect function, defined as the derivative of\nthe nonparametric treatment function. The asymptotic honesty of the confidence\nband is verified in theory. Simulations and an empirical study of air pollution\nand migration demonstrate the validity of our procedures."}, "http://arxiv.org/abs/2310.08115": {"title": "Model-Agnostic Covariate-Assisted Inference on Partially Identified Causal Effects", "link": "http://arxiv.org/abs/2310.08115", "description": "Many causal estimands are only partially identifiable since they depend on\nthe unobservable joint distribution between potential outcomes. Stratification\non pretreatment covariates can yield sharper partial identification bounds;\nhowever, unless the covariates are discrete with relatively small support, this\napproach typically requires consistent estimation of the conditional\ndistributions of the potential outcomes given the covariates. Thus, existing\napproaches may fail under model misspecification or if consistency assumptions\nare violated. In this study, we propose a unified and model-agnostic\ninferential approach for a wide class of partially identified estimands, based\non duality theory for optimal transport problems. In randomized experiments,\nour approach can wrap around any estimates of the conditional distributions and\nprovide uniformly valid inference, even if the initial estimates are\narbitrarily inaccurate. Also, our approach is doubly robust in observational\nstudies. Notably, this property allows analysts to use the multiplier bootstrap\nto select covariates and models without sacrificing validity even if the true\nmodel is not included. Furthermore, if the conditional distributions are\nestimated at semiparametric rates, our approach matches the performance of an\noracle with perfect knowledge of the outcome model. Finally, we propose an\nefficient computational framework, enabling implementation on many practical\nproblems in causal inference."}, "http://arxiv.org/abs/2310.08173": {"title": "Structural Vector Autoregressions and Higher Moments: Challenges and Solutions in Small Samples", "link": "http://arxiv.org/abs/2310.08173", "description": "Generalized method of moments estimators based on higher-order moment\nconditions derived from independent shocks can be used to identify and estimate\nthe simultaneous interaction in structural vector autoregressions. This study\nhighlights two problems that arise when using these estimators in small\nsamples. First, imprecise estimates of the asymptotically efficient weighting\nmatrix and the asymptotic variance lead to volatile estimates and inaccurate\ninference. Second, many moment conditions lead to a small sample scaling bias\ntowards innovations with a variance smaller than the normalizing unit variance\nassumption. To address the first problem, I propose utilizing the assumption of\nindependent structural shocks to estimate the efficient weighting matrix and\nthe variance of the estimator. For the second issue, I propose incorporating a\ncontinuously updated scaling term into the weighting matrix, eliminating the\nscaling bias. To demonstrate the effectiveness of these measures, I conducted a\nMonte Carlo simulation which shows a significant improvement in the performance\nof the estimator."}, "http://arxiv.org/abs/2310.08536": {"title": "Real-time Prediction of the Great Recession and the Covid-19 Recession", "link": "http://arxiv.org/abs/2310.08536", "description": "A series of standard and penalized logistic regression models is employed to\nmodel and forecast the Great Recession and the Covid-19 recession in the US.\nThese two recessions are scrutinized by closely examining the movement of five\nchosen predictors, their regression coefficients, and the predicted\nprobabilities of recession. The empirical analysis explores the predictive\ncontent of numerous macroeconomic and financial indicators with respect to NBER\nrecession indicator. The predictive ability of the underlying models is\nevaluated using a set of statistical evaluation metrics. The results strongly\nsupport the application of penalized logistic regression models in the area of\nrecession prediction. Specifically, the analysis indicates that a mixed usage\nof different penalized logistic regression models over different forecast\nhorizons largely outperform standard logistic regression models in the\nprediction of Great recession in the US, as they achieve higher predictive\naccuracy across 5 different forecast horizons. The Great Recession is largely\npredictable, whereas the Covid-19 recession remains unpredictable, given that\nthe Covid-19 pandemic is a real exogenous event. The results are validated by\nconstructing via principal component analysis (PCA) on a set of selected\nvariables a recession indicator that suffers less from publication lags and\nexhibits a very high correlation with the NBER recession indicator."}, "http://arxiv.org/abs/2210.11355": {"title": "Network Synthetic Interventions: A Causal Framework for Panel Data Under Network Interference", "link": "http://arxiv.org/abs/2210.11355", "description": "We propose a generalization of the synthetic controls and synthetic\ninterventions methodology to incorporate network interference. We consider the\nestimation of unit-specific potential outcomes from panel data in the presence\nof spillover across units and unobserved confounding. Key to our approach is a\nnovel latent factor model that takes into account network interference and\ngeneralizes the factor models typically used in panel data settings. We propose\nan estimator, Network Synthetic Interventions (NSI), and show that it\nconsistently estimates the mean outcomes for a unit under an arbitrary set of\ncounterfactual treatments for the network. We further establish that the\nestimator is asymptotically normal. We furnish two validity tests for whether\nthe NSI estimator reliably generalizes to produce accurate counterfactual\nestimates. We provide a novel graph-based experiment design that guarantees the\nNSI estimator produces accurate counterfactual estimates, and also analyze the\nsample complexity of the proposed design. We conclude with simulations that\ncorroborate our theoretical findings."}, "http://arxiv.org/abs/2310.08672": {"title": "Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal", "link": "http://arxiv.org/abs/2310.08672", "description": "In many settings, interventions may be more effective for some individuals\nthan others, so that targeting interventions may be beneficial. We analyze the\nvalue of targeting in the context of a large-scale field experiment with over\n53,000 college students, where the goal was to use \"nudges\" to encourage\nstudents to renew their financial-aid applications before a non-binding\ndeadline. We begin with baseline approaches to targeting. First, we target\nbased on a causal forest that estimates heterogeneous treatment effects and\nthen assigns students to treatment according to those estimated to have the\nhighest treatment effects. Next, we evaluate two alternative targeting\npolicies, one targeting students with low predicted probability of renewing\nfinancial aid in the absence of the treatment, the other targeting those with\nhigh probability. The predicted baseline outcome is not the ideal criterion for\ntargeting, nor is it a priori clear whether to prioritize low, high, or\nintermediate predicted probability. Nonetheless, targeting on low baseline\noutcomes is common in practice, for example because the relationship between\nindividual characteristics and treatment effects is often difficult or\nimpossible to estimate with historical data. We propose hybrid approaches that\nincorporate the strengths of both predictive approaches (accurate estimation)\nand causal approaches (correct criterion); we show that targeting intermediate\nbaseline outcomes is most effective, while targeting based on low baseline\noutcomes is detrimental. In one year of the experiment, nudging all students\nimproved early filing by an average of 6.4 percentage points over a baseline\naverage of 37% filing, and we estimate that targeting half of the students\nusing our preferred policy attains around 75% of this benefit."}, "http://arxiv.org/abs/2310.09013": {"title": "Smoothed instrumental variables quantile regression", "link": "http://arxiv.org/abs/2310.09013", "description": "In this article, I introduce the sivqr command, which estimates the\ncoefficients of the instrumental variables (IV) quantile regression model\nintroduced by Chernozhukov and Hansen (2005). The sivqr command offers several\nadvantages over the existing ivqreg and ivqreg2 commands for estimating this IV\nquantile regression model, which complements the alternative \"triangular model\"\nbehind cqiv and the \"local quantile treatment effect\" model of ivqte.\nComputationally, sivqr implements the smoothed estimator of Kaplan and Sun\n(2017), who show that smoothing improves both computation time and statistical\naccuracy. Standard errors are computed analytically or by Bayesian bootstrap;\nfor non-iid sampling, sivqr is compatible with bootstrap. I discuss syntax and\nthe underlying methodology, and I compare sivqr with other commands in an\nexample."}, "http://arxiv.org/abs/2310.09105": {"title": "Estimating Individual Responses when Tomorrow Matters", "link": "http://arxiv.org/abs/2310.09105", "description": "We propose a regression-based approach to estimate how individuals'\nexpectations influence their responses to a counterfactual change. We provide\nconditions under which average partial effects based on regression estimates\nrecover structural effects. We propose a practical three-step estimation method\nthat relies on subjective beliefs data. We illustrate our approach in a model\nof consumption and saving, focusing on the impact of an income tax that not\nonly changes current income but also affects beliefs about future income. By\napplying our approach to survey data from Italy, we find that considering\nindividuals' beliefs matter for evaluating the impact of tax policies on\nconsumption decisions."}, "http://arxiv.org/abs/2210.09828": {"title": "Modelling Large Dimensional Datasets with Markov Switching Factor Models", "link": "http://arxiv.org/abs/2210.09828", "description": "We study a novel large dimensional approximate factor model with regime\nchanges in the loadings driven by a latent first order Markov process. By\nexploiting the equivalent linear representation of the model, we first recover\nthe latent factors by means of Principal Component Analysis. We then cast the\nmodel in state-space form, and we estimate loadings and transition\nprobabilities through an EM algorithm based on a modified version of the\nBaum-Lindgren-Hamilton-Kim filter and smoother that makes use of the factors\npreviously estimated. Our approach is appealing as it provides closed form\nexpressions for all estimators. More importantly, it does not require knowledge\nof the true number of factors. We derive the theoretical properties of the\nproposed estimation procedure, and we show their good finite sample performance\nthrough a comprehensive set of Monte Carlo experiments. The empirical\nusefulness of our approach is illustrated through an application to a large\nportfolio of stocks."}, "http://arxiv.org/abs/2210.13562": {"title": "Prediction intervals for economic fixed-event forecasts", "link": "http://arxiv.org/abs/2210.13562", "description": "The fixed-event forecasting setup is common in economic policy. It involves a\nsequence of forecasts of the same (`fixed') predictand, so that the difficulty\nof the forecasting problem decreases over time. Fixed-event point forecasts are\ntypically published without a quantitative measure of uncertainty. To construct\nsuch a measure, we consider forecast postprocessing techniques tailored to the\nfixed-event case. We develop regression methods that impose constraints\nmotivated by the problem at hand, and use these methods to construct prediction\nintervals for gross domestic product (GDP) growth in Germany and the US."}, "http://arxiv.org/abs/2302.02747": {"title": "Testing Quantile Forecast Optimality", "link": "http://arxiv.org/abs/2302.02747", "description": "Quantile forecasts made across multiple horizons have become an important\noutput of many financial institutions, central banks and international\norganisations. This paper proposes misspecification tests for such quantile\nforecasts that assess optimality over a set of multiple forecast horizons\nand/or quantiles. The tests build on multiple Mincer-Zarnowitz quantile\nregressions cast in a moment equality framework. Our main test is for the null\nhypothesis of autocalibration, a concept which assesses optimality with respect\nto the information contained in the forecasts themselves. We provide an\nextension that allows to test for optimality with respect to larger information\nsets and a multivariate extension. Importantly, our tests do not just inform\nabout general violations of optimality, but may also provide useful insights\ninto specific forms of sub-optimality. A simulation study investigates the\nfinite sample performance of our tests, and two empirical applications to\nfinancial returns and U.S. macroeconomic series illustrate that our tests can\nyield interesting insights into quantile forecast sub-optimality and its\ncauses."}, "http://arxiv.org/abs/2305.00700": {"title": "Double and Single Descent in Causal Inference with an Application to High-Dimensional Synthetic Control", "link": "http://arxiv.org/abs/2305.00700", "description": "Motivated by a recent literature on the double-descent phenomenon in machine\nlearning, we consider highly over-parameterized models in causal inference,\nincluding synthetic control with many control units. In such models, there may\nbe so many free parameters that the model fits the training data perfectly. We\nfirst investigate high-dimensional linear regression for imputing wage data and\nestimating average treatment effects, where we find that models with many more\ncovariates than sample size can outperform simple ones. We then document the\nperformance of high-dimensional synthetic control estimators with many control\nunits. We find that adding control units can help improve imputation\nperformance even beyond the point where the pre-treatment fit is perfect. We\nprovide a unified theoretical perspective on the performance of these\nhigh-dimensional models. Specifically, we show that more complex models can be\ninterpreted as model-averaging estimators over simpler ones, which we link to\nan improvement in average performance. This perspective yields concrete\ninsights into the use of synthetic control when control units are many relative\nto the number of pre-treatment periods."}, "http://arxiv.org/abs/2310.09398": {"title": "An In-Depth Examination of Requirements for Disclosure Risk Assessment", "link": "http://arxiv.org/abs/2310.09398", "description": "The use of formal privacy to protect the confidentiality of responses in the\n2020 Decennial Census of Population and Housing has triggered renewed interest\nand debate over how to measure the disclosure risks and societal benefits of\nthe published data products. Following long-established precedent in economics\nand statistics, we argue that any proposal for quantifying disclosure risk\nshould be based on pre-specified, objective criteria. Such criteria should be\nused to compare methodologies to identify those with the most desirable\nproperties. We illustrate this approach, using simple desiderata, to evaluate\nthe absolute disclosure risk framework, the counterfactual framework underlying\ndifferential privacy, and prior-to-posterior comparisons. We conclude that\nsatisfying all the desiderata is impossible, but counterfactual comparisons\nsatisfy the most while absolute disclosure risk satisfies the fewest.\nFurthermore, we explain that many of the criticisms levied against differential\nprivacy would be levied against any technology that is not equivalent to\ndirect, unrestricted access to confidential data. Thus, more research is\nneeded, but in the near-term, the counterfactual approach appears best-suited\nfor privacy-utility analysis."}, "http://arxiv.org/abs/2310.09545": {"title": "A Semiparametric Instrumented Difference-in-Differences Approach to Policy Learning", "link": "http://arxiv.org/abs/2310.09545", "description": "Recently, there has been a surge in methodological development for the\ndifference-in-differences (DiD) approach to evaluate causal effects. Standard\nmethods in the literature rely on the parallel trends assumption to identify\nthe average treatment effect on the treated. However, the parallel trends\nassumption may be violated in the presence of unmeasured confounding, and the\naverage treatment effect on the treated may not be useful in learning a\ntreatment assignment policy for the entire population. In this article, we\npropose a general instrumented DiD approach for learning the optimal treatment\npolicy. Specifically, we establish identification results using a binary\ninstrumental variable (IV) when the parallel trends assumption fails to hold.\nAdditionally, we construct a Wald estimator, novel inverse probability\nweighting (IPW) estimators, and a class of semiparametric efficient and\nmultiply robust estimators, with theoretical guarantees on consistency and\nasymptotic normality, even when relying on flexible machine learning algorithms\nfor nuisance parameters estimation. Furthermore, we extend the instrumented DiD\nto the panel data setting. We evaluate our methods in extensive simulations and\na real data application."}, "http://arxiv.org/abs/2310.09597": {"title": "Adaptive maximization of social welfare", "link": "http://arxiv.org/abs/2310.09597", "description": "We consider the problem of repeatedly choosing policies to maximize social\nwelfare. Welfare is a weighted sum of private utility and public revenue.\nEarlier outcomes inform later policies. Utility is not observed, but indirectly\ninferred. Response functions are learned through experimentation.\n\nWe derive a lower bound on regret, and a matching adversarial upper bound for\na variant of the Exp3 algorithm. Cumulative regret grows at a rate of\n$T^{2/3}$. This implies that (i) welfare maximization is harder than the\nmulti-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets),\nand (ii) our algorithm achieves the optimal rate. For the stochastic setting,\nif social welfare is concave, we can achieve a rate of $T^{1/2}$ (for\ncontinuous policy sets), using a dyadic search algorithm.\n\nWe analyze an extension to nonlinear income taxation, and sketch an extension\nto commodity taxation. We compare our setting to monopoly pricing (which is\neasier), and price setting for bilateral trade (which is harder)."}, "http://arxiv.org/abs/2008.08387": {"title": "A Novel Approach to Predictive Accuracy Testing in Nested Environments", "link": "http://arxiv.org/abs/2008.08387", "description": "We introduce a new approach for comparing the predictive accuracy of two\nnested models that bypasses the difficulties caused by the degeneracy of the\nasymptotic variance of forecast error loss differentials used in the\nconstruction of commonly used predictive comparison statistics. Our approach\ncontinues to rely on the out of sample MSE loss differentials between the two\ncompeting models, leads to nuisance parameter free Gaussian asymptotics and is\nshown to remain valid under flexible assumptions that can accommodate\nheteroskedasticity and the presence of mixed predictors (e.g. stationary and\nlocal to unit root). A local power analysis also establishes its ability to\ndetect departures from the null in both stationary and persistent settings.\nSimulations calibrated to common economic and financial applications indicate\nthat our methods have strong power with good size control across commonly\nencountered sample sizes."}, "http://arxiv.org/abs/2211.07506": {"title": "Type I Tobit Bayesian Additive Regression Trees for Censored Outcome Regression", "link": "http://arxiv.org/abs/2211.07506", "description": "Censoring occurs when an outcome is unobserved beyond some threshold value.\nMethods that do not account for censoring produce biased predictions of the\nunobserved outcome. This paper introduces Type I Tobit Bayesian Additive\nRegression Tree (TOBART-1) models for censored outcomes. Simulation results and\nreal data applications demonstrate that TOBART-1 produces accurate predictions\nof censored outcomes. TOBART-1 provides posterior intervals for the conditional\nexpectation and other quantities of interest. The error term distribution can\nhave a large impact on the expectation of the censored outcome. Therefore the\nerror is flexibly modeled as a Dirichlet process mixture of normal\ndistributions."}, "http://arxiv.org/abs/2302.02866": {"title": "Out of Sample Predictability in Predictive Regressions with Many Predictor Candidates", "link": "http://arxiv.org/abs/2302.02866", "description": "This paper is concerned with detecting the presence of out of sample\npredictability in linear predictive regressions with a potentially large set of\ncandidate predictors. We propose a procedure based on out of sample MSE\ncomparisons that is implemented in a pairwise manner using one predictor at a\ntime and resulting in an aggregate test statistic that is standard normally\ndistributed under the global null hypothesis of no linear predictability.\nPredictors can be highly persistent, purely stationary or a combination of\nboth. Upon rejection of the null hypothesis we subsequently introduce a\npredictor screening procedure designed to identify the most active predictors.\nAn empirical application to key predictors of US economic activity illustrates\nthe usefulness of our methods and highlights the important forward looking role\nplayed by the series of manufacturing new orders."}, "http://arxiv.org/abs/2305.12883": {"title": "Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors", "link": "http://arxiv.org/abs/2305.12883", "description": "In recent years, there has been a significant growth in research focusing on\nminimum $\\ell_2$ norm (ridgeless) interpolation least squares estimators.\nHowever, the majority of these analyses have been limited to a simple\nregression error structure, assuming independent and identically distributed\nerrors with zero mean and common variance. In this paper, we explore prediction\nrisk as well as estimation risk under more general regression error\nassumptions, highlighting the benefits of overparameterization in a finite\nsample. We find that including a large number of unimportant parameters\nrelative to the sample size can effectively reduce both risks. Notably, we\nestablish that the estimation difficulties associated with the variance\ncomponents of both risks can be summarized through the trace of the\nvariance-covariance matrix of the regression errors."}, "http://arxiv.org/abs/2203.09001": {"title": "Selection and parallel trends", "link": "http://arxiv.org/abs/2203.09001", "description": "We study the role of selection into treatment in difference-in-differences\n(DiD) designs. We derive necessary and sufficient conditions for parallel\ntrends assumptions under general classes of selection mechanisms. These\nconditions characterize the empirical content of parallel trends. For settings\nwhere the necessary conditions are questionable, we propose tools for\nselection-based sensitivity analysis. We also provide templates for justifying\nDiD in applications with and without covariates. A reanalysis of the causal\neffect of NSW training programs demonstrates the usefulness of our\nselection-based approach to sensitivity analysis."}, "http://arxiv.org/abs/2207.07318": {"title": "Flexible global forecast combinations", "link": "http://arxiv.org/abs/2207.07318", "description": "Forecast combination -- the aggregation of individual forecasts from multiple\nexperts or models -- is a proven approach to economic forecasting. To date,\nresearch on economic forecasting has concentrated on local combination methods,\nwhich handle separate but related forecasting tasks in isolation. Yet, it has\nbeen known for over two decades in the machine learning community that global\nmethods, which exploit task-relatedness, can improve on local methods that\nignore it. Motivated by the possibility for improvement, this paper introduces\na framework for globally combining forecasts while being flexible to the level\nof task-relatedness. Through our framework, we develop global versions of\nseveral existing forecast combinations. To evaluate the efficacy of these new\nglobal forecast combinations, we conduct extensive comparisons using synthetic\nand real data. Our real data comparisons, which involve forecasts of core\neconomic indicators in the Eurozone, provide empirical evidence that the\naccuracy of global combinations of economic forecasts can surpass local\ncombinations."}, "http://arxiv.org/abs/2210.01938": {"title": "Probability of Causation with Sample Selection: A Reanalysis of the Impacts of J\\'ovenes en Acci\\'on on Formality", "link": "http://arxiv.org/abs/2210.01938", "description": "This paper identifies the probability of causation when there is sample\nselection. We show that the probability of causation is partially identified\nfor individuals who are always observed regardless of treatment status and\nderive sharp bounds under three increasingly restrictive sets of assumptions.\nThe first set imposes an exogenous treatment and a monotone sample selection\nmechanism. To tighten these bounds, the second set also imposes the monotone\ntreatment response assumption, while the third set additionally imposes a\nstochastic dominance assumption. Finally, we use experimental data from the\nColombian job training program J\\'ovenes en Acci\\'on to empirically illustrate\nour approach's usefulness. We find that, among always-employed women, at least\n18% and at most 24% transitioned to the formal labor market because of the\nprogram."}, "http://arxiv.org/abs/2212.06312": {"title": "Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation", "link": "http://arxiv.org/abs/2212.06312", "description": "Methods for learning optimal policies use causal machine learning models to\ncreate human-interpretable rules for making choices around the allocation of\ndifferent policy interventions. However, in realistic policy-making contexts,\ndecision-makers often care about trade-offs between outcomes, not just\nsingle-mindedly maximising utility for one outcome. This paper proposes an\napproach termed Multi-Objective Policy Learning (MOPoL) which combines optimal\ndecision trees for policy learning with a multi-objective Bayesian optimisation\napproach to explore the trade-off between multiple outcomes. It does this by\nbuilding a Pareto frontier of non-dominated models for different hyperparameter\nsettings which govern outcome weighting. The key here is that a low-cost greedy\ntree can be an accurate proxy for the very computationally costly optimal tree\nfor the purposes of making decisions which means models can be repeatedly fit\nto learn a Pareto frontier. The method is applied to a real-world case-study of\nnon-price rationing of anti-malarial medication in Kenya."}, "http://arxiv.org/abs/2302.08002": {"title": "Deep Learning Enhanced Realized GARCH", "link": "http://arxiv.org/abs/2302.08002", "description": "We propose a new approach to volatility modeling by combining deep learning\n(LSTM) and realized volatility measures. This LSTM-enhanced realized GARCH\nframework incorporates and distills modeling advances from financial\neconometrics, high frequency trading data and deep learning. Bayesian inference\nvia the Sequential Monte Carlo method is employed for statistical inference and\nforecasting. The new framework can jointly model the returns and realized\nvolatility measures, has an excellent in-sample fit and superior predictive\nperformance compared to several benchmark models, while being able to adapt\nwell to the stylized facts in volatility. The performance of the new framework\nis tested using a wide range of metrics, from marginal likelihood, volatility\nforecasting, to tail risk forecasting and option pricing. We report on a\ncomprehensive empirical study using 31 widely traded stock indices over a time\nperiod that includes COVID-19 pandemic."}, "http://arxiv.org/abs/2303.10019": {"title": "Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices", "link": "http://arxiv.org/abs/2303.10019", "description": "This paper presents a new method for combining (or aggregating or ensembling)\nmultivariate probabilistic forecasts, considering dependencies between\nquantiles and marginals through a smoothing procedure that allows for online\nlearning. We discuss two smoothing methods: dimensionality reduction using\nBasis matrices and penalized smoothing. The new online learning algorithm\ngeneralizes the standard CRPS learning framework into multivariate dimensions.\nIt is based on Bernstein Online Aggregation (BOA) and yields optimal asymptotic\nlearning properties. The procedure uses horizontal aggregation, i.e.,\naggregation across quantiles. We provide an in-depth discussion on possible\nextensions of the algorithm and several nested cases related to the existing\nliterature on online forecast combination. We apply the proposed methodology to\nforecasting day-ahead electricity prices, which are 24-dimensional\ndistributional forecasts. The proposed method yields significant improvements\nover uniform combination in terms of continuous ranked probability score\n(CRPS). We discuss the temporal evolution of the weights and hyperparameters\nand present the results of reduced versions of the preferred model. A fast C++\nimplementation of the proposed algorithm will be made available in connection\nwith this paper as an open-source R-Package on CRAN."}, "http://arxiv.org/abs/2309.02072": {"title": "DeepVol: A Pre-Trained Universal Asset Volatility Model", "link": "http://arxiv.org/abs/2309.02072", "description": "This paper introduces DeepVol, a pre-trained deep learning volatility model\nthat is more general than traditional econometric models. DeepVol leverage the\npower of transfer learning to effectively capture and model the volatility\ndynamics of all financial assets, including previously unseen ones, using a\nsingle universal model. This contrasts to the usual practice in the\neconometrics literature, which trains a separate model for each asset. The\nintroduction of DeepVol opens up new avenues for volatility modeling in the\nfinance industry, potentially transforming the way volatility is predicted."}, "http://arxiv.org/abs/2310.11680": {"title": "Trimmed Mean Group Estimation of Average Treatment Effects in Ultra Short T Panels under Correlated Heterogeneity", "link": "http://arxiv.org/abs/2310.11680", "description": "Under correlated heterogeneity, the commonly used two-way fixed effects\nestimator is biased and can lead to misleading inference. This paper proposes a\nnew trimmed mean group (TMG) estimator which is consistent at the irregular\nrate of n^{1/3} even if the time dimension of the panel is as small as the\nnumber of its regressors. Extensions to panels with time effects are provided,\nand a Hausman-type test of correlated heterogeneity is proposed. Small sample\nproperties of the TMG estimator (with and without time effects) are\ninvestigated by Monte Carlo experiments and shown to be satisfactory and\nperform better than other trimmed estimators proposed in the literature. The\nproposed test of correlated heterogeneity is also shown to have the correct\nsize and satisfactory power. The utility of the TMG approach is illustrated\nwith an empirical application."}, "http://arxiv.org/abs/2310.11962": {"title": "Machine Learning for Staggered Difference-in-Differences and Dynamic Treatment Effect Heterogeneity", "link": "http://arxiv.org/abs/2310.11962", "description": "We combine two recently proposed nonparametric difference-in-differences\nmethods, extending them to enable the examination of treatment effect\nheterogeneity in the staggered adoption setting using machine learning. The\nproposed method, machine learning difference-in-differences (MLDID), allows for\nestimation of time-varying conditional average treatment effects on the\ntreated, which can be used to conduct detailed inference on drivers of\ntreatment effect heterogeneity. We perform simulations to evaluate the\nperformance of MLDID and find that it accurately identifies the true predictors\nof treatment effect heterogeneity. We then use MLDID to evaluate the\nheterogeneous impacts of Brazil's Family Health Program on infant mortality,\nand find those in poverty and urban locations experienced the impact of the\npolicy more quickly than other subgroups."}, "http://arxiv.org/abs/2310.11969": {"title": "Survey calibration for causal inference: a simple method to balance covariate distributions", "link": "http://arxiv.org/abs/2310.11969", "description": "This paper proposes a simple method for balancing distributions of covariates\nfor causal inference based on observational studies. The method makes it\npossible to balance an arbitrary number of quantiles (e.g., medians, quartiles,\nor deciles) together with means if necessary. The proposed approach is based on\nthe theory of calibration estimators (Deville and S\\\"arndal 1992), in\nparticular, calibration estimators for quantiles, proposed by Harms and\nDuchesne (2006). By modifying the entropy balancing method and the covariate\nbalancing propensity score method, it is possible to balance the distributions\nof the treatment and control groups. The method does not require numerical\nintegration, kernel density estimation or assumptions about the distributions;\nvalid estimates can be obtained by drawing on existing asymptotic theory.\nResults of a simulation study indicate that the method efficiently estimates\naverage treatment effects on the treated (ATT), the average treatment effect\n(ATE), the quantile treatment effect on the treated (QTT) and the quantile\ntreatment effect (QTE), especially in the presence of non-linearity and\nmis-specification of the models. The proposed methods are implemented in an\nopen source R package jointCalib."}, "http://arxiv.org/abs/2203.04080": {"title": "On Robust Inference in Time Series Regression", "link": "http://arxiv.org/abs/2203.04080", "description": "Least squares regression with heteroskedasticity and autocorrelation\nconsistent (HAC) standard errors has proved very useful in cross section\nenvironments. However, several major difficulties, which are generally\noverlooked, must be confronted when transferring the HAC estimation technology\nto time series environments. First, in plausible time-series environments\ninvolving failure of strong exogeneity, OLS parameter estimates can be\ninconsistent, so that HAC inference fails even asymptotically. Second, most\neconomic time series have strong autocorrelation, which renders HAC regression\nparameter estimates highly inefficient. Third, strong autocorrelation similarly\nrenders HAC conditional predictions highly inefficient. Finally, The structure\nof popular HAC estimators is ill-suited for capturing the autoregressive\nautocorrelation typically present in economic time series, which produces large\nsize distortions and reduced power in HACbased hypothesis testing, in all but\nthe largest samples. We show that all four problems are largely avoided by the\nuse of a simple dynamic regression procedure, which is easily implemented. We\ndemonstrate the advantages of dynamic regression with detailed simulations\ncovering a range of practical issues."}, "http://arxiv.org/abs/2308.15338": {"title": "Another Look at the Linear Probability Model and Nonlinear Index Models", "link": "http://arxiv.org/abs/2308.15338", "description": "We reassess the use of linear models to approximate response probabilities of\nbinary outcomes, focusing on average partial effects (APE). We confirm that\nlinear projection parameters coincide with APEs in certain scenarios. Through\nsimulations, we identify other cases where OLS does or does not approximate\nAPEs and find that having large fraction of fitted values in [0, 1] is neither\nnecessary nor sufficient. We also show nonlinear least squares estimation of\nthe ramp model is consistent and asymptotically normal and is equivalent to\nusing OLS on an iteratively trimmed sample to reduce bias. Our findings offer\npractical guidance for empirical research."}, "http://arxiv.org/abs/2309.10642": {"title": "Testing and correcting sample selection in academic achievement comparisons", "link": "http://arxiv.org/abs/2309.10642", "description": "Country comparisons using standardized test scores may in some cases be\nmisleading unless we make sure that the potential sample selection bias created\nby drop-outs and non-enrollment patterns does not alter the analysis. In this\npaper, I propose an answer to this issue which consists of identifying the\ncounterfactual distribution of achievement (I mean the distribution of\nachievement if there was hypothetically no selection) from the observed\ndistribution of achievements. International comparison measures like means,\nquantiles, and inequality measures have to be computed using that\ncounterfactual distribution which is statistically closer to the observed one\nfor a low proportion of out-of-school children. I identify the quantiles of\nthat latent distribution by readjusting the percentile levels of the observed\nquantile function of achievement. Because the data on test scores is by nature\ntruncated, I have to rely on auxiliary data to borrow identification power. I\nfinally applied my method to compute selection corrected means using PISA 2018\nand PASEC 2019 and I found that ranking/comparisons can change."}, "http://arxiv.org/abs/2310.01104": {"title": "Multi-period static hedging of European options", "link": "http://arxiv.org/abs/2310.01104", "description": "We consider the hedging of European options when the price of the underlying\nasset follows a single-factor Markovian framework. By working in such a\nsetting, Carr and Wu \\cite{carr2014static} derived a spanning relation between\na given option and a continuum of shorter-term options written on the same\nasset. In this paper, we have extended their approach to simultaneously include\noptions over multiple short maturities. We then show a practical implementation\nof this with a finite set of shorter-term options to determine the hedging\nerror using a Gaussian Quadrature method. We perform a wide range of\nexperiments for both the \\textit{Black-Scholes} and \\textit{Merton Jump\nDiffusion} models, illustrating the comparative performance of the two methods."}, "http://arxiv.org/abs/2310.02414": {"title": "On Optimal Set Estimation for Partially Identified Binary Choice Models", "link": "http://arxiv.org/abs/2310.02414", "description": "In this paper we reconsider the notion of optimality in estimation of\npartially identified models. We illustrate the general problem in the context\nof a semiparametric binary choice model with discrete covariates as an example\nof a model which is partially identified as shown in, e.g. Bierens and Hartog\n(1988). A set estimator for the regression coefficients in the model can be\nconstructed by implementing the Maximum Score procedure proposed by Manski\n(1975). For many designs this procedure converges to the identified set for\nthese parameters, and so in one sense is optimal. But as shown in Komarova\n(2013) for other cases the Maximum Score objective function gives an outer\nregion of the identified set. This motivates alternative methods that are\noptimal in one sense that they converge to the identified region in all\ndesigns, and we propose and compare such procedures. One is a Hodges type\nestimator combining the Maximum Score estimator with existing procedures. A\nsecond is a two step estimator using a Maximum Score type objective function in\nthe second step. Lastly we propose a new random set quantile estimator,\nmotivated by definitions introduced in Molchanov (2006). Extensions of these\nideas for the cross sectional model to static and dynamic discrete panel data\nmodels are also provided."}, "http://arxiv.org/abs/2310.12825": {"title": "Nonparametric Regression with Dyadic Data", "link": "http://arxiv.org/abs/2310.12825", "description": "This paper studies the identification and estimation of a nonparametric\nnonseparable dyadic model where the structural function and the distribution of\nthe unobservable random terms are assumed to be unknown. The identification and\nthe estimation of the distribution of the unobservable random term are also\nproposed. I assume that the structural function is continuous and strictly\nincreasing in the unobservable heterogeneity. I propose suitable normalization\nfor the identification by allowing the structural function to have some\ndesirable properties such as homogeneity of degree one in the unobservable\nrandom term and some of its observables. The consistency and the asymptotic\ndistribution of the estimators are proposed. The finite sample properties of\nthe proposed estimators in a Monte-Carlo simulation are assessed."}, "http://arxiv.org/abs/2310.12863": {"title": "Moment-dependent phase transitions in high-dimensional Gaussian approximations", "link": "http://arxiv.org/abs/2310.12863", "description": "High-dimensional central limit theorems have been intensively studied with\nmost focus being on the case where the data is sub-Gaussian or sub-exponential.\nHowever, heavier tails are omnipresent in practice. In this article, we study\nthe critical growth rates of dimension $d$ below which Gaussian approximations\nare asymptotically valid but beyond which they are not. We are particularly\ninterested in how these thresholds depend on the number of moments $m$ that the\nobservations possess. For every $m\\in(2,\\infty)$, we construct i.i.d. random\nvectors $\\textbf{X}_1,...,\\textbf{X}_n$ in $\\mathbb{R}^d$, the entries of which\nare independent and have a common distribution (independent of $n$ and $d$)\nwith finite $m$th absolute moment, and such that the following holds: if there\nexists an $\\varepsilon\\in(0,\\infty)$ such that $d/n^{m/2-1+\\varepsilon}\\not\\to\n0$, then the Gaussian approximation error (GAE) satisfies $$\n\n\\limsup_{n\\to\\infty}\\sup_{t\\in\\mathbb{R}}\\left[\\mathbb{P}\\left(\\max_{1\\leq\nj\\leq d}\\frac{1}{\\sqrt{n}}\\sum_{i=1}^n\\textbf{X}_{ij}\\leq\nt\\right)-\\mathbb{P}\\left(\\max_{1\\leq j\\leq d}\\textbf{Z}_j\\leq\nt\\right)\\right]=1,$$ where $\\textbf{Z} \\sim\n\\mathsf{N}_d(\\textbf{0}_d,\\mathbf{I}_d)$. On the other hand, a result in\nChernozhukov et al. (2023a) implies that the left-hand side above is zero if\njust $d/n^{m/2-1-\\varepsilon}\\to 0$ for some $\\varepsilon\\in(0,\\infty)$. In\nthis sense, there is a moment-dependent phase transition at the threshold\n$d=n^{m/2-1}$ above which the limiting GAE jumps from zero to one."}, "http://arxiv.org/abs/2209.11840": {"title": "Revisiting the Analysis of Matched-Pair and Stratified Experiments in the Presence of Attrition", "link": "http://arxiv.org/abs/2209.11840", "description": "In this paper we revisit some common recommendations regarding the analysis\nof matched-pair and stratified experimental designs in the presence of\nattrition. Our main objective is to clarify a number of well-known claims about\nthe practice of dropping pairs with an attrited unit when analyzing\nmatched-pair designs. Contradictory advice appears in the literature about\nwhether or not dropping pairs is beneficial or harmful, and stratifying into\nlarger groups has been recommended as a resolution to the issue. To address\nthese claims, we derive the estimands obtained from the difference-in-means\nestimator in a matched-pair design both when the observations from pairs with\nan attrited unit are retained and when they are dropped. We find limited\nevidence to support the claims that dropping pairs helps recover the average\ntreatment effect, but we find that it may potentially help in recovering a\nconvex weighted average of conditional average treatment effects. We report\nsimilar findings for stratified designs when studying the estimands obtained\nfrom a regression of outcomes on treatment with and without strata fixed\neffects."}, "http://arxiv.org/abs/2210.04523": {"title": "An identification and testing strategy for proxy-SVARs with weak proxies", "link": "http://arxiv.org/abs/2210.04523", "description": "When proxies (external instruments) used to identify target structural shocks\nare weak, inference in proxy-SVARs (SVAR-IVs) is nonstandard and the\nconstruction of asymptotically valid confidence sets for the impulse responses\nof interest requires weak-instrument robust methods. In the presence of\nmultiple target shocks, test inversion techniques require extra restrictions on\nthe proxy-SVAR parameters other those implied by the proxies that may be\ndifficult to interpret and test. We show that frequentist asymptotic inference\nin these situations can be conducted through Minimum Distance estimation and\nstandard asymptotic methods if the proxy-SVAR can be identified by using\n`strong' instruments for the non-target shocks; i.e. the shocks which are not\nof primary interest in the analysis. The suggested identification strategy\nhinges on a novel pre-test for the null of instrument relevance based on\nbootstrap resampling which is not subject to pre-testing issues, in the sense\nthat the validity of post-test asymptotic inferences is not affected by the\noutcomes of the test. The test is robust to conditionally heteroskedasticity\nand/or zero-censored proxies, is computationally straightforward and applicable\nregardless of the number of shocks being instrumented. Some illustrative\nexamples show the empirical usefulness of the suggested identification and\ntesting strategy."}, "http://arxiv.org/abs/2301.07241": {"title": "Unconditional Quantile Partial Effects via Conditional Quantile Regression", "link": "http://arxiv.org/abs/2301.07241", "description": "This paper develops a semi-parametric procedure for estimation of\nunconditional quantile partial effects using quantile regression coefficients.\nThe estimator is based on an identification result showing that, for continuous\ncovariates, unconditional quantile effects are a weighted average of\nconditional ones at particular quantile levels that depend on the covariates.\nWe propose a two-step estimator for the unconditional effects where in the\nfirst step one estimates a structural quantile regression model, and in the\nsecond step a nonparametric regression is applied to the first step\ncoefficients. We establish the asymptotic properties of the estimator, say\nconsistency and asymptotic normality. Monte Carlo simulations show numerical\nevidence that the estimator has very good finite sample performance and is\nrobust to the selection of bandwidth and kernel. To illustrate the proposed\nmethod, we study the canonical application of the Engel's curve, i.e. food\nexpenditures as a share of income."}, "http://arxiv.org/abs/2302.04380": {"title": "Covariate Adjustment in Experiments with Matched Pairs", "link": "http://arxiv.org/abs/2302.04380", "description": "This paper studies inference on the average treatment effect in experiments\nin which treatment status is determined according to \"matched pairs\" and it is\nadditionally desired to adjust for observed, baseline covariates to gain\nfurther precision. By a \"matched pairs\" design, we mean that units are sampled\ni.i.d. from the population of interest, paired according to observed, baseline\ncovariates and finally, within each pair, one unit is selected at random for\ntreatment. Importantly, we presume that not all observed, baseline covariates\nare used in determining treatment assignment. We study a broad class of\nestimators based on a \"doubly robust\" moment condition that permits us to study\nestimators with both finite-dimensional and high-dimensional forms of covariate\nadjustment. We find that estimators with finite-dimensional, linear adjustments\nneed not lead to improvements in precision relative to the unadjusted\ndifference-in-means estimator. This phenomenon persists even if the adjustments\nare interacted with treatment; in fact, doing so leads to no changes in\nprecision. However, gains in precision can be ensured by including fixed\neffects for each of the pairs. Indeed, we show that this adjustment is the\n\"optimal\" finite-dimensional, linear adjustment. We additionally study two\nestimators with high-dimensional forms of covariate adjustment based on the\nLASSO. For each such estimator, we show that it leads to improvements in\nprecision relative to the unadjusted difference-in-means estimator and also\nprovide conditions under which it leads to the \"optimal\" nonparametric,\ncovariate adjustment. A simulation study confirms the practical relevance of\nour theoretical analysis, and the methods are employed to reanalyze data from\nan experiment using a \"matched pairs\" design to study the effect of\nmacroinsurance on microenterprise."}, "http://arxiv.org/abs/2305.03134": {"title": "Debiased inference for dynamic nonlinear models with two-way fixed effects", "link": "http://arxiv.org/abs/2305.03134", "description": "Panel data models often use fixed effects to account for unobserved\nheterogeneities. These fixed effects are typically incidental parameters and\ntheir estimators converge slowly relative to the square root of the sample\nsize. In the maximum likelihood context, this induces an asymptotic bias of the\nlikelihood function. Test statistics derived from the asymptotically biased\nlikelihood, therefore, no longer follow their standard limiting distributions.\nThis causes severe distortions in test sizes. We consider a generic class of\ndynamic nonlinear models with two-way fixed effects and propose an analytical\nbias correction method for the likelihood function. We formally show that the\nlikelihood ratio, the Lagrange-multiplier, and the Wald test statistics derived\nfrom the corrected likelihood follow their standard asymptotic distributions. A\nbias-corrected estimator of the structural parameters can also be derived from\nthe corrected likelihood function. We evaluate the performance of our bias\ncorrection procedure through simulations and an empirical example."}, "http://arxiv.org/abs/2310.13240": {"title": "Transparency challenges in policy evaluation with causal machine learning -- improving usability and accountability", "link": "http://arxiv.org/abs/2310.13240", "description": "Causal machine learning tools are beginning to see use in real-world policy\nevaluation tasks to flexibly estimate treatment effects. One issue with these\nmethods is that the machine learning models used are generally black boxes,\ni.e., there is no globally interpretable way to understand how a model makes\nestimates. This is a clear problem in policy evaluation applications,\nparticularly in government, because it is difficult to understand whether such\nmodels are functioning in ways that are fair, based on the correct\ninterpretation of evidence and transparent enough to allow for accountability\nif things go wrong. However, there has been little discussion of transparency\nproblems in the causal machine learning literature and how these might be\novercome. This paper explores why transparency issues are a problem for causal\nmachine learning in public policy evaluation applications and considers ways\nthese problems might be addressed through explainable AI tools and by\nsimplifying models in line with interpretable AI principles. It then applies\nthese ideas to a case-study using a causal forest model to estimate conditional\naverage treatment effects for a hypothetical change in the school leaving age\nin Australia. It shows that existing tools for understanding black-box\npredictive models are poorly suited to causal machine learning and that\nsimplifying the model to make it interpretable leads to an unacceptable\nincrease in error (in this application). It concludes that new tools are needed\nto properly understand causal machine learning models and the algorithms that\nfit them."}, "http://arxiv.org/abs/2308.04276": {"title": "Causal Interpretation of Linear Social Interaction Models with Endogenous Networks", "link": "http://arxiv.org/abs/2308.04276", "description": "This study investigates the causal interpretation of linear social\ninteraction models in the presence of endogeneity in network formation under a\nheterogeneous treatment effects framework. We consider an experimental setting\nin which individuals are randomly assigned to treatments while no interventions\nare made for the network structure. We show that running a linear regression\nignoring network endogeneity is not problematic for estimating the average\ndirect treatment effect. However, it leads to sample selection bias and\nnegative-weights problem for the estimation of the average spillover effect. To\novercome these problems, we propose using potential peer treatment as an\ninstrumental variable (IV), which is automatically a valid IV for actual\nspillover exposure. Using this IV, we examine two IV-based estimands and\ndemonstrate that they have a local average treatment-effect-type causal\ninterpretation for the spillover effect."}, "http://arxiv.org/abs/2309.01889": {"title": "The Local Projection Residual Bootstrap for AR(1) Models", "link": "http://arxiv.org/abs/2309.01889", "description": "This paper proposes a local projection residual bootstrap method to construct\nconfidence intervals for impulse response coefficients of AR(1) models. Our\nbootstrap method is based on the local projection (LP) approach and a residual\nbootstrap procedure. We present theoretical results for our bootstrap method\nand proposed confidence intervals. First, we prove the uniform consistency of\nthe LP-residual bootstrap over a large class of AR(1) models that allow for a\nunit root. Then, we prove the asymptotic validity of our confidence intervals\nover the same class of AR(1) models. Finally, we show that the LP-residual\nbootstrap provides asymptotic refinements for confidence intervals on a\nrestricted class of AR(1) models relative to those required for the uniform\nconsistency of our bootstrap."}, "http://arxiv.org/abs/2310.13785": {"title": "Bayesian Estimation of Panel Models under Potentially Sparse Heterogeneity", "link": "http://arxiv.org/abs/2310.13785", "description": "We incorporate a version of a spike and slab prior, comprising a pointmass at\nzero (\"spike\") and a Normal distribution around zero (\"slab\") into a dynamic\npanel data framework to model coefficient heterogeneity. In addition to\nhomogeneity and full heterogeneity, our specification can also capture sparse\nheterogeneity, that is, there is a core group of units that share common\nparameters and a set of deviators with idiosyncratic parameters. We fit a model\nwith unobserved components to income data from the Panel Study of Income\nDynamics. We find evidence for sparse heterogeneity for balanced panels\ncomposed of individuals with long employment histories."}, "http://arxiv.org/abs/2310.14068": {"title": "Unobserved Grouped Heteroskedasticity and Fixed Effects", "link": "http://arxiv.org/abs/2310.14068", "description": "This paper extends the linear grouped fixed effects (GFE) panel model to\nallow for heteroskedasticity from a discrete latent group variable. Key\nfeatures of GFE are preserved, such as individuals belonging to one of a finite\nnumber of groups and group membership is unrestricted and estimated. Ignoring\ngroup heteroskedasticity may lead to poor classification, which is detrimental\nto finite sample bias and standard errors of estimators. I introduce the\n\"weighted grouped fixed effects\" (WGFE) estimator that minimizes a weighted\naverage of group sum of squared residuals. I establish $\\sqrt{NT}$-consistency\nand normality under a concept of group separation based on second moments. A\ntest of group homoskedasticity is discussed. A fast computation procedure is\nprovided. Simulations show that WGFE outperforms alternatives that exclude\nsecond moment information. I demonstrate this approach by considering the link\nbetween income and democracy and the effect of unionization on earnings."}, "http://arxiv.org/abs/2310.14142": {"title": "On propensity score matching with a diverging number of matches", "link": "http://arxiv.org/abs/2310.14142", "description": "This paper reexamines Abadie and Imbens (2016)'s work on propensity score\nmatching for average treatment effect estimation. We explore the asymptotic\nbehavior of these estimators when the number of nearest neighbors, $M$, grows\nwith the sample size. It is shown, hardly surprising but technically\nnontrivial, that the modified estimators can improve upon the original\nfixed-$M$ estimators in terms of efficiency. Additionally, we demonstrate the\npotential to attain the semiparametric efficiency lower bound when the\npropensity score achieves \"sufficient\" dimension reduction, echoing Hahn\n(1998)'s insight about the role of dimension reduction in propensity\nscore-based causal inference."}, "http://arxiv.org/abs/2310.14438": {"title": "BVARs and Stochastic Volatility", "link": "http://arxiv.org/abs/2310.14438", "description": "Bayesian vector autoregressions (BVARs) are the workhorse in macroeconomic\nforecasting. Research in the last decade has established the importance of\nallowing time-varying volatility to capture both secular and cyclical\nvariations in macroeconomic uncertainty. This recognition, together with the\ngrowing availability of large datasets, has propelled a surge in recent\nresearch in building stochastic volatility models suitable for large BVARs.\nSome of these new models are also equipped with additional features that are\nespecially desirable for large systems, such as order invariance -- i.e.,\nestimates are not dependent on how the variables are ordered in the BVAR -- and\nrobustness against COVID-19 outliers. Estimation of these large, flexible\nmodels is made possible by the recently developed equation-by-equation approach\nthat drastically reduces the computational cost of estimating large systems.\nDespite these recent advances, there remains much ongoing work, such as the\ndevelopment of parsimonious approaches for time-varying coefficients and other\ntypes of nonlinearities in large BVARs."}, "http://arxiv.org/abs/2310.14983": {"title": "Causal clustering: design of cluster experiments under network interference", "link": "http://arxiv.org/abs/2310.14983", "description": "This paper studies the design of cluster experiments to estimate the global\ntreatment effect in the presence of spillovers on a single network. We provide\nan econometric framework to choose the clustering that minimizes the worst-case\nmean-squared error of the estimated global treatment effect. We show that the\noptimal clustering can be approximated as the solution of a novel penalized\nmin-cut optimization problem computed via off-the-shelf semi-definite\nprogramming algorithms. Our analysis also characterizes easy-to-check\nconditions to choose between a cluster or individual-level randomization. We\nillustrate the method's properties using unique network data from the universe\nof Facebook's users and existing network data from a field experiment."}, "http://arxiv.org/abs/2004.08318": {"title": "Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions", "link": "http://arxiv.org/abs/2004.08318", "description": "We study causal inference under case-control and case-population sampling.\nSpecifically, we focus on the binary-outcome and binary-treatment case, where\nthe parameters of interest are causal relative and attributable risks defined\nvia the potential outcome framework. It is shown that strong ignorability is\nnot always as powerful as it is under random sampling and that certain\nmonotonicity assumptions yield comparable results in terms of sharp identified\nintervals. Specifically, the usual odds ratio is shown to be a sharp identified\nupper bound on causal relative risk under the monotone treatment response and\nmonotone treatment selection assumptions. We offer algorithms for inference on\nthe causal parameters that are aggregated over the true population distribution\nof the covariates. We show the usefulness of our approach by studying three\nempirical examples: the benefit of attending private school for entering a\nprestigious university in Pakistan; the relationship between staying in school\nand getting involved with drug-trafficking gangs in Brazil; and the link\nbetween physicians' hours and size of the group practice in the United States."}, "http://arxiv.org/abs/2108.07455": {"title": "Causal Inference with Noncompliance and Unknown Interference", "link": "http://arxiv.org/abs/2108.07455", "description": "We consider a causal inference model in which individuals interact in a\nsocial network and they may not comply with the assigned treatments. In\nparticular, we suppose that the form of network interference is unknown to\nresearchers. To estimate meaningful causal parameters in this situation, we\nintroduce a new concept of exposure mapping, which summarizes potentially\ncomplicated spillover effects into a fixed dimensional statistic of\ninstrumental variables. We investigate identification conditions for the\nintention-to-treat effects and the average treatment effects for compliers,\nwhile explicitly considering the possibility of misspecification of exposure\nmapping. Based on our identification results, we develop nonparametric\nestimation procedures via inverse probability weighting. Their asymptotic\nproperties, including consistency and asymptotic normality, are investigated\nusing an approximate neighborhood interference framework. For an empirical\nillustration, we apply our method to experimental data on the anti-conflict\nintervention school program. The proposed methods are readily available with\nthe companion R package latenetwork."}, "http://arxiv.org/abs/2112.03872": {"title": "Nonparametric Treatment Effect Identification in School Choice", "link": "http://arxiv.org/abs/2112.03872", "description": "This paper studies nonparametric identification and estimation of causal\neffects in centralized school assignment. In many centralized assignment\nsettings, students are subjected to both lottery-driven variation and\nregression discontinuity (RD) driven variation. We characterize the full set of\nidentified atomic treatment effects (aTEs), defined as the conditional average\ntreatment effect between a pair of schools, given student characteristics.\nAtomic treatment effects are the building blocks of more aggregated notions of\ntreatment contrasts, and common approaches estimating aggregations of aTEs can\nmask important heterogeneity. In particular, many aggregations of aTEs put zero\nweight on aTEs driven by RD variation, and estimators of such aggregations put\nasymptotically vanishing weight on the RD-driven aTEs. We develop a diagnostic\ntool for empirically assessing the weight put on aTEs driven by RD variation.\nLastly, we provide estimators and accompanying asymptotic results for inference\non aggregations of RD-driven aTEs."}, "http://arxiv.org/abs/2203.01425": {"title": "A Modern Gauss-Markov Theorem? Really?", "link": "http://arxiv.org/abs/2203.01425", "description": "We show that the theorems in Hansen (2021a) (the version accepted by\nEconometrica), except for one, are not new as they coincide with classical\ntheorems like the good old Gauss-Markov or Aitken Theorem, respectively; the\nexceptional theorem is incorrect. Hansen (2021b) corrects this theorem. As a\nresult, all theorems in the latter version coincide with the above mentioned\nclassical theorems. Furthermore, we also show that the theorems in Hansen\n(2022) (the version published in Econometrica) either coincide with the\nclassical theorems just mentioned, or contain extra assumptions that are alien\nto the Gauss-Markov or Aitken Theorem."}, "http://arxiv.org/abs/2204.12723": {"title": "Information-theoretic limitations of data-based price discrimination", "link": "http://arxiv.org/abs/2204.12723", "description": "This paper studies third-degree price discrimination (3PD) based on a random\nsample of valuation and covariate data, where the covariate is continuous, and\nthe distribution of the data is unknown to the seller. The main results of this\npaper are twofold. The first set of results is pricing strategy independent and\nreveals the fundamental information-theoretic limitation of any data-based\npricing strategy in revenue generation for two cases: 3PD and uniform pricing.\nThe second set of results proposes the $K$-markets empirical revenue\nmaximization (ERM) strategy and shows that the $K$-markets ERM and the uniform\nERM strategies achieve the optimal rate of convergence in revenue to that\ngenerated by their respective true-distribution 3PD and uniform pricing optima.\nOur theoretical and numerical results suggest that the uniform (i.e.,\n$1$-market) ERM strategy generates a larger revenue than the $K$-markets ERM\nstrategy when the sample size is small enough, and vice versa."}, "http://arxiv.org/abs/2304.12698": {"title": "Enhanced multilayer perceptron with feature selection and grid search for travel mode choice prediction", "link": "http://arxiv.org/abs/2304.12698", "description": "Accurate and reliable prediction of individual travel mode choices is crucial\nfor developing multi-mode urban transportation systems, conducting\ntransportation planning and formulating traffic demand management strategies.\nTraditional discrete choice models have dominated the modelling methods for\ndecades yet suffer from strict model assumptions and low prediction accuracy.\nIn recent years, machine learning (ML) models, such as neural networks and\nboosting models, are widely used by researchers for travel mode choice\nprediction and have yielded promising results. However, despite the superior\nprediction performance, a large body of ML methods, especially the branch of\nneural network models, is also limited by overfitting and tedious model\nstructure determination process. To bridge this gap, this study proposes an\nenhanced multilayer perceptron (MLP; a neural network) with two hidden layers\nfor travel mode choice prediction; this MLP is enhanced by XGBoost (a boosting\nmethod) for feature selection and a grid search method for optimal hidden\nneurone determination of each hidden layer. The proposed method was trained and\ntested on a real resident travel diary dataset collected in Chengdu, China."}, "http://arxiv.org/abs/2306.02584": {"title": "Synthetic Regressing Control Method", "link": "http://arxiv.org/abs/2306.02584", "description": "Estimating weights in the synthetic control method, typically resulting in\nsparse weights where only a few control units have non-zero weights, involves\nan optimization procedure that simultaneously selects and aligns control units\nto closely match the treated unit. However, this simultaneous selection and\nalignment of control units may lead to a loss of efficiency. Another concern\narising from the aforementioned procedure is its susceptibility to\nunder-fitting due to imperfect pre-treatment fit. It is not uncommon for the\nlinear combination, using nonnegative weights, of pre-treatment period outcomes\nfor the control units to inadequately approximate the pre-treatment outcomes\nfor the treated unit. To address both of these issues, this paper proposes a\nsimple and effective method called Synthetic Regressing Control (SRC). The SRC\nmethod begins by performing the univariate linear regression to appropriately\nalign the pre-treatment periods of the control units with the treated unit.\nSubsequently, a SRC estimator is obtained by synthesizing (taking a weighted\naverage) the fitted controls. To determine the weights in the synthesis\nprocedure, we propose an approach that utilizes a criterion of unbiased risk\nestimator. Theoretically, we show that the synthesis way is asymptotically\noptimal in the sense of achieving the lowest possible squared error. Extensive\nnumerical experiments highlight the advantages of the SRC method."}, "http://arxiv.org/abs/2308.12470": {"title": "Scalable Estimation of Multinomial Response Models with Uncertain Consideration Sets", "link": "http://arxiv.org/abs/2308.12470", "description": "A standard assumption in the fitting of unordered multinomial response models\nfor $J$ mutually exclusive nominal categories, on cross-sectional or\nlongitudinal data, is that the responses arise from the same set of $J$\ncategories between subjects. However, when responses measure a choice made by\nthe subject, it is more appropriate to assume that the distribution of\nmultinomial responses is conditioned on a subject-specific consideration set,\nwhere this consideration set is drawn from the power set of $\\{1,2,\\ldots,J\\}$.\nBecause the cardinality of this power set is exponential in $J$, estimation is\ninfeasible in general. In this paper, we provide an approach to overcoming this\nproblem. A key step in the approach is a probability model over consideration\nsets, based on a general representation of probability distributions on\ncontingency tables, which results in mixtures of independent consideration\nmodels. Although the support of this distribution is exponentially large, the\nposterior distribution over consideration sets given parameters is typically\nsparse, and is easily sampled in an MCMC scheme. We show posterior consistency\nof the parameters of the conditional response model and the distribution of\nconsideration sets. The effectiveness of the methodology is documented in\nsimulated longitudinal data sets with $J=100$ categories and real data from the\ncereal market with $J=68$ brands."}, "http://arxiv.org/abs/2310.15512": {"title": "Inference for Rank-Rank Regressions", "link": "http://arxiv.org/abs/2310.15512", "description": "Slope coefficients in rank-rank regressions are popular measures of\nintergenerational mobility, for instance in regressions of a child's income\nrank on their parent's income rank. In this paper, we first point out that\ncommonly used variance estimators such as the homoskedastic or robust variance\nestimators do not consistently estimate the asymptotic variance of the OLS\nestimator in a rank-rank regression. We show that the probability limits of\nthese estimators may be too large or too small depending on the shape of the\ncopula of child and parent incomes. Second, we derive a general asymptotic\ntheory for rank-rank regressions and provide a consistent estimator of the OLS\nestimator's asymptotic variance. We then extend the asymptotic theory to other\nregressions involving ranks that have been used in empirical work. Finally, we\napply our new inference methods to three empirical studies. We find that the\nconfidence intervals based on estimators of the correct variance may sometimes\nbe substantially shorter and sometimes substantially longer than those based on\ncommonly used variance estimators. The differences in confidence intervals\nconcern economically meaningful values of mobility and thus lead to different\nconclusions when comparing mobility in U.S. commuting zones with mobility in\nother countries."}, "http://arxiv.org/abs/2310.15796": {"title": "Testing for equivalence of pre-trends in Difference-in-Differences estimation", "link": "http://arxiv.org/abs/2310.15796", "description": "The plausibility of the ``parallel trends assumption'' in\nDifference-in-Differences estimation is usually assessed by a test of the null\nhypothesis that the difference between the average outcomes of both groups is\nconstant over time before the treatment. However, failure to reject the null\nhypothesis does not imply the absence of differences in time trends between\nboth groups. We provide equivalence tests that allow researchers to find\nevidence in favor of the parallel trends assumption and thus increase the\ncredibility of their treatment effect estimates. While we motivate our tests in\nthe standard two-way fixed effects model, we discuss simple extensions to\nsettings in which treatment adoption is staggered over time."}, "http://arxiv.org/abs/1712.04802": {"title": "Fisher-Schultz Lecture: Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments, with an Application to Immunization in India", "link": "http://arxiv.org/abs/1712.04802", "description": "We propose strategies to estimate and make inference on key features of\nheterogeneous effects in randomized experiments. These key features include\nbest linear predictors of the effects using machine learning proxies, average\neffects sorted by impact groups, and average characteristics of most and least\nimpacted units. The approach is valid in high dimensional settings, where the\neffects are proxied (but not necessarily consistently estimated) by predictive\nand causal machine learning methods. We post-process these proxies into\nestimates of the key features. Our approach is generic, it can be used in\nconjunction with penalized methods, neural networks, random forests, boosted\ntrees, and ensemble methods, both predictive and causal. Estimation and\ninference are based on repeated data splitting to avoid overfitting and achieve\nvalidity. We use quantile aggregation of the results across many potential\nsplits, in particular taking medians of p-values and medians and other\nquantiles of confidence intervals. We show that quantile aggregation lowers\nestimation risks over a single split procedure, and establish its principal\ninferential properties. Finally, our analysis reveals ways to build provably\nbetter machine learning proxies through causal learning: we can use the\nobjective functions that we develop to construct the best linear predictors of\nthe effects, to obtain better machine learning proxies in the initial step. We\nillustrate the use of both inferential tools and causal learners with a\nrandomized field experiment that evaluates a combination of nudges to stimulate\ndemand for immunization in India."}, "http://arxiv.org/abs/2301.09016": {"title": "Inference for Two-stage Experiments under Covariate-Adaptive Randomization", "link": "http://arxiv.org/abs/2301.09016", "description": "This paper studies inference in two-stage randomized experiments under\ncovariate-adaptive randomization. In the initial stage of this experimental\ndesign, clusters (e.g., households, schools, or graph partitions) are\nstratified and randomly assigned to control or treatment groups based on\ncluster-level covariates. Subsequently, an independent second-stage design is\ncarried out, wherein units within each treated cluster are further stratified\nand randomly assigned to either control or treatment groups, based on\nindividual-level covariates. Under the homogeneous partial interference\nassumption, I establish conditions under which the proposed\ndifference-in-\"average of averages\" estimators are consistent and\nasymptotically normal for the corresponding average primary and spillover\neffects and develop consistent estimators of their asymptotic variances.\nCombining these results establishes the asymptotic validity of tests based on\nthese estimators. My findings suggest that ignoring covariate information in\nthe design stage can result in efficiency loss, and commonly used inference\nmethods that ignore or improperly use covariate information can lead to either\nconservative or invalid inference. Finally, I apply these results to studying\noptimal use of covariate information under covariate-adaptive randomization in\nlarge samples, and demonstrate that a specific generalized matched-pair design\nachieves minimum asymptotic variance for each proposed estimator. The practical\nrelevance of the theoretical results is illustrated through a simulation study\nand an empirical application."}, "http://arxiv.org/abs/2306.12003": {"title": "Difference-in-Differences with Interference: A Finite Population Perspective", "link": "http://arxiv.org/abs/2306.12003", "description": "In many scenarios, such as the evaluation of place-based policies, potential\noutcomes are not only dependent upon the unit's own treatment but also its\nneighbors' treatment. Despite this, \"difference-in-differences\" (DID) type\nestimators typically ignore such interference among neighbors. I show in this\npaper that the canonical DID estimators generally fail to identify interesting\ncausal effects in the presence of neighborhood interference. To incorporate\ninterference structure into DID estimation, I propose doubly robust estimators\nfor the direct average treatment effect on the treated as well as the average\nspillover effects under a modified parallel trends assumption. When spillover\neffects are of interest, we often sample the entire population. Thus, I adopt a\nfinite population perspective in the sense that the estimands are defined as\npopulation averages and inference is conditional on the attributes of all\npopulation units. The approach in this paper relaxes common restrictions in the\nliterature, such as partial interference and correctly specified spillover\nfunctions. Moreover, robust inference is discussed based on the asymptotic\ndistribution of the proposed estimators."}, "http://arxiv.org/abs/2310.16281": {"title": "Improving Robust Decisions with Data", "link": "http://arxiv.org/abs/2310.16281", "description": "A decision-maker (DM) faces uncertainty governed by a data-generating process\n(DGP), which is only known to belong to a set of sequences of independent but\npossibly non-identical distributions. A robust decision maximizes the DM's\nexpected payoff against the worst possible DGP in this set. This paper studies\nhow such robust decisions can be improved with data, where improvement is\nmeasured by expected payoff under the true DGP. In this paper, I fully\ncharacterize when and how such an improvement can be guaranteed under all\npossible DGPs and develop inference methods to achieve it. These inference\nmethods are needed because, as this paper shows, common inference methods\n(e.g., maximum likelihood or Bayesian) often fail to deliver such an\nimprovement. Importantly, the developed inference methods are given by simple\naugmentations to standard inference procedures, and are thus easy to implement\nin practice."}, "http://arxiv.org/abs/2310.16290": {"title": "Fair Adaptive Experiments", "link": "http://arxiv.org/abs/2310.16290", "description": "Randomized experiments have been the gold standard for assessing the\neffectiveness of a treatment or policy. The classical complete randomization\napproach assigns treatments based on a prespecified probability and may lead to\ninefficient use of data. Adaptive experiments improve upon complete\nrandomization by sequentially learning and updating treatment assignment\nprobabilities. However, their application can also raise fairness and equity\nconcerns, as assignment probabilities may vary drastically across groups of\nparticipants. Furthermore, when treatment is expected to be extremely\nbeneficial to certain groups of participants, it is more appropriate to expose\nmany of these participants to favorable treatment. In response to these\nchallenges, we propose a fair adaptive experiment strategy that simultaneously\nenhances data use efficiency, achieves an envy-free treatment assignment\nguarantee, and improves the overall welfare of participants. An important\nfeature of our proposed strategy is that we do not impose parametric modeling\nassumptions on the outcome variables, making it more versatile and applicable\nto a wider array of applications. Through our theoretical investigation, we\ncharacterize the convergence rate of the estimated treatment effects and the\nassociated standard deviations at the group level and further prove that our\nadaptive treatment assignment algorithm, despite not having a closed-form\nexpression, approaches the optimal allocation rule asymptotically. Our proof\nstrategy takes into account the fact that the allocation decisions in our\ndesign depend on sequentially accumulated data, which poses a significant\nchallenge in characterizing the properties and conducting statistical inference\nof our method. We further provide simulation evidence to showcase the\nperformance of our fair adaptive experiment strategy."}, "http://arxiv.org/abs/2310.16638": {"title": "Covariate Shift Adaptation Robust to Density-Ratio Estimation", "link": "http://arxiv.org/abs/2310.16638", "description": "Consider a scenario where we have access to train data with both covariates\nand outcomes while test data only contains covariates. In this scenario, our\nprimary aim is to predict the missing outcomes of the test data. With this\nobjective in mind, we train parametric regression models under a covariate\nshift, where covariate distributions are different between the train and test\ndata. For this problem, existing studies have proposed covariate shift\nadaptation via importance weighting using the density ratio. This approach\naverages the train data losses, each weighted by an estimated ratio of the\ncovariate densities between the train and test data, to approximate the\ntest-data risk. Although it allows us to obtain a test-data risk minimizer, its\nperformance heavily relies on the accuracy of the density ratio estimation.\nMoreover, even if the density ratio can be consistently estimated, the\nestimation errors of the density ratio also yield bias in the estimators of the\nregression model's parameters of interest. To mitigate these challenges, we\nintroduce a doubly robust estimator for covariate shift adaptation via\nimportance weighting, which incorporates an additional estimator for the\nregression function. Leveraging double machine learning techniques, our\nestimator reduces the bias arising from the density ratio estimation errors. We\ndemonstrate the asymptotic distribution of the regression parameter estimator.\nNotably, our estimator remains consistent if either the density ratio estimator\nor the regression function is consistent, showcasing its robustness against\npotential errors in density ratio estimation. Finally, we confirm the soundness\nof our proposed method via simulation studies."}, "http://arxiv.org/abs/2310.16819": {"title": "CATE Lasso: Conditional Average Treatment Effect Estimation with High-Dimensional Linear Regression", "link": "http://arxiv.org/abs/2310.16819", "description": "In causal inference about two treatments, Conditional Average Treatment\nEffects (CATEs) play an important role as a quantity representing an\nindividualized causal effect, defined as a difference between the expected\noutcomes of the two treatments conditioned on covariates. This study assumes\ntwo linear regression models between a potential outcome and covariates of the\ntwo treatments and defines CATEs as a difference between the linear regression\nmodels. Then, we propose a method for consistently estimating CATEs even under\nhigh-dimensional and non-sparse parameters. In our study, we demonstrate that\ndesirable theoretical properties, such as consistency, remain attainable even\nwithout assuming sparsity explicitly if we assume a weaker assumption called\nimplicit sparsity originating from the definition of CATEs. In this assumption,\nwe suppose that parameters of linear models in potential outcomes can be\ndivided into treatment-specific and common parameters, where the\ntreatment-specific parameters take difference values between each linear\nregression model, while the common parameters remain identical. Thus, in a\ndifference between two linear regression models, the common parameters\ndisappear, leaving only differences in the treatment-specific parameters.\nConsequently, the non-zero parameters in CATEs correspond to the differences in\nthe treatment-specific parameters. Leveraging this assumption, we develop a\nLasso regression method specialized for CATE estimation and present that the\nestimator is consistent. Finally, we confirm the soundness of the proposed\nmethod by simulation studies."}, "http://arxiv.org/abs/2203.06685": {"title": "Encompassing Tests for Nonparametric Regressions", "link": "http://arxiv.org/abs/2203.06685", "description": "We set up a formal framework to characterize encompassing of nonparametric\nmodels through the L2 distance. We contrast it to previous literature on the\ncomparison of nonparametric regression models. We then develop testing\nprocedures for the encompassing hypothesis that are fully nonparametric. Our\ntest statistics depend on kernel regression, raising the issue of bandwidth's\nchoice. We investigate two alternative approaches to obtain a \"small bias\nproperty\" for our test statistics. We show the validity of a wild bootstrap\nmethod. We empirically study the use of a data-driven bandwidth and illustrate\nthe attractive features of our tests for small and moderate samples."}, "http://arxiv.org/abs/2212.11012": {"title": "Partly Linear Instrumental Variables Regressions without Smoothing on the Instruments", "link": "http://arxiv.org/abs/2212.11012", "description": "We consider a semiparametric partly linear model identified by instrumental\nvariables. We propose an estimation method that does not smooth on the\ninstruments and we extend the Landweber-Fridman regularization scheme to the\nestimation of this semiparametric model. We then show the asymptotic normality\nof the parametric estimator and obtain the convergence rate for the\nnonparametric estimator. Our estimator that does not smooth on the instruments\ncoincides with a typical estimator that does smooth on the instruments but\nkeeps the respective bandwidth fixed as the sample size increases. We propose a\ndata driven method for the selection of the regularization parameter, and in a\nsimulation study we show the attractive performance of our estimators."}, "http://arxiv.org/abs/2212.11112": {"title": "A Bootstrap Specification Test for Semiparametric Models with Generated Regressors", "link": "http://arxiv.org/abs/2212.11112", "description": "This paper provides a specification test for semiparametric models with\nnonparametrically generated regressors. Such variables are not observed by the\nresearcher but are nonparametrically identified and estimable. Applications of\nthe test include models with endogenous regressors identified by control\nfunctions, semiparametric sample selection models, or binary games with\nincomplete information. The statistic is built from the residuals of the\nsemiparametric model. A novel wild bootstrap procedure is shown to provide\nvalid critical values. We consider nonparametric estimators with an automatic\nbias correction that makes the test implementable without undersmoothing. In\nsimulations the test exhibits good small sample performances, and an\napplication to women's labor force participation decisions shows its\nimplementation in a real data context."}, "http://arxiv.org/abs/2305.07993": {"title": "The Nonstationary Newsvendor with (and without) Predictions", "link": "http://arxiv.org/abs/2305.07993", "description": "The classic newsvendor model yields an optimal decision for a \"newsvendor\"\nselecting a quantity of inventory, under the assumption that the demand is\ndrawn from a known distribution. Motivated by applications such as cloud\nprovisioning and staffing, we consider a setting in which newsvendor-type\ndecisions must be made sequentially, in the face of demand drawn from a\nstochastic process that is both unknown and nonstationary. All prior work on\nthis problem either (a) assumes that the level of nonstationarity is known, or\n(b) imposes additional statistical assumptions that enable accurate predictions\nof the unknown demand.\n\nWe study the Nonstationary Newsvendor, with and without predictions. We\nfirst, in the setting without predictions, design a policy which we prove (via\nmatching upper and lower bounds) achieves order-optimal regret -- ours is the\nfirst policy to accomplish this without being given the level of\nnonstationarity of the underlying demand. We then, for the first time,\nintroduce a model for generic (i.e. with no statistical assumptions)\npredictions with arbitrary accuracy, and propose a policy that incorporates\nthese predictions without being given their accuracy. We upper bound the regret\nof this policy, and show that it matches the best achievable regret had the\naccuracy of the predictions been known. Finally, we empirically validate our\nnew policy with experiments based on two real-world datasets containing\nthousands of time-series, showing that it succeeds in closing approximately 74%\nof the gap between the best approaches based on nonstationarity and predictions\nalone."}, "http://arxiv.org/abs/2310.16849": {"title": "Correlation structure analysis of the global agricultural futures market", "link": "http://arxiv.org/abs/2310.16849", "description": "This paper adopts the random matrix theory (RMT) to analyze the correlation\nstructure of the global agricultural futures market from 2000 to 2020. It is\nfound that the distribution of correlation coefficients is asymmetric and right\nskewed, and many eigenvalues of the correlation matrix deviate from the RMT\nprediction. The largest eigenvalue reflects a collective market effect common\nto all agricultural futures, the other largest deviating eigenvalues can be\nimplemented to identify futures groups, and there are modular structures based\non regional properties or agricultural commodities among the significant\nparticipants of their corresponding eigenvectors. Except for the smallest\neigenvalue, other smallest deviating eigenvalues represent the agricultural\nfutures pairs with highest correlations. This paper can be of reference and\nsignificance for using agricultural futures to manage risk and optimize asset\nallocation."}, "http://arxiv.org/abs/2310.16850": {"title": "The impact of the Russia-Ukraine conflict on the extreme risk spillovers between agricultural futures and spots", "link": "http://arxiv.org/abs/2310.16850", "description": "The ongoing Russia-Ukraine conflict between two major agricultural powers has\nposed significant threats and challenges to the global food system and world\nfood security. Focusing on the impact of the conflict on the global\nagricultural market, we propose a new analytical framework for tail dependence,\nand combine the Copula-CoVaR method with the ARMA-GARCH-skewed Student-t model\nto examine the tail dependence structure and extreme risk spillover between\nagricultural futures and spots over the pre- and post-outbreak periods. Our\nresults indicate that the tail dependence structures in the futures-spot\nmarkets of soybean, maize, wheat, and rice have all reacted to the\nRussia-Ukraine conflict. Furthermore, the outbreak of the conflict has\nintensified risks of the four agricultural markets in varying degrees, with the\nwheat market being affected the most. Additionally, all the agricultural\nfutures markets exhibit significant downside and upside risk spillovers to\ntheir corresponding spot markets before and after the outbreak of the conflict,\nwhereas the strengths of these extreme risk spillover effects demonstrate\nsignificant asymmetries at the directional (downside versus upside) and\ntemporal (pre-outbreak versus post-outbreak) levels."}, "http://arxiv.org/abs/2310.17278": {"title": "Dynamic Factor Models: a Genealogy", "link": "http://arxiv.org/abs/2310.17278", "description": "Dynamic factor models have been developed out of the need of analyzing and\nforecasting time series in increasingly high dimensions. While mathematical\nstatisticians faced with inference problems in high-dimensional observation\nspaces were focusing on the so-called spiked-model-asymptotics, econometricians\nadopted an entirely and considerably more effective asymptotic approach, rooted\nin the factor models originally considered in psychometrics. The so-called\ndynamic factor model methods, in two decades, has grown into a wide and\nsuccessful body of techniques that are widely used in central banks, financial\ninstitutions, economic and statistical institutes. The objective of this\nchapter is not an extensive survey of the topic but a sketch of its historical\ngrowth, with emphasis on the various assumptions and interpretations, and a\nfamily tree of its main variants."}, "http://arxiv.org/abs/2310.17473": {"title": "Bayesian SAR model with stochastic volatility and multiple time-varying weights", "link": "http://arxiv.org/abs/2310.17473", "description": "A novel spatial autoregressive model for panel data is introduced, which\nincorporates multilayer networks and accounts for time-varying relationships.\nMoreover, the proposed approach allows the structural variance to evolve\nsmoothly over time and enables the analysis of shock propagation in terms of\ntime-varying spillover effects. The framework is applied to analyse the\ndynamics of international relationships among the G7 economies and their impact\non stock market returns and volatilities. The findings underscore the\nsubstantial impact of cooperative interactions and highlight discernible\ndisparities in network exposure across G7 nations, along with nuanced patterns\nin direct and indirect spillover effects."}, "http://arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "http://arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training\nmachine learning models on historical data to predict user behaviors and\nimprove recommendations continuously. However, these data training loops can\nintroduce interference in A/B tests, where data generated by control and\ntreatment algorithms, potentially with different distributions, are combined.\nTo address these challenges, we introduce a novel approach called weighted\ntraining. This approach entails training a model to predict the probability of\neach data point appearing in either the treatment or control data and\nsubsequently applying weighted losses during model training. We demonstrate\nthat this approach achieves the least variance among all estimators without\ncausing shifts in the training distributions. Through simulation studies, we\ndemonstrate the lower bias and variance of our approach compared to other\nmethods."}, "http://arxiv.org/abs/2310.17571": {"title": "Inside the black box: Neural network-based real-time prediction of US recessions", "link": "http://arxiv.org/abs/2310.17571", "description": "Feedforward neural network (FFN) and two specific types of recurrent neural\nnetwork, long short-term memory (LSTM) and gated recurrent unit (GRU), are used\nfor modeling US recessions in the period from 1967 to 2021. The estimated\nmodels are then employed to conduct real-time predictions of the Great\nRecession and the Covid-19 recession in US. Their predictive performances are\ncompared to those of the traditional linear models, the logistic regression\nmodel both with and without the ridge penalty. The out-of-sample performance\nsuggests the application of LSTM and GRU in the area of recession forecasting,\nespecially for the long-term forecasting tasks. They outperform other types of\nmodels across 5 forecasting horizons with respect to different types of\nstatistical performance metrics. Shapley additive explanations (SHAP) method is\napplied to the fitted GRUs across different forecasting horizons to gain\ninsight into the feature importance. The evaluation of predictor importance\ndiffers between the GRU and ridge logistic regression models, as reflected in\nthe variable order determined by SHAP values. When considering the top 5\npredictors, key indicators such as the S\\&P 500 index, real GDP, and private\nresidential fixed investment consistently appear for short-term forecasts (up\nto 3 months). In contrast, for longer-term predictions (6 months or more), the\nterm spread and producer price index become more prominent. These findings are\nsupported by both local interpretable model-agnostic explanations (LIME) and\nmarginal effects."}, "http://arxiv.org/abs/2205.07836": {"title": "2SLS with Multiple Treatments", "link": "http://arxiv.org/abs/2205.07836", "description": "We study what two-stage least squares (2SLS) identifies in models with\nmultiple treatments under treatment effect heterogeneity. Two conditions are\nshown to be necessary and sufficient for the 2SLS to identify positively\nweighted sums of agent-specific effects of each treatment: average conditional\nmonotonicity and no cross effects. Our identification analysis allows for any\nnumber of treatments, any number of continuous or discrete instruments, and the\ninclusion of covariates. We provide testable implications and present\ncharacterizations of choice behavior implied by our identification conditions\nand discuss how the conditions can be tested empirically."}, "http://arxiv.org/abs/2308.12485": {"title": "Optimal Shrinkage Estimation of Fixed Effects in Linear Panel Data Models", "link": "http://arxiv.org/abs/2308.12485", "description": "Shrinkage methods are frequently used to estimate fixed effects to reduce the\nnoisiness of the least squares estimators. However, widely used shrinkage\nestimators guarantee such noise reduction only under strong distributional\nassumptions. I develop an estimator for the fixed effects that obtains the best\npossible mean squared error within a class of shrinkage estimators. This class\nincludes conventional shrinkage estimators and the optimality does not require\ndistributional assumptions. The estimator has an intuitive form and is easy to\nimplement. Moreover, the fixed effects are allowed to vary with time and to be\nserially correlated, and the shrinkage optimally incorporates the underlying\ncorrelation structure in this case. In such a context, I also provide a method\nto forecast fixed effects one period ahead."}, "http://arxiv.org/abs/2310.18504": {"title": "Nonparametric Doubly Robust Identification of Causal Effects of a Continuous Treatment using Discrete Instruments", "link": "http://arxiv.org/abs/2310.18504", "description": "Many empirical applications estimate causal effects of a continuous\nendogenous variable (treatment) using a binary instrument. Estimation is\ntypically done through linear 2SLS. This approach requires a mean treatment\nchange and causal interpretation requires the LATE-type monotonicity in the\nfirst stage. An alternative approach is to explore distributional changes in\nthe treatment, where the first-stage restriction is treatment rank similarity.\nWe propose causal estimands that are doubly robust in that they are valid under\neither of these two restrictions. We apply the doubly robust estimation to\nestimate the impacts of sleep on well-being. Our results corroborate the usual\n2SLS estimates."}, "http://arxiv.org/abs/2310.18563": {"title": "Covariate Balancing and the Equivalence of Weighting and Doubly Robust Estimators of Average Treatment Effects", "link": "http://arxiv.org/abs/2310.18563", "description": "We show that when the propensity score is estimated using a suitable\ncovariate balancing procedure, the commonly used inverse probability weighting\n(IPW) estimator, augmented inverse probability weighting (AIPW) with linear\nconditional mean, and inverse probability weighted regression adjustment\n(IPWRA) with linear conditional mean are all numerically the same for\nestimating the average treatment effect (ATE) or the average treatment effect\non the treated (ATT). Further, suitably chosen covariate balancing weights are\nautomatically normalized, which means that normalized and unnormalized versions\nof IPW and AIPW are identical. For estimating the ATE, the weights that achieve\nthe algebraic equivalence of IPW, AIPW, and IPWRA are based on propensity\nscores estimated using the inverse probability tilting (IPT) method of Graham,\nPinto and Egel (2012). For the ATT, the weights are obtained using the\ncovariate balancing propensity score (CBPS) method developed in Imai and\nRatkovic (2014). These equivalences also make covariate balancing methods\nattractive when the treatment is confounded and one is interested in the local\naverage treatment effect."}, "http://arxiv.org/abs/2310.18836": {"title": "Design of Cluster-Randomized Trials with Cross-Cluster Interference", "link": "http://arxiv.org/abs/2310.18836", "description": "Cluster-randomized trials often involve units that are irregularly\ndistributed in space without well-separated communities. In these settings,\ncluster construction is a critical aspect of the design due to the potential\nfor cross-cluster interference. The existing literature relies on partial\ninterference models, which take clusters as given and assume no cross-cluster\ninterference. We relax this assumption by allowing interference to decay with\ngeographic distance between units. This induces a bias-variance trade-off:\nconstructing fewer, larger clusters reduces bias due to interference but\nincreases variance. We propose new estimators that exclude units most\npotentially impacted by cross-cluster interference and show that this\nsubstantially reduces asymptotic bias relative to conventional\ndifference-in-means estimators. We then study the design of clusters to\noptimize the estimators' rates of convergence. We provide formal justification\nfor a new design that chooses the number of clusters to balance the asymptotic\nbias and variance of our estimators and uses unsupervised learning to automate\ncluster construction."}, "http://arxiv.org/abs/2310.19200": {"title": "Popularity, face and voice: Predicting and interpreting livestreamers' retail performance using machine learning techniques", "link": "http://arxiv.org/abs/2310.19200", "description": "Livestreaming commerce, a hybrid of e-commerce and self-media, has expanded\nthe broad spectrum of traditional sales performance determinants. To\ninvestigate the factors that contribute to the success of livestreaming\ncommerce, we construct a longitudinal firm-level database with 19,175\nobservations, covering an entire livestreaming subsector. By comparing the\nforecasting accuracy of eight machine learning models, we identify a random\nforest model that provides the best prediction of gross merchandise volume\n(GMV). Furthermore, we utilize explainable artificial intelligence to open the\nblack-box of machine learning model, discovering four new facts: 1) variables\nrepresenting the popularity of livestreaming events are crucial features in\npredicting GMV. And voice attributes are more important than appearance; 2)\npopularity is a major determinant of sales for female hosts, while vocal\naesthetics is more decisive for their male counterparts; 3) merits and\ndrawbacks of the voice are not equally valued in the livestreaming market; 4)\nbased on changes of comments, page views and likes, sales growth can be divided\ninto three stages. Finally, we innovatively propose a 3D-SHAP diagram that\ndemonstrates the relationship between predicting feature importance, target\nvariable, and its predictors. This diagram identifies bottlenecks for both\nbeginner and top livestreamers, providing insights into ways to optimize their\nsales performance."}, "http://arxiv.org/abs/2310.19543": {"title": "Spectral identification and estimation of mixed causal-noncausal invertible-noninvertible models", "link": "http://arxiv.org/abs/2310.19543", "description": "This paper introduces new techniques for estimating, identifying and\nsimulating mixed causal-noncausal invertible-noninvertible models. We propose a\nframework that integrates high-order cumulants, merging both the spectrum and\nbispectrum into a single estimation function. The model that most adequately\nrepresents the data under the assumption that the error term is i.i.d. is\nselected. Our Monte Carlo study reveals unbiased parameter estimates and a high\nfrequency with which correct models are identified. We illustrate our strategy\nthrough an empirical analysis of returns from 24 Fama-French emerging market\nstock portfolios. The findings suggest that each portfolio displays noncausal\ndynamics, producing white noise residuals devoid of conditional heteroscedastic\neffects."}, "http://arxiv.org/abs/2310.19557": {"title": "A Bayesian Markov-switching SAR model for time-varying cross-price spillovers", "link": "http://arxiv.org/abs/2310.19557", "description": "The spatial autoregressive (SAR) model is extended by introducing a Markov\nswitching dynamics for the weight matrix and spatial autoregressive parameter.\nThe framework enables the identification of regime-specific connectivity\npatterns and strengths and the study of the spatiotemporal propagation of\nshocks in a system with a time-varying spatial multiplier matrix. The proposed\nmodel is applied to disaggregated CPI data from 15 EU countries to examine\ncross-price dependencies. The analysis identifies distinct connectivity\nstructures and spatial weights across the states, which capture shifts in\nconsumer behaviour, with marked cross-country differences in the spillover from\none price category to another."}, "http://arxiv.org/abs/2310.19747": {"title": "Characteristics of price related fluctuations in Non-Fungible Token (NFT) market", "link": "http://arxiv.org/abs/2310.19747", "description": "Non-fungible token (NFT) market is a new trading invention based on the\nblockchain technology which parallels the cryptocurrency market. In the present\nwork we study capitalization, floor price, the number of transactions, the\ninter-transaction times, and the transaction volume value of a few selected\npopular token collections. The results show that the fluctuations of all these\nquantities are characterized by heavy-tailed probability distribution\nfunctions, in most cases well described by the stretched exponentials, with a\ntrace of power-law scaling at times, long-range memory, and in several cases\neven the fractal organization of fluctuations, mostly restricted to the larger\nfluctuations, however. We conclude that the NFT market - even though young and\ngoverned by a somewhat different mechanisms of trading - shares several\nstatistical properties with the regular financial markets. However, some\ndifferences are visible in the specific quantitative indicators."}, "http://arxiv.org/abs/2310.19788": {"title": "Locally Optimal Best Arm Identification with a Fixed Budget", "link": "http://arxiv.org/abs/2310.19788", "description": "This study investigates the problem of identifying the best treatment arm, a\ntreatment arm with the highest expected outcome. We aim to identify the best\ntreatment arm with a lower probability of misidentification, which has been\nexplored under various names across numerous research fields, including\n\\emph{best arm identification} (BAI) and ordinal optimization. In our\nexperiments, the number of treatment-allocation rounds is fixed. In each round,\na decision-maker allocates a treatment arm to an experimental unit and observes\na corresponding outcome, which follows a Gaussian distribution with a variance\ndifferent among treatment arms. At the end of the experiment, we recommend one\nof the treatment arms as an estimate of the best treatment arm based on the\nobservations. The objective of the decision-maker is to design an experiment\nthat minimizes the probability of misidentifying the best treatment arm. With\nthis objective in mind, we develop lower bounds for the probability of\nmisidentification under the small-gap regime, where the gaps of the expected\noutcomes between the best and suboptimal treatment arms approach zero. Then,\nassuming that the variances are known, we design the\nGeneralized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, which is\nan extension of the Neyman allocation proposed by Neyman (1934) and the\nUniform-EBA strategy proposed by Bubeck et al. (2011). For the GNA-EBA\nstrategy, we show that the strategy is asymptotically optimal because its\nprobability of misidentification aligns with the lower bounds as the sample\nsize approaches infinity under the small-gap regime. We refer to such optimal\nstrategies as locally asymptotic optimal because their performance aligns with\nthe lower bounds within restricted situations characterized by the small-gap\nregime."}, "http://arxiv.org/abs/2009.00553": {"title": "A Vector Monotonicity Assumption for Multiple Instruments", "link": "http://arxiv.org/abs/2009.00553", "description": "When a researcher combines multiple instrumental variables for a single\nbinary treatment, the monotonicity assumption of the local average treatment\neffects (LATE) framework can become restrictive: it requires that all units\nshare a common direction of response even when separate instruments are shifted\nin opposing directions. What I call vector monotonicity, by contrast, simply\nassumes treatment uptake to be monotonic in all instruments, representing a\nspecial case of the partial monotonicity assumption introduced by Mogstad et\nal. (2021). I characterize the class of causal parameters that are point\nidentified under vector monotonicity, when the instruments are binary. This\nclass includes, for example, the average treatment effect among units that are\nin any way responsive to the collection of instruments, or those that are\nresponsive to a given subset of them. The identification results are\nconstructive and yield a simple estimator for the identified treatment effect\nparameters. An empirical application revisits the labor market returns to\ncollege."}, "http://arxiv.org/abs/2109.08109": {"title": "Standard Errors for Calibrated Parameters", "link": "http://arxiv.org/abs/2109.08109", "description": "Calibration, the practice of choosing the parameters of a structural model to\nmatch certain empirical moments, can be viewed as minimum distance estimation.\nExisting standard error formulas for such estimators require a consistent\nestimate of the correlation structure of the empirical moments, which is often\nunavailable in practice. Instead, the variances of the individual empirical\nmoments are usually readily estimable. Using only these variances, we derive\nconservative standard errors and confidence intervals for the structural\nparameters that are valid even under the worst-case correlation structure. In\nthe over-identified case, we show that the moment weighting scheme that\nminimizes the worst-case estimator variance amounts to a moment selection\nproblem with a simple solution. Finally, we develop tests of over-identifying\nor parameter restrictions. We apply our methods empirically to a model of menu\ncost pricing for multi-product firms and to a heterogeneous agent New Keynesian\nmodel."}, "http://arxiv.org/abs/2211.16714": {"title": "Incorporating Prior Knowledge of Latent Group Structure in Panel Data Models", "link": "http://arxiv.org/abs/2211.16714", "description": "The assumption of group heterogeneity has become popular in panel data\nmodels. We develop a constrained Bayesian grouped estimator that exploits\nresearchers' prior beliefs on groups in a form of pairwise constraints,\nindicating whether a pair of units is likely to belong to a same group or\ndifferent groups. We propose a prior to incorporate the pairwise constraints\nwith varying degrees of confidence. The whole framework is built on the\nnonparametric Bayesian method, which implicitly specifies a distribution over\nthe group partitions, and so the posterior analysis takes the uncertainty of\nthe latent group structure into account. Monte Carlo experiments reveal that\nadding prior knowledge yields more accurate estimates of coefficient and scores\npredictive gains over alternative estimators. We apply our method to two\nempirical applications. In a first application to forecasting U.S. CPI\ninflation, we illustrate that prior knowledge of groups improves density\nforecasts when the data is not entirely informative. A second application\nrevisits the relationship between a country's income and its democratic\ntransition; we identify heterogeneous income effects on democracy with five\ndistinct groups over ninety countries."}, "http://arxiv.org/abs/2307.01357": {"title": "Adaptive Principal Component Regression with Applications to Panel Data", "link": "http://arxiv.org/abs/2307.01357", "description": "Principal component regression (PCR) is a popular technique for fixed-design\nerror-in-variables regression, a generalization of the linear regression\nsetting in which the observed covariates are corrupted with random noise. We\nprovide the first time-uniform finite sample guarantees for online\n(regularized) PCR whenever data is collected adaptively. Since the proof\ntechniques for analyzing PCR in the fixed design setting do not readily extend\nto the online setting, our results rely on adapting tools from modern\nmartingale concentration to the error-in-variables setting. As an application\nof our bounds, we provide a framework for experiment design in panel data\nsettings when interventions are assigned adaptively. Our framework may be\nthought of as a generalization of the synthetic control and synthetic\ninterventions frameworks, where data is collected via an adaptive intervention\nassignment policy."}, "http://arxiv.org/abs/2309.06693": {"title": "Stochastic Learning of Semiparametric Monotone Index Models with Large Sample Size", "link": "http://arxiv.org/abs/2309.06693", "description": "I study the estimation of semiparametric monotone index models in the\nscenario where the number of observation points $n$ is extremely large and\nconventional approaches fail to work due to heavy computational burdens.\nMotivated by the mini-batch gradient descent algorithm (MBGD) that is widely\nused as a stochastic optimization tool in the machine learning field, I\nproposes a novel subsample- and iteration-based estimation procedure. In\nparticular, starting from any initial guess of the true parameter, I\nprogressively update the parameter using a sequence of subsamples randomly\ndrawn from the data set whose sample size is much smaller than $n$. The update\nis based on the gradient of some well-chosen loss function, where the\nnonparametric component is replaced with its Nadaraya-Watson kernel estimator\nbased on subsamples. My proposed algorithm essentially generalizes MBGD\nalgorithm to the semiparametric setup. Compared with full-sample-based method,\nthe new method reduces the computational time by roughly $n$ times if the\nsubsample size and the kernel function are chosen properly, so can be easily\napplied when the sample size $n$ is large. Moreover, I show that if I further\nconduct averages across the estimators produced during iterations, the\ndifference between the average estimator and full-sample-based estimator will\nbe $1/\\sqrt{n}$-trivial. Consequently, the average estimator is\n$1/\\sqrt{n}$-consistent and asymptotically normally distributed. In other\nwords, the new estimator substantially improves the computational speed, while\nat the same time maintains the estimation accuracy."}, "http://arxiv.org/abs/2310.16945": {"title": "Causal Q-Aggregation for CATE Model Selection", "link": "http://arxiv.org/abs/2310.16945", "description": "Accurate estimation of conditional average treatment effects (CATE) is at the\ncore of personalized decision making. While there is a plethora of models for\nCATE estimation, model selection is a nontrivial task, due to the fundamental\nproblem of causal inference. Recent empirical work provides evidence in favor\nof proxy loss metrics with double robust properties and in favor of model\nensembling. However, theoretical understanding is lacking. Direct application\nof prior theoretical work leads to suboptimal oracle model selection rates due\nto the non-convexity of the model selection problem. We provide regret rates\nfor the major existing CATE ensembling approaches and propose a new CATE model\nensembling approach based on Q-aggregation using the doubly robust loss. Our\nmain result shows that causal Q-aggregation achieves statistically optimal\noracle model selection regret rates of $\\frac{\\log(M)}{n}$ (with $M$ models and\n$n$ samples), with the addition of higher-order estimation error terms related\nto products of errors in the nuisance functions. Crucially, our regret rate\ndoes not require that any of the candidate CATE models be close to the truth.\nWe validate our new method on many semi-synthetic datasets and also provide\nextensions of our work to CATE model selection with instrumental variables and\nunobserved confounding."}, "http://arxiv.org/abs/2310.19992": {"title": "Robust Estimation of Realized Correlation: New Insight about Intraday Fluctuations in Market Betas", "link": "http://arxiv.org/abs/2310.19992", "description": "Time-varying volatility is an inherent feature of most economic time-series,\nwhich causes standard correlation estimators to be inconsistent. The quadrant\ncorrelation estimator is consistent but very inefficient. We propose a novel\nsubsampled quadrant estimator that improves efficiency while preserving\nconsistency and robustness. This estimator is particularly well-suited for\nhigh-frequency financial data and we apply it to a large panel of US stocks.\nOur empirical analysis sheds new light on intra-day fluctuations in market\nbetas by decomposing them into time-varying correlations and relative\nvolatility changes. Our results show that intraday variation in betas is\nprimarily driven by intraday variation in correlations."}, "http://arxiv.org/abs/2006.07691": {"title": "Synthetic Interventions", "link": "http://arxiv.org/abs/2006.07691", "description": "Consider a setting with $N$ heterogeneous units (e.g., individuals,\nsub-populations) and $D$ interventions (e.g., socio-economic policies). Our\ngoal is to learn the expected potential outcome associated with every\nintervention on every unit, totaling $N \\times D$ causal parameters. Towards\nthis, we present a causal framework, synthetic interventions (SI), to infer\nthese $N \\times D$ causal parameters while only observing each of the $N$ units\nunder at most two interventions, independent of $D$. This can be significant as\nthe number of interventions, i.e., level of personalization, grows. Under a\nnovel tensor factor model across units, outcomes, and interventions, we prove\nan identification result for each of these $N \\times D$ causal parameters,\nestablish finite-sample consistency of our estimator along with asymptotic\nnormality under additional conditions. Importantly, our estimator also allows\nfor latent confounders that determine how interventions are assigned. The\nestimator is further furnished with data-driven tests to examine its\nsuitability. Empirically, we validate our framework through a large-scale A/B\ntest performed on an e-commerce platform. We believe our results could have\nimplications for the design of data-efficient randomized experiments (e.g.,\nrandomized control trials) with heterogeneous units and multiple interventions."}, "http://arxiv.org/abs/2207.04481": {"title": "Detecting Grouped Local Average Treatment Effects and Selecting True Instruments", "link": "http://arxiv.org/abs/2207.04481", "description": "Under an endogenous binary treatment with heterogeneous effects and multiple\ninstruments, we propose a two-step procedure for identifying complier groups\nwith identical local average treatment effects (LATE) despite relying on\ndistinct instruments, even if several instruments violate the identifying\nassumptions. We use the fact that the LATE is homogeneous for instruments which\n(i) satisfy the LATE assumptions (instrument validity and treatment\nmonotonicity in the instrument) and (ii) generate identical complier groups in\nterms of treatment propensities given the respective instruments. We propose a\ntwo-step procedure, where we first cluster the propensity scores in the first\nstep and find groups of IVs with the same reduced form parameters in the second\nstep. Under the plurality assumption that within each set of instruments with\nidentical treatment propensities, instruments truly satisfying the LATE\nassumptions are the largest group, our procedure permits identifying these true\ninstruments in a data driven way. We show that our procedure is consistent and\nprovides consistent and asymptotically normal estimators of underlying LATEs.\nWe also provide a simulation study investigating the finite sample properties\nof our approach and an empirical application investigating the effect of\nincarceration on recidivism in the US with judge assignments serving as\ninstruments."}, "http://arxiv.org/abs/2304.09078": {"title": "Club coefficients in the UEFA Champions League: Time for shift to an Elo-based formula", "link": "http://arxiv.org/abs/2304.09078", "description": "One of the most popular club football tournaments, the UEFA Champions League,\nwill see a fundamental reform from the 2024/25 season: the traditional group\nstage will be replaced by one league where each of the 36 teams plays eight\nmatches. To guarantee that the opponents of the clubs are of the same strength\nin the new design, it is crucial to forecast the performance of the teams\nbefore the tournament as well as possible. This paper investigates whether the\ncurrently used rating of the teams, the UEFA club coefficient, can be improved\nby taking the games played in the national leagues into account. According to\nour logistic regression models, a variant of the Elo method provides a higher\naccuracy in terms of explanatory power in the Champions League matches. The\nUnion of European Football Associations (UEFA) is encouraged to follow the\nexample of the FIFA World Ranking and reform the calculation of the club\ncoefficients in order to avoid unbalanced schedules in the novel tournament\nformat of the Champions League."}, "http://arxiv.org/abs/2308.13564": {"title": "SGMM: Stochastic Approximation to Generalized Method of Moments", "link": "http://arxiv.org/abs/2308.13564", "description": "We introduce a new class of algorithms, Stochastic Generalized Method of\nMoments (SGMM), for estimation and inference on (overidentified) moment\nrestriction models. Our SGMM is a novel stochastic approximation alternative to\nthe popular Hansen (1982) (offline) GMM, and offers fast and scalable\nimplementation with the ability to handle streaming datasets in real time. We\nestablish the almost sure convergence, and the (functional) central limit\ntheorem for the inefficient online 2SLS and the efficient SGMM. Moreover, we\npropose online versions of the Durbin-Wu-Hausman and Sargan-Hansen tests that\ncan be seamlessly integrated within the SGMM framework. Extensive Monte Carlo\nsimulations show that as the sample size increases, the SGMM matches the\nstandard (offline) GMM in terms of estimation accuracy and gains over\ncomputational efficiency, indicating its practical value for both large-scale\nand online datasets. We demonstrate the efficacy of our approach by a proof of\nconcept using two well known empirical examples with large sample sizes."}, "http://arxiv.org/abs/2311.00013": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "http://arxiv.org/abs/2311.00013", "description": "We propose two approaches to estimate semiparametric discrete choice models\nfor bundles. Our first approach is a kernel-weighted rank estimator based on a\nmatching-based identification strategy. We establish its complete asymptotic\nproperties and prove the validity of the nonparametric bootstrap for inference.\nWe then introduce a new multi-index least absolute deviations (LAD) estimator\nas an alternative, of which the main advantage is its capacity to estimate\npreference parameters on both alternative- and agent-specific regressors. Both\nmethods can account for arbitrary correlation in disturbances across choices,\nwith the former also allowing for interpersonal heteroskedasticity. We also\ndemonstrate that the identification strategy underlying these procedures can be\nextended naturally to panel data settings, producing an analogous localized\nmaximum score estimator and a LAD estimator for estimating bundle choice models\nwith fixed effects. We derive the limiting distribution of the former and\nverify the validity of the numerical bootstrap as an inference tool. All our\nproposed methods can be applied to general multi-index models. Monte Carlo\nexperiments show that they perform well in finite samples."}, "http://arxiv.org/abs/2311.00439": {"title": "Bounds on Treatment Effects under Stochastic Monotonicity Assumption in Sample Selection Models", "link": "http://arxiv.org/abs/2311.00439", "description": "This paper discusses the partial identification of treatment effects in\nsample selection models when the exclusion restriction fails and the\nmonotonicity assumption in the selection effect does not hold exactly, both of\nwhich are key challenges in applying the existing methodologies. Our approach\nbuilds on Lee's (2009) procedure, who considers partial identification under\nthe monotonicity assumption, but we assume only a stochastic (and weaker)\nversion of monotonicity, which depends on a prespecified parameter $\\vartheta$\nthat represents researchers' belief in the plausibility of the monotonicity.\nUnder this assumption, we show that we can still obtain useful bounds even when\nthe monotonic behavioral model does not strictly hold. Our procedure is useful\nwhen empirical researchers anticipate that a small fraction of the population\nwill not behave monotonically in selection; it can also be an effective tool\nfor performing sensitivity analysis or examining the identification power of\nthe monotonicity assumption. Our procedure is easily extendable to other\nrelated settings; we also provide the identification result of the marginal\ntreatment effects setting as an important application. Moreover, we show that\nthe bounds can still be obtained even in the absence of the knowledge of\n$\\vartheta$ under the semiparametric models that nest the classical probit and\nlogit selection models."}, "http://arxiv.org/abs/2311.00577": {"title": "Personalized Assignment to One of Many Treatment Arms via Regularized and Clustered Joint Assignment Forests", "link": "http://arxiv.org/abs/2311.00577", "description": "We consider learning personalized assignments to one of many treatment arms\nfrom a randomized controlled trial. Standard methods that estimate\nheterogeneous treatment effects separately for each arm may perform poorly in\nthis case due to excess variance. We instead propose methods that pool\ninformation across treatment arms: First, we consider a regularized\nforest-based assignment algorithm based on greedy recursive partitioning that\nshrinks effect estimates across arms. Second, we augment our algorithm by a\nclustering scheme that combines treatment arms with consistently similar\noutcomes. In a simulation study, we compare the performance of these approaches\nto predicting arm-wise outcomes separately, and document gains of directly\noptimizing the treatment assignment with regularization and clustering. In a\ntheoretical model, we illustrate how a high number of treatment arms makes\nfinding the best arm hard, while we can achieve sizable utility gains from\npersonalization by regularized optimization."}, "http://arxiv.org/abs/2311.00662": {"title": "On Gaussian Process Priors in Conditional Moment Restriction Models", "link": "http://arxiv.org/abs/2311.00662", "description": "This paper studies quasi-Bayesian estimation and uncertainty quantification\nfor an unknown function that is identified by a nonparametric conditional\nmoment restriction model. We derive contraction rates for a class of Gaussian\nprocess priors and provide conditions under which a Bernstein-von Mises theorem\nholds for the quasi-posterior distribution. As a consequence, we show that\noptimally-weighted quasi-Bayes credible sets have exact asymptotic frequentist\ncoverage. This extends classical result on the frequentist validity of\noptimally weighted quasi-Bayes credible sets for parametric generalized method\nof moments (GMM) models."}, "http://arxiv.org/abs/2209.14502": {"title": "Fast Inference for Quantile Regression with Tens of Millions of Observations", "link": "http://arxiv.org/abs/2209.14502", "description": "Big data analytics has opened new avenues in economic research, but the\nchallenge of analyzing datasets with tens of millions of observations is\nsubstantial. Conventional econometric methods based on extreme estimators\nrequire large amounts of computing resources and memory, which are often not\nreadily available. In this paper, we focus on linear quantile regression\napplied to \"ultra-large\" datasets, such as U.S. decennial censuses. A fast\ninference framework is presented, utilizing stochastic subgradient descent\n(S-subGD) updates. The inference procedure handles cross-sectional data\nsequentially: (i) updating the parameter estimate with each incoming \"new\nobservation\", (ii) aggregating it as a $\\textit{Polyak-Ruppert}$ average, and\n(iii) computing a pivotal statistic for inference using only a solution path.\nThe methodology draws from time-series regression to create an asymptotically\npivotal statistic through random scaling. Our proposed test statistic is\ncalculated in a fully online fashion and critical values are calculated without\nresampling. We conduct extensive numerical studies to showcase the\ncomputational merits of our proposed inference. For inference problems as large\nas $(n, d) \\sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the\nnumber of regressors, our method generates new insights, surpassing current\ninference methods in computation. Our method specifically reveals trends in the\ngender gap in the U.S. college wage premium using millions of observations,\nwhile controlling over $10^3$ covariates to mitigate confounding effects."}, "http://arxiv.org/abs/2311.00905": {"title": "Data-Driven Fixed-Point Tuning for Truncated Realized Variations", "link": "http://arxiv.org/abs/2311.00905", "description": "Many methods for estimating integrated volatility and related functionals of\nsemimartingales in the presence of jumps require specification of tuning\nparameters for their use. In much of the available theory, tuning parameters\nare assumed to be deterministic, and their values are specified only up to\nasymptotic constraints. However, in empirical work and in simulation studies,\nthey are typically chosen to be random and data-dependent, with explicit\nchoices in practice relying on heuristics alone. In this paper, we consider\nnovel data-driven tuning procedures for the truncated realized variations of a\nsemimartingale with jumps, which are based on a type of stochastic fixed-point\niteration. Being effectively automated, our approach alleviates the need for\ndelicate decision-making regarding tuning parameters, and can be implemented\nusing information regarding sampling frequency alone. We show our methods can\nlead to asymptotically efficient estimation of integrated volatility and\nexhibit superior finite-sample performance compared to popular alternatives in\nthe literature."}, "http://arxiv.org/abs/2311.01217": {"title": "The learning effects of subsidies to bundled goods: a semiparametric approach", "link": "http://arxiv.org/abs/2311.01217", "description": "Can temporary subsidies to bundles induce long-run changes in demand due to\nlearning about the relative quality of one of its constituent goods? This paper\nprovides theoretical and experimental evidence on the role of this mechanism.\nTheoretically, we introduce a model where an agent learns about the quality of\nan innovation on an essential good through consumption. Our results show that\nthe contemporaneous effect of a one-off subsidy to a bundle that contains the\ninnovation may be decomposed into a direct price effect, and an indirect\nlearning motive, whereby an agent leverages the discount to increase the\ninformational bequest left to her future selves. We then assess the predictions\nof our theory in a randomised experiment in a ridesharing platform. The\nexperiment provided two-week discounts for car trips integrating with a train\nor metro station (a bundle). Given the heavy-tailed nature of our data, we\nfollow \\cite{Athey2023} and, motivated by our theory, propose a semiparametric\nmodel for treatment effects that enables the construction of more efficient\nestimators. We introduce a statistically efficient estimator for our model by\nrelying on L-moments, a robust alternative to standard moments. Our estimator\nimmediately yields a specification test for the semiparametric model; moreover,\nin our adopted parametrisation, it can be easily computed through generalized\nleast squares. Our empirical results indicate that a two-week 50\\% discount on\ncar trips integrating with train/metro leads to a contemporaneous increase in\nthe demand for integrated rides, and, consistent with our learning model,\npersistent changes in the mean and dispersion of nonintegrated rides. These\neffects persist for over four months after the discount. A simple calibration\nof our model shows that around 40\\% to 50\\% of the estimated contemporaneous\nincrease in integrated rides may be attributed to a learning motive."}, "http://arxiv.org/abs/2110.10650": {"title": "Attention Overload", "link": "http://arxiv.org/abs/2110.10650", "description": "We introduce an Attention Overload Model that captures the idea that\nalternatives compete for the decision maker's attention, and hence the\nattention that each alternative receives decreases as the choice problem\nbecomes larger. We provide testable implications on the observed choice\nbehavior that can be used to (point or partially) identify the decision maker's\npreference and attention frequency. We then enhance our attention overload\nmodel to accommodate heterogeneous preferences based on the idea of List-based\nAttention Overload, where alternatives are presented to the decision makers as\na list that correlates with both heterogeneous preferences and random\nattention. We show that preference and attention frequencies are (point or\npartially) identifiable under nonparametric assumptions on the list and\nattention formation mechanisms, even when the true underlying list is unknown\nto the researcher. Building on our identification results, we develop\neconometric methods for estimation and inference."}, "http://arxiv.org/abs/2112.13398": {"title": "Long Story Short: Omitted Variable Bias in Causal Machine Learning", "link": "http://arxiv.org/abs/2112.13398", "description": "We derive general, yet simple, sharp bounds on the size of the omitted\nvariable bias for a broad class of causal parameters that can be identified as\nlinear functionals of the conditional expectation function of the outcome. Such\nfunctionals encompass many of the traditional targets of investigation in\ncausal inference studies, such as, for example, (weighted) average of potential\noutcomes, average treatment effects (including subgroup effects, such as the\neffect on the treated), (weighted) average derivatives, and policy effects from\nshifts in covariate distribution -- all for general, nonparametric causal\nmodels. Our construction relies on the Riesz-Frechet representation of the\ntarget functional. Specifically, we show how the bound on the bias depends only\non the additional variation that the latent variables create both in the\noutcome and in the Riesz representer for the parameter of interest. Moreover,\nin many important cases (e.g, average treatment effects and avearage\nderivatives) the bound is shown to depend on easily interpretable quantities\nthat measure the explanatory power of the omitted variables. Therefore, simple\nplausibility judgments on the maximum explanatory power of omitted variables\n(in explaining treatment and outcome variation) are sufficient to place overall\nbounds on the size of the bias. Furthermore, we use debiased machine learning\nto provide flexible and efficient statistical inference on learnable components\nof the bounds. Finally, empirical examples demonstrate the usefulness of the\napproach."}, "http://arxiv.org/abs/2112.03626": {"title": "Phase transitions in nonparametric regressions", "link": "http://arxiv.org/abs/2112.03626", "description": "When the unknown regression function of a single variable is known to have\nderivatives up to the $(\\gamma+1)$th order bounded in absolute values by a\ncommon constant everywhere or a.e. (i.e., $(\\gamma+1)$th degree of smoothness),\nthe minimax optimal rate of the mean integrated squared error (MISE) is stated\nas $\\left(\\frac{1}{n}\\right)^{\\frac{2\\gamma+2}{2\\gamma+3}}$ in the literature.\nThis paper shows that: (i) if $n\\leq\\left(\\gamma+1\\right)^{2\\gamma+3}$, the\nminimax optimal MISE rate is $\\frac{\\log n}{n\\log(\\log n)}$ and the optimal\ndegree of smoothness to exploit is roughly $\\max\\left\\{ \\left\\lfloor \\frac{\\log\nn}{2\\log\\left(\\log n\\right)}\\right\\rfloor ,\\,1\\right\\} $; (ii) if\n$n>\\left(\\gamma+1\\right)^{2\\gamma+3}$, the minimax optimal MISE rate is\n$\\left(\\frac{1}{n}\\right)^{\\frac{2\\gamma+2}{2\\gamma+3}}$ and the optimal degree\nof smoothness to exploit is $\\gamma+1$. The fundamental contribution of this\npaper is a set of metric entropy bounds we develop for smooth function classes.\nSome of our bounds are original, and some of them improve and/or generalize the\nones in the literature (e.g., Kolmogorov and Tikhomirov, 1959). Our metric\nentropy bounds allow us to show phase transitions in the minimax optimal MISE\nrates associated with some commonly seen smoothness classes as well as\nnon-standard smoothness classes, and can also be of independent interest\noutside the nonparametric regression problems."}, "http://arxiv.org/abs/2206.04157": {"title": "Inference for Matched Tuples and Fully Blocked Factorial Designs", "link": "http://arxiv.org/abs/2206.04157", "description": "This paper studies inference in randomized controlled trials with multiple\ntreatments, where treatment status is determined according to a \"matched\ntuples\" design. Here, by a matched tuples design, we mean an experimental\ndesign where units are sampled i.i.d. from the population of interest, grouped\ninto \"homogeneous\" blocks with cardinality equal to the number of treatments,\nand finally, within each block, each treatment is assigned exactly once\nuniformly at random. We first study estimation and inference for matched tuples\ndesigns in the general setting where the parameter of interest is a vector of\nlinear contrasts over the collection of average potential outcomes for each\ntreatment. Parameters of this form include standard average treatment effects\nused to compare one treatment relative to another, but also include parameters\nwhich may be of interest in the analysis of factorial designs. We first\nestablish conditions under which a sample analogue estimator is asymptotically\nnormal and construct a consistent estimator of its corresponding asymptotic\nvariance. Combining these results establishes the asymptotic exactness of tests\nbased on these estimators. In contrast, we show that, for two common testing\nprocedures based on t-tests constructed from linear regressions, one test is\ngenerally conservative while the other generally invalid. We go on to apply our\nresults to study the asymptotic properties of what we call \"fully-blocked\" 2^K\nfactorial designs, which are simply matched tuples designs applied to a full\nfactorial experiment. Leveraging our previous results, we establish that our\nestimator achieves a lower asymptotic variance under the fully-blocked design\nthan that under any stratified factorial design which stratifies the\nexperimental sample into a finite number of \"large\" strata. A simulation study\nand empirical application illustrate the practical relevance of our results."}, "http://arxiv.org/abs/2303.02716": {"title": "Deterministic, quenched and annealed parameter estimation for heterogeneous network models", "link": "http://arxiv.org/abs/2303.02716", "description": "At least two, different approaches to define and solve statistical models for\nthe analysis of economic systems exist: the typical, econometric one,\ninterpreting the Gravity Model specification as the expected link weight of an\narbitrary probability distribution, and the one rooted into statistical\nphysics, constructing maximum-entropy distributions constrained to satisfy\ncertain network properties. In a couple of recent, companion papers they have\nbeen successfully integrated within the framework induced by the constrained\nminimisation of the Kullback-Leibler divergence: specifically, two, broad\nclasses of models have been devised, i.e. the integrated and the conditional\nones, defined by different, probabilistic rules to place links, load them with\nweights and turn them into proper, econometric prescriptions. Still, the\nrecipes adopted by the two approaches to estimate the parameters entering into\nthe definition of each model differ. In econometrics, a likelihood that\ndecouples the binary and weighted parts of a model, treating a network as\ndeterministic, is typically maximised; to restore its random character, two\nalternatives exist: either solving the likelihood maximisation on each\nconfiguration of the ensemble and taking the average of the parameters\nafterwards or taking the average of the likelihood function and maximising the\nlatter one. The difference between these approaches lies in the order in which\nthe operations of averaging and maximisation are taken - a difference that is\nreminiscent of the quenched and annealed ways of averaging out the disorder in\nspin glasses. The results of the present contribution, devoted to comparing\nthese recipes in the case of continuous, conditional network models, indicate\nthat the annealed estimation recipe represents the best alternative to the\ndeterministic one."}, "http://arxiv.org/abs/2307.01284": {"title": "Does regional variation in wage levels identify the effects of a national minimum wage?", "link": "http://arxiv.org/abs/2307.01284", "description": "This paper examines the identification assumptions underlying two types of\nestimators of the causal effects of minimum wages based on regional variation\nin wage levels: the \"effective minimum wage\" and the \"fraction affected/gap\"\ndesigns. For the effective minimum wage design, I show that the identification\nassumptions emphasized by Lee (1999) are crucial for unbiased estimation but\ndifficult to satisfy in empirical applications for reasons arising from\neconomic theory. For the fraction affected design at the region level, I show\nthat economic factors such as a common trend in the dispersion of worker\nproductivity or regional convergence in GDP per capita may lead to violations\nof the \"parallel trends\" identifying assumption. The paper suggests ways to\nincrease the likelihood of detecting those issues when implementing checks for\nparallel pre-trends. I also show that this design may be subject to biases\narising from the misspecification of the treatment intensity variable,\nespecially when the minimum wage strongly affects employment and wages."}, "http://arxiv.org/abs/2311.02196": {"title": "Pooled Bewley Estimator of Long Run Relationships in Dynamic Heterogenous Panels", "link": "http://arxiv.org/abs/2311.02196", "description": "Using a transformation of the autoregressive distributed lag model due to\nBewley, a novel pooled Bewley (PB) estimator of long-run coefficients for\ndynamic panels with heterogeneous short-run dynamics is proposed. The PB\nestimator is directly comparable to the widely used Pooled Mean Group (PMG)\nestimator, and is shown to be consistent and asymptotically normal. Monte Carlo\nsimulations show good small sample performance of PB compared to the existing\nestimators in the literature, namely PMG, panel dynamic OLS (PDOLS), and panel\nfully-modified OLS (FMOLS). Application of two bias-correction methods and a\nbootstrapping of critical values to conduct inference robust to cross-sectional\ndependence of errors are also considered. The utility of the PB estimator is\nillustrated in an empirical application to the aggregate consumption function."}, "http://arxiv.org/abs/2311.02299": {"title": "The Fragility of Sparsity", "link": "http://arxiv.org/abs/2311.02299", "description": "We show, using three empirical applications, that linear regression estimates\nwhich rely on the assumption of sparsity are fragile in two ways. First, we\ndocument that different choices of the regressor matrix that don't impact\nordinary least squares (OLS) estimates, such as the choice of baseline category\nwith categorical controls, can move sparsity-based estimates two standard\nerrors or more. Second, we develop two tests of the sparsity assumption based\non comparing sparsity-based estimators with OLS. The tests tend to reject the\nsparsity assumption in all three applications. Unless the number of regressors\nis comparable to or exceeds the sample size, OLS yields more robust results at\nlittle efficiency cost."}, "http://arxiv.org/abs/2311.02467": {"title": "Individualized Policy Evaluation and Learning under Clustered Network Interference", "link": "http://arxiv.org/abs/2311.02467", "description": "While there now exists a large literature on policy evaluation and learning,\nmuch of prior work assumes that the treatment assignment of one unit does not\naffect the outcome of another unit. Unfortunately, ignoring interference may\nlead to biased policy evaluation and yield ineffective learned policies. For\nexample, treating influential individuals who have many friends can generate\npositive spillover effects, thereby improving the overall performance of an\nindividualized treatment rule (ITR). We consider the problem of evaluating and\nlearning an optimal ITR under clustered network (or partial) interference where\nclusters of units are sampled from a population and units may influence one\nanother within each cluster. Under this model, we propose an estimator that can\nbe used to evaluate the empirical performance of an ITR. We show that this\nestimator is substantially more efficient than the standard inverse probability\nweighting estimator, which does not impose any assumption about spillover\neffects. We derive the finite-sample regret bound for a learned ITR, showing\nthat the use of our efficient evaluation estimator leads to the improved\nperformance of learned policies. Finally, we conduct simulation and empirical\nstudies to illustrate the advantages of the proposed methodology."}, "http://arxiv.org/abs/2311.02789": {"title": "Estimation of Semiparametric Multi-Index Models Using Deep Neural Networks", "link": "http://arxiv.org/abs/2311.02789", "description": "In this paper, we consider estimation and inference for both the multi-index\nparameters and the link function involved in a class of semiparametric\nmulti-index models via deep neural networks (DNNs). We contribute to the design\nof DNN by i) providing more transparency for practical implementation, ii)\ndefining different types of sparsity, iii) showing the differentiability, iv)\npointing out the set of effective parameters, and v) offering a new variant of\nrectified linear activation function (ReLU), etc. Asymptotic properties for the\njoint estimates of both the index parameters and the link functions are\nestablished, and a feasible procedure for the purpose of inference is also\nproposed. We conduct extensive numerical studies to examine the finite-sample\nperformance of the estimation methods, and we also evaluate the empirical\nrelevance and applicability of the proposed models and estimation methods to\nreal data."}, "http://arxiv.org/abs/2201.07880": {"title": "Deep self-consistent learning of local volatility", "link": "http://arxiv.org/abs/2201.07880", "description": "We present an algorithm for the calibration of local volatility from market\noption prices through deep self-consistent learning, by approximating both\nmarket option prices and local volatility using deep neural networks,\nrespectively. Our method uses the initial-boundary value problem of the\nunderlying Dupire's partial differential equation solved by the parameterized\noption prices to bring corrections to the parameterization in a self-consistent\nway. By exploiting the differentiability of the neural networks, we can\nevaluate Dupire's equation locally at each strike-maturity pair; while by\nexploiting their continuity, we sample strike-maturity pairs uniformly from a\ngiven domain, going beyond the discrete points where the options are quoted.\nMoreover, the absence of arbitrage opportunities are imposed by penalizing an\nassociated loss function as a soft constraint. For comparison with existing\napproaches, the proposed method is tested on both synthetic and market option\nprices, which shows an improved performance in terms of reduced interpolation\nand reprice errors, as well as the smoothness of the calibrated local\nvolatility. An ablation study has been performed, asserting the robustness and\nsignificance of the proposed method."}, "http://arxiv.org/abs/2204.12023": {"title": "A One-Covariate-at-a-Time Method for Nonparametric Additive Models", "link": "http://arxiv.org/abs/2204.12023", "description": "This paper proposes a one-covariate-at-a-time multiple testing (OCMT)\napproach to choose significant variables in high-dimensional nonparametric\nadditive regression models. Similarly to Chudik, Kapetanios and Pesaran (2018),\nwe consider the statistical significance of individual nonparametric additive\ncomponents one at a time and take into account the multiple testing nature of\nthe problem. One-stage and multiple-stage procedures are both considered. The\nformer works well in terms of the true positive rate only if the marginal\neffects of all signals are strong enough; the latter helps to pick up hidden\nsignals that have weak marginal effects. Simulations demonstrate the good\nfinite sample performance of the proposed procedures. As an empirical\napplication, we use the OCMT procedure on a dataset we extracted from the\nLongitudinal Survey on Rural Urban Migration in China. We find that our\nprocedure works well in terms of the out-of-sample forecast root mean square\nerrors, compared with competing methods."}, "http://arxiv.org/abs/2209.03259": {"title": "A Ridge-Regularised Jackknifed Anderson-Rubin Test", "link": "http://arxiv.org/abs/2209.03259", "description": "We consider hypothesis testing in instrumental variable regression models\nwith few included exogenous covariates but many instruments -- possibly more\nthan the number of observations. We show that a ridge-regularised version of\nthe jackknifed Anderson Rubin (1949, henceforth AR) test controls asymptotic\nsize in the presence of heteroskedasticity, and when the instruments may be\narbitrarily weak. Asymptotic size control is established under weaker\nassumptions than those imposed for recently proposed jackknifed AR tests in the\nliterature. Furthermore, ridge-regularisation extends the scope of jackknifed\nAR tests to situations in which there are more instruments than observations.\nMonte-Carlo simulations indicate that our method has favourable finite-sample\nsize and power properties compared to recently proposed alternative approaches\nin the literature. An empirical application on the elasticity of substitution\nbetween immigrants and natives in the US illustrates the usefulness of the\nproposed method for practitioners."}, "http://arxiv.org/abs/2212.09193": {"title": "Identification of time-varying counterfactual parameters in nonlinear panel models", "link": "http://arxiv.org/abs/2212.09193", "description": "We develop a general framework for the identification of counterfactual\nparameters in a class of nonlinear semiparametric panel models with fixed\neffects and time effects. Our method applies to models for discrete outcomes\n(e.g., two-way fixed effects binary choice) or continuous outcomes (e.g.,\ncensored regression), with discrete or continuous regressors. Our results do\nnot require parametric assumptions on the error terms or time-homogeneity on\nthe outcome equation. Our main results focus on static models, with a set of\nresults applying to models without any exogeneity conditions. We show that the\nsurvival distribution of counterfactual outcomes is identified (point or\npartial) in this class of models. This parameter is a building block for most\npartial and marginal effects of interest in applied practice that are based on\nthe average structural function as defined by Blundell and Powell (2003, 2004).\nTo the best of our knowledge, ours are the first results on average partial and\nmarginal effects for binary choice and ordered choice models with two-way fixed\neffects and non-logistic errors."}, "http://arxiv.org/abs/2306.04135": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "http://arxiv.org/abs/2306.04135", "description": "We propose two approaches to estimate semiparametric discrete choice models\nfor bundles. Our first approach is a kernel-weighted rank estimator based on a\nmatching-based identification strategy. We establish its complete asymptotic\nproperties and prove the validity of the nonparametric bootstrap for inference.\nWe then introduce a new multi-index least absolute deviations (LAD) estimator\nas an alternative, of which the main advantage is its capacity to estimate\npreference parameters on both alternative- and agent-specific regressors. Both\nmethods can account for arbitrary correlation in disturbances across choices,\nwith the former also allowing for interpersonal heteroskedasticity. We also\ndemonstrate that the identification strategy underlying these procedures can be\nextended naturally to panel data settings, producing an analogous localized\nmaximum score estimator and a LAD estimator for estimating bundle choice models\nwith fixed effects. We derive the limiting distribution of the former and\nverify the validity of the numerical bootstrap as an inference tool. All our\nproposed methods can be applied to general multi-index models. Monte Carlo\nexperiments show that they perform well in finite samples."}, "http://arxiv.org/abs/2311.03471": {"title": "Optimal Estimation Methodologies for Panel Data Regression Models", "link": "http://arxiv.org/abs/2311.03471", "description": "This survey study discusses main aspects to optimal estimation methodologies\nfor panel data regression models. In particular, we present current\nmethodological developments for modeling stationary panel data as well as\nrobust methods for estimation and inference in nonstationary panel data\nregression models. Some applications from the network econometrics and high\ndimensional statistics literature are also discussed within a stationary time\nseries environment."}, "http://arxiv.org/abs/2311.04073": {"title": "Debiased Fixed Effects Estimation of Binary Logit Models with Three-Dimensional Panel Data", "link": "http://arxiv.org/abs/2311.04073", "description": "Naive maximum likelihood estimation of binary logit models with fixed effects\nleads to unreliable inference due to the incidental parameter problem. We study\nthe case of three-dimensional panel data, where the model includes three sets\nof additive and overlapping unobserved effects. This encompasses models for\nnetwork panel data, where senders and receivers maintain bilateral\nrelationships over time, and fixed effects account for unobserved heterogeneity\nat the sender-time, receiver-time, and sender-receiver levels. In an asymptotic\nframework, where all three panel dimensions grow large at constant relative\nrates, we characterize the leading bias of the naive estimator. The inference\nproblem we identify is particularly severe, as it is not possible to balance\nthe order of the bias and the standard deviation. As a consequence, the naive\nestimator has a degenerating asymptotic distribution, which exacerbates the\ninference problem relative to other fixed effects estimators studied in the\nliterature. To resolve the inference problem, we derive explicit expressions to\ndebias the fixed effects estimator."}, "http://arxiv.org/abs/2207.09246": {"title": "Asymptotic Properties of Endogeneity Corrections Using Nonlinear Transformations", "link": "http://arxiv.org/abs/2207.09246", "description": "This paper considers a linear regression model with an endogenous regressor\nwhich arises from a nonlinear transformation of a latent variable. It is shown\nthat the corresponding coefficient can be consistently estimated without\nexternal instruments by adding a rank-based transformation of the regressor to\nthe model and performing standard OLS estimation. In contrast to other\napproaches, our nonparametric control function approach does not rely on a\nconformably specified copula. Furthermore, the approach allows for the presence\nof additional exogenous regressors which may be (linearly) correlated with the\nendogenous regressor(s). Consistency and asymptotic normality of the estimator\nare proved and the estimator is compared with copula based approaches by means\nof Monte Carlo simulations. An empirical application on wage data of the US\ncurrent population survey demonstrates the usefulness of our method."}, "http://arxiv.org/abs/2301.10643": {"title": "Automatic Locally Robust Estimation with Generated Regressors", "link": "http://arxiv.org/abs/2301.10643", "description": "Many economic and causal parameters of interest depend on generated\nregressors. Examples include structural parameters in models with endogenous\nvariables estimated by control functions and in models with sample selection,\ntreatment effect estimation with propensity score matching, and marginal\ntreatment effects. Inference with generated regressors is complicated by the\nvery complex expression for influence functions and asymptotic variances. To\naddress this problem, we propose Automatic Locally Robust/debiased GMM\nestimators in a general setting with generated regressors. Importantly, we\nallow for the generated regressors to be generated from machine learners, such\nas Random Forest, Neural Nets, Boosting, and many others. We use our results to\nconstruct novel Doubly Robust and Locally Robust estimators for the\nCounterfactual Average Structural Function and Average Partial Effects in\nmodels with endogeneity and sample selection, respectively. We provide\nsufficient conditions for the asymptotic normality of our debiased GMM\nestimators and investigate their finite sample performance through Monte Carlo\nsimulations."}, "http://arxiv.org/abs/2303.11399": {"title": "How Much Should We Trust Instrumental Variable Estimates in Political Science? Practical Advice Based on Over 60 Replicated Studies", "link": "http://arxiv.org/abs/2303.11399", "description": "Instrumental variable (IV) strategies are widely used in political science to\nestablish causal relationships. However, the identifying assumptions required\nby an IV design are demanding, and it remains challenging for researchers to\nassess their validity. In this paper, we replicate 67 papers published in three\ntop journals in political science during 2010-2022 and identify several\ntroubling patterns. First, researchers often overestimate the strength of their\nIVs due to non-i.i.d. errors, such as a clustering structure. Second, the most\ncommonly used t-test for the two-stage-least-squares (2SLS) estimates often\nseverely underestimates uncertainty. Using more robust inferential methods, we\nfind that around 19-30% of the 2SLS estimates in our sample are underpowered.\nThird, in the majority of the replicated studies, the 2SLS estimates are much\nlarger than the ordinary-least-squares estimates, and their ratio is negatively\ncorrelated with the strength of the IVs in studies where the IVs are not\nexperimentally generated, suggesting potential violations of unconfoundedness\nor the exclusion restriction. To help researchers avoid these pitfalls, we\nprovide a checklist for better practice."}, "http://arxiv.org/abs/2208.02028": {"title": "Bootstrap inference in the presence of bias", "link": "http://arxiv.org/abs/2208.02028", "description": "We consider bootstrap inference for estimators which are (asymptotically)\nbiased. We show that, even when the bias term cannot be consistently estimated,\nvalid inference can be obtained by proper implementations of the bootstrap.\nSpecifically, we show that the prepivoting approach of Beran (1987, 1988),\noriginally proposed to deliver higher-order refinements, restores bootstrap\nvalidity by transforming the original bootstrap p-value into an asymptotically\nuniform random variable. We propose two different implementations of\nprepivoting (plug-in and double bootstrap), and provide general high-level\nconditions that imply validity of bootstrap inference. To illustrate the\npractical relevance and implementation of our results, we discuss five\nexamples: (i) inference on a target parameter based on model averaging; (ii)\nridge-type regularized estimators; (iii) nonparametric regression; (iv) a\nlocation model for infinite variance data; and (v) dynamic panel data models."}, "http://arxiv.org/abs/2304.01273": {"title": "Heterogeneity-robust granular instruments", "link": "http://arxiv.org/abs/2304.01273", "description": "Granular instrumental variables (GIV) has experienced sharp growth in\nempirical macro-finance. The methodology's rise showcases granularity's\npotential for identification in a wide set of economic environments, like the\nestimation of spillovers and demand systems. I propose a new estimator--called\nrobust granular instrumental variables (RGIV)--that allows researchers to study\nunit-level heterogeneity in spillovers within GIV's framework. In contrast to\nGIV, RGIV also allows for unknown shock variances and does not require skewness\nof the size distribution of units. I also develop a test of overidentifying\nrestrictions that evaluates RGIV's compatibility with the data, a parameter\nrestriction test that evaluates the appropriateness of the homogeneous\nspillovers assumption, and extend the framework to allow for observable\nexplanatory variables. Applied to the Euro area, I find strong evidence of\ncountry-level heterogeneity in sovereign yield spillovers. In simulations, I\nshow that RGIV produces reliable and informative confidence intervals."}, "http://arxiv.org/abs/2309.11387": {"title": "Identifying Causal Effects in Information Provision Experiments", "link": "http://arxiv.org/abs/2309.11387", "description": "Information provision experiments are a popular way to study causal effects\nof beliefs on behavior. Researchers estimate these effects using TSLS. I show\nthat existing TSLS specifications do not estimate the average partial effect;\nthey have weights proportional to belief updating in the first-stage. If people\nwhose decisions depend on their beliefs gather information before the\nexperiment, the information treatment may shift beliefs more for people with\nweak belief effects. This attenuates TSLS estimates. I propose researchers use\na local-least-squares (LLS) estimator that I show consistently estimates the\naverage partial effect (APE) under Bayesian updating, and apply it to Settele\n(2022)."}, "http://arxiv.org/abs/2107.13737": {"title": "Design-Robust Two-Way-Fixed-Effects Regression For Panel Data", "link": "http://arxiv.org/abs/2107.13737", "description": "We propose a new estimator for average causal effects of a binary treatment\nwith panel data in settings with general treatment patterns. Our approach\naugments the popular two-way-fixed-effects specification with unit-specific\nweights that arise from a model for the assignment mechanism. We show how to\nconstruct these weights in various settings, including the staggered adoption\nsetting, where units opt into the treatment sequentially but permanently. The\nresulting estimator converges to an average (over units and time) treatment\neffect under the correct specification of the assignment model, even if the\nfixed effect model is misspecified. We show that our estimator is more robust\nthan the conventional two-way estimator: it remains consistent if either the\nassignment mechanism or the two-way regression model is correctly specified. In\naddition, the proposed estimator performs better than the two-way-fixed-effect\nestimator if the outcome model and assignment mechanism are locally\nmisspecified. This strong double robustness property underlines and quantifies\nthe benefits of modeling the assignment process and motivates using our\nestimator in practice. We also discuss an extension of our estimator to handle\ndynamic treatment effects."}, "http://arxiv.org/abs/2302.09756": {"title": "Identification-robust inference for the LATE with high-dimensional covariates", "link": "http://arxiv.org/abs/2302.09756", "description": "This paper presents an inference method for the local average treatment\neffect (LATE) in the presence of high-dimensional covariates, irrespective of\nthe strength of identification. We propose a novel high-dimensional conditional\ntest statistic with uniformly correct asymptotic size. We provide an\neasy-to-implement algorithm to infer the high-dimensional LATE by inverting our\ntest statistic and employing the double/debiased machine learning method.\nSimulations indicate that our test is robust against both weak identification\nand high dimensionality concerning size control and power performance,\noutperforming other conventional tests. Applying the proposed method to\nrailroad and population data to study the effect of railroad access on urban\npopulation growth, we observe that our methodology yields confidence intervals\nthat are 49% to 92% shorter than conventional results, depending on\nspecifications."}, "http://arxiv.org/abs/2311.05883": {"title": "Time-Varying Identification of Monetary Policy Shocks", "link": "http://arxiv.org/abs/2311.05883", "description": "We propose a new Bayesian heteroskedastic Markov-switching structural vector\nautoregression with data-driven time-varying identification. The model selects\nalternative exclusion restrictions over time and, as a condition for the\nsearch, allows to verify identification through heteroskedasticity within each\nregime. Based on four alternative monetary policy rules, we show that a monthly\nsix-variable system supports time variation in US monetary policy shock\nidentification. In the sample-dominating first regime, systematic monetary\npolicy follows a Taylor rule extended by the term spread and is effective in\ncurbing inflation. In the second regime, occurring after 2000 and gaining more\npersistence after the global financial and COVID crises, the Fed acts according\nto a money-augmented Taylor rule. This regime's unconventional monetary policy\nprovides economic stimulus, features the liquidity effect, and is complemented\nby a pure term spread shock. Absent the specific monetary policy of the second\nregime, inflation would be over one percentage point higher on average after\n2008."}, "http://arxiv.org/abs/2309.14160": {"title": "Unified Inference for Dynamic Quantile Predictive Regression", "link": "http://arxiv.org/abs/2309.14160", "description": "This paper develops unified asymptotic distribution theory for dynamic\nquantile predictive regressions which is useful when examining quantile\npredictability in stock returns under possible presence of nonstationarity."}, "http://arxiv.org/abs/2311.06256": {"title": "From Deep Filtering to Deep Econometrics", "link": "http://arxiv.org/abs/2311.06256", "description": "Calculating true volatility is an essential task for option pricing and risk\nmanagement. However, it is made difficult by market microstructure noise.\nParticle filtering has been proposed to solve this problem as it favorable\nstatistical properties, but relies on assumptions about underlying market\ndynamics. Machine learning methods have also been proposed but lack\ninterpretability, and often lag in performance. In this paper we implement the\nSV-PF-RNN: a hybrid neural network and particle filter architecture. Our\nSV-PF-RNN is designed specifically with stochastic volatility estimation in\nmind. We then show that it can improve on the performance of a basic particle\nfilter."}, "http://arxiv.org/abs/2311.06831": {"title": "Quasi-Bayes in Latent Variable Models", "link": "http://arxiv.org/abs/2311.06831", "description": "Latent variable models are widely used to account for unobserved determinants\nof economic behavior. Traditional nonparametric methods to estimate latent\nheterogeneity do not scale well into multidimensional settings. Distributional\nrestrictions alleviate tractability concerns but may impart non-trivial\nmisspecification bias. Motivated by these concerns, this paper introduces a\nquasi-Bayes approach to estimate a large class of multidimensional latent\nvariable models. Our approach to quasi-Bayes is novel in that we center it\naround relating the characteristic function of observables to the distribution\nof unobservables. We propose a computationally attractive class of priors that\nare supported on Gaussian mixtures and derive contraction rates for a variety\nof latent variable models."}, "http://arxiv.org/abs/2311.06891": {"title": "Design-based Estimation Theory for Complex Experiments", "link": "http://arxiv.org/abs/2311.06891", "description": "This paper considers the estimation of treatment effects in randomized\nexperiments with complex experimental designs, including cases with\ninterference between units. We develop a design-based estimation theory for\narbitrary experimental designs. Our theory facilitates the analysis of many\ndesign-estimator pairs that researchers commonly employ in practice and provide\nprocedures to consistently estimate asymptotic variance bounds. We propose new\nclasses of estimators with favorable asymptotic properties from a design-based\npoint of view. In addition, we propose a scalar measure of experimental\ncomplexity which can be linked to the design-based variance of the estimators.\nWe demonstrate the performance of our estimators using simulated datasets based\non an actual network experiment studying the effect of social networks on\ninsurance adoptions."}, "http://arxiv.org/abs/2311.07067": {"title": "High Dimensional Binary Choice Model with Unknown Heteroskedasticity or Instrumental Variables", "link": "http://arxiv.org/abs/2311.07067", "description": "This paper proposes a new method for estimating high-dimensional binary\nchoice models. The model we consider is semiparametric, placing no\ndistributional assumptions on the error term, allowing for heteroskedastic\nerrors, and permitting endogenous regressors. Our proposed approaches extend\nthe special regressor estimator originally proposed by Lewbel (2000). This\nestimator becomes impractical in high-dimensional settings due to the curse of\ndimensionality associated with high-dimensional conditional density estimation.\nTo overcome this challenge, we introduce an innovative data-driven dimension\nreduction method for nonparametric kernel estimators, which constitutes the\nmain innovation of this work. The method combines distance covariance-based\nscreening with cross-validation (CV) procedures, rendering the special\nregressor estimation feasible in high dimensions. Using the new feasible\nconditional density estimator, we address the variable and moment (instrumental\nvariable) selection problems for these models. We apply penalized least squares\n(LS) and Generalized Method of Moments (GMM) estimators with a smoothly clipped\nabsolute deviation (SCAD) penalty. A comprehensive analysis of the oracle and\nasymptotic properties of these estimators is provided. Monte Carlo simulations\nare employed to demonstrate the effectiveness of our proposed procedures in\nfinite sample scenarios."}, "http://arxiv.org/abs/2311.07243": {"title": "Optimal Estimation of Large-Dimensional Nonlinear Factor Models", "link": "http://arxiv.org/abs/2311.07243", "description": "This paper studies optimal estimation of large-dimensional nonlinear factor\nmodels. The key challenge is that the observed variables are possibly nonlinear\nfunctions of some latent variables where the functional forms are left\nunspecified. A local principal component analysis method is proposed to\nestimate the factor structure and recover information on latent variables and\nlatent functions, which combines $K$-nearest neighbors matching and principal\ncomponent analysis. Large-sample properties are established, including a sharp\nbound on the matching discrepancy of nearest neighbors, sup-norm error bounds\nfor estimated local factors and factor loadings, and the uniform convergence\nrate of the factor structure estimator. Under mild conditions our estimator of\nthe latent factor structure can achieve the optimal rate of uniform convergence\nfor nonparametric regression. The method is illustrated with a Monte Carlo\nexperiment and an empirical application studying the effect of tax cuts on\neconomic growth."}, "http://arxiv.org/abs/1902.09608": {"title": "On Binscatter", "link": "http://arxiv.org/abs/1902.09608", "description": "Binscatter is a popular method for visualizing bivariate relationships and\nconducting informal specification testing. We study the properties of this\nmethod formally and develop enhanced visualization and econometric binscatter\ntools. These include estimating conditional means with optimal binning and\nquantifying uncertainty. We also highlight a methodological problem related to\ncovariate adjustment that can yield incorrect conclusions. We revisit two\napplications using our methodology and find substantially different results\nrelative to those obtained using prior informal binscatter methods. General\npurpose software in Python, R, and Stata is provided. Our technical work is of\nindependent interest for the nonparametric partition-based estimation\nliterature."}, "http://arxiv.org/abs/2109.09043": {"title": "Composite Likelihood for Stochastic Migration Model with Unobserved Factor", "link": "http://arxiv.org/abs/2109.09043", "description": "We introduce the conditional Maximum Composite Likelihood (MCL) estimation\nmethod for the stochastic factor ordered Probit model of credit rating\ntransitions of firms. This model is recommended for internal credit risk\nassessment procedures in banks and financial institutions under the Basel III\nregulations. Its exact likelihood function involves a high-dimensional\nintegral, which can be approximated numerically before maximization. However,\nthe estimated migration risk and required capital tend to be sensitive to the\nquality of this approximation, potentially leading to statistical regulatory\narbitrage. The proposed conditional MCL estimator circumvents this problem and\nmaximizes the composite log-likelihood of the factor ordered Probit model. We\npresent three conditional MCL estimators of different complexity and examine\ntheir consistency and asymptotic normality when n and T tend to infinity. The\nperformance of these estimators at finite T is examined and compared with a\ngranularity-based approach in a simulation study. The use of the MCL estimator\nis also illustrated in an empirical application."}, "http://arxiv.org/abs/2111.01301": {"title": "Asymptotic in a class of network models with an increasing sub-Gamma degree sequence", "link": "http://arxiv.org/abs/2111.01301", "description": "For the differential privacy under the sub-Gamma noise, we derive the\nasymptotic properties of a class of network models with binary values with a\ngeneral link function. In this paper, we release the degree sequences of the\nbinary networks under a general noisy mechanism with the discrete Laplace\nmechanism as a special case. We establish the asymptotic result including both\nconsistency and asymptotically normality of the parameter estimator when the\nnumber of parameters goes to infinity in a class of network models. Simulations\nand a real data example are provided to illustrate asymptotic results."}, "http://arxiv.org/abs/2309.08982": {"title": "Least squares estimation in nonlinear cohort panels with learning from experience", "link": "http://arxiv.org/abs/2309.08982", "description": "We discuss techniques of estimation and inference for nonlinear cohort panels\nwith learning from experience, showing, inter alia, the consistency and\nasymptotic normality of the nonlinear least squares estimator employed in the\nseminal paper by Malmendier and Nagel (2016, QJE). Potential pitfalls for\nhypothesis testing are identified and solutions proposed. Monte Carlo\nsimulations verify the properties of the estimator and corresponding test\nstatistics in finite samples, while an application to a panel of survey\nexpectations demonstrates the usefulness of the theory developed."}, "http://arxiv.org/abs/2311.08218": {"title": "Estimating Conditional Value-at-Risk with Nonstationary Quantile Predictive Regression Models", "link": "http://arxiv.org/abs/2311.08218", "description": "This paper develops an asymptotic distribution theory for a two-stage\ninstrumentation estimation approach in quantile predictive regressions when\nboth generated covariates and persistent predictors are used. The generated\ncovariates are obtained from an auxiliary quantile regression model and our\nmain interest is the robust estimation and inference of the primary quantile\npredictive regression in which this generated covariate is added to the set of\nnonstationary regressors. We find that the proposed doubly IVX estimator is\nrobust to the abstract degree of persistence regardless of the presence of\ngenerated regressor obtained from the first stage procedure. The asymptotic\nproperties of the two-stage IVX estimator such as mixed Gaussianity are\nestablished while the asymptotic covariance matrix is adjusted to account for\nthe first-step estimation error."}, "http://arxiv.org/abs/2302.00469": {"title": "Regression adjustment in randomized controlled trials with many covariates", "link": "http://arxiv.org/abs/2302.00469", "description": "This paper is concerned with estimation and inference on average treatment\neffects in randomized controlled trials when researchers observe potentially\nmany covariates. By employing Neyman's (1923) finite population perspective, we\npropose a bias-corrected regression adjustment estimator using cross-fitting,\nand show that the proposed estimator has favorable properties over existing\nalternatives. For inference, we derive the first and second order terms in the\nstochastic component of the regression adjustment estimators, study higher\norder properties of the existing inference methods, and propose a\nbias-corrected version of the HC3 standard error. The proposed methods readily\nextend to stratified experiments with large strata. Simulation studies show our\ncross-fitted estimator, combined with the bias-corrected HC3, delivers precise\npoint estimates and robust size controls over a wide range of DGPs. To\nillustrate, the proposed methods are applied to real dataset on randomized\nexperiments of incentives and services for college achievement following\nAngrist, Lang, and Oreopoulos (2009)."}, "http://arxiv.org/abs/2311.08958": {"title": "Locally Asymptotically Minimax Statistical Treatment Rules Under Partial Identification", "link": "http://arxiv.org/abs/2311.08958", "description": "Policymakers often desire a statistical treatment rule (STR) that determines\na treatment assignment rule deployed in a future population from available\ndata. With the true knowledge of the data generating process, the average\ntreatment effect (ATE) is the key quantity characterizing the optimal treatment\nrule. Unfortunately, the ATE is often not point identified but partially\nidentified. Presuming the partial identification of the ATE, this study\nconducts a local asymptotic analysis and develops the locally asymptotically\nminimax (LAM) STR. The analysis does not assume the full differentiability but\nthe directional differentiability of the boundary functions of the\nidentification region of the ATE. Accordingly, the study shows that the LAM STR\ndiffers from the plug-in STR. A simulation study also demonstrates that the LAM\nSTR outperforms the plug-in STR."}, "http://arxiv.org/abs/2311.08963": {"title": "Incorporating Preferences Into Treatment Assignment Problems", "link": "http://arxiv.org/abs/2311.08963", "description": "This study investigates the problem of individualizing treatment allocations\nusing stated preferences for treatments. If individuals know in advance how the\nassignment will be individualized based on their stated preferences, they may\nstate false preferences. We derive an individualized treatment rule (ITR) that\nmaximizes welfare when individuals strategically state their preferences. We\nalso show that the optimal ITR is strategy-proof, that is, individuals do not\nhave a strong incentive to lie even if they know the optimal ITR a priori.\nConstructing the optimal ITR requires information on the distribution of true\npreferences and the average treatment effect conditioned on true preferences.\nIn practice, the information must be identified and estimated from the data. As\ntrue preferences are hidden information, the identification is not\nstraightforward. We discuss two experimental designs that allow the\nidentification: strictly strategy-proof randomized controlled trials and doubly\nrandomized preference trials. Under the presumption that data comes from one of\nthese experiments, we develop data-dependent procedures for determining ITR,\nthat is, statistical treatment rules (STRs). The maximum regret of the proposed\nSTRs converges to zero at a rate of the square root of the sample size. An\nempirical application demonstrates our proposed STRs."}, "http://arxiv.org/abs/2007.04267": {"title": "Difference-in-Differences Estimators of Intertemporal Treatment Effects", "link": "http://arxiv.org/abs/2007.04267", "description": "We study treatment-effect estimation using panel data. The treatment may be\nnon-binary, non-absorbing, and the outcome may be affected by treatment lags.\nWe make a parallel-trends assumption, and propose event-study estimators of the\neffect of being exposed to a weakly higher treatment dose for $\\ell$ periods.\nWe also propose normalized estimators, that estimate a weighted average of the\neffects of the current treatment and its lags. We also analyze commonly-used\ntwo-way-fixed-effects regressions. Unlike our estimators, they can be biased in\nthe presence of heterogeneous treatment effects. A local-projection version of\nthose regressions is biased even with homogeneous effects."}, "http://arxiv.org/abs/2007.10432": {"title": "Treatment Effects with Targeting Instruments", "link": "http://arxiv.org/abs/2007.10432", "description": "Multivalued treatments are commonplace in applications. We explore the use of\ndiscrete-valued instruments to control for selection bias in this setting. Our\ndiscussion revolves around the concept of targeting instruments: which\ninstruments target which treatments. It allows us to establish conditions under\nwhich counterfactual averages and treatment effects are point- or\npartially-identified for composite complier groups. We illustrate the\nusefulness of our framework by applying it to data from the Head Start Impact\nStudy. Under a plausible positive selection assumption, we derive informative\nbounds that suggest less beneficial effects of Head Start expansions than the\nparametric estimates of Kline and Walters (2016)."}, "http://arxiv.org/abs/2310.00786": {"title": "Semidiscrete optimal transport with unknown costs", "link": "http://arxiv.org/abs/2310.00786", "description": "Semidiscrete optimal transport is a challenging generalization of the\nclassical transportation problem in linear programming. The goal is to design a\njoint distribution for two random variables (one continuous, one discrete) with\nfixed marginals, in a way that minimizes expected cost. We formulate a novel\nvariant of this problem in which the cost functions are unknown, but can be\nlearned through noisy observations; however, only one function can be sampled\nat a time. We develop a semi-myopic algorithm that couples online learning with\nstochastic approximation, and prove that it achieves optimal convergence rates,\ndespite the non-smoothness of the stochastic gradient and the lack of strong\nconcavity in the objective function."}, "http://arxiv.org/abs/2311.09435": {"title": "Estimating Functionals of the Joint Distribution of Potential Outcomes with Optimal Transport", "link": "http://arxiv.org/abs/2311.09435", "description": "Many causal parameters depend on a moment of the joint distribution of\npotential outcomes. Such parameters are especially relevant in policy\nevaluation settings, where noncompliance is common and accommodated through the\nmodel of Imbens & Angrist (1994). This paper shows that the sharp identified\nset for these parameters is an interval with endpoints characterized by the\nvalue of optimal transport problems. Sample analogue estimators are proposed\nbased on the dual problem of optimal transport. These estimators are root-n\nconsistent and converge in distribution under mild assumptions. Inference\nprocedures based on the bootstrap are straightforward and computationally\nconvenient. The ideas and estimators are demonstrated in an application\nrevisiting the National Supported Work Demonstration job training program. I\nfind suggestive evidence that workers who would see below average earnings\nwithout treatment tend to see above average benefits from treatment."}, "http://arxiv.org/abs/2311.09972": {"title": "Inference in Auctions with Many Bidders Using Transaction Prices", "link": "http://arxiv.org/abs/2311.09972", "description": "This paper considers inference in first-price and second-price sealed-bid\nauctions with a large number of symmetric bidders having independent private\nvalues. Given the abundance of bidders in each auction, we propose an\nasymptotic framework in which the number of bidders diverges while the number\nof auctions remains fixed. This framework allows us to perform asymptotically\nexact inference on key model features using only transaction price data.\nSpecifically, we examine inference on the expected utility of the auction\nwinner, the expected revenue of the seller, and the tail properties of the\nvaluation distribution. Simulations confirm the accuracy of our inference\nmethods in finite samples. Finally, we also apply them to Hong Kong car license\nauction data."}, "http://arxiv.org/abs/2212.06080": {"title": "Logs with zeros? Some problems and solutions", "link": "http://arxiv.org/abs/2212.06080", "description": "When studying an outcome $Y$ that is weakly-positive but can equal zero (e.g.\nearnings), researchers frequently estimate an average treatment effect (ATE)\nfor a \"log-like\" transformation that behaves like $\\log(Y)$ for large $Y$ but\nis defined at zero (e.g. $\\log(1+Y)$, $\\mathrm{arcsinh}(Y)$). We argue that\nATEs for log-like transformations should not be interpreted as approximating\npercentage effects, since unlike a percentage, they depend on the units of the\noutcome. In fact, we show that if the treatment affects the extensive margin,\none can obtain a treatment effect of any magnitude simply by re-scaling the\nunits of $Y$ before taking the log-like transformation. This arbitrary\nunit-dependence arises because an individual-level percentage effect is not\nwell-defined for individuals whose outcome changes from zero to non-zero when\nreceiving treatment, and the units of the outcome implicitly determine how much\nweight the ATE for a log-like transformation places on the extensive margin. We\nfurther establish a trilemma: when the outcome can equal zero, there is no\ntreatment effect parameter that is an average of individual-level treatment\neffects, unit-invariant, and point-identified. We discuss several alternative\napproaches that may be sensible in settings with an intensive and extensive\nmargin, including (i) expressing the ATE in levels as a percentage (e.g. using\nPoisson regression), (ii) explicitly calibrating the value placed on the\nintensive and extensive margins, and (iii) estimating separate effects for the\ntwo margins (e.g. using Lee bounds). We illustrate these approaches in three\nempirical applications."}, "http://arxiv.org/abs/2308.05486": {"title": "The Distributional Impact of Money Growth and Inflation Disaggregates: A Quantile Sensitivity Analysis", "link": "http://arxiv.org/abs/2308.05486", "description": "We propose an alternative method to construct a quantile dependence system\nfor inflation and money growth. By considering all quantiles, we assess how\nperturbations in one variable's quantile lead to changes in the distribution of\nthe other variable. We demonstrate the construction of this relationship\nthrough a system of linear quantile regressions. The proposed framework is\nexploited to examine the distributional effects of money growth on the\ndistributions of inflation and its disaggregate measures in the United States\nand the Euro area. Our empirical analysis uncovers significant impacts of the\nupper quantile of the money growth distribution on the distribution of\ninflation and its disaggregate measures. Conversely, we find that the lower and\nmedian quantiles of the money growth distribution have a negligible influence\non the distribution of inflation and its disaggregate measures."}, "http://arxiv.org/abs/2311.10685": {"title": "High-Throughput Asset Pricing", "link": "http://arxiv.org/abs/2311.10685", "description": "We use empirical Bayes (EB) to mine for out-of-sample returns among 73,108\nlong-short strategies constructed from accounting ratios, past returns, and\nticker symbols. EB predicts returns are concentrated in accounting and past\nreturn strategies, small stocks, and pre-2004 samples. The cross-section of\nout-of-sample return lines up closely with EB predictions. Data-mined\nportfolios have mean returns comparable with published portfolios, but the\ndata-mined returns are arguably free of data mining bias. In contrast,\ncontrolling for multiple testing following Harvey, Liu, and Zhu (2016) misses\nthe vast majority of returns. This \"high-throughput asset pricing\" provides an\nevidence-based solution for data mining bias."}, "http://arxiv.org/abs/2209.11444": {"title": "Identification of the Marginal Treatment Effect with Multivalued Treatments", "link": "http://arxiv.org/abs/2209.11444", "description": "The multinomial choice model based on utility maximization has been widely\nused to select treatments. In this paper, we establish sufficient conditions\nfor the identification of the marginal treatment effects with multivalued\ntreatments. Our result reveals treatment effects conditioned on the willingness\nto participate in treatments against a specific treatment. Further, our results\ncan identify other parameters such as the marginal distribution of potential\noutcomes."}, "http://arxiv.org/abs/2211.04027": {"title": "Bootstraps for Dynamic Panel Threshold Models", "link": "http://arxiv.org/abs/2211.04027", "description": "This paper develops valid bootstrap inference methods for the dynamic panel\nthreshold regression. For the first-differenced generalized method of moments\n(GMM) estimation for the dynamic short panel, we show that the standard\nnonparametric bootstrap is inconsistent. The inconsistency is due to an\n$n^{1/4}$-consistent non-normal asymptotic distribution for the threshold\nestimate when the parameter resides within the continuity region of the\nparameter space, which stems from the rank deficiency of the approximate\nJacobian of the sample moment conditions on the continuity region. We propose a\ngrid bootstrap to construct confidence sets for the threshold, a residual\nbootstrap to construct confidence intervals for the coefficients, and a\nbootstrap for testing continuity. They are shown to be valid under uncertain\ncontinuity. A set of Monte Carlo experiments demonstrate that the proposed\nbootstraps perform well in the finite samples and improve upon the asymptotic\nnormal approximation even under a large jump at the threshold. An empirical\napplication to firms' investment model illustrates our methods."}, "http://arxiv.org/abs/2304.07480": {"title": "Gini-stable Lorenz curves and their relation to the generalised Pareto distribution", "link": "http://arxiv.org/abs/2304.07480", "description": "We introduce an iterative discrete information production process where we\ncan extend ordered normalised vectors by new elements based on a simple affine\ntransformation, while preserving the predefined level of inequality, G, as\nmeasured by the Gini index.\n\nThen, we derive the family of empirical Lorenz curves of the corresponding\nvectors and prove that it is stochastically ordered with respect to both the\nsample size and G which plays the role of the uncertainty parameter. We prove\nthat asymptotically, we obtain all, and only, Lorenz curves generated by a new,\nintuitive parametrisation of the finite-mean Pickands' Generalised Pareto\nDistribution (GPD) that unifies three other families, namely: the Pareto Type\nII, exponential, and scaled beta ones. The family is not only totally ordered\nwith respect to the parameter G, but also, thanks to our derivations, has a\nnice underlying interpretation. Our result may thus shed a new light on the\ngenesis of this family of distributions.\n\nOur model fits bibliometric, informetric, socioeconomic, and environmental\ndata reasonably well. It is quite user-friendly for it only depends on the\nsample size and its Gini index."}, "http://arxiv.org/abs/2311.11637": {"title": "Modeling economies of scope in joint production: Convex regression of input distance function", "link": "http://arxiv.org/abs/2311.11637", "description": "Modeling of joint production has proved a vexing problem. This paper develops\na radial convex nonparametric least squares (CNLS) approach to estimate the\ninput distance function with multiple outputs. We document the correct input\ndistance function transformation and prove that the necessary orthogonality\nconditions can be satisfied in radial CNLS. A Monte Carlo study is performed to\ncompare the finite sample performance of radial CNLS and other deterministic\nand stochastic frontier approaches in terms of the input distance function\nestimation. We apply our novel approach to the Finnish electricity distribution\nnetwork regulation and empirically confirm that the input isoquants become more\ncurved. In addition, we introduce the weight restriction to radial CNLS to\nmitigate the potential overfitting and increase the out-of-sample performance\nin energy regulation."}, "http://arxiv.org/abs/2311.11858": {"title": "Theory coherent shrinkage of Time-Varying Parameters in VARs", "link": "http://arxiv.org/abs/2311.11858", "description": "Time-Varying Parameters Vector Autoregressive (TVP-VAR) models are frequently\nused in economics to capture evolving relationships among the macroeconomic\nvariables. However, TVP-VARs have the tendency of overfitting the data,\nresulting in inaccurate forecasts and imprecise estimates of typical objects of\ninterests such as the impulse response functions. This paper introduces a\nTheory Coherent Time-Varying Parameters Vector Autoregressive Model\n(TC-TVP-VAR), which leverages on an arbitrary theoretical framework derived by\nan underlying economic theory to form a prior for the time varying parameters.\nThis \"theory coherent\" shrinkage prior significantly improves inference\nprecision and forecast accuracy over the standard TVP-VAR. Furthermore, the\nTC-TVP-VAR can be used to perform indirect posterior inference on the deep\nparameters of the underlying economic theory. The paper reveals that using the\nclassical 3-equation New Keynesian block to form a prior for the TVP- VAR\nsubstantially enhances forecast accuracy of output growth and of the inflation\nrate in a standard model of monetary policy. Additionally, the paper shows that\nthe TC-TVP-VAR can be used to address the inferential challenges during the\nZero Lower Bound period."}, "http://arxiv.org/abs/2007.07842": {"title": "Persistence in Financial Connectedness and Systemic Risk", "link": "http://arxiv.org/abs/2007.07842", "description": "This paper characterises dynamic linkages arising from shocks with\nheterogeneous degrees of persistence. Using frequency domain techniques, we\nintroduce measures that identify smoothly varying links of a transitory and\npersistent nature. Our approach allows us to test for statistical differences\nin such dynamic links. We document substantial differences in transitory and\npersistent linkages among US financial industry volatilities, argue that they\ntrack heterogeneously persistent sources of systemic risk, and thus may serve\nas a useful tool for market participants."}, "http://arxiv.org/abs/2205.11365": {"title": "Graph-Based Methods for Discrete Choice", "link": "http://arxiv.org/abs/2205.11365", "description": "Choices made by individuals have widespread impacts--for instance, people\nchoose between political candidates to vote for, between social media posts to\nshare, and between brands to purchase--moreover, data on these choices are\nincreasingly abundant. Discrete choice models are a key tool for learning\nindividual preferences from such data. Additionally, social factors like\nconformity and contagion influence individual choice. Traditional methods for\nincorporating these factors into choice models do not account for the entire\nsocial network and require hand-crafted features. To overcome these\nlimitations, we use graph learning to study choice in networked contexts. We\nidentify three ways in which graph learning techniques can be used for discrete\nchoice: learning chooser representations, regularizing choice model parameters,\nand directly constructing predictions from a network. We design methods in each\ncategory and test them on real-world choice datasets, including county-level\n2016 US election results and Android app installation and usage data. We show\nthat incorporating social network structure can improve the predictions of the\nstandard econometric choice model, the multinomial logit. We provide evidence\nthat app installations are influenced by social context, but we find no such\neffect on app usage among the same participants, which instead is habit-driven.\nIn the election data, we highlight the additional insights a discrete choice\nframework provides over classification or regression, the typical approaches.\nOn synthetic data, we demonstrate the sample complexity benefit of using social\ninformation in choice models."}, "http://arxiv.org/abs/2306.09287": {"title": "Modelling and Forecasting Macroeconomic Risk with Time Varying Skewness Stochastic Volatility Models", "link": "http://arxiv.org/abs/2306.09287", "description": "Monitoring downside risk and upside risk to the key macroeconomic indicators\nis critical for effective policymaking aimed at maintaining economic stability.\nIn this paper I propose a parametric framework for modelling and forecasting\nmacroeconomic risk based on stochastic volatility models with Skew-Normal and\nSkew-t shocks featuring time varying skewness. Exploiting a mixture stochastic\nrepresentation of the Skew-Normal and Skew-t random variables, in the paper I\ndevelop efficient posterior simulation samplers for Bayesian estimation of both\nunivariate and VAR models of this type. In an application, I use the models to\npredict downside risk to GDP growth in the US and I show that these models\nrepresent a competitive alternative to semi-parametric approaches such as\nquantile regression. Finally, estimating a medium scale VAR on US data I show\nthat time varying skewness is a relevant feature of macroeconomic and financial\nshocks."}, "http://arxiv.org/abs/2309.07476": {"title": "Causal inference in network experiments: regression-based analysis and design-based properties", "link": "http://arxiv.org/abs/2309.07476", "description": "Investigating interference or spillover effects among units is a central task\nin many social science problems. Network experiments are powerful tools for\nthis task, which avoids endogeneity by randomly assigning treatments to units\nover networks. However, it is non-trivial to analyze network experiments\nproperly without imposing strong modeling assumptions. Previously, many\nresearchers have proposed sophisticated point estimators and standard errors\nfor causal effects under network experiments. We further show that\nregression-based point estimators and standard errors can have strong\ntheoretical guarantees if the regression functions and robust standard errors\nare carefully specified to accommodate the interference patterns under network\nexperiments. We first recall a well-known result that the Hajek estimator is\nnumerically identical to the coefficient from the weighted-least-squares fit\nbased on the inverse probability of the exposure mapping. Moreover, we\ndemonstrate that the regression-based approach offers three notable advantages:\nits ease of implementation, the ability to derive standard errors through the\nsame weighted-least-squares fit, and the capacity to integrate covariates into\nthe analysis, thereby enhancing estimation efficiency. Furthermore, we analyze\nthe asymptotic bias of the regression-based network-robust standard errors.\nRecognizing that the covariance estimator can be anti-conservative, we propose\nan adjusted covariance estimator to improve the empirical coverage rates.\nAlthough we focus on regression-based point estimators and standard errors, our\ntheory holds under the design-based framework, which assumes that the\nrandomness comes solely from the design of network experiments and allows for\narbitrary misspecification of the regression models."}, "http://arxiv.org/abs/2311.12267": {"title": "Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity", "link": "http://arxiv.org/abs/2311.12267", "description": "This paper studies causal representation learning, the task of recovering\nhigh-level latent variables and their causal relationships from low-level data\nthat we observe, assuming access to observations generated from multiple\nenvironments. While existing works are able to prove full identifiability of\nthe underlying data generating process, they typically assume access to\nsingle-node, hard interventions which is rather unrealistic in practice. The\nmain contribution of this paper is characterize a notion of identifiability\nwhich is provably the best one can achieve when hard interventions are not\navailable. First, for linear causal models, we provide identifiability\nguarantee for data observed from general environments without assuming any\nsimilarities between them. While the causal graph is shown to be fully\nrecovered, the latent variables are only identified up to an effect-domination\nambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to\nrecover the ground-truth model up to EDA, and we demonstrate its effectiveness\nvia numerical experiments. Moving on to general non-parametric causal models,\nwe prove the same idenfifiability guarantee assuming access to groups of soft\ninterventions. Finally, we provide counterparts of our identifiability results,\nindicating that EDA is basically inevitable in our setting."}, "http://arxiv.org/abs/2311.12671": {"title": "Predictive Density Combination Using a Tree-Based Synthesis Function", "link": "http://arxiv.org/abs/2311.12671", "description": "Bayesian predictive synthesis (BPS) provides a method for combining multiple\npredictive distributions based on agent/expert opinion analysis theory and\nencompasses a range of existing density forecast pooling methods. The key\ningredient in BPS is a ``synthesis'' function. This is typically specified\nparametrically as a dynamic linear regression. In this paper, we develop a\nnonparametric treatment of the synthesis function using regression trees. We\nshow the advantages of our tree-based approach in two macroeconomic forecasting\napplications. The first uses density forecasts for GDP growth from the euro\narea's Survey of Professional Forecasters. The second combines density\nforecasts of US inflation produced by many regression models involving\ndifferent predictors. Both applications demonstrate the benefits -- in terms of\nimproved forecast accuracy and interpretability -- of modeling the synthesis\nfunction nonparametrically."}, "http://arxiv.org/abs/1810.00283": {"title": "Proxy Controls and Panel Data", "link": "http://arxiv.org/abs/1810.00283", "description": "We provide new results for nonparametric identification, estimation, and\ninference of causal effects using `proxy controls': observables that are noisy\nbut informative proxies for unobserved confounding factors. Our analysis\napplies to cross-sectional settings but is particularly well-suited to panel\nmodels. Our identification results motivate a simple and `well-posed'\nnonparametric estimator. We derive convergence rates for the estimator and\nconstruct uniform confidence bands with asymptotically correct size. In panel\nsettings, our methods provide a novel approach to the difficult problem of\nidentification with non-separable, general heterogeneity and fixed $T$. In\npanels, observations from different periods serve as proxies for unobserved\nheterogeneity and our key identifying assumptions follow from restrictions on\nthe serial dependence structure. We apply our methods to two empirical\nsettings. We estimate consumer demand counterfactuals using panel data and we\nestimate causal effects of grade retention on cognitive performance."}, "http://arxiv.org/abs/2102.08809": {"title": "Testing for Nonlinear Cointegration under Heteroskedasticity", "link": "http://arxiv.org/abs/2102.08809", "description": "This article discusses tests for nonlinear cointegration in the presence of\nvariance breaks. We build on cointegration test approaches under\nheteroskedasticity (Cavaliere and Taylor, 2006, Journal of Time Series\nAnalysis) and for nonlinearity (Choi and Saikkonen, 2010, Econometric Theory)\nto propose a bootstrap test and prove its consistency. A Monte Carlo study\nshows the approach to have good finite sample properties. We provide an\nempirical application to the environmental Kuznets curves (EKC), finding that\nthe cointegration test provides little evidence for the EKC hypothesis.\nAdditionally, we examine the nonlinear relation between the US money and the\ninterest rate, finding that our test does not reject the null of a smooth\ntransition cointegrating relation."}, "http://arxiv.org/abs/2311.12878": {"title": "Adaptive Bayesian Learning with Action and State-Dependent Signal Variance", "link": "http://arxiv.org/abs/2311.12878", "description": "This manuscript presents an advanced framework for Bayesian learning by\nincorporating action and state-dependent signal variances into decision-making\nmodels. This framework is pivotal in understanding complex data-feedback loops\nand decision-making processes in various economic systems. Through a series of\nexamples, we demonstrate the versatility of this approach in different\ncontexts, ranging from simple Bayesian updating in stable environments to\ncomplex models involving social learning and state-dependent uncertainties. The\npaper uniquely contributes to the understanding of the nuanced interplay\nbetween data, actions, outcomes, and the inherent uncertainty in economic\nmodels."}, "http://arxiv.org/abs/2311.13327": {"title": "Regressions under Adverse Conditions", "link": "http://arxiv.org/abs/2311.13327", "description": "We introduce a new regression method that relates the mean of an outcome\nvariable to covariates, given the \"adverse condition\" that a distress variable\nfalls in its tail. This allows to tailor classical mean regressions to adverse\neconomic scenarios, which receive increasing interest in managing macroeconomic\nand financial risks, among many others. In the terminology of the systemic risk\nliterature, our method can be interpreted as a regression for the Marginal\nExpected Shortfall. We propose a two-step procedure to estimate the new models,\nshow consistency and asymptotic normality of the estimator, and propose\nfeasible inference under weak conditions allowing for cross-sectional and time\nseries applications. The accuracy of the asymptotic approximations of the\ntwo-step estimator is verified in simulations. Two empirical applications show\nthat our regressions under adverse conditions are valuable in such diverse\nfields as the study of the relation between systemic risk and asset price\nbubbles, and dissecting macroeconomic growth vulnerabilities into individual\ncomponents."}, "http://arxiv.org/abs/2311.13575": {"title": "Large-Sample Properties of the Synthetic Control Method under Selection on Unobservables", "link": "http://arxiv.org/abs/2311.13575", "description": "We analyze the properties of the synthetic control (SC) method in settings\nwith a large number of units. We assume that the selection into treatment is\nbased on unobserved permanent heterogeneity and pretreatment information, thus\nallowing for both strictly and sequentially exogenous assignment processes.\nExploiting duality, we interpret the solution of the SC optimization problem as\nan estimator for the underlying treatment probabilities. We use this to derive\nthe asymptotic representation for the SC method and characterize sufficient\nconditions for its asymptotic normality. We show that the critical property\nthat determines the behavior of the SC method is the ability of input features\nto approximate the unobserved heterogeneity. Our results imply that the SC\nmethod delivers asymptotically normal estimators for a large class of linear\npanel data models as long as the number of pretreatment periods is large,\nmaking it a natural alternative to conventional methods built on the\nDifference-in-Differences."}, "http://arxiv.org/abs/2108.00723": {"title": "Partial Identification and Inference for Conditional Distributions of Treatment Effects", "link": "http://arxiv.org/abs/2108.00723", "description": "This paper considers identification and inference for the distribution of\ntreatment effects conditional on observable covariates. Since the conditional\ndistribution of treatment effects is not point identified without strong\nassumptions, we obtain bounds on the conditional distribution of treatment\neffects by using the Makarov bounds. We also consider the case where the\ntreatment is endogenous and propose two stochastic dominance assumptions to\ntighten the bounds. We develop a nonparametric framework to estimate the bounds\nand establish the asymptotic theory that is uniformly valid over the support of\ntreatment effects. An empirical example illustrates the usefulness of the\nmethods."}, "http://arxiv.org/abs/2311.13969": {"title": "Was Javert right to be suspicious? Unpacking treatment effect heterogeneity of alternative sentences on time-to-recidivism in Brazil", "link": "http://arxiv.org/abs/2311.13969", "description": "This paper presents new econometric tools to unpack the treatment effect\nheterogeneity of punishing misdemeanor offenses on time-to-recidivism. We show\nhow one can identify, estimate, and make inferences on the distributional,\nquantile, and average marginal treatment effects in setups where the treatment\nselection is endogenous and the outcome of interest, usually a duration\nvariable, is potentially right-censored. We explore our proposed econometric\nmethodology to evaluate the effect of fines and community service sentences as\na form of punishment on time-to-recidivism in the State of S\\~ao Paulo, Brazil,\nbetween 2010 and 2019, leveraging the as-if random assignment of judges to\ncases. Our results highlight substantial treatment effect heterogeneity that\nother tools are not meant to capture. For instance, we find that people whom\nmost judges would punish take longer to recidivate as a consequence of the\npunishment, while people who would be punished only by strict judges recidivate\nat an earlier date than if they were not punished. This result suggests that\ndesigning sentencing guidelines that encourage strict judges to become more\nlenient could reduce recidivism."}, "http://arxiv.org/abs/2311.14032": {"title": "Counterfactual Sensitivity in Equilibrium Models", "link": "http://arxiv.org/abs/2311.14032", "description": "Counterfactuals in equilibrium models are functions of the current state of\nthe world, the exogenous change variables and the model parameters. Current\npractice treats the current state of the world, the observed data, as perfectly\nmeasured, but there is good reason to believe that they are measured with\nerror. The main aim of this paper is to provide tools for quantifying\nuncertainty about counterfactuals, when the current state of the world is\nmeasured with error. I propose two methods, a Bayesian approach and an\nadversarial approach. Both methods are practical and theoretically justified. I\napply the two methods to the application in Adao et al. (2017) and find\nnon-trivial uncertainty about counterfactuals."}, "http://arxiv.org/abs/2311.14204": {"title": "Reproducible Aggregation of Sample-Split Statistics", "link": "http://arxiv.org/abs/2311.14204", "description": "Statistical inference is often simplified by sample-splitting. This\nsimplification comes at the cost of the introduction of randomness that is not\nnative to the data. We propose a simple procedure for sequentially aggregating\nstatistics constructed with multiple splits of the same sample. The user\nspecifies a bound and a nominal error rate. If the procedure is implemented\ntwice on the same data, the nominal error rate approximates the chance that the\nresults differ by more than the bound. We provide a non-asymptotic analysis of\nthe accuracy of the nominal error rate and illustrate the application of the\nprocedure to several widely applied statistical methods."}, "http://arxiv.org/abs/2205.03288": {"title": "Leverage, Influence, and the Jackknife in Clustered Regression Models: Reliable Inference Using summclust", "link": "http://arxiv.org/abs/2205.03288", "description": "We introduce a new Stata package called summclust that summarizes the cluster\nstructure of the dataset for linear regression models with clustered\ndisturbances. The key unit of observation for such a model is the cluster. We\ntherefore propose cluster-level measures of leverage, partial leverage, and\ninfluence and show how to compute them quickly in most cases. The measures of\nleverage and partial leverage can be used as diagnostic tools to identify\ndatasets and regression designs in which cluster-robust inference is likely to\nbe challenging. The measures of influence can provide valuable information\nabout how the results depend on the data in the various clusters. We also show\nhow to calculate two jackknife variance matrix estimators efficiently as a\nbyproduct of our other computations. These estimators, which are already\navailable in Stata, are generally more conservative than conventional variance\nmatrix estimators. The summclust package computes all the quantities that we\ndiscuss."}, "http://arxiv.org/abs/2205.10310": {"title": "Treatment Effects in Bunching Designs: The Impact of Mandatory Overtime Pay on Hours", "link": "http://arxiv.org/abs/2205.10310", "description": "The 1938 Fair Labor Standards Act mandates overtime premium pay for most U.S.\nworkers, but it has proven difficult to assess the policy's impact on the labor\nmarket because the rule applies nationally and has varied little over time. I\nuse the extent to which firms bunch workers at the overtime threshold of 40\nhours in a week to estimate the rule's effect on hours, drawing on data from\nindividual workers' weekly paychecks. To do so I generalize a popular\nidentification strategy that exploits bunching at kink points in a\ndecision-maker's choice set. Making only nonparametric assumptions about\npreferences and heterogeneity, I show that the average causal response among\nbunchers to the policy switch at the kink is partially identified. The bounds\nindicate a relatively small elasticity of demand for weekly hours, suggesting\nthat the overtime mandate has a discernible but limited impact on hours and\nemployment."}, "http://arxiv.org/abs/2305.19484": {"title": "A Simple Method for Predicting Covariance Matrices of Financial Returns", "link": "http://arxiv.org/abs/2305.19484", "description": "We consider the well-studied problem of predicting the time-varying\ncovariance matrix of a vector of financial returns. Popular methods range from\nsimple predictors like rolling window or exponentially weighted moving average\n(EWMA) to more sophisticated predictors such as generalized autoregressive\nconditional heteroscedastic (GARCH) type methods. Building on a specific\ncovariance estimator suggested by Engle in 2002, we propose a relatively simple\nextension that requires little or no tuning or fitting, is interpretable, and\nproduces results at least as good as MGARCH, a popular extension of GARCH that\nhandles multiple assets. To evaluate predictors we introduce a novel approach,\nevaluating the regret of the log-likelihood over a time period such as a\nquarter. This metric allows us to see not only how well a covariance predictor\ndoes over all, but also how quickly it reacts to changes in market conditions.\nOur simple predictor outperforms MGARCH in terms of regret. We also test\ncovariance predictors on downstream applications such as portfolio optimization\nmethods that depend on the covariance matrix. For these applications our simple\ncovariance predictor and MGARCH perform similarly."}, "http://arxiv.org/abs/2311.14698": {"title": "Business Policy Experiments using Fractional Factorial Designs: Consumer Retention on DoorDash", "link": "http://arxiv.org/abs/2311.14698", "description": "This paper investigates an approach to both speed up business decision-making\nand lower the cost of learning through experimentation by factorizing business\npolicies and employing fractional factorial experimental designs for their\nevaluation. We illustrate how this method integrates with advances in the\nestimation of heterogeneous treatment effects, elaborating on its advantages\nand foundational assumptions. We empirically demonstrate the implementation and\nbenefits of our approach and assess its validity in evaluating consumer\npromotion policies at DoorDash, which is one of the largest delivery platforms\nin the US. Our approach discovers a policy with 5% incremental profit at 67%\nlower implementation cost."}, "http://arxiv.org/abs/2311.14813": {"title": "A Review of Cross-Sectional Matrix Exponential Spatial Models", "link": "http://arxiv.org/abs/2311.14813", "description": "The matrix exponential spatial models exhibit similarities to the\nconventional spatial autoregressive model in spatial econometrics but offer\nanalytical, computational, and interpretive advantages. This paper provides a\ncomprehensive review of the literature on the estimation, inference, and model\nselection approaches for the cross-sectional matrix exponential spatial models.\nWe discuss summary measures for the marginal effects of regressors and detail\nthe matrix-vector product method for efficient estimation. Our aim is not only\nto summarize the main findings from the spatial econometric literature but also\nto make them more accessible to applied researchers. Additionally, we\ncontribute to the literature by introducing some new results. We propose an\nM-estimation approach for models with heteroskedastic error terms and\ndemonstrate that the resulting M-estimator is consistent and has an asymptotic\nnormal distribution. We also consider some new results for model selection\nexercises. In a Monte Carlo study, we examine the finite sample properties of\nvarious estimators from the literature alongside the M-estimator."}, "http://arxiv.org/abs/2311.14892": {"title": "An Identification and Dimensionality Robust Test for Instrumental Variables Models", "link": "http://arxiv.org/abs/2311.14892", "description": "I propose a new identification-robust test for the structural parameter in a\nheteroskedastic linear instrumental variables model. The proposed test\nstatistic is similar in spirit to a jackknife version of the K-statistic and\nthe resulting test has exact asymptotic size so long as an auxiliary parameter\ncan be consistently estimated. This is possible under approximate sparsity even\nwhen the number of instruments is much larger than the sample size. As the\nnumber of instruments is allowed, but not required, to be large, the limiting\nbehavior of the test statistic is difficult to examine via existing central\nlimit theorems. Instead, I derive the asymptotic chi-squared distribution of\nthe test statistic using a direct Gaussian approximation technique. To improve\npower against certain alternatives, I propose a simple combination with the\nsup-score statistic of Belloni et al. (2012) based on a thresholding rule. I\ndemonstrate favorable size control and power properties in a simulation study\nand apply the new methods to revisit the effect of social spillovers in movie\nconsumption."}, "http://arxiv.org/abs/2311.15458": {"title": "Causal Models for Longitudinal and Panel Data: A Survey", "link": "http://arxiv.org/abs/2311.15458", "description": "This survey discusses the recent causal panel data literature. This recent\nliterature has focused on credibly estimating causal effects of binary\ninterventions in settings with longitudinal data, with an emphasis on practical\nadvice for empirical researchers. It pays particular attention to heterogeneity\nin the causal effects, often in situations where few units are treated. The\nliterature has extended earlier work on difference-in-differences or\ntwo-way-fixed-effect estimators and more generally incorporated factor models\nor interactive fixed effects. It has also developed novel methods using\nsynthetic control approaches."}, "http://arxiv.org/abs/2311.15829": {"title": "(Frisch-Waugh-Lovell)': On the Estimation of Regression Models by Row", "link": "http://arxiv.org/abs/2311.15829", "description": "We demonstrate that regression models can be estimated by working\nindependently in a row-wise fashion. We document a simple procedure which\nallows for a wide class of econometric estimators to be implemented\ncumulatively, where, in the limit, estimators can be produced without ever\nstoring more than a single line of data in a computer's memory. This result is\nuseful in understanding the mechanics of many common regression models. These\nprocedures can be used to speed up the computation of estimates computed via\nOLS, IV, Ridge regression, LASSO, Elastic Net, and Non-linear models including\nprobit and logit, with all common modes of inference. This has implications for\nestimation and inference with `big data', where memory constraints may imply\nthat working with all data at once is particularly costly. We additionally show\nthat even with moderately sized datasets, this method can reduce computation\ntime compared with traditional estimation routines."}, "http://arxiv.org/abs/2311.15871": {"title": "On Quantile Treatment Effects, Rank Similarity, and Variation of Instrumental Variables", "link": "http://arxiv.org/abs/2311.15871", "description": "This paper investigates how certain relationship between observed and\ncounterfactual distributions serves as an identifying condition for treatment\neffects when the treatment is endogenous, and shows that this condition holds\nin a range of nonparametric models for treatment effects. To this end, we first\nprovide a novel characterization of the prevalent assumption restricting\ntreatment heterogeneity in the literature, namely rank similarity. Our\ncharacterization demonstrates the stringency of this assumption and allows us\nto relax it in an economically meaningful way, resulting in our identifying\ncondition. It also justifies the quest of richer exogenous variations in the\ndata (e.g., multi-valued or multiple instrumental variables) in exchange for\nweaker identifying conditions. The primary goal of this investigation is to\nprovide empirical researchers with tools that are robust and easy to implement\nbut still yield tight policy evaluations."}, "http://arxiv.org/abs/2311.15878": {"title": "Individualized Treatment Allocations with Distributional Welfare", "link": "http://arxiv.org/abs/2311.15878", "description": "In this paper, we explore optimal treatment allocation policies that target\ndistributional welfare. Most literature on treatment choice has considered\nutilitarian welfare based on the conditional average treatment effect (ATE).\nWhile average welfare is intuitive, it may yield undesirable allocations\nespecially when individuals are heterogeneous (e.g., with outliers) - the very\nreason individualized treatments were introduced in the first place. This\nobservation motivates us to propose an optimal policy that allocates the\ntreatment based on the conditional \\emph{quantile of individual treatment\neffects} (QoTE). Depending on the choice of the quantile probability, this\ncriterion can accommodate a policymaker who is either prudent or negligent. The\nchallenge of identifying the QoTE lies in its requirement for knowledge of the\njoint distribution of the counterfactual outcomes, which is generally hard to\nrecover even with experimental data. Therefore, we introduce minimax optimal\npolicies that are robust to model uncertainty. We then propose a range of\nidentifying assumptions under which we can point or partially identify the\nQoTE. We establish the asymptotic bound on the regret of implementing the\nproposed policies. We consider both stochastic and deterministic rules. In\nsimulations and two empirical applications, we compare optimal decisions based\non the QoTE with decisions based on other criteria."}, "http://arxiv.org/abs/2311.15932": {"title": "Valid Wald Inference with Many Weak Instruments", "link": "http://arxiv.org/abs/2311.15932", "description": "This paper proposes three novel test procedures that yield valid inference in\nan environment with many weak instrumental variables (MWIV). It is observed\nthat the t statistic of the jackknife instrumental variable estimator (JIVE)\nhas an asymptotic distribution that is identical to the two-stage-least squares\n(TSLS) t statistic in the just-identified environment. Consequently, test\nprocedures that were valid for TSLS t are also valid for the JIVE t. Two such\nprocedures, i.e., VtF and conditional Wald, are adapted directly. By exploiting\na feature of MWIV environments, a third, more powerful, one-sided VtF-based\ntest procedure can be obtained."}, "http://arxiv.org/abs/2311.15952": {"title": "Robust Conditional Wald Inference for Over-Identified IV", "link": "http://arxiv.org/abs/2311.15952", "description": "For the over-identified linear instrumental variables model, researchers\ncommonly report the 2SLS estimate along with the robust standard error and seek\nto conduct inference with these quantities. If errors are homoskedastic, one\ncan control the degree of inferential distortion using the first-stage F\ncritical values from Stock and Yogo (2005), or use the robust-to-weak\ninstruments Conditional Wald critical values of Moreira (2003). If errors are\nnon-homoskedastic, these methods do not apply. We derive the generalization of\nConditional Wald critical values that is robust to non-homoskedastic errors\n(e.g., heteroskedasticity or clustered variance structures), which can also be\napplied to nonlinear weakly-identified models (e.g. weakly-identified GMM)."}, "http://arxiv.org/abs/1801.00332": {"title": "Confidence set for group membership", "link": "http://arxiv.org/abs/1801.00332", "description": "Our confidence set quantifies the statistical uncertainty from data-driven\ngroup assignments in grouped panel models. It covers the true group memberships\njointly for all units with pre-specified probability and is constructed by\ninverting many simultaneous unit-specific one-sided tests for group membership.\nWe justify our approach under $N, T \\to \\infty$ asymptotics using tools from\nhigh-dimensional statistics, some of which we extend in this paper. We provide\nMonte Carlo evidence that the confidence set has adequate coverage in finite\nsamples.An empirical application illustrates the use of our confidence set."}, "http://arxiv.org/abs/2004.05027": {"title": "Direct and spillover effects of a new tramway line on the commercial vitality of peripheral streets", "link": "http://arxiv.org/abs/2004.05027", "description": "In cities, the creation of public transport infrastructure such as light\nrails can cause changes on a very detailed spatial scale, with different\nstories unfolding next to each other within a same urban neighborhood. We study\nthe direct effect of a light rail line built in Florence (Italy) on the retail\ndensity of the street where it was built and and its spillover effect on other\nstreets in the treated street's neighborhood. To this aim, we investigate the\nuse of the Synthetic Control Group (SCG) methods in panel comparative case\nstudies where interference between the treated and the untreated units is\nplausible, an issue still little researched in the SCG methodological\nliterature. We frame our discussion in the potential outcomes approach. Under a\npartial interference assumption, we formally define relevant direct and\nspillover causal effects. We also consider the ``unrealized'' spillover effect\non the treated street in the hypothetical scenario that another street in the\ntreated unit's neighborhood had been assigned to the intervention."}, "http://arxiv.org/abs/2004.09458": {"title": "Noise-Induced Randomization in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2004.09458", "description": "Regression discontinuity designs assess causal effects in settings where\ntreatment is determined by whether an observed running variable crosses a\npre-specified threshold. Here we propose a new approach to identification,\nestimation, and inference in regression discontinuity designs that uses\nknowledge about exogenous noise (e.g., measurement error) in the running\nvariable. In our strategy, we weight treated and control units to balance a\nlatent variable of which the running variable is a noisy measure. Our approach\nis explicitly randomization-based and complements standard formal analyses that\nappeal to continuity arguments while ignoring the stochastic nature of the\nassignment mechanism."}, "http://arxiv.org/abs/2111.12258": {"title": "On Recoding Ordered Treatments as Binary Indicators", "link": "http://arxiv.org/abs/2111.12258", "description": "Researchers using instrumental variables to investigate ordered treatments\noften recode treatment into an indicator for any exposure. We investigate this\nestimand under the assumption that the instruments shift compliers from no\ntreatment to some but not from some treatment to more. We show that when there\nare extensive margin compliers only (EMCO) this estimand captures a weighted\naverage of treatment effects that can be partially unbundled into each complier\ngroup's potential outcome means. We also establish an equivalence between EMCO\nand a two-factor selection model and apply our results to study treatment\nheterogeneity in the Oregon Health Insurance Experiment."}, "http://arxiv.org/abs/2311.16260": {"title": "Using Multiple Outcomes to Improve the Synthetic Control Method", "link": "http://arxiv.org/abs/2311.16260", "description": "When there are multiple outcome series of interest, Synthetic Control\nanalyses typically proceed by estimating separate weights for each outcome. In\nthis paper, we instead propose estimating a common set of weights across\noutcomes, by balancing either a vector of all outcomes or an index or average\nof them. Under a low-rank factor model, we show that these approaches lead to\nlower bias bounds than separate weights, and that averaging leads to further\ngains when the number of outcomes grows. We illustrate this via simulation and\nin a re-analysis of the impact of the Flint water crisis on educational\noutcomes."}, "http://arxiv.org/abs/2311.16333": {"title": "From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks", "link": "http://arxiv.org/abs/2311.16333", "description": "We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density\nforecasting through a novel neural network architecture with dedicated mean and\nvariance hemispheres. Our architecture features several key ingredients making\nMLE work in this context. First, the hemispheres share a common core at the\nentrance of the network which accommodates for various forms of time variation\nin the error variance. Second, we introduce a volatility emphasis constraint\nthat breaks mean/variance indeterminacy in this class of overparametrized\nnonlinear models. Third, we conduct a blocked out-of-bag reality check to curb\noverfitting in both conditional moments. Fourth, the algorithm utilizes\nstandard deep learning software and thus handles large data sets - both\ncomputationally and statistically. Ergo, our Hemisphere Neural Network (HNN)\nprovides proactive volatility forecasts based on leading indicators when it\ncan, and reactive volatility based on the magnitude of previous prediction\nerrors when it must. We evaluate point and density forecasts with an extensive\nout-of-sample experiment and benchmark against a suite of models ranging from\nclassics to more modern machine learning-based offerings. In all cases, HNN\nfares well by consistently providing accurate mean/variance forecasts for all\ntargets and horizons. Studying the resulting volatility paths reveals its\nversatility, while probabilistic forecasting evaluation metrics showcase its\nenviable reliability. Finally, we also demonstrate how this machinery can be\nmerged with other structured deep learning models by revisiting Goulet Coulombe\n(2022)'s Neural Phillips Curve."}, "http://arxiv.org/abs/2311.16440": {"title": "Inference for Low-rank Models without Estimating the Rank", "link": "http://arxiv.org/abs/2311.16440", "description": "This paper studies the inference about linear functionals of high-dimensional\nlow-rank matrices. While most existing inference methods would require\nconsistent estimation of the true rank, our procedure is robust to rank\nmisspecification, making it a promising approach in applications where rank\nestimation can be unreliable. We estimate the low-rank spaces using\npre-specified weighting matrices, known as diversified projections. A novel\nstatistical insight is that, unlike the usual statistical wisdom that\noverfitting mainly introduces additional variances, the over-estimated low-rank\nspace also gives rise to a non-negligible bias due to an implicit ridge-type\nregularization. We develop a new inference procedure and show that the central\nlimit theorem holds as long as the pre-specified rank is no smaller than the\ntrue rank. Empirically, we apply our method to the U.S. federal grants\nallocation data and test the existence of pork-barrel politics."}, "http://arxiv.org/abs/2311.16486": {"title": "On the adaptation of causal forests to manifold data", "link": "http://arxiv.org/abs/2311.16486", "description": "Researchers often hold the belief that random forests are \"the cure to the\nworld's ills\" (Bickel, 2010). But how exactly do they achieve this? Focused on\nthe recently introduced causal forests (Athey and Imbens, 2016; Wager and\nAthey, 2018), this manuscript aims to contribute to an ongoing research trend\ntowards answering this question, proving that causal forests can adapt to the\nunknown covariate manifold structure. In particular, our analysis shows that a\ncausal forest estimator can achieve the optimal rate of convergence for\nestimating the conditional average treatment effect, with the covariate\ndimension automatically replaced by the manifold dimension. These findings\nalign with analogous observations in the realm of deep learning and resonate\nwith the insights presented in Peter Bickel's 2004 Rietz lecture."}, "http://arxiv.org/abs/2311.17021": {"title": "Optimal Categorical Instrumental Variables", "link": "http://arxiv.org/abs/2311.17021", "description": "This paper discusses estimation with a categorical instrumental variable in\nsettings with potentially few observations per category. The proposed\ncategorical instrumental variable estimator (CIV) leverages a regularization\nassumption that implies existence of a latent categorical variable with fixed\nfinite support achieving the same first stage fit as the observed instrument.\nIn asymptotic regimes that allow the number of observations per category to\ngrow at arbitrary small polynomial rate with the sample size, I show that when\nthe cardinality of the support of the optimal instrument is known, CIV is\nroot-n asymptotically normal, achieves the same asymptotic variance as the\noracle IV estimator that presumes knowledge of the optimal instrument, and is\nsemiparametrically efficient under homoskedasticity. Under-specifying the\nnumber of support points reduces efficiency but maintains asymptotic normality."}, "http://arxiv.org/abs/1912.10488": {"title": "Efficient and Convergent Sequential Pseudo-Likelihood Estimation of Dynamic Discrete Games", "link": "http://arxiv.org/abs/1912.10488", "description": "We propose a new sequential Efficient Pseudo-Likelihood (k-EPL) estimator for\ndynamic discrete choice games of incomplete information. k-EPL considers the\njoint behavior of multiple players simultaneously, as opposed to individual\nresponses to other agents' equilibrium play. This, in addition to reframing the\nproblem from conditional choice probability (CCP) space to value function\nspace, yields a computationally tractable, stable, and efficient estimator. We\nshow that each iteration in the k-EPL sequence is consistent and asymptotically\nefficient, so the first-order asymptotic properties do not vary across\niterations. Furthermore, we show the sequence achieves higher-order equivalence\nto the finite-sample maximum likelihood estimator with iteration and that the\nsequence of estimators converges almost surely to the maximum likelihood\nestimator at a nearly-superlinear rate when the data are generated by any\nregular Markov perfect equilibrium, including equilibria that lead to\ninconsistency of other sequential estimators. When utility is linear in\nparameters, k-EPL iterations are computationally simple, only requiring that\nthe researcher solve linear systems of equations to generate pseudo-regressors\nwhich are used in a static logit/probit regression. Monte Carlo simulations\ndemonstrate the theoretical results and show k-EPL's good performance in finite\nsamples in both small- and large-scale games, even when the game admits\nspurious equilibria in addition to one that generated the data. We apply the\nestimator to study the role of competition in the U.S. wholesale club industry."}, "http://arxiv.org/abs/2202.07150": {"title": "Asymptotics of Cointegration Tests for High-Dimensional VAR($k$)", "link": "http://arxiv.org/abs/2202.07150", "description": "The paper studies nonstationary high-dimensional vector autoregressions of\norder $k$, VAR($k$). Additional deterministic terms such as trend or\nseasonality are allowed. The number of time periods, $T$, and the number of\ncoordinates, $N$, are assumed to be large and of the same order. Under this\nregime the first-order asymptotics of the Johansen likelihood ratio (LR),\nPillai-Bartlett, and Hotelling-Lawley tests for cointegration are derived: the\ntest statistics converge to nonrandom integrals. For more refined analysis, the\npaper proposes and analyzes a modification of the Johansen test. The new test\nfor the absence of cointegration converges to the partial sum of the Airy$_1$\npoint process. Supporting Monte Carlo simulations indicate that the same\nbehavior persists universally in many situations beyond those considered in our\ntheorems.\n\nThe paper presents empirical implementations of the approach for the analysis\nof S$\\&$P$100$ stocks and of cryptocurrencies. The latter example has a strong\npresence of multiple cointegrating relationships, while the results for the\nformer are consistent with the null of no cointegration."}, "http://arxiv.org/abs/2311.17575": {"title": "Identifying Causal Effects of Nonbinary, Ordered Treatments using Multiple Instrumental Variables", "link": "http://arxiv.org/abs/2311.17575", "description": "This paper addresses the challenge of identifying causal effects of\nnonbinary, ordered treatments with multiple binary instruments. Next to\npresenting novel insights into the widely-applied two-stage least squares\nestimand, I show that a weighted average of local average treatment effects for\ncombined complier populations is identified under the limited monotonicity\nassumption. This novel causal parameter has an intuitive interpretation,\noffering an appealing alternative to two-stage least squares. I employ recent\nadvances in causal machine learning for estimation. I further demonstrate how\ncausal forests can be used to detect local violations of the underlying limited\nmonotonicity assumption. The methodology is applied to study the impact of\ncommunity nurseries on child health outcomes."}, "http://arxiv.org/abs/2311.17858": {"title": "On the Limits of Regression Adjustment", "link": "http://arxiv.org/abs/2311.17858", "description": "Regression adjustment, sometimes known as Controlled-experiment Using\nPre-Experiment Data (CUPED), is an important technique in internet\nexperimentation. It decreases the variance of effect size estimates, often\ncutting confidence interval widths in half or more while never making them\nworse. It does so by carefully regressing the goal metric against\npre-experiment features to reduce the variance. The tremendous gains of\nregression adjustment begs the question: How much better can we do by\nengineering better features from pre-experiment data, for example by using\nmachine learning techniques or synthetic controls? Could we even reduce the\nvariance in our effect sizes arbitrarily close to zero with the right\npredictors? Unfortunately, our answer is negative. A simple form of regression\nadjustment, which uses just the pre-experiment values of the goal metric,\ncaptures most of the benefit. Specifically, under a mild assumption that\nobservations closer in time are easier to predict that ones further away in\ntime, we upper bound the potential gains of more sophisticated feature\nengineering, with respect to the gains of this simple form of regression\nadjustment. The maximum reduction in variance is $50\\%$ in Theorem 1, or\nequivalently, the confidence interval width can be reduced by at most an\nadditional $29\\%$."}, "http://arxiv.org/abs/2110.04442": {"title": "A Primer on Deep Learning for Causal Inference", "link": "http://arxiv.org/abs/2110.04442", "description": "This review systematizes the emerging literature for causal inference using\ndeep neural networks under the potential outcomes framework. It provides an\nintuitive introduction on how deep learning can be used to estimate/predict\nheterogeneous treatment effects and extend causal inference to settings where\nconfounding is non-linear, time varying, or encoded in text, networks, and\nimages. To maximize accessibility, we also introduce prerequisite concepts from\ncausal inference and deep learning. The survey differs from other treatments of\ndeep learning and causal inference in its sharp focus on observational causal\nestimation, its extended exposition of key algorithms, and its detailed\ntutorials for implementing, training, and selecting among deep estimators in\nTensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference."}, "http://arxiv.org/abs/2111.05243": {"title": "Bounding Treatment Effects by Pooling Limited Information across Observations", "link": "http://arxiv.org/abs/2111.05243", "description": "We provide novel bounds on average treatment effects (on the treated) that\nare valid under an unconfoundedness assumption. Our bounds are designed to be\nrobust in challenging situations, for example, when the conditioning variables\ntake on a large number of different values in the observed sample, or when the\noverlap condition is violated. This robustness is achieved by only using\nlimited \"pooling\" of information across observations. Namely, the bounds are\nconstructed as sample averages over functions of the observed outcomes such\nthat the contribution of each outcome only depends on the treatment status of a\nlimited number of observations. No information pooling across observations\nleads to so-called \"Manski bounds\", while unlimited information pooling leads\nto standard inverse propensity score weighting. We explore the intermediate\nrange between these two extremes and provide corresponding inference methods.\nWe show in Monte Carlo experiments and through an empirical application that\nour bounds are indeed robust and informative in practice."}, "http://arxiv.org/abs/2303.01231": {"title": "Robust Hicksian Welfare Analysis under Individual Heterogeneity", "link": "http://arxiv.org/abs/2303.01231", "description": "Welfare effects of price changes are often estimated with cross-sections;\nthese do not identify demand with heterogeneous consumers. We develop a\ntheoretical method addressing this, utilizing uncompensated demand moments to\nconstruct local approximations for compensated demand moments, robust to\nunobserved preference heterogeneity. Our methodological contribution offers\nrobust approximations for average and distributional welfare estimates,\nextending to price indices, taxable income elasticities, and general\nequilibrium welfare. Our methods apply to any cross-section; we demonstrate\nthem via UK household budget survey data. We uncover an insight: simple\nnon-parametric representative agent models might be less biased than complex\nparametric models accounting for heterogeneity."}, "http://arxiv.org/abs/2305.02185": {"title": "Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences", "link": "http://arxiv.org/abs/2305.02185", "description": "We consider a panel data analysis to examine the heterogeneity in treatment\neffects with respect to a pre-treatment covariate of interest in the staggered\ndifference-in-differences setting of Callaway and Sant'Anna (2021). Under\nstandard identification conditions, a doubly robust estimand conditional on the\ncovariate identifies the group-time conditional average treatment effect given\nthe covariate. Focusing on the case of a continuous covariate, we propose a\nthree-step estimation procedure based on nonparametric local polynomial\nregressions and parametric estimation methods. Using uniformly valid\ndistributional approximation results for empirical processes and multiplier\nbootstrapping, we develop doubly robust inference methods to construct uniform\nconfidence bands for the group-time conditional average treatment effect\nfunction. The accompanying R package didhetero allows for easy implementation\nof the proposed methods."}, "http://arxiv.org/abs/2311.18136": {"title": "Extrapolating Away from the Cutoff in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2311.18136", "description": "Canonical RD designs yield credible local estimates of the treatment effect\nat the cutoff under mild continuity assumptions, but they fail to identify\ntreatment effects away from the cutoff without additional assumptions. The\nfundamental challenge of identifying treatment effects away from the cutoff is\nthat the counterfactual outcome under the alternative treatment status is never\nobserved. This paper aims to provide a methodological blueprint to identify\ntreatment effects away from the cutoff in various empirical settings by\noffering a non-exhaustive list of assumptions on the counterfactual outcome.\nInstead of assuming the exact evolution of the counterfactual outcome, this\npaper bounds its variation using the data and sensitivity parameters. The\nproposed assumptions are weaker than those introduced previously in the\nliterature, resulting in partially identified treatment effects that are less\nsusceptible to assumption violations. This approach accommodates both single\ncutoff and multi-cutoff designs. The specific choice of the extrapolation\nassumption depends on the institutional background of each empirical\napplication. Additionally, researchers are recommended to conduct sensitivity\nanalysis on the chosen parameter and assess resulting shifts in conclusions.\nThe paper compares the proposed identification results with results using\nprevious methods via an empirical application and simulated data. It\ndemonstrates that set identification yields a more credible conclusion about\nthe sign of the treatment effect."}, "http://arxiv.org/abs/2311.18555": {"title": "Identification in Endogenous Sequential Treatment Regimes", "link": "http://arxiv.org/abs/2311.18555", "description": "This paper develops a novel nonparametric identification method for treatment\neffects in settings where individuals self-select into treatment sequences. I\npropose an identification strategy which relies on a dynamic version of\nstandard Instrumental Variables (IV) assumptions and builds on a dynamic\nversion of the Marginal Treatment Effects (MTE) as the fundamental building\nblock for treatment effects. The main contribution of the paper is to relax\nassumptions on the support of the observed variables and on unobservable gains\nof treatment that are present in the dynamic treatment effects literature.\nMonte Carlo simulation studies illustrate the desirable finite-sample\nperformance of a sieve estimator for MTEs and Average Treatment Effects (ATEs)\non a close-to-application simulation study."}, "http://arxiv.org/abs/2311.18759": {"title": "Bootstrap Inference on Partially Linear Binary Choice Model", "link": "http://arxiv.org/abs/2311.18759", "description": "The partially linear binary choice model can be used for estimating\nstructural equations where nonlinearity may appear due to diminishing marginal\nreturns, different life cycle regimes, or hectic physical phenomena. The\ninference procedure for this model based on the analytic asymptotic\napproximation could be unreliable in finite samples if the sample size is not\nsufficiently large. This paper proposes a bootstrap inference approach for the\nmodel. Monte Carlo simulations show that the proposed inference method performs\nwell in finite samples compared to the procedure based on the asymptotic\napproximation."}, "http://arxiv.org/abs/2006.01212": {"title": "New Approaches to Robust Inference on Market (Non-)Efficiency, Volatility Clustering and Nonlinear Dependence", "link": "http://arxiv.org/abs/2006.01212", "description": "Many financial and economic variables, including financial returns, exhibit\nnonlinear dependence, heterogeneity and heavy-tailedness. These properties may\nmake problematic the analysis of (non-)efficiency and volatility clustering in\neconomic and financial markets using traditional approaches that appeal to\nasymptotic normality of sample autocorrelation functions of returns and their\nsquares.\n\nThis paper presents new approaches to deal with the above problems. We\nprovide the results that motivate the use of measures of market\n(non-)efficiency and volatility clustering based on (small) powers of absolute\nreturns and their signed versions.\n\nWe further provide new approaches to robust inference on the measures in the\ncase of general time series, including GARCH-type processes. The approaches are\nbased on robust $t-$statistics tests and new results on their applicability are\npresented. In the approaches, parameter estimates (e.g., estimates of measures\nof nonlinear dependence) are computed for groups of data, and the inference is\nbased on $t-$statistics in the resulting group estimates. This results in valid\nrobust inference under heterogeneity and dependence assumptions satisfied in\nreal-world financial markets. Numerical results and empirical applications\nconfirm the advantages and wide applicability of the proposed approaches."}, "http://arxiv.org/abs/2312.00282": {"title": "Stochastic volatility models with skewness selection", "link": "http://arxiv.org/abs/2312.00282", "description": "This paper expands traditional stochastic volatility models by allowing for\ntime-varying skewness without imposing it. While dynamic asymmetry may capture\nthe likely direction of future asset returns, it comes at the risk of leading\nto overparameterization. Our proposed approach mitigates this concern by\nleveraging sparsity-inducing priors to automatically selects the skewness\nparameter as being dynamic, static or zero in a data-driven framework. We\nconsider two empirical applications. First, in a bond yield application,\ndynamic skewness captures interest rate cycles of monetary easing and\ntightening being partially explained by central banks' mandates. In an currency\nmodeling framework, our model indicates no skewness in the carry factor after\naccounting for stochastic volatility which supports the idea of carry crashes\nbeing the result of volatility surges instead of dynamic skewness."}, "http://arxiv.org/abs/2312.00399": {"title": "GMM-lev estimation and individual heterogeneity: Monte Carlo evidence and empirical applications", "link": "http://arxiv.org/abs/2312.00399", "description": "The generalized method of moments (GMM) estimator applied to equations in\nlevels, GMM-lev, has the advantage of being able to estimate the effect of\nmeasurable time-invariant covariates using all available information. This is\nnot possible with GMM-dif, applied to equations in each period transformed into\nfirst differences, while GMM-sys uses little information, as it adds the\nequation in levels for only one period. The GMM-lev, by implying a\ntwo-component error term containing the individual heterogeneity and the shock,\nexposes the explanatory variables to possible double endogeneity. For example,\nthe estimation of true persistence could suffer from bias if instruments were\ncorrelated with the unit-specific error component. We propose to exploit the\n\\citet{Mundlak1978}'s approach together with GMM-lev estimation to capture\ninitial conditions and improve inference. Monte Carlo simulations for different\npanel types and under different double endogeneity assumptions show the\nadvantage of our approach."}, "http://arxiv.org/abs/2312.00590": {"title": "Inference on common trends in functional time series", "link": "http://arxiv.org/abs/2312.00590", "description": "This paper studies statistical inference on unit roots and cointegration for\ntime series in a Hilbert space. We develop statistical inference on the number\nof common stochastic trends that are embedded in the time series, i.e., the\ndimension of the nonstationary subspace. We also consider hypotheses on the\nnonstationary subspace itself. The Hilbert space can be of an arbitrarily large\ndimension, and our methods remain asymptotically valid even when the time\nseries of interest takes values in a subspace of possibly unknown dimension.\nThis has wide applicability in practice; for example, in the case of\ncointegrated vector time series of finite dimension, in a high-dimensional\nfactor model that includes a finite number of nonstationary factors, in the\ncase of cointegrated curve-valued (or function-valued) time series, and\nnonstationary dynamic functional factor models. We include two empirical\nillustrations to the term structure of interest rates and labor market indices,\nrespectively."}, "http://arxiv.org/abs/2305.17083": {"title": "A Policy Gradient Method for Confounded POMDPs", "link": "http://arxiv.org/abs/2305.17083", "description": "In this paper, we propose a policy gradient method for confounded partially\nobservable Markov decision processes (POMDPs) with continuous state and\nobservation spaces in the offline setting. We first establish a novel\nidentification result to non-parametrically estimate any history-dependent\npolicy gradient under POMDPs using the offline data. The identification enables\nus to solve a sequence of conditional moment restrictions and adopt the min-max\nlearning procedure with general function approximation for estimating the\npolicy gradient. We then provide a finite-sample non-asymptotic bound for\nestimating the gradient uniformly over a pre-specified policy class in terms of\nthe sample size, length of horizon, concentratability coefficient and the\nmeasure of ill-posedness in solving the conditional moment restrictions.\nLastly, by deploying the proposed gradient estimation in the gradient ascent\nalgorithm, we show the global convergence of the proposed algorithm in finding\nthe history-dependent optimal policy under some technical conditions. To the\nbest of our knowledge, this is the first work studying the policy gradient\nmethod for POMDPs under the offline setting."}, "http://arxiv.org/abs/2312.00955": {"title": "Identification and Inference for Synthetic Controls with Confounding", "link": "http://arxiv.org/abs/2312.00955", "description": "This paper studies inference on treatment effects in panel data settings with\nunobserved confounding. We model outcome variables through a factor model with\nrandom factors and loadings. Such factors and loadings may act as unobserved\nconfounders: when the treatment is implemented depends on time-varying factors,\nand who receives the treatment depends on unit-level confounders. We study the\nidentification of treatment effects and illustrate the presence of a trade-off\nbetween time and unit-level confounding. We provide asymptotic results for\ninference for several Synthetic Control estimators and show that different\nsources of randomness should be considered for inference, depending on the\nnature of confounding. We conclude with a comparison of Synthetic Control\nestimators with alternatives for factor models."}, "http://arxiv.org/abs/2312.01162": {"title": "Tests for Many Treatment Effects in Regression Discontinuity Panel Data Models", "link": "http://arxiv.org/abs/2312.01162", "description": "Numerous studies use regression discontinuity design (RDD) for panel data by\nassuming that the treatment effects are homogeneous across all\nindividuals/groups and pooling the data together. It is unclear how to test for\nthe significance of treatment effects when the treatments vary across\nindividuals/groups and the error terms may exhibit complicated dependence\nstructures. This paper examines the estimation and inference of multiple\ntreatment effects when the errors are not independent and identically\ndistributed, and the treatment effects vary across individuals/groups. We\nderive a simple analytical expression for approximating the variance-covariance\nstructure of the treatment effect estimators under general dependence\nconditions and propose two test statistics, one is to test for the overall\nsignificance of the treatment effect and the other for the homogeneity of the\ntreatment effects. We find that in the Gaussian approximations to the test\nstatistics, the dependence structures in the data can be safely ignored due to\nthe localized nature of the statistics. This has the important implication that\nthe simulated critical values can be easily obtained. Simulations demonstrate\nour tests have superb size control and reasonable power performance in finite\nsamples regardless of the presence of strong cross-section dependence or/and\nweak serial dependence in the data. We apply our tests to two datasets and find\nsignificant overall treatment effects in each case."}, "http://arxiv.org/abs/2312.01209": {"title": "A Method of Moments Approach to Asymptotically Unbiased Synthetic Controls", "link": "http://arxiv.org/abs/2312.01209", "description": "A common approach to constructing a Synthetic Control unit is to fit on the\noutcome variable and covariates in pre-treatment time periods, but it has been\nshown by Ferman and Pinto (2021) that this approach does not provide asymptotic\nunbiasedness when the fit is imperfect and the number of controls is fixed.\nMany related panel methods have a similar limitation when the number of units\nis fixed. I introduce and evaluate a new method in which the Synthetic Control\nis constructed using a General Method of Moments approach where if the\nSynthetic Control satisfies the moment conditions it must have the same\nloadings on latent factors as the treated unit. I show that a Synthetic Control\nEstimator of this form will be asymptotically unbiased as the number of\npre-treatment time periods goes to infinity, even when pre-treatment fit is\nimperfect and the set of controls is fixed. Furthermore, if both the number of\npre-treatment and post-treatment time periods go to infinity, then averages of\ntreatment effects can be consistently estimated and asymptotically valid\ninference can be conducted using a subsampling method. I conduct simulations\nand an empirical application to compare the performance of this method with\nexisting approaches in the literature."}, "http://arxiv.org/abs/2312.01881": {"title": "Bayesian Nonlinear Regression using Sums of Simple Functions", "link": "http://arxiv.org/abs/2312.01881", "description": "This paper proposes a new Bayesian machine learning model that can be applied\nto large datasets arising in macroeconomics. Our framework sums over many\nsimple two-component location mixtures. The transition between components is\ndetermined by a logistic function that depends on a single threshold variable\nand two hyperparameters. Each of these individual models only accounts for a\nminor portion of the variation in the endogenous variables. But many of them\nare capable of capturing arbitrary nonlinear conditional mean relations.\nConjugate priors enable fast and efficient inference. In simulations, we show\nthat our approach produces accurate point and density forecasts. In a real-data\nexercise, we forecast US macroeconomic aggregates and consider the nonlinear\neffects of financial shocks in a large-scale nonlinear VAR."}, "http://arxiv.org/abs/1905.05237": {"title": "Sustainable Investing and the Cross-Section of Returns and Maximum Drawdown", "link": "http://arxiv.org/abs/1905.05237", "description": "We use supervised learning to identify factors that predict the cross-section\nof returns and maximum drawdown for stocks in the US equity market. Our data\nrun from January 1970 to December 2019 and our analysis includes ordinary least\nsquares, penalized linear regressions, tree-based models, and neural networks.\nWe find that the most important predictors tended to be consistent across\nmodels, and that non-linear models had better predictive power than linear\nmodels. Predictive power was higher in calm periods than in stressed periods.\nEnvironmental, social, and governance indicators marginally impacted the\npredictive power of non-linear models in our data, despite their negative\ncorrelation with maximum drawdown and positive correlation with returns. Upon\nexploring whether ESG variables are captured by some models, we find that ESG\ndata contribute to the prediction nonetheless."}, "http://arxiv.org/abs/2203.08879": {"title": "A Simple and Computationally Trivial Estimator for Grouped Fixed Effects Models", "link": "http://arxiv.org/abs/2203.08879", "description": "This paper introduces a new fixed effects estimator for linear panel data\nmodels with clustered time patterns of unobserved heterogeneity. The method\navoids non-convex and combinatorial optimization by combining a preliminary\nconsistent estimator of the slope coefficient, an agglomerative\npairwise-differencing clustering of cross-sectional units, and a pooled\nordinary least squares regression. Asymptotic guarantees are established in a\nframework where $T$ can grow at any power of $N$, as both $N$ and $T$ approach\ninfinity. Unlike most existing approaches, the proposed estimator is\ncomputationally straightforward and does not require a known upper bound on the\nnumber of groups. As existing approaches, this method leads to a consistent\nestimation of well-separated groups and an estimator of common parameters\nasymptotically equivalent to the infeasible regression controlling for the true\ngroups. An application revisits the statistical association between income and\ndemocracy."}, "http://arxiv.org/abs/2204.02346": {"title": "Finitely Heterogeneous Treatment Effect in Event-study", "link": "http://arxiv.org/abs/2204.02346", "description": "Treatment effect estimation strategies in the event-study setup, namely panel\ndata with variation in treatment timing, often use the parallel trend\nassumption that assumes mean independence of potential outcomes across\ndifferent treatment timings. In this paper, we relax the parallel trend\nassumption by assuming a latent type variable and develop a type-specific\nparallel trend assumption. With a finite support assumption on the latent type\nvariable, we show that an extremum classifier consistently estimates the type\nassignment. Based on the clasfification result, we propose a type-specific\ndiff-in-diff estimator for the type-specific CATT. By estimating the CATT with\nregard to the latent type, we study heterogeneity in treatment effect, in\naddition to heterogeneity in baseline outcomes."}, "http://arxiv.org/abs/2204.07672": {"title": "Abadie's Kappa and Weighting Estimators of the Local Average Treatment Effect", "link": "http://arxiv.org/abs/2204.07672", "description": "In this paper we study the finite sample and asymptotic properties of various\nweighting estimators of the local average treatment effect (LATE), each of\nwhich can be motivated by Abadie's (2003) kappa theorem. Our framework presumes\na binary treatment and a binary instrument, which may only be valid after\nconditioning on additional covariates. We argue that two of the estimators\nunder consideration, which are weight normalized, are generally preferable.\nSeveral other estimators, which are unnormalized, do not satisfy the properties\nof scale invariance with respect to the natural logarithm and translation\ninvariance, thereby exhibiting sensitivity to the units of measurement when\nestimating the LATE in logs and the centering of the outcome variable more\ngenerally. We also demonstrate that, when noncompliance is one sided, certain\nestimators have the advantage of being based on a denominator that is strictly\ngreater than zero by construction. This is the case for only one of the two\nnormalized estimators, and we recommend this estimator for wider use. We\nillustrate our findings with a simulation study and three empirical\napplications. The importance of normalization is particularly apparent in\napplications to real data. The simulations also suggest that covariate\nbalancing estimation of instrument propensity scores may be more robust to\nmisspecification. Software for implementing these methods is available in\nStata."}, "http://arxiv.org/abs/2208.03737": {"title": "(Functional)Characterizations vs (Finite)Tests: Partially Unifying Functional and Inequality-Based Approaches to Testing", "link": "http://arxiv.org/abs/2208.03737", "description": "Historically, testing if decision-makers obey certain choice axioms using\nchoice data takes two distinct approaches. The 'functional' approach observes\nand tests the entire 'demand' or 'choice' function, whereas the 'revealed\npreference(RP)' approach constructs inequalities to test finite choices. I\ndemonstrate that a statistical recasting of the revealed enables uniting both\napproaches. Specifically, I construct a computationally efficient algorithm to\noutput one-sided statistical tests of choice data from functional\ncharacterizations of axiomatic behavior, thus linking statistical and RP\ntesting. An application to weakly separable preferences, where RP\ncharacterizations are provably NP-Hard, demonstrates the approach's merit. I\nalso show that without assuming monotonicity, all restrictions disappear.\nHence, any ability to resolve axiomatic behavior relies on the monotonicity\nassumption."}, "http://arxiv.org/abs/2211.13610": {"title": "Cross-Sectional Dynamics Under Network Structure: Theory and Macroeconomic Applications", "link": "http://arxiv.org/abs/2211.13610", "description": "Many environments in economics feature a cross-section of units linked by\nbilateral ties. I develop a framework for studying dynamics of cross-sectional\nvariables exploiting this network structure. It is a vector autoregression in\nwhich innovations transmit cross-sectionally only via bilateral links and which\ncan accommodate rich patterns of how network effects of higher order accumulate\nover time. The model can be used to estimate dynamic network effects, with the\nnetwork given or inferred from dynamic cross-correlations in the data. It also\noffers a dimensionality-reduction technique for modeling high-dimensional\n(cross-sectional) processes, owing to networks' ability to summarize complex\nrelations among variables (units) by relatively few non-zero bilateral links.\nIn a first application, I estimate how sectoral productivity shocks transmit\nalong supply chain linkages and affect dynamics of sectoral prices in the US\neconomy. The analysis suggests that network positions can rationalize not only\nthe strength of a sector's impact on aggregates, but also its timing. In a\nsecond application, I model industrial production growth across 44 countries by\nassuming global business cycles are driven by bilateral links which I estimate.\nThis reduces out-of-sample mean squared errors by up to 23% relative to a\nprincipal components factor model."}, "http://arxiv.org/abs/2303.00083": {"title": "Transition Probabilities and Moment Restrictions in Dynamic Fixed Effects Logit Models", "link": "http://arxiv.org/abs/2303.00083", "description": "Dynamic logit models are popular tools in economics to measure state\ndependence. This paper introduces a new method to derive moment restrictions in\na large class of such models with strictly exogenous regressors and fixed\neffects. We exploit the common structure of logit-type transition probabilities\nand elementary properties of rational fractions, to formulate a systematic\nprocedure that scales naturally with model complexity (e.g the lag order or the\nnumber of observed time periods). We detail the construction of moment\nrestrictions in binary response models of arbitrary lag order as well as\nfirst-order panel vector autoregressions and dynamic multinomial logit models.\nIdentification of common parameters and average marginal effects is also\ndiscussed for the binary response case. Finally, we illustrate our results by\nstudying the dynamics of drug consumption amongst young people inspired by Deza\n(2015)."}, "http://arxiv.org/abs/2306.13362": {"title": "Sparse plus dense MIDAS regressions and nowcasting during the COVID pandemic", "link": "http://arxiv.org/abs/2306.13362", "description": "The common practice for GDP nowcasting in a data-rich environment is to\nemploy either sparse regression using LASSO-type regularization or a dense\napproach based on factor models or ridge regression, which differ in the way\nthey extract information from high-dimensional datasets. This paper aims to\ninvestigate whether sparse plus dense mixed frequency regression methods can\nimprove the nowcasts of the US GDP growth. We propose two novel MIDAS\nregressions and show that these novel sparse plus dense methods greatly improve\nthe accuracy of nowcasts during the COVID pandemic compared to either only\nsparse or only dense approaches. Using monthly macro and weekly financial\nseries, we further show that the improvement is particularly sharp when the\ndense component is restricted to be macro, while the sparse signal stems from\nboth macro and financial series."}, "http://arxiv.org/abs/2312.02288": {"title": "Almost Dominance: Inference and Application", "link": "http://arxiv.org/abs/2312.02288", "description": "This paper proposes a general framework for inference on three types of\nalmost dominances: Almost Lorenz dominance, almost inverse stochastic\ndominance, and almost stochastic dominance. We first generalize almost Lorenz\ndominance to almost upward and downward Lorenz dominances. We then provide a\nbootstrap inference procedure for the Lorenz dominance coefficients, which\nmeasure the degrees of almost Lorenz dominances. Furthermore, we propose almost\nupward and downward inverse stochastic dominances and provide inference on the\ninverse stochastic dominance coefficients. We also show that our results can\neasily be extended to almost stochastic dominance. Simulation studies\ndemonstrate the finite sample properties of the proposed estimators and the\nbootstrap confidence intervals. We apply our methods to the inequality growth\nin the United Kingdom and find evidence for almost upward inverse stochastic\ndominance."}, "http://arxiv.org/abs/2206.08052": {"title": "Likelihood ratio test for structural changes in factor models", "link": "http://arxiv.org/abs/2206.08052", "description": "A factor model with a break in its factor loadings is observationally\nequivalent to a model without changes in the loadings but a change in the\nvariance of its factors. This effectively transforms a structural change\nproblem of high dimension into a problem of low dimension. This paper considers\nthe likelihood ratio (LR) test for a variance change in the estimated factors.\nThe LR test implicitly explores a special feature of the estimated factors: the\npre-break and post-break variances can be a singular matrix under the\nalternative hypothesis, making the LR test diverging faster and thus more\npowerful than Wald-type tests. The better power property of the LR test is also\nconfirmed by simulations. We also consider mean changes and multiple breaks. We\napply the procedure to the factor modelling and structural change of the US\nemployment using monthly industry-level-data."}, "http://arxiv.org/abs/2207.03035": {"title": "On the instrumental variable estimation with many weak and invalid instruments", "link": "http://arxiv.org/abs/2207.03035", "description": "We discuss the fundamental issue of identification in linear instrumental\nvariable (IV) models with unknown IV validity. With the assumption of the\n\"sparsest rule\", which is equivalent to the plurality rule but becomes\noperational in computation algorithms, we investigate and prove the advantages\nof non-convex penalized approaches over other IV estimators based on two-step\nselections, in terms of selection consistency and accommodation for\nindividually weak IVs. Furthermore, we propose a surrogate sparsest penalty\nthat aligns with the identification condition and provides oracle sparse\nstructure simultaneously. Desirable theoretical properties are derived for the\nproposed estimator with weaker IV strength conditions compared to the previous\nliterature. Finite sample properties are demonstrated using simulations and the\nselection and estimation method is applied to an empirical study concerning the\neffect of BMI on diastolic blood pressure."}, "http://arxiv.org/abs/2301.07855": {"title": "Digital Divide: Empirical Study of CIUS 2020", "link": "http://arxiv.org/abs/2301.07855", "description": "Canada and other major countries are investigating the implementation of\n``digital money'' or Central Bank Digital Currencies, necessitating answers to\nkey questions about how demographic and geographic factors influence the\npopulation's digital literacy. This paper uses the Canadian Internet Use Survey\n(CIUS) 2020 and survey versions of Lasso inference methods to assess the\ndigital divide in Canada and determine the relevant factors that influence it.\nWe find that a significant divide in the use of digital technologies, e.g.,\nonline banking and virtual wallet, continues to exist across different\ndemographic and geographic categories. We also create a digital divide score\nthat measures the survey respondents' digital literacy and provide multiple\ncorrespondence analyses that further corroborate these findings."}, "http://arxiv.org/abs/2312.03165": {"title": "A Theory Guide to Using Control Functions to Instrument Hazard Models", "link": "http://arxiv.org/abs/2312.03165", "description": "I develop the theory around using control functions to instrument hazard\nmodels, allowing the inclusion of endogenous (e.g., mismeasured) regressors.\nSimple discrete-data hazard models can be expressed as binary choice panel data\nmodels, and the widespread Prentice and Gloeckler (1978) discrete-data\nproportional hazards model can specifically be expressed as a complementary\nlog-log model with time fixed effects. This allows me to recast it as GMM\nestimation and its instrumented version as sequential GMM estimation in a\nZ-estimation (non-classical GMM) framework; this framework can then be\nleveraged to establish asymptotic properties and sufficient conditions. Whilst\nthis paper focuses on the Prentice and Gloeckler (1978) model, the methods and\ndiscussion developed here can be applied more generally to other hazard models\nand binary choice models. I also introduce my Stata command for estimating a\ncomplementary log-log model instrumented via control functions (available as\nivcloglog on SSC), which allows practitioners to easily instrument the Prentice\nand Gloeckler (1978) model."}, "http://arxiv.org/abs/2005.05942": {"title": "Moment Conditions for Dynamic Panel Logit Models with Fixed Effects", "link": "http://arxiv.org/abs/2005.05942", "description": "This paper investigates the construction of moment conditions in discrete\nchoice panel data with individual specific fixed effects. We describe how to\nsystematically explore the existence of moment conditions that do not depend on\nthe fixed effects, and we demonstrate how to construct them when they exist.\nOur approach is closely related to the numerical \"functional differencing\"\nconstruction in Bonhomme (2012), but our emphasis is to find explicit analytic\nexpressions for the moment functions. We first explain the construction and\ngive examples of such moment conditions in various models. Then, we focus on\nthe dynamic binary choice logit model and explore the implications of the\nmoment conditions for identification and estimation of the model parameters\nthat are common to all individuals."}, "http://arxiv.org/abs/2104.12909": {"title": "Algorithm as Experiment: Machine Learning, Market Design, and Policy Eligibility Rules", "link": "http://arxiv.org/abs/2104.12909", "description": "Algorithms make a growing portion of policy and business decisions. We\ndevelop a treatment-effect estimator using algorithmic decisions as instruments\nfor a class of stochastic and deterministic algorithms. Our estimator is\nconsistent and asymptotically normal for well-defined causal effects. A special\ncase of our setup is multidimensional regression discontinuity designs with\ncomplex boundaries. We apply our estimator to evaluate the Coronavirus Aid,\nRelief, and Economic Security Act, which allocated many billions of dollars\nworth of relief funding to hospitals via an algorithmic rule. The funding is\nshown to have little effect on COVID-19-related hospital activities. Naive\nestimates exhibit selection bias."}, "http://arxiv.org/abs/2312.03915": {"title": "Alternative models for FX, arbitrage opportunities and efficient pricing of double barrier options in L\\'evy models", "link": "http://arxiv.org/abs/2312.03915", "description": "We analyze the qualitative differences between prices of double barrier\nno-touch options in the Heston model and pure jump KoBoL model calibrated to\nthe same set of the empirical data, and discuss the potential for arbitrage\nopportunities if the correct model is a pure jump model. We explain and\ndemonstrate with numerical examples that accurate and fast calculations of\nprices of double barrier options in jump models are extremely difficult using\nthe numerical methods available in the literature. We develop a new efficient\nmethod (GWR-SINH method) based of the Gaver-Wynn-Rho acceleration applied to\nthe Bromwich integral; the SINH-acceleration and simplified trapezoid rule are\nused to evaluate perpetual double barrier options for each value of the\nspectral parameter in GWR-algorithm. The program in Matlab running on a Mac\nwith moderate characteristics achieves the precision of the order of E-5 and\nbetter in several several dozen of milliseconds; the precision E-07 is\nachievable in about 0.1 sec. We outline the extension of GWR-SINH method to\nregime-switching models and models with stochastic parameters and stochastic\ninterest rates."}, "http://arxiv.org/abs/2312.04428": {"title": "A general framework for the generation of probabilistic socioeconomic scenarios: Quantification of national food security risk with application to the cases of Egypt and Ethiopia", "link": "http://arxiv.org/abs/2312.04428", "description": "In this work a general framework for providing detailed probabilistic\nsocioeconomic scenarios as well as estimates concerning country-level food\nsecurity risk is proposed. Our methodology builds on (a) the Bayesian\nprobabilistic version of the world population model and (b) on the\ninterdependencies of the minimum food requirements and the national food system\ncapacities on key drivers, such as: population, income, natural resources, and\nother socioeconomic and climate indicators. Model uncertainty plays an\nimportant role in such endeavours. In this perspective, the concept of the\nrecently developed convex risk measures which mitigate the model uncertainty\neffects, is employed for the development of a framework for assessment, in the\ncontext of food security. The proposed method provides predictions and\nevaluations for food security risk both within and across probabilistic\nscenarios at country level. Our methodology is illustrated through its\nimplementation for the cases of Egypt and Ethiopia, for the time period\n2019-2050, under the combined context of the Shared Socioeconomic Pathways\n(SSPs) and the Representative Concentration Pathways (RCPs)."}, "http://arxiv.org/abs/2108.02196": {"title": "Synthetic Controls for Experimental Design", "link": "http://arxiv.org/abs/2108.02196", "description": "This article studies experimental design in settings where the experimental\nunits are large aggregate entities (e.g., markets), and only one or a small\nnumber of units can be exposed to the treatment. In such settings,\nrandomization of the treatment may result in treated and control groups with\nvery different characteristics at baseline, inducing biases. We propose a\nvariety of synthetic control designs (Abadie, Diamond and Hainmueller, 2010,\nAbadie and Gardeazabal, 2003) as experimental designs to select treated units\nin non-randomized experiments with large aggregate units, as well as the\nuntreated units to be used as a control group. Average potential outcomes are\nestimated as weighted averages of treated units, for potential outcomes with\ntreatment -- and control units, for potential outcomes without treatment. We\nanalyze the properties of estimators based on synthetic control designs and\npropose new inferential techniques. We show that in experimental settings with\naggregate units, synthetic control designs can substantially reduce estimation\nbiases in comparison to randomization of the treatment."}, "http://arxiv.org/abs/2108.04852": {"title": "Multiway empirical likelihood", "link": "http://arxiv.org/abs/2108.04852", "description": "This paper develops a general methodology to conduct statistical inference\nfor observations indexed by multiple sets of entities. We propose a novel\nmultiway empirical likelihood statistic that converges to a chi-square\ndistribution under the non-degenerate case, where corresponding Hoeffding type\ndecomposition is dominated by linear terms. Our methodology is related to the\nnotion of jackknife empirical likelihood but the leave-out pseudo values are\nconstructed by leaving columns or rows. We further develop a modified version\nof our multiway empirical likelihood statistic, which converges to a chi-square\ndistribution regardless of the degeneracy, and discover its desirable\nhigher-order property compared to the t-ratio by the conventional Eicker-White\ntype variance estimator. The proposed methodology is illustrated by several\nimportant statistical problems, such as bipartite network, generalized\nestimating equations, and three-way observations."}, "http://arxiv.org/abs/2306.03632": {"title": "Uniform Inference for Cointegrated Vector Autoregressive Processes", "link": "http://arxiv.org/abs/2306.03632", "description": "Uniformly valid inference for cointegrated vector autoregressive processes\nhas so far proven difficult due to certain discontinuities arising in the\nasymptotic distribution of the least squares estimator. We extend asymptotic\nresults from the univariate case to multiple dimensions and show how inference\ncan be based on these results. Furthermore, we show that lag augmentation and a\nrecent instrumental variable procedure can also yield uniformly valid tests and\nconfidence regions. We verify the theoretical findings and investigate finite\nsample properties in simulation experiments for two specific examples."}, "http://arxiv.org/abs/2005.04141": {"title": "Critical Values Robust to P-hacking", "link": "http://arxiv.org/abs/2005.04141", "description": "P-hacking is prevalent in reality but absent from classical hypothesis\ntesting theory. As a consequence, significant results are much more common than\nthey are supposed to be when the null hypothesis is in fact true. In this\npaper, we build a model of hypothesis testing with p-hacking. From the model,\nwe construct critical values such that, if the values are used to determine\nsignificance, and if scientists' p-hacking behavior adjusts to the new\nsignificance standards, significant results occur with the desired frequency.\nSuch robust critical values allow for p-hacking so they are larger than\nclassical critical values. To illustrate the amount of correction that\np-hacking might require, we calibrate the model using evidence from the medical\nsciences. In the calibrated model the robust critical value for any test\nstatistic is the classical critical value for the same test statistic with one\nfifth of the significance level."}, "http://arxiv.org/abs/2312.05342": {"title": "Occasionally Misspecified", "link": "http://arxiv.org/abs/2312.05342", "description": "When fitting a particular Economic model on a sample of data, the model may\nturn out to be heavily misspecified for some observations. This can happen\nbecause of unmodelled idiosyncratic events, such as an abrupt but short-lived\nchange in policy. These outliers can significantly alter estimates and\ninferences. A robust estimation is desirable to limit their influence. For\nskewed data, this induces another bias which can also invalidate the estimation\nand inferences. This paper proposes a robust GMM estimator with a simple bias\ncorrection that does not degrade robustness significantly. The paper provides\nfinite-sample robustness bounds, and asymptotic uniform equivalence with an\noracle that discards all outliers. Consistency and asymptotic normality ensue\nfrom that result. An application to the \"Price-Puzzle,\" which finds inflation\nincreases when monetary policy tightens, illustrates the concerns and the\nmethod. The proposed estimator finds the intuitive result: tighter monetary\npolicy leads to a decline in inflation."}, "http://arxiv.org/abs/2312.05373": {"title": "GCov-Based Portmanteau Test", "link": "http://arxiv.org/abs/2312.05373", "description": "We examine finite sample performance of the Generalized Covariance (GCov)\nresidual-based specification test for semiparametric models with i.i.d. errors.\nThe residual-based multivariate portmanteau test statistic follows\nasymptotically a $\\chi^2$ distribution when the model is estimated by the GCov\nestimator. The test is shown to perform well in application to the univariate\nmixed causal-noncausal MAR, double autoregressive (DAR) and multivariate Vector\nAutoregressive (VAR) models. We also introduce a bootstrap procedure that\nprovides the limiting distribution of the test statistic when the specification\ntest is applied to a model estimated by the maximum likelihood, or the\napproximate or quasi-maximum likelihood under a parametric assumption on the\nerror distribution."}, "http://arxiv.org/abs/2312.05593": {"title": "Economic Forecasts Using Many Noises", "link": "http://arxiv.org/abs/2312.05593", "description": "This paper addresses a key question in economic forecasting: does pure noise\ntruly lack predictive power? Economists typically conduct variable selection to\neliminate noises from predictors. Yet, we prove a compelling result that in\nmost economic forecasts, the inclusion of noises in predictions yields greater\nbenefits than its exclusion. Furthermore, if the total number of predictors is\nnot sufficiently large, intentionally adding more noises yields superior\nforecast performance, outperforming benchmark predictors relying on dimension\nreduction. The intuition lies in economic predictive signals being densely\ndistributed among regression coefficients, maintaining modest forecast bias\nwhile diversifying away overall variance, even when a significant proportion of\npredictors constitute pure noises. One of our empirical demonstrations shows\nthat intentionally adding 300~6,000 pure noises to the Welch and Goyal (2008)\ndataset achieves a noteworthy 10% out-of-sample R square accuracy in\nforecasting the annual U.S. equity premium. The performance surpasses the\nmajority of sophisticated machine learning models."}, "http://arxiv.org/abs/2312.05700": {"title": "Influence Analysis with Panel Data", "link": "http://arxiv.org/abs/2312.05700", "description": "The presence of units with extreme values in the dependent and/or independent\nvariables (i.e., vertical outliers, leveraged data) has the potential to\nseverely bias regression coefficients and/or standard errors. This is common\nwith short panel data because the researcher cannot advocate asymptotic theory.\nExample include cross-country studies, cell-group analyses, and field or\nlaboratory experimental studies, where the researcher is forced to use few\ncross-sectional observations repeated over time due to the structure of the\ndata or research design. Available diagnostic tools may fail to properly detect\nthese anomalies, because they are not designed for panel data. In this paper,\nwe formalise statistical measures for panel data models with fixed effects to\nquantify the degree of leverage and outlyingness of units, and the joint and\nconditional influences of pairs of units. We first develop a method to visually\ndetect anomalous units in a panel data set, and identify their type. Second, we\ninvestigate the effect of these units on LS estimates, and on other units'\ninfluence on the estimated parameters. To illustrate and validate the proposed\nmethod, we use a synthetic data set contaminated with different types of\nanomalous units. We also provide an empirical example."}, "http://arxiv.org/abs/2312.05858": {"title": "The Machine Learning Control Method for Counterfactual Forecasting", "link": "http://arxiv.org/abs/2312.05858", "description": "Without a credible control group, the most widespread methodologies for\nestimating causal effects cannot be applied. To fill this gap, we propose the\nMachine Learning Control Method (MLCM), a new approach for causal panel\nanalysis based on counterfactual forecasting with machine learning. The MLCM\nestimates policy-relevant causal parameters in short- and long-panel settings\nwithout relying on untreated units. We formalize identification in the\npotential outcomes framework and then provide estimation based on supervised\nmachine learning algorithms. To illustrate the advantages of our estimator, we\npresent simulation evidence and an empirical application on the impact of the\nCOVID-19 crisis on educational inequality in Italy. We implement the proposed\nmethod in the companion R package MachineControl."}, "http://arxiv.org/abs/2312.05898": {"title": "Dynamic Spatiotemporal ARCH Models: Small and Large Sample Results", "link": "http://arxiv.org/abs/2312.05898", "description": "This paper explores the estimation of a dynamic spatiotemporal autoregressive\nconditional heteroscedasticity (ARCH) model. The log-volatility term in this\nmodel can depend on (i) the spatial lag of the log-squared outcome variable,\n(ii) the time-lag of the log-squared outcome variable, (iii) the spatiotemporal\nlag of the log-squared outcome variable, (iv) exogenous variables, and (v) the\nunobserved heterogeneity across regions and time, i.e., the regional and time\nfixed effects. We examine the small and large sample properties of two\nquasi-maximum likelihood estimators and a generalized method of moments\nestimator for this model. We first summarize the theoretical properties of\nthese estimators and then compare their finite sample properties through Monte\nCarlo simulations."}, "http://arxiv.org/abs/2312.05985": {"title": "Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions", "link": "http://arxiv.org/abs/2312.05985", "description": "To address the bias of the canonical two-way fixed effects estimator for\ndifference-in-differences under staggered adoptions, Wooldridge (2021) proposed\nthe extended two-way fixed effects estimator, which adds many parameters.\nHowever, this reduces efficiency. Restricting some of these parameters to be\nequal helps, but ad hoc restrictions may reintroduce bias. We propose a machine\nlearning estimator with a single tuning parameter, fused extended two-way fixed\neffects (FETWFE), that enables automatic data-driven selection of these\nrestrictions. We prove that under an appropriate sparsity assumption FETWFE\nidentifies the correct restrictions with probability tending to one. We also\nprove the consistency, asymptotic normality, and oracle efficiency of FETWFE\nfor two classes of heterogeneous marginal treatment effect estimators under\neither conditional or marginal parallel trends, and we prove consistency for\ntwo classes of conditional average treatment effects under conditional parallel\ntrends. We demonstrate FETWFE in simulation studies and an empirical\napplication."}, "http://arxiv.org/abs/2312.06379": {"title": "Trends in Temperature Data: Micro-foundations of Their Nature", "link": "http://arxiv.org/abs/2312.06379", "description": "Determining whether Global Average Temperature (GAT) is an integrated process\nof order 1, I(1), or is a stationary process around a trend function is crucial\nfor detection, attribution, impact and forecasting studies of climate change.\nIn this paper, we investigate the nature of trends in GAT building on the\nanalysis of individual temperature grids. Our 'micro-founded' evidence suggests\nthat GAT is stationary around a non-linear deterministic trend in the form of a\nlinear function with a one-period structural break. This break can be\nattributed to a combination of individual grid breaks and the standard\naggregation method under acceleration in global warming. We illustrate our\nfindings using simulations."}, "http://arxiv.org/abs/2312.06402": {"title": "Structural Analysis of Vector Autoregressive Models", "link": "http://arxiv.org/abs/2312.06402", "description": "This set of lecture notes discuss key concepts for the structural analysis of\nVector Autoregressive models for the teaching of Applied Macroeconometrics\nmodule."}, "http://arxiv.org/abs/2110.12722": {"title": "Functional instrumental variable regression with an application to estimating the impact of immigration on native wages", "link": "http://arxiv.org/abs/2110.12722", "description": "Functional linear regression gets its popularity as a statistical tool to\nstudy the relationship between function-valued response and exogenous\nexplanatory variables. However, in practice, it is hard to expect that the\nexplanatory variables of interest are perfectly exogenous, due to, for example,\nthe presence of omitted variables and measurement error. Despite its empirical\nrelevance, it was not until recently that this issue of endogeneity was studied\nin the literature on functional regression, and the development in this\ndirection does not seem to sufficiently meet practitioners' needs; for example,\nthis issue has been discussed with paying particular attention on consistent\nestimation and thus distributional properties of the proposed estimators still\nremain to be further explored. To fill this gap, this paper proposes new\nconsistent FPCA-based instrumental variable estimators and develops their\nasymptotic properties in detail. Simulation experiments under a wide range of\nsettings show that the proposed estimators perform considerably well. We apply\nour methodology to estimate the impact of immigration on native wages."}, "http://arxiv.org/abs/2205.01882": {"title": "Approximating Choice Data by Discrete Choice Models", "link": "http://arxiv.org/abs/2205.01882", "description": "We obtain a necessary and sufficient condition under which random-coefficient\ndiscrete choice models, such as mixed-logit models, are rich enough to\napproximate any nonparametric random utility models arbitrarily well across\nchoice sets. The condition turns out to be the affine-independence of the set\nof characteristic vectors. When the condition fails, resulting in some random\nutility models that cannot be closely approximated, we identify preferences and\nsubstitution patterns that are challenging to approximate accurately. We also\npropose algorithms to quantify the magnitude of approximation errors."}, "http://arxiv.org/abs/2305.18114": {"title": "Identifying Dynamic LATEs with a Static Instrument", "link": "http://arxiv.org/abs/2305.18114", "description": "In many situations, researchers are interested in identifying dynamic effects\nof an irreversible treatment with a static binary instrumental variable (IV).\nFor example, in evaluations of dynamic effects of training programs, with a\nsingle lottery determining eligibility. A common approach in these situations\nis to report per-period IV estimates. Under a dynamic extension of standard IV\nassumptions, we show that such IV estimators identify a weighted sum of\ntreatment effects for different latent groups and treatment exposures. However,\nthere is possibility of negative weights. We consider point and partial\nidentification of dynamic treatment effects in this setting under different\nsets of assumptions."}, "http://arxiv.org/abs/2312.07520": {"title": "Estimating Counterfactual Matrix Means with Short Panel Data", "link": "http://arxiv.org/abs/2312.07520", "description": "We develop a more flexible approach for identifying and estimating average\ncounterfactual outcomes when several but not all possible outcomes are observed\nfor each unit in a large cross section. Such settings include event studies and\nstudies of outcomes of \"matches\" between agents of two types, e.g. workers and\nfirms or people and places. When outcomes are generated by a factor model that\nallows for low-dimensional unobserved confounders, our method yields\nconsistent, asymptotically normal estimates of counterfactual outcome means\nunder asymptotics that fix the number of outcomes as the cross section grows\nand general outcome missingness patterns, including those not accommodated by\nexisting methods. Our method is also computationally efficient, requiring only\na single eigendecomposition of a particular aggregation of any factor estimates\nconstructed using subsets of units with the same observed outcomes. In a\nsemi-synthetic simulation study based on matched employer-employee data, our\nmethod performs favorably compared to a Two-Way-Fixed-Effects-model-based\nestimator."}, "http://arxiv.org/abs/2211.16362": {"title": "Score-based calibration testing for multivariate forecast distributions", "link": "http://arxiv.org/abs/2211.16362", "description": "Calibration tests based on the probability integral transform (PIT) are\nroutinely used to assess the quality of univariate distributional forecasts.\nHowever, PIT-based calibration tests for multivariate distributional forecasts\nface various challenges. We propose two new types of tests based on proper\nscoring rules, which overcome these challenges. They arise from a general\nframework for calibration testing in the multivariate case, introduced in this\nwork. The new tests have good size and power properties in simulations and\nsolve various problems of existing tests. We apply the tests to forecast\ndistributions for macroeconomic and financial time series data."}, "http://arxiv.org/abs/2309.04926": {"title": "Testing for Stationary or Persistent Coefficient Randomness in Predictive Regressions", "link": "http://arxiv.org/abs/2309.04926", "description": "This study considers tests for coefficient randomness in predictive\nregressions. Our focus is on how tests for coefficient randomness are\ninfluenced by the persistence of random coefficient. We find that when the\nrandom coefficient is stationary, or I(0), Nyblom's (1989) LM test loses its\noptimality (in terms of power), which is established against the alternative of\nintegrated, or I(1), random coefficient. We demonstrate this by constructing\ntests that are more powerful than the LM test when random coefficient is\nstationary, although these tests are dominated in terms of power by the LM test\nwhen random coefficient is integrated. This implies that the best test for\ncoefficient randomness differs from context to context, and practitioners\nshould take into account the persistence of potentially random coefficient and\nchoose from several tests accordingly. We apply tests for coefficient constancy\nto real data. The results mostly reverse the conclusion of an earlier empirical\nstudy."}, "http://arxiv.org/abs/2312.07683": {"title": "On Rosenbaum's Rank-based Matching Estimator", "link": "http://arxiv.org/abs/2312.07683", "description": "In two influential contributions, Rosenbaum (2005, 2020) advocated for using\nthe distances between component-wise ranks, instead of the original data\nvalues, to measure covariate similarity when constructing matching estimators\nof average treatment effects. While the intuitive benefits of using covariate\nranks for matching estimation are apparent, there is no theoretical\nunderstanding of such procedures in the literature. We fill this gap by\ndemonstrating that Rosenbaum's rank-based matching estimator, when coupled with\na regression adjustment, enjoys the properties of double robustness and\nsemiparametric efficiency without the need to enforce restrictive covariate\nmoment assumptions. Our theoretical findings further emphasize the statistical\nvirtues of employing ranks for estimation and inference, more broadly aligning\nwith the insights put forth by Peter Bickel in his 2004 Rietz lecture (Bickel,\n2004)."}, "http://arxiv.org/abs/2312.07881": {"title": "Efficiency of QMLE for dynamic panel data models with interactive effects", "link": "http://arxiv.org/abs/2312.07881", "description": "This paper derives the efficiency bound for estimating the parameters of\ndynamic panel data models in the presence of an increasing number of incidental\nparameters. We study the efficiency problem by formulating the dynamic panel as\na simultaneous equations system, and show that the quasi-maximum likelihood\nestimator (QMLE) applied to the system achieves the efficiency bound.\nComparison of QMLE with fixed effects estimators is made."}, "http://arxiv.org/abs/2312.08171": {"title": "Individual Updating of Subjective Probability of Homicide Victimization: a \"Natural Experiment'' on Risk Communication", "link": "http://arxiv.org/abs/2312.08171", "description": "We investigate the dynamics of the update of subjective homicide\nvictimization risk after an informational shock by developing two econometric\nmodels able to accommodate both optimal decisions of changing prior\nexpectations which enable us to rationalize skeptical Bayesian agents with\ntheir disregard to new information. We apply our models to a unique household\ndata (N = 4,030) that consists of socioeconomic and victimization expectation\nvariables in Brazil, coupled with an informational ``natural experiment''\nbrought by the sample design methodology, which randomized interviewers to\ninterviewees. The higher priors about their own subjective homicide\nvictimization risk are set, the more likely individuals are to change their\ninitial perceptions. In case of an update, we find that elders and females are\nmore reluctant to change priors and choose the new response level. In addition,\neven though the respondents' level of education is not significant, the\ninterviewers' level of education has a key role in changing and updating\ndecisions. The results show that our econometric approach fits reasonable well\nthe available empirical evidence, stressing the salient role heterogeneity\nrepresented by individual characteristics of interviewees and interviewers have\non belief updating and lack of it, say, skepticism. Furthermore, we can\nrationalize skeptics through an informational quality/credibility argument."}, "http://arxiv.org/abs/2312.08174": {"title": "Double Machine Learning for Static Panel Models with Fixed Effects", "link": "http://arxiv.org/abs/2312.08174", "description": "Machine Learning (ML) algorithms are powerful data-driven tools for\napproximating high-dimensional or non-linear nuisance functions which are\nuseful in practice because the true functional form of the predictors is\nex-ante unknown. In this paper, we develop estimators of policy interventions\nfrom panel data which allow for non-linear effects of the confounding\nregressors, and investigate the performance of these estimators using three\nwell-known ML algorithms, specifically, LASSO, classification and regression\ntrees, and random forests. We use Double Machine Learning (DML) (Chernozhukov\net al., 2018) for the estimation of causal effects of homogeneous treatments\nwith unobserved individual heterogeneity (fixed effects) and no unobserved\nconfounding by extending Robinson (1988)'s partially linear regression model.\nWe develop three alternative approaches for handling unobserved individual\nheterogeneity based on extending the within-group estimator, first-difference\nestimator, and correlated random effect estimator (Mundlak, 1978) for\nnon-linear models. Using Monte Carlo simulations, we find that conventional\nleast squares estimators can perform well even if the data generating process\nis non-linear, but there are substantial performance gains in terms of bias\nreduction under a process where the true effect of the regressors is non-linear\nand discontinuous. However, for the same scenarios, we also find -- despite\nextensive hyperparameter tuning -- inference to be problematic for both\ntree-based learners because these lead to highly non-normal estimator\ndistributions and the estimator variance being severely under-estimated. This\ncontradicts the performance of trees in other circumstances and requires\nfurther investigation. Finally, we provide an illustrative example of DML for\nobservational panel data showing the impact of the introduction of the national\nminimum wage in the UK."}, "http://arxiv.org/abs/2201.11304": {"title": "Standard errors for two-way clustering with serially correlated time effects", "link": "http://arxiv.org/abs/2201.11304", "description": "We propose improved standard errors and an asymptotic distribution theory for\ntwo-way clustered panels. Our proposed estimator and theory allow for arbitrary\nserial dependence in the common time effects, which is excluded by existing\ntwo-way methods, including the popular two-way cluster standard errors of\nCameron, Gelbach, and Miller (2011) and the cluster bootstrap of Menzel (2021).\nOur asymptotic distribution theory is the first which allows for this level of\ninter-dependence among the observations. Under weak regularity conditions, we\ndemonstrate that the least squares estimator is asymptotically normal, our\nproposed variance estimator is consistent, and t-ratios are asymptotically\nstandard normal, permitting conventional inference. We present simulation\nevidence that confidence intervals constructed with our proposed standard\nerrors obtain superior coverage performance relative to existing methods. We\nillustrate the relevance of the proposed method in an empirical application to\na standard Fama-French three-factor regression."}, "http://arxiv.org/abs/2303.04416": {"title": "Inference on Optimal Dynamic Policies via Softmax Approximation", "link": "http://arxiv.org/abs/2303.04416", "description": "Estimating optimal dynamic policies from offline data is a fundamental\nproblem in dynamic decision making. In the context of causal inference, the\nproblem is known as estimating the optimal dynamic treatment regime. Even\nthough there exists a plethora of methods for estimation, constructing\nconfidence intervals for the value of the optimal regime and structural\nparameters associated with it is inherently harder, as it involves non-linear\nand non-differentiable functionals of unknown quantities that need to be\nestimated. Prior work resorted to sub-sample approaches that can deteriorate\nthe quality of the estimate. We show that a simple soft-max approximation to\nthe optimal treatment regime, for an appropriately fast growing temperature\nparameter, can achieve valid inference on the truly optimal regime. We\nillustrate our result for a two-period optimal dynamic regime, though our\napproach should directly extend to the finite horizon case. Our work combines\ntechniques from semi-parametric inference and $g$-estimation, together with an\nappropriate triangular array central limit theorem, as well as a novel analysis\nof the asymptotic influence and asymptotic bias of softmax approximations."}, "http://arxiv.org/abs/1904.00111": {"title": "Simple subvector inference on sharp identified set in affine models", "link": "http://arxiv.org/abs/1904.00111", "description": "This paper studies a regularized support function estimator for bounds on\ncomponents of the parameter vector in the case in which the identified set is a\npolygon. The proposed regularized estimator has three important properties: (i)\nit has a uniform asymptotic Gaussian limit in the presence of flat faces in the\nabsence of redundant (or overidentifying) constraints (or vice versa); (ii) the\nbias from regularization does not enter the first-order limiting\ndistribution;(iii) the estimator remains consistent for sharp identified set\nfor the individual components even in the non-regualar case. These properties\nare used to construct uniformly valid confidence sets for an element\n$\\theta_{1}$ of a parameter vector $\\theta\\in\\mathbb{R}^{d}$ that is partially\nidentified by affine moment equality and inequality conditions. The proposed\nconfidence sets can be computed as a solution to a small number of linear and\nconvex quadratic programs, which leads to a substantial decrease in computation\ntime and guarantees a global optimum. As a result, the method provides\nuniformly valid inference in applications in which the dimension of the\nparameter space, $d$, and the number of inequalities, $k$, were previously\ncomputationally unfeasible ($d,k=100$). The proposed approach can be extended\nto construct confidence sets for intersection bounds, to construct joint\npolygon-shaped confidence sets for multiple components of $\\theta$, and to find\nthe set of solutions to a linear program. Inference for coefficients in the\nlinear IV regression model with an interval outcome is used as an illustrative\nexample."}, "http://arxiv.org/abs/1911.04529": {"title": "Identification in discrete choice models with imperfect information", "link": "http://arxiv.org/abs/1911.04529", "description": "We study identification of preferences in static single-agent discrete choice\nmodels where decision makers may be imperfectly informed about the state of the\nworld. We leverage the notion of one-player Bayes Correlated Equilibrium by\nBergemann and Morris (2016) to provide a tractable characterization of the\nsharp identified set. We develop a procedure to practically construct the sharp\nidentified set following a sieve approach, and provide sharp bounds on\ncounterfactual outcomes of interest. We use our methodology and data on the\n2017 UK general election to estimate a spatial voting model under weak\nassumptions on agents' information about the returns to voting. Counterfactual\nexercises quantify the consequences of imperfect information on the well-being\nof voters and parties."}, "http://arxiv.org/abs/2312.10333": {"title": "Logit-based alternatives to two-stage least squares", "link": "http://arxiv.org/abs/2312.10333", "description": "We propose logit-based IV and augmented logit-based IV estimators that serve\nas alternatives to the traditionally used 2SLS estimator in the model where\nboth the endogenous treatment variable and the corresponding instrument are\nbinary. Our novel estimators are as easy to compute as the 2SLS estimator but\nhave an advantage over the 2SLS estimator in terms of causal interpretability.\nIn particular, in certain cases where the probability limits of both our\nestimators and the 2SLS estimator take the form of weighted-average treatment\neffects, our estimators are guaranteed to yield non-negative weights whereas\nthe 2SLS estimator is not."}, "http://arxiv.org/abs/2312.10487": {"title": "The Dynamic Triple Gamma Prior as a Shrinkage Process Prior for Time-Varying Parameter Models", "link": "http://arxiv.org/abs/2312.10487", "description": "Many current approaches to shrinkage within the time-varying parameter\nframework assume that each state is equipped with only one innovation variance\nfor all time points. Sparsity is then induced by shrinking this variance\ntowards zero. We argue that this is not sufficient if the states display large\njumps or structural changes, something which is often the case in time series\nanalysis. To remedy this, we propose the dynamic triple gamma prior, a\nstochastic process that has a well-known triple gamma marginal form, while\nstill allowing for autocorrelation. Crucially, the triple gamma has many\ninteresting limiting and special cases (including the horseshoe shrinkage\nprior) which can also be chosen as the marginal distribution. Not only is the\nmarginal form well understood, we further derive many interesting properties of\nthe dynamic triple gamma, which showcase its dynamic shrinkage characteristics.\nWe develop an efficient Markov chain Monte Carlo algorithm to sample from the\nposterior and demonstrate the performance through sparse covariance modeling\nand forecasting of the returns of the components of the EURO STOXX 50 index."}, "http://arxiv.org/abs/2312.10558": {"title": "Some Finite-Sample Results on the Hausman Test", "link": "http://arxiv.org/abs/2312.10558", "description": "This paper shows that the endogeneity test using the control function\napproach in linear instrumental variable models is a variant of the Hausman\ntest. Moreover, we find that the test statistics used in these tests can be\nnumerically ordered, indicating their relative power properties in finite\nsamples."}, "http://arxiv.org/abs/2312.10984": {"title": "Predicting Financial Literacy via Semi-supervised Learning", "link": "http://arxiv.org/abs/2312.10984", "description": "Financial literacy (FL) represents a person's ability to turn assets into\nincome, and understanding digital currencies has been added to the modern\ndefinition. FL can be predicted by exploiting unlabelled recorded data in\nfinancial networks via semi-supervised learning (SSL). Measuring and predicting\nFL has not been widely studied, resulting in limited understanding of customer\nfinancial engagement consequences. Previous studies have shown that low FL\nincreases the risk of social harm. Therefore, it is important to accurately\nestimate FL to allocate specific intervention programs to less financially\nliterate groups. This will not only increase company profitability, but will\nalso reduce government spending. Some studies considered predicting FL in\nclassification tasks, whereas others developed FL definitions and impacts. The\ncurrent paper investigated mechanisms to learn customer FL level from their\nfinancial data using sampling by synthetic minority over-sampling techniques\nfor regression with Gaussian noise (SMOGN). We propose the SMOGN-COREG model\nfor semi-supervised regression, applying SMOGN to deal with unbalanced datasets\nand a nonparametric multi-learner co-regression (COREG) algorithm for labeling.\nWe compared the SMOGN-COREG model with six well-known regressors on five\ndatasets to evaluate the proposed models effectiveness on unbalanced and\nunlabelled financial data. Experimental results confirmed that the proposed\nmethod outperformed the comparator models for unbalanced and unlabelled\nfinancial data. Therefore, SMOGN-COREG is a step towards using unlabelled data\nto estimate FL level."}, "http://arxiv.org/abs/2312.11283": {"title": "The 2010 Census Confidentiality Protections Failed, Here's How and Why", "link": "http://arxiv.org/abs/2312.11283", "description": "Using only 34 published tables, we reconstruct five variables (census block,\nsex, age, race, and ethnicity) in the confidential 2010 Census person records.\nUsing the 38-bin age variable tabulated at the census block level, at most\n20.1% of reconstructed records can differ from their confidential source on\neven a single value for these five variables. Using only published data, an\nattacker can verify that all records in 70% of all census blocks (97 million\npeople) are perfectly reconstructed. The tabular publications in Summary File 1\nthus have prohibited disclosure risk similar to the unreleased confidential\nmicrodata. Reidentification studies confirm that an attacker can, within blocks\nwith perfect reconstruction accuracy, correctly infer the actual census\nresponse on race and ethnicity for 3.4 million vulnerable population uniques\n(persons with nonmodal characteristics) with 95% accuracy, the same precision\nas the confidential data achieve and far greater than statistical baselines.\nThe flaw in the 2010 Census framework was the assumption that aggregation\nprevented accurate microdata reconstruction, justifying weaker disclosure\nlimitation methods than were applied to 2010 Census public microdata. The\nframework used for 2020 Census publications defends against attacks that are\nbased on reconstruction, as we also demonstrate here. Finally, we show that\nalternatives to the 2020 Census Disclosure Avoidance System with similar\naccuracy (enhanced swapping) also fail to protect confidentiality, and those\nthat partially defend against reconstruction attacks (incomplete suppression\nimplementations) destroy the primary statutory use case: data for redistricting\nall legislatures in the country in compliance with the 1965 Voting Rights Act."}, "http://arxiv.org/abs/1811.11603": {"title": "Distribution Regression with Sample Selection, with an Application to Wage Decompositions in the UK", "link": "http://arxiv.org/abs/1811.11603", "description": "We develop a distribution regression model under endogenous sample selection.\nThis model is a semi-parametric generalization of the Heckman selection model.\nIt accommodates much richer effects of the covariates on outcome distribution\nand patterns of heterogeneity in the selection process, and allows for drastic\ndepartures from the Gaussian error structure, while maintaining the same level\ntractability as the classical model. The model applies to continuous, discrete\nand mixed outcomes. We provide identification, estimation, and inference\nmethods, and apply them to obtain wage decomposition for the UK. Here we\ndecompose the difference between the male and female wage distributions into\ncomposition, wage structure, selection structure, and selection sorting\neffects. After controlling for endogenous employment selection, we still find\nsubstantial gender wage gap -- ranging from 21% to 40% throughout the (latent)\noffered wage distribution that is not explained by composition. We also uncover\npositive sorting for single men and negative sorting for married women that\naccounts for a substantive fraction of the gender wage gap at the top of the\ndistribution."}, "http://arxiv.org/abs/2204.01884": {"title": "Policy Learning with Competing Agents", "link": "http://arxiv.org/abs/2204.01884", "description": "Decision makers often aim to learn a treatment assignment policy under a\ncapacity constraint on the number of agents that they can treat. When agents\ncan respond strategically to such policies, competition arises, complicating\nestimation of the optimal policy. In this paper, we study capacity-constrained\ntreatment assignment in the presence of such interference. We consider a\ndynamic model where the decision maker allocates treatments at each time step\nand heterogeneous agents myopically best respond to the previous treatment\nassignment policy. When the number of agents is large but finite, we show that\nthe threshold for receiving treatment under a given policy converges to the\npolicy's mean-field equilibrium threshold. Based on this result, we develop a\nconsistent estimator for the policy gradient. In simulations and a\nsemi-synthetic experiment with data from the National Education Longitudinal\nStudy of 1988, we demonstrate that this estimator can be used for learning\ncapacity-constrained policies in the presence of strategic behavior."}, "http://arxiv.org/abs/2204.10359": {"title": "Boundary Adaptive Local Polynomial Conditional Density Estimators", "link": "http://arxiv.org/abs/2204.10359", "description": "We begin by introducing a class of conditional density estimators based on\nlocal polynomial techniques. The estimators are boundary adaptive and easy to\nimplement. We then study the (pointwise and) uniform statistical properties of\nthe estimators, offering characterizations of both probability concentration\nand distributional approximation. In particular, we establish uniform\nconvergence rates in probability and valid Gaussian distributional\napproximations for the Studentized t-statistic process. We also discuss\nimplementation issues such as consistent estimation of the covariance function\nfor the Gaussian approximation, optimal integrated mean squared error bandwidth\nselection, and valid robust bias-corrected inference. We illustrate the\napplicability of our results by constructing valid confidence bands and\nhypothesis tests for both parametric specification and shape constraints,\nexplicitly characterizing their approximation errors. A companion R software\npackage implementing our main results is provided."}, "http://arxiv.org/abs/2312.11710": {"title": "Real-time monitoring with RCA models", "link": "http://arxiv.org/abs/2312.11710", "description": "We propose a family of weighted statistics based on the CUSUM process of the\nWLS residuals for the online detection of changepoints in a Random Coefficient\nAutoregressive model, using both the standard CUSUM and the Page-CUSUM process.\nWe derive the asymptotics under the null of no changepoint for all possible\nweighing schemes, including the case of the standardised CUSUM, for which we\nderive a Darling-Erdos-type limit theorem; our results guarantee the\nprocedure-wise size control under both an open-ended and a closed-ended\nmonitoring. In addition to considering the standard RCA model with no\ncovariates, we also extend our results to the case of exogenous regressors. Our\nresults can be applied irrespective of (and with no prior knowledge required as\nto) whether the observations are stationary or not, and irrespective of whether\nthey change into a stationary or nonstationary regime. Hence, our methodology\nis particularly suited to detect the onset, or the collapse, of a bubble or an\nepidemic. Our simulations show that our procedures, especially when\nstandardising the CUSUM process, can ensure very good size control and short\ndetection delays. We complement our theory by studying the online detection of\nbreaks in epidemiological and housing prices series."}, "http://arxiv.org/abs/2304.05805": {"title": "GDP nowcasting with artificial neural networks: How much does long-term memory matter?", "link": "http://arxiv.org/abs/2304.05805", "description": "In our study, we apply artificial neural networks (ANNs) to nowcast quarterly\nGDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare\nthe nowcasting performance of five different ANN architectures: the multilayer\nperceptron (MLP), the one-dimensional convolutional neural network (1D CNN),\nthe Elman recurrent neural network (RNN), the long short-term memory network\n(LSTM), and the gated recurrent unit (GRU). The empirical analysis presents the\nresults from two distinctively different evaluation periods. The first (2012:Q1\n-- 2019:Q4) is characterized by balanced economic growth, while the second\n(2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According\nto our results, longer input sequences result in more accurate nowcasts in\nperiods of balanced economic growth. However, this effect ceases above a\nrelatively low threshold value of around six quarters (eighteen months). During\nperiods of economic turbulence (e.g., during the COVID-19 recession), longer\ninput sequences do not help the models' predictive performance; instead, they\nseem to weaken their generalization capability. Combined results from the two\nevaluation periods indicate that architectural features enabling for long-term\nmemory do not result in more accurate nowcasts. On the other hand, the 1D CNN\nhas proved to be a highly suitable model for GDP nowcasting. The network has\nshown good nowcasting performance among the competitors during the first\nevaluation period and achieved the overall best accuracy during the second\nevaluation period. Consequently, first in the literature, we propose the\napplication of the 1D CNN for economic nowcasting."}, "http://arxiv.org/abs/2312.12741": {"title": "Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances", "link": "http://arxiv.org/abs/2312.12741", "description": "We address the problem of best arm identification (BAI) with a fixed budget\nfor two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the\nbest arm, an arm with the highest expected reward, through an adaptive\nexperiment. Kaufmann et al. (2016) develops a lower bound for the probability\nof misidentifying the best arm. They also propose a strategy, assuming that the\nvariances of rewards are known, and show that it is asymptotically optimal in\nthe sense that its probability of misidentification matches the lower bound as\nthe budget approaches infinity. However, an asymptotically optimal strategy is\nunknown when the variances are unknown. For this open issue, we propose a\nstrategy that estimates variances during an adaptive experiment and draws arms\nwith a ratio of the estimated standard deviations. We refer to this strategy as\nthe Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)\nstrategy. We then demonstrate that this strategy is asymptotically optimal by\nshowing that its probability of misidentification matches the lower bound when\nthe budget approaches infinity, and the gap between the expected rewards of two\narms approaches zero (small-gap regime). Our results suggest that under the\nworst-case scenario characterized by the small-gap regime, our strategy, which\nemploys estimated variance, is asymptotically optimal even when the variances\nare unknown."}, "http://arxiv.org/abs/2312.13195": {"title": "Principal Component Copulas for Capital Modelling", "link": "http://arxiv.org/abs/2312.13195", "description": "We introduce a class of copulas that we call Principal Component Copulas.\nThis class intends to combine the strong points of copula-based techniques with\nprincipal component-based models, which results in flexibility when modelling\ntail dependence along the most important directions in multivariate data. The\nproposed techniques have conceptual similarities and technical differences with\nthe increasingly popular class of factor copulas. Such copulas can generate\ncomplex dependence structures and also perform well in high dimensions. We show\nthat Principal Component Copulas give rise to practical and technical\nadvantages compared to other techniques. We perform a simulation study and\napply the copula to multivariate return data. The copula class offers the\npossibility to avoid the curse of dimensionality when estimating very large\ncopula models and it performs particularly well on aggregate measures of tail\nrisk, which is of importance for capital modeling."}, "http://arxiv.org/abs/2103.07066": {"title": "Finding Subgroups with Significant Treatment Effects", "link": "http://arxiv.org/abs/2103.07066", "description": "Researchers often run resource-intensive randomized controlled trials (RCTs)\nto estimate the causal effects of interventions on outcomes of interest. Yet\nthese outcomes are often noisy, and estimated overall effects can be small or\nimprecise. Nevertheless, we may still be able to produce reliable evidence of\nthe efficacy of an intervention by finding subgroups with significant effects.\nIn this paper, we propose a machine-learning method that is specifically\noptimized for finding such subgroups in noisy data. Unlike available methods\nfor personalized treatment assignment, our tool is fundamentally designed to\ntake significance testing into account: it produces a subgroup that is chosen\nto maximize the probability of obtaining a statistically significant positive\ntreatment effect. We provide a computationally efficient implementation using\ndecision trees and demonstrate its gain over selecting subgroups based on\npositive (estimated) treatment effects. Compared to standard tree-based\nregression and classification tools, this approach tends to yield higher power\nin detecting subgroups affected by the treatment."}, "http://arxiv.org/abs/2208.06729": {"title": "Optimal Recovery for Causal Inference", "link": "http://arxiv.org/abs/2208.06729", "description": "Problems in causal inference can be fruitfully addressed using signal\nprocessing techniques. As an example, it is crucial to successfully quantify\nthe causal effects of an intervention to determine whether the intervention\nachieved desired outcomes. We present a new geometric signal processing\napproach to classical synthetic control called ellipsoidal optimal recovery\n(EOpR), for estimating the unobservable outcome of a treatment unit. EOpR\nprovides policy evaluators with both worst-case and typical outcomes to help in\ndecision making. It is an approximation-theoretic technique that relates to the\ntheory of principal components, which recovers unknown observations given a\nlearned signal class and a set of known observations. We show EOpR can improve\npre-treatment fit and mitigate bias of the post-treatment estimate relative to\nother methods in causal inference. Beyond recovery of the unit of interest, an\nadvantage of EOpR is that it produces worst-case limits over the estimates\nproduced. We assess our approach on artificially-generated data, on datasets\ncommonly used in the econometrics literature, and in the context of the\nCOVID-19 pandemic, showing better performance than baseline techniques"}, "http://arxiv.org/abs/2301.01085": {"title": "The Chained Difference-in-Differences", "link": "http://arxiv.org/abs/2301.01085", "description": "This paper studies the identification, estimation, and inference of long-term\n(binary) treatment effect parameters when balanced panel data is not available,\nor consists of only a subset of the available data. We develop a new estimator:\nthe chained difference-in-differences, which leverages the overlapping\nstructure of many unbalanced panel data sets. This approach consists in\naggregating a collection of short-term treatment effects estimated on multiple\nincomplete panels. Our estimator accommodates (1) multiple time periods, (2)\nvariation in treatment timing, (3) treatment effect heterogeneity, (4) general\nmissing data patterns, and (5) sample selection on observables. We establish\nthe asymptotic properties of the proposed estimator and discuss identification\nand efficiency gains in comparison to existing methods. Finally, we illustrate\nits relevance through (i) numerical simulations, and (ii) an application about\nthe effects of an innovation policy in France."}, "http://arxiv.org/abs/2312.13939": {"title": "Binary Endogenous Treatment in Stochastic Frontier Models with an Application to Soil Conservation in El Salvador", "link": "http://arxiv.org/abs/2312.13939", "description": "Improving the productivity of the agricultural sector is part of one of the\nSustainable Development Goals set by the United Nations. To this end, many\ninternational organizations have funded training and technology transfer\nprograms that aim to promote productivity and income growth, fight poverty and\nenhance food security among smallholder farmers in developing countries.\nStochastic production frontier analysis can be a useful tool when evaluating\nthe effectiveness of these programs. However, accounting for treatment\nendogeneity, often intrinsic to these interventions, only recently has received\nany attention in the stochastic frontier literature. In this work, we extend\nthe classical maximum likelihood estimation of stochastic production frontier\nmodels by allowing both the production frontier and inefficiency to depend on a\npotentially endogenous binary treatment. We use instrumental variables to\ndefine an assignment mechanism for the treatment, and we explicitly model the\ndensity of the first and second-stage composite error terms. We provide\nempirical evidence of the importance of controlling for endogeneity in this\nsetting using farm-level data from a soil conservation program in El Salvador."}, "http://arxiv.org/abs/2312.14095": {"title": "RetailSynth: Synthetic Data Generation for Retail AI Systems Evaluation", "link": "http://arxiv.org/abs/2312.14095", "description": "Significant research effort has been devoted in recent years to developing\npersonalized pricing, promotions, and product recommendation algorithms that\ncan leverage rich customer data to learn and earn. Systematic benchmarking and\nevaluation of these causal learning systems remains a critical challenge, due\nto the lack of suitable datasets and simulation environments. In this work, we\npropose a multi-stage model for simulating customer shopping behavior that\ncaptures important sources of heterogeneity, including price sensitivity and\npast experiences. We embedded this model into a working simulation environment\n-- RetailSynth. RetailSynth was carefully calibrated on publicly available\ngrocery data to create realistic synthetic shopping transactions. Multiple\npricing policies were implemented within the simulator and analyzed for impact\non revenue, category penetration, and customer retention. Applied researchers\ncan use RetailSynth to validate causal demand models for multi-category retail\nand to incorporate realistic price sensitivity into emerging benchmarking\nsuites for personalized pricing, promotions, and product recommendations."}, "http://arxiv.org/abs/2201.06898": {"title": "Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period", "link": "http://arxiv.org/abs/2201.06898", "description": "We propose difference-in-differences estimators for continuous treatments\nwith heterogeneous effects. We assume that between consecutive periods, the\ntreatment of some units, the switchers, changes, while the treatment of other\nunits does not change. We show that under a parallel trends assumption, an\nunweighted and a weighted average of the slopes of switchers' potential\noutcomes can be estimated. While the former parameter may be more intuitive,\nthe latter can be used for cost-benefit analysis, and it can often be estimated\nmore precisely. We generalize our estimators to the instrumental-variable case.\nWe use our results to estimate the price-elasticity of gasoline consumption."}, "http://arxiv.org/abs/2211.14236": {"title": "Strategyproof Decision-Making in Panel Data Settings and Beyond", "link": "http://arxiv.org/abs/2211.14236", "description": "We consider the problem of decision-making using panel data, in which a\ndecision-maker gets noisy, repeated measurements of multiple units (or agents).\nWe consider a setup where there is a pre-intervention period, when the\nprincipal observes the outcomes of each unit, after which the principal uses\nthese observations to assign a treatment to each unit. Unlike this classical\nsetting, we permit the units generating the panel data to be strategic, i.e.\nunits may modify their pre-intervention outcomes in order to receive a more\ndesirable intervention. The principal's goal is to design a strategyproof\nintervention policy, i.e. a policy that assigns units to their\nutility-maximizing interventions despite their potential strategizing. We first\nidentify a necessary and sufficient condition under which a strategyproof\nintervention policy exists, and provide a strategyproof mechanism with a simple\nclosed form when one does exist. Along the way, we prove impossibility results\nfor strategic multiclass classification, which may be of independent interest.\nWhen there are two interventions, we establish that there always exists a\nstrategyproof mechanism, and provide an algorithm for learning such a\nmechanism. For three or more interventions, we provide an algorithm for\nlearning a strategyproof mechanism if there exists a sufficiently large gap in\nthe principal's rewards between different interventions. Finally, we\nempirically evaluate our model using real-world panel data collected from\nproduct sales over 18 months. We find that our methods compare favorably to\nbaselines which do not take strategic interactions into consideration, even in\nthe presence of model misspecification."}, "http://arxiv.org/abs/2312.14191": {"title": "Noisy Measurements Are Important, the Design of Census Products Is Much More Important", "link": "http://arxiv.org/abs/2312.14191", "description": "McCartan et al. (2023) call for \"making differential privacy work for census\ndata users.\" This commentary explains why the 2020 Census Noisy Measurement\nFiles (NMFs) are not the best focus for that plea. The August 2021 letter from\n62 prominent researchers asking for production of the direct output of the\ndifferential privacy system deployed for the 2020 Census signaled the\nengagement of the scholarly community in the design of decennial census data\nproducts. NMFs, the raw statistics produced by the 2020 Census Disclosure\nAvoidance System before any post-processing, are one component of that\ndesign--the query strategy output. The more important component is the query\nworkload output--the statistics released to the public. Optimizing the query\nworkload--the Redistricting Data (P.L. 94-171) Summary File,\nspecifically--could allow the privacy-loss budget to be more effectively\nmanaged. There could be fewer noisy measurements, no post-processing bias, and\ndirect estimates of the uncertainty from disclosure avoidance for each\npublished statistic."}, "http://arxiv.org/abs/2312.14325": {"title": "Exploring Distributions of House Prices and House Price Indices", "link": "http://arxiv.org/abs/2312.14325", "description": "We use house prices (HP) and house price indices (HPI) as a proxy to income\ndistribution. Specifically, we analyze sale prices in the 1970-2010 window of\nover 116,000 single-family homes in Hamilton County, Ohio, including Cincinnati\nmetro area of about 2.2 million people. We also analyze HPI, published by\nFederal Housing Finance Agency (FHFA), for nearly 18,000 US ZIP codes that\ncover a period of over 40 years starting in 1980's. If HP can be viewed as a\nfirst derivative of income, HPI can be viewed as its second derivative. We use\ngeneralized beta (GB) family of functions to fit distributions of HP and HPI\nsince GB naturally arises from the models of economic exchange described by\nstochastic differential equations. Our main finding is that HP and multi-year\nHPI exhibit a negative Dragon King (nDK) behavior, wherein power-law\ndistribution tail gives way to an abrupt decay to a finite upper limit value,\nwhich is similar to our recent findings for realized volatility of S\\&P500\nindex in the US stock market. This type of tail behavior is best fitted by a\nmodified GB (mGB) distribution. Tails of single-year HPI appear to show more\nconsistency with power-law behavior, which is better described by a GB Prime\n(GB2) distribution. We supplement full distribution fits by mGB and GB2 with\ndirect linear fits (LF) of the tails. Our numerical procedure relies on\nevaluation of confidence intervals (CI) of the fits, as well as of p-values\nthat give the likelihood that data come from the fitted distributions."}, "http://arxiv.org/abs/2207.09943": {"title": "Efficient Bias Correction for Cross-section and Panel Data", "link": "http://arxiv.org/abs/2207.09943", "description": "Bias correction can often improve the finite sample performance of\nestimators. We show that the choice of bias correction method has no effect on\nthe higher-order variance of semiparametrically efficient parametric\nestimators, so long as the estimate of the bias is asymptotically linear. It is\nalso shown that bootstrap, jackknife, and analytical bias estimates are\nasymptotically linear for estimators with higher-order expansions of a standard\nform. In particular, we find that for a variety of estimators the\nstraightforward bootstrap bias correction gives the same higher-order variance\nas more complicated analytical or jackknife bias corrections. In contrast, bias\ncorrections that do not estimate the bias at the parametric rate, such as the\nsplit-sample jackknife, result in larger higher-order variances in the i.i.d.\nsetting we focus on. For both a cross-sectional MLE and a panel model with\nindividual fixed effects, we show that the split-sample jackknife has a\nhigher-order variance term that is twice as large as that of the\n`leave-one-out' jackknife."}, "http://arxiv.org/abs/2312.15119": {"title": "Functional CLTs for subordinated stable L\\'evy models in physics, finance, and econometrics", "link": "http://arxiv.org/abs/2312.15119", "description": "We present a simple unifying treatment of a large class of applications from\nstatistical mechanics, econometrics, mathematical finance, and insurance\nmathematics, where stable (possibly subordinated) L\\'evy noise arises as a\nscaling limit of some form of continuous-time random walk (CTRW). For each\napplication, it is natural to rely on weak convergence results for stochastic\nintegrals on Skorokhod space in Skorokhod's J1 or M1 topologies. As compared to\nearlier and entirely separate works, we are able to give a more streamlined\naccount while also allowing for greater generality and providing important new\ninsights. For each application, we first make clear how the fundamental\nconclusions for J1 convergent CTRWs emerge as special cases of the same general\nprinciples, and we then illustrate how the specific settings give rise to\ndifferent results for strictly M1 convergent CTRWs."}, "http://arxiv.org/abs/2312.15494": {"title": "Variable Selection in High Dimensional Linear Regressions with Parameter Instability", "link": "http://arxiv.org/abs/2312.15494", "description": "This paper is concerned with the problem of variable selection in the\npresence of parameter instability when both the marginal effects of signals on\nthe target variable and the correlations of the covariates in the active set\ncould vary over time. We pose the issue of whether one should use weighted or\nunweighted observations at the variable selection stage in the presence of\nparameter instability, particularly when the number of potential covariates is\nlarge. We allow parameter instability to be continuous or discrete, subject to\ncertain regularity conditions. We discuss the pros and cons of Lasso and the\nOne Covariate at a time Multiple Testing (OCMT) method for variable selection\nand argue that OCMT has important advantages under parameter instability. We\nestablish three main theorems on selection, estimation post selection, and\nin-sample fit. These theorems provide justification for using unweighted\nobservations at the selection stage of OCMT and down-weighting of observations\nonly at the forecasting stage. It is shown that OCMT delivers better forecasts,\nin mean squared error sense, as compared to Lasso, Adaptive Lasso and boosting\nboth in Monte Carlo experiments as well as in 3 sets of empirical applications:\nforecasting monthly returns on 28 stocks from Dow Jones , forecasting quarterly\noutput growths across 33 countries, and forecasting euro area output growth\nusing surveys of professional forecasters."}, "http://arxiv.org/abs/2312.15524": {"title": "The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective", "link": "http://arxiv.org/abs/2312.15524", "description": "Large Language Models (LLMs) have demonstrated impressive potential to\nsimulate human behavior. Using a causal inference framework, we empirically and\ntheoretically analyze the challenges of conducting LLM-simulated experiments,\nand explore potential solutions. In the context of demand estimation, we show\nthat variations in the treatment included in the prompt (e.g., price of focal\nproduct) can cause variations in unspecified confounding factors (e.g., price\nof competitors, historical prices, outside temperature), introducing\nendogeneity and yielding implausibly flat demand curves. We propose a\ntheoretical framework suggesting this endogeneity issue generalizes to other\ncontexts and won't be fully resolved by merely improving the training data.\nUnlike real experiments where researchers assign pre-existing units across\nconditions, LLMs simulate units based on the entire prompt, which includes the\ndescription of the treatment. Therefore, due to associations in the training\ndata, the characteristics of individuals and environments simulated by the LLM\ncan be affected by the treatment assignment. We explore two potential\nsolutions. The first specifies all contextual variables that affect both\ntreatment and outcome, which we demonstrate to be challenging for a\ngeneral-purpose LLM. The second explicitly specifies the source of treatment\nvariation in the prompt given to the LLM (e.g., by informing the LLM that the\nstore is running an experiment). While this approach only allows the estimation\nof a conditional average treatment effect that depends on the specific\nexperimental design, it provides valuable directional results for exploratory\nanalysis."}, "http://arxiv.org/abs/2312.15595": {"title": "Zero-Inflated Bandits", "link": "http://arxiv.org/abs/2312.15595", "description": "Many real applications of bandits have sparse non-zero rewards, leading to\nslow learning rates. A careful distribution modeling that utilizes\nproblem-specific structures is known as critical to estimation efficiency in\nthe statistics literature, yet is under-explored in bandits. To fill the gap,\nwe initiate the study of zero-inflated bandits, where the reward is modeled as\na classic semi-parametric distribution called zero-inflated distribution. We\ncarefully design Upper Confidence Bound (UCB) and Thompson Sampling (TS)\nalgorithms for this specific structure. Our algorithms are suitable for a very\ngeneral class of reward distributions, operating under tail assumptions that\nare considerably less stringent than the typical sub-Gaussian requirements.\nTheoretically, we derive the regret bounds for both the UCB and TS algorithms\nfor multi-armed bandit, showing that they can achieve rate-optimal regret when\nthe reward distribution is sub-Gaussian. The superior empirical performance of\nthe proposed methods is shown via extensive numerical studies."}, "http://arxiv.org/abs/2312.15624": {"title": "Negative Controls for Instrumental Variable Designs", "link": "http://arxiv.org/abs/2312.15624", "description": "Studies using instrumental variables (IV) often assess the validity of their\nidentification assumptions using falsification tests. However, these tests are\noften carried out in an ad-hoc manner, without theoretical foundations. In this\npaper, we establish a theoretical framework for negative control tests, the\npredominant category of falsification tests for IV designs. These tests are\nconditional independence tests between negative control variables and either\nthe IV or the outcome (e.g., examining the ``effect'' on the lagged outcome).\nWe introduce a formal definition for threats to IV exogeneity (alternative path\nvariables) and characterize the necessary conditions that proxy variables for\nsuch unobserved threats must meet to serve as negative controls. The theory\nhighlights prevalent errors in the implementation of negative control tests and\nhow they could be corrected. Our theory can also be used to design new\nfalsification tests by identifying appropriate negative control variables,\nincluding currently underutilized types, and suggesting alternative statistical\ntests. The theory shows that all negative control tests assess IV exogeneity.\nHowever, some commonly used tests simultaneously evaluate the 2SLS functional\nform assumptions. Lastly, we show that while negative controls are useful for\ndetecting biases in IV designs, their capacity to correct or quantify such\nbiases requires additional non-trivial assumptions."}, "http://arxiv.org/abs/2312.15999": {"title": "Pricing with Contextual Elasticity and Heteroscedastic Valuation", "link": "http://arxiv.org/abs/2312.15999", "description": "We study an online contextual dynamic pricing problem, where customers decide\nwhether to purchase a product based on its features and price. We introduce a\nnovel approach to modeling a customer's expected demand by incorporating\nfeature-based price elasticity, which can be equivalently represented as a\nvaluation with heteroscedastic noise. To solve the problem, we propose a\ncomputationally efficient algorithm called \"Pricing with Perturbation (PwP)\",\nwhich enjoys an $O(\\sqrt{dT\\log T})$ regret while allowing arbitrary\nadversarial input context sequences. We also prove a matching lower bound at\n$\\Omega(\\sqrt{dT})$ to show the optimality regarding $d$ and $T$ (up to $\\log\nT$ factors). Our results shed light on the relationship between contextual\nelasticity and heteroscedastic valuation, providing insights for effective and\npractical pricing strategies."}, "http://arxiv.org/abs/2312.16099": {"title": "Direct Multi-Step Forecast based Comparison of Nested Models via an Encompassing Test", "link": "http://arxiv.org/abs/2312.16099", "description": "We introduce a novel approach for comparing out-of-sample multi-step\nforecasts obtained from a pair of nested models that is based on the forecast\nencompassing principle. Our proposed approach relies on an alternative way of\ntesting the population moment restriction implied by the forecast encompassing\nprinciple and that links the forecast errors from the two competing models in a\nparticular way. Its key advantage is that it is able to bypass the variance\ndegeneracy problem afflicting model based forecast comparisons across nested\nmodels. It results in a test statistic whose limiting distribution is standard\nnormal and which is particularly simple to construct and can accommodate both\nsingle period and longer-horizon prediction comparisons. Inferences are also\nshown to be robust to different predictor types, including stationary,\nhighly-persistent and purely deterministic processes. Finally, we illustrate\nthe use of our proposed approach through an empirical application that explores\nthe role of global inflation in enhancing individual country specific inflation\nforecasts."}, "http://arxiv.org/abs/2010.05117": {"title": "Combining Observational and Experimental Data to Improve Efficiency Using Imperfect Instruments", "link": "http://arxiv.org/abs/2010.05117", "description": "Randomized controlled trials generate experimental variation that can\ncredibly identify causal effects, but often suffer from limited scale, while\nobservational datasets are large, but often violate desired identification\nassumptions. To improve estimation efficiency, I propose a method that\nleverages imperfect instruments - pretreatment covariates that satisfy the\nrelevance condition but may violate the exclusion restriction. I show that\nthese imperfect instruments can be used to derive moment restrictions that, in\ncombination with the experimental data, improve estimation efficiency. I\noutline estimators for implementing this strategy, and show that my methods can\nreduce variance by up to 50%; therefore, only half of the experimental sample\nis required to attain the same statistical precision. I apply my method to a\nsearch listing dataset from Expedia that studies the causal effect of search\nrankings on clicks, and show that the method can substantially improve the\nprecision."}, "http://arxiv.org/abs/2105.12891": {"title": "Identification and Estimation of Partial Effects in Nonlinear Semiparametric Panel Models", "link": "http://arxiv.org/abs/2105.12891", "description": "Average partial effects (APEs) are often not point identified in panel models\nwith unrestricted unobserved heterogeneity, such as binary response panel model\nwith fixed effects and logistic errors. This lack of point-identification\noccurs despite the identification of these models' common coefficients. We\nprovide a unified framework to establish the point identification of various\npartial effects in a wide class of nonlinear semiparametric models under an\nindex sufficiency assumption on the unobserved heterogeneity, even when the\nerror distribution is unspecified and non-stationary. This assumption does not\nimpose parametric restrictions on the unobserved heterogeneity and\nidiosyncratic errors. We also present partial identification results when the\nsupport condition fails. We then propose three-step semiparametric estimators\nfor the APE, the average structural function, and average marginal effects, and\nshow their consistency and asymptotic normality. Finally, we illustrate our\napproach in a study of determinants of married women's labor supply."}, "http://arxiv.org/abs/2212.11833": {"title": "Efficient Sampling for Realized Variance Estimation in Time-Changed Diffusion Models", "link": "http://arxiv.org/abs/2212.11833", "description": "This paper analyzes the benefits of sampling intraday returns in intrinsic\ntime for the standard and pre-averaging realized variance (RV) estimators. We\ntheoretically show in finite samples and asymptotically that the RV estimator\nis most efficient under the new concept of realized business time, which\nsamples according to a combination of observed trades and estimated tick\nvariance. Our asymptotic results carry over to the pre-averaging RV estimator\nunder market microstructure noise. The analysis builds on the assumption that\nasset prices follow a diffusion that is time-changed with a jump process that\nseparately models the transaction times. This provides a flexible model that\nseparately captures the empirically varying trading intensity and tick variance\nprocesses, which are particularly relevant for disentangling the driving forces\nof the sampling schemes. Extensive simulations confirm our theoretical results\nand show that realized business time remains superior also under more general\nnoise and process specifications. An application to stock data provides\nempirical evidence for the benefits of using realized business time sampling to\nconstruct more efficient RV estimators as well as for an improved forecasting\nperformance."}, "http://arxiv.org/abs/2301.05580": {"title": "Randomization Test for the Specification of Interference Structure", "link": "http://arxiv.org/abs/2301.05580", "description": "This study considers testing the specification of spillover effects in causal\ninference. We focus on experimental settings in which the treatment assignment\nmechanism is known to researchers. We develop a new randomization test\nutilizing a hierarchical relationship between different exposures. Compared\nwith existing approaches, our approach is essentially applicable to any null\nexposure specifications and produces powerful test statistics without a priori\nknowledge of the true interference structure. As empirical illustrations, we\nrevisit two existing social network experiments: one on farmers' insurance\nadoption and the other on anti-conflict education programs."}, "http://arxiv.org/abs/2312.16214": {"title": "Stochastic Equilibrium the Lucas Critique and Keynesian Economics", "link": "http://arxiv.org/abs/2312.16214", "description": "In this paper, a mathematically rigorous solution overturns existing wisdom\nregarding New Keynesian Dynamic Stochastic General Equilibrium. I develop a\nformal concept of stochastic equilibrium. I prove uniqueness and necessity,\nwhen agents are patient, across a wide class of dynamic stochastic models.\nExistence depends on appropriately specified eigenvalue conditions. Otherwise,\nno solution of any kind exists. I construct the equilibrium for the benchmark\nCalvo New Keynesian. I provide novel comparative statics with the\nnon-stochastic model of independent mathematical interest. I uncover a\nbifurcation between neighbouring stochastic systems and approximations taken\nfrom the Zero Inflation Non-Stochastic Steady State (ZINSS). The correct\nPhillips curve agrees with the zero limit from the trend inflation framework.\nIt contains a large lagged inflation coefficient and a small response to\nexpected inflation. The response to the output gap is always muted and is zero\nat standard parameters. A neutrality result is presented to explain why and to\nalign Calvo with Taylor pricing. Present and lagged demand shocks enter the\nPhillips curve so there is no Divine Coincidence and the system is identified\nfrom structural shocks alone. The lagged inflation slope is increasing in the\ninflation response, embodying substantive policy trade-offs. The Taylor\nprinciple is reversed, inactive settings are necessary for existence, pointing\ntowards inertial policy. The observational equivalence idea of the Lucas\ncritique is disproven. The bifurcation results from the breakdown of the\nconstraints implied by lagged nominal rigidity, associated with cross-equation\ncancellation possible only at ZINSS. There is a dual relationship between\nrestrictions on the econometrician and constraints on repricing firms. Thus if\nthe model is correct, goodness of fit will jump."}, "http://arxiv.org/abs/2312.16307": {"title": "Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration", "link": "http://arxiv.org/abs/2312.16307", "description": "We consider a panel data setting in which one observes measurements of units\nover time, under different interventions. Our focus is on the canonical family\nof synthetic control methods (SCMs) which, after a pre-intervention time period\nwhen all units are under control, estimate counterfactual outcomes for test\nunits in the post-intervention time period under control by using data from\ndonor units who have remained under control for the entire post-intervention\nperiod. In order for the counterfactual estimate produced by synthetic control\nfor a test unit to be accurate, there must be sufficient overlap between the\noutcomes of the donor units and the outcomes of the test unit. As a result, a\ncanonical assumption in the literature on SCMs is that the outcomes for the\ntest units lie within either the convex hull or the linear span of the outcomes\nfor the donor units. However despite their ubiquity, such overlap assumptions\nmay not always hold, as is the case when, e.g., units select their own\ninterventions and different subpopulations of units prefer different\ninterventions a priori.\n\nWe shed light on this typically overlooked assumption, and we address this\nissue by incentivizing units with different preferences to take interventions\nthey would not normally consider. Specifically, we provide a SCM for\nincentivizing exploration in panel data settings which provides\nincentive-compatible intervention recommendations to units by leveraging tools\nfrom information design and online learning. Using our algorithm, we show how\nto obtain valid counterfactual estimates using SCMs without the need for an\nexplicit overlap assumption on the unit outcomes."}, "http://arxiv.org/abs/2312.16489": {"title": "Best-of-Both-Worlds Linear Contextual Bandits", "link": "http://arxiv.org/abs/2312.16489", "description": "This study investigates the problem of $K$-armed linear contextual bandits,\nan instance of the multi-armed bandit problem, under an adversarial corruption.\nAt each round, a decision-maker observes an independent and identically\ndistributed context and then selects an arm based on the context and past\nobservations. After selecting an arm, the decision-maker incurs a loss\ncorresponding to the selected arm. The decision-maker aims to minimize the\ncumulative loss over the trial. The goal of this study is to develop a strategy\nthat is effective in both stochastic and adversarial environments, with\ntheoretical guarantees. We first formulate the problem by introducing a novel\nsetting of bandits with adversarial corruption, referred to as the contextual\nadversarial regime with a self-bounding constraint. We assume linear models for\nthe relationship between the loss and the context. Then, we propose a strategy\nthat extends the RealLinExp3 by Neu & Olkhovskaya (2020) and the\nFollow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is\nshown to be upper-bounded by $O\\left(\\min\\left\\{\\frac{(\\log(T))^3}{\\Delta_{*}}\n+ \\sqrt{\\frac{C(\\log(T))^3}{\\Delta_{*}}},\\ \\\n\\sqrt{T}(\\log(T))^2\\right\\}\\right)$, where $T \\in\\mathbb{N}$ is the number of\nrounds, $\\Delta_{*} > 0$ is the constant minimum gap between the best and\nsuboptimal arms for any context, and $C\\in[0, T] $ is an adversarial corruption\nparameter. This regret upper bound implies\n$O\\left(\\frac{(\\log(T))^3}{\\Delta_{*}}\\right)$ in a stochastic environment and\nby $O\\left( \\sqrt{T}(\\log(T))^2\\right)$ in an adversarial environment. We refer\nto our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its\ntheoretical guarantees in both stochastic and adversarial regimes."}, "http://arxiv.org/abs/2312.16707": {"title": "Modeling Systemic Risk: A Time-Varying Nonparametric Causal Inference Framework", "link": "http://arxiv.org/abs/2312.16707", "description": "We propose a nonparametric and time-varying directed information graph\n(TV-DIG) framework to estimate the evolving causal structure in time series\nnetworks, thereby addressing the limitations of traditional econometric models\nin capturing high-dimensional, nonlinear, and time-varying interconnections\namong series. This framework employs an information-theoretic measure rooted in\na generalized version of Granger-causality, which is applicable to both linear\nand nonlinear dynamics. Our framework offers advancements in measuring systemic\nrisk and establishes meaningful connections with established econometric\nmodels, including vector autoregression and switching models. We evaluate the\nefficacy of our proposed model through simulation experiments and empirical\nanalysis, reporting promising results in recovering simulated time-varying\nnetworks with nonlinear and multivariate structures. We apply this framework to\nidentify and monitor the evolution of interconnectedness and systemic risk\namong major assets and industrial sectors within the financial network. We\nfocus on cryptocurrencies' potential systemic risks to financial stability,\nincluding spillover effects on other sectors during crises like the COVID-19\npandemic and the Federal Reserve's 2020 emergency response. Our findings\nreveals significant, previously underrecognized pre-2020 influences of\ncryptocurrencies on certain financial sectors, highlighting their potential\nsystemic risks and offering a systematic approach in tracking evolving\ncross-sector interactions within financial networks."}, "http://arxiv.org/abs/2312.16927": {"title": "Development of Choice Model for Brand Evaluation", "link": "http://arxiv.org/abs/2312.16927", "description": "Consumer choice modeling takes center stage as we delve into understanding\nhow personal preferences of decision makers (customers) for products influence\ndemand at the level of the individual. The contemporary choice theory is built\nupon the characteristics of the decision maker, alternatives available for the\nchoice of the decision maker, the attributes of the available alternatives and\ndecision rules that the decision maker uses to make a choice. The choice set in\nour research is represented by six major brands (products) of laundry\ndetergents in the Japanese market. We use the panel data of the purchases of 98\nhouseholds to which we apply the hierarchical probit model, facilitated by a\nMarkov Chain Monte Carlo simulation (MCMC) in order to evaluate the brand\nvalues of six brands. The applied model also allows us to evaluate the tangible\nand intangible brand values. These evaluated metrics help us to assess the\nbrands based on their tangible and intangible characteristics. Moreover,\nconsumer choice modeling also provides a framework for assessing the\nenvironmental performance of laundry detergent brands as the model uses the\ninformation on components (physical attributes) of laundry detergents."}, "http://arxiv.org/abs/2312.17061": {"title": "Bayesian Analysis of High Dimensional Vector Error Correction Model", "link": "http://arxiv.org/abs/2312.17061", "description": "Vector Error Correction Model (VECM) is a classic method to analyse\ncointegration relationships amongst multivariate non-stationary time series. In\nthis paper, we focus on high dimensional setting and seek for\nsample-size-efficient methodology to determine the level of cointegration. Our\ninvestigation centres at a Bayesian approach to analyse the cointegration\nmatrix, henceforth determining the cointegration rank. We design two algorithms\nand implement them on simulated examples, yielding promising results\nparticularly when dealing with high number of variables and relatively low\nnumber of observations. Furthermore, we extend this methodology to empirically\ninvestigate the constituents of the S&P 500 index, where low-volatility\nportfolios can be found during both in-sample training and out-of-sample\ntesting periods."}, "http://arxiv.org/abs/1903.08028": {"title": "State-Building through Public Land Disposal? An Application of Matrix Completion for Counterfactual Prediction", "link": "http://arxiv.org/abs/1903.08028", "description": "This paper examines how homestead policies, which opened vast frontier lands\nfor settlement, influenced the development of American frontier states. It uses\na treatment propensity-weighted matrix completion model to estimate the\ncounterfactual size of these states without homesteading. In simulation\nstudies, the method shows lower bias and variance than other estimators,\nparticularly in higher complexity scenarios. The empirical analysis reveals\nthat homestead policies significantly and persistently reduced state government\nexpenditure and revenue. These findings align with continuous\ndifference-in-differences estimates using 1.46 million land patent records.\nThis study's extension of the matrix completion method to include propensity\nscore weighting for causal effect estimation in panel data, especially in\nstaggered treatment contexts, enhances policy evaluation by improving the\nprecision of long-term policy impact assessments."}, "http://arxiv.org/abs/2011.08174": {"title": "Policy design in experiments with unknown interference", "link": "http://arxiv.org/abs/2011.08174", "description": "This paper studies experimental designs for estimation and inference on\npolicies with spillover effects. Units are organized into a finite number of\nlarge clusters and interact in unknown ways within each cluster. First, we\nintroduce a single-wave experiment that, by varying the randomization across\ncluster pairs, estimates the marginal effect of a change in treatment\nprobabilities, taking spillover effects into account. Using the marginal\neffect, we propose a test for policy optimality. Second, we design a\nmultiple-wave experiment to estimate welfare-maximizing treatment rules. We\nprovide strong theoretical guarantees and an implementation in a large-scale\nfield experiment."}, "http://arxiv.org/abs/2104.13367": {"title": "When should you adjust inferences for multiple hypothesis testing?", "link": "http://arxiv.org/abs/2104.13367", "description": "Multiple hypothesis testing practices vary widely, without consensus on which\nare appropriate when. We provide an economic foundation for these practices. In\nstudies of multiple treatments or sub-populations, adjustments may be\nappropriate depending on scale economies in the research production function,\nwith control of classical notions of compound errors emerging in some but not\nall cases. Studies with multiple outcomes motivate testing using a single\nindex, or adjusted tests of several indices when the intended audience is\nheterogeneous. Data on actual research costs suggest both that some adjustment\nis warranted and that standard procedures are overly conservative."}, "http://arxiv.org/abs/2303.11777": {"title": "Quasi Maximum Likelihood Estimation of High-Dimensional Factor Models: A Critical Review", "link": "http://arxiv.org/abs/2303.11777", "description": "We review Quasi Maximum Likelihood estimation of factor models for\nhigh-dimensional panels of time series. We consider two cases: (1) estimation\nwhen no dynamic model for the factors is specified \\citep{baili12,baili16}; (2)\nestimation based on the Kalman smoother and the Expectation Maximization\nalgorithm thus allowing to model explicitly the factor dynamics\n\\citep{DGRqml,BLqml}. Our interest is in approximate factor models, i.e., when\nwe allow for the idiosyncratic components to be mildly cross-sectionally, as\nwell as serially, correlated. Although such setting apparently makes estimation\nharder, we show, in fact, that factor models do not suffer of the {\\it curse of\ndimensionality} problem, but instead they enjoy a {\\it blessing of\ndimensionality} property. In particular, given an approximate factor structure,\nif the cross-sectional dimension of the data, $N$, grows to infinity, we show\nthat: (i) identification of the model is still possible, (ii) the\nmis-specification error due to the use of an exact factor model log-likelihood\nvanishes. Moreover, if we let also the sample size, $T$, grow to infinity, we\ncan also consistently estimate all parameters of the model and make inference.\nThe same is true for estimation of the latent factors which can be carried out\nby weighted least-squares, linear projection, or Kalman filtering/smoothing. We\nalso compare the approaches presented with: Principal Component analysis and\nthe classical, fixed $N$, exact Maximum Likelihood approach. We conclude with a\ndiscussion on efficiency of the considered estimators."}, "http://arxiv.org/abs/2305.08559": {"title": "Designing Discontinuities", "link": "http://arxiv.org/abs/2305.08559", "description": "Discontinuities can be fairly arbitrary but also cause a significant impact\non outcomes in larger systems. Indeed, their arbitrariness is why they have\nbeen used to infer causal relationships among variables in numerous settings.\nRegression discontinuity from econometrics assumes the existence of a\ndiscontinuous variable that splits the population into distinct partitions to\nestimate the causal effects of a given phenomenon. Here we consider the design\nof partitions for a given discontinuous variable to optimize a certain effect\npreviously studied using regression discontinuity. To do so, we propose a\nquantization-theoretic approach to optimize the effect of interest, first\nlearning the causal effect size of a given discontinuous variable and then\napplying dynamic programming for optimal quantization design of discontinuities\nto balance the gain and loss in that effect size. We also develop a\ncomputationally-efficient reinforcement learning algorithm for the dynamic\nprogramming formulation of optimal quantization. We demonstrate our approach by\ndesigning optimal time zone borders for counterfactuals of social capital,\nsocial mobility, and health. This is based on regression discontinuity analyses\nwe perform on novel data, which may be of independent empirical interest."}, "http://arxiv.org/abs/2309.09299": {"title": "Bounds on Average Effects in Discrete Choice Panel Data Models", "link": "http://arxiv.org/abs/2309.09299", "description": "In discrete choice panel data, the estimation of average effects is crucial\nfor quantifying the effect of covariates, and for policy evaluation and\ncounterfactual analysis. This task is challenging in short panels with\nindividual-specific effects due to partial identification and the incidental\nparameter problem. While consistent estimation of the identified set is\npossible, it generally requires very large sample sizes, especially when the\nnumber of support points of the observed covariates is large, such as when the\ncovariates are continuous. In this paper, we propose estimating outer bounds on\nthe identified set of average effects. Our bounds are easy to construct,\nconverge at the parametric rate, and are computationally simple to obtain even\nin moderately large samples, independent of whether the covariates are discrete\nor continuous. We also provide asymptotically valid confidence intervals on the\nidentified set. Simulation studies confirm that our approach works well and is\ninformative in finite samples. We also consider an application to labor force\nparticipation."}, "http://arxiv.org/abs/2312.17623": {"title": "Decision Theory for Treatment Choice Problems with Partial Identification", "link": "http://arxiv.org/abs/2312.17623", "description": "We apply classical statistical decision theory to a large class of treatment\nchoice problems with partial identification, revealing important theoretical\nand practical challenges but also interesting research opportunities. The\nchallenges are: In a general class of problems with Gaussian likelihood, all\ndecision rules are admissible; it is maximin-welfare optimal to ignore all\ndata; and, for severe enough partial identification, there are infinitely many\nminimax-regret optimal decision rules, all of which sometimes randomize the\npolicy recommendation. The opportunities are: We introduce a profiled regret\ncriterion that can reveal important differences between rules and render some\nof them inadmissible; and we uniquely characterize the minimax-regret optimal\nrule that least frequently randomizes. We apply our results to aggregation of\nexperimental estimates for policy adoption, to extrapolation of Local Average\nTreatment Effects, and to policy making in the presence of omitted variable\nbias."}, "http://arxiv.org/abs/2312.17676": {"title": "Robust Inference in Panel Data Models: Some Effects of Heteroskedasticity and Leveraged Data in Small Samples", "link": "http://arxiv.org/abs/2312.17676", "description": "With the violation of the assumption of homoskedasticity, least squares\nestimators of the variance become inefficient and statistical inference\nconducted with invalid standard errors leads to misleading rejection rates.\nDespite a vast cross-sectional literature on the downward bias of robust\nstandard errors, the problem is not extensively covered in the panel data\nframework. We investigate the consequences of the simultaneous presence of\nsmall sample size, heteroskedasticity and data points that exhibit extreme\nvalues in the covariates ('good leverage points') on the statistical inference.\nFocusing on one-way linear panel data models, we examine asymptotic and finite\nsample properties of a battery of heteroskedasticity-consistent estimators\nusing Monte Carlo simulations. We also propose a hybrid estimator of the\nvariance-covariance matrix. Results show that conventional standard errors are\nalways dominated by more conservative estimators of the variance, especially in\nsmall samples. In addition, all types of HC standard errors have excellent\nperformances in terms of size and power tests under homoskedasticity."}, "http://arxiv.org/abs/2401.00249": {"title": "Forecasting CPI inflation under economic policy and geo-political uncertainties", "link": "http://arxiv.org/abs/2401.00249", "description": "Forecasting a key macroeconomic variable, consumer price index (CPI)\ninflation, for BRIC countries using economic policy uncertainty and\ngeopolitical risk is a difficult proposition for policymakers at the central\nbanks. This study proposes a novel filtered ensemble wavelet neural network\n(FEWNet) that can produce reliable long-term forecasts for CPI inflation. The\nproposal applies a maximum overlapping discrete wavelet transform to the CPI\ninflation series to obtain high-frequency and low-frequency signals. All the\nwavelet-transformed series and filtered exogenous variables are fed into\ndownstream autoregressive neural networks to make the final ensemble forecast.\nTheoretically, we show that FEWNet reduces the empirical risk compared to\nsingle, fully connected neural networks. We also demonstrate that the\nrolling-window real-time forecasts obtained from the proposed algorithm are\nsignificantly more accurate than benchmark forecasting methods. Additionally,\nwe use conformal prediction intervals to quantify the uncertainty associated\nwith the forecasts generated by the proposed approach. The excellent\nperformance of FEWNet can be attributed to its capacity to effectively capture\nnon-linearities and long-range dependencies in the data through its adaptable\narchitecture."}, "http://arxiv.org/abs/2401.00264": {"title": "Identification of Dynamic Nonlinear Panel Models under Partial Stationarity", "link": "http://arxiv.org/abs/2401.00264", "description": "This paper studies identification for a wide range of nonlinear panel data\nmodels, including binary choice, ordered repsonse, and other types of limited\ndependent variable models. Our approach accommodates dynamic models with any\nnumber of lagged dependent variables as well as other types of (potentially\ncontemporary) endogeneity. Our identification strategy relies on a partial\nstationarity condition, which not only allows for an unknown distribution of\nerrors but also for temporal dependencies in errors. We derive partial\nidentification results under flexible model specifications and provide\nadditional support conditions for point identification. We demonstrate the\nrobust finite-sample performance of our approach using Monte Carlo simulations,\nwith static and dynamic ordered choice models as illustrative examples."}, "http://arxiv.org/abs/2401.00618": {"title": "Generalized Difference-in-Differences for Ordered Choice Models: Too Many \"False Zeros\"?", "link": "http://arxiv.org/abs/2401.00618", "description": "In this paper, we develop a generalized Difference-in-Differences model for\ndiscrete, ordered outcomes, building upon elements from a continuous\nChanges-in-Changes model. We focus on outcomes derived from self-reported\nsurvey data eliciting socially undesirable, illegal, or stigmatized behaviors\nlike tax evasion, substance abuse, or domestic violence, where too many \"false\nzeros\", or more broadly, underreporting are likely. We provide\ncharacterizations for distributional parallel trends, a concept central to our\napproach, within a general threshold-crossing model framework. In cases where\noutcomes are assumed to be reported correctly, we propose a framework for\nidentifying and estimating treatment effects across the entire distribution.\nThis framework is then extended to modeling underreported outcomes, allowing\nthe reporting decision to depend on treatment status. A simulation study\ndocuments the finite sample performance of the estimators. Applying our\nmethodology, we investigate the impact of recreational marijuana legalization\nfor adults in several U.S. states on the short-term consumption behavior of\n8th-grade high-school students. The results indicate small, but significant\nincreases in consumption probabilities at each level. These effects are further\namplified upon accounting for misreporting."}, "http://arxiv.org/abs/2010.08868": {"title": "A Decomposition Approach to Counterfactual Analysis in Game-Theoretic Models", "link": "http://arxiv.org/abs/2010.08868", "description": "Decomposition methods are often used for producing counterfactual predictions\nin non-strategic settings. When the outcome of interest arises from a\ngame-theoretic setting where agents are better off by deviating from their\nstrategies after a new policy, such predictions, despite their practical\nsimplicity, are hard to justify. We present conditions in Bayesian games under\nwhich the decomposition-based predictions coincide with the equilibrium-based\nones. In many games, such coincidence follows from an invariance condition for\nequilibrium selection rules. To illustrate our message, we revisit an empirical\nanalysis in Ciliberto and Tamer (2009) on firms' entry decisions in the airline\nindustry."}, "http://arxiv.org/abs/2308.10138": {"title": "On the Inconsistency of Cluster-Robust Inference and How Subsampling Can Fix It", "link": "http://arxiv.org/abs/2308.10138", "description": "Conventional methods of cluster-robust inference are inconsistent in the\npresence of unignorably large clusters. We formalize this claim by establishing\na necessary and sufficient condition for the consistency of the conventional\nmethods. We find that this condition for the consistency is rejected for a\nmajority of empirical research papers. In this light, we propose a novel score\nsubsampling method which is robust even under the condition that fails the\nconventional method. Simulation studies support these claims. With real data\nused by an empirical paper, we showcase that the conventional methods conclude\nsignificance while our proposed method concludes insignificance."}, "http://arxiv.org/abs/2401.01064": {"title": "Robust Inference for Multiple Predictive Regressions with an Application on Bond Risk Premia", "link": "http://arxiv.org/abs/2401.01064", "description": "We propose a robust hypothesis testing procedure for the predictability of\nmultiple predictors that could be highly persistent. Our method improves the\npopular extended instrumental variable (IVX) testing (Phillips and Lee, 2013;\nKostakis et al., 2015) in that, besides addressing the two bias effects found\nin Hosseinkouchack and Demetrescu (2021), we find and deal with the\nvariance-enlargement effect. We show that two types of higher-order terms\ninduce these distortion effects in the test statistic, leading to significant\nover-rejection for one-sided tests and tests in multiple predictive\nregressions. Our improved IVX-based test includes three steps to tackle all the\nissues above regarding finite sample bias and variance terms. Thus, the test\nstatistics perform well in size control, while its power performance is\ncomparable with the original IVX. Monte Carlo simulations and an empirical\nstudy on the predictability of bond risk premia are provided to demonstrate the\neffectiveness of the newly proposed approach."}, "http://arxiv.org/abs/2401.01565": {"title": "Classification and Treatment Learning with Constraints via Composite Heaviside Optimization: a Progressive MIP Method", "link": "http://arxiv.org/abs/2401.01565", "description": "This paper proposes a Heaviside composite optimization approach and presents\na progressive (mixed) integer programming (PIP) method for solving multi-class\nclassification and multi-action treatment problems with constraints. A\nHeaviside composite function is a composite of a Heaviside function (i.e., the\nindicator function of either the open $( \\, 0,\\infty )$ or closed $[ \\,\n0,\\infty \\, )$ interval) with a possibly nondifferentiable function.\nModeling-wise, we show how Heaviside composite optimization provides a unified\nformulation for learning the optimal multi-class classification and\nmulti-action treatment rules, subject to rule-dependent constraints stipulating\na variety of domain restrictions. A Heaviside composite function has an\nequivalent discrete formulation %in terms of integer variables, and the\nresulting optimization problem can in principle be solved by integer\nprogramming (IP) methods. Nevertheless, for constrained learning problems with\nlarge data sets, %of modest or large sizes, a straightforward application of\noff-the-shelf IP solvers is usually ineffective in achieving global optimality.\nTo alleviate such a computational burden, our major contribution is the\nproposal of the PIP method by leveraging the effectiveness of state-of-the-art\nIP solvers for problems of modest sizes. We provide the theoretical advantage\nof the PIP method with the connection to continuous optimization and show that\nthe computed solution is locally optimal for a broad class of Heaviside\ncomposite optimization problems. The numerical performance of the PIP method is\ndemonstrated by extensive computational experimentation."}, "http://arxiv.org/abs/2401.01645": {"title": "Model Averaging and Double Machine Learning", "link": "http://arxiv.org/abs/2401.01645", "description": "This paper discusses pairing double/debiased machine learning (DDML) with\nstacking, a model averaging method for combining multiple candidate learners,\nto estimate structural parameters. We introduce two new stacking approaches for\nDDML: short-stacking exploits the cross-fitting step of DDML to substantially\nreduce the computational burden and pooled stacking enforces common stacking\nweights over cross-fitting folds. Using calibrated simulation studies and two\napplications estimating gender gaps in citations and wages, we show that DDML\nwith stacking is more robust to partially unknown functional forms than common\nalternative approaches based on single pre-selected learners. We provide Stata\nand R software implementing our proposals."}, "http://arxiv.org/abs/2401.01804": {"title": "Efficient Computation of Confidence Sets Using Classification on Equidistributed Grids", "link": "http://arxiv.org/abs/2401.01804", "description": "Economic models produce moment inequalities, which can be used to form tests\nof the true parameters. Confidence sets (CS) of the true parameters are derived\nby inverting these tests. However, they often lack analytical expressions,\nnecessitating a grid search to obtain the CS numerically by retaining the grid\npoints that pass the test. When the statistic is not asymptotically pivotal,\nconstructing the critical value for each grid point in the parameter space adds\nto the computational burden. In this paper, we convert the computational issue\ninto a classification problem by using a support vector machine (SVM)\nclassifier. Its decision function provides a faster and more systematic way of\ndividing the parameter space into two regions: inside vs. outside of the\nconfidence set. We label those points in the CS as 1 and those outside as -1.\nResearchers can train the SVM classifier on a grid of manageable size and use\nit to determine whether points on denser grids are in the CS or not. We\nestablish certain conditions for the grid so that there is a tuning that allows\nus to asymptotically reproduce the test in the CS. This means that in the\nlimit, a point is classified as belonging to the confidence set if and only if\nit is labeled as 1 by the SVM."}, "http://arxiv.org/abs/2011.03073": {"title": "Bias correction for quantile regression estimators", "link": "http://arxiv.org/abs/2011.03073", "description": "We study the bias of classical quantile regression and instrumental variable\nquantile regression estimators. While being asymptotically first-order\nunbiased, these estimators can have non-negligible second-order biases. We\nderive a higher-order stochastic expansion of these estimators using empirical\nprocess theory. Based on this expansion, we derive an explicit formula for the\nsecond-order bias and propose a feasible bias correction procedure that uses\nfinite-difference estimators of the bias components. The proposed bias\ncorrection method performs well in simulations. We provide an empirical\nillustration using Engel's classical data on household expenditure."}, "http://arxiv.org/abs/2107.02637": {"title": "Difference-in-Differences with a Continuous Treatment", "link": "http://arxiv.org/abs/2107.02637", "description": "This paper analyzes difference-in-differences setups with a continuous\ntreatment. We show that treatment effect on the treated-type parameters can be\nidentified under a generalized parallel trends assumption that is similar to\nthe binary treatment setup. However, interpreting differences in these\nparameters across different values of the treatment can be particularly\nchallenging due to treatment effect heterogeneity. We discuss alternative,\ntypically stronger, assumptions that alleviate these challenges. We also\nprovide a variety of treatment effect decomposition results, highlighting that\nparameters associated with popular linear two-way fixed-effect (TWFE)\nspecifications can be hard to interpret, \\emph{even} when there are only two\ntime periods. We introduce alternative estimation procedures that do not suffer\nfrom these TWFE drawbacks, and show in an application that they can lead to\ndifferent conclusions."}, "http://arxiv.org/abs/2401.02428": {"title": "Cu\\'anto es demasiada inflaci\\'on? Una clasificaci\\'on de reg\\'imenes inflacionarios", "link": "http://arxiv.org/abs/2401.02428", "description": "The classifications of inflationary regimes proposed in the literature have\nmostly been based on arbitrary characterizations, subject to value judgments by\nresearchers. The objective of this study is to propose a new methodological\napproach that reduces subjectivity and improves accuracy in the construction of\nsuch regimes. The method is built upon a combination of clustering techniques\nand classification trees, which allows for an historical periodization of\nArgentina's inflationary history for the period 1943-2022. Additionally, two\nprocedures are introduced to smooth out the classification over time: a measure\nof temporal contiguity of observations and a rolling method based on the simple\nmajority rule. The obtained regimes are compared against the existing\nliterature on the inflation-relative price variability relationship, revealing\na better performance of the proposed regimes."}, "http://arxiv.org/abs/2401.02819": {"title": "Roughness Signature Functions", "link": "http://arxiv.org/abs/2401.02819", "description": "Inspired by the activity signature introduced by Todorov and Tauchen (2010),\nwhich was used to measure the activity of a semimartingale, this paper\nintroduces the roughness signature function. The paper illustrates how it can\nbe used to determine whether a discretely observed process is generated by a\ncontinuous process that is rougher than a Brownian motion, a pure-jump process,\nor a combination of the two. Further, if a continuous rough process is present,\nthe function gives an estimate of the roughness index. This is done through an\nextensive simulation study, where we find that the roughness signature function\nworks as expected on rough processes. We further derive some asymptotic\nproperties of this new signature function. The function is applied empirically\nto three different volatility measures for the S&P500 index. The three measures\nare realized volatility, the VIX, and the option-extracted volatility estimator\nof Todorov (2019). The realized volatility and option-extracted volatility show\nsigns of roughness, with the option-extracted volatility appearing smoother\nthan the realized volatility, while the VIX appears to be driven by a\ncontinuous martingale with jumps."}, "http://arxiv.org/abs/2401.03293": {"title": "Counterfactuals in factor models", "link": "http://arxiv.org/abs/2401.03293", "description": "We study a new model where the potential outcomes, corresponding to the\nvalues of a (possibly continuous) treatment, are linked through common factors.\nThe factors can be estimated using a panel of regressors. We propose a\nprocedure to estimate time-specific and unit-specific average marginal effects\nin this context. Our approach can be used either with high-dimensional time\nseries or with large panels. It allows for treatment effects heterogenous\nacross time and units and is straightforward to implement since it only relies\non principal components analysis and elementary computations. We derive the\nasymptotic distribution of our estimator of the average marginal effect and\nhighlight its solid finite sample performance through a simulation exercise.\nThe approach can also be used to estimate average counterfactuals or adapted to\nan instrumental variables setting and we discuss these extensions. Finally, we\nillustrate our novel methodology through an empirical application on income\ninequality."}, "http://arxiv.org/abs/2401.03756": {"title": "Contextual Fixed-Budget Best Arm Identification: Adaptive Experimental Design with Policy Learning", "link": "http://arxiv.org/abs/2401.03756", "description": "Individualized treatment recommendation is a crucial task in evidence-based\ndecision-making. In this study, we formulate this task as a fixed-budget best\narm identification (BAI) problem with contextual information. In this setting,\nwe consider an adaptive experiment given multiple treatment arms. At each\nround, a decision-maker observes a context (covariate) that characterizes an\nexperimental unit and assigns the unit to one of the treatment arms. At the end\nof the experiment, the decision-maker recommends a treatment arm estimated to\nyield the highest expected outcome conditioned on a context (best treatment\narm). The effectiveness of this decision is measured in terms of the worst-case\nexpected simple regret (policy regret), which represents the largest difference\nbetween the conditional expected outcomes of the best and recommended treatment\narms given a context. Our initial step is to derive asymptotic lower bounds for\nthe worst-case expected simple regret, which also implies ideal treatment\nassignment rules. Following the lower bounds, we propose the Adaptive Sampling\n(AS)-Policy Learning recommendation (PL) strategy. Under this strategy, we\nrandomly assign a treatment arm with a ratio of a target assignment ratio at\neach round. At the end of the experiment, we train a policy, a function that\nrecommends a treatment arm given a context, by maximizing the counterfactual\nempirical policy value. Our results show that the AS-PL strategy is\nasymptotically minimax optimal, with its leading factor of expected simple\nregret converging with our established worst-case lower bound. This research\nhas broad implications in various domains, and in light of existing literature,\nour method can be perceived as an adaptive experimental design tailored for\npolicy learning, on-policy learning, or adaptive welfare maximization."}, "http://arxiv.org/abs/2401.03990": {"title": "Identification with possibly invalid IVs", "link": "http://arxiv.org/abs/2401.03990", "description": "This paper proposes a novel identification strategy relying on\nquasi-instrumental variables (quasi-IVs). A quasi-IV is a relevant but possibly\ninvalid IV because it is not completely exogenous and/or excluded. We show that\na variety of models with discrete or continuous endogenous treatment, which are\nusually identified with an IV - quantile models with rank invariance additive\nmodels with homogenous treatment effects, and local average treatment effect\nmodels - can be identified under the joint relevance of two complementary\nquasi-IVs instead. To achieve identification we complement one excluded but\npossibly endogenous quasi-IV (e.g., ``relevant proxies'' such as previous\ntreatment choice) with one exogenous (conditional on the excluded quasi-IV) but\npossibly included quasi-IV (e.g., random assignment or exogenous market\nshocks). In practice, our identification strategy should be attractive since\ncomplementary quasi-IVs should be easier to find than standard IVs. Our\napproach also holds if any of the two quasi-IVs turns out to be a valid IV."}, "http://arxiv.org/abs/2401.04050": {"title": "Robust Estimation in Network Vector Autoregression with Nonstationary Regressors", "link": "http://arxiv.org/abs/2401.04050", "description": "This article studies identification and estimation for the network vector\nautoregressive model with nonstationary regressors. In particular, network\ndependence is characterized by a nonstochastic adjacency matrix. The\ninformation set includes a stationary regressand and a node-specific vector of\nnonstationary regressors, both observed at the same equally spaced time\nfrequencies. Our proposed econometric specification correponds to the NVAR\nmodel under time series nonstationarity which relies on the local-to-unity\nparametrization for capturing the unknown form of persistence of these\nnode-specific regressors. Robust econometric estimation is achieved using an\nIVX-type estimator and the asymptotic theory analysis for the augmented vector\nof regressors is studied based on a double asymptotic regime where both the\nnetwork size and the time dimension tend to infinity."}, "http://arxiv.org/abs/2103.12374": {"title": "What Do We Get from Two-Way Fixed Effects Regressions? Implications from Numerical Equivalence", "link": "http://arxiv.org/abs/2103.12374", "description": "In any multiperiod panel, a two-way fixed effects (TWFE) regression is\nnumerically equivalent to a first-difference (FD) regression that pools all\npossible between-period gaps. Building on this observation, this paper develops\nnumerical and causal interpretations of the TWFE coefficient. At the sample\nlevel, the TWFE coefficient is a weighted average of FD coefficients with\ndifferent between-period gaps. This decomposition is useful for assessing the\nsource of identifying variation for the TWFE coefficient. At the population\nlevel, a causal interpretation of the TWFE coefficient requires a common trends\nassumption for any between-period gap, and the assumption has to be conditional\non changes in time-varying covariates. I propose a natural generalization of\nthe TWFE estimator that can relax these requirements."}, "http://arxiv.org/abs/2107.11869": {"title": "Adaptive Estimation and Uniform Confidence Bands for Nonparametric Structural Functions and Elasticities", "link": "http://arxiv.org/abs/2107.11869", "description": "We introduce two data-driven procedures for optimal estimation and inference\nin nonparametric models using instrumental variables. The first is a\ndata-driven choice of sieve dimension for a popular class of sieve two-stage\nleast squares estimators. When implemented with this choice, estimators of both\nthe structural function $h_0$ and its derivatives (such as elasticities)\nconverge at the fastest possible (i.e., minimax) rates in sup-norm. The second\nis for constructing uniform confidence bands (UCBs) for $h_0$ and its\nderivatives. Our UCBs guarantee coverage over a generic class of\ndata-generating processes and contract at the minimax rate, possibly up to a\nlogarithmic factor. As such, our UCBs are asymptotically more efficient than\nUCBs based on the usual approach of undersmoothing. As an application, we\nestimate the elasticity of the intensive margin of firm exports in a\nmonopolistic competition model of international trade. Simulations illustrate\nthe good performance of our procedures in empirically calibrated designs. Our\nresults provide evidence against common parameterizations of the distribution\nof unobserved firm heterogeneity."}, "http://arxiv.org/abs/2301.09397": {"title": "ddml: Double/debiased machine learning in Stata", "link": "http://arxiv.org/abs/2301.09397", "description": "We introduce the package ddml for Double/Debiased Machine Learning (DDML) in\nStata. Estimators of causal parameters for five different econometric models\nare supported, allowing for flexible estimation of causal effects of endogenous\nvariables in settings with unknown functional forms and/or many exogenous\nvariables. ddml is compatible with many existing supervised machine learning\nprograms in Stata. We recommend using DDML in combination with stacking\nestimation which combines multiple machine learners into a final predictor. We\nprovide Monte Carlo evidence to support our recommendation."}, "http://arxiv.org/abs/2302.13857": {"title": "Multi-cell experiments for marginal treatment effect estimation of digital ads", "link": "http://arxiv.org/abs/2302.13857", "description": "Randomized experiments with treatment and control groups are an important\ntool to measure the impacts of interventions. However, in experimental settings\nwith one-sided noncompliance, extant empirical approaches may not produce the\nestimands a decision-maker needs to solve their problem of interest. For\nexample, these experimental designs are common in digital advertising settings,\nbut typical methods do not yield effects that inform the intensive margin --\nhow many consumers should be reached or how much should be spent on a campaign.\nWe propose a solution that combines a novel multi-cell experimental design with\nmodern estimation techniques that enables decision-makers to recover enough\ninformation to solve problems with an intensive margin. Our design is\nstraightforward to implement and does not require any additional budget to be\ncarried out. We illustrate our approach through a series of simulations that\nare calibrated using an advertising experiment at Facebook, finding that our\nmethod outperforms standard techniques in generating better decisions."}, "http://arxiv.org/abs/2401.04200": {"title": "Teacher bias or measurement error?", "link": "http://arxiv.org/abs/2401.04200", "description": "In many countries, teachers' track recommendations are used to allocate\nstudents to secondary school tracks. Previous studies have shown that students\nfrom families with low socioeconomic status (SES) receive lower track\nrecommendations than their peers from high SES families, conditional on\nstandardized test scores. It is often argued this indicates teacher bias.\nHowever, this claim is invalid in the presence of measurement error in test\nscores. We discuss how measurement error in test scores generates a biased\ncoefficient of the conditional SES gap, and consider three empirical strategies\nto address this bias. Using administrative data from the Netherlands, we find\nthat measurement error explains 35 to 43% of the conditional SES gap in track\nrecommendations."}, "http://arxiv.org/abs/2401.04512": {"title": "Robust Bayesian Method for Refutable Models", "link": "http://arxiv.org/abs/2401.04512", "description": "We propose a robust Bayesian method for economic models that can be rejected\nunder some data distributions. The econometrician starts with a structural\nassumption which can be written as the intersection of several assumptions, and\nthe joint assumption is refutable. To avoid the model rejection, the\neconometrician first takes a stance on which assumption $j$ is likely to be\nviolated and considers a measurement of the degree of violation of this\nassumption $j$. She then considers a (marginal) prior belief on the degree of\nviolation $(\\pi_{m_j})$: She considers a class of prior distributions $\\pi_s$\non all economic structures such that all $\\pi_s$ have the same marginal\ndistribution $\\pi_m$. Compared to the standard nonparametric Bayesian method\nthat puts a single prior on all economic structures, the robust Bayesian method\nimposes a single marginal prior distribution on the degree of violation. As a\nresult, the robust Bayesian method allows the econometrician to take a stance\nonly on the likeliness of violation of assumption $j$. Compared to the\nfrequentist approach to relax the refutable assumption, the robust Bayesian\nmethod is transparent on the econometrician's stance of choosing models. We\nalso show that many frequentists' ways to relax the refutable assumption can be\nfound equivalent to particular choices of robust Bayesian prior classes. We use\nthe local average treatment effect (LATE) in the potential outcome framework as\nthe leading illustrating example."}, "http://arxiv.org/abs/2005.10314": {"title": "On the Nuisance of Control Variables in Regression Analysis", "link": "http://arxiv.org/abs/2005.10314", "description": "Control variables are included in regression analyses to estimate the causal\neffect of a treatment on an outcome. In this paper, we argue that the estimated\neffect sizes of controls are unlikely to have a causal interpretation\nthemselves, though. This is because even valid controls are possibly endogenous\nand represent a combination of several different causal mechanisms operating\njointly on the outcome, which is hard to interpret theoretically. Therefore, we\nrecommend refraining from interpreting marginal effects of controls and\nfocusing on the main variables of interest, for which a plausible\nidentification argument can be established. To prevent erroneous managerial or\npolicy implications, coefficients of control variables should be clearly marked\nas not having a causal interpretation or omitted from regression tables\naltogether. Moreover, we advise against using control variable estimates for\nsubsequent theory building and meta-analyses."}, "http://arxiv.org/abs/2401.04803": {"title": "IV Estimation of Panel Data Tobit Models with Normal Errors", "link": "http://arxiv.org/abs/2401.04803", "description": "Amemiya (1973) proposed a ``consistent initial estimator'' for the parameters\nin a censored regression model with normal errors. This paper demonstrates that\na similar approach can be used to construct moment conditions for\nfixed--effects versions of the model considered by Amemiya. This result\nsuggests estimators for models that have not previously been considered."}, "http://arxiv.org/abs/2401.04849": {"title": "A Deep Learning Representation of Spatial Interaction Model for Resilient Spatial Planning of Community Business Clusters", "link": "http://arxiv.org/abs/2401.04849", "description": "Existing Spatial Interaction Models (SIMs) are limited in capturing the\ncomplex and context-aware interactions between business clusters and trade\nareas. To address the limitation, we propose a SIM-GAT model to predict\nspatiotemporal visitation flows between community business clusters and their\ntrade areas. The model innovatively represents the integrated system of\nbusiness clusters, trade areas, and transportation infrastructure within an\nurban region using a connected graph. Then, a graph-based deep learning model,\ni.e., Graph AttenTion network (GAT), is used to capture the complexity and\ninterdependencies of business clusters. We developed this model with data\ncollected from the Miami metropolitan area in Florida. We then demonstrated its\neffectiveness in capturing varying attractiveness of business clusters to\ndifferent residential neighborhoods and across scenarios with an eXplainable AI\napproach. We contribute a novel method supplementing conventional SIMs to\npredict and analyze the dynamics of inter-connected community business\nclusters. The analysis results can inform data-evidenced and place-specific\nplanning strategies helping community business clusters better accommodate\ntheir customers across scenarios, and hence improve the resilience of community\nbusinesses."}, "http://arxiv.org/abs/2306.14653": {"title": "Optimization of the Generalized Covariance Estimator in Noncausal Processes", "link": "http://arxiv.org/abs/2306.14653", "description": "This paper investigates the performance of the Generalized Covariance\nestimator (GCov) in estimating and identifying mixed causal and noncausal\nmodels. The GCov estimator is a semi-parametric method that minimizes an\nobjective function without making any assumptions about the error distribution\nand is based on nonlinear autocovariances to identify the causal and noncausal\norders. When the number and type of nonlinear autocovariances included in the\nobjective function of a GCov estimator is insufficient/inadequate, or the error\ndensity is too close to the Gaussian, identification issues can arise. These\nissues result in local minima in the objective function, which correspond to\nparameter values associated with incorrect causal and noncausal orders. Then,\ndepending on the starting point and the optimization algorithm employed, the\nalgorithm can converge to a local minimum. The paper proposes the use of the\nSimulated Annealing (SA) optimization algorithm as an alternative to\nconventional numerical optimization methods. The results demonstrate that SA\nperforms well when applied to mixed causal and noncausal models, successfully\neliminating the effects of local minima. The proposed approach is illustrated\nby an empirical application involving a bivariate commodity price series."}, "http://arxiv.org/abs/2401.05517": {"title": "On Efficient Inference of Causal Effects with Multiple Mediators", "link": "http://arxiv.org/abs/2401.05517", "description": "This paper provides robust estimators and efficient inference of causal\neffects involving multiple interacting mediators. Most existing works either\nimpose a linear model assumption among the mediators or are restricted to\nhandle conditionally independent mediators given the exposure. To overcome\nthese limitations, we define causal and individual mediation effects in a\ngeneral setting, and employ a semiparametric framework to develop quadruply\nrobust estimators for these causal effects. We further establish the asymptotic\nnormality of the proposed estimators and prove their local semiparametric\nefficiencies. The proposed method is empirically validated via simulated and\nreal datasets concerning psychiatric disorders in trauma survivors."}, "http://arxiv.org/abs/2401.05784": {"title": "Covariance Function Estimation for High-Dimensional Functional Time Series with Dual Factor Structures", "link": "http://arxiv.org/abs/2401.05784", "description": "We propose a flexible dual functional factor model for modelling\nhigh-dimensional functional time series. In this model, a high-dimensional\nfully functional factor parametrisation is imposed on the observed functional\nprocesses, whereas a low-dimensional version (via series approximation) is\nassumed for the latent functional factors. We extend the classic principal\ncomponent analysis technique for the estimation of a low-rank structure to the\nestimation of a large covariance matrix of random functions that satisfies a\nnotion of (approximate) functional \"low-rank plus sparse\" structure; and\ngeneralise the matrix shrinkage method to functional shrinkage in order to\nestimate the sparse structure of functional idiosyncratic components. Under\nappropriate regularity conditions, we derive the large sample theory of the\ndeveloped estimators, including the consistency of the estimated factors and\nfunctional factor loadings and the convergence rates of the estimated matrices\nof covariance functions measured by various (functional) matrix norms.\nConsistent selection of the number of factors and a data-driven rule to choose\nthe shrinkage parameter are discussed. Simulation and empirical studies are\nprovided to demonstrate the finite-sample performance of the developed model\nand estimation methodology."}, "http://arxiv.org/abs/2305.16377": {"title": "Validating a dynamic input-output model for the propagation of supply and demand shocks during the COVID-19 pandemic in Belgium", "link": "http://arxiv.org/abs/2305.16377", "description": "This work validates a dynamic production network model, used to quantify the\nimpact of economic shocks caused by COVID-19 in the UK, using data for Belgium.\nBecause the model was published early during the 2020 COVID-19 pandemic, it\nrelied on several assumptions regarding the magnitude of the observed economic\nshocks, for which more accurate data have become available in the meantime. We\nrefined the propagated shocks to align with observed data collected during the\npandemic and calibrated some less well-informed parameters using 115 economic\ntime series. The refined model effectively captures the evolution of GDP,\nrevenue, and employment during the COVID-19 pandemic in Belgium at both\nindividual economic activity and aggregate levels. However, the reduction in\nbusiness-to-business demand is overestimated, revealing structural shortcomings\nin accounting for businesses' motivations to sustain trade despite the\npandemic's induced shocks. We confirm that the relaxation of the stringent\nLeontief production function by a survey on the criticality of inputs\nsignificantly improved the model's accuracy. However, despite a large dataset,\ndistinguishing between varying degrees of relaxation proved challenging.\nOverall, this work demonstrates the model's validity in assessing the impact of\neconomic shocks caused by an epidemic in Belgium."}, "http://arxiv.org/abs/2309.13251": {"title": "Nonparametric estimation of conditional densities by generalized random forests", "link": "http://arxiv.org/abs/2309.13251", "description": "Considering a continuous random variable Y together with a continuous random\nvector X, I propose a nonparametric estimator f^(.|x) for the conditional\ndensity of Y given X=x. This estimator takes the form of an exponential series\nwhose coefficients T = (T1,...,TJ) are the solution of a system of nonlinear\nequations that depends on an estimator of the conditional expectation\nE[p(Y)|X=x], where p(.) is a J-dimensional vector of basis functions. A key\nfeature is that E[p(Y)|X=x] is estimated by generalized random forest (Athey,\nTibshirani, and Wager, 2019), targeting the heterogeneity of T across x. I show\nthat f^(.|x) is uniformly consistent and asymptotically normal, while allowing\nJ to grow to infinity. I also provide a standard error formula to construct\nasymptotically valid confidence intervals. Results from Monte Carlo experiments\nand an empirical illustration are provided."}, "http://arxiv.org/abs/2401.06264": {"title": "Exposure effects are policy relevant only under strong assumptions about the interference structure", "link": "http://arxiv.org/abs/2401.06264", "description": "Savje (2023) recommends misspecified exposure effects as a way to avoid\nstrong assumptions about interference when analyzing the results of an\nexperiment. In this discussion, we highlight a key limitation of Savje's\nrecommendation. Exposure effects are not generally useful for evaluating social\npolicies without the strong assumptions that Savje seeks to avoid.\n\nOur discussion is organized as follows. Section 2 summarizes our position,\nsection 3 provides a concrete example, and section 4 concludes. Proof of claims\nare in an appendix."}, "http://arxiv.org/abs/2401.06611": {"title": "Robust Analysis of Short Panels", "link": "http://arxiv.org/abs/2401.06611", "description": "Many structural econometric models include latent variables on whose\nprobability distributions one may wish to place minimal restrictions. Leading\nexamples in panel data models are individual-specific variables sometimes\ntreated as \"fixed effects\" and, in dynamic models, initial conditions. This\npaper presents a generally applicable method for characterizing sharp\nidentified sets when models place no restrictions on the probability\ndistribution of certain latent variables and no restrictions on their\ncovariation with other variables. In our analysis latent variables on which\nrestrictions are undesirable are removed, leading to econometric analysis\nrobust to misspecification of restrictions on their distributions which are\ncommonplace in the applied panel data literature. Endogenous explanatory\nvariables are easily accommodated. Examples of application to some static and\ndynamic binary, ordered and multiple discrete choice and censored panel data\nmodels are presented."}, "http://arxiv.org/abs/2308.00202": {"title": "Randomization Inference of Heterogeneous Treatment Effects under Network Interference", "link": "http://arxiv.org/abs/2308.00202", "description": "We design randomization tests of heterogeneous treatment effects when units\ninteract on a single connected network. Our modeling strategy allows network\ninterference into the potential outcomes framework using the concept of\nexposure mapping. We consider several null hypotheses representing different\nnotions of homogeneous treatment effects. However, these hypotheses are not\nsharp due to nuisance parameters and multiple potential outcomes. To address\nthe issue of multiple potential outcomes, we propose a conditional\nrandomization method that expands on existing procedures. Our conditioning\napproach permits the use of treatment assignment as a conditioning variable,\nwidening the range of application of the randomization method of inference. In\naddition, we propose techniques that overcome the nuisance parameter issue. We\nshow that our resulting testing methods based on the conditioning procedure and\nthe strategies for handling nuisance parameters are asymptotically valid. We\ndemonstrate the testing methods using a network data set and also present the\nfindings of a Monte Carlo study."}, "http://arxiv.org/abs/2401.06864": {"title": "Deep Learning With DAGs", "link": "http://arxiv.org/abs/2401.06864", "description": "Social science theories often postulate causal relationships among a set of\nvariables or events. Although directed acyclic graphs (DAGs) are increasingly\nused to represent these theories, their full potential has not yet been\nrealized in practice. As non-parametric causal models, DAGs require no\nassumptions about the functional form of the hypothesized relationships.\nNevertheless, to simplify the task of empirical evaluation, researchers tend to\ninvoke such assumptions anyway, even though they are typically arbitrary and do\nnot reflect any theoretical content or prior knowledge. Moreover, functional\nform assumptions can engender bias, whenever they fail to accurately capture\nthe complexity of the causal system under investigation. In this article, we\nintroduce causal-graphical normalizing flows (cGNFs), a novel approach to\ncausal inference that leverages deep neural networks to empirically evaluate\ntheories represented as DAGs. Unlike conventional approaches, cGNFs model the\nfull joint distribution of the data according to a DAG supplied by the analyst,\nwithout relying on stringent assumptions about functional form. In this way,\nthe method allows for flexible, semi-parametric estimation of any causal\nestimand that can be identified from the DAG, including total effects,\nconditional effects, direct and indirect effects, and path-specific effects. We\nillustrate the method with a reanalysis of Blau and Duncan's (1967) model of\nstatus attainment and Zhou's (2019) model of conditional versus controlled\nmobility. To facilitate adoption, we provide open-source software together with\na series of online tutorials for implementing cGNFs. The article concludes with\na discussion of current limitations and directions for future development."}, "http://arxiv.org/abs/2401.07038": {"title": "A simple stochastic nonlinear AR model with application to bubble", "link": "http://arxiv.org/abs/2401.07038", "description": "Economic and financial time series can feature locally explosive behavior\nwhen a bubble is formed. The economic or financial bubble, especially its\ndynamics, is an intriguing topic that has been attracting longstanding\nattention. To illustrate the dynamics of the local explosion itself, the paper\npresents a novel, simple, yet useful time series model, called the stochastic\nnonlinear autoregressive model, which is always strictly stationary and\ngeometrically ergodic and can create long swings or persistence observed in\nmany macroeconomic variables. When a nonlinear autoregressive coefficient is\noutside of a certain range, the model has periodically explosive behaviors and\ncan then be used to portray the bubble dynamics. Further, the quasi-maximum\nlikelihood estimation (QMLE) of our model is considered, and its strong\nconsistency and asymptotic normality are established under minimal assumptions\non innovation. A new model diagnostic checking statistic is developed for model\nfitting adequacy. In addition two methods for bubble tagging are proposed, one\nfrom the residual perspective and the other from the null-state perspective.\nMonte Carlo simulation studies are conducted to assess the performances of the\nQMLE and the two bubble tagging methods in finite samples. Finally, the\nusefulness of the model is illustrated by an empirical application to the\nmonthly Hang Seng Index."}, "http://arxiv.org/abs/2401.07152": {"title": "Inference for Synthetic Controls via Refined Placebo Tests", "link": "http://arxiv.org/abs/2401.07152", "description": "The synthetic control method is often applied to problems with one treated\nunit and a small number of control units. A common inferential task in this\nsetting is to test null hypotheses regarding the average treatment effect on\nthe treated. Inference procedures that are justified asymptotically are often\nunsatisfactory due to (1) small sample sizes that render large-sample\napproximation fragile and (2) simplification of the estimation procedure that\nis implemented in practice. An alternative is permutation inference, which is\nrelated to a common diagnostic called the placebo test. It has provable Type-I\nerror guarantees in finite samples without simplification of the method, when\nthe treatment is uniformly assigned. Despite this robustness, the placebo test\nsuffers from low resolution since the null distribution is constructed from\nonly $N$ reference estimates, where $N$ is the sample size. This creates a\nbarrier for statistical inference at a common level like $\\alpha = 0.05$,\nespecially when $N$ is small. We propose a novel leave-two-out procedure that\nbypasses this issue, while still maintaining the same finite-sample Type-I\nerror guarantee under uniform assignment for a wide range of $N$. Unlike the\nplacebo test whose Type-I error always equals the theoretical upper bound, our\nprocedure often achieves a lower unconditional Type-I error than theory\nsuggests; this enables useful inference in the challenging regime when $\\alpha\n< 1/N$. Empirically, our procedure achieves a higher power when the effect size\nis reasonably large and a comparable power otherwise. We generalize our\nprocedure to non-uniform assignments and show how to conduct sensitivity\nanalysis. From a methodological perspective, our procedure can be viewed as a\nnew type of randomization inference different from permutation or rank-based\ninference, which is particularly effective in small samples."}, "http://arxiv.org/abs/2401.07176": {"title": "A Note on Uncertainty Quantification for Maximum Likelihood Parameters Estimated with Heuristic Based Optimization Algorithms", "link": "http://arxiv.org/abs/2401.07176", "description": "Gradient-based solvers risk convergence to local optima, leading to incorrect\nresearcher inference. Heuristic-based algorithms are able to ``break free\" of\nthese local optima to eventually converge to the true global optimum. However,\ngiven that they do not provide the gradient/Hessian needed to approximate the\ncovariance matrix and that the significantly longer computational time they\nrequire for convergence likely precludes resampling procedures for inference,\nresearchers often are unable to quantify uncertainty in the estimates they\nderive with these methods. This note presents a simple and relatively fast\ntwo-step procedure to estimate the covariance matrix for parameters estimated\nwith these algorithms. This procedure relies on automatic differentiation, a\ncomputational means of calculating derivatives that is popular in machine\nlearning applications. A brief empirical example demonstrates the advantages of\nthis procedure relative to bootstrapping and shows the similarity in standard\nerror estimates between this procedure and that which would normally accompany\nmaximum likelihood estimation with a gradient-based algorithm."}, "http://arxiv.org/abs/2401.08290": {"title": "Causal Machine Learning for Moderation Effects", "link": "http://arxiv.org/abs/2401.08290", "description": "It is valuable for any decision maker to know the impact of decisions\n(treatments) on average and for subgroups. The causal machine learning\nliterature has recently provided tools for estimating group average treatment\neffects (GATE) to understand treatment heterogeneity better. This paper\naddresses the challenge of interpreting such differences in treatment effects\nbetween groups while accounting for variations in other covariates. We propose\na new parameter, the balanced group average treatment effect (BGATE), which\nmeasures a GATE with a specific distribution of a priori-determined covariates.\nBy taking the difference of two BGATEs, we can analyse heterogeneity more\nmeaningfully than by comparing two GATEs. The estimation strategy for this\nparameter is based on double/debiased machine learning for discrete treatments\nin an unconfoundedness setting, and the estimator is shown to be\n$\\sqrt{N}$-consistent and asymptotically normal under standard conditions.\nAdding additional identifying assumptions allows specific balanced differences\nin treatment effects between groups to be interpreted causally, leading to the\ncausal balanced group average treatment effect. We explore the finite sample\nproperties in a small-scale simulation study and demonstrate the usefulness of\nthese parameters in an empirical example."}, "http://arxiv.org/abs/2401.08442": {"title": "Assessing the impact of forced and voluntary behavioral changes on economic-epidemiological co-dynamics: A comparative case study between Belgium and Sweden during the 2020 COVID-19 pandemic", "link": "http://arxiv.org/abs/2401.08442", "description": "During the COVID-19 pandemic, governments faced the challenge of managing\npopulation behavior to prevent their healthcare systems from collapsing. Sweden\nadopted a strategy centered on voluntary sanitary recommendations while Belgium\nresorted to mandatory measures. Their consequences on pandemic progression and\nassociated economic impacts remain insufficiently understood. This study\nleverages the divergent policies of Belgium and Sweden during the COVID-19\npandemic to relax the unrealistic -- but persistently used -- assumption that\nsocial contacts are not influenced by an epidemic's dynamics. We develop an\nepidemiological-economic co-simulation model where pandemic-induced behavioral\nchanges are a superposition of voluntary actions driven by fear, prosocial\nbehavior or social pressure, and compulsory compliance with government\ndirectives. Our findings emphasize the importance of early responses, which\nreduce the stringency of measures necessary to safeguard healthcare systems and\nminimize ensuing economic damage. Voluntary behavioral changes lead to a\npattern of recurring epidemics, which should be regarded as the natural\nlong-term course of pandemics. Governments should carefully consider prolonging\nlockdown longer than necessary because this leads to higher economic damage and\na potentially higher second surge when measures are released. Our model can aid\npolicymakers in the selection of an appropriate long-term strategy that\nminimizes economic damage."}, "http://arxiv.org/abs/2101.00009": {"title": "Adversarial Estimation of Riesz Representers", "link": "http://arxiv.org/abs/2101.00009", "description": "Many causal and structural parameters are linear functionals of an underlying\nregression. The Riesz representer is a key component in the asymptotic variance\nof a semiparametrically estimated linear functional. We propose an adversarial\nframework to estimate the Riesz representer using general function spaces. We\nprove a nonasymptotic mean square rate in terms of an abstract quantity called\nthe critical radius, then specialize it for neural networks, random forests,\nand reproducing kernel Hilbert spaces as leading cases. Furthermore, we use\ncritical radius theory -- in place of Donsker theory -- to prove asymptotic\nnormality without sample splitting, uncovering a ``complexity-rate robustness''\ncondition. This condition has practical consequences: inference without sample\nsplitting is possible in several machine learning settings, which may improve\nfinite sample performance compared to sample splitting. Our estimators achieve\nnominal coverage in highly nonlinear simulations where previous methods break\ndown. They shed new light on the heterogeneous effects of matching grants."}, "http://arxiv.org/abs/2101.00399": {"title": "The Law of Large Numbers for Large Stable Matchings", "link": "http://arxiv.org/abs/2101.00399", "description": "In many empirical studies of a large two-sided matching market (such as in a\ncollege admissions problem), the researcher performs statistical inference\nunder the assumption that they observe a random sample from a large matching\nmarket. In this paper, we consider a setting in which the researcher observes\neither all or a nontrivial fraction of outcomes from a stable matching. We\nestablish a concentration inequality for empirical matching probabilities\nassuming strong correlation among the colleges' preferences while allowing\nstudents' preferences to be fully heterogeneous. Our concentration inequality\nyields laws of large numbers for the empirical matching probabilities and other\nstatistics commonly used in empirical analyses of a large matching market. To\nillustrate the usefulness of our concentration inequality, we prove consistency\nfor estimators of conditional matching probabilities and measures of positive\nassortative matching."}, "http://arxiv.org/abs/2203.08050": {"title": "Pairwise Valid Instruments", "link": "http://arxiv.org/abs/2203.08050", "description": "Finding valid instruments is difficult. We propose Validity Set Instrumental\nVariable (VSIV) estimation, a method for estimating local average treatment\neffects (LATEs) in heterogeneous causal effect models when the instruments are\npartially invalid. We consider settings with pairwise valid instruments, that\nis, instruments that are valid for a subset of instrument value pairs. VSIV\nestimation exploits testable implications of instrument validity to remove\ninvalid pairs and provides estimates of the LATEs for all remaining pairs,\nwhich can be aggregated into a single parameter of interest using\nresearcher-specified weights. We show that the proposed VSIV estimators are\nasymptotically normal under weak conditions and remove or reduce the asymptotic\nbias relative to standard LATE estimators (that is, LATE estimators that do not\nuse testable implications to remove invalid variation). We evaluate the finite\nsample properties of VSIV estimation in application-based simulations and apply\nour method to estimate the returns to college education using parental\neducation as an instrument."}, "http://arxiv.org/abs/2212.07052": {"title": "On LASSO for High Dimensional Predictive Regression", "link": "http://arxiv.org/abs/2212.07052", "description": "This paper examines LASSO, a widely-used $L_{1}$-penalized regression method,\nin high dimensional linear predictive regressions, particularly when the number\nof potential predictors exceeds the sample size and numerous unit root\nregressors are present. The consistency of LASSO is contingent upon two key\ncomponents: the deviation bound of the cross product of the regressors and the\nerror term, and the restricted eigenvalue of the Gram matrix. We present new\nprobabilistic bounds for these components, suggesting that LASSO's rates of\nconvergence are different from those typically observed in cross-sectional\ncases. When applied to a mixture of stationary, nonstationary, and cointegrated\npredictors, LASSO maintains its asymptotic guarantee if predictors are\nscale-standardized. Leveraging machine learning and macroeconomic domain\nexpertise, LASSO demonstrates strong performance in forecasting the\nunemployment rate, as evidenced by its application to the FRED-MD database."}, "http://arxiv.org/abs/2303.14226": {"title": "Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions", "link": "http://arxiv.org/abs/2303.14226", "description": "Consider a setting where there are $N$ heterogeneous units and $p$\ninterventions. Our goal is to learn unit-specific potential outcomes for any\ncombination of these $p$ interventions, i.e., $N \\times 2^p$ causal parameters.\nChoosing a combination of interventions is a problem that naturally arises in a\nvariety of applications such as factorial design experiments, recommendation\nengines, combination therapies in medicine, conjoint analysis, etc. Running $N\n\\times 2^p$ experiments to estimate the various parameters is likely expensive\nand/or infeasible as $N$ and $p$ grow. Further, with observational data there\nis likely confounding, i.e., whether or not a unit is seen under a combination\nis correlated with its potential outcome under that combination. To address\nthese challenges, we propose a novel latent factor model that imposes structure\nacross units (i.e., the matrix of potential outcomes is approximately rank\n$r$), and combinations of interventions (i.e., the coefficients in the Fourier\nexpansion of the potential outcomes is approximately $s$ sparse). We establish\nidentification for all $N \\times 2^p$ parameters despite unobserved\nconfounding. We propose an estimation procedure, Synthetic Combinations, and\nestablish it is finite-sample consistent and asymptotically normal under\nprecise conditions on the observation pattern. Our results imply consistent\nestimation given $\\text{poly}(r) \\times \\left( N + s^2p\\right)$ observations,\nwhile previous methods have sample complexity scaling as $\\min(N \\times s^2p, \\\n\\ \\text{poly(r)} \\times (N + 2^p))$. We use Synthetic Combinations to propose a\ndata-efficient experimental design. Empirically, Synthetic Combinations\noutperforms competing approaches on a real-world dataset on movie\nrecommendations. Lastly, we extend our analysis to do causal inference where\nthe intervention is a permutation over $p$ items (e.g., rankings)."}, "http://arxiv.org/abs/2308.15062": {"title": "Forecasting with Feedback", "link": "http://arxiv.org/abs/2308.15062", "description": "Systematically biased forecasts are typically interpreted as evidence of\nforecasters' irrationality and/or asymmetric loss. In this paper we propose an\nalternative explanation: when forecasts inform economic policy decisions, and\nthe resulting actions affect the realization of the forecast target itself,\nforecasts may be optimally biased even under quadratic loss. The result arises\nin environments in which the forecaster is uncertain about the decision maker's\nreaction to the forecast, which is presumably the case in most applications. We\nillustrate the empirical relevance of our theory by reviewing some stylized\nproperties of Green Book inflation forecasts and relating them to the\npredictions from our model. Our results point out that the presence of policy\nfeedback poses a challenge to traditional tests of forecast rationality."}, "http://arxiv.org/abs/2309.05639": {"title": "Forecasted Treatment Effects", "link": "http://arxiv.org/abs/2309.05639", "description": "We consider estimation and inference of the effects of a policy in the\nabsence of a control group. We obtain unbiased estimators of individual\n(heterogeneous) treatment effects and a consistent and asymptotically normal\nestimator of the average treatment effect. Our estimator averages over unbiased\nforecasts of individual counterfactuals, based on a (short) time series of\npre-treatment data. The paper emphasizes the importance of focusing on forecast\nunbiasedness rather than accuracy when the end goal is estimation of average\ntreatment effects. We show that simple basis function regressions ensure\nforecast unbiasedness for a broad class of data-generating processes for the\ncounterfactuals, even in short panels. In contrast, model-based forecasting\nrequires stronger assumptions and is prone to misspecification and estimation\nbias. We show that our method can replicate the findings of some previous\nempirical studies, but without using a control group."}, "http://arxiv.org/abs/1812.10820": {"title": "A $t$-test for synthetic controls", "link": "http://arxiv.org/abs/1812.10820", "description": "We propose a practical and robust method for making inferences on average\ntreatment effects estimated by synthetic controls. We develop a $K$-fold\ncross-fitting procedure for bias correction. To avoid the difficult estimation\nof the long-run variance, inference is based on a self-normalized\n$t$-statistic, which has an asymptotically pivotal $t$-distribution. Our\n$t$-test is easy to implement, provably robust against misspecification, and\nvalid with stationary and non-stationary data. It demonstrates an excellent\nsmall sample performance in application-based simulations and performs well\nrelative to alternative methods. We illustrate the usefulness of the $t$-test\nby revisiting the effect of carbon taxes on emissions."}, "http://arxiv.org/abs/2108.12419": {"title": "Revisiting Event Study Designs: Robust and Efficient Estimation", "link": "http://arxiv.org/abs/2108.12419", "description": "We develop a framework for difference-in-differences designs with staggered\ntreatment adoption and heterogeneous causal effects. We show that conventional\nregression-based estimators fail to provide unbiased estimates of relevant\nestimands absent strong restrictions on treatment-effect homogeneity. We then\nderive the efficient estimator addressing this challenge, which takes an\nintuitive \"imputation\" form when treatment-effect heterogeneity is\nunrestricted. We characterize the asymptotic behavior of the estimator, propose\ntools for inference, and develop tests for identifying assumptions. Our method\napplies with time-varying controls, in triple-difference designs, and with\ncertain non-binary treatments. We show the practical relevance of our results\nin a simulation study and an application. Studying the consumption response to\ntax rebates in the United States, we find that the notional marginal propensity\nto consume is between 8 and 11 percent in the first quarter - about half as\nlarge as benchmark estimates used to calibrate macroeconomic models - and\npredominantly occurs in the first month after the rebate."}, "http://arxiv.org/abs/2301.06720": {"title": "Testing Firm Conduct", "link": "http://arxiv.org/abs/2301.06720", "description": "Evaluating policy in imperfectly competitive markets requires understanding\nfirm behavior. While researchers test conduct via model selection and\nassessment, we present advantages of Rivers and Vuong (2002) (RV) model\nselection under misspecification. However, degeneracy of RV invalidates\ninference. With a novel definition of weak instruments for testing, we connect\ndegeneracy to instrument strength, derive weak instrument properties of RV, and\nprovide a diagnostic for weak instruments by extending the framework of Stock\nand Yogo (2005) to model selection. We test vertical conduct (Villas-Boas,\n2007) using common instrument sets. Some are weak, providing no power. Strong\ninstruments support manufacturers setting retail prices."}, "http://arxiv.org/abs/2307.13686": {"title": "Characteristics and Predictive Modeling of Short-term Impacts of Hurricanes on the US Employment", "link": "http://arxiv.org/abs/2307.13686", "description": "This study examines the short-term employment changes in the US after\nhurricane impacts. An analysis of hurricane events during 1990-2021 suggests\nthat county-level employment changes in the initial month are small on average,\nthough large employment losses (>30%) can occur after extreme cyclones. The\noverall small changes partly result from compensation among opposite changes in\nemployment sectors, such as the construction and leisure and hospitality\nsectors. An analysis of these extreme cases highlights concentrated employment\nlosses in the service-providing industries and delayed, robust employment gains\nrelated to reconstruction activities. The overall employment shock is\nnegatively correlated with the metrics of cyclone hazards (e.g., extreme wind\nand precipitation) and geospatial details of impacts (e.g., cyclone-entity\ndistance). Additionally, non-cyclone factors such as county characteristics\nalso strongly affect short-term employment changes. The findings inform\npredictive modeling of short-term employment changes and help deliver promising\nskills for service-providing industries and high-impact cyclones. Specifically,\nthe Random Forests model, which can account for nonlinear relationships,\ngreatly outperforms the multiple linear regression model commonly used by\neconomics studies. Overall, our findings may help improve post-cyclone aid\nprograms and the modeling of hurricanes socioeconomic impacts in a changing\nclimate."}, "http://arxiv.org/abs/2308.08958": {"title": "Linear Regression with Weak Exogeneity", "link": "http://arxiv.org/abs/2308.08958", "description": "This paper studies linear time series regressions with many regressors. Weak\nexogeneity is the most used identifying assumption in time series. Weak\nexogeneity requires the structural error to have zero conditional expectation\ngiven the present and past regressor values, allowing errors to correlate with\nfuture regressor realizations. We show that weak exogeneity in time series\nregressions with many controls may produce substantial biases and even render\nthe least squares (OLS) estimator inconsistent. The bias arises in settings\nwith many regressors because the normalized OLS design matrix remains\nasymptotically random and correlates with the regression error when only weak\n(but not strict) exogeneity holds. This bias's magnitude increases with the\nnumber of regressors and their average autocorrelation. To address this issue,\nwe propose an innovative approach to bias correction that yields a new\nestimator with improved properties relative to OLS. We establish consistency\nand conditional asymptotic Gaussianity of this new estimator and provide a\nmethod for inference."}, "http://arxiv.org/abs/2401.09874": {"title": "A Quantile Nelson-Siegel model", "link": "http://arxiv.org/abs/2401.09874", "description": "A widespread approach to modelling the interaction between macroeconomic\nvariables and the yield curve relies on three latent factors usually\ninterpreted as the level, slope, and curvature (Diebold et al., 2006). This\napproach is inherently focused on the conditional mean of the yields and\npostulates a dynamic linear model where the latent factors smoothly change over\ntime. However, periods of deep crisis, such as the Great Recession and the\nrecent pandemic, have highlighted the importance of statistical models that\naccount for asymmetric shocks and are able to forecast the tails of a\nvariable's distribution. A new version of the dynamic three-factor model is\nproposed to address this issue based on quantile regressions. The novel\napproach leverages the potential of quantile regression to model the entire\n(conditional) distribution of the yields instead of restricting to its mean. An\napplication to US data from the 1970s shows the significant heterogeneity of\nthe interactions between financial and macroeconomic variables across different\nquantiles. Moreover, an out-of-sample forecasting exercise showcases the\nproposed method's advantages in predicting the yield distribution tails\ncompared to the standard conditional mean model. Finally, by inspecting the\nposterior distribution of the three factors during the recent major crises, new\nevidence is found that supports the greater and longer-lasting negative impact\nof the great recession on the yields compared to the COVID-19 pandemic."}, "http://arxiv.org/abs/2401.10054": {"title": "Nowcasting economic activity in European regions using a mixed-frequency dynamic factor model", "link": "http://arxiv.org/abs/2401.10054", "description": "Timely information about the state of regional economies can be essential for\nplanning, implementing and evaluating locally targeted economic policies.\nHowever, European regional accounts for output are published at an annual\nfrequency and with a two-year delay. To obtain robust and more timely measures\nin a computationally efficient manner, we propose a mixed-frequency dynamic\nfactor model that accounts for national information to produce high-frequency\nestimates of the regional gross value added (GVA). We show that our model\nproduces reliable nowcasts of GVA in 162 regions across 12 European countries."}, "http://arxiv.org/abs/2108.13707": {"title": "Wild Bootstrap for Instrumental Variables Regressions with Weak and Few Clusters", "link": "http://arxiv.org/abs/2108.13707", "description": "We study the wild bootstrap inference for instrumental variable regressions\nin the framework of a small number of large clusters in which the number of\nclusters is viewed as fixed and the number of observations for each cluster\ndiverges to infinity. We first show that the wild bootstrap Wald test, with or\nwithout using the cluster-robust covariance estimator, controls size\nasymptotically up to a small error as long as the parameters of endogenous\nvariables are strongly identified in at least one of the clusters. Then, we\nestablish the required number of strong clusters for the test to have power\nagainst local alternatives. We further develop a wild bootstrap Anderson-Rubin\ntest for the full-vector inference and show that it controls size\nasymptotically up to a small error even under weak or partial identification in\nall clusters. We illustrate the good finite sample performance of the new\ninference methods using simulations and provide an empirical application to a\nwell-known dataset about US local labor markets."}, "http://arxiv.org/abs/2303.13598": {"title": "Bootstrap-Assisted Inference for Generalized Grenander-type Estimators", "link": "http://arxiv.org/abs/2303.13598", "description": "Westling and Carone (2020) proposed a framework for studying the large sample\ndistributional properties of generalized Grenander-type estimators, a versatile\nclass of nonparametric estimators of monotone functions. The limiting\ndistribution of those estimators is representable as the left derivative of the\ngreatest convex minorant of a Gaussian process whose covariance kernel can be\ncomplicated and whose monomial mean can be of unknown order (when the degree of\nflatness of the function of interest is unknown). The standard nonparametric\nbootstrap is unable to consistently approximate the large sample distribution\nof the generalized Grenander-type estimators even if the monomial order of the\nmean is known, making statistical inference a challenging endeavour in\napplications. To address this inferential problem, we present a\nbootstrap-assisted inference procedure for generalized Grenander-type\nestimators. The procedure relies on a carefully crafted, yet automatic,\ntransformation of the estimator. Moreover, our proposed method can be made\n``flatness robust'' in the sense that it can be made adaptive to the (possibly\nunknown) degree of flatness of the function of interest. The method requires\nonly the consistent estimation of a single scalar quantity, for which we\npropose an automatic procedure based on numerical derivative estimation and the\ngeneralized jackknife. Under random sampling, our inference method can be\nimplemented using a computationally attractive exchangeable bootstrap\nprocedure. We illustrate our methods with examples and we also provide a small\nsimulation study. The development of formal results is made possible by some\ntechnical results that may be of independent interest."}, "http://arxiv.org/abs/2401.10261": {"title": "How industrial clusters influence the growth of the regional GDP: A spatial-approach", "link": "http://arxiv.org/abs/2401.10261", "description": "In this paper, we employ spatial econometric methods to analyze panel data\nfrom German NUTS 3 regions. Our goal is to gain a deeper understanding of the\nsignificance and interdependence of industry clusters in shaping the dynamics\nof GDP. To achieve a more nuanced spatial differentiation, we introduce\nindicator matrices for each industry sector which allows for extending the\nspatial Durbin model to a new version of it. This approach is essential due to\nboth the economic importance of these sectors and the potential issue of\nomitted variables. Failing to account for industry sectors can lead to omitted\nvariable bias and estimation problems. To assess the effects of the major\nindustry sectors, we incorporate eight distinct branches of industry into our\nanalysis. According to prevailing economic theory, these clusters should have a\npositive impact on the regions they are associated with. Our findings indeed\nreveal highly significant impacts, which can be either positive or negative, of\nspecific sectors on local GDP growth. Spatially, we observe that direct and\nindirect effects can exhibit opposite signs, indicative of heightened\ncompetitiveness within and between industry sectors. Therefore, we recommend\nthat industry sectors should be taken into consideration when conducting\nspatial analysis of GDP. Doing so allows for a more comprehensive understanding\nof the economic dynamics at play."}, "http://arxiv.org/abs/2111.00822": {"title": "Financial-cycle ratios and medium-term predictions of GDP: Evidence from the United States", "link": "http://arxiv.org/abs/2111.00822", "description": "Using a large quarterly macroeconomic dataset for the period 1960-2017, we\ndocument the ability of specific financial ratios from the housing market and\nfirms' aggregate balance sheets to predict GDP over medium-term horizons in the\nUnited States. A cyclically adjusted house price-to-rent ratio and the\nliabilities-to-income ratio of the non-financial non-corporate business sector\nprovide the best in-sample and out-of-sample predictions of GDP growth over\nhorizons of one to five years, based on a wide variety of rankings. Small\nforecasting models that include these indicators outperform popular\nhigh-dimensional models and forecast combinations. The predictive power of the\ntwo ratios appears strong during both recessions and expansions, stable over\ntime, and consistent with well-established macro-finance theory."}, "http://arxiv.org/abs/2401.11016": {"title": "Bounding Consideration Probabilities in Consider-Then-Choose Ranking Models", "link": "http://arxiv.org/abs/2401.11016", "description": "A common theory of choice posits that individuals make choices in a two-step\nprocess, first selecting some subset of the alternatives to consider before\nmaking a selection from the resulting consideration set. However, inferring\nunobserved consideration sets (or item consideration probabilities) in this\n\"consider then choose\" setting poses significant challenges, because even\nsimple models of consideration with strong independence assumptions are not\nidentifiable, even if item utilities are known. We consider a natural extension\nof consider-then-choose models to a top-$k$ ranking setting, where we assume\nrankings are constructed according to a Plackett-Luce model after sampling a\nconsideration set. While item consideration probabilities remain non-identified\nin this setting, we prove that knowledge of item utilities allows us to infer\nbounds on the relative sizes of consideration probabilities. Additionally,\ngiven a condition on the expected consideration set size, we derive absolute\nupper and lower bounds on item consideration probabilities. We also provide\nalgorithms to tighten those bounds on consideration probabilities by\npropagating inferred constraints. Thus, we show that we can learn useful\ninformation about consideration probabilities despite not being able to\nidentify them precisely. We demonstrate our methods on a ranking dataset from a\npsychology experiment with two different ranking tasks (one with fixed\nconsideration sets and one with unknown consideration sets). This combination\nof data allows us to estimate utilities and then learn about unknown\nconsideration probabilities using our bounds."}, "http://arxiv.org/abs/2401.11046": {"title": "Information Based Inference in Models with Set-Valued Predictions and Misspecification", "link": "http://arxiv.org/abs/2401.11046", "description": "This paper proposes an information-based inference method for partially\nidentified parameters in incomplete models that is valid both when the model is\ncorrectly specified and when it is misspecified. Key features of the method\nare: (i) it is based on minimizing a suitably defined Kullback-Leibler\ninformation criterion that accounts for incompleteness of the model and\ndelivers a non-empty pseudo-true set; (ii) it is computationally tractable;\n(iii) its implementation is the same for both correctly and incorrectly\nspecified models; (iv) it exploits all information provided by variation in\ndiscrete and continuous covariates; (v) it relies on Rao's score statistic,\nwhich is shown to be asymptotically pivotal."}, "http://arxiv.org/abs/2401.11229": {"title": "Estimation with Pairwise Observations", "link": "http://arxiv.org/abs/2401.11229", "description": "The paper introduces a new estimation method for the standard linear\nregression model. The procedure is not driven by the optimisation of any\nobjective function rather, it is a simple weighted average of slopes from\nobservation pairs. The paper shows that such estimator is consistent for\ncarefully selected weights. Other properties, such as asymptotic distributions,\nhave also been derived to facilitate valid statistical inference. Unlike\ntraditional methods, such as Least Squares and Maximum Likelihood, among\nothers, the estimated residual of this estimator is not by construction\northogonal to the explanatory variables of the model. This property allows a\nwide range of practical applications, such as the testing of endogeneity,\ni.e.,the correlation between the explanatory variables and the disturbance\nterms, and potentially several others."}, "http://arxiv.org/abs/2401.11422": {"title": "Local Identification in the Instrumental Variable Multivariate Quantile Regression Model", "link": "http://arxiv.org/abs/2401.11422", "description": "The instrumental variable (IV) quantile regression model introduced by\nChernozhukov and Hansen (2005) is a useful tool for analyzing quantile\ntreatment effects in the presence of endogeneity, but when outcome variables\nare multidimensional, it is silent on the joint distribution of different\ndimensions of each variable. To overcome this limitation, we propose an IV\nmodel built on the optimal-transport-based multivariate quantile that takes\ninto account the correlation between the entries of the outcome variable. We\nthen provide a local identification result for the model. Surprisingly, we find\nthat the support size of the IV required for the identification is independent\nof the dimension of the outcome vector, as long as the IV is sufficiently\ninformative. Our result follows from a general identification theorem that we\nestablish, which has independent theoretical significance."}, "http://arxiv.org/abs/2401.12050": {"title": "A Bracketing Relationship for Long-Term Policy Evaluation with Combined Experimental and Observational Data", "link": "http://arxiv.org/abs/2401.12050", "description": "Combining short-term experimental data with observational data enables\ncredible long-term policy evaluation. The literature offers two key but\nnon-nested assumptions, namely the latent unconfoundedness (LU; Athey et al.,\n2020) and equi-confounding bias (ECB; Ghassami et al., 2022) conditions, to\ncorrect observational selection. Committing to the wrong assumption leads to\nbiased estimation. To mitigate such risks, we provide a novel bracketing\nrelationship (cf. Angrist and Pischke, 2009) repurposed for the setting with\ndata combination: the LU-based estimand and the ECB-based estimand serve as the\nlower and upper bounds, respectively, with the true causal effect lying in\nbetween if either assumption holds. For researchers further seeking point\nestimates, our Lalonde-style exercise suggests the conservatively more robust\nLU-based lower bounds align closely with the hold-out experimental estimates\nfor educational policy evaluation. We investigate the economic substantives of\nthese findings through the lens of a nonparametric class of selection\nmechanisms and sensitivity analysis. We uncover as key the sub-martingale\nproperty and sufficient-statistics role (Chetty, 2009) of the potential\noutcomes of student test scores (Chetty et al., 2011, 2014)."}, "http://arxiv.org/abs/2401.12084": {"title": "Temporal Aggregation for the Synthetic Control Method", "link": "http://arxiv.org/abs/2401.12084", "description": "The synthetic control method (SCM) is a popular approach for estimating the\nimpact of a treatment on a single unit with panel data. Two challenges arise\nwith higher frequency data (e.g., monthly versus yearly): (1) achieving\nexcellent pre-treatment fit is typically more challenging; and (2) overfitting\nto noise is more likely. Aggregating data over time can mitigate these problems\nbut can also destroy important signal. In this paper, we bound the bias for SCM\nwith disaggregated and aggregated outcomes and give conditions under which\naggregating tightens the bounds. We then propose finding weights that balance\nboth disaggregated and aggregated series."}, "http://arxiv.org/abs/1807.11835": {"title": "The econometrics of happiness: Are we underestimating the returns to education and income?", "link": "http://arxiv.org/abs/1807.11835", "description": "This paper describes a fundamental and empirically conspicuous problem\ninherent to surveys of human feelings and opinions in which subjective\nresponses are elicited on numerical scales. The paper also proposes a solution.\nThe problem is a tendency by some individuals -- particularly those with low\nlevels of education -- to simplify the response scale by considering only a\nsubset of possible responses such as the lowest, middle, and highest. In\nprinciple, this ``focal value rounding'' (FVR) behavior renders invalid even\nthe weak ordinality assumption often used in analysis of such data. With\n``happiness'' or life satisfaction data as an example, descriptive methods and\na multinomial logit model both show that the effect is large and that education\nand, to a lesser extent, income level are predictors of FVR behavior.\n\nA model simultaneously accounting for the underlying wellbeing and for the\ndegree of FVR is able to estimate the latent subjective wellbeing, i.e.~the\ncounterfactual full-scale responses for all respondents, the biases associated\nwith traditional estimates, and the fraction of respondents who exhibit FVR.\nAddressing this problem helps to resolve a longstanding puzzle in the life\nsatisfaction literature, namely that the returns to education, after adjusting\nfor income, appear to be small or negative. Due to the same econometric\nproblem, the marginal utility of income in a subjective wellbeing sense has\nbeen consistently underestimated."}, "http://arxiv.org/abs/2209.04329": {"title": "Heterogeneous Treatment Effect Bounds under Sample Selection with an Application to the Effects of Social Media on Political Polarization", "link": "http://arxiv.org/abs/2209.04329", "description": "We propose a method for estimation and inference for bounds for heterogeneous\ncausal effect parameters in general sample selection models where the treatment\ncan affect whether an outcome is observed and no exclusion restrictions are\navailable. The method provides conditional effect bounds as functions of policy\nrelevant pre-treatment variables. It allows for conducting valid statistical\ninference on the unidentified conditional effects. We use a flexible\ndebiased/double machine learning approach that can accommodate non-linear\nfunctional forms and high-dimensional confounders. Easily verifiable high-level\nconditions for estimation, misspecification robust confidence intervals, and\nuniform confidence bands are provided as well. We re-analyze data from a large\nscale field experiment on Facebook on counter-attitudinal news subscription\nwith attrition. Our method yields substantially tighter effect bounds compared\nto conventional methods and suggests depolarization effects for younger users."}, "http://arxiv.org/abs/2303.07287": {"title": "Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm", "link": "http://arxiv.org/abs/2303.07287", "description": "In non-asymptotic learning, variance-type parameters of sub-Gaussian\ndistributions are of paramount importance. However, directly estimating these\nparameters using the empirical moment generating function (MGF) is infeasible.\nTo address this, we suggest using the sub-Gaussian intrinsic moment norm\n[Buldygin and Kozachenko (2000), Theorem 1.3] achieved by maximizing a sequence\nof normalized moments. Significantly, the suggested norm can not only\nreconstruct the exponential moment bounds of MGFs but also provide tighter\nsub-Gaussian concentration inequalities. In practice, we provide an intuitive\nmethod for assessing whether data with a finite sample size is sub-Gaussian,\nutilizing the sub-Gaussian plot. The intrinsic moment norm can be robustly\nestimated via a simple plug-in approach. Our theoretical findings are also\napplicable to reinforcement learning, including the multi-armed bandit\nscenario."}, "http://arxiv.org/abs/2307.10067": {"title": "Weak Factors are Everywhere", "link": "http://arxiv.org/abs/2307.10067", "description": "Factor Sequences are stochastic double sequences indexed in time and\ncross-section which have a so called factor structure. The term was coined by\nForni and Lippi (2001) who introduced dynamic factor sequences. We introduce\nthe distinction between dynamic- and static factor sequences which has been\noverlooked in the literature. Static factor sequences, where the static factors\nare modeled by a dynamic system, are the most common model of macro-econometric\nfactor analysis, building on Chamberlain and Rothschild (1983a); Stock and\nWatson (2002a); Bai and Ng (2002).\n\nWe show that there exist two types of common components - a dynamic and a\nstatic common component. The difference between those consists of the weak\ncommon component, which is spanned by (potentially infinitely many) weak\nfactors. We also show that the dynamic common component of a dynamic factor\nsequence is causally subordinated to the output under suitable conditions. As a\nconsequence only the dynamic common component can be interpreted as the\nprojection on the infinite past of the common innovations of the economy, i.e.\nthe part which is dynamically common. On the other hand the static common\ncomponent captures only the contemporaneous co-movement."}, "http://arxiv.org/abs/2401.12309": {"title": "Interpreting Event-Studies from Recent Difference-in-Differences Methods", "link": "http://arxiv.org/abs/2401.12309", "description": "This note discusses the interpretation of event-study plots produced by\nrecent difference-in-differences methods. I show that even when specialized to\nthe case of non-staggered treatment timing, the default plots produced by\nsoftware for three of the most popular recent methods (de Chaisemartin and\nD'Haultfoeuille, 2020; Callaway and SantAnna, 2021; Borusyak, Jaravel and\nSpiess, 2024) do not match those of traditional two-way fixed effects (TWFE)\nevent-studies: the new methods may show a kink or jump at the time of treatment\neven when the TWFE event-study shows a straight line. This difference stems\nfrom the fact that the new methods construct the pre-treatment coefficients\nasymmetrically from the post-treatment coefficients. As a result, visual\nheuristics for analyzing TWFE event-study plots should not be immediately\napplied to those from these methods. I conclude with practical recommendations\nfor constructing and interpreting event-study plots when using these methods."}, "http://arxiv.org/abs/2104.00655": {"title": "Local Projections vs", "link": "http://arxiv.org/abs/2104.00655", "description": "We conduct a simulation study of Local Projection (LP) and Vector\nAutoregression (VAR) estimators of structural impulse responses across\nthousands of data generating processes, designed to mimic the properties of the\nuniverse of U.S. macroeconomic data. Our analysis considers various\nidentification schemes and several variants of LP and VAR estimators, employing\nbias correction, shrinkage, or model averaging. A clear bias-variance trade-off\nemerges: LP estimators have lower bias than VAR estimators, but they also have\nsubstantially higher variance at intermediate and long horizons. Bias-corrected\nLP is the preferred method if and only if the researcher overwhelmingly\nprioritizes bias. For researchers who also care about precision, VAR methods\nare the most attractive -- Bayesian VARs at short and long horizons, and\nleast-squares VARs at intermediate and long horizons."}, "http://arxiv.org/abs/2210.13843": {"title": "GLS under Monotone Heteroskedasticity", "link": "http://arxiv.org/abs/2210.13843", "description": "The generalized least square (GLS) is one of the most basic tools in\nregression analyses. A major issue in implementing the GLS is estimation of the\nconditional variance function of the error term, which typically requires a\nrestrictive functional form assumption for parametric estimation or smoothing\nparameters for nonparametric estimation. In this paper, we propose an\nalternative approach to estimate the conditional variance function under\nnonparametric monotonicity constraints by utilizing the isotonic regression\nmethod. Our GLS estimator is shown to be asymptotically equivalent to the\ninfeasible GLS estimator with knowledge of the conditional error variance, and\ninvolves only some tuning to trim boundary observations, not only for point\nestimation but also for interval estimation or hypothesis testing. Our analysis\nextends the scope of the isotonic regression method by showing that the\nisotonic estimates, possibly with generated variables, can be employed as first\nstage estimates to be plugged in for semiparametric objects. Simulation studies\nillustrate excellent finite sample performances of the proposed method. As an\nempirical example, we revisit Acemoglu and Restrepo's (2017) study on the\nrelationship between an aging population and economic growth to illustrate how\nour GLS estimator effectively reduces estimation errors."}, "http://arxiv.org/abs/2305.04137": {"title": "Volatility of Volatility and Leverage Effect from Options", "link": "http://arxiv.org/abs/2305.04137", "description": "We propose model-free (nonparametric) estimators of the volatility of\nvolatility and leverage effect using high-frequency observations of short-dated\noptions. At each point in time, we integrate available options into estimates\nof the conditional characteristic function of the price increment until the\noptions' expiration and we use these estimates to recover spot volatility. Our\nvolatility of volatility estimator is then formed from the sample variance and\nfirst-order autocovariance of the spot volatility increments, with the latter\ncorrecting for the bias in the former due to option observation errors. The\nleverage effect estimator is the sample covariance between price increments and\nthe estimated volatility increments. The rate of convergence of the estimators\ndepends on the diffusive innovations in the latent volatility process as well\nas on the observation error in the options with strikes in the vicinity of the\ncurrent spot price. Feasible inference is developed in a way that does not\nrequire prior knowledge of the source of estimation error that is\nasymptotically dominating."}, "http://arxiv.org/abs/2401.13057": {"title": "Inference under partial identification with minimax test statistics", "link": "http://arxiv.org/abs/2401.13057", "description": "We provide a means of computing and estimating the asymptotic distributions\nof test statistics based on an outer minimization of an inner maximization.\nSuch test statistics, which arise frequently in moment models, are of special\ninterest in providing hypothesis tests under partial identification. Under\ngeneral conditions, we provide an asymptotic characterization of such test\nstatistics using the minimax theorem, and a means of computing critical values\nusing the bootstrap. Making some light regularity assumptions, our results\nprovide a basis for several asymptotic approximations that have been provided\nfor partially identified hypothesis tests, and extend them by mitigating their\ndependence on local linear approximations of the parameter space. These\nasymptotic results are generally simple to state and straightforward to compute\n(e.g. adversarially)."}, "http://arxiv.org/abs/2401.13179": {"title": "Realized Stochastic Volatility Model with Skew-t Distributions for Improved Volatility and Quantile Forecasting", "link": "http://arxiv.org/abs/2401.13179", "description": "Forecasting volatility and quantiles of financial returns is essential for\naccurately measuring financial tail risks, such as value-at-risk and expected\nshortfall. The critical elements in these forecasts involve understanding the\ndistribution of financial returns and accurately estimating volatility. This\npaper introduces an advancement to the traditional stochastic volatility model,\ntermed the realized stochastic volatility model, which integrates realized\nvolatility as a precise estimator of volatility. To capture the well-known\ncharacteristics of return distribution, namely skewness and heavy tails, we\nincorporate three types of skew-t distributions. Among these, two distributions\ninclude the skew-normal feature, offering enhanced flexibility in modeling the\nreturn distribution. We employ a Bayesian estimation approach using the Markov\nchain Monte Carlo method and apply it to major stock indices. Our empirical\nanalysis, utilizing data from US and Japanese stock indices, indicates that the\ninclusion of both skewness and heavy tails in daily returns significantly\nimproves the accuracy of volatility and quantile forecasts."}, "http://arxiv.org/abs/2401.13370": {"title": "New accessibility measures based on unconventional big data sources", "link": "http://arxiv.org/abs/2401.13370", "description": "In health econometric studies we are often interested in quantifying aspects\nrelated to the accessibility to medical infrastructures. The increasing\navailability of data automatically collected through unconventional sources\n(such as webscraping, crowdsourcing or internet of things) recently opened\npreviously unconceivable opportunities to researchers interested in measuring\naccessibility and to use it as a tool for real-time monitoring, surveillance\nand health policies definition. This paper contributes to this strand of\nliterature proposing new accessibility measures that can be continuously feeded\nby automatic data collection. We present new measures of accessibility and we\nillustrate their use to study the territorial impact of supply-side shocks of\nhealth facilities. We also illustrate the potential of our proposal with a case\nstudy based on a huge set of data (related to the Emergency Departments in\nMilan, Italy) that have been webscraped for the purpose of this paper every 5\nminutes since November 2021 to March 2022, amounting to approximately 5 million\nobservations."}, "http://arxiv.org/abs/2401.13665": {"title": "Entrywise Inference for Causal Panel Data: A Simple and Instance-Optimal Approach", "link": "http://arxiv.org/abs/2401.13665", "description": "In causal inference with panel data under staggered adoption, the goal is to\nestimate and derive confidence intervals for potential outcomes and treatment\neffects. We propose a computationally efficient procedure, involving only\nsimple matrix algebra and singular value decomposition. We derive\nnon-asymptotic bounds on the entrywise error, establishing its proximity to a\nsuitably scaled Gaussian variable. Despite its simplicity, our procedure turns\nout to be instance-optimal, in that our theoretical scaling matches a local\ninstance-wise lower bound derived via a Bayesian Cram\\'{e}r-Rao argument. Using\nour insights, we develop a data-driven procedure for constructing entrywise\nconfidence intervals with pre-specified coverage guarantees. Our analysis is\nbased on a general inferential toolbox for the SVD algorithm applied to the\nmatrix denoising model, which might be of independent interest."}, "http://arxiv.org/abs/2307.01033": {"title": "Expected Shortfall LASSO", "link": "http://arxiv.org/abs/2307.01033", "description": "We propose an $\\ell_1$-penalized estimator for high-dimensional models of\nExpected Shortfall (ES). The estimator is obtained as the solution to a\nleast-squares problem for an auxiliary dependent variable, which is defined as\na transformation of the dependent variable and a pre-estimated tail quantile.\nLeveraging a sparsity condition, we derive a nonasymptotic bound on the\nprediction and estimator errors of the ES estimator, accounting for the\nestimation error in the dependent variable, and provide conditions under which\nthe estimator is consistent. Our estimator is applicable to heavy-tailed\ntime-series data and we find that the amount of parameters in the model may\ngrow with the sample size at a rate that depends on the dependence and\nheavy-tailedness in the data. In an empirical application, we consider the\nsystemic risk measure CoES and consider a set of regressors that consists of\nnonlinear transformations of a set of state variables. We find that the\nnonlinear model outperforms an unpenalized and untransformed benchmark\nconsiderably."}, "http://arxiv.org/abs/2401.14395": {"title": "Identification of Nonseparable Models with Endogenous Control Variables", "link": "http://arxiv.org/abs/2401.14395", "description": "We study identification of the treatment effects in a class of nonseparable\nmodels with the presence of potentially endogenous control variables. We show\nthat given the treatment variable and the controls are measurably separated,\nthe usual conditional independence condition or availability of excluded\ninstrument suffices for identification."}, "http://arxiv.org/abs/2105.08766": {"title": "Trading-off Bias and Variance in Stratified Experiments and in Matching Studies, Under a Boundedness Condition on the Magnitude of the Treatment Effect", "link": "http://arxiv.org/abs/2105.08766", "description": "I consider estimation of the average treatment effect (ATE), in a population\ncomposed of $S$ groups or units, when one has unbiased estimators of each\ngroup's conditional average treatment effect (CATE). These conditions are met\nin stratified experiments and in matching studies. I assume that each CATE is\nbounded in absolute value by $B$ standard deviations of the outcome, for some\nknown $B$. This restriction may be appealing: outcomes are often standardized\nin applied work, so researchers can use available literature to determine a\nplausible value for $B$. I derive, across all linear combinations of the CATEs'\nestimators, the minimax estimator of the ATE. In two stratified experiments, my\nestimator has twice lower worst-case mean-squared-error than the commonly-used\nstrata-fixed effects estimator. In a matching study with limited overlap, my\nestimator achieves 56\\% of the precision gains of a commonly-used trimming\nestimator, and has an 11 times smaller worst-case mean-squared-error."}, "http://arxiv.org/abs/2308.09535": {"title": "Weak Identification with Many Instruments", "link": "http://arxiv.org/abs/2308.09535", "description": "Linear instrumental variable regressions are widely used to estimate causal\neffects. Many instruments arise from the use of ``technical'' instruments and\nmore recently from the empirical strategy of ``judge design''. This paper\nsurveys and summarizes ideas from recent literature on estimation and\nstatistical inferences with many instruments for a single endogenous regressor.\nWe discuss how to assess the strength of the instruments and how to conduct\nweak identification-robust inference under heteroskedasticity. We establish new\nresults for a jack-knifed version of the Lagrange Multiplier (LM) test\nstatistic. Furthermore, we extend the weak-identification-robust tests to\nsettings with both many exogenous regressors and many instruments. We propose a\ntest that properly partials out many exogenous regressors while preserving the\nre-centering property of the jack-knife. The proposed tests have correct size\nand good power properties."}, "http://arxiv.org/abs/2401.14545": {"title": "Structural Periodic Vector Autoregressions", "link": "http://arxiv.org/abs/2401.14545", "description": "While seasonality inherent to raw macroeconomic data is commonly removed by\nseasonal adjustment techniques before it is used for structural inference, this\napproach might distort valuable information contained in the data. As an\nalternative method to commonly used structural vector autoregressions (SVAR)\nfor seasonally adjusted macroeconomic data, this paper offers an approach in\nwhich the periodicity of not seasonally adjusted raw data is modeled directly\nby structural periodic vector autoregressions (SPVAR) that are based on\nperiodic vector autoregressions (PVAR) as the reduced form model. In comparison\nto a VAR, the PVAR does allow not only for periodically time-varying\nintercepts, but also for periodic autoregressive parameters and innovations\nvariances, respectively. As this larger flexibility leads also to an increased\nnumber of parameters, we propose linearly constrained estimation techniques.\nOverall, SPVARs allow to capture seasonal effects and enable a direct and more\nrefined analysis of seasonal patterns in macroeconomic data, which can provide\nuseful insights into their dynamics. Moreover, based on such SPVARs, we propose\na general concept for structural impulse response analyses that takes seasonal\npatterns directly into account. We provide asymptotic theory for estimators of\nperiodic reduced form parameters and structural impulse responses under\nflexible linear restrictions. Further, for the construction of confidence\nintervals, we propose residual-based (seasonal) bootstrap methods that allow\nfor general forms of seasonalities in the data and prove its bootstrap\nconsistency. A real data application on industrial production, inflation and\nfederal funds rate is presented, showing that useful information about the data\nstructure can be lost when using common seasonal adjustment methods."}, "http://arxiv.org/abs/2401.14582": {"title": "High-dimensional forecasting with known knowns and known unknowns", "link": "http://arxiv.org/abs/2401.14582", "description": "Forecasts play a central role in decision making under uncertainty. After a\nbrief review of the general issues, this paper considers ways of using\nhigh-dimensional data in forecasting. We consider selecting variables from a\nknown active set, known knowns, using Lasso and OCMT, and approximating\nunobserved latent factors, known unknowns, by various means. This combines both\nsparse and dense approaches. We demonstrate the various issues involved in\nvariable selection in a high-dimensional setting with an application to\nforecasting UK inflation at different horizons over the period 2020q1-2023q1.\nThis application shows both the power of parsimonious models and the importance\nof allowing for global variables."}, "http://arxiv.org/abs/2401.15205": {"title": "csranks: An R Package for Estimation and Inference Involving Ranks", "link": "http://arxiv.org/abs/2401.15205", "description": "This article introduces the R package csranks for estimation and inference\ninvolving ranks. First, we review methods for the construction of confidence\nsets for ranks, namely marginal and simultaneous confidence sets as well as\nconfidence sets for the identities of the tau-best. Second, we review methods\nfor estimation and inference in regressions involving ranks. Third, we describe\nthe implementation of these methods in csranks and illustrate their usefulness\nin two examples: one about the quantification of uncertainty in the PISA\nranking of countries and one about the measurement of intergenerational\nmobility using rank-rank regressions."}, "http://arxiv.org/abs/2401.15253": {"title": "Testing the Exogeneity of Instrumental Variables and Regressors in Linear Regression Models Using Copulas", "link": "http://arxiv.org/abs/2401.15253", "description": "We provide a Copula-based approach to test the exogeneity of instrumental\nvariables in linear regression models. We show that the exogeneity of\ninstrumental variables is equivalent to the exogeneity of their standard normal\ntransformations with the same CDF value. Then, we establish a Wald test for the\nexogeneity of the instrumental variables. We demonstrate the performance of our\ntest using simulation studies. Our simulations show that if the instruments are\nactually endogenous, our test rejects the exogeneity hypothesis approximately\n93% of the time at the 5% significance level. Conversely, when instruments are\ntruly exogenous, it dismisses the exogeneity assumption less than 30% of the\ntime on average for data with 200 observations and less than 2% of the time for\ndata with 1,000 observations. Our results demonstrate our test's effectiveness,\noffering significant value to applied econometricians."}, "http://arxiv.org/abs/2401.16275": {"title": "Graph Neural Networks: Theory for Estimation with Application on Network Heterogeneity", "link": "http://arxiv.org/abs/2401.16275", "description": "This paper presents a novel application of graph neural networks for modeling\nand estimating network heterogeneity. Network heterogeneity is characterized by\nvariations in unit's decisions or outcomes that depend not only on its own\nattributes but also on the conditions of its surrounding neighborhood. We\ndelineate the convergence rate of the graph neural networks estimator, as well\nas its applicability in semiparametric causal inference with heterogeneous\ntreatment effects. The finite-sample performance of our estimator is evaluated\nthrough Monte Carlo simulations. In an empirical setting related to\nmicrofinance program participation, we apply the new estimator to examine the\naverage treatment effects and outcomes of counterfactual policies, and to\npropose an enhanced strategy for selecting the initial recipients of program\ninformation in social networks."}, "http://arxiv.org/abs/2103.01280": {"title": "Dynamic covariate balancing: estimating treatment effects over time with potential local projections", "link": "http://arxiv.org/abs/2103.01280", "description": "This paper studies the estimation and inference of treatment histories in\npanel data settings when treatments change dynamically over time.\n\nWe propose a method that allows for (i) treatments to be assigned dynamically\nover time based on high-dimensional covariates, past outcomes and treatments;\n(ii) outcomes and time-varying covariates to depend on treatment trajectories;\n(iii) heterogeneity of treatment effects.\n\nOur approach recursively projects potential outcomes' expectations on past\nhistories. It then controls the bias by balancing dynamically observable\ncharacteristics. We study the asymptotic and numerical properties of the\nestimator and illustrate the benefits of the procedure in an empirical\napplication."}, "http://arxiv.org/abs/2211.00363": {"title": "Reservoir Computing for Macroeconomic Forecasting with Mixed Frequency Data", "link": "http://arxiv.org/abs/2211.00363", "description": "Macroeconomic forecasting has recently started embracing techniques that can\ndeal with large-scale datasets and series with unequal release periods.\nMIxed-DAta Sampling (MIDAS) and Dynamic Factor Models (DFM) are the two main\nstate-of-the-art approaches that allow modeling series with non-homogeneous\nfrequencies. We introduce a new framework called the Multi-Frequency Echo State\nNetwork (MFESN) based on a relatively novel machine learning paradigm called\nreservoir computing. Echo State Networks (ESN) are recurrent neural networks\nformulated as nonlinear state-space systems with random state coefficients\nwhere only the observation map is subject to estimation. MFESNs are\nconsiderably more efficient than DFMs and allow for incorporating many series,\nas opposed to MIDAS models, which are prone to the curse of dimensionality. All\nmethods are compared in extensive multistep forecasting exercises targeting US\nGDP growth. We find that our MFESN models achieve superior or comparable\nperformance over MIDAS and DFMs at a much lower computational cost."}, "http://arxiv.org/abs/2306.07619": {"title": "Kernel Choice Matters for Boundary Inference Using Local Polynomial Density: With Application to Manipulation Testing", "link": "http://arxiv.org/abs/2306.07619", "description": "The local polynomial density (LPD) estimator has been a useful tool for\ninference concerning boundary points of density functions. While it is commonly\nbelieved that kernel selection is not crucial for the performance of\nkernel-based estimators, this paper argues that this does not hold true for LPD\nestimators at boundary points. We find that the commonly used kernels with\ncompact support lead to larger asymptotic and finite-sample variances.\nFurthermore, we present theoretical and numerical evidence showing that such\nunfavorable variance properties negatively affect the performance of\nmanipulation testing in regression discontinuity designs, which typically\nsuffer from low power. Notably, we demonstrate that these issues of increased\nvariance and reduced power can be significantly improved just by using a kernel\nfunction with unbounded support. We recommend the use of the spline-type kernel\n(the Laplace density) and illustrate its superior performance."}, "http://arxiv.org/abs/2309.14630": {"title": "Free Discontinuity Regression: With an Application to the Economic Effects of Internet Shutdowns", "link": "http://arxiv.org/abs/2309.14630", "description": "Discontinuities in regression functions can reveal important insights. In\nmany contexts, like geographic settings, such discontinuities are multivariate\nand unknown a priori. We propose a non-parametric regression method that\nestimates the location and size of discontinuities by segmenting the regression\nsurface. This estimator is based on a convex relaxation of the Mumford-Shah\nfunctional, for which we establish identification and convergence. We use it to\nshow that an internet shutdown in India resulted in a reduction of economic\nactivity by 25--35%, greatly surpassing previous estimates and shedding new\nlight on the true cost of such shutdowns for digital economies globally."}, "http://arxiv.org/abs/2401.16844": {"title": "Congestion Pricing for Efficiency and Equity: Theory and Applications to the San Francisco Bay Area", "link": "http://arxiv.org/abs/2401.16844", "description": "Congestion pricing, while adopted by many cities to alleviate traffic\ncongestion, raises concerns about widening socioeconomic disparities due to its\ndisproportionate impact on low-income travelers. In this study, we address this\nconcern by proposing a new class of congestion pricing schemes that not only\nminimize congestion levels but also incorporate an equity objective to reduce\ncost disparities among travelers with different willingness-to-pay. Our\nanalysis builds on a congestion game model with heterogeneous traveler\npopulations. We present four pricing schemes that account for practical\nconsiderations, such as the ability to charge differentiated tolls to various\ntraveler populations and the option to toll all or only a subset of edges in\nthe network. We evaluate our pricing schemes in the calibrated freeway network\nof the San Francisco Bay Area. We demonstrate that the proposed congestion\npricing schemes improve both efficiency (in terms of reduced average travel\ntime) and equity (the disparities of travel costs experienced by different\npopulations) compared to the current pricing scheme. Moreover, our pricing\nschemes also generate a total revenue comparable to the current pricing scheme.\nOur results further show that pricing schemes charging differentiated prices to\ntraveler populations with varying willingness-to-pay lead to a more equitable\ndistribution of travel costs compared to those that charge a homogeneous price\nto all."}, "http://arxiv.org/abs/2401.17137": {"title": "Partial Identification of Binary Choice Models with Misreported Outcomes", "link": "http://arxiv.org/abs/2401.17137", "description": "This paper provides partial identification of various binary choice models\nwith misreported dependent variables. We propose two distinct approaches by\nexploiting different instrumental variables respectively. In the first\napproach, the instrument is assumed to only affect the true dependent variable\nbut not misreporting probabilities. The second approach uses an instrument that\ninfluences misreporting probabilities monotonically while having no effect on\nthe true dependent variable. Moreover, we derive identification results under\nadditional restrictions on misreporting, including bounded/monotone\nmisreporting probabilities. We use simulations to demonstrate the robust\nperformance of our approaches, and apply the method to study educational\nattainment."}, "http://arxiv.org/abs/2209.00197": {"title": "Switchback Experiments under Geometric Mixing", "link": "http://arxiv.org/abs/2209.00197", "description": "The switchback is an experimental design that measures treatment effects by\nrepeatedly turning an intervention on and off for a whole system. Switchback\nexperiments are a robust way to overcome cross-unit spillover effects; however,\nthey are vulnerable to bias from temporal carryovers. In this paper, we\nconsider properties of switchback experiments in Markovian systems that mix at\na geometric rate. We find that, in this setting, standard switchback designs\nsuffer considerably from carryover bias: Their estimation error decays as\n$T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of\ncarryovers a faster rate of $T^{-1/2}$ would have been possible. We also show,\nhowever, that judicious use of burn-in periods can considerably improve the\nsituation, and enables errors that decay as $\\log(T)^{1/2}T^{-1/2}$. Our formal\nresults are mirrored in an empirical evaluation."}, "https://arxiv.org/abs/2401.17137": {"title": "Partial Identification of Binary Choice Models with Misreported Outcomes", "link": "https://arxiv.org/abs/2401.17137", "description": "This paper provides partial identification of various binary choice models with misreported dependent variables. We propose two distinct approaches by exploiting different instrumental variables respectively. In the first approach, the instrument is assumed to only affect the true dependent variable but not misreporting probabilities. The second approach uses an instrument that influences misreporting probabilities monotonically while having no effect on the true dependent variable. Moreover, we derive identification results under additional restrictions on misreporting, including bounded/monotone misreporting probabilities. We use simulations to demonstrate the robust performance of our approaches, and apply the method to study educational attainment."}, "https://arxiv.org/abs/2401.16844": {"title": "Congestion Pricing for Efficiency and Equity: Theory and Applications to the San Francisco Bay Area", "link": "https://arxiv.org/abs/2401.16844", "description": "Congestion pricing, while adopted by many cities to alleviate traffic congestion, raises concerns about widening socioeconomic disparities due to its disproportionate impact on low-income travelers. In this study, we address this concern by proposing a new class of congestion pricing schemes that not only minimize congestion levels but also incorporate an equity objective to reduce cost disparities among travelers with different willingness-to-pay. Our analysis builds on a congestion game model with heterogeneous traveler populations. We present four pricing schemes that account for practical considerations, such as the ability to charge differentiated tolls to various traveler populations and the option to toll all or only a subset of edges in the network. We evaluate our pricing schemes in the calibrated freeway network of the San Francisco Bay Area. We demonstrate that the proposed congestion pricing schemes improve both efficiency (in terms of reduced average travel time) and equity (the disparities of travel costs experienced by different populations) compared to the current pricing scheme. Moreover, our pricing schemes also generate a total revenue comparable to the current pricing scheme. Our results further show that pricing schemes charging differentiated prices to traveler populations with varying willingness-to-pay lead to a more equitable distribution of travel costs compared to those that charge a homogeneous price to all."}, "https://arxiv.org/abs/2209.00197": {"title": "Switchback Experiments under Geometric Mixing", "link": "https://arxiv.org/abs/2209.00197", "description": "The switchback is an experimental design that measures treatment effects by repeatedly turning an intervention on and off for a whole system. Switchback experiments are a robust way to overcome cross-unit spillover effects; however, they are vulnerable to bias from temporal carryovers. In this paper, we consider properties of switchback experiments in Markovian systems that mix at a geometric rate. We find that, in this setting, standard switchback designs suffer considerably from carryover bias: Their estimation error decays as $T^{-1/3}$ in terms of the experiment horizon $T$, whereas in the absence of carryovers a faster rate of $T^{-1/2}$ would have been possible. We also show, however, that judicious use of burn-in periods can considerably improve the situation, and enables errors that decay as $\\log(T)^{1/2}T^{-1/2}$. Our formal results are mirrored in an empirical evaluation."}, "https://arxiv.org/abs/2311.15878": {"title": "Policy Learning with Distributional Welfare", "link": "https://arxiv.org/abs/2311.15878", "description": "In this paper, we explore optimal treatment allocation policies that target distributional welfare. Most literature on treatment choice has considered utilitarian welfare based on the conditional average treatment effect (ATE). While average welfare is intuitive, it may yield undesirable allocations especially when individuals are heterogeneous (e.g., with outliers) - the very reason individualized treatments were introduced in the first place. This observation motivates us to propose an optimal policy that allocates the treatment based on the conditional quantile of individual treatment effects (QoTE). Depending on the choice of the quantile probability, this criterion can accommodate a policymaker who is either prudent or negligent. The challenge of identifying the QoTE lies in its requirement for knowledge of the joint distribution of the counterfactual outcomes, which is generally hard to recover even with experimental data. Therefore, we introduce minimax policies that are robust to model uncertainty. A range of identifying assumptions can be used to yield more informative policies. For both stochastic and deterministic policies, we establish the asymptotic bound on the regret of implementing the proposed policies. In simulations and two empirical applications, we compare optimal decisions based on the QoTE with decisions based on other criteria. The framework can be generalized to any setting where welfare is defined as a functional of the joint distribution of the potential outcomes."}, "https://arxiv.org/abs/2401.17595": {"title": "Marginal treatment effects in the absence of instrumental variables", "link": "https://arxiv.org/abs/2401.17595", "description": "We propose a method for defining, identifying, and estimating the marginal treatment effect (MTE) without imposing the instrumental variable (IV) assumptions of independence, exclusion, and separability (or monotonicity). Under a new definition of the MTE based on reduced-form treatment error that is statistically independent of the covariates, we find that the relationship between the MTE and standard treatment parameters holds in the absence of IVs. We provide a set of sufficient conditions ensuring the identification of the defined MTE in an environment of essential heterogeneity. The key conditions include a linear restriction on potential outcome regression functions, a nonlinear restriction on the propensity score, and a conditional mean independence restriction that will lead to additive separability. We prove this identification using the notion of semiparametric identification based on functional form. We suggest consistent semiparametric estimation procedures, and provide an empirical application for the Head Start program to illustrate the usefulness of the proposed method in analyzing heterogenous causal effects when IVs are elusive."}, "https://arxiv.org/abs/2401.17909": {"title": "Regularizing Discrimination in Optimal Policy Learning with Distributional Targets", "link": "https://arxiv.org/abs/2401.17909", "description": "A decision maker typically (i) incorporates training data to learn about the relative effectiveness of the treatments, and (ii) chooses an implementation mechanism that implies an \"optimal\" predicted outcome distribution according to some target functional. Nevertheless, a discrimination-aware decision maker may not be satisfied achieving said optimality at the cost of heavily discriminating against subgroups of the population, in the sense that the outcome distribution in a subgroup deviates strongly from the overall optimal outcome distribution. We study a framework that allows the decision maker to penalize for such deviations, while allowing for a wide range of target functionals and discrimination measures to be employed. We establish regret and consistency guarantees for empirical success policies with data-driven tuning parameters, and provide numerical results. Furthermore, we briefly illustrate the methods in two empirical settings."}, "https://arxiv.org/abs/2311.13969": {"title": "Was Javert right to be suspicious? Unpacking treatment effect heterogeneity of alternative sentences on time-to-recidivism in Brazil", "link": "https://arxiv.org/abs/2311.13969", "description": "This paper presents new econometric tools to unpack the treatment effect heterogeneity of punishing misdemeanor offenses on time-to-recidivism. We show how one can identify, estimate, and make inferences on the distributional, quantile, and average marginal treatment effects in setups where the treatment selection is endogenous and the outcome of interest, usually a duration variable, is potentially right-censored. We explore our proposed econometric methodology to evaluate the effect of fines and community service sentences as a form of punishment on time-to-recidivism in the State of S\\~ao Paulo, Brazil, between 2010 and 2019, leveraging the as-if random assignment of judges to cases. Our results highlight substantial treatment effect heterogeneity that other tools are not meant to capture. For instance, we find that people whom most judges would punish take longer to recidivate as a consequence of the punishment, while people who would be punished only by strict judges recidivate at an earlier date than if they were not punished. This result suggests that designing sentencing guidelines that encourage strict judges to become more lenient could reduce recidivism."}, "https://arxiv.org/abs/2402.00184": {"title": "The Heterogeneous Aggregate Valence Analysis (HAVAN) Model: A Flexible Approach to Modeling Unobserved Heterogeneity in Discrete Choice Analysis", "link": "https://arxiv.org/abs/2402.00184", "description": "This paper introduces the Heterogeneous Aggregate Valence Analysis (HAVAN) model, a novel class of discrete choice models. We adopt the term \"valence'' to encompass any latent quantity used to model consumer decision-making (e.g., utility, regret, etc.). Diverging from traditional models that parameterize heterogeneous preferences across various product attributes, HAVAN models (pronounced \"haven\") instead directly characterize alternative-specific heterogeneous preferences. This innovative perspective on consumer heterogeneity affords unprecedented flexibility and significantly reduces simulation burdens commonly associated with mixed logit models. In a simulation experiment, the HAVAN model demonstrates superior predictive performance compared to state-of-the-art artificial neural networks. This finding underscores the potential for HAVAN models to improve discrete choice modeling capabilities."}, "https://arxiv.org/abs/2402.00192": {"title": "Finite- and Large-Sample Inference for Ranks using Multinomial Data with an Application to Ranking Political Parties", "link": "https://arxiv.org/abs/2402.00192", "description": "It is common to rank different categories by means of preferences that are revealed through data on choices. A prominent example is the ranking of political candidates or parties using the estimated share of support each one receives in surveys or polls about political attitudes. Since these rankings are computed using estimates of the share of support rather than the true share of support, there may be considerable uncertainty concerning the true ranking of the political candidates or parties. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the rank of each category. We consider both the problem of constructing marginal confidence sets for the rank of a particular category as well as simultaneous confidence sets for the ranks of all categories. A distinguishing feature of our analysis is that we exploit the multinomial structure of the data to develop confidence sets that are valid in finite samples. We additionally develop confidence sets using the bootstrap that are valid only approximately in large samples. We use our methodology to rank political parties in Australia using data from the 2019 Australian Election Survey. We find that our finite-sample confidence sets are informative across the entire ranking of political parties, even in Australian territories with few survey respondents and/or with parties that are chosen by only a small share of the survey respondents. In contrast, the bootstrap-based confidence sets may sometimes be considerably less informative. These findings motivate us to compare these methods in an empirically-driven simulation study, in which we conclude that our finite-sample confidence sets often perform better than their large-sample, bootstrap-based counterparts, especially in settings that resemble our empirical application."}, "https://arxiv.org/abs/2402.00567": {"title": "Stochastic convergence in per capita CO$_2$ emissions", "link": "https://arxiv.org/abs/2402.00567", "description": "This paper studies stochastic convergence of per capita CO$_2$ emissions in 28 OECD countries for the 1901-2009 period. The analysis is carried out at two aggregation levels, first for the whole set of countries (joint analysis) and then separately for developed and developing states (group analysis). A powerful time series methodology, adapted to a nonlinear framework that allows for quadratic trends with possibly smooth transitions between regimes, is applied. This approach provides more robust conclusions in convergence path analysis, enabling (a) robust detection of the presence, and if so, the number of changes in the level and/or slope of the trend of the series, (b) inferences on stationarity of relative per capita CO$_2$ emissions, conditionally on the presence of breaks and smooth transitions between regimes, and (c) estimation of change locations in the convergence paths. Finally, as stochastic convergence is attained when both stationarity around a trend and $\\beta$-convergence hold, the linear approach proposed by Tomljanovich and Vogelsang (2002) is extended in order to allow for more general quadratic models. Overall, joint analysis finds some evidence of stochastic convergence in per capita CO$_2$ emissions. Some dispersion in terms of $\\beta$-convergence is detected by group analysis, particularly among developed countries. This is in accordance with per capita GDP not being the sole determinant of convergence in emissions, with factors like search for more efficient technologies, fossil fuel substitution, innovation, and possibly outsources of industries, also having a crucial role."}, "https://arxiv.org/abs/2402.00584": {"title": "Arellano-Bond LASSO Estimator for Dynamic Linear Panel Models", "link": "https://arxiv.org/abs/2402.00584", "description": "The Arellano-Bond estimator can be severely biased when the time series dimension of the data, $T$, is long. The source of the bias is the large degree of overidentification. We propose a simple two-step approach to deal with this problem. The first step applies LASSO to the cross-section data at each time period to select the most informative moment conditions. The second step applies a linear instrumental variable estimator using the instruments constructed from the moment conditions selected in the first step. The two stages are combined using sample-splitting and cross-fitting to avoid overfitting bias. Using asymptotic sequences where the two dimensions of the panel grow with the sample size, we show that the new estimator is consistent and asymptotically normal under much weaker conditions on $T$ than the Arellano-Bond estimator. Our theory covers models with high dimensional covariates including multiple lags of the dependent variable, which are common in modern applications. We illustrate our approach with an application to the short and long-term effects of the opening of K-12 schools and other policies on the spread of COVID-19 using weekly county-level panel data from the United States."}, "https://arxiv.org/abs/2402.00788": {"title": "EU-28's progress towards the 2020 renewable energy share", "link": "https://arxiv.org/abs/2402.00788", "description": "This paper assesses the convergence of the EU-28 countries towards their common goal of 20% in the renewable energy share indicator by year 2020. The potential presence of clubs of convergence towards different steady state equilibria is also analyzed from both the standpoints of global convergence to the 20% goal and specific convergence to the various targets assigned to Member States. Two clubs of convergence are detected in the former case, each corresponding to different RES targets. A probit model is also fitted with the aim of better understanding the determinants of club membership, that seemingly include real GDP per capita, expenditure on environmental protection, energy dependence, and nuclear capacity, with all of them having statistically significant effects. Finally, convergence is also analyzed separately for the transport, heating and cooling, and electricity sectors."}, "https://arxiv.org/abs/2402.00172": {"title": "The Fourier-Malliavin Volatility (FMVol) MATLAB library", "link": "https://arxiv.org/abs/2402.00172", "description": "This paper presents the Fourier-Malliavin Volatility (FMVol) estimation library for MATLAB. This library includes functions that implement Fourier- Malliavin estimators (see Malliavin and Mancino (2002, 2009)) of the volatility and co-volatility of continuous stochastic volatility processes and second-order quantities, like the quarticity (the squared volatility), the volatility of volatility and the leverage (the covariance between changes in the process and changes in its volatility). The Fourier-Malliavin method is fully non-parametric, does not require equally-spaced observations and is robust to measurement errors, or noise, without any preliminary bias correction or pre-treatment of the observations. Further, in its multivariate version, it is intrinsically robust to irregular and asynchronous sampling. Although originally introduced for a specific application in financial econometrics, namely the estimation of asset volatilities, the Fourier-Malliavin method is a general method that can be applied whenever one is interested in reconstructing the latent volatility and second-order quantities of a continuous stochastic volatility process from discrete observations."}, "https://arxiv.org/abs/2010.03898": {"title": "Consistent Specification Test of the Quantile Autoregression", "link": "https://arxiv.org/abs/2010.03898", "description": "This paper proposes a test for the joint hypothesis of correct dynamic specification and no omitted latent factors for the Quantile Autoregression. If the composite null is rejected we proceed to disentangle the cause of rejection, i.e., dynamic misspecification or an omitted variable. We establish the asymptotic distribution of the test statistics under fairly weak conditions and show that factor estimation error is negligible. A Monte Carlo study shows that the suggested tests have good finite sample properties. Finally, we undertake an empirical illustration of modelling GDP growth and CPI inflation in the United Kingdom, where we find evidence that factor augmented models are correctly specified in contrast with their non-augmented counterparts when it comes to GDP growth, while also exploring the asymmetric behaviour of the growth and inflation distributions."}, "https://arxiv.org/abs/2203.12740": {"title": "Correcting Attrition Bias using Changes-in-Changes", "link": "https://arxiv.org/abs/2203.12740", "description": "Attrition is a common and potentially important threat to internal validity in treatment effect studies. We extend the changes-in-changes approach to identify the average treatment effect for respondents and the entire study population in the presence of attrition. Our method, which exploits baseline outcome data, can be applied to randomized experiments as well as quasi-experimental difference-in-difference designs. A formal comparison highlights that while widely used corrections typically impose restrictions on whether or how response depends on treatment, our proposed attrition correction exploits restrictions on the outcome model. We further show that the conditions required for our correction can accommodate a broad class of response models that depend on treatment in an arbitrary way. We illustrate the implementation of the proposed corrections in an application to a large-scale randomized experiment."}, "https://arxiv.org/abs/2209.11444": {"title": "Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect", "link": "https://arxiv.org/abs/2209.11444", "description": "This paper establishes sufficient conditions for the identification of the marginal treatment effects with multivalued treatments. Our model is based on a multinomial choice model with utility maximization. Our MTE generalizes the MTE defined in Heckman and Vytlacil (2005) in binary treatment models. As in the binary case, we can interpret the MTE as the treatment effect for persons who are indifferent between two treatments at a particular level. Our MTE enables one to obtain the treatment effects of those with specific preference orders over the choice set. Further, our results can identify other parameters such as the marginal distribution of potential outcomes."}, "https://arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "https://arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods."}, "https://rss.arxiv.org/abs/2402.00184": {"title": "The Heterogeneous Aggregate Valence Analysis (HAVAN) Model: A Flexible Approach to Modeling Unobserved Heterogeneity in Discrete Choice Analysis", "link": "https://rss.arxiv.org/abs/2402.00184", "description": "This paper introduces the Heterogeneous Aggregate Valence Analysis (HAVAN) model, a novel class of discrete choice models. We adopt the term \"valence'' to encompass any latent quantity used to model consumer decision-making (e.g., utility, regret, etc.). Diverging from traditional models that parameterize heterogeneous preferences across various product attributes, HAVAN models (pronounced \"haven\") instead directly characterize alternative-specific heterogeneous preferences. This innovative perspective on consumer heterogeneity affords unprecedented flexibility and significantly reduces simulation burdens commonly associated with mixed logit models. In a simulation experiment, the HAVAN model demonstrates superior predictive performance compared to state-of-the-art artificial neural networks. This finding underscores the potential for HAVAN models to improve discrete choice modeling capabilities."}, "https://rss.arxiv.org/abs/2402.00192": {"title": "Finite- and Large-Sample Inference for Ranks using Multinomial Data with an Application to Ranking Political Parties", "link": "https://rss.arxiv.org/abs/2402.00192", "description": "It is common to rank different categories by means of preferences that are revealed through data on choices. A prominent example is the ranking of political candidates or parties using the estimated share of support each one receives in surveys or polls about political attitudes. Since these rankings are computed using estimates of the share of support rather than the true share of support, there may be considerable uncertainty concerning the true ranking of the political candidates or parties. In this paper, we consider the problem of accounting for such uncertainty by constructing confidence sets for the rank of each category. We consider both the problem of constructing marginal confidence sets for the rank of a particular category as well as simultaneous confidence sets for the ranks of all categories. A distinguishing feature of our analysis is that we exploit the multinomial structure of the data to develop confidence sets that are valid in finite samples. We additionally develop confidence sets using the bootstrap that are valid only approximately in large samples. We use our methodology to rank political parties in Australia using data from the 2019 Australian Election Survey. We find that our finite-sample confidence sets are informative across the entire ranking of political parties, even in Australian territories with few survey respondents and/or with parties that are chosen by only a small share of the survey respondents. In contrast, the bootstrap-based confidence sets may sometimes be considerably less informative. These findings motivate us to compare these methods in an empirically-driven simulation study, in which we conclude that our finite-sample confidence sets often perform better than their large-sample, bootstrap-based counterparts, especially in settings that resemble our empirical application."}, "https://rss.arxiv.org/abs/2402.00567": {"title": "Stochastic convergence in per capita CO$_2$ emissions", "link": "https://rss.arxiv.org/abs/2402.00567", "description": "This paper studies stochastic convergence of per capita CO$_2$ emissions in 28 OECD countries for the 1901-2009 period. The analysis is carried out at two aggregation levels, first for the whole set of countries (joint analysis) and then separately for developed and developing states (group analysis). A powerful time series methodology, adapted to a nonlinear framework that allows for quadratic trends with possibly smooth transitions between regimes, is applied. This approach provides more robust conclusions in convergence path analysis, enabling (a) robust detection of the presence, and if so, the number of changes in the level and/or slope of the trend of the series, (b) inferences on stationarity of relative per capita CO$_2$ emissions, conditionally on the presence of breaks and smooth transitions between regimes, and (c) estimation of change locations in the convergence paths. Finally, as stochastic convergence is attained when both stationarity around a trend and $\\beta$-convergence hold, the linear approach proposed by Tomljanovich and Vogelsang (2002) is extended in order to allow for more general quadratic models. Overall, joint analysis finds some evidence of stochastic convergence in per capita CO$_2$ emissions. Some dispersion in terms of $\\beta$-convergence is detected by group analysis, particularly among developed countries. This is in accordance with per capita GDP not being the sole determinant of convergence in emissions, with factors like search for more efficient technologies, fossil fuel substitution, innovation, and possibly outsources of industries, also having a crucial role."}, "https://rss.arxiv.org/abs/2402.00584": {"title": "Arellano-Bond LASSO Estimator for Dynamic Linear Panel Models", "link": "https://rss.arxiv.org/abs/2402.00584", "description": "The Arellano-Bond estimator can be severely biased when the time series dimension of the data, $T$, is long. The source of the bias is the large degree of overidentification. We propose a simple two-step approach to deal with this problem. The first step applies LASSO to the cross-section data at each time period to select the most informative moment conditions. The second step applies a linear instrumental variable estimator using the instruments constructed from the moment conditions selected in the first step. The two stages are combined using sample-splitting and cross-fitting to avoid overfitting bias. Using asymptotic sequences where the two dimensions of the panel grow with the sample size, we show that the new estimator is consistent and asymptotically normal under much weaker conditions on $T$ than the Arellano-Bond estimator. Our theory covers models with high dimensional covariates including multiple lags of the dependent variable, which are common in modern applications. We illustrate our approach with an application to the short and long-term effects of the opening of K-12 schools and other policies on the spread of COVID-19 using weekly county-level panel data from the United States."}, "https://rss.arxiv.org/abs/2402.00788": {"title": "EU-28's progress towards the 2020 renewable energy share", "link": "https://rss.arxiv.org/abs/2402.00788", "description": "This paper assesses the convergence of the EU-28 countries towards their common goal of 20% in the renewable energy share indicator by year 2020. The potential presence of clubs of convergence towards different steady state equilibria is also analyzed from both the standpoints of global convergence to the 20% goal and specific convergence to the various targets assigned to Member States. Two clubs of convergence are detected in the former case, each corresponding to different RES targets. A probit model is also fitted with the aim of better understanding the determinants of club membership, that seemingly include real GDP per capita, expenditure on environmental protection, energy dependence, and nuclear capacity, with all of them having statistically significant effects. Finally, convergence is also analyzed separately for the transport, heating and cooling, and electricity sectors."}, "https://rss.arxiv.org/abs/2402.00172": {"title": "The Fourier-Malliavin Volatility (FMVol) MATLAB library", "link": "https://rss.arxiv.org/abs/2402.00172", "description": "This paper presents the Fourier-Malliavin Volatility (FMVol) estimation library for MATLAB. This library includes functions that implement Fourier- Malliavin estimators (see Malliavin and Mancino (2002, 2009)) of the volatility and co-volatility of continuous stochastic volatility processes and second-order quantities, like the quarticity (the squared volatility), the volatility of volatility and the leverage (the covariance between changes in the process and changes in its volatility). The Fourier-Malliavin method is fully non-parametric, does not require equally-spaced observations and is robust to measurement errors, or noise, without any preliminary bias correction or pre-treatment of the observations. Further, in its multivariate version, it is intrinsically robust to irregular and asynchronous sampling. Although originally introduced for a specific application in financial econometrics, namely the estimation of asset volatilities, the Fourier-Malliavin method is a general method that can be applied whenever one is interested in reconstructing the latent volatility and second-order quantities of a continuous stochastic volatility process from discrete observations."}, "https://rss.arxiv.org/abs/2010.03898": {"title": "Consistent Specification Test of the Quantile Autoregression", "link": "https://rss.arxiv.org/abs/2010.03898", "description": "This paper proposes a test for the joint hypothesis of correct dynamic specification and no omitted latent factors for the Quantile Autoregression. If the composite null is rejected we proceed to disentangle the cause of rejection, i.e., dynamic misspecification or an omitted variable. We establish the asymptotic distribution of the test statistics under fairly weak conditions and show that factor estimation error is negligible. A Monte Carlo study shows that the suggested tests have good finite sample properties. Finally, we undertake an empirical illustration of modelling GDP growth and CPI inflation in the United Kingdom, where we find evidence that factor augmented models are correctly specified in contrast with their non-augmented counterparts when it comes to GDP growth, while also exploring the asymmetric behaviour of the growth and inflation distributions."}, "https://rss.arxiv.org/abs/2203.12740": {"title": "Correcting Attrition Bias using Changes-in-Changes", "link": "https://rss.arxiv.org/abs/2203.12740", "description": "Attrition is a common and potentially important threat to internal validity in treatment effect studies. We extend the changes-in-changes approach to identify the average treatment effect for respondents and the entire study population in the presence of attrition. Our method, which exploits baseline outcome data, can be applied to randomized experiments as well as quasi-experimental difference-in-difference designs. A formal comparison highlights that while widely used corrections typically impose restrictions on whether or how response depends on treatment, our proposed attrition correction exploits restrictions on the outcome model. We further show that the conditions required for our correction can accommodate a broad class of response models that depend on treatment in an arbitrary way. We illustrate the implementation of the proposed corrections in an application to a large-scale randomized experiment."}, "https://rss.arxiv.org/abs/2209.11444": {"title": "Treatment Effects with Multidimensional Unobserved Heterogeneity: Identification of the Marginal Treatment Effect", "link": "https://rss.arxiv.org/abs/2209.11444", "description": "This paper establishes sufficient conditions for the identification of the marginal treatment effects with multivalued treatments. Our model is based on a multinomial choice model with utility maximization. Our MTE generalizes the MTE defined in Heckman and Vytlacil (2005) in binary treatment models. As in the binary case, we can interpret the MTE as the treatment effect for persons who are indifferent between two treatments at a particular level. Our MTE enables one to obtain the treatment effects of those with specific preference orders over the choice set. Further, our results can identify other parameters such as the marginal distribution of potential outcomes."}, "https://rss.arxiv.org/abs/2310.17496": {"title": "Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach", "link": "https://rss.arxiv.org/abs/2310.17496", "description": "In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods."}, "http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon>0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard & Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}, "http://arxiv.org/abs/2310.04452": {"title": "Short text classification with machine learning in the social sciences: The case of climate change on Twitter", "link": "http://arxiv.org/abs/2310.04452", "description": "To analyse large numbers of texts, social science researchers are\nincreasingly confronting the challenge of text classification. When manual\nlabeling is not possible and researchers have to find automatized ways to\nclassify texts, computer science provides a useful toolbox of machine-learning\nmethods whose performance remains understudied in the social sciences. In this\narticle, we compare the performance of the most widely used text classifiers by\napplying them to a typical research scenario in social science research: a\nrelatively small labeled dataset with infrequent occurrence of categories of\ninterest, which is a part of a large unlabeled dataset. As an example case, we\nlook at Twitter communication regarding climate change, a topic of increasing\nscholarly interest in interdisciplinary social science research. Using a novel\ndataset including 5,750 tweets from various international organizations\nregarding the highly ambiguous concept of climate change, we evaluate the\nperformance of methods in automatically classifying tweets based on whether\nthey are about climate change or not. In this context, we highlight two main\nfindings. First, supervised machine-learning methods perform better than\nstate-of-the-art lexicons, in particular as class balance increases. Second,\ntraditional machine-learning methods, such as logistic regression and random\nforest, perform similarly to sophisticated deep-learning methods, whilst\nrequiring much less training time and computational resources. The results have\nimportant implications for the analysis of short texts in social science\nresearch."}, "http://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "http://arxiv.org/abs/2310.04563", "description": "During the COVID-19 pandemic, implementing in-person indoor instruction in a\nsafe manner was a high priority for universities nationwide. To support this\neffort at the University, we developed a mathematical model for estimating the\nrisk of SARS-CoV-2 transmission in university classrooms. This model was used\nto design a safe classroom environment at the University during the COVID-19\npandemic that supported the higher occupancy levels needed to match\npre-pandemic numbers of in-person courses, despite a limited number of large\nclassrooms. A retrospective analysis at the end of the semester confirmed the\nmodel's assessment that the proposed classroom configuration would be safe. Our\nframework is generalizable and was also used to support reopening decisions at\nStanford University. In addition, our methods are flexible; our modeling\nframework was repurposed to plan for large university events and gatherings. We\nfound that our approach and methods work in a wide range of indoor settings and\ncould be used to support reopening planning across various industries, from\nsecondary schools to movie theaters and restaurants."}, "http://arxiv.org/abs/2310.04578": {"title": "TNDDR: Efficient and doubly robust estimation of COVID-19 vaccine effectiveness under the test-negative design", "link": "http://arxiv.org/abs/2310.04578", "description": "While the test-negative design (TND), which is routinely used for monitoring\nseasonal flu vaccine effectiveness (VE), has recently become integral to\nCOVID-19 vaccine surveillance, it is susceptible to selection bias due to\noutcome-dependent sampling. Some studies have addressed the identifiability and\nestimation of causal parameters under the TND, but efficiency bounds for\nnonparametric estimators of the target parameter under the unconfoundedness\nassumption have not yet been investigated. We propose a one-step doubly robust\nand locally efficient estimator called TNDDR (TND doubly robust), which\nutilizes sample splitting and can incorporate machine learning techniques to\nestimate the nuisance functions. We derive the efficient influence function\n(EIF) for the marginal expectation of the outcome under a vaccination\nintervention, explore the von Mises expansion, and establish the conditions for\n$\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR.\nThe proposed TNDDR is supported by both theoretical and empirical\njustifications, and we apply it to estimate COVID-19 VE in an administrative\ndataset of community-dwelling older people (aged $\\geq 60$y) in the province of\nQu\\'ebec, Canada."}, "http://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "http://arxiv.org/abs/2310.04660", "description": "Many scientific questions in biomedical, environmental, and psychological\nresearch involve understanding the impact of multiple factors on outcomes.\nWhile randomized factorial experiments are ideal for this purpose,\nrandomization is infeasible in many empirical studies. Therefore, investigators\noften rely on observational data, where drawing reliable causal inferences for\nmultiple factors remains challenging. As the number of treatment combinations\ngrows exponentially with the number of factors, some treatment combinations can\nbe rare or even missing by chance in observed data, further complicating\nfactorial effects estimation. To address these challenges, we propose a novel\nweighting method tailored to observational studies with multiple factors. Our\napproach uses weighted observational data to emulate a randomized factorial\nexperiment, enabling simultaneous estimation of the effects of multiple factors\nand their interactions. Our investigations reveal a crucial nuance: achieving\nbalance among covariates, as in single-factor scenarios, is necessary but\ninsufficient for unbiasedly estimating factorial effects. Our findings suggest\nthat balancing the factors is also essential in multi-factor settings.\nMoreover, we extend our weighting method to handle missing treatment\ncombinations in observed data. Finally, we study the asymptotic behavior of the\nnew weighting estimators and propose a consistent variance estimator, providing\nreliable inferences on factorial effects in observational studies."}, "http://arxiv.org/abs/2310.04709": {"title": "Time-dependent mediators in survival analysis: Graphical representation of causal assumptions", "link": "http://arxiv.org/abs/2310.04709", "description": "We study time-dependent mediators in survival analysis using a treatment\nseparation approach due to Didelez [2019] and based on earlier work by Robins\nand Richardson [2011]. This approach avoids nested counterfactuals and\ncrossworld assumptions which are otherwise common in mediation analysis. The\ncausal model of treatment, mediators, covariates, confounders and outcome is\nrepresented by causal directed acyclic graphs (DAGs). However, the DAGs tend to\nbe very complex when we have measurements at a large number of time points. We\ntherefore suggest using so-called rolled graphs in which a node represents an\nentire coordinate process instead of a single random variable, leading us to\nfar simpler graphical representations. The rolled graphs are not necessarily\nacyclic; they can be analyzed by $\\delta$-separation which is the appropriate\ngraphical separation criterion in this class of graphs and analogous to\n$d$-separation. In particular, $\\delta$-separation is a graphical tool for\nevaluating if the conditions of the mediation analysis are met or if unmeasured\nconfounders influence the estimated effects. We also state a mediational\ng-formula. This is similar to the approach in Vansteelandt et al. [2019]\nalthough that paper has a different conceptual basis. Finally, we apply this\nframework to a statistical model based on a Cox model with an added treatment\neffect.survival analysis; mediation; causal inference; graphical models; local\nindependence graphs"}, "http://arxiv.org/abs/2310.04919": {"title": "The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models", "link": "http://arxiv.org/abs/2310.04919", "description": "In modern scientific research, the objective is often to identify which\nvariables are associated with an outcome among a large class of potential\npredictors. This goal can be achieved by selecting variables in a manner that\ncontrols the the false discovery rate (FDR), the proportion of irrelevant\npredictors among the selections. Knockoff filtering is a cutting-edge approach\nto variable selection that provides FDR control. Existing knockoff statistics\nfrequently employ linear models to assess relationships between features and\nthe response, but the linearity assumption is often violated in real world\napplications. This may result in poor power to detect truly prognostic\nvariables. We introduce a knockoff statistic based on the conditional\nprediction function (CPF), which can pair with state-of-art machine learning\npredictive models, such as deep neural networks. The CPF statistics can capture\nthe nonlinear relationships between predictors and outcomes while also\naccounting for correlation between features. We illustrate the capability of\nthe CPF statistics to provide superior power over common knockoff statistics\nwith continuous, categorical, and survival outcomes using repeated simulations.\nKnockoff filtering with the CPF statistics is demonstrated using (1) a\nresidential building dataset to select predictors for the actual sales prices\nand (2) the TCGA dataset to select genes that are correlated with disease\nstaging in lung cancer patients."}, "http://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "http://arxiv.org/abs/2310.04924", "description": "Markov chain Monte Carlo significance tests were first introduced by Besag\nand Clifford in [4]. These methods produce statistical valid p-values in\nproblems where sampling from the null hypotheses is intractable. We give an\noverview of the methods of Besag and Clifford and some recent developments. A\nrange of examples and applications are discussed."}, "http://arxiv.org/abs/2310.04934": {"title": "UBSea: A Unified Community Detection Framework", "link": "http://arxiv.org/abs/2310.04934", "description": "Detecting communities in networks and graphs is an important task across many\ndisciplines such as statistics, social science and engineering. There are\ngenerally three different kinds of mixing patterns for the case of two\ncommunities: assortative mixing, disassortative mixing and core-periphery\nstructure. Modularity optimization is a classical way for fitting network\nmodels with communities. However, it can only deal with assortative mixing and\ndisassortative mixing when the mixing pattern is known and fails to discover\nthe core-periphery structure. In this paper, we extend modularity in a\nstrategic way and propose a new framework based on Unified Bigroups Standadized\nEdge-count Analysis (UBSea). It can address all the formerly mentioned\ncommunity mixing structures. In addition, this new framework is able to\nautomatically choose the mixing type to fit the networks. Simulation studies\nshow that the new framework has superb performance in a wide range of settings\nunder the stochastic block model and the degree-corrected stochastic block\nmodel. We show that the new approach produces consistent estimate of the\ncommunities under a suitable signal-to-noise-ratio condition, for the case of a\nblock model with two communities, for both undirected and directed networks.\nThe new method is illustrated through applications to several real-world\ndatasets."}, "http://arxiv.org/abs/2310.05049": {"title": "On Estimation of Optimal Dynamic Treatment Regimes with Multiple Treatments for Survival Data-With Application to Colorectal Cancer Study", "link": "http://arxiv.org/abs/2310.05049", "description": "Dynamic treatment regimes (DTR) are sequential decision rules corresponding\nto several stages of intervention. Each rule maps patients' covariates to\noptional treatments. The optimal dynamic treatment regime is the one that\nmaximizes the mean outcome of interest if followed by the overall population.\nMotivated by a clinical study on advanced colorectal cancer with traditional\nChinese medicine, we propose a censored C-learning (CC-learning) method to\nestimate the dynamic treatment regime with multiple treatments using survival\ndata. To address the challenges of multiple stages with right censoring, we\nmodify the backward recursion algorithm in order to adapt to the flexible\nnumber and timing of treatments. For handling the problem of multiple\ntreatments, we propose a framework from the classification perspective by\ntransferring the problem of optimization with multiple treatment comparisons\ninto an example-dependent cost-sensitive classification problem. With\nclassification and regression tree (CART) as the classifier, the CC-learning\nmethod can produce an estimated optimal DTR with good interpretability. We\ntheoretically prove the optimality of our method and numerically evaluate its\nfinite sample performances through simulation. With the proposed method, we\nidentify the interpretable tree treatment regimes at each stage for the\nadvanced colorectal cancer treatment data from Xiyuan Hospital."}, "http://arxiv.org/abs/2310.05151": {"title": "Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions", "link": "http://arxiv.org/abs/2310.05151", "description": "In clinical trials of longitudinal continuous outcomes, reference based\nimputation (RBI) has commonly been applied to handle missing outcome data in\nsettings where the estimand incorporates the effects of intercurrent events,\ne.g. treatment discontinuation. RBI was originally developed in the multiple\nimputation framework, however recently conditional mean imputation (CMI)\ncombined with the jackknife estimator of the standard error was proposed as a\nway to obtain deterministic treatment effect estimates and correct frequentist\ninference. For both multiple and CMI, a mixed model for repeated measures\n(MMRM) is often used for the imputation model, but this can be computationally\nintensive to fit to multiple data sets (e.g. the jackknife samples) and lead to\nconvergence issues with complex MMRM models with many parameters. Therefore, a\nstep-wise approach based on sequential linear regression (SLR) of the outcomes\nat each visit was developed for the imputation model in the multiple imputation\nframework, but similar developments in the CMI framework are lacking. In this\narticle, we fill this gap in the literature by proposing a SLR approach to\nimplement RBI in the CMI framework, and justify its validity using theoretical\nresults and simulations. We also illustrate our proposal on a real data\napplication."}, "http://arxiv.org/abs/2310.05398": {"title": "Statistical Inference for Modulation Index in Phase-Amplitude Coupling", "link": "http://arxiv.org/abs/2310.05398", "description": "Phase-amplitude coupling is a phenomenon observed in several neurological\nprocesses, where the phase of one signal modulates the amplitude of another\nsignal with a distinct frequency. The modulation index (MI) is a common\ntechnique used to quantify this interaction by assessing the Kullback-Leibler\ndivergence between a uniform distribution and the empirical conditional\ndistribution of amplitudes with respect to the phases of the observed signals.\nThe uniform distribution is an ideal representation that is expected to appear\nunder the absence of coupling. However, it does not reflect the statistical\nproperties of coupling values caused by random chance. In this paper, we\npropose a statistical framework for evaluating the significance of an observed\nMI value based on a null hypothesis that a MI value can be entirely explained\nby chance. Significance is obtained by comparing the value with a reference\ndistribution derived under the null hypothesis of independence (i.e., no\ncoupling) between signals. We derived a closed-form distribution of this null\nmodel, resulting in a scaled beta distribution. To validate the efficacy of our\nproposed framework, we conducted comprehensive Monte Carlo simulations,\nassessing the significance of MI values under various experimental scenarios,\nincluding amplitude modulation, trains of spikes, and sequences of\nhigh-frequency oscillations. Furthermore, we corroborated the reliability of\nour model by comparing its statistical significance thresholds with reported\nvalues from other research studies conducted under different experimental\nsettings. Our method offers several advantages such as meta-analysis\nreliability, simplicity and computational efficiency, as it provides p-values\nand significance levels without resorting to generating surrogate data through\nsampling procedures."}, "http://arxiv.org/abs/2310.05526": {"title": "Projecting infinite time series graphs to finite marginal graphs using number theory", "link": "http://arxiv.org/abs/2310.05526", "description": "In recent years, a growing number of method and application works have\nadapted and applied the causal-graphical-model framework to time series data.\nMany of these works employ time-resolved causal graphs that extend infinitely\ninto the past and future and whose edges are repetitive in time, thereby\nreflecting the assumption of stationary causal relationships. However, most\nresults and algorithms from the causal-graphical-model framework are not\ndesigned for infinite graphs. In this work, we develop a method for projecting\ninfinite time series graphs with repetitive edges to marginal graphical models\non a finite time window. These finite marginal graphs provide the answers to\n$m$-separation queries with respect to the infinite graph, a task that was\npreviously unresolved. Moreover, we argue that these marginal graphs are useful\nfor causal discovery and causal effect estimation in time series, effectively\nenabling to apply results developed for finite graphs to the infinite graphs.\nThe projection procedure relies on finding common ancestors in the\nto-be-projected graph and is, by itself, not new. However, the projection\nprocedure has not yet been algorithmically implemented for time series graphs\nsince in these infinite graphs there can be infinite sets of paths that might\ngive rise to common ancestors. We solve the search over these possibly infinite\nsets of paths by an intriguing combination of path-finding techniques for\nfinite directed graphs and solution theory for linear Diophantine equations. By\nproviding an algorithm that carries out the projection, our paper makes an\nimportant step towards a theoretically-grounded and method-agnostic\ngeneralization of a range of causal inference methods and results to time\nseries."}, "http://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "http://arxiv.org/abs/2310.05539", "description": "In response to the unique challenge created by high-dimensional mediators in\nmediation analysis, this paper presents a novel procedure for testing the\nnullity of the mediation effect in the presence of high-dimensional mediators.\nThe procedure incorporates two distinct features. Firstly, the test remains\nvalid under all cases of the composite null hypothesis, including the\nchallenging scenario where both exposure-mediator and mediator-outcome\ncoefficients are zero. Secondly, it does not impose structural assumptions on\nthe exposure-mediator coefficients, thereby allowing for an arbitrarily strong\nexposure-mediator relationship. To the best of our knowledge, the proposed test\nis the first of its kind to provably possess these two features in\nhigh-dimensional mediation analysis. The validity and consistency of the\nproposed test are established, and its numerical performance is showcased\nthrough simulation studies. The application of the proposed test is\ndemonstrated by examining the mediation effect of DNA methylation between\nsmoking status and lung cancer development."}, "http://arxiv.org/abs/2310.05548": {"title": "Cokrig-and-Regress for Spatially Misaligned Environmental Data", "link": "http://arxiv.org/abs/2310.05548", "description": "Spatially misaligned data, where the response and covariates are observed at\ndifferent spatial locations, commonly arise in many environmental studies. Much\nof the statistical literature on handling spatially misaligned data has been\ndevoted to the case of a single covariate and a linear relationship between the\nresponse and this covariate. Motivated by spatially misaligned data collected\non air pollution and weather in China, we propose a cokrig-and-regress (CNR)\nmethod to estimate spatial regression models involving multiple covariates and\npotentially non-linear associations. The CNR estimator is constructed by\nreplacing the unobserved covariates (at the response locations) by their\ncokriging predictor derived from the observed but misaligned covariates under a\nmultivariate Gaussian assumption, where a generalized Kronecker product\ncovariance is used to account for spatial correlations within and between\ncovariates. A parametric bootstrap approach is employed to bias-correct the CNR\nestimates of the spatial covariance parameters and for uncertainty\nquantification. Simulation studies demonstrate that CNR outperforms several\nexisting methods for handling spatially misaligned data, such as\nnearest-neighbor interpolation. Applying CNR to the spatially misaligned air\npollution and weather data in China reveals a number of non-linear\nrelationships between PM$_{2.5}$ concentration and several meteorological\ncovariates."}, "http://arxiv.org/abs/2310.05622": {"title": "A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards", "link": "http://arxiv.org/abs/2310.05622", "description": "While well-established methods for time-to-event data are available when the\nproportional hazards assumption holds, there is no consensus on the best\ninferential approach under non-proportional hazards (NPH). However, a wide\nrange of parametric and non-parametric methods for testing and estimation in\nthis scenario have been proposed. To provide recommendations on the statistical\nanalysis of clinical trials where non proportional hazards are expected, we\nconducted a comprehensive simulation study under different scenarios of\nnon-proportional hazards, including delayed onset of treatment effect, crossing\nhazard curves, subgroups with different treatment effect and changing hazards\nafter disease progression. We assessed type I error rate control, power and\nconfidence interval coverage, where applicable, for a wide range of methods\nincluding weighted log-rank tests, the MaxCombo test, summary measures such as\nthe restricted mean survival time (RMST), average hazard ratios, and milestone\nsurvival probabilities as well as accelerated failure time regression models.\nWe found a trade-off between interpretability and power when choosing an\nanalysis strategy under NPH scenarios. While analysis methods based on weighted\nlogrank tests typically were favorable in terms of power, they do not provide\nan easily interpretable treatment effect estimate. Also, depending on the\nweight function, they test a narrow null hypothesis of equal hazard functions\nand rejection of this null hypothesis may not allow for a direct conclusion of\ntreatment benefit in terms of the survival function. In contrast,\nnon-parametric procedures based on well interpretable measures as the RMST\ndifference had lower power in most scenarios. Model based methods based on\nspecific survival distributions had larger power, however often gave biased\nestimates and lower than nominal confidence interval coverage."}, "http://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "http://arxiv.org/abs/2310.05646", "description": "We study transfer learning in the context of estimating piecewise-constant\nsignals when source data, which may be relevant but disparate, are available in\naddition to the target data. We initially investigate transfer learning\nestimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for\nunisource data scenarios and then generalise these estimators to accommodate\nmultisource data. To further reduce estimation errors, especially in scenarios\nwhere some sources significantly differ from the target, we introduce an\ninformative source selection algorithm. We then examine these estimators with\nmultisource selection and establish their minimax optimality under specific\nregularity conditions. It is worth emphasising that, unlike the prevalent\nnarrative in the transfer learning literature that the performance is enhanced\nthrough large source sample sizes, our approaches leverage higher observation\nfrequencies and accommodate diverse frequencies across multiple sources. Our\ntheoretical findings are empirically validated through extensive numerical\nexperiments, with the code available online, see\nhttps://github.com/chrisfanwang/transferlearning"}, "http://arxiv.org/abs/2310.05685": {"title": "Post-Selection Inference for Sparse Estimation", "link": "http://arxiv.org/abs/2310.05685", "description": "When the model is not known and parameter testing or interval estimation is\nconducted after model selection, it is necessary to consider selective\ninference. This paper discusses this issue in the context of sparse estimation.\nFirstly, we describe selective inference related to Lasso as per \\cite{lee},\nand then present polyhedra and truncated distributions when applying it to\nmethods such as Forward Stepwise and LARS. Lastly, we discuss the Significance\nTest for Lasso by \\cite{significant} and the Spacing Test for LARS by\n\\cite{ryan_exact}. This paper serves as a review article.\n\nKeywords: post-selective inference, polyhedron, LARS, lasso, forward\nstepwise, significance test, spacing test."}, "http://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "http://arxiv.org/abs/2310.05921", "description": "We introduce Conformal Decision Theory, a framework for producing safe\nautonomous decisions despite imperfect machine learning predictions. Examples\nof such decisions are ubiquitous, from robot planning algorithms that rely on\npedestrian predictions, to calibrating autonomous manufacturing to exhibit high\nthroughput and low error, to the choice of trusting a nominal policy versus\nswitching to a safe backup policy at run-time. The decisions produced by our\nalgorithms are safe in the sense that they come with provable statistical\nguarantees of having low risk without any assumptions on the world model\nwhatsoever; the observations need not be I.I.D. and can even be adversarial.\nThe theory extends results from conformal prediction to calibrate decisions\ndirectly, without requiring the construction of prediction sets. Experiments\ndemonstrate the utility of our approach in robot motion planning around humans,\nautomated stock trading, and robot manufacturin"}, "http://arxiv.org/abs/2101.06950": {"title": "Learning and scoring Gaussian latent variable causal models with unknown additive interventions", "link": "http://arxiv.org/abs/2101.06950", "description": "With observational data alone, causal structure learning is a challenging\nproblem. The task becomes easier when having access to data collected from\nperturbations of the underlying system, even when the nature of these is\nunknown. Existing methods either do not allow for the presence of latent\nvariables or assume that these remain unperturbed. However, these assumptions\nare hard to justify if the nature of the perturbations is unknown. We provide\nresults that enable scoring causal structures in the setting with additive, but\nunknown interventions. Specifically, we propose a maximum-likelihood estimator\nin a structural equation model that exploits system-wide invariances to output\nan equivalence class of causal structures from perturbation data. Furthermore,\nunder certain structural assumptions on the population model, we provide a\nsimple graphical characterization of all the DAGs in the interventional\nequivalence class. We illustrate the utility of our framework on synthetic data\nas well as real data involving California reservoirs and protein expressions.\nThe software implementation is available as the Python package \\emph{utlvce}."}, "http://arxiv.org/abs/2107.14151": {"title": "Modern Non-Linear Function-on-Function Regression", "link": "http://arxiv.org/abs/2107.14151", "description": "We introduce a new class of non-linear function-on-function regression models\nfor functional data using neural networks. We propose a framework using a\nhidden layer consisting of continuous neurons, called a continuous hidden\nlayer, for functional response modeling and give two model fitting strategies,\nFunctional Direct Neural Network (FDNN) and Functional Basis Neural Network\n(FBNN). Both are designed explicitly to exploit the structure inherent in\nfunctional data and capture the complex relations existing between the\nfunctional predictors and the functional response. We fit these models by\nderiving functional gradients and implement regularization techniques for more\nparsimonious results. We demonstrate the power and flexibility of our proposed\nmethod in handling complex functional models through extensive simulation\nstudies as well as real data examples."}, "http://arxiv.org/abs/2112.00832": {"title": "On the mixed-model analysis of covariance in cluster-randomized trials", "link": "http://arxiv.org/abs/2112.00832", "description": "In the analyses of cluster-randomized trials, mixed-model analysis of\ncovariance (ANCOVA) is a standard approach for covariate adjustment and\nhandling within-cluster correlations. However, when the normality, linearity,\nor the random-intercept assumption is violated, the validity and efficiency of\nthe mixed-model ANCOVA estimators for estimating the average treatment effect\nremain unclear. Under the potential outcomes framework, we prove that the\nmixed-model ANCOVA estimators for the average treatment effect are consistent\nand asymptotically normal under arbitrary misspecification of its working\nmodel. If the probability of receiving treatment is 0.5 for each cluster, we\nfurther show that the model-based variance estimator under mixed-model ANCOVA1\n(ANCOVA without treatment-covariate interactions) remains consistent,\nclarifying that the confidence interval given by standard software is\nasymptotically valid even under model misspecification. Beyond robustness, we\ndiscuss several insights on precision among classical methods for analyzing\ncluster-randomized trials, including the mixed-model ANCOVA, individual-level\nANCOVA, and cluster-level ANCOVA estimators. These insights may inform the\nchoice of methods in practice. Our analytical results and insights are\nillustrated via simulation studies and analyses of three cluster-randomized\ntrials."}, "http://arxiv.org/abs/2201.10770": {"title": "Confidence intervals for the Cox model test error from cross-validation", "link": "http://arxiv.org/abs/2201.10770", "description": "Cross-validation (CV) is one of the most widely used techniques in\nstatistical learning for estimating the test error of a model, but its behavior\nis not yet fully understood. It has been shown that standard confidence\nintervals for test error using estimates from CV may have coverage below\nnominal levels. This phenomenon occurs because each sample is used in both the\ntraining and testing procedures during CV and as a result, the CV estimates of\nthe errors become correlated. Without accounting for this correlation, the\nestimate of the variance is smaller than it should be. One way to mitigate this\nissue is by estimating the mean squared error of the prediction error instead\nusing nested CV. This approach has been shown to achieve superior coverage\ncompared to intervals derived from standard CV. In this work, we generalize the\nnested CV idea to the Cox proportional hazards model and explore various\nchoices of test error for this setting."}, "http://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2202.08419", "description": "In this paper, we develop a novel high-dimensional time-varying coefficient\nestimation method, based on high-dimensional Ito diffusion processes. To\naccount for high-dimensional time-varying coefficients, we first estimate local\n(or instantaneous) coefficients using a time-localized Dantzig selection scheme\nunder a sparsity condition, which results in biased local coefficient\nestimators due to the regularization. To handle the bias, we propose a\ndebiasing scheme, which provides well-performing unbiased local coefficient\nestimators. With the unbiased local coefficient estimators, we estimate the\nintegrated coefficient, and to further account for the sparsity of the\ncoefficient process, we apply thresholding schemes. We call this Thresholding\ndEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED\nestimator. In the empirical analysis, we apply the TED procedure to analyzing\nhigh-dimensional factor models using high-frequency data."}, "http://arxiv.org/abs/2206.12525": {"title": "Causality of Functional Longitudinal Data", "link": "http://arxiv.org/abs/2206.12525", "description": "\"Treatment-confounder feedback\" is the central complication to resolve in\nlongitudinal studies, to infer causality. The existing frameworks for\nidentifying causal effects for longitudinal studies with discrete repeated\nmeasures hinge heavily on assuming that time advances in discrete time steps or\ntreatment changes as a jumping process, rendering the number of \"feedbacks\"\nfinite. However, medical studies nowadays with real-time monitoring involve\nfunctional time-varying outcomes, treatment, and confounders, which leads to an\nuncountably infinite number of feedbacks between treatment and confounders.\nTherefore more general and advanced theory is needed. We generalize the\ndefinition of causal effects under user-specified stochastic treatment regimes\nto longitudinal studies with continuous monitoring and develop an\nidentification framework, allowing right censoring and truncation by death. We\nprovide sufficient identification assumptions including a generalized\nconsistency assumption, a sequential randomization assumption, a positivity\nassumption, and a novel \"intervenable\" assumption designed for the\ncontinuous-time case. Under these assumptions, we propose a g-computation\nprocess and an inverse probability weighting process, which suggest a\ng-computation formula and an inverse probability weighting formula for\nidentification. For practical purposes, we also construct two classes of\npopulation estimating equations to identify these two processes, respectively,\nwhich further suggest a doubly robust identification formula with extra\nrobustness against process misspecification. We prove that our framework fully\ngeneralize the existing frameworks and is nonparametric."}, "http://arxiv.org/abs/2209.08139": {"title": "Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2209.08139", "description": "Bayesian variable selection methods are powerful techniques for fitting and\ninferring on sparse high-dimensional linear regression models. However, many\nare computationally intensive or require restrictive prior distributions on\nmodel parameters. In this paper, we proposed a computationally efficient and\npowerful Bayesian approach for sparse high-dimensional linear regression.\nMinimal prior assumptions on the parameters are required through the use of\nplug-in empirical Bayes estimates of hyperparameters. Efficient maximum a\nposteriori (MAP) estimation is completed through a Parameter-Expanded\nExpectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in\na robust computationally efficient coordinate-wise optimization which -- when\nupdating the coefficient for a particular predictor -- adjusts for the impact\nof other predictor variables. The completion of the E-step uses an approach\nmotivated by the popular two-group approach to multiple testing. The result is\na PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse\nhigh-dimensional linear regression, which can be completed using one-at-a-time\nor all-at-once type optimization. We compare the empirical properties of PROBE\nto comparable approaches with numerous simulation studies and analyses of\ncancer cell drug responses. The proposed approach is implemented in the R\npackage probe."}, "http://arxiv.org/abs/2212.02709": {"title": "SURE-tuned Bridge Regression", "link": "http://arxiv.org/abs/2212.02709", "description": "Consider the {$\\ell_{\\alpha}$} regularized linear regression, also termed\nBridge regression. For $\\alpha\\in (0,1)$, Bridge regression enjoys several\nstatistical properties of interest such as sparsity and near-unbiasedness of\nthe estimates (Fan and Li, 2001). However, the main difficulty lies in the\nnon-convex nature of the penalty for these values of $\\alpha$, which makes an\noptimization procedure challenging and usually it is only possible to find a\nlocal optimum. To address this issue, Polson et al. (2013) took a sampling\nbased fully Bayesian approach to this problem, using the correspondence between\nthe Bridge penalty and a power exponential prior on the regression\ncoefficients. However, their sampling procedure relies on Markov chain Monte\nCarlo (MCMC) techniques, which are inherently sequential and not scalable to\nlarge problem dimensions. Cross validation approaches are similarly\ncomputation-intensive. To this end, our contribution is a novel\n\\emph{non-iterative} method to fit a Bridge regression model. The main\ncontribution lies in an explicit formula for Stein's unbiased risk estimate for\nthe out of sample prediction risk of Bridge regression, which can then be\noptimized to select the desired tuning parameters, allowing us to completely\nbypass MCMC as well as computation-intensive cross validation approaches. Our\nprocedure yields results in a fraction of computational times compared to\niterative schemes, without any appreciable loss in statistical performance. An\nR implementation is publicly available online at:\nhttps://github.com/loriaJ/Sure-tuned_BridgeRegression ."}, "http://arxiv.org/abs/2212.03122": {"title": "Robust convex biclustering with a tuning-free method", "link": "http://arxiv.org/abs/2212.03122", "description": "Biclustering is widely used in different kinds of fields including gene\ninformation analysis, text mining, and recommendation system by effectively\ndiscovering the local correlation between samples and features. However, many\nbiclustering algorithms will collapse when facing heavy-tailed data. In this\npaper, we propose a robust version of convex biclustering algorithm with Huber\nloss. Yet, the newly introduced robustification parameter brings an extra\nburden to selecting the optimal parameters. Therefore, we propose a tuning-free\nmethod for automatically selecting the optimal robustification parameter with\nhigh efficiency. The simulation study demonstrates the more fabulous\nperformance of our proposed method than traditional biclustering methods when\nencountering heavy-tailed noise. A real-life biomedical application is also\npresented. The R package RcvxBiclustr is available at\nhttps://github.com/YifanChen3/RcvxBiclustr."}, "http://arxiv.org/abs/2301.09661": {"title": "Estimating marginal treatment effects from observational studies and indirect treatment comparisons: When are standardization-based methods preferable to those based on propensity score weighting?", "link": "http://arxiv.org/abs/2301.09661", "description": "In light of newly developed standardization methods, we evaluate, via\nsimulation study, how propensity score weighting and standardization -based\napproaches compare for obtaining estimates of the marginal odds ratio and the\nmarginal hazard ratio. Specifically, we consider how the two approaches compare\nin two different scenarios: (1) in a single observational study, and (2) in an\nanchored indirect treatment comparison (ITC) of randomized controlled trials.\nWe present the material in such a way so that the matching-adjusted indirect\ncomparison (MAIC) and the (novel) simulated treatment comparison (STC) methods\nin the ITC setting may be viewed as analogous to the propensity score weighting\nand standardization methods in the single observational study setting. Our\nresults suggest that current recommendations for conducting ITCs can be\nimproved and underscore the importance of adjusting for purely prognostic\nfactors."}, "http://arxiv.org/abs/2302.11746": {"title": "Logistic Regression and Classification with non-Euclidean Covariates", "link": "http://arxiv.org/abs/2302.11746", "description": "We introduce a logistic regression model for data pairs consisting of a\nbinary response and a covariate residing in a non-Euclidean metric space\nwithout vector structures. Based on the proposed model we also develop a binary\nclassifier for non-Euclidean objects. We propose a maximum likelihood estimator\nfor the non-Euclidean regression coefficient in the model, and provide upper\nbounds on the estimation error under various metric entropy conditions that\nquantify complexity of the underlying metric space. Matching lower bounds are\nderived for the important metric spaces commonly seen in statistics,\nestablishing optimality of the proposed estimator in such spaces. Similarly, an\nupper bound on the excess risk of the developed classifier is provided for\ngeneral metric spaces. A finer upper bound and a matching lower bound, and thus\noptimality of the proposed classifier, are established for Riemannian\nmanifolds. We investigate the numerical performance of the proposed estimator\nand classifier via simulation studies, and illustrate their practical merits\nvia an application to task-related fMRI data."}, "http://arxiv.org/abs/2302.13658": {"title": "Robust High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2302.13658", "description": "In this paper, we develop a novel high-dimensional coefficient estimation\nprocedure based on high-frequency data. Unlike usual high-dimensional\nregression procedure such as LASSO, we additionally handle the heavy-tailedness\nof high-frequency observations as well as time variations of coefficient\nprocesses. Specifically, we employ Huber loss and truncation scheme to handle\nheavy-tailed observations, while $\\ell_{1}$-regularization is adopted to\novercome the curse of dimensionality. To account for the time-varying\ncoefficient, we estimate local coefficients which are biased due to the\n$\\ell_{1}$-regularization. Thus, when estimating integrated coefficients, we\npropose a debiasing scheme to enjoy the law of large number property and employ\na thresholding scheme to further accommodate the sparsity of the coefficients.\nWe call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show\nthat the RED-LASSO estimator can achieve a near-optimal convergence rate. In\nthe empirical study, we apply the RED-LASSO procedure to the high-dimensional\nintegrated coefficient estimation using high-frequency trading data."}, "http://arxiv.org/abs/2307.04754": {"title": "Action-State Dependent Dynamic Model Selection", "link": "http://arxiv.org/abs/2307.04754", "description": "A model among many may only be best under certain states of the world.\nSwitching from a model to another can also be costly. Finding a procedure to\ndynamically choose a model in these circumstances requires to solve a complex\nestimation procedure and a dynamic programming problem. A Reinforcement\nlearning algorithm is used to approximate and estimate from the data the\noptimal solution to this dynamic programming problem. The algorithm is shown to\nconsistently estimate the optimal policy that may choose different models based\non a set of covariates. A typical example is the one of switching between\ndifferent portfolio models under rebalancing costs, using macroeconomic\ninformation. Using a set of macroeconomic variables and price data, an\nempirical application to the aforementioned portfolio problem shows superior\nperformance to choosing the best portfolio model with hindsight."}, "http://arxiv.org/abs/2307.14828": {"title": "Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley", "link": "http://arxiv.org/abs/2307.14828", "description": "Two-component mixture models have proved to be a powerful tool for modeling\nheterogeneity in several cluster analysis contexts. However, most methods based\non these models assume a constant behavior for the mixture weights, which can\nbe restrictive and unsuitable for some applications. In this paper, we relax\nthis assumption and allow the mixture weights to vary according to the index\n(e.g., time) to make the model more adaptive to a broader range of data sets.\nWe propose an efficient MCMC algorithm to jointly estimate both component\nparameters and dynamic weights from their posterior samples. We evaluate the\nmethod's performance by running Monte Carlo simulation studies under different\nscenarios for the dynamic weights. In addition, we apply the algorithm to a\ntime series that records the level reached by a river in southern Brazil. The\nTaquari River is a water body whose frequent flood inundations have caused\nvarious damage to riverside communities. Implementing a dynamic mixture model\nallows us to properly describe the flood regimes for the areas most affected by\nthese phenomena."}, "http://arxiv.org/abs/2310.06130": {"title": "Statistical inference for radially-stable generalized Pareto distributions and return level-sets in geometric extremes", "link": "http://arxiv.org/abs/2310.06130", "description": "We obtain a functional analogue of the quantile function for probability\nmeasures admitting a continuous Lebesgue density on $\\mathbb{R}^d$, and use it\nto characterize the class of non-trivial limit distributions of radially\nrecentered and rescaled multivariate exceedances in geometric extremes. A new\nclass of multivariate distributions is identified, termed radially stable\ngeneralized Pareto distributions, and is shown to admit certain stability\nproperties that permit extrapolation to extremal sets along any direction in\n$\\mathbb{R}^d$. Based on the limit Poisson point process likelihood of the\nradially renormalized point process of exceedances, we develop parsimonious\nstatistical models that exploit theoretical links between structural\nstar-bodies and are amenable to Bayesian inference. The star-bodies determine\nthe mean measure of the limit Poisson process through a hierarchical structure.\nOur framework sharpens statistical inference by suitably including additional\ninformation from the angular directions of the geometric exceedances and\nfacilitates efficient computations in dimensions $d=2$ and $d=3$. Additionally,\nit naturally leads to the notion of the return level-set, which is a canonical\nquantile set expressed in terms of its average recurrence interval, and a\ngeometric analogue of the uni-dimensional return level. We illustrate our\nmethods with a simulation study showing superior predictive performance of\nprobabilities of rare events, and with two case studies, one associated with\nriver flow extremes, and the other with oceanographic extremes."}, "http://arxiv.org/abs/2310.06252": {"title": "Power and sample size calculation of two-sample projection-based testing for sparsely observed functional data", "link": "http://arxiv.org/abs/2310.06252", "description": "Projection-based testing for mean trajectory differences in two groups of\nirregularly and sparsely observed functional data has garnered significant\nattention in the literature because it accommodates a wide spectrum of group\ndifferences and (non-stationary) covariance structures. This article presents\nthe derivation of the theoretical power function and the introduction of a\ncomprehensive power and sample size (PASS) calculation toolkit tailored to the\nprojection-based testing method developed by Wang (2021). Our approach\naccommodates a wide spectrum of group difference scenarios and a broad class of\ncovariance structures governing the underlying processes. Through extensive\nnumerical simulation, we demonstrate the robustness of this testing method by\nshowcasing that its statistical power remains nearly unaffected even when a\ncertain percentage of observations are missing, rendering it 'missing-immune'.\nFurthermore, we illustrate the practical utility of this test through analysis\nof two randomized controlled trials of Parkinson's disease. To facilitate\nimplementation, we provide a user-friendly R package fPASS, complete with a\ndetailed vignette to guide users through its practical application. We\nanticipate that this article will significantly enhance the usability of this\npotent statistical tool across a range of biostatistical applications, with a\nparticular focus on its relevance in the design of clinical trials."}, "http://arxiv.org/abs/2310.06315": {"title": "Ultra-high dimensional confounder selection algorithms comparison with application to radiomics data", "link": "http://arxiv.org/abs/2310.06315", "description": "Radiomics is an emerging area of medical imaging data analysis particularly\nfor cancer. It involves the conversion of digital medical images into mineable\nultra-high dimensional data. Machine learning algorithms are widely used in\nradiomics data analysis to develop powerful decision support model to improve\nprecision in diagnosis, assessment of prognosis and prediction of therapy\nresponse. However, machine learning algorithms for causal inference have not\nbeen previously employed in radiomics analysis. In this paper, we evaluate the\nvalue of machine learning algorithms for causal inference in radiomics. We\nselect three recent competitive variable selection algorithms for causal\ninference: outcome-adaptive lasso (OAL), generalized outcome-adaptive lasso\n(GOAL) and causal ball screening (CBS). We used a sure independence screening\nprocedure to propose an extension of GOAL and OAL for ultra-high dimensional\ndata, SIS + GOAL and SIS + OAL. We compared SIS + GOAL, SIS + OAL and CBS using\nsimulation study and two radiomics datasets in cancer, osteosarcoma and\ngliosarcoma. The two radiomics studies and the simulation study identified SIS\n+ GOAL as the optimal variable selection algorithm."}, "http://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "http://arxiv.org/abs/2310.06330", "description": "Markov chain Monte Carlo (MCMC) is a commonly used method for approximating\nexpectations with respect to probability distributions. Uncertainty assessment\nfor MCMC estimators is essential in practical applications. Moreover, for\nmultivariate functions of a Markov chain, it is important to estimate not only\nthe auto-correlation for each component but also to estimate\ncross-correlations, in order to better assess sample quality, improve estimates\nof effective sample size, and use more effective stopping rules. Berg and Song\n[2022] introduced the moment least squares (momentLS) estimator, a\nshape-constrained estimator for the autocovariance sequence from a reversible\nMarkov chain, for univariate functions of the Markov chain. Based on this\nsequence estimator, they proposed an estimator of the asymptotic variance of\nthe sample mean from MCMC samples. In this study, we propose novel\nautocovariance sequence and asymptotic variance estimators for Markov chain\nfunctions with multiple components, based on the univariate momentLS estimators\nfrom Berg and Song [2022]. We demonstrate strong consistency of the proposed\nauto(cross)-covariance sequence and asymptotic variance matrix estimators. We\nconduct empirical comparisons of our method with other state-of-the-art\napproaches on simulated and real-data examples, using popular samplers\nincluding the random-walk Metropolis sampler and the No-U-Turn sampler from\nSTAN."}, "http://arxiv.org/abs/2310.06357": {"title": "Adaptive Storey's null proportion estimator", "link": "http://arxiv.org/abs/2310.06357", "description": "False discovery rate (FDR) is a commonly used criterion in multiple testing\nand the Benjamini-Hochberg (BH) procedure is arguably the most popular approach\nwith FDR guarantee. To improve power, the adaptive BH procedure has been\nproposed by incorporating various null proportion estimators, among which\nStorey's estimator has gained substantial popularity. The performance of\nStorey's estimator hinges on a critical hyper-parameter, where a pre-fixed\nconfiguration lacks power and existing data-driven hyper-parameters compromise\nthe FDR control. In this work, we propose a novel class of adaptive\nhyper-parameters and establish the FDR control of the associated BH procedure\nusing a martingale argument. Within this class of data-driven hyper-parameters,\nwe present a specific configuration designed to maximize the number of\nrejections and characterize the convergence of this proposal to the optimal\nhyper-parameter under a commonly-used mixture model. We evaluate our adaptive\nStorey's null proportion estimator and the associated BH procedure on extensive\nsimulated data and a motivating protein dataset. Our proposal exhibits\nsignificant power gains when dealing with a considerable proportion of weak\nnon-nulls or a conservative null distribution."}, "http://arxiv.org/abs/2310.06467": {"title": "Advances in Kth nearest-neighbour clutter removal", "link": "http://arxiv.org/abs/2310.06467", "description": "We consider the problem of feature detection in the presence of clutter in\nspatial point processes. Classification methods have been developed in previous\nstudies. Among these, Byers and Raftery (1998) models the observed Kth nearest\nneighbour distances as a mixture distribution and classifies the clutter and\nfeature points consequently. In this paper, we enhance such approach in two\nmanners. First, we propose an automatic procedure for selecting the number of\nnearest neighbours to consider in the classification method by means of\nsegmented regression models. Secondly, with the aim of applying the procedure\nmultiple times to get a ``better\" end result, we propose a stopping criterion\nthat minimizes the overall entropy measure of cluster separation between\nclutter and feature points. The proposed procedures are suitable for a feature\nwith clutter as two superimposed Poisson processes on any space, including\nlinear networks. We present simulations and two case studies of environmental\ndata to illustrate the method."}, "http://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "http://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class\nof partially observed stochastic differential equations (SDE) driven by jump\nprocesses. Such type of models can be routinely found in applications, of which\nwe focus upon the case of neuroscience. The data are assumed to be observed\nregularly in time and driven by the SDE model with unknown parameters. In\npractice the SDE may not have an analytically tractable solution and this leads\nnaturally to a time-discretization. We adapt the multilevel Markov chain Monte\nCarlo method of [11], which works with a hierarchy of time discretizations and\nshow empirically and theoretically that this is preferable to using one single\ntime discretization. The improvement is in terms of the computational cost\nneeded to obtain a pre-specified numerical error. Our approach is illustrated\non models that are found in neuroscience."}, "http://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation due to adverse events", "link": "http://arxiv.org/abs/2310.06653", "description": "In clinical trials, patients sometimes discontinue study treatments\nprematurely due to reasons such as adverse events. Treatment discontinuation\noccurs after the randomisation as an intercurrent event, making causal\ninference more challenging. The Intention-To-Treat (ITT) analysis provides\nvalid causal estimates of the effect of treatment assignment; still, it does\nnot take into account whether or not patients had to discontinue the treatment\nprematurely. We propose to deal with the problem of treatment discontinuation\nusing principal stratification, recognised in the ICH E9(R1) addendum as a\nstrategy for handling intercurrent events. Under this approach, we can\ndecompose the overall ITT effect into principal causal effects for groups of\npatients defined by their potential discontinuation behaviour in continuous\ntime. In this framework, we must consider that discontinuation happening in\ncontinuous time generates an infinite number of principal strata and that\ndiscontinuation time is not defined for patients who would never discontinue.\nAn additional complication is that discontinuation time and time-to-event\noutcomes are subject to administrative censoring. We employ a flexible\nmodel-based Bayesian approach to deal with such complications. We apply the\nBayesian principal stratification framework to analyse synthetic data based on\na recent RCT in Oncology, aiming to assess the causal effects of a new\ninvestigational drug combined with standard of care vs. standard of care alone\non progression-free survival. We simulate data under different assumptions that\nreflect real situations where patients' behaviour depends on critical baseline\ncovariates. Finally, we highlight how such an approach makes it straightforward\nto characterise patients' discontinuation behaviour with respect to the\navailable covariates with the help of a simulation study."}, "http://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "http://arxiv.org/abs/2310.06673", "description": "An assurance calculation is a Bayesian alternative to a power calculation.\nOne may be performed to aid the planning of a clinical trial, specifically\nsetting the sample size or to support decisions about whether or not to perform\na study. Immuno-oncology (IO) is a rapidly evolving area in the development of\nanticancer drugs. A common phenomenon that arises from IO trials is one of\ndelayed treatment effects, that is, there is a delay in the separation of the\nsurvival curves. To calculate assurance for a trial in which a delayed\ntreatment effect is likely to be present, uncertainty about key parameters\nneeds to be considered. If uncertainty is not considered, then the number of\npatients recruited may not be enough to ensure we have adequate statistical\npower to detect a clinically relevant treatment effect. We present a new\nelicitation technique for when a delayed treatment effect is likely to be\npresent and show how to compute assurance using these elicited prior\ndistributions. We provide an example to illustrate how this could be used in\npractice. Open-source software is provided for implementing our methods. Our\nmethodology makes the benefits of assurance methods available for the planning\nof IO trials (and others where a delayed treatment expect is likely to occur)."}, "http://arxiv.org/abs/2310.06696": {"title": "Variable selection with FDR control for noisy data -- an application to screening metabolites that are associated with breast and colorectal cancer", "link": "http://arxiv.org/abs/2310.06696", "description": "The rapidly expanding field of metabolomics presents an invaluable resource\nfor understanding the associations between metabolites and various diseases.\nHowever, the high dimensionality, presence of missing values, and measurement\nerrors associated with metabolomics data can present challenges in developing\nreliable and reproducible methodologies for disease association studies.\nTherefore, there is a compelling need to develop robust statistical methods\nthat can navigate these complexities to achieve reliable and reproducible\ndisease association studies. In this paper, we focus on developing such a\nmethodology with an emphasis on controlling the False Discovery Rate during the\nscreening of mutual metabolomic signals for multiple disease outcomes. We\nillustrate the versatility and performance of this procedure in a variety of\nscenarios, dealing with missing data and measurement errors. As a specific\napplication of this novel methodology, we target two of the most prevalent\ncancers among US women: breast cancer and colorectal cancer. By applying our\nmethod to the Wome's Health Initiative data, we successfully identify\nmetabolites that are associated with either or both of these cancers,\ndemonstrating the practical utility and potential of our method in identifying\nconsistent risk factors and understanding shared mechanisms between diseases."}, "http://arxiv.org/abs/2310.06708": {"title": "Adjustment with Three Continuous Variables", "link": "http://arxiv.org/abs/2310.06708", "description": "Spurious association between X and Y may be due to a confounding variable W.\nStatisticians may adjust for W using a variety of techniques. This paper\npresents the results of simulations conducted to assess the performance of\nthose techniques under various, elementary, data-generating processes. The\nresults indicate that no technique is best overall and that specific techniques\nshould be selected based on the particulars of the data-generating process.\nHere we show how causal graphs can guide the selection or design of techniques\nfor statistical adjustment. R programs are provided for researchers interested\nin generalization."}, "http://arxiv.org/abs/2310.06720": {"title": "Asymptotic theory for Bayesian inference and prediction: from the ordinary to a conditional Peaks-Over-Threshold method", "link": "http://arxiv.org/abs/2310.06720", "description": "The Peaks Over Threshold (POT) method is the most popular statistical method\nfor the analysis of univariate extremes. Even though there is a rich applied\nliterature on Bayesian inference for the POT method there is no asymptotic\ntheory for such proposals. Even more importantly, the ambitious and challenging\nproblem of predicting future extreme events according to a proper probabilistic\nforecasting approach has received no attention to date. In this paper we\ndevelop the asymptotic theory (consistency, contraction rates, asymptotic\nnormality and asymptotic coverage of credible intervals) for the Bayesian\ninference based on the POT method. We extend such an asymptotic theory to cover\nthe Bayesian inference on the tail properties of the conditional distribution\nof a response random variable conditionally to a vector of random covariates.\nWith the aim to make accurate predictions of severer extreme events than those\noccurred in the past, we specify the posterior predictive distribution of a\nfuture unobservable excess variable in the unconditional and conditional\napproach and we prove that is Wasserstein consistent and derive its contraction\nrates. Simulations show the good performances of the proposed Bayesian\ninferential methods. The analysis of the change in the frequency of financial\ncrises over time shows the utility of our methodology."}, "http://arxiv.org/abs/2310.06730": {"title": "Sparse topic modeling via spectral decomposition and thresholding", "link": "http://arxiv.org/abs/2310.06730", "description": "The probabilistic Latent Semantic Indexing model assumes that the expectation\nof the corpus matrix is low-rank and can be written as the product of a\ntopic-word matrix and a word-document matrix. In this paper, we study the\nestimation of the topic-word matrix under the additional assumption that the\nordered entries of its columns rapidly decay to zero. This sparsity assumption\nis motivated by the empirical observation that the word frequencies in a text\noften adhere to Zipf's law. We introduce a new spectral procedure for\nestimating the topic-word matrix that thresholds words based on their corpus\nfrequencies, and show that its $\\ell_1$-error rate under our sparsity\nassumption depends on the vocabulary size $p$ only via a logarithmic term. Our\nerror bound is valid for all parameter regimes and in particular for the\nsetting where $p$ is extremely large; this high-dimensional setting is commonly\nencountered but has not been adequately addressed in prior literature.\nFurthermore, our procedure also accommodates datasets that violate the\nseparability assumption, which is necessary for most prior approaches in topic\nmodeling. Experiments with synthetic data confirm that our procedure is\ncomputationally fast and allows for consistent estimation of the topic-word\nmatrix in a wide variety of parameter regimes. Our procedure also performs well\nrelative to well-established methods when applied to a large corpus of research\npaper abstracts, as well as the analysis of single-cell and microbiome data\nwhere the same statistical model is relevant but the parameter regimes are\nvastly different."}, "http://arxiv.org/abs/2310.06746": {"title": "Causal Rule Learning: Enhancing the Understanding of Heterogeneous Treatment Effect via Weighted Causal Rules", "link": "http://arxiv.org/abs/2310.06746", "description": "Interpretability is a key concern in estimating heterogeneous treatment\neffects using machine learning methods, especially for healthcare applications\nwhere high-stake decisions are often made. Inspired by the Predictive,\nDescriptive, Relevant framework of interpretability, we propose causal rule\nlearning which finds a refined set of causal rules characterizing potential\nsubgroups to estimate and enhance our understanding of heterogeneous treatment\neffects. Causal rule learning involves three phases: rule discovery, rule\nselection, and rule analysis. In the rule discovery phase, we utilize a causal\nforest to generate a pool of causal rules with corresponding subgroup average\ntreatment effects. The selection phase then employs a D-learning method to\nselect a subset of these rules to deconstruct individual-level treatment\neffects as a linear combination of the subgroup-level effects. This helps to\nanswer an ignored question by previous literature: what if an individual\nsimultaneously belongs to multiple groups with different average treatment\neffects? The rule analysis phase outlines a detailed procedure to further\nanalyze each rule in the subset from multiple perspectives, revealing the most\npromising rules for further validation. The rules themselves, their\ncorresponding subgroup treatment effects, and their weights in the linear\ncombination give us more insights into heterogeneous treatment effects.\nSimulation and real-world data analysis demonstrate the superior performance of\ncausal rule learning on the interpretable estimation of heterogeneous treatment\neffect when the ground truth is complex and the sample size is sufficient."}, "http://arxiv.org/abs/2310.06808": {"title": "Odds are the sign is right", "link": "http://arxiv.org/abs/2310.06808", "description": "This article introduces a new condition based on odds ratios for sensitivity\nanalysis. The analysis involves the average effect of a treatment or exposure\non a response or outcome with estimates adjusted for and conditional on a\nsingle, unmeasured, dichotomous covariate. Results of statistical simulations\nare displayed to show that the odds ratio condition is as reliable as other\ncommonly used conditions for sensitivity analysis. Other conditions utilize\nquantities reflective of a mediating covariate. The odds ratio condition can be\napplied when the covariate is a confounding variable. As an example application\nwe use the odds ratio condition to analyze and interpret a positive association\nobserved between Zika virus infection and birth defects."}, "http://arxiv.org/abs/2204.06030": {"title": "Variable importance measures for heterogeneous causal effects", "link": "http://arxiv.org/abs/2204.06030", "description": "The recognition that personalised treatment decisions lead to better clinical\noutcomes has sparked recent research activity in the following two domains.\nPolicy learning focuses on finding optimal treatment rules (OTRs), which\nexpress whether an individual would be better off with or without treatment,\ngiven their measured characteristics. OTRs optimize a pre-set population\ncriterion, but do not provide insight into the extent to which treatment\nbenefits or harms individual subjects. Estimates of conditional average\ntreatment effects (CATEs) do offer such insights, but valid inference is\ncurrently difficult to obtain when data-adaptive methods are used. Moreover,\nclinicians are (rightly) hesitant to blindly adopt OTR or CATE estimates, not\nleast since both may represent complicated functions of patient characteristics\nthat provide little insight into the key drivers of heterogeneity. To address\nthese limitations, we introduce novel nonparametric treatment effect variable\nimportance measures (TE-VIMs). TE-VIMs extend recent regression-VIMs, viewed as\nnonparametric analogues to ANOVA statistics. By not being tied to a particular\nmodel, they are amenable to data-adaptive (machine learning) estimation of the\nCATE, itself an active area of research. Estimators for the proposed statistics\nare derived from their efficient influence curves and these are illustrated\nthrough a simulation study and an applied example."}, "http://arxiv.org/abs/2204.07907": {"title": "Just Identified Indirect Inference Estimator: Accurate Inference through Bias Correction", "link": "http://arxiv.org/abs/2204.07907", "description": "An important challenge in statistical analysis lies in controlling the\nestimation bias when handling the ever-increasing data size and model\ncomplexity of modern data settings. In this paper, we propose a reliable\nestimation and inference approach for parametric models based on the Just\nIdentified iNdirect Inference estimator (JINI). The key advantage of our\napproach is that it allows to construct a consistent estimator in a simple\nmanner, while providing strong bias correction guarantees that lead to accurate\ninference. Our approach is particularly useful for complex parametric models,\nas it allows to bypass the analytical and computational difficulties (e.g., due\nto intractable estimating equation) typically encountered in standard\nprocedures. The properties of JINI (including consistency, asymptotic\nnormality, and its bias correction property) are also studied when the\nparameter dimension is allowed to diverge, which provide the theoretical\nfoundation to explain the advantageous performance of JINI in increasing\ndimensional covariates settings. Our simulations and an alcohol consumption\ndata analysis highlight the practical usefulness and excellent performance of\nJINI when data present features (e.g., misclassification, rounding) as well as\nin robust estimation."}, "http://arxiv.org/abs/2209.05598": {"title": "Learning domain-specific causal discovery from time series", "link": "http://arxiv.org/abs/2209.05598", "description": "Causal discovery (CD) from time-varying data is important in neuroscience,\nmedicine, and machine learning. Techniques for CD encompass randomized\nexperiments, which are generally unbiased but expensive, and algorithms such as\nGranger causality, conditional-independence-based, structural-equation-based,\nand score-based methods that are only accurate under strong assumptions made by\nhuman designers. However, as demonstrated in other areas of machine learning,\nhuman expertise is often not entirely accurate and tends to be outperformed in\ndomains with abundant data. In this study, we examine whether we can enhance\ndomain-specific causal discovery for time series using a data-driven approach.\nOur findings indicate that this procedure significantly outperforms\nhuman-designed, domain-agnostic causal discovery methods, such as Mutual\nInformation, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor,\nthe NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when\nfeasible, the causality field should consider a supervised approach in which\ndomain-specific CD procedures are learned from extensive datasets with known\ncausal relationships, rather than being designed by human specialists. Our\nfindings promise a new approach toward improving CD in neural and medical data\nand for the broader machine learning community."}, "http://arxiv.org/abs/2209.05795": {"title": "Joint modelling of the body and tail of bivariate data", "link": "http://arxiv.org/abs/2209.05795", "description": "In situations where both extreme and non-extreme data are of interest,\nmodelling the whole data set accurately is important. In a univariate\nframework, modelling the bulk and tail of a distribution has been extensively\nstudied before. However, when more than one variable is of concern, models that\naim specifically at capturing both regions correctly are scarce in the\nliterature. A dependence model that blends two copulas with different\ncharacteristics over the whole range of the data support is proposed. One\ncopula is tailored to the bulk and the other to the tail, with a dynamic\nweighting function employed to transition smoothly between them. Tail\ndependence properties are investigated numerically and simulation is used to\nconfirm that the blended model is sufficiently flexible to capture a wide\nvariety of structures. The model is applied to study the dependence between\ntemperature and ozone concentration at two sites in the UK and compared with a\nsingle copula fit. The proposed model provides a better, more flexible, fit to\nthe data, and is also capable of capturing complex dependence structures."}, "http://arxiv.org/abs/2212.14650": {"title": "Two-step estimators of high dimensional correlation matrices", "link": "http://arxiv.org/abs/2212.14650", "description": "We investigate block diagonal and hierarchical nested stochastic multivariate\nGaussian models by studying their sample cross-correlation matrix on high\ndimensions. By performing numerical simulations, we compare a filtered sample\ncross-correlation with the population cross-correlation matrices by using\nseveral rotationally invariant estimators (RIE) and hierarchical clustering\nestimators (HCE) under several loss functions. We show that at large but finite\nsample size, sample cross-correlation filtered by RIE estimators are often\noutperformed by HCE estimators for several of the loss functions. We also show\nthat for block models and for hierarchically nested block models the best\ndetermination of the filtered sample cross-correlation is achieved by\nintroducing two-step estimators combining state-of-the-art non-linear shrinkage\nmodels with hierarchical clustering estimators."}, "http://arxiv.org/abs/2302.02457": {"title": "Scalable inference in functional linear regression with streaming data", "link": "http://arxiv.org/abs/2302.02457", "description": "Traditional static functional data analysis is facing new challenges due to\nstreaming data, where data constantly flow in. A major challenge is that\nstoring such an ever-increasing amount of data in memory is nearly impossible.\nIn addition, existing inferential tools in online learning are mainly developed\nfor finite-dimensional problems, while inference methods for functional data\nare focused on the batch learning setting. In this paper, we tackle these\nissues by developing functional stochastic gradient descent algorithms and\nproposing an online bootstrap resampling procedure to systematically study the\ninference problem for functional linear regression. In particular, the proposed\nestimation and inference procedures use only one pass over the data; thus they\nare easy to implement and suitable to the situation where data arrive in a\nstreaming manner. Furthermore, we establish the convergence rate as well as the\nasymptotic distribution of the proposed estimator. Meanwhile, the proposed\nperturbed estimator from the bootstrap procedure is shown to enjoy the same\ntheoretical properties, which provide the theoretical justification for our\nonline inference tool. As far as we know, this is the first inference result on\nthe functional linear regression model with streaming data. Simulation studies\nare conducted to investigate the finite-sample performance of the proposed\nprocedure. An application is illustrated with the Beijing multi-site\nair-quality data."}, "http://arxiv.org/abs/2303.09598": {"title": "Variational Bayesian analysis of survival data using a log-logistic accelerated failure time model", "link": "http://arxiv.org/abs/2303.09598", "description": "The log-logistic regression model is one of the most commonly used\naccelerated failure time (AFT) models in survival analysis, for which\nstatistical inference methods are mainly established under the frequentist\nframework. Recently, Bayesian inference for log-logistic AFT models using\nMarkov chain Monte Carlo (MCMC) techniques has also been widely developed. In\nthis work, we develop an alternative approach to MCMC methods and infer the\nparameters of the log-logistic AFT model via a mean-field variational Bayes\n(VB) algorithm. A piecewise approximation technique is embedded in deriving the\nVB algorithm to achieve conjugacy. The proposed VB algorithm is evaluated and\ncompared with typical frequentist inferences and MCMC inference using simulated\ndata under various scenarios. A publicly available dataset is employed for\nillustration. We demonstrate that the proposed VB algorithm can achieve good\nestimation accuracy and has a lower computational cost compared with MCMC\nmethods."}, "http://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "http://arxiv.org/abs/2304.03853", "description": "StepMix is an open-source Python package for the pseudo-likelihood estimation\n(one-, two- and three-step approaches) of generalized finite mixture models\n(latent profile and latent class analysis) with external variables (covariates\nand distal outcomes). In many applications in social sciences, the main\nobjective is not only to cluster individuals into latent classes, but also to\nuse these classes to develop more complex statistical models. These models\ngenerally divide into a measurement model that relates the latent classes to\nobserved indicators, and a structural model that relates covariates and outcome\nvariables to the latent classes. The measurement and structural models can be\nestimated jointly using the so-called one-step approach or sequentially using\nstepwise methods, which present significant advantages for practitioners\nregarding the interpretability of the estimated latent classes. In addition to\nthe one-step approach, StepMix implements the most important stepwise\nestimation methods from the literature, including the bias-adjusted three-step\nmethods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the\nmore recent two-step approach. These pseudo-likelihood estimators are presented\nin this paper under a unified framework as specific expectation-maximization\nsubroutines. To facilitate and promote their adoption among the data science\ncommunity, StepMix follows the object-oriented design of the scikit-learn\nlibrary and provides an additional R wrapper."}, "http://arxiv.org/abs/2310.06926": {"title": "Bayesian inference and cure rate modeling for event history data", "link": "http://arxiv.org/abs/2310.06926", "description": "Estimating model parameters of a general family of cure models is always a\nchallenging task mainly due to flatness and multimodality of the likelihood\nfunction. In this work, we propose a fully Bayesian approach in order to\novercome these issues. Posterior inference is carried out by constructing a\nMetropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines\nGibbs sampling for the latent cure indicators and Metropolis-Hastings steps\nwith Langevin diffusion dynamics for parameter updates. The main MCMC algorithm\nis embedded within a parallel tempering scheme by considering heated versions\nof the target posterior distribution. It is demonstrated via simulations that\nthe proposed algorithm freely explores the multimodal posterior distribution\nand produces robust point estimates, while it outperforms maximum likelihood\nestimation via the Expectation-Maximization algorithm. A by-product of our\nBayesian implementation is to control the False Discovery Rate when classifying\nitems as cured or not. Finally, the proposed method is illustrated in a real\ndataset which refers to recidivism for offenders released from prison; the\nevent of interest is whether the offender was re-incarcerated after probation\nor not."}, "http://arxiv.org/abs/2310.06969": {"title": "Positivity-free Policy Learning with Observational Data", "link": "http://arxiv.org/abs/2310.06969", "description": "Policy learning utilizing observational data is pivotal across various\ndomains, with the objective of learning the optimal treatment assignment policy\nwhile adhering to specific constraints such as fairness, budget, and\nsimplicity. This study introduces a novel positivity-free (stochastic) policy\nlearning framework designed to address the challenges posed by the\nimpracticality of the positivity assumption in real-world scenarios. This\nframework leverages incremental propensity score policies to adjust propensity\nscore values instead of assigning fixed values to treatments. We characterize\nthese incremental propensity score policies and establish identification\nconditions, employing semiparametric efficiency theory to propose efficient\nestimators capable of achieving rapid convergence rates, even when integrated\nwith advanced machine learning algorithms. This paper provides a thorough\nexploration of the theoretical guarantees associated with policy learning and\nvalidates the proposed framework's finite-sample performance through\ncomprehensive numerical experiments, ensuring the identification of causal\neffects from observational data is both robust and reliable."}, "http://arxiv.org/abs/2310.07002": {"title": "Bayesian cross-validation by parallel Markov Chain Monte Carlo", "link": "http://arxiv.org/abs/2310.07002", "description": "Brute force cross-validation (CV) is a method for predictive assessment and\nmodel selection that is general and applicable to a wide range of Bayesian\nmodels. However, in many cases brute force CV is too computationally burdensome\nto form part of interactive modeling workflows, especially when inference\nrelies on Markov chain Monte Carlo (MCMC). In this paper we present a method\nfor conducting fast Bayesian CV by massively parallel MCMC. On suitable\naccelerator hardware, for many applications our approach is about as fast (in\nwall clock time) as a single full-data model fit.\n\nParallel CV is more flexible than existing fast CV approximation methods\nbecause it can easily exploit a wide range of scoring rules and data\npartitioning schemes. This is particularly useful for CV methods designed for\nnon-exchangeable data. Our approach also delivers accurate estimates of Monte\nCarlo and CV uncertainty. In addition to parallelizing computations, parallel\nCV speeds up inference by reusing information from earlier MCMC adaptation and\ninference obtained during initial model fitting and checking of the full-data\nmodel.\n\nWe propose MCMC diagnostics for parallel CV applications, including a summary\nof MCMC mixing based on the popular potential scale reduction factor\n($\\hat{R}$) and MCMC effective sample size ($\\widehat{ESS}$) measures.\nFurthermore, we describe a method for determining whether an $\\hat{R}$\ndiagnostic indicates approximate stationarity of the chains, that may be of\nmore general interest for applications beyond parallel CV.\n\nFor parallel CV to work on memory-constrained computing accelerators, we show\nthat parallel CV and associated diagnostics can be implemented using online\n(streaming) algorithms ideal for parallel computing environments with limited\nmemory. Constant memory algorithms allow parallel CV to scale up to very large\nblocking designs."}, "http://arxiv.org/abs/2310.07016": {"title": "Discovering the Unknowns: A First Step", "link": "http://arxiv.org/abs/2310.07016", "description": "This article aims at discovering the unknown variables in the system through\ndata analysis. The main idea is to use the time of data collection as a\nsurrogate variable and try to identify the unknown variables by modeling\ngradual and sudden changes in the data. We use Gaussian process modeling and a\nsparse representation of the sudden changes to efficiently estimate the large\nnumber of parameters in the proposed statistical model. The method is tested on\na realistic dataset generated using a one-dimensional implementation of a\nMagnetized Liner Inertial Fusion (MagLIF) simulation model and encouraging\nresults are obtained."}, "http://arxiv.org/abs/2310.07107": {"title": "Root n consistent extremile regression and its supervised and semi-supervised learning", "link": "http://arxiv.org/abs/2310.07107", "description": "Extremile (Daouia, Gijbels and Stupfler,2019) is a novel and coherent measure\nof risk, determined by weighted expectations rather than tail probabilities. It\nfinds application in risk management, and, in contrast to quantiles, it\nfulfills the axioms of consistency, taking into account the severity of tail\nlosses. However, existing studies (Daouia, Gijbels and Stupfler,2019,2022) on\nextremile involve unknown distribution functions, making it challenging to\nobtain a root n-consistent estimator for unknown parameters in linear extremile\nregression. This article introduces a new definition of linear extremile\nregression and its estimation method, where the estimator is root n-consistent.\nAdditionally, while the analysis of unlabeled data for extremes presents a\nsignificant challenge and is currently a topic of great interest in machine\nlearning for various classification problems, we have developed a\nsemi-supervised framework for the proposed extremile regression using unlabeled\ndata. This framework can also enhance estimation accuracy under model\nmisspecification. Both simulations and real data analyses have been conducted\nto illustrate the finite sample performance of the proposed methods."}, "http://arxiv.org/abs/2310.07124": {"title": "Systematic simulation of age-period-cohort analysis: Demonstrating bias of Bayesian regularization", "link": "http://arxiv.org/abs/2310.07124", "description": "Age-period-cohort (APC) analysis is one of the fundamental time-series\nanalyses used in the social sciences. This paper evaluates APC analysis via\nsystematic simulation in term of how well the artificial parameters are\nrecovered. We consider three models of Bayesian regularization using normal\nprior distributions: the random effects model with reference to multilevel\nanalysis, the ridge regression model equivalent to the intrinsic estimator, and\nthe random walk model referred to as the Bayesian cohort model. The proposed\nsimulation generates artificial data through combinations of the linear\ncomponents, focusing on the fact that the identification problem affects the\nlinear components of the three effects. Among the 13 cases of artificial data,\nthe random walk model recovered the artificial parameters well in 10 cases,\nwhile the random effects model and the ridge regression model did so in 4\ncases. The cases in which the models failed to recover the artificial\nparameters show the estimated linear component of the cohort effects as close\nto zero. In conclusion, the models of Bayesian regularization in APC analysis\nhave a bias: the index weights have a large influence on the cohort effects and\nthese constraints drive the linear component of the cohort effects close to\nzero. However, the random walk model mitigates underestimating the linear\ncomponent of the cohort effects."}, "http://arxiv.org/abs/2310.07330": {"title": "Functional Generalized Canonical Correlation Analysis for studying multiple longitudinal variables", "link": "http://arxiv.org/abs/2310.07330", "description": "In this paper, we introduce Functional Generalized Canonical Correlation\nAnalysis (FGCCA), a new framework for exploring associations between multiple\nrandom processes observed jointly. The framework is based on the multiblock\nRegularized Generalized Canonical Correlation Analysis (RGCCA) framework. It is\nrobust to sparsely and irregularly observed data, making it applicable in many\nsettings. We establish the monotonic property of the solving procedure and\nintroduce a Bayesian approach for estimating canonical components. We propose\nan extension of the framework that allows the integration of a univariate or\nmultivariate response into the analysis, paving the way for predictive\napplications. We evaluate the method's efficiency in simulation studies and\npresent a use case on a longitudinal dataset."}, "http://arxiv.org/abs/2310.07364": {"title": "Statistical inference of high-dimensional vector autoregressive time series with non-i", "link": "http://arxiv.org/abs/2310.07364", "description": "Independent or i.i.d. innovations is an essential assumption in the\nliterature for analyzing a vector time series. However, this assumption is\neither too restrictive for a real-life time series to satisfy or is hard to\nverify through a hypothesis test. This paper performs statistical inference on\na sparse high-dimensional vector autoregressive time series, allowing its white\nnoise innovations to be dependent, even non-stationary. To achieve this goal,\nit adopts a post-selection estimator to fit the vector autoregressive model and\nderives the asymptotic distribution of the post-selection estimator. The\ninnovations in the autoregressive time series are not assumed to be\nindependent, thus making the covariance matrices of the autoregressive\ncoefficient estimators complex and difficult to estimate. Our work develops a\nbootstrap algorithm to facilitate practitioners in performing statistical\ninference without having to engage in sophisticated calculations. Simulations\nand real-life data experiments reveal the validity of the proposed methods and\ntheoretical results.\n\nReal-life data is rarely considered to exactly satisfy an autoregressive\nmodel with independent or i.i.d. innovations, so our work should better reflect\nthe reality compared to the literature that assumes i.i.d. innovations."}, "http://arxiv.org/abs/2310.07399": {"title": "Randomized Runge-Kutta-Nystr\\\"om", "link": "http://arxiv.org/abs/2310.07399", "description": "We present 5/2- and 7/2-order $L^2$-accurate randomized Runge-Kutta-Nystr\\\"om\nmethods to approximate the Hamiltonian flow underlying various non-reversible\nMarkov chain Monte Carlo chains including unadjusted Hamiltonian Monte Carlo\nand unadjusted kinetic Langevin chains. Quantitative 5/2-order $L^2$-accuracy\nupper bounds are provided under gradient and Hessian Lipschitz assumptions on\nthe potential energy function. The superior complexity of the corresponding\nMarkov chains is numerically demonstrated for a selection of `well-behaved',\nhigh-dimensional target distributions."}, "http://arxiv.org/abs/2310.07456": {"title": "Hierarchical Bayesian Claim Count modeling with Overdispersed Outcome and Mismeasured Covariates in Actuarial Practice", "link": "http://arxiv.org/abs/2310.07456", "description": "The problem of overdispersed claim counts and mismeasured covariates is\ncommon in insurance. On the one hand, the presence of overdispersion in the\ncount data violates the homogeneity assumption, and on the other hand,\nmeasurement errors in covariates highlight the model risk issue in actuarial\npractice. The consequence can be inaccurate premium pricing which would\nnegatively affect business competitiveness. Our goal is to address these two\nmodelling problems simultaneously by capturing the unobservable correlations\nbetween observations that arise from overdispersed outcome and mismeasured\ncovariate in actuarial process. To this end, we establish novel connections\nbetween the count-based generalized linear mixed model (GLMM) and a popular\nerror-correction tool for non-linear modelling - Simulation Extrapolation\n(SIMEX). We consider a modelling framework based on the hierarchical Bayesian\nparadigm. To our knowledge, the approach of combining a hierarchical Bayes with\nSIMEX has not previously been discussed in the literature. We demonstrate the\napplicability of our approach on the workplace absenteeism data. Our results\nindicate that the hierarchical Bayesian GLMM incorporated with the SIMEX\noutperforms naive GLMM / SIMEX in terms of goodness of fit."}, "http://arxiv.org/abs/2310.07567": {"title": "Comparing the effectiveness of k-different treatments through the area under the ROC curve", "link": "http://arxiv.org/abs/2310.07567", "description": "The area under the receiver-operating characteristic curve (AUC) has become a\npopular index not only for measuring the overall prediction capacity of a\nmarker but also the association strength between continuous and binary\nvariables. In the current study, it has been used for comparing the association\nsize of four different interventions involving impulsive decision making,\nstudied through an animal model, in which each animal provides several negative\n(pre-treatment) and positive (post-treatment) measures. The problem of the full\ncomparison of the average AUCs arises therefore in a natural way. We construct\nan analysis of variance (ANOVA) type test for testing the equality of the\nimpact of these treatments measured through the respective AUCs, and\nconsidering the random-effect represented by the animal. The use (and\ndevelopment) of a post-hoc Tukey's HSD type test is also considered. We explore\nthe finite-sample behavior of our proposal via Monte Carlo simulations, and\nanalyze the data generated from the original problem. An R package implementing\nthe procedures is provided as supplementary material."}, "http://arxiv.org/abs/2310.07605": {"title": "Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate", "link": "http://arxiv.org/abs/2310.07605", "description": "Multiple comparisons in hypothesis testing often encounter structural\nconstraints in various applications. For instance, in structural Magnetic\nResonance Imaging for Alzheimer's Disease, the focus extends beyond examining\natrophic brain regions to include comparisons of anatomically adjacent regions.\nThese constraints can be modeled as linear transformations of parameters, where\nthe sign patterns play a crucial role in estimating directional effects. This\nclass of problems, encompassing total variations, wavelet transforms, fused\nLASSO, trend filtering, and more, presents an open challenge in effectively\ncontrolling the directional false discovery rate. In this paper, we propose an\nextended Split Knockoff method specifically designed to address the control of\ndirectional false discovery rate under linear transformations. Our proposed\napproach relaxes the stringent linear manifold constraint to its neighborhood,\nemploying a variable splitting technique commonly used in optimization. This\nmethodology yields an orthogonal design that benefits both power and\ndirectional false discovery rate control. By incorporating a sample splitting\nscheme, we achieve effective control of the directional false discovery rate,\nwith a notable reduction to zero as the relaxed neighborhood expands. To\ndemonstrate the efficacy of our method, we conduct simulation experiments and\napply it to two real-world scenarios: Alzheimer's Disease analysis and human\nage comparisons."}, "http://arxiv.org/abs/2310.07680": {"title": "Hamiltonian Dynamics of Bayesian Inference Formalised by Arc Hamiltonian Systems", "link": "http://arxiv.org/abs/2310.07680", "description": "This paper makes two theoretical contributions. First, we establish a novel\nclass of Hamiltonian systems, called arc Hamiltonian systems, for saddle\nHamiltonian functions over infinite-dimensional metric spaces. Arc Hamiltonian\nsystems generate a flow that satisfies the law of conservation of energy\neverywhere in a metric space. They are governed by an extension of Hamilton's\nequation formulated based on (i) the framework of arc fields and (ii) an\ninfinite-dimensional gradient, termed the arc gradient, of a Hamiltonian\nfunction. We derive conditions for the existence of a flow generated by an arc\nHamiltonian system, showing that they reduce to local Lipschitz continuity of\nthe arc gradient under sufficient regularity. Second, we present two\nHamiltonian functions, called the cumulant generating functional and the\ncentred cumulant generating functional, over a metric space of log-likelihoods\nand measures. The former characterises the posterior of Bayesian inference as a\npart of the arc gradient that induces a flow of log-likelihoods and\nnon-negative measures. The latter characterises the difference of the posterior\nand the prior as a part of the arc gradient that induces a flow of\nlog-likelihoods and probability measures. Our results reveal an implication of\nthe belief updating mechanism from the prior to the posterior as an\ninfinitesimal change of a measure in the infinite-dimensional Hamiltonian\nflows."}, "http://arxiv.org/abs/2009.12217": {"title": "Latent Causal Socioeconomic Health Index", "link": "http://arxiv.org/abs/2009.12217", "description": "This research develops a model-based LAtent Causal Socioeconomic Health\n(LACSH) index at the national level. Motivated by the need for a holistic\nnational well-being index, we build upon the latent health factor index (LHFI)\napproach that has been used to assess the unobservable ecological/ecosystem\nhealth. LHFI integratively models the relationship between metrics, latent\nhealth, and covariates that drive the notion of health. In this paper, the LHFI\nstructure is integrated with spatial modeling and statistical causal modeling.\nOur efforts are focused on developing the integrated framework to facilitate\nthe understanding of how an observational continuous variable might have\ncausally affected a latent trait that exhibits spatial correlation. A novel\nvisualization technique to evaluate covariate balance is also introduced for\nthe case of a continuous policy (treatment) variable. Our resulting LACSH\nframework and visualization tool are illustrated through two global case\nstudies on national socioeconomic health (latent trait), each with various\nmetrics and covariates pertaining to different aspects of societal health, and\nthe treatment variable being mandatory maternity leave days and government\nexpenditure on healthcare, respectively. We validate our model by two\nsimulation studies. All approaches are structured in a Bayesian hierarchical\nframework and results are obtained by Markov chain Monte Carlo techniques."}, "http://arxiv.org/abs/2201.02958": {"title": "Smooth Nested Simulation: Bridging Cubic and Square Root Convergence Rates in High Dimensions", "link": "http://arxiv.org/abs/2201.02958", "description": "Nested simulation concerns estimating functionals of a conditional\nexpectation via simulation. In this paper, we propose a new method based on\nkernel ridge regression to exploit the smoothness of the conditional\nexpectation as a function of the multidimensional conditioning variable.\nAsymptotic analysis shows that the proposed method can effectively alleviate\nthe curse of dimensionality on the convergence rate as the simulation budget\nincreases, provided that the conditional expectation is sufficiently smooth.\nThe smoothness bridges the gap between the cubic root convergence rate (that\nis, the optimal rate for the standard nested simulation) and the square root\nconvergence rate (that is, the canonical rate for the standard Monte Carlo\nsimulation). We demonstrate the performance of the proposed method via\nnumerical examples from portfolio risk management and input uncertainty\nquantification."}, "http://arxiv.org/abs/2204.12635": {"title": "Multivariate and regression models for directional data based on projected P\\'olya trees", "link": "http://arxiv.org/abs/2204.12635", "description": "Projected distributions have proved to be useful in the study of circular and\ndirectional data. Although any multivariate distribution can be used to produce\na projected model, these distributions are typically parametric. In this\narticle we consider a multivariate P\\'olya tree on $R^k$ and project it to the\nunit hypersphere $S^k$ to define a new Bayesian nonparametric model for\ndirectional data. We study the properties of the proposed model and in\nparticular, concentrate on the implied conditional distributions of some\ndirections given the others to define a directional-directional regression\nmodel. We also define a multivariate linear regression model with P\\'olya tree\nerror and project it to define a linear-directional regression model. We obtain\nthe posterior characterisation of all models and show their performance with\nsimulated and real datasets."}, "http://arxiv.org/abs/2207.13250": {"title": "Spatio-Temporal Wildfire Prediction using Multi-Modal Data", "link": "http://arxiv.org/abs/2207.13250", "description": "Due to severe societal and environmental impacts, wildfire prediction using\nmulti-modal sensing data has become a highly sought-after data-analytical tool\nby various stakeholders (such as state governments and power utility companies)\nto achieve a more informed understanding of wildfire activities and plan\npreventive measures. A desirable algorithm should precisely predict fire risk\nand magnitude for a location in real time. In this paper, we develop a flexible\nspatio-temporal wildfire prediction framework using multi-modal time series\ndata. We first predict the wildfire risk (the chance of a wildfire event) in\nreal-time, considering the historical events using discrete mutually exciting\npoint process models. Then we further develop a wildfire magnitude prediction\nset method based on the flexible distribution-free time-series conformal\nprediction (CP) approach. Theoretically, we prove a risk model parameter\nrecovery guarantee, as well as coverage and set size guarantees for the CP\nsets. Through extensive real-data experiments with wildfire data in California,\nwe demonstrate the effectiveness of our methods, as well as their flexibility\nand scalability in large regions."}, "http://arxiv.org/abs/2210.13550": {"title": "Regularized Nonlinear Regression with Dependent Errors and its Application to a Biomechanical Model", "link": "http://arxiv.org/abs/2210.13550", "description": "A biomechanical model often requires parameter estimation and selection in a\nknown but complicated nonlinear function. Motivated by observing that data from\na head-neck position tracking system, one of biomechanical models, show\nmultiplicative time dependent errors, we develop a modified penalized weighted\nleast squares estimator. The proposed method can be also applied to a model\nwith non-zero mean time dependent additive errors. Asymptotic properties of the\nproposed estimator are investigated under mild conditions on a weight matrix\nand the error process. A simulation study demonstrates that the proposed\nestimation works well in both parameter estimation and selection with time\ndependent error. The analysis and comparison with an existing method for\nhead-neck position tracking data show better performance of the proposed method\nin terms of the variance accounted for (VAF)."}, "http://arxiv.org/abs/2210.14965": {"title": "Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions", "link": "http://arxiv.org/abs/2210.14965", "description": "This paper adopts a tool from computational topology, the Euler\ncharacteristic curve (ECC) of a sample, to perform one- and two-sample goodness\nof fit tests. We call our procedure TopoTests. The presented tests work for\nsamples of arbitrary dimension, having comparable power to the state-of-the-art\ntests in the one-dimensional case. It is demonstrated that the type I error of\nTopoTests can be controlled and their type II error vanishes exponentially with\nincreasing sample size. Extensive numerical simulations of TopoTests are\nconducted to demonstrate their power for samples of various sizes."}, "http://arxiv.org/abs/2211.03860": {"title": "Automatic Change-Point Detection in Time Series via Deep Learning", "link": "http://arxiv.org/abs/2211.03860", "description": "Detecting change-points in data is challenging because of the range of\npossible types of change and types of behaviour of data when there is no\nchange. Statistically efficient methods for detecting a change will depend on\nboth of these features, and it can be difficult for a practitioner to develop\nan appropriate detection method for their application of interest. We show how\nto automatically generate new offline detection methods based on training a\nneural network. Our approach is motivated by many existing tests for the\npresence of a change-point being representable by a simple neural network, and\nthus a neural network trained with sufficient data should have performance at\nleast as good as these methods. We present theory that quantifies the error\nrate for such an approach, and how it depends on the amount of training data.\nEmpirical results show that, even with limited training data, its performance\nis competitive with the standard CUSUM-based classifier for detecting a change\nin mean when the noise is independent and Gaussian, and can substantially\noutperform it in the presence of auto-correlated or heavy-tailed noise. Our\nmethod also shows strong results in detecting and localising changes in\nactivity based on accelerometer data."}, "http://arxiv.org/abs/2211.09099": {"title": "Selecting Subpopulations for Causal Inference in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2211.09099", "description": "The Brazil Bolsa Familia (BF) program is a conditional cash transfer program\naimed to reduce short-term poverty by direct cash transfers and to fight\nlong-term poverty by increasing human capital among poor Brazilian people.\nEligibility for Bolsa Familia benefits depends on a cutoff rule, which\nclassifies the BF study as a regression discontinuity (RD) design. Extracting\ncausal information from RD studies is challenging. Following Li et al (2015)\nand Branson and Mealli (2019), we formally describe the BF RD design as a local\nrandomized experiment within the potential outcome approach. Under this\nframework, causal effects can be identified and estimated on a subpopulation\nwhere a local overlap assumption, a local SUTVA and a local ignorability\nassumption hold. We first discuss the potential advantages of this framework\nover local regression methods based on continuity assumptions, which concern\nthe definition of the causal estimands, the design and the analysis of the\nstudy, and the interpretation and generalizability of the results. A critical\nissue of this local randomization approach is how to choose subpopulations for\nwhich we can draw valid causal inference. We propose a Bayesian model-based\nfinite mixture approach to clustering to classify observations into\nsubpopulations where the RD assumptions hold and do not hold. This approach has\nimportant advantages: a) it allows to account for the uncertainty in the\nsubpopulation membership, which is typically neglected; b) it does not impose\nany constraint on the shape of the subpopulation; c) it is scalable to\nhigh-dimensional settings; e) it allows to target alternative causal estimands\nthan the average treatment effect (ATE); and f) it is robust to a certain\ndegree of manipulation/selection of the running variable. We apply our proposed\napproach to assess causal effects of the Bolsa Familia program on leprosy\nincidence in 2009."}, "http://arxiv.org/abs/2301.08276": {"title": "Cross-validatory model selection for Bayesian autoregressions with exogenous regressors", "link": "http://arxiv.org/abs/2301.08276", "description": "Bayesian cross-validation (CV) is a popular method for predictive model\nassessment that is simple to implement and broadly applicable. A wide range of\nCV schemes is available for time series applications, including generic\nleave-one-out (LOO) and K-fold methods, as well as specialized approaches\nintended to deal with serial dependence such as leave-future-out (LFO),\nh-block, and hv-block.\n\nExisting large-sample results show that both specialized and generic methods\nare applicable to models of serially-dependent data. However, large sample\nconsistency results overlook the impact of sampling variability on accuracy in\nfinite samples. Moreover, the accuracy of a CV scheme depends on many aspects\nof the procedure. We show that poor design choices can lead to elevated rates\nof adverse selection.\n\nIn this paper, we consider the problem of identifying the regression\ncomponent of an important class of models of data with serial dependence,\nautoregressions of order p with q exogenous regressors (ARX(p,q)), under the\nlogarithmic scoring rule. We show that when serial dependence is present,\nscores computed using the joint (multivariate) density have lower variance and\nbetter model selection accuracy than the popular pointwise estimator. In\naddition, we present a detailed case study of the special case of ARX models\nwith fixed autoregressive structure and variance. For this class, we derive the\nfinite-sample distribution of the CV estimators and the model selection\nstatistic. We conclude with recommendations for practitioners."}, "http://arxiv.org/abs/2301.12026": {"title": "G-formula for causal inference via multiple imputation", "link": "http://arxiv.org/abs/2301.12026", "description": "G-formula is a popular approach for estimating treatment or exposure effects\nfrom longitudinal data that are subject to time-varying confounding. G-formula\nestimation is typically performed by Monte-Carlo simulation, with\nnon-parametric bootstrapping used for inference. We show that G-formula can be\nimplemented by exploiting existing methods for multiple imputation (MI) for\nsynthetic data. This involves using an existing modified version of Rubin's\nvariance estimator. In practice missing data is ubiquitous in longitudinal\ndatasets. We show that such missing data can be readily accommodated as part of\nthe MI procedure when using G-formula, and describe how MI software can be used\nto implement the approach. We explore its performance using a simulation study\nand an application from cystic fibrosis."}, "http://arxiv.org/abs/2306.01292": {"title": "Alternative Measures of Direct and Indirect Effects", "link": "http://arxiv.org/abs/2306.01292", "description": "There are a number of measures of direct and indirect effects in the\nliterature. They are suitable in some cases and unsuitable in others. We\ndescribe a case where the existing measures are unsuitable and propose new\nsuitable ones. We also show that the new measures can partially handle\nunmeasured treatment-outcome confounding, and bound long-term effects by\ncombining experimental and observational data."}, "http://arxiv.org/abs/2309.11942": {"title": "On the Probability of Immunity", "link": "http://arxiv.org/abs/2309.11942", "description": "This work is devoted to the study of the probability of immunity, i.e. the\neffect occurs whether exposed or not. We derive necessary and sufficient\nconditions for non-immunity and $\\epsilon$-bounded immunity, i.e. the\nprobability of immunity is zero and $\\epsilon$-bounded, respectively. The\nformer allows us to estimate the probability of benefit (i.e., the effect\noccurs if and only if exposed) from a randomized controlled trial, and the\nlatter allows us to produce bounds of the probability of benefit that are\ntighter than the existing ones. We also introduce the concept of indirect\nimmunity (i.e., through a mediator) and repeat our previous analysis for it.\nFinally, we propose a method for sensitivity analysis of the probability of\nimmunity under unmeasured confounding."}, "http://arxiv.org/abs/2309.13441": {"title": "Anytime valid and asymptotically optimal statistical inference driven by predictive recursion", "link": "http://arxiv.org/abs/2309.13441", "description": "Distinguishing two classes of candidate models is a fundamental and\npractically important problem in statistical inference. Error rate control is\ncrucial to the logic but, in complex nonparametric settings, such guarantees\ncan be difficult to achieve, especially when the stopping rule that determines\nthe data collection process is not available. In this paper we develop a novel\ne-process construction that leverages the so-called predictive recursion (PR)\nalgorithm designed to rapidly and recursively fit nonparametric mixture models.\nThe resulting PRe-process affords anytime valid inference uniformly over\nstopping rules and is shown to be efficient in the sense that it achieves the\nmaximal growth rate under the alternative relative to the mixture model being\nfit by PR. In the special case of testing for a log-concave density, the\nPRe-process test is computationally simpler and faster, more stable, and no\nless efficient compared to a recently proposed anytime valid test."}, "http://arxiv.org/abs/2309.16598": {"title": "Cross-Prediction-Powered Inference", "link": "http://arxiv.org/abs/2309.16598", "description": "While reliable data-driven decision-making hinges on high-quality labeled\ndata, the acquisition of quality labels often involves laborious human\nannotations or slow and expensive scientific measurements. Machine learning is\nbecoming an appealing alternative as sophisticated predictive techniques are\nbeing used to quickly and cheaply produce large amounts of predicted labels;\ne.g., predicted protein structures are used to supplement experimentally\nderived structures, predictions of socioeconomic indicators from satellite\nimagery are used to supplement accurate survey data, and so on. Since\npredictions are imperfect and potentially biased, this practice brings into\nquestion the validity of downstream inferences. We introduce cross-prediction:\na method for valid inference powered by machine learning. With a small labeled\ndataset and a large unlabeled dataset, cross-prediction imputes the missing\nlabels via machine learning and applies a form of debiasing to remedy the\nprediction inaccuracies. The resulting inferences achieve the desired error\nprobability and are more powerful than those that only leverage the labeled\ndata. Closely related is the recent proposal of prediction-powered inference,\nwhich assumes that a good pre-trained model is already available. We show that\ncross-prediction is consistently more powerful than an adaptation of\nprediction-powered inference in which a fraction of the labeled data is split\noff and used to train the model. Finally, we observe that cross-prediction\ngives more stable conclusions than its competitors; its confidence intervals\ntypically have significantly lower variability."}, "http://arxiv.org/abs/2310.07801": {"title": "Trajectory-aware Principal Manifold Framework for Data Augmentation and Image Generation", "link": "http://arxiv.org/abs/2310.07801", "description": "Data augmentation for deep learning benefits model training, image\ntransformation, medical imaging analysis and many other fields. Many existing\nmethods generate new samples from a parametric distribution, like the Gaussian,\nwith little attention to generate samples along the data manifold in either the\ninput or feature space. In this paper, we verify that there are theoretical and\npractical advantages of using the principal manifold hidden in the feature\nspace than the Gaussian distribution. We then propose a novel trajectory-aware\nprincipal manifold framework to restore the manifold backbone and generate\nsamples along a specific trajectory. On top of the autoencoder architecture, we\nfurther introduce an intrinsic dimension regularization term to make the\nmanifold more compact and enable few-shot image generation. Experimental\nresults show that the novel framework is able to extract more compact manifold\nrepresentation, improve classification accuracy and generate smooth\ntransformation among few samples."}, "http://arxiv.org/abs/2310.07817": {"title": "Nonlinear global Fr\\'echet regression for random objects via weak conditional expectation", "link": "http://arxiv.org/abs/2310.07817", "description": "Random objects are complex non-Euclidean data taking value in general metric\nspace, possibly devoid of any underlying vector space structure. Such data are\ngetting increasingly abundant with the rapid advancement in technology.\nExamples include probability distributions, positive semi-definite matrices,\nand data on Riemannian manifolds. However, except for regression for\nobject-valued response with Euclidean predictors and\ndistribution-on-distribution regression, there has been limited development of\na general framework for object-valued response with object-valued predictors in\nthe literature. To fill this gap, we introduce the notion of a weak conditional\nFr\\'echet mean based on Carleman operators and then propose a global nonlinear\nFr\\'echet regression model through the reproducing kernel Hilbert space (RKHS)\nembedding. Furthermore, we establish the relationships between the conditional\nFr\\'echet mean and the weak conditional Fr\\'echet mean for both Euclidean and\nobject-valued data. We also show that the state-of-the-art global Fr\\'echet\nregression developed by Petersen and Mueller, 2019 emerges as a special case of\nour method by choosing a linear kernel. We require that the metric space for\nthe predictor admits a reproducing kernel, while the intrinsic geometry of the\nmetric space for the response is utilized to study the asymptotic properties of\nthe proposed estimates. Numerical studies, including extensive simulations and\na real application, are conducted to investigate the performance of our\nestimator in a finite sample."}, "http://arxiv.org/abs/2310.07850": {"title": "Conformal prediction with local weights: randomization enables local guarantees", "link": "http://arxiv.org/abs/2310.07850", "description": "In this work, we consider the problem of building distribution-free\nprediction intervals with finite-sample conditional coverage guarantees.\nConformal prediction (CP) is an increasingly popular framework for building\nprediction intervals with distribution-free guarantees, but these guarantees\nonly ensure marginal coverage: the probability of coverage is averaged over a\nrandom draw of both the training and test data, meaning that there might be\nsubstantial undercoverage within certain subpopulations. Instead, ideally, we\nwould want to have local coverage guarantees that hold for each possible value\nof the test point's features. While the impossibility of achieving pointwise\nlocal coverage is well established in the literature, many variants of\nconformal prediction algorithm show favorable local coverage properties\nempirically. Relaxing the definition of local coverage can allow for a\ntheoretical understanding of this empirical phenomenon. We aim to bridge this\ngap between theoretical validation and empirical performance by proving\nachievable and interpretable guarantees for a relaxed notion of local coverage.\nBuilding on the localized CP method of Guan (2023) and the weighted CP\nframework of Tibshirani et al. (2019), we propose a new method,\nrandomly-localized conformal prediction (RLCP), which returns prediction\nintervals that are not only marginally valid but also achieve a relaxed local\ncoverage guarantee and guarantees under covariate shift. Through a series of\nsimulations and real data experiments, we validate these coverage guarantees of\nRLCP while comparing it with the other local conformal prediction methods."}, "http://arxiv.org/abs/2310.07852": {"title": "On the Computational Complexity of Private High-dimensional Model Selection via the Exponential Mechanism", "link": "http://arxiv.org/abs/2310.07852", "description": "We consider the problem of model selection in a high-dimensional sparse\nlinear regression model under the differential privacy framework. In\nparticular, we consider the problem of differentially private best subset\nselection and study its utility guarantee. We adopt the well-known exponential\nmechanism for selecting the best model, and under a certain margin condition,\nwe establish its strong model recovery property. However, the exponential\nsearch space of the exponential mechanism poses a serious computational\nbottleneck. To overcome this challenge, we propose a Metropolis-Hastings\nalgorithm for the sampling step and establish its polynomial mixing time to its\nstationary distribution in the problem parameters $n,p$, and $s$. Furthermore,\nwe also establish approximate differential privacy for the final estimates of\nthe Metropolis-Hastings random walk using its mixing property. Finally, we also\nperform some illustrative simulations that echo the theoretical findings of our\nmain results."}, "http://arxiv.org/abs/2310.07935": {"title": "Estimating the Likelihood of Arrest from Police Records in Presence of Unreported Crimes", "link": "http://arxiv.org/abs/2310.07935", "description": "Many important policy decisions concerning policing hinge on our\nunderstanding of how likely various criminal offenses are to result in arrests.\nSince many crimes are never reported to law enforcement, estimates based on\npolice records alone must be adjusted to account for the likelihood that each\ncrime would have been reported to the police. In this paper, we present a\nmethodological framework for estimating the likelihood of arrest from police\ndata that incorporates estimates of crime reporting rates computed from a\nvictimization survey. We propose a parametric regression-based two-step\nestimator that (i) estimates the likelihood of crime reporting using logistic\nregression with survey weights; and then (ii) applies a second regression step\nto model the likelihood of arrest. Our empirical analysis focuses on racial\ndisparities in arrests for violent crimes (sex offenses, robbery, aggravated\nand simple assaults) from 2006--2015 police records from the National Incident\nBased Reporting System (NIBRS), with estimates of crime reporting obtained\nusing 2003--2020 data from the National Crime Victimization Survey (NCVS). We\nfind that, after adjusting for unreported crimes, the likelihood of arrest\ncomputed from police records decreases significantly. We also find that, while\nincidents with white offenders on average result in arrests more often than\nthose with black offenders, the disparities tend to be small after accounting\nfor crime characteristics and unreported crimes."}, "http://arxiv.org/abs/2310.07953": {"title": "Enhancing Sample Quality through Minimum Energy Importance Weights", "link": "http://arxiv.org/abs/2310.07953", "description": "Importance sampling is a powerful tool for correcting the distributional\nmismatch in many statistical and machine learning problems, but in practice its\nperformance is limited by the usage of simple proposals whose importance\nweights can be computed analytically. To address this limitation, Liu and Lee\n(2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes\nthe importance weights for arbitrary simulated samples by minimizing the\nkernelized Stein discrepancy. However, this requires knowing the score function\nof the target distribution, which is not easy to compute for many Bayesian\nproblems. Hence, in this paper we propose another novel BBIS algorithm using\nminimum energy design, BBIS-MED, that requires only the unnormalized density\nfunction, which can be utilized as a post-processing step to improve the\nquality of Markov Chain Monte Carlo samples. We demonstrate the effectiveness\nand wide applicability of our proposed BBIS-MED algorithm on extensive\nsimulations and a real-world Bayesian model calibration problem where the score\nfunction cannot be derived analytically."}, "http://arxiv.org/abs/2310.07958": {"title": "Towards Causal Deep Learning for Vulnerability Detection", "link": "http://arxiv.org/abs/2310.07958", "description": "Deep learning vulnerability detection has shown promising results in recent\nyears. However, an important challenge that still blocks it from being very\nuseful in practice is that the model is not robust under perturbation and it\ncannot generalize well over the out-of-distribution (OOD) data, e.g., applying\na trained model to unseen projects in real world. We hypothesize that this is\nbecause the model learned non-robust features, e.g., variable names, that have\nspurious correlations with labels. When the perturbed and OOD datasets no\nlonger have the same spurious features, the model prediction fails. To address\nthe challenge, in this paper, we introduced causality into deep learning\nvulnerability detection. Our approach CausalVul consists of two phases. First,\nwe designed novel perturbations to discover spurious features that the model\nmay use to make predictions. Second, we applied the causal learning algorithms,\nspecifically, do-calculus, on top of existing deep learning models to\nsystematically remove the use of spurious features and thus promote causal\nbased prediction. Our results show that CausalVul consistently improved the\nmodel accuracy, robustness and OOD performance for all the state-of-the-art\nmodels and datasets we experimented. To the best of our knowledge, this is the\nfirst work that introduces do calculus based causal learning to software\nengineering models and shows it's indeed useful for improving the model\naccuracy, robustness and generalization. Our replication package is located at\nhttps://figshare.com/s/0ffda320dcb96c249ef2."}, "http://arxiv.org/abs/2310.07973": {"title": "Statistical Performance Guarantee for Selecting Those Predicted to Benefit Most from Treatment", "link": "http://arxiv.org/abs/2310.07973", "description": "Across a wide array of disciplines, many researchers use machine learning\n(ML) algorithms to identify a subgroup of individuals, called exceptional\nresponders, who are likely to be helped by a treatment the most. A common\napproach consists of two steps. One first estimates the conditional average\ntreatment effect or its proxy using an ML algorithm. They then determine the\ncutoff of the resulting treatment prioritization score to select those\npredicted to benefit most from the treatment. Unfortunately, these estimated\ntreatment prioritization scores are often biased and noisy. Furthermore,\nutilizing the same data to both choose a cutoff value and estimate the average\ntreatment effect among the selected individuals suffer from a multiple testing\nproblem. To address these challenges, we develop a uniform confidence band for\nexperimentally evaluating the sorted average treatment effect (GATES) among the\nindividuals whose treatment prioritization score is at least as high as any\ngiven quantile value, regardless of how the quantile is chosen. This provides a\nstatistical guarantee that the GATES for the selected subgroup exceeds a\ncertain threshold. The validity of the proposed methodology depends solely on\nrandomization of treatment and random sampling of units without requiring\nmodeling assumptions or resampling methods. This widens its applicability\nincluding a wide range of other causal quantities. A simulation study shows\nthat the empirical coverage of the proposed uniform confidence bands is close\nto the nominal coverage when the sample is as small as 100. We analyze a\nclinical trial of late-stage prostate cancer and find a relatively large\nproportion of exceptional responders with a statistical performance guarantee."}, "http://arxiv.org/abs/2310.08020": {"title": "Assessing Copula Models for Mixed Continuous-Ordinal Variables", "link": "http://arxiv.org/abs/2310.08020", "description": "Vine pair-copula constructions exist for a mix of continuous and ordinal\nvariables. In some steps, this can involve estimating a bivariate copula for a\npair of mixed continuous-ordinal variables. To assess the adequacy of copula\nfits for such a pair, diagnostic and visualization methods based on normal\nscore plots and conditional Q-Q plots are proposed. The former utilizes a\nlatent continuous variable for the ordinal variable. Using the Kullback-Leibler\ndivergence, existing probability models for mixed continuous-ordinal variable\npair are assessed for the adequacy of fit with simple parametric copula\nfamilies. The effectiveness of the proposed visualization and diagnostic\nmethods is illustrated on simulated and real datasets."}, "http://arxiv.org/abs/2310.08193": {"title": "Are sanctions for losers? A network study of trade sanctions", "link": "http://arxiv.org/abs/2310.08193", "description": "Studies built on dependency and world-system theory using network approaches\nhave shown that international trade is structured into clusters of 'core' and\n'peripheral' countries performing distinct functions. However, few have used\nthese methods to investigate how sanctions affect the position of the countries\ninvolved in the capitalist world-economy. Yet, this topic has acquired pressing\nrelevance due to the emergence of economic warfare as a key geopolitical weapon\nsince the 1950s. And even more so in light of the preeminent role that\nsanctions have played in the US and their allies' response to the\nRussian-Ukrainian war. Applying several clustering techniques designed for\ncomplex and temporal networks, this paper shows that a shift in the pattern of\ncommerce away from sanctioning countries and towards neutral or friendly ones.\nAdditionally, there are suggestions that these shifts may lead to the creation\nof an alternative 'core' that interacts with the world-economy's periphery\nbypassing traditional 'core' countries such as EU member States and the US."}, "http://arxiv.org/abs/2310.08268": {"title": "Change point detection in dynamic heterogeneous networks via subspace tracking", "link": "http://arxiv.org/abs/2310.08268", "description": "Dynamic networks consist of a sequence of time-varying networks, and it is of\ngreat importance to detect the network change points. Most existing methods\nfocus on detecting abrupt change points, necessitating the assumption that the\nunderlying network probability matrix remains constant between adjacent change\npoints. This paper introduces a new model that allows the network probability\nmatrix to undergo continuous shifting, while the latent network structure,\nrepresented via the embedding subspace, only changes at certain time points.\nTwo novel statistics are proposed to jointly detect these network subspace\nchange points, followed by a carefully refined detection procedure.\nTheoretically, we show that the proposed method is asymptotically consistent in\nterms of change point detection, and also establish the impossibility region\nfor detecting these network subspace change points. The advantage of the\nproposed method is also supported by extensive numerical experiments on both\nsynthetic networks and a UK politician social network."}, "http://arxiv.org/abs/2310.08397": {"title": "Assessing Marine Mammal Abundance: A Novel Data Fusion", "link": "http://arxiv.org/abs/2310.08397", "description": "Marine mammals are increasingly vulnerable to human disturbance and climate\nchange. Their diving behavior leads to limited visual access during data\ncollection, making studying the abundance and distribution of marine mammals\nchallenging. In theory, using data from more than one observation modality\nshould lead to better informed predictions of abundance and distribution. With\nfocus on North Atlantic right whales, we consider the fusion of two data\nsources to inform about their abundance and distribution. The first source is\naerial distance sampling which provides the spatial locations of whales\ndetected in the region. The second source is passive acoustic monitoring (PAM),\nreturning calls received at hydrophones placed on the ocean floor. Due to\nlimited time on the surface and detection limitations arising from sampling\neffort, aerial distance sampling only provides a partial realization of\nlocations. With PAM, we never observe numbers or locations of individuals. To\naddress these challenges, we develop a novel thinned point pattern data fusion.\nOur approach leads to improved inference regarding abundance and distribution\nof North Atlantic right whales throughout Cape Cod Bay, Massachusetts in the\nUS. We demonstrate performance gains of our approach compared to that from a\nsingle source through both simulation and real data."}, "http://arxiv.org/abs/2310.08410": {"title": "Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review and Meta-Analysis", "link": "http://arxiv.org/abs/2310.08410", "description": "Large language models such as ChatGPT are increasingly explored in medical\ndomains. However, the absence of standard guidelines for performance evaluation\nhas led to methodological inconsistencies. This study aims to summarize the\navailable evidence on evaluating ChatGPT's performance in medicine and provide\ndirection for future research. We searched ten medical literature databases on\nJune 15, 2023, using the keyword \"ChatGPT\". A total of 3520 articles were\nidentified, of which 60 were reviewed and summarized in this paper and 17 were\nincluded in the meta-analysis. The analysis showed that ChatGPT displayed an\noverall integrated accuracy of 56% (95% CI: 51%-60%, I2 = 87%) in addressing\nmedical queries. However, the studies varied in question resource,\nquestion-asking process, and evaluation metrics. Moreover, many studies failed\nto report methodological details, including the version of ChatGPT and whether\neach question was used independently or repeatedly. Our findings revealed that\nalthough ChatGPT demonstrated considerable potential for application in\nhealthcare, the heterogeneity of the studies and insufficient reporting may\naffect the reliability of these results. Further well-designed studies with\ncomprehensive and transparent reporting are needed to evaluate ChatGPT's\nperformance in medicine."}, "http://arxiv.org/abs/2310.08414": {"title": "Confidence bounds for the true discovery proportion based on the exact distribution of the number of rejections", "link": "http://arxiv.org/abs/2310.08414", "description": "In multiple hypotheses testing it has become widely popular to make inference\non the true discovery proportion (TDP) of a set $\\mathcal{M}$ of null\nhypotheses. This approach is useful for several application fields, such as\nneuroimaging and genomics. Several procedures to compute simultaneous lower\nconfidence bounds for the TDP have been suggested in prior literature.\nSimultaneity allows for post-hoc selection of $\\mathcal{M}$. If sets of\ninterest are specified a priori, it is possible to gain power by removing the\nsimultaneity requirement. We present an approach to compute lower confidence\nbounds for the TDP if the set of null hypotheses is defined a priori. The\nproposed method determines the bounds using the exact distribution of the\nnumber of rejections based on a step-up multiple testing procedure under\nindependence assumptions. We assess robustness properties of our procedure and\napply it to real data from the field of functional magnetic resonance imaging."}, "http://arxiv.org/abs/2310.08426": {"title": "Extensions of Heterogeneity in Integration and Prediction (HIP) with R Shiny Application", "link": "http://arxiv.org/abs/2310.08426", "description": "Multiple data views measured on the same set of participants is becoming more\ncommon and has the potential to deepen our understanding of many complex\ndiseases by analyzing these different views simultaneously. Equally important,\nmany of these complex diseases show evidence of subgroup heterogeneity (e.g.,\nby sex or race). HIP (Heterogeneity in Integration and Prediction) is among the\nfirst methods proposed to integrate multiple data views while also accounting\nfor subgroup heterogeneity to identify common and subgroup-specific markers of\na particular disease. However, HIP is applicable to continuous outcomes and\nrequires programming expertise by the user. Here we propose extensions to HIP\nthat accommodate multi-class, Poisson, and Zero-Inflated Poisson outcomes while\nretaining the benefits of HIP. Additionally, we introduce an R Shiny\napplication, accessible on shinyapps.io at\nhttps://multi-viewlearn.shinyapps.io/HIP_ShinyApp/, that provides an interface\nwith the Python implementation of HIP to allow more researchers to use the\nmethod anywhere and on any device. We applied HIP to identify genes and\nproteins common and specific to males and females that are associated with\nexacerbation frequency. Although some of the identified genes and proteins show\nevidence of a relationship with chronic obstructive pulmonary disease (COPD) in\nexisting literature, others may be candidates for future research investigating\ntheir relationship with COPD. We demonstrate the use of the Shiny application\nwith a publicly available data. An R-package for HIP would be made available at\nhttps://github.com/lasandrall/HIP."}, "http://arxiv.org/abs/2310.08479": {"title": "Personalised dynamic super learning: an application in predicting hemodiafiltration's convection volumes", "link": "http://arxiv.org/abs/2310.08479", "description": "Obtaining continuously updated predictions is a major challenge for\npersonalised medicine. Leveraging combinations of parametric regressions and\nmachine learning approaches, the personalised online super learner (POSL) can\nachieve such dynamic and personalised predictions. We adapt POSL to predict a\nrepeated continuous outcome dynamically and propose a new way to validate such\npersonalised or dynamic prediction models. We illustrate its performance by\npredicting the convection volume of patients undergoing hemodiafiltration. POSL\noutperformed its candidate learners with respect to median absolute error,\ncalibration-in-the-large, discrimination, and net benefit. We finally discuss\nthe choices and challenges underlying the use of POSL."}, "http://arxiv.org/abs/1903.00037": {"title": "Distance-Based Independence Screening for Canonical Analysis", "link": "http://arxiv.org/abs/1903.00037", "description": "This paper introduces a novel method called Distance-Based Independence\nScreening for Canonical Analysis (DISCA) that performs simultaneous dimension\nreduction for a pair of random variables by optimizing the distance covariance\n(dCov). dCov is a statistic first proposed by Sz\\'ekely et al. [2009] for\nindependence testing. Compared with sufficient dimension reduction (SDR) and\ncanonical correlation analysis (CCA)-based approaches, DISCA is a model-free\napproach that does not impose dimensional or distributional restrictions on\nvariables and is more sensitive to nonlinear relationships. Theoretically, we\nestablish a non-asymptotic error bound to provide a guarantee of our method's\nperformance. Numerically, DISCA performs comparable to or better than other\nstate-of-the-art algorithms and is computationally faster. All codes of our\nDISCA method can be found on GitHub https : //github.com/Yijin911/DISCA.git,\nincluding an R package named DISCA."}, "http://arxiv.org/abs/2105.13952": {"title": "Generalized Permutation Framework for Testing Model Variable Significance", "link": "http://arxiv.org/abs/2105.13952", "description": "A common problem in machine learning is determining if a variable\nsignificantly contributes to a model's prediction performance. This problem is\naggravated for datasets, such as gene expression datasets, that suffer the\nworst case of dimensionality: a low number of observations along with a high\nnumber of possible explanatory variables. In such scenarios, traditional\nmethods for testing variable statistical significance or constructing variable\nconfidence intervals do not apply. To address these problems, we developed a\nnovel permutation framework for testing the significance of variables in\nsupervised models. Our permutation framework has three main advantages. First,\nit is non-parametric and does not rely on distributional assumptions or\nasymptotic results. Second, it not only ranks model variables in terms of\nrelative importance, but also tests for statistical significance of each\nvariable. Third, it can test for the significance of the interaction between\nmodel variables. We applied this permutation framework to multi-class\nclassification of the Iris flower dataset and of brain regions in RNA\nexpression data, and using this framework showed variable-level statistical\nsignificance and interactions."}, "http://arxiv.org/abs/2210.02002": {"title": "Factor Augmented Sparse Throughput Deep ReLU Neural Networks for High Dimensional Regression", "link": "http://arxiv.org/abs/2210.02002", "description": "This paper introduces a Factor Augmented Sparse Throughput (FAST) model that\nutilizes both latent factors and sparse idiosyncratic components for\nnonparametric regression. The FAST model bridges factor models on one end and\nsparse nonparametric models on the other end. It encompasses structured\nnonparametric models such as factor augmented additive models and sparse\nlow-dimensional nonparametric interaction models and covers the cases where the\ncovariates do not admit factor structures. Via diversified projections as\nestimation of latent factor space, we employ truncated deep ReLU networks to\nnonparametric factor regression without regularization and to a more general\nFAST model using nonconvex regularization, resulting in factor augmented\nregression using neural network (FAR-NN) and FAST-NN estimators respectively.\nWe show that FAR-NN and FAST-NN estimators adapt to the unknown low-dimensional\nstructure using hierarchical composition models in nonasymptotic minimax rates.\nWe also study statistical learning for the factor augmented sparse additive\nmodel using a more specific neural network architecture. Our results are\napplicable to the weak dependent cases without factor structures. In proving\nthe main technical result for FAST-NN, we establish a new deep ReLU network\napproximation result that contributes to the foundation of neural network\ntheory. Our theory and methods are further supported by simulation studies and\nan application to macroeconomic data."}, "http://arxiv.org/abs/2210.04482": {"title": "Leave-group-out cross-validation for latent Gaussian models", "link": "http://arxiv.org/abs/2210.04482", "description": "Evaluating the predictive performance of a statistical model is commonly done\nusing cross-validation. Although the leave-one-out method is frequently\nemployed, its application is justified primarily for independent and\nidentically distributed observations. However, this method tends to mimic\ninterpolation rather than prediction when dealing with dependent observations.\nThis paper proposes a modified cross-validation for dependent observations.\nThis is achieved by excluding an automatically determined set of observations\nfrom the training set to mimic a more reasonable prediction scenario. Also,\nwithin the framework of latent Gaussian models, we illustrate a method to\nadjust the joint posterior for this modified cross-validation to avoid model\nrefitting. This new approach is accessible in the R-INLA package\n(www.r-inla.org)."}, "http://arxiv.org/abs/2212.01179": {"title": "Feasibility of using survey data and semi-variogram kriging to obtain bespoke indices of neighbourhood characteristics: a simulation and a case study", "link": "http://arxiv.org/abs/2212.01179", "description": "Data on neighbourhood characteristics are not typically collected in\nepidemiological studies. They are however useful in the study of small-area\nhealth inequalities. Neighbourhood characteristics are collected in some\nsurveys and could be linked to the data of other studies. We propose to use\nkriging based on semi-variogram models to predict values at non-observed\nlocations with the aim of constructing bespoke indices of neighbourhood\ncharacteristics to be linked to data from epidemiological studies. We perform a\nsimulation study to assess the feasibility of the method as well as a case\nstudy using data from the RECORD study. Apart from having enough observed data\nat small distances to the non-observed locations, a good fitting\nsemi-variogram, a larger range and the absence of nugget effects for the\nsemi-variogram models are factors leading to a higher reliability."}, "http://arxiv.org/abs/2303.17823": {"title": "An interpretable neural network-based non-proportional odds model for ordinal regression", "link": "http://arxiv.org/abs/2303.17823", "description": "This study proposes an interpretable neural network-based non-proportional\nodds model (N$^3$POM) for ordinal regression. N$^3$POM is different from\nconventional approaches to ordinal regression with non-proportional models in\nseveral ways: (1) N$^3$POM is designed to directly handle continuous responses,\nwhereas standard methods typically treat de facto ordered continuous variables\nas discrete, (2) instead of estimating response-dependent finite coefficients\nof linear models from discrete responses as is done in conventional approaches,\nwe train a non-linear neural network to serve as a coefficient function. Thanks\nto the neural network, N$^3$POM offers flexibility while preserving the\ninterpretability of conventional ordinal regression. We establish a sufficient\ncondition under which the predicted conditional cumulative probability locally\nsatisfies the monotonicity constraint over a user-specified region in the\ncovariate space. Additionally, we provide a monotonicity-preserving stochastic\n(MPS) algorithm for effectively training the neural network. We apply N$^3$POM\nto several real-world datasets."}, "http://arxiv.org/abs/2306.16335": {"title": "Emulating the dynamics of complex systems using autoregressive models on manifolds (mNARX)", "link": "http://arxiv.org/abs/2306.16335", "description": "We propose a novel surrogate modelling approach to efficiently and accurately\napproximate the response of complex dynamical systems driven by time-varying\nexogenous excitations over extended time periods. Our approach, namely manifold\nnonlinear autoregressive modelling with exogenous input (mNARX), involves\nconstructing a problem-specific exogenous input manifold that is optimal for\nconstructing autoregressive surrogates. The manifold, which forms the core of\nmNARX, is constructed incrementally by incorporating the physics of the system,\nas well as prior expert- and domain- knowledge. Because mNARX decomposes the\nfull problem into a series of smaller sub-problems, each with a lower\ncomplexity than the original, it scales well with the complexity of the\nproblem, both in terms of training and evaluation costs of the final surrogate.\nFurthermore, mNARX synergizes well with traditional dimensionality reduction\ntechniques, making it highly suitable for modelling dynamical systems with\nhigh-dimensional exogenous inputs, a class of problems that is typically\nchallenging to solve. Since domain knowledge is particularly abundant in\nphysical systems, such as those found in civil and mechanical engineering,\nmNARX is well suited for these applications. We demonstrate that mNARX\noutperforms traditional autoregressive surrogates in predicting the response of\na classical coupled spring-mass system excited by a one-dimensional random\nexcitation. Additionally, we show that mNARX is well suited for emulating very\nhigh-dimensional time- and state-dependent systems, even when affected by\nactive controllers, by surrogating the dynamics of a realistic\naero-servo-elastic onshore wind turbine simulator. In general, our results\ndemonstrate that mNARX offers promising prospects for modelling complex\ndynamical systems, in terms of accuracy and efficiency."}, "http://arxiv.org/abs/2307.02236": {"title": "D-optimal Subsampling Design for Massive Data Linear Regression", "link": "http://arxiv.org/abs/2307.02236", "description": "Data reduction is a fundamental challenge of modern technology, where\nclassical statistical methods are not applicable because of computational\nlimitations. We consider linear regression for an extraordinarily large number\nof observations, but only a few covariates. Subsampling aims at the selection\nof a given percentage of the existing original data. Under distributional\nassumptions on the covariates, we derive D-optimal subsampling designs and\nstudy their theoretical properties. We make use of fundamental concepts of\noptimal design theory and an equivalence theorem from constrained convex\noptimization. The thus obtained subsampling designs provide simple rules for\nwhether to accept or reject a data point, allowing for an easy algorithmic\nimplementation. In addition, we propose a simplified subsampling method that\ndiffers from the D-optimal design but requires lower computing time. We present\na simulation study, comparing both subsampling schemes with the IBOSS method."}, "http://arxiv.org/abs/2310.08726": {"title": "Design-Based RCT Estimators and Central Limit Theorems for Baseline Subgroup and Related Analyses", "link": "http://arxiv.org/abs/2310.08726", "description": "There is a growing literature on design-based methods to estimate average\ntreatment effects (ATEs) for randomized controlled trials (RCTs) for full\nsample analyses. This article extends these methods to estimate ATEs for\ndiscrete subgroups defined by pre-treatment variables, with an application to\nan RCT testing subgroup effects for a school voucher experiment in New York\nCity. We consider ratio estimators for subgroup effects using regression\nmethods, allowing for model covariates to improve precision, and prove a finite\npopulation central limit theorem. We discuss extensions to blocked and\nclustered RCT designs, and to other common estimators with random\ntreatment-control sample sizes (or weights): post-stratification estimators,\nweighted estimators that adjust for data nonresponse, and estimators for\nBernoulli trials. We also develop simple variance estimators that share\nfeatures with robust estimators. Simulations show that the design-based\nsubgroup estimators yield confidence interval coverage near nominal levels,\neven for small subgroups."}, "http://arxiv.org/abs/2310.08798": {"title": "Alteration Detection of Tensor Dependence Structure via Sparsity-Exploited Reranking Algorithm", "link": "http://arxiv.org/abs/2310.08798", "description": "Tensor-valued data arise frequently from a wide variety of scientific\napplications, and many among them can be translated into an alteration\ndetection problem of tensor dependence structures. In this article, we\nformulate the problem under the popularly adopted tensor-normal distributions\nand aim at two-sample correlation/partial correlation comparisons of\ntensor-valued observations. Through decorrelation and centralization, a\nseparable covariance structure is employed to pool sample information from\ndifferent tensor modes to enhance the power of the test. Additionally, we\npropose a novel Sparsity-Exploited Reranking Algorithm (SERA) to further\nimprove the multiple testing efficiency. The algorithm is approached through\nreranking of the p-values derived from the primary test statistics, by\nincorporating a carefully constructed auxiliary tensor sequence. Besides the\ntensor framework, SERA is also generally applicable to a wide range of\ntwo-sample large-scale inference problems with sparsity structures, and is of\nindependent interest. The asymptotic properties of the proposed test are\nderived and the algorithm is shown to control the false discovery at the\npre-specified level. We demonstrate the efficacy of the proposed method through\nintensive simulations and two scientific applications."}, "http://arxiv.org/abs/2310.08812": {"title": "A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM model", "link": "http://arxiv.org/abs/2310.08812", "description": "Time series forecasting represents a significant and challenging task across\nvarious fields. Recently, methods based on mode decomposition have dominated\nthe forecasting of complex time series because of the advantages of capturing\nlocal characteristics and extracting intrinsic modes from data. Unfortunately,\nmost models fail to capture the implied volatilities that contain significant\ninformation. To enhance the forecasting of current, rapidly evolving, and\nvolatile time series, we propose a novel decomposition-ensemble paradigm, the\nVMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed\nto decompose the time series into K sub-modes. Subsequently, the GARCH model\nextracts the volatility information from these sub-modes, which serve as the\ninput for the LSTM. The numerical and volatility information of each sub-mode\nis utilized to train a Long Short-Term Memory network. This network predicts\nthe sub-mode, and then we aggregate the predictions from all sub-modes to\nproduce the output. By integrating econometric and artificial intelligence\nmethods, and taking into account both the numerical and volatility information\nof the time series, our proposed model demonstrates superior performance in\ntime series forecasting, as evidenced by the significant decrease in MSE, RMSE,\nand MAPE in our comparative experimental results."}, "http://arxiv.org/abs/2310.08867": {"title": "A Survey of Methods for Handling Disk Data Imbalance", "link": "http://arxiv.org/abs/2310.08867", "description": "Class imbalance exists in many classification problems, and since the data is\ndesigned for accuracy, imbalance in data classes can lead to classification\nchallenges with a few classes having higher misclassification costs. The\nBackblaze dataset, a widely used dataset related to hard discs, has a small\namount of failure data and a large amount of health data, which exhibits a\nserious class imbalance. This paper provides a comprehensive overview of\nresearch in the field of imbalanced data classification. The discussion is\norganized into three main aspects: data-level methods, algorithmic-level\nmethods, and hybrid methods. For each type of method, we summarize and analyze\nthe existing problems, algorithmic ideas, strengths, and weaknesses.\nAdditionally, the challenges of unbalanced data classification are discussed,\nalong with strategies to address them. It is convenient for researchers to\nchoose the appropriate method according to their needs."}, "http://arxiv.org/abs/2310.08939": {"title": "Fast Screening Rules for Optimal Design via Quadratic Lasso Reformulation", "link": "http://arxiv.org/abs/2310.08939", "description": "The problems of Lasso regression and optimal design of experiments share a\ncritical property: their optimal solutions are typically \\emph{sparse}, i.e.,\nonly a small fraction of the optimal variables are non-zero. Therefore, the\nidentification of the support of an optimal solution reduces the dimensionality\nof the problem and can yield a substantial simplification of the calculations.\nIt has recently been shown that linear regression with a \\emph{squared}\n$\\ell_1$-norm sparsity-inducing penalty is equivalent to an optimal\nexperimental design problem. In this work, we use this equivalence to derive\nsafe screening rules that can be used to discard inessential samples. Compared\nto previously existing rules, the new tests are much faster to compute,\nespecially for problems involving a parameter space of high dimension, and can\nbe used dynamically within any iterative solver, with negligible computational\noverhead. Moreover, we show how an existing homotopy algorithm to compute the\nregularization path of the lasso method can be reparametrized with respect to\nthe squared $\\ell_1$-penalty. This allows the computation of a Bayes\n$c$-optimal design in a finite number of steps and can be several orders of\nmagnitude faster than standard first-order algorithms. The efficiency of the\nnew screening rules and of the homotopy algorithm are demonstrated on different\nexamples based on real data."}, "http://arxiv.org/abs/2310.09100": {"title": "Time-Uniform Self-Normalized Concentration for Vector-Valued Processes", "link": "http://arxiv.org/abs/2310.09100", "description": "Self-normalized processes arise naturally in many statistical tasks. While\nself-normalized concentration has been extensively studied for scalar-valued\nprocesses, there is less work on multidimensional processes outside of the\nsub-Gaussian setting. In this work, we construct a general, self-normalized\ninequality for $\\mathbb{R}^d$-valued processes that satisfy a simple yet broad\n\"sub-$\\psi$\" tail condition, which generalizes assumptions based on cumulant\ngenerating functions. From this general inequality, we derive an upper law of\nthe iterated logarithm for sub-$\\psi$ vector-valued processes, which is tight\nup to small constants. We demonstrate applications in prototypical statistical\ntasks, such as parameter estimation in online linear regression and\nauto-regressive modeling, and bounded mean estimation via a new (multivariate)\nempirical Bernstein concentration inequality."}, "http://arxiv.org/abs/2310.09185": {"title": "Mediation Analysis using Semi-parametric Shape-Restricted Regression with Applications", "link": "http://arxiv.org/abs/2310.09185", "description": "Often linear regression is used to perform mediation analysis. However, in\nmany instances, the underlying relationships may not be linear, as in the case\nof placental-fetal hormones and fetal development. Although, the exact\nfunctional form of the relationship may be unknown, one may hypothesize the\ngeneral shape of the relationship. For these reasons, we develop a novel\nshape-restricted inference-based methodology for conducting mediation analysis.\nThis work is motivated by an application in fetal endocrinology where\nresearchers are interested in understanding the effects of pesticide\napplication on birth weight, with human chorionic gonadotropin (hCG) as the\nmediator. We assume a practically plausible set of nonlinear effects of hCG on\nthe birth weight and a linear relationship between pesticide exposure and hCG,\nwith both exposure-outcome and exposure-mediator models being linear in the\nconfounding factors. Using the proposed methodology on a population-level\nprenatal screening program data, with hCG as the mediator, we discovered that,\nwhile the natural direct effects suggest a positive association between\npesticide application and birth weight, the natural indirect effects were\nnegative."}, "http://arxiv.org/abs/2310.09214": {"title": "An Introduction to the Calibration of Computer Models", "link": "http://arxiv.org/abs/2310.09214", "description": "In the context of computer models, calibration is the process of estimating\nunknown simulator parameters from observational data. Calibration is variously\nreferred to as model fitting, parameter estimation/inference, an inverse\nproblem, and model tuning. The need for calibration occurs in most areas of\nscience and engineering, and has been used to estimate hard to measure\nparameters in climate, cardiology, drug therapy response, hydrology, and many\nother disciplines. Although the statistical method used for calibration can\nvary substantially, the underlying approach is essentially the same and can be\nconsidered abstractly. In this survey, we review the decisions that need to be\ntaken when calibrating a model, and discuss a range of computational methods\nthat can be used to compute Bayesian posterior distributions."}, "http://arxiv.org/abs/2310.09239": {"title": "Estimating weighted quantile treatment effects with missing outcome data by double sampling", "link": "http://arxiv.org/abs/2310.09239", "description": "Causal weighted quantile treatment effects (WQTE) are a useful compliment to\nstandard causal contrasts that focus on the mean when interest lies at the\ntails of the counterfactual distribution. To-date, however, methods for\nestimation and inference regarding causal WQTEs have assumed complete data on\nall relevant factors. Missing or incomplete data, however, is a widespread\nchallenge in practical settings, particularly when the data are not collected\nfor research purposes such as electronic health records and disease registries.\nFurthermore, in such settings may be particularly susceptible to the outcome\ndata being missing-not-at-random (MNAR). In this paper, we consider the use of\ndouble-sampling, through which the otherwise missing data is ascertained on a\nsub-sample of study units, as a strategy to mitigate bias due to MNAR data in\nthe estimation of causal WQTEs. With the additional data in-hand, we present\nidentifying conditions that do not require assumptions regarding missingness in\nthe original data. We then propose a novel inverse-probability weighted\nestimator and derive its' asymptotic properties, both pointwise at specific\nquantiles and uniform across a range of quantiles in (0,1), when the propensity\nscore and double-sampling probabilities are estimated. For practical inference,\nwe develop a bootstrap method that can be used for both pointwise and uniform\ninference. A simulation study is conducted to examine the finite sample\nperformance of the proposed estimators."}, "http://arxiv.org/abs/2310.09257": {"title": "A SIMPLE Approach to Provably Reconstruct Ising Model with Global Optimality", "link": "http://arxiv.org/abs/2310.09257", "description": "Reconstruction of interaction network between random events is a critical\nproblem arising from statistical physics and politics to sociology, biology,\nand psychology, and beyond. The Ising model lays the foundation for this\nreconstruction process, but finding the underlying Ising model from the least\namount of observed samples in a computationally efficient manner has been\nhistorically challenging for half a century. By using the idea of sparsity\nlearning, we present a approach named SIMPLE that has a dominant sample\ncomplexity from theoretical limit. Furthermore, a tuning-free algorithm is\ndeveloped to give a statistically consistent solution of SIMPLE in polynomial\ntime with high probability. On extensive benchmarked cases, the SIMPLE approach\nprovably reconstructs underlying Ising models with global optimality. The\napplication on the U.S. senators voting in the last six congresses reveals that\nboth the Republicans and Democrats noticeably assemble in each congresses;\ninterestingly, the assembling of Democrats is particularly pronounced in the\nlatest congress."}, "http://arxiv.org/abs/2208.09817": {"title": "High-Dimensional Composite Quantile Regression: Optimal Statistical Guarantees and Fast Algorithms", "link": "http://arxiv.org/abs/2208.09817", "description": "The composite quantile regression (CQR) was introduced by Zou and Yuan [Ann.\nStatist. 36 (2008) 1108--1126] as a robust regression method for linear models\nwith heavy-tailed errors while achieving high efficiency. Its penalized\ncounterpart for high-dimensional sparse models was recently studied in Gu and\nZou [IEEE Trans. Inf. Theory 66 (2020) 7132--7154], along with a specialized\noptimization algorithm based on the alternating direct method of multipliers\n(ADMM). Compared to the various first-order algorithms for penalized least\nsquares, ADMM-based algorithms are not well-adapted to large-scale problems. To\novercome this computational hardness, in this paper we employ a\nconvolution-smoothed technique to CQR, complemented with iteratively reweighted\n$\\ell_1$-regularization. The smoothed composite loss function is convex, twice\ncontinuously differentiable, and locally strong convex with high probability.\nWe propose a gradient-based algorithm for penalized smoothed CQR via a variant\nof the majorize-minimization principal, which gains substantial computational\nefficiency over ADMM. Theoretically, we show that the iteratively reweighted\n$\\ell_1$-penalized smoothed CQR estimator achieves near-minimax optimal\nconvergence rate under heavy-tailed errors without any moment constraint, and\nfurther achieves near-oracle convergence rate under a weaker minimum signal\nstrength condition than needed in Gu and Zou (2020). Numerical studies\ndemonstrate that the proposed method exhibits significant computational\nadvantages without compromising statistical performance compared to two\nstate-of-the-art methods that achieve robustness and high efficiency\nsimultaneously."}, "http://arxiv.org/abs/2210.14292": {"title": "Statistical Inference for H\\\"usler-Reiss Graphical Models Through Matrix Completions", "link": "http://arxiv.org/abs/2210.14292", "description": "The severity of multivariate extreme events is driven by the dependence\nbetween the largest marginal observations. The H\\\"usler-Reiss distribution is a\nversatile model for this extremal dependence, and it is usually parameterized\nby a variogram matrix. In order to represent conditional independence relations\nand obtain sparse parameterizations, we introduce the novel H\\\"usler-Reiss\nprecision matrix. Similarly to the Gaussian case, this matrix appears naturally\nin density representations of the H\\\"usler-Reiss Pareto distribution and\nencodes the extremal graphical structure through its zero pattern. For a given,\narbitrary graph we prove the existence and uniqueness of the completion of a\npartially specified H\\\"usler-Reiss variogram matrix so that its precision\nmatrix has zeros on non-edges in the graph. Using suitable estimators for the\nparameters on the edges, our theory provides the first consistent estimator of\ngraph structured H\\\"usler-Reiss distributions. If the graph is unknown, our\nmethod can be combined with recent structure learning algorithms to jointly\ninfer the graph and the corresponding parameter matrix. Based on our\nmethodology, we propose new tools for statistical inference of sparse\nH\\\"usler-Reiss models and illustrate them on large flight delay data in the\nU.S., as well as Danube river flow data."}, "http://arxiv.org/abs/2302.02288": {"title": "Efficient Adaptive Joint Significance Tests and Sobel-Type Confidence Intervals for Mediation Effects", "link": "http://arxiv.org/abs/2302.02288", "description": "Mediation analysis is an important statistical tool in many research fields.\nIts aim is to investigate the mechanism along the causal pathway between an\nexposure and an outcome. The joint significance test is widely utilized as a\nprominent statistical approach for examining mediation effects in practical\napplications. Nevertheless, the limitation of this mediation testing method\nstems from its conservative Type I error, which reduces its statistical power\nand imposes certain constraints on its popularity and utility. The proposed\nsolution to address this gap is the adaptive joint significance test for one\nmediator, a novel data-adaptive test for mediation effect that exhibits\nsignificant advancements compared to traditional joint significance test. The\nproposed method is designed to be user-friendly, eliminating the need for\ncomplicated procedures. We have derived explicit expressions for size and\npower, ensuring the theoretical validity of our approach. Furthermore, we\nextend the proposed adaptive joint significance tests for small-scale mediation\nhypotheses with family-wise error rate (FWER) control. Additionally, a novel\nadaptive Sobel-type approach is proposed for the estimation of confidence\nintervals for the mediation effects, demonstrating significant advancements\nover conventional Sobel's confidence intervals in terms of achieving desirable\ncoverage probabilities. Our mediation testing and confidence intervals\nprocedure is evaluated through comprehensive simulations, and compared with\nnumerous existing approaches. Finally, we illustrate the usefulness of our\nmethod by analysing three real-world datasets with continuous, binary and\ntime-to-event outcomes, respectively."}, "http://arxiv.org/abs/2305.15742": {"title": "Counterfactual Generative Models for Time-Varying Treatments", "link": "http://arxiv.org/abs/2305.15742", "description": "Estimating the counterfactual outcome of treatment is essential for\ndecision-making in public health and clinical science, among others. Often,\ntreatments are administered in a sequential, time-varying manner, leading to an\nexponentially increased number of possible counterfactual outcomes.\nFurthermore, in modern applications, the outcomes are high-dimensional and\nconventional average treatment effect estimation fails to capture disparities\nin individuals. To tackle these challenges, we propose a novel conditional\ngenerative framework capable of producing counterfactual samples under\ntime-varying treatment, without the need for explicit density estimation. Our\nmethod carefully addresses the distribution mismatch between the observed and\ncounterfactual distributions via a loss function based on inverse probability\nweighting. We present a thorough evaluation of our method using both synthetic\nand real-world data. Our results demonstrate that our method is capable of\ngenerating high-quality counterfactual samples and outperforms the\nstate-of-the-art baselines."}, "http://arxiv.org/abs/2309.09115": {"title": "Fully Synthetic Data for Complex Surveys", "link": "http://arxiv.org/abs/2309.09115", "description": "When seeking to release public use files for confidential data, statistical\nagencies can generate fully synthetic data. We propose an approach for making\nfully synthetic data from surveys collected with complex sampling designs.\nSpecifically, we generate pseudo-populations by applying the weighted finite\npopulation Bayesian bootstrap to account for survey weights, take simple random\nsamples from those pseudo-populations, estimate synthesis models using these\nsimple random samples, and release simulated data drawn from the models as the\npublic use files. We use the framework of multiple imputation to enable\nvariance estimation using two data generation strategies. In the first, we\ngenerate multiple data sets from each simple random sample, whereas in the\nsecond, we generate a single synthetic data set from each simple random sample.\nWe present multiple imputation combining rules for each setting. We illustrate\neach approach and the repeated sampling properties of the combining rules using\nsimulation studies."}, "http://arxiv.org/abs/2309.09323": {"title": "Answering Layer 3 queries with DiscoSCMs", "link": "http://arxiv.org/abs/2309.09323", "description": "Addressing causal queries across the Pearl Causal Hierarchy (PCH) (i.e.,\nassociational, interventional and counterfactual), which is formalized as\n\\Layer{} Valuations, is a central task in contemporary causal inference\nresearch. Counterfactual questions, in particular, pose a significant challenge\nas they often necessitate a complete knowledge of structural equations. This\npaper identifies \\textbf{the degeneracy problem} caused by the consistency\nrule. To tackle this, the \\textit{Distribution-consistency Structural Causal\nModels} (DiscoSCMs) is introduced, which extends both the structural causal\nmodels (SCM) and the potential outcome framework. The correlation pattern of\npotential outcomes in personalized incentive scenarios, described by $P(y_x,\ny'_{x'})$, is used as a case study for elucidation. Although counterfactuals\nare no longer degenerate, they remain indeterminable. As a result, the\ncondition of independent potential noise is incorporated into DiscoSCM. It is\nfound that by adeptly using homogeneity, counterfactuals can be identified.\nFurthermore, more refined results are achieved in the unit problem scenario. In\nsimpler terms, when modeling counterfactuals, one should contemplate: \"Consider\na person with average ability who takes a test and, due to good luck, achieves\nan exceptionally high score. If this person were to retake the test under\nidentical external conditions, what score will he obtain? An exceptionally high\nscore or an average score?\" If your choose is predicting an average score, then\nyou are essentially choosing DiscoSCM over the traditional frameworks based on\nthe consistency rule."}, "http://arxiv.org/abs/2310.01748": {"title": "A generative approach to frame-level multi-competitor races", "link": "http://arxiv.org/abs/2310.01748", "description": "Multi-competitor races often feature complicated within-race strategies that\nare difficult to capture when training data on race outcome level data.\nFurther, models which do not account for such strategic effects may suffer from\nconfounded inferences and predictions. In this work we develop a general\ngenerative model for multi-competitor races which allows analysts to explicitly\nmodel certain strategic effects such as changing lanes or drafting and separate\nthese impacts from competitor ability. The generative model allows one to\nsimulate full races from any real or created starting position which opens new\navenues for attributing value to within-race actions and to perform\ncounter-factual analyses. This methodology is sufficiently general to apply to\nany track based multi-competitor races where both tracking data is available\nand competitor movement is well described by simultaneous forward and lateral\nmovements. We apply this methodology to one-mile horse races using data\nprovided by the New York Racing Association (NYRA) and the New York\nThoroughbred Horsemen's Association (NYTHA) for the Big Data Derby 2022 Kaggle\nCompetition. This data features granular tracking data for all horses at the\nframe-level (occurring at approximately 4hz). We demonstrate how this model can\nyield new inferences, such as the estimation of horse-specific speed profiles\nwhich vary over phases of the race, and examples of posterior predictive\ncounterfactual simulations to answer questions of interest such as starting\nlane impacts on race outcomes."}, "http://arxiv.org/abs/2310.09345": {"title": "A Unified Bayesian Framework for Modeling Measurement Error in Multinomial Data", "link": "http://arxiv.org/abs/2310.09345", "description": "Measurement error in multinomial data is a well-known and well-studied\ninferential problem that is encountered in many fields, including engineering,\nbiomedical and omics research, ecology, finance, and social sciences.\nSurprisingly, methods developed to accommodate measurement error in multinomial\ndata are typically equipped to handle false negatives or false positives, but\nnot both. We provide a unified framework for accommodating both forms of\nmeasurement error using a Bayesian hierarchical approach. We demonstrate the\nproposed method's performance on simulated data and apply it to acoustic bat\nmonitoring data."}, "http://arxiv.org/abs/2310.09384": {"title": "Modeling Missing at Random Neuropsychological Test Scores Using a Mixture of Binomial Product Experts", "link": "http://arxiv.org/abs/2310.09384", "description": "Multivariate bounded discrete data arises in many fields. In the setting of\nlongitudinal dementia studies, such data is collected when individuals complete\nneuropsychological tests. We outline a modeling and inference procedure that\ncan model the joint distribution conditional on baseline covariates, leveraging\nprevious work on mixtures of experts and latent class models. Furthermore, we\nillustrate how the work can be extended when the outcome data is missing at\nrandom using a nested EM algorithm. The proposed model can incorporate\ncovariate information, perform imputation and clustering, and infer latent\ntrajectories. We apply our model on simulated data and an Alzheimer's disease\ndata set."}, "http://arxiv.org/abs/2310.09428": {"title": "Sparse higher order partial least squares for simultaneous variable selection, dimension reduction, and tensor denoising", "link": "http://arxiv.org/abs/2310.09428", "description": "Partial Least Squares (PLS) regression emerged as an alternative to ordinary\nleast squares for addressing multicollinearity in a wide range of scientific\napplications. As multidimensional tensor data is becoming more widespread,\ntensor adaptations of PLS have been developed. Our investigations reveal that\nthe previously established asymptotic result of the PLS estimator for a tensor\nresponse breaks down as the tensor dimensions and the number of features\nincrease relative to the sample size. To address this, we propose Sparse Higher\nOrder Partial Least Squares (SHOPS) regression and an accompanying algorithm.\nSHOPS simultaneously accommodates variable selection, dimension reduction, and\ntensor association denoising. We establish the asymptotic accuracy of the SHOPS\nalgorithm under a high-dimensional regime and verify these results through\ncomprehensive simulation experiments, and applications to two contemporary\nhigh-dimensional biological data analysis."}, "http://arxiv.org/abs/2310.09493": {"title": "Summary Statistics Knockoffs Inference with Family-wise Error Rate Control", "link": "http://arxiv.org/abs/2310.09493", "description": "Testing multiple hypotheses of conditional independence with provable error\nrate control is a fundamental problem with various applications. To infer\nconditional independence with family-wise error rate (FWER) control when only\nsummary statistics of marginal dependence are accessible, we adopt\nGhostKnockoff to directly generate knockoff copies of summary statistics and\npropose a new filter to select features conditionally dependent to the response\nwith provable FWER control. In addition, we develop a computationally efficient\nalgorithm to greatly reduce the computational cost of knockoff copies\ngeneration without sacrificing power and FWER control. Experiments on simulated\ndata and a real dataset of Alzheimer's disease genetics demonstrate the\nadvantage of proposed method over the existing alternatives in both statistical\npower and computational efficiency."}, "http://arxiv.org/abs/2310.09646": {"title": "Jackknife empirical likelihood confidence intervals for the categorical Gini correlation", "link": "http://arxiv.org/abs/2310.09646", "description": "The categorical Gini correlation, $\\rho_g$, was proposed by Dang et al. to\nmeasure the dependence between a categorical variable, $Y$ , and a numerical\nvariable, $X$. It has been shown that $\\rho_g$ has more appealing properties\nthan current existing dependence measurements. In this paper, we develop the\njackknife empirical likelihood (JEL) method for $\\rho_g$. Confidence intervals\nfor the Gini correlation are constructed without estimating the asymptotic\nvariance. Adjusted and weighted JEL are explored to improve the performance of\nthe standard JEL. Simulation studies show that our methods are competitive to\nexisting methods in terms of coverage accuracy and shortness of confidence\nintervals. The proposed methods are illustrated in an application on two real\ndatasets."}, "http://arxiv.org/abs/2310.09673": {"title": "Robust Quickest Change Detection in Non-Stationary Processes", "link": "http://arxiv.org/abs/2310.09673", "description": "Optimal algorithms are developed for robust detection of changes in\nnon-stationary processes. These are processes in which the distribution of the\ndata after change varies with time. The decision-maker does not have access to\nprecise information on the post-change distribution. It is shown that if the\npost-change non-stationary family has a distribution that is least favorable in\na well-defined sense, then the algorithms designed using the least favorable\ndistributions are robust and optimal. Non-stationary processes are encountered\nin public health monitoring and space and military applications. The robust\nalgorithms are applied to real and simulated data to show their effectiveness."}, "http://arxiv.org/abs/2310.09701": {"title": "A powerful empirical Bayes approach for high dimensional replicability analysis", "link": "http://arxiv.org/abs/2310.09701", "description": "Identifying replicable signals across different studies provides stronger\nscientific evidence and more powerful inference. Existing literature on high\ndimensional applicability analysis either imposes strong modeling assumptions\nor has low power. We develop a powerful and robust empirical Bayes approach for\nhigh dimensional replicability analysis. Our method effectively borrows\ninformation from different features and studies while accounting for\nheterogeneity. We show that the proposed method has better power than competing\nmethods while controlling the false discovery rate, both empirically and\ntheoretically. Analyzing datasets from the genome-wide association studies\nreveals new biological insights that otherwise cannot be obtained by using\nexisting methods."}, "http://arxiv.org/abs/2310.09702": {"title": "Inference with Mondrian Random Forests", "link": "http://arxiv.org/abs/2310.09702", "description": "Random forests are popular methods for classification and regression, and\nmany different variants have been proposed in recent years. One interesting\nexample is the Mondrian random forest, in which the underlying trees are\nconstructed according to a Mondrian process. In this paper we give a central\nlimit theorem for the estimates made by a Mondrian random forest in the\nregression setting. When combined with a bias characterization and a consistent\nvariance estimator, this allows one to perform asymptotically valid statistical\ninference, such as constructing confidence intervals, on the unknown regression\nfunction. We also provide a debiasing procedure for Mondrian random forests\nwhich allows them to achieve minimax-optimal estimation rates with\n$\\beta$-H\\\"older regression functions, for all $\\beta$ and in arbitrary\ndimension, assuming appropriate parameter tuning."}, "http://arxiv.org/abs/2310.09818": {"title": "MCMC for Bayesian nonparametric mixture modeling under differential privacy", "link": "http://arxiv.org/abs/2310.09818", "description": "Estimating the probability density of a population while preserving the\nprivacy of individuals in that population is an important and challenging\nproblem that has received considerable attention in recent years. While the\nprevious literature focused on frequentist approaches, in this paper, we\npropose a Bayesian nonparametric mixture model under differential privacy (DP)\nand present two Markov chain Monte Carlo (MCMC) algorithms for posterior\ninference. One is a marginal approach, resembling Neal's algorithm 5 with a\npseudo-marginal Metropolis-Hastings move, and the other is a conditional\napproach. Although our focus is primarily on local DP, we show that our MCMC\nalgorithms can be easily extended to deal with global differential privacy\nmechanisms. Moreover, for certain classes of mechanisms and mixture kernels, we\nshow how standard algorithms can be employed, resulting in substantial\nefficiency gains. Our approach is general and applicable to any mixture model\nand privacy mechanism. In several simulations and a real case study, we discuss\nthe performance of our algorithms and evaluate different privacy mechanisms\nproposed in the frequentist literature."}, "http://arxiv.org/abs/2310.09955": {"title": "On the Statistical Foundations of H-likelihood for Unobserved Random Variables", "link": "http://arxiv.org/abs/2310.09955", "description": "The maximum likelihood estimation is widely used for statistical inferences.\nIn this study, we reformulate the h-likelihood proposed by Lee and Nelder in\n1996, whose maximization yields maximum likelihood estimators for fixed\nparameters and asymptotically best unbiased predictors for random parameters.\nWe establish the statistical foundations for h-likelihood theories, which\nextend classical likelihood theories to embrace broad classes of statistical\nmodels with random parameters. The maximum h-likelihood estimators\nasymptotically achieve the generalized Cramer-Rao lower bound. Furthermore, we\nexplore asymptotic theory when the consistency of either fixed parameter\nestimation or random parameter prediction is violated. The introduction of this\nnew h-likelihood framework enables likelihood theories to cover inferences for\na much broader class of models, while also providing computationally efficient\nfitting algorithms to give asymptotically optimal estimators for fixed\nparameters and predictors for random parameters."}, "http://arxiv.org/abs/2310.09960": {"title": "Point Mass in the Confidence Distribution: Is it a Drawback or an Advantage?", "link": "http://arxiv.org/abs/2310.09960", "description": "Stein's (1959) problem highlights the phenomenon called the probability\ndilution in high dimensional cases, which is known as a fundamental deficiency\nin probabilistic inference. The satellite conjunction problem also suffers from\nprobability dilution that poor-quality data can lead to a dilution of collision\nprobability. Though various methods have been proposed, such as generalized\nfiducial distribution and the reference posterior, they could not maintain the\ncoverage probability of confidence intervals (CIs) in both problems. On the\nother hand, the confidence distribution (CD) has a point mass at zero, which\nhas been interpreted paradoxical. However, we show that this point mass is an\nadvantage rather than a drawback, because it gives a way to maintain the\ncoverage probability of CIs. More recently, `false confidence theorem' was\npresented as another deficiency in probabilistic inferences, called the false\nconfidence. It was further claimed that the use of consonant belief can\nmitigate this deficiency. However, we show that the false confidence theorem\ncannot be applied to the CD in both Stein's and satellite conjunction problems.\nIt is crucial that a confidence feature, not a consonant one, is the key to\novercome the deficiencies in probabilistic inferences. Our findings reveal that\nthe CD outperforms the other existing methods, including the consonant belief,\nin the context of Stein's and satellite conjunction problems. Additionally, we\ndemonstrate the ambiguity of coverage probability in an observed CI from the\nfrequentist CI procedure, and show that the CD provides valuable information\nregarding this ambiguity."}, "http://arxiv.org/abs/2310.09961": {"title": "Theoretical Evaluation of Asymmetric Shapley Values for Root-Cause Analysis", "link": "http://arxiv.org/abs/2310.09961", "description": "In this work, we examine Asymmetric Shapley Values (ASV), a variant of the\npopular SHAP additive local explanation method. ASV proposes a way to improve\nmodel explanations incorporating known causal relations between variables, and\nis also considered as a way to test for unfair discrimination in model\npredictions. Unexplored in previous literature, relaxing symmetry in Shapley\nvalues can have counter-intuitive consequences for model explanation. To better\nunderstand the method, we first show how local contributions correspond to\nglobal contributions of variance reduction. Using variance, we demonstrate\nmultiple cases where ASV yields counter-intuitive attributions, arguably\nproducing incorrect results for root-cause analysis. Second, we identify\ngeneralized additive models (GAM) as a restricted class for which ASV exhibits\ndesirable properties. We support our arguments by proving multiple theoretical\nresults about the method. Finally, we demonstrate the use of asymmetric\nattributions on multiple real-world datasets, comparing the results with and\nwithout restricted model families using gradient boosting and deep learning\nmodels."}, "http://arxiv.org/abs/2310.10003": {"title": "Conformal Contextual Robust Optimization", "link": "http://arxiv.org/abs/2310.10003", "description": "Data-driven approaches to predict-then-optimize decision-making problems seek\nto mitigate the risk of uncertainty region misspecification in safety-critical\nsettings. Current approaches, however, suffer from considering overly\nconservative uncertainty regions, often resulting in suboptimal decisionmaking.\nTo this end, we propose Conformal-Predict-Then-Optimize (CPO), a framework for\nleveraging highly informative, nonconvex conformal prediction regions over\nhigh-dimensional spaces based on conditional generative models, which have the\ndesired distribution-free coverage guarantees. Despite guaranteeing robustness,\nsuch black-box optimization procedures alone inspire little confidence owing to\nthe lack of explanation of why a particular decision was found to be optimal.\nWe, therefore, augment CPO to additionally provide semantically meaningful\nvisual summaries of the uncertainty regions to give qualitative intuition for\nthe optimal decision. We highlight the CPO framework by demonstrating results\non a suite of simulation-based inference benchmark tasks and a vehicle routing\ntask based on probabilistic weather prediction."}, "http://arxiv.org/abs/2310.10048": {"title": "Evaluation of transplant benefits with the U", "link": "http://arxiv.org/abs/2310.10048", "description": "Kidney transplantation is the most effective renal replacement therapy for\nend stage renal disease patients. With the severe shortage of kidney supplies\nand for the clinical effectiveness of transplantation, patient's life\nexpectancy post transplantation is used to prioritize patients for\ntransplantation; however, severe comorbidity conditions and old age are the\nmost dominant factors that negatively impact post-transplantation life\nexpectancy, effectively precluding sick or old patients from receiving\ntransplants. It would be crucial to design objective measures to quantify the\ntransplantation benefit by comparing the mean residual life with and without a\ntransplant, after adjusting for comorbidity and demographic conditions. To\naddress this urgent need, we propose a new class of semiparametric\ncovariate-dependent mean residual life models. Our method estimates covariate\neffects semiparametrically efficiently and the mean residual life function\nnonparametrically, enabling us to predict the residual life increment potential\nfor any given patient. Our method potentially leads to a more fair system that\nprioritizes patients who would have the largest residual life gains. Our\nanalysis of the kidney transplant data from the U.S. Scientific Registry of\nTransplant Recipients also suggests that a single index of covariates summarize\nwell the impacts of multiple covariates, which may facilitate interpretations\nof each covariate's effect. Our subgroup analysis further disclosed\ninequalities in survival gains across groups defined by race, gender and\ninsurance type (reflecting socioeconomic status)."}, "http://arxiv.org/abs/2310.10052": {"title": "Group-Orthogonal Subsampling for Hierarchical Data Based on Linear Mixed Models", "link": "http://arxiv.org/abs/2310.10052", "description": "Hierarchical data analysis is crucial in various fields for making\ndiscoveries. The linear mixed model is often used for training hierarchical\ndata, but its parameter estimation is computationally expensive, especially\nwith big data. Subsampling techniques have been developed to address this\nchallenge. However, most existing subsampling methods assume homogeneous data\nand do not consider the possible heterogeneity in hierarchical data. To address\nthis limitation, we develop a new approach called group-orthogonal subsampling\n(GOSS) for selecting informative subsets of hierarchical data that may exhibit\nheterogeneity. GOSS selects subdata with balanced data size among groups and\ncombinatorial orthogonality within each group, resulting in subdata that are\n$D$- and $A$-optimal for building linear mixed models. Estimators of parameters\ntrained on GOSS subdata are consistent and asymptotically normal. GOSS is shown\nto be numerically appealing via simulations and a real data application.\nTheoretical proofs, R codes, and supplementary numerical results are accessible\nonline as Supplementary Materials."}, "http://arxiv.org/abs/2310.10239": {"title": "Structural transfer learning of non-Gaussian DAG", "link": "http://arxiv.org/abs/2310.10239", "description": "Directed acyclic graph (DAG) has been widely employed to represent\ndirectional relationships among a set of collected nodes. Yet, the available\ndata in one single study is often limited for accurate DAG reconstruction,\nwhereas heterogeneous data may be collected from multiple relevant studies. It\nremains an open question how to pool the heterogeneous data together for better\nDAG structure reconstruction in the target study. In this paper, we first\nintroduce a novel set of structural similarity measures for DAG and then\npresent a transfer DAG learning framework by effectively leveraging information\nfrom auxiliary DAGs of different levels of similarities. Our theoretical\nanalysis shows substantial improvement in terms of DAG reconstruction in the\ntarget study, even when no auxiliary DAG is overall similar to the target DAG,\nwhich is in sharp contrast to most existing transfer learning methods. The\nadvantage of the proposed transfer DAG learning is also supported by extensive\nnumerical experiments on both synthetic data and multi-site brain functional\nconnectivity network data."}, "http://arxiv.org/abs/2310.10271": {"title": "A geometric power analysis for general log-linear models", "link": "http://arxiv.org/abs/2310.10271", "description": "General log-linear models are widely used to express the association in\nmultivariate frequency data on contingency tables. The paper focuses on the\npower analysis for testing the goodness-of-fit hypothesis for these models.\nConventionally, for the power-related sample size calculations a deviation from\nthe null hypothesis, aka effect size, is specified by means of the chi-square\ngoodness-of-fit index. It is argued that the odds ratio is a more natural\nmeasure of effect size, with the advantage of having a data-relevant\ninterpretation. Therefore, a class of log-affine models that are specified by\nodds ratios whose values deviate from those of the null by a small amount can\nbe chosen as an alternative. Being expressed as sets of constraints on odds\nratios, both hypotheses are represented by smooth surfaces in the probability\nsimplex, and thus, the power analysis can be given a geometric interpretation\nas well. A concept of geometric power is introduced and a Monte-Carlo algorithm\nfor its estimation is proposed. The framework is applied to the power analysis\nof goodness-of-fit in the context of multinomial sampling. An iterative scaling\nprocedure for generating distributions from a log-affine model is described and\nits convergence is proved. To illustrate, the geometric power analysis is\ncarried out for data from a clinical study."}, "http://arxiv.org/abs/2310.10324": {"title": "Assessing univariate and bivariate risks of late-frost and drought using vine copulas: A historical study for Bavaria", "link": "http://arxiv.org/abs/2310.10324", "description": "In light of climate change's impacts on forests, including extreme drought\nand late-frost, leading to vitality decline and regional forest die-back, we\nassess univariate drought and late-frost risks and perform a joint risk\nanalysis in Bavaria, Germany, from 1952 to 2020. Utilizing a vast dataset with\n26 bioclimatic and topographic variables, we employ vine copula models due to\nthe data's non-Gaussian and asymmetric dependencies. We use D-vine regression\nfor univariate and Y-vine regression for bivariate analysis, and propose\ncorresponding univariate and bivariate conditional probability risk measures.\nWe identify \"at-risk\" regions, emphasizing the need for forest adaptation due\nto climate change."}, "http://arxiv.org/abs/2310.10329": {"title": "Towards Data-Conditional Simulation for ABC Inference in Stochastic Differential Equations", "link": "http://arxiv.org/abs/2310.10329", "description": "We develop a Bayesian inference method for discretely-observed stochastic\ndifferential equations (SDEs). Inference is challenging for most SDEs, due to\nthe analytical intractability of the likelihood function. Nevertheless, forward\nsimulation via numerical methods is straightforward, motivating the use of\napproximate Bayesian computation (ABC). We propose a conditional simulation\nscheme for SDEs that is based on lookahead strategies for sequential Monte\nCarlo (SMC) and particle smoothing using backward simulation. This leads to the\nsimulation of trajectories that are consistent with the observed trajectory,\nthereby increasing the ABC acceptance rate. We additionally employ an invariant\nneural network, previously developed for Markov processes, to learn the summary\nstatistics function required in ABC. The neural network is incrementally\nretrained by exploiting an ABC-SMC sampler, which provides new training data at\neach round. Since the SDE simulation scheme differs from standard forward\nsimulation, we propose a suitable importance sampling correction, which has the\nadded advantage of guiding the parameters towards regions of high posterior\ndensity, especially in the first ABC-SMC round. Our approach achieves accurate\ninference and is about three times faster than standard (forward-only) ABC-SMC.\nWe illustrate our method in four simulation studies, including three examples\nfrom the Chan-Karaolyi-Longstaff-Sanders SDE family."}, "http://arxiv.org/abs/2310.10331": {"title": "Specifications tests for count time series models with covariates", "link": "http://arxiv.org/abs/2310.10331", "description": "We propose a goodness-of-fit test for a class of count time series models\nwith covariates which includes the Poisson autoregressive model with covariates\n(PARX) as a special case. The test criteria are derived from a specific\ncharacterization for the conditional probability generating function and the\ntest statistic is formulated as a $L_2$ weighting norm of the corresponding\nsample counterpart. The asymptotic properties of the proposed test statistic\nare provided under the null hypothesis as well as under specific alternatives.\nA bootstrap version of the test is explored in a Monte--Carlo study and\nillustrated on a real data set on road safety."}, "http://arxiv.org/abs/2310.10373": {"title": "False Discovery Proportion control for aggregated Knockoffs", "link": "http://arxiv.org/abs/2310.10373", "description": "Controlled variable selection is an important analytical step in various\nscientific fields, such as brain imaging or genomics. In these high-dimensional\ndata settings, considering too many variables leads to poor models and high\ncosts, hence the need for statistical guarantees on false positives. Knockoffs\nare a popular statistical tool for conditional variable selection in high\ndimension. However, they control for the expected proportion of false\ndiscoveries (FDR) and not their actual proportion (FDP). We present a new\nmethod, KOPI, that controls the proportion of false discoveries for\nKnockoff-based inference. The proposed method also relies on a new type of\naggregation to address the undesirable randomness associated with classical\nKnockoff inference. We demonstrate FDP control and substantial power gains over\nexisting Knockoff-based methods in various simulation settings and achieve good\nsensitivity/specificity tradeoffs on brain imaging and genomic data."}, "http://arxiv.org/abs/2310.10393": {"title": "Statistical and Causal Robustness for Causal Null Hypothesis Tests", "link": "http://arxiv.org/abs/2310.10393", "description": "Prior work applying semiparametric theory to causal inference has primarily\nfocused on deriving estimators that exhibit statistical robustness under a\nprespecified causal model that permits identification of a desired causal\nparameter. However, a fundamental challenge is correct specification of such a\nmodel, which usually involves making untestable assumptions. Evidence factors\nis an approach to combining hypothesis tests of a common causal null hypothesis\nunder two or more candidate causal models. Under certain conditions, this\nyields a test that is valid if at least one of the underlying models is\ncorrect, which is a form of causal robustness. We propose a method of combining\nsemiparametric theory with evidence factors. We develop a causal null\nhypothesis test based on joint asymptotic normality of K asymptotically linear\nsemiparametric estimators, where each estimator is based on a distinct\nidentifying functional derived from each of K candidate causal models. We show\nthat this test provides both statistical and causal robustness in the sense\nthat it is valid if at least one of the K proposed causal models is correct,\nwhile also allowing for slower than parametric rates of convergence in\nestimating nuisance functions. We demonstrate the efficacy of our method via\nsimulations and an application to the Framingham Heart Study."}, "http://arxiv.org/abs/2310.10407": {"title": "Ensemble methods for testing a global null", "link": "http://arxiv.org/abs/2310.10407", "description": "Testing a global null is a canonical problem in statistics and has a wide\nrange of applications. In view of the fact that no uniformly most powerful test\nexists, prior and/or domain knowledge are commonly used to focus on a certain\nclass of alternatives to improve the testing power. However, it is generally\nchallenging to develop tests that are particularly powerful against a certain\nclass of alternatives. In this paper, motivated by the success of ensemble\nlearning methods for prediction or classification, we propose an ensemble\nframework for testing that mimics the spirit of random forests to deal with the\nchallenges. Our ensemble testing framework aggregates a collection of weak base\ntests to form a final ensemble test that maintains strong and robust power for\nglobal nulls. We apply the framework to four problems about global testing in\ndifferent classes of alternatives arising from Whole Genome Sequencing (WGS)\nassociation studies. Specific ensemble tests are proposed for each of these\nproblems, and their theoretical optimality is established in terms of Bahadur\nefficiency. Extensive simulations and an analysis of a real WGS dataset are\nconducted to demonstrate the type I error control and/or power gain of the\nproposed ensemble tests."}, "http://arxiv.org/abs/2310.10422": {"title": "A Neural Network-Based Approach to Normality Testing for Dependent Data", "link": "http://arxiv.org/abs/2310.10422", "description": "There is a wide availability of methods for testing normality under the\nassumption of independent and identically distributed data. When data are\ndependent in space and/or time, however, assessing and testing the marginal\nbehavior is considerably more challenging, as the marginal behavior is impacted\nby the degree of dependence. We propose a new approach to assess normality for\ndependent data by non-linearly incorporating existing statistics from normality\ntests as well as sample moments such as skewness and kurtosis through a neural\nnetwork. We calibrate (deep) neural networks by simulated normal and non-normal\ndata with a wide range of dependence structures and we determine the\nprobability of rejecting the null hypothesis. We compare several approaches for\nnormality tests and demonstrate the superiority of our method in terms of\nstatistical power through an extensive simulation study. A real world\napplication to global temperature data further demonstrates how the degree of\nspatio-temporal aggregation affects the marginal normality in the data."}, "http://arxiv.org/abs/2310.10494": {"title": "Multivariate Scalar on Multidimensional Distribution Regression", "link": "http://arxiv.org/abs/2310.10494", "description": "We develop a new method for multivariate scalar on multidimensional\ndistribution regression. Traditional approaches typically analyze isolated\nunivariate scalar outcomes or consider unidimensional distributional\nrepresentations as predictors. However, these approaches are sub-optimal\nbecause: i) they fail to utilize the dependence between the distributional\npredictors: ii) neglect the correlation structure of the response. To overcome\nthese limitations, we propose a multivariate distributional analysis framework\nthat harnesses the power of multivariate density functions and multitask\nlearning. We develop a computationally efficient semiparametric estimation\nmethod for modelling the effect of the latent joint density on multivariate\nresponse of interest. Additionally, we introduce a new conformal algorithm for\nquantifying the uncertainty of regression models with multivariate responses\nand distributional predictors, providing valuable insights into the conditional\ndistribution of the response. We have validated the effectiveness of our\nproposed method through comprehensive numerical simulations, clearly\ndemonstrating its superior performance compared to traditional methods. The\napplication of the proposed method is demonstrated on tri-axial accelerometer\ndata from the National Health and Nutrition Examination Survey (NHANES)\n2011-2014 for modelling the association between cognitive scores across various\ndomains and distributional representation of physical activity among older\nadult population. Our results highlight the advantages of the proposed\napproach, emphasizing the significance of incorporating complete spatial\ninformation derived from the accelerometer device."}, "http://arxiv.org/abs/2310.10588": {"title": "Max-convolution processes with random shape indicator kernels", "link": "http://arxiv.org/abs/2310.10588", "description": "In this paper, we introduce a new class of models for spatial data obtained\nfrom max-convolution processes based on indicator kernels with random shape. We\nshow that this class of models have appealing dependence properties including\ntail dependence at short distances and independence at long distances. We\nfurther consider max-convolutions between such processes and processes with\ntail independence, in order to separately control the bulk and tail dependence\nbehaviors, and to increase flexibility of the model at longer distances, in\nparticular, to capture intermediate tail dependence. We show how parameters can\nbe estimated using a weighted pairwise likelihood approach, and we conduct an\nextensive simulation study to show that the proposed inference approach is\nfeasible in high dimensions and it yields accurate parameter estimates in most\ncases. We apply the proposed methodology to analyse daily temperature maxima\nmeasured at 100 monitoring stations in the state of Oklahoma, US. Our results\nindicate that our proposed model provides a good fit to the data, and that it\ncaptures both the bulk and the tail dependence structures accurately."}, "http://arxiv.org/abs/1805.07301": {"title": "Enhanced Pricing and Management of Bundled Insurance Risks with Dependence-aware Prediction using Pair Copula Construction", "link": "http://arxiv.org/abs/1805.07301", "description": "We propose a dependence-aware predictive modeling framework for multivariate\nrisks stemmed from an insurance contract with bundling features - an important\ntype of policy increasingly offered by major insurance companies. The bundling\nfeature naturally leads to longitudinal measurements of multiple insurance\nrisks, and correct pricing and management of such risks is of fundamental\ninterest to financial stability of the macroeconomy. We build a novel\npredictive model that fully captures the dependence among the multivariate\nrepeated risk measurements. Specifically, the longitudinal measurement of each\nindividual risk is first modeled using pair copula construction with a D-vine\nstructure, and the multiple D-vines are then integrated by a flexible copula.\nThe proposed model provides a unified modeling framework for multivariate\nlongitudinal data that can accommodate different scales of measurements,\nincluding continuous, discrete, and mixed observations, and thus can be\npotentially useful for various economic studies. A computationally efficient\nsequential method is proposed for model estimation and inference, and its\nperformance is investigated both theoretically and via simulation studies. In\nthe application, we examine multivariate bundled risks in multi-peril property\ninsurance using proprietary data from a commercial property insurance provider.\nThe proposed model is found to provide improved decision making for several key\ninsurance operations. For underwriting, we show that the experience rate priced\nby the proposed model leads to a 9% lift in the insurer's net revenue. For\nreinsurance, we show that the insurer underestimates the risk of the retained\ninsurance portfolio by 10% when ignoring the dependence among bundled insurance\nrisks."}, "http://arxiv.org/abs/2005.04721": {"title": "Decision Making in Drug Development via Inference on Power", "link": "http://arxiv.org/abs/2005.04721", "description": "A typical power calculation is performed by replacing unknown\npopulation-level quantities in the power function with what is observed in\nexternal studies. Many authors and practitioners view this as an assumed value\nof power and offer the Bayesian quantity probability of success or assurance as\nan alternative. The claim is by averaging over a prior or posterior\ndistribution, probability of success transcends power by capturing the\nuncertainty around the unknown true treatment effect and any other\npopulation-level parameters. We use p-value functions to frame both the\nprobability of success calculation and the typical power calculation as merely\nproducing two different point estimates of power. We demonstrate that Go/No-Go\ndecisions based on either point estimate of power do not adequately quantify\nand control the risk involved, and instead we argue for Go/No-Go decisions that\nutilize inference on power for better risk management and decision making."}, "http://arxiv.org/abs/2103.00674": {"title": "BEAUTY Powered BEAST", "link": "http://arxiv.org/abs/2103.00674", "description": "We study distribution-free goodness-of-fit tests with the proposed Binary\nExpansion Approximation of UniformiTY (BEAUTY) approach. This method\ngeneralizes the renowned Euler's formula, and approximates the characteristic\nfunction of any copula through a linear combination of expectations of binary\ninteractions from marginal binary expansions. This novel theory enables a\nunification of many important tests of independence via approximations from\nspecific quadratic forms of symmetry statistics, where the deterministic weight\nmatrix characterizes the power properties of each test. To achieve a robust\npower, we examine test statistics with data-adaptive weights, referred to as\nthe Binary Expansion Adaptive Symmetry Test (BEAST). Using properties of the\nbinary expansion filtration, we demonstrate that the Neyman-Pearson test of\nuniformity can be approximated by an oracle weighted sum of symmetry\nstatistics. The BEAST with this oracle provides a useful benchmark of feasible\npower. To approach this oracle power, we devise the BEAST through a regularized\nresampling approximation of the oracle test. The BEAST improves the empirical\npower of many existing tests against a wide spectrum of common alternatives and\ndelivers a clear interpretation of dependency forms when significant."}, "http://arxiv.org/abs/2103.16159": {"title": "Controlling the False Discovery Rate in Transformational Sparsity: Split Knockoffs", "link": "http://arxiv.org/abs/2103.16159", "description": "Controlling the False Discovery Rate (FDR) in a variable selection procedure\nis critical for reproducible discoveries, and it has been extensively studied\nin sparse linear models. However, it remains largely open in scenarios where\nthe sparsity constraint is not directly imposed on the parameters but on a\nlinear transformation of the parameters to be estimated. Examples of such\nscenarios include total variations, wavelet transforms, fused LASSO, and trend\nfiltering. In this paper, we propose a data-adaptive FDR control method, called\nthe Split Knockoff method, for this transformational sparsity setting. The\nproposed method exploits both variable and data splitting. The linear\ntransformation constraint is relaxed to its Euclidean proximity in a lifted\nparameter space, which yields an orthogonal design that enables the orthogonal\nSplit Knockoff construction. To overcome the challenge that exchangeability\nfails due to the heterogeneous noise brought by the transformation, new inverse\nsupermartingale structures are developed via data splitting for provable FDR\ncontrol without sacrificing power. Simulation experiments demonstrate that the\nproposed methodology achieves the desired FDR and power. We also provide an\napplication to Alzheimer's Disease study, where atrophy brain regions and their\nabnormal connections can be discovered based on a structural Magnetic Resonance\nImaging dataset (ADNI)."}, "http://arxiv.org/abs/2201.05967": {"title": "Uniform Inference for Kernel Density Estimators with Dyadic Data", "link": "http://arxiv.org/abs/2201.05967", "description": "Dyadic data is often encountered when quantities of interest are associated\nwith the edges of a network. As such it plays an important role in statistics,\neconometrics and many other data science disciplines. We consider the problem\nof uniformly estimating a dyadic Lebesgue density function, focusing on\nnonparametric kernel-based estimators taking the form of dyadic empirical\nprocesses. Our main contributions include the minimax-optimal uniform\nconvergence rate of the dyadic kernel density estimator, along with strong\napproximation results for the associated standardized and Studentized\n$t$-processes. A consistent variance estimator enables the construction of\nvalid and feasible uniform confidence bands for the unknown density function.\nWe showcase the broad applicability of our results by developing novel\ncounterfactual density estimation and inference methodology for dyadic data,\nwhich can be used for causal inference and program evaluation. A crucial\nfeature of dyadic distributions is that they may be \"degenerate\" at certain\npoints in the support of the data, a property making our analysis somewhat\ndelicate. Nonetheless our methods for uniform inference remain robust to the\npotential presence of such points. For implementation purposes, we discuss\ninference procedures based on positive semi-definite covariance estimators,\nmean squared error optimal bandwidth selectors and robust bias correction\ntechniques. We illustrate the empirical finite-sample performance of our\nmethods both in simulations and with real-world trade data, for which we make\ncomparisons between observed and counterfactual trade distributions in\ndifferent years. Our technical results concerning strong approximations and\nmaximal inequalities are of potential independent interest."}, "http://arxiv.org/abs/2206.01076": {"title": "Likelihood-based Inference for Random Networks with Changepoints", "link": "http://arxiv.org/abs/2206.01076", "description": "Generative, temporal network models play an important role in analyzing the\ndependence structure and evolution patterns of complex networks. Due to the\ncomplicated nature of real network data, it is often naive to assume that the\nunderlying data-generative mechanism itself is invariant with time. Such\nobservation leads to the study of changepoints or sudden shifts in the\ndistributional structure of the evolving network. In this paper, we propose a\nlikelihood-based methodology to detect changepoints in undirected, affine\npreferential attachment networks, and establish a hypothesis testing framework\nto detect a single changepoint, together with a consistent estimator for the\nchangepoint. Such results require establishing consistency and asymptotic\nnormality of the MLE under the changepoint regime, which suffers from long\nrange dependence. The methodology is then extended to the multiple changepoint\nsetting via both a sliding window method and a more computationally efficient\nscore statistic. We also compare the proposed methodology with previously\ndeveloped non-parametric estimators of the changepoint via simulation, and the\nmethods developed herein are applied to modeling the popularity of a topic in a\nTwitter network over time."}, "http://arxiv.org/abs/2301.01616": {"title": "Locally Private Causal Inference for Randomized Experiments", "link": "http://arxiv.org/abs/2301.01616", "description": "Local differential privacy is a differential privacy paradigm in which\nindividuals first apply a privacy mechanism to their data (often by adding\nnoise) before transmitting the result to a curator. The noise for privacy\nresults in additional bias and variance in their analyses. Thus it is of great\nimportance for analysts to incorporate the privacy noise into valid inference.\nIn this article, we develop methodologies to infer causal effects from locally\nprivatized data under randomized experiments. First, we present frequentist\nestimators under various privacy scenarios with their variance estimators and\nplug-in confidence intervals. We show a na\\\"ive debiased estimator results in\ninferior mean-squared error (MSE) compared to minimax lower bounds. In\ncontrast, we show that using a customized privacy mechanism, we can match the\nlower bound, giving minimax optimal inference. We also develop a Bayesian\nnonparametric methodology along with a blocked Gibbs sampling algorithm, which\ncan be applied to any of our proposed privacy mechanisms, and which performs\nespecially well in terms of MSE for tight privacy budgets. Finally, we present\nsimulation studies to evaluate the performance of our proposed frequentist and\nBayesian methodologies for various privacy budgets, resulting in useful\nsuggestions for performing causal inference for privatized data."}, "http://arxiv.org/abs/2303.03215": {"title": "Quantile-Quantile Methodology -- Detailed Results", "link": "http://arxiv.org/abs/2303.03215", "description": "The linear quantile-quantile relationship provides an easy-to-implement yet\neffective tool for transformation to and testing for normality. Its good\nperformance is verified in this report."}, "http://arxiv.org/abs/2305.06645": {"title": "Causal Inference for Continuous Multiple Time Point Interventions", "link": "http://arxiv.org/abs/2305.06645", "description": "There are limited options to estimate the treatment effects of variables\nwhich are continuous and measured at multiple time points, particularly if the\ntrue dose-response curve should be estimated as closely as possible. However,\nthese situations may be of relevance: in pharmacology, one may be interested in\nhow outcomes of people living with -- and treated for -- HIV, such as viral\nfailure, would vary for time-varying interventions such as different drug\nconcentration trajectories. A challenge for doing causal inference with\ncontinuous interventions is that the positivity assumption is typically\nviolated. To address positivity violations, we develop projection functions,\nwhich reweigh and redefine the estimand of interest based on functions of the\nconditional support for the respective interventions. With these functions, we\nobtain the desired dose-response curve in areas of enough support, and\notherwise a meaningful estimand that does not require the positivity\nassumption. We develop $g$-computation type plug-in estimators for this case.\nThose are contrasted with g-computation estimators which are applied to\ncontinuous interventions without specifically addressing positivity violations,\nwhich we propose to be presented with diagnostics. The ideas are illustrated\nwith longitudinal data from HIV positive children treated with an\nefavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children\n$<13$ years in Zambia/Uganda. Simulations show in which situations a standard\n$g$-computation approach is appropriate, and in which it leads to bias and how\nthe proposed weighted estimation approach then recovers the alternative\nestimand of interest."}, "http://arxiv.org/abs/2305.14275": {"title": "Amortized Variational Inference with Coverage Guarantees", "link": "http://arxiv.org/abs/2305.14275", "description": "Amortized variational inference produces a posterior approximation that can\nbe rapidly computed given any new observation. Unfortunately, there are few\nguarantees about the quality of these approximate posteriors. We propose\nConformalized Amortized Neural Variational Inference (CANVI), a procedure that\nis scalable, easily implemented, and provides guaranteed marginal coverage.\nGiven a collection of candidate amortized posterior approximators, CANVI\nconstructs conformalized predictors based on each candidate, compares the\npredictors using a metric known as predictive efficiency, and returns the most\nefficient predictor. CANVI ensures that the resulting predictor constructs\nregions that contain the truth with a user-specified level of probability.\nCANVI is agnostic to design decisions in formulating the candidate\napproximators and only requires access to samples from the forward model,\npermitting its use in likelihood-free settings. We prove lower bounds on the\npredictive efficiency of the regions produced by CANVI and explore how the\nquality of a posterior approximation relates to the predictive efficiency of\nprediction regions based on that approximation. Finally, we demonstrate the\naccurate calibration and high predictive efficiency of CANVI on a suite of\nsimulation-based inference benchmark tasks and an important scientific task:\nanalyzing galaxy emission spectra."}, "http://arxiv.org/abs/2305.17187": {"title": "Clip-OGD: An Experimental Design for Adaptive Neyman Allocation in Sequential Experiments", "link": "http://arxiv.org/abs/2305.17187", "description": "From clinical development of cancer therapies to investigations into partisan\nbias, adaptive sequential designs have become increasingly popular method for\ncausal inference, as they offer the possibility of improved precision over\ntheir non-adaptive counterparts. However, even in simple settings (e.g. two\ntreatments) the extent to which adaptive designs can improve precision is not\nsufficiently well understood. In this work, we study the problem of Adaptive\nNeyman Allocation in a design-based potential outcomes framework, where the\nexperimenter seeks to construct an adaptive design which is nearly as efficient\nas the optimal (but infeasible) non-adaptive Neyman design, which has access to\nall potential outcomes. Motivated by connections to online optimization, we\npropose Neyman Ratio and Neyman Regret as two (equivalent) performance measures\nof adaptive designs for this problem. We present Clip-OGD, an adaptive design\nwhich achieves $\\widetilde{O}(\\sqrt{T})$ expected Neyman regret and thereby\nrecovers the optimal Neyman variance in large samples. Finally, we construct a\nconservative variance estimator which facilitates the development of\nasymptotically valid confidence intervals. To complement our theoretical\nresults, we conduct simulations using data from a microeconomic experiment."}, "http://arxiv.org/abs/2306.15622": {"title": "Biclustering random matrix partitions with an application to classification of forensic body fluids", "link": "http://arxiv.org/abs/2306.15622", "description": "Classification of unlabeled data is usually achieved by supervised learning\nfrom labeled samples. Although there exist many sophisticated supervised\nmachine learning methods that can predict the missing labels with a high level\nof accuracy, they often lack the required transparency in situations where it\nis important to provide interpretable results and meaningful measures of\nconfidence. Body fluid classification of forensic casework data is the case in\npoint. We develop a new Biclustering Dirichlet Process for Class-assignment\nwith Random Matrices (BDP-CaRMa), with a three-level hierarchy of clustering,\nand a model-based approach to classification that adapts to block structure in\nthe data matrix. As the class labels of some observations are missing, the\nnumber of rows in the data matrix for each class is unknown. BDP-CaRMa handles\nthis and extends existing biclustering methods by simultaneously biclustering\nmultiple matrices each having a randomly variable number of rows. We\ndemonstrate our method by applying it to the motivating problem, which is the\nclassification of body fluids based on mRNA profiles taken from crime scenes.\nThe analyses of casework-like data show that our method is interpretable and\nproduces well-calibrated posterior probabilities. Our model can be more\ngenerally applied to other types of data with a similar structure to the\nforensic data."}, "http://arxiv.org/abs/2307.05644": {"title": "Lambert W random variables and their applications in loss modelling", "link": "http://arxiv.org/abs/2307.05644", "description": "Several distributions and families of distributions are proposed to model\nskewed data, think, e.g., of skew-normal and related distributions. Lambert W\nrandom variables offer an alternative approach where, instead of constructing a\nnew distribution, a certain transform is proposed (Goerg, 2011). Such an\napproach allows the construction of a Lambert W skewed version from any\ndistribution. We choose Lambert W normal distribution as a natural starting\npoint and also include Lambert W exponential distribution due to the simplicity\nand shape of the exponential distribution, which, after skewing, may produce a\nreasonably heavy tail for loss models. In the theoretical part, we focus on the\nmathematical properties of obtained distributions, including the range of\nskewness. In the practical part, the suitability of corresponding Lambert W\ntransformed distributions is evaluated on real insurance data. The results are\ncompared with those obtained using common loss distributions."}, "http://arxiv.org/abs/2307.06840": {"title": "Ensemble learning for blending gridded satellite and gauge-measured precipitation data", "link": "http://arxiv.org/abs/2307.06840", "description": "Regression algorithms are regularly used for improving the accuracy of\nsatellite precipitation products. In this context, satellite precipitation and\ntopography data are the predictor variables, and gauged-measured precipitation\ndata are the dependent variables. Alongside this, it is increasingly recognised\nin many fields that combinations of algorithms through ensemble learning can\nlead to substantial predictive performance improvements. Still, a sufficient\nnumber of ensemble learners for improving the accuracy of satellite\nprecipitation products and their large-scale comparison are currently missing\nfrom the literature. In this study, we work towards filling in this specific\ngap by proposing 11 new ensemble learners in the field and by extensively\ncomparing them. We apply the ensemble learners to monthly data from the\nPERSIANN (Precipitation Estimation from Remotely Sensed Information using\nArtificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals\nfor GPM) gridded datasets that span over a 15-year period and over the entire\nthe contiguous United States (CONUS). We also use gauge-measured precipitation\ndata from the Global Historical Climatology Network monthly database, version 2\n(GHCNm). The ensemble learners combine the predictions of six machine learning\nregression algorithms (base learners), namely the multivariate adaptive\nregression splines (MARS), multivariate adaptive polynomial splines\n(poly-MARS), random forests (RF), gradient boosting machines (GBM), extreme\ngradient boosting (XGBoost) and Bayesian regularized neural networks (BRNN),\nand each of them is based on a different combiner. The combiners include the\nequal-weight combiner, the median combiner, two best learners and seven\nvariants of a sophisticated stacking method. The latter stacks a regression\nalgorithm on top of the base learners to combine their independent\npredictions..."}, "http://arxiv.org/abs/2309.12819": {"title": "Doubly Robust Proximal Causal Learning for Continuous Treatments", "link": "http://arxiv.org/abs/2309.12819", "description": "Proximal causal learning is a promising framework for identifying the causal\neffect under the existence of unmeasured confounders. Within this framework,\nthe doubly robust (DR) estimator was derived and has shown its effectiveness in\nestimation, especially when the model assumption is violated. However, the\ncurrent form of the DR estimator is restricted to binary treatments, while the\ntreatment can be continuous in many real-world applications. The primary\nobstacle to continuous treatments resides in the delta function present in the\noriginal DR estimator, making it infeasible in causal effect estimation and\nintroducing a heavy computational burden in nuisance function estimation. To\naddress these challenges, we propose a kernel-based DR estimator that can well\nhandle continuous treatments. Equipped with its smoothness, we show that its\noracle form is a consistent approximation of the influence function. Further,\nwe propose a new approach to efficiently solve the nuisance functions. We then\nprovide a comprehensive convergence analysis in terms of the mean square error.\nWe demonstrate the utility of our estimator on synthetic datasets and\nreal-world applications."}, "http://arxiv.org/abs/2309.17283": {"title": "The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation", "link": "http://arxiv.org/abs/2309.17283", "description": "Assessing causal effects in the presence of unobserved confounding is a\nchallenging problem. Existing studies leveraged proxy variables or multiple\ntreatments to adjust for the confounding bias. In particular, the latter\napproach attributes the impact on a single outcome to multiple treatments,\nallowing estimating latent variables for confounding control. Nevertheless,\nthese methods primarily focus on a single outcome, whereas in many real-world\nscenarios, there is greater interest in studying the effects on multiple\noutcomes. Besides, these outcomes are often coupled with multiple treatments.\nExamples include the intensive care unit (ICU), where health providers evaluate\nthe effectiveness of therapies on multiple health indicators. To accommodate\nthese scenarios, we consider a new setting dubbed as multiple treatments and\nmultiple outcomes. We then show that parallel studies of multiple outcomes\ninvolved in this setting can assist each other in causal identification, in the\nsense that we can exploit other treatments and outcomes as proxies for each\ntreatment effect under study. We proceed with a causal discovery method that\ncan effectively identify such proxies for causal estimation. The utility of our\nmethod is demonstrated in synthetic data and sepsis disease."}, "http://arxiv.org/abs/2310.10740": {"title": "Unbiased Estimation of Structured Prediction Error", "link": "http://arxiv.org/abs/2310.10740", "description": "Many modern datasets, such as those in ecology and geology, are composed of\nsamples with spatial structure and dependence. With such data violating the\nusual independent and identically distributed (IID) assumption in machine\nlearning and classical statistics, it is unclear a priori how one should\nmeasure the performance and generalization of models. Several authors have\nempirically investigated cross-validation (CV) methods in this setting,\nreaching mixed conclusions. We provide a class of unbiased estimation methods\nfor general quadratic errors, correlated Gaussian response, and arbitrary\nprediction function $g$, for a noise-elevated version of the error. Our\napproach generalizes the coupled bootstrap (CB) from the normal means problem\nto general normal data, allowing correlation both within and between the\ntraining and test sets. CB relies on creating bootstrap samples that are\nintelligently decoupled, in the sense of being statistically independent.\nSpecifically, the key to CB lies in generating two independent \"views\" of our\ndata and using them as stand-ins for the usual independent training and test\nsamples. Beginning with Mallows' $C_p$, we generalize the estimator to develop\nour generalized $C_p$ estimators (GC). We show at under only a moment condition\non $g$, this noise-elevated error estimate converges smoothly to the noiseless\nerror estimate. We show that when Stein's unbiased risk estimator (SURE)\napplies, GC converges to SURE as in the normal means problem. Further, we use\nthese same tools to analyze CV and provide some theoretical analysis to help\nunderstand when CV will provide good estimates of error. Simulations align with\nour theoretical results, demonstrating the effectiveness of GC and illustrating\nthe behavior of CV methods. Lastly, we apply our estimator to a model selection\ntask on geothermal data in Nevada."}, "http://arxiv.org/abs/2310.10761": {"title": "Simulation Based Composite Likelihood", "link": "http://arxiv.org/abs/2310.10761", "description": "Inference for high-dimensional hidden Markov models is challenging due to the\nexponential-in-dimension computational cost of the forward algorithm. To\naddress this issue, we introduce an innovative composite likelihood approach\ncalled \"Simulation Based Composite Likelihood\" (SimBa-CL). With SimBa-CL, we\napproximate the likelihood by the product of its marginals, which we estimate\nusing Monte Carlo sampling. In a similar vein to approximate Bayesian\ncomputation (ABC), SimBa-CL requires multiple simulations from the model, but,\nin contrast to ABC, it provides a likelihood approximation that guides the\noptimization of the parameters. Leveraging automatic differentiation libraries,\nit is simple to calculate gradients and Hessians to not only speed-up\noptimization, but also to build approximate confidence sets. We conclude with\nan extensive experimental section, where we empirically validate our\ntheoretical results, conduct a comparative analysis with SMC, and apply\nSimBa-CL to real-world Aphtovirus data."}, "http://arxiv.org/abs/2310.10798": {"title": "Poisson Count Time Series", "link": "http://arxiv.org/abs/2310.10798", "description": "This paper reviews and compares popular methods, some old and some very\nrecent, that produce time series having Poisson marginal distributions. The\npaper begins by narrating ways where time series with Poisson marginal\ndistributions can be produced. Modeling nonstationary series with covariates\nmotivates consideration of methods where the Poisson parameter depends on time.\nHere, estimation methods are developed for some of the more flexible methods.\nThe results are used in the analysis of 1) a count sequence of tropical\ncyclones occurring in the North Atlantic Basin since 1970, and 2) the number of\nno-hitter games pitched in major league baseball since 1893. Tests for whether\nthe Poisson marginal distribution is appropriate are included."}, "http://arxiv.org/abs/2310.10915": {"title": "Identifiability of the Multinomial Processing Tree-IRT model for the Philadelphia Naming Test", "link": "http://arxiv.org/abs/2310.10915", "description": "For persons with aphasia, naming tests are used to evaluate the severity of\nthe disease and observing progress toward recovery. The Philadelphia Naming\nTest (PNT) is a leading naming test composed of 175 items. The items are common\nnouns which are one to four syllables in length and with low, medium, and high\nfrequency. Since the target word is known to the administrator, the response\nfrom the patient can be classified as correct or an error. If the patient\ncommits an error, the PNT provides procedures for classifying the type of error\nin the response. Item response theory can be applied to PNT data to provide\nestimates of item difficulty and subject naming ability. Walker et al. (2018)\ndeveloped a IRT multinomial processing tree (IRT-MPT) model to attempt to\nunderstand the pathways through which the different errors are made by patients\nwhen responding to an item. The MPT model expands on existing models by\nconsidering items to be heterogeneous and estimating multiple latent parameters\nfor patients to more precisely determine at which step of word of production a\npatient's ability has been affected. These latent parameters represent the\ntheoretical cognitive steps taken in responding to an item. Given the\ncomplexity of the model proposed in Walker et al. (2018), here we investigate\nthe identifiability of the parameters included in the IRT-MPT model."}, "http://arxiv.org/abs/2310.10976": {"title": "Exact nonlinear state estimation", "link": "http://arxiv.org/abs/2310.10976", "description": "The majority of data assimilation (DA) methods in the geosciences are based\non Gaussian assumptions. While these assumptions facilitate efficient\nalgorithms, they cause analysis biases and subsequent forecast degradations.\nNon-parametric, particle-based DA algorithms have superior accuracy, but their\napplication to high-dimensional models still poses operational challenges.\nDrawing inspiration from recent advances in the field of generative artificial\nintelligence (AI), this article introduces a new nonlinear estimation theory\nwhich attempts to bridge the existing gap in DA methodology. Specifically, a\nConjugate Transform Filter (CTF) is derived and shown to generalize the\ncelebrated Kalman filter to arbitrarily non-Gaussian distributions. The new\nfilter has several desirable properties, such as its ability to preserve\nstatistical relationships in the prior state and convergence to highly accurate\nobservations. An ensemble approximation of the new theory (ECTF) is also\npresented and validated using idealized statistical experiments that feature\nbounded quantities with non-Gaussian distributions, a prevalent challenge in\nEarth system models. Results from these experiments indicate that the greatest\nbenefits from ECTF occur when observation errors are small relative to the\nforecast uncertainty and when state variables exhibit strong nonlinear\ndependencies. Ultimately, the new filtering theory offers exciting avenues for\nimproving conventional DA algorithms through their principled integration with\nAI techniques."}, "http://arxiv.org/abs/2310.11122": {"title": "Sensitivity-Aware Amortized Bayesian Inference", "link": "http://arxiv.org/abs/2310.11122", "description": "Bayesian inference is a powerful framework for making probabilistic\ninferences and decisions under uncertainty. Fundamental choices in modern\nBayesian workflows concern the specification of the likelihood function and\nprior distributions, the posterior approximator, and the data. Each choice can\nsignificantly influence model-based inference and subsequent decisions, thereby\nnecessitating sensitivity analysis. In this work, we propose a multifaceted\napproach to integrate sensitivity analyses into amortized Bayesian inference\n(ABI, i.e., simulation-based inference with neural networks). First, we utilize\nweight sharing to encode the structural similarities between alternative\nlikelihood and prior specifications in the training process with minimal\ncomputational overhead. Second, we leverage the rapid inference of neural\nnetworks to assess sensitivity to various data perturbations or pre-processing\nprocedures. In contrast to most other Bayesian approaches, both steps\ncircumvent the costly bottleneck of refitting the model(s) for each choice of\nlikelihood, prior, or dataset. Finally, we propose to use neural network\nensembles to evaluate variation in results induced by unreliable approximation\non unseen data. We demonstrate the effectiveness of our method in applied\nmodeling problems, ranging from the estimation of disease outbreak dynamics and\nglobal warming thresholds to the comparison of human decision-making models.\nOur experiments showcase how our approach enables practitioners to effectively\nunveil hidden relationships between modeling choices and inferential\nconclusions."}, "http://arxiv.org/abs/2310.11357": {"title": "A Pseudo-likelihood Approach to Under-5 Mortality Estimation", "link": "http://arxiv.org/abs/2310.11357", "description": "Accurate and precise estimates of under-5 mortality rates (U5MR) are an\nimportant health summary for countries. Full survival curves are additionally\nof interest to better understand the pattern of mortality in children under 5.\nModern demographic methods for estimating a full mortality schedule for\nchildren have been developed for countries with good vital registration and\nreliable census data, but perform poorly in many low- and middle-income\ncountries. In these countries, the need to utilize nationally representative\nsurveys to estimate U5MR requires additional statistical care to mitigate\npotential biases in survey data, acknowledge the survey design, and handle\naspects of survival data (i.e., censoring and truncation). In this paper, we\ndevelop parametric and non-parametric pseudo-likelihood approaches to\nestimating under-5 mortality across time from complex survey data. We argue\nthat the parametric approach is particularly useful in scenarios where data are\nsparse and estimation may require stronger assumptions. The nonparametric\napproach provides an aid to model validation. We compare a variety of\nparametric models to three existing methods for obtaining a full survival curve\nfor children under the age of 5, and argue that a parametric pseudo-likelihood\napproach is advantageous in low- and middle-income countries. We apply our\nproposed approaches to survey data from Burkina Faso, Malawi, Senegal, and\nNamibia. All code for fitting the models described in this paper is available\nin the R package pssst."}, "http://arxiv.org/abs/2006.00767": {"title": "Generative Multiple-purpose Sampler for Weighted M-estimation", "link": "http://arxiv.org/abs/2006.00767", "description": "To overcome the computational bottleneck of various data perturbation\nprocedures such as the bootstrap and cross validations, we propose the\nGenerative Multiple-purpose Sampler (GMS), which constructs a generator\nfunction to produce solutions of weighted M-estimators from a set of given\nweights and tuning parameters. The GMS is implemented by a single optimization\nwithout having to repeatedly evaluate the minimizers of weighted losses, and is\nthus capable of significantly reducing the computational time. We demonstrate\nthat the GMS framework enables the implementation of various statistical\nprocedures that would be unfeasible in a conventional framework, such as the\niterated bootstrap, bootstrapped cross-validation for penalized likelihood,\nbootstrapped empirical Bayes with nonparametric maximum likelihood, etc. To\nconstruct a computationally efficient generator function, we also propose a\nnovel form of neural network called the \\emph{weight multiplicative multilayer\nperceptron} to achieve fast convergence. Our numerical results demonstrate that\nthe new neural network structure enjoys a few orders of magnitude speed\nadvantage in comparison to the conventional one. An R package called GMS is\nprovided, which runs under Pytorch to implement the proposed methods and allows\nthe user to provide a customized loss function to tailor to their own models of\ninterest."}, "http://arxiv.org/abs/2012.03593": {"title": "Algebraic geometry of discrete interventional models", "link": "http://arxiv.org/abs/2012.03593", "description": "We investigate the algebra and geometry of general interventions in discrete\nDAG models. To this end, we introduce a theory for modeling soft interventions\nin the more general family of staged tree models and develop the formalism to\nstudy these models as parametrized subvarieties of a product of probability\nsimplices. We then consider the problem of finding their defining equations,\nand we derive a combinatorial criterion for identifying interventional staged\ntree models for which the defining ideal is toric. We apply these results to\nthe class of discrete interventional DAG models and establish a criteria to\ndetermine when these models are toric varieties."}, "http://arxiv.org/abs/2105.12720": {"title": "Marginal structural models with Latent Class Growth Modeling of Treatment Trajectories", "link": "http://arxiv.org/abs/2105.12720", "description": "In a real-life setting, little is known regarding the effectiveness of\nstatins for primary prevention among older adults, and analysis of\nobservational data can add crucial information on the benefits of actual\npatterns of use. Latent class growth models (LCGM) are increasingly proposed as\na solution to summarize the observed longitudinal treatment in a few distinct\ngroups. When combined with standard approaches like Cox proportional hazards\nmodels, LCGM can fail to control time-dependent confounding bias because of\ntime-varying covariates that have a double role of confounders and mediators.\nWe propose to use LCGM to classify individuals into a few latent classes based\non their medication adherence pattern, then choose a working marginal\nstructural model (MSM) that relates the outcome to these groups. The parameter\nof interest is nonparametrically defined as the projection of the true MSM onto\nthe chosen working model. The combination of LCGM with MSM is a convenient way\nto describe treatment adherence and can effectively control time-dependent\nconfounding. Simulation studies were used to illustrate our approach and\ncompare it with unadjusted, baseline covariates-adjusted, time-varying\ncovariates adjusted and inverse probability of trajectory groups weighting\nadjusted models. We found that our proposed approach yielded estimators with\nlittle or no bias."}, "http://arxiv.org/abs/2208.07610": {"title": "E-Statistics, Group Invariance and Anytime Valid Testing", "link": "http://arxiv.org/abs/2208.07610", "description": "We study worst-case-growth-rate-optimal (GROW) e-statistics for hypothesis\ntesting between two group models. It is known that under a mild condition on\nthe action of the underlying group G on the data, there exists a maximally\ninvariant statistic. We show that among all e-statistics, invariant or not, the\nlikelihood ratio of the maximally invariant statistic is GROW, both in the\nabsolute and in the relative sense, and that an anytime-valid test can be based\non it. The GROW e-statistic is equal to a Bayes factor with a right Haar prior\non G. Our treatment avoids nonuniqueness issues that sometimes arise for such\npriors in Bayesian contexts. A crucial assumption on the group G is its\namenability, a well-known group-theoretical condition, which holds, for\ninstance, in scale-location families. Our results also apply to\nfinite-dimensional linear regression."}, "http://arxiv.org/abs/2302.03246": {"title": "CDANs: Temporal Causal Discovery from Autocorrelated and Non-Stationary Time Series Data", "link": "http://arxiv.org/abs/2302.03246", "description": "Time series data are found in many areas of healthcare such as medical time\nseries, electronic health records (EHR), measurements of vitals, and wearable\ndevices. Causal discovery, which involves estimating causal relationships from\nobservational data, holds the potential to play a significant role in\nextracting actionable insights about human health. In this study, we present a\nnovel constraint-based causal discovery approach for autocorrelated and\nnon-stationary time series data (CDANs). Our proposed method addresses several\nlimitations of existing causal discovery methods for autocorrelated and\nnon-stationary time series data, such as high dimensionality, the inability to\nidentify lagged causal relationships, and overlooking changing modules. Our\napproach identifies lagged and instantaneous/contemporaneous causal\nrelationships along with changing modules that vary over time. The method\noptimizes the conditioning sets in a constraint-based search by considering\nlagged parents instead of conditioning on the entire past that addresses high\ndimensionality. The changing modules are detected by considering both\ncontemporaneous and lagged parents. The approach first detects the lagged\nadjacencies, then identifies the changing modules and contemporaneous\nadjacencies, and finally determines the causal direction. We extensively\nevaluated our proposed method on synthetic and real-world clinical datasets,\nand compared its performance with several baseline approaches. The experimental\nresults demonstrate the effectiveness of the proposed method in detecting\ncausal relationships and changing modules for autocorrelated and non-stationary\ntime series data."}, "http://arxiv.org/abs/2305.07089": {"title": "Hierarchically Coherent Multivariate Mixture Networks", "link": "http://arxiv.org/abs/2305.07089", "description": "Large collections of time series data are often organized into hierarchies\nwith different levels of aggregation; examples include product and geographical\ngroupings. Probabilistic coherent forecasting is tasked to produce forecasts\nconsistent across levels of aggregation. In this study, we propose to augment\nneural forecasting architectures with a coherent multivariate mixture output.\nWe optimize the networks with a composite likelihood objective, allowing us to\ncapture time series' relationships while maintaining high computational\nefficiency. Our approach demonstrates 13.2% average accuracy improvements on\nmost datasets compared to state-of-the-art baselines. We conduct ablation\nstudies of the framework components and provide theoretical foundations for\nthem. To assist related work, the code is available at this\nhttps://github.com/Nixtla/neuralforecast."}, "http://arxiv.org/abs/2307.16720": {"title": "The epigraph and the hypograph indexes as useful tools for clustering multivariate functional data", "link": "http://arxiv.org/abs/2307.16720", "description": "The proliferation of data generation has spurred advancements in functional\ndata analysis. With the ability to analyze multiple variables simultaneously,\nthe demand for working with multivariate functional data has increased. This\nstudy proposes a novel formulation of the epigraph and hypograph indexes, as\nwell as their generalized expressions, specifically tailored for the\nmultivariate functional context. These definitions take into account the\ninterrelations between components. Furthermore, the proposed indexes are\nemployed to cluster multivariate functional data. In the clustering process,\nthe indexes are applied to both the data and their first and second\nderivatives. This generates a reduced-dimension dataset from the original\nmultivariate functional data, enabling the application of well-established\nmultivariate clustering techniques which have been extensively studied in the\nliterature. This methodology has been tested through simulated and real\ndatasets, performing comparative analyses against state-of-the-art to assess\nits performance."}, "http://arxiv.org/abs/2309.07810": {"title": "Spectrum-Aware Adjustment: A New Debiasing Framework with Applications to Principal Component Regression", "link": "http://arxiv.org/abs/2309.07810", "description": "We introduce a new debiasing framework for high-dimensional linear regression\nthat bypasses the restrictions on covariate distributions imposed by modern\ndebiasing technology. We study the prevalent setting where the number of\nfeatures and samples are both large and comparable. In this context,\nstate-of-the-art debiasing technology uses a degrees-of-freedom correction to\nremove the shrinkage bias of regularized estimators and conduct inference.\nHowever, this method requires that the observed samples are i.i.d., the\ncovariates follow a mean zero Gaussian distribution, and reliable covariance\nmatrix estimates for observed features are available. This approach struggles\nwhen (i) covariates are non-Gaussian with heavy tails or asymmetric\ndistributions, (ii) rows of the design exhibit heterogeneity or dependencies,\nand (iii) reliable feature covariance estimates are lacking.\n\nTo address these, we develop a new strategy where the debiasing correction is\na rescaled gradient descent step (suitably initialized) with step size\ndetermined by the spectrum of the sample covariance matrix. Unlike prior work,\nwe assume that eigenvectors of this matrix are uniform draws from the\northogonal group. We show this assumption remains valid in diverse situations\nwhere traditional debiasing fails, including designs with complex row-column\ndependencies, heavy tails, asymmetric properties, and latent low-rank\nstructures. We establish asymptotic normality of our proposed estimator\n(centered and scaled) under various convergence notions. Moreover, we develop a\nconsistent estimator for its asymptotic variance. Lastly, we introduce a\ndebiased Principal Components Regression (PCR) technique using our\nSpectrum-Aware approach. In varied simulations and real data experiments, we\nobserve that our method outperforms degrees-of-freedom debiasing by a margin."}, "http://arxiv.org/abs/2310.11471": {"title": "Modeling lower-truncated and right-censored insurance claims with an extension of the MBBEFD class", "link": "http://arxiv.org/abs/2310.11471", "description": "In general insurance, claims are often lower-truncated and right-censored\nbecause insurance contracts may involve deductibles and maximal covers. Most\nclassical statistical models are not (directly) suited to model lower-truncated\nand right-censored claims. A surprisingly flexible family of distributions that\ncan cope with lower-truncated and right-censored claims is the class of MBBEFD\ndistributions that originally has been introduced by Bernegger (1997) for\nreinsurance pricing, but which has not gained much attention outside the\nreinsurance literature. We derive properties of the class of MBBEFD\ndistributions, and we extend it to a bigger family of distribution functions\nsuitable for modeling lower-truncated and right-censored claims. Interestingly,\nin general insurance, we mainly rely on unimodal skewed densities, whereas the\nreinsurance literature typically proposes monotonically decreasing densities\nwithin the MBBEFD class."}, "http://arxiv.org/abs/2310.11603": {"title": "Group sequential two-stage preference designs", "link": "http://arxiv.org/abs/2310.11603", "description": "The two-stage preference design (TSPD) enables the inference for treatment\nefficacy while allowing for incorporation of patient preference to treatment.\nIt can provide unbiased estimates for selection and preference effects, where a\nselection effect occurs when patients who prefer one treatment respond\ndifferently than those who prefer another, and a preference effect is the\ndifference in response caused by an interaction between the patient's\npreference and the actual treatment they receive. One potential barrier to\nadopting TSPD in practice, however, is the relatively large sample size\nrequired to estimate selection and preference effects with sufficient power. To\naddress this concern, we propose a group sequential two-stage preference design\n(GS-TSPD), which combines TSPD with sequential monitoring for early stopping.\nIn the GS-TSPD, pre-planned sequential monitoring allows investigators to\nconduct repeated hypothesis tests on accumulated data prior to full enrollment\nto assess study eligibility for early trial termination without inflating type\nI error rates. Thus, the procedure allows investigators to terminate the study\nwhen there is sufficient evidence of treatment, selection, or preference\neffects during an interim analysis, thereby reducing the design resource in\nexpectation. To formalize such a procedure, we verify the independent\nincrements assumption for testing the selection and preference effects and\napply group sequential stopping boundaries from the approximate sequential\ndensity functions. Simulations are then conducted to investigate the operating\ncharacteristics of our proposed GS-TSPD compared to the traditional TSPD. We\ndemonstrate the applicability of the design using a study of Hepatitis C\ntreatment modality."}, "http://arxiv.org/abs/2310.11620": {"title": "Enhancing modified treatment policy effect estimation with weighted energy distance", "link": "http://arxiv.org/abs/2310.11620", "description": "The effects of continuous treatments are often characterized through the\naverage dose response function, which is challenging to estimate from\nobservational data due to confounding and positivity violations. Modified\ntreatment policies (MTPs) are an alternative approach that aim to assess the\neffect of a modification to observed treatment values and work under relaxed\nassumptions. Estimators for MTPs generally focus on estimating the conditional\ndensity of treatment given covariates and using it to construct weights.\nHowever, weighting using conditional density models has well-documented\nchallenges. Further, MTPs with larger treatment modifications have stronger\nconfounding and no tools exist to help choose an appropriate modification\nmagnitude. This paper investigates the role of weights for MTPs showing that to\ncontrol confounding, weights should balance the weighted data to an unobserved\nhypothetical target population, that can be characterized with observed data.\nLeveraging this insight, we present a versatile set of tools to enhance\nestimation for MTPs. We introduce a distance that measures imbalance of\ncovariate distributions under the MTP and use it to develop new weighting\nmethods and tools to aid in the estimation of MTPs. We illustrate our methods\nthrough an example studying the effect of mechanical power of ventilation on\nin-hospital mortality."}, "http://arxiv.org/abs/2310.11630": {"title": "Adaptive Bootstrap Tests for Composite Null Hypotheses in the Mediation Pathway Analysis", "link": "http://arxiv.org/abs/2310.11630", "description": "Mediation analysis aims to assess if, and how, a certain exposure influences\nan outcome of interest through intermediate variables. This problem has\nrecently gained a surge of attention due to the tremendous need for such\nanalyses in scientific fields. Testing for the mediation effect is greatly\nchallenged by the fact that the underlying null hypothesis (i.e. the absence of\nmediation effects) is composite. Most existing mediation tests are overly\nconservative and thus underpowered. To overcome this significant methodological\nhurdle, we develop an adaptive bootstrap testing framework that can accommodate\ndifferent types of composite null hypotheses in the mediation pathway analysis.\nApplied to the product of coefficients (PoC) test and the joint significance\n(JS) test, our adaptive testing procedures provide type I error control under\nthe composite null, resulting in much improved statistical power compared to\nexisting tests. Both theoretical properties and numerical examples of the\nproposed methodology are discussed."}, "http://arxiv.org/abs/2310.11683": {"title": "Are we bootstrapping the right thing? A new approach to quantify uncertainty of Average Treatment Effect Estimate", "link": "http://arxiv.org/abs/2310.11683", "description": "Existing approaches of using the bootstrap method to derive standard error\nand confidence interval of average treatment effect estimate has one potential\nissue, which is that they are actually bootstrapping the wrong thing, resulting\nin unvalid statistical inference. In this paper, we discuss this important\nissue and propose a new non-parametric bootstrap method that can more precisely\nquantify the uncertainty associated with average treatment effect estimates. We\ndemonstrate the validity of this approach through a simulation study and a\nreal-world example, and highlight the importance of deriving standard error and\nconfidence interval of average treatment effect estimates that both remove\nextra undesired noise and are easy to interpret when applied in real world\nscenarios."}, "http://arxiv.org/abs/2310.11724": {"title": "Simultaneous Nonparametric Inference of M-regression under Complex Temporal Dynamics", "link": "http://arxiv.org/abs/2310.11724", "description": "The paper considers simultaneous nonparametric inference for a wide class of\nM-regression models with time-varying coefficients. The covariates and errors\nof the regression model are tackled as a general class of piece-wise locally\nstationary time series and are allowed to be cross-dependent. We introduce an\nintegration technique to study the M-estimators, whose limiting properties are\ndisclosed using Bahadur representation and Gaussian approximation theory.\nFacilitated by a self-convolved bootstrap proposed in this paper, we introduce\na unified framework to conduct general classes of Exact Function Tests,\nLack-of-fit Tests, and Qualitative Tests for the time-varying coefficient\nM-regression under complex temporal dynamics. As an application, our method is\napplied to studying the anthropogenic warming trend and time-varying structures\nof the ENSO effect using global climate data from 1882 to 2005."}, "http://arxiv.org/abs/2310.11741": {"title": "Graph Sphere: From Nodes to Supernodes in Graphical Models", "link": "http://arxiv.org/abs/2310.11741", "description": "High-dimensional data analysis typically focuses on low-dimensional\nstructure, often to aid interpretation and computational efficiency. Graphical\nmodels provide a powerful methodology for learning the conditional independence\nstructure in multivariate data by representing variables as nodes and\ndependencies as edges. Inference is often focused on individual edges in the\nlatent graph. Nonetheless, there is increasing interest in determining more\ncomplex structures, such as communities of nodes, for multiple reasons,\nincluding more effective information retrieval and better interpretability. In\nthis work, we propose a multilayer graphical model where we first cluster nodes\nand then, at the second layer, investigate the relationships among groups of\nnodes. Specifically, nodes are partitioned into \"supernodes\" with a\ndata-coherent size-biased tessellation prior which combines ideas from Bayesian\nnonparametrics and Voronoi tessellations. This construct allows accounting also\nfor dependence of nodes within supernodes. At the second layer, dependence\nstructure among supernodes is modelled through a Gaussian graphical model,\nwhere the focus of inference is on \"superedges\". We provide theoretical\njustification for our modelling choices. We design tailored Markov chain Monte\nCarlo schemes, which also enable parallel computations. We demonstrate the\neffectiveness of our approach for large-scale structure learning in simulations\nand a transcriptomics application."}, "http://arxiv.org/abs/2310.11779": {"title": "A Multivariate Skew-Normal-Tukey-h Distribution", "link": "http://arxiv.org/abs/2310.11779", "description": "We introduce a new family of multivariate distributions by taking the\ncomponent-wise Tukey-h transformation of a random vector following a\nskew-normal distribution. The proposed distribution is named the\nskew-normal-Tukey-h distribution and is an extension of the skew-normal\ndistribution for handling heavy-tailed data. We compare this proposed\ndistribution to the skew-t distribution, which is another extension of the\nskew-normal distribution for modeling tail-thickness, and demonstrate that when\nthere are substantial differences in marginal kurtosis, the proposed\ndistribution is more appropriate. Moreover, we derive many appealing stochastic\nproperties of the proposed distribution and provide a methodology for the\nestimation of the parameters in which the computational requirement increases\nlinearly with the dimension. Using simulations, as well as a wine and a wind\nspeed data application, we illustrate how to draw inferences based on the\nmultivariate skew-normal-Tukey-h distribution."}, "http://arxiv.org/abs/2310.11799": {"title": "Testing for patterns and structures in covariance and correlation matrices", "link": "http://arxiv.org/abs/2310.11799", "description": "Covariance matrices of random vectors contain information that is crucial for\nmodelling. Certain structures and patterns of the covariances (or correlations)\nmay be used to justify parametric models, e.g., autoregressive models. Until\nnow, there have been only few approaches for testing such covariance structures\nsystematically and in a unified way. In the present paper, we propose such a\nunified testing procedure, and we will exemplify the approach with a large\nvariety of covariance structure models. This includes common structures such as\ndiagonal matrices, Toeplitz matrices, and compound symmetry but also the more\ninvolved autoregressive matrices. We propose hypothesis tests for these\nstructures, and we use bootstrap techniques for better small-sample\napproximation. The structures of the proposed tests invite for adaptations to\nother covariance patterns by choosing the hypothesis matrix appropriately. We\nprove their correctness for large sample sizes. The proposed methods require\nonly weak assumptions.\n\nWith the help of a simulation study, we assess the small sample properties of\nthe tests.\n\nWe also analyze a real data set to illustrate the application of the\nprocedure."}, "http://arxiv.org/abs/2310.11822": {"title": "Post-clustering Inference under Dependency", "link": "http://arxiv.org/abs/2310.11822", "description": "Recent work by Gao et al. has laid the foundations for post-clustering\ninference. For the first time, the authors established a theoretical framework\nallowing to test for differences between means of estimated clusters.\nAdditionally, they studied the estimation of unknown parameters while\ncontrolling the selective type I error. However, their theory was developed for\nindependent observations identically distributed as $p$-dimensional Gaussian\nvariables with a spherical covariance matrix. Here, we aim at extending this\nframework to a more convenient scenario for practical applications, where\narbitrary dependence structures between observations and features are allowed.\nWe show that a $p$-value for post-clustering inference under general dependency\ncan be defined, and we assess the theoretical conditions allowing the\ncompatible estimation of a covariance matrix. The theory is developed for\nhierarchical agglomerative clustering algorithms with several types of\nlinkages, and for the $k$-means algorithm. We illustrate our method with\nsynthetic data and real data of protein structures."}, "http://arxiv.org/abs/2310.12000": {"title": "Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models", "link": "http://arxiv.org/abs/2310.12000", "description": "Latent Gaussian process (GP) models are flexible probabilistic non-parametric\nfunction models. Vecchia approximations are accurate approximations for GPs to\novercome computational bottlenecks for large data, and the Laplace\napproximation is a fast method with asymptotic convergence guarantees to\napproximate marginal likelihoods and posterior predictive distributions for\nnon-Gaussian likelihoods. Unfortunately, the computational complexity of\ncombined Vecchia-Laplace approximations grows faster than linearly in the\nsample size when used in combination with direct solver methods such as the\nCholesky decomposition. Computations with Vecchia-Laplace approximations thus\nbecome prohibitively slow precisely when the approximations are usually the\nmost accurate, i.e., on large data sets. In this article, we present several\niterative methods for inference with Vecchia-Laplace approximations which make\ncomputations considerably faster compared to Cholesky-based calculations. We\nanalyze our proposed methods theoretically and in experiments with simulated\nand real-world data. In particular, we obtain a speed-up of an order of\nmagnitude compared to Cholesky-based inference and a threefold increase in\nprediction accuracy in terms of the continuous ranked probability score\ncompared to a state-of-the-art method on a large satellite data set. All\nmethods are implemented in a free C++ software library with high-level Python\nand R packages."}, "http://arxiv.org/abs/2310.12010": {"title": "A Note on Improving Variational Estimation for Multidimensional Item Response Theory", "link": "http://arxiv.org/abs/2310.12010", "description": "Survey instruments and assessments are frequently used in many domains of\nsocial science. When the constructs that these assessments try to measure\nbecome multifaceted, multidimensional item response theory (MIRT) provides a\nunified framework and convenient statistical tool for item analysis,\ncalibration, and scoring. However, the computational challenge of estimating\nMIRT models prohibits its wide use because many of the extant methods can\nhardly provide results in a realistic time frame when the number of dimensions,\nsample size, and test length are large. Instead, variational estimation\nmethods, such as Gaussian Variational Expectation Maximization (GVEM)\nalgorithm, have been recently proposed to solve the estimation challenge by\nproviding a fast and accurate solution. However, results have shown that\nvariational estimation methods may produce some bias on discrimination\nparameters during confirmatory model estimation, and this note proposes an\nimportance weighted version of GVEM (i.e., IW-GVEM) to correct for such bias\nunder MIRT models. We also use the adaptive moment estimation method to update\nthe learning rate for gradient descent automatically. Our simulations show that\nIW-GVEM can effectively correct bias with modest increase of computation time,\ncompared with GVEM. The proposed method may also shed light on improving the\nvariational estimation for other psychometrics models."}, "http://arxiv.org/abs/2310.12115": {"title": "MMD-based Variable Importance for Distributional Random Forest", "link": "http://arxiv.org/abs/2310.12115", "description": "Distributional Random Forest (DRF) is a flexible forest-based method to\nestimate the full conditional distribution of a multivariate output of interest\ngiven input variables. In this article, we introduce a variable importance\nalgorithm for DRFs, based on the well-established drop and relearn principle\nand MMD distance. While traditional importance measures only detect variables\nwith an influence on the output mean, our algorithm detects variables impacting\nthe output distribution more generally. We show that the introduced importance\nmeasure is consistent, exhibits high empirical performance on both real and\nsimulated data, and outperforms competitors. In particular, our algorithm is\nhighly efficient to select variables through recursive feature elimination, and\ncan therefore provide small sets of variables to build accurate estimates of\nconditional output distributions."}, "http://arxiv.org/abs/2310.12140": {"title": "Online Estimation with Rolling Validation: Adaptive Nonparametric Estimation with Stream Data", "link": "http://arxiv.org/abs/2310.12140", "description": "Online nonparametric estimators are gaining popularity due to their efficient\ncomputation and competitive generalization abilities. An important example\nincludes variants of stochastic gradient descent. These algorithms often take\none sample point at a time and instantly update the parameter estimate of\ninterest. In this work we consider model selection and hyperparameter tuning\nfor such online algorithms. We propose a weighted rolling-validation procedure,\nan online variant of leave-one-out cross-validation, that costs minimal extra\ncomputation for many typical stochastic gradient descent estimators. Similar to\nbatch cross-validation, it can boost base estimators to achieve a better,\nadaptive convergence rate. Our theoretical analysis is straightforward, relying\nmainly on some general statistical stability assumptions. The simulation study\nunderscores the significance of diverging weights in rolling validation in\npractice and demonstrates its sensitivity even when there is only a slim\ndifference between candidate estimators."}, "http://arxiv.org/abs/2010.02968": {"title": "Modelling of functional profiles and explainable shape shifts detection: An approach combining the notion of the Fr\\'echet mean with the shape invariant model", "link": "http://arxiv.org/abs/2010.02968", "description": "A modelling framework suitable for detecting shape shifts in functional\nprofiles combining the notion of Fr\\'echet mean and the concept of deformation\nmodels is developed and proposed. The generalized mean sense offered by the\nFr\\'echet mean notion is employed to capture the typical pattern of the\nprofiles under study, while the concept of deformation models, and in\nparticular of the shape invariant model, allows for interpretable\nparameterizations of profile's deviations from the typical shape. EWMA-type\ncontrol charts compatible with the functional nature of data and the employed\ndeformation model are built and proposed, exploiting certain shape\ncharacteristics of the profiles under study with respect to the generalized\nmean sense, allowing for the identification of potential shifts concerning the\nshape and/or the deformation process. Potential shifts in the shape deformation\nprocess, are further distinguished to significant shifts with respect to\namplitude and/or the phase of the profile under study. The proposed modelling\nand shift detection framework is implemented to a real world case study, where\ndaily concentration profiles concerning air pollutants from an area in the city\nof Athens are modelled, while profiles indicating hazardous concentration\nlevels are successfully identified in most of the cases."}, "http://arxiv.org/abs/2207.07218": {"title": "On the Selection of Tuning Parameters for Patch-Stitching Embedding Methods", "link": "http://arxiv.org/abs/2207.07218", "description": "While classical scaling, just like principal component analysis, is\nparameter-free, other methods for embedding multivariate data require the\nselection of one or several tuning parameters. This tuning can be difficult due\nto the unsupervised nature of the situation. We propose a simple, almost\nobvious, approach to supervise the choice of tuning parameter(s): minimize a\nnotion of stress. We apply this approach to the selection of the patch size in\na prototypical patch-stitching embedding method, both in the multidimensional\nscaling (aka network localization) setting and in the dimensionality reduction\n(aka manifold learning) setting. In our study, we uncover a new bias--variance\ntradeoff phenomenon."}, "http://arxiv.org/abs/2303.17856": {"title": "Bootstrapping multiple systems estimates to account for model selection", "link": "http://arxiv.org/abs/2303.17856", "description": "Multiple systems estimation using a Poisson loglinear model is a standard\napproach to quantifying hidden populations where data sources are based on\nlists of known cases. Information criteria are often used for selecting between\nthe large number of possible models. Confidence intervals are often reported\nconditional on the model selected, providing an over-optimistic impression of\nestimation accuracy. A bootstrap approach is a natural way to account for the\nmodel selection. However, because the model selection step has to be carried\nout for every bootstrap replication, there may be a high or even prohibitive\ncomputational burden. We explore the merit of modifying the model selection\nprocedure in the bootstrap to look only among a subset of models, chosen on the\nbasis of their information criterion score on the original data. This provides\nlarge computational gains with little apparent effect on inference. We also\nincorporate rigorous and economical ways of approaching issues of the existence\nof estimators when applying the method to sparse data tables."}, "http://arxiv.org/abs/2308.07319": {"title": "Partial identification for discrete data with nonignorable missing outcomes", "link": "http://arxiv.org/abs/2308.07319", "description": "Nonignorable missing outcomes are common in real world datasets and often\nrequire strong parametric assumptions to achieve identification. These\nassumptions can be implausible or untestable, and so we may forgo them in\nfavour of partially identified models that narrow the set of a priori possible\nvalues to an identification region. Here we propose a new nonparametric Bayes\nmethod that allows for the incorporation of multiple clinically relevant\nrestrictions of the parameter space simultaneously. We focus on two common\nrestrictions, instrumental variables and the direction of missing data bias,\nand investigate how these restrictions narrow the identification region for\nparameters of interest. Additionally, we propose a rejection sampling algorithm\nthat allows us to quantify the evidence for these assumptions in the data. We\ncompare our method to a standard Heckman selection model in both simulation\nstudies and in an applied problem examining the effectiveness of cash-transfers\nfor people experiencing homelessness."}, "http://arxiv.org/abs/2310.12285": {"title": "Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2310.12285", "description": "High-dimensional longitudinal data is increasingly used in a wide range of\nscientific studies. However, there are few statistical methods for\nhigh-dimensional linear mixed models (LMMs), as most Bayesian variable\nselection or penalization methods are designed for independent observations.\nAdditionally, the few available software packages for high-dimensional LMMs\nsuffer from scalability issues. This work presents an efficient and accurate\nBayesian framework for high-dimensional LMMs. We use empirical Bayes estimators\nof hyperparameters for increased flexibility and an\nExpectation-Conditional-Minimization (ECM) algorithm for computationally\nefficient maximum a posteriori probability (MAP) estimation of parameters. The\nnovelty of the approach lies in its partitioning and parameter expansion as\nwell as its fast and scalable computation. We illustrate Linear Mixed Modeling\nwith PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies\nevaluating fixed and random effects estimation along with computation time. A\nreal-world example is provided using data from a study of lupus in children,\nwhere we identify genes and clinical factors associated with a new lupus\nbiomarker and predict the biomarker over time."}, "http://arxiv.org/abs/2310.12348": {"title": "Goodness--of--Fit Tests Based on the Min--Characteristic Function", "link": "http://arxiv.org/abs/2310.12348", "description": "We propose tests of fit for classes of distributions that include the\nWeibull, the Pareto and the Fr\\'echet, distributions. The new tests employ the\nnovel tool of the min--characteristic function and are based on an L2--type\nweighted distance between this function and its empirical counterpart applied\non suitably standardized data. If data--standardization is performed using the\nMLE of the distributional parameters then the method reduces to testing for the\nstandard member of the family, with parameter values known and set equal to\none. We investigate asymptotic properties of the tests, while a Monte Carlo\nstudy is presented that includes the new procedure as well as competitors for\nthe purpose of specification testing with three extreme value distributions.\nThe new tests are also applied on a few real--data sets."}, "http://arxiv.org/abs/2310.12358": {"title": "causalBETA: An R Package for Bayesian Semiparametric Casual Inference with Event-Time Outcomes", "link": "http://arxiv.org/abs/2310.12358", "description": "Observational studies are often conducted to estimate causal effects of\ntreatments or exposures on event-time outcomes. Since treatments are not\nrandomized in observational studies, techniques from causal inference are\nrequired to adjust for confounding. Bayesian approaches to causal estimates are\ndesirable because they provide 1) prior smoothing provides useful\nregularization of causal effect estimates, 2) flexible models that are robust\nto misspecification, 3) full inference (i.e. both point and uncertainty\nestimates) for causal estimands. However, Bayesian causal inference is\ndifficult to implement manually and there is a lack of user-friendly software,\npresenting a significant barrier to wide-spread use. We address this gap by\ndeveloping causalBETA (Bayesian Event Time Analysis) - an open-source R package\nfor estimating causal effects on event-time outcomes using Bayesian\nsemiparametric models. The package provides a familiar front-end to users, with\nsyntax identical to existing survival analysis R packages such as survival. At\nthe same time, it back-ends to Stan - a popular platform for Bayesian modeling\nand high performance statistical computing - for efficient posterior\ncomputation. To improve user experience, the package is built using customized\nS3 class objects and methods to facilitate visualizations and summaries of\nresults using familiar generic functions like plot() and summary(). In this\npaper, we provide the methodological details of the package, a demonstration\nusing publicly-available data, and computational guidance."}, "http://arxiv.org/abs/2310.12391": {"title": "Real-time Semiparametric Regression via Sequential Monte Carlo", "link": "http://arxiv.org/abs/2310.12391", "description": "We develop and describe online algorithms for performing real-time\nsemiparametric regression analyses. Earlier work on this topic is in Luts,\nBroderick & Wand (J. Comput. Graph. Statist., 2014) where online mean field\nvariational Bayes was employed. In this article we instead develop sequential\nMonte Carlo approaches to circumvent well-known inaccuracies inherent in\nvariational approaches. Even though sequential Monte Carlo is not as fast as\nonline mean field variational Bayes, it can be a viable alternative for\napplications where the data rate is not overly high. For Gaussian response\nsemiparametric regression models our new algorithms share the online mean field\nvariational Bayes property of only requiring updating and storage of sufficient\nstatistics quantities of streaming data. In the non-Gaussian case accurate\nreal-time semiparametric regression requires the full data to be kept in\nstorage. The new algorithms allow for new options concerning accuracy/speed\ntrade-offs for real-time semiparametric regression."}, "http://arxiv.org/abs/2310.12402": {"title": "Data visualization and dimension reduction for metric-valued response regression", "link": "http://arxiv.org/abs/2310.12402", "description": "As novel data collection becomes increasingly common, traditional dimension\nreduction and data visualization techniques are becoming inadequate to analyze\nthese complex data. A surrogate-assisted sufficient dimension reduction (SDR)\nmethod for regression with a general metric-valued response on Euclidean\npredictors is proposed. The response objects are mapped to a real-valued\ndistance matrix using an appropriate metric and then projected onto a large\nsample of random unit vectors to obtain scalar-valued surrogate responses. An\nensemble estimate of the subspaces for the regression of the surrogate\nresponses versus the predictor is used to estimate the original central space.\nUnder this framework, classical SDR methods such as ordinary least squares and\nsliced inverse regression are extended. The surrogate-assisted method applies\nto responses on compact metric spaces including but not limited to Euclidean,\ndistributional, and functional. An extensive simulation experiment demonstrates\nthe superior performance of the proposed surrogate-assisted method on synthetic\ndata compared to existing competing methods where applicable. The analysis of\nthe distributions and functional trajectories of county-level COVID-19\ntransmission rates in the U.S. as a function of demographic characteristics is\nalso provided. The theoretical justifications are included as well."}, "http://arxiv.org/abs/2310.12424": {"title": "Optimal heteroskedasticity testing in nonparametric regression", "link": "http://arxiv.org/abs/2310.12424", "description": "Heteroskedasticity testing in nonparametric regression is a classic\nstatistical problem with important practical applications, yet fundamental\nlimits are unknown. Adopting a minimax perspective, this article considers the\ntesting problem in the context of an $\\alpha$-H\\\"{o}lder mean and a\n$\\beta$-H\\\"{o}lder variance function. For $\\alpha > 0$ and $\\beta \\in (0,\n\\frac{1}{2})$, the sharp minimax separation rate $n^{-4\\alpha} +\nn^{-\\frac{4\\beta}{4\\beta+1}} + n^{-2\\beta}$ is established. To achieve the\nminimax separation rate, a kernel-based statistic using first-order squared\ndifferences is developed. Notably, the statistic estimates a proxy rather than\na natural quadratic functional (the squared distance between the variance\nfunction and its best $L^2$ approximation by a constant) suggested in previous\nwork.\n\nThe setting where no smoothness is assumed on the variance function is also\nstudied; the variance profile across the design points can be arbitrary.\nDespite the lack of structure, consistent testing turns out to still be\npossible by using the Gaussian character of the noise, and the minimax rate is\nshown to be $n^{-4\\alpha} + n^{-1/2}$. Exploiting noise information happens to\nbe a fundamental necessity as consistent testing is impossible if nothing more\nthan zero mean and unit variance is known about the noise distribution.\nFurthermore, in the setting where $V$ is $\\beta$-H\\\"{o}lder but\nheteroskedasticity is measured only with respect to the design points, the\nminimax separation rate is shown to be $n^{-4\\alpha} + n^{-\\left(\\frac{1}{2}\n\\vee \\frac{4\\beta}{4\\beta+1}\\right)}$ when the noise is Gaussian and\n$n^{-4\\alpha} + n^{-\\frac{4\\beta}{4\\beta+1}} + n^{-2\\beta}$ when the noise\ndistribution is unknown."}, "http://arxiv.org/abs/2310.12427": {"title": "Fast Power Curve Approximation for Posterior Analyses", "link": "http://arxiv.org/abs/2310.12427", "description": "Bayesian hypothesis testing leverages posterior probabilities, Bayes factors,\nor credible intervals to assess characteristics that summarize data. We propose\na framework for power curve approximation with such hypothesis tests that\nassumes data are generated using statistical models with fixed parameters for\nthe purposes of sample size determination. We present a fast approach to\nexplore the sampling distribution of posterior probabilities when the\nconditions for the Bernstein-von Mises theorem are satisfied. We extend that\napproach to facilitate targeted sampling from the approximate sampling\ndistribution of posterior probabilities for each sample size explored. These\nsampling distributions are used to construct power curves for various types of\nposterior analyses. Our resulting method for power curve approximation is\norders of magnitude faster than conventional power curve estimation for\nBayesian hypothesis tests. We also prove the consistency of the corresponding\npower estimates and sample size recommendations under certain conditions."}, "http://arxiv.org/abs/2310.12428": {"title": "Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach", "link": "http://arxiv.org/abs/2310.12428", "description": "We initiate a novel approach to explain the out of sample performance of\nrandom forest (RF) models by exploiting the fact that any RF can be formulated\nas an adaptive weighted K nearest-neighbors model. Specifically, we use the\nproximity between points in the feature space learned by the RF to re-write\nrandom forest predictions exactly as a weighted average of the target labels of\ntraining data points. This linearity facilitates a local notion of\nexplainability of RF predictions that generates attributions for any model\nprediction across observations in the training set, and thereby complements\nestablished methods like SHAP, which instead generates attributions for a model\nprediction across dimensions of the feature space. We demonstrate this approach\nin the context of a bond pricing model trained on US corporate bond trades, and\ncompare our approach to various existing approaches to model explainability."}, "http://arxiv.org/abs/2310.12460": {"title": "Linear Source Apportionment using Generalized Least Squares", "link": "http://arxiv.org/abs/2310.12460", "description": "Motivated by applications to water quality monitoring using fluorescence\nspectroscopy, we develop the source apportionment model for high dimensional\nprofiles of dissolved organic matter (DOM). We describe simple methods to\nestimate the parameters of a linear source apportionment model, and show how\nthe estimates are related to those of ordinary and generalized least squares.\nUsing this least squares framework, we analyze the variability of the\nestimates, and we propose predictors for missing elements of a DOM profile. We\ndemonstrate the practical utility of our results on fluorescence spectroscopy\ndata collected from the Neuse River in North Carolina."}, "http://arxiv.org/abs/2310.12711": {"title": "Modelling multivariate extremes through angular-radial decomposition of the density function", "link": "http://arxiv.org/abs/2310.12711", "description": "We present a new framework for modelling multivariate extremes, based on an\nangular-radial representation of the probability density function. Under this\nrepresentation, the problem of modelling multivariate extremes is transformed\nto that of modelling an angular density and the tail of the radial variable,\nconditional on angle. Motivated by univariate theory, we assume that the tail\nof the conditional radial distribution converges to a generalised Pareto (GP)\ndistribution. To simplify inference, we also assume that the angular density is\ncontinuous and finite and the GP parameter functions are continuous with angle.\nWe refer to the resulting model as the semi-parametric angular-radial (SPAR)\nmodel for multivariate extremes. We consider the effect of the choice of polar\ncoordinate system and introduce generalised concepts of angular-radial\ncoordinate systems and generalised scalar angles in two dimensions. We show\nthat under certain conditions, the choice of polar coordinate system does not\naffect the validity of the SPAR assumptions. However, some choices of\ncoordinate system lead to simpler representations. In contrast, we show that\nthe choice of margin does affect whether the model assumptions are satisfied.\nIn particular, the use of Laplace margins results in a form of the density\nfunction for which the SPAR assumptions are satisfied for many common families\nof copula, with various dependence classes. We show that the SPAR model\nprovides a more versatile framework for characterising multivariate extremes\nthan provided by existing approaches, and that several commonly-used approaches\nare special cases of the SPAR model. Moreover, the SPAR framework provides a\nmeans of characterising all `extreme regions' of a joint distribution using a\nsingle inference. Applications in which this is useful are discussed."}, "http://arxiv.org/abs/2310.12757": {"title": "Conservative Inference for Counterfactuals", "link": "http://arxiv.org/abs/2310.12757", "description": "In causal inference, the joint law of a set of counterfactual random\nvariables is generally not identified. We show that a conservative version of\nthe joint law - corresponding to the smallest treatment effect - is identified.\nFinding this law uses recent results from optimal transport theory. Under this\nconservative law we can bound causal effects and we may construct inferences\nfor each individual's counterfactual dose-response curve. Intuitively, this is\nthe flattest counterfactual curve for each subject that is consistent with the\ndistribution of the observables. If the outcome is univariate then, under mild\nconditions, this curve is simply the quantile function of the counterfactual\ndistribution that passes through the observed point. This curve corresponds to\na nonparametric rank preserving structural model."}, "http://arxiv.org/abs/2310.12882": {"title": "Sequential Gibbs Posteriors with Applications to Principal Component Analysis", "link": "http://arxiv.org/abs/2310.12882", "description": "Gibbs posteriors are proportional to a prior distribution multiplied by an\nexponentiated loss function, with a key tuning parameter weighting information\nin the loss relative to the prior and providing a control of posterior\nuncertainty. Gibbs posteriors provide a principled framework for\nlikelihood-free Bayesian inference, but in many situations, including a single\ntuning parameter inevitably leads to poor uncertainty quantification. In\nparticular, regardless of the value of the parameter, credible regions have far\nfrom the nominal frequentist coverage even in large samples. We propose a\nsequential extension to Gibbs posteriors to address this problem. We prove the\nproposed sequential posterior exhibits concentration and a Bernstein-von Mises\ntheorem, which holds under easy to verify conditions in Euclidean space and on\nmanifolds. As a byproduct, we obtain the first Bernstein-von Mises theorem for\ntraditional likelihood-based Bayesian posteriors on manifolds. All methods are\nillustrated with an application to principal component analysis."}, "http://arxiv.org/abs/2207.06949": {"title": "Seeking the Truth Beyond the Data", "link": "http://arxiv.org/abs/2207.06949", "description": "Clustering is an unsupervised machine learning methodology where unlabeled\nelements/objects are grouped together aiming to the construction of\nwell-established clusters that their elements are classified according to their\nsimilarity. The goal of this process is to provide a useful aid to the\nresearcher that will help her/him to identify patterns among the data. Dealing\nwith large databases, such patterns may not be easily detectable without the\ncontribution of a clustering algorithm. This article provides a deep\ndescription of the most widely used clustering methodologies accompanied by\nuseful presentations concerning suitable parameter selection and\ninitializations. Simultaneously, this article not only represents a review\nhighlighting the major elements of examined clustering techniques but\nemphasizes the comparison of these algorithms' clustering efficiency based on 3\ndatasets, revealing their existing weaknesses and capabilities through accuracy\nand complexity, during the confrontation of discrete and continuous\nobservations. The produced results help us extract valuable conclusions about\nthe appropriateness of the examined clustering techniques in accordance with\nthe dataset's size."}, "http://arxiv.org/abs/2208.07831": {"title": "Structured prior distributions for the covariance matrix in latent factor models", "link": "http://arxiv.org/abs/2208.07831", "description": "Factor models are widely used for dimension reduction in the analysis of\nmultivariate data. This is achieved through decomposition of a p x p covariance\nmatrix into the sum of two components. Through a latent factor representation,\nthey can be interpreted as a diagonal matrix of idiosyncratic variances and a\nshared variation matrix, that is, the product of a p x k factor loadings matrix\nand its transpose. If k << p, this defines a sparse factorisation of the\ncovariance matrix. Historically, little attention has been paid to\nincorporating prior information in Bayesian analyses using factor models where,\nat best, the prior for the factor loadings is order invariant. In this work, a\nclass of structured priors is developed that can encode ideas of dependence\nstructure about the shared variation matrix. The construction allows\ndata-informed shrinkage towards sensible parametric structures while also\nfacilitating inference over the number of factors. Using an unconstrained\nreparameterisation of stationary vector autoregressions, the methodology is\nextended to stationary dynamic factor models. For computational inference,\nparameter-expanded Markov chain Monte Carlo samplers are proposed, including an\nefficient adaptive Gibbs sampler. Two substantive applications showcase the\nscope of the methodology and its inferential benefits."}, "http://arxiv.org/abs/2211.01746": {"title": "Log-density gradient covariance and automatic metric tensors for Riemann manifold Monte Carlo methods", "link": "http://arxiv.org/abs/2211.01746", "description": "A metric tensor for Riemann manifold Monte Carlo particularly suited for\nnon-linear Bayesian hierarchical models is proposed. The metric tensor is built\nfrom symmetric positive semidefinite log-density gradient covariance (LGC)\nmatrices, which are also proposed and further explored here. The LGCs\ngeneralize the Fisher information matrix by measuring the joint information\ncontent and dependence structure of both a random variable and the parameters\nof said variable. Consequently, positive definite Fisher/LGC-based metric\ntensors may be constructed not only from the observation likelihoods as is\ncurrent practice, but also from arbitrarily complicated non-linear prior/latent\nvariable structures, provided the LGC may be derived for each conditional\ndistribution used to construct said structures. The proposed methodology is\nhighly automatic and allows for exploitation of any sparsity associated with\nthe model in question. When implemented in conjunction with a Riemann manifold\nvariant of the recently proposed numerical generalized randomized Hamiltonian\nMonte Carlo processes, the proposed methodology is highly competitive, in\nparticular for the more challenging target distributions associated with\nBayesian hierarchical models."}, "http://arxiv.org/abs/2211.02383": {"title": "Simulation-Based Calibration Checking for Bayesian Computation: The Choice of Test Quantities Shapes Sensitivity", "link": "http://arxiv.org/abs/2211.02383", "description": "Simulation-based calibration checking (SBC) is a practical method to validate\ncomputationally-derived posterior distributions or their approximations. In\nthis paper, we introduce a new variant of SBC to alleviate several known\nproblems. Our variant allows the user to in principle detect any possible issue\nwith the posterior, while previously reported implementations could never\ndetect large classes of problems including when the posterior is equal to the\nprior. This is made possible by including additional data-dependent test\nquantities when running SBC. We argue and demonstrate that the joint likelihood\nof the data is an especially useful test quantity. Some other types of test\nquantities and their theoretical and practical benefits are also investigated.\nWe provide theoretical analysis of SBC, thereby providing a more complete\nunderstanding of the underlying statistical mechanisms. We also bring attention\nto a relatively common mistake in the literature and clarify the difference\nbetween SBC and checks based on the data-averaged posterior. We support our\nrecommendations with numerical case studies on a multivariate normal example\nand a case study in implementing an ordered simplex data type for use with\nHamiltonian Monte Carlo. The SBC variant introduced in this paper is\nimplemented in the $\\mathtt{SBC}$ R package."}, "http://arxiv.org/abs/2310.13081": {"title": "Metastable Hidden Markov Processes: a theory for modeling financial markets", "link": "http://arxiv.org/abs/2310.13081", "description": "The modeling of financial time series by hidden Markov models has been\nperformed successfully in the literature. In this paper, we propose a theory\nthat justifies such a modeling under the assumption that there exists a market\nformed by agents whose states evolve on time as an interacting Markov system\nthat has a metastable behavior described by the hidden Markov chain. This\ntheory is a rare application of metastability outside the modeling of physical\nsystems, and may inspire the development of new interacting Markov systems with\nfinancial constraints. In the context of financial economics and causal factor\ninvestment, the theory implies a new paradigm in which fluctuations in\ninvestment performance are primarily driven by the state of the market, rather\nthan being directly caused by other variables. Even though the usual approach\nto causal factor investment based on causal inference is not completely\ninconsistent with the proposed theory, the latter has the advantage of\naccounting for the non-stationary evolution of the time series through the\nchange between hidden market states. By accounting for this possibility, one\ncan more effectively assess risks and implement mitigation strategies."}, "http://arxiv.org/abs/2310.13162": {"title": "Network Meta-Analysis of Time-to-Event Endpoints with Individual Participant Data using Restricted Mean Survival Time Regression", "link": "http://arxiv.org/abs/2310.13162", "description": "Restricted mean survival time (RMST) models have gained popularity when\nanalyzing time-to-event outcomes because RMST models offer more straightforward\ninterpretations of treatment effects with fewer assumptions than hazard ratios\ncommonly estimated from Cox models. However, few network meta-analysis (NMA)\nmethods have been developed using RMST. In this paper, we propose advanced RMST\nNMA models when individual participant data are available. Our models allow us\nto study treatment effect moderation and provide comprehensive understanding\nabout comparative effectiveness of treatments and subgroup effects. An\nextensive simulation study and a real data example about treatments for\npatients with atrial fibrillation are presented."}, "http://arxiv.org/abs/2310.13178": {"title": "Exact Inference for Common Odds Ratio in Meta-Analysis with Zero-Total-Event Studies", "link": "http://arxiv.org/abs/2310.13178", "description": "Stemming from the high profile publication of Nissen and Wolski (2007) and\nsubsequent discussions with divergent views on how to handle observed\nzero-total-event studies, defined to be studies which observe zero events in\nboth treatment and control arms, the research topic concerning the common odds\nratio model with zero-total-event studies remains to be an unresolved problem\nin meta-analysis. In this article, we address this problem by proposing a novel\nrepro samples method to handle zero-total-event studies and make inference for\nthe parameter of common odds ratio. The development explicitly accounts for\nsampling scheme and does not rely on large sample approximation. It is\ntheoretically justified with a guaranteed finite sample performance. The\nempirical performance of the proposed method is demonstrated through simulation\nstudies. It shows that the proposed confidence set achieves the desired\nempirical coverage rate and also that the zero-total-event studies contains\ninformation and impacts the inference for the common odds ratio. The proposed\nmethod is applied to combine information in the Nissen and Wolski study."}, "http://arxiv.org/abs/2310.13232": {"title": "Interaction Screening and Pseudolikelihood Approaches for Tensor Learning in Ising Models", "link": "http://arxiv.org/abs/2310.13232", "description": "In this paper, we study two well known methods of Ising structure learning,\nnamely the pseudolikelihood approach and the interaction screening approach, in\nthe context of tensor recovery in $k$-spin Ising models. We show that both\nthese approaches, with proper regularization, retrieve the underlying\nhypernetwork structure using a sample size logarithmic in the number of network\nnodes, and exponential in the maximum interaction strength and maximum\nnode-degree. We also track down the exact dependence of the rate of tensor\nrecovery on the interaction order $k$, that is allowed to grow with the number\nof samples and nodes, for both the approaches. Finally, we provide a\ncomparative discussion of the performance of the two approaches based on\nsimulation studies, which also demonstrate the exponential dependence of the\ntensor recovery rate on the maximum coupling strength."}, "http://arxiv.org/abs/2310.13387": {"title": "Assumption violations in causal discovery and the robustness of score matching", "link": "http://arxiv.org/abs/2310.13387", "description": "When domain knowledge is limited and experimentation is restricted by\nethical, financial, or time constraints, practitioners turn to observational\ncausal discovery methods to recover the causal structure, exploiting the\nstatistical properties of their data. Because causal discovery without further\nassumptions is an ill-posed problem, each algorithm comes with its own set of\nusually untestable assumptions, some of which are hard to meet in real\ndatasets. Motivated by these considerations, this paper extensively benchmarks\nthe empirical performance of recent causal discovery methods on observational\ni.i.d. data generated under different background conditions, allowing for\nviolations of the critical assumptions required by each selected approach. Our\nexperimental findings show that score matching-based methods demonstrate\nsurprising performance in the false positive and false negative rate of the\ninferred graph in these challenging scenarios, and we provide theoretical\ninsights into their performance. This work is also the first effort to\nbenchmark the stability of causal discovery algorithms with respect to the\nvalues of their hyperparameters. Finally, we hope this paper will set a new\nstandard for the evaluation of causal discovery methods and can serve as an\naccessible entry point for practitioners interested in the field, highlighting\nthe empirical implications of different algorithm choices."}, "http://arxiv.org/abs/2310.13444": {"title": "Testing for the extent of instability in nearly unstable processes", "link": "http://arxiv.org/abs/2310.13444", "description": "This paper deals with unit root issues in time series analysis. It has been\nknown for a long time that unit root tests may be flawed when a series although\nstationary has a root close to unity. That motivated recent papers dedicated to\nautoregressive processes where the bridge between stability and instability is\nexpressed by means of time-varying coefficients. In this vein the process we\nconsider has a companion matrix $A_{n}$ with spectral radius $\\rho(A_{n}) < 1$\nsatisfying $\\rho(A_{n}) \\rightarrow 1$, a situation that we describe as `nearly\nunstable'. The question we investigate is the following: given an observed path\nsupposed to come from a nearly-unstable process, is it possible to test for the\n`extent of instability', \\textit{i.e.} to test how close we are to the unit\nroot? In this regard, we develop a strategy to evaluate $\\alpha$ and to test\nfor $\\mathcal{H}_0 : \"\\alpha = \\alpha_0\"$ against $\\mathcal{H}_1 : \"\\alpha >\n\\alpha_0\"$ when $\\rho(A_{n})$ lies in an inner $O(n^{-\\alpha})$-neighborhood of\nthe unity, for some $0 < \\alpha < 1$. Empirical evidence is given (on\nsimulations and real time series) about the advantages of the flexibility\ninduced by such a procedure compared to the usual unit root tests and their\nbinary responses. As a by-product, we also build a symmetric procedure for the\nusually left out situation where the dominant root lies around $-1$."}, "http://arxiv.org/abs/2310.13446": {"title": "Simple binning algorithm and SimDec visualization for comprehensive sensitivity analysis of complex computational models", "link": "http://arxiv.org/abs/2310.13446", "description": "Models of complex technological systems inherently contain interactions and\ndependencies among their input variables that affect their joint influence on\nthe output. Such models are often computationally expensive and few sensitivity\nanalysis methods can effectively process such complexities. Moreover, the\nsensitivity analysis field as a whole pays limited attention to the nature of\ninteraction effects, whose understanding can prove to be critical for the\ndesign of safe and reliable systems. In this paper, we introduce and\nextensively test a simple binning approach for computing sensitivity indices\nand demonstrate how complementing it with the smart visualization method,\nsimulation decomposition (SimDec), can permit important insights into the\nbehavior of complex engineering models. The simple binning approach computes\nfirst-, second-order effects, and a combined sensitivity index, and is\nconsiderably more computationally efficient than Sobol' indices. The totality\nof the sensitivity analysis framework provides an efficient and intuitive way\nto analyze the behavior of complex systems containing interactions and\ndependencies."}, "http://arxiv.org/abs/2310.13487": {"title": "Two-stage weighted least squares estimator of multivariate discrete-valued observation-driven models", "link": "http://arxiv.org/abs/2310.13487", "description": "In this work a general semi-parametric multivariate model where the first two\nconditional moments are assumed to be multivariate time series is introduced.\nThe focus of the estimation is the conditional mean parameter vector for\ndiscrete-valued distributions. Quasi-Maximum Likelihood Estimators (QMLEs)\nbased on the linear exponential family are typically employed for such\nestimation problems when the true multivariate conditional probability\ndistribution is unknown or too complex. Although QMLEs provide consistent\nestimates they may be inefficient. In this paper novel two-stage Multivariate\nWeighted Least Square Estimators (MWLSEs) are introduced which enjoy the same\nconsistency property as the QMLEs but can provide improved efficiency with\nsuitable choice of the covariance matrix of the observations. The proposed\nmethod allows for a more accurate estimation of model parameters in particular\nfor count and categorical data when maximum likelihood estimation is\nunfeasible. Moreover, consistency and asymptotic normality of MWLSEs are\nderived. The estimation performance of QMLEs and MWLSEs is compared through\nsimulation experiments and a real data application, showing superior accuracy\nof the proposed methodology."}, "http://arxiv.org/abs/2310.13511": {"title": "Dynamic Realized Minimum Variance Portfolio Models", "link": "http://arxiv.org/abs/2310.13511", "description": "This paper introduces a dynamic minimum variance portfolio (MVP) model using\nnonlinear volatility dynamic models, based on high-frequency financial data.\nSpecifically, we impose an autoregressive dynamic structure on MVP processes,\nwhich helps capture the MVP dynamics directly. To evaluate the dynamic MVP\nmodel, we estimate the inverse volatility matrix using the constrained\n$\\ell_1$-minimization for inverse matrix estimation (CLIME) and calculate daily\nrealized non-normalized MVP weights. Based on the realized non-normalized MVP\nweight estimator, we propose the dynamic MVP model, which we call the dynamic\nrealized minimum variance portfolio (DR-MVP) model. To estimate a large number\nof parameters, we employ the least absolute shrinkage and selection operator\n(LASSO) and predict the future MVP and establish its asymptotic properties.\nUsing high-frequency trading data, we apply the proposed method to MVP\nprediction."}, "http://arxiv.org/abs/2310.13580": {"title": "Bayesian Hierarchical Modeling for Bivariate Multiscale Spatial Data with Application to Blood Test Monitoring", "link": "http://arxiv.org/abs/2310.13580", "description": "In public health applications, spatial data collected are often recorded at\ndifferent spatial scales and over different correlated variables. Spatial\nchange of support is a key inferential problem in these applications and have\nbecome standard in univariate settings; however, it is less standard in\nmultivariate settings. There are several existing multivariate spatial models\nthat can be easily combined with multiscale spatial approach to analyze\nmultivariate multiscale spatial data. In this paper, we propose three new\nmodels from such combinations for bivariate multiscale spatial data in a\nBayesian context. In particular, we extend spatial random effects models,\nmultivariate conditional autoregressive models, and ordered hierarchical models\nthrough a multiscale spatial approach. We run simulation studies for the three\nmodels and compare them in terms of prediction performance and computational\nefficiency. We motivate our models through an analysis of 2015 Texas annual\naverage percentage receiving two blood tests from the Dartmouth Atlas Project."}, "http://arxiv.org/abs/2102.13209": {"title": "Wielding Occam's razor: Fast and frugal retail forecasting", "link": "http://arxiv.org/abs/2102.13209", "description": "The algorithms available for retail forecasting have increased in complexity.\nNewer methods, such as machine learning, are inherently complex. The more\ntraditional families of forecasting models, such as exponential smoothing and\nautoregressive integrated moving averages, have expanded to contain multiple\npossible forms and forecasting profiles. We question complexity in forecasting\nand the need to consider such large families of models. Our argument is that\nparsimoniously identifying suitable subsets of models will not decrease\nforecasting accuracy nor will it reduce the ability to estimate forecast\nuncertainty. We propose a framework that balances forecasting performance\nversus computational cost, resulting in the consideration of only a reduced set\nof models. We empirically demonstrate that a reduced set performs well.\nFinally, we translate computational benefits to monetary cost savings and\nenvironmental impact and discuss the implications of our results in the context\nof large retailers."}, "http://arxiv.org/abs/2211.04666": {"title": "Fast and Locally Adaptive Bayesian Quantile Smoothing using Calibrated Variational Approximations", "link": "http://arxiv.org/abs/2211.04666", "description": "Quantiles are useful characteristics of random variables that can provide\nsubstantial information on distributions compared with commonly used summary\nstatistics such as means. In this paper, we propose a Bayesian quantile trend\nfiltering method to estimate non-stationary trend of quantiles. We introduce\ngeneral shrinkage priors to induce locally adaptive Bayesian inference on\ntrends and mixture representation of the asymmetric Laplace likelihood. To\nquickly compute the posterior distribution, we develop calibrated mean-field\nvariational approximations to guarantee that the frequentist coverage of\ncredible intervals obtained from the approximated posterior is a specified\nnominal level. Simulation and empirical studies show that the proposed\nalgorithm is computationally much more efficient than the Gibbs sampler and\ntends to provide stable inference results, especially for high/low quantiles."}, "http://arxiv.org/abs/2305.17631": {"title": "A Bayesian Approach for Clustering Constant-wise Change-point Data", "link": "http://arxiv.org/abs/2305.17631", "description": "Change-point models deal with ordered data sequences. Their primary goal is\nto infer the locations where an aspect of the data sequence changes. In this\npaper, we propose and implement a nonparametric Bayesian model for clustering\nobservations based on their constant-wise change-point profiles via Gibbs\nsampler. Our model incorporates a Dirichlet Process on the constant-wise\nchange-point structures to cluster observations while simultaneously performing\nchange-point estimation. Additionally, our approach controls the number of\nclusters in the model, not requiring the specification of the number of\nclusters a priori. Our method's performance is evaluated on simulated data\nunder various scenarios and on a real dataset from single-cell genomic\nsequencing."}, "http://arxiv.org/abs/2306.06342": {"title": "Distribution-free inference with hierarchical data", "link": "http://arxiv.org/abs/2306.06342", "description": "This paper studies distribution-free inference in settings where the data set\nhas a hierarchical structure -- for example, groups of observations, or\nrepeated measurements. In such settings, standard notions of exchangeability\nmay not hold. To address this challenge, a hierarchical form of exchangeability\nis derived, facilitating extensions of distribution-free methods, including\nconformal prediction and jackknife+. While the standard theoretical guarantee\nobtained by the conformal prediction framework is a marginal predictive\ncoverage guarantee, in the special case of independent repeated measurements,\nit is possible to achieve a stronger form of coverage -- the \"second-moment\ncoverage\" property -- to provide better control of conditional miscoverage\nrates, and distribution-free prediction sets that achieve this property are\nconstructed. Simulations illustrate that this guarantee indeed leads to\nuniformly small conditional miscoverage rates. Empirically, this stronger\nguarantee comes at the cost of a larger width of the prediction set in\nscenarios where the fitted model is poorly calibrated, but this cost is very\nmild in cases where the fitted model is accurate."}, "http://arxiv.org/abs/2307.15205": {"title": "Robust graph-based methods for overcoming the curse of dimensionality", "link": "http://arxiv.org/abs/2307.15205", "description": "Graph-based two-sample tests and graph-based change-point detection that\nutilize a similarity graph provide a powerful tool for analyzing\nhigh-dimensional and non-Euclidean data as these methods do not impose\ndistributional assumptions on data and have good performance across various\nscenarios. Current graph-based tests that deliver efficacy across a broad\nspectrum of alternatives typically reply on the $K$-nearest neighbor graph or\nthe $K$-minimum spanning tree. However, these graphs can be vulnerable for\nhigh-dimensional data due to the curse of dimensionality. To mitigate this\nissue, we propose to use a robust graph that is considerably less influenced by\nthe curse of dimensionality. We also establish a theoretical foundation for\ngraph-based methods utilizing this proposed robust graph and demonstrate its\nconsistency under fixed alternatives for both low-dimensional and\nhigh-dimensional data."}, "http://arxiv.org/abs/2310.13764": {"title": "Random Flows of Covariance Operators and their Statistical Inference", "link": "http://arxiv.org/abs/2310.13764", "description": "We develop a statistical framework for conducting inference on collections of\ntime-varying covariance operators (covariance flows) over a general, possibly\ninfinite dimensional, Hilbert space. We model the intrinsically non-linear\nstructure of covariances by means of the Bures-Wasserstein metric geometry. We\nmake use of the Riemmanian-like structure induced by this metric to define a\nnotion of mean and covariance of a random flow, and develop an associated\nKarhunen-Lo\\`eve expansion. We then treat the problem of estimation and\nconstruction of functional principal components from a finite collection of\ncovariance flows. Our theoretical results are motivated by modern problems in\nfunctional data analysis, where one observes operator-valued random processes\n-- for instance when analysing dynamic functional connectivity and fMRI data,\nor when analysing multiple functional time series in the frequency domain.\n{Nevertheless, our framework is also novel in the finite-dimensions (matrix\ncase), and we demonstrate what simplifications can be afforded then}. We\nillustrate our methodology by means of simulations and a data analyses."}, "http://arxiv.org/abs/2310.13796": {"title": "Faithful graphical representations of local independence", "link": "http://arxiv.org/abs/2310.13796", "description": "Graphical models use graphs to represent conditional independence structure\nin the distribution of a random vector. In stochastic processes, graphs may\nrepresent so-called local independence or conditional Granger causality. Under\nsome regularity conditions, a local independence graph implies a set of\nindependences using a graphical criterion known as $\\delta$-separation, or\nusing its generalization, $\\mu$-separation. This is a stochastic process\nanalogue of $d$-separation in DAGs. However, there may be more independences\nthan implied by this graph and this is a violation of so-called faithfulness.\nWe characterize faithfulness in local independence graphs and give a method to\nconstruct a faithful graph from any local independence model such that the\noutput equals the true graph when Markov and faithfulness assumptions hold. We\ndiscuss various assumptions that are weaker than faithfulness, and we explore\ndifferent structure learning algorithms and their properties under varying\nassumptions."}, "http://arxiv.org/abs/2310.13826": {"title": "A p-value for Process Tracing and other N=1 Studies", "link": "http://arxiv.org/abs/2310.13826", "description": "The paper introduces a \\(p\\)-value that summarizes the evidence against a\nrival causal theory that explains an observed outcome in a single case. We show\nhow to represent the probability distribution characterizing a theorized rival\nhypothesis (the null) in the absence of randomization of treatment and when\ncounting on qualitative data, for instance when conducting process tracing. As\nin Fisher's \\autocite*{fisher1935design} original design, our \\(p\\)-value\nindicates how frequently one would find the same observations or even more\nfavorable observations under a theory that is compatible with our observations\nbut antagonistic to the working hypothesis. We also present an extension that\nallows researchers assess the sensitivity of their results to confirmation\nbias. Finally, we illustrate the application of our hypothesis test using the\nstudy by Snow \\autocite*{Snow1855} about the cause of Cholera in Soho, a\nclassic in Process Tracing, Epidemiology, and Microbiology. Our framework suits\nany type of case studies and evidence, such as data from interviews, archives,\nor participant observation."}, "http://arxiv.org/abs/2310.13858": {"title": "Likelihood-based surrogate dimension reduction", "link": "http://arxiv.org/abs/2310.13858", "description": "We consider the problem of surrogate sufficient dimension reduction, that is,\nestimating the central subspace of a regression model, when the covariates are\ncontaminated by measurement error. When no measurement error is present, a\nlikelihood-based dimension reduction method that relies on maximizing the\nlikelihood of a Gaussian inverse regression model on the Grassmann manifold is\nwell-known to have superior performance to traditional inverse moment methods.\nWe propose two likelihood-based estimators for the central subspace in\nmeasurement error settings, which make different adjustments to the observed\nsurrogates. Both estimators are computed based on maximizing objective\nfunctions on the Grassmann manifold and are shown to consistently recover the\ntrue central subspace. When the central subspace is assumed to depend on only a\nfew covariates, we further propose to augment the likelihood function with a\npenalty term that induces sparsity on the Grassmann manifold to obtain sparse\nestimators. The resulting objective function has a closed-form Riemann gradient\nwhich facilitates efficient computation of the penalized estimator. We leverage\nthe state-of-the-art trust region algorithm on the Grassmann manifold to\ncompute the proposed estimators efficiently. Simulation studies and a data\napplication demonstrate the proposed likelihood-based estimators perform better\nthan inverse moment-based estimators in terms of both estimation and variable\nselection accuracy."}, "http://arxiv.org/abs/2310.13874": {"title": "A Linear Errors-in-Variables Model with Unknown Heteroscedastic Measurement Errors", "link": "http://arxiv.org/abs/2310.13874", "description": "In the classic measurement error framework, covariates are contaminated by\nindependent additive noise. This paper considers parameter estimation in such a\nlinear errors-in-variables model where the unknown measurement error\ndistribution is heteroscedastic across observations. We propose a new\ngeneralized method of moment (GMM) estimator that combines a moment correction\napproach and a phase function-based approach. The former requires distributions\nto have four finite moments, while the latter relies on covariates having\nasymmetric distributions. The new estimator is shown to be consistent and\nasymptotically normal under appropriate regularity conditions. The asymptotic\ncovariance of the estimator is derived, and the estimated standard error is\ncomputed using a fast bootstrap procedure. The GMM estimator is demonstrated to\nhave strong finite sample performance in numerical studies, especially when the\nmeasurement errors follow non-Gaussian distributions."}, "http://arxiv.org/abs/2310.13911": {"title": "Multilevel Matrix Factor Model", "link": "http://arxiv.org/abs/2310.13911", "description": "Large-scale matrix data has been widely discovered and continuously studied\nin various fields recently. Considering the multi-level factor structure and\nutilizing the matrix structure, we propose a multilevel matrix factor model\nwith both global and local factors. The global factors can affect all matrix\ntimes series, whereas the local factors are only allow to affect within each\nspecific matrix time series. The estimation procedures can consistently\nestimate the factor loadings and determine the number of factors. We establish\nthe asymptotic properties of the estimators. The simulation is presented to\nillustrate the performance of the proposed estimation method. We utilize the\nmodel to analyze eight indicators across 200 stocks from ten distinct\nindustries, demonstrating the empirical utility of our proposed approach."}, "http://arxiv.org/abs/2310.13966": {"title": "Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression", "link": "http://arxiv.org/abs/2310.13966", "description": "In recent years, transfer learning has garnered significant attention in the\nmachine learning community. Its ability to leverage knowledge from related\nstudies to improve generalization performance in a target study has made it\nhighly appealing. This paper focuses on investigating the transfer learning\nproblem within the context of nonparametric regression over a reproducing\nkernel Hilbert space. The aim is to bridge the gap between practical\neffectiveness and theoretical guarantees. We specifically consider two\nscenarios: one where the transferable sources are known and another where they\nare unknown. For the known transferable source case, we propose a two-step\nkernel-based estimator by solely using kernel ridge regression. For the unknown\ncase, we develop a novel method based on an efficient aggregation algorithm,\nwhich can automatically detect and alleviate the effects of negative sources.\nThis paper provides the statistical properties of the desired estimators and\nestablishes the minimax optimal rate. Through extensive numerical experiments\non synthetic data and real examples, we validate our theoretical findings and\ndemonstrate the effectiveness of our proposed method."}, "http://arxiv.org/abs/2310.13973": {"title": "Estimation and convergence rates in the distributional single index model", "link": "http://arxiv.org/abs/2310.13973", "description": "The distributional single index model is a semiparametric regression model in\nwhich the conditional distribution functions $P(Y \\leq y | X = x) =\nF_0(\\theta_0(x), y)$ of a real-valued outcome variable $Y$ depend on\n$d$-dimensional covariates $X$ through a univariate, parametric index function\n$\\theta_0(x)$, and increase stochastically as $\\theta_0(x)$ increases. We\npropose least squares approaches for the joint estimation of $\\theta_0$ and\n$F_0$ in the important case where $\\theta_0(x) = \\alpha_0^{\\top}x$ and obtain\nconvergence rates of $n^{-1/3}$, thereby improving an existing result that\ngives a rate of $n^{-1/6}$. A simulation study indicates that the convergence\nrate for the estimation of $\\alpha_0$ might be faster. Furthermore, we\nillustrate our methods in a real data application that demonstrates the\nadvantages of shape restrictions in single index models."}, "http://arxiv.org/abs/2310.14246": {"title": "Shortcuts for causal discovery of nonlinear models by score matching", "link": "http://arxiv.org/abs/2310.14246", "description": "The use of simulated data in the field of causal discovery is ubiquitous due\nto the scarcity of annotated real data. Recently, Reisach et al., 2021\nhighlighted the emergence of patterns in simulated linear data, which displays\nincreasing marginal variance in the casual direction. As an ablation in their\nexperiments, Montagna et al., 2023 found that similar patterns may emerge in\nnonlinear models for the variance of the score vector $\\nabla \\log\np_{\\mathbf{X}}$, and introduced the ScoreSort algorithm. In this work, we\nformally define and characterize this score-sortability pattern of nonlinear\nadditive noise models. We find that it defines a class of identifiable\n(bivariate) causal models overlapping with nonlinear additive noise models. We\ntheoretically demonstrate the advantages of ScoreSort in terms of statistical\nefficiency compared to prior state-of-the-art score matching-based methods and\nempirically show the score-sortability of the most common synthetic benchmarks\nin the literature. Our findings remark (1) the lack of diversity in the data as\nan important limitation in the evaluation of nonlinear causal discovery\napproaches, (2) the importance of thoroughly testing different settings within\na problem class, and (3) the importance of analyzing statistical properties in\ncausal discovery, where research is often limited to defining identifiability\nconditions of the model."}, "http://arxiv.org/abs/2310.14293": {"title": "Testing exchangeability by pairwise betting", "link": "http://arxiv.org/abs/2310.14293", "description": "In this paper, we address the problem of testing exchangeability of a\nsequence of random variables, $X_1, X_2,\\cdots$. This problem has been studied\nunder the recently popular framework of testing by betting. But the mapping of\ntesting problems to game is not one to one: many games can be designed for the\nsame test. Past work established that it is futile to play single game betting\non every observation: test martingales in the data filtration are powerless.\nTwo avenues have been explored to circumvent this impossibility: betting in a\nreduced filtration (wealth is a test martingale in a coarsened filtration), or\nplaying many games in parallel (wealth is an e-process in the data filtration).\nThe former has proved to be difficult to theoretically analyze, while the\nlatter only works for binary or discrete observation spaces. Here, we introduce\na different approach that circumvents both drawbacks. We design a new (yet\nsimple) game in which we observe the data sequence in pairs. Despite the fact\nthat betting on individual observations is futile, we show that betting on\npairs of observations is not. To elaborate, we prove that our game leads to a\nnontrivial test martingale, which is interesting because it has been obtained\nby shrinking the filtration very slightly. We show that our test controls\ntype-1 error despite continuous monitoring, and achieves power one for both\nbinary and continuous observations, under a broad class of alternatives. Due to\nthe shrunk filtration, optional stopping is only allowed at even stopping\ntimes, not at odd ones: a relatively minor price. We provide a wide array of\nsimulations that align with our theoretical findings."}, "http://arxiv.org/abs/2310.14399": {"title": "The role of randomization inference in unraveling individual treatment effects in clinical trials: Application to HIV vaccine trials", "link": "http://arxiv.org/abs/2310.14399", "description": "Randomization inference is a powerful tool in early phase vaccine trials to\nestimate the causal effect of a regimen against a placebo or another regimen.\nTraditionally, randomization-based inference often focuses on testing either\nFisher's sharp null hypothesis of no treatment effect for any unit or Neyman's\nweak null hypothesis of no sample average treatment effect. Many recent efforts\nhave explored conducting exact randomization-based inference for other\nsummaries of the treatment effect profile, for instance, quantiles of the\ntreatment effect distribution function. In this article, we systematically\nreview methods that conduct exact, randomization-based inference for quantiles\nof individual treatment effects (ITEs) and extend some results by incorporating\nauxiliary information often available in a vaccine trial. These methods are\nsuitable for four scenarios: (i) a randomized controlled trial (RCT) where the\npotential outcomes under one regimen are constant; (ii) an RCT with no\nrestriction on any potential outcomes; (iii) an RCT with some user-specified\nbounds on potential outcomes; and (iv) a matched study comparing two\nnon-randomized, possibly confounded treatment arms. We then conduct two\nextensive simulation studies, one comparing the performance of each method in\nmany practical clinical settings and the other evaluating the usefulness of the\nmethods in ranking and advancing experimental therapies. We apply these methods\nto an early-phase clinical trail, HIV Vaccine Trials Network Study 086 (HVTN\n086), to showcase the usefulness of the methods."}, "http://arxiv.org/abs/2310.14419": {"title": "An RKHS Approach for Variable Selection in High-dimensional Functional Linear Models", "link": "http://arxiv.org/abs/2310.14419", "description": "High-dimensional functional data has become increasingly prevalent in modern\napplications such as high-frequency financial data and neuroimaging data\nanalysis. We investigate a class of high-dimensional linear regression models,\nwhere each predictor is a random element in an infinite dimensional function\nspace, and the number of functional predictors p can potentially be much\ngreater than the sample size n. Assuming that each of the unknown coefficient\nfunctions belongs to some reproducing kernel Hilbert space (RKHS), we\nregularized the fitting of the model by imposing a group elastic-net type of\npenalty on the RKHS norms of the coefficient functions. We show that our loss\nfunction is Gateaux sub-differentiable, and our functional elastic-net\nestimator exists uniquely in the product RKHS. Under suitable sparsity\nassumptions and a functional version of the irrepresentible condition, we\nestablish the variable selection consistency property of our approach. The\nproposed method is illustrated through simulation studies and a real-data\napplication from the Human Connectome Project."}, "http://arxiv.org/abs/2310.14448": {"title": "Semiparametrically Efficient Score for the Survival Odds Ratio", "link": "http://arxiv.org/abs/2310.14448", "description": "We consider a general proportional odds model for survival data under binary\ntreatment, where the functional form of the covariates is left unspecified. We\nderive the efficient score for the conditional survival odds ratio given the\ncovariates using modern semiparametric theory. The efficient score may be\nuseful in the development of doubly robust estimators, although computational\nchallenges remain."}, "http://arxiv.org/abs/2310.14763": {"title": "Externally Valid Policy Evaluation Combining Trial and Observational Data", "link": "http://arxiv.org/abs/2310.14763", "description": "Randomized trials are widely considered as the gold standard for evaluating\nthe effects of decision policies. Trial data is, however, drawn from a\npopulation which may differ from the intended target population and this raises\na problem of external validity (aka. generalizability). In this paper we seek\nto use trial data to draw valid inferences about the outcome of a policy on the\ntarget population. Additional covariate data from the target population is used\nto model the sampling of individuals in the trial study. We develop a method\nthat yields certifiably valid trial-based policy evaluations under any\nspecified range of model miscalibrations. The method is nonparametric and the\nvalidity is assured even with finite samples. The certified policy evaluations\nare illustrated using both simulated and real data."}, "http://arxiv.org/abs/2310.14922": {"title": "The Complex Network Patterns of Human Migration at Different Geographical Scales: Network Science meets Regression Analysis", "link": "http://arxiv.org/abs/2310.14922", "description": "Migration's influence in shaping population dynamics in times of impending\nclimate and population crises exposes its crucial role in upholding societal\ncohesion. As migration impacts virtually all aspects of life, it continues to\nrequire attention across scientific disciplines. This study delves into two\ndistinctive substrates of Migration Studies: the \"why\" substrate, which deals\nwith identifying the factors driving migration relying primarily on regression\nmodeling, encompassing economic, demographic, geographic, cultural, political,\nand other variables; and the \"how\" substrate, which focuses on identifying\nmigration flows and patterns, drawing from Network Science tools and\nvisualization techniques to depict complex migration networks. Despite the\ngrowing percentage of Network Science studies in migration, the explanations of\nthe identified network traits remain very scarce, highlighting the detachment\nbetween the two research substrates. Our study includes real-world network\nanalyses of human migration across different geographical levels: city,\ncountry, and global. We examine inter-district migration in Vienna at the city\nlevel, review internal migration networks in Austria and Croatia at the country\nlevel, and analyze migration exchange between Croatia and the world at the\nglobal level. By comparing network structures, we demonstrate how distinct\nnetwork traits impact regression modeling. This work not only uncovers\nmigration network patterns in previously unexplored areas but also presents a\ncomprehensive overview of recent research, highlighting gaps in each field and\ntheir interconnectedness. Our contribution offers suggestions for integrating\nboth fields to enhance methodological rigor and support future research."}, "http://arxiv.org/abs/2310.15016": {"title": "Impact of Record-Linkage Errors in Covid-19 Vaccine-Safety Analyses using German Health-Care Data: A Simulation Study", "link": "http://arxiv.org/abs/2310.15016", "description": "With unprecedented speed, 192,248,678 doses of Covid-19 vaccines were\nadministered in Germany by July 11, 2023 to combat the pandemic. Limitations of\nclinical trials imply that the safety profile of these vaccines is not fully\nknown before marketing. However, routine health-care data can help address\nthese issues. Despite the high proportion of insured people, the analysis of\nvaccination-related data is challenging in Germany. Generally, the Covid-19\nvaccination status and other health-care data are stored in separate databases,\nwithout persistent and database-independent person identifiers. Error-prone\nrecord-linkage techniques must be used to merge these databases. Our aim was to\nquantify the impact of record-linkage errors on the power and bias of different\nanalysis methods designed to assess Covid-19 vaccine safety when using German\nhealth-care data with a Monte-Carlo simulation study. We used a discrete-time\nsimulation and empirical data to generate realistic data with varying amounts\nof record-linkage errors. Afterwards, we analysed this data using a Cox model\nand the self-controlled case series (SCCS) method. Realistic proportions of\nrandom linkage errors only had little effect on the power of either method. The\nSCCS method produced unbiased results even with a high percentage of linkage\nerrors, while the Cox model underestimated the true effect."}, "http://arxiv.org/abs/2310.15069": {"title": "Second-order group knockoffs with applications to GWAS", "link": "http://arxiv.org/abs/2310.15069", "description": "Conditional testing via the knockoff framework allows one to identify --\namong large number of possible explanatory variables -- those that carry unique\ninformation about an outcome of interest, and also provides a false discovery\nrate guarantee on the selection. This approach is particularly well suited to\nthe analysis of genome wide association studies (GWAS), which have the goal of\nidentifying genetic variants which influence traits of medical relevance.\n\nWhile conditional testing can be both more powerful and precise than\ntraditional GWAS analysis methods, its vanilla implementation encounters a\ndifficulty common to all multivariate analysis methods: it is challenging to\ndistinguish among multiple, highly correlated regressors. This impasse can be\novercome by shifting the object of inference from single variables to groups of\ncorrelated variables. To achieve this, it is necessary to construct \"group\nknockoffs.\" While successful examples are already documented in the literature,\nthis paper substantially expands the set of algorithms and software for group\nknockoffs. We focus in particular on second-order knockoffs, for which we\ndescribe correlation matrix approximations that are appropriate for GWAS data\nand that result in considerable computational savings. We illustrate the\neffectiveness of the proposed methods with simulations and with the analysis of\nalbuminuria data from the UK Biobank.\n\nThe described algorithms are implemented in an open-source Julia package\nKnockoffs.jl, for which both R and Python wrappers are available."}, "http://arxiv.org/abs/2310.15070": {"title": "Improving estimation efficiency of case-cohort study with interval-censored failure time data", "link": "http://arxiv.org/abs/2310.15070", "description": "The case-cohort design is a commonly used cost-effective sampling strategy\nfor large cohort studies, where some covariates are expensive to measure or\nobtain. In this paper, we consider regression analysis under a case-cohort\nstudy with interval-censored failure time data, where the failure time is only\nknown to fall within an interval instead of being exactly observed. A common\napproach to analyze data from a case-cohort study is the inverse probability\nweighting approach, where only subjects in the case-cohort sample are used in\nestimation, and the subjects are weighted based on the probability of inclusion\ninto the case-cohort sample. This approach, though consistent, is generally\ninefficient as it does not incorporate information outside the case-cohort\nsample. To improve efficiency, we first develop a sieve maximum weighted\nlikelihood estimator under the Cox model based on the case-cohort sample, and\nthen propose a procedure to update this estimator by using information in the\nfull cohort. We show that the update estimator is consistent, asymptotically\nnormal, and more efficient than the original estimator. The proposed method can\nflexibly incorporate auxiliary variables to further improve estimation\nefficiency. We employ a weighted bootstrap procedure for variance estimation.\nSimulation results indicate that the proposed method works well in practical\nsituations. A real study on diabetes is provided for illustration."}, "http://arxiv.org/abs/2310.15108": {"title": "Evaluating machine learning models in non-standard settings: An overview and new findings", "link": "http://arxiv.org/abs/2310.15108", "description": "Estimating the generalization error (GE) of machine learning models is\nfundamental, with resampling methods being the most common approach. However,\nin non-standard settings, particularly those where observations are not\nindependently and identically distributed, resampling using simple random data\ndivisions may lead to biased GE estimates. This paper strives to present\nwell-grounded guidelines for GE estimation in various such non-standard\nsettings: clustered data, spatial data, unequal sampling probabilities, concept\ndrift, and hierarchically structured outcomes. Our overview combines\nwell-established methodologies with other existing methods that, to our\nknowledge, have not been frequently considered in these particular settings. A\nunifying principle among these techniques is that the test data used in each\niteration of the resampling procedure should reflect the new observations to\nwhich the model will be applied, while the training data should be\nrepresentative of the entire data set used to obtain the final model. Beyond\nproviding an overview, we address literature gaps by conducting simulation\nstudies. These studies assess the necessity of using GE-estimation methods\ntailored to the respective setting. Our findings corroborate the concern that\nstandard resampling methods often yield biased GE estimates in non-standard\nsettings, underscoring the importance of tailored GE estimation."}, "http://arxiv.org/abs/2310.15124": {"title": "Mixed-Variable Global Sensitivity Analysis For Knowledge Discovery And Efficient Combinatorial Materials Design", "link": "http://arxiv.org/abs/2310.15124", "description": "Global Sensitivity Analysis (GSA) is the study of the influence of any given\ninputs on the outputs of a model. In the context of engineering design, GSA has\nbeen widely used to understand both individual and collective contributions of\ndesign variables on the design objectives. So far, global sensitivity studies\nhave often been limited to design spaces with only quantitative (numerical)\ndesign variables. However, many engineering systems also contain, if not only,\nqualitative (categorical) design variables in addition to quantitative design\nvariables. In this paper, we integrate Latent Variable Gaussian Process (LVGP)\nwith Sobol' analysis to develop the first metamodel-based mixed-variable GSA\nmethod. Through numerical case studies, we validate and demonstrate the\neffectiveness of our proposed method for mixed-variable problems. Furthermore,\nwhile the proposed GSA method is general enough to benefit various engineering\ndesign applications, we integrate it with multi-objective Bayesian optimization\n(BO) to create a sensitivity-aware design framework in accelerating the Pareto\nfront design exploration for metal-organic framework (MOF) materials with\nmany-level combinatorial design spaces. Although MOFs are constructed only from\nqualitative variables that are notoriously difficult to design, our method can\nutilize sensitivity analysis to navigate the optimization in the many-level\nlarge combinatorial design space, greatly expediting the exploration of novel\nMOF candidates."}, "http://arxiv.org/abs/2003.04433": {"title": "Least Squares Estimation of a Quasiconvex Regression Function", "link": "http://arxiv.org/abs/2003.04433", "description": "We develop a new approach for the estimation of a multivariate function based\non the economic axioms of quasiconvexity (and monotonicity). On the\ncomputational side, we prove the existence of the quasiconvex constrained least\nsquares estimator (LSE) and provide a characterization of the function space to\ncompute the LSE via a mixed integer quadratic programme. On the theoretical\nside, we provide finite sample risk bounds for the LSE via a sharp oracle\ninequality. Our results allow for errors to depend on the covariates and to\nhave only two finite moments. We illustrate the superior performance of the LSE\nagainst some competing estimators via simulation. Finally, we use the LSE to\nestimate the production function for the Japanese plywood industry and the cost\nfunction for hospitals across the US."}, "http://arxiv.org/abs/2008.10296": {"title": "Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison", "link": "http://arxiv.org/abs/2008.10296", "description": "Leave-one-out cross-validation (LOO-CV) is a popular method for comparing\nBayesian models based on their estimated predictive performance on new, unseen,\ndata. As leave-one-out cross-validation is based on finite observed data, there\nis uncertainty about the expected predictive performance on new data. By\nmodeling this uncertainty when comparing two models, we can compute the\nprobability that one model has a better predictive performance than the other.\nModeling this uncertainty well is not trivial, and for example, it is known\nthat the commonly used standard error estimate is often too small. We study the\nproperties of the Bayesian LOO-CV estimator and the related uncertainty\nestimates when comparing two models. We provide new results of the properties\nboth theoretically in the linear regression case and empirically for multiple\ndifferent models and discuss the challenges of modeling the uncertainty. We\nshow that problematic cases include: comparing models with similar predictions,\nmisspecified models, and small data. In these cases, there is a weak connection\nin the skewness of the individual leave-one-out terms and the distribution of\nthe error of the Bayesian LOO-CV estimator. We show that it is possible that\nthe problematic skewness of the error distribution, which occurs when the\nmodels make similar predictions, does not fade away when the data size grows to\ninfinity in certain situations. Based on the results, we also provide practical\nrecommendations for the users of Bayesian LOO-CV for model comparison."}, "http://arxiv.org/abs/2105.04981": {"title": "Uncovering patterns for adverse pregnancy outcomes with a Bayesian spatial model: Evidence from Philadelphia", "link": "http://arxiv.org/abs/2105.04981", "description": "We introduce a Bayesian conditional autoregressive model for analyzing\npatient-specific and neighborhood risks of stillbirth and preterm birth within\na city. Our fully Bayesian approach automatically learns the amount of spatial\nheterogeneity and spatial dependence between neighborhoods. Our model provides\nmeaningful inferences and uncertainty quantification for both covariate effects\nand neighborhood risk probabilities through their posterior distributions. We\napply our methodology to data from the city of Philadelphia. Using electronic\nhealth records (45,919 deliveries at hospitals within the University of\nPennsylvania Health System) and United States Census Bureau data from 363\ncensus tracts in Philadelphia, we find that both patient-level characteristics\n(e.g. self-identified race/ethnicity) and neighborhood-level characteristics\n(e.g. violent crime) are highly associated with patients' odds of stillbirth or\npreterm birth. Our neighborhood risk analysis further reveals that census\ntracts in West Philadelphia and North Philadelphia are at highest risk of these\noutcomes. Specifically, neighborhoods with higher rates of women in poverty or\non public assistance have greater neighborhood risk for these outcomes, while\nneighborhoods with higher rates of college-educated women or women in the labor\nforce have lower risk. Our findings could be useful for targeted individual and\nneighborhood interventions."}, "http://arxiv.org/abs/2107.07317": {"title": "Nonparametric Statistical Inference via Metric Distribution Function in Metric Spaces", "link": "http://arxiv.org/abs/2107.07317", "description": "Distribution function is essential in statistical inference, and connected\nwith samples to form a directed closed loop by the correspondence theorem in\nmeasure theory and the Glivenko-Cantelli and Donsker properties. This\nconnection creates a paradigm for statistical inference. However, existing\ndistribution functions are defined in Euclidean spaces and no longer convenient\nto use in rapidly evolving data objects of complex nature. It is imperative to\ndevelop the concept of distribution function in a more general space to meet\nemerging needs. Note that the linearity allows us to use hypercubes to define\nthe distribution function in a Euclidean space, but without the linearity in a\nmetric space, we must work with the metric to investigate the probability\nmeasure. We introduce a class of metric distribution functions through the\nmetric between random objects and a fixed location in metric spaces. We\novercome this challenging step by proving the correspondence theorem and the\nGlivenko-Cantelli theorem for metric distribution functions in metric spaces\nthat lie the foundation for conducting rational statistical inference for\nmetric space-valued data. Then, we develop homogeneity test and mutual\nindependence test for non-Euclidean random objects, and present comprehensive\nempirical evidence to support the performance of our proposed methods."}, "http://arxiv.org/abs/2109.03694": {"title": "Parameterizing and Simulating from Causal Models", "link": "http://arxiv.org/abs/2109.03694", "description": "Many statistical problems in causal inference involve a probability\ndistribution other than the one from which data are actually observed; as an\nadditional complication, the object of interest is often a marginal quantity of\nthis other probability distribution. This creates many practical complications\nfor statistical inference, even where the problem is non-parametrically\nidentified. In particular, it is difficult to perform likelihood-based\ninference, or even to simulate from the model in a general way.\n\nWe introduce the `frugal parameterization', which places the causal effect of\ninterest at its centre, and then builds the rest of the model around it. We do\nthis in a way that provides a recipe for constructing a regular, non-redundant\nparameterization using causal quantities of interest. In the case of discrete\nvariables we can use odds ratios to complete the parameterization, while in the\ncontinuous case copulas are the natural choice; other possibilities are also\ndiscussed.\n\nOur methods allow us to construct and simulate from models with\nparametrically specified causal distributions, and fit them using\nlikelihood-based methods, including fully Bayesian approaches. Our proposal\nincludes parameterizations for the average causal effect and effect of\ntreatment on the treated, as well as other causal quantities of interest."}, "http://arxiv.org/abs/2202.09534": {"title": "Locally Adaptive Spatial Quantile Smoothing: Application to Monitoring Crime Density in Tokyo", "link": "http://arxiv.org/abs/2202.09534", "description": "Spatial trend estimation under potential heterogeneity is an important\nproblem to extract spatial characteristics and hazards such as criminal\nactivity. By focusing on quantiles, which provide substantial information on\ndistributions compared with commonly used summary statistics such as means, it\nis often useful to estimate not only the average trend but also the high (low)\nrisk trend additionally. In this paper, we propose a Bayesian quantile trend\nfiltering method to estimate the non-stationary trend of quantiles on graphs\nand apply it to crime data in Tokyo between 2013 and 2017. By modeling multiple\nobservation cases, we can estimate the potential heterogeneity of spatial crime\ntrends over multiple years in the application. To induce locally adaptive\nBayesian inference on trends, we introduce general shrinkage priors for graph\ndifferences. Introducing so-called shadow priors with multivariate distribution\nfor local scale parameters and mixture representation of the asymmetric Laplace\ndistribution, we provide a simple Gibbs sampling algorithm to generate\nposterior samples. The numerical performance of the proposed method is\ndemonstrated through simulation studies."}, "http://arxiv.org/abs/2203.16710": {"title": "Detecting Treatment Interference under the K-Nearest-Neighbors Interference Model", "link": "http://arxiv.org/abs/2203.16710", "description": "We propose a model of treatment interference where the response of a unit\ndepends only on its treatment status and the statuses of units within its\nK-neighborhood. Current methods for detecting interference include carefully\ndesigned randomized experiments and conditional randomization tests on a set of\nfocal units. We give guidance on how to choose focal units under this model of\ninterference. We then conduct a simulation study to evaluate the efficacy of\nexisting methods for detecting network interference. We show that this choice\nof focal units leads to powerful tests of treatment interference which\noutperform current experimental methods."}, "http://arxiv.org/abs/2206.00646": {"title": "Importance sampling for stochastic reaction-diffusion equations in the moderate deviation regime", "link": "http://arxiv.org/abs/2206.00646", "description": "We develop a provably efficient importance sampling scheme that estimates\nexit probabilities of solutions to small-noise stochastic reaction-diffusion\nequations from scaled neighborhoods of a stable equilibrium. The moderate\ndeviation scaling allows for a local approximation of the nonlinear dynamics by\ntheir linearized version. In addition, we identify a finite-dimensional\nsubspace where exits take place with high probability. Using stochastic control\nand variational methods we show that our scheme performs well both in the zero\nnoise limit and pre-asymptotically. Simulation studies for stochastically\nperturbed bistable dynamics illustrate the theoretical results."}, "http://arxiv.org/abs/2206.12084": {"title": "Functional Mixed Membership Models", "link": "http://arxiv.org/abs/2206.12084", "description": "Mixed membership models, or partial membership models, are a flexible\nunsupervised learning method that allows each observation to belong to multiple\nclusters. In this paper, we propose a Bayesian mixed membership model for\nfunctional data. By using the multivariate Karhunen-Lo\\`eve theorem, we are\nable to derive a scalable representation of Gaussian processes that maintains\ndata-driven learning of the covariance structure. Within this framework, we\nestablish conditional posterior consistency given a known feature allocation\nmatrix. Compared to previous work on mixed membership models, our proposal\nallows for increased modeling flexibility, with the benefit of a directly\ninterpretable mean and covariance structure. Our work is motivated by studies\nin functional brain imaging through electroencephalography (EEG) of children\nwith autism spectrum disorder (ASD). In this context, our work formalizes the\nclinical notion of \"spectrum\" in terms of feature membership proportions."}, "http://arxiv.org/abs/2208.07614": {"title": "Reweighting the RCT for generalization: finite sample error and variable selection", "link": "http://arxiv.org/abs/2208.07614", "description": "Randomized Controlled Trials (RCTs) may suffer from limited scope. In\nparticular, samples may be unrepresentative: some RCTs over- or under- sample\nindividuals with certain characteristics compared to the target population, for\nwhich one wants conclusions on treatment effectiveness. Re-weighting trial\nindividuals to match the target population can improve the treatment effect\nestimation. In this work, we establish the exact expressions of the bias and\nvariance of such reweighting procedures -- also called Inverse Propensity of\nSampling Weighting (IPSW) -- in presence of categorical covariates for any\nsample size. Such results allow us to compare the theoretical performance of\ndifferent versions of IPSW estimates. Besides, our results show how the\nperformance (bias, variance, and quadratic risk) of IPSW estimates depends on\nthe two sample sizes (RCT and target population). A by-product of our work is\nthe proof of consistency of IPSW estimates. Results also reveal that IPSW\nperformances are improved when the trial probability to be treated is estimated\n(rather than using its oracle counterpart). In addition, we study choice of\nvariables: how including covariates that are not necessary for identifiability\nof the causal effect may impact the asymptotic variance. Including covariates\nthat are shifted between the two samples but not treatment effect modifiers\nincreases the variance while non-shifted but treatment effect modifiers do not.\nWe illustrate all the takeaways in a didactic example, and on a semi-synthetic\nsimulation inspired from critical care medicine."}, "http://arxiv.org/abs/2209.15448": {"title": "Blessing from Human-AI Interaction: Super Reinforcement Learning in Confounded Environments", "link": "http://arxiv.org/abs/2209.15448", "description": "As AI becomes more prevalent throughout society, effective methods of\nintegrating humans and AI systems that leverage their respective strengths and\nmitigate risk have become an important priority. In this paper, we introduce\nthe paradigm of super reinforcement learning that takes advantage of Human-AI\ninteraction for data driven sequential decision making. This approach utilizes\nthe observed action, either from AI or humans, as input for achieving a\nstronger oracle in policy learning for the decision maker (humans or AI). In\nthe decision process with unmeasured confounding, the actions taken by past\nagents can offer valuable insights into undisclosed information. By including\nthis information for the policy search in a novel and legitimate manner, the\nproposed super reinforcement learning will yield a super-policy that is\nguaranteed to outperform both the standard optimal policy and the behavior one\n(e.g., past agents' actions). We call this stronger oracle a blessing from\nhuman-AI interaction. Furthermore, to address the issue of unmeasured\nconfounding in finding super-policies using the batch data, a number of\nnonparametric and causal identifications are established. Building upon on\nthese novel identification results, we develop several super-policy learning\nalgorithms and systematically study their theoretical properties such as\nfinite-sample regret guarantee. Finally, we illustrate the effectiveness of our\nproposal through extensive simulations and real-world applications."}, "http://arxiv.org/abs/2212.06906": {"title": "Flexible Regularized Estimation in High-Dimensional Mixed Membership Models", "link": "http://arxiv.org/abs/2212.06906", "description": "Mixed membership models are an extension of finite mixture models, where each\nobservation can partially belong to more than one mixture component. A\nprobabilistic framework for mixed membership models of high-dimensional\ncontinuous data is proposed with a focus on scalability and interpretability.\nThe novel probabilistic representation of mixed membership is based on convex\ncombinations of dependent multivariate Gaussian random vectors. In this\nsetting, scalability is ensured through approximations of a tensor covariance\nstructure through multivariate eigen-approximations with adaptive\nregularization imposed through shrinkage priors. Conditional weak posterior\nconsistency is established on an unconstrained model, allowing for a simple\nposterior sampling scheme while keeping many of the desired theoretical\nproperties of our model. The model is motivated by two biomedical case studies:\na case study on functional brain imaging of children with autism spectrum\ndisorder (ASD) and a case study on gene expression data from breast cancer\ntissue. These applications highlight how the typical assumption made in cluster\nanalysis, that each observation comes from one homogeneous subgroup, may often\nbe restrictive in several applications, leading to unnatural interpretations of\ndata features."}, "http://arxiv.org/abs/2301.09020": {"title": "On the Role of Volterra Integral Equations in Self-Consistent, Product-Limit, Inverse Probability of Censoring Weighted, and Redistribution-to-the-Right Estimators for the Survival Function", "link": "http://arxiv.org/abs/2301.09020", "description": "This paper reconsiders several results of historical and current importance\nto nonparametric estimation of the survival distribution for failure in the\npresence of right-censored observation times, demonstrating in particular how\nVolterra integral equations of the first kind help inter-connect the resulting\nestimators. The paper begins by considering Efron's self-consistency equation,\nintroduced in a seminal 1967 Berkeley symposium paper. Novel insights provided\nin the current work include the observations that (i) the self-consistency\nequation leads directly to an anticipating Volterra integral equation of the\nfirst kind whose solution is given by a product-limit estimator for the\ncensoring survival function; (ii) a definition used in this argument\nimmediately establishes the familiar product-limit estimator for the failure\nsurvival function; (iii) the usual Volterra integral equation for the\nproduct-limit estimator of the failure survival function leads to an immediate\nand simple proof that it can be represented as an inverse probability of\ncensoring weighted estimator (i.e., under appropriate conditions). Finally, we\nshow that the resulting inverse probability of censoring weighted estimators,\nattributed to a highly influential 1992 paper of Robins and Rotnitzky, were\nimplicitly introduced in Efron's 1967 paper in its development of the\nredistribution-to-the-right algorithm. All results developed herein allow for\nties between failure and/or censored observations."}, "http://arxiv.org/abs/2302.01576": {"title": "ResMem: Learn what you can and memorize the rest", "link": "http://arxiv.org/abs/2302.01576", "description": "The impressive generalization performance of modern neural networks is\nattributed in part to their ability to implicitly memorize complex training\npatterns. Inspired by this, we explore a novel mechanism to improve model\ngeneralization via explicit memorization. Specifically, we propose the\nresidual-memorization (ResMem) algorithm, a new method that augments an\nexisting prediction model (e.g. a neural network) by fitting the model's\nresiduals with a $k$-nearest neighbor based regressor. The final prediction is\nthen the sum of the original model and the fitted residual regressor. By\nconstruction, ResMem can explicitly memorize the training labels. Empirically,\nwe show that ResMem consistently improves the test set generalization of the\noriginal prediction model across various standard vision and natural language\nprocessing benchmarks. Theoretically, we formulate a stylized linear regression\nproblem and rigorously show that ResMem results in a more favorable test risk\nover the base predictor."}, "http://arxiv.org/abs/2303.05032": {"title": "Sensitivity analysis for principal ignorability violation in estimating complier and noncomplier average causal effects", "link": "http://arxiv.org/abs/2303.05032", "description": "An important strategy for identifying principal causal effects, which are\noften used in settings with noncompliance, is to invoke the principal\nignorability (PI) assumption. As PI is untestable, it is important to gauge how\nsensitive effect estimates are to its violation. We focus on this task for the\ncommon one-sided noncompliance setting where there are two principal strata,\ncompliers and noncompliers. Under PI, compliers and noncompliers share the same\noutcome-mean-given-covariates function under the control condition. For\nsensitivity analysis, we allow this function to differ between compliers and\nnoncompliers in several ways, indexed by an odds ratio, a generalized odds\nratio, a mean ratio, or a standardized mean difference sensitivity parameter.\nWe tailor sensitivity analysis techniques (with any sensitivity parameter\nchoice) to several types of PI-based main analysis methods, including outcome\nregression, influence function (IF) based and weighting methods. We illustrate\nthe proposed sensitivity analyses using several outcome types from the JOBS II\nstudy. This application estimates nuisance functions parametrically -- for\nsimplicity and accessibility. In addition, we establish rate conditions on\nnonparametric nuisance estimation for IF-based estimators to be asymptotically\nnormal -- with a view to inform nonparametric inference."}, "http://arxiv.org/abs/2304.13307": {"title": "A Statistical Interpretation of the Maximum Subarray Problem", "link": "http://arxiv.org/abs/2304.13307", "description": "Maximum subarray is a classical problem in computer science that given an\narray of numbers aims to find a contiguous subarray with the largest sum. We\nfocus on its use for a noisy statistical problem of localizing an interval with\na mean different from background. While a naive application of maximum subarray\nfails at this task, both a penalized and a constrained version can succeed. We\nshow that the penalized version can be derived for common exponential family\ndistributions, in a manner similar to the change-point detection literature,\nand we interpret the resulting optimal penalty value. The failure of the naive\nformulation is then explained by an analysis of the estimated interval\nboundaries. Experiments further quantify the effect of deviating from the\noptimal penalty. We also relate the penalized and constrained formulations and\nshow that the solutions to the former lie on the convex hull of the solutions\nto the latter."}, "http://arxiv.org/abs/2305.10637": {"title": "Conformalized matrix completion", "link": "http://arxiv.org/abs/2305.10637", "description": "Matrix completion aims to estimate missing entries in a data matrix, using\nthe assumption of a low-complexity structure (e.g., low rank) so that\nimputation is possible. While many effective estimation algorithms exist in the\nliterature, uncertainty quantification for this problem has proved to be\nchallenging, and existing methods are extremely sensitive to model\nmisspecification. In this work, we propose a distribution-free method for\npredictive inference in the matrix completion problem. Our method adapts the\nframework of conformal prediction, which provides confidence intervals with\nguaranteed distribution-free validity in the setting of regression, to the\nproblem of matrix completion. Our resulting method, conformalized matrix\ncompletion (cmc), offers provable predictive coverage regardless of the\naccuracy of the low-rank model. Empirical results on simulated and real data\ndemonstrate that cmc is robust to model misspecification while matching the\nperformance of existing model-based methods when the model is correct."}, "http://arxiv.org/abs/2305.15027": {"title": "A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods", "link": "http://arxiv.org/abs/2305.15027", "description": "We establish the first mathematically rigorous link between Bayesian,\nvariational Bayesian, and ensemble methods. A key step towards this it to\nreformulate the non-convex optimisation problem typically encountered in deep\nlearning as a convex optimisation in the space of probability measures. On a\ntechnical level, our contribution amounts to studying generalised variational\ninference through the lense of Wasserstein gradient flows. The result is a\nunified theory of various seemingly disconnected approaches that are commonly\nused for uncertainty quantification in deep learning -- including deep\nensembles and (variational) Bayesian methods. This offers a fresh perspective\non the reasons behind the success of deep ensembles over procedures based on\nparameterised variational inference, and allows the derivation of new\nensembling schemes with convergence guarantees. We showcase this by proposing a\nfamily of interacting deep ensembles with direct parallels to the interactions\nof particle systems in thermodynamics, and use our theory to prove the\nconvergence of these algorithms to a well-defined global minimiser on the space\nof probability measures."}, "http://arxiv.org/abs/2308.05858": {"title": "Inconsistency and Acausality of Model Selection in Bayesian Inverse Problems", "link": "http://arxiv.org/abs/2308.05858", "description": "Bayesian inference paradigms are regarded as powerful tools for solution of\ninverse problems. However, when applied to inverse problems in physical\nsciences, Bayesian formulations suffer from a number of inconsistencies that\nare often overlooked. A well known, but mostly neglected, difficulty is\nconnected to the notion of conditional probability densities. Borel, and later\nKolmogorov's (1933/1956), found that the traditional definition of conditional\ndensities is incomplete: In different parameterizations it leads to different\nresults. We will show an example where two apparently correct procedures\napplied to the same problem lead to two widely different results. Another type\nof inconsistency involves violation of causality. This problem is found in\nmodel selection strategies in Bayesian inversion, such as Hierarchical Bayes\nand Trans-Dimensional Inversion where so-called hyperparameters are included as\nvariables to control either the number (or type) of unknowns, or the prior\nuncertainties on data or model parameters. For Hierarchical Bayes we\ndemonstrate that the calculated 'prior' distributions of data or model\nparameters are not prior-, but posterior information. In fact, the calculated\n'standard deviations' of the data are a measure of the inability of the forward\nfunction to model the data, rather than uncertainties of the data. For\ntrans-dimensional inverse problems we show that the so-called evidence is, in\nfact, not a measure of the success of fitting the data for the given choice (or\nnumber) of parameters, as often claimed. We also find that the notion of\nNatural Parsimony is ill-defined, because of its dependence on the parameter\nprior. Based on this study, we find that careful rethinking of Bayesian\ninversion practices is required, with special emphasis on ways of avoiding the\nBorel-Kolmogorov inconsistency, and on the way we interpret model selection\nresults."}, "http://arxiv.org/abs/2310.15266": {"title": "Causal progress with imperfect placebo treatments and outcomes", "link": "http://arxiv.org/abs/2310.15266", "description": "In the quest to make defensible causal claims from observational data, it is\nsometimes possible to leverage information from \"placebo treatments\" and\n\"placebo outcomes\" (or \"negative outcome controls\"). Existing approaches\nemploying such information focus largely on point identification and assume (i)\n\"perfect placebos\", meaning placebo treatments have precisely zero effect on\nthe outcome and the real treatment has precisely zero effect on a placebo\noutcome; and (ii) \"equiconfounding\", meaning that the treatment-outcome\nrelationship where one is a placebo suffers the same amount of confounding as\ndoes the real treatment-outcome relationship, on some scale. We instead\nconsider an omitted variable bias framework, in which users can postulate\nnon-zero effects of placebo treatment on real outcomes or of real treatments on\nplacebo outcomes, and the relative strengths of confounding suffered by a\nplacebo treatment/outcome compared to the true treatment-outcome relationship.\nOnce postulated, these assumptions identify or bound the linear estimates of\ntreatment effects. While applicable in many settings, one ubiquitous use-case\nfor this approach is to employ pre-treatment outcomes as (perfect) placebo\noutcomes. In this setting, the parallel trends assumption of\ndifference-in-difference is in fact a strict equiconfounding assumption on a\nparticular scale, which can be relaxed in our framework. Finally, we\ndemonstrate the use of our framework with two applications, employing an R\npackage that implements these approaches."}, "http://arxiv.org/abs/2310.15333": {"title": "Estimating Trustworthy and Safe Optimal Treatment Regimes", "link": "http://arxiv.org/abs/2310.15333", "description": "Recent statistical and reinforcement learning methods have significantly\nadvanced patient care strategies. However, these approaches face substantial\nchallenges in high-stakes contexts, including missing data, inherent\nstochasticity, and the critical requirements for interpretability and patient\nsafety. Our work operationalizes a safe and interpretable framework to identify\noptimal treatment regimes. This approach involves matching patients with\nsimilar medical and pharmacological characteristics, allowing us to construct\nan optimal policy via interpolation. We perform a comprehensive simulation\nstudy to demonstrate the framework's ability to identify optimal policies even\nin complex settings. Ultimately, we operationalize our approach to study\nregimes for treating seizures in critically ill patients. Our findings strongly\nsupport personalized treatment strategies based on a patient's medical history\nand pharmacological features. Notably, we identify that reducing medication\ndoses for patients with mild and brief seizure episodes while adopting\naggressive treatment for patients in intensive care unit experiencing intense\nseizures leads to more favorable outcomes."}, "http://arxiv.org/abs/2310.15459": {"title": "Strategies to mitigate bias from time recording errors in pharmacokinetic studies", "link": "http://arxiv.org/abs/2310.15459", "description": "Opportunistic pharmacokinetic (PK) studies have sparse and imbalanced\nclinical measurement data, and the impact of sample time errors is an important\nconcern when seeking accurate estimates of treatment response. We evaluated an\napproximate Bayesian model for individualized pharmacokinetics in the presence\nof time recording errors (TREs), considering both a short and long infusion\ndosing pattern. We found that the long infusion schedule generally had lower\nbias in estimates of the pharmacodynamic (PD) endpoint relative to the short\ninfusion schedule. We investigated three different design strategies for their\nability to mitigate the impact of TREs: (i) shifting blood draws taken during\nan active infusion to the post-infusion period, (ii) identifying the best next\nsample time by minimizing bias in the presence of TREs, and (iii) collecting\nadditional information on a subset of patients based on estimate uncertainty or\nquadrature-estimated variance in the presence of TREs. Generally, the proposed\nstrategies led to a decrease in bias of the PD estimate for the short infusion\nschedule, but had a negligible impact for the long infusion schedule. Dosing\nregimens with periods of high non-linearity may benefit from design\nmodifications, while more stable concentration-time profiles are generally more\nrobust to TREs with no design modifications."}, "http://arxiv.org/abs/2310.15497": {"title": "Generalized Box-Cox method to estimate sample mean and standard deviation for Meta-analysis", "link": "http://arxiv.org/abs/2310.15497", "description": "Meta-analysis is the aggregation of data from multiple studies to find\npatterns across a broad range relating to a particular subject. It is becoming\nincreasingly useful to apply meta-analysis to summarize these studies being\ndone across various fields. In meta-analysis, it is common to use the mean and\nstandard deviation from each study to compare for analysis. While many studies\nreported mean and standard deviation for their summary statistics, some report\nother values including the minimum, maximum, median, and first and third\nquantiles. Often, the quantiles and median are reported when the data is skewed\nand does not follow a normal distribution. In order to correctly summarize the\ndata and draw conclusions from multiple studies, it is necessary to estimate\nthe mean and standard deviation from each study, considering variation and\nskewness within each study. In past literature, methods have been proposed to\nestimate the mean and standard deviation, but do not consider negative values.\nData that include negative values are common and would increase the accuracy\nand impact of the me-ta-analysis. We propose a method that implements a\ngeneralized Box-Cox transformation to estimate the mean and standard deviation\naccounting for such negative values while maintaining similar accuracy."}, "http://arxiv.org/abs/2310.15877": {"title": "Regression analysis of multiplicative hazards model with time-dependent coefficient for sparse longitudinal covariates", "link": "http://arxiv.org/abs/2310.15877", "description": "We study the multiplicative hazards model with intermittently observed\nlongitudinal covariates and time-varying coefficients. For such models, the\nexisting {\\it ad hoc} approach, such as the last value carried forward, is\nbiased. We propose a kernel weighting approach to get an unbiased estimation of\nthe non-parametric coefficient function and establish asymptotic normality for\nany fixed time point. Furthermore, we construct the simultaneous confidence\nband to examine the overall magnitude of the variation. Simulation studies\nsupport our theoretical predictions and show favorable performance of the\nproposed method. A data set from cerebral infarction is used to illustrate our\nmethodology."}, "http://arxiv.org/abs/2310.15956": {"title": "Likelihood-Based Inference for Semi-Parametric Transformation Cure Models with Interval Censored Data", "link": "http://arxiv.org/abs/2310.15956", "description": "A simple yet effective way of modeling survival data with cure fraction is by\nconsidering Box-Cox transformation cure model (BCTM) that unifies mixture and\npromotion time cure models. In this article, we numerically study the\nstatistical properties of the BCTM when applied to interval censored data.\nTime-to-events associated with susceptible subjects are modeled through\nproportional hazards structure that allows for non-homogeneity across subjects,\nwhere the baseline hazard function is estimated by distribution-free piecewise\nlinear function with varied degrees of non-parametricity. Due to missing cured\nstatuses for right censored subjects, maximum likelihood estimates of model\nparameters are obtained by developing an expectation-maximization (EM)\nalgorithm. Under the EM framework, the conditional expectation of the complete\ndata log-likelihood function is maximized by considering all parameters\n(including the Box-Cox transformation parameter $\\alpha$) simultaneously, in\ncontrast to conventional profile-likelihood technique of estimating $\\alpha$.\nThe robustness and accuracy of the model and estimation method are established\nthrough a detailed simulation study under various parameter settings, and an\nanalysis of real-life data obtained from a smoking cessation study."}, "http://arxiv.org/abs/1901.04916": {"title": "Pairwise accelerated failure time regression models for infectious disease transmission in close-contact groups with external sources of infection", "link": "http://arxiv.org/abs/1901.04916", "description": "Many important questions in infectious disease epidemiology involve the\neffects of covariates (e.g., age or vaccination status) on infectiousness and\nsusceptibility, which can be measured in studies of transmission in households\nor other close-contact groups. Because the transmission of disease produces\ndependent outcomes, these questions are difficult or impossible to address\nusing standard regression models from biostatistics. Pairwise survival analysis\nhandles dependent outcomes by calculating likelihoods in terms of contact\ninterval distributions in ordered pairs of individuals. The contact interval in\nthe ordered pair ij is the time from the onset of infectiousness in i to\ninfectious contact from i to j, where an infectious contact is sufficient to\ninfect j if they are susceptible. Here, we introduce a pairwise accelerated\nfailure time regression model for infectious disease transmission that allows\nthe rate parameter of the contact interval distribution to depend on\ninfectiousness covariates for i, susceptibility covariates for j, and pairwise\ncovariates. This model can simultaneously handle internal infections (caused by\ntransmission between individuals under observation) and external infections\n(caused by environmental or community sources of infection). In a simulation\nstudy, we show that these models produce valid point and interval estimates of\nparameters governing the contact interval distributions. We also explore the\nrole of epidemiologic study design and the consequences of model\nmisspecification. We use this regression model to analyze household data from\nLos Angeles County during the 2009 influenza A (H1N1) pandemic, where we find\nthat the ability to account for external sources of infection is critical to\nestimating the effect of antiviral prophylaxis."}, "http://arxiv.org/abs/2003.06416": {"title": "VCBART: Bayesian trees for varying coefficients", "link": "http://arxiv.org/abs/2003.06416", "description": "The linear varying coefficient models posits a linear relationship between an\noutcome and covariates in which the covariate effects are modeled as functions\nof additional effect modifiers. Despite a long history of study and use in\nstatistics and econometrics, state-of-the-art varying coefficient modeling\nmethods cannot accommodate multivariate effect modifiers without imposing\nrestrictive functional form assumptions or involving computationally intensive\nhyperparameter tuning. In response, we introduce VCBART, which flexibly\nestimates the covariate effect in a varying coefficient model using Bayesian\nAdditive Regression Trees. With simple default settings, VCBART outperforms\nexisting varying coefficient methods in terms of covariate effect estimation,\nuncertainty quantification, and outcome prediction. We illustrate the utility\nof VCBART with two case studies: one examining how the association between\nlater-life cognition and measures of socioeconomic position vary with respect\nto age and socio-demographics and another estimating how temporal trends in\nurban crime vary at the neighborhood level. An R package implementing VCBART is\navailable at https://github.com/skdeshpande91/VCBART"}, "http://arxiv.org/abs/2204.05870": {"title": "How much of the past matters? Using dynamic survival models for the monitoring of potassium in heart failure patients using electronic health records", "link": "http://arxiv.org/abs/2204.05870", "description": "Statistical methods to study the association between a longitudinal biomarker\nand the risk of death are very relevant for the long-term care of subjects\naffected by chronic illnesses, such as potassium in heart failure patients.\nParticularly in the presence of comorbidities or pharmacological treatments,\nsudden crises can cause potassium to undergo very abrupt yet transient changes.\nIn the context of the monitoring of potassium, there is a need for a dynamic\nmodel that can be used in clinical practice to assess the risk of death related\nto an observed patient's potassium trajectory. We considered different dynamic\nsurvival approaches, starting from the simple approach considering the most\nrecent measurement, to the joint model. We then propose a novel method based on\nwavelet filtering and landmarking to retrieve the prognostic role of past\nshort-term potassium shifts. We argue that while taking into account past\ninformation is important, not all past information is equally informative.\nState-of-the-art dynamic survival models are prone to give more importance to\nthe mean long-term value of potassium. However, our findings suggest that it is\nessential to take into account also recent potassium instability to capture all\nthe relevant prognostic information. The data used comes from over 2000\nsubjects, with a total of over 80 000 repeated potassium measurements collected\nthrough Administrative Health Records and Outpatient and Inpatient Clinic\nE-charts. A novel dynamic survival approach is proposed in this work for the\nmonitoring of potassium in heart failure. The proposed wavelet landmark method\nshows promising results revealing the prognostic role of past short-term\nchanges, according to their different duration, and achieving higher\nperformances in predicting the survival probability of individuals."}, "http://arxiv.org/abs/2212.09494": {"title": "Optimal Treatment Regimes for Proximal Causal Learning", "link": "http://arxiv.org/abs/2212.09494", "description": "A common concern when a policymaker draws causal inferences from and makes\ndecisions based on observational data is that the measured covariates are\ninsufficiently rich to account for all sources of confounding, i.e., the\nstandard no confoundedness assumption fails to hold. The recently proposed\nproximal causal inference framework shows that proxy variables that abound in\nreal-life scenarios can be leveraged to identify causal effects and therefore\nfacilitate decision-making. Building upon this line of work, we propose a novel\noptimal individualized treatment regime based on so-called outcome and\ntreatment confounding bridges. We then show that the value function of this new\noptimal treatment regime is superior to that of existing ones in the\nliterature. Theoretical guarantees, including identification, superiority,\nexcess value bound, and consistency of the estimated regime, are established.\nFurthermore, we demonstrate the proposed optimal regime via numerical\nexperiments and a real data application."}, "http://arxiv.org/abs/2302.07294": {"title": "Derandomized Novelty Detection with FDR Control via Conformal E-values", "link": "http://arxiv.org/abs/2302.07294", "description": "Conformal inference provides a general distribution-free method to rigorously\ncalibrate the output of any machine learning algorithm for novelty detection.\nWhile this approach has many strengths, it has the limitation of being\nrandomized, in the sense that it may lead to different results when analyzing\ntwice the same data, and this can hinder the interpretation of any findings. We\npropose to make conformal inferences more stable by leveraging suitable\nconformal e-values instead of p-values to quantify statistical significance.\nThis solution allows the evidence gathered from multiple analyses of the same\ndata to be aggregated effectively while provably controlling the false\ndiscovery rate. Further, we show that the proposed method can reduce randomness\nwithout much loss of power compared to standard conformal inference, partly\nthanks to an innovative way of weighting conformal e-values based on additional\nside information carefully extracted from the same data. Simulations with\nsynthetic and real data confirm this solution can be effective at eliminating\nrandom noise in the inferences obtained with state-of-the-art alternative\ntechniques, sometimes also leading to higher power."}, "http://arxiv.org/abs/2304.02127": {"title": "A Bayesian Collocation Integral Method for Parameter Estimation in Ordinary Differential Equations", "link": "http://arxiv.org/abs/2304.02127", "description": "Inferring the parameters of ordinary differential equations (ODEs) from noisy\nobservations is an important problem in many scientific fields. Currently, most\nparameter estimation methods that bypass numerical integration tend to rely on\nbasis functions or Gaussian processes to approximate the ODE solution and its\nderivatives. Due to the sensitivity of the ODE solution to its derivatives,\nthese methods can be hindered by estimation error, especially when only sparse\ntime-course observations are available. We present a Bayesian collocation\nframework that operates on the integrated form of the ODEs and also avoids the\nexpensive use of numerical solvers. Our methodology has the capability to\nhandle general nonlinear ODE systems. We demonstrate the accuracy of the\nproposed method through simulation studies, where the estimated parameters and\nrecovered system trajectories are compared with other recent methods. A real\ndata example is also provided."}, "http://arxiv.org/abs/2307.00127": {"title": "Large-scale Bayesian Structure Learning for Gaussian Graphical Models using Marginal Pseudo-likelihood", "link": "http://arxiv.org/abs/2307.00127", "description": "Bayesian methods for learning Gaussian graphical models offer a robust\nframework that addresses model uncertainty and incorporates prior knowledge.\nDespite their theoretical strengths, the applicability of Bayesian methods is\noften constrained by computational needs, especially in modern contexts\ninvolving thousands of variables. To overcome this issue, we introduce two\nnovel Markov chain Monte Carlo (MCMC) search algorithms that have a\nsignificantly lower computational cost than leading Bayesian approaches. Our\nproposed MCMC-based search algorithms use the marginal pseudo-likelihood\napproach to bypass the complexities of computing intractable normalizing\nconstants and iterative precision matrix sampling. These algorithms can deliver\nreliable results in mere minutes on standard computers, even for large-scale\nproblems with one thousand variables. Furthermore, our proposed method is\ncapable of addressing model uncertainty by efficiently exploring the full\nposterior graph space. Our simulation study indicates that the proposed\nalgorithms, particularly for large-scale sparse graphs, outperform the leading\nBayesian approaches in terms of computational efficiency and precision. The\nimplementation supporting the new approach is available through the R package\nBDgraph."}, "http://arxiv.org/abs/2307.09302": {"title": "Conformal prediction under ambiguous ground truth", "link": "http://arxiv.org/abs/2307.09302", "description": "Conformal Prediction (CP) allows to perform rigorous uncertainty\nquantification by constructing a prediction set $C(X)$ satisfying $\\mathbb{P}(Y\n\\in C(X))\\geq 1-\\alpha$ for a user-chosen $\\alpha \\in [0,1]$ by relying on\ncalibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\\mathbb{P}=\\mathbb{P}^{X}\n\\otimes \\mathbb{P}^{Y|X}$. It is typically implicitly assumed that\n$\\mathbb{P}^{Y|X}$ is the \"true\" posterior label distribution. However, in many\nreal-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating\nexpert opinions using a voting procedure, resulting in a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus\nw.r.t. $\\mathbb{P}_{vote}=\\mathbb{P}^X \\otimes \\mathbb{P}_{vote}^{Y|X}$ rather\nthan the true distribution $\\mathbb{P}$. In cases with unambiguous ground truth\nlabels, the distinction between $\\mathbb{P}_{vote}$ and $\\mathbb{P}$ is\nirrelevant. However, when experts do not agree because of ambiguous labels,\napproximating $\\mathbb{P}^{Y|X}$ with a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose\nto leverage expert opinions to approximate $\\mathbb{P}^{Y|X}$ using a\nnon-degenerate distribution $\\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP\nprocedures which provide guarantees w.r.t. $\\mathbb{P}_{agg}=\\mathbb{P}^X\n\\otimes \\mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels\nfrom $\\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a\ncase study of skin condition classification with significant disagreement among\nexpert annotators, we show that applying CP w.r.t. $\\mathbb{P}_{vote}$\nunder-covers expert annotations: calibrated for $72\\%$ coverage, it falls short\nby on average $10\\%$; our Monte Carlo CP closes this gap both empirically and\ntheoretically."}, "http://arxiv.org/abs/2310.16203": {"title": "Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework", "link": "http://arxiv.org/abs/2310.16203", "description": "Mediation analysis is an important analytic tool commonly used in a broad\nrange of scientific applications. In this article, we study the problem of\nmediation analysis when there are multivariate and conditionally dependent\nmediators, and when the variables are observed over multiple time points. The\nproblem is challenging, because the effect of a mediator involves not only the\npath from the treatment to this mediator itself at the current time point, but\nalso all possible paths pointed to this mediator from its upstream mediators,\nas well as the carryover effects from all previous time points. We propose a\nnovel multivariate dynamic mediation analysis approach. Drawing inspiration\nfrom the Markov decision process model that is frequently employed in\nreinforcement learning, we introduce a Markov mediation process paired with a\nsystem of time-varying linear structural equation models to formulate the\nproblem. We then formally define the individual mediation effect, built upon\nthe idea of simultaneous interventions and intervention calculus. We next\nderive the closed-form expression and propose an iterative estimation procedure\nunder the Markov mediation process model. We study both the asymptotic property\nand the empirical performance of the proposed estimator, and further illustrate\nour method with a mobile health application."}, "http://arxiv.org/abs/2310.16207": {"title": "Propensity score weighting plus an adjusted proportional hazards model does not equal doubly robust away from the null", "link": "http://arxiv.org/abs/2310.16207", "description": "Recently it has become common for applied works to combine commonly used\nsurvival analysis modeling methods, such as the multivariable Cox model, and\npropensity score weighting with the intention of forming a doubly robust\nestimator that is unbiased in large samples when either the Cox model or the\npropensity score model is correctly specified. This combination does not, in\ngeneral, produce a doubly robust estimator, even after regression\nstandardization, when there is truly a causal effect. We demonstrate via\nsimulation this lack of double robustness for the semiparametric Cox model, the\nWeibull proportional hazards model, and a simple proportional hazards flexible\nparametric model, with both the latter models fit via maximum likelihood. We\nprovide a novel proof that the combination of propensity score weighting and a\nproportional hazards survival model, fit either via full or partial likelihood,\nis consistent under the null of no causal effect of the exposure on the outcome\nunder particular censoring mechanisms if either the propensity score or the\noutcome model is correctly specified and contains all confounders. Given our\nresults suggesting that double robustness only exists under the null, we\noutline two simple alternative estimators that are doubly robust for the\nsurvival difference at a given time point (in the above sense), provided the\ncensoring mechanism can be correctly modeled, and one doubly robust method of\nestimation for the full survival curve. We provide R code to use these\nestimators for estimation and inference in the supplementary materials."}, "http://arxiv.org/abs/2310.16213": {"title": "Bayes factor functions", "link": "http://arxiv.org/abs/2310.16213", "description": "We describe Bayes factors functions based on z, t, $\\chi^2$, and F statistics\nand the prior distributions used to define alternative hypotheses. The\nnon-local alternative prior distributions are centered on standardized effects,\nwhich index the Bayes factor function. The prior densities include a dispersion\nparameter that models the variation of effect sizes across replicated\nexperiments. We examine the convergence rates of Bayes factor functions under\ntrue null and true alternative hypotheses. Several examples illustrate the\napplication of the Bayes factor functions to replicated experimental designs\nand compare the conclusions from these analyses to other default Bayes factor\nmethods."}, "http://arxiv.org/abs/2310.16256": {"title": "A Causal Disentangled Multi-Granularity Graph Classification Method", "link": "http://arxiv.org/abs/2310.16256", "description": "Graph data widely exists in real life, with large amounts of data and complex\nstructures. It is necessary to map graph data to low-dimensional embedding.\nGraph classification, a critical graph task, mainly relies on identifying the\nimportant substructures within the graph. At present, some graph classification\nmethods do not combine the multi-granularity characteristics of graph data.\nThis lack of granularity distinction in modeling leads to a conflation of key\ninformation and false correlations within the model. So, achieving the desired\ngoal of a credible and interpretable model becomes challenging. This paper\nproposes a causal disentangled multi-granularity graph representation learning\nmethod (CDM-GNN) to solve this challenge. The CDM-GNN model disentangles the\nimportant substructures and bias parts within the graph from a\nmulti-granularity perspective. The disentanglement of the CDM-GNN model reveals\nimportant and bias parts, forming the foundation for its classification task,\nspecifically, model interpretations. The CDM-GNN model exhibits strong\nclassification performance and generates explanatory outcomes aligning with\nhuman cognitive patterns. In order to verify the effectiveness of the model,\nthis paper compares the three real-world datasets MUTAG, PTC, and IMDM-M. Six\nstate-of-the-art models, namely GCN, GAT, Top-k, ASAPool, SUGAR, and SAT are\nemployed for comparison purposes. Additionally, a qualitative analysis of the\ninterpretation results is conducted."}, "http://arxiv.org/abs/2310.16260": {"title": "Private Estimation and Inference in High-Dimensional Regression with FDR Control", "link": "http://arxiv.org/abs/2310.16260", "description": "This paper presents novel methodologies for conducting practical\ndifferentially private (DP) estimation and inference in high-dimensional linear\nregression. We start by proposing a differentially private Bayesian Information\nCriterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso,\neliminating the need for prior knowledge of model sparsity, a requisite in the\nexisting literature. Then we propose a differentially private debiased LASSO\nalgorithm that enables privacy-preserving inference on regression parameters.\nOur proposed method enables accurate and private inference on the regression\nparameters by leveraging the inherent sparsity of high-dimensional linear\nregression models. Additionally, we address the issue of multiple testing in\nhigh-dimensional linear regression by introducing a differentially private\nmultiple testing procedure that controls the false discovery rate (FDR). This\nallows for accurate and privacy-preserving identification of significant\npredictors in the regression model. Through extensive simulations and real data\nanalysis, we demonstrate the efficacy of our proposed methods in conducting\ninference for high-dimensional linear models while safeguarding privacy and\ncontrolling the FDR."}, "http://arxiv.org/abs/2310.16284": {"title": "Bayesian Image Mediation Analysis", "link": "http://arxiv.org/abs/2310.16284", "description": "Mediation analysis aims to separate the indirect effect through mediators\nfrom the direct effect of the exposure on the outcome. It is challenging to\nperform mediation analysis with neuroimaging data which involves high\ndimensionality, complex spatial correlations, sparse activation patterns and\nrelatively low signal-to-noise ratio. To address these issues, we develop a new\nspatially varying coefficient structural equation model for Bayesian Image\nMediation Analysis (BIMA). We define spatially varying mediation effects within\nthe potential outcome framework, employing the soft-thresholded Gaussian\nprocess prior for functional parameters. We establish the posterior consistency\nfor spatially varying mediation effects along with selection consistency on\nimportant regions that contribute to the mediation effects. We develop an\nefficient posterior computation algorithm scalable to analysis of large-scale\nimaging data. Through extensive simulations, we show that BIMA can improve the\nestimation accuracy and computational efficiency for high-dimensional mediation\nanalysis over the existing methods. We apply BIMA to analyze the behavioral and\nfMRI data in the Adolescent Brain Cognitive Development (ABCD) study with a\nfocus on inferring the mediation effects of the parental education level on the\nchildren's general cognitive ability that are mediated through the working\nmemory brain activities."}, "http://arxiv.org/abs/2310.16294": {"title": "Producer-Side Experiments Based on Counterfactual Interleaving Designs for Online Recommender Systems", "link": "http://arxiv.org/abs/2310.16294", "description": "Recommender systems have become an integral part of online platforms,\nproviding personalized suggestions for purchasing items, consuming contents,\nand connecting with individuals. An online recommender system consists of two\nsides of components: the producer side comprises product sellers, content\ncreators, or service providers, etc., and the consumer side includes buyers,\nviewers, or guests, etc. To optimize an online recommender system, A/B tests\nserve as the golden standard for comparing different ranking models and\nevaluating their impact on both the consumers and producers. While\nconsumer-side experiments are relatively straightforward to design and commonly\nused to gauge the impact of ranking changes on the behavior of consumers\n(buyers, viewers, etc.), designing producer-side experiments presents a\nconsiderable challenge because producer items in the treatment and control\ngroups need to be ranked by different models and then merged into a single\nranking for the recommender to show to each consumer. In this paper, we review\nissues with the existing methods, propose new design principles for\nproducer-side experiments, and develop a rigorous solution based on\ncounterfactual interleaving designs for accurately measuring the effects of\nranking changes on the producers (sellers, creators, etc.)."}, "http://arxiv.org/abs/2310.16466": {"title": "Learning Continuous Network Emerging Dynamics from Scarce Observations via Data-Adaptive Stochastic Processes", "link": "http://arxiv.org/abs/2310.16466", "description": "Learning network dynamics from the empirical structure and spatio-temporal\nobservation data is crucial to revealing the interaction mechanisms of complex\nnetworks in a wide range of domains. However, most existing methods only aim at\nlearning network dynamic behaviors generated by a specific ordinary\ndifferential equation instance, resulting in ineffectiveness for new ones, and\ngenerally require dense observations. The observed data, especially from\nnetwork emerging dynamics, are usually difficult to obtain, which brings\ntrouble to model learning. Therefore, how to learn accurate network dynamics\nwith sparse, irregularly-sampled, partial, and noisy observations remains a\nfundamental challenge. We introduce Neural ODE Processes for Network Dynamics\n(NDP4ND), a new class of stochastic processes governed by stochastic\ndata-adaptive network dynamics, to overcome the challenge and learn continuous\nnetwork dynamics from scarce observations. Intensive experiments conducted on\nvarious network dynamics in ecological population evolution, phototaxis\nmovement, brain activity, epidemic spreading, and real-world empirical systems,\ndemonstrate that the proposed method has excellent data adaptability and\ncomputational efficiency, and can adapt to unseen network emerging dynamics,\nproducing accurate interpolation and extrapolation with reducing the ratio of\nrequired observation data to only about 6\\% and improving the learning speed\nfor new dynamics by three orders of magnitude."}, "http://arxiv.org/abs/2310.16489": {"title": "Latent event history models for quasi-reaction systems", "link": "http://arxiv.org/abs/2310.16489", "description": "Various processes can be modelled as quasi-reaction systems of stochastic\ndifferential equations, such as cell differentiation and disease spreading.\nSince the underlying data of particle interactions, such as reactions between\nproteins or contacts between people, are typically unobserved, statistical\ninference of the parameters driving these systems is developed from\nconcentration data measuring each unit in the system over time. While observing\nthe continuous time process at a time scale as fine as possible should in\ntheory help with parameter estimation, the existing Local Linear Approximation\n(LLA) methods fail in this case, due to numerical instability caused by small\nchanges of the system at successive time points. On the other hand, one may be\nable to reconstruct the underlying unobserved interactions from the observed\ncount data. Motivated by this, we first formalise the latent event history\nmodel underlying the observed count process. We then propose a computationally\nefficient Expectation-Maximation algorithm for parameter estimation, with an\nextended Kalman filtering procedure for the prediction of the latent states. A\nsimulation study shows the performance of the proposed method and highlights\nthe settings where it is particularly advantageous compared to the existing LLA\napproaches. Finally, we present an illustration of the methodology on the\nspreading of the COVID-19 pandemic in Italy."}, "http://arxiv.org/abs/2310.16502": {"title": "Assessing the overall and partial causal well-specification of nonlinear additive noise models", "link": "http://arxiv.org/abs/2310.16502", "description": "We propose a method to detect model misspecifications in nonlinear causal\nadditive and potentially heteroscedastic noise models. We aim to identify\npredictor variables for which we can infer the causal effect even in cases of\nsuch misspecification. We develop a general framework based on knowledge of the\nmultivariate observational data distribution and we then propose an algorithm\nfor finite sample data, discuss its asymptotic properties, and illustrate its\nperformance on simulated and real data."}, "http://arxiv.org/abs/2310.16600": {"title": "Balancing central and marginal rejection when combining independent significance tests", "link": "http://arxiv.org/abs/2310.16600", "description": "A common approach to evaluating the significance of a collection of\n$p$-values combines them with a pooling function, in particular when the\noriginal data are not available. These pooled $p$-values convert a sample of\n$p$-values into a single number which behaves like a univariate $p$-value. To\nclarify discussion of these functions, a telescoping series of alternative\nhypotheses are introduced that communicate the strength and prevalence of\nnon-null evidence in the $p$-values before general pooling formulae are\ndiscussed. A pattern noticed in the UMP pooled $p$-value for a particular\nalternative motivates the definition and discussion of central and marginal\nrejection levels at $\\alpha$. It is proven that central rejection is always\ngreater than or equal to marginal rejection, motivating a quotient to measure\nthe balance between the two for pooled $p$-values. A combining function based\non the $\\chi^2_{\\kappa}$ quantile transformation is proposed to control this\nquotient and shown to be robust to mis-specified parameters relative to the\nUMP. Different powers for different parameter settings motivate a map of\nplausible alternatives based on where this pooled $p$-value is minimized."}, "http://arxiv.org/abs/2310.16626": {"title": "Scalable Causal Structure Learning via Amortized Conditional Independence Testing", "link": "http://arxiv.org/abs/2310.16626", "description": "Controlling false positives (Type I errors) through statistical hypothesis\ntesting is a foundation of modern scientific data analysis. Existing causal\nstructure discovery algorithms either do not provide Type I error control or\ncannot scale to the size of modern scientific datasets. We consider a variant\nof the causal discovery problem with two sets of nodes, where the only edges of\ninterest form a bipartite causal subgraph between the sets. We develop Scalable\nCausal Structure Learning (SCSL), a method for causal structure discovery on\nbipartite subgraphs that provides Type I error control. SCSL recasts the\ndiscovery problem as a simultaneous hypothesis testing problem and uses\ndiscrete optimization over the set of possible confounders to obtain an upper\nbound on the test statistic for each edge. Semi-synthetic simulations\ndemonstrate that SCSL scales to handle graphs with hundreds of nodes while\nmaintaining error control and good power. We demonstrate the practical\napplicability of the method by applying it to a cancer dataset to reveal\nconnections between somatic gene mutations and metastases to different tissues."}, "http://arxiv.org/abs/2310.16650": {"title": "Data-integration with pseudoweights and survey-calibration: application to developing US-representative lung cancer risk models for use in screening", "link": "http://arxiv.org/abs/2310.16650", "description": "Accurate cancer risk estimation is crucial to clinical decision-making, such\nas identifying high-risk people for screening. However, most existing cancer\nrisk models incorporate data from epidemiologic studies, which usually cannot\nrepresent the target population. While population-based health surveys are\nideal for making inference to the target population, they typically do not\ncollect time-to-cancer incidence data. Instead, time-to-cancer specific\nmortality is often readily available on surveys via linkage to vital\nstatistics. We develop calibrated pseudoweighting methods that integrate\nindividual-level data from a cohort and a survey, and summary statistics of\ncancer incidence from national cancer registries. By leveraging\nindividual-level cancer mortality data in the survey, the proposed methods\nimpute time-to-cancer incidence for survey sample individuals and use survey\ncalibration with auxiliary variables of influence functions generated from Cox\nregression to improve robustness and efficiency of the inverse-propensity\npseudoweighting method in estimating pure risks. We develop a lung cancer\nincidence pure risk model from the Prostate, Lung, Colorectal, and Ovarian\n(PLCO) Cancer Screening Trial using our proposed methods by integrating data\nfrom the National Health Interview Survey (NHIS) and cancer registries."}, "http://arxiv.org/abs/2310.16653": {"title": "Adaptive importance sampling for heavy-tailed distributions via $\\alpha$-divergence minimization", "link": "http://arxiv.org/abs/2310.16653", "description": "Adaptive importance sampling (AIS) algorithms are widely used to approximate\nexpectations with respect to complicated target probability distributions. When\nthe target has heavy tails, existing AIS algorithms can provide inconsistent\nestimators or exhibit slow convergence, as they often neglect the target's tail\nbehaviour. To avoid this pitfall, we propose an AIS algorithm that approximates\nthe target by Student-t proposal distributions. We adapt location and scale\nparameters by matching the escort moments - which are defined even for\nheavy-tailed distributions - of the target and the proposal. These updates\nminimize the $\\alpha$-divergence between the target and the proposal, thereby\nconnecting with variational inference. We then show that the\n$\\alpha$-divergence can be approximated by a generalized notion of effective\nsample size and leverage this new perspective to adapt the tail parameter with\nBayesian optimization. We demonstrate the efficacy of our approach through\napplications to synthetic targets and a Bayesian Student-t regression task on a\nreal example with clinical trial data."}, "http://arxiv.org/abs/2310.16690": {"title": "Dynamic treatment effect phenotyping through functional survival analysis", "link": "http://arxiv.org/abs/2310.16690", "description": "In recent years, research interest in personalised treatments has been\ngrowing. However, treatment effect heterogeneity and possibly time-varying\ntreatment effects are still often overlooked in clinical studies. Statistical\ntools are needed for the identification of treatment response patterns, taking\ninto account that treatment response is not constant over time. We aim to\nprovide an innovative method to obtain dynamic treatment effect phenotypes on a\ntime-to-event outcome, conditioned on a set of relevant effect modifiers. The\nproposed method does not require the assumption of proportional hazards for the\ntreatment effect, which is rarely realistic. We propose a spline-based survival\nneural network, inspired by the Royston-Parmar survival model, to estimate\ntime-varying conditional treatment effects. We then exploit the functional\nnature of the resulting estimates to apply a functional clustering of the\ntreatment effect curves in order to identify different patterns of treatment\neffects. The application that motivated this work is the discontinuation of\ntreatment with Mineralocorticoid receptor Antagonists (MRAs) in patients with\nheart failure, where there is no clear evidence as to which patients it is the\nsafest choice to discontinue treatment and, conversely, when it leads to a\nhigher risk of adverse events. The data come from an electronic health record\ndatabase. A simulation study was performed to assess the performance of the\nspline-based neural network and the stability of the treatment response\nphenotyping procedure. We provide a novel method to inform individualized\nmedical decisions by characterising subject-specific treatment responses over\ntime."}, "http://arxiv.org/abs/2310.16698": {"title": "Causal Discovery with Generalized Linear Models through Peeling Algorithms", "link": "http://arxiv.org/abs/2310.16698", "description": "This article presents a novel method for causal discovery with generalized\nstructural equation models suited for analyzing diverse types of outcomes,\nincluding discrete, continuous, and mixed data. Causal discovery often faces\nchallenges due to unmeasured confounders that hinder the identification of\ncausal relationships. The proposed approach addresses this issue by developing\ntwo peeling algorithms (bottom-up and top-down) to ascertain causal\nrelationships and valid instruments. This approach first reconstructs a\nsuper-graph to represent ancestral relationships between variables, using a\npeeling algorithm based on nodewise GLM regressions that exploit relationships\nbetween primary and instrumental variables. Then, it estimates parent-child\neffects from the ancestral relationships using another peeling algorithm while\ndeconfounding a child's model with information borrowed from its parents'\nmodels. The article offers a theoretical analysis of the proposed approach,\nwhich establishes conditions for model identifiability and provides statistical\nguarantees for accurately discovering parent-child relationships via the\npeeling algorithms. Furthermore, the article presents numerical experiments\nshowcasing the effectiveness of our approach in comparison to state-of-the-art\nstructure learning methods without confounders. Lastly, it demonstrates an\napplication to Alzheimer's disease (AD), highlighting the utility of the method\nin constructing gene-to-gene and gene-to-disease regulatory networks involving\nSingle Nucleotide Polymorphisms (SNPs) for healthy and AD subjects."}, "http://arxiv.org/abs/2310.16813": {"title": "Improving the Aggregation and Evaluation of NBA Mock Drafts", "link": "http://arxiv.org/abs/2310.16813", "description": "Many enthusiasts and experts publish forecasts of the order players are\ndrafted into professional sports leagues, known as mock drafts. Using a novel\ndataset of mock drafts for the National Basketball Association (NBA), we\nanalyze authors' mock draft accuracy over time and ask how we can reasonably\nuse information from multiple authors. To measure how accurate mock drafts are,\nwe assume that both mock drafts and the actual draft are ranked lists, and we\npropose that rank-biased distance (RBD) of Webber et al. (2010) is the\nappropriate error metric for mock draft accuracy. This is because RBD allows\nmock drafts to have a different length than the actual draft, accounts for\nplayers not appearing in both lists, and weights errors early in the draft more\nthan errors later on. We validate that mock drafts, as expected, improve in\naccuracy over the course of a season, and that accuracy of the mock drafts\nproduced right before their drafts is fairly stable across seasons. To be able\nto combine information from multiple mock drafts into a single consensus mock\ndraft, we also propose a ranked-list combination method based on the ideas of\nranked-choice voting. We show that our method provides improved forecasts over\nthe standard Borda count combination method used for most similar analyses in\nsports, and that either combination method provides a more accurate forecast\nover time than any single author."}, "http://arxiv.org/abs/2310.16824": {"title": "Parametric model for post-processing visibility ensemble forecasts", "link": "http://arxiv.org/abs/2310.16824", "description": "Despite the continuous development of the different operational ensemble\nprediction systems over the past decades, ensemble forecasts still might suffer\nfrom lack of calibration and/or display systematic bias, thus require some\npost-processing to improve their forecast skill. Here we focus on visibility,\nwhich quantity plays a crucial role e.g. in aviation and road safety or in ship\nnavigation, and propose a parametric model where the predictive distribution is\na mixture of a gamma and a truncated normal distribution, both right censored\nat the maximal reported visibility value. The new model is evaluated in two\ncase studies based on visibility ensemble forecasts of the European Centre for\nMedium-Range Weather Forecasts covering two distinct domains in Central and\nWestern Europe and two different time periods. The results of the case studies\nindicate that climatology is substantially superior to the raw ensemble;\nnevertheless, the forecast skill can be further improved by post-processing, at\nleast for short lead times. Moreover, the proposed mixture model consistently\noutperforms the Bayesian model averaging approach used as reference\npost-processing technique."}, "http://arxiv.org/abs/2109.09339": {"title": "Improving the accuracy of estimating indexes in contingency tables using Bayesian estimators", "link": "http://arxiv.org/abs/2109.09339", "description": "In contingency table analysis, one is interested in testing whether a model\nof interest (e.g., the independent or symmetry model) holds using\ngoodness-of-fit tests. When the null hypothesis where the model is true is\nrejected, the interest turns to the degree to which the probability structure\nof the contingency table deviates from the model. Many indexes have been\nstudied to measure the degree of the departure, such as the Yule coefficient\nand Cram\\'er coefficient for the independence model, and Tomizawa's symmetry\nindex for the symmetry model. The inference of these indexes is performed using\nsample proportions, which are estimates of cell probabilities, but it is\nwell-known that the bias and mean square error (MSE) values become large\nwithout a sufficient number of samples. To address the problem, this study\nproposes a new estimator for indexes using Bayesian estimators of cell\nprobabilities. Assuming the Dirichlet distribution for the prior of cell\nprobabilities, we asymptotically evaluate the value of MSE when plugging the\nposterior means of cell probabilities into the index, and propose an estimator\nof the index using the Dirichlet hyperparameter that minimizes the value.\nNumerical experiments show that when the number of samples per cell is small,\nthe proposed method has smaller values of bias and MSE than other methods of\ncorrecting estimation accuracy. We also show that the values of bias and MSE\nare smaller than those obtained by using the uniform and Jeffreys priors."}, "http://arxiv.org/abs/2110.01031": {"title": "A general framework for formulating structured variable selection", "link": "http://arxiv.org/abs/2110.01031", "description": "In variable selection, a selection rule that prescribes the permissible sets\nof selected variables (called a \"selection dictionary\") is desirable due to the\ninherent structural constraints among the candidate variables. Such selection\nrules can be complex in real-world data analyses, and failing to incorporate\nsuch restrictions could not only compromise the interpretability of the model\nbut also lead to decreased prediction accuracy. However, no general framework\nhas been proposed to formalize selection rules and their applications, which\nposes a significant challenge for practitioners seeking to integrate these\nrules into their analyses. In this work, we establish a framework for\nstructured variable selection that can incorporate universal structural\nconstraints. We develop a mathematical language for constructing arbitrary\nselection rules, where the selection dictionary is formally defined. We\ndemonstrate that all selection rules can be expressed as combinations of\noperations on constructs, facilitating the identification of the corresponding\nselection dictionary. Once this selection dictionary is derived, practitioners\ncan apply their own user-defined criteria to select the optimal model.\nAdditionally, our framework enhances existing penalized regression methods for\nvariable selection by providing guidance on how to appropriately group\nvariables to achieve the desired selection rule. Furthermore, our innovative\nframework opens the door to establishing new l0 norm-based penalized regression\ntechniques that can be tailored to respect arbitrary selection rules, thereby\nexpanding the possibilities for more robust and tailored model development."}, "http://arxiv.org/abs/2203.14223": {"title": "Identifying Peer Influence in Therapeutic Communities", "link": "http://arxiv.org/abs/2203.14223", "description": "We investigate if there is a peer influence or role model effect on\nsuccessful graduation from Therapeutic Communities (TCs). We analyze anonymized\nindividual-level observational data from 3 TCs that kept records of written\nexchanges of affirmations and corrections among residents, and their precise\nentry and exit dates. The affirmations allow us to form peer networks, and the\nentry and exit dates allow us to define a causal effect of interest. We\nconceptualize the causal role model effect as measuring the difference in the\nexpected outcome of a resident (ego) who can observe one of their social\ncontacts (e.g., peers who gave affirmations), to be successful in graduating\nbefore the ego's exit vs not successfully graduating before the ego's exit.\nSince peer influence is usually confounded with unobserved homophily in\nobservational data, we model the network with a latent variable model to\nestimate homophily and include it in the outcome equation. We provide a\ntheoretical guarantee that the bias of our peer influence estimator decreases\nwith sample size. Our results indicate there is an effect of peers' graduation\non the graduation of residents. The magnitude of peer influence differs based\non gender, race, and the definition of the role model effect. A counterfactual\nexercise quantifies the potential benefits of intervention of assigning a buddy\nto \"at-risk\" individuals directly on the treated resident and indirectly on\ntheir peers through network propagation."}, "http://arxiv.org/abs/2207.03182": {"title": "Chilled Sampling for Uncertainty Quantification: A Motivation From A Meteorological Inverse Problem", "link": "http://arxiv.org/abs/2207.03182", "description": "Atmospheric motion vectors (AMVs) extracted from satellite imagery are the\nonly wind observations with good global coverage. They are important features\nfor feeding numerical weather prediction (NWP) models. Several Bayesian models\nhave been proposed to estimate AMVs. Although critical for correct assimilation\ninto NWP models, very few methods provide a thorough characterization of the\nestimation errors. The difficulty of estimating errors stems from the\nspecificity of the posterior distribution, which is both very high dimensional,\nand highly ill-conditioned due to a singular likelihood. Motivated by this\ndifficult inverse problem, this work studies the evaluation of the (expected)\nestimation errors using gradient-based Markov Chain Monte Carlo (MCMC)\nalgorithms. The main contribution is to propose a general strategy, called here\nchilling, which amounts to sampling a local approximation of the posterior\ndistribution in the neighborhood of a point estimate. From a theoretical point\nof view, we show that under regularity assumptions, the family of chilled\nposterior distributions converges in distribution as temperature decreases to\nan optimal Gaussian approximation at a point estimate given by the Maximum A\nPosteriori, also known as the Laplace approximation. Chilled sampling therefore\nprovides access to this approximation generally out of reach in such\nhigh-dimensional nonlinear contexts. From an empirical perspective, we evaluate\nthe proposed approach based on some quantitative Bayesian criteria. Our\nnumerical simulations are performed on synthetic and real meteorological data.\nThey reveal that not only the proposed chilling exhibits a significant gain in\nterms of accuracy of the point estimates and of their associated expected\nerrors, but also a substantial acceleration in the convergence speed of the\nMCMC algorithms."}, "http://arxiv.org/abs/2207.13612": {"title": "Robust Output Analysis with Monte-Carlo Methodology", "link": "http://arxiv.org/abs/2207.13612", "description": "In predictive modeling with simulation or machine learning, it is critical to\naccurately assess the quality of estimated values through output analysis. In\nrecent decades output analysis has become enriched with methods that quantify\nthe impact of input data uncertainty in the model outputs to increase\nrobustness. However, most developments are applicable assuming that the input\ndata adheres to a parametric family of distributions. We propose a unified\noutput analysis framework for simulation and machine learning outputs through\nthe lens of Monte Carlo sampling. This framework provides nonparametric\nquantification of the variance and bias induced in the outputs with\nhigher-order accuracy. Our new bias-corrected estimation from the model outputs\nleverages the extension of fast iterative bootstrap sampling and higher-order\ninfluence functions. For the scalability of the proposed estimation methods, we\ndevise budget-optimal rules and leverage control variates for variance\nreduction. Our theoretical and numerical results demonstrate a clear advantage\nin building more robust confidence intervals from the model outputs with higher\ncoverage probability."}, "http://arxiv.org/abs/2208.06685": {"title": "Adaptive novelty detection with false discovery rate guarantee", "link": "http://arxiv.org/abs/2208.06685", "description": "This paper studies the semi-supervised novelty detection problem where a set\nof \"typical\" measurements is available to the researcher. Motivated by recent\nadvances in multiple testing and conformal inference, we propose AdaDetect, a\nflexible method that is able to wrap around any probabilistic classification\nalgorithm and control the false discovery rate (FDR) on detected novelties in\nfinite samples without any distributional assumption other than\nexchangeability. In contrast to classical FDR-controlling procedures that are\noften committed to a pre-specified p-value function, AdaDetect learns the\ntransformation in a data-adaptive manner to focus the power on the directions\nthat distinguish between inliers and outliers. Inspired by the multiple testing\nliterature, we further propose variants of AdaDetect that are adaptive to the\nproportion of nulls while maintaining the finite-sample FDR control. The\nmethods are illustrated on synthetic datasets and real-world datasets,\nincluding an application in astrophysics."}, "http://arxiv.org/abs/2211.02582": {"title": "Inference for Network Count Time Series with the R Package PNAR", "link": "http://arxiv.org/abs/2211.02582", "description": "We introduce a new R package useful for inference about network count time\nseries. Such data are frequently encountered in statistics and they are usually\ntreated as multivariate time series. Their statistical analysis is based on\nlinear or log linear models. Nonlinear models, which have been applied\nsuccessfully in several research areas, have been neglected from such\napplications mainly because of their computational complexity. We provide R\nusers the flexibility to fit and study nonlinear network count time series\nmodels which include either a drift in the intercept or a regime switching\nmechanism. We develop several computational tools including estimation of\nvarious count Network Autoregressive models and fast computational algorithms\nfor testing linearity in standard cases and when non-identifiable parameters\nhamper the analysis. Finally, we introduce a copula Poisson algorithm for\nsimulating multivariate network count time series. We illustrate the\nmethodology by modeling weekly number of influenza cases in Germany."}, "http://arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "http://arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics\nand often exhibits community-like structures, where each component (node) along\neach different mode has a community membership associated with it. In this\npaper we propose the tensor mixed-membership blockmodel, a generalization of\nthe tensor blockmodel positing that memberships need not be discrete, but\ninstead are convex combinations of latent communities. We establish the\nidentifiability of our model and propose a computationally efficient estimation\nprocedure based on the higher-order orthogonal iteration algorithm (HOOI) for\ntensor SVD composed with a simplex corner-finding algorithm. We then\ndemonstrate the consistency of our estimation procedure by providing a per-node\nerror bound, which showcases the effect of higher-order structures on\nestimation accuracy. To prove our consistency result, we develop the\n$\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent,\npossibly heteroskedastic, subgaussian noise that may be of independent\ninterest. Our analysis uses a novel leave-one-out construction for the\niterates, and our bounds depend only on spectral properties of the underlying\nlow-rank tensor under nearly optimal signal-to-noise ratio conditions such that\ntensor SVD is computationally feasible. Whereas other leave-one-out analyses\ntypically focus on sequences constructed by analyzing the output of a given\nalgorithm with a small part of the noise removed, our leave-one-out analysis\nconstructions use both the previous iterates and the additional tensor\nstructure to eliminate a potential additional source of error. Finally, we\napply our methodology to real and simulated data, including applications to two\nflight datasets and a trade network dataset, demonstrating some effects not\nidentifiable from the model with discrete community memberships."}, "http://arxiv.org/abs/2304.10372": {"title": "Statistical inference for Gaussian Whittle-Mat\\'ern fields on metric graphs", "link": "http://arxiv.org/abs/2304.10372", "description": "Whittle-Mat\\'ern fields are a recently introduced class of Gaussian processes\non metric graphs, which are specified as solutions to a fractional-order\nstochastic differential equation. Unlike earlier covariance-based approaches\nfor specifying Gaussian fields on metric graphs, the Whittle-Mat\\'ern fields\nare well-defined for any compact metric graph and can provide Gaussian\nprocesses with differentiable sample paths. We derive the main statistical\nproperties of the model class, particularly the consistency and asymptotic\nnormality of maximum likelihood estimators of model parameters and the\nnecessary and sufficient conditions for asymptotic optimality properties of\nlinear prediction based on the model with misspecified parameters.\n\nThe covariance function of the Whittle-Mat\\'ern fields is generally\nunavailable in closed form, and they have therefore been challenging to use for\nstatistical inference. However, we show that for specific values of the\nfractional exponent, when the fields have Markov properties, likelihood-based\ninference and spatial prediction can be performed exactly and computationally\nefficiently. This facilitates using the Whittle-Mat\\'ern fields in statistical\napplications involving big datasets without the need for any approximations.\nThe methods are illustrated via an application to modeling of traffic data,\nwhere allowing for differentiable processes dramatically improves the results."}, "http://arxiv.org/abs/2305.09282": {"title": "Errors-in-variables Fr\\'echet Regression with Low-rank Covariate Approximation", "link": "http://arxiv.org/abs/2305.09282", "description": "Fr\\'echet regression has emerged as a promising approach for regression\nanalysis involving non-Euclidean response variables. However, its practical\napplicability has been hindered by its reliance on ideal scenarios with\nabundant and noiseless covariate data. In this paper, we present a novel\nestimation method that tackles these limitations by leveraging the low-rank\nstructure inherent in the covariate matrix. Our proposed framework combines the\nconcepts of global Fr\\'echet regression and principal component regression,\naiming to improve the efficiency and accuracy of the regression estimator. By\nincorporating the low-rank structure, our method enables more effective\nmodeling and estimation, particularly in high-dimensional and\nerrors-in-variables regression settings. We provide a theoretical analysis of\nthe proposed estimator's large-sample properties, including a comprehensive\nrate analysis of bias, variance, and additional variations due to measurement\nerrors. Furthermore, our numerical experiments provide empirical evidence that\nsupports the theoretical findings, demonstrating the superior performance of\nour approach. Overall, this work introduces a promising framework for\nregression analysis of non-Euclidean variables, effectively addressing the\nchallenges associated with limited and noisy covariate data, with potential\napplications in diverse fields."}, "http://arxiv.org/abs/2305.19417": {"title": "Model averaging approaches to data subset selection", "link": "http://arxiv.org/abs/2305.19417", "description": "Model averaging is a useful and robust method for dealing with model\nuncertainty in statistical analysis. Often, it is useful to consider data\nsubset selection at the same time, in which model selection criteria are used\nto compare models across different subsets of the data. Two different criteria\nhave been proposed in the literature for how the data subsets should be\nweighted. We compare the two criteria closely in a unified treatment based on\nthe Kullback-Leibler divergence, and conclude that one of them is subtly flawed\nand will tend to yield larger uncertainties due to loss of information.\nAnalytical and numerical examples are provided."}, "http://arxiv.org/abs/2309.06053": {"title": "Confounder selection via iterative graph expansion", "link": "http://arxiv.org/abs/2309.06053", "description": "Confounder selection, namely choosing a set of covariates to control for\nconfounding between a treatment and an outcome, is arguably the most important\nstep in the design of observational studies. Previous methods, such as Pearl's\ncelebrated back-door criterion, typically require pre-specifying a causal\ngraph, which can often be difficult in practice. We propose an interactive\nprocedure for confounder selection that does not require pre-specifying the\ngraph or the set of observed variables. This procedure iteratively expands the\ncausal graph by finding what we call \"primary adjustment sets\" for a pair of\npossibly confounded variables. This can be viewed as inverting a sequence of\nlatent projections of the underlying causal graph. Structural information in\nthe form of primary adjustment sets is elicited from the user, bit by bit,\nuntil either a set of covariates are found to control for confounding or it can\nbe determined that no such set exists. Other information, such as the causal\nrelations between confounders, is not required by the procedure. We show that\nif the user correctly specifies the primary adjustment sets in every step, our\nprocedure is both sound and complete."}, "http://arxiv.org/abs/2310.16989": {"title": "Randomization Inference When N Equals One", "link": "http://arxiv.org/abs/2310.16989", "description": "N-of-1 experiments, where a unit serves as its own control and treatment in\ndifferent time windows, have been used in certain medical contexts for decades.\nHowever, due to effects that accumulate over long time windows and\ninterventions that have complex evolution, a lack of robust inference tools has\nlimited the widespread applicability of such N-of-1 designs. This work combines\ntechniques from experiment design in causal inference and system identification\nfrom control theory to provide such an inference framework. We derive a model\nof the dynamic interference effect that arises in linear time-invariant\ndynamical systems. We show that a family of causal estimands analogous to those\nstudied in potential outcomes are estimable via a standard estimator derived\nfrom the method of moments. We derive formulae for higher moments of this\nestimator and describe conditions under which N-of-1 designs may provide faster\nways to estimate the effects of interventions in dynamical systems. We also\nprovide conditions under which our estimator is asymptotically normal and\nderive valid confidence intervals for this setting."}, "http://arxiv.org/abs/2310.17009": {"title": "Simulation based stacking", "link": "http://arxiv.org/abs/2310.17009", "description": "Simulation-based inference has been popular for amortized Bayesian\ncomputation. It is typical to have more than one posterior approximation, from\ndifferent inference algorithms, different architectures, or simply the\nrandomness of initialization and stochastic gradients. With a provable\nasymptotic guarantee, we present a general stacking framework to make use of\nall available posterior approximations. Our stacking method is able to combine\ndensities, simulation draws, confidence intervals, and moments, and address the\noverall precision, calibration, coverage, and bias at the same time. We\nillustrate our method on several benchmark simulations and a challenging\ncosmological inference task."}, "http://arxiv.org/abs/2310.17153": {"title": "Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration", "link": "http://arxiv.org/abs/2310.17153", "description": "Semi-implicit variational inference (SIVI) has been introduced to expand the\nanalytical variational families by defining expressive semi-implicit\ndistributions in a hierarchical manner. However, the single-layer architecture\ncommonly used in current SIVI methods can be insufficient when the target\nposterior has complicated structures. In this paper, we propose hierarchical\nsemi-implicit variational inference, called HSIVI, which generalizes SIVI to\nallow more expressive multi-layer construction of semi-implicit distributions.\nBy introducing auxiliary distributions that interpolate between a simple base\ndistribution and the target distribution, the conditional layers can be trained\nby progressively matching these auxiliary distributions one layer after\nanother. Moreover, given pre-trained score networks, HSIVI can be used to\naccelerate the sampling process of diffusion models with the score matching\nobjective. We show that HSIVI significantly enhances the expressiveness of SIVI\non several Bayesian inference problems with complicated target distributions.\nWhen used for diffusion model acceleration, we show that HSIVI can produce high\nquality samples comparable to or better than the existing fast diffusion model\nbased samplers with a small number of function evaluations on various datasets."}, "http://arxiv.org/abs/2310.17165": {"title": "Price Experimentation and Interference in Online Platforms", "link": "http://arxiv.org/abs/2310.17165", "description": "In this paper, we examine the biases arising in A/B tests where a firm\nmodifies a continuous parameter, such as price, to estimate the global\ntreatment effect associated to a given performance metric. Such biases emerge\nfrom canonical designs and estimators due to interference among market\nparticipants. We employ structural modeling and differential calculus to derive\nintuitive structural characterizations of this bias. We then specialize our\ngeneral model to a standard revenue management pricing problem. This setting\nhighlights a key potential pitfall in the use of pricing experiments to guide\nprofit maximization: notably, the canonical estimator for the change in profits\ncan have the {\\em wrong sign}. In other words, following the guidance of the\ncanonical estimator may lead the firm to move prices in the wrong direction,\nand thereby decrease profits relative to the status quo. We apply these results\nto a two-sided market model and show how this ``change of sign\" regime depends\non model parameters, and discuss structural and practical implications for\nplatform operators."}, "http://arxiv.org/abs/2310.17248": {"title": "The observed Fisher information attached to the EM algorithm, illustrated on Shepp and Vardi estimation procedure for positron emission tomography", "link": "http://arxiv.org/abs/2310.17248", "description": "The Shepp & Vardi (1982) implementation of the EM algorithm for PET scan\ntumor estimation provides a point estimate of the tumor. The current study\npresents a closed-form formula of the observed Fisher information for Shepp &\nVardi PET scan tumor estimation. Keywords: PET scan, EM algorithm, Fisher\ninformation matrix, standard errors."}, "http://arxiv.org/abs/2310.17308": {"title": "Wild Bootstrap for Counting Process-Based Statistics", "link": "http://arxiv.org/abs/2310.17308", "description": "The wild bootstrap is a popular resampling method in the context of\ntime-to-event data analyses. Previous works established the large sample\nproperties of it for applications to different estimators and test statistics.\nIt can be used to justify the accuracy of inference procedures such as\nhypothesis tests or time-simultaneous confidence bands. This paper consists of\ntwo parts: in Part~I, a general framework is developed in which the large\nsample properties are established in a unified way by using martingale\nstructures. The framework includes most of the well-known non- and\nsemiparametric statistical methods in time-to-event analysis and parametric\napproaches. In Part II, the Fine-Gray proportional sub-hazards model\nexemplifies the theory for inference on cumulative incidence functions given\nthe covariates. The model falls within the framework if the data are\ncensoring-complete. A simulation study demonstrates the reliability of the\nmethod and an application to a data set about hospital-acquired infections\nillustrates the statistical procedure."}, "http://arxiv.org/abs/2310.17334": {"title": "Bayesian Optimization for Personalized Dose-Finding Trials with Combination Therapies", "link": "http://arxiv.org/abs/2310.17334", "description": "Identification of optimal dose combinations in early phase dose-finding\ntrials is challenging, due to the trade-off between precisely estimating the\nmany parameters required to flexibly model the dose-response surface, and the\nsmall sample sizes in early phase trials. Existing methods often restrict the\nsearch to pre-defined dose combinations, which may fail to identify regions of\noptimality in the dose combination space. These difficulties are even more\npertinent in the context of personalized dose-finding, where patient\ncharacteristics are used to identify tailored optimal dose combinations. To\novercome these challenges, we propose the use of Bayesian optimization for\nfinding optimal dose combinations in standard (\"one size fits all\") and\npersonalized multi-agent dose-finding trials. Bayesian optimization is a method\nfor estimating the global optima of expensive-to-evaluate objective functions.\nThe objective function is approximated by a surrogate model, commonly a\nGaussian process, paired with a sequential design strategy to select the next\npoint via an acquisition function. This work is motivated by an\nindustry-sponsored problem, where focus is on optimizing a dual-agent therapy\nin a setting featuring minimal toxicity. To compare the performance of the\nstandard and personalized methods under this setting, simulation studies are\nperformed for a variety of scenarios. Our study concludes that taking a\npersonalized approach is highly beneficial in the presence of heterogeneity."}, "http://arxiv.org/abs/2310.17434": {"title": "The `Why' behind including `Y' in your imputation model", "link": "http://arxiv.org/abs/2310.17434", "description": "Missing data is a common challenge when analyzing epidemiological data, and\nimputation is often used to address this issue. Here, we investigate the\nscenario where a covariate used in an analysis has missingness and will be\nimputed. There are recommendations to include the outcome from the analysis\nmodel in the imputation model for missing covariates, but it is not necessarily\nclear if this recommmendation always holds and why this is sometimes true. We\nexamine deterministic imputation (i.e., single imputation where the imputed\nvalues are treated as fixed) and stochastic imputation (i.e., single imputation\nwith a random value or multiple imputation) methods and their implications for\nestimating the relationship between the imputed covariate and the outcome. We\nmathematically demonstrate that including the outcome variable in imputation\nmodels is not just a recommendation but a requirement to achieve unbiased\nresults when using stochastic imputation methods. Moreover, we dispel common\nmisconceptions about deterministic imputation models and demonstrate why the\noutcome should not be included in these models. This paper aims to bridge the\ngap between imputation in theory and in practice, providing mathematical\nderivations to explain common statistical recommendations. We offer a better\nunderstanding of the considerations involved in imputing missing covariates and\nemphasize when it is necessary to include the outcome variable in the\nimputation model."}, "http://arxiv.org/abs/2310.17440": {"title": "Gibbs optimal design of experiments", "link": "http://arxiv.org/abs/2310.17440", "description": "Bayesian optimal design of experiments is a well-established approach to\nplanning experiments. Briefly, a probability distribution, known as a\nstatistical model, for the responses is assumed which is dependent on a vector\nof unknown parameters. A utility function is then specified which gives the\ngain in information for estimating the true value of the parameters using the\nBayesian posterior distribution. A Bayesian optimal design is given by\nmaximising the expectation of the utility with respect to the joint\ndistribution given by the statistical model and prior distribution for the true\nparameter values. The approach takes account of the experimental aim via\nspecification of the utility and of all assumed sources of uncertainty via the\nexpected utility. However, it is predicated on the specification of the\nstatistical model. Recently, a new type of statistical inference, known as\nGibbs (or General Bayesian) inference, has been advanced. This is\nBayesian-like, in that uncertainty on unknown quantities is represented by a\nposterior distribution, but does not necessarily rely on specification of a\nstatistical model. Thus the resulting inference should be less sensitive to\nmisspecification of the statistical model. The purpose of this paper is to\npropose Gibbs optimal design: a framework for optimal design of experiments for\nGibbs inference. The concept behind the framework is introduced along with a\ncomputational approach to find Gibbs optimal designs in practice. The framework\nis demonstrated on exemplars including linear models, and experiments with\ncount and time-to-event responses."}, "http://arxiv.org/abs/2310.17546": {"title": "A changepoint approach to modelling non-stationary soil moisture dynamics", "link": "http://arxiv.org/abs/2310.17546", "description": "Soil moisture dynamics provide an indicator of soil health that scientists\nmodel via soil drydown curves. The typical modeling process requires the soil\nmoisture time series to be manually separated into drydown segments and then\nexponential decay models are fitted to them independently. Sensor development\nover recent years means that experiments that were previously conducted over a\nfew field campaigns can now be scaled to months or even years, often at a\nhigher sampling rate. Manual identification of drydown segments is no longer\npractical. To better meet the challenge of increasing data size, this paper\nproposes a novel changepoint-based approach to automatically identify\nstructural changes in the soil drying process, and estimate the parameters\ncharacterizing the drying processes simultaneously. A simulation study is\ncarried out to assess the performance of the method. The results demonstrate\nits ability to identify structural changes and retrieve key parameters of\ninterest to soil scientists. The method is applied to hourly soil moisture time\nseries from the NEON data portal to investigate the temporal dynamics of soil\nmoisture drydown. We recover known relationships previously identified\nmanually, alongside delivering new insights into the temporal variability\nacross soil types and locations."}, "http://arxiv.org/abs/2310.17629": {"title": "Approximate Leave-one-out Cross Validation for Regression with $\\ell_1$ Regularizers (extended version)", "link": "http://arxiv.org/abs/2310.17629", "description": "The out-of-sample error (OO) is the main quantity of interest in risk\nestimation and model selection. Leave-one-out cross validation (LO) offers a\n(nearly) distribution-free yet computationally demanding approach to estimate\nOO. Recent theoretical work showed that approximate leave-one-out cross\nvalidation (ALO) is a computationally efficient and statistically reliable\nestimate of LO (and OO) for generalized linear models with differentiable\nregularizers. For problems involving non-differentiable regularizers, despite\nsignificant empirical evidence, the theoretical understanding of ALO's error\nremains unknown. In this paper, we present a novel theory for a wide class of\nproblems in the generalized linear model family with non-differentiable\nregularizers. We bound the error |ALO - LO| in terms of intuitive metrics such\nas the size of leave-i-out perturbations in active sets, sample size n, number\nof features p and regularization parameters. As a consequence, for the\n$\\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes\nto infinity while n/p and SNR are fixed and bounded."}, "http://arxiv.org/abs/2108.04201": {"title": "Guaranteed Functional Tensor Singular Value Decomposition", "link": "http://arxiv.org/abs/2108.04201", "description": "This paper introduces the functional tensor singular value decomposition\n(FTSVD), a novel dimension reduction framework for tensors with one functional\nmode and several tabular modes. The problem is motivated by high-order\nlongitudinal data analysis. Our model assumes the observed data to be a random\nrealization of an approximate CP low-rank functional tensor measured on a\ndiscrete time grid. Incorporating tensor algebra and the theory of Reproducing\nKernel Hilbert Space (RKHS), we propose a novel RKHS-based constrained power\niteration with spectral initialization. Our method can successfully estimate\nboth singular vectors and functions of the low-rank structure in the observed\ndata. With mild assumptions, we establish the non-asymptotic contractive error\nbounds for the proposed algorithm. The superiority of the proposed framework is\ndemonstrated via extensive experiments on both simulated and real data."}, "http://arxiv.org/abs/2202.02146": {"title": "Elastic Gradient Descent, an Iterative Optimization Method Approximating the Solution Paths of the Elastic Net", "link": "http://arxiv.org/abs/2202.02146", "description": "The elastic net combines lasso and ridge regression to fuse the sparsity\nproperty of lasso with the grouping property of ridge regression. The\nconnections between ridge regression and gradient descent and between lasso and\nforward stagewise regression have previously been shown. Similar to how the\nelastic net generalizes lasso and ridge regression, we introduce elastic\ngradient descent, a generalization of gradient descent and forward stagewise\nregression. We theoretically analyze elastic gradient descent and compare it to\nthe elastic net and forward stagewise regression. Parts of the analysis are\nbased on elastic gradient flow, a piecewise analytical construction, obtained\nfor elastic gradient descent with infinitesimal step size. We also compare\nelastic gradient descent to the elastic net on real and simulated data and show\nthat it provides similar solution paths, but is several orders of magnitude\nfaster. Compared to forward stagewise regression, elastic gradient descent\nselects a model that, although still sparse, provides considerably lower\nprediction and estimation errors."}, "http://arxiv.org/abs/2202.03897": {"title": "Inference from Sampling with Response Probabilities Estimated via Calibration", "link": "http://arxiv.org/abs/2202.03897", "description": "A solution to control for nonresponse bias consists of multiplying the design\nweights of respondents by the inverse of estimated response probabilities to\ncompensate for the nonrespondents. Maximum likelihood and calibration are two\napproaches that can be applied to obtain estimated response probabilities. We\nconsider a common framework in which these approaches can be compared. We\ndevelop an asymptotic study of the behavior of the resulting estimator when\ncalibration is applied. A logistic regression model for the response\nprobabilities is postulated. Missing at random and unclustered data are\nsupposed. Three main contributions of this work are: 1) we show that the\nestimators with the response probabilities estimated via calibration are\nasymptotically equivalent to unbiased estimators and that a gain in efficiency\nis obtained when estimating the response probabilities via calibration as\ncompared to the estimator with the true response probabilities, 2) we show that\nthe estimators with the response probabilities estimated via calibration are\ndoubly robust to model misspecification and explain why double robustness is\nnot guaranteed when maximum likelihood is applied, and 3) we discuss and\nillustrate problems related to response probabilities estimation, namely\nexistence of a solution to the estimating equations, problems of convergence,\nand extreme weights. We explain and illustrate why the first aforementioned\nproblem is more likely with calibration than with maximum likelihood\nestimation. We present the results of a simulation study in order to illustrate\nthese elements."}, "http://arxiv.org/abs/2208.14951": {"title": "Statistical inference for multivariate extremes via a geometric approach", "link": "http://arxiv.org/abs/2208.14951", "description": "A geometric representation for multivariate extremes, based on the shapes of\nscaled sample clouds in light-tailed margins and their so-called limit sets,\nhas recently been shown to connect several existing extremal dependence\nconcepts. However, these results are purely probabilistic, and the geometric\napproach itself has not been fully exploited for statistical inference. We\noutline a method for parametric estimation of the limit set shape, which\nincludes a useful non/semi-parametric estimate as a pre-processing step. More\nfundamentally, our approach provides a new class of asymptotically-motivated\nstatistical models for the tails of multivariate distributions, and such models\ncan accommodate any combination of simultaneous or non-simultaneous extremes\nthrough appropriate parametric forms for the limit set shape. Extrapolation\nfurther into the tail of the distribution is possible via simulation from the\nfitted model. A simulation study confirms that our methodology is very\ncompetitive with existing approaches, and can successfully allow estimation of\nsmall probabilities in regions where other methods struggle. We apply the\nmethodology to two environmental datasets, with diagnostics demonstrating a\ngood fit."}, "http://arxiv.org/abs/2209.08889": {"title": "Inference of nonlinear causal effects with GWAS summary data", "link": "http://arxiv.org/abs/2209.08889", "description": "Large-scale genome-wide association studies (GWAS) have offered an exciting\nopportunity to discover putative causal genes or risk factors associated with\ndiseases by using SNPs as instrumental variables (IVs). However, conventional\napproaches assume linear causal relations partly for simplicity and partly for\nthe availability of GWAS summary data. In this work, we propose a novel model\n{for transcriptome-wide association studies (TWAS)} to incorporate nonlinear\nrelationships across IVs, an exposure/gene, and an outcome, which is robust\nagainst violations of the valid IV assumptions, permits the use of GWAS summary\ndata, and covers two-stage least squares as a special case. We decouple the\nestimation of a marginal causal effect and a nonlinear transformation, where\nthe former is estimated via sliced inverse regression and a sparse instrumental\nvariable regression, and the latter is estimated by a ratio-adjusted inverse\nregression. On this ground, we propose an inferential procedure. An application\nof the proposed method to the ADNI gene expression data and the IGAP GWAS\nsummary data identifies 18 causal genes associated with Alzheimer's disease,\nincluding APOE and TOMM40, in addition to 7 other genes missed by two-stage\nleast squares considering only linear relationships. Our findings suggest that\nnonlinear modeling is required to unleash the power of IV regression for\nidentifying potentially nonlinear gene-trait associations. Accompanying this\npaper is our Python library \\texttt{nl-causal}\n(\\url{https://nonlinear-causal.readthedocs.io/}) that implements the proposed\nmethod."}, "http://arxiv.org/abs/2301.03038": {"title": "Skewed Bernstein-von Mises theorem and skew-modal approximations", "link": "http://arxiv.org/abs/2301.03038", "description": "Gaussian approximations are routinely employed in Bayesian statistics to ease\ninference when the target posterior is intractable. Although these\napproximations are asymptotically justified by Bernstein-von Mises type\nresults, in practice the expected Gaussian behavior may poorly represent the\nshape of the posterior, thus affecting approximation accuracy. Motivated by\nthese considerations, we derive an improved class of closed-form approximations\nof posterior distributions which arise from a new treatment of a third-order\nversion of the Laplace method yielding approximations in a tractable family of\nskew-symmetric distributions. Under general assumptions which account for\nmisspecified models and non-i.i.d. settings, this family of approximations is\nshown to have a total variation distance from the target posterior whose rate\nof convergence improves by at least one order of magnitude the one established\nby the classical Bernstein-von Mises theorem. Specializing this result to the\ncase of regular parametric models shows that the same improvement in\napproximation accuracy can be also derived for polynomially bounded posterior\nfunctionals. Unlike other higher-order approximations, our results prove that\nit is possible to derive closed-form and valid densities which are expected to\nprovide, in practice, a more accurate, yet similarly-tractable, alternative to\nGaussian approximations of the target posterior, while inheriting its limiting\nfrequentist properties. We strengthen such arguments by developing a practical\nskew-modal approximation for both joint and marginal posteriors that achieves\nthe same theoretical guarantees of its theoretical counterpart by replacing the\nunknown model parameters with the corresponding MAP estimate. Empirical studies\nconfirm that our theoretical results closely match the remarkable performance\nobserved in practice, even in finite, possibly small, sample regimes."}, "http://arxiv.org/abs/2303.05878": {"title": "Identification and Estimation of Causal Effects with Confounders Missing Not at Random", "link": "http://arxiv.org/abs/2303.05878", "description": "Making causal inferences from observational studies can be challenging when\nconfounders are missing not at random. In such cases, identifying causal\neffects is often not guaranteed. Motivated by a real example, we consider a\ntreatment-independent missingness assumption under which we establish the\nidentification of causal effects when confounders are missing not at random. We\npropose a weighted estimating equation (WEE) approach for estimating model\nparameters and introduce three estimators for the average causal effect, based\non regression, propensity score weighting, and doubly robust estimation. We\nevaluate the performance of these estimators through simulations, and provide a\nreal data analysis to illustrate our proposed method."}, "http://arxiv.org/abs/2305.12283": {"title": "Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods", "link": "http://arxiv.org/abs/2305.12283", "description": "In this paper, we consider the uncertainty quantification problem for\nregression models. Specifically, we consider an individual calibration\nobjective for characterizing the quantiles of the prediction model. While such\nan objective is well-motivated from downstream tasks such as newsvendor cost,\nthe existing methods have been largely heuristic and lack of statistical\nguarantee in terms of individual calibration. We show via simple examples that\nthe existing methods focusing on population-level calibration guarantees such\nas average calibration or sharpness can lead to harmful and unexpected results.\nWe propose simple nonparametric calibration methods that are agnostic of the\nunderlying prediction model and enjoy both computational efficiency and\nstatistical consistency. Our approach enables a better understanding of the\npossibility of individual calibration, and we establish matching upper and\nlower bounds for the calibration error of our proposed methods. Technically,\nour analysis combines the nonparametric analysis with a covering number\nargument for parametric analysis, which advances the existing theoretical\nanalyses in the literature of nonparametric density estimation and quantile\nbandit problems. Importantly, the nonparametric perspective sheds new\ntheoretical insights into regression calibration in terms of the curse of\ndimensionality and reconciles the existing results on the impossibility of\nindividual calibration. To our knowledge, we make the first effort to reach\nboth individual calibration and finite-sample guarantee with minimal\nassumptions in terms of conformal prediction. Numerical experiments show the\nadvantage of such a simple approach under various metrics, and also under\ncovariates shift. We hope our work provides a simple benchmark and a starting\npoint of theoretical ground for future research on regression calibration."}, "http://arxiv.org/abs/2305.14943": {"title": "Learning Rate Free Bayesian Inference in Constrained Domains", "link": "http://arxiv.org/abs/2305.14943", "description": "We introduce a suite of new particle-based algorithms for sampling on\nconstrained domains which are entirely learning rate free. Our approach\nleverages coin betting ideas from convex optimisation, and the viewpoint of\nconstrained sampling as a mirrored optimisation problem on the space of\nprobability measures. Based on this viewpoint, we also introduce a unifying\nframework for several existing constrained sampling algorithms, including\nmirrored Langevin dynamics and mirrored Stein variational gradient descent. We\ndemonstrate the performance of our algorithms on a range of numerical examples,\nincluding sampling from targets on the simplex, sampling with fairness\nconstraints, and constrained sampling problems in post-selection inference. Our\nresults indicate that our algorithms achieve competitive performance with\nexisting constrained sampling methods, without the need to tune any\nhyperparameters."}, "http://arxiv.org/abs/2308.07983": {"title": "Monte Carlo guided Diffusion for Bayesian linear inverse problems", "link": "http://arxiv.org/abs/2308.07983", "description": "Ill-posed linear inverse problems arise frequently in various applications,\nfrom computational photography to medical imaging. A recent line of research\nexploits Bayesian inference with informative priors to handle the ill-posedness\nof such problems. Amongst such priors, score-based generative models (SGM) have\nrecently been successfully applied to several different inverse problems. In\nthis study, we exploit the particular structure of the prior defined by the SGM\nto define a sequence of intermediate linear inverse problems. As the noise\nlevel decreases, the posteriors of these inverse problems get closer to the\ntarget posterior of the original inverse problem. To sample from this sequence\nof posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The\nproposed algorithm, MCGDiff, is shown to be theoretically grounded and we\nprovide numerical simulations showing that it outperforms competing baselines\nwhen dealing with ill-posed inverse problems in a Bayesian setting."}, "http://arxiv.org/abs/2309.16843": {"title": "A Mean Field Approach to Empirical Bayes Estimation in High-dimensional Linear Regression", "link": "http://arxiv.org/abs/2309.16843", "description": "We study empirical Bayes estimation in high-dimensional linear regression. To\nfacilitate computationally efficient estimation of the underlying prior, we\nadopt a variational empirical Bayes approach, introduced originally in\nCarbonetto and Stephens (2012) and Kim et al. (2022). We establish asymptotic\nconsistency of the nonparametric maximum likelihood estimator (NPMLE) and its\n(computable) naive mean field variational surrogate under mild assumptions on\nthe design and the prior. Assuming, in addition, that the naive mean field\napproximation has a dominant optimizer, we develop a computationally efficient\napproximation to the oracle posterior distribution, and establish its accuracy\nunder the 1-Wasserstein metric. This enables computationally feasible Bayesian\ninference; e.g., construction of posterior credible intervals with an average\ncoverage guarantee, Bayes optimal estimation for the regression coefficients,\nestimation of the proportion of non-nulls, etc. Our analysis covers both\ndeterministic and random designs, and accommodates correlations among the\nfeatures. To the best of our knowledge, this provides the first rigorous\nnonparametric empirical Bayes method in a high-dimensional regression setting\nwithout sparsity."}, "http://arxiv.org/abs/2310.17679": {"title": "Fast Scalable and Accurate Discovery of DAGs Using the Best Order Score Search and Grow-Shrink Trees", "link": "http://arxiv.org/abs/2310.17679", "description": "Learning graphical conditional independence structures is an important\nmachine learning problem and a cornerstone of causal discovery. However, the\naccuracy and execution time of learning algorithms generally struggle to scale\nto problems with hundreds of highly connected variables -- for instance,\nrecovering brain networks from fMRI data. We introduce the best order score\nsearch (BOSS) and grow-shrink trees (GSTs) for learning directed acyclic graphs\n(DAGs) in this paradigm. BOSS greedily searches over permutations of variables,\nusing GSTs to construct and score DAGs from permutations. GSTs efficiently\ncache scores to eliminate redundant calculations. BOSS achieves\nstate-of-the-art performance in accuracy and execution time, comparing\nfavorably to a variety of combinatorial and gradient-based learning algorithms\nunder a broad range of conditions. To demonstrate its practicality, we apply\nBOSS to two sets of resting-state fMRI data: simulated data with\npseudo-empirical noise distributions derived from randomized empirical fMRI\ncortical signals and clinical data from 3T fMRI scans processed into cortical\nparcels. BOSS is available for use within the TETRAD project which includes\nPython and R wrappers."}, "http://arxiv.org/abs/2310.17712": {"title": "Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec", "link": "http://arxiv.org/abs/2310.17712", "description": "Embedding the nodes of a large network into an Euclidean space is a common\nobjective in modern machine learning, with a variety of tools available. These\nembeddings can then be used as features for tasks such as community\ndetection/node clustering or link prediction, where they achieve state of the\nart performance. With the exception of spectral clustering methods, there is\nlittle theoretical understanding for other commonly used approaches to learning\nembeddings. In this work we examine the theoretical properties of the\nembeddings learned by node2vec. Our main result shows that the use of k-means\nclustering on the embedding vectors produced by node2vec gives weakly\nconsistent community recovery for the nodes in (degree corrected) stochastic\nblock models. We also discuss the use of these embeddings for node and link\nprediction tasks. We demonstrate this result empirically, and examine how this\nrelates to other embedding tools for network data."}, "http://arxiv.org/abs/2310.17760": {"title": "Novel Models for Multiple Dependent Heteroskedastic Time Series", "link": "http://arxiv.org/abs/2310.17760", "description": "Functional magnetic resonance imaging or functional MRI (fMRI) is a very\npopular tool used for differing brain regions by measuring brain activity. It\nis affected by physiological noise, such as head and brain movement in the\nscanner from breathing, heart beats, or the subject fidgeting. The purpose of\nthis paper is to propose a novel approach to handling fMRI data for infants\nwith high volatility caused by sudden head movements. Another purpose is to\nevaluate the volatility modelling performance of multiple dependent fMRI time\nseries data. The models examined in this paper are AR and GARCH and the\nmodelling performance is evaluated by several statistical performance measures.\nThe conclusions of this paper are that multiple dependent fMRI series data can\nbe fitted with AR + GARCH model if the multiple fMRI data have many sudden head\nmovements. The GARCH model can capture the shared volatility clustering caused\nby head movements across brain regions. However, the multiple fMRI data without\nmany head movements have fitted AR + GARCH model with different performance.\nThe conclusions are supported by statistical tests and measures. This paper\nhighlights the difference between the proposed approach from traditional\napproaches when estimating model parameters and modelling conditional variances\non multiple dependent time series. In the future, the proposed approach can be\napplied to other research fields, such as financial economics, and signal\nprocessing. Code is available at \\url{https://github.com/13204942/STAT40710}."}, "http://arxiv.org/abs/2310.17766": {"title": "Minibatch Markov chain Monte Carlo Algorithms for Fitting Gaussian Processes", "link": "http://arxiv.org/abs/2310.17766", "description": "Gaussian processes (GPs) are a highly flexible, nonparametric statistical\nmodel that are commonly used to fit nonlinear relationships or account for\ncorrelation between observations. However, the computational load of fitting a\nGaussian process is $\\mathcal{O}(n^3)$ making them infeasible for use on large\ndatasets. To make GPs more feasible for large datasets, this research focuses\non the use of minibatching to estimate GP parameters. Specifically, we outline\nboth approximate and exact minibatch Markov chain Monte Carlo algorithms that\nsubstantially reduce the computation of fitting a GP by only considering small\nsubsets of the data at a time. We demonstrate and compare this methodology\nusing various simulations and real datasets."}, "http://arxiv.org/abs/2310.17806": {"title": "Transporting treatment effects from difference-in-differences studies", "link": "http://arxiv.org/abs/2310.17806", "description": "Difference-in-differences (DID) is a popular approach to identify the causal\neffects of treatments and policies in the presence of unmeasured confounding.\nDID identifies the sample average treatment effect in the treated (SATT).\nHowever, a goal of such research is often to inform decision-making in target\npopulations outside the treated sample. Transportability methods have been\ndeveloped to extend inferences from study samples to external target\npopulations; these methods have primarily been developed and applied in\nsettings where identification is based on conditional independence between the\ntreatment and potential outcomes, such as in a randomized trial. This paper\ndevelops identification and estimators for effects in a target population,\nbased on DID conducted in a study sample that differs from the target\npopulation. We present a range of assumptions under which one may identify\ncausal effects in the target population and employ causal diagrams to\nillustrate these assumptions. In most realistic settings, results depend\ncritically on the assumption that any unmeasured confounders are not effect\nmeasure modifiers on the scale of the effect of interest. We develop several\nestimators of transported effects, including a doubly robust estimator based on\nthe efficient influence function. Simulation results support theoretical\nproperties of the proposed estimators. We discuss the potential application of\nour approach to a study of the effects of a US federal smoke-free housing\npolicy, where the original study was conducted in New York City alone and the\ngoal is extend inferences to other US cities."}, "http://arxiv.org/abs/2310.17816": {"title": "Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs", "link": "http://arxiv.org/abs/2310.17816", "description": "This work addresses the problem of automated covariate selection under\nlimited prior knowledge. Given an exposure-outcome pair {X,Y} and a variable\nset Z of unknown causal structure, the Local Discovery by Partitioning (LDP)\nalgorithm partitions Z into subsets defined by their relation to {X,Y}. We\nenumerate eight exhaustive and mutually exclusive partitions of any arbitrary Z\nand leverage this taxonomy to differentiate confounders from other variable\ntypes. LDP is motivated by valid adjustment set identification, but avoids the\npretreatment assumption commonly made by automated covariate selection methods.\nWe provide theoretical guarantees that LDP returns a valid adjustment set for\nany Z that meets sufficient graphical conditions. Under stronger conditions, we\nprove that partition labels are asymptotically correct. Total independence\ntests is worst-case quadratic in |Z|, with sub-quadratic runtimes observed\nempirically. We numerically validate our theoretical guarantees on synthetic\nand semi-synthetic graphs. Adjustment sets from LDP yield less biased and more\nprecise average treatment effect estimates than baselines, with LDP\noutperforming on confounder recall, test count, and runtime for valid\nadjustment set discovery."}, "http://arxiv.org/abs/2310.17820": {"title": "Sparse Bayesian Multidimensional Item Response Theory", "link": "http://arxiv.org/abs/2310.17820", "description": "Multivariate Item Response Theory (MIRT) is sought-after widely by applied\nresearchers looking for interpretable (sparse) explanations underlying response\npatterns in questionnaire data. There is, however, an unmet demand for such\nsparsity discovery tools in practice. Our paper develops a Bayesian platform\nfor binary and ordinal item MIRT which requires minimal tuning and scales well\non relatively large datasets due to its parallelizable features. Bayesian\nmethodology for MIRT models has traditionally relied on MCMC simulation, which\ncannot only be slow in practice, but also often renders exact sparsity recovery\nimpossible without additional thresholding. In this work, we develop a scalable\nBayesian EM algorithm to estimate sparse factor loadings from binary and\nordinal item responses. We address the seemingly insurmountable problem of\nunknown latent factor dimensionality with tools from Bayesian nonparametrics\nwhich enable estimating the number of factors. Rotations to sparsity through\nparameter expansion further enhance convergence and interpretability without\nidentifiability constraints. In our simulation study, we show that our method\nreliably recovers both the factor dimensionality as well as the latent\nstructure on high-dimensional synthetic data even for small samples. We\ndemonstrate the practical usefulness of our approach on two datasets: an\neducational item response dataset and a quality-of-life measurement dataset.\nBoth demonstrations show that our tool yields interpretable estimates,\nfacilitating interesting discoveries that might otherwise go unnoticed under a\npure confirmatory factor analysis setting. We provide an easy-to-use software\nwhich is a useful new addition to the MIRT toolkit and which will hopefully\nserve as the go-to method for practitioners."}, "http://arxiv.org/abs/2310.17845": {"title": "A Unified and Optimal Multiple Testing Framework based on rho-values", "link": "http://arxiv.org/abs/2310.17845", "description": "Multiple testing is an important research direction that has gained major\nattention in recent years. Currently, most multiple testing procedures are\ndesigned with p-values or Local false discovery rate (Lfdr) statistics.\nHowever, p-values obtained by applying probability integral transform to some\nwell-known test statistics often do not incorporate information from the\nalternatives, resulting in suboptimal procedures. On the other hand, Lfdr based\nprocedures can be asymptotically optimal but their guarantee on false discovery\nrate (FDR) control relies on consistent estimation of Lfdr, which is often\ndifficult in practice especially when the incorporation of side information is\ndesirable. In this article, we propose a novel and flexibly constructed class\nof statistics, called rho-values, which combines the merits of both p-values\nand Lfdr while enjoys superiorities over methods based on these two types of\nstatistics. Specifically, it unifies these two frameworks and operates in two\nsteps, ranking and thresholding. The ranking produced by rho-values mimics that\nproduced by Lfdr statistics, and the strategy for choosing the threshold is\nsimilar to that of p-value based procedures. Therefore, the proposed framework\nguarantees FDR control under weak assumptions; it maintains the integrity of\nthe structural information encoded by the summary statistics and the auxiliary\ncovariates and hence can be asymptotically optimal. We demonstrate the efficacy\nof the new framework through extensive simulations and two data applications."}, "http://arxiv.org/abs/2310.17999": {"title": "Automated threshold selection and associated inference uncertainty for univariate extremes", "link": "http://arxiv.org/abs/2310.17999", "description": "Threshold selection is a fundamental problem in any threshold-based extreme\nvalue analysis. While models are asymptotically motivated, selecting an\nappropriate threshold for finite samples can be difficult through standard\nmethods. Inference can also be highly sensitive to the choice of threshold. Too\nlow a threshold choice leads to bias in the fit of the extreme value model,\nwhile too high a choice leads to unnecessary additional uncertainty in the\nestimation of model parameters. In this paper, we develop a novel methodology\nfor automated threshold selection that directly tackles this bias-variance\ntrade-off. We also develop a method to account for the uncertainty in this\nthreshold choice and propagate this uncertainty through to high quantile\ninference. Through a simulation study, we demonstrate the effectiveness of our\nmethod for threshold selection and subsequent extreme quantile estimation. We\napply our method to the well-known, troublesome example of the River Nidd\ndataset."}, "http://arxiv.org/abs/2310.18027": {"title": "Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors", "link": "http://arxiv.org/abs/2310.18027", "description": "Effective and rapid decision-making from randomized controlled trials (RCTs)\nrequires unbiased and precise treatment effect inferences. Two strategies to\naddress this requirement are to adjust for covariates that are highly\ncorrelated with the outcome, and to leverage historical control information via\nBayes' theorem. We propose a new Bayesian prognostic covariate adjustment\nmethodology, referred to as Bayesian PROCOVA, that combines these two\nstrategies. Covariate adjustment is based on generative artificial intelligence\n(AI) algorithms that construct a digital twin generator (DTG) for RCT\nparticipants. The DTG is trained on historical control data and yields a\ndigital twin (DT) probability distribution for each participant's control\noutcome. The expectation of the DT distribution defines the single covariate\nfor adjustment. Historical control information are leveraged via an additive\nmixture prior with two components: an informative prior probability\ndistribution specified based on historical control data, and a non-informative\nprior distribution. The weight parameter in the mixture has a prior\ndistribution as well, so that the entire additive mixture prior distribution is\ncompletely pre-specifiable and does not involve any information from the RCT.\nWe establish an efficient Gibbs algorithm for sampling from the posterior\ndistribution, and derive closed-form expressions for the posterior mean and\nvariance of the treatment effect conditional on the weight parameter, of\nBayesian PROCOVA. We evaluate the bias control and variance reduction of\nBayesian PROCOVA compared to frequentist prognostic covariate adjustment\n(PROCOVA) via simulation studies that encompass different types of\ndiscrepancies between the historical control and RCT data. Ultimately, Bayesian\nPROCOVA can yield informative treatment effect inferences with fewer control\nparticipants, accelerating effective decision-making."}, "http://arxiv.org/abs/2310.18047": {"title": "Robust Bayesian Inference on Riemannian Submanifold", "link": "http://arxiv.org/abs/2310.18047", "description": "Non-Euclidean spaces routinely arise in modern statistical applications such\nas in medical imaging, robotics, and computer vision, to name a few. While\ntraditional Bayesian approaches are applicable to such settings by considering\nan ambient Euclidean space as the parameter space, we demonstrate the benefits\nof integrating manifold structure into the Bayesian framework, both\ntheoretically and computationally. Moreover, existing Bayesian approaches which\nare designed specifically for manifold-valued parameters are primarily\nmodel-based, which are typically subject to inaccurate uncertainty\nquantification under model misspecification. In this article, we propose a\nrobust model-free Bayesian inference for parameters defined on a Riemannian\nsubmanifold, which is shown to provide valid uncertainty quantification from a\nfrequentist perspective. Computationally, we propose a Markov chain Monte Carlo\nto sample from the posterior on the Riemannian submanifold, where the mixing\ntime, in the large sample regime, is shown to depend only on the intrinsic\ndimension of the parameter space instead of the potentially much larger ambient\ndimension. Our numerical results demonstrate the effectiveness of our approach\non a variety of problems, such as reduced-rank multiple quantile regression,\nprincipal component analysis, and Fr\\'{e}chet mean estimation."}, "http://arxiv.org/abs/2310.18108": {"title": "Transductive conformal inference with adaptive scores", "link": "http://arxiv.org/abs/2310.18108", "description": "Conformal inference is a fundamental and versatile tool that provides\ndistribution-free guarantees for many machine learning tasks. We consider the\ntransductive setting, where decisions are made on a test sample of $m$ new\npoints, giving rise to $m$ conformal $p$-values. {While classical results only\nconcern their marginal distribution, we show that their joint distribution\nfollows a P\\'olya urn model, and establish a concentration inequality for their\nempirical distribution function.} The results hold for arbitrary exchangeable\nscores, including {\\it adaptive} ones that can use the covariates of the\ntest+calibration samples at training stage for increased accuracy. We\ndemonstrate the usefulness of these theoretical results through uniform,\nin-probability guarantees for two machine learning tasks of current interest:\ninterval prediction for transductive transfer learning and novelty detection\nbased on two-class classification."}, "http://arxiv.org/abs/2310.18212": {"title": "Robustness of Algorithms for Causal Structure Learning to Hyperparameter Choice", "link": "http://arxiv.org/abs/2310.18212", "description": "Hyperparameters play a critical role in machine learning. Hyperparameter\ntuning can make the difference between state-of-the-art and poor prediction\nperformance for any algorithm, but it is particularly challenging for structure\nlearning due to its unsupervised nature. As a result, hyperparameter tuning is\noften neglected in favour of using the default values provided by a particular\nimplementation of an algorithm. While there have been numerous studies on\nperformance evaluation of causal discovery algorithms, how hyperparameters\naffect individual algorithms, as well as the choice of the best algorithm for a\nspecific problem, has not been studied in depth before. This work addresses\nthis gap by investigating the influence of hyperparameters on causal structure\nlearning tasks. Specifically, we perform an empirical evaluation of\nhyperparameter selection for some seminal learning algorithms on datasets of\nvarying levels of complexity. We find that, while the choice of algorithm\nremains crucial to obtaining state-of-the-art performance, hyperparameter\nselection in ensemble settings strongly influences the choice of algorithm, in\nthat a poor choice of hyperparameters can lead to analysts using algorithms\nwhich do not give state-of-the-art performance for their data."}, "http://arxiv.org/abs/2310.18261": {"title": "Label Shift Estimators for Non-Ignorable Missing Data", "link": "http://arxiv.org/abs/2310.18261", "description": "We consider the problem of estimating the mean of a random variable Y subject\nto non-ignorable missingness, i.e., where the missingness mechanism depends on\nY . We connect the auxiliary proxy variable framework for non-ignorable\nmissingness (West and Little, 2013) to the label shift setting (Saerens et al.,\n2002). Exploiting this connection, we construct an estimator for non-ignorable\nmissing data that uses high-dimensional covariates (or proxies) without the\nneed for a generative model. In synthetic and semi-synthetic experiments, we\nstudy the behavior of the proposed estimator, comparing it to commonly used\nignorable estimators in both well-specified and misspecified settings.\nAdditionally, we develop a score to assess how consistent the data are with the\nlabel shift assumption. We use our approach to estimate disease prevalence\nusing a large health survey, comparing ignorable and non-ignorable approaches.\nWe show that failing to account for non-ignorable missingness can have profound\nconsequences on conclusions drawn from non-representative samples."}, "http://arxiv.org/abs/2102.12698": {"title": "Improving the Hosmer-Lemeshow Goodness-of-Fit Test in Large Models with Replicated Trials", "link": "http://arxiv.org/abs/2102.12698", "description": "The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-fit (GOF)\ntest that assesses the quality of the overall fit of a logistic regression\nmodel. In this paper, we give results from simulations showing that the type 1\nerror rate (and hence power) of the HL test decreases as model complexity\ngrows, provided that the sample size remains fixed and binary replicates are\npresent in the data. We demonstrate that the generalized version of the HL test\nby Surjanovic et al. (2020) can offer some protection against this power loss.\nWe conclude with a brief discussion explaining the behaviour of the HL test,\nalong with some guidance on how to choose between the two tests."}, "http://arxiv.org/abs/2110.04852": {"title": "Mixture representations and Bayesian nonparametric inference for likelihood ratio ordered distributions", "link": "http://arxiv.org/abs/2110.04852", "description": "In this article, we introduce mixture representations for likelihood ratio\nordered distributions. Essentially, the ratio of two probability densities, or\nmass functions, is monotone if and only if one can be expressed as a mixture of\none-sided truncations of the other. To illustrate the practical value of the\nmixture representations, we address the problem of density estimation for\nlikelihood ratio ordered distributions. In particular, we propose a\nnonparametric Bayesian solution which takes advantage of the mixture\nrepresentations. The prior distribution is constructed from Dirichlet process\nmixtures and has large support on the space of pairs of densities satisfying\nthe monotone ratio constraint. Posterior consistency holds under reasonable\nconditions on the prior specification and the true unknown densities. To our\nknowledge, this is the first posterior consistency result in the literature on\norder constrained inference. With a simple modification to the prior\ndistribution, we can test the equality of two distributions against the\nalternative of likelihood ratio ordering. We develop a Markov chain Monte Carlo\nalgorithm for posterior inference and demonstrate the method in a biomedical\napplication."}, "http://arxiv.org/abs/2207.08911": {"title": "Deeply-Learned Generalized Linear Models with Missing Data", "link": "http://arxiv.org/abs/2207.08911", "description": "Deep Learning (DL) methods have dramatically increased in popularity in\nrecent years, with significant growth in their application to supervised\nlearning problems in the biomedical sciences. However, the greater prevalence\nand complexity of missing data in modern biomedical datasets present\nsignificant challenges for DL methods. Here, we provide a formal treatment of\nmissing data in the context of deeply learned generalized linear models, a\nsupervised DL architecture for regression and classification problems. We\npropose a new architecture, \\textit{dlglm}, that is one of the first to be able\nto flexibly account for both ignorable and non-ignorable patterns of\nmissingness in input features and response at training time. We demonstrate\nthrough statistical simulation that our method outperforms existing approaches\nfor supervised learning tasks in the presence of missing not at random (MNAR)\nmissingness. We conclude with a case study of a Bank Marketing dataset from the\nUCI Machine Learning Repository, in which we predict whether clients subscribed\nto a product based on phone survey data. Supplementary materials for this\narticle are available online."}, "http://arxiv.org/abs/2208.04627": {"title": "Causal Effect Identification in Uncertain Causal Networks", "link": "http://arxiv.org/abs/2208.04627", "description": "Causal identification is at the core of the causal inference literature,\nwhere complete algorithms have been proposed to identify causal queries of\ninterest. The validity of these algorithms hinges on the restrictive assumption\nof having access to a correctly specified causal structure. In this work, we\nstudy the setting where a probabilistic model of the causal structure is\navailable. Specifically, the edges in a causal graph exist with uncertainties\nwhich may, for example, represent degree of belief from domain experts.\nAlternatively, the uncertainty about an edge may reflect the confidence of a\nparticular statistical test. The question that naturally arises in this setting\nis: Given such a probabilistic graph and a specific causal effect of interest,\nwhat is the subgraph which has the highest plausibility and for which the\ncausal effect is identifiable? We show that answering this question reduces to\nsolving an NP-complete combinatorial optimization problem which we call the\nedge ID problem. We propose efficient algorithms to approximate this problem\nand evaluate them against both real-world networks and randomly generated\ngraphs."}, "http://arxiv.org/abs/2211.00268": {"title": "Stacking designs: designing multi-fidelity computer experiments with target predictive accuracy", "link": "http://arxiv.org/abs/2211.00268", "description": "In an era where scientific experiments can be very costly, multi-fidelity\nemulators provide a useful tool for cost-efficient predictive scientific\ncomputing. For scientific applications, the experimenter is often limited by a\ntight computational budget, and thus wishes to (i) maximize predictive power of\nthe multi-fidelity emulator via a careful design of experiments, and (ii)\nensure this model achieves a desired error tolerance with some notion of\nconfidence. Existing design methods, however, do not jointly tackle objectives\n(i) and (ii). We propose a novel stacking design approach that addresses both\ngoals. A multi-level reproducing kernel Hilbert space (RKHS) interpolator is\nfirst introduced to build the emulator, under which our stacking design\nprovides a sequential approach for designing multi-fidelity runs such that a\ndesired prediction error of $\\epsilon > 0$ is met under regularity assumptions.\nWe then prove a novel cost complexity theorem that, under this multi-level\ninterpolator, establishes a bound on the computation cost (for training data\nsimulation) needed to achieve a prediction bound of $\\epsilon$. This result\nprovides novel insights on conditions under which the proposed multi-fidelity\napproach improves upon a conventional RKHS interpolator which relies on a\nsingle fidelity level. Finally, we demonstrate the effectiveness of stacking\ndesigns in a suite of simulation experiments and an application to finite\nelement analysis."}, "http://arxiv.org/abs/2211.05357": {"title": "Bayesian score calibration for approximate models", "link": "http://arxiv.org/abs/2211.05357", "description": "Scientists continue to develop increasingly complex mechanistic models to\nreflect their knowledge more realistically. Statistical inference using these\nmodels can be challenging since the corresponding likelihood function is often\nintractable and model simulation may be computationally burdensome.\nFortunately, in many of these situations, it is possible to adopt a surrogate\nmodel or approximate likelihood function. It may be convenient to conduct\nBayesian inference directly with the surrogate, but this can result in bias and\npoor uncertainty quantification. In this paper we propose a new method for\nadjusting approximate posterior samples to reduce bias and produce more\naccurate uncertainty quantification. We do this by optimizing a transform of\nthe approximate posterior that maximizes a scoring rule. Our approach requires\nonly a (fixed) small number of complex model simulations and is numerically\nstable. We demonstrate good performance of the new method on several examples\nof increasing complexity."}, "http://arxiv.org/abs/2302.00993": {"title": "Unpaired Multi-Domain Causal Representation Learning", "link": "http://arxiv.org/abs/2302.00993", "description": "The goal of causal representation learning is to find a representation of\ndata that consists of causally related latent variables. We consider a setup\nwhere one has access to data from multiple domains that potentially share a\ncausal representation. Crucially, observations in different domains are assumed\nto be unpaired, that is, we only observe the marginal distribution in each\ndomain but not their joint distribution. In this paper, we give sufficient\nconditions for identifiability of the joint distribution and the shared causal\ngraph in a linear setup. Identifiability holds if we can uniquely recover the\njoint distribution and the shared causal representation from the marginal\ndistributions in each domain. We transform our identifiability results into a\npractical method to recover the shared latent causal graph."}, "http://arxiv.org/abs/2303.17277": {"title": "Cross-temporal probabilistic forecast reconciliation: Methodological and practical issues", "link": "http://arxiv.org/abs/2303.17277", "description": "Forecast reconciliation is a post-forecasting process that involves\ntransforming a set of incoherent forecasts into coherent forecasts which\nsatisfy a given set of linear constraints for a multivariate time series. In\nthis paper we extend the current state-of-the-art cross-sectional probabilistic\nforecast reconciliation approach to encompass a cross-temporal framework, where\ntemporal constraints are also applied. Our proposed methodology employs both\nparametric Gaussian and non-parametric bootstrap approaches to draw samples\nfrom an incoherent cross-temporal distribution. To improve the estimation of\nthe forecast error covariance matrix, we propose using multi-step residuals,\nespecially in the time dimension where the usual one-step residuals fail. To\naddress high-dimensionality issues, we present four alternatives for the\ncovariance matrix, where we exploit the two-fold nature (cross-sectional and\ntemporal) of the cross-temporal structure, and introduce the idea of\noverlapping residuals. We assess the effectiveness of the proposed\ncross-temporal reconciliation approaches through a simulation study that\ninvestigates their theoretical and empirical properties and two forecasting\nexperiments, using the Australian GDP and the Australian Tourism Demand\ndatasets. For both applications, the optimal cross-temporal reconciliation\napproaches significantly outperform the incoherent base forecasts in terms of\nthe Continuous Ranked Probability Score and the Energy Score. Overall, the\nresults highlight the potential of the proposed methods to improve the accuracy\nof probabilistic forecasts and to address the challenge of integrating\ndisparate scenarios while coherently taking into account short-term\noperational, medium-term tactical, and long-term strategic planning."}, "http://arxiv.org/abs/2309.07867": {"title": "Beta Diffusion", "link": "http://arxiv.org/abs/2309.07867", "description": "We introduce beta diffusion, a novel generative modeling method that\nintegrates demasking and denoising to generate data within bounded ranges.\nUsing scaled and shifted beta distributions, beta diffusion utilizes\nmultiplicative transitions over time to create both forward and reverse\ndiffusion processes, maintaining beta distributions in both the forward\nmarginals and the reverse conditionals, given the data at any point in time.\nUnlike traditional diffusion-based generative models relying on additive\nGaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is\nmultiplicative and optimized with KL-divergence upper bounds (KLUBs) derived\nfrom the convexity of the KL divergence. We demonstrate that the proposed KLUBs\nare more effective for optimizing beta diffusion compared to negative ELBOs,\nwhich can also be derived as the KLUBs of the same KL divergence with its two\narguments swapped. The loss function of beta diffusion, expressed in terms of\nBregman divergence, further supports the efficacy of KLUBs for optimization.\nExperimental results on both synthetic data and natural images demonstrate the\nunique capabilities of beta diffusion in generative modeling of range-bounded\ndata and validate the effectiveness of KLUBs in optimizing diffusion models,\nthereby making them valuable additions to the family of diffusion-based\ngenerative models and the optimization techniques used to train them."}, "http://arxiv.org/abs/2310.18422": {"title": "Inference via Wild Bootstrap and Multiple Imputation under Fine-Gray Models with Incomplete Data", "link": "http://arxiv.org/abs/2310.18422", "description": "Fine-Gray models specify the subdistribution hazards for one out of multiple\ncompeting risks to be proportional. The estimators of parameters and cumulative\nincidence functions under Fine-Gray models have a simpler structure when data\nare censoring-complete than when they are more generally incomplete. This paper\nconsiders the case of incomplete data but it exploits the above-mentioned\nsimpler estimator structure for which there exists a wild bootstrap approach\nfor inferential purposes. The present idea is to link the methodology under\ncensoring-completeness with the more general right-censoring regime with the\nhelp of multiple imputation. In a simulation study, this approach is compared\nto the estimation procedure proposed in the original paper by Fine and Gray\nwhen it is combined with a bootstrap approach. An application to a data set\nabout hospital-acquired infections illustrates the method."}, "http://arxiv.org/abs/2310.18474": {"title": "Robust Bayesian Graphical Regression Models for Assessing Tumor Heterogeneity in Proteomic Networks", "link": "http://arxiv.org/abs/2310.18474", "description": "Graphical models are powerful tools to investigate complex dependency\nstructures in high-throughput datasets. However, most existing graphical models\nmake one of the two canonical assumptions: (i) a homogeneous graph with a\ncommon network for all subjects; or (ii) an assumption of normality especially\nin the context of Gaussian graphical models. Both assumptions are restrictive\nand can fail to hold in certain applications such as proteomic networks in\ncancer. To this end, we propose an approach termed robust Bayesian graphical\nregression (rBGR) to estimate heterogeneous graphs for non-normally distributed\ndata. rBGR is a flexible framework that accommodates non-normality through\nrandom marginal transformations and constructs covariate-dependent graphs to\naccommodate heterogeneity through graphical regression techniques. We formulate\na new characterization of edge dependencies in such models called conditional\nsign independence with covariates along with an efficient posterior sampling\nalgorithm. In simulation studies, we demonstrate that rBGR outperforms existing\ngraphical regression models for data generated under various levels of\nnon-normality in both edge and covariate selection. We use rBGR to assess\nproteomic networks across two cancers: lung and ovarian, to systematically\ninvestigate the effects of immunogenic heterogeneity within tumors. Our\nanalyses reveal several important protein-protein interactions that are\ndifferentially impacted by the immune cell abundance; some corroborate existing\nbiological knowledge whereas others are novel findings."}, "http://arxiv.org/abs/2310.18500": {"title": "Designing Randomized Experiments to Predict Unit-Specific Treatment Effects", "link": "http://arxiv.org/abs/2310.18500", "description": "Typically, a randomized experiment is designed to test a hypothesis about the\naverage treatment effect and sometimes hypotheses about treatment effect\nvariation. The results of such a study may then be used to inform policy and\npractice for units not in the study. In this paper, we argue that given this\nuse, randomized experiments should instead be designed to predict unit-specific\ntreatment effects in a well-defined population. We then consider how different\nsampling processes and models affect the bias, variance, and mean squared\nprediction error of these predictions. The results indicate, for example, that\nproblems of generalizability (differences between samples and populations) can\ngreatly affect bias both in predictive models and in measures of error in these\nmodels. We also examine when the average treatment effect estimate outperforms\nunit-specific treatment effect predictive models and implications of this for\nplanning studies."}, "http://arxiv.org/abs/2310.18527": {"title": "Multiple Imputation Method for High-Dimensional Neuroimaging Data", "link": "http://arxiv.org/abs/2310.18527", "description": "Missingness is a common issue for neuroimaging data, and neglecting it in\ndownstream statistical analysis can introduce bias and lead to misguided\ninferential conclusions. It is therefore crucial to conduct appropriate\nstatistical methods to address this issue. While multiple imputation is a\npopular technique for handling missing data, its application to neuroimaging\ndata is hindered by high dimensionality and complex dependence structures of\nmultivariate neuroimaging variables. To tackle this challenge, we propose a\nnovel approach, named High Dimensional Multiple Imputation (HIMA), based on\nBayesian models. HIMA develops a new computational strategy for sampling large\ncovariance matrices based on a robustly estimated posterior mode, which\ndrastically enhances computational efficiency and numerical stability. To\nassess the effectiveness of HIMA, we conducted extensive simulation studies and\nreal-data analysis using neuroimaging data from a Schizophrenia study. HIMA\nshowcases a computational efficiency improvement of over 2000 times when\ncompared to traditional approaches, while also producing imputed datasets with\nimproved precision and stability."}, "http://arxiv.org/abs/2310.18533": {"title": "Evaluating the effects of high-throughput structural neuroimaging predictors on whole-brain functional connectome outcomes via network-based vector-on-matrix regression", "link": "http://arxiv.org/abs/2310.18533", "description": "The joint analysis of multimodal neuroimaging data is critical in the field\nof brain research because it reveals complex interactive relationships between\nneurobiological structures and functions. In this study, we focus on\ninvestigating the effects of structural imaging (SI) features, including white\nmatter micro-structure integrity (WMMI) and cortical thickness, on the whole\nbrain functional connectome (FC) network. To achieve this goal, we propose a\nnetwork-based vector-on-matrix regression model to characterize the FC-SI\nassociation patterns. We have developed a novel multi-level dense bipartite and\nclique subgraph extraction method to identify which subsets of spatially\nspecific SI features intensively influence organized FC sub-networks. The\nproposed method can simultaneously identify highly correlated\nstructural-connectomic association patterns and suppress false positive\nfindings while handling millions of potential interactions. We apply our method\nto a multimodal neuroimaging dataset of 4,242 participants from the UK Biobank\nto evaluate the effects of whole-brain WMMI and cortical thickness on the\nresting-state FC. The results reveal that the WMMI on corticospinal tracts and\ninferior cerebellar peduncle significantly affect functional connections of\nsensorimotor, salience, and executive sub-networks with an average correlation\nof 0.81 (p<0.001)."}, "http://arxiv.org/abs/2310.18536": {"title": "Efficient Fully Bayesian Approach to Brain Activity Mapping with Complex-Valued fMRI Data", "link": "http://arxiv.org/abs/2310.18536", "description": "Functional magnetic resonance imaging (fMRI) enables indirect detection of\nbrain activity changes via the blood-oxygen-level-dependent (BOLD) signal.\nConventional analysis methods mainly rely on the real-valued magnitude of these\nsignals. In contrast, research suggests that analyzing both real and imaginary\ncomponents of the complex-valued fMRI (cv-fMRI) signal provides a more holistic\napproach that can increase power to detect neuronal activation. We propose a\nfully Bayesian model for brain activity mapping with cv-fMRI data. Our model\naccommodates temporal and spatial dynamics. Additionally, we propose a\ncomputationally efficient sampling algorithm, which enhances processing speed\nthrough image partitioning. Our approach is shown to be computationally\nefficient via image partitioning and parallel computation while being\ncompetitive with state-of-the-art methods. We support these claims with both\nsimulated numerical studies and an application to real cv-fMRI data obtained\nfrom a finger-tapping experiment."}, "http://arxiv.org/abs/2310.18556": {"title": "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment", "link": "http://arxiv.org/abs/2310.18556", "description": "Design-based causal inference is one of the most widely used frameworks for\ntesting causal null hypotheses or inferring about causal parameters from\nexperimental or observational data. The most significant merit of design-based\ncausal inference is that its statistical validity only comes from the study\ndesign (e.g., randomization design) and does not require assuming any\noutcome-generating distributions or models. Although immune to model\nmisspecification, design-based causal inference can still suffer from other\ndata challenges, among which missingness in outcomes is a significant one.\nHowever, compared with model-based causal inference, outcome missingness in\ndesign-based causal inference is much less studied, largely due to the\nchallenge that design-based causal inference does not assume any outcome\ndistributions/models and, therefore, cannot directly adopt any existing\nmodel-based approaches for missing data. To fill this gap, we systematically\nstudy the missing outcomes problem in design-based causal inference. First, we\nuse the potential outcomes framework to clarify the minimal assumption\n(concerning the outcome missingness mechanism) needed for conducting\nfinite-population-exact randomization tests for the null effect (i.e., Fisher's\nsharp null) and that needed for constructing finite-population-exact confidence\nsets with missing outcomes. Second, we propose a general framework called\n``imputation and re-imputation\" for conducting finite-population-exact\nrandomization tests in design-based causal studies with missing outcomes. Our\nframework can incorporate any existing outcome imputation algorithms and\nmeanwhile guarantee finite-population-exact type-I error rate control. Third,\nwe extend our framework to conduct covariate adjustment in an exact\nrandomization test with missing outcomes and to construct\nfinite-population-exact confidence sets with missing outcomes."}, "http://arxiv.org/abs/2310.18611": {"title": "Sequential Kalman filter for fast online changepoint detection in longitudinal health records", "link": "http://arxiv.org/abs/2310.18611", "description": "This article introduces the sequential Kalman filter, a computationally\nscalable approach for online changepoint detection with temporally correlated\ndata. The temporal correlation was not considered in the Bayesian online\nchangepoint detection approach due to the large computational cost. Motivated\nby detecting COVID-19 infections for dialysis patients from massive\nlongitudinal health records with a large number of covariates, we develop a\nscalable approach to detect multiple changepoints from correlated data by\nsequentially stitching Kalman filters of subsequences to compute the joint\ndistribution of the observations, which has linear computational complexity\nwith respect to the number of observations between the last detected\nchangepoint and the current observation at each time point, without\napproximating the likelihood function. Compared to other online changepoint\ndetection methods, simulated experiments show that our approach is more precise\nin detecting single or multiple changes in mean, variance, or correlation for\ntemporally correlated data. Furthermore, we propose a new way to integrate\nclassification and changepoint detection approaches that improve the detection\ndelay and accuracy for detecting COVID-19 infection compared to other\nalternatives."}, "http://arxiv.org/abs/2310.18733": {"title": "Threshold detection under a semiparametric regression model", "link": "http://arxiv.org/abs/2310.18733", "description": "Linear regression models have been extensively considered in the literature.\nHowever, in some practical applications they may not be appropriate all over\nthe range of the covariate. In this paper, a more flexible model is introduced\nby considering a regression model $Y=r(X)+\\varepsilon$ where the regression\nfunction $r(\\cdot)$ is assumed to be linear for large values in the domain of\nthe predictor variable $X$. More precisely, we assume that\n$r(x)=\\alpha_0+\\beta_0 x$ for $x> u_0$, where the value $u_0$ is identified as\nthe smallest value satisfying such a property. A penalized procedure is\nintroduced to estimate the threshold $u_0$. The considered proposal focusses on\na semiparametric approach since no parametric model is assumed for the\nregression function for values smaller than $u_0$. Consistency properties of\nboth the threshold estimator and the estimators of $(\\alpha_0,\\beta_0)$ are\nderived, under mild assumptions. Through a numerical study, the small sample\nproperties of the proposed procedure and the importance of introducing a\npenalization are investigated. The analysis of a real data set allows us to\ndemonstrate the usefulness of the penalized estimators."}, "http://arxiv.org/abs/2310.18766": {"title": "Discussion of ''A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks''", "link": "http://arxiv.org/abs/2310.18766", "description": "This review discusses the paper ''A Tale of Two Datasets: Representativeness\nand Generalisability of Inference for Samples of Networks'' by Krivitsky,\nColetti, and Hens, published in the Journal of the American Statistical\nAssociation in 2023."}, "http://arxiv.org/abs/2310.18858": {"title": "Estimating a function of the scale parameter in a gamma distribution with bounded variance", "link": "http://arxiv.org/abs/2310.18858", "description": "Given a gamma population with known shape parameter $\\alpha$, we develop a\ngeneral theory for estimating a function $g(\\cdot)$ of the scale parameter\n$\\beta$ with bounded variance. We begin by defining a sequential sampling\nprocedure with $g(\\cdot)$ satisfying some desired condition in proposing the\nstopping rule, and show the procedure enjoys appealing asymptotic properties.\nAfter these general conditions, we substitute $g(\\cdot)$ with specific\nfunctions including the gamma mean, the gamma variance, the gamma rate\nparameter, and a gamma survival probability as four possible illustrations. For\neach illustration, Monte Carlo simulations are carried out to justify the\nremarkable performance of our proposed sequential procedure. This is further\nsubstantiated with a real data study on weights of newly born babies."}, "http://arxiv.org/abs/2310.18875": {"title": "Feature calibration for computer models", "link": "http://arxiv.org/abs/2310.18875", "description": "Computer model calibration involves using partial and imperfect observations\nof the real world to learn which values of a model's input parameters lead to\noutputs that are consistent with real-world observations. When calibrating\nmodels with high-dimensional output (e.g. a spatial field), it is common to\nrepresent the output as a linear combination of a small set of basis vectors.\nOften, when trying to calibrate to such output, what is important to the\ncredibility of the model is that key emergent physical phenomena are\nrepresented, even if not faithfully or in the right place. In these cases,\ncomparison of model output and data in a linear subspace is inappropriate and\nwill usually lead to poor model calibration. To overcome this, we present\nkernel-based history matching (KHM), generalising the meaning of the technique\nsufficiently to be able to project model outputs and observations into a\nhigher-dimensional feature space, where patterns can be compared without their\nlocation necessarily being fixed. We develop the technical methodology, present\nan expert-driven kernel selection algorithm, and then apply the techniques to\nthe calibration of boundary layer clouds for the French climate model IPSL-CM."}, "http://arxiv.org/abs/2310.18905": {"title": "Incorporating nonparametric methods for estimating causal excursion effects in mobile health with zero-inflated count outcomes", "link": "http://arxiv.org/abs/2310.18905", "description": "In the domain of mobile health, tailoring interventions for real-time\ndelivery is of paramount importance. Micro-randomized trials have emerged as\nthe \"gold-standard\" methodology for developing such interventions. Analyzing\ndata from these trials provides insights into the efficacy of interventions and\nthe potential moderation by specific covariates. The \"causal excursion effect\",\na novel class of causal estimand, addresses these inquiries, backed by current\nsemiparametric inference techniques. Yet, existing methods mainly focus on\ncontinuous or binary data, leaving count data largely unexplored. The current\nwork is motivated by the Drink Less micro-randomized trial from the UK, which\nfocuses on a zero-inflated proximal outcome, the number of screen views in the\nsubsequent hour following the intervention decision point. In the current\npaper, we revisit the concept of causal excursion effects, specifically for\nzero-inflated count outcomes, and introduce novel estimation approaches that\nincorporate nonparametric techniques. Bidirectional asymptotics are derived for\nthe proposed estimators. Through extensive simulation studies, we evaluate the\nperformance of the proposed estimators. As an illustration, we also employ the\nproposed methods to the Drink Less trial data."}, "http://arxiv.org/abs/2310.18963": {"title": "Expectile-based conditional tail moments with covariates", "link": "http://arxiv.org/abs/2310.18963", "description": "Expectile, as the minimizer of an asymmetric quadratic loss function, is a\ncoherent risk measure and is helpful to use more information about the\ndistribution of the considered risk. In this paper, we propose a new risk\nmeasure by replacing quantiles by expectiles, called expectile-based\nconditional tail moment, and focus on the estimation of this new risk measure\nas the conditional survival function of the risk, given the risk exceeding the\nexpectile and given a value of the covariates, is heavy tail. Under some\nregular conditions, asymptotic properties of this new estimator are considered.\nThe extrapolated estimation of the conditional tail moments is also\ninvestigated. These results are illustrated both on simulated data and on a\nreal insurance data."}, "http://arxiv.org/abs/2310.19043": {"title": "Differentially Private Permutation Tests: Applications to Kernel Methods", "link": "http://arxiv.org/abs/2310.19043", "description": "Recent years have witnessed growing concerns about the privacy of sensitive\ndata. In response to these concerns, differential privacy has emerged as a\nrigorous framework for privacy protection, gaining widespread recognition in\nboth academic and industrial circles. While substantial progress has been made\nin private data analysis, existing methods often suffer from impracticality or\na significant loss of statistical efficiency. This paper aims to alleviate\nthese concerns in the context of hypothesis testing by introducing\ndifferentially private permutation tests. The proposed framework extends\nclassical non-private permutation tests to private settings, maintaining both\nfinite-sample validity and differential privacy in a rigorous manner. The power\nof the proposed test depends on the choice of a test statistic, and we\nestablish general conditions for consistency and non-asymptotic uniform power.\nTo demonstrate the utility and practicality of our framework, we focus on\nreproducing kernel-based test statistics and introduce differentially private\nkernel tests for two-sample and independence testing: dpMMD and dpHSIC. The\nproposed kernel tests are straightforward to implement, applicable to various\ntypes of data, and attain minimax optimal power across different privacy\nregimes. Our empirical evaluations further highlight their competitive power\nunder various synthetic and real-world scenarios, emphasizing their practical\nvalue. The code is publicly available to facilitate the implementation of our\nframework."}, "http://arxiv.org/abs/2310.19051": {"title": "A Survey of Methods for Estimating Hurst Exponent of Time Sequence", "link": "http://arxiv.org/abs/2310.19051", "description": "The Hurst exponent is a significant indicator for characterizing the\nself-similarity and long-term memory properties of time sequences. It has wide\napplications in physics, technologies, engineering, mathematics, statistics,\neconomics, psychology and so on. Currently, available methods for estimating\nthe Hurst exponent of time sequences can be divided into different categories:\ntime-domain methods and spectrum-domain methods based on the representation of\ntime sequence, linear regression methods and Bayesian methods based on\nparameter estimation methods. Although various methods are discussed in\nliterature, there are still some deficiencies: the descriptions of the\nestimation algorithms are just mathematics-oriented and the pseudo-codes are\nmissing; the effectiveness and accuracy of the estimation algorithms are not\nclear; the classification of estimation methods is not considered and there is\na lack of guidance for selecting the estimation methods. In this work, the\nemphasis is put on thirteen dominant methods for estimating the Hurst exponent.\nFor the purpose of decreasing the difficulty of implementing the estimation\nmethods with computer programs, the mathematical principles are discussed\nbriefly and the pseudo-codes of algorithms are presented with necessary\ndetails. It is expected that the survey could help the researchers to select,\nimplement and apply the estimation algorithms of interest in practical\nsituations in an easy way."}, "http://arxiv.org/abs/2310.19091": {"title": "Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector", "link": "http://arxiv.org/abs/2310.19091", "description": "Machine Learning (ML) systems are becoming instrumental in the public sector,\nwith applications spanning areas like criminal justice, social welfare,\nfinancial fraud detection, and public health. While these systems offer great\npotential benefits to institutional decision-making processes, such as improved\nefficiency and reliability, they still face the challenge of aligning intricate\nand nuanced policy objectives with the precise formalization requirements\nnecessitated by ML models. In this paper, we aim to bridge the gap between ML\nand public sector decision-making by presenting a comprehensive overview of key\ntechnical challenges where disjunctions between policy goals and ML models\ncommonly arise. We concentrate on pivotal points of the ML pipeline that\nconnect the model to its operational environment, delving into the significance\nof representative training data and highlighting the importance of a model\nsetup that facilitates effective decision-making. Additionally, we link these\nchallenges with emerging methodological advancements, encompassing causal ML,\ndomain adaptation, uncertainty quantification, and multi-objective\noptimization, illustrating the path forward for harmonizing ML and public\nsector objectives."}, "http://arxiv.org/abs/2310.19114": {"title": "Sparse Fr\\'echet Sufficient Dimension Reduction with Graphical Structure Among Predictors", "link": "http://arxiv.org/abs/2310.19114", "description": "Fr\\'echet regression has received considerable attention to model\nmetric-space valued responses that are complex and non-Euclidean data, such as\nprobability distributions and vectors on the unit sphere. However, existing\nFr\\'echet regression literature focuses on the classical setting where the\npredictor dimension is fixed, and the sample size goes to infinity. This paper\nproposes sparse Fr\\'echet sufficient dimension reduction with graphical\nstructure among high-dimensional Euclidean predictors. In particular, we\npropose a convex optimization problem that leverages the graphical information\namong predictors and avoids inverting the high-dimensional covariance matrix.\nWe also provide the Alternating Direction Method of Multipliers (ADMM)\nalgorithm to solve the optimization problem. Theoretically, the proposed method\nachieves subspace estimation and variable selection consistency under suitable\nconditions. Extensive simulations and a real data analysis are carried out to\nillustrate the finite-sample performance of the proposed method."}, "http://arxiv.org/abs/2310.19246": {"title": "A spectral regularisation framework for latent variable models designed for single channel applications", "link": "http://arxiv.org/abs/2310.19246", "description": "Latent variable models (LVMs) are commonly used to capture the underlying\ndependencies, patterns, and hidden structure in observed data. Source\nduplication is a by-product of the data hankelisation pre-processing step\ncommon to single channel LVM applications, which hinders practical LVM\nutilisation. In this article, a Python package titled\nspectrally-regularised-LVMs is presented. The proposed package addresses the\nsource duplication issue via the addition of a novel spectral regularisation\nterm. This package provides a framework for spectral regularisation in single\nchannel LVM applications, thereby making it easier to investigate and utilise\nLVMs with spectral regularisation. This is achieved via the use of symbolic or\nexplicit representations of potential LVM objective functions which are\nincorporated into a framework that uses spectral regularisation during the LVM\nparameter estimation process. The objective of this package is to provide a\nconsistent linear LVM optimisation framework which incorporates spectral\nregularisation and caters to single channel time-series applications."}, "http://arxiv.org/abs/2310.19253": {"title": "Flow-based Distributionally Robust Optimization", "link": "http://arxiv.org/abs/2310.19253", "description": "We present a computationally efficient framework, called \\texttt{FlowDRO},\nfor solving flow-based distributionally robust optimization (DRO) problems with\nWasserstein uncertainty sets, when requiring the worst-case distribution (also\ncalled the Least Favorable Distribution, LFD) to be continuous so that the\nalgorithm can be scalable to problems with larger sample sizes and achieve\nbetter generalization capability for the induced robust algorithms. To tackle\nthe computationally challenging infinitely dimensional optimization problem, we\nleverage flow-based models, continuous-time invertible transport maps between\nthe data distribution and the target distribution, and develop a Wasserstein\nproximal gradient flow type of algorithm. In practice, we parameterize the\ntransport maps by a sequence of neural networks progressively trained in blocks\nby gradient descent. Our computational framework is general, can handle\nhigh-dimensional data with large sample sizes, and can be useful for various\napplications. We demonstrate its usage in adversarial learning,\ndistributionally robust hypothesis testing, and a new mechanism for data-driven\ndistribution perturbation differential privacy, where the proposed method gives\nstrong empirical performance on real high-dimensional data."}, "http://arxiv.org/abs/2310.19343": {"title": "Quantile Super Learning for independent and online settings with application to solar power forecasting", "link": "http://arxiv.org/abs/2310.19343", "description": "Estimating quantiles of an outcome conditional on covariates is of\nfundamental interest in statistics with broad application in probabilistic\nprediction and forecasting. We propose an ensemble method for conditional\nquantile estimation, Quantile Super Learning, that combines predictions from\nmultiple candidate algorithms based on their empirical performance measured\nwith respect to a cross-validated empirical risk of the quantile loss function.\nWe present theoretical guarantees for both iid and online data scenarios. The\nperformance of our approach for quantile estimation and in forming prediction\nintervals is tested in simulation studies. Two case studies related to solar\nenergy are used to illustrate Quantile Super Learning: in an iid setting, we\npredict the physical properties of perovskite materials for photovoltaic cells,\nand in an online setting we forecast ground solar irradiance based on output\nfrom dynamic weather ensemble models."}, "http://arxiv.org/abs/2310.19433": {"title": "Ordinal classification for interval-valued data and interval-valued functional data", "link": "http://arxiv.org/abs/2310.19433", "description": "The aim of ordinal classification is to predict the ordered labels of the\noutput from a set of observed inputs. Interval-valued data refers to data in\nthe form of intervals. For the first time, interval-valued data and\ninterval-valued functional data are considered as inputs in an ordinal\nclassification problem. Six ordinal classifiers for interval data and\ninterval-valued functional data are proposed. Three of them are parametric, one\nof them is based on ordinal binary decompositions and the other two are based\non ordered logistic regression. The other three methods are based on the use of\ndistances between interval data and kernels on interval data. One of the\nmethods uses the weighted $k$-nearest-neighbor technique for ordinal\nclassification. Another method considers kernel principal component analysis\nplus an ordinal classifier. And the sixth method, which is the method that\nperforms best, uses a kernel-induced ordinal random forest. They are compared\nwith na\\\"ive approaches in an extensive experimental study with synthetic and\noriginal real data sets, about human global development, and weather data. The\nresults show that considering ordering and interval-valued information improves\nthe accuracy. The source code and data sets are available at\nhttps://github.com/aleixalcacer/OCFIVD."}, "http://arxiv.org/abs/2310.19435": {"title": "A novel characterization of structures in smooth regression curves: from a viewpoint of persistent homology", "link": "http://arxiv.org/abs/2310.19435", "description": "We characterize structures such as monotonicity, convexity, and modality in\nsmooth regression curves using persistent homology. Persistent homology is a\nkey tool in topological data analysis that detects higher dimensional\ntopological features such as connected components and holes (cycles or loops)\nin the data. In other words, persistent homology is a multiscale version of\nhomology that characterizes sets based on the connected components and holes.\nWe use super-level sets of functions to extract geometric features via\npersistent homology. In particular, we explore structures in regression curves\nvia the persistent homology of super-level sets of a function, where the\nfunction of interest is - the first derivative of the regression function.\n\nIn the course of this study, we extend an existing procedure of estimating\nthe persistent homology for the first derivative of a regression function and\nestablish its consistency. Moreover, as an application of the proposed\nmethodology, we demonstrate that the persistent homology of the derivative of a\nfunction can reveal hidden structures in the function that are not visible from\nthe persistent homology of the function itself. In addition, we also illustrate\nthat the proposed procedure can be used to compare the shapes of two or more\nregression curves which is not possible merely from the persistent homology of\nthe function itself."}, "http://arxiv.org/abs/2310.19519": {"title": "A General Neural Causal Model for Interactive Recommendation", "link": "http://arxiv.org/abs/2310.19519", "description": "Survivor bias in observational data leads the optimization of recommender\nsystems towards local optima. Currently most solutions re-mines existing\nhuman-system collaboration patterns to maximize longer-term satisfaction by\nreinforcement learning. However, from the causal perspective, mitigating\nsurvivor effects requires answering a counterfactual problem, which is\ngenerally unidentifiable and inestimable. In this work, we propose a neural\ncausal model to achieve counterfactual inference. Specifically, we first build\na learnable structural causal model based on its available graphical\nrepresentations which qualitatively characterizes the preference transitions.\nMitigation of the survivor bias is achieved though counterfactual consistency.\nTo identify the consistency, we use the Gumbel-max function as structural\nconstrains. To estimate the consistency, we apply reinforcement optimizations,\nand use Gumbel-Softmax as a trade-off to get a differentiable function. Both\ntheoretical and empirical studies demonstrate the effectiveness of our\nsolution."}, "http://arxiv.org/abs/2310.19621": {"title": "A Bayesian Methodology for Estimation for Sparse Canonical Correlation", "link": "http://arxiv.org/abs/2310.19621", "description": "It can be challenging to perform an integrative statistical analysis of\nmulti-view high-dimensional data acquired from different experiments on each\nsubject who participated in a joint study. Canonical Correlation Analysis (CCA)\nis a statistical procedure for identifying relationships between such data\nsets. In that context, Structured Sparse CCA (ScSCCA) is a rapidly emerging\nmethodological area that aims for robust modeling of the interrelations between\nthe different data modalities by assuming the corresponding CCA directional\nvectors to be sparse. Although it is a rapidly growing area of statistical\nmethodology development, there is a need for developing related methodologies\nin the Bayesian paradigm. In this manuscript, we propose a novel ScSCCA\napproach where we employ a Bayesian infinite factor model and aim to achieve\nrobust estimation by encouraging sparsity in two different levels of the\nmodeling framework. Firstly, we utilize a multiplicative Half-Cauchy process\nprior to encourage sparsity at the level of the latent variable loading\nmatrices. Additionally, we promote further sparsity in the covariance matrix by\nusing graphical horseshoe prior or diagonal structure. We conduct multiple\nsimulations to compare the performance of the proposed method with that of\nother frequently used CCA procedures, and we apply the developed procedures to\nanalyze multi-omics data arising from a breast cancer study."}, "http://arxiv.org/abs/2310.19683": {"title": "An Online Bootstrap for Time Series", "link": "http://arxiv.org/abs/2310.19683", "description": "Resampling methods such as the bootstrap have proven invaluable in the field\nof machine learning. However, the applicability of traditional bootstrap\nmethods is limited when dealing with large streams of dependent data, such as\ntime series or spatially correlated observations. In this paper, we propose a\nnovel bootstrap method that is designed to account for data dependencies and\ncan be executed online, making it particularly suitable for real-time\napplications. This method is based on an autoregressive sequence of\nincreasingly dependent resampling weights. We prove the theoretical validity of\nthe proposed bootstrap scheme under general conditions. We demonstrate the\neffectiveness of our approach through extensive simulations and show that it\nprovides reliable uncertainty quantification even in the presence of complex\ndata dependencies. Our work bridges the gap between classical resampling\ntechniques and the demands of modern data analysis, providing a valuable tool\nfor researchers and practitioners in dynamic, data-rich environments."}, "http://arxiv.org/abs/2310.19787": {"title": "$e^{\\text{RPCA}}$: Robust Principal Component Analysis for Exponential Family Distributions", "link": "http://arxiv.org/abs/2310.19787", "description": "Robust Principal Component Analysis (RPCA) is a widely used method for\nrecovering low-rank structure from data matrices corrupted by significant and\nsparse outliers. These corruptions may arise from occlusions, malicious\ntampering, or other causes for anomalies, and the joint identification of such\ncorruptions with low-rank background is critical for process monitoring and\ndiagnosis. However, existing RPCA methods and their extensions largely do not\naccount for the underlying probabilistic distribution for the data matrices,\nwhich in many applications are known and can be highly non-Gaussian. We thus\npropose a new method called Robust Principal Component Analysis for Exponential\nFamily distributions ($e^{\\text{RPCA}}$), which can perform the desired\ndecomposition into low-rank and sparse matrices when such a distribution falls\nwithin the exponential family. We present a novel alternating direction method\nof multiplier optimization algorithm for efficient $e^{\\text{RPCA}}$\ndecomposition. The effectiveness of $e^{\\text{RPCA}}$ is then demonstrated in\ntwo applications: the first for steel sheet defect detection, and the second\nfor crime activity monitoring in the Atlanta metropolitan area."}, "http://arxiv.org/abs/2110.00152": {"title": "ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families", "link": "http://arxiv.org/abs/2110.00152", "description": "The empirical Bayes normal means (EBNM) model is important to many areas of\nstatistics, including (but not limited to) multiple testing, wavelet denoising,\nmultiple linear regression, and matrix factorization. There are several\nexisting software packages that can fit EBNM models under different prior\nassumptions and using different algorithms; however, the differences across\ninterfaces complicate direct comparisons. Further, a number of important prior\nassumptions do not yet have implementations. Motivated by these issues, we\ndeveloped the R package ebnm, which provides a unified interface for\nefficiently fitting EBNM models using a variety of prior assumptions, including\nnonparametric approaches. In some cases, we incorporated existing\nimplementations into ebnm; in others, we implemented new fitting procedures\nwith a focus on speed and numerical stability. To demonstrate the capabilities\nof the unified interface, we compare results using different prior assumptions\nin two extended examples: the shrinkage estimation of baseball statistics; and\nthe matrix factorization of genetics data (via the new R package flashier). In\nsummary, ebnm is a convenient and comprehensive package for performing EBNM\nanalyses under a wide range of prior assumptions."}, "http://arxiv.org/abs/2110.02440": {"title": "Inverse Probability Weighting-based Mediation Analysis for Microbiome Data", "link": "http://arxiv.org/abs/2110.02440", "description": "Mediation analysis is an important tool to study causal associations in\nbiomedical and other scientific areas and has recently gained attention in\nmicrobiome studies. Using a microbiome study of acute myeloid leukemia (AML)\npatients, we investigate whether the effect of induction chemotherapy intensity\nlevels on the infection status is mediated by the microbial taxa abundance. The\nunique characteristics of the microbial mediators -- high-dimensionality,\nzero-inflation, and dependence -- call for new methodological developments in\nmediation analysis. The presence of an exposure-induced mediator-outcome\nconfounder, antibiotic use, further requires a delicate treatment in the\nanalysis. To address these unique challenges in our motivating AML microbiome\nstudy, we propose a novel nonparametric identification formula for the\ninterventional indirect effect (IIE), a measure recently developed for studying\nmediation effects. We develop the corresponding estimation algorithm using the\ninverse probability weighting method. We also test the presence of mediation\neffects via constructing the standard normal bootstrap confidence intervals.\nSimulation studies show that the proposed method has good finite-sample\nperformance in terms of the IIE estimation, and type-I error rate and power of\nthe corresponding test. In the AML microbiome study, our findings suggest that\nthe effect of induction chemotherapy intensity levels on infection is mainly\nmediated by patients' gut microbiome."}, "http://arxiv.org/abs/2203.03532": {"title": "E-detectors: a nonparametric framework for sequential change detection", "link": "http://arxiv.org/abs/2203.03532", "description": "Sequential change detection is a classical problem with a variety of\napplications. However, the majority of prior work has been parametric, for\nexample, focusing on exponential families. We develop a fundamentally new and\ngeneral framework for sequential change detection when the pre- and post-change\ndistributions are nonparametrically specified (and thus composite). Our\nprocedures come with clean, nonasymptotic bounds on the average run length\n(frequency of false alarms). In certain nonparametric cases (like sub-Gaussian\nor sub-exponential), we also provide near-optimal bounds on the detection delay\nfollowing a changepoint. The primary technical tool that we introduce is called\nan \\emph{e-detector}, which is composed of sums of e-processes -- a fundamental\ngeneralization of nonnegative supermartingales -- that are started at\nconsecutive times. We first introduce simple Shiryaev-Roberts and CUSUM-style\ne-detectors, and then show how to design their mixtures in order to achieve\nboth statistical and computational efficiency. Our e-detector framework can be\ninstantiated to recover classical likelihood-based procedures for parametric\nproblems, as well as yielding the first change detection method for many\nnonparametric problems. As a running example, we tackle the problem of\ndetecting changes in the mean of a bounded random variable without i.i.d.\nassumptions, with an application to tracking the performance of a basketball\nteam over multiple seasons."}, "http://arxiv.org/abs/2208.02942": {"title": "sparsegl: An R Package for Estimating Sparse Group Lasso", "link": "http://arxiv.org/abs/2208.02942", "description": "The sparse group lasso is a high-dimensional regression technique that is\nuseful for problems whose predictors have a naturally grouped structure and\nwhere sparsity is encouraged at both the group and individual predictor level.\nIn this paper we discuss a new R package for computing such regularized models.\nThe intention is to provide highly optimized solution routines enabling\nanalysis of very large datasets, especially in the context of sparse design\nmatrices."}, "http://arxiv.org/abs/2208.06236": {"title": "Differentially Private Kolmogorov-Smirnov-Type Tests", "link": "http://arxiv.org/abs/2208.06236", "description": "Hypothesis testing is a central problem in statistical analysis, and there is\ncurrently a lack of differentially private tests which are both statistically\nvalid and powerful. In this paper, we develop several new differentially\nprivate (DP) nonparametric hypothesis tests. Our tests are based on\nKolmogorov-Smirnov, Kuiper, Cram\\'er-von Mises, and Wasserstein test\nstatistics, which can all be expressed as a pseudo-metric on empirical\ncumulative distribution functions (ecdfs), and can be used to test hypotheses\non goodness-of-fit, two samples, and paired data. We show that these test\nstatistics have low sensitivity, requiring minimal noise to satisfy DP. In\nparticular, we show that the sensitivity of these test statistics can be\nexpressed in terms of the base sensitivity, which is the pseudo-metric distance\nbetween the ecdfs of adjacent databases and is easily calculated. The sampling\ndistribution of our test statistics are distribution-free under the null\nhypothesis, enabling easy computation of $p$-values by Monte Carlo methods. We\nshow that in several settings, especially with small privacy budgets or\nheavy-tailed data, our new DP tests outperform alternative nonparametric DP\ntests."}, "http://arxiv.org/abs/2208.10027": {"title": "Learning Invariant Representations under General Interventions on the Response", "link": "http://arxiv.org/abs/2208.10027", "description": "It has become increasingly common nowadays to collect observations of feature\nand response pairs from different environments. As a consequence, one has to\napply learned predictors to data with a different distribution due to\ndistribution shifts. One principled approach is to adopt the structural causal\nmodels to describe training and test models, following the invariance principle\nwhich says that the conditional distribution of the response given its\npredictors remains the same across environments. However, this principle might\nbe violated in practical settings when the response is intervened. A natural\nquestion is whether it is still possible to identify other forms of invariance\nto facilitate prediction in unseen environments. To shed light on this\nchallenging scenario, we focus on linear structural causal models (SCMs) and\nintroduce invariant matching property (IMP), an explicit relation to capture\ninterventions through an additional feature, leading to an alternative form of\ninvariance that enables a unified treatment of general interventions on the\nresponse as well as the predictors. We analyze the asymptotic generalization\nerrors of our method under both the discrete and continuous environment\nsettings, where the continuous case is handled by relating it to the\nsemiparametric varying coefficient models. We present algorithms that show\ncompetitive performance compared to existing methods over various experimental\nsettings including a COVID dataset."}, "http://arxiv.org/abs/2208.11756": {"title": "Testing Many Constraints in Possibly Irregular Models Using Incomplete U-Statistics", "link": "http://arxiv.org/abs/2208.11756", "description": "We consider the problem of testing a null hypothesis defined by equality and\ninequality constraints on a statistical parameter. Testing such hypotheses can\nbe challenging because the number of relevant constraints may be on the same\norder or even larger than the number of observed samples. Moreover, standard\ndistributional approximations may be invalid due to irregularities in the null\nhypothesis. We propose a general testing methodology that aims to circumvent\nthese difficulties. The constraints are estimated by incomplete U-statistics,\nand we derive critical values by Gaussian multiplier bootstrap. We show that\nthe bootstrap approximation of incomplete U-statistics is valid for kernels\nthat we call mixed degenerate when the number of combinations used to compute\nthe incomplete U-statistic is of the same order as the sample size. It follows\nthat our test controls type I error even in irregular settings. Furthermore,\nthe bootstrap approximation covers high-dimensional settings making our testing\nstrategy applicable for problems with many constraints. The methodology is\napplicable, in particular, when the constraints to be tested are polynomials in\nU-estimable parameters. As an application, we consider goodness-of-fit tests of\nlatent tree models for multivariate data."}, "http://arxiv.org/abs/2210.05538": {"title": "Estimating optimal treatment regimes in survival contexts using an instrumental variable", "link": "http://arxiv.org/abs/2210.05538", "description": "In survival contexts, substantial literature exists on estimating optimal\ntreatment regimes, where treatments are assigned based on personal\ncharacteristics for the purpose of maximizing the survival probability. These\nmethods assume that a set of covariates is sufficient to deconfound the\ntreatment-outcome relationship. Nevertheless, the assumption can be limited in\nobservational studies or randomized trials in which non-adherence occurs. Thus,\nwe propose a novel approach for estimating the optimal treatment regime when\ncertain confounders are not observable and a binary instrumental variable is\navailable. Specifically, via a binary instrumental variable, we propose two\nsemiparametric estimators for the optimal treatment regime by maximizing\nKaplan-Meier-like estimators within a pre-defined class of regimes, one of\nwhich possesses the desirable property of double robustness. Because the\nKaplan-Meier-like estimators are jagged, we incorporate kernel smoothing\nmethods to enhance their performance. Under appropriate regularity conditions,\nthe asymptotic properties are rigorously established. Furthermore, the finite\nsample performance is assessed through simulation studies. Finally, we\nexemplify our method using data from the National Cancer Institute's (NCI)\nprostate, lung, colorectal, and ovarian cancer screening trial."}, "http://arxiv.org/abs/2302.00878": {"title": "The Contextual Lasso: Sparse Linear Models via Deep Neural Networks", "link": "http://arxiv.org/abs/2302.00878", "description": "Sparse linear models are one of several core tools for interpretable machine\nlearning, a field of emerging importance as predictive models permeate\ndecision-making in many domains. Unfortunately, sparse linear models are far\nless flexible as functions of their input features than black-box models like\ndeep neural networks. With this capability gap in mind, we study a not-uncommon\nsituation where the input features dichotomize into two groups: explanatory\nfeatures, which are candidates for inclusion as variables in an interpretable\nmodel, and contextual features, which select from the candidate variables and\ndetermine their effects. This dichotomy leads us to the contextual lasso, a new\nstatistical estimator that fits a sparse linear model to the explanatory\nfeatures such that the sparsity pattern and coefficients vary as a function of\nthe contextual features. The fitting process learns this function\nnonparametrically via a deep neural network. To attain sparse coefficients, we\ntrain the network with a novel lasso regularizer in the form of a projection\nlayer that maps the network's output onto the space of $\\ell_1$-constrained\nlinear models. An extensive suite of experiments on real and synthetic data\nsuggests that the learned models, which remain highly transparent, can be\nsparser than the regular lasso without sacrificing the predictive power of a\nstandard deep neural network."}, "http://arxiv.org/abs/2302.02560": {"title": "Causal Estimation of Exposure Shifts with Neural Networks: Evaluating the Health Benefits of Stricter Air Quality Standards in the US", "link": "http://arxiv.org/abs/2302.02560", "description": "In policy research, one of the most critical analytic tasks is to estimate\nthe causal effect of a policy-relevant shift to the distribution of a\ncontinuous exposure/treatment on an outcome of interest. We call this problem\nshift-response function (SRF) estimation. Existing neural network methods\ninvolving robust causal-effect estimators lack theoretical guarantees and\npractical implementations for SRF estimation. Motivated by a key\npolicy-relevant question in public health, we develop a neural network method\nand its theoretical underpinnings to estimate SRFs with robustness and\nefficiency guarantees. We then apply our method to data consisting of 68\nmillion individuals and 27 million deaths across the U.S. to estimate the\ncausal effect from revising the US National Ambient Air Quality Standards\n(NAAQS) for PM 2.5 from 12 $\\mu g/m^3$ to 9 $\\mu g/m^3$. This change has been\nrecently proposed by the US Environmental Protection Agency (EPA). Our goal is\nto estimate, for the first time, the reduction in deaths that would result from\nthis anticipated revision using causal methods for SRFs. Our proposed method,\ncalled {T}argeted {R}egularization for {E}xposure {S}hifts with Neural\n{Net}works (TRESNET), contributes to the neural network literature for causal\ninference in two ways: first, it proposes a targeted regularization loss with\ntheoretical properties that ensure double robustness and achieves asymptotic\nefficiency specific for SRF estimation; second, it enables loss functions from\nthe exponential family of distributions to accommodate non-continuous outcome\ndistributions (such as hospitalization or mortality counts). We complement our\napplication with benchmark experiments that demonstrate TRESNET's broad\napplicability and competitiveness."}, "http://arxiv.org/abs/2303.03502": {"title": "A Semi-Parametric Model Simultaneously Handling Unmeasured Confounding, Informative Cluster Size, and Truncation by Death with a Data Application in Medicare Claims", "link": "http://arxiv.org/abs/2303.03502", "description": "Nearly 300,000 older adults experience a hip fracture every year, the\nmajority of which occur following a fall. Unfortunately, recovery after\nfall-related trauma such as hip fracture is poor, where older adults diagnosed\nwith Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long\ntime in hospitals or rehabilitation facilities during the post-operative\nrecuperation period. Because older adults value functional recovery and\nspending time at home versus facilities as key outcomes after hospitalization,\nidentifying factors that influence days spent at home after hospitalization is\nimperative. While several individual-level factors have been identified, the\ncharacteristics of the treating hospital have recently been identified as\ncontributors. However, few methodological rigorous approaches are available to\nhelp overcome potential sources of bias such as hospital-level unmeasured\nconfounders, informative hospital size, and loss to follow-up due to death.\nThis article develops a useful tool equipped with unsupervised learning to\nsimultaneously handle statistical complexities that are often encountered in\nhealth services research, especially when using large administrative claims\ndatabases. The proposed estimator has a closed form, thus only requiring light\ncomputation load in a large-scale study. We further develop its asymptotic\nproperties that can be used to make statistical inference in practice.\nExtensive simulation studies demonstrate superiority of the proposed estimator\ncompared to existing estimators."}, "http://arxiv.org/abs/2305.04113": {"title": "Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis", "link": "http://arxiv.org/abs/2305.04113", "description": "Factor analysis provides a canonical framework for imposing lower-dimensional\nstructure such as sparse covariance in high-dimensional data. High-dimensional\ndata on the same set of variables are often collected under different\nconditions, for instance in reproducing studies across research groups. In such\ncases, it is natural to seek to learn the shared versus condition-specific\nstructure. Existing hierarchical extensions of factor analysis have been\nproposed, but face practical issues including identifiability problems. To\naddress these shortcomings, we propose a class of SUbspace Factor Analysis\n(SUFA) models, which characterize variation across groups at the level of a\nlower-dimensional subspace. We prove that the proposed class of SUFA models\nlead to identifiability of the shared versus group-specific components of the\ncovariance, and study their posterior contraction properties. Taking a Bayesian\napproach, these contributions are developed alongside efficient posterior\ncomputation algorithms. Our sampler fully integrates out latent variables, is\neasily parallelizable and has complexity that does not depend on sample size.\nWe illustrate the methods through application to integration of multiple gene\nexpression datasets relevant to immunology."}, "http://arxiv.org/abs/2305.17570": {"title": "Auditing Fairness by Betting", "link": "http://arxiv.org/abs/2305.17570", "description": "We provide practical, efficient, and nonparametric methods for auditing the\nfairness of deployed classification and regression models. Whereas previous\nwork relies on a fixed-sample size, our methods are sequential and allow for\nthe continuous monitoring of incoming data, making them highly amenable to\ntracking the fairness of real-world systems. We also allow the data to be\ncollected by a probabilistic policy as opposed to sampled uniformly from the\npopulation. This enables auditing to be conducted on data gathered for another\npurpose. Moreover, this policy may change over time and different policies may\nbe used on different subpopulations. Finally, our methods can handle\ndistribution shift resulting from either changes to the model or changes in the\nunderlying population. Our approach is based on recent progress in\nanytime-valid inference and game-theoretic statistics-the \"testing by betting\"\nframework in particular. These connections ensure that our methods are\ninterpretable, fast, and easy to implement. We demonstrate the efficacy of our\napproach on three benchmark fairness datasets."}, "http://arxiv.org/abs/2306.02948": {"title": "Learning under random distributional shifts", "link": "http://arxiv.org/abs/2306.02948", "description": "Many existing approaches for generating predictions in settings with\ndistribution shift model distribution shifts as adversarial or low-rank in\nsuitable representations. In various real-world settings, however, we might\nexpect shifts to arise through the superposition of many small and random\nchanges in the population and environment. Thus, we consider a class of random\ndistribution shift models that capture arbitrary changes in the underlying\ncovariate space, and dense, random shocks to the relationship between the\ncovariates and the outcomes. In this setting, we characterize the benefits and\ndrawbacks of several alternative prediction strategies: the standard approach\nthat directly predicts the long-term outcome of interest, the proxy approach\nthat directly predicts a shorter-term proxy outcome, and a hybrid approach that\nutilizes both the long-term policy outcome and (shorter-term) proxy outcome(s).\nWe show that the hybrid approach is robust to the strength of the distribution\nshift and the proxy relationship. We apply this method to datasets in two\nhigh-impact domains: asylum-seeker assignment and early childhood education. In\nboth settings, we find that the proposed approach results in substantially\nlower mean-squared error than current approaches."}, "http://arxiv.org/abs/2306.05751": {"title": "Advancing Counterfactual Inference through Quantile Regression", "link": "http://arxiv.org/abs/2306.05751", "description": "The capacity to address counterfactual \"what if\" inquiries is crucial for\nunderstanding and making use of causal influences. Traditional counterfactual\ninference usually assumes the availability of a structural causal model. Yet,\nin practice, such a causal model is often unknown and may not be identifiable.\nThis paper aims to perform reliable counterfactual inference based on the\n(learned) qualitative causal structure and observational data, without\nnecessitating a given causal model or even the direct estimation of conditional\ndistributions. We re-cast counterfactual reasoning as an extended quantile\nregression problem, implemented with deep neural networks to capture general\ncausal relationships and data distributions. The proposed approach offers\nsuperior statistical efficiency compared to existing ones, and further, it\nenhances the potential for generalizing the estimated counterfactual outcomes\nto previously unseen data, providing an upper bound on the generalization\nerror. Empirical results conducted on multiple datasets offer compelling\nsupport for our theoretical assertions."}, "http://arxiv.org/abs/2306.06155": {"title": "Intensity Profile Projection: A Framework for Continuous-Time Representation Learning for Dynamic Networks", "link": "http://arxiv.org/abs/2306.06155", "description": "We present a new representation learning framework, Intensity Profile\nProjection, for continuous-time dynamic network data. Given triples $(i,j,t)$,\neach representing a time-stamped ($t$) interaction between two entities\n($i,j$), our procedure returns a continuous-time trajectory for each node,\nrepresenting its behaviour over time. The framework consists of three stages:\nestimating pairwise intensity functions, e.g. via kernel smoothing; learning a\nprojection which minimises a notion of intensity reconstruction error; and\nconstructing evolving node representations via the learned projection. The\ntrajectories satisfy two properties, known as structural and temporal\ncoherence, which we see as fundamental for reliable inference. Moreoever, we\ndevelop estimation theory providing tight control on the error of any estimated\ntrajectory, indicating that the representations could even be used in quite\nnoise-sensitive follow-on analyses. The theory also elucidates the role of\nsmoothing as a bias-variance trade-off, and shows how we can reduce the level\nof smoothing as the signal-to-noise ratio increases on account of the algorithm\n`borrowing strength' across the network."}, "http://arxiv.org/abs/2306.08777": {"title": "MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting", "link": "http://arxiv.org/abs/2306.08777", "description": "We propose novel statistics which maximise the power of a two-sample test\nbased on the Maximum Mean Discrepancy (MMD), by adapting over the set of\nkernels used in defining it. For finite sets, this reduces to combining\n(normalised) MMD values under each of these kernels via a weighted soft\nmaximum. Exponential concentration bounds are proved for our proposed\nstatistics under the null and alternative. We further show how these kernels\ncan be chosen in a data-dependent but permutation-independent way, in a\nwell-calibrated test, avoiding data splitting. This technique applies more\nbroadly to general permutation-based MMD testing, and includes the use of deep\nkernels with features learnt using unsupervised models such as auto-encoders.\nWe highlight the applicability of our MMD-FUSE test on both synthetic\nlow-dimensional and real-world high-dimensional data, and compare its\nperformance in terms of power against current state-of-the-art kernel tests."}, "http://arxiv.org/abs/2306.09335": {"title": "Class-Conditional Conformal Prediction with Many Classes", "link": "http://arxiv.org/abs/2306.09335", "description": "Standard conformal prediction methods provide a marginal coverage guarantee,\nwhich means that for a random test point, the conformal prediction set contains\nthe true label with a user-specified probability. In many classification\nproblems, we would like to obtain a stronger guarantee--that for test points of\na specific class, the prediction set contains the true label with the same\nuser-chosen probability. For the latter goal, existing conformal prediction\nmethods do not work well when there is a limited amount of labeled data per\nclass, as is often the case in real applications where the number of classes is\nlarge. We propose a method called clustered conformal prediction that clusters\ntogether classes having \"similar\" conformal scores and performs conformal\nprediction at the cluster level. Based on empirical evaluation across four\nimage data sets with many (up to 1000) classes, we find that clustered\nconformal typically outperforms existing methods in terms of class-conditional\ncoverage and set size metrics."}, "http://arxiv.org/abs/2306.11839": {"title": "Should I Stop or Should I Go: Early Stopping with Heterogeneous Populations", "link": "http://arxiv.org/abs/2306.11839", "description": "Randomized experiments often need to be stopped prematurely due to the\ntreatment having an unintended harmful effect. Existing methods that determine\nwhen to stop an experiment early are typically applied to the data in aggregate\nand do not account for treatment effect heterogeneity. In this paper, we study\nthe early stopping of experiments for harm on heterogeneous populations. We\nfirst establish that current methods often fail to stop experiments when the\ntreatment harms a minority group of participants. We then use causal machine\nlearning to develop CLASH, the first broadly-applicable method for\nheterogeneous early stopping. We demonstrate CLASH's performance on simulated\nand real data and show that it yields effective early stopping for both\nclinical trials and A/B tests."}, "http://arxiv.org/abs/2307.02520": {"title": "Conditional independence testing under misspecified inductive biases", "link": "http://arxiv.org/abs/2307.02520", "description": "Conditional independence (CI) testing is a fundamental and challenging task\nin modern statistics and machine learning. Many modern methods for CI testing\nrely on powerful supervised learning methods to learn regression functions or\nBayes predictors as an intermediate step; we refer to this class of tests as\nregression-based tests. Although these methods are guaranteed to control Type-I\nerror when the supervised learning methods accurately estimate the regression\nfunctions or Bayes predictors of interest, their behavior is less understood\nwhen they fail due to misspecified inductive biases; in other words, when the\nemployed models are not flexible enough or when the training algorithm does not\ninduce the desired predictors. Then, we study the performance of\nregression-based CI tests under misspecified inductive biases. Namely, we\npropose new approximations or upper bounds for the testing errors of three\nregression-based tests that depend on misspecification errors. Moreover, we\nintroduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI\ntest robust against misspecified inductive biases. Finally, we conduct\nexperiments with artificial and real data, showcasing the usefulness of our\ntheory and methods."}, "http://arxiv.org/abs/2308.03801": {"title": "On problematic practice of using normalization in Self-modeling/Multivariate Curve Resolution (S/MCR)", "link": "http://arxiv.org/abs/2308.03801", "description": "The paper is briefly dealing with greater or lesser misused normalization in\nself-modeling/multivariate curve resolution (S/MCR) practice. The importance of\nthe correct use of the ode solvers and apt kinetic illustrations are\nelucidated. The new terms, external and internal normalizations are defined and\ninterpreted. The problem of reducibility of a matrix is touched. Improper\ngeneralization/development of normalization-based methods are cited as\nexamples. The position of the extreme values of the signal contribution\nfunction is clarified. An Executable Notebook with Matlab Live Editor was\ncreated for algorithmic explanations and depictions."}, "http://arxiv.org/abs/2308.05373": {"title": "Conditional Independence Testing for Discrete Distributions: Beyond $\\chi^2$- and $G$-tests", "link": "http://arxiv.org/abs/2308.05373", "description": "This paper is concerned with the problem of conditional independence testing\nfor discrete data. In recent years, researchers have shed new light on this\nfundamental problem, emphasizing finite-sample optimality. The non-asymptotic\nviewpoint adapted in these works has led to novel conditional independence\ntests that enjoy certain optimality under various regimes. Despite their\nattractive theoretical properties, the considered tests are not necessarily\npractical, relying on a Poissonization trick and unspecified constants in their\ncritical values. In this work, we attempt to bridge the gap between theory and\npractice by reproving optimality without Poissonization and calibrating tests\nusing Monte Carlo permutations. Along the way, we also prove that classical\nasymptotic $\\chi^2$- and $G$-tests are notably sub-optimal in a\nhigh-dimensional regime, which justifies the demand for new tools. Our\ntheoretical results are complemented by experiments on both simulated and\nreal-world datasets. Accompanying this paper is an R package UCI that\nimplements the proposed tests."}, "http://arxiv.org/abs/2309.03875": {"title": "Network Sampling Methods for Estimating Social Networks, Population Percentages, and Totals of People Experiencing Unsheltered Homelessness", "link": "http://arxiv.org/abs/2309.03875", "description": "In this article, we propose using network-based sampling strategies to\nestimate the number of unsheltered people experiencing homelessness within a\ngiven administrative service unit, known as a Continuum of Care. We demonstrate\nthe effectiveness of network sampling methods to solve this problem. Here, we\nfocus on Respondent Driven Sampling (RDS), which has been shown to provide\nunbiased or low-biased estimates of totals and proportions for hard-to-reach\npopulations in contexts where a sampling frame (e.g., housing addresses) is not\navailable. To make the RDS estimator work for estimating the total number of\npeople living unsheltered, we introduce a new method that leverages\nadministrative data from the HUD-mandated Homeless Management Information\nSystem (HMIS). The HMIS provides high-quality counts and demographics for\npeople experiencing homelessness who sleep in emergency shelters. We then\ndemonstrate this method using network data collected in Nashville, TN, combined\nwith simulation methods to illustrate the efficacy of this approach and\nintroduce a method for performing a power analysis to find the optimal sample\nsize in this setting. We conclude with the RDS unsheltered PIT count conducted\nby King County Regional Homelessness Authority in 2022 (data publicly available\non the HUD website) and perform a comparative analysis between the 2022 RDS\nestimate of unsheltered people experiencing homelessness and an ARIMA forecast\nof the visual unsheltered PIT count. Finally, we discuss how this method works\nfor estimating the unsheltered population of people experiencing homelessness\nand future areas of research."}, "http://arxiv.org/abs/2310.19973": {"title": "Unified Enhancement of Privacy Bounds for Mixture Mechanisms via $f$-Differential Privacy", "link": "http://arxiv.org/abs/2310.19973", "description": "Differentially private (DP) machine learning algorithms incur many sources of\nrandomness, such as random initialization, random batch subsampling, and\nshuffling. However, such randomness is difficult to take into account when\nproving differential privacy bounds because it induces mixture distributions\nfor the algorithm's output that are difficult to analyze. This paper focuses on\nimproving privacy bounds for shuffling models and one-iteration differentially\nprivate gradient descent (DP-GD) with random initializations using $f$-DP. We\nderive a closed-form expression of the trade-off function for shuffling models\nthat outperforms the most up-to-date results based on $(\\epsilon,\\delta)$-DP.\nMoreover, we investigate the effects of random initialization on the privacy of\none-iteration DP-GD. Our numerical computations of the trade-off function\nindicate that random initialization can enhance the privacy of DP-GD. Our\nanalysis of $f$-DP guarantees for these mixture mechanisms relies on an\ninequality for trade-off functions introduced in this paper. This inequality\nimplies the joint convexity of $F$-divergences. Finally, we study an $f$-DP\nanalog of the advanced joint convexity of the hockey-stick divergence related\nto $(\\epsilon,\\delta)$-DP and apply it to analyze the privacy of mixture\nmechanisms."}, "http://arxiv.org/abs/2310.19985": {"title": "Modeling random directions of changes in simplex-valued data", "link": "http://arxiv.org/abs/2310.19985", "description": "We propose models and algorithms for learning about random directions in\nsimplex-valued data. The models are applied to the study of income level\nproportions and their changes over time in a geostatistical area. There are\nseveral notable challenges in the analysis of simplex-valued data: the\nmeasurements must respect the simplex constraint and the changes exhibit\nspatiotemporal smoothness and may be heterogeneous. To that end, we propose\nBayesian models that draw from and expand upon building blocks in circular and\nspatial statistics by exploiting a suitable transformation for the\nsimplex-valued data. Our models also account for spatial correlation across\nlocations in the simplex and the heterogeneous patterns via mixture modeling.\nWe describe some properties of the models and model fitting via MCMC\ntechniques. Our models and methods are applied to an analysis of movements and\ntrends of income categories using the Home Mortgage Disclosure Act data."}, "http://arxiv.org/abs/2310.19988": {"title": "Counterfactual fairness for small subgroups", "link": "http://arxiv.org/abs/2310.19988", "description": "While methods for measuring and correcting differential performance in risk\nprediction models have proliferated in recent years, most existing techniques\ncan only be used to assess fairness across relatively large subgroups. The\npurpose of algorithmic fairness efforts is often to redress discrimination\nagainst groups that are both marginalized and small, so this sample size\nlimitation often prevents existing techniques from accomplishing their main\naim. We take a three-pronged approach to address the problem of quantifying\nfairness with small subgroups. First, we propose new estimands built on the\n\"counterfactual fairness\" framework that leverage information across groups.\nSecond, we estimate these quantities using a larger volume of data than\nexisting techniques. Finally, we propose a novel data borrowing approach to\nincorporate \"external data\" that lacks outcomes and predictions but contains\ncovariate and group membership information. This less stringent requirement on\nthe external data allows for more possibilities for external data sources. We\ndemonstrate practical application of our estimators to a risk prediction model\nused by a major Midwestern health system during the COVID-19 pandemic."}, "http://arxiv.org/abs/2310.20058": {"title": "New Asymptotic Limit Theory and Inference for Monotone Regression", "link": "http://arxiv.org/abs/2310.20058", "description": "Nonparametric regression problems with qualitative constraints such as\nmonotonicity or convexity are ubiquitous in applications. For example, in\npredicting the yield of a factory in terms of the number of labor hours, the\nmonotonicity of the conditional mean function is a natural constraint. One can\nestimate a monotone conditional mean function using nonparametric least squares\nestimation, which involves no tuning parameters. Several interesting properties\nof the isotonic LSE are known including its rate of convergence, adaptivity\nproperties, and pointwise asymptotic distribution. However, we believe that the\nfull richness of the asymptotic limit theory has not been explored in the\nliterature which we do in this paper. Moreover, the inference problem is not\nfully settled. In this paper, we present some new results for monotone\nregression including an extension of existing results to triangular arrays, and\nprovide asymptotically valid confidence intervals that are uniformly valid over\na large class of distributions."}, "http://arxiv.org/abs/2310.20075": {"title": "Meek Separators and Their Applications in Targeted Causal Discovery", "link": "http://arxiv.org/abs/2310.20075", "description": "Learning causal structures from interventional data is a fundamental problem\nwith broad applications across various fields. While many previous works have\nfocused on recovering the entire causal graph, in practice, there are scenarios\nwhere learning only part of the causal graph suffices. This is called\n$targeted$ causal discovery. In our work, we focus on two such well-motivated\nproblems: subset search and causal matching. We aim to minimize the number of\ninterventions in both cases.\n\nTowards this, we introduce the $Meek~separator$, which is a subset of\nvertices that, when intervened, decomposes the remaining unoriented edges into\nsmaller connected components. We then present an efficient algorithm to find\nMeek separators that are of small sizes. Such a procedure is helpful in\ndesigning various divide-and-conquer-based approaches. In particular, we\npropose two randomized algorithms that achieve logarithmic approximation for\nsubset search and causal matching, respectively. Our results provide the first\nknown average-case provable guarantees for both problems. We believe that this\nopens up possibilities to design near-optimal methods for many other targeted\ncausal structure learning problems arising from various applications."}, "http://arxiv.org/abs/2310.20087": {"title": "PAM-HC: A Bayesian Nonparametric Construction of Hybrid Control for Randomized Clinical Trials Using External Data", "link": "http://arxiv.org/abs/2310.20087", "description": "It is highly desirable to borrow information from external data to augment a\ncontrol arm in a randomized clinical trial, especially in settings where the\nsample size for the control arm is limited. However, a main challenge in\nborrowing information from external data is to accommodate potential\nheterogeneous subpopulations across the external and trial data. We apply a\nBayesian nonparametric model called Plaid Atoms Model (PAM) to identify\noverlapping and unique subpopulations across datasets, with which we restrict\nthe information borrowing to the common subpopulations. This forms a hybrid\ncontrol (HC) that leads to more precise estimation of treatment effects\nSimulation studies demonstrate the robustness of the new method, and an\napplication to an Atopic Dermatitis dataset shows improved treatment effect\nestimation."}, "http://arxiv.org/abs/2310.20088": {"title": "Optimal transport representations and functional principal components for distribution-valued processes", "link": "http://arxiv.org/abs/2310.20088", "description": "We develop statistical models for samples of distribution-valued stochastic\nprocesses through time-varying optimal transport process representations under\nthe Wasserstein metric when the values of the process are univariate\ndistributions. While functional data analysis provides a toolbox for the\nanalysis of samples of real- or vector-valued processes, there is at present no\ncoherent statistical methodology available for samples of distribution-valued\nprocesses, which are increasingly encountered in data analysis. To address the\nneed for such methodology, we introduce a transport model for samples of\ndistribution-valued stochastic processes that implements an intrinsic approach\nwhereby distributions are represented by optimal transports. Substituting\ntransports for distributions addresses the challenge of centering\ndistribution-valued processes and leads to a useful and interpretable\nrepresentation of each realized process by an overall transport and a\nreal-valued trajectory, utilizing a scalar multiplication operation for\ntransports. This representation facilitates a connection to Gaussian processes\nthat proves useful, especially for the case where the distribution-valued\nprocesses are only observed on a sparse grid of time points. We study the\nconvergence of the key components of the proposed representation to their\npopulation targets and demonstrate the practical utility of the proposed\napproach through simulations and application examples."}, "http://arxiv.org/abs/2310.20182": {"title": "Explicit Form of the Asymptotic Variance Estimator for IPW-type Estimators of Certain Estimands", "link": "http://arxiv.org/abs/2310.20182", "description": "Confidence intervals (CI) for the IPW estimators of the ATT and ATO might not\nalways yield conservative CIs when using the 'robust sandwich variance'\nestimator. In this manuscript, we identify scenarios where this variance\nestimator can be employed to derive conservative CIs. Specifically, for the\nATT, a conservative CI can be derived when there's a homogeneous treatment\neffect or the interaction effect surpasses the effect from the covariates\nalone. For the ATO, conservative CIs can be derived under certain conditions,\nsuch as when there are homogeneous treatment effects, when there exists\nsignificant treatment-confounder interactions, or when there's a large number\nof members in the control groups."}, "http://arxiv.org/abs/2310.20294": {"title": "Robust nonparametric regression based on deep ReLU neural networks", "link": "http://arxiv.org/abs/2310.20294", "description": "In this paper, we consider robust nonparametric regression using deep neural\nnetworks with ReLU activation function. While several existing theoretically\njustified methods are geared towards robustness against identical heavy-tailed\nnoise distributions, the rise of adversarial attacks has emphasized the\nimportance of safeguarding estimation procedures against systematic\ncontamination. We approach this statistical issue by shifting our focus towards\nestimating conditional distributions. To address it robustly, we introduce a\nnovel estimation procedure based on $\\ell$-estimation. Under a mild model\nassumption, we establish general non-asymptotic risk bounds for the resulting\nestimators, showcasing their robustness against contamination, outliers, and\nmodel misspecification. We then delve into the application of our approach\nusing deep ReLU neural networks. When the model is well-specified and the\nregression function belongs to an $\\alpha$-H\\\"older class, employing\n$\\ell$-type estimation on suitable networks enables the resulting estimators to\nachieve the minimax optimal rate of convergence. Additionally, we demonstrate\nthat deep $\\ell$-type estimators can circumvent the curse of dimensionality by\nassuming the regression function closely resembles the composition of several\nH\\\"older functions. To attain this, new deep fully-connected ReLU neural\nnetworks have been designed to approximate this composition class. This\napproximation result can be of independent interest."}, "http://arxiv.org/abs/2310.20376": {"title": "Mixture modeling via vectors of normalized independent finite point processes", "link": "http://arxiv.org/abs/2310.20376", "description": "Statistical modeling in presence of hierarchical data is a crucial task in\nBayesian statistics. The Hierarchical Dirichlet Process (HDP) represents the\nutmost tool to handle data organized in groups through mixture modeling.\nAlthough the HDP is mathematically tractable, its computational cost is\ntypically demanding, and its analytical complexity represents a barrier for\npractitioners. The present paper conceives a mixture model based on a novel\nfamily of Bayesian priors designed for multilevel data and obtained by\nnormalizing a finite point process. A full distribution theory for this new\nfamily and the induced clustering is developed, including tractable expressions\nfor marginal, posterior and predictive distributions. Efficient marginal and\nconditional Gibbs samplers are designed for providing posterior inference. The\nproposed mixture model overcomes the HDP in terms of analytical feasibility,\nclustering discovery, and computational time. The motivating application comes\nfrom the analysis of shot put data, which contains performance measurements of\nathletes across different seasons. In this setting, the proposed model is\nexploited to induce clustering of the observations across seasons and athletes.\nBy linking clusters across seasons, similarities and differences in athlete's\nperformances are identified."}, "http://arxiv.org/abs/2310.20409": {"title": "Detection of nonlinearity, discontinuity and interactions in generalized regression models", "link": "http://arxiv.org/abs/2310.20409", "description": "In generalized regression models the effect of continuous covariates is\ncommonly assumed to be linear. This assumption, however, may be too restrictive\nin applications and may lead to biased effect estimates and decreased\npredictive ability. While a multitude of alternatives for the flexible modeling\nof continuous covariates have been proposed, methods that provide guidance for\nchoosing a suitable functional form are still limited. To address this issue,\nwe propose a detection algorithm that evaluates several approaches for modeling\ncontinuous covariates and guides practitioners to choose the most appropriate\nalternative. The algorithm utilizes a unified framework for tree-structured\nmodeling which makes the results easily interpretable. We assessed the\nperformance of the algorithm by conducting a simulation study. To illustrate\nthe proposed algorithm, we analyzed data of patients suffering from chronic\nkidney disease."}, "http://arxiv.org/abs/2310.20450": {"title": "Safe Testing for Large-Scale Experimentation Platforms", "link": "http://arxiv.org/abs/2310.20450", "description": "In the past two decades, AB testing has proliferated to optimise products in\ndigital domains. Traditional AB tests use fixed-horizon testing, determining\nthe sample size of the experiment and continuing until the experiment has\nconcluded. However, due to the feedback provided by modern data infrastructure,\nexperimenters may take incorrect decisions based on preliminary results of the\ntest. For this reason, anytime-valid inference (AVI) is seeing increased\nadoption as the modern experimenters method for rapid decision making in the\nworld of data streaming.\n\nThis work focuses on Safe Testing, a novel framework for experimentation that\nenables continuous analysis without elevating the risk of incorrect\nconclusions. There exist safe testing equivalents of many common statistical\ntests, including the z-test, the t-test, and the proportion test. We compare\nthe efficacy of safe tests against classical tests and another method for AVI,\nthe mixture sequential probability ratio test (mSPRT). Comparisons are\nconducted first on simulation and then by real-world data from a large\ntechnology company, Vinted, a large European online marketplace for second-hand\nclothing. Our findings indicate that safe tests require fewer samples to detect\nsignificant effects, encouraging its potential for broader adoption."}, "http://arxiv.org/abs/2310.20460": {"title": "Aggregating Dependent Signals with Heavy-Tailed Combination Tests", "link": "http://arxiv.org/abs/2310.20460", "description": "Combining dependent p-values to evaluate the global null hypothesis presents\na longstanding challenge in statistical inference, particularly when\naggregating results from diverse methods to boost signal detection. P-value\ncombination tests using heavy-tailed distribution based transformations, such\nas the Cauchy combination test and the harmonic mean p-value, have recently\ngarnered significant interest for their potential to efficiently handle\narbitrary p-value dependencies. Despite their growing popularity in practical\napplications, there is a gap in comprehensive theoretical and empirical\nevaluations of these methods. This paper conducts an extensive investigation,\nrevealing that, theoretically, while these combination tests are asymptotically\nvalid for pairwise quasi-asymptotically independent test statistics, such as\nbivariate normal variables, they are also asymptotically equivalent to the\nBonferroni test under the same conditions. However, extensive simulations\nunveil their practical utility, especially in scenarios where stringent type-I\nerror control is not necessary and signals are dense. Both the heaviness of the\ndistribution and its support substantially impact the tests' non-asymptotic\nvalidity and power, and we recommend using a truncated Cauchy distribution in\npractice. Moreover, we show that under the violation of quasi-asymptotic\nindependence among test statistics, these tests remain valid and, in fact, can\nbe considerably less conservative than the Bonferroni test. We also present two\ncase studies in genetics and genomics, showcasing the potential of the\ncombination tests to significantly enhance statistical power while effectively\ncontrolling type-I errors."}, "http://arxiv.org/abs/2310.20483": {"title": "Measuring multidimensional heterogeneity in emergent social phenomena", "link": "http://arxiv.org/abs/2310.20483", "description": "Measuring inequalities in a multidimensional framework is a challenging\nproblem which is common to most field of science and engineering. Nevertheless,\ndespite the enormous amount of researches illustrating the fields of\napplication of inequality indices, and of the Gini index in particular, very\nfew consider the case of a multidimensional variable. In this paper, we\nconsider in some details a new inequality index, based on the Fourier\ntransform, that can be fruitfully applied to measure the degree of\ninhomogeneity of multivariate probability distributions. This index exhibits a\nnumber of interesting properties that make it very promising in quantifying the\ndegree of inequality in data sets of complex and multifaceted social phenomena."}, "http://arxiv.org/abs/2310.20537": {"title": "Directed Cyclic Graph for Causal Discovery from Multivariate Functional Data", "link": "http://arxiv.org/abs/2310.20537", "description": "Discovering causal relationship using multivariate functional data has\nreceived a significant amount of attention very recently. In this article, we\nintroduce a functional linear structural equation model for causal structure\nlearning when the underlying graph involving the multivariate functions may\nhave cycles. To enhance interpretability, our model involves a low-dimensional\ncausal embedded space such that all the relevant causal information in the\nmultivariate functional data is preserved in this lower-dimensional subspace.\nWe prove that the proposed model is causally identifiable under standard\nassumptions that are often made in the causal discovery literature. To carry\nout inference of our model, we develop a fully Bayesian framework with suitable\nprior specifications and uncertainty quantification through posterior\nsummaries. We illustrate the superior performance of our method over existing\nmethods in terms of causal graph estimation through extensive simulation\nstudies. We also demonstrate the proposed method using a brain EEG dataset."}, "http://arxiv.org/abs/2310.20697": {"title": "Text-Transport: Toward Learning Causal Effects of Natural Language", "link": "http://arxiv.org/abs/2310.20697", "description": "As language technologies gain prominence in real-world settings, it is\nimportant to understand how changes to language affect reader perceptions. This\ncan be formalized as the causal effect of varying a linguistic attribute (e.g.,\nsentiment) on a reader's response to the text. In this paper, we introduce\nText-Transport, a method for estimation of causal effects from natural language\nunder any text distribution. Current approaches for valid causal effect\nestimation require strong assumptions about the data, meaning the data from\nwhich one can estimate valid causal effects often is not representative of the\nactual target domain of interest. To address this issue, we leverage the notion\nof distribution shift to describe an estimator that transports causal effects\nbetween domains, bypassing the need for strong assumptions in the target\ndomain. We derive statistical guarantees on the uncertainty of this estimator,\nand we report empirical results and analyses that support the validity of\nText-Transport across data settings. Finally, we use Text-Transport to study a\nrealistic setting--hate speech on social media--in which causal effects do\nshift significantly between text domains, demonstrating the necessity of\ntransport when conducting causal inference on natural language."}, "http://arxiv.org/abs/2203.15009": {"title": "DAMNETS: A Deep Autoregressive Model for Generating Markovian Network Time Series", "link": "http://arxiv.org/abs/2203.15009", "description": "Generative models for network time series (also known as dynamic graphs) have\ntremendous potential in fields such as epidemiology, biology and economics,\nwhere complex graph-based dynamics are core objects of study. Designing\nflexible and scalable generative models is a very challenging task due to the\nhigh dimensionality of the data, as well as the need to represent temporal\ndependencies and marginal network structure. Here we introduce DAMNETS, a\nscalable deep generative model for network time series. DAMNETS outperforms\ncompeting methods on all of our measures of sample quality, over both real and\nsynthetic data sets."}, "http://arxiv.org/abs/2212.01792": {"title": "Classification by sparse additive models", "link": "http://arxiv.org/abs/2212.01792", "description": "We consider (nonparametric) sparse additive models (SpAM) for classification.\nThe design of a SpAM classifier is based on minimizing the logistic loss with a\nsparse group Lasso and more general sparse group Slope-type penalties on the\ncoefficients of univariate components' expansions in orthonormal series (e.g.,\nFourier or wavelets). The resulting classifiers are inherently adaptive to the\nunknown sparsity and smoothness. We show that under certain sparse group\nrestricted eigenvalue condition the sparse group Lasso classifier is\nnearly-minimax (up to log-factors) within the entire range of analytic, Sobolev\nand Besov classes while the sparse group Slope classifier achieves the exact\nminimax order (without the extra log-factors) for sparse and moderately dense\nsetups. The performance of the proposed classifier is illustrated on the\nreal-data example."}, "http://arxiv.org/abs/2302.11656": {"title": "Confounder-Dependent Bayesian Mixture Model: Characterizing Heterogeneity of Causal Effects in Air Pollution Epidemiology", "link": "http://arxiv.org/abs/2302.11656", "description": "Several epidemiological studies have provided evidence that long-term\nexposure to fine particulate matter (PM2.5) increases mortality risk.\nFurthermore, some population characteristics (e.g., age, race, and\nsocioeconomic status) might play a crucial role in understanding vulnerability\nto air pollution. To inform policy, it is necessary to identify groups of the\npopulation that are more or less vulnerable to air pollution. In causal\ninference literature, the Group Average Treatment Effect (GATE) is a\ndistinctive facet of the conditional average treatment effect. This widely\nemployed metric serves to characterize the heterogeneity of a treatment effect\nbased on some population characteristics. In this work, we introduce a novel\nConfounder-Dependent Bayesian Mixture Model (CDBMM) to characterize causal\neffect heterogeneity. More specifically, our method leverages the flexibility\nof the dependent Dirichlet process to model the distribution of the potential\noutcomes conditionally to the covariates and the treatment levels, thus\nenabling us to: (i) identify heterogeneous and mutually exclusive population\ngroups defined by similar GATEs in a data-driven way, and (ii) estimate and\ncharacterize the causal effects within each of the identified groups. Through\nsimulations, we demonstrate the effectiveness of our method in uncovering key\ninsights about treatment effects heterogeneity. We apply our method to claims\ndata from Medicare enrollees in Texas. We found six mutually exclusive groups\nwhere the causal effects of PM2.5 on mortality are heterogeneous."}, "http://arxiv.org/abs/2305.03149": {"title": "A Spectral Method for Identifiable Grade of Membership Analysis with Binary Responses", "link": "http://arxiv.org/abs/2305.03149", "description": "Grade of Membership (GoM) models are popular individual-level mixture models\nfor multivariate categorical data. GoM allows each subject to have mixed\nmemberships in multiple extreme latent profiles. Therefore GoM models have a\nricher modeling capacity than latent class models that restrict each subject to\nbelong to a single profile. The flexibility of GoM comes at the cost of more\nchallenging identifiability and estimation problems. In this work, we propose a\nsingular value decomposition (SVD) based spectral approach to GoM analysis with\nmultivariate binary responses. Our approach hinges on the observation that the\nexpectation of the data matrix has a low-rank decomposition under a GoM model.\nFor identifiability, we develop sufficient and almost necessary conditions for\na notion of expectation identifiability. For estimation, we extract only a few\nleading singular vectors of the observed data matrix, and exploit the simplex\ngeometry of these vectors to estimate the mixed membership scores and other\nparameters. We also establish the consistency of our estimator in the\ndouble-asymptotic regime where both the number of subjects and the number of\nitems grow to infinity. Our spectral method has a huge computational advantage\nover Bayesian or likelihood-based methods and is scalable to large-scale and\nhigh-dimensional data. Extensive simulation studies demonstrate the superior\nefficiency and accuracy of our method. We also illustrate our method by\napplying it to a personality test dataset."}, "http://arxiv.org/abs/2305.05276": {"title": "Causal Discovery from Subsampled Time Series with Proxy Variables", "link": "http://arxiv.org/abs/2305.05276", "description": "Inferring causal structures from time series data is the central interest of\nmany scientific inquiries. A major barrier to such inference is the problem of\nsubsampling, i.e., the frequency of measurement is much lower than that of\ncausal influence. To overcome this problem, numerous methods have been\nproposed, yet either was limited to the linear case or failed to achieve\nidentifiability. In this paper, we propose a constraint-based algorithm that\ncan identify the entire causal structure from subsampled time series, without\nany parametric constraint. Our observation is that the challenge of subsampling\narises mainly from hidden variables at the unobserved time steps. Meanwhile,\nevery hidden variable has an observed proxy, which is essentially itself at\nsome observable time in the future, benefiting from the temporal structure.\nBased on these, we can leverage the proxies to remove the bias induced by the\nhidden variables and hence achieve identifiability. Following this intuition,\nwe propose a proxy-based causal discovery algorithm. Our algorithm is\nnonparametric and can achieve full causal identification. Theoretical\nadvantages are reflected in synthetic and real-world experiments."}, "http://arxiv.org/abs/2305.08942": {"title": "Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification", "link": "http://arxiv.org/abs/2305.08942", "description": "Data-driven modeling is useful for reconstructing nonlinear dynamical systems\nwhen the underlying process is unknown or too expensive to compute. Having\nreliable uncertainty assessment of the forecast enables tools to be deployed to\npredict new scenarios unobserved before. In this work, we first extend parallel\npartial Gaussian processes for predicting the vector-valued transition function\nthat links the observations between the current and next time points, and\nquantify the uncertainty of predictions by posterior sampling. Second, we show\nthe equivalence between the dynamic mode decomposition and the maximum\nlikelihood estimator of the linear mapping matrix in the linear state space\nmodel. The connection provides a {probabilistic generative} model of dynamic\nmode decomposition and thus, uncertainty of predictions can be obtained.\nFurthermore, we draw close connections between different data-driven models for\napproximating nonlinear dynamics, through a unified view of generative models.\nWe study two numerical examples, where the inputs of the dynamics are assumed\nto be known in the first example and the inputs are unknown in the second\nexample. The examples indicate that uncertainty of forecast can be properly\nquantified, whereas model or input misspecification can degrade the accuracy of\nuncertainty quantification."}, "http://arxiv.org/abs/2305.16795": {"title": "On Consistent Bayesian Inference from Synthetic Data", "link": "http://arxiv.org/abs/2305.16795", "description": "Generating synthetic data, with or without differential privacy, has\nattracted significant attention as a potential solution to the dilemma between\nmaking data easily available, and the privacy of data subjects. Several works\nhave shown that consistency of downstream analyses from synthetic data,\nincluding accurate uncertainty estimation, requires accounting for the\nsynthetic data generation. There are very few methods of doing so, most of them\nfor frequentist analysis. In this paper, we study how to perform consistent\nBayesian inference from synthetic data. We prove that mixing posterior samples\nobtained separately from multiple large synthetic data sets converges to the\nposterior of the downstream analysis under standard regularity conditions when\nthe analyst's model is compatible with the data provider's model. We also\npresent several examples showing how the theory works in practice, and showing\nhow Bayesian inference can fail when the compatibility assumption is not met,\nor the synthetic data set is not significantly larger than the original."}, "http://arxiv.org/abs/2306.04746": {"title": "Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models", "link": "http://arxiv.org/abs/2306.04746", "description": "In computational social science (CSS), researchers analyze documents to\nexplain social and political phenomena. In most scenarios, CSS researchers\nfirst obtain labels for documents and then explain labels using interpretable\nregression analyses in the second step. One increasingly common way to annotate\ndocuments cheaply at scale is through large language models (LLMs). However,\nlike other scalable ways of producing annotations, such surrogate labels are\noften imperfect and biased. We present a new algorithm for using imperfect\nannotation surrogates for downstream statistical analyses while guaranteeing\nstatistical properties -- like asymptotic unbiasedness and proper uncertainty\nquantification -- which are fundamental to CSS research. We show that direct\nuse of surrogate labels in downstream statistical analyses leads to substantial\nbias and invalid confidence intervals, even with high surrogate accuracy of\n80--90\\%. To address this, we build on debiased machine learning to propose the\ndesign-based supervised learning (DSL) estimator. DSL employs a doubly-robust\nprocedure to combine surrogate labels with a smaller number of high-quality,\ngold-standard labels. Our approach guarantees valid inference for downstream\nstatistical analyses, even when surrogates are arbitrarily biased and without\nrequiring stringent assumptions, by controlling the probability of sampling\ndocuments for gold-standard labeling. Both our theoretical analysis and\nexperimental results show that DSL provides valid statistical inference while\nachieving root mean squared errors comparable to existing alternatives that\nfocus only on prediction without inferential guarantees."}, "http://arxiv.org/abs/2308.13713": {"title": "Causally Sound Priors for Binary Experiments", "link": "http://arxiv.org/abs/2308.13713", "description": "We introduce the BREASE framework for the Bayesian analysis of randomized\ncontrolled trials with a binary treatment and a binary outcome. Approaching the\nproblem from a causal inference perspective, we propose parameterizing the\nlikelihood in terms of the baseline risk, efficacy, and adverse side effects of\nthe treatment, along with a flexible, yet intuitive and tractable jointly\nindependent beta prior distribution on these parameters, which we show to be a\ngeneralization of the Dirichlet prior for the joint distribution of potential\noutcomes. Our approach has a number of desirable characteristics when compared\nto current mainstream alternatives: (i) it naturally induces prior dependence\nbetween expected outcomes in the treatment and control groups; (ii) as the\nbaseline risk, efficacy and risk of adverse side effects are quantities\ncommonly present in the clinicians' vocabulary, the hyperparameters of the\nprior are directly interpretable, thus facilitating the elicitation of prior\nknowledge and sensitivity analysis; and (iii) we provide analytical formulae\nfor the marginal likelihood, Bayes factor, and other posterior quantities, as\nwell as exact posterior sampling via simulation, in cases where traditional\nMCMC fails. Empirical examples demonstrate the utility of our methods for\nestimation, hypothesis testing, and sensitivity analysis of treatment effects."}, "http://arxiv.org/abs/2309.01608": {"title": "Supervised dimensionality reduction for multiple imputation by chained equations", "link": "http://arxiv.org/abs/2309.01608", "description": "Multivariate imputation by chained equations (MICE) is one of the most\npopular approaches to address missing values in a data set. This approach\nrequires specifying a univariate imputation model for every variable under\nimputation. The specification of which predictors should be included in these\nunivariate imputation models can be a daunting task. Principal component\nanalysis (PCA) can simplify this process by replacing all of the potential\nimputation model predictors with a few components summarizing their variance.\nIn this article, we extend the use of PCA with MICE to include a supervised\naspect whereby information from the variables under imputation is incorporated\ninto the principal component estimation. We conducted an extensive simulation\nstudy to assess the statistical properties of MICE with different versions of\nsupervised dimensionality reduction and we compared them with the use of\nclassical unsupervised PCA as a simpler dimensionality reduction technique."}, "http://arxiv.org/abs/2311.00118": {"title": "Extracting the Multiscale Causal Backbone of Brain Dynamics", "link": "http://arxiv.org/abs/2311.00118", "description": "The bulk of the research effort on brain connectivity revolves around\nstatistical associations among brain regions, which do not directly relate to\nthe causal mechanisms governing brain dynamics. Here we propose the multiscale\ncausal backbone (MCB) of brain dynamics shared by a set of individuals across\nmultiple temporal scales, and devise a principled methodology to extract it.\n\nOur approach leverages recent advances in multiscale causal structure\nlearning and optimizes the trade-off between the model fitting and its\ncomplexity. Empirical assessment on synthetic data shows the superiority of our\nmethodology over a baseline based on canonical functional connectivity\nnetworks. When applied to resting-state fMRI data, we find sparse MCBs for both\nthe left and right brain hemispheres. Thanks to its multiscale nature, our\napproach shows that at low-frequency bands, causal dynamics are driven by brain\nregions associated with high-level cognitive functions; at higher frequencies\ninstead, nodes related to sensory processing play a crucial role. Finally, our\nanalysis of individual multiscale causal structures confirms the existence of a\ncausal fingerprint of brain connectivity, thus supporting from a causal\nperspective the existing extensive research in brain connectivity\nfingerprinting."}, "http://arxiv.org/abs/2311.00122": {"title": "Statistical Network Analysis: Past, Present, and Future", "link": "http://arxiv.org/abs/2311.00122", "description": "This article provides a brief overview of statistical network analysis, a\nrapidly evolving field of statistics, which encompasses statistical models,\nalgorithms, and inferential methods for analyzing data in the form of networks.\nParticular emphasis is given to connecting the historical developments in\nnetwork science to today's statistical network analysis, and outlining\nimportant new areas for future research.\n\nThis invited article is intended as a book chapter for the volume \"Frontiers\nof Statistics and Data Science\" edited by Subhashis Ghoshal and Anindya Roy for\nthe International Indian Statistical Association Series on Statistics and Data\nScience, published by Springer. This review article covers the material from\nthe short course titled \"Statistical Network Analysis: Past, Present, and\nFuture\" taught by the author at the Annual Conference of the International\nIndian Statistical Association, June 6-10, 2023, at Golden, Colorado."}, "http://arxiv.org/abs/2311.00210": {"title": "Broken Adaptive Ridge Method for Variable Selection in Generalized Partly Linear Models with Application to the Coronary Artery Disease Data", "link": "http://arxiv.org/abs/2311.00210", "description": "Motivated by the CATHGEN data, we develop a new statistical learning method\nfor simultaneous variable selection and parameter estimation under the context\nof generalized partly linear models for data with high-dimensional covariates.\nThe method is referred to as the broken adaptive ridge (BAR) estimator, which\nis an approximation of the $L_0$-penalized regression by iteratively performing\nreweighted squared $L_2$-penalized regression. The generalized partly linear\nmodel extends the generalized linear model by including a non-parametric\ncomponent to construct a flexible model for modeling various types of covariate\neffects. We employ the Bernstein polynomials as the sieve space to approximate\nthe non-parametric functions so that our method can be implemented easily using\nthe existing R packages. Extensive simulation studies suggest that the proposed\nmethod performs better than other commonly used penalty-based variable\nselection methods. We apply the method to the CATHGEN data with a binary\nresponse from a coronary artery disease study, which motivated our research,\nand obtained new findings in both high-dimensional genetic and low-dimensional\nnon-genetic covariates."}, "http://arxiv.org/abs/2311.00294": {"title": "Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence", "link": "http://arxiv.org/abs/2311.00294", "description": "To address the difficult problem of multi-step ahead prediction of\nnon-parametric autoregressions, we consider a forward bootstrap approach.\nEmploying a local constant estimator, we can analyze a general type of\nnon-parametric time series model, and show that the proposed point predictions\nare consistent with the true optimal predictor. We construct a quantile\nprediction interval that is asymptotically valid. Moreover, using a debiasing\ntechnique, we can asymptotically approximate the distribution of multi-step\nahead non-parametric estimation by bootstrap. As a result, we can build\nbootstrap prediction intervals that are pertinent, i.e., can capture the model\nestimation variability, thus improving upon the standard quantile prediction\nintervals. Simulation studies are given to illustrate the performance of our\npoint predictions and pertinent prediction intervals for finite samples."}, "http://arxiv.org/abs/2311.00528": {"title": "On the Comparative Analysis of Average Treatment Effects Estimation via Data Combination", "link": "http://arxiv.org/abs/2311.00528", "description": "There is growing interest in exploring causal effects in target populations\nby combining multiple datasets. Nevertheless, most approaches are tailored to\nspecific settings and lack comprehensive comparative analyses across different\nsettings. In this article, within the typical scenario of a source dataset and\na target dataset, we establish a unified framework for comparing various\nsettings in causal inference via data combination. We first design six distinct\nsettings, each with different available datasets and identifiability\nassumptions. The six settings cover a wide range of scenarios in the existing\nliterature. We then conduct a comprehensive efficiency comparative analysis\nacross these settings by calculating and comparing the semiparametric\nefficiency bounds for the average treatment effect (ATE) in the target\npopulation. Our findings reveal the key factors contributing to efficiency\ngains or losses across these settings. In addition, we extend our analysis to\nother estimands, including ATE in the source population and the average\ntreatment effect on treated (ATT) in both the source and target populations.\nFurthermore, we empirically validate our findings by constructing locally\nefficient estimators and conducting extensive simulation studies. We\ndemonstrate the proposed approaches using a real application to a MIMIC-III\ndataset as the target population and an eICU dataset as the source population."}, "http://arxiv.org/abs/2311.00541": {"title": "An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek", "link": "http://arxiv.org/abs/2311.00541", "description": "Word meanings change over time, and word senses evolve, emerge or die out in\nthe process. For ancient languages, where the corpora are often small, sparse\nand noisy, modelling such changes accurately proves challenging, and\nquantifying uncertainty in sense-change estimates consequently becomes\nimportant. GASC and DiSC are existing generative models that have been used to\nanalyse sense change for target words from an ancient Greek text corpus, using\nunsupervised learning without the help of any pre-training. These models\nrepresent the senses of a given target word such as \"kosmos\" (meaning\ndecoration, order or world) as distributions over context words, and sense\nprevalence as a distribution over senses. The models are fitted using MCMC\nmethods to measure temporal changes in these representations. In this paper, we\nintroduce EDiSC, an embedded version of DiSC, which combines word embeddings\nwith DiSC to provide superior model performance. We show empirically that EDiSC\noffers improved predictive accuracy, ground-truth recovery and uncertainty\nquantification, as well as better sampling efficiency and scalability\nproperties with MCMC methods. We also discuss the challenges of fitting these\nmodels."}, "http://arxiv.org/abs/2311.00553": {"title": "Polynomial Chaos Surrogate Construction for Random Fields with Parametric Uncertainty", "link": "http://arxiv.org/abs/2311.00553", "description": "Engineering and applied science rely on computational experiments to\nrigorously study physical systems. The mathematical models used to probe these\nsystems are highly complex, and sampling-intensive studies often require\nprohibitively many simulations for acceptable accuracy. Surrogate models\nprovide a means of circumventing the high computational expense of sampling\nsuch complex models. In particular, polynomial chaos expansions (PCEs) have\nbeen successfully used for uncertainty quantification studies of deterministic\nmodels where the dominant source of uncertainty is parametric. We discuss an\nextension to conventional PCE surrogate modeling to enable surrogate\nconstruction for stochastic computational models that have intrinsic noise in\naddition to parametric uncertainty. We develop a PCE surrogate on a joint space\nof intrinsic and parametric uncertainty, enabled by Rosenblatt transformations,\nand then extend the construction to random field data via the Karhunen-Loeve\nexpansion. We then take advantage of closed-form solutions for computing PCE\nSobol indices to perform a global sensitivity analysis of the model which\nquantifies the intrinsic noise contribution to the overall model output\nvariance. Additionally, the resulting joint PCE is generative in the sense that\nit allows generating random realizations at any input parameter setting that\nare statistically approximately equivalent to realizations from the underlying\nstochastic model. The method is demonstrated on a chemical catalysis example\nmodel."}, "http://arxiv.org/abs/2311.00568": {"title": "Scalable kernel balancing weights in a nationwide observational study of hospital profit status and heart attack outcomes", "link": "http://arxiv.org/abs/2311.00568", "description": "Weighting is a general and often-used method for statistical adjustment.\nWeighting has two objectives: first, to balance covariate distributions, and\nsecond, to ensure that the weights have minimal dispersion and thus produce a\nmore stable estimator. A recent, increasingly common approach directly\noptimizes the weights toward these two objectives. However, this approach has\nnot yet been feasible in large-scale datasets when investigators wish to\nflexibly balance general basis functions in an extended feature space. For\nexample, many balancing approaches cannot scale to national-level health\nservices research studies. To address this practical problem, we describe a\nscalable and flexible approach to weighting that integrates a basis expansion\nin a reproducing kernel Hilbert space with state-of-the-art convex optimization\ntechniques. Specifically, we use the rank-restricted Nystr\\\"{o}m method to\nefficiently compute a kernel basis for balancing in {nearly} linear time and\nspace, and then use the specialized first-order alternating direction method of\nmultipliers to rapidly find the optimal weights. In an extensive simulation\nstudy, we provide new insights into the performance of weighting estimators in\nlarge datasets, showing that the proposed approach substantially outperforms\nothers in terms of accuracy and speed. Finally, we use this weighting approach\nto conduct a national study of the relationship between hospital profit status\nand heart attack outcomes in a comprehensive dataset of 1.27 million patients.\nWe find that for-profit hospitals use interventional cardiology to treat heart\nattacks at similar rates as other hospitals, but have higher mortality and\nreadmission rates."}, "http://arxiv.org/abs/2311.00596": {"title": "Evaluating Binary Outcome Classifiers Estimated from Survey Data", "link": "http://arxiv.org/abs/2311.00596", "description": "Surveys are commonly used to facilitate research in epidemiology, health, and\nthe social and behavioral sciences. Often, these surveys are not simple random\nsamples, and respondents are given weights reflecting their probability of\nselection into the survey. It is well known that analysts can use these survey\nweights to produce unbiased estimates of population quantities like totals. In\nthis article, we show that survey weights also can be beneficial for evaluating\nthe quality of predictive models when splitting data into training and test\nsets. In particular, we characterize model assessment statistics, such as\nsensitivity and specificity, as finite population quantities, and compute\nsurvey-weighted estimates of these quantities with sample test data comprising\na random subset of the original data.Using simulations with data from the\nNational Survey on Drug Use and Health and the National Comorbidity Survey, we\nshow that unweighted metrics estimated with sample test data can misrepresent\npopulation performance, but weighted metrics appropriately adjust for the\ncomplex sampling design. We also show that this conclusion holds for models\ntrained using upsampling for mitigating class imbalance. The results suggest\nthat weighted metrics should be used when evaluating performance on sample test\ndata."}, "http://arxiv.org/abs/2206.10866": {"title": "Nearest Neighbor Classification based on Imbalanced Data: A Statistical Approach", "link": "http://arxiv.org/abs/2206.10866", "description": "When the competing classes in a classification problem are not of comparable\nsize, many popular classifiers exhibit a bias towards larger classes, and the\nnearest neighbor classifier is no exception. To take care of this problem, we\ndevelop a statistical method for nearest neighbor classification based on such\nimbalanced data sets. First, we construct a classifier for the binary\nclassification problem and then extend it for classification problems involving\nmore than two classes. Unlike the existing oversampling or undersampling\nmethods, our proposed classifiers do not need to generate any pseudo\nobservations or remove any existing observations, hence the results are exactly\nreproducible. We establish the Bayes risk consistency of these classifiers\nunder appropriate regularity conditions. Their superior performance over the\nexisting methods is amply demonstrated by analyzing several benchmark data\nsets."}, "http://arxiv.org/abs/2209.08892": {"title": "High-dimensional data segmentation in regression settings permitting temporal dependence and non-Gaussianity", "link": "http://arxiv.org/abs/2209.08892", "description": "We propose a data segmentation methodology for the high-dimensional linear\nregression problem where regression parameters are allowed to undergo multiple\nchanges. The proposed methodology, MOSEG, proceeds in two stages: first, the\ndata are scanned for multiple change points using a moving window-based\nprocedure, which is followed by a location refinement stage. MOSEG enjoys\ncomputational efficiency thanks to the adoption of a coarse grid in the first\nstage, and achieves theoretical consistency in estimating both the total number\nand the locations of the change points, under general conditions permitting\nserial dependence and non-Gaussianity. We also propose MOSEG.MS, a multiscale\nextension of MOSEG which, while comparable to MOSEG in terms of computational\ncomplexity, achieves theoretical consistency for a broader parameter space\nwhere large parameter shifts over short intervals and small changes over long\nstretches of stationarity are simultaneously allowed. We demonstrate good\nperformance of the proposed methods in comparative simulation studies and in an\napplication to predicting the equity premium."}, "http://arxiv.org/abs/2210.02341": {"title": "A Distributed Block-Split Gibbs Sampler with Hypergraph Structure for High-Dimensional Inverse Problems", "link": "http://arxiv.org/abs/2210.02341", "description": "Sampling-based algorithms are classical approaches to perform Bayesian\ninference in inverse problems. They provide estimators with the associated\ncredibility intervals to quantify the uncertainty on the estimators. Although\nthese methods hardly scale to high dimensional problems, they have recently\nbeen paired with optimization techniques, such as proximal and splitting\napproaches, to address this issue. Such approaches pave the way to distributed\nsamplers, splitting computations to make inference more scalable and faster. We\nintroduce a distributed Split Gibbs sampler (SGS) to efficiently solve such\nproblems involving distributions with multiple smooth and non-smooth functions\ncomposed with linear operators. The proposed approach leverages a recent\napproximate augmentation technique reminiscent of primal-dual optimization\nmethods. It is further combined with a block-coordinate approach to split the\nprimal and dual variables into blocks, leading to a distributed\nblock-coordinate SGS. The resulting algorithm exploits the hypergraph structure\nof the involved linear operators to efficiently distribute the variables over\nmultiple workers under controlled communication costs. It accommodates several\ndistributed architectures, such as the Single Program Multiple Data and\nclient-server architectures. Experiments on a large image deblurring problem\nshow the performance of the proposed approach to produce high quality estimates\nwith credibility intervals in a small amount of time. Supplementary material to\nreproduce the experiments is available online."}, "http://arxiv.org/abs/2210.14086": {"title": "A Global Wavelet Based Bootstrapped Test of Covariance Stationarity", "link": "http://arxiv.org/abs/2210.14086", "description": "We propose a covariance stationarity test for an otherwise dependent and\npossibly globally non-stationary time series. We work in a generalized version\nof the new setting in Jin, Wang and Wang (2015), who exploit Walsh (1923)\nfunctions in order to compare sub-sample covariances with the full sample\ncounterpart. They impose strict stationarity under the null, only consider\nlinear processes under either hypothesis in order to achieve a parametric\nestimator for an inverted high dimensional asymptotic covariance matrix, and do\nnot consider any other orthonormal basis. Conversely, we work with a general\northonormal basis under mild conditions that include Haar wavelet and Walsh\nfunctions; and we allow for linear or nonlinear processes with possibly non-iid\ninnovations. This is important in macroeconomics and finance where nonlinear\nfeedback and random volatility occur in many settings. We completely sidestep\nasymptotic covariance matrix estimation and inversion by bootstrapping a\nmax-correlation difference statistic, where the maximum is taken over the\ncorrelation lag $h$ and basis generated sub-sample counter $k$ (the number of\nsystematic samples). We achieve a higher feasible rate of increase for the\nmaximum lag and counter $\\mathcal{H}_{T}$ and $\\mathcal{K}_{T}$. Of particular\nnote, our test is capable of detecting breaks in variance, and distant, or very\nmild, deviations from stationarity."}, "http://arxiv.org/abs/2211.03031": {"title": "A framework for leveraging machine learning tools to estimate personalized survival curves", "link": "http://arxiv.org/abs/2211.03031", "description": "The conditional survival function of a time-to-event outcome subject to\ncensoring and truncation is a common target of estimation in survival analysis.\nThis parameter may be of scientific interest and also often appears as a\nnuisance in nonparametric and semiparametric problems. In addition to classical\nparametric and semiparametric methods (e.g., based on the Cox proportional\nhazards model), flexible machine learning approaches have been developed to\nestimate the conditional survival function. However, many of these methods are\neither implicitly or explicitly targeted toward risk stratification rather than\noverall survival function estimation. Others apply only to discrete-time\nsettings or require inverse probability of censoring weights, which can be as\ndifficult to estimate as the outcome survival function itself. Here, we employ\na decomposition of the conditional survival function in terms of observable\nregression models in which censoring and truncation play no role. This allows\napplication of an array of flexible regression and classification methods\nrather than only approaches that explicitly handle the complexities inherent to\nsurvival data. We outline estimation procedures based on this decomposition,\nempirically assess their performance, and demonstrate their use on data from an\nHIV vaccine trial."}, "http://arxiv.org/abs/2301.12389": {"title": "On Learning Necessary and Sufficient Causal Graphs", "link": "http://arxiv.org/abs/2301.12389", "description": "The causal revolution has stimulated interest in understanding complex\nrelationships in various fields. Most of the existing methods aim to discover\ncausal relationships among all variables within a complex large-scale graph.\nHowever, in practice, only a small subset of variables in the graph are\nrelevant to the outcomes of interest. Consequently, causal estimation with the\nfull causal graph -- particularly given limited data -- could lead to numerous\nfalsely discovered, spurious variables that exhibit high correlation with, but\nexert no causal impact on, the target outcome. In this paper, we propose\nlearning a class of necessary and sufficient causal graphs (NSCG) that\nexclusively comprises causally relevant variables for an outcome of interest,\nwhich we term causal features. The key idea is to employ probabilities of\ncausation to systematically evaluate the importance of features in the causal\ngraph, allowing us to identify a subgraph relevant to the outcome of interest.\nTo learn NSCG from data, we develop a necessary and sufficient causal\nstructural learning (NSCSL) algorithm, by establishing theoretical properties\nand relationships between probabilities of causation and natural causal effects\nof features. Across empirical studies of simulated and real data, we\ndemonstrate that NSCSL outperforms existing algorithms and can reveal crucial\nyeast genes for target heritable traits of interest."}, "http://arxiv.org/abs/2303.18211": {"title": "A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models", "link": "http://arxiv.org/abs/2303.18211", "description": "Additive Noise Models (ANMs) are a common model class for causal discovery\nfrom observational data and are often used to generate synthetic data for\ncausal discovery benchmarking. Specifying an ANM requires choosing all\nparameters, including those not fixed by explicit assumptions. Reisach et al.\n(2021) show that sorting variables by increasing variance often yields an\nordering close to a causal order and introduce var-sortability to quantify this\nalignment. Since increasing variances may be unrealistic and are\nscale-dependent, ANM data are often standardized in benchmarks.\n\nWe show that synthetic ANM data are characterized by another pattern that is\nscale-invariant: the explainable fraction of a variable's variance, as captured\nby the coefficient of determination $R^2$, tends to increase along the causal\norder. The result is high $R^2$-sortability, meaning that sorting the variables\nby increasing $R^2$ yields an ordering close to a causal order. We propose an\nefficient baseline algorithm termed $R^2$-SortnRegress that exploits high\n$R^2$-sortability and that can match and exceed the performance of established\ncausal discovery algorithms. We show analytically that sufficiently high edge\nweights lead to a relative decrease of the noise contributions along causal\nchains, resulting in increasingly deterministic relationships and high $R^2$.\nWe characterize $R^2$-sortability for different simulation parameters and find\nhigh values in common settings. Our findings reveal high $R^2$-sortability as\nan assumption about the data generating process relevant to causal discovery\nand implicit in many ANM sampling schemes. It should be made explicit, as its\nprevalence in real-world data is unknown. For causal discovery benchmarking, we\nimplement $R^2$-sortability, the $R^2$-SortnRegress algorithm, and ANM\nsimulation procedures in our library CausalDisco at\nhttps://causaldisco.github.io/CausalDisco/."}, "http://arxiv.org/abs/2304.11491": {"title": "Bayesian Boundary Trend Filtering", "link": "http://arxiv.org/abs/2304.11491", "description": "Estimating boundary curves has many applications such as economics, climate\nscience, and medicine. Bayesian trend filtering has been developed as one of\nlocally adaptive smoothing methods to estimate the non-stationary trend of\ndata. This paper develops a Bayesian trend filtering for estimating boundary\ntrend. To this end, the truncated multivariate normal working likelihood and\nglobal-local shrinkage priors based on scale mixtures of normal distribution\nare introduced. In particular, well-known horseshoe prior for difference leads\nto locally adaptive shrinkage estimation for boundary trend. However, the full\nconditional distributions of the Gibbs sampler involve high-dimensional\ntruncated multivariate normal distribution. To overcome the difficulty of\nsampling, an approximation of truncated multivariate normal distribution is\nemployed. Using the approximation, the proposed models lead to an efficient\nGibbs sampling algorithm via P\\'olya-Gamma data augmentation. The proposed\nmethod is also extended by considering nearly isotonic constraint. The\nperformance of the proposed method is illustrated through some numerical\nexperiments and real data examples."}, "http://arxiv.org/abs/2304.13237": {"title": "An Efficient Doubly-Robust Test for the Kernel Treatment Effect", "link": "http://arxiv.org/abs/2304.13237", "description": "The average treatment effect, which is the difference in expectation of the\ncounterfactuals, is probably the most popular target effect in causal inference\nwith binary treatments. However, treatments may have effects beyond the mean,\nfor instance decreasing or increasing the variance. We propose a new\nkernel-based test for distributional effects of the treatment. It is, to the\nbest of our knowledge, the first kernel-based, doubly-robust test with provably\nvalid type-I error. Furthermore, our proposed algorithm is computationally\nefficient, avoiding the use of permutations."}, "http://arxiv.org/abs/2305.07981": {"title": "Inferring Stochastic Group Interactions within Structured Populations via Coupled Autoregression", "link": "http://arxiv.org/abs/2305.07981", "description": "The internal behaviour of a population is an important feature to take\naccount of when modelling their dynamics. In line with kin selection theory,\nmany social species tend to cluster into distinct groups in order to enhance\ntheir overall population fitness. Temporal interactions between populations are\noften modelled using classical mathematical models, but these sometimes fail to\ndelve deeper into the, often uncertain, relationships within populations. Here,\nwe introduce a stochastic framework that aims to capture the interactions of\nanimal groups and an auxiliary population over time. We demonstrate the model's\ncapabilities, from a Bayesian perspective, through simulation studies and by\nfitting it to predator-prey count time series data. We then derive an\napproximation to the group correlation structure within such a population,\nwhile also taking account of the effect of the auxiliary population. We finally\ndiscuss how this approximation can lead to ecologically realistic\ninterpretations in a predator-prey context. This approximation also serves as\nverification to whether the population in question satisfies our various\nassumptions. Our modelling approach will be useful for empiricists for\nmonitoring groups within a conservation framework and also theoreticians\nwanting to quantify interactions, to study cooperation and other phenomena\nwithin social populations."}, "http://arxiv.org/abs/2306.09520": {"title": "Ensembled Prediction Intervals for Causal Outcomes Under Hidden Confounding", "link": "http://arxiv.org/abs/2306.09520", "description": "Causal inference of exact individual treatment outcomes in the presence of\nhidden confounders is rarely possible. Recent work has extended prediction\nintervals with finite-sample guarantees to partially identifiable causal\noutcomes, by means of a sensitivity model for hidden confounding. In deep\nlearning, predictors can exploit their inductive biases for better\ngeneralization out of sample. We argue that the structure inherent to a deep\nensemble should inform a tighter partial identification of the causal outcomes\nthat they predict. We therefore introduce an approach termed Caus-Modens, for\ncharacterizing causal outcome intervals by modulated ensembles. We present a\nsimple approach to partial identification using existing causal sensitivity\nmodels and show empirically that Caus-Modens gives tighter outcome intervals,\nas measured by the necessary interval size to achieve sufficient coverage. The\nlast of our three diverse benchmarks is a novel usage of GPT-4 for\nobservational experiments with unknown but probeable ground truth."}, "http://arxiv.org/abs/2306.16838": {"title": "Solving Kernel Ridge Regression with Gradient-Based Optimization Methods", "link": "http://arxiv.org/abs/2306.16838", "description": "Kernel ridge regression, KRR, is a generalization of linear ridge regression\nthat is non-linear in the data, but linear in the parameters. Here, we\nintroduce an equivalent formulation of the objective function of KRR, opening\nup both for using penalties other than the ridge penalty and for studying\nkernel ridge regression from the perspective of gradient descent. Using a\ncontinuous-time perspective, we derive a closed-form solution for solving\nkernel regression with gradient descent, something we refer to as kernel\ngradient flow, KGF, and theoretically bound the differences between KRR and\nKGF, where, for the latter, regularization is obtained through early stopping.\nWe also generalize KRR by replacing the ridge penalty with the $\\ell_1$ and\n$\\ell_\\infty$ penalties, respectively, and use the fact that analogous to the\nsimilarities between KGF and KRR, $\\ell_1$ regularization and forward stagewise\nregression (also known as coordinate descent), and $\\ell_\\infty$ regularization\nand sign gradient descent, follow similar solution paths. We can thus alleviate\nthe need for computationally heavy algorithms based on proximal gradient\ndescent. We show theoretically and empirically how the $\\ell_1$ and\n$\\ell_\\infty$ penalties, and the corresponding gradient-based optimization\nalgorithms, produce sparse and robust kernel regression solutions,\nrespectively."}, "http://arxiv.org/abs/2311.00820": {"title": "Bayesian inference for generalized linear models via quasi-posteriors", "link": "http://arxiv.org/abs/2311.00820", "description": "Generalized linear models (GLMs) are routinely used for modeling\nrelationships between a response variable and a set of covariates. The simple\nform of a GLM comes with easy interpretability, but also leads to concerns\nabout model misspecification impacting inferential conclusions. A popular\nsemi-parametric solution adopted in the frequentist literature is\nquasi-likelihood, which improves robustness by only requiring correct\nspecification of the first two moments. We develop a robust approach to\nBayesian inference in GLMs through quasi-posterior distributions. We show that\nquasi-posteriors provide a coherent generalized Bayes inference method, while\nalso approximating so-called coarsened posteriors. In so doing, we obtain new\ninsights into the choice of coarsening parameter. Asymptotically, the\nquasi-posterior converges in total variation to a normal distribution and has\nimportant connections with the loss-likelihood bootstrap posterior. We\ndemonstrate that it is also well-calibrated in terms of frequentist coverage.\nMoreover, the loss-scale parameter has a clear interpretation as a dispersion,\nand this leads to a consolidated method of moments estimator."}, "http://arxiv.org/abs/2311.00878": {"title": "Backward Joint Model for Dynamic Prediction using Multivariate Longitudinal and Competing Risk Data", "link": "http://arxiv.org/abs/2311.00878", "description": "Joint modeling is a useful approach to dynamic prediction of clinical\noutcomes using longitudinally measured predictors. When the outcomes are\ncompeting risk events, fitting the conventional shared random effects joint\nmodel often involves intensive computation, especially when multiple\nlongitudinal biomarkers are be used as predictors, as is often desired in\nprediction problems. Motivated by a longitudinal cohort study of chronic kidney\ndisease, this paper proposes a new joint model for the dynamic prediction of\nend-stage renal disease with the competing risk of death. The model factorizes\nthe likelihood into the distribution of the competing risks data and the\ndistribution of longitudinal data given the competing risks data. The\nestimation with the EM algorithm is efficient, stable and fast, with a\none-dimensional integral in the E-step and convex optimization for most\nparameters in the M-step, regardless of the number of longitudinal predictors.\nThe model also comes with a consistent albeit less efficient estimation method\nthat can be quickly implemented with standard software, ideal for model\nbuilding and diagnotics. This model enables the prediction of future\nlongitudinal data trajectories conditional on being at risk at a future time, a\npractically significant problem that has not been studied in the statistical\nliterature. We study the properties of the proposed method using simulations\nand a real dataset and compare its performance with the shared random effects\njoint model."}, "http://arxiv.org/abs/2311.00885": {"title": "Controlling the number of significant effects in multiple testing", "link": "http://arxiv.org/abs/2311.00885", "description": "In multiple testing several criteria to control for type I errors exist. The\nfalse discovery rate, which evaluates the expected proportion of false\ndiscoveries among the rejected null hypotheses, has become the standard\napproach in this setting. However, false discovery rate control may be too\nconservative when the effects are weak. In this paper we alternatively propose\nto control the number of significant effects, where 'significant' refers to a\npre-specified threshold $\\gamma$. This means that a $(1-\\alpha)$-lower\nconfidence bound $L$ for the number of non-true null hypothesis with p-values\nbelow $\\gamma$ is provided. When one rejects the nulls corresponding to the $L$\nsmallest p-values, the probability that the number of false positives exceeds\nthe number of false negatives among the significant effects is bounded by\n$\\alpha$. Relative merits of the proposed criterion are discussed. Procedures\nto control for the number of significant effects in practice are introduced and\ninvestigated both theoretically and through simulations. Illustrative real data\napplications are given."}, "http://arxiv.org/abs/2311.00923": {"title": "A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations", "link": "http://arxiv.org/abs/2311.00923", "description": "The fusion of causal models with deep learning introducing increasingly\nintricate data sets, such as the causal associations within images or between\ntextual components, has surfaced as a focal research area. Nonetheless, the\nbroadening of original causal concepts and theories to such complex,\nnon-statistical data has been met with serious challenges. In response, our\nstudy proposes redefinitions of causal data into three distinct categories from\nthe standpoint of causal structure and representation: definite data,\nsemi-definite data, and indefinite data. Definite data chiefly pertains to\nstatistical data used in conventional causal scenarios, while semi-definite\ndata refers to a spectrum of data formats germane to deep learning, including\ntime-series, images, text, and others. Indefinite data is an emergent research\nsphere inferred from the progression of data forms by us. To comprehensively\npresent these three data paradigms, we elaborate on their formal definitions,\ndifferences manifested in datasets, resolution pathways, and development of\nresearch. We summarize key tasks and achievements pertaining to definite and\nsemi-definite data from myriad research undertakings, present a roadmap for\nindefinite data, beginning with its current research conundrums. Lastly, we\nclassify and scrutinize the key datasets presently utilized within these three\nparadigms."}, "http://arxiv.org/abs/2311.00927": {"title": "Scalable Counterfactual Distribution Estimation in Multivariate Causal Models", "link": "http://arxiv.org/abs/2311.00927", "description": "We consider the problem of estimating the counterfactual joint distribution\nof multiple quantities of interests (e.g., outcomes) in a multivariate causal\nmodel extended from the classical difference-in-difference design. Existing\nmethods for this task either ignore the correlation structures among dimensions\nof the multivariate outcome by considering univariate causal models on each\ndimension separately and hence produce incorrect counterfactual distributions,\nor poorly scale even for moderate-size datasets when directly dealing with such\nmultivariate causal model. We propose a method that alleviates both issues\nsimultaneously by leveraging a robust latent one-dimensional subspace of the\noriginal high-dimension space and exploiting the efficient estimation from the\nunivariate causal model on such space. Since the construction of the\none-dimensional subspace uses information from all the dimensions, our method\ncan capture the correlation structures and produce good estimates of the\ncounterfactual distribution. We demonstrate the advantages of our approach over\nexisting methods on both synthetic and real-world data."}, "http://arxiv.org/abs/2311.01021": {"title": "ABC-based Forecasting in State Space Models", "link": "http://arxiv.org/abs/2311.01021", "description": "Approximate Bayesian Computation (ABC) has gained popularity as a method for\nconducting inference and forecasting in complex models, most notably those\nwhich are intractable in some sense. In this paper we use ABC to produce\nprobabilistic forecasts in state space models (SSMs). Whilst ABC-based\nforecasting in correctly-specified SSMs has been studied, the misspecified case\nhas not been investigated, and it is that case which we emphasize. We invoke\nrecent principles of 'focused' Bayesian prediction, whereby Bayesian updates\nare driven by a scoring rule that rewards predictive accuracy; the aim being to\nproduce predictives that perform well in that rule, despite misspecification.\nTwo methods are investigated for producing the focused predictions. In a\nsimulation setting, 'coherent' predictions are in evidence for both methods:\nthe predictive constructed via the use of a particular scoring rule predicts\nbest according to that rule. Importantly, both focused methods typically\nproduce more accurate forecasts than an exact, but misspecified, predictive. An\nempirical application to a truly intractable SSM completes the paper."}, "http://arxiv.org/abs/2311.01147": {"title": "Variational Inference for Sparse Poisson Regression", "link": "http://arxiv.org/abs/2311.01147", "description": "We have utilized the non-conjugate VB method for the problem of the sparse\nPoisson regression model. To provide an approximated conjugacy in the model,\nthe likelihood is approximated by a quadratic function, which provides the\nconjugacy of the approximation component with the Gaussian prior to the\nregression coefficient. Three sparsity-enforcing priors are used for this\nproblem. The proposed models are compared with each other and two frequentist\nsparse Poisson methods (LASSO and SCAD) to evaluate the prediction performance,\nas well as, the sparsing performance of the proposed methods. Throughout a\nsimulated data example, the accuracy of the VB methods is computed compared to\nthe corresponding benchmark MCMC methods. It can be observed that the proposed\nVB methods have provided a good approximation to the posterior distribution of\nthe parameters, while the VB methods are much faster than the MCMC ones. Using\nseveral benchmark count response data sets, the prediction performance of the\nproposed methods is evaluated in real-world applications."}, "http://arxiv.org/abs/2311.01287": {"title": "Semiparametric Latent ANOVA Model for Event-Related Potentials", "link": "http://arxiv.org/abs/2311.01287", "description": "Event-related potentials (ERPs) extracted from electroencephalography (EEG)\ndata in response to stimuli are widely used in psychological and neuroscience\nexperiments. A major goal is to link ERP characteristic components to\nsubject-level covariates. Existing methods typically follow two-step\napproaches, first identifying ERP components using peak detection methods and\nthen relating them to the covariates. This approach, however, can lead to loss\nof efficiency due to inaccurate estimates in the initial step, especially\nconsidering the low signal-to-noise ratio of EEG data. To address this\nchallenge, we propose a semiparametric latent ANOVA model (SLAM) that unifies\ninference on ERP components and their association to covariates. SLAM models\nERP waveforms via a structured Gaussian process prior that encodes ERP latency\nin its derivative and links the subject-level latencies to covariates using a\nlatent ANOVA. This unified Bayesian framework provides estimation at both\npopulation- and subject- levels, improving the efficiency of the inference by\nleveraging information across subjects. We automate posterior inference and\nhyperparameter tuning using a Monte Carlo expectation-maximization algorithm.\nWe demonstrate the advantages of SLAM over competing methods via simulations.\nOur method allows us to examine how factors or covariates affect the magnitude\nand/or latency of ERP components, which in turn reflect cognitive,\npsychological or neural processes. We exemplify this via an application to data\nfrom an ERP experiment on speech recognition, where we assess the effect of age\non two components of interest. Our results verify the scientific findings that\nolder people take a longer reaction time to respond to external stimuli because\nof the delay in perception and brain processes."}, "http://arxiv.org/abs/2311.01297": {"title": "Bias correction in multiple-systems estimation", "link": "http://arxiv.org/abs/2311.01297", "description": "If part of a population is hidden but two or more sources are available that\neach cover parts of this population, dual- or multiple-system(s) estimation can\nbe applied to estimate this population. For this it is common to use the\nlog-linear model, estimated with maximum likelihood. These maximum likelihood\nestimates are based on a non-linear model and therefore suffer from\nfinite-sample bias, which can be substantial in case of small samples or a\nsmall population size. This problem was recognised by Chapman, who derived an\nestimator with good small sample properties in case of two available sources.\nHowever, he did not derive an estimator for more than two sources. We propose\nan estimator that is an extension of Chapman's estimator to three or more\nsources and compare this estimator with other bias-reduced estimators in a\nsimulation study. The proposed estimator performs well, and much better than\nthe other estimators. A real data example on homelessness in the Netherlands\nshows that our proposed model can make a substantial difference."}, "http://arxiv.org/abs/2311.01301": {"title": "TRIALSCOPE A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models", "link": "http://arxiv.org/abs/2311.01301", "description": "The rapid digitization of real-world data offers an unprecedented opportunity\nfor optimizing healthcare delivery and accelerating biomedical discovery. In\npractice, however, such data is most abundantly available in unstructured\nforms, such as clinical notes in electronic medical records (EMRs), and it is\ngenerally plagued by confounders. In this paper, we present TRIALSCOPE, a\nunifying framework for distilling real-world evidence from population-level\nobservational data. TRIALSCOPE leverages biomedical language models to\nstructure clinical text at scale, employs advanced probabilistic modeling for\ndenoising and imputation, and incorporates state-of-the-art causal inference\ntechniques to combat common confounders. Using clinical trial specification as\ngeneric representation, TRIALSCOPE provides a turn-key solution to generate and\nreason with clinical hypotheses using observational data. In extensive\nexperiments and analyses on a large-scale real-world dataset with over one\nmillion cancer patients from a large US healthcare network, we show that\nTRIALSCOPE can produce high-quality structuring of real-world data and\ngenerates comparable results to marquee cancer trials. In addition to\nfacilitating in-silicon clinical trial design and optimization, TRIALSCOPE may\nbe used to empower synthetic controls, pragmatic trials, post-market\nsurveillance, as well as support fine-grained patient-like-me reasoning in\nprecision diagnosis and treatment."}, "http://arxiv.org/abs/2311.01303": {"title": "Local differential privacy in survival analysis using private failure indicators", "link": "http://arxiv.org/abs/2311.01303", "description": "We consider the estimation of the cumulative hazard function, and\nequivalently the distribution function, with censored data under a setup that\npreserves the privacy of the survival database. This is done through a\n$\\alpha$-locally differentially private mechanism for the failure indicators\nand by proposing a non-parametric kernel estimator for the cumulative hazard\nfunction that remains consistent under the privatization. Under mild\nconditions, we also prove lowers bounds for the minimax rates of convergence\nand show that estimator is minimax optimal under a well-chosen bandwidth."}, "http://arxiv.org/abs/2311.01341": {"title": "Composite Dyadic Models for Spatio-Temporal Data", "link": "http://arxiv.org/abs/2311.01341", "description": "Mechanistic statistical models are commonly used to study the flow of\nbiological processes. For example, in landscape genetics, the aim is to infer\nmechanisms that govern gene flow in populations. Existing statistical\napproaches in landscape genetics do not account for temporal dependence in the\ndata and may be computationally prohibitive. We infer mechanisms with a\nBayesian hierarchical dyadic model that scales well with large data sets and\nthat accounts for spatial and temporal dependence. We construct a\nfully-connected network comprising spatio-temporal data for the dyadic model\nand use normalized composite likelihoods to account for the dependence\nstructure in space and time. Our motivation for developing a dyadic model was\nto account for physical mechanisms commonly found in physical-statistical\nmodels. However, a numerical solver is not required in our approach because we\nmodel first-order changes directly. We apply our methods to ancient human DNA\ndata to infer the mechanisms that affected human movement in Bronze Age Europe."}, "http://arxiv.org/abs/2311.01412": {"title": "Castor: Causal Temporal Regime Structure Learning", "link": "http://arxiv.org/abs/2311.01412", "description": "The task of uncovering causal relationships among multivariate time series\ndata stands as an essential and challenging objective that cuts across a broad\narray of disciplines ranging from climate science to healthcare. Such data\nentails linear or non-linear relationships, and usually follow multiple a\npriori unknown regimes. Existing causal discovery methods can infer summary\ncausal graphs from heterogeneous data with known regimes, but they fall short\nin comprehensively learning both regimes and the corresponding causal graph. In\nthis paper, we introduce CASTOR, a novel framework designed to learn causal\nrelationships in heterogeneous time series data composed of various regimes,\neach governed by a distinct causal graph. Through the maximization of a score\nfunction via the EM algorithm, CASTOR infers the number of regimes and learns\nlinear or non-linear causal relationships in each regime. We demonstrate the\nrobust convergence properties of CASTOR, specifically highlighting its\nproficiency in accurately identifying unique regimes. Empirical evidence,\ngarnered from exhaustive synthetic experiments and two real-world benchmarks,\nconfirm CASTOR's superior performance in causal discovery compared to baseline\nmethods. By learning a full temporal causal graph for each regime, CASTOR\nestablishes itself as a distinctly interpretable method for causal discovery in\nheterogeneous time series."}, "http://arxiv.org/abs/2311.01453": {"title": "PPI++: Efficient Prediction-Powered Inference", "link": "http://arxiv.org/abs/2311.01453", "description": "We present PPI++: a computationally lightweight methodology for estimation\nand inference based on a small labeled dataset and a typically much larger\ndataset of machine-learning predictions. The methods automatically adapt to the\nquality of available predictions, yielding easy-to-compute confidence sets --\nfor parameters of any dimensionality -- that always improve on classical\nintervals using only the labeled data. PPI++ builds on prediction-powered\ninference (PPI), which targets the same problem setting, improving its\ncomputational and statistical efficiency. Real and synthetic experiments\ndemonstrate the benefits of the proposed adaptations."}, "http://arxiv.org/abs/2008.00707": {"title": "Heterogeneous Treatment and Spillover Effects under Clustered Network Interference", "link": "http://arxiv.org/abs/2008.00707", "description": "The bulk of causal inference studies rule out the presence of interference\nbetween units. However, in many real-world scenarios, units are interconnected\nby social, physical, or virtual ties, and the effect of the treatment can spill\nfrom one unit to other connected individuals in the network. In this paper, we\ndevelop a machine learning method that uses tree-based algorithms and a\nHorvitz-Thompson estimator to assess the heterogeneity of treatment and\nspillover effects with respect to individual, neighborhood, and network\ncharacteristics in the context of clustered networks and neighborhood\ninterference within clusters. The proposed Network Causal Tree (NCT) algorithm\nhas several advantages. First, it allows the investigation of the treatment\neffect heterogeneity, avoiding potential bias due to the presence of\ninterference. Second, understanding the heterogeneity of both treatment and\nspillover effects can guide policy-makers in scaling up interventions,\ndesigning targeting strategies, and increasing cost-effectiveness. We\ninvestigate the performance of our NCT method using a Monte Carlo simulation\nstudy, and we illustrate its application to assess the heterogeneous effects of\ninformation sessions on the uptake of a new weather insurance policy in rural\nChina."}, "http://arxiv.org/abs/2107.01773": {"title": "Extending Latent Basis Growth Model to Explore Joint Development in the Framework of Individual Measurement Occasions", "link": "http://arxiv.org/abs/2107.01773", "description": "Longitudinal processes often pose nonlinear change patterns. Latent basis\ngrowth models (LBGMs) provide a versatile solution without requiring specific\nfunctional forms. Building on the LBGM specification for unequally-spaced waves\nand individual occasions proposed by Liu and Perera (2023), we extend LBGMs to\nmultivariate longitudinal outcomes. This provides a unified approach to\nnonlinear, interconnected trajectories. Simulation studies demonstrate that the\nproposed model can provide unbiased and accurate estimates with target coverage\nprobabilities for the parameters of interest. Real-world analyses of reading\nand mathematics scores demonstrates its effectiveness in analyzing joint\ndevelopmental processes that vary in temporal patterns. Computational code is\nincluded."}, "http://arxiv.org/abs/2112.03152": {"title": "Bounding Wasserstein distance with couplings", "link": "http://arxiv.org/abs/2112.03152", "description": "Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates\nof intractable posterior expectations as the number of iterations tends to\ninfinity. However, in large data applications, MCMC can be computationally\nexpensive per iteration. This has catalyzed interest in approximating MCMC in a\nmanner that improves computational speed per iteration but does not produce\nasymptotically consistent estimates. In this article, we propose estimators\nbased on couplings of Markov chains to assess the quality of such\nasymptotically biased sampling methods. The estimators give empirical upper\nbounds of the Wasserstein distance between the limiting distribution of the\nasymptotically biased sampling method and the original target distribution of\ninterest. We establish theoretical guarantees for our upper bounds and show\nthat our estimators can remain effective in high dimensions. We apply our\nquality measures to stochastic gradient MCMC, variational Bayes, and Laplace\napproximations for tall data and to approximate MCMC for Bayesian logistic\nregression in 4500 dimensions and Bayesian linear regression in 50000\ndimensions."}, "http://arxiv.org/abs/2204.02954": {"title": "Strongly convergent homogeneous approximations to inhomogeneous Markov jump processes and applications", "link": "http://arxiv.org/abs/2204.02954", "description": "The study of time-inhomogeneous Markov jump processes is a traditional topic\nwithin probability theory that has recently attracted substantial attention in\nvarious applications. However, their flexibility also incurs a substantial\nmathematical burden which is usually circumvented by using well-known generic\ndistributional approximations or simulations. This article provides a novel\napproximation method that tailors the dynamics of a time-homogeneous Markov\njump process to meet those of its time-inhomogeneous counterpart on an\nincreasingly fine Poisson grid. Strong convergence of the processes in terms of\nthe Skorokhod $J_1$ metric is established, and convergence rates are provided.\nUnder traditional regularity assumptions, distributional convergence is\nestablished for unconditional proxies, to the same limit. Special attention is\ndevoted to the case where the target process has one absorbing state and the\nremaining ones transient, for which the absorption times also converge. Some\napplications are outlined, such as univariate hazard-rate density estimation,\nruin probabilities, and multivariate phase-type density evaluation."}, "http://arxiv.org/abs/2301.07210": {"title": "Causal Falsification of Digital Twins", "link": "http://arxiv.org/abs/2301.07210", "description": "Digital twins are virtual systems designed to predict how a real-world\nprocess will evolve in response to interventions. This modelling paradigm holds\nsubstantial promise in many applications, but rigorous procedures for assessing\ntheir accuracy are essential for safety-critical settings. We consider how to\nassess the accuracy of a digital twin using real-world data. We formulate this\nas causal inference problem, which leads to a precise definition of what it\nmeans for a twin to be \"correct\" appropriate for many applications.\nUnfortunately, fundamental results from causal inference mean observational\ndata cannot be used to certify that a twin is correct in this sense unless\npotentially tenuous assumptions are made, such as that the data are\nunconfounded. To avoid these assumptions, we propose instead to find situations\nin which the twin is not correct, and present a general-purpose statistical\nprocedure for doing so. Our approach yields reliable and actionable information\nabout the twin under only the assumption of an i.i.d. dataset of observational\ntrajectories, and remains sound even if the data are confounded. We apply our\nmethodology to a large-scale, real-world case study involving sepsis modelling\nwithin the Pulse Physiology Engine, which we assess using the MIMIC-III dataset\nof ICU patients."}, "http://arxiv.org/abs/2301.11472": {"title": "Fast Bayesian Inference for Spatial Mean-Parameterized Conway--Maxwell--Poisson Models", "link": "http://arxiv.org/abs/2301.11472", "description": "Count data with complex features arise in many disciplines, including\necology, agriculture, criminology, medicine, and public health. Zero inflation,\nspatial dependence, and non-equidispersion are common features in count data.\nThere are two classes of models that allow for these features -- the\nmode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the\ngeneralized Poisson model. However both require the use of either constraints\non the parameter space or a parameterization that leads to challenges in\ninterpretability. We propose a spatial mean-parameterized COMP model that\nretains the flexibility of these models while resolving the above issues. We\nuse a Bayesian spatial filtering approach in order to efficiently handle\nhigh-dimensional spatial data and we use reversible-jump MCMC to automatically\nchoose the basis vectors for spatial filtering. The COMP distribution poses two\nadditional computational challenges -- an intractable normalizing function in\nthe likelihood and no closed-form expression for the mean. We propose a fast\ncomputational approach that addresses these challenges by, respectively,\nintroducing an efficient auxiliary variable algorithm and pre-computing key\napproximations for fast likelihood evaluation. We illustrate the application of\nour methodology to simulated and real datasets, including Texas HPV-cancer data\nand US vaccine refusal data."}, "http://arxiv.org/abs/2305.08529": {"title": "Kernel-based Joint Independence Tests for Multivariate Stationary and Non-stationary Time Series", "link": "http://arxiv.org/abs/2305.08529", "description": "Multivariate time series data that capture the temporal evolution of\ninterconnected systems are ubiquitous in diverse areas. Understanding the\ncomplex relationships and potential dependencies among co-observed variables is\ncrucial for the accurate statistical modelling and analysis of such systems.\nHere, we introduce kernel-based statistical tests of joint independence in\nmultivariate time series by extending the $d$-variable Hilbert-Schmidt\nindependence criterion (dHSIC) to encompass both stationary and non-stationary\nprocesses, thus allowing broader real-world applications. By leveraging\nresampling techniques tailored for both single- and multiple-realisation time\nseries, we show how the method robustly uncovers significant higher-order\ndependencies in synthetic examples, including frequency mixing data and logic\ngates, as well as real-world climate, neuroscience, and socioeconomic data. Our\nmethod adds to the mathematical toolbox for the analysis of multivariate time\nseries and can aid in uncovering high-order interactions in data."}, "http://arxiv.org/abs/2306.07769": {"title": "Amortized Simulation-Based Frequentist Inference for Tractable and Intractable Likelihoods", "link": "http://arxiv.org/abs/2306.07769", "description": "High-fidelity simulators that connect theoretical models with observations\nare indispensable tools in many sciences. When coupled with machine learning, a\nsimulator makes it possible to infer the parameters of a theoretical model\ndirectly from real and simulated observations without explicit use of the\nlikelihood function. This is of particular interest when the latter is\nintractable. In this work, we introduce a simple extension of the recently\nproposed likelihood-free frequentist inference (LF2I) approach that has some\ncomputational advantages. Like LF2I, this extension yields provably valid\nconfidence sets in parameter inference problems in which a high-fidelity\nsimulator is available. The utility of our algorithm is illustrated by applying\nit to three pedagogically interesting examples: the first is from cosmology,\nthe second from high-energy physics and astronomy, both with tractable\nlikelihoods, while the third, with an intractable likelihood, is from\nepidemiology."}, "http://arxiv.org/abs/2307.05732": {"title": "Semiparametric Shape-restricted Estimators for Nonparametric Regression", "link": "http://arxiv.org/abs/2307.05732", "description": "Estimating the conditional mean function that relates predictive covariates\nto a response variable of interest is a fundamental task in economics and\nstatistics. In this manuscript, we propose some general nonparametric\nregression approaches that are widely applicable based on a simple yet\nsignificant decomposition of nonparametric functions into a semiparametric\nmodel with shape-restricted components. For instance, we observe that every\nLipschitz function can be expressed as a sum of a monotone function and a\nlinear function. We implement well-established shape-restricted estimation\nprocedures, such as isotonic regression, to handle the ``nonparametric\"\ncomponents of the true regression function and combine them with a simple\nsample-splitting procedure to estimate the parametric components. The resulting\nestimators inherit several favorable properties from the shape-restricted\nregression estimators. Notably, it is practically tuning parameter free,\nconverges at the minimax optimal rate, and exhibits an adaptive rate when the\ntrue regression function is ``simple\". We also confirm these theoretical\nproperties and compare the practice performance with existing methods via a\nseries of numerical studies."}, "http://arxiv.org/abs/2311.01470": {"title": "Preliminary Estimators of Population Mean using Ranked Set Sampling in the Presence of Measurement Error and Non-Response Error", "link": "http://arxiv.org/abs/2311.01470", "description": "In order to estimate the population mean in the presence of both non-response\nand measurement errors that are uncorrelated, the paper presents some novel\nestimators employing ranked set sampling by utilizing auxiliary information.Up\nto the first order of approximation, the equations for the bias and mean\nsquared error of the suggested estimators are produced, and it is found that\nthe proposed estimators outperform the other existing estimators analysed in\nthis study. Investigations using simulation studies and numerical examples show\nhow well the suggested estimators perform in the presence of measurement and\nnon-response errors. The relative efficiency of the suggested estimators\ncompared to the existing estimators has been expressed as a percentage, and the\nimpact of measurement errors has been expressed as a percentage computation of\nmeasurement errors."}, "http://arxiv.org/abs/2311.01484": {"title": "Comparison of methods for analyzing environmental mixtures effects on survival outcomes and application to a population-based cohort study", "link": "http://arxiv.org/abs/2311.01484", "description": "The estimation of the effect of environmental exposures and overall mixtures\non a survival time outcome is common in environmental epidemiological studies.\nWhile advanced statistical methods are increasingly being used for mixture\nanalyses, their applicability and performance for survival outcomes has yet to\nbe explored. We identified readily available methods for analyzing an\nenvironmental mixture's effect on a survival outcome and assessed their\nperformance via simulations replicating various real-life scenarios. Using\nprespecified criteria, we selected Bayesian Additive Regression Trees (BART),\nCox Elastic Net, Cox Proportional Hazards (PH) with and without penalized\nsplines, Gaussian Process Regression (GPR) and Multivariate Adaptive Regression\nSplines (MARS) to compare the bias and efficiency produced when estimating\nindividual exposure, overall mixture, and interaction effects on a survival\noutcome. We illustrate the selected methods in a real-world data application.\nWe estimated the effects of arsenic, cadmium, molybdenum, selenium, tungsten,\nand zinc on incidence of cardiovascular disease in American Indians using data\nfrom the Strong Heart Study (SHS). In the simulation study, there was a\nconsistent bias-variance trade off. The more flexible models (BART, GPR and\nMARS) were found to be most advantageous in the presence of nonproportional\nhazards, where the Cox models often did not capture the true effects due to\ntheir higher bias and lower variance. In the SHS, estimates of the effect of\nselenium and the overall mixture indicated negative effects, but the magnitudes\nof the estimated effects varied across methods. In practice, we recommend\nevaluating if findings are consistent across methods."}, "http://arxiv.org/abs/2311.01485": {"title": "Subgroup identification using individual participant data from multiple trials on low back pain", "link": "http://arxiv.org/abs/2311.01485", "description": "Model-based recursive partitioning (MOB) and its extension, metaMOB, are\npotent tools for identifying subgroups with differential treatment effects. In\nthe metaMOB approach random effects are used to model heterogeneity of the\ntreatment effects when pooling data from various trials. In situations where\ninterventions offer only small overall benefits and require extensive, costly\ntrials with a large participant enrollment, leveraging individual-participant\ndata (IPD) from multiple trials can help identify individuals who are most\nlikely to benefit from the intervention. We explore the application of MOB and\nmetaMOB in the context of non specific low back pain treatment, using\nsynthesized data based on a subset of the individual participant data\nmeta-analysis by Patel et al. Our study underscores the need to explore\nheterogeneity in intercepts and treatment effects to identify subgroups with\ndifferential treatment effects in IPD meta-analyses."}, "http://arxiv.org/abs/2311.01538": {"title": "A reluctant additive model framework for interpretable nonlinear individualized treatment rules", "link": "http://arxiv.org/abs/2311.01538", "description": "Individualized treatment rules (ITRs) for treatment recommendation is an\nimportant topic for precision medicine as not all beneficial treatments work\nwell for all individuals. Interpretability is a desirable property of ITRs, as\nit helps practitioners make sense of treatment decisions, yet there is a need\nfor ITRs to be flexible to effectively model complex biomedical data for\ntreatment decision making. Many ITR approaches either focus on linear ITRs,\nwhich may perform poorly when true optimal ITRs are nonlinear, or black-box\nnonlinear ITRs, which may be hard to interpret and can be overly complex. This\ndilemma indicates a tension between interpretability and accuracy of treatment\ndecisions. Here we propose an additive model-based nonlinear ITR learning\nmethod that balances interpretability and flexibility of the ITR. Our approach\naims to strike this balance by allowing both linear and nonlinear terms of the\ncovariates in the final ITR. Our approach is parsimonious in that the nonlinear\nterm is included in the final ITR only when it substantially improves the ITR\nperformance. To prevent overfitting, we combine cross-fitting and a specialized\ninformation criterion for model selection. Through extensive simulations, we\nshow that our methods are data-adaptive to the degree of nonlinearity and can\nfavorably balance ITR interpretability and flexibility. We further demonstrate\nthe robust performance of our methods with an application to a cancer drug\nsensitive study."}, "http://arxiv.org/abs/2311.01596": {"title": "Local Bayesian Dirichlet mixing of imperfect models", "link": "http://arxiv.org/abs/2311.01596", "description": "To improve the predictability of complex computational models in the\nexperimentally-unknown domains, we propose a Bayesian statistical machine\nlearning framework utilizing the Dirichlet distribution that combines results\nof several imperfect models. This framework can be viewed as an extension of\nBayesian stacking. To illustrate the method, we study the ability of Bayesian\nmodel averaging and mixing techniques to mine nuclear masses. We show that the\nglobal and local mixtures of models reach excellent performance on both\nprediction accuracy and uncertainty quantification and are preferable to\nclassical Bayesian model averaging. Additionally, our statistical analysis\nindicates that improving model predictions through mixing rather than mixing of\ncorrected models leads to more robust extrapolations."}, "http://arxiv.org/abs/2311.01625": {"title": "Topological inference on brain networks across subtypes of post-stroke aphasia", "link": "http://arxiv.org/abs/2311.01625", "description": "Persistent homology (PH) characterizes the shape of brain networks through\nthe persistence features. Group comparison of persistence features from brain\nnetworks can be challenging as they are inherently heterogeneous. A recent\nscale-space representation of persistence diagram (PD) through heat diffusion\nreparameterizes using the finite number of Fourier coefficients with respect to\nthe Laplace-Beltrami (LB) eigenfunction expansion of the domain, which provides\na powerful vectorized algebraic representation for group comparisons of PDs. In\nthis study, we advance a transposition-based permutation test for comparing\nmultiple groups of PDs through the heat-diffusion estimates of the PDs. We\nevaluate the empirical performance of the spectral transposition test in\ncapturing within- and between-group similarity and dissimilarity with respect\nto statistical variation of topological noise and hole location. We also\nillustrate how the method extends naturally into a clustering scheme by\nsubtyping individuals with post-stroke aphasia through the PDs of their\nresting-state functional brain networks."}, "http://arxiv.org/abs/2311.01638": {"title": "Inference on summaries of a model-agnostic longitudinal variable importance trajectory", "link": "http://arxiv.org/abs/2311.01638", "description": "In prediction settings where data are collected over time, it is often of\ninterest to understand both the importance of variables for predicting the\nresponse at each time point and the importance summarized over the time series.\nBuilding on recent advances in estimation and inference for variable importance\nmeasures, we define summaries of variable importance trajectories. These\nmeasures can be estimated and the same approaches for inference can be applied\nregardless of the choice of the algorithm(s) used to estimate the prediction\nfunction. We propose a nonparametric efficient estimation and inference\nprocedure as well as a null hypothesis testing procedure that are valid even\nwhen complex machine learning tools are used for prediction. Through\nsimulations, we demonstrate that our proposed procedures have good operating\ncharacteristics, and we illustrate their use by investigating the longitudinal\nimportance of risk factors for suicide attempt."}, "http://arxiv.org/abs/2311.01681": {"title": "The R", "link": "http://arxiv.org/abs/2311.01681", "description": "We propose a prognostic stratum matching framework that addresses the\ndeficiencies of Randomized trial data subgroup analysis and transforms\nObservAtional Data to be used as if they were randomized, thus paving the road\nfor precision medicine. Our approach counters the effects of unobserved\nconfounding in observational data by correcting the estimated probabilities of\nthe outcome under a treatment through a novel two-step process. These\nprobabilities are then used to train Optimal Policy Trees (OPTs), which are\ndecision trees that optimally assign treatments to subgroups of patients based\non their characteristics. This facilitates the creation of clinically intuitive\ntreatment recommendations. We applied our framework to observational data of\npatients with gastrointestinal stromal tumors (GIST) and validated the OPTs in\nan external cohort using the sensitivity and specificity metrics. We show that\nthese recommendations outperformed those of experts in GIST. We further applied\nthe same framework to randomized clinical trial (RCT) data of patients with\nextremity sarcomas. Remarkably, despite the initial trial results suggesting\nthat all patients should receive treatment, our framework, after addressing\nimbalances in patient distribution due to the trial's small sample size,\nidentified through the OPTs a subset of patients with unique characteristics\nwho may not require treatment. Again, we successfully validated our\nrecommendations in an external cohort."}, "http://arxiv.org/abs/2311.01709": {"title": "Causal inference with Machine Learning-Based Covariate Representation", "link": "http://arxiv.org/abs/2311.01709", "description": "Utilizing covariate information has been a powerful approach to improve the\nefficiency and accuracy for causal inference, which support massive amount of\nrandomized experiments run on data-driven enterprises. However, state-of-art\napproaches can become practically unreliable when the dimension of covariate\nincreases to just 50, whereas experiments on large platforms can observe even\nhigher dimension of covariate. We propose a machine-learning-assisted covariate\nrepresentation approach that can effectively make use of historical experiment\nor observational data that are run on the same platform to understand which\nlower dimensions can effectively represent the higher-dimensional covariate. We\nthen propose design and estimation methods with the covariate representation.\nWe prove statistically reliability and performance guarantees for the proposed\nmethods. The empirical performance is demonstrated using numerical experiments."}, "http://arxiv.org/abs/2311.01762": {"title": "Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel", "link": "http://arxiv.org/abs/2311.01762", "description": "Kernel ridge regression, KRR, is a generalization of linear ridge regression\nthat is non-linear in the data, but linear in the parameters. The solution can\nbe obtained either as a closed-form solution, which includes a matrix\ninversion, or iteratively through gradient descent. Using the iterative\napproach opens up for changing the kernel during training, something that is\ninvestigated in this paper. We theoretically address the effects this has on\nmodel complexity and generalization. Based on our findings, we propose an\nupdate scheme for the bandwidth of translational-invariant kernels, where we\nlet the bandwidth decrease to zero during training, thus circumventing the need\nfor hyper-parameter selection. We demonstrate on real and synthetic data how\ndecreasing the bandwidth during training outperforms using a constant\nbandwidth, selected by cross-validation and marginal likelihood maximization.\nWe also show theoretically and empirically that using a decreasing bandwidth,\nwe are able to achieve both zero training error in combination with good\ngeneralization, and a double descent behavior, phenomena that do not occur for\nKRR with constant bandwidth but are known to appear for neural networks."}, "http://arxiv.org/abs/2311.01833": {"title": "Similarity network aggregation for the analysis of glacier ecosystems", "link": "http://arxiv.org/abs/2311.01833", "description": "The synthesis of information deriving from complex networks is a topic\nreceiving increasing relevance in ecology and environmental sciences. In\nparticular, the aggregation of multilayer networks, i.e. network structures\nformed by multiple interacting networks (the layers), constitutes a\nfast-growing field. In several environmental applications, the layers of a\nmultilayer network are modelled as a collection of similarity matrices\ndescribing how similar pairs of biological entities are, based on different\ntypes of features (e.g. biological traits). The present paper first discusses\ntwo main techniques for combining the multi-layered information into a single\nnetwork (the so-called monoplex), i.e. Similarity Network Fusion (SNF) and\nSimilarity Matrix Average (SMA). Then, the effectiveness of the two methods is\ntested on a real-world dataset of the relative abundance of microbial species\nin the ecosystems of nine glaciers (four glaciers in the Alps and five in the\nAndes). A preliminary clustering analysis on the monoplexes obtained with\ndifferent methods shows the emergence of a tightly connected community formed\nby species that are typical of cryoconite holes worldwide. Moreover, the\nweights assigned to different layers by the SMA algorithm suggest that two\nlarge South American glaciers (Exploradores and Perito Moreno) are structurally\ndifferent from the smaller glaciers in both Europe and South America. Overall,\nthese results highlight the importance of integration methods in the discovery\nof the underlying organizational structure of biological entities in multilayer\necological networks."}, "http://arxiv.org/abs/2311.01872": {"title": "The use of restricted mean survival time to estimate treatment effect under model misspecification, a simulation study", "link": "http://arxiv.org/abs/2311.01872", "description": "The use of the non-parametric Restricted Mean Survival Time endpoint (RMST)\nhas grown in popularity as trialists look to analyse time-to-event outcomes\nwithout the restrictions of the proportional hazards assumption. In this paper,\nwe evaluate the power and type I error rate of the parametric and\nnon-parametric RMST estimators when treatment effect is explained by multiple\ncovariates, including an interaction term. Utilising the RMST estimator in this\nway allows the combined treatment effect to be summarised as a one-dimensional\nestimator, which is evaluated using a one-sided hypothesis Z-test. The\nestimators are either fully specified or misspecified, both in terms of\nunaccounted covariates or misspecified knot points (where trials exhibit\ncrossing survival curves). A placebo-controlled trial of Gamma interferon is\nused as a motivating example to simulate associated survival times. When\ncorrectly specified, the parametric RMST estimator has the greatest power,\nregardless of the time of analysis. The misspecified RMST estimator generally\nperforms similarly when covariates mirror those of the fitted case study\ndataset. However, as the magnitude of the unaccounted covariate increases, the\nassociated power of the estimator decreases. In all cases, the non-parametric\nRMST estimator has the lowest power, and power remains very reliant on the time\nof analysis (with a later analysis time correlated with greater power)."}, "http://arxiv.org/abs/2311.01902": {"title": "High Precision Causal Model Evaluation with Conditional Randomization", "link": "http://arxiv.org/abs/2311.01902", "description": "The gold standard for causal model evaluation involves comparing model\npredictions with true effects estimated from randomized controlled trials\n(RCT). However, RCTs are not always feasible or ethical to perform. In\ncontrast, conditionally randomized experiments based on inverse probability\nweighting (IPW) offer a more realistic approach but may suffer from high\nestimation variance. To tackle this challenge and enhance causal model\nevaluation in real-world conditional randomization settings, we introduce a\nnovel low-variance estimator for causal error, dubbed as the pairs estimator.\nBy applying the same IPW estimator to both the model and true experimental\neffects, our estimator effectively cancels out the variance due to IPW and\nachieves a smaller asymptotic variance. Empirical studies demonstrate the\nimproved of our estimator, highlighting its potential on achieving near-RCT\nperformance. Our method offers a simple yet powerful solution to evaluate\ncausal inference models in conditional randomization settings without\ncomplicated modification of the IPW estimator itself, paving the way for more\nrobust and reliable model assessments."}, "http://arxiv.org/abs/2311.01913": {"title": "Extended Relative Power Contribution that Allows to Evaluate the Effect of Correlated Noise", "link": "http://arxiv.org/abs/2311.01913", "description": "We proposed an extension of Akaike's relative power contribution that could\nbe applied to data with correlations between noises. This method decomposes the\npower spectrum into a contribution of the terms caused by correlation between\ntwo noises, in addition to the contributions of the independent noises.\nNumerical examples confirm that some of the correlated noise has the effect of\nreducing the power spectrum."}, "http://arxiv.org/abs/2311.02019": {"title": "Reproducible Parameter Inference Using Bagged Posteriors", "link": "http://arxiv.org/abs/2311.02019", "description": "Under model misspecification, it is known that Bayesian posteriors often do\nnot properly quantify uncertainty about true or pseudo-true parameters. Even\nmore fundamentally, misspecification leads to a lack of reproducibility in the\nsense that the same model will yield contradictory posteriors on independent\ndata sets from the true distribution. To define a criterion for reproducible\nuncertainty quantification under misspecification, we consider the probability\nthat two confidence sets constructed from independent data sets have nonempty\noverlap, and we establish a lower bound on this overlap probability that holds\nfor any valid confidence sets. We prove that credible sets from the standard\nposterior can strongly violate this bound, particularly in high-dimensional\nsettings (i.e., with dimension increasing with sample size), indicating that it\nis not internally coherent under misspecification. To improve reproducibility\nin an easy-to-use and widely applicable way, we propose to apply bagging to the\nBayesian posterior (\"BayesBag\"'); that is, to use the average of posterior\ndistributions conditioned on bootstrapped datasets. We motivate BayesBag from\nfirst principles based on Jeffrey conditionalization and show that the bagged\nposterior typically satisfies the overlap lower bound. Further, we prove a\nBernstein--Von Mises theorem for the bagged posterior, establishing its\nasymptotic normal distribution. We demonstrate the benefits of BayesBag via\nsimulation experiments and an application to crime rate prediction."}, "http://arxiv.org/abs/2311.02043": {"title": "Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective", "link": "http://arxiv.org/abs/2311.02043", "description": "Quantile regression is a powerful tool for inferring how covariates affect\nspecific percentiles of the response distribution. Existing methods either\nestimate conditional quantiles separately for each quantile of interest or\nestimate the entire conditional distribution using semi- or non-parametric\nmodels. The former often produce inadequate models for real data and do not\nshare information across quantiles, while the latter are characterized by\ncomplex and constrained models that can be difficult to interpret and\ncomputationally inefficient. Further, neither approach is well-suited for\nquantile-specific subset selection. Instead, we pose the fundamental problems\nof linear quantile estimation, uncertainty quantification, and subset selection\nfrom a Bayesian decision analysis perspective. For any Bayesian regression\nmodel, we derive optimal and interpretable linear estimates and uncertainty\nquantification for each model-based conditional quantile. Our approach\nintroduces a quantile-focused squared error loss, which enables efficient,\nclosed-form computing and maintains a close relationship with Wasserstein-based\ndensity estimation. In an extensive simulation study, our methods demonstrate\nsubstantial gains in quantile estimation accuracy, variable selection, and\ninference over frequentist and Bayesian competitors. We apply these tools to\nidentify the quantile-specific impacts of social and environmental stressors on\neducational outcomes for a large cohort of children in North Carolina."}, "http://arxiv.org/abs/2010.08627": {"title": "Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function", "link": "http://arxiv.org/abs/2010.08627", "description": "Canonical correlation analysis (CCA) is a popular statistical technique for\nexploring relationships between datasets. In recent years, the estimation of\nsparse canonical vectors has emerged as an important but challenging variant of\nthe CCA problem, with widespread applications. Unfortunately, existing\nrate-optimal estimators for sparse canonical vectors have high computational\ncost. We propose a quasi-Bayesian estimation procedure that not only achieves\nthe minimax estimation rate, but also is easy to compute by Markov Chain Monte\nCarlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled\nRayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et\nal. (2018), we adopt a Bayesian framework that combines this\nquasi-log-likelihood with a spike-and-slab prior to regularize the inference\nand promote sparsity. We investigate the empirical behavior of the proposed\nmethod on both continuous and truncated data, and we demonstrate that it\noutperforms several state-of-the-art methods. As an application, we use the\nproposed methodology to maximally correlate clinical variables and proteomic\ndata for better understanding the Covid-19 disease."}, "http://arxiv.org/abs/2104.08300": {"title": "Semiparametric Sensitivity Analysis: Unmeasured Confounding In Observational Studies", "link": "http://arxiv.org/abs/2104.08300", "description": "Establishing cause-effect relationships from observational data often relies\non untestable assumptions. It is crucial to know whether, and to what extent,\nthe conclusions drawn from non-experimental studies are robust to potential\nunmeasured confounding. In this paper, we focus on the average causal effect\n(ACE) as our target of inference. We generalize the sensitivity analysis\napproach developed by Robins et al. (2000), Franks et al. (2020) and Zhou and\nYao (2023. We use semiparametric theory to derive the non-parametric efficient\ninfluence function of the ACE, for fixed sensitivity parameters. We use this\ninfluence function to construct a one-step bias-corrected estimator of the ACE.\nOur estimator depends on semiparametric models for the distribution of the\nobserved data; importantly, these models do not impose any restrictions on the\nvalues of sensitivity analysis parameters. We establish sufficient conditions\nensuring that our estimator has root-n asymptotics. We use our methodology to\nevaluate the causal effect of smoking during pregnancy on birth weight. We also\nevaluate the performance of estimation procedure in a simulation study."}, "http://arxiv.org/abs/2207.00100": {"title": "A Bayesian 'sandwich' for variance estimation", "link": "http://arxiv.org/abs/2207.00100", "description": "Large-sample Bayesian analogs exist for many frequentist methods, but are\nless well-known for the widely-used 'sandwich' or 'robust' variance estimates.\nWe review existing approaches to Bayesian analogs of sandwich variance\nestimates and propose a new analog, as the Bayes rule under a form of balanced\nloss function, that combines elements of standard parametric inference with\nfidelity of the data to the model. Our development is general, for essentially\nany regression setting with independent outcomes. Being the large-sample\nequivalent of its frequentist counterpart, we show by simulation that Bayesian\nrobust standard error estimates can faithfully quantify the variability of\nparameter estimates even under model misspecification -- thus retaining the\nmajor attraction of the original frequentist version. We demonstrate our\nBayesian analog of standard error estimates when studying the association\nbetween age and systolic blood pressure in NHANES."}, "http://arxiv.org/abs/2210.17514": {"title": "Cost-aware Generalized $\\alpha$-investing for Multiple Hypothesis Testing", "link": "http://arxiv.org/abs/2210.17514", "description": "We consider the problem of sequential multiple hypothesis testing with\nnontrivial data collection costs. This problem appears, for example, when\nconducting biological experiments to identify differentially expressed genes of\na disease process. This work builds on the generalized $\\alpha$-investing\nframework which enables control of the false discovery rate in a sequential\ntesting setting. We make a theoretical analysis of the long term asymptotic\nbehavior of $\\alpha$-wealth which motivates a consideration of sample size in\nthe $\\alpha$-investing decision rule. Posing the testing process as a game with\nnature, we construct a decision rule that optimizes the expected\n$\\alpha$-wealth reward (ERO) and provides an optimal sample size for each test.\nEmpirical results show that a cost-aware ERO decision rule correctly rejects\nmore false null hypotheses than other methods for $n=1$ where $n$ is the sample\nsize. When the sample size is not fixed cost-aware ERO uses a prior on the null\nhypothesis to adaptively allocate of the sample budget to each test. We extend\ncost-aware ERO investing to finite-horizon testing which enables the decision\nrule to allocate samples in a non-myopic manner. Finally, empirical tests on\nreal data sets from biological experiments show that cost-aware ERO balances\nthe allocation of samples to an individual test against the allocation of\nsamples across multiple tests."}, "http://arxiv.org/abs/2301.01480": {"title": "A new over-dispersed count model", "link": "http://arxiv.org/abs/2301.01480", "description": "A new two-parameter discrete distribution, namely the PoiG distribution is\nderived by the convolution of a Poisson variate and an independently\ndistributed geometric random variable. This distribution generalizes both the\nPoisson and geometric distributions and can be used for modelling\nover-dispersed as well as equi-dispersed count data. A number of important\nstatistical properties of the proposed count model, such as the probability\ngenerating function, the moment generating function, the moments, the survival\nfunction and the hazard rate function. Monotonic properties are studied, such\nas the log concavity and the stochastic ordering are also investigated in\ndetail. Method of moment and the maximum likelihood estimators of the\nparameters of the proposed model are presented. It is envisaged that the\nproposed distribution may prove to be useful for the practitioners for\nmodelling over-dispersed count data compared to its closest competitors."}, "http://arxiv.org/abs/2305.10050": {"title": "The Impact of Missing Data on Causal Discovery: A Multicentric Clinical Study", "link": "http://arxiv.org/abs/2305.10050", "description": "Causal inference for testing clinical hypotheses from observational data\npresents many difficulties because the underlying data-generating model and the\nassociated causal graph are not usually available. Furthermore, observational\ndata may contain missing values, which impact the recovery of the causal graph\nby causal discovery algorithms: a crucial issue often ignored in clinical\nstudies. In this work, we use data from a multi-centric study on endometrial\ncancer to analyze the impact of different missingness mechanisms on the\nrecovered causal graph. This is achieved by extending state-of-the-art causal\ndiscovery algorithms to exploit expert knowledge without sacrificing\ntheoretical soundness. We validate the recovered graph with expert physicians,\nshowing that our approach finds clinically-relevant solutions. Finally, we\ndiscuss the goodness of fit of our graph and its consistency from a clinical\ndecision-making perspective using graphical separation to validate causal\npathways."}, "http://arxiv.org/abs/2309.03952": {"title": "The Causal Roadmap and simulation studies to inform the Statistical Analysis Plan for real-data applications", "link": "http://arxiv.org/abs/2309.03952", "description": "The Causal Roadmap outlines a systematic approach to our research endeavors:\ndefine quantity of interest, evaluate needed assumptions, conduct statistical\nestimation, and carefully interpret of results. At the estimation step, it is\nessential that the estimation algorithm be chosen thoughtfully for its\ntheoretical properties and expected performance. Simulations can help\nresearchers gain a better understanding of an estimator's statistical\nperformance under conditions unique to the real-data application. This in turn\ncan inform the rigorous pre-specification of a Statistical Analysis Plan (SAP),\nnot only stating the estimand (e.g., G-computation formula), the estimator\n(e.g., targeted minimum loss-based estimation [TMLE]), and adjustment\nvariables, but also the implementation of the estimator -- including nuisance\nparameter estimation and approach for variance estimation. Doing so helps\nensure valid inference (e.g., 95% confidence intervals with appropriate\ncoverage). Failing to pre-specify estimation can lead to data dredging and\ninflated Type-I error rates."}, "http://arxiv.org/abs/2311.02273": {"title": "A Sequential Learning Procedure with Applications to Online Sales Examination", "link": "http://arxiv.org/abs/2311.02273", "description": "In this paper, we consider the problem of estimating parameters in a linear\nregression model. We propose a sequential learning procedure to determine the\nsample size for achieving a given small estimation risk, under the widely used\nGauss-Markov setup with independent normal errors. The procedure is proven to\nenjoy the second-order efficiency and risk-efficiency properties, which are\nvalidated through Monte Carlo simulation studies. Using e-commerce data, we\nimplement the procedure to examine the influential factors of online sales."}, "http://arxiv.org/abs/2311.02306": {"title": "Heteroskedastic Tensor Clustering", "link": "http://arxiv.org/abs/2311.02306", "description": "Tensor clustering, which seeks to extract underlying cluster structures from\nnoisy tensor observations, has gained increasing attention. One extensively\nstudied model for tensor clustering is the tensor block model, which postulates\nthe existence of clustering structures along each mode and has found broad\napplications in areas like multi-tissue gene expression analysis and multilayer\nnetwork analysis. However, currently available computationally feasible methods\nfor tensor clustering either are limited to handling i.i.d. sub-Gaussian noise\nor suffer from suboptimal statistical performance, which restrains their\nutility in applications that have to deal with heteroskedastic data and/or low\nsignal-to-noise-ratio (SNR).\n\nTo overcome these challenges, we propose a two-stage method, named\n$\\mathsf{High\\text{-}order~HeteroClustering}$ ($\\mathsf{HHC}$), which starts by\nperforming tensor subspace estimation via a novel spectral algorithm called\n$\\mathsf{Thresholded~Deflated\\text{-}HeteroPCA}$, followed by approximate\n$k$-means to obtain cluster nodes. Encouragingly, our algorithm provably\nachieves exact clustering as long as the SNR exceeds the computational limit\n(ignoring logarithmic factors); here, the SNR refers to the ratio of the\npairwise disparity between nodes to the noise level, and the computational\nlimit indicates the lowest SNR that enables exact clustering with polynomial\nruntime. Comprehensive simulation and real-data experiments suggest that our\nalgorithm outperforms existing algorithms across various settings, delivering\nmore reliable clustering performance."}, "http://arxiv.org/abs/2311.02308": {"title": "Kernel-based sensitivity indices for any model behavior and screening", "link": "http://arxiv.org/abs/2311.02308", "description": "Complex models are often used to understand interactions and drivers of\nhuman-induced and/or natural phenomena. It is worth identifying the input\nvariables that drive the model output(s) in a given domain and/or govern\nspecific model behaviors such as contextual indicators based on\nsocio-environmental models. Using the theory of multivariate weighted\ndistributions to characterize specific model behaviors, we propose new measures\nof association between inputs and such behaviors. Our measures rely on\nsensitivity functionals (SFs) and kernel methods, including variance-based\nsensitivity analysis. The proposed $\\ell_1$-based kernel indices account for\ninteractions among inputs, higher-order moments of SFs, and their upper bounds\nare somehow equivalent to the Morris-type screening measures, including\ndependent elementary effects. Empirical kernel-based indices are derived,\nincluding their statistical properties for the computational issues, and\nnumerical results are provided."}, "http://arxiv.org/abs/2311.02312": {"title": "Efficient Change Point Detection and Estimation in High-Dimensional Correlation Matrices", "link": "http://arxiv.org/abs/2311.02312", "description": "This paper considers the problems of detecting a change point and estimating\nthe location in the correlation matrices of a sequence of high-dimensional\nvectors, where the dimension is large enough to be comparable to the sample\nsize or even much larger. A new break test is proposed based on signflip\nparallel analysis to detect the existence of change points. Furthermore, a\ntwo-step approach combining a signflip permutation dimension reduction step and\na CUSUM statistic is proposed to estimate the change point's location and\nrecover the support of changes. The consistency of the estimator is\nconstructed. Simulation examples and real data applications illustrate the\nsuperior empirical performance of the proposed methods. Especially, the\nproposed methods outperform existing ones for non-Gaussian data and the change\npoint in the extreme tail of a sequence and become more accurate as the\ndimension p increases. Supplementary materials for this article are available\nonline."}, "http://arxiv.org/abs/2311.02450": {"title": "Factor-guided estimation of large covariance matrix function with conditional functional sparsity", "link": "http://arxiv.org/abs/2311.02450", "description": "This paper addresses the fundamental task of estimating covariance matrix\nfunctions for high-dimensional functional data/functional time series. We\nconsider two functional factor structures encompassing either functional\nfactors with scalar loadings or scalar factors with functional loadings, and\npostulate functional sparsity on the covariance of idiosyncratic errors after\ntaking out the common unobserved factors. To facilitate estimation, we rely on\nthe spiked matrix model and its functional generalization, and derive some\nnovel asymptotic identifiability results, based on which we develop DIGIT and\nFPOET estimators under two functional factor models, respectively. Both\nestimators involve performing associated eigenanalysis to estimate the\ncovariance of common components, followed by adaptive functional thresholding\napplied to the residual covariance. We also develop functional information\ncriteria for the purpose of model selection. The convergence rates of estimated\nfactors, loadings, and conditional sparse covariance matrix functions under\nvarious functional matrix norms, are respectively established for DIGIT and\nFPOET estimators. Numerical studies including extensive simulations and two\nreal data applications on mortality rates and functional portfolio allocation\nare conducted to examine the finite-sample performance of the proposed\nmethodology."}, "http://arxiv.org/abs/2311.02532": {"title": "Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making", "link": "http://arxiv.org/abs/2311.02532", "description": "A/B testing is critical for modern technological companies to evaluate the\neffectiveness of newly developed products against standard baselines. This\npaper studies optimal designs that aim to maximize the amount of information\nobtained from online experiments to estimate treatment effects accurately. We\npropose three optimal allocation strategies in a dynamic setting where\ntreatments are sequentially assigned over time. These strategies are designed\nto minimize the variance of the treatment effect estimator when data follow a\nnon-Markov decision process or a (time-varying) Markov decision process. We\nfurther develop estimation procedures based on existing off-policy evaluation\n(OPE) methods and conduct extensive experiments in various environments to\ndemonstrate the effectiveness of the proposed methodologies. In theory, we\nprove the optimality of the proposed treatment allocation design and establish\nupper bounds for the mean squared errors of the resulting treatment effect\nestimators."}, "http://arxiv.org/abs/2311.02543": {"title": "Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling", "link": "http://arxiv.org/abs/2311.02543", "description": "This paper discusses estimation and limited information goodness-of-fit test\nstatistics in factor models for binary data using pairwise likelihood\nestimation and sampling weights. The paper extends the applicability of\npairwise likelihood estimation for factor models with binary data to\naccommodate complex sampling designs. Additionally, it introduces two key\nlimited information test statistics: the Pearson chi-squared test and the Wald\ntest. To enhance computational efficiency, the paper introduces modifications\nto both test statistics. The performance of the estimation and the proposed\ntest statistics under simple random sampling and unequal probability sampling\nis evaluated using simulated data."}, "http://arxiv.org/abs/2311.02574": {"title": "Semi-supervised Estimation of Event Rate with Doubly-censored Survival Data", "link": "http://arxiv.org/abs/2311.02574", "description": "Electronic Health Record (EHR) has emerged as a valuable source of data for\ntranslational research. To leverage EHR data for risk prediction and\nsubsequently clinical decision support, clinical endpoints are often time to\nonset of a clinical condition of interest. Precise information on clinical\nevent times is often not directly available and requires labor-intensive manual\nchart review to ascertain. In addition, events may occur outside of the\nhospital system, resulting in both left and right censoring often termed double\ncensoring. On the other hand, proxies such as time to the first diagnostic code\nare readily available yet with varying degrees of accuracy. Using error-prone\nevent times derived from these proxies can lead to biased risk estimates while\nonly relying on manually annotated event times, which are typically only\navailable for a small subset of patients, can lead to high variability. This\nsignifies the need for semi-supervised estimation methods that can efficiently\ncombine information from both the small subset of labeled observations and a\nlarge size of surrogate proxies. While semi-supervised estimation methods have\nbeen recently developed for binary and right-censored data, no methods\ncurrently exist in the presence of double censoring. This paper fills the gap\nby developing a robust and efficient Semi-supervised Estimation of Event rate\nwith Doubly-censored Survival data (SEEDS) by leveraging a small set of gold\nstandard labels and a large set of surrogate features. Under regularity\nconditions, we demonstrate that the proposed SEEDS estimator is consistent and\nasymptotically normal. Simulation results illustrate that SEEDS performs well\nin finite samples and can be substantially more efficient compared to the\nsupervised counterpart. We apply the SEEDS to estimate the age-specific\nsurvival rate of type 2 diabetes using EHR data from Mass General Brigham."}, "http://arxiv.org/abs/2311.02610": {"title": "An adaptive standardisation model for Day-Ahead electricity price forecasting", "link": "http://arxiv.org/abs/2311.02610", "description": "The study of Day-Ahead prices in the electricity market is one of the most\npopular problems in time series forecasting. Previous research has focused on\nemploying increasingly complex learning algorithms to capture the sophisticated\ndynamics of the market. However, there is a threshold where increased\ncomplexity fails to yield substantial improvements. In this work, we propose an\nalternative approach by introducing an adaptive standardisation to mitigate the\neffects of dataset shifts that commonly occur in the market. By doing so,\nlearning algorithms can prioritize uncovering the true relationship between the\ntarget variable and the explanatory variables. We investigate four distinct\nmarkets, including two novel datasets, previously unexplored in the literature.\nThese datasets provide a more realistic representation of the current market\ncontext, that conventional datasets do not show. The results demonstrate a\nsignificant improvement across all four markets, using learning algorithms that\nare less complex yet widely accepted in the literature. This significant\nadvancement unveils opens up new lines of research in this field, highlighting\nthe potential of adaptive transformations in enhancing the performance of\nforecasting models."}, "http://arxiv.org/abs/2311.02634": {"title": "Pointwise Data Depth for Univariate and Multivariate Functional Outlier Detection", "link": "http://arxiv.org/abs/2311.02634", "description": "Data depth is an efficient tool for robustly summarizing the distribution of\nfunctional data and detecting potential magnitude and shape outliers. Commonly\nused functional data depth notions, such as the modified band depth and\nextremal depth, are estimated from pointwise depth for each observed functional\nobservation. However, these techniques require calculating one single depth\nvalue for each functional observation, which may not be sufficient to\ncharacterize the distribution of the functional data and detect potential\noutliers. This paper presents an innovative approach to make the best use of\npointwise depth. We propose using the pointwise depth distribution for\nmagnitude outlier visualization and the correlation between pairwise depth for\nshape outlier detection. Furthermore, a bootstrap-based testing procedure has\nbeen introduced for the correlation to test whether there is any shape outlier.\nThe proposed univariate methods are then extended to bivariate functional data.\nThe performance of the proposed methods is examined and compared to\nconventional outlier detection techniques by intensive simulation studies. In\naddition, the developed methods are applied to simulated solar energy datasets\nfrom a photovoltaic system. Results revealed that the proposed method offers\nsuperior detection performance over conventional techniques. These findings\nwill benefit engineers and practitioners in monitoring photovoltaic systems by\ndetecting unnoticed anomalies and outliers."}, "http://arxiv.org/abs/2311.02658": {"title": "Nonparametric Estimation and Comparison of Distance Distributions from Censored Data", "link": "http://arxiv.org/abs/2311.02658", "description": "Transportation distance information is a powerful resource, but location\nrecords are often censored due to privacy concerns or regulatory mandates. We\nconsider the problem of transportation event distance distribution\nreconstruction, which aims to handle this obstacle and has applications to\npublic health informatics, logistics, and more. We propose numerical methods to\napproximate, sample from, and compare distributions of distances between\ncensored location pairs. We validate empirically and demonstrate applicability\nto practical geospatial data analysis tasks. Our code is available on GitHub."}, "http://arxiv.org/abs/2311.02766": {"title": "Riemannian Laplace Approximation with the Fisher Metric", "link": "http://arxiv.org/abs/2311.02766", "description": "The Laplace's method approximates a target density with a Gaussian\ndistribution at its mode. It is computationally efficient and asymptotically\nexact for Bayesian inference due to the Bernstein-von Mises theorem, but for\ncomplex targets and finite-data posteriors it is often too crude an\napproximation. A recent generalization of the Laplace Approximation transforms\nthe Gaussian approximation according to a chosen Riemannian geometry providing\na richer approximation family, while still retaining computational efficiency.\nHowever, as shown here, its properties heavily depend on the chosen metric,\nindeed the metric adopted in previous work results in approximations that are\noverly narrow as well as being biased even at the limit of infinite data. We\ncorrect this shortcoming by developing the approximation family further,\nderiving two alternative variants that are exact at the limit of infinite data,\nextending the theoretical analysis of the method, and demonstrating practical\nimprovements in a range of experiments."}, "http://arxiv.org/abs/2311.02808": {"title": "Nonparametric Estimation of Conditional Copula using Smoothed Checkerboard Bernstein Sieves", "link": "http://arxiv.org/abs/2311.02808", "description": "Conditional copulas are useful tools for modeling the dependence between\nmultiple response variables that may vary with a given set of predictor\nvariables. Conditional dependence measures such as conditional Kendall's tau\nand Spearman's rho that can be expressed as functionals of the conditional\ncopula are often used to evaluate the strength of dependence conditioning on\nthe covariates. In general, semiparametric estimation methods of conditional\ncopulas rely on an assumed parametric copula family where the copula parameter\nis assumed to be a function of the covariates. The functional relationship can\nbe estimated nonparametrically using different techniques but it is required to\nchoose an appropriate copula model from various candidate families. In this\npaper, by employing the empirical checkerboard Bernstein copula (ECBC)\nestimator we propose a fully nonparametric approach for estimating conditional\ncopulas, which doesn't require any selection of parametric copula models.\nClosed-form estimates of the conditional dependence measures are derived\ndirectly from the proposed ECBC-based conditional copula estimator. We provide\nthe large-sample consistency of the proposed estimator as well as the estimates\nof conditional dependence measures. The finite-sample performance of the\nproposed estimator and comparison with semiparametric methods are investigated\nthrough simulation studies. An application to real case studies is also\nprovided."}, "http://arxiv.org/abs/2311.02822": {"title": "Robust estimation of heteroscedastic regression models: a brief overview and new proposals", "link": "http://arxiv.org/abs/2311.02822", "description": "We collect robust proposals given in the field of regression models with\nheteroscedastic errors. Our motivation stems from the fact that the\npractitioner frequently faces the confluence of two phenomena in the context of\ndata analysis: non--linearity and heteroscedasticity. The impact of\nheteroscedasticity on the precision of the estimators is well--known, however\nthe conjunction of these two phenomena makes handling outliers more difficult.\n\nAn iterative procedure to estimate the parameters of a heteroscedastic\nnon--linear model is considered. The studied estimators combine weighted\n$MM-$regression estimators, to control the impact of high leverage points, and\na robust method to estimate the parameters of the variance function."}, "http://arxiv.org/abs/2311.03247": {"title": "Multivariate selfsimilarity: Multiscale eigen-structures for selfsimilarity parameter estimation", "link": "http://arxiv.org/abs/2311.03247", "description": "Scale-free dynamics, formalized by selfsimilarity, provides a versatile\nparadigm massively and ubiquitously used to model temporal dynamics in\nreal-world data. However, its practical use has mostly remained univariate so\nfar. By contrast, modern applications often demand multivariate data analysis.\nAccordingly, models for multivariate selfsimilarity were recently proposed.\nNevertheless, they have remained rarely used in practice because of a lack of\navailable robust estimation procedures for the vector of selfsimilarity\nparameters. Building upon recent mathematical developments, the present work\nputs forth an efficient estimation procedure based on the theoretical study of\nthe multiscale eigenstructure of the wavelet spectrum of multivariate\nselfsimilar processes. The estimation performance is studied theoretically in\nthe asymptotic limits of large scale and sample sizes, and computationally for\nfinite-size samples. As a practical outcome, a fully operational and documented\nmultivariate signal processing estimation toolbox is made freely available and\nis ready for practical use on real-world data. Its potential benefits are\nillustrated in epileptic seizure prediction from multi-channel EEG data."}, "http://arxiv.org/abs/2311.03289": {"title": "Batch effect correction with sample remeasurement in highly confounded case-control studies", "link": "http://arxiv.org/abs/2311.03289", "description": "Batch effects are pervasive in biomedical studies. One approach to address\nthe batch effects is repeatedly measuring a subset of samples in each batch.\nThese remeasured samples are used to estimate and correct the batch effects.\nHowever, rigorous statistical methods for batch effect correction with\nremeasured samples are severely under-developed. In this study, we developed a\nframework for batch effect correction using remeasured samples in highly\nconfounded case-control studies. We provided theoretical analyses of the\nproposed procedure, evaluated its power characteristics, and provided a power\ncalculation tool to aid in the study design. We found that the number of\nsamples that need to be remeasured depends strongly on the between-batch\ncorrelation. When the correlation is high, remeasuring a small subset of\nsamples is possible to rescue most of the power."}, "http://arxiv.org/abs/2311.03343": {"title": "Distribution-uniform anytime-valid inference", "link": "http://arxiv.org/abs/2311.03343", "description": "Are asymptotic confidence sequences and anytime $p$-values uniformly valid\nfor a nontrivial class of distributions $\\mathcal{P}$? We give a positive\nanswer to this question by deriving distribution-uniform anytime-valid\ninference procedures. Historically, anytime-valid methods -- including\nconfidence sequences, anytime $p$-values, and sequential hypothesis tests that\nenable inference at stopping times -- have been justified nonasymptotically.\nNevertheless, asymptotic procedures such as those based on the central limit\ntheorem occupy an important part of statistical toolbox due to their\nsimplicity, universality, and weak assumptions. While recent work has derived\nasymptotic analogues of anytime-valid methods with the aforementioned benefits,\nthese were not shown to be $\\mathcal{P}$-uniform, meaning that their\nasymptotics are not uniformly valid in a class of distributions $\\mathcal{P}$.\nIndeed, the anytime-valid inference literature currently has no central limit\ntheory to draw from that is both uniform in $\\mathcal{P}$ and in the sample\nsize $n$. This paper fills that gap by deriving a novel $\\mathcal{P}$-uniform\nstrong Gaussian approximation theorem, enabling $\\mathcal{P}$-uniform\nanytime-valid inference for the first time. Along the way, our Gaussian\napproximation also yields a $\\mathcal{P}$-uniform law of the iterated\nlogarithm."}, "http://arxiv.org/abs/2009.10780": {"title": "Independent finite approximations for Bayesian nonparametric inference", "link": "http://arxiv.org/abs/2009.10780", "description": "Completely random measures (CRMs) and their normalizations (NCRMs) offer\nflexible models in Bayesian nonparametrics. But their infinite dimensionality\npresents challenges for inference. Two popular finite approximations are\ntruncated finite approximations (TFAs) and independent finite approximations\n(IFAs). While the former have been well-studied, IFAs lack similarly general\nbounds on approximation error, and there has been no systematic comparison\nbetween the two options. In the present work, we propose a general recipe to\nconstruct practical finite-dimensional approximations for homogeneous CRMs and\nNCRMs, in the presence or absence of power laws. We call our construction the\nautomated independent finite approximation (AIFA). Relative to TFAs, we show\nthat AIFAs facilitate more straightforward derivations and use of parallel\ncomputing in approximate inference. We upper bound the approximation error of\nAIFAs for a wide class of common CRMs and NCRMs -- and thereby develop\nguidelines for choosing the approximation level. Our lower bounds in key cases\nsuggest that our upper bounds are tight. We prove that, for worst-case choices\nof observation likelihoods, TFAs are more efficient than AIFAs. Conversely, we\nfind that in real-data experiments with standard likelihoods, AIFAs and TFAs\nperform similarly. Moreover, we demonstrate that AIFAs can be used for\nhyperparameter estimation even when other potential IFA options struggle or do\nnot apply."}, "http://arxiv.org/abs/2111.07517": {"title": "Correlation Improves Group Testing: Capturing the Dilution Effect", "link": "http://arxiv.org/abs/2111.07517", "description": "Population-wide screening to identify and isolate infectious individuals is a\npowerful tool for controlling COVID-19 and other infectious diseases. Group\ntesting can enable such screening despite limited testing resources. Samples'\nviral loads are often positively correlated, either because prevalence and\nsample collection are both correlated with geography, or through intentional\nenhancement, e.g., by pooling samples from people in similar risk groups. Such\ncorrelation is known to improve test efficiency in mathematical models with\nfixed sensitivity. In reality, however, dilution degrades a pooled test's\nsensitivity by an amount that varies with the number of positives in the pool.\nIn the presence of this dilution effect, we study the impact of correlation on\nthe most widely-used group testing procedure, the Dorfman procedure. We show\nthat correlation's effects are significantly altered by the dilution effect. We\nprove that under a general correlation structure, pooling correlated samples\ntogether (called correlated pooling) achieves higher sensitivity but can\ndegrade test efficiency compared to independently pooling the samples (called\nnaive pooling) using the same pool size. We identify an alternative measure of\ntest resource usage, the number of positives found per test consumed, which we\nargue is better aligned with infection control, and show that correlated\npooling outperforms naive pooling on this measure. We build a realistic\nagent-based simulation to contextualize our theoretical results within an\nepidemic control framework. We argue that the dilution effect makes it even\nmore important for policy-makers evaluating group testing protocols for\nlarge-scale screening to incorporate naturally arising correlation and to\nintentionally maximize correlation."}, "http://arxiv.org/abs/2202.05349": {"title": "Robust Parameter Estimation for the Lee-Carter Family: A Probabilistic Principal Component Approach", "link": "http://arxiv.org/abs/2202.05349", "description": "The well-known Lee-Carter model uses a bilinear form\n$\\log(m_{x,t})=a_x+b_xk_t$ to represent the log mortality rate and has been\nwidely researched and developed over the past thirty years. However, there has\nbeen little attention being paid to the robustness of the parameters against\noutliers, especially when estimating $b_x$. In response, we propose a robust\nestimation method for a wide family of Lee-Carter-type models, treating the\nproblem as a Probabilistic Principal Component Analysis (PPCA) with\nmultivariate $t$-distributions. An efficient Expectation-Maximization (EM)\nalgorithm is also derived for implementation.\n\nThe benefits of the method are threefold: 1) it produces more robust\nestimates of both $b_x$ and $k_t$, 2) it can be naturally extended to a large\nfamily of Lee-Carter type models, including those for modelling multiple\npopulations, and 3) it can be integrated with other existing time series models\nfor $k_t$. Using numerical studies based on United States mortality data from\nthe Human Mortality Database, we show the proposed model performs more robust\ncompared to conventional methods in the presence of outliers."}, "http://arxiv.org/abs/2204.11979": {"title": "Semi-Parametric Sensitivity Analysis for Trials with Irregular and Informative Assessment Times", "link": "http://arxiv.org/abs/2204.11979", "description": "Many trials are designed to collect outcomes at or around pre-specified times\nafter randomization. In practice, there can be substantial variability in the\ntimes when participants are actually assessed. Such irregular assessment times\npose a challenge to learning the effect of treatment since not all participants\nhave outcome assessments at the times of interest. Furthermore, observed\noutcome values may not be representative of all participants' outcomes at a\ngiven time. This problem, known as informative assessment times, can arise if\nparticipants tend to have assessments when their outcomes are better (or worse)\nthan at other times, or if participants with better outcomes tend to have more\n(or fewer) assessments. Methods have been developed that account for some types\nof informative assessment; however, since these methods rely on untestable\nassumptions, sensitivity analyses are needed. We develop a sensitivity analysis\nmethodology by extending existing weighting methods. Our method accounts for\nthe possibility that participants with worse outcomes at a given time are more\n(or less) likely than other participants to have an assessment at that time,\neven after controlling for variables observed earlier in the study. We apply\nour method to a randomized trial of low-income individuals with uncontrolled\nasthma. We illustrate implementation of our influence-function based estimation\nprocedure in detail, and we derive the large-sample distribution of our\nestimator and evaluate its finite-sample performance."}, "http://arxiv.org/abs/2205.13935": {"title": "Detecting hidden confounding in observational data using multiple environments", "link": "http://arxiv.org/abs/2205.13935", "description": "A common assumption in causal inference from observational data is that there\nis no hidden confounding. Yet it is, in general, impossible to verify this\nassumption from a single dataset. Under the assumption of independent causal\nmechanisms underlying the data-generating process, we demonstrate a way to\ndetect unobserved confounders when having multiple observational datasets\ncoming from different environments. We present a theory for testable\nconditional independencies that are only absent when there is hidden\nconfounding and examine cases where we violate its assumptions: degenerate &\ndependent mechanisms, and faithfulness violations. Additionally, we propose a\nprocedure to test these independencies and study its empirical finite-sample\nbehavior using simulation studies and semi-synthetic data based on a real-world\ndataset. In most cases, the proposed procedure correctly predicts the presence\nof hidden confounding, particularly when the confounding bias is large."}, "http://arxiv.org/abs/2206.09444": {"title": "Bayesian non-conjugate regression via variational message passing", "link": "http://arxiv.org/abs/2206.09444", "description": "Variational inference is a popular method for approximating the posterior\ndistribution of hierarchical Bayesian models. It is well-recognized in the\nliterature that the choice of the approximation family and the regularity\nproperties of the posterior strongly influence the efficiency and accuracy of\nvariational methods. While model-specific conjugate approximations offer\nsimplicity, they often converge slowly and may yield poor approximations.\nNon-conjugate approximations instead are more flexible but typically require\nthe calculation of expensive multidimensional integrals. This study focuses on\nBayesian regression models that use possibly non-differentiable loss functions\nto measure prediction misfit. The data behavior is modeled using a linear\npredictor, potentially transformed using a bijective link function. Examples\ninclude generalized linear models, mixed additive models, support vector\nmachines, and quantile regression. To address the limitations of non-conjugate\nsettings, the study proposes an efficient non-conjugate variational message\npassing method for approximate posterior inference, which only requires the\ncalculation of univariate numerical integrals when analytical solutions are not\navailable. The approach does not require differentiability, conjugacy, or\nmodel-specific data-augmentation strategies, thereby naturally extending to\nmodels with non-conjugate likelihood functions. Additionally, a stochastic\nimplementation is provided to handle large-scale data problems. The proposed\nmethod's performances are evaluated through extensive simulations and real data\nexamples. Overall, the results highlight the effectiveness of the proposed\nvariational message passing method, demonstrating its computational efficiency\nand approximation accuracy as an alternative to existing methods in Bayesian\ninference for regression models."}, "http://arxiv.org/abs/2206.10143": {"title": "A Contrastive Approach to Online Change Point Detection", "link": "http://arxiv.org/abs/2206.10143", "description": "We suggest a novel procedure for online change point detection. Our approach\nexpands an idea of maximizing a discrepancy measure between points from\npre-change and post-change distributions. This leads to a flexible procedure\nsuitable for both parametric and nonparametric scenarios. We prove\nnon-asymptotic bounds on the average running length of the procedure and its\nexpected detection delay. The efficiency of the algorithm is illustrated with\nnumerical experiments on synthetic and real-world data sets."}, "http://arxiv.org/abs/2206.15367": {"title": "Targeted learning in observational studies with multi-valued treatments: An evaluation of antipsychotic drug treatment safety", "link": "http://arxiv.org/abs/2206.15367", "description": "We investigate estimation of causal effects of multiple competing\n(multi-valued) treatments in the absence of randomization. Our work is\nmotivated by an intention-to-treat study of the relative cardiometabolic risk\nof assignment to one of six commonly prescribed antipsychotic drugs in a cohort\nof nearly 39,000 adults adults with serious mental illness. Doubly-robust\nestimators, such as targeted minimum loss-based estimation (TMLE), require\ncorrect specification of either the treatment model or outcome model to ensure\nconsistent estimation; however, common TMLE implementations estimate treatment\nprobabilities using multiple binomial regressions rather than multinomial\nregression. We implement a TMLE estimator that uses multinomial treatment\nassignment and ensemble machine learning to estimate average treatment effects.\nOur multinomial implementation improves coverage, but does not necessarily\nreduce bias, relative to the binomial implementation in simulation experiments\nwith varying treatment propensity overlap and event rates. Evaluating the\ncausal effects of the antipsychotics on 3-year diabetes risk or death, we find\na safety benefit of moving from a second-generation drug considered among the\nsafest of the second-generation drugs to an infrequently prescribed\nfirst-generation drug thought to pose a generally low cardiometabolic risk."}, "http://arxiv.org/abs/2209.04364": {"title": "Evaluating tests for cluster-randomized trials with few clusters under generalized linear mixed models with covariate adjustment: a simulation study", "link": "http://arxiv.org/abs/2209.04364", "description": "Generalized linear mixed models (GLMM) are commonly used to analyze clustered\ndata, but when the number of clusters is small to moderate, standard\nstatistical tests may produce elevated type I error rates. Small-sample\ncorrections have been proposed for continuous or binary outcomes without\ncovariate adjustment. However, appropriate tests to use for count outcomes or\nunder covariate-adjusted models remains unknown. An important setting in which\nthis issue arises is in cluster-randomized trials (CRTs). Because many CRTs\nhave just a few clusters (e.g., clinics or health systems), covariate\nadjustment is particularly critical to address potential chance imbalance\nand/or low power (e.g., adjustment following stratified randomization or for\nthe baseline value of the outcome). We conducted simulations to evaluate\nGLMM-based tests of the treatment effect that account for the small (10) or\nmoderate (20) number of clusters under a parallel-group CRT setting across\nscenarios of covariate adjustment (including adjustment for one or more\nperson-level or cluster-level covariates) for both binary and count outcomes.\nWe find that when the intraclass correlation is non-negligible ($\\geq 0.01$)\nand the number of covariates is small ($\\leq 2$), likelihood ratio tests with a\nbetween-within denominator degree of freedom have type I error rates close to\nthe nominal level. When the number of covariates is moderate ($\\geq 5$), across\nour simulation scenarios, the relative performance of the tests varied\nconsiderably and no method performed uniformly well. Therefore, we recommend\nadjusting for no more than a few covariates and using likelihood ratio tests\nwith a between-within denominator degree of freedom."}, "http://arxiv.org/abs/2211.14578": {"title": "Estimation and inference for transfer learning with high-dimensional quantile regression", "link": "http://arxiv.org/abs/2211.14578", "description": "Transfer learning has become an essential technique to exploit information\nfrom the source domain to boost performance of the target task. Despite the\nprevalence in high-dimensional data, heterogeneity and heavy tails are\ninsufficiently accounted for by current transfer learning approaches and thus\nmay undermine the resulting performance. We propose a transfer learning\nprocedure in the framework of high-dimensional quantile regression models to\naccommodate heterogeneity and heavy tails in the source and target domains. We\nestablish error bounds of transfer learning estimator based on delicately\nselected transferable source domains, showing that lower error bounds can be\nachieved for critical selection criterion and larger sample size of source\ntasks. We further propose valid confidence interval and hypothesis test\nprocedures for individual component of high-dimensional quantile regression\ncoefficients by advocating a double transfer learning estimator, which is\none-step debiased estimator for the transfer learning estimator wherein the\ntechnique of transfer learning is designed again. By adopting data-splitting\ntechnique, we advocate a transferability detection approach that guarantees to\ncircumvent negative transfer and identify transferable sources with high\nprobability. Simulation results demonstrate that the proposed method exhibits\nsome favorable and compelling performances and the practical utility is further\nillustrated by analyzing a real example."}, "http://arxiv.org/abs/2306.03302": {"title": "Statistical Inference Under Constrained Selection Bias", "link": "http://arxiv.org/abs/2306.03302", "description": "Large-scale datasets are increasingly being used to inform decision making.\nWhile this effort aims to ground policy in real-world evidence, challenges have\narisen as selection bias and other forms of distribution shifts often plague\nobservational data. Previous attempts to provide robust inference have given\nguarantees depending on a user-specified amount of possible distribution shift\n(e.g., the maximum KL divergence between the observed and target\ndistributions). However, decision makers will often have additional knowledge\nabout the target distribution which constrains the kind of possible shifts. To\nleverage such information, we propose a framework that enables statistical\ninference in the presence of selection bias which obeys user-specified\nconstraints in the form of functions whose expectation is known under the\ntarget distribution. The output is high-probability bounds on the value of an\nestimand for the target distribution. Hence, our method leverages domain\nknowledge in order to partially identify a wide class of estimands. We analyze\nthe computational and statistical properties of methods to estimate these\nbounds and show that our method can produce informative bounds on a variety of\nsimulated and semisynthetic tasks, as well as in a real-world use case."}, "http://arxiv.org/abs/2307.15348": {"title": "Stratified principal component analysis", "link": "http://arxiv.org/abs/2307.15348", "description": "This paper investigates a general family of covariance models with repeated\neigenvalues extending probabilistic principal component analysis (PPCA). A\ngeometric interpretation shows that these models are parameterised by flag\nmanifolds and stratify the space of covariance matrices according to the\nsequence of eigenvalue multiplicities. The subsequent analysis sheds light on\nPPCA and answers an important question on the practical identifiability of\nindividual eigenvectors. It notably shows that one rarely has enough samples to\nfit a covariance model with distinct eigenvalues and that block-averaging the\nadjacent sample eigenvalues with small gaps achieves a better\ncomplexity/goodness-of-fit tradeoff."}, "http://arxiv.org/abs/2308.02005": {"title": "Bias Correction for Randomization-Based Estimation in Inexactly Matched Observational Studies", "link": "http://arxiv.org/abs/2308.02005", "description": "Matching has been widely used to mimic a randomized experiment with\nobservational data. Ideally, treated subjects are exactly matched with controls\nfor the covariates, and randomization-based estimation can then be conducted as\nin a randomized experiment (assuming no unobserved covariates). However, when\nthere exists continuous covariates or many covariates, matching typically\nshould be inexact. Previous studies have routinely ignored inexact matching in\nthe downstream randomization-based estimation as long as some covariate balance\ncriteria are satisfied, which can cause severe estimation bias. Built on the\ncovariate-adaptive randomization inference framework, in this research note, we\npropose two new classes of bias-corrected randomization-based estimators to\nreduce estimation bias due to inexact matching: the bias-corrected maximum\n$p$-value estimator for the constant treatment effect and the bias-corrected\ndifference-in-means estimator for the average treatment effect. Our simulation\nresults show that the proposed bias-corrected estimators can effectively reduce\nestimation bias due to inexact matching."}, "http://arxiv.org/abs/2310.01153": {"title": "Online Permutation Tests: $e$-values and Likelihood Ratios for Testing Group Invariance", "link": "http://arxiv.org/abs/2310.01153", "description": "We develop a flexible online version of the permutation test. This allows us\nto test exchangeability as the data is arriving, where we can choose to stop or\ncontinue without invalidating the size of the test. Our methods generalize\nbeyond exchangeability to other forms of invariance under a compact group. Our\napproach relies on constructing an $e$-process that is the running product of\nmultiple conditional $e$-values. To construct $e$-values, we first develop an\nessentially complete class of admissible $e$-values in which one can flexibly\n`plug in' almost any desired test statistic. To make the $e$-values\nconditional, we explore the intersection between the concepts of conditional\ninvariance and sequential invariance, and find that the appropriate conditional\ndistribution can be captured by a compact subgroup. To find powerful $e$-values\nfor given alternatives, we develop the theory of likelihood ratios for testing\ngroup invariance yielding new optimality results for group invariance tests.\nThese statistics turn out to exist in three different flavors, depending on the\nspace on which we specify our alternative. We apply these statistics to test\nagainst a Gaussian location shift, which yields connections to the $t$-test\nwhen testing sphericity, connections to the softmax function and its\ntemperature when testing exchangeability, and yields an improved version of a\nknown $e$-value for testing sign-symmetry. Moreover, we introduce an impatience\nparameter that allows users to obtain more power now in exchange for less power\nin the long run."}, "http://arxiv.org/abs/2311.03381": {"title": "Separating and Learning Latent Confounders to Enhancing User Preferences Modeling", "link": "http://arxiv.org/abs/2311.03381", "description": "Recommender models aim to capture user preferences from historical feedback\nand then predict user-specific feedback on candidate items. However, the\npresence of various unmeasured confounders causes deviations between the user\npreferences in the historical feedback and the true preferences, resulting in\nmodels not meeting their expected performance. Existing debias models either\n(1) specific to solving one particular bias or (2) directly obtain auxiliary\ninformation from user historical feedback, which cannot identify whether the\nlearned preferences are true user preferences or mixed with unmeasured\nconfounders. Moreover, we find that the former recommender system is not only a\nsuccessor to unmeasured confounders but also acts as an unmeasured confounder\naffecting user preference modeling, which has always been neglected in previous\nstudies. To this end, we incorporate the effect of the former recommender\nsystem and treat it as a proxy for all unmeasured confounders. We propose a\nnovel framework, \\textbf{S}eparating and \\textbf{L}earning Latent Confounders\n\\textbf{F}or \\textbf{R}ecommendation (\\textbf{SLFR}), which obtains the\nrepresentation of unmeasured confounders to identify the counterfactual\nfeedback by disentangling user preferences and unmeasured confounders, then\nguides the target model to capture the true preferences of users. Extensive\nexperiments in five real-world datasets validate the advantages of our method."}, "http://arxiv.org/abs/2311.03382": {"title": "Causal Structure Representation Learning of Confounders in Latent Space for Recommendation", "link": "http://arxiv.org/abs/2311.03382", "description": "Inferring user preferences from the historical feedback of users is a\nvaluable problem in recommender systems. Conventional approaches often rely on\nthe assumption that user preferences in the feedback data are equivalent to the\nreal user preferences without additional noise, which simplifies the problem\nmodeling. However, there are various confounders during user-item interactions,\nsuch as weather and even the recommendation system itself. Therefore,\nneglecting the influence of confounders will result in inaccurate user\npreferences and suboptimal performance of the model. Furthermore, the\nunobservability of confounders poses a challenge in further addressing the\nproblem. To address these issues, we refine the problem and propose a more\nrational solution. Specifically, we consider the influence of confounders,\ndisentangle them from user preferences in the latent space, and employ causal\ngraphs to model their interdependencies without specific labels. By cleverly\ncombining local and global causal graphs, we capture the user-specificity of\nconfounders on user preferences. We theoretically demonstrate the\nidentifiability of the obtained causal graph. Finally, we propose our model\nbased on Variational Autoencoders, named Causal Structure representation\nlearning of Confounders in latent space (CSC). We conducted extensive\nexperiments on one synthetic dataset and five real-world datasets,\ndemonstrating the superiority of our model. Furthermore, we demonstrate that\nthe learned causal representations of confounders are controllable, potentially\noffering users fine-grained control over the objectives of their recommendation\nlists with the learned causal graphs."}, "http://arxiv.org/abs/2311.03554": {"title": "Conditional Randomization Tests for Behavioral and Neural Time Series", "link": "http://arxiv.org/abs/2311.03554", "description": "Randomization tests allow simple and unambiguous tests of null hypotheses, by\ncomparing observed data to a null ensemble in which experimentally-controlled\nvariables are randomly resampled. In behavioral and neuroscience experiments,\nhowever, the stimuli presented often depend on the subject's previous actions,\nso simple randomization tests are not possible. We describe how conditional\nrandomization can be used to perform exact hypothesis tests in this situation,\nand illustrate it with two examples. We contrast conditional randomization with\na related approach of tangent randomization, in which stimuli are resampled\nbased only on events occurring in the past, which is not valid for all choices\nof test statistic. We discuss how to design experiments that allow conditional\nrandomization tests to be used."}, "http://arxiv.org/abs/2311.03630": {"title": "Counterfactual Data Augmentation with Contrastive Learning", "link": "http://arxiv.org/abs/2311.03630", "description": "Statistical disparity between distinct treatment groups is one of the most\nsignificant challenges for estimating Conditional Average Treatment Effects\n(CATE). To address this, we introduce a model-agnostic data augmentation method\nthat imputes the counterfactual outcomes for a selected subset of individuals.\nSpecifically, we utilize contrastive learning to learn a representation space\nand a similarity measure such that in the learned representation space close\nindividuals identified by the learned similarity measure have similar potential\noutcomes. This property ensures reliable imputation of counterfactual outcomes\nfor the individuals with close neighbors from the alternative treatment group.\nBy augmenting the original dataset with these reliable imputations, we can\neffectively reduce the discrepancy between different treatment groups, while\ninducing minimal imputation error. The augmented dataset is subsequently\nemployed to train CATE estimation models. Theoretical analysis and experimental\nstudies on synthetic and semi-synthetic benchmarks demonstrate that our method\nachieves significant improvements in both performance and robustness to\noverfitting across state-of-the-art models."}, "http://arxiv.org/abs/2311.03644": {"title": "BOB: Bayesian Optimized Bootstrap with Applications to Gaussian Mixture Models", "link": "http://arxiv.org/abs/2311.03644", "description": "Sampling from the joint posterior distribution of Gaussian mixture models\n(GMMs) via standard Markov chain Monte Carlo (MCMC) imposes several\ncomputational challenges, which have prevented a broader full Bayesian\nimplementation of these models. A growing body of literature has introduced the\nWeighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as\nalternatives to MCMC sampling. The core idea of these methods is to repeatedly\ncompute maximum a posteriori (MAP) estimates on many randomly weighted\nposterior densities. These MAP estimates then can be treated as approximate\nposterior draws. Nonetheless, a central question remains unanswered: How to\nselect the distribution of the random weights under arbitrary sample sizes.\nThus, we introduce the Bayesian Optimized Bootstrap (BOB), a computational\nmethod to automatically select the weights distribution by minimizing, through\nBayesian Optimization, a black-box and noisy version of the reverse KL\ndivergence between the Bayesian posterior and an approximate posterior obtained\nvia random weighting. Our proposed method allows for uncertainty\nquantification, approximate posterior sampling, and embraces recent\ndevelopments in parallel computing. We show that BOB outperforms competing\napproaches in recovering the Bayesian posterior, while retaining key\ntheoretical properties from existing methods. BOB's performance is demonstrated\nthrough extensive simulations, along with real-world data analyses."}, "http://arxiv.org/abs/2311.03660": {"title": "Sampling via F\\\"ollmer Flow", "link": "http://arxiv.org/abs/2311.03660", "description": "We introduce a novel unit-time ordinary differential equation (ODE) flow\ncalled the preconditioned F\\\"{o}llmer flow, which efficiently transforms a\nGaussian measure into a desired target measure at time 1. To discretize the\nflow, we apply Euler's method, where the velocity field is calculated either\nanalytically or through Monte Carlo approximation using Gaussian samples. Under\nreasonable conditions, we derive a non-asymptotic error bound in the\nWasserstein distance between the sampling distribution and the target\ndistribution. Through numerical experiments on mixture distributions in 1D, 2D,\nand high-dimensional spaces, we demonstrate that the samples generated by our\nproposed flow exhibit higher quality compared to those obtained by several\nexisting methods. Furthermore, we propose leveraging the F\\\"{o}llmer flow as a\nwarmstart strategy for existing Markov Chain Monte Carlo (MCMC) methods, aiming\nto mitigate mode collapses and enhance their performance. Finally, thanks to\nthe deterministic nature of the F\\\"{o}llmer flow, we can leverage deep neural\nnetworks to fit the trajectory of sample evaluations. This allows us to obtain\na generator for one-step sampling as a result."}, "http://arxiv.org/abs/2311.03763": {"title": "Thresholding the higher criticism test statistics for optimality in a heterogeneous setting", "link": "http://arxiv.org/abs/2311.03763", "description": "Donoho and Kipnis (2022) showed that the the higher criticism (HC) test\nstatistic has a non-Gaussian phase transition but remarked that it is probably\nnot optimal, in the detection of sparse differences between two large frequency\ntables when the counts are low. The setting can be considered to be\nheterogeneous, with cells containing larger total counts more able to detect\nsmaller differences. We provide a general study here of sparse detection\narising from such heterogeneous settings, and showed that optimality of the HC\ntest statistic requires thresholding, for example in the case of frequency\ntable comparison, to restrict to p-values of cells with total counts exceeding\na threshold. The use of thresholding also leads to optimality of the HC test\nstatistic when it is applied on the sparse Poisson means model of Arias-Castro\nand Wang (2015). The phase transitions we consider here are non-Gaussian, and\ninvolve an interplay between the rate functions of the response and sample size\ndistributions. We also showed, both theoretically and in a numerical study,\nthat applying thresholding to the Bonferroni test statistic results in better\nsparse mixture detection in heterogeneous settings."}, "http://arxiv.org/abs/2311.03769": {"title": "Nonparametric Screening for Additive Quantile Regression in Ultra-high Dimension", "link": "http://arxiv.org/abs/2311.03769", "description": "In practical applications, one often does not know the \"true\" structure of\nthe underlying conditional quantile function, especially in the ultra-high\ndimensional setting. To deal with ultra-high dimensionality, quantile-adaptive\nmarginal nonparametric screening methods have been recently developed. However,\nthese approaches may miss important covariates that are marginally independent\nof the response, or may select unimportant covariates due to their high\ncorrelations with important covariates. To mitigate such shortcomings, we\ndevelop a conditional nonparametric quantile screening procedure (complemented\nby subsequent selection) for nonparametric additive quantile regression models.\nUnder some mild conditions, we show that the proposed screening method can\nidentify all relevant covariates in a small number of steps with probability\napproaching one. The subsequent narrowed best subset (via a modified Bayesian\ninformation criterion) also contains all the relevant covariates with\noverwhelming probability. The advantages of our proposed procedure are\ndemonstrated through simulation studies and a real data example."}, "http://arxiv.org/abs/2311.03829": {"title": "Multilevel mixtures of latent trait analyzers for clustering multi-layer bipartite networks", "link": "http://arxiv.org/abs/2311.03829", "description": "Within network data analysis, bipartite networks represent a particular type\nof network where relationships occur between two disjoint sets of nodes,\nformally called sending and receiving nodes. In this context, sending nodes may\nbe organized into layers on the basis of some defined characteristics,\nresulting in a special case of multilayer bipartite network, where each layer\nincludes a specific set of sending nodes. To perform a clustering of sending\nnodes in multi-layer bipartite network, we extend the Mixture of Latent Trait\nAnalyzers (MLTA), also taking into account the influence of concomitant\nvariables on clustering formation and the multi-layer structure of the data. To\nthis aim, a multilevel approach offers a useful methodological tool to properly\naccount for the hierarchical structure of the data and for the unobserved\nsources of heterogeneity at multiple levels. A simulation study is conducted to\ntest the performance of the proposal in terms of parameters' and clustering\nrecovery. Furthermore, the model is applied to the European Social Survey data\n(ESS) to i) perform a clustering of individuals (sending nodes) based on their\ndigital skills (receiving nodes); ii) understand how socio-economic and\ndemographic characteristics influence the individual digitalization level; iii)\naccount for the multilevel structure of the data; iv) obtain a clustering of\ncountries in terms of the base-line attitude to digital technologies of their\nresidents."}, "http://arxiv.org/abs/2311.03989": {"title": "Learned Causal Method Prediction", "link": "http://arxiv.org/abs/2311.03989", "description": "For a given causal question, it is important to efficiently decide which\ncausal inference method to use for a given dataset. This is challenging because\ncausal methods typically rely on complex and difficult-to-verify assumptions,\nand cross-validation is not applicable since ground truth causal quantities are\nunobserved.In this work, we propose CAusal Method Predictor (CAMP), a framework\nfor predicting the best method for a given dataset. To this end, we generate\ndatasets from a diverse set of synthetic causal models, score the candidate\nmethods, and train a model to directly predict the highest-scoring method for\nthat dataset. Next, by formulating a self-supervised pre-training objective\ncentered on dataset assumptions relevant for causal inference, we significantly\nreduce the need for costly labeled data and enhance training efficiency. Our\nstrategy learns to map implicit dataset properties to the best method in a\ndata-driven manner. In our experiments, we focus on method prediction for\ncausal discovery. CAMP outperforms selecting any individual candidate method\nand demonstrates promising generalization to unseen semi-synthetic and\nreal-world benchmarks."}, "http://arxiv.org/abs/2311.04017": {"title": "Multivariate quantile-based permutation tests with application to functional data", "link": "http://arxiv.org/abs/2311.04017", "description": "Permutation tests enable testing statistical hypotheses in situations when\nthe distribution of the test statistic is complicated or not available. In some\nsituations, the test statistic under investigation is multivariate, with the\nmultiple testing problem being an important example. The corresponding\nmultivariate permutation tests are then typically based on a\nsuitableone-dimensional transformation of the vector of partial permutation\np-values via so called combining functions. This paper proposes a new approach\nthat utilizes the optimal measure transportation concept. The final single\np-value is computed from the empirical center-outward distribution function of\nthe permuted multivariate test statistics. This method avoids computation of\nthe partial p-values and it is easy to be implemented. In addition, it allows\nto compute and interpret contributions of the components of the multivariate\ntest statistic to the non-conformity score and to the rejection of the null\nhypothesis. Apart from this method, the measure transportation is applied also\nto the vector of partial p-values as an alternative to the classical combining\nfunctions. Both techniques are compared with the standard approaches using\nvarious practical examples in a Monte Carlo study. An application on a\nfunctional data set is provided as well."}, "http://arxiv.org/abs/2311.04037": {"title": "Causal Discovery Under Local Privacy", "link": "http://arxiv.org/abs/2311.04037", "description": "Differential privacy is a widely adopted framework designed to safeguard the\nsensitive information of data providers within a data set. It is based on the\napplication of controlled noise at the interface between the server that stores\nand processes the data, and the data consumers. Local differential privacy is a\nvariant that allows data providers to apply the privatization mechanism\nthemselves on their data individually. Therefore it provides protection also in\ncontexts in which the server, or even the data collector, cannot be trusted.\nThe introduction of noise, however, inevitably affects the utility of the data,\nparticularly by distorting the correlations between individual data components.\nThis distortion can prove detrimental to tasks such as causal discovery. In\nthis paper, we consider various well-known locally differentially private\nmechanisms and compare the trade-off between the privacy they provide, and the\naccuracy of the causal structure produced by algorithms for causal learning\nwhen applied to data obfuscated by these mechanisms. Our analysis yields\nvaluable insights for selecting appropriate local differentially private\nprotocols for causal discovery tasks. We foresee that our findings will aid\nresearchers and practitioners in conducting locally private causal discovery."}, "http://arxiv.org/abs/2311.04103": {"title": "Joint modelling of recurrent and terminal events with discretely-distributed non-parametric frailty: application on re-hospitalizations and death in heart failure patients", "link": "http://arxiv.org/abs/2311.04103", "description": "In the context of clinical and biomedical studies, joint frailty models have\nbeen developed to study the joint temporal evolution of recurrent and terminal\nevents, capturing both the heterogeneous susceptibility to experiencing a new\nepisode and the dependence between the two processes. While\ndiscretely-distributed frailty is usually more exploitable by clinicians and\nhealthcare providers, existing literature on joint frailty models predominantly\nassumes continuous distributions for the random effects. In this article, we\npresent a novel joint frailty model that assumes bivariate\ndiscretely-distributed non-parametric frailties, with an unknown finite number\nof mass points. This approach facilitates the identification of latent\nstructures among subjects, grouping them into sub-populations defined by a\nshared frailty value. We propose an estimation routine via\nExpectation-Maximization algorithm, which not only estimates the number of\nsubgroups but also serves as an unsupervised classification tool. This work is\nmotivated by a study of patients with Heart Failure (HF) receiving ACE\ninhibitors treatment in the Lombardia region of Italy. Recurrent events of\ninterest are hospitalizations due to HF and terminal event is death for any\ncause."}, "http://arxiv.org/abs/2311.04159": {"title": "Statistical Inference on Simulation Output: Batching as an Inferential Device", "link": "http://arxiv.org/abs/2311.04159", "description": "We present {batching} as an omnibus device for statistical inference on\nsimulation output. We consider the classical context of a simulationist\nperforming statistical inference on an estimator $\\theta_n$ (of an unknown\nfixed quantity $\\theta$) using only the output data $(Y_1,Y_2,\\ldots,Y_n)$\ngathered from a simulation. By \\emph{statistical inference}, we mean\napproximating the sampling distribution of the error $\\theta_n-\\theta$ toward:\n(A) estimating an ``assessment'' functional $\\psi$, e.g., bias, variance, or\nquantile; or (B) constructing a $(1-\\alpha)$-confidence region on $\\theta$. We\nargue that batching is a remarkably simple and effective inference device that\nis especially suited for handling dependent output data such as what one\nfrequently encounters in simulation contexts. We demonstrate that if the number\nof batches and the extent of their overlap are chosen correctly, batching\nretains bootstrap's attractive theoretical properties of {strong consistency}\nand {higher-order accuracy}. For constructing confidence regions, we\ncharacterize two limiting distributions associated with a Studentized\nstatistic. Our extensive numerical experience confirms theoretical insight,\nespecially about the effects of batch size and batch overlap."}, "http://arxiv.org/abs/2002.03355": {"title": "Scalable Function-on-Scalar Quantile Regression for Densely Sampled Functional Data", "link": "http://arxiv.org/abs/2002.03355", "description": "Functional quantile regression (FQR) is a useful alternative to mean\nregression for functional data as it provides a comprehensive understanding of\nhow scalar predictors influence the conditional distribution of functional\nresponses. In this article, we study the FQR model for densely sampled,\nhigh-dimensional functional data without relying on parametric error or\nindependent stochastic process assumptions, with the focus on statistical\ninference under this challenging regime along with scalable implementation.\nThis is achieved by a simple but powerful distributed strategy, in which we\nfirst perform separate quantile regression to compute $M$-estimators at each\nsampling location, and then carry out estimation and inference for the entire\ncoefficient functions by properly exploiting the uncertainty quantification and\ndependence structure of $M$-estimators. We derive a uniform Bahadur\nrepresentation and a strong Gaussian approximation result for the\n$M$-estimators on the discrete sampling grid, leading to dimension reduction\nand serving as the basis for inference. An interpolation-based estimator with\nminimax optimality is proposed, and large sample properties for point and\nsimultaneous interval estimators are established. The obtained minimax optimal\nrate under the FQR model shows an interesting phase transition phenomenon that\nhas been previously observed in functional mean regression. The proposed\nmethods are illustrated via simulations and an application to a mass\nspectrometry proteomics dataset."}, "http://arxiv.org/abs/2201.06110": {"title": "FNETS: Factor-adjusted network estimation and forecasting for high-dimensional time series", "link": "http://arxiv.org/abs/2201.06110", "description": "We propose FNETS, a methodology for network estimation and forecasting of\nhigh-dimensional time series exhibiting strong serial- and cross-sectional\ncorrelations. We operate under a factor-adjusted vector autoregressive (VAR)\nmodel which, after accounting for pervasive co-movements of the variables by\n{\\it common} factors, models the remaining {\\it idiosyncratic} dynamic\ndependence between the variables as a sparse VAR process. Network estimation of\nFNETS consists of three steps: (i) factor-adjustment via dynamic principal\ncomponent analysis, (ii) estimation of the latent VAR process via\n$\\ell_1$-regularised Yule-Walker estimator, and (iii) estimation of partial\ncorrelation and long-run partial correlation matrices. In doing so, we learn\nthree networks underpinning the VAR process, namely a directed network\nrepresenting the Granger causal linkages between the variables, an undirected\none embedding their contemporaneous relationships and finally, an undirected\nnetwork that summarises both lead-lag and contemporaneous linkages. In\naddition, FNETS provides a suite of methods for forecasting the factor-driven\nand the idiosyncratic VAR processes. Under general conditions permitting tails\nheavier than the Gaussian one, we derive uniform consistency rates for the\nestimators in both network estimation and forecasting, which hold as the\ndimension of the panel and the sample size diverge. Simulation studies and real\ndata application confirm the good performance of FNETS."}, "http://arxiv.org/abs/2202.01650": {"title": "Exposure Effects on Count Outcomes with Observational Data, with Application to Incarcerated Women", "link": "http://arxiv.org/abs/2202.01650", "description": "Causal inference methods can be applied to estimate the effect of a point\nexposure or treatment on an outcome of interest using data from observational\nstudies. For example, in the Women's Interagency HIV Study, it is of interest\nto understand the effects of incarceration on the number of sexual partners and\nthe number of cigarettes smoked after incarceration. In settings like this\nwhere the outcome is a count, the estimand is often the causal mean ratio,\ni.e., the ratio of the counterfactual mean count under exposure to the\ncounterfactual mean count under no exposure. This paper considers estimators of\nthe causal mean ratio based on inverse probability of treatment weights, the\nparametric g-formula, and doubly robust estimation, each of which can account\nfor overdispersion, zero-inflation, and heaping in the measured outcome.\nMethods are compared in simulations and are applied to data from the Women's\nInteragency HIV Study."}, "http://arxiv.org/abs/2208.14960": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case", "link": "http://arxiv.org/abs/2208.14960", "description": "Gaussian processes are arguably the most important class of spatiotemporal\nmodels within machine learning. They encode prior information about the modeled\nfunction and can be used for exact or approximate Bayesian learning. In many\napplications, particularly in physical sciences and engineering, but also in\nareas such as geostatistics and neuroscience, invariance to symmetries is one\nof the most fundamental forms of prior information one can consider. The\ninvariance of a Gaussian process' covariance to such symmetries gives rise to\nthe most natural generalization of the concept of stationarity to such spaces.\nIn this work, we develop constructive and practical techniques for building\nstationary Gaussian processes on a very large class of non-Euclidean spaces\narising in the context of symmetries. Our techniques make it possible to (i)\ncalculate covariance kernels and (ii) sample from prior and posterior Gaussian\nprocesses defined on such spaces, both in a practical manner. This work is\nsplit into two parts, each involving different technical considerations: part I\nstudies compact spaces, while part II studies non-compact spaces possessing\ncertain structure. Our contributions make the non-Euclidean Gaussian process\nmodels we study compatible with well-understood computational techniques\navailable in standard Gaussian process software packages, thereby making them\naccessible to practitioners."}, "http://arxiv.org/abs/2210.14455": {"title": "Asymmetric predictability in causal discovery: an information theoretic approach", "link": "http://arxiv.org/abs/2210.14455", "description": "Causal investigations in observational studies pose a great challenge in\nresearch where randomized trials or intervention-based studies are not\nfeasible. We develop an information geometric causal discovery and inference\nframework of \"predictive asymmetry\". For $(X, Y)$, predictive asymmetry enables\nassessment of whether $X$ is more likely to cause $Y$ or vice-versa. The\nasymmetry between cause and effect becomes particularly simple if $X$ and $Y$\nare deterministically related. We propose a new metric called the Directed\nMutual Information ($DMI$) and establish its key statistical properties. $DMI$\nis not only able to detect complex non-linear association patterns in bivariate\ndata, but also is able to detect and infer causal relations. Our proposed\nmethodology relies on scalable non-parametric density estimation using Fourier\ntransform. The resulting estimation method is manyfold faster than the\nclassical bandwidth-based density estimation. We investigate key asymptotic\nproperties of the $DMI$ methodology and a data-splitting technique is utilized\nto facilitate causal inference using the $DMI$. Through simulation studies and\nan application, we illustrate the performance of $DMI$."}, "http://arxiv.org/abs/2211.01227": {"title": "Conformalized survival analysis with adaptive cutoffs", "link": "http://arxiv.org/abs/2211.01227", "description": "This paper introduces an assumption-lean method that constructs valid and\nefficient lower predictive bounds (LPBs) for survival times with censored data.\nWe build on recent work by Cand\\`es et al. (2021), whose approach first subsets\nthe data to discard any data points with early censoring times, and then uses a\nreweighting technique (namely, weighted conformal inference (Tibshirani et al.,\n2019)) to correct for the distribution shift introduced by this subsetting\nprocedure.\n\nFor our new method, instead of constraining to a fixed threshold for the\ncensoring time when subsetting the data, we allow for a covariate-dependent and\ndata-adaptive subsetting step, which is better able to capture the\nheterogeneity of the censoring mechanism. As a result, our method can lead to\nLPBs that are less conservative and give more accurate information. We show\nthat in the Type I right-censoring setting, if either of the censoring\nmechanism or the conditional quantile of survival time is well estimated, our\nproposed procedure achieves nearly exact marginal coverage, where in the latter\ncase we additionally have approximate conditional coverage. We evaluate the\nvalidity and efficiency of our proposed algorithm in numerical experiments,\nillustrating its advantage when compared with other competing methods. Finally,\nour method is applied to a real dataset to generate LPBs for users' active\ntimes on a mobile app."}, "http://arxiv.org/abs/2211.07451": {"title": "Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain", "link": "http://arxiv.org/abs/2211.07451", "description": "Forecasts of regional electricity net-demand, consumption minus embedded\ngeneration, are an essential input for reliable and economic power system\noperation, and energy trading. While such forecasts are typically performed\nregion by region, operations such as managing power flows require spatially\ncoherent joint forecasts, which account for cross-regional dependencies. Here,\nwe forecast the joint distribution of net-demand across the 14 regions\nconstituting Great Britain's electricity network. Joint modelling is\ncomplicated by the fact that the net-demand variability within each region, and\nthe dependencies between regions, vary with temporal, socio-economical and\nweather-related factors. We accommodate for these characteristics by proposing\na multivariate Gaussian model based on a modified Cholesky parametrisation,\nwhich allows us to model each unconstrained parameter via an additive model.\nGiven that the number of model parameters and covariates is large, we adopt a\nsemi-automated approach to model selection, based on gradient boosting. In\naddition to comparing the forecasting performance of several versions of the\nproposed model with that of two non-Gaussian copula-based models, we visually\nexplore the model output to interpret how the covariates affect net-demand\nvariability and dependencies.\n\nThe code for reproducing the results in this paper is available at\nhttps://doi.org/10.5281/zenodo.7315105, while methods for building and fitting\nmultivariate Gaussian additive models are provided by the SCM R package,\navailable at https://github.com/VinGioia90/SCM."}, "http://arxiv.org/abs/2211.16182": {"title": "Residual Permutation Test for High-Dimensional Regression Coefficient Testing", "link": "http://arxiv.org/abs/2211.16182", "description": "We consider the problem of testing whether a single coefficient is equal to\nzero in fixed-design linear models under a moderately high-dimensional regime,\nwhere the dimension of covariates $p$ is allowed to be in the same order of\nmagnitude as sample size $n$. In this regime, to achieve finite-population\nvalidity, existing methods usually require strong distributional assumptions on\nthe noise vector (such as Gaussian or rotationally invariant), which limits\ntheir applications in practice. In this paper, we propose a new method, called\nresidual permutation test (RPT), which is constructed by projecting the\nregression residuals onto the space orthogonal to the union of the column\nspaces of the original and permuted design matrices. RPT can be proved to\nachieve finite-population size validity under fixed design with just\nexchangeable noises, whenever $p < n / 2$. Moreover, RPT is shown to be\nasymptotically powerful for heavy tailed noises with bounded $(1+t)$-th order\nmoment when the true coefficient is at least of order $n^{-t/(1+t)}$ for $t \\in\n[0,1]$. We further proved that this signal size requirement is essentially\nrate-optimal in the minimax sense. Numerical studies confirm that RPT performs\nwell in a wide range of simulation settings with normal and heavy-tailed noise\ndistributions."}, "http://arxiv.org/abs/2306.14351": {"title": "Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions", "link": "http://arxiv.org/abs/2306.14351", "description": "The aim of this paper is to make clear and precise the relationship between\nthe Rubin causal model (RCM) and structural causal model (SCM) frameworks for\ncausal inference. Adopting a neutral logical perspective, and drawing on\nprevious work, we show what is required for an RCM to be representable by an\nSCM. A key result then shows that every RCM -- including those that violate\nalgebraic principles implied by the SCM framework -- emerges as an abstraction\nof some representable RCM. Finally, we illustrate the power of this\nconciliatory perspective by pinpointing an important role for SCM principles in\nclassic applications of RCMs; conversely, we offer a characterization of the\nalgebraic constraints implied by a graph, helping to substantiate further\ncomparisons between the two frameworks."}, "http://arxiv.org/abs/2307.07342": {"title": "Bounded-memory adjusted scores estimation in generalized linear models with large data sets", "link": "http://arxiv.org/abs/2307.07342", "description": "The widespread use of maximum Jeffreys'-prior penalized likelihood in\nbinomial-response generalized linear models, and in logistic regression, in\nparticular, are supported by the results of Kosmidis and Firth (2021,\nBiometrika), who show that the resulting estimates are also always\nfinite-valued, even in cases where the maximum likelihood estimates are not,\nwhich is a practical issue regardless of the size of the data set. In logistic\nregression, the implied adjusted score equations are formally bias-reducing in\nasymptotic frameworks with a fixed number of parameters and appear to deliver a\nsubstantial reduction in the persistent bias of the maximum likelihood\nestimator in high-dimensional settings where the number of parameters grows\nasymptotically linearly and slower than the number of observations. In this\nwork, we develop and present two new variants of iteratively reweighted least\nsquares for estimating generalized linear models with adjusted score equations\nfor mean bias reduction and maximization of the likelihood penalized by a\npositive power of the Jeffreys-prior penalty, which eliminate the requirement\nof storing $O(n)$ quantities in memory, and can operate with data sets that\nexceed computer memory or even hard drive capacity. We achieve that through\nincremental QR decompositions, which enable IWLS iterations to have access only\nto data chunks of predetermined size. We assess the procedures through a\nreal-data application with millions of observations."}, "http://arxiv.org/abs/2309.02422": {"title": "Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test", "link": "http://arxiv.org/abs/2309.02422", "description": "Maximum mean discrepancy (MMD) refers to a general class of nonparametric\ntwo-sample tests that are based on maximizing the mean difference over samples\nfrom one distribution $P$ versus another $Q$, over all choices of data\ntransformations $f$ living in some function space $\\mathcal{F}$. Inspired by\nrecent work that connects what are known as functions of $\\textit{Radon bounded\nvariation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study\nthe MMD defined by taking $\\mathcal{F}$ to be the unit ball in the RBV space of\na given smoothness order $k \\geq 0$. This test, which we refer to as the\n$\\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a\ngeneralization of the well-known and classical Kolmogorov-Smirnov (KS) test to\nmultiple dimensions and higher orders of smoothness. It is also intimately\nconnected to neural networks: we prove that the witness in the RKS test -- the\nfunction $f$ achieving the maximum mean difference -- is always a ridge spline\nof degree $k$, i.e., a single neuron in a neural network. This allows us to\nleverage the power of modern deep learning toolkits to (approximately) optimize\nthe criterion that underlies the RKS test. We prove that the RKS test has\nasymptotically full power at distinguishing any distinct pair $P \\not= Q$ of\ndistributions, derive its asymptotic null distribution, and carry out extensive\nexperiments to elucidate the strengths and weakenesses of the RKS test versus\nthe more traditional kernel MMD test."}, "http://arxiv.org/abs/2311.04318": {"title": "Estimation for multistate models subject to reporting delays and incomplete event adjudication", "link": "http://arxiv.org/abs/2311.04318", "description": "Complete observation of event histories is often impossible due to sampling\neffects such as right-censoring and left-truncation, but also due to reporting\ndelays and incomplete event adjudication. This is for example the case during\ninterim stages of clinical trials and for health insurance claims. In this\npaper, we develop a parametric method that takes the aforementioned effects\ninto account, treating the latter two as partially exogenous. The method, which\ntakes the form of a two-step M-estimation procedure, is applicable to\nmultistate models in general, including competing risks and recurrent event\nmodels. The effect of reporting delays is derived via thinning, extending\nexisting results for Poisson models. To address incomplete event adjudication,\nwe propose an imputed likelihood approach which, compared to existing methods,\nhas the advantage of allowing for dependencies between the event history and\nadjudication processes as well as allowing for unreported events and multiple\nevent types. We establish consistency and asymptotic normality under standard\nidentifiability, integrability, and smoothness conditions, and we demonstrate\nthe validity of the percentile bootstrap. Finally, a simulation study shows\nfavorable finite sample performance of our method compared to other\nalternatives, while an application to disability insurance data illustrates its\npractical potential."}, "http://arxiv.org/abs/2311.04359": {"title": "Flexibly Estimating and Interpreting Heterogeneous Treatment Effects of Laparoscopic Surgery for Cholecystitis Patients", "link": "http://arxiv.org/abs/2311.04359", "description": "Laparoscopic surgery has been shown through a number of randomized trials to\nbe an effective form of treatment for cholecystitis. Given this evidence, one\nnatural question for clinical practice is: does the effectiveness of\nlaparoscopic surgery vary among patients? It might be the case that, while the\noverall effect is positive, some patients treated with laparoscopic surgery may\nrespond positively to the intervention while others do not or may be harmed. In\nour study, we focus on conditional average treatment effects to understand\nwhether treatment effects vary systematically with patient characteristics.\nRecent methodological work has developed a meta-learner framework for flexible\nestimation of conditional causal effects. In this framework, nonparametric\nestimation methods can be used to avoid bias from model misspecification while\npreserving statistical efficiency. In addition, researchers can flexibly and\neffectively explore whether treatment effects vary with a large number of\npossible effect modifiers. However, these methods have certain limitations. For\nexample, conducting inference can be challenging if black-box models are used.\nFurther, interpreting and visualizing the effect estimates can be difficult\nwhen there are multi-valued effect modifiers. In this paper, we develop new\nmethods that allow for interpretable results and inference from the\nmeta-learner framework for heterogeneous treatment effects estimation. We also\ndemonstrate methods that allow for an exploratory analysis to identify possible\neffect modifiers. We apply our methods to a large database for the use of\nlaparoscopic surgery in treating cholecystitis. We also conduct a series of\nsimulation studies to understand the relative performance of the methods we\ndevelop. Our study provides key guidelines for the interpretation of\nconditional causal effects from the meta-learner framework."}, "http://arxiv.org/abs/2311.04540": {"title": "On the estimation of the number of components in multivariate functional principal component analysis", "link": "http://arxiv.org/abs/2311.04540", "description": "Happ and Greven (2018) developed a methodology for principal components\nanalysis of multivariate functional data for data observed on different\ndimensional domains. Their approach relies on an estimation of univariate\nfunctional principal components for each univariate functional feature. In this\npaper, we present extensive simulations to investigate choosing the number of\nprincipal components to retain. We show empirically that the conventional\napproach of using a percentage of variance explained threshold for each\nunivariate functional feature may be unreliable when aiming to explain an\noverall percentage of variance in the multivariate functional data, and thus we\nadvise practitioners to be careful when using it."}, "http://arxiv.org/abs/2311.04585": {"title": "Goodness-of-Fit Tests for Linear Non-Gaussian Structural Equation Models", "link": "http://arxiv.org/abs/2311.04585", "description": "The field of causal discovery develops model selection methods to infer\ncause-effect relations among a set of random variables. For this purpose,\ndifferent modelling assumptions have been proposed to render cause-effect\nrelations identifiable. One prominent assumption is that the joint distribution\nof the observed variables follows a linear non-Gaussian structural equation\nmodel. In this paper, we develop novel goodness-of-fit tests that assess the\nvalidity of this assumption in the basic setting without latent confounders as\nwell as in extension to linear models that incorporate latent confounders. Our\napproach involves testing algebraic relations among second and higher moments\nthat hold as a consequence of the linearity of the structural equations.\nSpecifically, we show that the linearity implies rank constraints on matrices\nand tensors derived from moments. For a practical implementation of our tests,\nwe consider a multiplier bootstrap method that uses incomplete U-statistics to\nestimate subdeterminants, as well as asymptotic approximations to the null\ndistribution of singular values. The methods are illustrated, in particular,\nfor the T\\\"ubingen collection of benchmark data sets on cause-effect pairs."}, "http://arxiv.org/abs/2311.04657": {"title": "Long-Term Causal Inference with Imperfect Surrogates using Many Weak Experiments, Proxies, and Cross-Fold Moments", "link": "http://arxiv.org/abs/2311.04657", "description": "Inferring causal effects on long-term outcomes using short-term surrogates is\ncrucial to rapid innovation. However, even when treatments are randomized and\nsurrogates fully mediate their effect on outcomes, it's possible that we get\nthe direction of causal effects wrong due to confounding between surrogates and\noutcomes -- a situation famously known as the surrogate paradox. The\navailability of many historical experiments offer the opportunity to instrument\nfor the surrogate and bypass this confounding. However, even as the number of\nexperiments grows, two-stage least squares has non-vanishing bias if each\nexperiment has a bounded size, and this bias is exacerbated when most\nexperiments barely move metrics, as occurs in practice. We show how to\neliminate this bias using cross-fold procedures, JIVE being one example, and\nconstruct valid confidence intervals for the long-term effect in new\nexperiments where long-term outcome has not yet been observed. Our methodology\nfurther allows to proxy for effects not perfectly mediated by the surrogates,\nallowing us to handle both confounding and effect leakage as violations of\nstandard statistical surrogacy conditions."}, "http://arxiv.org/abs/2311.04696": {"title": "Generative causality: using Shannon's information theory to infer underlying asymmetry in causal relations", "link": "http://arxiv.org/abs/2311.04696", "description": "Causal investigations in observational studies pose a great challenge in\nscientific research where randomized trials or intervention-based studies are\nnot feasible. Leveraging Shannon's seminal work on information theory, we\nconsider a framework of asymmetry where any causal link between putative cause\nand effect must be explained through a mechanism governing the cause as well as\na generative process yielding an effect of the cause. Under weak assumptions,\nthis framework enables the assessment of whether X is a stronger predictor of Y\nor vice-versa. Under stronger identifiability assumptions our framework is able\nto distinguish between cause and effect using observational data. We establish\nkey statistical properties of this framework. Our proposed methodology relies\non scalable non-parametric density estimation using fast Fourier\ntransformation. The resulting estimation method is manyfold faster than the\nclassical bandwidth-based density estimation while maintaining comparable mean\nintegrated squared error rates. We investigate key asymptotic properties of our\nmethodology and introduce a data-splitting technique to facilitate inference.\nThe key attraction of our framework is its inference toolkit, which allows\nresearchers to quantify uncertainty in causal discovery findings. We illustrate\nthe performance of our methodology through simulation studies as well as\nmultiple real data examples."}, "http://arxiv.org/abs/2311.04812": {"title": "Is it possible to obtain reliable estimates for the prevalence of anemia and childhood stunting among children under 5 in the poorest districts in Peru?", "link": "http://arxiv.org/abs/2311.04812", "description": "In this article we describe and apply the Fay-Herriot model with spatially\ncorrelated random area effects (Pratesi, M., & Salvati, N. (2008)), in order to\npredict the prevalence of anemia and childhood stunting in Peruvian districts,\nbased on the data from the Demographic and Family Health Survey of the year\n2019, which collects data about anemia and childhood stunting for children\nunder the age of 12 years, and the National Census carried out in 2017. Our\nmain objective is to produce reliable predictions for the districts, where\nsample sizes are too small to provide good direct estimates, and for the\ndistricts, which were not included in the sample. The basic Fay-Herriot model\n(Fay & Herriot, 1979) tackles this problem by incorporating auxiliary\ninformation, which is generally available from administrative or census\nrecords. The Fay-Herriot model with spatially correlated random area effects,\nin addition to auxiliary information, incorporates geographic information about\nthe areas, such as latitude and longitude. This permits modeling spatial\nautocorrelations, which are not unusual in socioeconomic and health surveys. To\nevaluate the mean square error of the above-mentioned predictors, we use the\nparametric bootstrap procedure, developed in Molina et al. (2009)."}, "http://arxiv.org/abs/2311.04855": {"title": "Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values", "link": "http://arxiv.org/abs/2311.04855", "description": "Non-negative matrix factorization (NMF) is a dimensionality reduction\ntechnique that has shown promise for analyzing noisy data, especially\nastronomical data. For these datasets, the observed data may contain negative\nvalues due to noise even when the true underlying physical signal is strictly\npositive. Prior NMF work has not treated negative data in a statistically\nconsistent manner, which becomes problematic for low signal-to-noise data with\nmany negative values. In this paper we present two algorithms, Shift-NMF and\nNearly-NMF, that can handle both the noisiness of the input data and also any\nintroduced negativity. Both of these algorithms use the negative data space\nwithout clipping, and correctly recover non-negative signals without any\nintroduced positive offset that occurs when clipping negative data. We\ndemonstrate this numerically on both simple and more realistic examples, and\nprove that both algorithms have monotonically decreasing update rules."}, "http://arxiv.org/abs/2311.04871": {"title": "Integration of Summary Information from External Studies for Semiparametric Models", "link": "http://arxiv.org/abs/2311.04871", "description": "With the development of biomedical science, researchers have increasing\naccess to an abundance of studies focusing on similar research questions. There\nis a growing interest in the integration of summary information from those\nstudies to enhance the efficiency of estimation in their own internal studies.\nIn this work, we present a comprehensive framework on integration of summary\ninformation from external studies when the data are modeled by semiparametric\nmodels. Our novel framework offers straightforward estimators that update\nconventional estimations with auxiliary information. It addresses computational\nchallenges by capitalizing on the intricate mathematical structure inherent to\nthe problem. We demonstrate the conditions when the proposed estimators are\ntheoretically more efficient than initial estimate based solely on internal\ndata. Several special cases such as proportional hazards model in survival\nanalysis are provided with numerical examples."}, "http://arxiv.org/abs/2107.10885": {"title": "Laplace and Saddlepoint Approximations in High Dimensions", "link": "http://arxiv.org/abs/2107.10885", "description": "We examine the behaviour of the Laplace and saddlepoint approximations in the\nhigh-dimensional setting, where the dimension of the model is allowed to\nincrease with the number of observations. Approximations to the joint density,\nthe marginal posterior density and the conditional density are considered. Our\nresults show that under the mildest assumptions on the model, the error of the\njoint density approximation is $O(p^4/n)$ if $p = o(n^{1/4})$ for the Laplace\napproximation and saddlepoint approximation, and $O(p^3/n)$ if $p = o(n^{1/3})$\nunder additional assumptions on the second derivative of the log-likelihood.\nStronger results are obtained for the approximation to the marginal posterior\ndensity."}, "http://arxiv.org/abs/2111.00280": {"title": "Testing semiparametric model-equivalence hypotheses based on the characteristic function", "link": "http://arxiv.org/abs/2111.00280", "description": "We propose three test criteria each of which is appropriate for testing,\nrespectively, the equivalence hypotheses of symmetry, of homogeneity, and of\nindependence, with multivariate data. All quantities have the common feature of\ninvolving weighted--type distances between characteristic functions and are\nconvenient from the computational point of view if the weight function is\nproperly chosen. The asymptotic behavior of the tests under the null hypothesis\nis investigated, and numerical studies are conducted in order to examine the\nperformance of the criteria in finite samples."}, "http://arxiv.org/abs/2112.11079": {"title": "Data fission: splitting a single data point", "link": "http://arxiv.org/abs/2112.11079", "description": "Suppose we observe a random vector $X$ from some distribution $P$ in a known\nfamily with unknown parameters. We ask the following question: when is it\npossible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part\nis sufficient to reconstruct $X$ by itself, but both together can recover $X$\nfully, and the joint distribution of $(f(X),g(X))$ is tractable? As one\nexample, if $X=(X_1,\\dots,X_n)$ and $P$ is a product distribution, then for any\n$m<n$, we can split the sample to define $f(X)=(X_1,\\dots,X_m)$ and\n$g(X)=(X_{m+1},\\dots,X_n)$. Rasines and Young (2022) offers an alternative\nroute of accomplishing this task through randomization of $X$ with additive\nGaussian noise which enables post-selection inference in finite samples for\nGaussian distributed data and asymptotically for non-Gaussian additive models.\nIn this paper, we offer a more general methodology for achieving such a split\nin finite samples by borrowing ideas from Bayesian inference to yield a\n(frequentist) solution that can be viewed as a continuous analog of data\nsplitting. We call our method data fission, as an alternative to data\nsplitting, data carving and p-value masking. We exemplify the method on a few\nprototypical applications, such as post-selection inference for trend filtering\nand other regression problems."}, "http://arxiv.org/abs/2207.04598": {"title": "Differential item functioning via robust scaling", "link": "http://arxiv.org/abs/2207.04598", "description": "This paper proposes a method for assessing differential item functioning\n(DIF) in item response theory (IRT) models. The method does not require\npre-specification of anchor items, which is its main virtue. It is developed in\ntwo main steps, first by showing how DIF can be re-formulated as a problem of\noutlier detection in IRT-based scaling, then tackling the latter using methods\nfrom robust statistics. The proposal is a redescending M-estimator of IRT\nscaling parameters that is tuned to flag items with DIF at the desired\nasymptotic Type I Error rate. Theoretical results describe the efficiency of\nthe estimator in the absence of DIF and its robustness in the presence of DIF.\nSimulation studies show that the proposed method compares favorably to\ncurrently available approaches for DIF detection, and a real data example\nillustrates its application in a research context where pre-specification of\nanchor items is infeasible. The focus of the paper is the two-parameter\nlogistic model in two independent groups, with extensions to other settings\nconsidered in the conclusion."}, "http://arxiv.org/abs/2209.12345": {"title": "Berry-Esseen bounds for design-based causal inference with possibly diverging treatment levels and varying group sizes", "link": "http://arxiv.org/abs/2209.12345", "description": "Neyman (1923/1990) introduced the randomization model, which contains the\nnotation of potential outcomes to define causal effects and a framework for\nlarge-sample inference based on the design of the experiment. However, the\nexisting theory for this framework is far from complete especially when the\nnumber of treatment levels diverges and the treatment group sizes vary. We\nprovide a unified discussion of statistical inference under the randomization\nmodel with general treatment group sizes. We formulate the estimator in terms\nof a linear permutational statistic and use results based on Stein's method to\nderive various Berry--Esseen bounds on the linear and quadratic functions of\nthe estimator. These new Berry--Esseen bounds serve as basis for design-based\ncausal inference with possibly diverging treatment levels and a diverging\nnumber of causal parameters of interest. We also fill an important gap by\nproposing novel variance estimators for experiments with possibly many\ntreatment levels without replications. Equipped with the newly developed\nresults, design-based causal inference in general settings becomes more\nconvenient with stronger theoretical guarantees."}, "http://arxiv.org/abs/2210.02014": {"title": "Doubly Robust Proximal Synthetic Controls", "link": "http://arxiv.org/abs/2210.02014", "description": "To infer the treatment effect for a single treated unit using panel data,\nsynthetic control methods construct a linear combination of control units'\noutcomes that mimics the treated unit's pre-treatment outcome trajectory. This\nlinear combination is subsequently used to impute the counterfactual outcomes\nof the treated unit had it not been treated in the post-treatment period, and\nused to estimate the treatment effect. Existing synthetic control methods rely\non correctly modeling certain aspects of the counterfactual outcome generating\nmechanism and may require near-perfect matching of the pre-treatment\ntrajectory. Inspired by proximal causal inference, we obtain two novel\nnonparametric identifying formulas for the average treatment effect for the\ntreated unit: one is based on weighting, and the other combines models for the\ncounterfactual outcome and the weighting function. We introduce the concept of\ncovariate shift to synthetic controls to obtain these identification results\nconditional on the treatment assignment. We also develop two treatment effect\nestimators based on these two formulas and the generalized method of moments.\nOne new estimator is doubly robust: it is consistent and asymptotically normal\nif at least one of the outcome and weighting models is correctly specified. We\ndemonstrate the performance of the methods via simulations and apply them to\nevaluate the effectiveness of a Pneumococcal conjugate vaccine on the risk of\nall-cause pneumonia in Brazil."}, "http://arxiv.org/abs/2303.01385": {"title": "Hyperlink communities in higher-order networks", "link": "http://arxiv.org/abs/2303.01385", "description": "Many networks can be characterised by the presence of communities, which are\ngroups of units that are closely linked and can be relevant in understanding\nthe system's overall function. Recently, hypergraphs have emerged as a\nfundamental tool for modelling systems where interactions are not limited to\npairs but may involve an arbitrary number of nodes. Using a dual approach to\ncommunity detection, in this study we extend the concept of link communities to\nhypergraphs, allowing us to extract informative clusters of highly related\nhyperedges. We analyze the dendrograms obtained by applying hierarchical\nclustering to distance matrices among hyperedges on a variety of real-world\ndata, showing that hyperlink communities naturally highlight the hierarchical\nand multiscale structure of higher-order networks. Moreover, by using hyperlink\ncommunities, we are able to extract overlapping memberships from nodes,\novercoming limitations of traditional hard clustering methods. Finally, we\nintroduce higher-order network cartography as a practical tool for categorizing\nnodes into different structural roles based on their interaction patterns and\ncommunity participation. This approach helps identify different types of\nindividuals in a variety of real-world social systems. Our work contributes to\na better understanding of the structural organization of real-world\nhigher-order systems."}, "http://arxiv.org/abs/2305.05931": {"title": "Generalised shot noise representations of stochastic systems driven by non-Gaussian L\\'evy processes", "link": "http://arxiv.org/abs/2305.05931", "description": "We consider the problem of obtaining effective representations for the\nsolutions of linear, vector-valued stochastic differential equations (SDEs)\ndriven by non-Gaussian pure-jump L\\'evy processes, and we show how such\nrepresentations lead to efficient simulation methods. The processes considered\nconstitute a broad class of models that find application across the physical\nand biological sciences, mathematics, finance and engineering. Motivated by\nimportant relevant problems in statistical inference, we derive new,\ngeneralised shot-noise simulation methods whenever a normal variance-mean (NVM)\nmixture representation exists for the driving L\\'evy process, including the\ngeneralised hyperbolic, normal-Gamma, and normal tempered stable cases. Simple,\nexplicit conditions are identified for the convergence of the residual of a\ntruncated shot-noise representation to a Brownian motion in the case of the\npure L\\'evy process, and to a Brownian-driven SDE in the case of the\nL\\'evy-driven SDE. These results provide Gaussian approximations to the small\njumps of the process under the NVM representation. The resulting\nrepresentations are of particular importance in state inference and parameter\nestimation for L\\'evy-driven SDE models, since the resulting conditionally\nGaussian structures can be readily incorporated into latent variable inference\nmethods such as Markov chain Monte Carlo (MCMC), Expectation-Maximisation (EM),\nand sequential Monte Carlo."}, "http://arxiv.org/abs/2305.17517": {"title": "Stochastic Nonparametric Estimation of the Density-Flow Curve", "link": "http://arxiv.org/abs/2305.17517", "description": "The fundamental diagram serves as the foundation of traffic flow modeling for\nalmost a century. With the increasing availability of road sensor data,\ndeterministic parametric models have proved inadequate in describing the\nvariability of real-world data, especially in congested area of the\ndensity-flow diagram. In this paper we estimate the stochastic density-flow\nrelation introducing a nonparametric method called convex quantile regression.\nThe proposed method does not depend on any prior functional form assumptions,\nbut thanks to the concavity constraints, the estimated function satisfies the\ntheoretical properties of the density-flow curve. The second contribution is to\ndevelop the new convex quantile regression with bags (CQRb) approach to\nfacilitate practical implementation of CQR to the real-world data. We\nillustrate the CQRb estimation process using the road sensor data from Finland\nin years 2016-2018. Our third contribution is to demonstrate the excellent\nout-of-sample predictive power of the proposed CQRb method in comparison to the\nstandard parametric deterministic approach."}, "http://arxiv.org/abs/2305.19180": {"title": "Transfer Learning With Efficient Estimators to Optimally Leverage Historical Data in Analysis of Randomized Trials", "link": "http://arxiv.org/abs/2305.19180", "description": "Although randomized controlled trials (RCTs) are a cornerstone of comparative\neffectiveness, they typically have much smaller sample size than observational\nstudies because of financial and ethical considerations. Therefore there is\ninterest in using plentiful historical data (either observational data or prior\ntrials) to reduce trial sizes. Previous estimators developed for this purpose\nrely on unrealistic assumptions, without which the added data can bias the\ntreatment effect estimate. Recent work proposed an alternative method\n(prognostic covariate adjustment) that imposes no additional assumptions and\nincreases efficiency in trial analyses. The idea is to use historical data to\nlearn a prognostic model: a regression of the outcome onto the covariates. The\npredictions from this model, generated from the RCT subjects' baseline\nvariables, are then used as a covariate in a linear regression analysis of the\ntrial data. In this work, we extend prognostic adjustment to trial analyses\nwith nonparametric efficient estimators, which are more powerful than linear\nregression. We provide theory that explains why prognostic adjustment improves\nsmall-sample point estimation and inference without any possibility of bias.\nSimulations corroborate the theory: efficient estimators using prognostic\nadjustment compared to without provides greater power (i.e., smaller standard\nerrors) when the trial is small. Population shifts between historical and trial\ndata attenuate benefits but do not introduce bias. We showcase our estimator\nusing clinical trial data provided by Novo Nordisk A/S that evaluates insulin\ntherapy for individuals with type II diabetes."}, "http://arxiv.org/abs/2308.13928": {"title": "A flexible Bayesian tool for CoDa mixed models: logistic-normal distribution with Dirichlet covariance", "link": "http://arxiv.org/abs/2308.13928", "description": "Compositional Data Analysis (CoDa) has gained popularity in recent years.\nThis type of data consists of values from disjoint categories that sum up to a\nconstant. Both Dirichlet regression and logistic-normal regression have become\npopular as CoDa analysis methods. However, fitting this kind of multivariate\nmodels presents challenges, especially when structured random effects are\nincluded in the model, such as temporal or spatial effects.\n\nTo overcome these challenges, we propose the logistic-normal Dirichlet Model\n(LNDM). We seamlessly incorporate this approach into the R-INLA package,\nfacilitating model fitting and model prediction within the framework of Latent\nGaussian Models (LGMs). Moreover, we explore metrics like Deviance Information\nCriteria (DIC), Watanabe Akaike information criterion (WAIC), and\ncross-validation measure conditional predictive ordinate (CPO) for model\nselection in R-INLA for CoDa.\n\nIllustrating LNDM through a simple simulated example and with an ecological\ncase study on Arabidopsis thaliana in the Iberian Peninsula, we underscore its\npotential as an effective tool for managing CoDa and large CoDa databases."}, "http://arxiv.org/abs/2310.02968": {"title": "Sampling depth trade-off in function estimation under a two-level design", "link": "http://arxiv.org/abs/2310.02968", "description": "Many modern statistical applications involve a two-level sampling scheme that\nfirst samples subjects from a population and then samples observations on each\nsubject. These schemes often are designed to learn both the population-level\nfunctional structures shared by the subjects and the functional characteristics\nspecific to individual subjects. Common wisdom suggests that learning\npopulation-level structures benefits from sampling more subjects whereas\nlearning subject-specific structures benefits from deeper sampling within each\nsubject. Oftentimes these two objectives compete for limited sampling\nresources, which raises the question of how to optimally sample at the two\nlevels. We quantify such sampling-depth trade-offs by establishing the $L_2$\nminimax risk rates for learning the population-level and subject-specific\nstructures under a hierarchical Gaussian process model framework where we\nconsider a Bayesian and a frequentist perspective on the unknown\npopulation-level structure. These rates provide general lessons for designing\ntwo-level sampling schemes. Interestingly, subject-specific learning\noccasionally benefits more by sampling more subjects than by deeper\nwithin-subject sampling. We also construct estimators that adapt to unknown\nsmoothness and achieve the corresponding minimax rates. We conduct two\nsimulation experiments validating our theory and illustrating the sampling\ntrade-off in practice, and apply these estimators to two real datasets."}, "http://arxiv.org/abs/2311.05025": {"title": "Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients", "link": "http://arxiv.org/abs/2311.05025", "description": "We present an unbiased method for Bayesian posterior means based on kinetic\nLangevin dynamics that combines advanced splitting methods with enhanced\ngradient approximations. Our approach avoids Metropolis correction by coupling\nMarkov chains at different discretization levels in a multilevel Monte Carlo\napproach. Theoretical analysis demonstrates that our proposed estimator is\nunbiased, attains finite variance, and satisfies a central limit theorem. It\ncan achieve accuracy $\\epsilon>0$ for estimating expectations of Lipschitz\nfunctions in $d$ dimensions with $\\mathcal{O}(d^{1/4}\\epsilon^{-2})$ expected\ngradient evaluations, without assuming warm start. We exhibit similar bounds\nusing both approximate and stochastic gradients, and our method's computational\ncost is shown to scale logarithmically with the size of the dataset. The\nproposed method is tested using a multinomial regression problem on the MNIST\ndataset and a Poisson regression model for soccer scores. Experiments indicate\nthat the number of gradient evaluations per effective sample is independent of\ndimension, even when using inexact gradients. For product distributions, we\ngive dimension-independent variance bounds. Our results demonstrate that the\nunbiased algorithm we present can be much more efficient than the\n``gold-standard\" randomized Hamiltonian Monte Carlo."}, "http://arxiv.org/abs/2311.05056": {"title": "High-dimensional Newey-Powell Test Via Approximate Message Passing", "link": "http://arxiv.org/abs/2311.05056", "description": "Homoscedastic regression error is a common assumption in many\nhigh-dimensional regression models and theories. Although heteroscedastic error\ncommonly exists in real-world datasets, testing heteroscedasticity remains\nlargely underexplored under high-dimensional settings. We consider the\nheteroscedasticity test proposed in Newey and Powell (1987), whose asymptotic\ntheory has been well-established for the low-dimensional setting. We show that\nthe Newey-Powell test can be developed for high-dimensional data. For\nasymptotic theory, we consider the setting where the number of dimensions grows\nwith the sample size at a linear rate. The asymptotic analysis for the test\nstatistic utilizes the Approximate Message Passing (AMP) algorithm, from which\nwe obtain the limiting distribution of the test. The numerical performance of\nthe test is investigated through an extensive simulation study. As real-data\napplications, we present the analysis based on \"international economic growth\"\ndata (Belloni et al. 2011), which is found to be homoscedastic, and\n\"supermarket\" data (Lan et al., 2016), which is found to be heteroscedastic."}, "http://arxiv.org/abs/2311.05200": {"title": "An efficient Bayesian approach to joint functional principal component analysis for complex sampling designs", "link": "http://arxiv.org/abs/2311.05200", "description": "The analysis of multivariate functional curves has the potential to yield\nimportant scientific discoveries in domains such as healthcare, medicine,\neconomics and social sciences. However it is common for real-world settings to\npresent data that are both sparse and irregularly sampled, and this introduces\nimportant challenges for the current functional data methodology. Here we\npropose a Bayesian hierarchical framework for multivariate functional principal\ncomponent analysis which accommodates the intricacies of such sampling designs\nby flexibly pooling information across subjects and correlated curves. Our\nmodel represents common latent dynamics via shared functional principal\ncomponent scores, thereby effectively borrowing strength across curves while\ncircumventing the computationally challenging task of estimating covariance\nmatrices. These scores also provide a parsimonious representation of the major\nmodes of joint variation of the curves, and constitute interpretable scalar\nsummaries that can be employed in follow-up analyses. We perform inference\nusing a variational message passing algorithm which combines efficiency,\nmodularity and approximate posterior density estimation, enabling the joint\nanalysis of large datasets with parameter uncertainty quantification. We\nconduct detailed simulations to assess the effectiveness of our approach in\nsharing information under complex sampling designs. We also exploit it to\nestimate the molecular disease courses of individual patients with SARS-CoV-2\ninfection and characterise patient heterogeneity in recovery outcomes; this\nstudy reveals key coordinated dynamics across the immune, inflammatory and\nmetabolic systems, which are associated with survival and long-COVID symptoms\nup to one year post disease onset. Our approach is implemented in the R package\nbayesFPCA."}, "http://arxiv.org/abs/2311.05248": {"title": "A General Space of Belief Updates for Model Misspecification in Bayesian Networks", "link": "http://arxiv.org/abs/2311.05248", "description": "In an ideal setting for Bayesian agents, a perfect description of the rules\nof the environment (i.e., the objective observation model) is available,\nallowing them to reason through the Bayesian posterior to update their beliefs\nin an optimal way. But such an ideal setting hardly ever exists in the natural\nworld, so agents have to make do with reasoning about how they should update\ntheir beliefs simultaneously. This introduces a number of related challenges\nfor a number of research areas: (1) For Bayesian statistics, this deviation of\nthe subjective model from the true data-generating mechanism is termed model\nmisspecification in the literature. (2) For neuroscience, it introduces the\nnecessity to model how the agents' belief updates (how they use evidence to\nupdate their belief) and how their belief changes over time. The current paper\naddresses these two challenges by (a) providing a general class of\nposteriors/belief updates called cut-posteriors of Bayesian networks that have\na much greater expressivity, and (b) parameterizing the space of possible\nposteriors to make meta-learning (i.e., choosing the belief update from this\nspace in a principled manner) possible. For (a), it is noteworthy that any\ncut-posterior has local computation only, making computation tractable for\nhuman or artificial agents. For (b), a Markov Chain Monte Carlo algorithm to\nperform such meta-learning will be sketched here, though it is only an\nillustration and but no means the only possible meta-learning procedure\npossible for the space of cut-posteriors. Operationally, this work gives a\ngeneral algorithm to take in an arbitrary Bayesian network and output all\npossible cut-posteriors in the space."}, "http://arxiv.org/abs/2311.05272": {"title": "deform: An R Package for Nonstationary Spatial Gaussian Process Models by Deformations and Dimension Expansion", "link": "http://arxiv.org/abs/2311.05272", "description": "Gaussian processes (GP) are a popular and powerful tool for spatial modelling\nof data, especially data that quantify environmental processes. However, in\nstationary form, whether covariance is isotropic or anisotropic, GPs may lack\nthe flexibility to capture dependence across a continuous spatial process,\nespecially across a large domains. The deform package aims to provide users\nwith user-friendly R functions for the fitting and visualization of\nnonstationary spatial Gaussian processes. Users can choose to capture\nnonstationarity with either the spatial deformation approach of Sampson and\nGuttorp (1992) or the dimension expansion approach of Bornn, Shaddick, and\nZidek (2012). Thin plate regression splines are used for both approaches to\nbring transformations of locations to give a new set of locations that bring\nisotropic covariance. Fitted models in deform can be used to predict these new\nlocations and to simulate nonstationary Gaussian processes for an arbitrary set\nof locations."}, "http://arxiv.org/abs/2311.05330": {"title": "Applying a new category association estimator to sentiment analysis on the Web", "link": "http://arxiv.org/abs/2311.05330", "description": "This paper introduces a novel Bayesian method for measuring the degree of\nassociation between categorical variables. The method is grounded in the formal\ndefinition of variable independence and was implemented using MCMC techniques.\nUnlike existing methods, this approach does not assume prior knowledge of the\ntotal number of occurrences for any category, making it particularly\nwell-suited for applications like sentiment analysis. We applied the method to\na dataset comprising 4,613 tweets written in Portuguese, each annotated for 30\npossibly overlapping emotional categories. Through this analysis, we identified\npairs of emotions that exhibit associations and mutually exclusive pairs.\nFurthermore, the method identifies hierarchical relations between categories, a\nfeature observed in our data, and was used to cluster emotions into basic level\ngroups."}, "http://arxiv.org/abs/2311.05339": {"title": "An iterative algorithm for high-dimensional linear models with both sparse and non-sparse structures", "link": "http://arxiv.org/abs/2311.05339", "description": "Numerous practical medical problems often involve data that possess a\ncombination of both sparse and non-sparse structures. Traditional penalized\nregularizations techniques, primarily designed for promoting sparsity, are\ninadequate to capture the optimal solutions in such scenarios. To address these\nchallenges, this paper introduces a novel algorithm named Non-sparse Iteration\n(NSI). The NSI algorithm allows for the existence of both sparse and non-sparse\nstructures and estimates them simultaneously and accurately. We provide\ntheoretical guarantees that the proposed algorithm converges to the oracle\nsolution and achieves the optimal rate for the upper bound of the $l_2$-norm\nerror. Through simulations and practical applications, NSI consistently\nexhibits superior statistical performance in terms of estimation accuracy,\nprediction efficacy, and variable selection compared to several existing\nmethods. The proposed method is also applied to breast cancer data, revealing\nrepeated selection of specific genes for in-depth analysis."}, "http://arxiv.org/abs/2311.05421": {"title": "Diffusion Based Causal Representation Learning", "link": "http://arxiv.org/abs/2311.05421", "description": "Causal reasoning can be considered a cornerstone of intelligent systems.\nHaving access to an underlying causal graph comes with the promise of\ncause-effect estimation and the identification of efficient and safe\ninterventions. However, learning causal representations remains a major\nchallenge, due to the complexity of many real-world systems. Previous works on\ncausal representation learning have mostly focused on Variational Auto-Encoders\n(VAE). These methods only provide representations from a point estimate, and\nthey are unsuitable to handle high dimensions. To overcome these problems, we\nproposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm.\nThis algorithm uses diffusion-based representations for causal discovery. DCRL\noffers access to infinite dimensional latent codes, which encode different\nlevels of information in the latent code. In a first proof of principle, we\ninvestigate the use of DCRL for causal representation learning. We further\ndemonstrate experimentally that this approach performs comparably well in\nidentifying the causal structure and causal variables."}, "http://arxiv.org/abs/2311.05532": {"title": "Uncertainty-Aware Bayes' Rule and Its Applications", "link": "http://arxiv.org/abs/2311.05532", "description": "Bayes' rule has enabled innumerable powerful algorithms of statistical signal\nprocessing and statistical machine learning. However, when there exist model\nmisspecifications in prior distributions and/or data distributions, the direct\napplication of Bayes' rule is questionable. Philosophically, the key is to\nbalance the relative importance of prior and data distributions when\ncalculating posterior distributions: if prior (resp. data) distributions are\noverly conservative, we should upweight the prior belief (resp. data evidence);\nif prior (resp. data) distributions are overly opportunistic, we should\ndownweight the prior belief (resp. data evidence). This paper derives a\ngeneralized Bayes' rule, called uncertainty-aware Bayes' rule, to technically\nrealize the above philosophy, i.e., to combat the model uncertainties in prior\ndistributions and/or data distributions. Simulated and real-world experiments\nshowcase the superiority of the presented uncertainty-aware Bayes' rule over\nthe conventional Bayes' rule: In particular, the uncertainty-aware Kalman\nfilter, the uncertainty-aware particle filter, and the uncertainty-aware\ninteractive multiple model filter are suggested and validated."}, "http://arxiv.org/abs/2110.00115": {"title": "Comparing Sequential Forecasters", "link": "http://arxiv.org/abs/2110.00115", "description": "Consider two forecasters, each making a single prediction for a sequence of\nevents over time. We ask a relatively basic question: how might we compare\nthese forecasters, either online or post-hoc, while avoiding unverifiable\nassumptions on how the forecasts and outcomes were generated? In this paper, we\npresent a rigorous answer to this question by designing novel sequential\ninference procedures for estimating the time-varying difference in forecast\nscores. To do this, we employ confidence sequences (CS), which are sequences of\nconfidence intervals that can be continuously monitored and are valid at\narbitrary data-dependent stopping times (\"anytime-valid\"). The widths of our\nCSs are adaptive to the underlying variance of the score differences.\nUnderlying their construction is a game-theoretic statistical framework, in\nwhich we further identify e-processes and p-processes for sequentially testing\na weak null hypothesis -- whether one forecaster outperforms another on average\n(rather than always). Our methods do not make distributional assumptions on the\nforecasts or outcomes; our main theorems apply to any bounded scores, and we\nlater provide alternative methods for unbounded scores. We empirically validate\nour approaches by comparing real-world baseball and weather forecasters."}, "http://arxiv.org/abs/2207.14753": {"title": "Estimating Causal Effects with Hidden Confounding using Instrumental Variables and Environments", "link": "http://arxiv.org/abs/2207.14753", "description": "Recent works have proposed regression models which are invariant across data\ncollection environments. These estimators often have a causal interpretation\nunder conditions on the environments and type of invariance imposed. One recent\nexample, the Causal Dantzig (CD), is consistent under hidden confounding and\nrepresents an alternative to classical instrumental variable estimators such as\nTwo Stage Least Squares (TSLS). In this work we derive the CD as a generalized\nmethod of moments (GMM) estimator. The GMM representation leads to several\npractical results, including 1) creation of the Generalized Causal Dantzig\n(GCD) estimator which can be applied to problems with continuous environments\nwhere the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which\nhas properties superior to GCD or TSLS alone 3) straightforward asymptotic\nresults for all methods using GMM theory. We compare the CD, GCD, TSLS, and\nHybrid estimators in simulations and an application to a Flow Cytometry data\nset. The newly proposed GCD and Hybrid estimators have superior performance to\nexisting methods in many settings."}, "http://arxiv.org/abs/2208.02948": {"title": "A Feature Selection Method that Controls the False Discovery Rate", "link": "http://arxiv.org/abs/2208.02948", "description": "The problem of selecting a handful of truly relevant variables in supervised\nmachine learning algorithms is a challenging problem in terms of untestable\nassumptions that must hold and unavailability of theoretical assurances that\nselection errors are under control. We propose a distribution-free feature\nselection method, referred to as Data Splitting Selection (DSS) which controls\nFalse Discovery Rate (FDR) of feature selection while obtaining a high power.\nAnother version of DSS is proposed with a higher power which \"almost\" controls\nFDR. No assumptions are made on the distribution of the response or on the\njoint distribution of the features. Extensive simulation is performed to\ncompare the performance of the proposed methods with the existing ones."}, "http://arxiv.org/abs/2209.03474": {"title": "An extension of the Unified Skew-Normal family of distributions and application to Bayesian binary regression", "link": "http://arxiv.org/abs/2209.03474", "description": "We consider the general problem of Bayesian binary regression and we\nintroduce a new class of distributions, the Perturbed Unified Skew Normal\n(pSUN, henceforth), which generalizes the Unified Skew-Normal (SUN) class. We\nshow that the new class is conjugate to any binary regression model, provided\nthat the link function may be expressed as a scale mixture of Gaussian\ndensities. We discuss in detail the popular logit case, and we show that, when\na logistic regression model is combined with a Gaussian prior, posterior\nsummaries such as cumulants and normalizing constants can be easily obtained\nthrough the use of an importance sampling approach, opening the way to\nstraightforward variable selection procedures. For more general priors, the\nproposed methodology is based on a simple Gibbs sampler algorithm. We also\nclaim that, in the p > n case, the proposed methodology shows better\nperformances - both in terms of mixing and accuracy - compared to the existing\nmethods. We illustrate the performance through several simulation studies and\ntwo data analyses."}, "http://arxiv.org/abs/2211.15070": {"title": "Online Kernel CUSUM for Change-Point Detection", "link": "http://arxiv.org/abs/2211.15070", "description": "We present a computationally efficient online kernel Cumulative Sum (CUSUM)\nmethod for change-point detection that utilizes the maximum over a set of\nkernel statistics to account for the unknown change-point location. Our\napproach exhibits increased sensitivity to small changes compared to existing\nkernel-based change-point detection methods, including Scan-B statistic,\ncorresponding to a non-parametric Shewhart chart-type procedure. We provide\naccurate analytic approximations for two key performance metrics: the Average\nRun Length (ARL) and Expected Detection Delay (EDD), which enable us to\nestablish an optimal window length to be on the order of the logarithm of ARL\nto ensure minimal power loss relative to an oracle procedure with infinite\nmemory. Moreover, we introduce a recursive calculation procedure for detection\nstatistics to ensure constant computational and memory complexity, which is\nessential for online implementation. Through extensive experiments on both\nsimulated and real data, we demonstrate the competitive performance of our\nmethod and validate our theoretical results."}, "http://arxiv.org/abs/2301.09633": {"title": "Prediction-Powered Inference", "link": "http://arxiv.org/abs/2301.09633", "description": "Prediction-powered inference is a framework for performing valid statistical\ninference when an experimental dataset is supplemented with predictions from a\nmachine-learning system. The framework yields simple algorithms for computing\nprovably valid confidence intervals for quantities such as means, quantiles,\nand linear and logistic regression coefficients, without making any assumptions\non the machine-learning algorithm that supplies the predictions. Furthermore,\nmore accurate predictions translate to smaller confidence intervals.\nPrediction-powered inference could enable researchers to draw valid and more\ndata-efficient conclusions using machine learning. The benefits of\nprediction-powered inference are demonstrated with datasets from proteomics,\nastronomy, genomics, remote sensing, census analysis, and ecology."}, "http://arxiv.org/abs/2303.03521": {"title": "Bayesian Adaptive Selection of Variables for Function-on-Scalar Regression Models", "link": "http://arxiv.org/abs/2303.03521", "description": "Considering the field of functional data analysis, we developed a new\nBayesian method for variable selection in function-on-scalar regression (FOSR).\nOur approach uses latent variables, allowing an adaptive selection since it can\ndetermine the number of variables and which ones should be selected for a\nfunction-on-scalar regression model. Simulation studies show the proposed\nmethod's main properties, such as its accuracy in estimating the coefficients\nand high capacity to select variables correctly. Furthermore, we conducted\ncomparative studies with the main competing methods, such as the BGLSS method\nas well as the group LASSO, the group MCP and the group SCAD. We also used a\nCOVID-19 dataset and some socioeconomic data from Brazil for real data\napplication. In short, the proposed Bayesian variable selection model is\nextremely competitive, showing significant predictive and selective quality."}, "http://arxiv.org/abs/2305.10564": {"title": "Counterfactually Comparing Abstaining Classifiers", "link": "http://arxiv.org/abs/2305.10564", "description": "Abstaining classifiers have the option to abstain from making predictions on\ninputs that they are unsure about. These classifiers are becoming increasingly\npopular in high-stakes decision-making problems, as they can withhold uncertain\npredictions to improve their reliability and safety. When evaluating black-box\nabstaining classifier(s), however, we lack a principled approach that accounts\nfor what the classifier would have predicted on its abstentions. These missing\npredictions matter when they can eventually be utilized, either directly or as\na backup option in a failure mode. In this paper, we introduce a novel approach\nand perspective to the problem of evaluating and comparing abstaining\nclassifiers by treating abstentions as missing data. Our evaluation approach is\ncentered around defining the counterfactual score of an abstaining classifier,\ndefined as the expected performance of the classifier had it not been allowed\nto abstain. We specify the conditions under which the counterfactual score is\nidentifiable: if the abstentions are stochastic, and if the evaluation data is\nindependent of the training data (ensuring that the predictions are missing at\nrandom), then the score is identifiable. Note that, if abstentions are\ndeterministic, then the score is unidentifiable because the classifier can\nperform arbitrarily poorly on its abstentions. Leveraging tools from\nobservational causal inference, we then develop nonparametric and doubly robust\nmethods to efficiently estimate this quantity under identification. Our\napproach is examined in both simulated and real data experiments."}, "http://arxiv.org/abs/2307.01539": {"title": "Implementing measurement error models with mechanistic mathematical models in a likelihood-based framework for estimation, identifiability analysis, and prediction in the life sciences", "link": "http://arxiv.org/abs/2307.01539", "description": "Throughout the life sciences we routinely seek to interpret measurements and\nobservations using parameterised mechanistic mathematical models. A fundamental\nand often overlooked choice in this approach involves relating the solution of\na mathematical model with noisy and incomplete measurement data. This is often\nachieved by assuming that the data are noisy measurements of the solution of a\ndeterministic mathematical model, and that measurement errors are additive and\nnormally distributed. While this assumption of additive Gaussian noise is\nextremely common and simple to implement and interpret, it is often unjustified\nand can lead to poor parameter estimates and non-physical predictions. One way\nto overcome this challenge is to implement a different measurement error model.\nIn this review, we demonstrate how to implement a range of measurement error\nmodels in a likelihood-based framework for estimation, identifiability\nanalysis, and prediction, called Profile-Wise Analysis. This frequentist\napproach to uncertainty quantification for mechanistic models leverages the\nprofile likelihood for targeting parameters and understanding their influence\non predictions. Case studies, motivated by simple caricature models routinely\nused in systems biology and mathematical biology literature, illustrate how the\nsame ideas apply to different types of mathematical models. Open-source Julia\ncode to reproduce results is available on GitHub."}, "http://arxiv.org/abs/2307.04548": {"title": "Beyond the Two-Trials Rule", "link": "http://arxiv.org/abs/2307.04548", "description": "The two-trials rule for drug approval requires \"at least two adequate and\nwell-controlled studies, each convincing on its own, to establish\neffectiveness\". This is usually implemented by requiring two significant\npivotal trials and is the standard regulatory requirement to provide evidence\nfor a new drug's efficacy. However, there is need to develop suitable\nalternatives to this rule for a number of reasons, including the possible\navailability of data from more than two trials. I consider the case of up to 3\nstudies and stress the importance to control the partial Type-I error rate,\nwhere only some studies have a true null effect, while maintaining the overall\nType-I error rate of the two-trials rule, where all studies have a null effect.\nSome less-known $p$-value combination methods are useful to achieve this:\nPearson's method, Edgington's method and the recently proposed harmonic mean\n$\\chi^2$-test. I study their properties and discuss how they can be extended to\na sequential assessment of success while still ensuring overall Type-I error\ncontrol. I compare the different methods in terms of partial Type-I error rate,\nproject power and the expected number of studies required. Edgington's method\nis eventually recommended as it is easy to implement and communicate, has only\nmoderate partial Type-I error rate inflation but substantially increased\nproject power."}, "http://arxiv.org/abs/2307.06250": {"title": "Identifiability Guarantees for Causal Disentanglement from Soft Interventions", "link": "http://arxiv.org/abs/2307.06250", "description": "Causal disentanglement aims to uncover a representation of data using latent\nvariables that are interrelated through a causal model. Such a representation\nis identifiable if the latent model that explains the data is unique. In this\npaper, we focus on the scenario where unpaired observational and interventional\ndata are available, with each intervention changing the mechanism of a latent\nvariable. When the causal variables are fully observed, statistically\nconsistent algorithms have been developed to identify the causal model under\nfaithfulness assumptions. We here show that identifiability can still be\nachieved with unobserved causal variables, given a generalized notion of\nfaithfulness. Our results guarantee that we can recover the latent causal model\nup to an equivalence class and predict the effect of unseen combinations of\ninterventions, in the limit of infinite data. We implement our causal\ndisentanglement framework by developing an autoencoding variational Bayes\nalgorithm and apply it to the problem of predicting combinatorial perturbation\neffects in genomics."}, "http://arxiv.org/abs/2311.05794": {"title": "An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits", "link": "http://arxiv.org/abs/2311.05794", "description": "Typically, multi-armed bandit (MAB) experiments are analyzed at the end of\nthe study and thus require the analyst to specify a fixed sample size in\nadvance. However, in many online learning applications, it is advantageous to\ncontinuously produce inference on the average treatment effect (ATE) between\narms as new data arrive and determine a data-driven stopping time for the\nexperiment. Existing work on continuous inference for adaptive experiments\nassumes that the treatment assignment probabilities are bounded away from zero\nand one, thus excluding nearly all standard bandit algorithms. In this work, we\ndevelop the Mixture Adaptive Design (MAD), a new experimental design for\nmulti-armed bandits that enables continuous inference on the ATE with\nguarantees on statistical validity and power for nearly any bandit algorithm.\nOn a high level, the MAD \"mixes\" a bandit algorithm of the user's choice with a\nBernoulli design through a tuning parameter $\\delta_t$, where $\\delta_t$ is a\ndeterministic sequence that controls the priority placed on the Bernoulli\ndesign as the sample size grows. We show that for $\\delta_t =\no\\left(1/t^{1/4}\\right)$, the MAD produces a confidence sequence that is\nasymptotically valid and guaranteed to shrink around the true ATE. We\nempirically show that the MAD improves the coverage and power of ATE inference\nin MAB experiments without significant losses in finite-sample reward."}, "http://arxiv.org/abs/2311.05806": {"title": "Likelihood ratio tests in random graph models with increasing dimensions", "link": "http://arxiv.org/abs/2311.05806", "description": "We explore the Wilks phenomena in two random graph models: the $\\beta$-model\nand the Bradley-Terry model. For two increasing dimensional null hypotheses,\nincluding a specified null $H_0: \\beta_i=\\beta_i^0$ for $i=1,\\ldots, r$ and a\nhomogenous null $H_0: \\beta_1=\\cdots=\\beta_r$, we reveal high dimensional\nWilks' phenomena that the normalized log-likelihood ratio statistic,\n$[2\\{\\ell(\\widehat{\\mathbf{\\beta}}) - \\ell(\\widehat{\\mathbf{\\beta}}^0)\\}\n-r]/(2r)^{1/2}$, converges in distribution to the standard normal distribution\nas $r$ goes to infinity. Here, $\\ell( \\mathbf{\\beta})$ is the log-likelihood\nfunction on the model parameter $\\mathbf{\\beta}=(\\beta_1, \\ldots,\n\\beta_n)^\\top$, $\\widehat{\\mathbf{\\beta}}$ is its maximum likelihood estimator\n(MLE) under the full parameter space, and $\\widehat{\\mathbf{\\beta}}^0$ is the\nrestricted MLE under the null parameter space. For the homogenous null with a\nfixed $r$, we establish Wilks-type theorems that\n$2\\{\\ell(\\widehat{\\mathbf{\\beta}}) - \\ell(\\widehat{\\mathbf{\\beta}}^0)\\}$\nconverges in distribution to a chi-square distribution with $r-1$ degrees of\nfreedom, as the total number of parameters, $n$, goes to infinity. When testing\nthe fixed dimensional specified null, we find that its asymptotic null\ndistribution is a chi-square distribution in the $\\beta$-model. However,\nunexpectedly, this is not true in the Bradley-Terry model. By developing\nseveral novel technical methods for asymptotic expansion, we explore Wilks type\nresults in a principled manner; these principled methods should be applicable\nto a class of random graph models beyond the $\\beta$-model and the\nBradley-Terry model. Simulation studies and real network data applications\nfurther demonstrate the theoretical results."}, "http://arxiv.org/abs/2311.05819": {"title": "A flexible framework for synthesizing human activity patterns with application to sequential categorical data", "link": "http://arxiv.org/abs/2311.05819", "description": "The ability to synthesize realistic data in a parametrizable way is valuable\nfor a number of reasons, including privacy, missing data imputation, and\nevaluating the performance of statistical and computational methods. When the\nunderlying data generating process is complex, data synthesis requires\napproaches that balance realism and simplicity. In this paper, we address the\nproblem of synthesizing sequential categorical data of the type that is\nincreasingly available from mobile applications and sensors that record\nparticipant status continuously over the course of multiple days and weeks. We\npropose the paired Markov Chain (paired-MC) method, a flexible framework that\nproduces sequences that closely mimic real data while providing a\nstraightforward mechanism for modifying characteristics of the synthesized\nsequences. We demonstrate the paired-MC method on two datasets, one reflecting\ndaily human activity patterns collected via a smartphone application, and one\nencoding the intensities of physical activity measured by wearable\naccelerometers. In both settings, sequences synthesized by paired-MC better\ncapture key characteristics of the real data than alternative approaches."}, "http://arxiv.org/abs/2311.05847": {"title": "Threshold distribution of equal states for quantitative amplitude fluctuations", "link": "http://arxiv.org/abs/2311.05847", "description": "Objective. The distribution of equal states (DES) quantifies amplitude\nfluctuations in biomedical signals. However, under certain conditions, such as\na high resolution of data collection or special signal processing techniques,\nequal states may be very rare, whereupon the DES fails to measure the amplitude\nfluctuations. Approach. To address this problem, we develop a novel threshold\nDES (tDES) that measures the distribution of differential states within a\nthreshold. To evaluate the proposed tDES, we first analyze five sets of\nsynthetic signals generated in different frequency bands. We then analyze sleep\nelectroencephalography (EEG) datasets taken from the public PhysioNet. Main\nresults. Synthetic signals and detrend-filtered sleep EEGs have no neighboring\nequal values; however, tDES can effectively measure the amplitude fluctuations\nwithin these data. The tDES of EEG data increases significantly as the sleep\nstage increases, even with datasets covering very short periods, indicating\ndecreased amplitude fluctuations in sleep EEGs. Generally speaking, the\npresence of more low-frequency components in a physiological series reflects\nsmaller amplitude fluctuations and larger DES. Significance.The tDES provides a\nreliable computing method for quantifying amplitude fluctuations, exhibiting\nthe characteristics of conceptual simplicity and computational robustness. Our\nfindings broaden the application of quantitative amplitude fluctuations and\ncontribute to the classification of sleep stages based on EEG data."}, "http://arxiv.org/abs/2311.05914": {"title": "Efficient Case-Cohort Design using Balanced Sampling", "link": "http://arxiv.org/abs/2311.05914", "description": "A case-cohort design is a two-phase sampling design frequently used to\nanalyze censored survival data in a cost-effective way, where a subcohort is\nusually selected using simple random sampling or stratified simple random\nsampling. In this paper, we propose an efficient sampling procedure based on\nbalanced sampling when selecting a subcohort in a case-cohort design. A sample\nselected via a balanced sampling procedure automatically calibrates auxiliary\nvariables. When fitting a Cox model, calibrating sampling weights has been\nshown to lead to more efficient estimators of the regression coefficients\n(Breslow et al., 2009a, b). The reduced variabilities over its counterpart with\na simple random sampling are shown via extensive simulation experiments. The\nproposed design and estimation procedure are also illustrated with the\nwell-known National Wilms Tumor Study dataset."}, "http://arxiv.org/abs/2311.06076": {"title": "Bayesian Tensor Factorisations for Time Series of Counts", "link": "http://arxiv.org/abs/2311.06076", "description": "We propose a flexible nonparametric Bayesian modelling framework for\nmultivariate time series of count data based on tensor factorisations. Our\nmodels can be viewed as infinite state space Markov chains of known maximal\norder with non-linear serial dependence through the introduction of appropriate\nlatent variables. Alternatively, our models can be viewed as Bayesian\nhierarchical models with conditionally independent Poisson distributed\nobservations. Inference about the important lags and their complex interactions\nis achieved via MCMC. When the observed counts are large, we deal with the\nresulting computational complexity of Bayesian inference via a two-step\ninferential strategy based on an initial analysis of a training set of the\ndata. Our methodology is illustrated using simulation experiments and analysis\nof real-world data."}, "http://arxiv.org/abs/2311.06086": {"title": "A three-step approach to production frontier estimation and the Matsuoka's distribution", "link": "http://arxiv.org/abs/2311.06086", "description": "In this work, we introduce a three-step semiparametric methodology for the\nestimation of production frontiers. We consider a model inspired by the\nwell-known Cobb-Douglas production function, wherein input factors operate\nmultiplicatively within the model. Efficiency in the proposed model is assumed\nto follow a continuous univariate uniparametric distribution in $(0,1)$,\nreferred to as Matsuoka's distribution, which is introduced and explored.\nFollowing model linearization, the first step of the procedure is to\nsemiparametrically estimate the regression function through a local linear\nsmoother. The second step focuses on the estimation of the efficiency parameter\nin which the properties of the Matsuoka's distribution are employed. Finally,\nwe estimate the production frontier through a plug-in methodology. We present a\nrigorous asymptotic theory related to the proposed three-step estimation,\nincluding consistency, and asymptotic normality, and derive rates for the\nconvergences presented. Incidentally, we also introduce and study the\nMatsuoka's distribution, deriving its main properties, including quantiles,\nmoments, $\\alpha$-expectiles, entropies, and stress-strength reliability, among\nothers. The Matsuoka's distribution exhibits a versatile array of shapes\ncapable of effectively encapsulating the typical behavior of efficiency within\nproduction frontier models. To complement the large sample results obtained, a\nMonte Carlo simulation study is conducted to assess the finite sample\nperformance of the proposed three-step methodology. An empirical application\nusing a dataset of Danish milk producers is also presented."}, "http://arxiv.org/abs/2311.06220": {"title": "Bayesian nonparametric generative modeling of large multivariate non-Gaussian spatial fields", "link": "http://arxiv.org/abs/2311.06220", "description": "Multivariate spatial fields are of interest in many applications, including\nclimate model emulation. Not only can the marginal spatial fields be subject to\nnonstationarity, but the dependence structure among the marginal fields and\nbetween the fields might also differ substantially. Extending a recently\nproposed Bayesian approach to describe the distribution of a nonstationary\nunivariate spatial field using a triangular transport map, we cast the\ninference problem for a multivariate spatial field for a small number of\nreplicates into a series of independent Gaussian process (GP) regression tasks\nwith Gaussian errors. Due to the potential nonlinearity in the conditional\nmeans, the joint distribution modeled can be non-Gaussian. The resulting\nnonparametric Bayesian methodology scales well to high-dimensional spatial\nfields. It is especially useful when only a few training samples are available,\nbecause it employs regularization priors and quantifies uncertainty. Inference\nis conducted in an empirical Bayes setting by a highly scalable stochastic\ngradient approach. The implementation benefits from mini-batching and could be\naccelerated with parallel computing. We illustrate the extended transport-map\nmodel by studying hydrological variables from non-Gaussian climate-model\noutput."}, "http://arxiv.org/abs/2207.12705": {"title": "Efficient shape-constrained inference for the autocovariance sequence from a reversible Markov chain", "link": "http://arxiv.org/abs/2207.12705", "description": "In this paper, we study the problem of estimating the autocovariance sequence\nresulting from a reversible Markov chain. A motivating application for studying\nthis problem is the estimation of the asymptotic variance in central limit\ntheorems for Markov chains. We propose a novel shape-constrained estimator of\nthe autocovariance sequence, which is based on the key observation that the\nrepresentability of the autocovariance sequence as a moment sequence imposes\ncertain shape constraints. We examine the theoretical properties of the\nproposed estimator and provide strong consistency guarantees for our estimator.\nIn particular, for geometrically ergodic reversible Markov chains, we show that\nour estimator is strongly consistent for the true autocovariance sequence with\nrespect to an $\\ell_2$ distance, and that our estimator leads to strongly\nconsistent estimates of the asymptotic variance. Finally, we perform empirical\nstudies to illustrate the theoretical properties of the proposed estimator as\nwell as to demonstrate the effectiveness of our estimator in comparison with\nother current state-of-the-art methods for Markov chain Monte Carlo variance\nestimation, including batch means, spectral variance estimators, and the\ninitial convex sequence estimator."}, "http://arxiv.org/abs/2209.04321": {"title": "Estimating Racial Disparities in Emergency General Surgery", "link": "http://arxiv.org/abs/2209.04321", "description": "Research documents that Black patients experience worse general surgery\noutcomes than white patients in the United States. In this paper, we focus on\nan important but less-examined category: the surgical treatment of emergency\ngeneral surgery (EGS) conditions, which refers to medical emergencies where the\ninjury is \"endogenous,\" such as a burst appendix. Our goal is to assess racial\ndisparities for common outcomes after EGS treatment using an administrative\ndatabase of hospital claims in New York, Florida, and Pennsylvania, and to\nunderstand the extent to which differences are attributable to patient-level\nrisk factors versus hospital-level factors. To do so, we use a class of linear\nweighting estimators that re-weight white patients to have a similar\ndistribution of baseline characteristics as Black patients. This framework\nnests many common approaches, including matching and linear regression, but\noffers important advantages over these methods in terms of controlling\nimbalance between groups, minimizing extrapolation, and reducing computation\ntime. Applying this approach to the claims data, we find that disparities\nestimates that adjust for the admitting hospital are substantially smaller than\nestimates that adjust for patient baseline characteristics only, suggesting\nthat hospital-specific factors are important drivers of racial disparities in\nEGS outcomes."}, "http://arxiv.org/abs/2210.12759": {"title": "Robust angle-based transfer learning in high dimensions", "link": "http://arxiv.org/abs/2210.12759", "description": "Transfer learning aims to improve the performance of a target model by\nleveraging data from related source populations, which is known to be\nespecially helpful in cases with insufficient target data. In this paper, we\nstudy the problem of how to train a high-dimensional ridge regression model\nusing limited target data and existing regression models trained in\nheterogeneous source populations. We consider a practical setting where only\nthe parameter estimates of the fitted source models are accessible, instead of\nthe individual-level source data. Under the setting with only one source model,\nwe propose a novel flexible angle-based transfer learning (angleTL) method,\nwhich leverages the concordance between the source and the target model\nparameters. We show that angleTL unifies several benchmark methods by\nconstruction, including the target-only model trained using target data alone,\nthe source model fitted on source data, and distance-based transfer learning\nmethod that incorporates the source parameter estimates and the target data\nunder a distance-based similarity constraint. We also provide algorithms to\neffectively incorporate multiple source models accounting for the fact that\nsome source models may be more helpful than others. Our high-dimensional\nasymptotic analysis provides interpretations and insights regarding when a\nsource model can be helpful to the target model, and demonstrates the\nsuperiority of angleTL over other benchmark methods. We perform extensive\nsimulation studies to validate our theoretical conclusions and show the\nfeasibility of applying angleTL to transfer existing genetic risk prediction\nmodels across multiple biobanks."}, "http://arxiv.org/abs/2211.04370": {"title": "NESTER: An Adaptive Neurosymbolic Method for Causal Effect Estimation", "link": "http://arxiv.org/abs/2211.04370", "description": "Causal effect estimation from observational data is a central problem in\ncausal inference. Methods based on potential outcomes framework solve this\nproblem by exploiting inductive biases and heuristics from causal inference.\nEach of these methods addresses a specific aspect of causal effect estimation,\nsuch as controlling propensity score, enforcing randomization, etc., by\ndesigning neural network (NN) architectures and regularizers. In this paper, we\npropose an adaptive method called Neurosymbolic Causal Effect Estimator\n(NESTER), a generalized method for causal effect estimation. NESTER integrates\nthe ideas used in existing methods based on multi-head NNs for causal effect\nestimation into one framework. We design a Domain Specific Language (DSL)\ntailored for causal effect estimation based on causal inductive biases used in\nliterature. We conduct a theoretical analysis to investigate NESTER's efficacy\nin estimating causal effects. Our comprehensive empirical results show that\nNESTER performs better than state-of-the-art methods on benchmark datasets."}, "http://arxiv.org/abs/2211.13374": {"title": "A Multivariate Non-Gaussian Bayesian Filter Using Power Moments", "link": "http://arxiv.org/abs/2211.13374", "description": "In this paper, we extend our results on the univariate non-Gaussian Bayesian\nfilter using power moments to the multivariate systems, which can be either\nlinear or nonlinear. Doing this introduces several challenging problems, for\nexample a positive parametrization of the density surrogate, which is not only\na problem of filter design, but also one of the multiple dimensional Hamburger\nmoment problem. We propose a parametrization of the density surrogate with the\nproofs to its existence, Positivstellensatz and uniqueness. Based on it, we\nanalyze the errors of moments of the density estimates by the proposed density\nsurrogate. A discussion on continuous and discrete treatments to the\nnon-Gaussian Bayesian filtering problem is proposed to motivate the research on\ncontinuous parametrization of the system state. Simulation results on\nestimating different types of multivariate density functions are given to\nvalidate our proposed filter. To the best of our knowledge, the proposed filter\nis the first one implementing the multivariate Bayesian filter with the system\nstate parameterized as a continuous function, which only requires the true\nstates being Lebesgue integrable."}, "http://arxiv.org/abs/2303.02637": {"title": "A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks", "link": "http://arxiv.org/abs/2303.02637", "description": "A classic inferential statistical problem is the goodness-of-fit (GOF) test.\nSuch a test can be challenging when the hypothesized parametric model has an\nintractable likelihood and its distributional form is not available. Bayesian\nmethods for GOF can be appealing due to their ability to incorporate expert\nknowledge through prior distributions.\n\nHowever, standard Bayesian methods for this test often require strong\ndistributional assumptions on the data and their relevant parameters. To\naddress this issue, we propose a semi-Bayesian nonparametric (semi-BNP)\nprocedure in the context of the maximum mean discrepancy (MMD) measure that can\nbe applied to the GOF test. Our method introduces a novel Bayesian estimator\nfor the MMD, enabling the development of a measure-based hypothesis test for\nintractable models. Through extensive experiments, we demonstrate that our\nproposed test outperforms frequentist MMD-based methods by achieving a lower\nfalse rejection and acceptance rate of the null hypothesis. Furthermore, we\nshowcase the versatility of our approach by embedding the proposed estimator\nwithin a generative adversarial network (GAN) framework. It facilitates a\nrobust BNP learning approach as another significant application of our method.\nWith our BNP procedure, this new GAN approach can enhance sample diversity and\nimprove inferential accuracy compared to traditional techniques."}, "http://arxiv.org/abs/2304.07113": {"title": "Causal inference with a functional outcome", "link": "http://arxiv.org/abs/2304.07113", "description": "This paper presents methods to study the causal effect of a binary treatment\non a functional outcome with observational data. We define a Functional Average\nTreatment Effect and develop an outcome regression estimator. We show how to\nobtain valid inference on the FATE using simultaneous confidence bands, which\ncover the FATE with a given probability over the entire domain. Simulation\nexperiments illustrate how the simultaneous confidence bands take the multiple\ncomparison problem into account. Finally, we use the methods to infer the\neffect of early adult location on subsequent income development for one Swedish\nbirth cohort."}, "http://arxiv.org/abs/2310.00233": {"title": "CausalImages: An R Package for Causal Inference with Earth Observation, Bio-medical, and Social Science Images", "link": "http://arxiv.org/abs/2310.00233", "description": "The causalimages R package enables causal inference with image and image\nsequence data, providing new tools for integrating novel data sources like\nsatellite and bio-medical imagery into the study of cause and effect. One set\nof functions enables image-based causal inference analyses. For example, one\nkey function decomposes treatment effect heterogeneity by images using an\ninterpretable Bayesian framework. This allows for determining which types of\nimages or image sequences are most responsive to interventions. A second\nmodeling function allows researchers to control for confounding using images.\nThe package also allows investigators to produce embeddings that serve as\nvector summaries of the image or video content. Finally, infrastructural\nfunctions are also provided, such as tools for writing large-scale image and\nimage sequence data as sequentialized byte strings for more rapid image\nanalysis. causalimages therefore opens new capabilities for causal inference in\nR, letting researchers use informative imagery in substantive analyses in a\nfast and accessible manner."}, "http://arxiv.org/abs/2311.06409": {"title": "Flexible joint models for multivariate longitudinal and time-to-event data using multivariate functional principal components", "link": "http://arxiv.org/abs/2311.06409", "description": "The joint modeling of multiple longitudinal biomarkers together with a\ntime-to-event outcome is a challenging modeling task of continued scientific\ninterest. In particular, the computational complexity of high dimensional\n(generalized) mixed effects models often restricts the flexibility of shared\nparameter joint models, even when the subject-specific marker trajectories\nfollow highly nonlinear courses. We propose a parsimonious multivariate\nfunctional principal components representation of the shared random effects.\nThis allows better scalability, as the dimension of the random effects does not\ndirectly increase with the number of markers, only with the chosen number of\nprincipal component basis functions used in the approximation of the random\neffects. The functional principal component representation additionally allows\nto estimate highly flexible subject-specific random trajectories without\nparametric assumptions. The modeled trajectories can thus be distinctly\ndifferent for each biomarker. We build on the framework of flexible Bayesian\nadditive joint models implemented in the R-package 'bamlss', which also\nsupports estimation of nonlinear covariate effects via Bayesian P-splines. The\nflexible yet parsimonious functional principal components basis used in the\nestimation of the joint model is first estimated in a preliminary step. We\nvalidate our approach in a simulation study and illustrate its advantages by\nanalyzing a study on primary biliary cholangitis."}, "http://arxiv.org/abs/2311.06412": {"title": "Online multiple testing with e-values", "link": "http://arxiv.org/abs/2311.06412", "description": "A scientist tests a continuous stream of hypotheses over time in the course\nof her investigation -- she does not test a predetermined, fixed number of\nhypotheses. The scientist wishes to make as many discoveries as possible while\nensuring the number of false discoveries is controlled -- a well recognized way\nfor accomplishing this is to control the false discovery rate (FDR). Prior\nmethods for FDR control in the online setting have focused on formulating\nalgorithms when specific dependency structures are assumed to exist between the\ntest statistics of each hypothesis. However, in practice, these dependencies\noften cannot be known beforehand or tested after the fact. Our algorithm,\ne-LOND, provides FDR control under arbitrary, possibly unknown, dependence. We\nshow that our method is more powerful than existing approaches to this problem\nthrough simulations. We also formulate extensions of this algorithm to utilize\nrandomization for increased power, and for constructing confidence intervals in\nonline selective inference."}, "http://arxiv.org/abs/2311.06415": {"title": "Long-Term Dagum-PVF Frailty Regression Model: Application in Health Studies", "link": "http://arxiv.org/abs/2311.06415", "description": "Survival models incorporating cure fractions, commonly known as cure fraction\nmodels or long-term survival models, are widely employed in epidemiological\nstudies to account for both immune and susceptible patients in relation to the\nfailure event of interest under investigation. In such studies, there is also a\nneed to estimate the unobservable heterogeneity caused by prognostic factors\nthat cannot be observed. Moreover, the hazard function may exhibit a\nnon-monotonic form, specifically, an unimodal hazard function. In this article,\nwe propose a long-term survival model based on the defective version of the\nDagum distribution, with a power variance function (PVF) frailty term\nintroduced in the hazard function to control for unobservable heterogeneity in\npatient populations, which is useful for accommodating survival data in the\npresence of a cure fraction and with a non-monotone hazard function. The\ndistribution is conveniently reparameterized in terms of the cure fraction, and\nthen associated with the covariates via a logit link function, enabling direct\ninterpretation of the covariate effects on the cure fraction, which is not\nusual in the defective approach. It is also proven a result that generates\ndefective models induced by PVF frailty distribution. We discuss maximum\nlikelihood estimation for model parameters and evaluate its performance through\nMonte Carlo simulation studies. Finally, the practicality and benefits of our\nmodel are demonstrated through two health-related datasets, focusing on severe\ncases of COVID-19 in pregnant and postpartum women and on patients with\nmalignant skin neoplasms."}, "http://arxiv.org/abs/2311.06458": {"title": "Conditional Adjustment in a Markov Equivalence Class", "link": "http://arxiv.org/abs/2311.06458", "description": "We consider the problem of identifying a conditional causal effect through\ncovariate adjustment. We focus on the setting where the causal graph is known\nup to one of two types of graphs: a maximally oriented partially directed\nacyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs\nrepresent equivalence classes of possible underlying causal models. After\ndefining adjustment sets in this setting, we provide a necessary and sufficient\ngraphical criterion -- the conditional adjustment criterion -- for finding\nthese sets under conditioning on variables unaffected by treatment. We further\nprovide explicit sets from the graph that satisfy the conditional adjustment\ncriterion, and therefore, can be used as adjustment sets for conditional causal\neffect identification."}, "http://arxiv.org/abs/2311.06537": {"title": "Is Machine Learning Unsafe and Irresponsible in Social Sciences? Paradoxes and Reconsidering from Recidivism Prediction Tasks", "link": "http://arxiv.org/abs/2311.06537", "description": "The paper addresses some fundamental and hotly debated issues for high-stakes\nevent predictions underpinning the computational approach to social sciences.\nWe question several prevalent views against machine learning and outline a new\nparadigm that highlights the promises and promotes the infusion of\ncomputational methods and conventional social science approaches."}, "http://arxiv.org/abs/2311.06590": {"title": "Optimal resource allocation: Convex quantile regression approach", "link": "http://arxiv.org/abs/2311.06590", "description": "Optimal allocation of resources across sub-units in the context of\ncentralized decision-making systems such as bank branches or supermarket chains\nis a classical application of operations research and management science. In\nthis paper, we develop quantile allocation models to examine how much the\noutput and productivity could potentially increase if the resources were\nefficiently allocated between units. We increase robustness to random noise and\nheteroscedasticity by utilizing the local estimation of multiple production\nfunctions using convex quantile regression. The quantile allocation models then\nrely on the estimated shadow prices instead of detailed data of units and allow\nthe entry and exit of units. Our empirical results on Finland's business sector\nreveal a large potential for productivity gains through better allocation,\nkeeping the current technology and resources fixed."}, "http://arxiv.org/abs/2311.06681": {"title": "SpICE: An interpretable method for spatial data", "link": "http://arxiv.org/abs/2311.06681", "description": "Statistical learning methods are widely utilized in tackling complex problems\ndue to their flexibility, good predictive performance and its ability to\ncapture complex relationships among variables. Additionally, recently developed\nautomatic workflows have provided a standardized approach to implementing\nstatistical learning methods across various applications. However these tools\nhighlight a main drawbacks of statistical learning: its lack of interpretation\nin their results. In the past few years an important amount of research has\nbeen focused on methods for interpreting black box models. Having interpretable\nstatistical learning methods is relevant to have a deeper understanding of the\nmodel. In problems were spatial information is relevant, combined interpretable\nmethods with spatial data can help to get better understanding of the problem\nand interpretation of the results.\n\nThis paper is focused in the individual conditional expectation (ICE-plot), a\nmodel agnostic methods for interpreting statistical learning models and\ncombined them with spatial information. ICE-plot extension is proposed where\nspatial information is used as restriction to define Spatial ICE curves\n(SpICE). Spatial ICE curves are estimated using real data in the context of an\neconomic problem concerning property valuation in Montevideo, Uruguay.\nUnderstanding the key factors that influence property valuation is essential\nfor decision-making, and spatial data plays a relevant role in this regard."}, "http://arxiv.org/abs/2311.06719": {"title": "Efficient Multiple-Robust Estimation for Nonresponse Data Under Informative Sampling", "link": "http://arxiv.org/abs/2311.06719", "description": "Nonresponse after probability sampling is a universal challenge in survey\nsampling, often necessitating adjustments to mitigate sampling and selection\nbias simultaneously. This study explored the removal of bias and effective\nutilization of available information, not just in nonresponse but also in the\nscenario of data integration, where summary statistics from other data sources\nare accessible. We reformulate these settings within a two-step monotone\nmissing data framework, where the first step of missingness arises from\nsampling and the second originates from nonresponse. Subsequently, we derive\nthe semiparametric efficiency bound for the target parameter. We also propose\nadaptive estimators utilizing methods of moments and empirical likelihood\napproaches to attain the lower bound. The proposed estimator exhibits both\nefficiency and double robustness. However, attaining efficiency with an\nadaptive estimator requires the correct specification of certain working\nmodels. To reinforce robustness against the misspecification of working models,\nwe extend the property of double robustness to multiple robustness by proposing\na two-step empirical likelihood method that effectively leverages empirical\nweights. A numerical study is undertaken to investigate the finite-sample\nperformance of the proposed methods. We further applied our methods to a\ndataset from the National Health and Nutrition Examination Survey data by\nefficiently incorporating summary statistics from the National Health Interview\nSurvey data."}, "http://arxiv.org/abs/2311.06840": {"title": "Distribution Re-weighting and Voting Paradoxes", "link": "http://arxiv.org/abs/2311.06840", "description": "We explore a specific type of distribution shift called domain expertise, in\nwhich training is limited to a subset of all possible labels. This setting is\ncommon among specialized human experts, or specific focused studies. We show\nhow the standard approach to distribution shift, which involves re-weighting\ndata, can result in paradoxical disagreements among differing domain expertise.\nWe also demonstrate how standard adjustments for causal inference lead to the\nsame paradox. We prove that the characteristics of these paradoxes exactly\nmimic another set of paradoxes which arise among sets of voter preferences."}, "http://arxiv.org/abs/2311.06928": {"title": "Attention for Causal Relationship Discovery from Biological Neural Dynamics", "link": "http://arxiv.org/abs/2311.06928", "description": "This paper explores the potential of the transformer models for learning\nGranger causality in networks with complex nonlinear dynamics at every node, as\nin neurobiological and biophysical networks. Our study primarily focuses on a\nproof-of-concept investigation based on simulated neural dynamics, for which\nthe ground-truth causality is known through the underlying connectivity matrix.\nFor transformer models trained to forecast neuronal population dynamics, we\nshow that the cross attention module effectively captures the causal\nrelationship among neurons, with an accuracy equal or superior to that for the\nmost popular Granger causality analysis method. While we acknowledge that\nreal-world neurobiology data will bring further challenges, including dynamic\nconnectivity and unobserved variability, this research offers an encouraging\npreliminary glimpse into the utility of the transformer model for causal\nrepresentation learning in neuroscience."}, "http://arxiv.org/abs/2311.06945": {"title": "An Efficient Approach for Identifying Important Biomarkers for Biomedical Diagnosis", "link": "http://arxiv.org/abs/2311.06945", "description": "In this paper, we explore the challenges associated with biomarker\nidentification for diagnosis purpose in biomedical experiments, and propose a\nnovel approach to handle the above challenging scenario via the generalization\nof the Dantzig selector. To improve the efficiency of the regularization\nmethod, we introduce a transformation from an inherent nonlinear programming\ndue to its nonlinear link function into a linear programming framework. We\nillustrate the use of of our method on an experiment with binary response,\nshowing superior performance on biomarker identification studies when compared\nto their conventional analysis. Our proposed method does not merely serve as a\nvariable/biomarker selection tool, its ranking of variable importance provides\nvaluable reference information for practitioners to reach informed decisions\nregarding the prioritization of factors for further investigations."}, "http://arxiv.org/abs/2311.07034": {"title": "Regularized Halfspace Depth for Functional Data", "link": "http://arxiv.org/abs/2311.07034", "description": "Data depth is a powerful nonparametric tool originally proposed to rank\nmultivariate data from center outward. In this context, one of the most\narchetypical depth notions is Tukey's halfspace depth. In the last few decades\nnotions of depth have also been proposed for functional data. However, Tukey's\ndepth cannot be extended to handle functional data because of its degeneracy.\nHere, we propose a new halfspace depth for functional data which avoids\ndegeneracy by regularization. The halfspace projection directions are\nconstrained to have a small reproducing kernel Hilbert space norm. Desirable\ntheoretical properties of the proposed depth, such as isometry invariance,\nmaximality at center, monotonicity relative to a deepest point, upper\nsemi-continuity, and consistency are established. Moreover, the regularized\nhalfspace depth can rank functional data with varying emphasis in shape or\nmagnitude, depending on the regularization. A new outlier detection approach is\nalso proposed, which is capable of detecting both shape and magnitude outliers.\nIt is applicable to trajectories in L2, a very general space of functions that\ninclude non-smooth trajectories. Based on extensive numerical studies, our\nmethods are shown to perform well in terms of detecting outliers of different\ntypes. Three real data examples showcase the proposed depth notion."}, "http://arxiv.org/abs/2311.07156": {"title": "Deep mixture of linear mixed models for complex longitudinal data", "link": "http://arxiv.org/abs/2311.07156", "description": "Mixtures of linear mixed models are widely used for modelling longitudinal\ndata for which observation times differ between subjects. In typical\napplications, temporal trends are described using a basis expansion, with basis\ncoefficients treated as random effects varying by subject. Additional random\neffects can describe variation between mixture components, or other known\nsources of variation in complex experimental designs. A key advantage of these\nmodels is that they provide a natural mechanism for clustering, which can be\nhelpful for interpretation in many applications. Current versions of mixtures\nof linear mixed models are not specifically designed for the case where there\nare many observations per subject and a complex temporal trend, which requires\na large number of basis functions to capture. In this case, the\nsubject-specific basis coefficients are a high-dimensional random effects\nvector, for which the covariance matrix is hard to specify and estimate,\nespecially if it varies between mixture components. To address this issue, we\nconsider the use of recently-developed deep mixture of factor analyzers models\nas the prior for the random effects. The resulting deep mixture of linear mixed\nmodels is well-suited to high-dimensional settings, and we describe an\nefficient variational inference approach to posterior computation. The efficacy\nof the method is demonstrated on both real and simulated data."}, "http://arxiv.org/abs/2311.07371": {"title": "Scalable Estimation for Structured Additive Distributional Regression Through Variational Inference", "link": "http://arxiv.org/abs/2311.07371", "description": "Structured additive distributional regression models offer a versatile\nframework for estimating complete conditional distributions by relating all\nparameters of a parametric distribution to covariates. Although these models\nefficiently leverage information in vast and intricate data sets, they often\nresult in highly-parameterized models with many unknowns. Standard estimation\nmethods, like Bayesian approaches based on Markov chain Monte Carlo methods,\nface challenges in estimating these models due to their complexity and\ncostliness. To overcome these issues, we suggest a fast and scalable\nalternative based on variational inference. Our approach combines a\nparsimonious parametric approximation for the posteriors of regression\ncoefficients, with the exact conditional posterior for hyperparameters. For\noptimization, we use a stochastic gradient ascent method combined with an\nefficient strategy to reduce the variance of estimators. We provide theoretical\nproperties and investigate global and local annealing to enhance robustness,\nparticularly against data outliers. Our implementation is very general,\nallowing us to include various functional effects like penalized splines or\ncomplex tensor product interactions. In a simulation study, we demonstrate the\nefficacy of our approach in terms of accuracy and computation time. Lastly, we\npresent two real examples illustrating the modeling of infectious COVID-19\noutbreaks and outlier detection in brain activity."}, "http://arxiv.org/abs/2311.07474": {"title": "A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals", "link": "http://arxiv.org/abs/2311.07474", "description": "Most prognostic methods require a decent amount of data for model training.\nIn reality, however, the amount of historical data owned by a single\norganization might be small or not large enough to train a reliable prognostic\nmodel. To address this challenge, this article proposes a federated prognostic\nmodel that allows multiple users to jointly construct a failure time prediction\nmodel using their multi-stream, high-dimensional, and incomplete data while\nkeeping each user's data local and confidential. The prognostic model first\nemploys multivariate functional principal component analysis to fuse the\nmulti-stream degradation signals. Then, the fused features coupled with the\ntimes-to-failure are utilized to build a (log)-location-scale regression model\nfor failure prediction. To estimate parameters using distributed datasets and\nkeep the data privacy of all participants, we propose a new federated algorithm\nfor feature extraction. Numerical studies indicate that the performance of the\nproposed model is the same as that of classic non-federated prognostic models\nand is better than that of the models constructed by each user itself."}, "http://arxiv.org/abs/2311.07511": {"title": "Machine learning for uncertainty estimation in fusing precipitation observations from satellites and ground-based gauges", "link": "http://arxiv.org/abs/2311.07511", "description": "To form precipitation datasets that are accurate and, at the same time, have\nhigh spatial densities, data from satellites and gauges are often merged in the\nliterature. However, uncertainty estimates for the data acquired in this manner\nare scarcely provided, although the importance of uncertainty quantification in\npredictive modelling is widely recognized. Furthermore, the benefits that\nmachine learning can bring to the task of providing such estimates have not\nbeen broadly realized and properly explored through benchmark experiments. The\npresent study aims at filling in this specific gap by conducting the first\nbenchmark tests on the topic. On a large dataset that comprises 15-year-long\nmonthly data spanning across the contiguous United States, we extensively\ncompared six learners that are, by their construction, appropriate for\npredictive uncertainty quantification. These are the quantile regression (QR),\nquantile regression forests (QRF), generalized random forests (GRF), gradient\nboosting machines (GBM), light gradient boosting machines (LightGBM) and\nquantile regression neural networks (QRNN). The comparison referred to the\ncompetence of the learners in issuing predictive quantiles at nine levels that\nfacilitate a good approximation of the entire predictive probability\ndistribution, and was primarily based on the quantile and continuous ranked\nprobability skill scores. Three types of predictor variables (i.e., satellite\nprecipitation variables, distances between a point of interest and satellite\ngrid points, and elevation at a point of interest) were used in the comparison\nand were additionally compared with each other. This additional comparison was\nbased on the explainable machine learning concept of feature importance. The\nresults suggest that the order from the best to the worst of the learners for\nthe task investigated is the following: LightGBM, QRF, GRF, GBM, QRNN and QR..."}, "http://arxiv.org/abs/2311.07524": {"title": "The Link Between Health Insurance Coverage and Citizenship Among Immigrants: Bayesian Unit-Level Regression Modeling of Categorical Survey Data Observed with Measurement Error", "link": "http://arxiv.org/abs/2311.07524", "description": "Social scientists are interested in studying the impact that citizenship\nstatus has on health insurance coverage among immigrants in the United States.\nThis can be done using data from the Survey of Income and Program Participation\n(SIPP); however, two primary challenges emerge. First, statistical models must\naccount for the survey design in some fashion to reduce the risk of bias due to\ninformative sampling. Second, it has been observed that survey respondents\nmisreport citizenship status at nontrivial rates. This too can induce bias\nwithin a statistical model. Thus, we propose the use of a weighted\npseudo-likelihood mixture of categorical distributions, where the mixture\ncomponent is determined by the latent true response variable, in order to model\nthe misreported data. We illustrate through an empirical simulation study that\nthis approach can mitigate the two sources of bias attributable to the sample\ndesign and misreporting. Importantly, our misreporting model can be further\nused as a component in a deeper hierarchical model. With this in mind, we\nconduct an analysis of the relationship between health insurance coverage and\ncitizenship status using data from the SIPP."}, "http://arxiv.org/abs/2110.10195": {"title": "Operator-induced structural variable selection for identifying materials genes", "link": "http://arxiv.org/abs/2110.10195", "description": "In the emerging field of materials informatics, a fundamental task is to\nidentify physicochemically meaningful descriptors, or materials genes, which\nare engineered from primary features and a set of elementary algebraic\noperators through compositions. Standard practice directly analyzes the\nhigh-dimensional candidate predictor space in a linear model; statistical\nanalyses are then substantially hampered by the daunting challenge posed by the\nastronomically large number of correlated predictors with limited sample size.\nWe formulate this problem as variable selection with operator-induced structure\n(OIS) and propose a new method to achieve unconventional dimension reduction by\nutilizing the geometry embedded in OIS. Although the model remains linear, we\niterate nonparametric variable selection for effective dimension reduction.\nThis enables variable selection based on ab initio primary features, leading to\na method that is orders of magnitude faster than existing methods, with\nimproved accuracy. To select the nonparametric module, we discuss a desired\nperformance criterion that is uniquely induced by variable selection with OIS;\nin particular, we propose to employ a Bayesian Additive Regression Trees\n(BART)-based variable selection method. Numerical studies show superiority of\nthe proposed method, which continues to exhibit robust performance when the\ninput dimension is out of reach of existing methods. Our analysis of\nsingle-atom catalysis identifies physical descriptors that explain the binding\nenergy of metal-support pairs with high explanatory power, leading to\ninterpretable insights to guide the prevention of a notorious problem called\nsintering and aid catalysis design."}, "http://arxiv.org/abs/2204.12699": {"title": "Randomness of Shapes and Statistical Inference on Shapes via the Smooth Euler Characteristic Transform", "link": "http://arxiv.org/abs/2204.12699", "description": "In this article, we establish the mathematical foundations for modeling the\nrandomness of shapes and conducting statistical inference on shapes using the\nsmooth Euler characteristic transform. Based on these foundations, we propose\ntwo parametric algorithms for testing hypotheses on random shapes. Simulation\nstudies are presented to validate our mathematical derivations and to compare\nour algorithms with state-of-the-art methods to demonstrate the utility of our\nproposed framework. As real applications, we analyze a data set of mandibular\nmolars from four genera of primates and show that our algorithms have the power\nto detect significant shape differences that recapitulate known morphological\nvariation across suborders. Altogether, our discussions bridge the following\nfields: algebraic and computational topology, probability theory and stochastic\nprocesses, Sobolev spaces and functional analysis, statistical inference, and\ngeometric morphometrics."}, "http://arxiv.org/abs/2207.05019": {"title": "Covariate-adaptive randomization inference in matched designs", "link": "http://arxiv.org/abs/2207.05019", "description": "It is common to conduct causal inference in matched observational studies by\nproceeding as though treatment assignments within matched sets are assigned\nuniformly at random and using this distribution as the basis for inference.\nThis approach ignores observed discrepancies in matched sets that may be\nconsequential for the distribution of treatment, which are succinctly captured\nby within-set differences in the propensity score. We address this problem via\ncovariate-adaptive randomization inference, which modifies the permutation\nprobabilities to vary with estimated propensity score discrepancies and avoids\nrequirements to exclude matched pairs or model an outcome variable. We show\nthat the test achieves type I error control arbitrarily close to the nominal\nlevel when large samples are available for propensity score estimation. We\ncharacterize the large-sample behavior of the new randomization test for a\ndifference-in-means estimator of a constant additive effect. We also show that\nexisting methods of sensitivity analysis generalize effectively to\ncovariate-adaptive randomization inference. Finally, we evaluate the empirical\nvalue of covariate-adaptive randomization procedures via comparisons to\ntraditional uniform inference in matched designs with and without propensity\nscore calipers and regression adjustment using simulations and analyses of\ngenetic damage among welders and right-heart catheterization in surgical\npatients."}, "http://arxiv.org/abs/2209.07091": {"title": "A new Kernel Regression approach for Robustified $L_2$ Boosting", "link": "http://arxiv.org/abs/2209.07091", "description": "We investigate $L_2$ boosting in the context of kernel regression. Kernel\nsmoothers, in general, lack appealing traits like symmetry and positive\ndefiniteness, which are critical not only for understanding theoretical aspects\nbut also for achieving good practical performance. We consider a\nprojection-based smoother (Huang and Chen, 2008) that is symmetric, positive\ndefinite, and shrinking. Theoretical results based on the orthonormal\ndecomposition of the smoother reveal additional insights into the boosting\nalgorithm. In our asymptotic framework, we may replace the full-rank smoother\nwith a low-rank approximation. We demonstrate that the smoother's low-rank\n($d(n)$) is bounded above by $O(h^{-1})$, where $h$ is the bandwidth. Our\nnumerical findings show that, in terms of prediction accuracy, low-rank\nsmoothers may outperform full-rank smoothers. Furthermore, we show that the\nboosting estimator with low-rank smoother achieves the optimal convergence\nrate. Finally, to improve the performance of the boosting algorithm in the\npresence of outliers, we propose a novel robustified boosting algorithm which\ncan be used with any smoother discussed in the study. We investigate the\nnumerical performance of the proposed approaches using simulations and a\nreal-world case."}, "http://arxiv.org/abs/2210.01757": {"title": "Transportability of model-based estimands in evidence synthesis", "link": "http://arxiv.org/abs/2210.01757", "description": "In evidence synthesis, effect modifiers are typically described as variables\nthat induce treatment effect heterogeneity at the individual level, through\ntreatment-covariate interactions in an outcome model parametrized at such\nlevel. As such, effect modification is defined with respect to a conditional\nmeasure, but marginal effect estimates are required for population-level\ndecisions in health technology assessment. For non-collapsible measures, purely\nprognostic variables that are not determinants of treatment response at the\nindividual level may modify marginal effects, even where there is\nindividual-level treatment effect homogeneity. With heterogeneity, marginal\neffects for measures that are not directly collapsible cannot be expressed in\nterms of marginal covariate moments, and generally depend on the joint\ndistribution of conditional effect measure modifiers and purely prognostic\nvariables. There are implications for recommended practices in evidence\nsynthesis. Unadjusted anchored indirect comparisons can be biased in the\nabsence of individual-level treatment effect heterogeneity, or when marginal\ncovariate moments are balanced across studies. Covariate adjustment may be\nnecessary to account for cross-study imbalances in joint covariate\ndistributions involving purely prognostic variables. In the absence of\nindividual patient data for the target, covariate adjustment approaches are\ninherently limited in their ability to remove bias for measures that are not\ndirectly collapsible. Directly collapsible measures would facilitate the\ntransportability of marginal effects between studies by: (1) reducing\ndependence on model-based covariate adjustment where there is individual-level\ntreatment effect homogeneity and marginal covariate moments are balanced; and\n(2) facilitating the selection of baseline covariates for adjustment where\nthere is individual-level treatment effect heterogeneity."}, "http://arxiv.org/abs/2212.01900": {"title": "Bayesian survival analysis with INLA", "link": "http://arxiv.org/abs/2212.01900", "description": "This tutorial shows how various Bayesian survival models can be fitted using\nthe integrated nested Laplace approximation in a clear, legible, and\ncomprehensible manner using the INLA and INLAjoint R-packages. Such models\ninclude accelerated failure time, proportional hazards, mixture cure, competing\nrisks, multi-state, frailty, and joint models of longitudinal and survival\ndata, originally presented in the article \"Bayesian survival analysis with\nBUGS\" (Alvares et al., 2021). In addition, we illustrate the implementation of\na new joint model for a longitudinal semicontinuous marker, recurrent events,\nand a terminal event. Our proposal aims to provide the reader with syntax\nexamples for implementing survival models using a fast and accurate approximate\nBayesian inferential approach."}, "http://arxiv.org/abs/2301.03661": {"title": "Generative Quantile Regression with Variability Penalty", "link": "http://arxiv.org/abs/2301.03661", "description": "Quantile regression and conditional density estimation can reveal structure\nthat is missed by mean regression, such as multimodality and skewness. In this\npaper, we introduce a deep learning generative model for joint quantile\nestimation called Penalized Generative Quantile Regression (PGQR). Our approach\nsimultaneously generates samples from many random quantile levels, allowing us\nto infer the conditional distribution of a response variable given a set of\ncovariates. Our method employs a novel variability penalty to avoid the problem\nof vanishing variability, or memorization, in deep generative models. Further,\nwe introduce a new family of partial monotonic neural networks (PMNN) to\ncircumvent the problem of crossing quantile curves. A major benefit of PGQR is\nthat it can be fit using a single optimization, thus bypassing the need to\nrepeatedly train the model at multiple quantile levels or use computationally\nexpensive cross-validation to tune the penalty parameter. We illustrate the\nefficacy of PGQR through extensive simulation studies and analysis of real\ndatasets. Code to implement our method is available at\nhttps://github.com/shijiew97/PGQR."}, "http://arxiv.org/abs/2302.09526": {"title": "Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators", "link": "http://arxiv.org/abs/2302.09526", "description": "We present a methodology for using unlabeled data to design semi supervised\nlearning (SSL) methods that improve the prediction performance of supervised\nlearning for regression tasks. The main idea is to design different mechanisms\nfor integrating the unlabeled data, and include in each of them a mixing\nparameter $\\alpha$, controlling the weight given to the unlabeled data.\nFocusing on Generalized Linear Models (GLM) and linear interpolators classes of\nmodels, we analyze the characteristics of different mixing mechanisms, and\nprove that in all cases, it is invariably beneficial to integrate the unlabeled\ndata with some nonzero mixing ratio $\\alpha>0$, in terms of predictive\nperformance. Moreover, we provide a rigorous framework to estimate the best\nmixing ratio $\\alpha^*$ where mixed SSL delivers the best predictive\nperformance, while using the labeled and unlabeled data on hand.\n\nThe effectiveness of our methodology in delivering substantial improvement\ncompared to the standard supervised models, in a variety of settings, is\ndemonstrated empirically through extensive simulation, in a manner that\nsupports the theoretical analysis. We also demonstrate the applicability of our\nmethodology (with some intuitive modifications) to improve more complex models,\nsuch as deep neural networks, in real-world regression tasks."}, "http://arxiv.org/abs/2305.13421": {"title": "Sequential Estimation using Hierarchically Stratified Domains with Latin Hypercube Sampling", "link": "http://arxiv.org/abs/2305.13421", "description": "Quantifying the effect of uncertainties in systems where only point\nevaluations in the stochastic domain but no regularity conditions are available\nis limited to sampling-based techniques. This work presents an adaptive\nsequential stratification estimation method that uses Latin Hypercube Sampling\nwithin each stratum. The adaptation is achieved through a sequential\nhierarchical refinement of the stratification, guided by previous estimators\nusing local (i.e., stratum-dependent) variability indicators based on\ngeneralized polynomial chaos expansions and Sobol decompositions. For a given\ntotal number of samples $N$, the corresponding hierarchically constructed\nsequence of Stratified Sampling estimators combined with Latin Hypercube\nsampling is adequately averaged to provide a final estimator with reduced\nvariance. Numerical experiments illustrate the procedure's efficiency,\nindicating that it can offer a variance decay proportional to $N^{-2}$ in some\ncases."}, "http://arxiv.org/abs/2306.08794": {"title": "Quantile autoregressive conditional heteroscedasticity", "link": "http://arxiv.org/abs/2306.08794", "description": "This paper proposes a novel conditional heteroscedastic time series model by\napplying the framework of quantile regression processes to the ARCH(\\infty)\nform of the GARCH model. This model can provide varying structures for\nconditional quantiles of the time series across different quantile levels,\nwhile including the commonly used GARCH model as a special case. The strict\nstationarity of the model is discussed. For robustness against heavy-tailed\ndistributions, a self-weighted quantile regression (QR) estimator is proposed.\nWhile QR performs satisfactorily at intermediate quantile levels, its accuracy\ndeteriorates at high quantile levels due to data scarcity. As a remedy, a\nself-weighted composite quantile regression (CQR) estimator is further\nintroduced and, based on an approximate GARCH model with a flexible\nTukey-lambda distribution for the innovations, we can extrapolate the high\nquantile levels by borrowing information from intermediate ones. Asymptotic\nproperties for the proposed estimators are established. Simulation experiments\nare carried out to access the finite sample performance of the proposed\nmethods, and an empirical example is presented to illustrate the usefulness of\nthe new model."}, "http://arxiv.org/abs/2309.00948": {"title": "Marginalised Normal Regression: Unbiased curve fitting in the presence of x-errors", "link": "http://arxiv.org/abs/2309.00948", "description": "The history of the seemingly simple problem of straight line fitting in the\npresence of both $x$ and $y$ errors has been fraught with misadventure, with\nstatistically ad hoc and poorly tested methods abounding in the literature. The\nproblem stems from the emergence of latent variables describing the \"true\"\nvalues of the independent variables, the priors on which have a significant\nimpact on the regression result. By analytic calculation of maximum a\nposteriori values and biases, and comprehensive numerical mock tests, we assess\nthe quality of possible priors. In the presence of intrinsic scatter, the only\nprior that we find to give reliably unbiased results in general is a mixture of\none or more Gaussians with means and variances determined as part of the\ninference. We find that a single Gaussian is typically sufficient and dub this\nmodel Marginalised Normal Regression (MNR). We illustrate the necessity for MNR\nby comparing it to alternative methods on an important linear relation in\ncosmology, and extend it to nonlinear regression and an arbitrary covariance\nmatrix linking $x$ and $y$. We publicly release a Python/Jax implementation of\nMNR and its Gaussian mixture model extension that is coupled to Hamiltonian\nMonte Carlo for efficient sampling, which we call ROXY (Regression and\nOptimisation with X and Y errors)."}, "http://arxiv.org/abs/2311.07733": {"title": "Credible Intervals for Probability of Failure with Gaussian Processes", "link": "http://arxiv.org/abs/2311.07733", "description": "Efficiently approximating the probability of system failure has gained\nincreasing importance as expensive simulations begin to play a larger role in\nreliability quantification tasks in areas such as structural design, power grid\ndesign, and safety certification among others. This work derives credible\nintervals on the probability of failure for a simulation which we assume is a\nrealizations of a Gaussian process. We connect these intervals to binary\nclassification error and comment on their applicability to a broad class of\niterative schemes proposed throughout the literature. A novel iterative\nsampling scheme is proposed which can suggest multiple samples per batch for\nsimulations with parallel implementations. We empirically test our scalable,\nopen-source implementation on a variety simulations including a Tsunami model\nwhere failure is quantified in terms of maximum wave hight."}, "http://arxiv.org/abs/2311.07736": {"title": "Use of Equivalent Relative Utility (ERU) to Evaluate Artificial Intelligence-Enabled Rule-Out Devices", "link": "http://arxiv.org/abs/2311.07736", "description": "We investigated the use of equivalent relative utility (ERU) to evaluate the\neffectiveness of artificial intelligence (AI)-enabled rule-out devices that use\nAI to identify and autonomously remove non-cancer patient images from\nradiologist review in screening mammography.We reviewed two performance metrics\nthat can be used to compare the diagnostic performance between the\nradiologist-with-rule-out-device and radiologist-without-device workflows:\npositive/negative predictive values (PPV/NPV) and equivalent relative utility\n(ERU). To demonstrate the use of the two evaluation metrics, we applied both\nmethods to a recent US-based study that reported an improved performance of the\nradiologist-with-device workflow compared to the one without the device by\nretrospectively applying their AI algorithm to a large mammography dataset. We\nfurther applied the ERU method to a European study utilizing their reported\nrecall rates and cancer detection rates at different thresholds of their AI\nalgorithm to compare the potential utility among different thresholds. For the\nstudy using US data, neither the PPV/NPV nor the ERU method can conclude a\nsignificant improvement in diagnostic performance for any of the algorithm\nthresholds reported. For the study using European data, ERU values at lower AI\nthresholds are found to be higher than that at a higher threshold because more\nfalse-negative cases would be ruled-out at higher threshold, reducing the\noverall diagnostic performance. Both PPV/NPV and ERU methods can be used to\ncompare the diagnostic performance between the radiologist-with-device workflow\nand that without. One limitation of the ERU method is the need to measure the\nbaseline, standard-of-care relative utility (RU) value for mammography\nscreening in the US. Once the baseline value is known, the ERU method can be\napplied to large US datasets without knowing the true prevalence of the\ndataset."}, "http://arxiv.org/abs/2311.07752": {"title": "Doubly Robust Estimation under Possibly Misspecified Marginal Structural Cox Model", "link": "http://arxiv.org/abs/2311.07752", "description": "In this paper we address the challenges posed by non-proportional hazards and\ninformative censoring, offering a path toward more meaningful causal inference\nconclusions. We start from the marginal structural Cox model, which has been\nwidely used for analyzing observational studies with survival outcomes, and\ntypically relies on the inverse probability weighting method. The latter hinges\nupon a propensity score model for the treatment assignment, and a censoring\nmodel which incorporates both the treatment and the covariates. In such\nsettings, model misspecification can occur quite effortlessly, and the Cox\nregression model's non-collapsibility has historically posed challenges when\nstriving to guard against model misspecification through augmentation. We\nintroduce an augmented inverse probability weighted estimator which, enriched\nwith doubly robust properties, paves the way for integrating machine learning\nand a plethora of nonparametric methods, effectively overcoming the challenges\nof non-collapsibility. The estimator extends naturally to estimating a\ntime-average treatment effect when the proportional hazards assumption fails.\nWe closely examine its theoretical and practical performance, showing that it\nsatisfies both the assumption-lean and the well-specification criteria\ndiscussed in the recent literature. Finally, its application to a dataset\nreveals insights into the impact of mid-life alcohol consumption on mortality\nin later life."}, "http://arxiv.org/abs/2311.07762": {"title": "Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data", "link": "http://arxiv.org/abs/2311.07762", "description": "A mixture of multivariate Poisson-log normal factor analyzers is introduced\nby imposing constraints on the covariance matrix, which resulted in flexible\nmodels for clustering purposes. In particular, a class of eight parsimonious\nmixture models based on the mixtures of factor analyzers model are introduced.\nVariational Gaussian approximation is used for parameter estimation, and\ninformation criteria are used for model selection. The proposed models are\nexplored in the context of clustering discrete data arising from RNA sequencing\nstudies. Using real and simulated data, the models are shown to give favourable\nclustering performance. The GitHub R package for this work is available at\nhttps://github.com/anjalisilva/mixMPLNFA and is released under the open-source\nMIT license."}, "http://arxiv.org/abs/2311.07793": {"title": "The brain uses renewal points to model random sequences of stimuli", "link": "http://arxiv.org/abs/2311.07793", "description": "It has been classically conjectured that the brain assigns probabilistic\nmodels to sequences of stimuli. An important issue associated with this\nconjecture is the identification of the classes of models used by the brain to\nperform this task. We address this issue by using a new clustering procedure\nfor sets of electroencephalographic (EEG) data recorded from participants\nexposed to a sequence of auditory stimuli generated by a stochastic chain. This\nclustering procedure indicates that the brain uses renewal points in the\nstochastic sequence of auditory stimuli in order to build a model."}, "http://arxiv.org/abs/2311.07906": {"title": "Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects", "link": "http://arxiv.org/abs/2311.07906", "description": "Testing judicial impartiality is a problem of fundamental importance in\nempirical legal studies, for which standard regression methods have been\npopularly used to estimate the extralegal factor effects. However, those\nmethods cannot handle control variables with ultrahigh dimensionality, such as\nfound in judgment documents recorded in text format. To solve this problem, we\ndevelop a novel mixture conditional regression (MCR) approach, assuming that\nthe whole sample can be classified into a number of latent classes. Within each\nlatent class, a standard linear regression model can be used to model the\nrelationship between the response and a key feature vector, which is assumed to\nbe of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are\nthen used to determine the latent class membership, where a Na\\\"ive Bayes type\nmodel is used to describe the relationship. Hence, the dimension of control\nvariables is allowed to be arbitrarily high. A novel expectation-maximization\nalgorithm is developed for model estimation. Therefore, we are able to estimate\nthe interested key parameters as efficiently as if the true class membership\nwere known in advance. Simulation studies are presented to demonstrate the\nproposed MCR method. A real dataset of Chinese burglary offenses is analyzed\nfor illustration purpose."}, "http://arxiv.org/abs/2311.07951": {"title": "A Fast and Simple Algorithm for computing the MLE of Amplitude Density Function Parameters", "link": "http://arxiv.org/abs/2311.07951", "description": "Over the last decades, the family of $\\alpha$-stale distributions has proven\nto be useful for modelling in telecommunication systems. Particularly, in the\ncase of radar applications, finding a fast and accurate estimation for the\namplitude density function parameters appears to be very important. In this\nwork, the maximum likelihood estimator (MLE) is proposed for parameters of the\namplitude distribution. To do this, the amplitude data are \\emph{projected} on\nthe horizontal and vertical axes using two simple transformations. It is proved\nthat the \\emph{projected} data follow a zero-location symmetric $\\alpha$-stale\ndistribution for which the MLE can be computed quite fast. The average of\ncomputed MLEs based on two \\emph{projections} is considered as estimator for\nparameters of the amplitude distribution. Performance of the proposed\n\\emph{projection} method is demonstrated through simulation study and analysis\nof two sets of real radar data."}, "http://arxiv.org/abs/2311.07972": {"title": "Residual Importance Weighted Transfer Learning For High-dimensional Linear Regression", "link": "http://arxiv.org/abs/2311.07972", "description": "Transfer learning is an emerging paradigm for leveraging multiple sources to\nimprove the statistical inference on a single target. In this paper, we propose\na novel approach named residual importance weighted transfer learning (RIW-TL)\nfor high-dimensional linear models built on penalized likelihood. Compared to\nexisting methods such as Trans-Lasso that selects sources in an all-in-all-out\nmanner, RIW-TL includes samples via importance weighting and thus may permit\nmore effective sample use. To determine the weights, remarkably RIW-TL only\nrequires the knowledge of one-dimensional densities dependent on residuals,\nthus overcoming the curse of dimensionality of having to estimate\nhigh-dimensional densities in naive importance weighting. We show that the\noracle RIW-TL provides a faster rate than its competitors and develop a\ncross-fitting procedure to estimate this oracle. We discuss variants of RIW-TL\nby adopting different choices for residual weighting. The theoretical\nproperties of RIW-TL and its variants are established and compared with those\nof LASSO and Trans-Lasso. Extensive simulation and a real data analysis confirm\nits advantages."}, "http://arxiv.org/abs/2311.08004": {"title": "Nonlinear blind source separation exploiting spatial nonstationarity", "link": "http://arxiv.org/abs/2311.08004", "description": "In spatial blind source separation the observed multivariate random fields\nare assumed to be mixtures of latent spatially dependent random fields. The\nobjective is to recover latent random fields by estimating the unmixing\ntransformation. Currently, the algorithms for spatial blind source separation\ncan only estimate linear unmixing transformations. Nonlinear blind source\nseparation methods for spatial data are scarce. In this paper we extend an\nidentifiable variational autoencoder that can estimate nonlinear unmixing\ntransformations to spatially dependent data and demonstrate its performance for\nboth stationary and nonstationary spatial data using simulations. In addition,\nwe introduce scaled mean absolute Shapley additive explanations for\ninterpreting the latent components through nonlinear mixing transformation. The\nspatial identifiable variational autoencoder is applied to a geochemical\ndataset to find the latent random fields, which are then interpreted by using\nthe scaled mean absolute Shapley additive explanations."}, "http://arxiv.org/abs/2311.08050": {"title": "INLA+ -- Approximate Bayesian inference for non-sparse models using HPC", "link": "http://arxiv.org/abs/2311.08050", "description": "The integrated nested Laplace approximations (INLA) method has become a\nwidely utilized tool for researchers and practitioners seeking to perform\napproximate Bayesian inference across various fields of application. To address\nthe growing demand for incorporating more complex models and enhancing the\nmethod's capabilities, this paper introduces a novel framework that leverages\ndense matrices for performing approximate Bayesian inference based on INLA\nacross multiple computing nodes using HPC. When dealing with non-sparse\nprecision or covariance matrices, this new approach scales better compared to\nthe current INLA method, capitalizing on the computational power offered by\nmultiprocessors in shared and distributed memory architectures available in\ncontemporary computing resources and specialized dense matrix algebra. To\nvalidate the efficacy of this approach, we conduct a simulation study then\napply it to analyze cancer mortality data in Spain, employing a three-way\nspatio-temporal interaction model."}, "http://arxiv.org/abs/2311.08139": {"title": "Feedforward neural networks as statistical models: Improving interpretability through uncertainty quantification", "link": "http://arxiv.org/abs/2311.08139", "description": "Feedforward neural networks (FNNs) are typically viewed as pure prediction\nalgorithms, and their strong predictive performance has led to their use in\nmany machine-learning applications. However, their flexibility comes with an\ninterpretability trade-off; thus, FNNs have been historically less popular\namong statisticians. Nevertheless, classical statistical theory, such as\nsignificance testing and uncertainty quantification, is still relevant.\nSupplementing FNNs with methods of statistical inference, and covariate-effect\nvisualisations, can shift the focus away from black-box prediction and make\nFNNs more akin to traditional statistical models. This can allow for more\ninferential analysis, and, hence, make FNNs more accessible within the\nstatistical-modelling context."}, "http://arxiv.org/abs/2311.08168": {"title": "Time-Uniform Confidence Spheres for Means of Random Vectors", "link": "http://arxiv.org/abs/2311.08168", "description": "We derive and study time-uniform confidence spheres - termed confidence\nsphere sequences (CSSs) - which contain the mean of random vectors with high\nprobability simultaneously across all sample sizes. Inspired by the original\nwork of Catoni and Giulini, we unify and extend their analysis to cover both\nthe sequential setting and to handle a variety of distributional assumptions.\nMore concretely, our results include an empirical-Bernstein CSS for bounded\nrandom vectors (resulting in a novel empirical-Bernstein confidence interval),\na CSS for sub-$\\psi$ random vectors, and a CSS for heavy-tailed random vectors\nbased on a sequentially valid Catoni-Giulini estimator. Finally, we provide a\nversion of our empirical-Bernstein CSS that is robust to contamination by Huber\nnoise."}, "http://arxiv.org/abs/2311.08181": {"title": "Frame to frame interpolation for high-dimensional data visualisation using the woylier package", "link": "http://arxiv.org/abs/2311.08181", "description": "The woylier package implements tour interpolation paths between frames using\nGivens rotations. This provides an alternative to the geodesic interpolation\nbetween planes currently available in the tourr package. Tours are used to\nvisualise high-dimensional data and models, to detect clustering, anomalies and\nnon-linear relationships. Frame-to-frame interpolation can be useful for\nprojection pursuit guided tours when the index is not rotationally invariant.\nIt also provides a way to specifically reach a given target frame. We\ndemonstrate the method for exploring non-linear relationships between currency\ncross-rates."}, "http://arxiv.org/abs/2311.08254": {"title": "Identifiable and interpretable nonparametric factor analysis", "link": "http://arxiv.org/abs/2311.08254", "description": "Factor models have been widely used to summarize the variability of\nhigh-dimensional data through a set of factors with much lower dimensionality.\nGaussian linear factor models have been particularly popular due to their\ninterpretability and ease of computation. However, in practice, data often\nviolate the multivariate Gaussian assumption. To characterize higher-order\ndependence and nonlinearity, models that include factors as predictors in\nflexible multivariate regression are popular, with GP-LVMs using Gaussian\nprocess (GP) priors for the regression function and VAEs using deep neural\nnetworks. Unfortunately, such approaches lack identifiability and\ninterpretability and tend to produce brittle and non-reproducible results. To\naddress these problems by simplifying the nonparametric factor model while\nmaintaining flexibility, we propose the NIFTY framework, which parsimoniously\ntransforms uniform latent variables using one-dimensional nonlinear mappings\nand then applies a linear generative model. The induced multivariate\ndistribution falls into a flexible class while maintaining simple computation\nand interpretation. We prove that this model is identifiable and empirically\nstudy NIFTY using simulated data, observing good performance in density\nestimation and data visualization. We then apply NIFTY to bird song data in an\nenvironmental monitoring application."}, "http://arxiv.org/abs/2311.08315": {"title": "Total Empiricism: Learning from Data", "link": "http://arxiv.org/abs/2311.08315", "description": "Statistical analysis is an important tool to distinguish systematic from\nchance findings. Current statistical analyses rely on distributional\nassumptions reflecting the structure of some underlying model, which if not met\nlead to problems in the analysis and interpretation of the results. Instead of\ntrying to fix the model or \"correct\" the data, we here describe a totally\nempirical statistical approach that does not rely on ad hoc distributional\nassumptions in order to overcome many problems in contemporary statistics.\nStarting from elementary combinatorics, we motivate an information-guided\nformalism to quantify knowledge extracted from the given data. Subsequently, we\nderive model-agnostic methods to identify patterns that are solely evidenced by\nthe data based on our prior knowledge. The data-centric character of empiricism\nallows for its universal applicability, particularly as sample size grows\nlarger. In this comprehensive framework, we re-interpret and extend model\ndistributions, scores and statistical tests used in different schools of\nstatistics."}, "http://arxiv.org/abs/2311.08335": {"title": "Distinguishing immunological and behavioral effects of vaccination", "link": "http://arxiv.org/abs/2311.08335", "description": "The interpretation of vaccine efficacy estimands is subtle, even in\nrandomized trials designed to quantify immunological effects of vaccination. In\nthis article, we introduce terminology to distinguish between different vaccine\nefficacy estimands and clarify their interpretations. This allows us to\nexplicitly consider immunological and behavioural effects of vaccination, and\nestablish that policy-relevant estimands can differ substantially from those\ncommonly reported in vaccine trials. We further show that a conventional\nvaccine trial allows identification and estimation of different vaccine\nestimands under plausible conditions, if one additional post-treatment variable\nis measured. Specifically, we utilize a ``belief variable'' that indicates the\ntreatment an individual believed they had received. The belief variable is\nsimilar to ``blinding assessment'' variables that are occasionally collected in\nplacebo-controlled trials in other fields. We illustrate the relations between\nthe different estimands, and their practical relevance, in numerical examples\nbased on an influenza vaccine trial."}, "http://arxiv.org/abs/2311.08340": {"title": "Causal Message Passing: A Method for Experiments with Unknown and General Network Interference", "link": "http://arxiv.org/abs/2311.08340", "description": "Randomized experiments are a powerful methodology for data-driven evaluation\nof decisions or interventions. Yet, their validity may be undermined by network\ninterference. This occurs when the treatment of one unit impacts not only its\noutcome but also that of connected units, biasing traditional treatment effect\nestimations. Our study introduces a new framework to accommodate complex and\nunknown network interference, moving beyond specialized models in the existing\nliterature. Our framework, which we term causal message-passing, is grounded in\na high-dimensional approximate message passing methodology and is specifically\ntailored to experimental design settings with prevalent network interference.\nUtilizing causal message-passing, we present a practical algorithm for\nestimating the total treatment effect and demonstrate its efficacy in four\nnumerical scenarios, each with its unique interference structure."}, "http://arxiv.org/abs/2104.14987": {"title": "Emulating complex dynamical simulators with random Fourier features", "link": "http://arxiv.org/abs/2104.14987", "description": "A Gaussian process (GP)-based methodology is proposed to emulate complex\ndynamical computer models (or simulators). The method relies on emulating the\nnumerical flow map of the system over an initial (short) time step, where the\nflow map is a function that describes the evolution of the system from an\ninitial condition to a subsequent value at the next time step. This yields a\nprobabilistic distribution over the entire flow map function, with each draw\noffering an approximation to the flow map. The model output times series is\nthen predicted (under the Markov assumption) by drawing a sample from the\nemulated flow map (i.e., its posterior distribution) and using it to iterate\nfrom the initial condition ahead in time. Repeating this procedure with\nmultiple such draws creates a distribution over the time series. The mean and\nvariance of this distribution at a specific time point serve as the model\noutput prediction and the associated uncertainty, respectively. However,\ndrawing a GP posterior sample that represents the underlying function across\nits entire domain is computationally infeasible, given the infinite-dimensional\nnature of this object. To overcome this limitation, one can generate such a\nsample in an approximate manner using random Fourier features (RFF). RFF is an\nefficient technique for approximating the kernel and generating GP samples,\noffering both computational efficiency and theoretical guarantees. The proposed\nmethod is applied to emulate several dynamic nonlinear simulators including the\nwell-known Lorenz and van der Pol models. The results suggest that our approach\nhas a high predictive performance and the associated uncertainty can capture\nthe dynamics of the system accurately."}, "http://arxiv.org/abs/2111.12945": {"title": "Low-rank variational Bayes correction to the Laplace method", "link": "http://arxiv.org/abs/2111.12945", "description": "Approximate inference methods like the Laplace method, Laplace approximations\nand variational methods, amongst others, are popular methods when exact\ninference is not feasible due to the complexity of the model or the abundance\nof data. In this paper we propose a hybrid approximate method called Low-Rank\nVariational Bayes correction (VBC), that uses the Laplace method and\nsubsequently a Variational Bayes correction in a lower dimension, to the joint\nposterior mean. The cost is essentially that of the Laplace method which\nensures scalability of the method, in both model complexity and data size.\nModels with fixed and unknown hyperparameters are considered, for simulated and\nreal examples, for small and large datasets."}, "http://arxiv.org/abs/2202.13961": {"title": "Spatio-Causal Patterns of Sample Growth", "link": "http://arxiv.org/abs/2202.13961", "description": "Different statistical samples (e.g., from different locations) offer\npopulations and learning systems observations with distinct statistical\nproperties. Samples under (1) 'Unconfounded' growth preserve systems' ability\nto determine the independent effects of their individual variables on any\noutcome-of-interest (and lead, therefore, to fair and interpretable black-box\npredictions). Samples under (2) 'Externally-Valid' growth preserve their\nability to make predictions that generalize across out-of-sample variation. The\nfirst promotes predictions that generalize over populations, the second over\ntheir shared uncontrolled factors. We illustrate these theoretic patterns in\nthe full American census from 1840 to 1940, and samples ranging from the\nstreet-level all the way to the national. This reveals sample requirements for\ngeneralizability over space and time, and new connections among the Shapley\nvalue, counterfactual statistics, and hyperbolic geometry."}, "http://arxiv.org/abs/2211.04958": {"title": "Black-Box Model Confidence Sets Using Cross-Validation with High-Dimensional Gaussian Comparison", "link": "http://arxiv.org/abs/2211.04958", "description": "We derive high-dimensional Gaussian comparison results for the standard\n$V$-fold cross-validated risk estimates. Our results combine a recent\nstability-based argument for the low-dimensional central limit theorem of\ncross-validation with the high-dimensional Gaussian comparison framework for\nsums of independent random variables. These results give new insights into the\njoint sampling distribution of cross-validated risks in the context of model\ncomparison and tuning parameter selection, where the number of candidate models\nand tuning parameters can be larger than the fitting sample size. As a\nconsequence, our results provide theoretical support for a recent\nmethodological development that constructs model confidence sets using\ncross-validation."}, "http://arxiv.org/abs/2311.08427": {"title": "Towards a Transportable Causal Network Model Based on Observational Healthcare Data", "link": "http://arxiv.org/abs/2311.08427", "description": "Over the last decades, many prognostic models based on artificial\nintelligence techniques have been used to provide detailed predictions in\nhealthcare. Unfortunately, the real-world observational data used to train and\nvalidate these models are almost always affected by biases that can strongly\nimpact the outcomes validity: two examples are values missing not-at-random and\nselection bias. Addressing them is a key element in achieving transportability\nand in studying the causal relationships that are critical in clinical decision\nmaking, going beyond simpler statistical approaches based on probabilistic\nassociation.\n\nIn this context, we propose a novel approach that combines selection\ndiagrams, missingness graphs, causal discovery and prior knowledge into a\nsingle graphical model to estimate the cardiovascular risk of adolescent and\nyoung females who survived breast cancer. We learn this model from data\ncomprising two different cohorts of patients. The resulting causal network\nmodel is validated by expert clinicians in terms of risk assessment, accuracy\nand explainability, and provides a prognostic model that outperforms competing\nmachine learning methods."}, "http://arxiv.org/abs/2311.08484": {"title": "Covariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe)", "link": "http://arxiv.org/abs/2311.08484", "description": "We propose a new method for the simultaneous selection and estimation of\nmultivariate sparse additive models with correlated errors. Our method called\nCovariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe)\nsimultaneously selects among null, linear, and smooth non-linear effects for\neach predictor while incorporating joint estimation of the sparse residual\nstructure among responses, with the motivation that accounting for\ninter-response correlation structure can lead to improved accuracy in variable\nselection and estimation efficiency. CoMPAdRe is constructed in a\ncomputationally efficient way that allows the selection and estimation of\nlinear and non-linear covariates to be conducted in parallel across responses.\nCompared to single-response approaches that marginally select linear and\nnon-linear covariate effects, we demonstrate in simulation studies that the\njoint multivariate modeling leads to gains in both estimation efficiency and\nselection accuracy, of greater magnitude in settings where signal is moderate\nrelative to the level of noise. We apply our approach to protein-mRNA\nexpression levels from multiple breast cancer pathways obtained from The Cancer\nProteome Atlas and characterize both mRNA-protein associations and\nprotein-protein subnetworks for each pathway. We find non-linear mRNA-protein\nassociations for the Core Reactive, EMT, PIK-AKT, and RTK pathways."}, "http://arxiv.org/abs/2311.08527": {"title": "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments", "link": "http://arxiv.org/abs/2311.08527", "description": "We study inference on the long-term causal effect of a continual exposure to\na novel intervention, which we term a long-term treatment, based on an\nexperiment involving only short-term observations. Key examples include the\nlong-term health effects of regularly-taken medicine or of environmental\nhazards and the long-term effects on users of changes to an online platform.\nThis stands in contrast to short-term treatments or \"shocks,\" whose long-term\neffect can reasonably be mediated by short-term observations, enabling the use\nof surrogate methods. Long-term treatments by definition have direct effects on\nlong-term outcomes via continual exposure so surrogacy cannot reasonably hold.\n\nOur approach instead learns long-term temporal dynamics directly from\nshort-term experimental data, assuming that the initial dynamics observed\npersist but avoiding the need for both surrogacy assumptions and auxiliary data\nwith long-term observations. We connect the problem with offline reinforcement\nlearning, leveraging doubly-robust estimators to estimate long-term causal\neffects for long-term treatments and construct confidence intervals. Finally,\nwe demonstrate the method in simulated experiments."}, "http://arxiv.org/abs/2311.08561": {"title": "Measuring association with recursive rank binning", "link": "http://arxiv.org/abs/2311.08561", "description": "Pairwise measures of dependence are a common tool to map data in the early\nstages of analysis with several modern examples based on maximized partitions\nof the pairwise sample space. Following a short survey of modern measures of\ndependence, we introduce a new measure which recursively splits the ranks of a\npair of variables to partition the sample space and computes the $\\chi^2$\nstatistic on the resulting bins. Splitting logic is detailed for splits\nmaximizing a score function and randomly selected splits. Simulations indicate\nthat random splitting produces a statistic conservatively approximated by the\n$\\chi^2$ distribution without a loss of power to detect numerous different data\npatterns compared to maximized binning. Though it seems to add no power to\ndetect dependence, maximized recursive binning is shown to produce a natural\nvisualization of the data and the measure. Applying maximized recursive rank\nbinning to S&P 500 constituent data suggests the automatic detection of tail\ndependence."}, "http://arxiv.org/abs/2311.08604": {"title": "Incremental Cost-Effectiveness Statistical Inference: Calculations and Communications", "link": "http://arxiv.org/abs/2311.08604", "description": "We illustrate use of nonparametric statistical methods to compare alternative\ntreatments for a particular disease or condition on both their relative\neffectiveness and their relative cost. These Incremental Cost Effectiveness\n(ICE) methods are based upon Bootstrapping, i.e. Resampling with Replacement\nfrom observational or clinical-trial data on individual patients. We first show\nhow a reasonable numerical value for the \"Shadow Price of Health\" can be chosen\nusing functions within the ICEinfer R-package when effectiveness is not\nmeasured in \"QALY\"s. We also argue that simple histograms are ideal for\ncommunicating key findings to regulators, while our more detailed graphics may\nwell be more informative and compelling for other health-care stakeholders."}, "http://arxiv.org/abs/2311.08658": {"title": "Structured Estimation of Heterogeneous Time Series", "link": "http://arxiv.org/abs/2311.08658", "description": "How best to model structurally heterogeneous processes is a foundational\nquestion in the social, health and behavioral sciences. Recently, Fisher et\nal., (2022) introduced the multi-VAR approach for simultaneously estimating\nmultiple-subject multivariate time series characterized by common and\nindividualizing features using penalized estimation. This approach differs from\nmany popular modeling approaches for multiple-subject time series in that\nqualitative and quantitative differences in a large number of individual\ndynamics are well-accommodated. The current work extends the multi-VAR\nframework to include new adaptive weighting schemes that greatly improve\nestimation performance. In a small set of simulation studies we compare\nadaptive multi-VAR with these new penalty weights to common alternative\nestimators in terms of path recovery and bias. Furthermore, we provide toy\nexamples and code demonstrating the utility of multi-VAR under different\nheterogeneity regimes using the multivar package for R (Fisher, 2022)."}, "http://arxiv.org/abs/2311.08690": {"title": "Enabling CMF Estimation in Data-Constrained Scenarios: A Semantic-Encoding Knowledge Mining Model", "link": "http://arxiv.org/abs/2311.08690", "description": "Precise estimation of Crash Modification Factors (CMFs) is central to\nevaluating the effectiveness of various road safety treatments and prioritizing\ninfrastructure investment accordingly. While customized study for each\ncountermeasure scenario is desired, the conventional CMF estimation approaches\nrely heavily on the availability of crash data at given sites. This not only\nmakes the estimation costly, but the results are also less transferable, since\nthe intrinsic similarities between different safety countermeasure scenarios\nare not fully explored. Aiming to fill this gap, this study introduces a novel\nknowledge-mining framework for CMF prediction. This framework delves into the\nconnections of existing countermeasures and reduces the reliance of CMF\nestimation on crash data availability and manual data collection. Specifically,\nit draws inspiration from human comprehension processes and introduces advanced\nNatural Language Processing (NLP) techniques to extract intricate variations\nand patterns from existing CMF knowledge. It effectively encodes unstructured\ncountermeasure scenarios into machine-readable representations and models the\ncomplex relationships between scenarios and CMF values. This new data-driven\nframework provides a cost-effective and adaptable solution that complements the\ncase-specific approaches for CMF estimation, which is particularly beneficial\nwhen availability of crash data or time imposes constraints. Experimental\nvalidation using real-world CMF Clearinghouse data demonstrates the\neffectiveness of this new approach, which shows significant accuracy\nimprovements compared to baseline methods. This approach provides insights into\nnew possibilities of harnessing accumulated transportation knowledge in various\napplications."}, "http://arxiv.org/abs/2311.08691": {"title": "On Doubly Robust Estimation with Nonignorable Missing Data Using Instrumental Variables", "link": "http://arxiv.org/abs/2311.08691", "description": "Suppose we are interested in the mean of an outcome that is subject to\nnonignorable nonresponse. This paper develops new semiparametric estimation\nmethods with instrumental variables which affect nonresponse, but not the\noutcome. The proposed estimators remain consistent and asymptotically normal\neven under partial model misspecifications for two variation independent\nnuisance functions. We evaluate the performance of the proposed estimators via\na simulation study, and apply them in adjusting for missing data induced by HIV\ntesting refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana,\nusing interviewer experience as an instrumental variable."}, "http://arxiv.org/abs/2311.08743": {"title": "Kernel-based independence tests for causal structure learning on functional data", "link": "http://arxiv.org/abs/2311.08743", "description": "Measurements of systems taken along a continuous functional dimension, such\nas time or space, are ubiquitous in many fields, from the physical and\nbiological sciences to economics and engineering.Such measurements can be\nviewed as realisations of an underlying smooth process sampled over the\ncontinuum. However, traditional methods for independence testing and causal\nlearning are not directly applicable to such data, as they do not take into\naccount the dependence along the functional dimension. By using specifically\ndesigned kernels, we introduce statistical tests for bivariate, joint, and\nconditional independence for functional variables. Our method not only extends\nthe applicability to functional data of the HSIC and its d-variate version\n(d-HSIC), but also allows us to introduce a test for conditional independence\nby defining a novel statistic for the CPT based on the HSCIC, with optimised\nregularisation strength estimated through an evaluation rejection rate. Our\nempirical results of the size and power of these tests on synthetic functional\ndata show good performance, and we then exemplify their application to several\nconstraint- and regression-based causal structure learning problems, including\nboth synthetic examples and real socio-economic data."}, "http://arxiv.org/abs/2311.08752": {"title": "ProSpar-GP: scalable Gaussian process modeling with massive non-stationary datasets", "link": "http://arxiv.org/abs/2311.08752", "description": "Gaussian processes (GPs) are a popular class of Bayesian nonparametric\nmodels, but its training can be computationally burdensome for massive training\ndatasets. While there has been notable work on scaling up these models for big\ndata, existing methods typically rely on a stationary GP assumption for\napproximation, and can thus perform poorly when the underlying response surface\nis non-stationary, i.e., it has some regions of rapid change and other regions\nwith little change. Such non-stationarity is, however, ubiquitous in real-world\nproblems, including our motivating application for surrogate modeling of\ncomputer experiments. We thus propose a new Product of Sparse GP (ProSpar-GP)\nmethod for scalable GP modeling with massive non-stationary data. The\nProSpar-GP makes use of a carefully-constructed product-of-experts formulation\nof sparse GP experts, where different experts are placed within local regions\nof non-stationarity. These GP experts are fit via a novel variational inference\napproach, which capitalizes on mini-batching and GPU acceleration for efficient\noptimization of inducing points and length-scale parameters for each expert. We\nfurther show that the ProSpar-GP is Kolmogorov-consistent, in that its\ngenerative distribution defines a valid stochastic process over the prediction\nspace; such a property provides essential stability for variational inference,\nparticularly in the presence of non-stationarity. We then demonstrate the\nimproved performance of the ProSpar-GP over the state-of-the-art, in a suite of\nnumerical experiments and an application for surrogate modeling of a satellite\ndrag simulator."}, "http://arxiv.org/abs/2311.08812": {"title": "Optimal subsampling algorithm for the marginal model with large longitudinal data", "link": "http://arxiv.org/abs/2311.08812", "description": "Big data is ubiquitous in practices, and it has also led to heavy computation\nburden. To reduce the calculation cost and ensure the effectiveness of\nparameter estimators, an optimal subset sampling method is proposed to estimate\nthe parameters in marginal models with massive longitudinal data. The optimal\nsubsampling probabilities are derived, and the corresponding asymptotic\nproperties are established to ensure the consistency and asymptotic normality\nof the estimator. Extensive simulation studies are carried out to evaluate the\nperformance of the proposed method for continuous, binary and count data and\nwith four different working correlation matrices. A depression data is used to\nillustrate the proposed method."}, "http://arxiv.org/abs/2311.08845": {"title": "Statistical learning by sparse deep neural networks", "link": "http://arxiv.org/abs/2311.08845", "description": "We consider a deep neural network estimator based on empirical risk\nminimization with l_1-regularization. We derive a general bound for its excess\nrisk in regression and classification (including multiclass), and prove that it\nis adaptively nearly-minimax (up to log-factors) simultaneously across the\nentire range of various function classes."}, "http://arxiv.org/abs/2311.08908": {"title": "Robust Brain MRI Image Classification with SIBOW-SVM", "link": "http://arxiv.org/abs/2311.08908", "description": "The majority of primary Central Nervous System (CNS) tumors in the brain are\namong the most aggressive diseases affecting humans. Early detection of brain\ntumor types, whether benign or malignant, glial or non-glial, is critical for\ncancer prevention and treatment, ultimately improving human life expectancy.\nMagnetic Resonance Imaging (MRI) stands as the most effective technique to\ndetect brain tumors by generating comprehensive brain images through scans.\nHowever, human examination can be error-prone and inefficient due to the\ncomplexity, size, and location variability of brain tumors. Recently, automated\nclassification techniques using machine learning (ML) methods, such as\nConvolutional Neural Network (CNN), have demonstrated significantly higher\naccuracy than manual screening, while maintaining low computational costs.\nNonetheless, deep learning-based image classification methods, including CNN,\nface challenges in estimating class probabilities without proper model\ncalibration. In this paper, we propose a novel brain tumor image classification\nmethod, called SIBOW-SVM, which integrates the Bag-of-Features (BoF) model with\nSIFT feature extraction and weighted Support Vector Machines (wSVMs). This new\napproach effectively captures hidden image features, enabling the\ndifferentiation of various tumor types and accurate label predictions.\nAdditionally, the SIBOW-SVM is able to estimate the probabilities of images\nbelonging to each class, thereby providing high-confidence classification\ndecisions. We have also developed scalable and parallelable algorithms to\nfacilitate the practical implementation of SIBOW-SVM for massive images. As a\nbenchmark, we apply the SIBOW-SVM to a public data set of brain tumor MRI\nimages containing four classes: glioma, meningioma, pituitary, and normal. Our\nresults show that the new method outperforms state-of-the-art methods,\nincluding CNN."}, "http://arxiv.org/abs/2311.09015": {"title": "Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach", "link": "http://arxiv.org/abs/2311.09015", "description": "We consider the task of identifying and estimating a parameter of interest in\nsettings where data is missing not at random (MNAR). In general, such\nparameters are not identified without strong assumptions on the missing data\nmodel. In this paper, we take an alternative approach and introduce a method\ninspired by data fusion, where information in an MNAR dataset is augmented by\ninformation in an auxiliary dataset subject to missingness at random (MAR). We\nshow that even if the parameter of interest cannot be identified given either\ndataset alone, it can be identified given pooled data, under two complementary\nsets of assumptions. We derive an inverse probability weighted (IPW) estimator\nfor identified parameters, and evaluate the performance of our estimation\nstrategies via simulation studies."}, "http://arxiv.org/abs/2311.09081": {"title": "Posterior accuracy and calibration under misspecification in Bayesian generalized linear models", "link": "http://arxiv.org/abs/2311.09081", "description": "Generalized linear models (GLMs) are popular for data-analysis in almost all\nquantitative sciences, but the choice of likelihood family and link function is\noften difficult. This motivates the search for likelihoods and links that\nminimize the impact of potential misspecification. We perform a large-scale\nsimulation study on double-bounded and lower-bounded response data where we\nsystematically vary both true and assumed likelihoods and links. In contrast to\nprevious studies, we also study posterior calibration and uncertainty metrics\nin addition to point-estimate accuracy. Our results indicate that certain\nlikelihoods and links can be remarkably robust to misspecification, performing\nalmost on par with their respective true counterparts. Additionally, normal\nlikelihood models with identity link (i.e., linear regression) often achieve\ncalibration comparable to the more structurally faithful alternatives, at least\nin the studied scenarios. On the basis of our findings, we provide practical\nsuggestions for robust likelihood and link choices in GLMs."}, "http://arxiv.org/abs/2311.09107": {"title": "Illness-death model with renewal", "link": "http://arxiv.org/abs/2311.09107", "description": "The illness-death model for chronic conditions is combined with a renewal\nequation for the number of newborns taking into account possibly different\nfertility rates in the healthy and diseased parts of the population. The\nresulting boundary value problem consists of a system of partial differential\nequations with an integral boundary condition. As an application, the boundary\nvalue problem is applied to an example about type 2 diabetes."}, "http://arxiv.org/abs/2311.09137": {"title": "Causal prediction models for medication safety monitoring: The diagnosis of vancomycin-induced acute kidney injury", "link": "http://arxiv.org/abs/2311.09137", "description": "The current best practice approach for the retrospective diagnosis of adverse\ndrug events (ADEs) in hospitalized patients relies on a full patient chart\nreview and a formal causality assessment by multiple medical experts. This\nevaluation serves to qualitatively estimate the probability of causation (PC);\nthe probability that a drug was a necessary cause of an adverse event. This\npractice is manual, resource intensive and prone to human biases, and may thus\nbenefit from data-driven decision support. Here, we pioneer a causal modeling\napproach using observational data to estimate a lower bound of the PC\n(PC$_{low}$). This method includes two key causal inference components: (1) the\ntarget trial emulation framework and (2) estimation of individualized treatment\neffects using machine learning. We apply our method to the clinically relevant\nuse-case of vancomycin-induced acute kidney injury in intensive care patients,\nand compare our causal model-based PC$_{low}$ estimates to qualitative\nestimates of the PC provided by a medical expert. Important limitations and\npotential improvements are discussed, and we conclude that future improved\ncausal models could provide essential data-driven support for medication safety\nmonitoring in hospitalized patients."}, "http://arxiv.org/abs/1911.03071": {"title": "Balancing Covariates in Randomized Experiments with the Gram-Schmidt Walk Design", "link": "http://arxiv.org/abs/1911.03071", "description": "The design of experiments involves a compromise between covariate balance and\nrobustness. This paper provides a formalization of this trade-off and describes\nan experimental design that allows experimenters to navigate it. The design is\nspecified by a robustness parameter that bounds the worst-case mean squared\nerror of an estimator of the average treatment effect. Subject to the\nexperimenter's desired level of robustness, the design aims to simultaneously\nbalance all linear functions of potentially many covariates. Less robustness\nallows for more balance. We show that the mean squared error of the estimator\nis bounded in finite samples by the minimum of the loss function of an implicit\nridge regression of the potential outcomes on the covariates. Asymptotically,\nthe design perfectly balances all linear functions of a growing number of\ncovariates with a diminishing reduction in robustness, effectively allowing\nexperimenters to escape the compromise between balance and robustness in large\nsamples. Finally, we describe conditions that ensure asymptotic normality and\nprovide a conservative variance estimator, which facilitate the construction of\nasymptotically valid confidence intervals."}, "http://arxiv.org/abs/2102.07356": {"title": "Asymptotic properties of generalized closed-form maximum likelihood estimators", "link": "http://arxiv.org/abs/2102.07356", "description": "The maximum likelihood estimator (MLE) is pivotal in statistical inference,\nyet its application is often hindered by the absence of closed-form solutions\nfor many models. This poses challenges in real-time computation scenarios,\nparticularly within embedded systems technology, where numerical methods are\nimpractical. This study introduces a generalized form of the MLE that yields\nclosed-form estimators under certain conditions. We derive the asymptotic\nproperties of the proposed estimator and demonstrate that our approach retains\nkey properties such as invariance under one-to-one transformations, strong\nconsistency, and an asymptotic normal distribution. The effectiveness of the\ngeneralized MLE is exemplified through its application to the Gamma, Nakagami,\nand Beta distributions, showcasing improvements over the traditional MLE.\nAdditionally, we extend this methodology to a bivariate gamma distribution,\nsuccessfully deriving closed-form estimators. This advancement presents\nsignificant implications for real-time statistical analysis across various\napplications."}, "http://arxiv.org/abs/2207.13493": {"title": "The Cellwise Minimum Covariance Determinant Estimator", "link": "http://arxiv.org/abs/2207.13493", "description": "The usual Minimum Covariance Determinant (MCD) estimator of a covariance\nmatrix is robust against casewise outliers. These are cases (that is, rows of\nthe data matrix) that behave differently from the majority of cases, raising\nsuspicion that they might belong to a different population. On the other hand,\ncellwise outliers are individual cells in the data matrix. When a row contains\none or more outlying cells, the other cells in the same row still contain\nuseful information that we wish to preserve. We propose a cellwise robust\nversion of the MCD method, called cellMCD. Its main building blocks are\nobserved likelihood and a penalty term on the number of flagged cellwise\noutliers. It possesses good breakdown properties. We construct a fast algorithm\nfor cellMCD based on concentration steps (C-steps) that always lower the\nobjective. The method performs well in simulations with cellwise outliers, and\nhas high finite-sample efficiency on clean data. It is illustrated on real data\nwith visualizations of the results."}, "http://arxiv.org/abs/2208.07086": {"title": "Flexible Bayesian Multiple Comparison Adjustment Using Dirichlet Process and Beta-Binomial Model Priors", "link": "http://arxiv.org/abs/2208.07086", "description": "Researchers frequently wish to assess the equality or inequality of groups,\nbut this comes with the challenge of adequately adjusting for multiple\ncomparisons. Statistically, all possible configurations of equality and\ninequality constraints can be uniquely represented as partitions of the groups,\nwhere any number of groups are equal if they are in the same partition. In a\nBayesian framework, one can adjust for multiple comparisons by constructing a\nsuitable prior distribution over all possible partitions. Inspired by work on\nvariable selection in regression, we propose a class of flexible beta-binomial\npriors for Bayesian multiple comparison adjustment. We compare this prior setup\nto the Dirichlet process prior suggested by Gopalan and Berry (1998) and\nmultiple comparison adjustment methods that do not specify a prior over\npartitions directly. Our approach to multiple comparison adjustment not only\nallows researchers to assess all pairwise (in)equalities, but in fact all\npossible (in)equalities among all groups. As a consequence, the space of\npossible partitions grows quickly - for ten groups, there are already 115,975\npossible partitions - and we set up a stochastic search algorithm to\nefficiently explore the space. Our method is implemented in the Julia package\nEqualitySampler, and we illustrate it on examples related to the comparison of\nmeans, variances, and proportions."}, "http://arxiv.org/abs/2208.07959": {"title": "Variable Selection in Latent Regression IRT Models via Knockoffs: An Application to International Large-scale Assessment in Education", "link": "http://arxiv.org/abs/2208.07959", "description": "International large-scale assessments (ILSAs) play an important role in\neducational research and policy making. They collect valuable data on education\nquality and performance development across many education systems, giving\ncountries the opportunity to share techniques, organizational structures, and\npolicies that have proven efficient and successful. To gain insights from ILSA\ndata, we identify non-cognitive variables associated with students' academic\nperformance. This problem has three analytical challenges: 1) academic\nperformance is measured by cognitive items under a matrix sampling design; 2)\nthere are many missing values in the non-cognitive variables; and 3) multiple\ncomparisons due to a large number of non-cognitive variables. We consider an\napplication to the Programme for International Student Assessment (PISA),\naiming to identify non-cognitive variables associated with students'\nperformance in science. We formulate it as a variable selection problem under a\ngeneral latent variable model framework and further propose a knockoff method\nthat conducts variable selection with a controlled error rate for false\nselections."}, "http://arxiv.org/abs/2210.06927": {"title": "Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models", "link": "http://arxiv.org/abs/2210.06927", "description": "Bayesian modeling provides a principled approach to quantifying uncertainty\nin model parameters and model structure and has seen a surge of applications in\nrecent years. Within the context of a Bayesian workflow, we are concerned with\nmodel selection for the purpose of finding models that best explain the data,\nthat is, help us understand the underlying data generating process. Since we\nrarely have access to the true process, all we are left with during real-world\nanalyses is incomplete causal knowledge from sources outside of the current\ndata and model predictions of said data. This leads to the important question\nof when the use of prediction as a proxy for explanation for the purpose of\nmodel selection is valid. We approach this question by means of large-scale\nsimulations of Bayesian generalized linear models where we investigate various\ncausal and statistical misspecifications. Our results indicate that the use of\nprediction as proxy for explanation is valid and safe only when the models\nunder consideration are sufficiently consistent with the underlying causal\nstructure of the true data generating process."}, "http://arxiv.org/abs/2212.04550": {"title": "Modern Statistical Models and Methods for Estimating Fatigue-Life and Fatigue-Strength Distributions from Experimental Data", "link": "http://arxiv.org/abs/2212.04550", "description": "Engineers and scientists have been collecting and analyzing fatigue data\nsince the 1800s to ensure the reliability of life-critical structures.\nApplications include (but are not limited to) bridges, building structures,\naircraft and spacecraft components, ships, ground-based vehicles, and medical\ndevices. Engineers need to estimate S-N relationships (Stress or Strain versus\nNumber of cycles to failure), typically with a focus on estimating small\nquantiles of the fatigue-life distribution. Estimates from this kind of model\nare used as input to models (e.g., cumulative damage models) that predict\nfailure-time distributions under varying stress patterns. Also, design\nengineers need to estimate lower-tail quantiles of the closely related\nfatigue-strength distribution. The history of applying incorrect statistical\nmethods is nearly as long and such practices continue to the present. Examples\ninclude treating the applied stress (or strain) as the response and the number\nof cycles to failure as the explanatory variable in regression analyses\n(because of the need to estimate strength distributions) and ignoring or\notherwise mishandling censored observations (known as runouts in the fatigue\nliterature). The first part of the paper reviews the traditional modeling\napproach where a fatigue-life model is specified. We then show how this\nspecification induces a corresponding fatigue-strength model. The second part\nof the paper presents a novel alternative modeling approach where a\nfatigue-strength model is specified and a corresponding fatigue-life model is\ninduced. We explain and illustrate the important advantages of this new\nmodeling approach."}, "http://arxiv.org/abs/2303.01186": {"title": "Discrete-time Competing-Risks Regression with or without Penalization", "link": "http://arxiv.org/abs/2303.01186", "description": "Many studies employ the analysis of time-to-event data that incorporates\ncompeting risks and right censoring. Most methods and software packages are\ngeared towards analyzing data that comes from a continuous failure time\ndistribution. However, failure-time data may sometimes be discrete either\nbecause time is inherently discrete or due to imprecise measurement. This paper\nintroduces a novel estimation procedure for discrete-time survival analysis\nwith competing events. The proposed approach offers two key advantages over\nexisting procedures: first, it expedites the estimation process for a large\nnumber of unique failure time points; second, it allows for straightforward\nintegration and application of widely used regularized regression and screening\nmethods. We illustrate the benefits of our proposed approach by conducting a\ncomprehensive simulation study. Additionally, we showcase the utility of our\nprocedure by estimating a survival model for the length of stay of patients\nhospitalized in the intensive care unit, considering three competing events:\ndischarge to home, transfer to another medical facility, and in-hospital death."}, "http://arxiv.org/abs/2306.11281": {"title": "Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models", "link": "http://arxiv.org/abs/2306.11281", "description": "Answering counterfactual queries has many important applications such as\nknowledge discovery and explainability, but is challenging when causal\nvariables are unobserved and we only see a projection onto an observation\nspace, for instance, image pixels. One approach is to recover the latent\nStructural Causal Model (SCM), but this typically needs unrealistic\nassumptions, such as linearity of the causal mechanisms. Another approach is to\nuse na\\\"ive ML approximations, such as generative models, to generate\ncounterfactual samples; however, these lack guarantees of accuracy. In this\nwork, we strive to strike a balance between practicality and theoretical\nguarantees by focusing on a specific type of causal query called domain\ncounterfactuals, which hypothesizes what a sample would have looked like if it\nhad been generated in a different domain (or environment). Concretely, by only\nassuming invertibility, sparse domain interventions and access to observational\ndata from different domains, we aim to improve domain counterfactual estimation\nboth theoretically and practically with less restrictive assumptions. We define\ndomain counterfactually equivalent models and prove necessary and sufficient\nproperties for equivalent models that provide a tight characterization of the\ndomain counterfactual equivalence classes. Building upon this result, we prove\nthat every equivalence class contains a model where all intervened variables\nare at the end when topologically sorted by the causal DAG. This surprising\nresult suggests that a model design that only allows intervention in the last\n$k$ latent variables may improve model estimation for counterfactuals. We then\ntest this model design on extensive simulated and image-based experiments which\nshow the sparse canonical model indeed improves counterfactual estimation over\nbaseline non-sparse models."}, "http://arxiv.org/abs/2309.10378": {"title": "Group Spike and Slab Variational Bayes", "link": "http://arxiv.org/abs/2309.10378", "description": "We introduce Group Spike-and-slab Variational Bayes (GSVB), a scalable method\nfor group sparse regression. A fast co-ordinate ascent variational inference\n(CAVI) algorithm is developed for several common model families including\nGaussian, Binomial and Poisson. Theoretical guarantees for our proposed\napproach are provided by deriving contraction rates for the variational\nposterior in grouped linear regression. Through extensive numerical studies, we\ndemonstrate that GSVB provides state-of-the-art performance, offering a\ncomputationally inexpensive substitute to MCMC, whilst performing comparably or\nbetter than existing MAP methods. Additionally, we analyze three real world\ndatasets wherein we highlight the practical utility of our method,\ndemonstrating that GSVB provides parsimonious models with excellent predictive\nperformance, variable selection and uncertainty quantification."}, "http://arxiv.org/abs/2309.12632": {"title": "Are Deep Learning Classification Results Obtained on CT Scans Fair and Interpretable?", "link": "http://arxiv.org/abs/2309.12632", "description": "Following the great success of various deep learning methods in image and\nobject classification, the biomedical image processing society is also\noverwhelmed with their applications to various automatic diagnosis cases.\nUnfortunately, most of the deep learning-based classification attempts in the\nliterature solely focus on the aim of extreme accuracy scores, without\nconsidering interpretability, or patient-wise separation of training and test\ndata. For example, most lung nodule classification papers using deep learning\nrandomly shuffle data and split it into training, validation, and test sets,\ncausing certain images from the CT scan of a person to be in the training set,\nwhile other images of the exact same person to be in the validation or testing\nimage sets. This can result in reporting misleading accuracy rates and the\nlearning of irrelevant features, ultimately reducing the real-life usability of\nthese models. When the deep neural networks trained on the traditional, unfair\ndata shuffling method are challenged with new patient images, it is observed\nthat the trained models perform poorly. In contrast, deep neural networks\ntrained with strict patient-level separation maintain their accuracy rates even\nwhen new patient images are tested. Heat-map visualizations of the activations\nof the deep neural networks trained with strict patient-level separation\nindicate a higher degree of focus on the relevant nodules. We argue that the\nresearch question posed in the title has a positive answer only if the deep\nneural networks are trained with images of patients that are strictly isolated\nfrom the validation and testing patient sets."}, "http://arxiv.org/abs/2311.09388": {"title": "Synthesis estimators for positivity violations with a continuous covariate", "link": "http://arxiv.org/abs/2311.09388", "description": "Research intended to estimate the effect of an action, like in randomized\ntrials, often do not have random samples of the intended target population.\nInstead, estimates can be transported to the desired target population. Methods\nfor transporting between populations are often premised on a positivity\nassumption, such that all relevant covariate patterns in one population are\nalso present in the other. However, eligibility criteria, particularly in the\ncase of trials, can result in violations of positivity. To address\nnonpositivity, a synthesis of statistical and mechanistic models was previously\nproposed in the context of violations by a single binary covariate. Here, we\nextend the synthesis approach for positivity violations with a continuous\ncovariate. For estimation, two novel augmented inverse probability weighting\nestimators are proposed, with one based on estimating the parameters of a\nmarginal structural model and the other based on estimating the conditional\naverage causal effect. Both estimators are compared to other common approaches\nto address nonpositivity via a simulation study. Finally, the competing\napproaches are illustrated with an example in the context of two-drug versus\none-drug antiretroviral therapy on CD4 T cell counts among women with HIV."}, "http://arxiv.org/abs/2311.09419": {"title": "Change-point Inference for High-dimensional Heteroscedastic Data", "link": "http://arxiv.org/abs/2311.09419", "description": "We propose a bootstrap-based test to detect a mean shift in a sequence of\nhigh-dimensional observations with unknown time-varying heteroscedasticity. The\nproposed test builds on the U-statistic based approach in Wang et al. (2022),\ntargets a dense alternative, and adopts a wild bootstrap procedure to generate\ncritical values. The bootstrap-based test is free of tuning parameters and is\ncapable of accommodating unconditional time varying heteroscedasticity in the\nhigh-dimensional observations, as demonstrated in our theory and simulations.\nTheoretically, we justify the bootstrap consistency by using the recently\nproposed unconditional approach in Bucher and Kojadinovic (2019). Extensions to\ntesting for multiple change-points and estimation using wild binary\nsegmentation are also presented. Numerical simulations demonstrate the\nrobustness of the proposed testing and estimation procedures with respect to\ndifferent kinds of time-varying heteroscedasticity."}, "http://arxiv.org/abs/2311.09423": {"title": "Orthogonal prediction of counterfactual outcomes", "link": "http://arxiv.org/abs/2311.09423", "description": "Orthogonal meta-learners, such as DR-learner, R-learner and IF-learner, are\nincreasingly used to estimate conditional average treatment effects. They\nimprove convergence rates relative to na\\\"{\\i}ve meta-learners (e.g., T-, S-\nand X-learner) through de-biasing procedures that involve applying standard\nlearners to specifically transformed outcome data. This leads them to disregard\nthe possibly constrained outcome space, which can be particularly problematic\nfor dichotomous outcomes: these typically get transformed to values that are no\nlonger constrained to the unit interval, making it difficult for standard\nlearners to guarantee predictions within the unit interval. To address this, we\nconstruct orthogonal meta-learners for the prediction of counterfactual\noutcomes which respect the outcome space. As such, the obtained i-learner or\nimputation-learner is more generally expected to outperform existing learners,\neven when the outcome is unconstrained, as we confirm empirically in simulation\nstudies and an analysis of critical care data. Our development also sheds\nbroader light onto the construction of orthogonal learners for other estimands."}, "http://arxiv.org/abs/2311.09446": {"title": "On simulation-based inference for implicitly defined models", "link": "http://arxiv.org/abs/2311.09446", "description": "In many applications, a stochastic system is studied using a model implicitly\ndefined via a simulator. We develop a simulation-based parameter inference\nmethod for implicitly defined models. Our method differs from traditional\nlikelihood-based inference in that it uses a metamodel for the distribution of\na log-likelihood estimator. The metamodel is built on a local asymptotic\nnormality (LAN) property satisfied by the simulation-based log-likelihood\nestimator under certain conditions. A method for hypothesis test is developed\nunder the metamodel. Our method can enable accurate parameter estimation and\nuncertainty quantification where other Monte Carlo methods for parameter\ninference become highly inefficient due to large Monte Carlo variance. We\ndemonstrate our method using numerical examples including a mechanistic model\nfor the population dynamics of infectious disease."}, "http://arxiv.org/abs/2311.09838": {"title": "Bayesian Inference of Reproduction Number from Epidemiological and Genetic Data Using Particle MCMC", "link": "http://arxiv.org/abs/2311.09838", "description": "Inference of the reproduction number through time is of vital importance\nduring an epidemic outbreak. Typically, epidemiologists tackle this using\nobserved prevalence or incidence data. However, prevalence and incidence data\nalone is often noisy or partial. Models can also have identifiability issues\nwith determining whether a large amount of a small epidemic or a small amount\nof a large epidemic has been observed. Sequencing data however is becoming more\nabundant, so approaches which can incorporate genetic data are an active area\nof research. We propose using particle MCMC methods to infer the time-varying\nreproduction number from a combination of prevalence data reported at a set of\ndiscrete times and a dated phylogeny reconstructed from sequences. We validate\nour approach on simulated epidemics with a variety of scenarios. We then apply\nthe method to a real data set of HIV-1 in North Carolina, USA, between 1957 and\n2019."}, "http://arxiv.org/abs/2311.09875": {"title": "Unbiased and Multilevel Methods for a Class of Diffusions Partially Observed via Marked Point Processes", "link": "http://arxiv.org/abs/2311.09875", "description": "In this article we consider the filtering problem associated to partially\nobserved diffusions, with observations following a marked point process. In the\nmodel, the data form a point process with observation times that have its\nintensity driven by a diffusion, with the associated marks also depending upon\nthe diffusion process. We assume that one must resort to time-discretizing the\ndiffusion process and develop particle and multilevel particle filters to\nrecursively approximate the filter. In particular, we prove that our multilevel\nparticle filter can achieve a mean square error (MSE) of\n$\\mathcal{O}(\\epsilon^2)$ ($\\epsilon>0$ and arbitrary) with a cost of\n$\\mathcal{O}(\\epsilon^{-2.5})$ versus using a particle filter which has a cost\nof $\\mathcal{O}(\\epsilon^{-3})$ to achieve the same MSE. We then show how this\nmethodology can be extended to give unbiased (that is with no\ntime-discretization error) estimators of the filter, which are proved to have\nfinite variance and with high-probability have finite cost. Finally, we extend\nour methodology to the problem of online static-parameter estimation."}, "http://arxiv.org/abs/2311.09935": {"title": "Semi-parametric Benchmark Dose Analysis with Monotone Additive Models", "link": "http://arxiv.org/abs/2311.09935", "description": "Benchmark dose analysis aims to estimate the level of exposure to a toxin\nthat results in a clinically-significant adverse outcome and quantifies\nuncertainty using the lower limit of a confidence interval for this level. We\ndevelop a novel framework for benchmark dose analysis based on monotone\nadditive dose-response models. We first introduce a flexible approach for\nfitting monotone additive models via penalized B-splines and\nLaplace-approximate marginal likelihood. A reflective Newton method is then\ndeveloped that employs de Boor's algorithm for computing splines and their\nderivatives for efficient estimation of the benchmark dose. Finally, we develop\nand assess three approaches for calculating benchmark dose lower limits: a\nnaive one based on asymptotic normality of the estimator, one based on an\napproximate pivot, and one using a Bayesian parametric bootstrap. The latter\napproaches improve upon the naive method in terms of accuracy and are\nguaranteed to return a positive lower limit; the approach based on an\napproximate pivot is typically an order of magnitude faster than the bootstrap,\nalthough they are both practically feasible to compute. We apply the new\nmethods to make inferences about the level of prenatal alcohol exposure\nassociated with clinically significant cognitive defects in children using data\nfrom an NIH-funded longitudinal study. Software to reproduce the results in\nthis paper is available at https://github.com/awstringer1/bmd-paper-code."}, "http://arxiv.org/abs/2311.09961": {"title": "Scan statistics for the detection of anomalies in M-dependent random fields with applications to image data", "link": "http://arxiv.org/abs/2311.09961", "description": "Anomaly detection in random fields is an important problem in many\napplications including the detection of cancerous cells in medicine, obstacles\nin autonomous driving and cracks in the construction material of buildings.\nSuch anomalies are often visible as areas with different expected values\ncompared to the background noise. Scan statistics based on local means have the\npotential to detect such local anomalies by enhancing relevant features. We\nderive limit theorems for a general class of such statistics over M-dependent\nrandom fields of arbitrary but fixed dimension. By allowing for a variety of\ncombinations and contrasts of sample means over differently-shaped local\nwindows, this yields a flexible class of scan statistics that can be tailored\nto the particular application of interest. The latter is demonstrated for crack\ndetection in 2D-images of different types of concrete. Together with a\nsimulation study this indicates the potential of the proposed methodology for\nthe detection of anomalies in a variety of situations."}, "http://arxiv.org/abs/2311.09989": {"title": "Xputer: Bridging Data Gaps with NMF, XGBoost, and a Streamlined GUI Experience", "link": "http://arxiv.org/abs/2311.09989", "description": "The rapid proliferation of data across diverse fields has accentuated the\nimportance of accurate imputation for missing values. This task is crucial for\nensuring data integrity and deriving meaningful insights. In response to this\nchallenge, we present Xputer, a novel imputation tool that adeptly integrates\nNon-negative Matrix Factorization (NMF) with the predictive strengths of\nXGBoost. One of Xputer's standout features is its versatility: it supports zero\nimputation, enables hyperparameter optimization through Optuna, and allows\nusers to define the number of iterations. For enhanced user experience and\naccessibility, we have equipped Xputer with an intuitive Graphical User\nInterface (GUI) ensuring ease of handling, even for those less familiar with\ncomputational tools. In performance benchmarks, Xputer not only rivals the\ncomputational speed of established tools such as IterativeImputer but also\noften outperforms them in terms of imputation accuracy. Furthermore, Xputer\nautonomously handles a diverse spectrum of data types, including categorical,\ncontinuous, and Boolean, eliminating the need for prior preprocessing. Given\nits blend of performance, flexibility, and user-friendly design, Xputer emerges\nas a state-of-the-art solution in the realm of data imputation."}, "http://arxiv.org/abs/2311.10076": {"title": "A decorrelation method for general regression adjustment in randomized experiments", "link": "http://arxiv.org/abs/2311.10076", "description": "We study regression adjustment with general function class approximations for\nestimating the average treatment effect in the design-based setting. Standard\nregression adjustment involves bias due to sample re-use, and this bias leads\nto behavior that is sub-optimal in the sample size, and/or imposes restrictive\nassumptions. Our main contribution is to introduce a novel decorrelation-based\napproach that circumvents these issues. We prove guarantees, both asymptotic\nand non-asymptotic, relative to the oracle functions that are targeted by a\ngiven regression adjustment procedure. We illustrate our method by applying it\nto various high-dimensional and non-parametric problems, exhibiting improved\nsample complexity and weakened assumptions relative to known approaches."}, "http://arxiv.org/abs/2108.09431": {"title": "Equivariant Variance Estimation for Multiple Change-point Model", "link": "http://arxiv.org/abs/2108.09431", "description": "The variance of noise plays an important role in many change-point detection\nprocedures and the associated inferences. Most commonly used variance\nestimators require strong assumptions on the true mean structure or normality\nof the error distribution, which may not hold in applications. More\nimportantly, the qualities of these estimators have not been discussed\nsystematically in the literature. In this paper, we introduce a framework of\nequivariant variance estimation for multiple change-point models. In\nparticular, we characterize the set of all equivariant unbiased quadratic\nvariance estimators for a family of change-point model classes, and develop a\nminimax theory for such estimators."}, "http://arxiv.org/abs/2210.07987": {"title": "Bayesian Learning via Q-Exponential Process", "link": "http://arxiv.org/abs/2210.07987", "description": "Regularization is one of the most fundamental topics in optimization,\nstatistics and machine learning. To get sparsity in estimating a parameter\n$u\\in\\mathbb{R}^d$, an $\\ell_q$ penalty term, $\\Vert u\\Vert_q$, is usually\nadded to the objective function. What is the probabilistic distribution\ncorresponding to such $\\ell_q$ penalty? What is the correct stochastic process\ncorresponding to $\\Vert u\\Vert_q$ when we model functions $u\\in L^q$? This is\nimportant for statistically modeling large dimensional objects, e.g. images,\nwith penalty to preserve certainty properties, e.g. edges in the image. In this\nwork, we generalize the $q$-exponential distribution (with density proportional\nto) $\\exp{(- \\frac{1}{2}|u|^q)}$ to a stochastic process named $Q$-exponential\n(Q-EP) process that corresponds to the $L_q$ regularization of functions. The\nkey step is to specify consistent multivariate $q$-exponential distributions by\nchoosing from a large family of elliptic contour distributions. The work is\nclosely related to Besov process which is usually defined by the expanded\nseries. Q-EP can be regarded as a definition of Besov process with explicit\nprobabilistic formulation and direct control on the correlation length. From\nthe Bayesian perspective, Q-EP provides a flexible prior on functions with\nsharper penalty ($q<2$) than the commonly used Gaussian process (GP). We\ncompare GP, Besov and Q-EP in modeling functional data, reconstructing images,\nand solving inverse problems and demonstrate the advantage of our proposed\nmethodology."}, "http://arxiv.org/abs/2302.00354": {"title": "The Spatial Kernel Predictor based on Huge Observation Sets", "link": "http://arxiv.org/abs/2302.00354", "description": "Spatial prediction in an arbitrary location, based on a spatial set of\nobservations, is usually performed by Kriging, being the best linear unbiased\npredictor (BLUP) in a least-square sense. In order to predict a continuous\nsurface over a spatial domain a grid representation is most often used. Kriging\npredictions and prediction variances are computed in the nodes of a grid\ncovering the spatial domain, and the continuous surface is assessed from this\ngrid representation. A precise representation usually requires the number of\ngrid nodes to be considerably larger than the number of observations. For a\nGaussian random field model the Kriging predictor coinsides with the\nconditional expectation of the spatial variable given the observation set. An\nalternative expression for this conditional expectation provides a spatial\npredictor on functional form which does not rely on a spatial grid\ndiscretization. This functional predictor, called the Kernel predictor, is\nidentical to the asymptotic grid infill limit of the Kriging-based grid\nrepresentation, and the computational demand is primarily dependent on the\nnumber of observations - not the dimension of the spatial reference domain nor\nany grid discretization. We explore the potential of this Kernel predictor with\nassociated prediction variances. The predictor is valid for Gaussian random\nfields with any eligible spatial correlation function, and large computational\nsavings can be obtained by using a finite-range spatial correlation function.\nFor studies with a huge set of observations, localized predictors must be used,\nand the computational advantage relative to Kriging predictors can be very\nlarge. Moreover, model parameter inference based on a huge observation set can\nbe efficiently made. The methodology is demonstrated in a couple of examples."}, "http://arxiv.org/abs/2302.01861": {"title": "Covariance Matrix Estimation for High-Throughput Biomedical Data with Interconnected Communities", "link": "http://arxiv.org/abs/2302.01861", "description": "Estimating a covariance matrix is central to high-dimensional data analysis.\nEmpirical analyses of high-dimensional biomedical data, including genomics,\nproteomics, microbiome, and neuroimaging, among others, consistently reveal\nstrong modularity in the dependence patterns. In these analyses,\nintercorrelated high-dimensional biomedical features often form communities or\nmodules that can be interconnected with others. While the interconnected\ncommunity structure has been extensively studied in biomedical research (e.g.,\ngene co-expression networks), its potential to assist in the estimation of\ncovariance matrices remains largely unexplored. To address this gap, we propose\na procedure that leverages the commonly observed interconnected community\nstructure in high-dimensional biomedical data to estimate large covariance and\nprecision matrices. We derive the uniformly minimum variance unbiased\nestimators for covariance and precision matrices in closed forms and provide\ntheoretical results on their asymptotic properties. Our proposed method\nenhances the accuracy of covariance- and precision-matrix estimation and\ndemonstrates superior performance compared to the competing methods in both\nsimulations and real data analyses."}, "http://arxiv.org/abs/2303.16299": {"title": "Comparison of Methods that Combine Multiple Randomized Trials to Estimate Heterogeneous Treatment Effects", "link": "http://arxiv.org/abs/2303.16299", "description": "Individualized treatment decisions can improve health outcomes, but using\ndata to make these decisions in a reliable, precise, and generalizable way is\nchallenging with a single dataset. Leveraging multiple randomized controlled\ntrials allows for the combination of datasets with unconfounded treatment\nassignment to better estimate heterogeneous treatment effects. This paper\ndiscusses several non-parametric approaches for estimating heterogeneous\ntreatment effects using data from multiple trials. We extend single-study\nmethods to a scenario with multiple trials and explore their performance\nthrough a simulation study, with data generation scenarios that have differing\nlevels of cross-trial heterogeneity. The simulations demonstrate that methods\nthat directly allow for heterogeneity of the treatment effect across trials\nperform better than methods that do not, and that the choice of single-study\nmethod matters based on the functional form of the treatment effect. Finally,\nwe discuss which methods perform well in each setting and then apply them to\nfour randomized controlled trials to examine effect heterogeneity of treatments\nfor major depressive disorder."}, "http://arxiv.org/abs/2306.17043": {"title": "How trace plots help interpret meta-analysis results", "link": "http://arxiv.org/abs/2306.17043", "description": "The trace plot is seldom used in meta-analysis, yet it is a very informative\nplot. In this article we define and illustrate what the trace plot is, and\ndiscuss why it is important. The Bayesian version of the plot combines the\nposterior density of tau, the between-study standard deviation, and the\nshrunken estimates of the study effects as a function of tau. With a small or\nmoderate number of studies, tau is not estimated with much precision, and\nparameter estimates and shrunken study effect estimates can vary widely\ndepending on the correct value of tau. The trace plot allows visualization of\nthe sensitivity to tau along with a plot that shows which values of tau are\nplausible and which are implausible. A comparable frequentist or empirical\nBayes version provides similar results. The concepts are illustrated using\nexamples in meta-analysis and meta-regression; implementaton in R is\nfacilitated in a Bayesian or frequentist framework using the bayesmeta and\nmetafor packages, respectively."}, "http://arxiv.org/abs/2309.10978": {"title": "Scarcity-Mediated Spillover: An Overlooked Source of Bias in Pragmatic Clinical Trials", "link": "http://arxiv.org/abs/2309.10978", "description": "Pragmatic clinical trials evaluate the effectiveness of health interventions\nin real-world settings. Spillover arises in a pragmatic trial if the study\nintervention affects how scarce resources are allocated between patients in the\nintervention and comparison groups. This can harm patients assigned to the\ncontrol group and lead to overestimation of treatment effect. There is\ncurrently little recognition of this source of bias - which I term\n\"scarcity-mediated spillover\" - in the medical literature. In this article, I\nexamine what causes spillover and how it may have led trial investigators to\noverestimate the effect of patient navigation, AI-based physiological alarms,\nand elective induction of labor. I also suggest ways to detect\nscarcity-mediated spillover, design trials that avoid it, and modify clinical\ntrial guidelines to address this overlooked source of bias."}, "http://arxiv.org/abs/1910.12486": {"title": "Two-stage data segmentation permitting multiscale change points, heavy tails and dependence", "link": "http://arxiv.org/abs/1910.12486", "description": "The segmentation of a time series into piecewise stationary segments, a.k.a.\nmultiple change point analysis, is an important problem both in time series\nanalysis and signal processing. In the presence of multiscale change points\nwith both large jumps over short intervals and small changes over long\nstationary intervals, multiscale methods achieve good adaptivity in their\nlocalisation but at the same time, require the removal of false positives and\nduplicate estimators via a model selection step. In this paper, we propose a\nlocalised application of Schwarz information criterion which, as a generic\nmethodology, is applicable with any multiscale candidate generating procedure\nfulfilling mild assumptions. We establish the theoretical consistency of the\nproposed localised pruning method in estimating the number and locations of\nmultiple change points under general assumptions permitting heavy tails and\ndependence. Further, we show that combined with a MOSUM-based candidate\ngenerating procedure, it attains minimax optimality in terms of detection lower\nbound and localisation for i.i.d. sub-Gaussian errors. A careful comparison\nwith the existing methods by means of (a) theoretical properties such as\ngenerality, optimality and algorithmic complexity, (b) performance on simulated\ndatasets and run time, as well as (c) performance on real data applications,\nconfirm the overall competitiveness of the proposed methodology."}, "http://arxiv.org/abs/2101.04651": {"title": "Moving sum data segmentation for stochastics processes based on invariance", "link": "http://arxiv.org/abs/2101.04651", "description": "The segmentation of data into stationary stretches also known as multiple\nchange point problem is important for many applications in time series analysis\nas well as signal processing. Based on strong invariance principles, we analyse\ndata segmentation methodology using moving sum (MOSUM) statistics for a class\nof regime-switching multivariate processes where each switch results in a\nchange in the drift. In particular, this framework includes the data\nsegmentation of multivariate partial sum, integrated diffusion and renewal\nprocesses even if the distance between change points is sublinear. We study the\nasymptotic behaviour of the corresponding change point estimators, show\nconsistency and derive the corresponding localisation rates which are minimax\noptimal in a variety of situations including an unbounded number of changes in\nWiener processes with drift. Furthermore, we derive the limit distribution of\nthe change point estimators for local changes - a result that can in principle\nbe used to derive confidence intervals for the change points."}, "http://arxiv.org/abs/2207.07396": {"title": "Data Segmentation for Time Series Based on a General Moving Sum Approach", "link": "http://arxiv.org/abs/2207.07396", "description": "In this paper we propose new methodology for the data segmentation, also\nknown as multiple change point problem, in a general framework including\nclassic mean change scenarios, changes in linear regression but also changes in\nthe time series structure such as in the parameters of Poisson-autoregressive\ntime series. In particular, we derive a general theory based on estimating\nequations proving consistency for the number of change points as well as rates\nof convergence for the estimators of the locations of the change points. More\nprecisely, two different types of MOSUM (moving sum) statistics are considered:\nA MOSUM-Wald statistic based on differences of local estimators and a\nMOSUM-score statistic based on a global estimator. The latter is usually\ncomputationally less involved in particular in non-linear problems where no\nclosed form of the estimator is known such that numerical methods are required.\nFinally, we evaluate the methodology by means of simulated data as well as\nusing some geophysical well-log data."}, "http://arxiv.org/abs/2311.10263": {"title": "Stable Differentiable Causal Discovery", "link": "http://arxiv.org/abs/2311.10263", "description": "Inferring causal relationships as directed acyclic graphs (DAGs) is an\nimportant but challenging problem. Differentiable Causal Discovery (DCD) is a\npromising approach to this problem, framing the search as a continuous\noptimization. But existing DCD methods are numerically unstable, with poor\nperformance beyond tens of variables. In this paper, we propose Stable\nDifferentiable Causal Discovery (SDCD), a new method that improves previous DCD\nmethods in two ways: (1) It employs an alternative constraint for acyclicity;\nthis constraint is more stable, both theoretically and empirically, and fast to\ncompute. (2) It uses a training procedure tailored for sparse causal graphs,\nwhich are common in real-world scenarios. We first derive SDCD and prove its\nstability and correctness. We then evaluate it with both observational and\ninterventional data and on both small-scale and large-scale settings. We find\nthat SDCD outperforms existing methods in both convergence speed and accuracy\nand can scale to thousands of variables."}, "http://arxiv.org/abs/2311.10279": {"title": "Differentially private analysis of networks with covariates via a generalized $\\beta$-model", "link": "http://arxiv.org/abs/2311.10279", "description": "How to achieve the tradeoff between privacy and utility is one of fundamental\nproblems in private data analysis.In this paper, we give a rigourous\ndifferential privacy analysis of networks in the appearance of covariates via a\ngeneralized $\\beta$-model, which has an $n$-dimensional degree parameter\n$\\beta$ and a $p$-dimensional homophily parameter $\\gamma$.Under $(k_n,\n\\epsilon_n)$-edge differential privacy, we use the popular Laplace mechanism to\nrelease the network statistics.The method of moments is used to estimate the\nunknown model parameters. We establish the conditions guaranteeing consistency\nof the differentially private estimators $\\widehat{\\beta}$ and\n$\\widehat{\\gamma}$ as the number of nodes $n$ goes to infinity, which reveal an\ninteresting tradeoff between a privacy parameter and model parameters. The\nconsistency is shown by applying a two-stage Newton's method to obtain the\nupper bound of the error between $(\\widehat{\\beta},\\widehat{\\gamma})$ and its\ntrue value $(\\beta, \\gamma)$ in terms of the $\\ell_\\infty$ distance, which has\na convergence rate of rough order $1/n^{1/2}$ for $\\widehat{\\beta}$ and $1/n$\nfor $\\widehat{\\gamma}$, respectively. Further, we derive the asymptotic\nnormalities of $\\widehat{\\beta}$ and $\\widehat{\\gamma}$, whose asymptotic\nvariances are the same as those of the non-private estimators under some\nconditions. Our paper sheds light on how to explore asymptotic theory under\ndifferential privacy in a principled manner; these principled methods should be\napplicable to a class of network models with covariates beyond the generalized\n$\\beta$-model. Numerical studies and a real data analysis demonstrate our\ntheoretical findings."}, "http://arxiv.org/abs/2311.10282": {"title": "Joint clustering with alignment for temporal data in a one-point-per-trajectory setting", "link": "http://arxiv.org/abs/2311.10282", "description": "Temporal data, obtained in the setting where it is only possible to observe\none time point per trajectory, is widely used in different research fields, yet\nremains insufficiently addressed from the statistical point of view. Such data\noften contain observations of a large number of entities, in which case it is\nof interest to identify a small number of representative behavior types. In\nthis paper, we propose a new method performing clustering simultaneously with\nalignment of temporal objects inferred from these data, providing insight into\nthe relationships between the entities. A series of simulations confirm the\nability of the proposed approach to leverage multiple properties of the complex\ndata we target such as accessible uncertainties, correlations and a small\nnumber of time points. We illustrate it on real data encoding cellular response\nto a radiation treatment with high energy, supported with the results of an\nenrichment analysis."}, "http://arxiv.org/abs/2311.10489": {"title": "Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline Approach", "link": "http://arxiv.org/abs/2311.10489", "description": "Overlapping asymmetric datasets are common in data science and pose questions\nof how they can be incorporated together into a predictive analysis. In\nhealthcare datasets there is often a small amount of information that is\navailable for a larger number of patients such as an electronic health record,\nhowever a small number of patients may have had extensive further testing.\nCommon solutions such as missing imputation can often be unwise if the smaller\ncohort is significantly different in scale to the larger sample, therefore the\naim of this research is to develop a new method which can model the smaller\ncohort against a particular response, whilst considering the larger cohort\nalso. Motivated by non-parametric models, and specifically flexible smoothing\ntechniques via generalized additive models, we model a twice penalized P-Spline\napproximation method to firstly prevent over/under-fitting of the smaller\ncohort and secondly to consider the larger cohort. This second penalty is\ncreated through discrepancies in the marginal value of covariates that exist in\nboth the smaller and larger cohorts. Through data simulations, parameter\ntunings and model adaptations to consider a continuous and binary response, we\nfind our twice penalized approach offers an enhanced fit over a linear B-Spline\nand once penalized P-Spline approximation. Applying to a real-life dataset\nrelating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see\nan improved model fit performance of over 65%. Areas for future work within\nthis space include adapting our method to not require dimensionality reduction\nand also consider parametric modelling methods. However, to our knowledge this\nis the first work to propose additional marginal penalties in a flexible\nregression of which we can report a vastly improved model fit that is able to\nconsider asymmetric datasets, without the need for missing data imputation."}, "http://arxiv.org/abs/2311.10638": {"title": "Concept-free Causal Disentanglement with Variational Graph Auto-Encoder", "link": "http://arxiv.org/abs/2311.10638", "description": "In disentangled representation learning, the goal is to achieve a compact\nrepresentation that consists of all interpretable generative factors in the\nobservational data. Learning disentangled representations for graphs becomes\nincreasingly important as graph data rapidly grows. Existing approaches often\nrely on Variational Auto-Encoder (VAE) or its causal structure learning-based\nrefinement, which suffer from sub-optimality in VAEs due to the independence\nfactor assumption and unavailability of concept labels, respectively. In this\npaper, we propose an unsupervised solution, dubbed concept-free causal\ndisentanglement, built on a theoretically provable tight upper bound\napproximating the optimal factor. This results in an SCM-like causal structure\nmodeling that directly learns concept structures from data. Based on this idea,\nwe propose Concept-free Causal VGAE (CCVGAE) by incorporating a novel causal\ndisentanglement layer into Variational Graph Auto-Encoder. Furthermore, we\nprove concept consistency under our concept-free causal disentanglement\nframework, hence employing it to enhance the meta-learning framework, called\nconcept-free causal Meta-Graph (CC-Meta-Graph). We conduct extensive\nexperiments to demonstrate the superiority of the proposed models: CCVGAE and\nCC-Meta-Graph, reaching up to $29\\%$ and $11\\%$ absolute improvements over\nbaselines in terms of AUC, respectively."}, "http://arxiv.org/abs/2009.07055": {"title": "Causal Inference of General Treatment Effects using Neural Networks with A Diverging Number of Confounders", "link": "http://arxiv.org/abs/2009.07055", "description": "Semiparametric efficient estimation of various multi-valued causal effects,\nincluding quantile treatment effects, is important in economic, biomedical, and\nother social sciences. Under the unconfoundedness condition, adjustment for\nconfounders requires estimating the nuisance functions relating outcome or\ntreatment to confounders nonparametrically. This paper considers a generalized\noptimization framework for efficient estimation of general treatment effects\nusing artificial neural networks (ANNs) to approximate the unknown nuisance\nfunction of growing-dimensional confounders. We establish a new approximation\nerror bound for the ANNs to the nuisance function belonging to a mixed\nsmoothness class without a known sparsity structure. We show that the ANNs can\nalleviate the \"curse of dimensionality\" under this circumstance. We establish\nthe root-$n$ consistency and asymptotic normality of the proposed general\ntreatment effects estimators, and apply a weighted bootstrap procedure for\nconducting inference. The proposed methods are illustrated via simulation\nstudies and a real data application."}, "http://arxiv.org/abs/2012.08371": {"title": "Limiting laws and consistent estimation criteria for fixed and diverging number of spiked eigenvalues", "link": "http://arxiv.org/abs/2012.08371", "description": "In this paper, we study limiting laws and consistent estimation criteria for\nthe extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly,\nfor fixed $p$, we propose a generalized estimation criterion that can\nconsistently estimate, $k$, the number of spiked eigenvalues. Compared with the\nexisting literature, we show that consistency can be achieved under weaker\nconditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we\nderive limiting distributions of the spiked sample eigenvalues using random\nmatrix theory techniques. Notably, our results do not require the spiked\neigenvalues to be uniformly bounded from above or tending to infinity, as have\nbeen assumed in the existing literature. Based on the above derived results, we\nformulate a generalized estimation criterion and show that it can consistently\nestimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We\nfurther show that the results in our work continue to hold under a general\npopulation distribution without assuming normality. The efficacy of the\nproposed estimation criteria is illustrated through comparative simulation\nstudies."}, "http://arxiv.org/abs/2104.06296": {"title": "Count Network Autoregression", "link": "http://arxiv.org/abs/2104.06296", "description": "We consider network autoregressive models for count data with a non-random\nneighborhood structure. The main methodological contribution is the development\nof conditions that guarantee stability and valid statistical inference for such\nmodels. We consider both cases of fixed and increasing network dimension and we\nshow that quasi-likelihood inference provides consistent and asymptotically\nnormally distributed estimators. The work is complemented by simulation results\nand a data example."}, "http://arxiv.org/abs/2110.09115": {"title": "Optimal designs for experiments for scalar-on-function linear models", "link": "http://arxiv.org/abs/2110.09115", "description": "The aim of this work is to extend the usual optimal experimental design\nparadigm to experiments where the settings of one or more factors are\nfunctions. Such factors are known as profile factors, or as dynamic factors.\nFor these new experiments, a design consists of combinations of functions for\neach run of the experiment. After briefly introducing the class of profile\nfactors, basis functions are described with primary focus given on the B-spline\nbasis system, due to its computational efficiency and useful properties. Basis\nfunction expansions are applied to a functional linear model consisting of\nprofile factors, reducing the problem to an optimisation of basis coefficients.\nThe methodology developed comprises special cases, including combinations of\nprofile and non-functional factors, interactions, and polynomial effects. The\nmethod is finally applied to an experimental design problem in a\nBiopharmaceutical study that is performed using the Ambr250 modular bioreactor."}, "http://arxiv.org/abs/2207.13071": {"title": "Missing Values Handling for Machine Learning Portfolios", "link": "http://arxiv.org/abs/2207.13071", "description": "We characterize the structure and origins of missingness for 159\ncross-sectional return predictors and study missing value handling for\nportfolios constructed using machine learning. Simply imputing with\ncross-sectional means performs well compared to rigorous\nexpectation-maximization methods. This stems from three facts about predictor\ndata: (1) missingness occurs in large blocks organized by time, (2)\ncross-sectional correlations are small, and (3) missingness tends to occur in\nblocks organized by the underlying data source. As a result, observed data\nprovide little information about missing data. Sophisticated imputations\nintroduce estimation noise that can lead to underperformance if machine\nlearning is not carefully applied."}, "http://arxiv.org/abs/2209.00105": {"title": "Personalized Biopsy Schedules Using an Interval-censored Cause-specific Joint Model", "link": "http://arxiv.org/abs/2209.00105", "description": "Active surveillance (AS), where biopsies are conducted to detect cancer\nprogression, has been acknowledged as an efficient way to reduce the\novertreatment of prostate cancer. Most AS cohorts use fixed biopsy schedules\nfor all patients. However, the ideal test frequency remains unknown, and the\nroutine use of such invasive tests burdens the patients. An emerging idea is to\ngenerate personalized biopsy schedules based on each patient's\nprogression-specific risk. To achieve that, we propose the interval-censored\ncause-specific joint model (ICJM), which models the impact of longitudinal\nbiomarkers on cancer progression while considering the competing event of early\ntreatment initiation. The underlying likelihood function incorporates the\ninterval-censoring of cancer progression, the competing risk of treatment, and\nthe uncertainty about whether cancer progression occurred since the last biopsy\nin patients that are right-censored or experience the competing event. The\nmodel can produce patient-specific risk profiles until a horizon time. If the\nrisk exceeds a certain threshold, a biopsy is conducted. The optimal threshold\ncan be chosen by balancing two indicators of the biopsy schedules: the expected\nnumber of biopsies and expected delay in detection of cancer progression. A\nsimulation study showed that our personalized schedules could considerably\nreduce the number of biopsies per patient by 34%-54% compared to the fixed\nschedules, though at the cost of a slightly longer detection delay."}, "http://arxiv.org/abs/2304.14110": {"title": "A Bayesian Spatio-Temporal Extension to Poisson Auto-Regression: Modeling the Disease Infection Rate of COVID-19 in England", "link": "http://arxiv.org/abs/2304.14110", "description": "The COVID-19 pandemic provided many modeling challenges to investigate the\nevolution of an epidemic process over areal units. A suitable encompassing\nmodel must describe the spatio-temporal variations of the disease infection\nrate of multiple areal processes while adjusting for local and global inputs.\nWe develop an extension to Poisson Auto-Regression that incorporates\nspatio-temporal dependence to characterize the local dynamics while borrowing\ninformation among adjacent areas. The specification includes up to two sets of\nspace-time random effects to capture the spatio-temporal dependence and a\nlinear predictor depending on an arbitrary set of covariates. The proposed\nmodel, adopted in a fully Bayesian framework and implemented through a novel\nsparse-matrix representation in Stan, provides a framework for evaluating local\npolicy changes over the whole spatial and temporal domain of the study. It has\nbeen validated through a substantial simulation study and applied to the weekly\nCOVID-19 cases observed in the English local authority districts between May\n2020 and March 2021. The model detects substantial spatial and temporal\nheterogeneity and allows a full evaluation of the impact of two alternative\nsets of covariates: the level of local restrictions in place and the value of\nthe Google Mobility Indices. The paper also formalizes various novel\nmodel-based investigation methods for assessing additional aspects of disease\nepidemiology."}, "http://arxiv.org/abs/2305.14131": {"title": "Temporally Causal Discovery Tests for Discrete Time Series and Neural Spike Trains", "link": "http://arxiv.org/abs/2305.14131", "description": "We consider the problem of detecting causal relationships between discrete\ntime series, in the presence of potential confounders. A hypothesis test is\nintroduced for identifying the temporally causal influence of $(x_n)$ on\n$(y_n)$, causally conditioned on a possibly confounding third time series\n$(z_n)$. Under natural Markovian modeling assumptions, it is shown that the\nnull hypothesis, corresponding to the absence of temporally causal influence,\nis equivalent to the underlying `causal conditional directed information rate'\nbeing equal to zero. The plug-in estimator for this functional is identified\nwith the log-likelihood ratio test statistic for the desired test. This\nstatistic is shown to be asymptotically normal under the alternative hypothesis\nand asymptotically $\\chi^2$ distributed under the null, facilitating the\ncomputation of $p$-values when used on empirical data. The effectiveness of the\nresulting hypothesis test is illustrated on simulated data, validating the\nunderlying theory. The test is also employed in the analysis of spike train\ndata recorded from neurons in the V4 and FEF brain regions of behaving animals\nduring a visual attention task. There, the test results are seen to identify\ninteresting and biologically relevant information."}, "http://arxiv.org/abs/2307.01748": {"title": "Monotone Cubic B-Splines with a Neural-Network Generator", "link": "http://arxiv.org/abs/2307.01748", "description": "We present a method for fitting monotone curves using cubic B-splines, which\nis equivalent to putting a monotonicity constraint on the coefficients. We\nexplore different ways of enforcing this constraint and analyze their\ntheoretical and empirical properties. We propose two algorithms for solving the\nspline fitting problem: one that uses standard optimization techniques and one\nthat trains a Multi-Layer Perceptrons (MLP) generator to approximate the\nsolutions under various settings and perturbations. The generator approach can\nspeed up the fitting process when we need to solve the problem repeatedly, such\nas when constructing confidence bands using bootstrap. We evaluate our method\nagainst several existing methods, some of which do not use the monotonicity\nconstraint, on some monotone curves with varying noise levels. We demonstrate\nthat our method outperforms the other methods, especially in high-noise\nscenarios. We also apply our method to analyze the polarization-hole phenomenon\nduring star formation in astrophysics. The source code is accessible at\n\\texttt{\\url{https://github.com/szcf-weiya/MonotoneSplines.jl}}."}, "http://arxiv.org/abs/2311.10738": {"title": "Approximation of supply curves", "link": "http://arxiv.org/abs/2311.10738", "description": "In this note, we illustrate the computation of the approximation of the\nsupply curves using a one-step basis. We derive the expression for the L2\napproximation and propose a procedure for the selection of nodes of the\napproximation. We illustrate the use of this approach with three large sets of\nbid curves from European electricity markets."}, "http://arxiv.org/abs/2311.10848": {"title": "Addressing Population Heterogeneity for HIV Incidence Estimation Based on Recency Test", "link": "http://arxiv.org/abs/2311.10848", "description": "Cross-sectional HIV incidence estimation leverages recency test results to\ndetermine the HIV incidence of a population of interest, where recency test\nuses biomarker profiles to infer whether an HIV-positive individual was\n\"recently\" infected. This approach possesses an obvious advantage over the\nconventional cohort follow-up method since it avoids longitudinal follow-up and\nrepeated HIV testing. In this manuscript, we consider the extension of\ncross-sectional incidence estimation to estimate the incidence of a different\ntarget population addressing potential population heterogeneity. We propose a\ngeneral framework that incorporates two settings: one with the target\npopulation that is a subset of the population with cross-sectional recency\ntesting data, e.g., leveraging recency testing data from screening in\nactive-arm trial design, and the other with an external target population. We\nalso propose a method to incorporate HIV subtype, a special covariate that\nmodifies the properties of recency test, into our framework. Through extensive\nsimulation studies and a data application, we demonstrate the excellent\nperformance of the proposed methods. We conclude with a discussion of\nsensitivity analysis and future work to improve our framework."}, "http://arxiv.org/abs/2311.10877": {"title": "Covariate adjustment in randomized experiments with missing outcomes and covariates", "link": "http://arxiv.org/abs/2311.10877", "description": "Covariate adjustment can improve precision in estimating treatment effects\nfrom randomized experiments. With fully observed data, regression adjustment\nand propensity score weighting are two asymptotically equivalent methods for\ncovariate adjustment in randomized experiments. We show that this equivalence\nbreaks down in the presence of missing outcomes, with regression adjustment no\nlonger ensuring efficiency gain when the true outcome model is not linear in\ncovariates. Propensity score weighting, in contrast, still guarantees\nefficiency over unadjusted analysis, and including more covariates in\nadjustment never harms asymptotic efficiency. Moreover, we establish the value\nof using partially observed covariates to secure additional efficiency. Based\non these findings, we recommend a simple double-weighted estimator for\ncovariate adjustment with incomplete outcomes and covariates: (i) impute all\nmissing covariates by zero, and use the union of the completed covariates and\ncorresponding missingness indicators to estimate the probability of treatment\nand the probability of having observed outcome for all units; (ii) estimate the\naverage treatment effect by the coefficient of the treatment from the\nleast-squares regression of the observed outcome on the treatment, where we\nweight each unit by the inverse of the product of these two estimated\nprobabilities."}, "http://arxiv.org/abs/2311.10900": {"title": "A powerful rank-based correction to multiple testing under positive dependency", "link": "http://arxiv.org/abs/2311.10900", "description": "We develop a novel multiple hypothesis testing correction with family-wise\nerror rate (FWER) control that efficiently exploits positive dependencies\nbetween potentially correlated statistical hypothesis tests. Our proposed\nalgorithm $\\texttt{max-rank}$ is conceptually straight-forward, relying on the\nuse of a $\\max$-operator in the rank domain of computed test statistics. We\ncompare our approach to the frequently employed Bonferroni correction,\ntheoretically and empirically demonstrating its superiority over Bonferroni in\nthe case of existing positive dependency, and its equivalence otherwise. Our\nadvantage over Bonferroni increases as the number of tests rises, and we\nmaintain high statistical power whilst ensuring FWER control. We specifically\nframe our algorithm in the context of parallel permutation testing, a scenario\nthat arises in our primary application of conformal prediction, a recently\npopularized approach for quantifying uncertainty in complex predictive\nsettings."}, "http://arxiv.org/abs/2311.11050": {"title": "Functional Neural Network Control Chart", "link": "http://arxiv.org/abs/2311.11050", "description": "In many Industry 4.0 data analytics applications, quality characteristic data\nacquired from manufacturing processes are better modeled as functions, often\nreferred to as profiles. In practice, there are situations where a scalar\nquality characteristic, referred to also as the response, is influenced by one\nor more variables in the form of functional data, referred to as functional\ncovariates. To adjust the monitoring of the scalar response by the effect of\nthis additional information, a new profile monitoring strategy is proposed on\nthe residuals obtained from the functional neural network, which is able to\nlearn a possibly nonlinear relationship between the scalar response and the\nfunctional covariates. An extensive Monte Carlo simulation study is performed\nto assess the performance of the proposed method with respect to other control\ncharts that appeared in the literature before. Finally, a case study in the\nrailway industry is presented with the aim of monitoring the heating,\nventilation and air conditioning systems installed onboard passenger trains."}, "http://arxiv.org/abs/2311.11054": {"title": "Modern extreme value statistics for Utopian extremes", "link": "http://arxiv.org/abs/2311.11054", "description": "Capturing the extremal behaviour of data often requires bespoke marginal and\ndependence models which are grounded in rigorous asymptotic theory, and hence\nprovide reliable extrapolation into the upper tails of the data-generating\ndistribution. We present a modern toolbox of four methodological frameworks,\nmotivated by modern extreme value theory, that can be used to accurately\nestimate extreme exceedance probabilities or the corresponding level in either\na univariate or multivariate setting. Our frameworks were used to facilitate\nthe winning contribution of Team Yalla to the data competition organised for\nthe 13th International Conference on Extreme Value Analysis (EVA2023). This\ncompetition comprised seven teams competing across four separate\nsub-challenges, with each requiring the modelling of data simulated from known,\nyet highly complex, statistical distributions, and extrapolation far beyond the\nrange of the available samples in order to predict probabilities of extreme\nevents. Data were constructed to be representative of real environmental data,\nsampled from the fantasy country of \"Utopia\"."}, "http://arxiv.org/abs/2311.11153": {"title": "Biarchetype analysis: simultaneous learning of observations and features based on extremes", "link": "http://arxiv.org/abs/2311.11153", "description": "A new exploratory technique called biarchetype analysis is defined. We extend\narchetype analysis to find the archetypes of both observations and features\nsimultaneously. The idea of this new unsupervised machine learning tool is to\nrepresent observations and features by instances of pure types (biarchetypes)\nthat can be easily interpreted as they are mixtures of observations and\nfeatures. Furthermore, the observations and features are expressed as mixtures\nof the biarchetypes, which also helps understand the structure of the data. We\npropose an algorithm to solve biarchetype analysis. We show that biarchetype\nanalysis offers advantages over biclustering, especially in terms of\ninterpretability. This is because byarchetypes are extreme instances as opposed\nto the centroids returned by biclustering, which favors human understanding.\nBiarchetype analysis is applied to several machine learning problems to\nillustrate its usefulness."}, "http://arxiv.org/abs/2311.11216": {"title": "Valid Randomization Tests in Inexactly Matched Observational Studies via Iterative Convex Programming", "link": "http://arxiv.org/abs/2311.11216", "description": "In causal inference, matching is one of the most widely used methods to mimic\na randomized experiment using observational (non-experimental) data. Ideally,\ntreated units are exactly matched with control units for the covariates so that\nthe treatments are as-if randomly assigned within each matched set, and valid\nrandomization tests for treatment effects can then be conducted as in a\nrandomized experiment. However, inexact matching typically exists, especially\nwhen there are continuous or many observed covariates or when unobserved\ncovariates exist. Previous matched observational studies routinely conducted\ndownstream randomization tests as if matching was exact, as long as the matched\ndatasets satisfied some prespecified balance criteria or passed some balance\ntests. Some recent studies showed that this routine practice could render a\nhighly inflated type-I error rate of randomization tests, especially when the\nsample size is large. To handle this problem, we propose an iterative convex\nprogramming framework for randomization tests with inexactly matched datasets.\nUnder some commonly used regularity conditions, we show that our approach can\nproduce valid randomization tests (i.e., robustly controlling the type-I error\nrate) for any inexactly matched datasets, even when unobserved covariates\nexist. Our framework allows the incorporation of flexible machine learning\nmodels to better extract information from covariate imbalance while robustly\ncontrolling the type-I error rate."}, "http://arxiv.org/abs/2311.11236": {"title": "Generalized Linear Models via the Lasso: To Scale or Not to Scale?", "link": "http://arxiv.org/abs/2311.11236", "description": "The Lasso regression is a popular regularization method for feature selection\nin statistics. Prior to computing the Lasso estimator in both linear and\ngeneralized linear models, it is common to conduct a preliminary rescaling of\nthe feature matrix to ensure that all the features are standardized. Without\nthis standardization, it is argued, the Lasso estimate will unfortunately\ndepend on the units used to measure the features. We propose a new type of\niterative rescaling of the features in the context of generalized linear\nmodels. Whilst existing Lasso algorithms perform a single scaling as a\npreprocessing step, the proposed rescaling is applied iteratively throughout\nthe Lasso computation until convergence. We provide numerical examples, with\nboth real and simulated data, illustrating that the proposed iterative\nrescaling can significantly improve the statistical performance of the Lasso\nestimator without incurring any significant additional computational cost."}, "http://arxiv.org/abs/2311.11256": {"title": "Bayesian Modeling of Incompatible Spatial Data: A Case Study Involving Post-Adrian Storm Forest Damage Assessment", "link": "http://arxiv.org/abs/2311.11256", "description": "Incompatible spatial data modeling is a pervasive challenge in remote sensing\ndata analysis that involves field data. Typical approaches to addressing this\nchallenge aggregate information to a coarser common scale, i.e., compatible\nresolutions. Such pre-processing aggregation to a common resolution simplifies\nanalysis, but potentially causes information loss and hence compromised\ninference and predictive performance. To incorporate finer information to\nenhance prediction performance, we develop a new Bayesian method aimed at\nimproving predictive accuracy and uncertainty quantification. The main\ncontribution of this work is an efficient algorithm that enables full Bayesian\ninference using finer resolution data while optimizing computational and\nstorage costs. The algorithm is developed and applied to a forest damage\nassessment for the 2018 Adrian storm in Carinthia, Austria, which uses field\ndata and high-resolution LiDAR measurements. Simulation studies demonstrate\nthat this approach substantially improves prediction accuracy and stability,\nproviding more reliable inference to support forest management decisions."}, "http://arxiv.org/abs/2311.11290": {"title": "Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias", "link": "http://arxiv.org/abs/2311.11290", "description": "Firth (1993, Biometrika) shows that the maximum Jeffreys' prior penalized\nlikelihood estimator in logistic regression has asymptotic bias decreasing with\nthe square of the number of observations when the number of parameters is\nfixed, which is an order faster than the typical rate from maximum likelihood.\nThe widespread use of that estimator in applied work is supported by the\nresults in Kosmidis and Firth (2021, Biometrika), who show that it takes finite\nvalues, even in cases where the maximum likelihood estimate does not exist.\nKosmidis and Firth (2021, Biometrika) also provide empirical evidence that the\nestimator has good bias properties in high-dimensional settings where the\nnumber of parameters grows asymptotically linearly but slower than the number\nof observations. We design and carry out a large-scale computer experiment\ncovering a wide range of such high-dimensional settings and produce strong\nempirical evidence for a simple rescaling of the maximum Jeffreys' prior\npenalized likelihood estimator that delivers high accuracy in signal recovery\nin the presence of an intercept parameter. The rescaled estimator is effective\neven in cases where estimates from maximum likelihood and other recently\nproposed corrective methods based on approximate message passing do not exist."}, "http://arxiv.org/abs/2311.11445": {"title": "Maximum likelihood inference for a class of discrete-time Markov-switching time series models with multiple delays", "link": "http://arxiv.org/abs/2311.11445", "description": "Autoregressive Markov switching (ARMS) time series models are used to\nrepresent real-world signals whose dynamics may change over time. They have\nfound application in many areas of the natural and social sciences, as well as\nin engineering. In general, inference in this kind of systems involves two\nproblems: (a) detecting the number of distinct dynamical models that the signal\nmay adopt and (b) estimating any unknown parameters in these models. In this\npaper, we introduce a class of ARMS time series models that includes many\nsystems resulting from the discretisation of stochastic delay differential\nequations (DDEs). Remarkably, this class includes cases in which the\ndiscretisation time grid is not necessarily aligned with the delays of the DDE,\nresulting in discrete-time ARMS models with real (non-integer) delays. We\ndescribe methods for the maximum likelihood detection of the number of\ndynamical modes and the estimation of unknown parameters (including the\npossibly non-integer delays) and illustrate their application with an ARMS\nmodel of El Ni\\~no--southern oscillation (ENSO) phenomenon."}, "http://arxiv.org/abs/2311.11487": {"title": "Modeling Insurance Claims using Bayesian Nonparametric Regression", "link": "http://arxiv.org/abs/2311.11487", "description": "The prediction of future insurance claims based on observed risk factors, or\ncovariates, help the actuary set insurance premiums. Typically, actuaries use\nparametric regression models to predict claims based on the covariate\ninformation. Such models assume the same functional form tying the response to\nthe covariates for each data point. These models are not flexible enough and\ncan fail to accurately capture at the individual level, the relationship\nbetween the covariates and the claims frequency and severity, which are often\nmultimodal, highly skewed, and heavy-tailed. In this article, we explore the\nuse of Bayesian nonparametric (BNP) regression models to predict claims\nfrequency and severity based on covariates. In particular, we model claims\nfrequency as a mixture of Poisson regression, and the logarithm of claims\nseverity as a mixture of normal regression. We use the Dirichlet process (DP)\nand Pitman-Yor process (PY) as a prior for the mixing distribution over the\nregression parameters. Unlike parametric regression, such models allow each\ndata point to have its individual parameters, making them highly flexible,\nresulting in improved prediction accuracy. We describe model fitting using MCMC\nand illustrate their applicability using French motor insurance claims data."}, "http://arxiv.org/abs/2311.11522": {"title": "A Bayesian two-step multiple imputation approach based on mixed models for the missing in EMA data", "link": "http://arxiv.org/abs/2311.11522", "description": "Ecological Momentary Assessments (EMA) capture real-time thoughts and\nbehaviors in natural settings, producing rich longitudinal data for statistical\nand physiological analyses. However, the robustness of these analyses can be\ncompromised by the large amount of missing in EMA data sets. To address this,\nmultiple imputation, a method that replaces missing values with several\nplausible alternatives, has become increasingly popular. In this paper, we\nintroduce a two-step Bayesian multiple imputation framework which leverages the\nconfiguration of mixed models. We adopt the Random Intercept Linear Mixed\nmodel, the Mixed-effect Location Scale model which accounts for subject\nvariance influenced by covariates and random effects, and the Shared Parameter\nLocation Scale Mixed Effect model which links the missing data to the response\nvariable through a random intercept logistic model, to complete the posterior\ndistribution within the framework. In the simulation study and an application\non data from a study on caregivers of dementia patients, we further adapt this\ntwo-step Bayesian multiple imputation strategy to handle simultaneous missing\nvariables in EMA data sets and compare the effectiveness of multiple\nimputations across different mixed models. The analyses highlight the\nadvantages of multiple imputations over single imputations. Furthermore, we\npropose two pivotal considerations in selecting the optimal mixed model for the\ntwo-step imputation: the influence of covariates as well as random effects on\nthe within-variance, and the nature of missing data in relation to the response\nvariable."}, "http://arxiv.org/abs/2311.11543": {"title": "A Comparison of Parameter Estimation Methods for Shared Frailty Models", "link": "http://arxiv.org/abs/2311.11543", "description": "This paper compares six different parameter estimation methods for shared\nfrailty models via a series of simulation studies. A shared frailty model is a\nsurvival model that incorporates a random effect term, where the frailties are\ncommon or shared among individuals within specific groups. Several parameter\nestimation methods are available for fitting shared frailty models, such as\npenalized partial likelihood (PPL), expectation-maximization (EM), pseudo full\nlikelihood (PFL), hierarchical likelihood (HL), maximum marginal likelihood\n(MML), and maximization penalized likelihood (MPL) algorithms. These estimation\nmethods are implemented in various R packages, providing researchers with\nvarious options for analyzing clustered survival data using shared frailty\nmodels. However, there is a limited amount of research comparing the\nperformance of these parameter estimation methods for fitting shared frailty\nmodels. Consequently, it can be challenging for users to determine the most\nappropriate method for analyzing clustered survival data. To address this gap,\nthis paper aims to conduct a series of simulation studies to compare the\nperformance of different parameter estimation methods implemented in R\npackages. We will evaluate several key aspects, including parameter estimation,\nbias and variance of the parameter estimates, rate of convergence, and\ncomputational time required by each package. Through this systematic\nevaluation, our goal is to provide a comprehensive understanding of the\nadvantages and limitations associated with each estimation method."}, "http://arxiv.org/abs/2311.11563": {"title": "Time-varying effect in the competing risks based on restricted mean time lost", "link": "http://arxiv.org/abs/2311.11563", "description": "Patients with breast cancer tend to die from other diseases, so for studies\nthat focus on breast cancer, a competing risks model is more appropriate.\nConsidering subdistribution hazard ratio, which is used often, limited to model\nassumptions and clinical interpretation, we aimed to quantify the effects of\nprognostic factors by an absolute indicator, the difference in restricted mean\ntime lost (RMTL), which is more intuitive. Additionally, prognostic factors may\nhave dynamic effects (time-varying effects) in long-term follow-up. However,\nexisting competing risks regression models only provide a static view of\ncovariate effects, leading to a distorted assessment of the prognostic factor.\nTo address this issue, we proposed a dynamic effect RMTL regression that can\nexplore the between-group cumulative difference in mean life lost over a period\nof time and obtain the real-time effect by the speed of accumulation, as well\nas personalized predictions on a time scale. Through Monte Carlo simulation, we\nvalidated the dynamic effects estimated by the proposed regression having low\nbias and a coverage rate of around 95%. Applying this model to an elderly\nearly-stage breast cancer cohort, we found that most factors had different\npatterns of dynamic effects, revealing meaningful physiological mechanisms\nunderlying diseases. Moreover, from the perspective of prediction, the mean\nC-index in external validation reached 0.78. Dynamic effect RMTL regression can\nanalyze both dynamic cumulative effects and real-time effects of covariates,\nproviding a more comprehensive prognosis and better prediction when competing\nrisks exist."}, "http://arxiv.org/abs/2311.11575": {"title": "Testing multivariate normality by testing independence", "link": "http://arxiv.org/abs/2311.11575", "description": "We propose a simple multivariate normality test based on Kac-Bernstein's\ncharacterization, which can be conducted by utilising existing statistical\nindependence tests for sums and differences of data samples. We also perform\nits empirical investigation, which reveals that for high-dimensional data, the\nproposed approach may be more efficient than the alternative ones. The\naccompanying code repository is provided at \\url{https://shorturl.at/rtuy5}."}, "http://arxiv.org/abs/2311.11657": {"title": "Minimax Two-Stage Gradient Boosting for Parameter Estimation", "link": "http://arxiv.org/abs/2311.11657", "description": "Parameter estimation is an important sub-field in statistics and system\nidentification. Various methods for parameter estimation have been proposed in\nthe literature, among which the Two-Stage (TS) approach is particularly\npromising, due to its ease of implementation and reliable estimates. Among the\ndifferent statistical frameworks used to derive TS estimators, the min-max\nframework is attractive due to its mild dependence on prior knowledge about the\nparameters to be estimated. However, the existing implementation of the minimax\nTS approach has currently limited applicability, due to its heavy computational\nload. In this paper, we overcome this difficulty by using a gradient boosting\nmachine (GBM) in the second stage of TS approach. We call the resulting\nalgorithm the Two-Stage Gradient Boosting Machine (TSGBM) estimator. Finally,\nwe test our proposed TSGBM estimator on several numerical examples including\nmodels of dynamical systems."}, "http://arxiv.org/abs/2311.11765": {"title": "Making accurate and interpretable treatment decisions for binary outcomes", "link": "http://arxiv.org/abs/2311.11765", "description": "Optimal treatment rules can improve health outcomes on average by assigning a\ntreatment associated with the most desirable outcome to each individual. Due to\nan unknown data generation mechanism, it is appealing to use flexible models to\nestimate these rules. However, such models often lead to complex and\nuninterpretable rules. In this article, we introduce an approach aimed at\nestimating optimal treatment rules that have higher accuracy, higher value, and\nlower loss from the same simple model family. We use a flexible model to\nestimate the optimal treatment rules and a simple model to derive interpretable\ntreatment rules. We provide an extensible definition of interpretability and\npresent a method that - given a class of simple models - can be used to select\na preferred model. We conduct a simulation study to evaluate the performance of\nour approach compared to treatment rules obtained by fitting the same simple\nmodel directly to observed data. The results show that our approach has lower\naverage loss, higher average outcome, and greater power in identifying\nindividuals who can benefit from the treatment. We apply our approach to derive\ntreatment rules of adjuvant chemotherapy in colon cancer patients using cancer\nregistry data. The results show that our approach has the potential to improve\ntreatment decisions."}, "http://arxiv.org/abs/2311.11852": {"title": "Statistical Prediction of Peaks Over a Threshold", "link": "http://arxiv.org/abs/2311.11852", "description": "In many applied fields it is desired to make predictions with the aim of\nassessing the plausibility of more severe events than those already recorded to\nsafeguard against calamities that have not yet occurred. This problem can be\nanalysed using extreme value theory. We consider the popular peaks over a\nthreshold method and show that the generalised Pareto approximation of the true\npredictive densities of both a future unobservable excess or peak random\nvariable can be very accurate. We propose both a frequentist and a Bayesian\napproach for the estimation of such predictive densities. We show the\nasymptotic accuracy of the corresponding estimators and, more importantly,\nprove that the resulting predictive inference is asymptotically reliable. We\nshow the utility of the proposed predictive tools analysing extreme\ntemperatures in Milan in Italy."}, "http://arxiv.org/abs/2311.11922": {"title": "Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix", "link": "http://arxiv.org/abs/2311.11922", "description": "Surrogate index approaches have recently become a popular method of\nestimating longer-term impact from shorter-term outcomes. In this paper, we\nleverage 1098 test arms from 200 A/B tests at Netflix to empirically\ninvestigate to what degree would decisions made using a surrogate index\nutilizing 14 days of data would align with those made using direct measurement\nof day 63 treatment effects. Focusing specifically on linear \"auto-surrogate\"\nmodels that utilize the shorter-term observations of the long-term outcome of\ninterest, we find that the statistical inferences that we would draw from using\nthe surrogate index are ~95% consistent with those from directly measuring the\nlong-term treatment effect. Moreover, when we restrict ourselves to the set of\ntests that would be \"launched\" (i.e. positive and statistically significant)\nbased on the 63-day directly measured treatment effects, we find that relying\ninstead on the surrogate index achieves 79% and 65% recall."}, "http://arxiv.org/abs/2311.12016": {"title": "Bayesian Semiparametric Estimation of Heterogeneous Effects in Matched Case-Control Studies with an Application to Alzheimer's Disease and Heat", "link": "http://arxiv.org/abs/2311.12016", "description": "Epidemiological approaches for examining human health responses to\nenvironmental exposures in observational studies often control for confounding\nby implementing clever matching schemes and using statistical methods based on\nconditional likelihood. Nonparametric regression models have surged in\npopularity in recent years as a tool for estimating individual-level\nheterogeneous effects, which provide a more detailed picture of the\nexposure-response relationship but can also be aggregated to obtain improved\nmarginal estimates at the population level. In this work we incorporate\nBayesian additive regression trees (BART) into the conditional logistic\nregression model to identify heterogeneous effects of environmental exposures\nin a case-crossover design. Conditional logistic BART (CL-BART) utilizes\nreversible jump Markov chain Monte Carlo to bypass the conditional conjugacy\nrequirement of the original BART algorithm. Our work is motivated by the\ngrowing interest in identifying subpopulations more vulnerable to environmental\nexposures. We apply CL-BART to a study of the impact of heatwaves on people\nwith Alzheimer's Disease in California and effect modification by other chronic\nconditions. Through this application, we also describe strategies to examine\nheterogeneous odds ratios through variable importance, partial dependence, and\nlower-dimensional summaries. CL-BART is available in the clbart R package."}, "http://arxiv.org/abs/2311.12020": {"title": "A Heterogeneous Spatial Model for Soil Carbon Mapping of the Contiguous United States Using VNIR Spectra", "link": "http://arxiv.org/abs/2311.12020", "description": "The Rapid Carbon Assessment, conducted by the U.S. Department of Agriculture,\nwas implemented in order to obtain a representative sample of soil organic\ncarbon across the contiguous United States. In conjunction with a statistical\nmodel, the dataset allows for mapping of soil carbon prediction across the\nU.S., however there are two primary challenges to such an effort. First, there\nexists a large degree of heterogeneity in the data, whereby both the first and\nsecond moments of the data generating process seem to vary both spatially and\nfor different land-use categories. Second, the majority of the sampled\nlocations do not actually have lab measured values for soil organic carbon.\nRather, visible and near-infrared (VNIR) spectra were measured at most\nlocations, which act as a proxy to help predict carbon content. Thus, we\ndevelop a heterogeneous model to analyze this data that allows both the mean\nand the variance to vary as a function of space as well as land-use category,\nwhile incorporating VNIR spectra as covariates. After a cross-validation study\nthat establishes the effectiveness of the model, we construct a complete map of\nsoil organic carbon for the contiguous U.S. along with uncertainty\nquantification."}, "http://arxiv.org/abs/1909.10678": {"title": "A Bayesian Approach to Directed Acyclic Graphs with a Candidate Graph", "link": "http://arxiv.org/abs/1909.10678", "description": "Directed acyclic graphs represent the dependence structure among variables.\nWhen learning these graphs from data, different amounts of information may be\navailable for different edges. Although many methods have been developed to\nlearn the topology of these graphs, most of them do not provide a measure of\nuncertainty in the inference. We propose a Bayesian method, baycn (BAYesian\nCausal Network), to estimate the posterior probability of three states for each\nedge: present with one direction ($X \\rightarrow Y$), present with the opposite\ndirection ($X \\leftarrow Y$), and absent. Unlike existing Bayesian methods, our\nmethod requires that the prior probabilities of these states be specified, and\ntherefore provides a benchmark for interpreting the posterior probabilities. We\ndevelop a fast Metropolis-Hastings Markov chain Monte Carlo algorithm for the\ninference. Our algorithm takes as input the edges of a candidate graph, which\nmay be the output of another graph inference method and may contain false\nedges. In simulation studies our method achieves high accuracy with small\nvariation across different scenarios and is comparable or better than existing\nBayesian methods. We apply baycn to genomic data to distinguish the direct and\nindirect targets of genetic variants."}, "http://arxiv.org/abs/2007.05748": {"title": "Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-As-Model", "link": "http://arxiv.org/abs/2007.05748", "description": "Note: Published now as a chapter in \"Handbook of the History and Philosophy\nof Mathematical Practice\" (Springer Nature, editor B. Sriraman,\nhttps://doi.org/10.1007/978-3-030-19071-2_105-1).\n\nThe application of mathematical probability theory in statistics is quite\ncontroversial. Controversies regard both the interpretation of probability, and\napproaches to statistical inference. After having given an overview of the main\napproaches, I will propose a re-interpretation of frequentist probability. Most\nstatisticians are aware that probability models interpreted in a frequentist\nmanner are not really true in objective reality, but only idealisations. I\nargue that this is often ignored when actually applying frequentist methods and\ninterpreting the results, and that keeping up the awareness for the essential\ndifference between reality and models can lead to a more appropriate use and\ninterpretation of frequentist models and methods, called\n\"frequentism-as-model\". This is elaborated showing connections to existing\nwork, appreciating the special role of independently and identically\ndistributed observations and subject matter knowledge, giving an account of how\nand under what conditions models that are not true can be useful, giving\ndetailed interpretations of tests and confidence intervals, confronting their\nimplicit compatibility logic with the inverse probability logic of Bayesian\ninference, re-interpreting the role of model assumptions, appreciating\nrobustness, and the role of \"interpretative equivalence\" of models. Epistemic\nprobability shares the issue that its models are only idealisations, and an\nanalogous \"epistemic-probability-as-model\" can also be developed."}, "http://arxiv.org/abs/2008.09434": {"title": "Correcting a Nonparametric Two-sample Graph Hypothesis Test for Graphs with Different Numbers of Vertices with Applications to Connectomics", "link": "http://arxiv.org/abs/2008.09434", "description": "Random graphs are statistical models that have many applications, ranging\nfrom neuroscience to social network analysis. Of particular interest in some\napplications is the problem of testing two random graphs for equality of\ngenerating distributions. Tang et al. (2017) propose a test for this setting.\nThis test consists of embedding the graph into a low-dimensional space via the\nadjacency spectral embedding (ASE) and subsequently using a kernel two-sample\ntest based on the maximum mean discrepancy. However, if the two graphs being\ncompared have an unequal number of vertices, the test of Tang et al. (2017) may\nnot be valid. We demonstrate the intuition behind this invalidity and propose a\ncorrection that makes any subsequent kernel- or distance-based test valid. Our\nmethod relies on sampling based on the asymptotic distribution for the ASE. We\ncall these altered embeddings the corrected adjacency spectral embeddings\n(CASE). We also show that CASE remedies the exchangeability problem of the\noriginal test and demonstrate the validity and consistency of the test that\nuses CASE via a simulation study. Lastly, we apply our proposed test to the\nproblem of determining equivalence of generating distributions in human\nconnectomes extracted from diffusion magnetic resonance imaging (dMRI) at\ndifferent scales."}, "http://arxiv.org/abs/2008.10055": {"title": "Multiple Network Embedding for Anomaly Detection in Time Series of Graphs", "link": "http://arxiv.org/abs/2008.10055", "description": "This paper considers the graph signal processing problem of anomaly detection\nin time series of graphs. We examine two related, complementary inference\ntasks: the detection of anomalous graphs within a time series, and the\ndetection of temporally anomalous vertices. We approach these tasks via the\nadaptation of statistically principled methods for joint graph inference,\nspecifically \\emph{multiple adjacency spectral embedding} (MASE). We\ndemonstrate that our method is effective for our inference tasks. Moreover, we\nassess the performance of our method in terms of the underlying nature of\ndetectable anomalies. We further provide the theoretical justification for our\nmethod and insight into its use. Applied to the Enron communication graph and a\nlarge-scale commercial search engine time series of graphs, our approaches\ndemonstrate their applicability and identify the anomalous vertices beyond just\nlarge degree change."}, "http://arxiv.org/abs/2011.06127": {"title": "Generalized Kernel Two-Sample Tests", "link": "http://arxiv.org/abs/2011.06127", "description": "Kernel two-sample tests have been widely used for multivariate data to test\nequality of distributions. However, existing tests based on mapping\ndistributions into a reproducing kernel Hilbert space mainly target specific\nalternatives and do not work well for some scenarios when the dimension of the\ndata is moderate to high due to the curse of dimensionality. We propose a new\ntest statistic that makes use of a common pattern under moderate and high\ndimensions and achieves substantial power improvements over existing kernel\ntwo-sample tests for a wide range of alternatives. We also propose alternative\ntesting procedures that maintain high power with low computational cost,\noffering easy off-the-shelf tools for large datasets. The new approaches are\ncompared to other state-of-the-art tests under various settings and show good\nperformance. We showcase the new approaches through two applications: The\ncomparison of musks and non-musks using the shape of molecules, and the\ncomparison of taxi trips starting from John F. Kennedy airport in consecutive\nmonths. All proposed methods are implemented in an R package kerTests."}, "http://arxiv.org/abs/2108.00306": {"title": "A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions", "link": "http://arxiv.org/abs/2108.00306", "description": "With advances in scientific computing and mathematical modeling, complex\nscientific phenomena such as galaxy formations and rocket propulsion can now be\nreliably simulated. Such simulations can however be very time-intensive,\nrequiring millions of CPU hours to perform. One solution is multi-fidelity\nemulation, which uses data of different fidelities to train an efficient\npredictive model which emulates the expensive simulator. For complex scientific\nproblems and with careful elicitation from scientists, such multi-fidelity data\nmay often be linked by a directed acyclic graph (DAG) representing its\nscientific model dependencies. We thus propose a new Graphical Multi-fidelity\nGaussian Process (GMGP) model, which embeds this DAG structure (capturing\nscientific dependencies) within a Gaussian process framework. We show that the\nGMGP has desirable modeling traits via two Markov properties, and admits a\nscalable algorithm for recursive computation of the posterior mean and variance\nalong at each depth level of the DAG. We also present a novel experimental\ndesign methodology over the DAG given an experimental budget, and propose a\nnonlinear extension of the GMGP via deep Gaussian processes. The advantages of\nthe GMGP are then demonstrated via a suite of numerical experiments and an\napplication to emulation of heavy-ion collisions, which can be used to study\nthe conditions of matter in the Universe shortly after the Big Bang. The\nproposed model has broader uses in data fusion applications with graphical\nstructure, which we further discuss."}, "http://arxiv.org/abs/2202.06188": {"title": "Testing the number of common factors by bootstrapped sample covariance matrix in high-dimensional factor models", "link": "http://arxiv.org/abs/2202.06188", "description": "This paper studies the impact of bootstrap procedure on the eigenvalue\ndistributions of the sample covariance matrix under a high-dimensional factor\nstructure. We provide asymptotic distributions for the top eigenvalues of\nbootstrapped sample covariance matrix under mild conditions. After bootstrap,\nthe spiked eigenvalues which are driven by common factors will converge weakly\nto Gaussian limits after proper scaling and centralization. However, the\nlargest non-spiked eigenvalue is mainly determined by the order statistics of\nthe bootstrap resampling weights, and follows extreme value distribution. Based\non the disparate behavior of the spiked and non-spiked eigenvalues, we propose\ninnovative methods to test the number of common factors. Indicated by extensive\nnumerical and empirical studies, the proposed methods perform reliably and\nconvincingly under the existence of both weak factors and cross-sectionally\ncorrelated errors. Our technical details contribute to random matrix theory on\nspiked covariance model with convexly decaying density and unbounded support,\nor with general elliptical distributions."}, "http://arxiv.org/abs/2203.04689": {"title": "Tensor Completion for Causal Inference with Multivariate Longitudinal Data: A Reevaluation of COVID-19 Mandates", "link": "http://arxiv.org/abs/2203.04689", "description": "We propose a new method that uses tensor completion to estimate causal\neffects with multivariate longitudinal data, data in which multiple outcomes\nare observed for each unit and time period. Our motivation is to estimate the\nnumber of COVID-19 fatalities prevented by government mandates such as travel\nrestrictions, mask-wearing directives, and vaccination requirements. In\naddition to COVID-19 fatalities, we observe related outcomes such as the number\nof fatalities from other diseases and injuries. The proposed method arranges\nthe data as a tensor with three dimensions (unit, time, and outcome) and uses\ntensor completion to impute the missing counterfactual outcomes. We first prove\nthat under general conditions, combining multiple outcomes using the proposed\nmethod improves the accuracy of counterfactual imputations. We then compare the\nproposed method to other approaches commonly used to evaluate COVID-19\nmandates. Our main finding is that other approaches overestimate the effect of\nmasking-wearing directives and that mask-wearing directives were not an\neffective alternative to travel restrictions. We conclude that while the\nproposed method can be applied whenever multivariate longitudinal data are\navailable, we believe it is particularly timely as governments increasingly\nrely on longitudinal data to choose among policies such as mandates during\npublic health emergencies."}, "http://arxiv.org/abs/2203.10775": {"title": "Modified Method of Moments for Generalized Laplace Distribution", "link": "http://arxiv.org/abs/2203.10775", "description": "In this note, we consider the performance of the classic method of moments\nfor parameter estimation of symmetric variance-gamma (generalized Laplace)\ndistributions. We do this through both theoretical analysis (multivariate delta\nmethod) and a comprehensive simulation study with comparison to maximum\nlikelihood estimation, finding performance is often unsatisfactory. In\naddition, we modify the method of moments by taking absolute moments to improve\nefficiency; in particular, our simulation studies demonstrate that our modified\nestimators have significantly improved performance for parameter values\ntypically encountered in financial modelling, and is also competitive with\nmaximum likelihood estimation."}, "http://arxiv.org/abs/2208.14989": {"title": "Learning Multiscale Non-stationary Causal Structures", "link": "http://arxiv.org/abs/2208.14989", "description": "This paper addresses a gap in the current state of the art by providing a\nsolution for modeling causal relationships that evolve over time and occur at\ndifferent time scales. Specifically, we introduce the multiscale non-stationary\ndirected acyclic graph (MN-DAG), a framework for modeling multivariate time\nseries data. Our contribution is twofold. Firstly, we expose a probabilistic\ngenerative model by leveraging results from spectral and causality theories.\nOur model allows sampling an MN-DAG according to user-specified priors on the\ntime-dependence and multiscale properties of the causal graph. Secondly, we\ndevise a Bayesian method named Multiscale Non-stationary Causal Structure\nLearner (MN-CASTLE) that uses stochastic variational inference to estimate\nMN-DAGs. The method also exploits information from the local partial\ncorrelation between time series over different time resolutions. The data\ngenerated from an MN-DAG reproduces well-known features of time series in\ndifferent domains, such as volatility clustering and serial correlation.\nAdditionally, we show the superior performance of MN-CASTLE on synthetic data\nwith different multiscale and non-stationary properties compared to baseline\nmodels. Finally, we apply MN-CASTLE to identify the drivers of the natural gas\nprices in the US market. Causal relationships have strengthened during the\nCOVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline\nmethods fail to capture. MN-CASTLE identifies the causal impact of critical\neconomic drivers on natural gas prices, such as seasonal factors, economic\nuncertainty, oil prices, and gas storage deviations."}, "http://arxiv.org/abs/2301.06333": {"title": "Functional concurrent regression with compositional covariates and its application to the time-varying effect of causes of death on human longevity", "link": "http://arxiv.org/abs/2301.06333", "description": "Multivariate functional data that are cross-sectionally compositional data\nare attracting increasing interest in the statistical modeling literature, a\nmajor example being trajectories over time of compositions derived from\ncause-specific mortality rates. In this work, we develop a novel functional\nconcurrent regression model in which independent variables are functional\ncompositions. This allows us to investigate the relationship over time between\nlife expectancy at birth and compositions derived from cause-specific mortality\nrates of four distinct age classes, namely 0--4, 5--39, 40--64 and 65+ in 25\ncountries. A penalized approach is developed to estimate the regression\ncoefficients and select the relevant variables. Then an efficient computational\nstrategy based on an augmented Lagrangian algorithm is derived to solve the\nresulting optimization problem. The good performances of the model in\npredicting the response function and estimating the unknown functional\ncoefficients are shown in a simulation study. The results on real data confirm\nthe important role of neoplasms and cardiovascular diseases in determining life\nexpectancy emerged in other studies and reveal several other contributions not\nyet observed."}, "http://arxiv.org/abs/2305.00961": {"title": "DIF Analysis with Unknown Groups and Anchor Items", "link": "http://arxiv.org/abs/2305.00961", "description": "Ensuring fairness in instruments like survey questionnaires or educational\ntests is crucial. One way to address this is by a Differential Item Functioning\n(DIF) analysis, which examines if different subgroups respond differently to a\nparticular item, controlling for their overall latent construct level. DIF\nanalysis is typically conducted to assess measurement invariance at the item\nlevel. Traditional DIF analysis methods require knowing the comparison groups\n(reference and focal groups) and anchor items (a subset of DIF-free items).\nSuch prior knowledge may not always be available, and psychometric methods have\nbeen proposed for DIF analysis when one piece of information is unknown. More\nspecifically, when the comparison groups are unknown while anchor items are\nknown, latent DIF analysis methods have been proposed that estimate the unknown\ngroups by latent classes. When anchor items are unknown while comparison groups\nare known, methods have also been proposed, typically under a sparsity\nassumption -- the number of DIF items is not too large. However, DIF analysis\nwhen both pieces of information are unknown has not received much attention.\nThis paper proposes a general statistical framework under this setting. In the\nproposed framework, we model the unknown groups by latent classes and introduce\nitem-specific DIF parameters to capture the DIF effects. Assuming the number of\nDIF items is relatively small, an $L_1$-regularised estimator is proposed to\nsimultaneously identify the latent classes and the DIF items. A computationally\nefficient Expectation-Maximisation (EM) algorithm is developed to solve the\nnon-smooth optimisation problem for the regularised estimator. The performance\nof the proposed method is evaluated by simulation studies and an application to\nitem response data from a real-world educational test."}, "http://arxiv.org/abs/2306.14302": {"title": "Improved LM Test for Robust Model Specification Searches in Covariance Structure Analysis", "link": "http://arxiv.org/abs/2306.14302", "description": "Model specification searches and modifications are commonly employed in\ncovariance structure analysis (CSA) or structural equation modeling (SEM) to\nimprove the goodness-of-fit. However, these practices can be susceptible to\ncapitalizing on chance, as a model that fits one sample may not generalize to\nanother sample from the same population. This paper introduces the improved\nLagrange Multipliers (LM) test, which provides a reliable method for conducting\na thorough model specification search and effectively identifying missing\nparameters. By leveraging the stepwise bootstrap method in the standard LM and\nWald tests, our data-driven approach enhances the accuracy of parameter\nidentification. The results from Monte Carlo simulations and two empirical\napplications in political science demonstrate the effectiveness of the improved\nLM test, particularly when dealing with small sample sizes and models with\nlarge degrees of freedom. This approach contributes to better statistical fit\nand addresses the issue of capitalization on chance in model specification."}, "http://arxiv.org/abs/2308.13069": {"title": "The diachronic Bayesian", "link": "http://arxiv.org/abs/2308.13069", "description": "It is well known that a Bayesian probability forecast for the future\nobservations should form a probability measure in order to satisfy a natural\ncondition of coherence. The topic of this paper is the evolution of the\nBayesian probability measure over time. We model the process of updating the\nBayesian's beliefs in terms of prediction markets. The resulting picture is\nadapted to forecasting several steps ahead and making almost optimal decisions."}, "http://arxiv.org/abs/2311.12214": {"title": "Random Fourier Signature Features", "link": "http://arxiv.org/abs/2311.12214", "description": "Tensor algebras give rise to one of the most powerful measures of similarity\nfor sequences of arbitrary length called the signature kernel accompanied with\nattractive theoretical guarantees from stochastic analysis. Previous algorithms\nto compute the signature kernel scale quadratically in terms of the length and\nthe number of the sequences. To mitigate this severe computational bottleneck,\nwe develop a random Fourier feature-based acceleration of the signature kernel\nacting on the inherently non-Euclidean domain of sequences. We show uniform\napproximation guarantees for the proposed unbiased estimator of the signature\nkernel, while keeping its computation linear in the sequence length and number.\nIn addition, combined with recent advances on tensor projections, we derive two\neven more scalable time series features with favourable concentration\nproperties and computational complexity both in time and memory. Our empirical\nresults show that the reduction in computational cost comes at a negligible\nprice in terms of accuracy on moderate-sized datasets, and it enables one to\nscale to large datasets up to a million time series."}, "http://arxiv.org/abs/2311.12252": {"title": "Sensitivity analysis with multiple treatments and multiple outcomes with applications to air pollution mixtures", "link": "http://arxiv.org/abs/2311.12252", "description": "Understanding the health impacts of air pollution is vital in public health\nresearch. Numerous studies have estimated negative health effects of a variety\nof pollutants, but accurately gauging these impacts remains challenging due to\nthe potential for unmeasured confounding bias that is ubiquitous in\nobservational studies. In this study, we develop a framework for sensitivity\nanalysis in settings with both multiple treatments and multiple outcomes\nsimultaneously. This setting is of particular interest because one can identify\nthe strength of association between the unmeasured confounders and both the\ntreatment and outcome, under a factor confounding assumption. This provides\ninformative bounds on the causal effect leading to partial identification\nregions for the effects of multivariate treatments that account for the maximum\npossible bias from unmeasured confounding. We also show that when negative\ncontrols are available, we are able to refine the partial identification\nregions substantially, and in certain cases, even identify the causal effect in\nthe presence of unmeasured confounding. We derive partial identification\nregions for general estimands in this setting, and develop a novel\ncomputational approach to finding these regions."}, "http://arxiv.org/abs/2311.12293": {"title": "Sample size calculation based on the difference in restricted mean time lost for clinical trials with competing risks", "link": "http://arxiv.org/abs/2311.12293", "description": "Computation of sample size is important when designing clinical trials. The\npresence of competing risks makes the design of clinical trials with\ntime-to-event endpoints cumbersome. A model based on the subdistribution hazard\nratio (SHR) is commonly used for trials under competing risks. However, this\napproach has some limitations related to model assumptions and clinical\ninterpretation. Considering such limitations, the difference in restricted mean\ntime lost (RMTLd) is recommended as an alternative indicator. In this paper, we\npropose a sample size calculation method based on the RMTLd for the Weibull\ndistribution (RMTLdWeibull) for clinical trials, which considers experimental\nconditions such as equal allocation, uniform accrual, uniform loss to\nfollow-up, and administrative censoring. Simulation results show that sample\nsize calculation based on the RMTLdWeibull can generally achieve a predefined\npower level and maintain relative robustness. Moreover, the performance of the\nsample size calculation based on the RMTLdWeibull is similar or superior to\nthat based on the SHR. Even if the event time does not follow the Weibull\ndistribution, the sample size calculation based on the RMTLdWeibull still\nperforms well. The results also verify the performance of the sample size\ncalculation method based on the RMTLdWeibull. From the perspective of the\nresults of this study, clinical interpretation, application conditions and\nstatistical performance, we recommend that when designing clinical trials in\nthe presence of competing risks, the RMTLd indicator be applied for sample size\ncalculation and subsequent effect size measurement."}, "http://arxiv.org/abs/2311.12347": {"title": "Bayesian Cluster Geographically Weighted Regression for Spatial Heterogeneous Data", "link": "http://arxiv.org/abs/2311.12347", "description": "Spatial statistical models are commonly used in geographical scenarios to\nensure spatial variation is captured effectively. However, spatial models and\ncluster algorithms can be complicated and expensive. This paper pursues three\nmain objectives. First, it introduces covariate effect clustering by\nintegrating a Bayesian Geographically Weighted Regression (BGWR) with a\nGaussian mixture model and the Dirichlet process mixture model. Second, this\npaper examines situations in which a particular covariate holds significant\nimportance in one region but not in another in the Bayesian framework. Lastly,\nit addresses computational challenges present in existing BGWR, leading to\nnotable enhancements in Markov chain Monte Carlo estimation suitable for large\nspatial datasets. The efficacy of the proposed method is demonstrated using\nsimulated data and is further validated in a case study examining children's\ndevelopment domains in Queensland, Australia, using data provided by Children's\nHealth Queensland and Australia's Early Development Census."}, "http://arxiv.org/abs/2311.12380": {"title": "A multivariate adaptation of direct kernel estimator of density ratio", "link": "http://arxiv.org/abs/2311.12380", "description": "\\'Cwik and Mielniczuk (1989) introduced a univariate kernel density ratio\nestimator, which directly estimates the ratio without estimating the two\ndensities of interest. This study presents its straightforward multivariate\nadaptation."}, "http://arxiv.org/abs/2311.12392": {"title": "Individualized Dynamic Latent Factor Model with Application to Mobile Health Data", "link": "http://arxiv.org/abs/2311.12392", "description": "Mobile health has emerged as a major success in tracking individual health\nstatus, due to the popularity and power of smartphones and wearable devices.\nThis has also brought great challenges in handling heterogeneous,\nmulti-resolution data which arise ubiquitously in mobile health due to\nirregular multivariate measurements collected from individuals. In this paper,\nwe propose an individualized dynamic latent factor model for irregular\nmulti-resolution time series data to interpolate unsampled measurements of time\nseries with low resolution. One major advantage of the proposed method is the\ncapability to integrate multiple irregular time series and multiple subjects by\nmapping the multi-resolution data to the latent space. In addition, the\nproposed individualized dynamic latent factor model is applicable to capturing\nheterogeneous longitudinal information through individualized dynamic latent\nfactors. In theory, we provide the integrated interpolation error bound of the\nproposed estimator and derive the convergence rate with B-spline approximation\nmethods. Both the simulation studies and the application to smartwatch data\ndemonstrate the superior performance of the proposed method compared to\nexisting methods."}, "http://arxiv.org/abs/2311.12452": {"title": "Multi-indication evidence synthesis in oncology health technology assessment", "link": "http://arxiv.org/abs/2311.12452", "description": "Background: Cancer drugs receive licensing extensions to include additional\nindications as trial evidence on treatment effectiveness accumulates. We\ninvestigate how sharing information across indications can strengthen the\ninferences supporting Health Technology Assessment (HTA). Methods: We applied\nmeta-analytic methods to randomised trial data on bevacizumab to share\ninformation across cancer indications on the treatment effect on overall\nsurvival (OS) or progression-free survival (PFS), and on the surrogate\nrelationship between effects on PFS and OS. Common or random parameters were\nused to facilitate sharing and the further flexibility of mixture models was\nexplored. Results: OS treatment effects lacked precision when pooling data\navailable at present-day within each indication, particularly for indications\nwith few trials. There was no suggestion of heterogeneity across indications.\nSharing information across indications provided more precise inferences on\ntreatment effects, and on surrogacy parameters, with the strength of sharing\ndepending on the model. When a surrogate relationship was used to predict OS\neffects, uncertainty was only reduced with sharing imposed on PFS effects in\naddition to surrogacy parameters. Corresponding analyses using the earlier,\nsparser evidence available for particular HTAs showed that sharing on both\nsurrogacy and PFS effects did not notably reduce uncertainty in OS predictions.\nLimited heterogeneity across indications meant that the added flexibility of\nmixture models was unnecessary. Conclusions: Meta-analysis methods can be\nusefully applied to share information on treatment effectiveness across\nindications to increase the precision of target indication estimates in HTA.\nSharing on surrogate relationships requires caution, as meaningful precision\ngains require larger bodies of evidence and clear support for surrogacy from\nother indications."}, "http://arxiv.org/abs/2311.12597": {"title": "Optimal Functional Bilinear Regression with Two-way Functional Covariates via Reproducing Kernel Hilbert Space", "link": "http://arxiv.org/abs/2311.12597", "description": "Traditional functional linear regression usually takes a one-dimensional\nfunctional predictor as input and estimates the continuous coefficient\nfunction. Modern applications often generate two-dimensional covariates, which\nbecome matrices when observed at grid points. To avoid the inefficiency of the\nclassical method involving estimation of a two-dimensional coefficient\nfunction, we propose a functional bilinear regression model, and introduce an\ninnovative three-term penalty to impose roughness penalty in the estimation.\nThe proposed estimator exhibits minimax optimal property for prediction under\nthe framework of reproducing kernel Hilbert space. An iterative generalized\ncross-validation approach is developed to choose tuning parameters, which\nsignificantly improves the computational efficiency over the traditional\ncross-validation approach. The statistical and computational advantages of the\nproposed method over existing methods are further demonstrated via simulated\nexperiments, the Canadian weather data, and a biochemical long-range infrared\nlight detection and ranging data."}, "http://arxiv.org/abs/2311.12685": {"title": "A Graphical Comparison of Screening Designs using Support Recovery Probabilities", "link": "http://arxiv.org/abs/2311.12685", "description": "A screening experiment attempts to identify a subset of important effects\nusing a relatively small number of experimental runs. Given the limited run\nsize and a large number of possible effects, penalized regression is a popular\ntool used to analyze screening designs. In particular, an automated\nimplementation of the Gauss-Dantzig selector has been widely recommended to\ncompare screening design construction methods. Here, we illustrate potential\nreproducibility issues that arise when comparing screening designs via\nsimulation, and recommend a graphical method, based on screening probabilities,\nwhich compares designs by evaluating them along the penalized regression\nsolution path. This method can be implemented using simulation, or, in the case\nof lasso, by using exact local lasso sign recovery probabilities. Our approach\ncircumvents the need to specify tuning parameters associated with\nregularization methods, leading to more reliable design comparisons. This\narticle contains supplementary materials including code to implement the\nproposed methods."}, "http://arxiv.org/abs/2311.12717": {"title": "Phylogenetic least squares estimation without genetic distances", "link": "http://arxiv.org/abs/2311.12717", "description": "Least squares estimation of phylogenies is an established family of methods\nwith good statistical properties. State-of-the-art least squares phylogenetic\nestimation proceeds by first estimating a distance matrix, which is then used\nto determine the phylogeny by minimizing a squared-error loss function. Here,\nwe develop a method for least squares phylogenetic inference that does not rely\non a pre-estimated distance matrix. Our approach allows us to circumvent the\ntypical need to first estimate a distance matrix by forming a new loss function\ninspired by the phylogenetic likelihood score function; in this manner,\ninference is not based on a summary statistic of the sequence data, but\ndirectly on the sequence data itself. We use a Jukes-Cantor substitution model\nto show that our method leads to improvements over ordinary least squares\nphylogenetic inference, and is even observed to rival maximum likelihood\nestimation in terms of topology estimation efficiency. Using a Kimura\n2-parameter model, we show that our method also allows for estimation of the\nglobal transition/transversion ratio simultaneously with the phylogeny and its\nbranch lengths. This is impossible to accomplish with any other distance-based\nmethod as far as we know. Our developments pave the way for more optimal\nphylogenetic inference under the least squares framework, particularly in\nsettings under which likelihood-based inference is infeasible, including when\none desires to build a phylogeny based on information provided by only a subset\nof all possible nucleotide substitutions such as synonymous or non-synonymous\nsubstitutions."}, "http://arxiv.org/abs/2311.12726": {"title": "Nonparametric variable importance for time-to-event outcomes with application to prediction of HIV infection", "link": "http://arxiv.org/abs/2311.12726", "description": "In survival analysis, complex machine learning algorithms have been\nincreasingly used for predictive modeling. Given a collection of features\navailable for inclusion in a predictive model, it may be of interest to\nquantify the relative importance of a subset of features for the prediction\ntask at hand. In particular, in HIV vaccine trials, participant baseline\ncharacteristics are used to predict the probability of infection over the\nintended follow-up period, and investigators may wish to understand how much\ncertain types of predictors, such as behavioral factors, contribute toward\noverall predictiveness. Time-to-event outcomes such as time to infection are\noften subject to right censoring, and existing methods for assessing variable\nimportance are typically not intended to be used in this setting. We describe a\nbroad class of algorithm-agnostic variable importance measures for prediction\nin the context of survival data. We propose a nonparametric efficient\nestimation procedure that incorporates flexible learning of nuisance\nparameters, yields asymptotically valid inference, and enjoys\ndouble-robustness. We assess the performance of our proposed procedure via\nnumerical simulations and analyze data from the HVTN 702 study to inform\nenrollment strategies for future HIV vaccine trials."}, "http://arxiv.org/abs/2202.12989": {"title": "Flexible variable selection in the presence of missing data", "link": "http://arxiv.org/abs/2202.12989", "description": "In many applications, it is of interest to identify a parsimonious set of\nfeatures, or panel, from multiple candidates that achieves a desired level of\nperformance in predicting a response. This task is often complicated in\npractice by missing data arising from the sampling design or other random\nmechanisms. Most recent work on variable selection in missing data contexts\nrelies in some part on a finite-dimensional statistical model, e.g., a\ngeneralized or penalized linear model. In cases where this model is\nmisspecified, the selected variables may not all be truly scientifically\nrelevant and can result in panels with suboptimal classification performance.\nTo address this limitation, we propose a nonparametric variable selection\nalgorithm combined with multiple imputation to develop flexible panels in the\npresence of missing-at-random data. We outline strategies based on the proposed\nalgorithm that achieve control of commonly used error rates. Through\nsimulations, we show that our proposal has good operating characteristics and\nresults in panels with higher classification and variable selection performance\ncompared to several existing penalized regression approaches in cases where a\ngeneralized linear model is misspecified. Finally, we use the proposed method\nto develop biomarker panels for separating pancreatic cysts with differing\nmalignancy potential in a setting where complicated missingness in the\nbiomarkers arose due to limited specimen volumes."}, "http://arxiv.org/abs/2206.12773": {"title": "Scalable and optimal Bayesian inference for sparse covariance matrices via screened beta-mixture prior", "link": "http://arxiv.org/abs/2206.12773", "description": "In this paper, we propose a scalable Bayesian method for sparse covariance\nmatrix estimation by incorporating a continuous shrinkage prior with a\nscreening procedure. In the first step of the procedure, the off-diagonal\nelements with small correlations are screened based on their sample\ncorrelations. In the second step, the posterior of the covariance with the\nscreened elements fixed at $0$ is computed with the beta-mixture prior. The\nscreened elements of the covariance significantly increase the efficiency of\nthe posterior computation. The simulation studies and real data applications\nshow that the proposed method can be used for the high-dimensional problem with\nthe `large p, small n'. In some examples in this paper, the proposed method can\nbe computed in a reasonable amount of time, while no other existing Bayesian\nmethods can be. The proposed method has also sound theoretical properties. The\nscreening procedure has the sure screening property and the selection\nconsistency, and the posterior has the optimal minimax or nearly minimax\nconvergence rate under the Frobeninus norm."}, "http://arxiv.org/abs/2209.05474": {"title": "Consistent Selection of the Number of Groups in Panel Models via Sample-Splitting", "link": "http://arxiv.org/abs/2209.05474", "description": "Group number selection is a key question for group panel data modelling. In\nthis work, we develop a cross validation method to tackle this problem.\nSpecifically, we split the panel data into a training dataset and a testing\ndataset on the time span. We first use the training dataset to estimate the\nparameters and group memberships. Then we apply the fitted model to the testing\ndataset and then the group number is estimated by minimizing certain loss\nfunction values on the testing dataset. We design the loss functions for panel\ndata models either with or without fixed effects. The proposed method has two\nadvantages. First, the method is totally data-driven thus no further tuning\nparameters are involved. Second, the method can be flexibly applied to a wide\nrange of panel data models. Theoretically, we establish the estimation\nconsistency by taking advantage of the optimization property of the estimation\nalgorithm. Experiments on a variety of synthetic and empirical datasets are\ncarried out to further illustrate the advantages of the proposed method."}, "http://arxiv.org/abs/2301.07276": {"title": "Data thinning for convolution-closed distributions", "link": "http://arxiv.org/abs/2301.07276", "description": "We propose data thinning, an approach for splitting an observation into two\nor more independent parts that sum to the original observation, and that follow\nthe same distribution as the original observation, up to a (known) scaling of a\nparameter. This very general proposal is applicable to any convolution-closed\ndistribution, a class that includes the Gaussian, Poisson, negative binomial,\ngamma, and binomial distributions, among others. Data thinning has a number of\napplications to model selection, evaluation, and inference. For instance,\ncross-validation via data thinning provides an attractive alternative to the\nusual approach of cross-validation via sample splitting, especially in settings\nin which the latter is not applicable. In simulations and in an application to\nsingle-cell RNA-sequencing data, we show that data thinning can be used to\nvalidate the results of unsupervised learning approaches, such as k-means\nclustering and principal components analysis, for which traditional sample\nsplitting is unattractive or unavailable."}, "http://arxiv.org/abs/2303.12401": {"title": "Real-time forecasting within soccer matches through a Bayesian lens", "link": "http://arxiv.org/abs/2303.12401", "description": "This paper employs a Bayesian methodology to predict the results of soccer\nmatches in real-time. Using sequential data of various events throughout the\nmatch, we utilize a multinomial probit regression in a novel framework to\nestimate the time-varying impact of covariates and to forecast the outcome.\nEnglish Premier League data from eight seasons are used to evaluate the\nefficacy of our method. Different evaluation metrics establish that the\nproposed model outperforms potential competitors inspired by existing\nstatistical or machine learning algorithms. Additionally, we apply robustness\nchecks to demonstrate the model's accuracy across various scenarios."}, "http://arxiv.org/abs/2305.10524": {"title": "Dynamic Matrix Recovery", "link": "http://arxiv.org/abs/2305.10524", "description": "Matrix recovery from sparse observations is an extensively studied topic\nemerging in various applications, such as recommendation system and signal\nprocessing, which includes the matrix completion and compressed sensing models\nas special cases. In this work we propose a general framework for dynamic\nmatrix recovery of low-rank matrices that evolve smoothly over time. We start\nfrom the setting that the observations are independent across time, then extend\nto the setting that both the design matrix and noise possess certain temporal\ncorrelation via modified concentration inequalities. By pooling neighboring\nobservations, we obtain sharp estimation error bounds of both settings, showing\nthe influence of the underlying smoothness, the dependence and effective\nsamples. We propose a dynamic fast iterative shrinkage thresholding algorithm\nthat is computationally efficient, and characterize the interplay between\nalgorithmic and statistical convergence. Simulated and real data examples are\nprovided to support such findings."}, "http://arxiv.org/abs/2306.07047": {"title": "Foundations of Causal Discovery on Groups of Variables", "link": "http://arxiv.org/abs/2306.07047", "description": "Discovering causal relationships from observational data is a challenging\ntask that relies on assumptions connecting statistical quantities to graphical\nor algebraic causal models. In this work, we focus on widely employed\nassumptions for causal discovery when objects of interest are (multivariate)\ngroups of random variables rather than individual (univariate) random\nvariables, as is the case in a variety of problems in scientific domains such\nas climate science or neuroscience. If the group-level causal models are\nderived from partitioning a micro-level model into groups, we explore the\nrelationship between micro and group-level causal discovery assumptions. We\ninvestigate the conditions under which assumptions like Causal Faithfulness\nhold or fail to hold. Our analysis encompasses graphical causal models that\ncontain cycles and bidirected edges. We also discuss grouped time series causal\ngraphs and variants thereof as special cases of our general theoretical\nframework. Thereby, we aim to provide researchers with a solid theoretical\nfoundation for the development and application of causal discovery methods for\nvariable groups."}, "http://arxiv.org/abs/2309.01721": {"title": "Direct and Indirect Treatment Effects in the Presence of Semi-Competing Risks", "link": "http://arxiv.org/abs/2309.01721", "description": "Semi-competing risks refer to the phenomenon that the terminal event (such as\ndeath) can truncate the non-terminal event (such as disease progression) but\nnot vice versa. The treatment effect on the terminal event can be delivered\neither directly following the treatment or indirectly through the non-terminal\nevent. We consider two strategies to decompose the total effect into a direct\neffect and an indirect effect under the framework of mediation analysis, by\nadjusting the prevalence and hazard of non-terminal events, respectively. They\nrequire slightly different assumptions on cross-world quantities to achieve\nidentifiability. We establish asymptotic properties for the estimated\ncounterfactual cumulative incidences and decomposed treatment effects. Through\nsimulation studies and real-data applications we illustrate the subtle\ndifference between these two decompositions."}, "http://arxiv.org/abs/2311.12825": {"title": "A PSO Based Method to Generate Actionable Counterfactuals for High Dimensional Data", "link": "http://arxiv.org/abs/2311.12825", "description": "Counterfactual explanations (CFE) are methods that explain a machine learning\nmodel by giving an alternate class prediction of a data point with some minimal\nchanges in its features. It helps the users to identify their data attributes\nthat caused an undesirable prediction like a loan or credit card rejection. We\ndescribe an efficient and an actionable counterfactual (CF) generation method\nbased on particle swarm optimization (PSO). We propose a simple objective\nfunction for the optimization of the instance-centric CF generation problem.\nThe PSO brings in a lot of flexibility in terms of carrying out multi-objective\noptimization in large dimensions, capability for multiple CF generation, and\nsetting box constraints or immutability of data attributes. An algorithm is\nproposed that incorporates these features and it enables greater control over\nthe proximity and sparsity properties over the generated CFs. The proposed\nalgorithm is evaluated with a set of action-ability metrics in real-world\ndatasets, and the results were superior compared to that of the\nstate-of-the-arts."}, "http://arxiv.org/abs/2311.12978": {"title": "Physics-Informed Priors with Application to Boundary Layer Velocity", "link": "http://arxiv.org/abs/2311.12978", "description": "One of the most popular recent areas of machine learning predicates the use\nof neural networks augmented by information about the underlying process in the\nform of Partial Differential Equations (PDEs). These physics-informed neural\nnetworks are obtained by penalizing the inference with a PDE, and have been\ncast as a minimization problem currently lacking a formal approach to quantify\nthe uncertainty. In this work, we propose a novel model-based framework which\nregards the PDE as a prior information of a deep Bayesian neural network. The\nprior is calibrated without data to resemble the PDE solution in the prior\nmean, while our degree in confidence on the PDE with respect to the data is\nexpressed in terms of the prior variance. The information embedded in the PDE\nis then propagated to the posterior yielding physics-informed forecasts with\nuncertainty quantification. We apply our approach to a simulated viscous fluid\nand to experimentally-obtained turbulent boundary layer velocity in a wind\ntunnel using an appropriately simplified Navier-Stokes equation. Our approach\nrequires very few observations to produce physically-consistent forecasts as\nopposed to non-physical forecasts stemming from non-informed priors, thereby\nallowing forecasting complex systems where some amount of data as well as some\ncontextual knowledge is available."}, "http://arxiv.org/abs/2311.13017": {"title": "W-kernel and essential subspace for frequencist's evaluation of Bayesian estimators", "link": "http://arxiv.org/abs/2311.13017", "description": "The posterior covariance matrix W defined by the log-likelihood of each\nobservation plays important roles both in the sensitivity analysis and\nfrequencist's evaluation of the Bayesian estimators. This study focused on the\nmatrix W and its principal space; we term the latter as an essential subspace.\nFirst, it is shown that they appear in various statistical settings, such as\nthe evaluation of the posterior sensitivity, assessment of the frequencist's\nuncertainty from posterior samples, and stochastic expansion of the loss; a key\ntool to treat frequencist's properties is the recently proposed Bayesian\ninfinitesimal jackknife approximation (Giordano and Broderick (2023)). In the\nfollowing part, we show that the matrix W can be interpreted as a reproducing\nkernel; it is named as W-kernel. Using the W-kernel, the essential subspace is\nexpressed as a principal space given by the kernel PCA. A relation to the\nFisher kernel and neural tangent kernel is established, which elucidates the\nconnection to the classical asymptotic theory; it also leads to a sort of\nBayesian-frequencist's duality. Finally, two applications, selection of a\nrepresentative set of observations and dimensional reduction in the approximate\nbootstrap, are discussed. In the former, incomplete Cholesky decomposition is\nintroduced as an efficient method to compute the essential subspace. In the\nlatter, different implementations of the approximate bootstrap for posterior\nmeans are compared."}, "http://arxiv.org/abs/2311.13048": {"title": "Weighted composite likelihood for linear mixed models in complex samples", "link": "http://arxiv.org/abs/2311.13048", "description": "Fitting mixed models to complex survey data is a challenging problem. Most\nmethods in the literature, including the most widely used one, require a close\nrelationship between the model structure and the survey design. In this paper\nwe present methods for fitting arbitrary mixed models to data from arbitrary\nsurvey designs. We support this with an implementation that allows for\nmultilevel linear models and multistage designs without any assumptions about\nnesting of model and design, and that also allows for correlation structures\nsuch as those resulting from genetic relatedness. The estimation and inference\napproach uses weighted pairwise (composite) likelihood."}, "http://arxiv.org/abs/2311.13131": {"title": "Pair circulas modelling for multivariate circular time series", "link": "http://arxiv.org/abs/2311.13131", "description": "Modelling multivariate circular time series is considered. The\ncross-sectional and serial dependence is described by circulas, which are\nanalogs of copulas for circular distributions. In order to obtain a simple\nexpression of the dependence structure, we decompose a multivariate circula\ndensity to a product of several pair circula densities. Moreover, to reduce the\nnumber of pair circula densities, we consider strictly stationary multi-order\nMarkov processes. The real data analysis, in which the proposed model is fitted\nto multivariate time series wind direction data is also given."}, "http://arxiv.org/abs/2311.13196": {"title": "Optimal Time of Arrival Estimation for MIMO Backscatter Channels", "link": "http://arxiv.org/abs/2311.13196", "description": "In this paper, we propose a novel time of arrival (TOA) estimator for\nmultiple-input-multiple-output (MIMO) backscatter channels in closed form. The\nproposed estimator refines the estimation precision from the topological\nstructure of the MIMO backscatter channels, and can considerably enhance the\nestimation accuracy. Particularly, we show that for the general $M \\times N$\nbistatic topology, the mean square error (MSE) is $\\frac{M+N-1}{MN}\\sigma^2_0$,\nand for the general $M \\times M$ monostatic topology, it is\n$\\frac{2M-1}{M^2}\\sigma^2_0$ for the diagonal subchannels, and\n$\\frac{M-1}{M^2}\\sigma^2_0$ for the off-diagonal subchannels, where\n$\\sigma^2_0$ is the MSE of the conventional least square estimator. In\naddition, we derive the Cramer-Rao lower bound (CRLB) for MIMO backscatter TOA\nestimation which indicates that the proposed estimator is optimal. Simulation\nresults verify that the proposed TOA estimator can considerably improve both\nestimation and positioning accuracy, especially when the MIMO scale is large."}, "http://arxiv.org/abs/2311.13202": {"title": "Robust Multi-Model Subset Selection", "link": "http://arxiv.org/abs/2311.13202", "description": "Modern datasets in biology and chemistry are often characterized by the\npresence of a large number of variables and outlying samples due to measurement\nerrors or rare biological and chemical profiles. To handle the characteristics\nof such datasets we introduce a method to learn a robust ensemble comprised of\na small number of sparse, diverse and robust models, the first of its kind in\nthe literature. The degree to which the models are sparse, diverse and\nresistant to data contamination is driven directly by the data based on a\ncross-validation criterion. We establish the finite-sample breakdown of the\nensembles and the models that comprise them, and we develop a tailored\ncomputing algorithm to learn the ensembles by leveraging recent developments in\nl0 optimization. Our extensive numerical experiments on synthetic and\nartificially contaminated real datasets from genomics and cheminformatics\ndemonstrate the competitive advantage of our method over state-of-the-art\nsparse and robust methods. We also demonstrate the applicability of our\nproposal on a cardiac allograft vasculopathy dataset."}, "http://arxiv.org/abs/2311.13247": {"title": "A projected nonlinear state-space model for forecasting time series signals", "link": "http://arxiv.org/abs/2311.13247", "description": "Learning and forecasting stochastic time series is essential in various\nscientific fields. However, despite the proposals of nonlinear filters and\ndeep-learning methods, it remains challenging to capture nonlinear dynamics\nfrom a few noisy samples and predict future trajectories with uncertainty\nestimates while maintaining computational efficiency. Here, we propose a fast\nalgorithm to learn and forecast nonlinear dynamics from noisy time series data.\nA key feature of the proposed model is kernel functions applied to projected\nlines, enabling fast and efficient capture of nonlinearities in the latent\ndynamics. Through empirical case studies and benchmarking, the model\ndemonstrates its effectiveness in learning and forecasting complex nonlinear\ndynamics, offering a valuable tool for researchers and practitioners in time\nseries analysis."}, "http://arxiv.org/abs/2311.13291": {"title": "Robust Functional Regression with Discretely Sampled Predictors", "link": "http://arxiv.org/abs/2311.13291", "description": "The functional linear model is an important extension of the classical\nregression model allowing for scalar responses to be modeled as functions of\nstochastic processes. Yet, despite the usefulness and popularity of the\nfunctional linear model in recent years, most treatments, theoretical and\npractical alike, suffer either from (i) lack of resistance towards the many\ntypes of anomalies one may encounter with functional data or (ii) biases\nresulting from the use of discretely sampled functional data instead of\ncompletely observed data. To address these deficiencies, this paper introduces\nand studies the first class of robust functional regression estimators for\npartially observed functional data. The proposed broad class of estimators is\nbased on thin-plate splines with a novel computationally efficient quadratic\npenalty, is easily implementable and enjoys good theoretical properties under\nweak assumptions. We show that, in the incomplete data setting, both the sample\nsize and discretization error of the processes determine the asymptotic rate of\nconvergence of functional regression estimators and the latter cannot be\nignored. These theoretical properties remain valid even with multi-dimensional\nrandom fields acting as predictors and random smoothing parameters. The\neffectiveness of the proposed class of estimators in practice is demonstrated\nby means of a simulation study and a real-data example."}, "http://arxiv.org/abs/2311.13347": {"title": "Loss-based Objective and Penalizing Priors for Model Selection Problems", "link": "http://arxiv.org/abs/2311.13347", "description": "Many Bayesian model selection problems, such as variable selection or cluster\nanalysis, start by setting prior model probabilities on a structured model\nspace. Based on a chosen loss function between models, model selection is often\nperformed with a Bayes estimator that minimizes the posterior expected loss.\nThe prior model probabilities and the choice of loss both highly affect the\nmodel selection results, especially for data with small sample sizes, and their\nproper calibration and careful reflection of no prior model preference are\ncrucial in objective Bayesian analysis. We propose risk equilibrium priors as\nan objective choice for prior model probabilities that only depend on the model\nspace and the choice of loss. Under the risk equilibrium priors, the Bayes\naction becomes indifferent before observing data, and the family of the risk\nequilibrium priors includes existing popular objective priors in Bayesian\nvariable selection problems. We generalize the result to the elicitation of\nobjective priors for Bayesian cluster analysis with Binder's loss. We also\npropose risk penalization priors, where the Bayes action chooses the simplest\nmodel before seeing data. The concept of risk equilibrium and penalization\npriors allows us to interpret prior properties in light of the effect of loss\nfunctions, and also provides new insight into the sensitivity of Bayes\nestimators under the same prior but different loss. We illustrate the proposed\nconcepts with variable selection simulation studies and cluster analysis on a\ngalaxy dataset."}, "http://arxiv.org/abs/2311.13410": {"title": "Towards Sensitivity Analysis: A Workflow", "link": "http://arxiv.org/abs/2311.13410", "description": "Establishing causal claims is one of the primary endeavors in sociological\nresearch. Statistical causal inference is a promising way to achieve this\nthrough the potential outcome framework or structural causal models, which are\nbased on a set of identification assumptions. However, identification\nassumptions are often not fully discussed in practice, which harms the validity\nof causal claims. In this article, we focus on the unmeasurededness assumption\nthat assumes no unmeasured confounders in models, which is often violated in\npractice. This article reviews a set of papers in two leading sociological\njournals to check the practice of causal inference and relevant identification\nassumptions, indicating the lack of discussion on sensitivity analysis methods\non unconfoundedness in practice. And then, a blueprint of how to conduct\nsensitivity analysis methods on unconfoundedness is built, including six steps\nof proper choices on practices of sensitivity analysis to evaluate the impacts\nof unmeasured confounders."}, "http://arxiv.org/abs/2311.13481": {"title": "Synergizing Roughness Penalization and Basis Selection in Bayesian Spline Regression", "link": "http://arxiv.org/abs/2311.13481", "description": "Bayesian P-splines and basis determination through Bayesian model selection\nare both commonly employed strategies for nonparametric regression using spline\nbasis expansions within the Bayesian framework. Although both methods are\nwidely employed, they each have particular limitations that may introduce\npotential estimation bias depending on the nature of the target function. To\novercome the limitations associated with each method while capitalizing on\ntheir respective strengths, we propose a new prior distribution that integrates\nthe essentials of both approaches. The proposed prior distribution assesses the\ncomplexity of the spline model based on a penalty term formed by a convex\ncombination of the penalties from both methods. The proposed method exhibits\nadaptability to the unknown level of smoothness while achieving the\nminimax-optimal posterior contraction rate up to a logarithmic factor. We\nprovide an efficient Markov chain Monte Carlo algorithm for implementing the\nproposed approach. Our extensive simulation study reveals that the proposed\nmethod outperforms other competitors in terms of performance metrics or model\ncomplexity. An application to a real dataset substantiates the validity of our\nproposed approach."}, "http://arxiv.org/abs/2311.13556": {"title": "Universally Optimal Multivariate Crossover Designs", "link": "http://arxiv.org/abs/2311.13556", "description": "In this article, universally optimal multivariate crossover designs are\nstudied. The multiple response crossover design is motivated by a $3 \\times 3$\ncrossover setup, where the effect of $3$ doses of an oral drug are studied on\ngene expressions related to mucosal inflammation. Subjects are assigned to\nthree treatment sequences and response measurements on 5 different gene\nexpressions are taken from each subject in each of the $3$ time periods. To\nmodel multiple or $g$ responses, where $g>1$, in a crossover setup, a\nmultivariate fixed effect model with both direct and carryover treatment\neffects is considered. It is assumed that there are non zero within response\ncorrelations, while between response correlations are taken to be zero. The\ninformation matrix corresponding to the direct effects is obtained and some\nresults are studied. The information matrix in the multivariate case is shown\nto differ from the univariate case, particularly in the completely symmetric\nproperty. For the $g>1$ case, with $t$ treatments and $p$ periods, for $p=t\n\\geq 3$, the design represented by a Type $I$ orthogonal array of strength $2$\nis proved to be universally optimal over the class of binary designs, for the\ndirect treatment effects."}, "http://arxiv.org/abs/2311.13595": {"title": "Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein", "link": "http://arxiv.org/abs/2311.13595", "description": "Feature alignment methods are used in many scientific disciplines for data\npooling, annotation, and comparison. As an instance of a permutation learning\nproblem, feature alignment presents significant statistical and computational\nchallenges. In this work, we propose the covariance alignment model to study\nand compare various alignment methods and establish a minimax lower bound for\ncovariance alignment that has a non-standard dimension scaling because of the\npresence of a nuisance parameter. This lower bound is in fact minimax optimal\nand is achieved by a natural quasi MLE. However, this estimator involves a\nsearch over all permutations which is computationally infeasible even when the\nproblem has moderate size. To overcome this limitation, we show that the\ncelebrated Gromov-Wasserstein algorithm from optimal transport which is more\namenable to fast implementation even on large-scale problems is also minimax\noptimal. These results give the first statistical justification for the\ndeployment of the Gromov-Wasserstein algorithm in practice."}, "http://arxiv.org/abs/2107.06093": {"title": "A generalized hypothesis test for community structure in networks", "link": "http://arxiv.org/abs/2107.06093", "description": "Researchers theorize that many real-world networks exhibit community\nstructure where within-community edges are more likely than between-community\nedges. While numerous methods exist to cluster nodes into different\ncommunities, less work has addressed this question: given some network, does it\nexhibit statistically meaningful community structure? We answer this question\nin a principled manner by framing it as a statistical hypothesis test in terms\nof a general and model-agnostic community structure parameter. Leveraging this\nparameter, we propose a simple and interpretable test statistic used to\nformulate two separate hypothesis testing frameworks. The first is an\nasymptotic test against a baseline value of the parameter while the second\ntests against a baseline model using bootstrap-based thresholds. We prove\ntheoretical properties of these tests and demonstrate how the proposed method\nyields rich insights into real-world data sets."}, "http://arxiv.org/abs/2205.08030": {"title": "Interpretable sensitivity analysis for the Baron-Kenny approach to mediation with unmeasured confounding", "link": "http://arxiv.org/abs/2205.08030", "description": "Mediation analysis assesses the extent to which the exposure affects the\noutcome indirectly through a mediator and the extent to which it operates\ndirectly through other pathways. As the most popular method in empirical\nmediation analysis, the Baron-Kenny approach estimates the indirect and direct\neffects of the exposure on the outcome based on linear structural equation\nmodels. However, when the exposure and the mediator are not randomized, the\nestimates may be biased due to unmeasured confounding among the exposure,\nmediator, and outcome. Building on Cinelli and Hazlett (2020), we derive\ngeneral omitted-variable bias formulas in linear regressions with vector\nresponses and regressors. We then use the formulas to develop a sensitivity\nanalysis method for the Baron-Kenny approach to mediation in the presence of\nunmeasured confounding. To ensure interpretability, we express the sensitivity\nparameters to correspond to the natural factorization of the joint distribution\nof the direct acyclic graph for mediation analysis. They measure the partial\ncorrelation between the unmeasured confounder and the exposure, mediator,\noutcome, respectively. With the sensitivity parameters, we propose a novel\nmeasure called the \"robustness value for mediation\" or simply the \"robustness\nvalue\", to assess the robustness of results based on the Baron-Kenny approach\nwith respect to unmeasured confounding. Intuitively, the robustness value\nmeasures the minimum value of the maximum proportion of variability explained\nby the unmeasured confounding, for the exposure, mediator and outcome, to\noverturn the results of the point estimate or confidence interval for the\ndirect and indirect effects. Importantly, we prove that all our sensitivity\nbounds are attainable and thus sharp."}, "http://arxiv.org/abs/2208.10106": {"title": "Statistics did not prove that the Huanan Seafood Wholesale Market was the early epicenter of the COVID-19 pandemic", "link": "http://arxiv.org/abs/2208.10106", "description": "In a recent prominent study Worobey et al.\\ (2022, Science, 377, pp.\\ 951--9)\npurported to demonstrate statistically that the Huanan Seafood Wholesale Market\nwas the epicenter of the early COVID-19 epidemic. We show that this statistical\nconclusion is invalid on two grounds: (1) The assumption that a centroid of\nearly case locations or another simply constructed point is the origin of an\nepidemic is unproved. (2) A Monte Carlo test used to conclude that no other\nlocation than the seafood market can be the origin is flawed. Hence, the\nquestion of the origin of the pandemic has not been answered by their\nstatistical analysis."}, "http://arxiv.org/abs/2209.10433": {"title": "Arithmetic Average Density Fusion -- Part II: Unified Derivation for Unlabeled and Labeled RFS Fusion", "link": "http://arxiv.org/abs/2209.10433", "description": "As a fundamental information fusion approach, the arithmetic average (AA)\nfusion has recently been investigated for various random finite set (RFS)\nfilter fusion in the context of multi-sensor multi-target tracking. It is not a\nstraightforward extension of the ordinary density-AA fusion to the RFS\ndistribution but has to preserve the form of the fusing multi-target density.\nIn this work, we first propose a statistical concept, probability hypothesis\ndensity (PHD) consistency, and explain how it can be achieved by the PHD-AA\nfusion and lead to more accurate and robust detection and localization of the\npresent targets. This forms a both theoretically sound and technically\nmeaningful reason for performing inter-filter PHD AA-fusion/consensus, while\npreserving the form of the fusing RFS filter. Then, we derive and analyze the\nproper AA fusion formulations for most existing unlabeled/labeled RFS filters\nbasing on the (labeled) PHD-AA/consistency. These derivations are theoretically\nunified, exact, need no approximation and greatly enable heterogenous unlabeled\nand labeled RFS density fusion which is separately demonstrated in two\nconsequent companion papers."}, "http://arxiv.org/abs/2209.14846": {"title": "Factor Modeling of a High-Dimensional Matrix-Variate and Statistical Learning for Matrix-Valued Sequences", "link": "http://arxiv.org/abs/2209.14846", "description": "We propose a new matrix factor model, named RaDFaM, the latent structure of\nwhich is strictly derived based on a hierarchical rank decomposition of a\nmatrix. Hierarchy is in the sense that all basis vectors of the column space of\neach multiplier matrix are assumed the structure of a vector factor model.\nCompared to the most commonly used matrix factor model that takes the latent\nstructure of a bilinear form, RaDFaM involves additional row-wise and\ncolumn-wise matrix latent factors. This yields modest dimension reduction and\nstronger signal intensity from the sight of tensor subspace learning, though\nposes challenges of new estimation procedure and concomitant inferential theory\nfor a collection of matrix-valued observations. We develop a class of\nestimation procedure that makes use of the separable covariance structure under\nRaDFaM and approximate least squares, and derive its superiority in the merit\nof the peak signal-to-noise ratio. We also establish the asymptotic theory when\nthe matrix-valued observations are uncorrelated or weakly correlated.\nNumerically, in terms of image/matrix reconstruction, supervised learning, and\nso forth, we demonstrate the excellent performance of RaDFaM through two\nmatrix-valued sequence datasets of independent 2D images and multinational\nmacroeconomic indices time series, respectively."}, "http://arxiv.org/abs/2210.14745": {"title": "Identifying Counterfactual Queries with the R Package cfid", "link": "http://arxiv.org/abs/2210.14745", "description": "In the framework of structural causal models, counterfactual queries describe\nevents that concern multiple alternative states of the system under study.\nCounterfactual queries often take the form of \"what if\" type questions such as\n\"would an applicant have been hired if they had over 10 years of experience,\nwhen in reality they only had 5 years of experience?\" Such questions and\ncounterfactual inference in general are crucial, for example when addressing\nthe problem of fairness in decision-making. Because counterfactual events\ncontain contradictory states of the world, it is impossible to conduct a\nrandomized experiment to address them without making several restrictive\nassumptions. However, it is sometimes possible to identify such queries from\nobservational and experimental data by representing the system under study as a\ncausal model, and the available data as symbolic probability distributions.\nShpitser and Pearl (2007) constructed two algorithms, called ID* and IDC*, for\nidentifying counterfactual queries and conditional counterfactual queries,\nrespectively. These two algorithms are analogous to the ID and IDC algorithms\nby Shpitser and Pearl (2006) for identification of interventional\ndistributions, which were implemented in R by Tikka and Karvanen (2017) in the\ncausaleffect package. We present the R package cfid that implements the ID* and\nIDC* algorithms. Identification of counterfactual queries and the features of\ncfid are demonstrated via examples."}, "http://arxiv.org/abs/2210.17000": {"title": "Ensemble transport smoothing", "link": "http://arxiv.org/abs/2210.17000", "description": "Smoothers are algorithms for Bayesian time series re-analysis. Most\noperational smoothers rely either on affine Kalman-type transformations or on\nsequential importance sampling. These strategies occupy opposite ends of a\nspectrum that trades computational efficiency and scalability for statistical\ngenerality and consistency: non-Gaussianity renders affine Kalman updates\ninconsistent with the true Bayesian solution, while the ensemble size required\nfor successful importance sampling can be prohibitive. This paper revisits the\nsmoothing problem from the perspective of measure transport, which offers the\nprospect of consistent prior-to-posterior transformations for Bayesian\ninference. We leverage this capacity by proposing a general ensemble framework\nfor transport-based smoothing. Within this framework, we derive a comprehensive\nset of smoothing recursions based on nonlinear transport maps and detail how\nthey exploit the structure of state-space models in fully non-Gaussian\nsettings. We also describe how many standard Kalman-type smoothing algorithms\nemerge as special cases of our framework. A companion paper (Ramgraber et al.,\n2023) explores the implementation of nonlinear ensemble transport smoothers in\ngreater depth."}, "http://arxiv.org/abs/2210.17435": {"title": "Ensemble transport smoothing", "link": "http://arxiv.org/abs/2210.17435", "description": "Smoothing is a specialized form of Bayesian inference for state-space models\nthat characterizes the posterior distribution of a collection of states given\nan associated sequence of observations. Ramgraber et al. (2023) proposes a\ngeneral framework for transport-based ensemble smoothing, which includes linear\nKalman-type smoothers as special cases. Here, we build on this foundation to\nrealize and demonstrate nonlinear backward ensemble transport smoothers. We\ndiscuss parameterization and regularization of the associated transport maps,\nand then examine the performance of these smoothers for nonlinear and chaotic\ndynamical systems that exhibit non-Gaussian behavior. In these settings, our\nnonlinear transport smoothers yield lower estimation error than conventional\nlinear smoothers and state-of-the-art iterative ensemble Kalman smoothers, for\ncomparable numbers of model evaluations."}, "http://arxiv.org/abs/2305.03634": {"title": "On the use of ordered factors as explanatory variables", "link": "http://arxiv.org/abs/2305.03634", "description": "Consider a regression or some regression-type model for a certain response\nvariable where the linear predictor includes an ordered factor among the\nexplanatory variables. The inclusion of a factor of this type can take place is\na few different ways, discussed in the pertaining literature. The present\ncontribution proposes a different way of tackling this problem, by constructing\na numeric variable in an alternative way with respect to the current\nmethodology. The proposed techniques appears to retain the data fitting\ncapability of the existing methodology, but with a simpler interpretation of\nthe model components."}, "http://arxiv.org/abs/2306.13257": {"title": "Semiparametric Estimation of the Shape of the Limiting Multivariate Point Cloud", "link": "http://arxiv.org/abs/2306.13257", "description": "We propose a model to flexibly estimate joint tail properties by exploiting\nthe convergence of an appropriately scaled point cloud onto a compact limit\nset. Characteristics of the shape of the limit set correspond to key tail\ndependence properties. We directly model the shape of the limit set using\nBezier splines, which allow flexible and parsimonious specification of shapes\nin two dimensions. We then fit the Bezier splines to data in pseudo-polar\ncoordinates using Markov chain Monte Carlo, utilizing a limiting approximation\nto the conditional likelihood of the radii given angles. By imposing\nappropriate constraints on the parameters of the Bezier splines, we guarantee\nthat each posterior sample is a valid limit set boundary, allowing direct\nposterior analysis of any quantity derived from the shape of the curve.\nFurthermore, we obtain interpretable inference on the asymptotic dependence\nclass by using mixture priors with point masses on the corner of the unit box.\nFinally, we apply our model to bivariate datasets of extremes of variables\nrelated to fire risk and air pollution."}, "http://arxiv.org/abs/2307.08370": {"title": "Parameter estimation for contact tracing in graph-based models", "link": "http://arxiv.org/abs/2307.08370", "description": "We adopt a maximum-likelihood framework to estimate parameters of a\nstochastic susceptible-infected-recovered (SIR) model with contact tracing on a\nrooted random tree. Given the number of detectees per index case, our estimator\nallows to determine the degree distribution of the random tree as well as the\ntracing probability. Since we do not discover all infectees via contact\ntracing, this estimation is non-trivial. To keep things simple and stable, we\ndevelop an approximation suited for realistic situations (contract tracing\nprobability small, or the probability for the detection of index cases small).\nIn this approximation, the only epidemiological parameter entering the\nestimator is $R_0$.\n\nThe estimator is tested in a simulation study and is furthermore applied to\ncovid-19 contact tracing data from India. The simulation study underlines the\nefficiency of the method. For the empirical covid-19 data, we compare different\ndegree distributions and perform a sensitivity analysis. We find that\nparticularly a power-law and a negative binomial degree distribution fit the\ndata well and that the tracing probability is rather large. The sensitivity\nanalysis shows no strong dependency of the estimates on the reproduction\nnumber. Finally, we discuss the relevance of our findings."}, "http://arxiv.org/abs/2311.13701": {"title": "Reexamining Statistical Significance and P-Values in Nursing Research: Historical Context and Guidance for Interpretation, Alternatives, and Reporting", "link": "http://arxiv.org/abs/2311.13701", "description": "Nurses should rely on the best evidence, but tend to struggle with\nstatistics, impeding research integration into clinical practice. Statistical\nsignificance, a key concept in classical statistics, and its primary metric,\nthe p-value, are frequently misused. This topic has been debated in many\ndisciplines but rarely in nursing. The aim is to present key arguments in the\ndebate surrounding the misuse of p-values, discuss their relevance to nursing,\nand offer recommendations to address them. The literature indicates that the\nconcept of probability in classical statistics is not easily understood,\nleading to misinterpretations of statistical significance. Much of the critique\nconcerning p-values arises from such misunderstandings and imprecise\nterminology. Thus, some scholars have argued for the complete abandonment of\np-values. Instead of discarding p-values, this article provides a comprehensive\naccount of their historical context and the information they convey. This will\nclarify why they are widely used yet often misunderstood. The article also\noffers recommendations for accurate interpretation of statistical significance\nby incorporating other key metrics. To mitigate publication bias resulting from\np-value misuse, pre-registering the analysis plan is recommended. The article\nalso explores alternative approaches, particularly Bayes factors, as they may\nresolve several of these issues. P-values serve a purpose in nursing research\nas an initial safeguard against the influence of randomness. Much criticism\ndirected towards p-values arises from misunderstandings and inaccurate\nterminology. Several considerations and measures are recommended, some which go\nbeyond the conventional, to obtain accurate p-values and to better understand\nstatistical significance. Nurse educators and researchers should considerer\nthese in their educational and research reporting practices."}, "http://arxiv.org/abs/2311.13767": {"title": "Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions", "link": "http://arxiv.org/abs/2311.13767", "description": "With the development of data collection techniques, analysis with a survival\nresponse and high-dimensional covariates has become routine. Here we consider\nan interaction model, which includes a set of low-dimensional covariates, a set\nof high-dimensional covariates, and their interactions. This model has been\nmotivated by gene-environment (G-E) interaction analysis, where the E variables\nhave a low dimension, and the G variables have a high dimension. For such a\nmodel, there has been extensive research on estimation and variable selection.\nComparatively, inference studies with a valid false discovery rate (FDR)\ncontrol have been very limited. The existing high-dimensional inference tools\ncannot be directly applied to interaction models, as interactions and main\neffects are not ``equal\". In this article, for high-dimensional survival\nanalysis with interactions, we model survival using the Accelerated Failure\nTime (AFT) model and adopt a ``weighted least squares + debiased Lasso''\napproach for estimation and selection. A hierarchical FDR control approach is\ndeveloped for inference and respect of the ``main effects, interactions''\nhierarchy. { The asymptotic distribution properties of the debiased Lasso\nestimators} are rigorously established. Simulation demonstrates the\nsatisfactory performance of the proposed approach, and the analysis of a breast\ncancer dataset further establishes its practical utility."}, "http://arxiv.org/abs/2311.13768": {"title": "Valid confidence intervals for regression with best subset selection", "link": "http://arxiv.org/abs/2311.13768", "description": "Classical confidence intervals after best subset selection are widely\nimplemented in statistical software and are routinely used to guide\npractitioners in scientific fields to conclude significance. However, there are\nincreasing concerns in the recent literature about the validity of these\nconfidence intervals in that the intended frequentist coverage is not attained.\nIn the context of the Akaike information criterion (AIC), recent studies\nobserve an under-coverage phenomenon in terms of overfitting, where the\nestimate of error variance under the selected submodel is smaller than that for\nthe true model. Under-coverage is particularly troubling in selective inference\nas it points to inflated Type I errors that would invalidate significant\nfindings. In this article, we delineate a complementary, yet provably more\ndeciding factor behind the incorrect coverage of classical confidence intervals\nunder AIC, in terms of altered conditional sampling distributions of pivotal\nquantities. Resting on selective techniques developed in other settings, our\nfinite-sample characterization of the selection event under AIC uncovers its\ngeometry as a union of finitely many intervals on the real line, based on which\nwe derive new confidence intervals with guaranteed coverage for any sample\nsize. This geometry derived for AIC selection enables exact (and typically less\nthan exact) conditioning, circumventing the need for the excessive conditioning\ncommon in other post-selection methods. The proposed methods are easy to\nimplement and can be broadly applied to other commonly used best subset\nselection criteria. In an application to a classical US consumption dataset,\nthe proposed confidence intervals arrive at different conclusions compared to\nthe conventional ones, even when the selected model is the full model, leading\nto interpretable findings that better align with empirical observations."}, "http://arxiv.org/abs/2311.13815": {"title": "Resampling Methods with Imputed Data", "link": "http://arxiv.org/abs/2311.13815", "description": "Resampling techniques have become increasingly popular for estimation of\nuncertainty in data collected via surveys. Survey data are also frequently\nsubject to missing data which are often imputed. This note addresses the issue\nof using resampling methods such as a jackknife or bootstrap in conjunction\nwith imputations that have be sampled stochastically (e.g., in the vein of\nmultiple imputation). It is illustrated that the imputations must be redrawn\nwithin each replicate group of a jackknife or bootstrap. Further, the number of\nmultiply imputed datasets per replicate group must dramatically exceed the\nnumber of replicate groups for a jackknife. However, this is not the case in a\nbootstrap approach. A brief simulation study is provided to support the theory\nintroduced in this note."}, "http://arxiv.org/abs/2311.13825": {"title": "Online Prediction of Extreme Conditional Quantiles via B-Spline Interpolation", "link": "http://arxiv.org/abs/2311.13825", "description": "Extreme quantiles are critical for understanding the behavior of data in the\ntail region of a distribution. It is challenging to estimate extreme quantiles,\nparticularly when dealing with limited data in the tail. In such cases, extreme\nvalue theory offers a solution by approximating the tail distribution using the\nGeneralized Pareto Distribution (GPD). This allows for the extrapolation beyond\nthe range of observed data, making it a valuable tool for various applications.\nHowever, when it comes to conditional cases, where estimation relies on\ncovariates, existing methods may require computationally expensive GPD fitting\nfor different observations. This computational burden becomes even more\nproblematic as the volume of observations increases, sometimes approaching\ninfinity. To address this issue, we propose an interpolation-based algorithm\nnamed EMI. EMI facilitates the online prediction of extreme conditional\nquantiles with finite offline observations. Combining quantile regression and\nGPD-based extrapolation, EMI formulates as a bilevel programming problem,\nefficiently solvable using classic optimization methods. Once estimates for\noffline observations are obtained, EMI employs B-spline interpolation for\ncovariate-dependent variables, enabling estimation for online observations with\nfinite GPD fitting. Simulations and real data analysis demonstrate the\neffectiveness of EMI across various scenarios."}, "http://arxiv.org/abs/2311.13897": {"title": "Super-resolution capacity of variance-based stochastic fluorescence microscopy", "link": "http://arxiv.org/abs/2311.13897", "description": "Improving the resolution of fluorescence microscopy beyond the diffraction\nlimit can be achievedby acquiring and processing multiple images of the sample\nunder different illumination conditions.One of the simplest techniques, Random\nIllumination Microscopy (RIM), forms the super-resolvedimage from the variance\nof images obtained with random speckled illuminations. However, thevalidity of\nthis process has not been fully theorized. In this work, we characterize\nmathematicallythe sample information contained in the variance of\ndiffraction-limited speckled images as a functionof the statistical properties\nof the illuminations. We show that an unambiguous two-fold resolutiongain is\nobtained when the speckle correlation length coincides with the width of the\nobservationpoint spread function. Last, we analyze the difference between the\nvariance-based techniques usingrandom speckled illuminations (as in RIM) and\nthose obtained using random fluorophore activation(as in Super-resolution\nOptical Fluctuation Imaging, SOFI)."}, "http://arxiv.org/abs/2311.13911": {"title": "Identifying Important Pairwise Logratios in Compositional Data with Sparse Principal Component Analysis", "link": "http://arxiv.org/abs/2311.13911", "description": "Compositional data are characterized by the fact that their elemental\ninformation is contained in simple pairwise logratios of the parts that\nconstitute the composition. While pairwise logratios are typically easy to\ninterpret, the number of possible pairs to consider quickly becomes (too) large\neven for medium-sized compositions, which might hinder interpretability in\nfurther multivariate analyses. Sparse methods can therefore be useful to\nidentify few, important pairwise logratios (respectively parts contained in\nthem) from the total candidate set. To this end, we propose a procedure based\non the construction of all possible pairwise logratios and employ sparse\nprincipal component analysis to identify important pairwise logratios. The\nperformance of the procedure is demonstrated both with simulated and real-world\ndata. In our empirical analyses, we propose three visual tools showing (i) the\nbalance between sparsity and explained variability, (ii) stability of the\npairwise logratios, and (iii) importance of the original compositional parts to\naid practitioners with their model interpretation."}, "http://arxiv.org/abs/2311.13923": {"title": "Optimal $F$-score Clustering for Bipartite Record Linkage", "link": "http://arxiv.org/abs/2311.13923", "description": "Probabilistic record linkage is often used to match records from two files,\nin particular when the variables common to both files comprise imperfectly\nmeasured identifiers like names and demographic variables. We consider\nbipartite record linkage settings in which each entity appears at most once\nwithin a file, i.e., there are no duplicates within the files, but some\nentities appear in both files. In this setting, the analyst desires a point\nestimate of the linkage structure that matches each record to at most one\nrecord from the other file. We propose an approach for obtaining this point\nestimate by maximizing the expected $F$-score for the linkage structure. We\ntarget the approach for record linkage methods that produce either (an\napproximate) posterior distribution of the unknown linkage structure or\nprobabilities of matches for record pairs. Using simulations and applications\nwith genuine data, we illustrate that the $F$-score estimators can lead to\nsensible estimates of the linkage structure."}, "http://arxiv.org/abs/2311.13935": {"title": "An analysis of the fragmentation of observing time at the Muztagh-ata site", "link": "http://arxiv.org/abs/2311.13935", "description": "Cloud cover plays a pivotal role in assessing observational conditions for\nastronomical site-testing. Except for the fraction of observing time, its\nfragmentation also wields a significant influence on the quality of nighttime\nsky clarity. In this article, we introduce the function Gamma, designed to\ncomprehensively capture both the fraction of available observing time and its\ncontinuity. Leveraging in situ measurement data gathered at the Muztagh-ata\nsite between 2017 and 2021, we showcase the effectiveness of our approach. The\nstatistical result illustrates that the Muztagh-ata site affords approximately\n122 nights of absolute clear and 205 very good nights annually, corresponding\nto Gamma greater than or equal 0.9 and Gamma greater than or equal 0.36\nrespectively."}, "http://arxiv.org/abs/2311.14042": {"title": "Optimized Covariance Design for AB Test on Social Network under Interference", "link": "http://arxiv.org/abs/2311.14042", "description": "Online A/B tests have become increasingly popular and important for social\nplatforms. However, accurately estimating the global average treatment effect\n(GATE) has proven to be challenging due to network interference, which violates\nthe Stable Unit Treatment Value Assumption (SUTVA) and poses a great challenge\nto experimental design. Existing network experimental design research was\nmostly based on the unbiased Horvitz-Thompson (HT) estimator with substantial\ndata trimming to ensure unbiasedness at the price of high resultant estimation\nvariance. In this paper, we strive to balance the bias and variance in\ndesigning randomized network experiments. Under a potential outcome model with\n1-hop interference, we derive the bias and variance of the standard HT\nestimator and reveal their relation to the network topological structure and\nthe covariance of the treatment assignment vector. We then propose to formulate\nthe experimental design problem to optimize the covariance matrix of the\ntreatment assignment vector to achieve the bias and variance balance by\nminimizing a well-crafted upper bound of the mean squared error (MSE) of the\nestimator, which allows us to decouple the unknown interference effect\ncomponent and the experimental design component. An efficient projected\ngradient descent algorithm is presented to implement the desired randomization\nscheme. Finally, we carry out extensive simulation studies 2 to demonstrate the\nadvantages of our proposed method over other existing methods in many settings,\nwith different levels of model misspecification."}, "http://arxiv.org/abs/2311.14054": {"title": "Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis", "link": "http://arxiv.org/abs/2311.14054", "description": "Between 2011 and 2014 NHANES collected objectively measured physical activity\ndata using wrist-worn accelerometers for tens of thousands of individuals for\nup to seven days. Here we analyze the minute-level indicators of being active,\nwhich can be viewed as binary (because there is an active indicator at every\nminute), multilevel (because there are multiple days of data for each study\nparticipant), functional (because within-day data can be viewed as a function\nof time) data. To extract within- and between-participant directions of\nvariation in the data, we introduce Generalized Multilevel Functional Principal\nComponent Analysis (GM-FPCA), an approach based on the dimension reduction of\nthe linear predictor. Scores associated with specific patterns of activity are\nshown to be strongly associated with time to death. In particular, we confirm\nthat increased activity is associated with time to death, a result that has\nbeen reported on other data sets. In addition, our method shows the previously\nunreported finding that maintaining a consistent day-to-day routine is strongly\nassociated with a reduced risk of mortality (p-value $< 0.001$) even after\nadjusting for traditional risk factors. Extensive simulation studies indicate\nthat GM-FPCA provides accurate estimation of model parameters, is\ncomputationally stable, and is scalable in the number of study participants,\nvisits, and observations within visits. R code for implementing the method is\nprovided."}, "http://arxiv.org/abs/2311.14122": {"title": "Decompositions of the mean continuous ranked probability score", "link": "http://arxiv.org/abs/2311.14122", "description": "The continuous ranked probability score (crps) is the most commonly used\nscoring rule in the evaluation of probabilistic forecasts for real-valued\noutcomes. To assess and rank forecasting methods, researchers compute the mean\ncrps over given sets of forecast situations, based on the respective predictive\ndistributions and outcomes. We propose a new, isotonicity-based decomposition\nof the mean crps into interpretable components that quantify miscalibration\n(MSC), discrimination ability (DSC), and uncertainty (UNC), respectively. In a\ndetailed theoretical analysis, we compare the new approach to empirical\ndecompositions proposed earlier, generalize to population versions, analyse\ntheir properties and relationships, and relate to a hierarchy of notions of\ncalibration. The isotonicity-based decomposition guarantees the nonnegativity\nof the components and quantifies calibration in a sense that is stronger than\nfor other types of decompositions, subject to the nondegeneracy of empirical\ndecompositions. We illustrate the usage of the isotonicity-based decomposition\nin case studies from weather prediction and machine learning."}, "http://arxiv.org/abs/2311.14212": {"title": "Annotation Sensitivity: Training Data Collection Methods Affect Model Performance", "link": "http://arxiv.org/abs/2311.14212", "description": "When training data are collected from human annotators, the design of the\nannotation instrument, the instructions given to annotators, the\ncharacteristics of the annotators, and their interactions can impact training\ndata. This study demonstrates that design choices made when creating an\nannotation instrument also impact the models trained on the resulting\nannotations.\n\nWe introduce the term annotation sensitivity to refer to the impact of\nannotation data collection methods on the annotations themselves and on\ndownstream model performance and predictions.\n\nWe collect annotations of hate speech and offensive language in five\nexperimental conditions of an annotation instrument, randomly assigning\nannotators to conditions. We then fine-tune BERT models on each of the five\nresulting datasets and evaluate model performance on a holdout portion of each\ncondition. We find considerable differences between the conditions for 1) the\nshare of hate speech/offensive language annotations, 2) model performance, 3)\nmodel predictions, and 4) model learning curves.\n\nOur results emphasize the crucial role played by the annotation instrument\nwhich has received little attention in the machine learning literature. We call\nfor additional research into how and why the instrument impacts the annotations\nto inform the development of best practices in instrument design."}, "http://arxiv.org/abs/2311.14220": {"title": "Assumption-lean and Data-adaptive Post-Prediction Inference", "link": "http://arxiv.org/abs/2311.14220", "description": "A primary challenge facing modern scientific research is the limited\navailability of gold-standard data which can be both costly and labor-intensive\nto obtain. With the rapid development of machine learning (ML), scientists have\nrelied on ML algorithms to predict these gold-standard outcomes with easily\nobtained covariates. However, these predicted outcomes are often used directly\nin subsequent statistical analyses, ignoring imprecision and heterogeneity\nintroduced by the prediction procedure. This will likely result in false\npositive findings and invalid scientific conclusions. In this work, we\nintroduce an assumption-lean and data-adaptive Post-Prediction Inference\n(POP-Inf) procedure that allows valid and powerful inference based on\nML-predicted outcomes. Its \"assumption-lean\" property guarantees reliable\nstatistical inference without assumptions on the ML-prediction, for a wide\nrange of statistical quantities. Its \"data-adaptive'\" feature guarantees an\nefficiency gain over existing post-prediction inference methods, regardless of\nthe accuracy of ML-prediction. We demonstrate the superiority and applicability\nof our method through simulations and large-scale genomic data."}, "http://arxiv.org/abs/2311.14356": {"title": "Lagged coherence: explicit and testable definition", "link": "http://arxiv.org/abs/2311.14356", "description": "Measures of association between cortical regions based on activity signals\nprovide useful information for studying brain functional connectivity.\nDifficulties occur with signals of electric neuronal activity, where an\nobserved signal is a mixture, i.e. an instantaneous weighted average of the\ntrue, unobserved signals from all regions, due to volume conduction and low\nspatial resolution. This is why measures of lagged association are of interest,\nsince at least theoretically, \"lagged association\" is of physiological origin.\nIn contrast, the actual physiological instantaneous zero-lag association is\nmasked and confounded by the mixing artifact. A minimum requirement for a\nmeasure of lagged association is that it must not tend to zero with an increase\nof strength of true instantaneous physiological association. Such biased\nmeasures cannot tell apart if a change in its value is due to a change in\nlagged or a change in instantaneous association. An explicit testable\ndefinition for frequency domain lagged connectivity between two multivariate\ntime series is proposed. It is endowed with two important properties: it is\ninvariant to non-singular linear transformations of each vector time series\nseparately, and it is invariant to instantaneous association. As a sanity\ncheck, in the case of two univariate time series, the new definition leads back\nto the bivariate lagged coherence of 2007 (eqs 25 and 26 in\nhttps://doi.org/10.48550/arXiv.0706.1776)."}, "http://arxiv.org/abs/2311.14367": {"title": "Cultural data integration via random graphical modelling", "link": "http://arxiv.org/abs/2311.14367", "description": "Cultural values vary significantly around the world. Despite a large\nheterogeneity, similarities across national cultures are to be expected. This\npaper studies cross-country culture heterogeneity via the joint inference of\ncopula graphical models. To this end, a random graph generative model is\nintroduced, with a latent space that embeds cultural relatedness across\ncountries. Taking world-wide country-specific survey data as the primary source\nof information, the modelling framework allows to integrate external data, both\nat the level of cultural traits and of their interdependence. In this way, we\nare able to identify several dimensions of culture."}, "http://arxiv.org/abs/2311.14412": {"title": "A Comparison of PDF Projection with Normalizing Flows and SurVAE", "link": "http://arxiv.org/abs/2311.14412", "description": "Normalizing flows (NF) recently gained attention as a way to construct\ngenerative networks with exact likelihood calculation out of composable layers.\nHowever, NF is restricted to dimension-preserving transformations. Surjection\nVAE (SurVAE) has been proposed to extend NF to dimension-altering\ntransformations. Such networks are desirable because they are expressive and\ncan be precisely trained. We show that the approaches are a re-invention of PDF\nprojection, which appeared over twenty years earlier and is much further\ndeveloped."}, "http://arxiv.org/abs/2311.14424": {"title": "Exact confidence intervals for the mixing distribution from binomial mixture distribution samples", "link": "http://arxiv.org/abs/2311.14424", "description": "We present methodology for constructing pointwise confidence intervals for\nthe cumulative distribution function and the quantiles of mixing distributions\non the unit interval from binomial mixture distribution samples. No assumptions\nare made on the shape of the mixing distribution. The confidence intervals are\nconstructed by inverting exact tests of composite null hypotheses regarding the\nmixing distribution. Our method may be applied to any deconvolution approach\nthat produces test statistics whose distribution is stochastically monotone for\nstochastic increase of the mixing distribution. We propose a hierarchical Bayes\napproach, which uses finite Polya Trees for modelling the mixing distribution,\nthat provides stable and accurate deconvolution estimates without the need for\nadditional tuning parameters. Our main technical result establishes the\nstochastic monotonicity property of the test statistics produced by the\nhierarchical Bayes approach. Leveraging the need for the stochastic\nmonotonicity property, we explicitly derive the smallest asymptotic confidence\nintervals that may be constructed using our methodology. Raising the question\nwhether it is possible to construct smaller confidence intervals for the mixing\ndistribution without making parametric assumptions on its shape."}, "http://arxiv.org/abs/2311.14487": {"title": "Reconciliation of expert priors for quantities and events and application within the probabilistic Delphi method", "link": "http://arxiv.org/abs/2311.14487", "description": "We consider the problem of aggregating the judgements of a group of experts\nto form a single prior distribution representing the judgements of the group.\nWe develop a Bayesian hierarchical model to reconcile the judgements of the\ngroup of experts based on elicited quantiles for continuous quantities and\nprobabilities for one-off events. Previous Bayesian reconciliation methods have\nnot been used widely, if at all, in contrast to pooling methods and\nconsensus-based approaches. To address this we embed Bayesian reconciliation\nwithin the probabilistic Delphi method. The result is to furnish the outcome of\nthe probabilistic Delphi method with a direct probabilistic interpretation,\nwith the resulting prior representing the judgements of the decision maker. We\ncan use the rationales from the Delphi process to group the experts for the\nhierarchical modelling. We illustrate the approach with applications to studies\nevaluating erosion in embankment dams and pump failures in a water pumping\nstation, and assess the properties of the approach using the TU Delft database\nof expert judgement studies. We see that, even using an off-the-shelf\nimplementation of the approach, it out-performs individual experts, equal\nweighting of experts and the classical method based on the log score."}, "http://arxiv.org/abs/2311.14502": {"title": "Informed Random Partition Models with Temporal Dependence", "link": "http://arxiv.org/abs/2311.14502", "description": "Model-based clustering is a powerful tool that is often used to discover\nhidden structure in data by grouping observational units that exhibit similar\nresponse values. Recently, clustering methods have been developed that permit\nincorporating an ``initial'' partition informed by expert opinion. Then, using\nsome similarity criteria, partitions different from the initial one are down\nweighted, i.e. they are assigned reduced probabilities. These methods represent\nan exciting new direction of method development in clustering techniques. We\nadd to this literature a method that very flexibly permits assigning varying\nlevels of uncertainty to any subset of the partition. This is particularly\nuseful in practice as there is rarely clear prior information with regards to\nthe entire partition. Our approach is not based on partition penalties but\nconsiders individual allocation probabilities for each unit (e.g., locally\nweighted prior information). We illustrate the gains in prior specification\nflexibility via simulation studies and an application to a dataset concerning\nspatio-temporal evolution of ${\\rm PM}_{10}$ measurements in Germany."}, "http://arxiv.org/abs/2311.14655": {"title": "A Sparse Factor Model for Clustering High-Dimensional Longitudinal Data", "link": "http://arxiv.org/abs/2311.14655", "description": "Recent advances in engineering technologies have enabled the collection of a\nlarge number of longitudinal features. This wealth of information presents\nunique opportunities for researchers to investigate the complex nature of\ndiseases and uncover underlying disease mechanisms. However, analyzing such\nkind of data can be difficult due to its high dimensionality, heterogeneity and\ncomputational challenges. In this paper, we propose a Bayesian nonparametric\nmixture model for clustering high-dimensional mixed-type (e.g., continuous,\ndiscrete and categorical) longitudinal features. We employ a sparse factor\nmodel on the joint distribution of random effects and the key idea is to induce\nclustering at the latent factor level instead of the original data to escape\nthe curse of dimensionality. The number of clusters is estimated through a\nDirichlet process prior. An efficient Gibbs sampler is developed to estimate\nthe posterior distribution of the model parameters. Analysis of real and\nsimulated data is presented and discussed. Our study demonstrates that the\nproposed model serves as a useful analytical tool for clustering\nhigh-dimensional longitudinal data."}, "http://arxiv.org/abs/2104.10618": {"title": "Multiple conditional randomization tests for lagged and spillover treatment effects", "link": "http://arxiv.org/abs/2104.10618", "description": "We consider the problem of constructing multiple conditional randomization\ntests. They may test different causal hypotheses but always aim to be nearly\nindependent, allowing the randomization p-values to be interpreted individually\nand combined using standard methods. We start with a simple, sequential\nconstruction of such tests, and then discuss its application to three problems:\nevidence factors for observational studies, lagged treatment effect in\nstepped-wedge trials, and spillover effect in randomized trials with\ninterference. We compare the proposed approach with some existing methods using\nsimulated and real datasets. Finally, we establish a general sufficient\ncondition for constructing multiple nearly independent conditional\nrandomization tests."}, "http://arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "http://arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for\ndetecting localized regions in data sequences which each must contain a\nchange-point in the median, at a prescribed global significance level. RNSP\nworks by fitting the postulated constant model over many regions of the data\nusing a new sign-multiresolution sup-norm-type loss, and greedily identifying\nthe shortest intervals on which the constancy is significantly violated. By\nworking with the signs of the data around fitted model candidates, RNSP fulfils\nits coverage promises under minimal assumptions, requiring only sign-symmetry\nand serial independence of the signs of the true residuals. In particular, it\npermits their heterogeneity and arbitrarily heavy tails. The intervals of\nsignificance returned by RNSP have a finite-sample character, are unconditional\nin nature and do not rely on any assumptions on the true signal. Code\nimplementing RNSP is available at https://github.com/pfryz/nsp."}, "http://arxiv.org/abs/2111.12720": {"title": "Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator", "link": "http://arxiv.org/abs/2111.12720", "description": "We resurrect the infamous harmonic mean estimator for computing the marginal\nlikelihood (Bayesian evidence) and solve its problematic large variance. The\nmarginal likelihood is a key component of Bayesian model selection to evaluate\nmodel posterior probabilities; however, its computation is challenging. The\noriginal harmonic mean estimator, first proposed by Newton and Raftery in 1994,\ninvolves computing the harmonic mean of the likelihood given samples from the\nposterior. It was immediately realised that the original estimator can fail\ncatastrophically since its variance can become very large (possibly not\nfinite). A number of variants of the harmonic mean estimator have been proposed\nto address this issue although none have proven fully satisfactory. We present\nthe \\emph{learnt harmonic mean estimator}, a variant of the original estimator\nthat solves its large variance problem. This is achieved by interpreting the\nharmonic mean estimator as importance sampling and introducing a new target\ndistribution. The new target distribution is learned to approximate the optimal\nbut inaccessible target, while minimising the variance of the resulting\nestimator. Since the estimator requires samples of the posterior only, it is\nagnostic to the sampling strategy used. We validate the estimator on a variety\nof numerical experiments, including a number of pathological examples where the\noriginal harmonic mean estimator fails catastrophically. We also consider a\ncosmological application, where our approach leads to $\\sim$ 3 to 6 times more\nsamples than current state-of-the-art techniques in 1/3 of the time. In all\ncases our learnt harmonic mean estimator is shown to be highly accurate. The\nestimator is computationally scalable and can be applied to problems of\ndimension $O(10^3)$ and beyond. Code implementing the learnt harmonic mean\nestimator is made publicly available"}, "http://arxiv.org/abs/2205.07378": {"title": "Proximal MCMC for Bayesian Inference of Constrained and Regularized Estimation", "link": "http://arxiv.org/abs/2205.07378", "description": "This paper advocates proximal Markov Chain Monte Carlo (ProxMCMC) as a\nflexible and general Bayesian inference framework for constrained or\nregularized estimation. Originally introduced in the Bayesian imaging\nliterature, ProxMCMC employs the Moreau-Yosida envelope for a smooth\napproximation of the total-variation regularization term, fixes variance and\nregularization strength parameters as constants, and uses the Langevin\nalgorithm for the posterior sampling. We extend ProxMCMC to be fully Bayesian\nby providing data-adaptive estimation of all parameters including the\nregularization strength parameter. More powerful sampling algorithms such as\nHamiltonian Monte Carlo are employed to scale ProxMCMC to high-dimensional\nproblems. Analogous to the proximal algorithms in optimization, ProxMCMC offers\na versatile and modularized procedure for conducting statistical inference on\nconstrained and regularized problems. The power of ProxMCMC is illustrated on\nvarious statistical estimation and machine learning tasks, the inference of\nwhich is traditionally considered difficult from both frequentist and Bayesian\nperspectives."}, "http://arxiv.org/abs/2211.10032": {"title": "Modular Regression: Improving Linear Models by Incorporating Auxiliary Data", "link": "http://arxiv.org/abs/2211.10032", "description": "This paper develops a new framework, called modular regression, to utilize\nauxiliary information -- such as variables other than the original features or\nadditional data sets -- in the training process of linear models. At a high\nlevel, our method follows the routine: (i) decomposing the regression task into\nseveral sub-tasks, (ii) fitting the sub-task models, and (iii) using the\nsub-task models to provide an improved estimate for the original regression\nproblem. This routine applies to widely-used low-dimensional (generalized)\nlinear models and high-dimensional regularized linear regression. It also\nnaturally extends to missing-data settings where only partial observations are\navailable. By incorporating auxiliary information, our approach improves the\nestimation efficiency and prediction accuracy upon linear regression or the\nLasso under a conditional independence assumption for predicting the outcome.\nFor high-dimensional settings, we develop an extension of our procedure that is\nrobust to violations of the conditional independence assumption, in the sense\nthat it improves efficiency if this assumption holds and coincides with the\nLasso otherwise. We demonstrate the efficacy of our methods with simulated and\nreal data sets."}, "http://arxiv.org/abs/2211.13478": {"title": "A New Spatio-Temporal Model Exploiting Hamiltonian Equations", "link": "http://arxiv.org/abs/2211.13478", "description": "The solutions of Hamiltonian equations are known to describe the underlying\nphase space of the mechanical system. Hamiltonian Monte Carlo is the sole use\nof the properties of solutions to the Hamiltonian equations in Bayesian\nstatistics. In this article, we propose a novel spatio-temporal model using a\nstrategic modification of the Hamiltonian equations, incorporating appropriate\nstochasticity via Gaussian processes. The resultant sptaio-temporal process,\ncontinuously varying with time, turns out to be nonparametric, nonstationary,\nnonseparable and non-Gaussian. Additionally, as the spatio-temporal lag goes to\ninfinity, the lagged correlations converge to zero. We investigate the\ntheoretical properties of the new spatio-temporal process, including its\ncontinuity and smoothness properties. In the Bayesian paradigm, we derive\nmethods for complete Bayesian inference using MCMC techniques. The performance\nof our method has been compared with that of non-stationary Gaussian process\n(GP) using two simulation studies, where our method shows a significant\nimprovement over the non-stationary GP. Further, application of our new model\nto two real data sets revealed encouraging performance."}, "http://arxiv.org/abs/2212.08968": {"title": "Covariate Adjustment in Bayesian Adaptive Randomized Controlled Trials", "link": "http://arxiv.org/abs/2212.08968", "description": "In conventional randomized controlled trials, adjustment for baseline values\nof covariates known to be at least moderately associated with the outcome\nincreases the power of the trial. Recent work has shown particular benefit for\nmore flexible frequentist designs, such as information adaptive and adaptive\nmulti-arm designs. However, covariate adjustment has not been characterized\nwithin the more flexible Bayesian adaptive designs, despite their growing\npopularity. We focus on a subclass of these which allow for early stopping at\nan interim analysis given evidence of treatment superiority. We consider both\ncollapsible and non-collapsible estimands, and show how to obtain posterior\nsamples of marginal estimands from adjusted analyses. We describe several\nestimands for three common outcome types. We perform a simulation study to\nassess the impact of covariate adjustment using a variety of adjustment models\nin several different scenarios. This is followed by a real world application of\nthe compared approaches to a COVID-19 trial with a binary endpoint. For all\nscenarios, it is shown that covariate adjustment increases power and the\nprobability of stopping the trials early, and decreases the expected sample\nsizes as compared to unadjusted analyses."}, "http://arxiv.org/abs/2301.11873": {"title": "A Deep Learning Method for Comparing Bayesian Hierarchical Models", "link": "http://arxiv.org/abs/2301.11873", "description": "Bayesian model comparison (BMC) offers a principled approach for assessing\nthe relative merits of competing computational models and propagating\nuncertainty into model selection decisions. However, BMC is often intractable\nfor the popular class of hierarchical models due to their high-dimensional\nnested parameter structure. To address this intractability, we propose a deep\nlearning method for performing BMC on any set of hierarchical models which can\nbe instantiated as probabilistic programs. Since our method enables amortized\ninference, it allows efficient re-estimation of posterior model probabilities\nand fast performance validation prior to any real-data application. In a series\nof extensive validation studies, we benchmark the performance of our method\nagainst the state-of-the-art bridge sampling method and demonstrate excellent\namortized inference across all BMC settings. We then showcase our method by\ncomparing four hierarchical evidence accumulation models that have previously\nbeen deemed intractable for BMC due to partly implicit likelihoods.\nAdditionally, we demonstrate how transfer learning can be leveraged to enhance\ntraining efficiency. We provide reproducible code for all analyses and an\nopen-source implementation of our method."}, "http://arxiv.org/abs/2304.04124": {"title": "Nonparametric Confidence Intervals for Generalized Lorenz Curve using Modified Empirical Likelihood", "link": "http://arxiv.org/abs/2304.04124", "description": "The Lorenz curve portrays the inequality of income distribution. In this\narticle, we develop three modified empirical likelihood (EL) approaches\nincluding adjusted empirical likelihood, transformed empirical likelihood, and\ntransformed adjusted empirical likelihood to construct confidence intervals for\nthe generalized Lorenz ordinate. We have shown that the limiting distribution\nof the modified EL ratio statistics for the generalized Lorenz ordinate follows\nthe scaled Chi-Squared distributions with one degree of freedom. The coverage\nprobabilities and mean lengths of confidence intervals are compared of the\nproposed methods with the traditional EL method through simulations under\nvarious scenarios. Finally, the proposed methods are illustrated using a real\ndata application to construct confidence intervals."}, "http://arxiv.org/abs/2306.04702": {"title": "Efficient sparsity adaptive changepoint estimation", "link": "http://arxiv.org/abs/2306.04702", "description": "We propose a new, computationally efficient, sparsity adaptive changepoint\nestimator for detecting changes in unknown subsets of a high-dimensional data\nsequence. Assuming the data sequence is Gaussian, we prove that the new method\nsuccessfully estimates the number and locations of changepoints with a given\nerror rate and under minimal conditions, for all sparsities of the changing\nsubset. Moreover, our method has computational complexity linear up to\nlogarithmic factors in both the length and number of time series, making it\napplicable to large data sets. Through extensive numerical studies we show that\nthe new methodology is highly competitive in terms of both estimation accuracy\nand computational cost. The practical usefulness of the method is illustrated\nby analysing sensor data from a hydro power plant. An efficient R\nimplementation is available."}, "http://arxiv.org/abs/2306.07119": {"title": "Improving Forecasts for Heterogeneous Time Series by \"Averaging\", with Application to Food Demand Forecast", "link": "http://arxiv.org/abs/2306.07119", "description": "A common forecasting setting in real world applications considers a set of\npossibly heterogeneous time series of the same domain. Due to different\nproperties of each time series such as length, obtaining forecasts for each\nindividual time series in a straight-forward way is challenging. This paper\nproposes a general framework utilizing a similarity measure in Dynamic Time\nWarping to find similar time series to build neighborhoods in a k-Nearest\nNeighbor fashion, and improve forecasts of possibly simple models by averaging.\nSeveral ways of performing the averaging are suggested, and theoretical\narguments underline the usefulness of averaging for forecasting. Additionally,\ndiagnostics tools are proposed allowing a deep understanding of the procedure."}, "http://arxiv.org/abs/2307.07898": {"title": "A Graph-Prediction-Based Approach for Debiasing Underreported Data", "link": "http://arxiv.org/abs/2307.07898", "description": "We present a novel Graph-based debiasing Algorithm for Underreported Data\n(GRAUD) aiming at an efficient joint estimation of event counts and discovery\nprobabilities across spatial or graphical structures. This innovative method\nprovides a solution to problems seen in fields such as policing data and\nCOVID-$19$ data analysis. Our approach avoids the need for strong priors\ntypically associated with Bayesian frameworks. By leveraging the graph\nstructures on unknown variables $n$ and $p$, our method debiases the\nunder-report data and estimates the discovery probability at the same time. We\nvalidate the effectiveness of our method through simulation experiments and\nillustrate its practicality in one real-world application: police 911\ncalls-to-service data."}, "http://arxiv.org/abs/2307.12832": {"title": "More Power by using Fewer Permutations", "link": "http://arxiv.org/abs/2307.12832", "description": "It is conventionally believed that a permutation test should ideally use all\npermutations. If this is computationally unaffordable, it is believed one\nshould use the largest affordable Monte Carlo sample or (algebraic) subgroup of\npermutations. We challenge this belief by showing we can sometimes obtain\ndramatically more power by using a tiny subgroup. As the subgroup is tiny, this\nsimultaneously comes at a much lower computational cost. We exploit this to\nimprove the popular permutation-based Westfall & Young MaxT multiple testing\nmethod. We study the relative efficiency in a Gaussian location model, and find\nthe largest gain in high dimensions."}, "http://arxiv.org/abs/2309.14156": {"title": "Designing and evaluating an online reinforcement learning agent for physical exercise recommendations in N-of-1 trials", "link": "http://arxiv.org/abs/2309.14156", "description": "Personalized adaptive interventions offer the opportunity to increase patient\nbenefits, however, there are challenges in their planning and implementation.\nOnce implemented, it is an important question whether personalized adaptive\ninterventions are indeed clinically more effective compared to a fixed gold\nstandard intervention. In this paper, we present an innovative N-of-1 trial\nstudy design testing whether implementing a personalized intervention by an\nonline reinforcement learning agent is feasible and effective. Throughout, we\nuse a new study on physical exercise recommendations to reduce pain in\nendometriosis for illustration. We describe the design of a contextual bandit\nrecommendation agent and evaluate the agent in simulation studies. The results\nshow that, first, implementing a personalized intervention by an online\nreinforcement learning agent is feasible. Second, such adaptive interventions\nhave the potential to improve patients' benefits even if only few observations\nare available. As one challenge, they add complexity to the design and\nimplementation process. In order to quantify the expected benefit, data from\nprevious interventional studies is required. We expect our approach to be\ntransferable to other interventions and clinical interventions."}, "http://arxiv.org/abs/2311.14766": {"title": "Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing", "link": "http://arxiv.org/abs/2311.14766", "description": "Reinforcement Learning from Human Feedback (RLHF) has played a crucial role\nin the success of large models such as ChatGPT. RLHF is a reinforcement\nlearning framework which combines human feedback to improve learning\neffectiveness and performance. However, obtaining preferences feedback manually\nis quite expensive in commercial applications. Some statistical commercial\nindicators are usually more valuable and always ignored in RLHF. There exists a\ngap between commercial target and model training. In our research, we will\nattempt to fill this gap with statistical business feedback instead of human\nfeedback, using AB testing which is a well-established statistical method.\nReinforcement Learning from Statistical Feedback (RLSF) based on AB testing is\nproposed. Statistical inference methods are used to obtain preferences for\ntraining the reward network, which fine-tunes the pre-trained model in\nreinforcement learning framework, achieving greater business value.\nFurthermore, we extend AB testing with double selections at a single time-point\nto ANT testing with multiple selections at different feedback time points.\nMoreover, we design numerical experiences to validate the effectiveness of our\nalgorithm framework."}, "http://arxiv.org/abs/2311.14846": {"title": "Fast Estimation of the Renshaw-Haberman Model and Its Variants", "link": "http://arxiv.org/abs/2311.14846", "description": "In mortality modelling, cohort effects are often taken into consideration as\nthey add insights about variations in mortality across different generations.\nStatistically speaking, models such as the Renshaw-Haberman model may provide a\nbetter fit to historical data compared to their counterparts that incorporate\nno cohort effects. However, when such models are estimated using an iterative\nmaximum likelihood method in which parameters are updated one at a time,\nconvergence is typically slow and may not even be reached within a reasonably\nestablished maximum number of iterations. Among others, the slow convergence\nproblem hinders the study of parameter uncertainty through bootstrapping\nmethods.\n\nIn this paper, we propose an intuitive estimation method that minimizes the\nsum of squared errors between actual and fitted log central death rates. The\ncomplications arising from the incorporation of cohort effects are overcome by\nformulating part of the optimization as a principal component analysis with\nmissing values. We also show how the proposed method can be generalized to\nvariants of the Renshaw-Haberman model with further computational improvement,\neither with a simplified model structure or an additional constraint. Using\nmortality data from the Human Mortality Database (HMD), we demonstrate that our\nproposed method produces satisfactory estimation results and is significantly\nmore efficient compared to the traditional likelihood-based approach."}, "http://arxiv.org/abs/2311.14867": {"title": "Disaggregating Time-Series with Many Indicators: An Overview of the DisaggregateTS Package", "link": "http://arxiv.org/abs/2311.14867", "description": "Low-frequency time-series (e.g., quarterly data) are often treated as\nbenchmarks for interpolating to higher frequencies, since they generally\nexhibit greater precision and accuracy in contrast to their high-frequency\ncounterparts (e.g., monthly data) reported by governmental bodies. An array of\nregression-based methods have been proposed in the literature which aim to\nestimate a target high-frequency series using higher frequency indicators.\nHowever, in the era of big data and with the prevalence of large volume of\nadministrative data-sources there is a need to extend traditional methods to\nwork in high-dimensional settings, i.e. where the number of indicators is\nsimilar or larger than the number of low-frequency samples. The package\nDisaggregateTS includes both classical regressions-based disaggregation methods\nalongside recent extensions to high-dimensional settings, c.f. Mosley et al.\n(2022). This paper provides guidance on how to implement these methods via the\npackage in R, and demonstrates their use in an application to disaggregating\nCO2 emissions."}, "http://arxiv.org/abs/2311.14889": {"title": "Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data", "link": "http://arxiv.org/abs/2311.14889", "description": "In this paper we review recent advances in statistical methods for the\nevaluation of the heterogeneity of treatment effects (HTE), including subgroup\nidentification and estimation of individualized treatment regimens, from\nrandomized clinical trials and observational studies. We identify several types\nof approaches using the features introduced in Lipkovich, Dmitrienko and\nD'Agostino (2017) that distinguish the recommended principled methods from\nbasic methods for HTE evaluation that typically rely on rules of thumb and\ngeneral guidelines (the methods are often referred to as common practices). We\ndiscuss the advantages and disadvantages of various principled methods as well\nas common measures for evaluating their performance. We use simulated data and\na case study based on a historical clinical trial to illustrate several new\napproaches to HTE evaluation."}, "http://arxiv.org/abs/2311.14894": {"title": "Kernel-based measures of association between inputs and outputs based on ANOVA", "link": "http://arxiv.org/abs/2311.14894", "description": "ANOVA decomposition of function with random input variables provides ANOVA\nfunctionals (AFs), which contain information about the contributions of the\ninput variables on the output variable(s). By embedding AFs into an appropriate\nreproducing kernel Hilbert space regarding their distributions, we propose an\nefficient statistical test of independence between the input variables and\noutput variable(s). The resulting test statistic leads to new dependent\nmeasures of association between inputs and outputs that allow for i) dealing\nwith any distribution of AFs, including the Cauchy distribution, ii) accounting\nfor the necessary or desirable moments of AFs and the interactions among the\ninput variables. In uncertainty quantification for mathematical models, a\nnumber of existing measures are special cases of this framework. We then\nprovide unified and general global sensitivity indices and their consistent\nestimators, including asymptotic distributions. For Gaussian-distributed AFs,\nwe obtain Sobol' indices and dependent generalized sensitivity indices using\nquadratic kernels."}, "http://arxiv.org/abs/2311.15012": {"title": "False Discovery Rate Controlling Procedures with BLOSUM62 substitution matrix and their application to HIV Data", "link": "http://arxiv.org/abs/2311.15012", "description": "Identifying significant sites in sequence data and analogous data is of\nfundamental importance in many biological fields. Fisher's exact test is a\npopular technique, however this approach to sparse count data is not\nappropriate due to conservative decisions. Since count data in HIV data are\ntypically very sparse, it is crucial to use additional information to\nstatistical models to improve testing power. In order to develop new approaches\nto incorporate biological information in the false discovery controlling\nprocedure, we propose two models: one based on the empirical Bayes model under\nindependence of amino acids and the other uses pairwise associations of amino\nacids based on Markov random field with on the BLOSUM62 substitution matrix. We\napply the proposed methods to HIV data and identify significant sites\nincorporating BLOSUM62 matrix while the traditional method based on Fisher's\ntest does not discover any site. These newly developed methods have the\npotential to handle many biological problems in the studies of vaccine and drug\ntrials and phenotype studies."}, "http://arxiv.org/abs/2311.15031": {"title": "Robust and Efficient Semi-supervised Learning for Ising Model", "link": "http://arxiv.org/abs/2311.15031", "description": "In biomedical studies, it is often desirable to characterize the interactive\nmode of multiple disease outcomes beyond their marginal risk. Ising model is\none of the most popular choices serving for this purpose. Nevertheless,\nlearning efficiency of Ising models can be impeded by the scarcity of accurate\ndisease labels, which is a prominent problem in contemporary studies driven by\nelectronic health records (EHR). Semi-supervised learning (SSL) leverages the\nlarge unlabeled sample with auxiliary EHR features to assist the learning with\nlabeled data only and is a potential solution to this issue. In this paper, we\ndevelop a novel SSL method for efficient inference of Ising model. Our method\nfirst models the outcomes against the auxiliary features, then uses it to\nproject the score function of the supervised estimator onto the EHR features,\nand incorporates the unlabeled sample to augment the supervised estimator for\nvariance reduction without introducing bias. For the key step of conditional\nmodeling, we propose strategies that can effectively leverage the auxiliary EHR\ninformation while maintaining moderate model complexity. In addition, we\nintroduce approaches including intrinsic efficient updates and ensemble, to\novercome the potential misspecification of the conditional model that may cause\nefficiency loss. Our method is justified by asymptotic theory and shown to\noutperform existing SSL methods through simulation studies. We also illustrate\nits utility in a real example about several key phenotypes related to frequent\nICU admission on MIMIC-III data set."}, "http://arxiv.org/abs/2311.15257": {"title": "Bayesian Imputation of Revolving Doors", "link": "http://arxiv.org/abs/2311.15257", "description": "Political scientists and sociologists study how individuals switch back and\nforth between public and private organizations, for example between regulator\nand lobbyist positions, a phenomenon called \"revolving doors\". However, they\nface an important issue of data missingness, as not all data relevant to this\nquestion is freely available. For example, the nomination of an individual in a\ngiven public-sector position of power might be publically disclosed, but not\ntheir subsequent positions in the private sector. In this article, we adopt a\nBayesian data augmentation strategy for discrete time series and propose\nmeasures of public-private mobility across the French state at large,\nmobilizing administrative and digital data. We relax homogeneity hypotheses of\ntraditional hidden Markov models and implement a version of a Markov switching\nmodel, which allows for varying parameters across individuals and time and\nauto-correlated behaviors. We describe how the revolving doors phenomenon\nvaries across the French state and how it has evolved between 1990 and 2022."}, "http://arxiv.org/abs/2311.15322": {"title": "False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values", "link": "http://arxiv.org/abs/2311.15322", "description": "The effective utilization of structural information in data while ensuring\nstatistical validity poses a significant challenge in false discovery rate\n(FDR) analyses. Conformal inference provides rigorous theory for grounding\ncomplex machine learning methods without relying on strong assumptions or\nhighly idealized models. However, existing conformal methods have limitations\nin handling structured multiple testing. This is because their validity\nrequires the deployment of symmetric rules, which assume the exchangeability of\ndata points and permutation-invariance of fitting algorithms. To overcome these\nlimitations, we introduce the pseudo local index of significance (PLIS)\nprocedure, which is capable of accommodating asymmetric rules and requires only\npairwise exchangeability between the null conformity scores. We demonstrate\nthat PLIS offers finite-sample guarantees in FDR control and the ability to\nassign higher weights to relevant data points. Numerical results confirm the\neffectiveness and robustness of PLIS and show improvements in power compared to\nexisting model-free methods in various scenarios."}, "http://arxiv.org/abs/2311.15359": {"title": "Goodness-of-fit tests for the one-sided L\\'evy distribution based on quantile conditional moments", "link": "http://arxiv.org/abs/2311.15359", "description": "In this paper we introduce a novel statistical framework based on the first\ntwo quantile conditional moments that facilitates effective goodness-of-fit\ntesting for one-sided L\\'evy distributions. The scale-ratio framework\nintroduced in this paper extends our previous results in which we have shown\nhow to extract unique distribution features using conditional variance ratio\nfor the generic class of {\\alpha}-stable distributions. We show that the\nconditional moment-based goodness-of-fit statistics are a good alternative to\nother methods introduced in the literature tailored to the one-sided L\\'evy\ndistributions. The usefulness of our approach is verified using an empirical\ntest power study. For completeness, we also derive the asymptotic distributions\nof the test statistics and show how to apply our framework to real data."}, "http://arxiv.org/abs/2311.15384": {"title": "Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means", "link": "http://arxiv.org/abs/2311.15384", "description": "Clustering stands as one of the most prominent challenges within the realm of\nunsupervised machine learning. Among the array of centroid-based clustering\nalgorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes\ncenter stage as one of the extensively employed techniques in the literature.\nNonetheless, both $k$-means and its variants grapple with noteworthy\nlimitations. These encompass a heavy reliance on initial cluster centroids,\nsusceptibility to converging into local minima of the objective function, and\nsensitivity to outliers and noise in the data. When confronted with data\ncontaining noisy or outlier-laden observations, the Median-of-Means (MoM)\nestimator emerges as a stabilizing force for any centroid-based clustering\nframework. On a different note, a prevalent constraint among existing\nclustering methodologies resides in the prerequisite knowledge of the number of\nclusters prior to analysis. Utilizing model-based methodologies, such as\nBayesian nonparametric models, offers the advantage of infinite mixture models,\nthereby circumventing the need for such requirements. Motivated by these facts,\nin this article, we present an efficient and automatic clustering technique by\nintegrating the principles of model-based and centroid-based methodologies that\nmitigates the effect of noise on the quality of clustering while ensuring that\nthe number of clusters need not be specified in advance. Statistical guarantees\non the upper bound of clustering error, and rigorous assessment through\nsimulated and real datasets suggest the advantages of our proposed method over\nexisting state-of-the-art clustering algorithms."}, "http://arxiv.org/abs/2311.15410": {"title": "A Comprehensive Analysis of HIV Treatment Efficacy in the ACTG 175 Trial Through Multiple-Endpoint Approaches", "link": "http://arxiv.org/abs/2311.15410", "description": "In the realm of medical research, the intricate interplay of epidemiological\nrisk, genomic activity, adverse events, and clinical response necessitates a\nnuanced consideration of multiple variables. Clinical trials, designed to\nmeticulously assess the efficacy and safety of interventions, routinely\nincorporate a diverse array of endpoints. While a primary endpoint is\ncustomary, supplemented by key secondary endpoints, the statistical\nsignificance is typically evaluated independently for each. To address the\ninherent challenges in studying multiple endpoints, diverse strategies,\nincluding composite endpoints and global testing, have been proposed. This work\nstands apart by focusing on the evaluation of a clinical trial, deviating from\nthe conventional approach to underscore the efficacy of a multiple-endpoint\nprocedure. A double-blind study was conducted to gauge the treatment efficacy\nin adults infected with human immunodeficiency virus type 1 (HIV-1), featuring\nCD4 cell counts ranging from 200 to 500 per cubic millimeter. A total of 2467\nHIV-1-infected patients (43 percent without prior antiretroviral treatment)\nwere randomly assigned to one of four daily regimens: 600 mg of zidovudine; 600\nmg of zidovudine plus 400 mg of didanosine; 600 mg of zidovudine plus 2.25 mg\nof zalcitabine; or 400 mg of didanosine. The primary endpoint comprised a >50\npercent decline in CD4 cell count, development of acquired immunodeficiency\nsyndrome (AIDS), or death. This study sought to determine the efficacy and\nsafety of zidovudine (AZT) versus didanosine (ddI), AZT plus ddI, and AZT plus\nzalcitabine (ddC) in preventing disease progression in HIV-infected patients\nwith CD4 counts of 200-500 cells/mm3. By jointly considering all endpoints, the\nmultiple-endpoints approach yields results of greater significance than a\nsingle-endpoint approach."}, "http://arxiv.org/abs/2311.15434": {"title": "Structural Discovery with Partial Ordering Information for Time-Dependent Data with Convergence Guarantees", "link": "http://arxiv.org/abs/2311.15434", "description": "Structural discovery amongst a set of variables is of interest in both static\nand dynamic settings. In the presence of lead-lag dependencies in the data, the\ndynamics of the system can be represented through a structural equation model\n(SEM) that simultaneously captures the contemporaneous and temporal\nrelationships amongst the variables, with the former encoded through a directed\nacyclic graph (DAG) for model identification. In many real applications, a\npartial ordering amongst the nodes of the DAG is available, which makes it\neither beneficial or imperative to incorporate it as a constraint in the\nproblem formulation. This paper develops an algorithm that can seamlessly\nincorporate a priori partial ordering information for solving a linear SEM\n(also known as Structural Vector Autoregression) under a high-dimensional\nsetting. The proposed algorithm is provably convergent to a stationary point,\nand exhibits competitive performance on both synthetic and real data sets."}, "http://arxiv.org/abs/2311.15485": {"title": "Calibrated Generalized Bayesian Inference", "link": "http://arxiv.org/abs/2311.15485", "description": "We provide a simple and general solution to the fundamental open problem of\ninaccurate uncertainty quantification of Bayesian inference in misspecified or\napproximate models, and of generalized Bayesian posteriors more generally.\nWhile existing solutions are based on explicit Gaussian posterior\napproximations, or computationally onerous post-processing procedures, we\ndemonstrate that correct uncertainty quantification can be achieved by\nsubstituting the usual posterior with an alternative posterior that conveys the\nsame information. This solution applies to both likelihood-based and loss-based\nposteriors, and we formally demonstrate the reliable uncertainty quantification\nof this approach. The new approach is demonstrated through a range of examples,\nincluding generalized linear models, and doubly intractable models."}, "http://arxiv.org/abs/2311.15498": {"title": "Adjusted inference for multiple testing procedure in group sequential designs", "link": "http://arxiv.org/abs/2311.15498", "description": "Adjustment of statistical significance levels for repeated analysis in group\nsequential trials has been understood for some time. Similarly, methods for\nadjustment accounting for testing multiple hypotheses are common. There is\nlimited research on simultaneously adjusting for both multiple hypothesis\ntesting and multiple analyses of one or more hypotheses. We address this gap by\nproposing adjusted-sequential p-values that reject an elementary hypothesis\nwhen its adjusted-sequential p-values are less than or equal to the family-wise\nType I error rate (FWER) in a group sequential design. We also propose\nsequential p-values for intersection hypotheses as a tool to compute adjusted\nsequential p-values for elementary hypotheses. We demonstrate the application\nusing weighted Bonferroni tests and weighted parametric tests, comparing\nadjusted sequential p-values to a desired FWER for inference on each elementary\nhypothesis tested."}, "http://arxiv.org/abs/2311.15598": {"title": "Optimal Clustering of Discrete Mixtures: Binomial, Poisson, Block Models, and Multi-layer Networks", "link": "http://arxiv.org/abs/2311.15598", "description": "In this paper, we first study the fundamental limit of clustering networks\nwhen a multi-layer network is present. Under the mixture multi-layer stochastic\nblock model (MMSBM), we show that the minimax optimal network clustering error\nrate, which takes an exponential form and is characterized by the Renyi\ndivergence between the edge probability distributions of the component\nnetworks. We propose a novel two-stage network clustering method including a\ntensor-based initialization algorithm involving both node and sample splitting\nand a refinement procedure by likelihood-based Lloyd algorithm. Network\nclustering must be accompanied by node community detection. Our proposed\nalgorithm achieves the minimax optimal network clustering error rate and allows\nextreme network sparsity under MMSBM. Numerical simulations and real data\nexperiments both validate that our method outperforms existing methods.\nOftentimes, the edges of networks carry count-type weights. We then extend our\nmethodology and analysis framework to study the minimax optimal clustering\nerror rate for mixture of discrete distributions including Binomial, Poisson,\nand multi-layer Poisson networks. The minimax optimal clustering error rates in\nthese discrete mixtures all take the same exponential form characterized by the\nRenyi divergences. These optimal clustering error rates in discrete mixtures\ncan also be achieved by our proposed two-stage clustering algorithm."}, "http://arxiv.org/abs/2311.15610": {"title": "Bayesian Approach to Linear Bayesian Networks", "link": "http://arxiv.org/abs/2311.15610", "description": "This study proposes the first Bayesian approach for learning high-dimensional\nlinear Bayesian networks. The proposed approach iteratively estimates each\nelement of the topological ordering from backward and its parent using the\ninverse of a partial covariance matrix. The proposed method successfully\nrecovers the underlying structure when Bayesian regularization for the inverse\ncovariance matrix with unequal shrinkage is applied. Specifically, it shows\nthat the number of samples $n = \\Omega( d_M^2 \\log p)$ and $n = \\Omega(d_M^2\np^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian\nnetworks with sub-Gaussian and 4m-th bounded-moment error distributions,\nrespectively, where $p$ is the number of nodes and $d_M$ is the maximum degree\nof the moralized graph. The theoretical findings are supported by extensive\nsimulation studies including real data analysis. Furthermore the proposed\nmethod is demonstrated to outperform state-of-the-art frequentist approaches,\nsuch as the BHLSM, LISTEN, and TD algorithms in synthetic data."}, "http://arxiv.org/abs/2311.15715": {"title": "Spatio-temporal insights for wind energy harvesting in South Africa", "link": "http://arxiv.org/abs/2311.15715", "description": "Understanding complex spatial dependency structures is a crucial\nconsideration when attempting to build a modeling framework for wind speeds.\nIdeally, wind speed modeling should be very efficient since the wind speed can\nvary significantly from day to day or even hour to hour. But complex models\nusually require high computational resources. This paper illustrates how to\nconstruct and implement a hierarchical Bayesian model for wind speeds using the\nWeibull density function based on a continuously-indexed spatial field. For\nefficient (near real-time) inference the proposed model is implemented in the r\npackage R-INLA, based on the integrated nested Laplace approximation (INLA).\nSpecific attention is given to the theoretical and practical considerations of\nincluding a spatial component within a Bayesian hierarchical model. The\nproposed model is then applied and evaluated using a large volume of real data\nsourced from the coastal regions of South Africa between 2011 and 2021. By\nprojecting the mean and standard deviation of the Matern field, the results\nshow that the spatial modeling component is effectively capturing variation in\nwind speeds which cannot be explained by the other model components. The mean\nof the spatial field varies between $\\pm 0.3$ across the domain. These insights\nare valuable for planning and implementation of green energy resources such as\nwind farms in South Africa. Furthermore, shortcomings in the spatial sampling\ndomain is evident in the analysis and this is important for future sampling\nstrategies. The proposed model, and the conglomerated dataset, can serve as a\nfoundational framework for future investigations into wind energy in South\nAfrica."}, "http://arxiv.org/abs/2311.15860": {"title": "Frequentist Prediction Sets for Species Abundance using Indirect Information", "link": "http://arxiv.org/abs/2311.15860", "description": "Citizen science databases that consist of volunteer-led sampling efforts of\nspecies communities are relied on as essential sources of data in ecology.\nSummarizing such data across counties with frequentist-valid prediction sets\nfor each county provides an interpretable comparison across counties of varying\nsize or composition. As citizen science data often feature unequal sampling\nefforts across a spatial domain, prediction sets constructed with indirect\nmethods that share information across counties may be used to improve\nprecision. In this article, we present a nonparametric framework to obtain\nprecise prediction sets for a multinomial random sample based on indirect\ninformation that maintain frequentist coverage guarantees for each county. We\ndetail a simple algorithm to obtain prediction sets for each county using\nindirect information where the computation time does not depend on the sample\nsize and scales nicely with the number of species considered. The indirect\ninformation may be estimated by a proposed empirical Bayes procedure based on\ninformation from auxiliary data. Our approach makes inference for under-sampled\ncounties more precise, while maintaining area-specific frequentist validity for\neach county. Our method is used to provide a useful description of avian\nspecies abundance in North Carolina, USA based on citizen science data from the\neBird database."}, "http://arxiv.org/abs/2311.15982": {"title": "Stab-GKnock: Controlled variable selection for partially linear models using generalized knockoffs", "link": "http://arxiv.org/abs/2311.15982", "description": "The recently proposed fixed-X knockoff is a powerful variable selection\nprocedure that controls the false discovery rate (FDR) in any finite-sample\nsetting, yet its theoretical insights are difficult to show beyond Gaussian\nlinear models. In this paper, we make the first attempt to extend the fixed-X\nknockoff to partially linear models by using generalized knockoff features, and\npropose a new stability generalized knockoff (Stab-GKnock) procedure by\nincorporating selection probability as feature importance score. We provide FDR\ncontrol and power guarantee under some regularity conditions. In addition, we\npropose a two-stage method under high dimensionality by introducing a new joint\nfeature screening procedure, with guaranteed sure screening property. Extensive\nsimulation studies are conducted to evaluate the finite-sample performance of\nthe proposed method. A real data example is also provided for illustration."}, "http://arxiv.org/abs/2311.15988": {"title": "A novel CFA+EFA model to detect aberrant respondents", "link": "http://arxiv.org/abs/2311.15988", "description": "Aberrant respondents are common but yet extremely detrimental to the quality\nof social surveys or questionnaires. Recently, factor mixture models have been\nemployed to identify individuals providing deceptive or careless responses. We\npropose a comprehensive factor mixture model that combines confirmatory and\nexploratory factor models to represent both the non-aberrant and aberrant\ncomponents of the responses. The flexibility of the proposed solution allows\nfor the identification of two of the most common aberant response styles,\nnamely faking and careless responding. We validated our approach by means of\ntwo simulations and two case studies. The results indicate the effectiveness of\nthe proposed model in handling with aberrant responses in social and behavioral\nsurveys."}, "http://arxiv.org/abs/2311.16025": {"title": "Change Point Detection for Random Objects using Distance Profiles", "link": "http://arxiv.org/abs/2311.16025", "description": "We introduce a new powerful scan statistic and an associated test for\ndetecting the presence and pinpointing the location of a change point within\nthe distribution of a data sequence where the data elements take values in a\ngeneral separable metric space $(\\Omega, d)$. These change points mark abrupt\nshifts in the distribution of the data sequence. Our method hinges on distance\nprofiles, where the distance profile of an element $\\omega \\in \\Omega$ is the\ndistribution of distances from $\\omega$ as dictated by the data. Our approach\nis fully non-parametric and universally applicable to diverse data types,\nincluding distributional and network data, as long as distances between the\ndata objects are available. From a practicable point of view, it is nearly\ntuning parameter-free, except for the specification of cut-off intervals near\nthe endpoints where change points are assumed not to occur. Our theoretical\nresults include a precise characterization of the asymptotic distribution of\nthe test statistic under the null hypothesis of no change points and rigorous\nguarantees on the consistency of the test in the presence of change points\nunder contiguous alternatives, as well as for the consistency of the estimated\nchange point location. Through comprehensive simulation studies encompassing\nmultivariate data, bivariate distributional data and sequences of graph\nLaplacians, we demonstrate the effectiveness of our approach in both change\npoint detection power and estimating the location of the change point. We apply\nour method to real datasets, including U.S. electricity generation compositions\nand Bluetooth proximity networks, underscoring its practical relevance."}, "http://arxiv.org/abs/2203.02849": {"title": "Variable Selection with the Knockoffs: Composite Null Hypotheses", "link": "http://arxiv.org/abs/2203.02849", "description": "The fixed-X knockoff filter is a flexible framework for variable selection\nwith false discovery rate (FDR) control in linear models with arbitrary design\nmatrices (of full column rank) and it allows for finite-sample selective\ninference via the Lasso estimates. In this paper, we extend the theory of the\nknockoff procedure to tests with composite null hypotheses, which are usually\nmore relevant to real-world problems. The main technical challenge lies in\nhandling composite nulls in tandem with dependent features from arbitrary\ndesigns. We develop two methods for composite inference with the knockoffs,\nnamely, shifted ordinary least-squares (S-OLS) and feature-response product\nperturbation (FRPP), building on new structural properties of test statistics\nunder composite nulls. We also propose two heuristic variants of S-OLS method\nthat outperform the celebrated Benjamini-Hochberg (BH) procedure for composite\nnulls, which serves as a heuristic baseline under dependent test statistics.\nFinally, we analyze the loss in FDR when the original knockoff procedure is\nnaively applied on composite tests."}, "http://arxiv.org/abs/2205.02617": {"title": "COMBSS: Best Subset Selection via Continuous Optimization", "link": "http://arxiv.org/abs/2205.02617", "description": "The problem of best subset selection in linear regression is considered with\nthe aim to find a fixed size subset of features that best fits the response.\nThis is particularly challenging when the total available number of features is\nvery large compared to the number of data samples. Existing optimal methods for\nsolving this problem tend to be slow while fast methods tend to have low\naccuracy. Ideally, new methods perform best subset selection faster than\nexisting optimal methods but with comparable accuracy, or, being more accurate\nthan methods of comparable computational speed. Here, we propose a novel\ncontinuous optimization method that identifies a subset solution path, a small\nset of models of varying size, that consists of candidates for the single best\nsubset of features, that is optimal in a specific sense in linear regression.\nOur method turns out to be fast, making the best subset selection possible when\nthe number of features is well in excess of thousands. Because of the\noutstanding overall performance, framing the best subset selection challenge as\na continuous optimization problem opens new research directions for feature\nextraction for a large variety of regression models."}, "http://arxiv.org/abs/2210.09339": {"title": "Probability Weighted Clustered Coefficients Regression Models in Complex Survey Sampling", "link": "http://arxiv.org/abs/2210.09339", "description": "Regression analysis is commonly conducted in survey sampling. However,\nexisting methods fail when the relationships vary across different areas or\ndomains. In this paper, we propose a unified framework to study the group-wise\ncovariate effect under complex survey sampling based on pairwise penalties, and\nthe associated objective function is solved by the alternating direction method\nof multipliers. Theoretical properties of the proposed method are investigated\nunder some generality conditions. Numerical experiments demonstrate the\nsuperiority of the proposed method in terms of identifying groups and\nestimation efficiency for both linear regression models and logistic regression\nmodels."}, "http://arxiv.org/abs/2211.00460": {"title": "Augmentation Invariant Manifold Learning", "link": "http://arxiv.org/abs/2211.00460", "description": "Data augmentation is a widely used technique and an essential ingredient in\nthe recent advance in self-supervised representation learning. By preserving\nthe similarity between augmented data, the resulting data representation can\nimprove various downstream analyses and achieve state-of-the-art performance in\nmany applications. Despite the empirical effectiveness, most existing methods\nlack theoretical understanding under a general nonlinear setting. To fill this\ngap, we develop a statistical framework on a low-dimension product manifold to\nmodel the data augmentation transformation. Under this framework, we introduce\na new representation learning method called augmentation invariant manifold\nlearning and design a computationally efficient algorithm by reformulating it\nas a stochastic optimization problem. Compared with existing self-supervised\nmethods, the new method simultaneously exploits the manifold's geometric\nstructure and invariant property of augmented data and has an explicit\ntheoretical guarantee. Our theoretical investigation characterizes the role of\ndata augmentation in the proposed method and reveals why and how the data\nrepresentation learned from augmented data can improve the $k$-nearest neighbor\nclassifier in the downstream analysis, showing that a more complex data\naugmentation leads to more improvement in downstream analysis. Finally,\nnumerical experiments on simulated and real datasets are presented to\ndemonstrate the merit of the proposed method."}, "http://arxiv.org/abs/2212.05831": {"title": "Conditional-mean Multiplicative Operator Models for Count Time Series", "link": "http://arxiv.org/abs/2212.05831", "description": "Multiplicative error models (MEMs) are commonly used for real-valued time\nseries, but they cannot be applied to discrete-valued count time series as the\ninvolved multiplication would not preserve the integer nature of the data.\nThus, the concept of a multiplicative operator for counts is proposed (as well\nas several specific instances thereof), which are then used to develop a kind\nof MEMs for count time series (CMEMs). If equipped with a linear conditional\nmean, the resulting CMEMs are closely related to the class of so-called\ninteger-valued generalized autoregressive conditional heteroscedasticity\n(INGARCH) models and might be used as a semi-parametric extension thereof.\nImportant stochastic properties of different types of INGARCH-CMEM as well as\nrelevant estimation approaches are derived, namely types of quasi-maximum\nlikelihood and weighted least squares estimation. The performance and\napplication are demonstrated with simulations as well as with two real-world\ndata examples."}, "http://arxiv.org/abs/2301.01381": {"title": "Testing High-dimensional Multinomials with Applications to Text Analysis", "link": "http://arxiv.org/abs/2301.01381", "description": "Motivated by applications in text mining and discrete distribution inference,\nwe investigate the testing for equality of probability mass functions of $K$\ngroups of high-dimensional multinomial distributions. A test statistic, which\nis shown to have an asymptotic standard normal distribution under the null, is\nproposed. The optimal detection boundary is established, and the proposed test\nis shown to achieve this optimal detection boundary across the entire parameter\nspace of interest. The proposed method is demonstrated in simulation studies\nand applied to analyze two real-world datasets to examine variation among\nconsumer reviews of Amazon movies and diversity of statistical paper abstracts."}, "http://arxiv.org/abs/2303.03092": {"title": "Environment Invariant Linear Least Squares", "link": "http://arxiv.org/abs/2303.03092", "description": "This paper considers a multi-environment linear regression model in which\ndata from multiple experimental settings are collected. The joint distribution\nof the response variable and covariates may vary across different environments,\nyet the conditional expectations of $y$ given the unknown set of important\nvariables are invariant. Such a statistical model is related to the problem of\nendogeneity, causal inference, and transfer learning. The motivation behind it\nis illustrated by how the goals of prediction and attribution are inherent in\nestimating the true parameter and the important variable set. We construct a\nnovel environment invariant linear least squares (EILLS) objective function, a\nmulti-environment version of linear least-squares regression that leverages the\nabove conditional expectation invariance structure and heterogeneity among\ndifferent environments to determine the true parameter. Our proposed method is\napplicable without any additional structural knowledge and can identify the\ntrue parameter under a near-minimal identification condition. We establish\nnon-asymptotic $\\ell_2$ error bounds on the estimation error for the EILLS\nestimator in the presence of spurious variables. Moreover, we further show that\nthe $\\ell_0$ penalized EILLS estimator can achieve variable selection\nconsistency in high-dimensional regimes. These non-asymptotic results\ndemonstrate the sample efficiency of the EILLS estimator and its capability to\ncircumvent the curse of endogeneity in an algorithmic manner without any prior\nstructural knowledge. To the best of our knowledge, this paper is the first to\nrealize statistically efficient invariance learning in the general linear\nmodel."}, "http://arxiv.org/abs/2305.07721": {"title": "Designing Optimal Behavioral Experiments Using Machine Learning", "link": "http://arxiv.org/abs/2305.07721", "description": "Computational models are powerful tools for understanding human cognition and\nbehavior. They let us express our theories clearly and precisely, and offer\npredictions that can be subtle and often counter-intuitive. However, this same\nrichness and ability to surprise means our scientific intuitions and\ntraditional tools are ill-suited to designing experiments to test and compare\nthese models. To avoid these pitfalls and realize the full potential of\ncomputational modeling, we require tools to design experiments that provide\nclear answers about what models explain human behavior and the auxiliary\nassumptions those models must make. Bayesian optimal experimental design (BOED)\nformalizes the search for optimal experimental designs by identifying\nexperiments that are expected to yield informative data. In this work, we\nprovide a tutorial on leveraging recent advances in BOED and machine learning\nto find optimal experiments for any kind of model that we can simulate data\nfrom, and show how by-products of this procedure allow for quick and\nstraightforward evaluation of models and their parameters against real\nexperimental data. As a case study, we consider theories of how people balance\nexploration and exploitation in multi-armed bandit decision-making tasks. We\nvalidate the presented approach using simulations and a real-world experiment.\nAs compared to experimental designs commonly used in the literature, we show\nthat our optimal designs more efficiently determine which of a set of models\nbest account for individual human behavior, and more efficiently characterize\nbehavior given a preferred model. At the same time, formalizing a scientific\nquestion such that it can be adequately addressed with BOED can be challenging\nand we discuss several potential caveats and pitfalls that practitioners should\nbe aware of. We provide code and tutorial notebooks to replicate all analyses."}, "http://arxiv.org/abs/2307.05251": {"title": "Minimizing robust density power-based divergences for general parametric density models", "link": "http://arxiv.org/abs/2307.05251", "description": "Density power divergence (DPD) is designed to robustly estimate the\nunderlying distribution of observations, in the presence of outliers. However,\nDPD involves an integral of the power of the parametric density models to be\nestimated; the explicit form of the integral term can be derived only for\nspecific densities, such as normal and exponential densities. While we may\nperform a numerical integration for each iteration of the optimization\nalgorithms, the computational complexity has hindered the practical application\nof DPD-based estimation to more general parametric densities. To address the\nissue, this study introduces a stochastic approach to minimize DPD for general\nparametric density models. The proposed approach also can be employed to\nminimize other density power-based $\\gamma$-divergences, by leveraging\nunnormalized models."}, "http://arxiv.org/abs/2307.15176": {"title": "RCT Rejection Sampling for Causal Estimation Evaluation", "link": "http://arxiv.org/abs/2307.15176", "description": "Confounding is a significant obstacle to unbiased estimation of causal\neffects from observational data. For settings with high-dimensional covariates\n-- such as text data, genomics, or the behavioral social sciences --\nresearchers have proposed methods to adjust for confounding by adapting machine\nlearning methods to the goal of causal estimation. However, empirical\nevaluation of these adjustment methods has been challenging and limited. In\nthis work, we build on a promising empirical evaluation strategy that\nsimplifies evaluation design and uses real data: subsampling randomized\ncontrolled trials (RCTs) to create confounded observational datasets while\nusing the average causal effects from the RCTs as ground-truth. We contribute a\nnew sampling algorithm, which we call RCT rejection sampling, and provide\ntheoretical guarantees that causal identification holds in the observational\ndata to allow for valid comparisons to the ground-truth RCT. Using synthetic\ndata, we show our algorithm indeed results in low bias when oracle estimators\nare evaluated on the confounded samples, which is not always the case for a\npreviously proposed algorithm. In addition to this identification result, we\nhighlight several finite data considerations for evaluation designers who plan\nto use RCT rejection sampling on their own datasets. As a proof of concept, we\nimplement an example evaluation pipeline and walk through these finite data\nconsiderations with a novel, real-world RCT -- which we release publicly --\nconsisting of approximately 70k observations and text data as high-dimensional\ncovariates. Together, these contributions build towards a broader agenda of\nimproved empirical evaluation for causal estimation."}, "http://arxiv.org/abs/2309.09111": {"title": "Reducing sequential change detection to sequential estimation", "link": "http://arxiv.org/abs/2309.09111", "description": "We consider the problem of sequential change detection, where the goal is to\ndesign a scheme for detecting any changes in a parameter or functional $\\theta$\nof the data stream distribution that has small detection delay, but guarantees\ncontrol on the frequency of false alarms in the absence of changes. In this\npaper, we describe a simple reduction from sequential change detection to\nsequential estimation using confidence sequences: we begin a new\n$(1-\\alpha)$-confidence sequence at each time step, and proclaim a change when\nthe intersection of all active confidence sequences becomes empty. We prove\nthat the average run length is at least $1/\\alpha$, resulting in a change\ndetection scheme with minimal structural assumptions~(thus allowing for\npossibly dependent observations, and nonparametric distribution classes), but\nstrong guarantees. Our approach bears an interesting parallel with the\nreduction from change detection to sequential testing of Lorden (1971) and the\ne-detector of Shin et al. (2022)."}, "http://arxiv.org/abs/2311.16181": {"title": "mvlearnR and Shiny App for multiview learning", "link": "http://arxiv.org/abs/2311.16181", "description": "The package mvlearnR and accompanying Shiny App is intended for integrating\ndata from multiple sources or views or modalities (e.g. genomics, proteomics,\nclinical and demographic data). Most existing software packages for multiview\nlearning are decentralized and offer limited capabilities, making it difficult\nfor users to perform comprehensive integrative analysis. The new package wraps\nstatistical and machine learning methods and graphical tools, providing a\nconvenient and easy data integration workflow. For users with limited\nprogramming language, we provide a Shiny Application to facilitate data\nintegration anywhere and on any device. The methods have potential to offer\ndeeper insights into complex disease mechanisms.\n\nAvailability and Implementation: mvlearnR is available from the following\nGitHub repository: https://github.com/lasandrall/mvlearnR. The web application\nis hosted on shinyapps.io and available at:\nhttps://multi-viewlearn.shinyapps.io/MultiView_Modeling/"}, "http://arxiv.org/abs/2311.16286": {"title": "A statistical approach to latent dynamic modeling with differential equations", "link": "http://arxiv.org/abs/2311.16286", "description": "Ordinary differential equations (ODEs) can provide mechanistic models of\ntemporally local changes of processes, where parameters are often informed by\nexternal knowledge. While ODEs are popular in systems modeling, they are less\nestablished for statistical modeling of longitudinal cohort data, e.g., in a\nclinical setting. Yet, modeling of local changes could also be attractive for\nassessing the trajectory of an individual in a cohort in the immediate future\ngiven its current status, where ODE parameters could be informed by further\ncharacteristics of the individual. However, several hurdles so far limit such\nuse of ODEs, as compared to regression-based function fitting approaches. The\npotentially higher level of noise in cohort data might be detrimental to ODEs,\nas the shape of the ODE solution heavily depends on the initial value. In\naddition, larger numbers of variables multiply such problems and might be\ndifficult to handle for ODEs. To address this, we propose to use each\nobservation in the course of time as the initial value to obtain multiple local\nODE solutions and build a combined estimator of the underlying dynamics. Neural\nnetworks are used for obtaining a low-dimensional latent space for dynamic\nmodeling from a potentially large number of variables, and for obtaining\npatient-specific ODE parameters from baseline variables. Simultaneous\nidentification of dynamic models and of a latent space is enabled by recently\ndeveloped differentiable programming techniques. We illustrate the proposed\napproach in an application with spinal muscular atrophy patients and a\ncorresponding simulation study. In particular, modeling of local changes in\nhealth status at any point in time is contrasted to the interpretation of\nfunctions obtained from a global regression. This more generally highlights how\ndifferent application settings might demand different modeling strategies."}, "http://arxiv.org/abs/2311.16375": {"title": "Testing for a difference in means of a single feature after clustering", "link": "http://arxiv.org/abs/2311.16375", "description": "For many applications, it is critical to interpret and validate groups of\nobservations obtained via clustering. A common validation approach involves\ntesting differences in feature means between observations in two estimated\nclusters. In this setting, classical hypothesis tests lead to an inflated Type\nI error rate. To overcome this problem, we propose a new test for the\ndifference in means in a single feature between a pair of clusters obtained\nusing hierarchical or $k$-means clustering. The test based on the proposed\n$p$-value controls the selective Type I error rate in finite samples and can be\nefficiently computed. We further illustrate the validity and power of our\nproposal in simulation and demonstrate its use on single-cell RNA-sequencing\ndata."}, "http://arxiv.org/abs/2311.16451": {"title": "Variational Inference for the Latent Shrinkage Position Model", "link": "http://arxiv.org/abs/2311.16451", "description": "The latent position model (LPM) is a popular method used in network data\nanalysis where nodes are assumed to be positioned in a $p$-dimensional latent\nspace. The latent shrinkage position model (LSPM) is an extension of the LPM\nwhich automatically determines the number of effective dimensions of the latent\nspace via a Bayesian nonparametric shrinkage prior. However, the LSPM reliance\non Markov chain Monte Carlo for inference, while rigorous, is computationally\nexpensive, making it challenging to scale to networks with large numbers of\nnodes. We introduce a variational inference approach for the LSPM, aiming to\nreduce computational demands while retaining the model's ability to\nintrinsically determine the number of effective latent dimensions. The\nperformance of the variational LSPM is illustrated through simulation studies\nand its application to real-world network data. To promote wider adoption and\nease of implementation, we also provide open-source code."}, "http://arxiv.org/abs/2311.16529": {"title": "Efficient and Globally Robust Causal Excursion Effect Estimation", "link": "http://arxiv.org/abs/2311.16529", "description": "Causal excursion effect (CEE) characterizes the effect of an intervention\nunder policies that deviate from the experimental policy. It is widely used to\nstudy effect of time-varying interventions that have the potential to be\nfrequently adaptive, such as those delivered through smartphones. We study the\nsemiparametric efficient estimation of CEE and we derive a semiparametric\nefficiency bound for CEE with identity or log link functions under working\nassumptions. We propose a class of two-stage estimators that achieve the\nefficiency bound and are robust to misspecified nuisance models. In deriving\nthe asymptotic property of the estimators, we establish a general theory for\nglobally robust Z-estimators with either cross-fitted or non-cross-fitted\nnuisance parameters. We demonstrate substantial efficiency gain of the proposed\nestimator compared to existing ones through simulations and a real data\napplication using the Drink Less micro-randomized trial."}, "http://arxiv.org/abs/2311.16598": {"title": "Rectangular Hull Confidence Regions for Multivariate Parameters", "link": "http://arxiv.org/abs/2311.16598", "description": "We introduce three notions of multivariate median bias, namely, rectilinear,\nTukey, and orthant median bias. Each of these median biases is zero under a\nsuitable notion of multivariate symmetry. We study the coverage probabilities\nof rectangular hull of $B$ independent multivariate estimators, with special\nattention to the number of estimators $B$ needed to ensure a miscoverage of at\nmost $\\alpha$. It is proved that for estimators with zero orthant median bias,\nwe need $B\\geq c\\log_2(d/\\alpha)$ for some constant $c > 0$. Finally, we show\nthat there exists an asymptotically valid (non-trivial) confidence region for a\nmultivariate parameter $\\theta_0$ if and only if there exists a (non-trivial)\nestimator with an asymptotic orthant median bias of zero."}, "http://arxiv.org/abs/2311.16614": {"title": "A Multivariate Unimodality Test Harnenssing the Dip Statistic of Mahalanobis Distances Over Random Projections", "link": "http://arxiv.org/abs/2311.16614", "description": "Unimodality, pivotal in statistical analysis, offers insights into dataset\nstructures and drives sophisticated analytical procedures. While unimodality's\nconfirmation is straightforward for one-dimensional data using methods like\nSilverman's approach and Hartigans' dip statistic, its generalization to higher\ndimensions remains challenging. By extrapolating one-dimensional unimodality\nprinciples to multi-dimensional spaces through linear random projections and\nleveraging point-to-point distancing, our method, rooted in\n$\\alpha$-unimodality assumptions, presents a novel multivariate unimodality\ntest named mud-pod. Both theoretical and empirical studies confirm the efficacy\nof our method in unimodality assessment of multidimensional datasets as well as\nin estimating the number of clusters."}, "http://arxiv.org/abs/2311.16793": {"title": "Mediation pathway selection with unmeasured mediator-outcome confounding", "link": "http://arxiv.org/abs/2311.16793", "description": "Causal mediation analysis aims to investigate how an intermediary factor,\ncalled a mediator, regulates the causal effect of a treatment on an outcome.\nWith the increasing availability of measurements on a large number of potential\nmediators, methods for selecting important mediators have been proposed.\nHowever, these methods often assume the absence of unmeasured mediator-outcome\nconfounding. We allow for such confounding in a linear structural equation\nmodel for the outcome and further propose an approach to tackle the mediator\nselection issue. To achieve this, we firstly identify causal parameters by\nconstructing a pseudo proxy variable for unmeasured confounding. Leveraging\nthis proxy variable, we propose a partially penalized method to identify\nmediators affecting the outcome. The resultant estimates are consistent, and\nthe estimates of nonzero parameters are asymptotically normal. Motivated by\nthese results, we introduce a two-step procedure to consistently select active\nmediation pathways, eliminating the need to test composite null hypotheses for\neach mediator that are commonly required by traditional methods. Simulation\nstudies demonstrate the superior performance of our approach compared to\nexisting methods. Finally, we apply our approach to genomic data, identifying\ngene expressions that potentially mediate the impact of a genetic variant on\nmouse obesity."}, "http://arxiv.org/abs/2311.16941": {"title": "Debiasing Multimodal Models via Causal Information Minimization", "link": "http://arxiv.org/abs/2311.16941", "description": "Most existing debiasing methods for multimodal models, including causal\nintervention and inference methods, utilize approximate heuristics to represent\nthe biases, such as shallow features from early stages of training or unimodal\nfeatures for multimodal tasks like VQA, etc., which may not be accurate. In\nthis paper, we study bias arising from confounders in a causal graph for\nmultimodal data and examine a novel approach that leverages causally-motivated\ninformation minimization to learn the confounder representations. Robust\npredictive features contain diverse information that helps a model generalize\nto out-of-distribution data. Hence, minimizing the information content of\nfeatures obtained from a pretrained biased model helps learn the simplest\npredictive features that capture the underlying data distribution. We treat\nthese features as confounder representations and use them via methods motivated\nby causal theory to remove bias from models. We find that the learned\nconfounder representations indeed capture dataset biases, and the proposed\ndebiasing methods improve out-of-distribution (OOD) performance on multiple\nmultimodal datasets without sacrificing in-distribution performance.\nAdditionally, we introduce a novel metric to quantify the sufficiency of\nspurious features in models' predictions that further demonstrates the\neffectiveness of our proposed methods. Our code is available at:\nhttps://github.com/Vaidehi99/CausalInfoMin"}, "http://arxiv.org/abs/2311.16984": {"title": "FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings", "link": "http://arxiv.org/abs/2311.16984", "description": "External control arms (ECA) can inform the early clinical development of\nexperimental drugs and provide efficacy evidence for regulatory approval in\nnon-randomized settings. However, the main challenge of implementing ECA lies\nin accessing real-world data or historical clinical trials. Indeed, data\nsharing is often not feasible due to privacy considerations related to data\nleaving the original collection centers, along with pharmaceutical companies'\ncompetitive motives. In this paper, we leverage a privacy-enhancing technology\ncalled federated learning (FL) to remove some of the barriers to data sharing.\nWe introduce a federated learning inverse probability of treatment weighted\n(IPTW) method for time-to-event outcomes called FedECA which eases the\nimplementation of ECA by limiting patients' data exposure. We show with\nextensive experiments that FedECA outperforms its closest competitor,\nmatching-adjusted indirect comparison (MAIC), in terms of statistical power and\nability to balance the treatment and control groups. To encourage the use of\nsuch methods, we publicly release our code which relies on Substra, an\nopen-source FL software with proven experience in privacy-sensitive contexts."}, "http://arxiv.org/abs/2311.16988": {"title": "A Wasserstein-type Distance for Gaussian Mixtures on Vector Bundles with Applications to Shape Analysis", "link": "http://arxiv.org/abs/2311.16988", "description": "This paper uses sample data to study the problem of comparing populations on\nfinite-dimensional parallelizable Riemannian manifolds and more general trivial\nvector bundles. Utilizing triviality, our framework represents populations as\nmixtures of Gaussians on vector bundles and estimates the population parameters\nusing a mode-based clustering algorithm. We derive a Wasserstein-type metric\nbetween Gaussian mixtures, adapted to the manifold geometry, in order to\ncompare estimated distributions. Our contributions include an identifiability\nresult for Gaussian mixtures on manifold domains and a convenient\ncharacterization of optimal couplings of Gaussian mixtures under the derived\nmetric. We demonstrate these tools on some example domains, including the\npre-shape space of planar closed curves, with applications to the shape space\nof triangles and populations of nanoparticles. In the nanoparticle application,\nwe consider a sequence of populations of particle shapes arising from a\nmanufacturing process, and utilize the Wasserstein-type distance to perform\nchange-point detection."}, "http://arxiv.org/abs/2008.11477": {"title": "Bellman filtering and smoothing for state-space models", "link": "http://arxiv.org/abs/2008.11477", "description": "This paper presents a new filter for state-space models based on Bellman's\ndynamic-programming principle, allowing for nonlinearity, non-Gaussianity and\ndegeneracy in the observation and/or state-transition equations. The resulting\nBellman filter is a direct generalisation of the (iterated and extended) Kalman\nfilter, enabling scalability to higher dimensions while remaining\ncomputationally inexpensive. It can also be extended to enable smoothing. Under\nsuitable conditions, the Bellman-filtered states are stable over time and\ncontractive towards a region around the true state at every time step. Static\n(hyper)parameters are estimated by maximising a filter-implied pseudo\nlog-likelihood decomposition. In univariate simulation studies, the Bellman\nfilter performs on par with state-of-the-art simulation-based techniques at a\nfraction of the computational cost. In two empirical applications, involving up\nto 150 spatial dimensions or highly degenerate/nonlinear state dynamics, the\nBellman filter outperforms competing methods in both accuracy and speed."}, "http://arxiv.org/abs/2205.05955": {"title": "Bayesian inference for stochastic oscillatory systems using the phase-corrected Linear Noise Approximation", "link": "http://arxiv.org/abs/2205.05955", "description": "Likelihood-based inference in stochastic non-linear dynamical systems, such\nas those found in chemical reaction networks and biological clock systems, is\ninherently complex and has largely been limited to small and unrealistically\nsimple systems. Recent advances in analytically tractable approximations to the\nunderlying conditional probability distributions enable long-term dynamics to\nbe accurately modelled, and make the large number of model evaluations required\nfor exact Bayesian inference much more feasible. We propose a new methodology\nfor inference in stochastic non-linear dynamical systems exhibiting oscillatory\nbehaviour and show the parameters in these models can be realistically\nestimated from simulated data. Preliminary analyses based on the Fisher\nInformation Matrix of the model can guide the implementation of Bayesian\ninference. We show that this parameter sensitivity analysis can predict which\nparameters are practically identifiable. Several Markov chain Monte Carlo\nalgorithms are compared, with our results suggesting a parallel tempering\nalgorithm consistently gives the best approach for these systems, which are\nshown to frequently exhibit multi-modal posterior distributions."}, "http://arxiv.org/abs/2206.10479": {"title": "Policy Learning with Asymmetric Counterfactual Utilities", "link": "http://arxiv.org/abs/2206.10479", "description": "Data-driven decision making plays an important role even in high stakes\nsettings like medicine and public policy. Learning optimal policies from\nobserved data requires a careful formulation of the utility function whose\nexpected value is maximized across a population. Although researchers typically\nuse utilities that depend on observed outcomes alone, in many settings the\ndecision maker's utility function is more properly characterized by the joint\nset of potential outcomes under all actions. For example, the Hippocratic\nprinciple to \"do no harm\" implies that the cost of causing death to a patient\nwho would otherwise survive without treatment is greater than the cost of\nforgoing life-saving treatment. We consider optimal policy learning with\nasymmetric counterfactual utility functions of this form that consider the\njoint set of potential outcomes. We show that asymmetric counterfactual\nutilities lead to an unidentifiable expected utility function, and so we first\npartially identify it. Drawing on statistical decision theory, we then derive\nminimax decision rules by minimizing the maximum expected utility loss relative\nto different alternative policies. We show that one can learn minimax loss\ndecision rules from observed data by solving intermediate classification\nproblems, and establish that the finite sample excess expected utility loss of\nthis procedure is bounded by the regret of these intermediate classifiers. We\napply this conceptual framework and methodology to the decision about whether\nor not to use right heart catheterization for patients with possible pulmonary\nhypertension."}, "http://arxiv.org/abs/2209.04716": {"title": "Extrapolation before imputation reduces bias when imputing censored covariates", "link": "http://arxiv.org/abs/2209.04716", "description": "Modeling symptom progression to identify informative subjects for a new\nHuntington's disease clinical trial is problematic since time to diagnosis, a\nkey covariate, can be heavily censored. Imputation is an appealing strategy\nwhere censored covariates are replaced with their conditional means, but\nexisting methods saw over 200% bias under heavy censoring. Calculating these\nconditional means well requires estimating and then integrating over the\nsurvival function of the censored covariate from the censored value to\ninfinity. To estimate the survival function flexibly, existing methods use the\nsemiparametric Cox model with Breslow's estimator, leaving the integrand for\nthe conditional means (the estimated survival function) undefined beyond the\nobserved data. The integral is then estimated up to the largest observed\ncovariate value, and this approximation can cut off the tail of the survival\nfunction and lead to severe bias, particularly under heavy censoring. We\npropose a hybrid approach that splices together the semiparametric survival\nestimator with a parametric extension, making it possible to approximate the\nintegral up to infinity. In simulation studies, our proposed approach of\nextrapolation then imputation substantially reduces the bias seen with existing\nimputation methods, even when the parametric extension was misspecified. We\nfurther demonstrate how imputing with corrected conditional means helps to\nprioritize patients for future clinical trials."}, "http://arxiv.org/abs/2210.03559": {"title": "Estimation of the Order of Non-Parametric Hidden Markov Models using the Singular Values of an Integral Operator", "link": "http://arxiv.org/abs/2210.03559", "description": "We are interested in assessing the order of a finite-state Hidden Markov\nModel (HMM) with the only two assumptions that the transition matrix of the\nlatent Markov chain has full rank and that the density functions of the\nemission distributions are linearly independent. We introduce a new procedure\nfor estimating this order by investigating the rank of some well-chosen\nintegral operator which relies on the distribution of a pair of consecutive\nobservations. This method circumvents the usual limits of the spectral method\nwhen it is used for estimating the order of an HMM: it avoids the choice of the\nbasis functions; it does not require any knowledge of an upper-bound on the\norder of the HMM (for the spectral method, such an upper-bound is defined by\nthe number of basis functions); it permits to easily handle different types of\ndata (including continuous data, circular data or multivariate continuous data)\nwith a suitable choice of kernel. The method relies on the fact that the order\nof the HMM can be identified from the distribution of a pair of consecutive\nobservations and that this order is equal to the rank of some integral operator\n(\\emph{i.e.} the number of its singular values that are non-zero). Since only\nthe empirical counter-part of the singular values of the operator can be\nobtained, we propose a data-driven thresholding procedure. An upper-bound on\nthe probability of overestimating the order of the HMM is established.\nMoreover, sufficient conditions on the bandwidth used for kernel density\nestimation and on the threshold are stated to obtain the consistency of the\nestimator of the order of the HMM. The procedure is easily implemented since\nthe values of all the tuning parameters are determined by the sample size."}, "http://arxiv.org/abs/2306.15088": {"title": "Locally tail-scale invariant scoring rules for evaluation of extreme value forecasts", "link": "http://arxiv.org/abs/2306.15088", "description": "Statistical analysis of extremes can be used to predict the probability of\nfuture extreme events, such as large rainfalls or devastating windstorms. The\nquality of these forecasts can be measured through scoring rules. Locally scale\ninvariant scoring rules give equal importance to the forecasts at different\nlocations regardless of differences in the prediction uncertainty. This is a\nuseful feature when computing average scores but can be an unnecessarily strict\nrequirement when mostly concerned with extremes. We propose the concept of\nlocal weight-scale invariance, describing scoring rules fulfilling local scale\ninvariance in a certain region of interest, and as a special case local\ntail-scale invariance, for large events. Moreover, a new version of the\nweighted Continuous Ranked Probability score (wCRPS) called the scaled wCRPS\n(swCRPS) that possesses this property is developed and studied. The score is a\nsuitable alternative for scoring extreme value models over areas with varying\nscale of extreme events, and we derive explicit formulas of the score for the\nGeneralised Extreme Value distribution. The scoring rules are compared through\nsimulation, and their usage is illustrated in modelling of extreme water\nlevels, annual maximum rainfalls, and in an application to non-extreme forecast\nfor the prediction of air pollution."}, "http://arxiv.org/abs/2307.09850": {"title": "Communication-Efficient Distribution-Free Inference Over Networks", "link": "http://arxiv.org/abs/2307.09850", "description": "Consider a star network where each local node possesses a set of test\nstatistics that exhibit a symmetric distribution around zero when their\ncorresponding null hypothesis is true. This paper investigates statistical\ninference problems in networks concerning the aggregation of this general type\nof statistics and global error rate control under communication constraints in\nvarious scenarios. The study proposes communication-efficient algorithms that\nare built on established non-parametric methods, such as the Wilcoxon and sign\ntests, as well as modern inference methods such as the Benjamini-Hochberg (BH)\nand Barber-Candes (BC) procedures, coupled with sampling and quantization\noperations. The proposed methods are evaluated through extensive simulation\nstudies."}, "http://arxiv.org/abs/2308.01747": {"title": "Fusion regression methods with repeated functional data", "link": "http://arxiv.org/abs/2308.01747", "description": "Linear regression and classification methods with repeated functional data\nare considered. For each statistical unit in the sample, a real-valued\nparameter is observed over time under different conditions. Two regression\nmethods based on fusion penalties are presented. The first one is a\ngeneralization of the variable fusion methodology based on the 1-nearest\nneighbor. The second one, called group fusion lasso, assumes some grouping\nstructure of conditions and allows for homogeneity among the regression\ncoefficient functions within groups. A finite sample numerical simulation and\nan application on EEG data are presented."}, "http://arxiv.org/abs/2310.00599": {"title": "Approximate filtering via discrete dual processes", "link": "http://arxiv.org/abs/2310.00599", "description": "We consider the task of filtering a dynamic parameter evolving as a diffusion\nprocess, given data collected at discrete times from a likelihood which is\nconjugate to the marginal law of the diffusion, when a generic dual process on\na discrete state space is available. Recently, it was shown that duality with\nrespect to a death-like process implies that the filtering distributions are\nfinite mixtures, making exact filtering and smoothing feasible through\nrecursive algorithms with polynomial complexity in the number of observations.\nHere we provide general results for the case of duality between the diffusion\nand a regular jump continuous-time Markov chain on a discrete state space,\nwhich typically leads to filtering distribution given by countable mixtures\nindexed by the dual process state space. We investigate the performance of\nseveral approximation strategies on two hidden Markov models driven by\nCox-Ingersoll-Ross and Wright-Fisher diffusions, which admit duals of\nbirth-and-death type, and compare them with the available exact strategies\nbased on death-type duals and with bootstrap particle filtering on the\ndiffusion state space as a general benchmark."}, "http://arxiv.org/abs/2311.17100": {"title": "Automatic cross-validation in structured models: Is it time to leave out leave-one-out?", "link": "http://arxiv.org/abs/2311.17100", "description": "Standard techniques such as leave-one-out cross-validation (LOOCV) might not\nbe suitable for evaluating the predictive performance of models incorporating\nstructured random effects. In such cases, the correlation between the training\nand test sets could have a notable impact on the model's prediction error. To\novercome this issue, an automatic group construction procedure for\nleave-group-out cross validation (LGOCV) has recently emerged as a valuable\ntool for enhancing predictive performance measurement in structured models. The\npurpose of this paper is (i) to compare LOOCV and LGOCV within structured\nmodels, emphasizing model selection and predictive performance, and (ii) to\nprovide real data applications in spatial statistics using complex structured\nmodels fitted with INLA, showcasing the utility of the automatic LGOCV method.\nFirst, we briefly review the key aspects of the recently proposed LGOCV method\nfor automatic group construction in latent Gaussian models. We also demonstrate\nthe effectiveness of this method for selecting the model with the highest\npredictive performance by simulating extrapolation tasks in both temporal and\nspatial data analyses. Finally, we provide insights into the effectiveness of\nthe LGOCV method in modelling complex structured data, encompassing\nspatio-temporal multivariate count data, spatial compositional data, and\nspatio-temporal geospatial data."}, "http://arxiv.org/abs/2311.17102": {"title": "Splinets -- Orthogonal Splines and FDA for the Classification Problem", "link": "http://arxiv.org/abs/2311.17102", "description": "This study introduces an efficient workflow for functional data analysis in\nclassification problems, utilizing advanced orthogonal spline bases. The\nmethodology is based on the flexible Splinets package, featuring a novel spline\nrepresentation designed for enhanced data efficiency. Several innovative\nfeatures contribute to this efficiency: 1)Utilization of Orthonormal Spline\nBases 2)Consideration of Spline Support Sets 3)Data-Driven Knot Selection.\nIllustrating this approach, we applied the workflow to the Fashion MINST\ndataset. We demonstrate the classification process and highlight significant\nefficiency gains. Particularly noteworthy are the improvements that can be\nachieved through the 2D generalization of our methodology, especially in\nscenarios where data sparsity and dimension reduction are critical factors. A\nkey advantage of our workflow is the projection operation into the space of\nsplines with arbitrarily chosen knots, allowing for versatile functional data\nanalysis associated with classification problems. Moreover, the study explores\nSplinets package features suited for functional data analysis. The algebra and\ncalculus of splines use Taylor expansions at the knots within the support sets.\nVarious orthonormalization techniques for B-splines are implemented, including\nthe highly recommended dyadic method, which leads to the creation of splinets.\nImportantly, the locality of B-splines concerning support sets is preserved in\nthe corresponding splinet. Using this locality, along with implemented\nalgorithms, provides a powerful computational tool for functional data\nanalysis."}, "http://arxiv.org/abs/2311.17111": {"title": "Design of variable acceptance sampling plan for exponential distribution under uncertainty", "link": "http://arxiv.org/abs/2311.17111", "description": "In an acceptance monitoring system, acceptance sampling techniques are used\nto increase production, enhance control, and deliver higher-quality products at\na lesser cost. It might not always be possible to define the acceptance\nsampling plan parameters as exact values, especially, when data has\nuncertainty. In this work, acceptance sampling plans for a large number of\nidentical units with exponential lifetimes are obtained by treating acceptable\nquality life, rejectable quality life, consumer's risk, and producer's risk as\nfuzzy parameters. To obtain plan parameters of sequential sampling plans and\nrepetitive group sampling plans, fuzzy hypothesis test is considered. To\nvalidate the sampling plans obtained in this work, some examples are presented.\nOur results are compared with existing results in the literature. Finally, to\ndemonstrate the application of the resulting sampling plans, a real-life case\nstudy is presented."}, "http://arxiv.org/abs/2311.17246": {"title": "Detecting influential observations in single-index models with metric-valued response objects", "link": "http://arxiv.org/abs/2311.17246", "description": "Regression with random data objects is becoming increasingly common in modern\ndata analysis. Unfortunately, like the traditional regression setting with\nEuclidean data, random response regression is not immune to the trouble caused\nby unusual observations. A metric Cook's distance extending the classical\nCook's distances of Cook (1977) to general metric-valued response objects is\nproposed. The performance of the metric Cook's distance in both Euclidean and\nnon-Euclidean response regression with Euclidean predictors is demonstrated in\nan extensive experimental study. A real data analysis of county-level COVID-19\ntransmission in the United States also illustrates the usefulness of this\nmethod in practice."}, "http://arxiv.org/abs/2311.17271": {"title": "Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)", "link": "http://arxiv.org/abs/2311.17271", "description": "One measurement modality for rainfall is a fixed location rain gauge.\nHowever, extreme rainfall, flooding, and other climate extremes often occur at\nlarger spatial scales and affect more than one location in a community. For\nexample, in 2017 Hurricane Harvey impacted all of Houston and the surrounding\nregion causing widespread flooding. Flood risk modeling requires understanding\nof rainfall for hydrologic regions, which may contain one or more rain gauges.\nFurther, policy changes to address the risks and damages of natural hazards\nsuch as severe flooding are usually made at the community/neighborhood level or\nhigher geo-spatial scale. Therefore, spatial-temporal methods which convert\nresults from one spatial scale to another are especially useful in applications\nfor evolving environmental extremes. We develop a point-to-area random effects\n(PARE) modeling strategy for understanding spatial-temporal extreme values at\nthe areal level, when the core information are time series at point locations\ndistributed over the region."}, "http://arxiv.org/abs/2311.17303": {"title": "Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge", "link": "http://arxiv.org/abs/2311.17303", "description": "In this paper, we develop a generic methodology to encode hierarchical\ncausality structure among observed variables into a neural network in order to\nimprove its predictive performance. The proposed methodology, called\ncausality-informed neural network (CINN), leverages three coherent steps to\nsystematically map the structural causal knowledge into the layer-to-layer\ndesign of neural network while strictly preserving the orientation of every\ncausal relationship. In the first step, CINN discovers causal relationships\nfrom observational data via directed acyclic graph (DAG) learning, where causal\ndiscovery is recast as a continuous optimization problem to avoid the\ncombinatorial nature. In the second step, the discovered hierarchical causality\nstructure among observed variables is systematically encoded into neural\nnetwork through a dedicated architecture and customized loss function. By\ncategorizing variables in the causal DAG as root, intermediate, and leaf nodes,\nthe hierarchical causal DAG is translated into CINN with a one-to-one\ncorrespondence between nodes in the causal DAG and units in the CINN while\nmaintaining the relative order among these nodes. Regarding the loss function,\nboth intermediate and leaf nodes in the DAG graph are treated as target outputs\nduring CINN training so as to drive co-learning of causal relationships among\ndifferent types of nodes. As multiple loss components emerge in CINN, we\nleverage the projection of conflicting gradients to mitigate gradient\ninterference among the multiple learning tasks. Computational experiments\nacross a broad spectrum of UCI data sets demonstrate substantial advantages of\nCINN in predictive performance over other state-of-the-art methods. In\naddition, an ablation study underscores the value of integrating structural and\nquantitative causal knowledge in enhancing the neural network's predictive\nperformance incrementally."}, "http://arxiv.org/abs/2311.17445": {"title": "Interaction tests with covariate-adaptive randomization", "link": "http://arxiv.org/abs/2311.17445", "description": "Treatment-covariate interaction tests are commonly applied by researchers to\nexamine whether the treatment effect varies across patient subgroups defined by\nbaseline characteristics. The objective of this study is to explore\ntreatment-covariate interaction tests involving covariate-adaptive\nrandomization. Without assuming a parametric data generation model, we\ninvestigate usual interaction tests and observe that they tend to be\nconservative: specifically, their limiting rejection probabilities under the\nnull hypothesis do not exceed the nominal level and are typically strictly\nlower than it. To address this problem, we propose modifications to the usual\ntests to obtain corresponding exact tests. Moreover, we introduce a novel class\nof stratified-adjusted interaction tests that are simple, broadly applicable,\nand more powerful than the usual and modified tests. Our findings are relevant\nto two types of interaction tests: one involving stratification covariates and\nthe other involving additional covariates that are not used for randomization."}, "http://arxiv.org/abs/2311.17467": {"title": "Design of platform trials with a change in the control treatment arm", "link": "http://arxiv.org/abs/2311.17467", "description": "Platform trials are a more efficient way of testing multiple treatments\ncompared to running separate trials. In this paper we consider platform trials\nwhere, if a treatment is found to be superior to the control, it will become\nthe new standard of care (and the control in the platform). The remaining\ntreatments are then tested against this new control. In such a setting, one can\neither keep the information on both the new standard of care and the other\nactive treatments before the control is changed or one could discard this\ninformation when testing for benefit of the remaining treatments. We will show\nanalytically and numerically that retaining the information collected before\nthe change in control can be detrimental to the power of the study.\nSpecifically, we consider the overall power, the probability that the active\ntreatment with the greatest treatment effect is found during the trial. We also\nconsider the conditional power of the active treatments, the probability a\ngiven treatment can be found superior against the current control. We prove\nwhen, in a multi-arm multi-stage trial where no arms are added, retaining the\ninformation is detrimental to both overall and conditional power of the\nremaining treatments. This loss of power is studied for a motivating example.\nWe then discuss the effect on platform trials in which arms are added later. On\nthe basis of these observations we discuss different aspects to consider when\ndeciding whether to run a continuous platform trial or whether one may be\nbetter running a new trial."}, "http://arxiv.org/abs/2311.17476": {"title": "Inference of Sample Complier Average Causal Effects in Completely Randomized Experiments", "link": "http://arxiv.org/abs/2311.17476", "description": "In randomized experiments with non-compliance scholars have argued that the\ncomplier average causal effect (CACE) ought to be the main causal estimand. The\nliterature on inference of the complier average treatment effect (CACE) has\nfocused on inference about the population CACE. However, in general individuals\nin the experiments are volunteers. This means that there is a risk that\nindividuals partaking in a given experiment differ in important ways from a\npopulation of interest. It is thus of interest to focus on the sample at hand\nand have easy to use and correct procedures for inference about the sample\nCACE. We consider a more general setting than in the previous literature and\nconstruct a confidence interval based on the Wald estimator in the form of a\nfinite closed interval that is familiar to practitioners. Furthermore, with the\naccess of pre-treatment covariates, we propose a new regression adjustment\nestimator and associated methods for constructing confidence intervals. Finite\nsample performance of the methods is examined through a Monte Carlo simulation\nand the methods are used in an application to a job training experiment."}, "http://arxiv.org/abs/2311.17547": {"title": "Risk-based decision making: estimands for sequential prediction under interventions", "link": "http://arxiv.org/abs/2311.17547", "description": "Prediction models are used amongst others to inform medical decisions on\ninterventions. Typically, individuals with high risks of adverse outcomes are\nadvised to undergo an intervention while those at low risk are advised to\nrefrain from it. Standard prediction models do not always provide risks that\nare relevant to inform such decisions: e.g., an individual may be estimated to\nbe at low risk because similar individuals in the past received an intervention\nwhich lowered their risk. Therefore, prediction models supporting decisions\nshould target risks belonging to defined intervention strategies. Previous\nworks on prediction under interventions assumed that the prediction model was\nused only at one time point to make an intervention decision. In clinical\npractice, intervention decisions are rarely made only once: they might be\nrepeated, deferred and re-evaluated. This requires estimated risks under\ninterventions that can be reconsidered at several potential decision moments.\nIn the current work, we highlight key considerations for formulating estimands\nin sequential prediction under interventions that can inform such intervention\ndecisions. We illustrate these considerations by giving examples of estimands\nfor a case study about choosing between vaginal delivery and cesarean section\nfor women giving birth. Our formalization of prediction tasks in a sequential,\ncausal, and estimand context provides guidance for future studies to ensure\nthat the right question is answered and appropriate causal estimation\napproaches are chosen to develop sequential prediction models that can inform\nintervention decisions."}, "http://arxiv.org/abs/2311.17564": {"title": "Combining Stochastic Tendency and Distribution Overlap Towards Improved Nonparametric Effect Measures and Inference", "link": "http://arxiv.org/abs/2311.17564", "description": "A fundamental functional in nonparametric statistics is the Mann-Whitney\nfunctional ${\\theta} = P (X < Y )$ , which constitutes the basis for the most\npopular nonparametric procedures. The functional ${\\theta}$ measures a location\nor stochastic tendency effect between two distributions. A limitation of\n${\\theta}$ is its inability to capture scale differences. If differences of\nthis nature are to be detected, specific tests for scale or omnibus tests need\nto be employed. However, the latter often suffer from low power, and they do\nnot yield interpretable effect measures. In this manuscript, we extend\n${\\theta}$ by additionally incorporating the recently introduced distribution\noverlap index (nonparametric dispersion measure) $I_2$ that can be expressed in\nterms of the quantile process. We derive the joint asymptotic distribution of\nthe respective estimators of ${\\theta}$ and $I_2$ and construct confidence\nregions. Extending the Wilcoxon- Mann-Whitney test, we introduce a new test\nbased on the joint use of these functionals. It results in much larger\nconsistency regions while maintaining competitive power to the rank sum test\nfor situations in which {\\theta} alone would suffice. Compared with classical\nomnibus tests, the simulated power is much improved. Additionally, the newly\nproposed inference method yields effect measures whose interpretation is\nsurprisingly straightforward."}, "http://arxiv.org/abs/2311.17594": {"title": "Scale Invariant Correspondence Analysis", "link": "http://arxiv.org/abs/2311.17594", "description": "Correspondence analysis is a dimension reduction method for visualization of\nnonnegative data sets, in particular contingency tables ; but it depends on the\nmarginals of the data set. Two transformations of the data have been proposed\nto render correspondence analysis row and column scales invariant : These two\nkinds of transformations change the initial form of the data set into a\nbistochastic form. The power transorfmation applied by Greenacre (2010) has one\npositive parameter. While the transormation applied by Mosteller (1968) and\nGoodman (1996) has (I+J) positive parameters, where the raw data is row and\ncolumn scaled by the Sinkhorn (RAS or ipf) algorithm to render it bistochastic.\nGoodman (1996) named correspondence analsis of a bistochastic matrix\nmarginal-free correspondence analysis. We discuss these two transformations,\nand further generalize Mosteller-Goodman approach."}, "http://arxiv.org/abs/2311.17605": {"title": "Improving the Balance of Unobserved Covariates From Information Theory in Multi-Arm Randomization with Unequal Allocation Ratio", "link": "http://arxiv.org/abs/2311.17605", "description": "Multi-arm randomization has increasingly widespread applications recently and\nit is also crucial to ensure that the distributions of important observed\ncovariates as well as the potential unobserved covariates are similar and\ncomparable among all the treatment. However, the theoretical properties of\nunobserved covariates imbalance in multi-arm randomization with unequal\nallocation ratio remains unknown. In this paper, we give a general framework\nanalysing the moments and distributions of unobserved covariates imbalance and\napply them into different procedures including complete randomization (CR),\nstratified permuted block (STR-PB) and covariate-adaptive randomization (CAR).\nThe general procedures of multi-arm STR-PB and CAR with unequal allocation\nratio are also proposed. In addition, we introduce the concept of entropy to\nmeasure the correlation between discrete covariates and verify that we could\nutilize the correlation to select observed covariates to help better balance\nthe unobserved covariates."}, "http://arxiv.org/abs/2311.17685": {"title": "Enhancing efficiency and robustness in high-dimensional linear regression with additional unlabeled data", "link": "http://arxiv.org/abs/2311.17685", "description": "In semi-supervised learning, the prevailing understanding suggests that\nobserving additional unlabeled samples improves estimation accuracy for linear\nparameters only in the case of model misspecification. This paper challenges\nthis notion, demonstrating its inaccuracy in high dimensions. Initially\nfocusing on a dense scenario, we introduce robust semi-supervised estimators\nfor the regression coefficient without relying on sparse structures in the\npopulation slope. Even when the true underlying model is linear, we show that\nleveraging information from large-scale unlabeled data improves both estimation\naccuracy and inference robustness. Moreover, we propose semi-supervised methods\nwith further enhanced efficiency in scenarios with a sparse linear slope.\nDiverging from the standard semi-supervised literature, we also allow for\ncovariate shift. The performance of the proposed methods is illustrated through\nextensive numerical studies, including simulations and a real-data application\nto the AIDS Clinical Trials Group Protocol 175 (ACTG175)."}, "http://arxiv.org/abs/2311.17797": {"title": "Learning to Simulate: Generative Metamodeling via Quantile Regression", "link": "http://arxiv.org/abs/2311.17797", "description": "Stochastic simulation models, while effective in capturing the dynamics of\ncomplex systems, are often too slow to run for real-time decision-making.\nMetamodeling techniques are widely used to learn the relationship between a\nsummary statistic of the outputs (e.g., the mean or quantile) and the inputs of\nthe simulator, so that it can be used in real time. However, this methodology\nrequires the knowledge of an appropriate summary statistic in advance, making\nit inflexible for many practical situations. In this paper, we propose a new\nmetamodeling concept, called generative metamodeling, which aims to construct a\n\"fast simulator of the simulator\". This technique can generate random outputs\nsubstantially faster than the original simulation model, while retaining an\napproximately equal conditional distribution given the same inputs. Once\nconstructed, a generative metamodel can instantaneously generate a large amount\nof random outputs as soon as the inputs are specified, thereby facilitating the\nimmediate computation of any summary statistic for real-time decision-making.\nFurthermore, we propose a new algorithm -- quantile-regression-based generative\nmetamodeling (QRGMM) -- and study its convergence and rate of convergence.\nExtensive numerical experiments are conducted to investigate the empirical\nperformance of QRGMM, compare it with other state-of-the-art generative\nalgorithms, and demonstrate its usefulness in practical real-time\ndecision-making."}, "http://arxiv.org/abs/2311.17808": {"title": "A Heteroscedastic Bayesian Generalized Logistic Regression Model with Application to Scaling Problems", "link": "http://arxiv.org/abs/2311.17808", "description": "Scaling models have been used to explore relationships between urban\nindicators, population, and, more recently, extended to incorporate rural-urban\nindicator densities and population densities. In the scaling framework, power\nlaws and standard linear regression methods are used to estimate model\nparameters with assumed normality and fixed variance. These assumptions,\ninherited in the scaling field, have recently been demonstrated to be\ninadequate and noted that without consideration lead to model bias. Generalized\nlinear models (GLM) can accommodate a wider range of distributions where the\nchosen distribution must meet the assumptions of the data to prevent model\nbias. We present a widely applicable Bayesian generalized logistic regression\n(BGLR) framework to flexibly model a continuous real response addressing skew\nand heteroscedasticity using Markov Chain Monte Carlo (MCMC) methods. The\nGeneralized Logistic Distribution (GLD) robustly models skewed continuous data\ndue to the additional shape parameter. We compare the BGLR model to standard\nand Bayesian normal methods in fitting power laws to COVID-19 data. The BGLR\nprovides additional useful information beyond previous scaling methods,\nuniquely models variance including a scedasticity parameter and reveals\nparameter bias in widely used methods."}, "http://arxiv.org/abs/2311.17867": {"title": "A Class of Directed Acyclic Graphs with Mixed Data Types in Mediation Analysis", "link": "http://arxiv.org/abs/2311.17867", "description": "We propose a unified class of generalized structural equation models (GSEMs)\nwith data of mixed types in mediation analysis, including continuous,\ncategorical, and count variables. Such models extend substantially the\nclassical linear structural equation model to accommodate many data types\narising from the application of mediation analysis. Invoking the hierarchical\nmodeling approach, we specify GSEMs by a copula joint distribution of outcome\nvariable, mediator and exposure variable, in which marginal distributions are\nbuilt upon generalized linear models (GLMs) with confounding factors. We\ndiscuss the identifiability conditions for the causal mediation effects in the\ncounterfactual paradigm as well as the issue of mediation leakage, and develop\nan asymptotically efficient profile maximum likelihood estimation and inference\nfor two key mediation estimands, natural direct effect and natural indirect\neffect, in different scenarios of mixed data types. The proposed new\nmethodology is illustrated by a motivating epidemiological study that aims to\ninvestigate whether the tempo of reaching infancy BMI peak (delay or on time),\nan important early life growth milestone, may mediate the association between\nprenatal exposure to phthalates and pubertal health outcomes."}, "http://arxiv.org/abs/2311.17885": {"title": "Are ensembles getting better all the time?", "link": "http://arxiv.org/abs/2311.17885", "description": "Ensemble methods combine the predictions of several base models. We study\nwhether or not including more models in an ensemble always improve its average\nperformance. Such a question depends on the kind of ensemble considered, as\nwell as the predictive metric chosen. We focus on situations where all members\nof the ensemble are a priori expected to perform as well, which is the case of\nseveral popular methods like random forests or deep ensembles. In this setting,\nwe essentially show that ensembles are getting better all the time if, and only\nif, the considered loss function is convex. More precisely, in that case, the\naverage loss of the ensemble is a decreasing function of the number of models.\nWhen the loss function is nonconvex, we show a series of results that can be\nsummarised by the insight that ensembles of good models keep getting better,\nand ensembles of bad models keep getting worse. To this end, we prove a new\nresult on the monotonicity of tail probabilities that may be of independent\ninterest. We illustrate our results on a simple machine learning problem\n(diagnosing melanomas using neural nets)."}, "http://arxiv.org/abs/2111.07966": {"title": "Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects", "link": "http://arxiv.org/abs/2111.07966", "description": "There are a number of available methods for selecting whom to prioritize for\ntreatment, including ones based on treatment effect estimation, risk scoring,\nand hand-crafted rules. We propose rank-weighted average treatment effect\n(RATE) metrics as a simple and general family of metrics for comparing and\ntesting the quality of treatment prioritization rules. RATE metrics are\nagnostic as to how the prioritization rules were derived, and only assess how\nwell they identify individuals that benefit the most from treatment. We define\na family of RATE estimators and prove a central limit theorem that enables\nasymptotically exact inference in a wide variety of randomized and\nobservational study settings. RATE metrics subsume a number of existing\nmetrics, including the Qini coefficient, and our analysis directly yields\ninference methods for these metrics. We showcase RATE in the context of a\nnumber of applications, including optimal targeting of aspirin to stroke\npatients."}, "http://arxiv.org/abs/2306.02126": {"title": "A Process of Dependent Quantile Pyramids", "link": "http://arxiv.org/abs/2306.02126", "description": "Despite the practicality of quantile regression (QR), simultaneous estimation\nof multiple QR curves continues to be challenging. We address this problem by\nproposing a Bayesian nonparametric framework that generalizes the quantile\npyramid by replacing each scalar variate in the quantile pyramid with a\nstochastic process on a covariate space. We propose a novel approach to show\nthe existence of a quantile pyramid for all quantiles. The process of dependent\nquantile pyramids allows for non-linear QR and automatically ensures\nnon-crossing of QR curves on the covariate space. Simulation studies document\nthe performance and robustness of our approach. An application to cyclone\nintensity data is presented."}, "http://arxiv.org/abs/2306.02244": {"title": "Optimal neighbourhood selection in structural equation models", "link": "http://arxiv.org/abs/2306.02244", "description": "We study the optimal sample complexity of neighbourhood selection in linear\nstructural equation models, and compare this to best subset selection (BSS) for\nlinear models under general design. We show by example that -- even when the\nstructure is \\emph{unknown} -- the existence of underlying structure can reduce\nthe sample complexity of neighbourhood selection. This result is complicated by\nthe possibility of path cancellation, which we study in detail, and show that\nimprovements are still possible in the presence of path cancellation. Finally,\nwe support these theoretical observations with experiments. The proof\nintroduces a modified BSS estimator, called klBSS, and compares its performance\nto BSS. The analysis of klBSS may also be of independent interest since it\napplies to arbitrary structured models, not necessarily those induced by a\nstructural equation model. Our results have implications for structure learning\nin graphical models, which often relies on neighbourhood selection as a\nsubroutine."}, "http://arxiv.org/abs/2307.13868": {"title": "Learning sources of variability from high-dimensional observational studies", "link": "http://arxiv.org/abs/2307.13868", "description": "Causal inference studies whether the presence of a variable influences an\nobserved outcome. As measured by quantities such as the \"average treatment\neffect,\" this paradigm is employed across numerous biological fields, from\nvaccine and drug development to policy interventions. Unfortunately, the\nmajority of these methods are often limited to univariate outcomes. Our work\ngeneralizes causal estimands to outcomes with any number of dimensions or any\nmeasurable space, and formulates traditional causal estimands for nominal\nvariables as causal discrepancy tests. We propose a simple technique for\nadjusting universally consistent conditional independence tests and prove that\nthese tests are universally consistent causal discrepancy tests. Numerical\nexperiments illustrate that our method, Causal CDcorr, leads to improvements in\nboth finite sample validity and power when compared to existing strategies. Our\nmethods are all open source and available at github.com/ebridge2/cdcorr."}, "http://arxiv.org/abs/2307.16353": {"title": "Single Proxy Synthetic Control", "link": "http://arxiv.org/abs/2307.16353", "description": "Synthetic control methods are widely used to estimate the treatment effect on\na single treated unit in time-series settings. A common approach for estimating\nsynthetic controls is to regress the treated unit's pre-treatment outcome on\nthose of untreated units via ordinary least squares. However, this approach can\nperform poorly if the pre-treatment fit is not near perfect, whether the\nweights are normalized or not. In this paper, we introduce a single proxy\nsynthetic control approach, which views the outcomes of untreated units as\nproxies of the treatment-free potential outcome of the treated unit, a\nperspective we leverage to construct a valid synthetic control. Under this\nframework, we establish alternative identification and estimation methodologies\nfor synthetic controls and for the treatment effect on the treated unit.\nNotably, unlike a proximal synthetic control approach which requires two types\nof proxies for identification, ours relies on a single type of proxy, thus\nfacilitating its practical relevance. Additionally, we adapt a conformal\ninference approach to perform inference about the treatment effect, obviating\nthe need for a large number of post-treatment data. Lastly, our framework can\naccommodate time-varying covariates and nonlinear models. We demonstrate the\nproposed approach in a simulation study and a real-world application."}, "http://arxiv.org/abs/2206.12346": {"title": "A fast and stable approximate maximum-likelihood method for template fits", "link": "http://arxiv.org/abs/2206.12346", "description": "Barlow and Beeston presented an exact likelihood for the problem of fitting a\ncomposite model consisting of binned templates obtained from Monte-Carlo\nsimulation which are fitted to equally binned data. Solving the exact\nlikelihood is technically challenging, and therefore Conway proposed an\napproximate likelihood to address these challenges. In this paper, a new\napproximate likelihood is derived from the exact Barlow-Beeston one. The new\napproximate likelihood and Conway's likelihood are generalized to problems of\nfitting weighted data with weighted templates. The performance of estimates\nobtained with all three likelihoods is studied on two toy examples: a simple\none and a challenging one. The performance of the approximate likelihoods is\ncomparable to the exact Barlow-Beeston likelihood, while the performance in\nfits with weighted templates is better. The approximate likelihoods evaluate\nfaster than the Barlow-Beeston one when the number of bins is large."}, "http://arxiv.org/abs/2307.16370": {"title": "Inference for Low-rank Completion without Sample Splitting with Application to Treatment Effect Estimation", "link": "http://arxiv.org/abs/2307.16370", "description": "This paper studies the inferential theory for estimating low-rank matrices.\nIt also provides an inference method for the average treatment effect as an\napplication. We show that the least square estimation of eigenvectors following\nthe nuclear norm penalization attains the asymptotic normality. The key\ncontribution of our method is that it does not require sample splitting. In\naddition, this paper allows dependent observation patterns and heterogeneous\nobservation probabilities. Empirically, we apply the proposed procedure to\nestimating the impact of the presidential vote on allocating the U.S. federal\nbudget to the states."}, "http://arxiv.org/abs/2311.14679": {"title": "\"Medium-n studies\" in computing education conferences", "link": "http://arxiv.org/abs/2311.14679", "description": "Good (Frequentist) statistical practice requires that statistical tests be\nperformed in order to determine if the phenomenon being observed could\nplausibly occur by chance if the null hypothesis is false. Good practice also\nrequires that a test is not performed if the study is underpowered: if the\nnumber of observations is not sufficiently large to be able to reliably detect\nthe effect one hypothesizes, even if the effect exists. Running underpowered\nstudies runs the risk of false negative results. This creates tension in the\nguidelines and expectations for computer science education conferences: while\nthings are clear for studies with a large number of observations, researchers\nshould in fact not compute p-values and perform statistical tests if the number\nof observations is too small. The issue is particularly live in CSed venues,\nsince class sizes where those issues are salient are common. We outline the\nconsiderations for when to compute and when not to compute p-values in\ndifferent settings encountered by computer science education researchers. We\nsurvey the author and reviewer guidelines in different computer science\neducation conferences (ICER, SIGCSE TS, ITiCSE, EAAI, CompEd, Koli Calling). We\npresent summary data and make several preliminary observations about reviewer\nguidelines: guidelines vary from conference to conference; guidelines allow for\nqualitative studies, and, in some cases, experience reports, but guidelines do\nnot generally explicitly indicate that a paper should have at least one of (1)\nan appropriately-powered statistical analysis or (2) rich qualitative\ndescriptions. We present preliminary ideas for addressing the tension in the\nguidelines between small-n and large-n studies"}, "http://arxiv.org/abs/2311.17962": {"title": "In search of the perfect fit: interpretation, flexible modelling, and the existing generalisations of the normal distribution", "link": "http://arxiv.org/abs/2311.17962", "description": "Many generalised distributions exist for modelling data with vastly diverse\ncharacteristics. However, very few of these generalisations of the normal\ndistribution have shape parameters with clear roles that determine, for\ninstance, skewness and tail shape. In this chapter, we review existing skewing\nmechanisms and their properties in detail. Using the knowledge acquired, we add\na skewness parameter to the body-tail generalised normal distribution\n\\cite{BTGN}, that yields the \\ac{FIN} with parameters for location, scale,\nbody-shape, skewness, and tail weight. Basic statistical properties of the\n\\ac{FIN} are provided, such as the \\ac{PDF}, cumulative distribution function,\nmoments, and likelihood equations. Additionally, the \\ac{FIN} \\ac{PDF} is\nextended to a multivariate setting using a student t-copula, yielding the\n\\ac{MFIN}. The \\ac{MFIN} is applied to stock returns data, where it outperforms\nthe t-copula multivariate generalised hyperbolic, Azzalini skew-t, hyperbolic,\nand normal inverse Gaussian distributions."}, "http://arxiv.org/abs/2311.18039": {"title": "Restricted Regression in Networks", "link": "http://arxiv.org/abs/2311.18039", "description": "Network regression with additive node-level random effects can be problematic\nwhen the primary interest is estimating unconditional regression coefficients\nand some covariates are exactly or nearly in the vector space of node-level\neffects. We introduce the Restricted Network Regression model, that removes the\ncollinearity between fixed and random effects in network regression by\northogonalizing the random effects against the covariates. We discuss the\nchange in the interpretation of the regression coefficients in Restricted\nNetwork Regression and analytically characterize the effect of Restricted\nNetwork Regression on the regression coefficients for continuous response data.\nWe show through simulation with continuous and binary response data that\nRestricted Network Regression mitigates, but does not alleviate, network\nconfounding, by providing improved estimation of the regression coefficients.\nWe apply the Restricted Network Regression model in an analysis of 2015\nEurovision Song Contest voting data and show how the choice of regression model\naffects inference."}, "http://arxiv.org/abs/2311.18048": {"title": "An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis", "link": "http://arxiv.org/abs/2311.18048", "description": "We investigate the relationship between system identification and\nintervention design in dynamical systems. While previous research demonstrated\nhow identifiable representation learning methods, such as Independent Component\nAnalysis (ICA), can reveal cause-effect relationships, it relied on a passive\nperspective without considering how to collect data. Our work shows that in\nGaussian Linear Time-Invariant (LTI) systems, the system parameters can be\nidentified by introducing diverse intervention signals in a multi-environment\nsetting. By harnessing appropriate diversity assumptions motivated by the ICA\nliterature, our findings connect experiment design and representational\nidentifiability in dynamical systems. We corroborate our findings on synthetic\nand (simulated) physical data. Additionally, we show that Hidden Markov Models,\nin general, and (Gaussian) LTI systems, in particular, fulfil a generalization\nof the Causal de Finetti theorem with continuous parameters."}, "http://arxiv.org/abs/2311.18053": {"title": "On Non- and Weakly-Informative Priors for the Conway-Maxwell-Poisson (COM-Poisson) Distribution", "link": "http://arxiv.org/abs/2311.18053", "description": "Previous Bayesian evaluations of the Conway-Maxwell-Poisson (COM-Poisson)\ndistribution have little discussion of non- and weakly-informative priors for\nthe model. While only considering priors with such limited information\nrestricts potential analyses, these priors serve an important first step in the\nmodeling process and are useful when performing sensitivity analyses. We\ndevelop and derive several weakly- and non-informative priors using both the\nestablished conjugate prior and Jeffreys' prior. Our evaluation of each prior\ninvolves an empirical study under varying dispersion types and sample sizes. In\ngeneral, we find the weakly informative priors tend to perform better than the\nnon-informative priors. We also consider several data examples for illustration\nand provide code for implementation of each resulting posterior."}, "http://arxiv.org/abs/2311.18093": {"title": "Shared Control Individuals in Health Policy Evaluations with Application to Medical Cannabis Laws", "link": "http://arxiv.org/abs/2311.18093", "description": "Health policy researchers often have questions about the effects of a policy\nimplemented at some cluster-level unit, e.g., states, counties, hospitals, etc.\non individual-level outcomes collected over multiple time periods. Stacked\ndifference-in-differences is an increasingly popular way to estimate these\neffects. This approach involves estimating treatment effects for each\npolicy-implementing unit, then, if scientifically appropriate, aggregating them\nto an average effect estimate. However, when individual-level data are\navailable and non-implementing units are used as comparators for multiple\npolicy-implementing units, data from untreated individuals may be used across\nmultiple analyses, thereby inducing correlation between effect estimates.\nExisting methods do not quantify or account for this sharing of controls. Here,\nwe describe a stacked difference-in-differences study investigating the effects\nof state medical cannabis laws on treatment for chronic pain management that\nmotivated this work, discuss a framework for estimating and managing this\ncorrelation due to shared control individuals, and show how accounting for it\naffects the substantive results."}, "http://arxiv.org/abs/2311.18146": {"title": "Co-Active Subspace Methods for the Joint Analysis of Adjacent Computer Models", "link": "http://arxiv.org/abs/2311.18146", "description": "Active subspace (AS) methods are a valuable tool for understanding the\nrelationship between the inputs and outputs of a Physics simulation. In this\npaper, an elegant generalization of the traditional ASM is developed to assess\nthe co-activity of two computer models. This generalization, which we refer to\nas a Co-Active Subspace (C-AS) Method, allows for the joint analysis of two or\nmore computer models allowing for thorough exploration of the alignment (or\nnon-alignment) of the respective gradient spaces. We define co-active\ndirections, co-sensitivity indices, and a scalar ``concordance\" metric (and\ncomplementary ``discordance\" pseudo-metric) and we demonstrate that these are\npowerful tools for understanding the behavior of a class of computer models,\nespecially when used to supplement traditional AS analysis. Details for\nefficient estimation of the C-AS and an accompanying R package\n(github.com/knrumsey/concordance) are provided. Practical application is\ndemonstrated through analyzing a set of simulated rate stick experiments for\nPBX 9501, a high explosive, offering insights into complex model dynamics."}, "http://arxiv.org/abs/2311.18274": {"title": "Semiparametric Efficient Inference in Adaptive Experiments", "link": "http://arxiv.org/abs/2311.18274", "description": "We consider the problem of efficient inference of the Average Treatment\nEffect in a sequential experiment where the policy governing the assignment of\nsubjects to treatment or control can change over time. We first provide a\ncentral limit theorem for the Adaptive Augmented Inverse-Probability Weighted\nestimator, which is semiparametric efficient, under weaker assumptions than\nthose previously made in the literature. This central limit theorem enables\nefficient inference at fixed sample sizes. We then consider a sequential\ninference setting, deriving both asymptotic and nonasymptotic confidence\nsequences that are considerably tighter than previous methods. These\nanytime-valid methods enable inference under data-dependent stopping times\n(sample sizes). Additionally, we use propensity score truncation techniques\nfrom the recent off-policy estimation literature to reduce the finite sample\nvariance of our estimator without affecting the asymptotic variance. Empirical\nresults demonstrate that our methods yield narrower confidence sequences than\nthose previously developed in the literature while maintaining time-uniform\nerror control."}, "http://arxiv.org/abs/2311.18294": {"title": "Multivariate Unified Skew-t Distributions And Their Properties", "link": "http://arxiv.org/abs/2311.18294", "description": "The unified skew-t (SUT) is a flexible parametric multivariate distribution\nthat accounts for skewness and heavy tails in the data. A few of its properties\ncan be found scattered in the literature or in a parameterization that does not\nfollow the original one for unified skew-normal (SUN) distributions, yet a\nsystematic study is lacking. In this work, explicit properties of the\nmultivariate SUT distribution are presented, such as its stochastic\nrepresentations, moments, SUN-scale mixture representation, linear\ntransformation, additivity, marginal distribution, canonical form, quadratic\nform, conditional distribution, change of latent dimensions, Mardia measures of\nmultivariate skewness and kurtosis, and non-identifiability issue. These\nresults are given in a parametrization that reduces to the original SUN\ndistribution as a sub-model, hence facilitating the use of the SUT for\napplications. Several models based on the SUT distribution are provided for\nillustration."}, "http://arxiv.org/abs/2311.18446": {"title": "Length-of-stay times in hospital for COVID-19 patients using the smoothed Beran's estimator with bootstrap bandwidth selection", "link": "http://arxiv.org/abs/2311.18446", "description": "The survival function of length-of-stay in hospital ward and ICU for COVID-19\npatients is studied in this paper. Flexible statistical methods are used to\nestimate this survival function given relevant covariates such as age, sex,\nobesity and chronic obstructive pulmonary disease (COPD). A doubly-smoothed\nBeran's estimator has been considered to this aim. The bootstrap method has\nbeen used to produce new smoothing parameter selectors and to construct\nconfidence regions for the conditional survival function. Some simulation\nstudies show the good performance of the proposed methods."}, "http://arxiv.org/abs/2311.18460": {"title": "Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework", "link": "http://arxiv.org/abs/2311.18460", "description": "Fairness for machine learning predictions is widely required in practice for\nlegal, ethical, and societal reasons. Existing work typically focuses on\nsettings without unobserved confounding, even though unobserved confounding can\nlead to severe violations of causal fairness and, thus, unfair predictions. In\nthis work, we analyze the sensitivity of causal fairness to unobserved\nconfounding. Our contributions are three-fold. First, we derive bounds for\ncausal fairness metrics under different sources of unobserved confounding. This\nenables practitioners to examine the sensitivity of their machine learning\nmodels to unobserved confounding in fairness-critical applications. Second, we\npropose a novel neural framework for learning fair predictions, which allows us\nto offer worst-case guarantees of the extent to which causal fairness can be\nviolated due to unobserved confounding. Third, we demonstrate the effectiveness\nof our framework in a series of experiments, including a real-world case study\nabout predicting prison sentences. To the best of our knowledge, ours is the\nfirst work to study causal fairness under unobserved confounding. To this end,\nour work is of direct practical value as a refutation strategy to ensure the\nfairness of predictions in high-stakes applications."}, "http://arxiv.org/abs/2311.18477": {"title": "Intraday foreign exchange rate volatility forecasting: univariate and multilevel functional GARCH models", "link": "http://arxiv.org/abs/2311.18477", "description": "This paper seeks to predict conditional intraday volatility in foreign\nexchange (FX) markets using functional Generalized AutoRegressive Conditional\nHeteroscedasticity (GARCH) models. We contribute to the existing functional\nGARCH-type models by accounting for the stylised features of long-range and\ncross-dependence through estimating the models with long-range dependent and\nmulti-level functional principal component basis functions. Remarkably, we find\nthat taking account of cross-dependency dynamics between the major currencies\nsignificantly improves intraday conditional volatility forecasting.\nAdditionally, incorporating intraday bid-ask spread using a functional GARCH-X\nmodel adds explainability of long-range dependence and further enhances\npredictability. Intraday risk management applications are presented to\nhighlight the practical economic benefits of our proposed approaches."}, "http://arxiv.org/abs/2311.18501": {"title": "Perturbation-based Analysis of Compositional Data", "link": "http://arxiv.org/abs/2311.18501", "description": "Existing statistical methods for compositional data analysis are inadequate\nfor many modern applications for two reasons. First, modern compositional\ndatasets, for example in microbiome research, display traits such as\nhigh-dimensionality and sparsity that are poorly modelled with traditional\napproaches. Second, assessing -- in an unbiased way -- how summary statistics\nof a composition (e.g., racial diversity) affect a response variable is not\nstraightforward. In this work, we propose a framework based on hypothetical\ndata perturbations that addresses both issues. Unlike existing methods for\ncompositional data, we do not transform the data and instead use perturbations\nto define interpretable statistical functionals on the compositions themselves,\nwhich we call average perturbation effects. These average perturbation effects,\nwhich can be employed in many applications, naturally account for confounding\nthat biases frequently used marginal dependence analyses. We show how average\nperturbation effects can be estimated efficiently by deriving a\nperturbation-dependent reparametrization and applying semiparametric estimation\ntechniques. We analyze the proposed estimators empirically on simulated data\nand demonstrate advantages over existing techniques on US census and microbiome\ndata. For all proposed estimators, we provide confidence intervals with uniform\nasymptotic coverage guarantees."}, "http://arxiv.org/abs/2311.18532": {"title": "Local causal effects with continuous exposures: A matching estimator for the average causal derivative effect", "link": "http://arxiv.org/abs/2311.18532", "description": "The estimation of causal effects is a fundamental goal in the field of causal\ninference. However, it is challenging for various reasons. One reason is that\nthe exposure (or treatment) is naturally continuous in many real-world\nscenarios. When dealing with continuous exposure, dichotomizing the exposure\nvariable based on a pre-defined threshold may result in a biased understanding\nof causal relationships. In this paper, we propose a novel causal inference\nframework that can measure the causal effect of continuous exposure. We define\nthe expectation of a derivative of potential outcomes at a specific exposure\nlevel as the average causal derivative effect. Additionally, we propose a\nmatching method for this estimator and propose a permutation approach to test\nthe hypothesis of no local causal effect. We also investigate the asymptotic\nproperties of the proposed estimator and examine its performance through\nsimulation studies. Finally, we apply this causal framework in a real data\nexample of Chronic Obstructive Pulmonary Disease (COPD) patients."}, "http://arxiv.org/abs/2311.18584": {"title": "First-order multivariate integer-valued autoregressive model with multivariate mixture distributions", "link": "http://arxiv.org/abs/2311.18584", "description": "The univariate integer-valued time series has been extensively studied, but\nliterature on multivariate integer-valued time series models is quite limited\nand the complex correlation structure among the multivariate integer-valued\ntime series is barely discussed. In this study, we proposed a first-order\nmultivariate integer-valued autoregressive model to characterize the\ncorrelation among multivariate integer-valued time series with higher\nflexibility. Under the general conditions, we established the stationarity and\nergodicity of the proposed model. With the proposed method, we discussed the\nmodels with multivariate Poisson-lognormal distribution and multivariate\ngeometric-logitnormal distribution and the corresponding properties. The\nestimation method based on EM algorithm was developed for the model parameters\nand extensive simulation studies were performed to evaluate the effectiveness\nof proposed estimation method. Finally, a real crime data was analyzed to\ndemonstrate the advantage of the proposed model with comparison to the other\nmodels."}, "http://arxiv.org/abs/2311.18699": {"title": "Gaussian processes Correlated Bayesian Additive Regression Trees", "link": "http://arxiv.org/abs/2311.18699", "description": "In recent years, Bayesian Additive Regression Trees (BART) has garnered\nincreased attention, leading to the development of various extensions for\ndiverse applications. However, there has been limited exploration of its\nutility in analyzing correlated data. This paper introduces a novel extension\nof BART, named Correlated BART (CBART). Unlike the original BART with\nindependent errors, CBART is specifically designed to handle correlated\n(dependent) errors. Additionally, we propose the integration of CBART with\nGaussian processes (GP) to create a new model termed GP-CBART. This innovative\nmodel combines the strengths of the Gaussian processes and CBART, making it\nparticularly well-suited for analyzing time series or spatial data. In the\nGP-CBART framework, CBART captures the nonlinearity in the mean regression\n(covariates) function, while the Gaussian processes adeptly models the\ncorrelation structure within the response. Additionally, given the high\nflexibility of both CBART and GP models, their combination may lead to\nidentification issues. We provide methods to address these challenges. To\ndemonstrate the effectiveness of CBART and GP-CBART, we present corresponding\nsimulated and real-world examples."}, "http://arxiv.org/abs/2311.18725": {"title": "AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities", "link": "http://arxiv.org/abs/2311.18725", "description": "In the pharmaceutical industry, the use of artificial intelligence (AI) has\nseen consistent growth over the past decade. This rise is attributed to major\nadvancements in statistical machine learning methodologies, computational\ncapabilities and the increased availability of large datasets. AI techniques\nare applied throughout different stages of drug development, ranging from drug\ndiscovery to post-marketing benefit-risk assessment. Kolluri et al. provided a\nreview of several case studies that span these stages, featuring key\napplications such as protein structure prediction, success probability\nestimation, subgroup identification, and AI-assisted clinical trial monitoring.\nFrom a regulatory standpoint, there was a notable uptick in submissions\nincorporating AI components in 2021. The most prevalent therapeutic areas\nleveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%),\nand neurology (11%). The paradigm of personalized or precision medicine has\ngained significant traction in recent research, partly due to advancements in\nAI techniques \\cite{hamburg2010path}. This shift has had a transformative\nimpact on the pharmaceutical industry. Departing from the traditional\n\"one-size-fits-all\" model, personalized medicine incorporates various\nindividual factors, such as environmental conditions, lifestyle choices, and\nhealth histories, to formulate customized treatment plans. By utilizing\nsophisticated machine learning algorithms, clinicians and researchers are\nbetter equipped to make informed decisions in areas such as disease prevention,\ndiagnosis, and treatment selection, thereby optimizing health outcomes for each\nindividual."}, "http://arxiv.org/abs/2311.18807": {"title": "Pre-registration for Predictive Modeling", "link": "http://arxiv.org/abs/2311.18807", "description": "Amid rising concerns of reproducibility and generalizability in predictive\nmodeling, we explore the possibility and potential benefits of introducing\npre-registration to the field. Despite notable advancements in predictive\nmodeling, spanning core machine learning tasks to various scientific\napplications, challenges such as overlooked contextual factors, data-dependent\ndecision-making, and unintentional re-use of test data have raised questions\nabout the integrity of results. To address these issues, we propose adapting\npre-registration practices from explanatory modeling to predictive modeling. We\ndiscuss current best practices in predictive modeling and their limitations,\nintroduce a lightweight pre-registration template, and present a qualitative\nstudy with machine learning researchers to gain insight into the effectiveness\nof pre-registration in preventing biased estimates and promoting more reliable\nresearch outcomes. We conclude by exploring the scope of problems that\npre-registration can address in predictive modeling and acknowledging its\nlimitations within this context."}, "http://arxiv.org/abs/1910.12431": {"title": "Multilevel Dimension-Independent Likelihood-Informed MCMC for Large-Scale Inverse Problems", "link": "http://arxiv.org/abs/1910.12431", "description": "We present a non-trivial integration of dimension-independent\nlikelihood-informed (DILI) MCMC (Cui, Law, Marzouk, 2016) and the multilevel\nMCMC (Dodwell et al., 2015) to explore the hierarchy of posterior\ndistributions. This integration offers several advantages: First, DILI-MCMC\nemploys an intrinsic likelihood-informed subspace (LIS) (Cui et al., 2014) --\nwhich involves a number of forward and adjoint model simulations -- to design\naccelerated operator-weighted proposals. By exploiting the multilevel structure\nof the discretised parameters and discretised forward models, we design a\nRayleigh-Ritz procedure to significantly reduce the computational effort in\nbuilding the LIS and operating with DILI proposals. Second, the resulting\nDILI-MCMC can drastically improve the sampling efficiency of MCMC at each\nlevel, and hence reduce the integration error of the multilevel algorithm for\nfixed CPU time. Numerical results confirm the improved computational efficiency\nof the multilevel DILI approach."}, "http://arxiv.org/abs/2107.01306": {"title": "The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models", "link": "http://arxiv.org/abs/2107.01306", "description": "Here, we investigate whether (and how) experimental design could aid in the\nestimation of the precision matrix in a Gaussian chain graph model, especially\nthe interplay between the design, the effect of the experiment and prior\nknowledge about the effect. Estimation of the precision matrix is a fundamental\ntask to infer biological graphical structures like microbial networks. We\ncompare the marginal posterior precision of the precision matrix under four\npriors: flat, conjugate Normal-Wishart, Normal-MGIG and a general independent.\nUnder the flat and conjugate priors, the Laplace-approximated posterior\nprecision is not a function of the design matrix rendering useless any efforts\nto find an optimal experimental design to infer the precision matrix. In\ncontrast, the Normal-MGIG and general independent priors do allow for the\nsearch of optimal experimental designs, yet there is a sharp upper bound on the\ninformation that can be extracted from a given experiment. We confirm our\ntheoretical findings via a simulation study comparing i) the KL divergence\nbetween prior and posterior and ii) the Stein's loss difference of MAPs between\nrandom and no experiment. Our findings provide practical advice for domain\nscientists conducting experiments to better infer the precision matrix as a\nrepresentation of a biological network."}, "http://arxiv.org/abs/2206.02508": {"title": "Tucker tensor factor models: matricization and mode-wise PCA estimation", "link": "http://arxiv.org/abs/2206.02508", "description": "High-dimensional, higher-order tensor data are gaining prominence in a\nvariety of fields, including but not limited to computer vision and network\nanalysis. Tensor factor models, induced from noisy versions of tensor\ndecomposition or factorization, are natural potent instruments to study a\ncollection of tensor-variate objects that may be dependent or independent.\nHowever, it is still in the early stage of developing statistical inferential\ntheories for estimation of various low-rank structures, which are customary to\nplay the role of signals of tensor factor models. In this paper, starting from\ntensor matricization, we aim to ``decode\" estimation of a higher-order tensor\nfactor model in the sense that, we recast it into mode-wise traditional\nhigh-dimensional vector/fiber factor models so as to deploy the conventional\nestimation of principle components analysis (PCA). Demonstrated by the Tucker\ntensor factor model (TuTFaM), which is induced from most popular Tucker\ndecomposition, we summarize that estimations on signal components are\nessentially mode-wise PCA techniques, and the involvement of projection and\niteration will enhance the signal-to-noise ratio to various extend. We\nestablish the inferential theory of the proposed estimations and conduct rich\nsimulation experiments under TuTFaM, and illustrate how the proposed\nestimations can work in tensor reconstruction, clustering for video and\neconomic datasets, respectively."}, "http://arxiv.org/abs/2211.13959": {"title": "Testing Homological Equivalence Using Betti Numbers", "link": "http://arxiv.org/abs/2211.13959", "description": "In this article, we propose a one-sample test to check whether the support of\nthe unknown distribution generating the data is homologically equivalent to the\nsupport of some specified distribution or not OR using the corresponding\ntwo-sample test, one can test whether the supports of two unknown distributions\nare homologically equivalent or not. In the course of this study, test\nstatistics based on the Betti numbers are formulated, and the consistency of\nthe tests is established under the critical and the supercritical regimes.\nMoreover, some simulation studies are conducted and results are compared with\nthe existing methodologies such as Robinson's permutation test and test based\non mean persistent landscape functions. Furthermore, the practicability of the\ntests is shown on two well-known real data sets also."}, "http://arxiv.org/abs/2302.08893": {"title": "Active learning for data streams: a survey", "link": "http://arxiv.org/abs/2302.08893", "description": "Online active learning is a paradigm in machine learning that aims to select\nthe most informative data points to label from a data stream. The problem of\nminimizing the cost associated with collecting labeled observations has gained\na lot of attention in recent years, particularly in real-world applications\nwhere data is only available in an unlabeled form. Annotating each observation\ncan be time-consuming and costly, making it difficult to obtain large amounts\nof labeled data. To overcome this issue, many active learning strategies have\nbeen proposed in the last decades, aiming to select the most informative\nobservations for labeling in order to improve the performance of machine\nlearning models. These approaches can be broadly divided into two categories:\nstatic pool-based and stream-based active learning. Pool-based active learning\ninvolves selecting a subset of observations from a closed pool of unlabeled\ndata, and it has been the focus of many surveys and literature reviews.\nHowever, the growing availability of data streams has led to an increase in the\nnumber of approaches that focus on online active learning, which involves\ncontinuously selecting and labeling observations as they arrive in a stream.\nThis work aims to provide an overview of the most recently proposed approaches\nfor selecting the most informative observations from data streams in real time.\nWe review the various techniques that have been proposed and discuss their\nstrengths and limitations, as well as the challenges and opportunities that\nexist in this area of research."}, "http://arxiv.org/abs/2304.14750": {"title": "Bayesian Testing of Scientific Expectations Under Exponential Random Graph Models", "link": "http://arxiv.org/abs/2304.14750", "description": "The exponential random graph (ERGM) model is a commonly used statistical\nframework for studying the determinants of tie formations from social network\ndata. To test scientific theories under the ERGM framework, statistical\ninferential techniques are generally used based on traditional significance\ntesting using p-values. This methodology has certain limitations, however, such\nas its inconsistent behavior when the null hypothesis is true, its inability to\nquantify evidence in favor of a null hypothesis, and its inability to test\nmultiple hypotheses with competing equality and/or order constraints on the\nparameters of interest in a direct manner. To tackle these shortcomings, this\npaper presents Bayes factors and posterior probabilities for testing scientific\nexpectations under a Bayesian framework. The methodology is implemented in the\nR package 'BFpack'. The applicability of the methodology is illustrated using\nempirical collaboration networks and policy networks."}, "http://arxiv.org/abs/2312.00130": {"title": "Sparse Projected Averaged Regression for High-Dimensional Data", "link": "http://arxiv.org/abs/2312.00130", "description": "We examine the linear regression problem in a challenging high-dimensional\nsetting with correlated predictors to explain and predict relevant quantities,\nwith explicitly allowing the regression coefficient to vary from sparse to\ndense. Most classical high-dimensional regression estimators require some\ndegree of sparsity. We discuss the more recent concepts of variable screening\nand random projection as computationally fast dimension reduction tools, and\npropose a new random projection matrix tailored to the linear regression\nproblem with a theoretical bound on the gain in expected prediction error over\nconventional random projections.\n\nAround this new random projection, we built the Sparse Projected Averaged\nRegression (SPAR) method combining probabilistic variable screening steps with\nthe random projection steps to obtain an ensemble of small linear models. In\ndifference to existing methods, we introduce a thresholding parameter to obtain\nsome degree of sparsity.\n\nIn extensive simulations and two real data applications we guide through the\nelements of this method and compare prediction and variable selection\nperformance to various competitors. For prediction, our method performs at\nleast as good as the best competitors in most settings with a high number of\ntruly active variables, while variable selection remains a hard task for all\nmethods in high dimensions."}, "http://arxiv.org/abs/2312.00185": {"title": "On the variance of the Least Mean Square squared-error sample curve", "link": "http://arxiv.org/abs/2312.00185", "description": "Most studies of adaptive algorithm behavior consider performance measures\nbased on mean values such as the mean-square error. The derived models are\nuseful for understanding the algorithm behavior under different environments\nand can be used for design. Nevertheless, from a practical point of view, the\nadaptive filter user has only one realization of the algorithm to obtain the\ndesired result. This letter derives a model for the variance of the\nsquared-error sample curve of the least-mean-square (LMS) adaptive algorithm,\nso that the achievable cancellation level can be predicted based on the\nproperties of the steady-state squared error. The derived results provide the\nuser with useful design guidelines."}, "http://arxiv.org/abs/2312.00219": {"title": "The Functional Average Treatment Effect", "link": "http://arxiv.org/abs/2312.00219", "description": "This paper establishes the functional average as an important estimand for\ncausal inference. The significance of the estimand lies in its robustness\nagainst traditional issues of confounding. We prove that this robustness holds\neven when the probability distribution of the outcome, conditional on treatment\nor some other vector of adjusting variables, differs almost arbitrarily from\nits counterfactual analogue. This paper also examines possible estimators of\nthe functional average, including the sample mid-range, and proposes a new type\nof bootstrap for robust statistical inference: the Hoeffding bootstrap. After\nthis, the paper explores a new class of variables, the $\\mathcal{U}$ class of\nvariables, that simplifies the estimation of functional averages. This class of\nvariables is also used to establish mean exchangeability in some cases and to\nprovide the results of elementary statistical procedures, such as linear\nregression and the analysis of variance, with causal interpretations.\nSimulation evidence is provided. The methods of this paper are also applied to\na National Health and Nutrition Survey data set to investigate the causal\neffect of exercise on the blood pressure of adult smokers."}, "http://arxiv.org/abs/2312.00225": {"title": "Eliminating confounder-induced bias in the statistics of intervention", "link": "http://arxiv.org/abs/2312.00225", "description": "Experimental and observational studies often lead to spurious association\nbetween the outcome and independent variables describing the intervention,\nbecause of confounding to third-party factors. Even in randomized clinical\ntrials, confounding might be unavoidable due to small sample sizes.\nPractically, this poses a problem, because it is either expensive to re-design\nand conduct a new study or even impossible to alleviate the contribution of\nsome confounders due to e.g. ethical concerns. Here, we propose a method to\nconsistently derive hypothetical studies that retain as many of the\ndependencies in the original study as mathematically possible, while removing\nany association of observed confounders to the independent variables. Using\nhistoric studies, we illustrate how the confounding-free scenario re-estimates\nthe effect size of the intervention. The new effect size estimate represents a\nconcise prediction in the hypothetical scenario which paves a way from the\noriginal data towards the design of future studies."}, "http://arxiv.org/abs/2312.00294": {"title": "aeons: approximating the end of nested sampling", "link": "http://arxiv.org/abs/2312.00294", "description": "This paper presents analytic results on the anatomy of nested sampling, from\nwhich a technique is developed to estimate the run-time of the algorithm that\nworks for any nested sampling implementation. We test these methods on both toy\nmodels and true cosmological nested sampling runs. The method gives an\norder-of-magnitude prediction of the end point at all times, forecasting the\ntrue endpoint within standard error around the halfway point."}, "http://arxiv.org/abs/2312.00305": {"title": "Multiple Testing of Linear Forms for Noisy Matrix Completion", "link": "http://arxiv.org/abs/2312.00305", "description": "Many important tasks of large-scale recommender systems can be naturally cast\nas testing multiple linear forms for noisy matrix completion. These problems,\nhowever, present unique challenges because of the subtle bias-and-variance\ntradeoff of and an intricate dependence among the estimated entries induced by\nthe low-rank structure. In this paper, we develop a general approach to\novercome these difficulties by introducing new statistics for individual tests\nwith sharp asymptotics both marginally and jointly, and utilizing them to\ncontrol the false discovery rate (FDR) via a data splitting and symmetric\naggregation scheme. We show that valid FDR control can be achieved with\nguaranteed power under nearly optimal sample size requirements using the\nproposed methodology. Extensive numerical simulations and real data examples\nare also presented to further illustrate its practical merits."}, "http://arxiv.org/abs/2312.00346": {"title": "Supervised Factor Modeling for High-Dimensional Linear Time Series", "link": "http://arxiv.org/abs/2312.00346", "description": "Motivated by Tucker tensor decomposition, this paper imposes low-rank\nstructures to the column and row spaces of coefficient matrices in a\nmultivariate infinite-order vector autoregression (VAR), which leads to a\nsupervised factor model with two factor modelings being conducted to responses\nand predictors simultaneously. Interestingly, the stationarity condition\nimplies an intrinsic weak group sparsity mechanism of infinite-order VAR, and\nhence a rank-constrained group Lasso estimation is considered for\nhigh-dimensional linear time series. Its non-asymptotic properties are\ndiscussed thoughtfully by balancing the estimation, approximation and\ntruncation errors. Moreover, an alternating gradient descent algorithm with\nthresholding is designed to search for high-dimensional estimates, and its\ntheoretical justifications, including statistical and convergence analysis, are\nalso provided. Theoretical and computational properties of the proposed\nmethodology are verified by simulation experiments, and the advantages over\nexisting methods are demonstrated by two real examples."}, "http://arxiv.org/abs/2312.00373": {"title": "Streaming Bayesian Modeling for predicting Fat-Tailed Customer Lifetime Value", "link": "http://arxiv.org/abs/2312.00373", "description": "We develop an online learning MCMC approach applicable for hierarchical\nbayesian models and GLMS. We also develop a fat-tailed LTV model that\ngeneralizes over several kinds of fat and thin tails. We demonstrate both\ndevelopments on commercial LTV data from a large mobile app."}, "http://arxiv.org/abs/2312.00439": {"title": "Modeling the Ratio of Correlated Biomarkers Using Copula Regression", "link": "http://arxiv.org/abs/2312.00439", "description": "Modeling the ratio of two dependent components as a function of covariates is\na frequently pursued objective in observational research. Despite the high\nrelevance of this topic in medical studies, where biomarker ratios are often\nused as surrogate endpoints for specific diseases, existing models are based on\noversimplified assumptions, assuming e.g.\\@ independence or strictly positive\nassociations between the components. In this paper, we close this gap in the\nliterature and propose a regression model where the marginal distributions of\nthe two components are linked by Frank copula. A key feature of our model is\nthat it allows for both positive and negative correlations between the\ncomponents, with one of the model parameters being directly interpretable in\nterms of Kendall's rank correlation coefficient. We study our method\ntheoretically, evaluate finite sample properties in a simulation study and\ndemonstrate its efficacy in an application to diagnosis of Alzheimer's disease\nvia ratios of amyloid-beta and total tau protein biomarkers."}, "http://arxiv.org/abs/2312.00494": {"title": "Applying the estimands framework to non-inferiority trials: guidance on choice of hypothetical estimands for non-adherence and comparison of estimation methods", "link": "http://arxiv.org/abs/2312.00494", "description": "A common concern in non-inferiority (NI) trials is that non adherence due,\nfor example, to poor study conduct can make treatment arms artificially\nsimilar. Because intention to treat analyses can be anti-conservative in this\nsituation, per protocol analyses are sometimes recommended. However, such\nadvice does not consider the estimands framework, nor the risk of bias from per\nprotocol analyses. We therefore sought to update the above guidance using the\nestimands framework, and compare estimators to improve on the performance of\nper protocol analyses. We argue the main threat to validity of NI trials is the\noccurrence of trial specific intercurrent events (IEs), that is, IEs which\noccur in a trial setting, but would not occur in practice. To guard against\nerroneous conclusions of non inferiority, we suggest an estimand using a\nhypothetical strategy for trial specific IEs should be employed, with handling\nof other non trial specific IEs chosen based on clinical considerations. We\nprovide an overview of estimators that could be used to estimate a hypothetical\nestimand, including inverse probability weighting (IPW), and two instrumental\nvariable approaches (one using an informative Bayesian prior on the effect of\nstandard treatment, and one using a treatment by covariate interaction as an\ninstrument). We compare them, using simulation in the setting of all or nothing\ncompliance in two active treatment arms, and conclude both IPW and the\ninstrumental variable method using a Bayesian prior are potentially useful\napproaches, with the choice between them depending on which assumptions are\nmost plausible for a given trial."}, "http://arxiv.org/abs/2312.00501": {"title": "Cautionary Tales on Synthetic Controls in Survival Analyses", "link": "http://arxiv.org/abs/2312.00501", "description": "Synthetic control (SC) methods have gained rapid popularity in economics\nrecently, where they have been applied in the context of inferring the effects\nof treatments on standard continuous outcomes assuming linear input-output\nrelations. In medical applications, conversely, survival outcomes are often of\nprimary interest, a setup in which both commonly assumed data-generating\nprocesses (DGPs) and target parameters are different. In this paper, we\ntherefore investigate whether and when SCs could serve as an alternative to\nmatching methods in survival analyses. We find that, because SCs rely on a\nlinearity assumption, they will generally be biased for the true expected\nsurvival time in commonly assumed survival DGPs -- even when taking into\naccount the possibility of linearity on another scale as in accelerated failure\ntime models. Additionally, we find that, because SC units follow distributions\nwith lower variance than real control units, summaries of their distributions,\nsuch as survival curves, will be biased for the parameters of interest in many\nsurvival analyses. Nonetheless, we also highlight that using SCs can still\nimprove upon matching whenever the biases described above are outweighed by\nextrapolation biases exhibited by imperfect matches, and investigate the use of\nregularization to trade off the shortcomings of both approaches."}, "http://arxiv.org/abs/2312.00509": {"title": "Bayesian causal discovery from unknown general interventions", "link": "http://arxiv.org/abs/2312.00509", "description": "We consider the problem of learning causal Directed Acyclic Graphs (DAGs)\nusing combinations of observational and interventional experimental data.\nCurrent methods tailored to this setting assume that interventions either\ndestroy parent-child relations of the intervened (target) nodes or only alter\nsuch relations without modifying the parent sets, even when the intervention\ntargets are unknown. We relax this assumption by proposing a Bayesian method\nfor causal discovery from general interventions, which allow for modifications\nof the parent sets of the unknown targets. Even in this framework, DAGs and\ngeneral interventions may be identifiable only up to some equivalence classes.\nWe provide graphical characterizations of such interventional Markov\nequivalence and devise compatible priors for Bayesian inference that guarantee\nscore equivalence of indistinguishable structures. We then develop a Markov\nChain Monte Carlo (MCMC) scheme to approximate the posterior distribution over\nDAGs, intervention targets and induced parent sets. Finally, we evaluate the\nproposed methodology on both simulated and real protein expression data."}, "http://arxiv.org/abs/2312.00530": {"title": "New tools for network time series with an application to COVID-19 hospitalisations", "link": "http://arxiv.org/abs/2312.00530", "description": "Network time series are becoming increasingly important across many areas in\nscience and medicine and are often characterised by a known or inferred\nunderlying network structure, which can be exploited to make sense of dynamic\nphenomena that are often high-dimensional. For example, the Generalised Network\nAutoregressive (GNAR) models exploit such structure parsimoniously. We use the\nGNAR framework to introduce two association measures: the network and partial\nnetwork autocorrelation functions, and introduce Corbit (correlation-orbit)\nplots for visualisation. As with regular autocorrelation plots, Corbit plots\npermit interpretation of underlying correlation structures and, crucially, aid\nmodel selection more rapidly than using other tools such as AIC or BIC. We\nadditionally interpret GNAR processes as generalised graphical models, which\nconstrain the processes' autoregressive structure and exhibit interesting\ntheoretical connections to graphical models via utilization of higher-order\ninteractions. We demonstrate how incorporation of prior information is related\nto performing variable selection and shrinkage in the GNAR context. We\nillustrate the usefulness of the GNAR formulation, network autocorrelations and\nCorbit plots by modelling a COVID-19 network time series of the number of\nadmissions to mechanical ventilation beds at 140 NHS Trusts in England & Wales.\nWe introduce the Wagner plot that can analyse correlations over different time\nperiods or with respect to external covariates. In addition, we introduce plots\nthat quantify the relevance and influence of individual nodes. Our modelling\nprovides insight on the underlying dynamics of the COVID-19 series, highlights\ntwo groups of geographically co-located `influential' NHS Trusts and\ndemonstrates superior prediction abilities when compared to existing\ntechniques."}, "http://arxiv.org/abs/2312.00616": {"title": "Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry", "link": "http://arxiv.org/abs/2312.00616", "description": "In a longitudinal clinical registry, different measurement instruments might\nhave been used for assessing individuals at different time points. To combine\nthem, we investigate deep learning techniques for obtaining a joint latent\nrepresentation, to which the items of different measurement instruments are\nmapped. This corresponds to domain adaptation, an established concept in\ncomputer science for image data. Using the proposed approach as an example, we\nevaluate the potential of domain adaptation in a longitudinal cohort setting\nwith a rather small number of time points, motivated by an application with\ndifferent motor function measurement instruments in a registry of spinal\nmuscular atrophy (SMA) patients. There, we model trajectories in the latent\nrepresentation by ordinary differential equations (ODEs), where person-specific\nODE parameters are inferred from baseline characteristics. The goodness of fit\nand complexity of the ODE solutions then allows to judge the measurement\ninstrument mappings. We subsequently explore how alignment can be improved by\nincorporating corresponding penalty terms into model fitting. To systematically\ninvestigate the effect of differences between measurement instruments, we\nconsider several scenarios based on modified SMA data, including scenarios\nwhere a mapping should be feasible in principle and scenarios where no perfect\nmapping is available. While misalignment increases in more complex scenarios,\nsome structure is still recovered, even if the availability of measurement\ninstruments depends on patient state. A reasonable mapping is feasible also in\nthe more complex real SMA dataset. These results indicate that domain\nadaptation might be more generally useful in statistical modeling for\nlongitudinal registry data."}, "http://arxiv.org/abs/2312.00622": {"title": "Practical Path-based Bayesian Optimization", "link": "http://arxiv.org/abs/2312.00622", "description": "There has been a surge in interest in data-driven experimental design with\napplications to chemical engineering and drug manufacturing. Bayesian\noptimization (BO) has proven to be adaptable to such cases, since we can model\nthe reactions of interest as expensive black-box functions. Sometimes, the cost\nof this black-box functions can be separated into two parts: (a) the cost of\nthe experiment itself, and (b) the cost of changing the input parameters. In\nthis short paper, we extend the SnAKe algorithm to deal with both types of\ncosts simultaneously. We further propose extensions to the case of a maximum\nallowable input change, as well as to the multi-objective setting."}, "http://arxiv.org/abs/2312.00710": {"title": "SpaCE: The Spatial Confounding Environment", "link": "http://arxiv.org/abs/2312.00710", "description": "Spatial confounding poses a significant challenge in scientific studies\ninvolving spatial data, where unobserved spatial variables can influence both\ntreatment and outcome, possibly leading to spurious associations. To address\nthis problem, we introduce SpaCE: The Spatial Confounding Environment, the\nfirst toolkit to provide realistic benchmark datasets and tools for\nsystematically evaluating causal inference methods designed to alleviate\nspatial confounding. Each dataset includes training data, true counterfactuals,\na spatial graph with coordinates, and smoothness and confounding scores\ncharacterizing the effect of a missing spatial confounder. It also includes\nrealistic semi-synthetic outcomes and counterfactuals, generated using\nstate-of-the-art machine learning ensembles, following best practices for\ncausal inference benchmarks. The datasets cover real treatment and covariates\nfrom diverse domains, including climate, health and social sciences. SpaCE\nfacilitates an automated end-to-end pipeline, simplifying data loading,\nexperimental setup, and evaluating machine learning and causal inference\nmodels. The SpaCE project provides several dozens of datasets of diverse sizes\nand spatial complexity. It is publicly available as a Python package,\nencouraging community feedback and contributions."}, "http://arxiv.org/abs/2312.00728": {"title": "Soft computing for the posterior of a new matrix t graphical network", "link": "http://arxiv.org/abs/2312.00728", "description": "Modelling noisy data in a network context remains an unavoidable obstacle;\nfortunately, random matrix theory may comprehensively describe network\nenvironments effectively. Thus it necessitates the probabilistic\ncharacterisation of these networks (and accompanying noisy data) using matrix\nvariate models. Denoising network data using a Bayes approach is not common in\nsurveyed literature. This paper adopts the Bayesian viewpoint and introduces a\nnew matrix variate t-model in a prior sense by relying on the matrix variate\ngamma distribution for the noise process, following the Gaussian graphical\nnetwork for the cases when the normality assumption is violated. From a\nstatistical learning viewpoint, such a theoretical consideration indubitably\nbenefits the real-world comprehension of structures causing noisy data with\nnetwork-based attributes as part of machine learning in data science. A full\nstructural learning procedure is provided for calculating and approximating the\nresulting posterior of interest to assess the considered model's network\ncentrality measures. Experiments with synthetic and real-world stock price data\nare performed not only to validate the proposed algorithm's capabilities but\nalso to show that this model has wider flexibility than originally implied in\nBillio et al. (2021)."}, "http://arxiv.org/abs/2312.00770": {"title": "Random Forest for Dynamic Risk Prediction or Recurrent Events: A Pseudo-Observation Approach", "link": "http://arxiv.org/abs/2312.00770", "description": "Recurrent events are common in clinical, healthcare, social and behavioral\nstudies. A recent analysis framework for potentially censored recurrent event\ndata is to construct a censored longitudinal data set consisting of times to\nthe first recurrent event in multiple prespecified follow-up windows of length\n$\\tau$. With the staggering number of potential predictors being generated from\ngenetic, -omic, and electronic health records sources, machine learning\napproaches such as the random forest are growing in popularity, as they can\nincorporate information from highly correlated predictors with non-standard\nrelationships. In this paper, we bridge this gap by developing a random forest\napproach for dynamically predicting probabilities of remaining event-free\nduring a subsequent $\\tau$-duration follow-up period from a reconstructed\ncensored longitudinal data set. We demonstrate the increased ability of our\nrandom forest algorithm for predicting the probability of remaining event-free\nover a $\\tau$-duration follow-up period when compared to the recurrent event\nmodeling framework of Xia et al. (2020) in settings where association between\npredictors and recurrent event outcomes is complex in nature. The proposed\nrandom forest algorithm is demonstrated using recurrent exacerbation data from\nthe Azithromycin for the Prevention of Exacerbations of Chronic Obstructive\nPulmonary Disease (Albert et al., 2011)."}, "http://arxiv.org/abs/2109.00160": {"title": "Novel Bayesian method for simultaneous detection of activation signatures and background connectivity for task fMRI data", "link": "http://arxiv.org/abs/2109.00160", "description": "In this paper, we introduce a new Bayesian approach for analyzing task fMRI\ndata that simultaneously detects activation signatures and background\nconnectivity. Our modeling involves a new hybrid tensor spatial-temporal basis\nstrategy that enables scalable computing yet captures nearby and distant\nintervoxel correlation and long-memory temporal correlation. The spatial basis\ninvolves a composite hybrid transform with two levels: the first accounts for\nwithin-ROI correlation, and second between-ROI distant correlation. We\ndemonstrate in simulations how our basis space regression modeling strategy\nincreases sensitivity for identifying activation signatures, partly driven by\nthe induced background connectivity that itself can be summarized to reveal\nbiological insights. This strategy leads to computationally scalable fully\nBayesian inference at the voxel or ROI level that adjusts for multiple testing.\nWe apply this model to Human Connectome Project data to reveal insights into\nbrain activation patterns and background connectivity related to working memory\ntasks."}, "http://arxiv.org/abs/2202.06117": {"title": "Metric Statistics: Exploration and Inference for Random Objects With Distance Profiles", "link": "http://arxiv.org/abs/2202.06117", "description": "This article provides an overview on the statistical modeling of complex data\nas increasingly encountered in modern data analysis. It is argued that such\ndata can often be described as elements of a metric space that satisfies\ncertain structural conditions and features a probability measure. We refer to\nthe random elements of such spaces as random objects and to the emerging field\nthat deals with their statistical analysis as metric statistics. Metric\nstatistics provides methodology, theory and visualization tools for the\nstatistical description, quantification of variation, centrality and quantiles,\nregression and inference for populations of random objects for which samples\nare available. In addition to a brief review of current concepts, we focus on\ndistance profiles as a major tool for object data in conjunction with the\npairwise Wasserstein transports of the underlying one-dimensional distance\ndistributions. These pairwise transports lead to the definition of intuitive\nand interpretable notions of transport ranks and transport quantiles as well as\ntwo-sample inference. An associated profile metric complements the original\nmetric of the object space and may reveal important features of the object data\nin data analysis We demonstrate these tools for the analysis of complex data\nthrough various examples and visualizations."}, "http://arxiv.org/abs/2205.11956": {"title": "Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control", "link": "http://arxiv.org/abs/2205.11956", "description": "Most machine learning methods require tuning of hyper-parameters. For kernel\nridge regression with the Gaussian kernel, the hyper-parameter is the\nbandwidth. The bandwidth specifies the length scale of the kernel and has to be\ncarefully selected to obtain a model with good generalization. The default\nmethods for bandwidth selection, cross-validation and marginal likelihood\nmaximization, often yield good results, albeit at high computational costs.\nInspired by Jacobian regularization, we formulate an approximate expression for\nhow the derivatives of the functions inferred by kernel ridge regression with\nthe Gaussian kernel depend on the kernel bandwidth. We use this expression to\npropose a closed-form, computationally feather-light, bandwidth selection\nheuristic, based on controlling the Jacobian. In addition, the Jacobian\nexpression illuminates how the bandwidth selection is a trade-off between the\nsmoothness of the inferred function and the conditioning of the training data\nkernel matrix. We show on real and synthetic data that compared to\ncross-validation and marginal likelihood maximization, our method is on pair in\nterms of model performance, but up to six orders of magnitude faster."}, "http://arxiv.org/abs/2209.00102": {"title": "Bayesian Mixed Multidimensional Scaling for Auditory Processing", "link": "http://arxiv.org/abs/2209.00102", "description": "The human brain distinguishes speech sound categories by representing\nacoustic signals in a latent multidimensional auditory-perceptual space. This\nspace can be statistically constructed using multidimensional scaling, a\ntechnique that can compute lower-dimensional latent features representing the\nspeech signals in such a way that their pairwise distances in the latent space\nclosely resemble the corresponding distances in the observation space. The\ninter-individual and inter-population (e.g., native versus non-native\nlisteners) heterogeneity in such representations is however not well\nunderstood. These questions have often been examined using joint analyses that\nignore individual heterogeneity or using separate analyses that cannot\ncharacterize human similarities. Neither extreme, therefore, allows for\nprincipled comparisons between populations and individuals. The focus of the\ncurrent literature has also often been on inference on latent distances between\nthe categories and not on the latent features themselves, which are crucial for\nour applications, that make up these distances. Motivated by these problems, we\ndevelop a novel Bayesian mixed multidimensional scaling method, taking into\naccount the heterogeneity across populations and subjects. We design a Markov\nchain Monte Carlo algorithm for posterior computation. We then recover the\nlatent features using a post-processing scheme applied to the posterior\nsamples. We evaluate the method's empirical performances through synthetic\nexperiments. Applied to a motivating auditory neuroscience study, the method\nprovides novel insights into how biologically interpretable lower-dimensional\nlatent features reconstruct the observed distances between the stimuli and vary\nbetween individuals and their native language experiences."}, "http://arxiv.org/abs/2209.01297": {"title": "Assessing treatment effect heterogeneity in the presence of missing effect modifier data in cluster-randomized trials", "link": "http://arxiv.org/abs/2209.01297", "description": "Understanding whether and how treatment effects vary across subgroups is\ncrucial to inform clinical practice and recommendations. Accordingly, the\nassessment of heterogeneous treatment effects (HTE) based on pre-specified\npotential effect modifiers has become a common goal in modern randomized\ntrials. However, when one or more potential effect modifiers are missing,\ncomplete-case analysis may lead to bias and under-coverage. While statistical\nmethods for handling missing data have been proposed and compared for\nindividually randomized trials with missing effect modifier data, few\nguidelines exist for the cluster-randomized setting, where intracluster\ncorrelations in the effect modifiers, outcomes, or even missingness mechanisms\nmay introduce further threats to accurate assessment of HTE. In this article,\nthe performance of several missing data methods are compared through a\nsimulation study of cluster-randomized trials with continuous outcome and\nmissing binary effect modifier data, and further illustrated using real data\nfrom the Work, Family, and Health Study. Our results suggest that multilevel\nmultiple imputation (MMI) and Bayesian MMI have better performance than other\navailable methods, and that Bayesian MMI has lower bias and closer to nominal\ncoverage than standard MMI when there are model specification or compatibility\nissues."}, "http://arxiv.org/abs/2307.06996": {"title": "Spey: smooth inference for reinterpretation studies", "link": "http://arxiv.org/abs/2307.06996", "description": "Statistical models serve as the cornerstone for hypothesis testing in\nempirical studies. This paper introduces a new cross-platform Python-based\npackage designed to utilise different likelihood prescriptions via a flexible\nplug-in system. This framework empowers users to propose, examine, and publish\nnew likelihood prescriptions without developing software infrastructure,\nultimately unifying and generalising different ways of constructing likelihoods\nand employing them for hypothesis testing within a unified platform. We propose\na new simplified likelihood prescription, surpassing previous approximation\naccuracies by incorporating asymmetric uncertainties. Moreover, our package\nfacilitates the integration of various likelihood combination routines, thereby\nbroadening the scope of independent studies through a meta-analysis. By\nremaining agnostic to the source of the likelihood prescription and the signal\nhypothesis generator, our platform allows for the seamless implementation of\npackages with different likelihood prescriptions, fostering compatibility and\ninteroperability."}, "http://arxiv.org/abs/2309.11472": {"title": "Optimizing Dynamic Predictions from Joint Models using Super Learning", "link": "http://arxiv.org/abs/2309.11472", "description": "Joint models for longitudinal and time-to-event data are often employed to\ncalculate dynamic individualized predictions used in numerous applications of\nprecision medicine. Two components of joint models that influence the accuracy\nof these predictions are the shape of the longitudinal trajectories and the\nfunctional form linking the longitudinal outcome history to the hazard of the\nevent. Finding a single well-specified model that produces accurate predictions\nfor all subjects and follow-up times can be challenging, especially when\nconsidering multiple longitudinal outcomes. In this work, we use the concept of\nsuper learning and avoid selecting a single model. In particular, we specify a\nweighted combination of the dynamic predictions calculated from a library of\njoint models with different specifications. The weights are selected to\noptimize a predictive accuracy metric using V-fold cross-validation. We use as\npredictive accuracy measures the expected quadratic prediction error and the\nexpected predictive cross-entropy. In a simulation study, we found that the\nsuper learning approach produces results very similar to the Oracle model,\nwhich was the model with the best performance in the test datasets. All\nproposed methodology is implemented in the freely available R package JMbayes2."}, "http://arxiv.org/abs/2312.00963": {"title": "Spatiotemporal Transformer for Imputing Sparse Data: A Deep Learning Approach", "link": "http://arxiv.org/abs/2312.00963", "description": "Effective management of environmental resources and agricultural\nsustainability heavily depends on accurate soil moisture data. However,\ndatasets like the SMAP/Sentinel-1 soil moisture product often contain missing\nvalues across their spatiotemporal grid, which poses a significant challenge.\nThis paper introduces a novel Spatiotemporal Transformer model (ST-Transformer)\nspecifically designed to address the issue of missing values in sparse\nspatiotemporal datasets, particularly focusing on soil moisture data. The\nST-Transformer employs multiple spatiotemporal attention layers to capture the\ncomplex spatiotemporal correlations in the data and can integrate additional\nspatiotemporal covariates during the imputation process, thereby enhancing its\naccuracy. The model is trained using a self-supervised approach, enabling it to\nautonomously predict missing values from observed data points. Our model's\nefficacy is demonstrated through its application to the SMAP 1km soil moisture\ndata over a 36 x 36 km grid in Texas. It showcases superior accuracy compared\nto well-known imputation methods. Additionally, our simulation studies on other\ndatasets highlight the model's broader applicability in various spatiotemporal\nimputation tasks."}, "http://arxiv.org/abs/2312.01146": {"title": "Bayesian models are better than frequentist models in identifying differences in small datasets comprising phonetic data", "link": "http://arxiv.org/abs/2312.01146", "description": "While many studies have previously conducted direct comparisons between\nresults obtained from frequentist and Bayesian models, our research introduces\na novel perspective by examining these models in the context of a small dataset\ncomprising phonetic data. Specifically, we employed mixed-effects models and\nBayesian regression models to explore differences between monolingual and\nbilingual populations in the acoustic values of produced vowels. Our findings\nrevealed that Bayesian hypothesis testing exhibited superior accuracy in\nidentifying evidence for differences compared to the posthoc test, which tended\nto underestimate the existence of such differences. These results align with a\nsubstantial body of previous research highlighting the advantages of Bayesian\nover frequentist models, thereby emphasizing the need for methodological\nreform. In conclusion, our study supports the assertion that Bayesian models\nare more suitable for investigating differences in small datasets of phonetic\nand/or linguistic data, suggesting that researchers in these fields may find\ngreater reliability in utilizing such models for their analyses."}, "http://arxiv.org/abs/2312.01168": {"title": "MacroPARAFAC for handling rowwise and cellwise outliers in incomplete multi-way data", "link": "http://arxiv.org/abs/2312.01168", "description": "Multi-way data extend two-way matrices into higher-dimensional tensors, often\nexplored through dimensional reduction techniques. In this paper, we study the\nParallel Factor Analysis (PARAFAC) model for handling multi-way data,\nrepresenting it more compactly through a concise set of loading matrices and\nscores. We assume that the data may be incomplete and could contain both\nrowwise and cellwise outliers, signifying cases that deviate from the majority\nand outlying cells dispersed throughout the data array. To address these\nchallenges, we present a novel algorithm designed to robustly estimate both\nloadings and scores. Additionally, we introduce an enhanced outlier map to\ndistinguish various patterns of outlying behavior. Through simulations and the\nanalysis of fluorescence Excitation-Emission Matrix (EEM) data, we demonstrate\nthe robustness of our approach. Our results underscore the effectiveness of\ndiagnostic tools in identifying and interpreting unusual patterns within the\ndata."}, "http://arxiv.org/abs/2312.01210": {"title": "When accurate prediction models yield harmful self-fulfilling prophecies", "link": "http://arxiv.org/abs/2312.01210", "description": "Prediction models are popular in medical research and practice. By predicting\nan outcome of interest for specific patients, these models may help inform\ndifficult treatment decisions, and are often hailed as the poster children for\npersonalized, data-driven healthcare.\n\nWe show however, that using prediction models for decision making can lead to\nharmful decisions, even when the predictions exhibit good discrimination after\ndeployment. These models are harmful self-fulfilling prophecies: their\ndeployment harms a group of patients but the worse outcome of these patients\ndoes not invalidate the predictive power of the model. Our main result is a\nformal characterization of a set of such prediction models. Next we show that\nmodels that are well calibrated before} and after deployment are useless for\ndecision making as they made no change in the data distribution. These results\npoint to the need to revise standard practices for validation, deployment and\nevaluation of prediction models that are used in medical decisions."}, "http://arxiv.org/abs/2312.01238": {"title": "A deep learning pipeline for cross-sectional and longitudinal multiview data integration", "link": "http://arxiv.org/abs/2312.01238", "description": "Biomedical research now commonly integrates diverse data types or views from\nthe same individuals to better understand the pathobiology of complex diseases,\nbut the challenge lies in meaningfully integrating these diverse views.\nExisting methods often require the same type of data from all views\n(cross-sectional data only or longitudinal data only) or do not consider any\nclass outcome in the integration method, presenting limitations. To overcome\nthese limitations, we have developed a pipeline that harnesses the power of\nstatistical and deep learning methods to integrate cross-sectional and\nlongitudinal data from multiple sources. Additionally, it identifies key\nvariables contributing to the association between views and the separation\namong classes, providing deeper biological insights. This pipeline includes\nvariable selection/ranking using linear and nonlinear methods, feature\nextraction using functional principal component analysis and Euler\ncharacteristics, and joint integration and classification using dense\nfeed-forward networks and recurrent neural networks. We applied this pipeline\nto cross-sectional and longitudinal multi-omics data (metagenomics,\ntranscriptomics, and metabolomics) from an inflammatory bowel disease (IBD)\nstudy and we identified microbial pathways, metabolites, and genes that\ndiscriminate by IBD status, providing information on the etiology of IBD. We\nconducted simulations to compare the two feature extraction methods. The\nproposed pipeline is available from the following GitHub repository:\nhttps://github.com/lasandrall/DeepIDA-GRU."}, "http://arxiv.org/abs/2312.01265": {"title": "Concentration of Randomized Functions of Uniformly Bounded Variation", "link": "http://arxiv.org/abs/2312.01265", "description": "A sharp, distribution free, non-asymptotic result is proved for the\nconcentration of a random function around the mean function, when the\nrandomization is generated by a finite sequence of independent data and the\nrandom functions satisfy uniform bounded variation assumptions. The specific\nmotivation for the work comes from the need for inference on the distributional\nimpacts of social policy intervention. However, the family of randomized\nfunctions that we study is broad enough to cover wide-ranging applications. For\nexample, we provide a Kolmogorov-Smirnov like test for randomized functions\nthat are almost surely Lipschitz continuous, and novel tools for inference with\nheterogeneous treatment effects. A Dvoretzky-Kiefer-Wolfowitz like inequality\nis also provided for the sum of almost surely monotone random functions,\nextending the famous non-asymptotic work of Massart for empirical cumulative\ndistribution functions generated by i.i.d. data, to settings without\nmicro-clusters proposed by Canay, Santos, and Shaikh. We illustrate the\nrelevance of our theoretical results for applied work via empirical\napplications. Notably, the proof of our main concentration result relies on a\nnovel stochastic rendition of the fundamental result of Debreu, generally\ndubbed the \"gap lemma,\" that transforms discontinuous utility representations\nof preorders into continuous utility representations, and on an envelope\ntheorem of an infinite dimensional optimisation problem that we carefully\nconstruct."}, "http://arxiv.org/abs/2312.01266": {"title": "A unified framework for covariate adjustment under stratified randomization", "link": "http://arxiv.org/abs/2312.01266", "description": "Randomization, as a key technique in clinical trials, can eliminate sources\nof bias and produce comparable treatment groups. In randomized experiments, the\ntreatment effect is a parameter of general interest. Researchers have explored\nthe validity of using linear models to estimate the treatment effect and\nperform covariate adjustment and thus improve the estimation efficiency.\nHowever, the relationship between covariates and outcomes is not necessarily\nlinear, and is often intricate. Advances in statistical theory and related\ncomputer technology allow us to use nonparametric and machine learning methods\nto better estimate the relationship between covariates and outcomes and thus\nobtain further efficiency gains. However, theoretical studies on how to draw\nvalid inferences when using nonparametric and machine learning methods under\nstratified randomization are yet to be conducted. In this paper, we discuss a\nunified framework for covariate adjustment and corresponding statistical\ninference under stratified randomization and present a detailed proof of the\nvalidity of using local linear kernel-weighted least squares regression for\ncovariate adjustment in treatment effect estimators as a special case. In the\ncase of high-dimensional data, we additionally propose an algorithm for\nstatistical inference using machine learning methods under stratified\nrandomization, which makes use of sample splitting to alleviate the\nrequirements on the asymptotic properties of machine learning methods. Finally,\nwe compare the performances of treatment effect estimators using different\nmachine learning methods by considering various data generation scenarios, to\nguide practical research."}, "http://arxiv.org/abs/2312.01379": {"title": "Relation between PLS and OLS regression in terms of the eigenvalue distribution of the regressor covariance matrix", "link": "http://arxiv.org/abs/2312.01379", "description": "Partial least squares (PLS) is a dimensionality reduction technique\nintroduced in the field of chemometrics and successfully employed in many other\nareas. The PLS components are obtained by maximizing the covariance between\nlinear combinations of the regressors and of the target variables. In this\nwork, we focus on its application to scalar regression problems. PLS regression\nconsists in finding the least squares predictor that is a linear combination of\na subset of the PLS components. Alternatively, PLS regression can be formulated\nas a least squares problem restricted to a Krylov subspace. This equivalent\nformulation is employed to analyze the distance between\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$, the PLS\nestimator of the vector of coefficients of the linear regression model based on\n$L$ PLS components, and $\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$, the one\nobtained by ordinary least squares (OLS), as a function of $L$. Specifically,\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$ is the\nvector of coefficients in the aforementioned Krylov subspace that is closest to\n$\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$ in terms of the Mahalanobis distance\nwith respect to the covariance matrix of the OLS estimate. We provide a bound\non this distance that depends only on the distribution of the eigenvalues of\nthe regressor covariance matrix. Numerical examples on synthetic and real-world\ndata are used to illustrate how the distance between\n${\\hat{\\boldsymbol\\beta}\\;}_{\\mathrm{PLS}}^{\\scriptscriptstyle {(L)}}$ and\n$\\hat{\\boldsymbol \\beta}_{\\mathrm{OLS}}$ depends on the number of clusters in\nwhich the eigenvalues of the regressor covariance matrix are grouped."}, "http://arxiv.org/abs/2312.01411": {"title": "Bayesian inference on Cox regression models using catalytic prior distributions", "link": "http://arxiv.org/abs/2312.01411", "description": "The Cox proportional hazards model (Cox model) is a popular model for\nsurvival data analysis. When the sample size is small relative to the dimension\nof the model, the standard maximum partial likelihood inference is often\nproblematic. In this work, we propose the Cox catalytic prior distributions for\nBayesian inference on Cox models, which is an extension of a general class of\nprior distributions originally designed for stabilizing complex parametric\nmodels. The Cox catalytic prior is formulated as a weighted likelihood of the\nregression coefficients based on synthetic data and a surrogate baseline hazard\nconstant. This surrogate hazard can be either provided by the user or estimated\nfrom the data, and the synthetic data are generated from the predictive\ndistribution of a fitted simpler model. For point estimation, we derive an\napproximation of the marginal posterior mode, which can be computed\nconveniently as a regularized log partial likelihood estimator. We prove that\nour prior distribution is proper and the resulting estimator is consistent\nunder mild conditions. In simulation studies, our proposed method outperforms\nstandard maximum partial likelihood inference and is on par with existing\nshrinkage methods. We further illustrate the application of our method to a\nreal dataset."}, "http://arxiv.org/abs/2312.01457": {"title": "Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits", "link": "http://arxiv.org/abs/2312.01457", "description": "Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing\nnew policies using existing data without costly experimentation. However,\ncurrent OPE methods, such as Inverse Probability Weighting (IPW) and Doubly\nRobust (DR) estimators, suffer from high variance, particularly in cases of low\noverlap between target and behavior policies or large action and context\nspaces. In this paper, we introduce a new OPE estimator for contextual bandits,\nthe Marginal Ratio (MR) estimator, which focuses on the shift in the marginal\ndistribution of outcomes $Y$ instead of the policies themselves. Through\nrigorous theoretical analysis, we demonstrate the benefits of the MR estimator\ncompared to conventional methods like IPW and DR in terms of variance\nreduction. Additionally, we establish a connection between the MR estimator and\nthe state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator,\nproving that MR achieves lower variance among a generalized family of MIPS\nestimators. We further illustrate the utility of the MR estimator in causal\ninference settings, where it exhibits enhanced performance in estimating\nAverage Treatment Effects (ATE). Our experiments on synthetic and real-world\ndatasets corroborate our theoretical findings and highlight the practical\nadvantages of the MR estimator in OPE for contextual bandits."}, "http://arxiv.org/abs/2312.01496": {"title": "Large-Scale Correlation Screening under Dependence for Brain Functional Connectivity Network Inference", "link": "http://arxiv.org/abs/2312.01496", "description": "Data produced by resting-state functional Magnetic Resonance Imaging are\nwidely used to infer brain functional connectivity networks. Such networks\ncorrelate neural signals to connect brain regions, which consist in groups of\ndependent voxels. Previous work has focused on aggregating data across voxels\nwithin predefined regions. However, the presence of within-region correlations\nhas noticeable impacts on inter-regional correlation detection, and thus edge\nidentification. To alleviate them, we propose to leverage techniques from the\nlarge-scale correlation screening literature, and derive simple and practical\ncharacterizations of the mean number of correlation discoveries that flexibly\nincorporate intra-regional dependence structures. A connectivity network\ninference framework is then presented. First, inter-regional correlation\ndistributions are estimated. Then, correlation thresholds that can be tailored\nto one's application are constructed for each edge. Finally, the proposed\nframework is implemented on synthetic and real-world datasets. This novel\napproach for handling arbitrary intra-regional correlation is shown to limit\nfalse positives while improving true positive rates."}, "http://arxiv.org/abs/2312.01692": {"title": "Risk-Controlling Model Selection via Guided Bayesian Optimization", "link": "http://arxiv.org/abs/2312.01692", "description": "Adjustable hyperparameters of machine learning models typically impact\nvarious key trade-offs such as accuracy, fairness, robustness, or inference\ncost. Our goal in this paper is to find a configuration that adheres to\nuser-specified limits on certain risks while being useful with respect to other\nconflicting metrics. We solve this by combining Bayesian Optimization (BO) with\nrigorous risk-controlling procedures, where our core idea is to steer BO\ntowards an efficient testing strategy. Our BO method identifies a set of Pareto\noptimal configurations residing in a designated region of interest. The\nresulting candidates are statistically verified and the best-performing\nconfiguration is selected with guaranteed risk levels. We demonstrate the\neffectiveness of our approach on a range of tasks with multiple desiderata,\nincluding low error rates, equitable predictions, handling spurious\ncorrelations, managing rate and distortion in generative models, and reducing\ncomputational costs."}, "http://arxiv.org/abs/2312.01723": {"title": "Group Sequential Design Under Non-proportional Hazards", "link": "http://arxiv.org/abs/2312.01723", "description": "Non-proportional hazards (NPH) are often observed in clinical trials with\ntime-to-event endpoints. A common example is a long-term clinical trial with a\ndelayed treatment effect in immunotherapy for cancer. When designing clinical\ntrials with time-to-event endpoints, it is crucial to consider NPH scenarios to\ngain a complete understanding of design operating characteristics. In this\npaper, we focus on group sequential design for three NPH methods: the average\nhazard ratio, the weighted logrank test, and the MaxCombo combination test. For\neach of these approaches, we provide analytic forms of design characteristics\nthat facilitate sample size calculation and bound derivation for group\nsequential designs. Examples are provided to illustrate the proposed methods.\nTo facilitate statisticians in designing and comparing group sequential designs\nunder NPH, we have implemented the group sequential design methodology in the\ngsDesign2 R package at https://cran.r-project.org/web/packages/gsDesign2/."}, "http://arxiv.org/abs/2312.01735": {"title": "Weighted Q-learning for optimal dynamic treatment regimes with MNAR coavriates", "link": "http://arxiv.org/abs/2312.01735", "description": "Dynamic treatment regimes (DTRs) formalize medical decision-making as a\nsequence of rules for different stages, mapping patient-level information to\nrecommended treatments. In practice, estimating an optimal DTR using\nobservational data from electronic medical record (EMR) databases can be\ncomplicated by covariates that are missing not at random (MNAR) due to\ninformative monitoring of patients. Since complete case analysis can result in\nconsistent estimation of outcome model parameters under the assumption of\noutcome-independent missingness \\citep{Yang_Wang_Ding_2019}, Q-learning is a\nnatural approach to accommodating MNAR covariates. However, the backward\ninduction algorithm used in Q-learning can introduce complications, as MNAR\ncovariates at later stages can result in MNAR pseudo-outcomes at earlier\nstages, leading to suboptimal DTRs, even if outcome variables are fully\nobserved. To address this unique missing data problem in DTR settings, we\npropose two weighted Q-learning approaches where inverse probability weights\nfor missingness of the pseudo-outcomes are obtained through estimating\nequations with valid nonresponse instrumental variables or sensitivity\nanalysis. Asymptotic properties of the weighted Q-learning estimators are\nderived and the finite-sample performance of the proposed methods is evaluated\nand compared with alternative methods through extensive simulation studies.\nUsing EMR data from the Medical Information Mart for Intensive Care database,\nwe apply the proposed methods to investigate the optimal fluid strategy for\nsepsis patients in intensive care units."}, "http://arxiv.org/abs/2312.01815": {"title": "Hypothesis Testing in Gaussian Graphical Models: Novel Goodness-of-Fit Tests and Conditional Randomization Tests", "link": "http://arxiv.org/abs/2312.01815", "description": "We introduce novel hypothesis testing methods for Gaussian graphical models,\nwhose foundation is an innovative algorithm that generates exchangeable copies\nfrom these models. We utilize the exchangeable copies to formulate a\ngoodness-of-fit test, which is valid in both low and high-dimensional settings\nand flexible in choosing the test statistic. This test exhibits superior power\nperformance, especially in scenarios where the true precision matrix violates\nthe null hypothesis with many small entries. Furthermore, we adapt the sampling\nalgorithm for constructing a new conditional randomization test for the\nconditional independence between a response $Y$ and a vector of covariates $X$\ngiven some other variables $Z$. Thanks to the model-X framework, this test does\nnot require any modeling assumption about $Y$ and can utilize test statistics\nfrom advanced models. It also relaxes the assumptions of conditional\nrandomization tests by allowing the number of unknown parameters of the\ndistribution of $X$ to be much larger than the sample size. For both of our\ntesting procedures, we propose several test statistics and conduct\ncomprehensive simulation studies to demonstrate their superior performance in\ncontrolling the Type-I error and achieving high power. The usefulness of our\nmethods is further demonstrated through three real-world applications."}, "http://arxiv.org/abs/2312.01870": {"title": "Extreme-value modelling of migratory bird arrival dates: Insights from citizen science data", "link": "http://arxiv.org/abs/2312.01870", "description": "Citizen science mobilises many observers and gathers huge datasets but often\nwithout strict sampling protocols, which results in observation biases due to\nheterogeneity in sampling effort that can lead to biased statistical\ninferences. We develop a spatiotemporal Bayesian hierarchical model for\nbias-corrected estimation of arrival dates of the first migratory bird\nindividuals at a breeding site. Higher sampling effort could be correlated with\nearlier observed dates. We implement data fusion of two citizen-science\ndatasets with sensibly different protocols (BBS, eBird) and map posterior\ndistributions of the latent process, which contains four spatial components\nwith Gaussian process priors: species niche; sampling effort; position and\nscale parameters of annual first date of arrival. The data layer includes four\nresponse variables: counts of observed eBird locations (Poisson);\npresence-absence at observed eBird locations (Binomial); BBS occurrence counts\n(Poisson); first arrival dates (Generalized Extreme-Value). We devise a Markov\nChain Monte Carlo scheme and check by simulation that the latent process\ncomponents are identifiable. We apply our model to several migratory bird\nspecies in the northeastern US for 2001--2021. The sampling effort is shown to\nsignificantly modulate the observed first arrival date. We exploit this\nrelationship to effectively debias predictions of the true first arrival dates."}, "http://arxiv.org/abs/2312.01925": {"title": "Coefficient Shape Alignment in Multivariate Functional Regression", "link": "http://arxiv.org/abs/2312.01925", "description": "In multivariate functional data analysis, different functional covariates can\nbe homogeneous in some sense. The hidden homogeneity structure is informative\nabout the connectivity or association of different covariates. The covariates\nwith pronounced homogeneity can be analyzed jointly in the same group and this\ngives rise to a way of parsimoniously modeling multivariate functional data. In\nthis paper, we develop a multivariate functional regression technique by a new\nregularization approach termed \"coefficient shape alignment\" to tackle the\npotential homogeneity of different functional covariates. The modeling\nprocedure includes two main steps: first the unknown grouping structure is\ndetected with a new regularization approach to aggregate covariates into\ndisjoint groups; and then a grouped multivariate functional regression model is\nestablished based on the detected grouping structure. In this new grouped\nmodel, the coefficient functions of covariates in the same homogeneous group\nshare the same shape invariant to scaling. The new regularization approach\nbuilds on penalizing the discrepancy of coefficient shape. The consistency\nproperty of the detected grouping structure is thoroughly investigated, and the\nconditions that guarantee uncovering the underlying true grouping structure are\ndeveloped. The asymptotic properties of the model estimates are also developed.\nExtensive simulation studies are conducted to investigate the finite-sample\nproperties of the developed methods. The practical utility of the proposed\nmethods is illustrated in an analysis on sugar quality evaluation. This work\nprovides a novel means for analyzing the underlying homogeneity of functional\ncovariates and developing parsimonious model structures for multivariate\nfunctional data."}, "http://arxiv.org/abs/2312.01944": {"title": "New Methods for Network Count Time Series", "link": "http://arxiv.org/abs/2312.01944", "description": "The original generalized network autoregressive models are poor for modelling\ncount data as they are based on the additive and constant noise assumptions,\nwhich is usually inappropriate for count data. We introduce two new models\n(GNARI and NGNAR) for count network time series by adapting and extending\nexisting count-valued time series models. We present results on the statistical\nand asymptotic properties of our new models and their estimates obtained by\nconditional least squares and maximum likelihood. We conduct two simulation\nstudies that verify successful parameter estimation for both models and conduct\na further study that shows, for negative network parameters, that our NGNAR\nmodel outperforms existing models and our other GNARI model in terms of\npredictive performance. We model a network time series constructed from\nCOVID-positive counts for counties in New York State during 2020-22 and show\nthat our new models perform considerably better than existing methods for this\nproblem."}, "http://arxiv.org/abs/2312.01969": {"title": "FDR Control for Online Anomaly Detection", "link": "http://arxiv.org/abs/2312.01969", "description": "The goal of anomaly detection is to identify observations generated by a\nprocess that is different from a reference one. An accurate anomaly detector\nmust ensure low false positive and false negative rates. However in the online\ncontext such a constraint remains highly challenging due to the usual lack of\ncontrol of the False Discovery Rate (FDR). In particular the online framework\nmakes it impossible to use classical multiple testing approaches such as the\nBenjamini-Hochberg (BH) procedure. Our strategy overcomes this difficulty by\nexploiting a local control of the ``modified FDR'' (mFDR). An important\ningredient in this control is the cardinality of the calibration set used for\ncomputing empirical $p$-values, which turns out to be an influential parameter.\nIt results a new strategy for tuning this parameter, which yields the desired\nFDR control over the whole time series. The statistical performance of this\nstrategy is analyzed by theoretical guarantees and its practical behavior is\nassessed by simulation experiments which support our conclusions."}, "http://arxiv.org/abs/2312.02110": {"title": "Fourier Methods for Sufficient Dimension Reduction in Time Series", "link": "http://arxiv.org/abs/2312.02110", "description": "Dimensionality reduction has always been one of the most significant and\nchallenging problems in the analysis of high-dimensional data. In the context\nof time series analysis, our focus is on the estimation and inference of\nconditional mean and variance functions. By using central mean and variance\ndimension reduction subspaces that preserve sufficient information about the\nresponse, one can effectively estimate the unknown mean and variance functions\nof the time series. While the literature presents several approaches to\nestimate the time series central mean and variance subspaces (TS-CMS and\nTS-CVS), these methods tend to be computationally intensive and infeasible for\npractical applications. By employing the Fourier transform, we derive explicit\nestimators for TS-CMS and TS-CVS. These proposed estimators are demonstrated to\nbe consistent, asymptotically normal, and efficient. Simulation studies have\nbeen conducted to evaluate the performance of the proposed method. The results\nshow that our method is significantly more accurate and computationally\nefficient than existing methods. Furthermore, the method has been applied to\nthe Canadian Lynx dataset."}, "http://arxiv.org/abs/2202.03852": {"title": "Nonlinear Network Autoregression", "link": "http://arxiv.org/abs/2202.03852", "description": "We study general nonlinear models for time series networks of integer and\ncontinuous valued data. The vector of high dimensional responses, measured on\nthe nodes of a known network, is regressed non-linearly on its lagged value and\non lagged values of the neighboring nodes by employing a smooth link function.\nWe study stability conditions for such multivariate process and develop quasi\nmaximum likelihood inference when the network dimension is increasing. In\naddition, we study linearity score tests by treating separately the cases of\nidentifiable and non-identifiable parameters. In the case of identifiability,\nthe test statistic converges to a chi-square distribution. When the parameters\nare not-identifiable, we develop a supremum-type test whose p-values are\napproximated adequately by employing a feasible bound and bootstrap\nmethodology. Simulations and data examples support further our findings."}, "http://arxiv.org/abs/2202.10887": {"title": "Policy Evaluation for Temporal and/or Spatial Dependent Experiments", "link": "http://arxiv.org/abs/2202.10887", "description": "The aim of this paper is to establish a causal link between the policies\nimplemented by technology companies and the outcomes they yield within\nintricate temporal and/or spatial dependent experiments. We propose a novel\ntemporal/spatio-temporal Varying Coefficient Decision Process (VCDP) model,\ncapable of effectively capturing the evolving treatment effects in situations\ncharacterized by temporal and/or spatial dependence. Our methodology\nencompasses the decomposition of the Average Treatment Effect (ATE) into the\nDirect Effect (DE) and the Indirect Effect (IE). We subsequently devise\ncomprehensive procedures for estimating and making inferences about both DE and\nIE. Additionally, we provide a rigorous analysis of the statistical properties\nof these procedures, such as asymptotic power. To substantiate the\neffectiveness of our approach, we carry out extensive simulations and real data\nanalyses."}, "http://arxiv.org/abs/2206.13091": {"title": "Informed censoring: the parametric combination of data and expert information", "link": "http://arxiv.org/abs/2206.13091", "description": "The statistical censoring setup is extended to the situation when random\nmeasures can be assigned to the realization of datapoints, leading to a new way\nof incorporating expert information into the usual parametric estimation\nprocedures. The asymptotic theory is provided for the resulting estimators, and\nsome special cases of practical relevance are studied in more detail. Although\nthe proposed framework mathematically generalizes censoring and coarsening at\nrandom, and borrows techniques from M-estimation theory, it provides a novel\nand transparent methodology which enjoys significant practical applicability in\nsituations where expert information is present. The potential of the approach\nis illustrated by a concrete actuarial application of tail parameter estimation\nfor a heavy-tailed MTPL dataset with limited available expert information."}, "http://arxiv.org/abs/2208.11665": {"title": "Statistical exploration of the Manifold Hypothesis", "link": "http://arxiv.org/abs/2208.11665", "description": "The Manifold Hypothesis is a widely accepted tenet of Machine Learning which\nasserts that nominally high-dimensional data are in fact concentrated near a\nlow-dimensional manifold, embedded in high-dimensional space. This phenomenon\nis observed empirically in many real world situations, has led to development\nof a wide range of statistical methods in the last few decades, and has been\nsuggested as a key factor in the success of modern AI technologies. We show\nthat rich and sometimes intricate manifold structure in data can emerge from a\ngeneric and remarkably simple statistical model -- the Latent Metric Model --\nvia elementary concepts such as latent variables, correlation and stationarity.\nThis establishes a general statistical explanation for why the Manifold\nHypothesis seems to hold in so many situations. Informed by the Latent Metric\nModel we derive procedures to discover and interpret the geometry of\nhigh-dimensional data, and explore hypotheses about the data generating\nmechanism. These procedures operate under minimal assumptions and make use of\nwell known, scaleable graph-analytic algorithms."}, "http://arxiv.org/abs/2210.10852": {"title": "BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for Rethinking Generalized Linear Models", "link": "http://arxiv.org/abs/2210.10852", "description": "Two linearly uncorrelated binary variables must be also independent because\nnon-linear dependence cannot manifest with only two possible states. This\ninherent linearity is the atom of dependency constituting any complex form of\nrelationship. Inspired by this observation, we develop a framework called\nbinary expansion linear effect (BELIEF) for understanding arbitrary\nrelationships with a binary outcome. Models from the BELIEF framework are\neasily interpretable because they describe the association of binary variables\nin the language of linear models, yielding convenient theoretical insight and\nstriking Gaussian parallels. With BELIEF, one may study generalized linear\nmodels (GLM) through transparent linear models, providing insight into how the\nchoice of link affects modeling. For example, setting a GLM interaction\ncoefficient to zero does not necessarily lead to the kind of no-interaction\nmodel assumption as understood under their linear model counterparts.\nFurthermore, for a binary response, maximum likelihood estimation for GLMs\nparadoxically fails under complete separation, when the data are most\ndiscriminative, whereas BELIEF estimation automatically reveals the perfect\npredictor in the data that is responsible for complete separation. We explore\nthese phenomena and provide related theoretical results. We also provide\npreliminary empirical demonstration of some theoretical results."}, "http://arxiv.org/abs/2302.03750": {"title": "Linking convolutional kernel size to generalization bias in face analysis CNNs", "link": "http://arxiv.org/abs/2302.03750", "description": "Training dataset biases are by far the most scrutinized factors when\nexplaining algorithmic biases of neural networks. In contrast, hyperparameters\nrelated to the neural network architecture have largely been ignored even\nthough different network parameterizations are known to induce different\nimplicit biases over learned features. For example, convolutional kernel size\nis known to affect the frequency content of features learned in CNNs. In this\nwork, we present a causal framework for linking an architectural hyperparameter\nto out-of-distribution algorithmic bias. Our framework is experimental, in that\nwe train several versions of a network with an intervention to a specific\nhyperparameter, and measure the resulting causal effect of this choice on\nperformance bias when a particular out-of-distribution image perturbation is\napplied. In our experiments, we focused on measuring the causal relationship\nbetween convolutional kernel size and face analysis classification bias across\ndifferent subpopulations (race/gender), with respect to high-frequency image\ndetails. We show that modifying kernel size, even in one layer of a CNN,\nchanges the frequency content of learned features significantly across data\nsubgroups leading to biased generalization performance even in the presence of\na balanced dataset."}, "http://arxiv.org/abs/2303.10808": {"title": "Dimension-agnostic Change Point Detection", "link": "http://arxiv.org/abs/2303.10808", "description": "Change point testing for high-dimensional data has attracted a lot of\nattention in statistics and machine learning owing to the emergence of\nhigh-dimensional data with structural breaks from many fields. In practice,\nwhen the dimension is less than the sample size but is not small, it is often\nunclear whether a method that is tailored to high-dimensional data or simply a\nclassical method that is developed and justified for low-dimensional data is\npreferred. In addition, the methods designed for low-dimensional data may not\nwork well in the high-dimensional environment and vice versa. In this paper, we\npropose a dimension-agnostic testing procedure targeting a single change point\nin the mean of a multivariate time series. Specifically, we can show that the\nlimiting null distribution for our test statistic is the same regardless of the\ndimensionality and the magnitude of cross-sectional dependence. The power\nanalysis is also conducted to understand the large sample behavior of the\nproposed test. Through Monte Carlo simulations and a real data illustration, we\ndemonstrate that the finite sample results strongly corroborate the theory and\nsuggest that the proposed test can be used as a benchmark for change-point\ndetection of time series of low, medium, and high dimensions."}, "http://arxiv.org/abs/2306.13870": {"title": "Post-Selection Inference for the Cox Model with Interval-Censored Data", "link": "http://arxiv.org/abs/2306.13870", "description": "We develop a post-selection inference method for the Cox proportional hazards\nmodel with interval-censored data, which provides asymptotically valid p-values\nand confidence intervals conditional on the model selected by lasso. The method\nis based on a pivotal quantity that is shown to converge to a uniform\ndistribution under local alternatives. The proof can be adapted to many other\nregression models, which is illustrated by the extension to generalized linear\nmodels and the Cox model with right-censored data. Our method involves\nestimation of the efficient information matrix, for which several approaches\nare proposed with proof of their consistency. Thorough simulation studies show\nthat our method has satisfactory performance in samples of modest sizes. The\nutility of the method is illustrated via an application to an Alzheimer's\ndisease study."}, "http://arxiv.org/abs/2307.15330": {"title": "Group integrative dynamic factor models with application to multiple subject brain connectivity", "link": "http://arxiv.org/abs/2307.15330", "description": "This work introduces a novel framework for dynamic factor model-based data\nintegration of multiple subjects time series data, called GRoup Integrative\nDYnamic factor (GRIDY) models. The framework identifies and characterizes\ninter-subject differences between two pre-labeled groups by considering a\ncombination of group spatial information and individual temporal dependence.\nFurthermore, it enables the identification of intra-subject differences over\ntime by employing different model configurations for each subject.\nMethodologically, the framework combines a novel principal angle-based rank\nselection algorithm and a non-iterative integrative analysis framework.\nInspired by simultaneous component analysis, this approach also reconstructs\nidentifiable latent factor series with flexible covariance structures. The\nperformance of the GRIDY models is evaluated through simulations conducted\nunder various scenarios. An application is also presented to compare\nresting-state functional MRI data collected from multiple subjects in the\nAutism Spectrum Disorder and control groups."}, "http://arxiv.org/abs/2308.11138": {"title": "NLP-based detection of systematic anomalies among the narratives of consumer complaints", "link": "http://arxiv.org/abs/2308.11138", "description": "We develop an NLP-based procedure for detecting systematic nonmeritorious\nconsumer complaints, simply called systematic anomalies, among complaint\nnarratives. While classification algorithms are used to detect pronounced\nanomalies, in the case of smaller and frequent systematic anomalies, the\nalgorithms may falter due to a variety of reasons, including technical ones as\nwell as natural limitations of human analysts. Therefore, as the next step\nafter classification, we convert the complaint narratives into quantitative\ndata, which are then analyzed using an algorithm for detecting systematic\nanomalies. We illustrate the entire procedure using complaint narratives from\nthe Consumer Complaint Database of the Consumer Financial Protection Bureau."}, "http://arxiv.org/abs/2310.00864": {"title": "Multi-Label Residual Weighted Learning for Individualized Combination Treatment Rule", "link": "http://arxiv.org/abs/2310.00864", "description": "Individualized treatment rules (ITRs) have been widely applied in many fields\nsuch as precision medicine and personalized marketing. Beyond the extensive\nstudies on ITR for binary or multiple treatments, there is considerable\ninterest in applying combination treatments. This paper introduces a novel ITR\nestimation method for combination treatments incorporating interaction effects\namong treatments. Specifically, we propose the generalized $\\psi$-loss as a\nnon-convex surrogate in the residual weighted learning framework, offering\ndesirable statistical and computational properties. Statistically, the\nminimizer of the proposed surrogate loss is Fisher-consistent with the optimal\ndecision rules, incorporating interaction effects at any intensity level - a\nsignificant improvement over existing methods. Computationally, the proposed\nmethod applies the difference-of-convex algorithm for efficient computation.\nThrough simulation studies and real-world data applications, we demonstrate the\nsuperior performance of the proposed method in recommending combination\ntreatments."}, "http://arxiv.org/abs/2312.02167": {"title": "Uncertainty Quantification in Machine Learning Based Segmentation: A Post-Hoc Approach for Left Ventricle Volume Estimation in MRI", "link": "http://arxiv.org/abs/2312.02167", "description": "Recent studies have confirmed cardiovascular diseases remain responsible for\nhighest death toll amongst non-communicable diseases. Accurate left ventricular\n(LV) volume estimation is critical for valid diagnosis and management of\nvarious cardiovascular conditions, but poses significant challenge due to\ninherent uncertainties associated with segmentation algorithms in magnetic\nresonance imaging (MRI). Recent machine learning advancements, particularly\nU-Net-like convolutional networks, have facilitated automated segmentation for\nmedical images, but struggles under certain pathologies and/or different\nscanner vendors and imaging protocols. This study proposes a novel methodology\nfor post-hoc uncertainty estimation in LV volume prediction using It\\^{o}\nstochastic differential equations (SDEs) to model path-wise behavior for the\nprediction error. The model describes the area of the left ventricle along the\nheart's long axis. The method is agnostic to the underlying segmentation\nalgorithm, facilitating its use with various existing and future segmentation\ntechnologies. The proposed approach provides a mechanism for quantifying\nuncertainty, enabling medical professionals to intervene for unreliable\npredictions. This is of utmost importance in critical applications such as\nmedical diagnosis, where prediction accuracy and reliability can directly\nimpact patient outcomes. The method is also robust to dataset changes, enabling\napplication for medical centers with limited access to labeled data. Our\nfindings highlight the proposed uncertainty estimation methodology's potential\nto enhance automated segmentation robustness and generalizability, paving the\nway for more reliable and accurate LV volume estimation in clinical settings as\nwell as opening new avenues for uncertainty quantification in biomedical image\nsegmentation, providing promising directions for future research."}, "http://arxiv.org/abs/2312.02177": {"title": "Entropy generating function for past lifetime and its properties", "link": "http://arxiv.org/abs/2312.02177", "description": "The past entropy is considered as an uncertainty measure for the past\nlifetime distribution. Generating function approach to entropy become popular\nin recent time as it generate several well-known entropy measures. In this\npaper, we introduce the past entropy-generating function. We study certain\nproperties of this measure. It is shown that the past entropy-generating\nfunction uniquely determines the distribution. Further, we present\ncharacterizations for some lifetime models using the relationship between\nreliability concepts and the past entropy-generating function."}, "http://arxiv.org/abs/2312.02404": {"title": "Addressing Unmeasured Confounders in Cox Proportional Hazards Models Using Nonparametric Bayesian Approaches", "link": "http://arxiv.org/abs/2312.02404", "description": "In observational studies, unmeasured confounders present a crucial challenge\nin accurately estimating desired causal effects. To calculate the hazard ratio\n(HR) in Cox proportional hazard models, which are relevant for time-to-event\noutcomes, methods such as Two-Stage Residual Inclusion and Limited Information\nMaximum Likelihood are typically employed. However, these methods raise\nconcerns, including the potential for biased HR estimates and issues with\nparameter identification. This manuscript introduces a novel nonparametric\nBayesian method designed to estimate an unbiased HR, addressing concerns\nrelated to parameter identification. Our proposed method consists of two\nphases: 1) detecting clusters based on the likelihood of the exposure variable,\nand 2) estimating the hazard ratio within each cluster. Although it is\nimplicitly assumed that unmeasured confounders affect outcomes through cluster\neffects, our algorithm is well-suited for such data structures."}, "http://arxiv.org/abs/2312.02482": {"title": "Treatment heterogeneity with right-censored outcomes using grf", "link": "http://arxiv.org/abs/2312.02482", "description": "This article walks through how to estimate conditional average treatment\neffects (CATEs) with right-censored time-to-event outcomes using the function\ncausal_survival_forest (Cui et al., 2023) in the R package grf (Athey et al.,\n2019, Tibshirani et al., 2023)."}, "http://arxiv.org/abs/2312.02513": {"title": "Asymptotic Theory of the Best-Choice Rerandomization using the Mahalanobis Distance", "link": "http://arxiv.org/abs/2312.02513", "description": "Rerandomization, a design that utilizes pretreatment covariates and improves\ntheir balance between different treatment groups, has received attention\nrecently in both theory and practice. There are at least two types of\nrerandomization that are used in practice: the first rerandomizes the treatment\nassignment until covariate imbalance is below a prespecified threshold; the\nsecond randomizes the treatment assignment multiple times and chooses the one\nwith the best covariate balance. In this paper we will consider the second type\nof rerandomization, namely the best-choice rerandomization, whose theory and\ninference are still lacking in the literature. In particular, we will focus on\nthe best-choice rerandomization that uses the Mahalanobis distance to measure\ncovariate imbalance, which is one of the most commonly used imbalance measure\nfor multivariate covariates and is invariant to affine transformations of\ncovariates. We will study the large-sample repeatedly sampling properties of\nthe best-choice rerandomization, allowing both the number of covariates and the\nnumber of tried complete randomizations to increase with the sample size. We\nshow that the asymptotic distribution of the difference-in-means estimator is\nmore concentrated around the true average treatment effect under\nrerandomization than under the complete randomization, and propose large-sample\naccurate confidence intervals for rerandomization that are shorter than that\nfor the completely randomized experiment. We further demonstrate that, with\nmoderate number of covariates and with the number of tried randomizations\nincreasing polynomially with the sample size, the best-choice rerandomization\ncan achieve the ideally optimal precision that one can expect even with\nperfectly balanced covariates. The developed theory and methods for\nrerandomization are also illustrated using real field experiments."}, "http://arxiv.org/abs/2312.02518": {"title": "The general linear hypothesis testing problem for multivariate functional data with applications", "link": "http://arxiv.org/abs/2312.02518", "description": "As technology continues to advance at a rapid pace, the prevalence of\nmultivariate functional data (MFD) has expanded across diverse disciplines,\nspanning biology, climatology, finance, and numerous other fields of study.\nAlthough MFD are encountered in various fields, the development of methods for\nhypotheses on mean functions, especially the general linear hypothesis testing\n(GLHT) problem for such data has been limited. In this study, we propose and\nstudy a new global test for the GLHT problem for MFD, which includes the\none-way FMANOVA, post hoc, and contrast analysis as special cases. The\nasymptotic null distribution of the test statistic is shown to be a\nchi-squared-type mixture dependent of eigenvalues of the heteroscedastic\ncovariance functions. The distribution of the chi-squared-type mixture can be\nwell approximated by a three-cumulant matched chi-squared-approximation with\nits approximation parameters estimated from the data. By incorporating an\nadjustment coefficient, the proposed test performs effectively irrespective of\nthe correlation structure in the functional data, even when dealing with a\nrelatively small sample size. Additionally, the proposed test is shown to be\nroot-n consistent, that is, it has a nontrivial power against a local\nalternative. Simulation studies and a real data example demonstrate\nfinite-sample performance and broad applicability of the proposed test."}, "http://arxiv.org/abs/2312.02591": {"title": "General Spatio-Temporal Factor Models for High-Dimensional Random Fields on a Lattice", "link": "http://arxiv.org/abs/2312.02591", "description": "Motivated by the need for analysing large spatio-temporal panel data, we\nintroduce a novel dimensionality reduction methodology for $n$-dimensional\nrandom fields observed across a number $S$ spatial locations and $T$ time\nperiods. We call it General Spatio-Temporal Factor Model (GSTFM). First, we\nprovide the probabilistic and mathematical underpinning needed for the\nrepresentation of a random field as the sum of two components: the common\ncomponent (driven by a small number $q$ of latent factors) and the\nidiosyncratic component (mildly cross-correlated). We show that the two\ncomponents are identified as $n\\to\\infty$. Second, we propose an estimator of\nthe common component and derive its statistical guarantees (consistency and\nrate of convergence) as $\\min(n, S, T )\\to\\infty$. Third, we propose an\ninformation criterion to determine the number of factors. Estimation makes use\nof Fourier analysis in the frequency domain and thus we fully exploit the\ninformation on the spatio-temporal covariance structure of the whole panel.\nSynthetic data examples illustrate the applicability of GSTFM and its\nadvantages over the extant generalized dynamic factor model that ignores the\nspatial correlations."}, "http://arxiv.org/abs/2312.02717": {"title": "A Graphical Approach to Treatment Effect Estimation with Observational Network Data", "link": "http://arxiv.org/abs/2312.02717", "description": "We propose an easy-to-use adjustment estimator for the effect of a treatment\nbased on observational data from a single (social) network of units. The\napproach allows for interactions among units within the network, called\ninterference, and for observed confounding. We define a simplified causal graph\nthat does not differentiate between units, called generic graph. Using valid\nadjustment sets determined in the generic graph, we can identify the treatment\neffect and build a corresponding estimator. We establish the estimator's\nconsistency and its convergence to a Gaussian limiting distribution at the\nparametric rate under certain regularity conditions that restrict the growth of\ndependencies among units. We empirically verify the theoretical properties of\nour estimator through a simulation study and apply it to estimate the effect of\na strict facial-mask policy on the spread of COVID-19 in Switzerland."}, "http://arxiv.org/abs/2312.02807": {"title": "Online Change Detection in SAR Time-Series with Kronecker Product Structured Scaled Gaussian Models", "link": "http://arxiv.org/abs/2312.02807", "description": "We develop the information geometry of scaled Gaussian distributions for\nwhich the covariance matrix exhibits a Kronecker product structure. This model\nand its geometry are then used to propose an online change detection (CD)\nalgorithm for multivariate image times series (MITS). The proposed approach\nrelies mainly on the online estimation of the structured covariance matrix\nunder the null hypothesis, which is performed through a recursive (natural)\nRiemannian gradient descent. This approach exhibits a practical interest\ncompared to the corresponding offline version, as its computational cost\nremains constant for each new image added in the time series. Simulations show\nthat the proposed recursive estimators reach the Intrinsic Cram\\'er-Rao bound.\nThe interest of the proposed online CD approach is demonstrated on both\nsimulated and real data."}, "http://arxiv.org/abs/2312.02850": {"title": "A Kernel-Based Neural Network Test for High-dimensional Sequencing Data Analysis", "link": "http://arxiv.org/abs/2312.02850", "description": "The recent development of artificial intelligence (AI) technology, especially\nthe advance of deep neural network (DNN) technology, has revolutionized many\nfields. While DNN plays a central role in modern AI technology, it has been\nrarely used in sequencing data analysis due to challenges brought by\nhigh-dimensional sequencing data (e.g., overfitting). Moreover, due to the\ncomplexity of neural networks and their unknown limiting distributions,\nbuilding association tests on neural networks for genetic association analysis\nremains a great challenge. To address these challenges and fill the important\ngap of using AI in high-dimensional sequencing data analysis, we introduce a\nnew kernel-based neural network (KNN) test for complex association analysis of\nsequencing data. The test is built on our previously developed KNN framework,\nwhich uses random effects to model the overall effects of high-dimensional\ngenetic data and adopts kernel-based neural network structures to model complex\ngenotype-phenotype relationships. Based on KNN, a Wald-type test is then\nintroduced to evaluate the joint association of high-dimensional genetic data\nwith a disease phenotype of interest, considering non-linear and non-additive\neffects (e.g., interaction effects). Through simulations, we demonstrated that\nour proposed method attained higher power compared to the sequence kernel\nassociation test (SKAT), especially in the presence of non-linear and\ninteraction effects. Finally, we apply the methods to the whole genome\nsequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative\n(ADNI) study, investigating new genes associated with the hippocampal volume\nchange over time."}, "http://arxiv.org/abs/2312.02858": {"title": "Towards Causal Representations of Climate Model Data", "link": "http://arxiv.org/abs/2312.02858", "description": "Climate models, such as Earth system models (ESMs), are crucial for\nsimulating future climate change based on projected Shared Socioeconomic\nPathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated\nand invaluable, machine learning-based emulators trained on existing simulation\ndata can project additional climate scenarios much faster and are\ncomputationally efficient. However, they often lack generalizability and\ninterpretability. This work delves into the potential of causal representation\nlearning, specifically the \\emph{Causal Discovery with Single-parent Decoding}\n(CDSD) method, which could render climate model emulation efficient\n\\textit{and} interpretable. We evaluate CDSD on multiple climate datasets,\nfocusing on emissions, temperature, and precipitation. Our findings shed light\non the challenges, limitations, and promise of using CDSD as a stepping stone\ntowards more interpretable and robust climate model emulation."}, "http://arxiv.org/abs/2312.02860": {"title": "Spectral Deconfounding for High-Dimensional Sparse Additive Models", "link": "http://arxiv.org/abs/2312.02860", "description": "Many high-dimensional data sets suffer from hidden confounding. When hidden\nconfounders affect both the predictors and the response in a high-dimensional\nregression problem, standard methods lead to biased estimates. This paper\nsubstantially extends previous work on spectral deconfounding for\nhigh-dimensional linear models to the nonlinear setting and with this,\nestablishes a proof of concept that spectral deconfounding is valid for general\nnonlinear models. Concretely, we propose an algorithm to estimate\nhigh-dimensional additive models in the presence of hidden dense confounding:\narguably, this is a simple yet practically useful nonlinear scope. We prove\nconsistency and convergence rates for our method and evaluate it on synthetic\ndata and a genetic data set."}, "http://arxiv.org/abs/2312.02867": {"title": "Semi-Supervised Health Index Monitoring with Feature Generation and Fusion", "link": "http://arxiv.org/abs/2312.02867", "description": "The Health Index (HI) is crucial for evaluating system health, aiding tasks\nlike anomaly detection and predicting remaining useful life for systems\ndemanding high safety and reliability. Tight monitoring is crucial for\nachieving high precision at a lower cost, with applications such as spray\ncoating. Obtaining HI labels in real-world applications is often\ncost-prohibitive, requiring continuous, precise health measurements. Therefore,\nit is more convenient to leverage run-to failure datasets that may provide\npotential indications of machine wear condition, making it necessary to apply\nsemi-supervised tools for HI construction. In this study, we adapt the Deep\nSemi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use\nthe DeepSAD embedding as a condition indicators to address interpretability\nchallenges and sensitivity to system-specific factors. Then, we introduce a\ndiversity loss to enrich condition indicators. We employ an alternating\nprojection algorithm with isotonic constraints to transform the DeepSAD\nembedding into a normalized HI with an increasing trend. Validation on the PHME\n2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates\nmeaningful HIs estimations. Our methodology is then applied to monitor wear\nstates of thermal spray coatings using high-frequency voltage. Our\ncontributions create opportunities for more accessible and reliable HI\nestimation, particularly in cases where obtaining ground truth HI labels is\nunfeasible."}, "http://arxiv.org/abs/2312.02870": {"title": "Replica analysis of overfitting in regression models for time to event data: the impact of censoring", "link": "http://arxiv.org/abs/2312.02870", "description": "We use statistical mechanics techniques, viz. the replica method, to model\nthe effect of censoring on overfitting in Cox's proportional hazards model, the\ndominant regression method for time-to-event data. In the overfitting regime,\nMaximum Likelihood parameter estimators are known to be biased already for\nsmall values of the ratio of the number of covariates over the number of\nsamples. The inclusion of censoring was avoided in previous overfitting\nanalyses for mathematical convenience, but is vital to make any theory\napplicable to real-world medical data, where censoring is ubiquitous. Upon\nconstructing efficient algorithms for solving the new (and more complex) RS\nequations and comparing the solutions with numerical simulation data, we find\nexcellent agreement, even for large censoring rates. We then address the\npractical problem of using the theory to correct the biased ML estimators\n{without} knowledge of the data-generating distribution. This is achieved via a\nnovel numerical algorithm that self-consistently approximates all relevant\nparameters of the data generating distribution while simultaneously solving the\nRS equations. We investigate numerically the statistics of the corrected\nestimators, and show that the proposed new algorithm indeed succeeds in\nremoving the bias of the ML estimators, for both the association parameters and\nfor the cumulative hazard."}, "http://arxiv.org/abs/2312.02905": {"title": "E-values, Multiple Testing and Beyond", "link": "http://arxiv.org/abs/2312.02905", "description": "We discover a connection between the Benjamini-Hochberg (BH) procedure and\nthe recently proposed e-BH procedure [Wang and Ramdas, 2022] with a suitably\ndefined set of e-values. This insight extends to a generalized version of the\nBH procedure and the model-free multiple testing procedure in Barber and\nCand\\`es [2015] (BC) with a general form of rejection rules. The connection\nprovides an effective way of developing new multiple testing procedures by\naggregating or assembling e-values resulting from the BH and BC procedures and\ntheir use in different subsets of the data. In particular, we propose new\nmultiple testing methodologies in three applications, including a hybrid\napproach that integrates the BH and BC procedures, a multiple testing procedure\naimed at ensuring a new notion of fairness by controlling both the group-wise\nand overall false discovery rates (FDR), and a structure adaptive multiple\ntesting procedure that can incorporate external covariate information to boost\ndetection power. One notable feature of the proposed methods is that we use a\ndata-dependent approach for assigning weights to e-values, significantly\nenhancing the efficiency of the resulting e-BH procedure. The construction of\nthe weights is non-trivial and is motivated by the leave-one-out analysis for\nthe BH and BC procedures. In theory, we prove that the proposed e-BH procedures\nwith data-dependent weights in the three applications ensure finite sample FDR\ncontrol. Furthermore, we demonstrate the efficiency of the proposed methods\nthrough numerical studies in the three applications."}, "http://arxiv.org/abs/2112.06000": {"title": "Multiply robust estimators in longitudinal studies with missing data under control-based imputation", "link": "http://arxiv.org/abs/2112.06000", "description": "Longitudinal studies are often subject to missing data. The ICH E9(R1)\naddendum addresses the importance of defining a treatment effect estimand with\nthe consideration of intercurrent events. Jump-to-reference (J2R) is one\nclassically envisioned control-based scenario for the treatment effect\nevaluation using the hypothetical strategy, where the participants in the\ntreatment group after intercurrent events are assumed to have the same disease\nprogress as those with identical covariates in the control group. We establish\nnew estimators to assess the average treatment effect based on a proposed\npotential outcomes framework under J2R. Various identification formulas are\nconstructed under the assumptions addressed by J2R, motivating estimators that\nrely on different parts of the observed data distribution. Moreover, we obtain\na novel estimator inspired by the efficient influence function, with multiple\nrobustness in the sense that it achieves $n^{1/2}$-consistency if any pairs of\nmultiple nuisance functions are correctly specified, or if the nuisance\nfunctions converge at a rate not slower than $n^{-1/4}$ when using flexible\nmodeling approaches. The finite-sample performance of the proposed estimators\nis validated in simulation studies and an antidepressant clinical trial."}, "http://arxiv.org/abs/2207.13480": {"title": "On Selecting and Conditioning in Multiple Testing and Selective Inference", "link": "http://arxiv.org/abs/2207.13480", "description": "We investigate a class of methods for selective inference that condition on a\nselection event. Such methods follow a two-stage process. First, a data-driven\n(sub)collection of hypotheses is chosen from some large universe of hypotheses.\nSubsequently, inference takes place within this data-driven collection,\nconditioned on the information that was used for the selection. Examples of\nsuch methods include basic data splitting, as well as modern data carving\nmethods and post-selection inference methods for lasso coefficients based on\nthe polyhedral lemma. In this paper, we adopt a holistic view on such methods,\nconsidering the selection, conditioning, and final error control steps together\nas a single method. From this perspective, we demonstrate that multiple testing\nmethods defined directly on the full universe of hypotheses are always at least\nas powerful as selective inference methods based on selection and conditioning.\nThis result holds true even when the universe is potentially infinite and only\nimplicitly defined, such as in the case of data splitting. We provide a\ncomprehensive theoretical framework, along with insights, and delve into\nseveral case studies to illustrate instances where a shift to a non-selective\nor unconditional perspective can yield a power gain."}, "http://arxiv.org/abs/2306.11302": {"title": "A Two-Stage Bayesian Small Area Estimation Approach for Proportions", "link": "http://arxiv.org/abs/2306.11302", "description": "With the rise in popularity of digital Atlases to communicate spatial\nvariation, there is an increasing need for robust small-area estimates.\nHowever, current small-area estimation methods suffer from various modeling\nproblems when data are very sparse or when estimates are required for areas\nwith very small populations. These issues are particularly heightened when\nmodeling proportions. Additionally, recent work has shown significant benefits\nin modeling at both the individual and area levels. We propose a two-stage\nBayesian hierarchical small area estimation approach for proportions that can:\naccount for survey design; reduce direct estimate instability; and generate\nprevalence estimates for small areas with no survey data. Using a simulation\nstudy we show that, compared with existing Bayesian small area estimation\nmethods, our approach can provide optimal predictive performance (Bayesian mean\nrelative root mean squared error, mean absolute relative bias and coverage) of\nproportions under a variety of data conditions, including very sparse and\nunstable data. To assess the model in practice, we compare modeled estimates of\ncurrent smoking prevalence for 1,630 small areas in Australia using the\n2017-2018 National Health Survey data combined with 2016 census data."}, "http://arxiv.org/abs/2308.04368": {"title": "Multiple Testing of Local Extrema for Detection of Structural Breaks in Piecewise Linear Models", "link": "http://arxiv.org/abs/2308.04368", "description": "In this paper, we propose a new generic method for detecting the number and\nlocations of structural breaks or change points in piecewise linear models\nunder stationary Gaussian noise. Our method transforms the change point\ndetection problem into identifying local extrema (local maxima and local\nminima) through kernel smoothing and differentiation of the data sequence. By\ncomputing p-values for all local extrema based on peak height distributions of\nsmooth Gaussian processes, we utilize the Benjamini-Hochberg procedure to\nidentify significant local extrema as the detected change points. Our method\ncan distinguish between two types of change points: continuous breaks (Type I)\nand jumps (Type II). We study three scenarios of piecewise linear signals,\nnamely pure Type I, pure Type II and a mixture of Type I and Type II change\npoints. The results demonstrate that our proposed method ensures asymptotic\ncontrol of the False Discover Rate (FDR) and power consistency, as sequence\nlength, slope changes, and jump size increase. Furthermore, compared to\ntraditional change point detection methods based on recursive segmentation, our\napproach only requires a single test for all candidate local extrema, thereby\nachieving the smallest computational complexity proportionate to the data\nsequence length. Additionally, numerical studies illustrate that our method\nmaintains FDR control and power consistency, even in non-asymptotic cases when\nthe size of slope changes or jumps is not large. We have implemented our method\nin the R package \"dSTEM\" (available from\nhttps://cran.r-project.org/web/packages/dSTEM)."}, "http://arxiv.org/abs/2308.05484": {"title": "Filtering Dynamical Systems Using Observations of Statistics", "link": "http://arxiv.org/abs/2308.05484", "description": "We consider the problem of filtering dynamical systems, possibly stochastic,\nusing observations of statistics. Thus the computational task is to estimate a\ntime-evolving density $\\rho(v, t)$ given noisy observations of the true density\n$\\rho^\\dagger$; this contrasts with the standard filtering problem based on\nobservations of the state $v$. The task is naturally formulated as an\ninfinite-dimensional filtering problem in the space of densities $\\rho$.\nHowever, for the purposes of tractability, we seek algorithms in state space;\nspecifically we introduce a mean field state space model and, using interacting\nparticle system approximations to this model, we propose an ensemble method. We\nrefer to the resulting methodology as the ensemble Fokker-Planck filter\n(EnFPF).\n\nUnder certain restrictive assumptions we show that the EnFPF approximates the\nKalman-Bucy filter for the Fokker-Planck equation, which is the exact solution\nof the infinite-dimensional filtering problem; our numerical experiments show\nthat the methodology is useful beyond this restrictive setting. Specifically\nthe experiments show that the EnFPF is able to correct ensemble statistics, to\naccelerate convergence to the invariant density for autonomous systems, and to\naccelerate convergence to time-dependent invariant densities for non-autonomous\nsystems. We discuss possible applications of the EnFPF to climate ensembles and\nto turbulence modelling."}, "http://arxiv.org/abs/2311.05649": {"title": "Bayesian Image-on-Image Regression via Deep Kernel Learning based Gaussian Processes", "link": "http://arxiv.org/abs/2311.05649", "description": "In neuroimaging studies, it becomes increasingly important to study\nassociations between different imaging modalities using image-on-image\nregression (IIR), which faces challenges in interpretation, statistical\ninference, and prediction. Our motivating problem is how to predict task-evoked\nfMRI activity using resting-state fMRI data in the Human Connectome Project\n(HCP). The main difficulty lies in effectively combining different types of\nimaging predictors with varying resolutions and spatial domains in IIR. To\naddress these issues, we develop Bayesian Image-on-image Regression via Deep\nKernel Learning Gaussian Processes (BIRD-GP) and develop efficient posterior\ncomputation methods through Stein variational gradient descent. We demonstrate\nthe advantages of BIRD-GP over state-of-the-art IIR methods using simulations.\nFor HCP data analysis using BIRD-GP, we combine the voxel-wise fALFF maps and\nregion-wise connectivity matrices to predict fMRI contrast maps for language\nand social recognition tasks. We show that fALFF is less predictive than the\nconnectivity matrix for both tasks, but combining both yields improved results.\nAngular Gyrus Right emerges as the most predictable region for the language\ntask (75.9% predictable voxels), while Superior Parietal Gyrus Right tops for\nthe social recognition task (48.9% predictable voxels). Additionally, we\nidentify features from the resting-state fMRI data that are important for task\nfMRI prediction."}, "http://arxiv.org/abs/2312.03139": {"title": "A Bayesian Skew-heavy-tailed modelling for loss reserving", "link": "http://arxiv.org/abs/2312.03139", "description": "This paper focuses on modelling loss reserving to pay outstanding claims. As\nthe amount liable on any given claim is not known until settlement, we propose\na flexible model via heavy-tailed and skewed distributions to deal with\noutstanding liabilities. The inference relies on Markov chain Monte Carlo via\nGibbs sampler with adaptive Metropolis algorithm steps allowing for fast\ncomputations and providing efficient algorithms. An illustrative example\nemulates a typical dataset based on a runoff triangle and investigates the\nproperties of the proposed models. Also, a case study is considered and shows\nthat the proposed model outperforms the usual loss reserving models well\nestablished in the literature in the presence of skewness and heavy tails."}, "http://arxiv.org/abs/2312.03192": {"title": "Modeling Structure and Country-specific Heterogeneity in Misclassification Matrices of Verbal Autopsy-based Cause of Death Classifiers", "link": "http://arxiv.org/abs/2312.03192", "description": "Verbal autopsy (VA) algorithms are routinely used to determine\nindividual-level causes of death (COD) in many low-and-middle-income countries,\nwhich are then aggregated to derive population-level cause-specific mortality\nfractions (CSMF), essential to informing public health policies. However, VA\nalgorithms frequently misclassify COD and introduce bias in CSMF estimates. A\nrecent method, VA-calibration, can correct for this bias using a VA\nmisclassification matrix estimated from paired data on COD from both VA and\nminimally invasive tissue sampling (MITS) from the Child Health and Mortality\nPrevention Surveillance (CHAMPS) Network. Due to the limited sample size,\nCHAMPS data are pooled across all countries, implicitly assuming that the\nmisclassification rates are homogeneous.\n\nIn this research, we show that the VA misclassification matrices are\nsubstantially heterogeneous across countries, thereby biasing the\nVA-calibration. We develop a coherent framework for modeling country-specific\nmisclassification matrices in data-scarce settings. We first introduce a novel\nbase model based on two latent mechanisms: intrinsic accuracy and systematic\npreference to parsimoniously characterize misclassifications. We prove that\nthey are identifiable from the data and manifest as a form of invariance in\ncertain misclassification odds, a pattern evident in the CHAMPS data. Then we\nexpand from this base model, adding higher complexity and country-specific\nheterogeneity via interpretable effect sizes. Shrinkage priors balance the\nbias-variance tradeoff by adaptively favoring simpler models. We publish\nuncertainty-quantified estimates of VA misclassification rates for 6 countries.\nThis effort broadens VA-calibration's future applicability and strengthens\nongoing efforts of using VA for mortality surveillance."}, "http://arxiv.org/abs/2312.03254": {"title": "Efficiency of Terrestrial Laser Scanning in Survey Works: Assessment, Modelling, and Monitoring", "link": "http://arxiv.org/abs/2312.03254", "description": "Nowadays, static, mobile, terrestrial, and airborne laser scanning\ntechnologies have become familiar data sources for engineering work, especially\nin the area of land surveying. The diversity of Light Detection and Ranging\n(LiDAR) data applications thanks to the accuracy and the high point density in\naddition to the 3D data processing high speed allow laser scanning to occupy an\nadvanced position among other spatial data acquisition technologies. Moreover,\nthe unmanned aerial vehicle drives the airborne scanning progress by solving\nthe flying complexity issues. However, before the employment of the laser\nscanning technique, it is unavoidable to assess the accuracy of the scanner\nbeing used under different circumstances. The key to success is determined by\nthe correct selection of suitable scanning tools for the project. In this\npaper, the terrestrial LiDAR data is tested and used for several laser scanning\nprojects having diverse goals and typology, e.g., road deformation monitoring,\nbuilding facade modelling, road modelling, and stockpile modelling and volume\nmeasuring. The accuracy of direct measurement on the LiDAR point cloud is\nestimated as 4mm which may open the door widely for LiDAR data to play an\nessential role in survey work applications."}, "http://arxiv.org/abs/2312.03257": {"title": "Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes", "link": "http://arxiv.org/abs/2312.03257", "description": "Untargeted metabolomics based on liquid chromatography-mass spectrometry\ntechnology is quickly gaining widespread application given its ability to\ndepict the global metabolic pattern in biological samples. However, the data is\nnoisy and plagued by the lack of clear identity of data features measured from\nsamples. Multiple potential matchings exist between data features and known\nmetabolites, while the truth can only be one-to-one matches. Some existing\nmethods attempt to reduce the matching uncertainty, but are far from being able\nto remove the uncertainty for most features. The existence of the uncertainty\ncauses major difficulty in downstream functional analysis. To address these\nissues, we develop a novel approach for Bayesian Analysis of Untargeted\nMetabolomics data (BAUM) to integrate previously separate tasks into a single\nframework, including matching uncertainty inference, metabolite selection, and\nfunctional analysis. By incorporating the knowledge graph between variables and\nusing relatively simple assumptions, BAUM can analyze datasets with small\nsample sizes. By allowing different confidence levels of feature-metabolite\nmatching, the method is applicable to datasets in which feature identities are\npartially known. Simulation studies demonstrate that, compared with other\nexisting methods, BAUM achieves better accuracy in selecting important\nmetabolites that tend to be functionally consistent and assigning confidence\nscores to feature-metabolite matches. We analyze a COVID-19 metabolomics\ndataset and a mouse brain metabolomics dataset using BAUM. Even with a very\nsmall sample size of 16 mice per group, BAUM is robust and stable. It finds\npathways that conform to existing knowledge, as well as novel pathways that are\nbiologically plausible."}, "http://arxiv.org/abs/2312.03268": {"title": "Design-based inference for generalized network experiments with stochastic interventions", "link": "http://arxiv.org/abs/2312.03268", "description": "A growing number of scholars and data scientists are conducting randomized\nexperiments to analyze causal relationships in network settings where units\ninfluence one another. A dominant methodology for analyzing these network\nexperiments has been design-based, leveraging randomization of treatment\nassignment as the basis for inference. In this paper, we generalize this\ndesign-based approach so that it can be applied to more complex experiments\nwith a variety of causal estimands with different target populations. An\nimportant special case of such generalized network experiments is a bipartite\nnetwork experiment, in which the treatment assignment is randomized among one\nset of units and the outcome is measured for a separate set of units. We\npropose a broad class of causal estimands based on stochastic intervention for\ngeneralized network experiments. Using a design-based approach, we show how to\nestimate the proposed causal quantities without bias, and develop conservative\nvariance estimators. We apply our methodology to a randomized experiment in\neducation where a group of selected students in middle schools are eligible for\nthe anti-conflict promotion program, and the program participation is\nrandomized within this group. In particular, our analysis estimates the causal\neffects of treating each student or his/her close friends, for different target\npopulations in the network. We find that while the treatment improves the\noverall awareness against conflict among students, it does not significantly\nreduce the total number of conflicts."}, "http://arxiv.org/abs/2312.03274": {"title": "Empirical Bayes Covariance Decomposition, and a solution to the Multiple Tuning Problem in Sparse PCA", "link": "http://arxiv.org/abs/2312.03274", "description": "Sparse Principal Components Analysis (PCA) has been proposed as a way to\nimprove both interpretability and reliability of PCA. However, use of sparse\nPCA in practice is hindered by the difficulty of tuning the multiple\nhyperparameters that control the sparsity of different PCs (the \"multiple\ntuning problem\", MTP). Here we present a solution to the MTP using Empirical\nBayes methods. We first introduce a general formulation for penalized PCA of a\ndata matrix $\\mathbf{X}$, which includes some existing sparse PCA methods as\nspecial cases. We show that this formulation also leads to a penalized\ndecomposition of the covariance (or Gram) matrix, $\\mathbf{X}^T\\mathbf{X}$. We\nintroduce empirical Bayes versions of these penalized problems, in which the\npenalties are determined by prior distributions that are estimated from the\ndata by maximum likelihood rather than cross-validation. The resulting\n\"Empirical Bayes Covariance Decomposition\" provides a principled and efficient\nsolution to the MTP in sparse PCA, and one that can be immediately extended to\nincorporate other structural assumptions (e.g. non-negative PCA). We illustrate\nthe effectiveness of this approach on both simulated and real data examples."}, "http://arxiv.org/abs/2312.03538": {"title": "Bayesian variable selection in sample selection models using spike-and-slab priors", "link": "http://arxiv.org/abs/2312.03538", "description": "Sample selection models represent a common methodology for correcting bias\ninduced by data missing not at random. It is well known that these models are\nnot empirically identifiable without exclusion restrictions. In other words,\nsome variables predictive of missingness do not affect the outcome model of\ninterest. The drive to establish this requirement often leads to the inclusion\nof irrelevant variables in the model. A recent proposal uses adaptive LASSO to\ncircumvent this problem, but its performance depends on the so-called\ncovariance assumption, which can be violated in small to moderate samples.\nAdditionally, there are no tools yet for post-selection inference for this\nmodel. To address these challenges, we propose two families of spike-and-slab\npriors to conduct Bayesian variable selection in sample selection models. These\nprior structures allow for constructing a Gibbs sampler with tractable\nconditionals, which is scalable to the dimensions of practical interest. We\nillustrate the performance of the proposed methodology through a simulation\nstudy and present a comparison against adaptive LASSO and stepwise selection.\nWe also provide two applications using publicly available real data. An\nimplementation and code to reproduce the results in this paper can be found at\nhttps://github.com/adam-iqbal/selection-spike-slab"}, "http://arxiv.org/abs/2312.03561": {"title": "Blueprinting the Future: Automatic Item Categorization using Hierarchical Zero-Shot and Few-Shot Classifiers", "link": "http://arxiv.org/abs/2312.03561", "description": "In testing industry, precise item categorization is pivotal to align exam\nquestions with the designated content domains outlined in the assessment\nblueprint. Traditional methods either entail manual classification, which is\nlaborious and error-prone, or utilize machine learning requiring extensive\ntraining data, often leading to model underfit or overfit issues. This study\nunveils a novel approach employing the zero-shot and few-shot Generative\nPretrained Transformer (GPT) classifier for hierarchical item categorization,\nminimizing the necessity for training data, and instead, leveraging human-like\nlanguage descriptions to define categories. Through a structured python\ndictionary, the hierarchical nature of examination blueprints is navigated\nseamlessly, allowing for a tiered classification of items across multiple\nlevels. An initial simulation with artificial data demonstrates the efficacy of\nthis method, achieving an average accuracy of 92.91% measured by the F1 score.\nThis method was further applied to real exam items from the 2022 In-Training\nExamination (ITE) conducted by the American Board of Family Medicine (ABFM),\nreclassifying 200 items according to a newly formulated blueprint swiftly in 15\nminutes, a task that traditionally could span several days among editors and\nphysicians. This innovative approach not only drastically cuts down\nclassification time but also ensures a consistent, principle-driven\ncategorization, minimizing human biases and discrepancies. The ability to\nrefine classifications by adjusting definitions adds to its robustness and\nsustainability."}, "http://arxiv.org/abs/2312.03643": {"title": "Propagating moments in probabilistic graphical models for decision support systems", "link": "http://arxiv.org/abs/2312.03643", "description": "Probabilistic graphical models are widely used to model complex systems with\nuncertainty. Traditionally, Gaussian directed graphical models are applied for\nanalysis of large networks with continuous variables since they can provide\nconditional and marginal distributions in closed form simplifying the\ninferential task. The Gaussianity and linearity assumptions are often adequate,\nyet can lead to poor performance when dealing with some practical applications.\nIn this paper, we model each variable in graph G as a polynomial regression of\nits parents to capture complex relationships between individual variables and\nwith utility function of polynomial form. Since the marginal posterior\ndistributions of individual variables can become analytically intractable, we\ndevelop a message-passing algorithm to propagate information throughout the\nnetwork solely using moments which enables the expected utility scores to be\ncalculated exactly. We illustrate how the proposed methodology works in a\ndecision problem in energy systems planning."}, "http://arxiv.org/abs/2112.05623": {"title": "Smooth test for equality of copulas", "link": "http://arxiv.org/abs/2112.05623", "description": "A smooth test to simultaneously compare $K$ copulas, where $K \\geq 2$ is\nproposed. The $K$ observed populations can be paired, and the test statistic is\nconstructed based on the differences between moment sequences, called copula\ncoefficients. These coefficients characterize the copulas, even when the copula\ndensities may not exist. The procedure employs a two-step data-driven\nprocedure. In the initial step, the most significantly different coefficients\nare selected for all pairs of populations. The subsequent step utilizes these\ncoefficients to identify populations that exhibit significant differences. To\ndemonstrate the effectiveness of the method, we provide illustrations through\nnumerical studies and application to two real datasets."}, "http://arxiv.org/abs/2112.12909": {"title": "Optimal Variable Clustering for High-Dimensional Matrix Valued Data", "link": "http://arxiv.org/abs/2112.12909", "description": "Matrix valued data has become increasingly prevalent in many applications.\nMost of the existing clustering methods for this type of data are tailored to\nthe mean model and do not account for the dependence structure of the features,\nwhich can be very informative, especially in high-dimensional settings or when\nmean information is not available. To extract the information from the\ndependence structure for clustering, we propose a new latent variable model for\nthe features arranged in matrix form, with some unknown membership matrices\nrepresenting the clusters for the rows and columns. Under this model, we\nfurther propose a class of hierarchical clustering algorithms using the\ndifference of a weighted covariance matrix as the dissimilarity measure.\nTheoretically, we show that under mild conditions, our algorithm attains\nclustering consistency in the high-dimensional setting. While this consistency\nresult holds for our algorithm with a broad class of weighted covariance\nmatrices, the conditions for this result depend on the choice of the weight. To\ninvestigate how the weight affects the theoretical performance of our\nalgorithm, we establish the minimax lower bound for clustering under our latent\nvariable model in terms of some cluster separation metric. Given these results,\nwe identify the optimal weight in the sense that using this weight guarantees\nour algorithm to be minimax rate-optimal. The practical implementation of our\nalgorithm with the optimal weight is also discussed. Simulation studies show\nthat our algorithm performs better than existing methods in terms of the\nadjusted Rand index (ARI). The method is applied to a genomic dataset and\nyields meaningful interpretations."}, "http://arxiv.org/abs/2302.09392": {"title": "Extended Excess Hazard Models for Spatially Dependent Survival Data", "link": "http://arxiv.org/abs/2302.09392", "description": "Relative survival represents the preferred framework for the analysis of\npopulation cancer survival data. The aim is to model the survival probability\nassociated to cancer in the absence of information about the cause of death.\nRecent data linkage developments have allowed for incorporating the place of\nresidence into the population cancer data bases; however, modeling this spatial\ninformation has received little attention in the relative survival setting. We\npropose a flexible parametric class of spatial excess hazard models (along with\ninference tools), named \"Relative Survival Spatial General Hazard\" (RS-SGH),\nthat allows for the inclusion of fixed and spatial effects in both time-level\nand hazard-level components. We illustrate the performance of the proposed\nmodel using an extensive simulation study, and provide guidelines about the\ninterplay of sample size, censoring, and model misspecification. We present a\ncase study using real data from colon cancer patients in England. This case\nstudy illustrates how a spatial model can be used to identify geographical\nareas with low cancer survival, as well as how to summarize such a model\nthrough marginal survival quantities and spatial effects."}, "http://arxiv.org/abs/2312.03857": {"title": "Population Monte Carlo with Normalizing Flow", "link": "http://arxiv.org/abs/2312.03857", "description": "Adaptive importance sampling (AIS) methods provide a useful alternative to\nMarkov Chain Monte Carlo (MCMC) algorithms for performing inference of\nintractable distributions. Population Monte Carlo (PMC) algorithms constitute a\nfamily of AIS approaches which adapt the proposal distributions iteratively to\nimprove the approximation of the target distribution. Recent work in this area\nprimarily focuses on ameliorating the proposal adaptation procedure for\nhigh-dimensional applications. However, most of the AIS algorithms use simple\nproposal distributions for sampling, which might be inadequate in exploring\ntarget distributions with intricate geometries. In this work, we construct\nexpressive proposal distributions in the AIS framework using normalizing flow,\nan appealing approach for modeling complex distributions. We use an iterative\nparameter update rule to enhance the approximation of the target distribution.\nNumerical experiments show that in high-dimensional settings, the proposed\nalgorithm offers significantly improved performance compared to the existing\ntechniques."}, "http://arxiv.org/abs/2312.03911": {"title": "Improving Gradient-guided Nested Sampling for Posterior Inference", "link": "http://arxiv.org/abs/2312.03911", "description": "We present a performant, general-purpose gradient-guided nested sampling\nalgorithm, ${\\tt GGNS}$, combining the state of the art in differentiable\nprogramming, Hamiltonian slice sampling, clustering, mode separation, dynamic\nnested sampling, and parallelization. This unique combination allows ${\\tt\nGGNS}$ to scale well with dimensionality and perform competitively on a variety\nof synthetic and real-world problems. We also show the potential of combining\nnested sampling with generative flow networks to obtain large amounts of\nhigh-quality samples from the posterior distribution. This combination leads to\nfaster mode discovery and more accurate estimates of the partition function."}, "http://arxiv.org/abs/2312.03967": {"title": "Test-negative designs with various reasons for testing: statistical bias and solution", "link": "http://arxiv.org/abs/2312.03967", "description": "Test-negative designs are widely used for post-market evaluation of vaccine\neffectiveness. Different from classical test-negative designs where only\nhealthcare-seekers with symptoms are included, recent test-negative designs\nhave involved individuals with various reasons for testing, especially in an\noutbreak setting. While including these data can increase sample size and hence\nimprove precision, concerns have been raised about whether they will introduce\nbias into the current framework of test-negative designs, thereby demanding a\nformal statistical examination of this modified design. In this article, using\nstatistical derivations, causal graphs, and numerical simulations, we show that\nthe standard odds ratio estimator may be biased if various reasons for testing\nare not accounted for. To eliminate this bias, we identify three categories of\nreasons for testing, including symptoms, disease-unrelated reasons, and case\ncontact tracing, and characterize associated statistical properties and\nestimands. Based on our characterization, we propose stratified estimators that\ncan incorporate multiple reasons for testing to achieve consistent estimation\nand improve precision by maximizing the use of data. The performance of our\nproposed method is demonstrated through simulation studies."}, "http://arxiv.org/abs/2312.04026": {"title": "Independent-Set Design of Experiments for Estimating Treatment and Spillover Effects under Network Interference", "link": "http://arxiv.org/abs/2312.04026", "description": "Interference is ubiquitous when conducting causal experiments over networks.\nExcept for certain network structures, causal inference on the network in the\npresence of interference is difficult due to the entanglement between the\ntreatment assignments and the interference levels. In this article, we conduct\ncausal inference under interference on an observed, sparse but connected\nnetwork, and we propose a novel design of experiments based on an independent\nset. Compared to conventional designs, the independent-set design focuses on an\nindependent subset of data and controls their interference exposures through\nthe assignments to the rest (auxiliary set). We provide a lower bound on the\nsize of the independent set from a greedy algorithm , and justify the\ntheoretical performance of estimators under the proposed design. Our approach\nis capable of estimating both spillover effects and treatment effects. We\njustify its superiority over conventional methods and illustrate the empirical\nperformance through simulations."}, "http://arxiv.org/abs/2312.04064": {"title": "DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design", "link": "http://arxiv.org/abs/2312.04064", "description": "The discovery of therapeutics to treat genetically-driven pathologies relies\non identifying genes involved in the underlying disease mechanisms. Existing\napproaches search over the billions of potential interventions to maximize the\nexpected influence on the target phenotype. However, to reduce the risk of\nfailure in future stages of trials, practical experiment design aims to find a\nset of interventions that maximally change a target phenotype via diverse\nmechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the\nrate of significant discoveries per experiment while simultaneously probing for\na wide range of diverse mechanisms during a genomic experiment campaign. We\nprovide theoretical guarantees of approximate optimality under standard\nassumptions, and conduct a comprehensive experimental evaluation covering both\nsynthetic as well as real-world experimental design tasks. DiscoBAX outperforms\nexisting state-of-the-art methods for experimental design, selecting effective\nand diverse perturbations in biological systems."}, "http://arxiv.org/abs/2312.04077": {"title": "When is Plasmode simulation superior to parametric simulation when estimating the MSE of the least squares estimator in linear regression?", "link": "http://arxiv.org/abs/2312.04077", "description": "Simulation is a crucial tool for the evaluation and comparison of statistical\nmethods. How to design fair and neutral simulation studies is therefore of\ngreat interest for both researchers developing new methods and practitioners\nconfronted with the choice of the most suitable method. The term simulation\nusually refers to parametric simulation, that is, computer experiments using\nartificial data made up of pseudo-random numbers. Plasmode simulation, that is,\ncomputer experiments using the combination of resampling feature data from a\nreal-life dataset and generating the target variable with a user-selected\noutcome-generating model (OGM), is an alternative that is often claimed to\nproduce more realistic data. We compare parametric and Plasmode simulation for\nthe example of estimating the mean squared error of the least squares estimator\nin linear regression. If the true underlying data-generating process (DGP) and\nthe OGM were known, parametric simulation would be the best choice in terms of\nestimating the MSE well. However, in reality, both are usually unknown, so\nresearchers have to make assumptions: in Plasmode simulation studies for the\nOGM, in parametric simulation for both DGP and OGM. Most likely, these\nassumptions do not reflect the truth. Here, we aim to find out how assumptions\ndeviating from the true DGP and the true OGM affect the performance of\nparametric simulation and Plasmode simulations in the context of MSE estimation\nfor the least squares estimator and in which situations which simulation type\nis preferable. Our results suggest that the preferable simulation method\ndepends on many factors, including the number of features, and how the\nassumptions of a parametric simulation differ from the true DGP. Also, the\nresampling strategy used for Plasmode influences the results. In particular,\nsubsampling with a small sampling proportion can be recommended."}, "http://arxiv.org/abs/2312.04078": {"title": "A Review and Taxonomy of Methods for Quantifying Dataset Similarity", "link": "http://arxiv.org/abs/2312.04078", "description": "In statistics and machine learning, measuring the similarity between two or\nmore datasets is important for several purposes. The performance of a\npredictive model on novel datasets, referred to as generalizability, critically\ndepends on how similar the dataset used for fitting the model is to the novel\ndatasets. Exploiting or transferring insights between similar datasets is a key\naspect of meta-learning and transfer-learning. In two-sample testing, it is\nchecked, whether the underlying (multivariate) distributions of two datasets\ncoincide or not.\n\nExtremely many approaches for quantifying dataset similarity have been\nproposed in the literature. A structured overview is a crucial first step for\ncomparisons of approaches. We examine more than 100 methods and provide a\ntaxonomy, classifying them into ten classes, including (i) comparisons of\ncumulative distribution functions, density functions, or characteristic\nfunctions, (ii) methods based on multivariate ranks, (iii) discrepancy measures\nfor distributions, (iv) graph-based methods, (v) methods based on inter-point\ndistances, (vi) kernel-based methods, (vii) methods based on binary\nclassification, (viii) distance and similarity measures for datasets, (ix)\ncomparisons based on summary statistics, and (x) different testing approaches.\nHere, we present an extensive review of these methods. We introduce the main\nunderlying ideas, formal definitions, and important properties."}, "http://arxiv.org/abs/2312.04150": {"title": "A simple sensitivity analysis method for unmeasured confounders via linear programming with estimating equation constraints", "link": "http://arxiv.org/abs/2312.04150", "description": "In estimating the average treatment effect in observational studies, the\ninfluence of confounders should be appropriately addressed. To this end, the\npropensity score is widely used. If the propensity scores are known for all the\nsubjects, bias due to confounders can be adjusted by using the inverse\nprobability weighting (IPW) by the propensity score. Since the propensity score\nis unknown in general, it is usually estimated by the parametric logistic\nregression model with unknown parameters estimated by solving the score\nequation under the strongly ignorable treatment assignment (SITA) assumption.\nViolation of the SITA assumption and/or misspecification of the propensity\nscore model can cause serious bias in estimating the average treatment effect.\nTo relax the SITA assumption, the IPW estimator based on the outcome-dependent\npropensity score has been successfully introduced. However, it still depends on\nthe correctly specified parametric model and its identification. In this paper,\nwe propose a simple sensitivity analysis method for unmeasured confounders. In\nthe standard practice, the estimating equation is used to estimate the unknown\nparameters in the parametric propensity score model. Our idea is to make\ninference on the average causal effect by removing restrictive parametric model\nassumptions while still utilizing the estimating equation. Using estimating\nequations as constraints, which the true propensity scores asymptotically\nsatisfy, we construct the worst-case bounds for the average treatment effect\nwith linear programming. Different from the existing sensitivity analysis\nmethods, we construct the worst-case bounds with minimal assumptions. We\nillustrate our proposal by simulation studies and a real-world example."}, "http://arxiv.org/abs/2312.04444": {"title": "Parameter Inference for Hypo-Elliptic Diffusions under a Weak Design Condition", "link": "http://arxiv.org/abs/2312.04444", "description": "We address the problem of parameter estimation for degenerate diffusion\nprocesses defined via the solution of Stochastic Differential Equations (SDEs)\nwith diffusion matrix that is not full-rank. For this class of hypo-elliptic\ndiffusions recent works have proposed contrast estimators that are\nasymptotically normal, provided that the step-size in-between observations\n$\\Delta=\\Delta_n$ and their total number $n$ satisfy $n \\to \\infty$, $n\n\\Delta_n \\to \\infty$, $\\Delta_n \\to 0$, and additionally $\\Delta_n = o\n(n^{-1/2})$. This latter restriction places a requirement for a so-called\n`rapidly increasing experimental design'. In this paper, we overcome this\nlimitation and develop a general contrast estimator satisfying asymptotic\nnormality under the weaker design condition $\\Delta_n = o(n^{-1/p})$ for\ngeneral $p \\ge 2$. Such a result has been obtained for elliptic SDEs in the\nliterature, but its derivation in a hypo-elliptic setting is highly\nnon-trivial. We provide numerical results to illustrate the advantages of the\ndeveloped theory."}, "http://arxiv.org/abs/2312.04481": {"title": "Wasserstein complexity penalization priors: a new class of penalizing complexity priors", "link": "http://arxiv.org/abs/2312.04481", "description": "Penalizing complexity (PC) priors is a principled framework for designing\npriors that reduce model complexity. PC priors penalize the Kullback-Leibler\nDivergence (KLD) between the distributions induced by a ``simple'' model and\nthat of a more complex model. However, in many common cases, it is impossible\nto construct a prior in this way because the KLD is infinite. Various\napproximations are used to mitigate this problem, but the resulting priors then\nfail to follow the designed principles. We propose a new class of priors, the\nWasserstein complexity penalization (WCP) priors, by replacing KLD with the\nWasserstein distance in the PC prior framework. These priors avoid the infinite\nmodel distance issues and can be derived by following the principles exactly,\nmaking them more interpretable. Furthermore, principles and recipes to\nconstruct joint WCP priors for multiple parameters analytically and numerically\nare proposed and we show that they can be easily obtained, either numerically\nor analytically, for a general class of models. The methods are illustrated\nthrough several examples for which PC priors have previously been applied."}, "http://arxiv.org/abs/2201.08502": {"title": "Curved factor analysis with the Ellipsoid-Gaussian distribution", "link": "http://arxiv.org/abs/2201.08502", "description": "There is a need for new models for characterizing dependence in multivariate\ndata. The multivariate Gaussian distribution is routinely used, but cannot\ncharacterize nonlinear relationships in the data. Most non-linear extensions\ntend to be highly complex; for example, involving estimation of a non-linear\nregression model in latent variables. In this article, we propose a relatively\nsimple class of Ellipsoid-Gaussian multivariate distributions, which are\nderived by using a Gaussian linear factor model involving latent variables\nhaving a von Mises-Fisher distribution on a unit hyper-sphere. We show that the\nEllipsoid-Gaussian distribution can flexibly model curved relationships among\nvariables with lower-dimensional structures. Taking a Bayesian approach, we\npropose a hybrid of gradient-based geodesic Monte Carlo and adaptive Metropolis\nfor posterior sampling. We derive basic properties and illustrate the utility\nof the Ellipsoid-Gaussian distribution on a variety of simulated and real data\napplications. An accompanying R package is also available."}, "http://arxiv.org/abs/2212.12822": {"title": "Simultaneous false discovery proportion bounds via knockoffs and closed testing", "link": "http://arxiv.org/abs/2212.12822", "description": "We propose new methods to obtain simultaneous false discovery proportion\nbounds for knockoff-based approaches. We first investigate an approach based on\nJanson and Su's $k$-familywise error rate control method and interpolation. We\nthen generalize it by considering a collection of $k$ values, and show that the\nbound of Katsevich and Ramdas is a special case of this method and can be\nuniformly improved. Next, we further generalize the method by using closed\ntesting with a multi-weighted-sum local test statistic. This allows us to\nobtain a further uniform improvement and other generalizations over previous\nmethods. We also develop an efficient shortcut for its implementation. We\ncompare the performance of our proposed methods in simulations and apply them\nto a data set from the UK Biobank."}, "http://arxiv.org/abs/2302.06054": {"title": "Single Proxy Control", "link": "http://arxiv.org/abs/2302.06054", "description": "Negative control variables are sometimes used in non-experimental studies to\ndetect the presence of confounding by hidden factors. A negative control\noutcome (NCO) is an outcome that is influenced by unobserved confounders of the\nexposure effects on the outcome in view, but is not causally impacted by the\nexposure. Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration\nApproach (COCA) as a formal NCO counterfactual method to detect and correct for\nresidual confounding bias. For identification, COCA treats the NCO as an\nerror-prone proxy of the treatment-free counterfactual outcome of interest, and\ninvolves regressing the NCO on the treatment-free counterfactual, together with\na rank-preserving structural model which assumes a constant individual-level\ncausal effect. In this work, we establish nonparametric COCA identification for\nthe average causal effect for the treated, without requiring rank-preservation,\ntherefore accommodating unrestricted effect heterogeneity across units. This\nnonparametric identification result has important practical implications, as it\nprovides single proxy confounding control, in contrast to recently proposed\nproximal causal inference, which relies for identification on a pair of\nconfounding proxies. For COCA estimation we propose three separate strategies:\n(i) an extended propensity score approach, (ii) an outcome bridge function\napproach, and (iii) a doubly-robust approach. Finally, we illustrate the\nproposed methods in an application evaluating the causal impact of a Zika virus\noutbreak on birth rate in Brazil."}, "http://arxiv.org/abs/2302.11363": {"title": "lqmix: an R package for longitudinal data analysis via linear quantile mixtures", "link": "http://arxiv.org/abs/2302.11363", "description": "The analysis of longitudinal data gives the chance to observe how unit\nbehaviors change over time, but it also poses series of issues. These have been\nthe focus of a huge literature in the context of linear and generalized linear\nregression moving also, in the last ten years or so, to the context of linear\nquantile regression for continuous responses. In this paper, we present lqmix,\na novel R package that helps estimate a class of linear quantile regression\nmodels for longitudinal data, in the presence of time-constant and/or\ntime-varying, unit-specific, random coefficients, with unspecified\ndistribution. Model parameters are estimated in a maximum likelihood framework,\nvia an extended EM algorithm, and parameters' standard errors are estimated via\na block-bootstrap procedure. The analysis of a benchmark dataset is used to\ngive details on the package functions."}, "http://arxiv.org/abs/2303.04408": {"title": "Principal Component Analysis of Two-dimensional Functional Data with Serial Correlation", "link": "http://arxiv.org/abs/2303.04408", "description": "In this paper, we propose a novel model to analyze serially correlated\ntwo-dimensional functional data observed sparsely and irregularly on a domain\nwhich may not be a rectangle. Our approach employs a mixed effects model that\nspecifies the principal component functions as bivariate splines on\ntriangulations and the principal component scores as random effects which\nfollow an auto-regressive model. We apply the thin-plate penalty for\nregularizing the bivariate function estimation and develop an effective EM\nalgorithm along with Kalman filter and smoother for calculating the penalized\nlikelihood estimates of the parameters. Our approach was applied on simulated\ndatasets and on Texas monthly average temperature data from January year 1915\nto December year 2014."}, "http://arxiv.org/abs/2309.15973": {"title": "Application of data acquisition methods in the field of scientific research of public policy", "link": "http://arxiv.org/abs/2309.15973", "description": "Public policy also represent a special subdiscipline within political\nscience, within political science. They are given increasing importance and\nimportance in the context of scientific research and scientific approaches.\nPublic policy as a discipline of political science have their own special\nsubject and method of research. A particularly important aspect of the\nscientific approach to public policy is the aspect of applying research methods\nas one of the stages and phases of designing scientific research. In this\nsense, the goal of this research is to present the application of scientific\nresearch methods in the field of public policy. Those methods are based on\nscientific achievements developed within the framework of modern methodology of\nsocial sciences. Scientific research methods represent an important functional\npart of the research project as a model of the scientific research system,\npredominantly of an empirical character, which is applicable to all types of\nresearch. This is precisely what imposes the need to develop a project as a\nprerequisite for applying scientific methods and conducting scientific\nresearch, and therefore for a more complete understanding of public policy. The\nconclusions that will be reached point to the fact that scientific research of\npublic policy can not be carried out without the creation of a scientific\nresearch project as a complex scientific and operational document and the\napplication of appropriate methods and techniques developed within the\nframework of scientific achievements of modern social science methodology."}, "http://arxiv.org/abs/2312.04601": {"title": "Estimating Fr\\'echet bounds for validating programmatic weak supervision", "link": "http://arxiv.org/abs/2312.04601", "description": "We develop methods for estimating Fr\\'echet bounds on (possibly\nhigh-dimensional) distribution classes in which some variables are\ncontinuous-valued. We establish the statistical correctness of the computed\nbounds under uncertainty in the marginal constraints and demonstrate the\nusefulness of our algorithms by evaluating the performance of machine learning\n(ML) models trained with programmatic weak supervision (PWS). PWS is a\nframework for principled learning from weak supervision inputs (e.g.,\ncrowdsourced labels, knowledge bases, pre-trained models on related tasks,\netc), and it has achieved remarkable success in many areas of science and\nengineering. Unfortunately, it is generally difficult to validate the\nperformance of ML models trained with PWS due to the absence of labeled data.\nOur algorithms address this issue by estimating sharp lower and upper bounds\nfor performance metrics such as accuracy/recall/precision/F1 score."}, "http://arxiv.org/abs/2312.04661": {"title": "Robust Elastic Net Estimators for High Dimensional Generalized Linear Models", "link": "http://arxiv.org/abs/2312.04661", "description": "Robust estimators for Generalized Linear Models (GLMs) are not easy to\ndevelop because of the nature of the distributions involved. Recently, there\nhas been an increasing interest in this topic, especially in the presence of a\npossibly large number of explanatory variables. Transformed M-estimators (MT)\nare a natural way to extend the methodology of M-estimators to the class of\nGLMs and to obtain robust methods. We introduce a penalized version of\nMT-estimators in order to deal with high-dimensional data. We prove, under\nappropriate assumptions, consistency and asymptotic normality of this new class\nof estimators. The theory is developed for redescending $\\rho$-functions and\nElastic Net penalization. An iterative re-weighted least squares algorithm is\ngiven, together with a procedure to initialize it. The latter is of particular\nimportance, since the estimating equations might have multiple roots. We\nillustrate the performance of this new method for the Poisson family under\nseveral type of contaminations in a Monte Carlo experiment and in an example\nbased on a real dataset."}, "http://arxiv.org/abs/2312.04717": {"title": "A kinetic Monte Carlo Approach for Boolean Logic Functionality in Gold Nanoparticle Networks", "link": "http://arxiv.org/abs/2312.04717", "description": "Nanoparticles interconnected by insulating organic molecules exhibit\nnonlinear switching behavior at low temperatures. By assembling these switches\ninto a network and manipulating charge transport dynamics through surrounding\nelectrodes, the network can be reconfigurably functionalized to act as any\nBoolean logic gate. This work introduces a kinetic Monte Carlo-based simulation\ntool, applying established principles of single electronics to model charge\ntransport dynamics in nanoparticle networks. We functionalize nanoparticle\nnetworks as Boolean logic gates and assess their quality using a fitness\nfunction. Based on the definition of fitness, we derive new metrics to quantify\nessential nonlinear properties of the network, including negative differential\nresistance and nonlinear separability. These nonlinear properties are crucial\nnot only for functionalizing the network as Boolean logic gates but also when\nour networks are functionalized for brain-inspired computing applications in\nthe future. We address fundamental questions about the dependence of fitness\nand nonlinear properties on system size, number of surrounding electrodes, and\nelectrode positioning. We assert the overall benefit of having more electrodes,\nwith proximity to the network's output being pivotal for functionality and\nnonlinearity. Additionally, we demonstrate a optimal system size and argue for\nbreaking symmetry in electrode positioning to favor nonlinear properties."}, "http://arxiv.org/abs/2312.04747": {"title": "MetaDetect: Metamorphic Testing Based Anomaly Detection for Multi-UAV Wireless Networks", "link": "http://arxiv.org/abs/2312.04747", "description": "The reliability of wireless Ad Hoc Networks (WANET) communication is much\nlower than wired networks. WANET will be impacted by node overload, routing\nprotocol, weather, obstacle blockage, and many other factors, all those\nanomalies cannot be avoided. Accurate prediction of the network entirely\nstopping in advance is essential after people could do networking re-routing or\nchanging to different bands. In the present study, there are two primary goals.\nFirstly, design anomaly events detection patterns based on Metamorphic Testing\n(MT) methodology. Secondly, compare the performance of evaluation metrics, such\nas Transfer Rate, Occupancy rate, and the Number of packets received. Compared\nto other studies, the most significant advantage of mathematical\ninterpretability, as well as not requiring dependence on physical environmental\ninformation, only relies on the networking physical layer and Mac layer data.\nThe analysis of the results demonstrates that the proposed MT detection method\nis helpful for automatically identifying incidents/accident events on WANET.\nThe physical layer transfer Rate metric could get the best performance."}, "http://arxiv.org/abs/2312.04924": {"title": "Sparse Anomaly Detection Across Referentials: A Rank-Based Higher Criticism Approach", "link": "http://arxiv.org/abs/2312.04924", "description": "Detecting anomalies in large sets of observations is crucial in various\napplications, such as epidemiological studies, gene expression studies, and\nsystems monitoring. We consider settings where the units of interest result in\nmultiple independent observations from potentially distinct referentials. Scan\nstatistics and related methods are commonly used in such settings, but rely on\nstringent modeling assumptions for proper calibration. We instead propose a\nrank-based variant of the higher criticism statistic that only requires\nindependent observations originating from ordered spaces. We show under what\nconditions the resulting methodology is able to detect the presence of\nanomalies. These conditions are stated in a general, non-parametric manner, and\ndepend solely on the probabilities of anomalous observations exceeding nominal\nobservations. The analysis requires a refined understanding of the distribution\nof the ranks under the presence of anomalies, and in particular of the\nrank-induced dependencies. The methodology is robust against heavy-tailed\ndistributions through the use of ranks. Within the exponential family and a\nfamily of convolutional models, we analytically quantify the asymptotic\nperformance of our methodology and the performance of the oracle, and show the\ndifference is small for many common models. Simulations confirm these results.\nWe show the applicability of the methodology through an analysis of quality\ncontrol data of a pharmaceutical manufacturing process."}, "http://arxiv.org/abs/2312.04950": {"title": "Sequential inductive prediction intervals", "link": "http://arxiv.org/abs/2312.04950", "description": "In this paper we explore the concept of sequential inductive prediction\nintervals using theory from sequential testing. We furthermore introduce a\n3-parameter PAC definition of prediction intervals that allows us via\nsimulation to achieve almost sharp bounds with high probability."}, "http://arxiv.org/abs/2312.04972": {"title": "Comparison of Probabilistic Structural Reliability Methods for Ultimate Limit State Assessment of Wind Turbines", "link": "http://arxiv.org/abs/2312.04972", "description": "The probabilistic design of offshore wind turbines aims to ensure structural\nsafety in a cost-effective way. This involves conducting structural reliability\nassessments for different design options and considering different structural\nresponses. There are several structural reliability methods, and this paper\nwill apply and compare different approaches in some simplified case studies. In\nparticular, the well known environmental contour method will be compared to a\nmore novel approach based on sequential sampling and Gaussian processes\nregression for an ultimate limit state case study. For one of the case studies,\nresults will also be compared to results from a brute force simulation\napproach. Interestingly, the comparison is very different from the two case\nstudies. In one of the cases the environmental contours method agrees well with\nthe sequential sampling method but in the other, results vary considerably.\nProbably, this can be explained by the violation of some of the assumptions\nassociated with the environmental contour approach, i.e. that the short-term\nvariability of the response is large compared to the long-term variability of\nthe environmental conditions. Results from this simple comparison study\nsuggests that the sequential sampling method can be a robust and\ncomputationally effective approach for structural reliability assessment."}, "http://arxiv.org/abs/2312.05077": {"title": "Computation of least squares trimmed regression--an alternative to least trimmed squares regression", "link": "http://arxiv.org/abs/2312.05077", "description": "The least squares of depth trimmed (LST) residuals regression, proposed in\nZuo and Zuo (2023) \\cite{ZZ23}, serves as a robust alternative to the classic\nleast squares (LS) regression as well as a strong competitor to the famous\nleast trimmed squares (LTS) regression of Rousseeuw (1984) \\cite{R84}.\nTheoretical properties of the LST were thoroughly studied in \\cite{ZZ23}.\n\nThe article aims to promote the implementation and computation of the LST\nresiduals regression for a broad group of statisticians in statistical practice\nand demonstrates that (i) the LST is as robust as the benchmark of robust\nregression, the LTS regression, and much more efficient than the latter. (ii)\nIt can be as efficient as (or even more efficient than) the LS in the scenario\nwith errors uncorrelated with mean zero and homoscedastic with finite variance.\n(iii) It can be computed as fast as (or even faster than) the LTS based on a\nnewly proposed algorithm."}, "http://arxiv.org/abs/2312.05127": {"title": "Weighted least squares regression with the best robustness and high computability", "link": "http://arxiv.org/abs/2312.05127", "description": "A novel regression method is introduced and studied. The procedure weights\nsquared residuals based on their magnitude. Unlike the classic least squares\nwhich treats every squared residual equally important, the new procedure\nexponentially down-weights squared-residuals that lie far away from the cloud\nof all residuals and assigns a constant weight (one) to squared-residuals that\nlie close to the center of the squared-residual cloud.\n\nThe new procedure can keep a good balance between robustness and efficiency,\nit possesses the highest breakdown point robustness for any regression\nequivariant procedure, much more robust than the classic least squares, yet\nmuch more efficient than the benchmark of robust method, the least trimmed\nsquares (LTS) of Rousseeuw (1984).\n\nWith a smooth weight function, the new procedure could be computed very fast\nby the first-order (first-derivative) method and the second-order\n(second-derivative) method.\n\nAssertions and other theoretical findings are verified in simulated and real\ndata examples."}, "http://arxiv.org/abs/2208.07898": {"title": "Collaborative causal inference on distributed data", "link": "http://arxiv.org/abs/2208.07898", "description": "In recent years, the development of technologies for causal inference with\nprivacy preservation of distributed data has gained considerable attention.\nMany existing methods for distributed data focus on resolving the lack of\nsubjects (samples) and can only reduce random errors in estimating treatment\neffects. In this study, we propose a data collaboration quasi-experiment\n(DC-QE) that resolves the lack of both subjects and covariates, reducing random\nerrors and biases in the estimation. Our method involves constructing\ndimensionality-reduced intermediate representations from private data from\nlocal parties, sharing intermediate representations instead of private data for\nprivacy preservation, estimating propensity scores from the shared intermediate\nrepresentations, and finally, estimating the treatment effects from propensity\nscores. Through numerical experiments on both artificial and real-world data,\nwe confirm that our method leads to better estimation results than individual\nanalyses. While dimensionality reduction loses some information in the private\ndata and causes performance degradation, we observe that sharing intermediate\nrepresentations with many parties to resolve the lack of subjects and\ncovariates sufficiently improves performance to overcome the degradation caused\nby dimensionality reduction. Although external validity is not necessarily\nguaranteed, our results suggest that DC-QE is a promising method. With the\nwidespread use of our method, intermediate representations can be published as\nopen data to help researchers find causalities and accumulate a knowledge base."}, "http://arxiv.org/abs/2304.07034": {"title": "Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata", "link": "http://arxiv.org/abs/2304.07034", "description": "The optimum sample allocation in stratified sampling is one of the basic\nissues of survey methodology. It is a procedure of dividing the overall sample\nsize into strata sample sizes in such a way that for given sampling designs in\nstrata the variance of the stratified $\\pi$ estimator of the population total\n(or mean) for a given study variable assumes its minimum. In this work, we\nconsider the optimum allocation of a sample, under lower and upper bounds\nimposed jointly on sample sizes in strata. We are concerned with the variance\nfunction of some generic form that, in particular, covers the case of the\nsimple random sampling without replacement in strata. The goal of this paper is\ntwofold. First, we establish (using the Karush-Kuhn-Tucker conditions) a\ngeneric form of the optimal solution, the so-called optimality conditions.\nSecond, based on the established optimality conditions, we derive an efficient\nrecursive algorithm, named RNABOX, which solves the allocation problem under\nstudy. The RNABOX can be viewed as a generalization of the classical recursive\nNeyman allocation algorithm, a popular tool for optimum allocation when only\nupper bounds are imposed on sample strata-sizes. We implement RNABOX in R as a\npart of our package stratallo which is available from the Comprehensive R\nArchive Network (CRAN) repository."}, "http://arxiv.org/abs/2305.12366": {"title": "A Quantile Shift Approach To Main Effects And Interactions In A 2-By-2 Design", "link": "http://arxiv.org/abs/2305.12366", "description": "When comparing two independent groups, shift functions are basically\ntechniques that compare multiple quantiles rather than a single measure of\nlocation, the goal being to get a more detailed understanding of how the\ndistributions differ. Various versions have been proposed and studied. This\npaper deals with extensions of these methods to main effects and interactions\nin a between-by-between, 2-by-2 design. Two approaches are studied, one that\ncompares the deciles of the distributions, and one that has a certain\nconnection to the Wilcoxon-Mann-Whitney method. For both methods, we propose an\nimplementation using the Harrell-Davis quantile estimator, used in conjunction\nwith a percentile bootstrap approach. We report results of simulations of false\nand true positive rates."}, "http://arxiv.org/abs/2306.05415": {"title": "Causal normalizing flows: from theory to practice", "link": "http://arxiv.org/abs/2306.05415", "description": "In this work, we deepen on the use of normalizing flows for causal reasoning.\nSpecifically, we first leverage recent results on non-linear ICA to show that\ncausal models are identifiable from observational data given a causal ordering,\nand thus can be recovered using autoregressive normalizing flows (NFs). Second,\nwe analyze different design and learning choices for causal normalizing flows\nto capture the underlying causal data-generating process. Third, we describe\nhow to implement the do-operator in causal NFs, and thus, how to answer\ninterventional and counterfactual questions. Finally, in our experiments, we\nvalidate our design and training choices through a comprehensive ablation\nstudy; compare causal NFs to other approaches for approximating causal models;\nand empirically demonstrate that causal NFs can be used to address real-world\nproblems, where the presence of mixed discrete-continuous data and partial\nknowledge on the causal graph is the norm. The code for this work can be found\nat https://github.com/psanch21/causal-flows."}, "http://arxiv.org/abs/2307.13917": {"title": "BayesDAG: Gradient-Based Posterior Inference for Causal Discovery", "link": "http://arxiv.org/abs/2307.13917", "description": "Bayesian causal discovery aims to infer the posterior distribution over\ncausal models from observed data, quantifying epistemic uncertainty and\nbenefiting downstream tasks. However, computational challenges arise due to\njoint inference over combinatorial space of Directed Acyclic Graphs (DAGs) and\nnonlinear functions. Despite recent progress towards efficient posterior\ninference over DAGs, existing methods are either limited to variational\ninference on node permutation matrices for linear causal models, leading to\ncompromised inference accuracy, or continuous relaxation of adjacency matrices\nconstrained by a DAG regularizer, which cannot ensure resulting graphs are\nDAGs. In this work, we introduce a scalable Bayesian causal discovery framework\nbased on a combination of stochastic gradient Markov Chain Monte Carlo\n(SG-MCMC) and Variational Inference (VI) that overcomes these limitations. Our\napproach directly samples DAGs from the posterior without requiring any DAG\nregularization, simultaneously draws function parameter samples and is\napplicable to both linear and nonlinear causal models. To enable our approach,\nwe derive a novel equivalence to the permutation-based DAG learning, which\nopens up possibilities of using any relaxed gradient estimator defined over\npermutations. To our knowledge, this is the first framework applying\ngradient-based MCMC sampling for causal discovery. Empirical evaluation on\nsynthetic and real-world datasets demonstrate our approach's effectiveness\ncompared to state-of-the-art baselines."}, "http://arxiv.org/abs/2312.05319": {"title": "Hyperbolic Network Latent Space Model with Learnable Curvature", "link": "http://arxiv.org/abs/2312.05319", "description": "Network data is ubiquitous in various scientific disciplines, including\nsociology, economics, and neuroscience. Latent space models are often employed\nin network data analysis, but the geometric effect of latent space curvature\nremains a significant, unresolved issue. In this work, we propose a hyperbolic\nnetwork latent space model with a learnable curvature parameter. We\ntheoretically justify that learning the optimal curvature is essential to\nminimizing the embedding error across all hyperbolic embedding methods beyond\nnetwork latent space models. A maximum-likelihood estimation strategy,\nemploying manifold gradient optimization, is developed, and we establish the\nconsistency and convergence rates for the maximum-likelihood estimators, both\nof which are technically challenging due to the non-linearity and non-convexity\nof the hyperbolic distance metric. We further demonstrate the geometric effect\nof latent space curvature and the superior performance of the proposed model\nthrough extensive simulation studies and an application using a Facebook\nfriendship network."}, "http://arxiv.org/abs/2312.05345": {"title": "A General Estimation Framework for Multi-State Markov Processes with Flexible Specification of the Transition Intensities", "link": "http://arxiv.org/abs/2312.05345", "description": "When interest lies in the progression of a disease rather than on a single\noutcome, non-homogeneous multi-state Markov models constitute a natural and\npowerful modelling approach. Constant monitoring of a phenomenon of interest is\noften unfeasible, hence leading to an intermittent observation scheme. This\nsetting is challenging and existing models and their implementations do not yet\nallow for flexible enough specifications that can fully exploit the information\ncontained in the data. To widen significantly the scope of multi-state Markov\nmodels, we propose a closed-form expression for the local curvature information\nof a key quantity, the transition probability matrix. Such development allows\none to model any type of multi-state Markov process, where the transition\nintensities are flexibly specified as functions of additive predictors.\nParameter estimation is carried out through a carefully structured, stable\npenalised likelihood approach. The methodology is exemplified via two case\nstudies that aim at modelling the onset of cardiac allograft vasculopathy, and\ncognitive decline. To support applicability and reproducibility, all developed\ntools are implemented in the R package flexmsm."}, "http://arxiv.org/abs/2312.05365": {"title": "Product Centered Dirichlet Processes for Dependent Clustering", "link": "http://arxiv.org/abs/2312.05365", "description": "While there is an immense literature on Bayesian methods for clustering, the\nmultiview case has received little attention. This problem focuses on obtaining\ndistinct but statistically dependent clusterings in a common set of entities\nfor different data types. For example, clustering patients into subgroups with\nsubgroup membership varying according to the domain of the patient variables. A\nchallenge is how to model the across-view dependence between the partitions of\npatients into subgroups. The complexities of the partition space make standard\nmethods to model dependence, such as correlation, infeasible. In this article,\nwe propose CLustering with Independence Centering (CLIC), a clustering prior\nthat uses a single parameter to explicitly model dependence between clusterings\nacross views. CLIC is induced by the product centered Dirichlet process (PCDP),\na novel hierarchical prior that bridges between independent and equivalent\npartitions. We show appealing theoretic properties, provide a finite\napproximation and prove its accuracy, present a marginal Gibbs sampler for\nposterior computation, and derive closed form expressions for the marginal and\njoint partition distributions for the CLIC model. On synthetic data and in an\napplication to epidemiology, CLIC accurately characterizes view-specific\npartitions while providing inference on the dependence level."}, "http://arxiv.org/abs/2312.05372": {"title": "Rational Kriging", "link": "http://arxiv.org/abs/2312.05372", "description": "This article proposes a new kriging that has a rational form. It is shown\nthat the generalized least squares estimate of the mean from rational kriging\nis much more well behaved than that from ordinary kriging. Parameter estimation\nand uncertainty quantification for rational kriging are proposed using a\nGaussian process framework. Its potential applications in emulation and\ncalibration of computer models are also discussed."}, "http://arxiv.org/abs/2312.05400": {"title": "Generalized difference-in-differences", "link": "http://arxiv.org/abs/2312.05400", "description": "We propose a new method for estimating causal effects in longitudinal/panel\ndata settings that we call generalized difference-in-differences. Our approach\nunifies two alternative approaches in these settings: ignorability estimators\n(e.g., synthetic controls) and difference-in-differences (DiD) estimators. We\npropose a new identifying assumption -- a stable bias assumption -- which\ngeneralizes the conditional parallel trends assumption in DiD, leading to the\nproposed generalized DiD framework. This change gives generalized DiD\nestimators the flexibility of ignorability estimators while maintaining the\nrobustness to unobserved confounding of DiD. We also show how ignorability and\nDiD estimators are special cases of generalized DiD. We then propose\ninfluence-function based estimators of the observed data functional, allowing\nthe use of double/debiased machine learning for estimation. We also show how\ngeneralized DiD easily extends to include clustered treatment assignment and\nstaggered adoption settings, and we discuss how the framework can facilitate\nestimation of other treatment effects beyond the average treatment effect on\nthe treated. Finally, we provide simulations which show that generalized DiD\noutperforms ignorability and DiD estimators when their identifying assumptions\nare not met, while being competitive with these special cases when their\nidentifying assumptions are met."}, "http://arxiv.org/abs/2312.05404": {"title": "Disentangled Latent Representation Learning for Tackling the Confounding M-Bias Problem in Causal Inference", "link": "http://arxiv.org/abs/2312.05404", "description": "In causal inference, it is a fundamental task to estimate the causal effect\nfrom observational data. However, latent confounders pose major challenges in\ncausal inference in observational data, for example, confounding bias and\nM-bias. Recent data-driven causal effect estimators tackle the confounding bias\nproblem via balanced representation learning, but assume no M-bias in the\nsystem, thus they fail to handle the M-bias. In this paper, we identify a\nchallenging and unsolved problem caused by a variable that leads to confounding\nbias and M-bias simultaneously. To address this problem with co-occurring\nM-bias and confounding bias, we propose a novel Disentangled Latent\nRepresentation learning framework for learning latent representations from\nproxy variables for unbiased Causal effect Estimation (DLRCE) from\nobservational data. Specifically, DLRCE learns three sets of latent\nrepresentations from the measured proxy variables to adjust for the confounding\nbias and M-bias. Extensive experiments on both synthetic and three real-world\ndatasets demonstrate that DLRCE significantly outperforms the state-of-the-art\nestimators in the case of the presence of both confounding bias and M-bias."}, "http://arxiv.org/abs/2312.05411": {"title": "Deep Bayes Factors", "link": "http://arxiv.org/abs/2312.05411", "description": "The is no other model or hypothesis verification tool in Bayesian statistics\nthat is as widely used as the Bayes factor. We focus on generative models that\nare likelihood-free and, therefore, render the computation of Bayes factors\n(marginal likelihood ratios) far from obvious. We propose a deep learning\nestimator of the Bayes factor based on simulated data from two competing models\nusing the likelihood ratio trick. This estimator is devoid of summary\nstatistics and obviates some of the difficulties with ABC model choice. We\nestablish sufficient conditions for consistency of our Deep Bayes Factor\nestimator as well as its consistency as a model selection tool. We investigate\nthe performance of our estimator on various examples using a wide range of\nquality metrics related to estimation and model decision accuracy. After\ntraining, our deep learning approach enables rapid evaluations of the Bayes\nfactor estimator at any fictional data arriving from either hypothesized model,\nnot just the observed data $Y_0$. This allows us to inspect entire Bayes factor\ndistributions under the two models and to quantify the relative location of the\nBayes factor evaluated at $Y_0$ in light of these distributions. Such tail area\nevaluations are not possible for Bayes factor estimators tailored to $Y_0$. We\nfind the performance of our Deep Bayes Factors competitive with existing MCMC\ntechniques that require the knowledge of the likelihood function. We also\nconsider variants for posterior or intrinsic Bayes factors estimation. We\ndemonstrate the usefulness of our approach on a relatively high-dimensional\nreal data example about determining cognitive biases."}, "http://arxiv.org/abs/2312.05523": {"title": "Functional Data Analysis: An Introduction and Recent Developments", "link": "http://arxiv.org/abs/2312.05523", "description": "Functional data analysis (FDA) is a statistical framework that allows for the\nanalysis of curves, images, or functions on higher dimensional domains. The\ngoals of FDA, such as descriptive analyses, classification, and regression, are\ngenerally the same as for statistical analyses of scalar-valued or multivariate\ndata, but FDA brings additional challenges due to the high- and infinite\ndimensionality of observations and parameters, respectively. This paper\nprovides an introduction to FDA, including a description of the most common\nstatistical analysis techniques, their respective software implementations, and\nsome recent developments in the field. The paper covers fundamental concepts\nsuch as descriptives and outliers, smoothing, amplitude and phase variation,\nand functional principal component analysis. It also discusses functional\nregression, statistical inference with functional data, functional\nclassification and clustering, and machine learning approaches for functional\ndata analysis. The methods discussed in this paper are widely applicable in\nfields such as medicine, biophysics, neuroscience, and chemistry, and are\nincreasingly relevant due to the widespread use of technologies that allow for\nthe collection of functional data. Sparse functional data methods are also\nrelevant for longitudinal data analysis. All presented methods are demonstrated\nusing available software in R by analyzing a data set on human motion and motor\ncontrol. To facilitate the understanding of the methods, their implementation,\nand hands-on application, the code for these practical examples is made\navailable on Github: https://github.com/davidruegamer/FDA_tutorial ."}, "http://arxiv.org/abs/2312.05590": {"title": "Gradient Tracking for High Dimensional Federated Optimization", "link": "http://arxiv.org/abs/2312.05590", "description": "In this paper, we study the (decentralized) distributed optimization problem\nwith high-dimensional sparse structure. Building upon the FedDA algorithm, we\npropose a (Decentralized) FedDA-GT algorithm, which combines the\n\\textbf{gradient tracking} technique. It is able to eliminate the heterogeneity\namong different clients' objective functions while ensuring a dimension-free\nconvergence rate. Compared to the vanilla FedDA approach, (D)FedDA-GT can\nsignificantly reduce the communication complexity, from ${O}(s^2\\log\nd/\\varepsilon^{3/2})$ to a more efficient ${O}(s^2\\log d/\\varepsilon)$. In\ncases where strong convexity is applicable, we introduce a multistep mechanism\nresulting in the Multistep ReFedDA-GT algorithm, a minor modified version of\nFedDA-GT. This approach achieves an impressive communication complexity of\n${O}\\left(s\\log d \\log \\frac{1}{\\varepsilon}\\right)$ through repeated calls to\nthe ReFedDA-GT algorithm. Finally, we conduct numerical experiments,\nillustrating that our proposed algorithms enjoy the dual advantage of being\ndimension-free and heterogeneity-free."}, "http://arxiv.org/abs/2312.05682": {"title": "On Valid Multivariate Generalizations of the Confluent Hypergeometric Covariance Function", "link": "http://arxiv.org/abs/2312.05682", "description": "Modeling of multivariate random fields through Gaussian processes calls for\nthe construction of valid cross-covariance functions describing the dependence\nbetween any two component processes at different spatial locations. The\nrequired validity conditions often present challenges that lead to complicated\nrestrictions on the parameter space. The purpose of this paper is to present a\nsimplified techniques for establishing multivariate validity for the\nrecently-introduced Confluent Hypergeometric (CH) class of covariance\nfunctions. Specifically, we use multivariate mixtures to present both\nsimplified and comprehensive conditions for validity, based on results on\nconditionally negative semidefinite matrices and the Schur product theorem. In\naddition, we establish the spectral density of the CH covariance and use this\nto construct valid multivariate models as well as propose new\ncross-covariances. We show that our proposed approach leads to valid\nmultivariate cross-covariance models that inherit the desired marginal\nproperties of the CH model and outperform the multivariate Mat\\'ern model in\nout-of-sample prediction under slowly-decaying correlation of the underlying\nmultivariate random field. We also establish properties of multivariate CH\nmodels, including equivalence of Gaussian measures, and demonstrate their use\nin modeling a multivariate oceanography data set consisting of temperature,\nsalinity and oxygen, as measured by autonomous floats in the Southern Ocean."}, "http://arxiv.org/abs/2312.05718": {"title": "Feasible contact tracing", "link": "http://arxiv.org/abs/2312.05718", "description": "Contact tracing is one of the most important tools for preventing the spread\nof infectious diseases, but as the experience of COVID-19 showed, it is also\nnext-to-impossible to implement when the disease is spreading rapidly. We show\nhow to substantially improve the efficiency of contact tracing by combining\nstandard microeconomic tools that measure heterogeneity in how infectious a\nsick person is with ideas from machine learning about sequential optimization.\nOur contributions are twofold. First, we incorporate heterogeneity in\nindividual infectiousness in a multi-armed bandit to establish optimal\nalgorithms. At the heart of this strategy is a focus on learning. In the\ntypical conceptualization of contact tracing, contacts of an infected person\nare tested to find more infections. Under a learning-first framework, however,\ncontacts of infected persons are tested to ascertain whether the infected\nperson is likely to be a \"high infector\" and to find additional infections only\nif it is likely to be highly fruitful. Second, we demonstrate using three\nadministrative contact tracing datasets from India and Pakistan during COVID-19\nthat this strategy improves efficiency. Using our algorithm, we find 80% of\ninfections with just 40% of contacts while current approaches test twice as\nmany contacts to identify the same number of infections. We further show that a\nsimple strategy that can be easily implemented in the field performs at nearly\noptimal levels, allowing for, what we call, feasible contact tracing. These\nresults are immediately transferable to contact tracing in any epidemic."}, "http://arxiv.org/abs/2312.05756": {"title": "A quantitative fusion strategy of stock picking and timing based on Particle Swarm Optimized-Back Propagation Neural Network and Multivariate Gaussian-Hidden Markov Model", "link": "http://arxiv.org/abs/2312.05756", "description": "In recent years, machine learning (ML) has brought effective approaches and\nnovel techniques to economic decision, investment forecasting, and risk\nmanagement, etc., coping the variable and intricate nature of economic and\nfinancial environments. For the investment in stock market, this research\nintroduces a pioneering quantitative fusion model combining stock timing and\npicking strategy by leveraging the Multivariate Gaussian-Hidden Markov Model\n(MGHMM) and Back Propagation Neural Network optimized by Particle Swarm\n(PSO-BPNN). After the information coefficients (IC) between fifty-two factors\nthat have been winsorized, neutralized and standardized and the return of CSI\n300 index are calculated, a given amount of factors that rank ahead are choose\nto be candidate factors heading for the input of PSO-BPNN after dimension\nreduction by Principal Component Analysis (PCA), followed by a certain amount\nof constituent stocks outputted. Subsequently, we conduct the prediction and\ntrading on the basis of the screening stocks and stock market state outputted\nby MGHMM trained using inputting CSI 300 index data after Box-Cox\ntransformation, bespeaking eximious performance during the period of past four\nyears. Ultimately, some conventional forecast and trading methods are compared\nwith our strategy in Chinese stock market. Our fusion strategy incorporating\nstock picking and timing presented in this article provide a innovative\ntechnique for financial analysis."}, "http://arxiv.org/abs/2312.05757": {"title": "Towards Human-like Perception: Learning Structural Causal Model in Heterogeneous Graph", "link": "http://arxiv.org/abs/2312.05757", "description": "Heterogeneous graph neural networks have become popular in various domains.\nHowever, their generalizability and interpretability are limited due to the\ndiscrepancy between their inherent inference flows and human reasoning logic or\nunderlying causal relationships for the learning problem. This study introduces\na novel solution, HG-SCM (Heterogeneous Graph as Structural Causal Model). It\ncan mimic the human perception and decision process through two key steps:\nconstructing intelligible variables based on semantics derived from the graph\nschema and automatically learning task-level causal relationships among these\nvariables by incorporating advanced causal discovery techniques. We compared\nHG-SCM to seven state-of-the-art baseline models on three real-world datasets,\nunder three distinct and ubiquitous out-of-distribution settings. HG-SCM\nachieved the highest average performance rank with minimal standard deviation,\nsubstantiating its effectiveness and superiority in terms of both predictive\npower and generalizability. Additionally, the visualization and analysis of the\nauto-learned causal diagrams for the three tasks aligned well with domain\nknowledge and human cognition, demonstrating prominent interpretability.\nHG-SCM's human-like nature and its enhanced generalizability and\ninterpretability make it a promising solution for special scenarios where\ntransparency and trustworthiness are paramount."}, "http://arxiv.org/abs/2312.05802": {"title": "Enhancing Scalability in Bayesian Nonparametric Factor Analysis of Spatiotemporal Data", "link": "http://arxiv.org/abs/2312.05802", "description": "This manuscript puts forward novel practicable spatiotemporal Bayesian factor\nanalysis frameworks computationally feasible for moderate to large data. Our\nmodels exhibit significantly enhanced computational scalability and storage\nefficiency, deliver high overall modeling performances, and possess powerful\ninferential capabilities for adequately predicting outcomes at future time\npoints or new spatial locations and satisfactorily clustering spatial locations\ninto regions with similar temporal trajectories, a frequently encountered\ncrucial task. We integrate on top of a baseline separable factor model with\ntemporally dependent latent factors and spatially dependent factor loadings\nunder a probit stick breaking process (PSBP) prior a new slice sampling\nalgorithm that permits unknown varying numbers of spatial mixture components\nacross all factors and guarantees them to be non-increasing through the MCMC\niterations, thus considerably enhancing model flexibility, efficiency, and\nscalability. We further introduce a novel spatial latent nearest-neighbor\nGaussian process (NNGP) prior and new sequential updating algorithms for the\nspatially varying latent variables in the PSBP prior, thereby attaining high\nspatial scalability. The markedly accelerated posterior sampling and spatial\nprediction as well as the great modeling and inferential performances of our\nmodels are substantiated by our simulation experiments."}, "http://arxiv.org/abs/2312.05974": {"title": "Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise", "link": "http://arxiv.org/abs/2312.05974", "description": "This paper considers learning the hidden causal network of a linear networked\ndynamical system (NDS) from the time series data at some of its nodes --\npartial observability. The dynamics of the NDS are driven by colored noise that\ngenerates spurious associations across pairs of nodes, rendering the problem\nmuch harder. To address the challenge of noise correlation and partial\nobservability, we assign to each pair of nodes a feature vector computed from\nthe time series data of observed nodes. The feature embedding is engineered to\nyield structural consistency: there exists an affine hyperplane that\nconsistently partitions the set of features, separating the feature vectors\ncorresponding to connected pairs of nodes from those corresponding to\ndisconnected pairs. The causal inference problem is thus addressed via\nclustering the designed features. We demonstrate with simple baseline\nsupervised methods the competitive performance of the proposed causal inference\nmechanism under broad connectivity regimes and noise correlation levels,\nincluding a real world network. Further, we devise novel technical guarantees\nof structural consistency for linear NDS under the considered regime."}, "http://arxiv.org/abs/2312.06018": {"title": "A Multivariate Polya Tree Model for Meta-Analysis with Event Time Distributions", "link": "http://arxiv.org/abs/2312.06018", "description": "We develop a non-parametric Bayesian prior for a family of random probability\nmeasures by extending the Polya tree ($PT$) prior to a joint prior for a set of\nprobability measures $G_1,\\dots,G_n$, suitable for meta-analysis with event\ntime outcomes. In the application to meta-analysis $G_i$ is the event time\ndistribution specific to study $i$. The proposed model defines a regression on\nstudy-specific covariates by introducing increased correlation for any pair of\nstudies with similar characteristics. The desired multivariate $PT$ model is\nconstructed by introducing a hierarchical prior on the conditional splitting\nprobabilities in the $PT$ construction for each of the $G_i$. The hierarchical\nprior replaces the independent beta priors for the splitting probability in the\n$PT$ construction with a Gaussian process prior for corresponding (logit)\nsplitting probabilities across all studies. The Gaussian process is indexed by\nstudy-specific covariates, introducing the desired dependence with increased\ncorrelation for similar studies. The main feature of the proposed construction\nis (conditionally) conjugate posterior updating with commonly reported\ninference summaries for event time data. The construction is motivated by a\nmeta-analysis over cancer immunotherapy studies."}, "http://arxiv.org/abs/2312.06028": {"title": "Rejoinder to Discussion of \"A Tale of Two Datasets: Representativeness and Generalisability of Inference for Samples of Networks''", "link": "http://arxiv.org/abs/2312.06028", "description": "This rejoinder responds to discussions by of Caimo, Niezink, and\nSchweinberger and Fritz of ''A Tale of Two Datasets: Representativeness and\nGeneralisability of Inference for Samples of Networks'' by Krivitsky, Coletti,\nand Hens, all published in the Journal of the American Statistical Association\nin 2023."}, "http://arxiv.org/abs/2312.06098": {"title": "Mixture Matrix-valued Autoregressive Model", "link": "http://arxiv.org/abs/2312.06098", "description": "Time series of matrix-valued data are increasingly available in various areas\nincluding economics, finance, social science, etc. These data may shed light on\nthe inter-dynamical relationships between two sets of attributes, for instance\ncountries and economic indices. The matrix autoregressive (MAR) model provides\na parsimonious approach for analyzing such data. However, the MAR model, being\na linear model with parametric constraints, cannot capture the nonlinear\npatterns in the data, such as regime shifts in the dynamics. We propose a\nmixture matrix autoregressive (MMAR) model for analyzing potential regime\nshifts in the dynamics between two attributes, for instance, due to recession\nvs. blooming, or quiet period vs. pandemic. We propose an EM algorithm for\nmaximum likelihood estimation. We derive some theoretical properties of the\nproposed method including consistency and asymptotic distribution, and\nillustrate its performance via simulations and real applications."}, "http://arxiv.org/abs/2312.06155": {"title": "Illustrating the structures of bias from immortal time using directed acyclic graphs", "link": "http://arxiv.org/abs/2312.06155", "description": "Immortal time is a period of follow-up during which death or the study\noutcome cannot occur by design. Bias from immortal time has been increasingly\nrecognized in epidemiologic studies. However, it remains unclear how immortal\ntime arises and what the structures of bias from immortal time are. Here, we\nuse an example \"Do Nobel Prize winners live longer than less recognized\nscientists?\" to illustrate that immortal time arises from using postbaseline\ninformation to define exposure or eligibility. We use time-varying directed\nacyclic graphs (DAGs) to present the structures of bias from immortal time as\nthe key sources of bias, that is confounding and selection bias. We explain\nthat excluding immortal time from the follow-up does not fully address the\nbias, and that the presence of competing risks can worsen the bias. We also\ndiscuss how the structures of bias from immortal time are shared by different\nstudy designs in pharmacoepidemiology and provide solutions, where possible, to\naddress the bias."}, "http://arxiv.org/abs/2312.06159": {"title": "Could dropping a few cells change the takeaways from differential expression?", "link": "http://arxiv.org/abs/2312.06159", "description": "Differential expression (DE) plays a fundamental role toward illuminating the\nmolecular mechanisms driving a difference between groups (e.g., due to\ntreatment or disease). While any analysis is run on particular cells/samples,\nthe intent is to generalize to future occurrences of the treatment or disease.\nImplicitly, this step is justified by assuming that present and future samples\nare independent and identically distributed from the same population. Though\nthis assumption is always false, we hope that any deviation from the assumption\nis small enough that A) conclusions of the analysis still hold and B) standard\ntools like standard error, significance, and power still reflect\ngeneralizability. Conversely, we might worry about these deviations, and\nreliance on standard tools, if conclusions could be substantively changed by\ndropping a very small fraction of data. While checking every small fraction is\ncomputationally intractable, recent work develops an approximation to identify\nwhen such an influential subset exists. Building on this work, we develop a\nmetric for dropping-data robustness of DE; namely, we cast the analysis in a\nform suitable to the approximation, extend the approximation to models with\ndata-dependent hyperparameters, and extend the notion of a data point from a\nsingle cell to a pseudobulk observation. We then overcome the inherent\nnon-differentiability of gene set enrichment analysis to develop an additional\napproximation for the robustness of top gene sets. We assess robustness of DE\nfor published single-cell RNA-seq data and discover that 1000s of genes can\nhave their results flipped by dropping <1% of the data, including 100s that are\nsensitive to dropping a single cell (0.07%). Surprisingly, this non-robustness\nextends to high-level takeaways; half of the top 10 gene sets can be changed by\ndropping 1-2% of cells, and 2/10 can be changed by dropping a single cell."}, "http://arxiv.org/abs/2312.06204": {"title": "Multilayer Network Regression with Eigenvector Centrality and Community Structure", "link": "http://arxiv.org/abs/2312.06204", "description": "Centrality measures and community structures play a pivotal role in the\nanalysis of complex networks. To effectively model the impact of the network on\nour variable of interest, it is crucial to integrate information from the\nmultilayer network, including the interlayer correlations of network data. In\nthis study, we introduce a two-stage regression model that leverages the\neigenvector centrality and network community structure of fourth-order\ntensor-like multilayer networks. Initially, we utilize the eigenvector\ncentrality of multilayer networks, a method that has found extensive\napplication in prior research. Subsequently, we amalgamate the network\ncommunity structure to construct the community component centrality and\nindividual component centrality of nodes, which are then incorporated into the\nregression model. Furthermore, we establish the asymptotic properties of the\nleast squares estimates of the regression model coefficients. Our proposed\nmethod is employed to analyze data from the European airport network and The\nWorld Input-Output Database (WIOD), demonstrating its practical applicability\nand effectiveness."}, "http://arxiv.org/abs/2312.06265": {"title": "Type I Error Rates are Not Usually Inflated", "link": "http://arxiv.org/abs/2312.06265", "description": "The inflation of Type I error rates is thought to be one of the causes of the\nreplication crisis. Questionable research practices such as p-hacking are\nthought to inflate Type I error rates above their nominal level, leading to\nunexpectedly high levels of false positives in the literature and,\nconsequently, unexpectedly low replication rates. In this article, I offer an\nalternative view. I argue that questionable and other research practices do not\nusually inflate relevant Type I error rates. I begin with an introduction to\nType I error rates that distinguishes them from theoretical errors. I then\nillustrate my argument with respect to model misspecification, multiple\ntesting, selective inference, forking paths, exploratory analyses, p-hacking,\noptional stopping, double dipping, and HARKing. In each case, I demonstrate\nthat relevant Type I error rates are not usually inflated above their nominal\nlevel, and in the rare cases that they are, the inflation is easily identified\nand resolved. I conclude that the replication crisis may be explained, at least\nin part, by researchers' misinterpretation of statistical errors and their\nunderestimation of theoretical errors."}, "http://arxiv.org/abs/2312.06289": {"title": "A graphical framework for interpretable correlation matrix models", "link": "http://arxiv.org/abs/2312.06289", "description": "In this work, we present a new approach for constructing models for\ncorrelation matrices with a user-defined graphical structure. The graphical\nstructure makes correlation matrices interpretable and avoids the quadratic\nincrease of parameters as a function of the dimension. We suggest an automatic\napproach to define a prior using a natural sequence of simpler models within\nthe Penalized Complexity framework for the unknown parameters in these models.\n\nWe illustrate this approach with three applications: a multivariate linear\nregression of four biomarkers, a multivariate disease mapping, and a\nmultivariate longitudinal joint modelling. Each application underscores our\nmethod's intuitive appeal, signifying a substantial advancement toward a more\ncohesive and enlightening model that facilitates a meaningful interpretation of\ncorrelation matrices."}, "http://arxiv.org/abs/2312.06334": {"title": "Scoring multilevel regression and postratification based population and subpopulation estimates", "link": "http://arxiv.org/abs/2312.06334", "description": "Multilevel regression and poststratification (MRP) has been used extensively\nto adjust convenience and low-response surveys to make population and\nsubpopulation estimates. For this method, model validation is particularly\nimportant, but recent work has suggested that simple aggregation of individual\nprediction errors does not give a good measure of the error of the population\nestimate. In this manuscript we provide a clear explanation for why this\noccurs, propose two scoring metrics designed specifically for this problem, and\ndemonstrate their use in three different ways. We demonstrate that these\nscoring metrics correctly order models when compared to true goodness of\nestimate, although they do underestimate the magnitude of the score."}, "http://arxiv.org/abs/2312.06415": {"title": "Practicable Power Curve Approximation for Bioequivalence with Unequal Variances", "link": "http://arxiv.org/abs/2312.06415", "description": "Two-group (bio)equivalence tests assess whether two drug formulations provide\nsimilar therapeutic effects. These studies are often conducted using two\none-sided t-tests, where the test statistics jointly follow a bivariate\nt-distribution with singular covariance matrix. Unless the two groups of data\nare assumed to have equal variances, the degrees of freedom for this bivariate\nt-distribution are noninteger and unknown a priori. This makes it difficult to\nanalytically find sample sizes that yield desired power for the study using an\nautomated process. Popular free software for bioequivalence study design does\nnot accommodate the comparison of two groups with unequal variances, and\ncertain paid software solutions that make this accommodation produce unstable\nresults. We propose a novel simulation-based method that uses Sobol' sequences\nand root-finding algorithms to quickly and accurately approximate the power\ncurve for two-group bioequivalence tests with unequal variances. We also\nillustrate that caution should be exercised when assuming automated methods for\npower estimation are robust to arbitrary bioequivalence designs. Our methods\nfor sample size determination mitigate this lack of robustness and are widely\napplicable to equivalence and noninferiority tests facilitated via parallel and\ncrossover designs. All methods proposed in this work can be implemented using\nthe dent package in R."}, "http://arxiv.org/abs/2312.06437": {"title": "Posterior Ramifications of Prior Dependence Structures", "link": "http://arxiv.org/abs/2312.06437", "description": "In fully Bayesian analyses, prior distributions are specified before\nobserving data. Prior elicitation methods transfigure prior information into\nquantifiable prior distributions. Recently, methods that leverage copulas have\nbeen proposed to accommodate more flexible dependence structures when eliciting\nmultivariate priors. The resulting priors have been framed as suitable\ncandidates for Bayesian analysis. We prove that under broad conditions, the\nposterior cannot retain many of these flexible prior dependence structures as\ndata are observed. However, these flexible copula-based priors are useful for\ndesign purposes. Because correctly specifying the dependence structure a priori\ncan be difficult, we consider how the choice of prior copula impacts the\nposterior distribution in terms of convergence of the posterior mode. We also\nmake recommendations regarding prior dependence specification for posterior\nanalyses that streamline the prior elicitation process."}, "http://arxiv.org/abs/2312.06465": {"title": "A New Projection Pursuit Index for Big Data", "link": "http://arxiv.org/abs/2312.06465", "description": "Visualization of extremely large datasets in static or dynamic form is a huge\nchallenge because most traditional methods cannot deal with big data problems.\nA new visualization method for big data is proposed based on Projection\nPursuit, Guided Tour and Data Nuggets methods, that will help display\ninteresting hidden structures such as clusters, outliers, and other nonlinear\nstructures in big data. The Guided Tour is a dynamic graphical tool for\nhigh-dimensional data combining Projection Pursuit and Grand Tour methods. It\ndisplays a dynamic sequence of low-dimensional projections obtained by using\nProjection Pursuit (PP) index functions to navigate the data space. Different\nPP indices have been developed to detect interesting structures of multivariate\ndata but there are computational problems for big data using the original\nguided tour with these indices. A new PP index is developed to be computable\nfor big data, with the help of a data compression method called Data Nuggets\nthat reduces large datasets while maintaining the original data structure.\nSimulation studies are conducted and a real large dataset is used to illustrate\nthe proposed methodology. Static and dynamic graphical tools for big data can\nbe developed based on the proposed PP index to detect nonlinear structures."}, "http://arxiv.org/abs/2312.06478": {"title": "Prediction De-Correlated Inference", "link": "http://arxiv.org/abs/2312.06478", "description": "Leveraging machine-learning methods to predict outcomes on some unlabeled\ndatasets and then using these pseudo-outcomes in subsequent statistical\ninference is common in modern data analysis. Inference in this setting is often\ncalled post-prediction inference. We propose a novel, assumption-lean framework\nfor inference under post-prediction setting, called \\emph{Prediction\nDe-Correlated inference} (PDC). Our approach can automatically adapt to any\nblack-box machine-learning model and consistently outperforms supervised\nmethods. The PDC framework also offers easy extensibility for accommodating\nmultiple predictive models. Both numerical results and real-world data analysis\nsupport our theoretical results."}, "http://arxiv.org/abs/2312.06547": {"title": "KF-PLS: Optimizing Kernel Partial Least-Squares (K-PLS) with Kernel Flows", "link": "http://arxiv.org/abs/2312.06547", "description": "Partial Least-Squares (PLS) Regression is a widely used tool in chemometrics\nfor performing multivariate regression. PLS is a bi-linear method that has a\nlimited capacity of modelling non-linear relations between the predictor\nvariables and the response. Kernel PLS (K-PLS) has been introduced for\nmodelling non-linear predictor-response relations. In K-PLS, the input data is\nmapped via a kernel function to a Reproducing Kernel Hilbert space (RKH), where\nthe dependencies between the response and the input matrix are assumed to be\nlinear. K-PLS is performed in the RKH space between the kernel matrix and the\ndependent variable. Most available studies use fixed kernel parameters. Only a\nfew studies have been conducted on optimizing the kernel parameters for K-PLS.\nIn this article, we propose a methodology for the kernel function optimization\nbased on Kernel Flows (KF), a technique developed for Gaussian process\nregression (GPR). The results are illustrated with four case studies. The case\nstudies represent both numerical examples and real data used in classification\nand regression tasks. K-PLS optimized with KF, called KF-PLS in this study, is\nshown to yield good results in all illustrated scenarios. The paper presents\ncross-validation studies and hyperparameter analysis of the KF methodology when\napplied to K-PLS."}, "http://arxiv.org/abs/2312.06605": {"title": "Statistical Inference on Latent Space Models for Network Data", "link": "http://arxiv.org/abs/2312.06605", "description": "Latent space models are powerful statistical tools for modeling and\nunderstanding network data. While the importance of accounting for uncertainty\nin network analysis has been well recognized, the current literature\npredominantly focuses on point estimation and prediction, leaving the\nstatistical inference of latent space models an open question. This work aims\nto fill this gap by providing a general framework to analyze the theoretical\nproperties of the maximum likelihood estimators. In particular, we establish\nthe uniform consistency and asymptotic distribution results for the latent\nspace models under different edge types and link functions. Furthermore, the\nproposed framework enables us to generalize our results to the dependent-edge\nand sparse scenarios. Our theories are supported by simulation studies and have\nthe potential to be applied in downstream inferences, such as link prediction\nand network testing problems."}, "http://arxiv.org/abs/2312.06616": {"title": "The built environment and induced transport CO2 emissions: A double machine learning approach to account for residential self-selection", "link": "http://arxiv.org/abs/2312.06616", "description": "Understanding why travel behavior differs between residents of urban centers\nand suburbs is key to sustainable urban planning. Especially in light of rapid\nurban growth, identifying housing locations that minimize travel demand and\ninduced CO2 emissions is crucial to mitigate climate change. While the built\nenvironment plays an important role, the precise impact on travel behavior is\nobfuscated by residential self-selection. To address this issue, we propose a\ndouble machine learning approach to obtain unbiased, spatially-explicit\nestimates of the effect of the built environment on travel-related CO2\nemissions for each neighborhood by controlling for residential self-selection.\nWe examine how socio-demographics and travel-related attitudes moderate the\neffect and how it decomposes across the 5Ds of the built environment. Based on\na case study for Berlin and the travel diaries of 32,000 residents, we find\nthat the built environment causes household travel-related CO2 emissions to\ndiffer by a factor of almost two between central and suburban neighborhoods in\nBerlin. To highlight the practical importance for urban climate mitigation, we\nevaluate current plans for 64,000 new residential units in terms of total\ninduced transport CO2 emissions. Our findings underscore the significance of\nspatially differentiated compact development to decarbonize the transport\nsector."}, "http://arxiv.org/abs/2111.10715": {"title": "Confidences in Hypotheses", "link": "http://arxiv.org/abs/2111.10715", "description": "This article introduces a broadly-applicable new method of statistical\nanalysis called hypotheses assessment. It is a frequentist procedure designed\nto answer the question: Given the sample evidence and assuming one of two\nhypotheses is true, what is the relative plausibility of each hypothesis? Our\naim is to determine frequentist confidences in the hypotheses that are relevant\nto the data at hand and are as powerful as the particular application allows.\nHypotheses assessments complement hypothesis tests because providing\nconfidences in the hypotheses in addition to test results can better inform\napplied researchers about the strength of evidence provided by the data. For\nsimple hypotheses, the method produces minimum and maximum confidences in each\nhypothesis. The composite case is more complex, and we introduce two\nconventions to aid with understanding the strength of evidence. Assessments are\nqualitatively different from hypothesis testing and confidence interval\noutcomes, and thus fill a gap in the statistician's toolkit."}, "http://arxiv.org/abs/2204.03343": {"title": "Binary Spatial Random Field Reconstruction from Non-Gaussian Inhomogeneous Time-series Observations", "link": "http://arxiv.org/abs/2204.03343", "description": "We develop a new model for spatial random field reconstruction of a\nbinary-valued spatial phenomenon. In our model, sensors are deployed in a\nwireless sensor network across a large geographical region. Each sensor\nmeasures a non-Gaussian inhomogeneous temporal process which depends on the\nspatial phenomenon. Two types of sensors are employed: one collects point\nobservations at specific time points, while the other collects integral\nobservations over time intervals. Subsequently, the sensors transmit these\ntime-series observations to a Fusion Center (FC), and the FC infers the spatial\nphenomenon from these observations. We show that the resulting posterior\npredictive distribution is intractable and develop a tractable two-step\nprocedure to perform inference. Firstly, we develop algorithms to perform\napproximate Likelihood Ratio Tests on the time-series observations, compressing\nthem to a single bit for both point sensors and integral sensors. Secondly,\nonce the compressed observations are transmitted to the FC, we utilize a\nSpatial Best Linear Unbiased Estimator (S-BLUE) to reconstruct the binary\nspatial random field at any desired spatial location. The performance of the\nproposed approach is studied using simulation. We further illustrate the\neffectiveness of our method using a weather dataset from the National\nEnvironment Agency (NEA) of Singapore with fields including temperature and\nrelative humidity."}, "http://arxiv.org/abs/2210.08964": {"title": "PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting", "link": "http://arxiv.org/abs/2210.08964", "description": "This paper presents a new perspective on time series forecasting. In existing\ntime series forecasting methods, the models take a sequence of numerical values\nas input and yield numerical values as output. The existing SOTA models are\nlargely based on the Transformer architecture, modified with multiple encoding\nmechanisms to incorporate the context and semantics around the historical data.\nInspired by the successes of pre-trained language foundation models, we pose a\nquestion about whether these models can also be adapted to solve time-series\nforecasting. Thus, we propose a new forecasting paradigm: prompt-based time\nseries forecasting (PromptCast). In this novel task, the numerical input and\noutput are transformed into prompts and the forecasting task is framed in a\nsentence-to-sentence manner, making it possible to directly apply language\nmodels for forecasting purposes. To support and facilitate the research of this\ntask, we also present a large-scale dataset (PISA) that includes three\nreal-world forecasting scenarios. We evaluate different SOTA numerical-based\nforecasting methods and language generation models. The benchmark results with\nvarious forecasting settings demonstrate the proposed PromptCast with language\ngeneration models is a promising research direction. Additionally, in\ncomparison to conventional numerical-based forecasting, PromptCast shows a much\nbetter generalization ability under the zero-shot setting."}, "http://arxiv.org/abs/2211.13715": {"title": "Trust Your $\\nabla$: Gradient-based Intervention Targeting for Causal Discovery", "link": "http://arxiv.org/abs/2211.13715", "description": "Inferring causal structure from data is a challenging task of fundamental\nimportance in science. Observational data are often insufficient to identify a\nsystem's causal structure uniquely. While conducting interventions (i.e.,\nexperiments) can improve the identifiability, such samples are usually\nchallenging and expensive to obtain. Hence, experimental design approaches for\ncausal discovery aim to minimize the number of interventions by estimating the\nmost informative intervention target. In this work, we propose a novel\nGradient-based Intervention Targeting method, abbreviated GIT, that 'trusts'\nthe gradient estimator of a gradient-based causal discovery framework to\nprovide signals for the intervention acquisition function. We provide extensive\nexperiments in simulated and real-world datasets and demonstrate that GIT\nperforms on par with competitive baselines, surpassing them in the low-data\nregime."}, "http://arxiv.org/abs/2303.02951": {"title": "The (Surprising) Sample Optimality of Greedy Procedures for Large-Scale Ranking and Selection", "link": "http://arxiv.org/abs/2303.02951", "description": "Ranking and selection (R&S) aims to select the best alternative with the\nlargest mean performance from a finite set of alternatives. Recently,\nconsiderable attention has turned towards the large-scale R&S problem which\ninvolves a large number of alternatives. Ideal large-scale R&S procedures\nshould be sample optimal, i.e., the total sample size required to deliver an\nasymptotically non-zero probability of correct selection (PCS) grows at the\nminimal order (linear order) in the number of alternatives, $k$. Surprisingly,\nwe discover that the na\\\"ive greedy procedure, which keeps sampling the\nalternative with the largest running average, performs strikingly well and\nappears sample optimal. To understand this discovery, we develop a new\nboundary-crossing perspective and prove that the greedy procedure is sample\noptimal for the scenarios where the best mean maintains at least a positive\nconstant away from all other means as $k$ increases. We further show that the\nderived PCS lower bound is asymptotically tight for the slippage configuration\nof means with a common variance. For other scenarios, we consider the\nprobability of good selection and find that the result depends on the growth\nbehavior of the number of good alternatives: if it remains bounded as $k$\nincreases, the sample optimality still holds; otherwise, the result may change.\nMoreover, we propose the explore-first greedy procedures by adding an\nexploration phase to the greedy procedure. The procedures are proven to be\nsample optimal and consistent under the same assumptions. Last, we numerically\ninvestigate the performance of our greedy procedures in solving large-scale R&S\nproblems."}, "http://arxiv.org/abs/2303.07490": {"title": "Comparing the Robustness of Simple Network Scale-Up Method (NSUM) Estimators", "link": "http://arxiv.org/abs/2303.07490", "description": "The network scale-up method (NSUM) is a cost-effective approach to estimating\nthe size or prevalence of a group of people that is hard to reach through a\nstandard survey. The basic NSUM involves two steps: estimating respondents'\ndegrees by one of various methods (in this paper we focus on the probe group\nmethod which uses the number of people a respondent knows in various groups of\nknown size), and estimating the prevalence of the hard-to-reach population of\ninterest using respondents' estimated degrees and the number of people they\nreport knowing in the hard-to-reach group. Each of these two steps involves\ntaking either an average of ratios or a ratio of averages. Using the ratio of\naverages for each step has so far been the most common approach. However, we\npresent theoretical arguments that using the average of ratios at the second,\nprevalence-estimation step often has lower mean squared error when the random\nmixing assumption is violated, which seems likely in practice; this estimator\nwhich uses the ratio of averages for degree estimates and the average of ratios\nfor prevalence was proposed early in NSUM development but has largely been\nunexplored and unused. Simulation results using an example network data set\nalso support these findings. Based on this theoretical and empirical evidence,\nwe suggest that future surveys that use a simple estimator may want to use this\nmixed estimator, and estimation methods based on this estimator may produce new\nimprovements."}, "http://arxiv.org/abs/2307.11084": {"title": "GeoCoDA: Recognizing and Validating Structural Processes in Geochemical Data", "link": "http://arxiv.org/abs/2307.11084", "description": "Geochemical data are compositional in nature and are subject to the problems\ntypically associated with data that are restricted to the real non-negative\nnumber space with constant-sum constraint, that is, the simplex. Geochemistry\ncan be considered a proxy for mineralogy, comprised of atomically ordered\nstructures that define the placement and abundance of elements in the mineral\nlattice structure. Based on the innovative contributions of John Aitchison, who\nintroduced the logratio transformation into compositional data analysis, this\ncontribution provides a systematic workflow for assessing geochemical data in a\nsimple and efficient way, such that significant geochemical (mineralogical)\nprocesses can be recognized and validated. This workflow, called GeoCoDA and\npresented here in the form of a tutorial, enables the recognition of processes\nfrom which models can be constructed based on the associations of elements that\nreflect mineralogy. Both the original compositional values and their\ntransformation to logratios are considered. These models can reflect\nrock-forming processes, metamorphism, alteration and ore mineralization.\nMoreover, machine learning methods, both unsupervised and supervised, applied\nto an optimized set of subcompositions of the data, provide a systematic,\naccurate, efficient and defensible approach to geochemical data analysis. The\nworkflow is illustrated on lithogeochemical data from exploration of the Star\nkimberlite, consisting of a series of eruptions with five recognized phases."}, "http://arxiv.org/abs/2307.15681": {"title": "A Continuous-Time Dynamic Factor Model for Intensive Longitudinal Data Arising from Mobile Health Studies", "link": "http://arxiv.org/abs/2307.15681", "description": "Intensive longitudinal data (ILD) collected in mobile health (mHealth)\nstudies contain rich information on multiple outcomes measured frequently over\ntime that have the potential to capture short-term and long-term dynamics.\nMotivated by an mHealth study of smoking cessation in which participants\nself-report the intensity of many emotions multiple times per day, we describe\na dynamic factor model that summarizes the ILD as a low-dimensional,\ninterpretable latent process. This model consists of two submodels: (i) a\nmeasurement submodel--a factor model--that summarizes the multivariate\nlongitudinal outcome as lower-dimensional latent variables and (ii) a\nstructural submodel--an Ornstein-Uhlenbeck (OU) stochastic process--that\ncaptures the temporal dynamics of the multivariate latent process in continuous\ntime. We derive a closed-form likelihood for the marginal distribution of the\noutcome and the computationally-simpler sparse precision matrix for the OU\nprocess. We propose a block coordinate descent algorithm for estimation.\nFinally, we apply our method to the mHealth data to summarize the dynamics of\n18 different emotions as two latent processes. These latent processes are\ninterpreted by behavioral scientists as the psychological constructs of\npositive and negative affect and are key in understanding vulnerability to\nlapsing back to tobacco use among smokers attempting to quit."}, "http://arxiv.org/abs/2308.09562": {"title": "Outlier detection for mixed-type data: A novel approach", "link": "http://arxiv.org/abs/2308.09562", "description": "Outlier detection can serve as an extremely important tool for researchers\nfrom a wide range of fields. From the sectors of banking and marketing to the\nsocial sciences and healthcare sectors, outlier detection techniques are very\nuseful for identifying subjects that exhibit different and sometimes peculiar\nbehaviours. When the data set available to the researcher consists of both\ndiscrete and continuous variables, outlier detection presents unprecedented\nchallenges. In this paper we propose a novel method that detects outlying\nobservations in settings of mixed-type data, while reducing the required user\ninteraction and providing general guidelines for selecting suitable\nhyperparameter values. The methodology developed is being assessed through a\nseries of simulations on data sets with varying characteristics and achieves\nvery good performance levels. Our method demonstrates a high capacity for\ndetecting the majority of outliers while minimising the number of falsely\ndetected non-outlying observations. The ideas and techniques outlined in the\npaper can be used either as a pre-processing step or in tandem with other data\nmining and machine learning algorithms for developing novel approaches to\nchallenging research problems."}, "http://arxiv.org/abs/2308.15986": {"title": "Sensitivity Analysis for Causal Effects in Observational Studies with Multivalued Treatments", "link": "http://arxiv.org/abs/2308.15986", "description": "One of the fundamental challenges in drawing causal inferences from\nobservational studies is that the assumption of no unmeasured confounding is\nnot testable from observed data. Therefore, assessing sensitivity to this\nassumption's violation is important to obtain valid causal conclusions in\nobservational studies. Although several sensitivity analysis frameworks are\navailable in the casual inference literature, very few of them are applicable\nto observational studies with multivalued treatments. To address this issue, we\npropose a sensitivity analysis framework for performing sensitivity analysis in\nmultivalued treatment settings. Within this framework, a general class of\nadditive causal estimands has been proposed. We demonstrate that the estimation\nof the causal estimands under the proposed sensitivity model can be performed\nvery efficiently. Simulation results show that the proposed framework performs\nwell in terms of bias of the point estimates and coverage of the confidence\nintervals when there is sufficient overlap in the covariate distributions. We\nillustrate the application of our proposed method by conducting an\nobservational study that estimates the causal effect of fish consumption on\nblood mercury levels."}, "http://arxiv.org/abs/2309.01536": {"title": "perms: Likelihood-free estimation of marginal likelihoods for binary response data in Python and R", "link": "http://arxiv.org/abs/2309.01536", "description": "In Bayesian statistics, the marginal likelihood (ML) is the key ingredient\nneeded for model comparison and model averaging. Unfortunately, estimating MLs\naccurately is notoriously difficult, especially for models where posterior\nsimulation is not possible. Recently, Christensen (2023) introduced the concept\nof permutation counting, which can accurately estimate MLs of models for\nexchangeable binary responses. Such data arise in a multitude of statistical\nproblems, including binary classification, bioassay and sensitivity testing.\nPermutation counting is entirely likelihood-free and works for any model from\nwhich a random sample can be generated, including nonparametric models. Here we\npresent perms, a package implementing permutation counting. As a result of\nextensive optimisation efforts, perms is computationally efficient and able to\nhandle large data problems. It is available as both an R package and a Python\nlibrary. A broad gallery of examples illustrating its usage is provided, which\nincludes both standard parametric binary classification and novel applications\nof nonparametric models, such as changepoint analysis. We also cover the\ndetails of the implementation of perms and illustrate its computational speed\nvia a simple simulation study."}, "http://arxiv.org/abs/2312.06669": {"title": "An Association Test Based on Kernel-Based Neural Networks for Complex Genetic Association Analysis", "link": "http://arxiv.org/abs/2312.06669", "description": "The advent of artificial intelligence, especially the progress of deep neural\nnetworks, is expected to revolutionize genetic research and offer unprecedented\npotential to decode the complex relationships between genetic variants and\ndisease phenotypes, which could mark a significant step toward improving our\nunderstanding of the disease etiology. While deep neural networks hold great\npromise for genetic association analysis, limited research has been focused on\ndeveloping neural-network-based tests to dissect complex genotype-phenotype\nassociations. This complexity arises from the opaque nature of neural networks\nand the absence of defined limiting distributions. We have previously developed\na kernel-based neural network model (KNN) that synergizes the strengths of\nlinear mixed models with conventional neural networks. KNN adopts a\ncomputationally efficient minimum norm quadratic unbiased estimator (MINQUE)\nalgorithm and uses KNN structure to capture the complex relationship between\nlarge-scale sequencing data and a disease phenotype of interest. In the KNN\nframework, we introduce a MINQUE-based test to assess the joint association of\ngenetic variants with the phenotype, which considers non-linear and\nnon-additive effects and follows a mixture of chi-square distributions. We also\nconstruct two additional tests to evaluate and interpret linear and\nnon-linear/non-additive genetic effects, including interaction effects. Our\nsimulations show that our method consistently controls the type I error rate\nunder various conditions and achieves greater power than a commonly used\nsequence kernel association test (SKAT), especially when involving non-linear\nand interaction effects. When applied to real data from the UK Biobank, our\napproach identified genes associated with hippocampal volume, which can be\nfurther replicated and evaluated for their role in the pathogenesis of\nAlzheimer's disease."}, "http://arxiv.org/abs/2312.06820": {"title": "Extracting Self-Consistent Causal Insights from Users Feedback with LLMs and In-context Learning", "link": "http://arxiv.org/abs/2312.06820", "description": "Microsoft Windows Feedback Hub is designed to receive customer feedback on a\nwide variety of subjects including critical topics such as power and battery.\nFeedback is one of the most effective ways to have a grasp of users' experience\nwith Windows and its ecosystem. However, the sheer volume of feedback received\nby Feedback Hub makes it immensely challenging to diagnose the actual cause of\nreported issues. To better understand and triage issues, we leverage Double\nMachine Learning (DML) to associate users' feedback with telemetry signals. One\nof the main challenges we face in the DML pipeline is the necessity of domain\nknowledge for model design (e.g., causal graph), which sometimes is either not\navailable or hard to obtain. In this work, we take advantage of reasoning\ncapabilities in Large Language Models (LLMs) to generate a prior model that\nwhich to some extent compensates for the lack of domain knowledge and could be\nused as a heuristic for measuring feedback informativeness. Our LLM-based\napproach is able to extract previously known issues, uncover new bugs, and\nidentify sequences of events that lead to a bug, while minimizing out-of-domain\noutputs."}, "http://arxiv.org/abs/2312.06883": {"title": "Adaptive Experiments Toward Learning Treatment Effect Heterogeneity", "link": "http://arxiv.org/abs/2312.06883", "description": "Understanding treatment effect heterogeneity has become an increasingly\npopular task in various fields, as it helps design personalized advertisements\nin e-commerce or targeted treatment in biomedical studies. However, most of the\nexisting work in this research area focused on either analyzing observational\ndata based on strong causal assumptions or conducting post hoc analyses of\nrandomized controlled trial data, and there has been limited effort dedicated\nto the design of randomized experiments specifically for uncovering treatment\neffect heterogeneity. In the manuscript, we develop a framework for designing\nand analyzing response adaptive experiments toward better learning treatment\neffect heterogeneity. Concretely, we provide response adaptive experimental\ndesign frameworks that sequentially revise the data collection mechanism\naccording to the accrued evidence during the experiment. Such design strategies\nallow for the identification of subgroups with the largest treatment effects\nwith enhanced statistical efficiency. The proposed frameworks not only unify\nadaptive enrichment designs and response-adaptive randomization designs but\nalso complement A/B test designs in e-commerce and randomized trial designs in\nclinical settings. We demonstrate the merit of our design with theoretical\njustifications and in simulation studies with synthetic e-commerce and clinical\ntrial data."}, "http://arxiv.org/abs/2312.07175": {"title": "Instrumental Variable Estimation for Causal Inference in Longitudinal Data with Time-Dependent Latent Confounders", "link": "http://arxiv.org/abs/2312.07175", "description": "Causal inference from longitudinal observational data is a challenging\nproblem due to the difficulty in correctly identifying the time-dependent\nconfounders, especially in the presence of latent time-dependent confounders.\nInstrumental variable (IV) is a powerful tool for addressing the latent\nconfounders issue, but the traditional IV technique cannot deal with latent\ntime-dependent confounders in longitudinal studies. In this work, we propose a\nnovel Time-dependent Instrumental Factor Model (TIFM) for time-varying causal\neffect estimation from data with latent time-dependent confounders. At each\ntime-step, the proposed TIFM method employs the Recurrent Neural Network (RNN)\narchitecture to infer latent IV, and then uses the inferred latent IV factor\nfor addressing the confounding bias caused by the latent time-dependent\nconfounders. We provide a theoretical analysis for the proposed TIFM method\nregarding causal effect estimation in longitudinal data. Extensive evaluation\nwith synthetic datasets demonstrates the effectiveness of TIFM in addressing\ncausal effect estimation over time. We further apply TIFM to a climate dataset\nto showcase the potential of the proposed method in tackling real-world\nproblems."}, "http://arxiv.org/abs/2312.07177": {"title": "Fast Meta-Analytic Approximations for Relational Event Models: Applications to Data Streams and Multilevel Data", "link": "http://arxiv.org/abs/2312.07177", "description": "Large relational-event history data stemming from large networks are becoming\nincreasingly available due to recent technological developments (e.g. digital\ncommunication, online databases, etc). This opens many new doors to learning\nabout complex interaction behavior between actors in temporal social networks.\nThe relational event model has become the gold standard for relational event\nhistory analysis. Currently, however, the main bottleneck to fit relational\nevents models is of computational nature in the form of memory storage\nlimitations and computational complexity. Relational event models are therefore\nmainly used for relatively small data sets while larger, more interesting\ndatasets, including multilevel data structures and relational event data\nstreams, cannot be analyzed on standard desktop computers. This paper addresses\nthis problem by developing approximation algorithms based on meta-analysis\nmethods that can fit relational event models significantly faster while\navoiding the computational issues. In particular, meta-analytic approximations\nare proposed for analyzing streams of relational event data and multilevel\nrelational event data and potentially of combinations thereof. The accuracy and\nthe statistical properties of the methods are assessed using numerical\nsimulations. Furthermore, real-world data are used to illustrate the potential\nof the methodology to study social interaction behavior in an organizational\nnetwork and interaction behavior among political actors. The algorithms are\nimplemented in a publicly available R package 'remx'."}, "http://arxiv.org/abs/2312.07262": {"title": "Robust Bayesian graphical modeling using $\\gamma$-divergence", "link": "http://arxiv.org/abs/2312.07262", "description": "Gaussian graphical model is one of the powerful tools to analyze conditional\nindependence between two variables for multivariate Gaussian-distributed\nobservations. When the dimension of data is moderate or high, penalized\nlikelihood methods such as the graphical lasso are useful to detect significant\nconditional independence structures. However, the estimates are affected by\noutliers due to the Gaussian assumption. This paper proposes a novel robust\nposterior distribution for inference of Gaussian graphical models using the\n$\\gamma$-divergence which is one of the robust divergences. In particular, we\nfocus on the Bayesian graphical lasso by assuming the Laplace-type prior for\nelements of the inverse covariance matrix. The proposed posterior distribution\nmatches its maximum a posteriori estimate with the minimum $\\gamma$-divergence\nestimate provided by the frequentist penalized method. We show that the\nproposed method satisfies the posterior robustness which is a kind of measure\nof robustness in the Bayesian analysis. The property means that the information\nof outliers is automatically ignored in the posterior distribution as long as\nthe outliers are extremely large, which also provides theoretical robustness of\npoint estimate for the existing frequentist method. A sufficient condition for\nthe posterior propriety of the proposed posterior distribution is also shown.\nFurthermore, an efficient posterior computation algorithm via the weighted\nBayesian bootstrap method is proposed. The performance of the proposed method\nis illustrated through simulation studies and real data analysis."}, "http://arxiv.org/abs/2312.07320": {"title": "Convergence rates of non-stationary and deep Gaussian process regression", "link": "http://arxiv.org/abs/2312.07320", "description": "The focus of this work is the convergence of non-stationary and deep Gaussian\nprocess regression. More precisely, we follow a Bayesian approach to regression\nor interpolation, where the prior placed on the unknown function $f$ is a\nnon-stationary or deep Gaussian process, and we derive convergence rates of the\nposterior mean to the true function $f$ in terms of the number of observed\ntraining points. In some cases, we also show convergence of the posterior\nvariance to zero. The only assumption imposed on the function $f$ is that it is\nan element of a certain reproducing kernel Hilbert space, which we in\nparticular cases show to be norm-equivalent to a Sobolev space. Our analysis\nincludes the case of estimated hyper-parameters in the covariance kernels\nemployed, both in an empirical Bayes' setting and the particular hierarchical\nsetting constructed through deep Gaussian processes. We consider the settings\nof noise-free or noisy observations on deterministic or random training points.\nWe establish general assumptions sufficient for the convergence of deep\nGaussian process regression, along with explicit examples demonstrating the\nfulfilment of these assumptions. Specifically, our examples require that the\nH\\\"older or Sobolev norms of the penultimate layer are bounded almost surely."}, "http://arxiv.org/abs/2312.07479": {"title": "Convex Parameter Estimation of Perturbed Multivariate Generalized Gaussian Distributions", "link": "http://arxiv.org/abs/2312.07479", "description": "The multivariate generalized Gaussian distribution (MGGD), also known as the\nmultivariate exponential power (MEP) distribution, is widely used in signal and\nimage processing. However, estimating MGGD parameters, which is required in\npractical applications, still faces specific theoretical challenges. In\nparticular, establishing convergence properties for the standard fixed-point\napproach when both the distribution mean and the scatter (or the precision)\nmatrix are unknown is still an open problem. In robust estimation, imposing\nclassical constraints on the precision matrix, such as sparsity, has been\nlimited by the non-convexity of the resulting cost function. This paper tackles\nthese issues from an optimization viewpoint by proposing a convex formulation\nwith well-established convergence properties. We embed our analysis in a noisy\nscenario where robustness is induced by modelling multiplicative perturbations.\nThe resulting framework is flexible as it combines a variety of regularizations\nfor the precision matrix, the mean and model perturbations. This paper presents\nproof of the desired theoretical properties, specifies the conditions\npreserving these properties for different regularization choices and designs a\ngeneral proximal primal-dual optimization strategy. The experiments show a more\naccurate precision and covariance matrix estimation with similar performance\nfor the mean vector parameter compared to Tyler's M-estimator. In a\nhigh-dimensional setting, the proposed method outperforms the classical GLASSO,\none of its robust extensions, and the regularized Tyler's estimator."}, "http://arxiv.org/abs/2205.04324": {"title": "On a wider class of prior distributions for graphical models", "link": "http://arxiv.org/abs/2205.04324", "description": "Gaussian graphical models are useful tools for conditional independence\nstructure inference of multivariate random variables. Unfortunately, Bayesian\ninference of latent graph structures is challenging due to exponential growth\nof $\\mathcal{G}_n$, the set of all graphs in $n$ vertices. One approach that\nhas been proposed to tackle this problem is to limit search to subsets of\n$\\mathcal{G}_n$. In this paper, we study subsets that are vector subspaces with\nthe cycle space $\\mathcal{C}_n$ as main example. We propose a novel prior on\n$\\mathcal{C}_n$ based on linear combinations of cycle basis elements and\npresent its theoretical properties. Using this prior, we implement a Markov\nchain Monte Carlo algorithm, and show that (i) posterior edge inclusion\nestimates computed with our technique are comparable to estimates from the\nstandard technique despite searching a smaller graph space, and (ii) the vector\nspace perspective enables straightforward implementation of MCMC algorithms."}, "http://arxiv.org/abs/2301.10468": {"title": "Model selection-based estimation for generalized additive models using mixtures of g-priors: Towards systematization", "link": "http://arxiv.org/abs/2301.10468", "description": "We consider the estimation of generalized additive models using basis\nexpansions coupled with Bayesian model selection. Although Bayesian model\nselection is an intuitively appealing tool for regression splines, its use has\ntraditionally been limited to Gaussian additive regression because of the\navailability of a tractable form of the marginal model likelihood. We extend\nthe method to encompass the exponential family of distributions using the\nLaplace approximation to the likelihood. Although the approach exhibits success\nwith any Gaussian-type prior distribution, there remains a lack of consensus\nregarding the best prior distribution for nonparametric regression through\nmodel selection. We observe that the classical unit information prior\ndistribution for variable selection may not be well-suited for nonparametric\nregression using basis expansions. Instead, our investigation reveals that\nmixtures of g-priors are more suitable. We consider various mixtures of\ng-priors to evaluate the performance in estimating generalized additive models.\nFurthermore, we conduct a comparative analysis of several priors for knots to\nidentify the most practically effective strategy. Our extensive simulation\nstudies demonstrate the superiority of model selection-based approaches over\nother Bayesian methods."}, "http://arxiv.org/abs/2302.02482": {"title": "Continuously Indexed Graphical Models", "link": "http://arxiv.org/abs/2302.02482", "description": "Let $X = \\{X_{u}\\}_{u \\in U}$ be a real-valued Gaussian process indexed by a\nset $U$. It can be thought of as an undirected graphical model with every\nrandom variable $X_{u}$ serving as a vertex. We characterize this graph in\nterms of the covariance of $X$ through its reproducing kernel property. Unlike\nother characterizations in the literature, our characterization does not\nrestrict the index set $U$ to be finite or countable, and hence can be used to\nmodel the intrinsic dependence structure of stochastic processes in continuous\ntime/space. Consequently, this characterization is not in terms of the zero\nentries of an inverse covariance. This poses novel challenges for the problem\nof recovery of the dependence structure from a sample of independent\nrealizations of $X$, also known as structure estimation. We propose a\nmethodology that circumvents these issues, by targeting the recovery of the\nunderlying graph up to a finite resolution, which can be arbitrarily fine and\nis limited only by the available sample size. The recovery is shown to be\nconsistent so long as the graph is sufficiently regular in an appropriate\nsense. We derive corresponding convergence rates and finite sample guarantees.\nOur methodology is illustrated by means of a simulation study and two data\nanalyses."}, "http://arxiv.org/abs/2305.04634": {"title": "Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods", "link": "http://arxiv.org/abs/2305.04634", "description": "In spatial statistics, fast and accurate parameter estimation, coupled with a\nreliable means of uncertainty quantification, can be challenging when fitting a\nspatial process to real-world data because the likelihood function might be\nslow to evaluate or wholly intractable. In this work, we propose using\nconvolutional neural networks to learn the likelihood function of a spatial\nprocess. Through a specifically designed classification task, our neural\nnetwork implicitly learns the likelihood function, even in situations where the\nexact likelihood is not explicitly available. Once trained on the\nclassification task, our neural network is calibrated using Platt scaling which\nimproves the accuracy of the neural likelihood surfaces. To demonstrate our\napproach, we compare neural likelihood surfaces and the resulting maximum\nlikelihood estimates and approximate confidence regions with the equivalent for\nexact or approximate likelihood for two different spatial processes: a Gaussian\nprocess and a Brown-Resnick process which have computationally intensive and\nintractable likelihoods, respectively. We conclude that our method provides\nfast and accurate parameter estimation with a reliable method of uncertainty\nquantification in situations where standard methods are either undesirably slow\nor inaccurate. The method is applicable to any spatial process on a grid from\nwhich fast simulations are available."}, "http://arxiv.org/abs/2308.09112": {"title": "REACT to NHST: Sensible conclusions to meaningful hypotheses", "link": "http://arxiv.org/abs/2308.09112", "description": "While Null Hypothesis Significance Testing (NHST) remains a widely used\nstatistical tool, it suffers from several shortcomings, such as conflating\nstatistical and practical significance, sensitivity to sample size, and the\ninability to distinguish between accepting the null hypothesis and failing to\nreject it. Recent efforts have focused on developing alternatives to NHST to\naddress these issues. Despite these efforts, conventional NHST remains dominant\nin scientific research due to its simplicity and perceived ease of\ninterpretation. Our work presents a novel alternative to NHST that is just as\naccessible and intuitive: REACT. It not only tackles the shortcomings of NHST\nbut also offers additional advantages over existing alternatives. For instance,\nREACT is easily applicable to multiparametric hypotheses and does not require\nstringent significance-level corrections when conducting multiple tests. We\nillustrate the practical utility of REACT through real-world data examples,\nusing criteria aligned with common research practices to distinguish between\nthe absence of evidence and evidence of absence."}, "http://arxiv.org/abs/2309.05482": {"title": "A conformal test of linear models via permutation-augmented regressions", "link": "http://arxiv.org/abs/2309.05482", "description": "Permutation tests are widely recognized as robust alternatives to tests based\non normal theory. Random permutation tests have been frequently employed to\nassess the significance of variables in linear models. Despite their widespread\nuse, existing random permutation tests lack finite-sample and assumption-free\nguarantees for controlling type I error in partial correlation tests. To\naddress this ongoing challenge, we have developed a conformal test through\npermutation-augmented regressions, which we refer to as PALMRT. PALMRT not only\nachieves power competitive with conventional methods but also provides reliable\ncontrol of type I errors at no more than $2\\alpha$, given any targeted level\n$\\alpha$, for arbitrary fixed designs and error distributions. We have\nconfirmed this through extensive simulations.\n\nCompared to the cyclic permutation test (CPT) and residual permutation test\n(RPT), which also offer theoretical guarantees, PALMRT does not compromise as\nmuch on power or set stringent requirements on the sample size, making it\nsuitable for diverse biomedical applications. We further illustrate the\ndifferences in a long-Covid study where PALMRT validated key findings\npreviously identified using the t-test after multiple corrections, while both\nCPT and RPT suffered from a drastic loss of power and failed to identify any\ndiscoveries. We endorse PALMRT as a robust and practical hypothesis test in\nscientific research for its superior error control, power preservation, and\nsimplicity. An R package for PALMRT is available at\n\\url{https://github.com/LeyingGuan/PairedRegression}."}, "http://arxiv.org/abs/2309.13001": {"title": "Joint $p$-Values for Higher-Powered Bayesian Model Checking with Frequentist Guarantees", "link": "http://arxiv.org/abs/2309.13001", "description": "We introduce a joint posterior $p$-value, an extension of the posterior\npredictive $p$-value for multiple test statistics, designed to address\nlimitations of existing Bayesian $p$-values in the setting of continuous model\nexpansion. In particular, we show that the posterior predictive $p$-value, as\nwell as its sampled variant, become more conservative as the parameter\ndimension grows, and we demonstrate the ability of the joint $p$-value to\novercome this problem in cases where we can select test statistics that are\nnegatively associated under the posterior. We validate these conclusions with a\npair of simulation examples in which the joint $p$-value achieves substantial\ngains to power with only a modest increase in computational cost."}, "http://arxiv.org/abs/2312.07610": {"title": "Interpretational errors in statistical causal inference", "link": "http://arxiv.org/abs/2312.07610", "description": "We formalize an interpretational error that is common in statistical causal\ninference, termed identity slippage. This formalism is used to describe\nhistorically-recognized fallacies, and analyse a fast-growing literature in\nstatistics and applied fields. We conducted a systematic review of natural\nlanguage claims in the literature on stochastic mediation parameters, and\ndocumented extensive evidence of identity slippage in applications. This\nframework for error detection is applicable whenever policy decisions depend on\nthe accurate interpretation of statistical results, which is nearly always the\ncase. Therefore, broad awareness of identity slippage will aid statisticians in\nthe successful translation of data into public good."}, "http://arxiv.org/abs/2312.07616": {"title": "Evaluating the Alignment of a Data Analysis between Analyst and Audience", "link": "http://arxiv.org/abs/2312.07616", "description": "A challenge that data analysts face is building a data analysis that is\nuseful for a given consumer. Previously, we defined a set of principles for\ndescribing data analyses that can be used to create a data analysis and to\ncharacterize the variation between analyses. Here, we introduce a concept that\nwe call the alignment of a data analysis between the data analyst and a\nconsumer. We define a successfully aligned data analysis as the matching of\nprinciples between the analyst and the consumer for whom the analysis is\ndeveloped. In this paper, we propose a statistical model for evaluating the\nalignment of a data analysis and describe some of its properties. We argue that\nthis framework provides a language for characterizing alignment and can be used\nas a guide for practicing data scientists and students in data science courses\nfor how to build better data analyses."}, "http://arxiv.org/abs/2312.07619": {"title": "Estimating Causal Impacts of Scaling a Voluntary Policy Intervention", "link": "http://arxiv.org/abs/2312.07619", "description": "Evaluations often inform future program implementation decisions. However,\nthe implementation context may differ, sometimes substantially, from the\nevaluation study context. This difference leads to uncertainty regarding the\nrelevance of evaluation findings to future decisions. Voluntary interventions\npose another challenge to generalizability, as we do not know precisely who\nwill volunteer for the intervention in the future. We present a novel approach\nfor estimating target population average treatment effects among the treated by\ngeneralizing results from an observational study to projected volunteers within\nthe target population (the treated group). Our estimation approach can\naccommodate flexible outcome regression estimators such as Bayesian Additive\nRegression Trees (BART) and Bayesian Causal Forests (BCF). Our generalizability\napproach incorporates uncertainty regarding target population treatment status\ninto the posterior credible intervals to better reflect the uncertainty of\nscaling a voluntary intervention. In a simulation based on real data, we\ndemonstrate that these flexible estimators (BCF and BART) improve performance\nover estimators that rely on parametric regressions. We use our approach to\nestimate impacts of scaling up Comprehensive Primary Care Plus, a health care\npayment model intended to improve quality and efficiency of primary care, and\nwe demonstrate the promise of scaling to a targeted subgroup of practices."}, "http://arxiv.org/abs/2312.07697": {"title": "A Class of Computational Methods to Reduce Selection Bias when Designing Phase 3 Clinical Trials", "link": "http://arxiv.org/abs/2312.07697", "description": "When designing confirmatory Phase 3 studies, one usually evaluates one or\nmore efficacious and safe treatment option(s) based on data from previous\nstudies. However, several retrospective research articles reported the\nphenomenon of ``diminished treatment effect in Phase 3'' based on many case\nstudies. Even under basic assumptions, it was shown that the commonly used\nestimator could substantially overestimate the efficacy of selected group(s).\nAs alternatives, we propose a class of computational methods to reduce\nestimation bias and mean squared error (MSE) with a broader scope of multiple\ntreatment groups and flexibility to accommodate summary results by group as\ninput. Based on simulation studies and a real data example, we provide\npractical implementation guidance for this class of methods under different\nscenarios. For more complicated problems, our framework can serve as a starting\npoint with additional layers built in. Proposed methods can also be widely\napplied to other selection problems."}, "http://arxiv.org/abs/2312.07704": {"title": "Distribution of the elemental regression weights with t-distributed co-variate measurement errors", "link": "http://arxiv.org/abs/2312.07704", "description": "In this article, a heuristic approach is used to determined the best\napproximate distribution of $\\dfrac{Y_1}{Y_1 + Y_2}$, given that $Y_1,Y_2$ are\nindependent, and each of $Y_1$ and $Y$ is distributed as the\n$\\mathcal{F}$-distribution with common denominator degrees of freedom. The\nproposed approximate distribution is subject to graphical comparisons and\ndistributional tests. The proposed distribution is used to derive the\ndistribution of the elemental regression weight $\\omega_E$, where $E$ is the\nelemental regression set."}, "http://arxiv.org/abs/2312.07727": {"title": "Two-sample inference for sparse functional data", "link": "http://arxiv.org/abs/2312.07727", "description": "We propose a novel test procedure for comparing mean functions across two\ngroups within the reproducing kernel Hilbert space (RKHS) framework. Our\nproposed method is adept at handling sparsely and irregularly sampled\nfunctional data when observation times are random for each subject.\nConventional approaches, that are built upon functional principal components\nanalysis, usually assume homogeneous covariance structure across groups.\nNonetheless, justifying this assumption in real-world scenarios can be\nchallenging. To eliminate the need for a homogeneous covariance structure, we\nfirst develop the functional Bahadur representation for the mean estimator\nunder the RKHS framework; this representation naturally leads to the desirable\npointwise limiting distributions. Moreover, we establish weak convergence for\nthe mean estimator, allowing us to construct a test statistic for the mean\ndifference. Our method is easily implementable and outperforms some\nconventional tests in controlling type I errors across various settings. We\ndemonstrate the finite sample performance of our approach through extensive\nsimulations and two real-world applications."}, "http://arxiv.org/abs/2312.07741": {"title": "Robust Functional Principal Component Analysis for Non-Euclidean Random Objects", "link": "http://arxiv.org/abs/2312.07741", "description": "Functional data analysis offers a diverse toolkit of statistical methods\ntailored for analyzing samples of real-valued random functions. Recently,\nsamples of time-varying random objects, such as time-varying networks, have\nbeen increasingly encountered in modern data analysis. These data structures\nrepresent elements within general metric spaces that lack local or global\nlinear structures, rendering traditional functional data analysis methods\ninapplicable. Moreover, the existing methodology for time-varying random\nobjects does not work well in the presence of outlying objects. In this paper,\nwe propose a robust method for analysing time-varying random objects. Our\nmethod employs pointwise Fr\\'{e}chet medians and then constructs pointwise\ndistance trajectories between the individual time courses and the sample\nFr\\'{e}chet medians. This representation effectively transforms time-varying\nobjects into functional data. A novel robust approach to functional principal\ncomponent analysis based on a Winsorized U-statistic estimator of the\ncovariance structure is introduced. The proposed robust analysis of these\ndistance trajectories is able to identify key features of time-varying objects\nand is useful for downstream analysis. To illustrate the efficacy of our\napproach, numerical studies focusing on dynamic networks are conducted. The\nresults indicate that the proposed method exhibits good all-round performance\nand surpasses the existing approach in terms of robustness, showcasing its\nsuperior performance in handling time-varying objects data."}, "http://arxiv.org/abs/2312.07775": {"title": "On the construction of stationary processes and random fields", "link": "http://arxiv.org/abs/2312.07775", "description": "We propose a new method to construct a stationary process and random field\nwith a given convex, decreasing covariance function and any one-dimensional\nmarginal distribution. The result is a new class of stationary processes and\nrandom fields. The construction method provides a simple, unified approach for\na wide range of covariance functions and any one-dimensional marginal\ndistributions, and it allows a new way to model dependence structures in a\nstationary process/random field as its dependence structure is induced by the\ncorrelation structure of a few disjoint sets in the support set of the marginal\ndistribution."}, "http://arxiv.org/abs/2312.07792": {"title": "Differentially private projection-depth-based medians", "link": "http://arxiv.org/abs/2312.07792", "description": "We develop $(\\epsilon,\\delta)$-differentially private projection-depth-based\nmedians using the propose-test-release (PTR) and exponential mechanisms. Under\ngeneral conditions on the input parameters and the population measure, (e.g. we\ndo not assume any moment bounds), we quantify the probability the test in PTR\nfails, as well as the cost of privacy via finite sample deviation bounds. We\ndemonstrate our main result on the canonical projection-depth-based median. In\nthe Gaussian setting, we show that the resulting deviation bound matches the\nknown lower bound for private Gaussian mean estimation, up to a polynomial\nfunction of the condition number of the covariance matrix. In the Cauchy\nsetting, we show that the ``outlier error amplification'' effect resulting from\nthe heavy tails outweighs the cost of privacy. This result is then verified via\nnumerical simulations. Additionally, we present results on general PTR\nmechanisms and a uniform concentration result on the projected spacings of\norder statistics."}, "http://arxiv.org/abs/2312.07829": {"title": "How to Select Covariates for Imputation-Based Regression Calibration Method -- A Causal Perspective", "link": "http://arxiv.org/abs/2312.07829", "description": "In this paper, we identify the criteria for the selection of the minimal and\nmost efficient covariate adjustment sets for the regression calibration method\ndeveloped by Carroll, Rupert and Stefanski (CRS, 1992), used to correct bias\ndue to continuous exposure measurement error. We utilize directed acyclic\ngraphs to illustrate how subject matter knowledge can aid in the selection of\nsuch adjustment sets. Valid measurement error correction requires the\ncollection of data on any (1) common causes of true exposure and outcome and\n(2) common causes of measurement error and outcome, in both the main study and\nvalidation study. For the CRS regression calibration method to be valid,\nresearchers need to minimally adjust for covariate set (1) in both the\nmeasurement error model (MEM) and the outcome model and adjust for covariate\nset (2) at least in the MEM. In practice, we recommend including the minimal\ncovariate adjustment set in both the MEM and the outcome model. In contrast\nwith the regression calibration method developed by Rosner, Spiegelman and\nWillet, it is valid and more efficient to adjust for correlates of the true\nexposure or of measurement error that are not risk factors in the MEM only\nunder CRS method. We applied the proposed covariate selection approach to the\nHealth Professional Follow-up Study, examining the effect of fiber intake on\ncardiovascular incidence. In this study, we demonstrated potential issues with\na data-driven approach to building the MEM that is agnostic to the structural\nassumptions. We extend the originally proposed estimators to settings where\neffect modification by a covariate is allowed. Finally, we caution against the\nuse of the regression calibration method to calibrate the true nutrition intake\nusing biomarkers."}, "http://arxiv.org/abs/2312.07873": {"title": "Causal Integration of Multiple Cancer Cohorts with High-Dimensional Confounders: Bayesian Propensity Score Estimation", "link": "http://arxiv.org/abs/2312.07873", "description": "Comparative meta-analyses of patient groups by integrating multiple\nobservational studies rely on estimated propensity scores (PSs) to mitigate\nconfounder imbalances. However, PS estimation grapples with the theoretical and\npractical challenges posed by high-dimensional confounders. Motivated by an\nintegrative analysis of breast cancer patients across seven medical centers,\nthis paper tackles the challenges associated with integrating multiple\nobservational datasets and offering nationally interpretable results. The\nproposed inferential technique, called Bayesian Motif Submatrices for\nConfounders (B-MSMC), addresses the curse of dimensionality by a hybrid of\nBayesian and frequentist approaches. B-MSMC uses nonparametric Bayesian\n``Chinese restaurant\" processes to eliminate redundancy in the high-dimensional\nconfounders and discover latent motifs or lower-dimensional structure. With\nthese motifs as potential predictors, standard regression techniques can be\nutilized to accurately infer the PSs and facilitate causal group comparisons.\nSimulations and meta-analysis of the motivating cancer investigation\ndemonstrate the efficacy of our proposal in high-dimensional causal inference\nby integrating multiple observational studies; using different weighting\nmethods, we apply the B-MSMC approach to efficiently address confounding when\nintegrating observational health studies with high-dimensional confounders."}, "http://arxiv.org/abs/2312.07882": {"title": "A non-parametric approach for estimating consumer valuation distributions using second price auctions", "link": "http://arxiv.org/abs/2312.07882", "description": "We focus on online second price auctions, where bids are made sequentially,\nand the winning bidder pays the maximum of the second-highest bid and a seller\nspecified reserve price. For many such auctions, the seller does not see all\nthe bids or the total number of bidders accessing the auction, and only\nobserves the current selling prices throughout the course of the auction. We\ndevelop a novel non-parametric approach to estimate the underlying consumer\nvaluation distribution based on this data. Previous non-parametric approaches\nin the literature only use the final selling price and assume knowledge of the\ntotal number of bidders. The resulting estimate, in particular, can be used by\nthe seller to compute the optimal profit-maximizing price for the product. Our\napproach is free of tuning parameters, and we demonstrate its computational and\nstatistical efficiency in a variety of simulation settings, and also on an Xbox\n7-day auction dataset on eBay."}, "http://arxiv.org/abs/2312.08040": {"title": "Markov's Equality and Post-hoc (Anytime) Valid Inference", "link": "http://arxiv.org/abs/2312.08040", "description": "We present Markov's equality: a tight version of Markov's inequality, that\ndoes not impose further assumptions on the on the random variable. We show that\nthis equality, as well as Markov's inequality and its randomized improvement,\nare directly implied by a set of deterministic inequalities. We apply Markov's\nequality to show that standard tests based on $e$-values and $e$-processes are\npost-hoc (anytime) valid: the tests remain valid, even if the level $\\alpha$ is\nselected after observing the data. In fact, we show that this property\ncharacterizes $e$-values and $e$-processes."}, "http://arxiv.org/abs/2312.08150": {"title": "Active learning with biased non-response to label requests", "link": "http://arxiv.org/abs/2312.08150", "description": "Active learning can improve the efficiency of training prediction models by\nidentifying the most informative new labels to acquire. However, non-response\nto label requests can impact active learning's effectiveness in real-world\ncontexts. We conceptualise this degradation by considering the type of\nnon-response present in the data, demonstrating that biased non-response is\nparticularly detrimental to model performance. We argue that this sort of\nnon-response is particularly likely in contexts where the labelling process, by\nnature, relies on user interactions. To mitigate the impact of biased\nnon-response, we propose a cost-based correction to the sampling strategy--the\nUpper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly,\nbe applied to any active learning algorithm. Through experiments, we\ndemonstrate that our method successfully reduces the harm from labelling\nnon-response in many settings. However, we also characterise settings where the\nnon-response bias in the annotations remains detrimental under UCB-EU for\nparticular sampling methods and data generating processes. Finally, we evaluate\nour method on a real-world dataset from e-commerce platform Taobao. We show\nthat UCB-EU yields substantial performance improvements to conversion models\nthat are trained on clicked impressions. Most generally, this research serves\nto both better conceptualise the interplay between types of non-response and\nmodel improvements via active learning, and to provide a practical, easy to\nimplement correction that helps mitigate model degradation."}, "http://arxiv.org/abs/2312.08169": {"title": "Efficiency of Multivariate Tests in Trials in Progressive Supranuclear Palsy", "link": "http://arxiv.org/abs/2312.08169", "description": "Measuring disease progression in clinical trials for testing novel treatments\nfor multifaceted diseases as Progressive Supranuclear Palsy (PSP), remains\nchallenging. In this study we assess a range of statistical approaches to\ncompare outcomes measured by the items of the Progressive Supranuclear Palsy\nRating Scale (PSPRS). We consider several statistical approaches, including sum\nscores, as an FDA-recommended version of the PSPRS, multivariate tests, and\nanalysis approaches based on multiple comparisons of the individual items. We\npropose two novel approaches which measure disease status based on Item\nResponse Theory models. We assess the performance of these tests in an\nextensive simulation study and illustrate their use with a re-analysis of the\nABBV-8E12 clinical trial. Furthermore, we discuss the impact of the\nFDA-recommended scoring of item scores on the power of the statistical tests.\nWe find that classical approaches as the PSPRS sum score demonstrate moderate\nto high power when treatment effects are consistent across the individual\nitems. The tests based on Item Response Theory models yield the highest power\nwhen the simulated data are generated from an IRT model. The multiple testing\nbased approaches have a higher power in settings where the treatment effect is\nlimited to certain domains or items. The FDA-recommended item rescoring tends\nto decrease the simulated power. The study shows that there is no\none-size-fits-all testing procedure for evaluating treatment effects using\nPSPRS items; the optimal method varies based on the specific effect size\npatterns. The efficiency of the PSPRS sum score, while generally robust and\nstraightforward to apply, varies depending on the effect sizes' patterns\nencountered and more powerful alternatives are available in specific settings.\nThese findings can have important implications for the design of future\nclinical trials in PSP."}, "http://arxiv.org/abs/2011.06663": {"title": "Patient Recruitment Using Electronic Health Records Under Selection Bias: a Two-phase Sampling Framework", "link": "http://arxiv.org/abs/2011.06663", "description": "Electronic health records (EHRs) are increasingly recognized as a\ncost-effective resource for patient recruitment in clinical research. However,\nhow to optimally select a cohort from millions of individuals to answer a\nscientific question of interest remains unclear. Consider a study to estimate\nthe mean or mean difference of an expensive outcome. Inexpensive auxiliary\ncovariates predictive of the outcome may often be available in patients' health\nrecords, presenting an opportunity to recruit patients selectively which may\nimprove efficiency in downstream analyses. In this paper, we propose a\ntwo-phase sampling design that leverages available information on auxiliary\ncovariates in EHR data. A key challenge in using EHR data for multi-phase\nsampling is the potential selection bias, because EHR data are not necessarily\nrepresentative of the target population. Extending existing literature on\ntwo-phase sampling design, we derive an optimal two-phase sampling method that\nimproves efficiency over random sampling while accounting for the potential\nselection bias in EHR data. We demonstrate the efficiency gain from our\nsampling design via simulation studies and an application to evaluating the\nprevalence of hypertension among US adults leveraging data from the Michigan\nGenomics Initiative, a longitudinal biorepository in Michigan Medicine."}, "http://arxiv.org/abs/2201.05102": {"title": "Space-time extremes of severe US thunderstorm environments", "link": "http://arxiv.org/abs/2201.05102", "description": "Severe thunderstorms cause substantial economic and human losses in the\nUnited States. Simultaneous high values of convective available potential\nenergy (CAPE) and storm relative helicity (SRH) are favorable to severe\nweather, and both they and the composite variable\n$\\mathrm{PROD}=\\sqrt{\\mathrm{CAPE}} \\times \\mathrm{SRH}$ can be used as\nindicators of severe thunderstorm activity. Their extremal spatial dependence\nexhibits temporal non-stationarity due to seasonality and large-scale\natmospheric signals such as El Ni\\~no-Southern Oscillation (ENSO). In order to\ninvestigate this, we introduce a space-time model based on a max-stable,\nBrown--Resnick, field whose range depends on ENSO and on time through a tensor\nproduct spline. We also propose a max-stability test based on empirical\nlikelihood and the bootstrap. The marginal and dependence parameters must be\nestimated separately owing to the complexity of the model, and we develop a\nbootstrap-based model selection criterion that accounts for the marginal\nuncertainty when choosing the dependence model. In the case study, the\nout-sample performance of our model is good. We find that extremes of PROD,\nCAPE and SRH are generally more localized in summer and, in some regions, less\nlocalized during El Ni\\~no and La Ni\\~na events, and give meteorological\ninterpretations of these phenomena."}, "http://arxiv.org/abs/2202.07277": {"title": "Exploiting deterministic algorithms to perform global sensitivity analysis for continuous-time Markov chain compartmental models with application to epidemiology", "link": "http://arxiv.org/abs/2202.07277", "description": "In this paper, we propose a generic approach to perform global sensitivity\nanalysis (GSA) for compartmental models based on continuous-time Markov chains\n(CTMC). This approach enables a complete GSA for epidemic models, in which not\nonly the effects of uncertain parameters such as epidemic parameters\n(transmission rate, mean sojourn duration in compartments) are quantified, but\nalso those of intrinsic randomness and interactions between the two. The main\nstep in our approach is to build a deterministic representation of the\nunderlying continuous-time Markov chain by controlling the latent variables\nmodeling intrinsic randomness. Then, model output can be written as a\ndeterministic function of both uncertain parameters and controlled latent\nvariables, so that it becomes possible to compute standard variance-based\nsensitivity indices, e.g. the so-called Sobol' indices. However, different\nsimulation algorithms lead to different representations. We exhibit in this\nwork three different representations for CTMC stochastic compartmental models\nand discuss the results obtained by implementing and comparing GSAs based on\neach of these representations on a SARS-CoV-2 epidemic model."}, "http://arxiv.org/abs/2210.08589": {"title": "Anytime-Valid Linear Models and Regression Adjusted Causal Inference in Randomized Experiments", "link": "http://arxiv.org/abs/2210.08589", "description": "Linear regression adjustment is commonly used to analyse randomised\ncontrolled experiments due to its efficiency and robustness against model\nmisspecification. Current testing and interval estimation procedures leverage\nthe asymptotic distribution of such estimators to provide Type-I error and\ncoverage guarantees that hold only at a single sample size. Here, we develop\nthe theory for the anytime-valid analogues of such procedures, enabling linear\nregression adjustment in the sequential analysis of randomised experiments. We\nfirst provide sequential $F$-tests and confidence sequences for the parametric\nlinear model, which provide time-uniform Type-I error and coverage guarantees\nthat hold for all sample sizes. We then relax all linear model parametric\nassumptions in randomised designs and provide nonparametric model-free\nsequential tests and confidence sequences for treatment effects. This formally\nallows experiments to be continuously monitored for significance, stopped\nearly, and safeguards against statistical malpractices in data collection. A\nparticular feature of our results is their simplicity. Our test statistics and\nconfidence sequences all emit closed-form expressions, which are functions of\nstatistics directly available from a standard linear regression table. We\nillustrate our methodology with the sequential analysis of software A/B\nexperiments at Netflix, performing regression adjustment with pre-treatment\noutcomes."}, "http://arxiv.org/abs/2212.02505": {"title": "Shared Differential Clustering across Single-cell RNA Sequencing Datasets with the Hierarchical Dirichlet Process", "link": "http://arxiv.org/abs/2212.02505", "description": "Single-cell RNA sequencing (scRNA-seq) is powerful technology that allows\nresearchers to understand gene expression patterns at the single-cell level.\nHowever, analysing scRNA-seq data is challenging due to issues and biases in\ndata collection. In this work, we construct an integrated Bayesian model that\nsimultaneously addresses normalization, imputation and batch effects and also\nnonparametrically clusters cells into groups across multiple datasets. A Gibbs\nsampler based on a finite-dimensional approximation of the HDP is developed for\nposterior inference."}, "http://arxiv.org/abs/2305.14118": {"title": "Notes on Causation, Comparison, and Regression", "link": "http://arxiv.org/abs/2305.14118", "description": "Comparison and contrast are the basic means to unveil causation and learn\nwhich treatments work. To build good comparison groups, randomized\nexperimentation is key, yet often infeasible. In such non-experimental\nsettings, we illustrate and discuss diagnostics to assess how well the common\nlinear regression approach to causal inference approximates desirable features\nof randomized experiments, such as covariate balance, study representativeness,\ninterpolated estimation, and unweighted analyses. We also discuss alternative\nregression modeling, weighting, and matching approaches and argue they should\nbe given strong consideration in empirical work."}, "http://arxiv.org/abs/2312.08391": {"title": "Performance of capture-recapture population size estimators under covariate information", "link": "http://arxiv.org/abs/2312.08391", "description": "Capture-recapture methods for estimating the total size of elusive\npopulations are widely-used, however, due to the choice of estimator impacting\nupon the results and conclusions made, the question of performance of each\nestimator is raised. Motivated by an application of the estimators which allow\ncovariate information to meta-analytic data focused on the prevalence rate of\ncompleted suicide after bariatric surgery, where studies with no completed\nsuicides did not occur, this paper explores the performance of the estimators\nthrough use of a simulation study. The simulation study addresses the\nperformance of the Horvitz-Thompson, generalised Chao and generalised Zelterman\nestimators, in addition to performance of the analytical approach to variance\ncomputation. Given that the estimators vary in their dependence on\ndistributional assumptions, additional simulations are utilised to address the\nquestion of the impact outliers have on performance and inference."}, "http://arxiv.org/abs/2312.08530": {"title": "Using Model-Assisted Calibration Methods to Improve Efficiency of Regression Analyses with Two-Phase Samples under Complex Survey Designs", "link": "http://arxiv.org/abs/2312.08530", "description": "Two-phase sampling designs are frequently employed in epidemiological studies\nand large-scale health surveys. In such designs, certain variables are\nexclusively collected within a second-phase random subsample of the initial\nfirst-phase sample, often due to factors such as high costs, response burden,\nor constraints on data collection or measurement assessment. Consequently,\nsecond-phase sample estimators can be inefficient due to the diminished sample\nsize. Model-assisted calibration methods have been widely used to improve the\nefficiency of second-phase estimators. However, none of the existing methods\nhave considered the complexities arising from the intricate sample designs\npresent in both first- and second-phase samples in regression analyses. This\npaper proposes to calibrate the sample weights for the second-phase subsample\nto the weighted first-phase sample based on influence functions of regression\ncoefficients for a prediction of the covariate of interest, which can be\ncomputed for the entire first-phase sample. We establish the consistency of the\nproposed calibration estimation and provide variance estimation. Empirical\nevidence underscores the robustness of calibration on influence functions\ncompared to the imputation method, which can be sensitive to misspecified\nprediction models for the variable only collected in the second phase. Examples\nusing data from the National Health and Nutrition Examination Survey are\nprovided."}, "http://arxiv.org/abs/2312.08570": {"title": "(Re-)reading Sklar (1959) -- A personal view on Sklar's theorem", "link": "http://arxiv.org/abs/2312.08570", "description": "Some personal thoughts on Sklar's theorem after reading the original paper\n(Sklar, 1059) in French."}, "http://arxiv.org/abs/2312.08587": {"title": "Bayesian Tensor Modeling for Image-based Classification of Alzheimer's Disease", "link": "http://arxiv.org/abs/2312.08587", "description": "Tensor-based representations are being increasingly used to represent complex\ndata types such as imaging data, due to their appealing properties such as\ndimension reduction and the preservation of spatial information. Recently,\nthere is a growing literature on using Bayesian scalar-on-tensor regression\ntechniques that use tensor-based representations for high-dimensional and\nspatially distributed covariates to predict continuous outcomes. However\nsurprisingly, there is limited development on corresponding Bayesian\nclassification methods relying on tensor-valued covariates. Standard approaches\nthat vectorize the image are not desirable due to the loss of spatial\nstructure, and alternate methods that use extracted features from the image in\nthe predictive model may suffer from information loss. We propose a novel data\naugmentation-based Bayesian classification approach relying on tensor-valued\ncovariates, with a focus on imaging predictors. We propose two data\naugmentation schemes, one resulting in a support vector machine (SVM)\nclassifier, and another yielding a logistic regression classifier. While both\ntypes of classifiers have been proposed independently in literature, our\ncontribution is to extend such existing methodology to accommodate\nhigh-dimensional tensor valued predictors that involve low rank decompositions\nof the coefficient matrix while preserving the spatial information in the\nimage. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for\nimplementing these methods. Simulation studies show significant improvements in\nclassification accuracy and parameter estimation compared to routinely used\nclassification methods. We further illustrate our method in a neuroimaging\napplication using cortical thickness MRI data from Alzheimer's Disease\nNeuroimaging Initiative, with results displaying better classification accuracy\nthroughout several classification tasks."}, "http://arxiv.org/abs/2312.08670": {"title": "Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect Estimation", "link": "http://arxiv.org/abs/2312.08670", "description": "In the field of intracity freight transportation, changes in order volume are\nsignificantly influenced by temporal and spatial factors. When building subsidy\nand pricing strategies, predicting the causal effects of these strategies on\norder volume is crucial. In the process of calculating causal effects,\nconfounding variables can have an impact. Traditional methods to control\nconfounding variables handle data from a holistic perspective, which cannot\nensure the precision of causal effects in specific temporal and spatial\ndimensions. However, temporal and spatial dimensions are extremely critical in\nthe logistics field, and this limitation may directly affect the precision of\nsubsidy and pricing strategies. To address these issues, this study proposes a\ntechnique based on flexible temporal-spatial grid partitioning. Furthermore,\nbased on the flexible grid partitioning technique, we further propose a\ncontinuous entropy balancing method in the temporal-spatial domain, which named\nTS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments).\nThe method proposed in this paper has been tested on two simulation datasets\nand two real datasets, all of which have achieved excellent performance. In\nfact, after applying the TS-EBCT method to the intracity freight transportation\nfield, the prediction accuracy of the causal effect has been significantly\nimproved. It brings good business benefits to the company's subsidy and pricing\nstrategies."}, "http://arxiv.org/abs/2312.08838": {"title": "Bayesian Fused Lasso Modeling for Binary Data", "link": "http://arxiv.org/abs/2312.08838", "description": "L1-norm regularized logistic regression models are widely used for analyzing\ndata with binary response. In those analyses, fusing regression coefficients is\nuseful for detecting groups of variables. This paper proposes a binomial\nlogistic regression model with Bayesian fused lasso. Assuming a Laplace prior\non regression coefficients and differences between adjacent regression\ncoefficients enables us to perform variable selection and variable fusion\nsimultaneously in the Bayesian framework. We also propose assuming a horseshoe\nprior on the differences to improve the flexibility of variable fusion. The\nGibbs sampler is derived to estimate the parameters by a hierarchical\nexpression of priors and a data-augmentation method. Using simulation studies\nand real data analysis, we compare the proposed methods with the existing\nmethod."}, "http://arxiv.org/abs/2206.00560": {"title": "Learning common structures in a collection of networks", "link": "http://arxiv.org/abs/2206.00560", "description": "Let a collection of networks represent interactions within several (social or\necological) systems. We pursue two objectives: identifying similarities in the\ntopological structures that are held in common between the networks and\nclustering the collection into sub-collections of structurally homogeneous\nnetworks. We tackle these two questions with a probabilistic model based\napproach. We propose an extension of the Stochastic Block Model (SBM) adapted\nto the joint modeling of a collection of networks. The networks in the\ncollection are assumed to be independent realizations of SBMs. The common\nconnectivity structure is imposed through the equality of some parameters. The\nmodel parameters are estimated with a variational Expectation-Maximization (EM)\nalgorithm. We derive an ad-hoc penalized likelihood criterion to select the\nnumber of blocks and to assess the adequacy of the consensus found between the\nstructures of the different networks. This same criterion can also be used to\ncluster networks on the basis of their connectivity structure. It thus provides\na partition of the collection into subsets of structurally homogeneous\nnetworks. The relevance of our proposition is assessed on two collections of\necological networks. First, an application to three stream food webs reveals\nthe homogeneity of their structures and the correspondence between groups of\nspecies in different ecosystems playing equivalent ecological roles. Moreover,\nthe joint analysis allows a finer analysis of the structure of smaller\nnetworks. Second, we cluster 67 food webs according to their connectivity\nstructures and demonstrate that five mesoscale structures are sufficient to\ndescribe this collection."}, "http://arxiv.org/abs/2207.08933": {"title": "Change point detection in high dimensional data with U-statistics", "link": "http://arxiv.org/abs/2207.08933", "description": "We consider the problem of detecting distributional changes in a sequence of\nhigh dimensional data. Our approach combines two separate statistics stemming\nfrom $L_p$ norms whose behavior is similar under $H_0$ but potentially\ndifferent under $H_A$, leading to a testing procedure that that is flexible\nagainst a variety of alternatives. We establish the asymptotic distribution of\nour proposed test statistics separately in cases of weakly dependent and\nstrongly dependent coordinates as $\\min\\{N,d\\}\\to\\infty$, where $N$ denotes\nsample size and $d$ is the dimension, and establish consistency of testing and\nestimation procedures in high dimensions under one-change alternative settings.\nComputational studies in single and multiple change point scenarios demonstrate\nour method can outperform other nonparametric approaches in the literature for\ncertain alternatives in high dimensions. We illustrate our approach though an\napplication to Twitter data concerning the mentions of U.S. Governors."}, "http://arxiv.org/abs/2303.17642": {"title": "Change Point Detection on A Separable Model for Dynamic Networks", "link": "http://arxiv.org/abs/2303.17642", "description": "This paper studies the change point detection problem in time series of\nnetworks, with the Separable Temporal Exponential-family Random Graph Model\n(STERGM). We consider a sequence of networks generated from a piecewise\nconstant distribution that is altered at unknown change points in time.\nDetection of the change points can identify the discrepancies in the underlying\ndata generating processes and facilitate downstream dynamic network analysis.\nMoreover, the STERGM that focuses on network statistics is a flexible model to\nfit dynamic networks with both dyadic and temporal dependence. We propose a new\nestimator derived from the Alternating Direction Method of Multipliers (ADMM)\nand the Group Fused Lasso to simultaneously detect multiple time points, where\nthe parameters of STERGM have changed. We also provide a Bayesian information\ncriterion for model selection to assist the detection. Our experiments show\ngood performance of the proposed method on both simulated and real data.\nLastly, we develop an R package CPDstergm to implement our method."}, "http://arxiv.org/abs/2310.01198": {"title": "Likelihood Based Inference for ARMA Models", "link": "http://arxiv.org/abs/2310.01198", "description": "Autoregressive moving average (ARMA) models are frequently used to analyze\ntime series data. Despite the popularity of these models, algorithms for\nfitting ARMA models have weaknesses that are not well known. We provide a\nsummary of parameter estimation via maximum likelihood and discuss common\npitfalls that may lead to sub-optimal parameter estimates. We propose a random\nrestart algorithm for parameter estimation that frequently yields higher\nlikelihoods than traditional maximum likelihood estimation procedures. We then\ninvestigate the parameter uncertainty of maximum likelihood estimates, and\npropose the use of profile confidence intervals as a superior alternative to\nintervals derived from the Fisher's information matrix. Through a series of\nsimulation studies, we demonstrate the efficacy of our proposed algorithm and\nthe improved nominal coverage of profile confidence intervals compared to the\nnormal approximation based on Fisher's Information."}, "http://arxiv.org/abs/2312.09303": {"title": "A Physics Based Surrogate Model in Bayesian Uncertainty Quantification involving Elliptic PDEs", "link": "http://arxiv.org/abs/2312.09303", "description": "The paper addresses Bayesian inferences in inverse problems with uncertainty\nquantification involving a computationally expensive forward map associated\nwith solving a partial differential equations. To mitigate the computational\ncost, the paper proposes a new surrogate model informed by the physics of the\nproblem, specifically when the forward map involves solving a linear elliptic\npartial differential equation. The study establishes the consistency of the\nposterior distribution for this surrogate model and demonstrates its\neffectiveness through numerical examples with synthetic data. The results\nindicate a substantial improvement in computational speed, reducing the\nprocessing time from several months with the exact forward map to a few\nminutes, while maintaining negligible loss of accuracy in the posterior\ndistribution."}, "http://arxiv.org/abs/2312.09356": {"title": "Sparsity meets correlation in Gaussian sequence model", "link": "http://arxiv.org/abs/2312.09356", "description": "We study estimation of an $s$-sparse signal in the $p$-dimensional Gaussian\nsequence model with equicorrelated observations and derive the minimax rate. A\nnew phenomenon emerges from correlation, namely the rate scales with respect to\n$p-2s$ and exhibits a phase transition at $p-2s \\asymp \\sqrt{p}$. Correlation\nis shown to be a blessing provided it is sufficiently strong, and the critical\ncorrelation level exhibits a delicate dependence on the sparsity level. Due to\ncorrelation, the minimax rate is driven by two subproblems: estimation of a\nlinear functional (the average of the signal) and estimation of the signal's\n$(p-1)$-dimensional projection onto the orthogonal subspace. The\nhigh-dimensional projection is estimated via sparse regression and the linear\nfunctional is cast as a robust location estimation problem. Existing robust\nestimators turn out to be suboptimal, and we show a kernel mode estimator with\na widening bandwidth exploits the Gaussian character of the data to achieve the\noptimal estimation rate."}, "http://arxiv.org/abs/2312.09422": {"title": "Joint Alignment of Multivariate Quasi-Periodic Functional Data Using Deep Learning", "link": "http://arxiv.org/abs/2312.09422", "description": "The joint alignment of multivariate functional data plays an important role\nin various fields such as signal processing, neuroscience and medicine,\nincluding the statistical analysis of data from wearable devices. Traditional\nmethods often ignore the phase variability and instead focus on the variability\nin the observed amplitude. We present a novel method for joint alignment of\nmultivariate quasi-periodic functions using deep neural networks, decomposing,\nbut retaining all the information in the data by preserving both phase and\namplitude variability. Our proposed neural network uses a special activation of\nthe output that builds on the unit simplex transformation, and we utilize a\nloss function based on the Fisher-Rao metric to train our model. Furthermore,\nour method is unsupervised and can provide an optimal common template function\nas well as subject-specific templates. We demonstrate our method on two\nsimulated datasets and one real example, comprising data from 12-lead 10s\nelectrocardiogram recordings."}, "http://arxiv.org/abs/2312.09604": {"title": "Inferring Causality from Time Series data based on Structural Causal Model and its application to Neural Connectomics", "link": "http://arxiv.org/abs/2312.09604", "description": "Inferring causation from time series data is of scientific interest in\ndifferent disciplines, particularly in neural connectomics. While different\napproaches exist in the literature with parametric modeling assumptions, we\nfocus on a non-parametric model for time series satisfying a Markovian\nstructural causal model with stationary distribution and without concurrent\neffects. We show that the model structure can be used to its advantage to\nobtain an elegant algorithm for causal inference from time series based on\nconditional dependence tests, coined Causal Inference in Time Series (CITS)\nalgorithm. We describe Pearson's partial correlation and Hilbert-Schmidt\ncriterion as candidates for such conditional dependence tests that can be used\nin CITS for the Gaussian and non-Gaussian settings, respectively. We prove the\nmathematical guarantee of the CITS algorithm in recovering the true causal\ngraph, under standard mixing conditions on the underlying time series. We also\nconduct a comparative evaluation of performance of CITS with other existing\nmethodologies in simulated datasets. We then describe the utlity of the\nmethodology in neural connectomics -- in inferring causal functional\nconnectivity from time series of neural activity, and demonstrate its\napplication to a real neurobiological dataset of electro-physiological\nrecordings from the mouse visual cortex recorded by Neuropixel probes."}, "http://arxiv.org/abs/2312.09607": {"title": "Variational excess risk bound for general state space models", "link": "http://arxiv.org/abs/2312.09607", "description": "In this paper, we consider variational autoencoders (VAE) for general state\nspace models. We consider a backward factorization of the variational\ndistributions to analyze the excess risk associated with VAE. Such backward\nfactorizations were recently proposed to perform online variational learning\nand to obtain upper bounds on the variational estimation error. When\nindependent trajectories of sequences are observed and under strong mixing\nassumptions on the state space model and on the variational distribution, we\nprovide an oracle inequality explicit in the number of samples and in the\nlength of the observation sequences. We then derive consequences of this\ntheoretical result. In particular, when the data distribution is given by a\nstate space model, we provide an upper bound for the Kullback-Leibler\ndivergence between the data distribution and its estimator and between the\nvariational posterior and the estimated state space posterior\ndistributions.Under classical assumptions, we prove that our results can be\napplied to Gaussian backward kernels built with dense and recurrent neural\nnetworks."}, "http://arxiv.org/abs/2312.09633": {"title": "Natural gradient Variational Bayes without matrix inversion", "link": "http://arxiv.org/abs/2312.09633", "description": "This paper presents an approach for efficiently approximating the inverse of\nFisher information, a key component in variational Bayes inference. A notable\naspect of our approach is the avoidance of analytically computing the Fisher\ninformation matrix and its explicit inversion. Instead, we introduce an\niterative procedure for generating a sequence of matrices that converge to the\ninverse of Fisher information. The natural gradient variational Bayes algorithm\nwithout matrix inversion is provably convergent and achieves a convergence rate\nof order O(log s/s), with s the number of iterations. We also obtain a central\nlimit theorem for the iterates. Our algorithm exhibits versatility, making it\napplicable across a diverse array of variational Bayes domains, including\nGaussian approximation and normalizing flow Variational Bayes. We offer a range\nof numerical examples to demonstrate the efficiency and reliability of the\nproposed variational Bayes method."}, "http://arxiv.org/abs/2312.09698": {"title": "Smoothing for age-period-cohort models: a comparison between splines and random process", "link": "http://arxiv.org/abs/2312.09698", "description": "Age-Period-Cohort (APC) models are well used in the context of modelling\nhealth and demographic data to produce smooth estimates of each time trend.\nWhen smoothing in the context of APC models, there are two main schools,\nfrequentist using penalised smoothing splines, and Bayesian using random\nprocesses with little crossover between them. In this article, we clearly lay\nout the theoretical link between the two schools, provide examples using\nsimulated and real data to highlight similarities and difference, and help a\ngeneral APC user understand potentially inaccessible theory from functional\nanalysis. As intuition suggests, both approaches lead to comparable and almost\nidentical in-sample predictions, but random processes within a Bayesian\napproach might be beneficial for out-of-sample prediction as the sources of\nuncertainty are captured in a more complete way."}, "http://arxiv.org/abs/2312.09758": {"title": "Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach", "link": "http://arxiv.org/abs/2312.09758", "description": "Invariant representation learning (IRL) encourages the prediction from\ninvariant causal features to labels de-confounded from the environments,\nadvancing the technical roadmap of out-of-distribution (OOD) generalization.\nDespite spotlights around, recent theoretical results verified that some causal\nfeatures recovered by IRLs merely pretend domain-invariantly in the training\nenvironments but fail in unseen domains. The \\emph{fake invariance} severely\nendangers OOD generalization since the trustful objective can not be diagnosed\nand existing causal surgeries are invalid to rectify. In this paper, we review\na IRL family (InvRat) under the Partially and Fully Informative Invariant\nFeature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify\ntheir weaknesses in representing fake invariant features, then, unify their\ncausal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally\nrebuild the spurious and the fake invariant features simultaneously. Given\nthis, we further develop an approach based on conditional mutual information\nwith respect to RS-SCM, then rigorously rectify the spurious and fake invariant\neffects. It can be easily implemented by a small feature selection subnet\nintroduced in the IRL family, which is alternatively optimized to achieve our\ngoal. Experiments verified the superiority of our approach to fight against the\nfake invariant issue across a variety of OOD generalization benchmarks."}, "http://arxiv.org/abs/2312.09777": {"title": "Weyl formula and thermodynamics of geometric flow", "link": "http://arxiv.org/abs/2312.09777", "description": "We study the Weyl formula for the asymptotic number of eigenvalues of the\nLaplace-Beltrami operator with Dirichlet boundary condition on a Riemannian\nmanifold in the context of geometric flows. Assuming the eigenvalues to be the\nenergies of some associated statistical system, we show that geometric flows\nare directly related with the direction of increasing entropy chosen. For a\nclosed Riemannian manifold we obtain a volume preserving flow of geometry being\nequivalent to the increment of Gibbs entropy function derived from the spectrum\nof Laplace-Beltrami operator. Resemblance with Arnowitt, Deser, and Misner\n(ADM) formalism of gravity is also noted by considering open Riemannian\nmanifolds, directly equating the geometric flow parameter and the direction of\nincreasing entropy as time direction."}, "http://arxiv.org/abs/2312.09825": {"title": "Extreme value methods for estimating rare events in Utopia", "link": "http://arxiv.org/abs/2312.09825", "description": "To capture the extremal behaviour of complex environmental phenomena in\npractice, flexible techniques for modelling tail behaviour are required. In\nthis paper, we introduce a variety of such methods, which were used by the\nLancopula Utopiversity team to tackle the data challenge of the 2023 Extreme\nValue Analysis Conference. This data challenge was split into four sections,\nlabelled C1-C4. Challenges C1 and C2 comprise univariate problems, where the\ngoal is to estimate extreme quantiles for a non-stationary time series\nexhibiting several complex features. We propose a flexible modelling technique,\nbased on generalised additive models, with diagnostics indicating generally\ngood performance for the observed data. Challenges C3 and C4 concern\nmultivariate problems where the focus is on estimating joint extremal\nprobabilities. For challenge C3, we propose an extension of available models in\nthe multivariate literature and use this framework to estimate extreme\nprobabilities in the presence of non-stationary dependence. Finally, for\nchallenge C4, which concerns a 50 dimensional random vector, we employ a\nclustering technique to achieve dimension reduction and use a conditional\nmodelling approach to estimate extremal probabilities across independent groups\nof variables."}, "http://arxiv.org/abs/2312.09862": {"title": "Wasserstein-based Minimax Estimation of Dependence in Multivariate Regularly Varying Extremes", "link": "http://arxiv.org/abs/2312.09862", "description": "We study minimax risk bounds for estimators of the spectral measure in\nmultivariate linear factor models, where observations are linear combinations\nof regularly varying latent factors. Non-asymptotic convergence rates are\nderived for the multivariate Peak-over-Threshold estimator in terms of the\n$p$-th order Wasserstein distance, and information-theoretic lower bounds for\nthe minimax risks are established. The convergence rate of the estimator is\nshown to be minimax optimal under a class of Pareto-type models analogous to\nthe standard class used in the setting of one-dimensional observations known as\nthe Hall-Welsh class. When the estimator is minimax inefficient, a novel\ntwo-step estimator is introduced and demonstrated to attain the minimax lower\nbound. Our analysis bridges the gaps in understanding trade-offs between\nestimation bias and variance in multivariate extreme value theory."}, "http://arxiv.org/abs/2312.09884": {"title": "Investigating the heterogeneity of \"study twins\"", "link": "http://arxiv.org/abs/2312.09884", "description": "Meta-analyses are commonly performed based on random-effects models, while in\ncertain cases one might also argue in favour of a common-effect model. One such\ncase may be given by the example of two \"study twins\" that are performed\naccording to a common (or at least very similar) protocol. Here we investigate\nthe particular case of meta-analysis of a pair of studies, e.g. summarizing the\nresults of two confirmatory clinical trials in phase III of a clinical\ndevelopment programme. Thereby, we focus on the question of to what extent\nhomogeneity or heterogeneity may be discernible, and include an empirical\ninvestigation of published (\"twin\") pairs of studies. A pair of estimates from\ntwo studies only provides very little evidence on homogeneity or heterogeneity\nof effects, and ad-hoc decision criteria may often be misleading."}, "http://arxiv.org/abs/2312.09900": {"title": "Integral Fractional Ornstein-Uhlenbeck Process Model for Animal Movement", "link": "http://arxiv.org/abs/2312.09900", "description": "Modeling the trajectories of animals is challenging due to the complexity of\ntheir behaviors, the influence of unpredictable environmental factors,\nindividual variability, and the lack of detailed data on their movements.\nAdditionally, factors such as migration, hunting, reproduction, and social\ninteractions add additional layers of complexity when attempting to accurately\nforecast their movements. In the literature, various models exits that aim to\nstudy animal telemetry, by modeling the velocity of the telemetry, the\ntelemetry itself or both processes jointly through a Markovian process. In this\nwork, we propose to model the velocity of each coordinate axis for animal\ntelemetry data as a fractional Ornstein-Uhlenbeck (fOU) process. Then, the\nintegral fOU process models position data in animal telemetry. Compared to\ntraditional methods, the proposed model is flexible in modeling long-range\nmemory. The Hurst parameter $H \\in (0,1)$ is a crucial parameter in integral\nfOU process, as it determines the degree of dependence or long-range memory.\nThe integral fOU process is nonstationary process. In addition, a higher Hurst\nparameter ($H > 0.5$) indicates a stronger memory, leading to trajectories with\ntransient trends, while a lower Hurst parameter ($H < 0.5$) implies a weaker\nmemory, resulting in trajectories with recurring trends. When H = 0.5, the\nprocess reduces to a standard integral Ornstein-Uhlenbeck process. We develop a\nfast simulation algorithm of telemetry trajectories using an approach via\nfinite-dimensional distributions. We also develop a maximum likelihood method\nfor parameter estimation and its performance is examined by simulation studies.\nFinally, we present a telemetry application of Fin Whales that disperse over\nthe Gulf of Mexico."}, "http://arxiv.org/abs/2312.10002": {"title": "On the Invertibility of Euler Integral Transforms with Hyperplanes and Quadric Hypersurfaces", "link": "http://arxiv.org/abs/2312.10002", "description": "The Euler characteristic transform (ECT) is an integral transform used widely\nin topological data analysis. Previous efforts by Curry et al. and Ghrist et\nal. have independently shown that the ECT is injective on all compact definable\nsets. In this work, we study the invertibility of the ECT on definable sets\nthat aren't necessarily compact, resulting in a complete classification of\nconstructible functions that the Euler characteristic transform is not\ninjective on. We then introduce the quadric Euler characteristic transform\n(QECT) as a natural generalization of the ECT by detecting definable shapes\nwith quadric hypersurfaces rather than hyperplanes. We also discuss some\ncriteria for the invertibility of QECT."}, "http://arxiv.org/abs/2108.01327": {"title": "Distributed Inference for Tail Risk", "link": "http://arxiv.org/abs/2108.01327", "description": "For measuring tail risk with scarce extreme events, extreme value analysis is\noften invoked as the statistical tool to extrapolate to the tail of a\ndistribution. The presence of large datasets benefits tail risk analysis by\nproviding more observations for conducting extreme value analysis. However,\nlarge datasets can be stored distributedly preventing the possibility of\ndirectly analyzing them. In this paper, we introduce a comprehensive set of\ntools for examining the asymptotic behavior of tail empirical and quantile\nprocesses in the setting where data is distributed across multiple sources, for\ninstance, when data are stored on multiple machines. Utilizing these tools, one\ncan establish the oracle property for most distributed estimators in extreme\nvalue statistics in a straightforward way. The main theoretical challenge\narises when the number of machines diverges to infinity. The number of machines\nresembles the role of dimensionality in high dimensional statistics. We provide\nvarious examples to demonstrate the practicality and value of our proposed\ntoolkit."}, "http://arxiv.org/abs/2206.04133": {"title": "Bayesian multivariate logistic regression for superiority and inferiority decision-making under observable treatment heterogeneity", "link": "http://arxiv.org/abs/2206.04133", "description": "The effects of treatments may differ between persons with different\ncharacteristics. Addressing such treatment heterogeneity is crucial to\ninvestigate whether patients with specific characteristics are likely to\nbenefit from a new treatment. The current paper presents a novel Bayesian\nmethod for superiority decision-making in the context of randomized controlled\ntrials with multivariate binary responses and heterogeneous treatment effects.\nThe framework is based on three elements: a) Bayesian multivariate logistic\nregression analysis with a P\\'olya-Gamma expansion; b) a transformation\nprocedure to transfer obtained regression coefficients to a more intuitive\nmultivariate probability scale (i.e., success probabilities and the differences\nbetween them); and c) a compatible decision procedure for treatment comparison\nwith prespecified decision error rates. Procedures for a priori sample size\nestimation under a non-informative prior distribution are included. A numerical\nevaluation demonstrated that decisions based on a priori sample size estimation\nresulted in anticipated error rates among the trial population as well as\nsubpopulations. Further, average and conditional treatment effect parameters\ncould be estimated unbiasedly when the sample was large enough. Illustration\nwith the International Stroke Trial dataset revealed a trend towards\nheterogeneous effects among stroke patients: Something that would have remained\nundetected when analyses were limited to average treatment effects."}, "http://arxiv.org/abs/2210.06448": {"title": "Debiased inference for a covariate-adjusted regression function", "link": "http://arxiv.org/abs/2210.06448", "description": "In this article, we study nonparametric inference for a covariate-adjusted\nregression function. This parameter captures the average association between a\ncontinuous exposure and an outcome after adjusting for other covariates. In\nparticular, under certain causal conditions, this parameter corresponds to the\naverage outcome had all units been assigned to a specific exposure level, known\nas the causal dose-response curve. We propose a debiased local linear estimator\nof the covariate-adjusted regression function, and demonstrate that our\nestimator converges pointwise to a mean-zero normal limit distribution. We use\nthis result to construct asymptotically valid confidence intervals for function\nvalues and differences thereof. In addition, we use approximation results for\nthe distribution of the supremum of an empirical process to construct\nasymptotically valid uniform confidence bands. Our methods do not require\nundersmoothing, permit the use of data-adaptive estimators of nuisance\nfunctions, and our estimator attains the optimal rate of convergence for a\ntwice differentiable function. We illustrate the practical performance of our\nestimator using numerical studies and an analysis of the effect of air\npollution exposure on cardiovascular mortality."}, "http://arxiv.org/abs/2308.12506": {"title": "General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays", "link": "http://arxiv.org/abs/2308.12506", "description": "We present a general central limit theorem with simple, easy-to-check\ncovariance-based sufficient conditions for triangular arrays of random vectors\nwhen all variables could be interdependent. The result is constructed from\nStein's method, but the conditions are distinct from related work. We show that\nthese covariance conditions nest standard assumptions studied in the literature\nsuch as $M$-dependence, mixing random fields, non-mixing autoregressive\nprocesses, and dependency graphs, which themselves need not imply each other.\nThis permits researchers to work with high-level but intuitive conditions based\non overall correlation instead of more complicated and restrictive conditions\nsuch as strong mixing in random fields that may not have any obvious\nmicro-foundation. As examples of the implications, we show how the theorem\nimplies asymptotic normality in estimating: treatment effects with spillovers\nin more settings than previously admitted, covariance matrices, processes with\nglobal dependencies such as epidemic spread and information diffusion, and\nspatial process with Mat\\'{e}rn dependencies."}, "http://arxiv.org/abs/2312.10176": {"title": "Spectral estimation for spatial point processes and random fields", "link": "http://arxiv.org/abs/2312.10176", "description": "Spatial data can come in a variety of different forms, but two of the most\ncommon generating models for such observations are random fields and point\nprocesses. Whilst it is known that spectral analysis can unify these two\ndifferent data forms, specific methodology for the related estimation is yet to\nbe developed. In this paper, we solve this problem by extending multitaper\nestimation, to estimate the spectral density matrix function for multivariate\nspatial data, where processes can be any combination of either point processes\nor random fields. We discuss finite sample and asymptotic theory for the\nproposed estimators, as well as specific details on the implementation,\nincluding how to perform estimation on non-rectangular domains and the correct\nimplementation of multitapering for processes sampled in different ways, e.g.\ncontinuously vs on a regular grid."}, "http://arxiv.org/abs/2312.10234": {"title": "Targeted Machine Learning for Average Causal Effect Estimation Using the Front-Door Functional", "link": "http://arxiv.org/abs/2312.10234", "description": "Evaluating the average causal effect (ACE) of a treatment on an outcome often\ninvolves overcoming the challenges posed by confounding factors in\nobservational studies. A traditional approach uses the back-door criterion,\nseeking adjustment sets to block confounding paths between treatment and\noutcome. However, this method struggles with unmeasured confounders. As an\nalternative, the front-door criterion offers a solution, even in the presence\nof unmeasured confounders between treatment and outcome. This method relies on\nidentifying mediators that are not directly affected by these confounders and\nthat completely mediate the treatment's effect. Here, we introduce novel\nestimation strategies for the front-door criterion based on the targeted\nminimum loss-based estimation theory. Our estimators work across diverse\nscenarios, handling binary, continuous, and multivariate mediators. They\nleverage data-adaptive machine learning algorithms, minimizing assumptions and\nensuring key statistical properties like asymptotic linearity,\ndouble-robustness, efficiency, and valid estimates within the target parameter\nspace. We establish conditions under which the nuisance functional estimations\nensure the root n-consistency of ACE estimators. Our numerical experiments show\nthe favorable finite sample performance of the proposed estimators. We\ndemonstrate the applicability of these estimators to analyze the effect of\nearly stage academic performance on future yearly income using data from the\nFinnish Social Science Data Archive."}, "http://arxiv.org/abs/2312.10388": {"title": "The Causal Impact of Credit Lines on Spending Distributions", "link": "http://arxiv.org/abs/2312.10388", "description": "Consumer credit services offered by e-commerce platforms provide customers\nwith convenient loan access during shopping and have the potential to stimulate\nsales. To understand the causal impact of credit lines on spending, previous\nstudies have employed causal estimators, based on direct regression (DR),\ninverse propensity weighting (IPW), and double machine learning (DML) to\nestimate the treatment effect. However, these estimators do not consider the\nnotion that an individual's spending can be understood and represented as a\ndistribution, which captures the range and pattern of amounts spent across\ndifferent orders. By disregarding the outcome as a distribution, valuable\ninsights embedded within the outcome distribution might be overlooked. This\npaper develops a distribution-valued estimator framework that extends existing\nreal-valued DR-, IPW-, and DML-based estimators to distribution-valued\nestimators within Rubin's causal framework. We establish their consistency and\napply them to a real dataset from a large e-commerce platform. Our findings\nreveal that credit lines positively influence spending across all quantiles;\nhowever, as credit lines increase, consumers allocate more to luxuries (higher\nquantiles) than necessities (lower quantiles)."}, "http://arxiv.org/abs/2312.10435": {"title": "Uncertainty Quantification in Heterogeneous Treatment Effect Estimation with Gaussian-Process-Based Partially Linear Model", "link": "http://arxiv.org/abs/2312.10435", "description": "Estimating heterogeneous treatment effects across individuals has attracted\ngrowing attention as a statistical tool for performing critical\ndecision-making. We propose a Bayesian inference framework that quantifies the\nuncertainty in treatment effect estimation to support decision-making in a\nrelatively small sample size setting. Our proposed model places Gaussian\nprocess priors on the nonparametric components of a semiparametric model called\na partially linear model. This model formulation has three advantages. First,\nwe can analytically compute the posterior distribution of a treatment effect\nwithout relying on the computationally demanding posterior approximation.\nSecond, we can guarantee that the posterior distribution concentrates around\nthe true one as the sample size goes to infinity. Third, we can incorporate\nprior knowledge about a treatment effect into the prior distribution, improving\nthe estimation efficiency. Our experimental results show that even in the small\nsample size setting, our method can accurately estimate the heterogeneous\ntreatment effects and effectively quantify its estimation uncertainty."}, "http://arxiv.org/abs/2312.10499": {"title": "Censored extreme value estimation", "link": "http://arxiv.org/abs/2312.10499", "description": "A novel and comprehensive methodology designed to tackle the challenges posed\nby extreme values in the context of random censorship is introduced. The main\nfocus is the analysis of integrals based on the product-limit estimator of\nnormalized top-order statistics, denoted extreme Kaplan--Meier integrals. These\nintegrals allow for transparent derivation of various important asymptotic\ndistributional properties, offering an alternative approach to conventional\nplug-in estimation methods. Notably, this methodology demonstrates robustness\nand wide applicability within the scope of max-domains of attraction. An\nadditional noteworthy by-product is the extension of residual estimation of\nextremes to encompass all max-domains of attraction, which is of independent\ninterest."}, "http://arxiv.org/abs/2312.10541": {"title": "Random Measures, ANOVA Models and Quantifying Uncertainty in Randomized Controlled Trials", "link": "http://arxiv.org/abs/2312.10541", "description": "This short paper introduces a novel approach to global sensitivity analysis,\ngrounded in the variance-covariance structure of random variables derived from\nrandom measures. The proposed methodology facilitates the application of\ninformation-theoretic rules for uncertainty quantification, offering several\nadvantages. Specifically, the approach provides valuable insights into the\ndecomposition of variance within discrete subspaces, similar to the standard\nANOVA analysis. To illustrate this point, the method is applied to datasets\nobtained from the analysis of randomized controlled trials on evaluating the\nefficacy of the COVID-19 vaccine and assessing clinical endpoints in a lung\ncancer study."}, "http://arxiv.org/abs/2312.10548": {"title": "Analysis of composition on the original scale of measurement", "link": "http://arxiv.org/abs/2312.10548", "description": "In current applied research the most-used route to an analysis of composition\nis through log-ratios -- that is, contrasts among log-transformed measurements.\nHere we argue instead for a more direct approach, using a statistical model for\nthe arithmetic mean on the original scale of measurement. Central to the\napproach is a general variance-covariance function, derived by assuming\nmultiplicative measurement error. Quasi-likelihood analysis of logit models for\ncomposition is then a general alternative to the use of multivariate linear\nmodels for log-ratio transformed measurements, and it has important advantages.\nThese include robustness to secondary aspects of model specification, stability\nwhen there are zero-valued or near-zero measurements in the data, and more\ndirect interpretation. The usual efficiency property of quasi-likelihood\nestimation applies even when the error covariance matrix is unspecified. We\nalso indicate how the derived variance-covariance function can be used, instead\nof the variance-covariance matrix of log-ratios, with more general multivariate\nmethods for the analysis of composition. A specific feature is that the notion\nof `null correlation' -- for compositional measurements on their original scale\n-- emerges naturally."}, "http://arxiv.org/abs/2312.10563": {"title": "Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS Integration", "link": "http://arxiv.org/abs/2312.10563", "description": "Mediation analysis is a powerful tool for studying causal pathways between\nexposure, mediator, and outcome variables of interest. While classical\nmediation analysis using observational data often requires strong and sometimes\nunrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR)\navoids unmeasured confounding bias by employing genetic variants as\ninstrumental variables. We develop a novel MR framework for mediation analysis\nwith genome-wide associate study (GWAS) summary data, and provide solid\nstatistical guarantees. Our framework efficiently integrates information stored\nin three independent GWAS summary data and mitigates the commonly encountered\nwinner's curse and measurement error bias (a.k.a. instrument selection and weak\ninstrument bias) in MR. As a result, our framework provides valid statistical\ninference for both direct and mediation effects with enhanced statistical\nefficiency. As part of this endeavor, we also demonstrate that the concept of\nwinner's curse bias in mediation analysis with MR and summary data is more\ncomplex than previously documented in the classical two-sample MR literature,\nrequiring special treatments to address such a bias issue. Through our\ntheoretical investigations, we show that the proposed method delivers\nconsistent and asymptotically normally distributed causal effect estimates. We\nillustrate the finite-sample performance of our approach through simulation\nexperiments and a case study."}, "http://arxiv.org/abs/2312.10569": {"title": "Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data", "link": "http://arxiv.org/abs/2312.10569", "description": "Many modern causal questions ask how treatments affect complex outcomes that\nare measured using wearable devices and sensors. Current analysis approaches\nrequire summarizing these data into scalar statistics (e.g., the mean), but\nthese summaries can be misleading. For example, disparate distributions can\nhave the same means, variances, and other statistics. Researchers can overcome\nthe loss of information by instead representing the data as distributions. We\ndevelop an interpretable method for distributional data analysis that ensures\ntrustworthy and robust decision-making: Analyzing Distributional Data via\nMatching After Learning to Stretch (ADD MALTS). We (i) provide analytical\nguarantees of the correctness of our estimation strategy, (ii) demonstrate via\nsimulation that ADD MALTS outperforms other distributional data analysis\nmethods at estimating treatment effects, and (iii) illustrate ADD MALTS'\nability to verify whether there is enough cohesion between treatment and\ncontrol units within subpopulations to trustworthily estimate treatment\neffects. We demonstrate ADD MALTS' utility by studying the effectiveness of\ncontinuous glucose monitors in mitigating diabetes risks."}, "http://arxiv.org/abs/2312.10570": {"title": "Adversarially Balanced Representation for Continuous Treatment Effect Estimation", "link": "http://arxiv.org/abs/2312.10570", "description": "Individual treatment effect (ITE) estimation requires adjusting for the\ncovariate shift between populations with different treatments, and deep\nrepresentation learning has shown great promise in learning a balanced\nrepresentation of covariates. However the existing methods mostly consider the\nscenario of binary treatments. In this paper, we consider the more practical\nand challenging scenario in which the treatment is a continuous variable (e.g.\ndosage of a medication), and we address the two main challenges of this setup.\nWe propose the adversarial counterfactual regression network (ACFR) that\nadversarially minimizes the representation imbalance in terms of KL divergence,\nand also maintains the impact of the treatment value on the outcome prediction\nby leveraging an attention mechanism. Theoretically we demonstrate that ACFR\nobjective function is grounded in an upper bound on counterfactual outcome\nprediction error. Our experimental evaluation on semi-synthetic datasets\ndemonstrates the empirical superiority of ACFR over a range of state-of-the-art\nmethods."}, "http://arxiv.org/abs/2312.10573": {"title": "Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem", "link": "http://arxiv.org/abs/2312.10573", "description": "Random Forest is a machine learning method that offers many advantages,\nincluding the ability to easily measure variable importance. Class balancing\ntechnique is a well-known solution to deal with class imbalance problem.\nHowever, it has not been actively studied on RF variable importance. In this\npaper, we study the effect of class balancing on RF variable importance. Our\nsimulation results show that over-sampling is effective in correctly measuring\nvariable importance in class imbalanced situations with small sample size,\nwhile under-sampling fails to differentiate important and non-informative\nvariables. We then propose a variable selection algorithm that utilizes RF\nvariable importance and its confidence interval. Through an experimental study\nusing many real and artificial datasets, we demonstrate that our proposed\nalgorithm efficiently selects an optimal feature set, leading to improved\nprediction performance in class imbalance problem."}, "http://arxiv.org/abs/2312.10596": {"title": "A maximin optimal approach for model-free sampling designs in two-phase studies", "link": "http://arxiv.org/abs/2312.10596", "description": "Data collection costs can vary widely across variables in data science tasks.\nTwo-phase designs are often employed to save data collection costs. In\ntwo-phase studies, inexpensive variables are collected for all subjects in the\nfirst phase, and expensive variables are measured for a subset of subjects in\nthe second phase based on a predetermined sampling rule. The estimation\nefficiency under two-phase designs relies heavily on the sampling rule.\nExisting literature primarily focuses on designing sampling rules for\nestimating a scalar parameter in some parametric models or some specific\nestimating problems. However, real-world scenarios are usually model-unknown\nand involve two-phase designs for model-free estimation of a scalar or\nmulti-dimensional parameter. This paper proposes a maximin criterion to design\nan optimal sampling rule based on semiparametric efficiency bounds. The\nproposed method is model-free and applicable to general estimating problems.\nThe resulting sampling rule can minimize the semiparametric efficiency bound\nwhen the parameter is scalar and improve the bound for every component when the\nparameter is multi-dimensional. Simulation studies demonstrate that the\nproposed designs reduce the variance of the resulting estimator in various\nsettings. The implementation of the proposed design is illustrated in a real\ndata analysis."}, "http://arxiv.org/abs/2312.10607": {"title": "Bayesian Model Selection via Mean-Field Variational Approximation", "link": "http://arxiv.org/abs/2312.10607", "description": "This article considers Bayesian model selection via mean-field (MF)\nvariational approximation. Towards this goal, we study the non-asymptotic\nproperties of MF inference under the Bayesian framework that allows latent\nvariables and model mis-specification. Concretely, we show a Bernstein\nvon-Mises (BvM) theorem for the variational distribution from MF under possible\nmodel mis-specification, which implies the distributional convergence of MF\nvariational approximation to a normal distribution centering at the maximal\nlikelihood estimator (within the specified model). Motivated by the BvM\ntheorem, we propose a model selection criterion using the evidence lower bound\n(ELBO), and demonstrate that the model selected by ELBO tends to asymptotically\nagree with the one selected by the commonly used Bayesian information criterion\n(BIC) as sample size tends to infinity. Comparing to BIC, ELBO tends to incur\nsmaller approximation error to the log-marginal likelihood (a.k.a. model\nevidence) due to a better dimension dependence and full incorporation of the\nprior information. Moreover, we show the geometric convergence of the\ncoordinate ascent variational inference (CAVI) algorithm under the parametric\nmodel framework, which provides a practical guidance on how many iterations one\ntypically needs to run when approximating the ELBO. These findings demonstrate\nthat variational inference is capable of providing a computationally efficient\nalternative to conventional approaches in tasks beyond obtaining point\nestimates, which is also empirically demonstrated by our extensive numerical\nexperiments."}, "http://arxiv.org/abs/2312.10618": {"title": "Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines", "link": "http://arxiv.org/abs/2312.10618", "description": "Classification and probability estimation have broad applications in modern\nmachine learning and data science applications, including biology, medicine,\nengineering, and computer science. The recent development of a class of\nweighted Support Vector Machines (wSVMs) has shown great values in robustly\npredicting the class probability and classification for various problems with\nhigh accuracy. The current framework is based on the $\\ell^2$-norm regularized\nbinary wSVMs optimization problem, which only works with dense features and has\npoor performance at sparse features with redundant noise in most real\napplications. The sparse learning process requires a prescreen of the important\nvariables for each binary wSVMs for accurately estimating pairwise conditional\nprobability. In this paper, we proposed novel wSVMs frameworks that incorporate\nautomatic variable selection with accurate probability estimation for sparse\nlearning problems. We developed efficient algorithms for effective variable\nselection for solving either the $\\ell^1$-norm or elastic net regularized\nbinary wSVMs optimization problems. The binary class probability is then\nestimated either by the $\\ell^2$-norm regularized wSVMs framework with selected\nvariables or by elastic net regularized wSVMs directly. The two-step approach\nof $\\ell^1$-norm followed by $\\ell^2$-norm wSVMs show a great advantage in both\nautomatic variable selection and reliable probability estimators with the most\nefficient time. The elastic net regularized wSVMs offer the best performance in\nterms of variable selection and probability estimation with the additional\nadvantage of variable grouping in the compensation of more computation time for\nhigh dimensional problems. The proposed wSVMs-based sparse learning methods\nhave wide applications and can be further extended to $K$-class problems\nthrough ensemble learning."}, "http://arxiv.org/abs/2312.10675": {"title": "Visualization and Assessment of Copula Symmetry", "link": "http://arxiv.org/abs/2312.10675", "description": "Visualization and assessment of copula structures are crucial for accurately\nunderstanding and modeling the dependencies in multivariate data analysis. In\nthis paper, we introduce an innovative method that employs functional boxplots\nand rank-based testing procedures to evaluate copula symmetry. This approach is\nspecifically designed to assess key characteristics such as reflection\nsymmetry, radial symmetry, and joint symmetry. We first construct test\nfunctions for each specific property and then investigate the asymptotic\nproperties of their empirical estimators. We demonstrate that the functional\nboxplot of these sample test functions serves as an informative visualization\ntool of a given copula structure, effectively measuring the departure from zero\nof the test function. Furthermore, we introduce a nonparametric testing\nprocedure to assess the significance of deviations from symmetry, ensuring the\naccuracy and reliability of our visualization method. Through extensive\nsimulation studies involving various copula models, we demonstrate the\neffectiveness of our testing approach. Finally, we apply our visualization and\ntesting techniques to two real-world datasets: a nutritional habits survey with\nfive variables and wind speed data from three locations in Saudi Arabia."}, "http://arxiv.org/abs/2312.10690": {"title": "M-Estimation in Censored Regression Model using Instrumental Variables under Endogeneity", "link": "http://arxiv.org/abs/2312.10690", "description": "We propose and study M-estimation to estimate the parameters in the censored\nregression model in the presence of endogeneity, i.e., the Tobit model. In the\ncourse of this study, we follow two-stage procedures: the first stage consists\nof applying control function procedures to address the issue of endogeneity\nusing instrumental variables, and the second stage applies the M-estimation\ntechnique to estimate the unknown parameters involved in the model. The large\nsample properties of the proposed estimators are derived and analyzed. The\nfinite sample properties of the estimators are studied through Monte Carlo\nsimulation and a real data application related to women's labor force\nparticipation."}, "http://arxiv.org/abs/2312.10695": {"title": "Nonparametric Strategy Test", "link": "http://arxiv.org/abs/2312.10695", "description": "We present a nonparametric statistical test for determining whether an agent\nis following a given mixed strategy in a repeated strategic-form game given\nsamples of the agent's play. This involves two components: determining whether\nthe agent's frequencies of pure strategies are sufficiently close to the target\nfrequencies, and determining whether the pure strategies selected are\nindependent between different game iterations. Our integrated test involves\napplying a chi-squared goodness of fit test for the first component and a\ngeneralized Wald-Wolfowitz runs test for the second component. The results from\nboth tests are combined using Bonferroni correction to produce a complete test\nfor a given significance level $\\alpha.$ We applied the test to publicly\navailable data of human rock-paper-scissors play. The data consists of 50\niterations of play for 500 human players. We test with a null hypothesis that\nthe players are following a uniform random strategy independently at each game\niteration. Using a significance level of $\\alpha = 0.05$, we conclude that 305\n(61%) of the subjects are following the target strategy."}, "http://arxiv.org/abs/2312.10706": {"title": "Margin-closed regime-switching multivariate time series models", "link": "http://arxiv.org/abs/2312.10706", "description": "A regime-switching multivariate time series model which is closed under\nmargins is built. The model imposes a restriction on all lower-dimensional\nsub-processes to follow a regime-switching process sharing the same latent\nregime sequence and having the same Markov order as the original process. The\nmargin-closed regime-switching model is constructed by considering the\nmultivariate margin-closed Gaussian VAR($k$) dependence as a copula within each\nregime, and builds dependence between observations in different regimes by\nrequiring the first observation in the new regime to depend on the last\nobservation in the previous regime. The property of closure under margins\nallows inference on the latent regimes based on lower-dimensional selected\nsub-processes and estimation of univariate parameters from univariate\nsub-processes, and enables the use of multi-stage estimation procedure for the\nmodel. The parsimonious dependence structure of the model also avoids a large\nnumber of parameters under the regime-switching setting. The proposed model is\napplied to a macroeconomic data set to infer the latent business cycle and\ncompared with the relevant benchmark."}, "http://arxiv.org/abs/2312.10796": {"title": "Two sample test for covariance matrices in ultra-high dimension", "link": "http://arxiv.org/abs/2312.10796", "description": "In this paper, we propose a new test for testing the equality of two\npopulation covariance matrices in the ultra-high dimensional setting that the\ndimension is much larger than the sizes of both of the two samples. Our\nproposed methodology relies on a data splitting procedure and a comparison of a\nset of well selected eigenvalues of the sample covariance matrices on the split\ndata sets. Compared to the existing methods, our methodology is adaptive in the\nsense that (i). it does not require specific assumption (e.g., comparable or\nbalancing, etc.) on the sizes of two samples; (ii). it does not need\nquantitative or structural assumptions of the population covariance matrices;\n(iii). it does not need the parametric distributions or the detailed knowledge\nof the moments of the two populations. Theoretically, we establish the\nasymptotic distributions of the statistics used in our method and conduct the\npower analysis. We justify that our method is powerful under very weak\nalternatives. We conduct extensive numerical simulations and show that our\nmethod significantly outperforms the existing ones both in terms of size and\npower. Analysis of two real data sets is also carried out to demonstrate the\nusefulness and superior performance of our proposed methodology. An\n$\\texttt{R}$ package $\\texttt{UHDtst}$ is developed for easy implementation of\nour proposed methodology."}, "http://arxiv.org/abs/2312.10814": {"title": "Scalable Design with Posterior-Based Operating Characteristics", "link": "http://arxiv.org/abs/2312.10814", "description": "To design trustworthy Bayesian studies, criteria for posterior-based\noperating characteristics - such as power and the type I error rate - are often\ndefined in clinical, industrial, and corporate settings. These posterior-based\noperating characteristics are typically assessed by exploring sampling\ndistributions of posterior probabilities via simulation. In this paper, we\npropose a scalable method to determine optimal sample sizes and decision\ncriteria that leverages large-sample theory to explore sampling distributions\nof posterior probabilities in a targeted manner. This targeted exploration\napproach prompts consistent sample size recommendations with fewer simulation\nrepetitions than standard methods. We repurpose the posterior probabilities\ncomputed in that approach to efficiently investigate various sample sizes and\ndecision criteria using contour plots."}, "http://arxiv.org/abs/2312.10894": {"title": "Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference", "link": "http://arxiv.org/abs/2312.10894", "description": "In this paper, we study the effectiveness of using a constant stepsize in\nstatistical inference via linear stochastic approximation (LSA) algorithms with\nMarkovian data. After establishing a Central Limit Theorem (CLT), we outline an\ninference procedure that uses averaged LSA iterates to construct confidence\nintervals (CIs). Our procedure leverages the fast mixing property of\nconstant-stepsize LSA for better covariance estimation and employs\nRichardson-Romberg (RR) extrapolation to reduce the bias induced by constant\nstepsize and Markovian data. We develop theoretical results for guiding\nstepsize selection in RR extrapolation, and identify several important settings\nwhere the bias provably vanishes even without extrapolation. We conduct\nextensive numerical experiments and compare against classical inference\napproaches. Our results show that using a constant stepsize enjoys easy\nhyperparameter tuning, fast convergence, and consistently better CI coverage,\nespecially when data is limited."}, "http://arxiv.org/abs/2312.10920": {"title": "Domain adaption and physical constrains transfer learning for shale gas production", "link": "http://arxiv.org/abs/2312.10920", "description": "Effective prediction of shale gas production is crucial for strategic\nreservoir development. However, in new shale gas blocks, two main challenges\nare encountered: (1) the occurrence of negative transfer due to insufficient\ndata, and (2) the limited interpretability of deep learning (DL) models. To\ntackle these problems, we propose a novel transfer learning methodology that\nutilizes domain adaptation and physical constraints. This methodology\neffectively employs historical data from the source domain to reduce negative\ntransfer from the data distribution perspective, while also using physical\nconstraints to build a robust and reliable prediction model that integrates\nvarious types of data. The methodology starts by dividing the production data\nfrom the source domain into multiple subdomains, thereby enhancing data\ndiversity. It then uses Maximum Mean Discrepancy (MMD) and global average\ndistance measures to decide on the feasibility of transfer. Through domain\nadaptation, we integrate all transferable knowledge, resulting in a more\ncomprehensive target model. Lastly, by incorporating drilling, completion, and\ngeological data as physical constraints, we develop a hybrid model. This model,\na combination of a multi-layer perceptron (MLP) and a Transformer\n(Transformer-MLP), is designed to maximize interpretability. Experimental\nvalidation in China's southwestern region confirms the method's effectiveness."}, "http://arxiv.org/abs/2312.10926": {"title": "A Random Effects Model-based Method of Moments Estimation of Causal Effect in Mendelian Randomization Studies", "link": "http://arxiv.org/abs/2312.10926", "description": "Recent advances in genotyping technology have delivered a wealth of genetic\ndata, which is rapidly advancing our understanding of the underlying genetic\narchitecture of complex diseases. Mendelian Randomization (MR) leverages such\ngenetic data to estimate the causal effect of an exposure factor on an outcome\nfrom observational studies. In this paper, we utilize genetic correlations to\nsummarize information on a large set of genetic variants associated with the\nexposure factor. Our proposed approach is a generalization of the MR-inverse\nvariance weighting (IVW) approach where we can accommodate many weak and\npleiotropic effects. Our approach quantifies the variation explained by all\nvalid instrumental variables (IVs) instead of estimating the individual effects\nand thus could accommodate weak IVs. This is particularly useful for performing\nMR estimation in small studies, or minority populations where the selection of\nvalid IVs is unreliable and thus has a large influence on the MR estimation.\nThrough simulation and real data analysis, we demonstrate that our approach\nprovides a robust alternative to the existing MR methods. We illustrate the\nrobustness of our proposed approach under the violation of MR assumptions and\ncompare the performance with several existing approaches."}, "http://arxiv.org/abs/2312.10958": {"title": "Large-sample properties of multiple imputation estimators for parameters of logistic regression with covariates missing at random separately or simultaneously", "link": "http://arxiv.org/abs/2312.10958", "description": "We consider logistic regression including two sets of discrete or categorical\ncovariates that are missing at random (MAR) separately or simultaneously. We\nexamine the asymptotic properties of two multiple imputation (MI) estimators,\ngiven in the study of Lee at al. (2023), for the parameters of the logistic\nregression model with both sets of discrete or categorical covariates that are\nMAR separately or simultaneously. The proposed estimated asymptotic variances\nof the two MI estimators address a limitation observed with Rubin's type\nestimated variances, which lead to underestimate the variances of the two MI\nestimators (Rubin, 1987). Simulation results demonstrate that our two proposed\nMI methods outperform the complete-case, semiparametric inverse probability\nweighting, random forest MI using chained equations, and stochastic\napproximation of expectation-maximization methods. To illustrate the\nmethodology's practical application, we provide a real data example from a\nsurvey conducted in the Feng Chia night market in Taichung City, Taiwan."}, "http://arxiv.org/abs/2312.11001": {"title": "A Versatile Causal Discovery Framework to Allow Causally-Related Hidden Variables", "link": "http://arxiv.org/abs/2312.11001", "description": "Most existing causal discovery methods rely on the assumption of no latent\nconfounders, limiting their applicability in solving real-life problems. In\nthis paper, we introduce a novel, versatile framework for causal discovery that\naccommodates the presence of causally-related hidden variables almost\neverywhere in the causal network (for instance, they can be effects of observed\nvariables), based on rank information of covariance matrix over observed\nvariables. We start by investigating the efficacy of rank in comparison to\nconditional independence and, theoretically, establish necessary and sufficient\nconditions for the identifiability of certain latent structural patterns.\nFurthermore, we develop a Rank-based Latent Causal Discovery algorithm, RLCD,\nthat can efficiently locate hidden variables, determine their cardinalities,\nand discover the entire causal structure over both measured and hidden ones. We\nalso show that, under certain graphical conditions, RLCD correctly identifies\nthe Markov Equivalence Class of the whole latent causal graph asymptotically.\nExperimental results on both synthetic and real-world personality data sets\ndemonstrate the efficacy of the proposed approach in finite-sample cases."}, "http://arxiv.org/abs/2312.11054": {"title": "Detection of Model-based Planted Pseudo-cliques in Random Dot Product Graphs by the Adjacency Spectral Embedding and the Graph Encoder Embedding", "link": "http://arxiv.org/abs/2312.11054", "description": "In this paper, we explore the capability of both the Adjacency Spectral\nEmbedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded\npseudo-clique structure in the random dot product graph setting. In both theory\nand experiments, we demonstrate that this pairing of model and methods can\nyield worse results than the best existing spectral clique detection methods,\ndemonstrating at once the methods' potential inability to capture even modestly\nsized pseudo-cliques and the methods' robustness to the model contamination\ngiving rise to the pseudo-clique structure. To further enrich our analysis, we\nalso consider the Variational Graph Auto-Encoder (VGAE) model in our simulation\nand real data experiments."}, "http://arxiv.org/abs/2312.11108": {"title": "Multiple change point detection in functional data with applications to biomechanical fatigue data", "link": "http://arxiv.org/abs/2312.11108", "description": "Injuries to the lower extremity joints are often debilitating, particularly\nfor professional athletes. Understanding the onset of stressful conditions on\nthese joints is therefore important in order to ensure prevention of injuries\nas well as individualised training for enhanced athletic performance. We study\nthe biomechanical joint angles from the hip, knee and ankle for runners who are\nexperiencing fatigue. The data is cyclic in nature and densely collected by\nbody worn sensors, which makes it ideal to work with in the functional data\nanalysis (FDA) framework.\n\nWe develop a new method for multiple change point detection for functional\ndata, which improves the state of the art with respect to at least two novel\naspects. First, the curves are compared with respect to their maximum absolute\ndeviation, which leads to a better interpretation of local changes in the\nfunctional data compared to classical $L^2$-approaches. Secondly, as slight\naberrations are to be often expected in a human movement data, our method will\nnot detect arbitrarily small changes but hunts for relevant changes, where\nmaximum absolute deviation between the curves exceeds a specified threshold,\nsay $\\Delta >0$. We recover multiple changes in a long functional time series\nof biomechanical knee angle data, which are larger than the desired threshold\n$\\Delta$, allowing us to identify changes purely due to fatigue. In this work,\nwe analyse data from both controlled indoor as well as from an uncontrolled\noutdoor (marathon) setting."}, "http://arxiv.org/abs/2312.11136": {"title": "Identification of complier and noncomplier average causal effects in the presence of latent missing-at-random (LMAR) outcomes: a unifying view and choices of assumptions", "link": "http://arxiv.org/abs/2312.11136", "description": "The study of treatment effects is often complicated by noncompliance and\nmissing data. In the one-sided noncompliance setting where of interest are the\ncomplier and noncomplier average causal effects (CACE and NACE), we address\noutcome missingness of the \\textit{latent missing at random} type (LMAR, also\nknown as \\textit{latent ignorability}). That is, conditional on covariates and\ntreatment assigned, the missingness may depend on compliance type. Within the\ninstrumental variable (IV) approach to noncompliance, methods have been\nproposed for handling LMAR outcome that additionally invoke an exclusion\nrestriction type assumption on missingness, but no solution has been proposed\nfor when a non-IV approach is used. This paper focuses on effect identification\nin the presence of LMAR outcome, with a view to flexibly accommodate different\nprincipal identification approaches. We show that under treatment assignment\nignorability and LMAR only, effect nonidentifiability boils down to a set of\ntwo connected mixture equations involving unidentified stratum-specific\nresponse probabilities and outcome means. This clarifies that (except for a\nspecial case) effect identification generally requires two additional\nassumptions: a \\textit{specific missingness mechanism} assumption and a\n\\textit{principal identification} assumption. This provides a template for\nidentifying effects based on separate choices of these assumptions. We consider\na range of specific missingness assumptions, including those that have appeared\nin the literature and some new ones. Incidentally, we find an issue in the\nexisting assumptions, and propose a modification of the assumptions to avoid\nthe issue. Results under different assumptions are illustrated using data from\nthe Baltimore Experience Corps Trial."}, "http://arxiv.org/abs/2312.11137": {"title": "Random multiplication versus random sum: auto-regressive-like models with integer-valued random inputs", "link": "http://arxiv.org/abs/2312.11137", "description": "A common approach to analyze count time series is to fit models based on\nrandom sum operators. As an alternative, this paper introduces time series\nmodels based on a random multiplication operator, which is simply the\nmultiplication of a variable operand by an integer-valued random coefficient,\nwhose mean is the constant operand. Such operation is endowed into\nauto-regressive-like models with integer-valued random inputs, addressed as\nRMINAR. Two special variants are studied, namely the N0-valued random\ncoefficient auto-regressive model and the N0-valued random coefficient\nmultiplicative error model. Furthermore, Z-valued extensions are considered.\nThe dynamic structure of the proposed models is studied in detail. In\nparticular, their corresponding solutions are everywhere strictly stationary\nand ergodic, a fact that is not common neither in the literature on\ninteger-valued time series models nor real-valued random coefficient\nauto-regressive models. Therefore, the parameters of the RMINAR model are\nestimated using a four-stage weighted least squares estimator, with consistency\nand asymptotic normality established everywhere in the parameter space.\nFinally, the new RMINAR models are illustrated with some simulated and\nempirical examples."}, "http://arxiv.org/abs/2312.11178": {"title": "Deinterleaving RADAR emitters with optimal transport distances", "link": "http://arxiv.org/abs/2312.11178", "description": "Detection and identification of emitters provide vital information for\ndefensive strategies in electronic intelligence. Based on a received signal\ncontaining pulses from an unknown number of emitters, this paper introduces an\nunsupervised methodology for deinterleaving RADAR signals based on a\ncombination of clustering algorithms and optimal transport distances. The first\nstep involves separating the pulses with a clustering algorithm under the\nconstraint that the pulses of two different emitters cannot belong to the same\ncluster. Then, as the emitters exhibit complex behavior and can be represented\nby several clusters, we propose a hierarchical clustering algorithm based on an\noptimal transport distance to merge these clusters. A variant is also\ndeveloped, capable of handling more complex signals. Finally, the proposed\nmethodology is evaluated on simulated data provided through a realistic\nsimulator. Results show that the proposed methods are capable of deinterleaving\ncomplex RADAR signals."}, "http://arxiv.org/abs/2312.11319": {"title": "Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation", "link": "http://arxiv.org/abs/2312.11319", "description": "Accurately detecting multiple change-points is critical for various\napplications, but determining the optimal number of change-points remains a\nchallenge. Existing approaches based on information criteria attempt to balance\ngoodness-of-fit and model complexity, but their performance varies depending on\nthe model. Recently, data-driven selection criteria based on cross-validation\nhas been proposed, but these methods can be prone to slight overfitting in\nfinite samples. In this paper, we introduce a method that controls the\nprobability of overestimation and provides uncertainty quantification for\nlearning multiple change-points via cross-validation. We frame this problem as\na sequence of model comparison problems and leverage high-dimensional\ninferential procedures. We demonstrate the effectiveness of our approach\nthrough experiments on finite-sample data, showing superior uncertainty\nquantification for overestimation compared to existing methods. Our approach\nhas broad applicability and can be used in diverse change-point models."}, "http://arxiv.org/abs/2312.11323": {"title": "UniForCE: The Unimodality Forest Method for Clustering and Estimation of the Number of Clusters", "link": "http://arxiv.org/abs/2312.11323", "description": "Estimating the number of clusters k while clustering the data is a\nchallenging task. An incorrect cluster assumption indicates that the number of\nclusters k gets wrongly estimated. Consequently, the model fitting becomes less\nimportant. In this work, we focus on the concept of unimodality and propose a\nflexible cluster definition called locally unimodal cluster. A locally unimodal\ncluster extends for as long as unimodality is locally preserved across pairs of\nsubclusters of the data. Then, we propose the UniForCE method for locally\nunimodal clustering. The method starts with an initial overclustering of the\ndata and relies on the unimodality graph that connects subclusters forming\nunimodal pairs. Such pairs are identified using an appropriate statistical\ntest. UniForCE identifies maximal locally unimodal clusters by computing a\nspanning forest in the unimodality graph. Experimental results on both real and\nsynthetic datasets illustrate that the proposed methodology is particularly\nflexible and robust in discovering regular and highly complex cluster shapes.\nMost importantly, it automatically provides an adequate estimation of the\nnumber of clusters."}, "http://arxiv.org/abs/2312.11393": {"title": "Assessing Estimation Uncertainty under Model Misspecification", "link": "http://arxiv.org/abs/2312.11393", "description": "Model misspecification is ubiquitous in data analysis because the\ndata-generating process is often complex and mathematically intractable.\nTherefore, assessing estimation uncertainty and conducting statistical\ninference under a possibly misspecified working model is unavoidable. In such a\ncase, classical methods such as bootstrap and asymptotic theory-based inference\nfrequently fail since they rely heavily on the model assumptions. In this\narticle, we provide a new bootstrap procedure, termed local residual bootstrap,\nto assess estimation uncertainty under model misspecification for generalized\nlinear models. By resampling the residuals from the neighboring observations,\nwe can approximate the sampling distribution of the statistic of interest\naccurately. Instead of relying on the score equations, the proposed method\ndirectly recreates the response variables so that we can easily conduct\nstandard error estimation, confidence interval construction, hypothesis\ntesting, and model evaluation and selection. It performs similarly to classical\nbootstrap when the model is correctly specified and provides a more accurate\nassessment of uncertainty under model misspecification, offering data analysts\nan easy way to guard against the impact of misspecified models. We establish\ndesirable theoretical properties, such as the bootstrap validity, for the\nproposed method using the surrogate residuals. Numerical results and real data\nanalysis further demonstrate the superiority of the proposed method."}, "http://arxiv.org/abs/2312.11437": {"title": "Clustering Consistency of General Nonparametric Classification Methods in Cognitive Diagnosis", "link": "http://arxiv.org/abs/2312.11437", "description": "Cognitive diagnosis models have been popularly used in fields such as\neducation, psychology, and social sciences. While parametric likelihood\nestimation is a prevailing method for fitting cognitive diagnosis models,\nnonparametric methodologies are attracting increasing attention due to their\nease of implementation and robustness, particularly when sample sizes are\nrelatively small. However, existing clustering consistency results of the\nnonparametric estimation methods often rely on certain restrictive conditions,\nwhich may not be easily satisfied in practice. In this article, the clustering\nconsistency of the general nonparametric classification method is reestablished\nunder weaker and more practical conditions."}, "http://arxiv.org/abs/2301.03747": {"title": "Semiparametric Regression for Spatial Data via Deep Learning", "link": "http://arxiv.org/abs/2301.03747", "description": "In this work, we propose a deep learning-based method to perform\nsemiparametric regression analysis for spatially dependent data. To be\nspecific, we use a sparsely connected deep neural network with rectified linear\nunit (ReLU) activation function to estimate the unknown regression function\nthat describes the relationship between response and covariates in the presence\nof spatial dependence. Under some mild conditions, the estimator is proven to\nbe consistent, and the rate of convergence is determined by three factors: (1)\nthe architecture of neural network class, (2) the smoothness and (intrinsic)\ndimension of true mean function, and (3) the magnitude of spatial dependence.\nOur method can handle well large data set owing to the stochastic gradient\ndescent optimization algorithm. Simulation studies on synthetic data are\nconducted to assess the finite sample performance, the results of which\nindicate that the proposed method is capable of picking up the intricate\nrelationship between response and covariates. Finally, a real data analysis is\nprovided to demonstrate the validity and effectiveness of the proposed method."}, "http://arxiv.org/abs/2301.10059": {"title": "Oncology clinical trial design planning based on a multistate model that jointly models progression-free and overall survival endpoints", "link": "http://arxiv.org/abs/2301.10059", "description": "When planning an oncology clinical trial, the usual approach is to assume\nproportional hazards and even an exponential distribution for time-to-event\nendpoints. Often, besides the gold-standard endpoint overall survival (OS),\nprogression-free survival (PFS) is considered as a second confirmatory\nendpoint. We use a survival multistate model to jointly model these two\nendpoints and find that neither exponential distribution nor proportional\nhazards will typically hold for both endpoints simultaneously. The multistate\nmodel provides a stochastic process approach to model the dependency of such\nendpoints neither requiring latent failure times nor explicit dependency\nmodelling such as copulae. We use the multistate model framework to simulate\nclinical trials with endpoints OS and PFS and show how design planning\nquestions can be answered using this approach. In particular, non-proportional\nhazards for at least one of the endpoints are naturally modelled as well as\ntheir dependency to improve planning. We consider an oncology trial on\nnon-small-cell lung cancer as a motivating example from which we derive\nrelevant trial design questions. We then illustrate how clinical trial design\ncan be based on simulations from a multistate model. Key applications are\nco-primary endpoints and group-sequential designs. Simulations for these\napplications show that the standard simplifying approach may very well lead to\nunderpowered or overpowered clinical trials. Our approach is quite general and\ncan be extended to more complex trial designs, further endpoints, and other\ntherapeutic areas. An R package is available on CRAN."}, "http://arxiv.org/abs/2302.09694": {"title": "Disentangled Representation for Causal Mediation Analysis", "link": "http://arxiv.org/abs/2302.09694", "description": "Estimating direct and indirect causal effects from observational data is\ncrucial to understanding the causal mechanisms and predicting the behaviour\nunder different interventions. Causal mediation analysis is a method that is\noften used to reveal direct and indirect effects. Deep learning shows promise\nin mediation analysis, but the current methods only assume latent confounders\nthat affect treatment, mediator and outcome simultaneously, and fail to\nidentify different types of latent confounders (e.g., confounders that only\naffect the mediator or outcome). Furthermore, current methods are based on the\nsequential ignorability assumption, which is not feasible for dealing with\nmultiple types of latent confounders. This work aims to circumvent the\nsequential ignorability assumption and applies the piecemeal deconfounding\nassumption as an alternative. We propose the Disentangled Mediation Analysis\nVariational AutoEncoder (DMAVAE), which disentangles the representations of\nlatent confounders into three types to accurately estimate the natural direct\neffect, natural indirect effect and total effect. Experimental results show\nthat the proposed method outperforms existing methods and has strong\ngeneralisation ability. We further apply the method to a real-world dataset to\nshow its potential application."}, "http://arxiv.org/abs/2302.13511": {"title": "Extrapolated cross-validation for randomized ensembles", "link": "http://arxiv.org/abs/2302.13511", "description": "Ensemble methods such as bagging and random forests are ubiquitous in various\nfields, from finance to genomics. Despite their prevalence, the question of the\nefficient tuning of ensemble parameters has received relatively little\nattention. This paper introduces a cross-validation method, ECV (Extrapolated\nCross-Validation), for tuning the ensemble and subsample sizes in randomized\nensembles. Our method builds on two primary ingredients: initial estimators for\nsmall ensemble sizes using out-of-bag errors and a novel risk extrapolation\ntechnique that leverages the structure of prediction risk decomposition. By\nestablishing uniform consistency of our risk extrapolation technique over\nensemble and subsample sizes, we show that ECV yields $\\delta$-optimal (with\nrespect to the oracle-tuned risk) ensembles for squared prediction risk. Our\ntheory accommodates general ensemble predictors, only requires mild moment\nassumptions, and allows for high-dimensional regimes where the feature\ndimension grows with the sample size. As a practical case study, we employ ECV\nto predict surface protein abundances from gene expressions in single-cell\nmultiomics using random forests. In comparison to sample-split cross-validation\nand $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample\nsplitting. At the same time, its computational cost is considerably lower owing\nto the use of the risk extrapolation technique. Additional numerical results\nvalidate the finite-sample accuracy of ECV for several common ensemble\npredictors under a computational constraint on the maximum ensemble size."}, "http://arxiv.org/abs/2304.01944": {"title": "A Statistical Approach to Ecological Modeling by a New Similarity Index", "link": "http://arxiv.org/abs/2304.01944", "description": "Similarity index is an important scientific tool frequently used to determine\nwhether different pairs of entities are similar with respect to some prefixed\ncharacteristics. Some standard measures of similarity index include Jaccard\nindex, S{\\o}rensen-Dice index, and Simpson's index. Recently, a better index\n($\\hat{\\alpha}$) for the co-occurrence and/or similarity has been developed,\nand this measure really outperforms and gives theoretically supported\nreasonable predictions. However, the measure $\\hat{\\alpha}$ is not data\ndependent. In this article we propose a new measure of similarity which depends\nstrongly on the data before introducing randomness in prevalence. Then, we\npropose a new method of randomization which changes the whole pattern of\nresults. Before randomization our measure is similar to the Jaccard index,\nwhile after randomization it is close to $\\hat{\\alpha}$. We consider the\npopular ecological dataset from the Tuscan Archipelago, Italy; and compare the\nperformance of the proposed index to other measures. Since our proposed index\nis data dependent, it has some interesting properties which we illustrate in\nthis article through numerical studies."}, "http://arxiv.org/abs/2305.04587": {"title": "Replication of \"null results\" -- Absence of evidence or evidence of absence?", "link": "http://arxiv.org/abs/2305.04587", "description": "In several large-scale replication projects, statistically non-significant\nresults in both the original and the replication study have been interpreted as\na \"replication success\". Here we discuss the logical problems with this\napproach: Non-significance in both studies does not ensure that the studies\nprovide evidence for the absence of an effect and \"replication success\" can\nvirtually always be achieved if the sample sizes are small enough. In addition,\nthe relevant error rates are not controlled. We show how methods, such as\nequivalence testing and Bayes factors, can be used to adequately quantify the\nevidence for the absence of an effect and how they can be applied in the\nreplication setting. Using data from the Reproducibility Project: Cancer\nBiology, the Experimental Philosophy Replicability Project, and the\nReproducibility Project: Psychology we illustrate that many original and\nreplication studies with \"null results\" are in fact inconclusive. We conclude\nthat it is important to also replicate studies with statistically\nnon-significant results, but that they should be designed, analyzed, and\ninterpreted appropriately."}, "http://arxiv.org/abs/2306.02235": {"title": "Learning Linear Causal Representations from Interventions under General Nonlinear Mixing", "link": "http://arxiv.org/abs/2306.02235", "description": "We study the problem of learning causal representations from unknown, latent\ninterventions in a general setting, where the latent distribution is Gaussian\nbut the mixing function is completely general. We prove strong identifiability\nresults given unknown single-node interventions, i.e., without having access to\nthe intervention targets. This generalizes prior works which have focused on\nweaker classes, such as linear maps or paired counterfactual data. This is also\nthe first instance of causal identifiability from non-paired interventions for\ndeep neural network embeddings. Our proof relies on carefully uncovering the\nhigh-dimensional geometric structure present in the data distribution after a\nnon-linear density transformation, which we capture by analyzing quadratic\nforms of precision matrices of the latent distributions. Finally, we propose a\ncontrastive algorithm to identify the latent variables in practice and evaluate\nits performance on various tasks."}, "http://arxiv.org/abs/2309.09367": {"title": "ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed Factors", "link": "http://arxiv.org/abs/2309.09367", "description": "In this paper, we address the problem of designing an experiment with both\ndiscrete and continuous factors under fairly general parametric statistical\nmodels. We propose a new algorithm, named ForLion, to search for optimal\ndesigns under the D-criterion. The algorithm performs an exhaustive search in a\ndesign space with mixed factors while keeping high efficiency and reducing the\nnumber of distinct experimental settings. Its optimality is guaranteed by the\ngeneral equivalence theorem. We demonstrate its superiority over\nstate-of-the-art design algorithms using real-life experiments under\nmultinomial logistic models (MLM) and generalized linear models (GLM). Our\nsimulation studies show that the ForLion algorithm could reduce the number of\nexperimental settings by 25% or improve the relative efficiency of the designs\nby 17.5% on average. Our algorithm can help the experimenters reduce the time\ncost, the usage of experimental devices, and thus the total cost of their\nexperiments while preserving high efficiencies of the designs."}, "http://arxiv.org/abs/2312.11573": {"title": "Estimation of individual causal effects in network setup for multiple treatments", "link": "http://arxiv.org/abs/2312.11573", "description": "We study the problem of estimation of Individual Treatment Effects (ITE) in\nthe context of multiple treatments and networked observational data. Leveraging\nthe network information, we aim to utilize hidden confounders that may not be\ndirectly accessible in the observed data, thereby enhancing the practical\napplicability of the strong ignorability assumption. To achieve this, we first\nemploy Graph Convolutional Networks (GCN) to learn a shared representation of\nthe confounders. Then, our approach utilizes separate neural networks to infer\npotential outcomes for each treatment. We design a loss function as a weighted\ncombination of two components: representation loss and Mean Squared Error (MSE)\nloss on the factual outcomes. To measure the representation loss, we extend\nexisting metrics such as Wasserstein and Maximum Mean Discrepancy (MMD) from\nthe binary treatment setting to the multiple treatments scenario. To validate\nthe effectiveness of our proposed methodology, we conduct a series of\nexperiments on the benchmark datasets such as BlogCatalog and Flickr. The\nexperimental results consistently demonstrate the superior performance of our\nmodels when compared to baseline methods."}, "http://arxiv.org/abs/2312.11582": {"title": "Shapley-PC: Constraint-based Causal Structure Learning with Shapley Values", "link": "http://arxiv.org/abs/2312.11582", "description": "Causal Structure Learning (CSL), amounting to extracting causal relations\namong the variables in a dataset, is widely perceived as an important step\ntowards robust and transparent models. Constraint-based CSL leverages\nconditional independence tests to perform causal discovery. We propose\nShapley-PC, a novel method to improve constraint-based CSL algorithms by using\nShapley values over the possible conditioning sets to decide which variables\nare responsible for the observed conditional (in)dependences. We prove\nsoundness and asymptotic consistency and demonstrate that it can outperform\nstate-of-the-art constraint-based, search-based and functional causal\nmodel-based methods, according to standard metrics in CSL."}, "http://arxiv.org/abs/2312.11926": {"title": "Big Learning Expectation Maximization", "link": "http://arxiv.org/abs/2312.11926", "description": "Mixture models serve as one fundamental tool with versatile applications.\nHowever, their training techniques, like the popular Expectation Maximization\n(EM) algorithm, are notoriously sensitive to parameter initialization and often\nsuffer from bad local optima that could be arbitrarily worse than the optimal.\nTo address the long-lasting bad-local-optima challenge, we draw inspiration\nfrom the recent ground-breaking foundation models and propose to leverage their\nunderlying big learning principle to upgrade the EM. Specifically, we present\nthe Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs\njoint, marginal, and orthogonally transformed marginal matchings between data\nand model distributions. Through simulated experiments, we empirically show\nthat the BigLearn-EM is capable of delivering the optimal with high\nprobability; comparisons on benchmark clustering datasets further demonstrate\nits effectiveness and advantages over existing techniques. The code is\navailable at\nhttps://github.com/YulaiCong/Big-Learning-Expectation-Maximization."}, "http://arxiv.org/abs/2312.11927": {"title": "Empowering Dual-Level Graph Self-Supervised Pretraining with Motif Discovery", "link": "http://arxiv.org/abs/2312.11927", "description": "While self-supervised graph pretraining techniques have shown promising\nresults in various domains, their application still experiences challenges of\nlimited topology learning, human knowledge dependency, and incompetent\nmulti-level interactions. To address these issues, we propose a novel solution,\nDual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which\nintroduces a unique dual-level pretraining structure that orchestrates\nnode-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM\nautonomously uncovers significant graph motifs through an edge pooling module,\naligning learned motif similarities with graph kernel-based similarities. A\ncross-matching task enables sophisticated node-motif interactions and novel\nrepresentation learning. Extensive experiments on 15 datasets validate DGPM's\neffectiveness and generalizability, outperforming state-of-the-art methods in\nunsupervised representation learning and transfer learning settings. The\nautonomously discovered motifs demonstrate the potential of DGPM to enhance\nrobustness and interpretability."}, "http://arxiv.org/abs/2312.11934": {"title": "Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants", "link": "http://arxiv.org/abs/2312.11934", "description": "Causal discovery with latent variables is a crucial but challenging task.\nDespite the emergence of numerous methods aimed at addressing this challenge,\nthey are not fully identified to the structure that two observed variables are\ninfluenced by one latent variable and there might be a directed edge in\nbetween. Interestingly, we notice that this structure can be identified through\nthe utilization of higher-order cumulants. By leveraging the higher-order\ncumulants of non-Gaussian data, we provide an analytical solution for\nestimating the causal coefficients or their ratios. With the estimated (ratios\nof) causal coefficients, we propose a novel approach to identify the existence\nof a causal edge between two observed variables subject to latent variable\ninfluence. In case when such a causal edge exits, we introduce an asymmetry\ncriterion to determine the causal direction. The experimental results\ndemonstrate the effectiveness of our proposed method."}, "http://arxiv.org/abs/2312.11991": {"title": "Outcomes truncated by death in RCTs: a simulation study on the survivor average causal effect", "link": "http://arxiv.org/abs/2312.11991", "description": "Continuous outcome measurements truncated by death present a challenge for\nthe estimation of unbiased treatment effects in randomized controlled trials\n(RCTs). One way to deal with such situations is to estimate the survivor\naverage causal effect (SACE), but this requires making non-testable\nassumptions. Motivated by an ongoing RCT in very preterm infants with\nintraventricular hemorrhage, we performed a simulation study to compare a SACE\nestimator with complete case analysis (CCA, benchmark for a biased analysis)\nand an analysis after multiple imputation of missing outcomes. We set up 9\nscenarios combining positive, negative and no treatment effect on the outcome\n(cognitive development) and on survival at 2 years of age. Treatment effect\nestimates from all methods were compared in terms of bias, mean squared error\nand coverage with regard to two estimands: the treatment effect on the outcome\nused in the simulation and the SACE, which was derived by simulation of both\npotential outcomes per patient. Despite targeting different estimands\n(principal stratum estimand, hypothetical estimand), the SACE-estimator and\nmultiple imputation gave similar estimates of the treatment effect and\nefficiently reduced the bias compared to CCA. Also, both methods were\nrelatively robust to omission of one covariate in the analysis, and thus\nviolation of relevant assumptions. Although the SACE is not without\ncontroversy, we find it useful if mortality is inherent to the study\npopulation. Some degree of violation of the required assumptions is almost\ncertain, but may be acceptable in practice."}, "http://arxiv.org/abs/2312.12008": {"title": "How to develop, externally validate, and update multinomial prediction models", "link": "http://arxiv.org/abs/2312.12008", "description": "Multinomial prediction models (MPMs) have a range of potential applications\nacross healthcare where the primary outcome of interest has multiple nominal or\nordinal categories. However, the application of MPMs is scarce, which may be\ndue to the added methodological complexities that they bring. This article\nprovides a guide of how to develop, externally validate, and update MPMs. Using\na previously developed and validated MPM for treatment outcomes in rheumatoid\narthritis as an example, we outline guidance and recommendations for producing\na clinical prediction model, using multinomial logistic regression. This\narticle is intended to supplement existing general guidance on prediction model\nresearch. This guide is split into three parts: 1) Outcome definition and\nvariable selection, 2) Model development, and 3) Model evaluation (including\nperformance assessment, internal and external validation, and model\nrecalibration). We outline how to evaluate and interpret the predictive\nperformance of MPMs. R code is provided. We recommend the application of MPMs\nin clinical settings where the prediction of a nominal polytomous outcome is of\ninterest. Future methodological research could focus on MPM-specific\nconsiderations for variable selection and sample size criteria for external\nvalidation."}, "http://arxiv.org/abs/2312.12106": {"title": "Conditional autoregressive models fused with random forests to improve small-area spatial prediction", "link": "http://arxiv.org/abs/2312.12106", "description": "In areal unit data with missing or suppressed data, it desirable to create\nmodels that are able to predict observations that are not available.\nTraditional statistical methods achieve this through Bayesian hierarchical\nmodels that can capture the unexplained residual spatial autocorrelation\nthrough conditional autoregressive (CAR) priors, such that they can make\npredictions at geographically related spatial locations. In contrast, typical\nmachine learning approaches such as random forests ignore this residual\nautocorrelation, and instead base predictions on complex non-linear\nfeature-target relationships. In this paper, we propose CAR-Forest, a novel\nspatial prediction algorithm that combines the best features of both approaches\nby fusing them together. By iteratively refitting a random forest combined with\na Bayesian CAR model in one algorithm, CAR-Forest can incorporate flexible\nfeature-target relationships while still accounting for the residual spatial\nautocorrelation. Our results, based on a Scottish housing price data set, show\nthat CAR-Forest outperforms Bayesian CAR models, random forests, and the\nstate-of-the-art hybrid approach, geographically weighted random forest,\nproviding a state-of-the-art framework for small-area spatial prediction."}, "http://arxiv.org/abs/2312.12149": {"title": "Bayesian and minimax estimators of loss", "link": "http://arxiv.org/abs/2312.12149", "description": "We study the problem of loss estimation that involves for an observable $X\n\\sim f_{\\theta}$ the choice of a first-stage estimator $\\hat{\\gamma}$ of\n$\\gamma(\\theta)$, incurred loss $L=L(\\theta, \\hat{\\gamma})$, and the choice of\na second-stage estimator $\\hat{L}$ of $L$. We consider both: (i) a sequential\nversion where the first-stage estimate and loss are fixed and optimization is\nperformed at the second-stage level, and (ii) a simultaneous version with a\nRukhin-type loss function designed for the evaluation of $(\\hat{\\gamma},\n\\hat{L})$ as an estimator of $(\\gamma, L)$.\n\nWe explore various Bayesian solutions and provide minimax estimators for both\nsituations (i) and (ii). The analysis is carried out for several probability\nmodels, including multivariate normal models $N_d(\\theta, \\sigma^2 I_d)$ with\nboth known and unknown $\\sigma^2$, Gamma, univariate and multivariate Poisson,\nand negative binomial models, and relates to different choices of the\nfirst-stage and second-stage losses. The minimax findings are achieved by\nidentifying least favourable of sequence of priors and depend critically on\nparticular Bayesian solution properties, namely situations where the\nsecond-stage estimator $\\hat{L}(x)$ is constant as a function of $x$."}, "http://arxiv.org/abs/2312.12206": {"title": "Identification of Causal Structure in the Presence of Missing Data with Additive Noise Model", "link": "http://arxiv.org/abs/2312.12206", "description": "Missing data are an unavoidable complication frequently encountered in many\ncausal discovery tasks. When a missing process depends on the missing values\nthemselves (known as self-masking missingness), the recovery of the joint\ndistribution becomes unattainable, and detecting the presence of such\nself-masking missingness remains a perplexing challenge. Consequently, due to\nthe inability to reconstruct the original distribution and to discern the\nunderlying missingness mechanism, simply applying existing causal discovery\nmethods would lead to wrong conclusions. In this work, we found that the recent\nadvances additive noise model has the potential for learning causal structure\nunder the existence of the self-masking missingness. With this observation, we\naim to investigate the identification problem of learning causal structure from\nmissing data under an additive noise model with different missingness\nmechanisms, where the `no self-masking missingness' assumption can be\neliminated appropriately. Specifically, we first elegantly extend the scope of\nidentifiability of causal skeleton to the case with weak self-masking\nmissingness (i.e., no other variable could be the cause of self-masking\nindicators except itself). We further provide the sufficient and necessary\nidentification conditions of the causal direction under additive noise model\nand show that the causal structure can be identified up to an IN-equivalent\npattern. We finally propose a practical algorithm based on the above\ntheoretical results on learning the causal skeleton and causal direction.\nExtensive experiments on synthetic and real data demonstrate the efficiency and\neffectiveness of the proposed algorithms."}, "http://arxiv.org/abs/2312.12287": {"title": "A Criterion for Multivariate Regionalization of Spatial Data", "link": "http://arxiv.org/abs/2312.12287", "description": "The modifiable areal unit problem in geography or the change-of-support (COS)\nproblem in statistics demonstrates that the interpretation of spatial (or\nspatio-temporal) data analysis is affected by the choice of resolutions or\ngeographical units used in the study. The ecological fallacy is one famous\nexample of this phenomenon. Here we investigate the ecological fallacy\nassociated with the COS problem for multivariate spatial data with the goal of\nproviding a data-driven discretization criterion for the domain of interest\nthat minimizes aggregation errors. The discretization is based on a novel\nmultiscale metric, called the Multivariate Criterion for Aggregation Error\n(MVCAGE). Such multi-scale representations of an underlying multivariate\nprocess are often formulated in terms of basis expansions. We show that a\nparticularly useful basis expansion in this context is the multivariate\nKarhunen-Lo`eve expansion (MKLE). We use the MKLE to build the MVCAGE loss\nfunction and use it within the framework of spatial clustering algorithms to\nperform optimal spatial aggregation. We demonstrate the effectiveness of our\napproach through simulation and through regionalization of county-level income\nand hospital quality data over the United States and prediction of ocean color\nin the coastal Gulf of Alaska."}, "http://arxiv.org/abs/2312.12357": {"title": "Modeling non-linear Effects with Neural Networks in Relational Event Models", "link": "http://arxiv.org/abs/2312.12357", "description": "Dynamic networks offer an insight of how relational systems evolve. However,\nmodeling these networks efficiently remains a challenge, primarily due to\ncomputational constraints, especially as the number of observed events grows.\nThis paper addresses this issue by introducing the Deep Relational Event\nAdditive Model (DREAM) as a solution to the computational challenges presented\nby modeling non-linear effects in Relational Event Models (REMs). DREAM relies\non Neural Additive Models to model non-linear effects, allowing each effect to\nbe captured by an independent neural network. By strategically trading\ncomputational complexity for improved memory management and leveraging the\ncomputational capabilities of Graphic Processor Units (GPUs), DREAM efficiently\ncaptures complex non-linear relationships within data. This approach\ndemonstrates the capability of DREAM in modeling dynamic networks and scaling\nto larger networks. Comparisons with traditional REM approaches showcase DREAM\nsuperior computational efficiency. The model potential is further demonstrated\nby an examination of the patent citation network, which contains nearly 8\nmillion nodes and 100 million events."}, "http://arxiv.org/abs/2312.12361": {"title": "Improved multifidelity Monte Carlo estimators based on normalizing flows and dimensionality reduction techniques", "link": "http://arxiv.org/abs/2312.12361", "description": "We study the problem of multifidelity uncertainty propagation for\ncomputationally expensive models. In particular, we consider the general\nsetting where the high-fidelity and low-fidelity models have a dissimilar\nparameterization both in terms of number of random inputs and their probability\ndistributions, which can be either known in closed form or provided through\nsamples. We derive novel multifidelity Monte Carlo estimators which rely on a\nshared subspace between the high-fidelity and low-fidelity models where the\nparameters follow the same probability distribution, i.e., a standard Gaussian.\nWe build the shared space employing normalizing flows to map different\nprobability distributions into a common one, together with linear and nonlinear\ndimensionality reduction techniques, active subspaces and autoencoders,\nrespectively, which capture the subspaces where the models vary the most. We\nthen compose the existing low-fidelity model with these transformations and\nconstruct modified models with an increased correlation with the high-fidelity,\nwhich therefore yield multifidelity Monte Carlo estimators with reduced\nvariance. A series of numerical experiments illustrate the properties and\nadvantages of our approaches."}, "http://arxiv.org/abs/2312.12396": {"title": "A change-point random partition model for large spatio-temporal datasets", "link": "http://arxiv.org/abs/2312.12396", "description": "Spatio-temporal areal data can be seen as a collection of time series which\nare spatially correlated, according to a specific neighboring structure.\nMotivated by a dataset on mobile phone usage in the Metropolitan area of Milan,\nItaly, we propose a semi-parametric hierarchical Bayesian model allowing for\ntime-varying as well as spatial model-based clustering. To accommodate for\nchanging patterns over work hours and weekdays/weekends, we incorporate a\ntemporal change-point component that allows the specification of different\nhierarchical structures across time points. The model features a random\npartition prior that incorporates the desired spatial features and encourages\nco-clustering based on areal proximity. We explore properties of the model by\nway of extensive simulation studies from which we collect valuable information.\nFinally, we discuss the application to the motivating data, where the main goal\nis to spatially cluster population patterns of mobile phone usage."}, "http://arxiv.org/abs/2009.04710": {"title": "Robust Clustering with Normal Mixture Models: A Pseudo $\\beta$-Likelihood Approach", "link": "http://arxiv.org/abs/2009.04710", "description": "As in other estimation scenarios, likelihood based estimation in the normal\nmixture set-up is highly non-robust against model misspecification and presence\nof outliers (apart from being an ill-posed optimization problem). A robust\nalternative to the ordinary likelihood approach for this estimation problem is\nproposed which performs simultaneous estimation and data clustering and leads\nto subsequent anomaly detection. To invoke robustness, the methodology based on\nthe minimization of the density power divergence (or alternatively, the\nmaximization of the $\\beta$-likelihood) is utilized under suitable constraints.\nAn iteratively reweighted least squares approach has been followed in order to\ncompute the proposed estimators for the component means (or equivalently\ncluster centers) and component dispersion matrices which leads to simultaneous\ndata clustering. Some exploratory techniques are also suggested for anomaly\ndetection, a problem of great importance in the domain of statistics and\nmachine learning. The proposed method is validated with simulation studies\nunder different set-ups; it performs competitively or better compared to the\npopular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST,\nespecially when the mixture components (i.e., the clusters) share regions with\nsignificant overlap or outlying clusters exist with small but non-negligible\nweights (particularly in higher dimensions). Two real datasets are also used to\nillustrate the performance of the newly proposed method in comparison with\nothers along with an application in image processing. The proposed method\ndetects the clusters with lower misclassification rates and successfully points\nout the outlying (anomalous) observations from these datasets."}, "http://arxiv.org/abs/2202.02416": {"title": "Generalized Causal Tree for Uplift Modeling", "link": "http://arxiv.org/abs/2202.02416", "description": "Uplift modeling is crucial in various applications ranging from marketing and\npolicy-making to personalized recommendations. The main objective is to learn\noptimal treatment allocations for a heterogeneous population. A primary line of\nexisting work modifies the loss function of the decision tree algorithm to\nidentify cohorts with heterogeneous treatment effects. Another line of work\nestimates the individual treatment effects separately for the treatment group\nand the control group using off-the-shelf supervised learning algorithms. The\nformer approach that directly models the heterogeneous treatment effect is\nknown to outperform the latter in practice. However, the existing tree-based\nmethods are mostly limited to a single treatment and a single control use case,\nexcept for a handful of extensions to multiple discrete treatments. In this\npaper, we propose a generalization of tree-based approaches to tackle multiple\ndiscrete and continuous-valued treatments. We focus on a generalization of the\nwell-known causal tree algorithm due to its desirable statistical properties,\nbut our generalization technique can be applied to other tree-based approaches\nas well. The efficacy of our proposed method is demonstrated using experiments\nand real data examples."}, "http://arxiv.org/abs/2204.10426": {"title": "Marginal Structural Illness-Death Models for Semi-Competing Risks Data", "link": "http://arxiv.org/abs/2204.10426", "description": "The three state illness death model has been established as a general\napproach for regression analysis of semi competing risks data. For\nobservational data the marginal structural models (MSM) are a useful tool,\nunder the potential outcomes framework to define and estimate parameters with\ncausal interpretations. In this paper we introduce a class of marginal\nstructural illness death models for the analysis of observational semi\ncompeting risks data. We consider two specific such models, the Markov illness\ndeath MSM and the frailty based Markov illness death MSM. For interpretation\npurposes, risk contrasts under the MSMs are defined. Inference under the\nillness death MSM can be carried out using estimating equations with inverse\nprobability weighting, while inference under the frailty based illness death\nMSM requires a weighted EM algorithm. We study the inference procedures under\nboth MSMs using extensive simulations, and apply them to the analysis of mid\nlife alcohol exposure on late life cognitive impairment as well as mortality\nusing the Honolulu Asia Aging Study data set. The R codes developed in this\nwork have been implemented in the R package semicmprskcoxmsm that is publicly\navailable on CRAN."}, "http://arxiv.org/abs/2301.06098": {"title": "A novel method and comparison of methods for constructing Markov bridges", "link": "http://arxiv.org/abs/2301.06098", "description": "In this study, we address the central issue of statistical inference for\nMarkov jump processes using discrete time observations. The primary problem at\nhand is to accurately estimate the infinitesimal generator of a Markov jump\nprocess, a critical task in various applications. To tackle this problem, we\nbegin by reviewing established methods for generating sample paths from a\nMarkov jump process conditioned to endpoints, known as Markov bridges.\nAdditionally, we introduce a novel algorithm grounded in the concept of\ntime-reversal, which serves as our main contribution. Our proposed method is\nthen employed to estimate the infinitesimal generator of a Markov jump process.\nTo achieve this, we use a combination of Markov Chain Monte Carlo techniques\nand the Monte Carlo Expectation-Maximization algorithm. The results obtained\nfrom our approach demonstrate its effectiveness in providing accurate parameter\nestimates. To assess the efficacy of our proposed method, we conduct a\ncomprehensive comparative analysis with existing techniques (Bisection,\nUniformization, Direct, Rejection, and Modified Rejection), taking into\nconsideration both speed and accuracy. Notably, our method stands out as the\nfastest among the alternatives while maintaining high levels of precision."}, "http://arxiv.org/abs/2305.06262": {"title": "Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease", "link": "http://arxiv.org/abs/2305.06262", "description": "We propose a Bayesian model selection approach that allows medical\npractitioners to select among predictor variables while taking their respective\ncosts into account. Medical procedures almost always incur costs in time and/or\nmoney. These costs might exceed their usefulness for modeling the outcome of\ninterest. We develop Bayesian model selection that uses flexible model priors\nto penalize costly predictors a priori and select a subset of predictors useful\nrelative to their costs. Our approach (i) gives the practitioner control over\nthe magnitude of cost penalization, (ii) enables the prior to scale well with\nsample size, and (iii) enables the creation of our proposed inclusion path\nvisualization, which can be used to make decisions about individual candidate\npredictors using both probabilistic and visual tools. We demonstrate the\neffectiveness of our inclusion path approach and the importance of being able\nto adjust the magnitude of the prior's cost penalization through a dataset\npertaining to heart disease diagnosis in patients at the Cleveland Clinic\nFoundation, where several candidate predictors with various costs were recorded\nfor patients, and through simulated data."}, "http://arxiv.org/abs/2305.19139": {"title": "Estimating excess mortality in high-income countries during the COVID-19 pandemic", "link": "http://arxiv.org/abs/2305.19139", "description": "Quantifying the number of deaths caused by the COVID-19 crisis has been an\nongoing challenge for scientists, and no golden standard to do so has yet been\nestablished. We propose a principled approach to calculate age-adjusted yearly\nexcess mortality, and apply it to obtain estimates and uncertainty bounds for\n30 countries with publicly available data. The results uncover remarkable\nvariation in pandemic outcomes across different countries. We further compare\nour findings with existing estimates published in other major scientific\noutlets, highlighting the importance of proper age adjustment to obtain\nunbiased figures."}, "http://arxiv.org/abs/2312.12477": {"title": "Survey on Trustworthy Graph Neural Networks: From A Causal Perspective", "link": "http://arxiv.org/abs/2312.12477", "description": "Graph Neural Networks (GNNs) have emerged as powerful representation learning\ntools for capturing complex dependencies within diverse graph-structured data.\nDespite their success in a wide range of graph mining tasks, GNNs have raised\nserious concerns regarding their trustworthiness, including susceptibility to\ndistribution shift, biases towards certain populations, and lack of\nexplainability. Recently, integrating causal learning techniques into GNNs has\nsparked numerous ground-breaking studies since most of the trustworthiness\nissues can be alleviated by capturing the underlying data causality rather than\nsuperficial correlations. In this survey, we provide a comprehensive review of\nrecent research efforts on causality-inspired GNNs. Specifically, we first\npresent the key trustworthy risks of existing GNN models through the lens of\ncausality. Moreover, we introduce a taxonomy of Causality-Inspired GNNs\n(CIGNNs) based on the type of causal learning capability they are equipped\nwith, i.e., causal reasoning and causal representation learning. Besides, we\nsystematically discuss typical methods within each category and demonstrate how\nthey mitigate trustworthiness risks. Finally, we summarize useful resources and\ndiscuss several future directions, hoping to shed light on new research\nopportunities in this emerging field. The representative papers, along with\nopen-source data and codes, are available in\nhttps://github.com/usail-hkust/Causality-Inspired-GNNs."}, "http://arxiv.org/abs/2312.12638": {"title": "Using Exact Tests from Algebraic Statistics in Sparse Multi-way Analyses: An Application to Analyzing Differential Item Functioning", "link": "http://arxiv.org/abs/2312.12638", "description": "Asymptotic goodness-of-fit methods in contingency table analysis can struggle\nwith sparse data, especially in multi-way tables where it can be infeasible to\nmeet sample size requirements for a robust application of distributional\nassumptions. However, algebraic statistics provides exact alternatives to these\nclassical asymptotic methods that remain viable even with sparse data. We apply\nthese methods to a context in psychometrics and education research that leads\nnaturally to multi-way contingency tables: the analysis of differential item\nfunctioning (DIF). We explain concretely how to apply the exact methods of\nalgebraic statistics to DIF analysis using the R package algstat, and we\ncompare their performance to that of classical asymptotic methods."}, "http://arxiv.org/abs/2312.12641": {"title": "Matching via Distance Profiles", "link": "http://arxiv.org/abs/2312.12641", "description": "In this paper, we introduce and study matching methods based on distance\nprofiles. For the matching of point clouds, the proposed method is easily\nimplementable by solving a linear program, circumventing the computational\nobstacles of quadratic matching. Also, we propose and analyze a flexible way to\nexecute location-to-location matching using distance profiles. Moreover, we\nprovide a statistical estimation error analysis in the context of\nlocation-to-location matching using empirical process theory. Furthermore, we\napply our method to a certain model and show its noise stability by\ncharacterizing conditions on the noise level for the matching to be successful.\nLastly, we demonstrate the performance of the proposed method and compare it\nwith some existing methods using synthetic and real data."}, "http://arxiv.org/abs/2312.12645": {"title": "Revisiting the effect of greediness on the efficacy of exchange algorithms for generating exact optimal experimental designs", "link": "http://arxiv.org/abs/2312.12645", "description": "Coordinate exchange (CEXCH) is a popular algorithm for generating exact\noptimal experimental designs. The authors of CEXCH advocated for a highly\ngreedy implementation - one that exchanges and optimizes single element\ncoordinates of the design matrix. We revisit the effect of greediness on CEXCHs\nefficacy for generating highly efficient designs. We implement the\nsingle-element CEXCH (most greedy), a design-row (medium greedy) optimization\nexchange, and particle swarm optimization (PSO; least greedy) on 21 exact\nresponse surface design scenarios, under the $D$- and $I-$criterion, which have\nwell-known optimal designs that have been reproduced by several researchers. We\nfound essentially no difference in performance of the most greedy CEXCH and the\nmedium greedy CEXCH. PSO did exhibit better efficacy for generating $D$-optimal\ndesigns, and for most $I$-optimal designs than CEXCH, but not to a strong\ndegree under our parametrization. This work suggests that further investigation\nof the greediness dimension and its effect on CEXCH efficacy on a wider suite\nof models and criterion is warranted."}, "http://arxiv.org/abs/2312.12678": {"title": "Causal Discovery for fMRI data: Challenges, Solutions, and a Case Study", "link": "http://arxiv.org/abs/2312.12678", "description": "Designing studies that apply causal discovery requires navigating many\nresearcher degrees of freedom. This complexity is exacerbated when the study\ninvolves fMRI data. In this paper we (i) describe nine challenges that occur\nwhen applying causal discovery to fMRI data, (ii) discuss the space of\ndecisions that need to be made, (iii) review how a recent case study made those\ndecisions, (iv) and identify existing gaps that could potentially be solved by\nthe development of new methods. Overall, causal discovery is a promising\napproach for analyzing fMRI data, and multiple successful applications have\nindicated that it is superior to traditional fMRI functional connectivity\nmethods, but current causal discovery methods for fMRI leave room for\nimprovement."}, "http://arxiv.org/abs/2312.12708": {"title": "Gradient flows for empirical Bayes in high-dimensional linear models", "link": "http://arxiv.org/abs/2312.12708", "description": "Empirical Bayes provides a powerful approach to learning and adapting to\nlatent structure in data. Theory and algorithms for empirical Bayes have a rich\nliterature for sequence models, but are less understood in settings where\nlatent variables and data interact through more complex designs. In this work,\nwe study empirical Bayes estimation of an i.i.d. prior in Bayesian linear\nmodels, via the nonparametric maximum likelihood estimator (NPMLE). We\nintroduce and study a system of gradient flow equations for optimizing the\nmarginal log-likelihood, jointly over the prior and posterior measures in its\nGibbs variational representation using a smoothed reparametrization of the\nregression coefficients. A diffusion-based implementation yields a Langevin\ndynamics MCEM algorithm, where the prior law evolves continuously over time to\noptimize a sequence-model log-likelihood defined by the coordinates of the\ncurrent Langevin iterate. We show consistency of the NPMLE as $n, p \\rightarrow\n\\infty$ under mild conditions, including settings of random sub-Gaussian\ndesigns when $n \\asymp p$. In high noise, we prove a uniform log-Sobolev\ninequality for the mixing of Langevin dynamics, for possibly misspecified\npriors and non-log-concave posteriors. We then establish polynomial-time\nconvergence of the joint gradient flow to a near-NPMLE if the marginal negative\nlog-likelihood is convex in a sub-level set of the initialization."}, "http://arxiv.org/abs/2312.12710": {"title": "Semiparametric Copula Estimation for Spatially Correlated Multivariate Mixed Outcomes: Analyzing Visual Sightings of Fin Whales from Line Transect Survey", "link": "http://arxiv.org/abs/2312.12710", "description": "Multivariate data having both continuous and discrete variables is known as\nmixed outcomes and has widely appeared in a variety of fields such as ecology,\nepidemiology, and climatology. In order to understand the probability structure\nof multivariate data, the estimation of the dependence structure among mixed\noutcomes is very important. However, when location information is equipped with\nmultivariate data, the spatial correlation should be adequately taken into\naccount; otherwise, the estimation of the dependence structure would be\nseverely biased. To solve this issue, we propose a semiparametric Bayesian\ninference for the dependence structure among mixed outcomes while eliminating\nspatial correlation. To this end, we consider a hierarchical spatial model\nbased on the rank likelihood and a latent multivariate Gaussian process. We\ndevelop an efficient algorithm for computing the posterior using the Markov\nChain Monte Carlo. We also provide a scalable implementation of the model using\nthe nearest-neighbor Gaussian process under large spatial datasets. We conduct\na simulation study to validate our proposed procedure and demonstrate that the\nprocedure successfully accounts for spatial correlation and correctly infers\nthe dependence structure among outcomes. Furthermore, the procedure is applied\nto a real example collected during an international synoptic krill survey in\nthe Scotia Sea of the Antarctic Peninsula, which includes sighting data of fin\nwhales (Balaenoptera physalus), and the relevant oceanographic data."}, "http://arxiv.org/abs/2312.12786": {"title": "Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets", "link": "http://arxiv.org/abs/2312.12786", "description": "Development of comprehensive prediction models are often of great interest in\nmany disciplines of science, but datasets with information on all desired\nfeatures typically have small sample sizes. In this article, we describe a\ntransfer learning approach for building high-dimensional generalized linear\nmodels using data from a main study that has detailed information on all\npredictors, and from one or more external studies that have ascertained a more\nlimited set of predictors. We propose using the external dataset(s) to build\nreduced model(s) and then transfer the information on underlying parameters for\nthe analysis of the main study through a set of calibration equations, while\naccounting for the study-specific effects of certain design variables. We then\nuse a generalized method of moment (GMM) with penalization for parameter\nestimation and develop highly scalable algorithms for fitting models taking\nadvantage of the popular glmnet package. We further show that the use of\nadaptive-Lasso penalty leads to the oracle property of underlying parameter\nestimates and thus leads to convenient post-selection inference procedures. We\nconduct extensive simulation studies to investigate both predictive performance\nand post-selection inference properties of the proposed method. Finally, we\nillustrate a timely application of the proposed method for the development of\nrisk prediction models for five common diseases using the UK Biobank study,\ncombining baseline information from all study participants (500K) and recently\nreleased high-throughout proteomic data (# protein = 1500) on a subset (50K) of\nthe participants."}, "http://arxiv.org/abs/2312.12823": {"title": "Detecting Multiple Change-Points in Distributional Sequences Derived from Structural Health Monitoring Data: An Application to Bridge Damage Detection", "link": "http://arxiv.org/abs/2312.12823", "description": "Detecting damage in important structures using monitored data is a\nfundamental task of structural health monitoring, which is very important for\nthe structures' safety and life-cycle management. Based on the statistical\npattern recognition paradigm, damage detection can be achieved by detecting\nchanges in distribution of properly extracted damage-sensitive features (DSFs).\nThis can be naturally formulated as a distributional change-point detection\nproblem. A good change-point detector for damage detection should be scalable\nto large DSF datasets, applicable to different types of changes and able to\ncontrol the false-positive indication rate. To address these challenges, we\npropose a new distributional change-point detection method for damage\ndetection. We embed the elements of a DSF distributional sequence into the\nWasserstein space and develop a MOSUM-type multiple change-point detector based\non Fr\\'echet statistics. Theoretical properties are also established. Extensive\nsimulation studies demonstrate the superiority of our proposal against other\ncompetitors in addressing the aforementioned practical requirements. We apply\nour method to the cable-tension measurements monitored from a long-span\ncable-stayed bridge for cable damage detection. We conduct a comprehensive\nchange-point analysis for the extracted DSF data, and find some interesting\npatterns from the detected changes, which provides important insights into the\ndamage of the cable system."}, "http://arxiv.org/abs/2312.12844": {"title": "Causal Discovery under Identifiable Heteroscedastic Noise Model", "link": "http://arxiv.org/abs/2312.12844", "description": "Capturing the underlying structural causal relations represented by Directed\nAcyclic Graphs (DAGs) has been a fundamental task in various AI disciplines.\nCausal DAG learning via the continuous optimization framework has recently\nachieved promising performance in terms of both accuracy and efficiency.\nHowever, most methods make strong assumptions of homoscedastic noise, i.e.,\nexogenous noises have equal variances across variables, observations, or even\nboth. The noises in real data usually violate both assumptions due to the\nbiases introduced by different data collection processes. To address the issue\nof heteroscedastic noise, we introduce relaxed and implementable sufficient\nconditions, proving the identifiability of a general class of SEM subject to\nthese conditions. Based on the identifiable general SEM, we propose a novel\nformulation for DAG learning that accounts for the variation in noise variance\nacross variables and observations. We then propose an effective two-phase\niterative DAG learning algorithm to address the increasing optimization\ndifficulties and to learn a causal DAG from data with heteroscedastic variable\nnoise under varying variance. We show significant empirical gains of the\nproposed approaches over state-of-the-art methods on both synthetic data and\nreal data."}, "http://arxiv.org/abs/2312.12952": {"title": "High-dimensional sparse classification using exponential weighting with empirical hinge loss", "link": "http://arxiv.org/abs/2312.12952", "description": "In this study, we address the problem of high-dimensional binary\nclassification. Our proposed solution involves employing an aggregation\ntechnique founded on exponential weights and empirical hinge loss. Through the\nemployment of a suitable sparsity-inducing prior distribution, we demonstrate\nthat our method yields favorable theoretical results on predictions and\nmisclassification error. The efficiency of our procedure is achieved through\nthe utilization of Langevin Monte Carlo, a gradient-based sampling approach. To\nillustrate the effectiveness of our approach, we conduct comparisons with the\nlogistic Lasso on both simulated and a real dataset. Our method frequently\ndemonstrates superior performance compared to the logistic Lasso."}, "http://arxiv.org/abs/2312.12966": {"title": "Rank-based Bayesian clustering via covariate-informed Mallows mixtures", "link": "http://arxiv.org/abs/2312.12966", "description": "Data in the form of rankings, ratings, pair comparisons or clicks are\nfrequently collected in diverse fields, from marketing to politics, to\nunderstand assessors' individual preferences. Combining such preference data\nwith features associated with the assessors can lead to a better understanding\nof the assessors' behaviors and choices. The Mallows model is a popular model\nfor rankings, as it flexibly adapts to different types of preference data, and\nthe previously proposed Bayesian Mallows Model (BMM) offers a computationally\nefficient framework for Bayesian inference, also allowing capturing the users'\nheterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite\nmixture model that performs clustering while also accounting for\nassessor-related features, called the Bayesian Mallows model with covariates\n(BMMx). BMMx is based on a similarity function that a priori favours the\naggregation of assessors into a cluster when their covariates are similar,\nusing the Product Partition models (PPMx) proposal. We present two approaches\nto measure the covariate similarity: one based on a novel deterministic\nfunction measuring the covariates' goodness-of-fit to the cluster, and one\nbased on an augmented model as in PPMx. We investigate the performance of BMMx\nin both simulation experiments and real-data examples, showing the method's\npotential for advancing the understanding of assessor preferences and behaviors\nin different applications."}, "http://arxiv.org/abs/2312.13018": {"title": "Sample Design and Cross-sectional Weights of the Brazilian PCSVDF-Mulher Study (Waves 2016 and 2017): Integrating a Refreshment Sample with an Ongoing Longitudinal Sample to Calculate IPV Prevalence", "link": "http://arxiv.org/abs/2312.13018", "description": "Addressing unit non-response between waves of longitudinal studies\n(attrition) by means of sampling design in weighting has moved from an approach\nfocused on participant retention or modern missing data analysis procedures to\nan approach based on the availability of supplemental samples, either\ncollecting refreshment or replacement samples on an ongoing larger sample. We\nimplement a strategy for calculating individual cross-sectional weights and\napply them to the 2016 and 2017 waves of the PCSVDF-Mulher (Pesquisa de\nCondi\\c{c}\\~oes Socioecon\\^omicas e Viol\\^encia Dom\\'estica e Familiar contra a\nMulher - Survey of Socioeconomic Conditions and Domestic and Family Violence\nagainst Women), a large ($\\approx 10,000$), household interdisciplinary\nlongitudinal data set in Brazil to study intimate partner violence (IPV), its\ncauses and consequences. We developed a set of weights that combines a\nrefreshment sample collected in 2017 with the ongoing longitudinal sample\nstarted in 2016. Armed with this set of individual weights, we calculated IPV\nprevalence for nine capital cities in Brazil for the years 2016 and 2017. As\nfar as we know, this is the first attempt to calculate cross-sectional weights\nwith the aid of supplemental samples applied to a population representative\nsample study focused on IPV. Our analysis produced a set of weights whose\ncomparison to unweighted designs shows clearly neglected trends in the\nliterature on IPV measurement. Indeed, one of our key findings pointed out to\nthe fact that, even in well-designed longitudinal household surveys, the\nindiscriminate use of unweighted designs to calculate IPV prevalence might\nartificially and inadvertently inflate their values, which in turn might bring\ndistortions and considerable political, social, budgetary, and scientific\nimplications."}, "http://arxiv.org/abs/2312.13044": {"title": "Particle Gibbs for Likelihood-Free Inference of State Space Models with Application to Stochastic Volatility", "link": "http://arxiv.org/abs/2312.13044", "description": "State space models (SSMs) are widely used to describe dynamic systems.\nHowever, when the likelihood of the observations is intractable, parameter\ninference for SSMs cannot be easily carried out using standard Markov chain\nMonte Carlo or sequential Monte Carlo methods. In this paper, we propose a\nparticle Gibbs sampler as a general strategy to handle SSMs with intractable\nlikelihoods in the approximate Bayesian computation (ABC) setting. The proposed\nsampler incorporates a conditional auxiliary particle filter, which can help\nmitigate the weight degeneracy often encountered in ABC. To illustrate the\nmethodology, we focus on a classic stochastic volatility model (SVM) used in\nfinance and econometrics for analyzing and interpreting volatility. Simulation\nstudies demonstrate the accuracy of our sampler for SVM parameter inference,\ncompared to existing particle Gibbs samplers based on the conditional bootstrap\nfilter. As a real data application, we apply the proposed sampler for fitting\nan SVM to S&P 500 Index time-series data during the 2008 financial crisis."}, "http://arxiv.org/abs/2312.13097": {"title": "Power calculation for cross-sectional stepped wedge cluster randomized trials with a time-to-event endpoint", "link": "http://arxiv.org/abs/2312.13097", "description": "A popular design choice in public health and implementation science research,\nstepped wedge cluster randomized trials (SW-CRTs) are a form of randomized\ntrial whereby clusters are progressively transitioned from control to\nintervention, and the timing of transition is randomized for each cluster. An\nimportant task at the design stage is to ensure that the planned trial has\nsufficient power to observe a clinically meaningful effect size. While methods\nfor determining study power have been well-developed for SW-CRTs with\ncontinuous and binary outcomes, limited methods for power calculation are\navailable for SW-CRTs with censored time-to-event outcomes. In this article, we\npropose a stratified marginal Cox model to account for secular trend in\ncross-sectional SW-CRTs, and derive an explicit expression of the robust\nsandwich variance to facilitate power calculations without the need for\ncomputationally intensive simulations. Power formulas based on both the Wald\nand robust score tests are developed and compared via simulation, generally\ndemonstrating superiority of robust score procedures in different finite-sample\nscenarios. Finally, we illustrate our methods using a SW-CRT testing the effect\nof a new electronic reminder system on time to catheter removal in hospital\nsettings. We also offer an R Shiny application to facilitate sample size and\npower calculations using our proposed methods."}, "http://arxiv.org/abs/2312.13148": {"title": "Partially factorized variational inference for high-dimensional mixed models", "link": "http://arxiv.org/abs/2312.13148", "description": "While generalized linear mixed models (GLMMs) are a fundamental tool in\napplied statistics, many specifications -- such as those involving categorical\nfactors with many levels or interaction terms -- can be computationally\nchallenging to estimate due to the need to compute or approximate\nhigh-dimensional integrals. Variational inference (VI) methods are a popular\nway to perform such computations, especially in the Bayesian context. However,\nnaive VI methods can provide unreliable uncertainty quantification. We show\nthat this is indeed the case in the GLMM context, proving that standard VI\n(i.e. mean-field) dramatically underestimates posterior uncertainty in\nhigh-dimensions. We then show how appropriately relaxing the mean-field\nassumption leads to VI methods whose uncertainty quantification does not\ndeteriorate in high-dimensions, and whose total computational cost scales\nlinearly with the number of parameters and observations. Our theoretical and\nnumerical results focus on GLMMs with Gaussian or binomial likelihoods, and\nrely on connections to random graph theory to obtain sharp high-dimensional\nasymptotic analysis. We also provide generic results, which are of independent\ninterest, relating the accuracy of variational inference to the convergence\nrate of the corresponding coordinate ascent variational inference (CAVI)\nalgorithm for Gaussian targets. Our proposed partially-factorized VI (PF-VI)\nmethodology for GLMMs is implemented in the R package vglmer, see\nhttps://github.com/mgoplerud/vglmer . Numerical results with simulated and real\ndata examples illustrate the favourable computation cost versus accuracy\ntrade-off of PF-VI."}, "http://arxiv.org/abs/2312.13168": {"title": "Learning Bayesian networks: a copula approach for mixed-type data", "link": "http://arxiv.org/abs/2312.13168", "description": "Estimating dependence relationships between variables is a crucial issue in\nmany applied domains, such as medicine, social sciences and psychology. When\nseveral variables are entertained, these can be organized into a network which\nencodes their set of conditional dependence relations. Typically however, the\nunderlying network structure is completely unknown or can be partially drawn\nonly; accordingly it should be learned from the available data, a process known\nas structure learning. In addition, data arising from social and psychological\nstudies are often of different types, as they can include categorical, discrete\nand continuous measurements. In this paper we develop a novel Bayesian\nmethodology for structure learning of directed networks which applies to mixed\ndata, i.e. possibly containing continuous, discrete, ordinal and binary\nvariables simultaneously. Whenever available, our method can easily incorporate\nknown dependence structures among variables represented by paths or edge\ndirections that can be postulated in advance based on the specific problem\nunder consideration. We evaluate the proposed method through extensive\nsimulation studies, with appreciable performances in comparison with current\nstate-of-the-art alternative methods. Finally, we apply our methodology to\nwell-being data from a social survey promoted by the United Nations, and mental\nhealth data collected from a cohort of medical students."}, "http://arxiv.org/abs/2202.02249": {"title": "Functional Mixtures-of-Experts", "link": "http://arxiv.org/abs/2202.02249", "description": "We consider the statistical analysis of heterogeneous data for prediction in\nsituations where the observations include functions, typically time series. We\nextend the modeling with Mixtures-of-Experts (ME), as a framework of choice in\nmodeling heterogeneity in data for prediction with vectorial observations, to\nthis functional data analysis context. We first present a new family of ME\nmodels, named functional ME (FME) in which the predictors are potentially noisy\nobservations, from entire functions. Furthermore, the data generating process\nof the predictor and the real response, is governed by a hidden discrete\nvariable representing an unknown partition. Second, by imposing sparsity on\nderivatives of the underlying functional parameters via Lasso-like\nregularizations, we provide sparse and interpretable functional representations\nof the FME models called iFME. We develop dedicated expectation--maximization\nalgorithms for Lasso-like (EM-Lasso) regularized maximum-likelihood parameter\nestimation strategies to fit the models. The proposed models and algorithms are\nstudied in simulated scenarios and in applications to two real data sets, and\nthe obtained results demonstrate their performance in accurately capturing\ncomplex nonlinear relationships and in clustering the heterogeneous regression\ndata."}, "http://arxiv.org/abs/2203.14511": {"title": "Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments", "link": "http://arxiv.org/abs/2203.14511", "description": "Researchers are increasingly turning to machine learning (ML) algorithms to\ninvestigate causal heterogeneity in randomized experiments. Despite their\npromise, ML algorithms may fail to accurately ascertain heterogeneous treatment\neffects under practical settings with many covariates and small sample size. In\naddition, the quantification of estimation uncertainty remains a challenge. We\ndevelop a general approach to statistical inference for heterogeneous treatment\neffects discovered by a generic ML algorithm. We apply the Neyman's repeated\nsampling framework to a common setting, in which researchers use an ML\nalgorithm to estimate the conditional average treatment effect and then divide\nthe sample into several groups based on the magnitude of the estimated effects.\nWe show how to estimate the average treatment effect within each of these\ngroups, and construct a valid confidence interval. In addition, we develop\nnonparametric tests of treatment effect homogeneity across groups, and\nrank-consistency of within-group average treatment effects. The validity of our\nmethodology does not rely on the properties of ML algorithms because it is\nsolely based on the randomization of treatment assignment and random sampling\nof units. Finally, we generalize our methodology to the cross-fitting procedure\nby accounting for the additional uncertainty induced by the random splitting of\ndata."}, "http://arxiv.org/abs/2206.10323": {"title": "What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?", "link": "http://arxiv.org/abs/2206.10323", "description": "Estimation of heterogeneous treatment effects (HTE) is of prime importance in\nmany disciplines, ranging from personalized medicine to economics among many\nothers. Random forests have been shown to be a flexible and powerful approach\nto HTE estimation in both randomized trials and observational studies. In\nparticular \"causal forests\", introduced by Athey, Tibshirani and Wager (2019),\nalong with the R implementation in package grf were rapidly adopted. A related\napproach, called \"model-based forests\", that is geared towards randomized\ntrials and simultaneously captures effects of both prognostic and predictive\nvariables, was introduced by Seibold, Zeileis and Hothorn (2018) along with a\nmodular implementation in the R package model4you.\n\nHere, we present a unifying view that goes beyond the theoretical motivations\nand investigates which computational elements make causal forests so successful\nand how these can be blended with the strengths of model-based forests. To do\nso, we show that both methods can be understood in terms of the same parameters\nand model assumptions for an additive model under L2 loss. This theoretical\ninsight allows us to implement several flavors of \"model-based causal forests\"\nand dissect their different elements in silico.\n\nThe original causal forests and model-based forests are compared with the new\nblended versions in a benchmark study exploring both randomized trials and\nobservational settings. In the randomized setting, both approaches performed\nakin. If confounding was present in the data generating process, we found local\ncentering of the treatment indicator with the corresponding propensities to be\nthe main driver for good performance. Local centering of the outcome was less\nimportant, and might be replaced or enhanced by simultaneous split selection\nwith respect to both prognostic and predictive effects."}, "http://arxiv.org/abs/2303.13616": {"title": "Estimating Maximal Symmetries of Regression Functions via Subgroup Lattices", "link": "http://arxiv.org/abs/2303.13616", "description": "We present a method for estimating the maximal symmetry of a continuous\nregression function. Knowledge of such a symmetry can be used to significantly\nimprove modelling by removing the modes of variation resulting from the\nsymmetries. Symmetry estimation is carried out using hypothesis testing for\ninvariance strategically over the subgroup lattice of a search group G acting\non the feature space. We show that the estimation of the unique maximal\ninvariant subgroup of G generalises useful tools from linear dimension\nreduction to a non linear context. We show that the estimation is consistent\nwhen the subgroup lattice chosen is finite, even when some of the subgroups\nthemselves are infinite. We demonstrate the performance of this estimator in\nsynthetic settings and apply the methods to two data sets: satellite\nmeasurements of the earth's magnetic field intensity; and the distribution of\nsunspots."}, "http://arxiv.org/abs/2305.12043": {"title": "SF-SFD: Stochastic Optimization of Fourier Coefficients to Generate Space-Filling Designs", "link": "http://arxiv.org/abs/2305.12043", "description": "Due to the curse of dimensionality, it is often prohibitively expensive to\ngenerate deterministic space-filling designs. On the other hand, when using\nna{\\\"i}ve uniform random sampling to generate designs cheaply, design points\ntend to concentrate in a small region of the design space. Although, it is\npreferable in these cases to utilize quasi-random techniques such as Sobol\nsequences and Latin hypercube designs over uniform random sampling in many\nsettings, these methods have their own caveats especially in high-dimensional\nspaces. In this paper, we propose a technique that addresses the fundamental\nissue of measure concentration by updating high-dimensional distribution\nfunctions to produce better space-filling designs. Then, we show that our\ntechnique can outperform Latin hypercube sampling and Sobol sequences by the\ndiscrepancy metric while generating moderately-sized space-filling samples for\nhigh-dimensional problems."}, "http://arxiv.org/abs/2306.03625": {"title": "Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learning", "link": "http://arxiv.org/abs/2306.03625", "description": "We propose a simple and general framework for nonparametric estimation of\nheterogeneous treatment effects under fairness constraints. Under standard\nregularity conditions, we show that the resulting estimators possess the double\nrobustness property. We use this framework to characterize the trade-off\nbetween fairness and the maximum welfare achievable by the optimal policy. We\nevaluate the methods in a simulation study and illustrate them in a real-world\ncase study."}, "http://arxiv.org/abs/2312.13331": {"title": "A Bayesian Spatial Berkson error approach to estimate small area opioid mortality rates accounting for population-at-risk uncertainty", "link": "http://arxiv.org/abs/2312.13331", "description": "Monitoring small-area geographical population trends in opioid mortality has\nlarge scale implications to informing preventative resource allocation. A\ncommon approach to obtain small area estimates of opioid mortality is to use a\nstandard disease mapping approach in which population-at-risk estimates are\ntreated as fixed and known. Assuming fixed populations ignores the uncertainty\nsurrounding small area population estimates, which may bias risk estimates and\nunder-estimate their associated uncertainties. We present a Bayesian Spatial\nBerkson Error (BSBE) model to incorporate population-at-risk uncertainty within\na disease mapping model. We compare the BSBE approach to the naive (treating\ndenominators as fixed) using simulation studies to illustrate potential bias\nresulting from this assumption. We show the application of the BSBE model to\nobtain 2020 opioid mortality risk estimates for 159 counties in GA accounting\nfor population-at-risk uncertainty. Utilizing our proposed approach will help\nto inform interventions in opioid related public health responses, policies,\nand resource allocation. Additionally, we provide a general framework to\nimprove in the estimation and mapping of health indicators."}, "http://arxiv.org/abs/2312.13416": {"title": "A new criterion for interpreting acoustic emission damage signals in condition monitoring based on the distribution of cluster onsets", "link": "http://arxiv.org/abs/2312.13416", "description": "Structural Health Monitoring (SHM) relies on non-destructive techniques such\nas Acoustic Emission (AE) which provide a large amount of data over the life of\nthe systems. The analysis of these data is often based on clustering in order\nto get insights about damage evolution. In order to evaluate clustering\nresults, current approaches include Clustering Validity Indices (CVI) which\nfavor compact and separable clusters. However, these shape-based criteria are\nnot specific to AE data and SHM. This paper proposes a new approach based on\nthe sequentiality of clusters onsets. For monitoring purposes, onsets indicate\nwhen potential damage occurs for the first time and allows to detect the\ninititation of the defects. The proposed CVI relies on the Kullback-Leibler\ndivergence and enables to incorporate prior on damage onsets when available.\nThree experiments on real-world data sets demonstrate the relevance of the\nproposed approach. The first benchmark concerns the detection of the loosening\nof bolted plates under vibration. The proposed onset-based CVI outperforms the\nstandard approach in terms of both cluster quality and accuracy in detecting\nchanges in loosening. The second application involves micro-drilling of hard\nmaterials using Electrical Discharge Machining. In this industrial application,\nit is demonstrated that the proposed CVI can be used to evaluate the electrode\nprogression until the reference depth which is essential to ensure structural\nintegrity. Lastly, the third application is about the damage monitoring in a\ncomposite/metal hybrid joint structure. As an important result, the timeline of\nclusters generated by the proposed CVI is used to draw a scenario that accounts\nfor the occurrence of slippage leading to a critical failure."}, "http://arxiv.org/abs/2312.13430": {"title": "Debiasing Sample Loadings and Scores in Exponential Family PCA for Sparse Count Data", "link": "http://arxiv.org/abs/2312.13430", "description": "Multivariate count data with many zeros frequently occur in a variety of\napplication areas such as text mining with a document-term matrix and cluster\nanalysis with microbiome abundance data. Exponential family PCA (Collins et\nal., 2001) is a widely used dimension reduction tool to understand and capture\nthe underlying low-rank structure of count data. It produces principal\ncomponent scores by fitting Poisson regression models with estimated loadings\nas covariates. This tends to result in extreme scores for sparse count data\nsignificantly deviating from true scores. We consider two major sources of bias\nin this estimation procedure and propose ways to reduce their effects. First,\nthe discrepancy between true loadings and their estimates under a limited\nsample size largely degrades the quality of score estimates. By treating\nestimated loadings as covariates with bias and measurement errors, we debias\nscore estimates, using the iterative bootstrap method for loadings and\nconsidering classical measurement error models. Second, the existence of MLE\nbias is often ignored in score estimation, but this bias could be removed\nthrough well-known MLE bias reduction methods. We demonstrate the effectiveness\nof the proposed bias correction procedure through experiments on both simulated\ndata and real data."}, "http://arxiv.org/abs/2312.13450": {"title": "Precise FWER Control for Gaussian Related Fields: Riding the SuRF to continuous land -- Part 1", "link": "http://arxiv.org/abs/2312.13450", "description": "The Gaussian Kinematic Formula (GKF) is a powerful and computationally\nefficient tool to perform statistical inference on random fields and became a\nwell-established tool in the analysis of neuroimaging data. Using realistic\nerror models, recent articles show that GKF based methods for \\emph{voxelwise\ninference} lead to conservative control of the familywise error rate (FWER) and\nfor cluster-size inference lead to inflated false positive rates. In this\nseries of articles we identify and resolve the main causes of these\nshortcomings in the traditional usage of the GKF for voxelwise inference. This\nfirst part removes the \\textit{good lattice assumption} and allows the data to\nbe non-stationary, yet still assumes the data to be Gaussian. The latter\nassumption is resolved in part 2, where we also demonstrate that our GKF based\nmethodology is non-conservative under realistic error models."}, "http://arxiv.org/abs/2312.13454": {"title": "MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records", "link": "http://arxiv.org/abs/2312.13454", "description": "Objective: To improve survival analysis using EHR data, we aim to develop a\nsupervised topic model called MixEHR-SurG to simultaneously integrate\nheterogeneous EHR data and model survival hazard.\n\nMaterials and Methods: Our technical contributions are three-folds: (1)\nintegrating EHR topic inference with Cox proportional hazards likelihood; (2)\ninferring patient-specific topic hyperparameters using the PheCode concepts\nsuch that each topic can be identified with exactly one PheCode-associated\nphenotype; (3) multi-modal survival topic inference. This leads to a highly\ninterpretable survival and guided topic model that can infer PheCode-specific\nphenotype topics associated with patient mortality. We evaluated MixEHR-G using\na simulated dataset and two real-world EHR datasets: the Quebec Congenital\nHeart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient\nclaim data of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458\nsubjects with multi-modal EHR records.\n\nResults: Compared to the baselines, MixEHR-G achieved a superior dynamic\nAUROC for mortality prediction, with a mean AUROC score of 0.89 in the\nsimulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively,\nMixEHR-G associates severe cardiac conditions with high mortality risk among\nthe CHD patients after the first heart failure hospitalization and critical\nbrain injuries with increased mortality among the MIMIC-III patients after\ntheir ICU discharge.\n\nConclusion: The integration of the Cox proportional hazards model and EHR\ntopic inference in MixEHR-SurG led to not only competitive mortality prediction\nbut also meaningful phenotype topics for systematic survival analysis. The\nsoftware is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG."}, "http://arxiv.org/abs/2312.13460": {"title": "Hierarchical selection of genetic and gene by environment interaction effects in high-dimensional mixed models", "link": "http://arxiv.org/abs/2312.13460", "description": "Interactions between genes and environmental factors may play a key role in\nthe etiology of many common disorders. Several regularized generalized linear\nmodels (GLMs) have been proposed for hierarchical selection of gene by\nenvironment interaction (GEI) effects, where a GEI effect is selected only if\nthe corresponding genetic main effect is also selected in the model. However,\nnone of these methods allow to include random effects to account for population\nstructure, subject relatedness and shared environmental exposure. In this\npaper, we develop a unified approach based on regularized penalized\nquasi-likelihood (PQL) estimation to perform hierarchical selection of GEI\neffects in sparse regularized mixed models. We compare the selection and\nprediction accuracy of our proposed model with existing methods through\nsimulations under the presence of population structure and shared environmental\nexposure. We show that for all simulation scenarios, compared to other\npenalized methods, our proposed method enforced sparsity by controlling the\nnumber of false positives in the model while having the best predictive\nperformance. Finally, we apply our method to a real data application using the\nOrofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, and\nfound that our method retrieves previously reported significant loci."}, "http://arxiv.org/abs/2312.13482": {"title": "Spatially Adaptive Variable Screening in Presurgical fMRI Data Analysis", "link": "http://arxiv.org/abs/2312.13482", "description": "Accurate delineation of tumor-adjacent functional brain regions is essential\nfor planning function-preserving neurosurgery. Functional magnetic resonance\nimaging (fMRI) is increasingly used for presurgical counseling and planning.\nWhen analyzing presurgical fMRI data, false negatives are more dangerous to the\npatients than false positives because patients are more likely to experience\nsignificant harm from failing to identify functional regions and subsequently\nresecting critical tissues. In this paper, we propose a novel spatially\nadaptive variable screening procedure to enable effective control of false\nnegatives while leveraging the spatial structure of fMRI data. Compared to\nexisting statistical methods in fMRI data analysis, the new procedure directly\ncontrols false negatives at a desirable level and is completely data-driven.\nThe new method is also substantially different from existing false-negative\ncontrol procedures which do not take spatial information into account.\nNumerical examples show that the new method outperforms several\nstate-of-the-art methods in retaining signal voxels, especially the subtle ones\nat the boundaries of functional regions, while providing cleaner separation of\nfunctional regions from background noise. Such results could be valuable to\npreserve critical tissues in neurosurgery."}, "http://arxiv.org/abs/2312.13517": {"title": "An utopic adventure in the modelling of conditional univariate and multivariate extremes", "link": "http://arxiv.org/abs/2312.13517", "description": "The EVA 2023 data competition consisted of four challenges, ranging from\ninterval estimation for very high quantiles of univariate extremes conditional\non covariates, point estimation of unconditional return levels under a custom\nloss function, to estimation of the probabilities of tail events for low and\nhigh-dimensional multivariate data. We tackle these tasks by revisiting the\ncurrent and existing literature on conditional univariate and multivariate\nextremes. We propose new cross-validation methods for covariate-dependent\nmodels, validation metrics for exchangeable multivariate models, formulae for\nthe joint probability of exceedance for multivariate generalized Pareto vectors\nand a composition sampling algorithm for generating multivariate tail events\nfor the latter. We highlight overarching themes ranging from model validation\nat extremely high quantile levels to building custom estimation strategies that\nleverage model assumptions."}, "http://arxiv.org/abs/2312.13643": {"title": "Debiasing Welch's Method for Spectral Density Estimation", "link": "http://arxiv.org/abs/2312.13643", "description": "Welch's method provides an estimator of the power spectral density that is\nstatistically consistent. This is achieved by averaging over periodograms\ncalculated from overlapping segments of a time series. For a finite length time\nseries, while the variance of the estimator decreases as the number of segments\nincrease, the magnitude of the estimator's bias increases: a bias-variance\ntrade-off ensues when setting the segment number. We address this issue by\nproviding a a novel method for debiasing Welch's method which maintains the\ncomputational complexity and asymptotic consistency, and leads to improved\nfinite-sample performance. Theoretical results are given for fourth-order\nstationary processes with finite fourth-order moments and absolutely continuous\nfourth-order cumulant spectrum. The significant bias reduction is demonstrated\nwith numerical simulation and an application to real-world data, where several\nempirical metrics indicate our debiased estimator compares favourably to\nWelch's. Our estimator also permits irregular spacing over frequency and we\ndemonstrate how this may be employed for signal compression and further\nvariance reduction. Code accompanying this work is available in the R and\npython languages."}, "http://arxiv.org/abs/2312.13725": {"title": "Extreme Value Statistics for Analysing Simulated Environmental Extremes", "link": "http://arxiv.org/abs/2312.13725", "description": "We present the methods employed by team `Uniofbathtopia' as part of the Data\nChallenge organised for the 13th International Conference on Extreme Value\nAnalysis (EVA2023), including our winning entry for the third sub-challenge.\nOur approaches unite ideas from extreme value theory, which provides a\nstatistical framework for the estimation of probabilities/return levels\nassociated with rare events, with techniques from unsupervised statistical\nlearning, such as clustering and support identification. The methods are\ndemonstrated on the data provided for the Data Challenge -- environmental data\nsampled from the fantasy country of `Utopia' -- but the underlying assumptions\nand frameworks should apply in more general settings and applications."}, "http://arxiv.org/abs/2312.13875": {"title": "Best Arm Identification in Batched Multi-armed Bandit Problems", "link": "http://arxiv.org/abs/2312.13875", "description": "Recently multi-armed bandit problem arises in many real-life scenarios where\narms must be sampled in batches, due to limited time the agent can wait for the\nfeedback. Such applications include biological experimentation and online\nmarketing. The problem is further complicated when the number of arms is large\nand the number of batches is small. We consider pure exploration in a batched\nmulti-armed bandit problem. We introduce a general linear programming framework\nthat can incorporate objectives of different theoretical settings in best arm\nidentification. The linear program leads to a two-stage algorithm that can\nachieve good theoretical properties. We demonstrate by numerical studies that\nthe algorithm also has good performance compared to certain UCB-type or\nThompson sampling methods."}, "http://arxiv.org/abs/2312.13992": {"title": "Bayesian nonparametric boundary detection for income areal data", "link": "http://arxiv.org/abs/2312.13992", "description": "Recent discussions on the future of metropolitan cities underscore the\npivotal role of (social) equity, driven by demographic and economic trends.\nMore equal policies can foster and contribute to a city's economic success and\nsocial stability. In this work, we focus on identifying metropolitan areas with\ndistinct economic and social levels in the greater Los Angeles area, one of the\nmost diverse yet unequal areas in the United States. Utilizing American\nCommunity Survey data, we propose a Bayesian model for boundary detection based\non income distributions. The model identifies areas with significant income\ndisparities, offering actionable insights for policymakers to address social\nand economic inequalities. Our approach formalized as a Bayesian structural\nlearning framework, models areal densities through finite mixture models.\nEfficient posterior computation is facilitated by a transdimensional Markov\nChain Monte Carlo sampler. The methodology is validated via extensive\nsimulations and applied to the income distributions in the greater Los Angeles\narea. We identify several boundaries in the income distributions which can be\nexplained in light of other social dynamics such as crime rates and healthcare,\nshowing the usefulness of such an analysis to policymakers."}, "http://arxiv.org/abs/2312.14013": {"title": "Two-Stage Pseudo Maximum Likelihood Estimation of Semiparametric Copula-based Regression Models for Semi-Competing Risks Data", "link": "http://arxiv.org/abs/2312.14013", "description": "We propose a two-stage estimation procedure for a copula-based model with\nsemi-competing risks data, where the non-terminal event is subject to dependent\ncensoring by the terminal event, and both events are subject to independent\ncensoring. Under a copula-based model, the marginal survival functions of\nindividual event times are specified by semiparametric transformation models,\nand the dependence between the bivariate event times is specified by a\nparametric copula function. For the estimation procedure, in the first stage,\nthe parameters associated with the marginal of the terminal event are estimated\nonly using the corresponding observed outcomes, and in the second stage, the\nmarginal parameters for the non-terminal event time and the copula parameter\nare estimated via maximizing a pseudo-likelihood function based on the joint\ndistribution of the bivariate event times. We derived the asymptotic properties\nof the proposed estimator and provided an analytic variance estimator for\ninference. Through simulation studies, we showed that our approach leads to\nconsistent estimates with less computational cost and more robustness compared\nto the one-stage procedure developed in Chen (2012), where all parameters were\nestimated simultaneously. In addition, our approach demonstrates more desirable\nfinite-sample performances over another existing two-stage estimation method\nproposed in Zhu et al. (2021)."}, "http://arxiv.org/abs/2312.14086": {"title": "A Bayesian approach to functional regression: theory and computation", "link": "http://arxiv.org/abs/2312.14086", "description": "We propose a novel Bayesian methodology for inference in functional linear\nand logistic regression models based on the theory of reproducing kernel\nHilbert spaces (RKHS's). These models build upon the RKHS associated with the\ncovariance function of the underlying stochastic process, and can be viewed as\na finite-dimensional approximation to the classical functional regression\nparadigm. The corresponding functional model is determined by a function living\non a dense subspace of the RKHS of interest, which has a tractable parametric\nform based on linear combinations of the kernel. By imposing a suitable prior\ndistribution on this functional space, we can naturally perform data-driven\ninference via standard Bayes methodology, estimating the posterior distribution\nthrough Markov chain Monte Carlo (MCMC) methods. In this context, our\ncontribution is two-fold. First, we derive a theoretical result that guarantees\nposterior consistency in these models, based on an application of a classic\ntheorem of Doob to our RKHS setting. Second, we show that several prediction\nstrategies stemming from our Bayesian formulation are competitive against other\nusual alternatives in both simulations and real data sets, including a\nBayesian-motivated variable selection procedure."}, "http://arxiv.org/abs/2312.14130": {"title": "Adaptation using spatially distributed Gaussian Processes", "link": "http://arxiv.org/abs/2312.14130", "description": "We consider the accuracy of an approximate posterior distribution in\nnonparametric regression problems by combining posterior distributions computed\non subsets of the data defined by the locations of the independent variables.\nWe show that this approximate posterior retains the rate of recovery of the\nfull data posterior distribution, where the rate of recovery adapts to the\nsmoothness of the true regression function. As particular examples we consider\nGaussian process priors based on integrated Brownian motion and the Mat\\'ern\nkernel augmented with a prior on the length scale. Besides theoretical\nguarantees we present a numerical study of the methods both on synthetic and\nreal world data. We also propose a new aggregation technique, which numerically\noutperforms previous approaches."}, "http://arxiv.org/abs/2202.00824": {"title": "KSD Aggregated Goodness-of-fit Test", "link": "http://arxiv.org/abs/2202.00824", "description": "We investigate properties of goodness-of-fit tests based on the Kernel Stein\nDiscrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg,\nwhich aggregates multiple tests with different kernels. KSDAgg avoids splitting\nthe data to perform kernel selection (which leads to a loss in test power), and\nrather maximises the test power over a collection of kernels. We provide\nnon-asymptotic guarantees on the power of KSDAgg: we show it achieves the\nsmallest uniform separation rate of the collection, up to a logarithmic term.\nFor compactly supported densities with bounded model score function, we derive\nthe rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the\nminimax optimal rate over unrestricted Sobolev balls, up to an iterated\nlogarithmic term. KSDAgg can be computed exactly in practice as it relies\neither on a parametric bootstrap or on a wild bootstrap to estimate the\nquantiles and the level corrections. In particular, for the crucial choice of\nbandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such\nas median or standard deviation) or to data splitting. We find on both\nsynthetic and real-world data that KSDAgg outperforms other state-of-the-art\nquadratic-time adaptive KSD-based goodness-of-fit testing procedures."}, "http://arxiv.org/abs/2209.06101": {"title": "Evaluating individualized treatment effect predictions: a model-based perspective on discrimination and calibration assessment", "link": "http://arxiv.org/abs/2209.06101", "description": "In recent years, there has been a growing interest in the prediction of\nindividualized treatment effects. While there is a rapidly growing literature\non the development of such models, there is little literature on the evaluation\nof their performance. In this paper, we aim to facilitate the validation of\nprediction models for individualized treatment effects. The estimands of\ninterest are defined as based on the potential outcomes framework, which\nfacilitates a comparison of existing and novel measures. In particular, we\nexamine existing measures of measures of discrimination for benefit (variations\nof the c-for-benefit), and propose model-based extensions to the treatment\neffect setting for discrimination and calibration metrics that have a strong\nbasis in outcome risk prediction. The main focus is on randomized trial data\nwith binary endpoints and on models that provide individualized treatment\neffect predictions and potential outcome predictions. We use simulated data to\nprovide insight into the characteristics of the examined discrimination and\ncalibration statistics under consideration, and further illustrate all methods\nin a trial of acute ischemic stroke treatment. The results show that the\nproposed model-based statistics had the best characteristics in terms of bias\nand accuracy. While resampling methods adjusted for the optimism of performance\nestimates in the development data, they had a high variance across replications\nthat limited their accuracy. Therefore, individualized treatment effect models\nare best validated in independent data. To aid implementation, a software\nimplementation of the proposed methods was made available in R."}, "http://arxiv.org/abs/2211.11400": {"title": "The Online Closure Principle", "link": "http://arxiv.org/abs/2211.11400", "description": "The closure principle is fundamental in multiple testing and has been used to\nderive many efficient procedures with familywise error rate control. However,\nit is often unsuitable for modern research, which involves flexible multiple\ntesting settings where not all hypotheses are known at the beginning of the\nevaluation. In this paper, we focus on online multiple testing where a possibly\ninfinite sequence of hypotheses is tested over time. At each step, it must be\ndecided on the current hypothesis without having any information about the\nhypotheses that have not been tested yet. Our main contribution is a general\nand stringent mathematical definition of online multiple testing and a new\nonline closure principle which ensures that the resulting closed procedure can\nbe applied in the online setting. We prove that any familywise error rate\ncontrolling online procedure can be derived by this online closure principle\nand provide admissibility results. In addition, we demonstrate how short-cuts\nof these online closed procedures can be obtained under a suitable consonance\nproperty."}, "http://arxiv.org/abs/2305.12616": {"title": "Conformal Prediction With Conditional Guarantees", "link": "http://arxiv.org/abs/2305.12616", "description": "We consider the problem of constructing distribution-free prediction sets\nwith finite-sample conditional guarantees. Prior work has shown that it is\nimpossible to provide exact conditional coverage universally in finite samples.\nThus, most popular methods only provide marginal coverage over the covariates.\nThis paper bridges this gap by defining a spectrum of problems that interpolate\nbetween marginal and conditional validity. We motivate these problems by\nreformulating conditional coverage as coverage over a class of covariate\nshifts. When the target class of shifts is finite dimensional, we show how to\nsimultaneously obtain exact finite sample coverage over all possible shifts.\nFor example, given a collection of protected subgroups, our algorithm outputs\nintervals with exact coverage over each group. For more flexible, infinite\ndimensional classes where exact coverage is impossible, we provide a simple\nprocedure for quantifying the gap between the coverage of our algorithm and the\ntarget level. Moreover, by tuning a single hyperparameter, we allow the\npractitioner to control the size of this gap across shifts of interest. Our\nmethods can be easily incorporated into existing split conformal inference\npipelines, and thus can be used to quantify the uncertainty of modern black-box\nalgorithms without distributional assumptions."}, "http://arxiv.org/abs/2305.16842": {"title": "Accounting statement analysis at industry level", "link": "http://arxiv.org/abs/2305.16842", "description": "Compositional data are contemporarily defined as positive vectors, the ratios\namong whose elements are of interest to the researcher. Financial statement\nanalysis by means of accounting ratios fulfils this definition to the letter.\nCompositional data analysis solves the major problems in statistical analysis\nof standard financial ratios at industry level, such as skewness,\nnon-normality, non-linearity and dependence of the results on the choice of\nwhich accounting figure goes to the numerator and to the denominator of the\nratio. In spite of this, compositional applications to financial statement\nanalysis are still rare. In this article, we present some transformations\nwithin compositional data analysis that are particularly useful for financial\nstatement analysis. We show how to compute industry or sub-industry means of\nstandard financial ratios from a compositional perspective. We show how to\nvisualise firms in an industry with a compositional biplot, to classify them\nwith compositional cluster analysis and to relate financial and non-financial\nindicators with compositional regression models. We show an application to the\naccounting statements of Spanish wineries using DuPont analysis, and a\nstep-by-step tutorial to the compositional freeware CoDaPack."}, "http://arxiv.org/abs/2306.12865": {"title": "Estimating dynamic treatment regimes for ordinal outcomes with household interference: Application in household smoking cessation", "link": "http://arxiv.org/abs/2306.12865", "description": "The focus of precision medicine is on decision support, often in the form of\ndynamic treatment regimes (DTRs), which are sequences of decision rules. At\neach decision point, the decision rules determine the next treatment according\nto the patient's baseline characteristics, the information on treatments and\nresponses accrued by that point, and the patient's current health status,\nincluding symptom severity and other measures. However, DTR estimation with\nordinal outcomes is rarely studied, and rarer still in the context of\ninterference - where one patient's treatment may affect another's outcome. In\nthis paper, we introduce the weighted proportional odds model (WPOM): a\nregression-based, approximate doubly-robust approach to single-stage DTR\nestimation for ordinal outcomes. This method also accounts for the possibility\nof interference between individuals sharing a household through the use of\ncovariate balancing weights derived from joint propensity scores. Examining\ndifferent types of balancing weights, we verify the approximate double\nrobustness of WPOM with our adjusted weights via simulation studies. We further\nextend WPOM to multi-stage DTR estimation with household interference, namely\ndWPOM (dynamic WPOM). Lastly, we demonstrate our proposed methodology in the\nanalysis of longitudinal survey data from the Population Assessment of Tobacco\nand Health study, which motivates this work. Furthermore, considering\ninterference, we provide optimal treatment strategies for households to achieve\nsmoking cessation of the pair in the household."}, "http://arxiv.org/abs/2307.16502": {"title": "Percolated stochastic block model via EM algorithm and belief propagation with non-backtracking spectra", "link": "http://arxiv.org/abs/2307.16502", "description": "Whereas Laplacian and modularity based spectral clustering is apt to dense\ngraphs, recent results show that for sparse ones, the non-backtracking spectrum\nis the best candidate to find assortative clusters of nodes. Here belief\npropagation in the sparse stochastic block model is derived with arbitrary\ngiven model parameters that results in a non-linear system of equations; with\nlinear approximation, the spectrum of the non-backtracking matrix is able to\nspecify the number $k$ of clusters. Then the model parameters themselves can be\nestimated by the EM algorithm. Bond percolation in the assortative model is\nconsidered in the following two senses: the within- and between-cluster edge\nprobabilities decrease with the number of nodes and edges coming into existence\nin this way are retained with probability $\\beta$. As a consequence, the\noptimal $k$ is the number of the structural real eigenvalues (greater than\n$\\sqrt{c}$, where $c$ is the average degree) of the non-backtracking matrix of\nthe graph. Assuming, these eigenvalues $\\mu_1 >\\dots > \\mu_k$ are distinct, the\nmultiple phase transitions obtained for $\\beta$ are $\\beta_i\n=\\frac{c}{\\mu_i^2}$; further, at $\\beta_i$ the number of detectable clusters is\n$i$, for $i=1,\\dots ,k$. Inflation-deflation techniques are also discussed to\nclassify the nodes themselves, which can be the base of the sparse spectral\nclustering."}, "http://arxiv.org/abs/2310.02679": {"title": "Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization", "link": "http://arxiv.org/abs/2310.02679", "description": "We tackle the problem of sampling from intractable high-dimensional density\nfunctions, a fundamental task that often appears in machine learning and\nstatistics. We extend recent sampling-based approaches that leverage controlled\nstochastic processes to model approximate samples from these target densities.\nThe main drawback of these approaches is that the training objective requires\nfull trajectories to compute, resulting in sluggish credit assignment issues\ndue to use of entire trajectories and a learning signal present only at the\nterminal time. In this work, we present Diffusion Generative Flow Samplers\n(DGFS), a sampling-based framework where the learning process can be tractably\nbroken down into short partial trajectory segments, via parameterizing an\nadditional \"flow function\". Our method takes inspiration from the theory\ndeveloped for generative flow networks (GFlowNets), allowing us to make use of\nintermediate learning signals. Through various challenging experiments, we\ndemonstrate that DGFS achieves more accurate estimates of the normalization\nconstant than closely-related prior methods."}, "http://arxiv.org/abs/2312.14333": {"title": "Behaviour Modelling of Social Animals via Causal Structure Discovery and Graph Neural Networks", "link": "http://arxiv.org/abs/2312.14333", "description": "Better understanding the natural world is a crucial task with a wide range of\napplications. In environments with close proximity between humans and animals,\nsuch as zoos, it is essential to better understand the causes behind animal\nbehaviour and what interventions are responsible for changes in their\nbehaviours. This can help to predict unusual behaviours, mitigate detrimental\neffects and increase the well-being of animals. There has been work on\nmodelling the dynamics behind swarms of birds and insects but the complex\nsocial behaviours of mammalian groups remain less explored. In this work, we\npropose a method to build behavioural models using causal structure discovery\nand graph neural networks for time series. We apply this method to a mob of\nmeerkats in a zoo environment and study its ability to predict future actions\nand model the behaviour distribution at an individual-level and at a group\nlevel. We show that our method can match and outperform standard deep learning\narchitectures and generate more realistic data, while using fewer parameters\nand providing increased interpretability."}, "http://arxiv.org/abs/2312.14416": {"title": "Joint Semi-Symmetric Tensor PCA for Integrating Multi-modal Populations of Networks", "link": "http://arxiv.org/abs/2312.14416", "description": "Multi-modal populations of networks arise in many scenarios including in\nlarge-scale multi-modal neuroimaging studies that capture both functional and\nstructural neuroimaging data for thousands of subjects. A major research\nquestion in such studies is how functional and structural brain connectivity\nare related and how they vary across the population. we develop a novel\nPCA-type framework for integrating multi-modal undirected networks measured on\nmany subjects. Specifically, we arrange these networks as semi-symmetric\ntensors, where each tensor slice is a symmetric matrix representing a network\nfrom an individual subject. We then propose a novel Joint, Integrative\nSemi-Symmetric Tensor PCA (JisstPCA) model, associated with an efficient\niterative algorithm, for jointly finding low-rank representations of two or\nmore networks across the same population of subjects. We establish one-step\nstatistical convergence of our separate low-rank network factors as well as the\nshared population factors to the true factors, with finite sample statistical\nerror bounds. Through simulation studies and a real data example for\nintegrating multi-subject functional and structural brain connectivity, we\nillustrate the advantages of our method for finding joint low-rank structures\nin multi-modal populations of networks."}, "http://arxiv.org/abs/2312.14420": {"title": "On eigenvalues of sample covariance matrices based on high dimensional compositional data", "link": "http://arxiv.org/abs/2312.14420", "description": "This paper studies the asymptotic spectral properties of the sample\ncovariance matrix for high dimensional compositional data, including the\nlimiting spectral distribution, the limit of extreme eigenvalues, and the\ncentral limit theorem for linear spectral statistics. All asymptotic results\nare derived under the high-dimensional regime where the data dimension\nincreases to infinity proportionally with the sample size. The findings reveal\nthat the limiting spectral distribution is the well-known Marchenko-Pastur law.\nThe largest (or smallest non-zero) eigenvalue converges almost surely to the\nleft (or right) endpoint of the limiting spectral distribution, respectively.\nMoreover, the linear spectral statistics demonstrate a Gaussian limit.\nSimulation experiments demonstrate the accuracy of theoretical results."}, "http://arxiv.org/abs/2312.14534": {"title": "Global Rank Sum Test: An Efficient Rank-Based Nonparametric Test for Large Scale Online Experiment", "link": "http://arxiv.org/abs/2312.14534", "description": "Online experiments are widely used for improving online services. While doing\nonline experiments, The student t-test is the most widely used hypothesis\ntesting technique. In practice, however, the normality assumption on which the\nt-test depends on may fail, which resulting in untrustworthy results. In this\npaper, we first discuss the question of when the t-test fails, and thus\nintroduce the rank-sum test. Next, in order to solve the difficulties while\nimplementing rank-sum test in large online experiment platforms, we proposed a\nglobal-rank-sum test method as an improvement for the traditional one. Finally,\nwe demonstrate that the global-rank-sum test is not only more accurate and has\nhigher statistical power than the t-test, but also more time efficient than the\ntraditional rank-sum test, which eventually makes it possible for large online\nexperiment platforms to use."}, "http://arxiv.org/abs/2312.14549": {"title": "A machine learning approach based on survival analysis for IBNR frequencies in non-life reserving", "link": "http://arxiv.org/abs/2312.14549", "description": "We introduce new approaches for forecasting IBNR (Incurred But Not Reported)\nfrequencies by leveraging individual claims data, which includes accident date,\nreporting delay, and possibly additional features for every reported claim. A\nkey element of our proposal involves computing development factors, which may\nbe influenced by both the accident date and other features. These development\nfactors serve as the basis for predictions. While we assume close to continuous\nobservations of accident date and reporting delay, the development factors can\nbe expressed at any level of granularity, such as months, quarters, or year and\npredictions across different granularity levels exhibit coherence. The\ncalculation of development factors relies on the estimation of a hazard\nfunction in reverse development time, and we present three distinct methods for\nestimating this function: the Cox proportional hazard model, a feed-forward\nneural network, and xgboost (eXtreme gradient boosting). In all three cases,\nestimation is based on the same partial likelihood that accommodates left\ntruncation and ties in the data. While the first case is a semi-parametric\nmodel that assumes in parts a log linear structure, the two machine learning\napproaches only assume that the baseline and the other factors are\nmultiplicatively separable. Through an extensive simulation study and\nreal-world data application, our approach demonstrates promising results. This\npaper comes with an accompanying R-package, $\\texttt{ReSurv}$, which can be\naccessed at \\url{https://github.com/edhofman/ReSurv}"}, "http://arxiv.org/abs/2312.14583": {"title": "Inference on the state process of periodically inhomogeneous hidden Markov models for animal behavior", "link": "http://arxiv.org/abs/2312.14583", "description": "Over the last decade, hidden Markov models (HMMs) have become increasingly\npopular in statistical ecology, where they constitute natural tools for\nstudying animal behavior based on complex sensor data. Corresponding analyses\nsometimes explicitly focus on - and in any case need to take into account -\nperiodic variation, for example by quantifying the activity distribution over\nthe daily cycle or seasonal variation such as migratory behavior. For HMMs\nincluding periodic components, we establish important mathematical properties\nthat allow for comprehensive statistical inference related to periodic\nvariation, thereby also providing guidance for model building and model\nchecking. Specifically, we derive the periodically varying unconditional state\ndistribution as well as the time-varying and overall state dwell-time\ndistributions - all of which are of key interest when the inferential focus\nlies on the dynamics of the state process. We use the associated novel\ninference and model-checking tools to investigate changes in the diel activity\npatterns of fruit flies in response to changing light conditions."}, "http://arxiv.org/abs/2312.14601": {"title": "Generalized Moment Estimators based on Stein Identities", "link": "http://arxiv.org/abs/2312.14601", "description": "For parameter estimation of continuous and discrete distributions, we propose\na generalization of the method of moments (MM), where Stein identities are\nutilized for improved estimation performance. The construction of these\nStein-type MM-estimators makes use of a weight function as implied by an\nappropriate form of the Stein identity. Our general approach as well as\npotential benefits thereof are first illustrated by the simple example of the\nexponential distribution. Afterward, we investigate the more sophisticated\ntwo-parameter inverse Gaussian distribution and the two-parameter\nnegative-binomial distribution in great detail, together with illustrative\nreal-world data examples. Given an appropriate choice of the respective weight\nfunctions, their Stein-MM estimators, which are defined by simple closed-form\nformulas and allow for closed-form asymptotic computations, exhibit a better\nperformance regarding bias and mean squared error than competing estimators."}, "http://arxiv.org/abs/2312.14689": {"title": "Mistaken identities lead to missed opportunities: Testing for mean differences in partially matched data", "link": "http://arxiv.org/abs/2312.14689", "description": "It is increasingly common to collect pre-post data with pseudonyms or\nself-constructed identifiers. On survey responses from sensitive populations,\nidentifiers may be made optional to encourage higher response rates. The\nability to match responses between pre- and post-intervention phases for every\nparticipant may be impossible in such applications, leaving practitioners with\na choice between the paired t-test on the matched samples and the two-sample\nt-test on all samples for evaluating mean differences. We demonstrate the\ninadequacies with both approaches, as the former test requires discarding\nunmatched data, while the latter test ignores correlation and assumes\nindependence. In cases with a subset of matched samples, an opportunity to\nachieve limited inference about the correlation exists. We propose a novel\ntechnique for such `partially matched' data, which we refer to as the\nQuantile-based t-test for correlated samples, to assess mean differences using\na conservative estimate of the correlation between responses based on the\nmatched subset. Critically, our approach does not discard unmatched samples,\nnor does it assume independence. Our results demonstrate that the proposed\nmethod yields nominal Type I error probability while affording more power than\nexisting approaches. Practitioners can readily adopt our approach with basic\nstatistical programming software."}, "http://arxiv.org/abs/2312.14719": {"title": "Nonhomogeneous hidden semi-Markov models for toroidal data", "link": "http://arxiv.org/abs/2312.14719", "description": "A nonhomogeneous hidden semi-Markov model is proposed to segment toroidal\ntime series according to a finite number of latent regimes and, simultaneously,\nestimate the influence of time-varying covariates on the process' survival\nunder each regime. The model is a mixture of toroidal densities, whose\nparameters depend on the evolution of a semi-Markov chain, which is in turn\nmodulated by time-varying covariates through a proportional hazards assumption.\nParameter estimates are obtained using an EM algorithm that relies on an\nefficient augmentation of the latent process. The proposal is illustrated on a\ntime series of wind and wave directions recorded during winter."}, "http://arxiv.org/abs/2312.14810": {"title": "Accelerating Bayesian Optimal Experimental Design with Derivative-Informed Neural Operators", "link": "http://arxiv.org/abs/2312.14810", "description": "We consider optimal experimental design (OED) for nonlinear Bayesian inverse\nproblems governed by large-scale partial differential equations (PDEs). For the\noptimality criteria of Bayesian OED, we consider both expected information gain\nand summary statistics including the trace and determinant of the information\nmatrix that involves the evaluation of the parameter-to-observable (PtO) map\nand its derivatives. However, it is prohibitive to compute and optimize these\ncriteria when the PDEs are very expensive to solve, the parameters to estimate\nare high-dimensional, and the optimization problem is combinatorial,\nhigh-dimensional, and non-convex. To address these challenges, we develop an\naccurate, scalable, and efficient computational framework to accelerate the\nsolution of Bayesian OED. In particular, the framework is developed based on\nderivative-informed neural operator (DINO) surrogates with proper dimension\nreduction techniques and a modified swapping greedy algorithm. We demonstrate\nthe high accuracy of the DINO surrogates in the computation of the PtO map and\nthe optimality criteria compared to high-fidelity finite element\napproximations. We also show that the proposed method is scalable with\nincreasing parameter dimensions. Moreover, we demonstrate that it achieves high\nefficiency with over 1000X speedup compared to a high-fidelity Bayesian OED\nsolution for a three-dimensional PDE example with tens of thousands of\nparameters, including both online evaluation and offline construction costs of\nthe surrogates."}, "http://arxiv.org/abs/2108.13010": {"title": "Piecewise monotone estimation in one-parameter exponential family", "link": "http://arxiv.org/abs/2108.13010", "description": "The problem of estimating a piecewise monotone sequence of normal means is\ncalled the nearly isotonic regression. For this problem, an efficient algorithm\nhas been devised by modifying the pool adjacent violators algorithm (PAVA). In\nthis study, we investigate estimation of a piecewise monotone parameter\nsequence for general one-parameter exponential families such as binomial,\nPoisson and chi-square. We develop an efficient algorithm based on the modified\nPAVA, which utilizes the duality between the natural and expectation\nparameters. We also provide a method for selecting the regularization parameter\nby using an information criterion. Simulation results demonstrate that the\nproposed method detects change-points in piecewise monotone parameter sequences\nin a data-driven manner. Applications to spectrum estimation, causal inference\nand discretization error quantification of ODE solvers are also presented."}, "http://arxiv.org/abs/2212.03131": {"title": "Explainability as statistical inference", "link": "http://arxiv.org/abs/2212.03131", "description": "A wide variety of model explanation approaches have been proposed in recent\nyears, all guided by very different rationales and heuristics. In this paper,\nwe take a new route and cast interpretability as a statistical inference\nproblem. We propose a general deep probabilistic model designed to produce\ninterpretable predictions. The model parameters can be learned via maximum\nlikelihood, and the method can be adapted to any predictor network architecture\nand any type of prediction problem. Our method is a case of amortized\ninterpretability models, where a neural network is used as a selector to allow\nfor fast interpretation at inference time. Several popular interpretability\nmethods are shown to be particular cases of regularised maximum likelihood for\nour general model. We propose new datasets with ground truth selection which\nallow for the evaluation of the features importance map. Using these datasets,\nwe show experimentally that using multiple imputation provides more reasonable\ninterpretations."}, "http://arxiv.org/abs/2302.08151": {"title": "Towards a universal representation of statistical dependence", "link": "http://arxiv.org/abs/2302.08151", "description": "Dependence is undoubtedly a central concept in statistics. Though, it proves\ndifficult to locate in the literature a formal definition which goes beyond the\nself-evident 'dependence = non-independence'. This absence has allowed the term\n'dependence' and its declination to be used vaguely and indiscriminately for\nqualifying a variety of disparate notions, leading to numerous incongruities.\nFor example, the classical Pearson's, Spearman's or Kendall's correlations are\nwidely regarded as 'dependence measures' of major interest, in spite of\nreturning 0 in some cases of deterministic relationships between the variables\nat play, evidently not measuring dependence at all. Arguing that research on\nsuch a fundamental topic would benefit from a slightly more rigid framework,\nthis paper suggests a general definition of the dependence between two random\nvariables defined on the same probability space. Natural enough for aligning\nwith intuition, that definition is still sufficiently precise for allowing\nunequivocal identification of a 'universal' representation of the dependence\nstructure of any bivariate distribution. Links between this representation and\nfamiliar concepts are highlighted, and ultimately, the idea of a dependence\nmeasure based on that universal representation is explored and shown to satisfy\nRenyi's postulates."}, "http://arxiv.org/abs/2305.00349": {"title": "Causal effects of intervening variables in settings with unmeasured confounding", "link": "http://arxiv.org/abs/2305.00349", "description": "We present new results on average causal effects in settings with unmeasured\nexposure-outcome confounding. Our results are motivated by a class of\nestimands, e.g., frequently of interest in medicine and public health, that are\ncurrently not targeted by standard approaches for average causal effects. We\nrecognize these estimands as queries about the average causal effect of an\nintervening variable. We anchor our introduction of these estimands in an\ninvestigation of the role of chronic pain and opioid prescription patterns in\nthe opioid epidemic, and illustrate how conventional approaches will lead\nunreplicable estimates with ambiguous policy implications. We argue that our\naltenative effects are replicable and have clear policy implications, and\nfurthermore are non-parametrically identified by the classical frontdoor\nformula. As an independent contribution, we derive a new semiparametric\nefficient estimator of the frontdoor formula with a uniform sample boundedness\nguarantee. This property is unique among previously-described estimators in its\nclass, and we demonstrate superior performance in finite-sample settings.\nTheoretical results are applied with data from the National Health and\nNutrition Examination Survey."}, "http://arxiv.org/abs/2305.05330": {"title": "Point and probabilistic forecast reconciliation for general linearly constrained multiple time series", "link": "http://arxiv.org/abs/2305.05330", "description": "Forecast reconciliation is the post-forecasting process aimed to revise a set\nof incoherent base forecasts into coherent forecasts in line with given data\nstructures. Most of the point and probabilistic regression-based forecast\nreconciliation results ground on the so called \"structural representation\" and\non the related unconstrained generalized least squares reconciliation formula.\nHowever, the structural representation naturally applies to genuine\nhierarchical/grouped time series, where the top- and bottom-level variables are\nuniquely identified. When a general linearly constrained multiple time series\nis considered, the forecast reconciliation is naturally expressed according to\na projection approach. While it is well known that the classic structural\nreconciliation formula is equivalent to its projection approach counterpart, so\nfar it is not completely understood if and how a structural-like reconciliation\nformula may be derived for a general linearly constrained multiple time series.\nSuch an expression would permit to extend reconciliation definitions, theorems\nand results in a straightforward manner. In this paper, we show that for\ngeneral linearly constrained multiple time series it is possible to express the\nreconciliation formula according to a \"structural-like\" approach that keeps\ndistinct free and constrained, instead of bottom and upper (aggregated),\nvariables, establish the probabilistic forecast reconciliation framework, and\napply these findings to obtain fully reconciled point and probabilistic\nforecasts for the aggregates of the Australian GDP from income and expenditure\nsides, and for the European Area GDP disaggregated by income, expenditure and\noutput sides and by 19 countries."}, "http://arxiv.org/abs/2312.15032": {"title": "Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study", "link": "http://arxiv.org/abs/2312.15032", "description": "Scientific claims gain credibility by replicability, especially if\nreplication under different circumstances and varying designs yields equivalent\nresults. Aggregating results over multiple studies is, however, not\nstraightforward, and when the heterogeneity between studies increases,\nconventional methods such as (Bayesian) meta-analysis and Bayesian sequential\nupdating become infeasible. *Bayesian Evidence Synthesis*, built upon the\nfoundations of the Bayes factor, allows to aggregate support for conceptually\nsimilar hypotheses over studies, regardless of methodological differences. We\nassess the performance of Bayesian Evidence Synthesis over multiple effect and\nsample sizes, with a broad set of (inequality-constrained) hypotheses using\nMonte Carlo simulations, focusing explicitly on the complexity of the\nhypotheses under consideration. The simulations show that this method can\nevaluate complex (informative) hypotheses regardless of methodological\ndifferences between studies, and performs adequately if the set of studies\nconsidered has sufficient statistical power. Additionally, we pinpoint\nchallenging conditions that can lead to unsatisfactory results, and provide\nsuggestions on handling these situations. Ultimately, we show that Bayesian\nEvidence Synthesis is a promising tool that can be used when traditional\nresearch synthesis methods are not applicable due to insurmountable\nbetween-study heterogeneity."}, "http://arxiv.org/abs/2312.15055": {"title": "Deep Learning for Efficient GWAS Feature Selection", "link": "http://arxiv.org/abs/2312.15055", "description": "Genome-Wide Association Studies (GWAS) face unique challenges in the era of\nbig genomics data, particularly when dealing with ultra-high-dimensional\ndatasets where the number of genetic features significantly exceeds the\navailable samples. This paper introduces an extension to the feature selection\nmethodology proposed by Mirzaei et al. (2020), specifically tailored to tackle\nthe intricacies associated with ultra-high-dimensional GWAS data. Our extended\napproach enhances the original method by introducing a Frobenius norm penalty\ninto the student network, augmenting its capacity to adapt to scenarios\ncharacterized by a multitude of features and limited samples. Operating\nseamlessly in both supervised and unsupervised settings, our method employs two\nkey neural networks. The first leverages an autoencoder or supervised\nautoencoder for dimension reduction, extracting salient features from the\nultra-high-dimensional genomic data. The second network, a regularized\nfeed-forward model with a single hidden layer, is designed for precise feature\nselection. The introduction of the Frobenius norm penalty in the student\nnetwork significantly boosts the method's resilience to the challenges posed by\nultra-high-dimensional GWAS datasets. Experimental results showcase the\nefficacy of our approach in feature selection for GWAS data. The method not\nonly handles the inherent complexities of ultra-high-dimensional settings but\nalso demonstrates superior adaptability to the nuanced structures present in\ngenomics data. The flexibility and versatility of our proposed methodology are\nunderscored by its successful performance across a spectrum of experiments."}, "http://arxiv.org/abs/2312.15079": {"title": "Invariance-based Inference in High-Dimensional Regression with Finite-Sample Guarantees", "link": "http://arxiv.org/abs/2312.15079", "description": "In this paper, we develop invariance-based procedures for testing and\ninference in high-dimensional regression models. These procedures, also known\nas randomization tests, provide several important advantages. First, for the\nglobal null hypothesis of significance, our test is valid in finite samples. It\nis also simple to implement and comes with finite-sample guarantees on\nstatistical power. Remarkably, despite its simplicity, this testing idea has\nescaped the attention of earlier analytical work, which mainly concentrated on\ncomplex high-dimensional asymptotic methods. Under an additional assumption of\nGaussian design, we show that this test also achieves the minimax optimal rate\nagainst certain nonsparse alternatives, a type of result that is rare in the\nliterature. Second, for partial null hypotheses, we propose residual-based\ntests and derive theoretical conditions for their validity. These tests can be\nmade powerful by constructing the test statistic in a way that, first, selects\nthe important covariates (e.g., through Lasso) and then orthogonalizes the\nnuisance parameters. We illustrate our results through extensive simulations\nand applied examples. One consistent finding is that the strong finite-sample\nguarantees associated with our procedures result in added robustness when it\ncomes to handling multicollinearity and heavy-tailed covariates."}, "http://arxiv.org/abs/2312.15179": {"title": "Evaluating District-based Election Surveys with Synthetic Dirichlet Likelihood", "link": "http://arxiv.org/abs/2312.15179", "description": "In district-based multi-party elections, electors cast votes in their\nrespective districts. In each district, the party with maximum votes wins the\ncorresponding seat in the governing body. Election Surveys try to predict the\nelection outcome (vote shares and seat shares of parties) by querying a random\nsample of electors. However, the survey results are often inconsistent with the\nactual results, which could be due to multiple reasons. The aim of this work is\nto estimate a posterior distribution over the possible outcomes of the\nelection, given one or more survey results. This is achieved using a prior\ndistribution over vote shares, election models to simulate the complete\nelection from the vote share, and survey models to simulate survey results from\na complete election. The desired posterior distribution over the space of\npossible outcomes is constructed using Synthetic Dirichlet Likelihoods, whose\nparameters are estimated from Monte Carlo sampling of elections using the\nelection models. We further show the same approach can also use be used to\nevaluate the surveys - whether they were biased or not, based on the true\noutcome once it is known. Our work offers the first-ever probabilistic model to\nanalyze district-based election surveys. We illustrate our approach with\nextensive experiments on real and simulated data of district-based political\nelections in India."}, "http://arxiv.org/abs/2312.15205": {"title": "X-Vine Models for Multivariate Extremes", "link": "http://arxiv.org/abs/2312.15205", "description": "Regular vine sequences permit the organisation of variables in a random\nvector along a sequence of trees. Regular vine models have become greatly\npopular in dependence modelling as a way to combine arbitrary bivariate copulas\ninto higher-dimensional ones, offering flexibility, parsimony, and\ntractability. In this project, we use regular vine structures to decompose and\nconstruct the exponent measure density of a multivariate extreme value\ndistribution, or, equivalently, the tail copula density. Although these\ndensities pose theoretical challenges due to their infinite mass, their\nhomogeneity property offers simplifications. The theory sheds new light on\nexisting parametric families and facilitates the construction of new ones,\ncalled X-vines. Computations proceed via recursive formulas in terms of\nbivariate model components. We develop simulation algorithms for X-vine\nmultivariate Pareto distributions as well as methods for parameter estimation\nand model selection on the basis of threshold exceedances. The methods are\nillustrated by Monte Carlo experiments and a case study on US flight delay\ndata."}, "http://arxiv.org/abs/2312.15217": {"title": "Constructing a T-test for Value Function Comparison of Individualized Treatment Regimes in the Presence of Multiple Imputation for Missing Data", "link": "http://arxiv.org/abs/2312.15217", "description": "Optimal individualized treatment decision-making has improved health outcomes\nin recent years. The value function is commonly used to evaluate the goodness\nof an individualized treatment decision rule. Despite recent advances,\ncomparing value functions between different treatment decision rules or\nconstructing confidence intervals around value functions remains difficult. We\npropose a t-test based method applied to a test set that generates valid\np-values to compare value functions between a given pair of treatment decision\nrules when some of the data are missing. We demonstrate the ease in use of this\nmethod and evaluate its performance via simulation studies and apply it to the\nChina Health and Nutrition Survey data."}, "http://arxiv.org/abs/2312.15222": {"title": "Towards reaching a consensus in Bayesian trial designs: the case of 2-arm trials", "link": "http://arxiv.org/abs/2312.15222", "description": "Practical employment of Bayesian trial designs has been rare, in part due to\nthe regulators' requirement to calibrate such designs with an upper bound for\nType 1 error rate. This has led to an internally inconsistent hybrid\nmethodology, where important advantages from applying the Bayesian principles\nare lost. To present an alternative approach, we consider the prototype case of\na 2-arm superiority trial with binary outcomes. The design is adaptive,\napplying block randomization for treatment assignment and using error tolerance\ncriteria based on sequentially updated posterior probabilities, to conclude\nefficacy of the experimental treatment or futility of the trial. The method\nalso contains the option of applying a decision rule for terminating the trial\nearly if the predicted costs from continuing would exceed the corresponding\ngains. We propose that the control of Type 1 error rate be replaced by a\ncontrol of false discovery rate (FDR), a concept that lends itself to both\nfrequentist and Bayesian interpretations. Importantly, if the regulators and\nthe investigators can agree on a consensus distribution to represent their\nshared understanding on the effect sizes, the selected level of risk tolerance\nagainst false conclusions during the data analysis will also be a valid bound\nfor the FDR. The approach can lower the ultimately unnecessary barriers from\nthe practical application of Bayesian trial designs. This can lead to more\nflexible experimental protocols and more efficient use of trial data while\nstill effectively guarding against falsely claimed discoveries."}, "http://arxiv.org/abs/2312.15373": {"title": "A Multi-day Needs-based Modeling Approach for Activity and Travel Demand Analysis", "link": "http://arxiv.org/abs/2312.15373", "description": "This paper proposes a multi-day needs-based model for activity and travel\ndemand analysis. The model captures the multi-day dynamics in activity\ngeneration, which enables the modeling of activities with increased flexibility\nin time and space (e.g., e-commerce and remote working). As an enhancement to\nactivity-based models, the proposed model captures the underlying\ndecision-making process of activity generation by accounting for psychological\nneeds as the drivers of activities. The level of need satisfaction is modeled\nas a psychological inventory, whose utility is optimized via decisions on\nactivity participation, location, and duration. The utility includes both the\nbenefit in the inventory gained and the cost in time, monetary expense as well\nas maintenance of safety stock. The model includes two sub-models, a\nDeterministic Model that optimizes the utility of the inventory, and an\nEmpirical Model that accounts for heterogeneity and stochasticity. Numerical\nexperiments are conducted to demonstrate model scalability. A maximum\nlikelihood estimator is proposed, the properties of the log-likelihood function\nare examined and the recovery of true parameters is tested. This research\ncontributes to the literature on transportation demand models in the following\nthree aspects. First, it is arguably better grounded in psychological theory\nthan traditional models and allows the generation of activity patterns to be\npolicy-sensitive (while avoiding the need for ad hoc utility definitions).\nSecond, it contributes to the development of needs-based models with a\nnon-myopic approach to model multi-day activity patterns. Third, it proposes a\ntractable model formulation via problem reformulation and computational\nenhancements, which allows for maximum likelihood parameter estimation."}, "http://arxiv.org/abs/2312.15376": {"title": "Geodesic Optimal Transport Regression", "link": "http://arxiv.org/abs/2312.15376", "description": "Classical regression models do not cover non-Euclidean data that reside in a\ngeneral metric space, while the current literature on non-Euclidean regression\nby and large has focused on scenarios where either predictors or responses are\nrandom objects, i.e., non-Euclidean, but not both. In this paper we propose\ngeodesic optimal transport regression models for the case where both predictors\nand responses lie in a common geodesic metric space and predictors may include\nnot only one but also several random objects. This provides an extension of\nclassical multiple regression to the case where both predictors and responses\nreside in non-Euclidean metric spaces, a scenario that has not been considered\nbefore. It is based on the concept of optimal geodesic transports, which we\ndefine as an extension of the notion of optimal transports in distribution\nspaces to more general geodesic metric spaces, where we characterize optimal\ntransports as transports along geodesics. The proposed regression models cover\nthe relation between non-Euclidean responses and vectors of non-Euclidean\npredictors in many spaces of practical statistical interest. These include\none-dimensional distributions viewed as elements of the 2-Wasserstein space and\nmultidimensional distributions with the Fisher-Rao metric that are represented\nas data on the Hilbert sphere. Also included are data on finite-dimensional\nRiemannian manifolds, with an emphasis on spheres, covering directional and\ncompositional data, as well as data that consist of symmetric positive definite\nmatrices. We illustrate the utility of geodesic optimal transport regression\nwith data on summer temperature distributions and human mortality."}, "http://arxiv.org/abs/2312.15396": {"title": "PKBOIN-12: A Bayesian optimal interval Phase I/II design incorporating pharmacokinetics outcomes to find the optimal biological dose", "link": "http://arxiv.org/abs/2312.15396", "description": "Immunotherapies and targeted therapies have gained popularity due to their\npromising therapeutic effects across multiple treatment areas. The focus of\nearly phase dose-finding clinical trials has shifted from finding the maximum\ntolerated dose (MTD) to identifying the optimal biological dose (OBD), which\naims to balance the toxicity and efficacy outcomes, thereby optimizing the\nrisk-benefit trade-off. These trials often collect multiple pharmacokinetics\n(PK) outcomes to assess drug exposure, which has shown correlations with\ntoxicity and efficacy outcomes but has not been utilized in the current\ndose-finding designs for OBD selection. Moreover, PK outcomes are usually\navailable within days after initial treatment, much faster than toxicity and\nefficacy outcomes. To bridge this gap, we introduce the innovative\nmodel-assisted PKBOIN-12 design, which enhances the BOIN12 design by\nintegrating PK information into both the dose-finding algorithm and the final\nOBD determination process. We further extend PKBOIN-12 to the TITE-PKBOIN-12\ndesign to address the challenges of late-onset toxicity and efficacy outcomes.\nSimulation results demonstrate that the PKBOIN-12 design more effectively\nidentifies the OBD and allocates a greater number of patients to it than\nBOIN12. Additionally, PKBOIN-12 decreases the probability of selecting\ninefficacious doses as the OBD by excluding those with low drug exposure.\nComprehensive simulation studies and sensitivity analysis confirm the\nrobustness of both PKBOIN-12 and TITE-PKBOIN-12 designs in various scenarios."}, "http://arxiv.org/abs/2312.15469": {"title": "Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products", "link": "http://arxiv.org/abs/2312.15469", "description": "We consider the problem of sufficient dimension reduction (SDR) for\nmulti-index models. The estimators of the central mean subspace in prior works\neither have slow (non-parametric) convergence rates, or rely on stringent\ndistributional conditions (e.g., the covariate distribution $P_{\\mathbf{X}}$\nbeing elliptical symmetric). In this paper, we show that a fast parametric\nconvergence rate of form $C_d \\cdot n^{-1/2}$ is achievable via estimating the\n\\emph{expected smoothed gradient outer product}, for a general class of\ndistribution $P_{\\mathbf{X}}$ admitting Gaussian or heavier distributions. When\nthe link function is a polynomial with a degree of at most $r$ and\n$P_{\\mathbf{X}}$ is the standard Gaussian, we show that the prefactor depends\non the ambient dimension $d$ as $C_d \\propto d^r$."}, "http://arxiv.org/abs/2312.15496": {"title": "A Simple Bias Reduction for Chatterjee's Correlation", "link": "http://arxiv.org/abs/2312.15496", "description": "Chatterjee's rank correlation coefficient $\\xi_n$ is an empirical index for\ndetecting functional dependencies between two variables $X$ and $Y$. It is an\nestimator for a theoretical quantity $\\xi$ that is zero for independence and\none if $Y$ is a measurable function of $X$. Based on an equivalent\ncharacterization of sorted numbers, we derive an upper bound for $\\xi_n$ and\nsuggest a simple normalization aimed at reducing its bias for small sample size\n$n$. In Monte Carlo simulations of various cases, the normalization reduced the\nbias in all cases. The mean squared error was reduced, too, for values of $\\xi$\ngreater than about 0.4. Moreover, we observed that confidence intervals for\n$\\xi$ based on bootstrapping $\\xi_n$ in the usual n-out-of-n way have a\ncoverage probability close to zero. This is remedied by an m-out-of-n bootstrap\nwithout replacement in combination with our normalization method."}, "http://arxiv.org/abs/2312.15611": {"title": "Inference of Dependency Knowledge Graph for Electronic Health Records", "link": "http://arxiv.org/abs/2312.15611", "description": "The effective analysis of high-dimensional Electronic Health Record (EHR)\ndata, with substantial potential for healthcare research, presents notable\nmethodological challenges. Employing predictive modeling guided by a knowledge\ngraph (KG), which enables efficient feature selection, can enhance both\nstatistical efficiency and interpretability. While various methods have emerged\nfor constructing KGs, existing techniques often lack statistical certainty\nconcerning the presence of links between entities, especially in scenarios\nwhere the utilization of patient-level EHR data is limited due to privacy\nconcerns. In this paper, we propose the first inferential framework for\nderiving a sparse KG with statistical guarantee based on the dynamic log-linear\ntopic model proposed by \\cite{arora2016latent}. Within this model, the KG\nembeddings are estimated by performing singular value decomposition on the\nempirical pointwise mutual information matrix, offering a scalable solution. We\nthen establish entrywise asymptotic normality for the KG low-rank estimator,\nenabling the recovery of sparse graph edges with controlled type I error. Our\nwork uniquely addresses the under-explored domain of statistical inference\nabout non-linear statistics under the low-rank temporal dependent models, a\ncritical gap in existing research. We validate our approach through extensive\nsimulation studies and then apply the method to real-world EHR data in\nconstructing clinical KGs and generating clinical feature embeddings."}, "http://arxiv.org/abs/2312.15781": {"title": "A Computational Note on the Graphical Ridge in High-dimension", "link": "http://arxiv.org/abs/2312.15781", "description": "This article explores the estimation of precision matrices in\nhigh-dimensional Gaussian graphical models. We address the challenge of\nimproving the accuracy of maximum likelihood-based precision estimation through\npenalization. Specifically, we consider an elastic net penalty, which\nincorporates both L1 and Frobenius norm penalties while accounting for the\ntarget matrix during estimation. To enhance precision matrix estimation, we\npropose a novel two-step estimator that combines the strengths of ridge and\ngraphical lasso estimators. Through this approach, we aim to improve overall\nestimation performance. Our empirical analysis demonstrates the superior\nefficiency of our proposed method compared to alternative approaches. We\nvalidate the effectiveness of our proposal through numerical experiments and\napplication on three real datasets. These examples illustrate the practical\napplicability and usefulness of our proposed estimator."}, "http://arxiv.org/abs/2312.15919": {"title": "Review on Causality Detection Based on Empirical Dynamic Modeling", "link": "http://arxiv.org/abs/2312.15919", "description": "In contemporary scientific research, understanding the distinction between\ncorrelation and causation is crucial. While correlation is a widely used\nanalytical standard, it does not inherently imply causation. This paper\naddresses the potential for misinterpretation in relying solely on correlation,\nespecially in the context of nonlinear dynamics. Despite the rapid development\nof various correlation research methodologies, including machine learning, the\nexploration into mining causal correlations between variables remains ongoing.\nEmpirical Dynamic Modeling (EDM) emerges as a data-driven framework for\nmodeling dynamic systems, distinguishing itself by eschewing traditional\nformulaic methods in data analysis. Instead, it reconstructs dynamic system\nbehavior directly from time series data. The fundamental premise of EDM is that\ndynamic systems can be conceptualized as processes where a set of states,\ngoverned by specific rules, evolve over time in a high-dimensional space. By\nreconstructing these evolving states, dynamic systems can be effectively\nmodeled. Using EDM, this paper explores the detection of causal relationships\nbetween variables within dynamic systems through their time series data. It\nposits that if variable X causes variable Y, then the information about X is\ninherent in Y and can be extracted from Y's data. This study begins by\nexamining the dialectical relationship between correlation and causation,\nemphasizing that correlation does not equate to causation, and the absence of\ncorrelation does not necessarily indicate a lack of causation."}, "http://arxiv.org/abs/2312.16037": {"title": "Critical nonlinear aspects of hopping transport for reconfigurable logic in disordered dopant networks", "link": "http://arxiv.org/abs/2312.16037", "description": "Nonlinear behavior in the hopping transport of interacting charges enables\nreconfigurable logic in disordered dopant network devices, where voltages\napplied at control electrodes tune the relation between voltages applied at\ninput electrodes and the current measured at an output electrode. From kinetic\nMonte Carlo simulations we analyze the critical nonlinear aspects of\nvariable-range hopping transport for realizing Boolean logic gates in these\ndevices on three levels. First, we quantify the occurrence of individual gates\nfor random choices of control voltages. We find that linearly inseparable gates\nsuch as the XOR gate are less likely to occur than linearly separable gates\nsuch as the AND gate, despite the fact that the number of different regions in\nthe multidimensional control voltage space for which AND or XOR gates occur is\ncomparable. Second, we use principal component analysis to characterize the\ndistribution of the output current vectors for the (00,10,01,11) logic input\ncombinations in terms of eigenvectors and eigenvalues of the output covariance\nmatrix. This allows a simple and direct comparison of the behavior of different\nsimulated devices and a comparison to experimental devices. Third, we quantify\nthe nonlinearity in the distribution of the output current vectors necessary\nfor realizing Boolean functionality by introducing three nonlinearity\nindicators. The analysis provides a physical interpretation of the effects of\nchanging the hopping distance and temperature and is used in a comparison with\ndata generated by a deep neural network trained on a physical device."}, "http://arxiv.org/abs/2312.16139": {"title": "Anomaly component analysis", "link": "http://arxiv.org/abs/2312.16139", "description": "At the crossway of machine learning and data analysis, anomaly detection aims\nat identifying observations that exhibit abnormal behaviour. Be it measurement\nerrors, disease development, severe weather, production quality default(s)\n(items) or failed equipment, financial frauds or crisis events, their on-time\nidentification and isolation constitute an important task in almost any area of\nindustry and science. While a substantial body of literature is devoted to\ndetection of anomalies, little attention is payed to their explanation. This is\nthe case mostly due to intrinsically non-supervised nature of the task and\nnon-robustness of the exploratory methods like principal component analysis\n(PCA).\n\nWe introduce a new statistical tool dedicated for exploratory analysis of\nabnormal observations using data depth as a score. Anomaly component analysis\n(shortly ACA) is a method that searches a low-dimensional data representation\nthat best visualises and explains anomalies. This low-dimensional\nrepresentation not only allows to distinguish groups of anomalies better than\nthe methods of the state of the art, but as well provides a -- linear in\nvariables and thus easily interpretable -- explanation for anomalies. In a\ncomparative simulation and real-data study, ACA also proves advantageous for\nanomaly analysis with respect to methods present in the literature."}, "http://arxiv.org/abs/2312.16160": {"title": "SymmPI: Predictive Inference for Data with Group Symmetries", "link": "http://arxiv.org/abs/2312.16160", "description": "Quantifying the uncertainty of predictions is a core problem in modern\nstatistics. Methods for predictive inference have been developed under a\nvariety of assumptions, often -- for instance, in standard conformal prediction\n-- relying on the invariance of the distribution of the data under special\ngroups of transformations such as permutation groups. Moreover, many existing\nmethods for predictive inference aim to predict unobserved outcomes in\nsequences of feature-outcome observations. Meanwhile, there is interest in\npredictive inference under more general observation models (e.g., for partially\nobserved features) and for data satisfying more general distributional\nsymmetries (e.g., rotationally invariant or coordinate-independent observations\nin physics). Here we propose SymmPI, a methodology for predictive inference\nwhen data distributions have general group symmetries in arbitrary observation\nmodels. Our methods leverage the novel notion of distributional equivariant\ntransformations, which process the data while preserving their distributional\ninvariances. We show that SymmPI has valid coverage under distributional\ninvariance and characterize its performance under distribution shift,\nrecovering recent results as special cases. We apply SymmPI to predict\nunobserved values associated to vertices in a network, where the distribution\nis unchanged under relabelings that keep the network structure unchanged. In\nseveral simulations in a two-layer hierarchical model, and in an empirical data\nanalysis example, SymmPI performs favorably compared to existing methods."}, "http://arxiv.org/abs/2312.16162": {"title": "Properties of Test Statistics for Nonparametric Cointegrating Regression Functions Based on Subsamples", "link": "http://arxiv.org/abs/2312.16162", "description": "Nonparametric cointegrating regression models have been extensively used in\nfinancial markets, stock prices, heavy traffic, climate data sets, and energy\nmarkets. Models with parametric regression functions can be more appealing in\npractice compared to non-parametric forms, but do result in potential\nfunctional misspecification. Thus, there exists a vast literature on developing\na model specification test for parametric forms of regression functions. In\nthis paper, we develop two test statistics which are applicable for the\nendogenous regressors driven by long memory and semi-long memory input shocks\nin the regression model. The limit distributions of the test statistics under\nthese two scenarios are complicated and cannot be effectively used in practice.\nTo overcome this difficulty, we use the subsampling method and compute the test\nstatistics on smaller blocks of the data to construct their empirical\ndistributions. Throughout, Monte Carlo simulation studies are used to\nillustrate the properties of test statistics. We also provide an empirical\nexample of relating gross domestic product to total output of carbon dioxide in\ntwo European countries."}, "http://arxiv.org/abs/1909.06307": {"title": "Multiscale Jump Testing and Estimation Under Complex Temporal Dynamics", "link": "http://arxiv.org/abs/1909.06307", "description": "We consider the problem of detecting jumps in an otherwise smoothly evolving\ntrend whilst the covariance and higher-order structures of the system can\nexperience both smooth and abrupt changes over time. The number of jump points\nis allowed to diverge to infinity with the jump sizes possibly shrinking to\nzero. The method is based on a multiscale application of an optimal jump-pass\nfilter to the time series, where the scales are dense between admissible lower\nand upper bounds. For a wide class of non-stationary time series models and\ntrend functions, the proposed method is shown to be able to detect all jump\npoints within a nearly optimal range with a prescribed probability\nasymptotically under mild conditions. For a time series of length $n$, the\ncomputational complexity of the proposed method is $O(n)$ for each scale and\n$O(n\\log^{1+\\epsilon} n)$ overall, where $\\epsilon$ is an arbitrarily small\npositive constant. Numerical studies show that the proposed jump testing and\nestimation method performs robustly and accurately under complex temporal\ndynamics."}, "http://arxiv.org/abs/1911.06583": {"title": "GET: Global envelopes in R", "link": "http://arxiv.org/abs/1911.06583", "description": "This work describes the R package GET that implements global envelopes for a\ngeneral set of $d$-dimensional vectors $T$ in various applications. A\n$100(1-\\alpha)$% global envelope is a band bounded by two vectors such that the\nprobability that $T$ falls outside this envelope in any of the $d$ points is\nequal to $\\alpha$. The term 'global' means that this probability is controlled\nsimultaneously for all the $d$ elements of the vectors. The global envelopes\ncan be employed for central regions of functional or multivariate data, for\ngraphical Monte Carlo and permutation tests where the test statistic is\nmultivariate or functional, and for global confidence and prediction bands.\nIntrinsic graphical interpretation property is introduced for global envelopes.\nThe global envelopes included in the GET package have this property, which\nparticularly helps to interpret test results, by providing a graphical\ninterpretation that shows the reasons of rejection of the tested hypothesis.\nExamples of different uses of global envelopes and their implementation in the\nGET package are presented, including global envelopes for single and several\none- or two-dimensional functions, Monte Carlo goodness-of-fit tests for simple\nand composite hypotheses, comparison of distributions, functional analysis of\nvariance, functional linear model, and confidence bands in polynomial\nregression."}, "http://arxiv.org/abs/2007.02192": {"title": "Tail-adaptive Bayesian shrinkage", "link": "http://arxiv.org/abs/2007.02192", "description": "Modern genomic studies are increasingly focused on discovering more and more\ninteresting genes associated with a health response. Traditional shrinkage\npriors are primarily designed to detect a handful of signals from tens of\nthousands of predictors in the so-called ultra-sparsity domain. However, they\nmay fail to identify signals when the degree of sparsity is moderate. Robust\nsparse estimation under diverse sparsity regimes relies on a tail-adaptive\nshrinkage property. In this property, the tail-heaviness of the prior adjusts\nadaptively, becoming larger or smaller as the sparsity level increases or\ndecreases, respectively, to accommodate more or fewer signals. In this study,\nwe propose a global-local-tail (GLT) Gaussian mixture distribution that ensures\nthis property. We examine the role of the tail-index of the prior in relation\nto the underlying sparsity level and demonstrate that the GLT posterior\ncontracts at the minimax optimal rate for sparse normal mean models. We apply\nboth the GLT prior and the Horseshoe prior to real data problems and simulation\nexamples. Our findings indicate that the varying tail rule based on the GLT\nprior offers advantages over a fixed tail rule based on the Horseshoe prior in\ndiverse sparsity regimes."}, "http://arxiv.org/abs/2011.04833": {"title": "Handling time-dependent exposures and confounders when estimating attributable fractions -- bridging the gap between multistate and counterfactual modeling", "link": "http://arxiv.org/abs/2011.04833", "description": "The population-attributable fraction (PAF) expresses the proportion of events\nthat can be ascribed to a certain exposure in a certain population. It can be\nstrongly time-dependent because either exposure incidence or excess risk may\nchange over time. Competing events may moreover hinder the outcome of interest\nfrom being observed. Occurrence of either of these events may, in turn, prevent\nthe exposure of interest. Estimation approaches therefore need to carefully\naccount for the timing of events in such highly dynamic settings. The use of\nmultistate models has been widely encouraged to eliminate preventable yet\ncommon types of time-dependent bias. Even so, it has been pointed out that\nproposed multistate modeling approaches for PAF estimation fail to fully\neliminate such bias. In addition, assessing whether patients die from rather\nthan with a certain exposure not only requires adequate modeling of the timing\nof events but also of their confounding factors. While proposed multistate\nmodeling approaches for confounding adjustment may adequately accommodate\nbaseline imbalances, unlike g-methods, these proposals are not generally\nequipped to handle time-dependent confounding. However, the connection between\nmultistate modeling and g-methods (e.g. inverse probability of censoring\nweighting) for PAF estimation is not readily apparent. In this paper, we\nprovide a weighting-based characterization of both approaches to illustrate\nthis connection, to pinpoint current shortcomings of multistate modeling, and\nto enhance intuition into simple modifications to overcome these. R code is\nmade available to foster the uptake of g-methods for PAF estimation."}, "http://arxiv.org/abs/2211.00338": {"title": "Typical Yet Unlikely and Normally Abnormal: The Intuition Behind High-Dimensional Statistics", "link": "http://arxiv.org/abs/2211.00338", "description": "Normality, in the colloquial sense, has historically been considered an\naspirational trait, synonymous with ideality. The arithmetic average and, by\nextension, statistics including linear regression coefficients, have often been\nused to characterize normality, and are often used as a way to summarize\nsamples and identify outliers. We provide intuition behind the behavior of such\nstatistics in high dimensions, and demonstrate that even for datasets with a\nrelatively low number of dimensions, data start to exhibit a number of\npeculiarities which become severe as the number of dimensions increases. Whilst\nour main goal is to familiarize researchers with these peculiarities, we also\nshow that normality can be better characterized with `typicality', an\ninformation theoretic concept relating to entropy. An application of typicality\nto both synthetic and real-world data concerning political values reveals that\nin multi-dimensional space, to be `normal' is actually to be atypical. We\nbriefly explore the ramifications for outlier detection, demonstrating how\ntypicality, in contrast with the popular Mahalanobis distance, represents a\nviable method for outlier detection."}, "http://arxiv.org/abs/2212.12940": {"title": "Exact Selective Inference with Randomization", "link": "http://arxiv.org/abs/2212.12940", "description": "We introduce a pivot for exact selective inference with randomization. Not\nonly does our pivot lead to exact inference in Gaussian regression models, but\nit is also available in closed form. We reduce the problem of exact selective\ninference to a bivariate truncated Gaussian distribution. By doing so, we give\nup some power that is achieved with approximate maximum likelihood estimation\nin Panigrahi and Taylor (2022). Yet our pivot always produces narrower\nconfidence intervals than a closely related data splitting procedure. We\ninvestigate the trade-off between power and exact selective inference on\nsimulated datasets and an HIV drug resistance dataset."}, "http://arxiv.org/abs/2309.05025": {"title": "Simulating data from marginal structural models for a survival time outcome", "link": "http://arxiv.org/abs/2309.05025", "description": "Marginal structural models (MSMs) are often used to estimate causal effects\nof treatments on survival time outcomes from observational data when\ntime-dependent confounding may be present. They can be fitted using, e.g.,\ninverse probability of treatment weighting (IPTW). It is important to evaluate\nthe performance of statistical methods in different scenarios, and simulation\nstudies are a key tool for such evaluations. In such simulation studies, it is\ncommon to generate data in such a way that the model of interest is correctly\nspecified, but this is not always straightforward when the model of interest is\nfor potential outcomes, as is an MSM. Methods have been proposed for simulating\nfrom MSMs for a survival outcome, but these methods impose restrictions on the\ndata-generating mechanism. Here we propose a method that overcomes these\nrestrictions. The MSM can be a marginal structural logistic model for a\ndiscrete survival time or a Cox or additive hazards MSM for a continuous\nsurvival time. The hazard of the potential survival time can be conditional on\nbaseline covariates, and the treatment variable can be discrete or continuous.\nWe illustrate the use of the proposed simulation algorithm by carrying out a\nbrief simulation study. This study compares the coverage of confidence\nintervals calculated in two different ways for causal effect estimates obtained\nby fitting an MSM via IPTW."}, "http://arxiv.org/abs/2312.16177": {"title": "Learning to Infer Unobserved Behaviors: Estimating User's Preference for a Site over Other Sites", "link": "http://arxiv.org/abs/2312.16177", "description": "A site's recommendation system relies on knowledge of its users' preferences\nto offer relevant recommendations to them. These preferences are for attributes\nthat comprise items and content shown on the site, and are estimated from the\ndata of users' interactions with the site. Another form of users' preferences\nis material too, namely, users' preferences for the site over other sites,\nsince that shows users' base level propensities to engage with the site.\nEstimating users' preferences for the site, however, faces major obstacles\nbecause (a) the focal site usually has no data of its users' interactions with\nother sites; these interactions are users' unobserved behaviors for the focal\nsite; and (b) the Machine Learning literature in recommendation does not offer\na model of this situation. Even if (b) is resolved, the problem in (a) persists\nsince without access to data of its users' interactions with other sites, there\nis no ground truth for evaluation. Moreover, it is most useful when (c) users'\npreferences for the site can be estimated at the individual level, since the\nsite can then personalize recommendations to individual users. We offer a\nmethod to estimate individual user's preference for a focal site, under this\npremise. In particular, we compute the focal site's share of a user's online\nengagements without any data from other sites. We show an evaluation framework\nfor the model using only the focal site's data, allowing the site to test the\nmodel. We rely upon a Hierarchical Bayes Method and perform estimation in two\ndifferent ways - Markov Chain Monte Carlo and Stochastic Gradient with Langevin\nDynamics. Our results find good support for the approach to computing\npersonalized share of engagement and for its evaluation."}, "http://arxiv.org/abs/2312.16188": {"title": "The curious case of the test set AUROC", "link": "http://arxiv.org/abs/2312.16188", "description": "Whilst the size and complexity of ML models have rapidly and significantly\nincreased over the past decade, the methods for assessing their performance\nhave not kept pace. In particular, among the many potential performance\nmetrics, the ML community stubbornly continues to use (a) the area under the\nreceiver operating characteristic curve (AUROC) for a validation and test\ncohort (distinct from training data) or (b) the sensitivity and specificity for\nthe test data at an optimal threshold determined from the validation ROC.\nHowever, we argue that considering scores derived from the test ROC curve alone\ngives only a narrow insight into how a model performs and its ability to\ngeneralise."}, "http://arxiv.org/abs/2312.16241": {"title": "Analysis of Pleiotropy for Testosterone and Lipid Profiles in Males and Females", "link": "http://arxiv.org/abs/2312.16241", "description": "In modern scientific studies, it is often imperative to determine whether a\nset of phenotypes is affected by a single factor. If such an influence is\nidentified, it becomes essential to discern whether this effect is contingent\nupon categories such as sex or age group, and importantly, to understand\nwhether this dependence is rooted in purely non-environmental reasons. The\nexploration of such dependencies often involves studying pleiotropy, a\nphenomenon wherein a single genetic locus impacts multiple traits. This\nheightened interest in uncovering dependencies by pleiotropy is fueled by the\ngrowing accessibility of summary statistics from genome-wide association\nstudies (GWAS) and the establishment of thoroughly phenotyped sample\ncollections. This advancement enables a systematic and comprehensive\nexploration of the genetic connections among various traits and diseases.\nadditive genetic correlation illuminates the genetic connection between two\ntraits, providing valuable insights into the shared biological pathways and\nunderlying causal relationships between them. In this paper, we present a novel\nmethod to analyze such dependencies by studying additive genetic correlations\nbetween pairs of traits under consideration. Subsequently, we employ matrix\ncomparison techniques to discern and elucidate sex-specific or\nage-group-specific associations, contributing to a deeper understanding of the\nnuanced dependencies within the studied traits. Our proposed method is\ncomputationally handy and requires only GWAS summary statistics. We validate\nour method by applying it to the UK Biobank data and present the results."}, "http://arxiv.org/abs/2312.16260": {"title": "Multinomial Link Models", "link": "http://arxiv.org/abs/2312.16260", "description": "We propose a unified multinomial link model for analyzing categorical\nresponses. It not only covers the existing multinomial logistic models and\ntheir extensions as a special class, but also allows the observations with NA\nor Unknown responses to be incorporated as a special category in the data\nanalysis. We provide explicit formulae for computing the likelihood gradient\nand Fisher information matrix, as well as detailed algorithms for finding the\nmaximum likelihood estimates of the model parameters. Our algorithms solve the\ninfeasibility issue of existing statistical software on estimating parameters\nof cumulative link models. The applications to real datasets show that the\nproposed multinomial link models can fit the data significantly better, and the\ncorresponding data analysis may correct the misleading conclusions due to\nmissing data."}, "http://arxiv.org/abs/2312.16357": {"title": "Statistical monitoring of European cross-border physical electricity flows using novel temporal edge network processes", "link": "http://arxiv.org/abs/2312.16357", "description": "Conventional modelling of networks evolving in time focuses on capturing\nvariations in the network structure. However, the network might be static from\nthe origin or experience only deterministic, regulated changes in its\nstructure, providing either a physical infrastructure or a specified connection\narrangement for some other processes. Thus, to detect change in its\nexploitation, we need to focus on the processes happening on the network. In\nthis work, we present the concept of monitoring random Temporal Edge Network\n(TEN) processes that take place on the edges of a graph having a fixed\nstructure. Our framework is based on the Generalized Network Autoregressive\nstatistical models with time-dependent exogenous variables (GNARX models) and\nCumulative Sum (CUSUM) control charts. To demonstrate its effective detection\nof various types of change, we conduct a simulation study and monitor the\nreal-world data of cross-border physical electricity flows in Europe."}, "http://arxiv.org/abs/2312.16439": {"title": "Inferring the Effect of a Confounded Treatment by Calibrating Resistant Population's Variance", "link": "http://arxiv.org/abs/2312.16439", "description": "In a general set-up that allows unmeasured confounding, we show that the\nconditional average treatment effect on the treated can be identified as one of\ntwo possible values. Unlike existing causal inference methods, we do not\nrequire an exogenous source of variability in the treatment, e.g., an\ninstrument or another outcome unaffected by the treatment. Instead, we require\n(a) a nondeterministic treatment assignment, (b) that conditional variances of\nthe two potential outcomes are equal in the treatment group, and (c) a\nresistant population that was not exposed to the treatment or, if exposed, is\nunaffected by the treatment. Assumption (a) is commonly assumed in theoretical\nwork, while (b) holds under fairly general outcome models. For (c), which is a\nnew assumption, we show that a resistant population is often available in\npractice. We develop a large sample inference methodology and demonstrate our\nproposed method in a study of the effect of surface mining in central\nAppalachia on birth weight that finds a harmful effect."}, "http://arxiv.org/abs/2312.16512": {"title": "Degrees-of-freedom penalized piecewise regression", "link": "http://arxiv.org/abs/2312.16512", "description": "Many popular piecewise regression models rely on minimizing a cost function\non the model fit with a linear penalty on the number of segments. However, this\npenalty does not take into account varying complexities of the model functions\non the segments potentially leading to overfitting when models with varying\ncomplexities, such as polynomials of different degrees, are used. In this work,\nwe enhance on this approach by instead using a penalty on the sum of the\ndegrees of freedom over all segments, called degrees-of-freedom penalized\npiecewise regression (DofPPR). We show that the solutions of the resulting\nminimization problem are unique for almost all input data in a least squares\nsetting. We develop a fast algorithm which does not only compute a minimizer\nbut also determines an optimal hyperparameter -- in the sense of rolling cross\nvalidation with the one standard error rule -- exactly. This eliminates manual\nhyperparameter selection. Our method supports optional user parameters for\nincorporating domain knowledge. We provide an open-source Python/Rust code for\nthe piecewise polynomial least squares case which can be extended to further\nmodels. We demonstrate the practical utility through a simulation study and by\napplications to real data. A constrained variant of the proposed method gives\nstate-of-the-art results in the Turing benchmark for unsupervised changepoint\ndetection."}, "http://arxiv.org/abs/2312.16544": {"title": "Hierarchical variable clustering based on the predictive strength between random vectors", "link": "http://arxiv.org/abs/2312.16544", "description": "A rank-invariant clustering of variables is introduced that is based on the\npredictive strength between groups of variables, i.e., two groups are assigned\na high similarity if the variables in the first group contain high predictive\ninformation about the behaviour of the variables in the other group and/or vice\nversa. The method presented here is model-free, dependence-based and does not\nrequire any distributional assumptions. Various general invariance and\ncontinuity properties are investigated, with special attention to those that\nare beneficial for the agglomerative hierarchical clustering procedure. A fully\nnon-parametric estimator is considered whose excellent performance is\ndemonstrated in several simulation studies and by means of real-data examples."}, "http://arxiv.org/abs/2312.16656": {"title": "Clustering Sets of Functional Data by Similarity in Law", "link": "http://arxiv.org/abs/2312.16656", "description": "We introduce a new clustering method for the classification of functional\ndata sets by their probabilistic law, that is, a procedure that aims to assign\ndata sets to the same cluster if and only if the data were generated with the\nsame underlying distribution. This method has the nice virtue of being\nnon-supervised and non-parametric, allowing for exploratory investigation with\nfew assumptions about the data. Rigorous finite bounds on the classification\nerror are given along with an objective heuristic that consistently selects the\nbest partition in a data-driven manner. Simulated data has been clustered with\nthis procedure to show the performance of the method with different parametric\nmodel classes of functional data."}, "http://arxiv.org/abs/2312.16734": {"title": "Selective Inference for Sparse Graphs via Neighborhood Selection", "link": "http://arxiv.org/abs/2312.16734", "description": "Neighborhood selection is a widely used method used for estimating the\nsupport set of sparse precision matrices, which helps determine the conditional\ndependence structure in undirected graphical models. However, reporting only\npoint estimates for the estimated graph can result in poor replicability\nwithout accompanying uncertainty estimates. In fields such as psychology, where\nthe lack of replicability is a major concern, there is a growing need for\nmethods that can address this issue. In this paper, we focus on the Gaussian\ngraphical model. We introduce a selective inference method to attach\nuncertainty estimates to the selected (nonzero) entries of the precision matrix\nand decide which of the estimated edges must be included in the graph. Our\nmethod provides an exact adjustment for the selection of edges, which when\nmultiplied with the Wishart density of the random matrix, results in valid\nselective inferences. Through the use of externally added randomization\nvariables, our adjustment is easy to compute, requiring us to calculate the\nprobability of a selection event, that is equivalent to a few sign constraints\nand that decouples across the nodewise regressions. Through simulations and an\napplication to a mobile health trial designed to study mental health, we\ndemonstrate that our selective inference method results in higher power and\nimproved estimation accuracy."}, "http://arxiv.org/abs/2312.16739": {"title": "A Bayesian functional PCA model with multilevel partition priors for group studies in neuroscience", "link": "http://arxiv.org/abs/2312.16739", "description": "The statistical analysis of group studies in neuroscience is particularly\nchallenging due to the complex spatio-temporal nature of the data, its multiple\nlevels and the inter-individual variability in brain responses. In this\nrespect, traditional ANOVA-based studies and linear mixed effects models\ntypically provide only limited exploration of the dynamic of the group brain\nactivity and variability of the individual responses potentially leading to\noverly simplistic conclusions and/or missing more intricate patterns. In this\nstudy we propose a novel method based on functional Principal Components\nAnalysis and Bayesian model-based clustering to simultaneously assess group\neffects and individual deviations over the most important temporal features in\nthe data. This method provides a thorough exploration of group differences and\nindividual deviations in neuroscientific group studies without compromising on\nthe spatio-temporal nature of the data. By means of a simulation study we\ndemonstrate that the proposed model returns correct classification in different\nclustering scenarios under low and high of noise levels in the data. Finally we\nconsider a case study using Electroencephalogram data recorded during an object\nrecognition task where our approach provides new insights into the underlying\nbrain mechanisms generating the data and their variability."}, "http://arxiv.org/abs/2312.16769": {"title": "Estimation and Inference for High-dimensional Multi-response Growth Curve Model", "link": "http://arxiv.org/abs/2312.16769", "description": "A growth curve model (GCM) aims to characterize how an outcome variable\nevolves, develops and grows as a function of time, along with other predictors.\nIt provides a particularly useful framework to model growth trend in\nlongitudinal data. However, the estimation and inference of GCM with a large\nnumber of response variables faces numerous challenges, and remains\nunderdeveloped. In this article, we study the high-dimensional\nmultivariate-response linear GCM, and develop the corresponding estimation and\ninference procedures. Our proposal is far from a straightforward extension, and\ninvolves several innovative components. Specifically, we introduce a Kronecker\nproduct structure, which allows us to effectively decompose a very large\ncovariance matrix, and to pool the correlated samples to improve the estimation\naccuracy. We devise a highly non-trivial multi-step estimation approach to\nestimate the individual covariance components separately and effectively. We\nalso develop rigorous statistical inference procedures to test both the global\neffects and the individual effects, and establish the size and power\nproperties, as well as the proper false discovery control. We demonstrate the\neffectiveness of the new method through both intensive simulations, and the\nanalysis of a longitudinal neuroimaging data for Alzheimer's disease."}, "http://arxiv.org/abs/2312.16887": {"title": "Automatic Scoring of Cognition Drawings: Assessing the Quality of Machine-Based Scores Against a Gold Standard", "link": "http://arxiv.org/abs/2312.16887", "description": "Figure drawing is often used as part of dementia screening protocols. The\nSurvey of Health Aging and Retirement in Europe (SHARE) has adopted three\ndrawing tests from Addenbrooke's Cognitive Examination III as part of its\nquestionnaire module on cognition. While the drawings are usually scored by\ntrained clinicians, SHARE uses the face-to-face interviewers who conduct the\ninterviews to score the drawings during fieldwork. This may pose a risk to data\nquality, as interviewers may be less consistent in their scoring and more\nlikely to make errors due to their lack of clinical training. This paper\ntherefore reports a first proof of concept and evaluates the feasibility of\nautomating scoring using deep learning. We train several different\nconvolutional neural network (CNN) models using about 2,000 drawings from the\n8th wave of the SHARE panel in Germany and the corresponding interviewer\nscores, as well as self-developed 'gold standard' scores. The results suggest\nthat this approach is indeed feasible. Compared to training on interviewer\nscores, models trained on the gold standard data improve prediction accuracy by\nabout 10 percentage points. The best performing model, ConvNeXt Base, achieves\nan accuracy of about 85%, which is 5 percentage points higher than the accuracy\nof the interviewers. While this is a promising result, the models still\nstruggle to score partially correct drawings, which are also problematic for\ninterviewers. This suggests that more and better training data is needed to\nachieve production-level prediction accuracy. We therefore discuss possible\nnext steps to improve the quality and quantity of training examples."}, "http://arxiv.org/abs/2312.16953": {"title": "Super Ensemble Learning Using the Highly-Adaptive-Lasso", "link": "http://arxiv.org/abs/2312.16953", "description": "We consider estimation of a functional parameter of a realistically modeled\ndata distribution based on independent and identically distributed\nobservations. Suppose that the true function is defined as the minimizer of the\nexpectation of a specified loss function over its parameter space. Estimators\nof the true function are provided, viewed as a data-adaptive coordinate\ntransformation for the true function. For any $J$-dimensional real valued\ncadlag function with finite sectional variation norm, we define a candidate\nensemble estimator as the mapping from the data into the composition of the\ncadlag function and the $J$ estimated functions. Using $V$-fold\ncross-validation, we define the cross-validated empirical risk of each cadlag\nfunction specific ensemble estimator. We then define the Meta Highly Adaptive\nLasso Minimum Loss Estimator (M-HAL-MLE) as the cadlag function that minimizes\nthis cross-validated empirical risk over all cadlag functions with a uniform\nbound on the sectional variation norm. For each of the $V$ training samples,\nthis yields a composition of the M-HAL-MLE ensemble and the $J$ estimated\nfunctions trained on the training sample. We can estimate the true function\nwith the average of these $V$ estimated functions, which we call the M-HAL\nsuper-learner. The M-HAL super-learner converges to the oracle estimator at a\nrate $n^{-2/3}$ (up till $\\log n$-factor) w.r.t. excess risk, where the oracle\nestimator minimizes the excess risk among all considered ensembles. The excess\nrisk of the oracle estimator and true function is generally second order. Under\nweak conditions on the $J$ candidate estimators, target features of the\nundersmoothed M-HAL super-learner are asymptotically linear estimators of the\ncorresponding target features of true function, with influence curve either the\nefficient influence curve, or potentially, a super-efficient influence curve."}, "http://arxiv.org/abs/2312.17015": {"title": "Regularized Exponentially Tilted Empirical Likelihood for Bayesian Inference", "link": "http://arxiv.org/abs/2312.17015", "description": "Bayesian inference with empirical likelihood faces a challenge as the\nposterior domain is a proper subset of the original parameter space due to the\nconvex hull constraint. We propose a regularized exponentially tilted empirical\nlikelihood to address this issue. Our method removes the convex hull constraint\nusing a novel regularization technique, incorporating a continuous exponential\nfamily distribution to satisfy a Kullback--Leibler divergence criterion. The\nregularization arises as a limiting procedure where pseudo-data are added to\nthe formulation of exponentially tilted empirical likelihood in a structured\nfashion. We show that this regularized exponentially tilted empirical\nlikelihood retains certain desirable asymptotic properties of (exponentially\ntilted) empirical likelihood and has improved finite sample performance.\nSimulation and data analysis demonstrate that the proposed method provides a\nsuitable pseudo-likelihood for Bayesian inference. The implementation of our\nmethod is available as the R package retel. Supplementary materials for this\narticle are available online."}, "http://arxiv.org/abs/2312.17047": {"title": "Inconsistency of cross-validation for structure learning in Gaussian graphical models", "link": "http://arxiv.org/abs/2312.17047", "description": "Despite numerous years of research into the merits and trade-offs of various\nmodel selection criteria, obtaining robust results that elucidate the behavior\nof cross-validation remains a challenging endeavor. In this paper, we highlight\nthe inherent limitations of cross-validation when employed to discern the\nstructure of a Gaussian graphical model. We provide finite-sample bounds on the\nprobability that the Lasso estimator for the neighborhood of a node within a\nGaussian graphical model, optimized using a prediction oracle, misidentifies\nthe neighborhood. Our results pertain to both undirected and directed acyclic\ngraphs, encompassing general, sparse covariance structures. To support our\ntheoretical findings, we conduct an empirical investigation of this\ninconsistency by contrasting our outcomes with other commonly used information\ncriteria through an extensive simulation study. Given that many algorithms\ndesigned to learn the structure of graphical models require hyperparameter\nselection, the precise calibration of this hyperparameter is paramount for\naccurately estimating the inherent structure. Consequently, our observations\nshed light on this widely recognized practical challenge."}, "http://arxiv.org/abs/2312.17065": {"title": "CluBear: A Subsampling Package for Interactive Statistical Analysis with Massive Data on A Single Machine", "link": "http://arxiv.org/abs/2312.17065", "description": "This article introduces CluBear, a Python-based open-source package for\ninteractive massive data analysis. The key feature of CluBear is that it\nenables users to conduct convenient and interactive statistical analysis of\nmassive data with only a traditional single-computer system. Thus, CluBear\nprovides a cost-effective solution when mining large-scale datasets. In\naddition, the CluBear package integrates many commonly used statistical and\ngraphical tools, which are useful for most commonly encountered data analysis\ntasks."}, "http://arxiv.org/abs/2312.17111": {"title": "Online Tensor Inference", "link": "http://arxiv.org/abs/2312.17111", "description": "Recent technological advances have led to contemporary applications that\ndemand real-time processing and analysis of sequentially arriving tensor data.\nTraditional offline learning, involving the storage and utilization of all data\nin each computational iteration, becomes impractical for high-dimensional\ntensor data due to its voluminous size. Furthermore, existing low-rank tensor\nmethods lack the capability for statistical inference in an online fashion,\nwhich is essential for real-time predictions and informed decision-making. This\npaper addresses these challenges by introducing a novel online inference\nframework for low-rank tensor learning. Our approach employs Stochastic\nGradient Descent (SGD) to enable efficient real-time data processing without\nextensive memory requirements, thereby significantly reducing computational\ndemands. We establish a non-asymptotic convergence result for the online\nlow-rank SGD estimator, nearly matches the minimax optimal rate of estimation\nerror in offline models that store all historical data. Building upon this\nfoundation, we propose a simple yet powerful online debiasing approach for\nsequential statistical inference in low-rank tensor learning. The entire online\nprocedure, covering both estimation and inference, eliminates the need for data\nsplitting or storing historical data, making it suitable for on-the-fly\nhypothesis testing. Given the sequential nature of our data collection,\ntraditional analyses relying on offline methods and sample splitting are\ninadequate. In our analysis, we control the sum of constructed\nsuper-martingales to ensure estimates along the entire solution path remain\nwithin the benign region. Additionally, a novel spectral representation tool is\nemployed to address statistical dependencies among iterative estimates,\nestablishing the desired asymptotic normality."}, "http://arxiv.org/abs/2312.17230": {"title": "Variable Neighborhood Searching Rerandomization", "link": "http://arxiv.org/abs/2312.17230", "description": "Rerandomization discards undesired treatment assignments to ensure covariate\nbalance in randomized experiments. However, rerandomization based on\nacceptance-rejection sampling is computationally inefficient, especially when\nnumerous independent assignments are required to perform randomization-based\nstatistical inference. Existing acceleration methods are suboptimal and are not\napplicable in structured experiments, including stratified experiments and\nexperiments with clusters. Based on metaheuristics in combinatorial\noptimization, we propose a novel variable neighborhood searching\nrerandomization(VNSRR) method to draw balanced assignments in various\nexperiments efficiently. We derive the unbiasedness and a lower bound for the\nvariance reduction of the treatment effect estimator under VNSRR. Simulation\nstudies and a real data example indicate that our method maintains the\nappealing statistical properties of rerandomization and can sample thousands of\ntreatment assignments within seconds, even in cases where existing methods\nrequire an hour to complete the task."}, "http://arxiv.org/abs/2003.13119": {"title": "Statistical Quantile Learning for Large, Nonlinear, and Additive Latent Variable Models", "link": "http://arxiv.org/abs/2003.13119", "description": "The studies of large-scale, high-dimensional data in fields such as genomics\nand neuroscience have injected new insights into science. Yet, despite\nadvances, they are confronting several challenges often simultaneously:\nnon-linearity, slow computation, inconsistency and uncertain convergence, and\nsmall sample sizes compared to high feature dimensions. Here, we propose a\nrelatively simple, scalable, and consistent nonlinear dimension reduction\nmethod that can potentially address these issues in unsupervised settings. We\ncall this method Statistical Quantile Learning (SQL) because, methodologically,\nit leverages on a quantile approximation of the latent variables and standard\nnonparametric techniques (sieve or penalyzed methods). By doing so, we show\nthat estimating the model originate from a convex assignment matching problem.\nTheoretically, we provide the asymptotic properties of SQL and its rates of\nconvergence. Operationally, SQL overcomes both the parametric restriction in\nnonlinear factor models in statistics and the difficulty of specifying\nhyperparameters and vanishing gradients in deep learning. Simulation studies\nassent the theory and reveal that SQL outperforms state-of-the-art statistical\nand machine learning methods. Compared to its linear competitors, SQL explains\nmore variance, yields better separation and explanation, and delivers more\naccurate outcome prediction when latent factors are used as predictors;\ncompared to its nonlinear competitors, SQL shows considerable advantage in\ninterpretability, ease of use and computations in high-dimensional\nsettings.Finally, we apply SQL to high-dimensional gene expression data\n(consisting of 20263 genes from 801 subjects), where the proposed method\nidentified latent factors predictive of five cancer types. The SQL package is\navailable at https://github.com/jbodelet/SQL."}, "http://arxiv.org/abs/2102.06197": {"title": "Estimating a Directed Tree for Extremes", "link": "http://arxiv.org/abs/2102.06197", "description": "We propose a new method to estimate a root-directed spanning tree from\nextreme data. A prominent example is a river network, to be discovered from\nextreme flow measured at a set of stations. Our new algorithm utilizes\nqualitative aspects of a max-linear Bayesian network, which has been designed\nfor modelling causality in extremes. The algorithm estimates bivariate scores\nand returns a root-directed spanning tree. It performs extremely well on\nbenchmark data and new data. We prove that the new estimator is consistent\nunder a max-linear Bayesian network model with noise. We also assess its\nstrengths and limitations in a small simulation study."}, "http://arxiv.org/abs/2206.10108": {"title": "A Bayesian Nonparametric Approach for Identifying Differentially Abundant Taxa in Multigroup Microbiome Data with Covariates", "link": "http://arxiv.org/abs/2206.10108", "description": "Scientific studies in the last two decades have established the central role\nof the microbiome in disease and health. Differential abundance analysis seeks\nto identify microbial taxa associated with sample groups defined by a factor\nsuch as disease subtype, geographical region, or environmental condition. The\nresults, in turn, help clinical practitioners and researchers diagnose disease\nand develop treatments more effectively. However, microbiome data analysis is\nuniquely challenging due to high-dimensionality, sparsity, compositionally, and\ncollinearity. There is a critical need for unified statistical approaches for\ndifferential analysis in the presence of covariates. We develop a zero-inflated\nBayesian nonparametric (ZIBNP) methodology that meets these multipronged\nchallenges. The proposed technique flexibly adapts to the unique data\ncharacteristics, casts the high proportion of zeros in a censoring framework,\nand mitigates high-dimensionality and collinearity by utilizing the\ndimension-reducing property of the semiparametric Chinese restaurant process.\nAdditionally, the ZIBNP approach relates the microbiome sampling depths to\ninferential precision while accommodating the compositional nature of\nmicrobiome data. Through simulation studies and analyses of the CAnine\nMicrobiome during Parasitism (CAMP) and Global Gut microbiome datasets, we\ndemonstrate the accuracy of ZIBNP compared to established methods for\ndifferential abundance analysis in the presence of covariates."}, "http://arxiv.org/abs/2208.10910": {"title": "A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression", "link": "http://arxiv.org/abs/2208.10910", "description": "We introduce a new empirical Bayes approach for large-scale multiple linear\nregression. Our approach combines two key ideas: (i) the use of flexible\n\"adaptive shrinkage\" priors, which approximate the nonparametric family of\nscale mixture of normal distributions by a finite mixture of normal\ndistributions; and (ii) the use of variational approximations to efficiently\nestimate prior hyperparameters and compute approximate posteriors. Combining\nthese two ideas results in fast and flexible methods, with computational speed\ncomparable to fast penalized regression methods such as the Lasso, and with\nsuperior prediction accuracy across a wide range of scenarios. Furthermore, we\nshow that the posterior mean from our method can be interpreted as solving a\npenalized regression problem, with the precise form of the penalty function\nbeing learned from the data by directly solving an optimization problem (rather\nthan being tuned by cross-validation). Our methods are implemented in an R\npackage, mr.ash.alpha, available from\nhttps://github.com/stephenslab/mr.ash.alpha"}, "http://arxiv.org/abs/2209.13117": {"title": "Consistent Covariance estimation for stratum imbalances under minimization method for covariate-adaptive randomization", "link": "http://arxiv.org/abs/2209.13117", "description": "Pocock and Simon's minimization method is a popular approach for\ncovariate-adaptive randomization in clinical trials. Valid statistical\ninference with data collected under the minimization method requires the\nknowledge of the limiting covariance matrix of within-stratum imbalances, whose\nexistence is only recently established. In this work, we propose a\nbootstrap-based estimator for this limit and establish its consistency, in\nparticular, by Le Cam's third lemma. As an application, we consider in\nsimulation studies adjustments to existing robust tests for treatment effects\nwith survival data by the proposed estimator. It shows that the adjusted tests\nachieve a size close to the nominal level, and unlike other designs, the robust\ntests without adjustment may have an asymptotic size inflation issue under the\nminimization method."}, "http://arxiv.org/abs/2209.15224": {"title": "Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models", "link": "http://arxiv.org/abs/2209.15224", "description": "Unsupervised learning has been widely used in many real-world applications.\nOne of the simplest and most important unsupervised learning models is the\nGaussian mixture model (GMM). In this work, we study the multi-task learning\nproblem on GMMs, which aims to leverage potentially similar GMM parameter\nstructures among tasks to obtain improved learning performance compared to\nsingle-task learning. We propose a multi-task GMM learning procedure based on\nthe EM algorithm that not only can effectively utilize unknown similarity\nbetween related tasks but is also robust against a fraction of outlier tasks\nfrom arbitrary distributions. The proposed procedure is shown to achieve\nminimax optimal rate of convergence for both parameter estimation error and the\nexcess mis-clustering error, in a wide range of regimes. Moreover, we\ngeneralize our approach to tackle the problem of transfer learning for GMMs,\nwhere similar theoretical results are derived. Finally, we demonstrate the\neffectiveness of our methods through simulations and real data examples. To the\nbest of our knowledge, this is the first work studying multi-task and transfer\nlearning on GMMs with theoretical guarantees."}, "http://arxiv.org/abs/2212.06228": {"title": "LRD spectral analysis of multifractional functional time series on manifolds", "link": "http://arxiv.org/abs/2212.06228", "description": "This paper addresses the estimation of the second-order structure of a\nmanifold cross-time random field (RF) displaying spatially varying Long Range\nDependence (LRD), adopting the functional time series framework introduced in\nRuiz-Medina (2022). Conditions for the asymptotic unbiasedness of the\nintegrated periodogram operator in the Hilbert-Schmidt operator norm are\nderived beyond structural assumptions. Weak-consistent estimation of the\nlong-memory operator is achieved under a semiparametric functional spectral\nframework in the Gaussian context. The case where the projected manifold\nprocess can display Short Range Dependence (SRD) and LRD at different manifold\nscales is also analyzed. The performance of both estimation procedures is\nillustrated in the simulation study, in the context of multifractionally\nintegrated spherical functional autoregressive-moving average (SPHARMA(p,q))\nprocesses."}, "http://arxiv.org/abs/2301.01854": {"title": "Solving The Ordinary Least Squares in Closed Form, Without Inversion or Normalization", "link": "http://arxiv.org/abs/2301.01854", "description": "By connecting the LU factorization and the Gram-Schmidt orthogonalization\nwithout any normalization, closed-forms for the coefficients of the ordinary\nleast squares estimates are presented. Instead of using matrix inversion\nexplicitly, each of the coefficients is expressed and computed directly as a\nlinear combination of non-normalized Gram-Schmidt vectors and the original data\nmatrix and also in terms of the upper triangular factor from LU factorization.\nThe coefficients may computed iteratively using backward or forward algorithms\ngiven."}, "http://arxiv.org/abs/2312.17420": {"title": "Exact Consistency Tests for Gaussian Mixture Filters using Normalized Deviation Squared Statistics", "link": "http://arxiv.org/abs/2312.17420", "description": "We consider the problem of evaluating dynamic consistency in discrete time\nprobabilistic filters that approximate stochastic system state densities with\nGaussian mixtures. Dynamic consistency means that the estimated probability\ndistributions correctly describe the actual uncertainties. As such, the problem\nof consistency testing naturally arises in applications with regards to\nestimator tuning and validation. However, due to the general complexity of the\ndensity functions involved, straightforward approaches for consistency testing\nof mixture-based estimators have remained challenging to define and implement.\nThis paper derives a new exact result for Gaussian mixture consistency testing\nwithin the framework of normalized deviation squared (NDS) statistics. It is\nshown that NDS test statistics for generic multivariate Gaussian mixture models\nexactly follow mixtures of generalized chi-square distributions, for which\nefficient computational tools are available. The accuracy and utility of the\nresulting consistency tests are numerically demonstrated on static and dynamic\nmixture estimation examples."}, "http://arxiv.org/abs/2312.17480": {"title": "Detection of evolutionary shifts in variance under an Ornsten-Uhlenbeck model", "link": "http://arxiv.org/abs/2312.17480", "description": "1. Abrupt environmental changes can lead to evolutionary shifts in not only\nmean (optimal value), but also variance of descendants in trait evolution.\nThere are some methods to detect shifts in optimal value but few studies\nconsider shifts in variance. 2. We use a multi-optima and multi-variance OU\nprocess model to describe the trait evolution process with shifts in both\noptimal value and variance and provide analysis of how the covariance between\nspecies changes when shifts in variance occur along the path. 3. We propose a\nnew method to detect the shifts in both variance and optimal values based on\nminimizing the loss function with L1 penalty. We implement our method in a new\nR package, ShiVa (Detection of evolutionary shifts in variance). 4. We conduct\nsimulations to compare our method with the two methods considering only shifts\nin optimal values (l1ou; PhylogeneticEM). Our method shows strength in\npredictive ability and includes far fewer false positive shifts in optimal\nvalue compared to other methods when shifts in variance actually exist. When\nthere are only shifts in optimal value, our method performs similarly to other\nmethods. We applied our method to the cordylid data, ShiVa outperformed l1ou\nand phyloEM, exhibiting the highest log-likelihood and lowest BIC."}, "http://arxiv.org/abs/2312.17566": {"title": "Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing", "link": "http://arxiv.org/abs/2312.17566", "description": "Bayesian model-averaged hypothesis testing is an important technique in\nregression because it addresses the problem that the evidence one variable\ndirectly affects an outcome often depends on which other variables are included\nin the model. This problem is caused by confounding and mediation, and is\npervasive in big data settings with thousands of variables. However,\nmodel-averaging is under-utilized in fields, like epidemiology, where classical\nstatistical approaches dominate. Here we show that simultaneous Bayesian and\nfrequentist model-averaged hypothesis testing is possible in large samples, for\na family of priors. We show that Bayesian model-averaged regression is a closed\ntesting procedure, and use the theory of regular variation to derive\ninterchangeable posterior odds and $p$-values that jointly control the Bayesian\nfalse discovery rate (FDR), the frequentist type I error rate, and the\nfrequentist familywise error rate (FWER). These results arise from an\nasymptotic chi-squared distribution for the model-averaged deviance, under the\nnull hypothesis. We call the approach 'Doublethink'. In a related manuscript\n(Arning, Fryer and Wilson, 2024), we apply it to discovering direct risk\nfactors for COVID-19 hospitalization in UK Biobank, and we discuss its broader\nimplications for bridging the differences between Bayesian and frequentist\nhypothesis testing."}, "http://arxiv.org/abs/2312.17716": {"title": "Dependent Random Partitions by Shrinking Toward an Anchor", "link": "http://arxiv.org/abs/2312.17716", "description": "Although exchangeable processes from Bayesian nonparametrics have been used\nas a generating mechanism for random partition models, we deviate from this\nparadigm to explicitly incorporate clustering information in the formulation\nour random partition model. Our shrinkage partition distribution takes any\npartition distribution and shrinks its probability mass toward an anchor\npartition. We show how this provides a framework to model\nhierarchically-dependent and temporally-dependent random partitions. The\nshrinkage parameters control the degree of dependence, accommodating at its\nextremes both independence and complete equality. Since a priori knowledge of\nitems may vary, our formulation allows the degree of shrinkage toward the\nanchor to be item-specific. Our random partition model has a tractable\nnormalizing constant which allows for standard Markov chain Monte Carlo\nalgorithms for posterior sampling. We prove intuitive theoretical properties\nfor our distribution and compare it to related partition distributions. We show\nthat our model provides better out-of-sample fit in a real data application."}, "http://arxiv.org/abs/2210.00697": {"title": "A flexible model for correlated count data, with application to multi-condition differential expression analyses of single-cell RNA sequencing data", "link": "http://arxiv.org/abs/2210.00697", "description": "Detecting differences in gene expression is an important part of single-cell\nRNA sequencing experiments, and many statistical methods have been developed\nfor this aim. Most differential expression analyses focus on comparing\nexpression between two groups (e.g., treatment vs. control). But there is\nincreasing interest in multi-condition differential expression analyses in\nwhich expression is measured in many conditions, and the aim is to accurately\ndetect and estimate expression differences in all conditions. We show that\ndirectly modeling single-cell RNA-seq counts in all conditions simultaneously,\nwhile also inferring how expression differences are shared across conditions,\nleads to greatly improved performance for detecting and estimating expression\ndifferences compared to existing methods. We illustrate the potential of this\nnew approach by analyzing data from a single-cell experiment studying the\neffects of cytokine stimulation on gene expression. We call our new method\n\"Poisson multivariate adaptive shrinkage\", and it is implemented in an R\npackage available online at https://github.com/stephenslab/poisson.mash.alpha."}, "http://arxiv.org/abs/2211.13383": {"title": "A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments", "link": "http://arxiv.org/abs/2211.13383", "description": "In this paper, we aim to propose a consistent non-Gaussian Bayesian filter of\nwhich the system state is a continuous function. The distributions of the true\nsystem states, and those of the system and observation noises, are only assumed\nLebesgue integrable with no prior constraints on what function classes they\nfall within. This type of filter has significant merits in both theory and\npractice, which is able to ameliorate the curse of dimensionality for the\nparticle filter, a popular non-Gaussian Bayesian filter of which the system\nstate is parameterized by discrete particles and the corresponding weights. We\nfirst propose a new type of statistics, called the generalized logarithmic\nmoments. Together with the power moments, they are used to form a density\nsurrogate, parameterized as an analytic function, to approximate the true\nsystem state. The map from the parameters of the proposed density surrogate to\nboth the power moments and the generalized logarithmic moments is proved to be\na diffeomorphism, establishing the fact that there exists a unique density\nsurrogate which satisfies both moment conditions. This diffeomorphism also\nallows us to use gradient methods to treat the convex optimization problem in\ndetermining the parameters. Last but not least, simulation results reveal the\nadvantage of using both sets of moments for estimating mixtures of complicated\ntypes of functions. A robot localization simulation is also given, as an\nengineering application to validate the proposed filtering scheme."}, "http://arxiv.org/abs/2304.03476": {"title": "Generalizing the intention-to-treat effect of an active control against placebo from historical placebo-controlled trials to an active-controlled trial: A case study of the efficacy of daily oral TDF/FTC in the HPTN 084 study", "link": "http://arxiv.org/abs/2304.03476", "description": "In many clinical settings, an active-controlled trial design (e.g., a\nnon-inferiority or superiority design) is often used to compare an experimental\nmedicine to an active control (e.g., an FDA-approved, standard therapy). One\nprominent example is a recent phase 3 efficacy trial, HIV Prevention Trials\nNetwork Study 084 (HPTN 084), comparing long-acting cabotegravir, a new HIV\npre-exposure prophylaxis (PrEP) agent, to the FDA-approved daily oral tenofovir\ndisoproxil fumarate plus emtricitabine (TDF/FTC) in a population of\nheterosexual women in 7 African countries. One key complication of interpreting\nstudy results in an active-controlled trial like HPTN 084 is that the placebo\narm is not present and the efficacy of the active control (and hence the\nexperimental drug) compared to the placebo can only be inferred by leveraging\nother data sources. \\bz{In this article, we study statistical inference for the\nintention-to-treat (ITT) effect of the active control using relevant historical\nplacebo-controlled trials data under the potential outcomes (PO) framework}. We\nhighlight the role of adherence and unmeasured confounding, discuss in detail\nidentification assumptions and two modes of inference (point versus partial\nidentification), propose estimators under identification assumptions permitting\npoint identification, and lay out sensitivity analyses needed to relax\nidentification assumptions. We applied our framework to estimating the\nintention-to-treat effect of daily oral TDF/FTC versus placebo in HPTN 084\nusing data from an earlier Phase 3, placebo-controlled trial of daily oral\nTDF/FTC (Partners PrEP)."}, "http://arxiv.org/abs/2305.08284": {"title": "Model-based standardization using multiple imputation", "link": "http://arxiv.org/abs/2305.08284", "description": "When studying the association between treatment and a clinical outcome, a\nparametric multivariable model of the conditional outcome expectation is often\nused to adjust for covariates. The treatment coefficient of the outcome model\ntargets a conditional treatment effect. Model-based standardization is\ntypically applied to average the model predictions over the target covariate\ndistribution, and generate a covariate-adjusted estimate of the marginal\ntreatment effect. The standard approach to model-based standardization involves\nmaximum-likelihood estimation and use of the non-parametric bootstrap. We\nintroduce a novel, general-purpose, model-based standardization method based on\nmultiple imputation that is easily applicable when the outcome model is a\ngeneralized linear model. We term our proposed approach multiple imputation\nmarginalization (MIM). MIM consists of two main stages: the generation of\nsynthetic datasets and their analysis. MIM accommodates a Bayesian statistical\nframework, which naturally allows for the principled propagation of\nuncertainty, integrates the analysis into a probabilistic framework, and allows\nfor the incorporation of prior evidence. We conduct a simulation study to\nbenchmark the finite-sample performance of MIM in conjunction with a parametric\noutcome model. The simulations provide proof-of-principle in scenarios with\nbinary outcomes, continuous-valued covariates, a logistic outcome model and the\nmarginal log odds ratio as the target effect measure. When parametric modeling\nassumptions hold, MIM yields unbiased estimation in the target covariate\ndistribution, valid coverage rates, and similar precision and efficiency than\nthe standard approach to model-based standardization."}, "http://arxiv.org/abs/2401.00097": {"title": "Recursive identification with regularization and on-line hyperparameters estimation", "link": "http://arxiv.org/abs/2401.00097", "description": "This paper presents a regularized recursive identification algorithm with\nsimultaneous on-line estimation of both the model parameters and the algorithms\nhyperparameters. A new kernel is proposed to facilitate the algorithm\ndevelopment. The performance of this novel scheme is compared with that of the\nrecursive least-squares algorithm in simulation."}, "http://arxiv.org/abs/2401.00104": {"title": "Causal State Distillation for Explainable Reinforcement Learning", "link": "http://arxiv.org/abs/2401.00104", "description": "Reinforcement learning (RL) is a powerful technique for training intelligent\nagents, but understanding why these agents make specific decisions can be quite\nchallenging. This lack of transparency in RL models has been a long-standing\nproblem, making it difficult for users to grasp the reasons behind an agent's\nbehaviour. Various approaches have been explored to address this problem, with\none promising avenue being reward decomposition (RD). RD is appealing as it\nsidesteps some of the concerns associated with other methods that attempt to\nrationalize an agent's behaviour in a post-hoc manner. RD works by exposing\nvarious facets of the rewards that contribute to the agent's objectives during\ntraining. However, RD alone has limitations as it primarily offers insights\nbased on sub-rewards and does not delve into the intricate cause-and-effect\nrelationships that occur within an RL agent's neural model. In this paper, we\npresent an extension of RD that goes beyond sub-rewards to provide more\ninformative explanations. Our approach is centred on a causal learning\nframework that leverages information-theoretic measures for explanation\nobjectives that encourage three crucial properties of causal factors:\n\\emph{causal sufficiency}, \\emph{sparseness}, and \\emph{orthogonality}. These\nproperties help us distill the cause-and-effect relationships between the\nagent's states and actions or rewards, allowing for a deeper understanding of\nits decision-making processes. Our framework is designed to generate local\nexplanations and can be applied to a wide range of RL tasks with multiple\nreward channels. Through a series of experiments, we demonstrate that our\napproach offers more meaningful and insightful explanations for the agent's\naction selections."}, "http://arxiv.org/abs/2401.00139": {"title": "Is Knowledge All Large Language Models Needed for Causal Reasoning?", "link": "http://arxiv.org/abs/2401.00139", "description": "This paper explores the causal reasoning of large language models (LLMs) to\nenhance their interpretability and reliability in advancing artificial\nintelligence. Despite the proficiency of LLMs in a range of tasks, their\npotential for understanding causality requires further exploration. We propose\na novel causal attribution model that utilizes \"do-operators\" for constructing\ncounterfactual scenarios, allowing us to systematically quantify the influence\nof input numerical data and LLMs' pre-existing knowledge on their causal\nreasoning processes. Our newly developed experimental setup assesses LLMs'\nreliance on contextual information and inherent knowledge across various\ndomains. Our evaluation reveals that LLMs' causal reasoning ability depends on\nthe context and domain-specific knowledge provided, and supports the argument\nthat \"knowledge is, indeed, what LLMs principally require for sound causal\nreasoning\". On the contrary, in the absence of knowledge, LLMs still maintain a\ndegree of causal reasoning using the available numerical data, albeit with\nlimitations in the calculations."}, "http://arxiv.org/abs/2401.00196": {"title": "Bayesian principal stratification with longitudinal data and truncation by death", "link": "http://arxiv.org/abs/2401.00196", "description": "In many causal studies, outcomes are censored by death, in the sense that\nthey are neither observed nor defined for units who die. In such studies, the\nfocus is usually on the stratum of always survivors up to a single fixed time\ns. Building on a recent strand of the literature, we propose an extended\nframework for the analysis of longitudinal studies, where units can die at\ndifferent time points, and the main endpoints are observed and well defined\nonly up to the death time. We develop a Bayesian longitudinal principal\nstratification framework, where units are cross classified according to the\nlongitudinal death status. Under this framework, the focus is on causal effects\nfor the principal strata of units that would be alive up to a time point s\nirrespective of their treatment assignment, where these strata may vary as a\nfunction of s. We can get precious insights into the effects of treatment by\ninspecting the distribution of baseline characteristics within each\nlongitudinal principal stratum, and by investigating the time trend of both\nprincipal stratum membership and survivor-average causal effects. We illustrate\nour approach for the analysis of a longitudinal observational study aimed to\nassess, under the assumption of strong ignorability of treatment assignment,\nthe causal effects of a policy promoting start ups on firms survival and hiring\npolicy, where firms hiring status is censored by death."}, "http://arxiv.org/abs/2401.00245": {"title": "Alternative Approaches for Computing Highest-Density Regions", "link": "http://arxiv.org/abs/2401.00245", "description": "Many statistical problems require estimating a density function, say $f$,\nfrom data samples. In this work, for example, we are interested in\nhighest-density regions (HDRs), i.e., minimum volume sets that contain a given\nprobability. HDRs are typically computed using a density quantile approach,\nwhich, in the case of unknown densities, involves their estimation. This task\nturns out to be far from trivial, especially over increased dimensions and when\ndata are sparse and exhibit complex structures (e.g., multimodalities or\nparticular dependencies). We address this challenge by exploring alternative\napproaches to build HDRs that overcome direct (multivariate) density\nestimation. First, we generalize the density quantile method, currently\nimplementable on the basis of a consistent estimator of the density, to\n$neighbourhood$ measures, i.e., measures that preserve the order induced in the\nsample by $f$. Second, we discuss a number of suitable probabilistic- and\ndistance-based measures such as the $k$-nearest neighbourhood Euclidean\ndistance. Third, motivated by the ubiquitous role of $copula$ modeling in\nmodern statistics, we explore its use in the context of probabilistic-based\nmeasures. An extensive comparison among the introduced measures is provided,\nand their implications for computing HDRs in real-world problems are discussed."}, "http://arxiv.org/abs/2401.00255": {"title": "Adaptive Rank-based Tests for High Dimensional Mean Problems", "link": "http://arxiv.org/abs/2401.00255", "description": "The Wilcoxon signed-rank test and the Wilcoxon-Mann-Whitney test are commonly\nemployed in one sample and two sample mean tests for one-dimensional hypothesis\nproblems. For high-dimensional mean test problems, we calculate the asymptotic\ndistribution of the maximum of rank statistics for each variable and suggest a\nmax-type test. This max-type test is then merged with a sum-type test, based on\ntheir asymptotic independence offered by stationary and strong mixing\nassumptions. Our numerical studies reveal that this combined test demonstrates\nrobustness and superiority over other methods, especially for heavy-tailed\ndistributions."}, "http://arxiv.org/abs/2401.00257": {"title": "Assessing replication success via skeptical mixture priors", "link": "http://arxiv.org/abs/2401.00257", "description": "There is a growing interest in the analysis of replication studies of\noriginal findings across many disciplines. When testing a hypothesis for an\neffect size, two Bayesian approaches stand out for their principled use of the\nBayes factor (BF), namely the replication BF and the skeptical BF. In\nparticular, the latter BF is based on the skeptical prior, which represents the\nopinion of an investigator who is unconvinced by the original findings and\nwants to challenge them. We embrace the skeptical perspective, and elaborate a\nnovel mixture prior which incorporates skepticism while at the same time\ncontrolling for prior-data conflict within the original data. Consistency\nproperties of the resulting skeptical mixture BF are provided together with an\nextensive analysis of the main features of our proposal. Finally, we apply our\nmethodology to data from the Social Sciences Replication Project. In particular\nwe show that, for some case studies where prior-data conflict is an issue, our\nmethod uses a more realistic prior and leads to evidence-classification for\nreplication success which differs from the standard skeptical approach."}, "http://arxiv.org/abs/2401.00324": {"title": "Stratified distance space improves the efficiency of sequential samplers for approximate Bayesian computation", "link": "http://arxiv.org/abs/2401.00324", "description": "Approximate Bayesian computation (ABC) methods are standard tools for\ninferring parameters of complex models when the likelihood function is\nanalytically intractable. A popular approach to improving the poor acceptance\nrate of the basic rejection sampling ABC algorithm is to use sequential Monte\nCarlo (ABC SMC) to produce a sequence of proposal distributions adapting\ntowards the posterior, instead of generating values from the prior distribution\nof the model parameters. Proposal distribution for the subsequent iteration is\ntypically obtained from a weighted set of samples, often called particles, of\nthe current iteration of this sequence. Current methods for constructing these\nproposal distributions treat all the particles equivalently, regardless of the\ncorresponding value generated by the sampler, which may lead to inefficiency\nwhen propagating the information across iterations of the algorithm. To improve\nsampler efficiency, we introduce a modified approach called stratified distance\nABC SMC. Our algorithm stratifies particles based on their distance between the\ncorresponding synthetic and observed data, and then constructs distinct\nproposal distributions for all the strata. Taking into account the distribution\nof distances across the particle space leads to substantially improved\nacceptance rate of the rejection sampling. We further show that efficiency can\nbe gained by introducing a novel stopping rule for the sequential process based\non the stratified posterior samples and demonstrate these advances by several\nexamples."}, "http://arxiv.org/abs/2401.00354": {"title": "Estimation of the Emax model", "link": "http://arxiv.org/abs/2401.00354", "description": "This study focuses on the estimation of the Emax dose-response model, a\nwidely utilized framework in clinical trials, agriculture, and environmental\nexperiments. Existing challenges in obtaining maximum likelihood estimates\n(MLE) for model parameters are often ascribed to computational issues but, in\nreality, stem from the absence of MLE. Our contribution provides a new\nunderstanding and control of all the experimental situations that pratictioners\nmight face, guiding them in the estimation process. We derive the exact MLE for\na three-point experimental design and we identify the two scenarios where the\nMLE fails. To address these challenges, we propose utilizing Firth's modified\nscore, providing its analytical expression as a function of the experimental\ndesign. Through a simulation study, we demonstrate that, in one of the\nproblematic cases, the Firth modification yields a finite estimate. For the\nremaining case, we introduce a design-correction strategy akin to a hypothesis\ntest."}, "http://arxiv.org/abs/2401.00395": {"title": "Energetic Variational Gaussian Process Regression for Computer Experiments", "link": "http://arxiv.org/abs/2401.00395", "description": "The Gaussian process (GP) regression model is a widely employed surrogate\nmodeling technique for computer experiments, offering precise predictions and\nstatistical inference for the computer simulators that generate experimental\ndata. Estimation and inference for GP can be performed in both frequentist and\nBayesian frameworks. In this chapter, we construct the GP model through\nvariational inference, particularly employing the recently introduced energetic\nvariational inference method by Wang et al. (2021). Adhering to the GP model\nassumptions, we derive posterior distributions for its parameters. The\nenergetic variational inference approach bridges the Bayesian sampling and\noptimization and enables approximation of the posterior distributions and\nidentification of the posterior mode. By incorporating a normal prior on the\nmean component of the GP model, we also apply shrinkage estimation to the\nparameters, facilitating mean function variable selection. To showcase the\neffectiveness of our proposed GP model, we present results from three benchmark\nexamples."}, "http://arxiv.org/abs/2401.00461": {"title": "A Penalized Functional Linear Cox Regression Model for Spatially-defined Environmental Exposure with an Estimated Buffer Distance", "link": "http://arxiv.org/abs/2401.00461", "description": "In environmental health research, it is of interest to understand the effect\nof the neighborhood environment on health. Researchers have shown a protective\nassociation between green space around a person's residential address and\ndepression outcomes. In measuring exposure to green space, distance buffers are\noften used. However, buffer distances differ across studies. Typically, the\nbuffer distance is determined by researchers a priori. It is unclear how to\nidentify an appropriate buffer distance for exposure assessment. To address\ngeographic uncertainty problem for exposure assessment, we present a domain\nselection algorithm based on the penalized functional linear Cox regression\nmodel. The theoretical properties of our proposed method are studied and\nsimulation studies are conducted to evaluate finite sample performances of our\nmethod. The proposed method is illustrated in a study of associations of green\nspace exposure with depression and/or antidepressant use in the Nurses' Health\nStudy."}, "http://arxiv.org/abs/2401.00517": {"title": "Detecting Imprinting and Maternal Effects Using Monte Carlo Expectation Maximization Algorithm", "link": "http://arxiv.org/abs/2401.00517", "description": "Numerous statistical methods have been developed to explore genomic\nimprinting and maternal effects, which are causes of parent-of-origin patterns\nin complex human diseases. However, most of them either only model one of these\ntwo confounded epigenetic effects, or make strong yet unrealistic assumptions\nabout the population to avoid over-parameterization. A recent partial\nlikelihood method (LIME) can identify both epigenetic effects based on\ncase-control family data without those assumptions. Theoretical and empirical\nstudies have shown its validity and robustness. However, because LIME obtains\nparameter estimation by maximizing partial likelihood, it is interesting to\ncompare its efficiency with full likelihood maximizer. To overcome the\ndifficulty in over-parameterization when using full likelihood, in this study\nwe propose a Monte Carlo Expectation Maximization (MCEM) method to detect\nimprinting and maternal effects jointly. Those unknown mating type\nprobabilities, the nuisance parameters, can be considered as latent variables\nin EM algorithm. Monte Carlo samples are used to numerically approximate the\nexpectation function that cannot be solved algebraically. Our simulation\nresults show that though this MCEM algorithm takes longer computational time,\nand can give higher bias in some simulations compared to LIME, it can generally\ndetect both epigenetic effects with higher power and smaller standard error\nwhich demonstrates that it can be a good complement of LIME method."}, "http://arxiv.org/abs/2401.00520": {"title": "Monte Carlo Expectation-Maximization algorithm to detect imprinting and maternal effects for discordant sib-pair data", "link": "http://arxiv.org/abs/2401.00520", "description": "Numerous statistical methods have been developed to explore genomic\nimprinting and maternal effects, which are causes of parent-of-origin patterns\nin complex human diseases. Most of the methods, however, either only model one\nof these two confounded epigenetic effects, or make strong yet unrealistic\nassumptions about the population to avoid over-parameterization. A recent\npartial likelihood method (LIMEDSP ) can identify both epigenetic effects based\non discordant sibpair family data without those assumptions. Theoretical and\nempirical studies have shown its validity and robustness. As LIMEDSP method\nobtains parameter estimation by maximizing partial likelihood, it is\ninteresting to compare its efficiency with full likelihood maximizer. To\novercome the difficulty in over-parameterization when using full likelihood,\nthis study proposes a discordant sib-pair design based Monte Carlo Expectation\nMaximization (MCEMDSP ) method to detect imprinting and maternal effects\njointly. Those unknown mating type probabilities, the nuisance parameters, are\nconsidered as latent variables in EM algorithm. Monte Carlo samples are used to\nnumerically approximate the expectation function that cannot be solved\nalgebraically. Our simulation results show that though this MCEMDSP algorithm\ntakes longer computation time, it can generally detect both epigenetic effects\nwith higher power, which demonstrates that it can be a good complement of\nLIMEDSP method"}, "http://arxiv.org/abs/2401.00540": {"title": "Study Duration Prediction for Clinical Trials with Time-to-Event Endpoints Using Mixture Distributions Accounting for Heterogeneous Population", "link": "http://arxiv.org/abs/2401.00540", "description": "In the era of precision medicine, more and more clinical trials are now\ndriven or guided by biomarkers, which are patient characteristics objectively\nmeasured and evaluated as indicators of normal biological processes, pathogenic\nprocesses, or pharmacologic responses to therapeutic interventions. With the\noverarching objective to optimize and personalize disease management,\nbiomarker-guided clinical trials increase the efficiency by appropriately\nutilizing prognostic or predictive biomarkers in the design. However, the\nefficiency gain is often not quantitatively compared to the traditional\nall-comers design, in which a faster enrollment rate is expected (e.g. due to\nno restriction to biomarker positive patients) potentially leading to a shorter\nduration. To accurately predict biomarker-guided trial duration, we propose a\ngeneral framework using mixture distributions accounting for heterogeneous\npopulation. Extensive simulations are performed to evaluate the impact of\nheterogeneous population and the dynamics of biomarker characteristics and\ndisease on the study duration. Several influential parameters including median\nsurvival time, enrollment rate, biomarker prevalence and effect size are\nidentitied. Re-assessments of two publicly available trials are conducted to\nempirically validate the prediction accuracy and to demonstrate the practical\nutility. The R package \\emph{detest} is developed to implement the proposed\nmethod and is publicly available on CRAN."}, "http://arxiv.org/abs/2401.00566": {"title": "Change point analysis -- the empirical Hankel transform approach", "link": "http://arxiv.org/abs/2401.00566", "description": "In this study, we introduce the first-of-its-kind class of tests for\ndetecting change points in the distribution of a sequence of independent\nmatrix-valued random variables. The tests are constructed using the weighted\nsquare integral difference of the empirical orthogonal Hankel transforms. The\ntest statistics have a convenient closed-form expression, making them easy to\nimplement in practice. We present their limiting properties and demonstrate\ntheir quality through an extensive simulation study. We utilize these tests for\nchange point detection in cryptocurrency markets to showcase their practical\nuse. The detection of change points in this context can have various\napplications in constructing and analyzing novel trading systems."}, "http://arxiv.org/abs/2401.00568": {"title": "Extrapolation of Relative Treatment Effects using Change-point Survival Models", "link": "http://arxiv.org/abs/2401.00568", "description": "Introduction: Modelling of relative treatment effects is an important aspect\nto consider when extrapolating the long-term survival outcomes of treatments.\nFlexible parametric models offer the ability to accurately model the observed\ndata, however, the extrapolated relative treatment effects and subsequent\nsurvival function may lack face validity. Methods: We investigate the ability\nof change-point survival models to estimate changes in the relative treatment\neffects, specifically treatment delay, loss of treatment effects and converging\nhazards. These models are implemented using standard Bayesian statistical\nsoftware and propagate the uncertainty associate with all model parameters\nincluding the change-point location. A simulation study was conducted to assess\nthe predictive performance of these models compared with other parametric\nsurvival models. Change-point survival models were applied to three datasets,\ntwo of which were used in previous health technology assessments. Results:\nChange-point survival models typically provided improved extrapolated survival\npredictions, particularly when the changes in relative treatment effects are\nlarge. When applied to the real world examples they provided good fit to the\nobserved data while and in some situations produced more clinically plausible\nextrapolations than those generated by flexible spline models. Change-point\nmodels also provided support to a previously implemented modelling approach\nwhich was justified by visual inspection only and not goodness of fit to the\nobserved data. Conclusions: We believe change-point survival models offer the\nability to flexibly model observed data while also modelling and investigating\nclinically plausible scenarios with respect to the relative treatment effects."}, "http://arxiv.org/abs/2401.00624": {"title": "Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures", "link": "http://arxiv.org/abs/2401.00624", "description": "We propose a novel data-driven semi-confirmatory factor analysis (SCFA) model\nthat addresses the absence of model specification and handles the estimation\nand inference tasks with high-dimensional data. Confirmatory factor analysis\n(CFA) is a prevalent and pivotal technique for statistically validating the\ncovariance structure of latent common factors derived from multiple observed\nvariables. In contrast to other factor analysis methods, CFA offers a flexible\ncovariance modeling approach for common factors, enhancing the interpretability\nof relationships between the common factors, as well as between common factors\nand observations. However, the application of classic CFA models faces dual\nbarriers: the lack of a prerequisite specification of \"non-zero loadings\" or\nfactor membership (i.e., categorizing the observations into distinct common\nfactors), and the formidable computational burden in high-dimensional scenarios\nwhere the number of observed variables surpasses the sample size. To bridge\nthese two gaps, we propose the SCFA model by integrating the underlying\nhigh-dimensional covariance structure of observed variables into the CFA model.\nAdditionally, we offer computationally efficient solutions (i.e., closed-form\nuniformly minimum variance unbiased estimators) and ensure accurate statistical\ninference through closed-form exact variance estimators for all model\nparameters and factor scores. Through an extensive simulation analysis\nbenchmarking against standard computational packages, SCFA exhibits superior\nperformance in estimating model parameters and recovering factor scores, while\nsubstantially reducing the computational load, across both low- and\nhigh-dimensional scenarios. It exhibits moderate robustness to model\nmisspecification. We illustrate the practical application of the SCFA model by\nconducting factor analysis on a high-dimensional gene expression dataset."}, "http://arxiv.org/abs/2401.00634": {"title": "A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology", "link": "http://arxiv.org/abs/2401.00634", "description": "Accounting for exposure measurement errors has been recognized as a crucial\nproblem in environmental epidemiology for over two decades. Bayesian\nhierarchical models offer a coherent probabilistic framework for evaluating\nassociations between environmental exposures and health effects, which take\ninto account exposure measurement errors introduced by uncertainty in the\nestimated exposure as well as spatial misalignment between the exposure and\nhealth outcome data. While two-stage Bayesian analyses are often regarded as a\ngood alternative to fully Bayesian analyses when joint estimation is not\nfeasible, there has been minimal research on how to properly propagate\nuncertainty from the first-stage exposure model to the second-stage health\nmodel, especially in the case of a large number of participant locations along\nwith spatially correlated exposures. We propose a scalable two-stage Bayesian\napproach, called a sparse multivariate normal (sparse MVN) prior approach,\nbased on the Vecchia approximation for assessing associations between exposure\nand health outcomes in environmental epidemiology. We compare its performance\nwith existing approaches through simulation. Our sparse MVN prior approach\nshows comparable performance with the fully Bayesian approach, which is a gold\nstandard but is impossible to implement in some cases. We investigate the\nassociation between source-specific exposures and pollutant (nitrogen dioxide\n(NO$_2$))-specific exposures and birth outcomes for 2012 in Harris County,\nTexas, using several approaches, including the newly developed method."}, "http://arxiv.org/abs/2401.00649": {"title": "Linear Model and Extensions", "link": "http://arxiv.org/abs/2401.00649", "description": "I developed the lecture notes based on my ``Linear Model'' course at the\nUniversity of California Berkeley over the past seven years. This book provides\nan intermediate-level introduction to the linear model. It balances rigorous\nproofs and heuristic arguments. This book provides R code to replicate all\nsimulation studies and case studies."}, "http://arxiv.org/abs/2401.00667": {"title": "Channelling Multimodality Through a Unimodalizing Transport: Warp-U Sampler and Stochastic Bridge Sampling", "link": "http://arxiv.org/abs/2401.00667", "description": "Monte Carlo integration is fundamental in scientific and statistical\ncomputation, but requires reliable samples from the target distribution, which\nposes a substantial challenge in the case of multi-modal distributions.\nExisting methods often involve time-consuming tuning, and typically lack\ntailored estimators for efficient use of the samples. This paper adapts the\nWarp-U transformation [Wang et al., 2022] to form multi-modal sampling strategy\ncalled Warp-U sampling. It constructs a stochastic map to transport a\nmulti-modal density into a uni-modal one, and subsequently inverts the\ntransport but with new stochasticity injected. For efficient use of the samples\nfor normalising constant estimation, we propose (i) an unbiased estimation\nscheme based coupled chains, where the Warp-U sampling is used to reduce the\ncoupling time; and (ii) a stochastic Warp-U bridge sampling estimator, which\nimproves its deterministic counterpart given in Wang et al. [2022]. Our overall\napproach requires less tuning and is easier to apply than common alternatives.\nTheoretically, we establish the ergodicity of our sampling algorithm and that\nour stochastic Warp-U bridge sampling estimator has greater (asymptotic)\nprecision per CPU second compared to the Warp-U bridge estimator of Wang et al.\n[2022] under practical conditions. The advantages and current limitations of\nour approach are demonstrated through simulation studies and an application to\nexoplanet detection."}, "http://arxiv.org/abs/2401.00800": {"title": "Factor Importance Ranking and Selection using Total Indices", "link": "http://arxiv.org/abs/2401.00800", "description": "Factor importance measures the impact of each feature on output prediction\naccuracy. Many existing works focus on the model-based importance, but an\nimportant feature in one learning algorithm may hold little significance in\nanother model. Hence, a factor importance measure ought to characterize the\nfeature's predictive potential without relying on a specific prediction\nalgorithm. Such algorithm-agnostic importance is termed as intrinsic importance\nin Williamson et al. (2023), but their estimator again requires model fitting.\nTo bypass the modeling step, we present the equivalence between predictiveness\npotential and total Sobol' indices from global sensitivity analysis, and\nintroduce a novel consistent estimator that can be directly estimated from\nnoisy data. Integrating with forward selection and backward elimination gives\nrise to FIRST, Factor Importance Ranking and Selection using Total (Sobol')\nindices. Extensive simulations are provided to demonstrate the effectiveness of\nFIRST on regression and binary classification problems, and a clear advantage\nover the state-of-the-art methods."}, "http://arxiv.org/abs/2401.00840": {"title": "Bayesian Effect Selection in Additive Models with an Application to Time-to-Event Data", "link": "http://arxiv.org/abs/2401.00840", "description": "Accurately selecting and estimating smooth functional effects in additive\nmodels with potentially many functions is a challenging task. We introduce a\nnovel Demmler-Reinsch basis expansion to model the functional effects that\nallows us to orthogonally decompose an effect into its linear and nonlinear\nparts. We show that our representation allows to consistently estimate both\nparts as opposed to commonly employed mixed model representations. Equipping\nthe reparameterized regression coefficients with normal beta prime spike and\nslab priors allows us to determine whether a continuous covariate has a linear,\na nonlinear or no effect at all. We provide new theoretical results for the\nprior and a compelling explanation for its superior Markov chain Monte Carlo\nmixing performance compared to the spike-and-slab group lasso. We establish an\nefficient posterior estimation scheme and illustrate our approach along effect\nselection on the hazard rate of a time-to-event response in the geoadditive Cox\nregression model in simulations and data on survival with leukemia."}, "http://arxiv.org/abs/2205.00605": {"title": "Cluster-based Regression using Variational Inference and Applications in Financial Forecasting", "link": "http://arxiv.org/abs/2205.00605", "description": "This paper describes an approach to simultaneously identify clusters and\nestimate cluster-specific regression parameters from the given data. Such an\napproach can be useful in learning the relationship between input and output\nwhen the regression parameters for estimating output are different in different\nregions of the input space. Variational Inference (VI), a machine learning\napproach to obtain posterior probability densities using optimization\ntechniques, is used to identify clusters of explanatory variables and\nregression parameters for each cluster. From these results, one can obtain both\nthe expected value and the full distribution of predicted output. Other\nadvantages of the proposed approach include the elegant theoretical solution\nand clear interpretability of results. The proposed approach is well-suited for\nfinancial forecasting where markets have different regimes (or clusters) with\ndifferent patterns and correlations of market changes in each regime. In\nfinancial applications, knowledge about such clusters can provide useful\ninsights about portfolio performance and identify the relative importance of\nvariables in different market regimes. An illustrative example of predicting\none-day S&P change is considered to illustrate the approach and compare the\nperformance of the proposed approach with standard regression without clusters.\nDue to the broad applicability of the problem, its elegant theoretical\nsolution, and the computational efficiency of the proposed algorithm, the\napproach may be useful in a number of areas extending beyond the financial\ndomain."}, "http://arxiv.org/abs/2209.02008": {"title": "Parallel sampling of decomposable graphs using Markov chain on junction trees", "link": "http://arxiv.org/abs/2209.02008", "description": "Bayesian inference for undirected graphical models is mostly restricted to\nthe class of decomposable graphs, as they enjoy a rich set of properties making\nthem amenable to high-dimensional problems. While parameter inference is\nstraightforward in this setup, inferring the underlying graph is a challenge\ndriven by the computational difficulty in exploring the space of decomposable\ngraphs. This work makes two contributions to address this problem. First, we\nprovide sufficient and necessary conditions for when multi-edge perturbations\nmaintain decomposability of the graph. Using these, we characterize a simple\nclass of partitions that efficiently classify all edge perturbations by whether\nthey maintain decomposability. Second, we propose a novel parallel\nnon-reversible Markov chain Monte Carlo sampler for distributions over junction\ntree representations of the graph. At every step, the parallel sampler executes\nsimultaneously all edge perturbations within a partition. Through simulations,\nwe demonstrate the efficiency of our new edge perturbation conditions and class\nof partitions. We find that our parallel sampler yields improved mixing\nproperties in comparison to the single-move variate, and outperforms current\nstate-of-the-arts methods in terms of accuracy and computational efficiency.\nThe implementation of our work is available in the Python package parallelDG."}, "http://arxiv.org/abs/2302.00293": {"title": "A Survey of Methods, Challenges and Perspectives in Causality", "link": "http://arxiv.org/abs/2302.00293", "description": "Deep Learning models have shown success in a large variety of tasks by\nextracting correlation patterns from high-dimensional data but still struggle\nwhen generalizing out of their initial distribution. As causal engines aim to\nlearn mechanisms independent from a data distribution, combining Deep Learning\nwith Causality can have a great impact on the two fields. In this paper, we\nfurther motivate this assumption. We perform an extensive overview of the\ntheories and methods for Causality from different perspectives, with an\nemphasis on Deep Learning and the challenges met by the two domains. We show\nearly attempts to bring the fields together and the possible perspectives for\nthe future. We finish by providing a large variety of applications for\ntechniques from Causality."}, "http://arxiv.org/abs/2304.14954": {"title": "A Class of Dependent Random Distributions Based on Atom Skipping", "link": "http://arxiv.org/abs/2304.14954", "description": "We propose the Plaid Atoms Model (PAM), a novel Bayesian nonparametric model\nfor grouped data. Founded on an idea of `atom skipping', PAM is part of a\nwell-established category of models that generate dependent random\ndistributions and clusters across multiple groups. Atom skipping referrs to\nstochastically assigning 0 weights to atoms in an infinite mixture. Deploying\natom skipping across groups, PAM produces a dependent clustering pattern with\noverlapping and non-overlapping clusters across groups. As a result,\ninterpretable posterior inference is possible such as reporting the posterior\nprobability of a cluster being exclusive to a single group or shared among a\nsubset of groups. We discuss the theoretical properties of the proposed and\nrelated models. Minor extensions of the proposed model for multivariate or\ncount data are presented. Simulation studies and applications using real-world\ndatasets illustrate the performance of the new models with comparison to\nexisting models."}, "http://arxiv.org/abs/2305.09126": {"title": "Transfer Learning for Causal Effect Estimation", "link": "http://arxiv.org/abs/2305.09126", "description": "We present a Transfer Causal Learning (TCL) framework when target and source\ndomains share the same covariate/feature spaces, aiming to improve causal\neffect estimation accuracy in limited data. Limited data is very common in\nmedical applications, where some rare medical conditions, such as sepsis, are\nof interest. Our proposed method, named \\texttt{$\\ell_1$-TCL}, incorporates\n$\\ell_1$ regularized TL for nuisance models (e.g., propensity score model); the\nTL estimator of the nuisance parameters is plugged into downstream average\ncausal/treatment effect estimators (e.g., inverse probability weighted\nestimator). We establish non-asymptotic recovery guarantees for the\n\\texttt{$\\ell_1$-TCL} with generalized linear model (GLM) under the sparsity\nassumption in the high-dimensional setting, and demonstrate the empirical\nbenefits of \\texttt{$\\ell_1$-TCL} through extensive numerical simulation for\nGLM and recent neural network nuisance models. Our method is subsequently\nextended to real data and generates meaningful insights consistent with medical\nliterature, a case where all baseline methods fail."}, "http://arxiv.org/abs/2305.12789": {"title": "The Decaying Missing-at-Random Framework: Doubly Robust Causal Inference with Partially Labeled Data", "link": "http://arxiv.org/abs/2305.12789", "description": "In real-world scenarios, data collection limitations often result in\npartially labeled datasets, leading to difficulties in drawing reliable causal\ninferences. Traditional approaches in the semi-supervised (SS) and missing data\nliterature may not adequately handle these complexities, leading to biased\nestimates. To address these challenges, our paper introduces a novel decaying\nmissing-at-random (decaying MAR) framework. This framework tackles missing\noutcomes in high-dimensional settings and accounts for selection bias arising\nfrom the dependence of labeling probability on covariates. Notably, we relax\nthe need for a positivity condition, commonly required in the missing data\nliterature, and allow uniform decay of labeling propensity scores with sample\nsize, accommodating faster growth of unlabeled data. Our decaying MAR framework\nenables easy rate double-robust (DR) estimation of average treatment effects,\nsucceeding where other methods fail, even with correctly specified nuisance\nmodels. Additionally, it facilitates asymptotic normality under model\nmisspecification. To achieve this, we propose adaptive new targeted\nbias-reducing nuisance estimators and asymmetric cross-fitting, along with a\nnovel semi-parametric approach that fully leverages large volumes of unlabeled\ndata. Our approach requires weak sparsity conditions. Numerical results confirm\nour estimators' efficacy and versatility, addressing selection bias and model\nmisspecification."}, "http://arxiv.org/abs/2401.00872": {"title": "On discriminating between Libby-Novick generalized beta and Kumaraswamy distributions: theory and methods", "link": "http://arxiv.org/abs/2401.00872", "description": "In fitting a continuous bounded data, the generalized beta (and several\nvariants of this distribution) and the two-parameter Kumaraswamy (KW)\ndistributions are the two most prominent univariate continuous distributions\nthat come to our mind. There are some common features between these two rival\nprobability models and to select one of them in a practical situation can be of\ngreat interest. Consequently, in this paper, we discuss various methods of\nselection between the generalized beta proposed by Libby and Novick (1982)\n(LNGB) and the KW distributions, such as the criteria based on probability of\ncorrect selection which is an improvement over the likelihood ratio statistic\napproach, and also based on pseudo-distance measures. We obtain an\napproximation for the probability of correct selection under the hypotheses\nHLNGB and HKW , and select the model that maximizes it. However, our proposal\nis more appealing in the sense that we provide the comparison study for the\nLNGB distribution that subsumes both types of classical beta and exponentiated\ngenerators (see, for details, Cordeiro et al. 2014; Libby and Novick 1982)\nwhich can be a natural competitor of a two-parameter KW distribution in an\nappropriate scenario."}, "http://arxiv.org/abs/2401.00945": {"title": "A review of Monte Carlo-based versions of the EM algorithm", "link": "http://arxiv.org/abs/2401.00945", "description": "The EM algorithm is a powerful tool for maximum likelihood estimation with\nmissing data. In practice, the calculations required for the EM algorithm are\noften intractable. We review numerous methods to circumvent this\nintractability, all of which are based on Monte Carlo simulation. We focus our\nattention on the Monte Carlo EM (MCEM) algorithm and its various\nimplementations. We also discuss some related methods like stochastic\napproximation and Monte Carlo maximum likelihood. Generating the Monte Carlo\nsamples necessary for these methods is, in general, a hard problem. As such, we\nreview several simulation strategies which can be used to address this\nchallenge.\n\nGiven the wide range of methods available for approximating the EM, it can be\nchallenging to select which one to use. We review numerous comparisons between\nthese methods from a wide range of sources, and offer guidance on synthesizing\nthe findings. Finally, we give some directions for future research to fill\nimportant gaps in the existing literature on the MCEM algorithm and related\nmethods."}, "http://arxiv.org/abs/2401.00987": {"title": "Inverting estimating equations for causal inference on quantiles", "link": "http://arxiv.org/abs/2401.00987", "description": "The causal inference literature frequently focuses on estimating the mean of\nthe potential outcome, whereas the quantiles of the potential outcome may carry\nimportant additional information. We propose a universal approach, based on the\ninverse estimating equations, to generalize a wide class of causal inference\nsolutions from estimating the mean of the potential outcome to its quantiles.\nWe assume that an identifying moment function is available to identify the mean\nof the threshold-transformed potential outcome, based on which a convenient\nconstruction of the estimating equation of quantiles of potential outcome is\nproposed. In addition, we also give a general construction of the efficient\ninfluence functions of the mean and quantiles of potential outcomes, and\nidentify their connection. We motivate estimators for the quantile estimands\nwith the efficient influence function, and develop their asymptotic properties\nwhen either parametric models or data-adaptive machine learners are used to\nestimate the nuisance functions. A broad implication of our results is that one\ncan rework the existing result for mean causal estimands to facilitate causal\ninference on quantiles, rather than starting from scratch. Our results are\nillustrated by several examples."}, "http://arxiv.org/abs/2401.01234": {"title": "Mixture cure semiparametric additive hazard models under partly interval censoring -- a penalized likelihood approach", "link": "http://arxiv.org/abs/2401.01234", "description": "Survival analysis can sometimes involve individuals who will not experience\nthe event of interest, forming what is known as the cured group. Identifying\nsuch individuals is not always possible beforehand, as they provide only\nright-censored data. Ignoring the presence of the cured group can introduce\nbias in the final model. This paper presents a method for estimating a\nsemiparametric additive hazards model that accounts for the cured fraction.\nUnlike regression coefficients in a hazard ratio model, those in an additive\nhazard model measure hazard differences. The proposed method uses a primal-dual\ninterior point algorithm to obtain constrained maximum penalized likelihood\nestimates of the model parameters, including the regression coefficients and\nthe baseline hazard, subject to certain non-negativity constraints."}, "http://arxiv.org/abs/2401.01264": {"title": "Multiple Randomization Designs: Estimation and Inference with Interference", "link": "http://arxiv.org/abs/2401.01264", "description": "Classical designs of randomized experiments, going back to Fisher and Neyman\nin the 1930s still dominate practice even in online experimentation. However,\nsuch designs are of limited value for answering standard questions in settings,\ncommon in marketplaces, where multiple populations of agents interact\nstrategically, leading to complex patterns of spillover effects. In this paper,\nwe discuss new experimental designs and corresponding estimands to account for\nand capture these complex spillovers. We derive the finite-sample properties of\ntractable estimators for main effects, direct effects, and spillovers, and\npresent associated central limit theorems."}, "http://arxiv.org/abs/2401.01294": {"title": "Efficient Sparse Least Absolute Deviation Regression with Differential Privacy", "link": "http://arxiv.org/abs/2401.01294", "description": "In recent years, privacy-preserving machine learning algorithms have\nattracted increasing attention because of their important applications in many\nscientific fields. However, in the literature, most privacy-preserving\nalgorithms demand learning objectives to be strongly convex and Lipschitz\nsmooth, which thus cannot cover a wide class of robust loss functions (e.g.,\nquantile/least absolute loss). In this work, we aim to develop a fast\nprivacy-preserving learning solution for a sparse robust regression problem.\nOur learning loss consists of a robust least absolute loss and an $\\ell_1$\nsparse penalty term. To fast solve the non-smooth loss under a given privacy\nbudget, we develop a Fast Robust And Privacy-Preserving Estimation (FRAPPE)\nalgorithm for least absolute deviation regression. Our algorithm achieves a\nfast estimation by reformulating the sparse LAD problem as a penalized least\nsquare estimation problem and adopts a three-stage noise injection to guarantee\nthe $(\\epsilon,\\delta)$-differential privacy. We show that our algorithm can\nachieve better privacy and statistical accuracy trade-off compared with the\nstate-of-the-art privacy-preserving regression algorithms. In the end, we\nconduct experiments to verify the efficiency of our proposed FRAPPE algorithm."}, "http://arxiv.org/abs/2112.07145": {"title": "Linear Discriminant Analysis with High-dimensional Mixed Variables", "link": "http://arxiv.org/abs/2112.07145", "description": "Datasets containing both categorical and continuous variables are frequently\nencountered in many areas, and with the rapid development of modern measurement\ntechnologies, the dimensions of these variables can be very high. Despite the\nrecent progress made in modelling high-dimensional data for continuous\nvariables, there is a scarcity of methods that can deal with a mixed set of\nvariables. To fill this gap, this paper develops a novel approach for\nclassifying high-dimensional observations with mixed variables. Our framework\nbuilds on a location model, in which the distributions of the continuous\nvariables conditional on categorical ones are assumed Gaussian. We overcome the\nchallenge of having to split data into exponentially many cells, or\ncombinations of the categorical variables, by kernel smoothing, and provide new\nperspectives for its bandwidth choice to ensure an analogue of Bochner's Lemma,\nwhich is different to the usual bias-variance tradeoff. We show that the two\nsets of parameters in our model can be separately estimated and provide\npenalized likelihood for their estimation. Results on the estimation accuracy\nand the misclassification rates are established, and the competitive\nperformance of the proposed classifier is illustrated by extensive simulation\nand real data studies."}, "http://arxiv.org/abs/2206.02164": {"title": "Estimating and Mitigating the Congestion Effect of Curbside Pick-ups and Drop-offs: A Causal Inference Approach", "link": "http://arxiv.org/abs/2206.02164", "description": "Curb space is one of the busiest areas in urban road networks. Especially in\nrecent years, the rapid increase of ride-hailing trips and commercial\ndeliveries has induced massive pick-ups/drop-offs (PUDOs), which occupy the\nlimited curb space that was designed and built decades ago. These PUDOs could\njam curbside utilization and disturb the mainline traffic flow, evidently\nleading to significant negative societal externalities. However, there is a\nlack of an analytical framework that rigorously quantifies and mitigates the\ncongestion effect of PUDOs in the system view, particularly with little data\nsupport and involvement of confounding effects. To bridge this research gap,\nthis paper develops a rigorous causal inference approach to estimate the\ncongestion effect of PUDOs on general regional networks. A causal graph is set\nto represent the spatio-temporal relationship between PUDOs and traffic speed,\nand a double and separated machine learning (DSML) method is proposed to\nquantify how PUDOs affect traffic congestion. Additionally, a re-routing\nformulation is developed and solved to encourage passenger walking and traffic\nflow re-routing to achieve system optimization. Numerical experiments are\nconducted using real-world data in the Manhattan area. On average, 100\nadditional units of PUDOs in a region could reduce the traffic speed by 3.70\nand 4.54 mph on weekdays and weekends, respectively. Re-routing trips with\nPUDOs on curb space could respectively reduce the system-wide total travel time\nby 2.44% and 2.12% in Midtown and Central Park on weekdays. Sensitivity\nanalysis is also conducted to demonstrate the effectiveness and robustness of\nthe proposed framework."}, "http://arxiv.org/abs/2208.00124": {"title": "Physical Parameter Calibration", "link": "http://arxiv.org/abs/2208.00124", "description": "Computer simulation models are widely used to study complex physical systems.\nA related fundamental topic is the inverse problem, also called calibration,\nwhich aims at learning about the values of parameters in the model based on\nobservations. In most real applications, the parameters have specific physical\nmeanings, and we call them physical parameters. To recognize the true\nunderlying physical system, we need to effectively estimate such parameters.\nHowever, existing calibration methods cannot do this well due to the model\nidentifiability problem. This paper proposes a semi-parametric model, called\nthe discrepancy decomposition model, to describe the discrepancy between the\nphysical system and the computer model. The proposed model possesses a clear\ninterpretation, and more importantly, it is identifiable under mild conditions.\nUnder this model, we present estimators of the physical parameters and the\ndiscrepancy, and then establish their asymptotic properties. Numerical examples\nshow that the proposed method can better estimate the physical parameters than\nexisting methods."}, "http://arxiv.org/abs/2302.01974": {"title": "Conic Sparsity: Estimation of Regression Parameters in Closed Convex Polyhedral Cones", "link": "http://arxiv.org/abs/2302.01974", "description": "Statistical problems often involve linear equality and inequality constraints\non model parameters. Direct estimation of parameters restricted to general\npolyhedral cones, particularly when one is interested in estimating low\ndimensional features, may be challenging. We use a dual form parameterization\nto characterize parameter vectors restricted to lower dimensional faces of\npolyhedral cones and use the characterization to define a notion of 'sparsity'\non such cones. We show that the proposed notion agrees with the usual notion of\nsparsity in the unrestricted case and prove the validity of the proposed\ndefinition as a measure of sparsity. The identifiable parameterization of the\nlower dimensional faces allows a generalization of popular spike-and-slab\npriors to a closed convex polyhedral cone. The prior measure utilizes the\ngeometry of the cone by defining a Markov random field over the adjacency graph\nof the extreme rays of the cone. We describe an efficient way of computing the\nposterior of the parameters in the restricted case. We illustrate the\nusefulness of the proposed methodology for imposing linear equality and\ninequality constraints by using wearables data from the National Health and\nNutrition Examination Survey (NHANES) actigraph study where the daily average\nactivity profiles of participants exhibit patterns that seem to obey such\nconstraints."}, "http://arxiv.org/abs/2401.01426": {"title": "Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference", "link": "http://arxiv.org/abs/2401.01426", "description": "Pearl's causal hierarchy establishes a clear separation between\nobservational, interventional, and counterfactual questions. Researchers\nproposed sound and complete algorithms to compute identifiable causal queries\nat a given level of the hierarchy using the causal structure and data from the\nlower levels of the hierarchy. However, most of these algorithms assume that we\ncan accurately estimate the probability distribution of the data, which is an\nimpractical assumption for high-dimensional variables such as images. On the\nother hand, modern generative deep learning architectures can be trained to\nlearn how to accurately sample from such high-dimensional distributions.\nEspecially with the recent rise of foundation models for images, it is\ndesirable to leverage pre-trained models to answer causal queries with such\nhigh-dimensional data. To address this, we propose a sequential training\nalgorithm that, given the causal structure and a pre-trained conditional\ngenerative model, can train a deep causal generative model, which utilizes the\npre-trained model and can provably sample from identifiable interventional and\ncounterfactual distributions. Our algorithm, called Modular-DCM, uses\nadversarial training to learn the network weights, and to the best of our\nknowledge, is the first algorithm that can make use of pre-trained models and\nprovably sample from any identifiable causal query in the presence of latent\nconfounders with high-dimensional data. We demonstrate the utility of our\nalgorithm using semi-synthetic and real-world datasets containing images as\nvariables in the causal structure."}, "http://arxiv.org/abs/2401.01500": {"title": "Log-concave Density Estimation with Independent Components", "link": "http://arxiv.org/abs/2401.01500", "description": "We propose a method for estimating a log-concave density on $\\mathbb R^d$\nfrom samples, under the assumption that there exists an orthogonal\ntransformation that makes the components of the random vector independent.\nWhile log-concave density estimation is hard both computationally and\nstatistically, the independent components assumption alleviates both issues,\nwhile still maintaining a large non-parametric class. We prove that under mild\nconditions, at most $\\tilde{\\mathcal{O}}(\\epsilon^{-4})$ samples (suppressing\nconstants and log factors) suffice for our proposed estimator to be within\n$\\epsilon$ of the original density in squared Hellinger distance. On the\ncomputational front, while the usual log-concave maximum likelihood estimate\ncan be obtained via a finite-dimensional convex program, it is slow to compute\n-- especially in higher dimensions. We demonstrate through numerical\nexperiments that our estimator can be computed efficiently, making it more\npractical to use."}, "http://arxiv.org/abs/2401.01665": {"title": "A Wild Bootstrap Procedure for the Identification of Optimal Groups in Singular Spectrum Analysis", "link": "http://arxiv.org/abs/2401.01665", "description": "A key step in separating signal from noise using Singular Spectrum Analysis\n(SSA) is grouping, which is often done subjectively. In this article a method\nwhich enables the identification of statistically significant groups for the\ngrouping step in SSA is presented. The proposed procedure provides a more\nobjective and reliable approach for separating noise from the main signal in\nSSA. We utilize the w- correlation and test if it close or equal to zero. A\nwild bootstrap approach is used to determine the distribution of the\nw-correlation. To identify an ideal number of groupings which leads to almost\nperfect separation of the noise and signal, a given number of groups are\ntested, necessitating accounting for multiplicity. The effectiveness of our\nmethod in identifying the best group is demonstrated through a simulation\nstudy, furthermore, we have applied the approach to real world data in the\ncontext of neuroimaging. This research provides a valuable contribution to the\nfield of SSA and offers important insights into the statistical properties of\nthe w-correlation distribution. The results obtained from the simulation\nstudies and analysis of real-world data demonstrate the effectiveness of the\nproposed approach in identifying the best groupings for SSA."}, "http://arxiv.org/abs/2401.01713": {"title": "Multiple testing of interval composite null hypotheses using randomized p-values", "link": "http://arxiv.org/abs/2401.01713", "description": "One class of statistical hypothesis testing procedures is the indisputable\nequivalence tests, whose main objective is to establish practical equivalence\nrather than the usual statistical significant difference. These hypothesis\ntests are prone in bioequivalence studies, where one would wish to show that,\nfor example, an existing drug and a new one under development have the same\ntherapeutic effect. In this article, we consider a two-stage randomized (RAND2)\np-value utilizing the uniformly most powerful (UMP) p-value in the first stage\nwhen multiple two-one-sided hypotheses are of interest. We investigate the\nbehavior of the distribution functions of the two p-values when there are\nchanges in the boundaries of the null or alternative hypothesis or when the\nchosen parameters are too close to these boundaries. We also consider the\nbehavior of the power functions to an increase in sample size. Specifically, we\ninvestigate the level of conservativity to the sample sizes to see if we\ncontrol the type I error rate when using either of the two p-values for any\nsample size. In multiple tests, we evaluate the performance of the two p-values\nin estimating the proportion of true null hypotheses. We conduct a family-wise\nerror rate control using an adaptive Bonferroni procedure with a plug-in\nestimator to account for the multiplicity that arises from the multiple\nhypotheses under consideration. We verify the various claims in this research\nusing simulation study and real-world data analysis."}, "http://arxiv.org/abs/2401.01806": {"title": "A complex meta-regression model to identify effective features of interventions from multi-arm, multi-follow-up trials", "link": "http://arxiv.org/abs/2401.01806", "description": "Network meta-analysis (NMA) combines evidence from multiple trials to compare\nthe effectiveness of a set of interventions. In public health research,\ninterventions are often complex, made up of multiple components or features.\nThis makes it difficult to define a common set of interventions on which to\nperform the analysis. One approach to this problem is component network\nmeta-analysis (CNMA) which uses a meta-regression framework to define each\nintervention as a subset of components whose individual effects combine\nadditively. In this paper, we are motivated by a systematic review of complex\ninterventions to prevent obesity in children. Due to considerable heterogeneity\nacross the trials, these interventions cannot be expressed as a subset of\ncomponents but instead are coded against a framework of characteristic\nfeatures. To analyse these data, we develop a bespoke CNMA-inspired model that\nallows us to identify the most important features of interventions. We define a\nmeta-regression model with covariates on three levels: intervention, study, and\nfollow-up time, as well as flexible interaction terms. By specifying different\nregression structures for trials with and without a control arm, we relax the\nassumption from previous CNMA models that a control arm is the absence of\nintervention components. Furthermore, we derive a correlation structure that\naccounts for trials with multiple intervention arms and multiple follow-up\ntimes. Although our model was developed for the specifics of the obesity data\nset, it has wider applicability to any set of complex interventions that can be\ncoded according to a set of shared features."}, "http://arxiv.org/abs/2401.01833": {"title": "Credible Distributions of Overall Ranking of Entities", "link": "http://arxiv.org/abs/2401.01833", "description": "Inference on overall ranking of a set of entities, such as chess players,\nsubpopulations or hospitals, is an important problem. Estimation of ranks based\non point estimates of means does not account for the uncertainty in those\nestimates. Treating estimated ranks without regard for uncertainty is\nproblematic. We propose a Bayesian solution. It is competitive with recent\nfrequentist methods, and more effective and informative, and is as easy to\nimplement as it is to compute the posterior means and variances of the entity\nmeans. Using credible sets, we created novel credible distributions for the\nrank vector of the entities. We evaluate the Bayesian procedure in terms of\naccuracy and stability in two applications and a simulation study. Frequentist\napproaches cannot take account of covariates, but the Bayesian method handles\nthem easily."}, "http://arxiv.org/abs/2401.01872": {"title": "Multiple Imputation of Hierarchical Nonlinear Time Series Data with an Application to School Enrollment Data", "link": "http://arxiv.org/abs/2401.01872", "description": "International comparisons of hierarchical time series data sets based on\nsurvey data, such as annual country-level estimates of school enrollment rates,\ncan suffer from large amounts of missing data due to differing coverage of\nsurveys across countries and across times. A popular approach to handling\nmissing data in these settings is through multiple imputation, which can be\nespecially effective when there is an auxiliary variable that is strongly\npredictive of and has a smaller amount of missing data than the variable of\ninterest. However, standard methods for multiple imputation of hierarchical\ntime series data can perform poorly when the auxiliary variable and the\nvariable of interest are have a nonlinear relationship. Performance of standard\nmultiple imputation methods can also suffer if the substantive analysis model\nof interest is uncongenial to the imputation model, which can be a common\noccurrence for social science data if the imputation phase is conducted\nindependently of the analysis phase. We propose a Bayesian method for multiple\nimputation of hierarchical nonlinear time series data that uses a sequential\ndecomposition of the joint distribution and incorporates smoothing splines to\naccount for nonlinear relationships between variables. We compare the proposed\nmethod with existing multiple imputation methods through a simulation study and\nan application to secondary school enrollment data. We find that the proposed\nmethod can lead to substantial performance increases for estimation of\nparameters in uncongenial analysis models and for prediction of individual\nmissing values."}, "http://arxiv.org/abs/2208.10962": {"title": "Prediction of good reaction coordinates and future evolution of MD trajectories using Regularized Sparse Autoencoders: A novel deep learning approach", "link": "http://arxiv.org/abs/2208.10962", "description": "Identifying reaction coordinates(RCs) is an active area of research, given\nthe crucial role RCs play in determining the progress of a chemical reaction.\nThe choice of the reaction coordinate is often based on heuristic knowledge.\nHowever, an essential criterion for the choice is that the coordinate should\ncapture both the reactant and product states unequivocally. Also, the\ncoordinate should be the slowest one so that all the other degrees of freedom\ncan easily equilibrate along the reaction coordinate. Also, the coordinate\nshould be the slowest one so that all the other degrees of freedom can easily\nequilibrate along the reaction coordinate. We used a regularised sparse\nautoencoder, an energy-based model, to discover a crucial set of reaction\ncoordinates. Along with discovering reaction coordinates, our model also\npredicts the evolution of a molecular dynamics(MD) trajectory. We showcased\nthat including sparsity enforcing regularisation helps in choosing a small but\nimportant set of reaction coordinates. We used two model systems to demonstrate\nour approach: alanine dipeptide system and proflavine and DNA system, which\nexhibited intercalation of proflavine into DNA minor groove in an aqueous\nenvironment. We model MD trajectory as a multivariate time series, and our\nlatent variable model performs the task of multi-step time series prediction.\nThis idea is inspired by the popular sparse coding approach - to represent each\ninput sample as a linear combination of few elements taken from a set of\nrepresentative patterns."}, "http://arxiv.org/abs/2308.01704": {"title": "Similarity-based Random Partition Distribution for Clustering Functional Data", "link": "http://arxiv.org/abs/2308.01704", "description": "Random partition distribution is a crucial tool for model-based clustering.\nThis study advances the field of random partition in the context of functional\nspatial data, focusing on the challenges posed by hourly population data across\nvarious regions and dates. We propose an extended generalized Dirichlet\nprocess, named the similarity-based generalized Dirichlet process (SGDP), to\naddress the limitations of simple random partition distributions (e.g., those\ninduced by the Dirichlet process), such as an overabundance of clusters. This\nmodel prevents producing excess clusters as well as incorporates pairwise\nsimilarity information to ensure a more accurate and meaningful grouping. The\ntheoretical properties of SGDP are studied. Then, SGDP is applied to a\nreal-world dataset of hourly population flows in 500$\\rm{m}^2$ meshes in the\ncentral part of Tokyo. In this empirical context, SGDP excelled at detecting\nmeaningful patterns in the data while accounting for spatial nuances. The\nresults underscore the adaptability and utility of the method, showcasing its\nprowess in revealing intricate spatiotemporal dynamics. This study's findings\ncontribute significantly to urban planning, transportation, and policy-making\nby providing a helpful tool for understanding population dynamics and their\nimplications."}, "http://arxiv.org/abs/2401.01949": {"title": "Adjacency Matrix Decomposition Clustering for Human Activity Data", "link": "http://arxiv.org/abs/2401.01949", "description": "Mobile apps and wearable devices accurately and continuously measure human\nactivity; patterns within this data can provide a wealth of information\napplicable to fields such as transportation and health. Despite the potential\nutility of this data, there has been limited development of analysis methods\nfor sequences of daily activities. In this paper, we propose a novel clustering\nmethod and cluster evaluation metric for human activity data that leverages an\nadjacency matrix representation to cluster the data without the calculation of\na distance matrix. Our technique is substantially faster than conventional\nmethods based on computing pairwise distances via sequence alignment algorithms\nand also enhances interpretability of results. We compare our method to\ndistance-based hierarchical clustering through simulation studies and an\napplication to data collected by Daynamica, an app that turns raw sensor data\ninto a daily summary of a user's activities. Among days that contain a large\nportion of time spent at home, our method distinguishes days that also contain\nfull-time work or multiple hours of travel, while hierarchical clustering\ngroups these days together. We also illustrate how the computational advantage\nof our method enables the analysis of longer sequences by clustering full weeks\nof activity data."}, "http://arxiv.org/abs/2401.01966": {"title": "Ethical considerations for data involving human gender and sex variables", "link": "http://arxiv.org/abs/2401.01966", "description": "The inclusion of human sex and gender data in statistical analysis invokes\nmultiple considerations for data collection, combination, analysis, and\ninterpretation. These considerations are not unique to variables representing\nsex and gender. However, considering the relevance of the ethical practice\nstandards for statistics and data science to sex and gender variables is\ntimely, with results that can be applied to other sociocultural variables.\nHistorically, human gender and sex have been categorized with a binary system.\nThis tradition persists mainly because it is easy, and not because it produces\nthe best scientific information. Binary classification simplifies combinations\nof older and newer data sets. However, this classification system eliminates\nthe ability for respondents to articulate their gender identity, conflates\ngender and sex, and also obscures potentially important differences by\ncollapsing across valid and authentic categories. This approach perpetuates\nhistorical inaccuracy, simplicity, and bias, while also limiting the\ninformation that emerges from analyses of human data. The approach also\nviolates multiple elements in the American Statistical Association (ASA)\nEthical Guidelines for Statistical Practice. Information that would be captured\nwith a nonbinary classification could be relevant to decisions about analysis\nmethods and to decisions based on otherwise expert statistical work.\nStatistical practitioners are increasingly concerned with inconsistent,\nuninformative, and even unethical data collection and analysis practices. This\npaper presents a historical introduction to the collection and analysis of\nhuman gender and sex data, offers a critique of a few common survey questioning\nmethods based on alignment with the ASA Ethical Guidelines, and considers the\nscope of ethical considerations for human gender and sex data from design\nthrough analysis and interpretation."}, "http://arxiv.org/abs/2401.01977": {"title": "Conformal causal inference for cluster randomized trials: model-robust inference without asymptotic approximations", "link": "http://arxiv.org/abs/2401.01977", "description": "In the analysis of cluster randomized trials, two typical features are that\nindividuals within a cluster are correlated and that the total number of\nclusters can sometimes be limited. While model-robust treatment effect\nestimators have been recently developed, their asymptotic theory requires the\nnumber of clusters to approach infinity, and one often has to empirically\nassess the applicability of those methods in finite samples. To address this\nchallenge, we propose a conformal causal inference framework that achieves the\ntarget coverage probability of treatment effects in finite samples without the\nneed for asymptotic approximations. Meanwhile, we prove that this framework is\ncompatible with arbitrary working models, including machine learning algorithms\nleveraging baseline covariates, possesses robustness against arbitrary\nmisspecification of working models, and accommodates a variety of\nwithin-cluster correlations. Under this framework, we offer efficient\nalgorithms to make inferences on treatment effects at both the cluster and\nindividual levels, applicable to user-specified covariate subgroups and two\ntypes of test data. Finally, we demonstrate our methods via simulations and a\nreal data application based on a cluster randomized trial for treating chronic\npain."}, "http://arxiv.org/abs/2401.02048": {"title": "Random Effect Restricted Mean Survival Time Model", "link": "http://arxiv.org/abs/2401.02048", "description": "The restricted mean survival time (RMST) model has been garnering attention\nas a way to provide a clinically intuitive measure: the mean survival time.\nRMST models, which use methods based on pseudo time-to-event values and inverse\nprobability censoring weighting, can adjust covariates. However, no approach\nhas yet been introduced that considers random effects for clusters. In this\npaper, we propose a new random-effect RMST. We present two methods of analysis\nthat consider variable effects by i) using a generalized mixed model with\npseudo-values and ii) integrating the estimated results from the inverse\nprobability censoring weighting estimating equations for each cluster. We\nevaluate our proposed methods through computer simulations. In addition, we\nanalyze the effect of a mother's age at birth on under-five deaths in India\nusing states as clusters."}, "http://arxiv.org/abs/2401.02154": {"title": "Disentangle Estimation of Causal Effects from Cross-Silo Data", "link": "http://arxiv.org/abs/2401.02154", "description": "Estimating causal effects among different events is of great importance to\ncritical fields such as drug development. Nevertheless, the data features\nassociated with events may be distributed across various silos and remain\nprivate within respective parties, impeding direct information exchange between\nthem. This, in turn, can result in biased estimations of local causal effects,\nwhich rely on the characteristics of only a subset of the covariates. To tackle\nthis challenge, we introduce an innovative disentangle architecture designed to\nfacilitate the seamless cross-silo transmission of model parameters, enriched\nwith causal mechanisms, through a combination of shared and private branches.\nBesides, we introduce global constraints into the equation to effectively\nmitigate bias within the various missing domains, thereby elevating the\naccuracy of our causal effect estimation. Extensive experiments conducted on\nnew semi-synthetic datasets show that our method outperforms state-of-the-art\nbaselines."}, "http://arxiv.org/abs/2401.02387": {"title": "Assessing Time Series Correlation Significance: A Parametric Approach with Application to Physiological Signals", "link": "http://arxiv.org/abs/2401.02387", "description": "Correlation coefficients play a pivotal role in quantifying linear\nrelationships between random variables. Yet, their application to time series\ndata is very challenging due to temporal dependencies. This paper introduces a\nnovel approach to estimate the statistical significance of correlation\ncoefficients in time series data, addressing the limitations of traditional\nmethods based on the concept of effective degrees of freedom (or effective\nsample size, ESS). These effective degrees of freedom represent the independent\nsample size that would yield comparable test statistics under the assumption of\nno temporal correlation. We propose to assume a parametric Gaussian form for\nthe autocorrelation function. We show that this assumption, motivated by a\nLaplace approximation, enables a simple estimator of the ESS that depends only\non the temporal derivatives of the time series. Through numerical experiments,\nwe show that the proposed approach yields accurate statistics while\nsignificantly reducing computational overhead. In addition, we evaluate the\nadequacy of our approach on real physiological signals, for assessing the\nconnectivity measures in electrophysiology and detecting correlated arm\nmovements in motion capture data. Our methodology provides a simple tool for\nresearchers working with time series data, enabling robust hypothesis testing\nin the presence of temporal dependencies."}, "http://arxiv.org/abs/1712.02195": {"title": "Fast approximations in the homogeneous Ising model for use in scene analysis", "link": "http://arxiv.org/abs/1712.02195", "description": "The Ising model is important in statistical modeling and inference in many\napplications, however its normalizing constant, mean number of active vertices\nand mean spin interaction -- quantities needed in inference -- are\ncomputationally intractable. We provide accurate approximations that make it\npossible to numerically calculate these quantities in the homogeneous case.\nSimulation studies indicate good performance of our approximation formulae that\nare scalable and unfazed by the size (number of nodes, degree of graph) of the\nMarkov Random Field. The practical import of our approximation formulae is\nillustrated in performing Bayesian inference in a functional Magnetic Resonance\nImaging activation detection experiment, and also in likelihood ratio testing\nfor anisotropy in the spatial patterns of yearly increases in pistachio tree\nyields."}, "http://arxiv.org/abs/2203.12179": {"title": "Targeted Function Balancing", "link": "http://arxiv.org/abs/2203.12179", "description": "This paper introduces Targeted Function Balancing (TFB), a covariate\nbalancing weights framework for estimating the average treatment effect of a\nbinary intervention. TFB first regresses an outcome on covariates, and then\nselects weights that balance functions (of the covariates) that are\nprobabilistically near the resulting regression function. This yields balance\nin the regression function's predicted values and the covariates, with the\nregression function's estimated variance determining how much balance in the\ncovariates is sufficient. Notably, TFB demonstrates that intentionally leaving\nimbalance in some covariates can increase efficiency without introducing bias,\nchallenging traditions that warn against imbalance in any variable.\nAdditionally, TFB is entirely defined by a regression function and its\nestimated variance, turning the problem of how best to balance the covariates\ninto how best to model the outcome. Kernel regularized least squares and the\nLASSO are considered as regression estimators. With the former, TFB contributes\nto the literature of kernel-based weights. As for the LASSO, TFB uses the\nregression function's estimated variance to prioritize balance in certain\ndimensions of the covariates, a feature that can be greatly exploited by\nchoosing a sparse regression estimator. This paper also introduces a balance\ndiagnostic, Targeted Function Imbalance, that may have useful applications."}, "http://arxiv.org/abs/2303.01572": {"title": "Transportability without positivity: a synthesis of statistical and simulation modeling", "link": "http://arxiv.org/abs/2303.01572", "description": "When estimating an effect of an action with a randomized or observational\nstudy, that study is often not a random sample of the desired target\npopulation. Instead, estimates from that study can be transported to the target\npopulation. However, transportability methods generally rely on a positivity\nassumption, such that all relevant covariate patterns in the target population\nare also observed in the study sample. Strict eligibility criteria,\nparticularly in the context of randomized trials, may lead to violations of\nthis assumption. Two common approaches to address positivity violations are\nrestricting the target population and restricting the relevant covariate set.\nAs neither of these restrictions are ideal, we instead propose a synthesis of\nstatistical and simulation models to address positivity violations. We propose\ncorresponding g-computation and inverse probability weighting estimators. The\nrestriction and synthesis approaches to addressing positivity violations are\ncontrasted with a simulation experiment and an illustrative example in the\ncontext of sexually transmitted infection testing uptake. In both cases, the\nproposed synthesis approach accurately addressed the original research question\nwhen paired with a thoughtfully selected simulation model. Neither of the\nrestriction approaches were able to accurately address the motivating question.\nAs public health decisions must often be made with imperfect target population\ninformation, model synthesis is a viable approach given a combination of\nempirical data and external information based on the best available knowledge."}, "http://arxiv.org/abs/2401.02529": {"title": "Simulation-based transition density approximation for the inference of SDE models", "link": "http://arxiv.org/abs/2401.02529", "description": "Stochastic Differential Equations (SDEs) serve as a powerful modeling tool in\nvarious scientific domains, including systems science, engineering, and\necological science. While the specific form of SDEs is typically known for a\ngiven problem, certain model parameters remain unknown. Efficiently inferring\nthese unknown parameters based on observations of the state in discrete time\nseries represents a vital practical subject. The challenge arises in nonlinear\nSDEs, where maximum likelihood estimation of parameters is generally unfeasible\ndue to the absence of closed-form expressions for transition and stationary\nprobability density functions of the states. In response to this limitation, we\npropose a novel two-step parameter inference mechanism. This approach involves\na global-search phase followed by a local-refining procedure. The global-search\nphase is dedicated to identifying the domain of high-value likelihood\nfunctions, while the local-refining procedure is specifically designed to\nenhance the surrogate likelihood within this localized domain. Additionally, we\npresent two simulation-based approximations for the transition density, aiming\nto efficiently or accurately approximate the likelihood function. Numerical\nexamples illustrate the efficacy of our proposed methodology in achieving\nposterior parameter estimation."}, "http://arxiv.org/abs/2401.02557": {"title": "Multivariate Functional Clustering with Variable Selection and Application to Sensor Data from Engineering Systems", "link": "http://arxiv.org/abs/2401.02557", "description": "Multi-sensor data that track system operating behaviors are widely available\nnowadays from various engineering systems. Measurements from each sensor over\ntime form a curve and can be viewed as functional data. Clustering of these\nmultivariate functional curves is important for studying the operating patterns\nof systems. One complication in such applications is the possible presence of\nsensors whose data do not contain relevant information. Hence it is desirable\nfor the clustering method to equip with an automatic sensor selection\nprocedure. Motivated by a real engineering application, we propose a functional\ndata clustering method that simultaneously removes noninformative sensors and\ngroups functional curves into clusters using informative sensors. Functional\nprincipal component analysis is used to transform multivariate functional data\ninto a coefficient matrix for data reduction. We then model the transformed\ndata by a Gaussian mixture distribution to perform model-based clustering with\nvariable selection. Three types of penalties, the individual, variable, and\ngroup penalties, are considered to achieve automatic variable selection.\nExtensive simulations are conducted to assess the clustering and variable\nselection performance of the proposed methods. The application of the proposed\nmethods to an engineering system with multiple sensors shows the promise of the\nmethods and reveals interesting patterns in the sensor data."}, "http://arxiv.org/abs/2401.02694": {"title": "Nonconvex High-Dimensional Time-Varying Coefficient Estimation for Noisy High-Frequency Observations", "link": "http://arxiv.org/abs/2401.02694", "description": "In this paper, we propose a novel high-dimensional time-varying coefficient\nestimator for noisy high-frequency observations. In high-frequency finance, we\noften observe that noises dominate a signal of an underlying true process.\nThus, we cannot apply usual regression procedures to analyze noisy\nhigh-frequency observations. To handle this issue, we first employ a smoothing\nmethod for the observed variables. However, the smoothed variables still\ncontain non-negligible noises. To manage these non-negligible noises and the\nhigh dimensionality, we propose a nonconvex penalized regression method for\neach local coefficient. This method produces consistent but biased local\ncoefficient estimators. To estimate the integrated coefficients, we propose a\ndebiasing scheme and obtain a debiased integrated coefficient estimator using\ndebiased local coefficient estimators. Then, to further account for the\nsparsity structure of the coefficients, we apply a thresholding scheme to the\ndebiased integrated coefficient estimator. We call this scheme the Thresholded\ndEbiased Nonconvex LASSO (TEN-LASSO) estimator. Furthermore, this paper\nestablishes the concentration properties of the TEN-LASSO estimator and\ndiscusses a nonconvex optimization algorithm."}, "http://arxiv.org/abs/2401.02735": {"title": "Shared active subspace for multivariate vector-valued functions", "link": "http://arxiv.org/abs/2401.02735", "description": "This paper proposes several approaches as baselines to compute a shared\nactive subspace for multivariate vector-valued functions. The goal is to\nminimize the deviation between the function evaluations on the original space\nand those on the reconstructed one. This is done either by manipulating the\ngradients or the symmetric positive (semi-)definite (SPD) matrices computed\nfrom the gradients of each component function so as to get a single structure\ncommon to all component functions. These approaches can be applied to any data\nirrespective of the underlying distribution unlike the existing vector-valued\napproach that is constrained to a normal distribution. We test the\neffectiveness of these methods on five optimization problems. The experiments\nshow that, in general, the SPD-level methods are superior to the gradient-level\nones, and are close to the vector-valued approach in the case of a normal\ndistribution. Interestingly, in most cases it suffices to take the sum of the\nSPD matrices to identify the best shared active subspace."}, "http://arxiv.org/abs/2401.02828": {"title": "Optimal prediction of positive-valued spatial processes: asymmetric power-divergence loss", "link": "http://arxiv.org/abs/2401.02828", "description": "This article studies the use of asymmetric loss functions for the optimal\nprediction of positive-valued spatial processes. We focus on the family of\npower-divergence loss functions due to its many convenient properties, such as\nits continuity, convexity, relationship to well known divergence measures, and\nthe ability to control the asymmetry and behaviour of the loss function via a\npower parameter. The properties of power-divergence loss functions, optimal\npower-divergence (OPD) spatial predictors, and related measures of uncertainty\nquantification are examined. In addition, we examine the notion of asymmetry in\nloss functions defined for positive-valued spatial processes and define an\nasymmetry measure that is applied to the power-divergence loss function and\nother common loss functions. The paper concludes with a spatial statistical\nanalysis of zinc measurements in the soil of a floodplain of the Meuse River,\nNetherlands, using OPD spatial prediction."}, "http://arxiv.org/abs/2401.02917": {"title": "Bayesian changepoint detection via logistic regression and the topological analysis of image series", "link": "http://arxiv.org/abs/2401.02917", "description": "We present a Bayesian method for multivariate changepoint detection that\nallows for simultaneous inference on the location of a changepoint and the\ncoefficients of a logistic regression model for distinguishing pre-changepoint\ndata from post-changepoint data. In contrast to many methods for multivariate\nchangepoint detection, the proposed method is applicable to data of mixed type\nand avoids strict assumptions regarding the distribution of the data and the\nnature of the change. The regression coefficients provide an interpretable\ndescription of a potentially complex change. For posterior inference, the model\nadmits a simple Gibbs sampling algorithm based on P\\'olya-gamma data\naugmentation. We establish conditions under which the proposed method is\nguaranteed to recover the true underlying changepoint. As a testing ground for\nour method, we consider the problem of detecting topological changes in time\nseries of images. We demonstrate that the proposed method, combined with a\nnovel topological feature embedding, performs well on both simulated and real\nimage data."}, "http://arxiv.org/abs/2401.02930": {"title": "Dagma-DCE: Interpretable, Non-Parametric Differentiable Causal Discovery", "link": "http://arxiv.org/abs/2401.02930", "description": "We introduce Dagma-DCE, an interpretable and model-agnostic scheme for\ndifferentiable causal discovery. Current non- or over-parametric methods in\ndifferentiable causal discovery use opaque proxies of ``independence'' to\njustify the inclusion or exclusion of a causal relationship. We show\ntheoretically and empirically that these proxies may be arbitrarily different\nthan the actual causal strength. Juxtaposed to existing differentiable causal\ndiscovery algorithms, \\textsc{Dagma-DCE} uses an interpretable measure of\ncausal strength to define weighted adjacency matrices. In a number of simulated\ndatasets, we show our method achieves state-of-the-art level performance. We\nadditionally show that \\textsc{Dagma-DCE} allows for principled thresholding\nand sparsity penalties by domain-experts. The code for our method is available\nopen-source at https://github.com/DanWaxman/DAGMA-DCE, and can easily be\nadapted to arbitrary differentiable models."}, "http://arxiv.org/abs/2401.02939": {"title": "Penalized Distributed Lag Interaction Model: Air Pollution, Birth Weight and Neighborhood Vulnerability", "link": "http://arxiv.org/abs/2401.02939", "description": "Maternal exposure to air pollution during pregnancy has a substantial public\nhealth impact. Epidemiological evidence supports an association between\nmaternal exposure to air pollution and low birth weight. A popular method to\nestimate this association while identifying windows of susceptibility is a\ndistributed lag model (DLM), which regresses an outcome onto exposure history\nobserved at multiple time points. However, the standard DLM framework does not\nallow for modification of the association between repeated measures of exposure\nand the outcome. We propose a distributed lag interaction model that allows\nmodification of the exposure-time-response associations across individuals by\nincluding an interaction between a continuous modifying variable and the\nexposure history. Our model framework is an extension of a standard DLM that\nuses a cross-basis, or bi-dimensional function space, to simultaneously\ndescribe both the modification of the exposure-response relationship and the\ntemporal structure of the exposure data. Through simulations, we showed that\nour model with penalization out-performs a standard DLM when the true\nexposure-time-response associations vary by a continuous variable. Using a\nColorado, USA birth cohort, we estimated the association between birth weight\nand ambient fine particulate matter air pollution modified by an area-level\nmetric of health and social adversities from Colorado EnviroScreen."}, "http://arxiv.org/abs/2401.02953": {"title": "Linked factor analysis", "link": "http://arxiv.org/abs/2401.02953", "description": "Factor models are widely used in the analysis of high-dimensional data in\nseveral fields of research. Estimating a factor model, in particular its\ncovariance matrix, from partially observed data vectors is very challenging. In\nthis work, we show that when the data are structurally incomplete, the factor\nmodel likelihood function can be decomposed into the product of the likelihood\nfunctions of multiple partial factor models relative to different subsets of\ndata. If these multiple partial factor models are linked together by common\nparameters, then we can obtain complete maximum likelihood estimates of the\nfactor model parameters and thereby the full covariance matrix. We call this\nframework Linked Factor Analysis (LINFA). LINFA can be used for covariance\nmatrix completion, dimension reduction, data completion, and graphical\ndependence structure recovery. We propose an efficient Expectation-Maximization\nalgorithm for maximum likelihood estimation, accelerated by a novel group\nvertex tessellation (GVT) algorithm which identifies a minimal partition of the\nvertex set to implement an efficient optimization in the maximization steps. We\nillustrate our approach in an extensive simulation study and in the analysis of\ncalcium imaging data collected from mouse visual cortex."}, "http://arxiv.org/abs/2203.12808": {"title": "Robustness Against Weak or Invalid Instruments: Exploring Nonlinear Treatment Models with Machine Learning", "link": "http://arxiv.org/abs/2203.12808", "description": "We discuss causal inference for observational studies with possibly invalid\ninstrumental variables. We propose a novel methodology called two-stage\ncurvature identification (TSCI) by exploring the nonlinear treatment model with\nmachine learning. {The first-stage machine learning enables improving the\ninstrumental variable's strength and adjusting for different forms of violating\nthe instrumental variable assumptions.} The success of TSCI requires the\ninstrumental variable's effect on treatment to differ from its violation form.\nA novel bias correction step is implemented to remove bias resulting from the\npotentially high complexity of machine learning. Our proposed \\texttt{TSCI}\nestimator is shown to be asymptotically unbiased and Gaussian even if the\nmachine learning algorithm does not consistently estimate the treatment model.\nFurthermore, we design a data-dependent method to choose the best among several\ncandidate violation forms. We apply TSCI to study the effect of education on\nearnings."}, "http://arxiv.org/abs/2208.14011": {"title": "Robust and Efficient Estimation in Ordinal Response Models using the Density Power Divergence", "link": "http://arxiv.org/abs/2208.14011", "description": "In real life, we frequently come across data sets that involve some\nindependent explanatory variable(s) generating a set of ordinal responses.\nThese ordinal responses may correspond to an underlying continuous latent\nvariable, which is linearly related to the covariate(s), and takes a particular\n(ordinal) label depending on whether this latent variable takes value in some\nsuitable interval specified by a pair of (unknown) cut-offs. The most efficient\nway of estimating the unknown parameters (i.e., the regression coefficients and\nthe cut-offs) is the method of maximum likelihood (ML). However, contamination\nin the data set either in the form of misspecification of ordinal responses, or\nthe unboundedness of the covariate(s), might destabilize the likelihood\nfunction to a great extent where the ML based methodology might lead to\ncompletely unreliable inferences. In this paper, we explore a minimum distance\nestimation procedure based on the popular density power divergence (DPD) to\nyield robust parameter estimates for the ordinal response model. This paper\nhighlights how the resulting estimator, namely the minimum DPD estimator\n(MDPDE), can be used as a practical robust alternative to the classical\nprocedures based on the ML. We rigorously develop several theoretical\nproperties of this estimator, and provide extensive simulations to substantiate\nthe theory developed."}, "http://arxiv.org/abs/2303.10018": {"title": "Efficient nonparametric estimation of Toeplitz covariance matrices", "link": "http://arxiv.org/abs/2303.10018", "description": "A new nonparametric estimator for Toeplitz covariance matrices is proposed.\nThis estimator is based on a data transformation that translates the problem of\nToeplitz covariance matrix estimation to the problem of mean estimation in an\napproximate Gaussian regression. The resulting Toeplitz covariance matrix\nestimator is positive definite by construction, fully data-driven and\ncomputationally very fast. Moreover, this estimator is shown to be minimax\noptimal under the spectral norm for a large class of Toeplitz matrices. These\nresults are readily extended to estimation of inverses of Toeplitz covariance\nmatrices. Also, an alternative version of the Whittle likelihood for the\nspectral density based on the Discrete Cosine Transform (DCT) is proposed. The\nmethod is implemented in the R package vstdct that accompanies the paper."}, "http://arxiv.org/abs/2308.11260": {"title": "A simplified spatial+ approach to mitigate spatial confounding in multivariate spatial areal models", "link": "http://arxiv.org/abs/2308.11260", "description": "Spatial areal models encounter the well-known and challenging problem of\nspatial confounding. This issue makes it arduous to distinguish between the\nimpacts of observed covariates and spatial random effects. Despite previous\nresearch and various proposed methods to tackle this problem, finding a\ndefinitive solution remains elusive. In this paper, we propose a simplified\nversion of the spatial+ approach that involves dividing the covariate into two\ncomponents. One component captures large-scale spatial dependence, while the\nother accounts for short-scale dependence. This approach eliminates the need to\nseparately fit spatial models for the covariates. We apply this method to\nanalyse two forms of crimes against women, namely rapes and dowry deaths, in\nUttar Pradesh, India, exploring their relationship with socio-demographic\ncovariates. To evaluate the performance of the new approach, we conduct\nextensive simulation studies under different spatial confounding scenarios. The\nresults demonstrate that the proposed method provides reliable estimates of\nfixed effects and posterior correlations between different responses."}, "http://arxiv.org/abs/2308.15596": {"title": "Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes", "link": "http://arxiv.org/abs/2308.15596", "description": "The assessment of regression models with discrete outcomes is challenging and\nhas many fundamental issues. With discrete outcomes, standard regression model\nassessment tools such as Pearson and deviance residuals do not follow the\nconventional reference distribution (normal) under the true model, calling into\nquestion the legitimacy of model assessment based on these tools. To fill this\ngap, we construct a new type of residuals for general discrete outcomes,\nincluding ordinal and count outcomes. The proposed residuals are based on two\nlayers of probability integral transformation. When at least one continuous\ncovariate is available, the proposed residuals closely follow a uniform\ndistribution (or a normal distribution after transformation) under the\ncorrectly specified model. One can construct visualizations such as QQ plots to\ncheck the overall fit of a model straightforwardly, and the shape of QQ plots\ncan further help identify possible causes of misspecification such as\noverdispersion. We provide theoretical justification for the proposed residuals\nby establishing their asymptotic properties. Moreover, in order to assess the\nmean structure and identify potential covariates, we develop an ordered curve\nas a supplementary tool, which is based on the comparison between the partial\nsum of outcomes and of fitted means. Through simulation, we demonstrate\nempirically that the proposed tools outperform commonly used residuals for\nvarious model assessment tasks. We also illustrate the workflow of model\nassessment using the proposed tools in data analysis."}, "http://arxiv.org/abs/2401.03072": {"title": "Optimal Nonparametric Inference on Network Effects with Dependent Edges", "link": "http://arxiv.org/abs/2401.03072", "description": "Testing network effects in weighted directed networks is a foundational\nproblem in econometrics, sociology, and psychology. Yet, the prevalent edge\ndependency poses a significant methodological challenge. Most existing methods\nare model-based and come with stringent assumptions, limiting their\napplicability. In response, we introduce a novel, fully nonparametric framework\nthat requires only minimal regularity assumptions. While inspired by recent\ndevelopments in $U$-statistic literature (arXiv:1712.00771, arXiv:2004.06615),\nour approach notably broadens their scopes. Specifically, we identified and\ncarefully addressed the challenge of indeterminate degeneracy in the test\nstatistics $-$ a problem that aforementioned tools do not handle. We\nestablished Berry-Esseen type bound for the accuracy of type-I error rate\ncontrol. Using original analysis, we also proved the minimax optimality of our\ntest's power. Simulations underscore the superiority of our method in\ncomputation speed, accuracy, and numerical robustness compared to competing\nmethods. We also applied our method to the U.S. faculty hiring network data and\ndiscovered intriguing findings."}, "http://arxiv.org/abs/2401.03081": {"title": "Shrinkage Estimation and Prediction for Joint Type-II Censored Data from Two Burr-XII Populations", "link": "http://arxiv.org/abs/2401.03081", "description": "The main objective of this paper is to apply linear and pretest shrinkage\nestimation techniques to estimating the parameters of two 2-parameter Burr-XII\ndistributions. Further more, predictions for future observations are made using\nboth classical and Bayesian methods within a joint type-II censoring scheme.\nThe efficiency of shrinkage estimates is compared to maximum likelihood and\nBayesian estimates obtained through the expectation-maximization algorithm and\nimportance sampling method, as developed by Akbari Bargoshadi et al. (2023) in\n\"Statistical inference under joint type-II censoring data from two Burr-XII\npopulations\" published in Communications in Statistics-Simulation and\nComputation\". For Bayesian estimations, both informative and non-informative\nprior distributions are considered. Additionally, various loss functions\nincluding squared error, linear-exponential, and generalized entropy are taken\ninto account. Approximate confidence, credible, and highest probability density\nintervals are calculated. To evaluate the performance of the estimation\nmethods, a Monte Carlo simulation study is conducted. Additionally, two real\ndatasets are utilized to illustrate the proposed methods."}, "http://arxiv.org/abs/2401.03084": {"title": "Finite sample performance of optimal treatment rule estimators with right-censored outcomes", "link": "http://arxiv.org/abs/2401.03084", "description": "Patient care may be improved by recommending treatments based on patient\ncharacteristics when there is treatment effect heterogeneity. Recently, there\nhas been a great deal of attention focused on the estimation of optimal\ntreatment rules that maximize expected outcomes. However, there has been\ncomparatively less attention given to settings where the outcome is\nright-censored, especially with regard to the practical use of estimators. In\nthis study, simulations were undertaken to assess the finite-sample performance\nof estimators for optimal treatment rules and estimators for the expected\noutcome under treatment rules. The simulations were motivated by the common\nsetting in biomedical and public health research where the data is\nobservational, survival times may be right-censored, and there is interest in\nestimating baseline treatment decisions to maximize survival probability. A\nvariety of outcome regression and direct search estimation methods were\ncompared for optimal treatment rule estimation across a range of simulation\nscenarios. Methods that flexibly model the outcome performed comparatively\nwell, including in settings where the treatment rule was non-linear. R code to\nreproduce this study's results are available on Github."}, "http://arxiv.org/abs/2401.03106": {"title": "Contrastive linear regression", "link": "http://arxiv.org/abs/2401.03106", "description": "Contrastive dimension reduction methods have been developed for case-control\nstudy data to identify variation that is enriched in the foreground (case) data\nX relative to the background (control) data Y. Here, we develop contrastive\nregression for the setting when there is a response variable r associated with\neach foreground observation. This situation occurs frequently when, for\nexample, the unaffected controls do not have a disease grade or intervention\ndosage but the affected cases have a disease grade or intervention dosage, as\nin autism severity, solid tumors stages, polyp sizes, or warfarin dosages. Our\ncontrastive regression model captures shared low-dimensional variation between\nthe predictors in the cases and control groups, and then explains the\ncase-specific response variables through the variance that remains in the\npredictors after shared variation is removed. We show that, in one\nsingle-nucleus RNA sequencing dataset on autism severity in postmortem brain\nsamples from donors with and without autism and in another single-cell RNA\nsequencing dataset on cellular differentiation in chronic rhinosinusitis with\nand without nasal polyps, our contrastive linear regression performs feature\nranking and identifies biologically-informative predictors associated with\nresponse that cannot be identified using other approaches"}, "http://arxiv.org/abs/2401.03123": {"title": "A least distance estimator for a multivariate regression model using deep neural networks", "link": "http://arxiv.org/abs/2401.03123", "description": "We propose a deep neural network (DNN) based least distance (LD) estimator\n(DNN-LD) for a multivariate regression problem, addressing the limitations of\nthe conventional methods. Due to the flexibility of a DNN structure, both\nlinear and nonlinear conditional mean functions can be easily modeled, and a\nmultivariate regression model can be realized by simply adding extra nodes at\nthe output layer. The proposed method is more efficient in capturing the\ndependency structure among responses than the least squares loss, and robust to\noutliers. In addition, we consider $L_1$-type penalization for variable\nselection, crucial in analyzing high-dimensional data. Namely, we propose what\nwe call (A)GDNN-LD estimator that enjoys variable selection and model\nestimation simultaneously, by applying the (adaptive) group Lasso penalty to\nweight parameters in the DNN structure. For the computation, we propose a\nquadratic smoothing approximation method to facilitate optimizing the\nnon-smooth objective function based on the least distance loss. The simulation\nstudies and a real data analysis demonstrate the promising performance of the\nproposed method."}, "http://arxiv.org/abs/2401.03206": {"title": "A Robbins--Monro Sequence That Can Exploit Prior Information For Faster Convergence", "link": "http://arxiv.org/abs/2401.03206", "description": "We propose a new method to improve the convergence speed of the Robbins-Monro\nalgorithm by introducing prior information about the target point into the\nRobbins-Monro iteration. We achieve the incorporation of prior information\nwithout the need of a -- potentially wrong -- regression model, which would\nalso entail additional constraints. We show that this prior-information\nRobbins-Monro sequence is convergent for a wide range of prior distributions,\neven wrong ones, such as Gaussian, weighted sum of Gaussians, e.g., in a kernel\ndensity estimate, as well as bounded arbitrary distribution functions greater\nthan zero. We furthermore analyse the sequence numerically to understand its\nperformance and the influence of parameters. The results demonstrate that the\nprior-information Robbins-Monro sequence converges faster than the standard\none, especially during the first steps, which are particularly important for\napplications where the number of function measurements is limited, and when the\nnoise of observing the underlying function is large. We finally propose a rule\nto select the parameters of the sequence."}, "http://arxiv.org/abs/2401.03268": {"title": "Adaptive Randomization Methods for Sequential Multiple Assignment Randomized Trials (SMARTs) via Thompson Sampling", "link": "http://arxiv.org/abs/2401.03268", "description": "Response-adaptive randomization (RAR) has been studied extensively in\nconventional, single-stage clinical trials, where it has been shown to yield\nethical and statistical benefits, especially in trials with many treatment\narms. However, RAR and its potential benefits are understudied in sequential\nmultiple assignment randomized trials (SMARTs), which are the gold-standard\ntrial design for evaluation of multi-stage treatment regimes. We propose a\nsuite of RAR algorithms for SMARTs based on Thompson Sampling (TS), a widely\nused RAR method in single-stage trials in which treatment randomization\nprobabilities are aligned with the estimated probability that the treatment is\noptimal. We focus on two common objectives in SMARTs: (i) comparison of the\nregimes embedded in the trial, and (ii) estimation of an optimal embedded\nregime. We develop valid post-study inferential procedures for treatment\nregimes under the proposed algorithms. This is nontrivial, as (even in\nsingle-stage settings) RAR can lead to nonnormal limiting distributions of\nestimators. Our algorithms are the first for RAR in multi-stage trials that\naccount for nonregularity in the estimand. Empirical studies based on\nreal-world SMARTs show that TS can improve in-trial subject outcomes without\nsacrificing efficiency for post-trial comparisons."}, "http://arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "http://arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with\npotential confounding by time trends. Traditional frequentist methods can fail\nto provide adequate coverage of the intervention's true effect using confidence\nintervals, whereas Bayesian approaches show potential for better coverage of\nintervention effects. However, Bayesian methods have seen limited development\nin SWCRTs. We propose two novel Bayesian hierarchical penalized spline models\nfor SWCRTs. The first model is for SWCRTs involving many clusters and time\nperiods, focusing on immediate intervention effects. To evaluate its efficacy,\nwe compared this model to traditional frequentist methods. We further developed\nthe model to estimate time-varying intervention effects. We conducted a\ncomparative analysis of this Bayesian spline model against an existing Bayesian\nmonotone effect curve model. The proposed models are applied in the Primary\nPalliative Care for Emergency Medicine stepped wedge trial to evaluate the\neffectiveness of primary palliative care intervention. Extensive simulations\nand a real-world application demonstrate the strengths of the proposed Bayesian\nmodels. The Bayesian immediate effect model consistently achieves near the\nfrequentist nominal coverage probability for true intervention effect,\nproviding more reliable interval estimations than traditional frequentist\nmodels, while maintaining high estimation accuracy. The proposed Bayesian\ntime-varying effect model exhibits advancements over the existing Bayesian\nmonotone effect curve model in terms of improved accuracy and reliability. To\nthe best of our knowledge, this is the first development of Bayesian\nhierarchical spline modeling for SWCRTs. The proposed models offer an accurate\nand robust analysis of intervention effects. Their application could lead to\neffective adjustments in intervention strategies."}, "http://arxiv.org/abs/2401.03326": {"title": "Optimal Adaptive SMART Designs with Binary Outcomes", "link": "http://arxiv.org/abs/2401.03326", "description": "In a sequential multiple-assignment randomized trial (SMART), a sequence of\ntreatments is given to a patient over multiple stages. In each stage,\nrandomization may be done to allocate patients to different treatment groups.\nEven though SMART designs are getting popular among clinical researchers, the\nmethodologies for adaptive randomization at different stages of a SMART are few\nand not sophisticated enough to handle the complexity of optimal allocation of\ntreatments at every stage of a trial. Lack of optimal allocation methodologies\ncan raise serious concerns about SMART designs from an ethical point of view.\nIn this work, we develop an optimal adaptive allocation procedure to minimize\nthe expected number of treatment failures for a SMART with a binary primary\noutcome. Issues related to optimal adaptive allocations are explored\ntheoretically with supporting simulations. The applicability of the proposed\nmethodology is demonstrated using a recently conducted SMART study named\nM-Bridge for developing universal and resource-efficient dynamic treatment\nregimes (DTRs) for incoming first-year college students as a bridge to\ndesirable treatments to address alcohol-related risks."}, "http://arxiv.org/abs/2401.03338": {"title": "Modelling pathwise uncertainty of Stochastic Differential Equations samplers via Probabilistic Numerics", "link": "http://arxiv.org/abs/2401.03338", "description": "Probabilistic ordinary differential equation (ODE) solvers have been\nintroduced over the past decade as uncertainty-aware numerical integrators.\nThey typically proceed by assuming a functional prior to the ODE solution,\nwhich is then queried on a grid to form a posterior distribution over the ODE\nsolution. As the queries span the integration interval, the approximate\nposterior solution then converges to the true deterministic one. Gaussian ODE\nfilters, in particular, have enjoyed a lot of attention due to their\ncomputational efficiency, the simplicity of their implementation, as well as\ntheir provable fast convergence rates. In this article, we extend the\nmethodology to stochastic differential equations (SDEs) and propose a\nprobabilistic simulator for SDEs. Our approach involves transforming the SDE\ninto a sequence of random ODEs using piecewise differentiable approximations of\nthe Brownian motion. We then apply probabilistic ODE solvers to the individual\nODEs, resulting in a pathwise probabilistic solution to the SDE\\@. We establish\nworst-case strong $1.5$ local and $1.0$ global convergence orders for a\nspecific instance of our method. We further show how we can marginalise the\nBrownian approximations, by incorporating its coefficients as part of the prior\nODE model, allowing for computing exact transition densities under our model.\nFinally, we numerically validate the theoretical findings, showcasing\nreasonable weak convergence properties in the marginalised version."}, "http://arxiv.org/abs/2401.03554": {"title": "False Discovery Rate and Localizing Power", "link": "http://arxiv.org/abs/2401.03554", "description": "False discovery rate (FDR) is commonly used for correction for multiple\ntesting in neuroimaging studies. However, when using two-tailed tests, making\ndirectional inferences about the results can lead to vastly inflated error\nrate, even approaching 100\\% in some cases. This happens because FDR only\nprovides weak control over the error rate, meaning that the proportion of error\nis guaranteed only globally over all tests, not within subsets, such as among\nthose in only one or another direction. Here we consider and evaluate different\nstrategies for FDR control with two-tailed tests, using both synthetic and real\nimaging data. Approaches that separate the tests by direction of the hypothesis\ntest, or by the direction of the resulting test statistic, more properly\ncontrol the directional error rate and preserve FDR benefits, albeit with a\ndoubled risk of errors under complete absence of signal. Strategies that\ncombine tests in both directions, or that use simple two-tailed p-values, can\nlead to invalid directional conclusions, even if these tests remain globally\nvalid. To enable valid thresholding for directional inference, we suggest that\nimaging software should allow the possibility that the user sets asymmetrical\nthresholds for the two sides of the statistical map. While FDR continues to be\na valid, powerful procedure for multiple testing correction, care is needed\nwhen making directional inferences for two-tailed tests, or more broadly, when\nmaking any localized inference."}, "http://arxiv.org/abs/2401.03633": {"title": "A Spatial-statistical model to analyse historical rutting data", "link": "http://arxiv.org/abs/2401.03633", "description": "Pavement rutting poses a significant challenge in flexible pavements,\nnecessitating costly asphalt resurfacing. To address this issue\ncomprehensively, we propose an advanced Bayesian hierarchical framework of\nlatent Gaussian models with spatial components. Our model provides a thorough\ndiagnostic analysis, pinpointing areas exhibiting unexpectedly high rutting\nrates. Incorporating spatial and random components, and important explanatory\nvariables like annual average daily traffic (traffic intensity), pavement type,\nrut depth and lane width, our proposed models account for and estimate the\ninfluence of these variables on rutting. This approach not only quantifies\nuncertainties and discerns locations at the highest risk of requiring\nmaintenance, but also uncover spatial dependencies in rutting\n(millimetre/year). We apply our models to a data set spanning eleven years\n(2010-2020). Our findings emphasize the systematic unexpected spatial rutting\neffect, where more of the rutting variability is explained by including spatial\ncomponents. Pavement type, in conjunction with traffic intensity, is also found\nto be the primary driver of rutting. Furthermore, the spatial dependencies\nuncovered reveal road sections experiencing more than 1 millimeter of rutting\nbeyond annual expectations. This leads to a halving of the expected pavement\nlifespan in these areas. Our study offers valuable insights, presenting maps\nindicating expected rutting, and identifying locations with accelerated rutting\nrates, resulting in a reduction in pavement life expectancy of at least 10\nyears."}, "http://arxiv.org/abs/2401.03649": {"title": "Bayes Factor of Zero Inflated Models under Jeffereys Prior", "link": "http://arxiv.org/abs/2401.03649", "description": "Microbiome omics data including 16S rRNA reveal intriguing dynamic\nassociations between the human microbiome and various disease states. Drastic\nchanges in microbiota can be associated with factors like diet, hormonal\ncycles, diseases, and medical interventions. Along with the identification of\nspecific bacteria taxa associated with diseases, recent advancements give\nevidence that metabolism, genetics, and environmental factors can model these\nmicrobial effects. However, the current analytic methods for integrating\nmicrobiome data are fully developed to address the main challenges of\nlongitudinal metagenomics data, such as high-dimensionality, intra-sample\ndependence, and zero-inflation of observed counts. Hence, we propose the Bayes\nfactor approach for model selection based on negative binomial, Poisson,\nzero-inflated negative binomial, and zero-inflated Poisson models with\nnon-informative Jeffreys prior. We find that both in simulation studies and\nreal data analysis, our Bayes factor remarkably outperform traditional Akaike\ninformation criterion and Vuong's test. A new R package BFZINBZIP has been\nintroduced to do simulation study and real data analysis to facilitate Bayesian\nmodel selection based on the Bayes factor."}, "http://arxiv.org/abs/2401.03693": {"title": "TAD-SIE: Sample Size Estimation for Clinical Randomized Controlled Trials using a Trend-Adaptive Design with a Synthetic-Intervention-Based Estimator", "link": "http://arxiv.org/abs/2401.03693", "description": "Phase-3 clinical trials provide the highest level of evidence on drug safety\nand effectiveness needed for market approval by implementing large randomized\ncontrolled trials (RCTs). However, 30-40% of these trials fail mainly because\nsuch studies have inadequate sample sizes, stemming from the inability to\nobtain accurate initial estimates of average treatment effect parameters. To\nremove this obstacle from the drug development cycle, we present a new\nalgorithm called Trend-Adaptive Design with a Synthetic-Intervention-Based\nEstimator (TAD-SIE) that appropriately powers a parallel-group trial, a\nstandard RCT design, by leveraging a state-of-the-art hypothesis testing\nstrategy and a novel trend-adaptive design (TAD). Specifically, TAD-SIE uses\nSECRETS (Subject-Efficient Clinical Randomized Controlled Trials using\nSynthetic Intervention) for hypothesis testing, which simulates a cross-over\ntrial in order to boost power; doing so, makes it easier for a trial to reach\ntarget power within trial constraints (e.g., sample size limits). To estimate\nsample sizes, TAD-SIE implements a new TAD tailored to SECRETS given that\nSECRETS violates assumptions under standard TADs. In addition, our TAD\novercomes the ineffectiveness of standard TADs by allowing sample sizes to be\nincreased across iterations without any condition while controlling\nsignificance level with futility stopping. On a real-world Phase-3 clinical RCT\n(i.e., a two-arm parallel-group superiority trial with an equal number of\nsubjects per arm), TAD-SIE reaches typical target operating points of 80% or\n90% power and 5% significance level in contrast to baseline algorithms that\nonly get at best 59% power and 4% significance level."}, "http://arxiv.org/abs/2401.03820": {"title": "Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices", "link": "http://arxiv.org/abs/2401.03820", "description": "Estimating a covariance matrix and its associated principal components is a\nfundamental problem in contemporary statistics. While optimal estimation\nprocedures have been developed with well-understood properties, the increasing\ndemand for privacy preservation introduces new complexities to this classical\nproblem. In this paper, we study optimal differentially private Principal\nComponent Analysis (PCA) and covariance estimation within the spiked covariance\nmodel.\n\nWe precisely characterize the sensitivity of eigenvalues and eigenvectors\nunder this model and establish the minimax rates of convergence for estimating\nboth the principal components and covariance matrix. These rates hold up to\nlogarithmic factors and encompass general Schatten norms, including spectral\nnorm, Frobenius norm, and nuclear norm as special cases.\n\nWe introduce computationally efficient differentially private estimators and\nprove their minimax optimality, up to logarithmic factors. Additionally,\nmatching minimax lower bounds are established. Notably, in comparison with\nexisting literature, our results accommodate a diverging rank, necessitate no\neigengap condition between distinct principal components, and remain valid even\nif the sample size is much smaller than the dimension."}, "http://arxiv.org/abs/2401.03834": {"title": "Simultaneous false discovery bounds for invariant causal prediction", "link": "http://arxiv.org/abs/2401.03834", "description": "Invariant causal prediction (ICP, Peters et al. (2016)) provides a novel way\nto identify causal predictors of a response by utilizing heterogeneous data\nfrom different environments. One advantage of ICP is that it guarantees to make\nno false causal discoveries with high probability. Such a guarantee, however,\ncan be too conservative in some applications, resulting in few or no\ndiscoveries. To address this, we propose simultaneous false discovery bounds\nfor ICP, which provides users with extra flexibility in exploring causal\npredictors and can extract more informative results. These additional\ninferences come for free, in the sense that they do not require additional\nassumptions, and the same information obtained by the original ICP is retained.\nWe demonstrate the practical usage of our method through simulations and a real\ndataset."}, "http://arxiv.org/abs/2401.03881": {"title": "Density regression via Dirichlet process mixtures of normal structured additive regression models", "link": "http://arxiv.org/abs/2401.03881", "description": "Within Bayesian nonparametrics, dependent Dirichlet process mixture models\nprovide a highly flexible approach for conducting inference about the\nconditional density function. However, several formulations of this class make\neither rather restrictive modelling assumptions or involve intricate algorithms\nfor posterior inference, thus preventing their widespread use. In response to\nthese challenges, we present a flexible, versatile, and computationally\ntractable model for density regression based on a single-weights dependent\nDirichlet process mixture of normal distributions model for univariate\ncontinuous responses. We assume an additive structure for the mean of each\nmixture component and incorporate the effects of continuous covariates through\nsmooth nonlinear functions. The key components of our modelling approach are\npenalised B-splines and their bivariate tensor product extension. Our proposed\nmethod also seamlessly accommodates parametric effects of categorical\ncovariates, linear effects of continuous covariates, interactions between\ncategorical and/or continuous covariates, varying coefficient terms, and random\neffects, which is why we refer our model as a Dirichlet process mixture of\nnormal structured additive regression models. A noteworthy feature of our\nmethod is its efficiency in posterior simulation through Gibbs sampling, as\nclosed-form full conditional distributions for all model parameters are\navailable. Results from a simulation study demonstrate that our approach\nsuccessfully recovers true conditional densities and other regression\nfunctionals in various challenging scenarios. Applications to a toxicology,\ndisease diagnosis, and agricultural study are provided and further underpin the\nbroad applicability of our modelling framework. An R package, \\texttt{DDPstar},\nimplementing the proposed method is publicly available at\n\\url{https://bitbucket.org/mxrodriguez/ddpstar}."}, "http://arxiv.org/abs/2401.03891": {"title": "Radius selection using kernel density estimation for the computation of nonlinear measures", "link": "http://arxiv.org/abs/2401.03891", "description": "When nonlinear measures are estimated from sampled temporal signals with\nfinite-length, a radius parameter must be carefully selected to avoid a poor\nestimation. These measures are generally derived from the correlation integral\nwhich quantifies the probability of finding neighbors, i.e. pair of points\nspaced by less than the radius parameter. While each nonlinear measure comes\nwith several specific empirical rules to select a radius value, we provide a\nsystematic selection method. We show that the optimal radius for nonlinear\nmeasures can be approximated by the optimal bandwidth of a Kernel Density\nEstimator (KDE) related to the correlation sum. The KDE framework provides\nnon-parametric tools to approximate a density function from finite samples\n(e.g. histograms) and optimal methods to select a smoothing parameter, the\nbandwidth (e.g. bin width in histograms). We use results from KDE to derive a\nclosed-form expression for the optimal radius. The latter is used to compute\nthe correlation dimension and to construct recurrence plots yielding an\nestimate of Kolmogorov-Sinai entropy. We assess our method through numerical\nexperiments on signals generated by nonlinear systems and experimental\nelectroencephalographic time series."}, "http://arxiv.org/abs/2401.04036": {"title": "A regularized MANOVA test for semicontinuous high-dimensional data", "link": "http://arxiv.org/abs/2401.04036", "description": "We propose a MANOVA test for semicontinuous data that is applicable also when\nthe dimensionality exceeds the sample size. The test statistic is obtained as a\nlikelihood ratio, where numerator and denominator are computed at the maxima of\npenalized likelihood functions under each hypothesis. Closed form solutions for\nthe regularized estimators allow us to avoid computational overheads. We derive\nthe null distribution using a permutation scheme. The power and level of the\nresulting test are evaluated in a simulation study. We illustrate the new\nmethodology with two original data analyses, one regarding microRNA expression\nin human blastocyst cultures, and another regarding alien plant species\ninvasion in the island of Socotra (Yemen)."}, "http://arxiv.org/abs/2401.04086": {"title": "A Priori Determination of the Pretest Probability", "link": "http://arxiv.org/abs/2401.04086", "description": "In this manuscript, we present various proposed methods estimate the\nprevalence of disease, a critical prerequisite for the adequate interpretation\nof screening tests. To address the limitations of these approaches, which\nrevolve primarily around their a posteriori nature, we introduce a novel method\nto estimate the pretest probability of disease, a priori, utilizing the Logit\nfunction from the logistic regression model. This approach is a modification of\nMcGee's heuristic, originally designed for estimating the posttest probability\nof disease. In a patient presenting with $n_\\theta$ signs or symptoms, the\nminimal bound of the pretest probability, $\\phi$, can be approximated by:\n\n$\\phi \\approx\n\\frac{1}{5}{ln\\left[\\displaystyle\\prod_{\\theta=1}^{i}\\kappa_\\theta\\right]}$\n\nwhere $ln$ is the natural logarithm, and $\\kappa_\\theta$ is the likelihood\nratio associated with the sign or symptom in question."}, "http://arxiv.org/abs/2107.00118": {"title": "Do we need to estimate the variance in robust mean estimation?", "link": "http://arxiv.org/abs/2107.00118", "description": "In this paper, we propose self-tuned robust estimators for estimating the\nmean of heavy-tailed distributions, which refer to distributions with only\nfinite variances. Our approach introduces a new loss function that considers\nboth the mean parameter and a robustification parameter. By jointly optimizing\nthe empirical loss function with respect to both parameters, the\nrobustification parameter estimator can automatically adapt to the unknown data\nvariance, and thus the self-tuned mean estimator can achieve optimal\nfinite-sample performance. Our method outperforms previous approaches in terms\nof both computational and asymptotic efficiency. Specifically, it does not\nrequire cross-validation or Lepski's method to tune the robustification\nparameter, and the variance of our estimator achieves the Cram\\'er-Rao lower\nbound. Project source code is available at\n\\url{https://github.com/statsle/automean}."}, "http://arxiv.org/abs/2201.06606": {"title": "Drift vs Shift: Decoupling Trends and Changepoint Analysis", "link": "http://arxiv.org/abs/2201.06606", "description": "We introduce a new approach for decoupling trends (drift) and changepoints\n(shifts) in time series. Our locally adaptive model-based approach for robustly\ndecoupling combines Bayesian trend filtering and machine learning based\nregularization. An over-parameterized Bayesian dynamic linear model (DLM) is\nfirst applied to characterize drift. Then a weighted penalized likelihood\nestimator is paired with the estimated DLM posterior distribution to identify\nshifts. We show how Bayesian DLMs specified with so-called shrinkage priors can\nprovide smooth estimates of underlying trends in the presence of complex noise\ncomponents. However, their inability to shrink exactly to zero inhibits direct\nchangepoint detection. In contrast, penalized likelihood methods are highly\neffective in locating changepoints. However, they require data with simple\npatterns in both signal and noise. The proposed decoupling approach combines\nthe strengths of both, i.e. the flexibility of Bayesian DLMs with the hard\nthresholding property of penalized likelihood estimators, to provide\nchangepoint analysis in complex, modern settings. The proposed framework is\noutlier robust and can identify a variety of changes, including in mean and\nslope. It is also easily extended for analysis of parameter shifts in\ntime-varying parameter models like dynamic regressions. We illustrate the\nflexibility and contrast the performance and robustness of our approach with\nseveral alternative methods across a wide range of simulations and application\nexamples."}, "http://arxiv.org/abs/2303.00471": {"title": "E-values for k-Sample Tests With Exponential Families", "link": "http://arxiv.org/abs/2303.00471", "description": "We develop and compare e-variables for testing whether $k$ samples of data\nare drawn from the same distribution, the alternative being that they come from\ndifferent elements of an exponential family. We consider the GRO (growth-rate\noptimal) e-variables for (1) a `small' null inside the same exponential family,\nand (2) a `large' nonparametric null, as well as (3) an e-variable arrived at\nby conditioning on the sum of the sufficient statistics. (2) and (3) are\nefficiently computable, and extend ideas from Turner et al. [2021] and Wald\n[1947] respectively from Bernoulli to general exponential families. We provide\ntheoretical and simulation-based comparisons of these e-variables in terms of\ntheir logarithmic growth rate, and find that for small effects all four\ne-variables behave surprisingly similarly; for the Gaussian location and\nPoisson families, e-variables (1) and (3) coincide; for Bernoulli, (1) and (2)\ncoincide; but in general, whether (2) or (3) grows faster under the alternative\nis family-dependent. We furthermore discuss algorithms for numerically\napproximating (1)."}, "http://arxiv.org/abs/2303.13850": {"title": "Towards Learning and Explaining Indirect Causal Effects in Neural Networks", "link": "http://arxiv.org/abs/2303.13850", "description": "Recently, there has been a growing interest in learning and explaining causal\neffects within Neural Network (NN) models. By virtue of NN architectures,\nprevious approaches consider only direct and total causal effects assuming\nindependence among input variables. We view an NN as a structural causal model\n(SCM) and extend our focus to include indirect causal effects by introducing\nfeedforward connections among input neurons. We propose an ante-hoc method that\ncaptures and maintains direct, indirect, and total causal effects during NN\nmodel training. We also propose an algorithm for quantifying learned causal\neffects in an NN model and efficient approximation strategies for quantifying\ncausal effects in high-dimensional data. Extensive experiments conducted on\nsynthetic and real-world datasets demonstrate that the causal effects learned\nby our ante-hoc method better approximate the ground truth effects compared to\nexisting methods."}, "http://arxiv.org/abs/2304.08242": {"title": "The Deep Latent Position Topic Model for Clustering and Representation of Networks with Textual Edges", "link": "http://arxiv.org/abs/2304.08242", "description": "Numerical interactions leading to users sharing textual content published by\nothers are naturally represented by a network where the individuals are\nassociated with the nodes and the exchanged texts with the edges. To understand\nthose heterogeneous and complex data structures, clustering nodes into\nhomogeneous groups as well as rendering a comprehensible visualisation of the\ndata is mandatory. To address both issues, we introduce Deep-LPTM, a\nmodel-based clustering strategy relying on a variational graph auto-encoder\napproach as well as a probabilistic model to characterise the topics of\ndiscussion. Deep-LPTM allows to build a joint representation of the nodes and\nof the edges in two embeddings spaces. The parameters are inferred using a\nvariational inference algorithm. We also introduce IC2L, a model selection\ncriterion specifically designed to choose models with relevant clustering and\nvisualisation properties. An extensive benchmark study on synthetic data is\nprovided. In particular, we find that Deep-LPTM better recovers the partitions\nof the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails\nof the Enron company are analysed and visualisations of the results are\npresented, with meaningful highlights of the graph structure."}, "http://arxiv.org/abs/2305.05126": {"title": "Comparing Foundation Models using Data Kernels", "link": "http://arxiv.org/abs/2305.05126", "description": "Recent advances in self-supervised learning and neural network scaling have\nenabled the creation of large models, known as foundation models, which can be\neasily adapted to a wide range of downstream tasks. The current paradigm for\ncomparing foundation models involves evaluating them with aggregate metrics on\nvarious benchmark datasets. This method of model comparison is heavily\ndependent on the chosen evaluation metric, which makes it unsuitable for\nsituations where the ideal metric is either not obvious or unavailable. In this\nwork, we present a methodology for directly comparing the embedding space\ngeometry of foundation models, which facilitates model comparison without the\nneed for an explicit evaluation metric. Our methodology is grounded in random\ngraph theory and enables valid hypothesis testing of embedding similarity on a\nper-datum basis. Further, we demonstrate how our methodology can be extended to\nfacilitate population level model comparison. In particular, we show how our\nframework can induce a manifold of models equipped with a distance function\nthat correlates strongly with several downstream metrics. We remark on the\nutility of this population level model comparison as a first step towards a\ntaxonomic science of foundation models."}, "http://arxiv.org/abs/2305.18044": {"title": "Bayesian estimation of clustered dependence structures in functional neuroconnectivity", "link": "http://arxiv.org/abs/2305.18044", "description": "Motivated by the need to model the dependence between regions of interest in\nfunctional neuroconnectivity for efficient inference, we propose a new\nsampling-based Bayesian clustering approach for covariance structures of\nhigh-dimensional Gaussian outcomes. The key technique is based on a Dirichlet\nprocess that clusters covariance sub-matrices into independent groups of\noutcomes, thereby naturally inducing sparsity in the whole brain connectivity\nmatrix. A new split-merge algorithm is employed to achieve convergence of the\nMarkov chain that is shown empirically to recover both uniform and Dirichlet\npartitions with high accuracy. We investigate the empirical performance of the\nproposed method through extensive simulations. Finally, the proposed approach\nis used to group regions of interest into functionally independent groups in\nthe Autism Brain Imaging Data Exchange participants with autism spectrum\ndisorder and and co-occurring attention-deficit/hyperactivity disorder."}, "http://arxiv.org/abs/2306.12528": {"title": "Structured Learning in Time-dependent Cox Models", "link": "http://arxiv.org/abs/2306.12528", "description": "Cox models with time-dependent coefficients and covariates are widely used in\nsurvival analysis. In high-dimensional settings, sparse regularization\ntechniques are employed for variable selection, but existing methods for\ntime-dependent Cox models lack flexibility in enforcing specific sparsity\npatterns (i.e., covariate structures). We propose a flexible framework for\nvariable selection in time-dependent Cox models, accommodating complex\nselection rules. Our method can adapt to arbitrary grouping structures,\nincluding interaction selection, temporal, spatial, tree, and directed acyclic\ngraph structures. It achieves accurate estimation with low false alarm rates.\nWe develop the sox package, implementing a network flow algorithm for\nefficiently solving models with complex covariate structures. sox offers a\nuser-friendly interface for specifying grouping structures and delivers fast\ncomputation. Through examples, including a case study on identifying predictors\nof time to all-cause death in atrial fibrillation patients, we demonstrate the\npractical application of our method with specific selection rules."}, "http://arxiv.org/abs/2401.04156": {"title": "LASPATED: a Library for the Analysis of SPAtio-TEmporal Discrete data", "link": "http://arxiv.org/abs/2401.04156", "description": "We describe methods, tools, and a software library called LASPATED, available\non GitHub (at https://github.com/vguigues/) to fit models using spatio-temporal\ndata and space-time discretization. A video tutorial for this library is\navailable on YouTube. We consider two types of methods to estimate a\nnon-homogeneous Poisson process in space and time. The methods approximate the\narrival intensity function of the Poisson process by discretizing space and\ntime, and estimating arrival intensity as a function of subregion and time\ninterval. With such methods, it is typical that the dimension of the estimator\nis large relative to the amount of data, and therefore the performance of the\nestimator can be improved by using additional data. The first method uses\nadditional data to add a regularization term to the likelihood function for\ncalibrating the intensity of the Poisson process. The second method uses\nadditional data to estimate arrival intensity as a function of covariates. We\ndescribe a Python package to perform various types of space and time\ndiscretization. We also describe two packages for the calibration of the\nmodels, one in Matlab and one in C++. We demonstrate the advantages of our\nmethods compared to basic maximum likelihood estimation with simulated and real\ndata. The experiments with real data calibrate models of the arrival process of\nemergencies to be handled by the Rio de Janeiro emergency medical service."}, "http://arxiv.org/abs/2401.04190": {"title": "Is it possible to know cosmological fine-tuning?", "link": "http://arxiv.org/abs/2401.04190", "description": "Fine-tuning studies whether some physical parameters, or relevant ratios\nbetween them, are located within so-called life-permitting intervals of small\nprobability outside of which carbon-based life would not be possible. Recent\ndevelopments have found estimates of these probabilities that circumvent\nprevious concerns of measurability and selection bias. However, the question\nremains if fine-tuning can indeed be known. Using a mathematization of the\nepistemological concepts of learning and knowledge acquisition, we argue that\nmost examples that have been touted as fine-tuned cannot be formally assessed\nas such. Nevertheless, fine-tuning can be known when the physical parameter is\nseen as a random variable and it is supported in the nonnegative real line,\nprovided the size of the life-permitting interval is small in relation to the\nobserved value of the parameter."}, "http://arxiv.org/abs/2401.04240": {"title": "A New Cure Rate Model with Discrete and Multiple Exposures", "link": "http://arxiv.org/abs/2401.04240", "description": "Cure rate models are mostly used to study data arising from cancer clinical\ntrials. Its use in the context of infectious diseases has not been explored\nwell. In 2008, Tournoud and Ecochard first proposed a mechanistic formulation\nof cure rate model in the context of infectious diseases with multiple\nexposures to infection. However, they assumed a simple Poisson distribution to\ncapture the unobserved pathogens at each exposure time. In this paper, we\npropose a new cure rate model to study infectious diseases with discrete\nmultiple exposures to infection. Our formulation captures both over-dispersion\nand under-dispersion with respect to the count on pathogens at each time of\nexposure. We also propose a new estimation method based on the expectation\nmaximization algorithm to calculate the maximum likelihood estimates of the\nmodel parameters. We carry out a detailed Monte Carlo simulation study to\ndemonstrate the performance of the proposed model and estimation algorithm. The\nflexibility of our proposed model also allows us to carry out a model\ndiscrimination. For this purpose, we use both likelihood ratio test and\ninformation-based criteria. Finally, we illustrate our proposed model using a\nrecently collected data on COVID-19."}, "http://arxiv.org/abs/2401.04263": {"title": "Two-Step Targeted Minimum-Loss Based Estimation for Non-Negative Two-Part Outcomes", "link": "http://arxiv.org/abs/2401.04263", "description": "Non-negative two-part outcomes are defined as outcomes with a density\nfunction that have a zero point mass but are otherwise positive. Examples, such\nas healthcare expenditure and hospital length of stay, are common in healthcare\nutilization research. Despite the practical relevance of non-negative two-part\noutcomes, very few methods exist to leverage knowledge of their semicontinuity\nto achieve improved performance in estimating causal effects. In this paper, we\ndevelop a nonparametric two-step targeted minimum-loss based estimator (denoted\nas hTMLE) for non-negative two-part outcomes. We present methods for a general\nclass of interventions referred to as modified treatment policies, which can\naccommodate continuous, categorical, and binary exposures. The two-step TMLE\nuses a targeted estimate of the intensity component of the outcome to produce a\ntargeted estimate of the binary component of the outcome that may improve\nfinite sample efficiency. We demonstrate the efficiency gains achieved by the\ntwo-step TMLE with simulated examples and then apply it to a cohort of Medicaid\nbeneficiaries to estimate the effect of chronic pain and physical disability on\ndays' supply of opioids."}, "http://arxiv.org/abs/2401.04297": {"title": "Staged trees for discrete longitudinal data", "link": "http://arxiv.org/abs/2401.04297", "description": "In this paper we investigate the use of staged tree models for discrete\nlongitudinal data. Staged trees are a type of probabilistic graphical model for\nfinite sample space processes. They are a natural fit for longitudinal data\nbecause a temporal ordering is often implicitly assumed and standard methods\ncan be used for model selection and probability estimation. However, model\nselection methods perform poorly when the sample size is small relative to the\nsize of the graph and model interpretation is tricky with larger graphs. This\nis exacerbated by longitudinal data which is characterised by repeated\nobservations. To address these issues we propose two approaches: the\nlongitudinal staged tree with Markov assumptions which makes some initial\nconditional independence assumptions represented by a directed acyclic graph\nand marginal longitudinal staged trees which model certain margins of the data."}, "http://arxiv.org/abs/2401.04352": {"title": "Non-Deterministic Extension of Plasma Wind Tunnel Data Calibrated Model Predictions to Flight Conditions", "link": "http://arxiv.org/abs/2401.04352", "description": "This work proposes a novel approach for non-deterministic extension of\nexperimental data that considers structural model inadequacy for conditions\nother than the calibration scenario while simultaneously resolving any\nsignificant prior-data discrepancy with information extracted from flight\nmeasurements. This functionality is achieved through methodical utilization of\nmodel error emulators and Bayesian model averaging studies with available\nresponse data. The outlined approach does not require prior flight data\navailability and introduces straightforward mechanisms for their assimilation\nin future predictions. Application of the methodology is demonstrated herein by\nextending material performance data captured at the HyMETS facility to the MSL\nscenario, where the described process yields results that exhibit significantly\nimproved capacity for predictive uncertainty quantification studies. This work\nalso investigates limitations associated with straightforward uncertainty\npropagation procedures onto calibrated model predictions for the flight\nscenario and manages computational requirements with sensitivity analysis and\nsurrogate modeling techniques."}, "http://arxiv.org/abs/2401.04450": {"title": "Recanting twins: addressing intermediate confounding in mediation analysis", "link": "http://arxiv.org/abs/2401.04450", "description": "The presence of intermediate confounders, also called recanting witnesses, is\na fundamental challenge to the investigation of causal mechanisms in mediation\nanalysis, preventing the identification of natural path-specific effects.\nProposed alternative parameters (such as randomizational interventional\neffects) are problematic because they can be non-null even when there is no\nmediation for any individual in the population; i.e., they are not an average\nof underlying individual-level mechanisms. In this paper we develop a novel\nmethod for mediation analysis in settings with intermediate confounding, with\nguarantees that the causal parameters are summaries of the individual-level\nmechanisms of interest. The method is based on recently proposed ideas that\nview causality as the transfer of information, and thus replace recanting\nwitnesses by draws from their conditional distribution, what we call \"recanting\ntwins\". We show that, in the absence of intermediate confounding, recanting\ntwin effects recover natural path-specific effects. We present the assumptions\nrequired for identification of recanting twins effects under a standard\nstructural causal model, as well as the assumptions under which the recanting\ntwin identification formulas can be interpreted in the context of the recently\nproposed separable effects models. To estimate recanting-twin effects, we\ndevelop efficient semi-parametric estimators that allow the use of data driven\nmethods in the estimation of the nuisance parameters. We present numerical\nstudies of the methods using synthetic data, as well as an application to\nevaluate the role of new-onset anxiety and depressive disorder in explaining\nthe relationship between gabapentin/pregabalin prescription and incident opioid\nuse disorder among Medicaid beneficiaries with chronic pain."}, "http://arxiv.org/abs/2401.04490": {"title": "Testing similarity of parametric competing risks models for identifying potentially similar pathways in healthcare", "link": "http://arxiv.org/abs/2401.04490", "description": "The identification of similar patient pathways is a crucial task in\nhealthcare analytics. A flexible tool to address this issue are parametric\ncompeting risks models, where transition intensities may be specified by a\nvariety of parametric distributions, thus in particular being possibly\ntime-dependent. We assess the similarity between two such models by examining\nthe transitions between different health states. This research introduces a\nmethod to measure the maximum differences in transition intensities over time,\nleading to the development of a test procedure for assessing similarity. We\npropose a parametric bootstrap approach for this purpose and provide a proof to\nconfirm the validity of this procedure. The performance of our proposed method\nis evaluated through a simulation study, considering a range of sample sizes,\ndiffering amounts of censoring, and various thresholds for similarity. Finally,\nwe demonstrate the practical application of our approach with a case study from\nurological clinical routine practice, which inspired this research."}, "http://arxiv.org/abs/2401.04498": {"title": "Efficient designs for multivariate crossover trials", "link": "http://arxiv.org/abs/2401.04498", "description": "This article aims to study efficient/trace optimal designs for crossover\ntrials, with multiple response recorded from each subject in each time period.\nA multivariate fixed effect model is proposed with direct and carryover effects\ncorresponding to the multiple responses and the error dispersion matrix\nallowing for correlations to exist between and within responses. Two\ncorrelation structures, namely the proportional and the generalized markov\ncovariances are studied. The corresponding information matrices for direct\neffects under the two covariances are used to determine efficient designs.\nEfficiency of orthogonal array designs of Type $I$ and strength $2$ is\ninvestigated for the two covariance forms. To motivate the multivariate\ncrossover designs, a gene expression data in a $3 \\times 3$ framework is\nutilized."}, "http://arxiv.org/abs/2401.04603": {"title": "Skewed Pivot-Blend Modeling with Applications to Semicontinuous Outcomes", "link": "http://arxiv.org/abs/2401.04603", "description": "Skewness is a common occurrence in statistical applications. In recent years,\nvarious distribution families have been proposed to model skewed data by\nintroducing unequal scales based on the median or mode. However, we argue that\nthe point at which unbalanced scales occur may be at any quantile and cannot be\nreparametrized as an ordinary shift parameter in the presence of skewness. In\nthis paper, we introduce a novel skewed pivot-blend technique to create a\nskewed density family based on any continuous density, even those that are\nasymmetric and nonunimodal. Our framework enables the simultaneous estimation\nof scales, the pivotal point, and other location parameters, along with various\nextensions. We also introduce a skewed two-part model tailored for\nsemicontinuous outcomes, which identifies relevant variables across the entire\npopulation and mitigates the additional skewness induced by commonly used\ntransformations. Our theoretical analysis reveals the influence of skewness\nwithout assuming asymptotic conditions. Experiments on synthetic and real-life\ndata demonstrate the excellent performance of the proposed method."}, "http://arxiv.org/abs/2401.04686": {"title": "Weighted likelihood methods for robust fitting of wrapped models for $p$-torus data", "link": "http://arxiv.org/abs/2401.04686", "description": "We consider robust estimation of wrapped models to multivariate circular data\nthat are points on the surface of a $p$-torus based on the weighted likelihood\nmethodology.Robust model fitting is achieved by a set of weighted likelihood\nestimating equations, based on the computation of data dependent weights aimed\nto down-weight anomalous values, such as unexpected directions that do not\nshare the main pattern of the bulk of the data. Weighted likelihood estimating\nequations with weights evaluated on the torus orobtained after unwrapping the\ndata onto the Euclidean space are proposed and compared. Asymptotic properties\nand robustness features of the estimators under study have been studied,\nwhereas their finite sample behavior has been investigated by Monte Carlo\nnumerical experiment and real data examples."}, "http://arxiv.org/abs/2401.04689": {"title": "Efficient estimation for ergodic diffusion processes sampled at high frequency", "link": "http://arxiv.org/abs/2401.04689", "description": "A general theory of efficient estimation for ergodic diffusion processes\nsampled at high frequency with an infinite time horizon is presented. High\nfrequency sampling is common in many applications, with finance as a prominent\nexample. The theory is formulated in term of approximate martingale estimating\nfunctions and covers a large class of estimators including most of the\npreviously proposed estimators for diffusion processes. Easily checked\nconditions ensuring that an estimating function is an approximate martingale\nare derived, and general conditions ensuring consistency and asymptotic\nnormality of estimators are given. Most importantly, simple conditions are\ngiven that ensure rate optimality and efficiency. Rate optimal estimators of\nparameters in the diffusion coefficient converge faster than estimators of\ndrift coefficient parameters because they take advantage of the information in\nthe quadratic variation. The conditions facilitate the choice among the\nmultitude of estimators that have been proposed for diffusion models. Optimal\nmartingale estimating functions in the sense of Godambe and Heyde and their\nhigh frequency approximations are, under weak conditions, shown to satisfy the\nconditions for rate optimality and efficiency. This provides a natural feasible\nmethod of constructing explicit rate optimal and efficient estimating functions\nby solving a linear equation."}, "http://arxiv.org/abs/2401.04693": {"title": "Co-Clustering Multi-View Data Using the Latent Block Model", "link": "http://arxiv.org/abs/2401.04693", "description": "The Latent Block Model (LBM) is a prominent model-based co-clustering method,\nreturning parametric representations of each block cluster and allowing the use\nof well-grounded model selection methods. The LBM, while adapted in literature\nto handle different feature types, cannot be applied to datasets consisting of\nmultiple disjoint sets of features, termed views, for a common set of\nobservations. In this work, we introduce the multi-view LBM, extending the LBM\nmethod to multi-view data, where each view marginally follows an LBM. In the\ncase of two views, the dependence between them is captured by a cluster\nmembership matrix, and we aim to learn the structure of this matrix. We develop\na likelihood-based approach in which parameter estimation uses a stochastic EM\nalgorithm integrating a Gibbs sampler, and an ICL criterion is derived to\ndetermine the number of row and column clusters in each view. To motivate the\napplication of multi-view methods, we extend recent work developing hypothesis\ntests for the null hypothesis that clusters of observations in each view are\nindependent of each other. The testing procedure is integrated into the model\nestimation strategy. Furthermore, we introduce a penalty scheme to generate\nsparse row clusterings. We verify the performance of the developed algorithm\nusing synthetic datasets, and provide guidance for optimal parameter selection.\nFinally, the multi-view co-clustering method is applied to a complex genomics\ndataset, and is shown to provide new insights for high-dimension multi-view\nproblems."}, "http://arxiv.org/abs/2401.04723": {"title": "Spatio-temporal data fusion for the analysis of in situ and remote sensing data using the INLA-SPDE approach", "link": "http://arxiv.org/abs/2401.04723", "description": "We propose a Bayesian hierarchical model to address the challenge of spatial\nmisalignment in spatio-temporal data obtained from in situ and satellite\nsources. The model is fit using the INLA-SPDE approach, which provides\nefficient computation. Our methodology combines the different data sources in a\n\"fusion\"\" model via the construction of projection matrices in both spatial and\ntemporal domains. Through simulation studies, we demonstrate that the fusion\nmodel has superior performance in prediction accuracy across space and time\ncompared to standalone \"in situ\" and \"satellite\" models based on only in situ\nor satellite data, respectively. The fusion model also generally outperforms\nthe standalone models in terms of parameter inference. Such a modeling approach\nis motivated by environmental problems, and our specific focus is on the\nanalysis and prediction of harmful algae bloom (HAB) events, where the\nconvention is to conduct separate analyses based on either in situ samples or\nsatellite images. A real data analysis shows that the proposed model is a\nnecessary step towards a unified characterization of bloom dynamics and\nidentifying the key drivers of HAB events."}, "http://arxiv.org/abs/1807.00347": {"title": "Robust Inference Under Heteroskedasticity via the Hadamard Estimator", "link": "http://arxiv.org/abs/1807.00347", "description": "Drawing statistical inferences from large datasets in a model-robust way is\nan important problem in statistics and data science. In this paper, we propose\nmethods that are robust to large and unequal noise in different observational\nunits (i.e., heteroskedasticity) for statistical inference in linear\nregression. We leverage the Hadamard estimator, which is unbiased for the\nvariances of ordinary least-squares regression. This is in contrast to the\npopular White's sandwich estimator, which can be substantially biased in high\ndimensions. We propose to estimate the signal strength, noise level,\nsignal-to-noise ratio, and mean squared error via the Hadamard estimator. We\ndevelop a new degrees of freedom adjustment that gives more accurate confidence\nintervals than variants of White's sandwich estimator. Moreover, we provide\nconditions ensuring the estimator is well-defined, by studying a new random\nmatrix ensemble in which the entries of a random orthogonal projection matrix\nare squared. We also show approximate normality, using the second-order\nPoincare inequality. Our work provides improved statistical theory and methods\nfor linear regression in high dimensions."}, "http://arxiv.org/abs/2110.12235": {"title": "Adjusting for indirectly measured confounding using large-scale propensity scores", "link": "http://arxiv.org/abs/2110.12235", "description": "Confounding remains one of the major challenges to causal inference with\nobservational data. This problem is paramount in medicine, where we would like\nto answer causal questions from large observational datasets like electronic\nhealth records (EHRs) and administrative claims. Modern medical data typically\ncontain tens of thousands of covariates. Such a large set carries hope that\nmany of the confounders are directly measured, and further hope that others are\nindirectly measured through their correlation with measured covariates. How can\nwe exploit these large sets of covariates for causal inference? To help answer\nthis question, this paper examines the performance of the large-scale\npropensity score (LSPS) approach on causal analysis of medical data. We\ndemonstrate that LSPS may adjust for indirectly measured confounders by\nincluding tens of thousands of covariates that may be correlated with them. We\npresent conditions under which LSPS removes bias due to indirectly measured\nconfounders, and we show that LSPS may avoid bias when inadvertently adjusting\nfor variables (like colliders) that otherwise can induce bias. We demonstrate\nthe performance of LSPS with both simulated medical data and real medical data."}, "http://arxiv.org/abs/2208.07014": {"title": "Proximal Survival Analysis to Handle Dependent Right Censoring", "link": "http://arxiv.org/abs/2208.07014", "description": "Many epidemiological and clinical studies aim at analyzing a time-to-event\nendpoint. A common complication is right censoring. In some cases, it arises\nbecause subjects are still surviving after the study terminates or move out of\nthe study area, in which case right censoring is typically treated as\nindependent or non-informative. Such an assumption can be further relaxed to\nconditional independent censoring by leveraging possibly time-varying covariate\ninformation, if available, assuming censoring and failure time are independent\namong covariate strata. In yet other instances, events may be censored by other\ncompeting events like death and are associated with censoring possibly through\nprognoses. Realistically, measured covariates can rarely capture all such\nassociations with certainty. For such dependent censoring, often covariate\nmeasurements are at best proxies of underlying prognoses. In this paper, we\nestablish a nonparametric identification framework by formally admitting that\nconditional independent censoring may fail in practice and accounting for\ncovariate measurements as imperfect proxies of underlying association. The\nframework suggests adaptive estimators which we give generic assumptions under\nwhich they are consistent, asymptotically normal, and doubly robust. We\nillustrate our framework with concrete settings, where we examine the\nfinite-sample performance of our proposed estimators via a Monte-Carlo\nsimulation and apply them to the SEER-Medicare dataset."}, "http://arxiv.org/abs/2306.03384": {"title": "A Calibrated Data-Driven Approach for Small Area Estimation using Big Data", "link": "http://arxiv.org/abs/2306.03384", "description": "Where the response variable in a big data set is consistent with the variable\nof interest for small area estimation, the big data by itself can provide the\nestimates for small areas. These estimates are often subject to the coverage\nand measurement error bias inherited from the big data. However, if a\nprobability survey of the same variable of interest is available, the survey\ndata can be used as a training data set to develop an algorithm to impute for\nthe data missed by the big data and adjust for measurement errors. In this\npaper, we outline a methodology for such imputations based on an kNN algorithm\ncalibrated to an asymptotically design-unbiased estimate of the national total\nand illustrate the use of a training data set to estimate the imputation bias\nand the fixed - asymptotic bootstrap to estimate the variance of the small area\nhybrid estimator. We illustrate the methodology of this paper using a public\nuse data set and use it to compare the accuracy and precision of our hybrid\nestimator with the Fay-Harriot (FH) estimator. Finally, we also examine\nnumerically the accuracy and precision of the FH estimator when the auxiliary\nvariables used in the linking models are subject to under-coverage errors."}, "http://arxiv.org/abs/2306.06270": {"title": "Markov bases: a 25 year update", "link": "http://arxiv.org/abs/2306.06270", "description": "In this paper, we evaluate the challenges and best practices associated with\nthe Markov bases approach to sampling from conditional distributions. We\nprovide insights and clarifications after 25 years of the publication of the\nfundamental theorem for Markov bases by Diaconis and Sturmfels. In addition to\na literature review we prove three new results on the complexity of Markov\nbases in hierarchical models, relaxations of the fibers in log-linear models,\nand limitations of partial sets of moves in providing an irreducible Markov\nchain."}, "http://arxiv.org/abs/2309.02281": {"title": "s-ID: Causal Effect Identification in a Sub-Population", "link": "http://arxiv.org/abs/2309.02281", "description": "Causal inference in a sub-population involves identifying the causal effect\nof an intervention on a specific subgroup, which is distinguished from the\nwhole population through the influence of systematic biases in the sampling\nprocess. However, ignoring the subtleties introduced by sub-populations can\neither lead to erroneous inference or limit the applicability of existing\nmethods. We introduce and advocate for a causal inference problem in\nsub-populations (henceforth called s-ID), in which we merely have access to\nobservational data of the targeted sub-population (as opposed to the entire\npopulation). Existing inference problems in sub-populations operate on the\npremise that the given data distributions originate from the entire population,\nthus, cannot tackle the s-ID problem. To address this gap, we provide necessary\nand sufficient conditions that must hold in the causal graph for a causal\neffect in a sub-population to be identifiable from the observational\ndistribution of that sub-population. Given these conditions, we present a sound\nand complete algorithm for the s-ID problem."}, "http://arxiv.org/abs/2401.04753": {"title": "Dynamic Models Augmented by Hierarchical Data: An Application Of Estimating HIV Epidemics At Sub-National And Sub-Population Level", "link": "http://arxiv.org/abs/2401.04753", "description": "Dynamic models have been successfully used in producing estimates of HIV\nepidemics at the national level due to their epidemiological nature and their\nability to estimate prevalence, incidence, and mortality rates simultaneously.\nRecently, HIV interventions and policies have required more information at\nsub-national levels to support local planning, decision making and resource\nallocation. Unfortunately, many areas lack sufficient data for deriving stable\nand reliable results, and this is a critical technical barrier to more\nstratified estimates. One solution is to borrow information from other areas\nwithin the same country. However, directly assuming hierarchical structures\nwithin the HIV dynamic models is complicated and computationally\ntime-consuming. In this paper, we propose a simple and innovative way to\nincorporate hierarchical information into the dynamical systems by using\nauxiliary data. The proposed method efficiently uses information from multiple\nareas within each country without increasing the computational burden. As a\nresult, the new model improves predictive ability and uncertainty assessment."}, "http://arxiv.org/abs/2401.04771": {"title": "Network Layout Algorithm with Covariate Smoothing", "link": "http://arxiv.org/abs/2401.04771", "description": "Network science explores intricate connections among objects, employed in\ndiverse domains like social interactions, fraud detection, and disease spread.\nVisualization of networks facilitates conceptualizing research questions and\nforming scientific hypotheses. Networks, as mathematical high-dimensional\nobjects, require dimensionality reduction for (planar) visualization.\nVisualizing empirical networks present additional challenges. They often\ncontain false positive (spurious) and false negative (missing) edges.\nTraditional visualization methods don't account for errors in observation,\npotentially biasing interpretations. Moreover, contemporary network data\nincludes rich nodal attributes. However, traditional methods neglect these\nattributes when computing node locations. Our visualization approach aims to\nleverage nodal attribute richness to compensate for network data limitations.\nWe employ a statistical model estimating the probability of edge connections\nbetween nodes based on their covariates. We enhance the Fruchterman-Reingold\nalgorithm to incorporate estimated dyad connection probabilities, allowing\npractitioners to balance reliance on observed versus estimated edges. We\nexplore optimal smoothing levels, offering a natural way to include relevant\nnodal information in layouts. Results demonstrate the effectiveness of our\nmethod in achieving robust network visualization, providing insights for\nimproved analysis."}, "http://arxiv.org/abs/2401.04775": {"title": "Approximate Inference for Longitudinal Mechanistic HIV Contact Networks", "link": "http://arxiv.org/abs/2401.04775", "description": "Network models are increasingly used to study infectious disease spread.\nExponential Random Graph models have a history in this area, with scalable\ninference methods now available. An alternative approach uses mechanistic\nnetwork models. Mechanistic network models directly capture individual\nbehaviors, making them suitable for studying sexually transmitted diseases.\nCombining mechanistic models with Approximate Bayesian Computation allows\nflexible modeling using domain-specific interaction rules among agents,\navoiding network model oversimplifications. These models are ideal for\nlongitudinal settings as they explicitly incorporate network evolution over\ntime. We implemented a discrete-time version of a previously published\ncontinuous-time model of evolving contact networks for men who have sex with\nmen (MSM) and proposed an ABC-based approximate inference scheme for it. As\nexpected, we found that a two-wave longitudinal study design improves the\naccuracy of inference compared to a cross-sectional design. However, the gains\nin precision in collecting data twice, up to 18%, depend on the spacing of the\ntwo waves and are sensitive to the choice of summary statistics. In addition to\nmethodological developments, our results inform the design of future\nlongitudinal network studies in sexually transmitted diseases, specifically in\nterms of what data to collect from participants and when to do so."}, "http://arxiv.org/abs/2401.04778": {"title": "Generative neural networks for characteristic functions", "link": "http://arxiv.org/abs/2401.04778", "description": "In this work, we provide a simulation algorithm to simulate from a\n(multivariate) characteristic function, which is only accessible in a black-box\nformat. We construct a generative neural network, whose loss function exploits\na specific representation of the Maximum-Mean-Discrepancy metric to directly\nincorporate the targeted characteristic function. The construction is universal\nin the sense that it is independent of the dimension and that it does not\nrequire any assumptions on the given characteristic function. Furthermore,\nfinite sample guarantees on the approximation quality in terms of the\nMaximum-Mean Discrepancy metric are derived. The method is illustrated in a\nshort simulation study."}, "http://arxiv.org/abs/2401.04797": {"title": "Principal Component Analysis for Equation Discovery", "link": "http://arxiv.org/abs/2401.04797", "description": "Principal Component Analysis (PCA) is one of the most commonly used\nstatistical methods for data exploration, and for dimensionality reduction\nwherein the first few principal components account for an appreciable\nproportion of the variability in the data. Less commonly, attention is paid to\nthe last principal components because they do not account for an appreciable\nproportion of variability. However, this defining characteristic of the last\nprincipal components also qualifies them as combinations of variables that are\nconstant across the cases. Such constant-combinations are important because\nthey may reflect underlying laws of nature. In situations involving a large\nnumber of noisy covariates, the underlying law may not correspond to the last\nprincipal component, but rather to one of the last. Consequently, a criterion\nis required to identify the relevant eigenvector. In this paper, two examples\nare employed to demonstrate the proposed methodology; one from Physics,\ninvolving a small number of covariates, and another from Meteorology wherein\nthe number of covariates is in the thousands. It is shown that with an\nappropriate selection criterion, PCA can be employed to ``discover\" Kepler's\nthird law (in the former), and the hypsometric equation (in the latter)."}, "http://arxiv.org/abs/2401.04832": {"title": "Group lasso priors for Bayesian accelerated failure time models with left-truncated and interval-censored data", "link": "http://arxiv.org/abs/2401.04832", "description": "An important task in health research is to characterize time-to-event\noutcomes such as disease onset or mortality in terms of a potentially\nhigh-dimensional set of risk factors. For example, prospective cohort studies\nof Alzheimer's disease typically enroll older adults for observation over\nseveral decades to assess the long-term impact of genetic and other factors on\ncognitive decline and mortality. The accelerated failure time model is\nparticularly well-suited to such studies, structuring covariate effects as\n`horizontal' changes to the survival quantiles that conceptually reflect shifts\nin the outcome distribution due to lifelong exposures. However, this modeling\ntask is complicated by the enrollment of adults at differing ages, and\nintermittent followup visits leading to interval censored outcome information.\nMoreover, genetic and clinical risk factors are not only high-dimensional, but\ncharacterized by underlying grouping structure, such as by function or gene\nlocation. Such grouped high-dimensional covariates require shrinkage methods\nthat directly acknowledge this structure to facilitate variable selection and\nestimation. In this paper, we address these considerations directly by\nproposing a Bayesian accelerated failure time model with a group-structured\nlasso penalty, designed for left-truncated and interval-censored time-to-event\ndata. We develop a custom Markov chain Monte Carlo sampler for efficient\nestimation, and investigate the impact of various methods of penalty tuning and\nthresholding for variable selection. We present a simulation study examining\nthe performance of this method relative to models with an ordinary lasso\npenalty, and apply the proposed method to identify groups of predictive genetic\nand clinical risk factors for Alzheimer's disease in the Religious Orders Study\nand Memory and Aging Project (ROSMAP) prospective cohort studies of AD and\ndementia."}, "http://arxiv.org/abs/2401.04841": {"title": "Analysis of Compositional Data with Positive Correlations among Components using a Nested Dirichlet Distribution with Application to a Morris Water Maze Experiment", "link": "http://arxiv.org/abs/2401.04841", "description": "In a typical Morris water maze experiment, a mouse is placed in a circular\nwater tank and allowed to swim freely until it finds a platform, triggering a\nroute of escape from the tank. For reference purposes, the tank is divided into\nfour quadrants: the target quadrant where the trigger to escape resides, the\nopposite quadrant to the target, and two adjacent quadrants. Several response\nvariables can be measured: the amount of time that a mouse spends in different\nquadrants of the water tank, the number of times the mouse crosses from one\nquadrant to another, or how quickly a mouse triggers an escape from the tank.\nWhen considering time within each quadrant, it is hypothesized that normal mice\nwill spend smaller amounts of time in quadrants that do not contain the escape\nroute, while mice with an acquired or induced mental deficiency will spend\nequal time in all quadrants of the tank. Clearly, proportion of time in the\nquadrants must sum to one and are therefore statistically dependent; however,\nmost analyses of data from this experiment treat time in quadrants as\nstatistically independent. A recent paper introduced a hypothesis testing\nmethod that involves fitting such data to a Dirichlet distribution. While an\nimprovement over studies that ignore the compositional structure of the data,\nwe show that methodology is flawed. We introduce a two-sample test to detect\ndifferences in proportion of components for two independent groups where both\ngroups are from either a Dirichlet or nested Dirichlet distribution. This new\ntest is used to reanalyze the data from a previous study and come to a\ndifferent conclusion."}, "http://arxiv.org/abs/2401.04863": {"title": "Estimands and cumulative incidence function regression in clinical trials: some new results on interpretability and robustness", "link": "http://arxiv.org/abs/2401.04863", "description": "Regression analyses based on transformations of cumulative incidence\nfunctions are often adopted when modeling and testing for treatment effects in\nclinical trial settings involving competing and semi-competing risks. Common\nframeworks include the Fine-Gray model and models based on direct binomial\nregression. Using large sample theory we derive the limiting values of\ntreatment effect estimators based on such models when the data are generated\naccording to multiplicative intensity-based models, and show that the estimand\nis sensitive to several process features. The rejection rates of hypothesis\ntests based on cumulative incidence function regression models are also\nexamined for null hypotheses of different types, based on which a robustness\nproperty is established. In such settings supportive secondary analyses of\ntreatment effects are essential to ensure a full understanding of the nature of\ntreatment effects. An application to a palliative study of individuals with\nbreast cancer metastatic to bone is provided for illustration."}, "http://arxiv.org/abs/2401.05088": {"title": "Hybrid of node and link communities for graphon estimation", "link": "http://arxiv.org/abs/2401.05088", "description": "Networks serve as a tool used to examine the large-scale connectivity\npatterns in complex systems. Modelling their generative mechanism\nnonparametrically is often based on step-functions, such as the stochastic\nblock models. These models are capable of addressing two prominent topics in\nnetwork science: link prediction and community detection. However, such methods\noften have a resolution limit, making it difficult to separate small-scale\nstructures from noise. To arrive at a smoother representation of the network's\ngenerative mechanism, we explicitly trade variance for bias by smoothing blocks\nof edges based on stochastic equivalence. As such, we propose a different\nestimation method using a new model, which we call the stochastic shape model.\nTypically, analysis methods are based on modelling node or link communities. In\ncontrast, we take a hybrid approach, bridging the two notions of community.\nConsequently, we obtain a more parsimonious representation, enabling a more\ninterpretable and multiscale summary of the network structure. By considering\nmultiple resolutions, we trade bias and variance to ensure that our estimator\nis rate-optimal. We also examine the performance of our model through\nsimulations and applications to real network data."}, "http://arxiv.org/abs/2401.05124": {"title": "Nonparametric worst-case bounds for publication bias on the summary receiver operating characteristic curve", "link": "http://arxiv.org/abs/2401.05124", "description": "The summary receiver operating characteristic (SROC) curve has been\nrecommended as one important meta-analytical summary to represent the accuracy\nof a diagnostic test in the presence of heterogeneous cutoff values. However,\nselective publication of diagnostic studies for meta-analysis can induce\npublication bias (PB) on the estimate of the SROC curve. Several sensitivity\nanalysis methods have been developed to quantify PB on the SROC curve, and all\nthese methods utilize parametric selection functions to model the selective\npublication mechanism. The main contribution of this article is to propose a\nnew sensitivity analysis approach that derives the worst-case bounds for the\nSROC curve by adopting nonparametric selection functions under minimal\nassumptions. The estimation procedures of the worst-case bounds use the Monte\nCarlo method to obtain the SROC curves along with the corresponding area under\nthe curves in the worst case where the maximum possible PB under a range of\nmarginal selection probabilities is considered. We apply the proposed method to\na real-world meta-analysis to show that the worst-case bounds of the SROC\ncurves can provide useful insights for discussing the robustness of\nmeta-analytical findings on diagnostic test accuracy."}, "http://arxiv.org/abs/2401.05256": {"title": "Tests of Missing Completely At Random based on sample covariance matrices", "link": "http://arxiv.org/abs/2401.05256", "description": "We study the problem of testing whether the missing values of a potentially\nhigh-dimensional dataset are Missing Completely at Random (MCAR). We relax the\nproblem of testing MCAR to the problem of testing the compatibility of a\nsequence of covariance matrices, motivated by the fact that this procedure is\nfeasible when the dimension grows with the sample size. Tests of compatibility\ncan be used to test the feasibility of positive semi-definite matrix completion\nproblems with noisy observations, and thus our results may be of independent\ninterest. Our first contributions are to define a natural measure of the\nincompatibility of a sequence of correlation matrices, which can be\ncharacterised as the optimal value of a Semi-definite Programming (SDP)\nproblem, and to establish a key duality result allowing its practical\ncomputation and interpretation. By studying the concentration properties of the\nnatural plug-in estimator of this measure, we introduce novel hypothesis tests\nthat we prove have power against all distributions with incompatible covariance\nmatrices. The choice of critical values for our tests rely on a new\nconcentration inequality for the Pearson sample correlation matrix, which may\nbe of interest more widely. By considering key examples of missingness\nstructures, we demonstrate that our procedures are minimax rate optimal in\ncertain cases. We further validate our methodology with numerical simulations\nthat provide evidence of validity and power, even when data are heavy tailed."}, "http://arxiv.org/abs/2401.05281": {"title": "Asymptotic expected sensitivity function and its applications to nonparametric correlation estimators", "link": "http://arxiv.org/abs/2401.05281", "description": "We introduce a new type of influence function, the asymptotic expected\nsensitivity function, which is often equivalent to but mathematically more\ntractable than the traditional one based on the Gateaux derivative. To\nillustrate, we study the robustness of some important rank correlations,\nincluding Spearman's and Kendall's correlations, and the recently developed\nChatterjee's correlation."}, "http://arxiv.org/abs/2401.05315": {"title": "Multi-resolution filters via linear projection for large spatio-temporal datasets", "link": "http://arxiv.org/abs/2401.05315", "description": "Advances in compact sensing devices mounted on satellites have facilitated\nthe collection of large spatio-temporal datasets with coordinates. Since such\ndatasets are often incomplete and noisy, it is useful to create the prediction\nsurface of a spatial field. To this end, we consider an online filtering\ninference by using the Kalman filter based on linear Gaussian state-space\nmodels. However, the Kalman filter is impractically time-consuming when the\nnumber of locations in spatio-temporal datasets is large. To address this\nproblem, we propose a multi-resolution filter via linear projection (MRF-lp), a\nfast computation method for online filtering inference. In the MRF-lp, by\ncarrying out a multi-resolution approximation via linear projection (MRA-lp),\nthe forecast covariance matrix can be approximated while capturing both the\nlarge- and small-scale spatial variations. As a result of this approximation,\nour proposed MRF-lp preserves a block-sparse structure of some matrices\nappearing in the MRF-lp through time, which leads to the scalability of this\nalgorithm. Additionally, we discuss extensions of the MRF-lp to a nonlinear and\nnon-Gaussian case. Simulation studies and real data analysis for total\nprecipitable water vapor demonstrate that our proposed approach performs well\ncompared with the related methods."}, "http://arxiv.org/abs/2401.05330": {"title": "Hierarchical Causal Models", "link": "http://arxiv.org/abs/2401.05330", "description": "Scientists often want to learn about cause and effect from hierarchical data,\ncollected from subunits nested inside units. Consider students in schools,\ncells in patients, or cities in states. In such settings, unit-level variables\n(e.g. each school's budget) may affect subunit-level variables (e.g. the test\nscores of each student in each school) and vice versa. To address causal\nquestions with hierarchical data, we propose hierarchical causal models, which\nextend structural causal models and causal graphical models by adding inner\nplates. We develop a general graphical identification technique for\nhierarchical causal models that extends do-calculus. We find many situations in\nwhich hierarchical data can enable causal identification even when it would be\nimpossible with non-hierarchical data, that is, if we had only unit-level\nsummaries of subunit-level variables (e.g. the school's average test score,\nrather than each student's score). We develop estimation techniques for\nhierarchical causal models, using methods including hierarchical Bayesian\nmodels. We illustrate our results in simulation and via a reanalysis of the\nclassic \"eight schools\" study."}, "http://arxiv.org/abs/2010.02848": {"title": "Unified Robust Estimation", "link": "http://arxiv.org/abs/2010.02848", "description": "Robust estimation is primarily concerned with providing reliable parameter\nestimates in the presence of outliers. Numerous robust loss functions have been\nproposed in regression and classification, along with various computing\nalgorithms. In modern penalised generalised linear models (GLM), however, there\nis limited research on robust estimation that can provide weights to determine\nthe outlier status of the observations. This article proposes a unified\nframework based on a large family of loss functions, a composite of concave and\nconvex functions (CC-family). Properties of the CC-family are investigated, and\nCC-estimation is innovatively conducted via the iteratively reweighted convex\noptimisation (IRCO), which is a generalisation of the iteratively reweighted\nleast squares in robust linear regression. For robust GLM, the IRCO becomes the\niteratively reweighted GLM. The unified framework contains penalised estimation\nand robust support vector machine and is demonstrated with a variety of data\napplications."}, "http://arxiv.org/abs/2010.09335": {"title": "Statistical Models for Repeated Categorical Ratings: The R Package rater", "link": "http://arxiv.org/abs/2010.09335", "description": "A common problem in many disciplines is the need to assign a set of items\ninto categories or classes with known labels. This is often done by one or more\nexpert raters, or sometimes by an automated process. If these assignments or\n`ratings' are difficult to make accurately, a common tactic is to repeat them\nby different raters, or even by the same rater multiple times on different\noccasions. We present an R package `rater`, available on CRAN, that implements\nBayesian versions of several statistical models for analysis of repeated\ncategorical rating data. Inference is possible for the true underlying (latent)\nclass of each item, as well as the accuracy of each rater. The models are\nextensions of, and include, the Dawid-Skene model, and we implemented them\nusing the Stan probabilistic programming language. We illustrate the use of\n`rater` through a few examples. We also discuss in detail the techniques of\nmarginalisation and conditioning, which are necessary for these models but also\napply more generally to other models implemented in Stan."}, "http://arxiv.org/abs/2204.10969": {"title": "Combining Doubly Robust Methods and Machine Learning for Estimating Average Treatment Effects for Observational Real-world Data", "link": "http://arxiv.org/abs/2204.10969", "description": "Observational cohort studies are increasingly being used for comparative\neffectiveness research to assess the safety of therapeutics. Recently, various\ndoubly robust methods have been proposed for average treatment effect\nestimation by combining the treatment model and the outcome model via different\nvehicles, such as matching, weighting, and regression. The key advantage of\ndoubly robust estimators is that they require either the treatment model or the\noutcome model to be correctly specified to obtain a consistent estimator of\naverage treatment effects, and therefore lead to a more accurate and often more\nprecise inference. However, little work has been done to understand how doubly\nrobust estimators differ due to their unique strategies of using the treatment\nand outcome models and how machine learning techniques can be combined to boost\ntheir performance. Here we examine multiple popular doubly robust methods and\ncompare their performance using different treatment and outcome modeling via\nextensive simulations and a real-world application. We found that incorporating\nmachine learning with doubly robust estimators such as the targeted maximum\nlikelihood estimator gives the best overall performance. Practical guidance on\nhow to apply doubly robust estimators is provided."}, "http://arxiv.org/abs/2205.00259": {"title": "cubble: An R Package for Organizing and Wrangling Multivariate Spatio-temporal Data", "link": "http://arxiv.org/abs/2205.00259", "description": "Multivariate spatio-temporal data refers to multiple measurements taken\nacross space and time. For many analyses, spatial and time components can be\nseparately studied: for example, to explore the temporal trend of one variable\nfor a single spatial location, or to model the spatial distribution of one\nvariable at a given time. However for some studies, it is important to analyse\ndifferent aspects of the spatio-temporal data simultaneouly, like for instance,\ntemporal trends of multiple variables across locations. In order to facilitate\nthe study of different portions or combinations of spatio-temporal data, we\nintroduce a new data structure, cubble, with a suite of functions enabling easy\nslicing and dicing on the different components spatio-temporal components. The\nproposed cubble structure ensures that all the components of the data are easy\nto access and manipulate while providing flexibility for data analysis. In\naddition, cubble facilitates visual and numerical explorations of the data\nwhile easing data wrangling and modelling. The cubble structure and the\nfunctions provided in the cubble R package equip users with the capability to\nhandle hierarchical spatial and temporal structures. The cubble structure and\nthe tools implemented in the package are illustrated with different examples of\nAustralian climate data."}, "http://arxiv.org/abs/2301.08836": {"title": "Scalable Gaussian Process Inference with Stan", "link": "http://arxiv.org/abs/2301.08836", "description": "Gaussian processes (GPs) are sophisticated distributions to model functional\ndata. Whilst theoretically appealing, they are computationally cumbersome\nexcept for small datasets. We implement two methods for scaling GP inference in\nStan: First, a general sparse approximation using a directed acyclic dependency\ngraph; second, a fast, exact method for regularly spaced data modeled by GPs\nwith stationary kernels using the fast Fourier transform. Based on benchmark\nexperiments, we offer guidance for practitioners to decide between different\nmethods and parameterizations. We consider two real-world examples to\nillustrate the package. The implementation follows Stan's design and exposes\nperformant inference through a familiar interface. Full posterior inference for\nten thousand data points is feasible on a laptop in less than 20 seconds.\nDetails on how to get started using the popular interfaces cmdstanpy for Python\nand cmdstanr for R are provided."}, "http://arxiv.org/abs/2302.09103": {"title": "Multiple change-point detection for Poisson processes", "link": "http://arxiv.org/abs/2302.09103", "description": "The aim of change-point detection is to discover the changes in behavior that\nlie behind time sequence data. In this article, we study the case where the\ndata comes from an inhomogeneous Poisson process or a marked Poisson process.\nWe present a methodology for detecting multiple offline change-points based on\na minimum contrast estimator. In particular, we explain how to handle the\ncontinuous nature of the process with the available discrete observations. In\naddition, we select the appropriate number of regimes via a cross-validation\nprocedure which is really handy here due to the nature of the Poisson process.\nThrough experiments on simulated and real data sets, we demonstrate the\ninterest of the proposed method. The proposed method has been implemented in\nthe R package \\texttt{CptPointProcess} R."}, "http://arxiv.org/abs/2303.00531": {"title": "Parameter estimation for a hidden linear birth and death process with immigration", "link": "http://arxiv.org/abs/2303.00531", "description": "In this paper, we use a linear birth and death process with immigration to\nmodel infectious disease propagation when contamination stems from both\nperson-to-person contact and contact with the environment. Our aim is to\nestimate the parameters of the process. The main originality and difficulty\ncomes from the observation scheme. Counts of infected population are hidden.\nThe only data available are periodic cumulated new retired counts. Although\nvery common in epidemiology, this observation scheme is mathematically\nchallenging even for such a standard stochastic process. We first derive an\nanalytic expression of the unknown parameters as functions of well-chosen\ndiscrete time transition probabilities. Second, we extend and adapt the\nstandard Baum-Welch algorithm in order to estimate the said discrete time\ntransition probabilities in our hidden data framework. The performance of our\nestimators is illustrated both on synthetic data and real data of typhoid fever\nin Mayotte."}, "http://arxiv.org/abs/2306.04836": {"title": "$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control", "link": "http://arxiv.org/abs/2306.04836", "description": "In this paper, we propose a novel $K$-nearest neighbor resampling procedure\nfor estimating the performance of a policy from historical data containing\nrealized episodes of a decision process generated under a different policy. We\nprovide statistical consistency results under weak conditions. In particular,\nwe avoid the common assumption of identically and independently distributed\ntransitions and rewards. Instead, our analysis allows for the sampling of\nentire episodes, as is common practice in most applications. To establish the\nconsistency in this setting, we generalize Stone's Theorem, a well-known result\nin nonparametric statistics on local averaging, to include episodic data and\nthe counterfactual estimation underlying off-policy evaluation (OPE). By\nfocusing on feedback policies that depend deterministically on the current\nstate in environments with continuous state-action spaces and system-inherent\nstochasticity effected by chosen actions, and relying on trajectory simulation\nsimilar to Monte Carlo methods, the proposed method is particularly well suited\nfor stochastic control environments. Compared to other OPE methods, our\nalgorithm does not require optimization, can be efficiently implemented via\ntree-based nearest neighbor search and parallelization, and does not explicitly\nassume a parametric model for the environment's dynamics. Numerical experiments\ndemonstrate the effectiveness of the algorithm compared to existing baselines\nin a variety of stochastic control settings, including a linear quadratic\nregulator, trade execution in limit order books, and online stochastic bin\npacking."}, "http://arxiv.org/abs/2401.05343": {"title": "Spectral Topological Data Analysis of Brain Signals", "link": "http://arxiv.org/abs/2401.05343", "description": "Topological data analysis (TDA) has become a powerful approach over the last\ntwenty years, mainly due to its ability to capture the shape and the geometry\ninherent in the data. Persistence homology, which is a particular tool in TDA,\nhas been demonstrated to be successful in analyzing functional brain\nconnectivity. One limitation of standard approaches is that they use\narbitrarily chosen threshold values for analyzing connectivity matrices. To\novercome this weakness, TDA provides a filtration of the weighted brain network\nacross a range of threshold values. However, current analyses of the\ntopological structure of functional brain connectivity primarily rely on overly\nsimplistic connectivity measures, such as the Pearson orrelation. These\nmeasures do not provide information about the specific oscillators that drive\ndependence within the brain network. Here, we develop a frequency-specific\napproach that utilizes coherence, a measure of dependence in the spectral\ndomain, to evaluate the functional connectivity of the brain. Our approach, the\nspectral TDA (STDA), has the ability to capture more nuanced and detailed\ninformation about the underlying brain networks. The proposed STDA method leads\nto a novel topological summary, the spectral landscape, which is a\n2D-generalization of the persistence landscape. Using the novel spectral\nlandscape, we analyze the EEG brain connectivity of patients with attention\ndeficit hyperactivity disorder (ADHD) and shed light on the frequency-specific\ndifferences in the topology of brain connectivity between the controls and ADHD\npatients."}, "http://arxiv.org/abs/2401.05414": {"title": "On the Three Demons in Causality in Finance: Time Resolution, Nonstationarity, and Latent Factors", "link": "http://arxiv.org/abs/2401.05414", "description": "Financial data is generally time series in essence and thus suffers from\nthree fundamental issues: the mismatch in time resolution, the time-varying\nproperty of the distribution - nonstationarity, and causal factors that are\nimportant but unknown/unobserved. In this paper, we follow a causal perspective\nto systematically look into these three demons in finance. Specifically, we\nreexamine these issues in the context of causality, which gives rise to a novel\nand inspiring understanding of how the issues can be addressed. Following this\nperspective, we provide systematic solutions to these problems, which hopefully\nwould serve as a foundation for future research in the area."}, "http://arxiv.org/abs/2401.05466": {"title": "Multidimensional Scaling for Interval Data: INTERSCAL", "link": "http://arxiv.org/abs/2401.05466", "description": "Standard multidimensional scaling takes as input a dissimilarity matrix of\ngeneral term $\\delta _{ij}$ which is a numerical value. In this paper we input\n$\\delta _{ij}=[\\underline{\\delta _{ij}},\\overline{\\delta _{ij}}]$ where\n$\\underline{\\delta _{ij}}$ and $\\overline{\\delta _{ij}}$ are the lower bound\nand the upper bound of the ``dissimilarity'' between the stimulus/object $S_i$\nand the stimulus/object $S_j$ respectively. As output instead of representing\neach stimulus/object on a factorial plane by a point, as in other\nmultidimensional scaling methods, in the proposed method each stimulus/object\nis visualized by a rectangle, in order to represent dissimilarity variation. We\ngeneralize the classical scaling method looking for a method that produces\nresults similar to those obtained by Tops Principal Components Analysis. Two\nexamples are presented to illustrate the effectiveness of the proposed method."}, "http://arxiv.org/abs/2401.05471": {"title": "Shrinkage linear regression for symbolic interval-valued variables", "link": "http://arxiv.org/abs/2401.05471", "description": "This paper proposes a new approach to fit a linear regression for symbolic\ninternal-valued variables, which improves both the Center Method suggested by\nBillard and Diday in \\cite{BillardDiday2000} and the Center and Range Method\nsuggested by Lima-Neto, E.A. and De Carvalho, F.A.T. in \\cite{Lima2008,\nLima2010}. Just in the Centers Method and the Center and Range Method, the new\nmethods proposed fit the linear regression model on the midpoints and in the\nhalf of the length of the intervals as an additional variable (ranges) assumed\nby the predictor variables in the training data set, but to make these fitments\nin the regression models, the methods Ridge Regression, Lasso, and Elastic Net\nproposed by Tibshirani, R. Hastie, T., and Zou H in \\cite{Tib1996,\nHastieZou2005} are used. The prediction of the lower and upper of the interval\nresponse (dependent) variable is carried out from their midpoints and ranges,\nwhich are estimated from the linear regression models with shrinkage generated\nin the midpoints and the ranges of the interval-valued predictors. Methods\npresented in this document are applied to three real data sets cardiologic\ninterval data set, Prostate interval data set and US Murder interval data set\nto then compare their performance and facility of interpretation regarding the\nCenter Method and the Center and Range Method. For this evaluation, the\nroot-mean-squared error and the correlation coefficient are used. Besides, the\nreader may use all the methods presented herein and verify the results using\nthe {\\tt RSDA} package written in {\\tt R} language, that can be downloaded and\ninstalled directly from {\\tt CRAN} \\cite{Rod2014}."}, "http://arxiv.org/abs/2401.05472": {"title": "INTERSTATIS: The STATIS method for interval valued data", "link": "http://arxiv.org/abs/2401.05472", "description": "The STATIS method, proposed by L'Hermier des Plantes and Escoufier, is used\nto analyze multiple data tables in which is very common that each of the tables\nhave information concerning the same set of individuals. The differences and\nsimilitudes between said tables are analyzed by means of a structure called the\n\\emph{compromise}. In this paper we present a new algorithm for applying the\nSTATIS method when the input consists of interval data. This proposal is based\non Moore's interval arithmetic and the Centers Method for Principal Component\nAnalysis with interval data, proposed by Cazes el al. \\cite{cazes1997}. In\naddition to presenting the INTERSTATIS method in an algorithmic way, an\nexecution example is shown, alongside the interpretation of its results."}, "http://arxiv.org/abs/2401.05473": {"title": "Pyramidal Clustering Algorithms in ISO-3D Project", "link": "http://arxiv.org/abs/2401.05473", "description": "Pyramidal clustering method generalizes hierarchies by allowing non-disjoint\nclasses at a given level instead of a partition. Moreover, the clusters of the\npyramid are intervals of a total order on the set being clustered. [Diday\n1984], [Bertrand, Diday 1990] and [Mfoumoune 1998] proposed algorithms to build\na pyramid starting with an arbitrary order of the individual. In this paper we\npresent two new algorithms name {\\tt CAPS} and {\\tt CAPSO}. {\\tt CAPSO} builds\na pyramid starting with an order given on the set of the individuals (or\nsymbolic objects) while {\\tt CAPS} finds this order. These two algorithms\nallows moreover to cluster more complex data than the tabular model allows to\nprocess, by considering variation on the values taken by the variables, in this\nway, our method produces a symbolic pyramid. Each cluster thus formed is\ndefined not only by the set of its elements (i.e. its extent) but also by a\nsymbolic object, which describes its properties (i.e. its intent). These two\nalgorithms were implemented in C++ and Java to the ISO-3D project."}, "http://arxiv.org/abs/2401.05556": {"title": "Assessing High-Order Links in Cardiovascular and Respiratory Networks via Static and Dynamic Information Measures", "link": "http://arxiv.org/abs/2401.05556", "description": "The network representation is becoming increasingly popular for the\ndescription of cardiovascular interactions based on the analysis of multiple\nsimultaneously collected variables. However, the traditional methods to assess\nnetwork links based on pairwise interaction measures cannot reveal high-order\neffects involving more than two nodes, and are not appropriate to infer the\nunderlying network topology. To address these limitations, here we introduce a\nframework which combines the assessment of high-order interactions with\nstatistical inference for the characterization of the functional links\nsustaining physiological networks. The framework develops information-theoretic\nmeasures quantifying how two nodes interact in a redundant or synergistic way\nwith the rest of the network, and employs these measures for reconstructing the\nfunctional structure of the network. The measures are implemented for both\nstatic and dynamic networks mapped respectively by random variables and random\nprocesses using plug-in and model-based entropy estimators. The validation on\ntheoretical and numerical simulated networks documents the ability of the\nframework to represent high-order interactions as networks and to detect\nstatistical structures associated to cascade, common drive and common target\neffects. The application to cardiovascular networks mapped by the beat-to-beat\nvariability of heart rate, respiration, arterial pressure, cardiac output and\nvascular resistance allowed noninvasive characterization of several mechanisms\nof cardiovascular control operating in resting state and during orthostatic\nstress. Our approach brings to new comprehensive assessment of physiological\ninteractions and complements existing strategies for the classification of\npathophysiological states."}, "http://arxiv.org/abs/2401.05728": {"title": "A General Method for Resampling Autocorrelated Spatial Data", "link": "http://arxiv.org/abs/2401.05728", "description": "Comparing spatial data sets is a ubiquitous task in data analysis, however\nthe presence of spatial autocorrelation means that standard estimates of\nvariance will be wrong and tend to over-estimate the statistical significance\nof correlations and other observations. While there are a number of existing\napproaches to this problem, none are ideal, requiring detailed analytical\ncalculations, which are hard to generalise or detailed knowledge of the data\ngenerating process, which may not be available. In this work we propose a\nresampling approach based on Tobler's Law. By resampling the data with fixed\nspatial autocorrelation, measured by Moran's I, we generate a more realistic\nnull model. Testing on real and synthetic data, we find that, as long as the\nspatial autocorrelation is not too strong, this approach works just as well as\nif we knew the data generating process."}, "http://arxiv.org/abs/2401.05817": {"title": "Testing for similarity of multivariate mixed outcomes using generalised joint regression models with application to efficacy-toxicity responses", "link": "http://arxiv.org/abs/2401.05817", "description": "A common problem in clinical trials is to test whether the effect of an\nexplanatory variable on a response of interest is similar between two groups,\ne.g. patient or treatment groups. In this regard, similarity is defined as\nequivalence up to a pre-specified threshold that denotes an acceptable\ndeviation between the two groups. This issue is typically tackled by assessing\nif the explanatory variable's effect on the response is similar. This\nassessment is based on, for example, confidence intervals of differences or a\nsuitable distance between two parametric regression models. Typically, these\napproaches build on the assumption of a univariate continuous or binary outcome\nvariable. However, multivariate outcomes, especially beyond the case of\nbivariate binary response, remain underexplored. This paper introduces an\napproach based on a generalised joint regression framework exploiting the\nGaussian copula. Compared to existing methods, our approach accommodates\nvarious outcome variable scales, such as continuous, binary, categorical, and\nordinal, including mixed outcomes in multi-dimensional spaces. We demonstrate\nthe validity of this approach through a simulation study and an\nefficacy-toxicity case study, hence highlighting its practical relevance."}, "http://arxiv.org/abs/2401.05839": {"title": "Modelling physical activity profiles in COPD patients: a fully functional approach to variable domain functional regression models", "link": "http://arxiv.org/abs/2401.05839", "description": "Physical activity plays a significant role in the well-being of individuals\nwith Chronic obstructive Pulmonary Disease (COPD). Specifically, it has been\ndirectly associated with changes in hospitalization rates for these patients.\nHowever, previous investigations have primarily been conducted in a\ncross-sectional or longitudinal manner and have not considered a continuous\nperspective. Using the telEPOC program we use telemonitoring data to analyze\nthe impact of physical activity adopting a functional data approach. However,\nTraditional functional data methods, including functional regression models,\ntypically assume a consistent data domain. However, the data in the telEPOC\nprogram exhibits variable domains, presenting a challenge since the majority of\nfunctional data methods, are based on the fact that data are observed in the\nsame domain. To address this challenge, we introduce a novel fully functional\nmethodology tailored to variable domain functional data, eliminating the need\nfor data alignment, which can be computationally taxing. Although models\ndesigned for variable domain data are relatively scarce and may have inherent\nlimitations in their estimation methods, our approach circumvents these issues.\nWe substantiate the effectiveness of our methodology through a simulation\nstudy, comparing our results with those obtained using established\nmethodologies. Finally, we apply our methodology to analyze the impact of\nphysical activity in COPD patients using the telEPOC program's data. Software\nfor our method is available in the form of R code on request at\n\\url{https://github.com/Pavel-Hernadez-Amaro/V.D.F.R.M-new-estimation-approach.git}."}, "http://arxiv.org/abs/2401.05905": {"title": "Feasible pairwise pseudo-likelihood inference on spatial regressions in irregular lattice grids: the KD-T PL algorithm", "link": "http://arxiv.org/abs/2401.05905", "description": "Spatial regression models are central to the field of spatial statistics.\nNevertheless, their estimation in case of large and irregular gridded spatial\ndatasets presents considerable computational challenges. To tackle these\ncomputational problems, Arbia \\citep{arbia_2014_pairwise} introduced a\npseudo-likelihood approach (called pairwise likelihood, say PL) which required\nthe identification of pairs of observations that are internally correlated, but\nmutually conditionally uncorrelated. However, while the PL estimators enjoy\noptimal theoretical properties, their practical implementation when dealing\nwith data observed on irregular grids suffers from dramatic computational\nissues (connected with the identification of the pairs of observations) that,\nin most empirical cases, negatively counter-balance its advantages. In this\npaper we introduce an algorithm specifically designed to streamline the\ncomputation of the PL in large and irregularly gridded spatial datasets,\ndramatically simplifying the estimation phase. In particular, we focus on the\nestimation of Spatial Error models (SEM). Our proposed approach, efficiently\npairs spatial couples exploiting the KD tree data structure and exploits it to\nderive the closed-form expressions for fast parameter approximation. To\nshowcase the efficiency of our method, we provide an illustrative example using\nsimulated data, demonstrating the computational advantages if compared to a\nfull likelihood inference are not at the expenses of accuracy."}, "http://arxiv.org/abs/2401.06082": {"title": "Borrowing from historical control data in a Bayesian time-to-event model with flexible baseline hazard function", "link": "http://arxiv.org/abs/2401.06082", "description": "There is currently a focus on statistical methods which can use historical\ntrial information to help accelerate the discovery, development and delivery of\nmedicine. Bayesian methods can be constructed so that the borrowing is\n\"dynamic\" in the sense that the similarity of the data helps to determine how\nmuch information is used. In the time to event setting with one historical data\nset, a popular model for a range of baseline hazards is the piecewise\nexponential model where the time points are fixed and a borrowing structure is\nimposed on the model. Although convenient for implementation this approach\neffects the borrowing capability of the model. We propose a Bayesian model\nwhich allows the time points to vary and a dependency to be placed between the\nbaseline hazards. This serves to smooth the posterior baseline hazard improving\nboth model estimation and borrowing characteristics. We explore a variety of\nprior structures for the borrowing within our proposed model and assess their\nperformance against established approaches. We demonstrate that this leads to\nimproved type I error in the presence of prior data conflict and increased\npower. We have developed accompanying software which is freely available and\nenables easy implementation of the approach."}, "http://arxiv.org/abs/2401.06091": {"title": "A Closer Look at AUROC and AUPRC under Class Imbalance", "link": "http://arxiv.org/abs/2401.06091", "description": "In machine learning (ML), a widespread adage is that the area under the\nprecision-recall curve (AUPRC) is a superior metric for model comparison to the\narea under the receiver operating characteristic (AUROC) for binary\nclassification tasks with class imbalance. This paper challenges this notion\nthrough novel mathematical analysis, illustrating that AUROC and AUPRC can be\nconcisely related in probabilistic terms. We demonstrate that AUPRC, contrary\nto popular belief, is not superior in cases of class imbalance and might even\nbe a harmful metric, given its inclination to unduly favor model improvements\nin subpopulations with more frequent positive labels. This bias can\ninadvertently heighten algorithmic disparities. Prompted by these insights, a\nthorough review of existing ML literature was conducted, utilizing large\nlanguage models to analyze over 1.5 million papers from arXiv. Our\ninvestigation focused on the prevalence and substantiation of the purported\nAUPRC superiority. The results expose a significant deficit in empirical\nbacking and a trend of misattributions that have fuelled the widespread\nacceptance of AUPRC's supposed advantages. Our findings represent a dual\ncontribution: a significant technical advancement in understanding metric\nbehaviors and a stark warning about unchecked assumptions in the ML community.\nAll experiments are accessible at\nhttps://github.com/mmcdermott/AUC_is_all_you_need."}, "http://arxiv.org/abs/2011.04168": {"title": "Likelihood Inference for Possibly Non-Stationary Processes via Adaptive Overdifferencing", "link": "http://arxiv.org/abs/2011.04168", "description": "We make an observation that facilitates exact likelihood-based inference for\nthe parameters of the popular ARFIMA model without requiring stationarity by\nallowing the upper bound $\\bar{d}$ for the memory parameter $d$ to exceed\n$0.5$. We observe that estimating the parameters of a single non-stationary\nARFIMA model is equivalent to estimating the parameters of a sequence of\nstationary ARFIMA models, which allows for the use of existing methods for\nevaluating the likelihood for an invertible and stationary ARFIMA model. This\nenables improved inference because many standard methods perform poorly when\nestimates are close to the boundary of the parameter space. It also allows us\nto leverage the wealth of likelihood approximations that have been introduced\nfor estimating the parameters of a stationary process. We explore how\nestimation of the memory parameter $d$ depends on the upper bound $\\bar{d}$ and\nintroduce adaptive procedures for choosing $\\bar{d}$. Via simulations, we\nexamine the performance of our adaptive procedures for estimating the memory\nparameter when the true value is as large as $2.5$. Our adaptive procedures\nestimate the memory parameter well, can be used to obtain confidence intervals\nfor the memory parameter that achieve nominal coverage rates, and perform\nfavorably relative to existing alternatives."}, "http://arxiv.org/abs/2203.10118": {"title": "Bayesian Structural Learning with Parametric Marginals for Count Data: An Application to Microbiota Systems", "link": "http://arxiv.org/abs/2203.10118", "description": "High dimensional and heterogeneous count data are collected in various\napplied fields. In this paper, we look closely at high-resolution sequencing\ndata on the microbiome, which have enabled researchers to study the genomes of\nentire microbial communities. Revealing the underlying interactions between\nthese communities is of vital importance to learn how microbes influence human\nhealth. To perform structural learning from multivariate count data such as\nthese, we develop a novel Gaussian copula graphical model with two key\nelements. Firstly, we employ parametric regression to characterize the marginal\ndistributions. This step is crucial for accommodating the impact of external\ncovariates. Neglecting this adjustment could potentially introduce distortions\nin the inference of the underlying network of dependences. Secondly, we advance\na Bayesian structure learning framework, based on a computationally efficient\nsearch algorithm that is suited to high dimensionality. The approach returns\nsimultaneous inference of the marginal effects and of the dependence structure,\nincluding graph uncertainty estimates. A simulation study and a real data\nanalysis of microbiome data highlight the applicability of the proposed\napproach at inferring networks from multivariate count data in general, and its\nrelevance to microbiome analyses in particular. The proposed method is\nimplemented in the R package BDgraph."}, "http://arxiv.org/abs/2207.02986": {"title": "fabisearch: A Package for Change Point Detection in and Visualization of the Network Structure of Multivariate High-Dimensional Time Series in R", "link": "http://arxiv.org/abs/2207.02986", "description": "Change point detection is a commonly used technique in time series analysis,\ncapturing the dynamic nature in which many real-world processes function. With\nthe ever increasing troves of multivariate high-dimensional time series data,\nespecially in neuroimaging and finance, there is a clear need for scalable and\ndata-driven change point detection methods. Currently, change point detection\nmethods for multivariate high-dimensional data are scarce, with even less\navailable in high-level, easily accessible software packages. To this end, we\nintroduce the R package fabisearch, available on the Comprehensive R Archive\nNetwork (CRAN), which implements the factorized binary search (FaBiSearch)\nmethodology. FaBiSearch is a novel statistical method for detecting change\npoints in the network structure of multivariate high-dimensional time series\nwhich employs non-negative matrix factorization (NMF), an unsupervised\ndimension reduction and clustering technique. Given the high computational cost\nof NMF, we implement the method in C++ code and use parallelization to reduce\ncomputation time. Further, we also utilize a new binary search algorithm to\nefficiently identify multiple change points and provide a new method for\nnetwork estimation for data between change points. We show the functionality of\nthe package and the practicality of the method by applying it to a neuroimaging\nand a finance data set. Lastly, we provide an interactive, 3-dimensional,\nbrain-specific network visualization capability in a flexible, stand-alone\nfunction. This function can be conveniently used with any node coordinate\natlas, and nodes can be color coded according to community membership (if\napplicable). The output is an elegantly displayed network laid over a cortical\nsurface, which can be rotated in the 3-dimensional space."}, "http://arxiv.org/abs/2302.02718": {"title": "A Log-Linear Non-Parametric Online Changepoint Detection Algorithm based on Functional Pruning", "link": "http://arxiv.org/abs/2302.02718", "description": "Online changepoint detection aims to detect anomalies and changes in\nreal-time in high-frequency data streams, sometimes with limited available\ncomputational resources. This is an important task that is rooted in many\nreal-world applications, including and not limited to cybersecurity, medicine\nand astrophysics. While fast and efficient online algorithms have been recently\nintroduced, these rely on parametric assumptions which are often violated in\npractical applications. Motivated by data streams from the telecommunications\nsector, we build a flexible nonparametric approach to detect a change in the\ndistribution of a sequence. Our procedure, NP-FOCuS, builds a sequential\nlikelihood ratio test for a change in a set of points of the empirical\ncumulative density function of our data. This is achieved by keeping track of\nthe number of observations above or below those points. Thanks to functional\npruning ideas, NP-FOCuS has a computational cost that is log-linear in the\nnumber of observations and is suitable for high-frequency data streams. In\nterms of detection power, NP-FOCuS is seen to outperform current nonparametric\nonline changepoint techniques in a variety of settings. We demonstrate the\nutility of the procedure on both simulated and real data."}, "http://arxiv.org/abs/2304.10005": {"title": "Prediction under interventions: evaluation of counterfactual performance using longitudinal observational data", "link": "http://arxiv.org/abs/2304.10005", "description": "Predictions under interventions are estimates of what a person's risk of an\noutcome would be if they were to follow a particular treatment strategy, given\ntheir individual characteristics. Such predictions can give important input to\nmedical decision making. However, evaluating predictive performance of\ninterventional predictions is challenging. Standard ways of evaluating\npredictive performance do not apply when using observational data, because\nprediction under interventions involves obtaining predictions of the outcome\nunder conditions that are different to those that are observed for a subset of\nindividuals in the validation dataset. This work describes methods for\nevaluating counterfactual performance of predictions under interventions for\ntime-to-event outcomes. This means we aim to assess how well predictions would\nmatch the validation data if all individuals had followed the treatment\nstrategy under which predictions are made. We focus on counterfactual\nperformance evaluation using longitudinal observational data, and under\ntreatment strategies that involve sustaining a particular treatment regime over\ntime. We introduce an estimation approach using artificial censoring and\ninverse probability weighting which involves creating a validation dataset that\nmimics the treatment strategy under which predictions are made. We extend\nmeasures of calibration, discrimination (c-index and cumulative/dynamic AUCt)\nand overall prediction error (Brier score) to allow assessment of\ncounterfactual performance. The methods are evaluated using a simulation study,\nincluding scenarios in which the methods should detect poor performance.\nApplying our methods in the context of liver transplantation shows that our\nprocedure allows quantification of the performance of predictions supporting\ncrucial decisions on organ allocation."}, "http://arxiv.org/abs/2309.01334": {"title": "Average treatment effect on the treated, under lack of positivity", "link": "http://arxiv.org/abs/2309.01334", "description": "The use of propensity score (PS) methods has become ubiquitous in causal\ninference. At the heart of these methods is the positivity assumption.\nViolation of the positivity assumption leads to the presence of extreme PS\nweights when estimating average causal effects of interest, such as the average\ntreatment effect (ATE) or the average treatment effect on the treated (ATT),\nwhich renders invalid related statistical inference. To circumvent this issue,\ntrimming or truncating the extreme estimated PSs have been widely used.\nHowever, these methods require that we specify a priori a threshold and\nsometimes an additional smoothing parameter. While there are a number of\nmethods dealing with the lack of positivity when estimating ATE, surprisingly\nthere is no much effort in the same issue for ATT. In this paper, we first\nreview widely used methods, such as trimming and truncation in ATT. We\nemphasize the underlying intuition behind these methods to better understand\ntheir applications and highlight their main limitations. Then, we argue that\nthe current methods simply target estimands that are scaled ATT (and thus move\nthe goalpost to a different target of interest), where we specify the scale and\nthe target populations. We further propose a PS weight-based alternative for\nthe average causal effect on the treated, called overlap weighted average\ntreatment effect on the treated (OWATT). The appeal of our proposed method lies\nin its ability to obtain similar or even better results than trimming and\ntruncation while relaxing the constraint to choose a priori a threshold (or\neven specify a smoothing parameter). The performance of the proposed method is\nillustrated via a series of Monte Carlo simulations and a data analysis on\nracial disparities in health care expenditures."}, "http://arxiv.org/abs/2401.06261": {"title": "Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables", "link": "http://arxiv.org/abs/2401.06261", "description": "Multivariate Mendelian randomization (MVMR) is a statistical technique that\nuses sets of genetic instruments to estimate the direct causal effects of\nmultiple exposures on an outcome of interest. At genomic loci with pleiotropic\ngene regulatory effects, that is, loci where the same genetic variants are\nassociated to multiple nearby genes, MVMR can potentially be used to predict\ncandidate causal genes. However, consensus in the field dictates that the\ngenetic instruments in MVMR must be independent, which is usually not possible\nwhen considering a group of candidate genes from the same locus.\n\nWe used causal inference theory to show that MVMR with correlated instruments\nsatisfies the instrumental set condition. This is a classical result by Brito\nand Pearl (2002) for structural equation models that guarantees the\nidentifiability of causal effects in situations where multiple exposures\ncollectively, but not individually, separate a set of instrumental variables\nfrom an outcome variable. Extensive simulations confirmed the validity and\nusefulness of these theoretical results even at modest sample sizes.\nImportantly, the causal effect estimates remain unbiased and their variance\nsmall when instruments are highly correlated.\n\nWe applied MVMR with correlated instrumental variable sets at risk loci from\ngenome-wide association studies (GWAS) for coronary artery disease using eQTL\ndata from the STARNET study. Our method predicts causal genes at twelve loci,\neach associated with multiple colocated genes in multiple tissues. However, the\nextensive degree of regulatory pleiotropy across tissues and the limited number\nof causal variants in each locus still require that MVMR is run on a\ntissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a\nsingle model to predict causal gene-tissue combinations remains infeasible."}, "http://arxiv.org/abs/2401.06347": {"title": "Diagnostics for Regression Models with Semicontinuous Outcomes", "link": "http://arxiv.org/abs/2401.06347", "description": "Semicontinuous outcomes commonly arise in a wide variety of fields, such as\ninsurance claims, healthcare expenditures, rainfall amounts, and alcohol\nconsumption. Regression models, including Tobit, Tweedie, and two-part models,\nare widely employed to understand the relationship between semicontinuous\noutcomes and covariates. Given the potential detrimental consequences of model\nmisspecification, after fitting a regression model, it is of prime importance\nto check the adequacy of the model. However, due to the point mass at zero,\nstandard diagnostic tools for regression models (e.g., deviance and Pearson\nresiduals) are not informative for semicontinuous data. To bridge this gap, we\npropose a new type of residuals for semicontinuous outcomes that are applicable\nto general regression models. Under the correctly specified model, the proposed\nresiduals converge to being uniformly distributed, and when the model is\nmisspecified, they significantly depart from this pattern. In addition to\nin-sample validation, the proposed methodology can also be employed to evaluate\npredictive distributions. We demonstrate the effectiveness of the proposed tool\nusing health expenditure data from the US Medical Expenditure Panel Survey."}, "http://arxiv.org/abs/2401.06348": {"title": "A Fully Bayesian Approach for Comprehensive Mapping of Magnitude and Phase Brain Activation in Complex-Valued fMRI Data", "link": "http://arxiv.org/abs/2401.06348", "description": "Functional magnetic resonance imaging (fMRI) plays a crucial role in\nneuroimaging, enabling the exploration of brain activity through complex-valued\nsignals. These signals, composed of magnitude and phase, offer a rich source of\ninformation for understanding brain functions. Traditional fMRI analyses have\nlargely focused on magnitude information, often overlooking the potential\ninsights offered by phase data. In this paper, we propose a novel fully\nBayesian model designed for analyzing single-subject complex-valued fMRI\n(cv-fMRI) data. Our model, which we refer to as the CV-M&P model, is\ndistinctive in its comprehensive utilization of both magnitude and phase\ninformation in fMRI signals, allowing for independent prediction of different\ntypes of activation maps. We incorporate Gaussian Markov random fields (GMRFs)\nto capture spatial correlations within the data, and employ image partitioning\nand parallel computation to enhance computational efficiency. Our model is\nrigorously tested through simulation studies, and then applied to a real\ndataset from a unilateral finger-tapping experiment. The results demonstrate\nthe model's effectiveness in accurately identifying brain regions activated in\nresponse to specific tasks, distinguishing between magnitude and phase\nactivation."}, "http://arxiv.org/abs/2401.06350": {"title": "Optimal estimation of the null distribution in large-scale inference", "link": "http://arxiv.org/abs/2401.06350", "description": "The advent of large-scale inference has spurred reexamination of conventional\nstatistical thinking. In a Gaussian model for $n$ many $z$-scores with at most\n$k < \\frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale\nparameters of the null distribution. Placing no assumptions on the nonnull\neffects, the statistical task can be viewed as a robust estimation problem.\nHowever, the best known robust estimators fail to be consistent in the regime\n$k \\asymp n$ which is especially relevant in large-scale inference. The failure\nof estimators which are minimax rate-optimal with respect to other formulations\nof robustness (e.g. Huber's contamination model) might suggest the\nimpossibility of consistent estimation in this regime and, consequently, a\nmajor weakness of Efron's suggestion. A sound evaluation of Efron's model thus\nrequires a complete understanding of consistency. We sharply characterize the\nregime of $k$ for which consistent estimation is possible and further establish\nthe minimax estimation rates. It is shown consistent estimation of the location\nparameter is possible if and only if $\\frac{n}{2} - k = \\omega(\\sqrt{n})$, and\nconsistent estimation of the scale parameter is possible in the entire regime\n$k < \\frac{n}{2}$. Faster rates than those in Huber's contamination model are\nachievable by exploiting the Gaussian character of the data. The minimax upper\nbound is obtained by considering estimators based on the empirical\ncharacteristic function. The minimax lower bound involves constructing two\nmarginal distributions whose characteristic functions match on a wide interval\ncontaining zero. The construction notably differs from those in the literature\nby sharply capturing a scaling of $n-2k$ in the minimax estimation rate of the\nlocation."}, "http://arxiv.org/abs/2401.06383": {"title": "Decomposition with Monotone B-splines: Fitting and Testing", "link": "http://arxiv.org/abs/2401.06383", "description": "A univariate continuous function can always be decomposed as the sum of a\nnon-increasing function and a non-decreasing one. Based on this property, we\npropose a non-parametric regression method that combines two spline-fitted\nmonotone curves. We demonstrate by extensive simulations that, compared to\nstandard spline-fitting methods, the proposed approach is particularly\nadvantageous in high-noise scenarios. Several theoretical guarantees are\nestablished for the proposed approach. Additionally, we present statistics to\ntest the monotonicity of a function based on monotone decomposition, which can\nbetter control Type I error and achieve comparable (if not always higher) power\ncompared to existing methods. Finally, we apply the proposed fitting and\ntesting approaches to analyze the single-cell pseudotime trajectory datasets,\nidentifying significant biological insights for non-monotonically expressed\ngenes through Gene Ontology enrichment analysis. The source code implementing\nthe methodology and producing all results is accessible at\nhttps://github.com/szcf-weiya/MonotoneDecomposition.jl."}, "http://arxiv.org/abs/2401.06403": {"title": "Fourier analysis of spatial point processes", "link": "http://arxiv.org/abs/2401.06403", "description": "In this article, we develop comprehensive frequency domain methods for\nestimating and inferring the second-order structure of spatial point processes.\nThe main element here is on utilizing the discrete Fourier transform (DFT) of\nthe point pattern and its tapered counterpart. Under second-order stationarity,\nwe show that both the DFTs and the tapered DFTs are asymptotically jointly\nindependent Gaussian even when the DFTs share the same limiting frequencies.\nBased on these results, we establish an $\\alpha$-mixing central limit theorem\nfor a statistic formulated as a quadratic form of the tapered DFT. As\napplications, we derive the asymptotic distribution of the kernel spectral\ndensity estimator and establish a frequency domain inferential method for\nparametric stationary point processes. For the latter, the resulting model\nparameter estimator is computationally tractable and yields meaningful\ninterpretations even in the case of model misspecification. We investigate the\nfinite sample performance of our estimator through simulations, considering\nscenarios of both correctly specified and misspecified models. Furthermore, we\nextend our proposed DFT-based frequency domain methods to a class of\nnon-stationary spatial point processes."}, "http://arxiv.org/abs/2401.06446": {"title": "Increasing dimension asymptotics for two-way crossed mixed effect models", "link": "http://arxiv.org/abs/2401.06446", "description": "This paper presents asymptotic results for the maximum likelihood and\nrestricted maximum likelihood (REML) estimators within a two-way crossed mixed\neffect model as the sizes of the rows, columns, and cells tend to infinity.\nUnder very mild conditions which do not require the assumption of normality,\nthe estimators are proven to be asymptotically normal, possessing a structured\ncovariance matrix. The growth rate for the number of rows, columns, and cells\nis unrestricted, whether considered pairwise or collectively."}, "http://arxiv.org/abs/2401.06447": {"title": "A comprehensive framework for multi-fidelity surrogate modeling with noisy data: a gray-box perspective", "link": "http://arxiv.org/abs/2401.06447", "description": "Computer simulations (a.k.a. white-box models) are more indispensable than\never to model intricate engineering systems. However, computational models\nalone often fail to fully capture the complexities of reality. When physical\nexperiments are accessible though, it is of interest to enhance the incomplete\ninformation offered by computational models. Gray-box modeling is concerned\nwith the problem of merging information from data-driven (a.k.a. black-box)\nmodels and white-box (i.e., physics-based) models. In this paper, we propose to\nperform this task by using multi-fidelity surrogate models (MFSMs). A MFSM\nintegrates information from models with varying computational fidelity into a\nnew surrogate model. The multi-fidelity surrogate modeling framework we propose\nhandles noise-contaminated data and is able to estimate the underlying\nnoise-free high-fidelity function. Our methodology emphasizes on delivering\nprecise estimates of the uncertainty in its predictions in the form of\nconfidence and prediction intervals, by quantitatively incorporating the\ndifferent types of uncertainty that affect the problem, arising from\nmeasurement noise and from lack of knowledge due to the limited experimental\ndesign budget on both the high- and low-fidelity models. Applied to gray-box\nmodeling, our MFSM framework treats noisy experimental data as the\nhigh-fidelity and the white-box computational models as their low-fidelity\ncounterparts. The effectiveness of our methodology is showcased through\nsynthetic examples and a wind turbine application."}, "http://arxiv.org/abs/2401.06465": {"title": "Sanity Checks Revisited: An Exploration to Repair the Model Parameter Randomisation Test", "link": "http://arxiv.org/abs/2401.06465", "description": "The Model Parameter Randomisation Test (MPRT) is widely acknowledged in the\neXplainable Artificial Intelligence (XAI) community for its well-motivated\nevaluative principle: that the explanation function should be sensitive to\nchanges in the parameters of the model function. However, recent works have\nidentified several methodological caveats for the empirical interpretation of\nMPRT. To address these caveats, we introduce two adaptations to the original\nMPRT -- Smooth MPRT and Efficient MPRT, where the former minimises the impact\nthat noise has on the evaluation results through sampling and the latter\ncircumvents the need for biased similarity measurements by re-interpreting the\ntest through the explanation's rise in complexity, after full parameter\nrandomisation. Our experimental results demonstrate that these proposed\nvariants lead to improved metric reliability, thus enabling a more trustworthy\napplication of XAI methods."}, "http://arxiv.org/abs/2401.06557": {"title": "Treatment-Aware Hyperbolic Representation Learning for Causal Effect Estimation with Social Networks", "link": "http://arxiv.org/abs/2401.06557", "description": "Estimating the individual treatment effect (ITE) from observational data is a\ncrucial research topic that holds significant value across multiple domains.\nHow to identify hidden confounders poses a key challenge in ITE estimation.\nRecent studies have incorporated the structural information of social networks\nto tackle this challenge, achieving notable advancements. However, these\nmethods utilize graph neural networks to learn the representation of hidden\nconfounders in Euclidean space, disregarding two critical issues: (1) the\nsocial networks often exhibit a scalefree structure, while Euclidean embeddings\nsuffer from high distortion when used to embed such graphs, and (2) each\nego-centric network within a social network manifests a treatment-related\ncharacteristic, implying significant patterns of hidden confounders. To address\nthese issues, we propose a novel method called Treatment-Aware Hyperbolic\nRepresentation Learning (TAHyper). Firstly, TAHyper employs the hyperbolic\nspace to encode the social networks, thereby effectively reducing the\ndistortion of confounder representation caused by Euclidean embeddings.\nSecondly, we design a treatment-aware relationship identification module that\nenhances the representation of hidden confounders by identifying whether an\nindividual and her neighbors receive the same treatment. Extensive experiments\non two benchmark datasets are conducted to demonstrate the superiority of our\nmethod."}, "http://arxiv.org/abs/2401.06564": {"title": "Valid causal inference with unobserved confounding in high-dimensional settings", "link": "http://arxiv.org/abs/2401.06564", "description": "Various methods have recently been proposed to estimate causal effects with\nconfidence intervals that are uniformly valid over a set of data generating\nprocesses when high-dimensional nuisance models are estimated by\npost-model-selection or machine learning estimators. These methods typically\nrequire that all the confounders are observed to ensure identification of the\neffects. We contribute by showing how valid semiparametric inference can be\nobtained in the presence of unobserved confounders and high-dimensional\nnuisance models. We propose uncertainty intervals which allow for unobserved\nconfounding, and show that the resulting inference is valid when the amount of\nunobserved confounding is small relative to the sample size; the latter is\nformalized in terms of convergence rates. Simulation experiments illustrate the\nfinite sample properties of the proposed intervals and investigate an\nalternative procedure that improves the empirical coverage of the intervals\nwhen the amount of unobserved confounding is large. Finally, a case study on\nthe effect of smoking during pregnancy on birth weight is used to illustrate\nthe use of the methods introduced to perform a sensitivity analysis to\nunobserved confounding."}, "http://arxiv.org/abs/2401.06575": {"title": "A Weibull Mixture Cure Frailty Model for High-dimensional Covariates", "link": "http://arxiv.org/abs/2401.06575", "description": "A novel mixture cure frailty model is introduced for handling censored\nsurvival data. Mixture cure models are preferable when the existence of a cured\nfraction among patients can be assumed. However, such models are heavily\nunderexplored: frailty structures within cure models remain largely\nundeveloped, and furthermore, most existing methods do not work for\nhigh-dimensional datasets, when the number of predictors is significantly\nlarger than the number of observations. In this study, we introduce a novel\nextension of the Weibull mixture cure model that incorporates a frailty\ncomponent, employed to model an underlying latent population heterogeneity with\nrespect to the outcome risk. Additionally, high-dimensional covariates are\nintegrated into both the cure rate and survival part of the model, providing a\ncomprehensive approach to employ the model in the context of high-dimensional\nomics data. We also perform variable selection via an adaptive elastic net\npenalization, and propose a novel approach to inference using the\nexpectation-maximization (EM) algorithm. Extensive simulation studies are\nconducted across various scenarios to demonstrate the performance of the model,\nand results indicate that our proposed method outperforms competitor models. We\napply the novel approach to analyze RNAseq gene expression data from bulk\nbreast cancer patients included in The Cancer Genome Atlas (TCGA) database. A\nset of prognostic biomarkers is then derived from selected genes, and\nsubsequently validated via both functional enrichment analysis and comparison\nto the existing biological literature. Finally, a prognostic risk score index\nbased on the identified biomarkers is proposed and validated by exploring the\npatients' survival."}, "http://arxiv.org/abs/2401.06687": {"title": "Proximal Causal Inference With Text Data", "link": "http://arxiv.org/abs/2401.06687", "description": "Recent text-based causal methods attempt to mitigate confounding bias by\nincluding unstructured text data as proxies of confounding variables that are\npartially or imperfectly measured. These approaches assume analysts have\nsupervised labels of the confounders given text for a subset of instances, a\nconstraint that is not always feasible due to data privacy or cost. Here, we\naddress settings in which an important confounding variable is completely\nunobserved. We propose a new causal inference method that splits pre-treatment\ntext data, infers two proxies from two zero-shot models on the separate splits,\nand applies these proxies in the proximal g-formula. We prove that our\ntext-based proxy method satisfies identification conditions required by the\nproximal g-formula while other seemingly reasonable proposals do not. We\nevaluate our method in synthetic and semi-synthetic settings and find that it\nproduces estimates with low bias. This combination of proximal causal inference\nand zero-shot classifiers is novel (to our knowledge) and expands the set of\ntext-specific causal methods available to practitioners."}, "http://arxiv.org/abs/1905.11232": {"title": "Efficient posterior sampling for high-dimensional imbalanced logistic regression", "link": "http://arxiv.org/abs/1905.11232", "description": "High-dimensional data are routinely collected in many areas. We are\nparticularly interested in Bayesian classification models in which one or more\nvariables are imbalanced. Current Markov chain Monte Carlo algorithms for\nposterior computation are inefficient as $n$ and/or $p$ increase due to\nworsening time per step and mixing rates. One strategy is to use a\ngradient-based sampler to improve mixing while using data sub-samples to reduce\nper-step computational complexity. However, usual sub-sampling breaks down when\napplied to imbalanced data. Instead, we generalize piece-wise deterministic\nMarkov chain Monte Carlo algorithms to include importance-weighted and\nmini-batch sub-sampling. These approaches maintain the correct stationary\ndistribution with arbitrarily small sub-samples, and substantially outperform\ncurrent competitors. We provide theoretical support and illustrate gains in\nsimulated and real data applications."}, "http://arxiv.org/abs/2305.19481": {"title": "Bayesian Image Analysis in Fourier Space", "link": "http://arxiv.org/abs/2305.19481", "description": "Bayesian image analysis has played a large role over the last 40+ years in\nsolving problems in image noise-reduction, de-blurring, feature enhancement,\nand object detection. However, these problems can be complex and lead to\ncomputational difficulties, due to the modeled interdependence between spatial\nlocations. The Bayesian image analysis in Fourier space (BIFS) approach\nproposed here reformulates the conventional Bayesian image analysis paradigm\nfor continuous valued images as a large set of independent (but heterogeneous)\nprocesses over Fourier space. The original high-dimensional estimation problem\nin image space is thereby broken down into (trivially parallelizable)\nindependent one-dimensional problems in Fourier space. The BIFS approach leads\nto easy model specification with fast and direct computation, a wide range of\npossible prior characteristics, easy modeling of isotropy into the prior, and\nmodels that are effectively invariant to changes in image resolution."}, "http://arxiv.org/abs/2307.09404": {"title": "Continuous-time multivariate analysis", "link": "http://arxiv.org/abs/2307.09404", "description": "The starting point for much of multivariate analysis (MVA) is an $n\\times p$\ndata matrix whose $n$ rows represent observations and whose $p$ columns\nrepresent variables. Some multivariate data sets, however, may be best\nconceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves\nor functions defined on a common time interval. We introduce a framework for\nextending techniques of multivariate analysis to such settings. The proposed\nframework rests on the assumption that the curves can be represented as linear\ncombinations of basis functions such as B-splines. This is formally identical\nto the Ramsay-Silverman representation of functional data; but whereas\nfunctional data analysis extends MVA to the case of observations that are\ncurves rather than vectors -- heuristically, $n\\times p$ data with $p$ infinite\n-- we are instead concerned with what happens when $n$ is infinite. We describe\nhow to translate the classical MVA methods of covariance and correlation\nestimation, principal component analysis, Fisher's linear discriminant\nanalysis, and $k$-means clustering to the continuous-time setting. We\nillustrate the methods with a novel perspective on a well-known Canadian\nweather data set, and with applications to neurobiological and environmetric\ndata. The methods are implemented in the publicly available R package\n\\texttt{ctmva}."}, "http://arxiv.org/abs/2401.06904": {"title": "Non-collapsibility and Built-in Selection Bias of Hazard Ratio in Randomized Controlled Trials", "link": "http://arxiv.org/abs/2401.06904", "description": "Background: The hazard ratio of the Cox proportional hazards model is widely\nused in randomized controlled trials to assess treatment effects. However, two\nproperties of the hazard ratio including the non-collapsibility and built-in\nselection bias need to be further investigated. Methods: We conduct simulations\nto differentiate the non-collapsibility effect and built-in selection bias from\nthe difference between the marginal and the conditional hazard ratio.\nMeanwhile, we explore the performance of the Cox model with inverse probability\nof treatment weighting for covariate adjustment when estimating the marginal\nhazard ratio. The built-in selection bias is further assessed in the\nperiod-specific hazard ratio. Results: The conditional hazard ratio is a biased\nestimate of the marginal effect due to the non-collapsibility property. In\ncontrast, the hazard ratio estimated from the inverse probability of treatment\nweighting Cox model provides an unbiased estimate of the true marginal hazard\nratio. The built-in selection bias only manifests in the period-specific hazard\nratios even when the proportional hazards assumption is satisfied. The Cox\nmodel with inverse probability of treatment weighting can be used to account\nfor confounding bias and provide an unbiased effect under the randomized\ncontrolled trials setting when the parameter of interest is the marginal\neffect. Conclusions: We propose that the period-specific hazard ratios should\nalways be avoided due to the profound effects of built-in selection bias."}, "http://arxiv.org/abs/2401.06909": {"title": "Sensitivity Analysis for Matched Observational Studies with Continuous Exposures and Binary Outcomes", "link": "http://arxiv.org/abs/2401.06909", "description": "Matching is one of the most widely used study designs for adjusting for\nmeasured confounders in observational studies. However, unmeasured confounding\nmay exist and cannot be removed by matching. Therefore, a sensitivity analysis\nis typically needed to assess a causal conclusion's sensitivity to unmeasured\nconfounding. Sensitivity analysis frameworks for binary exposures have been\nwell-established for various matching designs and are commonly used in various\nstudies. However, unlike the binary exposure case, there still lacks valid and\ngeneral sensitivity analysis methods for continuous exposures, except in some\nspecial cases such as pair matching. To fill this gap in the binary outcome\ncase, we develop a sensitivity analysis framework for general matching designs\nwith continuous exposures and binary outcomes. First, we use probabilistic\nlattice theory to show our sensitivity analysis approach is\nfinite-population-exact under Fisher's sharp null. Second, we prove a novel\ndesign sensitivity formula as a powerful tool for asymptotically evaluating the\nperformance of our sensitivity analysis approach. Third, to allow effect\nheterogeneity with binary outcomes, we introduce a framework for conducting\nasymptotically exact inference and sensitivity analysis on generalized\nattributable effects with binary outcomes via mixed-integer programming.\nFourth, for the continuous outcomes case, we show that conducting an\nasymptotically exact sensitivity analysis in matched observational studies when\nboth the exposures and outcomes are continuous is generally NP-hard, except in\nsome special cases such as pair matching. As a real data application, we apply\nour new methods to study the effect of early-life lead exposure on juvenile\ndelinquency. We also develop a publicly available R package for implementation\nof the methods in this work."}, "http://arxiv.org/abs/2401.06919": {"title": "Pseudo-Empirical Likelihood Methods for Causal Inference", "link": "http://arxiv.org/abs/2401.06919", "description": "Causal inference problems have remained an important research topic over the\npast several decades due to their general applicability in assessing a\ntreatment effect in many different real-world settings. In this paper, we\npropose two inferential procedures on the average treatment effect (ATE)\nthrough a two-sample pseudo-empirical likelihood (PEL) approach. The first\nprocedure uses the estimated propensity scores for the formulation of the PEL\nfunction, and the resulting maximum PEL estimator of the ATE is equivalent to\nthe inverse probability weighted estimator discussed in the literature. Our\nfocus in this scenario is on the PEL ratio statistic and establishing its\ntheoretical properties. The second procedure incorporates outcome regression\nmodels for PEL inference through model-calibration constraints, and the\nresulting maximum PEL estimator of the ATE is doubly robust. Our main\ntheoretical result in this case is the establishment of the asymptotic\ndistribution of the PEL ratio statistic. We also propose a bootstrap method for\nconstructing PEL ratio confidence intervals for the ATE to bypass the scaling\nconstant which is involved in the asymptotic distribution of the PEL ratio\nstatistic but is very difficult to calculate. Finite sample performances of our\nproposed methods with comparisons to existing ones are investigated through\nsimulation studies."}, "http://arxiv.org/abs/2401.06925": {"title": "Modeling Latent Selection with Structural Causal Models", "link": "http://arxiv.org/abs/2401.06925", "description": "Selection bias is ubiquitous in real-world data, and can lead to misleading\nresults if not dealt with properly. We introduce a conditioning operation on\nStructural Causal Models (SCMs) to model latent selection from a causal\nperspective. We show that the conditioning operation transforms an SCM with the\npresence of an explicit latent selection mechanism into an SCM without such\nselection mechanism, which partially encodes the causal semantics of the\nselected subpopulation according to the original SCM. Furthermore, we show that\nthis conditioning operation preserves the simplicity, acyclicity, and linearity\nof SCMs, and commutes with marginalization. Thanks to these properties,\ncombined with marginalization and intervention, the conditioning operation\noffers a valuable tool for conducting causal reasoning tasks within causal\nmodels where latent details have been abstracted away. We demonstrate by\nexample how classical results of causal inference can be generalized to include\nselection bias and how the conditioning operation helps with modeling of\nreal-world problems."}, "http://arxiv.org/abs/2401.06990": {"title": "Graphical Principal Component Analysis of Multivariate Functional Time Series", "link": "http://arxiv.org/abs/2401.06990", "description": "In this paper, we consider multivariate functional time series with a two-way\ndependence structure: a serial dependence across time points and a graphical\ninteraction among the multiple functions within each time point. We develop the\nnotion of dynamic weak separability, a more general condition than those\nassumed in literature, and use it to characterize the two-way structure in\nmultivariate functional time series. Based on the proposed weak separability,\nwe develop a unified framework for functional graphical models and dynamic\nprincipal component analysis, and further extend it to optimally reconstruct\nsignals from contaminated functional data using graphical-level information. We\ninvestigate asymptotic properties of the resulting estimators and illustrate\nthe effectiveness of our proposed approach through extensive simulations. We\napply our method to hourly air pollution data that were collected from a\nmonitoring network in China."}, "http://arxiv.org/abs/2401.07000": {"title": "Counterfactual Slope and Its Applications to Social Stratification", "link": "http://arxiv.org/abs/2401.07000", "description": "This paper addresses two prominent theses in social stratification research,\nthe great equalizer thesis and Mare's (1980) school transition thesis. Both\ntheses are premised on a descriptive regularity: the association between\nsocioeconomic background and an outcome variable changes when conditioning on\nan intermediate treatment. The interpretation of this descriptive regularity is\ncomplicated by social actors' differential selection into treatment based on\ntheir potential outcomes under treatment. In particular, if the descriptive\nregularity is driven by selection, then the theses do not have a substantive\ninterpretation. We propose a set of novel counterfactual slope estimands, which\ncapture the two theses under the hypothetical scenario where differential\nselection into treatment is eliminated. Thus, we use the counterfactual slopes\nto construct selection-free tests for the two theses. Compared with the\nexisting literature, we are the first to provide explicit, nonparametric, and\ncausal estimands, which enable us to conduct principled selection-free tests.\nWe develop efficient and robust estimators by deriving the efficient influence\nfunctions of the estimands. We apply our framework to a nationally\nrepresentative dataset in the United States and re-evaluate the two theses.\nFindings from our selection-free tests show that the descriptive regularity of\nthe two theses is misleading for substantive interpretations."}, "http://arxiv.org/abs/2401.07018": {"title": "Graphical models for cardinal paired comparisons data", "link": "http://arxiv.org/abs/2401.07018", "description": "Graphical models for cardinal paired comparison data with and without\ncovariates are rigorously analyzed. Novel, graph--based, necessary and\nsufficient conditions which guarantee strong consistency, asymptotic normality\nand the exponential convergence of the estimated ranks are emphasized. A\ncomplete theory for models with covariates is laid out. In particular\nconditions under which covariates can be safely omitted from the model are\nprovided. The methodology is employed in the analysis of both finite and\ninfinite sets of ranked items specifically in the case of large sparse\ncomparison graphs. The proposed methods are explored by simulation and applied\nto the ranking of teams in the National Basketball Association (NBA)."}, "http://arxiv.org/abs/2401.07221": {"title": "Type I multivariate P\\'olya-Aeppli distributions with applications", "link": "http://arxiv.org/abs/2401.07221", "description": "An extensive body of literature exists that specifically addresses the\nunivariate case of zero-inflated count models. In contrast, research pertaining\nto multivariate models is notably less developed. We proposed two new\nparsimonious multivariate models which can be used to model correlated\nmultivariate overdispersed count data. Furthermore, for different parameter\nsettings and sample sizes, various simulations are performed. In conclusion, we\ndemonstrated the performance of the newly proposed multivariate candidates on\ntwo benchmark datasets, which surpasses that of several alternative approaches."}, "http://arxiv.org/abs/2401.07231": {"title": "Use of Prior Knowledge to Discover Causal Additive Models with Unobserved Variables and its Application to Time Series Data", "link": "http://arxiv.org/abs/2401.07231", "description": "This paper proposes two methods for causal additive models with unobserved\nvariables (CAM-UV). CAM-UV assumes that the causal functions take the form of\ngeneralized additive models and that latent confounders are present. First, we\npropose a method that leverages prior knowledge for efficient causal discovery.\nThen, we propose an extension of this method for inferring causality in time\nseries data. The original CAM-UV algorithm differs from other existing causal\nfunction models in that it does not seek the causal order between observed\nvariables, but rather aims to identify the causes for each observed variable.\nTherefore, the first proposed method in this paper utilizes prior knowledge,\nsuch as understanding that certain variables cannot be causes of specific\nothers. Moreover, by incorporating the prior knowledge that causes precedes\ntheir effects in time, we extend the first algorithm to the second method for\ncausal discovery in time series data. We validate the first proposed method by\nusing simulated data to demonstrate that the accuracy of causal discovery\nincreases as more prior knowledge is accumulated. Additionally, we test the\nsecond proposed method by comparing it with existing time series causal\ndiscovery methods, using both simulated data and real-world data."}, "http://arxiv.org/abs/2401.07259": {"title": "Inference for multivariate extremes via a semi-parametric angular-radial model", "link": "http://arxiv.org/abs/2401.07259", "description": "The modelling of multivariate extreme events is important in a wide variety\nof applications, including flood risk analysis, metocean engineering and\nfinancial modelling. A wide variety of statistical techniques have been\nproposed in the literature; however, many such methods are limited in the forms\nof dependence they can capture, or make strong parametric assumptions about\ndata structures. In this article, we introduce a novel inference framework for\nmultivariate extremes based on a semi-parametric angular-radial model. This\nmodel overcomes the limitations of many existing approaches and provides a\nunified paradigm for assessing joint tail behaviour. Alongside inferential\ntools, we also introduce techniques for assessing uncertainty and goodness of\nfit. Our proposed technique is tested on simulated data sets alongside observed\nmetocean time series', with results indicating generally good performance."}, "http://arxiv.org/abs/2401.07267": {"title": "Inference for high-dimensional linear expectile regression with de-biased method", "link": "http://arxiv.org/abs/2401.07267", "description": "In this paper, we address the inference problem in high-dimensional linear\nexpectile regression. We transform the expectile loss into a\nweighted-least-squares form and apply a de-biased strategy to establish\nWald-type tests for multiple constraints within a regularized framework.\nSimultaneously, we construct an estimator for the pseudo-inverse of the\ngeneralized Hessian matrix in high dimension with general amenable regularizers\nincluding Lasso and SCAD, and demonstrate its consistency through a new proof\ntechnique. We conduct simulation studies and real data applications to\ndemonstrate the efficacy of our proposed test statistic in both homoscedastic\nand heteroscedastic scenarios."}, "http://arxiv.org/abs/2401.07294": {"title": "Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation Studies", "link": "http://arxiv.org/abs/2401.07294", "description": "Metamodels, or the regression analysis of Monte Carlo simulation (MCS)\nresults, provide a powerful tool to summarize MCS findings. However, an as of\nyet unexplored approach is the use of multilevel metamodels (MLMM) that better\naccount for the dependent data structure of MCS results that arises from\nfitting multiple models to the same simulated data set. In this study, we\narticulate the theoretical rationale for the MLMM and illustrate how it can\ndramatically improve efficiency over the traditional regression approach,\nbetter account for complex MCS designs, and provide new insights into the\ngeneralizability of MCS findings."}, "http://arxiv.org/abs/2401.07344": {"title": "Robust Genomic Prediction and Heritability Estimation using Density Power Divergence", "link": "http://arxiv.org/abs/2401.07344", "description": "This manuscript delves into the intersection of genomics and phenotypic\nprediction, focusing on the statistical innovation required to navigate the\ncomplexities introduced by noisy covariates and confounders. The primary\nemphasis is on the development of advanced robust statistical models tailored\nfor genomic prediction from single nucleotide polymorphism (SNP) data collected\nfrom genome-wide association studies (GWAS) in plant and animal breeding and\nmulti-field trials. The manuscript explores the limitations of traditional\nmarker-assisted recurrent selection, highlighting the significance of\nincorporating all estimated effects of marker loci into the statistical\nframework and aiming to reduce the high dimensionality of GWAS data while\npreserving critical information. This paper introduces a new robust statistical\nframework for genomic prediction, employing one-stage and two-stage linear\nmixed model analyses along with utilizing the popular robust minimum density\npower divergence estimator (MDPDE) to estimate genetic effects on phenotypic\ntraits. The study illustrates the superior performance of the proposed\nMDPDE-based genomic prediction and associated heritability estimation\nprocedures over existing competitors through extensive empirical experiments on\nartificial datasets and application to a real-life maize breeding dataset. The\nresults showcase the robustness and accuracy of the proposed MDPDE-based\napproaches, especially in the presence of data contamination, emphasizing their\npotential applications in improving breeding programs and advancing genomic\nprediction of phenotyping traits."}, "http://arxiv.org/abs/2401.07365": {"title": "Sequential permutation testing by betting", "link": "http://arxiv.org/abs/2401.07365", "description": "We develop an anytime-valid permutation test, where the dataset is fixed and\nthe permutations are sampled sequentially one by one, with the objective of\nsaving computational resources by sampling fewer permutations and stopping\nearly. The core technical advance is the development of new test martingales\n(nonnegative martingales with initial value one) for testing exchangeability\nagainst a very particular alternative. These test martingales are constructed\nusing new and simple betting strategies that smartly bet on the relative ranks\nof permuted test statistics. The betting strategy is guided by the derivation\nof a simple log-optimal betting strategy, and displays excellent power in\npractice. In contrast to a well-known method by Besag and Clifford, our method\nyields a valid e-value or a p-value at any stopping time, and with particular\nstopping rules, it yields computational gains under both the null and the\nalternative without compromising power."}, "http://arxiv.org/abs/2401.07400": {"title": "Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data", "link": "http://arxiv.org/abs/2401.07400", "description": "Investigating the relationship, particularly the lead-lag effect, between\ntime series is a common question across various disciplines, especially when\nuncovering biological process. However, analyzing time series presents several\nchallenges. Firstly, due to technical reasons, the time points at which\nobservations are made are not at uniform inintervals. Secondly, some lead-lag\neffects are transient, necessitating time-lag estimation based on a limited\nnumber of time points. Thirdly, external factors also impact these time series,\nrequiring a similarity metric to assess the lead-lag relationship. To counter\nthese issues, we introduce a model grounded in the Gaussian process, affording\nthe flexibility to estimate lead-lag effects for irregular time series. In\naddition, our method outputs dissimilarity scores, thereby broadening its\napplications to include tasks such as ranking or clustering multiple pair-wise\ntime series when considering their strength of lead-lag effects with external\nfactors. Crucially, we offer a series of theoretical proofs to substantiate the\nvalidity of our proposed kernels and the identifiability of kernel parameters.\nOur model demonstrates advances in various simulations and real-world\napplications, particularly in the study of dynamic chromatin interactions,\ncompared to other leading methods."}, "http://arxiv.org/abs/2401.07401": {"title": "Design-Based Estimation and Central Limit Theorems for Local Average Treatment Effects for RCTs", "link": "http://arxiv.org/abs/2401.07401", "description": "There is a growing literature on design-based methods to estimate average\ntreatment effects for randomized controlled trials (RCTs) using the\nunderpinnings of experiments. In this article, we build on these methods to\nconsider design-based regression estimators for the local average treatment\neffect (LATE) estimand for RCTs with treatment noncompliance. We prove new\nfinite-population central limit theorems for a range of designs, including\nblocked and clustered RCTs, allowing for baseline covariates to improve\nprecision. We discuss consistent variance estimators based on model residuals\nand conduct simulations that show the estimators yield confidence interval\ncoverage near nominal levels. We demonstrate the methods using data from a\nprivate school voucher RCT in New York City USA."}, "http://arxiv.org/abs/2401.07421": {"title": "A Bayesian Approach to Modeling Variance of Intensive Longitudinal Biomarker Data as a Predictor of Health Outcomes", "link": "http://arxiv.org/abs/2401.07421", "description": "Intensive longitudinal biomarker data are increasingly common in scientific\nstudies that seek temporally granular understanding of the role of behavioral\nand physiological factors in relation to outcomes of interest. Intensive\nlongitudinal biomarker data, such as those obtained from wearable devices, are\noften obtained at a high frequency typically resulting in several hundred to\nthousand observations per individual measured over minutes, hours, or days.\nOften in longitudinal studies, the primary focus is on relating the means of\nbiomarker trajectories to an outcome, and the variances are treated as nuisance\nparameters, although they may also be informative for the outcomes. In this\npaper, we propose a Bayesian hierarchical model to jointly model a\ncross-sectional outcome and the intensive longitudinal biomarkers. To model the\nvariability of biomarkers and deal with the high intensity of data, we develop\nsubject-level cubic B-splines and allow the sharing of information across\nindividuals for both the residual variability and the random effects\nvariability. Then different levels of variability are extracted and\nincorporated into an outcome submodel for inferential and predictive purposes.\nWe demonstrate the utility of the proposed model via an application involving\nbio-monitoring of hertz-level heart rate information from a study on social\nstress."}, "http://arxiv.org/abs/2401.07445": {"title": "GACE: Learning Graph-Based Cross-Page Ads Embedding For Click-Through Rate Prediction", "link": "http://arxiv.org/abs/2401.07445", "description": "Predicting click-through rate (CTR) is the core task of many ads online\nrecommendation systems, which helps improve user experience and increase\nplatform revenue. In this type of recommendation system, we often encounter two\nmain problems: the joint usage of multi-page historical advertising data and\nthe cold start of new ads. In this paper, we proposed GACE, a graph-based\ncross-page ads embedding generation method. It can warm up and generate the\nrepresentation embedding of cold-start and existing ads across various pages.\nSpecifically, we carefully build linkages and a weighted undirected graph model\nconsidering semantic and page-type attributes to guide the direction of feature\nfusion and generation. We designed a variational auto-encoding task as\npre-training module and generated embedding representations for new and old ads\nbased on this task. The results evaluated in the public dataset AliEC from\nRecBole and the real-world industry dataset from Alipay show that our GACE\nmethod is significantly superior to the SOTA method. In the online A/B test,\nthe click-through rate on three real-world pages from Alipay has increased by\n3.6%, 2.13%, and 3.02%, respectively. Especially in the cold-start task, the\nCTR increased by 9.96%, 7.51%, and 8.97%, respectively."}, "http://arxiv.org/abs/2401.07522": {"title": "Balancing the edge effect and dimension of spectral spatial statistics under irregular sampling with applications to isotropy testing", "link": "http://arxiv.org/abs/2401.07522", "description": "We investigate distributional properties of a class of spectral spatial\nstatistics under irregular sampling of a random field that is defined on\n$\\mathbb{R}^d$, and use this to obtain a test for isotropy. Within this\ncontext, edge effects are well-known to create a bias in classical estimators\ncommonly encountered in the analysis of spatial data. This bias increases with\ndimension $d$ and, for $d>1$, can become non-negligible in the limiting\ndistribution of such statistics to the extent that a nondegenerate distribution\ndoes not exist. We provide a general theory for a class of (integrated)\nspectral statistics that enables to 1) significantly reduce this bias and 2)\nthat ensures that asymptotically Gaussian limits can be derived for $d \\le 3$\nfor appropriately tapered versions of such statistics. We use this to address\nsome crucial gaps in the literature, and demonstrate that tapering with a\nsufficiently smooth function is necessary to achieve such results. Our findings\nspecifically shed a new light on a recent result in Subba Rao (2018a). Our\ntheory then is used to propose a novel test for isotropy. In contrast to most\nof the literature, which validates this assumption on a finite number of\nspatial locations (or a finite number of Fourier frequencies), we develop a\ntest for isotropy on the full spatial domain by means of its characterization\nin the frequency domain. More precisely, we derive an explicit expression for\nthe minimum $L^2$-distance between the spectral density of the random field and\nits best approximation by a spectral density of an isotropic process. We prove\nasymptotic normality of an estimator of this quantity in the mixed increasing\ndomain framework and use this result to derive an asymptotic level\n$\\alpha$-test."}, "http://arxiv.org/abs/2401.07562": {"title": "Probabilistic Richardson Extrapolation", "link": "http://arxiv.org/abs/2401.07562", "description": "For over a century, extrapolation methods have provided a powerful tool to\nimprove the convergence order of a numerical method. However, these tools are\nnot well-suited to modern computer codes, where multiple continua are\ndiscretised and convergence orders are not easily analysed. To address this\nchallenge we present a probabilistic perspective on Richardson extrapolation, a\npoint of view that unifies classical extrapolation methods with modern\nmulti-fidelity modelling, and handles uncertain convergence orders by allowing\nthese to be statistically estimated. The approach is developed using Gaussian\nprocesses, leading to Gauss-Richardson Extrapolation (GRE). Conditions are\nestablished under which extrapolation using the conditional mean achieves a\npolynomial (or even an exponential) speed-up compared to the original numerical\nmethod. Further, the probabilistic formulation unlocks the possibility of\nexperimental design, casting the selection of fidelities as a continuous\noptimisation problem which can then be (approximately) solved. A case-study\ninvolving a computational cardiac model demonstrates that practical gains in\naccuracy can be achieved using the GRE method."}, "http://arxiv.org/abs/2401.07625": {"title": "Statistics in Survey Sampling", "link": "http://arxiv.org/abs/2401.07625", "description": "Survey sampling theory and methods are introduced. Sampling designs and\nestimation methods are carefully discussed as a textbook for survey sampling.\nTopics includes Horvitz-Thompson estimation, simple random sampling, stratified\nsampling, cluster sampling, ratio estimation, regression estimation, variance\nestimation, two-phase sampling, and nonresponse adjustment methods."}, "http://arxiv.org/abs/2401.07724": {"title": "A non-parametric estimator for Archimedean copulas under flexible censoring scenarios and an application to claims reserving", "link": "http://arxiv.org/abs/2401.07724", "description": "With insurers benefiting from ever-larger amounts of data of increasing\ncomplexity, we explore a data-driven method to model dependence within\nmultilevel claims in this paper. More specifically, we start from a\nnon-parametric estimator for Archimedean copula generators introduced by Genest\nand Rivest (1993), and we extend it to diverse flexible censoring scenarios\nusing techniques derived from survival analysis. We implement a graphical\nselection procedure for copulas that we validate using goodness-of-fit methods\napplied to complete, single-censored, and double-censored bivariate data. We\nillustrate the performance of our model with multiple simulation studies. We\nthen apply our methodology to a recent Canadian automobile insurance dataset\nwhere we seek to model the dependence between the activation delays of\ncorrelated coverages. We show that our model performs quite well in selecting\nthe best-fitted copula for the data at hand, especially when the dataset is\nlarge, and that the results can then be used as part of a larger claims\nreserving methodology."}, "http://arxiv.org/abs/2401.07767": {"title": "Estimation of the genetic Gaussian network using GWAS summary data", "link": "http://arxiv.org/abs/2401.07767", "description": "Genetic Gaussian network of multiple phenotypes constructed through the\ngenetic correlation matrix is informative for understanding their biological\ndependencies. However, its interpretation may be challenging because the\nestimated genetic correlations are biased due to estimation errors and\nhorizontal pleiotropy inherent in GWAS summary statistics. Here we introduce a\nnovel approach called Estimation of Genetic Graph (EGG), which eliminates the\nestimation error bias and horizontal pleiotropy bias with the same techniques\nused in multivariable Mendelian randomization. The genetic network estimated by\nEGG can be interpreted as representing shared common biological contributions\nbetween phenotypes, conditional on others, and even as indicating the causal\ncontributions. We use both simulations and real data to demonstrate the\nsuperior efficacy of our novel method in comparison with the traditional\nnetwork estimators. R package EGG is available on\nhttps://github.com/harryyiheyang/EGG."}, "http://arxiv.org/abs/2401.07820": {"title": "Posterior shrinkage towards linear subspaces", "link": "http://arxiv.org/abs/2401.07820", "description": "It is common to hold prior beliefs that are not characterized by points in\nthe parameter space but instead are relational in nature and can be described\nby a linear subspace. While some previous work has been done to account for\nsuch prior beliefs, the focus has primarily been on point estimators within a\nregression framework. We argue, however, that prior beliefs about parameters\nought to be encoded into the prior distribution rather than in the formation of\na point estimator. In this way, the prior beliefs help shape \\textit{all}\ninference. Through exponential tilting, we propose a fully generalizable method\nof taking existing prior information from, e.g., a pilot study, and combining\nit with additional prior beliefs represented by parameters lying on a linear\nsubspace. We provide computationally efficient algorithms for posterior\ninference that, once inference is made using a non-tilted prior, does not\ndepend on the sample size. We illustrate our proposed approach on an\nantihypertensive clinical trial dataset where we shrink towards a power law\ndose-response relationship, and on monthly influenza and pneumonia data where\nwe shrink moving average lag parameters towards smoothness. Software to\nimplement the proposed approach is provided in the R package \\verb+SUBSET+\navailable on GitHub."}, "http://arxiv.org/abs/2401.08159": {"title": "Reluctant Interaction Modeling in Generalized Linear Models", "link": "http://arxiv.org/abs/2401.08159", "description": "While including pairwise interactions in a regression model can better\napproximate response surface, fitting such an interaction model is a well-known\ndifficult problem. In particular, analyzing contemporary high-dimensional\ndatasets often leads to extremely large-scale interaction modeling problem,\nwhere the challenge is posed to identify important interactions among millions\nor even billions of candidate interactions. While several methods have recently\nbeen proposed to tackle this challenge, they are mostly designed by (1)\nassuming the hierarchy assumption among the important interactions and (or) (2)\nfocusing on the case in linear models with interactions and (sub)Gaussian\nerrors. In practice, however, neither of these two building blocks has to hold.\nIn this paper, we propose an interaction modeling framework in generalized\nlinear models (GLMs) which is free of any assumptions on hierarchy. We develop\na non-trivial extension of the reluctance interaction selection principle to\nthe GLMs setting, where a main effect is preferred over an interaction if all\nelse is equal. Our proposed method is easy to implement, and is highly scalable\nto large-scale datasets. Theoretically, we demonstrate that it possesses\nscreening consistency under high-dimensional setting. Numerical studies on\nsimulated datasets and a real dataset show that the proposed method does not\nsacrifice statistical performance in the presence of significant computational\ngain."}, "http://arxiv.org/abs/2401.08172": {"title": "On GEE for Mean-Variance-Correlation Models: Variance Estimation and Model Selection", "link": "http://arxiv.org/abs/2401.08172", "description": "Generalized estimating equations (GEE) are of great importance in analyzing\nclustered data without full specification of multivariate distributions. A\nrecent approach jointly models the mean, variance, and correlation coefficients\nof clustered data through three sets of regressions (Luo and Pan, 2022). We\nobserve that these estimating equations, however, are a special case of those\nof Yan and Fine (2004) which further allows the variance to depend on the mean\nthrough a variance function. The proposed variance estimators may be incorrect\nfor the variance and correlation parameters because of a subtle dependence\ninduced by the nested structure of the estimating equations. We characterize\nmodel settings where their variance estimation is invalid and show the variance\nestimators in Yan and Fine (2004) correctly account for such dependence. In\naddition, we introduce a novel model selection criterion that enables the\nsimultaneous selection of the mean-scale-correlation model. The sandwich\nvariance estimator and the proposed model selection criterion are tested by\nseveral simulation studies and real data analysis, which validate its\neffectiveness in variance estimation and model selection. Our work also extends\nthe R package geepack with the flexibility to apply different working\ncovariance matrices for the variance and correlation structures."}, "http://arxiv.org/abs/2401.08173": {"title": "Simultaneous Change Point Detection and Identification for High Dimensional Linear Models", "link": "http://arxiv.org/abs/2401.08173", "description": "In this article, we consider change point inference for high dimensional\nlinear models. For change point detection, given any subgroup of variables, we\npropose a new method for testing the homogeneity of corresponding regression\ncoefficients across the observations. Under some regularity conditions, the\nproposed new testing procedure controls the type I error asymptotically and is\npowerful against sparse alternatives and enjoys certain optimality. For change\npoint identification, an argmax based change point estimator is proposed which\nis shown to be consistent for the true change point location. Moreover,\ncombining with the binary segmentation technique, we further extend our new\nmethod for detecting and identifying multiple change points. Extensive\nnumerical studies justify the validity of our new method and an application to\nthe Alzheimer's disease data analysis further demonstrate its competitive\nperformance."}, "http://arxiv.org/abs/2401.08175": {"title": "Bayesian Kriging Approaches for Spatial Functional Data", "link": "http://arxiv.org/abs/2401.08175", "description": "Functional kriging approaches have been developed to predict the curves at\nunobserved spatial locations. However, most existing approaches are based on\nvariogram fittings rather than constructing hierarchical statistical models.\nTherefore, it is challenging to analyze the relationships between functional\nvariables, and uncertainty quantification of the model is not trivial. In this\nmanuscript, we propose a Bayesian framework for spatial function-on-function\nregression. However, inference for the proposed model has computational and\ninferential challenges because the model needs to account for within and\nbetween-curve dependencies. Furthermore, high-dimensional and spatially\ncorrelated parameters can lead to the slow mixing of Markov chain Monte Carlo\nalgorithms. To address these issues, we first utilize a basis transformation\napproach to simplify the covariance and apply projection methods for dimension\nreduction. We also develop a simultaneous band score for the proposed model to\ndetect the significant region in the regression function. We apply the methods\nto simulated and real datasets, including data on particulate matter in Japan\nand mobility data in South Korea. The proposed method is computationally\nefficient and provides accurate estimations and predictions."}, "http://arxiv.org/abs/2401.08224": {"title": "Differentially Private Estimation of CATE in Adaptive Experiment", "link": "http://arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average\ntreatment effect (CATE) in clinical trials and many other scenarios. While the\nprimary goal in experiment is to maximize estimation accuracy, due to the\nimperative of social welfare, it's also crucial to provide treatment with\nsuperior outcomes to patients, which is measured by regret in contextual bandit\nframework. These two objectives often lead to contrast optimal allocation\nmechanism. Furthermore, privacy concerns arise in clinical scenarios containing\nsensitive data like patients health records. Therefore, it's essential for the\ntreatment allocation mechanism to incorporate robust privacy protection\nmeasures. In this paper, we investigate the tradeoff between loss of social\nwelfare and statistical power in contextual bandit experiment. We propose a\nmatched upper and lower bound for the multi-objective optimization problem, and\nthen adopt the concept of Pareto optimality to mathematically characterize the\noptimality condition. Furthermore, we propose differentially private algorithms\nwhich still matches the lower bound, showing that privacy is \"almost free\".\nAdditionally, we derive the asymptotic normality of the estimator, which is\nessential in statistical inference and hypothesis testing."}, "http://arxiv.org/abs/2401.08303": {"title": "A Bayesian multivariate model with temporal dependence on random partition of areal data", "link": "http://arxiv.org/abs/2401.08303", "description": "More than half of the world's population is exposed to the risk of\nmosquito-borne diseases, which leads to millions of cases and hundreds of\nthousands of deaths every year. Analyzing this type of data is often complex\nand poses several interesting challenges, mainly due to the vast geographic\narea, the peculiar temporal behavior, and the potential correlation between\ninfections. Motivation stems from the analysis of tropical diseases data,\nnamely, the number of cases of two arboviruses, dengue and chikungunya,\ntransmitted by the same mosquito, for all the 145 microregions in Southeast\nBrazil from 2018 to 2022. As a contribution to the literature on multivariate\ndisease data, we develop a flexible Bayesian multivariate spatio-temporal model\nwhere temporal dependence is defined for areal clusters. The model features a\nprior distribution for the random partition of areal data that incorporates\nneighboring information, thus encouraging maps with few contiguous clusters and\ndiscouraging clusters with disconnected areas. The model also incorporates an\nautoregressive structure and terms related to seasonal patterns into temporal\ncomponents that are disease and cluster-specific. It also considers a\nmultivariate directed acyclic graph autoregressive structure to accommodate\nspatial and inter-disease dependence, facilitating the interpretation of\nspatial correlation. We explore properties of the model by way of simulation\nstudies and show results that prove our proposal compares well to competing\nalternatives. Finally, we apply the model to the motivating dataset with a\ntwofold goal: clustering areas where the temporal trend of certain diseases are\nsimilar, and exploring the potential existence of temporal and/or spatial\ncorrelation between two diseases transmitted by the same mosquito."}, "http://arxiv.org/abs/2006.13850": {"title": "Global Sensitivity and Domain-Selective Testing for Functional-Valued Responses: An Application to Climate Economy Models", "link": "http://arxiv.org/abs/2006.13850", "description": "Understanding the dynamics and evolution of climate change and associated\nuncertainties is key for designing robust policy actions. Computer models are\nkey tools in this scientific effort, which have now reached a high level of\nsophistication and complexity. Model auditing is needed in order to better\nunderstand their results, and to deal with the fact that such models are\nincreasingly opaque with respect to their inner workings. Current techniques\nsuch as Global Sensitivity Analysis (GSA) are limited to dealing either with\nmultivariate outputs, stochastic ones, or finite-change inputs. This limits\ntheir applicability to time-varying variables such as future pathways of\ngreenhouse gases. To provide additional semantics in the analysis of a model\nensemble, we provide an extension of GSA methodologies tackling the case of\nstochastic functional outputs with finite change inputs. To deal with finite\nchange inputs and functional outputs, we propose an extension of currently\navailable GSA methodologies while we deal with the stochastic part by\nintroducing a novel, domain-selective inferential technique for sensitivity\nindices. Our method is explored via a simulation study that shows its\nrobustness and efficacy in detecting sensitivity patterns. We apply it to real\nworld data, where its capabilities can provide to practitioners and\npolicymakers additional information about the time dynamics of sensitivity\npatterns, as well as information about robustness."}, "http://arxiv.org/abs/2107.00527": {"title": "Distribution-Free Prediction Bands for Multivariate Functional Time Series: an Application to the Italian Gas Market", "link": "http://arxiv.org/abs/2107.00527", "description": "Uncertainty quantification in forecasting represents a topic of great\nimportance in energy trading, as understanding the status of the energy market\nwould enable traders to directly evaluate the impact of their own offers/bids.\nTo this end, we propose a scalable procedure that outputs closed-form\nsimultaneous prediction bands for multivariate functional response variables in\na time series setting, which is able to guarantee performance bounds in terms\nof unconditional coverage and asymptotic exactness, both under some conditions.\nAfter evaluating its performance on synthetic data, the method is used to build\nmultivariate prediction bands for daily demand and offer curves in the Italian\ngas market."}, "http://arxiv.org/abs/2110.13017": {"title": "Nested $\\hat R$: Assessing the convergence of Markov chain Monte Carlo when running many short chains", "link": "http://arxiv.org/abs/2110.13017", "description": "Recent developments in Markov chain Monte Carlo (MCMC) algorithms allow us to\nrun thousands of chains in parallel almost as quickly as a single chain, using\nhardware accelerators such as GPUs. While each chain still needs to forget its\ninitial point during a warmup phase, the subsequent sampling phase can be\nshorter than in classical settings, where we run only a few chains. To\ndetermine if the resulting short chains are reliable, we need to assess how\nclose the Markov chains are to their stationary distribution after warmup. The\npotential scale reduction factor $\\widehat R$ is a popular convergence\ndiagnostic but unfortunately can require a long sampling phase to work well. We\npresent a nested design to overcome this challenge and a generalization called\nnested $\\widehat R$. This new diagnostic works under conditions similar to\n$\\widehat R$ and completes the workflow for GPU-friendly samplers. In addition,\nthe proposed nesting provides theoretical insights into the utility of\n$\\widehat R$, in both classical and short-chains regimes."}, "http://arxiv.org/abs/2111.10718": {"title": "The R2D2 Prior for Generalized Linear Mixed Models", "link": "http://arxiv.org/abs/2111.10718", "description": "In Bayesian analysis, the selection of a prior distribution is typically done\nby considering each parameter in the model. While this can be convenient, in\nmany scenarios it may be desirable to place a prior on a summary measure of the\nmodel instead. In this work, we propose a prior on the model fit, as measured\nby a Bayesian coefficient of determination ($R^2)$, which then induces a prior\non the individual parameters. We achieve this by placing a beta prior on $R^2$\nand then deriving the induced prior on the global variance parameter for\ngeneralized linear mixed models. We derive closed-form expressions in many\nscenarios and present several approximation strategies when an analytic form is\nnot possible and/or to allow for easier computation. In these situations, we\nsuggest approximating the prior by using a generalized beta prime distribution\nand provide a simple default prior construction scheme. This approach is quite\nflexible and can be easily implemented in standard Bayesian software. Lastly,\nwe demonstrate the performance of the method on simulated and real-world data,\nwhere the method particularly shines in high-dimensional settings, as well as\nmodeling random effects."}, "http://arxiv.org/abs/2206.05337": {"title": "Integrating complex selection rules into the latent overlapping group Lasso for constructing coherent prediction models", "link": "http://arxiv.org/abs/2206.05337", "description": "The construction of coherent prediction models holds great importance in\nmedical research as such models enable health researchers to gain deeper\ninsights into disease epidemiology and clinicians to identify patients at\nhigher risk of adverse outcomes. One commonly employed approach to developing\nprediction models is variable selection through penalized regression\ntechniques. Integrating natural variable structures into this process not only\nenhances model interpretability but can also %increase the likelihood of\nrecovering the true underlying model and boost prediction accuracy. However, a\nchallenge lies in determining how to effectively integrate potentially complex\nselection dependencies into the penalized regression. In this work, we\ndemonstrate how to represent selection dependencies mathematically, provide\nalgorithms for deriving the complete set of potential models, and offer a\nstructured approach for integrating complex rules into variable selection\nthrough the latent overlapping group Lasso. To illustrate our methodology, we\napplied these techniques to construct a coherent prediction model for major\nbleeding in hypertensive patients recently hospitalized for atrial fibrillation\nand subsequently prescribed oral anticoagulants. In this application, we\naccount for a proxy of anticoagulant adherence and its interaction with dosage\nand the type of oral anticoagulants in addition to drug-drug interactions."}, "http://arxiv.org/abs/2206.08756": {"title": "Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay", "link": "http://arxiv.org/abs/2206.08756", "description": "We study the tensor-on-tensor regression, where the goal is to connect tensor\nresponses to tensor covariates with a low Tucker rank parameter tensor/matrix\nwithout the prior knowledge of its intrinsic rank. We propose the Riemannian\ngradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with\nthe challenge of unknown rank by studying the effect of rank\nover-parameterization. We provide the first convergence guarantee for the\ngeneral tensor-on-tensor regression by showing that RGD and RGN respectively\nconverge linearly and quadratically to a statistically optimal estimate in both\nrank correctly-parameterized and over-parameterized settings. Our theory\nreveals an intriguing phenomenon: Riemannian optimization methods naturally\nadapt to over-parameterization without modifications to their implementation.\nWe also prove the statistical-computational gap in scalar-on-tensor regression\nby a direct low-degree polynomial argument. Our theory demonstrates a \"blessing\nof statistical-computational gap\" phenomenon: in a wide range of scenarios in\ntensor-on-tensor regression for tensors of order three or higher, the\ncomputationally required sample size matches what is needed by moderate rank\nover-parameterization when considering computationally feasible estimators,\nwhile there are no such benefits in the matrix settings. This shows moderate\nrank over-parameterization is essentially \"cost-free\" in terms of sample size\nin tensor-on-tensor regression of order three or higher. Finally, we conduct\nsimulation studies to show the advantages of our proposed methods and to\ncorroborate our theoretical findings."}, "http://arxiv.org/abs/2209.01396": {"title": "Small Study Regression Discontinuity Designs: Density Inclusive Study Size Metric and Performance", "link": "http://arxiv.org/abs/2209.01396", "description": "Regression discontinuity (RD) designs are popular quasi-experimental studies\nin which treatment assignment depends on whether the value of a running\nvariable exceeds a cutoff. RD designs are increasingly popular in educational\napplications due to the prevalence of cutoff-based interventions. In such\napplications sample sizes can be relatively small or there may be sparsity\naround the cutoff. We propose a metric, density inclusive study size (DISS),\nthat characterizes the size of an RD study better than overall sample size by\nincorporating the density of the running variable. We show the usefulness of\nthis metric in a Monte Carlo simulation study that compares the operating\ncharacteristics of popular nonparametric RD estimation methods in small\nstudies. We also apply the DISS metric and RD estimation methods to school\naccountability data from the state of Indiana."}, "http://arxiv.org/abs/2212.06108": {"title": "Tandem clustering with invariant coordinate selection", "link": "http://arxiv.org/abs/2212.06108", "description": "For multivariate data, tandem clustering is a well-known technique aiming to\nimprove cluster identification through initial dimension reduction.\nNevertheless, the usual approach using principal component analysis (PCA) has\nbeen criticized for focusing solely on inertia so that the first components do\nnot necessarily retain the structure of interest for clustering. To address\nthis limitation, a new tandem clustering approach based on invariant coordinate\nselection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS\nis designed to find structure in the data while providing affine invariant\ncomponents. Certain theoretical results have been previously derived and\nguarantee that under some elliptical mixture models, the group structure can be\nhighlighted on a subset of the first and/or last components. However, ICS has\ngarnered minimal attention within the context of clustering. Two challenges\nassociated with ICS include choosing the pair of scatter matrices and selecting\nthe components to retain. For effective clustering purposes, it is demonstrated\nthat the best scatter pairs consist of one scatter matrix capturing the\nwithin-cluster structure and another capturing the global structure. For the\nformer, local shape or pairwise scatters are of great interest, as is the\nminimum covariance determinant (MCD) estimator based on a carefully chosen\nsubset size that is smaller than usual. The performance of ICS as a dimension\nreduction method is evaluated in terms of preserving the cluster structure in\nthe data. In an extensive simulation study and empirical applications with\nbenchmark data sets, various combinations of scatter matrices as well as\ncomponent selection criteria are compared in situations with and without\noutliers. Overall, the new approach of tandem clustering with ICS shows\npromising results and clearly outperforms the PCA-based approach."}, "http://arxiv.org/abs/2212.07687": {"title": "Networks of reinforced stochastic processes: probability of asymptotic polarization and related general results", "link": "http://arxiv.org/abs/2212.07687", "description": "In a network of reinforced stochastic processes, for certain values of the\nparameters, all the agents' inclinations synchronize and converge almost surely\ntoward a certain random variable. The present work aims at clarifying when the\nagents can asymptotically polarize, i.e. when the common limit inclination can\ntake the extreme values, 0 or 1, with probability zero, strictly positive, or\nequal to one. Moreover, we present a suitable technique to estimate this\nprobability that, along with the theoretical results, has been framed in the\nmore general setting of a class of martingales taking values in [0, 1] and\nfollowing a specific dynamics."}, "http://arxiv.org/abs/2303.17478": {"title": "A Bayesian Dirichlet Auto-Regressive Moving Average Model for Forecasting Lead Times", "link": "http://arxiv.org/abs/2303.17478", "description": "Lead time data is compositional data found frequently in the hospitality\nindustry. Hospitality businesses earn fees each day, however these fees cannot\nbe recognized until later. For business purposes, it is important to understand\nand forecast the distribution of future fees for the allocation of resources,\nfor business planning, and for staffing. Motivated by 5 years of daily fees\ndata, we propose a new class of Bayesian time series models, a Bayesian\nDirichlet Auto-Regressive Moving Average (B-DARMA) model for compositional time\nseries, modeling the proportion of future fees that will be recognized in 11\nconsecutive 30 day windows and 1 last consecutive 35 day window. Each day's\ncompositional datum is modeled as Dirichlet distributed given the mean and a\nscale parameter. The mean is modeled with a Vector Autoregressive Moving\nAverage process after transforming with an additive log ratio link function and\ndepends on previous compositional data, previous compositional parameters and\ndaily covariates. The B-DARMA model offers solutions to data analyses of large\ncompositional vectors and short or long time series, offers efficiency gains\nthrough choice of priors, provides interpretable parameters for inference, and\nmakes reasonable forecasts."}, "http://arxiv.org/abs/2306.17361": {"title": "iSCAN: Identifying Causal Mechanism Shifts among Nonlinear Additive Noise Models", "link": "http://arxiv.org/abs/2306.17361", "description": "Structural causal models (SCMs) are widely used in various disciplines to\nrepresent causal relationships among variables in complex systems.\nUnfortunately, the underlying causal structure is often unknown, and estimating\nit from data remains a challenging task. In many situations, however, the end\ngoal is to localize the changes (shifts) in the causal mechanisms between\nrelated datasets instead of learning the full causal structure of the\nindividual datasets. Some applications include root cause analysis, analyzing\ngene regulatory network structure changes between healthy and cancerous\nindividuals, or explaining distribution shifts. This paper focuses on\nidentifying the causal mechanism shifts in two or more related datasets over\nthe same set of variables -- without estimating the entire DAG structure of\neach SCM. Prior work under this setting assumed linear models with Gaussian\nnoises; instead, in this work we assume that each SCM belongs to the more\ngeneral class of nonlinear additive noise models (ANMs). A key technical\ncontribution of this work is to show that the Jacobian of the score function\nfor the mixture distribution allows for the identification of shifts under\ngeneral non-parametric functional mechanisms. Once the shifted variables are\nidentified, we leverage recent work to estimate the structural differences, if\nany, for the shifted variables. Experiments on synthetic and real-world data\nare provided to showcase the applicability of this approach. Code implementing\nthe proposed method is open-source and publicly available at\nhttps://github.com/kevinsbello/iSCAN."}, "http://arxiv.org/abs/2307.10841": {"title": "A criterion and incremental design construction for simultaneous kriging predictions", "link": "http://arxiv.org/abs/2307.10841", "description": "In this paper, we further investigate the problem of selecting a set of\ndesign points for universal kriging, which is a widely used technique for\nspatial data analysis. Our goal is to select the design points in order to make\nsimultaneous predictions of the random variable of interest at a finite number\nof unsampled locations with maximum precision. Specifically, we consider as\nresponse a correlated random field given by a linear model with an unknown\nparameter vector and a spatial error correlation structure. We propose a new\ndesign criterion that aims at simultaneously minimizing the variation of the\nprediction errors at various points. We also present various efficient\ntechniques for incrementally building designs for that criterion scaling well\nfor high dimensions. Thus the method is particularly suitable for big data\napplications in areas of spatial data analysis such as mining, hydrogeology,\nnatural resource monitoring, and environmental sciences or equivalently for any\ncomputer simulation experiments. We have demonstrated the effectiveness of the\nproposed designs through two illustrative examples: one by simulation and\nanother based on real data from Upper Austria."}, "http://arxiv.org/abs/2307.15404": {"title": "Information-based Preprocessing of PLC Data for Automatic Behavior Modeling", "link": "http://arxiv.org/abs/2307.15404", "description": "Cyber-physical systems (CPS) offer immense optimization potential for\nmanufacturing processes through the availability of multivariate time series\ndata of actors and sensors. Based on automated analysis software, the\ndeployment of adaptive and responsive measures is possible for time series\ndata. Due to the complex and dynamic nature of modern manufacturing, analysis\nand modeling often cannot be entirely automated. Even machine- or deep learning\napproaches often depend on a priori expert knowledge and labelling. In this\npaper, an information-based data preprocessing approach is proposed. By\napplying statistical methods including variance and correlation analysis, an\napproximation of the sampling rate in event-based systems and the utilization\nof spectral analysis, knowledge about the underlying manufacturing processes\ncan be gained prior to modeling. The paper presents, how statistical analysis\nenables the pruning of a dataset's least important features and how the\nsampling rate approximation approach sets the base for further data analysis\nand modeling. The data's underlying periodicity, originating from the cyclic\nnature of an automated manufacturing process, will be detected by utilizing the\nfast Fourier transform. This information-based preprocessing method will then\nbe validated for process time series data of cyber-physical systems'\nprogrammable logic controllers (PLC)."}, "http://arxiv.org/abs/2308.05205": {"title": "Dynamic survival analysis: modelling the hazard function via ordinary differential equations", "link": "http://arxiv.org/abs/2308.05205", "description": "The hazard function represents one of the main quantities of interest in the\nanalysis of survival data. We propose a general approach for parametrically\nmodelling the dynamics of the hazard function using systems of autonomous\nordinary differential equations (ODEs). This modelling approach can be used to\nprovide qualitative and quantitative analyses of the evolution of the hazard\nfunction over time. Our proposal capitalises on the extensive literature of\nODEs which, in particular, allow for establishing basic rules or laws on the\ndynamics of the hazard function via the use of autonomous ODEs. We show how to\nimplement the proposed modelling framework in cases where there is an analytic\nsolution to the system of ODEs or where an ODE solver is required to obtain a\nnumerical solution. We focus on the use of a Bayesian modelling approach, but\nthe proposed methodology can also be coupled with maximum likelihood\nestimation. A simulation study is presented to illustrate the performance of\nthese models and the interplay of sample size and censoring. Two case studies\nusing real data are presented to illustrate the use of the proposed approach\nand to highlight the interpretability of the corresponding models. We conclude\nwith a discussion on potential extensions of our work and strategies to include\ncovariates into our framework."}, "http://arxiv.org/abs/2309.14658": {"title": "Improvements on Scalable Stochastic Bayesian Inference Methods for Multivariate Hawkes Process", "link": "http://arxiv.org/abs/2309.14658", "description": "Multivariate Hawkes Processes (MHPs) are a class of point processes that can\naccount for complex temporal dynamics among event sequences. In this work, we\nstudy the accuracy and computational efficiency of three classes of algorithms\nwhich, while widely used in the context of Bayesian inference, have rarely been\napplied in the context of MHPs: stochastic gradient expectation-maximization,\nstochastic gradient variational inference and stochastic gradient Langevin\nMonte Carlo. An important contribution of this paper is a novel approximation\nto the likelihood function that allows us to retain the computational\nadvantages associated with conjugate settings while reducing approximation\nerrors associated with the boundary effects. The comparisons are based on\nvarious simulated scenarios as well as an application to the study the risk\ndynamics in the Standard & Poor's 500 intraday index prices among its 11\nsectors."}, "http://arxiv.org/abs/2401.08606": {"title": "Forking paths in financial economics", "link": "http://arxiv.org/abs/2401.08606", "description": "We argue that spanning large numbers of degrees of freedom in empirical\nanalysis allows better characterizations of effects and thus improves the\ntrustworthiness of conclusions. Our ideas are illustrated in three studies:\nequity premium prediction, asset pricing anomalies and risk premia estimation.\nIn the first, we find that each additional degree of freedom in the protocol\nexpands the average range of $t$-statistics by at least 30%. In the second, we\nshow that resorting to forking paths instead of bootstrapping in multiple\ntesting raises the bar of significance for anomalies: at the 5% confidence\nlevel, the threshold for bootstrapped statistics is 4.5, whereas with paths, it\nis at least 8.2, a bar much higher than those currently used in the literature.\nIn our third application, we reveal the importance of particular steps in the\nestimation of premia. In addition, we use paths to corroborate prior findings\nin the three topics. We document heterogeneity in our ability to replicate\nprior studies: some conclusions seem robust, others do not align with the paths\nwe were able to generate."}, "http://arxiv.org/abs/2401.08626": {"title": "Validation and Comparison of Non-Stationary Cognitive Models: A Diffusion Model Application", "link": "http://arxiv.org/abs/2401.08626", "description": "Cognitive processes undergo various fluctuations and transient states across\ndifferent temporal scales. Superstatistics are emerging as a flexible framework\nfor incorporating such non-stationary dynamics into existing cognitive model\nclasses. In this work, we provide the first experimental validation of\nsuperstatistics and formal comparison of four non-stationary diffusion decision\nmodels in a specifically designed perceptual decision-making task. Task\ndifficulty and speed-accuracy trade-off were systematically manipulated to\ninduce expected changes in model parameters. To validate our models, we assess\nwhether the inferred parameter trajectories align with the patterns and\nsequences of the experimental manipulations. To address computational\nchallenges, we present novel deep learning techniques for amortized Bayesian\nestimation and comparison of models with time-varying parameters. Our findings\nindicate that transition models incorporating both gradual and abrupt parameter\nshifts provide the best fit to the empirical data. Moreover, we find that the\ninferred parameter trajectories closely mirror the sequence of experimental\nmanipulations. Posterior re-simulations further underscore the ability of the\nmodels to faithfully reproduce critical data patterns. Accordingly, our results\nsuggest that the inferred non-stationary dynamics may reflect actual changes in\nthe targeted psychological constructs. We argue that our initial experimental\nvalidation paves the way for the widespread application of superstatistics in\ncognitive modeling and beyond."}, "http://arxiv.org/abs/2401.08702": {"title": "Do We Really Even Need Data?", "link": "http://arxiv.org/abs/2401.08702", "description": "As artificial intelligence and machine learning tools become more accessible,\nand scientists face new obstacles to data collection (e.g. rising costs,\ndeclining survey response rates), researchers increasingly use predictions from\npre-trained algorithms as outcome variables. Though appealing for financial and\nlogistical reasons, using standard tools for inference can misrepresent the\nassociation between independent variables and the outcome of interest when the\ntrue, unobserved outcome is replaced by a predicted value. In this paper, we\ncharacterize the statistical challenges inherent to this so-called\n``post-prediction inference'' problem and elucidate three potential sources of\nerror: (i) the relationship between predicted outcomes and their true,\nunobserved counterparts, (ii) robustness of the machine learning model to\nresampling or uncertainty about the training data, and (iii) appropriately\npropagating not just bias but also uncertainty from predictions into the\nultimate inference procedure. We also contrast the framework for\npost-prediction inference with classical work spanning several related fields,\nincluding survey sampling, missing data, and semi-supervised learning. This\ncontrast elucidates the role of design in both classical and modern inference\nproblems."}, "http://arxiv.org/abs/2401.08875": {"title": "DCRMTA: Unbiased Causal Representation for Multi-touch Attribution", "link": "http://arxiv.org/abs/2401.08875", "description": "Multi-touch attribution (MTA) currently plays a pivotal role in achieving a\nfair estimation of the contributions of each advertising touchpoint to-wards\nconversion behavior, deeply influencing budget allocation and advertising\nrecommenda-tion. Traditional multi-touch attribution methods initially build a\nconversion prediction model, an-ticipating learning the inherent relationship\nbe-tween touchpoint sequences and user purchasing behavior through historical\ndata. Based on this, counterfactual touchpoint sequences are con-structed from\nthe original sequence subset, and conversions are estimated using the\nprediction model, thus calculating advertising contributions. A covert\nassumption of these methods is the un-biased nature of conversion prediction\nmodels. However, due to confounding variables factors arising from user\npreferences and internet recom-mendation mechanisms such as homogenization of\nad recommendations resulting from past shop-ping records, bias can easily occur\nin conversion prediction models trained on observational data. This paper\nredefines the causal effect of user fea-tures on conversions and proposes a\nnovel end-to-end approach, Deep Causal Representation for MTA (DCRMTA). Our\nmodel while eliminating confounding variables, extracts features with causal\nrelations to conversions from users. Fur-thermore, Extensive experiments on\nboth synthet-ic and real-world Criteo data demonstrate DCRMTA's superior\nperformance in converting prediction across varying data distributions, while\nalso effectively attributing value across dif-ferent advertising channels"}, "http://arxiv.org/abs/2401.08941": {"title": "A Powerful and Precise Feature-level Filter using Group Knockoffs", "link": "http://arxiv.org/abs/2401.08941", "description": "Selecting important features that have substantial effects on the response\nwith provable type-I error rate control is a fundamental concern in statistics,\nwith wide-ranging practical applications. Existing knockoff filters, although\nshown to provide theoretical guarantee on false discovery rate (FDR) control,\noften struggle to strike a balance between high power and precision in\npinpointing important features when there exist large groups of strongly\ncorrelated features. To address this challenge, we develop a new filter using\ngroup knockoffs to achieve both powerful and precise selection of important\nfeatures. Via experiments of simulated data and analysis of a real Alzheimer's\ndisease genetic dataset, it is found that the proposed filter can not only\ncontrol the proportion of false discoveries but also identify important\nfeatures with comparable power and greater precision than the existing group\nknockoffs filter."}, "http://arxiv.org/abs/2401.09379": {"title": "Merging uncertainty sets via majority vote", "link": "http://arxiv.org/abs/2401.09379", "description": "Given $K$ uncertainty sets that are arbitrarily dependent -- for example,\nconfidence intervals for an unknown parameter obtained with $K$ different\nestimators, or prediction sets obtained via conformal prediction based on $K$\ndifferent algorithms on shared data -- we address the question of how to\nefficiently combine them in a black-box manner to produce a single uncertainty\nset. We present a simple and broadly applicable majority vote procedure that\nproduces a merged set with nearly the same error guarantee as the input sets.\nWe then extend this core idea in a few ways: we show that weighted averaging\ncan be a powerful way to incorporate prior information, and a simple\nrandomization trick produces strictly smaller merged sets without altering the\ncoverage guarantee. Along the way, we prove an intriguing result that R\\\"uger's\ncombination rules (eg: twice the median of dependent p-values is a p-value) can\nbe strictly improved with randomization. When deployed in online settings, we\nshow how the exponential weighted majority algorithm can be employed in order\nto learn a good weighting over time. We then combine this method with adaptive\nconformal inference to deliver a simple conformal online model aggregation\n(COMA) method for nonexchangeable data."}, "http://arxiv.org/abs/2401.09381": {"title": "Modelling clusters in network time series with an application to presidential elections in the USA", "link": "http://arxiv.org/abs/2401.09381", "description": "Network time series are becoming increasingly relevant in the study of\ndynamic processes characterised by a known or inferred underlying network\nstructure. Generalised Network Autoregressive (GNAR) models provide a\nparsimonious framework for exploiting the underlying network, even in the\nhigh-dimensional setting. We extend the GNAR framework by introducing the\n$\\textit{community}$-$\\alpha$ GNAR model that exploits prior knowledge and/or\nexogenous variables for identifying and modelling dynamic interactions across\ncommunities in the underlying network. We further analyse the dynamics of\n$\\textit{Red, Blue}$ and $\\textit{Swing}$ states throughout presidential\nelections in the USA. Our analysis shows that dynamics differ among the\nstate-wise clusters."}, "http://arxiv.org/abs/2401.09401": {"title": "PERMUTOOLS: A MATLAB Package for Multivariate Permutation Testing", "link": "http://arxiv.org/abs/2401.09401", "description": "Statistical hypothesis testing and effect size measurement are routine parts\nof quantitative research. Advancements in computer processing power have\ngreatly improved the capability of statistical inference through the\navailability of resampling methods. However, many of the statistical practices\nused today are based on traditional, parametric methods that rely on\nassumptions about the underlying population. These assumptions may not always\nbe valid, leading to inaccurate results and misleading interpretations.\nPermutation testing, on the other hand, generates the sampling distribution\nempirically by permuting the observed data, providing distribution-free\nhypothesis testing. Furthermore, this approach lends itself to a powerful\nmethod for multiple comparison correction - known as max correction - which is\nless prone to type II errors than conventional correction methods. Parametric\nmethods have also traditionally been utilized for estimating the confidence\ninterval of various test statistics and effect size measures. However, these\ntoo can be estimated empirically using permutation or bootstrapping techniques.\nWhilst resampling methods are generally considered preferable, many popular\nprogramming languages and statistical software packages lack efficient\nimplementations. Here, we introduce PERMUTOOLS, a MATLAB package for\nmultivariate permutation testing and effect size measurement."}, "http://arxiv.org/abs/2008.03073": {"title": "From the power law to extreme value mixture distributions", "link": "http://arxiv.org/abs/2008.03073", "description": "The power law is useful in describing count phenomena such as network degrees\nand word frequencies. With a single parameter, it captures the main feature\nthat the frequencies are linear on the log-log scale. Nevertheless, there have\nbeen criticisms of the power law, for example that a threshold needs to be\npre-selected without its uncertainty quantified, that the power law is simply\ninadequate, and that subsequent hypothesis tests are required to determine\nwhether the data could have come from the power law. We propose a modelling\nframework that combines two different generalisations of the power law, namely\nthe generalised Pareto distribution and the Zipf-polylog distribution, to\nresolve these issues. The proposed mixture distributions are shown to fit the\ndata well and quantify the threshold uncertainty in a natural way. A model\nselection step embedded in the Bayesian inference algorithm further answers the\nquestion whether the power law is adequate."}, "http://arxiv.org/abs/2211.11884": {"title": "Parameter Estimation in Nonlinear Multivariate Stochastic Differential Equations Based on Splitting Schemes", "link": "http://arxiv.org/abs/2211.11884", "description": "Surprisingly, general estimators for nonlinear continuous time models based\non stochastic differential equations are yet lacking. Most applications still\nuse the Euler-Maruyama discretization, despite many proofs of its bias. More\nsophisticated methods, such as Kessler's Gaussian approximation, Ozak's Local\nLinearization, A\\\"it-Sahalia's Hermite expansions, or MCMC methods, lack a\nstraightforward implementation, do not scale well with increasing model\ndimension or can be numerically unstable. We propose two efficient and\neasy-to-implement likelihood-based estimators based on the Lie-Trotter (LT) and\nthe Strang (S) splitting schemes. We prove that S has $L^p$ convergence rate of\norder 1, a property already known for LT. We show that the estimators are\nconsistent and asymptotically efficient under the less restrictive one-sided\nLipschitz assumption. A numerical study on the 3-dimensional stochastic Lorenz\nsystem complements our theoretical findings. The simulation shows that the S\nestimator performs the best when measured on precision and computational speed\ncompared to the state-of-the-art."}, "http://arxiv.org/abs/2212.00703": {"title": "Data Integration Via Analysis of Subspaces (DIVAS)", "link": "http://arxiv.org/abs/2212.00703", "description": "Modern data collection in many data paradigms, including bioinformatics,\noften incorporates multiple traits derived from different data types (i.e.\nplatforms). We call this data multi-block, multi-view, or multi-omics data. The\nemergent field of data integration develops and applies new methods for\nstudying multi-block data and identifying how different data types relate and\ndiffer. One major frontier in contemporary data integration research is\nmethodology that can identify partially-shared structure between\nsub-collections of data types. This work presents a new approach: Data\nIntegration Via Analysis of Subspaces (DIVAS). DIVAS combines new insights in\nangular subspace perturbation theory with recent developments in matrix signal\nprocessing and convex-concave optimization into one algorithm for exploring\npartially-shared structure. Based on principal angles between subspaces, DIVAS\nprovides built-in inference on the results of the analysis, and is effective\neven in high-dimension-low-sample-size (HDLSS) situations."}, "http://arxiv.org/abs/2301.05636": {"title": "Improving Power by Conditioning on Less in Post-selection Inference for Changepoints", "link": "http://arxiv.org/abs/2301.05636", "description": "Post-selection inference has recently been proposed as a way of quantifying\nuncertainty about detected changepoints. The idea is to run a changepoint\ndetection algorithm, and then re-use the same data to perform a test for a\nchange near each of the detected changes. By defining the p-value for the test\nappropriately, so that it is conditional on the information used to choose the\ntest, this approach will produce valid p-values. We show how to improve the\npower of these procedures by conditioning on less information. This gives rise\nto an ideal selective p-value that is intractable but can be approximated by\nMonte Carlo. We show that for any Monte Carlo sample size, this procedure\nproduces valid p-values, and empirically that noticeable increase in power is\npossible with only very modest Monte Carlo sample sizes. Our procedure is easy\nto implement given existing post-selection inference methods, as we just need\nto generate perturbations of the data set and re-apply the post-selection\nmethod to each of these. On genomic data consisting of human GC content, our\nprocedure increases the number of significant changepoints that are detected\nfrom e.g. 17 to 27, when compared to existing methods."}, "http://arxiv.org/abs/2309.03969": {"title": "Estimating the prevalance of indirect effects and other spillovers", "link": "http://arxiv.org/abs/2309.03969", "description": "In settings where interference between units is possible, we define the\nprevalence of indirect effects to be the number of units who are affected by\nthe treatment of others. This quantity does not fully identify an indirect\neffect, but may be used to show whether such effects are widely prevalent.\nGiven a randomized experiment with binary-valued outcomes, methods are\npresented for conservative point estimation and one-sided interval estimation.\nNo assumptions beyond randomization of treatment are required, allowing for\nusage in settings where models or assumptions on interference might be\nquestionable. To show asymptotic coverage of our intervals in settings not\ncovered by existing results, we provide a central limit theorem that combines\nlocal dependence and sampling without replacement. Consistency and minimax\nproperties of the point estimator are shown as well. The approach is\ndemonstrated on an experiment in which students were treated for a highly\ntransmissible parasitic infection, for which we find that a significant\nfraction of students were affected by the treatment of schools other than their\nown."}, "http://arxiv.org/abs/2401.09559": {"title": "Asymptotic Online FWER Control for Dependent Test Statistics", "link": "http://arxiv.org/abs/2401.09559", "description": "In online multiple testing, an a priori unknown number of hypotheses are\ntested sequentially, i.e. at each time point a test decision for the current\nhypothesis has to be made using only the data available so far. Although many\npowerful test procedures have been developed for online error control in recent\nyears, most of them are designed solely for independent or at most locally\ndependent test statistics. In this work, we provide a new framework for\nderiving online multiple test procedures which ensure asymptotical (with\nrespect to the sample size) control of the familywise error rate (FWER),\nregardless of the dependence structure between test statistics. In this\ncontext, we give a few concrete examples of such test procedures and discuss\ntheir properties. Furthermore, we conduct a simulation study in which the type\nI error control of these test procedures is also confirmed for a finite sample\nsize and a gain in power is indicated."}, "http://arxiv.org/abs/2401.09641": {"title": "Functional Linear Non-Gaussian Acyclic Model for Causal Discovery", "link": "http://arxiv.org/abs/2401.09641", "description": "In causal discovery, non-Gaussianity has been used to characterize the\ncomplete configuration of a Linear Non-Gaussian Acyclic Model (LiNGAM),\nencompassing both the causal ordering of variables and their respective\nconnection strengths. However, LiNGAM can only deal with the finite-dimensional\ncase. To expand this concept, we extend the notion of variables to encompass\nvectors and even functions, leading to the Functional Linear Non-Gaussian\nAcyclic Model (Func-LiNGAM). Our motivation stems from the desire to identify\ncausal relationships in brain-effective connectivity tasks involving, for\nexample, fMRI and EEG datasets. We demonstrate why the original LiNGAM fails to\nhandle these inherently infinite-dimensional datasets and explain the\navailability of functional data analysis from both empirical and theoretical\nperspectives. {We establish theoretical guarantees of the identifiability of\nthe causal relationship among non-Gaussian random vectors and even random\nfunctions in infinite-dimensional Hilbert spaces.} To address the issue of\nsparsity in discrete time points within intrinsic infinite-dimensional\nfunctional data, we propose optimizing the coordinates of the vectors using\nfunctional principal component analysis. Experimental results on synthetic data\nverify the ability of the proposed framework to identify causal relationships\namong multivariate functions using the observed samples. For real data, we\nfocus on analyzing the brain connectivity patterns derived from fMRI data."}, "http://arxiv.org/abs/2401.09696": {"title": "Rejection Sampling with Vertical Weighted Strips", "link": "http://arxiv.org/abs/2401.09696", "description": "A number of distributions that arise in statistical applications can be\nexpressed in the form of a weighted density: the product of a base density and\na nonnegative weight function. Generating variates from such a distribution may\nbe nontrivial and can involve an intractable normalizing constant. Rejection\nsampling may be used to generate exact draws, but requires formulation of a\nsuitable proposal distribution. To be practically useful, the proposal must\nboth be convenient to sample from and not reject candidate draws too\nfrequently. A well-known approach to design a proposal involves decomposing the\ntarget density into a finite mixture, whose components may correspond to a\npartition of the support. This work considers such a construction that focuses\non majorization of the weight function. This approach may be applicable when\nassumptions for adaptive rejection sampling and related algorithms are not met.\nAn upper bound for the rejection probability based on this construction can be\nexpressed to evaluate the efficiency of the proposal before sampling. A method\nto partition the support is considered where regions are bifurcated based on\ntheir contribution to the bound. Examples based on the von Mises Fisher\ndistribution and Gaussian Process regression are provided to illustrate the\nmethod."}, "http://arxiv.org/abs/2401.09715": {"title": "Fast Variational Inference of Latent Space Models for Dynamic Networks Using Bayesian P-Splines", "link": "http://arxiv.org/abs/2401.09715", "description": "Latent space models (LSMs) are often used to analyze dynamic (time-varying)\nnetworks that evolve in continuous time. Existing approaches to Bayesian\ninference for these models rely on Markov chain Monte Carlo algorithms, which\ncannot handle modern large-scale networks. To overcome this limitation, we\nintroduce a new prior for continuous-time LSMs based on Bayesian P-splines that\nallows the posterior to adapt to the dimension of the latent space and the\ntemporal variation in each latent position. We propose a stochastic variational\ninference algorithm to estimate the model parameters. We use stochastic\noptimization to subsample both dyads and observed time points to design a fast\nalgorithm that is linear in the number of edges in the dynamic network.\nFurthermore, we establish non-asymptotic error bounds for point estimates\nderived from the variational posterior. To our knowledge, this is the first\nsuch result for Bayesian estimators of continuous-time LSMs. Lastly, we use the\nmethod to analyze a large data set of international conflicts consisting of\n4,456,095 relations from 2018 to 2022."}, "http://arxiv.org/abs/2401.09719": {"title": "Kernel-based multi-marker tests of association based on the accelerated failure time model", "link": "http://arxiv.org/abs/2401.09719", "description": "Kernel-based multi-marker tests for survival outcomes use primarily the Cox\nmodel to adjust for covariates. The proportional hazards assumption made by the\nCox model could be unrealistic, especially in the long-term follow-up. We\ndevelop a suite of novel multi-marker survival tests for genetic association\nbased on the accelerated failure time model, which is a popular alternative to\nthe Cox model due to its direct physical interpretation. The tests are based on\nthe asymptotic distributions of their test statistics and are thus\ncomputationally efficient. The association tests can account for the\nheterogeneity of genetic effects across sub-populations/individuals to increase\nthe power. All the new tests can deal with competing risks and left truncation.\nMoreover, we develop small-sample corrections to the tests to improve their\naccuracy under small samples. Extensive numerical experiments show that the new\ntests perform very well in various scenarios. An application to a genetic\ndataset of Alzheimer's disease illustrates the tests' practical utility."}, "http://arxiv.org/abs/2401.09816": {"title": "Jackknife empirical likelihood ratio test for testing the equality of semivariance", "link": "http://arxiv.org/abs/2401.09816", "description": "Semivariance is a measure of the dispersion of all observations that fall\nabove the mean or target value of a random variable and it plays an important\nrole in life-length, actuarial and income studies. In this paper, we develop a\nnew non-parametric test for equality of upper semi-variance. We use the\nU-statistic theory to derive the test statistic and then study the asymptotic\nproperties of the test statistic. We also develop a jackknife empirical\nlikelihood (JEL) ratio test for equality of upper Semivariance. Extensive Monte\nCarlo simulation studies are carried out to validate the performance of the\nproposed JEL-based test. We illustrate the test procedure using real data."}, "http://arxiv.org/abs/2401.09994": {"title": "Bayesian modeling of spatial ordinal data from health surveys", "link": "http://arxiv.org/abs/2401.09994", "description": "Health surveys allow exploring health indicators that are of great value from\na public health point of view and that cannot normally be studied from regular\nhealth registries. These indicators are usually coded as ordinal variables and\nmay depend on covariates associated with individuals. In this paper, we propose\na Bayesian individual-level model for small-area estimation of survey-based\nhealth indicators. A categorical likelihood is used at the first level of the\nmodel hierarchy to describe the ordinal data, and spatial dependence among\nsmall areas is taken into account by using a conditional autoregressive (CAR)\ndistribution. Post-stratification of the results of the proposed\nindividual-level model allows extrapolating the results to any administrative\nareal division, even for small areas. We apply this methodology to the analysis\nof the Health Survey of the Region of Valencia (Spain) of 2016 to describe the\ngeographical distribution of a self-perceived health indicator of interest in\nthis region."}, "http://arxiv.org/abs/2401.10010": {"title": "A global kernel estimator for partially linear varying coefficient additive hazards models", "link": "http://arxiv.org/abs/2401.10010", "description": "In biomedical studies, we are often interested in the association between\ndifferent types of covariates and the times to disease events. Because the\nrelationship between the covariates and event times is often complex, standard\nsurvival models that assume a linear covariate effect are inadequate. A\nflexible class of models for capturing complex interaction effects among types\nof covariates is the varying coefficient models, where the effects of a type of\ncovariates can be modified by another type of covariates. In this paper, we\nstudy kernel-based estimation methods for varying coefficient additive hazards\nmodels. Unlike many existing kernel-based methods that use a local neighborhood\nof subjects for the estimation of the varying coefficient function, we propose\na novel global approach that is generally more efficient. We establish\ntheoretical properties of the proposed estimators and demonstrate their\nsuperior performance compared with existing local methods through large-scale\nsimulation studies. To illustrate the proposed method, we provide an\napplication to a motivating cancer genomic study."}, "http://arxiv.org/abs/2401.10057": {"title": "A method for characterizing disease emergence curves from paired pathogen detection and serology data", "link": "http://arxiv.org/abs/2401.10057", "description": "Wildlife disease surveillance programs and research studies track infection\nand identify risk factors for wild populations, humans, and agriculture. Often,\nseveral types of samples are collected from individuals to provide more\ncomplete information about an animal's infection history. Methods that jointly\nanalyze multiple data streams to study disease emergence and drivers of\ninfection via epidemiological process models remain underdeveloped.\nJoint-analysis methods can more thoroughly analyze all available data, more\nprecisely quantifying epidemic processes, outbreak status, and risks. We\ncontribute a paired data modeling approach that analyzes multiple samples from\nindividuals. We use \"characterization maps\" to link paired data to\nepidemiological processes through a hierarchical statistical observation model.\nOur approach can provide both Bayesian and frequentist estimates of\nepidemiological parameters and state. We motivate our approach through the need\nto use paired pathogen and antibody detection tests to estimate parameters and\ninfection trajectories for the widely applicable susceptible, infectious,\nrecovered (SIR) model. We contribute general formulas to link characterization\nmaps to arbitrary process models and datasets and an extended SIR model that\nbetter accommodates paired data. We find via simulation that paired data can\nmore efficiently estimate SIR parameters than unpaired data, requiring samples\nfrom 5-10 times fewer individuals. We then study SARS-CoV-2 in wild\nWhite-tailed deer (Odocoileus virginianus) from three counties in the United\nStates. Estimates for average infectious times corroborate captive animal\nstudies. Our methods use general statistical theory to let applications extend\nbeyond the SIR model we consider, and to more complicated examples of paired\ndata."}, "http://arxiv.org/abs/2401.10124": {"title": "Lower Ricci Curvature for Efficient Community Detection", "link": "http://arxiv.org/abs/2401.10124", "description": "This study introduces the Lower Ricci Curvature (LRC), a novel, scalable, and\nscale-free discrete curvature designed to enhance community detection in\nnetworks. Addressing the computational challenges posed by existing\ncurvature-based methods, LRC offers a streamlined approach with linear\ncomputational complexity, making it well-suited for large-scale network\nanalysis. We further develop an LRC-based preprocessing method that effectively\naugments popular community detection algorithms. Through comprehensive\nsimulations and applications on real-world datasets, including the NCAA\nfootball league network, the DBLP collaboration network, the Amazon product\nco-purchasing network, and the YouTube social network, we demonstrate the\nefficacy of our method in significantly improving the performance of various\ncommunity detection algorithms."}, "http://arxiv.org/abs/2401.10180": {"title": "Generalized Decomposition Priors on R2", "link": "http://arxiv.org/abs/2401.10180", "description": "The adoption of continuous shrinkage priors in high-dimensional linear models\nhas gained momentum, driven by their theoretical and practical advantages. One\nof these shrinkage priors is the R2D2 prior, which comes with intuitive\nhyperparameters and well understood theoretical properties. The core idea is to\nspecify a prior on the percentage of explained variance $R^2$ and to conduct a\nDirichlet decomposition to distribute the explained variance among all the\nregression terms of the model. Due to the properties of the Dirichlet\ndistribution, the competition among variance components tends to gravitate\ntowards negative dependence structures, fully determined by the individual\ncomponents' means. Yet, in reality, specific coefficients or groups may compete\ndifferently for the total variability than the Dirichlet would allow for. In\nthis work we address this limitation by proposing a generalization of the R2D2\nprior, which we term the Generalized Decomposition R2 (GDR2) prior.\n\nOur new prior provides great flexibility in expressing dependency structures\nas well as enhanced shrinkage properties. Specifically, we explore the\ncapabilities of variance decomposition via logistic normal distributions.\nThrough extensive simulations and real-world case studies, we demonstrate that\nGDR2 priors yield strongly improved out-of-sample predictive performance and\nparameter recovery compared to R2D2 priors with similar hyper-parameter\nchoices."}, "http://arxiv.org/abs/2401.10193": {"title": "tinyVAST: R package with an expressive interface to specify lagged and simultaneous effects in multivariate spatio-temporal models", "link": "http://arxiv.org/abs/2401.10193", "description": "Multivariate spatio-temporal models are widely applicable, but specifying\ntheir structure is complicated and may inhibit wider use. We introduce the R\npackage tinyVAST from two viewpoints: the software user and the statistician.\nFrom the user viewpoint, tinyVAST adapts a widely used formula interface to\nspecify generalized additive models, and combines this with arguments to\nspecify spatial and spatio-temporal interactions among variables. These\ninteractions are specified using arrow notation (from structural equation\nmodels), or an extended arrow-and-lag notation that allows simultaneous,\nlagged, and recursive dependencies among variables over time. The user also\nspecifies a spatial domain for areal (gridded), continuous (point-count), or\nstream-network data. From the statistician viewpoint, tinyVAST constructs\nsparse precision matrices representing multivariate spatio-temporal variation,\nand parameters are estimated by specifying a generalized linear mixed model\n(GLMM). This expressive interface encompasses vector autoregressive, empirical\northogonal functions, spatial factor analysis, and ARIMA models. To\ndemonstrate, we fit to data from two survey platforms sampling corals, sponges,\nrockfishes, and flatfishes in the Gulf of Alaska and Aleutian Islands. We then\ncompare eight alternative model structures using different assumptions about\nhabitat drivers and survey detectability. Model selection suggests that\ntowed-camera and bottom trawl gears have spatial variation in detectability but\nsample the same underlying density of flatfishes and rockfishes, and that\nrockfishes are positively associated with sponges while flatfishes are\nnegatively associated with corals. We conclude that tinyVAST can be used to\ntest complicated dependencies representing alternative structural assumptions\nfor research and real-world policy evaluation."}, "http://arxiv.org/abs/2401.10196": {"title": "Functional Conditional Gaussian Graphical Models", "link": "http://arxiv.org/abs/2401.10196", "description": "Functional data has become a commonly encountered data type. In this paper,\nwe contribute to the literature on functional graphical modelling by extending\nthe notion of conditional Gaussian Graphical models and proposing a\ndouble-penalized estimator by which to recover the edge-set of the\ncorresponding graph. Penalty parameters play a crucial role in determining the\nprecision matrices for the response variables and the regression matrices. The\nperformance and model selection process in the proposed framework are\ninvestigated using information criteria. Moreover, we propose a novel version\nof the Kullback-Leibler cross-validation designed for conditional joint\nGaussian Graphical Models. The evaluation of model performance is done in terms\nof Kullback-Leibler divergence and graph recovery power."}, "http://arxiv.org/abs/2206.02250": {"title": "Frequency Domain Statistical Inference for High-Dimensional Time Series", "link": "http://arxiv.org/abs/2206.02250", "description": "Analyzing time series in the frequency domain enables the development of\npowerful tools for investigating the second-order characteristics of\nmultivariate processes. Parameters like the spectral density matrix and its\ninverse, the coherence or the partial coherence, encode comprehensively the\ncomplex linear relations between the component processes of the multivariate\nsystem. In this paper, we develop inference procedures for such parameters in a\nhigh-dimensional, time series setup. Towards this goal, we first focus on the\nderivation of consistent estimators of the coherence and, more importantly, of\nthe partial coherence which possess manageable limiting distributions that are\nsuitable for testing purposes. Statistical tests of the hypothesis that the\nmaximum over frequencies of the coherence, respectively, of the partial\ncoherence, do not exceed a prespecified threshold value are developed. Our\napproach allows for testing hypotheses for individual coherences and/or partial\ncoherences as well as for multiple testing of large sets of such parameters. In\nthe latter case, a consistent procedure to control the false discovery rate is\ndeveloped. The finite sample performance of the inference procedures introduced\nis investigated by means of simulations and applications to the construction of\ngraphical interaction models for brain connectivity based on EEG data are\npresented."}, "http://arxiv.org/abs/2206.09754": {"title": "Guided structure learning of DAGs for count data", "link": "http://arxiv.org/abs/2206.09754", "description": "In this paper, we tackle structure learning of Directed Acyclic Graphs\n(DAGs), with the idea of exploiting available prior knowledge of the domain at\nhand to guide the search of the best structure. In particular, we assume to\nknow the topological ordering of variables in addition to the given data. We\nstudy a new algorithm for learning the structure of DAGs, proving its\ntheoretical consistence in the limit of infinite observations. Furthermore, we\nexperimentally compare the proposed algorithm to a number of popular\ncompetitors, in order to study its behavior in finite samples."}, "http://arxiv.org/abs/2211.08784": {"title": "The robusTest package: two-sample tests revisited", "link": "http://arxiv.org/abs/2211.08784", "description": "The R package robusTest offers corrected versions of several common tests in\nbivariate statistics. We point out the limitations of these tests in their\nclassical versions, some of which are well known such as robustness or\ncalibration problems, and provide simple alternatives that can be easily used\ninstead. The classical tests and theirs robust alternatives are compared\nthrough a small simulation study. The latter emphasizes the superiority of\nrobust versions of the test of interest. Finally, an illustration of\ncorrelation's tests on a real data set is also provided."}, "http://arxiv.org/abs/2304.05527": {"title": "Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box", "link": "http://arxiv.org/abs/2304.05527", "description": "Automatic differentiation variational inference (ADVI) offers fast and\neasy-to-use posterior approximation in multiple modern probabilistic\nprogramming languages. However, its stochastic optimizer lacks clear\nconvergence criteria and requires tuning parameters. Moreover, ADVI inherits\nthe poor posterior uncertainty estimates of mean-field variational Bayes\n(MFVB). We introduce \"deterministic ADVI\" (DADVI) to address these issues.\nDADVI replaces the intractable MFVB objective with a fixed Monte Carlo\napproximation, a technique known in the stochastic optimization literature as\nthe \"sample average approximation\" (SAA). By optimizing an approximate but\ndeterministic objective, DADVI can use off-the-shelf second-order optimization,\nand, unlike standard mean-field ADVI, is amenable to more accurate posterior\ncovariances via linear response (LR). In contrast to existing worst-case\ntheory, we show that, on certain classes of common statistical problems, DADVI\nand the SAA can perform well with relatively few samples even in very high\ndimensions, though we also show that such favorable results cannot extend to\nvariational approximations that are too expressive relative to mean-field ADVI.\nWe show on a variety of real-world problems that DADVI reliably finds good\nsolutions with default settings (unlike ADVI) and, together with LR\ncovariances, is typically faster and more accurate than standard ADVI."}, "http://arxiv.org/abs/2401.10233": {"title": "Likelihood-ratio inference on differences in quantiles", "link": "http://arxiv.org/abs/2401.10233", "description": "Quantiles can represent key operational and business metrics, but the\ncomputational challenges associated with inference has hampered their adoption\nin online experimentation. One-sample confidence intervals are trivial to\nconstruct; however, two-sample inference has traditionally required\nbootstrapping or a density estimator. This paper presents a new two-sample\ndifference-in-quantile hypothesis test and confidence interval based on a\nlikelihood-ratio test statistic. A conservative version of the test does not\ninvolve a density estimator; a second version of the test, which uses a density\nestimator, yields confidence intervals very close to the nominal coverage\nlevel. It can be computed using only four order statistics from each sample."}, "http://arxiv.org/abs/2401.10235": {"title": "Semi-parametric local variable selection under misspecification", "link": "http://arxiv.org/abs/2401.10235", "description": "Local variable selection aims to discover localized effects by assessing the\nimpact of covariates on outcomes within specific regions defined by other\ncovariates. We outline some challenges of local variable selection in the\npresence of non-linear relationships and model misspecification. Specifically,\nwe highlight a potential drawback of common semi-parametric methods: even\nslight model misspecification can result in a high rate of false positives. To\naddress these shortcomings, we propose a methodology based on orthogonal cut\nsplines that achieves consistent local variable selection in high-dimensional\nscenarios. Our approach offers simplicity, handles both continuous and discrete\ncovariates, and provides theory for high-dimensional covariates and model\nmisspecification. We discuss settings with either independent or dependent\ndata. Our proposal allows including adjustment covariates that do not undergo\nselection, enhancing flexibility in modeling complex scenarios. We illustrate\nits application in simulation studies with both independent and functional\ndata, as well as with two real datasets. One dataset evaluates salary gaps\nassociated with discrimination factors at different ages, while the other\nexamines the effects of covariates on brain activation over time. The approach\nis implemented in the R package mombf."}, "http://arxiv.org/abs/2401.10269": {"title": "Robust Multi-Sensor Multi-Target Tracking Using Possibility Labeled Multi-Bernoulli Filter", "link": "http://arxiv.org/abs/2401.10269", "description": "With the increasing complexity of multiple target tracking scenes, a single\nsensor may not be able to effectively monitor a large number of targets.\nTherefore, it is imperative to extend the single-sensor technique to\nMulti-Sensor Multi-Target Tracking (MSMTT) for enhanced functionality. Typical\nMSMTT methods presume complete randomness of all uncertain components, and\ntherefore effective solutions such as the random finite set filter and\ncovariance intersection method have been derived to conduct the MSMTT task.\nHowever, the presence of epistemic uncertainty, arising from incomplete\ninformation, is often disregarded within the context of MSMTT. This paper\ndevelops an innovative possibility Labeled Multi-Bernoulli (LMB) Filter based\non the labeled Uncertain Finite Set (UFS) theory. The LMB filter inherits the\nhigh robustness of the possibility generalized labeled multi-Bernoulli filter\nwith simplified computational complexity. The fusion of LMB UFSs is derived and\nadapted to develop a robust MSMTT scheme. Simulation results corroborate the\nsuperior performance exhibited by the proposed approach in comparison to\ntypical probabilistic methods."}, "http://arxiv.org/abs/2401.10275": {"title": "Symbolic principal plane with Duality Centers Method", "link": "http://arxiv.org/abs/2401.10275", "description": "In \\cite{ref11} and \\cite{ref3}, the authors proposed the Centers and the\nVertices Methods to extend the well known principal components analysis method\nto a particular kind of symbolic objects characterized by multi--valued\nvariables of interval type. Nevertheless the authors use the classical circle\nof correlation to represent the variables. The correlation between the\nvariables and the principal components are not symbolic, because they compute\nthe standard correlations between the midpoints of variables and the midpoints\nof the principal components. It is well known that in standard principal\ncomponent analysis we may compute the correlation between the variables and the\nprincipal components using the duality relations starting from the coordinates\nof the individuals in the principal plane, also we can compute the coordinates\nof the individuals in the principal plane using duality relations starting from\nthe correlation between the variables and the principal components. In this\npaper we propose a new method to compute the symbolic correlation circle using\nduality relations in the case of interval-valued variables. Besides, the reader\nmay use all the methods presented herein and verify the results using the {\\tt\nRSDA} package written in {\\tt R} language, that can be downloaded and installed\ndirectly from {\\tt CRAN} \\cite{Rod2014}."}, "http://arxiv.org/abs/2401.10276": {"title": "Correspondence Analysis for Symbolic Multi--Valued Variables", "link": "http://arxiv.org/abs/2401.10276", "description": "This paper sets a proposal of a new method and two new algorithms for\nCorrespondence Analysis when we have Symbolic Multi--Valued Variables (SymCA).\nIn our method, there are two multi--valued variables $X$ and $Y$, that is to\nsay, the modality that takes the variables for a given individual is a finite\nset formed by the possible modalities taken for the variables in a given\nindividual, that which allows to apply the Correspondence Analysis to multiple\nselection questionnaires. Then, starting from all the possible classic\ncontingency tables an interval contingency table can be built, which will be\nthe point of departure of the proposed method."}, "http://arxiv.org/abs/2401.10495": {"title": "Causal Layering via Conditional Entropy", "link": "http://arxiv.org/abs/2401.10495", "description": "Causal discovery aims to recover information about an unobserved causal graph\nfrom the observable data it generates. Layerings are orderings of the variables\nwhich place causes before effects. In this paper, we provide ways to recover\nlayerings of a graph by accessing the data via a conditional entropy oracle,\nwhen distributions are discrete. Our algorithms work by repeatedly removing\nsources or sinks from the graph. Under appropriate assumptions and\nconditioning, we can separate the sources or sinks from the remainder of the\nnodes by comparing their conditional entropy to the unconditional entropy of\ntheir noise. Our algorithms are provably correct and run in worst-case\nquadratic time. The main assumptions are faithfulness and injective noise, and\neither known noise entropies or weakly monotonically increasing noise entropies\nalong directed paths. In addition, we require one of either a very mild\nextension of faithfulness, or strictly monotonically increasing noise\nentropies, or expanding noise injectivity to include an additional single\nargument in the structural functions."}, "http://arxiv.org/abs/2401.10592": {"title": "Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights", "link": "http://arxiv.org/abs/2401.10592", "description": "Randomized controlled clinical trials provide the gold standard for evidence\ngeneration in relation to the effectiveness of a new treatment in medical\nresearch. Relevant information from previous studies may be desirable to\nincorporate in the design of a new trial, with the Bayesian paradigm providing\na coherent framework to formally incorporate prior knowledge. Many established\nmethods involve the use of a discounting factor, sometimes related to a measure\nof `similarity' between historical sources and the new trial. However, it is\noften the case that the sample size is highly nonlinear in those discounting\nfactors. This hinders communication with subject-matter experts to elicit\nsensible values for borrowing strength at the trial design stage.\n\nFocusing on a sample size formula that can incorporate historical data from\nmultiple sources, we propose a linearization technique such that the sample\nsize changes evenly over values of the discounting factors (hereafter referred\nto as `weights'). Our approach leads to interpretable weights that directly\nrepresent the dissimilarity between historical and new trial data on the\nprobability scale, and could therefore facilitate easier elicitation of expert\nopinion on their values.\n\nInclusion of historical data in the design of clinical trials is not common\npractice. Part of the reason might be difficulty in interpretability of\ndiscrepancy parameters. We hope our work will help to bridge this gap and\nencourage uptake of these innovative methods.\n\nKeywords: Bayesian sample size determination; Commensurate priors; Historical\nborrowing; Prior aggregation; Uniform shrinkage."}, "http://arxiv.org/abs/2401.10796": {"title": "Reliability analysis for data-driven noisy models using active learning", "link": "http://arxiv.org/abs/2401.10796", "description": "Reliability analysis aims at estimating the failure probability of an\nengineering system. It often requires multiple runs of a limit-state function,\nwhich usually relies on computationally intensive simulations. Traditionally,\nthese simulations have been considered deterministic, i.e., running them\nmultiple times for a given set of input parameters always produces the same\noutput. However, this assumption does not always hold, as many studies in the\nliterature report non-deterministic computational simulations (also known as\nnoisy models). In such cases, running the simulations multiple times with the\nsame input will result in different outputs. Similarly, data-driven models that\nrely on real-world data may also be affected by noise. This characteristic\nposes a challenge when performing reliability analysis, as many classical\nmethods, such as FORM and SORM, are tailored to deterministic models. To bridge\nthis gap, this paper provides a novel methodology to perform reliability\nanalysis on models contaminated by noise. In such cases, noise introduces\nlatent uncertainty into the reliability estimator, leading to an incorrect\nestimation of the real underlying reliability index, even when using Monte\nCarlo simulation. To overcome this challenge, we propose the use of denoising\nregression-based surrogate models within an active learning reliability\nanalysis framework. Specifically, we combine Gaussian process regression with a\nnoise-aware learning function to efficiently estimate the probability of\nfailure of the underlying noise-free model. We showcase the effectiveness of\nthis methodology on standard benchmark functions and a finite element model of\na realistic structural frame."}, "http://arxiv.org/abs/2401.10824": {"title": "The trivariate wrapped Cauchy copula -- a multi-purpose model for angular data", "link": "http://arxiv.org/abs/2401.10824", "description": "In this paper, we will present a new flexible distribution for\nthree-dimensional angular data, or data on the three-dimensional torus. Our\ntrivariate wrapped Cauchy copula has the following benefits: (i) simple form of\ndensity, (ii) adjustable degree of dependence between every pair of variables,\n(iii) interpretable and well-estimable parameters, (iv) well-known conditional\ndistributions, (v) a simple data generating mechanism, (vi) unimodality.\nMoreover, our construction allows for linear marginals, implying that our\ncopula can also model cylindrical data. Parameter estimation via maximum\nlikelihood is explained, a comparison with the competitors in the existing\nliterature is given, and two real datasets are considered, one concerning\nprotein dihedral angles and another about data obtained by a buoy in the\nAdriatic Sea."}, "http://arxiv.org/abs/2401.10867": {"title": "Learning Optimal Dynamic Treatment Regimes from Longitudinal Data", "link": "http://arxiv.org/abs/2401.10867", "description": "Studies often report estimates of the average treatment effect. While the ATE\nsummarizes the effect of a treatment on average, it does not provide any\ninformation about the effect of treatment within any individual. A treatment\nstrategy that uses an individual's information to tailor treatment to maximize\nbenefit is known as an optimal dynamic treatment rule. Treatment, however, is\ntypically not limited to a single point in time; consequently, learning an\noptimal rule for a time-varying treatment may involve not just learning the\nextent to which the comparative treatments' benefits vary across the\ncharacteristics of individuals, but also learning the extent to which the\ncomparative treatments' benefits vary as relevant circumstances evolve within\nan individual. The goal of this paper is to provide a tutorial for estimating\nODTR from longitudinal observational and clinical trial data for applied\nresearchers. We describe an approach that uses a doubly-robust unbiased\ntransformation of the conditional average treatment effect. We then learn a\ntime-varying ODTR for when to increase buprenorphine-naloxone dose to minimize\nreturn-to-regular-opioid-use among patients with opioid use disorder. Our\nanalysis highlights the utility of ODTRs in the context of sequential decision\nmaking: the learned ODTR outperforms a clinically defined strategy."}, "http://arxiv.org/abs/2401.10869": {"title": "Variable selection for partially linear additive models", "link": "http://arxiv.org/abs/2401.10869", "description": "Among semiparametric regression models, partially linear additive models\nprovide a useful tool to include additive nonparametric components as well as a\nparametric component, when explaining the relationship between the response and\na set of explanatory variables. This paper concerns such models under sparsity\nassumptions for the covariates included in the linear component. Sparse\ncovariates are frequent in regression problems where the task of variable\nselection is usually of interest. As in other settings, outliers either in the\nresiduals or in the covariates involved in the linear component have a harmful\neffect. To simultaneously achieve model selection for the parametric component\nof the model and resistance to outliers, we combine preliminary robust\nestimators of the additive component, robust linear $MM-$regression estimators\nwith a penalty such as SCAD on the coefficients in the parametric part. Under\nmild assumptions, consistency results and rates of convergence for the proposed\nestimators are derived. A Monte Carlo study is carried out to compare, under\ndifferent models and contamination schemes, the performance of the robust\nproposal with its classical counterpart. The obtained results show the\nadvantage of using the robust approach. Through the analysis of a real data\nset, we also illustrate the benefits of the proposed procedure."}, "http://arxiv.org/abs/2103.06347": {"title": "Factorized Binary Search: change point detection in the network structure of multivariate high-dimensional time series", "link": "http://arxiv.org/abs/2103.06347", "description": "Functional magnetic resonance imaging (fMRI) time series data presents a\nunique opportunity to understand the behavior of temporal brain connectivity,\nand models that uncover the complex dynamic workings of this organ are of keen\ninterest in neuroscience. We are motivated to develop accurate change point\ndetection and network estimation techniques for high-dimensional whole-brain\nfMRI data. To this end, we introduce factorized binary search (FaBiSearch), a\nnovel change point detection method in the network structure of multivariate\nhigh-dimensional time series in order to understand the large-scale\ncharacterizations and dynamics of the brain. FaBiSearch employs non-negative\nmatrix factorization, an unsupervised dimension reduction technique, and a new\nbinary search algorithm to identify multiple change points. In addition, we\npropose a new method for network estimation for data between change points. We\nseek to understand the dynamic mechanism of the brain, particularly for two\nfMRI data sets. The first is a resting-state fMRI experiment, where subjects\nare scanned over three visits. The second is a task-based fMRI experiment,\nwhere subjects read Chapter 9 of Harry Potter and the Sorcerer's Stone. For the\nresting-state data set, we examine the test-retest behavior of dynamic\nfunctional connectivity, while for the task-based data set, we explore network\ndynamics during the reading and whether change points across subjects coincide\nwith key plot twists in the story. Further, we identify hub nodes in the brain\nnetwork and examine their dynamic behavior. Finally, we make all the methods\ndiscussed available in the R package fabisearch on CRAN."}, "http://arxiv.org/abs/2207.07020": {"title": "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO", "link": "http://arxiv.org/abs/2207.07020", "description": "The Gaussian chain graph model simultaneously parametrizes (i) the direct\neffects of $p$ predictors on $q$ outcomes and (ii) the residual partial\ncovariances between pairs of outcomes. We introduce a new method for fitting\nsparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We\ndevelop an Expectation Conditional Maximization algorithm to obtain sparse\nestimates of the $p \\times q$ matrix of direct effects and the $q \\times q$\nresidual precision matrix. Our algorithm iteratively solves a sequence of\npenalized maximum likelihood problems with self-adaptive penalties that\ngradually filter out negligible regression coefficients and partial\ncovariances. Because it adaptively penalizes individual model parameters, our\nmethod is seen to outperform fixed-penalty competitors on simulated data. We\nestablish the posterior contraction rate for our model, buttressing our\nmethod's excellent empirical performance with strong theoretical guarantees.\nUsing our method, we estimated the direct effects of diet and residence type on\nthe composition of the gut microbiome of elderly adults."}, "http://arxiv.org/abs/2301.00040": {"title": "Optimization-based Sensitivity Analysis for Unmeasured Confounding using Partial Correlations", "link": "http://arxiv.org/abs/2301.00040", "description": "Causal inference necessarily relies upon untestable assumptions; hence, it is\ncrucial to assess the robustness of obtained results to violations of\nidentification assumptions. However, such sensitivity analysis is only\noccasionally undertaken in practice, as many existing methods only apply to\nrelatively simple models and their results are often difficult to interpret. We\ntake a more flexible approach to sensitivity analysis and view it as a\nconstrained stochastic optimization problem. This work focuses on sensitivity\nanalysis for a linear causal effect when an unmeasured confounder and a\npotential instrument are present. We show how the bias of the OLS and TSLS\nestimands can be expressed in terms of partial correlations. Leveraging the\nalgebraic rules that relate different partial correlations, practitioners can\nspecify intuitive sensitivity models which bound the bias. We further show that\nthe heuristic \"plug-in\" sensitivity interval may not have any confidence\nguarantees; instead, we propose a bootstrap approach to construct sensitivity\nintervals which perform well in numerical simulations. We illustrate the\nproposed methods with a real study on the causal effect of education on\nearnings and provide user-friendly visualization tools."}, "http://arxiv.org/abs/2307.00048": {"title": "Learned harmonic mean estimation of the marginal likelihood with normalizing flows", "link": "http://arxiv.org/abs/2307.00048", "description": "Computing the marginal likelihood (also called the Bayesian model evidence)\nis an important task in Bayesian model selection, providing a principled\nquantitative way to compare models. The learned harmonic mean estimator solves\nthe exploding variance problem of the original harmonic mean estimation of the\nmarginal likelihood. The learned harmonic mean estimator learns an importance\nsampling target distribution that approximates the optimal distribution. While\nthe approximation need not be highly accurate, it is critical that the\nprobability mass of the learned distribution is contained within the posterior\nin order to avoid the exploding variance problem. In previous work a bespoke\noptimization problem is introduced when training models in order to ensure this\nproperty is satisfied. In the current article we introduce the use of\nnormalizing flows to represent the importance sampling target distribution. A\nflow-based model is trained on samples from the posterior by maximum likelihood\nestimation. Then, the probability density of the flow is concentrated by\nlowering the variance of the base distribution, i.e. by lowering its\n\"temperature\", ensuring its probability mass is contained within the posterior.\nThis approach avoids the need for a bespoke optimisation problem and careful\nfine tuning of parameters, resulting in a more robust method. Moreover, the use\nof normalizing flows has the potential to scale to high dimensional settings.\nWe present preliminary experiments demonstrating the effectiveness of the use\nof flows for the learned harmonic mean estimator. The harmonic code\nimplementing the learned harmonic mean, which is publicly available, has been\nupdated to now support normalizing flows."}, "http://arxiv.org/abs/2401.11001": {"title": "Reply to Comment by Schilling on their paper \"Optimal and fast confidence intervals for hypergeometric successes\"", "link": "http://arxiv.org/abs/2401.11001", "description": "A response to a letter to the editor by Schilling regarding Bartroff, Lorden,\nand Wang (\"Optimal and fast confidence intervals for hypergeometric successes\"\n2022, arXiv:2109.05624)"}, "http://arxiv.org/abs/2401.11070": {"title": "Efficient Data Reduction Strategies for Big Data and High-Dimensional LASSO Regressions", "link": "http://arxiv.org/abs/2401.11070", "description": "The IBOSS approach proposed by Wang et al. (2019) selects the most\ninformative subset of n points. It assumes that the ordinary least squares\nmethod is used and requires that the number of variables, p, is not large.\nHowever, in many practical problems, p is very large and penalty-based model\nfitting methods such as LASSO is used. We study the big data problems, in which\nboth n and p are large. In the first part, we focus on reduction in data\npoints. We develop theoretical results showing that the IBOSS type of approach\ncan be applicable to penalty-based regressions such as LASSO. In the second\npart, we consider the situations where p is extremely large. We propose a\ntwo-step approach that involves first reducing the number of variables and then\nreducing the number of data points. Two separate algorithms are developed,\nwhose performances are studied through extensive simulation studies. Compared\nto existing methods including well-known split-and-conquer approach, the\nproposed methods enjoy advantages in terms of estimation accuracy, prediction\naccuracy, and computation time."}, "http://arxiv.org/abs/2401.11075": {"title": "Estimating the Hawkes process from a discretely observed sample path", "link": "http://arxiv.org/abs/2401.11075", "description": "The Hawkes process is a widely used model in many areas, such as\n\nfinance, seismology, neuroscience, epidemiology, and social\n\nsciences. Estimation of the Hawkes process from continuous\n\nobservations of a sample path is relatively straightforward using\n\neither the maximum likelihood or other methods. However, estimating\n\nthe parameters of a Hawkes process from observations of a sample\n\npath at discrete time points only is challenging due to the\n\nintractability of the likelihood with such data. In this work, we\n\nintroduce a method to estimate the Hawkes process from a discretely\n\nobserved sample path. The method takes advantage of a state-space\n\nrepresentation of the incomplete data problem and use the sequential\n\nMonte Carlo (aka particle filtering) to approximate the likelihood\n\nfunction. As an estimator of the likelihood function the SMC\n\napproximation is unbiased, and therefore it can be used together\n\nwith the Metropolis-Hastings algorithm to construct Markov Chains to\n\napproximate the likelihood distribution, or more generally, the\n\nposterior distribution of model parameters. The performance of the\n\nmethodology is assessed using simulation experiments and compared\n\nwith other recently published methods. The proposed estimator is\n\nfound to have a smaller mean square error than the two benchmark\n\nestimators. The proposed method has the additional advantage that\n\nconfidence intervals for the parameters are easily available. We\n\napply the proposed estimator to the analysis of weekly count data on\n\nmeasles cases in Tokyo Japan and compare the results to those by\n\none of the benchmark methods."}, "http://arxiv.org/abs/2401.11119": {"title": "Constraint-based measures of shift and relative shift for discrete frequency distributions", "link": "http://arxiv.org/abs/2401.11119", "description": "Comparisons of frequency distributions often invoke the concept of shift to\ndescribe directional changes in properties such as the mean. In the present\nstudy, we sought to define shift as a property in and of itself. Specifically,\nwe define distributional shift (DS) as the concentration of frequencies away\nfrom the discrete class having the greatest value (e.g., the right-most bin of\na histogram). We derive a measure of DS using the normalized sum of\nexponentiated cumulative frequencies. We then define relative distributional\nshift (RDS) as the difference in DS between two distributions, revealing the\nmagnitude and direction by which one distribution is concentrated to lesser or\ngreater discrete classes relative to another. We find that RDS is highly\nrelated to popular measures that, while based on the comparison of frequency\ndistributions, do not explicitly consider shift. While RDS provides a useful\ncomplement to other comparative measures, DS allows shift to be quantified as a\nproperty of individual distributions, similar in concept to a statistical\nmoment."}, "http://arxiv.org/abs/2401.11128": {"title": "Regularized Estimation of Sparse Spectral Precision Matrices", "link": "http://arxiv.org/abs/2401.11128", "description": "Spectral precision matrix, the inverse of a spectral density matrix, is an\nobject of central interest in frequency-domain analysis of multivariate time\nseries. Estimation of spectral precision matrix is a key step in calculating\npartial coherency and graphical model selection of stationary time series. When\nthe dimension of a multivariate time series is moderate to large, traditional\nestimators of spectral density matrices such as averaged periodograms tend to\nbe severely ill-conditioned, and one needs to resort to suitable regularization\nstrategies involving optimization over complex variables.\n\nIn this work, we propose complex graphical Lasso (CGLASSO), an\n$\\ell_1$-penalized estimator of spectral precision matrix based on local\nWhittle likelihood maximization. We develop fast $\\textit{pathwise coordinate\ndescent}$ algorithms for implementing CGLASSO on large dimensional time series\ndata sets. At its core, our algorithmic development relies on a ring\nisomorphism between complex and real matrices that helps map a number of\noptimization problems over complex variables to similar optimization problems\nover real variables. This finding may be of independent interest and more\nbroadly applicable for high-dimensional statistical analysis with\ncomplex-valued data. We also present a complete non-asymptotic theory of our\nproposed estimator which shows that consistent estimation is possible in\nhigh-dimensional regime as long as the underlying spectral precision matrix is\nsuitably sparse. We compare the performance of CGLASSO with competing\nalternatives on simulated data sets, and use it to construct partial coherence\nnetwork among brain regions from a real fMRI data set."}, "http://arxiv.org/abs/2401.11263": {"title": "Estimating heterogeneous treatment effect from survival outcomes via (orthogonal) censoring unbiased learning", "link": "http://arxiv.org/abs/2401.11263", "description": "Methods for estimating heterogeneous treatment effects (HTE) from\nobservational data have largely focused on continuous or binary outcomes, with\nless attention paid to survival outcomes and almost none to settings with\ncompeting risks. In this work, we develop censoring unbiased transformations\n(CUTs) for survival outcomes both with and without competing risks.After\nconverting time-to-event outcomes using these CUTs, direct application of HTE\nlearners for continuous outcomes yields consistent estimates of heterogeneous\ncumulative incidence effects, total effects, and separable direct effects. Our\nCUTs enable application of a much larger set of state of the art HTE learners\nfor censored outcomes than had previously been available, especially in\ncompeting risks settings. We provide generic model-free learner-specific oracle\ninequalities bounding the finite-sample excess risk. The oracle efficiency\nresults depend on the oracle selector and estimated nuisance functions from all\nsteps involved in the transformation. We demonstrate the empirical performance\nof the proposed methods in simulation studies."}, "http://arxiv.org/abs/2401.11265": {"title": "Assessing the Competitiveness of Matrix-Free Block Likelihood Estimation in Spatial Models", "link": "http://arxiv.org/abs/2401.11265", "description": "In geostatistics, block likelihood offers a balance between statistical\naccuracy and computational efficiency when estimating covariance functions.\nThis balance is reached by dividing the sample into blocks and computing a\nweighted sum of (sub) log-likelihoods corresponding to pairs of blocks.\nPractitioners often choose block sizes ranging from hundreds to a few thousand\nobservations, inherently involving matrix-based implementations. An\nalternative, residing at the opposite end of this methodological spectrum,\ntreats each observation as a block, resulting in the matrix-free pairwise\nlikelihood method. We propose an additional alternative within this broad\nmethodological landscape, systematically constructing blocks of size two and\nmerging pairs of blocks through conditioning. Importantly, our method\nstrategically avoids large-sized blocks, facilitating explicit calculations\nthat ultimately do not rely on matrix computations. Studies with both simulated\nand real data validate the effectiveness of our approach, on one hand\ndemonstrating its superiority over pairwise likelihood, and on the other,\nchallenging the intuitive notion that employing matrix-based versions\nuniversally lead to better statistical performance."}, "http://arxiv.org/abs/2401.11272": {"title": "Asymptotics for non-degenerate multivariate $U$-statistics with estimated nuisance parameters under the null and local alternative hypotheses", "link": "http://arxiv.org/abs/2401.11272", "description": "The large-sample behavior of non-degenerate multivariate $U$-statistics of\narbitrary degree is investigated under the assumption that their kernel depends\non parameters that can be estimated consistently. Mild regularity conditions\nare given which guarantee that once properly normalized, such statistics are\nasymptotically multivariate Gaussian both under the null hypothesis and\nsequences of local alternatives. The work of Randles (1982, Ann. Statist.) is\nextended in three ways: the data and the kernel values can be multivariate\nrather than univariate, the limiting behavior under local alternatives is\nstudied for the first time, and the effect of knowing some of the nuisance\nparameters is quantified. These results can be applied to a broad range of\ngoodness-of-fit testing contexts, as shown in one specific example."}, "http://arxiv.org/abs/2401.11278": {"title": "Handling incomplete outcomes and covariates in cluster-randomized trials: doubly-robust estimation, efficiency considerations, and sensitivity analysis", "link": "http://arxiv.org/abs/2401.11278", "description": "In cluster-randomized trials, missing data can occur in various ways,\nincluding missing values in outcomes and baseline covariates at the individual\nor cluster level, or completely missing information for non-participants. Among\nthe various types of missing data in CRTs, missing outcomes have attracted the\nmost attention. However, no existing method comprehensively addresses all the\naforementioned types of missing data simultaneously due to their complexity.\nThis gap in methodology may lead to confusion and potential pitfalls in the\nanalysis of CRTs. In this article, we propose a doubly-robust estimator for a\nvariety of estimands that simultaneously handles missing outcomes under a\nmissing-at-random assumption, missing covariates with the missing-indicator\nmethod (with no constraint on missing covariate distributions), and missing\ncluster-population sizes via a uniform sampling framework. Furthermore, we\nprovide three approaches to improve precision by choosing the optimal weights\nfor intracluster correlation, leveraging machine learning, and modeling the\npropensity score for treatment assignment. To evaluate the impact of violated\nmissing data assumptions, we additionally propose a sensitivity analysis that\nmeasures when missing data alter the conclusion of treatment effect estimation.\nSimulation studies and data applications both show that our proposed method is\nvalid and superior to the existing methods."}, "http://arxiv.org/abs/2401.11327": {"title": "Measuring hierarchically-organized interactions in dynamic networks through spectral entropy rates: theory, estimation, and illustrative application to physiological networks", "link": "http://arxiv.org/abs/2401.11327", "description": "Recent advances in signal processing and information theory are boosting the\ndevelopment of new approaches for the data-driven modelling of complex network\nsystems. In the fields of Network Physiology and Network Neuroscience where the\nsignals of interest are often rich of oscillatory content, the spectral\nrepresentation of network systems is essential to ascribe the analyzed\ninteractions to specific oscillations with physiological meaning. In this\ncontext, the present work formalizes a coherent framework which integrates\nseveral information dynamics approaches to quantify node-specific, pairwise and\nhigher-order interactions in network systems. The framework establishes a\nhierarchical organization of interactions of different order using measures of\nentropy rate, mutual information rate and O-information rate, to quantify\nrespectively the dynamics of individual nodes, the links between pairs of\nnodes, and the redundant/synergistic hyperlinks between groups of nodes. All\nmeasures are formulated in the time domain, and then expanded to the spectral\ndomain to obtain frequency-specific information. The practical computation of\nall measures is favored presenting a toolbox that implements their parametric\nand non-parametric estimation, and includes approaches to assess their\nstatistical significance. The framework is illustrated first using theoretical\nexamples where the properties of the measures are displayed in benchmark\nsimulated network systems, and then applied to representative examples of\nmultivariate time series in the context of Network Neuroscience and Network\nPhysiology."}, "http://arxiv.org/abs/2401.11328": {"title": "A Hierarchical Decision-Based Maintenance for a Complex Modular System Driven by the { MoMA} Algorithm", "link": "http://arxiv.org/abs/2401.11328", "description": "This paper presents a maintenance policy for a modular system formed by K\nindependent modules (n-subsystems) subjected to environmental conditions\n(shocks). For the modeling of this complex system, the use of the\nMatrix-Analytical Method (MAM) is proposed under a layered approach according\nto its hierarchical structure. Thus, the operational state of the system (top\nlayer) depends on the states of the modules (middle layer), which in turn\ndepend on the states of their components (bottom layer). This allows a detailed\ndescription of the system operation to plan maintenance actions appropriately\nand optimally. We propose a hierarchical decision-based maintenance strategy\nwith periodic inspections as follows: at the time of the inspection, the\ncondition of the system is first evaluated. If intervention is necessary, the\nmodules are then checked to make individual decisions based on their states,\nand so on. Replacement or repair will be carried out as appropriate. An\noptimization problem is formulated as a function of the length of the\ninspection period and the intervention cost incurred over the useful life of\nthe system. Our method shows the advantages, providing compact and\nimplementable expressions. The model is illustrated on a submarine Electrical\nControl Unit (ECU)."}, "http://arxiv.org/abs/2401.11346": {"title": "Estimating Default Probability and Correlation using Stan", "link": "http://arxiv.org/abs/2401.11346", "description": "This work has the objective of estimating default probabilities and\ncorrelations of credit portfolios given default rate information through a\nBayesian framework using Stan. We use Vasicek's single factor credit model to\nestablish the theoretical framework for the behavior of the default rates, and\nuse NUTS Markov Chain Monte Carlo to estimate the parameters. We compare the\nBayesian estimates with classical estimates such as moments estimators and\nmaximum likelihood estimates. We apply the methodology both to simulated data\nand to corporate default rates, and perform inferences through Bayesian methods\nin order to exhibit the advantages of such a framework. We perform default\nforecasting and exhibit the importance of an adequate estimation of default\ncorrelations, and exhibit the advantage of using Stan to perform sampling\nregarding prior choice."}, "http://arxiv.org/abs/2401.11352": {"title": "Geometric Insights and Empirical Observations on Covariate Adjustment and Stratified Randomization in Randomized Clinical Trials", "link": "http://arxiv.org/abs/2401.11352", "description": "The statistical efficiency of randomized clinical trials can be improved by\nincorporating information from baseline covariates (i.e., pre-treatment patient\ncharacteristics). This can be done in the design stage using a\ncovariate-adaptive randomization scheme such as stratified (permutated block)\nrandomization, or in the analysis stage through covariate adjustment. This\narticle provides a geometric perspective on covariate adjustment and stratified\nrandomization in a unified framework where all regular, asymptotically linear\nestimators are identified as augmented estimators. From this perspective,\ncovariate adjustment can be viewed as an effort to approximate the optimal\naugmentation function, and stratified randomization aims to improve a given\napproximation by projecting it into an affine subspace containing the optimal\naugmentation function. The efficiency benefit of stratified randomization is\nasymptotically equivalent to making full use of stratum information in\ncovariate adjustment, which can be achieved using a simple calibration\nprocedure. Simulation results indicate that stratified randomization is clearly\nbeneficial to unadjusted estimators and much less so to adjusted ones and that\ncalibration is an effective way to recover the efficiency benefit of stratified\nrandomization without actually performing stratified randomization. These\ninsights and observations are illustrated using real clinical trial data."}, "http://arxiv.org/abs/2401.11354": {"title": "Squared Wasserstein-2 Distance for Efficient Reconstruction of Stochastic Differential Equations", "link": "http://arxiv.org/abs/2401.11354", "description": "We provide an analysis of the squared Wasserstein-2 ($W_2$) distance between\ntwo probability distributions associated with two stochastic differential\nequations (SDEs). Based on this analysis, we propose the use of a squared $W_2$\ndistance-based loss functions in the \\textit{reconstruction} of SDEs from noisy\ndata. To demonstrate the practicality of our Wasserstein distance-based loss\nfunctions, we performed numerical experiments that demonstrate the efficiency\nof our method in reconstructing SDEs that arise across a number of\napplications."}, "http://arxiv.org/abs/2401.11359": {"title": "The Exact Risks of Reference Panel-based Regularized Estimators", "link": "http://arxiv.org/abs/2401.11359", "description": "Reference panel-based estimators have become widely used in genetic\nprediction of complex traits due to their ability to address data privacy\nconcerns and reduce computational and communication costs. These estimators\nestimate the covariance matrix of predictors using an external reference panel,\ninstead of relying solely on the original training data. In this paper, we\ninvestigate the performance of reference panel-based $L_1$ and $L_2$\nregularized estimators within a unified framework based on approximate message\npassing (AMP). We uncover several key factors that influence the accuracy of\nreference panel-based estimators, including the sample sizes of the training\ndata and reference panels, the signal-to-noise ratio, the underlying sparsity\nof the signal, and the covariance matrix among predictors. Our findings reveal\nthat, even when the sample size of the reference panel matches that of the\ntraining data, reference panel-based estimators tend to exhibit lower accuracy\ncompared to traditional regularized estimators. Furthermore, we observe that\nthis performance gap widens as the amount of training data increases,\nhighlighting the importance of constructing large-scale reference panels to\nmitigate this issue. To support our theoretical analysis, we develop a novel\nnon-separable matrix AMP framework capable of handling the complexities\nintroduced by a general covariance matrix and the additional randomness\nassociated with a reference panel. We validate our theoretical results through\nextensive simulation studies and real data analyses using the UK Biobank\ndatabase."}, "http://arxiv.org/abs/2401.11368": {"title": "When exposure affects subgroup membership: Framing relevant causal questions in perinatal epidemiology and beyond", "link": "http://arxiv.org/abs/2401.11368", "description": "Perinatal epidemiology often aims to evaluate exposures on infant outcomes.\nWhen the exposure affects the composition of people who give birth to live\ninfants (e.g., by affecting fertility, behavior, or birth outcomes), this \"live\nbirth process\" mediates the exposure effect on infant outcomes. Causal\nestimands previously proposed for this setting include the total exposure\neffect on composite birth and infant outcomes, controlled direct effects (e.g.,\nenforcing birth), and principal stratum direct effects. Using perinatal HIV\ntransmission in the SEARCH Study as a motivating example, we present two\nalternative causal estimands: 1) conditional total effects; and 2) conditional\nstochastic direct effects, formulated under a hypothetical intervention to draw\nmediator values from some distribution (possibly conditional on covariates).\nThe proposed conditional total effect includes impacts of an intervention that\noperate by changing the types of people who have a live birth and the timing of\nbirths. The proposed conditional stochastic direct effects isolate the effect\nof an exposure on infant outcomes excluding any impacts through this live birth\nprocess. In SEARCH, this approach quantifies the impact of a universal testing\nand treatment intervention on infant HIV-free survival absent any effect of the\nintervention on the live birth process, within a clearly defined target\npopulation of women of reproductive age with HIV at study baseline. Our\napproach has implications for the evaluation of intervention effects in\nperinatal epidemiology broadly, and whenever causal effects within a subgroup\nare of interest and exposure affects membership in the subgroup."}, "http://arxiv.org/abs/2401.11380": {"title": "MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning", "link": "http://arxiv.org/abs/2401.11380", "description": "Model-based offline reinforcement learning methods (RL) have achieved\nstate-of-the-art performance in many decision-making problems thanks to their\nsample efficiency and generalizability. Despite these advancements, existing\nmodel-based offline RL approaches either focus on theoretical studies without\ndeveloping practical algorithms or rely on a restricted parametric policy\nspace, thus not fully leveraging the advantages of an unrestricted policy space\ninherent to model-based methods. To address this limitation, we develop MoMA, a\nmodel-based mirror ascent algorithm with general function approximations under\npartial coverage of offline data. MoMA distinguishes itself from existing\nliterature by employing an unrestricted policy class. In each iteration, MoMA\nconservatively estimates the value function by a minimization procedure within\na confidence set of transition models in the policy evaluation step, then\nupdates the policy with general function approximations instead of\ncommonly-used parametric policy classes in the policy improvement step. Under\nsome mild assumptions, we establish theoretical guarantees of MoMA by proving\nan upper bound on the suboptimality of the returned policy. We also provide a\npractically implementable, approximate version of the algorithm. The\neffectiveness of MoMA is demonstrated via numerical studies."}, "http://arxiv.org/abs/2401.11507": {"title": "Redundant multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses", "link": "http://arxiv.org/abs/2401.11507", "description": "During multiple testing, researchers often adjust their alpha level to\ncontrol the familywise error rate for a statistical inference about a joint\nunion alternative hypothesis (e.g., \"H1 or H2\"). However, in some cases, they\ndo not make this inference and instead make separate inferences about each of\nthe individual hypotheses that comprise the joint hypothesis (e.g., H1 and H2).\nFor example, a researcher might use a Bonferroni correction to adjust their\nalpha level from the conventional level of 0.050 to 0.025 when testing H1 and\nH2, find a significant result for H1 (p < 0.025) and not for H2 (p > .0.025),\nand so claim support for H1 and not for H2. However, these separate individual\ninferences do not require an alpha adjustment. Only a statistical inference\nabout the union alternative hypothesis \"H1 or H2\" requires an alpha adjustment\nbecause it is based on \"at least one\" significant result among the two tests,\nand so it depends on the familywise error rate. When a researcher corrects\ntheir alpha level during multiple testing but does not make an inference about\nthe union alternative hypothesis, their correction is redundant. In the present\narticle, I discuss this redundant correction problem, including its associated\nloss of statistical power and its potential causes vis-\\`a-vis error rate\nconfusions and the alpha adjustment ritual. I also provide three illustrations\nof redundant corrections from recent psychology studies. I conclude that\nredundant corrections represent a symptom of statisticism, and I call for a\nmore nuanced and context-specific approach to multiple testing corrections."}, "http://arxiv.org/abs/2401.11515": {"title": "Geometry-driven Bayesian Inference for Ultrametric Covariance Matrices", "link": "http://arxiv.org/abs/2401.11515", "description": "Ultrametric matrices arise as covariance matrices in latent tree models for\nmultivariate data with hierarchically correlated components. As a parameter\nspace in a model, the set of ultrametric matrices is neither convex nor a\nsmooth manifold, and focus in literature has hitherto mainly been restricted to\nestimation through projections and relaxation-based techniques. Leveraging the\nlink between an ultrametric matrix and a rooted tree, we equip the set of\nultrametric matrices with a convenient geometry based on the well-known\ngeometry of phylogenetic trees, whose attractive properties (e.g. unique\ngeodesics and Fr\\'{e}chet means) the set of ultrametric matrices inherits. This\nresults in a novel representation of an ultrametric matrix by coordinates of\nthe tree space, which we then use to define a class of Markovian and consistent\nprior distributions on the set of ultrametric matrices in a Bayesian model, and\ndevelop an efficient algorithm to sample from the posterior distribution that\ngenerates updates by making intrinsic local moves along geodesics within the\nset of ultrametric matrices. In simulation studies, our proposed algorithm\nrestores the underlying matrices with posterior samples that recover the tree\ntopology with a high frequency of true topology and generate element-wise\ncredible intervals with a high nominal coverage rate. We use the proposed\nalgorithm on the pre-clinical cancer data to investigate the mechanism\nsimilarity by constructing the underlying treatment tree and identify\ntreatments with high mechanism similarity also target correlated pathways in\nbiological literature."}, "http://arxiv.org/abs/2401.11537": {"title": "Addressing researcher degrees of freedom through minP adjustment", "link": "http://arxiv.org/abs/2401.11537", "description": "When different researchers study the same research question using the same\ndataset they may obtain different and potentially even conflicting results.\nThis is because there is often substantial flexibility in researchers'\nanalytical choices, an issue also referred to as ''researcher degrees of\nfreedom''. Combined with selective reporting of the smallest p-value or largest\neffect, researcher degrees of freedom may lead to an increased rate of false\npositive and overoptimistic results. In this paper, we address this issue by\nformalizing the multiplicity of analysis strategies as a multiple testing\nproblem. As the test statistics of different analysis strategies are usually\nhighly dependent, a naive approach such as the Bonferroni correction is\ninappropriate because it leads to an unacceptable loss of power. Instead, we\npropose using the ''minP'' adjustment method, which takes potential test\ndependencies into account and approximates the underlying null distribution of\nthe minimal p-value through a permutation-based procedure. This procedure is\nknown to achieve more power than simpler approaches while ensuring a weak\ncontrol of the family-wise error rate. We illustrate our approach for\naddressing researcher degrees of freedom by applying it to a study on the\nimpact of perioperative paO2 on post-operative complications after\nneurosurgery. A total of 48 analysis strategies are considered and adjusted\nusing the minP procedure. This approach allows to selectively report the result\nof the analysis strategy yielding the most convincing evidence, while\ncontrolling the type 1 error -- and thus the risk of publishing false positive\nresults that may not be replicable."}, "http://arxiv.org/abs/2401.11540": {"title": "A new flexible class of kernel-based tests of independence", "link": "http://arxiv.org/abs/2401.11540", "description": "Spherical and hyperspherical data are commonly encountered in diverse applied\nresearch domains, underscoring the vital task of assessing independence within\nsuch data structures. In this context, we investigate the properties of test\nstatistics relying on distance correlation measures originally introduced for\nthe energy distance, and generalize the concept to strongly negative definite\nkernel-based distances. An important benefit of employing this method lies in\nits versatility across diverse forms of directional data, enabling the\nexamination of independence among vectors of varying types. The applicability\nof tests is demonstrated on several real datasets."}, "http://arxiv.org/abs/2401.11646": {"title": "Nonparametric Estimation via Variance-Reduced Sketching", "link": "http://arxiv.org/abs/2401.11646", "description": "Nonparametric models are of great interest in various scientific and\nengineering disciplines. Classical kernel methods, while numerically robust and\nstatistically sound in low-dimensional settings, become inadequate in\nhigher-dimensional settings due to the curse of dimensionality. In this paper,\nwe introduce a new framework called Variance-Reduced Sketching (VRS),\nspecifically designed to estimate density functions and nonparametric\nregression functions in higher dimensions with a reduced curse of\ndimensionality. Our framework conceptualizes multivariable functions as\ninfinite-size matrices, and facilitates a new sketching technique motivated by\nnumerical linear algebra literature to reduce the variance in estimation\nproblems. We demonstrate the robust numerical performance of VRS through a\nseries of simulated experiments and real-world data applications. Notably, VRS\nshows remarkable improvement over existing neural network estimators and\nclassical kernel methods in numerous density estimation and nonparametric\nregression models. Additionally, we offer theoretical justifications for VRS to\nsupport its ability to deliver nonparametric estimation with a reduced curse of\ndimensionality."}, "http://arxiv.org/abs/2401.11789": {"title": "Stein EWMA Control Charts for Count Processes", "link": "http://arxiv.org/abs/2401.11789", "description": "The monitoring of serially independent or autocorrelated count processes is\nconsidered, having a Poisson or (negative) binomial marginal distribution under\nin-control conditions. Utilizing the corresponding Stein identities,\nexponentially weighted moving-average (EWMA) control charts are constructed,\nwhich can be flexibly adapted to uncover zero inflation, over- or\nunderdispersion. The proposed Stein EWMA charts' performance is investigated by\nsimulations, and their usefulness is demonstrated by a real-world data example\nfrom health surveillance."}, "http://arxiv.org/abs/2401.11804": {"title": "Regression Copulas for Multivariate Responses", "link": "http://arxiv.org/abs/2401.11804", "description": "We propose a novel distributional regression model for a multivariate\nresponse vector based on a copula process over the covariate space. It uses the\nimplicit copula of a Gaussian multivariate regression, which we call a\n``regression copula''. To allow for large covariate vectors their coefficients\nare regularized using a novel multivariate extension of the horseshoe prior.\nBayesian inference and distributional predictions are evaluated using efficient\nvariational inference methods, allowing application to large datasets. An\nadvantage of the approach is that the marginal distributions of the response\nvector can be estimated separately and accurately, resulting in predictive\ndistributions that are marginally-calibrated. Two substantive applications of\nthe methodology highlight its efficacy in multivariate modeling. The first is\nthe econometric modeling and prediction of half-hourly regional Australian\nelectricity prices. Here, our approach produces more accurate distributional\nforecasts than leading benchmark methods. The second is the evaluation of\nmultivariate posteriors in likelihood-free inference (LFI) of a model for tree\nspecies abundance data, extending a previous univariate regression copula LFI\nmethod. In both applications, we demonstrate that our new approach exhibits a\ndesirable marginal calibration property."}, "http://arxiv.org/abs/2401.11827": {"title": "Flexible Models for Simple Longitudinal Data", "link": "http://arxiv.org/abs/2401.11827", "description": "We propose a new method for estimating subject-specific mean functions from\nlongitudinal data. We aim to do this in a flexible manner (without restrictive\nassumptions about the shape of the subject-specific mean functions), while\nexploiting similarities in the mean functions between different subjects.\nFunctional principal components analysis fulfils both requirements, and methods\nfor functional principal components analysis have been developed for\nlongitudinal data. However, we find that these existing methods sometimes give\nfitted mean functions which are more complex than needed to provide a good fit\nto the data. We develop a new penalised likelihood approach to flexibly model\nlongitudinal data, with a penalty term to control the balance between fit to\nthe data and smoothness of the subject-specific mean curves. We run simulation\nstudies to demonstrate that the new method substantially improves the quality\nof inference relative to existing methods across a range of examples, and apply\nthe method to data on changes in body composition in adolescent girls."}, "http://arxiv.org/abs/2401.11842": {"title": "Subgroup analysis methods for time-to-event outcomes in heterogeneous randomized controlled trials", "link": "http://arxiv.org/abs/2401.11842", "description": "Non-significant randomized control trials can hide subgroups of good\nresponders to experimental drugs, thus hindering subsequent development.\nIdentifying such heterogeneous treatment effects is key for precision medicine\nand many post-hoc analysis methods have been developed for that purpose. While\nseveral benchmarks have been carried out to identify the strengths and\nweaknesses of these methods, notably for binary and continuous endpoints,\nsimilar systematic empirical evaluation of subgroup analysis for time-to-event\nendpoints are lacking. This work aims to fill this gap by evaluating several\nsubgroup analysis algorithms in the context of time-to-event outcomes, by means\nof three different research questions: Is there heterogeneity? What are the\nbiomarkers responsible for such heterogeneity? Who are the good responders to\ntreatment? In this context, we propose a new synthetic and semi-synthetic data\ngeneration process that allows one to explore a wide range of heterogeneity\nscenarios with precise control on the level of heterogeneity. We provide an\nopen source Python package, available on Github, containing our generation\nprocess and our comprehensive benchmark framework. We hope this package will be\nuseful to the research community for future investigations of heterogeneity of\ntreatment effects and subgroup analysis methods benchmarking."}, "http://arxiv.org/abs/2401.11885": {"title": "Bootstrap prediction regions for daily curves of electricity demand and price using functional data", "link": "http://arxiv.org/abs/2401.11885", "description": "The aim of this paper is to compute one-day-ahead prediction regions for\ndaily curves of electricity demand and price. Three model-based procedures to\nconstruct general prediction regions are proposed, all of them using bootstrap\nalgorithms. The first proposed method considers any $L_p$ norm for functional\ndata to measure the distance between curves, the second one is designed to take\ndifferent variabilities along the curve into account, and the third one takes\nadvantage of the notion of depth of a functional data. The regression model\nwith functional response on which our proposed prediction regions are based is\nrather general: it allows to include both endogenous and exogenous functional\nvariables, as well as exogenous scalar variables; in addition, the effect of\nsuch variables on the response one is modeled in a parametric, nonparametric or\nsemi-parametric way. A comparative study is carried out to analyse the\nperformance of these prediction regions for the electricity market of mainland\nSpain, in year 2012. This work extends and complements the methods and results\nin Aneiros et al. (2016) (focused on curve prediction) and Vilar et al. (2018)\n(focused on prediction intervals), which use the same database as here."}, "http://arxiv.org/abs/2401.11948": {"title": "The Ensemble Kalman Filter for Dynamic Inverse Problems", "link": "http://arxiv.org/abs/2401.11948", "description": "In inverse problems, the goal is to estimate unknown model parameters from\nnoisy observational data. Traditionally, inverse problems are solved under the\nassumption of a fixed forward operator describing the observation model. In\nthis article, we consider the extension of this approach to situations where we\nhave a dynamic forward model, motivated by applications in scientific\ncomputation and engineering. We specifically consider this extension for a\nderivative-free optimizer, the ensemble Kalman inversion (EKI). We introduce\nand justify a new methodology called dynamic-EKI, which is a particle-based\nmethod with a changing forward operator. We analyze our new method, presenting\nresults related to the control of our particle system through its covariance\nstructure. This analysis includes moment bounds and an ensemble collapse, which\nare essential for demonstrating a convergence result. We establish convergence\nin expectation and validate our theoretical findings through experiments with\ndynamic-EKI applied to a 2D Darcy flow partial differential equation."}, "http://arxiv.org/abs/2401.12031": {"title": "Multi-objective optimisation using expected quantile improvement for decision making in disease outbreaks", "link": "http://arxiv.org/abs/2401.12031", "description": "Optimization under uncertainty is important in many applications,\nparticularly to inform policy and decision making in areas such as public\nhealth. A key source of uncertainty arises from the incorporation of\nenvironmental variables as inputs into computational models or simulators. Such\nvariables represent uncontrollable features of the optimization problem and\nreliable decision making must account for the uncertainty they propagate to the\nsimulator outputs. Often, multiple, competing objectives are defined from these\noutputs such that the final optimal decision is a compromise between different\ngoals.\n\nHere, we present emulation-based optimization methodology for such problems\nthat extends expected quantile improvement (EQI) to address multi-objective\noptimization. Focusing on the practically important case of two objectives, we\nuse a sequential design strategy to identify the Pareto front of optimal\nsolutions. Uncertainty from the environmental variables is integrated out using\nMonte Carlo samples from the simulator. Interrogation of the expected output\nfrom the simulator is facilitated by use of (Gaussian process) emulators. The\nmethodology is demonstrated on an optimization problem from public health\ninvolving the dispersion of anthrax spores across a spatial terrain.\nEnvironmental variables include meteorological features that impact the\ndispersion, and the methodology identifies the Pareto front even when there is\nconsiderable input uncertainty."}, "http://arxiv.org/abs/2401.12126": {"title": "Biological species delimitation based on genetic and spatial dissimilarity: a comparative study", "link": "http://arxiv.org/abs/2401.12126", "description": "The delimitation of biological species, i.e., deciding which individuals\nbelong to the same species and whether and how many different species are\nrepresented in a data set, is key to the conservation of biodiversity. Much\nexisting work uses only genetic data for species delimitation, often employing\nsome kind of cluster analysis. This can be misleading, because geographically\ndistant groups of individuals can be genetically quite different even if they\nbelong to the same species. This paper investigates the problem of testing\nwhether two potentially separated groups of individuals can belong to a single\nspecies or not based on genetic and spatial data. Various approaches are\ncompared (some of which already exist in the literature) based on simulated\nmetapopulations generated with SLiM and GSpace - two software packages that can\nsimulate spatially-explicit genetic data at an individual level. Approaches\ninvolve partial Mantel testing, maximum likelihood mixed-effects models with a\npopulation effect, and jackknife-based homogeneity tests. A key challenge is\nthat most tests perform on genetic and geographical distance data, violating\nstandard independence assumptions. Simulations showed that partial Mantel tests\nand mixed-effects models have larger power than jackknife-based methods, but\ntend to display type-I-error rates slightly above the significance level.\nMoreover, a multiple regression model neglecting the dependence in the\ndissimilarities did not show inflated type-I-error rate. An application on\nbrassy ringlets concludes the paper."}, "http://arxiv.org/abs/1412.0367": {"title": "Bayesian nonparametric modeling for mean residual life regression", "link": "http://arxiv.org/abs/1412.0367", "description": "The mean residual life function is a key functional for a survival\ndistribution. It has practically useful interpretation as the expected\nremaining lifetime given survival up to a particular time point, and it also\ncharacterizes the survival distribution. However, it has received limited\nattention in terms of inference methods under a probabilistic modeling\nframework. In this paper, we seek to provide general inference methodology for\nmean residual life regression. Survival data often include a set of predictor\nvariables for the survival response distribution, and in many cases it is\nnatural to include the covariates as random variables into the modeling. We\nthus propose a Dirichlet process mixture modeling approach for the joint\nstochastic mechanism of the covariates and survival responses. This approach\nimplies a flexible model structure for the mean residual life of the\nconditional response distribution, allowing general shapes for mean residual\nlife as a function of covariates given a specific time point, as well as a\nfunction of time given particular values of the covariate vector. To expand the\nscope of the modeling framework, we extend the mixture model to incorporate\ndependence across experimental groups, such as treatment and control groups.\nThis extension is built from a dependent Dirichlet process prior for the\ngroup-specific mixing distributions, with common locations and weights that\nvary across groups through latent bivariate beta distributed random variables.\nWe develop properties of the proposed regression models, and discuss methods\nfor prior specification and posterior inference. The different components of\nthe methodology are illustrated with simulated data sets. Moreover, the\nmodeling approach is applied to a data set comprising right censored survival\ntimes of patients with small cell lung cancer."}, "http://arxiv.org/abs/2110.00314": {"title": "Confounder importance learning for treatment effect inference", "link": "http://arxiv.org/abs/2110.00314", "description": "We address modelling and computational issues for multiple treatment effect\ninference under many potential confounders. A primary issue relates to\npreventing harmful effects from omitting relevant covariates (under-selection),\nwhile not running into over-selection issues that introduce substantial\nvariance and a bias related to the non-random over-inclusion of covariates. We\npropose a novel empirical Bayes framework for Bayesian model averaging that\nlearns from data the extent to which the inclusion of key covariates should be\nencouraged, specifically those highly associated to the treatments. A key\nchallenge is computational. We develop fast algorithms, including an\nExpectation-Propagation variational approximation and simple stochastic\ngradient optimization algorithms, to learn the hyper-parameters from data. Our\nframework uses widely-used ingredients and largely existing software, and it is\nimplemented within the R package mombf featured on CRAN. This work is motivated\nby and is illustrated in two applications. The first is the association between\nsalary variation and discriminatory factors. The second, that has been debated\nin previous works, is the association between abortion policies and crime. Our\napproach provides insights that differ from previous analyses especially in\nsituations with weaker treatment effects."}, "http://arxiv.org/abs/2201.12865": {"title": "Extremal Random Forests", "link": "http://arxiv.org/abs/2201.12865", "description": "Classical methods for quantile regression fail in cases where the quantile of\ninterest is extreme and only few or no training data points exceed it.\nAsymptotic results from extreme value theory can be used to extrapolate beyond\nthe range of the data, and several approaches exist that use linear regression,\nkernel methods or generalized additive models. Most of these methods break down\nif the predictor space has more than a few dimensions or if the regression\nfunction of extreme quantiles is complex. We propose a method for extreme\nquantile regression that combines the flexibility of random forests with the\ntheory of extrapolation. Our extremal random forest (ERF) estimates the\nparameters of a generalized Pareto distribution, conditional on the predictor\nvector, by maximizing a local likelihood with weights extracted from a quantile\nrandom forest. We penalize the shape parameter in this likelihood to regularize\nits variability in the predictor space. Under general domain of attraction\nconditions, we show consistency of the estimated parameters in both the\nunpenalized and penalized case. Simulation studies show that our ERF\noutperforms both classical quantile regression methods and existing regression\napproaches from extreme value theory. We apply our methodology to extreme\nquantile prediction for U.S. wage data."}, "http://arxiv.org/abs/2203.00144": {"title": "The Concordance Index decomposition: A measure for a deeper understanding of survival prediction models", "link": "http://arxiv.org/abs/2203.00144", "description": "The Concordance Index (C-index) is a commonly used metric in Survival\nAnalysis for evaluating the performance of a prediction model. In this paper,\nwe propose a decomposition of the C-index into a weighted harmonic mean of two\nquantities: one for ranking observed events versus other observed events, and\nthe other for ranking observed events versus censored cases. This decomposition\nenables a finer-grained analysis of the relative strengths and weaknesses\nbetween different survival prediction methods. The usefulness of this\ndecomposition is demonstrated through benchmark comparisons against classical\nmodels and state-of-the-art methods, together with the new variational\ngenerative neural-network-based method (SurVED) proposed in this paper. The\nperformance of the models is assessed using four publicly available datasets\nwith varying levels of censoring. Using the C-index decomposition and synthetic\ncensoring, the analysis shows that deep learning models utilize the observed\nevents more effectively than other models. This allows them to keep a stable\nC-index in different censoring levels. In contrast to such deep learning\nmethods, classical machine learning models deteriorate when the censoring level\ndecreases due to their inability to improve on ranking the events versus other\nevents."}, "http://arxiv.org/abs/2211.01799": {"title": "Statistical Inference for Scale Mixture Models via Mellin Transform Approach", "link": "http://arxiv.org/abs/2211.01799", "description": "This paper deals with statistical inference for the scale mixture models. We\nstudy an estimation approach based on the Mellin -- Stieltjes transform that\ncan be applied to both discrete and absolute continuous mixing distributions.\nThe accuracy of the corresponding estimate is analysed in terms of its expected\npointwise error. As an important technical result, we prove the analogue of the\nBerry -- Esseen inequality for the Mellin transforms. The proposed statistical\napproach is illustrated by numerical examples."}, "http://arxiv.org/abs/2211.16155": {"title": "High-Dimensional Block Diagonal Covariance Structure Detection Using Singular Vectors", "link": "http://arxiv.org/abs/2211.16155", "description": "The assumption of independent subvectors arises in many aspects of\nmultivariate analysis. In most real-world applications, however, we lack prior\nknowledge about the number of subvectors and the specific variables within each\nsubvector. Yet, testing all these combinations is not feasible. For example,\nfor a data matrix containing 15 variables, there are already 1 382 958 545\npossible combinations. Given that zero correlation is a necessary condition for\nindependence, independent subvectors exhibit a block diagonal covariance\nmatrix. This paper focuses on the detection of such block diagonal covariance\nstructures in high-dimensional data and therefore also identifies uncorrelated\nsubvectors. Our nonparametric approach exploits the fact that the structure of\nthe covariance matrix is mirrored by the structure of its eigenvectors.\nHowever, the true block diagonal structure is masked by noise in the sample\ncase. To address this problem, we propose to use sparse approximations of the\nsample eigenvectors to reveal the sparse structure of the population\neigenvectors. Notably, the right singular vectors of a data matrix with an\noverall mean of zero are identical to the sample eigenvectors of its covariance\nmatrix. Using sparse approximations of these singular vectors instead of the\neigenvectors makes the estimation of the covariance matrix obsolete. We\ndemonstrate the performance of our method through simulations and provide real\ndata examples. Supplementary materials for this article are available online."}, "http://arxiv.org/abs/2303.05659": {"title": "A marginal structural model for normal tissue complication probability", "link": "http://arxiv.org/abs/2303.05659", "description": "The goal of radiation therapy for cancer is to deliver prescribed radiation\ndose to the tumor while minimizing dose to the surrounding healthy tissues. To\nevaluate treatment plans, the dose distribution to healthy organs is commonly\nsummarized as dose-volume histograms (DVHs). Normal tissue complication\nprobability (NTCP) modelling has centered around making patient-level risk\npredictions with features extracted from the DVHs, but few have considered\nadapting a causal framework to evaluate the safety of alternative treatment\nplans. We propose causal estimands for NTCP based on deterministic and\nstochastic interventions, as well as propose estimators based on marginal\nstructural models that impose bivariable monotonicity between dose, volume, and\ntoxicity risk. The properties of these estimators are studied through\nsimulations, and their use is illustrated in the context of radiotherapy\ntreatment of anal canal cancer patients."}, "http://arxiv.org/abs/2305.04141": {"title": "Geostatistical capture-recapture models", "link": "http://arxiv.org/abs/2305.04141", "description": "Methods for population estimation and inference have evolved over the past\ndecade to allow for the incorporation of spatial information when using\ncapture-recapture study designs. Traditional approaches to specifying spatial\ncapture-recapture (SCR) models often rely on an individual-based detection\nfunction that decays as a detection location is farther from an individual's\nactivity center. Traditional SCR models are intuitive because they incorporate\nmechanisms of animal space use based on their assumptions about activity\ncenters. We modify the SCR model to accommodate a wide range of space use\npatterns, including for those individuals that may exhibit traditional\nelliptical utilization distributions. Our approach uses underlying Gaussian\nprocesses to characterize the space use of individuals. This allows us to\naccount for multimodal and other complex space use patterns that may arise due\nto movement. We refer to this class of models as geostatistical\ncapture-recapture (GCR) models. We adapt a recursive computing strategy to fit\nGCR models to data in stages, some of which can be parallelized. This technique\nfacilitates implementation and leverages modern multicore and distributed\ncomputing environments. We demonstrate the application of GCR models by\nanalyzing both simulated data and a data set involving capture histories of\nsnowshoe hares in central Colorado, USA."}, "http://arxiv.org/abs/2308.03355": {"title": "Nonparametric Bayes multiresolution testing for high-dimensional rare events", "link": "http://arxiv.org/abs/2308.03355", "description": "In a variety of application areas, there is interest in assessing evidence of\ndifferences in the intensity of event realizations between groups. For example,\nin cancer genomic studies collecting data on rare variants, the focus is on\nassessing whether and how the variant profile changes with the disease subtype.\nMotivated by this application, we develop multiresolution nonparametric Bayes\ntests for differential mutation rates across groups. The multiresolution\napproach yields fast and accurate detection of spatial clusters of rare\nvariants, and our nonparametric Bayes framework provides great flexibility for\nmodeling the intensities of rare variants. Some theoretical properties are also\nassessed, including weak consistency of our Dirichlet Process-Poisson-Gamma\nmixture over multiple resolutions. Simulation studies illustrate excellent\nsmall sample properties relative to competitors, and we apply the method to\ndetect rare variants related to common variable immunodeficiency from whole\nexome sequencing data on 215 patients and over 60,027 control subjects."}, "http://arxiv.org/abs/2309.15316": {"title": "Leveraging Neural Networks to Profile Health Care Providers with Application to Medicare Claims", "link": "http://arxiv.org/abs/2309.15316", "description": "Encompassing numerous nationwide, statewide, and institutional initiatives in\nthe United States, provider profiling has evolved into a major health care\nundertaking with ubiquitous applications, profound implications, and\nhigh-stakes consequences. In line with such a significant profile, the\nliterature has accumulated a number of developments dedicated to enhancing the\nstatistical paradigm of provider profiling. Tackling wide-ranging profiling\nissues, these methods typically adjust for risk factors using linear\npredictors. While this approach is simple, it can be too restrictive to\ncharacterize complex and dynamic factor-outcome associations in certain\ncontexts. One such example arises from evaluating dialysis facilities treating\nMedicare beneficiaries with end-stage renal disease. It is of primary interest\nto consider how the coronavirus disease (COVID-19) affected 30-day unplanned\nreadmissions in 2020. The impact of COVID-19 on the risk of readmission varied\ndramatically across pandemic phases. To efficiently capture the variation while\nprofiling facilities, we develop a generalized partially linear model (GPLM)\nthat incorporates a neural network. Considering provider-level clustering, we\nimplement the GPLM as a stratified sampling-based stochastic optimization\nalgorithm that features accelerated convergence. Furthermore, an exact test is\ndesigned to identify under- and over-performing facilities, with an\naccompanying funnel plot to visualize profiles. The advantages of the proposed\nmethods are demonstrated through simulation experiments and profiling dialysis\nfacilities using 2020 Medicare claims from the United States Renal Data System."}, "http://arxiv.org/abs/2401.12369": {"title": "SubgroupTE: Advancing Treatment Effect Estimation with Subgroup Identification", "link": "http://arxiv.org/abs/2401.12369", "description": "Precise estimation of treatment effects is crucial for evaluating\nintervention effectiveness. While deep learning models have exhibited promising\nperformance in learning counterfactual representations for treatment effect\nestimation (TEE), a major limitation in most of these models is that they treat\nthe entire population as a homogeneous group, overlooking the diversity of\ntreatment effects across potential subgroups that have varying treatment\neffects. This limitation restricts the ability to precisely estimate treatment\neffects and provide subgroup-specific treatment recommendations. In this paper,\nwe propose a novel treatment effect estimation model, named SubgroupTE, which\nincorporates subgroup identification in TEE. SubgroupTE identifies\nheterogeneous subgroups with different treatment responses and more precisely\nestimates treatment effects by considering subgroup-specific causal effects. In\naddition, SubgroupTE iteratively optimizes subgrouping and treatment effect\nestimation networks to enhance both estimation and subgroup identification.\nComprehensive experiments on the synthetic and semi-synthetic datasets exhibit\nthe outstanding performance of SubgroupTE compared with the state-of-the-art\nmodels on treatment effect estimation. Additionally, a real-world study\ndemonstrates the capabilities of SubgroupTE in enhancing personalized treatment\nrecommendations for patients with opioid use disorder (OUD) by advancing\ntreatment effect estimation with subgroup identification."}, "http://arxiv.org/abs/2401.12420": {"title": "Rank-based estimators of global treatment effects for cluster randomized trials with multiple endpoints", "link": "http://arxiv.org/abs/2401.12420", "description": "Cluster randomization trials commonly employ multiple endpoints. When a\nsingle summary of treatment effects across endpoints is of primary interest,\nglobal hypothesis testing/effect estimation methods represent a common analysis\nstrategy. However, specification of the joint distribution required by these\nmethods is non-trivial, particularly when endpoint properties differ. We\ndevelop rank-based interval estimators for a global treatment effect referred\nto as the \"global win probability,\" or the probability that a treatment\nindividual responds better than a control individual on average. Using\nendpoint-specific ranks among the combined sample and within each arm, each\nindividual-level observation is converted to a \"win fraction\" which quantifies\nthe proportion of wins experienced over every observation in the comparison\narm. An individual's multiple observations are then replaced by a single\n\"global win fraction,\" constructed by averaging win fractions across endpoints.\nA linear mixed model is applied directly to the global win fractions to recover\npoint, variance, and interval estimates of the global win probability adjusted\nfor clustering. Simulation demonstrates our approach performs well concerning\ncoverage and type I error, and methods are easily implemented using standard\nsoftware. A case study using publicly available data is provided with\ncorresponding R and SAS code."}, "http://arxiv.org/abs/2401.12640": {"title": "Multilevel network meta-regression for general likelihoods: synthesis of individual and aggregate data with applications to survival analysis", "link": "http://arxiv.org/abs/2401.12640", "description": "Network meta-analysis combines aggregate data (AgD) from multiple randomised\ncontrolled trials, assuming that any effect modifiers are balanced across\npopulations. Individual patient data (IPD) meta-regression is the ``gold\nstandard'' method to relax this assumption, however IPD are frequently only\navailable in a subset of studies. Multilevel network meta-regression (ML-NMR)\nextends IPD meta-regression to incorporate AgD studies whilst avoiding\naggregation bias, but currently requires the aggregate-level likelihood to have\na known closed form. Notably, this prevents application to time-to-event\noutcomes.\n\nWe extend ML-NMR to individual-level likelihoods of any form, by integrating\nthe individual-level likelihood function over the AgD covariate distributions\nto obtain the respective marginal likelihood contributions. We illustrate with\ntwo examples of time-to-event outcomes, showing the performance of ML-NMR in a\nsimulated comparison with little loss of precision from a full IPD analysis,\nand demonstrating flexible modelling of baseline hazards using cubic M-splines\nwith synthetic data on newly diagnosed multiple myeloma.\n\nML-NMR is a general method for synthesising individual and aggregate level\ndata in networks of all sizes. Extension to general likelihoods, including for\nsurvival outcomes, greatly increases the applicability of the method. R and\nStan code is provided, and the methods are implemented in the multinma R\npackage."}, "http://arxiv.org/abs/2401.12697": {"title": "A Computationally Efficient Approach to False Discovery Rate Control and Power Maximisation via Randomisation and Mirror Statistic", "link": "http://arxiv.org/abs/2401.12697", "description": "Simultaneously performing variable selection and inference in\nhigh-dimensional regression models is an open challenge in statistics and\nmachine learning. The increasing availability of vast amounts of variables\nrequires the adoption of specific statistical procedures to accurately select\nthe most important predictors in a high-dimensional space, while controlling\nthe False Discovery Rate (FDR) arising from the underlying multiple hypothesis\ntesting. In this paper we propose the joint adoption of the Mirror Statistic\napproach to FDR control, coupled with outcome randomisation to maximise the\nstatistical power of the variable selection procedure. Through extensive\nsimulations we show how our proposed strategy allows to combine the benefits of\nthe two techniques. The Mirror Statistic is a flexible method to control FDR,\nwhich only requires mild model assumptions, but requires two sets of\nindependent regression coefficient estimates, usually obtained after splitting\nthe original dataset. Outcome randomisation is an alternative to Data\nSplitting, that allows to generate two independent outcomes, which can then be\nused to estimate the coefficients that go into the construction of the Mirror\nStatistic. The combination of these two approaches provides increased testing\npower in a number of scenarios, such as highly correlated covariates and high\npercentages of active variables. Moreover, it is scalable to very\nhigh-dimensional problems, since the algorithm has a low memory footprint and\nonly requires a single run on the full dataset, as opposed to iterative\nalternatives such as Multiple Data Splitting."}, "http://arxiv.org/abs/2401.12753": {"title": "Optimal Confidence Bands for Shape-restricted Regression in Multidimensions", "link": "http://arxiv.org/abs/2401.12753", "description": "In this paper, we propose and study construction of confidence bands for\nshape-constrained regression functions when the predictor is multivariate. In\nparticular, we consider the continuous multidimensional white noise model given\nby $d Y(\\mathbf{t}) = n^{1/2} f(\\mathbf{t}) \\,d\\mathbf{t} + d W(\\mathbf{t})$,\nwhere $Y$ is the observed stochastic process on $[0,1]^d$ ($d\\ge 1$), $W$ is\nthe standard Brownian sheet on $[0,1]^d$, and $f$ is the unknown function of\ninterest assumed to belong to a (shape-constrained) function class, e.g.,\ncoordinate-wise monotone functions or convex functions. The constructed\nconfidence bands are based on local kernel averaging with bandwidth chosen\nautomatically via a multivariate multiscale statistic. The confidence bands\nhave guaranteed coverage for every $n$ and for every member of the underlying\nfunction class. Under monotonicity/convexity constraints on $f$, the proposed\nconfidence bands automatically adapt (in terms of width) to the global and\nlocal (H\\\"{o}lder) smoothness and intrinsic dimensionality of the unknown $f$;\nthe bands are also shown to be optimal in a certain sense. These bands have\n(almost) parametric ($n^{-1/2}$) widths when the underlying function has\n``low-complexity'' (e.g., piecewise constant/affine)."}, "http://arxiv.org/abs/2401.12776": {"title": "Sub-model aggregation for scalable eigenvector spatial filtering: Application to spatially varying coefficient modeling", "link": "http://arxiv.org/abs/2401.12776", "description": "This study proposes a method for aggregating/synthesizing global and local\nsub-models for fast and flexible spatial regression modeling. Eigenvector\nspatial filtering (ESF) was used to model spatially varying coefficients and\nspatial dependence in the residuals by sub-model, while the generalized\nproduct-of-experts method was used to aggregate these sub-models. The major\nadvantages of the proposed method are as follows: (i) it is highly scalable for\nlarge samples in terms of accuracy and computational efficiency; (ii) it is\neasily implemented by estimating sub-models independently first and\naggregating/averaging them thereafter; and (iii) likelihood-based inference is\navailable because the marginal likelihood is available in closed-form. The\naccuracy and computational efficiency of the proposed method are confirmed\nusing Monte Carlo simulation experiments. This method was then applied to\nresidential land price analysis in Japan. The results demonstrate the\nusefulness of this method for improving the interpretability of spatially\nvarying coefficients. The proposed method is implemented in an R package\nspmoran (version 0.3.0 or later)."}, "http://arxiv.org/abs/2401.12827": {"title": "Distributed Empirical Likelihood Inference With or Without Byzantine Failures", "link": "http://arxiv.org/abs/2401.12827", "description": "Empirical likelihood is a very important nonparametric approach which is of\nwide application. However, it is hard and even infeasible to calculate the\nempirical log-likelihood ratio statistic with massive data. The main challenge\nis the calculation of the Lagrange multiplier. This motivates us to develop a\ndistributed empirical likelihood method by calculating the Lagrange multiplier\nin a multi-round distributed manner. It is shown that the distributed empirical\nlog-likelihood ratio statistic is asymptotically standard chi-squared under\nsome mild conditions. The proposed algorithm is communication-efficient and\nachieves the desired accuracy in a few rounds. Further, the distributed\nempirical likelihood method is extended to the case of Byzantine failures. A\nmachine selection algorithm is developed to identify the worker machines\nwithout Byzantine failures such that the distributed empirical likelihood\nmethod can be applied. The proposed methods are evaluated by numerical\nsimulations and illustrated with an analysis of airline on-time performance\nstudy and a surface climate analysis of Yangtze River Economic Belt."}, "http://arxiv.org/abs/2401.12836": {"title": "Empirical Likelihood Inference over Decentralized Networks", "link": "http://arxiv.org/abs/2401.12836", "description": "As a nonparametric statistical inference approach, empirical likelihood has\nbeen found very useful in numerous occasions. However, it encounters serious\ncomputational challenges when applied directly to the modern massive dataset.\nThis article studies empirical likelihood inference over decentralized\ndistributed networks, where the data are locally collected and stored by\ndifferent nodes. To fully utilize the data, this article fuses Lagrange\nmultipliers calculated in different nodes by employing a penalization\ntechnique. The proposed distributed empirical log-likelihood ratio statistic\nwith Lagrange multipliers solved by the penalized function is asymptotically\nstandard chi-squared under regular conditions even for a divergent machine\nnumber. Nevertheless, the optimization problem with the fused penalty is still\nhard to solve in the decentralized distributed network. To address the problem,\ntwo alternating direction method of multipliers (ADMM) based algorithms are\nproposed, which both have simple node-based implementation schemes.\nTheoretically, this article establishes convergence properties for proposed\nalgorithms, and further proves the linear convergence of the second algorithm\nin some specific network structures. The proposed methods are evaluated by\nnumerical simulations and illustrated with analyses of census income and Ford\ngobike datasets."}, "http://arxiv.org/abs/2401.12865": {"title": "Gridsemble: Selective Ensembling for False Discovery Rates", "link": "http://arxiv.org/abs/2401.12865", "description": "In this paper, we introduce Gridsemble, a data-driven selective ensembling\nalgorithm for estimating local false discovery rates (fdr) in large-scale\nmultiple hypothesis testing. Existing methods for estimating fdr often yield\ndifferent conclusions, yet the unobservable nature of fdr values prevents the\nuse of traditional model selection. There is limited guidance on choosing a\nmethod for a given dataset, making this an arbitrary decision in practice.\nGridsemble circumvents this challenge by ensembling a subset of methods with\nweights based on their estimated performances, which are computed on synthetic\ndatasets generated to mimic the observed data while including ground truth. We\ndemonstrate through simulation studies and an experimental application that\nthis method outperforms three popular R software packages with their default\nparameter values$\\unicode{x2014}$common choices given the current landscape.\nWhile our applications are in the context of high throughput transcriptomics,\nwe emphasize that Gridsemble is applicable to any use of large-scale multiple\nhypothesis testing, an approach that is utilized in many fields. We believe\nthat Gridsemble will be a useful tool for computing reliable estimates of fdr\nand for improving replicability in the presence of multiple hypotheses by\neliminating the need for an arbitrary choice of method. Gridsemble is\nimplemented in an open-source R software package available on GitHub at\njennalandy/gridsemblefdr."}, "http://arxiv.org/abs/2401.12905": {"title": "Estimating the construct validity of Principal Components Analysis", "link": "http://arxiv.org/abs/2401.12905", "description": "In many scientific disciplines, the features of interest cannot be observed\ndirectly, so must instead be inferred from observed behaviour. Latent variable\nanalyses are increasingly employed to systematise these inferences, and\nPrincipal Components Analysis (PCA) is perhaps the simplest and most popular of\nthese methods. Here, we examine how the assumptions that we are prepared to\nentertain, about the latent variable system, mediate the likelihood that\nPCA-derived components will capture the true sources of variance underlying\ndata. As expected, we find that this likelihood is excellent in the best case,\nand robust to empirically reasonable levels of measurement noise, but best-case\nperformance is also: (a) not robust to violations of the method's more\nprominent assumptions, of linearity and orthogonality; and also (b) requires\nthat other subtler assumptions be made, such as that the latent variables\nshould have varying importance, and that weights relating latent variables to\nobserved data have zero mean. Neither variance explained, nor replication in\nindependent samples, could reliably predict which (if any) PCA-derived\ncomponents will capture true sources of variance in data. We conclude by\ndescribing a procedure to fit these inferences more directly to empirical data,\nand use it to find that components derived via PCA from two different empirical\nneuropsychological datasets, are less likely to have meaningful referents in\nthe brain than we hoped."}, "http://arxiv.org/abs/2401.12911": {"title": "Pretraining and the Lasso", "link": "http://arxiv.org/abs/2401.12911", "description": "Pretraining is a popular and powerful paradigm in machine learning. As an\nexample, suppose one has a modest-sized dataset of images of cats and dogs, and\nplans to fit a deep neural network to classify them from the pixel features.\nWith pretraining, we start with a neural network trained on a large corpus of\nimages, consisting of not just cats and dogs but hundreds of other image types.\nThen we fix all of the network weights except for the top layer (which makes\nthe final classification) and train (or \"fine tune\") those weights on our\ndataset. This often results in dramatically better performance than the network\ntrained solely on our smaller dataset.\n\nIn this paper, we ask the question \"Can pretraining help the lasso?\". We\ndevelop a framework for the lasso in which an overall model is fit to a large\nset of data, and then fine-tuned to a specific task on a smaller dataset. This\nlatter dataset can be a subset of the original dataset, but does not need to\nbe. We find that this framework has a wide variety of applications, including\nstratified models, multinomial targets, multi-response models, conditional\naverage treatment estimation and even gradient boosting.\n\nIn the stratified model setting, the pretrained lasso pipeline estimates the\ncoefficients common to all groups at the first stage, and then group specific\ncoefficients at the second \"fine-tuning\" stage. We show that under appropriate\nassumptions, the support recovery rate of the common coefficients is superior\nto that of the usual lasso trained only on individual groups. This separate\nidentification of common and individual coefficients can also be useful for\nscientific understanding."}, "http://arxiv.org/abs/2401.12924": {"title": "Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection", "link": "http://arxiv.org/abs/2401.12924", "description": "This article delves into the analysis of performance and utilization of\nSupport Vector Machines (SVMs) for the critical task of forest fire detection\nusing image datasets. With the increasing threat of forest fires to ecosystems\nand human settlements, the need for rapid and accurate detection systems is of\nutmost importance. SVMs, renowned for their strong classification capabilities,\nexhibit proficiency in recognizing patterns associated with fire within images.\nBy training on labeled data, SVMs acquire the ability to identify distinctive\nattributes associated with fire, such as flames, smoke, or alterations in the\nvisual characteristics of the forest area. The document thoroughly examines the\nuse of SVMs, covering crucial elements like data preprocessing, feature\nextraction, and model training. It rigorously evaluates parameters such as\naccuracy, efficiency, and practical applicability. The knowledge gained from\nthis study aids in the development of efficient forest fire detection systems,\nenabling prompt responses and improving disaster management. Moreover, the\ncorrelation between SVM accuracy and the difficulties presented by\nhigh-dimensional datasets is carefully investigated, demonstrated through a\nrevealing case study. The relationship between accuracy scores and the\ndifferent resolutions used for resizing the training datasets has also been\ndiscussed in this article. These comprehensive studies result in a definitive\noverview of the difficulties faced and the potential sectors requiring further\nimprovement and focus."}, "http://arxiv.org/abs/2401.12937": {"title": "Are the Signs of Factor Loadings Arbitrary in Confirmatory Factor Analysis? Problems and Solutions", "link": "http://arxiv.org/abs/2401.12937", "description": "The replication crisis in social and behavioral sciences has raised concerns\nabout the reliability and validity of empirical studies. While research in the\nliterature has explored contributing factors to this crisis, the issues related\nto analytical tools have received less attention. This study focuses on a\nwidely used analytical tool - confirmatory factor analysis (CFA) - and\ninvestigates one issue that is typically overlooked in practice: accurately\nestimating factor-loading signs. Incorrect loading signs can distort the\nrelationship between observed variables and latent factors, leading to\nunreliable or invalid results in subsequent analyses. Our study aims to\ninvestigate and address the estimation problem of factor-loading signs in CFA\nmodels. Based on an empirical demonstration and Monte Carlo simulation studies,\nwe found current methods have drawbacks in estimating loading signs. To address\nthis problem, three solutions are proposed and proven to work effectively. The\napplications of these solutions are discussed and elaborated."}, "http://arxiv.org/abs/2401.12967": {"title": "Measure transport with kernel mean embeddings", "link": "http://arxiv.org/abs/2401.12967", "description": "Kalman filters constitute a scalable and robust methodology for approximate\nBayesian inference, matching first and second order moments of the target\nposterior. To improve the accuracy in nonlinear and non-Gaussian settings, we\nextend this principle to include more or different characteristics, based on\nkernel mean embeddings (KMEs) of probability measures into their corresponding\nHilbert spaces. Focusing on the continuous-time setting, we develop a family of\ninteracting particle systems (termed $\\textit{KME-dynamics}$) that bridge\nbetween the prior and the posterior, and that include the Kalman-Bucy filter as\na special case. A variant of KME-dynamics has recently been derived from an\noptimal transport perspective by Maurais and Marzouk, and we expose further\nconnections to (kernelised) diffusion maps, leading to a variational\nformulation of regression type. Finally, we conduct numerical experiments on\ntoy examples and the Lorenz-63 model, the latter of which show particular\npromise for a hybrid modification (called Kalman-adjusted KME-dynamics)."}, "http://arxiv.org/abs/2110.05475": {"title": "Bayesian hidden Markov models for latent variable labeling assignments in conflict research: application to the role ceasefires play in conflict dynamics", "link": "http://arxiv.org/abs/2110.05475", "description": "A crucial challenge for solving problems in conflict research is in\nleveraging the semi-supervised nature of the data that arise. Observed response\ndata such as counts of battle deaths over time indicate latent processes of\ninterest such as intensity and duration of conflicts, but defining and labeling\ninstances of these unobserved processes requires nuance and imprecision. The\navailability of such labels, however, would make it possible to study the\neffect of intervention-related predictors - such as ceasefires - directly on\nconflict dynamics (e.g., latent intensity) rather than through an intermediate\nproxy like observed counts of battle deaths. Motivated by this problem and the\nnew availability of the ETH-PRIO Civil Conflict Ceasefires data set, we propose\na Bayesian autoregressive (AR) hidden Markov model (HMM) framework as a\nsufficiently flexible machine learning approach for semi-supervised regime\nlabeling with uncertainty quantification. We motivate our approach by\nillustrating the way it can be used to study the role that ceasefires play in\nshaping conflict dynamics. This ceasefires data set is the first systematic and\nglobally comprehensive data on ceasefires, and our work is the first to analyze\nthis new data and to explore the effect of ceasefires on conflict dynamics in a\ncomprehensive and cross-country manner."}, "http://arxiv.org/abs/2112.14946": {"title": "A causal inference framework for spatial confounding", "link": "http://arxiv.org/abs/2112.14946", "description": "Recently, addressing spatial confounding has become a major topic in spatial\nstatistics. However, the literature has provided conflicting definitions, and\nmany proposed definitions do not address the issue of confounding as it is\nunderstood in causal inference. We define spatial confounding as the existence\nof an unmeasured causal confounder with a spatial structure. We present a\ncausal inference framework for nonparametric identification of the causal\neffect of a continuous exposure on an outcome in the presence of spatial\nconfounding. We propose double machine learning (DML), a procedure in which\nflexible models are used to regress both the exposure and outcome variables on\nconfounders to arrive at a causal estimator with favorable robustness\nproperties and convergence rates, and we prove that this approach is consistent\nand asymptotically normal under spatial dependence. As far as we are aware,\nthis is the first approach to spatial confounding that does not rely on\nrestrictive parametric assumptions (such as linearity, effect homogeneity, or\nGaussianity) for both identification and estimation. We demonstrate the\nadvantages of the DML approach analytically and in simulations. We apply our\nmethods and reasoning to a study of the effect of fine particulate matter\nexposure during pregnancy on birthweight in California."}, "http://arxiv.org/abs/2210.08228": {"title": "Nonparametric Estimation of Mediation Effects with A General Treatment", "link": "http://arxiv.org/abs/2210.08228", "description": "To investigate causal mechanisms, causal mediation analysis decomposes the\ntotal treatment effect into the natural direct and indirect effects. This paper\nexamines the estimation of the direct and indirect effects in a general\ntreatment effect model, where the treatment can be binary, multi-valued,\ncontinuous, or a mixture. We propose generalized weighting estimators with\nweights estimated by solving an expanding set of equations. Under some\nsufficient conditions, we show that the proposed estimators are consistent and\nasymptotically normal. Specifically, when the treatment is discrete, the\nproposed estimators attain the semiparametric efficiency bounds. Meanwhile,\nwhen the treatment is continuous, the convergence rates of the proposed\nestimators are slower than $N^{-1/2}$; however, they are still more efficient\nthan that constructed from the true weighting function. A simulation study\nreveals that our estimators exhibit a satisfactory finite-sample performance,\nwhile an application shows their practical value"}, "http://arxiv.org/abs/2212.11398": {"title": "Grace periods in comparative effectiveness studies of sustained treatments", "link": "http://arxiv.org/abs/2212.11398", "description": "Researchers are often interested in estimating the effect of sustained use of\na treatment on a health outcome. However, adherence to strict treatment\nprotocols can be challenging for individuals in practice and, when\nnon-adherence is expected, estimates of the effect of sustained use may not be\nuseful for decision making. As an alternative, more relaxed treatment protocols\nwhich allow for periods of time off treatment (i.e. grace periods) have been\nconsidered in pragmatic randomized trials and observational studies. In this\narticle, we consider the interpretation, identification, and estimation of\ntreatment strategies which include grace periods. We contrast natural grace\nperiod strategies which allow individuals the flexibility to take treatment as\nthey would naturally do, with stochastic grace period strategies in which the\ninvestigator specifies the distribution of treatment utilization. We estimate\nthe effect of initiation of a thiazide diuretic or an angiotensin-converting\nenzyme inhibitor in hypertensive individuals under various strategies which\ninclude grace periods."}, "http://arxiv.org/abs/2301.02739": {"title": "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values", "link": "http://arxiv.org/abs/2301.02739", "description": "Many testing problems are readily amenable to randomised tests such as those\nemploying data splitting. However despite their usefulness in principle,\nrandomised tests have obvious drawbacks. Firstly, two analyses of the same\ndataset may lead to different results. Secondly, the test typically loses power\nbecause it does not fully utilise the entire sample. As a remedy to these\ndrawbacks, we study how to combine the test statistics or p-values resulting\nfrom multiple random realisations such as through random data splits. We\ndevelop rank-transformed subsampling as a general method for delivering large\nsample inference about the combined statistic or p-value under mild\nassumptions. We apply our methodology to a wide range of problems, including\ntesting unimodality in high-dimensional data, testing goodness-of-fit of\nparametric quantile regression models, testing no direct effect in a\nsequentially randomised trial and calibrating cross-fit double machine learning\nconfidence intervals. In contrast to existing p-value aggregation schemes that\ncan be highly conservative, our method enjoys type-I error control that\nasymptotically approaches the nominal level. Moreover, compared to using the\nordinary subsampling, we show that our rank transform can remove the\nfirst-order bias in approximating the null under alternatives and greatly\nimprove power."}, "http://arxiv.org/abs/2303.07706": {"title": "On the Utility of Equal Batch Sizes for Inference in Stochastic Gradient Descent", "link": "http://arxiv.org/abs/2303.07706", "description": "Stochastic gradient descent (SGD) is an estimation tool for large data\nemployed in machine learning and statistics. Due to the Markovian nature of the\nSGD process, inference is a challenging problem. An underlying asymptotic\nnormality of the averaged SGD (ASGD) estimator allows for the construction of a\nbatch-means estimator of the asymptotic covariance matrix. Instead of the usual\nincreasing batch-size strategy employed in ASGD, we propose a memory efficient\nequal batch-size strategy and show that under mild conditions, the estimator is\nconsistent. A key feature of the proposed batching technique is that it allows\nfor bias-correction of the variance, at no cost to memory. Since joint\ninference for high dimensional problems may be undesirable, we present\nmarginal-friendly simultaneous confidence intervals, and show through an\nexample how covariance estimators of ASGD can be employed in improved\npredictions."}, "http://arxiv.org/abs/2307.04225": {"title": "Copula-like inference for discrete bivariate distributions with rectangular supports", "link": "http://arxiv.org/abs/2307.04225", "description": "After reviewing a large body of literature on the modeling of bivariate\ndiscrete distributions with finite support, \\cite{Gee20} made a compelling case\nfor the use of $I$-projections in the sense of \\cite{Csi75} as a sound way to\nattempt to decompose a bivariate probability mass function (p.m.f.) into its\ntwo univariate margins and a bivariate p.m.f.\\ with uniform margins playing the\nrole of a discrete copula. From a practical perspective, the necessary\n$I$-projections on Fr\\'echet classes can be carried out using the iterative\nproportional fitting procedure (IPFP), also known as Sinkhorn's algorithm or\nmatrix scaling in the literature. After providing conditions under which a\nbivariate p.m.f.\\ can be decomposed in the aforementioned sense, we\ninvestigate, for starting bivariate p.m.f.s with rectangular supports,\nnonparametric and parametric estimation procedures as well as goodness-of-fit\ntests for the underlying discrete copula. Related asymptotic results are\nprovided and build upon a differentiability result for $I$-projections on\nFr\\'echet classes which can be of independent interest. Theoretical results are\ncomplemented by finite-sample experiments and a data example."}, "http://arxiv.org/abs/2309.08783": {"title": "Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression", "link": "http://arxiv.org/abs/2309.08783", "description": "Sparse linear regression methods for high-dimensional data commonly assume\nthat residuals have constant variance, which can be violated in practice. For\nexample, Aphasia Quotient (AQ) is a critical measure of language impairment and\ninforms treatment decisions, but it is challenging to measure in stroke\npatients. It is of interest to use high-resolution T2 neuroimages of brain\ndamage to predict AQ. However, sparse regression models show marked evidence of\nheteroscedastic error even after transformations are applied. This violation of\nthe homoscedasticity assumption can lead to bias in estimated coefficients,\nprediction intervals (PI) with improper length, and increased type I errors.\nBayesian heteroscedastic linear regression models relax the homoscedastic error\nassumption but can enforce restrictive prior assumptions on parameters, and\nmany are computationally infeasible in the high-dimensional setting. This paper\nproposes estimating high-dimensional heteroscedastic linear regression models\nusing a heteroscedastic partitioned empirical Bayes Expectation Conditional\nMaximization (H-PROBE) algorithm. H-PROBE is a computationally efficient\nmaximum a posteriori estimation approach that requires minimal prior\nassumptions and can incorporate covariates hypothesized to impact\nheterogeneity. We apply this method by using high-dimensional neuroimages to\npredict and provide PIs for AQ that accurately quantify predictive uncertainty.\nOur analysis demonstrates that H-PROBE can provide narrower PI widths than\nstandard methods without sacrificing coverage. Narrower PIs are clinically\nimportant for determining the risk of moderate to severe aphasia. Additionally,\nthrough extensive simulation studies, we exhibit that H-PROBE results in\nsuperior prediction, variable selection, and predictive inference compared to\nalternative methods."}, "http://arxiv.org/abs/2401.13009": {"title": "Comparative Study of Causal Discovery Methods for Cyclic Models with Hidden Confounders", "link": "http://arxiv.org/abs/2401.13009", "description": "Nowadays, the need for causal discovery is ubiquitous. A better understanding\nof not just the stochastic dependencies between parts of a system, but also the\nactual cause-effect relations, is essential for all parts of science. Thus, the\nneed for reliable methods to detect causal directions is growing constantly. In\nthe last 50 years, many causal discovery algorithms have emerged, but most of\nthem are applicable only under the assumption that the systems have no feedback\nloops and that they are causally sufficient, i.e. that there are no unmeasured\nsubsystems that can affect multiple measured variables. This is unfortunate\nsince those restrictions can often not be presumed in practice. Feedback is an\nintegral feature of many processes, and real-world systems are rarely\ncompletely isolated and fully measured. Fortunately, in recent years, several\ntechniques, that can cope with cyclic, causally insufficient systems, have been\ndeveloped. And with multiple methods available, a practical application of\nthose algorithms now requires knowledge of the respective strengths and\nweaknesses. Here, we focus on the problem of causal discovery for sparse linear\nmodels which are allowed to have cycles and hidden confounders. We have\nprepared a comprehensive and thorough comparative study of four causal\ndiscovery techniques: two versions of the LLC method [10] and two variants of\nthe ASP-based algorithm [11]. The evaluation investigates the performance of\nthose techniques for various experiments with multiple interventional setups\nand different dataset sizes."}, "http://arxiv.org/abs/2401.13010": {"title": "Bartholomew's trend test -- approximated by a multiple contrast test", "link": "http://arxiv.org/abs/2401.13010", "description": "Bartholomew's trend test belongs to the broad class of isotonic regression\nmodels, specifically with a single qualitative factor, e.g. dose levels. Using\nthe approximation of the ANOVA F-test by the maximum contrast test against\ngrand mean and pool-adjacent-violator estimates under order restriction, an\neasier to use approximation is proposed."}, "http://arxiv.org/abs/2401.13045": {"title": "Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?", "link": "http://arxiv.org/abs/2401.13045", "description": "Over the past decade, the intricacies of sports-related concussions among\nfemale athletes have become readily apparent. Traditional clinical methods for\ndiagnosing concussions suffer limitations when applied to female athletes,\noften failing to capture subtle changes in brain structure and function.\nAdvanced neuroinformatics techniques and machine learning models have become\ninvaluable assets in this endeavor. While these technologies have been\nextensively employed in understanding concussion in male athletes, there\nremains a significant gap in our comprehension of their effectiveness for\nfemale athletes. With its remarkable data analysis capacity, machine learning\noffers a promising avenue to bridge this deficit. By harnessing the power of\nmachine learning, researchers can link observed phenotypic neuroimaging data to\nsex-specific biological mechanisms, unraveling the mysteries of concussions in\nfemale athletes. Furthermore, embedding methods within machine learning enable\nexamining brain architecture and its alterations beyond the conventional\nanatomical reference frame. In turn, allows researchers to gain deeper insights\ninto the dynamics of concussions, treatment responses, and recovery processes.\nTo guarantee that female athletes receive the optimal care they deserve,\nresearchers must employ advanced neuroimaging techniques and sophisticated\nmachine-learning models. These tools enable an in-depth investigation of the\nunderlying mechanisms responsible for concussion symptoms stemming from\nneuronal dysfunction in female athletes. This paper endeavors to address the\ncrucial issue of sex differences in multimodal neuroimaging experimental design\nand machine learning approaches within female athlete populations, ultimately\nensuring that they receive the tailored care they require when facing the\nchallenges of concussions."}, "http://arxiv.org/abs/2401.13090": {"title": "Variational Estimation for Multidimensional Generalized Partial Credit Model", "link": "http://arxiv.org/abs/2401.13090", "description": "Multidimensional item response theory (MIRT) models have generated increasing\ninterest in the psychometrics literature. Efficient approaches for estimating\nMIRT models with dichotomous responses have been developed, but constructing an\nequally efficient and robust algorithm for polytomous models has received\nlimited attention. To address this gap, this paper presents a novel Gaussian\nvariational estimation algorithm for the multidimensional generalized partial\ncredit model (MGPCM). The proposed algorithm demonstrates both fast and\naccurate performance, as illustrated through a series of simulation studies and\ntwo real data analyses."}, "http://arxiv.org/abs/2401.13094": {"title": "On cross-validated estimation of skew normal model", "link": "http://arxiv.org/abs/2401.13094", "description": "Skew normal model suffers from inferential drawbacks, namely singular Fisher\ninformation in the vicinity of symmetry and diverging of maximum likelihood\nestimation. To address the above drawbacks, Azzalini and Arellano-Valle (2013)\nintroduced maximum penalised likelihood estimation (MPLE) by subtracting a\npenalty function from the log-likelihood function with a pre-specified penalty\ncoefficient. Here, we propose a cross-validated MPLE to improve its performance\nwhen the underlying model is close to symmetry. We develop a theory for MPLE,\nwhere an asymptotic rate for the cross-validated penalty coefficient is\nderived. We further show that the proposed cross-validated MPLE is\nasymptotically efficient under certain conditions. In simulation studies and a\nreal data application, we demonstrate that the proposed estimator can\noutperform the conventional MPLE when the model is close to symmetry."}, "http://arxiv.org/abs/2401.13208": {"title": "Assessing Influential Observations in Pain Prediction using fMRI Data", "link": "http://arxiv.org/abs/2401.13208", "description": "Influential diagnosis is an integral part of data analysis, of which most\nexisting methodological frameworks presume a deterministic submodel and are\ndesigned for low-dimensional data (i.e., the number of predictors p smaller\nthan the sample size n). However, the stochastic selection of a submodel from\nhigh-dimensional data where p exceeds n has become ubiquitous. Thus, methods\nfor identifying observations that could exert undue influence on the choice of\na submodel can play an important role in this setting. To date, discussion of\nthis topic has been limited, falling short in two domains: (i) constrained\nability to detect multiple influential points, and (ii) applicability only in\nrestrictive settings. After describing the problem, we characterize and\nformalize the concept of influential observations on variable selection. Then,\nwe propose a generalized diagnostic measure, extended from an available metric\naccommodating different model selectors and multiple influential observations,\nthe asymptotic distribution of which is subsequently establish large p, thus\nproviding guidelines to ascertain influential observations. A high-dimensional\nclustering procedure is further incorporated into our proposed scheme to detect\nmultiple influential points. Simulation is conducted to assess the performances\nof various diagnostic approaches. The proposed procedure further demonstrates\nits value in improving predictive power when analyzing thermal-stimulated pain\nbased on fMRI data."}, "http://arxiv.org/abs/2401.13379": {"title": "An Ising Similarity Regression Model for Modeling Multivariate Binary Data", "link": "http://arxiv.org/abs/2401.13379", "description": "Understanding the dependence structure between response variables is an\nimportant component in the analysis of correlated multivariate data. This\narticle focuses on modeling dependence structures in multivariate binary data,\nmotivated by a study aiming to understand how patterns in different U.S.\nsenators' votes are determined by similarities (or lack thereof) in their\nattributes, e.g., political parties and social network profiles. To address\nsuch a research question, we propose a new Ising similarity regression model\nwhich regresses pairwise interaction coefficients in the Ising model against a\nset of similarity measures available/constructed from covariates. Model\nselection approaches are further developed through regularizing the\npseudo-likelihood function with an adaptive lasso penalty to enable the\nselection of relevant similarity measures. We establish estimation and\nselection consistency of the proposed estimator under a general setting where\nthe number of similarity measures and responses tend to infinity. Simulation\nstudy demonstrates the strong finite sample performance of the proposed\nestimator in terms of parameter estimation and similarity selection. Applying\nthe Ising similarity regression model to a dataset of roll call voting records\nof 100 U.S. senators, we are able to quantify how similarities in senators'\nparties, businessman occupations and social network profiles drive their voting\nassociations."}, "http://arxiv.org/abs/2201.05828": {"title": "Adaptive procedures for directional false discovery rate control", "link": "http://arxiv.org/abs/2201.05828", "description": "In multiple hypothesis testing, it is well known that adaptive procedures can\nenhance power via incorporating information about the number of true nulls\npresent. Under independence, we establish that two adaptive false discovery\nrate (FDR) methods, upon augmenting sign declarations, also offer directional\nfalse discovery rate (FDR$_\\text{dir}$) control in the strong sense. Such\nFDR$_\\text{dir}$ controlling properties are appealing because adaptive\nprocedures have the greatest potential to reap substantial gain in power when\nthe underlying parameter configurations contain little to no true nulls, which\nare precisely settings where the FDR$_\\text{dir}$ is an arguably more\nmeaningful error rate to be controlled than the FDR."}, "http://arxiv.org/abs/2212.02306": {"title": "Robust multiple method comparison and transformation", "link": "http://arxiv.org/abs/2212.02306", "description": "A generalization of Passing-Bablok regression is proposed for comparing\nmultiple measurement methods simultaneously. Possible applications include\nassay migration studies or interlaboratory trials. When comparing only two\nmethods, the method reduces to the usual Passing-Bablok estimator. It is close\nin spirit to reduced major axis regression, which is, however, not robust. To\nobtain a robust estimator, the major axis is replaced by the (hyper-)spherical\nmedian axis. The method is shown to reduce to the usual Passing-Bablok\nestimator if only two methods are compared. This technique has been applied to\ncompare SARS-CoV-2 serological tests, bilirubin in neonates, and an in vitro\ndiagnostic test using different instruments, sample preparations, and reagent\nlots. In addition, plots similar to the well-known Bland-Altman plots have been\ndeveloped to represent the variance structure."}, "http://arxiv.org/abs/2305.15936": {"title": "Learning DAGs from Data with Few Root Causes", "link": "http://arxiv.org/abs/2305.15936", "description": "We present a novel perspective and algorithm for learning directed acyclic\ngraphs (DAGs) from data generated by a linear structural equation model (SEM).\nFirst, we show that a linear SEM can be viewed as a linear transform that, in\nprior work, computes the data from a dense input vector of random valued root\ncauses (as we will call them) associated with the nodes. Instead, we consider\nthe case of (approximately) few root causes and also introduce noise in the\nmeasurement of the data. Intuitively, this means that the DAG data is produced\nby few data-generating events whose effect percolates through the DAG. We prove\nidentifiability in this new setting and show that the true DAG is the global\nminimizer of the $L^0$-norm of the vector of root causes. For data with few\nroot causes, with and without noise, we show superior performance compared to\nprior DAG learning methods."}, "http://arxiv.org/abs/2307.02096": {"title": "Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo", "link": "http://arxiv.org/abs/2307.02096", "description": "Hamiltonian Monte Carlo (HMC) is a powerful tool for Bayesian statistical\ninference due to its potential to rapidly explore high dimensional state space,\navoiding the random walk behavior typical of many Markov Chain Monte Carlo\nsamplers. The proper choice of the integrator of the Hamiltonian dynamics is\nkey to the efficiency of HMC. It is becoming increasingly clear that\nmulti-stage splitting integrators are a good alternative to the Verlet method,\ntraditionally used in HMC. Here we propose a principled way of finding optimal,\nproblem-specific integration schemes (in terms of the best conservation of\nenergy for harmonic forces/Gaussian targets) within the families of 2- and\n3-stage splitting integrators. The method, which we call Adaptive Integration\nApproach for statistics, or s-AIA, uses a multivariate Gaussian model and\nsimulation data obtained at the HMC burn-in stage to identify a system-specific\ndimensional stability interval and assigns the most appropriate 2-/3-stage\nintegrator for any user-chosen simulation step size within that interval. s-AIA\nhas been implemented in the in-house software package HaiCS without introducing\ncomputational overheads in the simulations. The efficiency of the s-AIA\nintegrators and their impact on the HMC accuracy, sampling performance and\nconvergence are discussed in comparison with known fixed-parameter multi-stage\nsplitting integrators (including Verlet). Numerical experiments on well-known\nstatistical models show that the adaptive schemes reach the best possible\nperformance within the family of 2-, 3-stage splitting schemes."}, "http://arxiv.org/abs/2401.13777": {"title": "Revisiting the memoryless property -- testing for the Pareto type I distribution", "link": "http://arxiv.org/abs/2401.13777", "description": "We propose new goodness-of-fit tests for the Pareto type I distribution.\nThese tests are based on a multiplicative version of the memoryless property\nwhich characterises this distribution. We present the results of a Monte Carlo\npower study demonstrating that the proposed tests are powerful compared to\nexisting tests. As a result of independent interest, we demonstrate that tests\nspecifically developed for the Pareto type I distribution substantially\noutperform tests for exponentiality applied to log-transformed data (since\nPareto type I distributed values can be transformed to exponentiality via a\nsimple log-transformation). Specifically, the newly proposed tests based on the\nmultiplicative memoryless property of the Pareto distribution substantially\noutperform a test based on the memoryless property of the exponential\ndistribution. The practical use of tests is illustrated by testing the\nhypothesis that two sets of observed golfers' earnings (those of the PGA and\nLIV tours) are realised from Pareto distributions."}, "http://arxiv.org/abs/2401.13787": {"title": "Bayesian Analysis of the Beta Regression Model Subject to Linear Inequality Restrictions with Application", "link": "http://arxiv.org/abs/2401.13787", "description": "ReRecent studies in machine learning are based on models in which parameters\nor state variables are bounded restricted. These restrictions are from prior\ninformation to ensure the validity of scientific theories or structural\nconsistency based on physical phenomena. The valuable information contained in\nthe restrictions must be considered during the estimation process to improve\nestimation accuracy. Many researchers have focused on linear regression models\nsubject to linear inequality restrictions, but generalized linear models have\nreceived little attention. In this paper, the parameters of beta Bayesian\nregression models subjected to linear inequality restrictions are estimated.\nThe proposed Bayesian restricted estimator, which is demonstrated by simulated\nstudies, outperforms ordinary estimators. Even in the presence of\nmulticollinearity, it outperforms the ridge estimator in terms of the standard\ndeviation and the mean squared error. The results confirm that the proposed\nBayesian restricted estimator makes sparsity in parameter estimating without\nusing the regularization penalty. Finally, a real data set is analyzed by the\nnew proposed Bayesian estimation method."}, "http://arxiv.org/abs/2401.13820": {"title": "A Bayesian hierarchical mixture cure modelling framework to utilize multiple survival datasets for long-term survivorship estimates: A case study from previously untreated metastatic melanoma", "link": "http://arxiv.org/abs/2401.13820", "description": "Time to an event of interest over a lifetime is a central measure of the\nclinical benefit of an intervention used in a health technology assessment\n(HTA). Within the same trial multiple end-points may also be considered. For\nexample, overall and progression-free survival time for different drugs in\noncology studies. A common challenge is when an intervention is only effective\nfor some proportion of the population who are not clinically identifiable.\nTherefore, latent group membership as well as separate survival models for\ngroups identified need to be estimated. However, follow-up in trials may be\nrelatively short leading to substantial censoring. We present a general\nBayesian hierarchical framework that can handle this complexity by exploiting\nthe similarity of cure fractions between end-points; accounting for the\ncorrelation between them and improving the extrapolation beyond the observed\ndata. Assuming exchangeability between cure fractions facilitates the borrowing\nof information between end-points. We show the benefits of using our approach\nwith a motivating example, the CheckMate 067 phase 3 trial consisting of\npatients with metastatic melanoma treated with first line therapy."}, "http://arxiv.org/abs/2401.13890": {"title": "Discrete Hawkes process with flexible residual distribution and filtered historical simulation", "link": "http://arxiv.org/abs/2401.13890", "description": "We introduce a new model which can be considered as a extended version of the\nHawkes process in a discrete sense. This model enables the integration of\nvarious residual distributions while preserving the fundamental properties of\nthe original Hawkes process. The rich nature of this model enables a filtered\nhistorical simulation which incorporate the properties of original time series\nmore accurately. The process naturally extends to multi-variate models with\neasy implementations of estimation and simulation. We investigate the effect of\nflexible residual distribution on estimation of high frequency financial data\ncompared with the Hawkes process."}, "http://arxiv.org/abs/2401.13929": {"title": "Reinforcement Learning with Hidden Markov Models for Discovering Decision-Making Dynamics", "link": "http://arxiv.org/abs/2401.13929", "description": "Major depressive disorder (MDD) presents challenges in diagnosis and\ntreatment due to its complex and heterogeneous nature. Emerging evidence\nindicates that reward processing abnormalities may serve as a behavioral marker\nfor MDD. To measure reward processing, patients perform computer-based\nbehavioral tasks that involve making choices or responding to stimulants that\nare associated with different outcomes. Reinforcement learning (RL) models are\nfitted to extract parameters that measure various aspects of reward processing\nto characterize how patients make decisions in behavioral tasks. Recent\nfindings suggest the inadequacy of characterizing reward learning solely based\non a single RL model; instead, there may be a switching of decision-making\nprocesses between multiple strategies. An important scientific question is how\nthe dynamics of learning strategies in decision-making affect the reward\nlearning ability of individuals with MDD. Motivated by the probabilistic reward\ntask (PRT) within the EMBARC study, we propose a novel RL-HMM framework for\nanalyzing reward-based decision-making. Our model accommodates learning\nstrategy switching between two distinct approaches under a hidden Markov model\n(HMM): subjects making decisions based on the RL model or opting for random\nchoices. We account for continuous RL state space and allow time-varying\ntransition probabilities in the HMM. We introduce a computationally efficient\nEM algorithm for parameter estimation and employ a nonparametric bootstrap for\ninference. We apply our approach to the EMBARC study to show that MDD patients\nare less engaged in RL compared to the healthy controls, and engagement is\nassociated with brain activities in the negative affect circuitry during an\nemotional conflict task."}, "http://arxiv.org/abs/2401.13943": {"title": "Is the age pension in Australia sustainable and fair? Evidence from forecasting the old-age dependency ratio using the Hamilton-Perry model", "link": "http://arxiv.org/abs/2401.13943", "description": "The age pension aims to assist eligible elderly Australians meet specific age\nand residency criteria in maintaining basic living standards. In designing\nefficient pension systems, government policymakers seek to satisfy the\nexpectations of the overall aging population in Australia. However, the\npopulation's unique demographic characteristics at the state and territory\nlevel are often overlooked due to the lack of available data. We use the\nHamilton-Perry model, which requires minimum input, to model and forecast the\nevolution of age-specific populations at the state level. We also integrate the\nobtained sub-national demographic information to determine sustainable pension\nages up to 2051. We also investigate pension welfare distribution in all states\nand territories to identify disadvantaged residents under the current pension\nsystem. Using the sub-national mortality data for Australia from 1971 to 2021\nobtained from AHMD (2023), we implement the Hamilton-Perry model with the help\nof functional time series forecasting techniques. With forecasts of\nage-specific population sizes for each state and territory, we compute the old\nage dependency ratio to determine the nationwide sustainable pension age."}, "http://arxiv.org/abs/2401.13975": {"title": "Sparse signal recovery and source localization via covariance learning", "link": "http://arxiv.org/abs/2401.13975", "description": "In the Multiple Measurements Vector (MMV) model, measurement vectors are\nconnected to unknown, jointly sparse signal vectors through a linear regression\nmodel employing a single known measurement matrix (or dictionary). Typically,\nthe number of atoms (columns of the dictionary) is greater than the number\nmeasurements and the sparse signal recovery problem is generally ill-posed. In\nthis paper, we treat the signals and measurement noise as independent Gaussian\nrandom vectors with unknown signal covariance matrix and noise variance,\nrespectively, and derive fixed point (FP) equation for solving the likelihood\nequation for signal powers, thereby enabling the recovery of the sparse signal\nsupport (sources with non-zero variances). Two practical algorithms, a block\ncoordinate descent (BCD) and a cyclic coordinate descent (CCD) algorithms, that\nleverage on the FP characterization of the likelihood equation are then\nproposed. Additionally, a greedy pursuit method, analogous to popular\nsimultaneous orthogonal matching pursuit (OMP), is introduced. Our numerical\nexamples demonstrate effectiveness of the proposed covariance learning (CL)\nalgorithms both in classic sparse signal recovery as well as in\ndirection-of-arrival (DOA) estimation problems where they perform favourably\ncompared to the state-of-the-art algorithms under a broad variety of settings."}, "http://arxiv.org/abs/2401.14052": {"title": "Testing Alpha in High Dimensional Linear Factor Pricing Models with Dependent Observations", "link": "http://arxiv.org/abs/2401.14052", "description": "In this study, we introduce three distinct testing methods for testing alpha\nin high dimensional linear factor pricing model that deals with dependent data.\nThe first method is a sum-type test procedure, which exhibits high performance\nwhen dealing with dense alternatives. The second method is a max-type test\nprocedure, which is particularly effective for sparse alternatives. For a\nbroader range of alternatives, we suggest a Cauchy combination test procedure.\nThis is predicated on the asymptotic independence of the sum-type and max-type\ntest statistics. Both simulation studies and practical data application\ndemonstrate the effectiveness of our proposed methods when handling dependent\nobservations."}, "http://arxiv.org/abs/2401.14094": {"title": "ODC and ROC curves, comparison curves, and stochastic dominance", "link": "http://arxiv.org/abs/2401.14094", "description": "We discuss two novel approaches to the classical two-sample problem. Our\nstarting point are properly standardized and combined, very popular in several\nareas of statistics and data analysis, ordinal dominance and receiver\ncharacteristic curves, denoted by ODC and ROC, respectively. The proposed new\ncurves are termed the comparison curves. Their estimates, being weighted rank\nprocesses on (0,1), form the basis of inference. These weighted processes are\nintuitive, well-suited for visual inspection of data at hand, and are also\nuseful for constructing some formal inferential procedures. They can be applied\nto several variants of two-sample problem. Their use can help to improve some\nexisting procedures both in terms of power and the ability to identify the\nsources of departures from the postulated model. To simplify interpretation of\nfinite sample results we restrict attention to values of the processes on a\nfinite grid of points. This results in the so-called bar plots (B-plots) which\nreadably summarize the information contained in the data. What is more, we show\nthat B-plots along with adjusted simultaneous acceptance regions provide\nprincipled information about where the model departs from the data. This leads\nto a framework which facilitates identification of regions with locally\nsignificant differences.\n\nWe show an implementation of the considered techniques to a standard\nstochastic dominance testing problem. Some min-type statistics are introduced\nand investigated. A simulation study compares two tests pertinent to the\ncomparison curves to well-established tests in the literature and demonstrates\nthe strong and competitive performance of the former in many typical\nsituations. Some real data applications illustrate simplicity and practical\nusefulness of the proposed approaches. A range of other applications of\nconsidered weighted processes is briefly discussed too."}, "http://arxiv.org/abs/2401.14122": {"title": "On a Novel Skewed Generalized t Distribution: Properties, Estimations and its Applications", "link": "http://arxiv.org/abs/2401.14122", "description": "With the progress of information technology, large amounts of asymmetric,\nleptokurtic and heavy-tailed data are arising in various fields, such as\nfinance, engineering, genetics and medicine. It is very challenging to model\nthose kinds of data, especially for extremely skewed data, accompanied by very\nhigh kurtosis or heavy tails. In this paper, we propose a class of novel skewed\ngeneralized t distribution (SkeGTD) as a scale mixture of skewed generalized\nnormal. The proposed SkeGTD has excellent adaptiveness to various data, because\nof its capability of allowing for a large range of skewness and kurtosis and\nits compatibility of the separated location, scale, skewness and shape\nparameters. We investigate some important properties of this family of\ndistributions. The maximum likelihood estimation, L-moments estimation and\ntwo-step estimation for the SkeGTD are explored. To illustrate the usefulness\nof the proposed methodology, we present simulation studies and analyze two real\ndatasets."}, "http://arxiv.org/abs/2401.14294": {"title": "Heteroscedasticity-aware stratified sampling to improve uplift modeling", "link": "http://arxiv.org/abs/2401.14294", "description": "In many business applications, including online marketing and customer churn\nprevention, randomized controlled trials (RCT's) are conducted to investigate\non the effect of specific treatment (coupon offers, advertisement\nmailings,...). Such RCT's allow for the estimation of average treatment effects\nas well as the training of (uplift) models for the heterogeneity of treatment\neffects between individuals. The problem with these RCT's is that they are\ncostly and this cost increases with the number of individuals included into the\nRCT. For this reason, there is research how to conduct experiments involving a\nsmall number of individuals while still obtaining precise treatment effect\nestimates. We contribute to this literature a heteroskedasticity-aware\nstratified sampling (HS) scheme, which leverages the fact that different\nindividuals have different noise levels in their outcome and precise treatment\neffect estimation requires more observations from the \"high-noise\" individuals\nthan from the \"low-noise\" individuals. By theory as well as by empirical\nexperiments, we demonstrate that our HS-sampling yields significantly more\nprecise estimates of the ATE, improves uplift models and makes their evaluation\nmore reliable compared to RCT data sampled completely randomly. Due to the\nrelative ease of application and the significant benefits, we expect\nHS-sampling to be valuable in many real-world applications."}, "http://arxiv.org/abs/2401.14338": {"title": "Case-crossover designs and overdispersion with application in air pollution epidemiology", "link": "http://arxiv.org/abs/2401.14338", "description": "Over the last three decades, case-crossover designs have found many\napplications in health sciences, especially in air pollution epidemiology. They\nare typically used, in combination with partial likelihood techniques, to\ndefine a conditional logistic model for the responses, usually health outcomes,\nconditional on the exposures. Despite the fact that conditional logistic models\nhave been shown equivalent, in typical air pollution epidemiology setups, to\nspecific instances of the well-known Poisson time series model, it is often\nclaimed that they cannot allow for overdispersion. This paper clarifies the\nrelationship between case-crossover designs, the models that ensue from their\nuse, and overdispersion. In particular, we propose to relax the assumption of\nindependence between individuals traditionally made in case-crossover analyses,\nin order to explicitly introduce overdispersion in the conditional logistic\nmodel. As we show, the resulting overdispersed conditional logistic model\ncoincides with the overdispersed, conditional Poisson model, in the sense that\ntheir likelihoods are simple re-expressions of one another. We further provide\nthe technical details of a Bayesian implementation of the proposed\ncase-crossover model, which we use to demonstrate, by means of a large\nsimulation study, that standard case-crossover models can lead to dramatically\nunderestimated coverage probabilities, while the proposed models do not. We\nalso perform an illustrative analysis of the association between air pollution\nand morbidity in Toronto, Canada, which shows that the proposed models are more\nrobust than standard ones to outliers such as those associated with public\nholidays."}, "http://arxiv.org/abs/2401.14345": {"title": "Uncovering Heterogeneity of Solar Flare Mechanism With Mixture Models", "link": "http://arxiv.org/abs/2401.14345", "description": "The physics of solar flares occurring on the Sun is highly complex and far\nfrom fully understood. However, observations show that solar eruptions are\nassociated with the intense kilogauss fields of active regions, where free\nenergies are stored with field-aligned electric currents. With the advent of\nhigh-quality data sources such as the Geostationary Operational Environmental\nSatellites (GOES) and Solar Dynamics Observatory (SDO)/Helioseismic and\nMagnetic Imager (HMI), recent works on solar flare forecasting have been\nfocusing on data-driven methods. In particular, black box machine learning and\ndeep learning models are increasingly adopted in which underlying data\nstructures are not modeled explicitly. If the active regions indeed follow the\nsame laws of physics, there should be similar patterns shared among them,\nreflected by the observations. Yet, these black box models currently used in\nthe literature do not explicitly characterize the heterogeneous nature of the\nsolar flare data, within and between active regions. In this paper, we propose\ntwo finite mixture models designed to capture the heterogeneous patterns of\nactive regions and their associated solar flare events. With extensive\nnumerical studies, we demonstrate the usefulness of our proposed method for\nboth resolving the sample imbalance issue and modeling the heterogeneity for\nrare energetic solar flare events."}, "http://arxiv.org/abs/2401.14355": {"title": "Multiply Robust Estimation of Causal Effect Curves for Difference-in-Differences Designs", "link": "http://arxiv.org/abs/2401.14355", "description": "Researchers commonly use difference-in-differences (DiD) designs to evaluate\npublic policy interventions. While established methodologies exist for\nestimating effects in the context of binary interventions, policies often\nresult in varied exposures across regions implementing the policy. Yet,\nexisting approaches for incorporating continuous exposures face substantial\nlimitations in addressing confounding variables associated with intervention\nstatus, exposure levels, and outcome trends. These limitations significantly\nconstrain policymakers' ability to fully comprehend policy impacts and design\nfuture interventions. In this study, we propose innovative estimators for\ncausal effect curves within the DiD framework, accounting for multiple sources\nof confounding. Our approach accommodates misspecification of a subset of\ntreatment, exposure, and outcome models while avoiding any parametric\nassumptions on the effect curve. We present the statistical properties of the\nproposed methods and illustrate their application through simulations and a\nstudy investigating the diverse effects of a nutritional excise tax."}, "http://arxiv.org/abs/2401.14359": {"title": "Minimum Covariance Determinant: Spectral Embedding and Subset Size Determination", "link": "http://arxiv.org/abs/2401.14359", "description": "This paper introduces several ideas to the minimum covariance determinant\nproblem for outlier detection and robust estimation of means and covariances.\nWe leverage the principal component transform to achieve dimension reduction,\npaving the way for improved analyses. Our best subset selection algorithm\nstrategically combines statistical depth and concentration steps. To ascertain\nthe appropriate subset size and number of principal components, we introduce a\nnovel bootstrap procedure that estimates the instability of the best subset\nalgorithm. The parameter combination exhibiting minimal instability proves\nideal for the purposes of outlier detection and robust estimation. Rigorous\nbenchmarking against prominent MCD variants showcases our approach's superior\ncapability in outlier detection and computational speed in high dimensions.\nApplication to a fruit spectra data set and a cancer genomics data set\nillustrates our claims."}, "http://arxiv.org/abs/2401.14393": {"title": "Clustering-based spatial interpolation of parametric post-processing models", "link": "http://arxiv.org/abs/2401.14393", "description": "Since the start of the operational use of ensemble prediction systems,\nensemble-based probabilistic forecasting has become the most advanced approach\nin weather prediction. However, despite the persistent development of the last\nthree decades, ensemble forecasts still often suffer from the lack of\ncalibration and might exhibit systematic bias, which calls for some form of\nstatistical post-processing. Nowadays, one can choose from a large variety of\npost-processing approaches, where parametric methods provide full predictive\ndistributions of the investigated weather quantity. Parameter estimation in\nthese models is based on training data consisting of past forecast-observation\npairs, thus post-processed forecasts are usually available only at those\nlocations where training data are accessible. We propose a general\nclustering-based interpolation technique of extending calibrated predictive\ndistributions from observation stations to any location in the ensemble domain\nwhere there are ensemble forecasts at hand. Focusing on the ensemble model\noutput statistics (EMOS) post-processing technique, in a case study based on\nwind speed ensemble forecasts of the European Centre for Medium-Range Weather\nForecasts, we demonstrate the predictive performance of various versions of the\nsuggested method and show its superiority over the regionally estimated and\ninterpolated EMOS models and the raw ensemble forecasts as well."}, "http://arxiv.org/abs/2208.09344": {"title": "A note on incorrect inferences in non-binary qualitative probabilistic networks", "link": "http://arxiv.org/abs/2208.09344", "description": "Qualitative probabilistic networks (QPNs) combine the conditional\nindependence assumptions of Bayesian networks with the qualitative properties\nof positive and negative dependence. They formalise various intuitive\nproperties of positive dependence to allow inferences over a large network of\nvariables. However, we will demonstrate in this paper that, due to an incorrect\nsymmetry property, many inferences obtained in non-binary QPNs are not\nmathematically true. We will provide examples of such incorrect inferences and\nbriefly discuss possible resolutions."}, "http://arxiv.org/abs/2210.14080": {"title": "Learning Individual Treatment Effects under Heterogeneous Interference in Networks", "link": "http://arxiv.org/abs/2210.14080", "description": "Estimates of individual treatment effects from networked observational data\nare attracting increasing attention these days. One major challenge in network\nscenarios is the violation of the stable unit treatment value assumption\n(SUTVA), which assumes that the treatment assignment of a unit does not\ninfluence others' outcomes. In network data, due to interference, the outcome\nof a unit is influenced not only by its treatment (i.e., direct effects) but\nalso by others' treatments (i.e., spillover effects). Furthermore, the\ninfluences from other units are always heterogeneous (e.g., friends with\nsimilar interests affect a person differently than friends with different\ninterests). In this paper, we focus on the problem of estimating individual\ntreatment effects (both direct and spillover effects) under heterogeneous\ninterference. To address this issue, we propose a novel Dual Weighting\nRegression (DWR) algorithm by simultaneously learning attention weights that\ncapture the heterogeneous interference and sample weights to eliminate the\ncomplex confounding bias in networks. We formulate the entire learning process\nas a bi-level optimization problem. In theory, we present generalization error\nbounds for individual treatment effect estimation. Extensive experiments on\nfour benchmark datasets demonstrate that the proposed DWR algorithm outperforms\nstate-of-the-art methods for estimating individual treatment effects under\nheterogeneous interference."}, "http://arxiv.org/abs/2302.10836": {"title": "nlive: an R Package to facilitate the application of the sigmoidal and random changepoint mixed models", "link": "http://arxiv.org/abs/2302.10836", "description": "Background: The use of mixed effect models with a specific functional form\nsuch as the Sigmoidal Mixed Model and the Piecewise Mixed Model (or Changepoint\nMixed Model) with abrupt or smooth random change allows the interpretation of\nthe defined parameters to understand longitudinal trajectories. Currently,\nthere are no interface R packages that can easily fit the Sigmoidal Mixed Model\nallowing the inclusion of covariates or incorporating recent developments to\nfit the Piecewise Mixed Model with random change. Results: To facilitate the\nmodeling of the Sigmoidal Mixed Model, and Piecewise Mixed Model with abrupt or\nsmooth random change, we have created an R package called nlive. All needed\npieces such as functions, covariance matrices, and initials generation were\nprogrammed. The package was implemented with recent developments such as the\npolynomial smooth transition of the piecewise mixed model with improved\nproperties over Bacon-Watts, and the stochastic approximation\nexpectation-maximization (SAEM) for efficient estimation. It was designed to\nhelp interpretation of the output by providing features such as annotated\noutput, warnings, and graphs. Functionality, including time and convergence,\nwas tested using simulations. We provided a data example to illustrate the\npackage use and output features and interpretation. The package implemented in\nthe R software is available from the Comprehensive R Archive Network (CRAN) at\nhttps://CRAN.R-project.org/package=nlive. Conclusions: The nlive package for R\nfits the Sigmoidal Mixed Model and the Piecewise Mixed: abrupt and smooth. The\nnlive allows fitting these models with only five mandatory arguments that are\nintuitive enough to the less sophisticated users."}, "http://arxiv.org/abs/2304.12500": {"title": "Environmental Justice Implications of Power Plant Emissions Control Policies: Heterogeneous Causal Effect Estimation under Bipartite Network Interference", "link": "http://arxiv.org/abs/2304.12500", "description": "Emissions generators, such as coal-fired power plants, are key contributors\nto air pollution and thus environmental policies to reduce their emissions have\nbeen proposed. Furthermore, marginalized groups are exposed to\ndisproportionately high levels of this pollution and have heightened\nsusceptibility to its adverse health impacts. As a result, robust evaluations\nof the heterogeneous impacts of air pollution regulations are key to justifying\nand designing maximally protective interventions. However, such evaluations are\ncomplicated in that much of air pollution regulatory policy intervenes on large\nemissions generators while resulting impacts are measured in potentially\ndistant populations. Such a scenario can be described as that of bipartite\nnetwork interference (BNI). To our knowledge, no literature to date has\nconsidered estimation of heterogeneous causal effects with BNI. In this paper,\nwe contribute to the literature in a three-fold manner. First, we propose\nBNI-specific estimators for subgroup-specific causal effects and design an\nempirical Monte Carlo simulation approach for BNI to evaluate their\nperformance. Second, we demonstrate how these estimators can be combined with\nsubgroup discovery approaches to identify subgroups benefiting most from air\npollution policies without a priori specification. Finally, we apply the\nproposed methods to estimate the effects of coal-fired power plant emissions\ncontrol interventions on ischemic heart disease (IHD) among 27,312,190 US\nMedicare beneficiaries. Though we find no statistically significant effect of\nthe interventions in the full population, we do find significant IHD\nhospitalization decreases in communities with high poverty and smoking rates."}, "http://arxiv.org/abs/2306.00686": {"title": "A novel approach for estimating functions in the multivariate setting based on an adaptive knot selection for B-splines with an application to a chemical system used in geoscience", "link": "http://arxiv.org/abs/2306.00686", "description": "In this paper, we will outline a novel data-driven method for estimating\nfunctions in a multivariate nonparametric regression model based on an adaptive\nknot selection for B-splines. The underlying idea of our approach for selecting\nknots is to apply the generalized lasso, since the knots of the B-spline basis\ncan be seen as changes in the derivatives of the function to be estimated. This\nmethod was then extended to functions depending on several variables by\nprocessing each dimension independently, thus reducing the problem to a\nunivariate setting. The regularization parameters were chosen by means of a\ncriterion based on EBIC. The nonparametric estimator was obtained using a\nmultivariate B-spline regression with the corresponding selected knots. Our\nprocedure was validated through numerical experiments by varying the number of\nobservations and the level of noise to investigate its robustness. The\ninfluence of observation sampling was also assessed and our method was applied\nto a chemical system commonly used in geoscience. For each different framework\nconsidered in this paper, our approach performed better than state-of-the-art\nmethods. Our completely data-driven method is implemented in the glober R\npackage which is available on the Comprehensive R Archive Network (CRAN)."}, "http://arxiv.org/abs/2401.14426": {"title": "M$^3$TN: Multi-gate Mixture-of-Experts based Multi-valued Treatment Network for Uplift Modeling", "link": "http://arxiv.org/abs/2401.14426", "description": "Uplift modeling is a technique used to predict the effect of a treatment\n(e.g., discounts) on an individual's response. Although several methods have\nbeen proposed for multi-valued treatment, they are extended from binary\ntreatment methods. There are still some limitations. Firstly, existing methods\ncalculate uplift based on predicted responses, which may not guarantee a\nconsistent uplift distribution between treatment and control groups. Moreover,\nthis may cause cumulative errors for multi-valued treatment. Secondly, the\nmodel parameters become numerous with many prediction heads, leading to reduced\nefficiency. To address these issues, we propose a novel \\underline{M}ulti-gate\n\\underline{M}ixture-of-Experts based \\underline{M}ulti-valued\n\\underline{T}reatment \\underline{N}etwork (M$^3$TN). M$^3$TN consists of two\ncomponents: 1) a feature representation module with Multi-gate\nMixture-of-Experts to improve the efficiency; 2) a reparameterization module by\nmodeling uplift explicitly to improve the effectiveness. We also conduct\nextensive experiments to demonstrate the effectiveness and efficiency of our\nM$^3$TN."}, "http://arxiv.org/abs/2401.14512": {"title": "Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population", "link": "http://arxiv.org/abs/2401.14512", "description": "Randomized controlled trials (RCTs) serve as the cornerstone for\nunderstanding causal effects, yet extending inferences to target populations\npresents challenges due to effect heterogeneity and underrepresentation. Our\npaper addresses the critical issue of identifying and characterizing\nunderrepresented subgroups in RCTs, proposing a novel framework for refining\ntarget populations to improve generalizability. We introduce an\noptimization-based approach, Rashomon Set of Optimal Trees (ROOT), to\ncharacterize underrepresented groups. ROOT optimizes the target subpopulation\ndistribution by minimizing the variance of the target average treatment effect\nestimate, ensuring more precise treatment effect estimations. Notably, ROOT\ngenerates interpretable characteristics of the underrepresented population,\naiding researchers in effective communication. Our approach demonstrates\nimproved precision and interpretability compared to alternatives, as\nillustrated with synthetic data experiments. We apply our methodology to extend\ninferences from the Starting Treatment with Agonist Replacement Therapies\n(START) trial -- investigating the effectiveness of medication for opioid use\ndisorder -- to the real-world population represented by the Treatment Episode\nDataset: Admissions (TEDS-A). By refining target populations using ROOT, our\nframework offers a systematic approach to enhance decision-making accuracy and\ninform future trials in diverse populations."}, "http://arxiv.org/abs/2401.14515": {"title": "Martingale Posterior Distributions for Log-concave Density Functions", "link": "http://arxiv.org/abs/2401.14515", "description": "The family of log-concave density functions contains various kinds of common\nprobability distributions. Due to the shape restriction, it is possible to find\nthe nonparametric estimate of the density, for example, the nonparametric\nmaximum likelihood estimate (NPMLE). However, the associated uncertainty\nquantification of the NPMLE is less well developed. The current techniques for\nuncertainty quantification are Bayesian, using a Dirichlet process prior\ncombined with the use of Markov chain Monte Carlo (MCMC) to sample from the\nposterior. In this paper, we start with the NPMLE and use a version of the\nmartingale posterior distribution to establish uncertainty about the NPMLE. The\nalgorithm can be implemented in parallel and hence is fast. We prove the\nconvergence of the algorithm by constructing suitable submartingales. We also\nillustrate results with different models and settings and some real data, and\ncompare our method with that within the literature."}, "http://arxiv.org/abs/2401.14535": {"title": "CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process", "link": "http://arxiv.org/abs/2401.14535", "description": "Identifying the underlying time-delayed latent causal processes in sequential\ndata is vital for grasping temporal dynamics and making downstream reasoning.\nWhile some recent methods can robustly identify these latent causal variables,\nthey rely on strict assumptions about the invertible generation process from\nlatent variables to observed data. However, these assumptions are often hard to\nsatisfy in real-world applications containing information loss. For instance,\nthe visual perception process translates a 3D space into 2D images, or the\nphenomenon of persistence of vision incorporates historical data into current\nperceptions. To address this challenge, we establish an identifiability theory\nthat allows for the recovery of independent latent components even when they\ncome from a nonlinear and non-invertible mix. Using this theory as a\nfoundation, we propose a principled approach, CaRiNG, to learn the CAusal\nRepresentatIon of Non-invertible Generative temporal data with identifiability\nguarantees. Specifically, we utilize temporal context to recover lost latent\ninformation and apply the conditions in our theory to guide the training\nprocess. Through experiments conducted on synthetic datasets, we validate that\nour CaRiNG method reliably identifies the causal process, even when the\ngeneration process is non-invertible. Moreover, we demonstrate that our\napproach considerably improves temporal understanding and reasoning in\npractical applications."}, "http://arxiv.org/abs/2401.14549": {"title": "Privacy-preserving Quantile Treatment Effect Estimation for Randomized Controlled Trials", "link": "http://arxiv.org/abs/2401.14549", "description": "In accordance with the principle of \"data minimization\", many internet\ncompanies are opting to record less data. However, this is often at odds with\nA/B testing efficacy. For experiments with units with multiple observations,\none popular data minimizing technique is to aggregate data for each unit.\nHowever, exact quantile estimation requires the full observation-level data. In\nthis paper, we develop a method for approximate Quantile Treatment Effect (QTE)\nanalysis using histogram aggregation. In addition, we can also achieve formal\nprivacy guarantees using differential privacy."}, "http://arxiv.org/abs/2401.14558": {"title": "Simulation Model Calibration with Dynamic Stratification and Adaptive Sampling", "link": "http://arxiv.org/abs/2401.14558", "description": "Calibrating simulation models that take large quantities of multi-dimensional\ndata as input is a hard simulation optimization problem. Existing adaptive\nsampling strategies offer a methodological solution. However, they may not\nsufficiently reduce the computational cost for estimation and solution\nalgorithm's progress within a limited budget due to extreme noise levels and\nheteroskedasticity of system responses. We propose integrating stratification\nwith adaptive sampling for the purpose of efficiency in optimization.\nStratification can exploit local dependence in the simulation inputs and\noutputs. Yet, the state-of-the-art does not provide a full capability to\nadaptively stratify the data as different solution alternatives are evaluated.\nWe devise two procedures for data-driven calibration problems that involve a\nlarge dataset with multiple covariates to calibrate models within a fixed\noverall simulation budget. The first approach dynamically stratifies the input\ndata using binary trees, while the second approach uses closed-form solutions\nbased on linearity assumptions between the objective function and concomitant\nvariables. We find that dynamical adjustment of stratification structure\naccelerates optimization and reduces run-to-run variability in generated\nsolutions. Our case study for calibrating a wind power simulation model, widely\nused in the wind industry, using the proposed stratified adaptive sampling,\nshows better-calibrated parameters under a limited budget."}, "http://arxiv.org/abs/2401.14562": {"title": "Properties of the Mallows Model Depending on the Number of Alternatives: A Warning for an Experimentalist", "link": "http://arxiv.org/abs/2401.14562", "description": "The Mallows model is a popular distribution for ranked data. We empirically\nand theoretically analyze how the properties of rankings sampled from the\nMallows model change when increasing the number of alternatives. We find that\nreal-world data behaves differently than the Mallows model, yet is in line with\nits recent variant proposed by Boehmer et al. [2021]. As part of our study, we\nissue several warnings about using the model."}, "http://arxiv.org/abs/2401.14593": {"title": "Robust Estimation of Pareto's Scale Parameter from Grouped Data", "link": "http://arxiv.org/abs/2401.14593", "description": "Numerous robust estimators exist as alternatives to the maximum likelihood\nestimator (MLE) when a completely observed ground-up loss severity sample\ndataset is available. However, the options for robust alternatives to MLE\nbecome significantly limited when dealing with grouped loss severity data, with\nonly a handful of methods like least squares, minimum Hellinger distance, and\noptimal bounded influence function available. This paper introduces a novel\nrobust estimation technique, the Method of Truncated Moments (MTuM),\nspecifically designed to estimate the tail index of a Pareto distribution from\ngrouped data. Inferential justification of MTuM is established by employing the\ncentral limit theorem and validating them through a comprehensive simulation\nstudy."}, "http://arxiv.org/abs/2401.14655": {"title": "Distributionally Robust Optimization and Robust Statistics", "link": "http://arxiv.org/abs/2401.14655", "description": "We review distributionally robust optimization (DRO), a principled approach\nfor constructing statistical estimators that hedge against the impact of\ndeviations in the expected loss between the training and deployment\nenvironments. Many well-known estimators in statistics and machine learning\n(e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are\ndistributionally robust in a precise sense. We hope that by discussing the DRO\ninterpretation of well-known estimators, statisticians who may not be too\nfamiliar with DRO may find a way to access the DRO literature through the\nbridge between classical results and their DRO equivalent formulation. On the\nother hand, the topic of robustness in statistics has a rich tradition\nassociated with removing the impact of contamination. Thus, another objective\nof this paper is to clarify the difference between DRO and classical\nstatistical robustness. As we will see, these are two fundamentally different\nphilosophies leading to completely different types of estimators. In DRO, the\nstatistician hedges against an environment shift that occurs after the decision\nis made; thus DRO estimators tend to be pessimistic in an adversarial setting,\nleading to a min-max type formulation. In classical robust statistics, the\nstatistician seeks to correct contamination that occurred before a decision is\nmade; thus robust statistical estimators tend to be optimistic leading to a\nmin-min type formulation."}, "http://arxiv.org/abs/2401.14684": {"title": "Inference for Cumulative Incidences and Treatment Effects in Randomized Controlled Trials with Time-to-Event Outcomes under ICH E9 (E1)", "link": "http://arxiv.org/abs/2401.14684", "description": "In randomized controlled trials (RCT) with time-to-event outcomes,\nintercurrent events occur as semi-competing/competing events, and they could\naffect the hazard of outcomes or render outcomes ill-defined. Although five\nstrategies have been proposed in ICH E9 (R1) addendum to address intercurrent\nevents in RCT, they did not readily extend to the context of time-to-event data\nfor studying causal effects with rigorously stated implications. In this study,\nwe show how to define, estimate, and infer the time-dependent cumulative\nincidence of outcome events in such contexts for obtaining causal\ninterpretations. Specifically, we derive the mathematical forms of the\nscientific objective (i.e., causal estimands) under the five strategies and\nclarify the required data structure to identify these causal estimands.\nFurthermore, we summarize estimation and inference methods for these causal\nestimands by adopting methodologies in survival analysis, including analytic\nformulas for asymptotic analysis and hypothesis testing. We illustrate our\nmethods with the LEADER Trial on investigating the effect of liraglutide on\ncardiovascular outcomes. Studies of multiple endpoints and combining strategies\nto address multiple intercurrent events can help practitioners understand\ntreatment effects more comprehensively."}, "http://arxiv.org/abs/2401.14722": {"title": "A Nonparametric Bayes Approach to Online Activity Prediction", "link": "http://arxiv.org/abs/2401.14722", "description": "Accurately predicting the onset of specific activities within defined\ntimeframes holds significant importance in several applied contexts. In\nparticular, accurate prediction of the number of future users that will be\nexposed to an intervention is an important piece of information for\nexperimenters running online experiments (A/B tests). In this work, we propose\na novel approach to predict the number of users that will be active in a given\ntime period, as well as the temporal trajectory needed to attain a desired user\nparticipation threshold. We model user activity using a Bayesian nonparametric\napproach which allows us to capture the underlying heterogeneity in user\nengagement. We derive closed-form expressions for the number of new users\nexpected in a given period, and a simple Monte Carlo algorithm targeting the\nposterior distribution of the number of days needed to attain a desired number\nof users; the latter is important for experimental planning. We illustrate the\nperformance of our approach via several experiments on synthetic and real world\ndata, in which we show that our novel method outperforms existing competitors."}, "http://arxiv.org/abs/2401.14827": {"title": "Clustering Longitudinal Ordinal Data via Finite Mixture of Matrix-Variate Distributions", "link": "http://arxiv.org/abs/2401.14827", "description": "In social sciences, studies are often based on questionnaires asking\nparticipants to express ordered responses several times over a study period. We\npresent a model-based clustering algorithm for such longitudinal ordinal data.\nAssuming that an ordinal variable is the discretization of a underlying latent\ncontinuous variable, the model relies on a mixture of matrix-variate normal\ndistributions, accounting simultaneously for within- and between-time\ndependence structures. The model is thus able to concurrently model the\nheterogeneity, the association among the responses and the temporal dependence\nstructure. An EM algorithm is developed and presented for parameters\nestimation. An evaluation of the model through synthetic data shows its\nestimation abilities and its advantages when compared to competitors. A\nreal-world application concerning changes in eating behaviours during the\nCovid-19 pandemic period in France will be presented."}, "http://arxiv.org/abs/2401.14836": {"title": "Automatic and location-adaptive estimation in functional single-index regression", "link": "http://arxiv.org/abs/2401.14836", "description": "This paper develops a new automatic and location-adaptive procedure for\nestimating regression in a Functional Single-Index Model (FSIM). This procedure\nis based on $k$-Nearest Neighbours ($k$NN) ideas. The asymptotic study includes\nresults for automatically data-driven selected number of neighbours, making the\nprocedure directly usable in practice. The local feature of the $k$NN approach\ninsures higher predictive power compared with usual kernel estimates, as\nillustrated in some finite sample analysis. As by-product we state as\npreliminary tools some new uniform asymptotic results for kernel estimates in\nthe FSIM model."}, "http://arxiv.org/abs/2401.14841": {"title": "Sparse semiparametric regression when predictors are mixture of functional and high-dimensional variables", "link": "http://arxiv.org/abs/2401.14841", "description": "This paper aims to front with dimensionality reduction in regression setting\nwhen the predictors are a mixture of functional variable and high-dimensional\nvector. A flexible model, combining both sparse linear ideas together with\nsemiparametrics, is proposed. A wide scope of asymptotic results is provided:\nthis covers as well rates of convergence of the estimators as asymptotic\nbehaviour of the variable selection procedure. Practical issues are analysed\nthrough finite sample simulated experiments while an application to Tecator's\ndata illustrates the usefulness of our methodology."}, "http://arxiv.org/abs/2401.14848": {"title": "A $k$NN procedure in semiparametric functional data analysis", "link": "http://arxiv.org/abs/2401.14848", "description": "A fast and flexible $k$NN procedure is developed for dealing with a\nsemiparametric functional regression model involving both partial-linear and\nsingle-index components. Rates of uniform consistency are presented. Simulated\nexperiments highlight the advantages of the $k$NN procedure. A real data\nanalysis is also shown."}, "http://arxiv.org/abs/2401.14864": {"title": "Fast and efficient algorithms for sparse semiparametric bi-functional regression", "link": "http://arxiv.org/abs/2401.14864", "description": "A new sparse semiparametric model is proposed, which incorporates the\ninfluence of two functional random variables in a scalar response in a flexible\nand interpretable manner. One of the functional covariates is included through\na single-index structure, while the other is included linearly through the\nhigh-dimensional vector formed by its discretised observations. For this model,\ntwo new algorithms are presented for selecting relevant variables in the linear\npart and estimating the model. Both procedures utilise the functional origin of\nlinear covariates. Finite sample experiments demonstrated the scope of\napplication of both algorithms: the first method is a fast algorithm that\nprovides a solution (without loss in predictive ability) for the significant\ncomputational time required by standard variable selection methods for\nestimating this model, and the second algorithm completes the set of relevant\nlinear covariates provided by the first, thus improving its predictive\nefficiency. Some asymptotic results theoretically support both procedures. A\nreal data application demonstrated the applicability of the presented\nmethodology from a predictive perspective in terms of the interpretability of\noutputs and low computational cost."}, "http://arxiv.org/abs/2401.14867": {"title": "Variable selection in functional regression models: a review", "link": "http://arxiv.org/abs/2401.14867", "description": "Despite of various similar features, Functional Data Analysis and\nHigh-Dimensional Data Analysis are two major fields in Statistics that grew up\nrecently almost independently one from each other. The aim of this paper is to\npropose a survey on methodological advances for variable selection in\nfunctional regression, which is typically a question for which both functional\nand multivariate ideas are crossing. More than a simple survey, this paper aims\nto promote even more new links between both areas."}, "http://arxiv.org/abs/2401.14902": {"title": "Model-assisted survey sampling with Bayesian optimization", "link": "http://arxiv.org/abs/2401.14902", "description": "Survey sampling plays an important role in the efficient allocation and\nmanagement of resources. The essence of survey sampling lies in acquiring a\nsample of data points from a population and subsequently using this sample to\nestimate the population parameters of the targeted response variable, such as\nenvironmental-related metrics or other pertinent factors. Practical limitations\nimposed on survey sampling necessitate prudent consideration of the number of\nsamples attainable from the study areas, given the constraints of a fixed\nbudget. To this end, researchers are compelled to employ sampling designs that\noptimize sample allocations to the best of their ability. Generally,\nprobability sampling serves as the preferred method, ensuring an unbiased\nestimation of population parameters. Evaluating the efficiency of estimators\ninvolves assessing their variances and benchmarking them against alternative\nbaseline approaches, such as simple random sampling. In this study, we propose\na novel model-assisted unbiased probability sampling method that leverages\nBayesian optimization for the determination of sampling designs. As a result,\nthis approach can yield in estimators with more efficient variance outcomes\ncompared to the conventional estimators such as the Horvitz-Thompson.\nFurthermore, we test the proposed method in a simulation study using an\nempirical dataset covering plot-level tree volume from central Finland. The\nresults demonstrate statistically significant improved performance for the\nproposed method when compared to the baseline."}, "http://arxiv.org/abs/2401.14910": {"title": "Modeling Extreme Events: Univariate and Multivariate Data-Driven Approaches", "link": "http://arxiv.org/abs/2401.14910", "description": "Modern inference in extreme value theory faces numerous complications, such\nas missing data, hidden covariates or design problems. Some of those\ncomplications were exemplified in the EVA 2023 data challenge. The challenge\ncomprises multiple individual problems which cover a variety of univariate and\nmultivariate settings. This note presents the contribution of team genEVA in\nsaid competition, with particular focus on a detailed presentation of\nmethodology and inference."}, "http://arxiv.org/abs/2401.15014": {"title": "A Robust Bayesian Method for Building Polygenic Risk Scores using Projected Summary Statistics and Bridge Prior", "link": "http://arxiv.org/abs/2401.15014", "description": "Polygenic Risk Scores (PRS) developed from genome-wide association studies\n(GWAS) are of increasing interest for various clinical and research\napplications. Bayesian methods have been particularly popular for building PRS\nin genome-wide scale because of their natural ability to regularize model and\nborrow information in high-dimension. In this article, we present new\ntheoretical results, methods, and extensive numerical studies to advance\nBayesian methods for PRS applications. We conduct theoretical studies to\nidentify causes of convergence issues of some Bayesian methods when required\ninput GWAS summary-statistics and linkage disequilibrium (LD) (genetic\ncorrelation) data are derived from distinct samples. We propose a remedy to the\nproblem by the projection of the summary-statistics data into the column space\nof the genetic correlation matrix. We further implement a PRS development\nalgorithm under the Bayesian Bridge prior which can allow more flexible\nspecification of effect-size distribution than those allowed under popular\nalternative methods. Finally, we conduct careful benchmarking studies of\nalternative Bayesian methods using both simulation studies and real datasets,\nwhere we carefully investigate both the effect of prior specification and\nestimation strategies for LD parameters. These studies show that the proposed\nalgorithm, equipped with the projection approach, the flexible prior\nspecification, and an efficient numerical algorithm leads to the development of\nthe most robust PRS across a wide variety of scenarios."}, "http://arxiv.org/abs/2401.15063": {"title": "Graph fission and cross-validation", "link": "http://arxiv.org/abs/2401.15063", "description": "We introduce a technique called graph fission which takes in a graph which\npotentially contains only one observation per node (whose distribution lies in\na known class) and produces two (or more) independent graphs with the same\nnode/edge set in a way that splits the original graph's information amongst\nthem in any desired proportion. Our proposal builds on data fission/thinning, a\nmethod that uses external randomization to create independent copies of an\nunstructured dataset. %under the assumption of independence between\nobservations. We extend this idea to the graph setting where there may be\nlatent structure between observations. We demonstrate the utility of this\nframework via two applications: inference after structural trend estimation on\ngraphs and a model selection procedure we term ``graph cross-validation''."}, "http://arxiv.org/abs/2401.15076": {"title": "Comparative Analysis of Practical Identifiability Methods for an SEIR Model", "link": "http://arxiv.org/abs/2401.15076", "description": "Identifiability of a mathematical model plays a crucial role in\nparameterization of the model. In this study, we establish the structural\nidentifiability of a Susceptible-Exposed-Infected-Recovered (SEIR) model given\ndifferent combinations of input data and investigate practical identifiability\nwith respect to different observable data, data frequency, and noise\ndistributions. The practical identifiability is explored by both Monte Carlo\nsimulations and a Correlation Matrix approach. Our results show that practical\nidentifiability benefits from higher data frequency and data from the peak of\nan outbreak. The incidence data gives the best practical identifiability\nresults compared to prevalence and cumulative data. In addition, we compare and\ndistinguish the practical identifiability by Monte Carlo simulations and a\nCorrelation Matrix approach, providing insights for when to use which method\nfor other applications."}, "http://arxiv.org/abs/2105.02487": {"title": "High-dimensional Functional Graphical Model Structure Learning via Neighborhood Selection Approach", "link": "http://arxiv.org/abs/2105.02487", "description": "Undirected graphical models are widely used to model the conditional\nindependence structure of vector-valued data. However, in many modern\napplications, for example those involving EEG and fMRI data, observations are\nmore appropriately modeled as multivariate random functions rather than\nvectors. Functional graphical models have been proposed to model the\nconditional independence structure of such functional data. We propose a\nneighborhood selection approach to estimate the structure of Gaussian\nfunctional graphical models, where we first estimate the neighborhood of each\nnode via a function-on-function regression and subsequently recover the entire\ngraph structure by combining the estimated neighborhoods. Our approach only\nrequires assumptions on the conditional distributions of random functions, and\nwe estimate the conditional independence structure directly. We thus circumvent\nthe need for a well-defined precision operator that may not exist when the\nfunctions are infinite dimensional. Additionally, the neighborhood selection\napproach is computationally efficient and can be easily parallelized. The\nstatistical consistency of the proposed method in the high-dimensional setting\nis supported by both theory and experimental results. In addition, we study the\neffect of the choice of the function basis used for dimensionality reduction in\nan intermediate step. We give a heuristic criterion for choosing a function\nbasis and motivate two practically useful choices, which we justify by both\ntheory and experiments."}, "http://arxiv.org/abs/2203.11469": {"title": "A new class of composite GBII regression models with varying threshold for modelling heavy-tailed data", "link": "http://arxiv.org/abs/2203.11469", "description": "The four-parameter generalized beta distribution of the second kind (GBII)\nhas been proposed for modelling insurance losses with heavy-tailed features.\nThe aim of this paper is to present a parametric composite GBII regression\nmodelling by splicing two GBII distributions using mode matching method. It is\ndesigned for simultaneous modeling of small and large claims and capturing the\npolicyholder heterogeneity by introducing the covariates into the location\nparameter. In such cases, the threshold that splits two GBII distributions\nvaries across individuals policyholders based on their risk features. The\nproposed regression modelling also contains a wide range of insurance loss\ndistributions as the head and the tail respectively and provides the\nclose-formed expressions for parameter estimation and model prediction. A\nsimulation study is conducted to show the accuracy of the proposed estimation\nmethod and the flexibility of the regressions. Some illustrations of the\napplicability of the new class of distributions and regressions are provided\nwith a Danish fire losses data set and a Chinese medical insurance claims data\nset, comparing with the results of competing models from the literature."}, "http://arxiv.org/abs/2206.14674": {"title": "Signature Methods in Machine Learning", "link": "http://arxiv.org/abs/2206.14674", "description": "Signature-based techniques give mathematical insight into the interactions\nbetween complex streams of evolving data. These insights can be quite naturally\ntranslated into numerical approaches to understanding streamed data, and\nperhaps because of their mathematical precision, have proved useful in\nanalysing streamed data in situations where the data is irregular, and not\nstationary, and the dimension of the data and the sample sizes are both\nmoderate. Understanding streamed multi-modal data is exponential: a word in $n$\nletters from an alphabet of size $d$ can be any one of $d^n$ messages.\nSignatures remove the exponential amount of noise that arises from sampling\nirregularity, but an exponential amount of information still remain. This\nsurvey aims to stay in the domain where that exponential scaling can be managed\ndirectly. Scalability issues are an important challenge in many problems but\nwould require another survey article and further ideas. This survey describes a\nrange of contexts where the data sets are small enough to remove the\npossibility of massive machine learning, and the existence of small sets of\ncontext free and principled features can be used effectively. The mathematical\nnature of the tools can make their use intimidating to non-mathematicians. The\nexamples presented in this article are intended to bridge this communication\ngap and provide tractable working examples drawn from the machine learning\ncontext. Notebooks are available online for several of these examples. This\nsurvey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin\nwhich had broadly similar aims at an earlier point in the development of this\nmachinery. This article illustrates how the theoretical insights offered by\nsignatures are simply realised in the analysis of application data in a way\nthat is largely agnostic to the data type."}, "http://arxiv.org/abs/2208.08925": {"title": "Efficiency of nonparametric e-tests", "link": "http://arxiv.org/abs/2208.08925", "description": "The notion of an e-value has been recently proposed as a possible alternative\nto critical regions and p-values in statistical hypothesis testing. In this\npaper we consider testing the nonparametric hypothesis of symmetry, introduce\nanalogues for e-values of three popular nonparametric tests, define an analogue\nfor e-values of Pitman's asymptotic relative efficiency, and apply it to the\nthree nonparametric tests. We discuss limitations of our simple definition of\nasymptotic relative efficiency and list directions of further research."}, "http://arxiv.org/abs/2211.16468": {"title": "Linear-Time Algorithms for Front-Door Adjustment in Causal Graphs", "link": "http://arxiv.org/abs/2211.16468", "description": "Causal effect estimation from observational data is a fundamental task in\nempirical sciences. It becomes particularly challenging when unobserved\nconfounders are involved in a system. This paper focuses on front-door\nadjustment -- a classic technique which, using observed mediators allows to\nidentify causal effects even in the presence of unobserved confounding. While\nthe statistical properties of the front-door estimation are quite well\nunderstood, its algorithmic aspects remained unexplored for a long time. In\n2022, Jeong, Tian, and Bareinboim presented the first polynomial-time algorithm\nfor finding sets satisfying the front-door criterion in a given directed\nacyclic graph (DAG), with an $O(n^3(n+m))$ run time, where $n$ denotes the\nnumber of variables and $m$ the number of edges of the causal graph. In our\nwork, we give the first linear-time, i.e., $O(n+m)$, algorithm for this task,\nwhich thus reaches the asymptotically optimal time complexity. This result\nimplies an $O(n(n+m))$ delay enumeration algorithm of all front-door adjustment\nsets, again improving previous work by a factor of $n^3$. Moreover, we provide\nthe first linear-time algorithm for finding a minimal front-door adjustment\nset. We offer implementations of our algorithms in multiple programming\nlanguages to facilitate practical usage and empirically validate their\nfeasibility, even for large graphs."}, "http://arxiv.org/abs/2305.13221": {"title": "Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data", "link": "http://arxiv.org/abs/2305.13221", "description": "Additive spatial statistical models with weakly stationary process\nassumptions have become standard in spatial statistics. However, one\ndisadvantage of such models is the computation time, which rapidly increases\nwith the number of data points. The goal of this article is to apply an\nexisting subsampling strategy to standard spatial additive models and to derive\nthe spatial statistical properties. We call this strategy the \"spatial data\nsubset model\" (SDSM) approach, which can be applied to big datasets in a\ncomputationally feasible way. Our approach has the advantage that one does not\nrequire any additional restrictive model assumptions. That is, computational\ngains increase as model assumptions are removed when using our model framework.\nThis provides one solution to the computational bottlenecks that occur when\napplying methods such as Kriging to \"big data\". We provide several properties\nof this new spatial data subset model approach in terms of moments, sill,\nnugget, and range under several sampling designs. An advantage of our approach\nis that it subsamples without throwing away data, and can be implemented using\ndatasets of any size that can be stored. We present the results of the spatial\ndata subset model approach on simulated datasets, and on a large dataset\nconsists of 150,000 observations of daytime land surface temperatures measured\nby the MODIS instrument onboard the Terra satellite."}, "http://arxiv.org/abs/2305.13818": {"title": "A Rank-Based Sequential Test of Independence", "link": "http://arxiv.org/abs/2305.13818", "description": "We consider the problem of independence testing for two univariate random\nvariables in a sequential setting. By leveraging recent developments on safe,\nanytime-valid inference, we propose a test with time-uniform type I error\ncontrol and derive explicit bounds on the finite sample performance of the\ntest. We demonstrate the empirical performance of the procedure in comparison\nto existing sequential and non-sequential independence tests. Furthermore,\nsince the proposed test is distribution free under the null hypothesis, we\nempirically simulate the gap due to Ville's inequality, the supermartingale\nanalogue of Markov's inequality, that is commonly applied to control type I\nerror in anytime-valid inference, and apply this to construct a truncated\nsequential test."}, "http://arxiv.org/abs/2307.08594": {"title": "Tight Distribution-Free Confidence Intervals for Local Quantile Regression", "link": "http://arxiv.org/abs/2307.08594", "description": "It is well known that it is impossible to construct useful confidence\nintervals (CIs) about the mean or median of a response $Y$ conditional on\nfeatures $X = x$ without making strong assumptions about the joint distribution\nof $X$ and $Y$. This paper introduces a new framework for reasoning about\nproblems of this kind by casting the conditional problem at different levels of\nresolution, ranging from coarse to fine localization. In each of these\nproblems, we consider local quantiles defined as the marginal quantiles of $Y$\nwhen $(X,Y)$ is resampled in such a way that samples $X$ near $x$ are\nup-weighted while the conditional distribution $Y \\mid X$ does not change. We\nthen introduce the Weighted Quantile method, which asymptotically produces the\nuniformly most accurate confidence intervals for these local quantiles no\nmatter the (unknown) underlying distribution. Another method, namely, the\nQuantile Rejection method, achieves finite sample validity under no assumption\nwhatsoever. We conduct extensive numerical studies demonstrating that both of\nthese methods are valid. In particular, we show that the Weighted Quantile\nprocedure achieves nominal coverage as soon as the effective sample size is in\nthe range of 10 to 20."}, "http://arxiv.org/abs/2307.08685": {"title": "Evaluating Climate Models with Sliced Elastic Distance", "link": "http://arxiv.org/abs/2307.08685", "description": "The validation of global climate models plays a crucial role in ensuring the\naccuracy of climatological predictions. However, existing statistical methods\nfor evaluating differences between climate fields often overlook time\nmisalignment and therefore fail to distinguish between sources of variability.\nTo more comprehensively measure differences between climate fields, we\nintroduce a new vector-valued metric, the sliced elastic distance. This new\nmetric simultaneously accounts for spatial and temporal variability while\ndecomposing the total distance into shape differences (amplitude), timing\nvariability (phase), and bias (translation). We compare the sliced elastic\ndistance against a classical metric and a newly developed Wasserstein-based\napproach through a simulation study. Our results demonstrate that the sliced\nelastic distance outperforms previous methods by capturing a broader range of\nfeatures. We then apply our metric to evaluate the historical model outputs of\nthe Coupled Model Intercomparison Project (CMIP) members, focusing on monthly\naverage surface temperatures and monthly total precipitation. By comparing\nthese model outputs with quasi-observational ERA5 Reanalysis data products, we\nrank the CMIP models and assess their performance. Additionally, we investigate\nthe progression from CMIP phase 5 to phase 6 and find modest improvements in\nthe phase 6 models regarding their ability to produce realistic climate\ndynamics."}, "http://arxiv.org/abs/2401.15139": {"title": "FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking", "link": "http://arxiv.org/abs/2401.15139", "description": "In high-dimensional data analysis, such as financial index tracking or\nbiomedical applications, it is crucial to select the few relevant variables\nwhile maintaining control over the false discovery rate (FDR). In these\napplications, strong dependencies often exist among the variables (e.g., stock\nreturns), which can undermine the FDR control property of existing methods like\nthe model-X knockoff method or the T-Rex selector. To address this issue, we\nhave expanded the T-Rex framework to accommodate overlapping groups of highly\ncorrelated variables. This is achieved by integrating a nearest neighbors\npenalization mechanism into the framework, which provably controls the FDR at\nthe user-defined target level. A real-world example of sparse index tracking\ndemonstrates the proposed method's ability to accurately track the S&P 500\nindex over the past 20 years based on a small number of stocks. An open-source\nimplementation is provided within the R package TRexSelector on CRAN."}, "http://arxiv.org/abs/2401.15225": {"title": "A bivariate two-state Markov modulated Poisson process for failure modelling", "link": "http://arxiv.org/abs/2401.15225", "description": "Motivated by a real failure dataset in a two-dimensional context, this paper\npresents an extension of the Markov modulated Poisson process (MMPP) to two\ndimensions. The one-dimensional MMPP has been proposed for the modeling of\ndependent and non-exponential inter-failure times (in contexts as queuing, risk\nor reliability, among others). The novel two-dimensional MMPP allows for\ndependence between the two sequences of inter-failure times, while at the same\ntime preserves the MMPP properties, marginally. The generalization is based on\nthe Marshall-Olkin exponential distribution. Inference is undertaken for the\nnew model through a method combining a matching moments approach with an\nApproximate Bayesian Computation (ABC) algorithm. The performance of the method\nis shown on simulated and real datasets representing times and distances\ncovered between consecutive failures in a public transport company. For the\nreal dataset, some quantities of importance associated with the reliability of\nthe system are estimated as the probabilities and expected number of failures\nat different times and distances covered by trains until the occurrence of a\nfailure."}, "http://arxiv.org/abs/2401.15259": {"title": "Estimating lengths-of-stay of hospitalised COVID-19 patients using a non-parametric model: a case study in Galicia (Spain)", "link": "http://arxiv.org/abs/2401.15259", "description": "Estimating the lengths-of-stay (LoS) of hospitalised COVID-19 patients is key\nfor predicting the hospital beds' demand and planning mitigation strategies, as\noverwhelming the healthcare systems has critical consequences for disease\nmortality. However, accurately mapping the time-to-event of hospital outcomes,\nsuch as the LoS in the intensive care unit (ICU), requires understanding\npatient trajectories while adjusting for covariates and observation bias, such\nas incomplete data. Standard methods, such as the Kaplan-Meier estimator,\nrequire prior assumptions that are untenable given current knowledge. Using\nreal-time surveillance data from the first weeks of the COVID-19 epidemic in\nGalicia (Spain), we aimed to model the time-to-event and event probabilities of\npatients' hospitalised, without parametric priors and adjusting for individual\ncovariates. We applied a non-parametric mixture cure model and compared its\nperformance in estimating hospital ward (HW)/ICU LoS to the performances of\ncommonly used methods to estimate survival. We showed that the proposed model\noutperformed standard approaches, providing more accurate ICU and HW LoS\nestimates. Finally, we applied our model estimates to simulate COVID-19\nhospital demand using a Monte Carlo algorithm. We provided evidence that\nadjusting for sex, generally overlooked in prediction models, together with age\nis key for accurately forecasting HW and ICU occupancy, as well as discharge or\ndeath outcomes."}, "http://arxiv.org/abs/2401.15262": {"title": "Asymptotic Behavior of Adversarial Training Estimator under $\\ell_\\infty$-Perturbation", "link": "http://arxiv.org/abs/2401.15262", "description": "Adversarial training has been proposed to hedge against adversarial attacks\nin machine learning and statistical models. This paper focuses on adversarial\ntraining under $\\ell_\\infty$-perturbation, which has recently attracted much\nresearch attention. The asymptotic behavior of the adversarial training\nestimator is investigated in the generalized linear model. The results imply\nthat the limiting distribution of the adversarial training estimator under\n$\\ell_\\infty$-perturbation could put a positive probability mass at $0$ when\nthe true parameter is $0$, providing a theoretical guarantee of the associated\nsparsity-recovery ability. Alternatively, a two-step procedure is proposed --\nadaptive adversarial training, which could further improve the performance of\nadversarial training under $\\ell_\\infty$-perturbation. Specifically, the\nproposed procedure could achieve asymptotic unbiasedness and variable-selection\nconsistency. Numerical experiments are conducted to show the sparsity-recovery\nability of adversarial training under $\\ell_\\infty$-perturbation and to compare\nthe empirical performance between classic adversarial training and adaptive\nadversarial training."}, "http://arxiv.org/abs/2401.15281": {"title": "Improved confidence intervals for nonlinear mixed-effects and nonparametric regression models", "link": "http://arxiv.org/abs/2401.15281", "description": "Statistical inference for high dimensional parameters (HDPs) can be based on\ntheir intrinsic correlation; that is, parameters that are close spatially or\ntemporally tend to have more similar values. This is why nonlinear\nmixed-effects models (NMMs) are commonly (and appropriately) used for models\nwith HDPs. Conversely, in many practical applications of NMM, the random\neffects (REs) are actually correlated HDPs that should remain constant during\nrepeated sampling for frequentist inference. In both scenarios, the inference\nshould be conditional on REs, instead of marginal inference by integrating out\nREs. In this paper, we first summarize recent theory of conditional inference\nfor NMM, and then propose a bias-corrected RE predictor and confidence interval\n(CI). We also extend this methodology to accommodate the case where some REs\nare not associated with data. Simulation studies indicate that this new\napproach leads to substantial improvement in the conditional coverage rate of\nRE CIs, including CIs for smooth functions in generalized additive models, as\ncompared to the existing method based on marginal inference."}, "http://arxiv.org/abs/2401.15309": {"title": "Zero-inflated Smoothing Spline (ZISS) Models for Individual-level Single-cell Temporal Data", "link": "http://arxiv.org/abs/2401.15309", "description": "Recent advancements in single-cell RNA-sequencing (scRNA-seq) have enhanced\nour understanding of cell heterogeneity at a high resolution. With the ability\nto sequence over 10,000 cells per hour, researchers can collect large scRNA-seq\ndatasets for different participants, offering an opportunity to study the\ntemporal progression of individual-level single-cell data. However, the\npresence of excessive zeros, a common issue in scRNA-seq, significantly impacts\nregression/association analysis, potentially leading to biased estimates in\ndownstream analysis. Addressing these challenges, we introduce the Zero\nInflated Smoothing Spline (ZISS) method, specifically designed to model\nsingle-cell temporal data. The ZISS method encompasses two components for\nmodeling gene expression patterns over time and handling excessive zeros. Our\napproach employs the smoothing spline ANOVA model, providing robust estimates\nof mean functions and zero probabilities for irregularly observed single-cell\ntemporal data compared to existing methods in our simulation studies and real\ndata analysis."}, "http://arxiv.org/abs/2401.15382": {"title": "Inference on an heteroscedastic Gompertz tumor growth model", "link": "http://arxiv.org/abs/2401.15382", "description": "We consider a non homogeneous Gompertz diffusion process whose parameters are\nmodified by generally time-dependent exogenous factors included in the\ninfinitesimal moments. The proposed model is able to describe tumor dynamics\nunder the effect of anti-proliferative and/or cell death-induced therapies. We\nassume that such therapies can modify also the infinitesimal variance of the\ndiffusion process. An estimation procedure, based on a control group and two\ntreated groups, is proposed to infer the model by estimating the constant\nparameters and the time-dependent terms. Moreover, several concatenated\nhypothesis tests are considered in order to confirm or reject the need to\ninclude time-dependent functions in the infinitesimal moments. Simulations are\nprovided to evaluate the efficiency of the suggested procedures and to validate\nthe testing hypothesis. Finally, an application to real data is considered."}, "http://arxiv.org/abs/2401.15461": {"title": "Anytime-Valid Tests of Group Invariance through Conformal Prediction", "link": "http://arxiv.org/abs/2401.15461", "description": "The assumption that data are invariant under the action of a compact group is\nimplicit in many statistical modeling assumptions such as normality, or the\nassumption of independence and identical distributions. Hence, testing for the\npresence of such invariances offers a principled way to falsify various\nstatistical models. In this article, we develop sequential, anytime-valid tests\nof distributional symmetry under the action of general compact groups. The\ntests that are developed allow for the continuous monitoring of data as it is\ncollected while keeping type-I error guarantees, and include tests for\nexchangeability and rotational symmetry as special examples. The main tool to\nthis end is the machinery developed for conformal prediction. The resulting\ntest statistic, called a conformal martingale, can be interpreted as a\nlikelihood ratio. We use this interpretation to show that the test statistics\nare optimal -- in a specific log-optimality sense -- against certain\nalternatives. Furthermore, we draw a connection between conformal prediction,\nanytime-valid tests of distributional invariance, and current developments on\nanytime-valid testing. In particular, we extend existing anytime-valid tests of\nindependence, which leverage exchangeability, to work under general group\ninvariances. Additionally, we discuss testing for invariance under subgroups of\nthe permutation group and orthogonal group, the latter of which corresponds to\ntesting the assumptions behind linear regression models."}, "http://arxiv.org/abs/2401.15514": {"title": "Validity of Complete Case Analysis Depends on the Target Population", "link": "http://arxiv.org/abs/2401.15514", "description": "Missing data is a pernicious problem in epidemiologic research. Research on\nthe validity of complete case analysis for missing data has typically focused\non estimating the average treatment effect (ATE) in the whole population.\nHowever, other target populations like the treated (ATT) or external targets\ncan be of substantive interest. In such cases, whether missing covariate data\noccurs within or outside the target population may impact the validity of\ncomplete case analysis. We sought to assess bias in complete case analysis when\ncovariate data is missing outside the target (e.g., missing covariate data\namong the untreated when estimating the ATT). We simulated a study of the\neffect of a binary treatment X on a binary outcome Y in the presence of 3\nconfounders C1-C3 that modified the risk difference (RD). We induced\nmissingness in C1 only among the untreated under 4 scenarios: completely\nrandomly (similar to MCAR); randomly based on C2 and C3 (similar to MAR);\nrandomly based on C1 (similar to MNAR); or randomly based on Y (similar to\nMAR). We estimated the ATE and ATT using weighting and averaged results across\nthe replicates. We conducted a parallel simulation transporting trial results\nto a target population in the presence of missing covariate data in the trial.\nIn the complete case analysis, estimated ATE was unbiased only when C1 was MCAR\namong the untreated. The estimated ATT, on the other hand, was unbiased in all\nscenarios except when Y caused missingness. The parallel simulation of\ngeneralizing and transporting trial results saw similar bias patterns. If\nmissing covariate data is only present outside the target population, complete\ncase analysis is unbiased except when missingness is associated with the\noutcome."}, "http://arxiv.org/abs/2401.15519": {"title": "Large Deviation Analysis of Score-based Hypothesis Testing", "link": "http://arxiv.org/abs/2401.15519", "description": "Score-based statistical models play an important role in modern machine\nlearning, statistics, and signal processing. For hypothesis testing, a\nscore-based hypothesis test is proposed in \\cite{wu2022score}. We analyze the\nperformance of this score-based hypothesis testing procedure and derive upper\nbounds on the probabilities of its Type I and II errors. We prove that the\nexponents of our error bounds are asymptotically (in the number of samples)\ntight for the case of simple null and alternative hypotheses. We calculate\nthese error exponents explicitly in specific cases and provide numerical\nstudies for various other scenarios of interest."}, "http://arxiv.org/abs/2401.15567": {"title": "Matrix Supermartingales and Randomized Matrix Concentration Inequalities", "link": "http://arxiv.org/abs/2401.15567", "description": "We present new concentration inequalities for either martingale dependent or\nexchangeable random symmetric matrices under a variety of tail conditions,\nencompassing standard Chernoff bounds to self-normalized heavy-tailed settings.\nThese inequalities are often randomized in a way that renders them strictly\ntighter than existing deterministic results in the literature, are typically\nexpressed in the Loewner order, and are sometimes valid at arbitrary\ndata-dependent stopping times.\n\nAlong the way, we explore the theory of matrix supermartingales and maximal\ninequalities, potentially of independent interest."}, "http://arxiv.org/abs/2401.15623": {"title": "GT-PCA: Effective and Interpretable Dimensionality Reduction with General Transform-Invariant Principal Component Analysis", "link": "http://arxiv.org/abs/2401.15623", "description": "Data analysis often requires methods that are invariant with respect to\nspecific transformations, such as rotations in case of images or shifts in case\nof images and time series. While principal component analysis (PCA) is a\nwidely-used dimension reduction technique, it lacks robustness with respect to\nthese transformations. Modern alternatives, such as autoencoders, can be\ninvariant with respect to specific transformations but are generally not\ninterpretable. We introduce General Transform-Invariant Principal Component\nAnalysis (GT-PCA) as an effective and interpretable alternative to PCA and\nautoencoders. We propose a neural network that efficiently estimates the\ncomponents and show that GT-PCA significantly outperforms alternative methods\nin experiments based on synthetic and real data."}, "http://arxiv.org/abs/2401.15680": {"title": "How to achieve model-robust inference in stepped wedge trials with model-based methods?", "link": "http://arxiv.org/abs/2401.15680", "description": "A stepped wedge design is a unidirectional crossover design where clusters\nare randomized to distinct treatment sequences defined by calendar time. While\nmodel-based analysis of stepped wedge designs -- via linear mixed models or\ngeneralized estimating equations -- is standard practice to evaluate treatment\neffects accounting for clustering and adjusting for baseline covariates, formal\nresults on their model-robustness properties remain unavailable. In this\narticle, we study when a potentially misspecified multilevel model can offer\nconsistent estimators for treatment effect estimands that are functions of\ncalendar time and/or exposure time. We describe a super-population potential\noutcomes framework to define treatment effect estimands of interest in stepped\nwedge designs, and adapt linear mixed models and generalized estimating\nequations to achieve estimand-aligned inference. We prove a central result\nthat, as long as the treatment effect structure is correctly specified in each\nworking model, our treatment effect estimator is robust to arbitrary\nmisspecification of all remaining model components. The theoretical results are\nillustrated via simulation experiments and re-analysis of a cardiovascular\nstepped wedge cluster randomized trial."}, "http://arxiv.org/abs/2401.15694": {"title": "Constrained Markov decision processes for response-adaptive procedures in clinical trials with binary outcomes", "link": "http://arxiv.org/abs/2401.15694", "description": "A constrained Markov decision process (CMDP) approach is developed for\nresponse-adaptive procedures in clinical trials with binary outcomes. The\nresulting CMDP class of Bayesian response- adaptive procedures can be used to\ntarget a certain objective, e.g., patient benefit or power while using\nconstraints to keep other operating characteristics under control. In the CMDP\napproach, the constraints can be formulated under different priors, which can\ninduce a certain behaviour of the policy under a given statistical hypothesis,\nor given that the parameters lie in a specific part of the parameter space. A\nsolution method is developed to find the optimal policy, as well as a more\nefficient method, based on backward recursion, which often yields a\nnear-optimal solution with an available optimality gap. Three applications are\nconsidered, involving type I error and power constraints, constraints on the\nmean squared error, and a constraint on prior robustness. While the CMDP\napproach slightly outperforms the constrained randomized dynamic programming\n(CRDP) procedure known from literature when focussing on type I and II error\nand mean squared error, showing the general quality of CRDP, CMDP significantly\noutperforms CRDP when the focus is on type I and II error only."}, "http://arxiv.org/abs/2401.15703": {"title": "A Bayesian multivariate extreme value mixture model", "link": "http://arxiv.org/abs/2401.15703", "description": "Impact assessment of natural hazards requires the consideration of both\nextreme and non-extreme events. Extensive research has been conducted on the\njoint modeling of bulk and tail in univariate settings; however, the\ncorresponding body of research in the context of multivariate analysis is\ncomparatively scant. This study extends the univariate joint modeling of bulk\nand tail to the multivariate framework. Specifically, it pertains to cases\nwhere multivariate observations exceed a high threshold in at least one\ncomponent. We propose a multivariate mixture model that assumes a parametric\nmodel to capture the bulk of the distribution, which is in the max-domain of\nattraction (MDA) of a multivariate extreme value distribution (mGEVD). The tail\nis described by the multivariate generalized Pareto distribution, which is\nasymptotically justified to model multivariate threshold exceedances. We show\nthat if all components exceed the threshold, our mixture model is in the MDA of\nan mGEVD. Bayesian inference based on multivariate random-walk\nMetropolis-Hastings and the automated factor slice sampler allows us to\nincorporate uncertainty from the threshold selection easily. Due to\ncomputational limitations, simulations and data applications are provided for\ndimension $d=2$, but a discussion is provided with views toward scalability\nbased on pairwise likelihood."}, "http://arxiv.org/abs/2401.15730": {"title": "Statistical analysis and first-passage-time applications of a lognormal diffusion process withmulti-sigmoidal logistic mean", "link": "http://arxiv.org/abs/2401.15730", "description": "We consider a lognormal diffusion process having a multisigmoidal logistic\nmean, useful to model the evolution of a population which reaches the maximum\nlevel of the growth after many stages. Referring to the problem of statistical\ninference, two procedures to find the maximum likelihood estimates of the\nunknown parameters are described. One is based on the resolution of the system\nof the critical points of the likelihood function, and the other is on the\nmaximization of the likelihood function with the simulated annealing algorithm.\nA simulation study to validate the described strategies for finding the\nestimates is also presented, with a real application to epidemiological data.\nSpecial attention is also devoted to the first-passage-time problem of the\nconsidered diffusion process through a fixed boundary."}, "http://arxiv.org/abs/2401.15778": {"title": "On the partial autocorrelation function for locally stationary time series: characterization, estimation and inference", "link": "http://arxiv.org/abs/2401.15778", "description": "For stationary time series, it is common to use the plots of partial\nautocorrelation function (PACF) or PACF-based tests to explore the temporal\ndependence structure of such processes. To our best knowledge, such analogs for\nnon-stationary time series have not been fully established yet. In this paper,\nwe fill this gap for locally stationary time series with short-range\ndependence. {First, we characterize the PACF locally in the time domain and\nshow that the $j$th PACF, denoted as $\\rho_{j}(t),$ decays with $j$ whose rate\nis adaptive to the temporal dependence of the time series $\\{x_{i,n}\\}$.\nSecond, at time $i,$ we justify that the PACF $\\rho_j(i/n)$ can be efficiently\napproximated by the best linear prediction coefficients via the Yule-Walker's\nequations. This allows us to study the PACF via ordinary least squares (OLS)\nlocally. Third, we show that the PACF is smooth in time for locally stationary\ntime series. We use the sieve method with OLS to estimate $\\rho_j(\\cdot)$} and\nconstruct some statistics to test the PACFs and infer the structures of the\ntime series. These tests generalize and modify those used for stationary time\nseries. Finally, a multiplier bootstrap algorithm is proposed for practical\nimplementation and an $\\mathtt R$ package $\\mathtt {Sie2nts}$ is provided to\nimplement our algorithm. Numerical simulations and real data analysis also\nconfirm usefulness of our results."}, "http://arxiv.org/abs/2401.15793": {"title": "Doubly regularized generalized linear models for spatial observations with high-dimensional covariates", "link": "http://arxiv.org/abs/2401.15793", "description": "A discrete spatial lattice can be cast as a network structure over which\nspatially-correlated outcomes are observed. A second network structure may also\ncapture similarities among measured features, when such information is\navailable. Incorporating the network structures when analyzing such\ndoubly-structured data can improve predictive power, and lead to better\nidentification of important features in the data-generating process. Motivated\nby applications in spatial disease mapping, we develop a new doubly regularized\nregression framework to incorporate these network structures for analyzing\nhigh-dimensional datasets. Our estimators can easily be implemented with\nstandard convex optimization algorithms. In addition, we describe a procedure\nto obtain asymptotically valid confidence intervals and hypothesis tests for\nour model parameters. We show empirically that our framework provides improved\npredictive accuracy and inferential power compared to existing high-dimensional\nspatial methods. These advantages hold given fully accurate network\ninformation, and also with networks which are partially misspecified or\nuninformative. The application of the proposed method to modeling COVID-19\nmortality data suggests that it can improve prediction of deaths beyond\nstandard spatial models, and that it selects relevant covariates more often."}, "http://arxiv.org/abs/2401.15796": {"title": "High-Dimensional False Discovery Rate Control for Dependent Variables", "link": "http://arxiv.org/abs/2401.15796", "description": "Algorithms that ensure reproducible findings from large-scale,\nhigh-dimensional data are pivotal in numerous signal processing applications.\nIn recent years, multivariate false discovery rate (FDR) controlling methods\nhave emerged, providing guarantees even in high-dimensional settings where the\nnumber of variables surpasses the number of samples. However, these methods\noften fail to reliably control the FDR in the presence of highly dependent\nvariable groups, a common characteristic in fields such as genomics and\nfinance. To tackle this critical issue, we introduce a novel framework that\naccounts for general dependency structures. Our proposed dependency-aware T-Rex\nselector integrates hierarchical graphical models within the T-Rex framework to\neffectively harness the dependency structure among variables. Leveraging\nmartingale theory, we prove that our variable penalization mechanism ensures\nFDR control. We further generalize the FDR-controlling framework by stating and\nproving a clear condition necessary for designing both graphical and\nnon-graphical models that capture dependencies. Additionally, we formulate a\nfully integrated optimal calibration algorithm that concurrently determines the\nparameters of the graphical model and the T-Rex framework, such that the FDR is\ncontrolled while maximizing the number of selected variables. Numerical\nexperiments and a breast cancer survival analysis use-case demonstrate that the\nproposed method is the only one among the state-of-the-art benchmark methods\nthat controls the FDR and reliably detects genes that have been previously\nidentified to be related to breast cancer. An open-source implementation is\navailable within the R package TRexSelector on CRAN."}, "http://arxiv.org/abs/2401.15806": {"title": "Continuous-time structural failure time model for intermittent treatment", "link": "http://arxiv.org/abs/2401.15806", "description": "The intermittent intake of treatment is commonly seen in patients with\nchronic disease. For example, patients with atrial fibrillation may need to\ndiscontinue the oral anticoagulants when they experience a certain surgery and\nre-initiate the treatment after the surgery. As another example, patients may\nskip a few days before they refill a treatment as planned. This treatment\ndispensation information (i.e., the time at which a patient initiates and\nrefills a treatment) is recorded in the electronic healthcare records or claims\ndatabase, and each patient has a different treatment dispensation. Current\nmethods to estimate the effects of such treatments censor the patients who\nre-initiate the treatment, which results in information loss or biased\nestimation. In this work, we present methods to estimate the effect of\ntreatments on failure time outcomes by taking all the treatment dispensation\ninformation. The developed methods are based on the continuous-time structural\nfailure time model, where the dependent censoring is tackled by inverse\nprobability of censoring weighting. The estimators are doubly robust and\nlocally efficient."}, "http://arxiv.org/abs/2401.15811": {"title": "Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms", "link": "http://arxiv.org/abs/2401.15811", "description": "Two-sided platforms are central to modern commerce and content sharing and\noften utilize A/B testing for developing new features. While user-side\nexperiments are common, seller-side experiments become crucial for specific\ninterventions and metrics. This paper investigates the effects of interference\ncaused by feedback loops on seller-side experiments in two-sided platforms,\nwith a particular focus on the counterfactual interleaving design, proposed in\n\\citet{ha2020counterfactual,nandy2021b}. These feedback loops, often generated\nby pacing algorithms, cause outcomes from earlier sessions to influence\nsubsequent ones. This paper contributes by creating a mathematical framework to\nanalyze this interference, theoretically estimating its impact, and conducting\nempirical evaluations of the counterfactual interleaving design in real-world\nscenarios. Our research shows that feedback loops can result in misleading\nconclusions about the treatment effects."}, "http://arxiv.org/abs/2401.15903": {"title": "Toward the Identifiability of Comparative Deep Generative Models", "link": "http://arxiv.org/abs/2401.15903", "description": "Deep Generative Models (DGMs) are versatile tools for learning data\nrepresentations while adequately incorporating domain knowledge such as the\nspecification of conditional probability distributions. Recently proposed DGMs\ntackle the important task of comparing data sets from different sources. One\nsuch example is the setting of contrastive analysis that focuses on describing\npatterns that are enriched in a target data set compared to a background data\nset. The practical deployment of those models often assumes that DGMs naturally\ninfer interpretable and modular latent representations, which is known to be an\nissue in practice. Consequently, existing methods often rely on ad-hoc\nregularization schemes, although without any theoretical grounding. Here, we\npropose a theory of identifiability for comparative DGMs by extending recent\nadvances in the field of non-linear independent component analysis. We show\nthat, while these models lack identifiability across a general class of mixing\nfunctions, they surprisingly become identifiable when the mixing function is\npiece-wise affine (e.g., parameterized by a ReLU neural network). We also\ninvestigate the impact of model misspecification, and empirically show that\npreviously proposed regularization techniques for fitting comparative DGMs help\nwith identifiability when the number of latent variables is not known in\nadvance. Finally, we introduce a novel methodology for fitting comparative DGMs\nthat improves the treatment of multiple data sources via multi-objective\noptimization and that helps adjust the hyperparameter for the regularization in\nan interpretable manner, using constrained optimization. We empirically\nvalidate our theory and new methodology using simulated data as well as a\nrecent data set of genetic perturbations in cells profiled via single-cell RNA\nsequencing."}, "http://arxiv.org/abs/2401.16099": {"title": "A Ridgelet Approach to Poisson Denoising", "link": "http://arxiv.org/abs/2401.16099", "description": "This paper introduces a novel ridgelet transform-based method for Poisson\nimage denoising. Our work focuses on harnessing the Poisson noise's unique\nnon-additive and signal-dependent properties, distinguishing it from Gaussian\nnoise. The core of our approach is a new thresholding scheme informed by\ntheoretical insights into the ridgelet coefficients of Poisson-distributed\nimages and adaptive thresholding guided by Stein's method.\n\nWe verify our theoretical model through numerical experiments and demonstrate\nthe potential of ridgelet thresholding across assorted scenarios. Our findings\nrepresent a significant step in enhancing the understanding of Poisson noise\nand offer an effective denoising method for images corrupted with it."}, "http://arxiv.org/abs/2401.16286": {"title": "Robust Functional Data Analysis for Stochastic Evolution Equations in Infinite Dimensions", "link": "http://arxiv.org/abs/2401.16286", "description": "This article addresses the robust measurement of covariations in the context\nof solutions to stochastic evolution equations in Hilbert spaces using\nfunctional data analysis. For such equations, standard techniques for\nfunctional data based on cross-sectional covariances are often inadequate for\nidentifying statistically relevant random drivers and detecting outliers since\nthey overlook the interplay between cross-sectional and temporal structures.\nTherefore, we develop an estimation theory for the continuous quadratic\ncovariation of the latent random driver of the equation instead of a static\ncovariance of the observable solution process. We derive identifiability\nresults under weak conditions, establish rates of convergence and a central\nlimit theorem based on infill asymptotics, and provide long-time asymptotics\nfor estimation of a static covariation of the latent driver. Applied to term\nstructure data, our approach uncovers a fundamental alignment with scaling\nlimits of covariations of specific short-term trading strategies, and an\nempirical study detects several jumps and indicates high-dimensional and\ntime-varying covariations."}, "http://arxiv.org/abs/2401.16396": {"title": "Ovarian Cancer Diagnostics using Wavelet Packet Scaling Descriptors", "link": "http://arxiv.org/abs/2401.16396", "description": "Detecting early-stage ovarian cancer accurately and efficiently is crucial\nfor timely treatment. Various methods for early diagnosis have been explored,\nincluding a focus on features derived from protein mass spectra, but these tend\nto overlook the complex interplay across protein expression levels. We propose\nan innovative method to automate the search for diagnostic features in these\nspectra by analyzing their inherent scaling characteristics. We compare two\ntechniques for estimating the self-similarity in a signal using the scaling\nbehavior of its wavelet packet decomposition. The methods are applied to the\nmass spectra using a rolling window approach, yielding a collection of\nself-similarity indexes that capture protein interactions, potentially\nindicative of ovarian cancer. Then, the most discriminatory scaling descriptors\nfrom this collection are selected for use in classification algorithms. To\nassess their effectiveness for early diagnosis of ovarian cancer, the\ntechniques are applied to two datasets from the American National Cancer\nInstitute. Comparative evaluation against an existing wavelet-based method\nshows that one wavelet packet-based technique led to improved diagnostic\nperformance for one of the analyzed datasets (95.67% vs. 96.78% test accuracy,\nrespectively). This highlights the potential of wavelet packet-based methods to\ncapture novel diagnostic information related to ovarian cancer. This innovative\napproach offers promise for better early detection and improved patient\noutcomes in ovarian cancer."}, "http://arxiv.org/abs/2010.16271": {"title": "View selection in multi-view stacking: Choosing the meta-learner", "link": "http://arxiv.org/abs/2010.16271", "description": "Multi-view stacking is a framework for combining information from different\nviews (i.e. different feature sets) describing the same set of objects. In this\nframework, a base-learner algorithm is trained on each view separately, and\ntheir predictions are then combined by a meta-learner algorithm. In a previous\nstudy, stacked penalized logistic regression, a special case of multi-view\nstacking, has been shown to be useful in identifying which views are most\nimportant for prediction. In this article we expand this research by\nconsidering seven different algorithms to use as the meta-learner, and\nevaluating their view selection and classification performance in simulations\nand two applications on real gene-expression data sets. Our results suggest\nthat if both view selection and classification accuracy are important to the\nresearch at hand, then the nonnegative lasso, nonnegative adaptive lasso and\nnonnegative elastic net are suitable meta-learners. Exactly which among these\nthree is to be preferred depends on the research context. The remaining four\nmeta-learners, namely nonnegative ridge regression, nonnegative forward\nselection, stability selection and the interpolating predictor, show little\nadvantages in order to be preferred over the other three."}, "http://arxiv.org/abs/2201.00409": {"title": "Global convergence of optimized adaptive importance samplers", "link": "http://arxiv.org/abs/2201.00409", "description": "We analyze the optimized adaptive importance sampler (OAIS) for performing\nMonte Carlo integration with general proposals. We leverage a classical result\nwhich shows that the bias and the mean-squared error (MSE) of the importance\nsampling scales with the $\\chi^2$-divergence between the target and the\nproposal and develop a scheme which performs global optimization of\n$\\chi^2$-divergence. While it is known that this quantity is convex for\nexponential family proposals, the case of the general proposals has been an\nopen problem. We close this gap by utilizing the nonasymptotic bounds for\nstochastic gradient Langevin dynamics (SGLD) for the global optimization of\n$\\chi^2$-divergence and derive nonasymptotic bounds for the MSE by leveraging\nrecent results from non-convex optimization literature. The resulting AIS\nschemes have explicit theoretical guarantees that are uniform-in-time."}, "http://arxiv.org/abs/2205.13469": {"title": "Proximal Estimation and Inference", "link": "http://arxiv.org/abs/2205.13469", "description": "We build a unifying convex analysis framework characterizing the statistical\nproperties of a large class of penalized estimators, both under a regular and\nan irregular design. Our framework interprets penalized estimators as proximal\nestimators, defined by a proximal operator applied to a corresponding initial\nestimator. We characterize the asymptotic properties of proximal estimators,\nshowing that their asymptotic distribution follows a closed-form formula\ndepending only on (i) the asymptotic distribution of the initial estimator,\n(ii) the estimator's limit penalty subgradient and (iii) the inner product\ndefining the associated proximal operator. In parallel, we characterize the\nOracle features of proximal estimators from the properties of their penalty's\nsubgradients. We exploit our approach to systematically cover linear regression\nsettings with a regular or irregular design. For these settings, we build new\n$\\sqrt{n}-$consistent, asymptotically normal Ridgeless-type proximal\nestimators, which feature the Oracle property and are shown to perform\nsatisfactorily in practically relevant Monte Carlo settings."}, "http://arxiv.org/abs/2206.05581": {"title": "Federated Offline Reinforcement Learning", "link": "http://arxiv.org/abs/2206.05581", "description": "Evidence-based or data-driven dynamic treatment regimes are essential for\npersonalized medicine, which can benefit from offline reinforcement learning\n(RL). Although massive healthcare data are available across medical\ninstitutions, they are prohibited from sharing due to privacy constraints.\nBesides, heterogeneity exists in different sites. As a result, federated\noffline RL algorithms are necessary and promising to deal with the problems. In\nthis paper, we propose a multi-site Markov decision process model that allows\nfor both homogeneous and heterogeneous effects across sites. The proposed model\nmakes the analysis of the site-level features possible. We design the first\nfederated policy optimization algorithm for offline RL with sample complexity.\nThe proposed algorithm is communication-efficient, which requires only a single\nround of communication interaction by exchanging summary statistics. We give a\ntheoretical guarantee for the proposed algorithm, where the suboptimality for\nthe learned policies is comparable to the rate as if data is not distributed.\nExtensive simulations demonstrate the effectiveness of the proposed algorithm.\nThe method is applied to a sepsis dataset in multiple sites to illustrate its\nuse in clinical settings."}, "http://arxiv.org/abs/2210.10171": {"title": "Doubly-robust and heteroscedasticity-aware sample trimming for causal inference", "link": "http://arxiv.org/abs/2210.10171", "description": "A popular method for variance reduction in observational causal inference is\npropensity-based trimming, the practice of removing units with extreme\npropensities from the sample. This practice has theoretical grounding when the\ndata are homoscedastic and the propensity model is parametric (Yang and Ding,\n2018; Crump et al. 2009), but in modern settings where heteroscedastic data are\nanalyzed with non-parametric models, existing theory fails to support current\npractice. In this work, we address this challenge by developing new methods and\ntheory for sample trimming. Our contributions are three-fold: first, we\ndescribe novel procedures for selecting which units to trim. Our procedures\ndiffer from previous work in that we trim not only units with small\npropensities, but also units with extreme conditional variances. Second, we\ngive new theoretical guarantees for inference after trimming. In particular, we\nshow how to perform inference on the trimmed subpopulation without requiring\nthat our regressions converge at parametric rates. Instead, we make only\nfourth-root rate assumptions like those in the double machine learning\nliterature. This result applies to conventional propensity-based trimming as\nwell and thus may be of independent interest. Finally, we propose a\nbootstrap-based method for constructing simultaneously valid confidence\nintervals for multiple trimmed sub-populations, which are valuable for\nnavigating the trade-off between sample size and variance reduction inherent in\ntrimming. We validate our methods in simulation, on the 2007-2008 National\nHealth and Nutrition Examination Survey, and on a semi-synthetic Medicare\ndataset and find promising results in all settings."}, "http://arxiv.org/abs/2211.16384": {"title": "Parameter Estimation with Increased Precision for Elliptic and Hypo-elliptic Diffusions", "link": "http://arxiv.org/abs/2211.16384", "description": "This work aims at making a comprehensive contribution in the general area of\nparametric inference for discretely observed diffusion processes. Established\napproaches for likelihood-based estimation invoke a time-discretisation scheme\nfor the approximation of the intractable transition dynamics of the Stochastic\nDifferential Equation (SDE) model over finite time periods. The scheme is\napplied for a step-size that is either user-selected or determined by the data.\nRecent research has highlighted the critical ef-fect of the choice of numerical\nscheme on the behaviour of derived parameter estimates in the setting of\nhypo-elliptic SDEs. In brief, in our work, first, we develop two weak second\norder sampling schemes (to cover both hypo-elliptic and elliptic SDEs) and\nproduce a small time expansion for the density of the schemes to form a proxy\nfor the true intractable SDE transition density. Then, we establish a\ncollection of analytic results for likelihood-based parameter estimates\nobtained via the formed proxies, thus providing a theoretical framework that\nshowcases advantages from the use of the developed methodology for SDE\ncalibration. We present numerical results from carrying out classical or\nBayesian inference, for both elliptic and hypo-elliptic SDEs."}, "http://arxiv.org/abs/2212.09009": {"title": "Locally Simultaneous Inference", "link": "http://arxiv.org/abs/2212.09009", "description": "Selective inference is the problem of giving valid answers to statistical\nquestions chosen in a data-driven manner. A standard solution to selective\ninference is simultaneous inference, which delivers valid answers to the set of\nall questions that could possibly have been asked. However, simultaneous\ninference can be unnecessarily conservative if this set includes many questions\nthat were unlikely to be asked in the first place. We introduce a less\nconservative solution to selective inference that we call locally simultaneous\ninference, which only answers those questions that could plausibly have been\nasked in light of the observed data, all the while preserving rigorous type I\nerror guarantees. For example, if the objective is to construct a confidence\ninterval for the \"winning\" treatment effect in a clinical trial with multiple\ntreatments, and it is obvious in hindsight that only one treatment had a chance\nto win, then our approach will return an interval that is nearly the same as\nthe uncorrected, standard interval. Locally simultaneous inference is\nimplemented by refining any method for simultaneous inference of interest.\nUnder mild conditions satisfied by common confidence intervals, locally\nsimultaneous inference strictly dominates its underlying simultaneous inference\nmethod, meaning it can never yield less statistical power but only more.\nCompared to conditional selective inference, which demands stronger guarantees,\nlocally simultaneous inference is more easily applicable in nonparametric\nsettings and is more numerically stable."}, "http://arxiv.org/abs/2302.00230": {"title": "Revisiting the Effects of Maternal Education on Adolescents' Academic Performance: Doubly Robust Estimation in a Network-Based Observational Study", "link": "http://arxiv.org/abs/2302.00230", "description": "In many contexts, particularly when study subjects are adolescents, peer\neffects can invalidate typical statistical requirements in the data. For\ninstance, it is plausible that a student's academic performance is influenced\nboth by their own mother's educational level as well as that of their peers.\nSince the underlying social network is measured, the Add Health study provides\na unique opportunity to examine the impact of maternal college education on\nadolescent school performance, both direct and indirect. However, causal\ninference on populations embedded in social networks poses technical\nchallenges, since the typical no interference assumption no longer holds. While\ninverse probability-of-treatment weighted (IPW) estimators have been developed\nfor this setting, they are often highly unstable. Motivated by the question of\nmaternal education, we propose doubly robust (DR) estimators combining models\nfor treatment and outcome that are consistent and asymptotically normal if\neither model is correctly specified. We present empirical results that\nillustrate the DR property and the efficiency gain of DR over IPW estimators\neven when the treatment model is misspecified. Contrary to previous studies,\nour robust analysis does not provide evidence of an indirect effect of maternal\neducation on academic performance within adolescents' social circles in Add\nHealth."}, "http://arxiv.org/abs/2304.04374": {"title": "Partial Identification of Causal Effects Using Proxy Variables", "link": "http://arxiv.org/abs/2304.04374", "description": "Proximal causal inference is a recently proposed framework for evaluating\ncausal effects in the presence of unmeasured confounding. For point\nidentification of causal effects, it leverages a pair of so-called treatment\nand outcome confounding proxy variables, to identify a bridge function that\nmatches the dependence of potential outcomes or treatment variables on the\nhidden factors to corresponding functions of observed proxies. Unique\nidentification of a causal effect via a bridge function crucially requires that\nproxies are sufficiently relevant for hidden factors, a requirement that has\npreviously been formalized as a completeness condition. However, completeness\nis well-known not to be empirically testable, and although a bridge function\nmay be well-defined, lack of completeness, sometimes manifested by availability\nof a single type of proxy, may severely limit prospects for identification of a\nbridge function and thus a causal effect; therefore, potentially restricting\nthe application of the proximal causal framework. In this paper, we propose\npartial identification methods that do not require completeness and obviate the\nneed for identification of a bridge function. That is, we establish that\nproxies of unobserved confounders can be leveraged to obtain bounds on the\ncausal effect of the treatment on the outcome even if available information\ndoes not suffice to identify either a bridge function or a corresponding causal\neffect of interest. Our bounds are non-smooth functionals of the observed data\ndistribution. As a consequence, in the context of inference, we initially\nprovide a smooth approximation of our bounds. Subsequently, we leverage\nbootstrap confidence intervals on the approximated bounds. We further establish\nanalogous partial identification results in related settings where\nidentification hinges upon hidden mediators for which proxies are available."}, "http://arxiv.org/abs/2306.16858": {"title": "Methods for non-proportional hazards in clinical trials: A systematic review", "link": "http://arxiv.org/abs/2306.16858", "description": "For the analysis of time-to-event data, frequently used methods such as the\nlog-rank test or the Cox proportional hazards model are based on the\nproportional hazards assumption, which is often debatable. Although a wide\nrange of parametric and non-parametric methods for non-proportional hazards\n(NPH) has been proposed, there is no consensus on the best approaches. To close\nthis gap, we conducted a systematic literature search to identify statistical\nmethods and software appropriate under NPH. Our literature search identified\n907 abstracts, out of which we included 211 articles, mostly methodological\nones. Review articles and applications were less frequently identified. The\narticles discuss effect measures, effect estimation and regression approaches,\nhypothesis tests, and sample size calculation approaches, which are often\ntailored to specific NPH situations. Using a unified notation, we provide an\noverview of methods available. Furthermore, we derive some guidance from the\nidentified articles. We summarized the contents from the literature review in a\nconcise way in the main text and provide more detailed explanations in the\nsupplement."}, "http://arxiv.org/abs/2308.13827": {"title": "An exhaustive ADDIS principle for online FWER control", "link": "http://arxiv.org/abs/2308.13827", "description": "In this paper we consider online multiple testing with familywise error rate\n(FWER) control, where the probability of committing at least one type I error\nshall remain under control while testing a possibly infinite sequence of\nhypotheses over time. Currently, Adaptive-Discard (ADDIS) procedures seem to be\nthe most promising online procedures with FWER control in terms of power. Now,\nour main contribution is a uniform improvement of the ADDIS principle and thus\nof all ADDIS procedures. This means, the methods we propose reject as least as\nmuch hypotheses as ADDIS procedures and in some cases even more, while\nmaintaining FWER control. In addition, we show that there is no other FWER\ncontrolling procedure that enlarges the event of rejecting any hypothesis.\nFinally, we apply the new principle to derive uniform improvements of the\nADDIS-Spending and ADDIS-Graph."}, "http://arxiv.org/abs/2401.16447": {"title": "The Hubbert diffusion process: Estimation via simulated annealing and variable neighborhood search procedures", "link": "http://arxiv.org/abs/2401.16447", "description": "Accurately charting the progress of oil production is a problem of great\ncurrent interest. Oil production is widely known to be cyclical: in any given\nsystem, after it reaches its peak, a decline will begin. With this in mind,\nMarion King Hubbert developed his peak theory in 1956 based on the bell-shaped\ncurve that bears his name. In the present work, we consider a stochasticmodel\nbased on the theory of diffusion processes and associated with the Hubbert\ncurve. The problem of the maximum likelihood estimation of the parameters for\nthis process is also considered. Since a complex system of equations appears,\nwith a solution that cannot be guaranteed by classical numerical procedures, we\nsuggest the use of metaheuristic optimization algorithms such as simulated\nannealing and variable neighborhood search. Some strategies are suggested for\nbounding the space of solutions, and a description is provided for the\napplication of the algorithms selected. In the case of the variable\nneighborhood search algorithm, a hybrid method is proposed in which it is\ncombined with simulated annealing. In order to validate the theory developed\nhere, we also carry out some studies based on simulated data and consider 2\nreal crude oil production scenarios from Norway and Kazakhstan."}, "http://arxiv.org/abs/2401.16563": {"title": "Topological Detection of Phenomenological Bifurcations with Unreliable Kernel Densities", "link": "http://arxiv.org/abs/2401.16563", "description": "Phenomenological (P-type) bifurcations are qualitative changes in stochastic\ndynamical systems whereby the stationary probability density function (PDF)\nchanges its topology. The current state of the art for detecting these\nbifurcations requires reliable kernel density estimates computed from an\nensemble of system realizations. However, in several real world signals such as\nBig Data, only a single system realization is available -- making it impossible\nto estimate a reliable kernel density. This study presents an approach for\ndetecting P-type bifurcations using unreliable density estimates. The approach\ncreates an ensemble of objects from Topological Data Analysis (TDA) called\npersistence diagrams from the system's sole realization and statistically\nanalyzes the resulting set. We compare several methods for replicating the\noriginal persistence diagram including Gibbs point process modelling, Pairwise\nInteraction Point Modelling, and subsampling. We show that for the purpose of\npredicting a bifurcation, the simple method of subsampling exceeds the other\ntwo methods of point process modelling in performance."}, "http://arxiv.org/abs/2401.16567": {"title": "Parallel Affine Transformation Tuning of Markov Chain Monte Carlo", "link": "http://arxiv.org/abs/2401.16567", "description": "The performance of Markov chain Monte Carlo samplers strongly depends on the\nproperties of the target distribution such as its covariance structure, the\nlocation of its probability mass and its tail behavior. We explore the use of\nbijective affine transformations of the sample space to improve the properties\nof the target distribution and thereby the performance of samplers running in\nthe transformed space. In particular, we propose a flexible and user-friendly\nscheme for adaptively learning the affine transformation during sampling.\nMoreover, the combination of our scheme with Gibbsian polar slice sampling is\nshown to produce samples of high quality at comparatively low computational\ncost in several settings based on real-world data."}, "http://arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "http://arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in\nprecision medicine. Specific interests lie in identifying the differential\neffect of different treatments based on some external covariates. We propose a\nnovel non-parametric treatment effect estimation method in a multi-treatment\nsetting. Our non-parametric modeling of the response curves relies on radial\nbasis function (RBF)-nets with shared hidden neurons. Our model thus\nfacilitates modeling commonality among the treatment outcomes. The estimation\nand inference schemes are developed under a Bayesian framework and implemented\nvia an efficient Markov chain Monte Carlo algorithm, appropriately\naccommodating uncertainty in all aspects of the analysis. The numerical\nperformance of the method is demonstrated through simulation experiments.\nApplying our proposed method to MIMIC data, we obtain several interesting\nfindings related to the impact of different treatment strategies on the length\nof ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "http://arxiv.org/abs/2401.16596": {"title": "PrIsing: Privacy-Preserving Peer Effect Estimation via Ising Model", "link": "http://arxiv.org/abs/2401.16596", "description": "The Ising model, originally developed as a spin-glass model for ferromagnetic\nelements, has gained popularity as a network-based model for capturing\ndependencies in agents' outputs. Its increasing adoption in healthcare and the\nsocial sciences has raised privacy concerns regarding the confidentiality of\nagents' responses. In this paper, we present a novel\n$(\\varepsilon,\\delta)$-differentially private algorithm specifically designed\nto protect the privacy of individual agents' outcomes. Our algorithm allows for\nprecise estimation of the natural parameter using a single network through an\nobjective perturbation technique. Furthermore, we establish regret bounds for\nthis algorithm and assess its performance on synthetic datasets and two\nreal-world networks: one involving HIV status in a social network and the other\nconcerning the political leaning of online blogs."}, "http://arxiv.org/abs/2401.16598": {"title": "Probabilistic Context Neighborhood Model for Lattices", "link": "http://arxiv.org/abs/2401.16598", "description": "We present the Probabilistic Context Neighborhood model designed for\ntwo-dimensional lattices as a variation of a Markov Random Field assuming\ndiscrete values. In this model, the neighborhood structure has a fixed geometry\nbut a variable order, depending on the neighbors' values. Our model extends the\nProbabilistic Context Tree model, originally applicable to one-dimensional\nspace. It retains advantageous properties, such as representing the dependence\nneighborhood structure as a graph in a tree format, facilitating an\nunderstanding of model complexity. Furthermore, we adapt the algorithm used to\nestimate the Probabilistic Context Tree to estimate the parameters of the\nproposed model. We illustrate the accuracy of our estimation methodology\nthrough simulation studies. Additionally, we apply the Probabilistic Context\nNeighborhood model to spatial real-world data, showcasing its practical\nutility."}, "http://arxiv.org/abs/2401.16651": {"title": "A constructive approach to selective risk control", "link": "http://arxiv.org/abs/2401.16651", "description": "Many modern applications require the use of data to both select the\nstatistical tasks and make valid inference after selection. In this article, we\nprovide a unifying approach to control for a class of selective risks. Our\nmethod is motivated by a reformulation of the celebrated Benjamini-Hochberg\n(BH) procedure for multiple hypothesis testing as the iterative limit of the\nBenjamini-Yekutieli (BY) procedure for constructing post-selection confidence\nintervals. Although several earlier authors have made noteworthy observations\nrelated to this, our discussion highlights that (1) the BH procedure is\nprecisely the fixed-point iteration of the BY procedure; (2) the fact that the\nBH procedure controls the false discovery rate is almost an immediate corollary\nof the fact that the BY procedure controls the false coverage-statement rate.\nBuilding on this observation, we propose a constructive approach to control\nextra-selection risk (selection made after decision) by iterating decision\nstrategies that control the post-selection risk (decision made after\nselection), and show that many previous methods and results are special cases\nof this general framework. We further extend this approach to problems with\nmultiple selective risks and demonstrate how new methods can be developed. Our\ndevelopment leads to two surprising results about the BH procedure: (1) in the\ncontext of one-sided location testing, the BH procedure not only controls the\nfalse discovery rate at the null but also at other locations for free; (2) in\nthe context of permutation tests, the BH procedure with exact permutation\np-values can be well approximated by a procedure which only requires a total\nnumber of permutations that is almost linear in the total number of hypotheses."}, "http://arxiv.org/abs/2401.16660": {"title": "A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information", "link": "http://arxiv.org/abs/2401.16660", "description": "The effective sample size (ESS) measures the informational value of a\nprobability distribution in terms of an equivalent number of study\nparticipants. The ESS plays a crucial role in estimating the Expected Value of\nSample Information (EVSI) through the Gaussian approximation approach. Despite\nthe significance of ESS, existing ESS estimation methods within the Gaussian\napproximation framework are either computationally expensive or potentially\ninaccurate. To address these limitations, we propose a novel approach that\nestimates the ESS using the summary statistics of generated datasets and\nnonparametric regression methods. The simulation results suggest that the\nproposed method provides accurate ESS estimates at a low computational cost,\nmaking it an efficient and practical way to quantify the information contained\nin the probability distribution of a parameter. Overall, determining the ESS\ncan help analysts understand the uncertainty levels in complex prior\ndistributions in the probability analyses of decision models and perform\nefficient EVSI calculations."}, "http://arxiv.org/abs/2401.16749": {"title": "Bayesian scalar-on-network regression with applications to brain functional connectivity", "link": "http://arxiv.org/abs/2401.16749", "description": "This study presents a Bayesian regression framework to model the relationship\nbetween scalar outcomes and brain functional connectivity represented as\nsymmetric positive definite (SPD) matrices. Unlike many proposals that simply\nvectorize the connectivity predictors thereby ignoring their matrix structures,\nour method respects the Riemannian geometry of SPD matrices by modelling them\nin a tangent space. We perform dimension reduction in the tangent space,\nrelating the resulting low-dimensional representations with the responses. The\ndimension reduction matrix is learnt in a supervised manner with a\nsparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting.\nOur method yields a parsimonious regression model that allows uncertainty\nquantification of the estimates and identification of key brain regions that\npredict the outcomes. We demonstrate the performance of our approach in\nsimulation settings and through a case study to predict Picture Vocabulary\nscores using data from the Human Connectome Project."}, "http://arxiv.org/abs/2401.16828": {"title": "Simulating signed mixtures", "link": "http://arxiv.org/abs/2401.16828", "description": "Simulating mixtures of distributions with signed weights proves a challenge\nas standard simulation algorithms are inefficient in handling the negative\nweights. In particular, the natural representation of mixture variates as\nassociated with latent component indicators is no longer available. We propose\nhere an exact accept-reject algorithm in the general case of finite signed\nmixtures that relies on optimaly pairing positive and negative components and\ndesigning a stratified sampling scheme on pairs. We analyze the performances of\nour approach, relative to the inverse cdf approach, since the cdf of the\ndistribution remains available for standard signed mixtures."}, "http://arxiv.org/abs/2401.16943": {"title": "Dynamical System Identification, Model Selection and Model Uncertainty Quantification by Bayesian Inference", "link": "http://arxiv.org/abs/2401.16943", "description": "This study presents a Bayesian maximum \\textit{a~posteriori} (MAP) framework\nfor dynamical system identification from time-series data. This is shown to be\nequivalent to a generalized zeroth-order Tikhonov regularization, providing a\nrational justification for the choice of the residual and regularization terms,\nrespectively, from the negative logarithms of the likelihood and prior\ndistributions. In addition to the estimation of model coefficients, the\nBayesian interpretation gives access to the full apparatus for Bayesian\ninference, including the ranking of models, the quantification of model\nuncertainties and the estimation of unknown (nuisance) hyperparameters. Two\nBayesian algorithms, joint maximum \\textit{a~posteriori} (JMAP) and variational\nBayesian approximation (VBA), are compared to the popular SINDy algorithm for\nthresholded least-squares regression, by application to several dynamical\nsystems with added noise. For multivariate Gaussian likelihood and prior\ndistributions, the Bayesian formulation gives Gaussian posterior and evidence\ndistributions, in which the numerator terms can be expressed in terms of the\nMahalanobis distance or ``Gaussian norm'' $||\\vy-\\hat{\\vy}||^2_{M^{-1}} =\n(\\vy-\\hat{\\vy})^\\top {M^{-1}} (\\vy-\\hat{\\vy})$, where $\\vy$ is a vector\nvariable, $\\hat{\\vy}$ is its estimator and $M$ is the covariance matrix. The\nposterior Gaussian norm is shown to provide a robust metric for quantitative\nmodel selection."}, "http://arxiv.org/abs/2401.16954": {"title": "Nonparametric latency estimation for mixture cure models", "link": "http://arxiv.org/abs/2401.16954", "description": "A nonparametric latency estimator for mixture cure models is studied in this\npaper. An i.i.d. representation is obtained, the asymptotic mean squared error\nof the latency estimator is found, and its asymptotic normality is proven. A\nbootstrap bandwidth selection method is introduced and its efficiency is\nevaluated in a simulation study. The proposed methods are applied to a dataset\nof colorectal cancer patients in the University Hospital of A Coru\\~na (CHUAC)."}, "http://arxiv.org/abs/2401.16990": {"title": "Recovering target causal effects from post-exposure selection induced by missing outcome data", "link": "http://arxiv.org/abs/2401.16990", "description": "Confounding bias and selection bias are two significant challenges to the\nvalidity of conclusions drawn from applied causal inference. The latter can\narise through informative missingness, wherein relevant information about units\nin the target population is missing, censored, or coarsened due to factors\nrelated to the exposure, the outcome, or their consequences. We extend existing\ngraphical criteria to address selection bias induced by missing outcome data by\nleveraging post-exposure variables. We introduce the Sequential Adjustment\nCriteria (SAC), which support recovering causal effects through sequential\nregressions. A refined estimator is further developed by applying Targeted\nMinimum-Loss Estimation (TMLE). Under certain regularity conditions, this\nestimator is multiply-robust, ensuring consistency even in scenarios where the\nInverse Probability Weighting (IPW) and the sequential regressions approaches\nfall short. A simulation exercise featuring various toy scenarios compares the\nrelative bias and robustness of the two proposed solutions against other\nestimators. As a motivating application case, we study the effects of\npharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD)\nupon the scores obtained by diagnosed Norwegian schoolchildren in national\ntests using observational data ($n=9\\,352$). Our findings support the\naccumulated clinical evidence affirming a positive but small effect of\nstimulant medication on school performance. A small positive selection bias was\nidentified, indicating that the treatment effect may be even more modest for\nthose exempted or abstained from the tests."}, "http://arxiv.org/abs/2401.17008": {"title": "A Unified Three-State Model Framework for Analysis of Treatment Crossover in Survival Trials", "link": "http://arxiv.org/abs/2401.17008", "description": "We present a unified three-state model (TSM) framework for evaluating\ntreatment effects in clinical trials in the presence of treatment crossover.\nResearchers have proposed diverse methodologies to estimate the treatment\neffect that would have hypothetically been observed if treatment crossover had\nnot occurred. However, there is little work on understanding the connections\nbetween these different approaches from a statistical point of view. Our\nproposed TSM framework unifies existing methods, effectively identifying\npotential biases, model assumptions, and inherent limitations for each method.\nThis can guide researchers in understanding when these methods are appropriate\nand choosing a suitable approach for their data. The TSM framework also\nfacilitates the creation of new methods to adjust for confounding effects from\ntreatment crossover. To illustrate this capability, we introduce a new\nimputation method that falls under its scope. Using a piecewise constant prior\nfor the hazard, our proposed method directly estimates the hazard function with\nincreased flexibility. Through simulation experiments, we demonstrate the\nperformance of different approaches for estimating the treatment effects."}, "http://arxiv.org/abs/2401.17041": {"title": "Gower's similarity coefficients with automatic weight selection", "link": "http://arxiv.org/abs/2401.17041", "description": "Nearest-neighbor methods have become popular in statistics and play a key\nrole in statistical learning. Important decisions in nearest-neighbor methods\nconcern the variables to use (when many potential candidates exist) and how to\nmeasure the dissimilarity between units. The first decision depends on the\nscope of the application while second depends mainly on the type of variables.\nUnfortunately, relatively few options permit to handle mixed-type variables, a\nsituation frequently encountered in practical applications. The most popular\ndissimilarity for mixed-type variables is derived as the complement to one of\nthe Gower's similarity coefficient. It is appealing because ranges between 0\nand 1, being an average of the scaled dissimilarities calculated variable by\nvariable, handles missing values and allows for a user-defined weighting scheme\nwhen averaging dissimilarities. The discussion on the weighting schemes is\nsometimes misleading since it often ignores that the unweighted \"standard\"\nsetting hides an unbalanced contribution of the single variables to the overall\ndissimilarity. We address this drawback following the recent idea of\nintroducing a weighting scheme that minimizes the differences in the\ncorrelation between each contributing dissimilarity and the resulting weighted\nGower's dissimilarity. In particular, this note proposes different approaches\nfor measuring the correlation depending on the type of variables. The\nperformances of the proposed approaches are evaluated in simulation studies\nrelated to classification and imputation of missing values."}, "http://arxiv.org/abs/2401.17110": {"title": "Nonparametric covariate hypothesis tests for the cure rate in mixture cure models", "link": "http://arxiv.org/abs/2401.17110", "description": "In lifetime data, like cancer studies, theremay be long term survivors, which\nlead to heavy censoring at the end of the follow-up period. Since a standard\nsurvival model is not appropriate to handle these data, a cure model is needed.\nIn the literature, covariate hypothesis tests for cure models are limited to\nparametric and semiparametric methods.We fill this important gap by proposing a\nnonparametric covariate hypothesis test for the probability of cure in mixture\ncure models. A bootstrap method is proposed to approximate the null\ndistribution of the test statistic. The procedure can be applied to any type of\ncovariate, and could be extended to the multivariate setting. Its efficiency is\nevaluated in a Monte Carlo simulation study. Finally, the method is applied to\na colorectal cancer dataset."}, "http://arxiv.org/abs/2401.17143": {"title": "Test for high-dimensional mean vectors via the weighted $L_2$-norm", "link": "http://arxiv.org/abs/2401.17143", "description": "In this paper, we propose a novel approach to test the equality of\nhigh-dimensional mean vectors of several populations via the weighted\n$L_2$-norm. We establish the asymptotic normality of the test statistics under\nthe null hypothesis. We also explain theoretically why our test statistics can\nbe highly useful in weakly dense cases when the nonzero signal in mean vectors\nis present. Furthermore, we compare the proposed test with existing tests using\nsimulation results, demonstrating that the weighted $L_2$-norm-based test\nstatistic exhibits favorable properties in terms of both size and power."}, "http://arxiv.org/abs/2401.17152": {"title": "Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models", "link": "http://arxiv.org/abs/2401.17152", "description": "A completely nonparametric method for the estimation of mixture cure models\nis proposed. A nonparametric estimator of the incidence is extensively studied\nand a nonparametric estimator of the latency is presented. These estimators,\nwhich are based on the Beran estimator of the conditional survival function,\nare proved to be the local maximum likelihood estimators. An i.i.d.\nrepresentation is obtained for the nonparametric incidence estimator. As a\nconsequence, an asymptotically optimal bandwidth is found. Moreover, a\nbootstrap bandwidth selection method for the nonparametric incidence estimator\nis proposed. The introduced nonparametric estimators are compared with existing\nsemiparametric approaches in a simulation study, in which the performance of\nthe bootstrap bandwidth selector is also assessed. Finally, the method is\napplied to a database of colorectal cancer from the University Hospital of A\nCoru\\~na (CHUAC)."}, "http://arxiv.org/abs/2401.17249": {"title": "Joint model with latent disease age: overcoming the need for reference time", "link": "http://arxiv.org/abs/2401.17249", "description": "Introduction: Heterogeneity of the progression of neurodegenerative diseases\nis one of the main challenges faced in developing effective therapies. With the\nincreasing number of large clinical databases, disease progression models have\nled to a better understanding of this heterogeneity. Nevertheless, these\ndiseases may have no clear onset and biological underlying processes may start\nbefore the first symptoms. Such an ill-defined disease reference time is an\nissue for current joint models, which have proven their effectiveness by\ncombining longitudinal and survival data. Objective In this work, we propose a\njoint non-linear mixed effect model with a latent disease age, to overcome this\nneed for a precise reference time.\n\nMethod: To do so, we utilized an existing longitudinal model with a latent\ndisease age as a longitudinal sub-model and associated it with a survival\nsub-model that estimates a Weibull distribution from the latent disease age. We\nthen validated our model on different simulated scenarios. Finally, we\nbenchmarked our model with a state-of-the-art joint model and reference\nsurvival and longitudinal models on simulated and real data in the context of\nAmyotrophic Lateral Sclerosis (ALS).\n\nResults: On real data, our model got significantly better results than the\nstate-of-the-art joint model for absolute bias (4.21(4.41) versus\n4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events\n(0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).\n\nConclusion: We showed that our approach is better suited than the\nstate-of-the-art in the context where the reference time is not reliable. This\nwork opens up the perspective to design predictive and personalized therapeutic\nstrategies."}, "http://arxiv.org/abs/2204.07105": {"title": "Nonresponse Bias Analysis in Longitudinal Studies: A Comparative Review with an Application to the Early Childhood Longitudinal Study", "link": "http://arxiv.org/abs/2204.07105", "description": "Longitudinal studies are subject to nonresponse when individuals fail to\nprovide data for entire waves or particular questions of the survey. We compare\napproaches to nonresponse bias analysis (NRBA) in longitudinal studies and\nillustrate them on the Early Childhood Longitudinal Study, Kindergarten Class\nof 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a\nmonotone missingness pattern, and the missingness mechanism can be missing at\nrandom (MAR) or missing not at random (MNAR). We discuss weighting, multiple\nimputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for\nmonotone patterns. Weighting adjustments are effective when the constructed\nweights are correlated to the survey outcome of interest. MI allows for\nvariables with missing values to be included in the imputation model, yielding\npotentially less biased and more efficient estimates. Multilevel models with\nmaximum likelihood estimation and marginal models estimated using generalized\nestimating equations can also handle incomplete longitudinal data. Bayesian\nmethods introduce prior information and potentially stabilize model estimation.\nWe add offsets in the MAR results to provide sensitivity analyses to assess\nMNAR deviations. We conduct NRBA for descriptive summaries and analytic model\nestimates and find that in the ECLS-K:2011 application, NRBA yields minor\nchanges to the substantive conclusions. The strength of evidence about our NRBA\ndepends on the strength of the relationship between the characteristics in the\nnonresponse adjustment and the key survey outcomes, so the key to a successful\nNRBA is to include strong predictors."}, "http://arxiv.org/abs/2206.01900": {"title": "Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios", "link": "http://arxiv.org/abs/2206.01900", "description": "Evaluation of intervention in a multi-agent system, e.g., when humans should\nintervene in autonomous driving systems and when a player should pass to\nteammates for a good shot, is challenging in various engineering and scientific\nfields. Estimating the individual treatment effect (ITE) using counterfactual\nlong-term prediction is practical to evaluate such interventions. However, most\nof the conventional frameworks did not consider the time-varying complex\nstructure of multi-agent relationships and covariate counterfactual prediction.\nThis may lead to erroneous assessments of ITE and difficulty in interpretation.\nHere we propose an interpretable, counterfactual recurrent network in\nmulti-agent systems to estimate the effect of the intervention. Our model\nleverages graph variational recurrent neural networks and theory-based\ncomputation with domain knowledge for the ITE estimation framework based on\nlong-term prediction of multi-agent covariates and outcomes, which can confirm\nthe circumstances under which the intervention is effective. On simulated\nmodels of an automated vehicle and biological agents with time-varying\nconfounders, we show that our methods achieved lower estimation errors in\ncounterfactual covariates and the most effective treatment timing than the\nbaselines. Furthermore, using real basketball data, our methods performed\nrealistic counterfactual predictions and evaluated the counterfactual passes in\nshot scenarios."}, "http://arxiv.org/abs/2210.06934": {"title": "On the potential benefits of entropic regularization for smoothing Wasserstein estimators", "link": "http://arxiv.org/abs/2210.06934", "description": "This paper is focused on the study of entropic regularization in optimal\ntransport as a smoothing method for Wasserstein estimators, through the prism\nof the classical tradeoff between approximation and estimation errors in\nstatistics. Wasserstein estimators are defined as solutions of variational\nproblems whose objective function involves the use of an optimal transport cost\nbetween probability measures. Such estimators can be regularized by replacing\nthe optimal transport cost by its regularized version using an entropy penalty\non the transport plan. The use of such a regularization has a potentially\nsignificant smoothing effect on the resulting estimators. In this work, we\ninvestigate its potential benefits on the approximation and estimation\nproperties of regularized Wasserstein estimators. Our main contribution is to\ndiscuss how entropic regularization may reach, at a lower computational cost,\nstatistical performances that are comparable to those of un-regularized\nWasserstein estimators in statistical learning problems involving\ndistributional data analysis. To this end, we present new theoretical results\non the convergence of regularized Wasserstein estimators. We also study their\nnumerical performances using simulated and real data in the supervised learning\nproblem of proportions estimation in mixture models using optimal transport."}, "http://arxiv.org/abs/2304.09157": {"title": "Neural networks for geospatial data", "link": "http://arxiv.org/abs/2304.09157", "description": "Analysis of geospatial data has traditionally been model-based, with a mean\nmodel, customarily specified as a linear regression on the covariates, and a\ncovariance model, encoding the spatial dependence. We relax the strong\nassumption of linearity and propose embedding neural networks directly within\nthe traditional geostatistical models to accommodate non-linear mean functions\nwhile retaining all other advantages including use of Gaussian Processes to\nexplicitly model the spatial covariance, enabling inference on the covariate\neffect through the mean and on the spatial dependence through the covariance,\nand offering predictions at new locations via kriging. We propose NN-GLS, a new\nneural network estimation algorithm for the non-linear mean in GP models that\nexplicitly accounts for the spatial covariance through generalized least\nsquares (GLS), the same loss used in the linear case. We show that NN-GLS\nadmits a representation as a special type of graph neural network (GNN). This\nconnection facilitates use of standard neural network computational techniques\nfor irregular geospatial data, enabling novel and scalable mini-batching,\nbackpropagation, and kriging schemes. Theoretically, we show that NN-GLS will\nbe consistent for irregularly observed spatially correlated data processes. To\nour knowledge this is the first asymptotic consistency result for any neural\nnetwork algorithm for spatial data. We demonstrate the methodology through\nsimulated and real datasets."}, "http://arxiv.org/abs/2304.14604": {"title": "Deep Neural-network Prior for Orbit Recovery from Method of Moments", "link": "http://arxiv.org/abs/2304.14604", "description": "Orbit recovery problems are a class of problems that often arise in practice\nand various forms. In these problems, we aim to estimate an unknown function\nafter being distorted by a group action and observed via a known operator.\nTypically, the observations are contaminated with a non-trivial level of noise.\nTwo particular orbit recovery problems of interest in this paper are\nmultireference alignment and single-particle cryo-EM modelling. In order to\nsuppress the noise, we suggest using the method of moments approach for both\nproblems while introducing deep neural network priors. In particular, our\nneural networks should output the signals and the distribution of group\nelements, with moments being the input. In the multireference alignment case,\nwe demonstrate the advantage of using the NN to accelerate the convergence for\nthe reconstruction of signals from the moments. Finally, we use our method to\nreconstruct simulated and biological volumes in the cryo-EM setting."}, "http://arxiv.org/abs/2306.13538": {"title": "A brief introduction on latent variable based ordinal regression models with an application to survey data", "link": "http://arxiv.org/abs/2306.13538", "description": "The analysis of survey data is a frequently arising issue in clinical trials,\nparticularly when capturing quantities which are difficult to measure. Typical\nexamples are questionnaires about patient's well-being, pain, or consent to an\nintervention. In these, data is captured on a discrete scale containing only a\nlimited number of possible answers, from which the respondent has to pick the\nanswer which fits best his/her personal opinion. This data is generally located\non an ordinal scale as answers can usually be arranged in an ascending order,\ne.g., \"bad\", \"neutral\", \"good\" for well-being. Since responses are usually\nstored numerically for data processing purposes, analysis of survey data using\nordinary linear regression models are commonly applied. However, assumptions of\nthese models are often not met as linear regression requires a constant\nvariability of the response variable and can yield predictions out of the range\nof response categories. By using linear models, one only gains insights about\nthe mean response which may affect representativeness. In contrast, ordinal\nregression models can provide probability estimates for all response categories\nand yield information about the full response scale beyond the mean. In this\nwork, we provide a concise overview of the fundamentals of latent variable\nbased ordinal models, applications to a real data set, and outline the use of\nstate-of-the-art-software for this purpose. Moreover, we discuss strengths,\nlimitations and typical pitfalls. This is a companion work to a current\nvignette-based structured interview study in paediatric anaesthesia."}, "https://arxiv.org/abs/2401.16447": {"title": "The Hubbert diffusion process: Estimation via simulated annealing and variable neighborhood search procedures", "link": "https://arxiv.org/abs/2401.16447", "description": "Accurately charting the progress of oil production is a problem of great current interest. Oil production is widely known to be cyclical: in any given system, after it reaches its peak, a decline will begin. With this in mind, Marion King Hubbert developed his peak theory in 1956 based on the bell-shaped curve that bears his name. In the present work, we consider a stochasticmodel based on the theory of diffusion processes and associated with the Hubbert curve. The problem of the maximum likelihood estimation of the parameters for this process is also considered. Since a complex system of equations appears, with a solution that cannot be guaranteed by classical numerical procedures, we suggest the use of metaheuristic optimization algorithms such as simulated annealing and variable neighborhood search. Some strategies are suggested for bounding the space of solutions, and a description is provided for the application of the algorithms selected. In the case of the variable neighborhood search algorithm, a hybrid method is proposed in which it is combined with simulated annealing. In order to validate the theory developed here, we also carry out some studies based on simulated data and consider 2 real crude oil production scenarios from Norway and Kazakhstan."}, "https://arxiv.org/abs/2401.16567": {"title": "Parallel Affine Transformation Tuning of Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2401.16567", "description": "The performance of Markov chain Monte Carlo samplers strongly depends on the properties of the target distribution such as its covariance structure, the location of its probability mass and its tail behavior. We explore the use of bijective affine transformations of the sample space to improve the properties of the target distribution and thereby the performance of samplers running in the transformed space. In particular, we propose a flexible and user-friendly scheme for adaptively learning the affine transformation during sampling. Moreover, the combination of our scheme with Gibbsian polar slice sampling is shown to produce samples of high quality at comparatively low computational cost in several settings based on real-world data."}, "https://arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "https://arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in precision medicine. Specific interests lie in identifying the differential effect of different treatments based on some external covariates. We propose a novel non-parametric treatment effect estimation method in a multi-treatment setting. Our non-parametric modeling of the response curves relies on radial basis function (RBF)-nets with shared hidden neurons. Our model thus facilitates modeling commonality among the treatment outcomes. The estimation and inference schemes are developed under a Bayesian framework and implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of the analysis. The numerical performance of the method is demonstrated through simulation experiments. Applying our proposed method to MIMIC data, we obtain several interesting findings related to the impact of different treatment strategies on the length of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "https://arxiv.org/abs/2401.16596": {"title": "PrIsing: Privacy-Preserving Peer Effect Estimation via Ising Model", "link": "https://arxiv.org/abs/2401.16596", "description": "The Ising model, originally developed as a spin-glass model for ferromagnetic elements, has gained popularity as a network-based model for capturing dependencies in agents' outputs. Its increasing adoption in healthcare and the social sciences has raised privacy concerns regarding the confidentiality of agents' responses. In this paper, we present a novel $(\\varepsilon,\\delta)$-differentially private algorithm specifically designed to protect the privacy of individual agents' outcomes. Our algorithm allows for precise estimation of the natural parameter using a single network through an objective perturbation technique. Furthermore, we establish regret bounds for this algorithm and assess its performance on synthetic datasets and two real-world networks: one involving HIV status in a social network and the other concerning the political leaning of online blogs."}, "https://arxiv.org/abs/2401.16598": {"title": "Probabilistic Context Neighborhood Model for Lattices", "link": "https://arxiv.org/abs/2401.16598", "description": "We present the Probabilistic Context Neighborhood model designed for two-dimensional lattices as a variation of a Markov Random Field assuming discrete values. In this model, the neighborhood structure has a fixed geometry but a variable order, depending on the neighbors' values. Our model extends the Probabilistic Context Tree model, originally applicable to one-dimensional space. It retains advantageous properties, such as representing the dependence neighborhood structure as a graph in a tree format, facilitating an understanding of model complexity. Furthermore, we adapt the algorithm used to estimate the Probabilistic Context Tree to estimate the parameters of the proposed model. We illustrate the accuracy of our estimation methodology through simulation studies. Additionally, we apply the Probabilistic Context Neighborhood model to spatial real-world data, showcasing its practical utility."}, "https://arxiv.org/abs/2401.16651": {"title": "A constructive approach to selective risk control", "link": "https://arxiv.org/abs/2401.16651", "description": "Many modern applications require the use of data to both select the statistical tasks and make valid inference after selection. In this article, we provide a unifying approach to control for a class of selective risks. Our method is motivated by a reformulation of the celebrated Benjamini-Hochberg (BH) procedure for multiple hypothesis testing as the iterative limit of the Benjamini-Yekutieli (BY) procedure for constructing post-selection confidence intervals. Although several earlier authors have made noteworthy observations related to this, our discussion highlights that (1) the BH procedure is precisely the fixed-point iteration of the BY procedure; (2) the fact that the BH procedure controls the false discovery rate is almost an immediate corollary of the fact that the BY procedure controls the false coverage-statement rate. Building on this observation, we propose a constructive approach to control extra-selection risk (selection made after decision) by iterating decision strategies that control the post-selection risk (decision made after selection), and show that many previous methods and results are special cases of this general framework. We further extend this approach to problems with multiple selective risks and demonstrate how new methods can be developed. Our development leads to two surprising results about the BH procedure: (1) in the context of one-sided location testing, the BH procedure not only controls the false discovery rate at the null but also at other locations for free; (2) in the context of permutation tests, the BH procedure with exact permutation p-values can be well approximated by a procedure which only requires a total number of permutations that is almost linear in the total number of hypotheses."}, "https://arxiv.org/abs/2401.16660": {"title": "A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information", "link": "https://arxiv.org/abs/2401.16660", "description": "The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations."}, "https://arxiv.org/abs/2401.16749": {"title": "Bayesian scalar-on-network regression with applications to brain functional connectivity", "link": "https://arxiv.org/abs/2401.16749", "description": "This study presents a Bayesian regression framework to model the relationship between scalar outcomes and brain functional connectivity represented as symmetric positive definite (SPD) matrices. Unlike many proposals that simply vectorize the connectivity predictors thereby ignoring their matrix structures, our method respects the Riemannian geometry of SPD matrices by modelling them in a tangent space. We perform dimension reduction in the tangent space, relating the resulting low-dimensional representations with the responses. The dimension reduction matrix is learnt in a supervised manner with a sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting. Our method yields a parsimonious regression model that allows uncertainty quantification of the estimates and identification of key brain regions that predict the outcomes. We demonstrate the performance of our approach in simulation settings and through a case study to predict Picture Vocabulary scores using data from the Human Connectome Project."}, "https://arxiv.org/abs/2401.16943": {"title": "Dynamical System Identification, Model Selection and Model Uncertainty Quantification by Bayesian Inference", "link": "https://arxiv.org/abs/2401.16943", "description": "This study presents a Bayesian maximum \\textit{a~posteriori} (MAP) framework for dynamical system identification from time-series data. This is shown to be equivalent to a generalized zeroth-order Tikhonov regularization, providing a rational justification for the choice of the residual and regularization terms, respectively, from the negative logarithms of the likelihood and prior distributions. In addition to the estimation of model coefficients, the Bayesian interpretation gives access to the full apparatus for Bayesian inference, including the ranking of models, the quantification of model uncertainties and the estimation of unknown (nuisance) hyperparameters. Two Bayesian algorithms, joint maximum \\textit{a~posteriori} (JMAP) and variational Bayesian approximation (VBA), are compared to the popular SINDy algorithm for thresholded least-squares regression, by application to several dynamical systems with added noise. For multivariate Gaussian likelihood and prior distributions, the Bayesian formulation gives Gaussian posterior and evidence distributions, in which the numerator terms can be expressed in terms of the Mahalanobis distance or ``Gaussian norm'' $||\\vy-\\hat{\\vy}||^2_{M^{-1}} = (\\vy-\\hat{\\vy})^\\top {M^{-1}} (\\vy-\\hat{\\vy})$, where $\\vy$ is a vector variable, $\\hat{\\vy}$ is its estimator and $M$ is the covariance matrix. The posterior Gaussian norm is shown to provide a robust metric for quantitative model selection."}, "https://arxiv.org/abs/2401.16954": {"title": "Nonparametric latency estimation for mixture cure models", "link": "https://arxiv.org/abs/2401.16954", "description": "A nonparametric latency estimator for mixture cure models is studied in this paper. An i.i.d. representation is obtained, the asymptotic mean squared error of the latency estimator is found, and its asymptotic normality is proven. A bootstrap bandwidth selection method is introduced and its efficiency is evaluated in a simulation study. The proposed methods are applied to a dataset of colorectal cancer patients in the University Hospital of A Coru\\~na (CHUAC)."}, "https://arxiv.org/abs/2401.16990": {"title": "Recovering target causal effects from post-exposure selection induced by missing outcome data", "link": "https://arxiv.org/abs/2401.16990", "description": "Confounding bias and selection bias are two significant challenges to the validity of conclusions drawn from applied causal inference. The latter can arise through informative missingness, wherein relevant information about units in the target population is missing, censored, or coarsened due to factors related to the exposure, the outcome, or their consequences. We extend existing graphical criteria to address selection bias induced by missing outcome data by leveraging post-exposure variables. We introduce the Sequential Adjustment Criteria (SAC), which support recovering causal effects through sequential regressions. A refined estimator is further developed by applying Targeted Minimum-Loss Estimation (TMLE). Under certain regularity conditions, this estimator is multiply-robust, ensuring consistency even in scenarios where the Inverse Probability Weighting (IPW) and the sequential regressions approaches fall short. A simulation exercise featuring various toy scenarios compares the relative bias and robustness of the two proposed solutions against other estimators. As a motivating application case, we study the effects of pharmacological treatment for Attention-Deficit/Hyperactivity Disorder (ADHD) upon the scores obtained by diagnosed Norwegian schoolchildren in national tests using observational data ($n=9\\,352$). Our findings support the accumulated clinical evidence affirming a positive but small effect of stimulant medication on school performance. A small positive selection bias was identified, indicating that the treatment effect may be even more modest for those exempted or abstained from the tests."}, "https://arxiv.org/abs/2401.17008": {"title": "A Unified Three-State Model Framework for Analysis of Treatment Crossover in Survival Trials", "link": "https://arxiv.org/abs/2401.17008", "description": "We present a unified three-state model (TSM) framework for evaluating treatment effects in clinical trials in the presence of treatment crossover. Researchers have proposed diverse methodologies to estimate the treatment effect that would have hypothetically been observed if treatment crossover had not occurred. However, there is little work on understanding the connections between these different approaches from a statistical point of view. Our proposed TSM framework unifies existing methods, effectively identifying potential biases, model assumptions, and inherent limitations for each method. This can guide researchers in understanding when these methods are appropriate and choosing a suitable approach for their data. The TSM framework also facilitates the creation of new methods to adjust for confounding effects from treatment crossover. To illustrate this capability, we introduce a new imputation method that falls under its scope. Using a piecewise constant prior for the hazard, our proposed method directly estimates the hazard function with increased flexibility. Through simulation experiments, we demonstrate the performance of different approaches for estimating the treatment effects."}, "https://arxiv.org/abs/2401.17041": {"title": "Gower's similarity coefficients with automatic weight selection", "link": "https://arxiv.org/abs/2401.17041", "description": "Nearest-neighbor methods have become popular in statistics and play a key role in statistical learning. Important decisions in nearest-neighbor methods concern the variables to use (when many potential candidates exist) and how to measure the dissimilarity between units. The first decision depends on the scope of the application while second depends mainly on the type of variables. Unfortunately, relatively few options permit to handle mixed-type variables, a situation frequently encountered in practical applications. The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient. It is appealing because ranges between 0 and 1, being an average of the scaled dissimilarities calculated variable by variable, handles missing values and allows for a user-defined weighting scheme when averaging dissimilarities. The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted \"standard\" setting hides an unbalanced contribution of the single variables to the overall dissimilarity. We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity. In particular, this note proposes different approaches for measuring the correlation depending on the type of variables. The performances of the proposed approaches are evaluated in simulation studies related to classification and imputation of missing values."}, "https://arxiv.org/abs/2401.17110": {"title": "Nonparametric covariate hypothesis tests for the cure rate in mixture cure models", "link": "https://arxiv.org/abs/2401.17110", "description": "In lifetime data, like cancer studies, theremay be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods.We fill this important gap by proposing a nonparametric covariate hypothesis test for the probability of cure in mixture cure models. A bootstrap method is proposed to approximate the null distribution of the test statistic. The procedure can be applied to any type of covariate, and could be extended to the multivariate setting. Its efficiency is evaluated in a Monte Carlo simulation study. Finally, the method is applied to a colorectal cancer dataset."}, "https://arxiv.org/abs/2401.17152": {"title": "Nonparametric incidence estimation and bootstrap bandwidth selection in mixture cure models", "link": "https://arxiv.org/abs/2401.17152", "description": "A completely nonparametric method for the estimation of mixture cure models is proposed. A nonparametric estimator of the incidence is extensively studied and a nonparametric estimator of the latency is presented. These estimators, which are based on the Beran estimator of the conditional survival function, are proved to be the local maximum likelihood estimators. An i.i.d. representation is obtained for the nonparametric incidence estimator. As a consequence, an asymptotically optimal bandwidth is found. Moreover, a bootstrap bandwidth selection method for the nonparametric incidence estimator is proposed. The introduced nonparametric estimators are compared with existing semiparametric approaches in a simulation study, in which the performance of the bootstrap bandwidth selector is also assessed. Finally, the method is applied to a database of colorectal cancer from the University Hospital of A Coru\\~na (CHUAC)."}, "https://arxiv.org/abs/2401.17249": {"title": "Joint model with latent disease age: overcoming the need for reference time", "link": "https://arxiv.org/abs/2401.17249", "description": "Introduction: Heterogeneity of the progression of neurodegenerative diseases is one of the main challenges faced in developing effective therapies. With the increasing number of large clinical databases, disease progression models have led to a better understanding of this heterogeneity. Nevertheless, these diseases may have no clear onset and biological underlying processes may start before the first symptoms. Such an ill-defined disease reference time is an issue for current joint models, which have proven their effectiveness by combining longitudinal and survival data. Objective In this work, we propose a joint non-linear mixed effect model with a latent disease age, to overcome this need for a precise reference time.\n Method: To do so, we utilized an existing longitudinal model with a latent disease age as a longitudinal sub-model and associated it with a survival sub-model that estimates a Weibull distribution from the latent disease age. We then validated our model on different simulated scenarios. Finally, we benchmarked our model with a state-of-the-art joint model and reference survival and longitudinal models on simulated and real data in the context of Amyotrophic Lateral Sclerosis (ALS).\n Results: On real data, our model got significantly better results than the state-of-the-art joint model for absolute bias (4.21(4.41) versus 4.24(4.14)(p-value=1.4e-17)), and mean cumulative AUC for right censored events (0.67(0.07) versus 0.61(0.09)(p-value=1.7e-03)).\n Conclusion: We showed that our approach is better suited than the state-of-the-art in the context where the reference time is not reliable. This work opens up the perspective to design predictive and personalized therapeutic strategies."}, "https://arxiv.org/abs/2401.16563": {"title": "Topological Detection of Phenomenological Bifurcations with Unreliable Kernel Densities", "link": "https://arxiv.org/abs/2401.16563", "description": "Phenomenological (P-type) bifurcations are qualitative changes in stochastic dynamical systems whereby the stationary probability density function (PDF) changes its topology. The current state of the art for detecting these bifurcations requires reliable kernel density estimates computed from an ensemble of system realizations. However, in several real world signals such as Big Data, only a single system realization is available -- making it impossible to estimate a reliable kernel density. This study presents an approach for detecting P-type bifurcations using unreliable density estimates. The approach creates an ensemble of objects from Topological Data Analysis (TDA) called persistence diagrams from the system's sole realization and statistically analyzes the resulting set. We compare several methods for replicating the original persistence diagram including Gibbs point process modelling, Pairwise Interaction Point Modelling, and subsampling. We show that for the purpose of predicting a bifurcation, the simple method of subsampling exceeds the other two methods of point process modelling in performance."}, "https://arxiv.org/abs/2401.16828": {"title": "Simulating signed mixtures", "link": "https://arxiv.org/abs/2401.16828", "description": "Simulating mixtures of distributions with signed weights proves a challenge as standard simulation algorithms are inefficient in handling the negative weights. In particular, the natural representation of mixture variates as associated with latent component indicators is no longer available. We propose here an exact accept-reject algorithm in the general case of finite signed mixtures that relies on optimaly pairing positive and negative components and designing a stratified sampling scheme on pairs. We analyze the performances of our approach, relative to the inverse cdf approach, since the cdf of the distribution remains available for standard signed mixtures."}, "https://arxiv.org/abs/2401.17143": {"title": "Test for high-dimensional mean vectors via the weighted $L_2$-norm", "link": "https://arxiv.org/abs/2401.17143", "description": "In this paper, we propose a novel approach to test the equality of high-dimensional mean vectors of several populations via the weighted $L_2$-norm. We establish the asymptotic normality of the test statistics under the null hypothesis. We also explain theoretically why our test statistics can be highly useful in weakly dense cases when the nonzero signal in mean vectors is present. Furthermore, we compare the proposed test with existing tests using simulation results, demonstrating that the weighted $L_2$-norm-based test statistic exhibits favorable properties in terms of both size and power."}, "https://arxiv.org/abs/2204.07105": {"title": "Nonresponse Bias Analysis in Longitudinal Studies: A Comparative Review with an Application to the Early Childhood Longitudinal Study", "link": "https://arxiv.org/abs/2204.07105", "description": "Longitudinal studies are subject to nonresponse when individuals fail to provide data for entire waves or particular questions of the survey. We compare approaches to nonresponse bias analysis (NRBA) in longitudinal studies and illustrate them on the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K:2011). Wave nonresponse with attrition often yields a monotone missingness pattern, and the missingness mechanism can be missing at random (MAR) or missing not at random (MNAR). We discuss weighting, multiple imputation (MI), incomplete data modeling, and Bayesian approaches to NRBA for monotone patterns. Weighting adjustments are effective when the constructed weights are correlated to the survey outcome of interest. MI allows for variables with missing values to be included in the imputation model, yielding potentially less biased and more efficient estimates. Multilevel models with maximum likelihood estimation and marginal models estimated using generalized estimating equations can also handle incomplete longitudinal data. Bayesian methods introduce prior information and potentially stabilize model estimation. We add offsets in the MAR results to provide sensitivity analyses to assess MNAR deviations. We conduct NRBA for descriptive summaries and analytic model estimates and find that in the ECLS-K:2011 application, NRBA yields minor changes to the substantive conclusions. The strength of evidence about our NRBA depends on the strength of the relationship between the characteristics in the nonresponse adjustment and the key survey outcomes, so the key to a successful NRBA is to include strong predictors."}, "https://arxiv.org/abs/2304.14604": {"title": "Deep Neural-network Prior for Orbit Recovery from Method of Moments", "link": "https://arxiv.org/abs/2304.14604", "description": "Orbit recovery problems are a class of problems that often arise in practice and various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignment and single-particle cryo-EM modelling. In order to suppress the noise, we suggest using the method of moments approach for both problems while introducing deep neural network priors. In particular, our neural networks should output the signals and the distribution of group elements, with moments being the input. In the multireference alignment case, we demonstrate the advantage of using the NN to accelerate the convergence for the reconstruction of signals from the moments. Finally, we use our method to reconstruct simulated and biological volumes in the cryo-EM setting."}, "https://arxiv.org/abs/2306.07119": {"title": "Improving Forecasts for Heterogeneous Time Series by \"Averaging\", with Application to Food Demand Forecast", "link": "https://arxiv.org/abs/2306.07119", "description": "A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure."}, "https://arxiv.org/abs/2306.13538": {"title": "A brief introduction on latent variable based ordinal regression models with an application to survey data", "link": "https://arxiv.org/abs/2306.13538", "description": "The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure. Typical examples are questionnaires about patient's well-being, pain, or consent to an intervention. In these, data is captured on a discrete scale containing only a limited number of possible answers, from which the respondent has to pick the answer which fits best his/her personal opinion. This data is generally located on an ordinal scale as answers can usually be arranged in an ascending order, e.g., \"bad\", \"neutral\", \"good\" for well-being. Since responses are usually stored numerically for data processing purposes, analysis of survey data using ordinary linear regression models are commonly applied. However, assumptions of these models are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. By using linear models, one only gains insights about the mean response which may affect representativeness. In contrast, ordinal regression models can provide probability estimates for all response categories and yield information about the full response scale beyond the mean. In this work, we provide a concise overview of the fundamentals of latent variable based ordinal models, applications to a real data set, and outline the use of state-of-the-art-software for this purpose. Moreover, we discuss strengths, limitations and typical pitfalls. This is a companion work to a current vignette-based structured interview study in paediatric anaesthesia."}, "https://arxiv.org/abs/2311.09388": {"title": "Synthesis estimators for positivity violations with a continuous covariate", "link": "https://arxiv.org/abs/2311.09388", "description": "Studies intended to estimate the effect of a treatment, like randomized trials, often consist of a biased sample of the desired target population. To correct for this bias, estimates can be transported to the desired target population. Methods for transporting between populations are often premised on a positivity assumption, such that all relevant covariate patterns in one population are also present in the other. However, eligibility criteria, particularly in the case of trials, can result in violations of positivity. To address nonpositivity, a synthesis of statistical and mathematical models can be considered. This approach integrates multiple data sources (e.g. trials, observational, pharmacokinetic studies) to estimate treatment effects, leveraging mathematical models to handle positivity violations. This approach was previously demonstrated for positivity violations by a single binary covariate. Here, we extend the synthesis approach for positivity violations with a continuous covariate. For estimation, two novel augmented inverse probability weighting estimators are proposed. Both estimators are contrasted with other common approaches for addressing nonpositivity. Empirical performance is compared via Monte Carlo simulation. Finally, the competing approaches are illustrated with an example in the context of two-drug versus one-drug antiretroviral therapy on CD4 T cell counts among women with HIV."}, "https://arxiv.org/abs/2401.15796": {"title": "High-Dimensional False Discovery Rate Control for Dependent Variables", "link": "https://arxiv.org/abs/2401.15796", "description": "Algorithms that ensure reproducible findings from large-scale, high-dimensional data are pivotal in numerous signal processing applications. In recent years, multivariate false discovery rate (FDR) controlling methods have emerged, providing guarantees even in high-dimensional settings where the number of variables surpasses the number of samples. However, these methods often fail to reliably control the FDR in the presence of highly dependent variable groups, a common characteristic in fields such as genomics and finance. To tackle this critical issue, we introduce a novel framework that accounts for general dependency structures. Our proposed dependency-aware T-Rex selector integrates hierarchical graphical models within the T-Rex framework to effectively harness the dependency structure among variables. Leveraging martingale theory, we prove that our variable penalization mechanism ensures FDR control. We further generalize the FDR-controlling framework by stating and proving a clear condition necessary for designing both graphical and non-graphical models that capture dependencies. Additionally, we formulate a fully integrated optimal calibration algorithm that concurrently determines the parameters of the graphical model and the T-Rex framework, such that the FDR is controlled while maximizing the number of selected variables. Numerical experiments and a breast cancer survival analysis use-case demonstrate that the proposed method is the only one among the state-of-the-art benchmark methods that controls the FDR and reliably detects genes that have been previously identified to be related to breast cancer. An open-source implementation is available within the R package TRexSelector on CRAN."}, "https://arxiv.org/abs/2206.01900": {"title": "Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios", "link": "https://arxiv.org/abs/2206.01900", "description": "Evaluation of intervention in a multi-agent system, e.g., when humans should intervene in autonomous driving systems and when a player should pass to teammates for a good shot, is challenging in various engineering and scientific fields. Estimating the individual treatment effect (ITE) using counterfactual long-term prediction is practical to evaluate such interventions. However, most of the conventional frameworks did not consider the time-varying complex structure of multi-agent relationships and covariate counterfactual prediction. This may lead to erroneous assessments of ITE and difficulty in interpretation. Here we propose an interpretable, counterfactual recurrent network in multi-agent systems to estimate the effect of the intervention. Our model leverages graph variational recurrent neural networks and theory-based computation with domain knowledge for the ITE estimation framework based on long-term prediction of multi-agent covariates and outcomes, which can confirm the circumstances under which the intervention is effective. On simulated models of an automated vehicle and biological agents with time-varying confounders, we show that our methods achieved lower estimation errors in counterfactual covariates and the most effective treatment timing than the baselines. Furthermore, using real basketball data, our methods performed realistic counterfactual predictions and evaluated the counterfactual passes in shot scenarios."}, "https://arxiv.org/abs/2210.06934": {"title": "On the potential benefits of entropic regularization for smoothing Wasserstein estimators", "link": "https://arxiv.org/abs/2210.06934", "description": "This paper is focused on the study of entropic regularization in optimal transport as a smoothing method for Wasserstein estimators, through the prism of the classical tradeoff between approximation and estimation errors in statistics. Wasserstein estimators are defined as solutions of variational problems whose objective function involves the use of an optimal transport cost between probability measures. Such estimators can be regularized by replacing the optimal transport cost by its regularized version using an entropy penalty on the transport plan. The use of such a regularization has a potentially significant smoothing effect on the resulting estimators. In this work, we investigate its potential benefits on the approximation and estimation properties of regularized Wasserstein estimators. Our main contribution is to discuss how entropic regularization may reach, at a lower computational cost, statistical performances that are comparable to those of un-regularized Wasserstein estimators in statistical learning problems involving distributional data analysis. To this end, we present new theoretical results on the convergence of regularized Wasserstein estimators. We also study their numerical performances using simulated and real data in the supervised learning problem of proportions estimation in mixture models using optimal transport."}, "https://arxiv.org/abs/2304.09157": {"title": "Neural networks for geospatial data", "link": "https://arxiv.org/abs/2304.09157", "description": "Analysis of geospatial data has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a covariance model, encoding the spatial dependence. We relax the strong assumption of linearity and propose embedding neural networks directly within the traditional geostatistical models to accommodate non-linear mean functions while retaining all other advantages including use of Gaussian Processes to explicitly model the spatial covariance, enabling inference on the covariate effect through the mean and on the spatial dependence through the covariance, and offering predictions at new locations via kriging. We propose NN-GLS, a new neural network estimation algorithm for the non-linear mean in GP models that explicitly accounts for the spatial covariance through generalized least squares (GLS), the same loss used in the linear case. We show that NN-GLS admits a representation as a special type of graph neural network (GNN). This connection facilitates use of standard neural network computational techniques for irregular geospatial data, enabling novel and scalable mini-batching, backpropagation, and kriging schemes. Theoretically, we show that NN-GLS will be consistent for irregularly observed spatially correlated data processes. To our knowledge this is the first asymptotic consistency result for any neural network algorithm for spatial data. We demonstrate the methodology through simulated and real datasets."}, "https://arxiv.org/abs/2311.17246": {"title": "Detecting influential observations in single-index models with metric-valued response objects", "link": "https://arxiv.org/abs/2311.17246", "description": "Regression with random data objects is becoming increasingly common in modern data analysis. Unfortunately, like the traditional regression setting with Euclidean data, random response regression is not immune to the trouble caused by unusual observations. A metric Cook's distance extending the classical Cook's distances of Cook (1977) to general metric-valued response objects is proposed. The performance of the metric Cook's distance in both Euclidean and non-Euclidean response regression with Euclidean predictors is demonstrated in an extensive experimental study. A real data analysis of county-level COVID-19 transmission in the United States also illustrates the usefulness of this method in practice."}, "https://arxiv.org/abs/2401.15139": {"title": "FDR-Controlled Portfolio Optimization for Sparse Financial Index Tracking", "link": "https://arxiv.org/abs/2401.15139", "description": "In high-dimensional data analysis, such as financial index tracking or biomedical applications, it is crucial to select the few relevant variables while maintaining control over the false discovery rate (FDR). In these applications, strong dependencies often exist among the variables (e.g., stock returns), which can undermine the FDR control property of existing methods like the model-X knockoff method or the T-Rex selector. To address this issue, we have expanded the T-Rex framework to accommodate overlapping groups of highly correlated variables. This is achieved by integrating a nearest neighbors penalization mechanism into the framework, which provably controls the FDR at the user-defined target level. A real-world example of sparse index tracking demonstrates the proposed method's ability to accurately track the S&P 500 index over the past 20 years based on a small number of stocks. An open-source implementation is provided within the R package TRexSelector on CRAN."}, "https://arxiv.org/abs/2401.17346": {"title": "npcure: An R Package for Nonparametric Inference in Mixture Cure Models", "link": "https://arxiv.org/abs/2401.17346", "description": "Mixture cure models have been widely used to analyze survival data with a cure fraction. They assume that a subgroup of the individuals under study will never experience the event (cured subjects). So, the goal is twofold: to study both the cure probability and the failure time of the uncured individuals through a proper survival function (latency). The R package npcure implements a completely nonparametric approach for estimating these functions in mixture cure models, considering right-censored survival times. Nonparametric estimators for the cure probability and the latency as functions of a covariate are provided. Bootstrap bandwidth selectors for the estimators are included. The package also implements a nonparametric covariate significance test for the cure probability, which can be applied with a continuous, discrete, or qualitative covariate."}, "https://arxiv.org/abs/2401.17347": {"title": "Cure models to estimate time until hospitalization due to COVID-19", "link": "https://arxiv.org/abs/2401.17347", "description": "A short introduction to survival analysis and censored data is included in this paper. A thorough literature review in the field of cure models has been done. An overview on the most important and recent approaches on parametric, semiparametric and nonparametric mixture cure models is also included. The main nonparametric and semiparametric approaches were applied to a real time dataset of COVID-19 patients from the first weeks of the epidemic in Galicia (NW Spain). The aim is to model the elapsed time from diagnosis to hospital admission. The main conclusions, as well as the limitations of both the cure models and the dataset, are presented, illustrating the usefulness of cure models in this kind of studies, where the influence of age and sex on the time to hospital admission is shown."}, "https://arxiv.org/abs/2401.17385": {"title": "Causal Analysis of Air Pollution Mixtures: Estimands, Positivity, and Extrapolation", "link": "https://arxiv.org/abs/2401.17385", "description": "Causal inference for air pollution mixtures is an increasingly important issue with appreciable challenges. When the exposure is a multivariate mixture, there are many exposure contrasts that may be of nominal interest for causal effect estimation, but the complex joint mixture distribution often renders observed data extremely limited in their ability to inform estimates of many commonly-defined causal effects. We use potential outcomes to 1) define causal effects of air pollution mixtures, 2) formalize the key assumption of mixture positivity required for estimation and 3) offer diagnostic metrics for positivity violations in the mixture setting that allow researchers to assess the extent to which data can actually support estimation of mixture effects of interest. For settings where there is limited empirical support, we redefine causal estimands that apportion causal effects according to whether they can be directly informed by observed data versus rely entirely on model extrapolation, isolating key sources of information on the causal effect of an air pollution mixture. The ideas are deployed to assess the ability of a national United States data set on the chemical components of ambient particulate matter air pollution to support estimation of a variety of causal mixture effects."}, "https://arxiv.org/abs/2401.17389": {"title": "A review of statistical models used to characterize species-habitat associations with animal movement data", "link": "https://arxiv.org/abs/2401.17389", "description": "Understanding species-habitat associations is fundamental to ecological sciences and for species conservation. Consequently, various statistical approaches have been designed to infer species-habitat associations. Due to their conceptual and mathematical differences, these methods can yield contrasting results. In this paper, we describe and compare commonly used statistical models that relate animal movement data to environmental data. Specifically, we examined selection functions which include resource selection function (RSF) and step-selection function (SSF), as well as hidden Markov models (HMMs) and related methods such as state-space models. We demonstrate differences in assumptions of each method while highlighting advantages and limitations. Additionally, we provide guidance on selecting the most appropriate statistical method based on research objectives and intended inference. To demonstrate the varying ecological insights derived from each statistical model, we apply them to the movement track of a single ringed seal in a case study. For example, the RSF indicated selection of areas with high prey diversity, whereas the SSFs indicated no discernable relationship with prey diversity. Furthermore, the HMM reveals variable associations with prey diversity across different behaviors. Notably, the three models identified different important areas. This case study highlights the critical significance of selecting the appropriate model to identify species-habitat relationships and specific areas of importance. Our comprehensive review provides the foundational information required for making informed decisions when choosing the most suitable statistical methods to address specific questions, such as identifying expansive corridors or protected zones, understanding movement patterns, or studying behaviours."}, "https://arxiv.org/abs/2401.17393": {"title": "Estimating the EVSI with Gaussian Approximations and Spline-Based Series Methods", "link": "https://arxiv.org/abs/2401.17393", "description": "Background. The Expected Value of Sample Information (EVSI) measures the expected benefits that could be obtained by collecting additional data. Estimating EVSI using the traditional nested Monte Carlo method is computationally expensive but the recently developed Gaussian approximation (GA) approach can efficiently estimate EVSI across different sample sizes. However, the conventional GA may result in biased EVSI estimates if the decision models are highly nonlinear. This bias may lead to suboptimal study designs when GA is used to optimize the value of different studies. Therefore, we extend the conventional GA approach to improve its performance for nonlinear decision models. Methods. Our method provides accurate EVSI estimates by approximating the conditional benefit based on two steps. First, a Taylor series approximation is applied to estimate the conditional benefit as a function of the conditional moments of the parameters of interest using a spline, which is fitted to the samples of the parameters and the corresponding benefits. Next, the conditional moments of parameters are approximated by the conventional GA and Fisher information. The proposed approach is applied to several data collection exercises involving non-Gaussian parameters and nonlinear decision models. Its performance is compared with the nested Monte Carlo method, the conventional GA approach, and the nonparametric regression-based method for EVSI calculation. Results. The proposed approach provides accurate EVSI estimates across different sample sizes when the parameters of interest are non-Gaussian and the decision models are nonlinear. The computational cost of the proposed method is similar to other novel methods. Conclusions. The proposed approach can estimate EVSI across sample sizes accurately and efficiently, which may support researchers in determining an economically optimal study design using EVSI."}, "https://arxiv.org/abs/2401.17430": {"title": "Modeling of spatial extremes in environmental data science: Time to move away from max-stable processes", "link": "https://arxiv.org/abs/2401.17430", "description": "Environmental data science for spatial extremes has traditionally relied heavily on max-stable processes. Even though the popularity of these models has perhaps peaked with statisticians, they are still perceived and considered as the `state-of-the-art' in many applied fields. However, while the asymptotic theory supporting the use of max-stable processes is mathematically rigorous and comprehensive, we think that it has also been overused, if not misused, in environmental applications, to the detriment of more purposeful and meticulously validated models. In this paper, we review the main limitations of max-stable process models, and strongly argue against their systematic use in environmental studies. Alternative solutions based on more flexible frameworks using the exceedances of variables above appropriately chosen high thresholds are discussed, and an outlook on future research is given, highlighting recommendations moving forward and the opportunities offered by hybridizing machine learning with extreme-value statistics."}, "https://arxiv.org/abs/2401.17452": {"title": "Group-Weighted Conformal Prediction", "link": "https://arxiv.org/abs/2401.17452", "description": "Conformal prediction (CP) is a method for constructing a prediction interval around the output of a fitted model, whose validity does not rely on the model being correct--the CP interval offers a coverage guarantee that is distribution-free, but relies on the training data being drawn from the same distribution as the test data. A recent variant, weighted conformal prediction (WCP), reweights the method to allow for covariate shift between the training and test distributions. However, WCP requires knowledge of the nature of the covariate shift-specifically,the likelihood ratio between the test and training covariate distributions. In practice, since this likelihood ratio is estimated rather than known exactly, the coverage guarantee may degrade due to the estimation error. In this paper, we consider a special scenario where observations belong to a finite number of groups, and these groups determine the covariate shift between the training and test distributions-for instance, this may arise if the training set is collected via stratified sampling. Our results demonstrate that in this special case, the predictive coverage guarantees of WCP can be drastically improved beyond the bounds given by existing estimation error bounds."}, "https://arxiv.org/abs/2401.17473": {"title": "Adaptive Matrix Change Point Detection: Leveraging Structured Mean Shifts", "link": "https://arxiv.org/abs/2401.17473", "description": "In high-dimensional time series, the component processes are often assembled into a matrix to display their interrelationship. We focus on detecting mean shifts with unknown change point locations in these matrix time series. Series that are activated by a change may cluster along certain rows (columns), which forms mode-specific change point alignment. Leveraging mode-specific change point alignments may substantially enhance the power for change point detection. Yet, there may be no mode-specific alignments in the change point structure. We propose a powerful test to detect mode-specific change points, yet robust to non-mode-specific changes. We show the validity of using the multiplier bootstrap to compute the p-value of the proposed methods, and derive non-asymptotic bounds on the size and power of the tests. We also propose a parallel bootstrap, a computationally efficient approach for computing the p-value of the proposed adaptive test. In particular, we show the consistency of the proposed test, under mild regularity conditions. To obtain the theoretical results, we derive new, sharp bounds on Gaussian approximation and multiplier bootstrap approximation, which are of independent interest for high dimensional problems with diverging sparsity."}, "https://arxiv.org/abs/2401.17518": {"title": "Model Uncertainty and Selection of Risk Models for Left-Truncated and Right-Censored Loss Data", "link": "https://arxiv.org/abs/2401.17518", "description": "Insurance loss data are usually in the form of left-truncation and right-censoring due to deductibles and policy limits respectively. This paper investigates the model uncertainty and selection procedure when various parametric models are constructed to accommodate such left-truncated and right-censored data. The joint asymptotic properties of the estimators have been established using the Delta method along with Maximum Likelihood Estimation when the model is specified. We conduct the simulation studies using Fisk, Lognormal, Lomax, Paralogistic, and Weibull distributions with various proportions of loss data below deductibles and above policy limits. A variety of graphic tools, hypothesis tests, and penalized likelihood criteria are employed to validate the models, and their performances on the model selection are evaluated through the probability of each parent distribution being correctly selected. The effectiveness of each tool on model selection is also illustrated using {well-studied} data that represent Wisconsin property losses in the United States from 2007 to 2010."}, "https://arxiv.org/abs/2401.17646": {"title": "From Sparse to Dense Functional Data: Phase Transitions from a Simultaneous Inference Perspective", "link": "https://arxiv.org/abs/2401.17646", "description": "We aim to develop simultaneous inference tools for the mean function of functional data from sparse to dense. First, we derive a unified Gaussian approximation to construct simultaneous confidence bands of mean functions based on the B-spline estimator. Then, we investigate the conditions of phase transitions by decomposing the asymptotic variance of the approximated Gaussian process. As an extension, we also consider the orthogonal series estimator and show the corresponding conditions of phase transitions. Extensive simulation results strongly corroborate the theoretical results, and also illustrate the variation of the asymptotic distribution via the asymptotic variance decomposition we obtain. The developed method is further applied to body fat data and traffic data."}, "https://arxiv.org/abs/2401.17735": {"title": "The impact of coarsening an exposure on partial identifiability in instrumental variable settings", "link": "https://arxiv.org/abs/2401.17735", "description": "In instrumental variable (IV) settings, such as in imperfect randomized trials and observational studies with Mendelian randomization, one may encounter a continuous exposure, the causal effect of which is not of true interest. Instead, scientific interest may lie in a coarsened version of this exposure. Although there is a lengthy literature on the impact of coarsening of an exposure with several works focusing specifically on IV settings, all methods proposed in this literature require parametric assumptions. Instead, just as in the standard IV setting, one can consider partial identification via bounds making no parametric assumptions. This was first pointed out in Alexander Balke's PhD dissertation. We extend and clarify his work and derive novel bounds in several settings, including for a three-level IV, which will most likely be the case in Mendelian randomization. We demonstrate our findings in two real data examples, a randomized trial for peanut allergy in infants and a Mendelian randomization setting investigating the effect of homocysteine on cardiovascular disease."}, "https://arxiv.org/abs/2401.17737": {"title": "Hierarchical Bias-Driven Stratification for Interpretable Causal Effect Estimation", "link": "https://arxiv.org/abs/2401.17737", "description": "Interpretability and transparency are essential for incorporating causal effect models from observational data into policy decision-making. They can provide trust for the model in the absence of ground truth labels to evaluate the accuracy of such models. To date, attempts at transparent causal effect estimation consist of applying post hoc explanation methods to black-box models, which are not interpretable. Here, we present BICauseTree: an interpretable balancing method that identifies clusters where natural experiments occur locally. Our approach builds on decision trees with a customized objective function to improve balancing and reduce treatment allocation bias. Consequently, it can additionally detect subgroups presenting positivity violations, exclude them, and provide a covariate-based definition of the target population we can infer from and generalize to. We evaluate the method's performance using synthetic and realistic datasets, explore its bias-interpretability tradeoff, and show that it is comparable with existing approaches."}, "https://arxiv.org/abs/2401.17754": {"title": "Nonparametric estimation of circular trend surfaces with application to wave directions", "link": "https://arxiv.org/abs/2401.17754", "description": "In oceanography, modeling wave fields requires the use of statistical tools capable of handling the circular nature of the {data measurements}. An important issue in ocean wave analysis is the study of height and direction waves, being direction values recorded as angles or, equivalently, as points on a unit circle. Hence, reconstruction of a wave direction field on the sea surface can be approached by the use of a linear-circular regression model, viewing wave directions as a realization of a circular spatial process whose trend should be estimated. In this paper, we consider a spatial regression model with a circular response and several real-valued predictors. Nonparametric estimators of the circular trend surface are proposed, accounting for the (unknown) spatial correlation. Some asymptotic results about these estimators as well as some guidelines for their practical implementation are also given. The performance of the proposed estimators is investigated in a simulation study. An application to wave directions in the Adriatic Sea is provided for illustration."}, "https://arxiv.org/abs/2401.17770": {"title": "Nonparametric geostatistical risk mapping", "link": "https://arxiv.org/abs/2401.17770", "description": "In this work, a fully nonparametric geostatistical approach to estimate threshold exceeding probabilities is proposed. To estimate the large-scale variability (spatial trend) of the process, the nonparametric local linear regression estimator, with the bandwidth selected by a method that takes the spatial dependence into account, is used. A bias-corrected nonparametric estimator of the variogram, obtained from the nonparametric residuals, is proposed to estimate the small-scale variability. Finally, a bootstrap algorithm is designed to estimate the unconditional probabilities of exceeding a threshold value at any location. The behavior of this approach is evaluated through simulation and with an application to a real data set."}, "https://arxiv.org/abs/2401.17966": {"title": "XGBoostPP: Tree-based Estimation of Point Process Intensity Functions", "link": "https://arxiv.org/abs/2401.17966", "description": "We propose a novel tree-based ensemble method, named XGBoostPP, to nonparametrically estimate the intensity of a point process as a function of covariates. It extends the use of gradient-boosted regression trees (Chen & Guestrin, 2016) to the point process literature via two carefully designed loss functions. The first loss is based on the Poisson likelihood, working for general point processes. The second loss is based on the weighted Poisson likelihood, where spatially dependent weights are introduced to further improve the estimation efficiency for clustered processes. An efficient greedy search algorithm is developed for model estimation, and the effectiveness of the proposed method is demonstrated through extensive simulation studies and two real data analyses. In particular, we report that XGBoostPP achieves superior performance to existing approaches when the dimension of the covariate space is high, revealing the advantages of tree-based ensemble methods in estimating complex intensity functions."}, "https://arxiv.org/abs/2401.17987": {"title": "Bagging cross-validated bandwidths with application to Big Data", "link": "https://arxiv.org/abs/2401.17987", "description": "Hall and Robinson (2009) proposed and analyzed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall and Robinson (2009) assumes that $N$, the number of bagged subsamples, is $\\infty$. We expand upon their theoretical results by allowing $N$ to be finite, as it is in practice. Our results indicate an important difference in the rate of convergence of the bagged cross-validation bandwidth for the cases $N=\\infty$ and $N<\\infty$. Simulations quantify the improvement in statistical efficiency and computational speed that can result from using bagged cross-validation as opposed to a binned implementation of ordinary cross-validation. The performance of thebagged bandwidth is also illustrated on a real, very large, data set. Finally, a byproduct of our study is the correction of errors appearing in the Hall and Robinson (2009) expression for the asymptotic mean squared error of the bagging selector."}, "https://arxiv.org/abs/2401.17993": {"title": "Robust Inference for Generalized Linear Mixed Models: An Approach Based on Score Sign Flipping", "link": "https://arxiv.org/abs/2401.17993", "description": "Despite the versatility of generalized linear mixed models in handling complex experimental designs, they often suffer from misspecification and convergence problems. This makes inference on the values of coefficients problematic. To address these challenges, we propose a robust extension of the score-based statistical test using sign-flipping transformations. Our approach efficiently handles within-variance structure and heteroscedasticity, ensuring accurate regression coefficient testing. The approach is illustrated by analyzing the reduction of health issues over time for newly adopted children. The model is characterized by a binomial response with unbalanced frequencies and several categorical and continuous predictors. The proposed approach efficiently deals with critical problems related to longitudinal nonlinear models, surpassing common statistical approaches such as generalized estimating equations and generalized linear mixed models."}, "https://arxiv.org/abs/2401.18014": {"title": "Bayesian regularization for flexible baseline hazard functions in Cox survival models", "link": "https://arxiv.org/abs/2401.18014", "description": "Fully Bayesian methods for Cox models specify a model for the baseline hazard function. Parametric approaches generally provide monotone estimations. Semi-parametric choices allow for more flexible patterns but they can suffer from overfitting and instability. Regularization methods through prior distributions with correlated structures usually give reasonable answers to these types of situations. We discuss Bayesian regularization for Cox survival models defined via flexible baseline hazards specified by a mixture of piecewise constant functions and by a cubic B-spline function. For those \"semiparametric\" proposals, different prior scenarios ranging from prior independence to particular correlated structures are discussed in a real study with micro-virulence data and in an extensive simulation scenario that includes different data sample and time axis partition sizes in order to capture risk variations. The posterior distribution of the parameters was approximated using Markov chain Monte Carlo methods. Model selection was performed in accordance with the Deviance Information Criteria and the Log Pseudo-Marginal Likelihood. The results obtained reveal that, in general, Cox models present great robustness in covariate effects and survival estimates independent of the baseline hazard specification. In relation to the \"semi-parametric\" baseline hazard specification, the B-splines hazard function is less dependent on the regularization process than the piecewise specification because it demands a smaller time axis partition to estimate a similar behaviour of the risk."}, "https://arxiv.org/abs/2401.18023": {"title": "A cost-sensitive constrained Lasso", "link": "https://arxiv.org/abs/2401.18023", "description": "The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered."}, "https://arxiv.org/abs/2401.16667": {"title": "Sharp variance estimator and causal bootstrap in stratified randomized experiments", "link": "https://arxiv.org/abs/2401.16667", "description": "The finite-population asymptotic theory provides a normal approximation for the sampling distribution of the average treatment effect estimator in stratified randomized experiments. The asymptotic variance is often estimated by a Neyman-type conservative variance estimator. However, the variance estimator can be overly conservative, and the asymptotic theory may fail in small samples. To solve these issues, we propose a sharp variance estimator for the difference-in-means estimator weighted by the proportion of stratum sizes in stratified randomized experiments. Furthermore, we propose two causal bootstrap procedures to more accurately approximate the sampling distribution of the weighted difference-in-means estimator. The first causal bootstrap procedure is based on rank-preserving imputation and we show that it has second-order refinement over normal approximation. The second causal bootstrap procedure is based on sharp null imputation and is applicable in paired experiments. Our analysis is randomization-based or design-based by conditioning on the potential outcomes, with treatment assignment being the sole source of randomness. Numerical studies and real data analyses demonstrate advantages of our proposed methods in finite samples."}, "https://arxiv.org/abs/2401.17334": {"title": "Efficient estimation of parameters in marginals in semiparametric multivariate models", "link": "https://arxiv.org/abs/2401.17334", "description": "We consider a general multivariate model where univariate marginal distributions are known up to a parameter vector and we are interested in estimating that parameter vector without specifying the joint distribution, except for the marginals. If we assume independence between the marginals and maximize the resulting quasi-likelihood, we obtain a consistent but inefficient QMLE estimator. If we assume a parametric copula (other than independence) we obtain a full MLE, which is efficient but only under a correct copula specification and may be biased if the copula is misspecified. Instead we propose a sieve MLE estimator (SMLE) which improves over QMLE but does not have the drawbacks of full MLE. We model the unknown part of the joint distribution using the Bernstein-Kantorovich polynomial copula and assess the resulting improvement over QMLE and over misspecified FMLE in terms of relative efficiency and robustness. We derive the asymptotic distribution of the new estimator and show that it reaches the relevant semiparametric efficiency bound. Simulations suggest that the sieve MLE can be almost as efficient as FMLE relative to QMLE provided there is enough dependence between the marginals. We demonstrate practical value of the new estimator with several applications. First, we apply SMLE in an insurance context where we build a flexible semi-parametric claim loss model for a scenario where one of the variables is censored. As in simulations, the use of SMLE leads to tighter parameter estimates. Next, we consider financial risk management examples and show how the use of SMLE leads to superior Value-at-Risk predictions. The paper comes with an online archive which contains all codes and datasets."}, "https://arxiv.org/abs/2401.17422": {"title": "River flow modelling using nonparametric functional data analysis", "link": "https://arxiv.org/abs/2401.17422", "description": "Time series and extreme value analyses are two statistical approaches usually applied to study hydrological data. Classical techniques, such as ARIMA models (in the case of mean flow predictions), and parametric generalised extreme value (GEV) fits and nonparametric extreme value methods (in the case of extreme value theory) have been usually employed in this context. In this paper, nonparametric functional data methods are used to perform mean monthly flow predictions and extreme value analysis, which are important for flood risk management. These are powerful tools that take advantage of both, the functional nature of the data under consideration and the flexibility of nonparametric methods, providing more reliable results. Therefore, they can be useful to prevent damage caused by floods and to reduce the likelihood and/or the impact of floods in a specific location. The nonparametric functional approaches are applied to flow samples of two rivers in the U.S. In this way, monthly mean flow is predicted and flow quantiles in the extreme value framework are estimated using the proposed methods. Results show that the nonparametric functional techniques work satisfactorily, generally outperforming the behaviour of classical parametric and nonparametric estimators in both settings."}, "https://arxiv.org/abs/2401.17504": {"title": "CaMU: Disentangling Causal Effects in Deep Model Unlearning", "link": "https://arxiv.org/abs/2401.17504", "description": "Machine unlearning requires removing the information of forgetting data while keeping the necessary information of remaining data. Despite recent advancements in this area, existing methodologies mainly focus on the effect of removing forgetting data without considering the negative impact this can have on the information of the remaining data, resulting in significant performance degradation after data removal. Although some methods try to repair the performance of remaining data after removal, the forgotten information can also return after repair. Such an issue is due to the intricate intertwining of the forgetting and remaining data. Without adequately differentiating the influence of these two kinds of data on the model, existing algorithms take the risk of either inadequate removal of the forgetting data or unnecessary loss of valuable information from the remaining data. To address this shortcoming, the present study undertakes a causal analysis of the unlearning and introduces a novel framework termed Causal Machine Unlearning (CaMU). This framework adds intervention on the information of remaining data to disentangle the causal effects between forgetting data and remaining data. Then CaMU eliminates the causal impact associated with forgetting data while concurrently preserving the causal relevance of the remaining data. Comprehensive empirical results on various datasets and models suggest that CaMU enhances performance on the remaining data and effectively minimizes the influences of forgetting data. Notably, this work is the first to interpret deep model unlearning tasks from a new perspective of causality and provide a solution based on causal analysis, which opens up new possibilities for future research in deep model unlearning."}, "https://arxiv.org/abs/2401.17585": {"title": "Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks", "link": "https://arxiv.org/abs/2401.17585", "description": "Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers six common reasoning schemes in real world. We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit. We found that all model editing methods show notably low performance on this dataset, especially in certain reasoning schemes. Our analysis over the chain-of-thought generation of edited models further uncover key reasons behind the inadequacy of existing knowledge editing methods from a reasoning standpoint, involving aspects on fact-wise editing, fact recall ability, and coherence in generation. We will make our benchmark publicly available."}, "https://arxiv.org/abs/2210.00822": {"title": "Combinatorial and algebraic perspectives on the marginal independence structure of Bayesian networks", "link": "https://arxiv.org/abs/2210.00822", "description": "We consider the problem of estimating the marginal independence structure of a Bayesian network from observational data, learning an undirected graph we call the unconditional dependence graph. We show that unconditional dependence graphs of Bayesian networks correspond to the graphs having equal independence and intersection numbers. Using this observation, a Gr\\\"obner basis for a toric ideal associated to unconditional dependence graphs of Bayesian networks is given and then extended by additional binomial relations to connect the space of all such graphs. An MCMC method, called GrUES (Gr\\\"obner-based Unconditional Equivalence Search), is implemented based on the resulting moves and applied to synthetic Gaussian data. GrUES recovers the true marginal independence structure via a penalized maximum likelihood or MAP estimate at a higher rate than simple independence tests while also yielding an estimate of the posterior, for which the $20\\%$ HPD credible sets include the true structure at a high rate for data-generating graphs with density at least $0.5$."}, "https://arxiv.org/abs/2306.04866": {"title": "Computational methods for fast Bayesian model assessment via calibrated posterior p-values", "link": "https://arxiv.org/abs/2306.04866", "description": "Posterior predictive p-values (ppps) have become popular tools for Bayesian model assessment, being general-purpose and easy to use. However, interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. Calibrated ppps (cppps) can be obtained via a bootstrap-like procedure, yet remain unavailable in practice due to high computational cost. This paper introduces methods to enable efficient approximation of cppps and their uncertainty for fast model assessment. We first investigate the computational trade-off between the number of calibration replicates and the number of MCMC samples per replicate. Provided that the MCMC chain from the real data has converged, using short MCMC chains per calibration replicate can save significant computation time compared to naive implementations, without significant loss in accuracy. We propose different variance estimators for the cppp approximation, which can be used to confirm quickly the lack of evidence against model misspecification. As variance estimation uses effective sample sizes of many short MCMC chains, we show these can be approximated well from the real-data MCMC chain. The procedure for cppp is implemented in NIMBLE, a flexible framework for hierarchical modeling that supports many models and discrepancy measures."}, "https://arxiv.org/abs/2306.11267": {"title": "Model-assisted analysis of covariance estimators for stepped wedge cluster randomized experiments", "link": "https://arxiv.org/abs/2306.11267", "description": "Stepped wedge cluster randomized experiments represent a class of unidirectional crossover designs increasingly adopted for comparative effectiveness and implementation science research. Although stepped wedge cluster randomized experiments have become popular, definitions of estimands and robust methods to target clearly-defined estimands remain insufficient. To address this gap, we describe a class of estimands that explicitly acknowledge the multilevel data structure in stepped wedge cluster randomized experiments, and highlight three typical members of the estimand class that are interpretable and are of practical interest. We then introduce four possible formulations of analysis of covariance (ANCOVA) working models to achieve estimand-aligned analyses. By exploiting baseline covariates, each ANCOVA model can potentially improve the estimation efficiency over the unadjusted estimators. In addition, each ANCOVA estimator is model-assisted in the sense that its point estimator is consistent with the target estimand even when the working model is misspecified. Under the stepped wedge randomization scheme, we establish the finite population Central Limit Theorem for each estimator, which motivates design-based variance estimators. Through simulations, we study the finite-sample operating characteristics of the ANCOVA estimators under different data-generating processes. We illustrate their applications via the analysis of the Washington State Expedited Partner Therapy study."}, "https://arxiv.org/abs/2401.10869": {"title": "Robust variable selection for partially linear additive models", "link": "https://arxiv.org/abs/2401.10869", "description": "Among semiparametric regression models, partially linear additive models provide a useful tool to include additive nonparametric components as well as a parametric component, when explaining the relationship between the response and a set of explanatory variables. This paper concerns such models under sparsity assumptions for the covariates included in the linear component. Sparse covariates are frequent in regression problems where the task of variable selection is usually of interest. As in other settings, outliers either in the residuals or in the covariates involved in the linear component have a harmful effect. To simultaneously achieve model selection for the parametric component of the model and resistance to outliers, we combine preliminary robust estimators of the additive component, robust linear $MM-$regression estimators with a penalty such as SCAD on the coefficients in the parametric part. Under mild assumptions, consistency results and rates of convergence for the proposed estimators are derived. A Monte Carlo study is carried out to compare, under different models and contamination schemes, the performance of the robust proposal with its classical counterpart. The obtained results show the advantage of using the robust approach. Through the analysis of a real data set, we also illustrate the benefits of the proposed procedure."}, "https://arxiv.org/abs/2209.04193": {"title": "Correcting inferences for volunteer-collected data with geospatial sampling bias", "link": "https://arxiv.org/abs/2209.04193", "description": "Citizen science projects in which volunteers collect data are increasingly popular due to their ability to engage the public with scientific questions. The scientific value of these data are however hampered by several biases. In this paper, we deal with geospatial sampling bias by enriching the volunteer-collected data with geographical covariates, and then using regression-based models to correct for bias. We show that night sky brightness estimates change substantially after correction, and that the corrected inferences better represent an external satellite-derived measure of skyglow. We conclude that geospatial bias correction can greatly increase the scientific value of citizen science projects."}, "https://arxiv.org/abs/2302.06935": {"title": "Reference prior for Bayesian estimation of seismic fragility curves", "link": "https://arxiv.org/abs/2302.06935", "description": "One of the central quantities of probabilistic seismic risk assessment studies is the fragility curve, which represents the probability of failure of a mechanical structure conditional on a scalar measure derived from the seismic ground motion. Estimating such curves is a difficult task because, for many structures of interest, few data are available and the data are only binary; i.e., they indicate the state of the structure, failure or non-failure. This framework concerns complex equipments such as electrical devices encountered in industrial installations. In order to address this challenging framework a wide range of the methods in the literature rely on a parametric log-normal model. Bayesian approaches allow for efficient learning of the model parameters. However, the choice of the prior distribution has a non-negligible influence on the posterior distribution and, therefore, on any resulting estimate. We propose a thorough study of this parametric Bayesian estimation problem when the data are limited and binary. Using the reference prior theory as a support, we suggest an objective approach for the prior choice. This approach leads to the Jeffreys prior which is explicitly derived for this problem for the first time. The posterior distribution is proven to be proper (i.e., it integrates to unity) with the Jeffreys prior and improper with some classical priors from the literature. The posterior distribution with the Jeffreys prior is also shown to vanish at the boundaries of the parameters domain, so sampling the posterior distribution of the parameters does not produce anomalously small or large values. Therefore, this does not produce degenerate fragility curves such as unit-step functions and the Jeffreys prior leads to robust credibility intervals. The numerical results obtained on two different case studies, including an industrial case, illustrate the theoretical predictions."}, "https://arxiv.org/abs/2307.02096": {"title": "Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo", "link": "https://arxiv.org/abs/2307.02096", "description": "Hamiltonian Monte Carlo (HMC) is a powerful tool for Bayesian statistical inference due to its potential to rapidly explore high dimensional state space, avoiding the random walk behavior typical of many Markov Chain Monte Carlo samplers. The proper choice of the integrator of the Hamiltonian dynamics is key to the efficiency of HMC. It is becoming increasingly clear that multi-stage splitting integrators are a good alternative to the Verlet method, traditionally used in HMC. Here we propose a principled way of finding optimal, problem-specific integration schemes (in terms of the best conservation of energy for harmonic forces/Gaussian targets) within the families of 2- and 3-stage splitting integrators. The method, which we call Adaptive Integration Approach for statistics, or s-AIA, uses a multivariate Gaussian model and simulation data obtained at the HMC burn-in stage to identify a system-specific dimensional stability interval and assigns the most appropriate 2-/3-stage integrator for any user-chosen simulation step size within that interval. s-AIA has been implemented in the in-house software package HaiCS without introducing computational overheads in the simulations. The efficiency of the s-AIA integrators and their impact on the HMC accuracy, sampling performance and convergence are discussed in comparison with known fixed-parameter multi-stage splitting integrators (including Verlet). Numerical experiments on well-known statistical models show that the adaptive schemes reach the best possible performance within the family of 2-, 3-stage splitting schemes."}, "https://arxiv.org/abs/2307.15176": {"title": "RCT Rejection Sampling for Causal Estimation Evaluation", "link": "https://arxiv.org/abs/2307.15176", "description": "Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation."}, "https://arxiv.org/abs/2311.11922": {"title": "Evaluating the Surrogate Index as a Decision-Making Tool Using 200 A/B Tests at Netflix", "link": "https://arxiv.org/abs/2311.11922", "description": "Surrogate index approaches have recently become a popular method of estimating longer-term impact from shorter-term outcomes. In this paper, we leverage 1098 test arms from 200 A/B tests at Netflix to empirically investigate to what degree would decisions made using a surrogate index utilizing 14 days of data would align with those made using direct measurement of day 63 treatment effects. Focusing specifically on linear \"auto-surrogate\" models that utilize the shorter-term observations of the long-term outcome of interest, we find that the statistical inferences that we would draw from using the surrogate index are ~95% consistent with those from directly measuring the long-term treatment effect. Moreover, when we restrict ourselves to the set of tests that would be \"launched\" (i.e. positive and statistically significant) based on the 63-day directly measured treatment effects, we find that relying instead on the surrogate index achieves 79% and 65% recall."}, "https://arxiv.org/abs/2401.15778": {"title": "On the partial autocorrelation function for locally stationary time series: characterization, estimation and inference", "link": "https://arxiv.org/abs/2401.15778", "description": "For stationary time series, it is common to use the plots of partial autocorrelation function (PACF) or PACF-based tests to explore the temporal dependence structure of such processes. To our best knowledge, such analogs for non-stationary time series have not been fully established yet. In this paper, we fill this gap for locally stationary time series with short-range dependence. First, we characterize the PACF locally in the time domain and show that the $j$th PACF, denoted as $\\rho_{j}(t),$ decays with $j$ whose rate is adaptive to the temporal dependence of the time series $\\{x_{i,n}\\}$. Second, at time $i,$ we justify that the PACF $\\rho_j(i/n)$ can be efficiently approximated by the best linear prediction coefficients via the Yule-Walker's equations. This allows us to study the PACF via ordinary least squares (OLS) locally. Third, we show that the PACF is smooth in time for locally stationary time series. We use the sieve method with OLS to estimate $\\rho_j(\\cdot)$ and construct some statistics to test the PACFs and infer the structures of the time series. These tests generalize and modify those used for stationary time series. Finally, a multiplier bootstrap algorithm is proposed for practical implementation and an $\\mathtt R$ package $\\mathtt {Sie2nts}$ is provided to implement our algorithm. Numerical simulations and real data analysis also confirm usefulness of our results."}, "https://arxiv.org/abs/2402.00154": {"title": "Penalized G-estimation for effect modifier selection in the structural nested mean models for repeated outcomes", "link": "https://arxiv.org/abs/2402.00154", "description": "Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \\textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\\'e de Montr\\'eal."}, "https://arxiv.org/abs/2402.00164": {"title": "De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing", "link": "https://arxiv.org/abs/2402.00164", "description": "In some high-dimensional and semiparametric inference problems involving two populations, the parameter of interest can be characterized by two-sample U-statistics involving some nuisance parameters. In this work we first extend the framework of one-step estimation with cross-fitting to two-sample U-statistics, showing that using an orthogonalized influence function can effectively remove the first order bias, resulting in asymptotically normal estimates of the parameter of interest. As an example, we apply this method and theory to the problem of testing two-sample conditional distributions, also known as strong ignorability. When combined with a conformal-based rank-sum test, we discover that the nuisance parameters can be divided into two categories, where in one category the nuisance estimation accuracy does not affect the testing validity, whereas in the other the nuisance estimation accuracy must satisfy the usual requirement for the test to be valid. We believe these findings provide further insights into and enhance the conformal inference toolbox."}, "https://arxiv.org/abs/2402.00183": {"title": "A review of regularised estimation methods and cross-validation in spatiotemporal statistics", "link": "https://arxiv.org/abs/2402.00183", "description": "This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \\geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix."}, "https://arxiv.org/abs/2402.00202": {"title": "Anytime-Valid Generalized Universal Inference on Risk Minimizers", "link": "https://arxiv.org/abs/2402.00202", "description": "A common goal in statistics and machine learning is estimation of unknowns. Point estimates alone are of little value without an accompanying measure of uncertainty, but traditional uncertainty quantification methods, such as confidence sets and p-values, often require strong distributional or structural assumptions that may not be justified in modern problems. The present paper considers a very common case in machine learning, where the quantity of interest is the minimizer of a given risk (expected loss) function. For such cases, we propose a generalized universal procedure for inference on risk minimizers that features a finite-sample, frequentist validity property under mild distributional assumptions. One version of the proposed procedure is shown to be anytime-valid in the sense that it maintains validity properties regardless of the stopping rule used for the data collection process. We show how this anytime-validity property offers protection against certain factors contributing to the replication crisis in science."}, "https://arxiv.org/abs/2402.00239": {"title": "Publication bias adjustment in network meta-analysis: an inverse probability weighting approach using clinical trial registries", "link": "https://arxiv.org/abs/2402.00239", "description": "Network meta-analysis (NMA) is a useful tool to compare multiple interventions simultaneously in a single meta-analysis, it can be very helpful for medical decision making when the study aims to find the best therapy among several active candidates. However, the validity of its results is threatened by the publication bias issue. Existing methods to handle the publication bias issue in the standard pairwise meta-analysis are hard to extend to this area with the complicated data structure and the underlying assumptions for pooling the data. In this paper, we aimed to provide a flexible inverse probability weighting (IPW) framework along with several t-type selection functions to deal with the publication bias problem in the NMA context. To solve these proposed selection functions, we recommend making use of the additional information from the unpublished studies from multiple clinical trial registries. A comprehensive numerical study and a real example showed that our methodology can help obtain more accurate estimates and higher coverage probabilities, and improve other properties of an NMA (e.g., ranking the interventions)."}, "https://arxiv.org/abs/2402.00307": {"title": "Debiased Multivariable Mendelian Randomization", "link": "https://arxiv.org/abs/2402.00307", "description": "Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic variants that are weakly associated with some exposures conditional on the other exposures. This article focuses on MVMR using summary data from genome-wide association studies (GWAS). We provide a new asymptotic regime to analyze MVMR estimators with many weak instruments, allowing for linear combinations of exposures to have different degrees of instrument strength, and formally show that the popular multivariable inverse-variance weighted (MV-IVW) estimator's asymptotic behavior is highly sensitive to instruments' strength. We then propose a multivariable debiased IVW (MV-dIVW) estimator, which effectively reduces the asymptotic bias from weak instruments in MV-IVW, and introduce an adjusted version, MV-adIVW, for improved finite-sample robustness. We establish the theoretical properties of our proposed estimators and extend them to handle balanced horizontal pleiotropy. We conclude by demonstrating the performance of our proposed methods in simulated and real datasets. We implement this method in the R package mr.divw."}, "https://arxiv.org/abs/2402.00335": {"title": "Regression-Based Proximal Causal Inference", "link": "https://arxiv.org/abs/2402.00335", "description": "In observational studies, identification of causal effects is threatened by the potential for unmeasured confounding. Negative controls have become widely used to evaluate the presence of potential unmeasured confounding thus enhancing credibility of reported causal effect estimates. Going beyond simply testing for residual confounding, proximal causal inference (PCI) was recently developed to debias causal effect estimates subject to confounding by hidden factors, by leveraging a pair of negative control variables, also known as treatment and outcome confounding proxies. While formal statistical inference has been developed for PCI, these methods can be challenging to implement in practice as they involve solving complex integral equations that are typically ill-posed. In this paper, we develop a regression-based PCI approach, employing a two-stage regression via familiar generalized linear models to implement the PCI framework, which completely obviates the need to solve difficult integral equations. In the first stage, one fits a generalized linear model (GLM) for the outcome confounding proxy in terms of the treatment confounding proxy and the primary treatment. In the second stage, one fits a GLM for the primary outcome in terms of the primary treatment, using the predicted value of the first-stage regression model as a regressor which as we establish accounts for any residual confounding for which the proxies are relevant. The proposed approach has merit in that (i) it is applicable to continuous, count, and binary outcomes cases, making it relevant to a wide range of real-world applications, and (ii) it is easy to implement using off-the-shelf software for GLMs. We establish the statistical properties of regression-based PCI and illustrate their performance in both synthetic and real-world empirical applications."}, "https://arxiv.org/abs/2402.00512": {"title": "A goodness-of-fit test for regression models with spatially correlated errors", "link": "https://arxiv.org/abs/2402.00512", "description": "The problem of assessing a parametric regression model in the presence of spatial correlation is addressed in this work. For that purpose, a goodness-of-fit test based on a $L_2$-distance comparing a parametric and a nonparametric regression estimators is proposed. Asymptotic properties of the test statistic, both under the null hypothesis and under local alternatives, are derived. Additionally, a bootstrap procedure is designed to calibrate the test in practice. Finite sample performance of the test is analyzed through a simulation study, and its applicability is illustrated using a real data example."}, "https://arxiv.org/abs/2402.00597": {"title": "An efficient multivariate volatility model for many assets", "link": "https://arxiv.org/abs/2402.00597", "description": "This paper develops a flexible and computationally efficient multivariate volatility model, which allows for dynamic conditional correlations and volatility spillover effects among financial assets. The new model has desirable properties such as identifiability and computational tractability for many assets. A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new model with and without low-rank constraints on the coefficient matrices respectively, and the asymptotic properties for both estimators are established. Moreover, a Bayesian information criterion with selection consistency is developed for order selection, and the testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for small and moderate dimensions. The usefulness of the new model and its inference tools is illustrated by two empirical examples for 5 stock markets and 17 industry portfolios, respectively."}, "https://arxiv.org/abs/2402.00610": {"title": "Multivariate ordinal regression for multiple repeated measurements", "link": "https://arxiv.org/abs/2402.00610", "description": "In this paper we propose a multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, where we distinguish between the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student t distributed. The estimation is performed using composite likelihood methods. We perform several simulation exercises to investigate the quality of the estimates in different settings as well as in comparison with a Bayesian approach. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. We also introduce R package mvordflex and illustrate how this implementation can be used to estimate the proposed model in a user-friendly, convenient way. Finally, we illustrate the framework on a data set containing firm failure and credit ratings information from the rating agencies S&P and Moody's for US listed companies."}, "https://arxiv.org/abs/2402.00651": {"title": "Skew-elliptical copula based mixed models for non-Gaussian longitudinal data with application to HIV-AIDS study", "link": "https://arxiv.org/abs/2402.00651", "description": "This work has been motivated by a longitudinal data set on HIV CD4 T+ cell counts from Livingstone district, Zambia. The corresponding histogram plots indicate lack of symmetry in the marginal distributions and the pairwise scatter plots show non-elliptical dependence patterns. The standard linear mixed model for longitudinal data fails to capture these features. Thus it seems appropriate to consider a more general framework for modeling such data. In this article, we consider generalized linear mixed models (GLMM) for the marginals (e.g. Gamma mixed model), and temporal dependency of the repeated measurements is modeled by the copula corresponding to some skew-elliptical distributions (like skew-normal/skew-t). Our proposed class of copula based mixed models simultaneously takes into account asymmetry, between-subject variability and non-standard temporal dependence, and hence can be considered extensions to the standard linear mixed model based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and also describe how to obtain standard errors of the parameter estimates. We investigate the finite sample performance of our procedure with extensive simulation studies involving skewed and symmetric marginal distributions and several choices of the copula. We finally apply our models to the HIV data set and report the findings."}, "https://arxiv.org/abs/2402.00668": {"title": "Factor copula models for non-Gaussian longitudinal data", "link": "https://arxiv.org/abs/2402.00668", "description": "This article presents factor copula approaches to model temporal dependency of non- Gaussian (continuous/discrete) longitudinal data. Factor copula models are canonical vine copulas which explain the underlying dependence structure of a multivariate data through latent variables, and therefore can be easily interpreted and implemented to unbalanced longitudinal data. We develop regression models for continuous, binary and ordinal longitudinal data including covariates, by using factor copula constructions with subject-specific latent variables. Considering homogeneous within-subject dependence, our proposed models allow for feasible parametric inference in moderate to high dimensional situations, using two-stage (IFM) estimation method. We assess the finite sample performance of the proposed models with extensive simulation studies. In the empirical analysis, the proposed models are applied for analysing different longitudinal responses of two real world data sets. Moreover, we compare the performances of these models with some widely used random effect models using standard model selection techniques and find substantial improvements. Our studies suggest that factor copula models can be good alternatives to random effect models and can provide better insights to temporal dependency of longitudinal data of arbitrary nature."}, "https://arxiv.org/abs/2402.00778": {"title": "Robust Sufficient Dimension Reduction via $\\alpha$-Distance Covariance", "link": "https://arxiv.org/abs/2402.00778", "description": "We introduce a novel sufficient dimension-reduction (SDR) method which is robust against outliers using $\\alpha$-distance covariance (dCov) in dimension-reduction problems. Under very mild conditions on the predictors, the central subspace is effectively estimated and model-free advantage without estimating link function based on the projection on the Stiefel manifold. We establish the convergence property of the proposed estimation under some regularity conditions. We compare the performance of our method with existing SDR methods by simulation and real data analysis and show that our algorithm improves the computational efficiency and effectiveness."}, "https://arxiv.org/abs/2402.00814": {"title": "Optimal monotone conditional error functions", "link": "https://arxiv.org/abs/2402.00814", "description": "This note presents a method that provides optimal monotone conditional error functions for a large class of adaptive two stage designs. The presented method builds on a previously developed general theory for optimal adaptive two stage designs where sample sizes are reassessed for a specific conditional power and the goal is to minimize the expected sample size. The previous theory can easily lead to a non-monotonous conditional error function which is highly undesirable for logical reasons and can harm type I error rate control for composite null hypotheses. The here presented method extends the existing theory by introducing intermediate monotonising steps that can easily be implemented."}, "https://arxiv.org/abs/2402.00072": {"title": "Explainable AI for survival analysis: a median-SHAP approach", "link": "https://arxiv.org/abs/2402.00072", "description": "With the adoption of machine learning into routine clinical practice comes the need for Explainable AI methods tailored to medical applications. Shapley values have sparked wide interest for locally explaining models. Here, we demonstrate their interpretation strongly depends on both the summary statistic and the estimator for it, which in turn define what we identify as an 'anchor point'. We show that the convention of using a mean anchor point may generate misleading interpretations for survival analysis and introduce median-SHAP, a method for explaining black-box models predicting individual survival times."}, "https://arxiv.org/abs/2402.00077": {"title": "Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions", "link": "https://arxiv.org/abs/2402.00077", "description": "Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by the American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging such multi-institutional sequencing data presents significant challenges. Variations in gene panels result in loss of information when the analysis is conducted on common gene sets. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess the model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data."}, "https://arxiv.org/abs/2402.00168": {"title": "Continuous Treatment Effects with Surrogate Outcomes", "link": "https://arxiv.org/abs/2402.00168", "description": "In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance."}, "https://arxiv.org/abs/2402.00358": {"title": "nhppp: Simulating Nonhomogeneous Poisson Point Processes in R", "link": "https://arxiv.org/abs/2402.00358", "description": "We introduce the `nhppp' package for simulating events from one dimensional non-homogeneous Poisson point processes (NHPPPs) in R. Its functions are based on three algorithms that provably sample from a target NHPPP: the time-transformation of a homogeneous Poisson process (of intensity one) via the inverse of the integrated intensity function; the generation of a Poisson number of order statistics from a fixed density function; and the thinning of a majorizing NHPPP via an acceptance-rejection scheme. We present a study of numerical accuracy and time performance of the algorithms and advice on which algorithm to prefer in each situation. Functions available in the package are illustrated with simple reproducible examples."}, "https://arxiv.org/abs/2402.00396": {"title": "Efficient Exploration for LLMs", "link": "https://arxiv.org/abs/2402.00396", "description": "We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles."}, "https://arxiv.org/abs/2402.00715": {"title": "Intent Assurance using LLMs guided by Intent Drift", "link": "https://arxiv.org/abs/2402.00715", "description": "Intent-Based Networking (IBN) presents a paradigm shift for network management, by promising to align intents and business objectives with network operations--in an automated manner. However, its practical realization is challenging: 1) processing intents, i.e., translate, decompose and identify the logic to fulfill the intent, and 2) intent conformance, that is, considering dynamic networks, the logic should be adequately adapted to assure intents. To address the latter, intent assurance is tasked with continuous verification and validation, including taking the necessary actions to align the operational and target states. In this paper, we define an assurance framework that allows us to detect and act when intent drift occurs. To do so, we leverage AI-driven policies, generated by Large Language Models (LLMs) which can quickly learn the necessary in-context requirements, and assist with the fulfillment and assurance of intents."}, "https://arxiv.org/abs/1911.12430": {"title": "Propensity score matching for estimating a marginal hazard ratio", "link": "https://arxiv.org/abs/1911.12430", "description": "Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching."}, "https://arxiv.org/abs/2008.08176": {"title": "New Goodness-of-Fit Tests for Time Series Models", "link": "https://arxiv.org/abs/2008.08176", "description": "This article proposes omnibus portmanteau tests for contrasting adequacy of time series models. The test statistics are based on combining the autocorrelation function of the conditional residuals, the autocorrelation function of the conditional squared residuals, and the cross-correlation function between these residuals and their squares. The maximum likelihood estimator is used to derive the asymptotic distribution of the proposed test statistics under a general class of time series models, including ARMA, GARCH, and other nonlinear structures. An extensive Monte Carlo simulation study shows that the proposed tests successfully control the type I error probability and tend to have more power than other competitor tests in many scenarios. Two applications to a set of weekly stock returns for 92 companies from the S&P 500 demonstrate the practical use of the proposed tests."}, "https://arxiv.org/abs/2010.06103": {"title": "Quasi-maximum Likelihood Inference for Linear Double Autoregressive Models", "link": "https://arxiv.org/abs/2010.06103", "description": "This paper investigates the quasi-maximum likelihood inference including estimation, model selection and diagnostic checking for linear double autoregressive (DAR) models, where all asymptotic properties are established under only fractional moment of the observed process. We propose a Gaussian quasi-maximum likelihood estimator (G-QMLE) and an exponential quasi-maximum likelihood estimator (E-QMLE) for the linear DAR model, and establish the consistency and asymptotic normality for both estimators. Based on the G-QMLE and E-QMLE, two Bayesian information criteria are proposed for model selection, and two mixed portmanteau tests are constructed to check the adequacy of fitted models. Moreover, we compare the proposed G-QMLE and E-QMLE with the existing doubly weighted quantile regression estimator in terms of the asymptotic efficiency and numerical performance. Simulation studies illustrate the finite-sample performance of the proposed inference tools, and a real example on the Bitcoin return series shows the usefulness of the proposed inference tools."}, "https://arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "https://arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for detecting localized regions in data sequences which each must contain a change-point in the median, at a prescribed global significance level. RNSP works by fitting the postulated constant model over many regions of the data using a new sign-multiresolution sup-norm-type loss, and greedily identifying the shortest intervals on which the constancy is significantly violated. By working with the signs of the data around fitted model candidates, RNSP fulfils its coverage promises under minimal assumptions, requiring only sign-symmetry and serial independence of the signs of the true residuals. In particular, it permits their heterogeneity and arbitrarily heavy tails. The intervals of significance returned by RNSP have a finite-sample character, are unconditional in nature and do not rely on any assumptions on the true signal. Code implementing RNSP is available at https://github.com/pfryz/nsp."}, "https://arxiv.org/abs/2301.10387": {"title": "Mesh-clustered Gaussian process emulator for partial differential equation boundary value problems", "link": "https://arxiv.org/abs/2301.10387", "description": "Partial differential equations (PDEs) have become an essential tool for modeling complex physical systems. Such equations are typically solved numerically via mesh-based methods, such as finite element methods, with solutions over the spatial domain. However, obtaining these solutions are often prohibitively costly, limiting the feasibility of exploring parameters in PDEs. In this paper, we propose an efficient emulator that simultaneously predicts the solutions over the spatial domain, with theoretical justification of its uncertainty quantification. The novelty of the proposed method lies in the incorporation of the mesh node coordinates into the statistical model. In particular, the proposed method segments the mesh nodes into multiple clusters via a Dirichlet process prior and fits Gaussian process models with the same hyperparameters in each of them. Most importantly, by revealing the underlying clustering structures, the proposed method can provide valuable insights into qualitative features of the resulting dynamics that can be used to guide further investigations. Real examples are demonstrated to show that our proposed method has smaller prediction errors than its main competitors, with competitive computation time, and identifies interesting clusters of mesh nodes that possess physical significance, such as satisfying boundary conditions. An R package for the proposed methodology is provided in an open repository."}, "https://arxiv.org/abs/2304.07312": {"title": "Stochastic Actor Oriented Model with Random Effects", "link": "https://arxiv.org/abs/2304.07312", "description": "The stochastic actor oriented model (SAOM) is a method for modelling social interactions and social behaviour over time. It can be used to model drivers of dynamic interactions using both exogenous covariates and endogenous network configurations, but also the co-evolution of behaviour and social interactions. In its standard implementations, it assumes that all individual have the same interaction evaluation function. This lack of heterogeneity is one of its limitations. The aim of this paper is to extend the inference framework for the SAOM to include random effects, so that the heterogeneity of individuals can be modeled more accurately.\n We decompose the linear evaluation function that models the probability of forming or removing a tie from the network, in a homogeneous fixed part and a random, individual-specific part. We extend the Robbins-Monro algorithm to the estimation of the variance of the random parameters. Our method is applicable for the general random effect formulations. We illustrate the method with a random out-degree model and show the parameter estimation of the random components, significance tests and model evaluation. We apply the method to the Kapferer's Tailor shop study. It is shown that a random out-degree constitutes a serious alternative to including transitivity and higher-order dependency effects."}, "https://arxiv.org/abs/2305.03780": {"title": "Boldness-Recalibration for Binary Event Predictions", "link": "https://arxiv.org/abs/2305.03780", "description": "Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions range from .26-.78 to .10-.91)."}, "https://arxiv.org/abs/2308.10231": {"title": "Static and Dynamic BART for Rank-Order Data", "link": "https://arxiv.org/abs/2308.10231", "description": "Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores defined as nonlinear Bayesian additive regression tree functions of covariates. To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are derived. These results are applied in a Gibbs sampler with data augmentation for posterior inference. The proposed methods are shown to outperform existing competitors in simulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams."}, "https://arxiv.org/abs/2309.08494": {"title": "Modeling Data Analytic Iteration With Probabilistic Outcome Sets", "link": "https://arxiv.org/abs/2309.08494", "description": "In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis."}, "https://arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "https://arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with potential confounding by time trends. Traditional frequentist methods can fail to provide adequate coverage of the intervention's true effect using confidence intervals, whereas Bayesian approaches show potential for better coverage of intervention effects. However, Bayesian methods have seen limited development in SWCRTs. We propose two novel Bayesian hierarchical penalized spline models for SWCRTs. The first model is for SWCRTs involving many clusters and time periods, focusing on immediate intervention effects. To evaluate its efficacy, we compared this model to traditional frequentist methods. We further developed the model to estimate time-varying intervention effects. We conducted a comparative analysis of this Bayesian spline model against an existing Bayesian monotone effect curve model. The proposed models are applied in the Primary Palliative Care for Emergency Medicine stepped wedge trial to evaluate the effectiveness of primary palliative care intervention. Extensive simulations and a real-world application demonstrate the strengths of the proposed Bayesian models. The Bayesian immediate effect model consistently achieves near the frequentist nominal coverage probability for true intervention effect, providing more reliable interval estimations than traditional frequentist models, while maintaining high estimation accuracy. The proposed Bayesian time-varying effect model exhibits advancements over the existing Bayesian monotone effect curve model in terms of improved accuracy and reliability. To the best of our knowledge, this is the first development of Bayesian hierarchical spline modeling for SWCRTs. The proposed models offer an accurate and robust analysis of intervention effects. Their application could lead to effective adjustments in intervention strategies."}, "https://arxiv.org/abs/2401.08224": {"title": "Privacy Preserving Adaptive Experiment Design", "link": "https://arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two objectives often lead to contrast optimal allocation mechanism. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it's essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is \"almost free\". Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing."}, "https://arxiv.org/abs/2211.01345": {"title": "Generative machine learning methods for multivariate ensemble post-processing", "link": "https://arxiv.org/abs/2211.01345", "description": "Ensemble weather forecasts based on multiple runs of numerical weather prediction models typically show systematic errors and require post-processing to obtain reliable forecasts. Accurately modeling multivariate dependencies is crucial in many practical applications, and various approaches to multivariate post-processing have been proposed where ensemble predictions are first post-processed separately in each margin and multivariate dependencies are then restored via copulas. These two-step methods share common key limitations, in particular the difficulty to include additional predictors in modeling the dependencies. We propose a novel multivariate post-processing method based on generative machine learning to address these challenges. In this new class of nonparametric data-driven distributional regression models, samples from the multivariate forecast distribution are directly obtained as output of a generative neural network. The generative model is trained by optimizing a proper scoring rule which measures the discrepancy between the generated and observed data, conditional on exogenous input variables. Our method does not require parametric assumptions on univariate distributions or multivariate dependencies and allows for incorporating arbitrary predictors. In two case studies on multivariate temperature and wind speed forecasting at weather stations over Germany, our generative model shows significant improvements over state-of-the-art methods and particularly improves the representation of spatial dependencies."}, "https://arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "https://arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this paper we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed with a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent, heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Finally, we apply our methodology to real and simulated data, demonstrating some effects not identifiable from the model with discrete community memberships."}, "https://arxiv.org/abs/2212.11880": {"title": "Parameter Inference based on Gaussian Processes Informed by Nonlinear Partial Differential Equations", "link": "https://arxiv.org/abs/2212.11880", "description": "Partial differential equations (PDEs) are widely used for the description of physical and engineering phenomena. Some key parameters involved in PDEs, which represent certain physical properties with important scientific interpretations, are difficult or even impossible to measure directly. Estimating these parameters from noisy and sparse experimental data of related physical quantities is an important task. Many methods for PDE parameter inference involve a large number of evaluations for numerical solutions to PDE through algorithms such as the finite element method, which can be time-consuming, especially for nonlinear PDEs. In this paper, we propose a novel method for the inference of unknown parameters in PDEs, called the PDE-Informed Gaussian Process (PIGP) based parameter inference method. Through modeling the PDE solution as a Gaussian process (GP), we derive the manifold constraints induced by the (linear) PDE structure such that, under the constraints, the GP satisfies the PDE. For nonlinear PDEs, we propose an augmentation method that transforms the nonlinear PDE into an equivalent PDE system linear in all derivatives, which our PIGP-based method can handle. The proposed method can be applied to a broad spectrum of nonlinear PDEs. The PIGP-based method can be applied to multi-dimensional PDE systems and PDE systems with unobserved components. Like conventional Bayesian approaches, the method can provide uncertainty quantification for both the unknown parameters and the PDE solution. The PIGP-based method also completely bypasses the numerical solver for PDEs. The proposed method is demonstrated through several application examples from different areas."}, "https://rss.arxiv.org/abs/2402.00154": {"title": "Penalized G-estimation for effect modifier selection in the structural nested mean models for repeated outcomes", "link": "https://rss.arxiv.org/abs/2402.00154", "description": "Effect modification occurs when the impact of the treatment on an outcome varies based on the levels of other covariates known as effect modifiers. Modeling of these effect differences is important for etiological goals and for purposes of optimizing treatment. Structural nested mean models (SNMMs) are useful causal models for estimating the potentially heterogeneous effect of a time-varying exposure on the mean of an outcome in the presence of time-varying confounding. In longitudinal health studies, information on many demographic, behavioural, biological, and clinical covariates may be available, among which some might cause heterogeneous treatment effects. A data-driven approach for selecting the effect modifiers of an exposure may be necessary if these effect modifiers are \\textit{a priori} unknown and need to be identified. Although variable selection techniques are available in the context of estimating conditional average treatment effects using marginal structural models, or in the context of estimating optimal dynamic treatment regimens, all of these methods consider an outcome measured at a single point in time. In the context of an SNMM for repeated outcomes, we propose a doubly robust penalized G-estimator for the causal effect of a time-varying exposure with a simultaneous selection of effect modifiers and prove the oracle property of our estimator. We conduct a simulation study to evaluate the performance of the proposed estimator in finite samples and for verification of its double-robustness property. Our work is motivated by a study of hemodiafiltration for treating patients with end-stage renal disease at the Centre Hospitalier de l'Universit\\'e de Montr\\'eal."}, "https://rss.arxiv.org/abs/2402.00164": {"title": "De-Biased Two-Sample U-Statistics With Application To Conditional Distribution Testing", "link": "https://rss.arxiv.org/abs/2402.00164", "description": "In some high-dimensional and semiparametric inference problems involving two populations, the parameter of interest can be characterized by two-sample U-statistics involving some nuisance parameters. In this work we first extend the framework of one-step estimation with cross-fitting to two-sample U-statistics, showing that using an orthogonalized influence function can effectively remove the first order bias, resulting in asymptotically normal estimates of the parameter of interest. As an example, we apply this method and theory to the problem of testing two-sample conditional distributions, also known as strong ignorability. When combined with a conformal-based rank-sum test, we discover that the nuisance parameters can be divided into two categories, where in one category the nuisance estimation accuracy does not affect the testing validity, whereas in the other the nuisance estimation accuracy must satisfy the usual requirement for the test to be valid. We believe these findings provide further insights into and enhance the conformal inference toolbox."}, "https://rss.arxiv.org/abs/2402.00183": {"title": "A review of regularised estimation methods and cross-validation in spatiotemporal statistics", "link": "https://rss.arxiv.org/abs/2402.00183", "description": "This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \\geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix."}, "https://rss.arxiv.org/abs/2402.00202": {"title": "Anytime-Valid Generalized Universal Inference on Risk Minimizers", "link": "https://rss.arxiv.org/abs/2402.00202", "description": "A common goal in statistics and machine learning is estimation of unknowns. Point estimates alone are of little value without an accompanying measure of uncertainty, but traditional uncertainty quantification methods, such as confidence sets and p-values, often require strong distributional or structural assumptions that may not be justified in modern problems. The present paper considers a very common case in machine learning, where the quantity of interest is the minimizer of a given risk (expected loss) function. For such cases, we propose a generalized universal procedure for inference on risk minimizers that features a finite-sample, frequentist validity property under mild distributional assumptions. One version of the proposed procedure is shown to be anytime-valid in the sense that it maintains validity properties regardless of the stopping rule used for the data collection process. We show how this anytime-validity property offers protection against certain factors contributing to the replication crisis in science."}, "https://rss.arxiv.org/abs/2402.00239": {"title": "Publication bias adjustment in network meta-analysis: an inverse probability weighting approach using clinical trial registries", "link": "https://rss.arxiv.org/abs/2402.00239", "description": "Network meta-analysis (NMA) is a useful tool to compare multiple interventions simultaneously in a single meta-analysis, it can be very helpful for medical decision making when the study aims to find the best therapy among several active candidates. However, the validity of its results is threatened by the publication bias issue. Existing methods to handle the publication bias issue in the standard pairwise meta-analysis are hard to extend to this area with the complicated data structure and the underlying assumptions for pooling the data. In this paper, we aimed to provide a flexible inverse probability weighting (IPW) framework along with several t-type selection functions to deal with the publication bias problem in the NMA context. To solve these proposed selection functions, we recommend making use of the additional information from the unpublished studies from multiple clinical trial registries. A comprehensive numerical study and a real example showed that our methodology can help obtain more accurate estimates and higher coverage probabilities, and improve other properties of an NMA (e.g., ranking the interventions)."}, "https://rss.arxiv.org/abs/2402.00307": {"title": "Debiased Multivariable Mendelian Randomization", "link": "https://rss.arxiv.org/abs/2402.00307", "description": "Multivariable Mendelian randomization (MVMR) uses genetic variants as instrumental variables to infer the direct effect of multiple exposures on an outcome. Compared to univariable Mendelian randomization, MVMR is less prone to horizontal pleiotropy and enables estimation of the direct effect of each exposure on the outcome. However, MVMR faces greater challenges with weak instruments -- genetic variants that are weakly associated with some exposures conditional on the other exposures. This article focuses on MVMR using summary data from genome-wide association studies (GWAS). We provide a new asymptotic regime to analyze MVMR estimators with many weak instruments, allowing for linear combinations of exposures to have different degrees of instrument strength, and formally show that the popular multivariable inverse-variance weighted (MV-IVW) estimator's asymptotic behavior is highly sensitive to instruments' strength. We then propose a multivariable debiased IVW (MV-dIVW) estimator, which effectively reduces the asymptotic bias from weak instruments in MV-IVW, and introduce an adjusted version, MV-adIVW, for improved finite-sample robustness. We establish the theoretical properties of our proposed estimators and extend them to handle balanced horizontal pleiotropy. We conclude by demonstrating the performance of our proposed methods in simulated and real datasets. We implement this method in the R package mr.divw."}, "https://rss.arxiv.org/abs/2402.00335": {"title": "Regression-Based Proximal Causal Inference", "link": "https://rss.arxiv.org/abs/2402.00335", "description": "In observational studies, identification of causal effects is threatened by the potential for unmeasured confounding. Negative controls have become widely used to evaluate the presence of potential unmeasured confounding thus enhancing credibility of reported causal effect estimates. Going beyond simply testing for residual confounding, proximal causal inference (PCI) was recently developed to debias causal effect estimates subject to confounding by hidden factors, by leveraging a pair of negative control variables, also known as treatment and outcome confounding proxies. While formal statistical inference has been developed for PCI, these methods can be challenging to implement in practice as they involve solving complex integral equations that are typically ill-posed. In this paper, we develop a regression-based PCI approach, employing a two-stage regression via familiar generalized linear models to implement the PCI framework, which completely obviates the need to solve difficult integral equations. In the first stage, one fits a generalized linear model (GLM) for the outcome confounding proxy in terms of the treatment confounding proxy and the primary treatment. In the second stage, one fits a GLM for the primary outcome in terms of the primary treatment, using the predicted value of the first-stage regression model as a regressor which as we establish accounts for any residual confounding for which the proxies are relevant. The proposed approach has merit in that (i) it is applicable to continuous, count, and binary outcomes cases, making it relevant to a wide range of real-world applications, and (ii) it is easy to implement using off-the-shelf software for GLMs. We establish the statistical properties of regression-based PCI and illustrate their performance in both synthetic and real-world empirical applications."}, "https://rss.arxiv.org/abs/2402.00512": {"title": "A goodness-of-fit test for regression models with spatially correlated errors", "link": "https://rss.arxiv.org/abs/2402.00512", "description": "The problem of assessing a parametric regression model in the presence of spatial correlation is addressed in this work. For that purpose, a goodness-of-fit test based on a $L_2$-distance comparing a parametric and a nonparametric regression estimators is proposed. Asymptotic properties of the test statistic, both under the null hypothesis and under local alternatives, are derived. Additionally, a bootstrap procedure is designed to calibrate the test in practice. Finite sample performance of the test is analyzed through a simulation study, and its applicability is illustrated using a real data example."}, "https://rss.arxiv.org/abs/2402.00597": {"title": "An efficient multivariate volatility model for many assets", "link": "https://rss.arxiv.org/abs/2402.00597", "description": "This paper develops a flexible and computationally efficient multivariate volatility model, which allows for dynamic conditional correlations and volatility spillover effects among financial assets. The new model has desirable properties such as identifiability and computational tractability for many assets. A sufficient condition of the strict stationarity is derived for the new process. Two quasi-maximum likelihood estimation methods are proposed for the new model with and without low-rank constraints on the coefficient matrices respectively, and the asymptotic properties for both estimators are established. Moreover, a Bayesian information criterion with selection consistency is developed for order selection, and the testing for volatility spillover effects is carefully discussed. The finite sample performance of the proposed methods is evaluated in simulation studies for small and moderate dimensions. The usefulness of the new model and its inference tools is illustrated by two empirical examples for 5 stock markets and 17 industry portfolios, respectively."}, "https://rss.arxiv.org/abs/2402.00610": {"title": "Multivariate ordinal regression for multiple repeated measurements", "link": "https://rss.arxiv.org/abs/2402.00610", "description": "In this paper we propose a multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, where we distinguish between the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student t distributed. The estimation is performed using composite likelihood methods. We perform several simulation exercises to investigate the quality of the estimates in different settings as well as in comparison with a Bayesian approach. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. We also introduce R package mvordflex and illustrate how this implementation can be used to estimate the proposed model in a user-friendly, convenient way. Finally, we illustrate the framework on a data set containing firm failure and credit ratings information from the rating agencies S&P and Moody's for US listed companies."}, "https://rss.arxiv.org/abs/2402.00651": {"title": "Skew-elliptical copula based mixed models for non-Gaussian longitudinal data with application to HIV-AIDS study", "link": "https://rss.arxiv.org/abs/2402.00651", "description": "This work has been motivated by a longitudinal data set on HIV CD4 T+ cell counts from Livingstone district, Zambia. The corresponding histogram plots indicate lack of symmetry in the marginal distributions and the pairwise scatter plots show non-elliptical dependence patterns. The standard linear mixed model for longitudinal data fails to capture these features. Thus it seems appropriate to consider a more general framework for modeling such data. In this article, we consider generalized linear mixed models (GLMM) for the marginals (e.g. Gamma mixed model), and temporal dependency of the repeated measurements is modeled by the copula corresponding to some skew-elliptical distributions (like skew-normal/skew-t). Our proposed class of copula based mixed models simultaneously takes into account asymmetry, between-subject variability and non-standard temporal dependence, and hence can be considered extensions to the standard linear mixed model based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and also describe how to obtain standard errors of the parameter estimates. We investigate the finite sample performance of our procedure with extensive simulation studies involving skewed and symmetric marginal distributions and several choices of the copula. We finally apply our models to the HIV data set and report the findings."}, "https://rss.arxiv.org/abs/2402.00668": {"title": "Factor copula models for non-Gaussian longitudinal data", "link": "https://rss.arxiv.org/abs/2402.00668", "description": "This article presents factor copula approaches to model temporal dependency of non- Gaussian (continuous/discrete) longitudinal data. Factor copula models are canonical vine copulas which explain the underlying dependence structure of a multivariate data through latent variables, and therefore can be easily interpreted and implemented to unbalanced longitudinal data. We develop regression models for continuous, binary and ordinal longitudinal data including covariates, by using factor copula constructions with subject-specific latent variables. Considering homogeneous within-subject dependence, our proposed models allow for feasible parametric inference in moderate to high dimensional situations, using two-stage (IFM) estimation method. We assess the finite sample performance of the proposed models with extensive simulation studies. In the empirical analysis, the proposed models are applied for analysing different longitudinal responses of two real world data sets. Moreover, we compare the performances of these models with some widely used random effect models using standard model selection techniques and find substantial improvements. Our studies suggest that factor copula models can be good alternatives to random effect models and can provide better insights to temporal dependency of longitudinal data of arbitrary nature."}, "https://rss.arxiv.org/abs/2402.00778": {"title": "Robust Sufficient Dimension Reduction via $\\alpha$-Distance Covariance", "link": "https://rss.arxiv.org/abs/2402.00778", "description": "We introduce a novel sufficient dimension-reduction (SDR) method which is robust against outliers using $\\alpha$-distance covariance (dCov) in dimension-reduction problems. Under very mild conditions on the predictors, the central subspace is effectively estimated and model-free advantage without estimating link function based on the projection on the Stiefel manifold. We establish the convergence property of the proposed estimation under some regularity conditions. We compare the performance of our method with existing SDR methods by simulation and real data analysis and show that our algorithm improves the computational efficiency and effectiveness."}, "https://rss.arxiv.org/abs/2402.00814": {"title": "Optimal monotone conditional error functions", "link": "https://rss.arxiv.org/abs/2402.00814", "description": "This note presents a method that provides optimal monotone conditional error functions for a large class of adaptive two stage designs. The presented method builds on a previously developed general theory for optimal adaptive two stage designs where sample sizes are reassessed for a specific conditional power and the goal is to minimize the expected sample size. The previous theory can easily lead to a non-monotonous conditional error function which is highly undesirable for logical reasons and can harm type I error rate control for composite null hypotheses. The here presented method extends the existing theory by introducing intermediate monotonising steps that can easily be implemented."}, "https://rss.arxiv.org/abs/2402.00072": {"title": "Explainable AI for survival analysis: a median-SHAP approach", "link": "https://rss.arxiv.org/abs/2402.00072", "description": "With the adoption of machine learning into routine clinical practice comes the need for Explainable AI methods tailored to medical applications. Shapley values have sparked wide interest for locally explaining models. Here, we demonstrate their interpretation strongly depends on both the summary statistic and the estimator for it, which in turn define what we identify as an 'anchor point'. We show that the convention of using a mean anchor point may generate misleading interpretations for survival analysis and introduce median-SHAP, a method for explaining black-box models predicting individual survival times."}, "https://rss.arxiv.org/abs/2402.00077": {"title": "Unlocking the Power of Multi-institutional Data: Integrating and Harmonizing Genomic Data Across Institutions", "link": "https://rss.arxiv.org/abs/2402.00077", "description": "Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by the American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging such multi-institutional sequencing data presents significant challenges. Variations in gene panels result in loss of information when the analysis is conducted on common gene sets. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess the model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data."}, "https://rss.arxiv.org/abs/2402.00168": {"title": "Continuous Treatment Effects with Surrogate Outcomes", "link": "https://rss.arxiv.org/abs/2402.00168", "description": "In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance."}, "https://rss.arxiv.org/abs/2402.00358": {"title": "nhppp: Simulating Nonhomogeneous Poisson Point Processes in R", "link": "https://rss.arxiv.org/abs/2402.00358", "description": "We introduce the `nhppp' package for simulating events from one dimensional non-homogeneous Poisson point processes (NHPPPs) in R. Its functions are based on three algorithms that provably sample from a target NHPPP: the time-transformation of a homogeneous Poisson process (of intensity one) via the inverse of the integrated intensity function; the generation of a Poisson number of order statistics from a fixed density function; and the thinning of a majorizing NHPPP via an acceptance-rejection scheme. We present a study of numerical accuracy and time performance of the algorithms and advice on which algorithm to prefer in each situation. Functions available in the package are illustrated with simple reproducible examples."}, "https://rss.arxiv.org/abs/2402.00396": {"title": "Efficient Exploration for LLMs", "link": "https://rss.arxiv.org/abs/2402.00396", "description": "We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles."}, "https://rss.arxiv.org/abs/2402.00715": {"title": "Intent Assurance using LLMs guided by Intent Drift", "link": "https://rss.arxiv.org/abs/2402.00715", "description": "Intent-Based Networking (IBN) presents a paradigm shift for network management, by promising to align intents and business objectives with network operations--in an automated manner. However, its practical realization is challenging: 1) processing intents, i.e., translate, decompose and identify the logic to fulfill the intent, and 2) intent conformance, that is, considering dynamic networks, the logic should be adequately adapted to assure intents. To address the latter, intent assurance is tasked with continuous verification and validation, including taking the necessary actions to align the operational and target states. In this paper, we define an assurance framework that allows us to detect and act when intent drift occurs. To do so, we leverage AI-driven policies, generated by Large Language Models (LLMs) which can quickly learn the necessary in-context requirements, and assist with the fulfillment and assurance of intents."}, "https://rss.arxiv.org/abs/1911.12430": {"title": "Propensity score matching for estimating a marginal hazard ratio", "link": "https://rss.arxiv.org/abs/1911.12430", "description": "Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching."}, "https://rss.arxiv.org/abs/2008.08176": {"title": "New Goodness-of-Fit Tests for Time Series Models", "link": "https://rss.arxiv.org/abs/2008.08176", "description": "This article proposes omnibus portmanteau tests for contrasting adequacy of time series models. The test statistics are based on combining the autocorrelation function of the conditional residuals, the autocorrelation function of the conditional squared residuals, and the cross-correlation function between these residuals and their squares. The maximum likelihood estimator is used to derive the asymptotic distribution of the proposed test statistics under a general class of time series models, including ARMA, GARCH, and other nonlinear structures. An extensive Monte Carlo simulation study shows that the proposed tests successfully control the type I error probability and tend to have more power than other competitor tests in many scenarios. Two applications to a set of weekly stock returns for 92 companies from the S&P 500 demonstrate the practical use of the proposed tests."}, "https://rss.arxiv.org/abs/2010.06103": {"title": "Quasi-maximum Likelihood Inference for Linear Double Autoregressive Models", "link": "https://rss.arxiv.org/abs/2010.06103", "description": "This paper investigates the quasi-maximum likelihood inference including estimation, model selection and diagnostic checking for linear double autoregressive (DAR) models, where all asymptotic properties are established under only fractional moment of the observed process. We propose a Gaussian quasi-maximum likelihood estimator (G-QMLE) and an exponential quasi-maximum likelihood estimator (E-QMLE) for the linear DAR model, and establish the consistency and asymptotic normality for both estimators. Based on the G-QMLE and E-QMLE, two Bayesian information criteria are proposed for model selection, and two mixed portmanteau tests are constructed to check the adequacy of fitted models. Moreover, we compare the proposed G-QMLE and E-QMLE with the existing doubly weighted quantile regression estimator in terms of the asymptotic efficiency and numerical performance. Simulation studies illustrate the finite-sample performance of the proposed inference tools, and a real example on the Bitcoin return series shows the usefulness of the proposed inference tools."}, "https://rss.arxiv.org/abs/2109.02487": {"title": "Robust Narrowest Significance Pursuit: Inference for multiple change-points in the median", "link": "https://rss.arxiv.org/abs/2109.02487", "description": "We propose Robust Narrowest Significance Pursuit (RNSP), a methodology for detecting localized regions in data sequences which each must contain a change-point in the median, at a prescribed global significance level. RNSP works by fitting the postulated constant model over many regions of the data using a new sign-multiresolution sup-norm-type loss, and greedily identifying the shortest intervals on which the constancy is significantly violated. By working with the signs of the data around fitted model candidates, RNSP fulfils its coverage promises under minimal assumptions, requiring only sign-symmetry and serial independence of the signs of the true residuals. In particular, it permits their heterogeneity and arbitrarily heavy tails. The intervals of significance returned by RNSP have a finite-sample character, are unconditional in nature and do not rely on any assumptions on the true signal. Code implementing RNSP is available at https://github.com/pfryz/nsp."}, "https://rss.arxiv.org/abs/2301.10387": {"title": "Mesh-clustered Gaussian process emulator for partial differential equation boundary value problems", "link": "https://rss.arxiv.org/abs/2301.10387", "description": "Partial differential equations (PDEs) have become an essential tool for modeling complex physical systems. Such equations are typically solved numerically via mesh-based methods, such as finite element methods, with solutions over the spatial domain. However, obtaining these solutions are often prohibitively costly, limiting the feasibility of exploring parameters in PDEs. In this paper, we propose an efficient emulator that simultaneously predicts the solutions over the spatial domain, with theoretical justification of its uncertainty quantification. The novelty of the proposed method lies in the incorporation of the mesh node coordinates into the statistical model. In particular, the proposed method segments the mesh nodes into multiple clusters via a Dirichlet process prior and fits Gaussian process models with the same hyperparameters in each of them. Most importantly, by revealing the underlying clustering structures, the proposed method can provide valuable insights into qualitative features of the resulting dynamics that can be used to guide further investigations. Real examples are demonstrated to show that our proposed method has smaller prediction errors than its main competitors, with competitive computation time, and identifies interesting clusters of mesh nodes that possess physical significance, such as satisfying boundary conditions. An R package for the proposed methodology is provided in an open repository."}, "https://rss.arxiv.org/abs/2304.07312": {"title": "Stochastic Actor Oriented Model with Random Effects", "link": "https://rss.arxiv.org/abs/2304.07312", "description": "The stochastic actor oriented model (SAOM) is a method for modelling social interactions and social behaviour over time. It can be used to model drivers of dynamic interactions using both exogenous covariates and endogenous network configurations, but also the co-evolution of behaviour and social interactions. In its standard implementations, it assumes that all individual have the same interaction evaluation function. This lack of heterogeneity is one of its limitations. The aim of this paper is to extend the inference framework for the SAOM to include random effects, so that the heterogeneity of individuals can be modeled more accurately.\n We decompose the linear evaluation function that models the probability of forming or removing a tie from the network, in a homogeneous fixed part and a random, individual-specific part. We extend the Robbins-Monro algorithm to the estimation of the variance of the random parameters. Our method is applicable for the general random effect formulations. We illustrate the method with a random out-degree model and show the parameter estimation of the random components, significance tests and model evaluation. We apply the method to the Kapferer's Tailor shop study. It is shown that a random out-degree constitutes a serious alternative to including transitivity and higher-order dependency effects."}, "https://rss.arxiv.org/abs/2305.03780": {"title": "Boldness-Recalibration for Binary Event Predictions", "link": "https://rss.arxiv.org/abs/2305.03780", "description": "Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions range from .26-.78 to .10-.91)."}, "https://rss.arxiv.org/abs/2308.10231": {"title": "Static and Dynamic BART for Rank-Order Data", "link": "https://rss.arxiv.org/abs/2308.10231", "description": "Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores defined as nonlinear Bayesian additive regression tree functions of covariates. To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are derived. These results are applied in a Gibbs sampler with data augmentation for posterior inference. The proposed methods are shown to outperform existing competitors in simulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams."}, "https://rss.arxiv.org/abs/2309.08494": {"title": "Modeling Data Analytic Iteration With Probabilistic Outcome Sets", "link": "https://rss.arxiv.org/abs/2309.08494", "description": "In 1977 John Tukey described how in exploratory data analysis, data analysts use tools, such as data visualizations, to separate their expectations from what they observe. In contrast to statistical theory, an underappreciated aspect of data analysis is that a data analyst must make decisions by comparing the observed data or output from a statistical tool to what the analyst previously expected from the data. However, there is little formal guidance for how to make these data analytic decisions as statistical theory generally omits a discussion of who is using these statistical methods. In this paper, we propose a model for the iterative process of data analysis based on the analyst's expectations, using what we refer to as expected and anomaly probabilistic outcome sets, and the concept of statistical information gain. Here, we extend the basic idea of comparing an analyst's expectations to what is observed in a data visualization to more general analytic situations. Our model posits that the analyst's goal is to increase the amount of information the analyst has relative to what the analyst already knows, through successive analytic iterations. We introduce two criteria--expected information gain and anomaly information gain--to provide guidance about analytic decision-making and ultimately to improve the practice of data analysis. Finally, we show how our framework can be used to characterize common situations in practical data analysis."}, "https://rss.arxiv.org/abs/2401.03287": {"title": "Advancing Stepped Wedge Cluster Randomized Trials Analysis: Bayesian Hierarchical Penalized Spline Models for Immediate and Time-Varying Intervention Effects", "link": "https://rss.arxiv.org/abs/2401.03287", "description": "Stepped wedge cluster randomized trials (SWCRTs) often face challenges with potential confounding by time trends. Traditional frequentist methods can fail to provide adequate coverage of the intervention's true effect using confidence intervals, whereas Bayesian approaches show potential for better coverage of intervention effects. However, Bayesian methods have seen limited development in SWCRTs. We propose two novel Bayesian hierarchical penalized spline models for SWCRTs. The first model is for SWCRTs involving many clusters and time periods, focusing on immediate intervention effects. To evaluate its efficacy, we compared this model to traditional frequentist methods. We further developed the model to estimate time-varying intervention effects. We conducted a comparative analysis of this Bayesian spline model against an existing Bayesian monotone effect curve model. The proposed models are applied in the Primary Palliative Care for Emergency Medicine stepped wedge trial to evaluate the effectiveness of primary palliative care intervention. Extensive simulations and a real-world application demonstrate the strengths of the proposed Bayesian models. The Bayesian immediate effect model consistently achieves near the frequentist nominal coverage probability for true intervention effect, providing more reliable interval estimations than traditional frequentist models, while maintaining high estimation accuracy. The proposed Bayesian time-varying effect model exhibits advancements over the existing Bayesian monotone effect curve model in terms of improved accuracy and reliability. To the best of our knowledge, this is the first development of Bayesian hierarchical spline modeling for SWCRTs. The proposed models offer an accurate and robust analysis of intervention effects. Their application could lead to effective adjustments in intervention strategies."}, "https://rss.arxiv.org/abs/2401.08224": {"title": "Privacy Preserving Adaptive Experiment Design", "link": "https://rss.arxiv.org/abs/2401.08224", "description": "Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two objectives often lead to contrast optimal allocation mechanism. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it's essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is \"almost free\". Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing."}, "https://rss.arxiv.org/abs/2401.16571": {"title": "Individualized Multi-Treatment Response Curves Estimation using RBF-net with Shared Neurons", "link": "https://rss.arxiv.org/abs/2401.16571", "description": "Heterogeneous treatment effect estimation is an important problem in precision medicine. Specific interests lie in identifying the differential effect of different treatments based on some external covariates. We propose a novel non-parametric treatment effect estimation method in a multi-treatment setting. Our non-parametric modeling of the response curves relies on radial basis function (RBF)-nets with shared hidden neurons. Our model thus facilitates modeling commonality among the treatment outcomes. The estimation and inference schemes are developed under a Bayesian framework and implemented via an efficient Markov chain Monte Carlo algorithm, appropriately accommodating uncertainty in all aspects of the analysis. The numerical performance of the method is demonstrated through simulation experiments. Applying our proposed method to MIMIC data, we obtain several interesting findings related to the impact of different treatment strategies on the length of ICU stay and 12-hour SOFA score for sepsis patients who are home-discharged."}, "https://rss.arxiv.org/abs/2211.01345": {"title": "Generative machine learning methods for multivariate ensemble post-processing", "link": "https://rss.arxiv.org/abs/2211.01345", "description": "Ensemble weather forecasts based on multiple runs of numerical weather prediction models typically show systematic errors and require post-processing to obtain reliable forecasts. Accurately modeling multivariate dependencies is crucial in many practical applications, and various approaches to multivariate post-processing have been proposed where ensemble predictions are first post-processed separately in each margin and multivariate dependencies are then restored via copulas. These two-step methods share common key limitations, in particular the difficulty to include additional predictors in modeling the dependencies. We propose a novel multivariate post-processing method based on generative machine learning to address these challenges. In this new class of nonparametric data-driven distributional regression models, samples from the multivariate forecast distribution are directly obtained as output of a generative neural network. The generative model is trained by optimizing a proper scoring rule which measures the discrepancy between the generated and observed data, conditional on exogenous input variables. Our method does not require parametric assumptions on univariate distributions or multivariate dependencies and allows for incorporating arbitrary predictors. In two case studies on multivariate temperature and wind speed forecasting at weather stations over Germany, our generative model shows significant improvements over state-of-the-art methods and particularly improves the representation of spatial dependencies."}, "https://rss.arxiv.org/abs/2212.08642": {"title": "Estimating Higher-Order Mixed Memberships via the $\\ell_{2,\\infty}$ Tensor Perturbation Bound", "link": "https://rss.arxiv.org/abs/2212.08642", "description": "Higher-order multiway data is ubiquitous in machine learning and statistics and often exhibits community-like structures, where each component (node) along each different mode has a community membership associated with it. In this paper we propose the tensor mixed-membership blockmodel, a generalization of the tensor blockmodel positing that memberships need not be discrete, but instead are convex combinations of latent communities. We establish the identifiability of our model and propose a computationally efficient estimation procedure based on the higher-order orthogonal iteration algorithm (HOOI) for tensor SVD composed with a simplex corner-finding algorithm. We then demonstrate the consistency of our estimation procedure by providing a per-node error bound, which showcases the effect of higher-order structures on estimation accuracy. To prove our consistency result, we develop the $\\ell_{2,\\infty}$ tensor perturbation bound for HOOI under independent, heteroskedastic, subgaussian noise that may be of independent interest. Our analysis uses a novel leave-one-out construction for the iterates, and our bounds depend only on spectral properties of the underlying low-rank tensor under nearly optimal signal-to-noise ratio conditions such that tensor SVD is computationally feasible. Finally, we apply our methodology to real and simulated data, demonstrating some effects not identifiable from the model with discrete community memberships."}, "https://rss.arxiv.org/abs/2212.11880": {"title": "Parameter Inference based on Gaussian Processes Informed by Nonlinear Partial Differential Equations", "link": "https://rss.arxiv.org/abs/2212.11880", "description": "Partial differential equations (PDEs) are widely used for the description of physical and engineering phenomena. Some key parameters involved in PDEs, which represent certain physical properties with important scientific interpretations, are difficult or even impossible to measure directly. Estimating these parameters from noisy and sparse experimental data of related physical quantities is an important task. Many methods for PDE parameter inference involve a large number of evaluations for numerical solutions to PDE through algorithms such as the finite element method, which can be time-consuming, especially for nonlinear PDEs. In this paper, we propose a novel method for the inference of unknown parameters in PDEs, called the PDE-Informed Gaussian Process (PIGP) based parameter inference method. Through modeling the PDE solution as a Gaussian process (GP), we derive the manifold constraints induced by the (linear) PDE structure such that, under the constraints, the GP satisfies the PDE. For nonlinear PDEs, we propose an augmentation method that transforms the nonlinear PDE into an equivalent PDE system linear in all derivatives, which our PIGP-based method can handle. The proposed method can be applied to a broad spectrum of nonlinear PDEs. The PIGP-based method can be applied to multi-dimensional PDE systems and PDE systems with unobserved components. Like conventional Bayesian approaches, the method can provide uncertainty quantification for both the unknown parameters and the PDE solution. The PIGP-based method also completely bypasses the numerical solver for PDEs. The proposed method is demonstrated through several application examples from different areas."}, "https://arxiv.org/abs/2402.01005": {"title": "The prices of renewable commodities: A robust stationarity analysis", "link": "https://arxiv.org/abs/2402.01005", "description": "This paper addresses the problem of testing for persistence in the effects of the shocks affecting the prices of renewable commodities, which have potential implications on stabilization policies and economic forecasting, among other areas. A robust methodology is employed that enables the determination of the potential presence and number of instant/gradual structural changes in the series, stationarity testing conditional on the number of changes detected, and the detection of change points. This procedure is applied to the annual real prices of eighteen renewable commodities over the period of 1900-2018. Results indicate that most of the series display non-linear features, including quadratic patterns and regime transitions that often coincide with well-known political and economic episodes. The conclusions of stationarity testing suggest that roughly half of the series are integrated. Stationarity fails to be rejected for grains, whereas most livestock and textile commodities do reject stationarity. Evidence is mixed in all soft commodities and tropical crops, where stationarity can be rejected in approximately half of the cases. The implication would be that for these commodities, stabilization schemes would not be recommended."}, "https://arxiv.org/abs/2402.01069": {"title": "Data-driven model selection within the matrix completion method for causal panel data models", "link": "https://arxiv.org/abs/2402.01069", "description": "Matrix completion estimators are employed in causal panel data models to regulate the rank of the underlying factor model using nuclear norm minimization. This convex optimization problem enables concurrent regularization of a potentially high-dimensional set of covariates to shrink the model size. For valid finite sample inference, we adopt a permutation-based approach and prove its validity for any treatment assignment mechanism. Simulations illustrate the consistency of the proposed estimator in parameter estimation and variable selection. An application to public health policies in Germany demonstrates the data-driven model selection feature on empirical data and finds no effect of travel restrictions on the containment of severe Covid-19 infections."}, "https://arxiv.org/abs/2402.01112": {"title": "Gerontologic Biostatistics 2", "link": "https://arxiv.org/abs/2402.01112", "description": "Background: Introduced in 2010, the sub-discipline of gerontologic biostatistics (GBS) was conceptualized to address the specific challenges in analyzing data from research studies involving older adults. However, the evolving technological landscape has catalyzed data science and statistical advancements since the original GBS publication, greatly expanding the scope of gerontologic research. There is a need to describe how these advancements enhance the analysis of multi-modal data and complex phenotypes that are hallmarks of gerontologic research. Methods: This paper introduces GBS 2.0, an updated and expanded set of analytical methods reflective of the practice of gerontologic biostatistics in contemporary and future research. Results: GBS 2.0 topics and relevant software resources include cutting-edge methods in experimental design; analytical techniques that include adaptations of machine learning, quantifying deep phenotypic measurements, high-dimensional -omics analysis; the integration of information from multiple studies, and strategies to foster reproducibility, replicability, and open science. Discussion: The methodological topics presented here seek to update and expand GBS. By facilitating the synthesis of biostatistics and data science in gerontology, we aim to foster the next generation of gerontologic researchers."}, "https://arxiv.org/abs/2402.01121": {"title": "Non-linear Mendelian randomization with Two-stage prediction estimation and Control function estimation", "link": "https://arxiv.org/abs/2402.01121", "description": "Most of the existing Mendelian randomization (MR) methods are limited by the assumption of linear causality between exposure and outcome, and the development of new non-linear MR methods is highly desirable. We introduce two-stage prediction estimation and control function estimation from econometrics to MR and extend them to non-linear causality. We give conditions for parameter identification and theoretically prove the consistency and asymptotic normality of the estimates. We compare the two methods theoretically under both linear and non-linear causality. We also extend the control function estimation to a more flexible semi-parametric framework without detailed parametric specifications of causality. Extensive simulations numerically corroborate our theoretical results. Application to UK Biobank data reveals non-linear causal relationships between sleep duration and systolic/diastolic blood pressure."}, "https://arxiv.org/abs/2402.01381": {"title": "Spatial-Sign based Maxsum Test for High Dimensional Location Parameters", "link": "https://arxiv.org/abs/2402.01381", "description": "In this study, we explore a robust testing procedure for the high-dimensional location parameters testing problem. Initially, we introduce a spatial-sign based max-type test statistic, which exhibits excellent performance for sparse alternatives. Subsequently, we demonstrate the asymptotic independence between this max-type test statistic and the spatial-sign based sum-type test statistic (Feng and Sun, 2016). Building on this, we propose a spatial-sign based max-sum type testing procedure, which shows remarkable performance under varying signal sparsity. Our simulation studies underscore the superior performance of the procedures we propose."}, "https://arxiv.org/abs/2402.01398": {"title": "penalizedclr: an R package for penalized conditional logistic regression for integration of multiple omics layers", "link": "https://arxiv.org/abs/2402.01398", "description": "The matched case-control design, up until recently mostly pertinent to epidemiological studies, is becoming customary in biomedical applications as well. For instance, in omics studies, it is quite common to compare cancer and healthy tissue from the same patient. Furthermore, researchers today routinely collect data from various and variable sources that they wish to relate to the case-control status. This highlights the need to develop and implement statistical methods that can take these tendencies into account. We present an R package penalizedclr, that provides an implementation of the penalized conditional logistic regression model for analyzing matched case-control studies. It allows for different penalties for different blocks of covariates, and it is therefore particularly useful in the presence of multi-source omics data. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection for variable selection in the considered regression model. The proposed method fills a gap in the available software for fitting high-dimensional conditional logistic regression model accounting for the matched design and block structure of predictors/features. The output consists of a set of selected variables that are significantly associated with case-control status. These features can then be investigated in terms of functional interpretation or validation in further, more targeted studies."}, "https://arxiv.org/abs/2402.01491": {"title": "Moving Aggregate Modified Autoregressive Copula-Based Time Series Models (MAGMAR-Copulas) Without Markov Restriction", "link": "https://arxiv.org/abs/2402.01491", "description": "Copula-based time series models implicitly assume a finite Markov order. In reality a time series may not follow the Markov property. We modify the copula-based time series models by introducing a moving aggregate (MAG) part into the model updating equation. The functional form of the MAG-part is given as the inverse of a conditional copula. The resulting MAG-modified Autoregressive Copula-Based Time Series model (MAGMAR-Copula) is discussed in detail and distributional properties are derived in a D-vine framework. The model nests the classical ARMA model as well as the copula-based time series model. The modeling performance is compared with the model from \\cite{mcneil2022time} modeling US inflation. Our model is competitive in terms of information criteria. It is a generalization of ARMA and also copula-based time series models and is in spirit similar to other moving average time series models such as ARMA and GARCH."}, "https://arxiv.org/abs/2402.01635": {"title": "kNN Algorithm for Conditional Mean and Variance Estimation with Automated Uncertainty Quantification and Variable Selection", "link": "https://arxiv.org/abs/2402.01635", "description": "In this paper, we introduce a kNN-based regression method that synergizes the scalability and adaptability of traditional non-parametric kNN models with a novel variable selection technique. This method focuses on accurately estimating the conditional mean and variance of random response variables, thereby effectively characterizing conditional distributions across diverse scenarios.Our approach incorporates a robust uncertainty quantification mechanism, leveraging our prior estimation work on conditional mean and variance. The employment of kNN ensures scalable computational efficiency in predicting intervals and statistical accuracy in line with optimal non-parametric rates. Additionally, we introduce a new kNN semi-parametric algorithm for estimating ROC curves, accounting for covariates. For selecting the smoothing parameter k, we propose an algorithm with theoretical guarantees.Incorporation of variable selection enhances the performance of the method significantly over conventional kNN techniques in various modeling tasks. We validate the approach through simulations in low, moderate, and high-dimensional covariate spaces. The algorithm's effectiveness is particularly notable in biomedical applications as demonstrated in two case studies. Concluding with a theoretical analysis, we highlight the consistency and convergence rate of our method over traditional kNN models, particularly when the underlying regression model takes values in a low-dimensional space."}, "https://arxiv.org/abs/2402.00907": {"title": "AlphaRank: An Artificial Intelligence Approach for Ranking and Selection Problems", "link": "https://arxiv.org/abs/2402.00907", "description": "We introduce AlphaRank, an artificial intelligence approach to address the fixed-budget ranking and selection (R&S) problems. We formulate the sequential sampling decision as a Markov decision process and propose a Monte Carlo simulation-based rollout policy that utilizes classic R&S procedures as base policies for efficiently learning the value function of stochastic dynamic programming. We accelerate online sample-allocation by using deep reinforcement learning to pre-train a neural network model offline based on a given prior. We also propose a parallelizable computing framework for large-scale problems, effectively combining \"divide and conquer\" and \"recursion\" for enhanced scalability and efficiency. Numerical experiments demonstrate that the performance of AlphaRank is significantly improved over the base policies, which could be attributed to AlphaRank's superior capability on the trade-off among mean, variance, and induced correlation overlooked by many existing policies."}, "https://arxiv.org/abs/2402.01003": {"title": "Practical challenges in mediation analysis: A guide for applied researchers", "link": "https://arxiv.org/abs/2402.01003", "description": "Mediation analysis is a statistical approach that can provide insights regarding the intermediary processes by which an intervention or exposure affects a given outcome. Mediation analyses rose to prominence, particularly in social science research, with the publication of the seminal paper by Baron and Kenny and is now commonly applied in many research disciplines, including health services research. Despite the growth in popularity, applied researchers may still encounter challenges in terms of conducting mediation analyses in practice. In this paper, we provide an overview of conceptual and methodological challenges that researchers face when conducting mediation analyses. Specifically, we discuss the following key challenges: (1) Conceptually differentiating mediators from other third variables, (2) Extending beyond the single mediator context, (3) Identifying appropriate datasets in which measurement and temporal ordering supports the hypothesized mediation model, (4) Selecting mediation effects that reflect the scientific question of interest, (5) Assessing the validity of underlying assumptions of no omitted confounders, (6) Addressing measurement error regarding the mediator, and (7) Clearly reporting results from mediation analyses. We discuss each challenge and highlight ways in which the applied researcher can approach these challenges."}, "https://arxiv.org/abs/2402.01139": {"title": "Online conformal prediction with decaying step sizes", "link": "https://arxiv.org/abs/2402.01139", "description": "We introduce a method for online conformal prediction with decaying step sizes. Like previous methods, ours possesses a retrospective guarantee of coverage for arbitrary sequences. However, unlike previous methods, we can simultaneously estimate a population quantile when it exists. Our theory and experiments indicate substantially improved practical properties: in particular, when the distribution is stable, the coverage is close to the desired level for every time point, not just on average over the observed sequence."}, "https://arxiv.org/abs/2402.01207": {"title": "Efficient Causal Graph Discovery Using Large Language Models", "link": "https://arxiv.org/abs/2402.01207", "description": "We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains."}, "https://arxiv.org/abs/2402.01320": {"title": "On the mean-field limit for Stein variational gradient descent: stability and multilevel approximation", "link": "https://arxiv.org/abs/2402.01320", "description": "In this paper we propose and analyze a novel multilevel version of Stein variational gradient descent (SVGD). SVGD is a recent particle based variational inference method. For Bayesian inverse problems with computationally expensive likelihood evaluations, the method can become prohibitive as it requires to evolve a discrete dynamical system over many time steps, each of which requires likelihood evaluations at all particle locations. To address this, we introduce a multilevel variant that involves running several interacting particle dynamics in parallel corresponding to different approximation levels of the likelihood. By carefully tuning the number of particles at each level, we prove that a significant reduction in computational complexity can be achieved. As an application we provide a numerical experiment for a PDE driven inverse problem, which confirms the speed up suggested by our theoretical results."}, "https://arxiv.org/abs/2402.01454": {"title": "Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach", "link": "https://arxiv.org/abs/2402.01454", "description": "In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is widely accepted as significant for creating consistent meaningful causal models, despite the recognized challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through \"statistical causal prompting (SCP)\" for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, it has been clarified that an LLM can improve SCD with its background knowledge, even if the LLM does not contain information on the dataset. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains."}, "https://arxiv.org/abs/2402.01607": {"title": "Natural Counterfactuals With Necessary Backtracking", "link": "https://arxiv.org/abs/2402.01607", "description": "Counterfactual reasoning is pivotal in human cognition and especially important for providing explanations and making decisions. While Judea Pearl's influential approach is theoretically elegant, its generation of a counterfactual scenario often requires interventions that are too detached from the real scenarios to be feasible. In response, we propose a framework of natural counterfactuals and a method for generating counterfactuals that are natural with respect to the actual world's data distribution. Our methodology refines counterfactual reasoning, allowing changes in causally preceding variables to minimize deviations from realistic scenarios. To generate natural counterfactuals, we introduce an innovative optimization framework that permits but controls the extent of backtracking with a naturalness criterion. Empirical experiments indicate the effectiveness of our method."}, "https://arxiv.org/abs/2106.15436": {"title": "Topo-Geometric Analysis of Variability in Point Clouds using Persistence Landscapes", "link": "https://arxiv.org/abs/2106.15436", "description": "Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies."}, "https://arxiv.org/abs/2203.06225": {"title": "Semiparametric Mixed-effects Model for Longitudinal Data with Non-normal Errors", "link": "https://arxiv.org/abs/2203.06225", "description": "Difficulties may arise when analyzing longitudinal data using mixed-effects models if there are nonparametric functions present in the linear predictor component. This study extends the use of semiparametric mixed-effects modeling in cases when the response variable does not always follow a normal distribution and the nonparametric component is structured as an additive model. A novel approach is proposed to identify significant linear and non-linear components using a double-penalized generalized estimating equation with two penalty terms. Furthermore, the iterative approach provided intends to enhance the efficiency of estimating regression coefficients by incorporating the calculation of the working covariance matrix. The oracle properties of the resulting estimators are established under certain regularity conditions, where the dimensions of both the parametric and nonparametric components increase as the sample size grows. We perform numerical studies to demonstrate the efficacy of our proposal."}, "https://arxiv.org/abs/2208.07573": {"title": "Higher-order accurate two-sample network inference and network hashing", "link": "https://arxiv.org/abs/2208.07573", "description": "Two-sample hypothesis testing for network comparison presents many significant challenges, including: leveraging repeated network observations and known node registration, but without requiring them to operate; relaxing strong structural assumptions; achieving finite-sample higher-order accuracy; handling different network sizes and sparsity levels; fast computation and memory parsimony; controlling false discovery rate (FDR) in multiple testing; and theoretical understandings, particularly regarding finite-sample accuracy and minimax optimality. In this paper, we develop a comprehensive toolbox, featuring a novel main method and its variants, all accompanied by strong theoretical guarantees, to address these challenges. Our method outperforms existing tools in speed and accuracy, and it is proved power-optimal. Our algorithms are user-friendly and versatile in handling various data structures (single or repeated network observations; known or unknown node registration). We also develop an innovative framework for offline hashing and fast querying as a very useful tool for large network databases. We showcase the effectiveness of our method through comprehensive simulations and applications to two real-world datasets, which revealed intriguing new structures."}, "https://arxiv.org/abs/2208.14983": {"title": "Improved information criteria for Bayesian model averaging in lattice field theory", "link": "https://arxiv.org/abs/2208.14983", "description": "Bayesian model averaging is a practical method for dealing with uncertainty due to model specification. Use of this technique requires the estimation of model probability weights. In this work, we revisit the derivation of estimators for these model weights. Use of the Kullback-Leibler divergence as a starting point leads naturally to a number of alternative information criteria suitable for Bayesian model weight estimation. We explore three such criteria, known to the statistics literature before, in detail: a Bayesian analogue of the Akaike information criterion which we call the BAIC, the Bayesian predictive information criterion (BPIC), and the posterior predictive information criterion (PPIC). We compare the use of these information criteria in numerical analysis problems common in lattice field theory calculations. We find that the PPIC has the most appealing theoretical properties and can give the best performance in terms of model-averaging uncertainty, particularly in the presence of noisy data, while the BAIC is a simple and reliable alternative."}, "https://arxiv.org/abs/2210.12968": {"title": "Overlap, matching, or entropy weights: what are we weighting for?", "link": "https://arxiv.org/abs/2210.12968", "description": "There has been a recent surge in statistical methods for handling the lack of adequate positivity when using inverse probability weights (IPW). However, these nascent developments have raised a number of questions. Thus, we demonstrate the ability of equipoise estimators (overlap, matching, and entropy weights) to handle the lack of positivity. Compared to IPW, the equipoise estimators have been shown to be flexible and easy to interpret. However, promoting their wide use requires that researchers know clearly why, when to apply them and what to expect.\n In this paper, we provide the rationale to use these estimators to achieve robust results. We specifically look into the impact imbalances in treatment allocation can have on the positivity and, ultimately, on the estimates of the treatment effect. We zero into the typical pitfalls of the IPW estimator and its relationship with the estimators of the average treatment effect on the treated (ATT) and on the controls (ATC). Furthermore, we also compare IPW trimming to the equipoise estimators. We focus particularly on two key points: What fundamentally distinguishes their estimands? When should we expect similar results? Our findings are illustrated through Monte-Carlo simulation studies and a data example on healthcare expenditure."}, "https://arxiv.org/abs/2212.13574": {"title": "Weak Signal Inclusion Under Dependence and Applications in Genome-wide Association Study", "link": "https://arxiv.org/abs/2212.13574", "description": "Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regulate false negative proportion at a user-specified level. FNC screening is developed in a realistic setting with arbitrary covariance dependence between variables. We calibrate the overall dependence through a parameter whose scale is compatible with the existing phase diagram in high-dimensional sparse inference. Utilizing the new calibration, we asymptotically explicate the joint effect of covariance dependence, signal sparsity, and signal intensity on the proposed method. We interpret the results using a new phase diagram, which shows that FNC screening can efficiently select a set of candidate variables to retain a high proportion of signals even when the signals are not individually separable from noise. Finite sample performance of FNC screening is compared to those of several existing methods in simulation studies. The proposed method outperforms the others in adapting to a user-specified false negative control level. We implement FNC screening to empower a two-stage GWAS procedure, which demonstrates substantial power gain when working with limited sample sizes in real applications."}, "https://arxiv.org/abs/2303.03502": {"title": "Analyzing Risk Factors for Post-Acute Recovery in Older Adults with Alzheimer's Disease and Related Dementia: A New Semi-Parametric Model for Large-Scale Medicare Claims", "link": "https://arxiv.org/abs/2303.03502", "description": "Nearly 300,000 older adults experience a hip fracture every year, the majority of which occur following a fall. Unfortunately, recovery after fall-related trauma such as hip fracture is poor, where older adults diagnosed with Alzheimer's Disease and Related Dementia (ADRD) spend a particularly long time in hospitals or rehabilitation facilities during the post-operative recuperation period. Because older adults value functional recovery and spending time at home versus facilities as key outcomes after hospitalization, identifying factors that influence days spent at home after hospitalization is imperative. While several individual-level factors have been identified, the characteristics of the treating hospital have recently been identified as contributors. However, few methodological rigorous approaches are available to help overcome potential sources of bias such as hospital-level unmeasured confounders, informative hospital size, and loss to follow-up due to death. This article develops a useful tool equipped with unsupervised learning to simultaneously handle statistical complexities that are often encountered in health services research, especially when using large administrative claims databases. The proposed estimator has a closed form, thus only requiring light computation load in a large-scale study. We further develop its asymptotic properties that can be used to make statistical inference in practice. Extensive simulation studies demonstrate superiority of the proposed estimator compared to existing estimators."}, "https://arxiv.org/abs/2303.08218": {"title": "Spatial causal inference in the presence of unmeasured confounding and interference", "link": "https://arxiv.org/abs/2303.08218", "description": "This manuscript bridges the divide between causal inference and spatial statistics, presenting novel insights for causal inference in spatial data analysis, and establishing how tools from spatial statistics can be used to draw causal inferences. We introduce spatial causal graphs to highlight that spatial confounding and interference can be entangled, in that investigating the presence of one can lead to wrongful conclusions in the presence of the other. Moreover, we show that spatial dependence in the exposure variable can render standard analyses invalid, which can lead to erroneous conclusions. To remedy these issues, we propose a Bayesian parametric approach based on tools commonly-used in spatial statistics. This approach simultaneously accounts for interference and mitigates bias resulting from local and neighborhood unmeasured spatial confounding. From a Bayesian perspective, we show that incorporating an exposure model is necessary, and we theoretically prove that all model parameters are identifiable, even in the presence of unmeasured confounding. To illustrate the approach's effectiveness, we provide results from a simulation study and a case study involving the impact of sulfur dioxide emissions from power plants on cardiovascular mortality."}, "https://arxiv.org/abs/2306.12671": {"title": "Feature screening for clustering analysis", "link": "https://arxiv.org/abs/2306.12671", "description": "In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses."}, "https://arxiv.org/abs/2312.03257": {"title": "Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes", "link": "https://arxiv.org/abs/2312.03257", "description": "Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application given its ability to depict the global metabolic pattern in biological samples. However, the data is noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible."}, "https://arxiv.org/abs/2306.06534": {"title": "K-Tensors: Clustering Positive Semi-Definite Matrices", "link": "https://arxiv.org/abs/2306.06534", "description": "This paper introduces $K$-Tensors, a novel self-consistent clustering algorithm designed to cluster positive semi-definite (PSD) matrices by their eigenstructures. Clustering PSD matrices is crucial across various fields, including computer and biomedical sciences. Traditional clustering methods, which often involve matrix vectorization, tend to overlook the inherent PSD characteristics, thereby discarding valuable shape and eigenstructural information. To preserve this essential shape and eigenstructral information, our approach incorporates a unique distance metric that respects the PSD nature of the data. We demonstrate that $K$-Tensors is not only self-consistent but also reliably converges to a local optimum. Through numerical studies, we further validate the algorithm's effectiveness and explore its properties in detail."}, "https://arxiv.org/abs/2307.10276": {"title": "A test for counting sequences of integer-valued autoregressive models", "link": "https://arxiv.org/abs/2307.10276", "description": "The integer autoregressive (INAR) model is one of the most commonly used models in nonnegative integer-valued time series analysis and is a counterpart to the traditional autoregressive model for continuous-valued time series. To guarantee the integer-valued nature, the binomial thinning operator or more generally the generalized Steutel and van Harn operator is used to define the INAR model. However, the distributions of the counting sequences used in the operators have been determined by the preference of analyst without statistical verification so far. In this paper, we propose a test based on the mean and variance relationships for distributions of counting sequences and a disturbance process to check if the operator is reasonable. We show that our proposed test has asymptotically correct size and is consistent. Numerical simulation is carried out to evaluate the finite sample performance of our test. As a real data application, we apply our test to the monthly number of anorexia cases in animals submitted to animal health laboratories in New Zealand and we conclude that binomial thinning operator is not appropriate."}, "https://arxiv.org/abs/2402.01827": {"title": "Extracting Scalar Measures from Curves", "link": "https://arxiv.org/abs/2402.01827", "description": "The ability to order outcomes is necessary to make comparisons which is complicated when there is no natural ordering on the space of outcomes, as in the case of functional outcomes. This paper examines methods for extracting a scalar summary from functional or longitudinal outcomes based on an average rate of change which can be used to compare curves. Common approaches used in practice use a change score or an analysis of covariance (ANCOVA) to make comparisons. However, these standard approaches only use a fraction of the available data and are inefficient. We derive measures of performance of an averaged rate of change of a functional outcome and compare this measure to standard measures. Simulations and data from a depression clinical trial are used to illustrate results."}, "https://arxiv.org/abs/2402.01861": {"title": "Depth for samples of sets with applications to testing equality in distribution of two samples of random sets", "link": "https://arxiv.org/abs/2402.01861", "description": "This paper introduces several depths for random sets with possibly non-convex realisations, proposes ways to estimate the depths based on the samples and compares them with existing ones. The depths are further applied for the comparison between two samples of random sets using a visual method of DD-plots and statistical testing. The advantage of this approach is identifying sets within the sample that are responsible for rejecting the null hypothesis of equality in distribution and providing clues on differences between distributions. The method is justified using a simulation study and applied to real data consisting of histological images of mastopathy and mammary cancer tissue."}, "https://arxiv.org/abs/2402.01866": {"title": "Parametric Bootstrap on Networks with Non-Exchangeable Nodes", "link": "https://arxiv.org/abs/2402.01866", "description": "This paper studies the parametric bootstrap method for networks to quantify the uncertainty of statistics of interest. While existing network resampling methods primarily focus on count statistics under node-exchangeable (graphon) models, we consider more general network statistics (including local statistics) under the Chung-Lu model without node-exchangeability. We show that the natural network parametric bootstrap that first estimates the network generating model and then draws bootstrap samples from the estimated model generally suffers from bootstrap bias. As a general recipe for addressing this problem, we show that a two-level bootstrap procedure provably reduces the bias. This essentially extends the classical idea of iterative bootstrap to the network case with growing number of parameters. Moreover, for many network statistics, the second-level bootstrap also provides a way to construct confidence intervals with higher accuracy. As a byproduct of our effort to construct confidence intervals, we also prove the asymptotic normality of subgraph counts under the Chung-Lu model."}, "https://arxiv.org/abs/2402.01914": {"title": "Predicting Batting Averages in Specific Matchups Using Generalized Linked Matrix Factorization", "link": "https://arxiv.org/abs/2402.01914", "description": "Predicting batting averages for specific batters against specific pitchers is a challenging problem in baseball. Previous methods for estimating batting averages in these matchups have used regression models that can incorporate the pitcher's and batter's individual batting averages. However, these methods are limited in their flexibility to include many additional parameters because of the challenges of high-dimensional data in regression. Dimension reduction methods can be used to incorporate many predictors into the model by finding a lower rank set of patterns among them, providing added flexibility. This paper illustrates that dimension reduction methods can be useful for predicting batting averages. To incorporate binomial data (batting averages) as well as additional data about each batter and pitcher, this paper proposes a novel dimension reduction method that uses alternating generalized linear models to estimate shared patterns across three data sources related to batting averages. While the novel method slightly outperforms existing methods for imputing batting averages based on simulations and a cross-validation study, the biggest advantage is that it can easily incorporate other sources of data. As data-collection technology continues to improve, more variables will be available, and this method will be more accurate with more informative data in the future."}, "https://arxiv.org/abs/2402.01951": {"title": "Sparse spanning portfolios and under-diversification with second-order stochastic dominance", "link": "https://arxiv.org/abs/2402.01951", "description": "We develop and implement methods for determining whether relaxing sparsity constraints on portfolios improves the investment opportunity set for risk-averse investors. We formulate a new estimation procedure for sparse second-order stochastic spanning based on a greedy algorithm and Linear Programming. We show the optimal recovery of the sparse solution asymptotically whether spanning holds or not. From large equity datasets, we estimate the expected utility loss due to possible under-diversification, and find that there is no benefit from expanding a sparse opportunity set beyond 45 assets. The optimal sparse portfolio invests in 10 industry sectors and cuts tail risk when compared to a sparse mean-variance portfolio. On a rolling-window basis, the number of assets shrinks to 25 assets in crisis periods, while standard factor models cannot explain the performance of the sparse portfolios."}, "https://arxiv.org/abs/2402.01966": {"title": "The general solution to an autoregressive law of motion", "link": "https://arxiv.org/abs/2402.01966", "description": "In this article we provide a complete description of the set of all solutions to an autoregressive law of motion in a finite-dimensional complex vector space. Every solution is shown to be the sum of three parts, each corresponding to a directed flow of time. One part flows forward from the arbitrarily distant past; one flows backward from the arbitrarily distant future; and one flows outward from time zero. The three parts are obtained by applying three complementary spectral projections to the solution, these corresponding to a separation of the eigenvalues of the autoregressive operator according to whether they are inside, outside or on the unit circle. We provide a finite-dimensional parametrization of the set of all solutions."}, "https://arxiv.org/abs/2402.02016": {"title": "Applying different methods to model dry and wet spells at daily scale in a large range of rainfall regimes across Europe", "link": "https://arxiv.org/abs/2402.02016", "description": "The modelling of the occurrence of rainfall dry and wet spells (ds and ws, respectively) can be jointly conveyed using the inter-arrival times (it). While the modelling of it has the advantage of requiring a single fitting for the description of all rainfall time characteristics (including wet and dry chains, an extension of the concept of spells), the assumption on the independence and identical distribution of the renewal times it implicitly imposes a memoryless property on the derived ws, which may not be true in some cases. In this study, two different methods for the modelling of rainfall time characteristics at station scale have been applied: i) a direct method (DM) that fits the discrete Lerch distribution to it records, and then derives ws and ds (as well as the corresponding chains) from the it distribution; and ii) an indirect method (IM) that fits the Lerch distribution to the ws and ds records separately, relaxing the assumptions of the renewal process. The results of this application over six stations in Europe, characterized by a wide range of rainfall regimes, highlight how the geometric distribution does not always reasonably reproduce the ws frequencies, even when it are modelled by the Lerch distribution well. Improved performances are obtained with the IM, thanks to the relaxation of the assumption on the independence and identical distribution of the renewal times. A further improvement on the fittings is obtained when the datasets are separated into two periods, suggesting that the inferences may benefit for accounting for the local seasonality."}, "https://arxiv.org/abs/2402.02064": {"title": "Prediction modelling with many correlated and zero-inflated predictors: assessing a nonnegative garrote approach", "link": "https://arxiv.org/abs/2402.02064", "description": "Building prediction models from mass-spectrometry data is challenging due to the abundance of correlated features with varying degrees of zero-inflation, leading to a common interest in reducing the features to a concise predictor set with good predictive performance. In this study, we formally established and examined regularized regression approaches, designed to address zero-inflated and correlated predictors. In particular, we describe a novel two-stage regularized regression approach (ridge-garrote) explicitly modelling zero-inflated predictors using two component variables, comprising a ridge estimator in the first stage and subsequently applying a nonnegative garrote estimator in the second stage. We contrasted ridge-garrote with one-stage methods (ridge, lasso) and other two-stage regularized regression approaches (lasso-ridge, ridge-lasso) for zero-inflated predictors. We assessed the predictive performance and predictor selection properties of these methods in a comparative simulation study and a real-data case study to predict kidney function using peptidomic features derived from mass-spectrometry. In the simulation study, the predictive performance of all assessed approaches was comparable, yet the ridge-garrote approach consistently selected more parsimonious models compared to its competitors in most scenarios. While lasso-ridge achieved higher predictive accuracy than its competitors, it exhibited high variability in the number of selected predictors. Ridge-lasso exhibited slightly superior predictive accuracy than ridge-garrote but at the expense of selecting more noise predictors. Overall, ridge emerged as a favourable option when variable selection is not a primary concern, while ridge-garrote demonstrated notable practical utility in selecting a parsimonious set of predictors, with only minimal compromise in predictive accuracy."}, "https://arxiv.org/abs/2402.02068": {"title": "Modeling local predictive ability using power-transformed Gaussian processes", "link": "https://arxiv.org/abs/2402.02068", "description": "A Gaussian process is proposed as a model for the posterior distribution of the local predictive ability of a model or expert, conditional on a vector of covariates, from historical predictions in the form of log predictive scores. Assuming Gaussian expert predictions and a Gaussian data generating process, a linear transformation of the predictive score follows a noncentral chi-squared distribution with one degree of freedom. Motivated by this we develop a non-central chi-squared Gaussian process regression to flexibly model local predictive ability, with the posterior distribution of the latent GP function and kernel hyperparameters sampled by Hamiltonian Monte Carlo. We show that a cube-root transformation of the log scores is approximately Gaussian with homoscedastic variance, which makes it possible to estimate the model much faster by marginalizing the latent GP function analytically. Linear pools based on learned local predictive ability are applied to predict daily bike usage in Washington DC."}, "https://arxiv.org/abs/2402.02128": {"title": "Adaptive Accelerated Failure Time modeling with a Semiparametric Skewed Error Distribution", "link": "https://arxiv.org/abs/2402.02128", "description": "The accelerated failure time (AFT) model is widely used to analyze relationships between variables in the presence of censored observations. However, this model relies on some assumptions such as the error distribution, which can lead to biased or inefficient estimates if these assumptions are violated. In order to overcome this challenge, we propose a novel approach that incorporates a semiparametric skew-normal scale mixture distribution for the error term in the AFT model. By allowing for more flexibility and robustness, this approach reduces the risk of misspecification and improves the accuracy of parameter estimation. We investigate the identifiability and consistency of the proposed model and develop a practical estimation algorithm. To evaluate the performance of our approach, we conduct extensive simulation studies and real data analyses. The results demonstrate the effectiveness of our method in providing robust and accurate estimates in various scenarios."}, "https://arxiv.org/abs/2402.02187": {"title": "Graphical models for multivariate extremes", "link": "https://arxiv.org/abs/2402.02187", "description": "Graphical models in extremes have emerged as a diverse and quickly expanding research area in extremal dependence modeling. They allow for parsimonious statistical methodology and are particularly suited for enforcing sparsity in high-dimensional problems. In this work, we provide the fundamental concepts of extremal graphical models and discuss recent advances in the field. Different existing perspectives on graphical extremes are presented in a unified way through graphical models for exponent measures. We discuss the important cases of nonparametric extremal graphical models on simple graph structures, and the parametric class of H\\\"usler--Reiss models on arbitrary undirected graphs. In both cases, we describe model properties, methods for statistical inference on known graph structures, and structure learning algorithms when the graph is unknown. We illustrate different methods in an application to flight delay data at US airports."}, "https://arxiv.org/abs/2402.02196": {"title": "Sample-Efficient Clustering and Conquer Procedures for Parallel Large-Scale Ranking and Selection", "link": "https://arxiv.org/abs/2402.02196", "description": "We propose novel \"clustering and conquer\" procedures for the parallel large-scale ranking and selection (R&S) problem, which leverage correlation information for clustering to break the bottleneck of sample efficiency. In parallel computing environments, correlation-based clustering can achieve an $\\mathcal{O}(p)$ sample complexity reduction rate, which is the optimal reduction rate theoretically attainable. Our proposed framework is versatile, allowing for seamless integration of various prevalent R&S methods under both fixed-budget and fixed-precision paradigms. It can achieve improvements without the necessity of highly accurate correlation estimation and precise clustering. In large-scale AI applications such as neural architecture search, a screening-free version of our procedure surprisingly surpasses fully-sequential benchmarks in terms of sample efficiency. This suggests that leveraging valuable structural information, such as correlation, is a viable path to bypassing the traditional need for screening via pairwise comparison--a step previously deemed essential for high sample efficiency but problematic for parallelization. Additionally, we propose a parallel few-shot clustering algorithm tailored for large-scale problems."}, "https://arxiv.org/abs/2402.02272": {"title": "One-inflated zero-truncated count regression models", "link": "https://arxiv.org/abs/2402.02272", "description": "We find that in zero-truncated count data (y=1,2,...), individuals often gain information at first observation (y=1), leading to a common but unaddressed phenomenon of \"one-inflation\". The current standard, the zero-truncated negative binomial (ZTNB) model, is misspecified under one-inflation, causing bias and inconsistency. To address this, we introduce the one-inflated zero-truncated negative binomial (OIZTNB) regression model. The importance of our model is highlighted through simulation studies, and through the discovery of one-inflation in four datasets that have traditionally championed ZTNB. We recommended OIZTNB over ZTNB for most data, and provide estimation, marginal effects, and testing in the accompanying R package oneinfl."}, "https://arxiv.org/abs/2402.02306": {"title": "A flexible Bayesian g-formula for causal survival analyses with time-dependent confounding", "link": "https://arxiv.org/abs/2402.02306", "description": "In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator. This estimator facilitates both longitudinal predictive and causal inference. It incorporates Bayesian additive regression trees in the modeling of the time-evolving generative components, aiming to mitigate bias due to model misspecification. Specifically, we introduce a more general class of g-formulas for discrete survival data. These formulas can incorporate the longitudinal balancing scores, which serve as an effective method for dimension reduction and are vital when dealing with an expanding array of time-varying confounders. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment regimes, whether static or dynamic. For each type of treatment regime, we provide posterior sampling algorithms, which are grounded in the Bayesian additive regression trees framework. We have conducted simulation studies to illustrate the empirical performance of our proposed Bayesian g-formula estimators, and to compare them with existing parametric estimators. We further demonstrate the practical utility of our methods in real-world scenarios using data from the Yale New Haven Health System's electronic health records."}, "https://arxiv.org/abs/2402.02329": {"title": "Leveraging Local Distributions in Mendelian Randomization: Uncertain Opinions are Invalid", "link": "https://arxiv.org/abs/2402.02329", "description": "Mendelian randomization (MR) considers using genetic variants as instrumental variables (IVs) to infer causal effects in observational studies. However, the validity of causal inference in MR can be compromised when the IVs are potentially invalid. In this work, we propose a new method, MR-Local, to infer the causal effect in the existence of possibly invalid IVs. By leveraging the distribution of ratio estimates around the true causal effect, MR-Local selects the cluster of ratio estimates with the least uncertainty and performs causal inference within it. We establish the asymptotic normality of our estimator in the two-sample summary-data setting under either the plurality rule or the balanced pleiotropy assumption. Extensive simulations and analyses of real datasets demonstrate the reliability of our approach."}, "https://arxiv.org/abs/2402.02482": {"title": "Global bank network connectedness revisited: What is common, idiosyncratic and when?", "link": "https://arxiv.org/abs/2402.02482", "description": "We revisit the problem of estimating high-dimensional global bank network connectedness. Instead of directly regularizing the high-dimensional vector of realized volatilities as in Demirer et al. (2018), we estimate a dynamic factor model with sparse VAR idiosyncratic components. This allows to disentangle: (I) the part of system-wide connectedness (SWC) due to the common component shocks (what we call the \"banking market\"), and (II) the part due to the idiosyncratic shocks (the single banks). We employ both the original dataset as in Demirer et al. (2018) (daily data, 2003-2013), as well as a more recent vintage (2014-2023). For both, we compute SWC due to (I), (II), (I+II) and provide bootstrap confidence bands. In accordance with the literature, we find SWC to spike during global crises. However, our method minimizes the risk of SWC underestimation in high-dimensional datasets where episodes of systemic risk can be both pervasive and idiosyncratic. In fact, we are able to disentangle how in normal times $\\approx$60-80% of SWC is due to idiosyncratic variation and only $\\approx$20-40% to market variation. However, in crises periods such as the 2008 financial crisis and the Covid19 outbreak in 2019, the situation is completely reversed: SWC is comparatively more driven by a market dynamic and less by an idiosyncratic one."}, "https://arxiv.org/abs/2402.02535": {"title": "Data-driven Policy Learning for a Continuous Treatment", "link": "https://arxiv.org/abs/2402.02535", "description": "This paper studies policy learning under the condition of unconfoundedness with a continuous treatment variable. Our research begins by employing kernel-based inverse propensity-weighted (IPW) methods to estimate policy welfare. We aim to approximate the optimal policy within a global policy class characterized by infinite Vapnik-Chervonenkis (VC) dimension. This is achieved through the utilization of a sequence of sieve policy classes, each with finite VC dimension. Preliminary analysis reveals that welfare regret comprises of three components: global welfare deficiency, variance, and bias. This leads to the necessity of simultaneously selecting the optimal bandwidth for estimation and the optimal policy class for welfare approximation. To tackle this challenge, we introduce a semi-data-driven strategy that employs penalization techniques. This approach yields oracle inequalities that adeptly balance the three components of welfare regret without prior knowledge of the welfare deficiency. By utilizing precise maximal and concentration inequalities, we derive sharper regret bounds than those currently available in the literature. In instances where the propensity score is unknown, we adopt the doubly robust (DR) moment condition tailored to the continuous treatment setting. In alignment with the binary-treatment case, the DR welfare regret closely parallels the IPW welfare regret, given the fast convergence of nuisance estimators."}, "https://arxiv.org/abs/2402.02664": {"title": "Statistical Inference for Generalized Integer Autoregressive Processes", "link": "https://arxiv.org/abs/2402.02664", "description": "A popular and flexible time series model for counts is the generalized integer autoregressive process of order $p$, GINAR($p$). These Markov processes are defined using thinning operators evaluated on past values of the process along with a discretely-valued innovation process. This class includes the commonly used INAR($p$) process, defined with binomial thinning and Poisson innovations. GINAR processes can be used in a variety of settings, including modeling time series with low counts, and allow for more general mean-variance relationships, capturing both over- or under-dispersion. While there are many thinning operators and innovation processes given in the literature, less focus has been spent on comparing statistical inference and forecasting procedures over different choices of GINAR process. We provide an extensive study of exact and approximate inference and forecasting methods that can be applied to a wide class of GINAR($p$) processes with general thinning and innovation parameters. We discuss the challenges of exact estimation when $p$ is larger. We summarize and extend asymptotic results for estimators of process parameters, and present simulations to compare small sample performance, highlighting how different methods compare. We illustrate this methodology by fitting GINAR processes to a disease surveillance series."}, "https://arxiv.org/abs/2402.02666": {"title": "Using child-woman ratios to infer demographic rates in historical populations with limited data", "link": "https://arxiv.org/abs/2402.02666", "description": "Data on historical populations often extends no further than numbers of people by broad age-sex group, with nothing on numbers of births or deaths. Demographers studying these populations have experimented with methods that use the data on numbers of people to infer birth and death rates. These methods have, however, received little attention since they were first developed in the 1960s. We revisit the problem of inferring demographic rates from population structure, spelling out the assumptions needed, and specialising the methods to the case where only child-woman ratios are available. We apply the methods to the case of Maori populations in nineteenth-century Aotearoa New Zealand. We find that, in this particular case, the methods reveal as much about the nature of the data as they do about historical demographic conditions."}, "https://arxiv.org/abs/2402.02672": {"title": "Estimation of conditional average treatment effects on distributed data: A privacy-preserving approach", "link": "https://arxiv.org/abs/2402.02672", "description": "Estimation of conditional average treatment effects (CATEs) is an important topic in various fields such as medical and social sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data if they contain privacy information. To address this issue, we proposed data collaboration double machine learning (DC-DML), a method that can estimate CATE models with privacy preservation of distributed data, and evaluated the method through numerical experiments. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric or non-parametric CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. However, to our knowledge, no communication-efficient method has been proposed for estimating and testing semi-parametric or non-parametric CATE models on distributed data. Second, our method enables collaborative estimation between different parties as well as multiple time points because the dimensionality-reduced intermediate representations can be accumulated. Third, our method performed as well or better than other methods in evaluation experiments using synthetic, semi-synthetic and real-world datasets."}, "https://arxiv.org/abs/2402.02684": {"title": "Efficient estimation of subgroup treatment effects using multi-source data", "link": "https://arxiv.org/abs/2402.02684", "description": "Investigators often use multi-source data (e.g., multi-center trials, meta-analyses of randomized trials, pooled analyses of observational cohorts) to learn about the effects of interventions in subgroups of some well-defined target population. Such a target population can correspond to one of the data sources of the multi-source data or an external population in which the treatment and outcome information may not be available. We develop and evaluate methods for using multi-source data to estimate subgroup potential outcome means and treatment effects in a target population. We consider identifiability conditions and propose doubly robust estimators that, under mild conditions, are non-parametrically efficient and allow for nuisance functions to be estimated using flexible data-adaptive methods (e.g., machine learning techniques). We also show how to construct confidence intervals and simultaneous confidence bands for the estimated subgroup treatment effects. We examine the properties of the proposed estimators in simulation studies and compare performance against alternative estimators. We also conclude that our methods work well when the sample size of the target population is much larger than the sample size of the multi-source data. We illustrate the proposed methods in a meta-analysis of randomized trials for schizophrenia."}, "https://arxiv.org/abs/2402.02702": {"title": "Causal inference under transportability assumptions for conditional relative effect measures", "link": "https://arxiv.org/abs/2402.02702", "description": "When extending inferences from a randomized trial to a new target population, an assumption of transportability of difference effect measures (e.g., conditional average treatment effects) -- or even stronger assumptions of transportability in expectation or distribution of potential outcomes -- is invoked to identify the marginal causal mean difference in the target population. However, many clinical investigators believe that relative effect measures conditional on covariates, such as conditional risk ratios and mean ratios, are more likely to be ``transportable'' across populations compared with difference effect measures. Here, we examine the identification and estimation of the marginal counterfactual mean difference and ratio under a transportability assumption for conditional relative effect measures. We obtain identification results for two scenarios that often arise in practice when individuals in the target population (1) only have access to the control treatment, or (2) have access to the control and other treatments but not necessarily the experimental treatment evaluated in the trial. We then propose multiply robust and nonparametric efficient estimators that allow for the use of data-adaptive methods (e.g., machine learning techniques) to model the nuisance parameters. We examine the performance of the methods in simulation studies and illustrate their use with data from two trials of paliperidone for patients with schizophrenia. We conclude that the proposed methods are attractive when background knowledge suggests that the transportability assumption for conditional relative effect measures is more plausible than alternative assumptions."}, "https://arxiv.org/abs/2402.02867": {"title": "A new robust approach for the polytomous logistic regression model based on R\\'enyi's pseudodistances", "link": "https://arxiv.org/abs/2402.02867", "description": "This paper presents a robust alternative to the Maximum Likelihood Estimator (MLE) for the Polytomous Logistic Regression Model (PLRM), known as the family of minimum R\\`enyi Pseudodistance (RP) estimators. The proposed minimum RP estimators are parametrized by a tuning parameter $\\alpha\\geq0$, and include the MLE as a special case when $\\alpha=0$. These estimators, along with a family of RP-based Wald-type tests, are shown to exhibit superior performance in the presence of misclassification errors. The paper includes an extensive simulation study and a real data example to illustrate the robustness of these proposed statistics."}, "https://arxiv.org/abs/2402.03004": {"title": "Construction and evaluation of optimal diagnostic tests with application to hepatocellular carcinoma diagnosis", "link": "https://arxiv.org/abs/2402.03004", "description": "Accurate diagnostic tests are crucial to ensure effective treatment, screening, and surveillance of diseases. However, the limited accuracy of individual biomarkers often hinders comprehensive screening. The heterogeneity of many diseases, particularly cancer, calls for the use of several biomarkers together into a composite diagnostic test. In this paper, we present a novel multivariate model that optimally combines multiple biomarkers using the likelihood ratio function. The model's parameters directly translate into computationally simple diagnostic accuracy measures. Additionally, our method allows for reliable predictions even in scenarios where specific biomarker measurements are unavailable and can guide the selection of biomarker combinations under resource constraints. We conduct simulation studies to compare the performance to popular classification and discriminant analysis methods. We utilize the approach to construct an optimal diagnostic test for hepatocellular carcinoma, a cancer type known for the absence of a single ideal marker. An accompanying R implementation is made available for reproducing all results."}, "https://arxiv.org/abs/2402.03035": {"title": "An optimality property of the Bayes-Kelly algorithm", "link": "https://arxiv.org/abs/2402.03035", "description": "This note states a simple property of optimality of the Bayes-Kelly algorithm for conformal testing and poses a related open problem."}, "https://arxiv.org/abs/2402.03192": {"title": "Multiple testing using uniform filtering of ordered p-values", "link": "https://arxiv.org/abs/2402.03192", "description": "We investigate the multiplicity model with m values of some test statistic independently drawn from a mixture of no effect (null) and positive effect (alternative), where we seek to identify, the alternative test results with a controlled error rate. We are interested in the case where the alternatives are rare. A number of multiple testing procedures filter the set of ordered p-values in order to eliminate the nulls. Such an approach can only work if the p-values originating from the alternatives form one or several identifiable clusters. The Benjamini and Hochberg (BH) method, for example, assumes that this cluster occurs in a small interval $(0,\\Delta)$ and filters out all or most of the ordered p-values $p_{(r)}$ above a linear threshold $s \\times r$. In repeated applications this filter controls the false discovery rate via the slope s. We propose a new adaptive filter that deletes the p-values from regions of uniform distribution. In cases where a single cluster remains, the p-values in an interval are declared alternatives, with the mid-point and the length of the interval chosen by controlling the data-dependent FDR at a desired level."}, "https://arxiv.org/abs/2402.03231": {"title": "Improved prediction of future user activity in online A/B testing", "link": "https://arxiv.org/abs/2402.03231", "description": "In online randomized experiments or A/B tests, accurate predictions of participant inclusion rates are of paramount importance. These predictions not only guide experimenters in optimizing the experiment's duration but also enhance the precision of treatment effect estimates. In this paper we present a novel, straightforward, and scalable Bayesian nonparametric approach for predicting the rate at which individuals will be exposed to interventions within the realm of online A/B testing. Our approach stands out by offering dual prediction capabilities: it forecasts both the quantity of new customers expected in future time windows and, unlike available alternative methods, the number of times they will be observed. We derive closed-form expressions for the posterior distributions of the quantities needed to form predictions about future user activity, thereby bypassing the need for numerical algorithms such as Markov chain Monte Carlo. After a comprehensive exposition of our model, we test its performance on experiments on real and simulated data, where we show its superior performance with respect to existing alternatives in the literature."}, "https://arxiv.org/abs/2402.01785": {"title": "DoubleMLDeep: Estimation of Causal Effects with Multimodal Data", "link": "https://arxiv.org/abs/2402.01785", "description": "This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data."}, "https://arxiv.org/abs/2402.01897": {"title": "Comparative analysis of two new wind speed T-X models using Weibull and log-logistic distributions for wind energy potential estimation in Tabriz, Iran", "link": "https://arxiv.org/abs/2402.01897", "description": "To assess the potential of wind energy in a specific area, statistical distribution functions are commonly used to characterize wind speed distributions. The selection of an appropriate wind speed model is crucial in minimizing wind power estimation errors. In this paper, we propose a novel method that utilizes the T-X family of continuous distributions to generate two new wind speed distribution functions, which have not been previously explored in the wind energy literature. These two statistical distributions, namely the Weibull-three parameters-log-logistic (WE3-LL3) and log-logistic-three parameters-Weibull (LL3-WE3) are compared with four other probability density functions (PDFs) to analyze wind speed data collected in Tabriz, Iran. The parameters of the considered distributions are estimated using maximum likelihood estimators with the Nelder-Mead numerical method. The suitability of the proposed distributions for the actual wind speed data is evaluated based on criteria such as root mean square errors, coefficient of determination, Kolmogorov-Smirnov test, and chi-square test. The analysis results indicate that the LL3-WE3 distribution demonstrates generally superior performance in capturing seasonal and annual wind speed data, except for summer, while the WE3-LL3 distribution exhibits the best fit for summer. It is also observed that both the LL3-WE3 and WE3-LL3 distributions effectively describe wind speed data in terms of the wind power density error criterion. Overall, the LL3-WE3 and WE3-LL3 models offer a highly accurate fit compared to other PDFs for estimating wind energy potential."}, "https://arxiv.org/abs/2402.01972": {"title": "Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts", "link": "https://arxiv.org/abs/2402.01972", "description": "We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that (i) their practical applicability can be hindered by loss function non-convexity; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To avoid these drawbacks, EP-learner constructs an efficient plug-in estimator of the population risk function for the causal contrast, thereby inheriting the stability and robustness properties of plug-in estimation strategies like T-learning. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we illustrate that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our R package hte3."}, "https://arxiv.org/abs/2402.02111": {"title": "Accelerating Look-ahead in Bayesian Optimization: Multilevel Monte Carlo is All you Need", "link": "https://arxiv.org/abs/2402.02111", "description": "We leverage multilevel Monte Carlo (MLMC) to improve the performance of multi-step look-ahead Bayesian optimization (BO) methods that involve nested expectations and maximizations. The complexity rate of naive Monte Carlo degrades for nested operations, whereas MLMC is capable of achieving the canonical Monte Carlo convergence rate for this type of problem, independently of dimension and without any smoothness assumptions. Our theoretical study focuses on the approximation improvements for one- and two-step look-ahead acquisition functions, but, as we discuss, the approach is generalizable in various ways, including beyond the context of BO. Findings are verified numerically and the benefits of MLMC for BO are illustrated on several benchmark examples. Code is available here https://github.com/Shangda-Yang/MLMCBO."}, "https://arxiv.org/abs/2402.02190": {"title": "Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems", "link": "https://arxiv.org/abs/2402.02190", "description": "Finding the best solution is the most common objective in combinatorial optimization (CO) problems. However, a single solution may not be suitable in practical scenarios, as the objective functions and constraints are only approximations of original real-world situations. To tackle this, finding (i) \"heterogeneous solutions\", diverse solutions with distinct characteristics, and (ii) \"penalty-diversified solutions\", variations in constraint severity, are natural directions. This strategy provides the flexibility to select a suitable solution during post-processing. However, discovering these diverse solutions is more challenging than identifying a single solution. To overcome this challenge, this study introduces Continual Tensor Relaxation Annealing (CTRA) for unsupervised-learning-based CO solvers. CTRA addresses various problems simultaneously by extending the continual relaxation approach, which transforms discrete decision variables into continual tensors. This method finds heterogeneous and penalty-diversified solutions through mutual interactions, where the choice of one solution affects the other choices. Numerical experiments show that CTRA enables UL-based solvers to find heterogeneous and penalty-diversified solutions much faster than existing UL-based solvers. Moreover, these experiments reveal that CTRA enhances the exploration ability."}, "https://arxiv.org/abs/2402.02303": {"title": "Bootstrapping Fisher Market Equilibrium and First-Price Pacing Equilibrium", "link": "https://arxiv.org/abs/2402.02303", "description": "The linear Fisher market (LFM) is a basic equilibrium model from economics, which also has applications in fair and efficient resource allocation. First-price pacing equilibrium (FPPE) is a model capturing budget-management mechanisms in first-price auctions. In certain practical settings such as advertising auctions, there is an interest in performing statistical inference over these models. A popular methodology for general statistical inference is the bootstrap procedure. Yet, for LFM and FPPE there is no existing theory for the valid application of bootstrap procedures. In this paper, we introduce and devise several statistically valid bootstrap inference procedures for LFM and FPPE. The most challenging part is to bootstrap general FPPE, which reduces to bootstrapping constrained M-estimators, a largely unexplored problem. We devise a bootstrap procedure for FPPE under mild degeneracy conditions by using the powerful tool of epi-convergence theory. Experiments with synthetic and semi-real data verify our theory."}, "https://arxiv.org/abs/2402.02459": {"title": "On Minimum Trace Factor Analysis -- An Old Song Sung to a New Tune", "link": "https://arxiv.org/abs/2402.02459", "description": "Dimensionality reduction methods, such as principal component analysis (PCA) and factor analysis, are central to many problems in data science. There are, however, serious and well-understood challenges to finding robust low dimensional approximations for data with significant heteroskedastic noise. This paper introduces a relaxed version of Minimum Trace Factor Analysis (MTFA), a convex optimization method with roots dating back to the work of Ledermann in 1940. This relaxation is particularly effective at not overfitting to heteroskedastic perturbations and addresses the commonly cited Heywood cases in factor analysis and the recently identified \"curse of ill-conditioning\" for existing spectral methods. We provide theoretical guarantees on the accuracy of the resulting low rank subspace and the convergence rate of the proposed algorithm to compute that matrix. We develop a number of interesting connections to existing methods, including HeteroPCA, Lasso, and Soft-Impute, to fill an important gap in the already large literature on low rank matrix estimation. Numerical experiments benchmark our results against several recent proposals for dealing with heteroskedastic noise."}, "https://arxiv.org/abs/2402.02489": {"title": "Bivariate change point detection in movement direction and speed", "link": "https://arxiv.org/abs/2402.02489", "description": "Biological movement patterns can sometimes be quasi linear with abrupt changes in direction and speed, as in plastids in root cells investigated here. For the analysis of such changes we propose a new stochastic model for movement along linear structures. Maximum likelihood estimators are provided, and due to serial dependencies of increments, the classical MOSUM statistic is replaced by a moving kernel estimator. Convergence of the resulting difference process and strong consistency of the variance estimator are shown. We estimate the change points and propose a graphical technique to distinguish between change points in movement direction and speed."}, "https://arxiv.org/abs/2402.02773": {"title": "Series ridge regression for spatial data on $\\mathbb{R}^d$", "link": "https://arxiv.org/abs/2402.02773", "description": "This paper develops a general asymptotic theory of series ridge estimators for spatial data observed at irregularly spaced locations in a sampling region $R_n \\subset \\mathbb{R}^d$. We adopt a stochastic sampling design that can generate irregularly spaced sampling sites in a flexible manner including both pure increasing and mixed increasing domain frameworks. Specifically, we consider a spatial trend regression model and a nonparametric regression model with spatially dependent covariates. For these models, we investigate the $L^2$-penalized series estimation of the trend and regression functions and establish (i) uniform and $L^2$ convergence rates and (ii) multivariate central limit theorems for general series estimators, (iii) optimal uniform and $L^2$ convergence rates for spline and wavelet series estimators, and (iv) show that our dependence structure conditions on the underlying spatial processes cover a wide class of random fields including L\\'evy-driven continuous autoregressive and moving average random fields."}, "https://arxiv.org/abs/2402.02859": {"title": "Importance sampling for online variational learning", "link": "https://arxiv.org/abs/2402.02859", "description": "This article addresses online variational estimation in state-space models. We focus on learning the smoothing distribution, i.e. the joint distribution of the latent states given the observations, using a variational approach together with Monte Carlo importance sampling. We propose an efficient algorithm for computing the gradient of the evidence lower bound (ELBO) in the context of streaming data, where observations arrive sequentially. Our contributions include a computationally efficient online ELBO estimator, demonstrated performance in offline and true online settings, and adaptability for computing general expectations under joint smoothing distributions."}, "https://arxiv.org/abs/2402.02898": {"title": "Bayesian Federated Inference for regression models with heterogeneous multi-center populations", "link": "https://arxiv.org/abs/2402.02898", "description": "To estimate accurately the parameters of a regression model, the sample size must be large enough relative to the number of possible predictors for the model. In practice, sufficient data is often lacking, which can lead to overfitting of the model and, as a consequence, unreliable predictions of the outcome of new patients. Pooling data from different data sets collected in different (medical) centers would alleviate this problem, but is often not feasible due to privacy regulation or logistic problems. An alternative route would be to analyze the local data in the centers separately and combine the statistical inference results with the Bayesian Federated Inference (BFI) methodology. The aim of this approach is to compute from the inference results in separate centers what would have been found if the statistical analysis was performed on the combined data. We explain the methodology under homogeneity and heterogeneity across the populations in the separate centers, and give real life examples for better understanding. Excellent performance of the proposed methodology is shown. An R-package to do all the calculations has been developed and is illustrated in this paper. The mathematical details are given in the Appendix."}, "https://arxiv.org/abs/2402.03197": {"title": "Heavy-tailed $p$-value combinations from the perspective of extreme value theory", "link": "https://arxiv.org/abs/2402.03197", "description": "Handling multiplicity without losing much power has been a persistent challenge in various fields that often face the necessity of managing numerous statistical tests simultaneously. Recently, $p$-value combination methods based on heavy-tailed distributions, such as a Cauchy distribution, have received much attention for their ability to handle multiplicity without the prescribed knowledge of the dependence structure. This paper delves into these types of $p$-value combinations through the lens of extreme value theory. Distributions with regularly varying tails, a subclass of heavy tail distributions, are found to be useful in constructing such $p$-value combinations. Three $p$-value combination statistics (sum, max cumulative sum, and max) are introduced, of which left tail probabilities are shown to be approximately uniform when the global null is true. The primary objective of this paper is to bridge the gap between current developments in $p$-value combination methods and the literature on extreme value theory, while also offering guidance on selecting the calibrator and its associated parameters."}, "https://arxiv.org/abs/2402.03203": {"title": "Right-censored models by the expectile method", "link": "https://arxiv.org/abs/2402.03203", "description": "Based on the expectile loss function and the adaptive LASSO penalty, the paper proposes and studies the estimation methods for the accelerated failure time (AFT) model. In this approach, we need to estimate the survival function of the censoring variable by the Kaplan-Meier estimator. The AFT model parameters are first estimated by the expectile method and afterwards, when the number of explanatory variables can be large, by the adaptive LASSO expectile method which directly carries out the automatic selection of variables. We also obtain the convergence rate and asymptotic normality for the two estimators, while showing the sparsity property for the censored adaptive LASSO expectile estimator. A numerical study using Monte Carlo simulations confirms the theoretical results and demonstrates the competitive performance of the two proposed estimators. The usefulness of these estimators is illustrated by applying them to three survival data sets."}, "https://arxiv.org/abs/2402.03229": {"title": "Disentangling high order effects in the transfer entropy", "link": "https://arxiv.org/abs/2402.03229", "description": "Transfer Entropy (TE), the main approach to determine the directed information flow within a network system, can be biased (in defect or excess), both in the pairwise and conditioned calculation, due to high order dependencies among the two dynamic processes under consideration and the remaining processes in the system which are used in conditioning. Here we propose a novel approach which, instead of conditioning the TE on all the network processes other than driver and target like in its fully conditioned version, or not conditioning at all like in the pairwise approach, searches both for the multiplet of variables leading to the maximum information flow and for those minimizing it, providing a decomposition of the TE in unique, redundant and synergistic atoms. Our approach allows to quantify the relative importance of high order effects, w.r.t. pure two-body effects, in the information transfer between two processes, and to highlight those processes which accompany the driver to build those high order effects. We report an application of the proposed approach in climatology, analyzing data from El Ni\\~{n}o and the Southern Oscillation."}, "https://arxiv.org/abs/2012.14708": {"title": "Adaptive Estimation for Non-stationary Factor Models And A Test for Static Factor Loadings", "link": "https://arxiv.org/abs/2012.14708", "description": "This paper considers the estimation and testing of a class of locally stationary time series factor models with evolutionary temporal dynamics. In particular, the entries and the dimension of the factor loading matrix are allowed to vary with time while the factors and the idiosyncratic noise components are locally stationary. We propose an adaptive sieve estimator for the span of the varying loading matrix and the locally stationary factor processes. A uniformly consistent estimator of the effective number of factors is investigated via eigenanalysis of a non-negative definite time-varying matrix. A possibly high-dimensional bootstrap-assisted test for the hypothesis of static factor loadings is proposed by comparing the kernels of the covariance matrices of the whole time series with their local counterparts. We examine our estimator and test via simulation studies and real data analysis. Finally, all our results hold at the following popular but distinct assumptions: (a) the white noise idiosyncratic errors with either fixed or diverging dimension, and (b) the correlated idiosyncratic errors with diverging dimension."}, "https://arxiv.org/abs/2106.16124": {"title": "Robust inference for geographic regression discontinuity designs: assessing the impact of police precincts", "link": "https://arxiv.org/abs/2106.16124", "description": "We study variation in policing outcomes attributable to differential policing practices in New York City (NYC) using geographic regression discontinuity designs (GeoRDDs). By focusing on small geographic windows near police precinct boundaries we can estimate local average treatment effects of police precincts on arrest rates. We propose estimands and develop estimators for the GeoRDD when the data come from a spatial point process. Additionally, standard GeoRDDs rely on continuity assumptions of the potential outcome surface or a local randomization assumption within a window around the boundary. These assumptions, however, can easily be violated in realistic applications. We develop a novel and robust approach to testing whether there are differences in policing outcomes that are caused by differences in police precincts across NYC. Importantly, this approach is applicable to standard regression discontinuity designs with both numeric and point process data. This approach is robust to violations of traditional assumptions made, and is valid under weaker assumptions. We use a unique form of resampling to provide a valid estimate of our test statistic's null distribution even under violations of standard assumptions. This procedure gives substantially different results in the analysis of NYC arrest rates than those that rely on standard assumptions."}, "https://arxiv.org/abs/2110.12510": {"title": "Post-Regularization Confidence Bands for Ordinary Differential Equations", "link": "https://arxiv.org/abs/2110.12510", "description": "Ordinary differential equation (ODE) is an important tool to study the dynamics of a system of biological and physical processes. A central question in ODE modeling is to infer the significance of individual regulatory effect of one signal variable on another. However, building confidence band for ODE with unknown regulatory relations is challenging, and it remains largely an open question. In this article, we construct post-regularization confidence band for individual regulatory function in ODE with unknown functionals and noisy data observations. Our proposal is the first of its kind, and is built on two novel ingredients. The first is a new localized kernel learning approach that combines reproducing kernel learning with local Taylor approximation, and the second is a new de-biasing method that tackles infinite-dimensional functionals and additional measurement errors. We show that the constructed confidence band has the desired asymptotic coverage probability, and the recovered regulatory network approaches the truth with probability tending to one. We establish the theoretical properties when the number of variables in the system can be either smaller or larger than the number of sampling time points, and we study the regime-switching phenomenon. We demonstrate the efficacy of the proposed method through both simulations and illustrations with two data applications."}, "https://arxiv.org/abs/2111.02299": {"title": "Cross-validated risk scores adaptive enrichment (CADEN) design", "link": "https://arxiv.org/abs/2111.02299", "description": "We propose a Cross-validated ADaptive ENrichment design (CADEN) in which a trial population is enriched with a subpopulation of patients who are predicted to benefit from the treatment more than an average patient (the sensitive group). This subpopulation is found using a risk score constructed from the baseline (potentially high-dimensional) information about patients. The design incorporates an early stopping rule for futility. Simulation studies are used to assess the properties of CADEN against the original (non-enrichment) cross-validated risk scores (CVRS) design that constructs a risk score at the end of the trial. We show that when there exists a sensitive group of patients, CADEN achieves a higher power and a reduction in the expected sample size, in comparison to the CVRS design. We illustrate the application of the design in two real clinical trials. We conclude that the new design offers improved statistical efficiency in comparison to the existing non-enrichment method, as well as increased benefit to patients. The method has been implemented in an R package caden."}, "https://arxiv.org/abs/2208.05553": {"title": "Exploiting Neighborhood Interference with Low Order Interactions under Unit Randomized Design", "link": "https://arxiv.org/abs/2208.05553", "description": "Network interference, where the outcome of an individual is affected by the treatment assignment of those in their social network, is pervasive in real-world settings. However, it poses a challenge to estimating causal effects. We consider the task of estimating the total treatment effect (TTE), or the difference between the average outcomes of the population when everyone is treated versus when no one is, under network interference. Under a Bernoulli randomized design, we provide an unbiased estimator for the TTE when network interference effects are constrained to low order interactions among neighbors of an individual. We make no assumptions on the graph other than bounded degree, allowing for well-connected networks that may not be easily clustered. We derive a bound on the variance of our estimator and show in simulated experiments that it performs well compared with standard estimators for the TTE. We also derive a minimax lower bound on the mean squared error of our estimator which suggests that the difficulty of estimation can be characterized by the degree of interactions in the potential outcomes model. We also prove that our estimator is asymptotically normal under boundedness conditions on the network degree and potential outcomes model. Central to our contribution is a new framework for balancing model flexibility and statistical complexity as captured by this low order interactions structure."}, "https://arxiv.org/abs/2307.02236": {"title": "D-optimal Subsampling Design for Massive Data Linear Regression", "link": "https://arxiv.org/abs/2307.02236", "description": "Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider linear regression for an extraordinarily large number of observations, but only a few covariates. Subsampling aims at the selection of a given percentage of the existing original data. Under distributional assumptions on the covariates, we derive D-optimal subsampling designs and study their theoretical properties. We make use of fundamental concepts of optimal design theory and an equivalence theorem from constrained convex optimization. The thus obtained subsampling designs provide simple rules for whether to accept or reject a data point, allowing for an easy algorithmic implementation. In addition, we propose a simplified subsampling method with lower computational complexity that differs from the D-optimal design. We present a simulation study, comparing both subsampling schemes with the IBOSS method in the case of a fixed size of the subsample."}, "https://arxiv.org/abs/2307.09700": {"title": "The Connection Between R-Learning and Inverse-Variance Weighting for Estimation of Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2307.09700", "description": "Many methods for estimating conditional average treatment effects (CATEs) can be expressed as weighted pseudo-outcome regressions (PORs). Previous comparisons of POR techniques have paid careful attention to the choice of pseudo-outcome transformation. However, we argue that the dominant driver of performance is actually the choice of weights. For example, we point out that R-Learning implicitly performs a POR with inverse-variance weights (IVWs). In the CATE setting, IVWs mitigate the instability associated with inverse-propensity weights, and lead to convenient simplifications of bias terms. We demonstrate the superior performance of IVWs in simulations, and derive convergence rates for IVWs that are, to our knowledge, the fastest yet shown without assuming knowledge of the covariate distribution."}, "https://arxiv.org/abs/2311.02467": {"title": "Individualized Policy Evaluation and Learning under Clustered Network Interference", "link": "https://arxiv.org/abs/2311.02467", "description": "While there now exists a large literature on policy evaluation and learning, much of prior work assumes that the treatment assignment of one unit does not affect the outcome of another unit. Unfortunately, ignoring interference may lead to biased policy evaluation and ineffective learned policies. For example, treating influential individuals who have many friends can generate positive spillover effects, thereby improving the overall performance of an individualized treatment rule (ITR). We consider the problem of evaluating and learning an optimal ITR under clustered network interference (also known as partial interference) where clusters of units are sampled from a population and units may influence one another within each cluster. Unlike previous methods that impose strong restrictions on spillover effects, the proposed methodology only assumes a semiparametric structural model where each unit's outcome is an additive function of individual treatments within the cluster. Under this model, we propose an estimator that can be used to evaluate the empirical performance of an ITR. We show that this estimator is substantially more efficient than the standard inverse probability weighting estimator, which does not impose any assumption about spillover effects. We derive the finite-sample regret bound for a learned ITR, showing that the use of our efficient evaluation estimator leads to the improved performance of learned policies. Finally, we conduct simulation and empirical studies to illustrate the advantages of the proposed methodology."}, "https://arxiv.org/abs/2311.14220": {"title": "Assumption-lean and Data-adaptive Post-Prediction Inference", "link": "https://arxiv.org/abs/2311.14220", "description": "A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce an assumption-lean and data-adaptive Post-Prediction Inference (POP-Inf) procedure that allows valid and powerful inference based on ML-predicted outcomes. Its \"assumption-lean\" property guarantees reliable statistical inference without assumptions on the ML-prediction, for a wide range of statistical quantities. Its \"data-adaptive'\" feature guarantees an efficiency gain over existing post-prediction inference methods, regardless of the accuracy of ML-prediction. We demonstrate the superiority and applicability of our method through simulations and large-scale genomic data."}, "https://arxiv.org/abs/2401.07365": {"title": "Sequential permutation testing by betting", "link": "https://arxiv.org/abs/2401.07365", "description": "We develop an anytime-valid permutation test, where the dataset is fixed and the permutations are sampled sequentially one by one, with the objective of saving computational resources by sampling fewer permutations and stopping early. The core technical advance is the development of new test martingales (nonnegative martingales with initial value one) for testing exchangeability against a very particular alternative. These test martingales are constructed using new and simple betting strategies that smartly bet on the relative ranks of permuted test statistics. The betting strategies are guided by the derivation of a simple log-optimal betting strategy, and display excellent power in practice. In contrast to a well-known method by Besag and Clifford, our method yields a valid e-value or a p-value at any stopping time, and with particular stopping rules, it yields computational gains under both the null and the alternative without compromising power."}, "https://arxiv.org/abs/2401.08702": {"title": "Do We Really Even Need Data?", "link": "https://arxiv.org/abs/2401.08702", "description": "As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure."}, "https://arxiv.org/abs/2105.09254": {"title": "Multiply Robust Causal Mediation Analysis with Continuous Treatments", "link": "https://arxiv.org/abs/2105.09254", "description": "In many applications, researchers are interested in the direct and indirect causal effects of a treatment or exposure on an outcome of interest. Mediation analysis offers a rigorous framework for identifying and estimating these causal effects. For binary treatments, efficient estimators for the direct and indirect effects are presented in Tchetgen Tchetgen and Shpitser (2012) based on the influence function of the parameter of interest. These estimators possess desirable properties, such as multiple-robustness and asymptotic normality, while allowing for slower than root-n rates of convergence for the nuisance parameters. However, in settings involving continuous treatments, these influence function-based estimators are not readily applicable without making strong parametric assumptions. In this work, utilizing a kernel-smoothing approach, we propose an estimator suitable for settings with continuous treatments inspired by the influence function-based estimator of Tchetgen Tchetgen and Shpitser (2012). Our proposed approach employs cross-fitting, relaxing the smoothness requirements on the nuisance functions, and allowing them to be estimated at slower rates than the target parameter. Additionally, similar to influence function-based estimators, our proposed estimator is multiply robust and asymptotically normal, making it applicable for inference in settings where a parametric model cannot be assumed."}, "https://arxiv.org/abs/2207.12382": {"title": "On Confidence Sequences for Bounded Random Processes via Universal Gambling Strategies", "link": "https://arxiv.org/abs/2207.12382", "description": "This paper considers the problem of constructing a confidence sequence, which is a sequence of confidence intervals that hold uniformly over time, for estimating the mean of bounded real-valued random processes. This paper revisits the gambling-based approach established in the recent literature from a natural \\emph{two-horse race} perspective, and demonstrates new properties of the resulting algorithm induced by Cover (1991)'s universal portfolio. The main result of this paper is a new algorithm based on a mixture of lower bounds, which closely approximates the performance of Cover's universal portfolio with constant per-round time complexity. A higher-order generalization of a lower bound on a logarithmic function in (Fan et al., 2015), which is developed as a key technique for the proposed algorithm, may be of independent interest."}, "https://arxiv.org/abs/2305.11857": {"title": "Computing high-dimensional optimal transport by flow neural networks", "link": "https://arxiv.org/abs/2305.11857", "description": "Flow-based models are widely used in generative tasks, including normalizing flow, where a neural network transports from a data distribution $P$ to a normal distribution. This work develops a flow-based model that transports from $P$ to an arbitrary $Q$ where both distributions are only accessible via finite samples. We propose to learn the dynamic optimal transport between $P$ and $Q$ by training a flow neural network. The model is trained to optimally find an invertible transport map between $P$ and $Q$ by minimizing the transport cost. The trained optimal transport flow subsequently allows for performing many downstream tasks, including infinitesimal density ratio estimation (DRE) and distribution interpolation in the latent space for generative models. The effectiveness of the proposed model on high-dimensional data is demonstrated by strong empirical performance on high-dimensional DRE, OT baselines, and image-to-image translation."}, "https://arxiv.org/abs/2305.18435": {"title": "Statistically Efficient Bayesian Sequential Experiment Design via Reinforcement Learning with Cross-Entropy Estimators", "link": "https://arxiv.org/abs/2305.18435", "description": "Reinforcement learning can learn amortised design policies for designing sequences of experiments. However, current amortised methods rely on estimators of expected information gain (EIG) that require an exponential number of samples on the magnitude of the EIG to achieve an unbiased estimation. We propose the use of an alternative estimator based on the cross-entropy of the joint model distribution and a flexible proposal distribution. This proposal distribution approximates the true posterior of the model parameters given the experimental history and the design policy. Our method overcomes the exponential-sample complexity of previous approaches and provide more accurate estimates of high EIG values. More importantly, it allows learning of superior design policies, and is compatible with continuous and discrete design spaces, non-differentiable likelihoods and even implicit probabilistic models."}, "https://arxiv.org/abs/2309.16965": {"title": "Controlling Continuous Relaxation for Combinatorial Optimization", "link": "https://arxiv.org/abs/2309.16965", "description": "Motivated by developments in machine learning technologies, unsupervised learning (UL)-based solvers for CO problems have recently been proposed. These solvers train a neural network that outputs a solution by optimizing the CO objective directly. UL-based solvers have several advantages over traditional methods. However, various studies have shown that these solvers underperform compared to greedy algorithms for complex CO problems. In addition, these solvers employ a continuous relaxation strategy; thus, post-learning rounding from the continuous space back to the original discrete space is required, undermining the robustness of the results. To address these problems, we propose the continuous relaxation annealing (CRA) strategy. The CRA introduces a penalty term to control the continuity and discreteness of the relaxed variables and eliminate local optima. In addition, the CRA implements an annealing process for the penalty term that initially prioritizes continuous solutions and progressively transitions towards discreet solutions until the relaxed variables become nearly discrete, eliminating the artificial rounding. Experimental results demonstrate that the CRA significantly enhances the UL-based solvers, outperforming both existing UL-based solvers and greedy algorithms for complex CO problems."}, "https://arxiv.org/abs/2310.12424": {"title": "Optimal heteroskedasticity testing in nonparametric regression", "link": "https://arxiv.org/abs/2310.12424", "description": "Heteroskedasticity testing in nonparametric regression is a classic statistical problem with important practical applications, yet fundamental limits are unknown. Adopting a minimax perspective, this article considers the testing problem in the context of an $\\alpha$-H\\\"{o}lder mean and a $\\beta$-H\\\"{o}lder variance function. For $\\alpha > 0$ and $\\beta \\in (0, 1/2)$, the sharp minimax separation rate $n^{-4\\alpha} + n^{-4\\beta/(4\\beta+1)} + n^{-2\\beta}$ is established. To achieve the minimax separation rate, a kernel-based statistic using first-order squared differences is developed. Notably, the statistic estimates a proxy rather than a natural quadratic functional (the squared distance between the variance function and its best $L^2$ approximation by a constant) suggested in previous work.\n The setting where no smoothness is assumed on the variance function is also studied; the variance profile across the design points can be arbitrary. Despite the lack of structure, consistent testing turns out to still be possible by using the Gaussian character of the noise, and the minimax rate is shown to be $n^{-4\\alpha} + n^{-1/2}$. Exploiting noise information happens to be a fundamental necessity as consistent testing is impossible if nothing more than zero mean and unit variance is known about the noise distribution. Furthermore, in the setting where the variance function is $\\beta$-H\\\"{o}lder but heteroskedasticity is measured only with respect to the design points, the minimax separation rate is shown to be $n^{-4\\alpha} + n^{-\\left((1/2) \\vee (4\\beta/(4\\beta+1))\\right)}$ when the noise is Gaussian and $n^{-4\\alpha} + n^{-4\\beta/(4\\beta+1)} + n^{-2\\beta}$ when the noise distribution is unknown."}, "https://arxiv.org/abs/2311.12267": {"title": "Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity", "link": "https://arxiv.org/abs/2311.12267", "description": "We study causal representation learning, the task of recovering high-level latent variables and their causal relationships in the form of a causal graph from low-level observed data (such as text and images), assuming access to observations generated from multiple environments. Prior results on the identifiability of causal representations typically assume access to single-node interventions which is rather unrealistic in practice, since the latent variables are unknown in the first place. In this work, we provide the first identifiability results based on data that stem from general environments. We show that for linear causal models, while the causal graph can be fully recovered, the latent variables are only identified up to the surrounded-node ambiguity (SNA) \\citep{varici2023score}. We provide a counterpart of our guarantee, showing that SNA is basically unavoidable in our setting. We also propose an algorithm, \\texttt{LiNGCReL} which provably recovers the ground-truth model up to SNA, and we demonstrate its effectiveness via numerical experiments. Finally, we consider general non-parametric causal models and show that the same identification barrier holds when assuming access to groups of soft single-node interventions."}, "https://arxiv.org/abs/2401.08875": {"title": "DCRMTA: Unbiased Causal Representation for Multi-touch Attribution", "link": "https://arxiv.org/abs/2401.08875", "description": "Multi-touch attribution (MTA) currently plays a pivotal role in achieving a fair estimation of the contributions of each advertising touchpoint to-wards conversion behavior, deeply influencing budget allocation and advertising recommenda-tion. Previous works attempted to eliminate the bias caused by user preferences to achieve the unbiased assumption of the conversion model. The multi-model collaboration method is not ef-ficient, and the complete elimination of user in-fluence also eliminates the causal effect of user features on conversion, resulting in limited per-formance of the conversion model. This paper re-defines the causal effect of user features on con-versions and proposes a novel end-to-end ap-proach, Deep Causal Representation for MTA (DCRMTA). Our model focuses on extracting causa features between conversions and users while eliminating confounding variables. Fur-thermore, extensive experiments demonstrate DCRMTA's superior performance in converting prediction across varying data distributions, while also effectively attributing value across dif-ferent advertising channels."}, "https://arxiv.org/abs/2401.15519": {"title": "Large Deviation Analysis of Score-based Hypothesis Testing", "link": "https://arxiv.org/abs/2401.15519", "description": "Score-based statistical models play an important role in modern machine learning, statistics, and signal processing. For hypothesis testing, a score-based hypothesis test is proposed in \\cite{wu2022score}. We analyze the performance of this score-based hypothesis testing procedure and derive upper bounds on the probabilities of its Type I and II errors. We prove that the exponents of our error bounds are asymptotically (in the number of samples) tight for the case of simple null and alternative hypotheses. We calculate these error exponents explicitly in specific cases and provide numerical studies for various other scenarios of interest."}, "https://arxiv.org/abs/2402.03416": {"title": "A hyperbolastic type-I diffusion process: Parameter estimation bymeans of the firefly algorithm", "link": "https://arxiv.org/abs/2402.03416", "description": "A stochastic diffusion process, whose mean function is a hyperbolastic curve of type I, is presented. Themain characteristics of the process are studied and the problem of maximum likelihood estimation forthe parameters of the process is considered. To this end, the firefly metaheuristic optimization algo-rithm is applied after bounding the parametric space by a stagewise procedure. Some examples basedon simulated sample paths and real data illustrate this development."}, "https://arxiv.org/abs/2402.03459": {"title": "Hybrid Smoothing for Anomaly Detection in Time Series", "link": "https://arxiv.org/abs/2402.03459", "description": "Many industrial and engineering processes monitored as times series have smooth trends that indicate normal behavior and occasionally anomalous patterns that can indicate a problem. This kind of behavior can be modeled by a smooth trend such as a spline or Gaussian process and a disruption based on a sparser representation. Our approach is to expand the process signal into two sets of basis functions: one set uses $L_2$ penalties on the coefficients and the other set uses $L_1$ penalties to control sparsity. From a frequentist perspective, this results in a hybrid smoother that combines cubic smoothing splines and the LASSO, and as a Bayesian hierarchical model (BHM), this is equivalent to priors giving a Gaussian process and a Laplace distribution for anomaly coefficients. For the hybrid smoother we propose two new ways of determining the penalty parameters that use effective degrees of freedom and contrast this with the BHM that uses loosely informative inverse gamma priors. Several reformulations are used to make sampling the BHM posterior more efficient including some novel features in orthogonalizing and regularizing the model basis functions. This methodology is motivated by a substantive application, monitoring the water treatment process for the Denver Metropolitan area. We also test the methods with a Monte Carlo study designed around the kind of anomalies expected in this application. Both the hybrid smoother and the full BHM give comparable results with small false positive and false negative rates. Besides being successful in the water treatment application, this work can be easily extended to other Gaussian process models and other features that represent process disruptions."}, "https://arxiv.org/abs/2402.03491": {"title": "Periodically Correlated Time Series and the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2402.03491", "description": "This research introduces a novel approach to resampling periodically correlated (PC) time series using bandpass filters for frequency separation called the Variable Bandpass Periodic Block Bootstrap (VBPBB) and then examines the significant advantages of this new method. While bootstrapping allows estimation of a statistic's sampling distribution by resampling the original data with replacement, and block bootstrapping is a model-free resampling strategy for correlated time series data, both fail to preserve correlations in PC time series. Existing extensions of the block bootstrap help preserve the correlation structures of PC processes but suffer from flaws and inefficiencies. Analyses of time series data containing cyclic, seasonal, or PC principal components often seen in annual, daily, or other cyclostationary processes benefit from separating these components. The VBPBB uses bandpass filters to separate a PC component from interference such as noise at other uncorrelated frequencies. A simulation study is presented, demonstrating near universal improvements obtained from the VBPBB when compared with prior block bootstrapping methods for periodically correlated time series."}, "https://arxiv.org/abs/2402.03538": {"title": "Forecasting Adversarial Actions Using Judgment Decomposition-Recomposition", "link": "https://arxiv.org/abs/2402.03538", "description": "In domains such as homeland security, cybersecurity and competitive marketing, it is frequently the case that analysts need to forecast adversarial actions that impact the problem of interest. Standard structured expert judgement elicitation techniques may fall short as they do not explicitly take into account intentionality. We present a decomposition technique based on adversarial risk analysis followed by a behavioral recomposition using discrete choice models that facilitate such elicitation process and illustrate its performance through behavioral experiments."}, "https://arxiv.org/abs/2402.03565": {"title": "Breakpoint based online anomaly detection", "link": "https://arxiv.org/abs/2402.03565", "description": "The goal of anomaly detection is to identify observations that are generated by a distribution that differs from the reference distribution that qualifies normal behavior. When examining a time series, the reference distribution may evolve over time. The anomaly detector must therefore be able to adapt to such changes. In the online context, it is particularly difficult to adapt to abrupt and unpredictable changes. Our solution to this problem is based on the detection of breakpoints in order to adapt in real time to the new reference behavior of the series and to increase the accuracy of the anomaly detection. This solution also provides a control of the False Discovery Rate by extending methods developed for stationary series."}, "https://arxiv.org/abs/2402.03683": {"title": "Gambling-Based Confidence Sequences for Bounded Random Vectors", "link": "https://arxiv.org/abs/2402.03683", "description": "A confidence sequence (CS) is a sequence of confidence sets that contains a target parameter of an underlying stochastic process at any time step with high probability. This paper proposes a new approach to constructing CSs for means of bounded multivariate stochastic processes using a general gambling framework, extending the recently established coin toss framework for bounded random processes. The proposed gambling framework provides a general recipe for constructing CSs for categorical and probability-vector-valued observations, as well as for general bounded multidimensional observations through a simple reduction. This paper specifically explores the use of the mixture portfolio, akin to Cover's universal portfolio, in the proposed framework and investigates the properties of the resulting CSs. Simulations demonstrate the tightness of these confidence sequences compared to existing methods. When applied to the sampling without-replacement setting for finite categorical data, it is shown that the resulting CS based on a universal gambling strategy is provably tighter than that of the posterior-prior ratio martingale proposed by Waudby-Smith and Ramdas."}, "https://arxiv.org/abs/2402.03954": {"title": "Mixed Matrix Completion in Complex Survey Sampling under Heterogeneous Missingness", "link": "https://arxiv.org/abs/2402.03954", "description": "Modern surveys with large sample sizes and growing mixed-type questionnaires require robust and scalable analysis methods. In this work, we consider recovering a mixed dataframe matrix, obtained by complex survey sampling, with entries following different canonical exponential distributions and subject to heterogeneous missingness. To tackle this challenging task, we propose a two-stage procedure: in the first stage, we model the entry-wise missing mechanism by logistic regression, and in the second stage, we complete the target parameter matrix by maximizing a weighted log-likelihood with a low-rank constraint. We propose a fast and scalable estimation algorithm that achieves sublinear convergence, and the upper bound for the estimation error of the proposed method is rigorously derived. Experimental results support our theoretical claims, and the proposed estimator shows its merits compared to other existing methods. The proposed method is applied to analyze the National Health and Nutrition Examination Survey data."}, "https://arxiv.org/abs/2402.04165": {"title": "Monthly GDP nowcasting with Machine Learning and Unstructured Data", "link": "https://arxiv.org/abs/2402.04165", "description": "In the dynamic landscape of continuous change, Machine Learning (ML) \"nowcasting\" models offer a distinct advantage for informed decision-making in both public and private sectors. This study introduces ML-based GDP growth projection models for monthly rates in Peru, integrating structured macroeconomic indicators with high-frequency unstructured sentiment variables. Analyzing data from January 2007 to May 2023, encompassing 91 leading economic indicators, the study evaluates six ML algorithms to identify optimal predictors. Findings highlight the superior predictive capability of ML models using unstructured data, particularly Gradient Boosting Machine, LASSO, and Elastic Net, exhibiting a 20% to 25% reduction in prediction errors compared to traditional AR and Dynamic Factor Models (DFM). This enhanced performance is attributed to better handling of data of ML models in high-uncertainty periods, such as economic crises."}, "https://arxiv.org/abs/2402.03447": {"title": "Challenges in Variable Importance Ranking Under Correlation", "link": "https://arxiv.org/abs/2402.03447", "description": "Variable importance plays a pivotal role in interpretable machine learning as it helps measure the impact of factors on the output of the prediction model. Model agnostic methods based on the generation of \"null\" features via permutation (or related approaches) can be applied. Such analysis is often utilized in pharmaceutical applications due to its ability to interpret black-box models, including tree-based ensembles. A major challenge and significant confounder in variable importance estimation however is the presence of between-feature correlation. Recently, several adjustments to marginal permutation utilizing feature knockoffs were proposed to address this issue, such as the variable importance measure known as conditional predictive impact (CPI). Assessment and evaluation of such approaches is the focus of our work. We first present a comprehensive simulation study investigating the impact of feature correlation on the assessment of variable importance. We then theoretically prove the limitation that highly correlated features pose for the CPI through the knockoff construction. While we expect that there is always no correlation between knockoff variables and its corresponding predictor variables, we prove that the correlation increases linearly beyond a certain correlation threshold between the predictor variables. Our findings emphasize the absence of free lunch when dealing with high feature correlation, as well as the necessity of understanding the utility and limitations behind methods in variable importance estimation."}, "https://arxiv.org/abs/2402.03527": {"title": "Consistent Validation for Predictive Methods in Spatial Settings", "link": "https://arxiv.org/abs/2402.03527", "description": "Spatial prediction tasks are key to weather forecasting, studying air pollution, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We instead propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data."}, "https://arxiv.org/abs/2402.03941": {"title": "Discovery of the Hidden World with Large Language Models", "link": "https://arxiv.org/abs/2402.03941", "description": "Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as a factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis."}, "https://arxiv.org/abs/2402.04082": {"title": "An Optimal House Price Prediction Algorithm: XGBoost", "link": "https://arxiv.org/abs/2402.04082", "description": "An accurate prediction of house prices is a fundamental requirement for various sectors including real estate and mortgage lending. It is widely recognized that a property value is not solely determined by its physical attributes but is significantly influenced by its surrounding neighbourhood. Meeting the diverse housing needs of individuals while balancing budget constraints is a primary concern for real estate developers. To this end, we addressed the house price prediction problem as a regression task and thus employed various machine learning techniques capable of expressing the significance of independent variables. We made use of the housing dataset of Ames City in Iowa, USA to compare support vector regressor, random forest regressor, XGBoost, multilayer perceptron and multiple linear regression algorithms for house price prediction. Afterwards, we identified the key factors that influence housing costs. Our results show that XGBoost is the best performing model for house price prediction."}, "https://arxiv.org/abs/1801.07683": {"title": "Discovering the Signal Subgraph: An Iterative Screening Approach on Graphs", "link": "https://arxiv.org/abs/1801.07683", "description": "Supervised learning on graphs is a challenging task due to the high dimensionality and inherent structural dependencies in the data, where each edge depends on a pair of vertices. Existing conventional methods designed for Euclidean data do not account for this graph dependency structure. To address this issue, this paper proposes an iterative vertex screening method to identify the signal subgraph that is most informative for the given graph attributes. The method screens the rows and columns of the adjacency matrix concurrently and stops when the resulting distance correlation is maximized. We establish the theoretical foundation of our method by proving that it estimates the true signal subgraph with high probability. Additionally, we establish the convergence rate of classification error under the Erdos-Renyi random graph model and prove that the subsequent classification can be asymptotically optimal, outperforming the entire graph under high-dimensional conditions. Our method is evaluated on various simulated datasets and real-world human and murine graphs derived from functional and structural magnetic resonance images. The results demonstrate its excellent performance in estimating the ground-truth signal subgraph and achieving superior classification accuracy."}, "https://arxiv.org/abs/2006.09587": {"title": "Adaptive, Rate-Optimal Hypothesis Testing in Nonparametric IV Models", "link": "https://arxiv.org/abs/2006.09587", "description": "We propose a new adaptive hypothesis test for inequality (e.g., monotonicity, convexity) and equality (e.g., parametric, semiparametric) restrictions on a structural function in a nonparametric instrumental variables (NPIV) model. Our test statistic is based on a modified leave-one-out sample analog of a quadratic distance between the restricted and unrestricted sieve NPIV estimators. We provide computationally simple, data-driven choices of sieve tuning parameters and Bonferroni adjusted chi-squared critical values. Our test adapts to the unknown smoothness of alternative functions in the presence of unknown degree of endogeneity and unknown strength of the instruments. It attains the adaptive minimax rate of testing in $L^2$.\n That is, the sum of its type I error uniformly over the composite null and its type II error uniformly over nonparametric alternative models cannot be improved by any other hypothesis test for NPIV models of unknown regularities. Confidence sets in $L^2$ are obtained by inverting the adaptive test. Simulations confirm that our adaptive test controls size and its finite-sample power greatly exceeds existing non-adaptive tests for monotonicity and parametric restrictions in NPIV models. Empirical applications to test for shape restrictions of differentiated products demand and of Engel curves are presented."}, "https://arxiv.org/abs/2007.01680": {"title": "Developing a predictive signature for two trial endpoints using the cross-validated risk scores method", "link": "https://arxiv.org/abs/2007.01680", "description": "The existing cross-validated risk scores (CVRS) design has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using high-dimensional data (such as genetic data). The design is based on computing a risk score for each patient and dividing them into clusters using a non-parametric clustering procedure. In some settings it is desirable to consider the trade-off between two outcomes, such as efficacy and toxicity, or cost and effectiveness. With this motivation, we extend the CVRS design (CVRS2) to consider two outcomes. The design employs bivariate risk scores that are divided into clusters. We assess the properties of the CVRS2 using simulated data and illustrate its application on a randomised psychiatry trial. We show that CVRS2 is able to reliably identify the sensitive group (the group for which the new treatment provides benefit on both outcomes) in the simulated data. We apply the CVRS2 design to a psychology clinical trial that had offender status and substance use status as two outcomes and collected a large number of baseline covariates. The CVRS2 design yields a significant treatment effect for both outcomes, while the CVRS approach identified a significant effect for the offender status only after pre-filtering the covariates."}, "https://arxiv.org/abs/2208.05121": {"title": "Locally Adaptive Bayesian Isotonic Regression using Half Shrinkage Priors", "link": "https://arxiv.org/abs/2208.05121", "description": "Isotonic regression or monotone function estimation is a problem of estimating function values under monotonicity constraints, which appears naturally in many scientific fields. This paper proposes a new Bayesian method with global-local shrinkage priors for estimating monotone function values. Specifically, we introduce half shrinkage priors for positive valued random variables and assign them for the first-order differences of function values. We also develop fast and simple Gibbs sampling algorithms for full posterior analysis. By incorporating advanced shrinkage priors, the proposed method is adaptive to local abrupt changes or jumps in target functions. We show this adaptive property theoretically by proving that the posterior mean estimators are robust to large differences and that asymptotic risk for unchanged points can be improved. Finally, we demonstrate the proposed methods through simulations and applications to a real data set."}, "https://arxiv.org/abs/2303.03497": {"title": "Integrative data analysis where partial covariates have complex non-linear effects by using summary information from an external data", "link": "https://arxiv.org/abs/2303.03497", "description": "A full parametric and linear specification may be insufficient to capture complicated patterns in studies exploring complex features, such as those investigating age-related changes in brain functional abilities. Alternatively, a partially linear model (PLM) consisting of both parametric and non-parametric elements may have a better fit. This model has been widely applied in economics, environmental science, and biomedical studies. In this paper, we introduce a novel statistical inference framework that equips PLM with high estimation efficiency by effectively synthesizing summary information from external data into the main analysis. Such an integrative scheme is versatile in assimilating various types of reduced models from the external study. The proposed method is shown to be theoretically valid and numerically convenient, and it ensures a high-efficiency gain compared to classic methods in PLM. Our method is further validated using two data applications by evaluating the risk factors of brain imaging measures and blood pressure."}, "https://arxiv.org/abs/2304.08184": {"title": "Adjustment with Many Regressors Under Covariate-Adaptive Randomizations", "link": "https://arxiv.org/abs/2304.08184", "description": "Our paper discovers a new trade-off of using regression adjustments (RAs) in causal inference under covariate-adaptive randomizations (CARs). On one hand, RAs can improve the efficiency of causal estimators by incorporating information from covariates that are not used in the randomization. On the other hand, RAs can degrade estimation efficiency due to their estimation errors, which are not asymptotically negligible when the number of regressors is of the same order as the sample size. Ignoring the estimation errors of RAs may result in serious over-rejection of causal inference under the null hypothesis. To address the issue, we construct a new ATE estimator by optimally linearly combining the adjusted and unadjusted estimators. We then develop a unified inference theory for this estimator under CARs. It has two features: (1) the Wald test based on it achieves the exact asymptotic size under the null hypothesis, regardless of whether the number of covariates is fixed or diverges no faster than the sample size; and (2) it guarantees weak efficiency improvement over both the adjusted and unadjusted estimators."}, "https://arxiv.org/abs/2305.01435": {"title": "Transfer Estimates for Causal Effects across Heterogeneous Sites", "link": "https://arxiv.org/abs/2305.01435", "description": "We consider the problem of extrapolating treatment effects across heterogeneous populations (\"sites\"/\"contexts\"). We consider an idealized scenario in which the researcher observes cross-sectional data for a large number of units across several \"experimental\" sites in which an intervention has already been implemented to a new \"target\" site for which a baseline survey of unit-specific, pre-treatment outcomes and relevant attributes is available. We propose a transfer estimator that exploits cross-sectional variation between individuals and sites to predict treatment outcomes using baseline outcome data for the target location. We consider the problem of determining the optimal finite-dimensional feature space in which to solve that prediction problem. Our approach is design-based in the sense that the performance of the predictor is evaluated given the specific, finite selection of experimental and target sites. Our approach is nonparametric, and our formal results concern the construction of an optimal basis of predictors as well as convergence rates for the estimated conditional average treatment effect relative to the constrained-optimal population predictor for the target site. We illustrate our approach using a combined data set of five multi-site randomized controlled trials (RCTs) to evaluate the effect of conditional cash transfers on school attendance."}, "https://arxiv.org/abs/2305.10054": {"title": "Functional Adaptive Double-Sparsity Estimator for Functional Linear Regression Model with Multiple Functional Covariates", "link": "https://arxiv.org/abs/2305.10054", "description": "Sensor devices have been increasingly used in engineering and health studies recently, and the captured multi-dimensional activity and vital sign signals can be studied in association with health outcomes to inform public health. The common approach is the scalar-on-function regression model, in which health outcomes are the scalar responses while high-dimensional sensor signals are the functional covariates, but how to effectively interpret results becomes difficult. In this study, we propose a new Functional Adaptive Double-Sparsity (FadDoS) estimator based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. Application to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly demonstrates how the FadDoS estimator can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields, where multi-dimensional sensor signals are collected simultaneously, to expand the use of sensor devices in health studies and facilitate sensor data analysis."}, "https://arxiv.org/abs/2305.15671": {"title": "Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal Data", "link": "https://arxiv.org/abs/2305.15671", "description": "We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on time series of matrices with observations distributed on a fixed 2-D spatial grid, i.e., the spatio-temporal data, and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially-smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC."}, "https://arxiv.org/abs/2307.15181": {"title": "On the Efficiency of Finely Stratified Experiments", "link": "https://arxiv.org/abs/2307.15181", "description": "This paper studies the efficient estimation of a large class of treatment effect parameters that arise in the analysis of experiments. Here, efficiency is understood to be with respect to a broad class of treatment assignment schemes for which the marginal probability that any unit is assigned to treatment equals a pre-specified value, e.g., one half. Importantly, we do not require that treatment status is assigned in an i.i.d. fashion, thereby accommodating complicated treatment assignment schemes that are used in practice, such as stratified block randomization and matched pairs. The class of parameters considered are those that can be expressed as the solution to a set of moment conditions involving a known function of the observed data, including possibly the pre-specified value for the marginal probability of treatment assignment. We show that this class of parameters includes, among other things, average treatment effects, quantile treatment effects, local average treatment effects as well as the counterparts to these quantities in experiments in which the unit is itself a cluster. In this setting, we establish two results. First, we derive a lower bound on the asymptotic variance of estimators of the parameter of interest in the form of a convolution theorem. Second, we show that the naive method of moments estimator achieves this bound on the asymptotic variance quite generally if treatment is assigned using a \"finely stratified\" design. By a \"finely stratified\" design, we mean experiments in which units are divided into groups of a fixed size and a proportion within each group is assigned to treatment uniformly at random so that it respects the restriction on the marginal probability of treatment assignment. In this sense, \"finely stratified\" experiments lead to efficient estimators of treatment effect parameters \"by design\" rather than through ex post covariate adjustment."}, "https://arxiv.org/abs/2310.13785": {"title": "Bayesian Estimation of Panel Models under Potentially Sparse Heterogeneity", "link": "https://arxiv.org/abs/2310.13785", "description": "We incorporate a version of a spike and slab prior, comprising a pointmass at zero (\"spike\") and a Normal distribution around zero (\"slab\") into a dynamic panel data framework to model coefficient heterogeneity. In addition to homogeneity and full heterogeneity, our specification can also capture sparse heterogeneity, that is, there is a core group of units that share common parameters and a set of deviators with idiosyncratic parameters. We fit a model with unobserved components to income data from the Panel Study of Income Dynamics. We find evidence for sparse heterogeneity for balanced panels composed of individuals with long employment histories."}, "https://arxiv.org/abs/2401.12420": {"title": "Rank-based estimators of global treatment effects for cluster randomized trials with multiple endpoints", "link": "https://arxiv.org/abs/2401.12420", "description": "Cluster randomization trials commonly employ multiple endpoints. When a single summary of treatment effects across endpoints is of primary interest, global hypothesis testing/effect estimation methods represent a common analysis strategy. However, specification of the joint distribution required by these methods is non-trivial, particularly when endpoint properties differ. We develop rank-based interval estimators for a global treatment effect referred to as the \"global win probability,\" or the probability that a treatment individual responds better than a control individual on average. Using endpoint-specific ranks among the combined sample and within each arm, each individual-level observation is converted to a \"win fraction\" which quantifies the proportion of wins experienced over every observation in the comparison arm. An individual's multiple observations are then replaced by a single \"global win fraction,\" constructed by averaging win fractions across endpoints. A linear mixed model is applied directly to the global win fractions to recover point, variance, and interval estimates of the global win probability adjusted for clustering. Simulation demonstrates our approach performs well concerning coverage and type I error, and methods are easily implemented using standard software. A case study using publicly available data is provided with corresponding R and SAS code."}, "https://arxiv.org/abs/2401.14684": {"title": "Inference for Cumulative Incidences and Treatment Effects in Randomized Controlled Trials with Time-to-Event Outcomes under ICH E9 (E1)", "link": "https://arxiv.org/abs/2401.14684", "description": "In randomized controlled trials (RCT) with time-to-event outcomes, intercurrent events occur as semi-competing/competing events, and they could affect the hazard of outcomes or render outcomes ill-defined. Although five strategies have been proposed in ICH E9 (R1) addendum to address intercurrent events in RCT, they did not readily extend to the context of time-to-event data for studying causal effects with rigorously stated implications. In this study, we show how to define, estimate, and infer the time-dependent cumulative incidence of outcome events in such contexts for obtaining causal interpretations. Specifically, we derive the mathematical forms of the scientific objective (i.e., causal estimands) under the five strategies and clarify the required data structure to identify these causal estimands. Furthermore, we summarize estimation and inference methods for these causal estimands by adopting methodologies in survival analysis, including analytic formulas for asymptotic analysis and hypothesis testing. We illustrate our methods with the LEADER Trial on investigating the effect of liraglutide on cardiovascular outcomes. Studies of multiple endpoints and combining strategies to address multiple intercurrent events can help practitioners understand treatment effects more comprehensively."}, "https://arxiv.org/abs/1908.06486": {"title": "Independence Testing for Temporal Data", "link": "https://arxiv.org/abs/1908.06486", "description": "Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time-series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in invalid test. To address these challenges, this paper introduces the temporal dependence statistic with block permutation to test independence between temporal data. Under proper assumptions, the proposed procedure is asymptotically valid and universally consistent for testing independence between stationary time-series, and capable of estimating the optimal dependence lag that maximizes the dependence. Notably, it is compatible with a rich family of distance and kernel based dependence measures, eliminates the need for multiple testing, and demonstrates superior power in multivariate, low sample size, and nonlinear settings. An analysis of neural connectivity with fMRI data reveals various temporal dependence among signals within the visual network and default mode network."}, "https://arxiv.org/abs/2001.01095": {"title": "High-Dimensional Independence Testing via Maximum and Average Distance Correlations", "link": "https://arxiv.org/abs/2001.01095", "description": "This paper introduces and investigates the utilization of maximum and average distance correlations for multivariate independence testing. We characterize their consistency properties in high-dimensional settings with respect to the number of marginally dependent dimensions, assess the advantages of each test statistic, examine their respective null distributions, and present a fast chi-square-based testing procedure. The resulting tests are non-parametric and applicable to both Euclidean distance and the Gaussian kernel as the underlying metric. To better understand the practical use cases of the proposed tests, we evaluate the empirical performance of the maximum distance correlation, average distance correlation, and the original distance correlation across various multivariate dependence scenarios, as well as conduct a real data experiment to test the presence of various cancer types and peptide levels in human plasma."}, "https://arxiv.org/abs/2204.10275": {"title": "Do t-Statistic Hurdles Need to be Raised?", "link": "https://arxiv.org/abs/2204.10275", "description": "Many scholars have called for raising statistical hurdles to guard against false discoveries in academic publications. I show these calls may be difficult to justify empirically. Published data exhibit bias: results that fail to meet existing hurdles are often unobserved. These unobserved results must be extrapolated, which can lead to weak identification of revised hurdles. In contrast, statistics that can target only published findings (e.g. empirical Bayes shrinkage and the FDR) can be strongly identified, as data on published findings is plentiful. I demonstrate these results theoretically and in an empirical analysis of the cross-sectional return predictability literature."}, "https://arxiv.org/abs/2310.07852": {"title": "On the Computational Complexity of Private High-dimensional Model Selection", "link": "https://arxiv.org/abs/2310.07852", "description": "We consider the problem of model selection in a high-dimensional sparse linear regression model under privacy constraints. We propose a differentially private best subset selection method with strong utility properties by adopting the well-known exponential mechanism for selecting the best model. We propose an efficient Metropolis-Hastings algorithm and establish that it enjoys polynomial mixing time to its stationary distribution. Furthermore, we also establish approximate differential privacy for the final estimates of the Metropolis-Hastings random walk using its mixing property. Finally, we perform some illustrative experiments that show the strong utility of our algorithm."}, "https://arxiv.org/abs/2311.01303": {"title": "Local differential privacy in survival analysis using private failure indicators", "link": "https://arxiv.org/abs/2311.01303", "description": "We consider the estimation of the cumulative hazard function, and equivalently the distribution function, with censored data under a setup that preserves the privacy of the survival database. This is done through a $\\alpha$-locally differentially private mechanism for the failure indicators and by proposing a non-parametric kernel estimator for the cumulative hazard function that remains consistent under the privatization. Under mild conditions, we also prove lowers bounds for the minimax rates of convergence and show that estimator is minimax optimal under a well-chosen bandwidth."}, "https://arxiv.org/abs/2311.08527": {"title": "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments", "link": "https://arxiv.org/abs/2311.08527", "description": "We study inference on the long-term causal effect of a continual exposure to a novel intervention, which we term a long-term treatment, based on an experiment involving only short-term observations. Key examples include the long-term health effects of regularly-taken medicine or of environmental hazards and the long-term effects on users of changes to an online platform. This stands in contrast to short-term treatments or ``shocks,\" whose long-term effect can reasonably be mediated by short-term observations, enabling the use of surrogate methods. Long-term treatments by definition have direct effects on long-term outcomes via continual exposure, so surrogacy conditions cannot reasonably hold. We connect the problem with offline reinforcement learning, leveraging doubly-robust estimators to estimate long-term causal effects for long-term treatments and construct confidence intervals."}, "https://arxiv.org/abs/2402.04321": {"title": "Homogeneity problem for basis expansion of functional data with applications to resistive memories", "link": "https://arxiv.org/abs/2402.04321", "description": "The homogeneity problem for testing if more than two different samples come from the same population is considered for the case of functional data. The methodological results are motivated by the study of homogeneity of electronic devices fabricated by different materials and active layer thicknesses. In the case of normality distribution of the stochastic processes associated with each sample, this problem is known as Functional ANOVA problem and is reduced to test the equality of the mean group functions (FANOVA). The problem is that the current/voltage curves associated with Resistive Random Access Memories (RRAM) are not generated by a Gaussian process so that a different approach is necessary for testing homogeneity. To solve this problem two different parametric and nonparametric approaches based on basis expansion of the sample curves are proposed. The first consists of testing multivariate homogeneity tests on a vector of basis coefficients of the sample curves. The second is based on dimension reduction by using functional principal component analysis of the sample curves (FPCA) and testing multivariate homogeneity on a vector of principal components scores. Different approximation numerical techniques are employed to adapt the experimental data for the statistical study. An extensive simulation study is developed for analyzing the performance of both approaches in the parametric and non-parametric cases. Finally, the proposed methodologies are applied on three samples of experimental reset curves measured in three different RRAM technologies."}, "https://arxiv.org/abs/2402.04327": {"title": "On the estimation and interpretation of effect size metrics", "link": "https://arxiv.org/abs/2402.04327", "description": "Effect size estimates are thought to capture the collective, two-way response to an intervention or exposure in a three-way problem among the intervention/exposure, various confounders and the outcome. For meaningful causal inference from the estimated effect size, the joint distribution of observed confounders must be identical across all intervention/exposure groups. However, real-world observational studies and even randomized clinical trials often lack such structural symmetry. To address this issue, various methods have been proposed and widely utilized. Recently, elementary combinatorics and information theory have motivated a consistent way to completely eliminate observed confounding in any given study. In this work, we leverage these new techniques to evaluate conventional methods based on their ability to (a) consistently differentiate between collective and individual responses to intervention/exposure and (b) establish the desired structural parity for sensible effect size estimation. Our findings reveal that a straightforward application of logistic regression homogenizes the three-way stratified analysis, but fails to restore structural symmetry leaving in particular the two-way effect size estimate unadjusted. Conversely, the Mantel-Haenszel estimator struggles to separate three-way effects from the two-way effect of intervention/exposure, leading to inconsistencies in interpreting pooled estimates as two-way risk metrics."}, "https://arxiv.org/abs/2402.04341": {"title": "CausalMetaR: An R package for performing causally interpretable meta-analyses", "link": "https://arxiv.org/abs/2402.04341", "description": "Researchers would often like to leverage data from a collection of sources (e.g., primary studies in a meta-analysis) to estimate causal effects in a target population of interest. However, traditional meta-analytic methods do not produce causally interpretable estimates for a well-defined target population. In this paper, we present the CausalMetaR R package, which implements efficient and robust methods to estimate causal effects in a given internal or external target population using multi-source data. The package includes estimators of average and subgroup treatment effects for the entire target population. To produce efficient and robust estimates of causal effects, the package implements doubly robust and non-parametric efficient estimators and supports using flexible data-adaptive (e.g., machine learning techniques) methods and cross-fitting techniques to estimate the nuisance models (e.g., the treatment model, the outcome model). We describe the key features of the package and demonstrate how to use the package through an example."}, "https://arxiv.org/abs/2402.04345": {"title": "A Framework of Zero-Inflated Bayesian Negative Binomial Regression Models For Spatiotemporal Data", "link": "https://arxiv.org/abs/2402.04345", "description": "Spatiotemporal data analysis with massive zeros is widely used in many areas such as epidemiology and public health. We use a Bayesian framework to fit zero-inflated negative binomial models and employ a set of latent variables from P\\'olya-Gamma distributions to derive an efficient Gibbs sampler. The proposed model accommodates varying spatial and temporal random effects through Gaussian process priors, which have both the simplicity and flexibility in modeling nonlinear relationships through a covariance function. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. For the simulation study, we adopt multiple settings with varying sizes of spatial locations to evaluate the performance of the proposed model such as spatial and temporal random effects estimation and compare the result to other methods. We also apply the proposed model to the COVID-19 death counts in the state of Florida, USA from 3/25/2020 through 7/29/2020 to examine relationships between social vulnerability and COVID-19 deaths."}, "https://arxiv.org/abs/2402.04425": {"title": "Linear-Phase-Type probability modelling of functional PCA with applications to resistive memories", "link": "https://arxiv.org/abs/2402.04425", "description": "Functional principal component analysis based on Karhunen Loeve expansion allows to describe the stochastic evolution of the main characteristics associated to multiple systems and devices. Identifying the probability distribution of the principal component scores is fundamental to characterize the whole process. The aim of this work is to consider a family of statistical distributions that could be accurately adjusted to a previous transformation. Then, a new class of distributions, the linear-phase-type, is introduced to model the principal components. This class is studied in detail in order to prove, through the KL expansion, that certain linear transformations of the process at each time point are phase-type distributed. This way, the one-dimensional distributions of the process are in the same linear-phase-type class. Finally, an application to model the reset process associated with resistive memories is developed and explained."}, "https://arxiv.org/abs/2402.04433": {"title": "Fast Online Changepoint Detection", "link": "https://arxiv.org/abs/2402.04433", "description": "We study online changepoint detection in the context of a linear regression model. We propose a class of heavily weighted statistics based on the CUSUM process of the regression residuals, which are specifically designed to ensure timely detection of breaks occurring early on during the monitoring horizon. We subsequently propose a class of composite statistics, constructed using different weighing schemes; the decision rule to mark a changepoint is based on the largest statistic across the various weights, thus effectively working like a veto-based voting mechanism, which ensures fast detection irrespective of the location of the changepoint. Our theory is derived under a very general form of weak dependence, thus being able to apply our tests to virtually all time series encountered in economics, medicine, and other applied sciences. Monte Carlo simulations show that our methodologies are able to control the procedure-wise Type I Error, and have short detection delays in the presence of breaks."}, "https://arxiv.org/abs/2402.04461": {"title": "On Data Analysis Pipelines and Modular Bayesian Modeling", "link": "https://arxiv.org/abs/2402.04461", "description": "Data analysis pipelines, where quantities estimated in upstream modules are used as inputs to downstream ones, are common in many application areas. The most common approach to implementing data analysis pipelines involves obtaining point estimates from the upstream module(s) and then treating these as known quantities when working with the downstream module(s). This approach is straightforward, but is likely to underestimate the overall uncertainty associated with any final estimates. An alternative approach involves estimating parameters from the modules jointly using a Bayesian hierarchical model, which has the advantage of propagating upstream uncertainty into the downstream estimates. However in the case where one of the modules suffers from misspecification, such a joint model can result in the misspecification from one module corrupting the estimates from the remaining modules. Furthermore, hierarchical models require the development of ad-hoc implementations that can be time consuming to create and require large amounts of computational effort. So-called cut inference modifies the posterior distribution in such a way that prevents information flow between certain modules and provides a third alternative for statistical inference in data analysis pipelines. This paper presents a unified framework that encompasses all three modeling approaches (two-step, cut, and joint) in the context of data analysis pipelines with two modules and uses two examples to illustrate the trade offs associated with these three approaches. Our work shows that cut inference provides both robustness and ease of implementation for data analysis pipelines at a lower cost in terms of statistical inference than two-step procedures."}, "https://arxiv.org/abs/2402.04487": {"title": "Item-Level Heterogeneous Treatment Effects of Selective Serotonin Reuptake Inhibitors (SSRIs) on Depression: Implications for Inference, Generalizability, and Identification", "link": "https://arxiv.org/abs/2402.04487", "description": "In analysis of randomized controlled trials (RCTs) with patient-reported outcome measures (PROMs), Item Response Theory (IRT) models that allow for heterogeneity in the treatment effect at the item level merit consideration. These models for ``item-level heterogeneous treatment effects'' (IL-HTE) can provide more accurate statistical inference, allow researchers to better generalize their results, and resolve critical identification problems in the estimation of interaction effects. In this study, we extend the IL-HTE model to polytomous data and apply the model to determine how the effect of selective serotonin reuptake inhibitors (SSRIs) on depression varies across the items on a depression rating scale. We first conduct a Monte Carlo simulation study to assess the performance of the polytomous IL-HTE model under a range of conditions. We then apply the IL-HTE model to item-level data from 28 RCTs measuring the effect of SSRIs on depression using the 17-item Hamilton Depression Rating Scale (HDRS-17) and estimate potential heterogeneity by subscale (HDRS-6). Our results show that the IL-HTE model provides more accurate statistical inference, allows for generalizability of results to out-of-sample items, and resolves identification problems in the estimation of interaction effects. Our empirical application shows that while the average effect of SSRIs on depression is beneficial (i.e., negative) and statistically significant, there is substantial IL-HTE, with estimates of the standard deviation of item-level effects nearly as large as the average effect. We show that this substantial IL-HTE is driven primarily by systematically larger effects on the HDRS-6 subscale items. The IL-HTE model has the potential to provide new insights for the inference, generalizability, and identification of treatment effects in clinical trials using patient reported outcome measures."}, "https://arxiv.org/abs/2402.04593": {"title": "Spatial autoregressive model with measurement error in covariates", "link": "https://arxiv.org/abs/2402.04593", "description": "The Spatial AutoRegressive model (SAR) is commonly used in studies involving spatial and network data to estimate the spatial or network peer influence and the effects of covariates on the response, taking into account the spatial or network dependence. While the model can be efficiently estimated with a Quasi maximum likelihood approach (QMLE), the detrimental effect of covariate measurement error on the QMLE and how to remedy it is currently unknown. If covariates are measured with error, then the QMLE may not have the $\\sqrt{n}$ convergence and may even be inconsistent even when a node is influenced by only a limited number of other nodes or spatial units. We develop a measurement error-corrected ML estimator (ME-QMLE) for the parameters of the SAR model when covariates are measured with error. The ME-QMLE possesses statistical consistency and asymptotic normality properties. We consider two types of applications. The first is when the true covariate cannot be measured directly, and a proxy is observed instead. The second one involves including latent homophily factors estimated with error from the network for estimating peer influence. Our numerical results verify the bias correction property of the estimator and the accuracy of the standard error estimates in finite samples. We illustrate the method on a real dataset related to county-level death rates from the COVID-19 pandemic."}, "https://arxiv.org/abs/2402.04674": {"title": "Hyperparameter Tuning for Causal Inference with Double Machine Learning: A Simulation Study", "link": "https://arxiv.org/abs/2402.04674", "description": "Proper hyperparameter tuning is essential for achieving optimal performance of modern machine learning (ML) methods in predictive tasks. While there is an extensive literature on tuning ML learners for prediction, there is only little guidance available on tuning ML learners for causal machine learning and how to select among different ML learners. In this paper, we empirically assess the relationship between the predictive performance of ML methods and the resulting causal estimation based on the Double Machine Learning (DML) approach by Chernozhukov et al. (2018). DML relies on estimating so-called nuisance parameters by treating them as supervised learning problems and using them as plug-in estimates to solve for the (causal) parameter. We conduct an extensive simulation study using data from the 2019 Atlantic Causal Inference Conference Data Challenge. We provide empirical insights on the role of hyperparameter tuning and other practical decisions for causal estimation with DML. First, we assess the importance of data splitting schemes for tuning ML learners within Double Machine Learning. Second, we investigate how the choice of ML methods and hyperparameters, including recent AutoML frameworks, impacts the estimation performance for a causal parameter of interest. Third, we assess to what extent the choice of a particular causal model, as characterized by incorporated parametric assumptions, can be based on predictive performance metrics."}, "https://arxiv.org/abs/2402.04727": {"title": "Data-driven Bayesian estimation of Monod kinetics", "link": "https://arxiv.org/abs/2402.04727", "description": "In this paper, we consider the well known problem of non-linear identification of the rates of the reactions involved in cells with Monod functions. In bioprocesses, generating data is very expensive and long and so it is important to incorporate prior knowledge on the Monod kinetic parameters. Bayesian estimation is an elegant estimation technique which deals with parameter estimation with prior knowledge modeled as probability density functions. However, we might not have an accurate knowledge of the kinetic parameters such as interval bounds, especially for newly developed cell lines. Hence, we consider the case when there is no accurate prior information on the kinetic parameters except qualitative knowledge such that their non-negativity. A log-Gaussian prior distribution is considered for the parameters and the mean and variances of these distribution are tuned using the Expectation Maximization algorithm. The algorithm requires to use Metropolis Hastings within Gibbs sampling which can be computationally expensive. We develop a novel variant of the Metropolis-Hastings within Gibbs sampling sampling scheme in order to accelerate and improve on the hyperparameter tuning. We show that it can give better modeling performances on a relatively large-scale simulation example compared to available methods in the literature."}, "https://arxiv.org/abs/2402.04808": {"title": "Basis expansion approaches for functional analysis of variance with repeated measures", "link": "https://arxiv.org/abs/2402.04808", "description": "The methodological contribution in this paper is motivated by biomechanical studies where data characterizing human movement are waveform curves representing joint measures such as flexion angles, velocity, acceleration, and so on. In many cases the aim consists of detecting differences in gait patterns when several independent samples of subjects walk or run under different conditions (repeated measures). Classic kinematic studies often analyse discrete summaries of the sample curves discarding important information and providing biased results. As the sample data are obviously curves, a Functional Data Analysis approach is proposed to solve the problem of testing the equality of the mean curves of a functional variable observed on several independent groups under different treatments or time periods. A novel approach for Functional Analysis of Variance (FANOVA) for repeated measures that takes into account the complete curves is introduced. By assuming a basis expansion for each sample curve, two-way FANOVA problem is reduced to Multivariate ANOVA for the multivariate response of basis coefficients. Then, two different approaches for MANOVA with repeated measures are considered. Besides, an extensive simulation study is developed to check their performance. Finally, two applications with gait data are developed."}, "https://arxiv.org/abs/2402.04828": {"title": "What drives the European carbon market? Macroeconomic factors and forecasts", "link": "https://arxiv.org/abs/2402.04828", "description": "Putting a price on carbon -- with taxes or developing carbon markets -- is a widely used policy measure to achieve the target of net-zero emissions by 2050. This paper tackles the issue of producing point, direction-of-change, and density forecasts for the monthly real price of carbon within the EU Emissions Trading Scheme (EU ETS). We aim to uncover supply- and demand-side forces that can contribute to improving the prediction accuracy of models at short- and medium-term horizons. We show that a simple Bayesian Vector Autoregressive (BVAR) model, augmented with either one or two factors capturing a set of predictors affecting the price of carbon, provides substantial accuracy gains over a wide set of benchmark forecasts, including survey expectations and forecasts made available by data providers. We extend the study to verified emissions and demonstrate that, in this case, adding stochastic volatility can further improve the forecasting performance of a single-factor BVAR model. We rely on emissions and price forecasts to build market monitoring tools that track demand and price pressure in the EU ETS market. Our results are relevant for policymakers and market practitioners interested in quantifying the desired and unintended macroeconomic effects of monitoring the carbon market dynamics."}, "https://arxiv.org/abs/2402.04952": {"title": "Metrics on Markov Equivalence Classes for Evaluating Causal Discovery Algorithms", "link": "https://arxiv.org/abs/2402.04952", "description": "Many state-of-the-art causal discovery methods aim to generate an output graph that encodes the graphical separation and connection statements of the causal graph that underlies the data-generating process. In this work, we argue that an evaluation of a causal discovery method against synthetic data should include an analysis of how well this explicit goal is achieved by measuring how closely the separations/connections of the method's output align with those of the ground truth. We show that established evaluation measures do not accurately capture the difference in separations/connections of two causal graphs, and we introduce three new measures of distance called s/c-distance, Markov distance and Faithfulness distance that address this shortcoming. We complement our theoretical analysis with toy examples, empirical experiments and pseudocode."}, "https://arxiv.org/abs/2402.05030": {"title": "Inference for Two-Stage Extremum Estimators", "link": "https://arxiv.org/abs/2402.05030", "description": "We present a simulation-based approach to approximate the asymptotic variance and asymptotic distribution function of two-stage estimators. We focus on extremum estimators in the second stage and consider a large class of estimators in the first stage. This class includes extremum estimators, high-dimensional estimators, and other types of estimators (e.g., Bayesian estimators). We accommodate scenarios where the asymptotic distributions of both the first- and second-stage estimators are non-normal. We also allow for the second-stage estimator to exhibit a significant bias due to the first-stage sampling error. We introduce a debiased plug-in estimator and establish its limiting distribution. Our method is readily implementable with complex models. Unlike resampling methods, we eliminate the need for multiple computations of the plug-in estimator. Monte Carlo simulations confirm the effectiveness of our approach in finite samples. We present an empirical application with peer effects on adolescent fast-food consumption habits, where we employ the proposed method to address the issue of biased instrumental variable estimates resulting from the presence of many weak instruments."}, "https://arxiv.org/abs/2402.05065": {"title": "logitFD: an R package for functional principal component logit regression", "link": "https://arxiv.org/abs/2402.05065", "description": "The functional logit regression model was proposed by Escabias et al. (2004) with the objective of modeling a scalar binary response variable from a functional predictor. The model estimation proposed in that case was performed in a subspace of L2(T) of squared integrable functions of finite dimension, generated by a finite set of basis functions. For that estimation it was assumed that the curves of the functional predictor and the functional parameter of the model belong to the same finite subspace. The estimation so obtained was affected by high multicollinearity problems and the solution given to these problems was based on different functional principal component analysis. The logitFD package introduced here provides a toolbox for the fit of these models by implementing the different proposed solutions and by generalizing the model proposed in 2004 to the case of several functional and non-functional predictors. The performance of the functions is illustrated by using data sets of functional data included in the fda.usc package from R-CRAN."}, "https://arxiv.org/abs/2402.04355": {"title": "PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation", "link": "https://arxiv.org/abs/2402.04355", "description": "We propose a comprehensive sample-based method for assessing the quality of generative models. The proposed approach enables the estimation of the probability that two sets of samples are drawn from the same distribution, providing a statistically rigorous method for assessing the performance of a single generative model or the comparison of multiple competing models trained on the same dataset. This comparison can be conducted by dividing the space into non-overlapping regions and comparing the number of data samples in each region. The method only requires samples from the generative model and the test data. It is capable of functioning directly on high-dimensional data, obviating the need for dimensionality reduction. Significantly, the proposed method does not depend on assumptions regarding the density of the true distribution, and it does not rely on training or fitting any auxiliary models. Instead, it focuses on approximating the integral of the density (probability mass) across various sub-regions within the data space."}, "https://arxiv.org/abs/2402.04489": {"title": "De-amplifying Bias from Differential Privacy in Language Model Fine-tuning", "link": "https://arxiv.org/abs/2402.04489", "description": "Fairness and privacy are two important values machine learning (ML) practitioners often seek to operationalize in models. Fairness aims to reduce model bias for social/demographic sub-groups. Privacy via differential privacy (DP) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. The trade-offs between privacy and fairness goals of trustworthy ML pose a challenge to those wishing to address both. We show that DP amplifies gender, racial, and religious bias when fine-tuning large language models (LLMs), producing models more biased than ones fine-tuned without DP. We find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. Through the case of binary gender bias, we demonstrate that Counterfactual Data Augmentation (CDA), a known method for addressing bias, also mitigates bias amplification by DP. As a consequence, DP and CDA together can be used to fine-tune models while maintaining both fairness and privacy."}, "https://arxiv.org/abs/2402.04579": {"title": "Collective Counterfactual Explanations via Optimal Transport", "link": "https://arxiv.org/abs/2402.04579", "description": "Counterfactual explanations provide individuals with cost-optimal actions that can alter their labels to desired classes. However, if substantial instances seek state modification, such individual-centric methods can lead to new competitions and unanticipated costs. Furthermore, these recommendations, disregarding the underlying data distribution, may suggest actions that users perceive as outliers. To address these issues, our work proposes a collective approach for formulating counterfactual explanations, with an emphasis on utilizing the current density of the individuals to inform the recommended actions. Our problem naturally casts as an optimal transport problem. Leveraging the extensive literature on optimal transport, we illustrate how this collective method improves upon the desiderata of classical counterfactual explanations. We support our proposal with numerical simulations, illustrating the effectiveness of the proposed approach and its relation to classic methods."}, "https://arxiv.org/abs/2402.04602": {"title": "Online Quantile Regression", "link": "https://arxiv.org/abs/2402.04602", "description": "This paper tackles the challenge of integrating sequentially arriving data within the quantile regression framework, where the number of covariates is allowed to grow with the number of observations, the horizon is unknown, and memory is limited. We employ stochastic sub-gradient descent to minimize the empirical check loss and study its statistical properties and regret performance. In our analysis, we unveil the delicate interplay between updating iterates based on individual observations versus batches of observations, revealing distinct regularity properties in each scenario. Our method ensures long-term optimal estimation irrespective of the chosen update strategy. Importantly, our contributions go beyond prior works by achieving exponential-type concentration inequalities and attaining optimal regret and error rates that exhibit only short-term sensitivity to initial errors. A key insight from our study is the delicate statistical analyses and the revelation that appropriate stepsize schemes significantly mitigate the impact of initial errors on subsequent errors and regrets. This underscores the robustness of stochastic sub-gradient descent in handling initial uncertainties, emphasizing its efficacy in scenarios where the sequential arrival of data introduces uncertainties regarding both the horizon and the total number of observations. Additionally, when the initial error rate is well controlled, there is a trade-off between short-term error rate and long-term optimality. Due to the lack of delicate statistical analysis for square loss, we also briefly discuss its properties and proper schemes. Extensive simulations support our theoretical findings."}, "https://arxiv.org/abs/2402.04859": {"title": "Conditionality principle under unconstrained randomness", "link": "https://arxiv.org/abs/2402.04859", "description": "A very simple example demonstrates that Fisher's application of the conditionality principle to regression (\"fixed $x$ regression\"), endorsed by Sprott and many other followers, makes prediction impossible in the context of statistical learning theory. On the other hand, relaxing the requirement of conditionality makes it possible via, e.g., conformal prediction."}, "https://arxiv.org/abs/2110.00314": {"title": "Confounder importance learning for treatment effect inference", "link": "https://arxiv.org/abs/2110.00314", "description": "We address modelling and computational issues for multiple treatment effect inference under many potential confounders. A primary issue relates to preventing harmful effects from omitting relevant covariates (under-selection), while not running into over-selection issues that introduce substantial variance and a bias related to the non-random over-inclusion of covariates. We propose a novel empirical Bayes framework for Bayesian model averaging that learns from data the extent to which the inclusion of key covariates should be encouraged, specifically those highly associated to the treatments. A key challenge is computational. We develop fast algorithms, including an Expectation-Propagation variational approximation and simple stochastic gradient optimization algorithms, to learn the hyper-parameters from data. Our framework uses widely-used ingredients and largely existing software, and it is implemented within the R package mombf featured on CRAN. This work is motivated by and is illustrated in two applications. The first is the association between salary variation and discriminatory factors. The second, that has been debated in previous works, is the association between abortion policies and crime. Our approach provides insights that differ from previous analyses especially in situations with weaker treatment effects."}, "https://arxiv.org/abs/2202.12062": {"title": "Semiparametric Estimation of Dynamic Binary Choice Panel Data Models", "link": "https://arxiv.org/abs/2202.12062", "description": "We propose a new approach to the semiparametric analysis of panel data binary choice models with fixed effects and dynamics (lagged dependent variables). The model we consider has the same random utility framework as in Honore and Kyriazidou (2000). We demonstrate that, with additional serial dependence conditions on the process of deterministic utility and tail restrictions on the error distribution, the (point) identification of the model can proceed in two steps, and only requires matching the value of an index function of explanatory variables over time, as opposed to that of each explanatory variable. Our identification approach motivates an easily implementable, two-step maximum score (2SMS) procedure -- producing estimators whose rates of convergence, in contrast to Honore and Kyriazidou's (2000) methods, are independent of the model dimension. We then derive the asymptotic properties of the 2SMS procedure and propose bootstrap-based distributional approximations for inference. Monte Carlo evidence indicates that our procedure performs adequately in finite samples."}, "https://arxiv.org/abs/2206.08216": {"title": "Minimum Density Power Divergence Estimation for the Generalized Exponential Distribution", "link": "https://arxiv.org/abs/2206.08216", "description": "Statistical modeling of rainfall data is an active research area in agro-meteorology. The most common models fitted to such datasets are exponential, gamma, log-normal, and Weibull distributions. As an alternative to some of these models, the generalized exponential (GE) distribution was proposed by Gupta and Kundu (2001, Exponentiated Exponential Family: An Alternative to Gamma and Weibull Distributions, Biometrical Journal). Rainfall (specifically for short periods) datasets often include outliers, and thus, a proper robust parameter estimation procedure is necessary. Here, we use the popular minimum density power divergence estimation (MDPDE) procedure developed by Basu et al. (1998, Robust and Efficient Estimation by Minimising a Density Power Divergence, Biometrika) for estimating the GE parameters. We derive the analytical expressions for the estimating equations and asymptotic distributions. We analytically compare MDPDE with maximum likelihood estimation in terms of robustness, through an influence function analysis. Besides, we study the asymptotic relative efficiency of MDPDE analytically for different parameter settings. We apply the proposed technique to some simulated datasets and two rainfall datasets from Texas, United States. The results indicate superior performance of MDPDE compared to the other existing estimation techniques in most of the scenarios."}, "https://arxiv.org/abs/2210.08589": {"title": "Anytime-Valid Linear Models and Regression Adjusted Causal Inference in Randomized Experiments", "link": "https://arxiv.org/abs/2210.08589", "description": "Linear regression adjustment is commonly used to analyse randomised controlled experiments due to its efficiency and robustness against model misspecification. Current testing and interval estimation procedures leverage the asymptotic distribution of such estimators to provide Type-I error and coverage guarantees that hold only at a single sample size. Here, we develop the theory for the anytime-valid analogues of such procedures, enabling linear regression adjustment in the sequential analysis of randomised experiments. We first provide sequential $F$-tests and confidence sequences for the parametric linear model, which provide time-uniform Type-I error and coverage guarantees that hold for all sample sizes. We then relax all linear model parametric assumptions in randomised designs and provide nonparametric model-free sequential tests and confidence sequences for treatment effects. This formally allows experiments to be continuously monitored for significance, stopped early, and safeguards against statistical malpractices in data collection. A particular feature of our results is their simplicity. Our test statistics and confidence sequences all emit closed-form expressions, which are functions of statistics directly available from a standard linear regression table. We illustrate our methodology with the sequential analysis of software A/B experiments at Netflix, performing regression adjustment with pre-treatment outcomes."}, "https://arxiv.org/abs/2302.07663": {"title": "Testing for causal effect for binary data when propensity scores are estimated through Bayesian Networks", "link": "https://arxiv.org/abs/2302.07663", "description": "This paper proposes a new statistical approach for assessing treatment effect using Bayesian Networks (BNs). The goal is to draw causal inferences from observational data with a binary outcome and discrete covariates. The BNs are here used to estimate the propensity score, which enables flexible modeling and ensures maximum likelihood properties, including asymptotic efficiency. %As a result, other available approaches cannot perform better. When the propensity score is estimated by BNs, two point estimators are considered - H\\'ajek and Horvitz-Thompson - based on inverse probability weighting, and their main distributional properties are derived for constructing confidence intervals and testing hypotheses about the absence of the treatment effect. Empirical evidence is presented to show the goodness of the proposed methodology on a simulation study mimicking the characteristics of a real dataset of prostate cancer patients from Milan San Raffaele Hospital."}, "https://arxiv.org/abs/2306.01211": {"title": "Priming bias versus post-treatment bias in experimental designs", "link": "https://arxiv.org/abs/2306.01211", "description": "Conditioning on variables affected by treatment can induce post-treatment bias when estimating causal effects. Although this suggests that researchers should measure potential moderators before administering the treatment in an experiment, doing so may also bias causal effect estimation if the covariate measurement primes respondents to react differently to the treatment. This paper formally analyzes this trade-off between post-treatment and priming biases in three experimental designs that vary when moderators are measured: pre-treatment, post-treatment, or a randomized choice between the two. We derive nonparametric bounds for interactions between the treatment and the moderator in each design and show how to use substantive assumptions to narrow these bounds. These bounds allow researchers to assess the sensitivity of their empirical findings to either source of bias. We then apply these methods to a survey experiment on electoral messaging."}, "https://arxiv.org/abs/2307.01037": {"title": "Vector Quantile Regression on Manifolds", "link": "https://arxiv.org/abs/2307.01037", "description": "Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features. QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain. Although the notion of quantiles was recently extended to multi-variate distributions, QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate and geological phenomena), and tori (dihedral angles in proteins). By leveraging optimal transport theory and c-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs). Our approach allows for quantile estimation, regression, and computation of conditional confidence sets and likelihoods. We demonstrate the approach's efficacy and provide insights regarding the meaning of non-Euclidean quantiles through synthetic and real data experiments."}, "https://arxiv.org/abs/2303.10019": {"title": "Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices", "link": "https://arxiv.org/abs/2303.10019", "description": "This paper presents a new method for combining (or aggregating or ensembling) multivariate probabilistic forecasts, considering dependencies between quantiles and marginals through a smoothing procedure that allows for online learning. We discuss two smoothing methods: dimensionality reduction using Basis matrices and penalized smoothing. The new online learning algorithm generalizes the standard CRPS learning framework into multivariate dimensions. It is based on Bernstein Online Aggregation (BOA) and yields optimal asymptotic learning properties. The procedure uses horizontal aggregation, i.e., aggregation across quantiles. We provide an in-depth discussion on possible extensions of the algorithm and several nested cases related to the existing literature on online forecast combination. We apply the proposed methodology to forecasting day-ahead electricity prices, which are 24-dimensional distributional forecasts. The proposed method yields significant improvements over uniform combination in terms of continuous ranked probability score (CRPS). We discuss the temporal evolution of the weights and hyperparameters and present the results of reduced versions of the preferred model. A fast C++ implementation of the proposed algorithm is provided in the open-source R-Package profoc on CRAN."}, "https://arxiv.org/abs/2402.05194": {"title": "Multi-class classification of biomechanical data: A functional LDA approach based on multi-class penalized functional PLS", "link": "https://arxiv.org/abs/2402.05194", "description": "A functional linear discriminant analysis approach to classify a set of kinematic data (human movement curves of individuals performing different physical activities) is performed. Kinematic data, usually collected in linear acceleration or angular rotation format, can be identified with functions in a continuous domain (time, percentage of gait cycle, etc.). Since kinematic curves are measured in the same sample of individuals performing different activities, they are a clear example of functional data with repeated measures. On the other hand, the sample curves are observed with noise. Then, a roughness penalty might be necessary in order to provide a smooth estimation of the discriminant functions, which would make them more interpretable. Moreover, because of the infinite dimension of functional data, a reduction dimension technique should be considered. To solve these problems, we propose a multi-class approach for penalized functional partial least squares (FPLS) regression. Then linear discriminant analysis (LDA) will be performed on the estimated FPLS components. This methodology is motivated by two case studies. The first study considers the linear acceleration recorded every two seconds in 30 subjects, related to three different activities (walking, climbing stairs and down stairs). The second study works with the triaxial angular rotation, for each joint, in 51 children when they completed a cycle walking under three conditions (walking, carrying a backpack and pulling a trolley). A simulation study is also developed for comparing the performance of the proposed functional LDA with respect to the corresponding multivariate and non-penalized approaches."}, "https://arxiv.org/abs/2402.05209": {"title": "Stochastic modeling of Random Access Memories reset transitions", "link": "https://arxiv.org/abs/2402.05209", "description": "Resistive Random Access Memories (RRAMs) are being studied by the industry and academia because it is widely accepted that they are promising candidates for the next generation of high density nonvolatile memories. Taking into account the stochastic nature of mechanisms behind resistive switching, a new technique based on the use of functional data analysis has been developed to accurately model resistive memory device characteristics. Functional principal component analysis (FPCA) based on Karhunen-Loeve expansion is applied to obtain an orthogonal decomposition of the reset process in terms of uncorrelated scalar random variables. Then, the device current has been accurately described making use of just one variable presenting a modeling approach that can be very attractive from the circuit simulation viewpoint. The new method allows a comprehensive description of the stochastic variability of these devices by introducing a probability distribution that allows the simulation of the main parameter that is employed for the model implementation. A rigorous description of the mathematical theory behind the technique is given and its application for a broad set of experimental measurements is explained."}, "https://arxiv.org/abs/2402.05231": {"title": "Estimating Fold Changes from Partially Observed Outcomes with Applications in Microbial Metagenomics", "link": "https://arxiv.org/abs/2402.05231", "description": "We consider the problem of estimating fold-changes in the expected value of a multivariate outcome that is observed subject to unknown sample-specific and category-specific perturbations. We are motivated by high-throughput sequencing studies of the abundance of microbial taxa, in which microbes are systematically over- and under-detected relative to their true abundances. Our log-linear model admits a partially identifiable estimand, and we establish full identifiability by imposing interpretable parameter constraints. To reduce bias and guarantee the existence of parameter estimates in the presence of sparse observations, we apply an asymptotically negligible and constraint-invariant penalty to our estimating function. We develop a fast coordinate descent algorithm for estimation, and an augmented Lagrangian algorithm for estimation under null hypotheses. We construct a model-robust score test, and demonstrate valid inference even for small sample sizes and violated distributional assumptions. The flexibility of the approach and comparisons to related methods are illustrated via a meta-analysis of microbial associations with colorectal cancer."}, "https://arxiv.org/abs/2402.05329": {"title": "Selective linear segmentation for detecting relevant parameter changes", "link": "https://arxiv.org/abs/2402.05329", "description": "Change-point processes are one flexible approach to model long time series. We propose a method to uncover which model parameter truly vary when a change-point is detected. Given a set of breakpoints, we use a penalized likelihood approach to select the best set of parameters that changes over time and we prove that the penalty function leads to a consistent selection of the true model. Estimation is carried out via the deterministic annealing expectation-maximization algorithm. Our method accounts for model selection uncertainty and associates a probability to all the possible time-varying parameter specifications. Monte Carlo simulations highlight that the method works well for many time series models including heteroskedastic processes. For a sample of 14 Hedge funds (HF) strategies, using an asset based style pricing model, we shed light on the promising ability of our method to detect the time-varying dynamics of risk exposures as well as to forecast HF returns."}, "https://arxiv.org/abs/2402.05342": {"title": "Nonlinear Regression Analysis", "link": "https://arxiv.org/abs/2402.05342", "description": "Nonlinear regression analysis is a popular and important tool for scientists and engineers. In this article, we introduce theories and methods of nonlinear regression and its statistical inferences using the frequentist and Bayesian statistical modeling and computation. Least squares with the Gauss-Newton method is the most widely used approach to parameters estimation. Under the assumption of normally distributed errors, maximum likelihood estimation is equivalent to least squares estimation. The Wald confidence regions for parameters in a nonlinear regression model are affected by the curvatures in the mean function. Furthermore, we introduce the Newton-Raphson method and the generalized least squares method to deal with variance heterogeneity. Examples of simulation data analysis are provided to illustrate important properties of confidence regions and the statistical inferences using the nonlinear least squares estimation and Bayesian inference."}, "https://arxiv.org/abs/2402.05384": {"title": "Efficient Nonparametric Inference of Causal Mediation Effects with Nonignorable Missing Confounders", "link": "https://arxiv.org/abs/2402.05384", "description": "We consider causal mediation analysis with confounders subject to nonignorable missingness in a nonparametric framework. Our approach relies on shadow variables that are associated with the missing confounders but independent of the missingness mechanism. The mediation effect of interest is shown to be a weighted average of an iterated conditional expectation, which motivates our Sieve-based Iterative Outward (SIO) estimator. We derive the rate of convergence and asymptotic normality of the SIO estimator, which do not suffer from the ill-posed inverse problem. Essentially, we show that the asymptotic normality is not affected by the slow convergence rate of nonparametric estimators of nuisance functions. Moreover, we demonstrate that our estimator is locally efficient and attains the semiparametric efficiency bound under certain conditions. We accurately depict the efficiency loss attributable to missingness and identify scenarios in which efficiency loss is absent. We also propose a stable and easy-to-implement approach to estimate asymptotic variance and construct confidence intervals for the mediation effects. Finally, we evaluate the finite-sample performance of our proposed approach through simulation studies, and apply it to the CFPS data to show its practical applicability."}, "https://arxiv.org/abs/2402.05395": {"title": "Efficient Estimation for Functional Accelerated Failure Time Model", "link": "https://arxiv.org/abs/2402.05395", "description": "We propose a functional accelerated failure time model to characterize effects of both functional and scalar covariates on the time to event of interest, and provide regularity conditions to guarantee model identifiability. For efficient estimation of model parameters, we develop a sieve maximum likelihood approach where parametric and nonparametric coefficients are bundled with an unknown baseline hazard function in the likelihood function. Not only do the bundled parameters cause immense numerical difficulties, but they also result in new challenges in theoretical development. By developing a general theoretical framework, we overcome the challenges arising from the bundled parameters and derive the convergence rate of the proposed estimator. Furthermore, we prove that the finite-dimensional estimator is $\\sqrt{n}$-consistent, asymptotically normal and achieves the semiparametric information bound. The proposed inference procedures are evaluated by extensive simulation studies and illustrated with an application to the sequential organ failure assessment data from the Improving Care of Acute Lung Injury Patients study."}, "https://arxiv.org/abs/2402.05432": {"title": "Difference-in-Differences Estimators with Continuous Treatments and no Stayers", "link": "https://arxiv.org/abs/2402.05432", "description": "Many treatments or policy interventions are continuous in nature. Examples include prices, taxes or temperatures. Empirical researchers have usually relied on two-way fixed effect regressions to estimate treatment effects in such cases. However, such estimators are not robust to heterogeneous treatment effects in general; they also rely on the linearity of treatment effects. We propose estimators for continuous treatments that do not impose those restrictions, and that can be used when there are no stayers: the treatment of all units changes from one period to the next. We start by extending the nonparametric results of de Chaisemartin et al. (2023) to cases without stayers. We also present a parametric estimator, and use it to revisit Desch\\^enes and Greenstone (2012)."}, "https://arxiv.org/abs/2402.05479": {"title": "A comparison of the effects of different methodologies on the statistics learning profiles of prospective primary education teachers from a gender perspective", "link": "https://arxiv.org/abs/2402.05479", "description": "Over the last decades,it has been shown that teaching and learning statistics is complex, regardless of the teaching methodology. This research presents the different learning profiles identified in a group of future Primary Education (PE) teachers during the study of the Statistics blockdepending on the methodology used and gender, where the sample consists of 132 students in the third year of the PE undergraduate degree in theUniversity of the Basque Country(Universidad del Pa\\'is Vasco/Euskal Herriko Unibertsitatea, UPV/EHU). To determine the profiles, a cluster analysis technique has been used, where the main variables to determine them are, on the one hand, their statistical competence development and, on the other hand, the evolutionof their attitude towards statistics. In order to better understand the nature of the profiles obtained, the type of teaching methodology used to work on the Statistics block has been taken into account. This comparison is based on the fact that the sample is divided into two groups: one has worked with a Project Based Learning (PBL) methodology,while the other has worked with a methodology in which theoretical explanations and typically decontextualized exercises predominate. Among the results obtained,three differentiated profiles areobserved, highlighting the proportion of students with an advantageous profile in the group where PBL is included.With regard to gender, the results show that women's attitudes towardstatistics evolvedmore positively than men's after the sessions devoted to statistics in the PBL group."}, "https://arxiv.org/abs/2402.05633": {"title": "Full Law Identification under Missing Data in the Categorical Colluder Model", "link": "https://arxiv.org/abs/2402.05633", "description": "Missing data may be disastrous for the identifiability of causal and statistical estimands. In graphical missing data models, colluders are dependence structures that have a special importance for identification considerations. It has been shown that the presence of a colluder makes the full law, i.e., the joint distribution of variables and response indicators, non-parametrically non-identifiable. However, with additional mild assumptions regarding the variables involved with the colluder structure, identifiability is regained. We present a necessary and sufficient condition for the identification of the full law in the presence of a colluder structure with arbitrary categorical variables."}, "https://arxiv.org/abs/2402.05767": {"title": "Covariance matrix completion via auxiliary information", "link": "https://arxiv.org/abs/2402.05767", "description": "Covariance matrix estimation is an important task in the analysis of multivariate data in disparate scientific fields, including neuroscience, genomics, and astronomy. However, modern scientific data are often incomplete due to factors beyond the control of researchers, and data missingness may prohibit the use of traditional covariance estimation methods. Some existing methods address this problem by completing the data matrix, or by filling the missing entries of an incomplete sample covariance matrix by assuming a low-rank structure. We propose a novel approach that exploits auxiliary variables to complete covariance matrix estimates. An example of auxiliary variable is the distance between neurons, which is usually inversely related to the strength of neuronal correlation. Our method extracts auxiliary information via regression, and involves a single tuning parameter that can be selected empirically. We compare our method with other matrix completion approaches via simulations, and apply it to the analysis of large-scale neuroscience data."}, "https://arxiv.org/abs/2402.05789": {"title": "High Dimensional Factor Analysis with Weak Factors", "link": "https://arxiv.org/abs/2402.05789", "description": "This paper studies the principal components (PC) estimator for high dimensional approximate factor models with weak factors in that the factor loading ($\\boldsymbol{\\Lambda}^0$) scales sublinearly in the number $N$ of cross-section units, i.e., $\\boldsymbol{\\Lambda}^{0\\top} \\boldsymbol{\\Lambda}^0 / N^\\alpha$ is positive definite in the limit for some $\\alpha \\in (0,1)$. While the consistency and asymptotic normality of these estimates are by now well known when the factors are strong, i.e., $\\alpha=1$, the statistical properties for weak factors remain less explored. Here, we show that the PC estimator maintains consistency and asymptotical normality for any $\\alpha\\in(0,1)$, provided suitable conditions regarding the dependence structure in the noise are met. This complements earlier result by Onatski (2012) that the PC estimator is inconsistent when $\\alpha=0$, and the more recent work by Bai and Ng (2023) who established the asymptotic normality of the PC estimator when $\\alpha \\in (1/2,1)$. Our proof strategy integrates the traditional eigendecomposition-based approach for factor models with leave-one-out analysis similar in spirit to those used in matrix completion and other settings. This combination allows us to deal with factors weaker than the former and at the same time relax the incoherence and independence assumptions often associated with the later."}, "https://arxiv.org/abs/2402.05844": {"title": "The CATT SATT on the MATT: semiparametric inference for sample treatment effects on the treated", "link": "https://arxiv.org/abs/2402.05844", "description": "We study variants of the average treatment effect on the treated with population parameters replaced by their sample counterparts. For each estimand, we derive the limiting distribution with respect to a semiparametric efficient estimator of the population effect and provide guidance on variance estimation. Included in our analysis is the well-known sample average treatment effect on the treated, for which we obtain some unexpected results. Unlike for the ordinary sample average treatment effect, we find that the asymptotic variance for the sample average treatment effect on the treated is point-identified and consistently estimable, but it potentially exceeds that of the population estimand. To address this shortcoming, we propose a modification that yields a new estimand, the mixed average treatment effect on the treated, which is always estimated more precisely than both the population and sample effects. We also introduce a second new estimand that arises from an alternative interpretation of the treatment effect on the treated with which all individuals are weighted by the propensity score."}, "https://arxiv.org/abs/2402.05377": {"title": "Association between Sitting Time and Urinary Incontinence in the US Population: data from the National Health and Nutrition Examination Survey (NHANES) 2007 to 2018", "link": "https://arxiv.org/abs/2402.05377", "description": "Background Urinary incontinence (UI) is a common health problem that affects the life and health quality of millions of people in the US. We aimed to investigate the association between sitting time and UI. Methods Across-sectional survey of adult participants of National Health and Nutrition Examination Survey 2007-2018 was performed. Weighted multivariable logistic and regression models were conducted to assess the association between sitting time and UI. Results A total of 22916 participants were enrolled. Prolonged sitting time was associated with urgent UI (UUI, Odds ratio [OR] = 1.184, 95% Confidence interval [CI] = 1.076 to 1.302, P = 0.001). Compared with patients with sitting time shorter than 7 hours, moderate activity increased the risk of prolonged sitting time over 7 hours in the fully-adjusted model (OR = 2.537, 95% CI = 1.419 to 4.536, P = 0.002). Sitting time over 7 hours was related to male mixed UI (MUI, OR = 1.581, 95% CI = 1.129 to 2.213, P = 0.010), and female stress UI (SUI, OR = 0.884, 95% CI = 0.795 to 0.983, P = 0.026) in the fully-adjusted model. Conclusions Prolonged sedentary sitting time (> 7 hours) indicated a high risk of UUI in all populations, female SUI and male MUI. Compared with sitting time shorter than 7 hours, the moderate activity could not reverse the risk of prolonged sitting, which warranted further studies for confirmation."}, "https://arxiv.org/abs/2402.05438": {"title": "Penalized spline estimation of principal components for sparse functional data: rates of convergence", "link": "https://arxiv.org/abs/2402.05438", "description": "This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functions motivated by the matrix Bregman divergence, and the penalty term is the integrated squared derivative. The theory reveals that the asymptotic behavior of penalized spline estimators depends on the interesting interplay between several factors, i.e., the smoothness of the unknown functions, the spline degree, the spline knot number, the penalty order, and the penalty parameter. The theory also classifies the asymptotic behavior into seven scenarios and characterizes whether and how the minimax optimal rates of convergence are achievable in each scenario."}, "https://arxiv.org/abs/2201.01793": {"title": "Spectral Clustering with Variance Information for Group Structure Estimation in Panel Data", "link": "https://arxiv.org/abs/2201.01793", "description": "Consider a panel data setting where repeated observations on individuals are available. Often it is reasonable to assume that there exist groups of individuals that share similar effects of observed characteristics, but the grouping is typically unknown in advance. We first conduct a local analysis which reveals that the variances of the individual coefficient estimates contain useful information for the estimation of group structure. We then propose a method to estimate unobserved groupings for general panel data models that explicitly account for the variance information. Our proposed method remains computationally feasible with a large number of individuals and/or repeated measurements on each individual. The developed ideas can also be applied even when individual-level data are not available and only parameter estimates together with some quantification of estimation uncertainty are given to the researcher. A thorough simulation study demonstrates superior performance of our method than existing methods and we apply the method to two empirical applications."}, "https://arxiv.org/abs/2205.08586": {"title": "Treatment Choice with Nonlinear Regret", "link": "https://arxiv.org/abs/2205.08586", "description": "The literature focuses on the mean of welfare regret, which can lead to undesirable treatment choice due to sensitivity to sampling uncertainty. We propose to minimize the mean of a nonlinear transformation of regret and show that singleton rules are not essentially complete for nonlinear regret. Focusing on mean square regret, we derive closed-form fractions for finite-sample Bayes and minimax optimal rules. Our approach is grounded in decision theory and extends to limit experiments. The treatment fractions can be viewed as the strength of evidence favoring treatment. We apply our framework to a normal regression model and sample size calculation."}, "https://arxiv.org/abs/2205.08644": {"title": "Benefits and costs of matching prior to a Difference in Differences analysis when parallel trends does not hold", "link": "https://arxiv.org/abs/2205.08644", "description": "The Difference in Difference (DiD) estimator is a popular estimator built on the \"parallel trends\" assumption, which is an assertion that the treatment group, absent treatment, would change \"similarly\" to the control group over time. To bolster such a claim, one might generate a comparison group, via matching, that is similar to the treated group with respect to pre-treatment outcomes and/or pre-treatment covariates. Unfortunately, as has been previously pointed out, this intuitively appealing approach also has a cost in terms of bias. To assess the trade-offs of matching in our application, we first characterize the bias of matching prior to a DiD analysis under a linear structural model that allows for time-invariant observed and unobserved confounders with time-varying effects on the outcome. Given our framework, we verify that matching on baseline covariates generally reduces bias. We further show how additionally matching on pre-treatment outcomes has both cost and benefit. First, matching on pre-treatment outcomes partially balances unobserved confounders, which mitigates some bias. This reduction is proportional to the outcome's reliability, a measure of how coupled the outcomes are with the latent covariates. Offsetting these gains, matching also injects bias into the final estimate by undermining the second difference in the DiD via a regression-to-the-mean effect. Consequently, we provide heuristic guidelines for determining to what degree the bias reduction of matching is likely to outweigh the bias cost. We illustrate our guidelines by reanalyzing a principal turnover study that used matching prior to a DiD analysis and find that matching on both the pre-treatment outcomes and observed covariates makes the estimated treatment effect more credible."}, "https://arxiv.org/abs/2211.15128": {"title": "A New Formula for Faster Computation of the K-Fold Cross-Validation and Good Regularisation Parameter Values in Ridge Regression", "link": "https://arxiv.org/abs/2211.15128", "description": "In the present paper, we prove a new theorem, resulting in an update formula for linear regression model residuals calculating the exact k-fold cross-validation residuals for any choice of cross-validation strategy without model refitting. The required matrix inversions are limited by the cross-validation segment sizes and can be executed with high efficiency in parallel. The well-known formula for leave-one-out cross-validation follows as a special case of the theorem. In situations where the cross-validation segments consist of small groups of repeated measurements, we suggest a heuristic strategy for fast serial approximations of the cross-validated residuals and associated Predicted Residual Sum of Squares (PRESS) statistic. We also suggest strategies for efficient estimation of the minimum PRESS value and full PRESS function over a selected interval of regularisation values. The computational effectiveness of the parameter selection for Ridge- and Tikhonov regression modelling resulting from our theoretical findings and heuristic arguments is demonstrated in several applications with real and highly multivariate datasets."}, "https://arxiv.org/abs/2305.08172": {"title": "Fast Signal Region Detection with Application to Whole Genome Association Studies", "link": "https://arxiv.org/abs/2305.08172", "description": "Research on the localization of the genetic basis associated with diseases or traits has been widely conducted in the last a few decades. Scan methods have been developed for region-based analysis in whole-genome association studies, helping us better understand how genetics influences human diseases or traits, especially when the aggregated effects of multiple causal variants are present. In this paper, we propose a fast and effective algorithm coupling with high-dimensional test for simultaneously detecting multiple signal regions, which is distinct from existing methods using scan or knockoff statistics. The idea is to conduct binary splitting with re-search and arrangement based on a sequence of dynamic critical values to increase detection accuracy and reduce computation. Theoretical and empirical studies demonstrate that our approach enjoys favorable theoretical guarantees with fewer restrictions and exhibits superior numerical performance with faster computation. Utilizing the UK Biobank data to identify the genetic regions related to breast cancer, we confirm previous findings and meanwhile, identify a number of new regions which suggest strong association with risk of breast cancer and deserve further investigation."}, "https://arxiv.org/abs/2305.15951": {"title": "Distributed model building and recursive integration for big spatial data modeling", "link": "https://arxiv.org/abs/2305.15951", "description": "Motivated by the need for computationally tractable spatial methods in neuroimaging studies, we develop a distributed and integrated framework for estimation and inference of Gaussian process model parameters with ultra-high-dimensional likelihoods. We propose a shift in viewpoint from whole to local data perspectives that is rooted in distributed model building and integrated estimation and inference. The framework's backbone is a computationally and statistically efficient integration procedure that simultaneously incorporates dependence within and between spatial resolutions in a recursively partitioned spatial domain. Statistical and computational properties of our distributed approach are investigated theoretically and in simulations. The proposed approach is used to extract new insights on autism spectrum disorder from the Autism Brain Imaging Data Exchange."}, "https://arxiv.org/abs/2307.02388": {"title": "Multi-Task Learning with Summary Statistics", "link": "https://arxiv.org/abs/2307.02388", "description": "Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields."}, "https://arxiv.org/abs/2307.03748": {"title": "Incentive-Theoretic Bayesian Inference for Collaborative Science", "link": "https://arxiv.org/abs/2307.03748", "description": "Contemporary scientific research is a distributed, collaborative endeavor, carried out by teams of researchers, regulatory institutions, funding agencies, commercial partners, and scientific bodies, all interacting with each other and facing different incentives. To maintain scientific rigor, statistical methods should acknowledge this state of affairs. To this end, we study hypothesis testing when there is an agent (e.g., a researcher or a pharmaceutical company) with a private prior about an unknown parameter and a principal (e.g., a policymaker or regulator) who wishes to make decisions based on the parameter value. The agent chooses whether to run a statistical trial based on their private prior and then the result of the trial is used by the principal to reach a decision. We show how the principal can conduct statistical inference that leverages the information that is revealed by an agent's strategic behavior -- their choice to run a trial or not. In particular, we show how the principal can design a policy to elucidate partial information about the agent's private prior beliefs and use this to control the posterior probability of the null. One implication is a simple guideline for the choice of significance threshold in clinical trials: the type-I error level should be set to be strictly less than the cost of the trial divided by the firm's profit if the trial is successful."}, "https://arxiv.org/abs/2307.05251": {"title": "Minimizing robust density power-based divergences for general parametric density models", "link": "https://arxiv.org/abs/2307.05251", "description": "Density power divergence (DPD) is designed to robustly estimate the underlying distribution of observations, in the presence of outliers. However, DPD involves an integral of the power of the parametric density models to be estimated; the explicit form of the integral term can be derived only for specific densities, such as normal and exponential densities. While we may perform a numerical integration for each iteration of the optimization algorithms, the computational complexity has hindered the practical application of DPD-based estimation to more general parametric densities. To address the issue, this study introduces a stochastic approach to minimize DPD for general parametric density models. The proposed approach can also be employed to minimize other density power-based $\\gamma$-divergences, by leveraging unnormalized models. We provide \\verb|R| package for implementation of the proposed approach in \\url{https://github.com/oknakfm/sgdpd}."}, "https://arxiv.org/abs/2307.06137": {"title": "Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case", "link": "https://arxiv.org/abs/2307.06137", "description": "Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes."}, "https://arxiv.org/abs/2312.01210": {"title": "When accurate prediction models yield harmful self-fulfilling prophecies", "link": "https://arxiv.org/abs/2312.01210", "description": "Objective: Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. Many prediction models are deployed for decision support based on their prediction accuracy in validation studies. We investigate whether this is a safe and valid approach.\n Materials and Methods: We show that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model.\n Results: Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution.\n Discussion: Our results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.\n Conclusion: Outcome prediction models can yield harmful self-fulfilling prophecies when used for decision making, a new perspective on prediction model development, deployment and monitoring is needed."}, "https://arxiv.org/abs/2401.11119": {"title": "Constraint-based measures of shift and relative shift for discrete frequency distributions", "link": "https://arxiv.org/abs/2401.11119", "description": "The comparison of frequency distributions is a common statistical task with broad applications and a long history of methodological development. However, existing measures do not quantify the magnitude and direction by which one distribution is shifted relative to another. In the present study, we define distributional shift (DS) as the concentration of frequencies away from the greatest discrete class, e.g., a histogram's right-most bin. We derive a measure of DS based on the sum of cumulative frequencies, intuitively quantifying shift as a statistical moment. We then define relative distributional shift (RDS) as the difference in DS between distributions. Using simulated random sampling, we demonstrate that RDS is highly related to measures that are popularly used to compare frequency distributions. Focusing on a specific use case, i.e., simulated healthcare Evaluation and Management coding profiles, we show how RDS can be used to examine many pairs of empirical and expected distributions via shift-significance plots. In comparison to other measures, RDS has the unique advantage of being a signed (directional) measure based on a simple difference in an intuitive property."}, "https://arxiv.org/abs/2109.12006": {"title": "An overview of variable selection procedures using regularization paths in high-dimensional Gaussian linear regression", "link": "https://arxiv.org/abs/2109.12006", "description": "Current high-throughput technologies provide a large amount of variables to describe a phenomenon. Only a few variables are generally sufficient to answer the question. Identify them in a high-dimensional Gaussian linear regression model is the one of the most-used statistical methods. In this article, we describe step-by-step the variable selection procedures built upon regularization paths. Regularization paths are obtained by combining a regularization function and an algorithm. Then, they are combined either with a model selection procedure using penalty functions or with a sampling strategy to obtain the final selected variables. We perform a comparison study by considering three simulation settings with various dependency structures on variables. %from the most classical to a most realistic one. In all the settings, we evaluate (i) the ability to discriminate between the active variables and the non-active variables along the regularization path (pROC-AUC), (ii) the prediction performance of the selected variable subset (MSE) and (iii) the relevance of the selected variables (recall, specificity, FDR). From the results, we provide recommendations on strategies to be favored depending on the characteristics of the problem at hand. We obtain that the regularization function Elastic-net provides most of the time better results than the $\\ell_1$ one and the lars algorithm has to be privileged as the GD one. ESCV provides the best prediction performances. Bolasso and the knockoffs method are judicious choices to limit the selection of non-active variables while ensuring selection of enough active variables. Conversely, the data-driven penalties considered in this review are not to be favored. As for Tigress and LinSelect, they are conservative methods."}, "https://arxiv.org/abs/2309.09323": {"title": "Answering Causal Queries at Layer 3 with DiscoSCMs-Embracing Heterogeneity", "link": "https://arxiv.org/abs/2309.09323", "description": "In the realm of causal inference, Potential Outcomes (PO) and Structural Causal Models (SCM) are recognized as the principal frameworks.However, when it comes to Layer 3 valuations -- counterfactual queries deeply entwined with individual-level semantics -- both frameworks encounter limitations due to the degenerative issues brought forth by the consistency rule. This paper advocates for the Distribution-consistency Structural Causal Models (DiscoSCM) framework as a pioneering approach to counterfactual inference, skillfully integrating the strengths of both PO and SCM. The DiscoSCM framework distinctively incorporates a unit selection variable $U$ and embraces the concept of uncontrollable exogenous noise realization. Through personalized incentive scenarios, we demonstrate the inadequacies of PO and SCM frameworks in representing the probability of a user being a complier (a Layer 3 event) without degeneration, an issue adeptly resolved by adopting the assumption of independent counterfactual noises within DiscoSCM. This innovative assumption broadens the foundational counterfactual theory, facilitating the extension of numerous theoretical results regarding the probability of causation to an individual granularity level and leading to a comprehensive set of theories on heterogeneous counterfactual bounds. Ultimately, our paper posits that if one acknowledges and wishes to leverage the ubiquitous heterogeneity, understanding causality as invariance across heterogeneous units, then DiscoSCM stands as a significant advancement in the methodology of counterfactual inference."}, "https://arxiv.org/abs/2401.03756": {"title": "Adaptive Experimental Design for Policy Learning", "link": "https://arxiv.org/abs/2401.03756", "description": "Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases."}, "https://arxiv.org/abs/2402.06058": {"title": "Mathematical programming tools for randomization purposes in small two-arm clinical trials: A case study with real data", "link": "https://arxiv.org/abs/2402.06058", "description": "Modern randomization methods in clinical trials are invariably adaptive, meaning that the assignment of the next subject to a treatment group uses the accumulated information in the trial. Some of the recent adaptive randomization methods use mathematical programming to construct attractive clinical trials that balance the group features, such as their sizes and covariate distributions of their subjects. We review some of these methods and compare their performance with common covariate-adaptive randomization methods for small clinical trials. We introduce an energy distance measure that compares the discrepancy between the two groups using the joint distribution of the subjects' covariates. This metric is more appealing than evaluating the discrepancy between the groups using their marginal covariate distributions. Using numerical experiments, we demonstrate the advantages of the mathematical programming methods under the new measure. In the supplementary material, we provide R codes to reproduce our study results and facilitate comparisons of different randomization procedures."}, "https://arxiv.org/abs/2402.06122": {"title": "Peeking with PEAK: Sequential, Nonparametric Composite Hypothesis Tests for Means of Multiple Data Streams", "link": "https://arxiv.org/abs/2402.06122", "description": "We propose a novel nonparametric sequential test for composite hypotheses for means of multiple data streams. Our proposed method, \\emph{peeking with expectation-based averaged capital} (PEAK), builds upon the testing-as-betting framework and provides a non-asymptotic $\\alpha$-level test across any stopping time. PEAK is computationally tractable and efficiently rejects hypotheses that are incorrect across all potential distributions that satisfy our nonparametric assumption, enabling joint composite hypothesis testing on multiple streams of data. We numerically validate our theoretical findings under the best arm identification and threshold identification in the bandit setting, illustrating the computational efficiency of our method against state-of-the-art testing methods."}, "https://arxiv.org/abs/2402.06133": {"title": "Leveraging Quadratic Polynomials in Python for Advanced Data Analysis", "link": "https://arxiv.org/abs/2402.06133", "description": "The article provides a comprehensive overview of using quadratic polynomials in Python for modeling and analyzing data. It starts by explaining the basic concept of a quadratic polynomial, its general form, and its significance in capturing the curvature in data indicative of natural phenomena. The paper highlights key features of quadratic polynomials, their applications in regression analysis, and the process of fitting these polynomials to data using Python's `numpy` and `matplotlib` libraries. It also discusses the calculation of the coefficient of determination (R-squared) to quantify the fit of the polynomial model. Practical examples, including Python scripts, are provided to demonstrate how to apply these concepts in data analysis. The document serves as a bridge between theoretical knowledge and applied analytics, aiding in understanding and communicating data patterns."}, "https://arxiv.org/abs/2402.06228": {"title": "Towards participatory multi-modeling for policy support across domains and scales: a systematic procedure for integral multi-model design", "link": "https://arxiv.org/abs/2402.06228", "description": "Policymaking for complex challenges such as pandemics necessitates the consideration of intricate implications across multiple domains and scales. Computational models can support policymaking, but a single model is often insufficient for such multidomain and scale challenges. Multi-models comprising several interacting computational models at different scales or relying on different modeling paradigms offer a potential solution. Such multi-models can be assembled from existing computational models (i.e., integrated modeling) or be designed conceptually as a whole before their computational implementation (i.e., integral modeling). Integral modeling is particularly valuable for novel policy problems, such as those faced in the early stages of a pandemic, where relevant models may be unavailable or lack standard documentation. Designing such multi-models through an integral approach is, however, a complex task requiring the collaboration of modelers and experts from various domains. In this collaborative effort, modelers must precisely define the domain knowledge needed from experts and establish a systematic procedure for translating such knowledge into a multi-model. Yet, these requirements and systematic procedures are currently lacking for multi-models that are both multiscale and multi-paradigm. We address this challenge by introducing a procedure for developing multi-models with an integral approach based on clearly defined domain knowledge requirements derived from literature. We illustrate this procedure using the case of school closure policies in the Netherlands during the COVID-19 pandemic, revealing their potential implications in the short and long term and across the healthcare and educational domains. The requirements and procedure provided in this article advance the application of integral multi-modeling for policy support in multiscale and multidomain contexts."}, "https://arxiv.org/abs/2402.06382": {"title": "Robust Rao-type tests for step-stress accelerated life-tests under interval-monitoring and Weibull lifetime distributions", "link": "https://arxiv.org/abs/2402.06382", "description": "Many products in engineering are highly reliable with large mean lifetimes to failure. Performing lifetests under normal operations conditions would thus require long experimentation times and high experimentation costs. Alternatively, accelerated lifetests shorten the experimentation time by running the tests at higher than normal stress conditions, thus inducing more failures. Additionally, a log-linear regression model can be used to relate the lifetime distribution of the product to the level of stress it experiences. After estimating the parameters of this relationship, results can be extrapolated to normal operating conditions. On the other hand, censored data is common in reliability analysis. Interval-censored data arise when continuous inspection is difficult or infeasible due to technical or budgetary constraints. In this paper, we develop robust restricted estimators based on the density power divergence for step-stress accelerated life-tests under Weibull distributions with interval-censored data. We present theoretical asymptotic properties of the estimators and develop robust Rao-type test statistics based on the proposed robust estimators for testing composite null hypothesis on the model parameters."}, "https://arxiv.org/abs/2402.06410": {"title": "Manifold-valued models for analysis of EEG time series data", "link": "https://arxiv.org/abs/2402.06410", "description": "We propose a model for time series taking values on a Riemannian manifold and fit it to time series of covariance matrices derived from EEG data for patients suffering from epilepsy. The aim of the study is two-fold: to develop a model with interpretable parameters for different possible modes of EEG dynamics, and to explore the extent to which modelling results are affected by the choice of manifold and its associated geometry. The model specifies a distribution for the tangent direction vector at any time point, combining an autoregressive term, a mean reverting term and a form of Gaussian noise. Parameter inference is carried out by maximum likelihood estimation, and we compare modelling results obtained using the standard Euclidean geometry on covariance matrices and the affine invariant geometry. Results distinguish between epileptic seizures and interictal periods between seizures in patients: between seizures the dynamics have a strong mean reverting component and the autoregressive component is missing, while for the majority of seizures there is a significant autoregressive component and the mean reverting effect is weak. The fitted models are also used to compare seizures within and between patients. The affine invariant geometry is advantageous and it provides a better fit to the data."}, "https://arxiv.org/abs/2402.06428": {"title": "Smooth Transformation Models for Survival Analysis: A Tutorial Using R", "link": "https://arxiv.org/abs/2402.06428", "description": "Over the last five decades, we have seen strong methodological advances in survival analysis, mainly in two separate strands: One strand is based on a parametric approach that assumes some response distribution. More prominent, however, is the strand of flexible methods which rely mainly on non-/semi-parametric estimation. As the methodological landscape continues to evolve, the task of navigating through the multitude of methods and identifying corresponding available software resources is becoming increasingly difficult. This task becomes particularly challenging in more complex scenarios, such as when dealing with interval-censored or clustered survival data, non-proportionality, or dependent censoring.\n In this tutorial, we explore the potential of using smooth transformation models for survival analysis in the R system for statistical computing. These models provide a unified maximum likelihood framework that covers a range of survival models, including well-established ones such as the Weibull model and a fully parameterised version of the famous Cox proportional hazards model, as well as extensions to more complex scenarios. We explore smooth transformation models for non-proportional/crossing hazards, dependent censoring, clustered observations and extensions towards personalised medicine within this framework.\n By fitting these models to survival data from a two-arm randomised controlled trial on rectal cancer therapy, we demonstrate how survival analysis tasks can be seamlessly navigated within the smooth transformation model framework in R. This is achieved by the implementation provided by the \"tram\" package and few related packages."}, "https://arxiv.org/abs/2402.06480": {"title": "Optimal Forecast Reconciliation with Uncertainty Quantification", "link": "https://arxiv.org/abs/2402.06480", "description": "We propose to estimate the weight matrix used for forecast reconciliation as parameters in a general linear model in order to quantify its uncertainty. This implies that forecast reconciliation can be formulated as an orthogonal projection from the space of base-forecast errors into a coherent linear subspace. We use variance decomposition together with the Wishart distribution to derive the central estimator for the forecast-error covariance matrix. In addition, we prove that distance-reducing properties apply to the reconciled forecasts at all levels of the hierarchy as well as to the forecast-error covariance. A covariance matrix for the reconciliation weight matrix is derived, which leads to improved estimates of the forecast-error covariance matrix. We show how shrinkage can be introduced in the formulated model by imposing specific priors on the weight matrix and the forecast-error covariance matrix. The method is illustrated in a simulation study that shows consistent improvements in the log-score. Finally, standard errors for the weight matrix and the variance-separation formula are illustrated using a case study of forecasting electricity load in Sweden."}, "https://arxiv.org/abs/2402.05940": {"title": "Causal Relationship Network of Risk Factors Impacting Workday Loss in Underground Coal Mines", "link": "https://arxiv.org/abs/2402.05940", "description": "This study aims to establish the causal relationship network between various factors leading to workday loss in underground coal mines using a novel causal artificial intelligence (AI) method. The analysis utilizes data obtained from the National Institute for Occupational Safety and Health (NIOSH). A total of 101,010 injury records from 3,982 unique underground coal mines spanning the years from 1990 to 2020 were extracted from the NIOSH database. Causal relationships were analyzed and visualized using a novel causal AI method called Grouped Greedy Equivalence Search (GGES). The impact of each variable on workday loss was assessed through intervention do-calculus adjustment (IDA) scores. Model training and validation were performed using the 10-fold cross-validation technique. Performance metrics, including adjacency precision (AP), adjacency recall (AR), arrowhead precision (AHP), and arrowhead recall (AHR), were utilized to evaluate the models. Findings revealed that after 2006, key direct causes of workday loss among mining employees included total mining experience, mean office employees, mean underground employees, county, and total mining experience (years). Total mining experience emerged as the most influential factor, whereas mean employees per mine exhibited the least influence. The analyses emphasized the significant role of total mining experience in determining workday loss. The models achieved optimal performance, with AP, AR, AHP, and AHR values measuring 0.694, 0.653, 0.386, and 0.345, respectively. This study demonstrates the feasibility of utilizing the new GGES method to clarify the causal factors behind the workday loss by analyzing employment demographics and injury records and establish their causal relationship network."}, "https://arxiv.org/abs/2402.06048": {"title": "Coherence-based Input Design for Sparse System Identification", "link": "https://arxiv.org/abs/2402.06048", "description": "The maximum absolute correlation between regressors, which is called mutual coherence, plays an essential role in sparse estimation. A regressor matrix whose columns are highly correlated may result from optimal input design, since there is no constraint on the mutual coherence, so when this regressor is used to estimate sparse parameter vectors of a system, it may yield a large estimation error. This paper aims to tackle this issue for fixed denominator models, which include Laguerre, Kautz, and generalized orthonormal basis function expansion models, for example. The paper proposes an optimal input design method where the achieved Fisher information matrix is fitted to the desired Fisher matrix, together with a coordinate transformation designed to make the regressors in the transformed coordinates have low mutual coherence. The method can be used together with any sparse estimation method and in a numerical study we show its potential for alleviating the problem of model order selection when used in conjunction with, for example, classical methods such as AIC and BIC."}, "https://arxiv.org/abs/2402.06571": {"title": "Weighted cumulative residual Entropy Generating Function and its properties", "link": "https://arxiv.org/abs/2402.06571", "description": "The study on the generating function approach to entropy become popular as it generates several well-known entropy measures discussed in the literature. In this work, we define the weighted cumulative residual entropy generating function (WCREGF) and study its properties. We then introduce the dynamic weighted cumulative residual entropy generating function (DWCREGF). It is shown that the DWCREGF determines the distribution uniquely. We study some characterization results using the relationship between the DWCREGF and the hazard rate and/or the mean residual life function. Using a characterization based on DWCREGF, we develop a new goodness fit test for Rayleigh distribution. A Monte Carlo simulation study is conducted to evaluate the proposed test. Finally, the test is illustrated using two real data sets."}, "https://arxiv.org/abs/2402.06574": {"title": "Prediction of air pollutants PM10 by ARBX(1) processes", "link": "https://arxiv.org/abs/2402.06574", "description": "This work adopts a Banach-valued time series framework for component-wise estimation and prediction, from temporal correlated functional data, in presence of exogenous variables. The strong-consistency of the proposed functional estimator and associated plug-in predictor is formulated. The simulation study undertaken illustrates their large-sample size properties. Air pollutants PM10 curve forecasting, in the Haute-Normandie region (France), is addressed by implementation of the functional time series approach presented"}, "https://arxiv.org/abs/1409.2709": {"title": "Convergence of hybrid slice sampling via spectral gap", "link": "https://arxiv.org/abs/1409.2709", "description": "It is known that the simple slice sampler has robust convergence properties, however the class of problems where it can be implemented is limited. In contrast, we consider hybrid slice samplers which are easily implementable and where another Markov chain approximately samples the uniform distribution on each slice. Under appropriate assumptions on the Markov chain on the slice we show a lower bound and an upper bound of the spectral gap of the hybrid slice sampler in terms of the spectral gap of the simple slice sampler. An immediate consequence of this is that spectral gap and geometric ergodicity of the hybrid slice sampler can be concluded from spectral gap and geometric ergodicity of its simple version which is very well understood. These results indicate that robustness properties of the simple slice sampler are inherited by (appropriately designed) easily implementable hybrid versions. We apply the developed theory and analyse a number of specific algorithms such as the stepping-out shrinkage slice sampling, hit-and-run slice sampling on a class of multivariate targets and an easily implementable combination of both procedures on multidimensional bimodal densities."}, "https://arxiv.org/abs/2101.07934": {"title": "Meta-analysis of Censored Adverse Events", "link": "https://arxiv.org/abs/2101.07934", "description": "Meta-analysis is a powerful tool for assessing drug safety by combining treatment-related toxicological findings across multiple studies, as clinical trials are typically underpowered for detecting adverse drug effects. However, incomplete reporting of adverse events (AEs) in published clinical studies is a frequent issue, especially if the observed number of AEs is below a pre-specified study-dependent threshold. Ignoring the censored AE information, often found in lower frequency, can significantly bias the estimated incidence rate of AEs. Despite its importance, this common meta-analysis problem has received little statistical or analytic attention in the literature. To address this challenge, we propose a Bayesian approach to accommodating the censored and possibly rare AEs for meta-analysis of safety data. Through simulation studies, we demonstrate that the proposed method can improves accuracy in point and interval estimation of incidence probabilities, particularly in the presence of censored data. Overall, the proposed method provides a practical solution that can facilitate better-informed decisions regarding drug safety."}, "https://arxiv.org/abs/2208.11665": {"title": "Statistical exploration of the Manifold Hypothesis", "link": "https://arxiv.org/abs/2208.11665", "description": "The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms."}, "https://arxiv.org/abs/2305.13462": {"title": "Robust heavy-tailed versions of generalized linear models with applications in actuarial science", "link": "https://arxiv.org/abs/2305.13462", "description": "Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. We propose an alternative approach which is modelling-based and thus fundamentally different. It allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The approach possesses appealing theoretical and empirical properties."}, "https://arxiv.org/abs/2306.08598": {"title": "Kernel Debiased Plug-in Estimation: Simultaneous, Automated Debiasing without Influence Functions for Many Target Parameters", "link": "https://arxiv.org/abs/2306.08598", "description": "In the problem of estimating target parameters in nonparametric models with nuisance parameters, substituting the unknown nuisances with nonparametric estimators can introduce \"plug-in bias.\" Traditional methods addressing this sub-optimal bias-variance trade-offs rely on the influence function (IF) of the target parameter. When estimating multiple target parameters, these methods require debiasing the nuisance parameter multiple times using the corresponding IFs, posing analytical and computational challenges. In this work, we leverage the targeted maximum likelihood estimation framework to propose a novel method named kernel debiased plug-in estimation (KDPE). KDPE refines an initial estimate through regularized likelihood maximization steps, employing a nonparametric model based on reproducing kernel Hilbert spaces. We show that KDPE (i) simultaneously debiases all pathwise differentiable target parameters that satisfy our regularity conditions, (ii) does not require the IF for implementation, and (iii) remains computationally tractable. We numerically illustrate the use of KDPE and validate our theoretical results."}, "https://arxiv.org/abs/2306.16593": {"title": "Autoregressive with Slack Time Series Model for Forecasting a Partially-Observed Dynamical Time Series", "link": "https://arxiv.org/abs/2306.16593", "description": "This study delves into the domain of dynamical systems, specifically the forecasting of dynamical time series defined through an evolution function. Traditional approaches in this area predict the future behavior of dynamical systems by inferring the evolution function. However, these methods may confront obstacles due to the presence of missing variables, which are usually attributed to challenges in measurement and a partial understanding of the system of interest. To overcome this obstacle, we introduce the autoregressive with slack time series (ARS) model, that simultaneously estimates the evolution function and imputes missing variables as a slack time series. Assuming time-invariance and linearity in the (underlying) entire dynamical time series, our experiments demonstrate the ARS model's capability to forecast future time series. From a theoretical perspective, we prove that a 2-dimensional time-invariant and linear system can be reconstructed by utilizing observations from a single, partially observed dimension of the system."}, "https://arxiv.org/abs/2310.11683": {"title": "Treatment bootstrapping: A new approach to quantify uncertainty of average treatment effect estimates", "link": "https://arxiv.org/abs/2310.11683", "description": "This paper proposes a new non-parametric bootstrap method to quantify the uncertainty of average treatment effect estimate for the treated from matching estimators. More specifically, it seeks to quantify the uncertainty associated with the average treatment effect estimate for the treated by bootstrapping the treatment group only and finding the counterpart control group by matching on estimated propensity score. We demonstrate the validity of this approach and compare it with existing bootstrap approaches through Monte Carlo simulation and real world data set. The results indicate that the proposed approach constructs confidence interval and standard error that have at least comparable precision and coverage rate as existing bootstrap approaches, while the variance estimates can fluctuate depending on the proportion of treatment group units in the sample data and the specific matching method used."}, "https://arxiv.org/abs/2401.15811": {"title": "Seller-Side Experiments under Interference Induced by Feedback Loops in Two-Sided Platforms", "link": "https://arxiv.org/abs/2401.15811", "description": "Two-sided platforms are central to modern commerce and content sharing and often utilize A/B testing for developing new features. While user-side experiments are common, seller-side experiments become crucial for specific interventions and metrics. This paper investigates the effects of interference caused by feedback loops on seller-side experiments in two-sided platforms, with a particular focus on the counterfactual interleaving design, proposed in \\citet{ha2020counterfactual,nandy2021b}. These feedback loops, often generated by pacing algorithms, cause outcomes from earlier sessions to influence subsequent ones. This paper contributes by creating a mathematical framework to analyze this interference, theoretically estimating its impact, and conducting empirical evaluations of the counterfactual interleaving design in real-world scenarios. Our research shows that feedback loops can result in misleading conclusions about the treatment effects."}, "https://arxiv.org/abs/2212.06693": {"title": "Transfer Learning with Large-Scale Quantile Regression", "link": "https://arxiv.org/abs/2212.06693", "description": "Quantile regression is increasingly encountered in modern big data applications due to its robustness and flexibility. We consider the scenario of learning the conditional quantiles of a specific target population when the available data may go beyond the target and be supplemented from other sources that possibly share similarities with the target. A crucial question is how to properly distinguish and utilize useful information from other sources to improve the quantile estimation and inference at the target. We develop transfer learning methods for high-dimensional quantile regression by detecting informative sources whose models are similar to the target and utilizing them to improve the target model. We show that under reasonable conditions, the detection of the informative sources based on sample splitting is consistent. Compared to the naive estimator with only the target data, the transfer learning estimator achieves a much lower error rate as a function of the sample sizes, the signal-to-noise ratios, and the similarity measures among the target and the source models. Extensive simulation studies demonstrate the superiority of our proposed approach. We apply our methods to tackle the problem of detecting hard-landing risk for flight safety and show the benefits and insights gained from transfer learning of three different types of airplanes: Boeing 737, Airbus A320, and Airbus A380."}, "https://arxiv.org/abs/2304.09872": {"title": "Depth Functions for Partial Orders with a Descriptive Analysis of Machine Learning Algorithms", "link": "https://arxiv.org/abs/2304.09872", "description": "We propose a framework for descriptively analyzing sets of partial orders based on the concept of depth functions. Despite intensive studies of depth functions in linear and metric spaces, there is very little discussion on depth functions for non-standard data types such as partial orders. We introduce an adaptation of the well-known simplicial depth to the set of all partial orders, the union-free generic (ufg) depth. Moreover, we utilize our ufg depth for a comparison of machine learning algorithms based on multidimensional performance measures. Concretely, we analyze the distribution of different classifier performances over a sample of standard benchmark data sets. Our results promisingly demonstrate that our approach differs substantially from existing benchmarking approaches and, therefore, adds a new perspective to the vivid debate on the comparison of classifiers."}, "https://arxiv.org/abs/2305.15984": {"title": "Dynamic Inter-treatment Information Sharing for Individualized Treatment Effects Estimation", "link": "https://arxiv.org/abs/2305.15984", "description": "Estimation of individualized treatment effects (ITE) from observational studies is a fundamental problem in causal inference and holds significant importance across domains, including healthcare. However, limited observational datasets pose challenges in reliable ITE estimation as data have to be split among treatment groups to train an ITE learner. While information sharing among treatment groups can partially alleviate the problem, there is currently no general framework for end-to-end information sharing in ITE estimation. To tackle this problem, we propose a deep learning framework based on `\\textit{soft weight sharing}' to train ITE learners, enabling \\textit{dynamic end-to-end} information sharing among treatment groups. The proposed framework complements existing ITE learners, and introduces a new class of ITE learners, referred to as \\textit{HyperITE}. We extend state-of-the-art ITE learners with \\textit{HyperITE} versions and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves ITE estimation error, with increasing effectiveness for smaller datasets."}, "https://arxiv.org/abs/2402.06915": {"title": "Detection and inference of changes in high-dimensional linear regression with non-sparse structures", "link": "https://arxiv.org/abs/2402.06915", "description": "For the data segmentation problem in high-dimensional linear regression settings, a commonly made assumption is that the regression parameters are segment-wise sparse, which enables many existing methods to estimate the parameters locally via $\\ell_1$-regularised maximum likelihood-type estimation and contrast them for change point detection. Contrary to the common belief, we show that the sparsity of neither regression parameters nor their differences, a.k.a.\\ differential parameters, is necessary for achieving the consistency in multiple change point detection. In fact, both statistically and computationally, better efficiency is attained by a simple strategy that scans for large discrepancies in local covariance between the regressors and the response. We go a step further and propose a suite of tools for directly inferring about the differential parameters post-segmentation, which are applicable even when the regression parameters themselves are non-sparse. Theoretical investigations are conducted under general conditions permitting non-Gaussianity, temporal dependence and ultra-high dimensionality. Numerical experiments demonstrate the competitiveness of the proposed methodologies."}, "https://arxiv.org/abs/2402.07022": {"title": "A product-limit estimator of the conditional survival function when cure status is partially known", "link": "https://arxiv.org/abs/2402.07022", "description": "We introduce a nonparametric estimator of the conditional survival function in the mixture cure model for right censored data when cure status is partially known. The estimator is developed for the setting of a single continuous covariate but it can be extended to multiple covariates. It extends the estimator of Beran (1981), which ignores cure status information. We obtain an almost sure representation, from which the strong consistency and asymptotic normality of the estimator are derived. Asymptotic expressions of the bias and variance demonstrate a reduction in the variance with respect to Beran's estimator. A simulation study shows that, if the bandwidth parameter is suitably chosen, our estimator performs better than others for an ample range of covariate values. A bootstrap bandwidth selector is proposed. Finally, the proposed estimator is applied to a real dataset studying survival of sarcoma patients."}, "https://arxiv.org/abs/2402.07048": {"title": "Logistic-beta processes for modeling dependent random probabilities with beta marginals", "link": "https://arxiv.org/abs/2402.07048", "description": "The beta distribution serves as a canonical tool for modeling probabilities and is extensively used in statistics and machine learning, especially in the field of Bayesian nonparametrics. Despite its widespread use, there is limited work on flexible and computationally convenient stochastic process extensions for modeling dependent random probabilities. We propose a novel stochastic process called the logistic-beta process, whose logistic transformation yields a stochastic process with common beta marginals. Similar to the Gaussian process, the logistic-beta process can model dependence on both discrete and continuous domains, such as space or time, and has a highly flexible dependence structure through correlation kernels. Moreover, its normal variance-mean mixture representation leads to highly effective posterior inference algorithms. The flexibility and computational benefits of logistic-beta processes are demonstrated through nonparametric binary regression simulation studies. Furthermore, we apply the logistic-beta process in modeling dependent Dirichlet processes, and illustrate its application and benefits through Bayesian density regression problems in a toxicology study."}, "https://arxiv.org/abs/2402.07247": {"title": "The Pairwise Matching Design is Optimal under Extreme Noise and Assignments", "link": "https://arxiv.org/abs/2402.07247", "description": "We consider the general performance of the difference-in-means estimator in an equally-allocated two-arm randomized experiment under common experimental endpoints such as continuous (regression), incidence, proportion, count and uncensored survival. We consider two sources of randomness: the subject-specific assignments and the contribution of unobserved subject-specific measurements. We then examine mean squared error (MSE) performance under a new, more realistic \"simultaneous tail criterion\". We prove that the pairwise matching design of Greevy et al. (2004) performs best asymptotically under this criterion when compared to other blocking designs. We also prove that the optimal design must be less random than complete randomization and more random than any deterministic, optimized allocation. Theoretical results are supported by simulations in all five response types."}, "https://arxiv.org/abs/2402.07297": {"title": "On Robust Measures of Spatial Correlation", "link": "https://arxiv.org/abs/2402.07297", "description": "As a rule statistical measures are often vulnerable to the presence of outliers and spatial correlation coefficients, critical in the assessment of spatial data, remain susceptible to this inherent flaw. In contexts where data originates from a variety of domains (such as, e. g., socio-economic, environmental or epidemiological disciplines) it is quite common to encounter not just anomalous data points, but also non-normal distributions. These irregularities can significantly distort the broader analytical landscape often masking significant spatial attributes. This paper embarks on a mission to enhance the resilience of traditional spatial correlation metrics, specifically the Moran Coefficient (MC), Geary's Contiguity ratio (GC), and the Approximate Profile Likelihood Estimator (APLE) and to propose a series of alternative measures. Drawing inspiration from established analytical paradigms, our research harnesses the power of influence function studies to examine the robustness of the proposed novel measures in the presence of different outlier scenarios."}, "https://arxiv.org/abs/2402.07349": {"title": "Control Variates for MCMC", "link": "https://arxiv.org/abs/2402.07349", "description": "This chapter describes several control variate methods for improving estimates of expectations from MCMC."}, "https://arxiv.org/abs/2402.07373": {"title": "A Functional Coefficients Network Autoregressive Model", "link": "https://arxiv.org/abs/2402.07373", "description": "The paper introduces a flexible model for the analysis of multivariate nonlinear time series data. The proposed Functional Coefficients Network Autoregressive (FCNAR) model considers the response of each node in the network to depend in a nonlinear fashion to each own past values (autoregressive component), as well as past values of each neighbor (network component). Key issues of model stability/stationarity, together with model parameter identifiability, estimation and inference are addressed for error processes that can be heavier than Gaussian for both fixed and growing number of network nodes. The performance of the estimators for the FCNAR model is assessed on synthetic data and the applicability of the model is illustrated on multiple indicators of air pollution data."}, "https://arxiv.org/abs/2402.07439": {"title": "Joint estimation of the predictive ability of experts using a multi-output Gaussian process", "link": "https://arxiv.org/abs/2402.07439", "description": "A multi-output Gaussian process (GP) is introduced as a model for the joint posterior distribution of the local predictive ability of set of models and/or experts, conditional on a vector of covariates, from historical predictions in the form of log predictive scores. Following a power transformation of the log scores, a GP with Gaussian noise can be used, which allows faster computation by first using Hamiltonian Monte Carlo to sample the hyper-parameters of the GP from a model where the latent GP surface has been marginalized out, and then using these draws to generate draws of joint predictive ability conditional on a new vector of covariates. Linear pools based on learned joint local predictive ability are applied to predict daily bike usage in Washington DC."}, "https://arxiv.org/abs/2402.07521": {"title": "A step towards the integration of machine learning and small area estimation", "link": "https://arxiv.org/abs/2402.07521", "description": "The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with \"borrowing strength\" from other subpopulations and time periods, is considered."}, "https://arxiv.org/abs/2402.07569": {"title": "EM Estimation of the B-spline Copula with Penalized Log-Likelihood Function", "link": "https://arxiv.org/abs/2402.07569", "description": "The B-spline copula function is defined by a linear combination of elements of the normalized B-spline basis. We develop a modified EM algorithm, to maximize the penalized log-likelihood function, wherein we use the smoothly clipped absolute deviation (SCAD) penalty function for the penalization term. We conduct simulation studies to demonstrate the stability of the proposed numerical procedure, show that penalization yields estimates with smaller mean-square errors when the true parameter matrix is sparse, and provide methods for determining tuning parameters and for model selection. We analyze as an example a data set consisting of birth and death rates from 237 countries, available at the website, ''Our World in Data,'' and we estimate the marginal density and distribution functions of those rates together with all parameters of our B-spline copula model."}, "https://arxiv.org/abs/2402.07629": {"title": "Logistic Multidimensional Data Analysis for Ordinal Response Variables using a Cumulative Link function", "link": "https://arxiv.org/abs/2402.07629", "description": "We present a multidimensional data analysis framework for the analysis of ordinal response variables. Underlying the ordinal variables, we assume a continuous latent variable, leading to cumulative logit models. The framework includes unsupervised methods, when no predictor variables are available, and supervised methods, when predictor variables are available. We distinguish between dominance variables and proximity variables, where dominance variables are analyzed using inner product models, whereas the proximity variables are analyzed using distance models. An expectation-majorization-minimization algorithm is derived for estimation of the parameters of the models. We illustrate our methodology with data from the International Social Survey Programme."}, "https://arxiv.org/abs/2402.07634": {"title": "Loglinear modeling with mixed numerical and categorical predictor variables through an Extended Stereotype Model", "link": "https://arxiv.org/abs/2402.07634", "description": "Loglinear analysis is most useful when we have two or more categorical response variables. Loglinear analysis, however, requires categorical predictor variables, such that the data can be represented in a contingency table. Researchers often have a mix of categorical and numerical predictors. We present a new statistical methodology for the analysis of multiple categorical response variables with a mix of numeric and categorical predictor variables. Therefore, the stereotype model, a reduced rank regression model for multinomial outcome variables, is extended with a design matrix for the profile scores and one for the dependencies among the responses. An MM algorithm is presented for estimation of the model parameters. Three examples are presented. The first shows that our method is equivalent to loglinear analysis when we only have categorical variables. With the second example, we show the differences between marginal logit models and our extended stereotype model, which is a conditional model. The third example is more extensive, and shows how to analyze a data set, how to select a model, and how to interpret the final model."}, "https://arxiv.org/abs/2402.07743": {"title": "Beyond Sparsity: Local Projections Inference with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2402.07743", "description": "Impulse response analysis studies how the economy responds to shocks, such as changes in interest rates, and helps policymakers manage these effects. While Vector Autoregression Models (VARs) with structural assumptions have traditionally dominated the estimation of impulse responses, local projections, the projection of future responses on current shock, have recently gained attention for their robustness and interpretability. Including many lags as controls is proposed as a means of robustness, and including a richer set of controls helps in its interpretation as a causal parameter. In both cases, an extensive number of controls leads to the consideration of high-dimensional techniques. While methods like LASSO exist, they mostly rely on sparsity assumptions - most of the parameters are exactly zero, which has limitations in dense data generation processes. This paper proposes a novel approach that incorporates high-dimensional covariates in local projections without relying on sparsity constraints. Adopting the Orthogonal Greedy Algorithm with a high-dimensional AIC (OGA+HDAIC) model selection method, this approach offers advantages including robustness in both sparse and dense scenarios, improved interpretability by prioritizing cross-sectional explanatory power, and more reliable causal inference in local projections."}, "https://arxiv.org/abs/2402.07811": {"title": "PageRank and the Bradley-Terry model", "link": "https://arxiv.org/abs/2402.07811", "description": "PageRank and the Bradley-Terry model are competing approaches to ranking entities such as teams in sports tournaments or journals in citation networks. The Bradley-Terry model is a classical statistical method for ranking based on paired comparisons. The PageRank algorithm ranks nodes according to their importance in a network. Whereas Bradley-Terry scores are computed via maximum likelihood estimation, PageRanks are derived from the stationary distribution of a Markov chain. More recent work has shown maximum likelihood estimates for the Bradley-Terry model may be approximated from such a limiting distribution, an interesting connection that has been discovered and rediscovered over the decades. Here we show - through relatively simple mathematics - a connection between paired comparisons and PageRank that exploits the quasi-symmetry property of the Bradley-Terry model. This motivates a novel interpretation of Bradley-Terry scores as 'scaled' PageRanks, and vice versa, with direct implications for citation-based journal ranking metrics."}, "https://arxiv.org/abs/2402.07837": {"title": "Quantile Least Squares: A Flexible Approach for Robust Estimation and Validation of Location-Scale Families", "link": "https://arxiv.org/abs/2402.07837", "description": "In this paper, the problem of robust estimation and validation of location-scale families is revisited. The proposed methods exploit the joint asymptotic normality of sample quantiles (of i.i.d random variables) to construct the ordinary and generalized least squares estimators of location and scale parameters. These quantile least squares (QLS) estimators are easy to compute because they have explicit expressions, their robustness is achieved by excluding extreme quantiles from the least-squares estimation, and efficiency is boosted by using as many non-extreme quantiles as practically relevant. The influence functions of the QLS estimators are specified and plotted for several location-scale families. They closely resemble the shapes of some well-known influence functions yet those shapes emerge automatically (i.e., do not need to be specified). The joint asymptotic normality of the proposed estimators is established, and their finite-sample properties are explored using simulations. Also, computational costs of these estimators, as well as those of MLE, are evaluated for sample sizes n = 10^6, 10^7, 10^8, 10^9. For model validation, two goodness-of-fit tests are constructed and their performance is studied using simulations and real data. In particular, for the daily stock returns of Google over the last four years, both tests strongly support the logistic distribution assumption and reject other bell-shaped competitors."}, "https://arxiv.org/abs/2402.06920": {"title": "The power of forgetting in statistical hypothesis testing", "link": "https://arxiv.org/abs/2402.06920", "description": "This paper places conformal testing in a general framework of statistical hypothesis testing. A standard approach to testing a composite null hypothesis $H$ is to test each of its elements and to reject $H$ when each of its elements is rejected. It turns out that we can fully cover conformal testing using this approach only if we allow forgetting some of the data. However, we will see that the standard approach covers conformal testing in a weak asymptotic sense and under restrictive assumptions. I will also list several possible directions of further research, including developing a general scheme of online testing."}, "https://arxiv.org/abs/2402.07066": {"title": "Differentially Private Range Queries with Correlated Input Perturbation", "link": "https://arxiv.org/abs/2402.07066", "description": "This work proposes a class of locally differentially private mechanisms for linear queries, in particular range queries, that leverages correlated input perturbation to simultaneously achieve unbiasedness, consistency, statistical transparency, and control over utility requirements in terms of accuracy targets expressed either in certain query margins or as implied by the hierarchical database structure. The proposed Cascade Sampling algorithm instantiates the mechanism exactly and efficiently. Our bounds show that we obtain near-optimal utility while being empirically competitive against output perturbation methods."}, "https://arxiv.org/abs/2402.07131": {"title": "Resampling methods for Private Statistical Inference", "link": "https://arxiv.org/abs/2402.07131", "description": "We consider the task of constructing confidence intervals with differential privacy. We propose two private variants of the non-parametric bootstrap, which privately compute the median of the results of multiple ``little'' bootstraps run on partitions of the data and give asymptotic bounds on the coverage error of the resulting confidence intervals. For a fixed differential privacy parameter $\\epsilon$, our methods enjoy the same error rates as that of the non-private bootstrap to within logarithmic factors in the sample size $n$. We empirically validate the performance of our methods for mean estimation, median estimation, and logistic regression with both real and synthetic data. Our methods achieve similar coverage accuracy to existing methods (and non-private baselines) while providing notably shorter ($\\gtrsim 10$ times) confidence intervals than previous approaches."}, "https://arxiv.org/abs/2402.07160": {"title": "PASOA- PArticle baSed Bayesian Optimal Adaptive design", "link": "https://arxiv.org/abs/2402.07160", "description": "We propose a new procedure named PASOA, for Bayesian experimental design, that performs sequential design optimization by simultaneously providing accurate estimates of successive posterior distributions for parameter inference. The sequential design process is carried out via a contrastive estimation principle, using stochastic optimization and Sequential Monte Carlo (SMC) samplers to maximise the Expected Information Gain (EIG). As larger information gains are obtained for larger distances between successive posterior distributions, this EIG objective may worsen classical SMC performance. To handle this issue, tempering is proposed to have both a large information gain and an accurate SMC sampling, that we show is crucial for performance. This novel combination of stochastic optimization and tempered SMC allows to jointly handle design optimization and parameter inference. We provide a proof that the obtained optimal design estimators benefit from some consistency property. Numerical experiments confirm the potential of the approach, which outperforms other recent existing procedures."}, "https://arxiv.org/abs/2402.07170": {"title": "Research on the multi-stage impact of digital economy on rural revitalization in Hainan Province based on GPM model", "link": "https://arxiv.org/abs/2402.07170", "description": "The rapid development of the digital economy has had a profound impact on the implementation of the rural revitalization strategy. Based on this, this study takes Hainan Province as the research object to deeply explore the impact of digital economic development on rural revitalization. The study collected panel data from 2003 to 2022 to construct an evaluation index system for the digital economy and rural revitalization and used panel regression analysis and other methods to explore the promotion effect of the digital economy on rural revitalization. Research results show that the digital economy has a significant positive impact on rural revitalization, and this impact increases as the level of fiscal expenditure increases. The issuance of digital RMB has further exerted a regulatory effect and promoted the development of the digital economy and the process of rural revitalization. At the same time, the establishment of the Hainan Free Trade Port has also played a positive role in promoting the development of the digital economy and rural revitalization. In the prediction of the optimal strategy for rural revitalization based on the development levels of the primary, secondary, and tertiary industries (Rate1, Rate2, and Rate3), it was found that rate1 can encourage Hainan Province to implement digital economic innovation, encourage rate3 to implement promotion behaviors, and increase rate2 can At the level of sustainable development when rate3 promotes rate2's digital economic innovation behavior, it can standardize rate2's production behavior to the greatest extent, accelerate the faster application of the digital economy to the rural revitalization industry, and promote the technological advancement of enterprises."}, "https://arxiv.org/abs/2402.07307": {"title": "Self-Consistent Conformal Prediction", "link": "https://arxiv.org/abs/2402.07307", "description": "In decision-making guided by machine learning, decision-makers often take identical actions in contexts with identical predicted outcomes. Conformal prediction helps decision-makers quantify outcome uncertainty for actions, allowing for better risk management. Inspired by this perspective, we introduce self-consistent conformal prediction, which yields both Venn-Abers calibrated predictions and conformal prediction intervals that are valid conditional on actions prompted by model predictions. Our procedure can be applied post-hoc to any black-box predictor to provide rigorous, action-specific decision-making guarantees. Numerical experiments show our approach strikes a balance between interval efficiency and conditional validity."}, "https://arxiv.org/abs/2402.07322": {"title": "Interference Among First-Price Pacing Equilibria: A Bias and Variance Analysis", "link": "https://arxiv.org/abs/2402.07322", "description": "Online A/B testing is widely used in the internet industry to inform decisions on new feature roll-outs. For online marketplaces (such as advertising markets), standard approaches to A/B testing may lead to biased results when buyers operate under a budget constraint, as budget consumption in one arm of the experiment impacts performance of the other arm. To counteract this interference, one can use a budget-split design where the budget constraint operates on a per-arm basis and each arm receives an equal fraction of the budget, leading to ``budget-controlled A/B testing.'' Despite clear advantages of budget-controlled A/B testing, performance degrades when budget are split too small, limiting the overall throughput of such systems. In this paper, we propose a parallel budget-controlled A/B testing design where we use market segmentation to identify submarkets in the larger market, and we run parallel experiments on each submarket.\n Our contributions are as follows: First, we introduce and demonstrate the effectiveness of the parallel budget-controlled A/B test design with submarkets in a large online marketplace environment. Second, we formally define market interference in first-price auction markets using the first price pacing equilibrium (FPPE) framework. Third, we propose a debiased surrogate that eliminates the first-order bias of FPPE, drawing upon the principles of sensitivity analysis in mathematical programs. Fourth, we derive a plug-in estimator for the surrogate and establish its asymptotic normality. Fifth, we provide an estimation procedure for submarket parallel budget-controlled A/B tests. Finally, we present numerical examples on semi-synthetic data, confirming that the debiasing technique achieves the desired coverage properties."}, "https://arxiv.org/abs/2402.07406": {"title": "Asymptotic Equivalency of Two Different Approaches of L-statistics", "link": "https://arxiv.org/abs/2402.07406", "description": "There are several ways to establish the asymptotic normality of $L$-statistics, depending upon the selection of the weights generating function and the cumulative distribution function of the underlying model. Here, in this paper it is shown that the two of the asymptotic approaches are equivalent/equal for a particular choice of the weights generating function."}, "https://arxiv.org/abs/2402.07419": {"title": "Conditional Generative Models are Sufficient to Sample from Any Causal Effect Estimand", "link": "https://arxiv.org/abs/2402.07419", "description": "Causal inference from observational data has recently found many applications in machine learning. While sound and complete algorithms exist to compute causal effects, many of these algorithms require explicit access to conditional likelihoods over the observational distribution, which is difficult to estimate in the high-dimensional regime, such as with images. To alleviate this issue, researchers have approached the problem by simulating causal relations with neural models and obtained impressive results. However, none of these existing approaches can be applied to generic scenarios such as causal graphs on image data with latent confounders, or obtain conditional interventional samples. In this paper, we show that any identifiable causal effect given an arbitrary causal graph can be computed through push-forward computations of conditional generative models. Based on this result, we devise a diffusion-based approach to sample from any (conditional) interventional distribution on image data. To showcase our algorithm's performance, we conduct experiments on a Colored MNIST dataset having both the treatment ($X$) and the target variables ($Y$) as images and obtain interventional samples from $P(y|do(x))$. As an application of our algorithm, we evaluate two large conditional generative models that are pre-trained on the CelebA dataset by analyzing the strength of spurious correlations and the level of disentanglement they achieve."}, "https://arxiv.org/abs/2402.07717": {"title": "Efficient reductions between some statistical models", "link": "https://arxiv.org/abs/2402.07717", "description": "We study the problem of approximately transforming a sample from a source statistical model to a sample from a target statistical model without knowing the parameters of the source model, and construct several computationally efficient such reductions between statistical experiments. In particular, we provide computationally efficient procedures that approximately reduce uniform, Erlang, and Laplace location models to general target families. We illustrate our methodology by establishing nonasymptotic reductions between some canonical high-dimensional problems, spanning mixtures of experts, phase retrieval, and signal denoising. Notably, the reductions are structure preserving and can accommodate missing data. We also point to a possible application in transforming one differentially private mechanism to another."}, "https://arxiv.org/abs/2402.07806": {"title": "A comparison of six linear and non-linear mixed models for longitudinal data: application to late-life cognitive trajectories", "link": "https://arxiv.org/abs/2402.07806", "description": "Longitudinal characterization of cognitive change in late-life has received increasing attention to better understand age-related cognitive aging and cognitive changes reflecting pathology-related and mortality-related processes. Several mixed-effects models have been proposed to accommodate the non-linearity of cognitive decline and assess the putative influence of covariates on it. In this work, we examine the standard linear mixed model (LMM) with a linear function of time and five alternative models capturing non-linearity of change over time, including the LMM with a quadratic term, LMM with splines, the functional mixed model, the piecewise linear mixed model and the sigmoidal mixed model. We first theoretically describe the models. Next, using data from deceased participants from two prospective cohorts with annual cognitive testing, we compared the interpretation of the models by investigating the association of education on cognitive change before death. Finally, we performed a simulation study to empirically evaluate the models and provide practical recommendations. In particular, models were challenged by increasing follow-up spacing, increasing missing data, and decreasing sample size. With the exception of the LMM with a quadratic term, the fit of all models was generally adequate to capture non-linearity of cognitive change and models were relatively robust. Although spline-based models do not have interpretable nonlinearity parameters, their convergence was easier to achieve and they allow for graphical interpretation. In contrast the piecewise and the sigmoidal models, with interpretable non-linear parameters may require more data to achieve convergence."}, "https://arxiv.org/abs/2008.04522": {"title": "Bayesian Analysis on Limiting the Student-$t$ Linear Regression Model", "link": "https://arxiv.org/abs/2008.04522", "description": "For the outlier problem in linear regression models, the Student-$t$ linear regression model is one of the common methods for robust modeling and is widely adopted in the literature. However, most of them applies it without careful theoretical consideration. This study provides the practically useful and quite simple conditions to ensure that the Student-$t$ linear regression model is robust against an outlier in the $y$-direction using regular variation theory."}, "https://arxiv.org/abs/2103.01412": {"title": "Some Finite Sample Properties of the Sign Test", "link": "https://arxiv.org/abs/2103.01412", "description": "This paper contains two finite-sample results concerning the sign test. First, we show that the sign-test is unbiased with independent, non-identically distributed data for both one-sided and two-sided hypotheses. The proof for the two-sided case is based on a novel argument that relates the derivatives of the power function to a regular bipartite graph. Unbiasedness then follows from the existence of perfect matchings on such graphs. Second, we provide a simple theoretical counterexample to show that the sign test over-rejects when the data exhibits correlation. Our results can be useful for understanding the properties of approximate randomization tests in settings with few clusters."}, "https://arxiv.org/abs/2105.12852": {"title": "Identifying Brexit voting patterns in the British House of Commons: an analysis based on Bayesian mixture models with flexible concomitant covariate effects", "link": "https://arxiv.org/abs/2105.12852", "description": "Brexit and its implications are an ongoing topic of interest since the Brexit referendum in 2016. In 2019 the House of commons held a number of \"indicative\" and \"meaningful\" votes as part of the Brexit approval process. The voting behaviour of members of the parliament in these votes is investigated to gain insight into the Brexit approval process. In particular, a mixture model with concomitant covariates is developed to identify groups of members of parliament who share similar voting behaviour while also considering characteristics of the members of parliament. The novelty of the method lies in the flexible structure used to model the effect of concomitant covariates on the component weights of the mixture, with the (potentially nonlinear) terms represented as a smooth function of the covariates. Results show this approach allows to quantify the effect of the age of members of parliament, as well as preferences and competitiveness in the constituencies they represent, on their position towards Brexit. This helps grouping the aforementioned politicians into homogeous clusters, whose composition departs sensibly from that of the parties."}, "https://arxiv.org/abs/2112.03220": {"title": "Cross-validation for change-point regression: pitfalls and solutions", "link": "https://arxiv.org/abs/2112.03220", "description": "Cross-validation is the standard approach for tuning parameter selection in many non-parametric regression problems. However its use is less common in change-point regression, perhaps as its prediction error-based criterion may appear to permit small spurious changes and hence be less well-suited to estimation of the number and location of change-points. We show that in fact the problems of cross-validation with squared error loss are more severe and can lead to systematic under- or over-estimation of the number of change-points, and highly suboptimal estimation of the mean function in simple settings where changes are easily detectable. We propose two simple approaches to remedy these issues, the first involving the use of absolute error rather than squared error loss, and the second involving modifying the holdout sets used. For the latter, we provide conditions that permit consistent estimation of the number of change-points for a general change-point estimation procedure. We show these conditions are satisfied for least squares estimation using new results on its performance when supplied with the incorrect number of change-points. Numerical experiments show that our new approaches are competitive with common change-point methods using classical tuning parameter choices when error distributions are well-specified, but can substantially outperform these in misspecified models. An implementation of our methodology is available in the R package crossvalidationCP on CRAN."}, "https://arxiv.org/abs/2203.07342": {"title": "A Bayesian Nonparametric Approach to Species Sampling Problems with Ordering", "link": "https://arxiv.org/abs/2203.07342", "description": "Species-sampling problems (SSPs) refer to a vast class of statistical problems calling for the estimation of (discrete) functionals of the unknown species composition of an unobservable population. A common feature of SSPs is their invariance with respect to species labeling, which is at the core of the Bayesian nonparametric (BNP) approach to SSPs under the popular Pitman-Yor process (PYP) prior. In this paper, we develop a BNP approach to SSPs that are not \"invariant\" to species labeling, in the sense that an ordering or ranking is assigned to species' labels. Inspired by the population genetics literature on age-ordered alleles' compositions, we study the following SSP with ordering: given an observable sample from an unknown population of individuals belonging to species (alleles), with species' labels being ordered according to weights (ages), estimate the frequencies of the first $r$ order species' labels in an enlarged sample obtained by including additional unobservable samples. By relying on an ordered PYP prior, we obtain an explicit posterior distribution of the first $r$ order frequencies, with estimates being of easy implementation and computationally efficient. We apply our approach to the analysis of genetic variation, showing its effectiveness in estimating the frequency of the oldest allele, and then we discuss other potential applications."}, "https://arxiv.org/abs/2210.04100": {"title": "Doubly robust estimation and sensitivity analysis for marginal structural quantile models", "link": "https://arxiv.org/abs/2210.04100", "description": "The marginal structure quantile model (MSQM) provides a unique lens to understand the causal effect of a time-varying treatment on the full distribution of potential outcomes. Under the semiparametric framework, we derive the efficiency influence function for the MSQM, from which a new doubly robust estimator is proposed for point estimation and inference. We show that the doubly robust estimator is consistent if either of the models associated with treatment assignment or the potential outcome distributions is correctly specified, and is semiparametric efficient if both models are correct. To implement the doubly robust MSQM estimator, we propose to solve a smoothed estimating equation to facilitate efficient computation of the point and variance estimates. In addition, we develop a confounding function approach to investigate the sensitivity of several MSQM estimators when the sequential ignorability assumption is violated. Extensive simulations are conducted to examine the finite-sample performance characteristics of the proposed methods. We apply the proposed methods to the Yale New Haven Health System Electronic Health Record data to study the effect of antihypertensive medications to patients with severe hypertension and assess the robustness of findings to unmeasured baseline and time-varying confounding."}, "https://arxiv.org/abs/2304.00231": {"title": "Using Overlap Weights to Address Extreme Propensity Scores in Estimating Restricted Mean Counterfactual Survival Times", "link": "https://arxiv.org/abs/2304.00231", "description": "While the inverse probability of treatment weighting (IPTW) is a commonly used approach for treatment comparisons in observational data, the resulting estimates may be subject to bias and excessively large variance when there is lack of overlap in the propensity score distributions. By smoothly down-weighting the units with extreme propensity scores, overlap weighting (OW) can help mitigate the bias and variance issues associated with IPTW. Although theoretical and simulation results have supported the use of OW with continuous and binary outcomes, its performance with right-censored survival outcomes remains to be further investigated, especially when the target estimand is defined based on the restricted mean survival time (RMST)-a clinically meaningful summary measure free of the proportional hazards assumption. In this article, we combine propensity score weighting and inverse probability of censoring weighting to estimate the restricted mean counterfactual survival times, and propose computationally-efficient variance estimators. We conduct simulations to compare the performance of IPTW, trimming, and OW in terms of bias, variance, and 95% confidence interval coverage, under various degrees of covariate overlap. Regardless of overlap, we demonstrate the advantage of OW over IPTW and trimming methods in bias, variance, and coverage when the estimand is defined based on RMST."}, "https://arxiv.org/abs/2304.04221": {"title": "Maximum Agreement Linear Prediction via the Concordance Correlation Coefficient", "link": "https://arxiv.org/abs/2304.04221", "description": "This paper examines distributional properties and predictive performance of the estimated maximum agreement linear predictor (MALP) introduced in Bottai, Kim, Lieberman, Luta, and Pena (2022) paper in The American Statistician, which is the linear predictor maximizing Lin's concordance correlation coefficient (CCC) between the predictor and the predictand. It is compared and contrasted, theoretically and through computer experiments, with the estimated least-squares linear predictor (LSLP). Finite-sample and asymptotic properties are obtained, and confidence intervals are also presented. The predictors are illustrated using two real data sets: an eye data set and a bodyfat data set. The results indicate that the estimated MALP is a viable alternative to the estimated LSLP if one desires a predictor whose predicted values possess higher agreement with the predictand values, as measured by the CCC."}, "https://arxiv.org/abs/2305.16464": {"title": "Flexible Variable Selection for Clustering and Classification", "link": "https://arxiv.org/abs/2305.16464", "description": "The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets"}, "https://arxiv.org/abs/2306.15616": {"title": "Network-Adjusted Covariates for Community Detection", "link": "https://arxiv.org/abs/2306.15616", "description": "Community detection is a crucial task in network analysis that can be significantly improved by incorporating subject-level information, i.e. covariates. However, current methods often struggle with selecting tuning parameters and analyzing low-degree nodes. In this paper, we introduce a novel method that addresses these challenges by constructing network-adjusted covariates, which leverage the network connections and covariates with a unique weight to each node based on the node's degree. Spectral clustering on network-adjusted covariates yields an exact recovery of community labels under certain conditions, which is tuning-free and computationally efficient. We present novel theoretical results about the strong consistency of our method under degree-corrected stochastic blockmodels with covariates, even in the presence of mis-specification and sparse communities with bounded degrees. Additionally, we establish a general lower bound for the community detection problem when both network and covariates are present, and it shows our method is optimal up to a constant factor. Our method outperforms existing approaches in simulations and a LastFM app user network, and provides interpretable community structures in a statistics publication citation network where $30\\%$ of nodes are isolated."}, "https://arxiv.org/abs/2308.01062": {"title": "A new multivariate and non-parametric association measure based on paired orthants", "link": "https://arxiv.org/abs/2308.01062", "description": "Multivariate correlation analysis plays a key role in various fields such as statistics and big data analytics. In this paper, it is presented a new non-parametric association measure between more than two variables based on the concept of paired orthants. In order to evaluate the proposed methodology, different N-tuple sets (from two to six variables) have been evaluated. The presented rank correlation analysis not only evaluates the inter-relatedness of multiple variables, but also determine the specific tendency of these variables."}, "https://arxiv.org/abs/2310.12711": {"title": "Modelling multivariate extremes through angular-radial decomposition of the density function", "link": "https://arxiv.org/abs/2310.12711", "description": "We present a new framework for modelling multivariate extremes, based on an angular-radial representation of the probability density function. Under this representation, the problem of modelling multivariate extremes is transformed to that of modelling an angular density and the tail of the radial variable, conditional on angle. Motivated by univariate theory, we assume that the tail of the conditional radial distribution converges to a generalised Pareto (GP) distribution. To simplify inference, we also assume that the angular density is continuous and finite and the GP parameter functions are continuous with angle. We refer to the resulting model as the semi-parametric angular-radial (SPAR) model for multivariate extremes. We consider the effect of the choice of polar coordinate system and introduce generalised concepts of angular-radial coordinate systems and generalised scalar angles in two dimensions. We show that under certain conditions, the choice of polar coordinate system does not affect the validity of the SPAR assumptions. However, some choices of coordinate system lead to simpler representations. In contrast, we show that the choice of margin does affect whether the model assumptions are satisfied. In particular, the use of Laplace margins results in a form of the density function for which the SPAR assumptions are satisfied for many common families of copula, with various dependence classes. We show that the SPAR model provides a more versatile framework for characterising multivariate extremes than provided by existing approaches, and that several commonly-used approaches are special cases of the SPAR model."}, "https://arxiv.org/abs/2310.17334": {"title": "Bayesian Optimization for Personalized Dose-Finding Trials with Combination Therapies", "link": "https://arxiv.org/abs/2310.17334", "description": "Identification of optimal dose combinations in early phase dose-finding trials is challenging, due to the trade-off between precisely estimating the many parameters required to flexibly model the possibly non-monotonic dose-response surface, and the small sample sizes in early phase trials. This difficulty is even more pertinent in the context of personalized dose-finding, where patient characteristics are used to identify tailored optimal dose combinations. To overcome these challenges, we propose the use of Bayesian optimization for finding optimal dose combinations in standard (\"one size fits all\") and personalized multi-agent dose-finding trials. Bayesian optimization is a method for estimating the global optima of expensive-to-evaluate objective functions. The objective function is approximated by a surrogate model, commonly a Gaussian process, paired with a sequential design strategy to select the next point via an acquisition function. This work is motivated by an industry-sponsored problem, where focus is on optimizing a dual-agent therapy in a setting featuring minimal toxicity. To compare the performance of the standard and personalized methods under this setting, simulation studies are performed for a variety of scenarios. Our study concludes that taking a personalized approach is highly beneficial in the presence of heterogeneity."}, "https://arxiv.org/abs/2401.09379": {"title": "Merging uncertainty sets via majority vote", "link": "https://arxiv.org/abs/2401.09379", "description": "Given $K$ uncertainty sets that are arbitrarily dependent -- for example, confidence intervals for an unknown parameter obtained with $K$ different estimators, or prediction sets obtained via conformal prediction based on $K$ different algorithms on shared data -- we address the question of how to efficiently combine them in a black-box manner to produce a single uncertainty set. We present a simple and broadly applicable majority vote procedure that produces a merged set with nearly the same error guarantee as the input sets. We then extend this core idea in a few ways: we show that weighted averaging can be a powerful way to incorporate prior information, and a simple randomization trick produces strictly smaller merged sets without altering the coverage guarantee. Further improvements can be obtained inducing exchangeability within the sets. When deployed in online settings, we show how the exponential weighted majority algorithm can be employed in order to learn a good weighting over time. We then combine this method with adaptive conformal inference to deliver a simple conformal online model aggregation (COMA) method for nonexchangeable data."}, "https://arxiv.org/abs/2105.02569": {"title": "Machine Collaboration", "link": "https://arxiv.org/abs/2105.02569", "description": "We propose a new ensemble framework for supervised learning, called machine collaboration (MaC), using a collection of base machines for prediction tasks. Unlike bagging/stacking (a parallel & independent framework) and boosting (a sequential & top-down framework), MaC is a type of circular & interactive learning framework. The circular & interactive feature helps the base machines to transfer information circularly and update their structures and parameters accordingly. The theoretical result on the risk bound of the estimator from MaC reveals that the circular & interactive feature can help MaC reduce risk via a parsimonious ensemble. We conduct extensive experiments on MaC using both simulated data and 119 benchmark real datasets. The results demonstrate that in most cases, MaC performs significantly better than several other state-of-the-art methods, including classification and regression trees, neural networks, stacking, and boosting."}, "https://arxiv.org/abs/2211.03846": {"title": "Federated Causal Discovery From Interventions", "link": "https://arxiv.org/abs/2211.03846", "description": "Causal discovery serves a pivotal role in mitigating model uncertainty through recovering the underlying causal mechanisms among variables. In many practical domains, such as healthcare, access to the data gathered by individual entities is limited, primarily for privacy and regulatory constraints. However, the majority of existing causal discovery methods require the data to be available in a centralized location. In response, researchers have introduced federated causal discovery. While previous federated methods consider distributed observational data, the integration of interventional data remains largely unexplored. We propose FedCDI, a federated framework for inferring causal structures from distributed data containing interventional samples. In line with the federated learning framework, FedCDI improves privacy by exchanging belief updates rather than raw samples. Additionally, it introduces a novel intervention-aware method for aggregating individual updates. We analyze scenarios with shared or disjoint intervened covariates, and mitigate the adverse effects of interventional data heterogeneity. The performance and scalability of FedCDI is rigorously tested across a variety of synthetic and real-world graphs."}, "https://arxiv.org/abs/2309.00943": {"title": "iCOS: Option-Implied COS Method", "link": "https://arxiv.org/abs/2309.00943", "description": "This paper proposes the option-implied Fourier-cosine method, iCOS, for non-parametric estimation of risk-neutral densities, option prices, and option sensitivities. The iCOS method leverages the Fourier-based COS technique, proposed by Fang and Oosterlee (2008), by utilizing the option-implied cosine series coefficients. Notably, this procedure does not rely on any model assumptions about the underlying asset price dynamics, it is fully non-parametric, and it does not involve any numerical optimization. These features make it rather general and computationally appealing. Furthermore, we derive the asymptotic properties of the proposed non-parametric estimators and study their finite-sample behavior in Monte Carlo simulations. Our empirical analysis using S&P 500 index options and Amazon equity options illustrates the effectiveness of the iCOS method in extracting valuable information from option prices under different market conditions. Additionally, we apply our methodology to dissect and quantify observation and discretization errors in the VIX index."}, "https://arxiv.org/abs/2310.12863": {"title": "A remark on moment-dependent phase transitions in high-dimensional Gaussian approximations", "link": "https://arxiv.org/abs/2310.12863", "description": "In this article, we study the critical growth rates of dimension below which Gaussian critical values can be used for hypothesis testing but beyond which they cannot. We are particularly interested in how these growth rates depend on the number of moments that the observations possess."}, "https://arxiv.org/abs/2312.05974": {"title": "Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise", "link": "https://arxiv.org/abs/2312.05974", "description": "This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes -- partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime."}, "https://arxiv.org/abs/2402.08051": {"title": "On Bayesian Filtering for Markov Regime Switching Models", "link": "https://arxiv.org/abs/2402.08051", "description": "This paper presents a framework for empirical analysis of dynamic macroeconomic models using Bayesian filtering, with a specific focus on the state-space formulation of Dynamic Stochastic General Equilibrium (DSGE) models with multiple regimes. We outline the theoretical foundations of model estimation, provide the details of two families of powerful multiple-regime filters, IMM and GPB, and construct corresponding multiple-regime smoothers. A simulation exercise, based on a prototypical New Keynesian DSGE model, is used to demonstrate the computational robustness of the proposed filters and smoothers and evaluate their accuracy and speed for a selection of filters from each family. We show that the canonical IMM filter is faster and is no less, and often more, accurate than its competitors within IMM and GPB families, the latter including the commonly used Kim and Nelson (1999) filter. Using it with the matching smoother improves the precision in recovering unobserved variables by about 25 percent. Furthermore, applying it to the U.S. 1947-2023 macroeconomic time series, we successfully identify significant past policy shifts including those related to the post-Covid-19 period. Our results demonstrate the practical applicability and potential of the proposed routines in macroeconomic analysis."}, "https://arxiv.org/abs/2402.08069": {"title": "Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions", "link": "https://arxiv.org/abs/2402.08069", "description": "Measurement of the interrater agreement (IRA) is critical in various disciplines. To correct for potential confounding chance agreement in IRA, Cohen's kappa and many other methods have been proposed. However, owing to the varied strategies and assumptions across these methods, there is a lack of practical guidelines on how these methods should be preferred even for the common two-rater dichotomous rating. To fill the gaps in the literature, we systematically review nine IRA methods and propose a generalized framework that can simulate the correlated decision processes behind the two raters to compare those reviewed methods under comprehensive practical scenarios. Based on the new framework, an estimand of \"true\" chance-corrected IRA is defined by accounting for the \"probabilistic certainty\" and serves as the comparison benchmark. We carry out extensive simulations to evaluate the performance of the reviewed IRA measures, and an agglomerative hierarchical clustering analysis is conducted to assess the inter-relationships among the included methods and the benchmark metric. Recommendations for selecting appropriate IRA statistics in different practical conditions are provided and the needs for further advancements in IRA estimation methodologies are emphasized."}, "https://arxiv.org/abs/2402.08108": {"title": "Finding Moving-Band Statistical Arbitrages via Convex-Concave Optimization", "link": "https://arxiv.org/abs/2402.08108", "description": "We propose a new method for finding statistical arbitrages that can contain more assets than just the traditional pair. We formulate the problem as seeking a portfolio with the highest volatility, subject to its price remaining in a band and a leverage limit. This optimization problem is not convex, but can be approximately solved using the convex-concave procedure, a specific sequential convex programming method. We show how the method generalizes to finding moving-band statistical arbitrages, where the price band midpoint varies over time."}, "https://arxiv.org/abs/2402.08151": {"title": "Gradient-flow adaptive importance sampling for Bayesian leave one out cross-validation for sigmoidal classification models", "link": "https://arxiv.org/abs/2402.08151", "description": "We introduce a set of gradient-flow-guided adaptive importance sampling (IS) transformations to stabilize Monte-Carlo approximations of point-wise leave one out cross-validated (LOO) predictions for Bayesian classification models. One can leverage this methodology for assessing model generalizability by for instance computing a LOO analogue to the AIC or computing LOO ROC/PRC curves and derived metrics like the AUROC and AUPRC. By the calculus of variations and gradient flow, we derive two simple nonlinear single-step transformations that utilize gradient information to shift a model's pre-trained full-data posterior closer to the target LOO posterior predictive distributions. In doing so, the transformations stabilize importance weights. Because the transformations involve the gradient of the likelihood function, the resulting Monte Carlo integral depends on Jacobian determinants with respect to the model Hessian. We derive closed-form exact formulae for these Jacobian determinants in the cases of logistic regression and shallow ReLU-activated artificial neural networks, and provide a simple approximation that sidesteps the need to compute full Hessian matrices and their spectra. We test the methodology on an $n\\ll p$ dataset that is known to produce unstable LOO IS weights."}, "https://arxiv.org/abs/2402.08222": {"title": "Integration of multiview microbiome data for deciphering microbiome-metabolome-disease pathways", "link": "https://arxiv.org/abs/2402.08222", "description": "The intricate interplay between host organisms and their gut microbiota has catalyzed research into the microbiome's role in disease, shedding light on novel aspects of disease pathogenesis. However, the mechanisms through which the microbiome exerts its influence on disease remain largely unclear. In this study, we first introduce a structural equation model to delineate the pathways connecting the microbiome, metabolome, and disease processes, utilizing a target multiview microbiome data. To mitigate the challenges posed by hidden confounders, we further propose an integrative approach that incorporates data from an external microbiome cohort. This method also supports the identification of disease-specific and microbiome-associated metabolites that are missing in the target cohort. We provide theoretical underpinnings for the estimations derived from our integrative approach, demonstrating estimation consistency and asymptotic normality. The effectiveness of our methodologies is validated through comprehensive simulation studies and an empirical application to inflammatory bowel disease, highlighting their potential to unravel the complex relationships between the microbiome, metabolome, and disease."}, "https://arxiv.org/abs/2402.08239": {"title": "Interaction Decomposition of prediction function", "link": "https://arxiv.org/abs/2402.08239", "description": "This paper discusses the foundation of methods for accurately grasping the interaction effects. Among the existing methods that capture the interaction effects as terms, PD and ALE are known as global modelagnostic methods in the IML field. ALE, among the two, can theoretically provide a functional decomposition of the prediction function, and this study focuses on functional decomposition. Specifically, we mathematically formalize what we consider to be the requirements that must always be met by a decomposition (interaction decomposition, hereafter, ID) that decomposes the prediction function into main and interaction effect terms. We also present a theorem about how to produce a decomposition that meets these requirements. Furthermore, we confirm that while ALE is ID, PD is not, and we present examples of decomposition that meet the requirements of ID using methods other than existing ones (i.e., new methods)."}, "https://arxiv.org/abs/2402.08283": {"title": "Classification Using Global and Local Mahalanobis Distances", "link": "https://arxiv.org/abs/2402.08283", "description": "We propose a novel semi-parametric classifier based on Mahalanobis distances of an observation from the competing classes. Our tool is a generalized additive model with the logistic link function that uses these distances as features to estimate the posterior probabilities of the different classes. While popular parametric classifiers like linear and quadratic discriminant analyses are mainly motivated by the normality of the underlying distributions, the proposed classifier is more flexible and free from such parametric assumptions. Since the densities of elliptic distributions are functions of Mahalanobis distances, this classifier works well when the competing classes are (nearly) elliptic. In such cases, it often outperforms popular nonparametric classifiers, especially when the sample size is small compared to the dimension of the data. To cope with non-elliptic and possibly multimodal distributions, we propose a local version of the Mahalanobis distance. Subsequently, we propose another classifier based on a generalized additive model that uses the local Mahalanobis distances as features. This nonparametric classifier usually performs like the Mahalanobis distance based semiparametric classifier when the underlying distributions are elliptic, but outperforms it for several non-elliptic and multimodal distributions. We also investigate the behaviour of these two classifiers in high dimension, low sample size situations. A thorough numerical study involving several simulated and real datasets demonstrate the usefulness of the proposed classifiers in comparison to many state-of-the-art methods."}, "https://arxiv.org/abs/2402.08328": {"title": "Dual Likelihood for Causal Inference under Structure Uncertainty", "link": "https://arxiv.org/abs/2402.08328", "description": "Knowledge of the underlying causal relations is essential for inferring the effect of interventions in complex systems. In a widely studied approach, structural causal models postulate noisy functional relations among interacting variables, where the underlying causal structure is then naturally represented by a directed graph whose edges indicate direct causal dependencies. In the typical application, this underlying causal structure must be learned from data, and thus, the remaining structure uncertainty needs to be incorporated into causal inference in order to draw reliable conclusions. In recent work, test inversions provide an ansatz to account for this data-driven model choice and, therefore, combine structure learning with causal inference. In this article, we propose the use of dual likelihood to greatly simplify the treatment of the involved testing problem. Indeed, dual likelihood leads to a closed-form solution for constructing confidence regions for total causal effects that rigorously capture both sources of uncertainty: causal structure and numerical size of nonzero effects. The proposed confidence regions can be computed with a bottom-up procedure starting from sink nodes. To render the causal structure identifiable, we develop our ideas in the context of linear causal relations with equal error variances."}, "https://arxiv.org/abs/2402.08335": {"title": "Joint Modeling of Multivariate Longitudinal and Survival Outcomes with the R package INLAjoint", "link": "https://arxiv.org/abs/2402.08335", "description": "This paper introduces the R package INLAjoint, designed as a toolbox for fitting a diverse range of regression models addressing both longitudinal and survival outcomes. INLAjoint relies on the computational efficiency of the integrated nested Laplace approximations methodology, an efficient alternative to Markov chain Monte Carlo for Bayesian inference, ensuring both speed and accuracy in parameter estimation and uncertainty quantification. The package facilitates the construction of complex joint models by treating individual regression models as building blocks, which can be assembled to address specific research questions. Joint models are relevant in biomedical studies where the collection of longitudinal markers alongside censored survival times is common. They have gained significant interest in recent literature, demonstrating the ability to rectify biases present in separate modeling approaches such as informative censoring by a survival event or confusion bias due to population heterogeneity. We provide a comprehensive overview of the joint modeling framework embedded in INLAjoint with illustrative examples. Through these examples, we demonstrate the practical utility of INLAjoint in handling complex data scenarios encountered in biomedical research."}, "https://arxiv.org/abs/2402.08336": {"title": "A two-step approach for analyzing time to event data under non-proportional hazards", "link": "https://arxiv.org/abs/2402.08336", "description": "The log-rank test and the Cox proportional hazards model are commonly used to compare time-to-event data in clinical trials, as they are most powerful under proportional hazards. But there is a loss of power if this assumption is violated, which is the case for some new oncology drugs like immunotherapies. We consider a two-stage test procedure, in which the weighting of the log-rank test statistic depends on a pre-test of the proportional hazards assumption. I.e., depending on the pre-test either the log-rank or an alternative test is used to compare the survival probabilities. We show that if naively implemented this can lead to a substantial inflation of the type-I error rate. To address this, we embed the two-stage test in a permutation test framework to keep the nominal level alpha. We compare the operating characteristics of the two-stage test with the log-rank test and other tests by clinical trial simulations."}, "https://arxiv.org/abs/2402.08376": {"title": "The generalized Hausman test for detecting non-normality in the latent variable distribution of the two-parameter IRT model", "link": "https://arxiv.org/abs/2402.08376", "description": "This paper introduces the generalized Hausman test as a novel method for detecting non-normality of the latent variable distribution of unidimensional Item Response Theory (IRT) models for binary data. The test utilizes the pairwise maximum likelihood estimator obtained for the parameters of the classical two-parameter IRT model, which assumes normality of the latent variable, and the quasi-maximum likelihood estimator obtained under a semi-nonparametric framework, allowing for a more flexible distribution of the latent variable. The performance of the generalized Hausman test is evaluated through a simulation study and it is compared with the likelihood-ratio and the M2 test statistics. Additionally, various information criteria are computed. The simulation results show that the generalized Hausman test outperforms the other tests under most conditions. However, the results obtained from the information criteria are somewhat contradictory under certain conditions, suggesting a need for further investigation and interpretation."}, "https://arxiv.org/abs/2402.08500": {"title": "Covariate selection for the estimation of marginal hazard ratios in high-dimensional data", "link": "https://arxiv.org/abs/2402.08500", "description": "Hazard ratios are frequently reported in time-to-event and epidemiological studies to assess treatment effects. In observational studies, the combination of propensity score weights with the Cox proportional hazards model facilitates the estimation of the marginal hazard ratio (MHR). The methods for estimating MHR are analogous to those employed for estimating common causal parameters, such as the average treatment effect. However, MHR estimation in the context of high-dimensional data remain unexplored. This paper seeks to address this gap through a simulation study that consider variable selection methods from causal inference combined with a recently proposed multiply robust approach for MHR estimation. Additionally, a case study utilizing stroke register data is conducted to demonstrate the application of these methods. The results from the simulation study indicate that the double selection covariate selection method is preferable to several other strategies when estimating MHR. Nevertheless, the estimation can be further improved by employing the multiply robust approach to the set of propensity score models obtained during the double selection process."}, "https://arxiv.org/abs/2402.08575": {"title": "Heterogeneity, Uncertainty and Learning: Semiparametric Identification and Estimation", "link": "https://arxiv.org/abs/2402.08575", "description": "We provide semiparametric identification results for a broad class of learning models in which continuous outcomes depend on three types of unobservables: i) known heterogeneity, ii) initially unknown heterogeneity that may be revealed over time, and iii) transitory uncertainty. We consider a common environment where the researcher only has access to a short panel on choices and realized outcomes. We establish identification of the outcome equation parameters and the distribution of the three types of unobservables, under the standard assumption that unknown heterogeneity and uncertainty are normally distributed. We also show that, absent known heterogeneity, the model is identified without making any distributional assumption. We then derive the asymptotic properties of a sieve MLE estimator for the model parameters, and devise a tractable profile likelihood based estimation procedure. Monte Carlo simulation results indicate that our estimator exhibits good finite-sample properties."}, "https://arxiv.org/abs/2402.08110": {"title": "Estimating Lagged (Cross-)Covariance Operators of $L^p$-$m$-approximable Processes in Cartesian Product Hilbert Spaces", "link": "https://arxiv.org/abs/2402.08110", "description": "When estimating the parameters in functional ARMA, GARCH and invertible, linear processes, covariance and lagged cross-covariance operators of processes in Cartesian product spaces appear. Such operators have been consistenly estimated in recent years, either less generally or under a strong condition. This article extends the existing literature by deriving explicit upper bounds for estimation errors for lagged covariance and lagged cross-covariance operators of processes in general Cartesian product Hilbert spaces, based on the mild weak dependence condition $L^p$-$m$-approximability. The upper bounds are stated for each lag, Cartesian power(s) and sample size, where the two processes in the context of lagged cross-covariance operators can take values in different spaces. General consequences of our results are also mentioned."}, "https://arxiv.org/abs/2402.08229": {"title": "Causal Discovery under Off-Target Interventions", "link": "https://arxiv.org/abs/2402.08229", "description": "Causal graph discovery is a significant problem with applications across various disciplines. However, with observational data alone, the underlying causal graph can only be recovered up to its Markov equivalence class, and further assumptions or interventions are necessary to narrow down the true graph. This work addresses the causal discovery problem under the setting of stochastic interventions with the natural goal of minimizing the number of interventions performed. We propose the following stochastic intervention model which subsumes existing adaptive noiseless interventions in the literature while capturing scenarios such as fat-hand interventions and CRISPR gene knockouts: any intervention attempt results in an actual intervention on a random subset of vertices, drawn from a distribution dependent on attempted action. Under this model, we study the two fundamental problems in causal discovery of verification and search and provide approximation algorithms with polylogarithmic competitive ratios and provide some preliminary experimental results."}, "https://arxiv.org/abs/2402.08285": {"title": "Theoretical properties of angular halfspace depth", "link": "https://arxiv.org/abs/2402.08285", "description": "The angular halfspace depth (ahD) is a natural modification of the celebrated halfspace (or Tukey) depth to the setup of directional data. It allows us to define elements of nonparametric inference, such as the median, the inter-quantile regions, or the rank statistics, for datasets supported in the unit sphere. Despite being introduced in 1987, ahD has never received ample recognition in the literature, mainly due to the lack of efficient algorithms for its computation. With the recent progress on the computational front, ahD however exhibits the potential for developing viable nonparametric statistics techniques for directional datasets. In this paper, we thoroughly treat the theoretical properties of ahD. We show that similarly to the classical halfspace depth for multivariate data, also ahD satisfies many desirable properties of a statistical depth function. Further, we derive uniform continuity/consistency results for the associated set of directional medians, and the central regions of ahD, the latter representing a depth-based analogue of the quantiles for directional data."}, "https://arxiv.org/abs/2402.08602": {"title": "Globally-Optimal Greedy Experiment Selection for Active Sequential Estimation", "link": "https://arxiv.org/abs/2402.08602", "description": "Motivated by modern applications such as computerized adaptive testing, sequential rank aggregation, and heterogeneous data source selection, we study the problem of active sequential estimation, which involves adaptively selecting experiments for sequentially collected data. The goal is to design experiment selection rules for more accurate model estimation. Greedy information-based experiment selection methods, optimizing the information gain for one-step ahead, have been employed in practice thanks to their computational convenience, flexibility to context or task changes, and broad applicability. However, statistical analysis is restricted to one-dimensional cases due to the problem's combinatorial nature and the seemingly limited capacity of greedy algorithms, leaving the multidimensional problem open.\n In this study, we close the gap for multidimensional problems. In particular, we propose adopting a class of greedy experiment selection methods and provide statistical analysis for the maximum likelihood estimator following these selection rules. This class encompasses both existing methods and introduces new methods with improved numerical efficiency. We prove that these methods produce consistent and asymptotically normal estimators. Additionally, within a decision theory framework, we establish that the proposed methods achieve asymptotic optimality when the risk measure aligns with the selection rule. We also conduct extensive numerical studies on both simulated and real data to illustrate the efficacy of the proposed methods.\n From a technical perspective, we devise new analytical tools to address theoretical challenges. These analytical tools are of independent theoretical interest and may be reused in related problems involving stochastic approximation and sequential designs."}, "https://arxiv.org/abs/2402.08616": {"title": "Adjustment Identification Distance: A gadjid for Causal Structure Learning", "link": "https://arxiv.org/abs/2402.08616", "description": "Evaluating graphs learned by causal discovery algorithms is difficult: The number of edges that differ between two graphs does not reflect how the graphs differ with respect to the identifying formulas they suggest for causal effects. We introduce a framework for developing causal distances between graphs which includes the structural intervention distance for directed acyclic graphs as a special case. We use this framework to develop improved adjustment-based distances as well as extensions to completed partially directed acyclic graphs and causal orders. We develop polynomial-time reachability algorithms to compute the distances efficiently. In our package gadjid (open source at https://github.com/CausalDisco/gadjid), we provide implementations of our distances; they are orders of magnitude faster than the structural intervention distance and thereby provide a success metric for causal discovery that scales to graph sizes that were previously prohibitive."}, "https://arxiv.org/abs/2402.08672": {"title": "Model Assessment and Selection under Temporal Distribution Shift", "link": "https://arxiv.org/abs/2402.08672", "description": "We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidate models by estimating the difference of their generalization errors. We further integrate pairwise comparisons into a single-elimination tournament, achieving near-optimal model selection from a collection of candidates. Theoretical analyses and numerical experiments demonstrate the adaptivity of our proposed methods to the non-stationarity in data."}, "https://arxiv.org/abs/1711.00564": {"title": "Sophisticated and small versus simple and sizeable: When does it pay off to introduce drifting coefficients in Bayesian VARs?", "link": "https://arxiv.org/abs/1711.00564", "description": "We assess the relationship between model size and complexity in the time-varying parameter VAR framework via thorough predictive exercises for the Euro Area, the United Kingdom and the United States. It turns out that sophisticated dynamics through drifting coefficients are important in small data sets, while simpler models tend to perform better in sizeable data sets. To combine the best of both worlds, novel shrinkage priors help to mitigate the curse of dimensionality, resulting in competitive forecasts for all scenarios considered. Furthermore, we discuss dynamic model selection to improve upon the best performing individual model for each point in time."}, "https://arxiv.org/abs/2204.02346": {"title": "Finitely Heterogeneous Treatment Effect in Event-study", "link": "https://arxiv.org/abs/2204.02346", "description": "The key assumption of the differences-in-differences approach in the event-study design is that untreated potential outcome differences are mean independent of treatment timing: the parallel trend assumption. In this paper, we relax the parallel trend assumption by assuming a latent type variable and developing a type-specific parallel trend assumption. With a finite support assumption on the latent type variable, we show that an extremum classifier consistently estimates the type assignment. Based on the classification result, we propose a type-specific diff-in-diff estimator for type-specific CATT. By estimating the CATT with regard to the latent type, we study heterogeneity in treatment effect, in addition to heterogeneity in baseline outcomes."}, "https://arxiv.org/abs/2209.14846": {"title": "Modeling and Learning on High-Dimensional Matrix-Variate Sequences", "link": "https://arxiv.org/abs/2209.14846", "description": "We propose a new matrix factor model, named RaDFaM, which is strictly derived based on the general rank decomposition and assumes a structure of a high-dimensional vector factor model for each basis vector. RaDFaM contributes a novel class of low-rank latent structure that makes tradeoff between signal intensity and dimension reduction from the perspective of tensor subspace. Based on the intrinsic separable covariance structure of RaDFaM, for a collection of matrix-valued observations, we derive a new class of PCA variants for estimating loading matrices, and sequentially the latent factor matrices. The peak signal-to-noise ratio of RaDFaM is proved to be superior in the category of PCA-type estimations. We also establish the asymptotic theory including the consistency, convergence rates, and asymptotic distributions for components in the signal part. Numerically, we demonstrate the performance of RaDFaM in applications such as matrix reconstruction, supervised learning, and clustering, on uncorrelated and correlated data, respectively."}, "https://arxiv.org/abs/2212.12822": {"title": "Simultaneous false discovery proportion bounds via knockoffs and closed testing", "link": "https://arxiv.org/abs/2212.12822", "description": "We propose new methods to obtain simultaneous false discovery proportion bounds for knockoff-based approaches. We first investigate an approach based on Janson and Su's $k$-familywise error rate control method and interpolation. We then generalize it by considering a collection of $k$ values, and show that the bound of Katsevich and Ramdas is a special case of this method and can be uniformly improved. Next, we further generalize the method by using closed testing with a multi-weighted-sum local test statistic. This allows us to obtain a further uniform improvement and other generalizations over previous methods. We also develop an efficient shortcut for its implementation. We compare the performance of our proposed methods in simulations and apply them to a data set from the UK Biobank."}, "https://arxiv.org/abs/2305.19749": {"title": "Forecasting high-dimensional functional time series: Application to sub-national age-specific mortality", "link": "https://arxiv.org/abs/2305.19749", "description": "We study the modeling and forecasting of high-dimensional functional time series (HDFTS), which can be cross-sectionally correlated and temporally dependent. We introduce a decomposition of the HDFTS into two distinct components: a deterministic component and a residual component that varies over time. The decomposition is derived through the estimation of two-way functional analysis of variance. A functional time series forecasting method, based on functional principal component analysis, is implemented to produce forecasts for the residual component. By combining the forecasts of the residual component with the deterministic component, we obtain forecast curves for multiple populations. We apply the model to age- and sex-specific mortality rates in the United States, France, and Japan, in which there are 51 states, 95 departments, and 47 prefectures, respectively. The proposed method is capable of delivering more accurate point and interval forecasts in forecasting multi-population mortality than several benchmark methods considered."}, "https://arxiv.org/abs/2306.01521": {"title": "Uncertainty Quantification in Bayesian Reduced-Rank Sparse Regressions", "link": "https://arxiv.org/abs/2306.01521", "description": "Reduced-rank regression recognises the possibility of a rank-deficient matrix of coefficients. We propose a novel Bayesian model for estimating the rank of the coefficient matrix, which obviates the need for post-processing steps and allows for uncertainty quantification. Our method employs a mixture prior on the regression coefficient matrix along with a global-local shrinkage prior on its low-rank decomposition. Then, we rely on the Signal Adaptive Variable Selector to perform sparsification and define two novel tools: the Posterior Inclusion Probability uncertainty index and the Relevance Index. The validity of the method is assessed in a simulation study, and then its advantages and usefulness are shown in real-data applications on the chemical composition of tobacco and on the photometry of galaxies."}, "https://arxiv.org/abs/2306.04177": {"title": "Semiparametric Efficiency Gains From Parametric Restrictions on Propensity Scores", "link": "https://arxiv.org/abs/2306.04177", "description": "We explore how much knowing a parametric restriction on propensity scores improves semiparametric efficiency bounds in the potential outcome framework. For stratified propensity scores, considered as a parametric model, we derive explicit formulas for the efficiency gain from knowing how the covariate space is split. Based on these, we find that the efficiency gain decreases as the partition of the stratification becomes finer. For general parametric models, where it is hard to obtain explicit representations of efficiency bounds, we propose a novel framework that enables us to see whether knowing a parametric model is valuable in terms of efficiency even when it is very high-dimensional. In addition to the intuitive fact that knowing the parametric model does not help much if it is sufficiently flexible, we reveal that the efficiency gain can be nearly zero even though the parametric assumption significantly restricts the space of possible propensity scores."}, "https://arxiv.org/abs/2311.04696": {"title": "Quantification and cross-fitting inference of asymmetric relations under generative exposure mapping models", "link": "https://arxiv.org/abs/2311.04696", "description": "In many practical studies, learning directionality between a pair of variables is of great interest while notoriously hard when their underlying relation is nonlinear. This paper presents a method that examines asymmetry in exposure-outcome pairs when a priori assumptions about their relative ordering are unavailable. Our approach utilizes a framework of generative exposure mapping (GEM) to study asymmetric relations in continuous exposure-outcome pairs, through which we can capture distributional asymmetries with no prefixed variable ordering. We propose a coefficient of asymmetry to quantify relational asymmetry using Shannon's entropy analytics as well as statistical estimation and inference for such an estimand of directionality. Large-sample theoretical guarantees are established for cross-fitting inference techniques. The proposed methodology is extended to allow both measured confounders and contamination in outcome measurements, which is extensively evaluated through extensive simulation studies and real data applications."}, "https://arxiv.org/abs/2304.08242": {"title": "The Deep Latent Position Topic Model for Clustering and Representation of Networks with Textual Edges", "link": "https://arxiv.org/abs/2304.08242", "description": "Numerical interactions leading to users sharing textual content published by others are naturally represented by a network where the individuals are associated with the nodes and the exchanged texts with the edges. To understand those heterogeneous and complex data structures, clustering nodes into homogeneous groups as well as rendering a comprehensible visualisation of the data is mandatory. To address both issues, we introduce Deep-LPTM, a model-based clustering strategy relying on a variational graph auto-encoder approach as well as a probabilistic model to characterise the topics of discussion. Deep-LPTM allows to build a joint representation of the nodes and of the edges in two embeddings spaces. The parameters are inferred using a variational inference algorithm. We also introduce IC2L, a model selection criterion specifically designed to choose models with relevant clustering and visualisation properties. An extensive benchmark study on synthetic data is provided. In particular, we find that Deep-LPTM better recovers the partitions of the nodes than the state-of-the art ETSBM and STBM. Eventually, the emails of the Enron company are analysed and visualisations of the results are presented, with meaningful highlights of the graph structure."}, "https://arxiv.org/abs/2310.17816": {"title": "Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs", "link": "https://arxiv.org/abs/2310.17816", "description": "Causal discovery is crucial for causal inference in observational studies: it can enable the identification of valid adjustment sets (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose local discovery by partitioning (LDP), a novel nonparametric local discovery algorithm that is tailored for downstream inference tasks while avoiding the pretreatment assumption. LDP is a constraint-based procedure that partitions variables into subsets defined solely by their causal relation to an exposure-outcome pair. Further, LDP returns a VAS for the exposure-outcome pair under causal insufficiency and mild sufficient conditions. Total independence tests is worst-case quadratic in variable count. Asymptotic theoretical guarantees are numerically validated on synthetic graphs. Adjustment sets from LDP yield less biased and more precise average treatment effect estimates than baseline discovery algorithms, with LDP outperforming on confounder recall, runtime, and test count for VAS discovery. Further, LDP ran at least 1300x faster than baselines on a benchmark."}, "https://arxiv.org/abs/2311.06840": {"title": "Omitted Labels in Causality: A Study of Paradoxes", "link": "https://arxiv.org/abs/2311.06840", "description": "We explore what we call ``omitted label contexts,'' in which training data is limited to a subset of the possible labels. This setting is common among specialized human experts or specific focused studies. We lean on well-studied paradoxes (Simpson's and Condorcet) to illustrate the more general difficulties of causal inference in omitted label contexts. Contrary to the fundamental principles on which much of causal inference is built, we show that ``correct'' adjustments sometimes require non-exchangeable treatment and control groups. These pitfalls lead us to the study networks of conclusions drawn from different contexts and the structures the form, proving an interesting connection between these networks and social choice theory."}, "https://arxiv.org/abs/2402.08707": {"title": "Data Analytics for Intermodal Freight Transportation Applications", "link": "https://arxiv.org/abs/2402.08707", "description": "arXiv:2402.08707v1 Announce Type: new\nAbstract: With the growth of intermodal freight transportation, it is important that transportation planners and decision makers are knowledgeable about freight flow data to make informed decisions. This is particularly true with Intelligent Transportation Systems (ITS) offering new capabilities to intermodal freight transportation. Specifically, ITS enables access to multiple different data sources, but they have different formats, resolution, and time scales. Thus, knowledge of data science is essential to be successful in future ITS-enabled intermodal freight transportation system. This chapter discusses the commonly used descriptive and predictive data analytic techniques in intermodal freight transportation applications. These techniques cover the entire spectrum of univariate, bivariate, and multivariate analyses. In addition to illustrating how to apply these techniques through relatively simple examples, this chapter will also show how to apply them using the statistical software R. Additional exercises are provided for those who wish to apply the described techniques to more complex problems."}, "https://arxiv.org/abs/2402.08828": {"title": "Fusing Individualized Treatment Rules Using Secondary Outcomes", "link": "https://arxiv.org/abs/2402.08828", "description": "arXiv:2402.08828v1 Announce Type: new\nAbstract: An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their individual feature variables. In many practices, the ideal ITR for the primary outcome is also expected to cause minimal harm to other secondary outcomes. Therefore, our objective is to learn an ITR that not only maximizes the value function for the primary outcome, but also approximates the optimal rule for the secondary outcomes as closely as possible. To achieve this goal, we introduce a fusion penalty to encourage the ITRs based on different outcomes to yield similar recommendations. Two algorithms are proposed to estimate the ITR using surrogate loss functions. We prove that the agreement rate between the estimated ITR of the primary outcome and the optimal ITRs of the secondary outcomes converges to the true agreement rate faster than if the secondary outcomes are not taken into consideration. Furthermore, we derive the non-asymptotic properties of the value function and misclassification rate for the proposed method. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method."}, "https://arxiv.org/abs/2402.08873": {"title": "Balancing Method for Non-monotone Missing Data", "link": "https://arxiv.org/abs/2402.08873", "description": "arXiv:2402.08873v1 Announce Type: new\nAbstract: Covariate balancing methods have been widely applied to single or monotone missing patterns and have certain advantages over likelihood-based methods and inverse probability weighting approaches based on standard logistic regression. In this paper, we consider non-monotone missing data under the complete-case missing variable condition (CCMV), which is a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, a weighted estimator can be used for estimation, where the weight is a sum of ratios of conditional probability of observing a particular missing pattern versus that of observing the complete-case pattern, given the variables observed in the corresponding missing pattern. Plug-in estimators of the propensity ratios, however, can be unbounded and lead to unstable estimation. Using further relations between propensity ratios and balancing of moments across missing patterns, we employ tailored loss functions each encouraging empirical balance across patterns to estimate propensity ratios flexibly using functional basis expansion. We propose two penalizations to separately control propensity ratio model complexity and covariate imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are also developed. Numerical simulation results show that the proposed method achieves smaller mean squared errors than other methods."}, "https://arxiv.org/abs/2402.08877": {"title": "Computational Considerations for the Linear Model of Coregionalization", "link": "https://arxiv.org/abs/2402.08877", "description": "arXiv:2402.08877v1 Announce Type: new\nAbstract: In the last two decades, the linear model of coregionalization (LMC) has been widely used to model multivariate spatial processes. From a computational standpoint, the LMC is a substantially easier model to work with than other multidimensional alternatives. Up to now, this fact has been largely overlooked in the literature. Starting from an analogy with matrix normal models, we propose a reformulation of the LMC likelihood that highlights the linear, rather than cubic, computational complexity as a function of the dimension of the response vector. Further, we describe in detail how those simplifications can be included in a Gaussian hierarchical model. In addition, we demonstrate in two examples how the disentangled version of the likelihood we derive can be exploited to improve Markov chain Monte Carlo (MCMC) based computations when conducting Bayesian inference. The first is an interwoven approach that combines samples from centered and whitened parametrizations of the latent LMC distributed random fields. The second is a sparsity-inducing method that introduces structural zeros in the coregionalization matrix in an attempt to reduce the number of parameters in a principled way. It also provides a new way to investigate the strength of the correlation among the components of the outcome vector. Both approaches come at virtually no additional cost and are shown to significantly improve MCMC performance and predictive performance respectively. We apply our methodology to a dataset comprised of air pollutant measurements in the state of California."}, "https://arxiv.org/abs/2402.08879": {"title": "Inference for an Algorithmic Fairness-Accuracy Frontier", "link": "https://arxiv.org/abs/2402.08879", "description": "arXiv:2402.08879v1 Announce Type: new\nAbstract: Decision-making processes increasingly rely on the use of algorithms. Yet, algorithms' predictive ability frequently exhibit systematic variation across subgroups of the population. While both fairness and accuracy are desirable properties of an algorithm, they often come at the cost of one another. What should a fairness-minded policymaker do then, when confronted with finite data? In this paper, we provide a consistent estimator for a theoretical fairness-accuracy frontier put forward by Liang, Lu and Mu (2023) and propose inference methods to test hypotheses that have received much attention in the fairness literature, such as (i) whether fully excluding a covariate from use in training the algorithm is optimal and (ii) whether there are less discriminatory alternatives to an existing algorithm. We also provide an estimator for the distance between a given algorithm and the fairest point on the frontier, and characterize its asymptotic distribution. We leverage the fact that the fairness-accuracy frontier is part of the boundary of a convex set that can be fully represented by its support function. We show that the estimated support function converges to a tight Gaussian process as the sample size increases, and then express policy-relevant hypotheses as restrictions on the support function to construct valid test statistics."}, "https://arxiv.org/abs/2402.08941": {"title": "Local-Polynomial Estimation for Multivariate Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2402.08941", "description": "arXiv:2402.08941v1 Announce Type: new\nAbstract: We introduce a multivariate local-linear estimator for multivariate regression discontinuity designs in which treatment is assigned by crossing a boundary in the space of running variables. The dominant approach uses the Euclidean distance from a boundary point as the scalar running variable; hence, multivariate designs are handled as uni-variate designs. However, the distance running variable is incompatible with the assumption for asymptotic validity. We handle multivariate designs as multivariate. In this study, we develop a novel asymptotic normality for multivariate local-polynomial estimators. Our estimator is asymptotically valid and can capture heterogeneous treatment effects over the boundary. We demonstrate the effectiveness of our estimator through numerical simulations. Our empirical illustration of a Colombian scholarship study reveals a richer heterogeneity (including its absence) of the treatment effect that is hidden in the original estimates."}, "https://arxiv.org/abs/2402.09033": {"title": "Cross-Temporal Forecast Reconciliation at Digital Platforms with Machine Learning", "link": "https://arxiv.org/abs/2402.09033", "description": "arXiv:2402.09033v1 Announce Type: new\nAbstract: Platform businesses operate on a digital core and their decision making requires high-dimensional accurate forecast streams at different levels of cross-sectional (e.g., geographical regions) and temporal aggregation (e.g., minutes to days). It also necessitates coherent forecasts across all levels of the hierarchy to ensure aligned decision making across different planning units such as pricing, product, controlling and strategy. Given that platform data streams feature complex characteristics and interdependencies, we introduce a non-linear hierarchical forecast reconciliation method that produces cross-temporal reconciled forecasts in a direct and automated way through the use of popular machine learning methods. The method is sufficiently fast to allow forecast-based high-frequency decision making that platforms require. We empirically test our framework on a unique, large-scale streaming dataset from a leading on-demand delivery platform in Europe."}, "https://arxiv.org/abs/2402.09086": {"title": "Impact of Non-Informative Censoring on Propensity Score Based Estimation of Marginal Hazard Ratios", "link": "https://arxiv.org/abs/2402.09086", "description": "arXiv:2402.09086v1 Announce Type: new\nAbstract: In medical and epidemiological studies, one of the most common settings is studying the effect of a treatment on a time-to-event outcome, where the time-to-event might be censored before end of study. A common parameter of interest in such a setting is the marginal hazard ratio (MHR). When a study is based on observational data, propensity score (PS) based methods are often used, in an attempt to make the treatment groups comparable despite having a non-randomized treatment. Previous studies have shown censoring to be a factor that induces bias when using PS based estimators. In this paper we study the magnitude of the bias under different rates of non-informative censoring when estimating MHR using PS weighting or PS matching. A bias correction involving the probability of event is suggested and compared to conventional PS based methods."}, "https://arxiv.org/abs/2402.09332": {"title": "Nonparametric identification and efficient estimation of causal effects with instrumental variables", "link": "https://arxiv.org/abs/2402.09332", "description": "arXiv:2402.09332v1 Announce Type: new\nAbstract: Instrumental variables are widely used in econometrics and epidemiology for identifying and estimating causal effects when an exposure of interest is confounded by unmeasured factors. Despite this popularity, the assumptions invoked to justify the use of instruments differ substantially across the literature. Similarly, statistical approaches for estimating the resulting causal quantities vary considerably, and often rely on strong parametric assumptions. In this work, we compile and organize structural conditions that nonparametrically identify conditional average treatment effects, average treatment effects among the treated, and local average treatment effects, with a focus on identification formulae invoking the conditional Wald estimand. Moreover, we build upon existing work and propose nonparametric efficient estimators of functionals corresponding to marginal and conditional causal contrasts resulting from the various identification paradigms. We illustrate the proposed methods on an observational study examining the effects of operative care on adverse events for cholecystitis patients, and a randomized trial assessing the effects of market participation on political views."}, "https://arxiv.org/abs/2402.08845": {"title": "Feature Attribution with Necessity and Sufficiency via Dual-stage Perturbation Test for Causal Explanation", "link": "https://arxiv.org/abs/2402.08845", "description": "arXiv:2402.08845v1 Announce Type: cross\nAbstract: We investigate the problem of explainability in machine learning.To address this problem, Feature Attribution Methods (FAMs) measure the contribution of each feature through a perturbation test, where the difference in prediction is compared under different perturbations.However, such perturbation tests may not accurately distinguish the contributions of different features, when their change in prediction is the same after perturbation.In order to enhance the ability of FAMs to distinguish different features' contributions in this challenging setting, we propose to utilize the probability (PNS) that perturbing a feature is a necessary and sufficient cause for the prediction to change as a measure of feature importance.Our approach, Feature Attribution with Necessity and Sufficiency (FANS), computes the PNS via a perturbation test involving two stages (factual and interventional).In practice, to generate counterfactual samples, we use a resampling-based approach on the observed samples to approximate the required conditional distribution.Finally, we combine FANS and gradient-based optimization to extract the subset with the largest PNS.We demonstrate that FANS outperforms existing feature attribution methods on six benchmarks."}, "https://arxiv.org/abs/2402.09032": {"title": "Ecient spatial designs for targeted regions or level set detection", "link": "https://arxiv.org/abs/2402.09032", "description": "arXiv:2402.09032v1 Announce Type: cross\nAbstract: Acquiring information on spatial phenomena can be costly and time-consuming. In this context, to obtain reliable global knowledge, the choice of measurement location is a crucial issue. Space-lling designs are often used to control variability uniformly across the whole space. However, in a monitoring context, it is more relevant to focus on crucial regions, especially when dealing with sensitive areas such as the environment, climate or public health. It is therefore important to choose a relevant optimality criterion to build models adapted to the purpose of the experiment. In this article, we propose two new optimality criteria: the rst aims to focus on areas where the response exceeds a given threshold, while the second is suitable for estimating sets of levels. We introduce several algorithms for constructing optimal designs. We also focus on cost-eective algorithms that produce non-optimal but ecient designs. For both sequential and non-sequential contexts, we compare our designs with existing ones through extensive simulation studies."}, "https://arxiv.org/abs/2402.09328": {"title": "Connecting Algorithmic Fairness to Quality Dimensions in Machine Learning in Official Statistics and Survey Production", "link": "https://arxiv.org/abs/2402.09328", "description": "arXiv:2402.09328v1 Announce Type: cross\nAbstract: National Statistical Organizations (NSOs) increasingly draw on Machine Learning (ML) to improve the timeliness and cost-effectiveness of their products. When introducing ML solutions, NSOs must ensure that high standards with respect to robustness, reproducibility, and accuracy are upheld as codified, e.g., in the Quality Framework for Statistical Algorithms (QF4SA; Yung et al. 2022). At the same time, a growing body of research focuses on fairness as a pre-condition of a safe deployment of ML to prevent disparate social impacts in practice. However, fairness has not yet been explicitly discussed as a quality aspect in the context of the application of ML at NSOs. We employ Yung et al. (2022)'s QF4SA quality framework and present a mapping of its quality dimensions to algorithmic fairness. We thereby extend the QF4SA framework in several ways: we argue for fairness as its own quality dimension, we investigate the interaction of fairness with other dimensions, and we explicitly address data, both on its own and its interaction with applied methodology. In parallel with empirical illustrations, we show how our mapping can contribute to methodology in the domains of official statistics, algorithmic fairness, and trustworthy machine learning."}, "https://arxiv.org/abs/2402.09356": {"title": "On the Impact of Spatial Covariance Matrix Ordering on Tile Low-Rank Estimation of Mat\\'ern Parameters", "link": "https://arxiv.org/abs/2402.09356", "description": "arXiv:2402.09356v1 Announce Type: cross\nAbstract: Spatial statistical modeling and prediction involve generating and manipulating an n*n symmetric positive definite covariance matrix, where n denotes the number of spatial locations. However, when n is large, processing this covariance matrix using traditional methods becomes prohibitive. Thus, coupling parallel processing with approximation can be an elegant solution to this challenge by relying on parallel solvers that deal with the matrix as a set of small tiles instead of the full structure. Each processing unit can process a single tile, allowing better performance. The approximation can also be performed at the tile level for better compression and faster execution. The Tile Low-Rank (TLR) approximation, a tile-based approximation algorithm, has recently been used in spatial statistics applications. However, the quality of TLR algorithms mainly relies on ordering the matrix elements. This order can impact the compression quality and, therefore, the efficiency of the underlying linear solvers, which highly depends on the individual ranks of each tile. Thus, herein, we aim to investigate the accuracy and performance of some existing ordering algorithms that are used to order the geospatial locations before generating the spatial covariance matrix. Furthermore, we highlight the pros and cons of each ordering algorithm in the context of spatial statistics applications and give hints to practitioners on how to choose the ordering algorithm carefully. We assess the quality of the compression and the accuracy of the statistical parameter estimates of the Mat\\'ern covariance function using TLR approximation under various ordering algorithms and settings of correlations."}, "https://arxiv.org/abs/2110.14465": {"title": "Unbiased Statistical Estimation and Valid Confidence Intervals Under Differential Privacy", "link": "https://arxiv.org/abs/2110.14465", "description": "arXiv:2110.14465v2 Announce Type: replace\nAbstract: We present a method for producing unbiased parameter estimates and valid confidence intervals under the constraints of differential privacy, a formal framework for limiting individual information leakage from sensitive data. Prior work in this area is limited in that it is tailored to calculating confidence intervals for specific statistical procedures, such as mean estimation or simple linear regression. While other recent work can produce confidence intervals for more general sets of procedures, they either yield only approximately unbiased estimates, are designed for one-dimensional outputs, or assume significant user knowledge about the data-generating distribution. Our method induces distributions of mean and covariance estimates via the bag of little bootstraps (BLB) and uses them to privately estimate the parameters' sampling distribution via a generalized version of the CoinPress estimation algorithm. If the user can bound the parameters of the BLB-induced parameters and provide heavier-tailed families, the algorithm produces unbiased parameter estimates and valid confidence intervals which hold with arbitrarily high probability. These results hold in high dimensions and for any estimation procedure which behaves nicely under the bootstrap."}, "https://arxiv.org/abs/2212.00984": {"title": "Fully Data-driven Normalized and Exponentiated Kernel Density Estimator with Hyv\\\"arinen Score", "link": "https://arxiv.org/abs/2212.00984", "description": "arXiv:2212.00984v2 Announce Type: replace\nAbstract: We introduce a new deal of kernel density estimation using an exponentiated form of kernel density estimators. The density estimator has two hyperparameters flexibly controlling the smoothness of the resulting density. We tune them in a data-driven manner by minimizing an objective function based on the Hyv\\\"arinen score to avoid the optimization involving the intractable normalizing constant due to the exponentiation. We show the asymptotic properties of the proposed estimator and emphasize the importance of including the two hyperparameters for flexible density estimation. Our simulation studies and application to income data show that the proposed density estimator is appealing when the underlying density is multi-modal or observations contain outliers."}, "https://arxiv.org/abs/2301.11399": {"title": "Distributional outcome regression via quantile functions and its application to modelling continuously monitored heart rate and physical activity", "link": "https://arxiv.org/abs/2301.11399", "description": "arXiv:2301.11399v3 Announce Type: replace\nAbstract: Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve."}, "https://arxiv.org/abs/2302.11746": {"title": "Binary Regression and Classification with Covariates in Metric Spaces", "link": "https://arxiv.org/abs/2302.11746", "description": "arXiv:2302.11746v3 Announce Type: replace\nAbstract: Inspired by logistic regression, we introduce a regression model for data tuples consisting of a binary response and a set of covariates residing in a metric space without vector structures. Based on the proposed model we also develop a binary classifier for metric-space valued data. We propose a maximum likelihood estimator for the metric-space valued regression coefficient in the model, and provide upper bounds on the estimation error under various metric entropy conditions that quantify complexity of the underlying metric space. Matching lower bounds are derived for the important metric spaces commonly seen in statistics, establishing optimality of the proposed estimator in such spaces. Similarly, an upper bound on the excess risk of the developed classifier is provided for general metric spaces. A finer upper bound and a matching lower bound, and thus optimality of the proposed classifier, are established for Riemannian manifolds. To the best of our knowledge, the proposed regression model and the above minimax bounds are the first of their kind for analyzing a binary response with covariates residing in general metric spaces. We also investigate the numerical performance of the proposed estimator and classifier via simulation studies, and illustrate their practical merits via an application to task-related fMRI data."}, "https://arxiv.org/abs/2305.02685": {"title": "Testing for no effect in regression problems: a permutation approach", "link": "https://arxiv.org/abs/2305.02685", "description": "arXiv:2305.02685v2 Announce Type: replace\nAbstract: Often the question arises whether $Y$ can be predicted based on $X$ using a certain model. Especially for highly flexible models such as neural networks one may ask whether a seemingly good prediction is actually better than fitting pure noise or whether it has to be attributed to the flexibility of the model. This paper proposes a rigorous permutation test to assess whether the prediction is better than the prediction of pure noise. The test avoids any sample splitting and is based instead on generating new pairings of $(X_i, Y_j)$. It introduces a new formulation of the null hypothesis and rigorous justification for the test, which distinguishes it from previous literature. The theoretical findings are applied both to simulated data and to sensor data of tennis serves in an experimental context. The simulation study underscores how the available information affects the test. It shows that the less informative the predictors, the lower the probability of rejecting the null hypothesis of fitting pure noise and emphasizes that detecting weaker dependence between variables requires a sufficient sample size."}, "https://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "https://arxiv.org/abs/2310.05539", "description": "arXiv:2310.05539v2 Announce Type: replace\nAbstract: In response to the unique challenge created by high-dimensional mediators in mediation analysis, this paper presents a novel procedure for testing the nullity of the mediation effect in the presence of high-dimensional mediators. The procedure incorporates two distinct features. Firstly, the test remains valid under all cases of the composite null hypothesis, including the challenging scenario where both exposure-mediator and mediator-outcome coefficients are zero. Secondly, it does not impose structural assumptions on the exposure-mediator coefficients, thereby allowing for an arbitrarily strong exposure-mediator relationship. To the best of our knowledge, the proposed test is the first of its kind to provably possess these two features in high-dimensional mediation analysis. The validity and consistency of the proposed test are established, and its numerical performance is showcased through simulation studies. The application of the proposed test is demonstrated by examining the mediation effect of DNA methylation between smoking status and lung cancer development."}, "https://arxiv.org/abs/2312.16307": {"title": "Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration", "link": "https://arxiv.org/abs/2312.16307", "description": "arXiv:2312.16307v2 Announce Type: replace\nAbstract: We consider the setting of synthetic control methods (SCMs), a canonical approach used to estimate the treatment effect on the treated in a panel data setting. We shed light on a frequently overlooked but ubiquitous assumption made in SCMs of \"overlap\": a treated unit can be written as some combination -- typically, convex or linear combination -- of the units that remain under control. We show that if units select their own interventions, and there is sufficiently large heterogeneity between units that prefer different interventions, overlap will not hold. We address this issue by proposing a framework which incentivizes units with different preferences to take interventions they would not normally consider. Specifically, leveraging tools from information design and online learning, we propose a SCM that incentivizes exploration in panel data settings by providing incentive-compatible intervention recommendations to units. We establish this estimator obtains valid counterfactual estimates without the need for an a priori overlap assumption. We extend our results to the setting of synthetic interventions, where the goal is to produce counterfactual outcomes under all interventions, not just control. Finally, we provide two hypothesis tests for determining whether unit overlap holds for a given panel dataset."}, "https://arxiv.org/abs/2211.11700": {"title": "High-Dimensional Undirected Graphical Models for Arbitrary Mixed Data", "link": "https://arxiv.org/abs/2211.11700", "description": "arXiv:2211.11700v2 Announce Type: replace-cross\nAbstract: Graphical models are an important tool in exploring relationships between variables in complex, multivariate data. Methods for learning such graphical models are well developed in the case where all variables are either continuous or discrete, including in high-dimensions. However, in many applications data span variables of different types (e.g. continuous, count, binary, ordinal, etc.), whose principled joint analysis is nontrivial. Latent Gaussian copula models, in which all variables are modeled as transformations of underlying jointly Gaussian variables, represent a useful approach. Recent advances have shown how the binary-continuous case can be tackled, but the general mixed variable type regime remains challenging. In this work, we make the simple yet useful observation that classical ideas concerning polychoric and polyserial correlations can be leveraged in a latent Gaussian copula framework. Building on this observation we propose flexible and scalable methodology for data with variables of entirely general mixed type. We study the key properties of the approaches theoretically and empirically, via extensive simulations as well an illustrative application to data from the UK Biobank concerning COVID-19 risk factors."}, "https://arxiv.org/abs/2303.15135": {"title": "Properties of the reconciled distributions for Gaussian and count forecasts", "link": "https://arxiv.org/abs/2303.15135", "description": "arXiv:2303.15135v2 Announce Type: replace-cross\nAbstract: Reconciliation enforces coherence between hierarchical forecasts, in order to satisfy a set of linear constraints. While most works focus on the reconciliation of the point forecasts, we consider probabilistic reconciliation and we analyze the properties of the distributions reconciled via conditioning. We provide a formal analysis of the variance of the reconciled distribution, treating separately the case of Gaussian forecasts and count forecasts. We also study the reconciled upper mean in the case of 1-level hierarchies; also in this case we analyze separately the case of Gaussian forecasts and count forecasts. We then show experiments on the reconciliation of intermittent time series related to the count of extreme market events. The experiments confirm our theoretical results and show that reconciliation largely improves the performance of probabilistic forecasting."}, "https://arxiv.org/abs/2305.15742": {"title": "Counterfactual Generative Models for Time-Varying Treatments", "link": "https://arxiv.org/abs/2305.15742", "description": "arXiv:2305.15742v3 Announce Type: replace-cross\nAbstract: Estimating the counterfactual outcome of treatment is essential for decision-making in public health and clinical science, among others. Often, treatments are administered in a sequential, time-varying manner, leading to an exponentially increased number of possible counterfactual outcomes. Furthermore, in modern applications, the outcomes are high-dimensional and conventional average treatment effect estimation fails to capture disparities in individuals. To tackle these challenges, we propose a novel conditional generative framework capable of producing counterfactual samples under time-varying treatment, without the need for explicit density estimation. Our method carefully addresses the distribution mismatch between the observed and counterfactual distributions via a loss function based on inverse probability re-weighting, and supports integration with state-of-the-art conditional generative models such as the guided diffusion and conditional variational autoencoder. We present a thorough evaluation of our method using both synthetic and real-world data. Our results demonstrate that our method is capable of generating high-quality counterfactual samples and outperforms the state-of-the-art baselines."}, "https://arxiv.org/abs/2306.10816": {"title": "$\\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery", "link": "https://arxiv.org/abs/2306.10816", "description": "arXiv:2306.10816v2 Announce Type: replace-cross\nAbstract: Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms."}, "https://arxiv.org/abs/2310.12115": {"title": "MMD-based Variable Importance for Distributional Random Forest", "link": "https://arxiv.org/abs/2310.12115", "description": "arXiv:2310.12115v2 Announce Type: replace-cross\nAbstract: Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions."}, "https://arxiv.org/abs/2401.03633": {"title": "A Spatial-statistical model to analyse historical rutting data", "link": "https://arxiv.org/abs/2401.03633", "description": "arXiv:2401.03633v2 Announce Type: replace-cross\nAbstract: Pavement rutting poses a significant challenge in flexible pavements, necessitating costly asphalt resurfacing. To address this issue comprehensively, we propose an advanced Bayesian hierarchical framework of latent Gaussian models with spatial components. Our model provides a thorough diagnostic analysis, pinpointing areas exhibiting unexpectedly high rutting rates. Incorporating spatial and random components, and important explanatory variables like annual average daily traffic (traffic intensity), pavement type, rut depth and lane width, our proposed models account for and estimate the influence of these variables on rutting. This approach not only quantifies uncertainties and discerns locations at the highest risk of requiring maintenance, but also uncover spatial dependencies in rutting (millimetre/year). We apply our models to a data set spanning eleven years (2010-2020). Our findings emphasize the systematic unexpected spatial rutting effect, where more of the rutting variability is explained by including spatial components. Pavement type, in conjunction with traffic intensity, is also found to be the primary driver of rutting. Furthermore, the spatial dependencies uncovered reveal road sections experiencing more than 1 millimeter of rutting beyond annual expectations. This leads to a halving of the expected pavement lifespan in these areas. Our study offers valuable insights, presenting maps indicating expected rutting, and identifying locations with accelerated rutting rates, resulting in a reduction in pavement life expectancy of at least 10 years."}, "https://arxiv.org/abs/2402.09472": {"title": "Identifying Intended Effects with Causal Models", "link": "https://arxiv.org/abs/2402.09472", "description": "arXiv:2402.09472v1 Announce Type: new \nAbstract: The aim of this paper is to extend the framework of causal inference, in particular as it has been developed by Judea Pearl, in order to model actions and identify their intended effects, in the direction opened by Elisabeth Anscombe. We show how intentions can be inferred from a causal model and its implied correlations observable in data. The paper defines confounding effects as the reasons why teleological inference may fail and introduces interference as a way to control for them. The ''fundamental problem'' of teleological inference is presented, explaining why causal analysis needs an extension in order to take intentions into account."}, "https://arxiv.org/abs/2402.09583": {"title": "Horseshoe Priors for Sparse Dirichlet-Multinomial Models", "link": "https://arxiv.org/abs/2402.09583", "description": "arXiv:2402.09583v1 Announce Type: new \nAbstract: Bayesian inference for Dirichlet-Multinomial (DM) models has a long and important history. The concentration parameter $\\alpha$ is pivotal in smoothing category probabilities within the multinomial distribution and is crucial for the inference afterward. Due to the lack of a tractable form of its marginal likelihood, $\\alpha$ is often chosen ad-hoc, or estimated using approximation algorithms. A constant $\\alpha$ often leads to inadequate smoothing of probabilities, particularly for sparse compositional count datasets. In this paper, we introduce a novel class of prior distributions facilitating conjugate updating of the concentration parameter, allowing for full Bayesian inference for DM models. Our methodology is based on fast residue computation and admits closed-form posterior moments in specific scenarios. Additionally, our prior provides continuous shrinkage with its heavy tail and substantial mass around zero, ensuring adaptability to the sparsity or quasi-sparsity of the data. We demonstrate the usefulness of our approach on both simulated examples and on a real-world human microbiome dataset. Finally, we conclude with directions for future research."}, "https://arxiv.org/abs/2402.09698": {"title": "Combining Evidence Across Filtrations", "link": "https://arxiv.org/abs/2402.09698", "description": "arXiv:2402.09698v1 Announce Type: new \nAbstract: In anytime-valid sequential inference, it is known that any admissible inference procedure must be based on test martingales and their composite generalization, called e-processes, which are nonnegative processes whose expectation at any arbitrary stopping time is upper-bounded by one. An e-process quantifies the accumulated evidence against a composite null hypothesis over a sequence of outcomes. This paper studies methods for combining e-processes that are computed using different information sets, i.e., filtrations, for a null hypothesis. Even though e-processes constructed on the same filtration can be combined effortlessly (e.g., by averaging), e-processes constructed on different filtrations cannot be combined as easily because their validity in a coarser filtration does not translate to validity in a finer filtration. We discuss three concrete examples of such e-processes in the literature: exchangeability tests, independence tests, and tests for evaluating and comparing forecasts with lags. Our main result establishes that these e-processes can be lifted into any finer filtration using adjusters, which are functions that allow betting on the running maximum of the accumulated wealth (thereby insuring against the loss of evidence). We also develop randomized adjusters that can improve the power of the resulting sequential inference procedure."}, "https://arxiv.org/abs/2402.09744": {"title": "Quantile Granger Causality in the Presence of Instability", "link": "https://arxiv.org/abs/2402.09744", "description": "arXiv:2402.09744v1 Announce Type: new \nAbstract: We propose a new framework for assessing Granger causality in quantiles in unstable environments, for a fixed quantile or over a continuum of quantile levels. Our proposed test statistics are consistent against fixed alternatives, they have nontrivial power against local alternatives, and they are pivotal in certain important special cases. In addition, we show the validity of a bootstrap procedure when asymptotic distributions depend on nuisance parameters. Monte Carlo simulations reveal that the proposed test statistics have correct empirical size and high power, even in absence of structural breaks. Finally, two empirical applications in energy economics and macroeconomics highlight the applicability of our method as the new tests provide stronger evidence of Granger causality."}, "https://arxiv.org/abs/2402.09758": {"title": "Extrapolation-Aware Nonparametric Statistical Inference", "link": "https://arxiv.org/abs/2402.09758", "description": "arXiv:2402.09758v1 Announce Type: new \nAbstract: We define extrapolation as any type of statistical inference on a conditional function (e.g., a conditional expectation or conditional quantile) evaluated outside of the support of the conditioning variable. This type of extrapolation occurs in many data analysis applications and can invalidate the resulting conclusions if not taken into account. While extrapolating is straightforward in parametric models, it becomes challenging in nonparametric models. In this work, we extend the nonparametric statistical model to explicitly allow for extrapolation and introduce a class of extrapolation assumptions that can be combined with existing inference techniques to draw extrapolation-aware conclusions. The proposed class of extrapolation assumptions stipulate that the conditional function attains its minimal and maximal directional derivative, in each direction, within the observed support. We illustrate how the framework applies to several statistical applications including prediction and uncertainty quantification. We furthermore propose a consistent estimation procedure that can be used to adjust existing nonparametric estimates to account for extrapolation by providing lower and upper extrapolation bounds. The procedure is empirically evaluated on both simulated and real-world data."}, "https://arxiv.org/abs/2402.09788": {"title": "An extension of sine-skewed circular distributions", "link": "https://arxiv.org/abs/2402.09788", "description": "arXiv:2402.09788v1 Announce Type: new \nAbstract: Sine-skewed circular distributions are identifiable and have easily-computable trigonometric moments and a simple random number generation algorithm, whereas they are known to have relatively low levels of asymmetry. This study proposes a new family of circular distributions that can be skewed more significantly than that of existing models. It is shown that a subfamily of the proposed distributions is identifiable with respect to parameters and all distributions in the subfamily have explicit trigonometric moments and a simple random number generation algorithm. The maximum likelihood estimation for model parameters is considered and its finite sample performances are investigated by numerical simulations. Some real data applications are illustrated for practical purposes."}, "https://arxiv.org/abs/2402.09789": {"title": "Identification with Posterior-Separable Information Costs", "link": "https://arxiv.org/abs/2402.09789", "description": "arXiv:2402.09789v1 Announce Type: new \nAbstract: I provide a model of rational inattention with heterogeneity and prove it is observationally equivalent to a state-dependent stochastic choice model subject to attention costs. I demonstrate that additive separability of unobservable heterogeneity, together with an independence assumption, suffice for the empirical model to admit a representative agent. Using conditional probabilities, I show how to identify: how covariates affect the desirability of goods, (a measure of) welfare, factual changes in welfare, and bounds on counterfactual market shares."}, "https://arxiv.org/abs/2402.09837": {"title": "Conjugacy properties of multivariate unified skew-elliptical distributions", "link": "https://arxiv.org/abs/2402.09837", "description": "arXiv:2402.09837v1 Announce Type: new \nAbstract: The broad class of multivariate unified skew-normal (SUN) distributions has been recently shown to possess fundamental conjugacy properties. When used as priors for the vector of parameters in general probit, tobit, and multinomial probit models, these distributions yield posteriors that still belong to the SUN family. Although such a core result has led to important advancements in Bayesian inference and computation, its applicability beyond likelihoods associated with fully-observed, discretized, or censored realizations from multivariate Gaussian models remains yet unexplored. This article covers such an important gap by proving that the wider family of multivariate unified skew-elliptical (SUE) distributions, which extends SUNs to more general perturbations of elliptical densities, guarantees conjugacy for broader classes of models, beyond those relying on fully-observed, discretized or censored Gaussians. Such a result leverages the closure under linear combinations, conditioning and marginalization of SUE to prove that such a family is conjugate to the likelihood induced by general multivariate regression models for fully-observed, censored or dichotomized realizations from skew-elliptical distributions. This advancement substantially enlarges the set of models that enable conjugate Bayesian inference to general formulations arising from elliptical and skew-elliptical families, including the multivariate Student's t and skew-t, among others."}, "https://arxiv.org/abs/2402.09858": {"title": "An Approximation Based Theory of Linear Regression", "link": "https://arxiv.org/abs/2402.09858", "description": "arXiv:2402.09858v1 Announce Type: new \nAbstract: The goal of this paper is to provide a theory linear regression based entirely on approximations. It will be argued that the standard linear regression model based theory whether frequentist or Bayesian has failed and that this failure is due to an 'assumed (revealed?) truth' (John Tukey) attitude to the models. This is reflected in the language of statistical inference which involves a concept of truth, for example efficiency, consistency and hypothesis testing. The motivation behind this paper was to remove the word `true' from the theory and practice of linear regression and to replace it by approximation. The approximations considered are the least squares approximations. An approximation is called valid if it contains no irrelevant covariates. This is operationalized using the concept of a Gaussian P-value which is the probability that pure Gaussian noise is better in term of least squares than the covariate. The precise definition given in the paper is intuitive and requires only four simple equations. Given this a valid approximation is one where all the Gaussian P-values are less than a threshold $p0$ specified by the statistician, in this paper with the default value 0.01. This approximations approach is not only much simpler it is overwhelmingly better than the standard model based approach. This will be demonstrated using six real data sets, four from high dimensional regression and two from vector autoregression. Both the simplicity and the superiority of Gaussian P-values derive from their universal exactness and validity. This is in complete contrast to standard F P-values which are valid only for carefully designed simulations.\n The paper contains excerpts from an unpublished paper by John Tukey entitled `Issues relevant to an honest account of data-based inference partially in the light of Laurie Davies's paper'."}, "https://arxiv.org/abs/2402.09888": {"title": "Multinomial mixture for spatial data", "link": "https://arxiv.org/abs/2402.09888", "description": "arXiv:2402.09888v1 Announce Type: new \nAbstract: The purpose of this paper is to extend standard finite mixture models in the context of multinomial mixtures for spatial data, in order to cluster geographical units according to demographic characteristics. The spatial information is incorporated on the model through the mixing probabilities of each component. To be more specific, a Gibbs distribution is assumed for prior probabilities. In this way, assignment of each observation is affected by neighbors' cluster and spatial dependence is included in the model. Estimation is based on a modified EM algorithm which is enriched by an extra, initial step for approximating the field. The simulated field algorithm is used in this initial step. The presented model will be used for clustering municipalities of Attica with respect to age distribution of residents."}, "https://arxiv.org/abs/2402.09895": {"title": "Spatial Data Analysis", "link": "https://arxiv.org/abs/2402.09895", "description": "arXiv:2402.09895v1 Announce Type: new \nAbstract: This handbook chapter provides an essential introduction to the field of spatial econometrics, offering a comprehensive overview of techniques and methodologies for analysing spatial data in the social sciences. Spatial econometrics addresses the unique challenges posed by spatially dependent observations, where spatial relationships among data points can significantly impact statistical analyses. The chapter begins by exploring the fundamental concepts of spatial dependence and spatial autocorrelation, and highlighting their implications for traditional econometric models. It then introduces a range of spatial econometric models, particularly spatial lag, spatial error, and spatial lag of X models, illustrating how these models accommodate spatial relationships and yield accurate and insightful results about the underlying spatial processes. The chapter provides an intuitive understanding of these models compare to each other. A practical example on London house prices demonstrates the application of spatial econometrics, emphasising its relevance in uncovering hidden spatial patterns, addressing endogeneity, and providing robust estimates in the presence of spatial dependence."}, "https://arxiv.org/abs/2402.09928": {"title": "When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators", "link": "https://arxiv.org/abs/2402.09928", "description": "arXiv:2402.09928v1 Announce Type: new \nAbstract: The conventional Two-Way Fixed-Effects (TWFE) estimator has been under strain lately. Recent literature has revealed potential shortcomings of TWFE when the treatment effects are heterogeneous. Scholars have developed new advanced dynamic Difference-in-Differences (DiD) estimators to tackle these potential shortcomings. However, confusion remains in applied research as to when the conventional TWFE is biased and what issues the novel estimators can and cannot address. In this study, we first provide an intuitive explanation of the problems of TWFE and elucidate the key features of the novel alternative DiD estimators. We then systematically demonstrate the conditions under which the conventional TWFE is inconsistent. We employ Monte Carlo simulations to assess the performance of dynamic DiD estimators under violations of key assumptions, which likely happens in applied cases. While the new dynamic DiD estimators offer notable advantages in capturing heterogeneous treatment effects, we show that the conventional TWFE performs generally well if the model specifies an event-time function. All estimators are equally sensitive to violations of the parallel trends assumption, anticipation effects or violations of time-varying exogeneity. Despite their advantages, the new dynamic DiD estimators tackle a very specific problem and they do not serve as a universal remedy for violations of the most critical assumptions. We finally derive, based on our simulations, recommendations for how and when to use TWFE and the new DiD estimators in applied research."}, "https://arxiv.org/abs/2402.09938": {"title": "Optimal Bayesian stepped-wedge cluster randomised trial designs for binary outcome data", "link": "https://arxiv.org/abs/2402.09938", "description": "arXiv:2402.09938v1 Announce Type: new \nAbstract: Under a generalised estimating equation analysis approach, approximate design theory is used to determine Bayesian D-optimal designs. For two examples, considering simple exchangeable and exponential decay correlation structures, we compare the efficiency of identified optimal designs to balanced stepped-wedge designs and corresponding stepped-wedge designs determined by optimising using a normal approximation approach. The dependence of the Bayesian D-optimal designs on the assumed correlation structure is explored; for the considered settings, smaller decay in the correlation between outcomes across time periods, along with larger values of the intra-cluster correlation, leads to designs closer to a balanced design being optimal. Unlike for normal data, it is shown that the optimal design need not be centro-symmetric in the binary outcome case. The efficiency of the Bayesian D-optimal design relative to a balanced design can be large, but situations are demonstrated in which the advantages are small. Similarly, the optimal design from a normal approximation approach is often not much less efficient than the Bayesian D-optimal design. Bayesian D-optimal designs can be readily identified for stepped-wedge cluster randomised trials with binary outcome data. In certain circumstances, principally ones with strong time period effects, they will indicate that a design unlikely to have been identified by previous methods may be substantially more efficient. However, they require a larger number of assumptions than existing optimal designs, and in many situations existing theory under a normal approximation will provide an easier means of identifying an efficient design for binary outcome data."}, "https://arxiv.org/abs/2402.10156": {"title": "Empirically assessing the plausibility of unconfoundedness in observational studies", "link": "https://arxiv.org/abs/2402.10156", "description": "arXiv:2402.10156v1 Announce Type: new \nAbstract: The possibility of unmeasured confounding is one of the main limitations for causal inference from observational studies. There are different methods for partially empirically assessing the plausibility of unconfoundedness. However, most currently available methods require (at least partial) assumptions about the confounding structure, which may be difficult to know in practice. In this paper we describe a simple strategy for empirically assessing the plausibility of conditional unconfoundedness (i.e., whether the candidate set of covariates suffices for confounding adjustment) which does not require any assumptions about the confounding structure, requiring instead assumptions related to temporal ordering between covariates, exposure and outcome (which can be guaranteed by design), measurement error and selection into the study. The proposed method essentially relies on testing the association between a subset of covariates (those associated with the exposure given all other covariates) and the outcome conditional on the remaining covariates and the exposure. We describe the assumptions underlying the method, provide proofs, use simulations to corroborate the theory and illustrate the method with an applied example assessing the causal effect of length-for-age measured in childhood and intelligence quotient measured in adulthood using data from the 1982 Pelotas (Brazil) birth cohort. We also discuss the implications of measurement error and some important limitations."}, "https://arxiv.org/abs/2106.05024": {"title": "Contamination Bias in Linear Regressions", "link": "https://arxiv.org/abs/2106.05024", "description": "arXiv:2106.05024v4 Announce Type: replace \nAbstract: We study regressions with multiple treatments and a set of controls that is flexible enough to purge omitted variable bias. We show that these regressions generally fail to estimate convex averages of heterogeneous treatment effects -- instead, estimates of each treatment's effect are contaminated by non-convex averages of the effects of other treatments. We discuss three estimation approaches that avoid such contamination bias, including the targeting of easiest-to-estimate weighted average effects. A re-analysis of nine empirical applications finds economically and statistically meaningful contamination bias in observational studies; contamination bias in experimental studies is more limited due to idiosyncratic effect heterogeneity."}, "https://arxiv.org/abs/2301.00277": {"title": "Higher-order Refinements of Small Bandwidth Asymptotics for Density-Weighted Average Derivative Estimators", "link": "https://arxiv.org/abs/2301.00277", "description": "arXiv:2301.00277v2 Announce Type: replace \nAbstract: The density weighted average derivative (DWAD) of a regression function is a canonical parameter of interest in economics. Classical first-order large sample distribution theory for kernel-based DWAD estimators relies on tuning parameter restrictions and model assumptions that imply an asymptotic linear representation of the point estimator. These conditions can be restrictive, and the resulting distributional approximation may not be representative of the actual sampling distribution of the statistic of interest. In particular, the approximation is not robust to bandwidth choice. Small bandwidth asymptotics offers an alternative, more general distributional approximation for kernel-based DWAD estimators that allows for, but does not require, asymptotic linearity. The resulting inference procedures based on small bandwidth asymptotics were found to exhibit superior finite sample performance in simulations, but no formal theory justifying that empirical success is available in the literature. Employing Edgeworth expansions, this paper shows that small bandwidth asymptotic approximations lead to inference procedures with higher-order distributional properties that are demonstrably superior to those of procedures based on asymptotic linear approximations."}, "https://arxiv.org/abs/2301.04804": {"title": "A Generalized Estimating Equation Approach to Network Regression", "link": "https://arxiv.org/abs/2301.04804", "description": "arXiv:2301.04804v2 Announce Type: replace \nAbstract: Regression models applied to network data where node attributes are the dependent variables poses a methodological challenge. As has been well studied, naive regression neither properly accounts for community structure, nor does it account for the dependent variable acting as both model outcome and covariate. To address this methodological gap, we propose a network regression model motivated by the important observation that controlling for community structure can, when a network is modular, significantly account for meaningful correlation between observations induced by network connections. We propose a generalized estimating equation (GEE) approach to learn model parameters based on clusters defined through any single-membership community detection algorithm applied to the observed network. We provide a necessary condition on the network size and edge formation probabilities to establish the asymptotic normality of the model parameters under the assumption that the graph structure is a stochastic block model. We evaluate the performance of our approach through simulations and apply it to estimate the joint impact of baseline covariates and network effects on COVID-19 incidence rate among countries connected by a network of commercial airline traffic. We find that during the beginning of the pandemic the network effect has some influence, the percentage of urban population has more influence on the incidence rate compared to the network effect after the travel ban was in effect."}, "https://arxiv.org/abs/2301.09754": {"title": "Prioritizing Variables for Observational Study Design using the Joint Variable Importance Plot", "link": "https://arxiv.org/abs/2301.09754", "description": "arXiv:2301.09754v3 Announce Type: replace \nAbstract: Observational studies of treatment effects require adjustment for confounding variables. However, causal inference methods typically cannot deliver perfect adjustment on all measured baseline variables, and there is often ambiguity about which variables should be prioritized. Standard prioritization methods based on treatment imbalance alone neglect variables' relationships with the outcome. We propose the joint variable importance plot to guide variable prioritization for observational studies. Since not all variables are equally relevant to the outcome, the plot adds outcome associations to quantify the potential confounding jointly with the standardized mean difference. To enhance comparisons on the plot between variables with different confounding relationships, we also derive and plot bias curves. Variable prioritization using the plot can produce recommended values for tuning parameters in many existing matching and weighting methods. We showcase the use of the joint variable importance plots in the design of a balance-constrained matched study to evaluate whether taking an antidiabetic medication, glyburide, increases the incidence of C-section delivery among pregnant individuals with gestational diabetes."}, "https://arxiv.org/abs/2302.11487": {"title": "Improving Model Choice in Classification: An Approach Based on Clustering of Covariance Matrices", "link": "https://arxiv.org/abs/2302.11487", "description": "arXiv:2302.11487v2 Announce Type: replace \nAbstract: This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of clusters of the involved covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. We show that such groupings of covariance matrices can be achieved through simple modifications of the CEM (Classification Expectation Maximization) algorithm. Our approach leads to propose Gaussian Mixture Models for model-based clustering and discriminant analysis, in which covariance matrices are clustered according to a parsimonious criterion, creating intermediate steps between the fourteen widely known parsimonious models. The added versatility not only allows us to obtain models with fewer parameters for fitting the data, but also provides greater interpretability. We show its usefulness for model-based clustering and discriminant analysis, providing algorithms to find approximate solutions verifying suitable size, shape and orientation constraints, and applying them to both simulation and real data examples."}, "https://arxiv.org/abs/2308.07248": {"title": "Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures", "link": "https://arxiv.org/abs/2308.07248", "description": "arXiv:2308.07248v3 Announce Type: replace \nAbstract: Linear mixed models are commonly used in analyzing stepped-wedge cluster randomized trials (SW-CRTs). A key consideration for analyzing a SW-CRT is accounting for the potentially complex correlation structure, which can be achieved by specifying a random effects structure. Common random effects structures for a SW-CRT include random intercept, random cluster-by-period, and discrete-time decay. Recently, more complex structures, such as the random intervention structure, have been proposed. In practice, specifying appropriate random effects can be challenging. Robust variance estimators (RVE) may be applied to linear mixed models to provide consistent estimators of standard errors of fixed effect parameters in the presence of random-effects misspecification. However, there has been no empirical investigation of RVE for SW-CRT. In this paper, we first review five RVEs (both standard and small-sample bias-corrected RVEs) that are available for linear mixed models. We then describe a comprehensive simulation study to examine the performance of these RVEs for SW-CRTs with a continuous outcome under different data generators. For each data generator, we investigate whether the use of a RVE with either the random intercept model or the random cluster-by-period model is sufficient to provide valid statistical inference for fixed effect parameters, when these working models are subject to misspecification. Our results indicate that the random intercept and random cluster-by-period models with RVEs performed similarly. The CR3 RVE estimator, coupled with the number of clusters minus two degrees of freedom correction, consistently gave the best coverage results, but could be slightly anti-conservative when the number of clusters was below 16. We summarize the implications of our results for linear mixed model analysis of SW-CRTs in practice."}, "https://arxiv.org/abs/2308.12260": {"title": "Estimating Causal Effects for Binary Outcomes Using Per-Decision Inverse Probability Weighting", "link": "https://arxiv.org/abs/2308.12260", "description": "arXiv:2308.12260v2 Announce Type: replace \nAbstract: Micro-randomized trials are commonly conducted for optimizing mobile health interventions such as push notifications for behavior change. In analyzing such trials, causal excursion effects are often of primary interest, and their estimation typically involves inverse probability weighting (IPW). However, in a micro-randomized trial additional treatments can often occur during the time window over which an outcome is defined, and this can greatly inflate the variance of the causal effect estimator because IPW would involve a product of numerous weights. To reduce variance and improve estimation efficiency, we propose a new estimator using a modified version of IPW, which we call \"per-decision IPW\". It is applicable when the outcome is binary and can be expressed as the maximum of a series of sub-outcomes defined over sub-intervals of time. We establish the estimator's consistency and asymptotic normality. Through simulation studies and real data applications, we demonstrate substantial efficiency improvement of the proposed estimator over existing estimators (relative efficiency up to 1.45 and sample size savings up to 31% in realistic settings). The new estimator can be used to improve the precision of primary and secondary analyses for micro-randomized trials with binary outcomes."}, "https://arxiv.org/abs/2309.06988": {"title": "A basket trial design based on power priors", "link": "https://arxiv.org/abs/2309.06988", "description": "arXiv:2309.06988v2 Announce Type: replace \nAbstract: In basket trials a treatment is investigated in several subgroups. They are primarily used in oncology in early clinical phases as single-arm trials with a binary endpoint. For their analysis primarily Bayesian methods have been suggested, as they allow partial sharing of information based on the observed similarity between subgroups. Fujikawa et al. (2020) suggested an approach using empirical Bayes methods that allows flexible sharing based on easily interpretable weights derived from the Jensen-Shannon divergence between the subgroup-wise posterior distributions. We show that this design is closely related to the method of power priors and investigate several modifications of Fujikawa's design using methods from the power prior literature. While in Fujikawa's design, the amount of information that is shared between two baskets is only determined by their pairwise similarity, we also discuss extensions where the outcomes of all baskets are considered in the computation of the sharing weights. The results of our comparison study show that the power prior design has comparable performance to fully Bayesian designs in a range of different scenarios. At the same time, the power prior design is computationally cheap and even allows analytical computation of operating characteristics in some settings."}, "https://arxiv.org/abs/2202.00190": {"title": "Sketching stochastic valuation functions", "link": "https://arxiv.org/abs/2202.00190", "description": "arXiv:2202.00190v3 Announce Type: replace-cross \nAbstract: We consider the problem of sketching a set valuation function, which is defined as the expectation of a valuation function of independent random item values. We show that for monotone subadditive or submodular valuation functions satisfying a weak homogeneity condition, or certain other conditions, there exist discretized distributions of item values with $O(k\\log(k))$ support sizes that yield a sketch valuation function which is a constant-factor approximation, for any value query for a set of items of cardinality less than or equal to $k$. The discretized distributions can be efficiently computed by an algorithm for each item's value distribution separately. Our results hold under conditions that accommodate a wide range of valuation functions arising in applications, such as the value of a team corresponding to the best performance of a team member, constant elasticity of substitution production functions exhibiting diminishing returns used in economics and consumer theory, and others. Sketch valuation functions are particularly valuable for finding approximate solutions to optimization problems such as best set selection and welfare maximization. They enable computationally efficient evaluation of approximate value oracle queries and provide an approximation guarantee for the underlying optimization problem."}, "https://arxiv.org/abs/2305.16842": {"title": "Accounting statement analysis at industry level", "link": "https://arxiv.org/abs/2305.16842", "description": "arXiv:2305.16842v4 Announce Type: replace-cross \nAbstract: Compositional data are contemporarily defined as positive vectors, the ratios among whose elements are of interest to the researcher. Financial statement analysis by means of accounting ratios fulfils this definition to the letter. Compositional data analysis solves the major problems in statistical analysis of standard financial ratios at industry level, such as skewness, non-normality, non-linearity and dependence of the results on the choice of which accounting figure goes to the numerator and to the denominator of the ratio. In spite of this, compositional applications to financial statement analysis are still rare. In this article, we present some transformations within compositional data analysis that are particularly useful for financial statement analysis. We show how to compute industry or sub-industry means of standard financial ratios from a compositional perspective. We show how to visualise firms in an industry with a compositional biplot, to classify them with compositional cluster analysis and to relate financial and non-financial indicators with compositional regression models. We show an application to the accounting statements of Spanish wineries using DuPont analysis, and a step-by-step tutorial to the compositional freeware CoDaPack."}, "https://arxiv.org/abs/2402.10537": {"title": "Quantifying Individual Risk for Binary Outcome: Bounds and Inference", "link": "https://arxiv.org/abs/2402.10537", "description": "arXiv:2402.10537v1 Announce Type: new \nAbstract: Understanding treatment heterogeneity is crucial for reliable decision-making in treatment evaluation and selection. While the conditional average treatment effect (CATE) is commonly used to capture treatment heterogeneity induced by covariates and design individualized treatment policies, it remains an averaging metric within subpopulations. This limitation prevents it from unveiling individual-level risks, potentially leading to misleading results. This article addresses this gap by examining individual risk for binary outcomes, specifically focusing on the fraction negatively affected (FNA) conditional on covariates -- a metric assessing the percentage of individuals experiencing worse outcomes with treatment compared to control. Under the strong ignorability assumption, FNA is unidentifiable, and we find that previous bounds are wide and practically unattainable except in certain degenerate cases. By introducing a plausible positive correlation assumption for the potential outcomes, we obtain significantly improved bounds compared to previous studies. We show that even with a positive and statistically significant CATE, the lower bound on FNA can be positive, i.e., in the best-case scenario many units will be harmed if receiving treatment. We establish a nonparametric sensitivity analysis framework for FNA using the Pearson correlation coefficient as the sensitivity parameter, thereby exploring the relationships among the correlation coefficient, FNA, and CATE. We also present a practical and tractable method for selecting the range of correlation coefficients. Furthermore, we propose flexible estimators for refined FNA bounds and prove their consistency and asymptotic normality."}, "https://arxiv.org/abs/2402.10545": {"title": "Spatial quantile clustering of climate data", "link": "https://arxiv.org/abs/2402.10545", "description": "arXiv:2402.10545v1 Announce Type: new \nAbstract: In the era of climate change, the distribution of climate variables evolves with changes not limited to the mean value. Consequently, clustering algorithms based on central tendency could produce misleading results when used to summarize spatial and/or temporal patterns. We present a novel approach to spatial clustering of time series based on quantiles using a Bayesian framework that incorporates a spatial dependence layer based on a Markov random field. A series of simulations tested the proposal, then applied to the sea surface temperature of the Mediterranean Sea, one of the first seas to be affected by the effects of climate change."}, "https://arxiv.org/abs/2402.10574": {"title": "Nowcasting with mixed frequency data using Gaussian processes", "link": "https://arxiv.org/abs/2402.10574", "description": "arXiv:2402.10574v1 Announce Type: new \nAbstract: We propose and discuss Bayesian machine learning methods for mixed data sampling (MIDAS) regressions. This involves handling frequency mismatches with restricted and unrestricted MIDAS variants and specifying functional relationships between many predictors and the dependent variable. We use Gaussian processes (GP) and Bayesian additive regression trees (BART) as flexible extensions to linear penalized estimation. In a nowcasting and forecasting exercise we focus on quarterly US output growth and inflation in the GDP deflator. The new models leverage macroeconomic Big Data in a computationally efficient way and offer gains in predictive accuracy along several dimensions."}, "https://arxiv.org/abs/2402.10624": {"title": "Functional principal component analysis as an alternative to mixed-effect models for describing sparse repeated measures in presence of missing data", "link": "https://arxiv.org/abs/2402.10624", "description": "arXiv:2402.10624v1 Announce Type: new \nAbstract: Analyzing longitudinal data in health studies is challenging due to sparse and error-prone measurements, strong within-individual correlation, missing data and various trajectory shapes. While mixed-effect models (MM) effectively address these challenges, they remain parametric models and may incur computational costs. In contrast, Functional Principal Component Analysis (FPCA) is a non-parametric approach developed for regular and dense functional data that flexibly describes temporal trajectories at a lower computational cost. This paper presents an empirical simulation study evaluating the behaviour of FPCA with sparse and error-prone repeated measures and its robustness under different missing data schemes in comparison with MM. The results show that FPCA is well-suited in the presence of missing at random data caused by dropout, except in scenarios involving most frequent and systematic dropout. Like MM, FPCA fails under missing not at random mechanism. The FPCA was applied to describe the trajectories of four cognitive functions before clinical dementia and contrast them with those of matched controls in a case-control study nested in a population-based aging cohort. The average cognitive declines of future dementia cases showed a sudden divergence from those of their matched controls with a sharp acceleration 5 to 2.5 years prior to diagnosis."}, "https://arxiv.org/abs/2402.10836": {"title": "Manipulation Test for Multidimensional RDD", "link": "https://arxiv.org/abs/2402.10836", "description": "arXiv:2402.10836v1 Announce Type: new \nAbstract: The causal inference model proposed by Lee (2008) for the regression discontinuity design (RDD) relies on assumptions that imply the continuity of the density of the assignment (running) variable. The test for this implication is commonly referred to as the manipulation test and is regularly reported in applied research to strengthen the design's validity. The multidimensional RDD (MRDD) extends the RDD to contexts where treatment assignment depends on several running variables. This paper introduces a manipulation test for the MRDD. First, it develops a theoretical model for causal inference with the MRDD, used to derive a testable implication on the conditional marginal densities of the running variables. Then, it constructs the test for the implication based on a quadratic form of a vector of statistics separately computed for each marginal density. Finally, the proposed test is compared with alternative procedures commonly employed in applied research."}, "https://arxiv.org/abs/2402.10242": {"title": "Signed Diverse Multiplex Networks: Clustering and Inference", "link": "https://arxiv.org/abs/2402.10242", "description": "arXiv:2402.10242v1 Announce Type: cross \nAbstract: The paper introduces a Signed Generalized Random Dot Product Graph (SGRDPG) model, which is a variant of the Generalized Random Dot Product Graph (GRDPG), where, in addition, edges can be positive or negative. The setting is extended to a multiplex version, where all layers have the same collection of nodes and follow the SGRDPG. The only common feature of the layers of the network is that they can be partitioned into groups with common subspace structures, while otherwise all matrices of connection probabilities can be all different. The setting above is extremely flexible and includes a variety of existing multiplex network models as its particular cases. The paper fulfills two objectives. First, it shows that keeping signs of the edges in the process of network construction leads to a better precision of estimation and clustering and, hence, is beneficial for tackling real world problems such as analysis of brain networks. Second, by employing novel algorithms, our paper ensures equivalent or superior accuracy than has been achieved in simpler multiplex network models. In addition to theoretical guarantees, both of those features are demonstrated using numerical simulations and a real data example."}, "https://arxiv.org/abs/2402.10255": {"title": "Benchmarking the Operation of Quantum Heuristics and Ising Machines: Scoring Parameter Setting Strategies on Optimization Applications", "link": "https://arxiv.org/abs/2402.10255", "description": "arXiv:2402.10255v1 Announce Type: cross \nAbstract: We discuss guidelines for evaluating the performance of parameterized stochastic solvers for optimization problems, with particular attention to systems that employ novel hardware, such as digital quantum processors running variational algorithms, analog processors performing quantum annealing, or coherent Ising Machines. We illustrate through an example a benchmarking procedure grounded in the statistical analysis of the expectation of a given performance metric measured in a test environment. In particular, we discuss the necessity and cost of setting parameters that affect the algorithm's performance. The optimal value of these parameters could vary significantly between instances of the same target problem. We present an open-source software package that facilitates the design, evaluation, and visualization of practical parameter tuning strategies for complex use of the heterogeneous components of the solver. We examine in detail an example using parallel tempering and a simulator of a photonic Coherent Ising Machine computing and display the scoring of an illustrative baseline family of parameter-setting strategies that feature an exploration-exploitation trade-off."}, "https://arxiv.org/abs/2402.10456": {"title": "Generative Modeling for Tabular Data via Penalized Optimal Transport Network", "link": "https://arxiv.org/abs/2402.10456", "description": "arXiv:2402.10456v1 Announce Type: cross \nAbstract: The task of precisely learning the probability distribution of rows within tabular data and producing authentic synthetic samples is both crucial and non-trivial. Wasserstein generative adversarial network (WGAN) marks a notable improvement in generative modeling, addressing the challenges faced by its predecessor, generative adversarial network. However, due to the mixed data types and multimodalities prevalent in tabular data, the delicate equilibrium between the generator and discriminator, as well as the inherent instability of Wasserstein distance in high dimensions, WGAN often fails to produce high-fidelity samples. To this end, we propose POTNet (Penalized Optimal Transport Network), a generative deep neural network based on a novel, robust, and interpretable marginally-penalized Wasserstein (MPW) loss. POTNet can effectively model tabular data containing both categorical and continuous features. Moreover, it offers the flexibility to condition on a subset of features. We provide theoretical justifications for the motivation behind the MPW loss. We also empirically demonstrate the effectiveness of our proposed method on four different benchmarks across a variety of real-world and simulated datasets. Our proposed model achieves orders of magnitude speedup during the sampling stage compared to state-of-the-art generative models for tabular data, thereby enabling efficient large-scale synthetic data generation."}, "https://arxiv.org/abs/2402.10592": {"title": "Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification", "link": "https://arxiv.org/abs/2402.10592", "description": "arXiv:2402.10592v1 Announce Type: cross \nAbstract: Practitioners conducting adaptive experiments often encounter two competing priorities: reducing the cost of experimentation by effectively assigning treatments during the experiment itself, and gathering information swiftly to conclude the experiment and implement a treatment across the population. Currently, the literature is divided, with studies on regret minimization addressing the former priority in isolation, and research on best-arm identification focusing solely on the latter. This paper proposes a unified model that accounts for both within-experiment performance and post-experiment outcomes. We then provide a sharp theory of optimal performance in large populations that unifies canonical results in the literature. This unification also uncovers novel insights. For example, the theory reveals that familiar algorithms, like the recently proposed top-two Thompson sampling algorithm, can be adapted to optimize a broad class of objectives by simply adjusting a single scalar parameter. In addition, the theory reveals that enormous reductions in experiment duration can sometimes be achieved with minimal impact on both within-experiment and post-experiment regret."}, "https://arxiv.org/abs/2402.10608": {"title": "A maximum likelihood estimation of L\\'evy-driven stochastic systems for univariate and multivariate time series of observations", "link": "https://arxiv.org/abs/2402.10608", "description": "arXiv:2402.10608v1 Announce Type: cross \nAbstract: Literature is full of inference techniques developed to estimate the parameters of stochastic dynamical systems driven by the well-known Brownian noise. Such diffusion models are often inappropriate models to properly describe the dynamics reflected in many real-world data which are dominated by jump discontinuities of various sizes and frequencies. To account for the presence of jumps, jump-diffusion models are introduced and some inference techniques are developed. Jump-diffusion models are also inadequate models since they fail to reflect the frequent occurrence as well as the continuous spectrum of natural jumps. It is, therefore, crucial to depart from the classical stochastic systems like diffusion and jump-diffusion models and resort to stochastic systems where the regime of stochasticity is governed by the stochastic fluctuations of L\\'evy type. Reconstruction of L\\'evy-driven dynamical systems, however, has been a major challenge. The literature on the reconstruction of L\\'evy-driven systems is rather poor: there are few reconstruction algorithms developed which suffer from one or several problems such as being data-hungry, failing to provide a full reconstruction of noise parameters, tackling only some specific systems, failing to cope with multivariate data in practice, lacking proper validation mechanisms, and many more. This letter introduces a maximum likelihood estimation procedure which grants a full reconstruction of the system, requires less data, and its implementation for multivariate data is quite straightforward. To the best of our knowledge this contribution is the first to tackle all the mentioned shortcomings. We apply our algorithm to simulated data as well as an ice-core dataset spanning the last glaciation. In particular, we find new insights about the dynamics of the climate in the curse of the last glaciation which was not found in previous studies."}, "https://arxiv.org/abs/2402.10859": {"title": "Spatio-temporal point process modelling of fires in Sicily exploring human and environmental factors", "link": "https://arxiv.org/abs/2402.10859", "description": "arXiv:2402.10859v1 Announce Type: cross \nAbstract: In 2023, Sicily faced an escalating issue of uncontrolled fires, necessitating a thorough investigation into their spatio-temporal dynamics. Our study addresses this concern through point process theory. Each wildfire is treated as a unique point in both space and time, allowing us to assess the influence of environmental and anthropogenic factors by fitting a spatio-temporal separable Poisson point process model, with a particular focus on the role of land usage. First, a spatial log-linear Poisson model is applied to investigate the influence of land use types on wildfire distribution, controlling for other environmental covariates. The results highlight the significant effect of human activities, altitude, and slope on spatial fire occurrence. Then, a Generalized Additive Model with Poisson-distributed response further explores the temporal dynamics of wildfire occurrences, confirming their dependence on various environmental variables, including the maximum daily temperature, wind speed, surface pressure, and total precipitation."}, "https://arxiv.org/abs/2103.05161": {"title": "The Efficient Shrinkage Path: Maximum Likelihood of Minimum MSE Risk", "link": "https://arxiv.org/abs/2103.05161", "description": "arXiv:2103.05161v5 Announce Type: replace \nAbstract: A new generalized ridge regression shrinkage path is proposed that is as short as possible under the restriction that it must pass through the vector of regression coefficient estimators that make the overall Optimal Variance-Bias Trade-Off under Normal distribution-theory. Five distinct types of ridge TRACE displays plus other graphics for this efficient path are motivated and illustrated here. These visualizations provide invaluable data-analytic insights and improved self-confidence to researchers and data scientists fitting linear models to ill-conditioned (confounded) data."}, "https://arxiv.org/abs/2208.11756": {"title": "Testing Many Constraints in Possibly Irregular Models Using Incomplete U-Statistics", "link": "https://arxiv.org/abs/2208.11756", "description": "arXiv:2208.11756v3 Announce Type: replace \nAbstract: We consider the problem of testing a null hypothesis defined by equality and inequality constraints on a statistical parameter. Testing such hypotheses can be challenging because the number of relevant constraints may be on the same order or even larger than the number of observed samples. Moreover, standard distributional approximations may be invalid due to irregularities in the null hypothesis. We propose a general testing methodology that aims to circumvent these difficulties. The constraints are estimated by incomplete U-statistics, and we derive critical values by Gaussian multiplier bootstrap. We show that the bootstrap approximation of incomplete U-statistics is valid for kernels that we call mixed degenerate when the number of combinations used to compute the incomplete U-statistic is of the same order as the sample size. It follows that our test controls type I error even in irregular settings. Furthermore, the bootstrap approximation covers high-dimensional settings making our testing strategy applicable for problems with many constraints. The methodology is applicable, in particular, when the constraints to be tested are polynomials in U-estimable parameters. As an application, we consider goodness-of-fit tests of latent tree models for multivariate data."}, "https://arxiv.org/abs/2305.03149": {"title": "A Spectral Method for Identifiable Grade of Membership Analysis with Binary Responses", "link": "https://arxiv.org/abs/2305.03149", "description": "arXiv:2305.03149v3 Announce Type: replace \nAbstract: Grade of Membership (GoM) models are popular individual-level mixture models for multivariate categorical data. GoM allows each subject to have mixed memberships in multiple extreme latent profiles. Therefore GoM models have a richer modeling capacity than latent class models that restrict each subject to belong to a single profile. The flexibility of GoM comes at the cost of more challenging identifiability and estimation problems. In this work, we propose a singular value decomposition (SVD) based spectral approach to GoM analysis with multivariate binary responses. Our approach hinges on the observation that the expectation of the data matrix has a low-rank decomposition under a GoM model. For identifiability, we develop sufficient and almost necessary conditions for a notion of expectation identifiability. For estimation, we extract only a few leading singular vectors of the observed data matrix, and exploit the simplex geometry of these vectors to estimate the mixed membership scores and other parameters. We also establish the consistency of our estimator in the double-asymptotic regime where both the number of subjects and the number of items grow to infinity. Our spectral method has a huge computational advantage over Bayesian or likelihood-based methods and is scalable to large-scale and high-dimensional data. Extensive simulation studies demonstrate the superior efficiency and accuracy of our method. We also illustrate our method by applying it to a personality test dataset."}, "https://arxiv.org/abs/2305.17517": {"title": "Stochastic Nonparametric Estimation of the Density-Flow Curve", "link": "https://arxiv.org/abs/2305.17517", "description": "arXiv:2305.17517v3 Announce Type: replace \nAbstract: Recent advances in operations research and machine learning have revived interest in solving complex real-world, large-size traffic control problems. With the increasing availability of road sensor data, deterministic parametric models have proved inadequate in describing the variability of real-world data, especially in congested area of the density-flow diagram. In this paper we estimate the stochastic density-flow relation introducing a nonparametric method called convex quantile regression. The proposed method does not depend on any prior functional form assumptions, but thanks to the concavity constraints, the estimated function satisfies the theoretical properties of the density-flow curve. The second contribution is to develop the new convex quantile regression with bags (CQRb) approach to facilitate practical implementation of CQR to the real-world data. We illustrate the CQRb estimation process using the road sensor data from Finland in years 2016-2018. Our third contribution is to demonstrate the excellent out-of-sample predictive power of the proposed CQRb method in comparison to the standard parametric deterministic approach."}, "https://arxiv.org/abs/2309.01889": {"title": "The Local Projection Residual Bootstrap for AR(1) Models", "link": "https://arxiv.org/abs/2309.01889", "description": "arXiv:2309.01889v3 Announce Type: replace \nAbstract: This paper proposes a local projection residual bootstrap method to construct confidence intervals for impulse response coefficients of AR(1) models. Our bootstrap method is based on the local projection (LP) approach and involves a residual bootstrap procedure applied to AR(1) models. We present theoretical results for our bootstrap method and proposed confidence intervals. First, we prove the uniform consistency of the LP-residual bootstrap over a large class of AR(1) models that allow for a unit root. Then, we prove the asymptotic validity of our confidence intervals over the same class of AR(1) models. Finally, we show that the LP-residual bootstrap provides asymptotic refinements for confidence intervals on a restricted class of AR(1) models relative to those required for the uniform consistency of our bootstrap."}, "https://arxiv.org/abs/2312.00501": {"title": "Cautionary Tales on Synthetic Controls in Survival Analyses", "link": "https://arxiv.org/abs/2312.00501", "description": "arXiv:2312.00501v2 Announce Type: replace \nAbstract: Synthetic control (SC) methods have gained rapid popularity in economics recently, where they have been applied in the context of inferring the effects of treatments on standard continuous outcomes assuming linear input-output relations. In medical applications, conversely, survival outcomes are often of primary interest, a setup in which both commonly assumed data-generating processes (DGPs) and target parameters are different. In this paper, we therefore investigate whether and when SCs could serve as an alternative to matching methods in survival analyses. We find that, because SCs rely on a linearity assumption, they will generally be biased for the true expected survival time in commonly assumed survival DGPs -- even when taking into account the possibility of linearity on another scale as in accelerated failure time models. Additionally, we find that, because SC units follow distributions with lower variance than real control units, summaries of their distributions, such as survival curves, will be biased for the parameters of interest in many survival analyses. Nonetheless, we also highlight that using SCs can still improve upon matching whenever the biases described above are outweighed by extrapolation biases exhibited by imperfect matches, and investigate the use of regularization to trade off the shortcomings of both approaches."}, "https://arxiv.org/abs/2312.12966": {"title": "Rank-based Bayesian clustering via covariate-informed Mallows mixtures", "link": "https://arxiv.org/abs/2312.12966", "description": "arXiv:2312.12966v2 Announce Type: replace \nAbstract: Data in the form of rankings, ratings, pair comparisons or clicks are frequently collected in diverse fields, from marketing to politics, to understand assessors' individual preferences. Combining such preference data with features associated with the assessors can lead to a better understanding of the assessors' behaviors and choices. The Mallows model is a popular model for rankings, as it flexibly adapts to different types of preference data, and the previously proposed Bayesian Mallows Model (BMM) offers a computationally efficient framework for Bayesian inference, also allowing capturing the users' heterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite mixture model that performs clustering while also accounting for assessor-related features, called the Bayesian Mallows model with covariates (BMMx). BMMx is based on a similarity function that a priori favours the aggregation of assessors into a cluster when their covariates are similar, using the Product Partition models (PPMx) proposal. We present two approaches to measure the covariate similarity: one based on a novel deterministic function measuring the covariates' goodness-of-fit to the cluster, and one based on an augmented model as in PPMx. We investigate the performance of BMMx in both simulation experiments and real-data examples, showing the method's potential for advancing the understanding of assessor preferences and behaviors in different applications."}, "https://arxiv.org/abs/2401.04200": {"title": "Teacher bias or measurement error?", "link": "https://arxiv.org/abs/2401.04200", "description": "arXiv:2401.04200v2 Announce Type: replace \nAbstract: In many countries, teachers' track recommendations are used to allocate students to secondary school tracks. Previous studies have shown that students from families with low socioeconomic status (SES) receive lower track recommendations than their peers from high SES families, conditional on standardized test scores. It is often argued that this indicates teacher bias. However, this claim is invalid in the presence of measurement error in test scores. We discuss how measurement error in test scores generates a biased coefficient of the conditional SES gap, and consider three empirical strategies to address this bias. Using administrative data from the Netherlands, we find that measurement error explains 35 to 43% of the conditional SES gap in track recommendations."}, "https://arxiv.org/abs/2401.04512": {"title": "Robust Bayesian Method for Refutable Models", "link": "https://arxiv.org/abs/2401.04512", "description": "arXiv:2401.04512v2 Announce Type: replace \nAbstract: We propose a robust Bayesian method for economic models that can be rejected under some data distributions. The econometrician starts with a structural assumption which can be written as the intersection of several assumptions, and the joint assumption is refutable. To avoid the model rejection, the econometrician first takes a stance on which assumption $j$ is likely to be violated and considers a measurement of the degree of violation of this assumption $j$. She then considers a (marginal) prior belief on the degree of violation $(\\pi_{m_j})$: She considers a class of prior distributions $\\pi_s$ on all economic structures such that all $\\pi_s$ have the same marginal distribution $\\pi_m$. Compared to the standard nonparametric Bayesian method that puts a single prior on all economic structures, the robust Bayesian method imposes a single marginal prior distribution on the degree of violation. As a result, the robust Bayesian method allows the econometrician to take a stance only on the likeliness of violation of assumption $j$. Compared to the frequentist approach to relax the refutable assumption, the robust Bayesian method is transparent on the econometrician's stance of choosing models. We also show that many frequentists' ways to relax the refutable assumption can be found equivalent to particular choices of robust Bayesian prior classes. We use the local average treatment effect (LATE) in the potential outcome framework as the leading illustrating example."}, "https://arxiv.org/abs/2401.14593": {"title": "Robust Estimation of Pareto's Scale Parameter from Grouped Data", "link": "https://arxiv.org/abs/2401.14593", "description": "arXiv:2401.14593v2 Announce Type: replace \nAbstract: Numerous robust estimators exist as alternatives to the maximum likelihood estimator (MLE) when a completely observed ground-up loss severity sample dataset is available. However, the options for robust alternatives to MLE become significantly limited when dealing with grouped loss severity data, with only a handful of methods like least squares, minimum Hellinger distance, and optimal bounded influence function available. This paper introduces a novel robust estimation technique, the Method of Truncated Moments (MTuM), specifically designed to estimate the tail index of a Pareto distribution from grouped data. Inferential justification of MTuM is established by employing the central limit theorem and validating them through a comprehensive simulation study."}, "https://arxiv.org/abs/1906.01741": {"title": "Fr\\'echet random forests for metric space valued regression with non euclidean predictors", "link": "https://arxiv.org/abs/1906.01741", "description": "arXiv:1906.01741v3 Announce Type: replace-cross \nAbstract: Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fr\\'echet trees and Fr\\'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fr\\'echet regressogram predictor using data-driven partitions is given and applied to Fr\\'echet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice."}, "https://arxiv.org/abs/2302.07930": {"title": "Interpretable Deep Learning Methods for Multiview Learning", "link": "https://arxiv.org/abs/2302.07930", "description": "arXiv:2302.07930v2 Announce Type: replace-cross \nAbstract: Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning."}, "https://arxiv.org/abs/2311.18048": {"title": "An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis", "link": "https://arxiv.org/abs/2311.18048", "description": "arXiv:2311.18048v2 Announce Type: replace-cross \nAbstract: We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-Invariant (LTI) systems, the system parameters can be identified by introducing diverse intervention signals in a multi-environment setting. By harnessing appropriate diversity assumptions motivated by the ICA literature, our findings connect experiment design and representational identifiability in dynamical systems. We corroborate our findings on synthetic and (simulated) physical data. Additionally, we show that Hidden Markov Models, in general, and (Gaussian) LTI systems, in particular, fulfil a generalization of the Causal de Finetti theorem with continuous parameters."}, "https://arxiv.org/abs/2312.02867": {"title": "Semi-Supervised Health Index Monitoring with Feature Generation and Fusion", "link": "https://arxiv.org/abs/2312.02867", "description": "arXiv:2312.02867v2 Announce Type: replace-cross \nAbstract: The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible."}, "https://arxiv.org/abs/2402.11020": {"title": "Proximal Causal Inference for Conditional Separable Effects", "link": "https://arxiv.org/abs/2402.11020", "description": "arXiv:2402.11020v1 Announce Type: new \nAbstract: Scientists often pose questions about treatment effects on outcomes conditional on a post-treatment event. However, defining, identifying, and estimating causal effects conditional on post-treatment events requires care, even in perfectly executed randomized experiments. Recently, the conditional separable effect (CSE) was proposed as an interventionist estimand, corresponding to scientifically meaningful questions in these settings. However, while being a single-world estimand, which can be queried experimentally, existing identification results for the CSE require no unmeasured confounding between the outcome and post-treatment event. This assumption can be violated in many applications. In this work, we address this concern by developing new identification and estimation results for the CSE in the presence of unmeasured confounding. We establish nonparametric identification of the CSE in both observational and experimental settings when certain proxy variables are available for hidden common causes of the post-treatment event and outcome. We characterize the efficient influence function for the CSE under a semiparametric model of the observed data law in which nuisance functions are a priori unrestricted. Moreover, we develop a consistent, asymptotically linear, and locally semiparametric efficient estimator of the CSE using modern machine learning theory. We illustrate our framework with simulation studies and a real-world cancer therapy trial."}, "https://arxiv.org/abs/2402.11052": {"title": "Building Trees for Probabilistic Prediction via Scoring Rules", "link": "https://arxiv.org/abs/2402.11052", "description": "arXiv:2402.11052v1 Announce Type: new \nAbstract: Decision trees built with data remain in widespread use for nonparametric prediction. Predicting probability distributions is preferred over point predictions when uncertainty plays a prominent role in analysis and decision-making. We study modifying a tree to produce nonparametric predictive distributions. We find the standard method for building trees may not result in good predictive distributions and propose changing the splitting criteria for trees to one based on proper scoring rules. Analysis of both simulated data and several real datasets demonstrates that using these new splitting criteria results in trees with improved predictive properties considering the entire predictive distribution."}, "https://arxiv.org/abs/2402.11070": {"title": "Scalable Analysis of Bipartite Experiments", "link": "https://arxiv.org/abs/2402.11070", "description": "arXiv:2402.11070v1 Announce Type: new \nAbstract: Bipartite Experiments are randomized experiments where the treatment is applied to a set of units (randomization units) that is different from the units of analysis, and randomization units and analysis units are connected through a bipartite graph. The scale of experimentation at large online platforms necessitates both accurate inference in the presence of a large bipartite interference graph, as well as a highly scalable implementation. In this paper, we describe new methods for inference that enable practical, scalable analysis of bipartite experiments: (1) We propose CA-ERL, a covariate-adjusted variant of the exposure-reweighted-linear (ERL) estimator [9], which empirically yields 60-90% variance reduction. (2) We introduce a randomization-based method for inference and prove asymptotic validity of a Wald-type confidence interval under graph sparsity assumptions. (3) We present a linear-time algorithm for randomization inference of the CA-ERL estimator, which can be easily implemented in query engines like Presto or Spark. We evaluate our methods both on a real experiment at Meta that randomized treatment on Facebook Groups and analyzed user-level metrics, as well as simulations on synthetic data. The real-world data shows that our CA-ERL estimator reduces the confidence interval (CI) width by 60-90% (compared to ERL) in a practical setting. The simulations using synthetic data show that our randomization inference procedure achieves correct coverage across instances, while the ERL estimator has incorrectly small CI widths for instances with large true effect sizes and is overly conservative when the bipartite graph is dense."}, "https://arxiv.org/abs/2402.11092": {"title": "Adaptive Weight Learning for Multiple Outcome Optimization With Continuous Treatment", "link": "https://arxiv.org/abs/2402.11092", "description": "arXiv:2402.11092v1 Announce Type: new \nAbstract: To promote precision medicine, individualized treatment regimes (ITRs) are crucial for optimizing the expected clinical outcome based on patient-specific characteristics. However, existing ITR research has primarily focused on scenarios with categorical treatment options and a single outcome. In reality, clinicians often encounter scenarios with continuous treatment options and multiple, potentially competing outcomes, such as medicine efficacy and unavoidable toxicity. To balance these outcomes, a proper weight is necessary, which should be learned in a data-driven manner that considers both patient preference and clinician expertise. In this paper, we present a novel algorithm for developing individualized treatment regimes (ITRs) that incorporate continuous treatment options and multiple outcomes, utilizing observational data. Our approach assumes that clinicians are optimizing individualized patient utilities with sub-optimal treatment decisions that are at least better than random assignment. Treatment assignment is assumed to directly depend on the true underlying utility of the treatment rather than patient characteristics. The proposed method simultaneously estimates the weighting of composite outcomes and the decision-making process, allowing for construction of individualized treatment regimes with continuous doses. The proposed estimators can be used for inference and variable selection, facilitating the identification of informative treatment assignments and preference-associated variables. We evaluate the finite sample performance of our proposed method via simulation studies and apply it to a real data application of radiation oncology analysis."}, "https://arxiv.org/abs/2402.11133": {"title": "Two-Sample Hypothesis Testing for Large Random Graphs of Unequal Size", "link": "https://arxiv.org/abs/2402.11133", "description": "arXiv:2402.11133v1 Announce Type: new \nAbstract: Two-sample hypothesis testing for large graphs is popular in cognitive science, probabilistic machine learning and artificial intelligence. While numerous methods have been proposed in the literature to address this problem, less attention has been devoted to scenarios involving graphs of unequal size or situations where there are only one or a few samples of graphs. In this article, we propose a Frobenius test statistic tailored for small sample sizes and unequal-sized random graphs to test whether they are generated from the same model or not. Our approach involves an algorithm for generating bootstrapped adjacency matrices from estimated community-wise edge probability matrices, forming the basis of the Frobenius test statistic. We derive the asymptotic distribution of the proposed test statistic and validate its stability and efficiency in detecting minor differences in underlying models through simulations. Furthermore, we explore its application to fMRI data where we are able to distinguish brain activity patterns when subjects are exposed to sentences and pictures for two different stimuli and the control group."}, "https://arxiv.org/abs/2402.11292": {"title": "Semi-functional partial linear regression with measurement error: An approach based on $k$NN estimation", "link": "https://arxiv.org/abs/2402.11292", "description": "arXiv:2402.11292v1 Announce Type: new \nAbstract: This paper focuses on a semiparametric regression model in which the response variable is explained by the sum of two components. One of them is parametric (linear), the corresponding explanatory variable is measured with additive error and its dimension is finite ($p$). The other component models, in a nonparametric way, the effect of a functional variable (infinite dimension) on the response. $k$-NN based estimators are proposed for each component, and some asymptotic results are obtained. A simulation study illustrates the behaviour of such estimators for finite sample sizes, while an application to real data shows the usefulness of our proposal."}, "https://arxiv.org/abs/2402.11336": {"title": "Conditionally Affinely Invariant Rerandomization and its Admissibility", "link": "https://arxiv.org/abs/2402.11336", "description": "arXiv:2402.11336v1 Announce Type: new \nAbstract: Rerandomization utilizes modern computing ability to search for covariate balance improved experimental design while adhering to the randomization principle originally advocated by RA Fisher. Conditionally affinely invariant rerandomization has the ``Equal Percent Variance Reducing'' property on subsets of conditionally ellipsoidally symmetric covariates. It is suitable to deal with covariates of varying importance or mixed types and usually produces multiple balance scores. ``Unified'' and `` intersection'' methods are common ways of deciding on multiple scores. In general, `` intersection'' methods are computationally more efficient but asymptotically inadmissible. As computational cost is not a major concern in experimental design, we recommend ``unified'' methods to build admissible criteria for rerandomization"}, "https://arxiv.org/abs/2402.11341": {"title": "Between- and Within-Cluster Spearman Rank Correlations", "link": "https://arxiv.org/abs/2402.11341", "description": "arXiv:2402.11341v1 Announce Type: new \nAbstract: Clustered data are common in practice. Clustering arises when subjects are measured repeatedly, or subjects are nested in groups (e.g., households, schools). It is often of interest to evaluate the correlation between two variables with clustered data. There are three commonly used Pearson correlation coefficients (total, between-, and within-cluster), which together provide an enriched perspective of the correlation. However, these Pearson correlation coefficients are sensitive to extreme values and skewed distributions. They also depend on the scale of the data and are not applicable to ordered categorical data. Current non-parametric measures for clustered data are only for the total correlation. Here we define population parameters for the between- and within-cluster Spearman rank correlations. The definitions are natural extensions of the Pearson between- and within-cluster correlations to the rank scale. We show that the total Spearman rank correlation approximates a weighted sum of the between- and within-cluster Spearman rank correlations, where the weights are functions of rank intraclass correlations of the two random variables. We also discuss the equivalence between the within-cluster Spearman rank correlation and the covariate-adjusted partial Spearman rank correlation. Furthermore, we describe estimation and inference for the three Spearman rank correlations, conduct simulations to evaluate the performance of our estimators, and illustrate their use with data from a longitudinal biomarker study and a clustered randomized trial."}, "https://arxiv.org/abs/2402.11425": {"title": "Online Local False Discovery Rate Control: A Resource Allocation Approach", "link": "https://arxiv.org/abs/2402.11425", "description": "arXiv:2402.11425v1 Announce Type: new \nAbstract: We consider the problem of online local false discovery rate (FDR) control where multiple tests are conducted sequentially, with the goal of maximizing the total expected number of discoveries. We formulate the problem as an online resource allocation problem with accept/reject decisions, which from a high level can be viewed as an online knapsack problem, with the additional uncertainty of random budget replenishment. We start with general arrival distributions and propose a simple policy that achieves a $O(\\sqrt{T})$ regret. We complement the result by showing that such regret rate is in general not improvable. We then shift our focus to discrete arrival distributions. We find that many existing re-solving heuristics in the online resource allocation literature, albeit achieve bounded loss in canonical settings, may incur a $\\Omega(\\sqrt{T})$ or even a $\\Omega(T)$ regret. With the observation that canonical policies tend to be too optimistic and over accept arrivals, we propose a novel policy that incorporates budget buffers. We show that small additional logarithmic buffers suffice to reduce the regret from $\\Omega(\\sqrt{T})$ or even $\\Omega(T)$ to $O(\\ln^2 T)$. Numerical experiments are conducted to validate our theoretical findings. Our formulation may have wider applications beyond the problem considered in this paper, and our results emphasize how effective policies should be designed to reach a balance between circumventing wrong accept and reducing wrong reject in online resource allocation problems with uncertain budgets."}, "https://arxiv.org/abs/2402.11466": {"title": "Nonparametric assessment of regimen response curve estimators", "link": "https://arxiv.org/abs/2402.11466", "description": "arXiv:2402.11466v1 Announce Type: new \nAbstract: Marginal structural models have been widely used in causal inference to estimate mean outcomes under either a static or a prespecified set of treatment decision rules. This approach requires imposing a working model for the mean outcome given a sequence of treatments and possibly baseline covariates. In this paper, we introduce a dynamic marginal structural model that can be used to estimate an optimal decision rule within a class of parametric rules. Specifically, we will estimate the mean outcome as a function of the parameters in the class of decision rules, referred to as a regimen-response curve. In general, misspecification of the working model may lead to a biased estimate with questionable causal interpretability. To mitigate this issue, we will leverage risk to assess \"goodness-of-fit\" of the imposed working model. We consider the counterfactual risk as our target parameter and derive inverse probability weighting and canonical gradients to map it to the observed data. We provide asymptotic properties of the resulting risk estimators, considering both fixed and data-dependent target parameters. We will show that the inverse probability weighting estimator can be efficient and asymptotic linear when the weight functions are estimated using a sieve-based estimator. The proposed method is implemented on the LS1 study to estimate a regimen-response curve for patients with Parkinson's disease."}, "https://arxiv.org/abs/2402.11563": {"title": "A Gibbs Sampling Scheme for a Generalised Poisson-Kingman Class", "link": "https://arxiv.org/abs/2402.11563", "description": "arXiv:2402.11563v1 Announce Type: new \nAbstract: A Bayesian nonparametric method of James, Lijoi \\& Prunster (2009) used to predict future values of observations from normalized random measures with independent increments is modified to a class of models based on negative binomial processes for which the increments are not independent, but are independent conditional on an underlying gamma variable. Like in James et al., the new algorithm is formulated in terms of two variables, one a function of the past observations, and the other an updating by means of a new observation. We outline an application of the procedure to population genetics, for the construction of realisations of genealogical trees and coalescents from samples of alleles."}, "https://arxiv.org/abs/2402.11609": {"title": "Risk-aware product decisions in A/B tests with multiple metrics", "link": "https://arxiv.org/abs/2402.11609", "description": "arXiv:2402.11609v1 Announce Type: new \nAbstract: In the past decade, AB tests have become the standard method for making product decisions in tech companies. They offer a scientific approach to product development, using statistical hypothesis testing to control the risks of incorrect decisions. Typically, multiple metrics are used in AB tests to serve different purposes, such as establishing evidence of success, guarding against regressions, or verifying test validity. To mitigate risks in AB tests with multiple outcomes, it's crucial to adapt the design and analysis to the varied roles of these outcomes. This paper introduces the theoretical framework for decision rules guiding the evaluation of experiments at Spotify. First, we show that if guardrail metrics with non-inferiority tests are used, the significance level does not need to be multiplicity-adjusted for those tests. Second, if the decision rule includes non-inferiority tests, deterioration tests, or tests for quality, the type II error rate must be corrected to guarantee the desired power level for the decision. We propose a decision rule encompassing success, guardrail, deterioration, and quality metrics, employing diverse tests. This is accompanied by a design and analysis plan that mitigates risks across any data-generating process. The theoretical results are demonstrated using Monte Carlo simulations."}, "https://arxiv.org/abs/2402.11640": {"title": "Protocols for Observational Studies: An Application to Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2402.11640", "description": "arXiv:2402.11640v1 Announce Type: new \nAbstract: In his 2022 IMS Medallion Lecture delivered at the Joint Statistical Meetings, Prof. Dylan S. Small eloquently advocated for the use of protocols in observational studies. We discuss his proposal and, inspired by his ideas, we develop a protocol for the regression discontinuity design."}, "https://arxiv.org/abs/2402.11652": {"title": "Doubly Robust Inference in Causal Latent Factor Models", "link": "https://arxiv.org/abs/2402.11652", "description": "arXiv:2402.11652v1 Announce Type: new \nAbstract: This article introduces a new framework for estimating average treatment effects under unobserved confounding in modern data-rich environments featuring large numbers of units and outcomes. The proposed estimator is doubly robust, combining outcome imputation, inverse probability weighting, and a novel cross-fitting procedure for matrix completion. We derive finite-sample and asymptotic guarantees, and show that the error of the new estimator converges to a mean-zero Gaussian distribution at a parametric rate. Simulation results demonstrate the practical relevance of the formal properties of the estimators analyzed in this article."}, "https://arxiv.org/abs/2402.11659": {"title": "Credible causal inference beyond toy models", "link": "https://arxiv.org/abs/2402.11659", "description": "arXiv:2402.11659v1 Announce Type: new \nAbstract: Causal inference with observational data critically relies on untestable and extra-statistical assumptions that have (sometimes) testable implications. Well-known sets of assumptions that are sufficient to justify the causal interpretation of certain estimators are called identification strategies. These templates for causal analysis, however, do not perfectly map into empirical research practice. Researchers are often left in the disjunctive of either abstracting away from their particular setting to fit in the templates, risking erroneous inferences, or avoiding situations in which the templates cannot be applied, missing valuable opportunities for conducting empirical analysis. In this article, I show how directed acyclic graphs (DAGs) can help researchers to conduct empirical research and assess the quality of evidence without excessively relying on research templates. First, I offer a concise introduction to causal inference frameworks. Then I survey the arguments in the methodological literature in favor of using research templates, while either avoiding or limiting the use of causal graphical models. Third, I discuss the problems with the template model, arguing for a more flexible approach to DAGs that helps illuminating common problems in empirical settings and improving the credibility of causal claims. I demonstrate this approach in a series of worked examples, showing the gap between identification strategies as invoked by researchers and their actual applications. Finally, I conclude highlighting the benefits that routinely incorporating causal graphical models in our scientific discussions would have in terms of transparency, testability, and generativity."}, "https://arxiv.org/abs/2402.11936": {"title": "Relative Jump Distance: a diagnostic for Nested Sampling", "link": "https://arxiv.org/abs/2402.11936", "description": "arXiv:2402.11936v1 Announce Type: new \nAbstract: Nested sampling is widely used in astrophysics for reliably inferring model parameters and comparing models within a Bayesian framework. To address models with many parameters, Markov Chain Monte Carlo (MCMC) random walks are incorporated within nested sampling to advance a live point population. Diagnostic tools for nested sampling are crucial to ensure the reliability of astrophysical conclusions. We develop a diagnostic to identify problematic random walks that fail to meet the requirements of nested sampling. The distance from the start to the end of the random walk, the jump distance, is divided by the typical neighbor distance between live points, computed robustly with the MLFriends algorithm, to obtain a relative jump distance (RJD). We propose the geometric mean RJD and the fraction of RJD>1 as new summary diagnostics. Relative jump distances are investigated with mock and real-world inference applications, including inferring the distance to gravitational wave event GW170817. Problematic nested sampling runs are identified based on significant differences to reruns with much longer MCMC chains. These consistently exhibit low average RJDs and f(RJD>1) values below 50 percent. The RJD is more sensitive than previous tests based on the live point insertion order. The RJD diagnostic is proposed as a widely applicable diagnostic to verify inference with nested sampling. It is implemented in the UltraNest package in version 4.1."}, "https://arxiv.org/abs/2402.12009": {"title": "Moduli of Continuity in Metric Models and Extension of Liveability Indices", "link": "https://arxiv.org/abs/2402.12009", "description": "arXiv:2402.12009v1 Announce Type: new \nAbstract: Index spaces serve as valuable metric models for studying properties relevant to various applications, such as social science or economics. These properties are represented by real Lipschitz functions that describe the degree of association with each element within the underlying metric space. After determining the index value within a given sample subset, the classic McShane and Whitney formulas allow a Lipschitz regression procedure to be performed to extend the index values over the entire metric space. To improve the adaptability of the metric model to specific scenarios, this paper introduces the concept of a composition metric, which involves composing a metric with an increasing, positive and subadditive function $\\phi$. The results presented here extend well-established results for Lipschitz indices on metric spaces to composition metrics. In addition, we establish the corresponding approximation properties that facilitate the use of this functional structure. To illustrate the power and simplicity of this mathematical framework, we provide a concrete application involving the modelling of livability indices in North American cities."}, "https://arxiv.org/abs/2402.12083": {"title": "TrialEmulation: An R Package to Emulate Target Trials for Causal Analysis of Observational Time-to-event Data", "link": "https://arxiv.org/abs/2402.12083", "description": "arXiv:2402.12083v1 Announce Type: new \nAbstract: Randomised controlled trials (RCTs) are regarded as the gold standard for estimating causal treatment effects on health outcomes. However, RCTs are not always feasible, because of time, budget or ethical constraints. Observational data such as those from electronic health records (EHRs) offer an alternative way to estimate the causal effects of treatments. Recently, the `target trial emulation' framework was proposed by Hernan and Robins (2016) to provide a formal structure for estimating causal treatment effects from observational data. To promote more widespread implementation of target trial emulation in practice, we develop the R package TrialEmulation to emulate a sequence of target trials using observational time-to-event data, where individuals who start to receive treatment and those who have not been on the treatment at the baseline of the emulated trials are compared in terms of their risks of an outcome event. Specifically, TrialEmulation provides (1) data preparation for emulating a sequence of target trials, (2) calculation of the inverse probability of treatment and censoring weights to handle treatment switching and dependent censoring, (3) fitting of marginal structural models for the time-to-event outcome given baseline covariates, (4) estimation and inference of marginal intention to treat and per-protocol effects of the treatment in terms of marginal risk differences between treated and untreated for a user-specified target trial population. In particular, TrialEmulation can accommodate large data sets (e.g., from EHRs) within memory constraints of R by processing data in chunks and applying case-control sampling. We demonstrate the functionality of TrialEmulation using a simulated data set that mimics typical observational time-to-event data in practice."}, "https://arxiv.org/abs/2402.12171": {"title": "A frequentist test of proportional colocalization after selecting relevant genetic variants", "link": "https://arxiv.org/abs/2402.12171", "description": "arXiv:2402.12171v1 Announce Type: new \nAbstract: Colocalization analyses assess whether two traits are affected by the same or distinct causal genetic variants in a single gene region. A class of Bayesian colocalization tests are now routinely used in practice; for example, for genetic analyses in drug development pipelines. In this work, we consider an alternative frequentist approach to colocalization testing that examines the proportionality of genetic associations with each trait. The proportional colocalization approach uses markedly different assumptions to Bayesian colocalization tests, and therefore can provide valuable complementary evidence in cases where Bayesian colocalization results are inconclusive or sensitive to priors. We propose a novel conditional test of proportional colocalization, prop-coloc-cond, that aims to account for the uncertainty in variant selection, in order to recover accurate type I error control. The test can be implemented straightforwardly, requiring only summary data on genetic associations. Simulation evidence and an empirical investigation into GLP1R gene expression demonstrates how tests of proportional colocalization can offer important insights in conjunction with Bayesian colocalization tests."}, "https://arxiv.org/abs/2402.12323": {"title": "Expressing and visualizing model uncertainty in Bayesian variable selection using Cartesian credible sets", "link": "https://arxiv.org/abs/2402.12323", "description": "arXiv:2402.12323v1 Announce Type: new \nAbstract: Modern regression applications can involve hundreds or thousands of variables which motivates the use of variable selection methods. Bayesian variable selection defines a posterior distribution on the possible subsets of the variables (which are usually termed models) to express uncertainty about which variables are strongly linked to the response. This can be used to provide Bayesian model averaged predictions or inference, and to understand the relative importance of different variables. However, there has been little work on meaningful representations of this uncertainty beyond first order summaries. We introduce Cartesian credible sets to address this gap. The elements of these sets are formed by concatenating sub-models defined on each block of a partition of the variables. Investigating these sub-models allow us to understand whether the models in the Cartesian credible set always/never/sometimes include a particular variable or group of variables and provide a useful summary of model uncertainty. We introduce methods to find these sets that emphasize ease of understanding. The potential of the method is illustrated on regression problems with both small and large numbers of variables."}, "https://arxiv.org/abs/2402.12349": {"title": "An optimal replacement policy under variable shocks and self-healing patterns", "link": "https://arxiv.org/abs/2402.12349", "description": "arXiv:2402.12349v1 Announce Type: new \nAbstract: We study a system that experiences damaging external shocks at stochastic intervals, continuous degradation, and self-healing. The motivation for such a system comes from real-life applications based on micro-electro-mechanical systems (MEMS). The system fails if the cumulative damage exceeds a time-dependent threshold. We develop a preventive maintenance policy to replace the system such that its lifetime is prudently utilized. Further, three variations on the healing pattern have been considered: (i) shocks heal for a fixed duration $\\tau$; (ii) a fixed proportion of shocks are non-healable (that is, $\\tau=0$); (iii) there are two types of shocks -- self healable shocks heal for a finite duration, and nonhealable shocks inflict a random system degradation. We implement a proposed preventive maintenance policy and compare the optimal replacement times in these new cases to that of the original case where all shocks heal indefinitely and thereby enable the system manager to take necessary decisions in generalized system set-ups."}, "https://arxiv.org/abs/2306.11380": {"title": "A Bayesian Take on Gaussian Process Networks", "link": "https://arxiv.org/abs/2306.11380", "description": "arXiv:2306.11380v4 Announce Type: cross \nAbstract: Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows the description of continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution."}, "https://arxiv.org/abs/2402.00623": {"title": "Bayesian Causal Inference with Gaussian Process Networks", "link": "https://arxiv.org/abs/2402.00623", "description": "arXiv:2402.00623v1 Announce Type: cross \nAbstract: Causal discovery and inference from observational data is an essential problem in statistics posing both modeling and computational challenges. These are typically addressed by imposing strict assumptions on the joint distribution such as linearity. We consider the problem of the Bayesian estimation of the effects of hypothetical interventions in the Gaussian Process Network (GPN) model, a flexible causal framework which allows describing the causal relationships nonparametrically. We detail how to perform causal inference on GPNs by simulating the effect of an intervention across the whole network and propagating the effect of the intervention on downstream variables. We further derive a simpler computational approximation by estimating the intervention distribution as a function of local variables only, modeling the conditional distributions via additive Gaussian processes. We extend both frameworks beyond the case of a known causal graph, incorporating uncertainty about the causal structure via Markov chain Monte Carlo methods. Simulation studies show that our approach is able to identify the effects of hypothetical interventions with non-Gaussian, non-linear observational data and accurately reflect the posterior uncertainty of the causal estimates. Finally we compare the results of our GPN-based causal inference approach to existing methods on a dataset of $A.~thaliana$ gene expressions."}, "https://arxiv.org/abs/2402.10982": {"title": "mshw, a forecasting library to predict short-term electricity demand based on multiple seasonal Holt-Winters", "link": "https://arxiv.org/abs/2402.10982", "description": "arXiv:2402.10982v1 Announce Type: cross \nAbstract: Transmission system operators have a growing need for more accurate forecasting of electricity demand. Current electricity systems largely require demand forecasting so that the electricity market establishes electricity prices as well as the programming of production units. The companies that are part of the electrical system use exclusive software to obtain predictions, based on the use of time series and prediction tools, whether statistical or artificial intelligence. However, the most common form of prediction is based on hybrid models that use both technologies. In any case, it is software with a complicated structure, with a large number of associated variables and that requires a high computational load to make predictions. The predictions they can offer are not much better than those that simple models can offer. In this paper we present a MATLAB toolbox created for the prediction of electrical demand. The toolbox implements multiple seasonal Holt-Winters exponential smoothing models and neural network models. The models used include the use of discrete interval mobile seasonalities (DIMS) to improve forecasting on special days. Additionally, the results of its application in various electrical systems in Europe are shown, where the results obtained can be seen. The use of this library opens a new avenue of research for the use of models with discrete and complex seasonalities in other fields of application."}, "https://arxiv.org/abs/2402.11134": {"title": "Functional Partial Least-Squares: Optimal Rates and Adaptation", "link": "https://arxiv.org/abs/2402.11134", "description": "arXiv:2402.11134v1 Announce Type: cross \nAbstract: We consider the functional linear regression model with a scalar response and a Hilbert space-valued predictor, a well-known ill-posed inverse problem. We propose a new formulation of the functional partial least-squares (PLS) estimator related to the conjugate gradient method. We shall show that the estimator achieves the (nearly) optimal convergence rate on a class of ellipsoids and we introduce an early stopping rule which adapts to the unknown degree of ill-posedness. Some theoretical and simulation comparison between the estimator and the principal component regression estimator is provided."}, "https://arxiv.org/abs/2402.11219": {"title": "Estimators for multivariate allometric regression model", "link": "https://arxiv.org/abs/2402.11219", "description": "arXiv:2402.11219v1 Announce Type: cross \nAbstract: In a regression model with multiple response variables and multiple explanatory variables, if the difference of the mean vectors of the response variables for different values of explanatory variables is always in the direction of the first principal eigenvector of the covariance matrix of the response variables, then it is called a multivariate allometric regression model. This paper studies the estimation of the first principal eigenvector in the multivariate allometric regression model. A class of estimators that includes conventional estimators is proposed based on weighted sum-of-squares matrices of regression sum-of-squares matrix and residual sum-of-squares matrix. We establish an upper bound of the mean squared error of the estimators contained in this class, and the weight value minimizing the upper bound is derived. Sufficient conditions for the consistency of the estimators are discussed in weak identifiability regimes under which the difference of the largest and second largest eigenvalues of the covariance matrix decays asymptotically and in ``large $p$, large $n$\" regimes, where $p$ is the number of response variables and $n$ is the sample size. Several numerical results are also presented."}, "https://arxiv.org/abs/2402.11228": {"title": "Adaptive Split Balancing for Optimal Random Forest", "link": "https://arxiv.org/abs/2402.11228", "description": "arXiv:2402.11228v1 Announce Type: cross \nAbstract: While random forests are commonly used for regression problems, existing methods often lack adaptability in complex situations or lose optimality under simple, smooth scenarios. In this study, we introduce the adaptive split balancing forest (ASBF), capable of learning tree representations from data while simultaneously achieving minimax optimality under the Lipschitz class. To exploit higher-order smoothness levels, we further propose a localized version that attains the minimax rate under the H\\\"older class $\\mathcal{H}^{q,\\beta}$ for any $q\\in\\mathbb{N}$ and $\\beta\\in(0,1]$. Rather than relying on the widely-used random feature selection, we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal results. Conversely, a less random, more balanced approach demonstrates optimality. Additionally, we establish uniform upper bounds and explore the application of random forests in average treatment effect estimation problems. Through simulation studies and real-data applications, we demonstrate the superior empirical performance of the proposed methods over existing random forests."}, "https://arxiv.org/abs/2402.11394": {"title": "Maximal Inequalities for Empirical Processes under General Mixing Conditions with an Application to Strong Approximations", "link": "https://arxiv.org/abs/2402.11394", "description": "arXiv:2402.11394v1 Announce Type: cross \nAbstract: This paper provides a bound for the supremum of sample averages over a class of functions for a general class of mixing stochastic processes with arbitrary mixing rates. Regardless of the speed of mixing, the bound is comprised of a concentration rate and a novel measure of complexity. The speed of mixing, however, affects the former quantity implying a phase transition. Fast mixing leads to the standard root-n concentration rate, while slow mixing leads to a slower concentration rate, its speed depends on the mixing structure. Our findings are applied to derive strong approximation results for a general class of mixing processes with arbitrary mixing rates."}, "https://arxiv.org/abs/2402.11771": {"title": "Evaluating the Effectiveness of Index-Based Treatment Allocation", "link": "https://arxiv.org/abs/2402.11771", "description": "arXiv:2402.11771v1 Announce Type: cross \nAbstract: When resources are scarce, an allocation policy is needed to decide who receives a resource. This problem occurs, for instance, when allocating scarce medical resources and is often solved using modern ML methods. This paper introduces methods to evaluate index-based allocation policies -- that allocate a fixed number of resources to those who need them the most -- by using data from a randomized control trial. Such policies create dependencies between agents, which render the assumptions behind standard statistical tests invalid and limit the effectiveness of estimators. Addressing these challenges, we translate and extend recent ideas from the statistics literature to present an efficient estimator and methods for computing asymptotically correct confidence intervals. This enables us to effectively draw valid statistical conclusions, a critical gap in previous work. Our extensive experiments validate our methodology in practical settings, while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial to evaluate different ML allocation policies in the context of a mHealth program, drawing previously invisible conclusions."}, "https://arxiv.org/abs/2109.08351": {"title": "Regression Discontinuity Design with Potentially Many Covariates", "link": "https://arxiv.org/abs/2109.08351", "description": "arXiv:2109.08351v4 Announce Type: replace \nAbstract: This paper studies the case of possibly high-dimensional covariates in the regression discontinuity design (RDD) analysis. In particular, we propose estimation and inference methods for the RDD models with covariate selection which perform stably regardless of the number of covariates. The proposed methods combine the local approach using kernel weights with $\\ell_{1}$-penalization to handle high-dimensional covariates. We provide theoretical and numerical results which illustrate the usefulness of the proposed methods. Theoretically, we present risk and coverage properties for our point estimation and inference methods, respectively. Under certain special case, the proposed estimator becomes more efficient than the conventional covariate adjusted estimator at the cost of an additional sparsity condition. Numerically, our simulation experiments and empirical example show the robust behaviors of the proposed methods to the number of covariates in terms of bias and variance for point estimation and coverage probability and interval length for inference."}, "https://arxiv.org/abs/2110.13761": {"title": "Regime-Switching Density Forecasts Using Economists' Scenarios", "link": "https://arxiv.org/abs/2110.13761", "description": "arXiv:2110.13761v2 Announce Type: replace \nAbstract: We propose an approach for generating macroeconomic density forecasts that incorporate information on multiple scenarios defined by experts. We adopt a regime-switching framework in which sets of scenarios (\"views\") are used as Bayesian priors on economic regimes. Predictive densities coming from different views are then combined by optimizing objective functions of density forecasting. We illustrate the approach with an empirical application to quarterly real-time forecasts of U.S. GDP growth, in which we exploit the Fed's macroeconomic scenarios used for bank stress tests. We show that the approach achieves good accuracy in terms of average predictive scores and good calibration of forecast distributions. Moreover, it can be used to evaluate the contribution of economists' scenarios to density forecast performance."}, "https://arxiv.org/abs/2203.08014": {"title": "Non-Existent Moments of Earnings Growth", "link": "https://arxiv.org/abs/2203.08014", "description": "arXiv:2203.08014v3 Announce Type: replace \nAbstract: The literature often employs moment-based earnings risk measures like variance, skewness, and kurtosis. However, under heavy-tailed distributions, these moments may not exist in the population. Our empirical analysis reveals that population kurtosis, skewness, and variance often do not exist for the conditional distribution of earnings growth. This challenges moment-based analyses. We propose robust conditional Pareto exponents as novel earnings risk measures, developing estimation and inference methods. Using the UK New Earnings Survey Panel Dataset (NESPD) and US Panel Study of Income Dynamics (PSID), we find: 1) Moments often fail to exist; 2) Earnings risk increases over the life cycle; 3) Job stayers face higher earnings risk; 4) These patterns persist during the 2007--2008 recession and the 2015--2016 positive growth period."}, "https://arxiv.org/abs/2206.08051": {"title": "Voronoi Density Estimator for High-Dimensional Data: Computation, Compactification and Convergence", "link": "https://arxiv.org/abs/2206.08051", "description": "arXiv:2206.08051v2 Announce Type: replace \nAbstract: The Voronoi Density Estimator (VDE) is an established density estimation technique that adapts to the local geometry of data. However, its applicability has been so far limited to problems in two and three dimensions. This is because Voronoi cells rapidly increase in complexity as dimensions grow, making the necessary explicit computations infeasible. We define a variant of the VDE deemed Compactified Voronoi Density Estimator (CVDE), suitable for higher dimensions. We propose computationally efficient algorithms for numerical approximation of the CVDE and formally prove convergence of the estimated density to the original one. We implement and empirically validate the CVDE through a comparison with the Kernel Density Estimator (KDE). Our results indicate that the CVDE outperforms the KDE on sound and image data."}, "https://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "https://arxiv.org/abs/2302.13066", "description": "arXiv:2302.13066v4 Announce Type: replace \nAbstract: Different proxy variables used in fiscal policy SVARs lead to contradicting conclusions regarding the size of fiscal multipliers. We show that the conflicting results are due to violations of the exogeneity assumptions, i.e. the commonly used proxies are endogenously related to the structural shocks. We propose a novel approach to include proxy variables into a Bayesian non-Gaussian SVAR, tailored to accommodate for potentially endogenous proxy variables. Using our model, we show that increasing government spending is a more effective tool to stimulate the economy than reducing taxes."}, "https://arxiv.org/abs/2304.09046": {"title": "On clustering levels of a hierarchical categorical risk factor", "link": "https://arxiv.org/abs/2304.09046", "description": "arXiv:2304.09046v2 Announce Type: replace \nAbstract: Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies."}, "https://arxiv.org/abs/2306.12003": {"title": "Difference-in-Differences with Interference", "link": "https://arxiv.org/abs/2306.12003", "description": "arXiv:2306.12003v4 Announce Type: replace \nAbstract: In many scenarios, such as the evaluation of place-based policies, potential outcomes are not only dependent upon the unit's own treatment but also its neighbors' treatment. Despite this, \"difference-in-differences\" (DID) type estimators typically ignore such interference among neighbors. I show in this paper that the canonical DID estimators generally fail to identify interesting causal effects in the presence of neighborhood interference. To incorporate interference structure into DID estimation, I propose doubly robust estimators for the direct average treatment effect on the treated as well as the average spillover effects under a modified parallel trends assumption. The approach in this paper relaxes common restrictions in the literature, such as partial interference and correctly specified spillover functions. Moreover, robust inference is discussed based on the asymptotic distribution of the proposed estimators."}, "https://arxiv.org/abs/2309.16129": {"title": "Counterfactual Density Estimation using Kernel Stein Discrepancies", "link": "https://arxiv.org/abs/2309.16129", "description": "arXiv:2309.16129v2 Announce Type: replace \nAbstract: Causal effects are usually studied in terms of the means of counterfactual distributions, which may be insufficient in many scenarios. Given a class of densities known up to normalizing constants, we propose to model counterfactual distributions by minimizing kernel Stein discrepancies in a doubly robust manner. This enables the estimation of counterfactuals over large classes of distributions while exploiting the desired double robustness. We present a theoretical analysis of the proposed estimator, providing sufficient conditions for consistency and asymptotic normality, as well as an examination of its empirical performance."}, "https://arxiv.org/abs/2311.14054": {"title": "Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2311.14054", "description": "arXiv:2311.14054v2 Announce Type: replace \nAbstract: Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant), functional (because within-day data can be viewed as a function of time) data. To extract within- and between-participant directions of variation in the data, we introduce Generalized Multilevel Functional Principal Component Analysis (GM-FPCA), an approach based on the dimension reduction of the linear predictor. Scores associated with specific patterns of activity are shown to be strongly associated with time to death. In particular, we confirm that increased activity is associated with time to death, a result that has been reported on other data sets. In addition, our method shows the previously unreported finding that maintaining a consistent day-to-day routine is strongly associated with a reduced risk of mortality (p-value $< 0.001$) even after adjusting for traditional risk factors. Extensive simulation studies indicate that GM-FPCA provides accurate estimation of model parameters, is computationally stable, and is scalable in the number of study participants, visits, and observations within visits. R code for implementing the method is provided."}, "https://arxiv.org/abs/2401.10235": {"title": "Semi-parametric local variable selection under misspecification", "link": "https://arxiv.org/abs/2401.10235", "description": "arXiv:2401.10235v2 Announce Type: replace \nAbstract: Local variable selection aims to discover localized effects by assessing the impact of covariates on outcomes within specific regions defined by other covariates. We outline some challenges of local variable selection in the presence of non-linear relationships and model misspecification. Specifically, we highlight a potential drawback of common semi-parametric methods: even slight model misspecification can result in a high rate of false positives. To address these shortcomings, we propose a methodology based on orthogonal cut splines that achieves consistent local variable selection in high-dimensional scenarios. Our approach offers simplicity, handles both continuous and discrete covariates, and provides theory for high-dimensional covariates and model misspecification. We discuss settings with either independent or dependent data. Our proposal allows including adjustment covariates that do not undergo selection, enhancing flexibility in modeling complex scenarios. We illustrate its application in simulation studies with both independent and functional data, as well as with two real datasets. One dataset evaluates salary gaps associated with discrimination factors at different ages, while the other examines the effects of covariates on brain activation over time. The approach is implemented in the R package mombf."}, "https://arxiv.org/abs/1910.08597": {"title": "Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic", "link": "https://arxiv.org/abs/1910.08597", "description": "arXiv:1910.08597v5 Announce Type: replace-cross \nAbstract: This paper proposes SplitSGD, a new dynamic learning rate schedule for stochastic optimization. This method decreases the learning rate for better adaptation to the local geometry of the objective function whenever a stationary phase is detected, that is, the iterates are likely to bounce at around a vicinity of a local minimum. The detection is performed by splitting the single thread into two and using the inner product of the gradients from the two threads as a measure of stationarity. Owing to this simple yet provably valid stationarity detection, SplitSGD is easy-to-implement and essentially does not incur additional computational cost than standard SGD. Through a series of extensive experiments, we show that this method is appropriate for both convex problems and training (non-convex) neural networks, with performance compared favorably to other stochastic optimization methods. Importantly, this method is observed to be very robust with a set of default parameters for a wide range of problems and, moreover, can yield better generalization performance than other adaptive gradient methods such as Adam."}, "https://arxiv.org/abs/2007.02192": {"title": "Tail-adaptive Bayesian shrinkage", "link": "https://arxiv.org/abs/2007.02192", "description": "arXiv:2007.02192v4 Announce Type: replace-cross \nAbstract: Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation method under diverse sparsity regimes, which has a tail-adaptive shrinkage property. In this property, the tail-heaviness of the prior adjusts adaptively, becoming larger or smaller as the sparsity level increases or decreases, respectively, to accommodate more or fewer signals, a posteriori. We propose a global-local-tail (GLT) Gaussian mixture distribution that ensures this property. We examine the role of the tail-index of the prior in relation to the underlying sparsity level and demonstrate that the GLT posterior contracts at the minimax optimal rate for sparse normal mean models. We apply both the GLT prior and the Horseshoe prior to a real data problem and simulation examples. Our findings indicate that the varying tail rule based on the GLT prior offers advantages over a fixed tail rule based on the Horseshoe prior in diverse sparsity regimes."}, "https://arxiv.org/abs/2109.02726": {"title": "Screening the Discrepancy Function of a Computer Model", "link": "https://arxiv.org/abs/2109.02726", "description": "arXiv:2109.02726v2 Announce Type: replace-cross \nAbstract: Screening traditionally refers to the problem of detecting active inputs in the computer model. In this paper, we develop methodology that applies to screening, but the main focus is on detecting active inputs not in the computer model itself but rather on the discrepancy function that is introduced to account for model inadequacy when linking the computer model with field observations. We contend this is an important problem as it informs the modeler which are the inputs that are potentially being mishandled in the model, but also along which directions it may be less recommendable to use the model for prediction. The methodology is Bayesian and is inspired by the continuous spike and slab prior popularized by the literature on Bayesian variable selection. In our approach, and in contrast with previous proposals, a single MCMC sample from the full model allows us to compute the posterior probabilities of all the competing models, resulting in a methodology that is computationally very fast. The approach hinges on the ability to obtain posterior inclusion probabilities of the inputs, which are very intuitive and easy to interpret quantities, as the basis for selecting active inputs. For that reason, we name the methodology PIPS -- posterior inclusion probability screening."}, "https://arxiv.org/abs/2210.02171": {"title": "A uniform kernel trick for high-dimensional two-sample problems", "link": "https://arxiv.org/abs/2210.02171", "description": "arXiv:2210.02171v2 Announce Type: replace-cross \nAbstract: We use a suitable version of the so-called \"kernel trick\" to devise two-sample (homogeneity) tests, especially focussed on high-dimensional and functional data. Our proposal entails a simplification related to the important practical problem of selecting an appropriate kernel function. Specifically, we apply a uniform variant of the kernel trick which involves the supremum within a class of kernel-based distances. We obtain the asymptotic distribution (under the null and alternative hypotheses) of the test statistic. The proofs rely on empirical processes theory, combined with the delta method and Hadamard (directional) differentiability techniques, and functional Karhunen-Lo\\`eve-type expansions of the underlying processes. This methodology has some advantages over other standard approaches in the literature. We also give some experimental insight into the performance of our proposal compared to the original kernel-based approach \\cite{Gretton2007} and the test based on energy distances \\cite{Szekely-Rizzo-2017}."}, "https://arxiv.org/abs/2303.01385": {"title": "Hyperlink communities in higher-order networks", "link": "https://arxiv.org/abs/2303.01385", "description": "arXiv:2303.01385v3 Announce Type: replace-cross \nAbstract: Many networks can be characterised by the presence of communities, which are groups of units that are closely linked. Identifying these communities can be crucial for understanding the system's overall function. Recently, hypergraphs have emerged as a fundamental tool for modelling systems where interactions are not limited to pairs but may involve an arbitrary number of nodes. In this study, we adopt a dual approach to community detection and extend the concept of link communities to hypergraphs. This extension allows us to extract informative clusters of highly related hyperedges. We analyze the dendrograms obtained by applying hierarchical clustering to distance matrices among hyperedges across a variety of real-world data, showing that hyperlink communities naturally highlight the hierarchical and multiscale structure of higher-order networks. Moreover, hyperlink communities enable us to extract overlapping memberships from nodes, overcoming limitations of traditional hard clustering methods. Finally, we introduce higher-order network cartography as a practical tool for categorizing nodes into different structural roles based on their interaction patterns and community participation. This approach aids in identifying different types of individuals in a variety of real-world social systems. Our work contributes to a better understanding of the structural organization of real-world higher-order systems."}, "https://arxiv.org/abs/2308.07480": {"title": "Order-based Structure Learning with Normalizing Flows", "link": "https://arxiv.org/abs/2308.07480", "description": "arXiv:2308.07480v2 Announce Type: replace-cross \nAbstract: Estimating the causal structure of observational data is a challenging combinatorial search problem that scales super-exponentially with graph size. Existing methods use continuous relaxations to make this problem computationally tractable but often restrict the data-generating process to additive noise models (ANMs) through explicit or implicit assumptions. We present Order-based Structure Learning with Normalizing Flows (OSLow), a framework that relaxes these assumptions using autoregressive normalizing flows. We leverage the insight that searching over topological orderings is a natural way to enforce acyclicity in structure discovery and propose a novel, differentiable permutation learning method to find such orderings. Through extensive experiments on synthetic and real-world data, we demonstrate that OSLow outperforms prior baselines and improves performance on the observational Sachs and SynTReN datasets as measured by structural hamming distance and structural intervention distance, highlighting the importance of relaxing the ANM assumption made by existing methods."}, "https://arxiv.org/abs/2309.02468": {"title": "Ab initio uncertainty quantification in scattering analysis of microscopy", "link": "https://arxiv.org/abs/2309.02468", "description": "arXiv:2309.02468v2 Announce Type: replace-cross \nAbstract: Estimating parameters from data is a fundamental problem in physics, customarily done by minimizing a loss function between a model and observed statistics. In scattering-based analysis, researchers often employ their domain expertise to select a specific range of wavevectors for analysis, a choice that can vary depending on the specific case. We introduce another paradigm that defines a probabilistic generative model from the beginning of data processing and propagates the uncertainty for parameter estimation, termed ab initio uncertainty quantification (AIUQ). As an illustrative example, we demonstrate this approach with differential dynamic microscopy (DDM) that extracts dynamical information through Fourier analysis at a selected range of wavevectors. We first show that DDM is equivalent to fitting a temporal variogram in the reciprocal space using a latent factor model as the generative model. Then we derive the maximum marginal likelihood estimator, which optimally weighs information at all wavevectors, therefore eliminating the need to select the range of wavevectors. Furthermore, we substantially reduce the computational cost by utilizing the generalized Schur algorithm for Toeplitz covariances without approximation. Simulated studies validate that AIUQ significantly improves estimation accuracy and enables model selection with automated analysis. The utility of AIUQ is also demonstrated by three distinct sets of experiments: first in an isotropic Newtonian fluid, pushing limits of optically dense systems compared to multiple particle tracking; next in a system undergoing a sol-gel transition, automating the determination of gelling points and critical exponent; and lastly, in discerning anisotropic diffusive behavior of colloids in a liquid crystal. These outcomes collectively underscore AIUQ's versatility to capture system dynamics in an efficient and automated manner."}, "https://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "https://arxiv.org/abs/2310.03114", "description": "arXiv:2310.03114v2 Announce Type: replace-cross \nAbstract: In this article we consider Bayesian parameter inference for a type of partially observed stochastic Volterra equation (SVE). SVEs are found in many areas such as physics and mathematical finance. In the latter field they can be used to represent long memory in unobserved volatility processes. In many cases of practical interest, SVEs must be time-discretized and then parameter inference is based upon the posterior associated to this time-discretized process. Based upon recent studies on time-discretization of SVEs (e.g. Richard et al. 2021), we use Euler-Maruyama methods for the afore-mentioned discretization. We then show how multilevel Markov chain Monte Carlo (MCMC) methods (Jasra et al. 2018) can be applied in this context. In the examples we study, we give a proof that shows that the cost to achieve a mean square error (MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon>0$, is {$\\mathcal{O}(\\epsilon^{-\\tfrac{4}{2H+1}})$, where $H$ is the Hurst parameter. If one uses a single level MCMC method then the cost is $\\mathcal{O}(\\epsilon^{-\\tfrac{2(2H+3)}{2H+1}})$} to achieve the same MSE. We illustrate these results in the context of state-space and stochastic volatility models, with the latter applied to real data."}, "https://arxiv.org/abs/2310.11122": {"title": "Sensitivity-Aware Amortized Bayesian Inference", "link": "https://arxiv.org/abs/2310.11122", "description": "arXiv:2310.11122v4 Announce Type: replace-cross \nAbstract: Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions."}, "https://arxiv.org/abs/2402.12548": {"title": "Composite likelihood inference for space-time point processes", "link": "https://arxiv.org/abs/2402.12548", "description": "arXiv:2402.12548v1 Announce Type: new \nAbstract: The dynamics of a rain forest is extremely complex involving births, deaths and growth of trees with complex interactions between trees, animals, climate, and environment. We consider the patterns of recruits (new trees) and dead trees between rain forest censuses. For a current census we specify regression models for the conditional intensity of recruits and the conditional probabilities of death given the current trees and spatial covariates. We estimate regression parameters using conditional composite likelihood functions that only involve the conditional first order properties of the data. When constructing assumption lean estimators of covariance matrices of parameter estimates we only need mild assumptions of decaying conditional correlations in space while assumptions regarding correlations over time are avoided by exploiting conditional centering of composite likelihood score functions. Time series of point patterns from rain forest censuses are quite short while each point pattern covers a fairly big spatial region. To obtain asymptotic results we therefore use a central limit theorem for the fixed timespan - increasing spatial domain asymptotic setting. This also allows us to handle the challenge of using stochastic covariates constructed from past point patterns. Conveniently, it suffices to impose weak dependence assumptions on the innovations of the space-time process. We investigate the proposed methodology by simulation studies and applications to rain forest data."}, "https://arxiv.org/abs/2402.12555": {"title": "Optimal Dynamic Treatment Regime Estimation in the Presence of Nonadherence", "link": "https://arxiv.org/abs/2402.12555", "description": "arXiv:2402.12555v1 Announce Type: new \nAbstract: Dynamic treatment regimes (DTRs) are sequences of functions that formalize the process of precision medicine. DTRs take as input patient information and output treatment recommendations. A major focus of the DTR literature has been on the estimation of optimal DTRs, the sequences of decision rules that result in the best outcome in expectation, across the complete population were they to be applied. While there is a rich literature on optimal DTR estimation, to date there has been minimal consideration of the impacts of nonadherence on these estimation procedures. Nonadherence refers to any process through that an individual's prescribed treatment does not match their true treatment. We explore the impacts of nonadherence and demonstrate that generally, when nonadherence is ignored, suboptimal regimes will be estimated. In light of these findings we propose a method for estimating optimal DTRs in the presence of nonadherence. The resulting estimators are consistent and asymptotically normal, with a double robustness property. Using simulations we demonstrate the reliability of these results, and illustrate comparable performance between the proposed estimation procedure adjusting for the impacts of nonadherence and estimators that are computed on data without nonadherence."}, "https://arxiv.org/abs/2402.12576": {"title": "Understanding Difference-in-differences methods to evaluate policy effects with staggered adoption: an application to Medicaid and HIV", "link": "https://arxiv.org/abs/2402.12576", "description": "arXiv:2402.12576v1 Announce Type: new \nAbstract: While a randomized control trial is considered the gold standard for estimating causal treatment effects, there are many research settings in which randomization is infeasible or unethical. In such cases, researchers rely on analytical methods for observational data to explore causal relationships. Difference-in-differences (DID) is one such method that, most commonly, estimates a difference in some mean outcome in a group before and after the implementation of an intervention or policy and compares this with a control group followed over the same time (i.e., a group that did not implement the intervention or policy). Although DID modeling approaches have been gaining popularity in public health research, the majority of these approaches and their extensions are developed and shared within the economics literature. While extensions of DID modeling approaches may be straightforward to apply to observational data in any field, the complexities and assumptions involved in newer approaches are often misunderstood. In this paper, we focus on recent extensions of the DID method and their relationships to linear models in the setting of staggered treatment adoption over multiple years. We detail the identification and estimation of the average treatment effect among the treated using potential outcomes notation, highlighting the assumptions necessary to produce valid estimates. These concepts are described within the context of Medicaid expansion and retention in care among people living with HIV (PWH) in the United States. While each DID approach is potentially valid, understanding their different assumptions and choosing an appropriate method can have important implications for policy-makers, funders, and public health as a whole."}, "https://arxiv.org/abs/2402.12583": {"title": "Non-linear Triple Changes Estimator for Targeted Policies", "link": "https://arxiv.org/abs/2402.12583", "description": "arXiv:2402.12583v1 Announce Type: new \nAbstract: The renowned difference-in-differences (DiD) estimator relies on the assumption of 'parallel trends,' which does not hold in many practical applications. To address this issue, the econometrics literature has turned to the triple difference estimator. Both DiD and triple difference are limited to assessing average effects exclusively. An alternative avenue is offered by the changes-in-changes (CiC) estimator, which provides an estimate of the entire counterfactual distribution at the cost of relying on (stronger) distributional assumptions. In this work, we extend the triple difference estimator to accommodate the CiC framework, presenting the `triple changes estimator' and its identification assumptions, thereby expanding the scope of the CiC paradigm. Subsequently, we empirically evaluate the proposed framework and apply it to a study examining the impact of Medicaid expansion on children's preventive care."}, "https://arxiv.org/abs/2402.12607": {"title": "Inference on LATEs with covariates", "link": "https://arxiv.org/abs/2402.12607", "description": "arXiv:2402.12607v1 Announce Type: new \nAbstract: In theory, two-stage least squares (TSLS) identifies a weighted average of covariate-specific local average treatment effects (LATEs) from a saturated specification without making parametric assumptions on how available covariates enter the model. In practice, TSLS is severely biased when saturation leads to a number of control dummies that is of the same order of magnitude as the sample size, and the use of many, arguably weak, instruments. This paper derives asymptotically valid tests and confidence intervals for an estimand that identifies the weighted average of LATEs targeted by saturated TSLS, even when the number of control dummies and instrument interactions is large. The proposed inference procedure is robust against four key features of saturated economic data: treatment effect heterogeneity, covariates with rich support, weak identification strength, and conditional heteroskedasticity."}, "https://arxiv.org/abs/2402.12710": {"title": "Integrating Active Learning in Causal Inference with Interference: A Novel Approach in Online Experiments", "link": "https://arxiv.org/abs/2402.12710", "description": "arXiv:2402.12710v1 Announce Type: new \nAbstract: In the domain of causal inference research, the prevalent potential outcomes framework, notably the Rubin Causal Model (RCM), often overlooks individual interference and assumes independent treatment effects. This assumption, however, is frequently misaligned with the intricate realities of real-world scenarios, where interference is not merely a possibility but a common occurrence. Our research endeavors to address this discrepancy by focusing on the estimation of direct and spillover treatment effects under two assumptions: (1) network-based interference, where treatments on neighbors within connected networks affect one's outcomes, and (2) non-random treatment assignments influenced by confounders. To improve the efficiency of estimating potentially complex effects functions, we introduce an novel active learning approach: Active Learning in Causal Inference with Interference (ACI). This approach uses Gaussian process to flexibly model the direct and spillover treatment effects as a function of a continuous measure of neighbors' treatment assignment. The ACI framework sequentially identifies the experimental settings that demand further data. It further optimizes the treatment assignments under the network interference structure using genetic algorithms to achieve efficient learning outcome. By applying our method to simulation data and a Tencent game dataset, we demonstrate its feasibility in achieving accurate effects estimations with reduced data requirements. This ACI approach marks a significant advancement in the realm of data efficiency for causal inference, offering a robust and efficient alternative to traditional methodologies, particularly in scenarios characterized by complex interference patterns."}, "https://arxiv.org/abs/2402.12719": {"title": "Restricted maximum likelihood estimation in generalized linear mixed models", "link": "https://arxiv.org/abs/2402.12719", "description": "arXiv:2402.12719v1 Announce Type: new \nAbstract: Restricted maximum likelihood (REML) estimation is a widely accepted and frequently used method for fitting linear mixed models, with its principal advantage being that it produces less biased estimates of the variance components. However, the concept of REML does not immediately generalize to the setting of non-normally distributed responses, and it is not always clear the extent to which, either asymptotically or in finite samples, such generalizations reduce the bias of variance component estimates compared to standard unrestricted maximum likelihood estimation. In this article, we review various attempts that have been made over the past four decades to extend REML estimation in generalized linear mixed models. We establish four major classes of approaches, namely approximate linearization, integrated likelihood, modified profile likelihoods, and direct bias correction of the score function, and show that while these four classes may have differing motivations and derivations, they often arrive at a similar if not the same REML estimate. We compare the finite sample performance of these four classes through a numerical study involving binary and count data, with results demonstrating that they perform similarly well in reducing the finite sample bias of variance components."}, "https://arxiv.org/abs/2402.12724": {"title": "Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression", "link": "https://arxiv.org/abs/2402.12724", "description": "arXiv:2402.12724v1 Announce Type: new \nAbstract: Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power."}, "https://arxiv.org/abs/2402.12803": {"title": "Joint Mean and Correlation Regression Models for Multivariate Data", "link": "https://arxiv.org/abs/2402.12803", "description": "arXiv:2402.12803v1 Announce Type: new \nAbstract: We propose a new joint mean and correlation regression model for correlated multivariate discrete responses, that simultaneously regresses the mean of each response against a set of covariates, and the correlations between responses against a set of similarity/distance measures. A set of joint estimating equations are formulated to construct an estimator of both the mean regression coefficients and the correlation regression parameters. Under a general setting where the number of responses can tend to infinity, the joint estimator is demonstrated to be consistent and asymptotically normally distributed, with differing rates of convergence due to the mean regression coefficients being heterogeneous across responses. An iterative estimation procedure is developed to obtain parameter estimates in the required, constrained parameter space. We apply the proposed model to a multivariate abundance dataset comprising overdispersed counts of 38 Carabidae ground beetle species sampled throughout Scotland, along with information about the environmental conditions of each site and the traits of each species. Results show in particular that the relationships between the mean abundances of various beetle species and environmental covariates are different and that beetle total length has statistically important effect in driving the correlations between the species. Simulations demonstrate the strong finite sample performance of the proposed estimator in terms of point estimation and inference."}, "https://arxiv.org/abs/2402.12825": {"title": "On scalable ARMA models", "link": "https://arxiv.org/abs/2402.12825", "description": "arXiv:2402.12825v1 Announce Type: new \nAbstract: This paper considers both the least squares and quasi-maximum likelihood estimation for the recently proposed scalable ARMA model, a parametric infinite-order vector AR model, and their asymptotic normality is also established. It makes feasible the inference on this computationally efficient model, especially for financial time series. An efficient block coordinate descent algorithm is further introduced to search for estimates, and a Bayesian information criterion is suggested for model selection. Simulation experiments are conducted to illustrate their finite sample performance, and a real application on six macroeconomic indicators illustrates the usefulness of the proposed methodology."}, "https://arxiv.org/abs/2402.12838": {"title": "Extending the Scope of Inference About Predictive Ability to Machine Learning Methods", "link": "https://arxiv.org/abs/2402.12838", "description": "arXiv:2402.12838v1 Announce Type: new \nAbstract: Though out-of-sample forecast evaluation is systematically employed with modern machine learning methods and there exists a well-established classic inference theory for predictive ability, see, e.g., West (1996, Asymptotic Inference About Predictive Ability, \\textit{Econometrica}, 64, 1067-1084), such theory is not directly applicable to modern machine learners such as the Lasso in the high dimensional setting. We investigate under which conditions such extensions are possible. Two key properties for standard out-of-sample asymptotic inference to be valid with machine learning are (i) a zero-mean condition for the score of the prediction loss function; and (ii) a fast rate of convergence for the machine learner. Monte Carlo simulations confirm our theoretical findings. For accurate finite sample inferences with machine learning, we recommend a small out-of-sample vs in-sample size ratio. We illustrate the wide applicability of our results with a new out-of-sample test for the Martingale Difference Hypothesis (MDH). We obtain the asymptotic null distribution of our test and use it to evaluate"}, "https://arxiv.org/abs/2402.12866": {"title": "On new tests for the Poisson distribution based on empirical weight functions", "link": "https://arxiv.org/abs/2402.12866", "description": "arXiv:2402.12866v1 Announce Type: new \nAbstract: We propose new goodness-of-fit tests for the Poisson distribution. The testing procedure entails fitting a weighted Poisson distribution, which has the Poisson as a special case, to observed data. Based on sample data, we calculate an empirical weight function which is compared to its theoretical counterpart under the Poisson assumption. Weighted Lp distances between these empirical and theoretical functions are proposed as test statistics and closed form expressions are derived for L1, L2 and L1 distances. A Monte Carlo study is included in which the newly proposed tests are shown to be powerful when compared to existing tests, especially in the case of overdispersed alternatives. We demonstrate the use of the tests with two practical examples."}, "https://arxiv.org/abs/2402.12901": {"title": "Bi-invariant Dissimilarity Measures for Sample Distributions in Lie Groups", "link": "https://arxiv.org/abs/2402.12901", "description": "arXiv:2402.12901v1 Announce Type: new \nAbstract: Data sets sampled in Lie groups are widespread, and as with multivariate data, it is important for many applications to assess the differences between the sets in terms of their distributions. Indices for this task are usually derived by considering the Lie group as a Riemannian manifold. Then, however, compatibility with the group operation is guaranteed only if a bi-invariant metric exists, which is not the case for most non-compact and non-commutative groups. We show here that if one considers an affine connection structure instead, one obtains bi-invariant generalizations of well-known dissimilarity measures: a Hotelling $T^2$ statistic, Bhattacharyya distance and Hellinger distance. Each of the dissimilarity measures matches its multivariate counterpart for Euclidean data and is translation-invariant, so that biases, e.g., through an arbitrary choice of reference, are avoided. We further derive non-parametric two-sample tests that are bi-invariant and consistent. We demonstrate the potential of these dissimilarity measures by performing group tests on data of knee configurations and epidemiological shape data. Significant differences are revealed in both cases."}, "https://arxiv.org/abs/2402.13023": {"title": "Bridging Methodologies: Angrist and Imbens' Contributions to Causal Identification", "link": "https://arxiv.org/abs/2402.13023", "description": "arXiv:2402.13023v1 Announce Type: new \nAbstract: In the 1990s, Joshua Angrist and Guido Imbens studied the causal interpretation of Instrumental Variable estimates (a widespread methodology in economics) through the lens of potential outcomes (a classical framework to formalize causality in statistics). Bridging a gap between those two strands of literature, they stress the importance of treatment effect heterogeneity and show that, under defendable assumptions in various applications, this method recovers an average causal effect for a specific subpopulation of individuals whose treatment is affected by the instrument. They were awarded the Nobel Prize primarily for this Local Average Treatment Effect (LATE). The first part of this article presents that methodological contribution in-depth: the origination in earlier applied articles, the different identification results and extensions, and related debates on the relevance of LATEs for public policy decisions. The second part reviews the main contributions of the authors beyond the LATE. J. Angrist has pursued the search for informative and varied empirical research designs in several fields, particularly in education. G. Imbens has complemented the toolbox for treatment effect estimation in many ways, notably through propensity score reweighting, matching, and, more recently, adapting machine learning procedures."}, "https://arxiv.org/abs/2402.13042": {"title": "Not all distributional shifts are equal: Fine-grained robust conformal inference", "link": "https://arxiv.org/abs/2402.13042", "description": "arXiv:2402.13042v1 Announce Type: new \nAbstract: We introduce a fine-grained framework for uncertainty quantification of predictive models under distributional shifts. This framework distinguishes the shift in covariate distributions from that in the conditional relationship between the outcome (Y) and the covariates (X). We propose to reweight the training samples to adjust for an identifiable covariate shift while protecting against worst-case conditional distribution shift bounded in an $f$-divergence ball. Based on ideas from conformal inference and distributionally robust learning, we present an algorithm that outputs (approximately) valid and efficient prediction intervals in the presence of distributional shifts. As a use case, we apply the framework to sensitivity analysis of individual treatment effects with hidden confounding. The proposed methods are evaluated in simulation studies and three real data applications, demonstrating superior robustness and efficiency compared with existing benchmarks."}, "https://arxiv.org/abs/2402.12384": {"title": "Comparing MCMC algorithms in Stochastic Volatility Models using Simulation Based Calibration", "link": "https://arxiv.org/abs/2402.12384", "description": "arXiv:2402.12384v1 Announce Type: cross \nAbstract: Simulation Based Calibration (SBC) is applied to analyse two commonly used, competing Markov chain Monte Carlo algorithms for estimating the posterior distribution of a stochastic volatility model. In particular, the bespoke 'off-set mixture approximation' algorithm proposed by Kim, Shephard, and Chib (1998) is explored together with a Hamiltonian Monte Carlo algorithm implemented through Stan. The SBC analysis involves a simulation study to assess whether each sampling algorithm has the capacity to produce valid inference for the correctly specified model, while also characterising statistical efficiency through the effective sample size. Results show that Stan's No-U-Turn sampler, an implementation of Hamiltonian Monte Carlo, produces a well-calibrated posterior estimate while the celebrated off-set mixture approach is less efficient and poorly calibrated, though model parameterisation also plays a role. Limitations and restrictions of generality are discussed."}, "https://arxiv.org/abs/2402.12785": {"title": "Stochastic Graph Heat Modelling for Diffusion-based Connectivity Retrieval", "link": "https://arxiv.org/abs/2402.12785", "description": "arXiv:2402.12785v1 Announce Type: cross \nAbstract: Heat diffusion describes the process by which heat flows from areas with higher temperatures to ones with lower temperatures. This concept was previously adapted to graph structures, whereby heat flows between nodes of a graph depending on the graph topology. Here, we combine the graph heat equation with the stochastic heat equation, which ultimately yields a model for multivariate time signals on a graph. We show theoretically how the model can be used to directly compute the diffusion-based connectivity structure from multivariate signals. Unlike other connectivity measures, our heat model-based approach is inherently multivariate and yields an absolute scaling factor, namely the graph thermal diffusivity, which captures the extent of heat-like graph propagation in the data. On two datasets, we show how the graph thermal diffusivity can be used to characterise Alzheimer's disease. We find that the graph thermal diffusivity is lower for Alzheimer's patients than healthy controls and correlates with dementia scores, suggesting structural impairment in patients in line with previous findings."}, "https://arxiv.org/abs/2402.12980": {"title": "Efficient adjustment for complex covariates: Gaining efficiency with DOPE", "link": "https://arxiv.org/abs/2402.12980", "description": "arXiv:2402.12980v1 Announce Type: cross \nAbstract: Covariate adjustment is a ubiquitous method used to estimate the average treatment effect (ATE) from observational data. Assuming a known graphical structure of the data generating model, recent results give graphical criteria for optimal adjustment, which enables efficient estimation of the ATE. However, graphical approaches are challenging for high-dimensional and complex data, and it is not straightforward to specify a meaningful graphical model of non-Euclidean data such as texts. We propose an general framework that accommodates adjustment for any subset of information expressed by the covariates. We generalize prior works and leverage these results to identify the optimal covariate information for efficient adjustment. This information is minimally sufficient for prediction of the outcome conditionally on treatment.\n Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estimator (DOPE) for efficient estimation of the ATE, and we provide asymptotic results for the DOPE under general conditions. Compared to the augmented inverse propensity weighted (AIPW) estimator, the DOPE can retain its efficiency even when the covariates are highly predictive of treatment. We illustrate this with a single-index model, and with an implementation of the DOPE based on neural networks, we demonstrate its performance on simulated and real data. Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings."}, "https://arxiv.org/abs/2202.12540": {"title": "Familial inference: tests for hypotheses on a family of centres", "link": "https://arxiv.org/abs/2202.12540", "description": "arXiv:2202.12540v3 Announce Type: replace \nAbstract: Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions, often concerning their centre. Tests that assess statistical hypotheses of centre implicitly assume a specific centre, e.g., the mean or median. Yet, scientific hypotheses do not always specify a particular centre. This ambiguity leaves the possibility for a gap between scientific theory and statistical practice that can lead to rejection of a true null. In the face of replicability crises in many scientific disciplines, significant results of this kind are concerning. Rather than testing a single centre, this paper proposes testing a family of plausible centres, such as that induced by the Huber loss function (the Huber family). Each centre in the family generates a testing problem, and the resulting family of hypotheses constitutes a familial hypothesis. A Bayesian nonparametric procedure is devised to test familial hypotheses, enabled by a novel pathwise optimization routine to fit the Huber family. The favourable properties of the new test are demonstrated theoretically and experimentally. Two examples from psychology serve as real-world case studies."}, "https://arxiv.org/abs/2204.10508": {"title": "Identification enhanced generalised linear model estimation with nonignorable missing outcomes", "link": "https://arxiv.org/abs/2204.10508", "description": "arXiv:2204.10508v2 Announce Type: replace \nAbstract: Missing data often result in undesirable bias and loss of efficiency. These become substantial problems when the response mechanism is nonignorable, such that the response model depends on unobserved variables. It is necessary to estimate the joint distribution of unobserved variables and response indicators to manage nonignorable nonresponse. However, model misspecification and identification issues prevent robust estimates despite careful estimation of the target joint distribution. In this study, we modelled the distribution of the observed parts and derived sufficient conditions for model identifiability, assuming a logistic regression model as the response mechanism and generalised linear models as the main outcome model of interest. More importantly, the derived sufficient conditions are testable with the observed data and do not require any instrumental variables, which are often assumed to guarantee model identifiability but cannot be practically determined beforehand. To analyse missing data, we propose a new imputation method which incorporates verifiable identifiability using only observed data. Furthermore, we present the performance of the proposed estimators in numerical studies and apply the proposed method to two sets of real data: exit polls for the 19th South Korean election data and public data collected from the Korean Survey of Household Finances and Living Conditions."}, "https://arxiv.org/abs/2206.08541": {"title": "Ensemble distributional forecasting for insurance loss reserving", "link": "https://arxiv.org/abs/2206.08541", "description": "arXiv:2206.08541v4 Announce Type: replace \nAbstract: Loss reserving generally focuses on identifying a single model that can generate superior predictive performance. However, different loss reserving models specialise in capturing different aspects of loss data. This is recognised in practice in the sense that results from different models are often considered, and sometimes combined. For instance, actuaries may take a weighted average of the prediction outcomes from various loss reserving models, often based on subjective assessments.\n In this paper, we propose a systematic framework to objectively combine (i.e. ensemble) multiple _stochastic_ loss reserving models such that the strengths offered by different models can be utilised effectively. Our framework contains two main innovations compared to existing literature and practice. Firstly, our criteria model combination considers the full distributional properties of the ensemble and not just the central estimate - which is of particular importance in the reserving context. Secondly, our framework is that it is tailored for the features inherent to reserving data. These include, for instance, accident, development, calendar, and claim maturity effects. Crucially, the relative importance and scarcity of data across accident periods renders the problem distinct from the traditional ensembling techniques in statistical learning.\n Our framework is illustrated with a complex synthetic dataset. In the results, the optimised ensemble outperforms both (i) traditional model selection strategies, and (ii) an equally weighted ensemble. In particular, the improvement occurs not only with central estimates but also relevant quantiles, such as the 75th percentile of reserves (typically of interest to both insurers and regulators)."}, "https://arxiv.org/abs/2207.08964": {"title": "Sensitivity analysis for constructing optimal regimes in the presence of treatment non-compliance", "link": "https://arxiv.org/abs/2207.08964", "description": "arXiv:2207.08964v2 Announce Type: replace \nAbstract: The current body of research on developing optimal treatment strategies often places emphasis on intention-to-treat analyses, which fail to take into account the compliance behavior of individuals. Methods based on instrumental variables have been developed to determine optimal treatment strategies in the presence of endogeneity. However, these existing methods are not applicable when there are two active treatment options and the average causal effects of the treatments cannot be identified using a binary instrument. In order to address this limitation, we present a procedure that can identify an optimal treatment strategy and the corresponding value function as a function of a vector of sensitivity parameters. Additionally, we derive the canonical gradient of the target parameter and propose a multiply robust classification-based estimator for the optimal treatment strategy. Through simulations, we demonstrate the practical need for and usefulness of our proposed method. We apply our method to a randomized trial on Adaptive Treatment for Alcohol and Cocaine Dependence."}, "https://arxiv.org/abs/2211.07506": {"title": "Type I Tobit Bayesian Additive Regression Trees for Censored Outcome Regression", "link": "https://arxiv.org/abs/2211.07506", "description": "arXiv:2211.07506v4 Announce Type: replace \nAbstract: Censoring occurs when an outcome is unobserved beyond some threshold value. Methods that do not account for censoring produce biased predictions of the unobserved outcome. This paper introduces Type I Tobit Bayesian Additive Regression Tree (TOBART-1) models for censored outcomes. Simulation results and real data applications demonstrate that TOBART-1 produces accurate predictions of censored outcomes. TOBART-1 provides posterior intervals for the conditional expectation and other quantities of interest. The error term distribution can have a large impact on the expectation of the censored outcome. Therefore the error is flexibly modeled as a Dirichlet process mixture of normal distributions."}, "https://arxiv.org/abs/2305.01464": {"title": "Large Global Volatility Matrix Analysis Based on Observation Structural Information", "link": "https://arxiv.org/abs/2305.01464", "description": "arXiv:2305.01464v3 Announce Type: replace \nAbstract: In this paper, we develop a novel large volatility matrix estimation procedure for analyzing global financial markets. Practitioners often use lower-frequency data, such as weekly or monthly returns, to address the issue of different trading hours in the international financial market. However, this approach can lead to inefficiency due to information loss. To mitigate this problem, our proposed method, called Structured Principal Orthogonal complEment Thresholding (Structured-POET), incorporates observation structural information for both global and national factor models. We establish the asymptotic properties of the Structured-POET estimator, and also demonstrate the drawbacks of conventional covariance matrix estimation procedures when using lower-frequency data. Finally, we apply the Structured-POET estimator to an out-of-sample portfolio allocation study using international stock market data."}, "https://arxiv.org/abs/2307.07090": {"title": "Choice Models and Permutation Invariance: Demand Estimation in Differentiated Products Markets", "link": "https://arxiv.org/abs/2307.07090", "description": "arXiv:2307.07090v2 Announce Type: replace \nAbstract: Choice modeling is at the core of understanding how changes to the competitive landscape affect consumer choices and reshape market equilibria. In this paper, we propose a fundamental characterization of choice functions that encompasses a wide variety of extant choice models. We demonstrate how non-parametric estimators like neural nets can easily approximate such functionals and overcome the curse of dimensionality that is inherent in the non-parametric estimation of choice functions. We demonstrate through extensive simulations that our proposed functionals can flexibly capture underlying consumer behavior in a completely data-driven fashion and outperform traditional parametric models. As demand settings often exhibit endogenous features, we extend our framework to incorporate estimation under endogenous features. Further, we also describe a formal inference procedure to construct valid confidence intervals on objects of interest like price elasticity. Finally, to assess the practical applicability of our estimator, we utilize a real-world dataset from S. Berry, Levinsohn, and Pakes (1995). Our empirical analysis confirms that the estimator generates realistic and comparable own- and cross-price elasticities that are consistent with the observations reported in the existing literature."}, "https://arxiv.org/abs/2307.15681": {"title": "A Continuous-Time Dynamic Factor Model for Intensive Longitudinal Data Arising from Mobile Health Studies", "link": "https://arxiv.org/abs/2307.15681", "description": "arXiv:2307.15681v3 Announce Type: replace \nAbstract: Intensive longitudinal data (ILD) collected in mobile health (mHealth) studies contain rich information on multiple outcomes measured frequently over time that have the potential to capture short-term and long-term dynamics. Motivated by an mHealth study of smoking cessation in which participants self-report the intensity of many emotions multiple times per day, we describe a dynamic factor model that summarizes the ILD as a low-dimensional, interpretable latent process. This model consists of two submodels: (i) a measurement submodel--a factor model--that summarizes the multivariate longitudinal outcome as lower-dimensional latent variables and (ii) a structural submodel--an Ornstein-Uhlenbeck (OU) stochastic process--that captures the temporal dynamics of the multivariate latent process in continuous time. We derive a closed-form likelihood for the marginal distribution of the outcome and the computationally-simpler sparse precision matrix for the OU process. We propose a block coordinate descent algorithm for estimation. Finally, we apply our method to the mHealth data to summarize the dynamics of 18 different emotions as two latent processes. These latent processes are interpreted by behavioral scientists as the psychological constructs of positive and negative affect and are key in understanding vulnerability to lapsing back to tobacco use among smokers attempting to quit."}, "https://arxiv.org/abs/2312.17015": {"title": "Regularized Exponentially Tilted Empirical Likelihood for Bayesian Inference", "link": "https://arxiv.org/abs/2312.17015", "description": "arXiv:2312.17015v3 Announce Type: replace \nAbstract: Bayesian inference with empirical likelihood faces a challenge as the posterior domain is a proper subset of the original parameter space due to the convex hull constraint. We propose a regularized exponentially tilted empirical likelihood to address this issue. Our method removes the convex hull constraint using a novel regularization technique, incorporating a continuous exponential family distribution to satisfy a Kullback-Leibler divergence criterion. The regularization arises as a limiting procedure where pseudo-data are added to the formulation of exponentially tilted empirical likelihood in a structured fashion. We show that this regularized exponentially tilted empirical likelihood retains certain desirable asymptotic properties of (exponentially tilted) empirical likelihood and has improved finite sample performance. Simulation and data analysis demonstrate that the proposed method provides a suitable pseudo-likelihood for Bayesian inference. The implementation of our method is available as the R package retel. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2401.11278": {"title": "Handling incomplete outcomes and covariates in cluster-randomized trials: doubly-robust estimation, efficiency considerations, and sensitivity analysis", "link": "https://arxiv.org/abs/2401.11278", "description": "arXiv:2401.11278v2 Announce Type: replace \nAbstract: In cluster-randomized trials (CRTs), missing data can occur in various ways, including missing values in outcomes and baseline covariates at the individual or cluster level, or completely missing information for non-participants. Among the various types of missing data in CRTs, missing outcomes have attracted the most attention. However, no existing methods can simultaneously address all aforementioned types of missing data in CRTs. To fill in this gap, we propose a new doubly-robust estimator for the average treatment effect on a variety of scales. The proposed estimator simultaneously handles missing outcomes under missingness at random, missing covariates without constraining the missingness mechanism, and missing cluster-population sizes via a uniform sampling mechanism. Furthermore, we detail key considerations to improve precision by specifying the optimal weights, leveraging machine learning, and modeling the treatment assignment mechanism. Finally, to evaluate the impact of violating missing data assumptions, we contribute a new sensitivity analysis framework tailored to CRTs. Simulation studies and a real data application both demonstrate that our proposed methods are effective in handling missing data in CRTs and superior to the existing methods."}, "https://arxiv.org/abs/2211.14221": {"title": "Learning Large Causal Structures from Inverse Covariance Matrix via Sparse Matrix Decomposition", "link": "https://arxiv.org/abs/2211.14221", "description": "arXiv:2211.14221v3 Announce Type: replace-cross \nAbstract: Learning causal structures from observational data is a fundamental problem facing important computational challenges when the number of variables is large. In the context of linear structural equation models (SEMs), this paper focuses on learning causal structures from the inverse covariance matrix. The proposed method, called ICID for Independence-preserving Decomposition from Inverse Covariance matrix, is based on continuous optimization of a matrix decomposition model that preserves the nonzero patterns of the inverse covariance matrix. Through theoretical and empirical evidences, we show that ICID efficiently identifies the sought directed acyclic graph (DAG) assuming the knowledge of noise variances. Moreover, ICID is shown empirically to be robust under bounded misspecification of noise variances in the case where the noise variances are non-equal. The proposed method enjoys a low complexity, as reflected by its time efficiency in the experiments, and also enables a novel regularization scheme that yields highly accurate solutions on the Simulated fMRI data (Smith et al., 2011) in comparison with state-of-the-art algorithms."}, "https://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "https://arxiv.org/abs/2310.04563", "description": "arXiv:2310.04563v2 Announce Type: replace-cross \nAbstract: During the COVID-19 pandemic, safely implementing in-person indoor instruction was a high priority for universities nationwide. To support this effort at the University, we developed a mathematical model for estimating the risk of SARS-CoV-2 transmission in university classrooms. This model was used to evaluate combinations of feasible interventions for classrooms at the University during the pandemic and optimize the set of interventions that would allow higher occupancy levels, matching the pre-pandemic numbers of in-person courses. Importantly, we determined that requiring masking in dense classrooms with unrestricted seating with more than 90% of students vaccinated was easy to implement, incurred little logistical or financial cost, and allowed classes to be held at full capacity. A retrospective analysis at the end of the semester confirmed the model's assessment that the proposed classroom configuration would be safe. Our framework is generalizable and was used to support reopening decisions at Stanford University. In addition, our framework is flexible and applies to a wide range of indoor settings. It was repurposed for large university events and gatherings and could be used to support planning indoor space use to avoid transmission of infectious diseases across various industries, from secondary schools to movie theaters and restaurants."}, "https://arxiv.org/abs/2310.18212": {"title": "Robustness of Algorithms for Causal Structure Learning to Hyperparameter Choice", "link": "https://arxiv.org/abs/2310.18212", "description": "arXiv:2310.18212v2 Announce Type: replace-cross \nAbstract: Hyperparameters play a critical role in machine learning. Hyperparameter tuning can make the difference between state-of-the-art and poor prediction performance for any algorithm, but it is particularly challenging for structure learning due to its unsupervised nature. As a result, hyperparameter tuning is often neglected in favour of using the default values provided by a particular implementation of an algorithm. While there have been numerous studies on performance evaluation of causal discovery algorithms, how hyperparameters affect individual algorithms, as well as the choice of the best algorithm for a specific problem, has not been studied in depth before. This work addresses this gap by investigating the influence of hyperparameters on causal structure learning tasks. Specifically, we perform an empirical evaluation of hyperparameter selection for some seminal learning algorithms on datasets of varying levels of complexity. We find that, while the choice of algorithm remains crucial to obtaining state-of-the-art performance, hyperparameter selection in ensemble settings strongly influences the choice of algorithm, in that a poor choice of hyperparameters can lead to analysts using algorithms which do not give state-of-the-art performance for their data."}, "https://arxiv.org/abs/2402.10870": {"title": "Best of Three Worlds: Adaptive Experimentation for Digital Marketing in Practice", "link": "https://arxiv.org/abs/2402.10870", "description": "arXiv:2402.10870v2 Announce Type: replace-cross \nAbstract: Adaptive experimental design (AED) methods are increasingly being used in industry as a tool to boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing methods. However, the behavior and guarantees of such methods are not well-understood beyond idealized stationary settings. This paper shares lessons learned regarding the challenges of naively using AED systems in industrial settings where non-stationarity is prevalent, while also providing perspectives on the proper objectives and system specifications in such settings. We developed an AED framework for counterfactual inference based on these experiences, and tested it in a commercial environment."}, "https://arxiv.org/abs/2402.13259": {"title": "Fast Discrete-Event Simulation of Markovian Queueing Networks through Euler Approximation", "link": "https://arxiv.org/abs/2402.13259", "description": "arXiv:2402.13259v1 Announce Type: new \nAbstract: The efficient management of large-scale queueing networks is critical for a variety of sectors, including healthcare, logistics, and customer service, where system performance has profound implications for operational effectiveness and cost management. To address this key challenge, our paper introduces simulation techniques tailored for complex, large-scale Markovian queueing networks. We develop two simulation schemes based on Euler approximation, namely the backward and forward schemes. These schemes can accommodate time-varying dynamics and are optimized for efficient implementation using vectorization. Assuming a feedforward queueing network structure, we establish that the two schemes provide stochastic upper and lower bounds for the system state, while the approximation error remains bounded over the simulation horizon. With the recommended choice of time step, we show that our approximation schemes exhibit diminishing asymptotic relative error as the system scales up, while maintaining much lower computational complexity compared to traditional discrete-event simulation and achieving speedups up to tens of thousands times. This study highlights the substantial potential of Euler approximation in simulating large-scale discrete systems."}, "https://arxiv.org/abs/2402.13375": {"title": "A Strategic Model of Software Dependency Networks", "link": "https://arxiv.org/abs/2402.13375", "description": "arXiv:2402.13375v1 Announce Type: new \nAbstract: Modern software development involves collaborative efforts and reuse of existing code, which reduces the cost of developing new software. However, reusing code from existing packages exposes coders to vulnerabilities in these dependencies. We study the formation of dependency networks among software packages and libraries, guided by a structural model of network formation with observable and unobservable heterogeneity. We estimate costs, benefits, and link externalities of the network of 696,790 directed dependencies between 35,473 repositories of the Rust programming language using a novel scalable algorithm. We find evidence of a positive externality exerted on other coders when coders create dependencies. Furthermore, we show that coders are likely to link to more popular packages of the same software type but less popular packages of other types. We adopt models for the spread of infectious diseases to measure a package's systemicness as the number of downstream packages a vulnerability would affect. Systemicness is highly skewed with the most systemic repository affecting almost 90% of all repositories only two steps away. Lastly, we show that protecting only the ten most important repositories reduces vulnerability contagion by nearly 40%."}, "https://arxiv.org/abs/2402.13384": {"title": "Allowing Growing Dimensional Binary Outcomes via the Multivariate Probit Indian Buffet Process", "link": "https://arxiv.org/abs/2402.13384", "description": "arXiv:2402.13384v1 Announce Type: new \nAbstract: There is a rich literature on infinite latent feature models, with much of the focus being on the Indian Buffet Process (IBP). The current literature focuses on the case in which latent features are generated independently, so that there is no within-sample dependence in which features are selected. Motivated by ecology applications in which latent features correspond to which species are discovered in a sample, we propose a new class of dependent infinite latent feature models. Our construction starts with a probit Indian buffet process, and then incorporates dependence among features through a correlation matrix in a latent multivariate Gaussian random variable. We show that the proposed approach preserves many appealing theoretical properties of the IBP. To address the challenge of modeling and inference for a growing-dimensional correlation matrix, we propose Bayesian modeling approaches to reduce effective dimensionality. We additionally describe extensions to incorporate covariates and hierarchical structure to enable borrowing of information. For posterior inference, we describe Markov chain Monte Carlo sampling algorithms and efficient approximations. Simulation studies and applications to fungal biodiversity data provide compelling support for the new modeling class relative to competitors."}, "https://arxiv.org/abs/2402.13391": {"title": "De-Biasing the Bias: Methods for Improving Disparity Assessments with Noisy Group Measurements", "link": "https://arxiv.org/abs/2402.13391", "description": "arXiv:2402.13391v1 Announce Type: new \nAbstract: Health care decisions are increasingly informed by clinical decision support algorithms, but these algorithms may perpetuate or increase racial and ethnic disparities in access to and quality of health care. Further complicating the problem, clinical data often have missing or poor quality racial and ethnic information, which can lead to misleading assessments of algorithmic bias. We present novel statistical methods that allow for the use of probabilities of racial/ethnic group membership in assessments of algorithm performance and quantify the statistical bias that results from error in these imputed group probabilities. We propose a sensitivity analysis approach to estimating the statistical bias that allows practitioners to assess disparities in algorithm performance under a range of assumed levels of group probability error. We also prove theoretical bounds on the statistical bias for a set of commonly used fairness metrics and describe real-world scenarios where our theoretical results are likely to apply. We present a case study using imputed race and ethnicity from the Bayesian Improved Surname Geocoding (BISG) algorithm for estimation of disparities in a clinical decision support algorithm used to inform osteoporosis treatment. Our novel methods allow policy makers to understand the range of potential disparities under a given algorithm even when race and ethnicity information is missing and to make informed decisions regarding the implementation of machine learning for clinical decision support."}, "https://arxiv.org/abs/2402.13472": {"title": "Generalized linear models with spatial dependence and a functional covariate", "link": "https://arxiv.org/abs/2402.13472", "description": "arXiv:2402.13472v1 Announce Type: new \nAbstract: We extend generalized functional linear models under independence to a situation in which a functional covariate is related to a scalar response variable that exhibits spatial dependence. For estimation, we apply basis expansion and truncation for dimension reduction of the covariate process followed by a composite likelihood estimating equation to handle the spatial dependency. We develop asymptotic results for the proposed model under a repeating lattice asymptotic context, allowing us to construct a confidence interval for the spatial dependence parameter and a confidence band for the parameter function. A binary conditionals model is presented as a concrete illustration and is used in simulation studies to verify the applicability of the asymptotic inferential results."}, "https://arxiv.org/abs/2402.13719": {"title": "Informative Simultaneous Confidence Intervals for Graphical Test Procedures", "link": "https://arxiv.org/abs/2402.13719", "description": "arXiv:2402.13719v1 Announce Type: new \nAbstract: Simultaneous confidence intervals (SCIs) that are compatible with a given closed test procedure are often non-informative. More precisely, for a one-sided null hypothesis, the bound of the SCI can stick to the border of the null hypothesis, irrespective of how far the point estimate deviates from the null hypothesis. This has been illustrated for the Bonferroni-Holm and fall-back procedures, for which alternative SCIs have been suggested, that are free of this deficiency. These informative SCIs are not fully compatible with the initial multiple test, but are close to it and hence provide similar power advantages. They provide a multiple hypothesis test with strong family-wise error rate control that can be used in replacement of the initial multiple test. The current paper extends previous work for informative SCIs to graphical test procedures. The information gained from the newly suggested SCIs is shown to be always increasing with increasing evidence against a null hypothesis. The new SCIs provide a compromise between information gain and the goal to reject as many hypotheses as possible. The SCIs are defined via a family of dual graphs and the projection method. A simple iterative algorithm for the computation of the intervals is provided. A simulation study illustrates the results for a complex graphical test procedure."}, "https://arxiv.org/abs/2402.13890": {"title": "A unified Bayesian framework for interval hypothesis testing in clinical trials", "link": "https://arxiv.org/abs/2402.13890", "description": "arXiv:2402.13890v1 Announce Type: new \nAbstract: The American Statistical Association (ASA) statement on statistical significance and P-values \\cite{wasserstein2016asa} cautioned statisticians against making scientific decisions solely on the basis of traditional P-values. The statement delineated key issues with P-values, including a lack of transparency, an inability to quantify evidence in support of the null hypothesis, and an inability to measure the size of an effect or the importance of a result. In this article, we demonstrate that the interval null hypothesis framework (instead of the point null hypothesis framework), when used in tandem with Bayes factor-based tests, is instrumental in circumnavigating the key issues of P-values. Further, we note that specifying prior densities for Bayes factors is challenging and has been a reason for criticism of Bayesian hypothesis testing in existing literature. We address this by adapting Bayes factors directly based on common test statistics. We demonstrate, through numerical experiments and real data examples, that the proposed Bayesian interval hypothesis testing procedures can be calibrated to ensure frequentist error control while retaining their inherent interpretability. Finally, we illustrate the improved flexibility and applicability of the proposed methods by providing coherent frameworks for competitive landscape analysis and end-to-end Bayesian hypothesis tests in the context of reporting clinical trial outcomes."}, "https://arxiv.org/abs/2402.13907": {"title": "Quadratic inference with dense functional responses", "link": "https://arxiv.org/abs/2402.13907", "description": "arXiv:2402.13907v1 Announce Type: new \nAbstract: We address the challenge of estimation in the context of constant linear effect models with dense functional responses. In this framework, the conditional expectation of the response curve is represented by a linear combination of functional covariates with constant regression parameters. In this paper, we present an alternative solution by employing the quadratic inference approach, a well-established method for analyzing correlated data, to estimate the regression coefficients. Our approach leverages non-parametrically estimated basis functions, eliminating the need for choosing working correlation structures. Furthermore, we demonstrate that our method achieves a parametric $\\sqrt{n}$-convergence rate, contingent on an appropriate choice of bandwidth. This convergence is observed when the number of repeated measurements per trajectory exceeds a certain threshold, specifically, when it surpasses $n^{a_{0}}$, with $n$ representing the number of trajectories. Additionally, we establish the asymptotic normality of the resulting estimator. The performance of the proposed method is compared with that of existing methods through extensive simulation studies, where our proposed method outperforms. Real data analysis is also conducted to demonstrate the proposed method."}, "https://arxiv.org/abs/2402.13933": {"title": "Powerful Large-scale Inference in High Dimensional Mediation Analysis", "link": "https://arxiv.org/abs/2402.13933", "description": "arXiv:2402.13933v1 Announce Type: new \nAbstract: In genome-wide epigenetic studies, exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables such as DNA methylation. Mediation analysis offers a way to study these intermediate variables and identify the presence or absence of causal mediation effects. Testing for mediation effects lead to a composite null hypothesis. Existing methods like the Sobel's test or the Max-P test are often underpowered because 1) statistical inference is often conducted based on distributions determined under a subset of the null and 2) they are not designed to shoulder the multiple testing burden. To tackle these issues, we introduce a technique called MLFDR (Mediation Analysis using Local False Discovery Rates) for high dimensional mediation analysis, which uses the local False Discovery Rates based on the coefficients of the structural equation model specifying the mediation relationship to construct a rejection region. We have shown theoretically as well as through simulation studies that in the high-dimensional setting, the new method of identifying the mediating variables controls the FDR asymptotically and performs better with respect to power than several existing methods such as DACT (Liu et al.)and JS-mixture (Dai et al)."}, "https://arxiv.org/abs/2402.13332": {"title": "Double machine learning for causal hybrid modeling -- applications in the Earth sciences", "link": "https://arxiv.org/abs/2402.13332", "description": "arXiv:2402.13332v1 Announce Type: cross \nAbstract: Hybrid modeling integrates machine learning with scientific knowledge with the goal of enhancing interpretability, generalization, and adherence to natural laws. Nevertheless, equifinality and regularization biases pose challenges in hybrid modeling to achieve these purposes. This paper introduces a novel approach to estimating hybrid models via a causal inference framework, specifically employing Double Machine Learning (DML) to estimate causal effects. We showcase its use for the Earth sciences on two problems related to carbon dioxide fluxes. In the $Q_{10}$ model, we demonstrate that DML-based hybrid modeling is superior in estimating causal parameters over end-to-end deep neural network (DNN) approaches, proving efficiency, robustness to bias from regularization methods, and circumventing equifinality. Our approach, applied to carbon flux partitioning, exhibits flexibility in accommodating heterogeneous causal effects. The study emphasizes the necessity of explicitly defining causal graphs and relationships, advocating for this as a general best practice. We encourage the continued exploration of causality in hybrid models for more interpretable and trustworthy results in knowledge-guided machine learning."}, "https://arxiv.org/abs/2402.13604": {"title": "Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE", "link": "https://arxiv.org/abs/2402.13604", "description": "arXiv:2402.13604v1 Announce Type: cross \nAbstract: This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically thereby performing in seconds and minutes what previously took days and weeks. The model is trained on 14 million pairs of occupational descriptions and HISCO codes in 13 different languages contributed by 22 different sources. Our approach is shown to have accuracy, recall and precision above 90 percent. Our tool breaks the metaphorical HISCO barrier and makes this data readily available for analysis of occupational structures with broad applicability in economics, economic history and various related disciplines."}, "https://arxiv.org/abs/2402.13678": {"title": "Weak Poincar\\'e inequality comparisons for ideal and hybrid slice sampling", "link": "https://arxiv.org/abs/2402.13678", "description": "arXiv:2402.13678v1 Announce Type: cross \nAbstract: Using the framework of weak Poincar{\\'e} inequalities, we provide a general comparison between the Hybrid and Ideal Slice Sampling Markov chains in terms of their Dirichlet forms. In particular, under suitable assumptions Hybrid Slice Sampling will inherit fast convergence from Ideal Slice Sampling and conversely. We apply our results to analyse the convergence of the Independent Metropolis--Hastings, Slice Sampling with Stepping-Out and Shrinkage, and Hit-and-Run-within-Slice Sampling algorithms."}, "https://arxiv.org/abs/1508.01167": {"title": "The Divergence Index: A Decomposable Measure of Segregation and Inequality", "link": "https://arxiv.org/abs/1508.01167", "description": "arXiv:1508.01167v3 Announce Type: replace \nAbstract: Decomposition analysis is a critical tool for understanding the social and spatial dimensions of segregation and diversity. In this paper, I highlight the conceptual, mathematical, and empirical distinctions between segregation and diversity and introduce the Divergence Index as a decomposable measure of segregation. Scholars have turned to the Information Theory Index as the best alternative to the Dissimilarity Index in decomposition studies, however it measures diversity rather than segregation. I demonstrate the importance of preserving this conceptual distinction with a decomposition analysis of segregation and diversity in U.S. metropolitan areas from 1990 to 2010, which shows that the Information Theory Index has tended to decrease, particularly within cities, while the Divergence Index has tended to increase, particularly within suburbs. Rather than being a substitute for measures of diversity, the Divergence Index complements existing measures by enabling the analysis and decomposition of segregation alongside diversity."}, "https://arxiv.org/abs/2007.12448": {"title": "A (tight) upper bound for the length of confidence intervals with conditional coverage", "link": "https://arxiv.org/abs/2007.12448", "description": "arXiv:2007.12448v3 Announce Type: replace \nAbstract: We show that two popular selective inference procedures, namely data carving (Fithian et al., 2017) and selection with a randomized response (Tian et al., 2018b), when combined with the polyhedral method (Lee et al., 2016), result in confidence intervals whose length is bounded. This contrasts results for confidence intervals based on the polyhedral method alone, whose expected length is typically infinite (Kivaranovic and Leeb, 2020). Moreover, we show that these two procedures always dominate corresponding sample-splitting methods in terms of interval length."}, "https://arxiv.org/abs/2101.05135": {"title": "A Latent Variable Model for Relational Events with Multiple Receivers", "link": "https://arxiv.org/abs/2101.05135", "description": "arXiv:2101.05135v3 Announce Type: replace \nAbstract: Directional relational event data, such as email data, often contain unicast messages (i.e., messages of one sender towards one receiver) and multicast messages (i.e., messages of one sender towards multiple receivers). The Enron email data that is the focus in this paper consists of 31% multicast messages. Multicast messages contain important information about the roles of actors in the network, which is needed for better understanding social interaction dynamics. In this paper a multiplicative latent factor model is proposed to analyze such relational data. For a given message, all potential receiver actors are placed on a suitability scale, and the actors are included in the receiver set whose suitability score exceeds a threshold value. Unobserved heterogeneity in the social interaction behavior is captured using a multiplicative latent factor structure with latent variables for actors (which differ for actors as senders and receivers) and latent variables for individual messages. A Bayesian computational algorithm, which relies on Gibbs sampling, is proposed for model fitting. Model assessment is done using posterior predictive checks. Based on our analyses of the Enron email data, a mc-amen model with a 2 dimensional latent variable can accurately capture the empirical distribution of the cardinality of the receiver set and the composition of the receiver sets for commonly observed messages. Moreover the results show that actors have a comparable (but not identical) role as a sender and as a receiver in the network."}, "https://arxiv.org/abs/2102.10154": {"title": "Truncated, Censored, and Actuarial Payment-type Moments for Robust Fitting of a Single-parameter Pareto Distribution", "link": "https://arxiv.org/abs/2102.10154", "description": "arXiv:2102.10154v3 Announce Type: replace \nAbstract: With some regularity conditions maximum likelihood estimators (MLEs) always produce asymptotically optimal (in the sense of consistency, efficiency, sufficiency, and unbiasedness) estimators. But in general, the MLEs lead to non-robust statistical inference, for example, pricing models and risk measures. Actuarial claim severity is continuous, right-skewed, and frequently heavy-tailed. The data sets that such models are usually fitted to contain outliers that are difficult to identify and separate from genuine data. Moreover, due to commonly used actuarial \"loss control strategies\" in financial and insurance industries, the random variables we observe and wish to model are affected by truncation (due to deductibles), censoring (due to policy limits), scaling (due to coinsurance proportions) and other transformations. To alleviate the lack of robustness of MLE-based inference in risk modeling, here in this paper, we propose and develop a new method of estimation - method of truncated moments (MTuM) and generalize it for different scenarios of loss control mechanism. Various asymptotic properties of those estimates are established by using central limit theory. New connections between different estimators are found. A comparative study of newly-designed methods with the corresponding MLEs is performed. Detail investigation has been done for a single parameter Pareto loss model including a simulation study."}, "https://arxiv.org/abs/2103.02089": {"title": "Robust Estimation of Loss Models for Lognormal Insurance Payment Severity Data", "link": "https://arxiv.org/abs/2103.02089", "description": "arXiv:2103.02089v4 Announce Type: replace \nAbstract: The primary objective of this scholarly work is to develop two estimation procedures - maximum likelihood estimator (MLE) and method of trimmed moments (MTM) - for the mean and variance of lognormal insurance payment severity data sets affected by different loss control mechanism, for example, truncation (due to deductibles), censoring (due to policy limits), and scaling (due to coinsurance proportions), in insurance and financial industries. Maximum likelihood estimating equations for both payment-per-payment and payment-per-loss data sets are derived which can be solved readily by any existing iterative numerical methods. The asymptotic distributions of those estimators are established via Fisher information matrices. Further, with a goal of balancing efficiency and robustness and to remove point masses at certain data points, we develop a dynamic MTM estimation procedures for lognormal claim severity models for the above-mentioned transformed data scenarios. The asymptotic distributional properties and the comparison with the corresponding MLEs of those MTM estimators are established along with extensive simulation studies. Purely for illustrative purpose, numerical examples for 1500 US indemnity losses are provided which illustrate the practical performance of the established results in this paper."}, "https://arxiv.org/abs/2202.13000": {"title": "Robust Estimation of Loss Models for Truncated and Censored Severity Data", "link": "https://arxiv.org/abs/2202.13000", "description": "arXiv:2202.13000v2 Announce Type: replace \nAbstract: In this paper, we consider robust estimation of claim severity models in insurance, when data are affected by truncation (due to deductibles), censoring (due to policy limits), and scaling (due to coinsurance). In particular, robust estimators based on the methods of trimmed moments (T-estimators) and winsorized moments (W-estimators) are pursued and fully developed. The general definitions of such estimators are formulated and their asymptotic properties are investigated. For illustrative purposes, specific formulas for T- and W-estimators of the tail parameter of a single-parameter Pareto distribution are derived. The practical performance of these estimators is then explored using the well-known Norwegian fire claims data. Our results demonstrate that T- and W-estimators offer a robust and computationally efficient alternative to the likelihood-based inference for models that are affected by deductibles, policy limits, and coinsurance."}, "https://arxiv.org/abs/2204.02477": {"title": "Method of Winsorized Moments for Robust Fitting of Truncated and Censored Lognormal Distributions", "link": "https://arxiv.org/abs/2204.02477", "description": "arXiv:2204.02477v2 Announce Type: replace \nAbstract: When constructing parametric models to predict the cost of future claims, several important details have to be taken into account: (i) models should be designed to accommodate deductibles, policy limits, and coinsurance factors, (ii) parameters should be estimated robustly to control the influence of outliers on model predictions, and (iii) all point predictions should be augmented with estimates of their uncertainty. The methodology proposed in this paper provides a framework for addressing all these aspects simultaneously. Using payment-per-payment and payment-per-loss variables, we construct the adaptive version of method of winsorized moments (MWM) estimators for the parameters of truncated and censored lognormal distribution. Further, the asymptotic distributional properties of this approach are derived and compared with those of the maximum likelihood estimator (MLE) and method of trimmed moments (MTM) estimators. The latter being a primary competitor to MWM. Moreover, the theoretical results are validated with extensive simulation studies and risk measure sensitivity analysis. Finally, practical performance of these methods is illustrated using the well-studied data set of 1500 U.S. indemnity losses. With this real data set, it is also demonstrated that the composite models do not provide much improvement in the quality of predictive models compared to a stand-alone fitted distribution specially for truncated and censored sample data."}, "https://arxiv.org/abs/2211.16298": {"title": "Double Robust Bayesian Inference on Average Treatment Effects", "link": "https://arxiv.org/abs/2211.16298", "description": "arXiv:2211.16298v4 Announce Type: replace \nAbstract: We propose a double robust Bayesian inference procedure on the average treatment effect (ATE) under unconfoundedness. Our robust Bayesian approach involves two important modifications: first, we adjust the prior distributions of the conditional mean function; second, we correct the posterior distribution of the resulting ATE. Both adjustments make use of pilot estimators motivated by the semiparametric influence function for ATE estimation. We prove asymptotic equivalence of our Bayesian procedure and efficient frequentist ATE estimators by establishing a new semiparametric Bernstein-von Mises theorem under double robustness; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score and vice versa. Consequently, the resulting Bayesian credible sets form confidence intervals with asymptotically exact coverage probability. In simulations, our double robust Bayesian procedure leads to significant bias reduction of point estimation over conventional Bayesian methods and more accurate coverage of confidence intervals compared to existing frequentist methods. We illustrate our method in an application to the National Supported Work Demonstration."}, "https://arxiv.org/abs/2302.12627": {"title": "Cox reduction and confidence sets of models: a theoretical elucidation", "link": "https://arxiv.org/abs/2302.12627", "description": "arXiv:2302.12627v2 Announce Type: replace \nAbstract: For sparse high-dimensional regression problems, Cox and Battey [1, 9] emphasised the need for confidence sets of models: an enumeration of those small sets of variables that fit the data equivalently well in a suitable statistical sense. This is to be contrasted with the single model returned by penalised regression procedures, effective for prediction but potentially misleading for subject-matter understanding. The proposed construction of such sets relied on preliminary reduction of the full set of variables, and while various possibilities could be considered for this, [9] proposed a succession of regression fits based on incomplete block designs. The purpose of the present paper is to provide insight on both aspects of that work. For an unspecified reduction strategy, we begin by characterising models that are likely to be retained in the model confidence set, emphasising geometric aspects. We then evaluate possible reduction schemes based on penalised regression or marginal screening, before theoretically elucidating the reduction of [9]. We identify features of the covariate matrix that may reduce its efficacy, and indicate improvements to the original proposal. An advantage of the approach is its ability to reveal its own stability or fragility for the data at hand."}, "https://arxiv.org/abs/2303.17642": {"title": "Change Point Detection on A Separable Model for Dynamic Networks", "link": "https://arxiv.org/abs/2303.17642", "description": "arXiv:2303.17642v3 Announce Type: replace \nAbstract: This paper studies change point detection in time series of networks, with the Separable Temporal Exponential-family Random Graph Model (STERGM). Dynamic network patterns can be inherently complex due to dyadic and temporal dependence. Detection of the change points can identify the discrepancies in the underlying data generating processes and facilitate downstream analysis. The STERGM that utilizes network statistics to represent the structural patterns is a flexible model to fit dynamic networks. We propose a new estimator derived from the Alternating Direction Method of Multipliers (ADMM) and Group Fused Lasso to simultaneously detect multiple time points, where the parameters of a time-heterogeneous STERGM have changed. We also provide a Bayesian information criterion for model selection and an R package CPDstergm to implement the proposed method. Experiments on simulated and real data show good performance of the proposed framework."}, "https://arxiv.org/abs/2306.15088": {"title": "Locally tail-scale invariant scoring rules for evaluation of extreme value forecasts", "link": "https://arxiv.org/abs/2306.15088", "description": "arXiv:2306.15088v3 Announce Type: replace \nAbstract: Statistical analysis of extremes can be used to predict the probability of future extreme events, such as large rainfalls or devastating windstorms. The quality of these forecasts can be measured through scoring rules. Locally scale invariant scoring rules give equal importance to the forecasts at different locations regardless of differences in the prediction uncertainty. This is a useful feature when computing average scores but can be an unnecessarily strict requirement when mostly concerned with extremes. We propose the concept of local weight-scale invariance, describing scoring rules fulfilling local scale invariance in a certain region of interest, and as a special case local tail-scale invariance, for large events. Moreover, a new version of the weighted Continuous Ranked Probability score (wCRPS) called the scaled wCRPS (swCRPS) that possesses this property is developed and studied. The score is a suitable alternative for scoring extreme value models over areas with varying scale of extreme events, and we derive explicit formulas of the score for the Generalised Extreme Value distribution. The scoring rules are compared through simulation, and their usage is illustrated in modelling of extreme water levels, annual maximum rainfalls, and in an application to non-extreme forecast for the prediction of air pollution."}, "https://arxiv.org/abs/2309.06429": {"title": "Efficient Inference on High-Dimensional Linear Models with Missing Outcomes", "link": "https://arxiv.org/abs/2309.06429", "description": "arXiv:2309.06429v2 Announce Type: replace \nAbstract: This paper is concerned with inference on the regression function of a high-dimensional linear model when outcomes are missing at random. We propose an estimator which combines a Lasso pilot estimate of the regression function with a bias correction term based on the weighted residuals of the Lasso regression. The weights depend on estimates of the missingness probabilities (propensity scores) and solve a convex optimization program that trades off bias and variance optimally. Provided that the propensity scores can be pointwise consistently estimated at in-sample data points, our proposed estimator for the regression function is asymptotically normal and semi-parametrically efficient among all asymptotically linear estimators. Furthermore, the proposed estimator keeps its asymptotic properties even if the propensity scores are estimated by modern machine learning techniques. We validate the finite-sample performance of the proposed estimator through comparative simulation studies and the real-world problem of inferring the stellar masses of galaxies in the Sloan Digital Sky Survey."}, "https://arxiv.org/abs/2401.02939": {"title": "Penalized Distributed Lag Interaction Model: Air Pollution, Birth Weight and Neighborhood Vulnerability", "link": "https://arxiv.org/abs/2401.02939", "description": "arXiv:2401.02939v2 Announce Type: replace \nAbstract: Maternal exposure to air pollution during pregnancy has a substantial public health impact. Epidemiological evidence supports an association between maternal exposure to air pollution and low birth weight. A popular method to estimate this association while identifying windows of susceptibility is a distributed lag model (DLM), which regresses an outcome onto exposure history observed at multiple time points. However, the standard DLM framework does not allow for modification of the association between repeated measures of exposure and the outcome. We propose a distributed lag interaction model that allows modification of the exposure-time-response associations across individuals by including an interaction between a continuous modifying variable and the exposure history. Our model framework is an extension of a standard DLM that uses a cross-basis, or bi-dimensional function space, to simultaneously describe both the modification of the exposure-response relationship and the temporal structure of the exposure data. Through simulations, we showed that our model with penalization out-performs a standard DLM when the true exposure-time-response associations vary by a continuous variable. Using a Colorado, USA birth cohort, we estimated the association between birth weight and ambient fine particulate matter air pollution modified by an area-level metric of health and social adversities from Colorado EnviroScreen."}, "https://arxiv.org/abs/2205.12112": {"title": "Stereographic Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2205.12112", "description": "arXiv:2205.12112v2 Announce Type: replace-cross \nAbstract: High-dimensional distributions, especially those with heavy tails, are notoriously difficult for off-the-shelf MCMC samplers: the combination of unbounded state spaces, diminishing gradient information, and local moves results in empirically observed ``stickiness'' and poor theoretical mixing properties -- lack of geometric ergodicity. In this paper, we introduce a new class of MCMC samplers that map the original high-dimensional problem in Euclidean space onto a sphere and remedy these notorious mixing problems. In particular, we develop random-walk Metropolis type algorithms as well as versions of the Bouncy Particle Sampler that are uniformly ergodic for a large class of light and heavy-tailed distributions and also empirically exhibit rapid convergence in high dimensions. In the best scenario, the proposed samplers can enjoy the ``blessings of dimensionality'' that the convergence is faster in higher dimensions."}, "https://arxiv.org/abs/2306.09507": {"title": "Winsorized Robust Credibility Models", "link": "https://arxiv.org/abs/2306.09507", "description": "arXiv:2306.09507v3 Announce Type: replace-cross \nAbstract: The B\\\"uhlmann model, a branch of classical credibility theory, has been successively applied to the premium estimation for group insurance contracts and other insurance specifications. In this paper, we develop a robust B\\\"uhlmann credibility via the censored version of loss data, or the censored mean (a robust alternative to traditional individual mean). This framework yields explicit formulas of structural parameters in credibility estimation for both scale-shape distribution families, location-scale distribution families, and their variants, which are commonly used to model insurance risks. The asymptotic properties of the proposed method are provided and corroborated through simulations, and their performance is compared to that of credibility based on the trimmed mean. By varying the censoring/trimming threshold level in several parametric models, we find all structural parameters via censoring are less volatile compared to the corresponding quantities via trimming, and using censored mean as a robust risk measure will reduce the influence of parametric loss assumptions on credibility estimation. Besides, the non-parametric estimations in credibility are discussed using the theory of $L-$estimators. And a numerical illustration from Wisconsin Local Government Property Insurance Fund indicates that the proposed robust credibility can prevent the effect caused by model mis-specification and capture the risk behavior of loss data in a broader viewpoint."}, "https://arxiv.org/abs/2401.12126": {"title": "Biological species delimitation based on genetic and spatial dissimilarity: a comparative study", "link": "https://arxiv.org/abs/2401.12126", "description": "arXiv:2401.12126v2 Announce Type: replace-cross \nAbstract: The delimitation of biological species, i.e., deciding which individuals belong to the same species and whether and how many different species are represented in a data set, is key to the conservation of biodiversity. Much existing work uses only genetic data for species delimitation, often employing some kind of cluster analysis. This can be misleading, because geographically distant groups of individuals can be genetically quite different even if they belong to the same species. This paper investigates the problem of testing whether two potentially separated groups of individuals can belong to a single species or not based on genetic and spatial data. Various approaches are compared (some of which already exist in the literature) based on simulated metapopulations generated with SLiM and GSpace - two software packages that can simulate spatially-explicit genetic data at an individual level. Approaches involve partial Mantel testing, maximum likelihood mixed-effects models with a population effect, and jackknife-based homogeneity tests. A key challenge is that most tests perform on genetic and geographical distance data, violating standard independence assumptions. Simulations showed that partial Mantel tests and mixed-effects models have larger power than jackknife-based methods, but tend to display type-I-error rates slightly above the significance level. Moreover, a multiple regression model neglecting the dependence in the dissimilarities did not show inflated type-I-error rate. An application on brassy ringlets concludes the paper."}, "https://arxiv.org/abs/2401.15567": {"title": "Positive Semidefinite Supermartingales and Randomized Matrix Concentration Inequalities", "link": "https://arxiv.org/abs/2401.15567", "description": "arXiv:2401.15567v2 Announce Type: replace-cross \nAbstract: We present new concentration inequalities for either martingale dependent or exchangeable random symmetric matrices under a variety of tail conditions, encompassing now-standard Chernoff bounds to self-normalized heavy-tailed settings. These inequalities are often randomized in a way that renders them strictly tighter than existing deterministic results in the literature, are typically expressed in the Loewner order, and are sometimes valid at arbitrary data-dependent stopping times. Along the way, we explore the theory of positive semidefinite supermartingales and maximal inequalities, a natural matrix analog of scalar nonnegative supermartingales that is potentially of independent interest."}, "https://arxiv.org/abs/2402.14133": {"title": "Maximum likelihood estimation for aggregate current status data: Simulation study using the illness-death model for chronic diseases with duration dependency", "link": "https://arxiv.org/abs/2402.14133", "description": "arXiv:2402.14133v1 Announce Type: new \nAbstract: We use the illness-death model (IDM) for chronic conditions to derive a new analytical relation between the transition rates between the states of the IDM. The transition rates are the incidence rate (i) and the mortality rates of people without disease (m0) and with disease (m1). For the most generic case, the rates depend on age, calendar time and in case of m1 also on the duration of the disease. In this work, we show that the prevalence-odds can be expressed as a convolution-like product of the incidence rate and an exponentiated linear combination of i, m0 and m1. The analytical expression can be used as the basis for a maximum likelihood estimation (MLE) and associated large sample asymptotics. In a simulation study where a cross-sectional trial about a chronic condition is mimicked, we estimate the duration dependency of the mortality rate m1 based on aggregated current status data using the ML estimator. For this, the number of study participants and the number of diseased people in eleven age groups are considered. The ML estimator provides reasonable estimates for the parameters including their large sample confidence bounds."}, "https://arxiv.org/abs/2402.14206": {"title": "The impact of Facebook-Cambridge Analytica data scandal on the USA tech stock market: An event study based on clustering method", "link": "https://arxiv.org/abs/2402.14206", "description": "arXiv:2402.14206v1 Announce Type: new \nAbstract: This study delves into the intra-industry effects following a firm-specific scandal, with a particular focus on the Facebook data leakage scandal and its associated events within the U.S. tech industry and two additional relevant groups. We employ various metrics including daily spread, volatility, volume-weighted return, and CAPM-beta for the pre-analysis clustering, and subsequently utilize CAR (Cumulative Abnormal Return) to evaluate the impact on firms grouped within these clusters. From a broader industry viewpoint, significant positive CAARs are observed across U.S. sample firms over the three days post-scandal announcement, indicating no adverse impact on the tech sector overall. Conversely, after Facebook's initial quarterly earnings report, it showed a notable negative effect despite reported positive performance. The clustering principle should aid in identifying directly related companies and thus reducing the influence of randomness. This was indeed achieved for the effect of the key event, namely \"The Effect of Congressional Hearing on Certain Clusters across U.S. Tech Stock Market,\" which was identified as delayed and significantly negative. Therefore, we recommend applying the clustering method when conducting such or similar event studies."}, "https://arxiv.org/abs/2402.14260": {"title": "Linear Discriminant Regularized Regression", "link": "https://arxiv.org/abs/2402.14260", "description": "arXiv:2402.14260v1 Announce Type: new \nAbstract: Linear Discriminant Analysis (LDA) is an important classification approach. Its simple linear form makes it easy to interpret and it is capable to handle multi-class responses. It is closely related to other classical multivariate statistical techniques, such as Fisher's discriminant analysis, canonical correlation analysis and linear regression. In this paper we strengthen its connection to multivariate response regression by characterizing the explicit relationship between the discriminant directions and the regression coefficient matrix. This key characterization leads to a new regression-based multi-class classification procedure that is flexible enough to deploy any existing structured, regularized, and even non-parametric, regression methods. Moreover, our new formulation is generically easy to analyze compared to existing regression-based LDA procedures. In particular, we provide complete theoretical guarantees for using the widely used $\\ell_1$-regularization that has not yet been fully analyzed in the LDA context. Our theoretical findings are corroborated by extensive simulation studies and real data analysis."}, "https://arxiv.org/abs/2402.14282": {"title": "Extention of Bagging MARS with Group LASSO for Heterogeneous Treatment Effect Estimation", "link": "https://arxiv.org/abs/2402.14282", "description": "arXiv:2402.14282v1 Announce Type: new \nAbstract: Recent years, large scale clinical data like patient surveys and medical record data are playing an increasing role in medical data science. These large-scale clinical data, collectively referred to as \"real-world data (RWD)\". It is expected to be widely used in large-scale observational studies of specific diseases, personal medicine or precise medicine, finding the responder of drugs or treatments. Applying RWD for estimating heterogeneous treat ment effect (HTE) has already been a trending topic. HTE has the potential to considerably impact the development of precision medicine by helping doctors make more informed precise treatment decisions and provide more personalized medical care. The statistical models used to estimate HTE is called treatment effect models. Powers et al. proposed a some treatment effect models for observational study, where they pointed out that the bagging causal MARS (BCM) performs outstanding compared to other models. While BCM has excellent performance, it still has room for improvement. In this paper, we proposed a new treatment effect model called shrinkage causal bagging MARS method to improve their shared basis conditional mean regression framework based on the following points: first, we estimated basis functions using transformed outcome, then applied the group LASSO method to optimize the model and estimate parameters. Besides, we are focusing on pursing better interpretability of model to improve the ethical acceptance. We designed simulations to verify the performance of our proposed method and our proposed method superior in mean square error and bias in most simulation settings. Also we applied it to real data set ACTG 175 to verify its usability, where our results are supported by previous studies."}, "https://arxiv.org/abs/2402.14322": {"title": "Estimation of Spectral Risk Measure for Left Truncated and Right Censored Data", "link": "https://arxiv.org/abs/2402.14322", "description": "arXiv:2402.14322v1 Announce Type: new \nAbstract: Left truncated and right censored data are encountered frequently in insurance loss data due to deductibles and policy limits. Risk estimation is an important task in insurance as it is a necessary step for determining premiums under various policy terms. Spectral risk measures are inherently coherent and have the benefit of connecting the risk measure to the user's risk aversion. In this paper we study the estimation of spectral risk measure based on left truncated and right censored data. We propose a non parametric estimator of spectral risk measure using the product limit estimator and establish the asymptotic normality for our proposed estimator. We also develop an Edgeworth expansion of our proposed estimator. The bootstrap is employed to approximate the distribution of our proposed estimator and shown to be second order ``accurate''. Monte Carlo studies are conducted to compare the proposed spectral risk measure estimator with the existing parametric and non parametric estimators for left truncated and right censored data. Based on our simulation study we estimate the exponential spectral risk measure for three data sets viz; Norwegian fire claims data set, Spain automobile insurance claims and French marine losses."}, "https://arxiv.org/abs/2402.14377": {"title": "Ratio of two independent xgamma random variables and some distributional properties", "link": "https://arxiv.org/abs/2402.14377", "description": "arXiv:2402.14377v1 Announce Type: new \nAbstract: In this investigation, the distribution of the ratio of two independently distributed xgamma (Sen et al. 2016) random variables X and Y , with different parameters, is proposed and studied. The related distributional properties such as, moments, entropy measures, are investigated. We have also shown a unique characterization of the proposed distribution based on truncated incomplete moments."}, "https://arxiv.org/abs/2402.14438": {"title": "Efficiency-improved doubly robust estimation with non-confounding predictive covariates", "link": "https://arxiv.org/abs/2402.14438", "description": "arXiv:2402.14438v1 Announce Type: new \nAbstract: In observational studies, covariates with substantial missing data are often omitted, despite their strong predictive capabilities. These excluded covariates are generally believed not to simultaneously affect both treatment and outcome, indicating that they are not genuine confounders and do not impact the identification of the average treatment effect (ATE). In this paper, we introduce an alternative doubly robust (DR) estimator that fully leverages non-confounding predictive covariates to enhance efficiency, while also allowing missing values in such covariates. Beyond the double robustness property, our proposed estimator is designed to be more efficient than the standard DR estimator. Specifically, when the propensity score model is correctly specified, it achieves the smallest asymptotic variance among the class of DR estimators, and brings additional efficiency gains by further integrating predictive covariates. Simulation studies demonstrate the notable performance of the proposed estimator over current popular methods. An illustrative example is provided to assess the effectiveness of right heart catheterization (RHC) for critically ill patients."}, "https://arxiv.org/abs/2402.14506": {"title": "Enhancing Rolling Horizon Production Planning Through Stochastic Optimization Evaluated by Means of Simulation", "link": "https://arxiv.org/abs/2402.14506", "description": "arXiv:2402.14506v1 Announce Type: new \nAbstract: Production planning must account for uncertainty in a production system, arising from fluctuating demand forecasts. Therefore, this article focuses on the integration of updated customer demand into the rolling horizon planning cycle. We use scenario-based stochastic programming to solve capacitated lot sizing problems under stochastic demand in a rolling horizon environment. This environment is replicated using a discrete event simulation-optimization framework, where the optimization problem is periodically solved, leveraging the latest demand information to continually adjust the production plan. We evaluate the stochastic optimization approach and compare its performance to solving a deterministic lot sizing model, using expected demand figures as input, as well as to standard Material Requirements Planning (MRP). In the simulation study, we analyze three different customer behaviors related to forecasting, along with four levels of shop load, within a multi-item and multi-stage production system. We test a range of significant parameter values for the three planning methods and compute the overall costs to benchmark them. The results show that the production plans obtained by MRP are outperformed by deterministic and stochastic optimization. Particularly, when facing tight resource restrictions and rising uncertainty in customer demand, the use of stochastic optimization becomes preferable compared to deterministic optimization."}, "https://arxiv.org/abs/2402.14562": {"title": "Recoverability of Causal Effects in a Longitudinal Study under Presence of Missing Data", "link": "https://arxiv.org/abs/2402.14562", "description": "arXiv:2402.14562v1 Announce Type: new \nAbstract: Missing data in multiple variables is a common issue. We investigate the applicability of the framework of graphical models for handling missing data to a complex longitudinal pharmacological study of HIV-positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial. Specifically, we examine whether the causal effects of interest, defined through static interventions on multiple continuous variables, can be recovered (estimated consistently) from the available data only. So far, no general algorithms are available to decide on recoverability, and decisions have to be made on a case-by-case basis. We emphasize sensitivity of recoverability to even the smallest changes in the graph structure, and present recoverability results for three plausible missingness directed acyclic graphs (m-DAGs) in the CHAPAS-3 study, informed by clinical knowledge. Furthermore, we propose the concept of ''closed missingness mechanisms'' and show that under these mechanisms an available case analysis is admissible for consistent estimation for any type of statistical and causal query, even if the underlying missingness mechanism is of missing not at random (MNAR) type. Both simulations and theoretical considerations demonstrate how, in the assumed MNAR setting of our study, a complete or available case analysis can be superior to multiple imputation, and estimation results vary depending on the assumed missingness DAG. Our analyses are possibly the first to show the applicability of missingness DAGs (m-DAGs) to complex longitudinal real-world data, while highlighting the sensitivity with respect to the assumed causal model."}, "https://arxiv.org/abs/2402.14574": {"title": "A discussion of the paper \"Safe testing\" by Gr\\\"unwald, de Heide, and Koolen", "link": "https://arxiv.org/abs/2402.14574", "description": "arXiv:2402.14574v1 Announce Type: new \nAbstract: This is a discussion of the paper \"Safe testing\" by Gr\\\"unwald, de Heide, and Koolen, Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, 24 January, 2024"}, "https://arxiv.org/abs/2402.14763": {"title": "Functional Spatial Autoregressive Models", "link": "https://arxiv.org/abs/2402.14763", "description": "arXiv:2402.14763v1 Announce Type: new \nAbstract: This study introduces a novel spatial autoregressive model in which the dependent variable is a function that may exhibit functional autocorrelation with the outcome functions of nearby units. This model can be characterized as a simultaneous integral equation system, which, in general, does not necessarily have a unique solution. For this issue, we provide a simple condition on the magnitude of the spatial interaction to ensure the uniqueness in data realization. For estimation, to account for the endogeneity caused by the spatial interaction, we propose a regularized two-stage least squares estimator based on a basis approximation for the functional parameter. The asymptotic properties of the estimator including the consistency and asymptotic normality are investigated under certain conditions. Additionally, we propose a simple Wald-type test for detecting the presence of spatial effects. As an empirical illustration, we apply the proposed model and method to analyze age distributions in Japanese cities."}, "https://arxiv.org/abs/2402.14775": {"title": "Localised Natural Causal Learning Algorithms for Weak Consistency Conditions", "link": "https://arxiv.org/abs/2402.14775", "description": "arXiv:2402.14775v1 Announce Type: new \nAbstract: By relaxing conditions for natural structure learning algorithms, a family of constraint-based algorithms containing all exact structure learning algorithms under the faithfulness assumption, we define localised natural structure learning algorithms (LoNS). We also provide a set of necessary and sufficient assumptions for consistency of LoNS, which can be thought of as a strict relaxation of the restricted faithfulness assumption. We provide a practical LoNS algorithm that runs in exponential time, which is then compared with related existing structure learning algorithms, namely PC/SGS and the relatively recent Sparsest Permutation algorithm. Simulation studies are also provided."}, "https://arxiv.org/abs/2402.14145": {"title": "Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains", "link": "https://arxiv.org/abs/2402.14145", "description": "arXiv:2402.14145v1 Announce Type: cross \nAbstract: Distribution shifts are ubiquitous in real-world machine learning applications, posing a challenge to the generalization of models trained on one data distribution to another. We focus on scenarios where data distributions vary across multiple segments of the entire population and only make local assumptions about the differences between training and test (deployment) distributions within each segment. We propose a two-stage multiply robust estimation method to improve model performance on each individual segment for tabular data analysis. The method involves fitting a linear combination of the based models, learned using clusters of training data from multiple segments, followed by a refinement step for each segment. Our method is designed to be implemented with commonly used off-the-shelf machine learning models. We establish theoretical guarantees on the generalization bound of the method on the test risk. With extensive experiments on synthetic and real datasets, we demonstrate that the proposed method substantially improves over existing alternatives in prediction accuracy and robustness on both regression and classification tasks. We also assess its effectiveness on a user city prediction dataset from a large technology company."}, "https://arxiv.org/abs/2402.14220": {"title": "Estimating Unknown Population Sizes Using the Hypergeometric Distribution", "link": "https://arxiv.org/abs/2402.14220", "description": "arXiv:2402.14220v1 Announce Type: cross \nAbstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data."}, "https://arxiv.org/abs/2402.14264": {"title": "Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation", "link": "https://arxiv.org/abs/2402.14264", "description": "arXiv:2402.14264v1 Announce Type: cross \nAbstract: Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, recently also incorporating generic machine learning estimators, the statistical optimality of these methods has still remained an open area of investigation. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that attain small errors; which is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as a black-box sub-process. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATTE), as well as weighted variants of the former, which arise in policy evaluation."}, "https://arxiv.org/abs/2402.14390": {"title": "Composite likelihood inference for the Poisson log-normal model", "link": "https://arxiv.org/abs/2402.14390", "description": "arXiv:2402.14390v1 Announce Type: cross \nAbstract: Inferring parameters of a latent variable model can be a daunting task when the conditional distribution of the latent variables given the observed ones is intractable. Variational approaches prove to be computationally efficient but, possibly, lack theoretical guarantees on the estimates, while sampling based solutions are quite the opposite. Starting from already available variational approximations, we define a first Monte Carlo EM algorithm to obtain maximum likelihood estimators, focusing on the Poisson log-normal model which provides a generic framework for the analysis of multivariate count data. We then extend this algorithm to the case of a composite likelihood in order to be able to handle higher dimensional count data."}, "https://arxiv.org/abs/2402.14481": {"title": "Towards Automated Causal Discovery: a case study on 5G telecommunication data", "link": "https://arxiv.org/abs/2402.14481", "description": "arXiv:2402.14481v1 Announce Type: cross \nAbstract: We introduce the concept of Automated Causal Discovery (AutoCD), defined as any system that aims to fully automate the application of causal discovery and causal reasoning methods. AutoCD's goal is to deliver all causal information that an expert human analyst would and answer a user's causal queries. We describe the architecture of such a platform, and illustrate its performance on synthetic data sets. As a case study, we apply it on temporal telecommunication data. The system is general and can be applied to a plethora of causal discovery problems."}, "https://arxiv.org/abs/2402.14538": {"title": "Interference Produces False-Positive Pricing Experiments", "link": "https://arxiv.org/abs/2402.14538", "description": "arXiv:2402.14538v1 Announce Type: cross \nAbstract: It is standard practice in online retail to run pricing experiments by randomizing at the article-level, i.e. by changing prices of different products to identify treatment effects. Due to customers' cross-price substitution behavior, such experiments suffer from interference bias: the observed difference between treatment groups in the experiment is typically significantly larger than the global effect that could be expected after a roll-out decision of the tested pricing policy. We show in simulations that such bias can be as large as 100%, and report experimental data implying bias of similar magnitude. Finally, we discuss approaches for de-biased pricing experiments, suggesting observational methods as a potentially attractive alternative to clustering."}, "https://arxiv.org/abs/2402.14764": {"title": "A Combinatorial Central Limit Theorem for Stratified Randomization", "link": "https://arxiv.org/abs/2402.14764", "description": "arXiv:2402.14764v1 Announce Type: cross \nAbstract: This paper establishes a combinatorial central limit theorem for stratified randomization that holds under Lindeberg-type conditions and allows for a growing number of large and small strata. The result is then applied to derive the asymptotic distributions of two test statistics proposed in a finite population setting with randomly assigned instruments and a super population instrumental variables model, both having many strata."}, "https://arxiv.org/abs/2402.14781": {"title": "Rao-Blackwellising Bayesian Causal Inference", "link": "https://arxiv.org/abs/2402.14781", "description": "arXiv:2402.14781v1 Announce Type: cross \nAbstract: Bayesian causal inference, i.e., inferring a posterior over causal models for the use in downstream causal reasoning tasks, poses a hard computational inference problem that is little explored in literature. In this work, we combine techniques from order-based MCMC structure learning with recent advances in gradient-based graph learning into an effective Bayesian causal inference framework. Specifically, we decompose the problem of inferring the causal structure into (i) inferring a topological order over variables and (ii) inferring the parent sets for each variable. When limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time. We further use Gaussian processes to model the unknown causal mechanisms, which also allows their exact marginalisation. This introduces a Rao-Blackwellization scheme, where all components are eliminated from the model, except for the causal order, for which we learn a distribution via gradient-based optimisation. The combination of Rao-Blackwellization with our sequential inference procedure for causal orders yields state-of-the-art on linear and non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures."}, "https://arxiv.org/abs/2203.03020": {"title": "Optimal regimes for algorithm-assisted human decision-making", "link": "https://arxiv.org/abs/2203.03020", "description": "arXiv:2203.03020v3 Announce Type: replace \nAbstract: We consider optimal regimes for algorithm-assisted human decision-making. Such regimes are decision functions of measured pre-treatment variables and, by leveraging natural treatment values, enjoy a \"superoptimality\" property whereby they are guaranteed to outperform conventional optimal regimes. When there is unmeasured confounding, the benefit of using superoptimal regimes can be considerable. When there is no unmeasured confounding, superoptimal regimes are identical to conventional optimal regimes. Furthermore, identification of the expected outcome under superoptimal regimes in non-experimental studies requires the same assumptions as identification of value functions under conventional optimal regimes when the treatment is binary. To illustrate the utility of superoptimal regimes, we derive new identification and estimation results in a common instrumental variable setting. We use these derivations to analyze examples from the optimal regimes literature, including a case study of the effect of prompt intensive care treatment on survival."}, "https://arxiv.org/abs/2211.02609": {"title": "How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests", "link": "https://arxiv.org/abs/2211.02609", "description": "arXiv:2211.02609v2 Announce Type: replace \nAbstract: There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard `point-form null' significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both within- and between-experiment variation. This `distributional null' approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses between-experiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first `Many Labs' project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both within- and between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into `predictor-target' pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results."}, "https://arxiv.org/abs/2303.09680": {"title": "Bootstrap based asymptotic refinements for high-dimensional nonlinear models", "link": "https://arxiv.org/abs/2303.09680", "description": "arXiv:2303.09680v2 Announce Type: replace \nAbstract: We consider penalized extremum estimation of a high-dimensional, possibly nonlinear model that is sparse in the sense that most of its parameters are zero but some are not. We use the SCAD penalty function, which provides model selection consistent and oracle efficient estimates under suitable conditions. However, asymptotic approximations based on the oracle model can be inaccurate with the sample sizes found in many applications. This paper gives conditions under which the bootstrap, based on estimates obtained through SCAD penalization with thresholding, provides asymptotic refinements of size \\(O \\left( n^{- 2} \\right)\\) for the error in the rejection (coverage) probability of a symmetric hypothesis test (confidence interval) and \\(O \\left( n^{- 1} \\right)\\) for the error in the rejection (coverage) probability of a one-sided or equal tailed test (confidence interval). The results of Monte Carlo experiments show that the bootstrap can provide large reductions in errors in rejection and coverage probabilities. The bootstrap is consistent, though it does not necessarily provide asymptotic refinements, even if some parameters are close but not equal to zero. Random-coefficients logit and probit models and nonlinear moment models are examples of models to which the procedure applies."}, "https://arxiv.org/abs/2303.13960": {"title": "Demystifying estimands in cluster-randomised trials", "link": "https://arxiv.org/abs/2303.13960", "description": "arXiv:2303.13960v3 Announce Type: replace \nAbstract: Estimands can help clarify the interpretation of treatment effects and ensure that estimators are aligned to the study's objectives. Cluster randomised trials require additional attributes to be defined within the estimand compared to individually randomised trials, including whether treatment effects are marginal or cluster specific, and whether they are participant or cluster average. In this paper, we provide formal definitions of estimands encompassing both these attributes using potential outcomes notation and describe differences between them. We then provide an overview of estimators for each estimand, describe their assumptions, and show consistency (i.e. asymptotically unbiased estimation) for a series of analyses based on cluster level summaries. Then, through a reanalysis of a published cluster randomised trial, we demonstrate that the choice of both estimand and estimator can affect interpretation. For instance, the estimated odds ratio ranged from 1.38 (p=0.17) to 1.83 (p=0.03) depending on the target estimand, and for some estimands, the choice of estimator affected the conclusions by leading to smaller treatment effect estimates. We conclude that careful specification of the estimand, along with an appropriate choice of estimator, are essential to ensuring that cluster randomised trials address the right question."}, "https://arxiv.org/abs/2306.08946": {"title": "Bootstrap aggregation and confidence measures to improve time series causal discovery", "link": "https://arxiv.org/abs/2306.08946", "description": "arXiv:2306.08946v2 Announce Type: replace \nAbstract: Learning causal graphs from multivariate time series is a ubiquitous challenge in all application domains dealing with time-dependent systems, such as in Earth sciences, biology, or engineering, to name a few. Recent developments for this causal discovery learning task have shown considerable skill, notably the specific time-series adaptations of the popular conditional independence-based learning framework. However, uncertainty estimation is challenging for conditional independence-based methods. Here, we introduce a novel bootstrap approach designed for time series causal discovery that preserves the temporal dependencies and lag structure. It can be combined with a range of time series causal discovery methods and provides a measure of confidence for the links of the time series graphs. Furthermore, next to confidence estimation, an aggregation, also called bagging, of the bootstrapped graphs by majority voting results in bagged causal discovery methods. In this work, we combine this approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves in precision and recall as compared to its base algorithm PCMCI+, at the cost of higher computational demands. These statistical performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications."}, "https://arxiv.org/abs/2307.14973": {"title": "Insufficient Gibbs Sampling", "link": "https://arxiv.org/abs/2307.14973", "description": "arXiv:2307.14973v2 Announce Type: replace \nAbstract: In some applied scenarios, the availability of complete data is restricted, often due to privacy concerns; only aggregated, robust and inefficient statistics derived from the data are made accessible. These robust statistics are not sufficient, but they demonstrate reduced sensitivity to outliers and offer enhanced data protection due to their higher breakdown point. We consider a parametric framework and propose a method to sample from the posterior distribution of parameters conditioned on various robust and inefficient statistics: specifically, the pairs (median, MAD) or (median, IQR), or a collection of quantiles. Our approach leverages a Gibbs sampler and simulates latent augmented data, which facilitates simulation from the posterior distribution of parameters belonging to specific families of distributions. A by-product of these samples from the joint posterior distribution of parameters and data given the observed statistics is that we can estimate Bayes factors based on observed statistics via bridge sampling. We validate and outline the limitations of the proposed methods through toy examples and an application to real-world income data."}, "https://arxiv.org/abs/2308.08644": {"title": "Generalized Bradley-Terry Models for Score Estimation from Paired Comparisons", "link": "https://arxiv.org/abs/2308.08644", "description": "arXiv:2308.08644v2 Announce Type: replace \nAbstract: Many applications, e.g. in content recommendation, sports, or recruitment, leverage the comparisons of alternatives to score those alternatives. The classical Bradley-Terry model and its variants have been widely used to do so. The historical model considers binary comparisons (victory or defeat) between alternatives, while more recent developments allow finer comparisons to be taken into account. In this article, we introduce a probabilistic model encompassing a broad variety of paired comparisons that can take discrete or continuous values. We do so by considering a well-behaved subset of the exponential family, which we call the family of generalized Bradley-Terry (GBT) models, as it includes the classical Bradley-Terry model and many of its variants. Remarkably, we prove that all GBT models are guaranteed to yield a strictly convex negative log-likelihood. Moreover, assuming a Gaussian prior on alternatives' scores, we prove that the maximum a posteriori (MAP) of GBT models, whose existence, uniqueness and fast computation are thus guaranteed, varies monotonically with respect to comparisons (the more A beats B, the better the score of A) and is Lipschitz-resilient with respect to each new comparison (a single new comparison can only have a bounded effect on all the estimated scores). These desirable properties make GBT models appealing for practical use. We illustrate some features of GBT models on simulations."}, "https://arxiv.org/abs/2309.05092": {"title": "Adaptive conformal classification with noisy labels", "link": "https://arxiv.org/abs/2309.05092", "description": "arXiv:2309.05092v2 Announce Type: replace \nAbstract: This paper develops novel conformal prediction methods for classification tasks that can automatically adapt to random label contamination in the calibration sample, leading to more informative prediction sets with stronger coverage guarantees compared to state-of-the-art approaches. This is made possible by a precise characterization of the effective coverage inflation (or deflation) suffered by standard conformal inferences in the presence of label contamination, which is then made actionable through new calibration algorithms. Our solution is flexible and can leverage different modeling assumptions about the label contamination process, while requiring no knowledge of the underlying data distribution or of the inner workings of the machine-learning classifier. The advantages of the proposed methods are demonstrated through extensive simulations and an application to object classification with the CIFAR-10H image data set."}, "https://arxiv.org/abs/2310.14763": {"title": "Externally Valid Policy Evaluation Combining Trial and Observational Data", "link": "https://arxiv.org/abs/2310.14763", "description": "arXiv:2310.14763v2 Announce Type: replace \nAbstract: Randomized trials are widely considered as the gold standard for evaluating the effects of decision policies. Trial data is, however, drawn from a population which may differ from the intended target population and this raises a problem of external validity (aka. generalizability). In this paper we seek to use trial data to draw valid inferences about the outcome of a policy on the target population. Additional covariate data from the target population is used to model the sampling of individuals in the trial study. We develop a method that yields certifiably valid trial-based policy evaluations under any specified range of model miscalibrations. The method is nonparametric and the validity is assured even with finite samples. The certified policy evaluations are illustrated using both simulated and real data."}, "https://arxiv.org/abs/2401.11422": {"title": "Local Identification in Instrumental Variable Multivariate Quantile Regression Models", "link": "https://arxiv.org/abs/2401.11422", "description": "arXiv:2401.11422v2 Announce Type: replace \nAbstract: The instrumental variable (IV) quantile regression model introduced by Chernozhukov and Hansen (2005) is a useful tool for analyzing quantile treatment effects in the presence of endogeneity, but when outcome variables are multidimensional, it is silent on the joint distribution of different dimensions of each variable. To overcome this limitation, we propose an IV model built on the optimal-transport-based multivariate quantile that takes into account the correlation between the entries of the outcome variable. We then provide a local identification result for the model. Surprisingly, we find that the support size of the IV required for the identification is independent of the dimension of the outcome vector, as long as the IV is sufficiently informative. Our result follows from a general identification theorem that we establish, which has independent theoretical significance."}, "https://arxiv.org/abs/2401.12911": {"title": "Pretraining and the Lasso", "link": "https://arxiv.org/abs/2401.12911", "description": "arXiv:2401.12911v2 Announce Type: replace \nAbstract: Pretraining is a popular and powerful paradigm in machine learning. As an example, suppose one has a modest-sized dataset of images of cats and dogs, and plans to fit a deep neural network to classify them from the pixel features. With pretraining, we start with a neural network trained on a large corpus of images, consisting of not just cats and dogs but hundreds of other image types. Then we fix all of the network weights except for the top layer (which makes the final classification) and train (or \"fine tune\") those weights on our dataset. This often results in dramatically better performance than the network trained solely on our smaller dataset.\n In this paper, we ask the question \"Can pretraining help the lasso?\". We develop a framework for the lasso in which an overall model is fit to a large set of data, and then fine-tuned to a specific task on a smaller dataset. This latter dataset can be a subset of the original dataset, but does not need to be. We find that this framework has a wide variety of applications, including stratified models, multinomial targets, multi-response models, conditional average treatment estimation and even gradient boosting.\n In the stratified model setting, the pretrained lasso pipeline estimates the coefficients common to all groups at the first stage, and then group specific coefficients at the second \"fine-tuning\" stage. We show that under appropriate assumptions, the support recovery rate of the common coefficients is superior to that of the usual lasso trained only on individual groups. This separate identification of common and individual coefficients can also be useful for scientific understanding."}, "https://arxiv.org/abs/2310.19253": {"title": "Flow-based Distributionally Robust Optimization", "link": "https://arxiv.org/abs/2310.19253", "description": "arXiv:2310.19253v3 Announce Type: replace-cross \nAbstract: We present a computationally efficient framework, called $\\texttt{FlowDRO}$, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD) and sample from it. The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution and develop a Wasserstein proximal gradient flow type algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on high-dimensional real data."}, "https://arxiv.org/abs/2402.14942": {"title": "On Identification of Dynamic Treatment Regimes with Proxies of Hidden Confounders", "link": "https://arxiv.org/abs/2402.14942", "description": "arXiv:2402.14942v1 Announce Type: new \nAbstract: We consider identification of optimal dynamic treatment regimes in a setting where time-varying treatments are confounded by hidden time-varying confounders, but proxy variables of the unmeasured confounders are available. We show that, with two independent proxy variables at each time point that are sufficiently relevant for the hidden confounders, identification of the joint distribution of counterfactuals is possible, thereby facilitating identification of an optimal treatment regime."}, "https://arxiv.org/abs/2402.15004": {"title": "Repro Samples Method for a Performance Guaranteed Inference in General and Irregular Inference Problems", "link": "https://arxiv.org/abs/2402.15004", "description": "arXiv:2402.15004v1 Announce Type: new \nAbstract: Rapid advancements in data science require us to have fundamentally new frameworks to tackle prevalent but highly non-trivial \"irregular\" inference problems, to which the large sample central limit theorem does not apply. Typical examples are those involving discrete or non-numerical parameters and those involving non-numerical data, etc. In this article, we present an innovative, wide-reaching, and effective approach, called \"repro samples method,\" to conduct statistical inference for these irregular problems plus more. The development relates to but improves several existing simulation-inspired inference approaches, and we provide both exact and approximate theories to support our development. Moreover, the proposed approach is broadly applicable and subsumes the classical Neyman-Pearson framework as a special case. For the often-seen irregular inference problems that involve both discrete/non-numerical and continuous parameters, we propose an effective three-step procedure to make inferences for all parameters. We also develop a unique matching scheme that turns the discreteness of discrete/non-numerical parameters from an obstacle for forming inferential theories into a beneficial attribute for improving computational efficiency. We demonstrate the effectiveness of the proposed general methodology using various examples, including a case study example on a Gaussian mixture model with unknown number of components. This case study example provides a solution to a long-standing open inference question in statistics on how to quantify the estimation uncertainty for the unknown number of components and other associated parameters. Real data and simulation studies, with comparisons to existing approaches, demonstrate the far superior performance of the proposed method."}, "https://arxiv.org/abs/2402.15030": {"title": "Adjusting for Ascertainment Bias in Meta-Analysis of Penetrance for Cancer Risk", "link": "https://arxiv.org/abs/2402.15030", "description": "arXiv:2402.15030v1 Announce Type: new \nAbstract: Multi-gene panel testing allows efficient detection of pathogenic variants in cancer susceptibility genes including moderate-risk genes such as ATM and PALB2. A growing number of studies examine the risk of breast cancer (BC) conferred by pathogenic variants of such genes. A meta-analysis combining the reported risk estimates can provide an overall age-specific risk of developing BC, i.e., penetrance for a gene. However, estimates reported by case-control studies often suffer from ascertainment bias. Currently there are no methods available to adjust for such ascertainment bias in this setting. We consider a Bayesian random-effects meta-analysis method that can synthesize different types of risk measures and extend it to incorporate studies with ascertainment bias. This is achieved by introducing a bias term in the model and assigning appropriate priors. We validate the method through a simulation study and apply it to estimate BC penetrance for carriers of pathogenic variants of ATM and PALB2 genes. Our simulations show that the proposed method results in more accurate and precise penetrance estimates compared to when no adjustment is made for ascertainment bias or when such biased studies are discarded from the analysis. The estimated overall BC risk for individuals with pathogenic variants in (1) ATM is 5.77% (3.22%-9.67%) by age 50 and 26.13% (20.31%-32.94%) by age 80; (2) PALB2 is 12.99% (6.48%-22.23%) by age 50 and 44.69% (34.40%-55.80%) by age 80. The proposed method allows for meta-analyses to include studies with ascertainment bias resulting in a larger number of studies included and thereby more robust estimates."}, "https://arxiv.org/abs/2402.15060": {"title": "A uniformly ergodic Gibbs sampler for Bayesian survival analysis", "link": "https://arxiv.org/abs/2402.15060", "description": "arXiv:2402.15060v1 Announce Type: new \nAbstract: Finite sample inference for Cox models is an important problem in many settings, such as clinical trials. Bayesian procedures provide a means for finite sample inference and incorporation of prior information if MCMC algorithms and posteriors are well behaved. On the other hand, estimation procedures should also retain inferential properties in high dimensional settings. In addition, estimation procedures should be able to incorporate constraints and multilevel modeling such as cure models and frailty models in a straightforward manner. In order to tackle these modeling challenges, we propose a uniformly ergodic Gibbs sampler for a broad class of convex set constrained multilevel Cox models. We develop two key strategies. First, we exploit a connection between Cox models and negative binomial processes through the Poisson process to reduce Bayesian computation to iterative Gaussian sampling. Next, we appeal to sufficient dimension reduction to address the difficult computation of nonparametric baseline hazards, allowing for the collapse of the Markov transition operator within the Gibbs sampler based on sufficient statistics. We demonstrate our approach using open source data and simulations."}, "https://arxiv.org/abs/2402.15071": {"title": "High-Dimensional Covariate-Augmented Overdispersed Poisson Factor Model", "link": "https://arxiv.org/abs/2402.15071", "description": "arXiv:2402.15071v1 Announce Type: new \nAbstract: The current Poisson factor models often assume that the factors are unknown, which overlooks the explanatory potential of certain observable covariates. This study focuses on high dimensional settings, where the number of the count response variables and/or covariates can diverge as the sample size increases. A covariate-augmented overdispersed Poisson factor model is proposed to jointly perform a high-dimensional Poisson factor analysis and estimate a large coefficient matrix for overdispersed count data. A group of identifiability conditions are provided to theoretically guarantee computational identifiability. We incorporate the interdependence of both response variables and covariates by imposing a low-rank constraint on the large coefficient matrix. To address the computation challenges posed by nonlinearity, two high-dimensional latent matrices, and the low-rank constraint, we propose a novel variational estimation scheme that combines Laplace and Taylor approximations. We also develop a criterion based on a singular value ratio to determine the number of factors and the rank of the coefficient matrix. Comprehensive simulation studies demonstrate that the proposed method outperforms the state-of-the-art methods in estimation accuracy and computational efficiency. The practical merit of our method is demonstrated by an application to the CITE-seq dataset. A flexible implementation of our proposed method is available in the R package \\emph{COAP}."}, "https://arxiv.org/abs/2402.15086": {"title": "A modified debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization", "link": "https://arxiv.org/abs/2402.15086", "description": "arXiv:2402.15086v1 Announce Type: new \nAbstract: Mendelian randomization uses genetic variants as instrumental variables to make causal inferences about the effects of modifiable risk factors on diseases from observational data. One of the major challenges in Mendelian randomization is that many genetic variants are only modestly or even weakly associated with the risk factor of interest, a setting known as many weak instruments. Many existing methods, such as the popular inverse-variance weighted (IVW) method, could be biased when the instrument strength is weak. To address this issue, the debiased IVW (dIVW) estimator, which is shown to be robust to many weak instruments, was recently proposed. However, this estimator still has non-ignorable bias when the effective sample size is small. In this paper, we propose a modified debiased IVW (mdIVW) estimator by multiplying a modification factor to the original dIVW estimator. After this simple correction, we show that the bias of the mdIVW estimator converges to zero at a faster rate than that of the dIVW estimator under some regularity conditions. Moreover, the mdIVW estimator has smaller variance than the dIVW estimator.We further extend the proposed method to account for the presence of instrumental variable selection and balanced horizontal pleiotropy. We demonstrate the improvement of the mdIVW estimator over the dIVW estimator through extensive simulation studies and real data analysis."}, "https://arxiv.org/abs/2402.15137": {"title": "Benchmarking Observational Studies with Experimental Data under Right-Censoring", "link": "https://arxiv.org/abs/2402.15137", "description": "arXiv:2402.15137v1 Announce Type: new \nAbstract: Drawing causal inferences from observational studies (OS) requires unverifiable validity assumptions; however, one can falsify those assumptions by benchmarking the OS with experimental data from a randomized controlled trial (RCT). A major limitation of existing procedures is not accounting for censoring, despite the abundance of RCTs and OSes that report right-censored time-to-event outcomes. We consider two cases where censoring time (1) is independent of time-to-event and (2) depends on time-to-event the same way in OS and RCT. For the former, we adopt a censoring-doubly-robust signal for the conditional average treatment effect (CATE) to facilitate an equivalence test of CATEs in OS and RCT, which serves as a proxy for testing if the validity assumptions hold. For the latter, we show that the same test can still be used even though unbiased CATE estimation may not be possible. We verify the effectiveness of our censoring-aware tests via semi-synthetic experiments and analyze RCT and OS data from the Women's Health Initiative study."}, "https://arxiv.org/abs/2402.15292": {"title": "adjustedCurves: Estimating Confounder-Adjusted Survival Curves in R", "link": "https://arxiv.org/abs/2402.15292", "description": "arXiv:2402.15292v1 Announce Type: new \nAbstract: Kaplan-Meier curves stratified by treatment allocation are the most popular way to depict causal effects in studies with right-censored time-to-event endpoints. If the treatment is randomly assigned and the sample size of the study is adequate, this method produces unbiased estimates of the population-averaged counterfactual survival curves. However, in the presence of confounding, this is no longer the case. Instead, specific methods that allow adjustment for confounding must be used. We present the adjustedCurves R package, which can be used to estimate and plot these confounder-adjusted survival curves using a variety of methods from the literature. It provides a convenient wrapper around existing R packages on the topic and adds additional methods and functionality on top of it, uniting the sometimes vastly different methods under one consistent framework. Among the additional features are the estimation of confidence intervals, confounder-adjusted restricted mean survival times and confounder-adjusted survival time quantiles. After giving a brief overview of the implemented methods, we illustrate the package using publicly available data from an observational study including 2982 breast cancer."}, "https://arxiv.org/abs/2402.15357": {"title": "Rapid Bayesian identification of sparse nonlinear dynamics from scarce and noisy data", "link": "https://arxiv.org/abs/2402.15357", "description": "arXiv:2402.15357v1 Announce Type: new \nAbstract: We propose a fast probabilistic framework for identifying differential equations governing the dynamics of observed data. We recast the SINDy method within a Bayesian framework and use Gaussian approximations for the prior and likelihood to speed up computation. The resulting method, Bayesian-SINDy, not only quantifies uncertainty in the parameters estimated but also is more robust when learning the correct model from limited and noisy data. Using both synthetic and real-life examples such as Lynx-Hare population dynamics, we demonstrate the effectiveness of the new framework in learning correct model equations and compare its computational and data efficiency with existing methods. Because Bayesian-SINDy can quickly assimilate data and is robust against noise, it is particularly suitable for biological data and real-time system identification in control. Its probabilistic framework also enables the calculation of information entropy, laying the foundation for an active learning strategy."}, "https://arxiv.org/abs/2402.15489": {"title": "On inference for modularity statistics in structured networks", "link": "https://arxiv.org/abs/2402.15489", "description": "arXiv:2402.15489v1 Announce Type: new \nAbstract: This paper revisits the classical concept of network modularity and its spectral relaxations used throughout graph data analysis. We formulate and study several modularity statistic variants for which we establish asymptotic distributional results in the large-network limit for networks exhibiting nodal community structure. Our work facilitates testing for network differences and can be used in conjunction with existing theoretical guarantees for stochastic blockmodel random graphs. Our results are enabled by recent advances in the study of low-rank truncations of large network adjacency matrices. We provide confirmatory simulation studies and real data analysis pertaining to the network neuroscience study of psychosis, specifically schizophrenia. Collectively, this paper contributes to the limited existing literature to date on statistical inference for modularity-based network analysis. Supplemental materials for this article are available online."}, "https://arxiv.org/abs/2402.14966": {"title": "Smoothness Adaptive Hypothesis Transfer Learning", "link": "https://arxiv.org/abs/2402.14966", "description": "arXiv:2402.14966v1 Announce Type: cross \nAbstract: Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset in practice. In this paper, we address these problems by proposing Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression(KRR)-based algorithm. We first prove that employing the misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality and derive an adaptive procedure to the unknown Sobolev smoothness. Leveraging these results, SATL employs Gaussian kernels in both phases so that the estimators can adapt to the unknown smoothness of the target/source and their offset function. We derive the minimax lower bound of the learning problem in excess risk and show that SATL enjoys a matching upper bound up to a logarithmic factor. The minimax convergence rate sheds light on the factors influencing transfer dynamics and demonstrates the superiority of SATL compared to non-transfer learning settings. While our main objective is a theoretical analysis, we also conduct several experiments to confirm our results."}, "https://arxiv.org/abs/2402.14979": {"title": "Optimizing Language Models for Human Preferences is a Causal Inference Problem", "link": "https://arxiv.org/abs/2402.14979", "description": "arXiv:2402.14979v1 Announce Type: cross \nAbstract: As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions."}, "https://arxiv.org/abs/2402.15053": {"title": "Nonlinear Bayesian optimal experimental design using logarithmic Sobolev inequalities", "link": "https://arxiv.org/abs/2402.15053", "description": "arXiv:2402.15053v1 Announce Type: cross \nAbstract: We study the problem of selecting $k$ experiments from a larger candidate pool, where the goal is to maximize mutual information (MI) between the selected subset and the underlying parameters. Finding the exact solution is to this combinatorial optimization problem is computationally costly, not only due to the complexity of the combinatorial search but also the difficulty of evaluating MI in nonlinear/non-Gaussian settings. We propose greedy approaches based on new computationally inexpensive lower bounds for MI, constructed via log-Sobolev inequalities. We demonstrate that our method outperforms random selection strategies, Gaussian approximations, and nested Monte Carlo (NMC) estimators of MI in various settings, including optimal design for nonlinear models with non-additive noise."}, "https://arxiv.org/abs/2402.15301": {"title": "Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models", "link": "https://arxiv.org/abs/2402.15301", "description": "arXiv:2402.15301v1 Announce Type: cross \nAbstract: Causal graph recovery is essential in the field of causal inference. Traditional methods are typically knowledge-based or statistical estimation-based, which are limited by data collection biases and individuals' knowledge about factors affecting the relations between variables of interests. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that utilizes the extensive knowledge contained within a large corpus of scientific literature to deduce causal relationships in general causal graph recovery tasks. This method leverages Retrieval Augmented-Generation (RAG) based LLMs to systematically analyze and extract pertinent information from a comprehensive collection of research papers. Our method first retrieves relevant text chunks from the aggregated literature. Then, the LLM is tasked with identifying and labelling potential associations between factors. Finally, we give a method to aggregate the associational relationships to build a causal graph. We demonstrate our method is able to construct high quality causal graphs on the well-known SACHS dataset solely from literature."}, "https://arxiv.org/abs/2402.15460": {"title": "Potential outcome simulation for efficient head-to-head comparison of adaptive dose-finding designs", "link": "https://arxiv.org/abs/2402.15460", "description": "arXiv:2402.15460v1 Announce Type: cross \nAbstract: Dose-finding trials are a key component of the drug development process and rely on a statistical design to help inform dosing decisions. Triallists wishing to choose a design require knowledge of operating characteristics of competing methods. This is often assessed using a large-scale simulation study with multiple designs and configurations investigated, which can be time-consuming and therefore limits the scope of the simulation.\n We introduce a new approach to the design of simulation studies of dose-finding trials. The approach simulates all potential outcomes that individuals could experience at each dose level in the trial. Datasets are simulated in advance and then the same datasets are applied to each of the competing methods to enable a more efficient head-to-head comparison.\n In two case-studies we show sizeable reductions in Monte Carlo error for comparing a performance metric between two competing designs. Efficiency gains depend on the similarity of the designs. Comparing two Phase I/II design variants, with high correlation of recommending the same optimal biologic dose, we show that the new approach requires a simulation study that is approximately 30 times smaller than the conventional approach. Furthermore, advance-simulated trial datasets can be reused to assess the performance of designs across multiple configurations.\n We recommend researchers consider this more efficient simulation approach in their dose-finding studies and we have updated the R package escalation to help facilitate implementation."}, "https://arxiv.org/abs/2010.02848": {"title": "Unified Robust Estimation", "link": "https://arxiv.org/abs/2010.02848", "description": "arXiv:2010.02848v5 Announce Type: replace \nAbstract: Robust estimation is primarily concerned with providing reliable parameter estimates in the presence of outliers. Numerous robust loss functions have been proposed in regression and classification, along with various computing algorithms. In modern penalised generalised linear models (GLM), however, there is limited research on robust estimation that can provide weights to determine the outlier status of the observations. This article proposes a unified framework based on a large family of loss functions, a composite of concave and convex functions (CC-family). Properties of the CC-family are investigated, and CC-estimation is innovatively conducted via the iteratively reweighted convex optimisation (IRCO), which is a generalisation of the iteratively reweighted least squares in robust linear regression. For robust GLM, the IRCO becomes the iteratively reweighted GLM. The unified framework contains penalised estimation and robust support vector machine and is demonstrated with a variety of data applications."}, "https://arxiv.org/abs/2206.14275": {"title": "Dynamic CoVaR Modeling", "link": "https://arxiv.org/abs/2206.14275", "description": "arXiv:2206.14275v3 Announce Type: replace \nAbstract: The popular systemic risk measure CoVaR (conditional Value-at-Risk) is widely used in economics and finance. Formally, it is defined as a large quantile of one variable (e.g., losses in the financial system) conditional on some other variable (e.g., losses in a bank's shares) being in distress. In this article, we propose joint dynamic forecasting models for the Value-at-Risk (VaR) and CoVaR. We also introduce a two-step M-estimator for the model parameters drawing on recently proposed bivariate scoring functions for the pair (VaR, CoVaR). We prove consistency and asymptotic normality of our parameter estimator and analyze its finite-sample properties in simulations. Finally, we apply a specific subclass of our dynamic forecasting models, which we call CoCAViaR models, to log-returns of large US banks. It is shown that our CoCAViaR models generate CoVaR predictions that are superior to forecasts issued from current benchmark models."}, "https://arxiv.org/abs/2301.10640": {"title": "Adaptive enrichment trial designs using joint modeling of longitudinal and time-to-event data", "link": "https://arxiv.org/abs/2301.10640", "description": "arXiv:2301.10640v2 Announce Type: replace \nAbstract: Adaptive enrichment allows for pre-defined patient subgroups of interest to be investigated throughout the course of a clinical trial. Many trials which measure a long-term time-to-event endpoint often also routinely collect repeated measures on biomarkers which may be predictive of the primary endpoint. Although these data may not be leveraged directly to support subgroup selection decisions and early stopping decisions, we aim to make greater use of these data to increase efficiency and improve interim decision making. In this work, we present a joint model for longitudinal and time-to-event data and two methods for creating standardised statistics based on this joint model. We can use the estimates to define enrichment rules and efficacy and futility early stopping rules for a flexible efficient clinical trial with possible enrichment. Under this framework, we show asymptotically that the familywise error rate is protected in the strong sense. To assess the results, we consider a trial for the treatment of metastatic breast cancer where repeated ctDNA measurements are available and the subgroup criteria is defined by patients' ER and HER2 status. Using simulation, we show that incorporating biomarker information leads to accurate subgroup identification and increases in power."}, "https://arxiv.org/abs/2303.17478": {"title": "A Bayesian Dirichlet Auto-Regressive Moving Average Model for Forecasting Lead Times", "link": "https://arxiv.org/abs/2303.17478", "description": "arXiv:2303.17478v3 Announce Type: replace \nAbstract: Lead time data is compositional data found frequently in the hospitality industry. Hospitality businesses earn fees each day, however these fees cannot be recognized until later. For business purposes, it is important to understand and forecast the distribution of future fees for the allocation of resources, for business planning, and for staffing. Motivated by 5 years of daily fees data, we propose a new class of Bayesian time series models, a Bayesian Dirichlet Auto-Regressive Moving Average (B-DARMA) model for compositional time series, modeling the proportion of future fees that will be recognized in 11 consecutive 30 day windows and 1 last consecutive 35 day window. Each day's compositional datum is modeled as Dirichlet distributed given the mean and a scale parameter. The mean is modeled with a Vector Autoregressive Moving Average process after transforming with an additive log ratio link function and depends on previous compositional data, previous compositional parameters and daily covariates. The B-DARMA model offers solutions to data analyses of large compositional vectors and short or long time series, offers efficiency gains through choice of priors, provides interpretable parameters for inference, and makes reasonable forecasts."}, "https://arxiv.org/abs/2304.01363": {"title": "Sufficient and Necessary Conditions for the Identifiability of DINA Models with Polytomous Responses", "link": "https://arxiv.org/abs/2304.01363", "description": "arXiv:2304.01363v3 Announce Type: replace \nAbstract: Cognitive Diagnosis Models (CDMs) provide a powerful statistical and psychometric tool for researchers and practitioners to learn fine-grained diagnostic information about respondents' latent attributes. There has been a growing interest in the use of CDMs for polytomous response data, as more and more items with multiple response options become widely used. Similar to many latent variable models, the identifiability of CDMs is critical for accurate parameter estimation and valid statistical inference. However, the existing identifiability results are primarily focused on binary response models and have not adequately addressed the identifiability of CDMs with polytomous responses. This paper addresses this gap by presenting sufficient and necessary conditions for the identifiability of the widely used DINA model with polytomous responses, with the aim to provide a comprehensive understanding of the identifiability of CDMs with polytomous responses and to inform future research in this field."}, "https://arxiv.org/abs/2312.01735": {"title": "Weighted Q-learning for optimal dynamic treatment regimes with MNAR covariates", "link": "https://arxiv.org/abs/2312.01735", "description": "arXiv:2312.01735v3 Announce Type: replace \nAbstract: Dynamic treatment regimes (DTRs) formalize medical decision-making as a sequence of rules for different stages, mapping patient-level information to recommended treatments. In practice, estimating an optimal DTR using observational data from electronic medical record (EMR) databases can be complicated by covariates that are missing not at random (MNAR) due to informative monitoring of patients. Since complete case analysis can result in consistent estimation of outcome model parameters under the assumption of outcome-independent missingness, Q-learning is a natural approach to accommodating MNAR covariates. However, the backward induction algorithm used in Q-learning can introduce challenges, as MNAR covariates at later stages can result in MNAR pseudo-outcomes at earlier stages, leading to suboptimal DTRs, even if the longitudinal outcome variables are fully observed. To address this unique missing data problem in DTR settings, we propose two weighted Q-learning approaches where inverse probability weights for missingness of the pseudo-outcomes are obtained through estimating equations with valid nonresponse instrumental variables or sensitivity analysis. Asymptotic properties of the weighted Q-learning estimators are derived and the finite-sample performance of the proposed methods is evaluated and compared with alternative methods through extensive simulation studies. Using EMR data from the Medical Information Mart for Intensive Care database, we apply the proposed methods to investigate the optimal fluid strategy for sepsis patients in intensive care units."}, "https://arxiv.org/abs/2401.06082": {"title": "Borrowing from historical control data in a Bayesian time-to-event model with flexible baseline hazard function", "link": "https://arxiv.org/abs/2401.06082", "description": "arXiv:2401.06082v2 Announce Type: replace \nAbstract: There is currently a focus on statistical methods which can use historical trial information to help accelerate the discovery, development and delivery of medicine. Bayesian methods can be constructed so that the borrowing is \"dynamic\" in the sense that the similarity of the data helps to determine how much information is used. In the time to event setting with one historical data set, a popular model for a range of baseline hazards is the piecewise exponential model where the time points are fixed and a borrowing structure is imposed on the model. Although convenient for implementation this approach effects the borrowing capability of the model. We propose a Bayesian model which allows the time points to vary and a dependency to be placed between the baseline hazards. This serves to smooth the posterior baseline hazard improving both model estimation and borrowing characteristics. We explore a variety of prior structures for the borrowing within our proposed model and assess their performance against established approaches. We demonstrate that this leads to improved type I error in the presence of prior data conflict and increased power. We have developed accompanying software which is freely available and enables easy implementation of the approach."}, "https://arxiv.org/abs/2307.11390": {"title": "Fast spatial simulation of extreme high-resolution radar precipitation data using INLA", "link": "https://arxiv.org/abs/2307.11390", "description": "arXiv:2307.11390v2 Announce Type: replace-cross \nAbstract: Aiming to deliver improved precipitation simulations for hydrological impact assessment studies, we develop a methodology for modelling and simulating high-dimensional spatial precipitation extremes, focusing on both their marginal distributions and tail dependence structures. Tail dependence is a crucial property for assessing the consequences of an extreme precipitation event, yet most stochastic weather generators do not attempt to capture this property. We model extreme precipitation using a latent Gaussian version of the spatial conditional extremes model. This requires data with Laplace marginal distributions, but precipitation distributions contain point masses at zero that complicate necessary standardisation procedures. We therefore employ two separate models, one for describing extremes of nonzero precipitation and one for describing the probability of precipitation occurrence. Extreme precipitation is simulated by combining simulations from the two models. Nonzero precipitation marginals are modelled using latent Gaussian models with gamma and generalised Pareto likelihoods, and four different precipitation occurrence models are investigated. Fast inference is achieved using integrated nested Laplace approximations (INLA). We model and simulate spatial precipitation extremes in Central Norway, using high-density radar data. Inference on a 6000-dimensional data set is achieved within hours, and the simulations capture the main trends of the observed precipitation well."}, "https://arxiv.org/abs/2308.01054": {"title": "Simulation-based inference using surjective sequential neural likelihood estimation", "link": "https://arxiv.org/abs/2308.01054", "description": "arXiv:2308.01054v2 Announce Type: replace-cross \nAbstract: We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model."}, "https://arxiv.org/abs/2402.15585": {"title": "Inference for Regression with Variables Generated from Unstructured Data", "link": "https://arxiv.org/abs/2402.15585", "description": "arXiv:2402.15585v1 Announce Type: new \nAbstract: The leading strategy for analyzing unstructured data uses two steps. First, latent variables of economic interest are estimated with an upstream information retrieval model. Second, the estimates are treated as \"data\" in a downstream econometric model. We establish theoretical arguments for why this two-step strategy leads to biased inference in empirically plausible settings. More constructively, we propose a one-step strategy for valid inference that uses the upstream and downstream models jointly. The one-step strategy (i) substantially reduces bias in simulations; (ii) has quantitatively important effects in a leading application using CEO time-use data; and (iii) can be readily adapted by applied researchers."}, "https://arxiv.org/abs/2402.15600": {"title": "A Graph-based Approach to Estimating the Number of Clusters", "link": "https://arxiv.org/abs/2402.15600", "description": "arXiv:2402.15600v1 Announce Type: new \nAbstract: We consider the problem of estimating the number of clusters ($k$) in a dataset. We propose a non-parametric approach to the problem that is based on maximizing a statistic constructed from similarity graphs. This graph-based statistic is a robust summary measure of the similarity information among observations and is applicable even if the number of dimensions or number of clusters is possibly large. The approach is straightforward to implement, computationally fast, and can be paired with any kind of clustering technique. Asymptotic theory is developed to establish the selection consistency of the proposed approach. Simulation studies demonstrate that the graph-based statistic outperforms existing methods for estimating $k$. We illustrate its utility on a high-dimensional image dataset and RNA-seq dataset."}, "https://arxiv.org/abs/2402.15705": {"title": "A Variational Approach for Modeling High-dimensional Spatial Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2402.15705", "description": "arXiv:2402.15705v1 Announce Type: new \nAbstract: Gaussian and discrete non-Gaussian spatial datasets are prevalent across many fields such as public health, ecology, geosciences, and social sciences. Bayesian spatial generalized linear mixed models (SGLMMs) are a flexible class of models designed for these data, but SGLMMs do not scale well, even to moderately large datasets. State-of-the-art scalable SGLMMs (i.e., basis representations or sparse covariance/precision matrices) require posterior sampling via Markov chain Monte Carlo (MCMC), which can be prohibitive for large datasets. While variational Bayes (VB) have been extended to SGLMMs, their focus has primarily been on smaller spatial datasets. In this study, we propose two computationally efficient VB approaches for modeling moderate-sized and massive (millions of locations) Gaussian and discrete non-Gaussian spatial data. Our scalable VB method embeds semi-parametric approximations for the latent spatial random processes and parallel computing offered by modern high-performance computing systems. Our approaches deliver nearly identical inferential and predictive performance compared to 'gold standard' methods but achieve computational speedups of up to 1000x. We demonstrate our approaches through a comparative numerical study as well as applications to two real-world datasets. Our proposed VB methodology enables practitioners to model millions of non-Gaussian spatial observations using a standard laptop within a short timeframe."}, "https://arxiv.org/abs/2402.15772": {"title": "Mean-preserving rounding integer-valued ARMA models", "link": "https://arxiv.org/abs/2402.15772", "description": "arXiv:2402.15772v1 Announce Type: new \nAbstract: In the past four decades, research on count time series has made significant progress, but research on $\\mathbb{Z}$-valued time series is relatively rare. Existing $\\mathbb{Z}$-valued models are mainly of autoregressive structure, where the use of the rounding operator is very natural. Because of the discontinuity of the rounding operator, the formulation of the corresponding model identifiability conditions and the computation of parameter estimators need special attention. It is also difficult to derive closed-form formulae for crucial stochastic properties. We rediscover a stochastic rounding operator, referred to as mean-preserving rounding, which overcomes the above drawbacks. Then, a novel class of $\\mathbb{Z}$-valued ARMA models based on the new operator is proposed, and the existence of stationary solutions of the models is established. Stochastic properties including closed-form formulae for (conditional) moments, autocorrelation function, and conditional distributions are obtained. The advantages of our novel model class compared to existing ones are demonstrated. In particular, our model construction avoids identifiability issues such that maximum likelihood estimation is possible. A simulation study is provided, and the appealing performance of the new models is shown by several real-world data sets."}, "https://arxiv.org/abs/2402.16053": {"title": "Reducing multivariate independence testing to two bivariate means comparisons", "link": "https://arxiv.org/abs/2402.16053", "description": "arXiv:2402.16053v1 Announce Type: new \nAbstract: Testing for independence between two random vectors is a fundamental problem in statistics. It is observed from empirical studies that many existing omnibus consistent tests may not work well for some strongly nonmonotonic and nonlinear relationships. To explore the reasons behind this issue, we novelly transform the multivariate independence testing problem equivalently into checking the equality of two bivariate means. An important observation we made is that the power loss is mainly due to cancellation of positive and negative terms in dependence metrics, making them very close to zero. Motivated by this observation, we propose a class of consistent metrics with a positive integer $\\gamma$ that exactly characterize independence. Theoretically, we show that the metrics with even and infinity $\\gamma$ can effectively avoid the cancellation, and have high powers under the alternatives that two mean differences offset each other. Since we target at a wide range of dependence scenarios in practice, we further suggest to combine the p-values of test statistics with different $\\gamma$'s through the Fisher's method. We illustrate the advantages of our proposed tests through extensive numerical studies."}, "https://arxiv.org/abs/2402.16322": {"title": "Estimating Stochastic Block Models in the Presence of Covariates", "link": "https://arxiv.org/abs/2402.16322", "description": "arXiv:2402.16322v1 Announce Type: new \nAbstract: In the standard stochastic block model for networks, the probability of a connection between two nodes, often referred to as the edge probability, depends on the unobserved communities each of these nodes belongs to. We consider a flexible framework in which each edge probability, together with the probability of community assignment, are also impacted by observed covariates. We propose a computationally tractable two-step procedure to estimate the conditional edge probabilities as well as the community assignment probabilities. The first step relies on a spectral clustering algorithm applied to a localized adjacency matrix of the network. In the second step, k-nearest neighbor regression estimates are computed on the extracted communities. We study the statistical properties of these estimators by providing non-asymptotic bounds."}, "https://arxiv.org/abs/2402.16362": {"title": "Estimation of complex carryover effects in crossover designs with repeated measures", "link": "https://arxiv.org/abs/2402.16362", "description": "arXiv:2402.16362v1 Announce Type: new \nAbstract: It has been argued that the models used to analyze data from crossover designs are not appropriate when simple carryover effects are assumed. In this paper, the estimability conditions of the carryover effects are found, and a theoretical result that supports them, additionally, two simulation examples are developed in a non-linear dose-response for a repeated measures crossover trial in two designs: the traditional AB/BA design and a Williams square. Both show that a semiparametric model can detect complex carryover effects and that this estimation improves the precision of treatment effect estimators. We concluded that when there are at least five replicates in each observation period per individual, semiparametric statistical models provide a good estimator of the treatment effect and reduce bias with respect to models that assume either, the absence of carryover or simplex carryover effects. In addition, an application of the methodology is shown and the richness in analysis that is gained by being able to estimate complex carryover effects is evident."}, "https://arxiv.org/abs/2402.16520": {"title": "Sequential design for surrogate modeling in Bayesian inverse problems", "link": "https://arxiv.org/abs/2402.16520", "description": "arXiv:2402.16520v1 Announce Type: new \nAbstract: Sequential design is a highly active field of research in active learning which provides a general framework for the design of computer experiments to make the most of a low computational budget. It has been widely used to generate efficient surrogate models able to replace complex computer codes, most notably for uncertainty quantification, Bayesian optimization, reliability analysis or model calibration tasks. In this work, a sequential design strategy is developed for Bayesian inverse problems, in which a Gaussian process surrogate model serves as an emulator for a costly computer code. The proposed strategy is based on a goal-oriented I-optimal criterion adapted to the Stepwise Uncertainty Reduction (SUR) paradigm. In SUR strategies, a new design point is chosen by minimizing the expectation of an uncertainty metric with respect to the yet unknown new data point. These methods have attracted increasing interest as they provide an accessible framework for the sequential design of experiments while including almost-sure convergence for the most-widely used metrics. In this paper, a weighted integrated mean square prediction error is introduced and serves as a metric of uncertainty for the newly proposed IP-SUR (Inverse Problem Stepwise Uncertainty Reduction) sequential design strategy derived from SUR methods. This strategy is shown to be tractable for both scalar and multi-output Gaussian process surrogate models with continuous sample paths, and comes with theoretical guarantee for the almost-sure convergence of the metric of uncertainty. The premises of this work are highlighted on various test cases in which the newly derived strategy is compared to other naive and sequential designs (D-optimal designs, Bayes risk minimization)."}, "https://arxiv.org/abs/2402.16580": {"title": "Information-Enriched Selection of Stationary and Non-Stationary Autoregressions using the Adaptive Lasso", "link": "https://arxiv.org/abs/2402.16580", "description": "arXiv:2402.16580v1 Announce Type: new \nAbstract: We propose a novel approach to elicit the weight of a potentially non-stationary regressor in the consistent and oracle-efficient estimation of autoregressive models using the adaptive Lasso. The enhanced weight builds on a statistic that exploits distinct orders in probability of the OLS estimator in time series regressions when the degree of integration differs. We provide theoretical results on the benefit of our approach for detecting stationarity when a tuning criterion selects the $\\ell_1$ penalty parameter. Monte Carlo evidence shows that our proposal is superior to using OLS-based weights, as suggested by Kock [Econom. Theory, 32, 2016, 243-259]. We apply the modified estimator to model selection for German inflation rates after the introduction of the Euro. The results indicate that energy commodity price inflation and headline inflation are best described by stationary autoregressions."}, "https://arxiv.org/abs/2402.16693": {"title": "Fast Algorithms for Quantile Regression with Selection", "link": "https://arxiv.org/abs/2402.16693", "description": "arXiv:2402.16693v1 Announce Type: new \nAbstract: This paper addresses computational challenges in estimating Quantile Regression with Selection (QRS). The estimation of the parameters that model self-selection requires the estimation of the entire quantile process several times. Moreover, closed-form expressions of the asymptotic variance are too cumbersome, making the bootstrap more convenient to perform inference. Taking advantage of recent advancements in the estimation of quantile regression, along with some specific characteristics of the QRS estimation problem, I propose streamlined algorithms for the QRS estimator. These algorithms significantly reduce computation time through preprocessing techniques and quantile grid reduction for the estimation of the copula and slope parameters. I show the optimization enhancements with some simulations. Lastly, I show how preprocessing methods can improve the precision of the estimates without sacrificing computational efficiency. Hence, they constitute a practical solutions for estimators with non-differentiable and non-convex criterion functions such as those based on copulas."}, "https://arxiv.org/abs/2402.16725": {"title": "Inference on the proportion of variance explained in principal component analysis", "link": "https://arxiv.org/abs/2402.16725", "description": "arXiv:2402.16725v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored.\n In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset."}, "https://arxiv.org/abs/2402.15619": {"title": "Towards Improved Uncertainty Quantification of Stochastic Epidemic Models Using Sequential Monte Carlo", "link": "https://arxiv.org/abs/2402.15619", "description": "arXiv:2402.15619v1 Announce Type: cross \nAbstract: Sequential Monte Carlo (SMC) algorithms represent a suite of robust computational methodologies utilized for state estimation and parameter inference within dynamical systems, particularly in real-time or online environments where data arrives sequentially over time. In this research endeavor, we propose an integrated framework that combines a stochastic epidemic simulator with a sequential importance sampling (SIS) scheme to dynamically infer model parameters, which evolve due to social as well as biological processes throughout the progression of an epidemic outbreak and are also influenced by evolving data measurement bias. Through iterative updates of a set of weighted simulated trajectories based on observed data, this framework enables the estimation of posterior distributions for these parameters, thereby capturing their temporal variability and associated uncertainties. Through simulation studies, we showcase the efficacy of SMC in accurately tracking the evolving dynamics of epidemics while appropriately accounting for uncertainties. Moreover, we delve into practical considerations and challenges inherent in implementing SMC for parameter estimation within dynamic epidemiological settings, areas where the substantial computational capabilities of high-performance computing resources can be usefully brought to bear."}, "https://arxiv.org/abs/2402.16131": {"title": "A VAE-based Framework for Learning Multi-Level Neural Granger-Causal Connectivity", "link": "https://arxiv.org/abs/2402.16131", "description": "arXiv:2402.16131v1 Announce Type: cross \nAbstract: Granger causality has been widely used in various application domains to capture lead-lag relationships amongst the components of complex dynamical systems, and the focus in extant literature has been on a single dynamical system. In certain applications in macroeconomics and neuroscience, one has access to data from a collection of related such systems, wherein the modeling task of interest is to extract the shared common structure that is embedded across them, as well as to identify the idiosyncrasies within individual ones. This paper introduces a Variational Autoencoder (VAE) based framework that jointly learns Granger-causal relationships amongst components in a collection of related-yet-heterogeneous dynamical systems, and handles the aforementioned task in a principled way. The performance of the proposed framework is evaluated on several synthetic data settings and benchmarked against existing approaches designed for individual system learning. The method is further illustrated on a real dataset involving time series data from a neurophysiological experiment and produces interpretable results."}, "https://arxiv.org/abs/2402.16225": {"title": "Discrete Fourier Transform Approximations Based on the Cooley-Tukey Radix-2 Algorithm", "link": "https://arxiv.org/abs/2402.16225", "description": "arXiv:2402.16225v1 Announce Type: cross \nAbstract: This report elaborates on approximations for the discrete Fourier transform by means of replacing the exact Cooley-Tukey algorithm twiddle-factors by low-complexity integers, such as $0, \\pm \\frac{1}{2}, \\pm 1$."}, "https://arxiv.org/abs/2402.16375": {"title": "Valuing insurance against small probability risks: A meta-analysis", "link": "https://arxiv.org/abs/2402.16375", "description": "arXiv:2402.16375v1 Announce Type: cross \nAbstract: The demand for voluntary insurance against low-probability, high-impact risks is lower than expected. To assess the magnitude of the demand, we conduct a meta-analysis of contingent valuation studies using a dataset of experimentally elicited and survey-based estimates. We find that the average stated willingness to pay (WTP) for insurance is 87% of expected losses. We perform a meta-regression analysis to examine the heterogeneity in aggregate WTP across these studies. The meta-regression reveals that information about loss probability and probability levels positively influence relative willingness to pay, whereas respondents' average income and age have a negative effect. Moreover, we identify cultural sub-factors, such as power distance and uncertainty avoidance, that provided additional explanations for differences in WTP across international samples. Methodological factors related to the sampling and data collection process significantly influence the stated WTP. Our results, robust to model specification and publication bias, are relevant to current debates on stated preferences for low-probability risks management."}, "https://arxiv.org/abs/2402.16661": {"title": "Penalized Generative Variable Selection", "link": "https://arxiv.org/abs/2402.16661", "description": "arXiv:2402.16661v1 Announce Type: cross \nAbstract: Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using the conditional Wasserstein Generative Adversarial networks. Group Lasso penalization is applied for variable selection, which may improve model estimation/prediction, interpretability, stability, etc. Significantly advancing from the existing literature, the analysis of censored survival data is also considered. We establish the convergence rate for variable selection while considering the approximation error, and obtain a more efficient distribution estimation. Simulations and the analysis of real experimental data demonstrate satisfactory practical utility of the proposed analysis."}, "https://arxiv.org/abs/2106.10770": {"title": "A Neural Frequency-Severity Model and Its Application to Insurance Claims", "link": "https://arxiv.org/abs/2106.10770", "description": "arXiv:2106.10770v2 Announce Type: replace \nAbstract: This paper proposes a flexible and analytically tractable class of frequency and severity models for predicting insurance claims. The proposed model is able to capture nonlinear relationships in explanatory variables by characterizing the logarithmic mean functions of frequency and severity distributions as neural networks. Moreover, a potential dependence between the claim frequency and severity can be incorporated. In particular, the paper provides analytic formulas for mean and variance of the total claim cost, making our model ideal for many applications such as pricing insurance contracts and the pure premium. A simulation study demonstrates that our method successfully recovers nonlinear features of explanatory variables as well as the dependency between frequency and severity. Then, this paper uses a French auto insurance claim dataset to illustrate that the proposed model is superior to the existing methods in fitting and predicting the claim frequency, severity, and the total claim loss. Numerical results indicate that the proposed model helps in maintaining the competitiveness of an insurer by accurately predicting insurance claims and avoiding adverse selection."}, "https://arxiv.org/abs/2107.06238": {"title": "GENIUS-MAWII: For Robust Mendelian Randomization with Many Weak Invalid Instruments", "link": "https://arxiv.org/abs/2107.06238", "description": "arXiv:2107.06238v3 Announce Type: replace \nAbstract: Mendelian randomization (MR) has become a popular approach to study causal effects by using genetic variants as instrumental variables. We propose a new MR method, GENIUS-MAWII, which simultaneously addresses the two salient phenomena that adversely affect MR analyses: many weak instruments and widespread horizontal pleiotropy. Similar to MR GENIUS (Tchetgen Tchetgen et al., 2021), we achieve identification of the treatment effect by leveraging heteroscedasticity of the exposure. We then derive the class of influence functions of the treatment effect, based on which, we construct a continuous updating estimator and establish its consistency and asymptotic normality under a many weak invalid instruments asymptotic regime by developing novel semiparametric theory. We also provide a measure of weak identification, an overidentification test, and a graphical diagnostic tool. We demonstrate in simulations that GENIUS-MAWII has clear advantages in the presence of directional or correlated horizontal pleiotropy compared to other methods. We apply our method to study the effect of body mass index on systolic blood pressure using UK Biobank."}, "https://arxiv.org/abs/2205.06989": {"title": "lsirm12pl: An R package for latent space item response modeling", "link": "https://arxiv.org/abs/2205.06989", "description": "arXiv:2205.06989v2 Announce Type: replace \nAbstract: The latent space item response model (LSIRM; Jeon et al., 2021) allows us to show interactions between respondents and items in item response data by embedding both items and respondents in a shared and unobserved metric space. The R package lsirm12pl implements Bayesian estimation of the LSIRM and its extensions for different response types, base model specifications, and missing data. Further, the lsirm12pl offers methods to improve model utilization and interpretation, such as clustering of item positions in an estimated interaction map. lsirm12pl also provides convenient summary and plotting options to assess and process estimated results. In this paper, we give an overview of the methodological basis of LSIRM and describe the LSIRM extensions considered in the package. We then present the utilization of the package lsirm12pl with real data examples that are contained in the package."}, "https://arxiv.org/abs/2208.02925": {"title": "Factor Network Autoregressions", "link": "https://arxiv.org/abs/2208.02925", "description": "arXiv:2208.02925v4 Announce Type: replace \nAbstract: We propose a factor network autoregressive (FNAR) model for time series with complex network structures. The coefficients of the model reflect many different types of connections between economic agents (``multilayer network\"), which are summarized into a smaller number of network matrices (``network factors\") through a novel tensor-based principal component approach. We provide consistency and asymptotic normality results for the estimation of the factors and the coefficients of the FNAR. Our approach combines two different dimension-reduction techniques and can be applied to ultra-high-dimensional datasets. Simulation results show the goodness of our approach in finite samples. In an empirical application, we use the FNAR to investigate the cross-country interdependence of GDP growth rates based on a variety of international trade and financial linkages. The model provides a rich characterization of macroeconomic network effects."}, "https://arxiv.org/abs/2209.01172": {"title": "An Interpretable and Efficient Infinite-Order Vector Autoregressive Model for High-Dimensional Time Series", "link": "https://arxiv.org/abs/2209.01172", "description": "arXiv:2209.01172v4 Announce Type: replace \nAbstract: As a special infinite-order vector autoregressive (VAR) model, the vector autoregressive moving average (VARMA) model can capture much richer temporal patterns than the widely used finite-order VAR model. However, its practicality has long been hindered by its non-identifiability, computational intractability, and difficulty of interpretation, especially for high-dimensional time series. This paper proposes a novel sparse infinite-order VAR model for high-dimensional time series, which avoids all above drawbacks while inheriting essential temporal patterns of the VARMA model. As another attractive feature, the temporal and cross-sectional structures of the VARMA-type dynamics captured by this model can be interpreted separately, since they are characterized by different sets of parameters. This separation naturally motivates the sparsity assumption on the parameters determining the cross-sectional dependence. As a result, greater statistical efficiency and interpretability can be achieved with little loss of temporal information. We introduce two $\\ell_1$-regularized estimation methods for the proposed model, which can be efficiently implemented via block coordinate descent algorithms, and derive the corresponding nonasymptotic error bounds. A consistent model order selection method based on the Bayesian information criteria is also developed. The merit of the proposed approach is supported by simulation studies and a real-world macroeconomic data analysis."}, "https://arxiv.org/abs/2211.04697": {"title": "$L^{\\infty}$- and $L^2$-sensitivity analysis for causal inference with unmeasured confounding", "link": "https://arxiv.org/abs/2211.04697", "description": "arXiv:2211.04697v4 Announce Type: replace \nAbstract: Sensitivity analysis for the unconfoundedness assumption is crucial in observational studies. For this purpose, the marginal sensitivity model (MSM) gained popularity recently due to its good interpretability and mathematical properties. However, as a quantification of confounding strength, the $L^{\\infty}$-bound it puts on the logit difference between the observed and full data propensity scores may render the analysis conservative. In this article, we propose a new sensitivity model that restricts the $L^2$-norm of the propensity score ratio, requiring only the average strength of unmeasured confounding to be bounded. By characterizing sensitivity analysis as an optimization problem, we derive closed-form sharp bounds of the average potential outcomes under our model. We propose efficient one-step estimators for these bounds based on the corresponding efficient influence functions. Additionally, we apply multiplier bootstrap to construct simultaneous confidence bands to cover the sensitivity curve that consists of bounds at different sensitivity parameters. Through a real-data study, we illustrate how the new $L^2$-sensitivity analysis can improve calibration using observed confounders and provide tighter bounds when the unmeasured confounder is additionally assumed to be independent of the measured confounders and only have an additive effect on the potential outcomes."}, "https://arxiv.org/abs/2306.16591": {"title": "Nonparametric Causal Decomposition of Group Disparities", "link": "https://arxiv.org/abs/2306.16591", "description": "arXiv:2306.16591v2 Announce Type: replace \nAbstract: We propose a nonparametric framework that decomposes the causal contributions of a treatment variable to an outcome disparity between two groups. We decompose the causal contributions of treatment into group differences in 1) treatment prevalence, 2) average treatment effects, and 3) selection into treatment based on individual-level treatment effects. Our framework reformulates the classic Kitagawa-Blinder-Oaxaca decomposition nonparametrically in causal terms, complements causal mediation analysis by explaining group disparities instead of group effects, and distinguishes more mechanisms than recent random equalization decomposition. In contrast to all prior approaches, our framework isolates the causal contribution of differential selection into treatment as a novel mechanism for explaining and ameliorating group disparities. We develop nonparametric estimators based on efficient influence functions that are $\\sqrt{n}$-consistent, asymptotically normal, semiparametrically efficient, and multiply robust to misspecification. We apply our framework to decompose the causal contributions of education to the disparity in adult income between parental income groups (intergenerational income persistence). We find that both differential prevalence of, and differential selection into, college graduation significantly contribute to intergenerational income persistence."}, "https://arxiv.org/abs/2308.05484": {"title": "Filtering Dynamical Systems Using Observations of Statistics", "link": "https://arxiv.org/abs/2308.05484", "description": "arXiv:2308.05484v3 Announce Type: replace \nAbstract: We consider the problem of filtering dynamical systems, possibly stochastic, using observations of statistics. Thus, the computational task is to estimate a time-evolving density $\\rho(v, t)$ given noisy observations of the true density $\\rho^\\dagger$; this contrasts with the standard filtering problem based on observations of the state $v$. The task is naturally formulated as an infinite-dimensional filtering problem in the space of densities $\\rho$. However, for the purposes of tractability, we seek algorithms in state space; specifically, we introduce a mean-field state-space model, and using interacting particle system approximations to this model, we propose an ensemble method. We refer to the resulting methodology as the ensemble Fokker-Planck filter (EnFPF).\n Under certain restrictive assumptions, we show that the EnFPF approximates the Kalman-Bucy filter for the Fokker-Planck equation, which is the exact solution to the infinite-dimensional filtering problem. Furthermore, our numerical experiments show that the methodology is useful beyond this restrictive setting. Specifically, the experiments show that the EnFPF is able to correct ensemble statistics, to accelerate convergence to the invariant density for autonomous systems, and to accelerate convergence to time-dependent invariant densities for non-autonomous systems. We discuss possible applications of the EnFPF to climate ensembles and to turbulence modeling."}, "https://arxiv.org/abs/2309.09115": {"title": "Fully Synthetic Data for Complex Surveys", "link": "https://arxiv.org/abs/2309.09115", "description": "arXiv:2309.09115v3 Announce Type: replace \nAbstract: When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey."}, "https://arxiv.org/abs/2310.11724": {"title": "Simultaneous Nonparametric Inference of M-regression under Complex Temporal Dynamics", "link": "https://arxiv.org/abs/2310.11724", "description": "arXiv:2310.11724v2 Announce Type: replace \nAbstract: The paper considers simultaneous nonparametric inference for a wide class of M-regression models with time-varying coefficients. The covariates and errors of the regression model are tackled as a general class of nonstationary time series and are allowed to be cross-dependent. We construct $\\sqrt{n}$-consistent inference for the cumulative regression function, whose limiting properties are disclosed using Bahadur representation and Gaussian approximation theory. A simple and unified self-convolved bootstrap procedure is proposed. With only one tuning parameter, the bootstrap consistently simulates the desired limiting behavior of the M-estimators under complex temporal dynamics, even under the possible presence of breakpoints in time series. Our methodology leads to a unified framework to conduct general classes of Exact Function Tests, Lack-of-fit Tests, and Qualitative Tests for the time-varying coefficients under complex temporal dynamics. These tests enable one to, among many others, conduct variable selection procedures, check for constancy and linearity, as well as verify shape assumptions, including monotonicity and convexity. As applications, our method is utilized to study the time-varying properties of global climate data and Microsoft stock return, respectively."}, "https://arxiv.org/abs/2310.14399": {"title": "The role of randomization inference in unraveling individual treatment effects in early phase vaccine trials", "link": "https://arxiv.org/abs/2310.14399", "description": "arXiv:2310.14399v2 Announce Type: replace \nAbstract: Randomization inference is a powerful tool in early phase vaccine trials when estimating the causal effect of a regimen against a placebo or another regimen. Randomization-based inference often focuses on testing either Fisher's sharp null hypothesis of no treatment effect for any participant or Neyman's weak null hypothesis of no sample average treatment effect. Many recent efforts have explored conducting exact randomization-based inference for other summaries of the treatment effect profile, for instance, quantiles of the treatment effect distribution function. In this article, we systematically review methods that conduct exact, randomization-based inference for quantiles of individual treatment effects (ITEs) and extend some results to a special case where na\\\"ive participants are expected not to exhibit responses to highly specific endpoints. These methods are suitable for completely randomized trials, stratified completely randomized trials, and a matched study comparing two non-randomized arms from possibly different trials. We evaluate the usefulness of these methods using synthetic data in simulation studies. Finally, we apply these methods to HIV Vaccine Trials Network Study 086 (HVTN 086) and HVTN 205 and showcase a wide range of application scenarios of the methods. R code that replicates all analyses in this article can be found in first author's GitHub page at https://github.com/Zhe-Chen-1999/ITE-Inference."}, "https://arxiv.org/abs/2401.02529": {"title": "Simulation-based transition density approximation for the inference of SDE models", "link": "https://arxiv.org/abs/2401.02529", "description": "arXiv:2401.02529v2 Announce Type: replace \nAbstract: Stochastic Differential Equations (SDEs) serve as a powerful modeling tool in various scientific domains, including systems science, engineering, and ecological science. While the specific form of SDEs is typically known for a given problem, certain model parameters remain unknown. Efficiently inferring these unknown parameters based on observations of the state in discrete time series represents a vital practical subject. The challenge arises in nonlinear SDEs, where maximum likelihood estimation of parameters is generally unfeasible due to the absence of closed-form expressions for transition and stationary probability density functions of the states. In response to this limitation, we propose a novel two-step parameter inference mechanism. This approach involves a global-search phase followed by a local-refining procedure. The global-search phase is dedicated to identifying the domain of high-value likelihood functions, while the local-refining procedure is specifically designed to enhance the surrogate likelihood within this localized domain. Additionally, we present two simulation-based approximations for the transition density, aiming to efficiently or accurately approximate the likelihood function. Numerical examples illustrate the efficacy of our proposed methodology in achieving posterior parameter estimation."}, "https://arxiv.org/abs/2401.11229": {"title": "Estimation with Pairwise Observations", "link": "https://arxiv.org/abs/2401.11229", "description": "arXiv:2401.11229v2 Announce Type: replace \nAbstract: The paper introduces a new estimation method for the standard linear regression model. The procedure is not driven by the optimisation of any objective function rather, it is a simple weighted average of slopes from observation pairs. The paper shows that such estimator is consistent for carefully selected weights. Other properties, such as asymptotic distributions, have also been derived to facilitate valid statistical inference. Unlike traditional methods, such as Least Squares and Maximum Likelihood, among others, the estimated residual of this estimator is not by construction orthogonal to the explanatory variables of the model. This property allows a wide range of practical applications, such as the testing of endogeneity, i.e., the correlation between the explanatory variables and the disturbance terms."}, "https://arxiv.org/abs/2401.15076": {"title": "Comparative Analysis of Practical Identifiability Methods for an SEIR Model", "link": "https://arxiv.org/abs/2401.15076", "description": "arXiv:2401.15076v2 Announce Type: replace \nAbstract: Identifiability of a mathematical model plays a crucial role in parameterization of the model. In this study, we establish the structural identifiability of a Susceptible-Exposed-Infected-Recovered (SEIR) model given different combinations of input data and investigate practical identifiability with respect to different observable data, data frequency, and noise distributions. The practical identifiability is explored by both Monte Carlo simulations and a Correlation Matrix approach. Our results show that practical identifiability benefits from higher data frequency and data from the peak of an outbreak. The incidence data gives the best practical identifiability results compared to prevalence and cumulative data. In addition, we compare and distinguish the practical identifiability by Monte Carlo simulations and a Correlation Matrix approach, providing insights for when to use which method for other applications."}, "https://arxiv.org/abs/1908.08600": {"title": "Online Causal Inference for Advertising in Real-Time Bidding Auctions", "link": "https://arxiv.org/abs/1908.08600", "description": "arXiv:1908.08600v4 Announce Type: replace-cross \nAbstract: Real-time bidding (RTB) systems, which utilize auctions to allocate user impressions to competing advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of such advertising remains a challenge in research and practice. This paper proposes a new approach to perform causal inference on advertising bought through such mechanisms. Leveraging the economic structure of first- and second-price auctions, we first show that the effects of advertising are identified by the optimal bids. Hence, since these optimal bids are the only objects that need to be recovered, we introduce an adapted Thompson sampling (TS) algorithm to solve a multi-armed bandit problem that succeeds in recovering such bids and, consequently, the effects of advertising while minimizing the costs of experimentation. We derive a regret bound for our algorithm which is order optimal and use data from RTB auctions to show that it outperforms commonly used methods that estimate the effects of advertising."}, "https://arxiv.org/abs/2009.10303": {"title": "On the representation and learning of monotone triangular transport maps", "link": "https://arxiv.org/abs/2009.10303", "description": "arXiv:2009.10303v3 Announce Type: replace-cross \nAbstract: Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\\unicode{x2014}$approximations of the Knothe$\\unicode{x2013}$Rosenblatt (KR) rearrangement$\\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes."}, "https://arxiv.org/abs/2103.14726": {"title": "Random line graphs and edge-attributed network inference", "link": "https://arxiv.org/abs/2103.14726", "description": "arXiv:2103.14726v2 Announce Type: replace-cross \nAbstract: We extend the latent position random graph model to the line graph of a random graph, which is formed by creating a vertex for each edge in the original random graph, and connecting each pair of edges incident to a common vertex in the original graph. We prove concentration inequalities for the spectrum of a line graph, as well as limiting distribution results for the largest eigenvalue and the empirical spectral distribution in certain settings. For the stochastic blockmodel, we establish that although naive spectral decompositions can fail to extract necessary signal for edge clustering, there exist signal-preserving singular subspaces of the line graph that can be recovered through a carefully-chosen projection. Moreover, we can consistently estimate edge latent positions in a random line graph, even though such graphs are of a random size, typically have high rank, and possess no spectral gap. Our results demonstrate that the line graph of a stochastic block model exhibits underlying block structure, and in simulations, we synthesize and test our methods against several commonly-used techniques, including tensor decompositions, for cluster recovery and edge covariate inference. By naturally incorporating information encoded in both vertices and edges, the random line graph improves network inference."}, "https://arxiv.org/abs/2106.14077": {"title": "The Role of Contextual Information in Best Arm Identification", "link": "https://arxiv.org/abs/2106.14077", "description": "arXiv:2106.14077v3 Announce Type: replace-cross \nAbstract: We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the \"Track-and-Stop\" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification."}, "https://arxiv.org/abs/2306.14142": {"title": "Estimating Policy Effects in a Social Network with Independent Set Sampling", "link": "https://arxiv.org/abs/2306.14142", "description": "arXiv:2306.14142v3 Announce Type: replace-cross \nAbstract: Evaluating the impact of policy interventions on respondents who are embedded in a social network is often challenging due to the presence of network interference within the treatment groups, as well as between treatment and non-treatment groups throughout the network. In this paper, we propose a modeling strategy that combines existing work on stochastic actor-oriented models (SAOM) with a novel network sampling method based on the identification of independent sets. By assigning respondents from an independent set to the treatment, we are able to block any spillover of the treatment and network influence, thereby allowing us to isolate the direct effect of the treatment from the indirect network-induced effects, in the immediate term. As a result, our method allows for the estimation of both the \\textit{direct} as well as the \\textit{net effect} of a chosen policy intervention, in the presence of network effects in the population. We perform a comparative simulation analysis to show that our proposed sampling technique leads to distinct direct and net effects of the policy, as well as significant network effects driven by policy-linked homophily. This study highlights the importance of network sampling techniques in improving policy evaluation studies and has the potential to help researchers and policymakers with better planning, designing, and anticipating policy responses in a networked society."}, "https://arxiv.org/abs/2306.16838": {"title": "Solving Kernel Ridge Regression with Gradient-Based Optimization Methods", "link": "https://arxiv.org/abs/2306.16838", "description": "arXiv:2306.16838v5 Announce Type: replace-cross \nAbstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\\ell_1$ and $\\ell_\\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\\ell_\\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\\ell_1$ and $\\ell_\\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively."}, "https://arxiv.org/abs/2310.02461": {"title": "Optimization-based frequentist confidence intervals for functionals in constrained inverse problems: Resolving the Burrus conjecture", "link": "https://arxiv.org/abs/2310.02461", "description": "arXiv:2310.02461v2 Announce Type: replace-cross \nAbstract: We present an optimization-based framework to construct confidence intervals for functionals in constrained inverse problems, ensuring valid one-at-a-time frequentist coverage guarantees. Our approach builds upon the now-called strict bounds intervals, originally pioneered by Burrus (1965) and Rust and Burrus (1972), which offer ways to directly incorporate any side information about the parameters during inference without introducing external biases. This family of methods allows for uncertainty quantification in ill-posed inverse problems without needing to select a regularizing prior. By tying optimization-based intervals to an inversion of a constrained likelihood ratio test, we translate interval coverage guarantees into type I error control and characterize the resulting interval via solutions to optimization problems. Along the way, we refute the Burrus conjecture, which posited that, for possibly rank-deficient linear Gaussian models with positivity constraints, a correction based on the quantile of the chi-squared distribution with one degree of freedom suffices to shorten intervals while maintaining frequentist coverage guarantees. Our framework provides a novel approach to analyzing the conjecture, and we construct a counterexample employing a stochastic dominance argument, which we also use to disprove a general form of the conjecture. We illustrate our framework with several numerical examples and provide directions for extensions beyond the Rust-Burrus method for nonlinear, non-Gaussian settings with general constraints."}, "https://arxiv.org/abs/2310.16281": {"title": "Improving Robust Decisions with Data", "link": "https://arxiv.org/abs/2310.16281", "description": "arXiv:2310.16281v3 Announce Type: replace-cross \nAbstract: A decision-maker faces uncertainty governed by a data-generating process (DGP), which is only known to belong to a set of sequences of independent but possibly non-identical distributions. A robust decision maximizes the expected payoff against the worst possible DGP in this set. This paper characterizes when and how such robust decisions can be improved with data, measured by the expected payoff under the true DGP, no matter which possible DGP is the truth. It further develops novel and simple inference methods to achieve it, as common methods (e.g., maximum likelihood or Bayesian) often fail to deliver such an improvement."}, "https://arxiv.org/abs/2312.02482": {"title": "Treatment heterogeneity with right-censored outcomes using grf", "link": "https://arxiv.org/abs/2312.02482", "description": "arXiv:2312.02482v3 Announce Type: replace-cross \nAbstract: This article walks through how to estimate conditional average treatment effects (CATEs) with right-censored time-to-event outcomes using the function causal_survival_forest (Cui et al., 2023) in the R package grf (Athey et al., 2019, Tibshirani et al., 2024) using data from the National Job Training Partnership Act."}, "https://arxiv.org/abs/2401.06091": {"title": "A Closer Look at AUROC and AUPRC under Class Imbalance", "link": "https://arxiv.org/abs/2401.06091", "description": "arXiv:2401.06091v2 Announce Type: replace-cross \nAbstract: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need."}, "https://arxiv.org/abs/2402.16917": {"title": "Bayesian Chain Ladder For Cumulative Run-Off Triangle Under Half-Normal Distribution Assumption", "link": "https://arxiv.org/abs/2402.16917", "description": "arXiv:2402.16917v1 Announce Type: new \nAbstract: An insurance company is required to prepare a certain amount of money, called reserve, as a mean to pay its policy holders claims in the future. There are several types of reserve, one of them is IBNR reserve, for which the payments are made not in the same calendar year as the events that trigger the claims. Wuthrich and Merz (2015) developed a Bayesian Chain Ladder method for gamma distribution that applies the Bayesian Theory into the usual Chain Ladder Method . In this article, we modify the previous Bayesian Chain Ladder for half-normal distribution, as such distribution is more well suited for lighter tail claims."}, "https://arxiv.org/abs/2402.16969": {"title": "Robust Evaluation of Longitudinal Surrogate Markers with Censored Data", "link": "https://arxiv.org/abs/2402.16969", "description": "arXiv:2402.16969v1 Announce Type: new \nAbstract: The development of statistical methods to evaluate surrogate markers is an active area of research. In many clinical settings, the surrogate marker is not simply a single measurement but is instead a longitudinal trajectory of measurements over time, e.g., fasting plasma glucose measured every 6 months for 3 years. In general, available methods developed for the single-surrogate setting cannot accommodate a longitudinal surrogate marker. Furthermore, many of the methods have not been developed for use with primary outcomes that are time-to-event outcomes and/or subject to censoring. In this paper, we propose robust methods to evaluate a longitudinal surrogate marker in a censored time-to-event outcome setting. Specifically, we propose a method to define and estimate the proportion of the treatment effect on a censored primary outcome that is explained by the treatment effect on a longitudinal surrogate marker measured up to time $t_0$. We accommodate both potential censoring of the primary outcome and of the surrogate marker. A simulation study demonstrates good finite-sample performance of our proposed methods. We illustrate our procedures by examining repeated measures of fasting plasma glucose, a surrogate marker for diabetes diagnosis, using data from the Diabetes Prevention Program (DPP)."}, "https://arxiv.org/abs/2402.17042": {"title": "Towards Generalizing Inferences from Trials to Target Populations", "link": "https://arxiv.org/abs/2402.17042", "description": "arXiv:2402.17042v1 Announce Type: new \nAbstract: Randomized Controlled Trials (RCTs) are pivotal in generating internally valid estimates with minimal assumptions, serving as a cornerstone for researchers dedicated to advancing causal inference methods. However, extending these findings beyond the experimental cohort to achieve externally valid estimates is crucial for broader scientific inquiry. This paper delves into the forefront of addressing these external validity challenges, encapsulating the essence of a multidisciplinary workshop held at the Institute for Computational and Experimental Research in Mathematics (ICERM), Brown University, in Fall 2023. The workshop congregated experts from diverse fields including social science, medicine, public health, statistics, computer science, and education, to tackle the unique obstacles each discipline faces in extrapolating experimental findings. Our study presents three key contributions: we integrate ongoing efforts, highlighting methodological synergies across fields; provide an exhaustive review of generalizability and transportability based on the workshop's discourse; and identify persistent hurdles while suggesting avenues for future research. By doing so, this paper aims to enhance the collective understanding of the generalizability and transportability of causal effects, fostering cross-disciplinary collaboration and offering valuable insights for researchers working on refining and applying causal inference methods."}, "https://arxiv.org/abs/2402.17070": {"title": "Dempster-Shafer P-values: Thoughts on an Alternative Approach for Multinomial Inference", "link": "https://arxiv.org/abs/2402.17070", "description": "arXiv:2402.17070v1 Announce Type: new \nAbstract: In this paper, we demonstrate that a new measure of evidence we developed called the Dempster-Shafer p-value which allow for insights and interpretations which retain most of the structure of the p-value while covering for some of the disadvantages that traditional p- values face. Moreover, we show through classical large-sample bounds and simulations that there exists a close connection between our form of DS hypothesis testing and the classical frequentist testing paradigm. We also demonstrate how our approach gives unique insights into the dimensionality of a hypothesis test, as well as models the effects of adversarial attacks on multinomial data. Finally, we demonstrate how these insights can be used to analyze text data for public health through an analysis of the Population Health Metrics Research Consortium dataset for verbal autopsies."}, "https://arxiv.org/abs/2402.17103": {"title": "Causal Orthogonalization: Multicollinearity, Economic Interpretability, and the Gram-Schmidt Process", "link": "https://arxiv.org/abs/2402.17103", "description": "arXiv:2402.17103v1 Announce Type: new \nAbstract: This paper considers the problem of interpreting orthogonalization model coefficients. We derive a causal economic interpretation of the Gram-Schmidt orthogonalization process and provide the conditions for its equivalence to total effects from a recursive Directed Acyclic Graph. We extend the Gram-Schmidt process to groups of simultaneous regressors common in economic data sets and derive its finite sample properties, finding its coefficients to be unbiased, stable, and more efficient than those from Ordinary Least Squares. Finally, we apply the estimator to childhood reading comprehension scores, controlling for such highly collinear characteristics as race, education, and income. The model expands Bohren et al.'s decomposition of systemic discrimination into channel-specific effects and improves its coefficient significance levels."}, "https://arxiv.org/abs/2402.17274": {"title": "Sequential Change-point Detection for Binomial Time Series with Exogenous Variables", "link": "https://arxiv.org/abs/2402.17274", "description": "arXiv:2402.17274v1 Announce Type: new \nAbstract: Sequential change-point detection for time series enables us to sequentially check the hypothesis that the model still holds as more and more data are observed. It's widely used in data monitoring in practice. Meanwhile, binomial time series, which depicts independent binary individual behaviors within a group when the individual behaviors are dependent on past observations of the whole group, is an important type of model in practice but hasn't been developed well. We first propose a Binomial AR($1$) model, and then consider a method for sequential change-point detection for the Binomial AR(1)."}, "https://arxiv.org/abs/2402.17297": {"title": "Combined Quantile Forecasting for High-Dimensional Non-Gaussian Data", "link": "https://arxiv.org/abs/2402.17297", "description": "arXiv:2402.17297v1 Announce Type: new \nAbstract: This study proposes a novel method for forecasting a scalar variable based on high-dimensional predictors that is applicable to various data distributions. In the literature, one of the popular approaches for forecasting with many predictors is to use factor models. However, these traditional methods are ineffective when the data exhibit non-Gaussian characteristics such as skewness or heavy tails. In this study, we newly utilize a quantile factor model to extract quantile factors that describe specific quantiles of the data beyond the mean factor. We then build a quantile-based forecast model using the estimated quantile factors at different quantile levels as predictors. Finally, the predicted values at the various quantile levels are combined into a single forecast as a weighted average with weights determined by a Markov chain based on past trends of the target variable. The main idea of the proposed method is to incorporate a quantile approach to a forecasting method to handle non-Gaussian characteristics effectively. The performance of the proposed method is evaluated through a simulation study and real data analysis of PM2.5 data in South Korea, where the proposed method outperforms other existing methods in most cases."}, "https://arxiv.org/abs/2402.17366": {"title": "Causal blind spots when using prediction models for treatment decisions", "link": "https://arxiv.org/abs/2402.17366", "description": "arXiv:2402.17366v1 Announce Type: new \nAbstract: Prediction models are increasingly proposed for guiding treatment decisions, but most fail to address the special role of treatments, leading to inappropriate use. This paper highlights the limitations of using standard prediction models for treatment decision support. We identify 'causal blind spots' in three common approaches to handling treatments in prediction modelling and illustrate potential harmful consequences in several medical applications. We advocate for an extension of guidelines for development, reporting, clinical evaluation and monitoring of prediction models to ensure that the intended use of the model is matched to an appropriate risk estimand. For decision support this requires a shift towards developing predictions under the specific treatment options under consideration ('predictions under interventions'). We argue that this will improve the efficacy of prediction models in guiding treatment decisions and prevent potential negative effects on patient outcomes."}, "https://arxiv.org/abs/2402.17374": {"title": "Quasi-Bayesian Estimation and Inference with Control Functions", "link": "https://arxiv.org/abs/2402.17374", "description": "arXiv:2402.17374v1 Announce Type: new \nAbstract: We consider a quasi-Bayesian method that combines a frequentist estimation in the first stage and a Bayesian estimation/inference approach in the second stage. The study is motivated by structural discrete choice models that use the control function methodology to correct for endogeneity bias. In this scenario, the first stage estimates the control function using some frequentist parametric or nonparametric approach. The structural equation in the second stage, associated with certain complicated likelihood functions, can be more conveniently dealt with using a Bayesian approach. This paper studies the asymptotic properties of the quasi-posterior distributions obtained from the second stage. We prove that the corresponding quasi-Bayesian credible set does not have the desired coverage in large samples. Nonetheless, the quasi-Bayesian point estimator remains consistent and is asymptotically equivalent to a frequentist two-stage estimator. We show that one can obtain valid inference by bootstrapping the quasi-posterior that takes into account the first-stage estimation uncertainty."}, "https://arxiv.org/abs/2402.17425": {"title": "Semi-parametric goodness-of-fit testing for INAR models", "link": "https://arxiv.org/abs/2402.17425", "description": "arXiv:2402.17425v1 Announce Type: new \nAbstract: Among the various models designed for dependent count data, integer-valued autoregressive (INAR) processes enjoy great popularity. Typically, statistical inference for INAR models uses asymptotic theory that relies on rather stringent (parametric) assumptions on the innovations such as Poisson or negative binomial distributions. In this paper, we present a novel semi-parametric goodness-of-fit test tailored for the INAR model class. Relying on the INAR-specific shape of the joint probability generating function, our approach allows for model validation of INAR models without specifying the (family of the) innovation distribution. We derive the limiting null distribution of our proposed test statistic, prove consistency under fixed alternatives and discuss its asymptotic behavior under local alternatives. By manifold Monte Carlo simulations, we illustrate the overall good performance of our testing procedure in terms of power and size properties. In particular, it turns out that the power can be considerably improved by using higher-order test statistics. We conclude the article with the application on three real-world economic data sets."}, "https://arxiv.org/abs/2402.17637": {"title": "Learning the Covariance of Treatment Effects Across Many Weak Experiments", "link": "https://arxiv.org/abs/2402.17637", "description": "arXiv:2402.17637v1 Announce Type: new \nAbstract: When primary objectives are insensitive or delayed, experimenters may instead focus on proxy metrics derived from secondary outcomes. For example, technology companies often infer long-term impacts of product interventions from their effects on weighted indices of short-term user engagement signals. We consider meta-analysis of many historical experiments to learn the covariance of treatment effects on different outcomes, which can support the construction of such proxies. Even when experiments are plentiful and large, if treatment effects are weak, the sample covariance of estimated treatment effects across experiments can be highly biased and remains inconsistent even as more experiments are considered. We overcome this by using techniques inspired by weak instrumental variable analysis, which we show can reliably estimate parameters of interest, even without a structural model. We show the Limited Information Maximum Likelihood (LIML) estimator learns a parameter that is equivalent to fitting total least squares to a transformation of the scatterplot of estimated treatment effects, and that Jackknife Instrumental Variables Estimation (JIVE) learns another parameter that can be computed from the average of Jackknifed covariance matrices across experiments. We also present a total-covariance-based estimator for the latter estimand under homoskedasticity, which we show is equivalent to a $k$-class estimator. We show how these parameters relate to causal quantities and can be used to construct unbiased proxy metrics under a structural model with both direct and indirect effects subject to the INstrument Strength Independent of Direct Effect (INSIDE) assumption of Mendelian randomization. Lastly, we discuss the application of our methods at Netflix."}, "https://arxiv.org/abs/2402.16909": {"title": "Impact of Physical Activity on Quality of Life During Pregnancy: A Causal ML Approach", "link": "https://arxiv.org/abs/2402.16909", "description": "arXiv:2402.16909v1 Announce Type: cross \nAbstract: The concept of Quality of Life (QoL) refers to a holistic measurement of an individual's well-being, incorporating psychological and social aspects. Pregnant women, especially those with obesity and stress, often experience lower QoL. Physical activity (PA) has shown the potential to enhance the QoL. However, pregnant women who are overweight and obese rarely meet the recommended level of PA. Studies have investigated the relationship between PA and QoL during pregnancy using correlation-based approaches. These methods aim to discover spurious correlations between variables rather than causal relationships. Besides, the existing methods mainly rely on physical activity parameters and neglect the use of different factors such as maternal (medical) history and context data, leading to biased estimates. Furthermore, the estimations lack an understanding of mediators and counterfactual scenarios that might affect them. In this paper, we investigate the causal relationship between being physically active (treatment variable) and the QoL (outcome) during pregnancy and postpartum. To estimate the causal effect, we develop a Causal Machine Learning method, integrating causal discovery and causal inference components. The data for our investigation is derived from a long-term wearable-based health monitoring study focusing on overweight and obese pregnant women. The machine learning (meta-learner) estimation technique is used to estimate the causal effect. Our result shows that performing adequate physical activity during pregnancy and postpartum improves the QoL by units of 7.3 and 3.4 on average in physical health and psychological domains, respectively. In the final step, four refutation analysis techniques are employed to validate our estimation."}, "https://arxiv.org/abs/2402.17233": {"title": "Hybrid Square Neural ODE Causal Modeling", "link": "https://arxiv.org/abs/2402.17233", "description": "arXiv:2402.17233v1 Announce Type: cross \nAbstract: Hybrid models combine mechanistic ODE-based dynamics with flexible and expressive neural network components. Such models have grown rapidly in popularity, especially in scientific domains where such ODE-based modeling offers important interpretability and validated causal grounding (e.g., for counterfactual reasoning). The incorporation of mechanistic models also provides inductive bias in standard blackbox modeling approaches, critical when learning from small datasets or partially observed, complex systems. Unfortunately, as hybrid models become more flexible, the causal grounding provided by the mechanistic model can quickly be lost. We address this problem by leveraging another common source of domain knowledge: ranking of treatment effects for a set of interventions, even if the precise treatment effect is unknown. We encode this information in a causal loss that we combine with the standard predictive loss to arrive at a hybrid loss that biases our learning towards causally valid hybrid models. We demonstrate our ability to achieve a win-win -- state-of-the-art predictive performance and causal validity -- in the challenging task of modeling glucose dynamics during exercise."}, "https://arxiv.org/abs/2402.17570": {"title": "Sparse Variational Contaminated Noise Gaussian Process Regression for Forecasting Geomagnetic Perturbations", "link": "https://arxiv.org/abs/2402.17570", "description": "arXiv:2402.17570v1 Announce Type: cross \nAbstract: Gaussian Processes (GP) have become popular machine learning methods for kernel based learning on datasets with complicated covariance structures. In this paper, we present a novel extension to the GP framework using a contaminated normal likelihood function to better account for heteroscedastic variance and outlier noise. We propose a scalable inference algorithm based on the Sparse Variational Gaussian Process (SVGP) method for fitting sparse Gaussian process regression models with contaminated normal noise on large datasets. We examine an application to geomagnetic ground perturbations, where the state-of-art prediction model is based on neural networks. We show that our approach yields shorter predictions intervals for similar coverage and accuracy when compared to an artificial dense neural network baseline."}, "https://arxiv.org/abs/1809.04347": {"title": "High-dimensional Bayesian Fourier Analysis For Detecting Circadian Gene Expressions", "link": "https://arxiv.org/abs/1809.04347", "description": "arXiv:1809.04347v2 Announce Type: replace \nAbstract: In genomic applications, there is often interest in identifying genes whose time-course expression trajectories exhibit periodic oscillations with a period of approximately 24 hours. Such genes are usually referred to as circadian, and their identification is a crucial step toward discovering physiological processes that are clock-controlled. It is natural to expect that the expression of gene i at time j might depend to some degree on the expression of the other genes measured at the same time. However, widely-used rhythmicity detection techniques do not accommodate for the potential dependence across genes. We develop a Bayesian approach for periodicity identification that explicitly takes into account the complex dependence structure across time-course trajectories in gene expressions. We employ a latent factor representation to accommodate dependence, while representing the true trajectories in the Fourier domain allows for inference on period, phase, and amplitude of the signal. Identification of circadian genes is allowed through a carefully chosen variable selection prior on the Fourier basis coefficients. The methodology is applied to a novel mouse liver circadian dataset. Although motivated by time-course gene expression array data, the proposed approach is applicable to the analysis of dependent functional data at broad."}, "https://arxiv.org/abs/2202.06117": {"title": "Metric Statistics: Exploration and Inference for Random Objects With Distance Profiles", "link": "https://arxiv.org/abs/2202.06117", "description": "arXiv:2202.06117v4 Announce Type: replace \nAbstract: This article provides an overview on the statistical modeling of complex data as increasingly encountered in modern data analysis. It is argued that such data can often be described as elements of a metric space that satisfies certain structural conditions and features a probability measure. We refer to the random elements of such spaces as random objects and to the emerging field that deals with their statistical analysis as metric statistics. Metric statistics provides methodology, theory and visualization tools for the statistical description, quantification of variation, centrality and quantiles, regression and inference for populations of random objects, inferring these quantities from available data and samples. In addition to a brief review of current concepts, we focus on distance profiles as a major tool for object data in conjunction with the pairwise Wasserstein transports of the underlying one-dimensional distance distributions. These pairwise transports lead to the definition of intuitive and interpretable notions of transport ranks and transport quantiles as well as two-sample inference. An associated profile metric complements the original metric of the object space and may reveal important features of the object data in data analysis. We demonstrate these tools for the analysis of complex data through various examples and visualizations."}, "https://arxiv.org/abs/2206.11117": {"title": "Causal inference in multi-cohort studies using the target trial framework to identify and minimize sources of bias", "link": "https://arxiv.org/abs/2206.11117", "description": "arXiv:2206.11117v4 Announce Type: replace \nAbstract: Longitudinal cohort studies, which follow a group of individuals over time, provide the opportunity to examine causal effects of complex exposures on long-term health outcomes. Utilizing data from multiple cohorts has the potential to add further benefit by improving precision of estimates through data pooling and by allowing examination of effect heterogeneity through replication of analyses across cohorts. However, the interpretation of findings can be complicated by biases that may be compounded when pooling data, or, contribute to discrepant findings when analyses are replicated. The 'target trial' is a powerful tool for guiding causal inference in single-cohort studies. Here we extend this conceptual framework to address the specific challenges that can arise in the multi-cohort setting. By representing a clear definition of the target estimand, the target trial provides a central point of reference against which biases arising in each cohort and from data pooling can be systematically assessed. Consequently, analyses can be designed to reduce these biases and the resulting findings appropriately interpreted in light of potential remaining biases. We use a case study to demonstrate the framework and its potential to strengthen causal inference in multi-cohort studies through improved analysis design and clarity in the interpretation of findings."}, "https://arxiv.org/abs/2301.03805": {"title": "Asymptotic Theory for Two-Way Clustering", "link": "https://arxiv.org/abs/2301.03805", "description": "arXiv:2301.03805v2 Announce Type: replace \nAbstract: This paper proves a new central limit theorem for a sample that exhibits two-way dependence and heterogeneity across clusters. Statistical inference for situations with both two-way dependence and cluster heterogeneity has thus far been an open issue. The existing theory for two-way clustering inference requires identical distributions across clusters (implied by the so-called separate exchangeability assumption). Yet no such homogeneity requirement is needed in the existing theory for one-way clustering. The new result therefore theoretically justifies the view that two-way clustering is a more robust version of one-way clustering, consistent with applied practice. The result is applied to linear regression, where it is shown that a standard plug-in variance estimator is valid for inference."}, "https://arxiv.org/abs/2304.00074": {"title": "Bayesian Clustering via Fusing of Localized Densities", "link": "https://arxiv.org/abs/2304.00074", "description": "arXiv:2304.00074v2 Announce Type: replace \nAbstract: Bayesian clustering typically relies on mixture models, with each component interpreted as a different cluster. After defining a prior for the component parameters and weights, Markov chain Monte Carlo (MCMC) algorithms are commonly used to produce samples from the posterior distribution of the component labels. The data are then clustered by minimizing the expectation of a clustering loss function that favours similarity to the component labels. Unfortunately, although these approaches are routinely implemented, clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components. To address this problem, we develop Fusing of Localized Densities (FOLD), a novel clustering method that melds components together using the posterior of the kernels. FOLD has a fully Bayesian decision theoretic justification, naturally leads to uncertainty quantification, can be easily implemented as an add-on to MCMC algorithms for mixtures, and favours a small number of distinct clusters. We provide theoretical support for FOLD including clustering optimality under kernel misspecification. In simulated experiments and real data, FOLD outperforms competitors by minimizing the number of clusters while inferring meaningful group structure."}, "https://arxiv.org/abs/2305.19675": {"title": "Testing Truncation Dependence: The Gumbel-Barnett Copula", "link": "https://arxiv.org/abs/2305.19675", "description": "arXiv:2305.19675v2 Announce Type: replace \nAbstract: In studies on lifetimes, occasionally, the population contains statistical units that are born before the data collection has started. Left-truncated are units that deceased before this start. For all other units, the age at the study start often is recorded and we aim at testing whether this second measurement is independent of the genuine measure of interest, the lifetime. Our basic model of dependence is the one-parameter Gumbel-Barnett copula. For simplicity, the marginal distribution of the lifetime is assumed to be Exponential and for the age-at-study-start, namely the distribution of birth dates, we assume a Uniform. Also for simplicity, and to fit our application, we assume that units that die later than our study period, are also truncated. As a result from point process theory, we can approximate the truncated sample by a Poisson process and thereby derive its likelihood. Identification, consistency and asymptotic distribution of the maximum-likelihood estimator are derived. Testing for positive truncation dependence must include the hypothetical independence which coincides with the boundary of the copula's parameter space. By non-standard theory, the maximum likelihood estimator of the exponential and the copula parameter is distributed as a mixture of a two- and a one-dimensional normal distribution. The application are 55 thousand double-truncated lifetimes of German businesses that closed down over the period 2014 to 2016. The likelihood has its maximum for the copula parameter at the parameter space boundary so that the $p$-value of test is $0.5$. The life expectancy does not increase relative to the year of foundation. Using a Farlie-Gumbel-Morgenstern copula, which models positive and negative dependence, finds that life expectancy of German enterprises even decreases significantly over time."}, "https://arxiv.org/abs/2310.18504": {"title": "Doubly Robust Identification of Causal Effects of a Continuous Treatment using Discrete Instruments", "link": "https://arxiv.org/abs/2310.18504", "description": "arXiv:2310.18504v3 Announce Type: replace \nAbstract: Many empirical applications estimate causal effects of a continuous endogenous variable (treatment) using a binary instrument. Estimation is typically done through linear 2SLS. This approach requires a mean treatment change and causal interpretation requires the LATE-type monotonicity in the first stage. An alternative approach is to explore distributional changes in the treatment, where the first-stage restriction is treatment rank similarity. We propose causal estimands that are doubly robust in that they are valid under either of these two restrictions. We apply the doubly robust estimation to estimate the impacts of sleep on well-being. Our new estimates corroborate the usual 2SLS estimates."}, "https://arxiv.org/abs/2311.00439": {"title": "Robustify and Tighten the Lee Bounds: A Sample Selection Model under Stochastic Monotonicity and Symmetry Assumptions", "link": "https://arxiv.org/abs/2311.00439", "description": "arXiv:2311.00439v2 Announce Type: replace \nAbstract: In the presence of sample selection, Lee's (2009) nonparametric bounds are a popular tool for estimating a treatment effect. However, the Lee bounds rely on the monotonicity assumption, whose empirical validity is sometimes unclear, and the bounds are often regarded to be too wide and less informative even under the monotonicity. To address these practical issues, this paper introduces a stochastic version of the monotonicity assumption alongside a nonparametric distributional shape constraint. The former enhances the robustness of the Lee bounds with respect to monotonicity, while the latter serves to tighten these bounds. The obtained bounds do not rely on the exclusion restriction and can be root-$n$ consistently estimable, making them practically viable. The potential usefulness of the proposed methods is illustrated by their application to experimental data from the after-school instruction program studied by Muralidharan, Singh, and Ganimian (2019)."}, "https://arxiv.org/abs/2312.07177": {"title": "Fast Meta-Analytic Approximations for Relational Event Models: Applications to Data Streams and Multilevel Data", "link": "https://arxiv.org/abs/2312.07177", "description": "arXiv:2312.07177v2 Announce Type: replace \nAbstract: Large relational-event history data stemming from large networks are becoming increasingly available due to recent technological developments (e.g. digital communication, online databases, etc). This opens many new doors to learning about complex interaction behavior between actors in temporal social networks. The relational event model has become the gold standard for relational event history analysis. Currently, however, the main bottleneck to fit relational events models is of computational nature in the form of memory storage limitations and computational complexity. Relational event models are therefore mainly used for relatively small data sets while larger, more interesting datasets, including multilevel data structures and relational event data streams, cannot be analyzed on standard desktop computers. This paper addresses this problem by developing approximation algorithms based on meta-analysis methods that can fit relational event models significantly faster while avoiding the computational issues. In particular, meta-analytic approximations are proposed for analyzing streams of relational event data and multilevel relational event data and potentially of combinations thereof. The accuracy and the statistical properties of the methods are assessed using numerical simulations. Furthermore, real-world data are used to illustrate the potential of the methodology to study social interaction behavior in an organizational network and interaction behavior among political actors. The algorithms are implemented in a publicly available R package 'remx'."}, "https://arxiv.org/abs/2103.09603": {"title": "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R", "link": "https://arxiv.org/abs/2103.09603", "description": "arXiv:2103.09603v5 Announce Type: replace-cross \nAbstract: The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods."}, "https://arxiv.org/abs/2111.10275": {"title": "Composite Goodness-of-fit Tests with Kernels", "link": "https://arxiv.org/abs/2111.10275", "description": "arXiv:2111.10275v4 Announce Type: replace-cross \nAbstract: Model misspecification can create significant challenges for the implementation of probabilistic models, and this has led to development of a range of robust methods which directly account for this issue. However, whether these more involved methods are required will depend on whether the model is really misspecified, and there is a lack of generally applicable methods to answer this question. In this paper, we propose one such method. More precisely, we propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on the maximum mean discrepancy and the kernel Stein discrepancy. They are widely applicable, including whenever the density of the parametric model is known up to normalisation constant, or if the model takes the form of a simulator. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of an unnormalised non-parametric density model, and an intractable generative model of a biological cellular network."}, "https://arxiv.org/abs/2203.12572": {"title": "Post-selection inference for e-value based confidence intervals", "link": "https://arxiv.org/abs/2203.12572", "description": "arXiv:2203.12572v4 Announce Type: replace-cross \nAbstract: Suppose that one can construct a valid $(1-\\delta)$-confidence interval (CI) for each of $K$ parameters of potential interest. If a data analyst uses an arbitrary data-dependent criterion to select some subset $S$ of parameters, then the aforementioned CIs for the selected parameters are no longer valid due to selection bias. We design a new method to adjust the intervals in order to control the false coverage rate (FCR). The main established method is the \"BY procedure\" by Benjamini and Yekutieli (JASA, 2005). The BY guarantees require certain restrictions on the selection criterion and on the dependence between the CIs. We propose a new simple method which, in contrast, is valid under any dependence structure between the original CIs, and any (unknown) selection criterion, but which only applies to a special, yet broad, class of CIs that we call e-CIs. To elaborate, our procedure simply reports $(1-\\delta|S|/K)$-CIs for the selected parameters, and we prove that it controls the FCR at $\\delta$ for confidence intervals that implicitly invert e-values; examples include those constructed via supermartingale methods, via universal inference, or via Chernoff-style bounds, among others. The e-BY procedure is admissible, and recovers the BY procedure as a special case via a particular calibrator. Our work also has implications for post-selection inference in sequential settings, since it applies at stopping times, to continuously-monitored confidence sequences, and under bandit sampling. We demonstrate the efficacy of our procedure using numerical simulations and real A/B testing data from Twitter."}, "https://arxiv.org/abs/2206.09107": {"title": "Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data", "link": "https://arxiv.org/abs/2206.09107", "description": "arXiv:2206.09107v2 Announce Type: replace-cross \nAbstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk."}, "https://arxiv.org/abs/2301.08994": {"title": "How to Measure Evidence and Its Strength: Bayes Factors or Relative Belief Ratios?", "link": "https://arxiv.org/abs/2301.08994", "description": "arXiv:2301.08994v2 Announce Type: replace-cross \nAbstract: Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. Certainly Bayes factors are regularly employed. The question then is: which of these measures of evidence is more appropriate? It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor based on a mixture prior and, when all is considered, the relative belief ratio has better properties as a measure of evidence. It is further shown that, when a natural restriction on the mixture prior is imposed, the Bayes factor equals the relative belief ratio obtained without using the mixture prior. Even with this restriction, this still leaves open the question of how the strength of evidence is to be measured. It is argued here that the current practice of using the size of the Bayes factor to measure strength is not correct and a solution to this issue is presented. Several general criticisms of these measures of evidence are also discussed and addressed."}, "https://arxiv.org/abs/2306.03816": {"title": "Parametrization, Prior Independence, and the Semiparametric Bernstein-von Mises Theorem for the Partially Linear Model", "link": "https://arxiv.org/abs/2306.03816", "description": "arXiv:2306.03816v4 Announce Type: replace-cross \nAbstract: I prove a semiparametric Bernstein-von Mises theorem for a partially linear regression model with independent priors for the low-dimensional parameter of interest and the infinite-dimensional nuisance parameters. My result avoids a prior invariance condition that arises from a loss of information in not knowing the nuisance parameter. The key idea is a feasible reparametrization of the regression function that mimics the Gaussian profile likelihood. This allows a researcher to assume independent priors for the model parameters while automatically accounting for the loss of information associated with not knowing the nuisance parameter. As these prior stability conditions often impose strong restrictions on the underlying data-generating process, my results provide a more robust asymptotic normality theorem than the original parametrization of the partially linear model."}, "https://arxiv.org/abs/2308.11050": {"title": "Optimal Dorfman Group Testing for Symmetric Distributions", "link": "https://arxiv.org/abs/2308.11050", "description": "arXiv:2308.11050v2 Announce Type: replace-cross \nAbstract: We study Dorfman's classical group testing protocol in a novel setting where individual specimen statuses are modeled as exchangeable random variables. We are motivated by infectious disease screening. In that case, specimens which arrive together for testing often originate from the same community and so their statuses may exhibit positive correlation. Dorfman's protocol screens a population of n specimens for a binary trait by partitioning it into non-overlapping groups, testing these, and only individually retesting the specimens of each positive group. The partition is chosen to minimize the expected number of tests under a probabilistic model of specimen statuses. We relax the typical assumption that these are independent and identically distributed and instead model them as exchangeable random variables. In this case, their joint distribution is symmetric in the sense that it is invariant under permutations. We give a characterization of such distributions in terms of a function q where q(h) is the marginal probability that any group of size h tests negative. We use this interpretable representation to show that the set partitioning problem arising in Dorfman's protocol can be reduced to an integer partitioning problem and efficiently solved. We apply these tools to an empirical dataset from the COVID-19 pandemic. The methodology helps explain the unexpectedly high empirical efficiency reported by the original investigators."}, "https://arxiv.org/abs/2309.15326": {"title": "Structural identifiability analysis of linear reaction-advection-diffusion processes in mathematical biology", "link": "https://arxiv.org/abs/2309.15326", "description": "arXiv:2309.15326v3 Announce Type: replace-cross \nAbstract: Effective application of mathematical models to interpret biological data and make accurate predictions often requires that model parameters are identifiable. Approaches to assess the so-called structural identifiability of models are well-established for ordinary differential equation models, yet there are no commonly adopted approaches that can be applied to assess the structural identifiability of the partial differential equation (PDE) models that are requisite to capture spatial features inherent to many phenomena. The differential algebra approach to structural identifiability has recently been demonstrated to be applicable to several specific PDE models. In this brief article, we present general methodology for performing structural identifiability analysis on partially observed reaction-advection-diffusion (RAD) PDE models that are linear in the unobserved quantities. We show that the differential algebra approach can always, in theory, be applied to such models. Moreover, despite the perceived complexity introduced by the addition of advection and diffusion terms, identifiability of spatial analogues of non-spatial models cannot decrease in structural identifiability. We conclude by discussing future possibilities and the computational cost of performing structural identifiability analysis on more general PDE models."}, "https://arxiv.org/abs/2310.19683": {"title": "An Online Bootstrap for Time Series", "link": "https://arxiv.org/abs/2310.19683", "description": "arXiv:2310.19683v2 Announce Type: replace-cross \nAbstract: Resampling methods such as the bootstrap have proven invaluable in the field of machine learning. However, the applicability of traditional bootstrap methods is limited when dealing with large streams of dependent data, such as time series or spatially correlated observations. In this paper, we propose a novel bootstrap method that is designed to account for data dependencies and can be executed online, making it particularly suitable for real-time applications. This method is based on an autoregressive sequence of increasingly dependent resampling weights. We prove the theoretical validity of the proposed bootstrap scheme under general conditions. We demonstrate the effectiveness of our approach through extensive simulations and show that it provides reliable uncertainty quantification even in the presence of complex data dependencies. Our work bridges the gap between classical resampling techniques and the demands of modern data analysis, providing a valuable tool for researchers and practitioners in dynamic, data-rich environments."}, "https://arxiv.org/abs/2311.07511": {"title": "Uncertainty estimation in satellite precipitation interpolation with machine learning", "link": "https://arxiv.org/abs/2311.07511", "description": "arXiv:2311.07511v2 Announce Type: replace-cross \nAbstract: Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We address this gap by benchmarking six algorithms, mostly novel for this task, for quantifying predictive uncertainty in spatial interpolation. On 15 years of monthly data over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Feature importance analysis revealed satellite precipitation (PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals) datasets) as the most informative predictor, followed by gauge elevation and distance to satellite grid points. Compared to QR, LightGBM showed improved performance with respect to the quantile scoring rule by 11.10%, followed by QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial interpolation with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in interpolating spatial data, supported with a formal evaluation framework based on scoring functions and scoring rules."}, "https://arxiv.org/abs/2311.18460": {"title": "Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework", "link": "https://arxiv.org/abs/2311.18460", "description": "arXiv:2311.18460v2 Announce Type: replace-cross \nAbstract: Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications."}, "https://arxiv.org/abs/2402.17915": {"title": "Generation and analysis of synthetic data via Bayesian networks: a robust approach for uncertainty quantification via Bayesian paradigm", "link": "https://arxiv.org/abs/2402.17915", "description": "arXiv:2402.17915v1 Announce Type: new \nAbstract: Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples."}, "https://arxiv.org/abs/2402.17920": {"title": "Generalized Bayesian Additive Regression Trees for Restricted Mean Survival Time Inference", "link": "https://arxiv.org/abs/2402.17920", "description": "arXiv:2402.17920v1 Announce Type: new \nAbstract: Prediction methods for time-to-event outcomes often utilize survival models that rely on strong assumptions about noninformative censoring or on how individual-level covariates and survival functions are related. When the main interest is in predicting individual-level restricted mean survival times (RMST), reliance on such assumptions can lead to poor predictive performance if these assumptions are not satisfied. We propose a generalized Bayes framework that avoids full probability modeling of all survival outcomes by using an RMST-targeted loss function that depends on a collection of inverse probability of censoring weights (IPCW). In our generalized Bayes formulation, we utilize a flexible additive tree regression model for the RMST function, and the posterior distribution of interest is obtained through model-averaging IPCW-conditional loss function-based pseudo-Bayesian posteriors. Because informative censoring can be captured by the IPCW-dependent loss function, our approach only requires one to specify a model for the censoring distribution, thereby obviating the need for complex joint modeling to handle informative censoring. We evaluate the performance of our method through a series of simulations that compare it with several well-known survival machine learning methods, and we illustrate the application of our method using a multi-site cohort of breast cancer patients with clinical and genomic covariates."}, "https://arxiv.org/abs/2402.17984": {"title": "Sampling low-fidelity outputs for estimation of high-fidelity density and its tails", "link": "https://arxiv.org/abs/2402.17984", "description": "arXiv:2402.17984v1 Announce Type: new \nAbstract: In a multifidelity setting, data are available under the same conditions from two (or more) sources, e.g. computer codes, one being lower-fidelity but computationally cheaper, and the other higher-fidelity and more expensive. This work studies for which low-fidelity outputs, one should obtain high-fidelity outputs, if the goal is to estimate the probability density function of the latter, especially when it comes to the distribution tails and extremes. It is suggested to approach this problem from the perspective of the importance sampling of low-fidelity outputs according to some proposal distribution, combined with special considerations for the distribution tails based on extreme value theory. The notion of an optimal proposal distribution is introduced and investigated, in both theory and simulations. The approach is motivated and illustrated with an application to estimate the probability density function of record extremes of ship motions, obtained through two computer codes of different fidelities."}, "https://arxiv.org/abs/2402.18105": {"title": "JEL ratio test for independence between a continuous and a categorical random variable", "link": "https://arxiv.org/abs/2402.18105", "description": "arXiv:2402.18105v1 Announce Type: new \nAbstract: The categorical Gini covariance is a dependence measure between a numerical variable and a categorical variable. The Gini covariance measures dependence by quantifying the difference between the conditional and unconditional distributional functions. A value of zero for the categorical Gini covariance implies independence of the numerical variable and the categorical variable. We propose a non-parametric test for testing the independence between a numerical and categorical variable using the categorical Gini covariance. We used the theory of U-statistics to find the test statistics and study the properties. The test has an asymptotic normal distribution. As the implementation of a normal-based test is difficult, we develop a jackknife empirical likelihood (JEL) ratio test for testing independence. Extensive Monte Carlo simulation studies are carried out to validate the performance of the proposed JEL-based test. We illustrate the test procedure using real a data set."}, "https://arxiv.org/abs/2402.18130": {"title": "Sequential Change-point Detection for Compositional Time Series with Exogenous Variables", "link": "https://arxiv.org/abs/2402.18130", "description": "arXiv:2402.18130v1 Announce Type: new \nAbstract: Sequential change-point detection for time series enables us to sequentially check the hypothesis that the model still holds as more and more data are observed. It is widely used in data monitoring in practice. In this work, we consider sequential change-point detection for compositional time series, time series in which the observations are proportions. For fitting compositional time series, we propose a generalized Beta AR(1) model, which can incorporate exogenous variables upon which the time series observations are dependent. We show the compositional time series are strictly stationary and geometrically ergodic and consider maximum likelihood estimation for model parameters. We show the partial MLEs are consistent and asymptotically normal and propose a parametric sequential change-point detection method for the compositional time series model. The change-point detection method is illustrated using a time series of Covid-19 positivity rates."}, "https://arxiv.org/abs/2402.18185": {"title": "Quantile Outcome Adaptive Lasso: Covariate Selection for Inverse Probability Weighting Estimator of Quantile Treatment Effects", "link": "https://arxiv.org/abs/2402.18185", "description": "arXiv:2402.18185v1 Announce Type: new \nAbstract: When using the propensity score method to estimate the treatment effects, it is important to select the covariates to be included in the propensity score model. The inclusion of covariates unrelated to the outcome in the propensity score model led to bias and large variance in the estimator of treatment effects. Many data-driven covariate selection methods have been proposed for selecting covariates related to outcomes. However, most of them assume an average treatment effect estimation and may not be designed to estimate quantile treatment effects (QTE), which is the effect of treatment on the quantiles of outcome distribution. In QTE estimation, we consider two relation types with the outcome as the expected value and quantile point. To achieve this, we propose a data-driven covariate selection method for propensity score models that allows for the selection of covariates related to the expected value and quantile of the outcome for QTE estimation. Assuming the quantile regression model as an outcome regression model, covariate selection was performed using a regularization method with the partial regression coefficients of the quantile regression model as weights. The proposed method was applied to artificial data and a dataset of mothers and children born in King County, Washington, to compare the performance of existing methods and QTE estimators. As a result, the proposed method performs well in the presence of covariates related to both the expected value and quantile of the outcome."}, "https://arxiv.org/abs/2402.18186": {"title": "Bayesian Geographically Weighted Regression using Fused Lasso Prior", "link": "https://arxiv.org/abs/2402.18186", "description": "arXiv:2402.18186v1 Announce Type: new \nAbstract: A main purpose of spatial data analysis is to predict the objective variable for the unobserved locations. Although Geographically Weighted Regression (GWR) is often used for this purpose, estimation instability proves to be an issue. To address this issue, Bayesian Geographically Weighted Regression (BGWR) has been proposed. In BGWR, by setting the same prior distribution for all locations, the coefficients' estimation stability is improved. However, when observation locations' density is spatially different, these methods do not sufficiently consider the similarity of coefficients among locations. Moreover, the prediction accuracy of these methods becomes worse. To solve these issues, we propose Bayesian Geographically Weighted Sparse Regression (BGWSR) that uses Bayesian Fused Lasso for the prior distribution of the BGWR coefficients. Constraining the parameters to have the same values at adjacent locations is expected to improve the prediction accuracy at locations with a low number of adjacent locations. Furthermore, from the predictive distribution, it is also possible to evaluate the uncertainty of the predicted value of the objective variable. By examining numerical studies, we confirmed that BGWSR has better prediction performance than the existing methods (GWR and BGWR) when the density of observation locations is spatial difference. Finally, the BGWSR is applied to land price data in Tokyo. Thus, the results suggest that BGWSR has better prediction performance and smaller uncertainty than existing methods."}, "https://arxiv.org/abs/2402.18298": {"title": "Mapping between measurement scales in meta-analysis, with application to measures of body mass index in children", "link": "https://arxiv.org/abs/2402.18298", "description": "arXiv:2402.18298v1 Announce Type: new \nAbstract: Quantitative evidence synthesis methods aim to combine data from multiple medical trials to infer relative effects of different interventions. A challenge arises when trials report continuous outcomes on different measurement scales. To include all evidence in one coherent analysis, we require methods to `map' the outcomes onto a single scale. This is particularly challenging when trials report aggregate rather than individual data. We are motivated by a meta-analysis of interventions to prevent obesity in children. Trials report aggregate measurements of body mass index (BMI) either expressed as raw values or standardised for age and sex. We develop three methods for mapping between aggregate BMI data using known relationships between individual measurements on different scales. The first is an analytical method based on the mathematical definitions of z-scores and percentiles. The other two approaches involve sampling individual participant data on which to perform the conversions. One method is a straightforward sampling routine, while the other involves optimization with respect to the reported outcomes. In contrast to the analytical approach, these methods also have wider applicability for mapping between any pair of measurement scales with known or estimable individual-level relationships. We verify and contrast our methods using trials from our data set which report outcomes on multiple scales. We find that all methods recreate mean values with reasonable accuracy, but for standard deviations, optimization outperforms the other methods. However, the optimization method is more likely to underestimate standard deviations and is vulnerable to non-convergence."}, "https://arxiv.org/abs/2402.18450": {"title": "Copula Approximate Bayesian Computation Using Distribution Random Forests", "link": "https://arxiv.org/abs/2402.18450", "description": "arXiv:2402.18450v1 Announce Type: new \nAbstract: This paper introduces a novel Approximate Bayesian Computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihood functions. This framework can describe the possibly skewed and high dimensional posterior distribution by a novel multivariate copula-based distribution, based on univariate marginal posterior distributions which can account for skewness and be accurately estimated by Distribution Random Forests (DRF) while performing automatic summary statistics (covariates) selection, and on robustly estimated copula dependence parameters. The framework employs a novel multivariate mode estimator to perform for MLE and posterior mode estimation, and provides an optional step to perform model selection from a given set of models with posterior probabilities estimated by DRF. The posterior distribution estimation accuracy of the ABC framework is illustrated through simulation studies involving models with analytically computable posterior distributions, and involving exponential random graph and mechanistic network models which are each defined by an intractable likelihood from which it is costly to simulate large network datasets. Also, the framework is illustrated through analyses of large real-life networks of sizes ranging between 28,000 to 65.6 million nodes (between 3 million to 1.8 billion edges), including a large multilayer network with weighted directed edges."}, "https://arxiv.org/abs/2402.18533": {"title": "Constructing Bayesian Optimal Designs for Discrete Choice Experiments by Simulated Annealing", "link": "https://arxiv.org/abs/2402.18533", "description": "arXiv:2402.18533v1 Announce Type: new \nAbstract: Discrete Choice Experiments (DCEs) investigate the attributes that influence individuals' choices when selecting among various options. To enhance the quality of the estimated choice models, researchers opt for Bayesian optimal designs that utilize existing information about the attributes' preferences. Given the nonlinear nature of choice models, the construction of an appropriate design requires efficient algorithms. Among these, the Coordinate-Exchange (CE) algorithm is most commonly employed for constructing designs based on the multinomial logit model. Since this is a hill-climbing algorithm, obtaining better designs necessitates multiple random starting designs. This approach increases the algorithm's run-time, but may not lead to a significant improvement in results. We propose the use of a Simulated Annealing (SA) algorithm to construct Bayesian D-optimal designs. This algorithm accepts both superior and inferior solutions, avoiding premature convergence and allowing a more thorough exploration of potential designs. Consequently, it ultimately obtains higher-quality choice designs within the same time-frame. Our work represents the first application of an SA algorithm in constructing Bayesian optimal designs for DCEs. Through computational experiments and a real-life case study, we demonstrate that the SA designs consistently outperform the CE designs in terms of Bayesian D-efficiency, especially when the prior preference information is highly uncertain."}, "https://arxiv.org/abs/2402.17886": {"title": "Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion", "link": "https://arxiv.org/abs/2402.17886", "description": "arXiv:2402.17886v1 Announce Type: cross \nAbstract: This paper considers the problem of sampling from non-logconcave distribution, based on queries of its unnormalized density. It first describes a framework, Diffusion Monte Carlo (DMC), based on the simulation of a denoising diffusion process with its score function approximated by a generic Monte Carlo estimator. DMC is an oracle-based meta-algorithm, where its oracle is the assumed access to samples that generate a Monte Carlo score estimator. Then we provide an implementation of this oracle, based on rejection sampling, and this turns DMC into a true algorithm, termed Zeroth-Order Diffusion Monte Carlo (ZOD-MC). We provide convergence analyses by first constructing a general framework, i.e. a performance guarantee for DMC, without assuming the target distribution to be log-concave or satisfying any isoperimetric inequality. Then we prove that ZOD-MC admits an inverse polynomial dependence on the desired sampling accuracy, albeit still suffering from the curse of dimensionality. Consequently, for low dimensional distributions, ZOD-MC is a very efficient sampler, with performance exceeding latest samplers, including also-denoising-diffusion-based RDMC and RS-DMC. Last, we experimentally demonstrate the insensitivity of ZOD-MC to increasingly higher barriers between modes or discontinuity in non-convex potential."}, "https://arxiv.org/abs/2402.18242": {"title": "A network-constrain Weibull AFT model for biomarkers discovery", "link": "https://arxiv.org/abs/2402.18242", "description": "arXiv:2402.18242v1 Announce Type: cross \nAbstract: We propose AFTNet, a novel network-constraint survival analysis method based on the Weibull accelerated failure time (AFT) model solved by a penalized likelihood approach for variable selection and estimation. When using the log-linear representation, the inference problem becomes a structured sparse regression problem for which we explicitly incorporate the correlation patterns among predictors using a double penalty that promotes both sparsity and grouping effect. Moreover, we establish the theoretical consistency for the AFTNet estimator and present an efficient iterative computational algorithm based on the proximal gradient descent method. Finally, we evaluate AFTNet performance both on synthetic and real data examples."}, "https://arxiv.org/abs/2402.18392": {"title": "Unveiling the Potential of Robustness in Evaluating Causal Inference Models", "link": "https://arxiv.org/abs/2402.18392", "description": "arXiv:2402.18392v1 Announce Type: cross \nAbstract: The growing demand for personalized decision-making has led to a surge of interest in estimating the Conditional Average Treatment Effect (CATE). The intersection of machine learning and causal inference has yielded various effective CATE estimators. However, deploying these estimators in practice is often hindered by the absence of counterfactual labels, making it challenging to select the desirable CATE estimator using conventional model selection procedures like cross-validation. Existing approaches for CATE estimator selection, such as plug-in and pseudo-outcome metrics, face two inherent challenges. Firstly, they are required to determine the metric form and the underlying machine learning models for fitting nuisance parameters or plug-in learners. Secondly, they lack a specific focus on selecting a robust estimator. To address these challenges, this paper introduces a novel approach, the Distributionally Robust Metric (DRM), for CATE estimator selection. The proposed DRM not only eliminates the need to fit additional models but also excels at selecting a robust CATE estimator. Experimental studies demonstrate the efficacy of the DRM method, showcasing its consistent effectiveness in identifying superior estimators while mitigating the risk of selecting inferior ones."}, "https://arxiv.org/abs/2108.00306": {"title": "A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions", "link": "https://arxiv.org/abs/2108.00306", "description": "arXiv:2108.00306v5 Announce Type: replace \nAbstract: With advances in scientific computing and mathematical modeling, complex scientific phenomena such as galaxy formations and rocket propulsion can now be reliably simulated. Such simulations can however be very time-intensive, requiring millions of CPU hours to perform. One solution is multi-fidelity emulation, which uses data of different fidelities to train an efficient predictive model which emulates the expensive simulator. For complex scientific problems and with careful elicitation from scientists, such multi-fidelity data may often be linked by a directed acyclic graph (DAG) representing its scientific model dependencies. We thus propose a new Graphical Multi-fidelity Gaussian Process (GMGP) model, which embeds this DAG structure (capturing scientific dependencies) within a Gaussian process framework. We show that the GMGP has desirable modeling traits via two Markov properties, and admits a scalable algorithm for recursive computation of the posterior mean and variance along at each depth level of the DAG. We also present a novel experimental design methodology over the DAG given an experimental budget, and propose a nonlinear extension of the GMGP via deep Gaussian processes. The advantages of the GMGP are then demonstrated via a suite of numerical experiments and an application to emulation of heavy-ion collisions, which can be used to study the conditions of matter in the Universe shortly after the Big Bang. The proposed model has broader uses in data fusion applications with graphical structure, which we further discuss."}, "https://arxiv.org/abs/2112.02712": {"title": "Interpretable discriminant analysis for functional data supported on random nonlinear domains with an application to Alzheimer's disease", "link": "https://arxiv.org/abs/2112.02712", "description": "arXiv:2112.02712v2 Announce Type: replace \nAbstract: We introduce a novel framework for the classification of functional data supported on nonlinear, and possibly random, manifold domains. The motivating application is the identification of subjects with Alzheimer's disease from their cortical surface geometry and associated cortical thickness map. The proposed model is based upon a reformulation of the classification problem as a regularized multivariate functional linear regression model. This allows us to adopt a direct approach to the estimation of the most discriminant direction while controlling for its complexity with appropriate differential regularization. Our approach does not require prior estimation of the covariance structure of the functional predictors, which is computationally prohibitive in our application setting. We provide a theoretical analysis of the out-of-sample prediction error of the proposed model and explore the finite sample performance in a simulation setting. We apply the proposed method to a pooled dataset from the Alzheimer's Disease Neuroimaging Initiative and the Parkinson's Progression Markers Initiative. Through this application, we identify discriminant directions that capture both cortical geometric and thickness predictive features of Alzheimer's disease that are consistent with the existing neuroscience literature."}, "https://arxiv.org/abs/2202.03960": {"title": "Continuous permanent unobserved heterogeneity in dynamic discrete choice models", "link": "https://arxiv.org/abs/2202.03960", "description": "arXiv:2202.03960v3 Announce Type: replace \nAbstract: In dynamic discrete choice (DDC) analysis, it is common to use mixture models to control for unobserved heterogeneity. However, consistent estimation typically requires both restrictions on the support of unobserved heterogeneity and a high-level injectivity condition that is difficult to verify. This paper provides primitive conditions for point identification of a broad class of DDC models with multivariate continuous permanent unobserved heterogeneity. The results apply to both finite- and infinite-horizon DDC models, do not require a full support assumption, nor a long panel, and place no parametric restriction on the distribution of unobserved heterogeneity. In addition, I propose a seminonparametric estimator that is computationally attractive and can be implemented using familiar parametric methods."}, "https://arxiv.org/abs/2211.06755": {"title": "The chiPower transformation: a valid alternative to logratio transformations in compositional data analysis", "link": "https://arxiv.org/abs/2211.06755", "description": "arXiv:2211.06755v3 Announce Type: replace \nAbstract: The approach to analysing compositional data has been dominated by the use of logratio transformations, to ensure exact subcompositional coherence and, in some situations, exact isometry as well. A problem with this approach is that data zeros, found in most applications, have to be replaced to allow the logarithmic transformation. An alternative new approach, called the `chiPower' transformation, which allows data zeros, is to combine the standardization inherent in the chi-square distance in correspondence analysis, with the essential elements of the Box-Cox power transformation. The chiPower transformation is justified because it} defines between-sample distances that tend to logratio distances for strictly positive data as the power parameter tends to zero, and are then equivalent to transforming to logratios. For data with zeros, a value of the power can be identified that brings the chiPower transformation as close as possible to a logratio transformation, without having to substitute the zeros. Especially in the area of high-dimensional data, this alternative approach can present such a high level of coherence and isometry as to be a valid approach to the analysis of compositional data. Furthermore, in a supervised learning context, if the compositional variables serve as predictors of a response in a modelling framework, for example generalized linear models, then the power can be used as a tuning parameter in optimizing the accuracy of prediction through cross-validation. The chiPower-transformed variables have a straightforward interpretation, since they are each identified with single compositional parts, not ratios."}, "https://arxiv.org/abs/2301.11057": {"title": "Empirical Bayes factors for common hypothesis tests", "link": "https://arxiv.org/abs/2301.11057", "description": "arXiv:2301.11057v4 Announce Type: replace \nAbstract: Bayes factors for composite hypotheses have difficulty in encoding vague prior knowledge, as improper priors cannot be used and objective priors may be subjectively unreasonable. To address these issues I revisit the posterior Bayes factor, in which the posterior distribution from the data at hand is re-used in the Bayes factor for the same data. I argue that this is biased when calibrated against proper Bayes factors, but propose adjustments to allow interpretation on the same scale. In the important case of a regular normal model, the bias in log scale is half the number of parameters. The resulting empirical Bayes factor is closely related to the widely applicable information criterion. I develop test-based empirical Bayes factors for several standard tests and propose an extension to multiple testing closely related to the optimal discovery procedure. When only a P-value is available, an approximate empirical Bayes factor is 10p. I propose interpreting the strength of Bayes factors on a logarithmic scale with base 3.73, reflecting the sharpest distinction between weaker and stronger belief. This provides an objective framework for interpreting statistical evidence, and realises a Bayesian/frequentist compromise."}, "https://arxiv.org/abs/2303.16599": {"title": "Difference-based covariance matrix estimate in time series nonparametric regression with applications to specification tests", "link": "https://arxiv.org/abs/2303.16599", "description": "arXiv:2303.16599v2 Announce Type: replace \nAbstract: Long-run covariance matrix estimation is the building block of time series inference. The corresponding difference-based estimator, which avoids detrending, has attracted considerable interest due to its robustness to both smooth and abrupt structural breaks and its competitive finite sample performance. However, existing methods mainly focus on estimators for the univariate process while their direct and multivariate extensions for most linear models are asymptotically biased. We propose a novel difference-based and debiased long-run covariance matrix estimator for functional linear models with time-varying regression coefficients, allowing time series non-stationarity, long-range dependence, state-heteroscedasticity and their mixtures. We apply the new estimator to (i) the structural stability test, overcoming the notorious non-monotonic power phenomena caused by piecewise smooth alternatives for regression coefficients, and (ii) the nonparametric residual-based tests for long memory, improving the performance via the residual-free formula of the proposed estimator. The effectiveness of the proposed method is justified theoretically and demonstrated by superior performance in simulation studies, while its usefulness is elaborated via real data analysis. Our method is implemented in the R package mlrv."}, "https://arxiv.org/abs/2307.11705": {"title": "Small Sample Inference for Two-way Capture Recapture Experiments", "link": "https://arxiv.org/abs/2307.11705", "description": "arXiv:2307.11705v2 Announce Type: replace \nAbstract: The properties of the generalized Waring distribution defined on the non negative integers are reviewed. Formulas for its moments and its mode are given. A construction as a mixture of negative binomial distributions is also presented. Then we turn to the Petersen model for estimating the population size $N$ in a two-way capture recapture experiment. We construct a Bayesian model for $N$ by combining a Waring prior with the hypergeometric distribution for the number of units caught twice in the experiment. Credible intervals for $N$ are obtained using quantiles of the posterior, a generalized Waring distribution. The standard confidence interval for the population size constructed using the asymptotic variance of Petersen estimator and .5 logit transformed interval are shown to be special cases of the generalized Waring credible interval. The true coverage of this interval is shown to be bigger than or equal to its nominal converage in small populations, regardless of the capture probabilities. In addition, its length is substantially smaller than that of the .5 logit transformed interval. Thus the proposed generalized Waring credible interval appears to be the best way to quantify the uncertainty of the Petersen estimator for populations size."}, "https://arxiv.org/abs/2310.18027": {"title": "Bayesian Prognostic Covariate Adjustment With Additive Mixture Priors", "link": "https://arxiv.org/abs/2310.18027", "description": "arXiv:2310.18027v4 Announce Type: replace \nAbstract: Effective and rapid decision-making from randomized controlled trials (RCTs) requires unbiased and precise treatment effect inferences. Two strategies to address this requirement are to adjust for covariates that are highly correlated with the outcome, and to leverage historical control information via Bayes' theorem. We propose a new Bayesian prognostic covariate adjustment methodology, referred to as Bayesian PROCOVA, that combines these two strategies. Covariate adjustment in Bayesian PROCOVA is based on generative artificial intelligence (AI) algorithms that construct a digital twin generator (DTG) for RCT participants. The DTG is trained on historical control data and yields a digital twin (DT) probability distribution for each RCT participant's outcome under the control treatment. The expectation of the DT distribution, referred to as the prognostic score, defines the covariate for adjustment. Historical control information is leveraged via an additive mixture prior with two components: an informative prior probability distribution specified based on historical control data, and a weakly informative prior distribution. The mixture weight determines the extent to which posterior inferences are drawn from the informative component, versus the weakly informative component. This weight has a prior distribution as well, and so the entire additive mixture prior is completely pre-specifiable without involving any RCT information. We establish an efficient Gibbs algorithm for sampling from the posterior distribution, and derive closed-form expressions for the posterior mean and variance of the treatment effect parameter conditional on the weight, in Bayesian PROCOVA. We evaluate efficiency gains of Bayesian PROCOVA via its bias control and variance reduction compared to frequentist PROCOVA in simulation studies that encompass different discrepancies. These gains translate to smaller RCTs."}, "https://arxiv.org/abs/2311.09015": {"title": "Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach", "link": "https://arxiv.org/abs/2311.09015", "description": "arXiv:2311.09015v2 Announce Type: replace \nAbstract: We consider the task of identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR). In general, such parameters are not identified without strong assumptions on the missing data model. In this paper, we take an alternative approach and introduce a method inspired by data fusion, where information in an MNAR dataset is augmented by information in an auxiliary dataset subject to missingness at random (MAR). We show that even if the parameter of interest cannot be identified given either dataset alone, it can be identified given pooled data, under two complementary sets of assumptions. We derive an inverse probability weighted (IPW) estimator for identified parameters, and evaluate the performance of our estimation strategies via simulation studies, and a data application."}, "https://arxiv.org/abs/2401.08941": {"title": "A Powerful and Precise Feature-level Filter using Group Knockoffs", "link": "https://arxiv.org/abs/2401.08941", "description": "arXiv:2401.08941v2 Announce Type: replace \nAbstract: Selecting important features that have substantial effects on the response with provable type-I error rate control is a fundamental concern in statistics, with wide-ranging practical applications. Existing knockoff filters, although shown to provide theoretical guarantee on false discovery rate (FDR) control, often struggle to strike a balance between high power and precision in pinpointing important features when there exist large groups of strongly correlated features. To address this challenge, we develop a new filter using group knockoffs to achieve both powerful and precise selection of important features. Via experiments of simulated data and analysis of a real Alzheimer's disease genetic dataset, it is found that the proposed filter can not only control the proportion of false discoveries but also identify important features with comparable power and greater precision than the existing group knockoffs filter."}, "https://arxiv.org/abs/2107.12365": {"title": "Inference for Heteroskedastic PCA with Missing Data", "link": "https://arxiv.org/abs/2107.12365", "description": "arXiv:2107.12365v2 Announce Type: replace-cross \nAbstract: This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a novel approach to performing valid inference on the principal subspace under a spiked covariance model with missing data, on the basis of an estimator called HeteroPCA (Zhang et al., 2022). We develop non-asymptotic distributional guarantees for HeteroPCA, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Our inference procedures are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels."}, "https://arxiv.org/abs/2306.05751": {"title": "Advancing Counterfactual Inference through Nonlinear Quantile Regression", "link": "https://arxiv.org/abs/2306.05751", "description": "arXiv:2306.05751v3 Announce Type: replace-cross \nAbstract: The capacity to address counterfactual \"what if\" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to perform reliable counterfactual inference based solely on observational data and the (learned) qualitative causal structure, without necessitating a predefined causal model or even direct estimations of conditional distributions. To this end, we establish a novel connection between counterfactual inference and quantile regression and show that counterfactual inference can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual inference implemented with neural networks under a bi-level optimization scheme. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data, thereby providing an upper bound on the generalization error. Furthermore, empirical evidence demonstrates its superior statistical efficiency in comparison to existing methods. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions."}, "https://arxiv.org/abs/2307.16485": {"title": "Parameter Inference for Degenerate Diffusion Processes", "link": "https://arxiv.org/abs/2307.16485", "description": "arXiv:2307.16485v2 Announce Type: replace-cross \nAbstract: We study parametric inference for ergodic diffusion processes with a degenerate diffusion matrix. Existing research focuses on a particular class of hypo-elliptic SDEs, with components split into `rough'/`smooth' and noise from rough components propagating directly onto smooth ones, but some critical model classes arising in applications have yet to be explored. We aim to cover this gap, thus analyse the highly degenerate class of SDEs, where components split into further sub-groups. Such models include e.g. the notable case of generalised Langevin equations. We propose a tailored time-discretisation scheme and provide asymptotic results supporting our scheme in the context of high-frequency, full observations. The proposed discretisation scheme is applicable in much more general data regimes and is shown to overcome biases via simulation studies also in the practical case when only a smooth component is observed. Joint consideration of our study for highly degenerate SDEs and existing research provides a general `recipe' for the development of time-discretisation schemes to be used within statistical methods for general classes of hypo-elliptic SDEs."}, "https://arxiv.org/abs/2402.18612": {"title": "Understanding random forests and overfitting: a visualization and simulation study", "link": "https://arxiv.org/abs/2402.18612", "description": "arXiv:2402.18612v1 Announce Type: new \nAbstract: Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models."}, "https://arxiv.org/abs/2402.18741": {"title": "Spectral Extraction of Unique Latent Variables", "link": "https://arxiv.org/abs/2402.18741", "description": "arXiv:2402.18741v1 Announce Type: new \nAbstract: Multimodal datasets contain observations generated by multiple types of sensors. Most works to date focus on uncovering latent structures in the data that appear in all modalities. However, important aspects of the data may appear in only one modality due to the differences between the sensors. Uncovering modality-specific attributes may provide insights into the sources of the variability of the data. For example, certain clusters may appear in the analysis of genetics but not in epigenetic markers. Another example is hyper-spectral satellite imaging, where various atmospheric and ground phenomena are detectable using different parts of the spectrum. In this paper, we address the problem of uncovering latent structures that are unique to a single modality. Our approach is based on computing a graph representation of datasets from two modalities and analyzing the differences between their connectivity patterns. We provide an asymptotic analysis of the convergence of our approach based on a product manifold model. To evaluate the performance of our method, we test its ability to uncover latent structures in multiple types of artificial and real datasets."}, "https://arxiv.org/abs/2402.18745": {"title": "Degree-heterogeneous Latent Class Analysis for High-dimensional Discrete Data", "link": "https://arxiv.org/abs/2402.18745", "description": "arXiv:2402.18745v1 Announce Type: new \nAbstract: The latent class model is a widely used mixture model for multivariate discrete data. Besides the existence of qualitatively heterogeneous latent classes, real data often exhibit additional quantitative heterogeneity nested within each latent class. The modern latent class analysis also faces extra challenges, including the high-dimensionality, sparsity, and heteroskedastic noise inherent in discrete data. Motivated by these phenomena, we introduce the Degree-heterogeneous Latent Class Model and propose a spectral approach to clustering and statistical inference in the challenging high-dimensional sparse data regime. We propose an easy-to-implement HeteroClustering algorithm. It uses heteroskedastic PCA with L2 normalization to remove degree effects and perform clustering in the top singular subspace of the data matrix. We establish an exponential error rate for HeteroClustering, leading to exact clustering under minimal signal-to-noise conditions. We further investigate the estimation and inference of the high-dimensional continuous item parameters in the model, which are crucial to interpreting and finding useful markers for latent classes. We provide comprehensive procedures for global testing and multiple testing of these parameters with valid error controls. The superior performance of our methods is demonstrated through extensive simulations and applications to three diverse real-world datasets from political voting records, genetic variations, and single-cell sequencing."}, "https://arxiv.org/abs/2402.18748": {"title": "Fast Bootstrapping Nonparametric Maximum Likelihood for Latent Mixture Models", "link": "https://arxiv.org/abs/2402.18748", "description": "arXiv:2402.18748v1 Announce Type: new \nAbstract: Estimating the mixing density of a latent mixture model is an important task in signal processing. Nonparametric maximum likelihood estimation is one popular approach to this problem. If the latent variable distribution is assumed to be continuous, then bootstrapping can be used to approximate it. However, traditional bootstrapping requires repeated evaluations on resampled data and is not scalable. In this letter, we construct a generative process to rapidly produce nonparametric maximum likelihood bootstrap estimates. Our method requires only a single evaluation of a novel two-stage optimization algorithm. Simulations and real data analyses demonstrate that our procedure accurately estimates the mixing density with little computational cost even when there are a hundred thousand observations."}, "https://arxiv.org/abs/2402.18900": {"title": "Prognostic Covariate Adjustment for Logistic Regression in Randomized Controlled Trials", "link": "https://arxiv.org/abs/2402.18900", "description": "arXiv:2402.18900v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) with binary primary endpoints introduce novel challenges for inferring the causal effects of treatments. The most significant challenge is non-collapsibility, in which the conditional odds ratio estimand under covariate adjustment differs from the unconditional estimand in the logistic regression analysis of RCT data. This issue gives rise to apparent paradoxes, such as the variance of the estimator for the conditional odds ratio from a covariate-adjusted model being greater than the variance of the estimator from the unadjusted model. We address this challenge in the context of adjustment based on predictions of control outcomes from generative artificial intelligence (AI) algorithms, which are referred to as prognostic scores. We demonstrate that prognostic score adjustment in logistic regression increases the power of the Wald test for the conditional odds ratio under a fixed sample size, or alternatively reduces the necessary sample size to achieve a desired power, compared to the unadjusted analysis. We derive formulae for prospective calculations of the power gain and sample size reduction that can result from adjustment for the prognostic score. Furthermore, we utilize g-computation to expand the scope of prognostic score adjustment to inferences on the marginal risk difference, relative risk, and odds ratio estimands. We demonstrate the validity of our formulae via extensive simulation studies that encompass different types of logistic regression model specifications. Our simulation studies also indicate how prognostic score adjustment can reduce the variance of g-computation estimators for the marginal estimands while maintaining frequentist properties such as asymptotic unbiasedness and Type I error rate control. Our methodology can ultimately enable more definitive and conclusive analyses for RCTs with binary primary endpoints."}, "https://arxiv.org/abs/2402.18904": {"title": "False Discovery Rate Control for Confounder Selection Using Mirror Statistics", "link": "https://arxiv.org/abs/2402.18904", "description": "arXiv:2402.18904v1 Announce Type: new \nAbstract: While data-driven confounder selection requires careful consideration, it is frequently employed in observational studies to adjust for confounding factors. Widely recognized criteria for confounder selection include the minimal set approach, which involves selecting variables relevant to both treatment and outcome, and the union set approach, which involves selecting variables for either treatment or outcome. These approaches are often implemented using heuristics and off-the-shelf statistical methods, where the degree of uncertainty may not be clear. In this paper, we focus on the false discovery rate (FDR) to measure uncertainty in confounder selection. We define the FDR specific to confounder selection and propose methods based on the mirror statistic, a recently developed approach for FDR control that does not rely on p-values. The proposed methods are free from p-values and require only the assumption of some symmetry in the distribution of the mirror statistic. It can be easily combined with sparse estimation and other methods that involve difficulties in deriving p-values. The properties of the proposed method are investigated by exhaustive numerical experiments. Particularly in high-dimensional data scenarios, our method outperforms conventional methods."}, "https://arxiv.org/abs/2402.19021": {"title": "Enhancing the Power of Gaussian Graphical Model Inference by Modeling the Graph Structure", "link": "https://arxiv.org/abs/2402.19021", "description": "arXiv:2402.19021v1 Announce Type: new \nAbstract: For the problem of inferring a Gaussian graphical model (GGM), this work explores the application of a recent approach from the multiple testing literature for graph inference. The main idea of the method by Rebafka et al. (2022) is to model the data by a latent variable model, the so-called noisy stochastic block model (NSBM), and then use the associated ${\\ell}$-values to infer the graph. The inferred graph controls the false discovery rate, that means that the proportion of falsely declared edges does not exceed a user-defined nominal level. Here it is shown that any test statistic from the GGM literature can be used as input for the NSBM approach to perform GGM inference. To make the approach feasible in practice, a new, computationally efficient inference algorithm for the NSBM is developed relying on a greedy approach to maximize the integrated complete-data likelihood. Then an extensive numerical study illustrates that the NSBM approach outperforms the state of the art for any of the here considered GGM-test statistics. In particular in sparse settings and on real datasets a significant gain in power is observed."}, "https://arxiv.org/abs/2402.19029": {"title": "Essential Properties of Type III* Methods", "link": "https://arxiv.org/abs/2402.19029", "description": "arXiv:2402.19029v1 Announce Type: new \nAbstract: Type III methods, introduced by SAS in 1976, formulate estimable functions that substitute, somehow, for classical ANOVA effects in multiple linear regression models. They have been controversial since, provoking wide use and satisfied users on the one hand and skepticism and scorn on the other. Their essential mathematical properties have not been established, although they are widely thought to be known: what those functions are, to what extent they coincide with classical ANOVA effects, and how they are affected by cell sample sizes, empty cells, and covariates. Those properties are established here."}, "https://arxiv.org/abs/2402.19046": {"title": "On the Improvement of Predictive Modeling Using Bayesian Stacking and Posterior Predictive Checking", "link": "https://arxiv.org/abs/2402.19046", "description": "arXiv:2402.19046v1 Announce Type: new \nAbstract: Model uncertainty is pervasive in real world analysis situations and is an often-neglected issue in applied statistics. However, standard approaches to the research process do not address the inherent uncertainty in model building and, thus, can lead to overconfident and misleading analysis interpretations. One strategy to incorporate more flexible models is to base inferences on predictive modeling. This approach provides an alternative to existing explanatory models, as inference is focused on the posterior predictive distribution of the response variable. Predictive modeling can advance explanatory ambitions in the social sciences and in addition enrich the understanding of social phenomena under investigation. Bayesian stacking is a methodological approach rooted in Bayesian predictive modeling. In this paper, we outline the method of Bayesian stacking but add to it the approach of posterior predictive checking (PPC) as a means of assessing the predictive quality of those elements of the stacking ensemble that are important to the research question. Thus, we introduce a viable workflow for incorporating PPC into predictive modeling using Bayesian stacking without presuming the existence of a true model. We apply these tools to the PISA 2018 data to investigate potential inequalities in reading competency with respect to gender and socio-economic background. Our empirical example serves as rough guideline for practitioners who want to implement the concepts of predictive modeling and model uncertainty in their work to similar research questions."}, "https://arxiv.org/abs/2402.19109": {"title": "Confidence and Assurance of Percentiles", "link": "https://arxiv.org/abs/2402.19109", "description": "arXiv:2402.19109v1 Announce Type: new \nAbstract: Confidence interval of mean is often used when quoting statistics. The same rigor is often missing when quoting percentiles and tolerance or percentile intervals. This article derives the expression for confidence in percentiles of a sample population. Confidence intervals of median is compared to those of mean for a few sample distributions. The concept of assurance from reliability engineering is then extended to percentiles. The assurance level of sorted samples simply matches the confidence and percentile levels. Numerical method to compute assurance using Brent's optimization method is provided as an open-source python package."}, "https://arxiv.org/abs/2402.19346": {"title": "Recanting witness and natural direct effects: Violations of assumptions or definitions?", "link": "https://arxiv.org/abs/2402.19346", "description": "arXiv:2402.19346v1 Announce Type: new \nAbstract: There have been numerous publications on the advantages and disadvantages of estimating natural (pure) effects compared to controlled effects. One of the main criticisms of natural effects is that it requires an additional assumption for identifiability, namely that the exposure does not cause a confounder of the mediator-outcome relationship. However, every analysis in every study should begin with a research question expressed in ordinary language. Researchers then develop/use mathematical expressions or estimators to best answer these ordinary language questions. When a recanting witness is present, the paper illustrates that there are no violations of assumptions. Rather, using directed acyclic graphs, the typical estimators for natural effects are simply no longer answering any meaningful question. Although some might view this as semantics, the proposed approach illustrates why the more recent methods of path-specific effects and separable effects are more valid and transparent compared to previous methods for decomposition analysis."}, "https://arxiv.org/abs/2402.19425": {"title": "Testing Information Ordering for Strategic Agents", "link": "https://arxiv.org/abs/2402.19425", "description": "arXiv:2402.19425v1 Announce Type: new \nAbstract: A key primitive of a strategic environment is the information available to players. Specifying a priori an information structure is often difficult for empirical researchers. We develop a test of information ordering that allows researchers to examine if the true information structure is at least as informative as a proposed baseline. We construct a computationally tractable test statistic by utilizing the notion of Bayes Correlated Equilibrium (BCE) to translate the ordering of information structures into an ordering of functions. We apply our test to examine whether hubs provide informational advantages to certain airlines in addition to market power."}, "https://arxiv.org/abs/2402.18651": {"title": "Quantifying Human Priors over Social and Navigation Networks", "link": "https://arxiv.org/abs/2402.18651", "description": "arXiv:2402.18651v1 Announce Type: cross \nAbstract: Human knowledge is largely implicit and relational -- do we have a friend in common? can I walk from here to there? In this work, we leverage the combinatorial structure of graphs to quantify human priors over such relational data. Our experiments focus on two domains that have been continuously relevant over evolutionary timescales: social interaction and spatial navigation. We find that some features of the inferred priors are remarkably consistent, such as the tendency for sparsity as a function of graph size. Other features are domain-specific, such as the propensity for triadic closure in social interactions. More broadly, our work demonstrates how nonclassical statistical analysis of indirect behavioral experiments can be used to efficiently model latent biases in the data."}, "https://arxiv.org/abs/2402.18689": {"title": "The VOROS: Lifting ROC curves to 3D", "link": "https://arxiv.org/abs/2402.18689", "description": "arXiv:2402.18689v1 Announce Type: cross \nAbstract: The area under the ROC curve is a common measure that is often used to rank the relative performance of different binary classifiers. However, as has been also previously noted, it can be a measure that ill-captures the benefits of different classifiers when either the true class values or misclassification costs are highly unbalanced between the two classes. We introduce a third dimension to capture these costs, and lift the ROC curve to a ROC surface in a natural way. We study both this surface and introduce the VOROS, the volume over this ROC surface, as a 3D generalization of the 2D area under the ROC curve. For problems where there are only bounds on the expected costs or class imbalances, we restrict consideration to the volume of the appropriate subregion of the ROC surface. We show how the VOROS can better capture the costs of different classifiers on both a classical and a modern example dataset."}, "https://arxiv.org/abs/2402.18810": {"title": "The numeraire e-variable", "link": "https://arxiv.org/abs/2402.18810", "description": "arXiv:2402.18810v1 Announce Type: cross \nAbstract: We consider testing a composite null hypothesis $\\mathcal{P}$ against a point alternative $\\mathbb{Q}$. This paper establishes a powerful and general result: under no conditions whatsoever on $\\mathcal{P}$ or $\\mathbb{Q}$, we show that there exists a special e-variable $X^*$ that we call the numeraire. It is strictly positive and for every $\\mathbb{P} \\in \\mathcal{P}$, $\\mathbb{E}_\\mathbb{P}[X^*] \\le 1$ (the e-variable property), while for every other e-variable $X$, we have $\\mathbb{E}_\\mathbb{Q}[X/X^*] \\le 1$ (the numeraire property). In particular, this implies $\\mathbb{E}_\\mathbb{Q}[\\log(X/X^*)] \\le 0$ (log-optimality). $X^*$ also identifies a particular sub-probability measure $\\mathbb{P}^*$ via the density $d \\mathbb{P}^*/d \\mathbb{Q} = 1/X^*$. As a result, $X^*$ can be seen as a generalized likelihood ratio of $\\mathbb{Q}$ against $\\mathcal{P}$. We show that $\\mathbb{P}^*$ coincides with the reverse information projection (RIPr) when additional assumptions are made that are required for the latter to exist. Thus $\\mathbb{P}^*$ is a natural definition of the RIPr in the absence of any assumptions on $\\mathcal{P}$ or $\\mathbb{Q}$. In addition to the abstract theory, we provide several tools for finding the numeraire in concrete cases. We discuss several nonparametric examples where we can indeed identify the numeraire, despite not having a reference measure. We end with a more general optimality theory that goes beyond the ubiquitous logarithmic utility. We focus on certain power utilities, leading to reverse R\\'enyi projections in place of the RIPr, which also always exists."}, "https://arxiv.org/abs/2402.18910": {"title": "DIGIC: Domain Generalizable Imitation Learning by Causal Discovery", "link": "https://arxiv.org/abs/2402.18910", "description": "arXiv:2402.18910v1 Announce Type: cross \nAbstract: Causality has been combined with machine learning to produce robust representations for domain generalization. Most existing methods of this type require massive data from multiple domains to identify causal features by cross-domain variations, which can be expensive or even infeasible and may lead to misidentification in some cases. In this work, we make a different attempt by leveraging the demonstration data distribution to discover the causal features for a domain generalizable policy. We design a novel framework, called DIGIC, to identify the causal features by finding the direct cause of the expert action from the demonstration data distribution via causal discovery. Our framework can achieve domain generalizable imitation learning with only single-domain data and serve as a complement for cross-domain variation-based methods under non-structural assumptions on the underlying causal models. Our empirical study in various control tasks shows that the proposed framework evidently improves the domain generalization performance and has comparable performance to the expert in the original domain simultaneously."}, "https://arxiv.org/abs/2402.18921": {"title": "Semi-Supervised U-statistics", "link": "https://arxiv.org/abs/2402.18921", "description": "arXiv:2402.18921v1 Announce Type: cross \nAbstract: Semi-supervised datasets are ubiquitous across diverse domains where obtaining fully labeled data is costly or time-consuming. The prevalence of such datasets has consistently driven the demand for new tools and methods that exploit the potential of unlabeled data. Responding to this demand, we introduce semi-supervised U-statistics enhanced by the abundance of unlabeled data, and investigate their statistical properties. We show that the proposed approach is asymptotically Normal and exhibits notable efficiency gains over classical U-statistics by effectively integrating various powerful prediction tools into the framework. To understand the fundamental difficulty of the problem, we derive minimax lower bounds in semi-supervised settings and showcase that our procedure is semi-parametrically efficient under regularity conditions. Moreover, tailored to bivariate kernels, we propose a refined approach that outperforms the classical U-statistic across all degeneracy regimes, and demonstrate its optimality properties. Simulation studies are conducted to corroborate our findings and to further demonstrate our framework."}, "https://arxiv.org/abs/2402.19162": {"title": "A Bayesian approach to uncover spatio-temporal determinants of heterogeneity in repeated cross-sectional health surveys", "link": "https://arxiv.org/abs/2402.19162", "description": "arXiv:2402.19162v1 Announce Type: cross \nAbstract: In several countries, including Italy, a prominent approach to population health surveillance involves conducting repeated cross-sectional surveys at short intervals of time. These surveys gather information on the health status of individual respondents, including details on their behaviors, risk factors, and relevant socio-demographic information. While the collected data undoubtedly provides valuable information, modeling such data presents several challenges. For instance, in health risk models, it is essential to consider behavioral information, spatio-temporal dynamics, and disease co-occurrence. In response to these challenges, our work proposes a multivariate spatio-temporal logistic model for chronic disease diagnoses. Predictors are modeled using individual risk factor covariates and a latent individual propensity to the disease.\n Leveraging a state space formulation of the model, we construct a framework in which spatio-temporal heterogeneity in regression parameters is informed by exogenous spatial information, corresponding to different spatial contextual risk factors that may affect health and the occurrence of chronic diseases in different ways. To explore the utility and the effectiveness of our method, we analyze behavioral and risk factor surveillance data collected in Italy (PASSI), which is well-known as a country characterized by high peculiar administrative, social and territorial diversities reflected on high variability in morbidity among population subgroups."}, "https://arxiv.org/abs/2402.19214": {"title": "A Bayesian approach with Gaussian priors to the inverse problem of source identification in elliptic PDEs", "link": "https://arxiv.org/abs/2402.19214", "description": "arXiv:2402.19214v1 Announce Type: cross \nAbstract: We consider the statistical linear inverse problem of making inference on an unknown source function in an elliptic partial differential equation from noisy observations of its solution. We employ nonparametric Bayesian procedures based on Gaussian priors, leading to convenient conjugate formulae for posterior inference. We review recent results providing theoretical guarantees on the quality of the resulting posterior-based estimation and uncertainty quantification, and we discuss the application of the theory to the important classes of Gaussian series priors defined on the Dirichlet-Laplacian eigenbasis and Mat\\'ern process priors. We provide an implementation of posterior inference for both classes of priors, and investigate its performance in a numerical simulation study."}, "https://arxiv.org/abs/2402.19268": {"title": "Extremal quantiles of intermediate orders under two-way clustering", "link": "https://arxiv.org/abs/2402.19268", "description": "arXiv:2402.19268v1 Announce Type: cross \nAbstract: This paper investigates extremal quantiles under two-way cluster dependence. We demonstrate that the limiting distribution of the unconditional intermediate order quantiles in the tails converges to a Gaussian distribution. This is remarkable as two-way cluster dependence entails potential non-Gaussianity in general, but extremal quantiles do not suffer from this issue. Building upon this result, we extend our analysis to extremal quantile regressions of intermediate order."}, "https://arxiv.org/abs/2402.19399": {"title": "An Empirical Analysis of Scam Token on Ethereum Blockchain: Counterfeit tokens on Uniswap", "link": "https://arxiv.org/abs/2402.19399", "description": "arXiv:2402.19399v1 Announce Type: cross \nAbstract: This article presents an empirical investigation into the determinants of total revenue generated by counterfeit tokens on Uniswap. It offers a detailed overview of the counterfeit token fraud process, along with a systematic summary of characteristics associated with such fraudulent activities observed in Uniswap. The study primarily examines the relationship between revenue from counterfeit token scams and their defining characteristics, and analyzes the influence of market economic factors such as return on market capitalization and price return on Ethereum. Key findings include a significant increase in overall transactions of counterfeit tokens on their first day of fraud, and a rise in upfront fraud costs leading to corresponding increases in revenue. Furthermore, a negative correlation is identified between the total revenue of counterfeit tokens and the volatility of Ethereum market capitalization return, while price return volatility on Ethereum is found to have a positive impact on counterfeit token revenue, albeit requiring further investigation for a comprehensive understanding. Additionally, the number of subscribers for the real token correlates positively with the realized volume of scam tokens, indicating that a larger community following the legitimate token may inadvertently contribute to the visibility and success of counterfeit tokens. Conversely, the number of Telegram subscribers exhibits a negative impact on the realized volume of scam tokens, suggesting that a higher level of scrutiny or awareness within Telegram communities may act as a deterrent to fraudulent activities. Finally, the timing of when the scam token is introduced on the Ethereum blockchain may have a negative impact on its success. Notably, the cumulative amount scammed by only 42 counterfeit tokens amounted to almost 11214 Ether."}, "https://arxiv.org/abs/1701.07078": {"title": "Measurement-to-Track Association and Finite-Set Statistics", "link": "https://arxiv.org/abs/1701.07078", "description": "arXiv:1701.07078v2 Announce Type: replace \nAbstract: This is a shortened, clarified, and mathematically more rigorous version of the original arXiv version. Its first four findings remain unchanged from the original: 1) measurement-to-track associations (MTAs) in multitarget tracking (MTT) are heuristic and physically erroneous multitarget state models; 2) MTAs occur in the labeled random finite set (LRFS) approach only as purely mathematical abstractions that do not occur singly; 3) the labeled random finite set (LRFS) approach is not a mathematically obfuscated replication of multi-hypothesis tracking (MHT); and 4) the conventional interpretation of MHT is more consistent with classical than Bayesian statistics. This version goes beyond the original in including the following additional main finding: 5) a generalized, RFS-like interpretation results in a correct Bayesian formulation of MHT, based on MTA likelihood functions and MTA Markov transitions/."}, "https://arxiv.org/abs/1805.10869": {"title": "Tilting Approximate Models", "link": "https://arxiv.org/abs/1805.10869", "description": "arXiv:1805.10869v4 Announce Type: replace \nAbstract: Model approximations are common practice when estimating structural or quasi-structural models. The paper considers the econometric properties of estimators that utilize projections to reimpose information about the exact model in the form of conditional moments. The resulting estimator efficiently combines the information provided by the approximate law of motion and the moment conditions. The paper develops the corresponding asymptotic theory and provides simulation evidence that tilting substantially reduces the mean squared error for parameter estimates. It applies the methodology to pricing long-run risks in aggregate consumption in the US, whereas the model is solved using the Campbell and Shiller (1988) approximation. Tilting improves empirical fit and results suggest that approximation error is a source of upward bias in estimates of risk aversion and downward bias in the elasticity of intertemporal substitution."}, "https://arxiv.org/abs/2204.07672": {"title": "Abadie's Kappa and Weighting Estimators of the Local Average Treatment Effect", "link": "https://arxiv.org/abs/2204.07672", "description": "arXiv:2204.07672v4 Announce Type: replace \nAbstract: Recent research has demonstrated the importance of flexibly controlling for covariates in instrumental variables estimation. In this paper we study the finite sample and asymptotic properties of various weighting estimators of the local average treatment effect (LATE), motivated by Abadie's (2003) kappa theorem and offering the requisite flexibility relative to standard practice. We argue that two of the estimators under consideration, which are weight normalized, are generally preferable. Several other estimators, which are unnormalized, do not satisfy the properties of scale invariance with respect to the natural logarithm and translation invariance, thereby exhibiting sensitivity to the units of measurement when estimating the LATE in logs and the centering of the outcome variable more generally. We also demonstrate that, when noncompliance is one sided, certain weighting estimators have the advantage of being based on a denominator that is strictly greater than zero by construction. This is the case for only one of the two normalized estimators, and we recommend this estimator for wider use. We illustrate our findings with a simulation study and three empirical applications, which clearly document the sensitivity of unnormalized estimators to how the outcome variable is coded. We implement the proposed estimators in the Stata package kappalate."}, "https://arxiv.org/abs/2212.06669": {"title": "A scale of interpretation for likelihood ratios and Bayes factors", "link": "https://arxiv.org/abs/2212.06669", "description": "arXiv:2212.06669v3 Announce Type: replace \nAbstract: Several subjective proposals have been made for interpreting the strength of evidence in likelihood ratios and Bayes factors. I identify a more objective scaling by modelling the effect of evidence on belief. The resulting scale with base 3.73 aligns with previous proposals and may partly explain intuitions."}, "https://arxiv.org/abs/2304.05805": {"title": "GDP nowcasting with artificial neural networks: How much does long-term memory matter?", "link": "https://arxiv.org/abs/2304.05805", "description": "arXiv:2304.05805v3 Announce Type: replace \nAbstract: We apply artificial neural networks (ANNs) to nowcast quarterly GDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare the nowcasting performance of five different ANN architectures: the multilayer perceptron (MLP), the one-dimensional convolutional neural network (1D CNN), the Elman recurrent neural network (RNN), the long short-term memory network (LSTM), and the gated recurrent unit (GRU). The empirical analysis presents results from two distinctively different evaluation periods. The first (2012:Q1 -- 2019:Q4) is characterized by balanced economic growth, while the second (2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According to our results, longer input sequences result in more accurate nowcasts in periods of balanced economic growth. However, this effect ceases above a relatively low threshold value of around six quarters (eighteen months). During periods of economic turbulence (e.g., during the COVID-19 recession), longer input sequences do not help the models' predictive performance; instead, they seem to weaken their generalization capability. Combined results from the two evaluation periods indicate that architectural features enabling long-term memory do not result in more accurate nowcasts. Comparing network architectures, the 1D CNN has proved to be a highly suitable model for GDP nowcasting. The network has shown good nowcasting performance among the competitors during the first evaluation period and achieved the overall best accuracy during the second evaluation period. Consequently, first in the literature, we propose the application of the 1D CNN for economic nowcasting."}, "https://arxiv.org/abs/2305.01849": {"title": "Semiparametric Discovery and Estimation of Interaction in Mixed Exposures using Stochastic Interventions", "link": "https://arxiv.org/abs/2305.01849", "description": "arXiv:2305.01849v2 Announce Type: replace \nAbstract: This study introduces a nonparametric definition of interaction and provides an approach to both interaction discovery and efficient estimation of this parameter. Using stochastic shift interventions and ensemble machine learning, our approach identifies and quantifies interaction effects through a model-independent target parameter, estimated via targeted maximum likelihood and cross-validation. This method contrasts the expected outcomes of joint interventions with those of individual interventions. Validation through simulation and application to the National Institute of Environmental Health Sciences Mixtures Workshop data demonstrate the efficacy of our method in detecting true interaction directions and its consistency in identifying significant impacts of furan exposure on leukocyte telomere length. Our method, called SuperNOVA, advances the ability to analyze multiexposure interactions within high-dimensional data, offering significant methodological improvements to understand complex exposure dynamics in health research. We provide peer-reviewed open-source software that employs or proposed methodology in the \\texttt{SuperNOVA} R package."}, "https://arxiv.org/abs/2305.04634": {"title": "Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods", "link": "https://arxiv.org/abs/2305.04634", "description": "arXiv:2305.04634v3 Announce Type: replace \nAbstract: In spatial statistics, fast and accurate parameter estimation, coupled with a reliable means of uncertainty quantification, can be challenging when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or wholly intractable. In this work, we propose using convolutional neural networks to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare neural likelihood surfaces and the resulting maximum likelihood estimates and approximate confidence regions with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian process and a Brown-Resnick process which have computationally intensive and intractable likelihoods, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate. The method is applicable to any spatial process on a grid from which fast simulations are available."}, "https://arxiv.org/abs/2306.10405": {"title": "A semi-parametric estimation method for quantile coherence with an application to bivariate financial time series clustering", "link": "https://arxiv.org/abs/2306.10405", "description": "arXiv:2306.10405v3 Announce Type: replace \nAbstract: In multivariate time series analysis, spectral coherence measures the linear dependency between two time series at different frequencies. However, real data applications often exhibit nonlinear dependency in the frequency domain. Conventional coherence analysis fails to capture such dependency. The quantile coherence, on the other hand, characterizes nonlinear dependency by defining the coherence at a set of quantile levels based on trigonometric quantile regression. This paper introduces a new estimation technique for quantile coherence. The proposed method is semi-parametric, which uses the parametric form of the spectrum of a vector autoregressive (VAR) model to approximate the quantile coherence, combined with nonparametric smoothing across quantiles. At a given quantile level, we compute the quantile autocovariance function (QACF) by performing the Fourier inverse transform of the quantile periodograms. Subsequently, we utilize the multivariate Durbin-Levinson algorithm to estimate the VAR parameters and derive the estimate of the quantile coherence. Finally, we smooth the preliminary estimate of quantile coherence across quantiles using a nonparametric smoother. Numerical results show that the proposed estimation method outperforms nonparametric methods. We show that quantile coherence-based bivariate time series clustering has advantages over the ordinary VAR coherence. For applications, the identified clusters of financial stocks by quantile coherence with a market benchmark are shown to have an intriguing and more informative structure of diversified investment portfolios that may be used by investors to make better decisions."}, "https://arxiv.org/abs/2202.08370": {"title": "CAREER: A Foundation Model for Labor Sequence Data", "link": "https://arxiv.org/abs/2202.08370", "description": "arXiv:2202.08370v4 Announce Type: replace-cross \nAbstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use."}, "https://arxiv.org/abs/2210.14484": {"title": "Imputation of missing values in multi-view data", "link": "https://arxiv.org/abs/2210.14484", "description": "arXiv:2210.14484v3 Announce Type: replace-cross \nAbstract: Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible."}, "https://arxiv.org/abs/2309.16598": {"title": "Cross-Prediction-Powered Inference", "link": "https://arxiv.org/abs/2309.16598", "description": "arXiv:2309.16598v3 Announce Type: replace-cross \nAbstract: While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability."}, "https://arxiv.org/abs/2311.08168": {"title": "Time-Uniform Confidence Spheres for Means of Random Vectors", "link": "https://arxiv.org/abs/2311.08168", "description": "arXiv:2311.08168v2 Announce Type: replace-cross \nAbstract: We derive and study time-uniform confidence spheres -- confidence sphere sequences (CSSs) -- which contain the mean of random vectors with high probability simultaneously across all sample sizes. Inspired by the original work of Catoni and Giulini, we unify and extend their analysis to cover both the sequential setting and to handle a variety of distributional assumptions. Our results include an empirical-Bernstein CSS for bounded random vectors (resulting in a novel empirical-Bernstein confidence interval with asymptotic width scaling proportionally to the true unknown variance), CSSs for sub-$\\psi$ random vectors (which includes sub-gamma, sub-Poisson, and sub-exponential), and CSSs for heavy-tailed random vectors (two moments only). Finally, we provide two CSSs that are robust to contamination by Huber noise. The first is a robust version of our empirical-Bernstein CSS, and the second extends recent work in the univariate setting to heavy-tailed multivariate distributions."}, "https://arxiv.org/abs/2403.00080": {"title": "Spatio-temporal modeling for record-breaking temperature events in Spain", "link": "https://arxiv.org/abs/2403.00080", "description": "arXiv:2403.00080v1 Announce Type: new \nAbstract: Record-breaking temperature events are now very frequently in the news, viewed as evidence of climate change. With this as motivation, we undertake the first substantial spatial modeling investigation of temperature record-breaking across years for any given day within the year. We work with a dataset consisting of over sixty years (1960-2021) of daily maximum temperatures across peninsular Spain. Formal statistical analysis of record-breaking events is an area that has received attention primarily within the probability community, dominated by results for the stationary record-breaking setting with some additional work addressing trends. Such effort is inadequate for analyzing actual record-breaking data. Effective analysis requires rich modeling of the indicator events which define record-breaking sequences. Resulting from novel and detailed exploratory data analysis, we propose hierarchical conditional models for the indicator events. After suitable model selection, we discover explicit trend behavior, necessary autoregression, significance of distance to the coast, useful interactions, helpful spatial random effects, and very strong daily random effects. Illustratively, the model estimates that global warming trends have increased the number of records expected in the past decade almost two-fold, 1.93 (1.89,1.98), but also estimates highly differentiated climate warming rates in space and by season."}, "https://arxiv.org/abs/2403.00140": {"title": "Estimating the linear relation between variables that are never jointly observed: an application in in vivo experiments", "link": "https://arxiv.org/abs/2403.00140", "description": "arXiv:2403.00140v1 Announce Type: new \nAbstract: This work is motivated by in vivo experiments in which measurement are destructive so that the variables of interest can never be observed simultaneously when the aim is to estimate the regression coefficients of a linear regression. Assuming that the global experiment can be decomposed into sub experiments (corresponding for example to different doses) with distinct first moments, we propose different estimators of the linear regression which take account of that additional information. We consider estimators based on moments as well as estimators based optimal transport theory. These estimators are proved to be consistent as well as asymptotically Gaussian under weak hypotheses. The asymptotic variance has no explicit expression, except in some particular cases, and specific bootstrap approaches are developed to build confidence intervals for the estimated parameter. A Monte Carlo study is conducted to assess and compare the finite sample performances of the different approaches."}, "https://arxiv.org/abs/2403.00224": {"title": "Tobit models for count time series", "link": "https://arxiv.org/abs/2403.00224", "description": "arXiv:2403.00224v1 Announce Type: new \nAbstract: Several models for count time series have been developed during the last decades, often inspired by traditional autoregressive moving average (ARMA) models for real-valued time series, including integer-valued ARMA (INARMA) and integer-valued generalized autoregressive conditional heteroscedasticity (INGARCH) models. Both INARMA and INGARCH models exhibit an ARMA-like autocorrelation function (ACF). To achieve negative ACF values within the class of INGARCH models, log and softplus link functions are suggested in the literature, where the softplus approach leads to conditional linearity in good approximation. However, the softplus approach is limited to the INGARCH family for unbounded counts, i.e. it can neither be used for bounded counts, nor for count processes from the INARMA family. In this paper, we present an alternative solution, named the Tobit approach, for achieving approximate linearity together with negative ACF values, which is more generally applicable than the softplus approach. A Skellam--Tobit INGARCH model for unbounded counts is studied in detail, including stationarity, approximate computation of moments, maximum likelihood and censored least absolute deviations estimation for unknown parameters and corresponding simulations. Extensions of the Tobit approach to other situations are also discussed, including underlying discrete distributions, INAR models, and bounded counts. Three real-data examples are considered to illustrate the usefulness of the new approach."}, "https://arxiv.org/abs/2403.00237": {"title": "Stable Reduced-Rank VAR Identification", "link": "https://arxiv.org/abs/2403.00237", "description": "arXiv:2403.00237v1 Announce Type: new \nAbstract: The vector autoregression (VAR) has been widely used in system identification, econometrics, natural science, and many other areas. However, when the state dimension becomes large the parameter dimension explodes. So rank reduced modelling is attractive and is well developed. But a fundamental requirement in almost all applications is stability of the fitted model. And this has not been addressed in the rank reduced case. Here, we develop, for the first time, a closed-form formula for an estimator of a rank reduced transition matrix which is guaranteed to be stable. We show that our estimator is consistent and asymptotically statistically efficient and illustrate it in comparative simulations."}, "https://arxiv.org/abs/2403.00281": {"title": "Wavelet Based Periodic Autoregressive Moving Average Models", "link": "https://arxiv.org/abs/2403.00281", "description": "arXiv:2403.00281v1 Announce Type: new \nAbstract: This paper proposes a wavelet-based method for analysing periodic autoregressive moving average (PARMA) time series. Even though Fourier analysis provides an effective method for analysing periodic time series, it requires the estimation of a large number of Fourier parameters when the PARMA parameters do not vary smoothly. The wavelet-based analysis helps us to obtain a parsimonious model with a reduced number of parameters. We have illustrated this with simulated and actual data sets."}, "https://arxiv.org/abs/2403.00304": {"title": "Coherent forecasting of NoGeAR(1) model", "link": "https://arxiv.org/abs/2403.00304", "description": "arXiv:2403.00304v1 Announce Type: new \nAbstract: This article focuses on the coherent forecasting of the recently introduced novel geometric AR(1) (NoGeAR(1)) model - an INAR model based on inflated - parameter binomial thinning approach. Various techniques are available to achieve h - step ahead coherent forecasts of count time series, like median and mode forecasting. However, there needs to be more body of literature addressing coherent forecasting in the context of overdispersed count time series. Here, we study the forecasting distribution corresponding to NoGeAR(1) process using the Monte Carlo (MC) approximation method. Accordingly, several forecasting measures are employed in the simulation study to facilitate a thorough comparison of the forecasting capability of NoGeAR(1) with other models. The methodology is also demonstrated using real-life data, specifically the data on CW{\\ss} TeXpert downloads and Barbados COVID-19 data."}, "https://arxiv.org/abs/2403.00347": {"title": "Set-Valued Control Functions", "link": "https://arxiv.org/abs/2403.00347", "description": "arXiv:2403.00347v1 Announce Type: new \nAbstract: The control function approach allows the researcher to identify various causal effects of interest. While powerful, it requires a strong invertibility assumption, which limits its applicability. This paper expands the scope of the nonparametric control function approach by allowing the control function to be set-valued and derive sharp bounds on structural parameters. The proposed generalization accommodates a wide range of selection processes involving discrete endogenous variables, random coefficients, treatment selections with interference, and dynamic treatment selections."}, "https://arxiv.org/abs/2403.00383": {"title": "The Mollified (Discrete) Uniform Distribution and its Applications", "link": "https://arxiv.org/abs/2403.00383", "description": "arXiv:2403.00383v1 Announce Type: new \nAbstract: The mollified uniform distribution is rediscovered, which constitutes a ``soft'' version of the continuous uniform distribution. Important stochastic properties are derived and used to demonstrate potential fields of applications. For example, it constitutes a model covering platykurtic, mesokurtic and leptokurtic shapes. Its cumulative distribution function may also serve as the soft-clipping response function for defining generalized linear models with approximately linear dependence. Furthermore, it might be considered for teaching, as an appealing example for the convolution of random variables. Finally, a discrete type of mollified uniform distribution is briefly discussed as well."}, "https://arxiv.org/abs/2403.00422": {"title": "Inference for Interval-Identified Parameters Selected from an Estimated Set", "link": "https://arxiv.org/abs/2403.00422", "description": "arXiv:2403.00422v1 Announce Type: new \nAbstract: Interval identification of parameters such as average treatment effects, average partial effects and welfare is particularly common when using observational data and experimental data with imperfect compliance due to the endogeneity of individuals' treatment uptake. In this setting, a treatment or policy will typically become an object of interest to the researcher when it is either selected from the estimated set of best-performers or arises from a data-dependent selection rule. In this paper, we develop new inference tools for interval-identified parameters chosen via these forms of selection. We develop three types of confidence intervals for data-dependent and interval-identified parameters, discuss how they apply to several examples of interest and prove their uniform asymptotic validity under weak assumptions."}, "https://arxiv.org/abs/2403.00429": {"title": "Population Power Curves in ASCA with Permutation Testing", "link": "https://arxiv.org/abs/2403.00429", "description": "arXiv:2403.00429v1 Announce Type: new \nAbstract: In this paper, we revisit the Power Curves in ANOVA Simultaneous Component Analysis (ASCA) based on permutation testing, and introduce the Population Curves derived from population parameters describing the relative effect among factors and interactions. We distinguish Relative from Absolute Population Curves, where the former represent statistical power in terms of the normalized effect size between structure and noise, and the latter in terms of the sample size. Relative Population Curves are useful to find the optimal ASCA model (e.g., fixed/random factors, crossed/nested relationships, interactions, the test statistic, transformations, etc.) for the analysis of an experimental design at hand. Absolute Population Curves are useful to determine the sample size and the optimal number of levels for each factor during the planning phase on an experiment. We illustrate both types of curves through simulation. We expect Population Curves to become the go-to approach to plan the optimal analysis pipeline and the required sample size in an omics study analyzed with ASCA."}, "https://arxiv.org/abs/2403.00458": {"title": "Prices and preferences in the electric vehicle market", "link": "https://arxiv.org/abs/2403.00458", "description": "arXiv:2403.00458v1 Announce Type: new \nAbstract: Although electric vehicles are less polluting than gasoline powered vehicles, adoption is challenged by higher procurement prices. Existing discourse emphasizes EV battery costs as being principally responsible for this price differential and widespread adoption is routinely conditioned upon battery costs declining. We scrutinize such reasoning by sourcing data on EV attributes and market conditions between 2011 and 2023. Our findings are fourfold. First, EV prices are influenced principally by the number of amenities, additional features, and dealer-installed accessories sold as standard on an EV, and to a lesser extent, by EV horsepower. Second, EV range is negatively correlated with EV price implying that range anxiety concerns may be less consequential than existing discourse suggests. Third, battery capacity is positively correlated with EV price, due to more capacity being synonymous with the delivery of more horsepower. Collectively, this suggests that higher procurement prices for EVs reflects consumer preference for vehicles that are feature dense and more powerful. Fourth and finally, accommodating these preferences have produced vehicles with lower fuel economy, a shift that reduces envisioned lifecycle emissions benefits by at least 3.26 percent, subject to the battery pack chemistry leveraged and the carbon intensity of the electrical grid. These findings warrant attention as decarbonization efforts increasingly emphasize electrification as a pathway for complying with domestic and international climate agreements."}, "https://arxiv.org/abs/2403.00508": {"title": "Changepoint problem with angular data using a measure of variation based on the intrinsic geometry of torus", "link": "https://arxiv.org/abs/2403.00508", "description": "arXiv:2403.00508v1 Announce Type: new \nAbstract: In many temporally ordered data sets, it is observed that the parameters of the underlying distribution change abruptly at unknown times. The detection of such changepoints is important for many applications. While this problem has been studied substantially in the linear data setup, not much work has been done for angular data. In this article, we utilize the intrinsic geometry of a torus to introduce the notion of the `square of an angle' and use it to propose a new measure of variation, called the `curved variance', of an angular random variable. Using the above ideas, we propose new tests for the existence of changepoint(s) in the concentration, mean direction, and/or both of these. The limiting distributions of the test statistics are derived and their powers are obtained using extensive simulation. It is seen that the tests have better power than the corresponding existing tests. The proposed methods have been implemented on three real-life data sets revealing interesting insights. In particular, our method when used to detect simultaneous changes in mean direction and concentration for hourly wind direction measurements of the cyclonic storm `Amphan' identified changepoints that could be associated with important meteorological events."}, "https://arxiv.org/abs/2403.00600": {"title": "Random Interval Distillation for Detecting Multiple Changes in General Dependent Data", "link": "https://arxiv.org/abs/2403.00600", "description": "arXiv:2403.00600v1 Announce Type: new \nAbstract: We propose a new and generic approach for detecting multiple change-points in general dependent data, termed random interval distillation (RID). By collecting random intervals with sufficient strength of signals and reassembling them into a sequence of informative short intervals, our new approach captures the shifts in signal characteristics across diverse dependent data forms including locally stationary high-dimensional time series and dynamic networks with Markov formation. We further propose a range of secondary refinements tailored to various data types to enhance the localization precision. Notably, for univariate time series and low-rank autoregressive networks, our methods achieve the minimax optimality as their independent counterparts. For practical applications, we introduce a clustering-based and data-driven procedure to determine the optimal threshold for signal strength, which is adaptable to a wide array of dependent data scenarios utilizing the connection between RID and clustering. Additionally, our method has been extended to identify kinks and changes in signals characterized by piecewise polynomial trends. We examine the effectiveness and usefulness of our methodology via extensive simulation studies and a real data example, implementing it in the R-package rid."}, "https://arxiv.org/abs/2403.00617": {"title": "Distortion in Correspondence Analysis and in Taxicab Correspondence Analysis: A Comparison", "link": "https://arxiv.org/abs/2403.00617", "description": "arXiv:2403.00617v1 Announce Type: new \nAbstract: Distortion is a fundamental well-studied topic in dimension reduction papers, and intimately related with the underlying intrinsic dimension of a mapping of a high dimensional data set onto a lower dimension. In this paper, we study embedding distortions produced by Correspondence Analysis and its robust l1 variant Taxicab Correspondence analysis, which are visualization methods for contingency tables. For high dimensional data, distortions in Correspondence Analysis are contractions; while distortions in Taxicab Correspondence Analysis could be contractions or stretchings. This shows that Euclidean geometry is quite rigid, because of the orthogonality property; while Taxicab geometry is quite flexible, because the orthogonality property is replaced by the conjugacy property."}, "https://arxiv.org/abs/2403.00639": {"title": "Hierarchical Bayesian Models to Mitigate Systematic Disparities in Prediction with Proxy Outcomes", "link": "https://arxiv.org/abs/2403.00639", "description": "arXiv:2403.00639v1 Announce Type: new \nAbstract: Label bias occurs when the outcome of interest is not directly observable and instead modeling is performed with proxy labels. When the difference between the true outcome and the proxy label is correlated with predictors, this can yield systematic disparities in predictions for different groups of interest. We propose Bayesian hierarchical measurement models to address these issues. Through practical examples, we demonstrate how our approach improves accuracy and helps with algorithmic fairness."}, "https://arxiv.org/abs/2403.00687": {"title": "Structurally Aware Robust Model Selection for Mixtures", "link": "https://arxiv.org/abs/2403.00687", "description": "arXiv:2403.00687v1 Announce Type: new \nAbstract: Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations."}, "https://arxiv.org/abs/2403.00701": {"title": "Bayesian Model Averaging for Partial Ordering Continual Reassessment Methods", "link": "https://arxiv.org/abs/2403.00701", "description": "arXiv:2403.00701v1 Announce Type: new \nAbstract: Phase I clinical trials are essential to bringing novel therapies from chemical development to widespread use. Traditional approaches to dose-finding in Phase I trials, such as the '3+3' method and the Continual Reassessment Method (CRM), provide a principled approach for escalating across dose levels. However, these methods lack the ability to incorporate uncertainty regarding the dose-toxicity ordering as found in combination drug trials. Under this setting, dose-levels vary across multiple drugs simultaneously, leading to multiple possible dose-toxicity orderings. The Partial Ordering CRM (POCRM) extends to these settings by allowing for multiple dose-toxicity orderings. In this work, it is shown that the POCRM is vulnerable to 'estimation incoherency' whereby toxicity estimates shift in an illogical way, threatening patient safety and undermining clinician trust in dose-finding models. To this end, the Bayesian model averaged POCRM (BMA-POCRM) is proposed. BMA-POCRM uses Bayesian model averaging to take into account all possible orderings simultaneously, reducing the frequency of estimation incoherencies. The effectiveness of BMA-POCRM in drug combination settings is demonstrated through a specific instance of estimate incoherency of POCRM and simulation studies. The results highlight the improved safety, accuracy and reduced occurrence of estimate incoherency in trials applying the BMA-POCRM relative to the POCRM model."}, "https://arxiv.org/abs/2403.00158": {"title": "Automated Efficient Estimation using Monte Carlo Efficient Influence Functions", "link": "https://arxiv.org/abs/2403.00158", "description": "arXiv:2403.00158v1 Announce Type: cross \nAbstract: Many practical problems involve estimating low dimensional statistical quantities with high-dimensional models and datasets. Several approaches address these estimation tasks based on the theory of influence functions, such as debiased/double ML or targeted minimum loss estimation. This paper introduces \\textit{Monte Carlo Efficient Influence Functions} (MC-EIF), a fully automated technique for approximating efficient influence functions that integrates seamlessly with existing differentiable probabilistic programming systems. MC-EIF automates efficient statistical estimation for a broad class of models and target functionals that would previously require rigorous custom analysis. We prove that MC-EIF is consistent, and that estimators using MC-EIF achieve optimal $\\sqrt{N}$ convergence rates. We show empirically that estimators using MC-EIF are at parity with estimators using analytic EIFs. Finally, we demonstrate a novel capstone example using MC-EIF for optimal portfolio selection."}, "https://arxiv.org/abs/2403.00694": {"title": "Defining Expertise: Applications to Treatment Effect Estimation", "link": "https://arxiv.org/abs/2403.00694", "description": "arXiv:2403.00694v1 Announce Type: cross \nAbstract: Decision-makers are often experts of their domain and take actions based on their domain knowledge. Doctors, for instance, may prescribe treatments by predicting the likely outcome of each available treatment. Actions of an expert thus naturally encode part of their domain knowledge, and can help make inferences within the same domain: Knowing doctors try to prescribe the best treatment for their patients, we can tell treatments prescribed more frequently are likely to be more effective. Yet in machine learning, the fact that most decision-makers are experts is often overlooked, and \"expertise\" is seldom leveraged as an inductive bias. This is especially true for the literature on treatment effect estimation, where often the only assumption made about actions is that of overlap. In this paper, we argue that expertise - particularly the type of expertise the decision-makers of a domain are likely to have - can be informative in designing and selecting methods for treatment effect estimation. We formally define two types of expertise, predictive and prognostic, and demonstrate empirically that: (i) the prominent type of expertise in a domain significantly influences the performance of different methods in treatment effect estimation, and (ii) it is possible to predict the type of expertise present in a dataset, which can provide a quantitative basis for model selection."}, "https://arxiv.org/abs/2403.00749": {"title": "Shrinkage estimators in zero-inflated Bell regression model with application", "link": "https://arxiv.org/abs/2403.00749", "description": "arXiv:2403.00749v1 Announce Type: cross \nAbstract: We propose Stein-type estimators for zero-inflated Bell regression models by incorporating information on model parameters. These estimators combine the advantages of unrestricted and restricted estimators. We derive the asymptotic distributional properties, including bias and mean squared error, for the proposed shrinkage estimators. Monte Carlo simulations demonstrate the superior performance of our shrinkage estimators across various scenarios. Furthermore, we apply the proposed estimators to analyze a real dataset, showcasing their practical utility."}, "https://arxiv.org/abs/2107.08112": {"title": "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data", "link": "https://arxiv.org/abs/2107.08112", "description": "arXiv:2107.08112v2 Announce Type: replace \nAbstract: Latent variable models are increasingly used in economics for high-dimensional categorical data like text and surveys. We demonstrate the effectiveness of Hamiltonian Monte Carlo (HMC) with parallelized automatic differentiation for analyzing such data in a computationally efficient and methodologically sound manner. Our new model, Supervised Topic Model with Covariates, shows that carefully modeling this type of data can have significant implications on conclusions compared to a simpler, frequently used, yet methodologically problematic, two-step approach. A simulation study and revisiting Bandiera et al. (2020)'s study of executive time use demonstrate these results. The approach accommodates thousands of parameters and doesn't require custom algorithms specific to each model, making it accessible for applied researchers"}, "https://arxiv.org/abs/2111.12258": {"title": "On Recoding Ordered Treatments as Binary Indicators", "link": "https://arxiv.org/abs/2111.12258", "description": "arXiv:2111.12258v4 Announce Type: replace \nAbstract: Researchers using instrumental variables to investigate ordered treatments often recode treatment into an indicator for any exposure. We investigate this estimand under the assumption that the instruments shift compliers from no treatment to some but not from some treatment to more. We show that when there are extensive margin compliers only (EMCO) this estimand captures a weighted average of treatment effects that can be partially unbundled into each complier group's potential outcome means. We also establish an equivalence between EMCO and a two-factor selection model and apply our results to study treatment heterogeneity in the Oregon Health Insurance Experiment."}, "https://arxiv.org/abs/2211.16364": {"title": "Disentangling the structure of ecological bipartite networks from observation processes", "link": "https://arxiv.org/abs/2211.16364", "description": "arXiv:2211.16364v2 Announce Type: replace \nAbstract: The structure of a bipartite interaction network can be described by providing a clustering for each of the two types of nodes. Such clusterings are outputted by fitting a Latent Block Model (LBM) on an observed network that comes from a sampling of species interactions in the field. However, the sampling is limited and possibly uneven. This may jeopardize the fit of the LBM and then the description of the structure of the network by detecting structures which result from the sampling and not from actual underlying ecological phenomena. If the observed interaction network consists of a weighted bipartite network where the number of observed interactions between two species is available, the sampling efforts for all species can be estimated and used to correct the LBM fit. We propose to combine an observation model that accounts for sampling and an LBM for describing the structure of underlying possible ecological interactions. We develop an original inference procedure for this model, the efficiency of which is demonstrated on simulation studies. The pratical interest in ecology of our model is highlighted on a large dataset of plant-pollinator network."}, "https://arxiv.org/abs/2212.07602": {"title": "Modifying Survival Models To Accommodate Thresholding Behavior", "link": "https://arxiv.org/abs/2212.07602", "description": "arXiv:2212.07602v2 Announce Type: replace \nAbstract: Survival models capture the relationship between an accumulating hazard and the occurrence of a singular event stimulated by that accumulation. When the model for the hazard is sufficiently flexible survival models can accommodate a wide range of behaviors. If the hazard model is less flexible, for example when it is constrained by an external physical process, then the resulting survival model can be much too rigid. In this paper I introduce a modified survival model that generalizes the relationship between accumulating hazard and event occurrence with particular emphasis on capturing thresholding behavior. Finally I demonstrate the utility of this approach on a physiological application."}, "https://arxiv.org/abs/2212.13294": {"title": "Multivariate Bayesian variable selection with application to multi-trait genetic fine mapping", "link": "https://arxiv.org/abs/2212.13294", "description": "arXiv:2212.13294v3 Announce Type: replace \nAbstract: Variable selection has played a critical role in modern statistical learning and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but most of these methods consider selecting variables for only one response. As more data is being collected nowadays, it is common to analyze multiple related responses from the same study. Existing multivariate variable selection methods select variables for all responses without considering the possible heterogeneity across different responses, i.e. some features may only predict a subset of responses but not the rest. Motivated by the multi-trait fine mapping problem in genetics to identify the causal variants for multiple related traits, we developed a novel multivariate Bayesian variable selection method to select critical predictors from a large number of grouped predictors that target at multiple correlated and possibly heterogeneous responses. Our new method is featured by its selection at multiple levels, its incorporation of prior biological knowledge to guide selection and identification of best subset of responses predictors target at. We showed the advantage of our method via extensive simulations and a real fine mapping example to identify causal variants associated with different subsets of addictive behaviors."}, "https://arxiv.org/abs/2302.07976": {"title": "Discovery of Critical Thresholds in Mixed Exposures and Estimation of Policy Intervention Effects using Targeted Learning", "link": "https://arxiv.org/abs/2302.07976", "description": "arXiv:2302.07976v2 Announce Type: replace \nAbstract: Traditional regulations of chemical exposure tend to focus on single exposures, overlooking the potential amplified toxicity due to multiple concurrent exposures. We are interested in understanding the average outcome if exposures were limited to fall under a multivariate threshold. Because threshold levels are often unknown \\textit{a priori}, we provide an algorithm that finds exposure threshold levels where the expected outcome is maximized or minimized. Because both identifying thresholds and estimating policy effects on the same data would lead to overfitting bias, we also provide a data-adaptive estimation framework, which allows for both threshold discovery and policy estimation. Simulation studies show asymptotic convergence to the optimal exposure region and to the true effect of an intervention. We demonstrate how our method identifies true interactions in a public synthetic mixture data set. Finally, we applied our method to NHANES data to discover metal exposures that have the most harmful effects on telomere length. We provide an implementation in the \\texttt{CVtreeMLE} R package."}, "https://arxiv.org/abs/2304.00059": {"title": "Resolving power: A general approach to compare the distinguishing ability of threshold-free evaluation metrics", "link": "https://arxiv.org/abs/2304.00059", "description": "arXiv:2304.00059v2 Announce Type: replace \nAbstract: Selecting an evaluation metric is fundamental to model development, but uncertainty remains about when certain metrics are preferable and why. This paper introduces the concept of resolving power to describe the ability of an evaluation metric to distinguish between binary classifiers of similar quality. This ability depends on two attributes: 1. The metric's response to improvements in classifier quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power generically as a metric's sampling uncertainty scaled by its signal. The primary application of resolving power is to assess threshold-free evaluation metrics, such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation study compares the AUROC and the AUPRC in a variety of contexts. It finds that the AUROC generally has greater resolving power, but that the AUPRC is better when searching among high-quality classifiers applied to low prevalence outcomes. The paper concludes by proposing an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model."}, "https://arxiv.org/abs/2306.08559": {"title": "Inference in IV models with clustered dependence, many instruments and weak identification", "link": "https://arxiv.org/abs/2306.08559", "description": "arXiv:2306.08559v2 Announce Type: replace \nAbstract: Data clustering reduces the effective sample size from the number of observations towards the number of clusters. For instrumental variable models I show that this reduced effective sample size makes the instruments more likely to be weak, in the sense that they contain little information about the endogenous regressor, and many, in the sense that their number is large compared to the sample size. Clustered data therefore increases the need for many and weak instrument robust tests. However, none of the previously developed many and weak instrument robust tests can be applied to this type of data as they all require independent observations. I therefore adapt two types of such tests to clustered data. First, I derive cluster jackknife Anderson-Rubin and score tests by removing clusters rather than individual observations from the statistics. Second, I propose a cluster many instrument Anderson-Rubin test which improves on the first type of tests by using a more optimal, but more complex, weighting matrix. I show that if the clusters satisfy an invariance assumption the higher complexity poses no problems. By revisiting a study on the effect of queenly reign on war I show the empirical relevance of the new tests."}, "https://arxiv.org/abs/2306.10635": {"title": "Finite Population Survey Sampling: An Unapologetic Bayesian Perspective", "link": "https://arxiv.org/abs/2306.10635", "description": "arXiv:2306.10635v2 Announce Type: replace \nAbstract: This article attempts to offer some perspectives on Bayesian inference for finite population quantities when the units in the population are assumed to exhibit complex dependencies. Beginning with an overview of Bayesian hierarchical models, including some that yield design-based Horvitz-Thompson estimators, the article proceeds to introduce dependence in finite populations and sets out inferential frameworks for ignorable and nonignorable responses. Multivariate dependencies using graphical models and spatial processes are discussed and some salient features of two recent analyses for spatial finite populations are presented."}, "https://arxiv.org/abs/2307.05234": {"title": "CR-Lasso: Robust cellwise regularized sparse regression", "link": "https://arxiv.org/abs/2307.05234", "description": "arXiv:2307.05234v2 Announce Type: replace \nAbstract: Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. To evaluate the approach, we conduct empirical studies comparing its selection and prediction performance with several sparse regression methods. We show that CR-Lasso is competitive under the settings considered. We illustrate the effectiveness of the proposed method on real data through an analysis of a bone mineral density dataset."}, "https://arxiv.org/abs/2308.02918": {"title": "Spectral Ranking Inferences based on General Multiway Comparisons", "link": "https://arxiv.org/abs/2308.02918", "description": "arXiv:2308.02918v3 Announce Type: replace \nAbstract: This paper studies the performance of the spectral method in the estimation and uncertainty quantification of the unobserved preference scores of compared entities in a general and more realistic setup. Specifically, the comparison graph consists of hyper-edges of possible heterogeneous sizes, and the number of comparisons can be as low as one for a given hyper-edge. Such a setting is pervasive in real applications, circumventing the need to specify the graph randomness and the restrictive homogeneous sampling assumption imposed in the commonly used Bradley-Terry-Luce (BTL) or Plackett-Luce (PL) models. Furthermore, in scenarios where the BTL or PL models are appropriate, we unravel the relationship between the spectral estimator and the Maximum Likelihood Estimator (MLE). We discover that a two-step spectral method, where we apply the optimal weighting estimated from the equal weighting vanilla spectral method, can achieve the same asymptotic efficiency as the MLE. Given the asymptotic distributions of the estimated preference scores, we also introduce a comprehensive framework to carry out both one-sample and two-sample ranking inferences, applicable to both fixed and random graph settings. It is noteworthy that this is the first time effective two-sample rank testing methods have been proposed. Finally, we substantiate our findings via comprehensive numerical simulations and subsequently apply our developed methodologies to perform statistical inferences for statistical journals and movie rankings."}, "https://arxiv.org/abs/2310.17009": {"title": "Simulation-based stacking", "link": "https://arxiv.org/abs/2310.17009", "description": "arXiv:2310.17009v2 Announce Type: replace \nAbstract: Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a consistency guarantee, we present a general posterior stacking framework to make use of all available approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias of the posterior approximation at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task."}, "https://arxiv.org/abs/2310.17434": {"title": "The `Why' behind including `Y' in your imputation model", "link": "https://arxiv.org/abs/2310.17434", "description": "arXiv:2310.17434v2 Announce Type: replace \nAbstract: Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e., single imputation with fixed values) and stochastic imputation (i.e., single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This paper aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model."}, "https://arxiv.org/abs/2204.10375": {"title": "lpcde: Estimation and Inference for Local Polynomial Conditional Density Estimators", "link": "https://arxiv.org/abs/2204.10375", "description": "arXiv:2204.10375v2 Announce Type: replace-cross \nAbstract: This paper discusses the R package lpcde, which stands for local polynomial conditional density estimation. It implements the kernel-based local polynomial smoothing methods introduced in Cattaneo, Chandak, Jansson, Ma (2024( for statistical estimation and inference of conditional distributions, densities, and derivatives thereof. The package offers mean square error optimal bandwidth selection and associated point estimators, as well as uncertainty quantification based on robust bias correction both pointwise (e.g., confidence intervals) and uniformly (e.g., confidence bands) over evaluation points. The methods implemented are boundary adaptive whenever the data is compactly supported. The package also implements regularized conditional density estimation methods, ensuring the resulting density estimate is non-negative and integrates to one. We contrast the functionalities of lpcde with existing R packages for conditional density estimation, and showcase its main features using simulated data."}, "https://arxiv.org/abs/2210.01282": {"title": "Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees", "link": "https://arxiv.org/abs/2210.01282", "description": "arXiv:2210.01282v3 Announce Type: replace-cross \nAbstract: We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks."}, "https://arxiv.org/abs/2304.09126": {"title": "An improved BISG for inferring race from surname and geolocation", "link": "https://arxiv.org/abs/2304.09126", "description": "arXiv:2304.09126v3 Announce Type: replace-cross \nAbstract: Bayesian Improved Surname Geocoding (BISG) is a ubiquitous tool for predicting race and ethnicity using an individual's geolocation and surname. Here we demonstrate that statistical dependence of surname and geolocation within racial/ethnic categories in the United States results in biases for minority subpopulations, and we introduce a raking-based improvement. Our method augments the data used by BISG--distributions of race by geolocation and race by surname--with the distribution of surname by geolocation obtained from state voter files. We validate our algorithm on state voter registration lists that contain self-identified race/ethnicity."}, "https://arxiv.org/abs/2305.14916": {"title": "Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting", "link": "https://arxiv.org/abs/2305.14916", "description": "arXiv:2305.14916v2 Announce Type: replace-cross \nAbstract: We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning."}, "https://arxiv.org/abs/2401.04190": {"title": "Is it possible to know cosmological fine-tuning?", "link": "https://arxiv.org/abs/2401.04190", "description": "arXiv:2401.04190v2 Announce Type: replace-cross \nAbstract: Fine-tuning studies whether some physical parameters, or relevant ratios between them, are located within so-called life-permitting intervals of small probability outside of which carbon-based life would not be possible. Recent developments have found estimates of these probabilities that circumvent previous concerns of measurability and selection bias. However, the question remains if fine-tuning can indeed be known. Using a mathematization of the epistemological concepts of learning and knowledge acquisition, we argue that most examples that have been touted as fine-tuned cannot be formally assessed as such. Nevertheless, fine-tuning can be known when the physical parameter is seen as a random variable and it is supported in the nonnegative real line, provided the size of the life-permitting interval is small in relation to the observed value of the parameter."}, "https://arxiv.org/abs/2403.00957": {"title": "Resolution of Simpson's paradox via the common cause principle", "link": "https://arxiv.org/abs/2403.00957", "description": "arXiv:2403.00957v1 Announce Type: new \nAbstract: Simpson's paradox is an obstacle to establishing a probabilistic association between two events $a_1$ and $a_2$, given the third (lurking) random variable $B$. We focus on scenarios when the random variables $A$ (which combines $a_1$, $a_2$, and their complements) and $B$ have a common cause $C$ that need not be observed. Alternatively, we can assume that $C$ screens out $A$ from $B$. For such cases, the correct association between $a_1$ and $a_2$ is to be defined via conditioning over $C$. This set-up generalizes the original Simpson's paradox. Now its two contradicting options simply refer to two particular and different causes $C$. We show that if $B$ and $C$ are binary and $A$ is quaternary (the minimal and the most widespread situation for valid Simpson's paradox), the conditioning over any binary common cause $C$ establishes the same direction of the association between $a_1$ and $a_2$ as the conditioning over $B$ in the original formulation of the paradox. Thus, for the minimal common cause, one should choose the option of Simpson's paradox that assumes conditioning over $B$ and not its marginalization. For tertiary (unobserved) common causes $C$ all three options of Simpson's paradox become possible (i.e. marginalized, conditional, and none of them), and one needs prior information on $C$ to choose the right option."}, "https://arxiv.org/abs/2403.00968": {"title": "The Bridged Posterior: Optimization, Profile Likelihood and a New Approach to Generalized Bayes", "link": "https://arxiv.org/abs/2403.00968", "description": "arXiv:2403.00968v1 Announce Type: new \nAbstract: Optimization is widely used in statistics, thanks to its efficiency for delivering point estimates on useful spaces, such as those satisfying low cardinality or combinatorial structure. To quantify uncertainty, Gibbs posterior exponentiates the negative loss function to form a posterior density. Nevertheless, Gibbs posteriors are supported in a high-dimensional space, and do not inherit the computational efficiency or constraint formulations from optimization. In this article, we explore a new generalized Bayes approach, viewing the likelihood as a function of data, parameters, and latent variables conditionally determined by an optimization sub-problem. Marginally, the latent variable given the data remains stochastic, and is characterized by its posterior distribution. This framework, coined ``bridged posterior'', conforms to the Bayesian paradigm. Besides providing a novel generative model, we obtain a positively surprising theoretical finding that under mild conditions, the $\\sqrt{n}$-adjusted posterior distribution of the parameters under our model converges to the same normal distribution as that of the canonical integrated posterior. Therefore, our result formally dispels a long-held belief that partial optimization of latent variables may lead to under-estimation of parameter uncertainty. We demonstrate the practical advantages of our approach under several settings, including maximum-margin classification, latent normal models, and harmonization of multiple networks."}, "https://arxiv.org/abs/2403.01000": {"title": "A linear mixed model approach for measurement error adjustment: applications to sedentary behavior assessment from wearable devices", "link": "https://arxiv.org/abs/2403.01000", "description": "arXiv:2403.01000v1 Announce Type: new \nAbstract: In recent years, wearable devices have become more common to capture a wide range of health behaviors, especially for physical activity and sedentary behavior. These sensor-based measures are deemed to be objective and thus less prone to self-reported biases, inherent in questionnaire assessments. While this is undoubtedly a major advantage, there can still be measurement errors from the device recordings, which pose serious challenges for conducting statistical analysis and obtaining unbiased risk estimates. There is a vast literature proposing statistical methods for adjusting for measurement errors in self-reported behaviors, such as in dietary intake. However, there is much less research on error correction for sensor-based device measures, especially sedentary behavior. In this paper, we address this gap. Exploiting the excessive multiple-day assessments typically collected when sensor devices are deployed, we propose a two-stage linear mixed effect model (LME) based approach to correct bias caused by measurement errors. We provide theoretical proof of the debiasing process using the Best Linear Unbiased Predictors (BLUP), and use both simulation and real data from a cohort study to demonstrate the performance of the proposed approach while comparing to the na\\\"ive plug-in approach that directly uses device measures without appropriately adjusting measurement errors. Our results indicate that employing our easy-to-implement BLUP correction method can greatly reduce biases in disease risk estimates and thus enhance the validity of study findings."}, "https://arxiv.org/abs/2403.01150": {"title": "Error Analysis of a Simple Quaternion Estimator: the Gaussian Case", "link": "https://arxiv.org/abs/2403.01150", "description": "arXiv:2403.01150v1 Announce Type: new \nAbstract: Reference [1] introduces a novel closed-form quaternion estimator from two vector observations. The simplicity of the estimator enables clear physical insights and a closed-form expression for the bias as a function of the quaternion error covariance matrix. The latter could be approximated up to second order with respect to the underlying measurement noise assuming arbitrary probability distribution. The current note relaxes the second-order assumption and provides an expression for the error covariance that is exact to the fourth order, under the assumption of Gaussian distribution. This not only provides increased accuracy but also alleviates issues related to singularity. This technical note presents a comprehensive derivation of the individual components of the quaternion additive error covariance matrix."}, "https://arxiv.org/abs/2403.01330": {"title": "Re-evaluating the impact of hormone replacement therapy on heart disease using match-adaptive randomization inference", "link": "https://arxiv.org/abs/2403.01330", "description": "arXiv:2403.01330v1 Announce Type: new \nAbstract: Matching is an appealing way to design observational studies because it mimics the data structure produced by stratified randomized trials, pairing treated individuals with similar controls. After matching, inference is often conducted using methods tailored for stratified randomized trials in which treatments are permuted within matched pairs. However, in observational studies, matched pairs are not predetermined before treatment; instead, they are constructed based on observed treatment status. This introduces a challenge as the permutation distributions used in standard inference methods do not account for the possibility that permuting treatments might lead to a different selection of matched pairs ($Z$-dependence). To address this issue, we propose a novel and computationally efficient algorithm that characterizes and enables sampling from the correct conditional distribution of treatment after an optimal propensity score matching, accounting for $Z$-dependence. We show how this new procedure, called match-adaptive randomization inference, corrects for an anticonservative result in a well-known observational study investigating the impact of hormone replacement theory (HRT) on coronary heart disease and corroborates experimental findings about heterogeneous effects of HRT across different ages of initiation in women. Keywords: matching, causal inference, propensity score, permutation test, Type I error, graphs."}, "https://arxiv.org/abs/2403.01386": {"title": "Minimax-Regret Sample Selection in Randomized Experiments", "link": "https://arxiv.org/abs/2403.01386", "description": "arXiv:2403.01386v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are often run in settings with many subpopulations that may have differential benefits from the treatment being evaluated. We consider the problem of sample selection, i.e., whom to enroll in an RCT, such as to optimize welfare in a heterogeneous population. We formalize this problem within the minimax-regret framework, and derive optimal sample-selection schemes under a variety of conditions. We also highlight how different objectives and decisions can lead to notably different guidance regarding optimal sample allocation through a synthetic experiment leveraging historical COVID-19 trial data."}, "https://arxiv.org/abs/2403.01477": {"title": "Two-phase rejective sampling", "link": "https://arxiv.org/abs/2403.01477", "description": "arXiv:2403.01477v1 Announce Type: new \nAbstract: Rejective sampling improves design and estimation efficiency of single-phase sampling when auxiliary information in a finite population is available. When such auxiliary information is unavailable, we propose to use two-phase rejective sampling (TPRS), which involves measuring auxiliary variables for the sample of units in the first phase, followed by the implementation of rejective sampling for the outcome in the second phase. We explore the asymptotic design properties of double expansion and regression estimators under TPRS. We show that TPRS enhances the efficiency of the double expansion estimator, rendering it comparable to a regression estimator. We further refine the design to accommodate varying importance of covariates and extend it to multi-phase sampling."}, "https://arxiv.org/abs/2403.01585": {"title": "Calibrating doubly-robust estimators with unbalanced treatment assignment", "link": "https://arxiv.org/abs/2403.01585", "description": "arXiv:2403.01585v1 Announce Type: new \nAbstract: Machine learning methods, particularly the double machine learning (DML) estimator (Chernozhukov et al., 2018), are increasingly popular for the estimation of the average treatment effect (ATE). However, datasets often exhibit unbalanced treatment assignments where only a few observations are treated, leading to unstable propensity score estimations. We propose a simple extension of the DML estimator which undersamples data for propensity score modeling and calibrates scores to match the original distribution. The paper provides theoretical results showing that the estimator retains the DML estimator's asymptotic properties. A simulation study illustrates the finite sample performance of the estimator."}, "https://arxiv.org/abs/2403.01684": {"title": "Dendrogram of mixing measures: Learning latent hierarchy and model selection for finite mixture models", "link": "https://arxiv.org/abs/2403.01684", "description": "arXiv:2403.01684v1 Announce Type: new \nAbstract: We present a new way to summarize and select mixture models via the hierarchical clustering tree (dendrogram) of an overfitted latent mixing measure. Our proposed method bridges agglomerative hierarchical clustering and mixture modeling. The dendrogram's construction is derived from the theory of convergence of the mixing measures, and as a result, we can both consistently select the true number of mixing components and obtain the pointwise optimal convergence rate for parameter estimation from the tree, even when the model parameters are only weakly identifiable. In theory, it explicates the choice of the optimal number of clusters in hierarchical clustering. In practice, the dendrogram reveals more information on the hierarchy of subpopulations compared to traditional ways of summarizing mixture models. Several simulation studies are carried out to support our theory. We also illustrate the methodology with an application to single-cell RNA sequence analysis."}, "https://arxiv.org/abs/2403.01838": {"title": "Graphical n-sample tests of correspondence of distributions", "link": "https://arxiv.org/abs/2403.01838", "description": "arXiv:2403.01838v1 Announce Type: new \nAbstract: Classical tests are available for the two-sample test of correspondence of distribution functions. From these, the Kolmogorov-Smirnov test provides also the graphical interpretation of the test results, in different forms. Here, we propose modifications of the Kolmogorov-Smirnov test with higher power. The proposed tests are based on the so-called global envelope test which allows for graphical interpretation, similarly as the Kolmogorov-Smirnov test. The tests are based on rank statistics and are suitable also for the comparison of $n$ samples, with $n \\geq 2$. We compare the alternatives for the two-sample case through an extensive simulation study and discuss their interpretation. Finally, we apply the tests to real data. Specifically, we compare the height distributions between boys and girls at different ages, as well as sepal length distributions of different flower species using the proposed methodologies."}, "https://arxiv.org/abs/2403.01948": {"title": "On Fractional Moment Estimation from Polynomial Chaos Expansion", "link": "https://arxiv.org/abs/2403.01948", "description": "arXiv:2403.01948v1 Announce Type: new \nAbstract: Fractional statistical moments are utilized for various tasks of uncertainty quantification, including the estimation of probability distributions. However, an estimation of fractional statistical moments of costly mathematical models by statistical sampling is challenging since it is typically not possible to create a large experimental design due to limitations in computing capacity. This paper presents a novel approach for the analytical estimation of fractional moments, directly from polynomial chaos expansions. Specifically, the first four statistical moments obtained from the deterministic PCE coefficients are used for an estimation of arbitrary fractional moments via H\\\"{o}lder's inequality. The proposed approach is utilized for an estimation of statistical moments and probability distributions in three numerical examples of increasing complexity. Obtained results show that the proposed approach achieves a superior performance in estimating the distribution of the response, in comparison to a standard Latin hypercube sampling in the presented examples."}, "https://arxiv.org/abs/2403.02058": {"title": "Utility-based optimization of Fujikawa's basket trial design - Pre-specified protocol of a comparison study", "link": "https://arxiv.org/abs/2403.02058", "description": "arXiv:2403.02058v1 Announce Type: new \nAbstract: Basket trial designs are a type of master protocol in which the same therapy is tested in several strata of the patient cohort. Many basket trial designs implement borrowing mechanisms. These allow sharing information between similar strata with the goal of increasing power in responsive strata while at the same time constraining type-I error inflation to a bearable threshold. These borrowing mechanisms can be tuned using numerical tuning parameters. The optimal choice of these tuning parameters is subject to research. In a comparison study using simulations and numerical calculations, we are planning to investigate the use of utility functions for quantifying the compromise between power and type-I error inflation and the use of numerical optimization algorithms for optimizing these functions. The present document is the protocol of this comparison study, defining each step of the study in accordance with the ADEMP scheme for pre-specification of simulation studies."}, "https://arxiv.org/abs/2403.02060": {"title": "Expectile Periodograms", "link": "https://arxiv.org/abs/2403.02060", "description": "arXiv:2403.02060v1 Announce Type: new \nAbstract: In this paper, we introduce a periodogram-like function, called expectile periodograms, for detecting and estimating hidden periodicity from observations with asymmetrically distributed noise. The expectile periodograms are constructed from trigonometric expectile regression where a specially designed objective function is used to substitute the squared $l_2$ norm that leads to the ordinary periodograms. The expectile periodograms have properties which are analogous to quantile periodograms, which provide a broader view of the time series by examining different expectile levels, but are much faster to calculate. The asymptotic properties are discussed and simulations show its efficiency and robustness in the presence of hidden periodicities with asymmetric or heavy-tailed noise. Finally, we leverage the inherent two-dimensional characteristics of the expectile periodograms and train a deep-learning (DL) model to classify the earthquake waveform data. Remarkably, our approach achieves heightened classification testing accuracy when juxtaposed with alternative periodogram-based methodologies."}, "https://arxiv.org/abs/2403.02144": {"title": "Improved Tests for Mediation", "link": "https://arxiv.org/abs/2403.02144", "description": "arXiv:2403.02144v1 Announce Type: new \nAbstract: Testing for a mediation effect is important in many disciplines, but is made difficult - even asymptotically - by the influence of nuisance parameters. Classical tests such as likelihood ratio (LR) and Wald (Sobel) tests have very poor power properties in parts of the parameter space, and many attempts have been made to produce improved tests, with limited success. In this paper we show that augmenting the critical region of the LR test can produce a test with much improved behavior everywhere. In fact, we first show that there exists a test of this type that is (asymptotically) exact for certain test levels $\\alpha $, including the common choices $\\alpha =.01,.05,.10.$ The critical region of this exact test has some undesirable properties. We go on to show that there is a very simple class of augmented LR critical regions which provides tests that are nearly exact, and avoid the issues inherent in the exact test. We suggest an optimal and coherent member of this class, provide the table needed to implement the test and to report p-values if desired. Simulation confirms validity with non-Gaussian disturbances, under heteroskedasticity, and in a nonlinear (logit) model. A short application of the method to an entrepreneurial attitudes study is included for illustration."}, "https://arxiv.org/abs/2403.02154": {"title": "Double trouble: Predicting new variant counts across two heterogeneous populations", "link": "https://arxiv.org/abs/2403.02154", "description": "arXiv:2403.02154v1 Announce Type: new \nAbstract: Collecting genomics data across multiple heterogeneous populations (e.g., across different cancer types) has the potential to improve our understanding of disease. Despite sequencing advances, though, resources often remain a constraint when gathering data. So it would be useful for experimental design if experimenters with access to a pilot study could predict the number of new variants they might expect to find in a follow-up study: both the number of new variants shared between the populations and the total across the populations. While many authors have developed prediction methods for the single-population case, we show that these predictions can fare poorly across multiple populations that are heterogeneous. We prove that, surprisingly, a natural extension of a state-of-the-art single-population predictor to multiple populations fails for fundamental reasons. We provide the first predictor for the number of new shared variants and new total variants that can handle heterogeneity in multiple populations. We show that our proposed method works well empirically using real cancer and population genetics data."}, "https://arxiv.org/abs/2403.02194": {"title": "Boosting Distributional Copula Regression for Bivariate Binary, Discrete and Mixed Responses", "link": "https://arxiv.org/abs/2403.02194", "description": "arXiv:2403.02194v1 Announce Type: new \nAbstract: Motivated by challenges in the analysis of biomedical data and observational studies, we develop statistical boosting for the general class of bivariate distributional copula regression with arbitrary marginal distributions, which is suited to model binary, count, continuous or mixed outcomes. In our framework, the joint distribution of arbitrary, bivariate responses is modelled through a parametric copula. To arrive at a model for the entire conditional distribution, not only the marginal distribution parameters but also the copula parameters are related to covariates through additive predictors. We suggest efficient and scalable estimation by means of an adapted component-wise gradient boosting algorithm with statistical models as base-learners. A key benefit of boosting as opposed to classical likelihood or Bayesian estimation is the implicit data-driven variable selection mechanism as well as shrinkage without additional input or assumptions from the analyst. To the best of our knowledge, our implementation is the only one that combines a wide range of covariate effects, marginal distributions, copula functions, and implicit data-driven variable selection. We showcase the versatility of our approach on data from genetic epidemiology, healthcare utilization and childhood undernutrition. Our developments are implemented in the R package gamboostLSS, fostering transparent and reproducible research."}, "https://arxiv.org/abs/2403.02245": {"title": "Dynamic programming principle in cost-efficient sequential design: application to switching measurements", "link": "https://arxiv.org/abs/2403.02245", "description": "arXiv:2403.02245v1 Announce Type: new \nAbstract: We study sequential cost-efficient design in a situation where each update of covariates involves a fixed time cost typically considerable compared to a single measurement time. The problem arises from parameter estimation in switching measurements on superconducting Josephson junctions which are components needed in quantum computers and other superconducting electronics. In switching measurements, a sequence of current pulses is applied to the junction and a binary voltage response is observed. The measurement requires a very low temperature that can be kept stable only for a relatively short time, and therefore it is essential to use an efficient design. We use the dynamic programming principle from the mathematical theory of optimal control to solve the optimal update times. Our simulations demonstrate the cost-efficiency compared to the previously used methods."}, "https://arxiv.org/abs/2403.00776": {"title": "A framework for understanding data science", "link": "https://arxiv.org/abs/2403.00776", "description": "arXiv:2403.00776v1 Announce Type: cross \nAbstract: The objective of this research is to provide a framework with which the data science community can understand, define, and develop data science as a field of inquiry. The framework is based on the classical reference framework (axiology, ontology, epistemology, methodology) used for 200 years to define knowledge discovery paradigms and disciplines in the humanities, sciences, algorithms, and now data science. I augmented it for automated problem-solving with (methods, technology, community). The resulting data science reference framework is used to define the data science knowledge discovery paradigm in terms of the philosophy of data science addressed in previous papers and the data science problem-solving paradigm, i.e., the data science method, and the data science problem-solving workflow, both addressed in this paper. The framework is a much called for unifying framework for data science as it contains the components required to define data science. For insights to better understand data science, this paper uses the framework to define the emerging, often enigmatic, data science problem-solving paradigm and workflow, and to compare them with their well-understood scientific counterparts, scientific problem-solving paradigm and workflow."}, "https://arxiv.org/abs/2403.00967": {"title": "Asymptotic expansion of the drift estimator for the fractional Ornstein-Uhlenbeck process", "link": "https://arxiv.org/abs/2403.00967", "description": "arXiv:2403.00967v1 Announce Type: cross \nAbstract: We present an asymptotic expansion formula of an estimator for the drift coefficient of the fractional Ornstein-Uhlenbeck process. As the machinery, we apply the general expansion scheme for Wiener functionals recently developed by the authors [26]. The central limit theorem in the principal part of the expansion has the classical scaling T^{1/2}. However, the asymptotic expansion formula is a complex in that the order of the correction term becomes the classical T^{-1/2} for H in (1/2,5/8), but T^{4H-3} for H in [5/8, 3/4)."}, "https://arxiv.org/abs/2403.01208": {"title": "The Science of Data Collection: Insights from Surveys can Improve Machine Learning Models", "link": "https://arxiv.org/abs/2403.01208", "description": "arXiv:2403.01208v1 Announce Type: cross \nAbstract: Whether future AI models make the world safer or less safe for humans rests in part on our ability to efficiently collect accurate data from people about what they want the models to do. However, collecting high quality data is difficult, and most AI/ML researchers are not trained in data collection methods. The growing emphasis on data-centric AI highlights the potential of data to enhance model performance. It also reveals an opportunity to gain insights from survey methodology, the science of collecting high-quality survey data.\n In this position paper, we summarize lessons from the survey methodology literature and discuss how they can improve the quality of training and feedback data, which in turn improve model performance. Based on the cognitive response process model, we formulate specific hypotheses about the aspects of label collection that may impact training data quality. We also suggest collaborative research ideas into how possible biases in data collection can be mitigated, making models more accurate and human-centric."}, "https://arxiv.org/abs/2403.01318": {"title": "High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media", "link": "https://arxiv.org/abs/2403.01318", "description": "arXiv:2403.01318v1 Announce Type: cross \nAbstract: Motivated by the empirical power law of the distributions of credits (e.g., the number of \"likes\") of viral posts in social media, we introduce the high-dimensional tail index regression and methods of estimation and inference for its parameters. We propose a regularized estimator, establish its consistency, and derive its convergence rate. To conduct inference, we propose to debias the regularized estimate, and establish the asymptotic normality of the debiased estimator. Simulation studies support our theory. These methods are applied to text analyses of viral posts in X (formerly Twitter) concerning LGBTQ+."}, "https://arxiv.org/abs/2403.01403": {"title": "Greedy selection of optimal location of sensors for uncertainty reduction in seismic moment tensor inversion", "link": "https://arxiv.org/abs/2403.01403", "description": "arXiv:2403.01403v1 Announce Type: cross \nAbstract: We address an optimal sensor placement problem through Bayesian experimental design for seismic full waveform inversion for the recovery of the associated moment tensor. The objective is that of optimally choosing the location of the sensors (stations) from which to collect the observed data. The Shannon expected information gain is used as the objective function to search for the optimal network of sensors. A closed form for such objective is available due to the linear structure of the forward problem, as well as the Gaussian modeling of the observational errors and prior distribution. The resulting problem being inherently combinatorial, a greedy algorithm is deployed to sequentially select the sensor locations that form the best network for learning the moment tensor. Numerical results are presented and analyzed under several instances of the problem, including: use of full three-dimensional velocity-models, cases in which the earthquake-source location is unknown, as well as moment tensor inversion under model misspecification"}, "https://arxiv.org/abs/2403.01865": {"title": "Improving generalisation via anchor multivariate analysis", "link": "https://arxiv.org/abs/2403.01865", "description": "arXiv:2403.01865v1 Announce Type: cross \nAbstract: We introduce a causal regularisation extension to anchor regression (AR) for improved out-of-distribution (OOD) generalisation. We present anchor-compatible losses, aligning with the anchor framework to ensure robustness against distribution shifts. Various multivariate analysis (MVA) algorithms, such as (Orthonormalized) PLS, RRR, and MLR, fall within the anchor framework. We observe that simple regularisation enhances robustness in OOD settings. Estimators for selected algorithms are provided, showcasing consistency and efficacy in synthetic and real-world climate science problems. The empirical validation highlights the versatility of anchor regularisation, emphasizing its compatibility with MVA approaches and its role in enhancing replicability while guarding against distribution shifts. The extended AR framework advances causal inference methodologies, addressing the need for reliable OOD generalisation."}, "https://arxiv.org/abs/2107.08075": {"title": "Kpop: A kernel balancing approach for reducing specification assumptions in survey weighting", "link": "https://arxiv.org/abs/2107.08075", "description": "arXiv:2107.08075v2 Announce Type: replace \nAbstract: With the precipitous decline in response rates, researchers and pollsters have been left with highly non-representative samples, relying on constructed weights to make these samples representative of the desired target population. Though practitioners employ valuable expert knowledge to choose what variables, $X$ must be adjusted for, they rarely defend particular functional forms relating these variables to the response process or the outcome. Unfortunately, commonly-used calibration weights -- which make the weighted mean $X$ in the sample equal that of the population -- only ensure correct adjustment when the portion of the outcome and the response process left unexplained by linear functions of $X$ are independent. To alleviate this functional form dependency, we describe kernel balancing for population weighting (kpop). This approach replaces the design matrix $\\mathbf{X}$ with a kernel matrix, $\\mathbf{K}$ encoding high-order information about $\\mathbf{X}$. Weights are then found to make the weighted average row of $\\mathbf{K}$ among sampled units approximately equal that of the target population. This produces good calibration on a wide range of smooth functions of $X$, without relying on the user to decide which $X$ or what functions of them to include. We describe the method and illustrate it by application to polling data from the 2016 U.S. presidential election."}, "https://arxiv.org/abs/2202.10030": {"title": "Multivariate Tie-breaker Designs", "link": "https://arxiv.org/abs/2202.10030", "description": "arXiv:2202.10030v4 Announce Type: replace \nAbstract: In a tie-breaker design (TBD), subjects with high values of a running variable are given some (usually desirable) treatment, subjects with low values are not, and subjects in the middle are randomized. TBDs are intermediate between regression discontinuity designs (RDDs) and randomized controlled trials (RCTs) by allowing a tradeoff between the resource allocation efficiency of an RDD and the statistical efficiency of an RCT. We study a model where the expected response is one multivariate regression for treated subjects and another for control subjects. For given covariates, we show how to use convex optimization to choose treatment probabilities that optimize a D-optimality criterion. We can incorporate a variety of constraints motivated by economic and ethical considerations. In our model, D-optimality for the treatment effect coincides with D-optimality for the whole regression, and without economic constraints, an RCT is globally optimal. We show that a monotonicity constraint favoring more deserving subjects induces sparsity in the number of distinct treatment probabilities and this is different from preexisting sparsity results for constrained designs. We also study a prospective D-optimality, analogous to Bayesian optimal design, to understand design tradeoffs without reference to a specific data set. We apply the convex optimization solution to a semi-synthetic example involving triage data from the MIMIC-IV-ED database."}, "https://arxiv.org/abs/2206.09883": {"title": "Policy Learning under Endogeneity Using Instrumental Variables", "link": "https://arxiv.org/abs/2206.09883", "description": "arXiv:2206.09883v3 Announce Type: replace \nAbstract: This paper studies the identification and estimation of individualized intervention policies in observational data settings characterized by endogenous treatment selection and the availability of instrumental variables. We introduce encouragement rules that manipulate an instrument. Incorporating the marginal treatment effects (MTE) as policy invariant structural parameters, we establish the identification of the social welfare criterion for the optimal encouragement rule. Focusing on binary encouragement rules, we propose to estimate the optimal policy via the Empirical Welfare Maximization (EWM) method and derive convergence rates of the regret (welfare loss). We consider extensions to accommodate multiple instruments and budget constraints. Using data from the Indonesian Family Life Survey, we apply the EWM encouragement rule to advise on the optimal tuition subsidy assignment. Our framework offers interpretability regarding why a certain subpopulation is targeted."}, "https://arxiv.org/abs/2303.03520": {"title": "The Effect of Alcohol Consumption on Brain Ageing: A New Causal Inference Framework for Incomplete and Massive Phenomic Data", "link": "https://arxiv.org/abs/2303.03520", "description": "arXiv:2303.03520v2 Announce Type: replace \nAbstract: Although substance use, such as alcohol consumption, is known to be associated with cognitive decline during ageing, its direct influence on the central nervous system remains unclear. In this study, we aim to investigate the potential influence of alcohol intake frequency on accelerated brain ageing by estimating the mean potential brain-age gap (BAG) index, the difference between brain age and actual age, under different alcohol intake frequencies in a large UK Biobank (UKB) cohort with extensive phenomic data reflecting a comprehensive life-style profile. We face two major challenges: (1) a large number of phenomic variables as potential confounders and (2) a small proportion of participants with complete phenomic data. To address these challenges, we first develop a new ensemble learning framework to establish robust estimation of mean potential outcome in the presence of many confounders. We then construct a data integration step to borrow information from UKB participants with incomplete phenomic data to improve efficiency. Our analysis results reveal that daily intake or even a few times a week may have significant effects on accelerating brain ageing. Moreover, extensive numerical studies demonstrate the superiority of our method over competing methods, in terms of smaller estimation bias and variability."}, "https://arxiv.org/abs/2305.17615": {"title": "Estimating overidentified linear models with heteroskedasticity and outliers", "link": "https://arxiv.org/abs/2305.17615", "description": "arXiv:2305.17615v3 Announce Type: replace \nAbstract: Overidentified two-stage least square (TSLS) is commonly adopted by applied economists to address endogeneity. Though it potentially gives more efficient or informative estimate, overidentification comes with a cost. The bias of TSLS is severe when the number of instruments is large. Hence, Jackknife Instrumental Variable Estimator (JIVE) has been proposed to reduce bias of overidentified TSLS. A conventional heuristic rule to assess the performance of TSLS and JIVE is approximate bias. This paper formalizes this concept and applies the new definition of approximate bias to three classes of estimators that bridge between OLS, TSLS and a variant of JIVE, namely, JIVE1. Three new approximately unbiased estimators are proposed. They are called AUK, TSJI1 and UOJIVE. Interestingly, a previously proposed approximately unbiased estimator UIJIVE can be viewed as a special case of UOJIVE. While UIJIVE is approximately unbiased asymptotically, UOJIVE is approximately unbiased even in finite sample. Moreover, UOJIVE estimates parameters for both endogenous and control variables whereas UIJIVE only estimates the parameter of the endogenous variables. TSJI1 and UOJIVE are consistent and asymptotically normal under fixed number of instruments. They are also consistent under many-instrument asymptotics. This paper characterizes a series of moment existence conditions to establish all asymptotic results. In addition, the new estimators demonstrate good performances with simulated and empirical datasets."}, "https://arxiv.org/abs/2306.10976": {"title": "Empirical sandwich variance estimator for iterated conditional expectation g-computation", "link": "https://arxiv.org/abs/2306.10976", "description": "arXiv:2306.10976v2 Announce Type: replace \nAbstract: Iterated conditional expectation (ICE) g-computation is an estimation approach for addressing time-varying confounding for both longitudinal and time-to-event data. Unlike other g-computation implementations, ICE avoids the need to specify models for each time-varying covariate. For variance estimation, previous work has suggested the bootstrap. However, bootstrapping can be computationally intense and sensitive to the number of resamples used. Here, we present ICE g-computation as a set of stacked estimating equations. Therefore, the variance for the ICE g-computation estimator can be consistently estimated using the empirical sandwich variance estimator. Performance of the variance estimator was evaluated empirically with a simulation study. The proposed approach is also demonstrated with an illustrative example on the effect of cigarette smoking on the prevalence of hypertension. In the simulation study, the empirical sandwich variance estimator appropriately estimated the variance. When comparing runtimes between the sandwich variance estimator and the bootstrap for the applied example, the sandwich estimator was substantially faster, even when bootstraps were run in parallel. The empirical sandwich variance estimator is a viable option for variance estimation with ICE g-computation."}, "https://arxiv.org/abs/2309.03952": {"title": "The Causal Roadmap and Simulations to Improve the Rigor and Reproducibility of Real-Data Applications", "link": "https://arxiv.org/abs/2309.03952", "description": "arXiv:2309.03952v4 Announce Type: replace \nAbstract: The Causal Roadmap outlines a systematic approach to asking and answering questions of cause-and-effect: define quantity of interest, evaluate needed assumptions, conduct statistical estimation, and carefully interpret results. It is paramount that the algorithm for statistical estimation and inference be carefully pre-specified to optimize its expected performance for the specific real-data application. Simulations that realistically reflect the application, including key characteristics such as strong confounding and dependent or missing outcomes, can help us gain a better understanding of an estimator's applied performance. We illustrate this with two examples, using the Causal Roadmap and realistic simulations to inform estimator selection and full specification of the Statistical Analysis Plan. First, in an observational longitudinal study, outcome-blind simulations are used to inform nuisance parameter estimation and variance estimation for longitudinal targeted maximum likelihood estimation (TMLE). Second, in a cluster-randomized controlled trial with missing outcomes, treatment-blind simulations are used to ensure control for Type-I error in Two-Stage TMLE. In both examples, realistic simulations empower us to pre-specify an estimator that is expected to have strong finite sample performance and also yield quality-controlled computing code for the actual analysis. Together, this process helps to improve the rigor and reproducibility of our research."}, "https://arxiv.org/abs/2309.08982": {"title": "Least squares estimation in nonstationary nonlinear cohort panels with learning from experience", "link": "https://arxiv.org/abs/2309.08982", "description": "arXiv:2309.08982v3 Announce Type: replace \nAbstract: We discuss techniques of estimation and inference for nonstationary nonlinear cohort panels with learning from experience, showing, inter alia, the consistency and asymptotic normality of the nonlinear least squares estimator used in empirical practice. Potential pitfalls for hypothesis testing are identified and solutions proposed. Monte Carlo simulations verify the properties of the estimator and corresponding test statistics in finite samples, while an application to a panel of survey expectations demonstrates the usefulness of the theory developed."}, "https://arxiv.org/abs/2310.15069": {"title": "Second-order group knockoffs with applications to GWAS", "link": "https://arxiv.org/abs/2310.15069", "description": "arXiv:2310.15069v2 Announce Type: replace \nAbstract: Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance.\n While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct \"group knockoffs.\" While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank.\n The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available."}, "https://arxiv.org/abs/2102.06202": {"title": "Private Prediction Sets", "link": "https://arxiv.org/abs/2102.06202", "description": "arXiv:2102.06202v3 Announce Type: replace-cross \nAbstract: In real-world settings involving consequential decision-making, the deployment of machine learning systems generally requires both reliable uncertainty quantification and protection of individuals' privacy. We present a framework that treats these two desiderata jointly. Our framework is based on conformal prediction, a methodology that augments predictive models to return prediction sets that provide uncertainty quantification -- they provably cover the true response with a user-specified probability, such as 90%. One might hope that when used with privately-trained models, conformal prediction would yield privacy guarantees for the resulting prediction sets; unfortunately, this is not the case. To remedy this key problem, we develop a method that takes any pre-trained predictive model and outputs differentially private prediction sets. Our method follows the general approach of split conformal prediction; we use holdout data to calibrate the size of the prediction sets but preserve privacy by using a privatized quantile subroutine. This subroutine compensates for the noise introduced to preserve privacy in order to guarantee correct coverage. We evaluate the method on large-scale computer vision datasets."}, "https://arxiv.org/abs/2210.00362": {"title": "Yurinskii's Coupling for Martingales", "link": "https://arxiv.org/abs/2210.00362", "description": "arXiv:2210.00362v2 Announce Type: replace-cross \nAbstract: Yurinskii's coupling is a popular theoretical tool for non-asymptotic distributional analysis in mathematical statistics and applied probability, offering a Gaussian strong approximation with an explicit error bound under easily verified conditions. Originally stated in $\\ell^2$-norm for sums of independent random vectors, it has recently been extended both to the $\\ell^p$-norm, for $1 \\leq p \\leq \\infty$, and to vector-valued martingales in $\\ell^2$-norm, under some strong conditions. We present as our main result a Yurinskii coupling for approximate martingales in $\\ell^p$-norm, under substantially weaker conditions than those previously imposed. Our formulation further allows for the coupling variable to follow a more general Gaussian mixture distribution, and we provide a novel third-order coupling method which gives tighter approximations in certain settings. We specialize our main result to mixingales, martingales, and independent data, and derive uniform Gaussian mixture strong approximations for martingale empirical processes. Applications to nonparametric partitioning-based and local polynomial regression procedures are provided."}, "https://arxiv.org/abs/2302.14186": {"title": "Approximately optimal domain adaptation with Fisher's Linear Discriminant", "link": "https://arxiv.org/abs/2302.14186", "description": "arXiv:2302.14186v3 Announce Type: replace-cross \nAbstract: We propose a class of models based on Fisher's Linear Discriminant (FLD) in the context of domain adaptation. The class is the convex combination of two hypotheses: i) an average hypothesis representing previously seen source tasks and ii) a hypothesis trained on a new target task. For a particular generative setting we derive the optimal convex combination of the two models under 0-1 loss, propose a computable approximation, and study the effect of various parameter settings on the relative risks between the optimal hypothesis, hypothesis i), and hypothesis ii). We demonstrate the effectiveness of the proposed optimal classifier in the context of EEG- and ECG-based classification settings and argue that the optimal classifier can be computed without access to direct information from any of the individual source tasks. We conclude by discussing further applications, limitations, and possible future directions."}, "https://arxiv.org/abs/2303.05328": {"title": "Simulation-based, Finite-sample Inference for Privatized Data", "link": "https://arxiv.org/abs/2303.05328", "description": "arXiv:2303.05328v4 Announce Type: replace-cross \nAbstract: Privacy protection methods, such as differentially private mechanisms, introduce noise into resulting statistics which often produces complex and intractable sampling distributions. In this paper, we propose a simulation-based \"repro sample\" approach to produce statistically valid confidence intervals and hypothesis tests, which builds on the work of Xie and Wang (2022). We show that this methodology is applicable to a wide variety of private inference problems, appropriately accounts for biases introduced by privacy mechanisms (such as by clamping), and improves over other state-of-the-art inference methods such as the parametric bootstrap in terms of the coverage and type I error of the private inference. We also develop significant improvements and extensions for the repro sample methodology for general models (not necessarily related to privacy), including 1) modifying the procedure to ensure guaranteed coverage and type I errors, even accounting for Monte Carlo error, and 2) proposing efficient numerical algorithms to implement the confidence intervals and $p$-values."}, "https://arxiv.org/abs/2306.03928": {"title": "Designing Decision Support Systems Using Counterfactual Prediction Sets", "link": "https://arxiv.org/abs/2306.03928", "description": "arXiv:2306.03928v2 Announce Type: replace-cross \nAbstract: Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at https://github.com/Networks-Learning/counterfactual-prediction-sets."}, "https://arxiv.org/abs/2306.06342": {"title": "Distribution-free inference with hierarchical data", "link": "https://arxiv.org/abs/2306.06342", "description": "arXiv:2306.06342v3 Announce Type: replace-cross \nAbstract: This paper studies distribution-free inference in settings where the data set has a hierarchical structure -- for example, groups of observations, or repeated measurements. In such settings, standard notions of exchangeability may not hold. To address this challenge, a hierarchical form of exchangeability is derived, facilitating extensions of distribution-free methods, including conformal prediction and jackknife+. While the standard theoretical guarantee obtained by the conformal prediction framework is a marginal predictive coverage guarantee, in the special case of independent repeated measurements, it is possible to achieve a stronger form of coverage -- the \"second-moment coverage\" property -- to provide better control of conditional miscoverage rates, and distribution-free prediction sets that achieve this property are constructed. Simulations illustrate that this guarantee indeed leads to uniformly small conditional miscoverage rates. Empirically, this stronger guarantee comes at the cost of a larger width of the prediction set in scenarios where the fitted model is poorly calibrated, but this cost is very mild in cases where the fitted model is accurate."}, "https://arxiv.org/abs/2310.01153": {"title": "Post-hoc and Anytime Valid Permutation and Group Invariance Testing", "link": "https://arxiv.org/abs/2310.01153", "description": "arXiv:2310.01153v3 Announce Type: replace-cross \nAbstract: We study post-hoc ($e$-value-based) and post-hoc anytime valid inference for testing exchangeability and general group invariance. Our methods satisfy a generalized Type I error control that permits a data-dependent selection of both the number of observations $n$ and the significance level $\\alpha$. We derive a simple analytical expression for all exact post-hoc valid $p$-values for group invariance, which allows for a flexible plug-in of the test statistic. For post-hoc anytime validity, we derive sequential $p$-processes by multiplying post-hoc $p$-values. In sequential testing, it is key to specify how the number of observations may depend on the data. We propose two approaches, and show how they nest existing efforts. To construct good post-hoc $p$-values, we develop the theory of likelihood ratios for group invariance, and generalize existing optimality results. These likelihood ratios turn out to exist in different flavors depending on which space we specify our alternative. We illustrate our methods by testing against a Gaussian location shift, which yields an improved optimality result for the $t$-test when testing sphericity, connections to the softmax function when testing exchangeability, and an improved method for testing sign-symmetry."}, "https://arxiv.org/abs/2311.18274": {"title": "Semiparametric Efficient Inference in Adaptive Experiments", "link": "https://arxiv.org/abs/2311.18274", "description": "arXiv:2311.18274v3 Announce Type: replace-cross \nAbstract: We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control."}, "https://arxiv.org/abs/2403.02467": {"title": "Applied Causal Inference Powered by ML and AI", "link": "https://arxiv.org/abs/2403.02467", "description": "arXiv:2403.02467v1 Announce Type: new \nAbstract: An introduction to the emerging fusion of machine learning and causal inference. The book presents ideas from classical structural equation models (SEMs) and their modern AI equivalent, directed acyclical graphs (DAGs) and structural causal models (SCMs), and covers Double/Debiased Machine Learning methods to do inference in such models using modern predictive tools."}, "https://arxiv.org/abs/2403.02539": {"title": "Addressing the Influence of Unmeasured Confounding in Observational Studies with Time-to-Event Outcomes: A Semiparametric Sensitivity Analysis Approach", "link": "https://arxiv.org/abs/2403.02539", "description": "arXiv:2403.02539v1 Announce Type: new \nAbstract: In this paper, we develop a semiparametric sensitivity analysis approach designed to address unmeasured confounding in observational studies with time-to-event outcomes. We target estimation of the marginal distributions of potential outcomes under competing exposures using influence function-based techniques. We derived the non-parametric influence function for uncensored data and mapped the uncensored data influence function to the observed data influence function. Our methodology is motivated by and applied to an observational study evaluating the effectiveness of radical prostatectomy (RP) versus external beam radiotherapy with androgen deprivation (EBRT+AD) for the treatment of prostate cancer. We also present a simulation study to evaluate the statistical properties of our methodology."}, "https://arxiv.org/abs/2403.02591": {"title": "Matrix-based Prediction Approach for Intraday Instantaneous Volatility Vector", "link": "https://arxiv.org/abs/2403.02591", "description": "arXiv:2403.02591v1 Announce Type: new \nAbstract: In this paper, we introduce a novel method for predicting intraday instantaneous volatility based on Ito semimartingale models using high-frequency financial data. Several studies have highlighted stylized volatility time series features, such as interday auto-regressive dynamics and the intraday U-shaped pattern. To accommodate these volatility features, we propose an interday-by-intraday instantaneous volatility matrix process that can be decomposed into low-rank conditional expected instantaneous volatility and noise matrices. To predict the low-rank conditional expected instantaneous volatility matrix, we propose the Two-sIde Projected-PCA (TIP-PCA) procedure. We establish asymptotic properties of the proposed estimators and conduct a simulation study to assess the finite sample performance of the proposed prediction method. Finally, we apply the TIP-PCA method to an out-of-sample instantaneous volatility vector prediction study using high-frequency data from the S&P 500 index and 11 sector index funds."}, "https://arxiv.org/abs/2403.02625": {"title": "Determining the Number of Common Functional Factors with Twice Cross-Validation", "link": "https://arxiv.org/abs/2403.02625", "description": "arXiv:2403.02625v1 Announce Type: new \nAbstract: The semiparametric factor model serves as a vital tool to describe the dependence patterns in the data. It recognizes that the common features observed in the data are actually explained by functions of specific exogenous variables.Unlike traditional factor models, where the focus is on selecting the number of factors, our objective here is to identify the appropriate number of common functions, a crucial parameter in this model. In this paper, we develop a novel data-driven method to determine the number of functional factors using cross validation (CV). Our proposed method employs a two-step CV process that ensures the orthogonality of functional factors, which we refer to as Functional Twice Cross-Validation (FTCV). Extensive simulations demonstrate that FTCV accurately selects the number of common functions and outperforms existing methods in most cases.Furthermore, by specifying market volatility as the exogenous force, we provide real data examples that illustrate the interpretability of selected common functions in characterizing the influence on U.S. Treasury Yields and the cross correlations between Dow30 returns."}, "https://arxiv.org/abs/2403.02979": {"title": "Regularised Canonical Correlation Analysis: graphical lasso, biplots and beyond", "link": "https://arxiv.org/abs/2403.02979", "description": "arXiv:2403.02979v1 Announce Type: new \nAbstract: Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption of conditional independencies and based on the Graphical Lasso. Our method has desirable theoretical guarantees and good empirical performance, demonstrated through extensive simulations and real-world biological datasets. Recognizing the difficulties of model selection in high dimensions and other practical challenges of applying CCA in real-world settings, we introduce a novel framework for evaluating and interpreting regularized CCA models in the context of Exploratory Data Analysis (EDA), which we hope will empower researchers and pave the way for wider adoption."}, "https://arxiv.org/abs/2403.03058": {"title": "Machine Learning Assisted Adjustment Boosts Inferential Efficiency of Randomized Controlled Trials", "link": "https://arxiv.org/abs/2403.03058", "description": "arXiv:2403.03058v1 Announce Type: new \nAbstract: In this work, we proposed a novel inferential procedure assisted by machine learning based adjustment for randomized control trials. The method was developed under the Rosenbaum's framework of exact tests in randomized experiments with covariate adjustments. Through extensive simulation experiments, we showed the proposed method can robustly control the type I error and can boost the inference efficiency for a randomized controlled trial (RCT). This advantage was further demonstrated in a real world example. The simplicity and robustness of the proposed method makes it a competitive candidate as a routine inference procedure for RCTs, especially when the number of baseline covariates is large, and when nonlinear association or interaction among covariates is expected. Its application may remarkably reduce the required sample size and cost of RCTs, such as phase III clinical trials."}, "https://arxiv.org/abs/2403.03099": {"title": "Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure", "link": "https://arxiv.org/abs/2403.03099", "description": "arXiv:2403.03099v1 Announce Type: new \nAbstract: Big data, with NxP dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order PxN(N-1). To circumvent this problem, typically, the clustering technique is applied to a random sample drawn from the dataset: however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of \"data nuggets\", which reduce a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets-based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2403.03138": {"title": "A novel methodological framework for the analysis of health trajectories and survival outcomes in heart failure patients", "link": "https://arxiv.org/abs/2403.03138", "description": "arXiv:2403.03138v1 Announce Type: new \nAbstract: Heart failure (HF) contributes to circa 200,000 annual hospitalizations in France. With the increasing age of HF patients, elucidating the specific causes of inpatient mortality became a public health problematic. We introduce a novel methodological framework designed to identify prevalent health trajectories and investigate their impact on death. The initial step involves applying sequential pattern mining to characterize patients' trajectories, followed by an unsupervised clustering algorithm based on a new metric for measuring the distance between hospitalization diagnoses. Finally, a survival analysis is conducted to assess survival outcomes. The application of this framework to HF patients from a representative sample of the French population demonstrates its methodological significance in enhancing the analysis of healthcare trajectories."}, "https://arxiv.org/abs/2403.02663": {"title": "The Influence of Validation Data on Logical and Scientific Interpretations of Forensic Expert Opinions", "link": "https://arxiv.org/abs/2403.02663", "description": "arXiv:2403.02663v1 Announce Type: cross \nAbstract: Forensic experts use specialized training and knowledge to enable other members of the judicial system to make better informed and more just decisions. Factfinders, in particular, are tasked with judging how much weight to give to experts' reports and opinions. Many references describe assessing evidential weight from the perspective of a forensic expert. Some recognize that stakeholders are each responsible for evaluating their own weight of evidence. Morris (1971, 1974, 1977) provided a general framework for recipients to update their own uncertainties after learning an expert's opinion. Although this framework is normative under Bayesian axioms and several forensic scholars advocate the use of Bayesian reasoning, few resources describe its application in forensic science. This paper addresses this gap by examining how recipients can combine principles of science and Bayesian reasoning to evaluate their own likelihood ratios for expert opinions. This exercise helps clarify how an expert's role depends on whether one envisions recipients to be logical and scientific or deferential. Illustrative examples with an expert's opinion expressed as a categorical conclusion, likelihood ratio, or range of likelihood ratios, or with likelihood ratios from multiple experts, each reveal the importance and influence of validation data for logical recipients' interpretations."}, "https://arxiv.org/abs/2403.02696": {"title": "Low-rank matrix estimation via nonconvex spectral regularized methods in errors-in-variables matrix regression", "link": "https://arxiv.org/abs/2403.02696", "description": "arXiv:2403.02696v1 Announce Type: cross \nAbstract: High-dimensional matrix regression has been studied in various aspects, such as statistical properties, computational efficiency and application to specific instances including multivariate regression, system identification and matrix compressed sensing. Current studies mainly consider the idealized case that the covariate matrix is obtained without noise, while the more realistic scenario that the covariates may always be corrupted with noise or missing data has received little attention. We consider the general errors-in-variables matrix regression model and proposed a unified framework for low-rank estimation based on nonconvex spectral regularization. Then in the statistical aspect, recovery bounds for any stationary points are provided to achieve statistical consistency. In the computational aspect, the proximal gradient method is applied to solve the nonconvex optimization problem and is proved to converge in polynomial time. Consequences for specific matrix compressed sensing models with additive noise and missing data are obtained via verifying corresponding regularity conditions. Finally, the performance of the proposed nonconvex estimation method is illustrated by numerical experiments."}, "https://arxiv.org/abs/2403.02840": {"title": "On a theory of martingales for censoring", "link": "https://arxiv.org/abs/2403.02840", "description": "arXiv:2403.02840v1 Announce Type: cross \nAbstract: A theory of martingales for censoring is developed. The Doob-Meyer martingale is shown to be inadequate in general, and a repaired martingale is proposed with a non-predictable centering term. Associated martingale transforms, variation processes, and covariation processes are developed based on a measure of half-predictability that generalizes predictability. The development is applied to study the Kaplan Meier estimator."}, "https://arxiv.org/abs/2403.03007": {"title": "Scalable Bayesian inference for the generalized linear mixed model", "link": "https://arxiv.org/abs/2403.03007", "description": "arXiv:2403.03007v1 Announce Type: cross \nAbstract: The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database."}, "https://arxiv.org/abs/2403.03208": {"title": "Active Statistical Inference", "link": "https://arxiv.org/abs/2403.03208", "description": "arXiv:2403.03208v1 Announce Type: cross \nAbstract: Inspired by the concept of active learning, we propose active inference$\\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics."}, "https://arxiv.org/abs/2103.12900": {"title": "Loss based prior for the degrees of freedom of the Wishart distribution", "link": "https://arxiv.org/abs/2103.12900", "description": "arXiv:2103.12900v2 Announce Type: replace \nAbstract: Motivated by the proliferation of extensive macroeconomic and health datasets necessitating accurate forecasts, a novel approach is introduced to address Vector Autoregressive (VAR) models. This approach employs the global-local shrinkage-Wishart prior. Unlike conventional VAR models, where degrees of freedom are predetermined to be equivalent to the size of the variable plus one or equal to zero, the proposed method integrates a hyperprior for the degrees of freedom to account for the uncertainty about the parameter values. Specifically, a loss-based prior is derived to leverage information regarding the data-inherent degrees of freedom. The efficacy of the proposed prior is demonstrated in a multivariate setting for forecasting macroeconomic data, as well as Dengue infection data."}, "https://arxiv.org/abs/2107.13737": {"title": "Design-Robust Two-Way-Fixed-Effects Regression For Panel Data", "link": "https://arxiv.org/abs/2107.13737", "description": "arXiv:2107.13737v3 Announce Type: replace \nAbstract: We propose a new estimator for average causal effects of a binary treatment with panel data in settings with general treatment patterns. Our approach augments the popular two-way-fixed-effects specification with unit-specific weights that arise from a model for the assignment mechanism. We show how to construct these weights in various settings, including the staggered adoption setting, where units opt into the treatment sequentially but permanently. The resulting estimator converges to an average (over units and time) treatment effect under the correct specification of the assignment model, even if the fixed effect model is misspecified. We show that our estimator is more robust than the conventional two-way estimator: it remains consistent if either the assignment mechanism or the two-way regression model is correctly specified. In addition, the proposed estimator performs better than the two-way-fixed-effect estimator if the outcome model and assignment mechanism are locally misspecified. This strong double robustness property underlines and quantifies the benefits of modeling the assignment process and motivates using our estimator in practice. We also discuss an extension of our estimator to handle dynamic treatment effects."}, "https://arxiv.org/abs/2108.04376": {"title": "Counterfactual Effect Generalization: A Combinatorial Definition", "link": "https://arxiv.org/abs/2108.04376", "description": "arXiv:2108.04376v4 Announce Type: replace \nAbstract: The widely used 'Counterfactual' definition of Causal Effects was derived for unbiasedness and accuracy - and not generalizability. We propose a Combinatorial definition for the External Validity (EV) of intervention effects. We first define the concept of an effect observation 'background'. We then formulate conditions for effect generalization based on their sets of (observable and unobservable) backgrounds. This reveals two limits for effect generalization: (1) when effects are observed under all their enumerable backgrounds, or, (2) when backgrounds have become sufficiently randomized. We use the resulting combinatorial framework to re-examine several issues in the original counterfactual formulation: out-of-sample validity, concurrent estimation of multiple effects, bias-variance tradeoffs, statistical power, and connections to current predictive and explaining techniques.\n Methodologically, the definitions also allow us to also replace the parametric estimation problems that followed the counterfactual definition by combinatorial enumeration and randomization problems in non-experimental samples. We use this non-parametric framework to demonstrate (External Validity, Unconfoundness and Precision) tradeoffs in the performance of popular supervised, explaining, and causal-effect estimators. We demonstrate the approach also allows for the use of these methods in non-i.i.d. samples. The COVID19 pandemic highlighted the need for learning solutions to provide predictions in severally incomplete samples. We demonstrate applications in this pressing problem."}, "https://arxiv.org/abs/2212.14411": {"title": "Near-Optimal Non-Parametric Sequential Tests and Confidence Sequences with Possibly Dependent Observations", "link": "https://arxiv.org/abs/2212.14411", "description": "arXiv:2212.14411v3 Announce Type: replace \nAbstract: Sequential tests and their implied confidence sequences, which are valid at arbitrary stopping times, promise flexible statistical inference and on-the-fly decision making. However, strong guarantees are limited to parametric sequential tests that under-cover in practice or concentration-bound-based sequences that over-cover and have suboptimal rejection times. In this work, we consider \\cite{robbins1970boundary}'s delayed-start normal-mixture sequential probability ratio tests, and we provide the first asymptotic type-I-error and expected-rejection-time guarantees under general non-parametric data generating processes, where the asymptotics are indexed by the test's burn-in time. The type-I-error results primarily leverage a martingale strong invariance principle and establish that these tests (and their implied confidence sequences) have type-I error rates approaching a desired $\\alpha$-level. The expected-rejection-time results primarily leverage an identity inspired by It\\^o's lemma and imply that, in certain asymptotic regimes, the expected rejection time approaches the minimum possible among $\\alpha$-level tests. We show how to apply our results to sequential inference on parameters defined by estimating equations, such as average treatment effects. Together, our results establish these (ostensibly parametric) tests as general-purpose, non-parametric, and near-optimal. We illustrate this via numerical experiments."}, "https://arxiv.org/abs/2307.12754": {"title": "Nonparametric Linear Feature Learning in Regression Through Regularisation", "link": "https://arxiv.org/abs/2307.12754", "description": "arXiv:2307.12754v3 Announce Type: replace \nAbstract: Representation learning plays a crucial role in automated feature selection, particularly in the context of high-dimensional data, where non-parametric methods often struggle. In this study, we focus on supervised learning scenarios where the pertinent information resides within a lower-dimensional linear subspace of the data, namely the multi-index model. If this subspace were known, it would greatly enhance prediction, computation, and interpretation. To address this challenge, we propose a novel method for linear feature learning with non-parametric prediction, which simultaneously estimates the prediction function and the linear subspace. Our approach employs empirical risk minimisation, augmented with a penalty on function derivatives, ensuring versatility. Leveraging the orthogonality and rotation invariance properties of Hermite polynomials, we introduce our estimator, named RegFeaL. By utilising alternative minimisation, we iteratively rotate the data to improve alignment with leading directions and accurately estimate the relevant dimension in practical settings. We establish that our method yields a consistent estimator of the prediction function with explicit rates. Additionally, we provide empirical results demonstrating the performance of RegFeaL in various experiments."}, "https://arxiv.org/abs/2311.10877": {"title": "Covariate adjustment in randomized experiments with missing outcomes and covariates", "link": "https://arxiv.org/abs/2311.10877", "description": "arXiv:2311.10877v2 Announce Type: replace \nAbstract: Covariate adjustment can improve precision in analyzing randomized experiments. With fully observed data, regression adjustment and propensity score weighting are asymptotically equivalent in improving efficiency over unadjusted analysis. When some outcomes are missing, we consider combining these two adjustment methods with inverse probability of observation weighting for handling missing outcomes, and show that the equivalence between the two methods breaks down. Regression adjustment no longer ensures efficiency gain over unadjusted analysis unless the true outcome model is linear in covariates or the outcomes are missing completely at random. Propensity score weighting, in contrast, still guarantees efficiency over unadjusted analysis, and including more covariates in adjustment never harms asymptotic efficiency. Moreover, we establish the value of using partially observed covariates to secure additional efficiency by the missingness indicator method, which imputes all missing covariates by zero and uses the union of the completed covariates and corresponding missingness indicators as the new, fully observed covariates. Based on these findings, we recommend using regression adjustment in combination with the missingness indicator method if the linear outcome model or missing complete at random assumption is plausible and using propensity score weighting with the missingness indicator method otherwise."}, "https://arxiv.org/abs/2401.11804": {"title": "Regression Copulas for Multivariate Responses", "link": "https://arxiv.org/abs/2401.11804", "description": "arXiv:2401.11804v2 Announce Type: replace \nAbstract: We propose a novel distributional regression model for a multivariate response vector based on a copula process over the covariate space. It uses the implicit copula of a Gaussian multivariate regression, which we call a ``regression copula''. To allow for large covariate vectors their coefficients are regularized using a novel multivariate extension of the horseshoe prior. Bayesian inference and distributional predictions are evaluated using efficient variational inference methods, allowing application to large datasets. An advantage of the approach is that the marginal distributions of the response vector can be estimated separately and accurately, resulting in predictive distributions that are marginally-calibrated. Two substantive applications of the methodology highlight its efficacy in multivariate modeling. The first is the econometric modeling and prediction of half-hourly regional Australian electricity prices. Here, our approach produces more accurate distributional forecasts than leading benchmark methods. The second is the evaluation of multivariate posteriors in likelihood-free inference (LFI) of a model for tree species abundance data, extending a previous univariate regression copula LFI method. In both applications, we demonstrate that our new approach exhibits a desirable marginal calibration property."}, "https://arxiv.org/abs/2211.13715": {"title": "Trust Your $\\nabla$: Gradient-based Intervention Targeting for Causal Discovery", "link": "https://arxiv.org/abs/2211.13715", "description": "arXiv:2211.13715v4 Announce Type: replace-cross \nAbstract: Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime."}, "https://arxiv.org/abs/2302.01246": {"title": "Behavioral Carry-Over Effect and Power Consideration in Crossover Trials", "link": "https://arxiv.org/abs/2302.01246", "description": "arXiv:2302.01246v2 Announce Type: replace-cross \nAbstract: A crossover trial is an efficient trial design when there is no carry-over effect. To reduce the impact of the biological carry-over effect, a washout period is often designed. However, the carry-over effect remains an outstanding concern when a washout period is unethical or cannot sufficiently diminish the impact of the carry-over effect. The latter can occur in comparative effectiveness research where the carry-over effect is often non-biological but behavioral. In this paper, we investigate the crossover design under a potential outcomes framework with and without the carry-over effect. We find that when the carry-over effect exists and satisfies a sign condition, the basic estimator underestimates the treatment effect, which does not inflate the type I error of one-sided tests but negatively impacts the power. This leads to a power trade-off between the crossover design and the parallel-group design, and we derive the condition under which the crossover design does not lead to type I error inflation and is still more powerful than the parallel-group design. We also develop covariate adjustment methods for crossover trials. We evaluate the performance of cross-over design and covariate adjustment using data from the MTN-034/REACH study."}, "https://arxiv.org/abs/2306.14318": {"title": "Regularization of the ensemble Kalman filter using a non-parametric, non-stationary spatial model", "link": "https://arxiv.org/abs/2306.14318", "description": "arXiv:2306.14318v3 Announce Type: replace-cross \nAbstract: The sample covariance matrix of a random vector is a good estimate of the true covariance matrix if the sample size is much larger than the length of the vector. In high-dimensional problems, this condition is never met. As a result, in high dimensions the EnKF ensemble does not contain enough information to specify the prior covariance matrix accurately. This necessitates the need for regularization of the analysis (observation update) problem. We propose a regularization technique based on a new spatial model on the sphere. The model is a constrained version of the general Gaussian process convolution model. The constraints on the location-dependent convolution kernel include local isotropy, positive definiteness as a function of distance, and smoothness as a function of location. The model allows for a rigorous definition of the local spectrum, which, in addition, is required to be a smooth function of spatial wavenumber. We regularize the ensemble Kalman filter by postulating that its prior covariances obey this model. The model is estimated online in a two-stage procedure. First, ensemble perturbations are bandpass filtered in several wavenumber bands to extract aggregated local spatial spectra. Second, a neural network recovers the local spectra from sample variances of the filtered fields. We show that with the growing ensemble size, the estimator is capable of extracting increasingly detailed spatially non-stationary structures. In simulation experiments, the new technique led to substantially better EnKF performance than several existing techniques."}, "https://arxiv.org/abs/2403.03228": {"title": "Multiwinner Elections and the Spoiler Effect", "link": "https://arxiv.org/abs/2403.03228", "description": "arXiv:2403.03228v1 Announce Type: new \nAbstract: In the popular debate over the use of ranked-choice voting, it is often claimed that the method of single transferable vote (STV) is immune or mostly immune to the so-called ``spoiler effect,'' where the removal of a losing candidate changes the set of winners. This claim has previously been studied only in the single-winner case. We investigate how susceptible STV is to the spoiler effect in multiwinner elections, where the output of the voting method is a committee of size at least two. To evaluate STV we compare it to numerous other voting methods including single non-transferable vote, $k$-Borda, and the Chamberlin-Courant rule. We provide simulation results under three different random models and empirical results using a large database of real-world multiwinner political elections from Scotland. Our results show that STV is not spoiler-proof in any meaningful sense in the multiwinner context, but it tends to perform well relative to other methods, especially when using real-world ballot data."}, "https://arxiv.org/abs/2403.03240": {"title": "Triple/Debiased Lasso for Statistical Inference of Conditional Average Treatment Effects", "link": "https://arxiv.org/abs/2403.03240", "description": "arXiv:2403.03240v1 Announce Type: new \nAbstract: This study investigates the estimation and the statistical inference about Conditional Average Treatment Effects (CATEs), which have garnered attention as a metric representing individualized causal effects. In our data-generating process, we assume linear models for the outcomes associated with binary treatments and define the CATE as a difference between the expected outcomes of these linear models. This study allows the linear models to be high-dimensional, and our interest lies in consistent estimation and statistical inference for the CATE. In high-dimensional linear regression, one typical approach is to assume sparsity. However, in our study, we do not assume sparsity directly. Instead, we consider sparsity only in the difference of the linear models. We first use a doubly robust estimator to approximate this difference and then regress the difference on covariates with Lasso regularization. Although this regression estimator is consistent for the CATE, we further reduce the bias using the techniques in double/debiased machine learning (DML) and debiased Lasso, leading to $\\sqrt{n}$-consistency and confidence intervals. We refer to the debiased estimator as the triple/debiased Lasso (TDL), applying both DML and debiased Lasso techniques. We confirm the soundness of our proposed method through simulation studies."}, "https://arxiv.org/abs/2403.03299": {"title": "Understanding and avoiding the \"weights of regression\": Heterogeneous effects, misspecification, and longstanding solutions", "link": "https://arxiv.org/abs/2403.03299", "description": "arXiv:2403.03299v1 Announce Type: new \nAbstract: Researchers in many fields endeavor to estimate treatment effects by regressing outcome data (Y) on a treatment (D) and observed confounders (X). Even absent unobserved confounding, the regression coefficient on the treatment reports a weighted average of strata-specific treatment effects (Angrist, 1998). Where heterogeneous treatment effects cannot be ruled out, the resulting coefficient is thus not generally equal to the average treatment effect (ATE), and is unlikely to be the quantity of direct scientific or policy interest. The difference between the coefficient and the ATE has led researchers to propose various interpretational, bounding, and diagnostic aids (Humphreys, 2009; Aronow and Samii, 2016; Sloczynski, 2022; Chattopadhyay and Zubizarreta, 2023). We note that the linear regression of Y on D and X can be misspecified when the treatment effect is heterogeneous in X. The \"weights of regression\", for which we provide a new (more general) expression, simply characterize how the OLS coefficient will depart from the ATE under the misspecification resulting from unmodeled treatment effect heterogeneity. Consequently, a natural alternative to suffering these weights is to address the misspecification that gives rise to them. For investigators committed to linear approaches, we propose relying on the slightly weaker assumption that the potential outcomes are linear in X. Numerous well-known estimators are unbiased for the ATE under this assumption, namely regression-imputation/g-computation/T-learner, regression with an interaction of the treatment and covariates (Lin, 2013), and balancing weights. Any of these approaches avoid the apparent weighting problem of the misspecified linear regression, at an efficiency cost that will be small when there are few covariates relative to sample size. We demonstrate these lessons using simulations in observational and experimental settings."}, "https://arxiv.org/abs/2403.03349": {"title": "A consensus-constrained parsimonious Gaussian mixture model for clustering hyperspectral images", "link": "https://arxiv.org/abs/2403.03349", "description": "arXiv:2403.03349v1 Announce Type: new \nAbstract: The use of hyperspectral imaging to investigate food samples has grown due to the improved performance and lower cost of spectroscopy instrumentation. Food engineers use hyperspectral images to classify the type and quality of a food sample, typically using classification methods. In order to train these methods, every pixel in each training image needs to be labelled. Typically, computationally cheap threshold-based approaches are used to label the pixels, and classification methods are trained based on those labels. However, threshold-based approaches are subjective and cannot be generalized across hyperspectral images taken in different conditions and of different foods. Here a consensus-constrained parsimonious Gaussian mixture model (ccPGMM) is proposed to label pixels in hyperspectral images using a model-based clustering approach. The ccPGMM utilizes available information on the labels of a small number of pixels and the relationship between those pixels and neighbouring pixels as constraints when clustering the rest of the pixels in the image. A latent variable model is used to represent the high-dimensional data in terms of a small number of underlying latent factors. To ensure computational feasibility, a consensus clustering approach is employed, where the data are divided into multiple randomly selected subsets of variables and constrained clustering is applied to each data subset; the clustering results are then consolidated across all data subsets to provide a consensus clustering solution. The ccPGMM approach is applied to simulated datasets and real hyperspectral images of three types of puffed cereal, corn, rice, and wheat. Improved clustering performance and computational efficiency are demonstrated when compared to other current state-of-the-art approaches."}, "https://arxiv.org/abs/2403.03389": {"title": "Exploring Spatial Generalized Functional Linear Models: A Comparative Simulation Study and Analysis of COVID-19", "link": "https://arxiv.org/abs/2403.03389", "description": "arXiv:2403.03389v1 Announce Type: new \nAbstract: Implementation of spatial generalized linear models with a functional covariate can be accomplished through the use of a truncated basis expansion of the covariate process. In practice, one must select a truncation level for use. We compare five criteria for the selection of an appropriate truncation level, including AIC and BIC based on a log composite likelihood, a fraction of variance explained criterion, a fitted mean squared error, and a prediction error with one standard error rule. Based on the use of extensive simulation studies, we propose that BIC constitutes a reasonable default criterion for the selection of the truncation level for use in a spatial functional generalized linear model. In addition, we demonstrate that the spatial model with a functional covariate outperforms other models when the data contain spatial structure and response variables are in fact influenced by a functional covariate process. We apply the spatial functional generalized linear model to a problem in which the objective is to relate COVID-19 vaccination rates in counties of states in the Midwestern United States to the number of new cases from previous weeks in those same geographic regions."}, "https://arxiv.org/abs/2403.03589": {"title": "Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate Choices", "link": "https://arxiv.org/abs/2403.03589", "description": "arXiv:2403.03589v1 Announce Type: new \nAbstract: This study designs an adaptive experiment for efficiently estimating average treatment effect (ATEs). We consider an adaptive experiment where an experimenter sequentially samples an experimental unit from a covariate density decided by the experimenter and assigns a treatment. After assigning a treatment, the experimenter observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using gathered samples. The objective of the experimenter is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose a framework under which an experimenter optimizes the covariate density, as well as the propensity score, and find that optimizing both covariate density and propensity score reduces the asymptotic variance more than optimizing only the propensity score. Based on this idea, in each round of our experiment, the experimenter optimizes the covariate density and propensity score based on past observations. To design an adaptive experiment, we first derive the efficient covariate density and propensity score that minimizes the semiparametric efficiency bound, a lower bound for the asymptotic variance given a fixed covariate density and a fixed propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound."}, "https://arxiv.org/abs/2403.03613": {"title": "Reducing the dimensionality and granularity in hierarchical categorical variables", "link": "https://arxiv.org/abs/2403.03613", "description": "arXiv:2403.03613v1 Announce Type: new \nAbstract: Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves model fit. We apply our methodology on a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature."}, "https://arxiv.org/abs/2403.03646": {"title": "Bayesian Generalized Distributed Lag Regression with Variable Selection", "link": "https://arxiv.org/abs/2403.03646", "description": "arXiv:2403.03646v1 Announce Type: new \nAbstract: Distributed Lag Models (DLMs) and similar regression approaches such as MIDAS have been used for many decades in econometrics, and more recently in the study of air quality and its impact on human health. They are useful not only for quantifying accumulating and delayed effects, but also for estimating the lags that are most susceptible to these effects. Among other things, they have been used to infer the period of exposure to poor air quality which might negatively impact child birth weight. The increased attention DLMs have received in recent years is reflective of their potential to help us understand a great many issues, particularly in the investigation of how the environment affects human health. In this paper we describe how to expand the utility of these models for Bayesian inference by leveraging latent-variables. In particular we explain how to perform binary regression to better handle imbalanced data, how to incorporate negative binomial regression, and how to estimate the probability of predictor inclusion. Extra parameters introduced through the DLM framework may require calibration for the MCMC algorithm, but this will not be the case in DLM-based analyses often seen in pollution exposure literature. In these cases, the parameters are inferred through a fully automatic Gibbs sampling procedure."}, "https://arxiv.org/abs/2403.03722": {"title": "Is Distance Correlation Robust?", "link": "https://arxiv.org/abs/2403.03722", "description": "arXiv:2403.03722v1 Announce Type: new \nAbstract: Distance correlation is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance correlation is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitivity to outliers we construct a more robust version of distance correlation, which is based on a new data transformation. Simulations indicate that the resulting method is quite robust, and has good power in the presence of outliers. We illustrate the method on genetic data. Comparing the classical distance correlation with its more robust version provides additional insight."}, "https://arxiv.org/abs/2403.03778": {"title": "Ancestor regression in structural vector autoregressive models", "link": "https://arxiv.org/abs/2403.03778", "description": "arXiv:2403.03778v1 Announce Type: new \nAbstract: We present a new method for causal discovery in linear structural vector autoregressive models. We adapt an idea designed for independent observations to the case of time series while retaining its favorable properties, i.e., explicit error control for false causal discovery, at least asymptotically. We apply our method to several real-world bivariate time series datasets and discuss its findings which mostly agree with common understanding. The arrow of time in a model can be interpreted as background knowledge on possible causal mechanisms. Hence, our ideas could be extended to incorporating different background knowledge, even for independent observations."}, "https://arxiv.org/abs/2403.03802": {"title": "Inequalities and bounds for expected order statistics from transform-ordered families", "link": "https://arxiv.org/abs/2403.03802", "description": "arXiv:2403.03802v1 Announce Type: new \nAbstract: We introduce a comprehensive method for establishing stochastic orders among order statistics in the i.i.d. case. This approach relies on the assumption that the underlying distribution is linked to a reference distribution through a transform order. Notably, this method exhibits broad applicability, particularly since several well-known nonparametric distribution families can be defined using relevant transform orders, including the convex and the star transform orders. In the context of convex-ordered families, we demonstrate that applying Jensen's inequality enables the derivation of bounds for the probability that a random variable exceeds the expected value of its corresponding order statistic."}, "https://arxiv.org/abs/2403.03837": {"title": "An Adaptive Multivariate Functional EWMA Control Chart", "link": "https://arxiv.org/abs/2403.03837", "description": "arXiv:2403.03837v1 Announce Type: new \nAbstract: In many modern industrial scenarios, the measurements of the quality characteristics of interest are often required to be represented as functional data or profiles. This motivates the growing interest in extending traditional univariate statistical process monitoring (SPM) schemes to the functional data setting. This article proposes a new SPM scheme, which is referred to as adaptive multivariate functional EWMA (AMFEWMA), to extend the well-known exponentially weighted moving average (EWMA) control chart from the univariate scalar to the multivariate functional setting. The favorable performance of the AMFEWMA control chart over existing methods is assessed via an extensive Monte Carlo simulation. Its practical applicability is demonstrated through a case study in the monitoring of the quality of a resistance spot welding process in the automotive industry through the online observations of dynamic resistance curves, which are associated with multiple spot welds on the same car body and recognized as the full technological signature of the process."}, "https://arxiv.org/abs/2403.03868": {"title": "Confidence on the Focal: Conformal Prediction with Selection-Conditional Coverage", "link": "https://arxiv.org/abs/2403.03868", "description": "arXiv:2403.03868v1 Announce Type: new \nAbstract: Conformal prediction builds marginally valid prediction intervals which cover the unknown outcome of a randomly drawn new test point with a prescribed probability. In practice, a common scenario is that, after seeing the test unit(s), practitioners decide which test unit(s) to focus on in a data-driven manner, and wish to quantify the uncertainty for the focal unit(s). In such cases, marginally valid prediction intervals for these focal units can be misleading due to selection bias. This paper presents a general framework for constructing a prediction set with finite-sample exact coverage conditional on the unit being selected. Its general form works for arbitrary selection rules, and generalizes Mondrian Conformal Prediction to multiple test units and non-equivariant classifiers. We then work out computationally efficient implementation of our framework for a number of realistic selection rules, including top-K selection, optimization-based selection, selection based on conformal p-values, and selection based on properties of preliminary conformal prediction sets. The performance of our methods is demonstrated via applications in drug discovery and health risk prediction."}, "https://arxiv.org/abs/2403.03948": {"title": "Estimating the household secondary attack rate with the Incomplete Chain Binomial model", "link": "https://arxiv.org/abs/2403.03948", "description": "arXiv:2403.03948v1 Announce Type: new \nAbstract: The Secondary Attack Rate (SAR) is a measure of how infectious a communicable disease is, and is often estimated based on studies of disease transmission in households. The Chain Binomial model is a simple model for disease outbreaks, and the final size distribution derived from it can be used to estimate the SAR using simple summary statistics. The final size distribution of the Chain Binomial model assume that the outbreaks have concluded, which in some instances may require long follow-up time. We develop a way to compute the probability distribution of the number of infected before the outbreak has concluded, which we call the Incomplete Chain Binomial distribution. We study a few theoretical properties of the model. We develop Maximum Likelihood estimation routines for inference on the SAR and explore the model by analyzing two real world data sets."}, "https://arxiv.org/abs/1612.06040": {"title": "Monte Carlo goodness-of-fit tests for degree corrected and related stochastic blockmodels", "link": "https://arxiv.org/abs/1612.06040", "description": "arXiv:1612.06040v4 Announce Type: replace \nAbstract: We construct Bayesian and frequentist finite-sample goodness-of-fit tests for three different variants of the stochastic blockmodel for network data. Since all of the stochastic blockmodel variants are log-linear in form when block assignments are known, the tests for the \\emph{latent} block model versions combine a block membership estimator with the algebraic statistics machinery for testing goodness-of-fit in log-linear models. We describe Markov bases and marginal polytopes of the variants of the stochastic blockmodel, and discuss how both facilitate the development of goodness-of-fit tests and understanding of model behavior.\n The general testing methodology developed here extends to any finite mixture of log-linear models on discrete data, and as such is the first application of the algebraic statistics machinery for latent-variable models."}, "https://arxiv.org/abs/2212.06108": {"title": "Tandem clustering with invariant coordinate selection", "link": "https://arxiv.org/abs/2212.06108", "description": "arXiv:2212.06108v4 Announce Type: replace \nAbstract: For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach."}, "https://arxiv.org/abs/2302.11322": {"title": "Causal inference with misspecified network interference structure", "link": "https://arxiv.org/abs/2302.11322", "description": "arXiv:2302.11322v2 Announce Type: replace \nAbstract: Under interference, the potential outcomes of a unit depend on treatments assigned to other units. A network interference structure is typically assumed to be given and accurate. In this paper, we study the problems resulting from misspecifying these networks. First, we derive bounds on the bias arising from estimating causal effects under a misspecified network. We show that the maximal possible bias depends on the divergence between the assumed network and the true one with respect to the induced exposure probabilities. Then, we propose a novel estimator that leverages multiple networks simultaneously and is unbiased if one of the networks is correct, thus providing robustness to network specification. Additionally, we develop a probabilistic bias analysis that quantifies the impact of a postulated misspecification mechanism on the causal estimates. We illustrate key issues in simulations and demonstrate the utility of the proposed methods in a social network field experiment and a cluster-randomized trial with suspected cross-clusters contamination."}, "https://arxiv.org/abs/2305.00484": {"title": "Sequential Markov Chain Monte Carlo for Lagrangian Data Assimilation with Applications to Unknown Data Locations", "link": "https://arxiv.org/abs/2305.00484", "description": "arXiv:2305.00484v3 Announce Type: replace \nAbstract: We consider a class of high-dimensional spatial filtering problems, where the spatial locations of observations are unknown and driven by the partially observed hidden signal. This problem is exceptionally challenging as not only is high-dimensional, but the model for the signal yields longer-range time dependencies through the observation locations. Motivated by this model we revisit a lesser-known and \\emph{provably convergent} computational methodology from \\cite{berzuini, cent, martin} that uses sequential Markov Chain Monte Carlo (MCMC) chains. We extend this methodology for data filtering problems with unknown observation locations. We benchmark our algorithms on Linear Gaussian state space models against competing ensemble methods and demonstrate a significant improvement in both execution speed and accuracy. Finally, we implement a realistic case study on a high-dimensional rotating shallow water model (of about $10^4-10^5$ dimensions) with real and synthetic data. The data is provided by the National Oceanic and Atmospheric Administration (NOAA) and contains observations from ocean drifters in a domain of the Atlantic Ocean restricted to the longitude and latitude intervals $[-51^{\\circ}, -41^{\\circ}]$, $[17^{\\circ}, 27^{\\circ}]$ respectively."}, "https://arxiv.org/abs/2306.05829": {"title": "A reduced-rank approach to predicting multiple binary responses through machine learning", "link": "https://arxiv.org/abs/2306.05829", "description": "arXiv:2306.05829v2 Announce Type: replace \nAbstract: This paper investigates the problem of simultaneously predicting multiple binary responses by utilizing a shared set of covariates. Our approach incorporates machine learning techniques for binary classification, without making assumptions about the underlying observations. Instead, our focus lies on a group of predictors, aiming to identify the one that minimizes prediction error. Unlike previous studies that primarily address estimation error, we directly analyze the prediction error of our method using PAC-Bayesian bounds techniques. In this paper, we introduce a pseudo-Bayesian approach capable of handling incomplete response data. Our strategy is efficiently implemented using the Langevin Monte Carlo method. Through simulation studies and a practical application using real data, we demonstrate the effectiveness of our proposed method, producing comparable or sometimes superior results compared to the current state-of-the-art method."}, "https://arxiv.org/abs/2309.12380": {"title": "Methods for generating and evaluating synthetic longitudinal patient data: a systematic review", "link": "https://arxiv.org/abs/2309.12380", "description": "arXiv:2309.12380v2 Announce Type: replace \nAbstract: The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods."}, "https://arxiv.org/abs/2312.01209": {"title": "A Method of Moments Approach to Asymptotically Unbiased Synthetic Controls", "link": "https://arxiv.org/abs/2312.01209", "description": "arXiv:2312.01209v2 Announce Type: replace \nAbstract: A common approach to constructing a Synthetic Control unit is to fit on the outcome variable and covariates in pre-treatment time periods, but it has been shown by Ferman and Pinto (2019) that this approach does not provide asymptotic unbiasedness when the fit is imperfect and the number of controls is fixed. Many related panel methods have a similar limitation when the number of units is fixed. I introduce and evaluate a new method in which the Synthetic Control is constructed using a General Method of Moments approach where units not being included in the Synthetic Control are used as instruments. I show that a Synthetic Control Estimator of this form will be asymptotically unbiased as the number of pre-treatment time periods goes to infinity, even when pre-treatment fit is imperfect and the number of units is fixed. Furthermore, if both the number of pre-treatment and post-treatment time periods go to infinity, then averages of treatment effects can be consistently estimated. I conduct simulations and an empirical application to compare the performance of this method with existing approaches in the literature."}, "https://arxiv.org/abs/2312.12952": {"title": "High-dimensional sparse classification using exponential weighting with empirical hinge loss", "link": "https://arxiv.org/abs/2312.12952", "description": "arXiv:2312.12952v2 Announce Type: replace \nAbstract: In this study, we address the problem of high-dimensional binary classification. Our proposed solution involves employing an aggregation technique founded on exponential weights and empirical hinge loss. Through the employment of a suitable sparsity-inducing prior distribution, we demonstrate that our method yields favorable theoretical results on prediction error. The efficiency of our procedure is achieved through the utilization of Langevin Monte Carlo, a gradient-based sampling approach. To illustrate the effectiveness of our approach, we conduct comparisons with the logistic Lasso on simulated data and a real dataset. Our method frequently demonstrates superior performance compared to the logistic Lasso."}, "https://arxiv.org/abs/2211.12200": {"title": "Fast Computer Model Calibration using Annealed and Transformed Variational Inference", "link": "https://arxiv.org/abs/2211.12200", "description": "arXiv:2211.12200v2 Announce Type: replace-cross \nAbstract: Computer models play a crucial role in numerous scientific and engineering domains. To ensure the accuracy of simulations, it is essential to properly calibrate the input parameters of these models through statistical inference. While Bayesian inference is the standard approach for this task, employing Markov Chain Monte Carlo methods often encounters computational hurdles due to the costly evaluation of likelihood functions and slow mixing rates. Although variational inference (VI) can be a fast alternative to traditional Bayesian approaches, VI has limited applicability due to boundary issues and local optima problems. To address these challenges, we propose flexible VI methods based on deep generative models that do not require parametric assumptions on the variational distribution. We embed a surjective transformation in our framework to avoid posterior truncation at the boundary. Additionally, we provide theoretical conditions that guarantee the success of the algorithm. Furthermore, our temperature annealing scheme can prevent being trapped in local optima through a series of intermediate posteriors. We apply our method to infectious disease models and a geophysical model, illustrating that the proposed method can provide fast and accurate inference compared to its competitors."}, "https://arxiv.org/abs/2403.03975": {"title": "Robust covariance estimation and explainable outlier detection for matrix-valued data", "link": "https://arxiv.org/abs/2403.03975", "description": "arXiv:2403.03975v1 Announce Type: new \nAbstract: The minimum covariance determinant (MCD) estimator is a popular method for robustly estimating the mean and covariance of multivariate data. We extend the MCD to the setting where the observations are matrices rather than vectors and introduce the matrix minimum covariance determinant (MMCD) estimators for robust parameter estimation. These estimators hold equivariance properties, achieve a high breakdown point, and are consistent under elliptical matrix-variate distributions. We have also developed an efficient algorithm with convergence guarantees to compute the MMCD estimators. Using the MMCD estimators, we can compute robust Mahalanobis distances that can be used for outlier detection. Those distances can be decomposed into outlyingness contributions from each cell, row, or column of a matrix-variate observation using Shapley values, a concept for outlier explanation recently introduced in the multivariate setting. Simulations and examples reveal the excellent properties and usefulness of the robust estimators."}, "https://arxiv.org/abs/2403.04058": {"title": "Plant-Capture Methods for Estimating Population Size from Uncertain Plant Captures", "link": "https://arxiv.org/abs/2403.04058", "description": "arXiv:2403.04058v1 Announce Type: new \nAbstract: Plant-capture is a variant of classical capture-recapture methods used to estimate the size of a population. In this method, decoys referred to as \"plants\" are introduced into the population in order to estimate the capture probability. The method has shown considerable success in estimating population sizes from limited samples in many epidemiological, ecological, and demographic studies. However, previous plant-recapture studies have not systematically accounted for uncertainty in the capture status of each individual plant. In this work, we propose various approaches to formally incorporate uncertainty into the plant-capture model arising from (i) the capture status of plants and (ii) the heterogeneity between multiple survey sites. We present two inference methods and compare their performance in simulation studies. We then apply our methods to estimate the size of the homeless population in several US cities using the large-scale \"S-night\" study conducted by the US Census Bureau."}, "https://arxiv.org/abs/2403.04131": {"title": "Extract Mechanisms from Heterogeneous Effects: Identification Strategy for Mediation Analysis", "link": "https://arxiv.org/abs/2403.04131", "description": "arXiv:2403.04131v1 Announce Type: new \nAbstract: Understanding causal mechanisms is essential for explaining and generalizing empirical phenomena. Causal mediation analysis offers statistical techniques to quantify mediation effects. However, existing methods typically require strong identification assumptions or sophisticated research designs. We develop a new identification strategy that simplifies these assumptions, enabling the simultaneous estimation of causal and mediation effects. The strategy is based on a novel decomposition of total treatment effects, which transforms the challenging mediation problem into a simple linear regression problem. The new method establishes a new link between causal mediation and causal moderation. We discuss several research designs and estimators to increase the usability of our identification strategy for a variety of empirical studies. We demonstrate the application of our method by estimating the causal mediation effect in experiments concerning common pool resource governance and voting information. Additionally, we have created statistical software to facilitate the implementation of our method."}, "https://arxiv.org/abs/2403.04345": {"title": "A Novel Theoretical Framework for Exponential Smoothing", "link": "https://arxiv.org/abs/2403.04345", "description": "arXiv:2403.04345v1 Announce Type: new \nAbstract: Simple Exponential Smoothing is a classical technique used for smoothing time series data by assigning exponentially decreasing weights to past observations through a recursive equation; it is sometimes presented as a rule of thumb procedure. We introduce a novel theoretical perspective where the recursive equation that defines simple exponential smoothing occurs naturally as a stochastic gradient ascent scheme to optimize a sequence of Gaussian log-likelihood functions. Under this lens of analysis, our main theorem shows that -in a general setting- simple exponential smoothing converges to a neighborhood of the trend of a trend-stationary stochastic process. This offers a novel theoretical assurance that the exponential smoothing procedure yields reliable estimators of the underlying trend shedding light on long-standing observations in the literature regarding the robustness of simple exponential smoothing."}, "https://arxiv.org/abs/2403.04354": {"title": "A Logarithmic Mean Divisia Index Decomposition of CO$_2$ Emissions from Energy Use in Romania", "link": "https://arxiv.org/abs/2403.04354", "description": "arXiv:2403.04354v1 Announce Type: new \nAbstract: Carbon emissions have become a specific alarming indicators and intricate challenges that lead an extended argue about climate change. The growing trend in the utilization of fossil fuels for the economic progress and simultaneously reducing the carbon quantity has turn into a substantial and global challenge. The aim of this paper is to examine the driving factors of CO$_2$ emissions from energy sector in Romania during the period 2008-2022 emissions using the log mean Divisia index (LMDI) method and takes into account five items: CO$_2$ emissions, primary energy resources, energy consumption, gross domestic product and population, the driving forces of CO$_2$ emissions, based on which it was calculated the contribution of carbon intensity, energy mixes, generating efficiency, economy, and population. The results indicate that generating efficiency effect -90968.57 is the largest inhibiting index while economic effect is the largest positive index 69084.04 having the role of increasing CO$_2$ emissions."}, "https://arxiv.org/abs/2403.04564": {"title": "Estimating hidden population size from a single respondent-driven sampling survey", "link": "https://arxiv.org/abs/2403.04564", "description": "arXiv:2403.04564v1 Announce Type: new \nAbstract: This work is concerned with the estimation of hard-to-reach population sizes using a single respondent-driven sampling (RDS) survey, a variant of chain-referral sampling that leverages social relationships to reach members of a hidden population. The popularity of RDS as a standard approach for surveying hidden populations brings theoretical and methodological challenges regarding the estimation of population sizes, mainly for public health purposes. This paper proposes a frequentist, model-based framework for estimating the size of a hidden population using a network-based approach. An optimization algorithm is proposed for obtaining the identification region of the target parameter when model assumptions are violated. We characterize the asymptotic behavior of our proposed methodology and assess its finite sample performance under departures from model assumptions."}, "https://arxiv.org/abs/2403.04613": {"title": "Simultaneous Conformal Prediction of Missing Outcomes with Propensity Score $\\epsilon$-Discretization", "link": "https://arxiv.org/abs/2403.04613", "description": "arXiv:2403.04613v1 Announce Type: new \nAbstract: We study the problem of simultaneous predictive inference on multiple outcomes missing at random. We consider a suite of possible simultaneous coverage properties, conditionally on the missingness pattern and on the -- possibly discretized/binned -- feature values. For data with discrete feature distributions, we develop a procedure which attains feature- and missingness-conditional coverage; and further improve it via pooling its results after partitioning the unobserved outcomes. To handle general continuous feature distributions, we introduce methods based on discretized feature values. To mitigate the issue that feature-discretized data may fail to remain missing at random, we propose propensity score $\\epsilon$-discretization. This approach is inspired by the balancing property of the propensity score, namely that the missing data mechanism is independent of the outcome conditional on the propensity [Rosenbaum and Rubin (1983)]. We show that the resulting pro-CP method achieves propensity score discretized feature- and missingness-conditional coverage, when the propensity score is known exactly or is estimated sufficiently accurately. Furthermore, we consider a stronger inferential target, the squared-coverage guarantee, which penalizes the spread of the coverage proportion. We propose methods -- termed pro-CP2 -- to achieve it with similar conditional properties as we have shown for usual coverage. A key novel technical contribution in our results is that propensity score discretization leads to a notion of approximate balancing, which we formalize and characterize precisely. In extensive empirical experiments on simulated data and on a job search intervention dataset, we illustrate that our procedures provide informative prediction sets with valid conditional coverage."}, "https://arxiv.org/abs/2403.04766": {"title": "Nonparametric Regression under Cluster Sampling", "link": "https://arxiv.org/abs/2403.04766", "description": "arXiv:2403.04766v1 Announce Type: new \nAbstract: This paper develops a general asymptotic theory for nonparametric kernel regression in the presence of cluster dependence. We examine nonparametric density estimation, Nadaraya-Watson kernel regression, and local linear estimation. Our theory accommodates growing and heterogeneous cluster sizes. We derive asymptotic conditional bias and variance, establish uniform consistency, and prove asymptotic normality. Our findings reveal that under heterogeneous cluster sizes, the asymptotic variance includes a new term reflecting within-cluster dependence, which is overlooked when cluster sizes are presumed to be bounded. We propose valid approaches for bandwidth selection and inference, introduce estimators of the asymptotic variance, and demonstrate their consistency. In simulations, we verify the effectiveness of the cluster-robust bandwidth selection and show that the derived cluster-robust confidence interval improves the coverage ratio. We illustrate the application of these methods using a policy-targeting dataset in development economics."}, "https://arxiv.org/abs/2403.04039": {"title": "Sample size planning for conditional counterfactual mean estimation with a K-armed randomized experiment", "link": "https://arxiv.org/abs/2403.04039", "description": "arXiv:2403.04039v1 Announce Type: cross \nAbstract: We cover how to determine a sufficiently large sample size for a $K$-armed randomized experiment in order to estimate conditional counterfactual expectations in data-driven subgroups. The sub-groups can be output by any feature space partitioning algorithm, including as defined by binning users having similar predictive scores or as defined by a learned policy tree. After carefully specifying the inference target, a minimum confidence level, and a maximum margin of error, the key is to turn the original goal into a simultaneous inference problem where the recommended sample size to offset an increased possibility of estimation error is directly related to the number of inferences to be conducted. Given a fixed sample size budget, our result allows us to invert the question to one about the feasible number of treatment arms or partition complexity (e.g. number of decision tree leaves). Using policy trees to learn sub-groups, we evaluate our nominal guarantees on a large publicly-available randomized experiment test data set."}, "https://arxiv.org/abs/2403.04236": {"title": "Regularized DeepIV with Model Selection", "link": "https://arxiv.org/abs/2403.04236", "description": "arXiv:2403.04236v1 Announce Type: cross \nAbstract: In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. While recent advancements in machine learning have introduced flexible methods for IV estimation, they often encounter one or more of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) requiring minimax computation oracle, which is highly unstable in practice; (3) absence of model selection procedure. In this paper, we present the first method and analysis that can avoid all three limitations, while still enabling general function approximation. Specifically, we propose a minimax-oracle-free method called Regularized DeepIV (RDIV) regression that can converge to the least-norm IV solution. Our method consists of two stages: first, we learn the conditional distribution of covariates, and by utilizing the learned distribution, we learn the estimator by minimizing a Tikhonov-regularized loss function. We further show that our method allows model selection procedures that can achieve the oracle rates in the misspecified regime. When extended to an iterative estimator, our method matches the current state-of-the-art convergence rate. Our method is a Tikhonov regularized variant of the popular DeepIV method with a non-parametric MLE first-stage estimator, and our results provide the first rigorous guarantees for this empirically used method, showcasing the importance of regularization which was absent from the original work."}, "https://arxiv.org/abs/2403.04328": {"title": "A dual approach to nonparametric characterization for random utility models", "link": "https://arxiv.org/abs/2403.04328", "description": "arXiv:2403.04328v1 Announce Type: cross \nAbstract: This paper develops a novel characterization for random utility models (RUM), which turns out to be a dual representation of the characterization by Kitamura and Stoye (2018, ECMA). For a given family of budgets and its \"patch\" representation \\'a la Kitamura and Stoye, we construct a matrix $\\Xi$ of which each row vector indicates the structure of possible revealed preference relations in each subfamily of budgets. Then, it is shown that a stochastic demand system on the patches of budget lines, say $\\pi$, is consistent with a RUM, if and only if $\\Xi\\pi \\geq \\mathbb{1}$. In addition to providing a concise closed form characterization, especially when $\\pi$ is inconsistent with RUMs, the vector $\\Xi\\pi$ also contains information concerning (1) sub-families of budgets in which cyclical choices must occur with positive probabilities, and (2) the maximal possible weights on rational choice patterns in a population. The notion of Chv\\'atal rank of polytopes and the duality theorem in linear programming play key roles to obtain these results."}, "https://arxiv.org/abs/2211.00329": {"title": "Weak Identification in Low-Dimensional Factor Models with One or Two Factors", "link": "https://arxiv.org/abs/2211.00329", "description": "arXiv:2211.00329v2 Announce Type: replace \nAbstract: This paper describes how to reparameterize low-dimensional factor models with one or two factors to fit weak identification theory developed for generalized method of moments models. Some identification-robust tests, here called \"plug-in\" tests, require a reparameterization to distinguish weakly identified parameters from strongly identified parameters. The reparameterizations in this paper make plug-in tests available for subvector hypotheses in low-dimensional factor models with one or two factors. Simulations show that the plug-in tests are less conservative than identification-robust tests that use the original parameterization. An empirical application to a factor model of parental investments in children is included."}, "https://arxiv.org/abs/2301.13368": {"title": "Misspecification-robust Sequential Neural Likelihood for Simulation-based Inference", "link": "https://arxiv.org/abs/2301.13368", "description": "arXiv:2301.13368v2 Announce Type: replace \nAbstract: Simulation-based inference techniques are indispensable for parameter estimation of mechanistic and simulable models with intractable likelihoods. While traditional statistical approaches like approximate Bayesian computation and Bayesian synthetic likelihood have been studied under well-specified and misspecified settings, they often suffer from inefficiencies due to wasted model simulations. Neural approaches, such as sequential neural likelihood (SNL) avoid this wastage by utilising all model simulations to train a neural surrogate for the likelihood function. However, the performance of SNL under model misspecification is unreliable and can result in overconfident posteriors centred around an inaccurate parameter estimate. In this paper, we propose a novel SNL method, which through the incorporation of additional adjustment parameters, is robust to model misspecification and capable of identifying features of the data that the model is not able to recover. We demonstrate the efficacy of our approach through several illustrative examples, where our method gives more accurate point estimates and uncertainty quantification than SNL."}, "https://arxiv.org/abs/2302.09211": {"title": "Bayesian Covariance Estimation for Multi-group Matrix-variate Data", "link": "https://arxiv.org/abs/2302.09211", "description": "arXiv:2302.09211v2 Announce Type: replace \nAbstract: Multi-group covariance estimation for matrix-variate data with small within group sample sizes is a key part of many data analysis tasks in modern applications. To obtain accurate group-specific covariance estimates, shrinkage estimation methods which shrink an unstructured, group-specific covariance either across groups towards a pooled covariance or within each group towards a Kronecker structure have been developed. However, in many applications, it is unclear which approach will result in more accurate covariance estimates. In this article, we present a hierarchical prior distribution which flexibly allows for both types of shrinkage. The prior linearly combines shrinkage across groups towards a shared pooled covariance and shrinkage within groups towards a group-specific Kronecker covariance. We illustrate the utility of the proposed prior in speech recognition and an analysis of chemical exposure data."}, "https://arxiv.org/abs/2307.10651": {"title": "Distributional Regression for Data Analysis", "link": "https://arxiv.org/abs/2307.10651", "description": "arXiv:2307.10651v2 Announce Type: replace \nAbstract: Flexible modeling of the entire distribution as a function of covariates is an important generalization of mean-based regression that has seen growing interest over the past decades in both the statistics and machine learning literature. This review outlines selected state-of-the-art statistical approaches to distributional regression, complemented with alternatives from machine learning. Topics covered include the similarities and differences between these approaches, extensions, properties and limitations, estimation procedures, and the availability of software. In view of the increasing complexity and availability of large-scale data, this review also discusses the scalability of traditional estimation methods, current trends, and open challenges. Illustrations are provided using data on childhood malnutrition in Nigeria and Australian electricity prices."}, "https://arxiv.org/abs/2308.00014": {"title": "A new mapping of technological interdependence", "link": "https://arxiv.org/abs/2308.00014", "description": "arXiv:2308.00014v2 Announce Type: replace \nAbstract: How does technological interdependence affect a sector's ability to innovate? This paper answers this question by looking at knowledge interdependence (knowledge spillovers and technological complementarities) and structural interdependence (intersectoral network linkages). We examine these two dimensions of technological interdependence by applying novel methods of text mining and network analysis to the documents of 6.5 million patents granted by the United States Patent and Trademark Office (USPTO) between 1976 and 2021. We show that both dimensions positively affect sector innovation. While the impact of knowledge interdependence is slightly larger in the long-term horizon, positive shocks affecting the network linkages (structural interdependence) produce greater and more enduring effects on innovation performance in a relatively short run. Our analysis also highlights that patent text contains a wealth of information often not captured by traditional innovation metrics, such as patent citations."}, "https://arxiv.org/abs/2308.05564": {"title": "Large Skew-t Copula Models and Asymmetric Dependence in Intraday Equity Returns", "link": "https://arxiv.org/abs/2308.05564", "description": "arXiv:2308.05564v2 Announce Type: replace \nAbstract: Skew-t copula models are attractive for the modeling of financial data because they allow for asymmetric and extreme tail dependence. We show that the copula implicit in the skew-t distribution of Azzalini and Capitanio (2003) allows for a higher level of pairwise asymmetric dependence than two popular alternative skew-t copulas. Estimation of this copula in high dimensions is challenging, and we propose a fast and accurate Bayesian variational inference (VI) approach to do so. The method uses a conditionally Gaussian generative representation of the skew-t distribution to define an augmented posterior that can be approximated accurately. A fast stochastic gradient ascent algorithm is used to solve the variational optimization. The new methodology is used to estimate skew-t factor copula models for intraday returns from 2017 to 2021 on 93 U.S. equities. The copula captures substantial heterogeneity in asymmetric dependence over equity pairs, in addition to the variability in pairwise correlations. We show that intraday predictive densities from the skew-t copula are more accurate than from some other copula models, while portfolio selection strategies based on the estimated pairwise tail dependencies improve performance relative to the benchmark index."}, "https://arxiv.org/abs/2310.00864": {"title": "Multi-Label Residual Weighted Learning for Individualized Combination Treatment Rule", "link": "https://arxiv.org/abs/2310.00864", "description": "arXiv:2310.00864v3 Announce Type: replace \nAbstract: Individualized treatment rules (ITRs) have been widely applied in many fields such as precision medicine and personalized marketing. Beyond the extensive studies on ITR for binary or multiple treatments, there is considerable interest in applying combination treatments. This paper introduces a novel ITR estimation method for combination treatments incorporating interaction effects among treatments. Specifically, we propose the generalized $\\psi$-loss as a non-convex surrogate in the residual weighted learning framework, offering desirable statistical and computational properties. Statistically, the minimizer of the proposed surrogate loss is Fisher-consistent with the optimal decision rules, incorporating interaction effects at any intensity level - a significant improvement over existing methods. Computationally, the proposed method applies the difference-of-convex algorithm for efficient computation. Through simulation studies and real-world data applications, we demonstrate the superior performance of the proposed method in recommending combination treatments."}, "https://arxiv.org/abs/2310.01748": {"title": "A generative approach to frame-level multi-competitor races", "link": "https://arxiv.org/abs/2310.01748", "description": "arXiv:2310.01748v3 Announce Type: replace \nAbstract: Multi-competitor races often feature complicated within-race strategies that are difficult to capture when training data on race outcome level data. Further, models which do not account for such strategic effects may suffer from confounded inferences and predictions. In this work we develop a general generative model for multi-competitor races which allows analysts to explicitly model certain strategic effects such as changing lanes or drafting and separate these impacts from competitor ability. The generative model allows one to simulate full races from any real or created starting position which opens new avenues for attributing value to within-race actions and to perform counter-factual analyses. This methodology is sufficiently general to apply to any track based multi-competitor races where both tracking data is available and competitor movement is well described by simultaneous forward and lateral movements. We apply this methodology to one-mile horse races using data provided by the New York Racing Association (NYRA) and the New York Thoroughbred Horsemen's Association (NYTHA) for the Big Data Derby 2022 Kaggle Competition. This data features granular tracking data for all horses at the frame-level (occurring at approximately 4hz). We demonstrate how this model can yield new inferences, such as the estimation of horse-specific speed profiles which vary over phases of the race, and examples of posterior predictive counterfactual simulations to answer questions of interest such as starting lane impacts on race outcomes."}, "https://arxiv.org/abs/2311.17100": {"title": "Automatic cross-validation in structured models: Is it time to leave out leave-one-out?", "link": "https://arxiv.org/abs/2311.17100", "description": "arXiv:2311.17100v2 Announce Type: replace \nAbstract: Standard techniques such as leave-one-out cross-validation (LOOCV) might not be suitable for evaluating the predictive performance of models incorporating structured random effects. In such cases, the correlation between the training and test sets could have a notable impact on the model's prediction error. To overcome this issue, an automatic group construction procedure for leave-group-out cross validation (LGOCV) has recently emerged as a valuable tool for enhancing predictive performance measurement in structured models. The purpose of this paper is (i) to compare LOOCV and LGOCV within structured models, emphasizing model selection and predictive performance, and (ii) to provide real data applications in spatial statistics using complex structured models fitted with INLA, showcasing the utility of the automatic LGOCV method. First, we briefly review the key aspects of the recently proposed LGOCV method for automatic group construction in latent Gaussian models. We also demonstrate the effectiveness of this method for selecting the model with the highest predictive performance by simulating extrapolation tasks in both temporal and spatial data analyses. Finally, we provide insights into the effectiveness of the LGOCV method in modelling complex structured data, encompassing spatio-temporal multivariate count data, spatial compositional data, and spatio-temporal geospatial data."}, "https://arxiv.org/abs/2401.14512": {"title": "Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population", "link": "https://arxiv.org/abs/2401.14512", "description": "arXiv:2401.14512v2 Announce Type: replace \nAbstract: Randomized controlled trials (RCTs) serve as the cornerstone for understanding causal effects, yet extending inferences to target populations presents challenges due to effect heterogeneity and underrepresentation. Our paper addresses the critical issue of identifying and characterizing underrepresented subgroups in RCTs, proposing a novel framework for refining target populations to improve generalizability. We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups. ROOT optimizes the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate, ensuring more precise treatment effect estimations. Notably, ROOT generates interpretable characteristics of the underrepresented population, aiding researchers in effective communication. Our approach demonstrates improved precision and interpretability compared to alternatives, as illustrated with synthetic data experiments. We apply our methodology to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial -- investigating the effectiveness of medication for opioid use disorder -- to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations."}, "https://arxiv.org/abs/2208.07581": {"title": "Regression modelling of spatiotemporal extreme U", "link": "https://arxiv.org/abs/2208.07581", "description": "arXiv:2208.07581v4 Announce Type: replace-cross \nAbstract: Risk management in many environmental settings requires an understanding of the mechanisms that drive extreme events. Useful metrics for quantifying such risk are extreme quantiles of response variables conditioned on predictor variables that describe, e.g., climate, biosphere and environmental states. Typically these quantiles lie outside the range of observable data and so, for estimation, require specification of parametric extreme value models within a regression framework. Classical approaches in this context utilise linear or additive relationships between predictor and response variables and suffer in either their predictive capabilities or computational efficiency; moreover, their simplicity is unlikely to capture the truly complex structures that lead to the creation of extreme wildfires. In this paper, we propose a new methodological framework for performing extreme quantile regression using artificial neutral networks, which are able to capture complex non-linear relationships and scale well to high-dimensional data. The \"black box\" nature of neural networks means that they lack the desirable trait of interpretability often favoured by practitioners; thus, we unify linear, and additive, regression methodology with deep learning to create partially-interpretable neural networks that can be used for statistical inference but retain high prediction accuracy. To complement this methodology, we further propose a novel point process model for extreme values which overcomes the finite lower-endpoint problem associated with the generalised extreme value class of distributions. Efficacy of our unified framework is illustrated on U.S. wildfire data with a high-dimensional predictor set and we illustrate vast improvements in predictive performance over linear and spline-based regression techniques."}, "https://arxiv.org/abs/2403.04912": {"title": "Bayesian Level-Set Clustering", "link": "https://arxiv.org/abs/2403.04912", "description": "arXiv:2403.04912v1 Announce Type: new \nAbstract: Broadly, the goal when clustering data is to separate observations into meaningful subgroups. The rich variety of methods for clustering reflects the fact that the relevant notion of meaningful clusters varies across applications. The classical Bayesian approach clusters observations by their association with components of a mixture model; the choice in class of components allows flexibility to capture a range of meaningful cluster notions. However, in practice the range is somewhat limited as difficulties with computation and cluster identifiability arise as components are made more flexible. Instead of mixture component attribution, we consider clusterings that are functions of the data and the density $f$, which allows us to separate flexible density estimation from clustering. Within this framework, we develop a method to cluster data into connected components of a level set of $f$. Under mild conditions, we establish that our Bayesian level-set (BALLET) clustering methodology yields consistent estimates, and we highlight its performance in a variety of toy and simulated data examples. Finally, through an application to astronomical data we show the method performs favorably relative to the popular level-set clustering algorithm DBSCAN in terms of accuracy, insensitivity to tuning parameters, and quantification of uncertainty."}, "https://arxiv.org/abs/2403.04915": {"title": "Bayesian Inference for High-dimensional Time Series by Latent Process Modeling", "link": "https://arxiv.org/abs/2403.04915", "description": "arXiv:2403.04915v1 Announce Type: new \nAbstract: Time series data arising in many applications nowadays are high-dimensional. A large number of parameters describe features of these time series. We propose a novel approach to modeling a high-dimensional time series through several independent univariate time series, which are then orthogonally rotated and sparsely linearly transformed. With this approach, any specified intrinsic relations among component time series given by a graphical structure can be maintained at all time snapshots. We call the resulting process an Orthogonally-rotated Univariate Time series (OUT). Key structural properties of time series such as stationarity and causality can be easily accommodated in the OUT model. For Bayesian inference, we put suitable prior distributions on the spectral densities of the independent latent times series, the orthogonal rotation matrix, and the common precision matrix of the component times series at every time point. A likelihood is constructed using the Whittle approximation for univariate latent time series. An efficient Markov Chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We study the convergence of the pseudo-posterior distribution based on the Whittle likelihood for the model's parameters upon developing a new general posterior convergence theorem for pseudo-posteriors. We find that the posterior contraction rate for independent observations essentially prevails in the OUT model under very mild conditions on the temporal dependence described in terms of the smoothness of the corresponding spectral densities. Through a simulation study, we compare the accuracy of estimating the parameters and identifying the graphical structure with other approaches. We apply the proposed methodology to analyze a dataset on different industrial components of the US gross domestic product between 2010 and 2019 and predict future observations."}, "https://arxiv.org/abs/2403.05080": {"title": "On an Empirical Likelihood based Solution to the Approximate Bayesian Computation Problem", "link": "https://arxiv.org/abs/2403.05080", "description": "arXiv:2403.05080v1 Announce Type: new \nAbstract: Approximate Bayesian Computation (ABC) methods are applicable to statistical models specified by generative processes with analytically intractable likelihoods. These methods try to approximate the posterior density of a model parameter by comparing the observed data with additional process-generated simulated datasets. For computational benefit, only the values of certain well-chosen summary statistics are usually compared, instead of the whole dataset. Most ABC procedures are computationally expensive, justified only heuristically, and have poor asymptotic properties. In this article, we introduce a new empirical likelihood-based approach to the ABC paradigm called ABCel. The proposed procedure is computationally tractable and approximates the target log posterior of the parameter as a sum of two functions of the data -- namely, the mean of the optimal log-empirical likelihood weights and the estimated differential entropy of the summary functions. We rigorously justify the procedure via direct and reverse information projections onto appropriate classes of probability densities. Past applications of empirical likelihood in ABC demanded constraints based on analytically tractable estimating functions that involve both the data and the parameter; although by the nature of the ABC problem such functions may not be available in general. In contrast, we use constraints that are functions of the summary statistics only. Equally importantly, we show that our construction directly connects to the reverse information projection. We show that ABCel is posterior consistent and has highly favourable asymptotic properties. Its construction justifies the use of simple summary statistics like moments, quantiles, etc, which in practice produce an accurate approximation of the posterior density. We illustrate the performance of the proposed procedure in a range of applications."}, "https://arxiv.org/abs/2403.05331": {"title": "Causality and extremes", "link": "https://arxiv.org/abs/2403.05331", "description": "arXiv:2403.05331v1 Announce Type: new \nAbstract: In this work, we summarize the state-of-the-art methods in causal inference for extremes. In a non-exhaustive way, we start by describing an extremal approach to quantile treatment effect where the treatment has an impact on the tail of the outcome. Then, we delve into two primary causal structures for extremes, offering in-depth insights into their identifiability. Additionally, we discuss causal structure learning in relation to these two models as well as in a model-agnostic framework. To illustrate the practicality of the approaches, we apply and compare these different methods using a Seine network dataset. This work concludes with a summary and outlines potential directions for future research."}, "https://arxiv.org/abs/2403.05336": {"title": "Estimating time-varying exposure effects through continuous-time modelling in Mendelian randomization", "link": "https://arxiv.org/abs/2403.05336", "description": "arXiv:2403.05336v1 Announce Type: new \nAbstract: Mendelian randomization is an instrumental variable method that utilizes genetic information to investigate the causal effect of a modifiable exposure on an outcome. In most cases, the exposure changes over time. Understanding the time-varying causal effect of the exposure can yield detailed insights into mechanistic effects and the potential impact of public health interventions. Recently, a growing number of Mendelian randomization studies have attempted to explore time-varying causal effects. However, the proposed approaches oversimplify temporal information and rely on overly restrictive structural assumptions, limiting their reliability in addressing time-varying causal problems. This paper considers a novel approach to estimate time-varying effects through continuous-time modelling by combining functional principal component analysis and weak-instrument-robust techniques. Our method effectively utilizes available data without making strong structural assumptions and can be applied in general settings where the exposure measurements occur at different timepoints for different individuals. We demonstrate through simulations that our proposed method performs well in estimating time-varying effects and provides reliable inference results when the time-varying effect form is correctly specified. The method could theoretically be used to estimate arbitrarily complex time-varying effects. However, there is a trade-off between model complexity and instrument strength. Estimating complex time-varying effects requires instruments that are unrealistically strong. We illustrate the application of this method in a case study examining the time-varying effects of systolic blood pressure on urea levels."}, "https://arxiv.org/abs/2403.05343": {"title": "Disentangling the Timescales of a Complex System: A Bayesian Approach to Temporal Network Analysis", "link": "https://arxiv.org/abs/2403.05343", "description": "arXiv:2403.05343v1 Announce Type: new \nAbstract: Changes in the timescales at which complex systems evolve are essential to predicting critical transitions and catastrophic failures. Disentangling the timescales of the dynamics governing complex systems remains a key challenge. With this study, we introduce an integrated Bayesian framework based on temporal network models to address this challenge. We focus on two methodologies: change point detection for identifying shifts in system dynamics, and a spectrum analysis for inferring the distribution of timescales. Applied to synthetic and empirical datasets, these methologies robustly identify critical transitions and comprehensively map the dominant and subsidiaries timescales in complex systems. This dual approach offers a powerful tool for analyzing temporal networks, significantly enhancing our understanding of dynamic behaviors in complex systems."}, "https://arxiv.org/abs/2403.05359": {"title": "Applying Non-negative Matrix Factorization with Covariates to the Longitudinal Data as Growth Curve Model", "link": "https://arxiv.org/abs/2403.05359", "description": "arXiv:2403.05359v1 Announce Type: new \nAbstract: Using Non-negative Matrix Factorization (NMF), the observed matrix can be approximated by the product of the basis and coefficient matrices. Moreover, if the coefficient vectors are explained by the covariates for each individual, the coefficient matrix can be written as the product of the parameter matrix and the covariate matrix, and additionally described in the framework of Non-negative Matrix tri-Factorization (tri-NMF) with covariates. Consequently, this is equal to the mean structure of the Growth Curve Model (GCM). The difference is that the basis matrix for GCM is given by the analyst, whereas that for NMF with covariates is unknown and optimized. In this study, we applied NMF with covariance to longitudinal data and compared it with GCM. We have also published an R package that implements this method, and we show how to use it through examples of data analyses including longitudinal measurement, spatiotemporal data and text data. In particular, we demonstrate the usefulness of Gaussian kernel functions as covariates."}, "https://arxiv.org/abs/2403.05373": {"title": "Regularized Principal Spline Functions to Mitigate Spatial Confounding", "link": "https://arxiv.org/abs/2403.05373", "description": "arXiv:2403.05373v1 Announce Type: new \nAbstract: This paper proposes a new approach to address the problem of unmeasured confounding in spatial designs. Spatial confounding occurs when some confounding variables are unobserved and not included in the model, leading to distorted inferential results about the effect of an exposure on an outcome. We show the relationship existing between the confounding bias of a non-spatial model and that of a semi-parametric model that includes a basis matrix to represent the unmeasured confounder conditional on the exposure. This relationship holds for any basis expansion, however it is shown that using the semi-parametric approach guarantees a reduction in the confounding bias only under certain circumstances, which are related to the spatial structures of the exposure and the unmeasured confounder, the type of basis expansion utilized, and the regularization mechanism. To adjust for spatial confounding, and therefore try to recover the effect of interest, we propose a Bayesian semi-parametric regression model, where an expansion matrix of principal spline basis functions is used to approximate the unobserved factor, and spike-and-slab priors are imposed on the respective expansion coefficients in order to select the most important bases. From the results of an extensive simulation study, we conclude that our proposal is able to reduce the confounding bias with respect to the non-spatial model, and it also seems more robust to bias amplification than competing approaches."}, "https://arxiv.org/abs/2403.05503": {"title": "Linear Model Estimators and Consistency under an Infill Asymptotic Domain", "link": "https://arxiv.org/abs/2403.05503", "description": "arXiv:2403.05503v1 Announce Type: new \nAbstract: Functional data present as functions or curves possessing a spatial or temporal component. These components by nature have a fixed observational domain. Consequently, any asymptotic investigation requires modelling the increased correlation among observations as density increases due to this fixed domain constraint. One such appropriate stochastic process is the Ornstein-Uhlenbeck process. Utilizing this spatial autoregressive process, we demonstrate that parameter estimators for a simple linear regression model display inconsistency in an infill asymptotic domain. Such results are contrary to those expected under the customary increasing domain asymptotics. Although none of these estimator variances approach zero, they do display a pattern of diminishing return regarding decreasing estimator variance as sample size increases. This may prove invaluable to a practitioner as this indicates perhaps an optimal sample size to cease data collection. This in turn reduces time and data collection cost because little information is gained in sampling beyond a certain sample size."}, "https://arxiv.org/abs/2403.04793": {"title": "A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series", "link": "https://arxiv.org/abs/2403.04793", "description": "arXiv:2403.04793v1 Announce Type: cross \nAbstract: Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results."}, "https://arxiv.org/abs/2403.04919": {"title": "Identifying Causal Effects Under Functional Dependencies", "link": "https://arxiv.org/abs/2403.04919", "description": "arXiv:2403.04919v1 Announce Type: cross \nAbstract: We study the identification of causal effects, motivated by two improvements to identifiability which can be attained if one knows that some variables in a causal graph are functionally determined by their parents (without needing to know the specific functions). First, an unidentifiable causal effect may become identifiable when certain variables are functional. Second, certain functional variables can be excluded from being observed without affecting the identifiability of a causal effect, which may significantly reduce the number of needed variables in observational data. Our results are largely based on an elimination procedure which removes functional variables from a causal graph while preserving key properties in the resulting causal graph, including the identifiability of causal effects."}, "https://arxiv.org/abs/2403.05006": {"title": "Provable Multi-Party Reinforcement Learning with Diverse Human Feedback", "link": "https://arxiv.org/abs/2403.05006", "description": "arXiv:2403.05006v1 Announce Type: cross \nAbstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \\textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity."}, "https://arxiv.org/abs/2403.05425": {"title": "An Adaptive Dimension Reduction Estimation Method for High-dimensional Bayesian Optimization", "link": "https://arxiv.org/abs/2403.05425", "description": "arXiv:2403.05425v1 Announce Type: cross \nAbstract: Bayesian optimization (BO) has shown impressive results in a variety of applications within low-to-moderate dimensional Euclidean spaces. However, extending BO to high-dimensional settings remains a significant challenge. We address this challenge by proposing a two-step optimization framework. Initially, we identify the effective dimension reduction (EDR) subspace for the objective function using the minimum average variance estimation (MAVE) method. Subsequently, we construct a Gaussian process model within this EDR subspace and optimize it using the expected improvement criterion. Our algorithm offers the flexibility to operate these steps either concurrently or in sequence. In the sequential approach, we meticulously balance the exploration-exploitation trade-off by distributing the sampling budget between subspace estimation and function optimization, and the convergence rate of our algorithm in high-dimensional contexts has been established. Numerical experiments validate the efficacy of our method in challenging scenarios."}, "https://arxiv.org/abs/2403.05461": {"title": "On varimax asymptotics in network models and spectral methods for dimensionality reduction", "link": "https://arxiv.org/abs/2403.05461", "description": "arXiv:2403.05461v1 Announce Type: cross \nAbstract: Varimax factor rotations, while popular among practitioners in psychology and statistics since being introduced by H. Kaiser, have historically been viewed with skepticism and suspicion by some theoreticians and mathematical statisticians. Now, work by K. Rohe and M. Zeng provides new, fundamental insight: varimax rotations provably perform statistical estimation in certain classes of latent variable models when paired with spectral-based matrix truncations for dimensionality reduction. We build on this newfound understanding of varimax rotations by developing further connections to network analysis and spectral methods rooted in entrywise matrix perturbation analysis. Concretely, this paper establishes the asymptotic multivariate normality of vectors in varimax-transformed Euclidean point clouds that represent low-dimensional node embeddings in certain latent space random graph models. We address related concepts including network sparsity, data denoising, and the role of matrix rank in latent variable parameterizations. Collectively, these findings, at the confluence of classical and contemporary multivariate analysis, reinforce methodology and inference procedures grounded in matrix factorization-based techniques. Numerical examples illustrate our findings and supplement our discussion."}, "https://arxiv.org/abs/2105.12081": {"title": "Group selection and shrinkage: Structured sparsity for semiparametric additive models", "link": "https://arxiv.org/abs/2105.12081", "description": "arXiv:2105.12081v3 Announce Type: replace \nAbstract: Sparse regression and classification estimators that respect group structures have application to an assortment of statistical and machine learning problems, from multitask learning to sparse additive modeling to hierarchical selection. This work introduces structured sparse estimators that combine group subset selection with shrinkage. To accommodate sophisticated structures, our estimators allow for arbitrary overlap between groups. We develop an optimization framework for fitting the nonconvex regularization surface and present finite-sample error bounds for estimation of the regression function. As an application requiring structure, we study sparse semiparametric additive modeling, a procedure that allows the effect of each predictor to be zero, linear, or nonlinear. For this task, the new estimators improve across several metrics on synthetic data compared to alternatives. Finally, we demonstrate their efficacy in modeling supermarket foot traffic and economic recessions using many predictors. These demonstrations suggest sparse semiparametric additive models, fit using the new estimators, are an excellent compromise between fully linear and fully nonparametric alternatives. All of our algorithms are made available in the scalable implementation grpsel."}, "https://arxiv.org/abs/2107.02739": {"title": "Shapes as Product Differentiation: Neural Network Embedding in the Analysis of Markets for Fonts", "link": "https://arxiv.org/abs/2107.02739", "description": "arXiv:2107.02739v2 Announce Type: replace \nAbstract: Many differentiated products have key attributes that are unstructured and thus high-dimensional (e.g., design, text). Instead of treating unstructured attributes as unobservables in economic models, quantifying them can be important to answer interesting economic questions. To propose an analytical framework for these types of products, this paper considers one of the simplest design products-fonts-and investigates merger and product differentiation using an original dataset from the world's largest online marketplace for fonts. We quantify font shapes by constructing embeddings from a deep convolutional neural network. Each embedding maps a font's shape onto a low-dimensional vector. In the resulting product space, designers are assumed to engage in Hotelling-type spatial competition. From the image embeddings, we construct two alternative measures that capture the degree of design differentiation. We then study the causal effects of a merger on the merging firm's creative decisions using the constructed measures in a synthetic control method. We find that the merger causes the merging firm to increase the visual variety of font design. Notably, such effects are not captured when using traditional measures for product offerings (e.g., specifications and the number of products) constructed from structured data."}, "https://arxiv.org/abs/2108.05858": {"title": "An Optimal Transport Approach to Estimating Causal Effects via Nonlinear Difference-in-Differences", "link": "https://arxiv.org/abs/2108.05858", "description": "arXiv:2108.05858v2 Announce Type: replace \nAbstract: We propose a nonlinear difference-in-differences method to estimate multivariate counterfactual distributions in classical treatment and control study designs with observational data. Our approach sheds a new light on existing approaches like the changes-in-changes and the classical semiparametric difference-in-differences estimator and generalizes them to settings with multivariate heterogeneity in the outcomes. The main benefit of this extension is that it allows for arbitrary dependence and heterogeneity in the joint outcomes. We demonstrate its utility both on synthetic and real data. In particular, we revisit the classical Card \\& Krueger dataset, examining the effect of a minimum wage increase on employment in fast food restaurants; a reanalysis with our method reveals that restaurants tend to substitute full-time with part-time labor after a minimum wage increase at a faster pace. A previous version of this work was entitled \"An optimal transport approach to causal inference."}, "https://arxiv.org/abs/2112.13738": {"title": "Evaluation of binary classifiers for asymptotically dependent and independent extremes", "link": "https://arxiv.org/abs/2112.13738", "description": "arXiv:2112.13738v3 Announce Type: replace \nAbstract: Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this paper, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences. This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study a risk function adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. A simulation study compares different classifiers and indicates their performance with respect to our risk function. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations."}, "https://arxiv.org/abs/2211.01492": {"title": "Fast, effective, and coherent time series modeling using the sparsity-ranked lasso", "link": "https://arxiv.org/abs/2211.01492", "description": "arXiv:2211.01492v2 Announce Type: replace \nAbstract: The sparsity-ranked lasso (SRL) has been developed for model selection and estimation in the presence of interactions and polynomials. The main tenet of the SRL is that an algorithm should be more skeptical of higher-order polynomials and interactions *a priori* compared to main effects, and hence the inclusion of these more complex terms should require a higher level of evidence. In time series, the same idea of ranked prior skepticism can be applied to the possibly seasonal autoregressive (AR) structure of the series during the model fitting process, becoming especially useful in settings with uncertain or multiple modes of seasonality. The SRL can naturally incorporate exogenous variables, with streamlined options for inference and/or feature selection. The fitting process is quick even for large series with a high-dimensional feature set. In this work, we discuss both the formulation of this procedure and the software we have developed for its implementation via the **fastTS** R package. We explore the performance of our SRL-based approach in a novel application involving the autoregressive modeling of hourly emergency room arrivals at the University of Iowa Hospitals and Clinics. We find that the SRL is considerably faster than its competitors, while producing more accurate predictions."}, "https://arxiv.org/abs/2306.16406": {"title": "Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift", "link": "https://arxiv.org/abs/2306.16406", "description": "arXiv:2306.16406v3 Announce Type: replace \nAbstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \\emph{dataset shift} conditions are known as \\emph{domain adaptation} or \\emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population.\n In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlapping support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions."}, "https://arxiv.org/abs/2312.00130": {"title": "Sparse Data-Driven Random Projection in Regression for High-Dimensional Data", "link": "https://arxiv.org/abs/2312.00130", "description": "arXiv:2312.00130v2 Announce Type: replace \nAbstract: We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors where the vector of coefficients can vary from sparse to dense. In this setting, we propose a combination of probabilistic variable screening with random projection tools as a viable approach. More specifically, we introduce a new data-driven random projection tailored to the problem at hand and derive a theoretical bound on the gain in expected prediction error over conventional random projections. The variables to enter the projection are screened by accounting for predictor correlation. To reduce the dependence on fine-tuning choices, we aggregate over an ensemble of linear models. A thresholding parameter is introduced to obtain a higher degree of sparsity. Both this parameter and the number of models in the ensemble can be chosen by cross-validation. In extensive simulations, we compare the proposed method with other random projection tools and with classical sparse and dense methods and show that it is competitive in terms of prediction across a variety of scenarios with different sparsity and predictor covariance settings. We also show that the method with cross-validation is able to rank the variables satisfactorily. Finally, we showcase the method on two real data applications."}, "https://arxiv.org/abs/2104.14737": {"title": "Automatic Debiased Machine Learning via Riesz Regression", "link": "https://arxiv.org/abs/2104.14737", "description": "arXiv:2104.14737v2 Announce Type: replace-cross \nAbstract: A variety of interesting parameters may depend on high dimensional regressions. Machine learning can be used to estimate such parameters. However estimators based on machine learners can be severely biased by regularization and/or model selection. Debiased machine learning uses Neyman orthogonal estimating equations to reduce such biases. Debiased machine learning generally requires estimation of unknown Riesz representers. A primary innovation of this paper is to provide Riesz regression estimators of Riesz representers that depend on the parameter of interest, rather than explicit formulae, and that can employ any machine learner, including neural nets and random forests. End-to-end algorithms emerge where the researcher chooses the parameter of interest and the machine learner and the debiasing follows automatically. Another innovation here is debiased machine learners of parameters depending on generalized regressions, including high-dimensional generalized linear models. An empirical example of automatic debiased machine learning using neural nets is given. We find in Monte Carlo examples that automatic debiasing sometimes performs better than debiasing via inverse propensity scores and never worse. Finite sample mean square error bounds for Riesz regression estimators and asymptotic theory are also given."}, "https://arxiv.org/abs/2312.13454": {"title": "MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records", "link": "https://arxiv.org/abs/2312.13454", "description": "arXiv:2312.13454v2 Announce Type: replace-cross \nAbstract: Existing survival models either do not scale to high dimensional and multi-modal data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC- III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG."}, "https://arxiv.org/abs/2401.12924": {"title": "Performance Analysis of Support Vector Machine (SVM) on Challenging Datasets for Forest Fire Detection", "link": "https://arxiv.org/abs/2401.12924", "description": "arXiv:2401.12924v2 Announce Type: replace-cross \nAbstract: This article delves into the analysis of performance and utilization of Support Vector Machines (SVMs) for the critical task of forest fire detection using image datasets. With the increasing threat of forest fires to ecosystems and human settlements, the need for rapid and accurate detection systems is of utmost importance. SVMs, renowned for their strong classification capabilities, exhibit proficiency in recognizing patterns associated with fire within images. By training on labeled data, SVMs acquire the ability to identify distinctive attributes associated with fire, such as flames, smoke, or alterations in the visual characteristics of the forest area. The document thoroughly examines the use of SVMs, covering crucial elements like data preprocessing, feature extraction, and model training. It rigorously evaluates parameters such as accuracy, efficiency, and practical applicability. The knowledge gained from this study aids in the development of efficient forest fire detection systems, enabling prompt responses and improving disaster management. Moreover, the correlation between SVM accuracy and the difficulties presented by high-dimensional datasets is carefully investigated, demonstrated through a revealing case study. The relationship between accuracy scores and the different resolutions used for resizing the training datasets has also been discussed in this article. These comprehensive studies result in a definitive overview of the difficulties faced and the potential sectors requiring further improvement and focus."}, "https://arxiv.org/abs/2403.05644": {"title": "TSSS: A Novel Triangulated Spherical Spline Smoothing for Surface-based Data", "link": "https://arxiv.org/abs/2403.05644", "description": "arXiv:2403.05644v1 Announce Type: new \nAbstract: Surface-based data is commonly observed in diverse practical applications spanning various fields. In this paper, we introduce a novel nonparametric method to discover the underlying signals from data distributed on complex surface-based domains. Our approach involves a penalized spline estimator defined on a triangulation of surface patches, which enables effective signal extraction and recovery. The proposed method offers several advantages over existing methods, including superior handling of \"leakage\" or \"boundary effects\" over complex domains, enhanced computational efficiency, and potential applications in analyzing sparse and irregularly distributed data on complex objects. We provide rigorous theoretical guarantees for the proposed method, including convergence rates of the estimator in both the $L_2$ and supremum norms, as well as the asymptotic normality of the estimator. We also demonstrate that the convergence rates achieved by our estimation method are optimal within the framework of nonparametric estimation. Furthermore, we introduce a bootstrap method to quantify the uncertainty associated with the proposed estimators accurately. The superior performance of the proposed method is demonstrated through simulation experiments and data applications on cortical surface functional magnetic resonance imaging data and oceanic near-surface atmospheric data."}, "https://arxiv.org/abs/2403.05647": {"title": "Minor Issues Escalated to Critical Levels in Large Samples: A Permutation-Based Fix", "link": "https://arxiv.org/abs/2403.05647", "description": "arXiv:2403.05647v1 Announce Type: new \nAbstract: In the big data era, the need to reevaluate traditional statistical methods is paramount due to the challenges posed by vast datasets. While larger samples theoretically enhance accuracy and hypothesis testing power without increasing false positives, practical concerns about inflated Type-I errors persist. The prevalent belief is that larger samples can uncover subtle effects, necessitating dual consideration of p-value and effect size. Yet, the reliability of p-values from large samples remains debated.\n This paper warns that larger samples can exacerbate minor issues into significant errors, leading to false conclusions. Through our simulation study, we demonstrate how growing sample sizes amplify issues arising from two commonly encountered violations of model assumptions in real-world data and lead to incorrect decisions. This underscores the need for vigilant analytical approaches in the era of big data. In response, we introduce a permutation-based test to counterbalance the effects of sample size and assumption discrepancies by neutralizing them between actual and permuted data. We demonstrate that this approach effectively stabilizes nominal Type I error rates across various sample sizes, thereby ensuring robust statistical inferences even amidst breached conventional assumptions in big data.\n For reproducibility, our R codes are publicly available at: \\url{https://github.com/ubcxzhang/bigDataIssue}."}, "https://arxiv.org/abs/2403.05655": {"title": "PROTEST: Nonparametric Testing of Hypotheses Enhanced by Experts' Utility Judgements", "link": "https://arxiv.org/abs/2403.05655", "description": "arXiv:2403.05655v1 Announce Type: new \nAbstract: Instead of testing solely a precise hypothesis, it is often useful to enlarge it with alternatives that are deemed to differ from it negligibly. For instance, in a bioequivalence study one might consider the hypothesis that the concentration of an ingredient is exactly the same in two drugs. In such a context, it might be more relevant to test the enlarged hypothesis that the difference in concentration between the drugs is of no practical significance. While this concept is not alien to Bayesian statistics, applications remain confined to parametric settings and strategies on how to effectively harness experts' intuitions are often scarce or nonexistent. To resolve both issues, we introduce PROTEST, an accessible nonparametric testing framework that seamlessly integrates with Markov Chain Monte Carlo (MCMC) methods. We develop expanded versions of the model adherence, goodness-of-fit, quantile and two-sample tests. To demonstrate how PROTEST operates, we make use of examples, simulated studies - such as testing link functions in a binary regression setting, as well as a comparison between the performance of PROTEST and the PTtest (Holmes et al., 2015) - and an application with data on neuron spikes. Furthermore, we address the crucial issue of selecting the threshold - which controls how much a hypothesis is to be expanded - even when intuitions are limited or challenging to quantify."}, "https://arxiv.org/abs/2403.05679": {"title": "Debiased Projected Two-Sample Comparisonscfor Single-Cell Expression Data", "link": "https://arxiv.org/abs/2403.05679", "description": "arXiv:2403.05679v1 Announce Type: new \nAbstract: We study several variants of the high-dimensional mean inference problem motivated by modern single-cell genomics data. By taking advantage of low-dimensional and localized signal structures commonly seen in such data, our proposed methods not only have the usual frequentist validity but also provide useful information on the potential locations of the signal if the null hypothesis is rejected. Our method adaptively projects the high-dimensional vector onto a low-dimensional space, followed by a debiasing step using the semiparametric double-machine learning framework. Our analysis shows that debiasing is unnecessary under the global null, but necessary under a ``projected null'' that is of scientific interest. We also propose an ``anchored projection'' to maximize the power while avoiding the degeneracy issue under the null. Experiments on synthetic data and a real single-cell sequencing dataset demonstrate the effectiveness and interpretability of our methods."}, "https://arxiv.org/abs/2403.05704": {"title": "Non-robustness of diffusion estimates on networks with measurement error", "link": "https://arxiv.org/abs/2403.05704", "description": "arXiv:2403.05704v1 Announce Type: new \nAbstract: Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links are so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and in an insurance experiment in China."}, "https://arxiv.org/abs/2403.05756": {"title": "Model-Free Local Recalibration of Neural Networks", "link": "https://arxiv.org/abs/2403.05756", "description": "arXiv:2403.05756v1 Announce Type: new \nAbstract: Artificial neural networks (ANNs) are highly flexible predictive models. However, reliably quantifying uncertainty for their predictions is a continuing challenge. There has been much recent work on \"recalibration\" of predictive distributions for ANNs, so that forecast probabilities for events of interest are consistent with certain frequency evaluations of them. Uncalibrated probabilistic forecasts are of limited use for many important decision-making tasks. To address this issue, we propose a localized recalibration of ANN predictive distributions using the dimension-reduced representation of the input provided by the ANN hidden layers. Our novel method draws inspiration from recalibration techniques used in the literature on approximate Bayesian computation and likelihood-free inference methods. Most existing calibration methods for ANNs can be thought of as calibrating either on the input layer, which is difficult when the input is high-dimensional, or the output layer, which may not be sufficiently flexible. Through a simulation study, we demonstrate that our method has good performance compared to alternative approaches, and explore the benefits that can be achieved by localizing the calibration based on different layers of the network. Finally, we apply our proposed method to a diamond price prediction problem, demonstrating the potential of our approach to improve prediction and uncertainty quantification in real-world applications."}, "https://arxiv.org/abs/2403.05792": {"title": "Distributed Conditional Feature Screening via Pearson Partial Correlation with FDR Control", "link": "https://arxiv.org/abs/2403.05792", "description": "arXiv:2403.05792v1 Announce Type: new \nAbstract: This paper studies the distributed conditional feature screening for massive data with ultrahigh-dimensional features. Specifically, three distributed partial correlation feature screening methods (SAPS, ACPS and JDPS methods) are firstly proposed based on Pearson partial correlation. The corresponding consistency of distributed estimation and the sure screening property of feature screening methods are established. Secondly, because using a hard threshold in feature screening will lead to a high false discovery rate (FDR), this paper develops a two-step distributed feature screening method based on knockoff technique to control the FDR. It is shown that the proposed method can control the FDR in the finite sample, and also enjoys the sure screening property under some conditions. Different from the existing screening methods, this paper not only considers the influence of a conditional variable on both the response variable and feature variables in variable screening, but also studies the FDR control issue. Finally, the effectiveness of the proposed methods is confirmed by numerical simulations and a real data analysis."}, "https://arxiv.org/abs/2403.05803": {"title": "Semiparametric Inference for Regression-Discontinuity Designs", "link": "https://arxiv.org/abs/2403.05803", "description": "arXiv:2403.05803v1 Announce Type: new \nAbstract: Treatment effects in regression discontinuity designs (RDDs) are often estimated using local regression methods. However, global approximation methods are generally deemed inefficient. In this paper, we propose a semiparametric framework tailored for estimating treatment effects in RDDs. Our global approach conceptualizes the identification of treatment effects within RDDs as a partially linear modeling problem, with the linear component capturing the treatment effect. Furthermore, we utilize the P-spline method to approximate the nonparametric function and develop procedures for inferring treatment effects within this framework. We demonstrate through Monte Carlo simulations that the proposed method performs well across various scenarios. Furthermore, we illustrate using real-world datasets that our global approach may result in more reliable inference."}, "https://arxiv.org/abs/2403.05850": {"title": "Estimating Causal Effects of Discrete and Continuous Treatments with Binary Instruments", "link": "https://arxiv.org/abs/2403.05850", "description": "arXiv:2403.05850v1 Announce Type: new \nAbstract: We propose an instrumental variable framework for identifying and estimating average and quantile effects of discrete and continuous treatments with binary instruments. The basis of our approach is a local copula representation of the joint distribution of the potential outcomes and unobservables determining treatment assignment. This representation allows us to introduce an identifying assumption, so-called copula invariance, that restricts the local dependence of the copula with respect to the treatment propensity. We show that copula invariance identifies treatment effects for the entire population and other subpopulations such as the treated. The identification results are constructive and lead to straightforward semiparametric estimation procedures based on distribution regression. An application to the effect of sleep on well-being uncovers interesting patterns of heterogeneity."}, "https://arxiv.org/abs/2403.05899": {"title": "Online Identification of Stochastic Continuous-Time Wiener Models Using Sampled Data", "link": "https://arxiv.org/abs/2403.05899", "description": "arXiv:2403.05899v1 Announce Type: new \nAbstract: It is well known that ignoring the presence of stochastic disturbances in the identification of stochastic Wiener models leads to asymptotically biased estimators. On the other hand, optimal statistical identification, via likelihood-based methods, is sensitive to the assumptions on the data distribution and is usually based on relatively complex sequential Monte Carlo algorithms. We develop a simple recursive online estimation algorithm based on an output-error predictor, for the identification of continuous-time stochastic parametric Wiener models through stochastic approximation. The method is applicable to generic model parameterizations and, as demonstrated in the numerical simulation examples, it is robust with respect to the assumptions on the spectrum of the disturbance process."}, "https://arxiv.org/abs/2403.05969": {"title": "Sample Size Selection under an Infill Asymptotic Domain", "link": "https://arxiv.org/abs/2403.05969", "description": "arXiv:2403.05969v1 Announce Type: new \nAbstract: Experimental studies often fail to appropriately account for the number of collected samples within a fixed time interval for functional responses. Data of this nature appropriately falls under an Infill Asymptotic domain that is constrained by time and not considered infinite. Therefore, the sample size should account for this infill asymptotic domain. This paper provides general guidance on selecting an appropriate size for an experimental study for various simple linear regression models and tuning parameter values of the covariance structure used under an asymptotic domain, an Ornstein-Uhlenbeck process. Selecting an appropriate sample size is determined based on the percent of total variation that is captured at any given sample size for each parameter. Additionally, guidance on the selection of the tuning parameter is given by linking this value to the signal-to-noise ratio utilized for power calculations under design of experiments."}, "https://arxiv.org/abs/2403.05999": {"title": "Locally Regular and Efficient Tests in Non-Regular Semiparametric Models", "link": "https://arxiv.org/abs/2403.05999", "description": "arXiv:2403.05999v1 Announce Type: new \nAbstract: This paper considers hypothesis testing in semiparametric models which may be non-regular. I show that C($\\alpha$) style tests are locally regular under mild conditions, including in cases where locally regular estimators do not exist, such as models which are (semi-parametrically) weakly identified. I characterise the appropriate limit experiment in which to study local (asymptotic) optimality of tests in the non-regular case, permitting the generalisation of classical power bounds to this case. I give conditions under which these power bounds are attained by the proposed C($\\alpha$) style tests. The application of the theory to a single index model and an instrumental variables model is worked out in detail."}, "https://arxiv.org/abs/2403.06238": {"title": "Quantifying the Uncertainty of Imputed Demographic Disparity Estimates: The Dual-Bootstrap", "link": "https://arxiv.org/abs/2403.06238", "description": "arXiv:2403.06238v1 Announce Type: new \nAbstract: Measuring average differences in an outcome across racial or ethnic groups is a crucial first step for equity assessments, but researchers often lack access to data on individuals' races and ethnicities to calculate them. A common solution is to impute the missing race or ethnicity labels using proxies, then use those imputations to estimate the disparity. Conventional standard errors mischaracterize the resulting estimate's uncertainty because they treat the imputation model as given and fixed, instead of as an unknown object that must be estimated with uncertainty. We propose a dual-bootstrap approach that explicitly accounts for measurement uncertainty and thus enables more accurate statistical inference, which we demonstrate via simulation. In addition, we adapt our approach to the commonly used Bayesian Improved Surname Geocoding (BISG) imputation algorithm, where direct bootstrapping is infeasible because the underlying Census Bureau data are unavailable. In simulations, we find that measurement uncertainty is generally insignificant for BISG except in particular circumstances; bias, not variance, is likely the predominant source of error. We apply our method to quantify the uncertainty of prevalence estimates of common health conditions by race using data from the American Family Cohort."}, "https://arxiv.org/abs/2403.06246": {"title": "Estimating Factor-Based Spot Volatility Matrices with Noisy and Asynchronous High-Frequency Data", "link": "https://arxiv.org/abs/2403.06246", "description": "arXiv:2403.06246v1 Announce Type: new \nAbstract: We propose a new estimator of high-dimensional spot volatility matrices satisfying a low-rank plus sparse structure from noisy and asynchronous high-frequency data collected for an ultra-large number of assets. The noise processes are allowed to be temporally correlated, heteroskedastic, asymptotically vanishing and dependent on the efficient prices. We define a kernel-weighted pre-averaging method to jointly tackle the microstructure noise and asynchronicity issues, and we obtain uniformly consistent estimates for latent prices. We impose a continuous-time factor model with time-varying factor loadings on the price processes, and estimate the common factors and loadings via a local principal component analysis. Assuming a uniform sparsity condition on the idiosyncratic volatility structure, we combine the POET and kernel-smoothing techniques to estimate the spot volatility matrices for both the latent prices and idiosyncratic errors. Under some mild restrictions, the estimated spot volatility matrices are shown to be uniformly consistent under various matrix norms. We provide Monte-Carlo simulation and empirical studies to examine the numerical performance of the developed estimation methodology."}, "https://arxiv.org/abs/2403.06657": {"title": "Data-Driven Tuning Parameter Selection for High-Dimensional Vector Autoregressions", "link": "https://arxiv.org/abs/2403.06657", "description": "arXiv:2403.06657v1 Announce Type: new \nAbstract: Lasso-type estimators are routinely used to estimate high-dimensional time series models. The theoretical guarantees established for Lasso typically require the penalty level to be chosen in a suitable fashion often depending on unknown population quantities. Furthermore, the resulting estimates and the number of variables retained in the model depend crucially on the chosen penalty level. However, there is currently no theoretically founded guidance for this choice in the context of high-dimensional time series. Instead one resorts to selecting the penalty level in an ad hoc manner using, e.g., information criteria or cross-validation. We resolve this problem by considering estimation of the perhaps most commonly employed multivariate time series model, the linear vector autoregressive (VAR) model, and propose a weighted Lasso estimator with penalization chosen in a fully data-driven way. The theoretical guarantees that we establish for the resulting estimation and prediction error match those currently available for methods based on infeasible choices of penalization. We thus provide a first solution for choosing the penalization in high-dimensional time series models."}, "https://arxiv.org/abs/2403.06783": {"title": "A doubly robust estimator for the Mann Whitney Wilcoxon Rank Sum Test when applied for causal inference in observational studies", "link": "https://arxiv.org/abs/2403.06783", "description": "arXiv:2403.06783v1 Announce Type: new \nAbstract: The Mann-Whitney-Wilcoxon rank sum test (MWWRST) is a widely used method for comparing two treatment groups in randomized control trials, particularly when dealing with highly skewed data. However, when applied to observational study data, the MWWRST often yields invalid results for causal inference. To address this limitation, Wu et al. (2014) introduced an approach that incorporates inverse probability weighting (IPW) into this rank-based statistics to mitigate confounding effects. Subsequently, Mao (2018), Zhang et al. (2019), and Ai et al. (2020) extended this IPW estimator to develop doubly robust estimators.\n Nevertheless, each of these approaches has notable limitations. Mao's method imposes stringent assumptions that may not align with real-world study data. Zhang et al.'s (2019) estimators rely on bootstrap inference, which suffers from computational inefficiency and lacks known asymptotic properties. Meanwhile, Ai et al. (2020) primarily focus on testing the null hypothesis of equal distributions between two groups, which is a more stringent assumption that may not be well-suited to the primary practical application of MWWRST.\n In this paper, we aim to address these limitations by leveraging functional response models (FRM) to develop doubly robust estimators. We demonstrate the performance of our proposed approach using both simulated and real study data."}, "https://arxiv.org/abs/2403.06879": {"title": "Partially identified heteroskedastic SVARs", "link": "https://arxiv.org/abs/2403.06879", "description": "arXiv:2403.06879v1 Announce Type: new \nAbstract: This paper studies the identification of Structural Vector Autoregressions (SVARs) exploiting a break in the variances of the structural shocks. Point-identification for this class of models relies on an eigen-decomposition involving the covariance matrices of reduced-form errors and requires that all the eigenvalues are distinct. This point-identification, however, fails in the presence of multiplicity of eigenvalues. This occurs in an empirically relevant scenario where, for instance, only a subset of structural shocks had the break in their variances, or where a group of variables shows a variance shift of the same amount. Together with zero or sign restrictions on the structural parameters and impulse responses, we derive the identified sets for impulse responses and show how to compute them. We perform inference on the impulse response functions, building on the robust Bayesian approach developed for set identified SVARs. To illustrate our proposal, we present an empirical example based on the literature on the global crude oil market where the identification is expected to fail due to multiplicity of eigenvalues."}, "https://arxiv.org/abs/2403.05759": {"title": "Membership Testing in Markov Equivalence Classes via Independence Query Oracles", "link": "https://arxiv.org/abs/2403.05759", "description": "arXiv:2403.05759v1 Announce Type: cross \nAbstract: Understanding causal relationships between variables is a fundamental problem with broad impact in numerous scientific fields. While extensive research has been dedicated to learning causal graphs from data, its complementary concept of testing causal relationships has remained largely unexplored. While learning involves the task of recovering the Markov equivalence class (MEC) of the underlying causal graph from observational data, the testing counterpart addresses the following critical question: Given a specific MEC and observational data from some causal graph, can we determine if the data-generating causal graph belongs to the given MEC?\n We explore constraint-based testing methods by establishing bounds on the required number of conditional independence tests. Our bounds are in terms of the size of the maximum undirected clique ($s$) of the given MEC. In the worst case, we show a lower bound of $\\exp(\\Omega(s))$ independence tests. We then give an algorithm that resolves the task with $\\exp(O(s))$ tests, matching our lower bound. Compared to the learning problem, where algorithms often use a number of independence tests that is exponential in the maximum in-degree, this shows that testing is relatively easier. In particular, it requires exponentially less independence tests in graphs featuring high in-degrees and small clique sizes. Additionally, using the DAG associahedron, we provide a geometric interpretation of testing versus learning and discuss how our testing result can aid learning."}, "https://arxiv.org/abs/2403.06343": {"title": "Seasonal and Periodic Patterns in US COVID-19 Mortality using the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2403.06343", "description": "arXiv:2403.06343v1 Announce Type: cross \nAbstract: Since the emergence of the SARS-CoV-2 virus, research into the existence, extent, and pattern of seasonality has been of the highest importance for public health preparation. This study uses a novel bandpass bootstrap approach called the Variable Bandpass Periodic Block Bootstrap (VBPBB) to investigate the periodically correlated (PC) components including seasonality within US COVID-19 mortality. Bootstrapping to produce confidence intervals (CI) for periodic characteristics such as the seasonal mean requires preservation of the PC component's correlation structure during resampling. While existing bootstrap methods can preserve the PC component correlation structure, filtration of that PC component's frequency from interference is critical to bootstrap the PC component's characteristics accurately and efficiently. The VBPBB filters the PC time series to reduce interference from other components such as noise. This greatly reduces bootstrapped CI size and outperforms the statistical power and accuracy of other methods when estimating the periodic mean sampling distribution. VBPBB analysis of US COVID-19 mortality PC components are provided and compared against alternative bootstrapping methods. These results reveal crucial evidence supporting the presence of a seasonal PC pattern and existence of additional PC components, their timing, and CIs for their effect which will aid prediction and preparation for future COVID-19 responses."}, "https://arxiv.org/abs/2403.06371": {"title": "Ensemble Kalman filter in geoscience meets model predictive control", "link": "https://arxiv.org/abs/2403.06371", "description": "arXiv:2403.06371v1 Announce Type: cross \nAbstract: Although data assimilation originates from control theory, the relationship between modern data assimilation methods in geoscience and model predictive control has not been extensively explored. In the present paper, I discuss that the modern data assimilation methods in geoscience and model predictive control essentially minimize the similar quadratic cost functions. Inspired by this similarity, I propose a new ensemble Kalman filter (EnKF)-based method for controlling spatio-temporally chaotic systems, which can readily be applied to high-dimensional and nonlinear Earth systems. In this method, the reference vector, which serves as the control target, is assimilated into the state space as a pseudo-observation by ensemble Kalman smoother to obtain the appropriate perturbation to be added to a system. A proof-of-concept experiment using the Lorenz 63 model is presented. The system is constrained in one wing of the butterfly attractor without tipping to the other side by reasonably small control perturbations which are comparable with previous works."}, "https://arxiv.org/abs/1904.00136": {"title": "Estimating spillovers using imprecisely measured networks", "link": "https://arxiv.org/abs/1904.00136", "description": "arXiv:1904.00136v4 Announce Type: replace \nAbstract: In many experimental contexts, whether and how network interactions impact the outcome of interest for both treated and untreated individuals are key concerns. Networks data is often assumed to perfectly represent these possible interactions. This paper considers the problem of estimating treatment effects when measured connections are, instead, a noisy representation of the true spillover pathways. We show that existing methods, using the potential outcomes framework, yield biased estimators in the presence of this mismeasurement. We develop a new method, using a class of mixture models, that can account for missing connections and discuss its estimation via the Expectation-Maximization algorithm. We check our method's performance by simulating experiments on real network data from 43 villages in India. Finally, we use data from a previously published study to show that estimates using our method are more robust to the choice of network measure."}, "https://arxiv.org/abs/2108.05555": {"title": "Longitudinal Network Models and Permutation-Uniform Markov Chains", "link": "https://arxiv.org/abs/2108.05555", "description": "arXiv:2108.05555v2 Announce Type: replace \nAbstract: Consider longitudinal networks whose edges turn on and off according to a discrete-time Markov chain with exponential-family transition probabilities. We characterize when their joint distributions are also exponential families with the same parameter, improving data reduction. Further we show that the permutation-uniform subclass of these chains permit interpretation as an independent, identically distributed sequence on the same state space. We then apply these ideas to temporal exponential random graph models, for which permutation uniformity is well suited, and discuss mean-parameter convergence, dyadic independence, and exchangeability. Our framework facilitates our introducing a new network model; simplifies analysis of some network and autoregressive models from the literature, including by permitting closed-form expressions for maximum likelihood estimates for some models; and facilitates applying standard tools to longitudinal-network Markov chains from either asymptotics or single-observation exponential random graph models."}, "https://arxiv.org/abs/2201.13451": {"title": "Nonlinear Regression with Residuals: Causal Estimation with Time-varying Treatments and Covariates", "link": "https://arxiv.org/abs/2201.13451", "description": "arXiv:2201.13451v3 Announce Type: replace \nAbstract: Standard regression adjustment gives inconsistent estimates of causal effects when there are time-varying treatment effects and time-varying covariates. Loosely speaking, the issue is that some covariates are post-treatment variables because they may be affected by prior treatment status, and regressing out post-treatment variables causes bias. More precisely, the bias is due to certain non-confounding latent variables that create colliders in the causal graph. These latent variables, which we call phantoms, do not harm the identifiability of the causal effect, but they render naive regression estimates inconsistent. Motivated by this, we ask: how can we modify regression methods so that they hold up even in the presence of phantoms? We develop an estimator for this setting based on regression modeling (linear, log-linear, probit and Cox regression), proving that it is consistent for a reasonable causal estimand. In particular, the estimator is a regression model fit with a simple adjustment for collinearity, making it easy to understand and implement with standard regression software. The proposed estimators are instances of the parametric g-formula, extending the regression-with-residuals approach to several canonical nonlinear models."}, "https://arxiv.org/abs/2207.07318": {"title": "Flexible global forecast combinations", "link": "https://arxiv.org/abs/2207.07318", "description": "arXiv:2207.07318v3 Announce Type: replace \nAbstract: Forecast combination -- the aggregation of individual forecasts from multiple experts or models -- is a proven approach to economic forecasting. To date, research on economic forecasting has concentrated on local combination methods, which handle separate but related forecasting tasks in isolation. Yet, it has been known for over two decades in the machine learning community that global methods, which exploit task-relatedness, can improve on local methods that ignore it. Motivated by the possibility for improvement, this paper introduces a framework for globally combining forecasts while being flexible to the level of task-relatedness. Through our framework, we develop global versions of several existing forecast combinations. To evaluate the efficacy of these new global forecast combinations, we conduct extensive comparisons using synthetic and real data. Our real data comparisons, which involve forecasts of core economic indicators in the Eurozone, provide empirical evidence that the accuracy of global combinations of economic forecasts can surpass local combinations."}, "https://arxiv.org/abs/2301.09379": {"title": "Revisiting Panel Data Discrete Choice Models with Lagged Dependent Variables", "link": "https://arxiv.org/abs/2301.09379", "description": "arXiv:2301.09379v4 Announce Type: replace \nAbstract: This paper revisits the identification and estimation of a class of semiparametric (distribution-free) panel data binary choice models with lagged dependent variables, exogenous covariates, and entity fixed effects. We provide a novel identification strategy, using an \"identification at infinity\" argument. In contrast with the celebrated Honore and Kyriazidou (2000), our method permits time trends of any form and does not suffer from the \"curse of dimensionality\". We propose an easily implementable conditional maximum score estimator. The asymptotic properties of the proposed estimator are fully characterized. A small-scale Monte Carlo study demonstrates that our approach performs satisfactorily in finite samples. We illustrate the usefulness of our method by presenting an empirical application to enrollment in private hospital insurance using the Household, Income and Labour Dynamics in Australia (HILDA) Survey data."}, "https://arxiv.org/abs/2303.10117": {"title": "Estimation of Grouped Time-Varying Network Vector Autoregression Models", "link": "https://arxiv.org/abs/2303.10117", "description": "arXiv:2303.10117v2 Announce Type: replace \nAbstract: This paper introduces a flexible time-varying network vector autoregressive model framework for large-scale time series. A latent group structure is imposed on the heterogeneous and node-specific time-varying momentum and network spillover effects so that the number of unknown time-varying coefficients to be estimated can be reduced considerably. A classic agglomerative clustering algorithm with nonparametrically estimated distance matrix is combined with a ratio criterion to consistently estimate the latent group number and membership. A post-grouping local linear smoothing method is proposed to estimate the group-specific time-varying momentum and network effects, substantially improving the convergence rates of the preliminary estimates which ignore the latent structure. We further modify the methodology and theory to allow for structural breaks in either the group membership, group number or group-specific coefficient functions. Numerical studies including Monte-Carlo simulation and an empirical application are presented to examine the finite-sample performance of the developed model and methodology."}, "https://arxiv.org/abs/2305.11812": {"title": "Off-policy evaluation beyond overlap: partial identification through smoothness", "link": "https://arxiv.org/abs/2305.11812", "description": "arXiv:2305.11812v2 Announce Type: replace \nAbstract: Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions."}, "https://arxiv.org/abs/2306.09806": {"title": "Testing for Peer Effects without Specifying the Network Structure", "link": "https://arxiv.org/abs/2306.09806", "description": "arXiv:2306.09806v2 Announce Type: replace \nAbstract: This paper proposes an Anderson-Rubin (AR) test for the presence of peer effects in panel data without the need to specify the network structure. The unrestricted model of our test is a linear panel data model of social interactions with dyad-specific peer effects. The proposed AR test evaluates if the peer effect coefficients are all zero. As the number of peer effect coefficients increases with the sample size, so does the number of instrumental variables (IVs) employed to test the restrictions under the null, rendering Bekker's many-IV environment. By extending existing many-IV asymptotic results to panel data, we establish the asymptotic validity of the proposed AR test. Our Monte Carlo simulations show the robustness and superior performance of the proposed test compared to some existing tests with misspecified networks. We provide two applications to demonstrate its empirical relevance."}, "https://arxiv.org/abs/2308.10138": {"title": "On the Inconsistency of Cluster-Robust Inference and How Subsampling Can Fix It", "link": "https://arxiv.org/abs/2308.10138", "description": "arXiv:2308.10138v4 Announce Type: replace \nAbstract: Conventional methods of cluster-robust inference are inconsistent in the presence of unignorably large clusters. We formalize this claim by establishing a necessary and sufficient condition for the consistency of the conventional methods. We find that this condition for the consistency is rejected for a majority of empirical research papers. In this light, we propose a novel score subsampling method that achieves uniform size control over a broad class of data generating processes, covering that fails the conventional method. Simulation studies support these claims. With real data used by an empirical paper, we showcase that the conventional methods conclude significance while our proposed method concludes insignificance."}, "https://arxiv.org/abs/2309.12819": {"title": "Doubly Robust Proximal Causal Learning for Continuous Treatments", "link": "https://arxiv.org/abs/2309.12819", "description": "arXiv:2309.12819v3 Announce Type: replace \nAbstract: Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications."}, "https://arxiv.org/abs/2310.18905": {"title": "Incorporating nonparametric methods for estimating causal excursion effects in mobile health with zero-inflated count outcomes", "link": "https://arxiv.org/abs/2310.18905", "description": "arXiv:2310.18905v2 Announce Type: replace \nAbstract: In mobile health, tailoring interventions for real-time delivery is of paramount importance. Micro-randomized trials have emerged as the \"gold-standard\" methodology for developing such interventions. Analyzing data from these trials provides insights into the efficacy of interventions and the potential moderation by specific covariates. The \"causal excursion effect\", a novel class of causal estimand, addresses these inquiries. Yet, existing research mainly focuses on continuous or binary data, leaving count data largely unexplored. The current work is motivated by the Drink Less micro-randomized trial from the UK, which focuses on a zero-inflated proximal outcome, i.e., the number of screen views in the subsequent hour following the intervention decision point. To be specific, we revisit the concept of causal excursion effect, specifically for zero-inflated count outcomes, and introduce novel estimation approaches that incorporate nonparametric techniques. Bidirectional asymptotics are established for the proposed estimators. Simulation studies are conducted to evaluate the performance of the proposed methods. As an illustration, we also implement these methods to the Drink Less trial data."}, "https://arxiv.org/abs/2311.17445": {"title": "Interaction tests with covariate-adaptive randomization", "link": "https://arxiv.org/abs/2311.17445", "description": "arXiv:2311.17445v2 Announce Type: replace \nAbstract: Treatment-covariate interaction tests are commonly applied by researchers to examine whether the treatment effect varies across patient subgroups defined by baseline characteristics. The objective of this study is to explore treatment-covariate interaction tests involving covariate-adaptive randomization. Without assuming a parametric data generating model, we investigate usual interaction tests and observe that they tend to be conservative: specifically, their limiting rejection probabilities under the null hypothesis do not exceed the nominal level and are typically strictly lower than it. To address this problem, we propose modifications to the usual tests to obtain corresponding valid tests. Moreover, we introduce a novel class of stratified-adjusted interaction tests that are simple, more powerful than the usual and modified tests, and broadly applicable to most covariate-adaptive randomization methods. The results are general to encompass two types of interaction tests: one involving stratification covariates and the other involving additional covariates that are not used for randomization. Our study clarifies the application of interaction tests in clinical trials and offers valuable tools for revealing treatment heterogeneity, crucial for advancing personalized medicine."}, "https://arxiv.org/abs/2112.14249": {"title": "Nested Nonparametric Instrumental Variable Regression: Long Term, Mediated, and Time Varying Treatment Effects", "link": "https://arxiv.org/abs/2112.14249", "description": "arXiv:2112.14249v3 Announce Type: replace-cross \nAbstract: Several causal parameters in short panel data models are scalar summaries of a function called a nested nonparametric instrumental variable regression (nested NPIV). Examples include long term, mediated, and time varying treatment effects identified using proxy variables. However, it appears that no prior estimators or guarantees for nested NPIV exist, preventing flexible estimation and inference for these causal parameters. A major challenge is compounding ill posedness due to the nested inverse problems. We analyze adversarial estimators of nested NPIV, and provide sufficient conditions for efficient inference on the causal parameter. Our nonasymptotic analysis has three salient features: (i) introducing techniques that limit how ill posedness compounds; (ii) accommodating neural networks, random forests, and reproducing kernel Hilbert spaces; and (iii) extending to causal functions, e.g. long term heterogeneous treatment effects. We measure long term heterogeneous treatment effects of Project STAR and mediated proximal treatment effects of the Job Corps."}, "https://arxiv.org/abs/2211.12585": {"title": "Scalable couplings for the random walk Metropolis algorithm", "link": "https://arxiv.org/abs/2211.12585", "description": "arXiv:2211.12585v2 Announce Type: replace-cross \nAbstract: There has been a recent surge of interest in coupling methods for Markov chain Monte Carlo algorithms: they facilitate convergence quantification and unbiased estimation, while exploiting embarrassingly parallel computing capabilities. Motivated by these, we consider the design and analysis of couplings of the random walk Metropolis algorithm which scale well with the dimension of the target measure. Methodologically, we introduce a low-rank modification of the synchronous coupling that is provably optimally contractive in standard high-dimensional asymptotic regimes. We expose a shortcoming of the reflection coupling, the status quo at time of writing, and we propose a modification which mitigates the issue. Our analysis bridges the gap to the optimal scaling literature and builds a framework of asymptotic optimality which may be of independent interest. We illustrate the applicability of our proposed couplings, and the potential for extending our ideas, with various numerical experiments."}, "https://arxiv.org/abs/2302.07677": {"title": "Bayesian Federated Inference for estimating Statistical Models based on Non-shared Multicenter Data sets", "link": "https://arxiv.org/abs/2302.07677", "description": "arXiv:2302.07677v2 Announce Type: replace-cross \nAbstract: Identifying predictive factors for an outcome of interest via a multivariable analysis is often difficult when the data set is small. Combining data from different medical centers into a single (larger) database would alleviate this problem, but is in practice challenging due to regulatory and logistic problems. Federated Learning (FL) is a machine learning approach that aims to construct from local inferences in separate data centers what would have been inferred had the data sets been merged. It seeks to harvest the statistical power of larger data sets without actually creating them. The FL strategy is not always efficient and precise. Therefore, in this paper we refine and implement an alternative Bayesian Federated Inference (BFI) framework for multicenter data with the same aim as FL. The BFI framework is designed to cope with small data sets by inferring locally not only the optimal parameter values, but also additional features of the posterior parameter distribution, capturing information beyond what is used in FL. BFI has the additional benefit that a single inference cycle across the centers is sufficient, whereas FL needs multiple cycles. We quantify the performance of the proposed methodology on simulated and real life data."}, "https://arxiv.org/abs/2305.00164": {"title": "Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles", "link": "https://arxiv.org/abs/2305.00164", "description": "arXiv:2305.00164v2 Announce Type: replace-cross \nAbstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given for both the estimation accuracy and expected length of confidence intervals for the minimizer and minimum.\n The nonasymptotic local minimax framework brings out new phenomena in simultaneous estimation and inference for the minimizer and minimum. We establish a novel uncertainty principle that provides a fundamental limit on how well the minimizer and minimum can be estimated simultaneously for any convex regression function. A similar result holds for the expected length of the confidence intervals for the minimizer and minimum."}, "https://arxiv.org/abs/2306.12690": {"title": "Uniform error bound for PCA matrix denoising", "link": "https://arxiv.org/abs/2306.12690", "description": "arXiv:2306.12690v2 Announce Type: replace-cross \nAbstract: Principal component analysis (PCA) is a simple and popular tool for processing high-dimensional data. We investigate its effectiveness for matrix denoising.\n We consider the clean data are generated from a low-dimensional subspace, but masked by independent high-dimensional sub-Gaussian noises with standard deviation $\\sigma$. Under the low-rank assumption on the clean data with a mild spectral gap assumption, we prove that the distance between each pair of PCA-denoised data point and the clean data point is uniformly bounded by $O(\\sigma \\log n)$. To illustrate the spectral gap assumption, we show it can be satisfied when the clean data are independently generated with a non-degenerate covariance matrix. We then provide a general lower bound for the error of the denoised data matrix, which indicates PCA denoising gives a uniform error bound that is rate-optimal. Furthermore, we examine how the error bound impacts downstream applications such as clustering and manifold learning. Numerical results validate our theoretical findings and reveal the importance of the uniform error."}, "https://arxiv.org/abs/2310.02679": {"title": "Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization", "link": "https://arxiv.org/abs/2310.02679", "description": "arXiv:2310.02679v3 Announce Type: replace-cross \nAbstract: We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional \"flow function\". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods."}, "https://arxiv.org/abs/2310.19788": {"title": "Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget", "link": "https://arxiv.org/abs/2310.19788", "description": "arXiv:2310.19788v3 Announce Type: replace-cross \nAbstract: This study investigates the experimental design problem for identifying the arm with the highest expected outcome, referred to as best arm identification (BAI). In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates an arm and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the arms. At the end of the experiment, the decision-maker recommends one of the arms as an estimate of the best arm. To design an experiment, we first discuss lower bounds for the probability of misidentification. Our analysis highlights that the available information on the outcome distribution, such as means (expected outcomes), variances, and the choice of the best arm, significantly influences the lower bounds. Because available information is limited in actual experiments, we develop a lower bound that is valid under the unknown means and the unknown choice of the best arm, which are referred to as the worst-case lower bound. We demonstrate that the worst-case lower bound depends solely on the variances of the outcomes. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases infinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to the same values across arms. We refer to such strategies as asymptotically worst-case optimal."}, "https://arxiv.org/abs/2312.08150": {"title": "Active learning with biased non-response to label requests", "link": "https://arxiv.org/abs/2312.08150", "description": "arXiv:2312.08150v2 Announce Type: replace-cross \nAbstract: Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation."}, "https://arxiv.org/abs/2401.13045": {"title": "Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?", "link": "https://arxiv.org/abs/2401.13045", "description": "arXiv:2401.13045v2 Announce Type: replace-cross \nAbstract: Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions."}, "https://arxiv.org/abs/2403.07124": {"title": "Stochastic gradient descent-based inference for dynamic network models with attractors", "link": "https://arxiv.org/abs/2403.07124", "description": "arXiv:2403.07124v1 Announce Type: new \nAbstract: In Coevolving Latent Space Networks with Attractors (CLSNA) models, nodes in a latent space represent social actors, and edges indicate their dynamic interactions. Attractors are added at the latent level to capture the notion of attractive and repulsive forces between nodes, borrowing from dynamical systems theory. However, CLSNA reliance on MCMC estimation makes scaling difficult, and the requirement for nodes to be present throughout the study period limit practical applications. We address these issues by (i) introducing a Stochastic gradient descent (SGD) parameter estimation method, (ii) developing a novel approach for uncertainty quantification using SGD, and (iii) extending the model to allow nodes to join and leave over time. Simulation results show that our extensions result in little loss of accuracy compared to MCMC, but can scale to much larger networks. We apply our approach to the longitudinal social networks of members of US Congress on the social media platform X. Accounting for node dynamics overcomes selection bias in the network and uncovers uniquely and increasingly repulsive forces within the Republican Party."}, "https://arxiv.org/abs/2403.07158": {"title": "Sample Splitting and Assessing Goodness-of-fit of Time Series", "link": "https://arxiv.org/abs/2403.07158", "description": "arXiv:2403.07158v1 Announce Type: new \nAbstract: A fundamental and often final step in time series modeling is to assess the quality of fit of a proposed model to the data. Since the underlying distribution of the innovations that generate a model is often not prescribed, goodness-of-fit tests typically take the form of testing the fitted residuals for serial independence. However, these fitted residuals are inherently dependent since they are based on the same parameter estimates and thus standard tests of serial independence, such as those based on the autocorrelation function (ACF) or distance correlation function (ADCF) of the fitted residuals need to be adjusted. The sample splitting procedure in Pfister et al.~(2018) is one such fix for the case of models for independent data, but fails to work in the dependent setting. In this paper sample splitting is leveraged in the time series setting to perform tests of serial dependence of fitted residuals using the ACF and ADCF. Here the first $f_n$ of the data points are used to estimate the parameters of the model and then using these parameter estimates, the last $l_n$ of the data points are used to compute the estimated residuals. Tests for serial independence are then based on these $l_n$ residuals. As long as the overlap between the $f_n$ and $l_n$ data splits is asymptotically 1/2, the ACF and ADCF tests of serial independence tests often have the same limit distributions as though the underlying residuals are indeed iid. In particular if the first half of the data is used to estimate the parameters and the estimated residuals are computed for the entire data set based on these parameter estimates, then the ACF and ADCF can have the same limit distributions as though the residuals were iid. This procedure ameliorates the need for adjustment in the construction of confidence bounds for both the ACF and ADCF in goodness-of-fit testing."}, "https://arxiv.org/abs/2403.07236": {"title": "Partial Identification of Individual-Level Parameters Using Aggregate Data in a Nonparametric Binary Outcome Model", "link": "https://arxiv.org/abs/2403.07236", "description": "arXiv:2403.07236v1 Announce Type: new \nAbstract: It is well known that the relationship between variables at the individual level can be different from the relationship between those same variables aggregated over individuals. This problem of aggregation becomes relevant when the researcher wants to learn individual-level relationships, but only has access to data that has been aggregated. In this paper, I develop a methodology to partially identify linear combinations of conditional average outcomes from aggregate data when the outcome of interest is binary, while imposing as few restrictions on the underlying data generating process as possible. I construct identified sets using an optimization program that allows for researchers to impose additional shape restrictions. I also provide consistency results and construct an inference procedure that is valid with aggregate data, which only provides marginal information about each variable. I apply the methodology to simulated and real-world data sets and find that the estimated identified sets are too wide to be useful. This suggests that to obtain useful information from aggregate data sets about individual-level relationships, researchers must impose further assumptions that are carefully justified."}, "https://arxiv.org/abs/2403.07288": {"title": "Efficient and Model-Agnostic Parameter Estimation Under Privacy-Preserving Post-randomization Data", "link": "https://arxiv.org/abs/2403.07288", "description": "arXiv:2403.07288v1 Announce Type: new \nAbstract: Protecting individual privacy is crucial when releasing sensitive data for public use. While data de-identification helps, it is not enough. This paper addresses parameter estimation in scenarios where data are perturbed using the Post-Randomization Method (PRAM) to enhance privacy. Existing methods for parameter estimation under PRAM data suffer from limitations like being parameter-specific, model-dependent, and lacking efficiency guarantees. We propose a novel, efficient method that overcomes these limitations. Our method is applicable to general parameters defined through estimating equations and makes no assumptions about the underlying data model. We further prove that the proposed estimator achieves the semiparametric efficiency bound, making it optimal in terms of asymptotic variance."}, "https://arxiv.org/abs/2403.07650": {"title": "A Class of Semiparametric Yang and Prentice Frailty Models", "link": "https://arxiv.org/abs/2403.07650", "description": "arXiv:2403.07650v1 Announce Type: new \nAbstract: The Yang and Prentice (YP) regression models have garnered interest from the scientific community due to their ability to analyze data whose survival curves exhibit intersection. These models include proportional hazards (PH) and proportional odds (PO) models as specific cases. However, they encounter limitations when dealing with multivariate survival data due to potential dependencies between the times-to-event. A solution is introducing a frailty term into the hazard functions, making it possible for the times-to-event to be considered independent, given the frailty term. In this study, we propose a new class of YP models that incorporate frailty. We use the exponential distribution, the piecewise exponential distribution (PE), and Bernstein polynomials (BP) as baseline functions. Our approach adopts a Bayesian methodology. The proposed models are evaluated through a simulation study, which shows that the YP frailty models with BP and PE baselines perform similarly to the generator parametric model of the data. We apply the models in two real data sets."}, "https://arxiv.org/abs/2403.07778": {"title": "Joint Modeling of Longitudinal Measurements and Time-to-event Outcomes Using BUGS", "link": "https://arxiv.org/abs/2403.07778", "description": "arXiv:2403.07778v1 Announce Type: new \nAbstract: The objective of this paper is to provide an introduction to the principles of Bayesian joint modeling of longitudinal measurements and time-to-event outcomes, as well as model implementation using the BUGS language syntax. This syntax can be executed directly using OpenBUGS or by utilizing convenient functions to invoke OpenBUGS and JAGS from R software. In this paper, all details of joint models are provided, ranging from simple to more advanced models. The presentation started with the joint modeling of a Gaussian longitudinal marker and time-to-event outcome. The implementation of the Bayesian paradigm of the model is reviewed. The strategies for simulating data from the JM are also discussed. A proportional hazard model with various forms of baseline hazards, along with the discussion of all possible association structures between the two sub-models are taken into consideration. The paper covers joint models with multivariate longitudinal measurements, zero-inflated longitudinal measurements, competing risks, and time-to-event with cure fraction. The models are illustrated by the analyses of several real data sets. All simulated and real data and code are available at \\url{https://github.com/tbaghfalaki/JM-with-BUGS-and-JAGS}."}, "https://arxiv.org/abs/2403.07008": {"title": "AutoEval Done Right: Using Synthetic Data for Model Evaluation", "link": "https://arxiv.org/abs/2403.07008", "description": "arXiv:2403.07008v1 Announce Type: cross \nAbstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4."}, "https://arxiv.org/abs/2403.07031": {"title": "The Cram Method for Efficient Simultaneous Learning and Evaluation", "link": "https://arxiv.org/abs/2403.07031", "description": "arXiv:2403.07031v1 Announce Type: cross \nAbstract: We introduce the \"cram\" method, a general and efficient approach to simultaneous learning and evaluation using a generic machine learning (ML) algorithm. In a single pass of batched data, the proposed method repeatedly trains an ML algorithm and tests its empirical performance. Because it utilizes the entire sample for both learning and evaluation, cramming is significantly more data-efficient than sample-splitting. The cram method also naturally accommodates online learning algorithms, making its implementation computationally efficient. To demonstrate the power of the cram method, we consider the standard policy learning setting where cramming is applied to the same data to both develop an individualized treatment rule (ITR) and estimate the average outcome that would result if the learned ITR were to be deployed. We show that under a minimal set of assumptions, the resulting crammed evaluation estimator is consistent and asymptotically normal. While our asymptotic results require a relatively weak stabilization condition of ML algorithm, we develop a simple, generic method that can be used with any policy learning algorithm to satisfy this condition. Our extensive simulation studies show that, when compared to sample-splitting, cramming reduces the evaluation standard error by more than 40% while improving the performance of learned policy. We also apply the cram method to a randomized clinical trial to demonstrate its applicability to real-world problems. Finally, we briefly discuss future extensions of the cram method to other learning and evaluation settings."}, "https://arxiv.org/abs/2403.07104": {"title": "Shrinkage MMSE estimators of covariances beyond the zero-mean and stationary variance assumptions", "link": "https://arxiv.org/abs/2403.07104", "description": "arXiv:2403.07104v1 Announce Type: cross \nAbstract: We tackle covariance estimation in low-sample scenarios, employing a structured covariance matrix with shrinkage methods. These involve convexly combining a low-bias/high-variance empirical estimate with a biased regularization estimator, striking a bias-variance trade-off. Literature provides optimal settings of the regularization amount through risk minimization between the true covariance and its shrunk counterpart. Such estimators were derived for zero-mean statistics with i.i.d. diagonal regularization matrices accounting for the average sample variance solely. We extend these results to regularization matrices accounting for the sample variances both for centered and non-centered samples. In the latter case, the empirical estimate of the true mean is incorporated into our shrinkage estimators. Introducing confidence weights into the statistics also enhance estimator robustness against outliers. We compare our estimators to other shrinkage methods both on numerical simulations and on real data to solve a detection problem in astronomy."}, "https://arxiv.org/abs/2403.07464": {"title": "On Ranking-based Tests of Independence", "link": "https://arxiv.org/abs/2403.07464", "description": "arXiv:2403.07464v1 Announce Type: cross \nAbstract: In this paper we develop a novel nonparametric framework to test the independence of two random variables $\\mathbf{X}$ and $\\mathbf{Y}$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dx dy)$, based on {\\it Receiver Operating Characteristic} (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis $\\mathcal{H}\\_0$ is necessarily false as soon as the optimal scoring function related to the pair of distributions $(H\\otimes G,\\; F)$, obtained from a bipartite ranking algorithm, has a ROC curve that deviates from the main diagonal of the unit square.We consider a wide class of rank statistics encompassing many ways of deviating from the diagonal in the ROC space to build tests of independence. Beyond its great flexibility, this new method has theoretical properties that far surpass those of its competitors. Nonasymptotic bounds for the two types of testing errors are established. From an empirical perspective, the novel procedure we promote in this paper exhibits a remarkable ability to detect small departures, of various types, from the null assumption $\\mathcal{H}_0$, even in high dimension, as supported by the numerical experiments presented here."}, "https://arxiv.org/abs/2403.07495": {"title": "Tuning diagonal scale matrices for HMC", "link": "https://arxiv.org/abs/2403.07495", "description": "arXiv:2403.07495v1 Announce Type: cross \nAbstract: Three approaches for adaptively tuning diagonal scale matrices for HMC are discussed and compared. The common practice of scaling according to estimated marginal standard deviations is taken as a benchmark. Scaling according to the mean log-target gradient (ISG), and a scaling method targeting that the frequency of when the underlying Hamiltonian dynamics crosses the respective medians should be uniform across dimensions, are taken as alternatives. Numerical studies suggest that the ISG method leads in many cases to more efficient sampling than the benchmark, in particular in cases with strong correlations or non-linear dependencies. The ISG method is also easy to implement, computationally cheap and would be relatively simple to include in automatically tuned codes as an alternative to the benchmark practice."}, "https://arxiv.org/abs/2403.07657": {"title": "Scalable Spatiotemporal Prediction with Bayesian Neural Fields", "link": "https://arxiv.org/abs/2403.07657", "description": "arXiv:2403.07657v1 Announce Type: cross \nAbstract: Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in many scientific and business-intelligence applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As modern datasets continue to increase in size and complexity, there is a growing need for new statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle large prediction problems. This work presents the Bayesian Neural Field (BayesNF), a domain-general statistical model for inferring rich probability distributions over a spatiotemporal domain, which can be used for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a novel deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust uncertainty quantification. By defining the prior through a sequence of smooth differentiable transforms, posterior inference is conducted on large-scale data using variationally learned surrogates trained via stochastic gradient descent. We evaluate BayesNF against prominent statistical and machine-learning baselines, showing considerable improvements on diverse prediction problems from climate and public health datasets that contain tens to hundreds of thousands of measurements. The paper is accompanied with an open-source software package (https://github.com/google/bayesnf) that is easy-to-use and compatible with modern GPU and TPU accelerators on the JAX machine learning platform."}, "https://arxiv.org/abs/2403.07728": {"title": "CAS: A General Algorithm for Online Selective Conformal Prediction with FCR Control", "link": "https://arxiv.org/abs/2403.07728", "description": "arXiv:2403.07728v1 Announce Type: cross \nAbstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) to measure the averaged miscoverage error. We develop a general framework named CAS (Calibration after Adaptive Selection) that can wrap around any prediction model and online selection rule to output post-selection prediction intervals. If the current individual is selected, we first perform an adaptive selection on historical data to construct a calibration set, then output a conformal prediction interval for the unobserved label. We provide tractable constructions for the calibration set for popular online selection rules. We proved that CAS can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. For the decision-driven selection rule, including most online multiple-testing procedures, CAS can exactly control the real-time FCR below the target level without any distributional assumptions. For the online selection with symmetric thresholds, we establish the error bound for the control gap of FCR under mild distributional assumptions. To account for the distribution shift in online data, we also embed CAS into some recent dynamic conformal prediction methods and examine the long-run FCR control. Numerical results on both synthetic and real data corroborate that CAS can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings."}, "https://arxiv.org/abs/2103.11066": {"title": "Treatment Allocation under Uncertain Costs", "link": "https://arxiv.org/abs/2103.11066", "description": "arXiv:2103.11066v4 Announce Type: replace \nAbstract: We consider the problem of learning how to optimally allocate treatments whose cost is uncertain and can vary with pre-treatment covariates. This setting may arise in medicine if we need to prioritize access to a scarce resource that different patients would use for different amounts of time, or in marketing if we want to target discounts whose cost to the company depends on how much the discounts are used. Here, we show that the optimal treatment allocation rule under budget constraints is a thresholding rule based on priority scores, and we propose a number of practical methods for learning these priority scores using data from a randomized trial. Our formal results leverage a statistical connection between our problem and that of learning heterogeneous treatment effects under endogeneity using an instrumental variable. We find our method to perform well in a number of empirical evaluations."}, "https://arxiv.org/abs/2201.01879": {"title": "Exponential family measurement error models for single-cell CRISPR screens", "link": "https://arxiv.org/abs/2201.01879", "description": "arXiv:2201.01879v3 Announce Type: replace \nAbstract: CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present substantial statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens -- \"thresholded regression\" -- exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV (\"GLM-based errors-in-variables\"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g., Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several novel insights."}, "https://arxiv.org/abs/2202.03513": {"title": "Causal survival analysis under competing risks using longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2202.03513", "description": "arXiv:2202.03513v2 Announce Type: replace \nAbstract: Longitudinal modified treatment policies (LMTP) have been recently developed as a novel method to define and estimate causal parameters that depend on the natural value of treatment. LMTPs represent an important advancement in causal inference for longitudinal studies as they allow the non-parametric definition and estimation of the joint effect of multiple categorical, numerical, or continuous exposures measured at several time points. We extend the LMTP methodology to problems in which the outcome is a time-to-event variable subject to right-censoring and competing risks. We present identification results and non-parametric locally efficient estimators that use flexible data-adaptive regression techniques to alleviate model misspecification bias, while retaining important asymptotic properties such as $\\sqrt{n}$-consistency. We present an application to the estimation of the effect of the time-to-intubation on acute kidney injury amongst COVID-19 hospitalized patients, where death by other causes is taken to be the competing event."}, "https://arxiv.org/abs/2301.00584": {"title": "Selective conformal inference with false coverage-statement rate control", "link": "https://arxiv.org/abs/2301.00584", "description": "arXiv:2301.00584v5 Announce Type: replace \nAbstract: Conformal inference is a popular tool for constructing prediction intervals (PI). We consider here the scenario of post-selection/selective conformal inference, that is PIs are reported only for individuals selected from an unlabeled test data. To account for multiplicity, we develop a general split conformal framework to construct selective PIs with the false coverage-statement rate (FCR) control. We first investigate the Benjamini and Yekutieli (2005)'s FCR-adjusted method in the present setting, and show that it is able to achieve FCR control but yields uniformly inflated PIs. We then propose a novel solution to the problem, named as Selective COnditional conformal Predictions (SCOP), which entails performing selection procedures on both calibration set and test set and construct marginal conformal PIs on the selected sets by the aid of conditional empirical distribution obtained by the calibration set. Under a unified framework and exchangeable assumptions, we show that the SCOP can exactly control the FCR. More importantly, we provide non-asymptotic miscoverage bounds for a general class of selection procedures beyond exchangeablity and discuss the conditions under which the SCOP is able to control the FCR. As special cases, the SCOP with quantile-based selection or conformal p-values-based multiple testing procedures enjoys valid coverage guarantee under mild conditions. Numerical results confirm the effectiveness and robustness of SCOP in FCR control and show that it achieves more narrowed PIs over existing methods in many settings."}, "https://arxiv.org/abs/2302.07070": {"title": "Empirical study of periodic autoregressive models with additive noise -- estimation and testing", "link": "https://arxiv.org/abs/2302.07070", "description": "arXiv:2302.07070v2 Announce Type: replace \nAbstract: Periodic autoregressive (PAR) time series with finite variance is considered as one of the most common models of second-order cyclostationary processes. However, in the real applications, the signals with periodic characteristics may be disturbed by additional noise related to measurement device disturbances or to other external sources. Thus, the known estimation techniques dedicated for PAR models may be inefficient for such cases. When the variance of the additive noise is relatively small, it can be ignored and the classical estimation techniques can be applied. However, for extreme cases, the additive noise can have a significant influence on the estimation results. In this paper, we propose four estimation techniques for the noise-corrupted PAR models with finite variance distributions. The methodology is based on Yule-Walker equations utilizing the autocovariance function. It can be used for any type of the finite variance additive noise. The presented simulation study clearly indicates the efficiency of the proposed techniques, also for extreme case, when the additive noise is a sum of the Gaussian additive noise and additive outliers. The proposed estimation techniques are also applied for testing if the data corresponds to noise-corrupted PAR model. This issue is strongly related to the identification of informative component in the data in case when the model is disturbed by additive non-informative noise. The power of the test is studied for simulated data. Finally, the testing procedure is applied for two real time series describing particulate matter concentration in the air."}, "https://arxiv.org/abs/2303.04669": {"title": "Minimum contrast for the first-order intensity estimation of spatial and spatio-temporal point processes", "link": "https://arxiv.org/abs/2303.04669", "description": "arXiv:2303.04669v2 Announce Type: replace \nAbstract: In this paper, we harness a result in point process theory, specifically the expectation of the weighted $K$-function, where the weighting is done by the true first-order intensity function. This theoretical result can be employed as an estimation method to derive parameter estimates for a particular model assumed for the data. The underlying motivation is to avoid the difficulties associated with dealing with complex likelihoods in point process models and their maximization. The exploited result makes our method theoretically applicable to any model specification. In this paper, we restrict our study to Poisson models, whose likelihood represents the base for many more complex point process models.\n In this context, our proposed method can estimate the vector of local parameters that correspond to the points within the analyzed point pattern without introducing any additional complexity compared to the global estimation. We illustrate the method through simulation studies for both purely spatial and spatio-temporal point processes and show complex scenarios based on the Poisson model through the analysis of two real datasets concerning environmental problems."}, "https://arxiv.org/abs/2303.12378": {"title": "A functional spatial autoregressive model using signatures", "link": "https://arxiv.org/abs/2303.12378", "description": "arXiv:2303.12378v2 Announce Type: replace \nAbstract: We propose a new approach to the autoregressive spatial functional model, based on the notion of signature, which represents a function as an infinite series of its iterated integrals. It presents the advantage of being applicable to a wide range of processes. After having provided theoretical guarantees to the proposed model, we have shown in a simulation study and on a real data set that this new approach presents competitive performances compared to the traditional model."}, "https://arxiv.org/abs/2303.17823": {"title": "An interpretable neural network-based non-proportional odds model for ordinal regression", "link": "https://arxiv.org/abs/2303.17823", "description": "arXiv:2303.17823v4 Announce Type: replace \nAbstract: This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is defined for both continuous and discrete responses, whereas standard methods typically treat the ordered continuous variables as if they are discrete, (2) instead of estimating response-dependent finite-dimensional coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets."}, "https://arxiv.org/abs/2306.05186": {"title": "Innovation Processes for Inference", "link": "https://arxiv.org/abs/2306.05186", "description": "arXiv:2306.05186v2 Announce Type: replace \nAbstract: Urn models for innovation have proven to capture fundamental empirical laws shared by several real-world processes. The so-called urn model with triggering includes, as particular cases, an urn representation of the two-parameter Poisson-Dirichlet process and the Dirichlet process, seminal in Bayesian non-parametric inference. In this work, we leverage this connection to introduce a novel approach for quantifying closeness between symbolic sequences and test it within the framework of the authorship attribution problem. The method demonstrates high accuracy when compared to other state-of-the-art methods in different scenarios, featuring a substantial gain in computational efficiency and theoretical transparency. Beyond the practical convenience, this work demonstrates how the recently established connection between urn models and non-parametric Bayesian inference can pave the way for designing more efficient inference methods. In particular, the hybrid approach that we propose allows us to relax the exchangeability hypothesis, which can be particularly relevant for systems exhibiting complex correlation patterns and non-stationary dynamics."}, "https://arxiv.org/abs/2306.08693": {"title": "Integrating Uncertainty Awareness into Conformalized Quantile Regression", "link": "https://arxiv.org/abs/2306.08693", "description": "arXiv:2306.08693v2 Announce Type: replace \nAbstract: Conformalized Quantile Regression (CQR) is a recently proposed method for constructing prediction intervals for a response $Y$ given covariates $X$, without making distributional assumptions. However, existing constructions of CQR can be ineffective for problems where the quantile regressors perform better in certain parts of the feature space than others. The reason is that the prediction intervals of CQR do not distinguish between two forms of uncertainty: first, the variability of the conditional distribution of $Y$ given $X$ (i.e., aleatoric uncertainty), and second, our uncertainty in estimating this conditional distribution (i.e., epistemic uncertainty). This can lead to intervals that are overly narrow in regions where epistemic uncertainty is high. To address this, we propose a new variant of the CQR methodology, Uncertainty-Aware CQR (UACQR), that explicitly separates these two sources of uncertainty to adjust quantile regressors differentially across the feature space. Compared to CQR, our methods enjoy the same distribution-free theoretical coverage guarantees, while demonstrating in our experiments stronger conditional coverage properties in simulated settings and real-world data sets alike."}, "https://arxiv.org/abs/2310.17165": {"title": "Price Experimentation and Interference", "link": "https://arxiv.org/abs/2310.17165", "description": "arXiv:2310.17165v2 Announce Type: replace \nAbstract: In this paper, we examine biases arising in A/B tests where firms modify a continuous parameter, such as price, to estimate the global treatment effect of a given performance metric, such as profit. These biases emerge in canonical experimental estimators due to interference among market participants. We employ structural modeling and differential calculus to derive intuitive characterizations of these biases. We then specialize our general model to a standard revenue management pricing problem. This setting highlights a key pitfall in the use of A/B pricing experiments to guide profit maximization: notably, the canonical estimator for the expected change in profits can have the {\\em wrong sign}. In other words, following the guidance of canonical estimators may lead firms to move prices in the wrong direction, inadvertently decreasing profits relative to the status quo. We apply these results to a two-sided market model and show how this ``change of sign\" regime depends on model parameters such as market imbalance, as well as the price markup. Finally, we discuss structural and practical implications for platform operators."}, "https://arxiv.org/abs/2311.06458": {"title": "Conditional Adjustment in a Markov Equivalence Class", "link": "https://arxiv.org/abs/2311.06458", "description": "arXiv:2311.06458v2 Announce Type: replace \nAbstract: We consider the problem of identifying a conditional causal effect through covariate adjustment. We focus on the setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). Both MPDAGs and PAGs represent equivalence classes of possible underlying causal models. After defining adjustment sets in this setting, we provide a necessary and sufficient graphical criterion -- the conditional adjustment criterion -- for finding these sets under conditioning on variables unaffected by treatment. We further provide explicit sets from the graph that satisfy the conditional adjustment criterion, and therefore, can be used as adjustment sets for conditional causal effect identification."}, "https://arxiv.org/abs/2312.17061": {"title": "Bayesian Analysis of High Dimensional Vector Error Correction Model", "link": "https://arxiv.org/abs/2312.17061", "description": "arXiv:2312.17061v2 Announce Type: replace \nAbstract: Vector Error Correction Model (VECM) is a classic method to analyse cointegration relationships amongst multivariate non-stationary time series. In this paper, we focus on high dimensional setting and seek for sample-size-efficient methodology to determine the level of cointegration. Our investigation centres at a Bayesian approach to analyse the cointegration matrix, henceforth determining the cointegration rank. We design two algorithms and implement them on simulated examples, yielding promising results particularly when dealing with high number of variables and relatively low number of observations. Furthermore, we extend this methodology to empirically investigate the constituents of the S&P 500 index, where low-volatility portfolios can be found during both in-sample training and out-of-sample testing periods."}, "https://arxiv.org/abs/2111.13800": {"title": "A Two-Stage Feature Selection Approach for Robust Evaluation of Treatment Effects in High-Dimensional Observational Data", "link": "https://arxiv.org/abs/2111.13800", "description": "arXiv:2111.13800v2 Announce Type: replace-cross \nAbstract: A Randomized Control Trial (RCT) is considered as the gold standard for evaluating the effect of any intervention or treatment. However, its feasibility is often hindered by ethical, economical, and legal considerations, making observational data a valuable alternative for drawing causal conclusions. Nevertheless, healthcare observational data presents a difficult challenge due to its high dimensionality, requiring careful consideration to ensure unbiased, reliable, and robust causal inferences. To overcome this challenge, in this study, we propose a novel two-stage feature selection technique called, Outcome Adaptive Elastic Net (OAENet), explicitly designed for making robust causal inference decisions using matching techniques. OAENet offers several key advantages over existing methods: superior performance on correlated and high-dimensional data compared to the existing methods and the ability to select specific sets of variables (including confounders and variables associated only with the outcome). This ensures robustness and facilitates an unbiased estimate of the causal effect. Numerical experiments on simulated data demonstrate that OAENet significantly outperforms state-of-the-art methods by either producing a higher-quality estimate or a comparable estimate in significantly less time. To illustrate the applicability of OAENet, we employ large-scale US healthcare data to estimate the effect of Opioid Use Disorder (OUD) on suicidal behavior. When compared to competing methods, OAENet closely aligns with existing literature on the relationship between OUD and suicidal behavior. Performance on both simulated and real-world data highlights that OAENet notably enhances the accuracy of estimating treatment effects or evaluating policy decision-making with causal inference."}, "https://arxiv.org/abs/2311.05330": {"title": "A Bayesian framework for measuring association and its application to emotional dynamics in Web discourse", "link": "https://arxiv.org/abs/2311.05330", "description": "arXiv:2311.05330v2 Announce Type: replace-cross \nAbstract: This paper introduces a Bayesian framework designed to measure the degree of association between categorical random variables. The method is grounded in the formal definition of variable independence and is implemented using Markov Chain Monte Carlo (MCMC) techniques. Unlike commonly employed techniques in Association Rule Learning, this approach enables a clear and precise estimation of confidence intervals and the statistical significance of the measured degree of association. We applied the method to non-exclusive emotions identified by annotators in 4,613 tweets written in Portuguese. This analysis revealed pairs of emotions that exhibit associations and mutually opposed pairs. Moreover, the method identifies hierarchical relations between categories, a feature observed in our data, and is utilized to cluster emotions into basic-level groups."}, "https://arxiv.org/abs/2403.07892": {"title": "Change Point Detection with Copula Entropy based Two-Sample Test", "link": "https://arxiv.org/abs/2403.07892", "description": "arXiv:2403.07892v1 Announce Type: new \nAbstract: Change point detection is a typical task that aim to find changes in time series and can be tackled with two-sample test. Copula Entropy is a mathematical concept for measuring statistical independence and a two-sample test based on it was introduced recently. In this paper we propose a nonparametric multivariate method for multiple change point detection with the copula entropy-based two-sample test. The single change point detection is first proposed as a group of two-sample tests on every points of time series data and the change point is considered as with the maximum of the test statistics. The multiple change point detection is then proposed by combining the single change point detection method with binary segmentation strategy. We verified the effectiveness of our method and compared it with the other similar methods on the simulated univariate and multivariate data and the Nile data."}, "https://arxiv.org/abs/2403.07963": {"title": "Copula based dependent censoring in cure models", "link": "https://arxiv.org/abs/2403.07963", "description": "arXiv:2403.07963v1 Announce Type: new \nAbstract: In this paper we consider a time-to-event variable $T$ that is subject to random right censoring, and we assume that the censoring time $C$ is stochastically dependent on $T$ and that there is a positive probability of not observing the event. There are various situations in practice where this happens, and appropriate models and methods need to be considered to avoid biased estimators of the survival function or incorrect conclusions in clinical trials. We consider a fully parametric model for the bivariate distribution of $(T,C)$, that takes these features into account. The model depends on a parametric copula (with unknown association parameter) and on parametric marginal distributions for $T$ and $C$. Sufficient conditions are developed under which the model is identified, and an estimation procedure is proposed. In particular, our model allows to identify and estimate the association between $T$ and $C$, even though only the smallest of these variables is observable. The finite sample performance of the estimated parameters is illustrated by means of a thorough simulation study and the analysis of breast cancer data."}, "https://arxiv.org/abs/2403.08118": {"title": "Characterising harmful data sources when constructing multi-fidelity surrogate models", "link": "https://arxiv.org/abs/2403.08118", "description": "arXiv:2403.08118v1 Announce Type: new \nAbstract: Surrogate modelling techniques have seen growing attention in recent years when applied to both modelling and optimisation of industrial design problems. These techniques are highly relevant when assessing the performance of a particular design carries a high cost, as the overall cost can be mitigated via the construction of a model to be queried in lieu of the available high-cost source. The construction of these models can sometimes employ other sources of information which are both cheaper and less accurate. The existence of these sources however poses the question of which sources should be used when constructing a model. Recent studies have attempted to characterise harmful data sources to guide practitioners in choosing when to ignore a certain source. These studies have done so in a synthetic setting, characterising sources using a large amount of data that is not available in practice. Some of these studies have also been shown to potentially suffer from bias in the benchmarks used in the analysis. In this study, we present a characterisation of harmful low-fidelity sources using only the limited data available to train a surrogate model. We employ recently developed benchmark filtering techniques to conduct a bias-free assessment, providing objectively varied benchmark suites of different sizes for future research. Analysing one of these benchmark suites with the technique known as Instance Space Analysis, we provide an intuitive visualisation of when a low-fidelity source should be used and use this analysis to provide guidelines that can be used in an applied industrial setting."}, "https://arxiv.org/abs/2403.08130": {"title": "Imputation of Counterfactual Outcomes when the Errors are Predictable", "link": "https://arxiv.org/abs/2403.08130", "description": "arXiv:2403.08130v1 Announce Type: new \nAbstract: A crucial input into causal inference is the imputed counterfactual outcome.\n Imputation error can arise because of sampling uncertainty from estimating the prediction model using the untreated observations, or from out-of-sample information not captured by the model. While the literature has focused on sampling uncertainty, it vanishes with the sample size. Often overlooked is the possibility that the out-of-sample error can be informative about the missing counterfactual outcome if it is mutually or serially correlated. Motivated by the best linear unbiased predictor (\\blup) of \\citet{goldberger:62} in a time series setting, we propose an improved predictor of potential outcome when the errors are correlated. The proposed \\pup\\; is practical as it is not restricted to linear models,\n can be used with consistent estimators already developed, and improves mean-squared error for a large class of strong mixing error processes. Ignoring predictability in the errors can distort conditional inference. However, the precise impact will depend on the choice of estimator as well as the realized values of the residuals."}, "https://arxiv.org/abs/2403.08183": {"title": "Causal Interpretation of Estimands Defined by Exposure Mappings", "link": "https://arxiv.org/abs/2403.08183", "description": "arXiv:2403.08183v1 Announce Type: new \nAbstract: In settings with interference, it is common to utilize estimands defined by exposure mappings to summarize the impact of variation in treatment assignments local to the ego. This paper studies their causal interpretation under weak restrictions on interference. We demonstrate that the estimands can exhibit unpalatable sign reversals under conventional identification conditions. This motivates the formulation of sign preservation criteria for causal interpretability. To satisfy preferred criteria, it is necessary to impose restrictions on interference, either in potential outcomes or selection into treatment. We provide sufficient conditions and show that they are satisfied by a nonparametric model allowing for a complex form of interference in both the outcome and selection stages."}, "https://arxiv.org/abs/2403.08359": {"title": "Assessment of background noise properties in time and time-frequency domains in the context of vibration-based local damage detection in real environment", "link": "https://arxiv.org/abs/2403.08359", "description": "arXiv:2403.08359v1 Announce Type: new \nAbstract: Any measurement in condition monitoring applications is associated with disturbing noise. Till now, most of the diagnostic procedures have assumed the Gaussian distribution for the noise. This paper shares a novel perspective to the problem of local damage detection. The acquired vector of observations is considered as an additive mixture of signal of interest (SOI) and noise with strongly non-Gaussian, heavy-tailed properties, that masks the SOI. The distribution properties of the background noise influence the selection of tools used for the signal analysis, particularly for local damage detection. Thus, it is extremely important to recognize and identify possible non-Gaussian behavior of the noise. The problem considered here is more general than the classical goodness-of-fit testing. The paper highlights the important role of variance, as most of the methods for signal analysis are based on the assumption of the finite-variance distribution of the underlying signal. The finite variance assumption is crucial but implicit to most indicators used in condition monitoring, (such as the root-mean-square value, the power spectral density, the kurtosis, the spectral correlation, etc.), in view that infinite variance implies moments higher than 2 are also infinite. The problem is demonstrated based on three popular types of non-Gaussian distributions observed for real vibration signals. We demonstrate how the properties of noise distribution in the time domain may change by its transformations to the time-frequency domain (spectrogram). Additionally, we propose a procedure to check the presence of the infinite-variance of the background noise. Our investigations are illustrated using simulation studies and real vibration signals from various machines."}, "https://arxiv.org/abs/2403.08514": {"title": "Spatial Latent Gaussian Modelling with Change of Support", "link": "https://arxiv.org/abs/2403.08514", "description": "arXiv:2403.08514v1 Announce Type: new \nAbstract: Spatial data are often derived from multiple sources (e.g. satellites, in-situ sensors, survey samples) with different supports, but associated with the same properties of a spatial phenomenon of interest. It is common for predictors to also be measured on different spatial supports than the response variables. Although there is no standard way to work with spatial data with different supports, a prevalent approach used by practitioners has been to use downscaling or interpolation to project all the variables of analysis towards a common support, and then using standard spatial models. The main disadvantage with this approach is that simple interpolation can introduce biases and, more importantly, the uncertainty associated with the change of support is not taken into account in parameter estimation. In this article, we propose a Bayesian spatial latent Gaussian model that can handle data with different rectilinear supports in both the response variable and predictors. Our approach allows to handle changes of support more naturally according to the properties of the spatial stochastic process being used, and to take into account the uncertainty from the change of support in parameter estimation and prediction. We use spatial stochastic processes as linear combinations of basis functions where Gaussian Markov random fields define the weights. Our hierarchical modelling approach can be described by the following steps: (i) define a latent model where response variables and predictors are considered as latent stochastic processes with continuous support, (ii) link the continuous-index set stochastic processes with its projection to the support of the observed data, (iii) link the projected process with the observed data. We show the applicability of our approach by simulation studies and modelling land suitability for improved grassland in Rhondda Cynon Taf, a county borough in Wales."}, "https://arxiv.org/abs/2403.08577": {"title": "Evaluation and comparison of covariate balance metrics in studies with time-dependent confounding", "link": "https://arxiv.org/abs/2403.08577", "description": "arXiv:2403.08577v1 Announce Type: new \nAbstract: Marginal structural models have been increasingly used by analysts in recent years to account for confounding bias in studies with time-varying treatments. The parameters of these models are often estimated using inverse probability of treatment weighting. To ensure that the estimated weights adequately control confounding, it is possible to check for residual imbalance between treatment groups in the weighted data. Several balance metrics have been developed and compared in the cross-sectional case but have not yet been evaluated and compared in longitudinal studies with time-varying treatment. We have first extended the definition of several balance metrics to the case of a time-varying treatment, with or without censoring. We then compared the performance of these balance metrics in a simulation study by assessing the strength of the association between their estimated level of imbalance and bias. We found that the Mahalanobis balance performed best.Finally, the method was illustrated for estimating the cumulative effect of statins exposure over one year on the risk of cardiovascular disease or death in people aged 65 and over in population-wide administrative data. This illustration confirms the feasibility of employing our proposed metrics in large databases with multiple time-points."}, "https://arxiv.org/abs/2403.08630": {"title": "Leveraging Non-Decimated Wavelet Packet Features and Transformer Models for Time Series Forecasting", "link": "https://arxiv.org/abs/2403.08630", "description": "arXiv:2403.08630v1 Announce Type: new \nAbstract: This article combines wavelet analysis techniques with machine learning methods for univariate time series forecasting, focusing on three main contributions. Firstly, we consider the use of Daubechies wavelets with different numbers of vanishing moments as input features to both non-temporal and temporal forecasting methods, by selecting these numbers during the cross-validation phase. Secondly, we compare the use of both the non-decimated wavelet transform and the non-decimated wavelet packet transform for computing these features, the latter providing a much larger set of potentially useful coefficient vectors. The wavelet coefficients are computed using a shifted version of the typical pyramidal algorithm to ensure no leakage of future information into these inputs. Thirdly, we evaluate the use of these wavelet features on a significantly wider set of forecasting methods than previous studies, including both temporal and non-temporal models, and both statistical and deep learning-based methods. The latter include state-of-the-art transformer-based neural network architectures. Our experiments suggest significant benefit in replacing higher-order lagged features with wavelet features across all examined non-temporal methods for one-step-forward forecasting, and modest benefit when used as inputs for temporal deep learning-based models for long-horizon forecasting."}, "https://arxiv.org/abs/2403.08653": {"title": "Physics-Guided Inverse Regression for Crop Quality Assessment", "link": "https://arxiv.org/abs/2403.08653", "description": "arXiv:2403.08653v1 Announce Type: new \nAbstract: We present an innovative approach leveraging Physics-Guided Neural Networks (PGNNs) for enhancing agricultural quality assessments. Central to our methodology is the application of physics-guided inverse regression, a technique that significantly improves the model's ability to precisely predict quality metrics of crops. This approach directly addresses the challenges of scalability, speed, and practicality that traditional assessment methods face. By integrating physical principles, notably Fick`s second law of diffusion, into neural network architectures, our developed PGNN model achieves a notable advancement in enhancing both the interpretability and accuracy of assessments. Empirical validation conducted on cucumbers and mushrooms demonstrates the superior capability of our model in outperforming conventional computer vision techniques in postharvest quality evaluation. This underscores our contribution as a scalable and efficient solution to the pressing demands of global food supply challenges."}, "https://arxiv.org/abs/2403.08753": {"title": "Invalid proxies and volatility changes", "link": "https://arxiv.org/abs/2403.08753", "description": "arXiv:2403.08753v1 Announce Type: new \nAbstract: When in proxy-SVARs the covariance matrix of VAR disturbances is subject to exogenous, permanent, nonrecurring breaks that generate target impulse response functions (IRFs) that change across volatility regimes, even strong, exogenous external instruments can result in inconsistent estimates of the dynamic causal effects of interest if the breaks are not properly accounted for. In such cases, it is essential to explicitly incorporate the shifts in unconditional volatility in order to point-identify the target structural shocks and possibly restore consistency. We demonstrate that, under a necessary and sufficient rank condition that leverages moments implied by changes in volatility, the target IRFs can be point-identified and consistently estimated. Importantly, standard asymptotic inference remains valid in this context despite (i) the covariance between the proxies and the instrumented structural shocks being local-to-zero, as in Staiger and Stock (1997), and (ii) the potential failure of instrument exogeneity. We introduce a novel identification strategy that appropriately combines external instruments with \"informative\" changes in volatility, thus obviating the need to assume proxy relevance and exogeneity in estimation. We illustrate the effectiveness of the suggested method by revisiting a fiscal proxy-SVAR previously estimated in the literature, complementing the fiscal instruments with information derived from the massive reduction in volatility observed in the transition from the Great Inflation to the Great Moderation regimes."}, "https://arxiv.org/abs/2403.08079": {"title": "BayesFLo: Bayesian fault localization of complex software systems", "link": "https://arxiv.org/abs/2403.08079", "description": "arXiv:2403.08079v1 Announce Type: cross \nAbstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root causes, or for integrating domain and/or structural knowledge from test engineers. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model on potential root cause combinations. A key feature of BayesFLo is its integration of the principles of combination hierarchy and heredity, which capture the structured nature of failure-inducing combinations. A critical challenge, however, is the sheer number of potential root cause scenarios to consider, which renders the computation of posterior root cause probabilities infeasible even for small software systems. We thus develop new algorithms for efficient computation of such probabilities, leveraging recent tools from integer programming and graph representations. We then demonstrate the effectiveness of BayesFLo over state-of-the-art fault localization methods, in a suite of numerical experiments and in two motivating case studies on the JMP XGBoost interface."}, "https://arxiv.org/abs/2403.08628": {"title": "Optimal sub-Gaussian variance proxy for truncated Gaussian and exponential random variables", "link": "https://arxiv.org/abs/2403.08628", "description": "arXiv:2403.08628v1 Announce Type: cross \nAbstract: This paper establishes the optimal sub-Gaussian variance proxy for truncated Gaussian and truncated exponential random variables. The proofs rely on first characterizing the optimal variance proxy as the unique solution to a set of two equations and then observing that for these two truncated distributions, one may find explicit solutions to this set of equations. Moreover, we establish the conditions under which the optimal variance proxy coincides with the variance, thereby characterizing the strict sub-Gaussianity of the truncated random variables. Specifically, we demonstrate that truncated Gaussian variables exhibit strict sub-Gaussian behavior if and only if they are symmetric, meaning their truncation is symmetric with respect to the mean. Conversely, truncated exponential variables are shown to never exhibit strict sub-Gaussian properties. These findings contribute to the understanding of these prevalent probability distributions in statistics and machine learning, providing a valuable foundation for improved and optimal modeling and decision-making processes."}, "https://arxiv.org/abs/2206.04643": {"title": "On the Performance of the Neyman Allocation with Small Pilots", "link": "https://arxiv.org/abs/2206.04643", "description": "arXiv:2206.04643v3 Announce Type: replace \nAbstract: The Neyman Allocation is used in many papers on experimental design, which typically assume that researchers have access to large pilot studies. This may be unrealistic. To understand the properties of the Neyman Allocation with small pilots, we study its behavior in an asymptotic framework that takes pilot size to be fixed even as the size of the main wave tends to infinity. Our analysis shows that the Neyman Allocation can lead to estimates of the ATE with higher asymptotic variance than with (non-adaptive) balanced randomization. In particular, this happens when the outcome variable is relatively homoskedastic with respect to treatment status or when it exhibits high kurtosis. We provide a series of empirical examples showing that such situations can arise in practice. Our results suggest that researchers with small pilots should not use the Neyman Allocation if they believe that outcomes are homoskedastic or heavy-tailed. We examine some potential methods for improving the finite sample performance of the FNA via simulations."}, "https://arxiv.org/abs/2208.11570": {"title": "Flexible control of the median of the false discovery proportion", "link": "https://arxiv.org/abs/2208.11570", "description": "arXiv:2208.11570v4 Announce Type: replace \nAbstract: We introduce a multiple testing procedure that controls the median of the proportion of false discoveries (FDP) in a flexible way. The procedure only requires a vector of p-values as input and is comparable to the Benjamini-Hochberg method, which controls the mean of the FDP. Our method allows freely choosing one or several values of alpha after seeing the data -- unlike Benjamini-Hochberg, which can be very liberal when alpha is chosen post hoc. We prove these claims and illustrate them with simulations. Our procedure is inspired by a popular estimator of the total number of true hypotheses. We adapt this estimator to provide simultaneously median unbiased estimators of the FDP, valid for finite samples. This simultaneity allows for the claimed flexibility. Our approach does not assume independence. The time complexity of our method is linear in the number of hypotheses, after sorting the p-values."}, "https://arxiv.org/abs/2209.00105": {"title": "Personalized Biopsy Schedules Using an Interval-censored Cause-specific Joint Model", "link": "https://arxiv.org/abs/2209.00105", "description": "arXiv:2209.00105v5 Announce Type: replace \nAbstract: Active surveillance (AS), where biopsies are conducted to detect cancer progression, has been acknowledged as an efficient way to reduce the overtreatment of prostate cancer. Most AS cohorts use fixed biopsy schedules for all patients. However, the ideal test frequency remains unknown, and the routine use of such invasive tests burdens the patients. An emerging idea is to generate personalized biopsy schedules based on each patient's progression-specific risk. To achieve that, we propose the interval-censored cause-specific joint model (ICJM), which models the impact of longitudinal biomarkers on cancer progression while considering the competing event of early treatment initiation. The underlying likelihood function incorporates the interval-censoring of cancer progression, the competing risk of treatment, and the uncertainty about whether cancer progression occurred since the last biopsy in patients that are right-censored or experience the competing event. The model can produce patient-specific risk profiles until a horizon time. If the risk exceeds a certain threshold, a biopsy is conducted. The optimal threshold can be chosen by balancing two indicators of the biopsy schedules: the expected number of biopsies and expected delay in detection of cancer progression. A simulation study showed that our personalized schedules could considerably reduce the number of biopsies per patient by 34%-54% compared to the fixed schedules, though at the cost of a slightly longer detection delay."}, "https://arxiv.org/abs/2306.01604": {"title": "On the minimum information checkerboard copulas under fixed Kendall's rank correlation", "link": "https://arxiv.org/abs/2306.01604", "description": "arXiv:2306.01604v2 Announce Type: replace \nAbstract: Copulas have gained widespread popularity as statistical models to represent dependence structures between multiple variables in various applications. The minimum information copula, given a finite number of constraints in advance, emerges as the copula closest to the uniform copula when measured in Kullback-Leibler divergence. In prior research, the focus has predominantly been on constraints related to expectations on moments, including Spearman's $\\rho$. This approach allows for obtaining the copula through convex programming. However, the existing framework for minimum information copulas does not encompass non-linear constraints such as Kendall's $\\tau$. To address this limitation, we introduce MICK, a novel minimum information copula under fixed Kendall's $\\tau$. We first characterize MICK by its local dependence property. Despite being defined as the solution to a non-convex optimization problem, we demonstrate that the uniqueness of this copula is guaranteed when the correlation is sufficiently small. Additionally, we provide numerical insights into applying MICK to real financial data."}, "https://arxiv.org/abs/2306.10601": {"title": "Sliced Wasserstein Regression", "link": "https://arxiv.org/abs/2306.10601", "description": "arXiv:2306.10601v2 Announce Type: replace \nAbstract: While statistical modeling of distributional data has gained increased attention, the case of multivariate distributions has been somewhat neglected despite its relevance in various applications. This is because the Wasserstein distance, commonly used in distributional data analysis, poses challenges for multivariate distributions. A promising alternative is the sliced Wasserstein distance, which offers a computationally simpler solution. We propose distributional regression models with multivariate distributions as responses paired with Euclidean vector predictors. The foundation of our methodology is a slicing transform from the multivariate distribution space to the sliced distribution space for which we establish a theoretical framework, with the Radon transform as a prominent example. We introduce and study the asymptotic properties of sample-based estimators for two regression approaches, one based on utilizing the sliced Wasserstein distance directly in the multivariate distribution space, and a second approach based on a new slice-wise distance, employing a univariate distribution regression for each slice. Both global and local Fr\\'echet regression methods are deployed for these approaches and illustrated in simulations and through applications. These include joint distributions of excess winter death rates and winter temperature anomalies in European countries as a function of base winter temperature and also data from finance."}, "https://arxiv.org/abs/2310.11969": {"title": "Survey calibration for causal inference: a simple method to balance covariate distributions", "link": "https://arxiv.org/abs/2310.11969", "description": "arXiv:2310.11969v2 Announce Type: replace \nAbstract: This paper proposes a~simple, yet powerful, method for balancing distributions of covariates for causal inference based on observational studies. The method makes it possible to balance an arbitrary number of quantiles (e.g., medians, quartiles, or deciles) together with means if necessary. The proposed approach is based on the theory of calibration estimators (Deville and S\\\"arndal 1992), in particular, calibration estimators for quantiles, proposed by Harms and Duchesne (2006). The method does not require numerical integration, kernel density estimation or assumptions about the distributions. Valid estimates can be obtained by drawing on existing asymptotic theory. An~illustrative example of the proposed approach is presented for the entropy balancing method and the covariate balancing propensity score method. Results of a~simulation study indicate that the method efficiently estimates average treatment effects on the treated (ATT), the average treatment effect (ATE), the quantile treatment effect on the treated (QTT) and the quantile treatment effect (QTE), especially in the presence of non-linearity and mis-specification of the models. The proposed approach can be further generalized to other designs (e.g. multi-category, continuous) or methods (e.g. synthetic control method). An open source software implementing proposed methods is available."}, "https://arxiv.org/abs/2401.14359": {"title": "Minimum Covariance Determinant: Spectral Embedding and Subset Size Determination", "link": "https://arxiv.org/abs/2401.14359", "description": "arXiv:2401.14359v2 Announce Type: replace \nAbstract: This paper introduces several enhancements to the minimum covariance determinant method of outlier detection and robust estimation of means and covariances. We leverage the principal component transform to achieve dimension reduction and ultimately better analyses. Our best subset selection algorithm strategically combines statistical depth and concentration steps. To ascertain the appropriate subset size and number of principal components, we introduce a bootstrap procedure that estimates the instability of the best subset algorithm. The parameter combination exhibiting minimal instability proves ideal for the purposes of outlier detection and robust estimation. Rigorous benchmarking against prominent MCD variants showcases our approach's superior statistical performance and computational speed in high dimensions. Application to a fruit spectra data set and a cancer genomics data set illustrates our claims."}, "https://arxiv.org/abs/2305.15754": {"title": "Bayesian Analysis for Over-parameterized Linear Model without Sparsity", "link": "https://arxiv.org/abs/2305.15754", "description": "arXiv:2305.15754v2 Announce Type: replace-cross \nAbstract: In the field of high-dimensional Bayesian statistics, a plethora of methodologies have been developed, including various prior distributions that result in parameter sparsity. However, such priors exhibit limitations in handling the spectral eigenvector structure of data, rendering estimations less effective for analyzing the over-parameterized models (high-dimensional linear models that do not assume sparsity) developed in recent years. This study introduces a Bayesian approach that employs a prior distribution dependent on the eigenvectors of data covariance matrices without inducing parameter sparsity. We also provide contraction rates of the derived posterior estimation and develop a truncated Gaussian approximation of the posterior distribution. The former demonstrates the efficiency of posterior estimation, whereas the latter facilitates the uncertainty quantification of parameters via a Bernstein--von Mises-type approach. These findings suggest that Bayesian methods capable of handling data spectra and estimating non-sparse high-dimensional parameters are feasible."}, "https://arxiv.org/abs/2401.06446": {"title": "Increasing dimension asymptotics for two-way crossed mixed effect models", "link": "https://arxiv.org/abs/2401.06446", "description": "arXiv:2401.06446v2 Announce Type: replace-cross \nAbstract: This paper presents asymptotic results for the maximum likelihood and restricted maximum likelihood (REML) estimators within a two-way crossed mixed effect model as the sizes of the rows, columns, and cells tend to infinity. Under very mild conditions which do not require the assumption of normality, the estimators are proven to be asymptotically normal, possessing a structured covariance matrix. The growth rate for the number of rows, columns, and cells is unrestricted, whether considered pairwise or collectively."}, "https://arxiv.org/abs/2403.08927": {"title": "Principal stratification with U-statistics under principal ignorability", "link": "https://arxiv.org/abs/2403.08927", "description": "arXiv:2403.08927v1 Announce Type: new \nAbstract: Principal stratification is a popular framework for causal inference in the presence of an intermediate outcome. While the principal average treatment effects have traditionally been the default target of inference, it may not be sufficient when the interest lies in the relative favorability of one potential outcome over the other within the principal stratum. We thus introduce the principal generalized causal effect estimands, which extend the principal average causal effects to accommodate nonlinear contrast functions. Under principal ignorability, we expand the theoretical results in Jiang et. al. (2022) to a much wider class of causal estimands in the presence of a binary intermediate variable. We develop identification formulas and derive the efficient influence functions of the generalized estimands for principal stratification analyses. These efficient influence functions motivate a set of multiply robust estimators and lay the ground for obtaining efficient debiased machine learning estimators via cross-fitting based on $U$-statistics. The proposed methods are illustrated through simulations and the analysis of a data example."}, "https://arxiv.org/abs/2403.09042": {"title": "Recurrent Events Modeling Based on a Reflected Brownian Motion with Application to Hypoglycemia", "link": "https://arxiv.org/abs/2403.09042", "description": "arXiv:2403.09042v1 Announce Type: new \nAbstract: Patients with type 2 diabetes need to closely monitor blood sugar levels as their routine diabetes self-management. Although many treatment agents aim to tightly control blood sugar, hypoglycemia often stands as an adverse event. In practice, patients can observe hypoglycemic events more easily than hyperglycemic events due to the perception of neurogenic symptoms. We propose to model each patient's observed hypoglycemic event as a lower-boundary crossing event for a reflected Brownian motion with an upper reflection barrier. The lower-boundary is set by clinical standards. To capture patient heterogeneity and within-patient dependence, covariates and a patient level frailty are incorporated into the volatility and the upper reflection barrier. This framework provides quantification for the underlying glucose level variability, patients heterogeneity, and risk factors' impact on glucose. We make inferences based on a Bayesian framework using Markov chain Monte Carlo. Two model comparison criteria, the Deviance Information Criterion and the Logarithm of the Pseudo-Marginal Likelihood, are used for model selection. The methodology is validated in simulation studies. In analyzing a dataset from the diabetic patients in the DURABLE trial, our model provides adequate fit, generates data similar to the observed data, and offers insights that could be missed by other models."}, "https://arxiv.org/abs/2403.09350": {"title": "A Bayes Factor Framework for Unified Parameter Estimation and Hypothesis Testing", "link": "https://arxiv.org/abs/2403.09350", "description": "arXiv:2403.09350v1 Announce Type: new \nAbstract: The Bayes factor, the data-based updating factor of the prior to posterior odds of two hypotheses, is a natural measure of statistical evidence for one hypothesis over the other. We show how Bayes factors can also be used for parameter estimation. The key idea is to consider the Bayes factor as a function of the parameter value under the null hypothesis. This 'Bayes factor function' is inverted to obtain point estimates ('maximum evidence estimates') and interval estimates ('support intervals'), similar to how P-value functions are inverted to obtain point estimates and confidence intervals. This provides data analysts with a unified inference framework as Bayes factors (for any tested parameter value), support intervals (at any level), and point estimates can be easily read off from a plot of the Bayes factor function. This approach shares similarities but is also distinct from conventional Bayesian and frequentist approaches: It uses the Bayesian evidence calculus, but without synthesizing data and prior, and it defines statistical evidence in terms of (integrated) likelihood ratios, but also includes a natural way for dealing with nuisance parameters. Applications to real-world examples illustrate how our framework is of practical value for making make quantitative inferences."}, "https://arxiv.org/abs/2403.09503": {"title": "Shrinkage for Extreme Partial Least-Squares", "link": "https://arxiv.org/abs/2403.09503", "description": "arXiv:2403.09503v1 Announce Type: new \nAbstract: This work focuses on dimension-reduction techniques for modelling conditional extreme values. Specifically, we investigate the idea that extreme values of a response variable can be explained by nonlinear functions derived from linear projections of an input random vector. In this context, the estimation of projection directions is examined, as approached by the Extreme Partial Least Squares (EPLS) method--an adaptation of the original Partial Least Squares (PLS) method tailored to the extreme-value framework. Further, a novel interpretation of EPLS directions as maximum likelihood estimators is introduced, utilizing the von Mises-Fisher distribution applied to hyperballs. The dimension reduction process is enhanced through the Bayesian paradigm, enabling the incorporation of prior information into the projection direction estimation. The maximum a posteriori estimator is derived in two specific cases, elucidating it as a regularization or shrinkage of the EPLS estimator. We also establish its asymptotic behavior as the sample size approaches infinity. A simulation data study is conducted in order to assess the practical utility of our proposed method. This clearly demonstrates its effectiveness even in moderate data problems within high-dimensional settings. Furthermore, we provide an illustrative example of the method's applicability using French farm income data, highlighting its efficacy in real-world scenarios."}, "https://arxiv.org/abs/2403.09604": {"title": "Extremal graphical modeling with latent variables", "link": "https://arxiv.org/abs/2403.09604", "description": "arXiv:2403.09604v1 Announce Type: new \nAbstract: Extremal graphical models encode the conditional independence structure of multivariate extremes and provide a powerful tool for quantifying the risk of rare events. Prior work on learning these graphs from data has focused on the setting where all relevant variables are observed. For the popular class of H\\\"usler-Reiss models, we propose the \\texttt{eglatent} method, a tractable convex program for learning extremal graphical models in the presence of latent variables. Our approach decomposes the H\\\"usler-Reiss precision matrix into a sparse component encoding the graphical structure among the observed variables after conditioning on the latent variables, and a low-rank component encoding the effect of a few latent variables on the observed variables. We provide finite-sample guarantees of \\texttt{eglatent} and show that it consistently recovers the conditional graph as well as the number of latent variables. We highlight the improved performances of our approach on synthetic and real data."}, "https://arxiv.org/abs/2011.09437": {"title": "Trend and Variance Adaptive Bayesian Changepoint Analysis & Local Outlier Scoring", "link": "https://arxiv.org/abs/2011.09437", "description": "arXiv:2011.09437v4 Announce Type: replace \nAbstract: We adaptively estimate both changepoints and local outlier processes in a Bayesian dynamic linear model with global-local shrinkage priors in a novel model we call Adaptive Bayesian Changepoints with Outliers (ABCO). We utilize a state-space approach to identify a dynamic signal in the presence of outliers and measurement error with stochastic volatility. We find that global state equation parameters are inadequate for most real applications and we include local parameters to track noise at each time-step. This setup provides a flexible framework to detect unspecified changepoints in complex series, such as those with large interruptions in local trends, with robustness to outliers and heteroskedastic noise. Finally, we compare our algorithm against several alternatives to demonstrate its efficacy in diverse simulation scenarios and two empirical examples on the U.S. economy."}, "https://arxiv.org/abs/2208.07614": {"title": "Reweighting the RCT for generalization: finite sample error and variable selection", "link": "https://arxiv.org/abs/2208.07614", "description": "arXiv:2208.07614v5 Announce Type: replace \nAbstract: Randomized Controlled Trials (RCTs) may suffer from limited scope. In particular, samples may be unrepresentative: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation. In this work, we establish the exact expressions of the bias and variance of such reweighting procedures -- also called Inverse Propensity of Sampling Weighting (IPSW) -- in presence of categorical covariates for any sample size. Such results allow us to compare the theoretical performance of different versions of IPSW estimates. Besides, our results show how the performance (bias, variance, and quadratic risk) of IPSW estimates depends on the two sample sizes (RCT and target population). A by-product of our work is the proof of consistency of IPSW estimates. Results also reveal that IPSW performances are improved when the trial probability to be treated is estimated (rather than using its oracle counterpart). In addition, we study choice of variables: how including covariates that are not necessary for identifiability of the causal effect may impact the asymptotic variance. Including covariates that are shifted between the two samples but not treatment effect modifiers increases the variance while non-shifted but treatment effect modifiers do not. We illustrate all the takeaways in a didactic example, and on a semi-synthetic simulation inspired from critical care medicine."}, "https://arxiv.org/abs/2212.01832": {"title": "The flexible Gumbel distribution: A new model for inference about the mode", "link": "https://arxiv.org/abs/2212.01832", "description": "arXiv:2212.01832v2 Announce Type: replace \nAbstract: A new unimodal distribution family indexed by the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. A regression model concerning the mode of a response given covariates based on the proposed unimodal distribution can be easily formulated, which we apply to data from an application in criminology to reveal interesting data features that are obscured by outliers."}, "https://arxiv.org/abs/2306.15629": {"title": "A non-parametric approach to detect patterns in binary sequences", "link": "https://arxiv.org/abs/2306.15629", "description": "arXiv:2306.15629v2 Announce Type: replace \nAbstract: In many circumstances given an ordered sequence of one or more types of elements/ symbols, the objective is to determine any existence of randomness in occurrence of one of the elements,say type 1 element. Such a method can be useful in determining existence of any non-random pattern in the wins or loses of a player in a series of games played. Existing methods of tests based on total number of runs or tests based on length of longest run (Mosteller (1941)) can be used for testing the null hypothesis of randomness in the entire sequence, and not a specific type of element. Additionally, the Runs Test tends to show results contradictory to the intuition visualised by the graphs of say, win proportions over time due to method used in computation of runs. This paper develops a test approach to address this problem by computing the gaps between two consecutive type 1 elements and thereafter following the idea of \"pattern\" in occurrence and \"directional\" trend (increasing, decreasing or constant), employs the use of exact Binomial test, Kenall's Tau and Siegel-Tukey test for scale problem. Further modifications suggested by Jan Vegelius(1982) have been applied in the Siegel Tukey test to adjust for tied ranks and achieve more accurate results. This approach is distribution-free and suitable for small sizes. Also comparisons with the conventional runs test shows the superiority of the proposed approach under the null hypothesis of randomness in the occurrence of type 1 elements."}, "https://arxiv.org/abs/2309.03067": {"title": "A modelling framework for detecting and leveraging node-level information in Bayesian network inference", "link": "https://arxiv.org/abs/2309.03067", "description": "arXiv:2309.03067v2 Announce Type: replace \nAbstract: Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modelling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximisation algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases."}, "https://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "https://arxiv.org/abs/2309.12833", "description": "arXiv:2309.12833v3 Announce Type: replace \nAbstract: Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients."}, "https://arxiv.org/abs/2312.05345": {"title": "Spline-Based Multi-State Models for Analyzing Disease Progression", "link": "https://arxiv.org/abs/2312.05345", "description": "arXiv:2312.05345v2 Announce Type: replace \nAbstract: Motivated by disease progression-related studies, we propose an estimation method for fitting general non-homogeneous multi-state Markov models. The proposal can handle many types of multi-state processes, with several states and various combinations of observation schemes (e.g., intermittent, exactly observed, censored), and allows for the transition intensities to be flexibly modelled through additive (spline-based) predictors. The algorithm is based on a computationally efficient and stable penalized maximum likelihood estimation approach which exploits the information provided by the analytical Hessian matrix of the model log-likelihood. The proposed modeling framework is employed in case studies that aim at modeling the onset of cardiac allograft vasculopathy, and cognitive decline due to aging, where novel patterns are uncovered. To support applicability and reproducibility, all developed tools are implemented in the R package flexmsm."}, "https://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "https://arxiv.org/abs/2205.02274", "description": "arXiv:2205.02274v4 Announce Type: replace-cross \nAbstract: Marketplace companies rely heavily on experimentation when making changes to the design or operation of their platforms. The workhorse of experimentation is the randomized controlled trial (RCT), or A/B test, in which users are randomly assigned to treatment or control groups. However, marketplace interference causes the Stable Unit Treatment Value Assumption (SUTVA) to be violated, leading to bias in the standard RCT metric. In this work, we propose techniques for platforms to run standard RCTs and still obtain meaningful estimates despite the presence of marketplace interference. We specifically consider a generalized matching setting, in which the platform explicitly matches supply with demand via a linear programming algorithm. Our first proposal is for the platform to estimate the value of global treatment and global control via optimization. We prove that this approach is unbiased in the fluid limit. Our second proposal is to compare the average shadow price of the treatment and control groups rather than the total value accrued by each group. We prove that this technique corresponds to the correct first-order approximation (in a Taylor series sense) of the value function of interest even in a finite-size system. We then use this result to prove that, under reasonable assumptions, our estimator is less biased than the RCT estimator. At the heart of our result is the idea that it is relatively easy to model interference in matching-driven marketplaces since, in such markets, the platform mediates the spillover."}, "https://arxiv.org/abs/2303.12657": {"title": "Generalised Linear Mixed Model Specification, Analysis, Fitting, and Optimal Design in R with the glmmr Packages", "link": "https://arxiv.org/abs/2303.12657", "description": "arXiv:2303.12657v3 Announce Type: replace-cross \nAbstract: We describe the \\proglang{R} package \\pkg{glmmrBase} and an extension \\pkg{glmmrOptim}. \\pkg{glmmrBase} provides a flexible approach to specifying, fitting, and analysing generalised linear mixed models. We use an object-orientated class system within \\proglang{R} to provide methods for a wide range of covariance and mean functions, including specification of non-linear functions of data and parameters, relevant to multiple applications including cluster randomised trials, cohort studies, spatial and spatio-temporal modelling, and split-plot designs. The class generates relevant matrices and statistics and a wide range of methods including full likelihood estimation of generalised linear mixed models using stochastic Maximum Likelihood, Laplace approximation, power calculation, and access to relevant calculations. The class also includes Hamiltonian Monte Carlo simulation of random effects, sparse matrix methods, and other functionality to support efficient estimation. The \\pkg{glmmrOptim} package implements a set of algorithms to identify c-optimal experimental designs where observations are correlated and can be specified using the generalised linear mixed model classes. Several examples and comparisons to existing packages are provided to illustrate use of the packages."}, "https://arxiv.org/abs/2403.09726": {"title": "Inference for non-probability samples using the calibration approach for quantiles", "link": "https://arxiv.org/abs/2403.09726", "description": "arXiv:2403.09726v1 Announce Type: new \nAbstract: Non-probability survey samples are examples of data sources that have become increasingly popular in recent years, also in official statistics. However, statistical inference based on non-probability samples is much more difficult because they are biased and are not representative of the target population (Wu, 2022). In this paper we consider a method of joint calibration for totals (Deville & S\\\"arndal, 1992) and quantiles (Harms & Duchesne, 2006) and use the proposed approach to extend existing inference methods for non-probability samples, such as inverse probability weighting, mass imputation and doubly robust estimators. By including quantile information in the estimation process non-linear relationships between the target and auxiliary variables can be approximated the way it is done in step-wise (constant) regression. Our simulation study has demonstrated that the estimators in question are more robust against model mis-specification and, as a result, help to reduce bias and improve estimation efficiency. Variance estimation for our proposed approach is also discussed. We show that existing inference methods can be used and that the resulting confidence intervals are at nominal levels. Finally, we applied the proposed methods to estimate the share of vacancies aimed at Ukrainian workers in Poland using an integrated set of administrative and survey data about job vacancies. The proposed approaches have been implemented in two R packages (nonprobsvy and jointCalib), which were used to conduct the simulation and empirical study"}, "https://arxiv.org/abs/2403.09877": {"title": "Quantifying Distributional Input Uncertainty via Inflated Kolmogorov-Smirnov Confidence Band", "link": "https://arxiv.org/abs/2403.09877", "description": "arXiv:2403.09877v1 Announce Type: new \nAbstract: In stochastic simulation, input uncertainty refers to the propagation of the statistical noise in calibrating input models to impact output accuracy, in addition to the Monte Carlo simulation noise. The vast majority of the input uncertainty literature focuses on estimating target output quantities that are real-valued. However, outputs of simulation models are random and real-valued targets essentially serve only as summary statistics. To provide a more holistic assessment, we study the input uncertainty problem from a distributional view, namely we construct confidence bands for the entire output distribution function. Our approach utilizes a novel test statistic whose asymptotic consists of the supremum of the sum of a Brownian bridge and a suitable mean-zero Gaussian process, which generalizes the Kolmogorov-Smirnov statistic to account for input uncertainty. Regarding implementation, we also demonstrate how to use subsampling to efficiently estimate the covariance function of the Gaussian process, thereby leading to an implementable estimation of the quantile of the test statistic and a statistically valid confidence band. Numerical results demonstrate how our new confidence bands provide valid coverage for output distributions under input uncertainty that is not achievable by conventional approaches."}, "https://arxiv.org/abs/2403.09907": {"title": "Multi-Layer Kernel Machines: Fast and Optimal Nonparametric Regression with Uncertainty Quantification", "link": "https://arxiv.org/abs/2403.09907", "description": "arXiv:2403.09907v1 Announce Type: new \nAbstract: Kernel ridge regression (KRR) is widely used for nonparametric regression over reproducing kernel Hilbert spaces. It offers powerful modeling capabilities at the cost of significant computational costs, which typically require $O(n^3)$ computational time and $O(n^2)$ storage space, with the sample size n. We introduce a novel framework of multi-layer kernel machines that approximate KRR by employing a multi-layer structure and random features, and study how the optimal number of random features and layer sizes can be chosen while still preserving the minimax optimality of the approximate KRR estimate. For various classes of random features, including those corresponding to Gaussian and Matern kernels, we prove that multi-layer kernel machines can achieve $O(n^2\\log^2n)$ computational time and $O(n\\log^2n)$ storage space, and yield fast and minimax optimal approximations to the KRR estimate for nonparametric regression. Moreover, we construct uncertainty quantification for multi-layer kernel machines by using conformal prediction techniques with robust coverage properties. The analysis and theoretical predictions are supported by simulations and real data examples."}, "https://arxiv.org/abs/2403.09928": {"title": "Identification and estimation of mediational effects of longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2403.09928", "description": "arXiv:2403.09928v1 Announce Type: new \nAbstract: We demonstrate a comprehensive semiparametric approach to causal mediation analysis, addressing the complexities inherent in settings with longitudinal and continuous treatments, confounders, and mediators. Our methodology utilizes a nonparametric structural equation model and a cross-fitted sequential regression technique based on doubly robust pseudo-outcomes, yielding an efficient, asymptotically normal estimator without relying on restrictive parametric modeling assumptions. We are motivated by a recent scientific controversy regarding the effects of invasive mechanical ventilation (IMV) on the survival of COVID-19 patients, considering acute kidney injury (AKI) as a mediating factor. We highlight the possibility of \"inconsistent mediation,\" in which the direct and indirect effects of the exposure operate in opposite directions. We discuss the significance of mediation analysis for scientific understanding and its potential utility in treatment decisions."}, "https://arxiv.org/abs/2403.09956": {"title": "On the distribution of isometric log-ratio transformations under extra-multinomial count data", "link": "https://arxiv.org/abs/2403.09956", "description": "arXiv:2403.09956v1 Announce Type: new \nAbstract: Compositional data arise when count observations are normalised into proportions adding up to unity. To allow use of standard statistical methods, compositional proportions can be mapped from the simplex into the Euclidean space through the isometric log-ratio (ilr) transformation. When the counts follow a multinomial distribution with fixed class-specific probabilities, the distribution of the ensuing ilr coordinates has been shown to be asymptotically multivariate normal. We here derive an asymptotic normal approximation to the distribution of the ilr coordinates when the counts show overdispersion under the Dirichlet-multinomial mixture model. Using a simulation study, we then investigate the practical applicability of the approximation against the empirical distribution of the ilr coordinates under varying levels of extra-multinomial variation and the total count. The approximation works well, except with a small total count or high amount of overdispersion. These empirical results remain even under population-level heterogeneity in the total count. Our work is motivated by microbiome data, which often exhibit considerable extra-multinomial variation and are increasingly treated as compositional through scaling taxon-specific counts into proportions. We conclude that if the analysis of empirical data relies on normality of the ilr coordinates, it may be advisable to choose a taxonomic level where counts are less sparse so that the distribution of taxon-specific class probabilities remains unimodal."}, "https://arxiv.org/abs/2403.09984": {"title": "Repro Samples Method for High-dimensional Logistic Model", "link": "https://arxiv.org/abs/2403.09984", "description": "arXiv:2403.09984v1 Announce Type: new \nAbstract: This paper presents a novel method to make statistical inferences for both the model support and regression coefficients in a high-dimensional logistic regression model. Our method is based on the repro samples framework, in which we conduct statistical inference by generating artificial samples mimicking the actual data-generating process. The proposed method has two major advantages. Firstly, for model support, we introduce the first method for constructing model confidence set in a high-dimensional setting and the proposed method only requires a weak signal strength assumption. Secondly, in terms of regression coefficients, we establish confidence sets for any group of linear combinations of regression coefficients. Our simulation results demonstrate that the proposed method produces valid and small model confidence sets and achieves better coverage for regression coefficients than the state-of-the-art debiasing methods. Additionally, we analyze single-cell RNA-seq data on the immune response. Besides identifying genes previously proved as relevant in the literature, our method also discovers a significant gene that has not been studied before, revealing a potential new direction in understanding cellular immune response mechanisms."}, "https://arxiv.org/abs/2403.10034": {"title": "Inference for Heterogeneous Graphical Models using Doubly High-Dimensional Linear-Mixed Models", "link": "https://arxiv.org/abs/2403.10034", "description": "arXiv:2403.10034v1 Announce Type: new \nAbstract: Motivated by the problem of inferring the graph structure of functional connectivity networks from multi-level functional magnetic resonance imaging data, we develop a valid inference framework for high-dimensional graphical models that accounts for group-level heterogeneity. We introduce a neighborhood-based method to learn the graph structure and reframe the problem as that of inferring fixed effect parameters in a doubly high-dimensional linear mixed model. Specifically, we propose a LASSO-based estimator and a de-biased LASSO-based inference framework for the fixed effect parameters in the doubly high-dimensional linear mixed model, leveraging random matrix theory to deal with challenges induced by the identical fixed and random effect design matrices arising in our setting. Moreover, we introduce consistent estimators for the variance components to identify subject-specific edges in the inferred graph. To illustrate the generality of the proposed approach, we also adapt our method to account for serial correlation by learning heterogeneous graphs in the setting of a vector autoregressive model. We demonstrate the performance of the proposed framework using real data and benchmark simulation studies."}, "https://arxiv.org/abs/2403.10136": {"title": "Response Style Characterization for Repeated Measures Using the Visual Analogue Scale", "link": "https://arxiv.org/abs/2403.10136", "description": "arXiv:2403.10136v1 Announce Type: new \nAbstract: Self-report measures (e.g., Likert scales) are widely used to evaluate subjective health perceptions. Recently, the visual analog scale (VAS), a slider-based scale, has become popular owing to its ability to precisely and easily assess how people feel. These data can be influenced by the response style (RS), a user-dependent systematic tendency that occurs regardless of questionnaire instructions. Despite its importance, especially in between-individual analysis, little attention has been paid to handling the RS in the VAS (denoted as response profile (RP)), as it is mainly used for within-individual monitoring and is less affected by RP. However, VAS measurements often require repeated self-reports of the same questionnaire items, making it difficult to apply conventional methods on a Likert scale. In this study, we developed a novel RP characterization method for various types of repeatedly measured VAS data. This approach involves the modeling of RP as distributional parameters ${\\theta}$ through a mixture of RS-like distributions, and addressing the issue of unbalanced data through bootstrap sampling for treating repeated measures. We assessed the effectiveness of the proposed method using simulated pseudo-data and an actual dataset from an empirical study. The assessment of parameter recovery showed that our method accurately estimated the RP parameter ${\\theta}$, demonstrating its robustness. Moreover, applying our method to an actual VAS dataset revealed the presence of individual RP heterogeneity, even in repeated VAS measurements, similar to the findings of the Likert scale. Our proposed method enables RP heterogeneity-aware VAS data analysis, similar to Likert-scale data analysis."}, "https://arxiv.org/abs/2403.10165": {"title": "Finite mixture copulas for modeling dependence in longitudinal count data", "link": "https://arxiv.org/abs/2403.10165", "description": "arXiv:2403.10165v1 Announce Type: new \nAbstract: Dependence modeling of multivariate count data has been receiving a considerable attention in recent times. Multivariate elliptical copulas are typically preferred in statistical literature to analyze dependence between repeated measurements of longitudinal data since they allow for different choices of the correlation structure. But these copulas lack in flexibility to model dependence and inference is only feasible under parametric restrictions. In this article, we propose the use of finite mixture of elliptical copulas in order to capture complex and hidden temporal dependency of discrete longitudinal data. With guaranteed model identifiability, our approach permits to use different correlation matrices in each component of the mixture copula. We theoretically examine the dependence properties of finite mixture of copulas, before applying them for constructing regression models for count longitudinal data. The inference of the proposed class of models is based on composite likelihood approach and the finite sample performance of the parameter estimates are investigated through extensive simulation studies. For model validation, besides the standard techniques we extended the t-plot method to accommodate finite mixture of elliptical copulas. Finally, our models are applied to analyze the temporal dependency of two real world longitudinal data sets and shown to provide improvements if compared against standard elliptical copulas."}, "https://arxiv.org/abs/2403.10289": {"title": "Towards a power analysis for PLS-based methods", "link": "https://arxiv.org/abs/2403.10289", "description": "arXiv:2403.10289v1 Announce Type: new \nAbstract: In recent years, power analysis has become widely used in applied sciences, with the increasing importance of the replicability issue. When distribution-free methods, such as Partial Least Squares (PLS)-based approaches, are considered, formulating power analysis turns out to be challenging. In this study, we introduce the methodological framework of a new procedure for performing power analysis when PLS-based methods are used. Data are simulated by the Monte Carlo method, assuming the null hypothesis of no effect is false and exploiting the latent structure estimated by PLS in the pilot data. In this way, the complex correlation data structure is explicitly considered in power analysis and sample size estimation. The paper offers insights into selecting statistical tests for the power analysis procedure, comparing accuracy-based tests and those based on continuous parameters estimated by PLS. Simulated and real datasets are investigated to show how the method works in practice."}, "https://arxiv.org/abs/2403.10352": {"title": "Goodness-of-Fit for Conditional Distributions: An Approach Using Principal Component Analysis and Component Selection", "link": "https://arxiv.org/abs/2403.10352", "description": "arXiv:2403.10352v1 Announce Type: new \nAbstract: This paper introduces a novel goodness-of-fit test technique for parametric conditional distributions. The proposed tests are based on a residual marked empirical process, for which we develop a conditional Principal Component Analysis. The obtained components provide a basis for various types of new tests in addition to the omnibus one. Component tests that based on each component serve as experts in detecting certain directions. Smooth tests that assemble a few components are also of great use in practice. To further improve testing efficiency, we introduce a component selection approach, aiming to identify the most contributory components. The finite sample performance of the proposed tests is illustrated through Monte Carlo experiments."}, "https://arxiv.org/abs/2403.10440": {"title": "Multivariate Bayesian models with flexible shared interactions for analyzing spatio-temporal patterns of rare cancers", "link": "https://arxiv.org/abs/2403.10440", "description": "arXiv:2403.10440v1 Announce Type: new \nAbstract: Rare cancers affect millions of people worldwide each year. However, estimating incidence or mortality rates associated with rare cancers presents important difficulties and poses new statistical methodological challenges. In this paper, we expand the collection of multivariate spatio-temporal models by introducing adaptable shared interactions to enable a comprehensive analysis of both incidence and cancer mortality in rare cancer cases. These models allow the modulation of spatio-temporal interactions between incidence and mortality, allowing for changes in their relationship over time. The new models have been implemented in INLA using r-generic constructions. We conduct a simulation study to evaluate the performance of the new spatio-temporal models in terms of sensitivity and specificity. Results show that multivariate spatio-temporal models with flexible shared interaction outperform conventional multivariate spatio-temporal models with independent interactions. We use these models to analyze incidence and mortality data for pancreatic cancer and leukaemia among males across 142 administrative healthcare districts of Great Britain over a span of nine biennial periods (2002-2019)."}, "https://arxiv.org/abs/2403.10514": {"title": "Multilevel functional distributional models with application to continuous glucose monitoring in diabetes clinical trials", "link": "https://arxiv.org/abs/2403.10514", "description": "arXiv:2403.10514v1 Announce Type: new \nAbstract: Continuous glucose monitoring (CGM) is a minimally invasive technology that allows continuous monitoring of an individual's blood glucose. We focus on a large clinical trial that collected CGM data every few minutes for 26 weeks and assumes that the basic observation unit is the distribution of CGM observations in a four-week interval. The resulting data structure is multilevel (because each individual has multiple months of data) and distributional (because the data for each four-week interval is represented as a distribution). The scientific goals are to: (1) identify and quantify the effects of factors that affect glycemic control in type 1 diabetes (T1D) patients; and (2) identify and characterize the patients who respond to treatment. To address these goals, we propose a new multilevel functional model that treats the CGM distributions as a response. Methods are motivated by and applied to data collected by The Juvenile Diabetes Research Foundation Continuous Glucose Monitoring Group. Reproducible code for the methods introduced here is available on GitHub."}, "https://arxiv.org/abs/2403.09667": {"title": "Should the choice of BOIN design parameter p", "link": "https://arxiv.org/abs/2403.09667", "description": "arXiv:2403.09667v1 Announce Type: cross \nAbstract: When the early stopping parameter n.earlystop is relatively small or the cohortsize value is not optimized via simulation, it may be better to use p.tox < 1.4 * target.DLT.rate, or try out different cohort sizes, or increase n.earlystop, whichever is both feasible and provides better operating characteristics. This is because if the cohortsize was not optimized via simulation, even when n.earlystop = 12, the BOIN escalation/de-escalation rules generated using p.tox = 1.4 * target.DLT.rate could be exactly the same as those calculated using p.tox > 3 * target.DLT.rate, which might not be acceptable for some pediatric trials targeting 10% DLT rate. The traditional 3+3 design stops the dose finding process when 3 patients have been treated at the current dose level, 0 DLT has been observed, and the next higher dose has already been eliminated. If additional 3 patients were required to be treated at the current dose in the situation described above, the corresponding boundary table could be generated using BOIN design with target DLT rates ranging from 18% to 29%, p.saf ranging from 8% to 26%, and p.tox ranging from 39% to 99%. To generate the boundary table of this 3+3 design variant, BOIN parameters also need to satisfy a set of conditions."}, "https://arxiv.org/abs/2403.09869": {"title": "Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors", "link": "https://arxiv.org/abs/2403.09869", "description": "arXiv:2403.09869v1 Announce Type: cross \nAbstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts."}, "https://arxiv.org/abs/2403.10239": {"title": "A Big Data Approach to Understand Sub-national Determinants of FDI in Africa", "link": "https://arxiv.org/abs/2403.10239", "description": "arXiv:2403.10239v1 Announce Type: cross \nAbstract: Various macroeconomic and institutional factors hinder FDI inflows, including corruption, trade openness, access to finance, and political instability. Existing research mostly focuses on country-level data, with limited exploration of firm-level data, especially in developing countries. Recognizing this gap, recent calls for research emphasize the need for qualitative data analysis to delve into FDI determinants, particularly at the regional level. This paper proposes a novel methodology, based on text mining and social network analysis, to get information from more than 167,000 online news articles to quantify regional-level (sub-national) attributes affecting FDI ownership in African companies. Our analysis extends information on obstacles to industrial development as mapped by the World Bank Enterprise Surveys. Findings suggest that regional (sub-national) structural and institutional characteristics can play an important role in determining foreign ownership."}, "https://arxiv.org/abs/2403.10250": {"title": "Interpretable Machine Learning for Survival Analysis", "link": "https://arxiv.org/abs/2403.10250", "description": "arXiv:2403.10250v1 Announce Type: cross \nAbstract: With the spread and rapid advancement of black box machine learning models, the field of interpretable machine learning (IML) or explainable artificial intelligence (XAI) has become increasingly important over the last decade. This is particularly relevant for survival analysis, where the adoption of IML techniques promotes transparency, accountability and fairness in sensitive areas, such as clinical decision making processes, the development of targeted therapies, interventions or in other medical or healthcare related contexts. More specifically, explainability can uncover a survival model's potential biases and limitations and provide more mathematically sound ways to understand how and which features are influential for prediction or constitute risk factors. However, the lack of readily available IML methods may have deterred medical practitioners and policy makers in public health from leveraging the full potential of machine learning for predicting time-to-event data. We present a comprehensive review of the limited existing amount of work on IML methods for survival analysis within the context of the general IML taxonomy. In addition, we formally detail how commonly used IML methods, such as such as individual conditional expectation (ICE), partial dependence plots (PDP), accumulated local effects (ALE), different feature importance measures or Friedman's H-interaction statistics can be adapted to survival outcomes. An application of several IML methods to real data on data on under-5 year mortality of Ghanaian children from the Demographic and Health Surveys (DHS) Program serves as a tutorial or guide for researchers, on how to utilize the techniques in practice to facilitate understanding of model decisions or predictions."}, "https://arxiv.org/abs/2312.17420": {"title": "Exact Consistency Tests for Gaussian Mixture Filters using Normalized Deviation Squared Statistics", "link": "https://arxiv.org/abs/2312.17420", "description": "arXiv:2312.17420v2 Announce Type: replace \nAbstract: We consider the problem of evaluating dynamic consistency in discrete time probabilistic filters that approximate stochastic system state densities with Gaussian mixtures. Dynamic consistency means that the estimated probability distributions correctly describe the actual uncertainties. As such, the problem of consistency testing naturally arises in applications with regards to estimator tuning and validation. However, due to the general complexity of the density functions involved, straightforward approaches for consistency testing of mixture-based estimators have remained challenging to define and implement. This paper derives a new exact result for Gaussian mixture consistency testing within the framework of normalized deviation squared (NDS) statistics. It is shown that NDS test statistics for generic multivariate Gaussian mixture models exactly follow mixtures of generalized chi-square distributions, for which efficient computational tools are available. The accuracy and utility of the resulting consistency tests are numerically demonstrated on static and dynamic mixture estimation examples."}, "https://arxiv.org/abs/2305.05998": {"title": "On the Time-Varying Structure of the Arbitrage Pricing Theory using the Japanese Sector Indices", "link": "https://arxiv.org/abs/2305.05998", "description": "arXiv:2305.05998v4 Announce Type: replace-cross \nAbstract: This paper is the first study to examine the time instability of the APT in the Japanese stock market. In particular, we measure how changes in each risk factor affect the stock risk premiums to investigate the validity of the APT over time, applying the rolling window method to Fama and MacBeth's (1973) two-step regression and Kamstra and Shi's (2023) generalized GRS test. We summarize our empirical results as follows: (1) the changes in monetary policy by major central banks greatly affect the validity of the APT in Japan, and (2) the time-varying estimates of the risk premiums for each factor are also unstable over time, and they are affected by the business cycle and economic crises. Therefore, we conclude that the validity of the APT as an appropriate model to explain the Japanese sector index is not stable over time."}, "https://arxiv.org/abs/2307.09552": {"title": "Self-Compatibility: Evaluating Causal Discovery without Ground Truth", "link": "https://arxiv.org/abs/2307.09552", "description": "arXiv:2307.09552v2 Announce Type: replace-cross \nAbstract: As causal ground truth is incredibly rare, causal discovery algorithms are commonly only evaluated on simulated data. This is concerning, given that simulations reflect preconceptions about generating processes regarding noise distributions, model classes, and more. In this work, we propose a novel method for falsifying the output of a causal discovery algorithm in the absence of ground truth. Our key insight is that while statistical learning seeks stability across subsets of data points, causal learning should seek stability across subsets of variables. Motivated by this insight, our method relies on a notion of compatibility between causal graphs learned on different subsets of variables. We prove that detecting incompatibilities can falsify wrongly inferred causal relations due to violation of assumptions or errors from finite sample effects. Although passing such compatibility tests is only a necessary criterion for good performance, we argue that it provides strong evidence for the causal models whenever compatibility entails strong implications for the joint distribution. We also demonstrate experimentally that detection of incompatibilities can aid in causal model selection."}, "https://arxiv.org/abs/2309.09452": {"title": "Beyond expected values: Making environmental decisions using value of information analysis when measurement outcome matters", "link": "https://arxiv.org/abs/2309.09452", "description": "arXiv:2309.09452v2 Announce Type: replace-cross \nAbstract: In ecological and environmental contexts, management actions must sometimes be chosen urgently. Value of information (VoI) analysis provides a quantitative toolkit for projecting the improved management outcomes expected after making additional measurements. However, traditional VoI analysis reports metrics as expected values (i.e. risk-neutral). This can be problematic because expected values hide uncertainties in projections. The true value of a measurement will only be known after the measurement's outcome is known, leaving large uncertainty in the measurement's value before it is performed. As a result, the expected value metrics produced in traditional VoI analysis may not align with the priorities of a risk-averse decision-maker who wants to avoid low-value measurement outcomes. In the present work, we introduce four new VoI metrics that can address a decision-maker's risk-aversion to different measurement outcomes. We demonstrate the benefits of the new metrics with two ecological case studies for which traditional VoI analysis has been previously applied. Using the new metrics, we also demonstrate a clear mathematical link between the often-separated environmental decision-making disciplines of VoI and optimal design of experiments. This mathematical link has the potential to catalyse future collaborations between ecologists and statisticians to work together to quantitatively address environmental decision-making questions of fundamental importance. Overall, the introduced VoI metrics complement existing metrics to provide decision-makers with a comprehensive view of the value of, and risks associated with, a proposed monitoring or measurement activity. This is critical for improved environmental outcomes when decisions must be urgently made."}, "https://arxiv.org/abs/2311.10685": {"title": "High-Throughput Asset Pricing", "link": "https://arxiv.org/abs/2311.10685", "description": "arXiv:2311.10685v2 Announce Type: replace-cross \nAbstract: We use empirical Bayes (EB) to mine data on 140,000 long-short strategies constructed from accounting ratios, past returns, and ticker symbols. This \"high-throughput asset pricing\" produces out-of-sample performance comparable to strategies in top finance journals. But unlike the published strategies, the data-mined strategies are free of look-ahead bias. EB predicts that high returns are concentrated in accounting strategies, small stocks, and pre-2004 samples, consistent with limited attention theories. The intuition is seen in the cross-sectional distribution of t-stats, which is far from the null for equal-weighted accounting strategies. High-throughput methods provide a rigorous, unbiased method for documenting asset pricing facts."}, "https://arxiv.org/abs/2403.10680": {"title": "Spatio-temporal Occupancy Models with INLA", "link": "https://arxiv.org/abs/2403.10680", "description": "arXiv:2403.10680v1 Announce Type: new \nAbstract: Modern methods for quantifying and predicting species distribution play a crucial part in biodiversity conservation. Occupancy models are a popular choice for analyzing species occurrence data as they allow to separate the observational error induced by imperfect detection, and the sources of bias affecting the occupancy process. However, the spatial and temporal variation in occupancy not accounted for by environmental covariates is often ignored or modelled through simple spatial structures as the computational costs of fitting explicit spatio-temporal models is too high. In this work, we demonstrate how INLA may be used to fit complex occupancy models and how the R-INLA package can provide a user-friendly interface to make such complex models available to users.\n We show how occupancy models, provided some simplification on the detection process, can be framed as latent Gaussian models and benefit from the powerful INLA machinery. A large selection of complex modelling features, and random effect modelshave already been implemented in R-INLA. These become available for occupancy models, providing the user with an efficient and flexible toolbox.\n We illustrate how INLA provides a computationally efficient framework for developing and fitting complex occupancy models using two case studies. Through these, we show how different spatio-temporal models that include spatial-varying trends, smooth terms, and spatio-temporal random effects can be fitted. At the cost of limiting the complexity of the detection model, INLA can incorporate a range of complex structures in the process.\n INLA-based occupancy models provide an alternative framework to fit complex spatiotemporal occupancy models. The need for new and more flexible computationally approaches to fit such models makes INLA an attractive option for addressing complex ecological problems, and a promising area of research."}, "https://arxiv.org/abs/2403.10742": {"title": "Assessing Delayed Treatment Benefits of Immunotherapy Using Long-Term Average Hazard: A Novel Test/Estimation Approach", "link": "https://arxiv.org/abs/2403.10742", "description": "arXiv:2403.10742v1 Announce Type: new \nAbstract: Delayed treatment effects on time-to-event outcomes have often been observed in randomized controlled studies of cancer immunotherapies. In the case of delayed onset of treatment effect, the conventional test/estimation approach using the log-rank test for between-group comparison and Cox's hazard ratio to estimate the magnitude of treatment effect is not optimal, because the log-rank test is not the most powerful option, and the interpretation of the resulting hazard ratio is not obvious. Recently, alternative test/estimation approaches were proposed to address both the power issue and the interpretation problems of the conventional approach. One is a test/estimation approach based on long-term restricted mean survival time, and the other approach is based on average hazard with survival weight. This paper integrates these two ideas and proposes a novel test/estimation approach based on long-term average hazard (LT-AH) with survival weight. Numerical studies reveal specific scenarios where the proposed LT-AH method provides a higher power than the two alternative approaches. The proposed approach has test/estimation coherency and can provide robust estimates of the magnitude of treatment effect not dependent on study-specific censoring time distribution. Also, the proposed LT-AH approach can summarize the magnitude of the treatment effect in both absolute difference and relative terms using ``hazard'' (i.e., difference in LT-AH and ratio of LT-AH), meeting guideline recommendations and practical needs. This proposed approach can be a useful alternative to the traditional hazard-based test/estimation approach when delayed onset of survival benefit is expected."}, "https://arxiv.org/abs/2403.10878": {"title": "Cubature scheme for spatio-temporal Poisson point processes estimation", "link": "https://arxiv.org/abs/2403.10878", "description": "arXiv:2403.10878v1 Announce Type: new \nAbstract: This work presents the cubature scheme for the fitting of spatio-temporal Poisson point processes. The methodology is implemented in the R Core Team (2024) package stopp (D'Angelo and Adelfio, 2023), published on the Comprehensive R Archive Network (CRAN) and available from https://CRAN.R-project.org/package=stopp. Since the number of dummy points should be sufficient for an accurate estimate of the likelihood, numerical experiments are currently under development to give guidelines on this aspect."}, "https://arxiv.org/abs/2403.10907": {"title": "Macroeconomic Spillovers of Weather Shocks across U", "link": "https://arxiv.org/abs/2403.10907", "description": "arXiv:2403.10907v1 Announce Type: new \nAbstract: We estimate the short-run effects of severe weather shocks on local economic activity and assess cross-border spillovers operating through economic linkages between U.S. states. We measure weather shocks using a detailed county-level database on emergency declarations triggered by natural disasters and estimate their impacts with a monthly Global Vector Autoregressive (GVAR) model for the U.S. states. Impulse responses highlight significant country-wide macroeconomic effects of weather shocks hitting individual regions. We also show that (i) taking into account economic interconnections between states allows capturing much stronger spillover effects than those associated with mere spatial adjacency, (ii) geographical heterogeneity is critical for assessing country-wide effects of weather shocks, and (iii) network effects amplify the local impacts of these shocks."}, "https://arxiv.org/abs/2403.10945": {"title": "Zero-Inflated Stochastic Volatility Model for Disaggregated Inflation Data with Exact Zeros", "link": "https://arxiv.org/abs/2403.10945", "description": "arXiv:2403.10945v1 Announce Type: new \nAbstract: The disaggregated time-series data for Consumer Price Index often exhibits frequent instances of exact zero price changes, stemming from measurement errors inherent in the data collection process. However, the currently prominent stochastic volatility model of trend inflation is designed for aggregate measures of price inflation, where exact zero price changes rarely occur. We propose a zero-inflated stochastic volatility model applicable to such nonstationary real-valued multivariate time-series data with exact zeros, by a Bayesian dynamic generalized linear model that jointly specifies the dynamic zero-generating process. We also provide an efficient custom Gibbs sampler that leverages the P\\'olya-Gamma augmentation. Applying the model to disaggregated Japanese Consumer Price Index data, we find that the zero-inflated model provides more sensible and informative estimates of time-varying trend and volatility. Through an out-of-sample forecasting exercise, we find that the zero-inflated model provides improved point forecasts when zero-inflation is prominent, and better coverage of interval forecasts of the non-zero data by the non-zero distributional component."}, "https://arxiv.org/abs/2403.11003": {"title": "Extreme Treatment Effect: Extrapolating Causal Effects Into Extreme Treatment Domain", "link": "https://arxiv.org/abs/2403.11003", "description": "arXiv:2403.11003v1 Announce Type: new \nAbstract: The potential outcomes framework serves as a fundamental tool for quantifying the causal effects. When the treatment variable (exposure) is continuous, one is typically interested in the estimation of the effect curve (also called the average dose-response function), denoted as \\(mu(t)\\). In this work, we explore the ``extreme causal effect,'' where our focus lies in determining the impact of an extreme level of treatment, potentially beyond the range of observed values--that is, estimating \\(mu(t)\\) for very large \\(t\\). Our framework is grounded in the field of statistics known as extreme value theory. We establish the foundation for our approach, outlining key assumptions that enable the estimation of the extremal causal effect. Additionally, we present a novel and consistent estimation procedure that utilizes extreme value theory in order to potentially reduce the dimension of the confounders to at most 3. In practical applications, our framework proves valuable when assessing the effects of scenarios such as drug overdoses, extreme river discharges, or extremely high temperatures on a variable of interest."}, "https://arxiv.org/abs/2403.11016": {"title": "Comprehensive OOS Evaluation of Predictive Algorithms with Statistical Decision Theory", "link": "https://arxiv.org/abs/2403.11016", "description": "arXiv:2403.11016v1 Announce Type: new \nAbstract: We argue that comprehensive out-of-sample (OOS) evaluation using statistical decision theory (SDT) should replace the current practice of K-fold and Common Task Framework validation in machine learning (ML) research. SDT provides a formal framework for performing comprehensive OOS evaluation across all possible (1) training samples, (2) populations that may generate training data, and (3) populations of prediction interest. Regarding feature (3), we emphasize that SDT requires the practitioner to directly confront the possibility that the future may not look like the past and to account for a possible need to extrapolate from one population to another when building a predictive algorithm. SDT is simple in abstraction, but it is often computationally demanding to implement. We discuss progress in tractable implementation of SDT when prediction accuracy is measured by mean square error or by misclassification rate. We summarize research studying settings in which the training data will be generated from a subpopulation of the population of prediction interest. We also consider conditional prediction with alternative restrictions on the state space of possible populations that may generate training data. We conclude by calling on ML researchers to join with econometricians and statisticians in expanding the domain within which implementation of SDT is tractable."}, "https://arxiv.org/abs/2403.11017": {"title": "Continuous-time mediation analysis for repeatedlymeasured mediators and outcomes", "link": "https://arxiv.org/abs/2403.11017", "description": "arXiv:2403.11017v1 Announce Type: new \nAbstract: Mediation analysis aims to decipher the underlying causal mechanisms between an exposure, an outcome, and intermediate variables called mediators. Initially developed for fixed-time mediator and outcome, it has been extended to the framework of longitudinal data by discretizing the assessment times of mediator and outcome. Yet, processes in play in longitudinal studies are usually defined in continuous time and measured at irregular and subject-specific visits. This is the case in dementia research when cerebral and cognitive changes measured at planned visits in cohorts are of interest. We thus propose a methodology to estimate the causal mechanisms between a time-fixed exposure ($X$), a mediator process ($\\mathcal{M}_t$) and an outcome process ($\\mathcal{Y}_t$) both measured repeatedly over time in the presence of a time-dependent confounding process ($\\mathcal{L}_t$). We consider three types of causal estimands, the natural effects, path-specific effects and randomized interventional analogues to natural effects, and provide identifiability assumptions. We employ a dynamic multivariate model based on differential equations for their estimation. The performance of the methods are explored in simulations, and we illustrate the method in two real-world examples motivated by the 3C cerebral aging study to assess: (1) the effect of educational level on functional dependency through depressive symptomatology and cognitive functioning, and (2) the effect of a genetic factor on cognitive functioning potentially mediated by vascular brain lesions and confounded by neurodegeneration."}, "https://arxiv.org/abs/2403.11163": {"title": "A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques", "link": "https://arxiv.org/abs/2403.11163", "description": "arXiv:2403.11163v1 Announce Type: new \nAbstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models."}, "https://arxiv.org/abs/2403.11276": {"title": "Effects of model misspecification on small area estimators", "link": "https://arxiv.org/abs/2403.11276", "description": "arXiv:2403.11276v1 Announce Type: new \nAbstract: Nested error regression models are commonly used to incorporate observational unit specific auxiliary variables to improve small area estimates. When the mean structure of this model is misspecified, there is generally an increase in the mean square prediction error (MSPE) of Empirical Best Linear Unbiased Predictors (EBLUP). Observed Best Prediction (OBP) method has been proposed with the intent to improve on the MSPE over EBLUP. We conduct a Monte Carlo simulation experiment to understand the effect of mispsecification of mean structures on different small area estimators. Our simulation results lead to an unexpected result that OBP may perform very poorly when observational unit level auxiliary variables are used and that OBP can be improved significantly when population means of those auxiliary variables (area level auxiliary variables) are used in the nested error regression model or when a corresponding area level model is used. Our simulation also indicates that the MSPE of OBP in an increasing function of the difference between the sample and population means of the auxiliary variables."}, "https://arxiv.org/abs/2403.11309": {"title": "Nonparametric Identification and Estimation with Non-Classical Errors-in-Variables", "link": "https://arxiv.org/abs/2403.11309", "description": "arXiv:2403.11309v1 Announce Type: new \nAbstract: This paper considers nonparametric identification and estimation of the regression function when a covariate is mismeasured. The measurement error need not be classical. Employing the small measurement error approximation, we establish nonparametric identification under weak and easy-to-interpret conditions on the instrumental variable. The paper also provides nonparametric estimators of the regression function and derives their rates of convergence."}, "https://arxiv.org/abs/2403.11356": {"title": "Multiscale Quantile Regression with Local Error Control", "link": "https://arxiv.org/abs/2403.11356", "description": "arXiv:2403.11356v1 Announce Type: new \nAbstract: For robust and efficient detection of change points, we introduce a novel methodology MUSCLE (multiscale quantile segmentation controlling local error) that partitions serial data into multiple segments, each sharing a common quantile. It leverages multiple tests for quantile changes over different scales and locations, and variational estimation. Unlike the often adopted global error control, MUSCLE focuses on local errors defined on individual segments, significantly improving detection power in finding change points. Meanwhile, due to the built-in model complexity penalty, it enjoys the finite sample guarantee that its false discovery rate (or the expected proportion of falsely detected change points) is upper bounded by its unique tuning parameter. Further, we obtain the consistency and the localisation error rates in estimating change points, under mild signal-to-noise-ratio conditions. Both match (up to log factors) the minimax optimality results in the Gaussian setup. All theories hold under the only distributional assumption of serial independence. Incorporating the wavelet tree data structure, we develop an efficient dynamic programming algorithm for computing MUSCLE. Extensive simulations as well as real data applications in electrophysiology and geophysics demonstrate its competitiveness and effectiveness. An implementation via R package muscle is available from GitHub."}, "https://arxiv.org/abs/2403.11438": {"title": "Models of linkage error for capture-recapture estimation without clerical reviews", "link": "https://arxiv.org/abs/2403.11438", "description": "arXiv:2403.11438v1 Announce Type: new \nAbstract: The capture-recapture method can be applied to measure the coverage of administrative and big data sources, in official statistics. In its basic form, it involves the linkage of two sources while assuming a perfect linkage and other standard assumptions. In practice, linkage errors arise and are a potential source of bias, where the linkage is based on quasi-identifiers. These errors include false positives and false negatives, where the former arise when linking a pair of records from different units, and the latter arise when not linking a pair of records from the same unit. So far, the existing solutions have resorted to costly clerical reviews, or they have made the restrictive conditional independence assumption. In this work, these requirements are relaxed by modeling the number of links from a record instead. The same approach may be taken to estimate the linkage accuracy without clerical reviews, when linking two sources that each have some undercoverage."}, "https://arxiv.org/abs/2403.11562": {"title": "A Comparison of Joint Species Distribution Models for Percent Cover Data", "link": "https://arxiv.org/abs/2403.11562", "description": "arXiv:2403.11562v1 Announce Type: new \nAbstract: 1. Joint species distribution models (JSDMs) have gained considerable traction among ecologists over the past decade, due to their capacity to answer a wide range of questions at both the species- and the community-level. The family of generalized linear latent variable models in particular has proven popular for building JSDMs, being able to handle many response types including presence-absence data, biomass, overdispersed and/or zero-inflated counts.\n 2. We extend latent variable models to handle percent cover data, with vegetation, sessile invertebrate, and macroalgal cover data representing the prime examples of such data arising in community ecology.\n 3. Sparsity is a commonly encountered challenge with percent cover data. Responses are typically recorded as percentages covered per plot, though some species may be completely absent or present, i.e., have 0% or 100% cover respectively, rendering the use of beta distribution inadequate.\n 4. We propose two JSDMs suitable for percent cover data, namely a hurdle beta model and an ordered beta model. We compare the two proposed approaches to a beta distribution for shifted responses, transformed presence-absence data, and an ordinal model for percent cover classes. Results demonstrate the hurdle beta JSDM was generally the most accurate at retrieving the latent variables and predicting ecological percent cover data."}, "https://arxiv.org/abs/2403.11564": {"title": "Spatio-temporal point process intensity estimation using zero-deflated subsampling applied to a lightning strikes dataset in France", "link": "https://arxiv.org/abs/2403.11564", "description": "arXiv:2403.11564v1 Announce Type: new \nAbstract: Cloud-to-ground lightning strikes observed in a specific geographical domain over time can be naturally modeled by a spatio-temporal point process. Our focus lies in the parametric estimation of its intensity function, incorporating both spatial factors (such as altitude) and spatio-temporal covariates (such as field temperature, precipitation, etc.). The events are observed in France over a span of three years. Spatio-temporal covariates are observed with resolution $0.1^\\circ \\times 0.1^\\circ$ ($\\approx 100$km$^2$) and six-hour periods. This results in an extensive dataset, further characterized by a significant excess of zeroes (i.e., spatio-temporal cells with no observed events). We reexamine composite likelihood methods commonly employed for spatial point processes, especially in situations where covariates are piecewise constant. Additionally, we extend these methods to account for zero-deflated subsampling, a strategy involving dependent subsampling, with a focus on selecting more cells in regions where events are observed. A simulation study is conducted to illustrate these novel methodologies, followed by their application to the dataset of lightning strikes."}, "https://arxiv.org/abs/2403.11767": {"title": "Multiple testing in game-theoretic probability: pictures and questions", "link": "https://arxiv.org/abs/2403.11767", "description": "arXiv:2403.11767v1 Announce Type: new \nAbstract: The usual way of testing probability forecasts in game-theoretic probability is via construction of test martingales. The standard assumption is that all forecasts are output by the same forecaster. In this paper I will discuss possible extensions of this picture to testing probability forecasts output by several forecasters. This corresponds to multiple hypothesis testing in statistics. One interesting phenomenon is that even a slight relaxation of the requirement of family-wise validity leads to a very significant increase in the efficiency of testing procedures. The main goal of this paper is to report results of preliminary simulation studies and list some directions of further research."}, "https://arxiv.org/abs/2403.11954": {"title": "Robust Estimation and Inference in Categorical Data", "link": "https://arxiv.org/abs/2403.11954", "description": "arXiv:2403.11954v1 Announce Type: new \nAbstract: In empirical science, many variables of interest are categorical. Like any model, models for categorical responses can be misspecified, leading to possibly large biases in estimation. One particularly troublesome source of misspecification is inattentive responding in questionnaires, which is well-known to jeopardize the validity of structural equation models (SEMs) and other survey-based analyses. I propose a general estimator that is designed to be robust to misspecification of models for categorical responses. Unlike hitherto approaches, the estimator makes no assumption whatsoever on the degree, magnitude, or type of misspecification. The proposed estimator generalizes maximum likelihood estimation, is strongly consistent, asymptotically Gaussian, has the same time complexity as maximum likelihood, and can be applied to any model for categorical responses. In addition, I develop a novel test that tests whether a given response can be fitted well by the assumed model, which allows one to trace back possible sources of misspecification. I verify the attractive theoretical properties of the proposed methodology in Monte Carlo experiments, and demonstrate its practical usefulness in an empirical application on a SEM of personality traits, where I find compelling evidence for the presence of inattentive responding whose adverse effects the proposed estimator can withstand, unlike maximum likelihood."}, "https://arxiv.org/abs/2403.11983": {"title": "Proposal of a general framework to categorize continuous predictor variables", "link": "https://arxiv.org/abs/2403.11983", "description": "arXiv:2403.11983v1 Announce Type: new \nAbstract: The use of discretized variables in the development of prediction models is a common practice, in part because the decision-making process is more natural when it is based on rules created from segmented models. Although this practice is perhaps more common in medicine, it is extensible to any area of knowledge where a predictive model helps in decision-making. Therefore, providing researchers with a useful and valid categorization method could be a relevant issue when developing prediction models. In this paper, we propose a new general methodology that can be applied to categorize a predictor variable in any regression model where the response variable belongs to the exponential family distribution. Furthermore, it can be applied in any multivariate context, allowing to categorize more than one continuous covariate simultaneously. In addition, a computationally very efficient method is proposed to obtain the optimal number of categories, based on a pseudo-BIC proposal. Several simulation studies have been conducted in which the efficiency of the method with respect to both the location and the number of estimated cut-off points is shown. Finally, the categorization proposal has been applied to a real data set of 543 patients with chronic obstructive pulmonary disease from Galdakao Hospital's five outpatient respiratory clinics, who were followed up for 10 years. We applied the proposed methodology to jointly categorize the continuous variables six-minute walking test and forced expiratory volume in one second in a multiple Poisson generalized additive model for the response variable rate of the number of hospital admissions by years of follow-up. The location and number of cut-off points obtained were clinically validated as being in line with the categorizations used in the literature."}, "https://arxiv.org/abs/2401.01998": {"title": "A Corrected Score Function Framework for Modelling Circadian Gene Expression", "link": "https://arxiv.org/abs/2401.01998", "description": "arXiv:2401.01998v1 Announce Type: cross \nAbstract: Many biological processes display oscillatory behavior based on an approximately 24 hour internal timing system specific to each individual. One process of particular interest is gene expression, for which several circadian transcriptomic studies have identified associations between gene expression during a 24 hour period and an individual's health. A challenge with analyzing data from these studies is that each individual's internal timing system is offset relative to the 24 hour day-night cycle, where day-night cycle time is recorded for each collected sample. Laboratory procedures can accurately determine each individual's offset and determine the internal time of sample collection. However, these laboratory procedures are labor-intensive and expensive. In this paper, we propose a corrected score function framework to obtain a regression model of gene expression given internal time when the offset of each individual is too burdensome to determine. A feature of this framework is that it does not require the probability distribution generating offsets to be symmetric with a mean of zero. Simulation studies validate the use of this corrected score function framework for cosinor regression, which is prevalent in circadian transcriptomic studies. Illustrations with three real circadian transcriptomic data sets further demonstrate that the proposed framework consistently mitigates bias relative to using a score function that does not account for this offset."}, "https://arxiv.org/abs/2403.10567": {"title": "Uncertainty estimation in spatial interpolation of satellite precipitation with ensemble learning", "link": "https://arxiv.org/abs/2403.10567", "description": "arXiv:2403.10567v1 Announce Type: cross \nAbstract: Predictions in the form of probability distributions are crucial for decision-making. Quantile regression enables this within spatial interpolation settings for merging remote sensing and gauge precipitation data. However, ensemble learning of quantile regression algorithms remains unexplored in this context. Here, we address this gap by introducing nine quantile-based ensemble learners and applying them to large precipitation datasets. We employed a novel feature engineering strategy, reducing predictors to distance-weighted satellite precipitation at relevant locations, combined with location elevation. Our ensemble learners include six stacking and three simple methods (mean, median, best combiner), combining six individual algorithms: quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machines (LightGBM), and quantile regression neural networks (QRNN). These algorithms serve as both base learners and combiners within different stacking methods. We evaluated performance against QR using quantile scoring functions in a large dataset comprising 15 years of monthly gauge-measured and satellite precipitation in contiguous US (CONUS). Stacking with QR and QRNN yielded the best results across quantile levels of interest (0.025, 0.050, 0.075, 0.100, 0.200, 0.300, 0.400, 0.500, 0.600, 0.700, 0.800, 0.900, 0.925, 0.950, 0.975), surpassing the reference method by 3.91% to 8.95%. This demonstrates the potential of stacking to improve probabilistic predictions in spatial interpolation and beyond."}, "https://arxiv.org/abs/2403.10618": {"title": "Limits of Approximating the Median Treatment Effect", "link": "https://arxiv.org/abs/2403.10618", "description": "arXiv:2403.10618v1 Announce Type: cross \nAbstract: Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\\mathbf{a}, \\mathbf{b}$, much of the prior work focused on estimating median$(\\mathbf{a}) -$ median$(\\mathbf{b})$, where median($\\mathbf x$) denotes the median value in the sorted ordering of all the values in vector $\\mathbf x$. It is known that estimating the difference of medians is easier than the desired estimand of median$(\\mathbf{a-b})$, called the Median Treatment Effect (MTE). The fundamental problem of causal inference -- for every individual $i$, we can only observe one of the potential outcome values, i.e., either the value $a_i$ or $b_i$, but not both, makes estimating MTE particularly challenging. In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\\mathbf{a-b}$. Next, we identify a quantity called variability that exactly captures the complexity of MTE estimation. By drawing connections to instance-optimality studied in theoretical computer science, we show that every algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior work, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values."}, "https://arxiv.org/abs/2403.10766": {"title": "ODE Discovery for Longitudinal Heterogeneous Treatment Effects Inference", "link": "https://arxiv.org/abs/2403.10766", "description": "arXiv:2403.10766v1 Announce Type: cross \nAbstract: Inferring unbiased treatment effects has received widespread attention in the machine learning community. In recent years, our community has proposed numerous solutions in standard settings, high-dimensional treatment settings, and even longitudinal settings. While very diverse, the solution has mostly relied on neural networks for inference and simultaneous correction of assignment bias. New approaches typically build on top of previous approaches by proposing new (or refined) architectures and learning algorithms. However, the end result -- a neural-network-based inference machine -- remains unchallenged. In this paper, we introduce a different type of solution in the longitudinal setting: a closed-form ordinary differential equation (ODE). While we still rely on continuous optimization to learn an ODE, the resulting inference machine is no longer a neural network. Doing so yields several advantages such as interpretability, irregular sampling, and a different set of identification assumptions. Above all, we consider the introduction of a completely new type of solution to be our most important contribution as it may spark entirely new innovations in treatment effects in general. We facilitate this by formulating our contribution as a framework that can transform any ODE discovery method into a treatment effects method."}, "https://arxiv.org/abs/2403.11332": {"title": "Graph Neural Network based Double Machine Learning Estimator of Network Causal Effects", "link": "https://arxiv.org/abs/2403.11332", "description": "arXiv:2403.11332v1 Announce Type: cross \nAbstract: Our paper addresses the challenge of inferring causal effects in social network data, characterized by complex interdependencies among individuals resulting in challenges such as non-independence of units, interference (where a unit's outcome is affected by neighbors' treatments), and introduction of additional confounding factors from neighboring units. We propose a novel methodology combining graph neural networks and double machine learning, enabling accurate and efficient estimation of direct and peer effects using a single observational social network. Our approach utilizes graph isomorphism networks in conjunction with double machine learning to effectively adjust for network confounders and consistently estimate the desired causal effects. We demonstrate that our estimator is both asymptotically normal and semiparametrically efficient. A comprehensive evaluation against four state-of-the-art baseline methods using three semi-synthetic social network datasets reveals our method's on-par or superior efficacy in precise causal effect estimation. Further, we illustrate the practical application of our method through a case study that investigates the impact of Self-Help Group participation on financial risk tolerance. The results indicate a significant positive direct effect, underscoring the potential of our approach in social network analysis. Additionally, we explore the effects of network sparsity on estimation performance."}, "https://arxiv.org/abs/2403.11333": {"title": "Identification of Information Structures in Bayesian Games", "link": "https://arxiv.org/abs/2403.11333", "description": "arXiv:2403.11333v1 Announce Type: cross \nAbstract: To what extent can an external observer observing an equilibrium action distribution in an incomplete information game infer the underlying information structure? We investigate this issue in a general linear-quadratic-Gaussian framework. A simple class of canonical information structures is offered and proves rich enough to rationalize any possible equilibrium action distribution that can arise under an arbitrary information structure. We show that the class is parsimonious in the sense that the relevant parameters can be uniquely pinned down by an observed equilibrium outcome, up to some qualifications. Our result implies, for example, that the accuracy of each agent's signal about the state is identified, as measured by how much observing the signal reduces the state variance. Moreover, we show that a canonical information structure characterizes the lower bound on the amount by which each agent's signal can reduce the state variance, across all observationally equivalent information structures. The lower bound is tight, for example, when the actual information structure is uni-dimensional, or when there are no strategic interactions among agents, but in general, there is a gap since agents' strategic motives confound their private information about fundamental and strategic uncertainty."}, "https://arxiv.org/abs/2403.11343": {"title": "Federated Transfer Learning with Differential Privacy", "link": "https://arxiv.org/abs/2403.11343", "description": "arXiv:2403.11343v1 Announce Type: cross \nAbstract: Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of \\textit{federated differential privacy}, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets."}, "https://arxiv.org/abs/1805.05606": {"title": "Nonparametric Bayesian volatility learning under microstructure noise", "link": "https://arxiv.org/abs/1805.05606", "description": "arXiv:1805.05606v2 Announce Type: replace \nAbstract: In this work, we study the problem of learning the volatility under market microstructure noise. Specifically, we consider noisy discrete time observations from a stochastic differential equation and develop a novel computational method to learn the diffusion coefficient of the equation. We take a nonparametric Bayesian approach, where we \\emph{a priori} model the volatility function as piecewise constant. Its prior is specified via the inverse Gamma Markov chain. Sampling from the posterior is accomplished by incorporating the Forward Filtering Backward Simulation algorithm in the Gibbs sampler. Good performance of the method is demonstrated on two representative synthetic data examples. We also apply the method on a EUR/USD exchange rate dataset. Finally we present a limit result on the prior distribution."}, "https://arxiv.org/abs/2203.06056": {"title": "Identifying Causal Effects using Instrumental Time Series: Nuisance IV and Correcting for the Past", "link": "https://arxiv.org/abs/2203.06056", "description": "arXiv:2203.06056v2 Announce Type: replace \nAbstract: Instrumental variable (IV) regression relies on instruments to infer causal effects from observational data with unobserved confounding. We consider IV regression in time series models, such as vector auto-regressive (VAR) processes. Direct applications of i.i.d. techniques are generally inconsistent as they do not correctly adjust for dependencies in the past. In this paper, we outline the difficulties that arise due to time structure and propose methodology for constructing identifying equations that can be used for consistent parametric estimation of causal effects in time series data. One method uses extra nuisance covariates to obtain identifiability (an idea that can be of interest even in the i.i.d. case). We further propose a graph marginalization framework that allows us to apply nuisance IV and other IV methods in a principled way to time series. Our methods make use of a version of the global Markov property, which we prove holds for VAR(p) processes. For VAR(1) processes, we prove identifiability conditions that relate to Jordan forms and are different from the well-known rank conditions in the i.i.d. case (they do not require as many instruments as covariates, for example). We provide methods, prove their consistency, and show how the inferred causal effect can be used for distribution generalization. Simulation experiments corroborate our theoretical results. We provide ready-to-use Python code."}, "https://arxiv.org/abs/2206.12113": {"title": "Sequential adaptive design for emulating costly computer codes", "link": "https://arxiv.org/abs/2206.12113", "description": "arXiv:2206.12113v3 Announce Type: replace \nAbstract: Gaussian processes (GPs) are generally regarded as the gold standard surrogate model for emulating computationally expensive computer-based simulators. However, the problem of training GPs as accurately as possible with a minimum number of model evaluations remains challenging. We address this problem by suggesting a novel adaptive sampling criterion called VIGF (variance of improvement for global fit). The improvement function at any point is a measure of the deviation of the GP emulator from the nearest observed model output. At each iteration of the proposed algorithm, a new run is performed where VIGF is the largest. Then, the new sample is added to the design and the emulator is updated accordingly. A batch version of VIGF is also proposed which can save the user time when parallel computing is available. Additionally, VIGF is extended to the multi-fidelity case where the expensive high-fidelity model is predicted with the assistance of a lower fidelity simulator. This is performed via hierarchical kriging. The applicability of our method is assessed on a bunch of test functions and its performance is compared with several sequential sampling strategies. The results suggest that our method has a superior performance in predicting the benchmark functions in most cases."}, "https://arxiv.org/abs/2206.12525": {"title": "Causality for Complex Continuous-time Functional Longitudinal Studies", "link": "https://arxiv.org/abs/2206.12525", "description": "arXiv:2206.12525v3 Announce Type: replace \nAbstract: The paramount obstacle in longitudinal studies for causal inference is the complex \"treatment-confounder feedback.\" Traditional methodologies for elucidating causal effects in longitudinal analyses are primarily based on the assumption that time moves in specific intervals or that changes in treatment occur discretely. This conventional view confines treatment-confounder feedback to a limited, countable scope. The advent of real-time monitoring in modern medical research introduces functional longitudinal data with dynamically time-varying outcomes, treatments, and confounders, necessitating dealing with a potentially uncountably infinite treatment-confounder feedback. Thus, there is an urgent need for a more elaborate and refined theoretical framework to navigate these intricacies. Recently, Ying (2024) proposed a preliminary framework focusing on end-of-study outcomes and addressing the causality in functional longitudinal data. Our paper expands significantly upon his foundation in fourfold: First, we conduct a comprehensive review of existing literature, which not only fosters a deeper understanding of the underlying concepts but also illuminates the genesis of both Ying (2024)'s and ours. Second, we extend Ying (2024) to fully embrace a functional time-varying outcome process, incorporating right censoring and truncation by death, which are both significant and practical concerns. Third, we formalize previously informal propositions in Ying (2024), demonstrating how this framework broadens the existing frameworks in a nonparametric manner. Lastly, we delve into a detailed discussion on the interpretability and feasibility of our assumptions, and outlining a strategy for future numerical studies."}, "https://arxiv.org/abs/2207.00530": {"title": "The Target Study: A Conceptual Model and Framework for Measuring Disparity", "link": "https://arxiv.org/abs/2207.00530", "description": "arXiv:2207.00530v2 Announce Type: replace \nAbstract: We present a conceptual model to measure disparity--the target study--where social groups may be similarly situated (i.e., balanced) on allowable covariates. Our model, based on a sampling design, does not intervene to assign social group membership or alter allowable covariates. To address non-random sample selection, we extend our model to generalize or transport disparity or to assess disparity after an intervention on eligibility-related variables that eliminates forms of collider-stratification. To avoid bias from differential timing of enrollment, we aggregate time-specific study results by balancing calendar time of enrollment across social groups. To provide a framework for emulating our model, we discuss study designs, data structures, and G-computation and weighting estimators. We compare our sampling-based model to prominent decomposition-based models used in healthcare and algorithmic fairness. We provide R code for all estimators and apply our methods to measure health system disparities in hypertension control using electronic medical records."}, "https://arxiv.org/abs/2210.06927": {"title": "Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models", "link": "https://arxiv.org/abs/2210.06927", "description": "arXiv:2210.06927v4 Announce Type: replace \nAbstract: Bayesian modeling provides a principled approach to quantifying uncertainty in model parameters and model structure and has seen a surge of applications in recent years. Within the context of a Bayesian workflow, we are concerned with model selection for the purpose of finding models that best explain the data, that is, help us understand the underlying data generating process. Since we rarely have access to the true process, all we are left with during real-world analyses is incomplete causal knowledge from sources outside of the current data and model predictions of said data. This leads to the important question of when the use of prediction as a proxy for explanation for the purpose of model selection is valid. We approach this question by means of large-scale simulations of Bayesian generalized linear models where we investigate various causal and statistical misspecifications. Our results indicate that the use of prediction as proxy for explanation is valid and safe only when the models under consideration are sufficiently consistent with the underlying causal structure of the true data generating process."}, "https://arxiv.org/abs/2211.01938": {"title": "A family of mixture models for beta valued DNA methylation data", "link": "https://arxiv.org/abs/2211.01938", "description": "arXiv:2211.01938v3 Announce Type: replace \nAbstract: As hypermethylation of promoter cytosine-guanine dinucleotide (CpG) islands has been shown to silence tumour suppressor genes, identifying differentially methylated CpG sites between different samples can assist in understanding disease. Differentially methylated CpG sites (DMCs) can be identified using moderated t-tests or nonparametric tests, but this typically requires the use of data transformations due to a lack of appropriate statistical methods able to adequately account for the bounded nature of DNA methylation data.\n We propose a family of beta mixture models (BMMs) which use a model-based approach to cluster CpG sites given their original beta-valued methylation data, with no need for transformations. The BMMs allow (i) objective inference of methylation state thresholds and (ii) identification of DMCs between different sample types. The BMMs employ different parameter constraints facilitating application to different study settings. Parameter estimation proceeds via an expectation-maximisation algorithm, with a novel approximation in the maximization step providing tractability and computational feasibility.\n Performance of BMMs is assessed through thorough simulation studies, and the BMMs are used to analyse a prostate cancer dataset. The BMMs objectively infer intuitive and biologically interpretable methylation state thresholds, and identify DMCs that are related to genes implicated in carcinogenesis and involved in cancer related pathways. An R package betaclust facilitates widespread use of BMMs."}, "https://arxiv.org/abs/2211.16059": {"title": "On Large-Scale Multiple Testing Over Networks: An Asymptotic Approach", "link": "https://arxiv.org/abs/2211.16059", "description": "arXiv:2211.16059v4 Announce Type: replace \nAbstract: This work concerns developing communication- and computation-efficient methods for large-scale multiple testing over networks, which is of interest to many practical applications. We take an asymptotic approach and propose two methods, proportion-matching and greedy aggregation, tailored to distributed settings. The proportion-matching method achieves the global BH performance yet only requires a one-shot communication of the (estimated) proportion of true null hypotheses as well as the number of p-values at each node. By focusing on the asymptotic optimal power, we go beyond the BH procedure by providing an explicit characterization of the asymptotic optimal solution. This leads to the greedy aggregation method that effectively approximates the optimal rejection regions at each node, while computation efficiency comes from the greedy-type approach naturally. Moreover, for both methods, we provide the rate of convergence for both the FDR and power. Extensive numerical results over a variety of challenging settings are provided to support our theoretical findings."}, "https://arxiv.org/abs/2212.01900": {"title": "Bayesian survival analysis with INLA", "link": "https://arxiv.org/abs/2212.01900", "description": "arXiv:2212.01900v3 Announce Type: replace \nAbstract: This tutorial shows how various Bayesian survival models can be fitted using the integrated nested Laplace approximation in a clear, legible, and comprehensible manner using the INLA and INLAjoint R-packages. Such models include accelerated failure time, proportional hazards, mixture cure, competing risks, multi-state, frailty, and joint models of longitudinal and survival data, originally presented in the article \"Bayesian survival analysis with BUGS\" (Alvares et al., 2021). In addition, we illustrate the implementation of a new joint model for a longitudinal semicontinuous marker, recurrent events, and a terminal event. Our proposal aims to provide the reader with syntax examples for implementing survival models using a fast and accurate approximate Bayesian inferential approach."}, "https://arxiv.org/abs/2302.13133": {"title": "Data-driven uncertainty quantification for constrained stochastic differential equations and application to solar photovoltaic power forecast data", "link": "https://arxiv.org/abs/2302.13133", "description": "arXiv:2302.13133v2 Announce Type: replace \nAbstract: In this work, we extend the data-driven It\\^{o} stochastic differential equation (SDE) framework for the pathwise assessment of short-term forecast errors to account for the time-dependent upper bound that naturally constrains the observable historical data and forecast. We propose a new nonlinear and time-inhomogeneous SDE model with a Jacobi-type diffusion term for the phenomenon of interest, simultaneously driven by the forecast and the constraining upper bound. We rigorously demonstrate the existence and uniqueness of a strong solution to the SDE model by imposing a condition for the time-varying mean-reversion parameter appearing in the drift term. The normalized forecast function is thresholded to keep such mean-reversion parameters bounded. The SDE model parameter calibration also covers the thresholding parameter of the normalized forecast by applying a novel iterative two-stage optimization procedure to user-selected approximations of the likelihood function. Another novel contribution is estimating the transition density of the forecast error process, not known analytically in a closed form, through a tailored kernel smoothing technique with the control variate method. We fit the model to the 2019 photovoltaic (PV) solar power daily production and forecast data in Uruguay, computing the daily maximum solar PV production estimation. Two statistical versions of the constrained SDE model are fit, with the beta and truncated normal distributions as proxies for the transition density. Empirical results include simulations of the normalized solar PV power production and pathwise confidence bands generated through an indirect inference method. An objective comparison of optimal parametric points associated with the two selected statistical approximations is provided by applying the innovative kernel density estimation technique of the transition function of the forecast error process."}, "https://arxiv.org/abs/2303.08528": {"title": "Translating predictive distributions into informative priors", "link": "https://arxiv.org/abs/2303.08528", "description": "arXiv:2303.08528v3 Announce Type: replace \nAbstract: When complex Bayesian models exhibit implausible behaviour, one solution is to assemble available information into an informative prior. Challenges arise as prior information is often only available for the observable quantity, or some model-derived marginal quantity, rather than directly pertaining to the natural parameters in our model. We propose a method for translating available prior information, in the form of an elicited distribution for the observable or model-derived marginal quantity, into an informative joint prior. Our approach proceeds given a parametric class of prior distributions with as yet undetermined hyperparameters, and minimises the difference between the supplied elicited distribution and corresponding prior predictive distribution. We employ a global, multi-stage Bayesian optimisation procedure to locate optimal values for the hyperparameters. Three examples illustrate our approach: a cure-fraction survival model, where censoring implies that the observable quantity is a priori a mixed discrete/continuous quantity; a setting in which prior information pertains to $R^{2}$ -- a model-derived quantity; and a nonlinear regression model."}, "https://arxiv.org/abs/2303.10215": {"title": "Statistical inference for association studies in the presence of binary outcome misclassification", "link": "https://arxiv.org/abs/2303.10215", "description": "arXiv:2303.10215v3 Announce Type: replace \nAbstract: In biomedical and public health association studies, binary outcome variables may be subject to misclassification, resulting in substantial bias in effect estimates. The feasibility of addressing binary outcome misclassification in regression models is often hindered by model identifiability issues. In this paper, we characterize the identifiability problems in this class of models as a specific case of ''label switching'' and leverage a pattern in the resulting parameter estimates to solve the permutation invariance of the complete data log-likelihood. Our proposed algorithm in binary outcome misclassification models does not require gold standard labels and relies only on the assumption that the sum of the sensitivity and specificity exceeds 1. A label switching correction is applied within estimation methods to recover unbiased effect estimates and to estimate misclassification rates. Open source software is provided to implement the proposed methods. We give a detailed simulation study for our proposed methodology and apply these methods to data from the 2020 Medical Expenditure Panel Survey (MEPS)."}, "https://arxiv.org/abs/2303.15158": {"title": "Discovering the Network Granger Causality in Large Vector Autoregressive Models", "link": "https://arxiv.org/abs/2303.15158", "description": "arXiv:2303.15158v2 Announce Type: replace \nAbstract: This paper proposes novel inferential procedures for discovering the network Granger causality in high-dimensional vector autoregressive models. In particular, we mainly offer two multiple testing procedures designed to control the false discovery rate (FDR). The first procedure is based on the limiting normal distribution of the $t$-statistics with the debiased lasso estimator. The second procedure is its bootstrap version. We also provide a robustification of the first procedure against any cross-sectional dependence using asymptotic e-variables. Their theoretical properties, including FDR control and power guarantee, are investigated. The finite sample evidence suggests that both procedures can successfully control the FDR while maintaining high power. Finally, the proposed methods are applied to discovering the network Granger causality in a large number of macroeconomic variables and regional house prices in the UK."}, "https://arxiv.org/abs/2306.14311": {"title": "Simple Estimation of Semiparametric Models with Measurement Errors", "link": "https://arxiv.org/abs/2306.14311", "description": "arXiv:2306.14311v2 Announce Type: replace \nAbstract: We develop a practical way of addressing the Errors-In-Variables (EIV) problem in the Generalized Method of Moments (GMM) framework. We focus on the settings in which the variability of the EIV is a fraction of that of the mismeasured variables, which is typical for empirical applications. For any initial set of moment conditions our approach provides a \"corrected\" set of moment conditions that are robust to the EIV. We show that the GMM estimator based on these moments is root-n-consistent, with the standard tests and confidence intervals providing valid inference. This is true even when the EIV are so large that naive estimators (that ignore the EIV problem) are heavily biased with their confidence intervals having 0% coverage. Our approach involves no nonparametric estimation, which is especially important for applications with many covariates, and settings with multivariate or non-classical EIV. In particular, the approach makes it easy to use instrumental variables to address EIV in nonlinear models."}, "https://arxiv.org/abs/2306.14862": {"title": "Marginal Effects for Probit and Tobit with Endogeneity", "link": "https://arxiv.org/abs/2306.14862", "description": "arXiv:2306.14862v2 Announce Type: replace \nAbstract: When evaluating partial effects, it is important to distinguish between structural endogeneity and measurement errors. In contrast to linear models, these two sources of endogeneity affect partial effects differently in nonlinear models. We study this issue focusing on the Instrumental Variable (IV) Probit and Tobit models. We show that even when a valid IV is available, failing to differentiate between the two types of endogeneity can lead to either under- or over-estimation of the partial effects. We develop simple estimators of the bounds on the partial effects and provide easy to implement confidence intervals that correctly account for both types of endogeneity. We illustrate the methods in a Monte Carlo simulation and an empirical application."}, "https://arxiv.org/abs/2308.05577": {"title": "Optimal Designs for Two-Stage Inference", "link": "https://arxiv.org/abs/2308.05577", "description": "arXiv:2308.05577v2 Announce Type: replace \nAbstract: The analysis of screening experiments is often done in two stages, starting with factor selection via an analysis under a main effects model. The success of this first stage is influenced by three components: (1) main effect estimators' variances and (2) bias, and (3) the estimate of the noise variance. Component (3) has only recently been given attention with design techniques that ensure an unbiased estimate of the noise variance. In this paper, we propose a design criterion based on expected confidence intervals of the first stage analysis that balances all three components. To address model misspecification, we propose a computationally-efficient all-subsets analysis and a corresponding constrained design criterion based on lack-of-fit. Scenarios found in existing design literature are revisited with our criteria and new designs are provided that improve upon existing methods."}, "https://arxiv.org/abs/2310.08536": {"title": "Real-time Prediction of the Great Recession and the Covid-19 Recession", "link": "https://arxiv.org/abs/2310.08536", "description": "arXiv:2310.08536v4 Announce Type: replace \nAbstract: A series of standard and penalized logistic regression models is employed to model and forecast the Great Recession and the Covid-19 recession in the US. The empirical analysis explores the predictive content of numerous macroeconomic and financial indicators with respect to NBER recession indicator. The predictive ability of the underlying models is evaluated using a set of statistical evaluation metrics. The recessions are scrutinized by closely examining the movement of five most influential predictors that are chosen through automatic variable selections of the Lasso regression, along with the regression coefficients and the predicted recession probabilities. The results strongly support the application of penalized logistic regression models in the area of recession forecasting. Specifically, the analysis indicates that the mixed usage of different penalized logistic regression models over different forecast horizons largely outperform standard logistic regression models in the prediction of Great recession in the US, as they achieve higher predictive accuracy across 5 different forecast horizons. The Great Recession is largely predictable, whereas the Covid-19 recession remains unpredictable, given that the Covid-19 pandemic is a real exogenous event. The empirical study reaffirms the traditional role of the term spread as one of the most important recession indicators. The results are validated by constructing via PCA on a set of selected variables a recession indicator that suffers less from publication lags and exhibits a very high association with the NBER recession indicator."}, "https://arxiv.org/abs/2310.17571": {"title": "Inside the black box: Neural network-based real-time prediction of US recessions", "link": "https://arxiv.org/abs/2310.17571", "description": "arXiv:2310.17571v2 Announce Type: replace \nAbstract: A standard feedforward neural network (FFN) and two specific types of recurrent neural networks, long short-term memory (LSTM) and gated recurrent unit (GRU), are used for modeling US recessions in the period from 1967 to 2021. The estimated models are then employed to conduct real-time predictions of the Great Recession and the Covid-19 recession in the US. Their predictive performances are compared to those of the traditional linear models, the standard logit model and the ridge logit model. The out-of-sample performance suggests the application of LSTM and GRU in the area of recession forecasting, especially for the long-term forecasting tasks. They outperform other types of models across five different forecast horizons with respect to a selected set of statistical metrics. Shapley additive explanations (SHAP) method is applied to GRU and the ridge logit model as the best performer in the neural network and linear model group, respectively, to gain insight into the variable importance. The evaluation of variable importance differs between GRU and the ridge logit model, as reflected in their unequal variable orders determined by the SHAP values. These different weight assignments can be attributed to GRUs flexibility and capability to capture the business cycle asymmetries and nonlinearities. The SHAP method delivers some key recession indicators. For forecasting up to 3 months, the stock price index, real GDP, and private residential fixed investment show great short-term predictability, while for longer-term forecasting up to 12 months, the term spread and the producer price index have strong explanatory power for recessions. These findings are robust against other interpretation methods such as the local interpretable model-agnostic explanations (LIME) for GRU and the marginal effects for the ridge logit model."}, "https://arxiv.org/abs/2311.03829": {"title": "Multilevel mixtures of latent trait analyzers for clustering multi-layer bipartite networks", "link": "https://arxiv.org/abs/2311.03829", "description": "arXiv:2311.03829v2 Announce Type: replace \nAbstract: Within network data analysis, bipartite networks represent a particular type of network where relationships occur between two disjoint sets of nodes, formally called sending and receiving nodes. In this context, sending nodes may be organized into layers on the basis of some defined characteristics, resulting in a special case of multilayer bipartite network, where each layer includes a specific set of sending nodes. To perform a clustering of sending nodes in multi-layer bipartite network, we extend the Mixture of Latent Trait Analyzers (MLTA), also taking into account the influence of concomitant variables on clustering formation and the multi-layer structure of the data. To this aim, a multilevel approach offers a useful methodological tool to properly account for the hierarchical structure of the data and for the unobserved sources of heterogeneity at multiple levels. A simulation study is conducted to test the performance of the proposal in terms of parameters' and clustering recovery. Furthermore, the model is applied to the European Social Survey data (ESS) to i) perform a clustering of individuals (sending nodes) based on their digital skills (receiving nodes); ii) understand how socio-economic and demographic characteristics influence the individual digitalization level; iii) account for the multilevel structure of the data; iv) obtain a clustering of countries in terms of the base-line attitude to digital technologies of their residents."}, "https://arxiv.org/abs/2312.12741": {"title": "Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances", "link": "https://arxiv.org/abs/2312.12741", "description": "arXiv:2312.12741v2 Announce Type: replace-cross \nAbstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. Kaufmann et al. (2016) develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW) strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (small-gap regime). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown."}, "https://arxiv.org/abs/2403.12243": {"title": "Time-Since-Infection Model for Hospitalization and Incidence Data", "link": "https://arxiv.org/abs/2403.12243", "description": "arXiv:2403.12243v1 Announce Type: new \nAbstract: The Time Since Infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular recently due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted to estimate disease transmission or predict disease-related hospitalizations - metrics crucial for understanding a pandemic and planning hospital resources. Moreover, their dependence on reported infection data makes them vulnerable to variations in data quality. In this study, we advance TSI models by integrating hospitalization data, marking a significant step forward in modeling with TSI models. Our improvements enable the estimation of key infectious disease parameters without relying on contact tracing data, reduce bias in incidence data, and provide a foundation to connect TSI models with other infectious disease models. We introduce hospitalization propensity parameters to jointly model incidence and hospitalization data. We use a composite likelihood function to accommodate complex data structure and an MCEM algorithm to estimate model parameters. We apply our method to COVID-19 data to estimate disease transmission, assess risk factor impacts, and calculate hospitalization propensity."}, "https://arxiv.org/abs/2403.12250": {"title": "Bayesian Optimization Sequential Surrogate (BOSS) Algorithm: Fast Bayesian Inference for a Broad Class of Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2403.12250", "description": "arXiv:2403.12250v1 Announce Type: new \nAbstract: Approximate Bayesian inference based on Laplace approximation and quadrature methods have become increasingly popular for their efficiency at fitting latent Gaussian models (LGM), which encompass popular models such as Bayesian generalized linear models, survival models, and spatio-temporal models. However, many useful models fall under the LGM framework only if some conditioning parameters are fixed, as the design matrix would vary with these parameters otherwise. Such models are termed the conditional LGMs with examples in change-point detection, non-linear regression, etc. Existing methods for fitting conditional LGMs rely on grid search or Markov-chain Monte Carlo (MCMC); both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. As each evaluation of the density requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, we introduce the Bayesian optimization sequential surrogate (BOSS) algorithm, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations compared to grid or MCMC methods, Bayesian optimization provides us with sequential design points that capture the majority of the posterior mass of the conditioning parameters, which subsequently yields an accurate surrogate posterior distribution that can be easily normalized. We illustrate the efficiency, accuracy, and practical utility of the proposed method through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics."}, "https://arxiv.org/abs/2403.12332": {"title": "A maximum penalised likelihood approach for semiparametric accelerated failure time models with time-varying covariates and partly interval censoring", "link": "https://arxiv.org/abs/2403.12332", "description": "arXiv:2403.12332v1 Announce Type: new \nAbstract: Accelerated failure time (AFT) models are frequently used for modelling survival data. This approach is attractive as it quantifies the direct relationship between the time until an event occurs and various covariates. It asserts that the failure times experience either acceleration or deceleration through a multiplicative factor when these covariates are present. While existing literature provides numerous methods for fitting AFT models with time-fixed covariates, adapting these approaches to scenarios involving both time-varying covariates and partly interval-censored data remains challenging. In this paper, we introduce a maximum penalised likelihood approach to fit a semiparametric AFT model. This method, designed for survival data with partly interval-censored failure times, accommodates both time-fixed and time-varying covariates. We utilise Gaussian basis functions to construct a smooth approximation of the nonparametric baseline hazard and fit the model via a constrained optimisation approach. To illustrate the effectiveness of our proposed method, we conduct a comprehensive simulation study. We also present an implementation of our approach on a randomised clinical trial dataset on advanced melanoma patients."}, "https://arxiv.org/abs/2403.12456": {"title": "Inflation Target at Risk: A Time-varying Parameter Distributional Regression", "link": "https://arxiv.org/abs/2403.12456", "description": "arXiv:2403.12456v1 Announce Type: new \nAbstract: Macro variables frequently display time-varying distributions, driven by the dynamic and evolving characteristics of economic, social, and environmental factors that consistently reshape the fundamental patterns and relationships governing these variables. To better understand the distributional dynamics beyond the central tendency, this paper introduces a novel semi-parametric approach for constructing time-varying conditional distributions, relying on the recent advances in distributional regression. We present an efficient precision-based Markov Chain Monte Carlo algorithm that simultaneously estimates all model parameters while explicitly enforcing the monotonicity condition on the conditional distribution function. Our model is applied to construct the forecasting distribution of inflation for the U.S., conditional on a set of macroeconomic and financial indicators. The risks of future inflation deviating excessively high or low from the desired range are carefully evaluated. Moreover, we provide a thorough discussion about the interplay between inflation and unemployment rates during the Global Financial Crisis, COVID, and the third quarter of 2023."}, "https://arxiv.org/abs/2403.12561": {"title": "A Bayesian multilevel hidden Markov model with Poisson-lognormal emissions for intense longitudinal count data", "link": "https://arxiv.org/abs/2403.12561", "description": "arXiv:2403.12561v1 Announce Type: new \nAbstract: Hidden Markov models (HMMs) are probabilistic methods in which observations are seen as realizations of a latent Markov process with discrete states that switch over time. Moving beyond standard statistical tests, HMMs offer a statistical environment to optimally exploit the information present in multivariate time series, uncovering the latent dynamics that rule them. Here, we extend the Poisson HMM to the multilevel framework, accommodating variability between individuals with continuously distributed individual random effects following a lognormal distribution, and describe how to estimate the model in a fully parametric Bayesian framework. The proposed multilevel HMM enables probabilistic decoding of hidden state sequences from multivariate count time-series based on individual-specific parameters, and offers a framework to quantificate between-individual variability formally. Through a Monte Carlo study we show that the multilevel HMM outperforms the HMM for scenarios involving heterogeneity between individuals, demonstrating improved decoding accuracy and estimation performance of parameters of the emission distribution, and performs equally well when not between heterogeneity is present. Finally, we illustrate how to use our model to explore the latent dynamics governing complex multivariate count data in an empirical application concerning pilot whale diving behaviour in the wild, and how to identify neural states from multi-electrode recordings of motor neural cortex activity in a macaque monkey in an experimental set up. We make the multilevel HMM introduced in this study publicly available in the R-package mHMMbayes in CRAN."}, "https://arxiv.org/abs/2403.12624": {"title": "Large-scale metric objects filtering for binary classification with application to abnormal brain connectivity detection", "link": "https://arxiv.org/abs/2403.12624", "description": "arXiv:2403.12624v1 Announce Type: new \nAbstract: The classification of random objects within metric spaces without a vector structure has attracted increasing attention. However, the complexity inherent in such non-Euclidean data often restricts existing models to handle only a limited number of features, leaving a gap in real-world applications. To address this, we propose a data-adaptive filtering procedure to identify informative features from large-scale random objects, leveraging a novel Kolmogorov-Smirnov-type statistic defined on the metric space. Our method, applicable to data in general metric spaces with binary labels, exhibits remarkable flexibility. It enjoys a model-free property, as its implementation does not rely on any specified classifier. Theoretically, it controls the false discovery rate while guaranteeing the sure screening property. Empirically, equipped with a Wasserstein metric, it demonstrates superior sample performance compared to Euclidean competitors. When applied to analyze a dataset on autism, our method identifies significant brain regions associated with the condition. Moreover, it reveals distinct interaction patterns among these regions between individuals with and without autism, achieved by filtering hundreds of thousands of covariance matrices representing various brain connectivities."}, "https://arxiv.org/abs/2403.12653": {"title": "Composite likelihood estimation of stationary Gaussian processes with a view toward stochastic volatility", "link": "https://arxiv.org/abs/2403.12653", "description": "arXiv:2403.12653v1 Announce Type: new \nAbstract: We develop a framework for composite likelihood inference of parametric continuous-time stationary Gaussian processes. We derive the asymptotic theory of the associated maximum composite likelihood estimator. We implement our approach on a pair of models that has been proposed to describe the random log-spot variance of financial asset returns. A simulation study shows that it delivers good performance in these settings and improves upon a method-of-moments estimation. In an application, we inspect the dynamic of an intraday measure of spot variance computed with high-frequency data from the cryptocurrency market. The empirical evidence supports a mechanism, where the short- and long-term correlation structure of stochastic volatility are decoupled in order to capture its properties at different time scales."}, "https://arxiv.org/abs/2403.12677": {"title": "Causal Change Point Detection and Localization", "link": "https://arxiv.org/abs/2403.12677", "description": "arXiv:2403.12677v1 Announce Type: new \nAbstract: Detecting and localizing change points in sequential data is of interest in many areas of application. Various notions of change points have been proposed, such as changes in mean, variance, or the linear regression coefficient. In this work, we consider settings in which a response variable $Y$ and a set of covariates $X=(X^1,\\ldots,X^{d+1})$ are observed over time and aim to find changes in the causal mechanism generating $Y$ from $X$. More specifically, we assume $Y$ depends linearly on a subset of the covariates and aim to determine at what time points either the dependency on the subset or the subset itself changes. We call these time points causal change points (CCPs) and show that they form a subset of the commonly studied regression change points. We propose general methodology to both detect and localize CCPs. Although motivated by causality, we define CCPs without referencing an underlying causal model. The proposed definition of CCPs exploits a notion of invariance, which is a purely observational quantity but -- under additional assumptions -- has a causal meaning. For CCP localization, we propose a loss function that can be combined with existing multiple change point algorithms to localize multiple CCPs efficiently. We evaluate and illustrate our methods on simulated datasets."}, "https://arxiv.org/abs/2403.12711": {"title": "Tests for categorical data beyond Pearson: A distance covariance and energy distance approach", "link": "https://arxiv.org/abs/2403.12711", "description": "arXiv:2403.12711v1 Announce Type: new \nAbstract: Categorical variables are of uttermost importance in biomedical research. When two of them are considered, it is often the case that one wants to test whether or not they are statistically dependent. We show weaknesses of classical methods -- such as Pearson's and the G-test -- and we propose testing strategies based on distances that lack those drawbacks. We first develop this theory for classical two-dimensional contingency tables, within the context of distance covariance, an association measure that characterises general statistical independence of two variables. We then apply the same fundamental ideas to one-dimensional tables, namely to the testing for goodness of fit to a discrete distribution, for which we resort to an analogous statistic called energy distance. We prove that our methodology has desirable theoretical properties, and we show how we can calibrate the null distribution of our test statistics without resorting to any resampling technique. We illustrate all this in simulations, as well as with some real data examples, demonstrating the adequate performance of our approach for biostatistical practice."}, "https://arxiv.org/abs/2403.12714": {"title": "On the use of the cumulant generating function for inference on time series", "link": "https://arxiv.org/abs/2403.12714", "description": "arXiv:2403.12714v1 Announce Type: new \nAbstract: We introduce innovative inference procedures for analyzing time series data. Our methodology enables density approximation and composite hypothesis testing based on Whittle's estimator, a widely applied M-estimator in the frequency domain. Its core feature involves the (general Legendre transform of the) cumulant generating function of the Whittle likelihood score, as obtained using an approximated distribution of the periodogram ordinates. We present a testing algorithm that significantly expands the applicability of the state-of-the-art saddlepoint test, while maintaining the numerical accuracy of the saddlepoint approximation. Additionally, we demonstrate connections between our findings and three other prevalent frequency domain approaches: the bootstrap, empirical likelihood, and exponential tilting. Numerical examples using both simulated and real data illustrate the advantages and accuracy of our methodology."}, "https://arxiv.org/abs/2403.12757": {"title": "On Equivalence of Likelihood-Based Confidence Bands for Fatigue-Life and Fatigue-Strength Distributions", "link": "https://arxiv.org/abs/2403.12757", "description": "arXiv:2403.12757v1 Announce Type: new \nAbstract: Fatigue data arise in many research and applied areas and there have been statistical methods developed to model and analyze such data. The distributions of fatigue life and fatigue strength are often of interest to engineers designing products that might fail due to fatigue from cyclic-stress loading. Based on a specified statistical model and the maximum likelihood method, the cumulative distribution function (cdf) and quantile function (qf) can be estimated for the fatigue-life and fatigue-strength distributions. Likelihood-based confidence bands then can be obtained for the cdf and qf. This paper provides equivalence results for confidence bands for fatigue-life and fatigue-strength models. These results are useful for data analysis and computing implementation. We show (a) the equivalence of the confidence bands for the fatigue-life cdf and the fatigue-life qf, (b) the equivalence of confidence bands for the fatigue-strength cdf and the fatigue-strength qf, and (c) the equivalence of confidence bands for the fatigue-life qf and the fatigue-strength qf. Then we illustrate the usefulness of those equivalence results with two examples using experimental fatigue data."}, "https://arxiv.org/abs/2403.12759": {"title": "Robust Numerical Methods for Nonlinear Regression", "link": "https://arxiv.org/abs/2403.12759", "description": "arXiv:2403.12759v1 Announce Type: new \nAbstract: Many scientific and engineering applications require fitting regression models that are nonlinear in the parameters. Advances in computer hardware and software in recent decades have made it easier to fit such models. Relative to fitting regression models that are linear in the parameters, however, fitting nonlinear regression models is more complicated. In particular, software like the $\\texttt{nls}$ R function requires care in how the model is parameterized and how initial values are chosen for the maximum likelihood iterations. Often special diagnostics are needed to detect and suggest approaches for dealing with identifiability problems that can arise with such model fitting. When using Bayesian inference, there is the added complication of having to specify (often noninformative or weakly informative) prior distributions. Generally, the details for these tasks must be determined for each new nonlinear regression model. This paper provides a step-by-step procedure for specifying these details for any appropriate nonlinear regression model. Following the procedure will result in a numerically robust algorithm for fitting the nonlinear regression model. We illustrate the methods with three different nonlinear models that are used in the analysis of experimental fatigue data and we include two detailed numerical examples."}, "https://arxiv.org/abs/2403.12789": {"title": "Bivariate temporal dependence via mixtures of rotated copulas", "link": "https://arxiv.org/abs/2403.12789", "description": "arXiv:2403.12789v1 Announce Type: new \nAbstract: Parametric bivariate copula families have been known to flexibly capture enough various dependence patterns, e.g., either positive or negative dependence in either the lower or upper tails of bivariate distributions. However, to the best of our knowledge, there is not a single parametric model adaptable enough to capture several of these features simultaneously. To address this, we propose a mixture of 4-way rotations of a parametric copula that is able to capture all these features. We illustrate the construction using the Clayton family but the concept is general and can be applied to other families. In order to include dynamic dependence regimes, the approach is extended to a time-dependent sequence of mixture copulas in which the mixture probabilities are allowed to evolve in time via a moving average type of relationship. The properties of the proposed model and its performance are examined using simulated and real data sets."}, "https://arxiv.org/abs/2403.12815": {"title": "A Unified Framework for Rerandomization using Quadratic Forms", "link": "https://arxiv.org/abs/2403.12815", "description": "arXiv:2403.12815v1 Announce Type: new \nAbstract: In the design stage of a randomized experiment, one way to ensure treatment and control groups exhibit similar covariate distributions is to randomize treatment until some prespecified level of covariate balance is satisfied. This experimental design strategy is known as rerandomization. Most rerandomization methods utilize balance metrics based on a quadratic form $v^TAv$ , where $v$ is a vector of covariate mean differences and $A$ is a positive semi-definite matrix. In this work, we derive general results for treatment-versus-control rerandomization schemes that employ quadratic forms for covariate balance. In addition to allowing researchers to quickly derive properties of rerandomization schemes not previously considered, our theoretical results provide guidance on how to choose the matrix $A$ in practice. We find the Mahalanobis and Euclidean distances optimize different measures of covariate balance. Furthermore, we establish how the covariates' eigenstructure and their relationship to the outcomes dictates which matrix $A$ yields the most precise mean-difference estimator for the average treatment effect. We find that the Euclidean distance is minimax optimal, in the sense that the mean-difference estimator's precision is never too far from the optimal choice, regardless of the relationship between covariates and outcomes. Our theoretical results are verified via simulation, where we find that rerandomization using the Euclidean distance has better performance in high-dimensional settings and typically achieves greater variance reduction to the mean-difference estimator than other quadratic forms."}, "https://arxiv.org/abs/2403.12822": {"title": "FORM-based global reliability sensitivity analysis of systems with multiple failure modes", "link": "https://arxiv.org/abs/2403.12822", "description": "arXiv:2403.12822v1 Announce Type: new \nAbstract: Global variance-based reliability sensitivity indices arise from a variance decomposition of the indicator function describing the failure event. The first-order indices reflect the main effect of each variable on the variance of the failure event and can be used for variable prioritization; the total-effect indices represent the total effect of each variable, including its interaction with other variables, and can be used for variable fixing. This contribution derives expressions for the variance-based reliability indices of systems with multiple failure modes that are based on the first-order reliability method (FORM). The derived expressions are a function of the FORM results and, hence, do not require additional expensive model evaluations. They do involve the evaluation of multinormal integrals, for which effective solutions are available. We demonstrate that the derived expressions enable an accurate estimation of variance-based reliability sensitivities for general system problems to which FORM is applicable."}, "https://arxiv.org/abs/2403.12880": {"title": "Clustered Mallows Model", "link": "https://arxiv.org/abs/2403.12880", "description": "arXiv:2403.12880v1 Announce Type: new \nAbstract: Rankings are a type of preference elicitation that arise in experiments where assessors arrange items, for example, in decreasing order of utility. Orderings of n items labelled {1,...,n} denoted are permutations that reflect strict preferences. For a number of reasons, strict preferences can be unrealistic assumptions for real data. For example, when items share common traits it may be reasonable to attribute them equal ranks. Also, there can be different importance attributions to decisions that form the ranking. In a situation with, for example, a large number of items, an assessor may wish to rank at top a certain number items; to rank other items at the bottom and to express indifference to all others. In addition, when aggregating opinions, a judging body might be decisive about some parts of the rank but ambiguous for others. In this paper we extend the well-known Mallows (Mallows, 1957) model (MM) to accommodate item indifference, a phenomenon that can be in place for a variety of reasons, such as those above mentioned.The underlying grouping of similar items motivates the proposed Clustered Mallows Model (CMM). The CMM can be interpreted as a Mallows distribution for tied ranks where ties are learned from the data. The CMM provides the flexibility to combine strict and indifferent relations, achieving a simpler and robust representation of rank collections in the form of ordered clusters. Bayesian inference for the CMM is in the class of doubly-intractable problems since the model's normalisation constant is not available in closed form. We overcome this challenge by sampling from the posterior with a version of the exchange algorithm \\citep{murray2006}. Real data analysis of food preferences and results of Formula 1 races are presented, illustrating the CMM in practical situations."}, "https://arxiv.org/abs/2403.12908": {"title": "Regularised Spectral Estimation for High Dimensional Point Processes", "link": "https://arxiv.org/abs/2403.12908", "description": "arXiv:2403.12908v1 Announce Type: new \nAbstract: Advances in modern technology have enabled the simultaneous recording of neural spiking activity, which statistically can be represented by a multivariate point process. We characterise the second order structure of this process via the spectral density matrix, a frequency domain equivalent of the covariance matrix. In the context of neuronal analysis, statistics based on the spectral density matrix can be used to infer connectivity in the brain network between individual neurons. However, the high-dimensional nature of spike train data mean that it is often difficult, or at times impossible, to compute these statistics. In this work, we discuss the importance of regularisation-based methods for spectral estimation, and propose novel methodology for use in the point process setting. We establish asymptotic properties for our proposed estimators and evaluate their performance on synthetic data simulated from multivariate Hawkes processes. Finally, we apply our methodology to neuroscience spike train data in order to illustrate its ability to infer connectivity in the brain network."}, "https://arxiv.org/abs/2403.12108": {"title": "Does AI help humans make better decisions? A methodological framework for experimental evaluation", "link": "https://arxiv.org/abs/2403.12108", "description": "arXiv:2403.12108v1 Announce Type: cross \nAbstract: The use of Artificial Intelligence (AI) based on data-driven algorithms has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions as compared to a human alone or AI an alone. We introduce a new methodological framework that can be used to answer experimentally this question with no additional assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded experimental design, in which the provision of AI-generated recommendations is randomized across cases with a human making final decisions. Under this experimental design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. We apply the proposed methodology to the data from our own randomized controlled trial of a pretrial risk assessment instrument. We find that AI recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Our analysis also shows that AI-alone decisions generally perform worse than human decisions with or without AI assistance. Finally, AI recommendations tend to impose cash bail on non-white arrestees more often than necessary when compared to white arrestees."}, "https://arxiv.org/abs/2403.12110": {"title": "Robust estimations from distribution structures: I", "link": "https://arxiv.org/abs/2403.12110", "description": "arXiv:2403.12110v1 Announce Type: cross \nAbstract: As the most fundamental problem in statistics, robust location estimation has many prominent solutions, such as the trimmed mean, Winsorized mean, Hodges Lehmann estimator, Huber M estimator, and median of means. Recent studies suggest that their maximum biases concerning the mean can be quite different, but the underlying mechanisms largely remain unclear. This study exploited a semiparametric method to classify distributions by the asymptotic orderliness of quantile combinations with varying breakdown points, showing their interrelations and connections to parametric distributions. Further deductions explain why the Winsorized mean typically has smaller biases compared to the trimmed mean; two sequences of semiparametric robust mean estimators emerge, particularly highlighting the superiority of the median Hodges Lehmann mean. This article sheds light on the understanding of the common nature of probability distributions."}, "https://arxiv.org/abs/2403.12284": {"title": "The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control", "link": "https://arxiv.org/abs/2403.12284", "description": "arXiv:2403.12284v1 Announce Type: cross \nAbstract: Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition."}, "https://arxiv.org/abs/2403.12367": {"title": "Semisupervised score based matching algorithm to evaluate the effect of public health interventions", "link": "https://arxiv.org/abs/2403.12367", "description": "arXiv:2403.12367v1 Announce Type: cross \nAbstract: Multivariate matching algorithms \"pair\" similar study units in an observational study to remove potential bias and confounding effects caused by the absence of randomizations. In one-to-one multivariate matching algorithms, a large number of \"pairs\" to be matched could mean both the information from a large sample and a large number of tasks, and therefore, to best match the pairs, such a matching algorithm with efficiency and comparatively limited auxiliary matching knowledge provided through a \"training\" set of paired units by domain experts, is practically intriguing.\n We proposed a novel one-to-one matching algorithm based on a quadratic score function $S_{\\beta}(x_i,x_j)= \\beta^T (x_i-x_j)(x_i-x_j)^T \\beta$. The weights $\\beta$, which can be interpreted as a variable importance measure, are designed to minimize the score difference between paired training units while maximizing the score difference between unpaired training units. Further, in the typical but intricate case where the training set is much smaller than the unpaired set, we propose a \\underline{s}emisupervised \\underline{c}ompanion \\underline{o}ne-\\underline{t}o-\\underline{o}ne \\underline{m}atching \\underline{a}lgorithm (SCOTOMA) that makes the best use of the unpaired units. The proposed weight estimator is proved to be consistent when the truth matching criterion is indeed the quadratic score function. When the model assumptions are violated, we demonstrate that the proposed algorithm still outperforms some popular competing matching algorithms through a series of simulations. We applied the proposed algorithm to a real-world study to investigate the effect of in-person schooling on community Covid-19 transmission rate for policy making purpose."}, "https://arxiv.org/abs/2403.12417": {"title": "On Predictive planning and counterfactual learning in active inference", "link": "https://arxiv.org/abs/2403.12417", "description": "arXiv:2403.12417v1 Announce Type: cross \nAbstract: Given the rapid advancement of artificial intelligence, understanding the foundations of intelligent behaviour is increasingly important. Active inference, regarded as a general theory of behaviour, offers a principled approach to probing the basis of sophistication in planning and decision-making. In this paper, we examine two decision-making schemes in active inference based on 'planning' and 'learning from experience'. Furthermore, we also introduce a mixed model that navigates the data-complexity trade-off between these strategies, leveraging the strengths of both to facilitate balanced decision-making. We evaluate our proposed model in a challenging grid-world scenario that requires adaptability from the agent. Additionally, our model provides the opportunity to analyze the evolution of various parameters, offering valuable insights and contributing to an explainable framework for intelligent decision-making."}, "https://arxiv.org/abs/2403.12491": {"title": "A consistent test of spherical symmetry for multivariate and high-dimensional data via data augmentation", "link": "https://arxiv.org/abs/2403.12491", "description": "arXiv:2403.12491v1 Announce Type: cross \nAbstract: We develop a test for spherical symmetry of a multivariate distribution $P$ that works even when the dimension of the data $d$ is larger than the sample size $n$. We propose a non-negative measure $\\zeta(P)$ such that $\\zeta(P)=0$ if and only if $P$ is spherically symmetric. We construct a consistent estimator of $\\zeta(P)$ using the data augmentation method and investigate its large sample properties. The proposed test based on this estimator is calibrated using a novel resampling algorithm. Our test controls the Type-I error, and it is consistent against general alternatives. We also study its behaviour for a sequence of alternatives $(1-\\delta_n) F+\\delta_n G$, where $\\zeta(G)=0$ but $\\zeta(F)>0$, and $\\delta_n \\in [0,1]$. When $\\lim\\sup\\delta_n<1$, for any $G$, the power of our test converges to unity as $n$ increases. However, if $\\lim\\sup\\delta_n=1$, the asymptotic power of our test depends on $\\lim n(1-\\delta_n)^2$. We establish this by proving the minimax rate optimality of our test over a suitable class of alternatives and showing that it is Pitman efficient when $\\lim n(1-\\delta_n)^2>0$. Moreover, our test is provably consistent for high-dimensional data even when $d$ is larger than $n$. Our numerical results amply demonstrate the superiority of the proposed test over some state-of-the-art methods."}, "https://arxiv.org/abs/2403.12858": {"title": "Settlement Mapping for Population Density Modelling in Disease Risk Spatial Analysis", "link": "https://arxiv.org/abs/2403.12858", "description": "arXiv:2403.12858v1 Announce Type: cross \nAbstract: In disease risk spatial analysis, many researchers especially in Indonesia are still modelling population density as the ratio of total population to administrative area extent. This model oversimplifies the problem, because it covers large uninhabited areas, while the model should focus on inhabited areas. This study uses settlement mapping against satellite imagery to focus the model and calculate settlement area extent. As far as our search goes, we did not find any specific studies comparing the use of settlement mapping with administrative area to model population density in computing its correlation to a disease case rate. This study investigates the comparison of both models using data on Tuberculosis (TB) case rate in Central and East Java Indonesia. Our study shows that using administrative area density the Spearman's $\\rho$ was considered as \"Fair\" (0.566, p<0.01) and using settlement density was \"Moderately Strong\" (0.673, p<0.01). The difference is significant according to Hotelling's t test. By this result we are encouraging researchers to use settlement mapping to improve population density modelling in disease risk spatial analysis. Resources used by and resulting from this work are publicly available at https://github.com/mirzaalimm/PopulationDensityVsDisease."}, "https://arxiv.org/abs/2202.07277": {"title": "Exploiting deterministic algorithms to perform global sensitivity analysis of continuous-time Markov chain compartmental models with application to epidemiology", "link": "https://arxiv.org/abs/2202.07277", "description": "arXiv:2202.07277v3 Announce Type: replace \nAbstract: In this paper, we propose a generic approach to perform global sensitivity analysis (GSA) for compartmental models based on continuous-time Markov chains (CTMC). This approach enables a complete GSA for epidemic models, in which not only the effects of uncertain parameters such as epidemic parameters (transmission rate, mean sojourn duration in compartments) are quantified, but also those of intrinsic randomness and interactions between the two. The main step in our approach is to build a deterministic representation of the underlying continuous-time Markov chain by controlling the latent variables modeling intrinsic randomness. Then, model output can be written as a deterministic function of both uncertain parameters and controlled latent variables, so that it becomespossible to compute standard variance-based sensitivity indices, e.g. the so-called Sobol' indices. However, different simulation algorithms lead to different representations. We exhibit in this work three different representations for CTMC stochastic compartmental models and discuss the results obtained by implementing and comparing GSAs based on each of these representations on a SARS-CoV-2 epidemic model."}, "https://arxiv.org/abs/2205.13734": {"title": "An efficient tensor regression for high-dimensional data", "link": "https://arxiv.org/abs/2205.13734", "description": "arXiv:2205.13734v2 Announce Type: replace \nAbstract: Most currently used tensor regression models for high-dimensional data are based on Tucker decomposition, which has good properties but loses its efficiency in compressing tensors very quickly as the order of tensors increases, say greater than four or five. However, for the simplest tensor autoregression in handling time series data, its coefficient tensor already has the order of six. This paper revises a newly proposed tensor train (TT) decomposition and then applies it to tensor regression such that a nice statistical interpretation can be obtained. The new tensor regression can well match the data with hierarchical structures, and it even can lead to a better interpretation for the data with factorial structures, which are supposed to be better fitted by models with Tucker decomposition. More importantly, the new tensor regression can be easily applied to the case with higher order tensors since TT decomposition can compress the coefficient tensors much more efficiently. The methodology is also extended to tensor autoregression for time series data, and nonasymptotic properties are derived for the ordinary least squares estimations of both tensor regression and autoregression. A new algorithm is introduced to search for estimators, and its theoretical justification is also discussed. Theoretical and computational properties of the proposed methodology are verified by simulation studies, and the advantages over existing methods are illustrated by two real examples."}, "https://arxiv.org/abs/2207.08886": {"title": "Turning the information-sharing dial: efficient inference from different data sources", "link": "https://arxiv.org/abs/2207.08886", "description": "arXiv:2207.08886v3 Announce Type: replace \nAbstract: A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes."}, "https://arxiv.org/abs/2212.08282": {"title": "Early-Phase Local-Area Model for Pandemics Using Limited Data: A SARS-CoV-2 Application", "link": "https://arxiv.org/abs/2212.08282", "description": "arXiv:2212.08282v2 Announce Type: replace \nAbstract: The emergence of novel infectious agents presents challenges to statistical models of disease transmission. These challenges arise from limited, poor-quality data and an incomplete understanding of the agent. Moreover, outbreaks manifest differently across regions due to various factors, making it imperative for models to factor in regional specifics. In this work, we offer a model that effectively utilizes constrained data resources to estimate disease transmission rates at the local level, especially during the early outbreak phase when primarily infection counts and aggregated local characteristics are accessible. This model merges a pathogen transmission methodology based on daily infection numbers with regression techniques, drawing correlations between disease transmission and local-area factors, such as demographics, health policies, behavior, and even climate, to estimate and forecast daily infections. We incorporate the quasi-score method and an error term to navigate potential data concerns and mistaken assumptions. Additionally, we introduce an online estimator that facilitates real-time data updates, complemented by an iterative algorithm for parameter estimation. This approach facilitates real-time analysis of disease transmission when data quality is suboptimal and knowledge of the infectious pathogen is limited. It is particularly useful in the early stages of outbreaks, providing support for local decision-making."}, "https://arxiv.org/abs/2306.07047": {"title": "Foundations of Causal Discovery on Groups of Variables", "link": "https://arxiv.org/abs/2306.07047", "description": "arXiv:2306.07047v3 Announce Type: replace \nAbstract: Discovering causal relationships from observational data is a challenging task that relies on assumptions connecting statistical quantities to graphical or algebraic causal models. In this work, we focus on widely employed assumptions for causal discovery when objects of interest are (multivariate) groups of random variables rather than individual (univariate) random variables, as is the case in a variety of problems in scientific domains such as climate science or neuroscience. If the group-level causal models are derived from partitioning a micro-level model into groups, we explore the relationship between micro and group-level causal discovery assumptions. We investigate the conditions under which assumptions like Causal Faithfulness hold or fail to hold. Our analysis encompasses graphical causal models that contain cycles and bidirected edges. We also discuss grouped time series causal graphs and variants thereof as special cases of our general theoretical framework. Thereby, we aim to provide researchers with a solid theoretical foundation for the development and application of causal discovery methods for variable groups."}, "https://arxiv.org/abs/2306.16821": {"title": "Efficient subsampling for exponential family models", "link": "https://arxiv.org/abs/2306.16821", "description": "arXiv:2306.16821v2 Announce Type: replace \nAbstract: We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are \"closest\" to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$."}, "https://arxiv.org/abs/2308.08346": {"title": "RMST-based multiple contrast tests in general factorial designs", "link": "https://arxiv.org/abs/2308.08346", "description": "arXiv:2308.08346v2 Announce Type: replace \nAbstract: Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. However, global tests do not provide this information. Therefore, multiple tests for the RMST are developed in a second step to infer several null hypotheses simultaneously. Hereby, the asymptotically exact dependence structure between the local test statistics is incorporated to gain more power. Finally, the small sample performance of the proposed global and multiple testing procedures is analyzed in simulations and illustrated in a real data example."}, "https://arxiv.org/abs/2310.18108": {"title": "Transductive conformal inference with adaptive scores", "link": "https://arxiv.org/abs/2310.18108", "description": "arXiv:2310.18108v2 Announce Type: replace \nAbstract: Conformal inference is a fundamental and versatile tool that provides distribution-free guarantees for many machine learning tasks. We consider the transductive setting, where decisions are made on a test sample of $m$ new points, giving rise to $m$ conformal $p$-values. While classical results only concern their marginal distribution, we show that their joint distribution follows a P\\'olya urn model, and establish a concentration inequality for their empirical distribution function. The results hold for arbitrary exchangeable scores, including adaptive ones that can use the covariates of the test+calibration samples at training stage for increased accuracy. We demonstrate the usefulness of these theoretical results through uniform, in-probability guarantees for two machine learning tasks of current interest: interval prediction for transductive transfer learning and novelty detection based on two-class classification."}, "https://arxiv.org/abs/2206.08235": {"title": "Identifying the Most Appropriate Order for Categorical Responses", "link": "https://arxiv.org/abs/2206.08235", "description": "arXiv:2206.08235v3 Announce Type: replace-cross \nAbstract: Categorical responses arise naturally within various scientific disciplines. In many circumstances, there is no predetermined order for the response categories, and the response has to be modeled as nominal. In this study, we regard the order of response categories as part of the statistical model, and show that the true order, when it exists, can be selected using likelihood-based model selection criteria. For predictive purposes, a statistical model with a chosen order may outperform models based on nominal responses, even if a true order does not exist. For multinomial logistic models, widely used for categorical responses, we show the existence of theoretically equivalent orders that cannot be differentiated based on likelihood criteria, and determine the connections between their maximum likelihood estimators. We use simulation studies and a real-data analysis to confirm the need and benefits of choosing the most appropriate order for categorical responses."}, "https://arxiv.org/abs/2302.03788": {"title": "Toward a Theory of Causation for Interpreting Neural Code Models", "link": "https://arxiv.org/abs/2302.03788", "description": "arXiv:2302.03788v2 Announce Type: replace-cross \nAbstract: Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post hoc interpretability method specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and ten NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax. All our NCMs, except for the BERT-like model, statistically learn to predict tokens related to blocks of code (\\eg brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful method to detect and facilitate the elimination of confounding bias in NCMs."}, "https://arxiv.org/abs/2403.13069": {"title": "kDGLM: a R package for Bayesian analysis of Generalized Dynamic Linear Models", "link": "https://arxiv.org/abs/2403.13069", "description": "arXiv:2403.13069v1 Announce Type: new \nAbstract: This paper introduces kDGLM, an R package designed for Bayesian analysis of Generalized Dynamic Linear Models (GDLM), with a primary focus on both uni- and multivariate exponential families. Emphasizing sequential inference for time series data, the kDGLM package provides comprehensive support for fitting, smoothing, monitoring, and feed-forward interventions. The methodology employed by kDGLM, as proposed in Alves et al. (2024), seamlessly integrates with well-established techniques from the literature, particularly those used in (Gaussian) Dynamic Models. These include discount strategies, autoregressive components, transfer functions, and more. Leveraging key properties of the Kalman filter and smoothing, kDGLM exhibits remarkable computational efficiency, enabling virtually instantaneous fitting times that scale linearly with the length of the time series. This characteristic makes it an exceptionally powerful tool for the analysis of extended time series. For example, when modeling monthly hospital admissions in Brazil due to gastroenteritis from 2010 to 2022, the fitting process took a mere 0.11s. Even in a spatial-time variant of the model (27 outcomes, 110 latent states, and 156 months, yielding 17,160 parameters), the fitting time was only 4.24s. Currently, the kDGLM package supports a range of distributions, including univariate Normal (unknown mean and observational variance), bivariate Normal (unknown means, observational variances, and correlation), Poisson, Gamma (known shape and unknown mean), and Multinomial (known number of trials and unknown event probabilities). Additionally, kDGLM allows the joint modeling of multiple time series, provided each series follows one of the supported distributions. Ongoing efforts aim to continuously expand the supported distributions."}, "https://arxiv.org/abs/2403.13076": {"title": "Spatial Autoregressive Model on a Dirichlet Distribution", "link": "https://arxiv.org/abs/2403.13076", "description": "arXiv:2403.13076v1 Announce Type: new \nAbstract: Compositional data find broad application across diverse fields due to their efficacy in representing proportions or percentages of various components within a whole. Spatial dependencies often exist in compositional data, particularly when the data represents different land uses or ecological variables. Ignoring the spatial autocorrelations in modelling of compositional data may lead to incorrect estimates of parameters. Hence, it is essential to incorporate spatial information into the statistical analysis of compositional data to obtain accurate and reliable results. However, traditional statistical methods are not directly applicable to compositional data due to the correlation between its observations, which are constrained to lie on a simplex. To address this challenge, the Dirichlet distribution is commonly employed, as its support aligns with the nature of compositional vectors. Specifically, the R package DirichletReg provides a regression model, termed Dirichlet regression, tailored for compositional data. However, this model fails to account for spatial dependencies, thereby restricting its utility in spatial contexts. In this study, we introduce a novel spatial autoregressive Dirichlet regression model for compositional data, adeptly integrating spatial dependencies among observations. We construct a maximum likelihood estimator for a Dirichlet density function augmented with a spatial lag term. We compare this spatial autoregressive model with the same model without spatial lag, where we test both models on synthetic data as well as two real datasets, using different metrics. By considering the spatial relationships among observations, our model provides more accurate and reliable results for the analysis of compositional data. The model is further evaluated against a spatial multinomial regression model for compositional data, and their relative effectiveness is discussed."}, "https://arxiv.org/abs/2403.13118": {"title": "Modal Analysis of Spatiotemporal Data via Multivariate Gaussian Process Regression", "link": "https://arxiv.org/abs/2403.13118", "description": "arXiv:2403.13118v1 Announce Type: new \nAbstract: Modal analysis has become an essential tool to understand the coherent structure of complex flows. The classical modal analysis methods, such as dynamic mode decomposition (DMD) and spectral proper orthogonal decomposition (SPOD), rely on a sufficient amount of data that is regularly sampled in time. However, often one needs to deal with sparse temporally irregular data, e.g., due to experimental measurements and simulation algorithm. To overcome the limitations of data scarcity and irregular sampling, we propose a novel modal analysis technique using multi-variate Gaussian process regression (MVGPR). We first establish the connection between MVGPR and the existing modal analysis techniques, DMD and SPOD, from a linear system identification perspective. Next, leveraging this connection, we develop a MVGPR-based modal analysis technique that addresses the aforementioned limitations. The capability of MVGPR is endowed by its judiciously designed kernel structure for correlation function, that is derived from the assumed linear dynamics. Subsequently, the proposed MVGPR method is benchmarked against DMD and SPOD on a range of examples, from academic and synthesized data to unsteady airfoil aerodynamics. The results demonstrate MVGPR as a promising alternative to classical modal analysis methods, especially in the scenario of scarce and temporally irregular data."}, "https://arxiv.org/abs/2403.13197": {"title": "Robust inference of cooperative behaviour of multiple ion channels in voltage-clamp recordings", "link": "https://arxiv.org/abs/2403.13197", "description": "arXiv:2403.13197v1 Announce Type: new \nAbstract: Recent experimental studies have shed light on the intriguing possibility that ion channels exhibit cooperative behaviour. However, a comprehensive understanding of such cooperativity remains elusive, primarily due to limitations in measuring separately the response of each channel. Rather, only the superimposed channel response can be observed, challenging existing data analysis methods. To address this gap, we propose IDC (Idealisation, Discretisation, and Cooperativity inference), a robust statistical data analysis methodology that requires only voltage-clamp current recordings of an ensemble of ion channels. The framework of IDC enables us to integrate recent advancements in idealisation techniques and coupled Markov models. Further, in the cooperativity inference phase of IDC, we introduce a minimum distance estimator and establish its statistical guarantee in the form of asymptotic consistency. We demonstrate the effectiveness and robustness of IDC through extensive simulation studies. As an application, we investigate gramicidin D channels. Our findings reveal that these channels act independently, even at varying applied voltages during voltage-clamp experiments. An implementation of IDC is available from GitLab."}, "https://arxiv.org/abs/2403.13256": {"title": "Bayesian Nonparametric Trees for Principal Causal Effects", "link": "https://arxiv.org/abs/2403.13256", "description": "arXiv:2403.13256v1 Announce Type: new \nAbstract: Principal stratification analysis evaluates how causal effects of a treatment on a primary outcome vary across strata of units defined by their treatment effect on some intermediate quantity. This endeavor is substantially challenged when the intermediate variable is continuously scaled and there are infinitely many basic principal strata. We employ a Bayesian nonparametric approach to flexibly evaluate treatment effects across flexibly-modeled principal strata. The approach uses Bayesian Causal Forests (BCF) to simultaneously specify two Bayesian Additive Regression Tree models; one for the principal stratum membership and one for the outcome, conditional on principal strata. We show how the capability of BCF for capturing treatment effect heterogeneity is particularly relevant for assessing how treatment effects vary across the surface defined by continuously-scaled principal strata, in addition to other benefits relating to targeted selection and regularization-induced confounding. The capabilities of the proposed approach are illustrated with a simulation study, and the methodology is deployed to investigate how causal effects of power plant emissions control technologies on ambient particulate pollution vary as a function of the technologies' impact on sulfur dioxide emissions."}, "https://arxiv.org/abs/2403.13260": {"title": "A Bayesian Approach for Selecting Relevant External Data (BASE): Application to a study of Long-Term Outcomes in a Hemophilia Gene Therapy Trial", "link": "https://arxiv.org/abs/2403.13260", "description": "arXiv:2403.13260v1 Announce Type: new \nAbstract: Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early stages of their introduction to the market. To address the critical question of estimating long-term efficacy without waiting for the completion of lengthy clinical trials, we propose a novel Bayesian framework. This framework selects pertinent data from external sources, often early-phase clinical trials with more comprehensive longitudinal efficacy data that could lead to an improved inference of the long-term efficacy outcome. We apply this methodology to predict the long-term factor IX (FIX) levels of HEMGENIX (etranacogene dezaparvovec), the first FDA-approved gene therapy to treat adults with severe Hemophilia B, in a phase 3 study. Our application showcases the capability of the framework to estimate the 5-year FIX levels following HEMGENIX therapy, demonstrating sustained FIX levels induced by HEMGENIX infusion. Additionally, we provide theoretical insights into the methodology by establishing its posterior convergence properties."}, "https://arxiv.org/abs/2403.13340": {"title": "Forecasting density-valued functional panel data", "link": "https://arxiv.org/abs/2403.13340", "description": "arXiv:2403.13340v1 Announce Type: new \nAbstract: We introduce a statistical method for modeling and forecasting functional panel data, where each element is a density. Density functions are nonnegative and have a constrained integral and thus do not constitute a linear vector space. We implement a center log-ratio transformation to transform densities into unconstrained functions. These functions exhibit cross-sectionally correlation and temporal dependence. Via a functional analysis of variance decomposition, we decompose the unconstrained functional panel data into a deterministic trend component and a time-varying residual component. To produce forecasts for the time-varying component, a functional time series forecasting method, based on the estimation of the long-range covariance, is implemented. By combining the forecasts of the time-varying residual component with the deterministic trend component, we obtain h-step-ahead forecast curves for multiple populations. Illustrated by age- and sex-specific life-table death counts in the United States, we apply our proposed method to generate forecasts of the life-table death counts for 51 states."}, "https://arxiv.org/abs/2403.13398": {"title": "A unified framework for bounding causal effects on the always-survivor and other populations", "link": "https://arxiv.org/abs/2403.13398", "description": "arXiv:2403.13398v1 Announce Type: new \nAbstract: We investigate the bounding problem of causal effects in experimental studies in which the outcome is truncated by death, meaning that the subject dies before the outcome can be measured. Causal effects cannot be point identified without instruments and/or tight parametric assumptions but can be bounded under mild restrictions. Previous work on partial identification under the principal stratification framework has primarily focused on the `always-survivor' subpopulation. In this paper, we present a novel nonparametric unified framework to provide sharp bounds on causal effects on discrete and continuous square-integrable outcomes. These bounds are derived on the `always-survivor', `protected', and `harmed' subpopulations and on the entire population with/without assumptions of monotonicity and stochastic dominance. The main idea depends on rewriting the optimization problem in terms of the integrated tail probability expectation formula using a set of conditional probability distributions. The proposed procedure allows for settings with any type and number of covariates, and can be extended to incorporate average causal effects and complier average causal effects. Furthermore, we present several simulation studies conducted under various assumptions as well as the application of the proposed approach to a real dataset from the National Supported Work Demonstration."}, "https://arxiv.org/abs/2403.13544": {"title": "A class of bootstrap based residuals for compositional data", "link": "https://arxiv.org/abs/2403.13544", "description": "arXiv:2403.13544v1 Announce Type: new \nAbstract: Regression models for compositional data are common in several areas of knowledge. As in other classes of regression models, it is desirable to perform diagnostic analysis in these models using residuals that are approximately standard normally distributed. However, for regression models for compositional data, there has not been any multivariate residual that meets this requirement. In this work, we introduce a class of asymptotically standard normally distributed residuals for compositional data based on bootstrap. Monte Carlo simulation studies indicate that the distributions of the residuals of this class are well approximated by the standard normal distribution in small samples. An application to simulated data also suggests that one of the residuals of the proposed class is better to identify model misspecification than its competitors. Finally, the usefulness of the best residual of the proposed class is illustrated through an application on sleep stages. The class of residuals proposed here can also be used in other classes of multivariate regression models."}, "https://arxiv.org/abs/2403.13628": {"title": "Scalable Scalar-on-Image Cortical Surface Regression with a Relaxed-Thresholded Gaussian Process Prior", "link": "https://arxiv.org/abs/2403.13628", "description": "arXiv:2403.13628v1 Announce Type: new \nAbstract: In addressing the challenge of analysing the large-scale Adolescent Brain Cognition Development (ABCD) fMRI dataset, involving over 5,000 subjects and extensive neuroimaging data, we propose a scalable Bayesian scalar-on-image regression model for computational feasibility and efficiency. Our model employs a relaxed-thresholded Gaussian process (RTGP), integrating piecewise-smooth, sparse, and continuous functions capable of both hard- and soft-thresholding. This approach introduces additional flexibility in feature selection in scalar-on-image regression and leads to scalable posterior computation by adopting a variational approximation and utilising the Karhunen-Lo\\`eve expansion for Gaussian processes. This advancement substantially reduces the computational costs in vertex-wise analysis of cortical surface data in large-scale Bayesian spatial models. The model's parameter estimation and prediction accuracy and feature selection performance are validated through extensive simulation studies and an application to the ABCD study. Here, we perform regression analysis correlating intelligence scores with task-based functional MRI data, taking into account confounding factors including age, sex, and parental education level. This validation highlights our model's capability to handle large-scale neuroimaging data while maintaining computational feasibility and accuracy."}, "https://arxiv.org/abs/2403.13725": {"title": "Robust Inference in Locally Misspecified Bipartite Networks", "link": "https://arxiv.org/abs/2403.13725", "description": "arXiv:2403.13725v1 Announce Type: new \nAbstract: This paper introduces a methodology to conduct robust inference in bipartite networks under local misspecification. We focus on a class of dyadic network models with misspecified conditional moment restrictions. The framework of misspecification is local, as the effect of misspecification varies with the sample size. We utilize this local asymptotic approach to construct a robust estimator that is minimax optimal for the mean square error within a neighborhood of misspecification. Additionally, we introduce bias-aware confidence intervals that account for the effect of the local misspecification. These confidence intervals have the correct asymptotic coverage for the true parameter of interest under sparse network asymptotics. Monte Carlo experiments demonstrate that the robust estimator performs well in finite samples and sparse networks. As an empirical illustration, we study the formation of a scientific collaboration network among economists."}, "https://arxiv.org/abs/2403.13738": {"title": "Policy Relevant Treatment Effects with Multidimensional Unobserved Heterogeneity", "link": "https://arxiv.org/abs/2403.13738", "description": "arXiv:2403.13738v1 Announce Type: new \nAbstract: This paper provides a framework for the policy relevant treatment effects using instrumental variables. In the framework, a treatment selection may or may not satisfy the classical monotonicity condition and can accommodate multidimensional unobserved heterogeneity. We can bound the target parameter by extracting information from identifiable estimands. We also provide a more conservative yet computationally simpler bound by applying a convex relaxation method. Linear shape restrictions can be easily incorporated to further improve the bounds. Numerical and simulation results illustrate the informativeness of our convex-relaxation bounds, i.e., that our bounds are sufficiently tight."}, "https://arxiv.org/abs/2403.13750": {"title": "Data integration of non-probability and probability samples with predictive mean matching", "link": "https://arxiv.org/abs/2403.13750", "description": "arXiv:2403.13750v1 Announce Type: new \nAbstract: In this paper we study predictive mean matching mass imputation estimators to integrate data from probability and non-probability samples. We consider two approaches: matching predicted to observed ($\\hat{y}-y$ matching) or predicted to predicted ($\\hat{y}-\\hat{y}$ matching) values. We prove the consistency of two semi-parametric mass imputation estimators based on these approaches and derive their variance and estimators of variance. Our approach can be employed with non-parametric regression techniques, such as kernel regression, and the analytical expression for variance can also be applied in nearest neighbour matching for non-probability samples. We conduct extensive simulation studies in order to compare the properties of this estimator with existing approaches, discuss the selection of $k$-nearest neighbours, and study the effects of model mis-specification. The paper finishes with empirical study in integration of job vacancy survey and vacancies submitted to public employment offices (admin and online data). Open source software is available for the proposed approaches."}, "https://arxiv.org/abs/2403.13153": {"title": "Tensor Time Series Imputation through Tensor Factor Modelling", "link": "https://arxiv.org/abs/2403.13153", "description": "arXiv:2403.13153v1 Announce Type: cross \nAbstract: We propose tensor time series imputation when the missing pattern in the tensor data can be general, as long as any two data positions along a tensor fibre are both observed for enough time points. The method is based on a tensor time series factor model with Tucker decomposition of the common component. One distinguished feature of the tensor time series factor model used is that there can be weak factors in the factor loadings matrix for each mode. This reflects reality better when real data can have weak factors which drive only groups of observed variables, for instance, a sector factor in financial market driving only stocks in a particular sector. Using the data with missing entries, asymptotic normality is derived for rows of estimated factor loadings, while consistent covariance matrix estimation enables us to carry out inferences. As a first in the literature, we also propose a ratio-based estimator for the rank of the core tensor under general missing patterns. Rates of convergence are spelt out for the imputations from the estimated tensor factor models. We introduce a new measure for gauging imputation performances, and simulation results show that our imputation procedure works well, with asymptotic normality and corresponding inferences also demonstrated. Re-imputation performances are also gauged when we demonstrate that using slightly larger rank then estimated gives superior re-imputation performances. An NYC taxi traffic data set is also analyzed by imposing general missing patterns and gauging the imputation performances."}, "https://arxiv.org/abs/2403.13203": {"title": "Quadratic Point Estimate Method for Probabilistic Moments Computation", "link": "https://arxiv.org/abs/2403.13203", "description": "arXiv:2403.13203v1 Announce Type: cross \nAbstract: This paper presents in detail the originally developed Quadratic Point Estimate Method (QPEM), aimed at efficiently and accurately computing the first four output moments of probabilistic distributions, using 2n^2+1 sample (or sigma) points, with n, the number of input random variables. The proposed QPEM particularly offers an effective, superior, and practical alternative to existing sampling and quadrature methods for low- and moderately-high-dimensional problems. Detailed theoretical derivations are provided proving that the proposed method can achieve a fifth or higher-order accuracy for symmetric input distributions. Various numerical examples, from simple polynomial functions to nonlinear finite element analyses with random field representations, support the theoretical findings and further showcase the validity, efficiency, and applicability of the QPEM, from low- to high-dimensional problems."}, "https://arxiv.org/abs/2403.13361": {"title": "Multifractal wavelet dynamic mode decomposition modeling for marketing time series", "link": "https://arxiv.org/abs/2403.13361", "description": "arXiv:2403.13361v1 Announce Type: cross \nAbstract: Marketing is the way we ensure our sales are the best in the market, our prices the most accessible, and our clients satisfied, thus ensuring our brand has the widest distribution. This requires sophisticated and advanced understanding of the whole related network. Indeed, marketing data may exist in different forms such as qualitative and quantitative data. However, in the literature, it is easily noted that large bibliographies may be collected about qualitative studies, while only a few studies adopt a quantitative point of view. This is a major drawback that results in marketing science still focusing on design, although the market is strongly dependent on quantities such as money and time. Indeed, marketing data may form time series such as brand sales in specified periods, brand-related prices over specified periods, market shares, etc. The purpose of the present work is to investigate some marketing models based on time series for various brands. This paper aims to combine the dynamic mode decomposition and wavelet decomposition to study marketing series due to both prices, and volume sales in order to explore the effect of the time scale on the persistence of brand sales in the market and on the forecasting of such persistence, according to the characteristics of the brand and the related market competition or competitors. Our study is based on a sample of Saudi brands during the period 22 November 2017 to 30 December 2021."}, "https://arxiv.org/abs/2403.13489": {"title": "Antithetic Multilevel Methods for Elliptic and Hypo-Elliptic Diffusions with Applications", "link": "https://arxiv.org/abs/2403.13489", "description": "arXiv:2403.13489v1 Announce Type: cross \nAbstract: In this paper, we present a new antithetic multilevel Monte Carlo (MLMC) method for the estimation of expectations with respect to laws of diffusion processes that can be elliptic or hypo-elliptic. In particular, we consider the case where one has to resort to time discretization of the diffusion and numerical simulation of such schemes. Motivated by recent developments, we introduce a new MLMC estimator of expectations, which does not require simulation of intractable L\\'evy areas but has a weak error of order 2 and achieves the optimal computational complexity. We then show how this approach can be used in the context of the filtering problem associated to partially observed diffusions with discrete time observations. We illustrate with numerical simulations that our new approaches provide efficiency gains for several problems relative to some existing methods."}, "https://arxiv.org/abs/2403.13565": {"title": "AdaTrans: Feature-wise and Sample-wise Adaptive Transfer Learning for High-dimensional Regression", "link": "https://arxiv.org/abs/2403.13565", "description": "arXiv:2403.13565v1 Announce Type: cross \nAbstract: We consider the transfer learning problem in the high dimensional setting, where the feature dimension is larger than the sample size. To learn transferable information, which may vary across features or the source samples, we propose an adaptive transfer learning method that can detect and aggregate the feature-wise (F-AdaTrans) or sample-wise (S-AdaTrans) transferable structures. We achieve this by employing a novel fused-penalty, coupled with weights that can adapt according to the transferable structure. To choose the weight, we propose a theoretically informed, data-driven procedure, enabling F-AdaTrans to selectively fuse the transferable signals with the target while filtering out non-transferable signals, and S-AdaTrans to obtain the optimal combination of information transferred from each source sample. The non-asymptotic rates are established, which recover existing near-minimax optimal rates in special cases. The effectiveness of the proposed method is validated using both synthetic and real data."}, "https://arxiv.org/abs/2109.10330": {"title": "A Bayesian hierarchical model for disease mapping that accounts for scaling and heavy-tailed latent effects", "link": "https://arxiv.org/abs/2109.10330", "description": "arXiv:2109.10330v2 Announce Type: replace \nAbstract: In disease mapping, the relative risk of a disease is commonly estimated across different areas within a region of interest. The number of cases in an area is often assumed to follow a Poisson distribution whose mean is decomposed as the product between an offset and the logarithm of the disease's relative risk. The log risk may be written as the sum of fixed effects and latent random effects. The BYM2 model decomposes each latent effect into a weighted sum of independent and spatial effects. We build on the BYM2 model to allow for heavy-tailed latent effects and accommodate potentially outlying risks, after accounting for the fixed effects. We assume a scale mixture structure wherein the variance of the latent process changes across areas and allows for outlier identification. We propose two prior specifications for this scale mixture parameter. These are compared through simulation studies and in the analysis of Zika cases from the first (2015-2016) epidemic in Rio de Janeiro city, Brazil. The simulation studies show that, in terms of the model assessment criterion WAIC and outlier detection, the two proposed parametrisations perform better than the model proposed by Congdon (2017) to capture outliers. In particular, the proposed parametrisations are more efficient, in terms of outlier detection, than Congdon's when outliers are neighbours. Our analysis of Zika cases finds 19 out of 160 districts of Rio as potential outliers, after accounting for the socio-development index. Our proposed model may help prioritise interventions and identify potential issues in the recording of cases."}, "https://arxiv.org/abs/2210.13562": {"title": "Prediction intervals for economic fixed-event forecasts", "link": "https://arxiv.org/abs/2210.13562", "description": "arXiv:2210.13562v3 Announce Type: replace \nAbstract: The fixed-event forecasting setup is common in economic policy. It involves a sequence of forecasts of the same (`fixed') predictand, so that the difficulty of the forecasting problem decreases over time. Fixed-event point forecasts are typically published without a quantitative measure of uncertainty. To construct such a measure, we consider forecast postprocessing techniques tailored to the fixed-event case. We develop regression methods that impose constraints motivated by the problem at hand, and use these methods to construct prediction intervals for gross domestic product (GDP) growth in Germany and the US."}, "https://arxiv.org/abs/2211.07823": {"title": "Graph Neural Networks for Causal Inference Under Network Confounding", "link": "https://arxiv.org/abs/2211.07823", "description": "arXiv:2211.07823v3 Announce Type: replace \nAbstract: This paper studies causal inference with observational network data. A challenging aspect of this setting is the possibility of interference in both potential outcomes and selection into treatment, for example due to peer effects in either stage. We therefore consider a nonparametric setup in which both stages are reduced forms of simultaneous-equations models. This results in high-dimensional network confounding, where the network and covariates of all units constitute sources of selection bias. The literature predominantly assumes that confounding can be summarized by a known, low-dimensional function of these objects, and it is unclear what selection models justify common choices of functions. We show that graph neural networks (GNNs) are well suited to adjust for high-dimensional network confounding. We establish a network analog of approximate sparsity under primitive conditions on interference. This demonstrates that the model has low-dimensional structure that makes estimation feasible and justifies the use of shallow GNN architectures."}, "https://arxiv.org/abs/2301.12045": {"title": "Forward selection and post-selection inference in factorial designs", "link": "https://arxiv.org/abs/2301.12045", "description": "arXiv:2301.12045v2 Announce Type: replace \nAbstract: Ever since the seminal work of R. A. Fisher and F. Yates, factorial designs have been an important experimental tool to simultaneously estimate the effects of multiple treatment factors. In factorial designs, the number of treatment combinations grows exponentially with the number of treatment factors, which motivates the forward selection strategy based on the sparsity, hierarchy, and heredity principles for factorial effects. Although this strategy is intuitive and has been widely used in practice, its rigorous statistical theory has not been formally established. To fill this gap, we establish design-based theory for forward factor selection in factorial designs based on the potential outcome framework. We not only prove a consistency property for the factor selection procedure but also discuss statistical inference after factor selection. In particular, with selection consistency, we quantify the advantages of forward selection based on asymptotic efficiency gain in estimating factorial effects. With inconsistent selection in higher-order interactions, we propose two strategies and investigate their impact on subsequent inference. Our formulation differs from the existing literature on variable selection and post-selection inference because our theory is based solely on the physical randomization of the factorial design and does not rely on a correctly specified outcome model."}, "https://arxiv.org/abs/2305.13221": {"title": "Incorporating Subsampling into Bayesian Models for High-Dimensional Spatial Data", "link": "https://arxiv.org/abs/2305.13221", "description": "arXiv:2305.13221v3 Announce Type: replace \nAbstract: Additive spatial statistical models with weakly stationary process assumptions have become standard in spatial statistics. However, one disadvantage of such models is the computation time, which rapidly increases with the number of data points. The goal of this article is to apply an existing subsampling strategy to standard spatial additive models and to derive the spatial statistical properties. We call this strategy the ''spatial data subset model'' (SDSM) approach, which can be applied to big datasets in a computationally feasible way. Our approach has the advantage that one does not require any additional restrictive model assumptions. That is, computational gains increase as model assumptions are removed when using our model framework. This provides one solution to the computational bottlenecks that occur when applying methods such as Kriging to ''big data''. We provide several properties of this new spatial data subset model approach in terms of moments, sill, nugget, and range under several sampling designs. An advantage of our approach is that it subsamples without throwing away data, and can be implemented using datasets of any size that can be stored. We present the results of the spatial data subset model approach on simulated datasets, and on a large dataset consists of 150,000 observations of daytime land surface temperatures measured by the MODIS instrument onboard the Terra satellite."}, "https://arxiv.org/abs/2309.03316": {"title": "Spatial data fusion adjusting for preferential sampling using INLA and SPDE", "link": "https://arxiv.org/abs/2309.03316", "description": "arXiv:2309.03316v2 Announce Type: replace \nAbstract: Spatially misaligned data can be fused by using a Bayesian melding model that assumes that underlying all observations there is a spatially continuous Gaussian random field process. This model can be used, for example, to predict air pollution levels by combining point data from monitoring stations and areal data from satellite imagery.\n However, if the data presents preferential sampling, that is, if the observed point locations are not independent of the underlying spatial process, the inference obtained from models that ignore such a dependence structure might not be valid.\n In this paper, we present a Bayesian spatial model for the fusion of point and areal data that takes into account preferential sampling. The model combines the Bayesian melding specification and a model for the stochastically dependent sampling and underlying spatial processes.\n Fast Bayesian inference is performed using the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) approaches. The performance of the model is assessed using simulated data in a range of scenarios and sampling strategies that can appear in real settings. The model is also applied to predict air pollution in the USA."}, "https://arxiv.org/abs/2312.12823": {"title": "Detecting Multiple Change Points in Distributional Sequences Derived from Structural Health Monitoring Data: An Application to Bridge Damage Detection", "link": "https://arxiv.org/abs/2312.12823", "description": "arXiv:2312.12823v2 Announce Type: replace \nAbstract: Detecting damage in critical structures using monitored data is a fundamental task of structural health monitoring, which is extremely important for maintaining structures' safety and life-cycle management. Based on statistical pattern recognition paradigm, damage detection can be conducted by assessing changes in the distribution of properly extracted damage-sensitive features (DSFs). This can be naturally formulated as a distributional change-point detection problem. A good change-point detector for damage detection should be scalable to large DSF datasets, applicable to different types of changes, and capable of controlling for false-positive indications. This study proposes a new distributional change-point detection method for damage detection to address these challenges. We embed the elements of a DSF distributional sequence into the Wasserstein space and construct a moving sum (MOSUM) multiple change-point detector based on Fr\\'echet statistics and establish theoretical properties. Extensive simulation studies demonstrate the superiority of our proposed approach against other competitors to address the aforementioned practical requirements. We apply our method to the cable-tension measurements monitored from a long-span cable-stayed bridge for cable damage detection. We conduct a comprehensive change-point analysis for the extracted DSF data, and reveal interesting patterns from the detected changes, which provides valuable insights into cable system damage."}, "https://arxiv.org/abs/2401.10196": {"title": "Functional Gaussian Graphical Regression Models", "link": "https://arxiv.org/abs/2401.10196", "description": "arXiv:2401.10196v2 Announce Type: replace \nAbstract: Functional data describe a wide range of processes encountered in practice, such as growth curves and spectral absorption. Functional regression considers a version of regression, where both the response and covariates are functional data. Evaluating both the functional relatedness between the response and covariates and the relatedness of a multivariate response function can be challenging. In this paper, we propose a solution for both these issues, by means of a functional Gaussian graphical regression model. It extends the notion of conditional Gaussian graphical models to partially separable functions. For inference, we propose a double-penalized estimator. Additionally, we present a novel adaptation of Kullback-Leibler cross-validation tailored for graph estimators which accounts for precision and regression matrices when the population presents one or more sub-groups, named joint Kullback-Leibler cross-validation. Evaluation of model performance is done in terms of Kullback-Leibler divergence and graph recovery power. We illustrate the method on a air pollution dataset."}, "https://arxiv.org/abs/2206.13668": {"title": "Non-Independent Components Analysis", "link": "https://arxiv.org/abs/2206.13668", "description": "arXiv:2206.13668v4 Announce Type: replace-cross \nAbstract: A seminal result in the ICA literature states that for $AY = \\varepsilon$, if the components of $\\varepsilon$ are independent and at most one is Gaussian, then $A$ is identified up to sign and permutation of its rows (Comon, 1994). In this paper we study to which extent the independence assumption can be relaxed by replacing it with restrictions on higher order moment or cumulant tensors of $\\varepsilon$. We document new conditions that establish identification for several non-independent component models, e.g. common variance models, and propose efficient estimation methods based on the identification results. We show that in situations where independence cannot be assumed the efficiency gains can be significant relative to methods that rely on independence."}, "https://arxiv.org/abs/2311.00118": {"title": "Extracting the Multiscale Causal Backbone of Brain Dynamics", "link": "https://arxiv.org/abs/2311.00118", "description": "arXiv:2311.00118v2 Announce Type: replace-cross \nAbstract: The bulk of the research effort on brain connectivity revolves around statistical associations among brain regions, which do not directly relate to the causal mechanisms governing brain dynamics. Here we propose the multiscale causal backbone (MCB) of brain dynamics, shared by a set of individuals across multiple temporal scales, and devise a principled methodology to extract it.\n Our approach leverages recent advances in multiscale causal structure learning and optimizes the trade-off between the model fit and its complexity. Empirical assessment on synthetic data shows the superiority of our methodology over a baseline based on canonical functional connectivity networks. When applied to resting-state fMRI data, we find sparse MCBs for both the left and right brain hemispheres. Thanks to its multiscale nature, our approach shows that at low-frequency bands, causal dynamics are driven by brain regions associated with high-level cognitive functions; at higher frequencies instead, nodes related to sensory processing play a crucial role. Finally, our analysis of individual multiscale causal structures confirms the existence of a causal fingerprint of brain connectivity, thus supporting the existing extensive research in brain connectivity fingerprinting from a causal perspective."}, "https://arxiv.org/abs/2311.17885": {"title": "Are Ensembles Getting Better all the Time?", "link": "https://arxiv.org/abs/2311.17885", "description": "arXiv:2311.17885v2 Announce Type: replace-cross \nAbstract: Ensemble methods combine the predictions of several base models. We study whether or not including more models always improves their average performance. This question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods such as random forests or deep ensembles. In this setting, we show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised as: ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a medical prediction problem (diagnosing melanomas using neural nets) and a \"wisdom of crowds\" experiment (guessing the ratings of upcoming movies)."}, "https://arxiv.org/abs/2403.13934": {"title": "Data integration methods for micro-randomized trials", "link": "https://arxiv.org/abs/2403.13934", "description": "arXiv:2403.13934v1 Announce Type: new \nAbstract: Existing statistical methods for the analysis of micro-randomized trials (MRTs) are designed to estimate causal excursion effects using data from a single MRT. In practice, however, researchers can often find previous MRTs that employ similar interventions. In this paper, we develop data integration methods that capitalize on this additional information, leading to statistical efficiency gains. To further increase efficiency, we demonstrate how to combine these approaches according to a generalization of multivariate precision weighting that allows for correlation between estimates, and we show that the resulting meta-estimator possesses an asymptotic optimality property. We illustrate our methods in simulation and in a case study involving two MRTs in the area of smoking cessation."}, "https://arxiv.org/abs/2403.14036": {"title": "Fused LASSO as Non-Crossing Quantile Regression", "link": "https://arxiv.org/abs/2403.14036", "description": "arXiv:2403.14036v1 Announce Type: new \nAbstract: Quantile crossing has been an ever-present thorn in the side of quantile regression. This has spurred research into obtaining densities and coefficients that obey the quantile monotonicity property. While important contributions, these papers do not provide insight into how exactly these constraints influence the estimated coefficients. This paper extends non-crossing constraints and shows that by varying a single hyperparameter ($\\alpha$) one can obtain commonly used quantile estimators. Namely, we obtain the quantile regression estimator of Koenker and Bassett (1978) when $\\alpha=0$, the non crossing quantile regression estimator of Bondell et al. (2010) when $\\alpha=1$, and the composite quantile regression estimator of Koenker (1984) and Zou and Yuan (2008) when $\\alpha\\rightarrow\\infty$. As such, we show that non-crossing constraints are simply a special type of fused-shrinkage."}, "https://arxiv.org/abs/2403.14044": {"title": "Statistical tests for comparing the associations of multiple exposures with a common outcome in Cox proportional hazard models", "link": "https://arxiv.org/abs/2403.14044", "description": "arXiv:2403.14044v1 Announce Type: new \nAbstract: With advancement of medicine, alternative exposures or interventions are emerging with respect to a common outcome, and there are needs to formally test the difference in the associations of multiple exposures. We propose a duplication method-based multivariate Wald test in the Cox proportional hazard regression analyses to test the difference in the associations of multiple exposures with a same outcome. The proposed method applies to linear or categorical exposures. To illustrate our method, we applied our method to compare the associations between alignment to two different dietary patterns, either as continuous or quartile exposures, and incident chronic diseases, defined as a composite of CVD, cancer, and diabetes, in the Health Professional Follow-up Study. Relevant sample codes in R that implement the proposed approach are provided. The proposed duplication-method-based approach offers a flexible, formal statistical test of multiple exposures for the common outcome with minimal assumptions."}, "https://arxiv.org/abs/2403.14152": {"title": "Generalized Rosenbaum Bounds Sensitivity Analysis for Matched Observational Studies with Treatment Doses: Sufficiency, Consistency, and Efficiency", "link": "https://arxiv.org/abs/2403.14152", "description": "arXiv:2403.14152v1 Announce Type: new \nAbstract: In matched observational studies with binary treatments, the Rosenbaum bounds framework is arguably the most widely used sensitivity analysis framework for assessing sensitivity to unobserved covariates. Unlike the binary treatment case, although widely needed in practice, sensitivity analysis for matched observational studies with treatment doses (i.e., non-binary treatments such as ordinal treatments or continuous treatments) still lacks solid foundations and valid methodologies. We fill in this blank by establishing theoretical foundations and novel methodologies under a generalized Rosenbaum bounds sensitivity analysis framework. First, we present a criterion for assessing the validity of sensitivity analysis in matched observational studies with treatment doses and use that criterion to justify the necessity of incorporating the treatment dose information into sensitivity analysis through generalized Rosenbaum sensitivity bounds. We also generalize Rosenbaum's classic sensitivity parameter $\\Gamma$ to the non-binary treatment case and prove its sufficiency. Second, we study the asymptotic properties of sensitivity analysis by generalizing Rosenbaum's classic design sensitivity and Bahadur efficiency for testing Fisher's sharp null to the non-binary treatment case and deriving novel formulas for them. Our theoretical results showed the importance of appropriately incorporating the treatment dose into a test, which is an intrinsic distinction with the binary treatment case. Third, for testing Neyman's weak null (i.e., null sample average treatment effect), we propose the first valid sensitivity analysis procedure for matching with treatment dose through generalizing an existing optimization-based sensitivity analysis for the binary treatment case, built on the generalized Rosenbaum sensitivity bounds and large-scale mixed integer programming."}, "https://arxiv.org/abs/2403.14216": {"title": "A Gaussian smooth transition vector autoregressive model: An application to the macroeconomic effects of severe weather shocks", "link": "https://arxiv.org/abs/2403.14216", "description": "arXiv:2403.14216v1 Announce Type: new \nAbstract: We introduce a new smooth transition vector autoregressive model with a Gaussian conditional distribution and transition weights that, for a $p$th order model, depend on the full distribution of the preceding $p$ observations. Specifically, the transition weight of each regime increases in its relative weighted likelihood. This data-driven approach facilitates capturing complex switching dynamics, enhancing the identification of gradual regime shifts. In an empirical application to the macroeconomic effects of a severe weather shock, we find that in monthly U.S. data from 1961:1 to 2022:3, the impacts of the shock are stronger in the regime prevailing in the early part of the sample and in certain crisis periods than in the regime dominating the latter part of the sample. This suggests overall adaptation of the U.S. economy to increased severe weather over time."}, "https://arxiv.org/abs/2403.14336": {"title": "An empirical appraisal of methods for the dynamic prediction of survival with numerous longitudinal predictors", "link": "https://arxiv.org/abs/2403.14336", "description": "arXiv:2403.14336v1 Announce Type: new \nAbstract: Recently, the increasing availability of repeated measurements in biomedical studies has motivated the development of several statistical methods for the dynamic prediction of survival in settings where a large (potentially high-dimensional) number of longitudinal covariates is available. These methods differ in both how they model the longitudinal covariates trajectories, and how they specify the relationship between the longitudinal covariates and the survival outcome. Because these methods are still quite new, little is known about their applicability, limitations and performance when applied to real-world data.\n To investigate these questions, we present a comparison of the predictive performance of the aforementioned methods and two simpler prediction approaches to three datasets that differ in terms of outcome type, sample size, number of longitudinal covariates and length of follow-up. We discuss how different modelling choices can have an impact on the possibility to accommodate unbalanced study designs and on computing time, and compare the predictive performance of the different approaches using a range of performance measures and landmark times."}, "https://arxiv.org/abs/2403.14348": {"title": "Statistical modeling to adjust for time trends in adaptive platform trials utilizing non-concurrent controls", "link": "https://arxiv.org/abs/2403.14348", "description": "arXiv:2403.14348v1 Announce Type: new \nAbstract: Utilizing non-concurrent controls in the analysis of late-entering experimental arms in platform trials has recently received considerable attention, both on academic and regulatory levels. While incorporating this data can lead to increased power and lower required sample sizes, it might also introduce bias to the effect estimators if temporal drifts are present in the trial. Aiming to mitigate the potential calendar time bias, we propose various frequentist model-based approaches that leverage the non-concurrent control data, while adjusting for time trends. One of the currently available frequentist models incorporates time as a categorical fixed effect, separating the duration of the trial into periods, defined as time intervals bounded by any treatment arm entering or leaving the platform. In this work, we propose two extensions of this model. First, we consider an alternative definition of the time covariate by dividing the trial into fixed-length calendar time intervals. Second, we propose alternative methods to adjust for time trends. In particular, we investigate adjusting for autocorrelated random effects to account for dependency between closer time intervals and employing spline regression to model time with a smooth polynomial function. We evaluate the performance of the proposed approaches in a simulation study and illustrate their use by means of a case study."}, "https://arxiv.org/abs/2403.14451": {"title": "Phenology curve estimation via a mixed model representation of functional principal components: Characterizing time series of satellite-derived vegetation indices", "link": "https://arxiv.org/abs/2403.14451", "description": "arXiv:2403.14451v1 Announce Type: new \nAbstract: Vegetation phenology consists of studying synchronous stationary events, such as the vegetation green up and leaves senescence, that can be construed as adaptive responses to climatic constraints. In this paper, we propose a method to estimate the annual phenology curve from multi-annual observations of time series of vegetation indices derived from satellite images. We fitted the classical harmonic regression model to annual-based time series in order to construe the original data set as realizations of a functional process. Hierarchical clustering was applied to define a nearly homogeneous group of annual (smoothed) time series from which a representative and idealized phenology curve was estimated at the pixel level. This curve resulted from fitting a mixed model, based on functional principal components, to the homogeneous group of time series. Leveraging the idealized phenology curve, we employed standard calculus criteria to estimate the following phenological parameters (stationary events): green up, start of season, maturity, senescence, end of season and dormancy. By applying the proposed methodology to four different data cubes (time series from 2000 to 2023 of a popular satellite-derived vegetation index) recorded across grasslands, forests, and annual rainfed agricultural zones of a Flora and Fauna Protected Area in northern Mexico, we verified that our approach characterizes properly the phenological cycle in vegetation with nearly periodic dynamics, such as grasslands and agricultural areas. The R package sephora was used for all computations in this paper."}, "https://arxiv.org/abs/2403.14452": {"title": "On Weighted Trigonometric Regression for Suboptimal Designs in Circadian Biology Studies", "link": "https://arxiv.org/abs/2403.14452", "description": "arXiv:2403.14452v1 Announce Type: new \nAbstract: Studies in circadian biology often use trigonometric regression to model phenomena over time. Ideally, protocols in these studies would collect samples at evenly distributed and equally spaced time points over a 24 hour period. This sample collection protocol is known as an equispaced design, which is considered the optimal experimental design for trigonometric regression under multiple statistical criteria. However, implementing equispaced designs in studies involving individuals is logistically challenging, and failure to employ an equispaced design could cause a loss of statistical power when performing hypothesis tests with an estimated model. This paper is motivated by the potential loss of statistical power during hypothesis testing, and considers a weighted trigonometric regression as a remedy. Specifically, the weights for this regression are the normalized reciprocals of estimates derived from a kernel density estimator for sample collection time, which inflates the weight of samples collected at underrepresented time points. A search procedure is also introduced to identify the concentration hyperparameter for kernel density estimation that maximizes the Hessian of weighted squared loss, which relates to both maximizing the $D$-optimality criterion from experimental design literature and minimizing the generalized variance. Simulation studies consistently demonstrate that this weighted regression mitigates variability in inferences produced by an estimated model. Illustrations with three real circadian biology data sets further indicate that this weighted regression consistently yields larger test statistics than its unweighted counterpart for first-order trigonometric regression, or cosinor regression, which is prevalent in circadian biology studies."}, "https://arxiv.org/abs/2403.14531": {"title": "Green's matching: an efficient approach to parameter estimation in complex dynamic systems", "link": "https://arxiv.org/abs/2403.14531", "description": "arXiv:2403.14531v1 Announce Type: new \nAbstract: Parameters of differential equations are essential to characterize intrinsic behaviors of dynamic systems. Numerous methods for estimating parameters in dynamic systems are computationally and/or statistically inadequate, especially for complex systems with general-order differential operators, such as motion dynamics. This article presents Green's matching, a computationally tractable and statistically efficient two-step method, which only needs to approximate trajectories in dynamic systems but not their derivatives due to the inverse of differential operators by Green's function. This yields a statistically optimal guarantee for parameter estimation in general-order equations, a feature not shared by existing methods, and provides an efficient framework for broad statistical inferences in complex dynamic systems."}, "https://arxiv.org/abs/2403.14563": {"title": "Evaluating the impact of instrumental variables in propensity score models using synthetic and negative control experiments", "link": "https://arxiv.org/abs/2403.14563", "description": "arXiv:2403.14563v1 Announce Type: new \nAbstract: In pharmacoepidemiology research, instrumental variables (IVs) are variables that strongly predict treatment but have no causal effect on the outcome of interest except through the treatment. There remain concerns about the inclusion of IVs in propensity score (PS) models amplifying estimation bias and reducing precision. Some PS modeling approaches attempt to address the potential effects of IVs, including selecting only covariates for the PS model that are strongly associated to the outcome of interest, thus screening out IVs. We conduct a study utilizing simulations and negative control experiments to evaluate the effect of IVs on PS model performance and to uncover best PS practices for real-world studies. We find that simulated IVs have a weak effect on bias and precision in both simulations and negative control experiments based on real-world data. In simulation experiments, PS methods that utilize outcome data, including the high-dimensional propensity score, produce the least estimation bias. However, in real-world settings underlying causal structures are unknown, and negative control experiments can illustrate a PS model's ability to minimize systematic bias. We find that large-scale, regularized regression based PS models in this case provide the most centered negative control distributions, suggesting superior performance in real-world scenarios."}, "https://arxiv.org/abs/2403.14573": {"title": "A Transfer Learning Causal Approach to Evaluate Racial/Ethnic and Geographic Variation in Outcomes Following Congenital Heart Surgery", "link": "https://arxiv.org/abs/2403.14573", "description": "arXiv:2403.14573v1 Announce Type: new \nAbstract: Congenital heart defects (CHD) are the most prevalent birth defects in the United States and surgical outcomes vary considerably across the country. The outcomes of treatment for CHD differ for specific patient subgroups, with non-Hispanic Black and Hispanic populations experiencing higher rates of mortality and morbidity. A valid comparison of outcomes within racial/ethnic subgroups is difficult given large differences in case-mix and small subgroup sizes. We propose a causal inference framework for outcome assessment and leverage advances in transfer learning to incorporate data from both target and source populations to help estimate causal effects while accounting for different sources of risk factor and outcome differences across populations. Using the Society of Thoracic Surgeons' Congenital Heart Surgery Database (STS-CHSD), we focus on a national cohort of patients undergoing the Norwood operation from 2016-2022 to assess operative mortality and morbidity outcomes across U.S. geographic regions by race/ethnicity. We find racial and ethnic outcome differences after controlling for potential confounding factors. While geography does not have a causal effect on outcomes for non-Hispanic Caucasian patients, non-Hispanic Black patients experience wide variability in outcomes with estimated 30-day mortality ranging from 5.9% (standard error 2.2%) to 21.6% (4.4%) across U.S. regions."}, "https://arxiv.org/abs/2403.14067": {"title": "Automatic Outlier Rectification via Optimal Transport", "link": "https://arxiv.org/abs/2403.14067", "description": "arXiv:2403.14067v1 Announce Type: cross \nAbstract: In this paper, we propose a novel conceptual framework to detect outliers using optimal transport with a concave cost function. Conventional outlier detection approaches typically use a two-stage procedure: first, outliers are detected and removed, and then estimation is performed on the cleaned data. However, this approach does not inform outlier removal with the estimation task, leaving room for improvement. To address this limitation, we propose an automatic outlier rectification mechanism that integrates rectification and estimation within a joint optimization framework. We take the first step to utilize an optimal transport distance with a concave cost function to construct a rectification set in the space of probability distributions. Then, we select the best distribution within the rectification set to perform the estimation task. Notably, the concave cost function we introduced in this paper is the key to making our estimator effectively identify the outlier during the optimization process. We discuss the fundamental differences between our estimator and optimal transport-based distributionally robust optimization estimator. finally, we demonstrate the effectiveness and superiority of our approach over conventional approaches in extensive simulation and empirical analyses for mean estimation, least absolute regression, and the fitting of option implied volatility surfaces."}, "https://arxiv.org/abs/2403.14385": {"title": "Estimating Causal Effects with Double Machine Learning -- A Method Evaluation", "link": "https://arxiv.org/abs/2403.14385", "description": "arXiv:2403.14385v1 Announce Type: cross \nAbstract: The estimation of causal effects with observational data continues to be a very active research area. In recent years, researchers have developed new frameworks which use machine learning to relax classical assumptions necessary for the estimation of causal effects. In this paper, we review one of the most prominent methods - \"double/debiased machine learning\" (DML) - and empirically evaluate it by comparing its performance on simulated data relative to more traditional statistical methods, before applying it to real-world data. Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships. This advantage enables a departure from traditional functional form assumptions typically necessary in causal effect estimation. However, we demonstrate that the method continues to critically depend on standard assumptions about causal structure and identification. When estimating the effects of air pollution on housing prices in our application, we find that DML estimates are consistently larger than estimates of less flexible methods. From our overall results, we provide actionable recommendations for specific choices researchers must make when applying DML in practice."}, "https://arxiv.org/abs/1905.07812": {"title": "Iterative Estimation of Nonparametric Regressions with Continuous Endogenous Variables and Discrete Instruments", "link": "https://arxiv.org/abs/1905.07812", "description": "arXiv:1905.07812v2 Announce Type: replace \nAbstract: We consider a nonparametric regression model with continuous endogenous independent variables when only discrete instruments are available that are independent of the error term. While this framework is very relevant for applied research, its implementation is cumbersome, as the regression function becomes the solution to a nonlinear integral equation. We propose a simple iterative procedure to estimate such models and showcase some of its asymptotic properties. In a simulation experiment, we discuss the details of its implementation in the case when the instrumental variable is binary. We conclude with an empirical application in which we examine the effect of pollution on house prices in a short panel of U.S. counties."}, "https://arxiv.org/abs/2111.10715": {"title": "Confidences in Hypotheses", "link": "https://arxiv.org/abs/2111.10715", "description": "arXiv:2111.10715v4 Announce Type: replace \nAbstract: This article outlines a broadly-applicable new method of statistical analysis for situations involving two competing hypotheses. Hypotheses assessment is a frequentist procedure designed to answer the question: Given the sample evidence (and assumed model), what is the relative plausibility of each hypothesis? Our aim is to determine frequentist confidences in the hypotheses that are relevant to the data at hand and are as powerful as the particular application allows. Hypotheses assessments complement significance tests because providing confidences in the hypotheses in addition to test results can better inform applied researchers about the strength of evidence provided by the data. For simple hypotheses, the method produces minimum and maximum confidences in each hypothesis. The composite case is more complex, and we introduce two conventions to aid with understanding the strength of evidence. Assessments are qualitatively different from significance test and confidence interval outcomes, and thus fill a gap in the statistician's toolkit."}, "https://arxiv.org/abs/2201.06694": {"title": "Homophily in preferences or meetings? Identifying and estimating an iterative network formation model", "link": "https://arxiv.org/abs/2201.06694", "description": "arXiv:2201.06694v4 Announce Type: replace \nAbstract: Is homophily in social and economic networks driven by a taste for homogeneity (preferences) or by a higher probability of meeting individuals with similar attributes (opportunity)? This paper studies identification and estimation of an iterative network game that distinguishes between these two mechanisms. Our approach enables us to assess the counterfactual effects of changing the meeting protocol between agents. As an application, we study the role of preferences and meetings in shaping classroom friendship networks in Brazil. In a network structure in which homophily due to preferences is stronger than homophily due to meeting opportunities, tracking students may improve welfare. Still, the relative benefit of this policy diminishes over the school year."}, "https://arxiv.org/abs/2204.11318": {"title": "Identification and Statistical Decision Theory", "link": "https://arxiv.org/abs/2204.11318", "description": "arXiv:2204.11318v2 Announce Type: replace \nAbstract: Econometricians have usefully separated study of estimation into identification and statistical components. Identification analysis, which assumes knowledge of the probability distribution generating observable data, places an upper bound on what may be learned about population parameters of interest with finite sample data. Yet Wald's statistical decision theory studies decision making with sample data without reference to identification, indeed without reference to estimation. This paper asks if identification analysis is useful to statistical decision theory. The answer is positive, as it can yield an informative and tractable upper bound on the achievable finite sample performance of decision criteria. The reasoning is simple when the decision relevant parameter is point identified. It is more delicate when the true state is partially identified and a decision must be made under ambiguity. Then the performance of some criteria, such as minimax regret, is enhanced by randomizing choice of an action. This may be accomplished by making choice a function of sample data. I find it useful to recast choice of a statistical decision function as selection of choice probabilities for the elements of the choice set. Using sample data to randomize choice conceptually differs from and is complementary to its traditional use to estimate population parameters."}, "https://arxiv.org/abs/2303.10712": {"title": "Mixture of segmentation for heterogeneous functional data", "link": "https://arxiv.org/abs/2303.10712", "description": "arXiv:2303.10712v2 Announce Type: replace \nAbstract: In this paper we consider functional data with heterogeneity in time and in population. We propose a mixture model with segmentation of time to represent this heterogeneity while keeping the functional structure. Maximum likelihood estimator is considered, proved to be identifiable and consistent. In practice, an EM algorithm is used, combined with dynamic programming for the maximization step, to approximate the maximum likelihood estimator. The method is illustrated on a simulated dataset, and used on a real dataset of electricity consumption."}, "https://arxiv.org/abs/2307.16798": {"title": "Forster-Warmuth Counterfactual Regression: A Unified Learning Approach", "link": "https://arxiv.org/abs/2307.16798", "description": "arXiv:2307.16798v4 Announce Type: replace \nAbstract: Series or orthogonal basis regression is one of the most popular non-parametric regression techniques in practice, obtained by regressing the response on features generated by evaluating the basis functions at observed covariate values. The most routinely used series estimator is based on ordinary least squares fitting, which is known to be minimax rate optimal in various settings, albeit under stringent restrictions on the basis functions and the distribution of covariates. In this work, inspired by the recently developed Forster-Warmuth (FW) learner, we propose an alternative series regression estimator that can attain the minimax estimation rate under strictly weaker conditions imposed on the basis functions and the joint law of covariates, than existing series estimators in the literature. Moreover, a key contribution of this work generalizes the FW-learner to a so-called counterfactual regression problem, in which the response variable of interest may not be directly observed (hence, the name ``counterfactual'') on all sampled units, and therefore needs to be inferred in order to identify and estimate the regression in view from the observed data. Although counterfactual regression is not entirely a new area of inquiry, we propose the first-ever systematic study of this challenging problem from a unified pseudo-outcome perspective. In fact, we provide what appears to be the first generic and constructive approach for generating the pseudo-outcome (to substitute for the unobserved response) which leads to the estimation of the counterfactual regression curve of interest with small bias, namely bias of second order. Several applications are used to illustrate the resulting FW-learner including many nonparametric regression problems in missing data and causal inference literature, for which we establish high-level conditions for minimax rate optimality of the proposed FW-learner."}, "https://arxiv.org/abs/2312.16241": {"title": "Analysis of Pleiotropy for Testosterone and Lipid Profiles in Males and Females", "link": "https://arxiv.org/abs/2312.16241", "description": "arXiv:2312.16241v2 Announce Type: replace \nAbstract: In modern scientific studies, it is often imperative to determine whether a set of phenotypes is affected by a single factor. If such an influence is identified, it becomes essential to discern whether this effect is contingent upon categories such as sex or age group, and importantly, to understand whether this dependence is rooted in purely non-environmental reasons. The exploration of such dependencies often involves studying pleiotropy, a phenomenon wherein a single genetic locus impacts multiple traits. This heightened interest in uncovering dependencies by pleiotropy is fueled by the growing accessibility of summary statistics from genome-wide association studies (GWAS) and the establishment of thoroughly phenotyped sample collections. This advancement enables a systematic and comprehensive exploration of the genetic connections among various traits and diseases. additive genetic correlation illuminates the genetic connection between two traits, providing valuable insights into the shared biological pathways and underlying causal relationships between them. In this paper, we present a novel method to analyze such dependencies by studying additive genetic correlations between pairs of traits under consideration. Subsequently, we employ matrix comparison techniques to discern and elucidate sex-specific or age-group-specific associations, contributing to a deeper understanding of the nuanced dependencies within the studied traits. Our proposed method is computationally handy and requires only GWAS summary statistics. We validate our method by applying it to the UK Biobank data and present the results."}, "https://arxiv.org/abs/2312.10569": {"title": "Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data", "link": "https://arxiv.org/abs/2312.10569", "description": "arXiv:2312.10569v2 Announce Type: replace-cross \nAbstract: Many modern causal questions ask how treatments affect complex outcomes that are measured using wearable devices and sensors. Current analysis approaches require summarizing these data into scalar statistics (e.g., the mean), but these summaries can be misleading. For example, disparate distributions can have the same means, variances, and other statistics. Researchers can overcome the loss of information by instead representing the data as distributions. We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making: Analyzing Distributional Data via Matching After Learning to Stretch (ADD MALTS). We (i) provide analytical guarantees of the correctness of our estimation strategy, (ii) demonstrate via simulation that ADD MALTS outperforms other distributional data analysis methods at estimating treatment effects, and (iii) illustrate ADD MALTS' ability to verify whether there is enough cohesion between treatment and control units within subpopulations to trustworthily estimate treatment effects. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks."}, "https://arxiv.org/abs/2403.14899": {"title": "Statistical Inference For Noisy Matrix Completion Incorporating Auxiliary Information", "link": "https://arxiv.org/abs/2403.14899", "description": "arXiv:2403.14899v1 Announce Type: new \nAbstract: This paper investigates statistical inference for noisy matrix completion in a semi-supervised model when auxiliary covariates are available. The model consists of two parts. One part is a low-rank matrix induced by unobserved latent factors; the other part models the effects of the observed covariates through a coefficient matrix which is composed of high-dimensional column vectors. We model the observational pattern of the responses through a logistic regression of the covariates, and allow its probability to go to zero as the sample size increases. We apply an iterative least squares (LS) estimation approach in our considered context. The iterative LS methods in general enjoy a low computational cost, but deriving the statistical properties of the resulting estimators is a challenging task. We show that our method only needs a few iterations, and the resulting entry-wise estimators of the low-rank matrix and the coefficient matrix are guaranteed to have asymptotic normal distributions. As a result, individual inference can be conducted for each entry of the unknown matrices. We also propose a simultaneous testing procedure with multiplier bootstrap for the high-dimensional coefficient matrix. This simultaneous inferential tool can help us further investigate the effects of covariates for the prediction of missing entries."}, "https://arxiv.org/abs/2403.14925": {"title": "Computational Approaches for Exponential-Family Factor Analysis", "link": "https://arxiv.org/abs/2403.14925", "description": "arXiv:2403.14925v1 Announce Type: new \nAbstract: We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies."}, "https://arxiv.org/abs/2403.14954": {"title": "Creating a Spatial Vulnerability Index for Environmental Health", "link": "https://arxiv.org/abs/2403.14954", "description": "arXiv:2403.14954v1 Announce Type: new \nAbstract: Extreme natural hazards are increasing in frequency and intensity. These natural changes in our environment, combined with man-made pollution, have substantial economic, social and health impacts globally. The impact of the environment on human health (environmental health) is becoming well understood in international research literature. However, there are significant barriers to understanding key characteristics of this impact, related to substantial data volumes, data access rights and the time required to compile and compare data over regions and time. This study aims to reduce these barriers in Australia by creating an open data repository of national environmental health data and presenting a methodology for the production of health outcome-weighted population vulnerability indices related to extreme heat, extreme cold and air pollution at various temporal and geographical resolutions.\n Current state-of-the-art methods for the calculation of vulnerability indices include equal weight percentile ranking and the use of principal component analysis (PCA). The weighted vulnerability index methodology proposed in this study offers an advantage over others in the literature by considering health outcomes in the calculation process. The resulting vulnerability percentiles more clearly align population sensitivity and adaptive capacity with health risks. The temporal and spatial resolutions of the indices enable national monitoring on a scale never before seen across Australia. Additionally, we show that a weekly temporal resolution can be used to identify spikes in vulnerability due to changes in relative national environmental exposure."}, "https://arxiv.org/abs/2403.15111": {"title": "Fast TTC Computation", "link": "https://arxiv.org/abs/2403.15111", "description": "arXiv:2403.15111v1 Announce Type: new \nAbstract: This paper proposes a fast Markov Matrix-based methodology for computing Top Trading Cycles (TTC) that delivers O(1) computational speed, that is speed independent of the number of agents and objects in the system. The proposed methodology is well suited for complex large-dimensional problems like housing choice. The methodology retains all the properties of TTC, namely, Pareto-efficiency, individual rationality and strategy-proofness."}, "https://arxiv.org/abs/2403.15220": {"title": "Modelling with Discretized Variables", "link": "https://arxiv.org/abs/2403.15220", "description": "arXiv:2403.15220v1 Announce Type: new \nAbstract: This paper deals with econometric models in which the dependent variable, some explanatory variables, or both are observed as censored interval data. This discretization often happens due to confidentiality of sensitive variables like income. Models using these variables cannot point identify regression parameters as the conditional moments are unknown, which led the literature to use interval estimates. Here, we propose a discretization method through which the regression parameters can be point identified while preserving data confidentiality. We demonstrate the asymptotic properties of the OLS estimator for the parameters in multivariate linear regressions for cross-sectional data. The theoretical findings are supported by Monte Carlo experiments and illustrated with an application to the Australian gender wage gap."}, "https://arxiv.org/abs/2403.15258": {"title": "Tests for almost stochastic dominance", "link": "https://arxiv.org/abs/2403.15258", "description": "arXiv:2403.15258v1 Announce Type: new \nAbstract: We introduce a 2-dimensional stochastic dominance (2DSD) index to characterize both strict and almost stochastic dominance. Based on this index, we derive an estimator for the minimum violation ratio (MVR), also known as the critical parameter, of the almost stochastic ordering condition between two variables. We determine the asymptotic properties of the empirical 2DSD index and MVR for the most frequently used stochastic orders. We also provide conditions under which the bootstrap estimators of these quantities are strongly consistent. As an application, we develop consistent bootstrap testing procedures for almost stochastic dominance. The performance of the tests is checked via simulations and the analysis of real data."}, "https://arxiv.org/abs/2403.15302": {"title": "Optimal Survival Analyses With Prevalent and Incident Patients", "link": "https://arxiv.org/abs/2403.15302", "description": "arXiv:2403.15302v1 Announce Type: new \nAbstract: Period-prevalent cohorts are often used for their cost-saving potential in epidemiological studies of survival outcomes. Under this design, prevalent patients allow for evaluations of long-term survival outcomes without the need for long follow-up, whereas incident patients allow for evaluations of short-term survival outcomes without the issue of left-truncation. In most period-prevalent survival analyses from the existing literature, patients have been recruited to achieve an overall sample size, with little attention given to the relative frequencies of prevalent and incident patients and their statistical implications. Furthermore, there are no existing methods available to rigorously quantify the impact of these relative frequencies on estimation and inference and incorporate this information into study design strategies. To address these gaps, we develop an approach to identify the optimal mix of prevalent and incident patients that maximizes precision over the entire estimated survival curve, subject to a flexible weighting scheme. In addition, we prove that inference based on the weighted log-rank test or Cox proportional hazards model is most powerful with an entirely prevalent or incident cohort, and we derive theoretical formulas to determine the optimal choice. Simulations confirm the validity of the proposed optimization criteria and show that substantial efficiency gains can be achieved by recruiting the optimal mix of prevalent and incident patients. The proposed methods are applied to assess waitlist outcomes among kidney transplant candidates."}, "https://arxiv.org/abs/2403.15327": {"title": "On two-sample testing for data with arbitrarily missing values", "link": "https://arxiv.org/abs/2403.15327", "description": "arXiv:2403.15327v1 Announce Type: new \nAbstract: We develop a new rank-based approach for univariate two-sample testing in the presence of missing data which makes no assumptions about the missingness mechanism. This approach is a theoretical extension of the Wilcoxon-Mann-Whitney test that controls the Type I error by providing exact bounds for the test statistic after accounting for the number of missing values. Greater statistical power is shown when the method is extended to account for a bounded domain. Furthermore, exact bounds are provided on the proportions of data that can be missing in the two samples while yielding a significant result. Simulations demonstrate that our method has good power, typically for cases of $10\\%$ to $20\\%$ missing data, while standard imputation approaches fail to control the Type I error. We illustrate our method on complex clinical trial data in which patients' withdrawal from the trial lead to missing values."}, "https://arxiv.org/abs/2403.15384": {"title": "Unifying area and unit-level small area estimation through calibration", "link": "https://arxiv.org/abs/2403.15384", "description": "arXiv:2403.15384v1 Announce Type: new \nAbstract: When estimating area means, direct estimators based on area-specific data, are usually consistent under the sampling design without model assumptions. However, they are inefficient if the area sample size is small. In small area estimation, model assumptions linking the areas are used to \"borrow strength\" from other areas. The basic area-level model provides design-consistent estimators but error variances are assumed to be known. In practice, they are estimated with the (scarce) area-specific data. These estimators are inefficient, and their error is not accounted for in the associated mean squared error estimators. Unit-level models do not require to know the error variances but do not account for the survey design. Here we describe a unified estimator of an area mean that may be obtained both from an area-level model or a unit-level model and based on consistent estimators of the model error variances as the number of areas increases. We propose bootstrap mean squared error estimators that account for the uncertainty due to the estimation of the error variances. We show a better performance of the new small area estimators and our bootstrap estimators of the mean squared error. We apply the results to education data from Colombia."}, "https://arxiv.org/abs/2403.15175": {"title": "Double Cross-fit Doubly Robust Estimators: Beyond Series Regression", "link": "https://arxiv.org/abs/2403.15175", "description": "arXiv:2403.15175v1 Announce Type: cross \nAbstract: Doubly robust estimators with cross-fitting have gained popularity in causal inference due to their favorable structure-agnostic error guarantees. However, when additional structure, such as H\\\"{o}lder smoothness, is available then more accurate \"double cross-fit doubly robust\" (DCDR) estimators can be constructed by splitting the training data and undersmoothing nuisance function estimators on independent samples. We study a DCDR estimator of the Expected Conditional Covariance, a functional of interest in causal inference and conditional independence testing, and derive a series of increasingly powerful results with progressively stronger assumptions. We first provide a structure-agnostic error analysis for the DCDR estimator with no assumptions on the nuisance functions or their estimators. Then, assuming the nuisance functions are H\\\"{o}lder smooth, but without assuming knowledge of the true smoothness level or the covariate density, we establish that DCDR estimators with several linear smoothers are semiparametric efficient under minimal conditions and achieve fast convergence rates in the non-$\\sqrt{n}$ regime. When the covariate density and smoothnesses are known, we propose a minimax rate-optimal DCDR estimator based on undersmoothed kernel regression. Moreover, we show an undersmoothed DCDR estimator satisfies a slower-than-$\\sqrt{n}$ central limit theorem, and that inference is possible even in the non-$\\sqrt{n}$ regime. Finally, we support our theoretical results with simulations, providing intuition for double cross-fitting and undersmoothing, demonstrating where our estimator achieves semiparametric efficiency while the usual \"single cross-fit\" estimator fails, and illustrating asymptotic normality for the undersmoothed DCDR estimator."}, "https://arxiv.org/abs/2403.15198": {"title": "On the Weighted Top-Difference Distance: Axioms, Aggregation, and Approximation", "link": "https://arxiv.org/abs/2403.15198", "description": "arXiv:2403.15198v1 Announce Type: cross \nAbstract: We study a family of distance functions on rankings that allow for asymmetric treatments of alternatives and consider the distinct relevance of the top and bottom positions for ordered lists. We provide a full axiomatic characterization of our distance. In doing so, we retrieve new characterizations of existing axioms and show how to effectively weaken them for our purposes. This analysis highlights the generality of our distance as it embeds many (semi)metrics previously proposed in the literature. Subsequently, we show that, notwithstanding its level of generality, our distance is still readily applicable. We apply it to preference aggregation, studying the features of the associated median voting rule. It is shown how the derived preference function satisfies many desirable features in the context of voting rules, ranging from fairness to majority and Pareto-related properties. We show how to compute consensus rankings exactly, and provide generalized Diaconis-Graham inequalities that can be leveraged to obtain approximation algorithms. Finally, we propose some truncation ideas for our distances inspired by Lu and Boutilier (2010). These can be leveraged to devise a Polynomial-Time-Approximation Scheme for the corresponding rank aggregation problem."}, "https://arxiv.org/abs/2403.15238": {"title": "WEEP: A method for spatial interpretation of weakly supervised CNN models in computational pathology", "link": "https://arxiv.org/abs/2403.15238", "description": "arXiv:2403.15238v1 Announce Type: cross \nAbstract: Deep learning enables the modelling of high-resolution histopathology whole-slide images (WSI). Weakly supervised learning of tile-level data is typically applied for tasks where labels only exist on the patient or WSI level (e.g. patient outcomes or histological grading). In this context, there is a need for improved spatial interpretability of predictions from such models. We propose a novel method, Wsi rEgion sElection aPproach (WEEP), for model interpretation. It provides a principled yet straightforward way to establish the spatial area of WSI required for assigning a particular prediction label. We demonstrate WEEP on a binary classification task in the area of breast cancer computational pathology. WEEP is easy to implement, is directly connected to the model-based decision process, and offers information relevant to both research and diagnostic applications."}, "https://arxiv.org/abs/2207.02626": {"title": "Estimating the limiting shape of bivariate scaled sample clouds: with additional benefits of self-consistent inference for existing extremal dependence properties", "link": "https://arxiv.org/abs/2207.02626", "description": "arXiv:2207.02626v3 Announce Type: replace \nAbstract: The key to successful statistical analysis of bivariate extreme events lies in flexible modelling of the tail dependence relationship between the two variables. In the extreme value theory literature, various techniques are available to model separate aspects of tail dependence, based on different asymptotic limits. Results from Balkema and Nolde (2010) and Nolde (2014) highlight the importance of studying the limiting shape of an appropriately-scaled sample cloud when characterising the whole joint tail. We now develop the first statistical inference for this limit set, which has considerable practical importance for a unified inference framework across different aspects of the joint tail. Moreover, Nolde and Wadsworth (2022) link this limit set to various existing extremal dependence frameworks. Hence, a by-product of our new limit set inference is the first set of self-consistent estimators for several extremal dependence measures, avoiding the current possibility of contradictory conclusions. In simulations, our limit set estimator is successful across a range of distributions, and the corresponding extremal dependence estimators provide a major joint improvement and small marginal improvements over existing techniques. We consider an application to sea wave heights, where our estimates successfully capture the expected weakening extremal dependence as the distance between locations increases."}, "https://arxiv.org/abs/2211.04034": {"title": "Structured Mixture of Continuation-ratio Logits Models for Ordinal Regression", "link": "https://arxiv.org/abs/2211.04034", "description": "arXiv:2211.04034v2 Announce Type: replace \nAbstract: We develop a nonparametric Bayesian modeling approach to ordinal regression based on priors placed directly on the discrete distribution of the ordinal responses. The prior probability models are built from a structured mixture of multinomial distributions. We leverage a continuation-ratio logits representation to formulate the mixture kernel, with mixture weights defined through the logit stick-breaking process that incorporates the covariates through a linear function. The implied regression functions for the response probabilities can be expressed as weighted sums of parametric regression functions, with covariate-dependent weights. Thus, the modeling approach achieves flexible ordinal regression relationships, avoiding linearity or additivity assumptions in the covariate effects. Model flexibility is formally explored through the Kullback-Leibler support of the prior probability model. A key model feature is that the parameters for both the mixture kernel and the mixture weights can be associated with a continuation-ratio logits regression structure. Hence, an efficient and relatively easy to implement posterior simulation method can be designed, using P\\'olya-Gamma data augmentation. Moreover, the model is built from a conditional independence structure for category-specific parameters, which results in additional computational efficiency gains through partial parallel sampling. In addition to the general mixture structure, we study simplified model versions that incorporate covariate dependence only in the mixture kernel parameters or only in the mixture weights. For all proposed models, we discuss approaches to prior specification and develop Markov chain Monte Carlo methods for posterior simulation. The methodology is illustrated with several synthetic and real data examples."}, "https://arxiv.org/abs/2302.06054": {"title": "Single Proxy Control", "link": "https://arxiv.org/abs/2302.06054", "description": "arXiv:2302.06054v5 Announce Type: replace \nAbstract: Negative control variables are sometimes used in non-experimental studies to detect the presence of confounding by hidden factors. A negative control outcome (NCO) is an outcome that is influenced by unobserved confounders of the exposure effects on the outcome in view, but is not causally impacted by the exposure. Tchetgen Tchetgen (2013) introduced the Control Outcome Calibration Approach (COCA) as a formal NCO counterfactual method to detect and correct for residual confounding bias. For identification, COCA treats the NCO as an error-prone proxy of the treatment-free counterfactual outcome of interest, and involves regressing the NCO on the treatment-free counterfactual, together with a rank-preserving structural model which assumes a constant individual-level causal effect. In this work, we establish nonparametric COCA identification for the average causal effect for the treated, without requiring rank-preservation, therefore accommodating unrestricted effect heterogeneity across units. This nonparametric identification result has important practical implications, as it provides single proxy confounding control, in contrast to recently proposed proximal causal inference, which relies for identification on a pair of confounding proxies. For COCA estimation we propose three separate strategies: (i) an extended propensity score approach, (ii) an outcome bridge function approach, and (iii) a doubly-robust approach. Finally, we illustrate the proposed methods in an application evaluating the causal impact of a Zika virus outbreak on birth rate in Brazil."}, "https://arxiv.org/abs/2309.06985": {"title": "CARE: Large Precision Matrix Estimation for Compositional Data", "link": "https://arxiv.org/abs/2309.06985", "description": "arXiv:2309.06985v2 Announce Type: replace \nAbstract: High-dimensional compositional data are prevalent in many applications. The simplex constraint poses intrinsic challenges to inferring the conditional dependence relationships among the components forming a composition, as encoded by a large precision matrix. We introduce a precise specification of the compositional precision matrix and relate it to its basis counterpart, which is shown to be asymptotically identifiable under suitable sparsity assumptions. By exploiting this connection, we propose a composition adaptive regularized estimation (CARE) method for estimating the sparse basis precision matrix. We derive rates of convergence for the estimator and provide theoretical guarantees on support recovery and data-driven parameter tuning. Our theory reveals an intriguing trade-off between identification and estimation, thereby highlighting the blessing of dimensionality in compositional data analysis. In particular, in sufficiently high dimensions, the CARE estimator achieves minimax optimality and performs as well as if the basis were observed. We further discuss how our framework can be extended to handle data containing zeros, including sampling zeros and structural zeros. The advantages of CARE over existing methods are illustrated by simulation studies and an application to inferring microbial ecological networks in the human gut."}, "https://arxiv.org/abs/2311.06086": {"title": "A three-step approach to production frontier estimation and the Matsuoka's distribution", "link": "https://arxiv.org/abs/2311.06086", "description": "arXiv:2311.06086v2 Announce Type: replace \nAbstract: In this work, we introduce a three-step semiparametric methodology for the estimation of production frontiers. We consider a model inspired by the well-known Cobb-Douglas production function, wherein input factors operate multiplicatively within the model. Efficiency in the proposed model is assumed to follow a continuous univariate uniparametric distribution in $(0,1)$, referred to as Matsuoka's distribution, which is discussed in detail. Following model linearization, the first step is to semiparametrically estimate the regression function through a local linear smoother. The second step focuses on the estimation of the efficiency parameter. Finally, we estimate the production frontier through a plug-in methodology. We present a rigorous asymptotic theory related to the proposed three-step estimation, including consistency, and asymptotic normality, and derive rates for the convergences presented. Incidentally, we also study the Matsuoka's distribution, deriving its main properties. The Matsuoka's distribution exhibits a versatile array of shapes capable of effectively encapsulating the typical behavior of efficiency within production frontier models. To complement the large sample results obtained, a Monte Carlo simulation study is conducted to assess the finite sample performance of the proposed three-step methodology. An empirical application using a dataset of Danish milk producers is also presented."}, "https://arxiv.org/abs/2311.15458": {"title": "Causal Models for Longitudinal and Panel Data: A Survey", "link": "https://arxiv.org/abs/2311.15458", "description": "arXiv:2311.15458v2 Announce Type: replace \nAbstract: This survey discusses the recent causal panel data literature. This recent literature has focused on credibly estimating causal effects of binary interventions in settings with longitudinal data, with an emphasis on practical advice for empirical researchers. It pays particular attention to heterogeneity in the causal effects, often in situations where few units are treated and with particular structures on the assignment pattern. The literature has extended earlier work on difference-in-differences or two-way-fixed-effect estimators. It has more generally incorporated factor models or interactive fixed effects. It has also developed novel methods using synthetic control approaches."}, "https://arxiv.org/abs/2312.08530": {"title": "Using Model-Assisted Calibration Methods to Improve Efficiency of Regression Analyses with Two-Phase Samples under Complex Survey Designs", "link": "https://arxiv.org/abs/2312.08530", "description": "arXiv:2312.08530v2 Announce Type: replace \nAbstract: Two-phase sampling designs are frequently employed in epidemiological studies and large-scale health surveys. In such designs, certain variables are exclusively collected within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or measurement assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators. However, no existing methods provide appropriate calibration auxiliary variables while simultaneously considering the complex sample designs present in both the first- and second-phase samples in regression analyses. This paper proposes to calibrate the sample weights for the second-phase subsample to the weighted entire first-phase sample based on score functions of regression coefficients by using predictions of the covariate of interest, which can be computed for the entire first-phase sample. We establish the consistency of the proposed calibration estimation and provide variance estimation. Empirical evidence underscores the robustness of the calibration on score functions compared to the imputation method, which can be sensitive to misspecified prediction models for the variable only collected in the second phase. Examples using data from the National Health and Nutrition Examination Survey are provided."}, "https://arxiv.org/abs/2312.15496": {"title": "A Simple Bias Reduction for Chatterjee's Correlation", "link": "https://arxiv.org/abs/2312.15496", "description": "arXiv:2312.15496v2 Announce Type: replace \nAbstract: Chatterjee's rank correlation coefficient $\\xi_n$ is an empirical index for detecting functional dependencies between two variables $X$ and $Y$. It is an estimator for a theoretical quantity $\\xi$ that is zero for independence and one if $Y$ is a measurable function of $X$. Based on an equivalent characterization of sorted numbers, we derive an upper bound for $\\xi_n$ and suggest a simple normalization aimed at reducing its bias for small sample size $n$. In Monte Carlo simulations of various cases, the normalization reduced the bias in all cases. The mean squared error was reduced, too, for values of $\\xi$ greater than about 0.4. Moreover, we observed that non-parametric confidence intervals for $\\xi$ based on bootstrapping $\\xi_n$ in the usual n-out-of-n way have a coverage probability close to zero. This is remedied by an m-out-of-n bootstrap without replacement in combination with our normalization method."}, "https://arxiv.org/abs/2203.13776": {"title": "Sharp adaptive and pathwise stable similarity testing for scalar ergodic diffusions", "link": "https://arxiv.org/abs/2203.13776", "description": "arXiv:2203.13776v2 Announce Type: replace-cross \nAbstract: Within the nonparametric diffusion model, we develop a multiple test to infer about similarity of an unknown drift $b$ to some reference drift $b_0$: At prescribed significance, we simultaneously identify those regions where violation from similiarity occurs, without a priori knowledge of their number, size and location. This test is shown to be minimax-optimal and adaptive. At the same time, the procedure is robust under small deviation from Brownian motion as the driving noise process. A detailed investigation for fractional driving noise, which is neither a semimartingale nor a Markov process, is provided for Hurst indices close to the Brownian motion case."}, "https://arxiv.org/abs/2306.15209": {"title": "Dynamic Reconfiguration of Brain Functional Network in Stroke", "link": "https://arxiv.org/abs/2306.15209", "description": "arXiv:2306.15209v2 Announce Type: replace-cross \nAbstract: The brain continually reorganizes its functional network to adapt to post-stroke functional impairments. Previous studies using static modularity analysis have presented global-level behavior patterns of this network reorganization. However, it is far from understood how the brain reconfigures its functional network dynamically following a stroke. This study collected resting-state functional MRI data from 15 stroke patients, with mild (n = 6) and severe (n = 9) two subgroups based on their clinical symptoms. Additionally, 15 age-matched healthy subjects were considered as controls. By applying a multilayer network method, a dynamic modular structure was recognized based on a time-resolved function network. Then dynamic network measurements (recruitment, integration, and flexibility) were calculated to characterize the dynamic reconfiguration of post-stroke brain functional networks, hence, to reveal the neural functional rebuilding process. It was found from this investigation that severe patients tended to have reduced recruitment and increased between-network integration, while mild patients exhibited low network flexibility and less network integration. It is also noted that this severity-dependent alteration in network interaction was not able to be revealed by previous studies using static methods. Clinically, the obtained knowledge of the diverse patterns of dynamic adjustment in brain functional networks observed from the brain signal could help understand the underlying mechanism of the motor, speech, and cognitive functional impairments caused by stroke attacks. The proposed method not only could be used to evaluate patients' current brain status but also has the potential to provide insights into prognosis analysis and prediction."}, "https://arxiv.org/abs/2403.15670": {"title": "Computationally Scalable Bayesian SPDE Modeling for Censored Spatial Responses", "link": "https://arxiv.org/abs/2403.15670", "description": "arXiv:2403.15670v1 Announce Type: new \nAbstract: Observations of groundwater pollutants, such as arsenic or Perfluorooctane sulfonate (PFOS), are riddled with left censoring. These measurements have impact on the health and lifestyle of the populace. Left censoring of these spatially correlated observations are usually addressed by applying Gaussian processes (GPs), which have theoretical advantages. However, this comes with a challenging computational complexity of $\\mathcal{O}(n^3)$, which is impractical for large datasets. Additionally, a sizable proportion of the data being left-censored creates further bottlenecks, since the likelihood computation now involves an intractable high-dimensional integral of the multivariate Gaussian density. In this article, we tackle these two problems simultaneously by approximating the GP with a Gaussian Markov random field (GMRF) approach that exploits an explicit link between a GP with Mat\\'ern correlation function and a GMRF using stochastic partial differential equations (SPDEs). We introduce a GMRF-based measurement error into the model, which alleviates the likelihood computation for the censored data, drastically improving the speed of the model while maintaining admirable accuracy. Our approach demonstrates robustness and substantial computational scalability, compared to state-of-the-art methods for censored spatial responses across various simulation settings. Finally, the fit of this fully Bayesian model to the concentration of PFOS in groundwater available at 24,959 sites across California, where 46.62\\% responses are censored, produces prediction surface and uncertainty quantification in real time, thereby substantiating the applicability and scalability of the proposed method. Code for implementation is made available via GitHub."}, "https://arxiv.org/abs/2403.15755": {"title": "Optimized Model Selection for Estimating Treatment Effects from Costly Simulations of the US Opioid Epidemic", "link": "https://arxiv.org/abs/2403.15755", "description": "arXiv:2403.15755v1 Announce Type: new \nAbstract: Agent-based simulation with a synthetic population can help us compare different treatment conditions while keeping everything else constant within the same population (i.e., as digital twins). Such population-scale simulations require large computational power (i.e., CPU resources) to get accurate estimates for treatment effects. We can use meta models of the simulation results to circumvent the need to simulate every treatment condition. Selecting the best estimating model at a given sample size (number of simulation runs) is a crucial problem. Depending on the sample size, the ability of the method to estimate accurately can change significantly. In this paper, we discuss different methods to explore what model works best at a specific sample size. In addition to the empirical results, we provide a mathematical analysis of the MSE equation and how its components decide which model to select and why a specific method behaves that way in a range of sample sizes. The analysis showed why the direction estimation method is better than model-based methods in larger sample sizes and how the between-group variation and the within-group variation affect the MSE equation."}, "https://arxiv.org/abs/2403.15778": {"title": "Supervised Learning via Ensembles of Diverse Functional Representations: the Functional Voting Classifier", "link": "https://arxiv.org/abs/2403.15778", "description": "arXiv:2403.15778v1 Announce Type: new \nAbstract: Many conventional statistical and machine learning methods face challenges when applied directly to high dimensional temporal observations. In recent decades, Functional Data Analysis (FDA) has gained widespread popularity as a framework for modeling and analyzing data that are, by their nature, functions in the domain of time. Although supervised classification has been extensively explored in recent decades within the FDA literature, ensemble learning of functional classifiers has only recently emerged as a topic of significant interest. Thus, the latter subject presents unexplored facets and challenges from various statistical perspectives. The focal point of this paper lies in the realm of ensemble learning for functional data and aims to show how different functional data representations can be used to train ensemble members and how base model predictions can be combined through majority voting. The so-called Functional Voting Classifier (FVC) is proposed to demonstrate how different functional representations leading to augmented diversity can increase predictive accuracy. Many real-world datasets from several domains are used to display that the FVC can significantly enhance performance compared to individual models. The framework presented provides a foundation for voting ensembles with functional data and can stimulate a highly encouraging line of research in the FDA context."}, "https://arxiv.org/abs/2403.15802": {"title": "Augmented Doubly Robust Post-Imputation Inference for Proteomic Data", "link": "https://arxiv.org/abs/2403.15802", "description": "arXiv:2403.15802v1 Announce Type: new \nAbstract: Quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of proteins in molecular mechanisms. However, analysis of such data is challenging due to the large proportion of missing values. A common strategy to address this issue is to utilize an imputed dataset, which often introduces systematic bias into downstream analyses if the imputation errors are ignored. In this paper, we propose a statistical framework inspired by doubly robust estimators that offers valid and efficient inference for proteomic data. Our framework combines powerful machine learning tools, such as variational autoencoders, to augment the imputation quality with high-dimensional peptide data, and a parametric model to estimate the propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework and has provable properties. In application to both single-cell and bulk-cell proteomic data our method utilizes the imputed data to gain additional, meaningful discoveries and yet maintains good control of false positives."}, "https://arxiv.org/abs/2403.15862": {"title": "Non-monotone dependence modeling with copulas: an application to the volume-return relationship", "link": "https://arxiv.org/abs/2403.15862", "description": "arXiv:2403.15862v1 Announce Type: new \nAbstract: This paper introduces an innovative method for constructing copula models capable of describing arbitrary non-monotone dependence structures. The proposed method enables the creation of such copulas in parametric form, thus allowing the resulting models to adapt to diverse and intricate real-world data patterns. We apply this novel methodology to analyze the relationship between returns and trading volumes in financial markets, a domain where the existence of non-monotone dependencies is well-documented in the existing literature. Our approach exhibits superior adaptability compared to other models which have previously been proposed in the literature, enabling a deeper understanding of the dependence structure among the considered variables."}, "https://arxiv.org/abs/2403.15877": {"title": "Integrated path stability selection", "link": "https://arxiv.org/abs/2403.15877", "description": "arXiv:2403.15877v1 Announce Type: new \nAbstract: Stability selection is a widely used method for improving the performance of feature selection algorithms. However, stability selection has been found to be highly conservative, resulting in low sensitivity. Further, the theoretical bound on the expected number of false positives, E(FP), is relatively loose, making it difficult to know how many false positives to expect in practice. In this paper, we introduce a novel method for stability selection based on integrating the stability paths rather than maximizing over them. This yields a tighter bound on E(FP), resulting in a feature selection criterion that has higher sensitivity in practice and is better calibrated in terms of matching the target E(FP). Our proposed method requires the same amount of computation as the original stability selection algorithm, and only requires the user to specify one input parameter, a target value for E(FP). We provide theoretical bounds on performance, and demonstrate the method on simulations and real data from cancer gene expression studies."}, "https://arxiv.org/abs/2403.15910": {"title": "Difference-in-Differences with Unpoolable Data", "link": "https://arxiv.org/abs/2403.15910", "description": "arXiv:2403.15910v1 Announce Type: new \nAbstract: In this study, we identify and relax the assumption of data \"poolability\" in difference-in-differences (DID) estimation. Poolability, or the combination of observations from treated and control units into one dataset, is often not possible due to data privacy concerns. For instance, administrative health data stored in secure facilities is often not combinable across jurisdictions. We propose an innovative approach to estimate DID with unpoolable data: UN--DID. Our method incorporates adjustments for additional covariates, multiple groups, and staggered adoption. Without covariates, UN--DID and conventional DID give identical estimates of the average treatment effect on the treated (ATT). With covariates, we show mathematically and through simulations that UN--DID and conventional DID provide different, but equally informative, estimates of the ATT. An empirical example further underscores the utility of our methodology. The UN--DID method paves the way for more comprehensive analyses of policy impacts, even under data poolability constraints."}, "https://arxiv.org/abs/2403.15934": {"title": "Debiased Machine Learning when Nuisance Parameters Appear in Indicator Functions", "link": "https://arxiv.org/abs/2403.15934", "description": "arXiv:2403.15934v1 Announce Type: new \nAbstract: This paper studies debiased machine learning when nuisance parameters appear in indicator functions. An important example is maximized average welfare under optimal treatment assignment rules. For asymptotically valid inference for a parameter of interest, the current literature on debiased machine learning relies on Gateaux differentiability of the functions inside moment conditions, which does not hold when nuisance parameters appear in indicator functions. In this paper, we propose smoothing the indicator functions, and develop an asymptotic distribution theory for this class of models. The asymptotic behavior of the proposed estimator exhibits a trade-off between bias and variance due to smoothing. We study how a parameter which controls the degree of smoothing can be chosen optimally to minimize an upper bound of the asymptotic mean squared error. A Monte Carlo simulation supports the asymptotic distribution theory, and an empirical example illustrates the implementation of the method."}, "https://arxiv.org/abs/2403.15983": {"title": "Bayesian segmented Gaussian copula factor model for single-cell sequencing data", "link": "https://arxiv.org/abs/2403.15983", "description": "arXiv:2403.15983v1 Announce Type: new \nAbstract: Single-cell sequencing technologies have significantly advanced molecular and cellular biology, offering unprecedented insights into cellular heterogeneity by allowing for the measurement of gene expression at an individual cell level. However, the analysis of such data is challenged by the prevalence of low counts due to dropout events and the skewed nature of the data distribution, which conventional Gaussian factor models struggle to handle effectively. To address these challenges, we propose a novel Bayesian segmented Gaussian copula model to explicitly account for inflation of zero and near-zero counts, and to address the high skewness in the data. By employing a Dirichlet-Laplace prior for each column of the factor loadings matrix, we shrink the loadings of unnecessary factors towards zero, which leads to a simple approach to automatically determine the number of latent factors, and resolve the identifiability issue inherent in factor models due to the rotational invariance of the factor loadings matrix. Through simulation studies, we demonstrate the superior performance of our method over existing approaches in conducting factor analysis on data exhibiting the characteristics of single-cell data, such as excessive low counts and high skewness. Furthermore, we apply the proposed method to a real single-cell RNA-sequencing dataset from a lymphoblastoid cell line, successfully identifying biologically meaningful latent factors and detecting previously uncharacterized cell subtypes."}, "https://arxiv.org/abs/2403.16177": {"title": "The Informativeness of Combined Experimental and Observational Data under Dynamic Selection", "link": "https://arxiv.org/abs/2403.16177", "description": "arXiv:2403.16177v1 Announce Type: new \nAbstract: This paper addresses the challenge of estimating the Average Treatment Effect on the Treated Survivors (ATETS; Vikstrom et al., 2018) in the absence of long-term experimental data, utilizing available long-term observational data instead. We establish two theoretical results. First, it is impossible to obtain informative bounds for the ATETS with no model restriction and no auxiliary data. Second, to overturn this negative result, we explore as a promising avenue the recent econometric developments in combining experimental and observational data (e.g., Athey et al., 2020, 2019); we indeed find that exploiting short-term experimental data can be informative without imposing classical model restrictions. Furthermore, building on Chesher and Rosen (2017), we explore how to systematically derive sharp identification bounds, exploiting both the novel data-combination principles and classical model restrictions. Applying the proposed method, we explore what can be learned about the long-run effects of job training programs on employment without long-term experimental data."}, "https://arxiv.org/abs/2403.16256": {"title": "Covariate-adjusted marginal cumulative incidence curves for competing risk analysis", "link": "https://arxiv.org/abs/2403.16256", "description": "arXiv:2403.16256v1 Announce Type: new \nAbstract: Covariate imbalance between treatment groups makes it difficult to compare cumulative incidence curves in competing risk analyses. In this paper we discuss different methods to estimate adjusted cumulative incidence curves including inverse probability of treatment weighting and outcome regression modeling. For these methods to work, correct specification of the propensity score model or outcome regression model, respectively, is needed. We introduce a new doubly robust estimator, which requires correct specification of only one of the two models. We conduct a simulation study to assess the performance of these three methods, including scenarios with model misspecification of the relationship between covariates and treatment and/or outcome. We illustrate their usage in a cohort study of breast cancer patients estimating covariate-adjusted marginal cumulative incidence curves for recurrence, second primary tumour development and death after undergoing mastectomy treatment or breast-conserving therapy. Our study points out the advantages and disadvantages of each covariate adjustment method when applied in competing risk analysis."}, "https://arxiv.org/abs/2403.16283": {"title": "Sample Empirical Likelihood Methods for Causal Inference", "link": "https://arxiv.org/abs/2403.16283", "description": "arXiv:2403.16283v1 Announce Type: new \nAbstract: Causal inference is crucial for understanding the true impact of interventions, policies, or actions, enabling informed decision-making and providing insights into the underlying mechanisms that shape our world. In this paper, we establish a framework for the estimation and inference of average treatment effects using a two-sample empirical likelihood function. Two different approaches to incorporating propensity scores are developed. The first approach introduces propensity scores calibrated constraints in addition to the standard model-calibration constraints; the second approach uses the propensity scores to form weighted versions of the model-calibration constraints. The resulting estimators from both approaches are doubly robust. The limiting distributions of the two sample empirical likelihood ratio statistics are derived, facilitating the construction of confidence intervals and hypothesis tests for the average treatment effect. Bootstrap methods for constructing sample empirical likelihood ratio confidence intervals are also discussed for both approaches. Finite sample performances of the methods are investigated through simulation studies."}, "https://arxiv.org/abs/2403.16297": {"title": "Round Robin Active Sequential Change Detection for Dependent Multi-Channel Data", "link": "https://arxiv.org/abs/2403.16297", "description": "arXiv:2403.16297v1 Announce Type: new \nAbstract: This paper considers the problem of sequentially detecting a change in the joint distribution of multiple data sources under a sampling constraint. Specifically, the channels or sources generate observations that are independent over time, but not necessarily independent at any given time instant. The sources follow an initial joint distribution, and at an unknown time instant, the joint distribution of an unknown subset of sources changes. Importantly, there is a hard constraint that only a fixed number of sources are allowed to be sampled at each time instant. The goal is to sequentially observe the sources according to the constraint, and stop sampling as quickly as possible after the change while controlling the false alarm rate below a user-specified level. The sources can be selected dynamically based on the already collected data, and thus, a policy for this problem consists of a joint sampling and change-detection rule. A non-randomized policy is studied, and an upper bound is established on its worst-case conditional expected detection delay with respect to both the change point and the observations from the affected sources before the change."}, "https://arxiv.org/abs/2403.16544": {"title": "The Role of Mean Absolute Deviation Function in Obtaining Smooth Estimation for Distribution and Density Functions: Beta Regression Approach", "link": "https://arxiv.org/abs/2403.16544", "description": "arXiv:2403.16544v1 Announce Type: new \nAbstract: Smooth Estimation of probability density and distribution functions from its sample is an attractive and an important problem that has applications in several fields such as, business, medicine, and environment. This article introduces a simple approach but novel for estimating both functions in one process to have smooth curves for both via left mean absolute deviation (MAD) function and beta regression approach. Our approach explores estimation of both functions by smoothing the first derivative of left MAD function to obtain the final optimal smooth estimates. The derivation for these final smooth estimates under conditions of nondecreasing distribution function and nonnegative density function are performed by applying beta regression of a polynomial degree on the first derivative of left MAD function where the degree of polynomial is chosen among the models that have less mean absolute residuals under the constraint of nonnegativity for the first derivative of regression vector of expected values. A general class of normal, logistic and Gumbel distributions is derived as proposed smooth estimators for the distribution and density functions using logit, probit and cloglog links, respectively. This approach is applied to simulated data from unimodal, bimodal, tri-modal and skew distributions and an application to real data set is given."}, "https://arxiv.org/abs/2403.16590": {"title": "Extremal properties of max-autoregressive moving average processes for modelling extreme river flows", "link": "https://arxiv.org/abs/2403.16590", "description": "arXiv:2403.16590v1 Announce Type: new \nAbstract: Max-autogressive moving average (Max-ARMA) processes are powerful tools for modelling time series data with heavy-tailed behaviour; these are a non-linear version of the popular autoregressive moving average models. River flow data typically have features of heavy tails and non-linearity, as large precipitation events cause sudden spikes in the data that then exponentially decay. Therefore, stationary Max-ARMA models are a suitable candidate for capturing the unique temporal dependence structure exhibited by river flows. This paper contributes to advancing our understanding of the extremal properties of stationary Max-ARMA processes. We detail the first approach for deriving the extremal index, the lagged asymptotic dependence coefficient, and an efficient simulation for a general Max-ARMA process. We use the extremal properties, coupled with the belief that Max-ARMA processes provide only an approximation to extreme river flow, to fit such a model which can broadly capture river flow behaviour over a high threshold. We make our inference under a reparametrisation which gives a simpler parameter space that excludes cases where any parameter is non-identifiable. We illustrate results for river flow data from the UK River Thames."}, "https://arxiv.org/abs/2403.16673": {"title": "Quasi-randomization tests for network interference", "link": "https://arxiv.org/abs/2403.16673", "description": "arXiv:2403.16673v1 Announce Type: new \nAbstract: Many classical inferential approaches fail to hold when interference exists among the population units. This amounts to the treatment status of one unit affecting the potential outcome of other units in the population. Testing for such spillover effects in this setting makes the null hypothesis non-sharp. An interesting approach to tackling the non-sharp nature of the null hypothesis in this setup is constructing conditional randomization tests such that the null is sharp on the restricted population. In randomized experiments, conditional randomized tests hold finite sample validity. Such approaches can pose computational challenges as finding these appropriate sub-populations based on experimental design can involve solving an NP-hard problem. In this paper, we view the network amongst the population as a random variable instead of being fixed. We propose a new approach that builds a conditional quasi-randomization test. Our main idea is to build the (non-sharp) null distribution of no spillover effects using random graph null models. We show that our method is exactly valid in finite-samples under mild assumptions. Our method displays enhanced power over other methods, with substantial improvement in complex experimental designs. We highlight that the method reduces to a simple permutation test, making it easy to implement in practice. We conduct a simulation study to verify the finite-sample validity of our approach and illustrate our methodology to test for interference in a weather insurance adoption experiment run in rural China."}, "https://arxiv.org/abs/2403.16706": {"title": "An alternative measure for quantifying the heterogeneity in meta-analysis", "link": "https://arxiv.org/abs/2403.16706", "description": "arXiv:2403.16706v1 Announce Type: new \nAbstract: Quantifying the heterogeneity is an important issue in meta-analysis, and among the existing measures, the $I^2$ statistic is most commonly used. In this paper, we first illustrate with a simple example that the $I^2$ statistic is heavily dependent on the study sample sizes, mainly because it is used to quantify the heterogeneity between the observed effect sizes. To reduce the influence of sample sizes, we introduce an alternative measure that aims to directly measure the heterogeneity between the study populations involved in the meta-analysis. We further propose a new estimator, namely the $I_A^2$ statistic, to estimate the newly defined measure of heterogeneity. For practical implementation, the exact formulas of the $I_A^2$ statistic are also derived under two common scenarios with the effect size as the mean difference (MD) or the standardized mean difference (SMD). Simulations and real data analysis demonstrate that the $I_A^2$ statistic provides an asymptotically unbiased estimator for the absolute heterogeneity between the study populations, and it is also independent of the study sample sizes as expected. To conclude, our newly defined $I_A^2$ statistic can be used as a supplemental measure of heterogeneity to monitor the situations where the study effect sizes are indeed similar with little biological difference. In such scenario, the fixed-effect model can be appropriate; nevertheless, when the sample sizes are sufficiently large, the $I^2$ statistic may still increase to 1 and subsequently suggest the random-effects model for meta-analysis."}, "https://arxiv.org/abs/2403.16773": {"title": "Privacy-Protected Spatial Autoregressive Model", "link": "https://arxiv.org/abs/2403.16773", "description": "arXiv:2403.16773v1 Announce Type: new \nAbstract: Spatial autoregressive (SAR) models are important tools for studying network effects. However, with an increasing emphasis on data privacy, data providers often implement privacy protection measures that make classical SAR models inapplicable. In this study, we introduce a privacy-protected SAR model with noise-added response and covariates to meet privacy-protection requirements. However, in this scenario, the traditional quasi-maximum likelihood estimator becomes infeasible because the likelihood function cannot be formulated. To address this issue, we first consider an explicit expression for the likelihood function with only noise-added responses. However, the derivatives are biased owing to the noise in the covariates. Therefore, we develop techniques that can correct the biases introduced by noise. Correspondingly, a Newton-Raphson-type algorithm is proposed to obtain the estimator, leading to a corrected likelihood estimator. To further enhance computational efficiency, we introduce a corrected least squares estimator based on the idea of bias correction. These two estimation methods ensure both data security and the attainment of statistically valid estimators. Theoretical analysis of both estimators is carefully conducted, and statistical inference methods are discussed. The finite sample performances of different methods are demonstrated through extensive simulations and the analysis of a real dataset."}, "https://arxiv.org/abs/2403.16813": {"title": "A Generalized Logrank-type Test for Comparison of Treatment Regimes in Sequential Multiple Assignment Randomized Trials", "link": "https://arxiv.org/abs/2403.16813", "description": "arXiv:2403.16813v1 Announce Type: new \nAbstract: The sequential multiple assignment randomized trial (SMART) is the\n ideal study design for the evaluation of multistage treatment\n regimes, which comprise sequential decision rules that recommend\n treatments for a patient at each of a series of decision points\n based on their evolving characteristics. A common goal is to\n compare the set of so-called embedded regimes represented in the\n design on the basis of a primary outcome of interest. In the study\n of chronic diseases and disorders, this outcome is often a time to\n an event, and a goal is to compare the distributions of the\n time-to-event outcome associated with each regime in the set. We\n present a general statistical framework in which we develop a\n logrank-type test for comparison of the survival distributions\n associated with regimes within a specified set based on the data\n from a SMART with an arbitrary number of stages that allows\n incorporation of covariate information to enhance efficiency and can\n also be used with data from an observational study. The framework\n provides clarification of the assumptions required to yield a\n principled test procedure, and the proposed test subsumes or offers\n an improved alternative to existing methods. We demonstrate\n performance of the methods in a suite of simulation\n studies. The methods are applied to a SMART in patients with acute\n promyelocytic leukemia."}, "https://arxiv.org/abs/2403.16832": {"title": "Testing for sufficient follow-up in survival data with a cure fraction", "link": "https://arxiv.org/abs/2403.16832", "description": "arXiv:2403.16832v1 Announce Type: new \nAbstract: In order to estimate the proportion of `immune' or `cured' subjects who will never experience failure, a sufficiently long follow-up period is required. Several statistical tests have been proposed in the literature for assessing the assumption of sufficient follow-up, meaning that the study duration is longer than the support of the survival times for the uncured subjects. However, for practical purposes, the follow-up would be considered sufficiently long if the probability for the event to happen after the end of the study is very small. Based on this observation, we formulate a more relaxed notion of `practically' sufficient follow-up characterized by the quantiles of the distribution and develop a novel nonparametric statistical test. The proposed method relies mainly on the assumption of a non-increasing density function in the tail of the distribution. The test is then based on a shape constrained density estimator such as the Grenander or the kernel smoothed Grenander estimator and a bootstrap procedure is used for computation of the critical values. The performance of the test is investigated through an extensive simulation study, and the method is illustrated on breast cancer data."}, "https://arxiv.org/abs/2403.16844": {"title": "Resistant Inference in Instrumental Variable Models", "link": "https://arxiv.org/abs/2403.16844", "description": "arXiv:2403.16844v1 Announce Type: new \nAbstract: The classical tests in the instrumental variable model can behave arbitrarily if the data is contaminated. For instance, one outlying observation can be enough to change the outcome of a test. We develop a framework to construct testing procedures that are robust to weak instruments, outliers and heavy-tailed errors in the instrumental variable model. The framework is constructed upon M-estimators. By deriving the influence functions of the classical weak instrument robust tests, such as the Anderson-Rubin test, K-test and the conditional likelihood ratio (CLR) test, we prove their unbounded sensitivity to infinitesimal contamination. Therefore, we construct contamination resistant/robust alternatives. In particular, we show how to construct a robust CLR statistic based on Mallows type M-estimators and show that its asymptotic distribution is the same as that of the (classical) CLR statistic. The theoretical results are corroborated by a simulation study. Finally, we revisit three empirical studies affected by outliers and demonstrate how the new robust tests can be used in practice."}, "https://arxiv.org/abs/2403.16906": {"title": "Comparing basic statistical concepts with diagnostic probabilities based on directly observed proportions to help understand the replication crisis", "link": "https://arxiv.org/abs/2403.16906", "description": "arXiv:2403.16906v1 Announce Type: new \nAbstract: Instead of regarding an observed proportion as a sample from a population with an unknown parameter, diagnosticians intuitively use the observed proportion as a direct estimate of the posterior probability of a diagnosis. Therefore, a diagnostician might also regard a continuous Gaussian probability distribution of an outcome conditional on a study selection criterion to represents posterior probabilities. Fitting a distribution to its mean and standard deviation (SD) can be regarded as pooling data from an infinite number of imaginary or theoretical studies with an identical mean and SD but randomly different numerical values. For a distribution of possible means based on a SEM, the posterior probability Q of any theoretically true mean falling into a specified tail would be equal to the tail area as a proportion of the whole. If the reverse likelihood distribution of possible study means conditional on the same hypothetical tail threshold is assumed to be the same as the posterior probability distribution of means (as is customary) then by Bayes rule the P value equals Q. Replication involves doing two independent studies, thus doubling the variance for the combined posterior probability distribution. Thus, if the original effect size was 1.96, the number of observations was 100, the SEM was 1 and the original P value was 0.025, the theoretical probability of a replicating study getting a P value of up to 0.025 again is only 0.283. By applying the double variance to power calculations, the required number of observations is doubled compared to conventional approaches. If these theoretical probabilities of replication are consistent with empirical replication study results, this might explain the replication crisis and make the concepts of statistics easier to understand by diagnosticians and others."}, "https://arxiv.org/abs/2403.15499": {"title": "A Causal Analysis of CO2 Reduction Strategies in Electricity Markets Through Machine Learning-Driven Metalearners", "link": "https://arxiv.org/abs/2403.15499", "description": "arXiv:2403.15499v1 Announce Type: cross \nAbstract: This study employs the Causal Machine Learning (CausalML) statistical method to analyze the influence of electricity pricing policies on carbon dioxide (CO2) levels in the household sector. Investigating the causality between potential outcomes and treatment effects, where changes in pricing policies are the treatment, our analysis challenges the conventional wisdom surrounding incentive-based electricity pricing. The study's findings suggest that adopting such policies may inadvertently increase CO2 intensity. Additionally, we integrate a machine learning-based meta-algorithm, reflecting a contemporary statistical approach, to enhance the depth of our causal analysis. The study conducts a comparative analysis of learners X, T, S, and R to ascertain the optimal methods based on the defined question's specified goals and contextual nuances. This research contributes valuable insights to the ongoing dialogue on sustainable development practices, emphasizing the importance of considering unintended consequences in policy formulation."}, "https://arxiv.org/abs/2403.15635": {"title": "Nonparametric inference of higher order interaction patterns in networks", "link": "https://arxiv.org/abs/2403.15635", "description": "arXiv:2403.15635v1 Announce Type: cross \nAbstract: We propose a method for obtaining parsimonious decompositions of networks into higher order interactions which can take the form of arbitrary motifs.The method is based on a class of analytically solvable generative models, where vertices are connected via explicit copies of motifs, which in combination with non-parametric priors allow us to infer higher order interactions from dyadic graph data without any prior knowledge on the types or frequencies of such interactions. Crucially, we also consider 'degree--corrected' models that correctly reflect the degree distribution of the network and consequently prove to be a better fit for many real world--networks compared to non-degree corrected models. We test the presented approach on simulated data for which we recover the set of underlying higher order interactions to a high degree of accuracy. For empirical networks the method identifies concise sets of atomic subgraphs from within thousands of candidates that cover a large fraction of edges and include higher order interactions of known structural and functional significance. The method not only produces an explicit higher order representation of the network but also a fit of the network to analytically tractable models opening new avenues for the systematic study of higher order network structures."}, "https://arxiv.org/abs/2403.15711": {"title": "Identifiable Latent Neural Causal Models", "link": "https://arxiv.org/abs/2403.15711", "description": "arXiv:2403.15711v1 Announce Type: cross \nAbstract: Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data. It is particularly good at predictions under unseen distribution shifts, because these shifts can generally be interpreted as consequences of interventions. Hence leveraging {seen} distribution shifts becomes a natural strategy to help identifying causal representations, which in turn benefits predictions where distributions are previously {unseen}. Determining the types (or conditions) of such distribution shifts that do contribute to the identifiability of causal representations is critical. This work establishes a {sufficient} and {necessary} condition characterizing the types of distribution shifts for identifiability in the context of latent additive noise models. Furthermore, we present partial identifiability results when only a portion of distribution shifts meets the condition. In addition, we extend our findings to latent post-nonlinear causal models. We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations. Our algorithm, guided by our underlying theory, has demonstrated outstanding performance across a diverse range of synthetic and real-world datasets. The empirical observations align closely with the theoretical findings, affirming the robustness and effectiveness of our approach."}, "https://arxiv.org/abs/2403.15792": {"title": "Reviving pseudo-inverses: Asymptotic properties of large dimensional Moore-Penrose and Ridge-type inverses with applications", "link": "https://arxiv.org/abs/2403.15792", "description": "arXiv:2403.15792v1 Announce Type: cross \nAbstract: In this paper, we derive high-dimensional asymptotic properties of the Moore-Penrose inverse and the ridge-type inverse of the sample covariance matrix. In particular, the analytical expressions of the weighted sample trace moments are deduced for both generalized inverse matrices and are present by using the partial exponential Bell polynomials which can easily be computed in practice. The existent results are extended in several directions: (i) First, the population covariance matrix is not assumed to be a multiple of the identity matrix; (ii) Second, the assumption of normality is not used in the derivation; (iii) Third, the asymptotic results are derived under the high-dimensional asymptotic regime. Our findings are used to construct improved shrinkage estimators of the precision matrix, which asymptotically minimize the quadratic loss with probability one. Finally, the finite sample properties of the derived theoretical results are investigated via an extensive simulation study."}, "https://arxiv.org/abs/2403.16031": {"title": "Learning Directed Acyclic Graphs from Partial Orderings", "link": "https://arxiv.org/abs/2403.16031", "description": "arXiv:2403.16031v1 Announce Type: cross \nAbstract: Directed acyclic graphs (DAGs) are commonly used to model causal relationships among random variables. In general, learning the DAG structure is both computationally and statistically challenging. Moreover, without additional information, the direction of edges may not be estimable from observational data. In contrast, given a complete causal ordering of the variables, the problem can be solved efficiently, even in high dimensions. In this paper, we consider the intermediate problem of learning DAGs when a partial causal ordering of variables is available. We propose a general estimation framework for leveraging the partial ordering and present efficient estimation algorithms for low- and high-dimensional problems. The advantages of the proposed framework are illustrated via numerical studies."}, "https://arxiv.org/abs/2403.16336": {"title": "Predictive Inference in Multi-environment Scenarios", "link": "https://arxiv.org/abs/2403.16336", "description": "arXiv:2403.16336v1 Announce Type: cross \nAbstract: We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include extensions for settings with non-real-valued responses and a theory of consistency for predictive inference in these general problems. We demonstrate a novel resizing method to adapt to problem difficulty, which applies both to existing approaches for predictive inference with hierarchical data and the methods we develop; this reduces prediction set sizes using limited information from the test environment, a key to the methods' practical performance, which we evaluate through neurochemical sensing and species classification datasets."}, "https://arxiv.org/abs/2403.16413": {"title": "Optimal testing in a class of nonregular models", "link": "https://arxiv.org/abs/2403.16413", "description": "arXiv:2403.16413v1 Announce Type: cross \nAbstract: This paper studies optimal hypothesis testing for nonregular statistical models with parameter-dependent support. We consider both one-sided and two-sided hypothesis testing and develop asymptotically uniformly most powerful tests based on the likelihood ratio process. The proposed one-sided test involves randomization to achieve asymptotic size control, some tuning constant to avoid discontinuities in the limiting likelihood ratio process, and a user-specified alternative hypothetical value to achieve the asymptotic optimality. Our two-sided test becomes asymptotically uniformly most powerful without imposing further restrictions such as unbiasedness. Simulation results illustrate desirable power properties of the proposed tests."}, "https://arxiv.org/abs/2403.16688": {"title": "Optimal convex $M$-estimation via score matching", "link": "https://arxiv.org/abs/2403.16688", "description": "arXiv:2403.16688v1 Announce Type: cross \nAbstract: In the context of linear regression, we construct a data-driven convex loss function with respect to which empirical risk minimisation yields optimal asymptotic variance in the downstream estimation of the regression coefficients. Our semiparametric approach targets the best decreasing approximation of the derivative of the log-density of the noise distribution. At the population level, this fitting process is a nonparametric extension of score matching, corresponding to a log-concave projection of the noise distribution with respect to the Fisher divergence. The procedure is computationally efficient, and we prove that our procedure attains the minimal asymptotic covariance among all convex $M$-estimators. As an example of a non-log-concave setting, for Cauchy errors, the optimal convex loss function is Huber-like, and our procedure yields an asymptotic efficiency greater than 0.87 relative to the oracle maximum likelihood estimator of the regression coefficients that uses knowledge of this error distribution; in this sense, we obtain robustness without sacrificing much efficiency. Numerical experiments confirm the practical merits of our proposal."}, "https://arxiv.org/abs/2403.16828": {"title": "Asymptotics of predictive distributions driven by sample means and variances", "link": "https://arxiv.org/abs/2403.16828", "description": "arXiv:2403.16828v1 Announce Type: cross \nAbstract: Let $\\alpha_n(\\cdot)=P\\bigl(X_{n+1}\\in\\cdot\\mid X_1,\\ldots,X_n\\bigr)$ be the predictive distributions of a sequence $(X_1,X_2,\\ldots)$ of $p$-variate random variables. Suppose $$\\alpha_n=\\mathcal{N}_p(M_n,Q_n)$$ where $M_n=\\frac{1}{n}\\sum_{i=1}^nX_i$ and $Q_n=\\frac{1}{n}\\sum_{i=1}^n(X_i-M_n)(X_i-M_n)^t$. Then, there is a random probability measure $\\alpha$ on $\\mathbb{R}^p$ such that $\\alpha_n\\rightarrow\\alpha$ weakly a.s. If $p\\in\\{1,2\\}$, one also obtains $\\lVert\\alpha_n-\\alpha\\rVert\\overset{a.s.}\\longrightarrow 0$ where $\\lVert\\cdot\\rVert$ is total variation distance. Moreover, the convergence rate of $\\lVert\\alpha_n-\\alpha\\rVert$ is arbitrarily close to $n^{-1/2}$. These results (apart from the one regarding the convergence rate) still apply even if $\\alpha_n=\\mathcal{L}_p(M_n,Q_n)$, where $\\mathcal{L}_p$ belongs to a class of distributions much larger than the normal. Finally, the asymptotic behavior of copula-based predictive distributions (introduced in [13]) is investigated and a numerical experiment is performed."}, "https://arxiv.org/abs/2008.12927": {"title": "Broadcasted Nonparametric Tensor Regression", "link": "https://arxiv.org/abs/2008.12927", "description": "arXiv:2008.12927v3 Announce Type: replace \nAbstract: We propose a novel use of a broadcasting operation, which distributes univariate functions to all entries of the tensor covariate, to model the nonlinearity in tensor regression nonparametrically. A penalized estimation and the corresponding algorithm are proposed. Our theoretical investigation, which allows the dimensions of the tensor covariate to diverge, indicates that the proposed estimation yields a desirable convergence rate. We also provide a minimax lower bound, which characterizes the optimality of the proposed estimator for a wide range of scenarios. Numerical experiments are conducted to confirm the theoretical findings, and they show that the proposed model has advantages over its existing linear counterparts."}, "https://arxiv.org/abs/2010.15864": {"title": "Identification and Estimation of Unconditional Policy Effects of an Endogenous Binary Treatment: An Unconditional MTE Approach", "link": "https://arxiv.org/abs/2010.15864", "description": "arXiv:2010.15864v5 Announce Type: replace \nAbstract: This paper studies the identification and estimation of policy effects when treatment status is binary and endogenous. We introduce a new class of marginal treatment effects (MTEs) based on the influence function of the functional underlying the policy target. We show that an unconditional policy effect can be represented as a weighted average of the newly defined MTEs over the individuals who are indifferent about their treatment status. We provide conditions for point identification of the unconditional policy effects. When a quantile is the functional of interest, we introduce the UNconditional Instrumental Quantile Estimator (UNIQUE) and establish its consistency and asymptotic distribution. In the empirical application, we estimate the effect of changing college enrollment status, induced by higher tuition subsidy, on the quantiles of the wage distribution."}, "https://arxiv.org/abs/2102.01155": {"title": "G-Formula for Observational Studies under Stratified Interference, with Application to Bed Net Use on Malaria", "link": "https://arxiv.org/abs/2102.01155", "description": "arXiv:2102.01155v2 Announce Type: replace \nAbstract: Assessing population-level effects of vaccines and other infectious disease prevention measures is important to the field of public health. In infectious disease studies, one person's treatment may affect another individual's outcome, i.e., there may be interference between units. For example, the use of bed nets to prevent malaria by one individual may have an indirect effect on other individuals living in close proximity. In some settings, individuals may form groups or clusters where interference only occurs within groups, i.e., there is partial interference. Inverse probability weighted estimators have previously been developed for observational studies with partial interference. Unfortunately, these estimators are not well suited for studies with large clusters. Therefore, in this paper, the parametric g-formula is extended to allow for partial interference. G-formula estimators are proposed for overall effects, effects when treated, and effects when untreated. The proposed estimators can accommodate large clusters and do not suffer from the g-null paradox that may occur in the absence of interference. The large sample properties of the proposed estimators are derived assuming no unmeasured confounders and that the partial interference takes a particular form (referred to as `weak stratified interference'). Simulation studies are presented demonstrating the finite-sample performance of the proposed estimators. The Demographic and Health Survey from the Democratic Republic of the Congo is then analyzed using the proposed g-formula estimators to assess the effects of bed net use on malaria."}, "https://arxiv.org/abs/2109.03087": {"title": "An unbiased estimator of the case fatality rate", "link": "https://arxiv.org/abs/2109.03087", "description": "arXiv:2109.03087v2 Announce Type: replace \nAbstract: During an epidemic outbreak of a new disease, the probability of dying once infected is considered an important though difficult task to be computed. Since it is very hard to know the true number of infected people, the focus is placed on estimating the case fatality rate, which is defined as the probability of dying once tested and confirmed as infected. The estimation of this rate at the beginning of an epidemic remains challenging for several reasons, including the time gap between diagnosis and death, and the rapid growth in the number of confirmed cases. In this work, an unbiased estimator of the case fatality rate of a virus is presented. The consistency of the estimator is demonstrated, and its asymptotic distribution is derived, enabling the corresponding confidence intervals (C.I.) to be established. The proposed method is based on the distribution F of the time between confirmation and death of individuals who die because of the virus. The estimator's performance is analyzed in both simulation scenarios and the real-world context of Argentina in 2020 for the COVID-19 pandemic, consistently achieving excellent results when compared to an existing proposal as well as to the conventional \\naive\" estimator that was employed to report the case fatality rates during the last COVID-19 pandemic. In the simulated scenarios, the empirical coverage of our C.I. is studied, both using the F employed to generate the data and an estimated F, and it is observed that the desired level of confidence is reached quickly when using real F and in a reasonable period of time when estimating F."}, "https://arxiv.org/abs/2212.09961": {"title": "Uncertainty Quantification of MLE for Entity Ranking with Covariates", "link": "https://arxiv.org/abs/2212.09961", "description": "arXiv:2212.09961v2 Announce Type: replace \nAbstract: This paper concerns with statistical estimation and inference for the ranking problems based on pairwise comparisons with additional covariate information such as the attributes of the compared items. Despite extensive studies, few prior literatures investigate this problem under the more realistic setting where covariate information exists. To tackle this issue, we propose a novel model, Covariate-Assisted Ranking Estimation (CARE) model, that extends the well-known Bradley-Terry-Luce (BTL) model, by incorporating the covariate information. Specifically, instead of assuming every compared item has a fixed latent score $\\{\\theta_i^*\\}_{i=1}^n$, we assume the underlying scores are given by $\\{\\alpha_i^*+{x}_i^\\top\\beta^*\\}_{i=1}^n$, where $\\alpha_i^*$ and ${x}_i^\\top\\beta^*$ represent latent baseline and covariate score of the $i$-th item, respectively. We impose natural identifiability conditions and derive the $\\ell_{\\infty}$- and $\\ell_2$-optimal rates for the maximum likelihood estimator of $\\{\\alpha_i^*\\}_{i=1}^{n}$ and $\\beta^*$ under a sparse comparison graph, using a novel `leave-one-out' technique (Chen et al., 2019) . To conduct statistical inferences, we further derive asymptotic distributions for the MLE of $\\{\\alpha_i^*\\}_{i=1}^n$ and $\\beta^*$ with minimal sample complexity. This allows us to answer the question whether some covariates have any explanation power for latent scores and to threshold some sparse parameters to improve the ranking performance. We improve the approximation method used in (Gao et al., 2021) for the BLT model and generalize it to the CARE model. Moreover, we validate our theoretical results through large-scale numerical studies and an application to the mutual fund stock holding dataset."}, "https://arxiv.org/abs/2303.10016": {"title": "Improving instrumental variable estimators with post-stratification", "link": "https://arxiv.org/abs/2303.10016", "description": "arXiv:2303.10016v2 Announce Type: replace \nAbstract: Experiments studying get-out-the-vote (GOTV) efforts estimate the causal effect of various mobilization efforts on voter turnout. However, there is often substantial noncompliance in these studies. A usual approach is to use an instrumental variable (IV) analysis to estimate impacts for compliers, here being those actually contacted by the investigators. Unfortunately, popular IV estimators can be unstable in studies with a small fraction of compliers. We explore post-stratifying the data (e.g., taking a weighted average of IV estimates within each stratum) using variables that predict complier status (and, potentially, the outcome) to mitigate this. We present the benefits of post-stratification in terms of bias, variance, and improved standard error estimates, and provide a finite-sample asymptotic variance formula. We also compare the performance of different IV approaches and discuss the advantages of our design-based post-stratification approach over incorporating compliance-predictive covariates into the two-stage least squares estimator. In the end, we show that covariates predictive of compliance can increase precision, but only if one is willing to make a bias-variance trade-off by down-weighting or dropping strata with few compliers. By contrast, standard approaches such as two-stage least squares fail to use such information. We finally examine the benefits of our approach in two GOTV applications."}, "https://arxiv.org/abs/2304.04712": {"title": "Testing for linearity in scalar-on-function regression with responses missing at random", "link": "https://arxiv.org/abs/2304.04712", "description": "arXiv:2304.04712v2 Announce Type: replace \nAbstract: A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the simplified method estimates the functional slope by simply discarding observations with missing responses. Second, the imputed method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the inverse probability weighted method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain."}, "https://arxiv.org/abs/2306.14693": {"title": "Conformal link prediction for false discovery rate control", "link": "https://arxiv.org/abs/2306.14693", "description": "arXiv:2306.14693v2 Announce Type: replace \nAbstract: Most link prediction methods return estimates of the connection probability of missing edges in a graph. Such output can be used to rank the missing edges from most to least likely to be a true edge, but does not directly provide a classification into true and non-existent. In this work, we consider the problem of identifying a set of true edges with a control of the false discovery rate (FDR). We propose a novel method based on high-level ideas from the literature on conformal inference. The graph structure induces intricate dependence in the data, which we carefully take into account, as this makes the setup different from the usual setup in conformal inference, where data exchangeability is assumed. The FDR control is empirically demonstrated for both simulated and real data."}, "https://arxiv.org/abs/2306.16715": {"title": "Causal Meta-Analysis by Integrating Multiple Observational Studies with Multivariate Outcomes", "link": "https://arxiv.org/abs/2306.16715", "description": "arXiv:2306.16715v3 Announce Type: replace \nAbstract: Integrating multiple observational studies to make unconfounded causal or descriptive comparisons of group potential outcomes in a large natural population is challenging. Moreover, retrospective cohorts, being convenience samples, are usually unrepresentative of the natural population of interest and have groups with unbalanced covariates. We propose a general covariate-balancing framework based on pseudo-populations that extends established weighting methods to the meta-analysis of multiple retrospective cohorts with multiple groups. Additionally, by maximizing the effective sample sizes of the cohorts, we propose a FLEXible, Optimized, and Realistic (FLEXOR) weighting method appropriate for integrative analyses. We develop new weighted estimators for unconfounded inferences on wide-ranging population-level features and estimands relevant to group comparisons of quantitative, categorical, or multivariate outcomes. The asymptotic properties of these estimators are examined. Through simulation studies and meta-analyses of TCGA datasets, we demonstrate the versatility and reliability of the proposed weighting strategy, especially for the FLEXOR pseudo-population."}, "https://arxiv.org/abs/2307.02603": {"title": "Bayesian Structure Learning in Undirected Gaussian Graphical Models: Literature Review with Empirical Comparison", "link": "https://arxiv.org/abs/2307.02603", "description": "arXiv:2307.02603v2 Announce Type: replace \nAbstract: Gaussian graphical models provide a powerful framework to reveal the conditional dependency structure between multivariate variables. The process of uncovering the conditional dependency network is known as structure learning. Bayesian methods can measure the uncertainty of conditional relationships and include prior information. However, frequentist methods are often preferred due to the computational burden of the Bayesian approach. Over the last decade, Bayesian methods have seen substantial improvements, with some now capable of generating accurate estimates of graphs up to a thousand variables in mere minutes. Despite these advancements, a comprehensive review or empirical comparison of all recent methods has not been conducted. This paper delves into a wide spectrum of Bayesian approaches used for structure learning and evaluates their efficacy through a simulation study. We also demonstrate how to apply Bayesian structure learning to a real-world data set and provide directions for future research. This study gives an exhaustive overview of this dynamic field for newcomers, practitioners, and experts."}, "https://arxiv.org/abs/2311.02658": {"title": "Nonparametric Estimation and Comparison of Distance Distributions from Censored Data", "link": "https://arxiv.org/abs/2311.02658", "description": "arXiv:2311.02658v4 Announce Type: replace \nAbstract: Transportation distance information is a powerful resource, but location records are often censored due to privacy concerns or regulatory mandates. We outline methods to approximate, sample from, and compare distributions of distances between censored location pairs, a task with applications to public health informatics, logistics, and more. We validate empirically via simulation and demonstrate applicability to practical geospatial data analysis tasks."}, "https://arxiv.org/abs/2312.06437": {"title": "Posterior Ramifications of Prior Dependence Structures", "link": "https://arxiv.org/abs/2312.06437", "description": "arXiv:2312.06437v2 Announce Type: replace \nAbstract: In fully Bayesian analyses, prior distributions are specified before observing data. Prior elicitation methods transfigure prior information into quantifiable prior distributions. Recently, methods that leverage copulas have been proposed to accommodate more flexible dependence structures when eliciting multivariate priors. We prove that under broad conditions, the posterior cannot retain many of these flexible prior dependence structures in large-sample settings. We emphasize the impact of this result by overviewing several objectives for prior specification to help practitioners select prior dependence structures that align with their objectives for posterior analysis. Because correctly specifying the dependence structure a priori can be difficult, we consider how the choice of prior copula impacts the posterior distribution in terms of asymptotic convergence of the posterior mode. Our resulting recommendations streamline the prior elicitation process."}, "https://arxiv.org/abs/2009.13961": {"title": "Online Action Learning in High Dimensions: A Conservative Perspective", "link": "https://arxiv.org/abs/2009.13961", "description": "arXiv:2009.13961v4 Announce Type: replace-cross \nAbstract: Sequential learning problems are common in several fields of research and practical applications. Examples include dynamic pricing and assortment, design of auctions and incentives and permeate a large number of sequential treatment experiments. In this paper, we extend one of the most popular learning solutions, the $\\epsilon_t$-greedy heuristics, to high-dimensional contexts considering a conservative directive. We do this by allocating part of the time the original rule uses to adopt completely new actions to a more focused search in a restrictive set of promising actions. The resulting rule might be useful for practical applications that still values surprises, although at a decreasing rate, while also has restrictions on the adoption of unusual actions. With high probability, we find reasonable bounds for the cumulative regret of a conservative high-dimensional decaying $\\epsilon_t$-greedy rule. Also, we provide a lower bound for the cardinality of the set of viable actions that implies in an improved regret bound for the conservative version when compared to its non-conservative counterpart. Additionally, we show that end-users have sufficient flexibility when establishing how much safety they want, since it can be tuned without impacting theoretical properties. We illustrate our proposal both in a simulation exercise and using a real dataset."}, "https://arxiv.org/abs/2211.09284": {"title": "Iterative execution of discrete and inverse discrete Fourier transforms with applications for signal denoising via sparsification", "link": "https://arxiv.org/abs/2211.09284", "description": "arXiv:2211.09284v3 Announce Type: replace-cross \nAbstract: We describe a family of iterative algorithms that involve the repeated execution of discrete and inverse discrete Fourier transforms. One interesting member of this family is motivated by the discrete Fourier transform uncertainty principle and involves the application of a sparsification operation to both the real domain and frequency domain data with convergence obtained when real domain sparsity hits a stable pattern. This sparsification variant has practical utility for signal denoising, in particular the recovery of a periodic spike signal in the presence of Gaussian noise. General convergence properties and denoising performance relative to existing methods are demonstrated using simulation studies. An R package implementing this technique and related resources can be found at https://hrfrost.host.dartmouth.edu/IterativeFT."}, "https://arxiv.org/abs/2306.17667": {"title": "Bias-Free Estimation of Signals on Top of Unknown Backgrounds", "link": "https://arxiv.org/abs/2306.17667", "description": "arXiv:2306.17667v2 Announce Type: replace-cross \nAbstract: We present a method for obtaining unbiased signal estimates in the presence of a significant unknown background, eliminating the need for a parametric model for the background itself. Our approach is based on a minimal set of conditions for observation and background estimators, which are typically satisfied in practical scenarios. To showcase the effectiveness of our method, we apply it to simulated data from the planned dielectric axion haloscope MADMAX."}, "https://arxiv.org/abs/2311.02695": {"title": "Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions", "link": "https://arxiv.org/abs/2311.02695", "description": "arXiv:2311.02695v2 Announce Type: replace-cross \nAbstract: The task of inferring high-level causal variables from low-level observations, commonly referred to as causal representation learning, is fundamentally underconstrained. As such, recent works to address this problem focus on various assumptions that lead to identifiability of the underlying latent causal variables. A large corpus of these preceding approaches consider multi-environment data collected under different interventions on the causal model. What is common to virtually all of these works is the restrictive assumption that in each environment, only a single variable is intervened on. In this work, we relax this assumption and provide the first identifiability result for causal representation learning that allows for multiple variables to be targeted by an intervention within one environment. Our approach hinges on a general assumption on the coverage and diversity of interventions across environments, which also includes the shared assumption of single-node interventions of previous works. The main idea behind our approach is to exploit the trace that interventions leave on the variance of the ground truth causal variables and regularizing for a specific notion of sparsity with respect to this trace. In addition to and inspired by our theoretical contributions, we present a practical algorithm to learn causal representations from multi-node interventional data and provide empirical evidence that validates our identifiability results."}, "https://arxiv.org/abs/2403.17087": {"title": "Sparse inference in Poisson Log-Normal model by approximating the L0-norm", "link": "https://arxiv.org/abs/2403.17087", "description": "arXiv:2403.17087v1 Announce Type: new \nAbstract: Variable selection methods are required in practical statistical modeling, to identify and include only the most relevant predictors, and then improving model interpretability. Such variable selection methods are typically employed in regression models, for instance in this article for the Poisson Log Normal model (PLN, Chiquet et al., 2021). This model aim to explain multivariate count data using dependent variables, and its utility was demonstrating in scientific fields such as ecology and agronomy. In the case of the PLN model, most recent papers focus on sparse networks inference through combination of the likelihood with a L1 -penalty on the precision matrix. In this paper, we propose to rely on a recent penalization method (SIC, O'Neill and Burke, 2023), which consists in smoothly approximating the L0-penalty, and that avoids the calibration of a tuning parameter with a cross-validation procedure. Moreover, this work focuses on the coefficient matrix of the PLN model and establishes an inference procedure ensuring effective variable selection performance, so that the resulting fitted model explaining multivariate count data using only relevant explanatory variables. Our proposal involves implementing a procedure that integrates the SIC penalization algorithm (epsilon-telescoping) and the PLN model fitting algorithm (a variational EM algorithm). To support our proposal, we provide theoretical results and insights about the penalization method, and we perform simulation studies to assess the method, which is also applied on real datasets."}, "https://arxiv.org/abs/2403.17117": {"title": "Covariate-adjusted Group Sequential Comparisons of Survival Probabilities", "link": "https://arxiv.org/abs/2403.17117", "description": "arXiv:2403.17117v1 Announce Type: new \nAbstract: In confirmatory clinical trials, survival outcomes are frequently studied and interim analyses for efficacy and/or futility are often desirable. Methods such as the log rank test and Cox regression model are commonly used to compare treatments in this setting. They rely on a proportional hazards (PH) assumption and are subject to type I error rate inflation and loss of power when PH are violated. Such violations may be expected a priori, particularly when the mechanisms of treatments differ such as immunotherapy vs. chemotherapy for treating cancer. We develop group sequential tests for comparing survival curves with covariate adjustment that allow for interim analyses in the presence of non-PH and offer easily interpreted, clinically meaningful summary measures of the treatment effect. The joint distribution of repeatedly computed test statistics converges to the canonical joint distribution with a Markov structure. The asymptotic distribution of the test statistics allows marginal comparisons of survival probabilities at multiple fixed time points and facilitates both critical value specification to maintain type I error control and sample size/power determination. Simulations demonstrate that the achieved type I error rate and power of the proposed tests meet targeted levels and are robust to the PH assumption and covariate influence. The proposed tests are illustrated using a clinical trial dataset from the Blood and Marrow Transplant Clinical Trials Network 1101 trial."}, "https://arxiv.org/abs/2403.17121": {"title": "High-dimensional Factor Analysis for Network-linked Data", "link": "https://arxiv.org/abs/2403.17121", "description": "arXiv:2403.17121v1 Announce Type: new \nAbstract: Factor analysis is a widely used statistical tool in many scientific disciplines, such as psychology, economics, and sociology. As observations linked by networks become increasingly common, incorporating network structures into factor analysis remains an open problem. In this paper, we focus on high-dimensional factor analysis involving network-connected observations, and propose a generalized factor model with latent factors that account for both the network structure and the dependence structure among high-dimensional variables. These latent factors can be shared by the high-dimensional variables and the network, or exclusively applied to either of them. We develop a computationally efficient estimation procedure and establish asymptotic inferential theories. Notably, we show that by borrowing information from the network, the proposed estimator of the factor loading matrix achieves optimal asymptotic variance under much milder identifiability constraints than existing literature. Furthermore, we develop a hypothesis testing procedure to tackle the challenge of discerning the shared and individual latent factors' structure. The finite sample performance of the proposed method is demonstrated through simulation studies and a real-world dataset involving a statistician co-authorship network."}, "https://arxiv.org/abs/2403.17127": {"title": "High-Dimensional Mean-Variance Spanning Tests", "link": "https://arxiv.org/abs/2403.17127", "description": "arXiv:2403.17127v1 Announce Type: new \nAbstract: We introduce a new framework for the mean-variance spanning (MVS) hypothesis testing. The procedure can be applied to any test-asset dimension and only requires stationary asset returns and the number of benchmark assets to be smaller than the number of time periods. It involves individually testing moment conditions using a robust Student-t statistic based on the batch-mean method and combining the p-values using the Cauchy combination test. Simulations demonstrate the superior performance of the test compared to state-of-the-art approaches. For the empirical application, we look at the problem of domestic versus international diversification in equities. We find that the advantages of diversification are influenced by economic conditions and exhibit cross-country variation. We also highlight that the rejection of the MVS hypothesis originates from the potential to reduce variance within the domestic global minimum-variance portfolio."}, "https://arxiv.org/abs/2403.17132": {"title": "A Personalized Predictive Model that Jointly Optimizes Discrimination and Calibration", "link": "https://arxiv.org/abs/2403.17132", "description": "arXiv:2403.17132v1 Announce Type: new \nAbstract: Precision medicine is accelerating rapidly in the field of health research. This includes fitting predictive models for individual patients based on patient similarity in an attempt to improve model performance. We propose an algorithm which fits a personalized predictive model (PPM) using an optimal size of a similar subpopulation that jointly optimizes model discrimination and calibration, as it is criticized that calibration is not assessed nearly as often as discrimination despite poorly calibrated models being potentially misleading. We define a mixture loss function that considers model discrimination and calibration, and allows for flexibility in emphasizing one performance measure over another. We empirically show that the relationship between the size of subpopulation and calibration is quadratic, which motivates the development of our jointly optimized model. We also investigate the effect of within-population patient weighting on performance and conclude that the size of subpopulation has a larger effect on the predictive performance of the PPM compared to the choice of weight function."}, "https://arxiv.org/abs/2403.17257": {"title": "Statistical Inference on Hierarchical Simultaneous Autoregressive Models with Missing Data", "link": "https://arxiv.org/abs/2403.17257", "description": "arXiv:2403.17257v1 Announce Type: new \nAbstract: Efficient estimation methods for simultaneous autoregressive (SAR) models with missing data in the response variable have been well-developed in the literature. It is common practice to introduce a measurement error into SAR models. The measurement error serves to distinguish the noise component from the spatial process. However, the previous literature has not considered adding a measurement error to the SAR models with missing data. The maximum likelihood estimation for such models with large datasets is challenging and computationally expensive. This paper proposes two efficient likelihood-based estimation methods: the marginal maximum likelihood (ML) and expectation-maximisation (EM) algorithms for estimating SAR models with both measurement errors and missing data in the response variable. The spatial error model (SEM) and the spatial autoregressive model (SAM), two popular SAR model types, are considered. The missing data mechanism is assumed to follow missing at random (MAR). While naive calculation approaches lead to computational complexities of $O(n^3)$, where n is the total number of observations, our computational approaches for both the marginal ML and EM algorithms are designed to reduce the computational complexity. The performance of the proposed methods is investigated empirically using simulated and real datasets."}, "https://arxiv.org/abs/2403.17318": {"title": "Statistical analysis and method to propagate the impact of measurement uncertainty on dynamic mode decomposition", "link": "https://arxiv.org/abs/2403.17318", "description": "arXiv:2403.17318v1 Announce Type: new \nAbstract: We apply random matrix theory to study the impact of measurement uncertainty on dynamic mode decomposition. Specifically, when the measurements follow a normal probability density function, we show how the moments of that density propagate through the dynamic mode decomposition. While we focus on the first and second moments, the analytical expressions we derive are general and can be extended to higher-order moments. Further, the proposed numerical method to propagate uncertainty is agnostic of specific dynamic mode decomposition formulations. Of particular relevance, the estimated second moments provide confidence bounds that may be used as a metric of trustworthiness, that is, how much one can rely on a finite-dimensional linear operator to represent an underlying dynamical system. We perform numerical experiments on two canonical systems and verify the estimated confidence levels by comparing the moments to those obtained from Monte Carlo simulations."}, "https://arxiv.org/abs/2403.17321": {"title": "A Bayesian shrinkage estimator for transfer learning", "link": "https://arxiv.org/abs/2403.17321", "description": "arXiv:2403.17321v1 Announce Type: new \nAbstract: Transfer learning (TL) has emerged as a powerful tool to supplement data collected for a target task with data collected for a related source task. The Bayesian framework is natural for TL because information from the source data can be incorporated in the prior distribution for the target data analysis. In this paper, we propose and study Bayesian TL methods for the normal-means problem and multiple linear regression. We propose two classes of prior distributions. The first class assumes the difference in the parameters for the source and target tasks is sparse, i.e., many parameters are shared across tasks. The second assumes that none of the parameters are shared across tasks, but the differences are bounded in $\\ell_2$-norm. For the sparse case, we propose a Bayes shrinkage estimator with theoretical guarantees under mild assumptions. The proposed methodology is tested on synthetic data and outperforms state-of-the-art TL methods. We then use this method to fine-tune the last layer of a neural network model to predict the molecular gap property in a material science application. We report improved performance compared to classical fine tuning and methods using only the target data."}, "https://arxiv.org/abs/2403.17481": {"title": "A Type of Nonlinear Fr\\'echet Regressions", "link": "https://arxiv.org/abs/2403.17481", "description": "arXiv:2403.17481v1 Announce Type: new \nAbstract: The existing Fr\\'echet regression is actually defined within a linear framework, since the weight function in the Fr\\'echet objective function is linearly defined, and the resulting Fr\\'echet regression function is identified to be a linear model when the random object belongs to a Hilbert space. Even for nonparametric and semiparametric Fr\\'echet regressions, which are usually nonlinear, the existing methods handle them by local linear (or local polynomial) technique, and the resulting Fr\\'echet regressions are (locally) linear as well. We in this paper introduce a type of nonlinear Fr\\'echet regressions. Such a framework can be utilized to fit the essentially nonlinear models in a general metric space and uniquely identify the nonlinear structure in a Hilbert space. Particularly, its generalized linear form can return to the standard linear Fr\\'echet regression through a special choice of the weight function. Moreover, the generalized linear form possesses methodological and computational simplicity because the Euclidean variable and the metric space element are completely separable. The favorable theoretical properties (e.g. the estimation consistency and presentation theorem) of the nonlinear Fr\\'echet regressions are established systemically. The comprehensive simulation studies and a human mortality data analysis demonstrate that the new strategy is significantly better than the competitors."}, "https://arxiv.org/abs/2403.17489": {"title": "Adaptive Bayesian Structure Learning of DAGs With Non-conjugate Prior", "link": "https://arxiv.org/abs/2403.17489", "description": "arXiv:2403.17489v1 Announce Type: new \nAbstract: Directed Acyclic Graphs (DAGs) are solid structures used to describe and infer the dependencies among variables in multivariate scenarios. Having a thorough comprehension of the accurate DAG-generating model is crucial for causal discovery and estimation. Our work suggests utilizing a non-conjugate prior for Gaussian DAG structure learning to enhance the posterior probability. We employ the idea of using the Bessel function to address the computational burden, providing faster MCMC computation compared to the use of conjugate priors. In addition, our proposal exhibits a greater rate of adaptation when compared to the conjugate prior, specifically for the inclusion of nodes in the DAG-generating model. Simulation studies demonstrate the superior accuracy of DAG learning, and we obtain the same maximum a posteriori and median probability model estimate for the AML data, using the non-conjugate prior."}, "https://arxiv.org/abs/2403.17580": {"title": "Measuring Dependence between Events", "link": "https://arxiv.org/abs/2403.17580", "description": "arXiv:2403.17580v1 Announce Type: new \nAbstract: Measuring dependence between two events, or equivalently between two binary random variables, amounts to expressing the dependence structure inherent in a $2\\times 2$ contingency table in a real number between $-1$ and $1$. Countless such dependence measures exist, but there is little theoretical guidance on how they compare and on their advantages and shortcomings. Thus, practitioners might be overwhelmed by the problem of choosing a suitable measure. We provide a set of natural desirable properties that a proper dependence measure should fulfill. We show that Yule's Q and the little-known Cole coefficient are proper, while the most widely-used measures, the phi coefficient and all contingency coefficients, are improper. They have a severe attainability problem, that is, even under perfect dependence they can be very far away from $-1$ and $1$, and often differ substantially from the proper measures in that they understate strength of dependence. The structural reason is that these are measures for equality of events rather than of dependence. We derive the (in some instances non-standard) limiting distributions of the measures and illustrate how asymptotically valid confidence intervals can be constructed. In a case study on drug consumption we demonstrate how misleading conclusions may arise from the use of improper dependence measures."}, "https://arxiv.org/abs/2403.17609": {"title": "A location Invariant Statistic-Based Consistent Estimation Method for Three-Parameter Generalized Exponential Distribution", "link": "https://arxiv.org/abs/2403.17609", "description": "arXiv:2403.17609v1 Announce Type: new \nAbstract: In numerous instances, the generalized exponential distribution can be used as an alternative to the gamma distribution or the Weibull distribution when analyzing lifetime or skewed data. This article offers a consistent method for estimating the parameters of a three-parameter generalized exponential distribution that sidesteps the issue of an unbounded likelihood function. The method is hinged on a maximum likelihood estimation of shape and scale parameters that uses a location-invariant statistic. Important estimator properties, such as uniqueness and consistency, are demonstrated. In addition, quantile estimates for the lifetime distribution are provided. We present a Monte Carlo simulation study along with comparisons to a number of well-known estimation techniques in terms of bias and root mean square error. For illustrative purposes, a real-world lifetime data set is analyzed."}, "https://arxiv.org/abs/2403.17624": {"title": "The inclusive Synthetic Control Method", "link": "https://arxiv.org/abs/2403.17624", "description": "arXiv:2403.17624v1 Announce Type: new \nAbstract: We introduce the inclusive synthetic control method (iSCM), a modification of synthetic control type methods that allows the inclusion of units potentially affected directly or indirectly by an intervention in the donor pool. This method is well suited for applications with either multiple treated units or in which some of the units in the donor pool might be affected by spillover effects. Our iSCM is very easy to implement using most synthetic control type estimators. As an illustrative empirical example, we re-estimate the causal effect of German reunification on GDP per capita allowing for spillover effects from West Germany to Austria."}, "https://arxiv.org/abs/2403.17670": {"title": "A family of Chatterjee's correlation coefficients and their properties", "link": "https://arxiv.org/abs/2403.17670", "description": "arXiv:2403.17670v1 Announce Type: new \nAbstract: Quantifying the strength of functional dependence between random scalars $X$ and $Y$ is an important statistical problem. While many existing correlation coefficients excel in identifying linear or monotone functional dependence, they fall short in capturing general non-monotone functional relationships. In response, we propose a family of correlation coefficients $\\xi^{(h,F)}_n$, characterized by a continuous bivariate function $h$ and a cdf function $F$. By offering a range of selections for $h$ and $F$, $\\xi^{(h,F)}_n$ encompasses a diverse class of novel correlation coefficients, while also incorporates the Chatterjee's correlation coefficient (Chatterjee, 2021) as a special case. We prove that $\\xi^{(h,F)}_n$ converges almost surely to a deterministic limit $\\xi^{(h,F)}$ as sample size $n$ approaches infinity. In addition, under appropriate conditions imposed on $h$ and $F$, the limit $\\xi^{(h,F)}$ satisfies the three appealing properties: (P1). it belongs to the range of $[0,1]$; (P2). it equals 1 if and only if $Y$ is a measurable function of $X$; and (P3). it equals 0 if and only if $Y$ is independent of $X$. As amplified by our numerical experiments, our proposals provide practitioners with a variety of options to choose the most suitable correlation coefficient tailored to their specific practical needs."}, "https://arxiv.org/abs/2403.17777": {"title": "Deconvolution from two order statistics", "link": "https://arxiv.org/abs/2403.17777", "description": "arXiv:2403.17777v1 Announce Type: new \nAbstract: Economic data are often contaminated by measurement errors and truncated by ranking. This paper shows that the classical measurement error model with independent and additive measurement errors is identified nonparametrically using only two order statistics of repeated measurements. The identification result confirms a hypothesis by Athey and Haile (2002) for a symmetric ascending auction model with unobserved heterogeneity. Extensions allow for heterogeneous measurement errors, broadening the applicability to additional empirical settings, including asymmetric auctions and wage offer models. We adapt an existing simulated sieve estimator and illustrate its performance in finite samples."}, "https://arxiv.org/abs/2403.17882": {"title": "On the properties of distance covariance for categorical data: Robustness, sure screening, and approximate null distributions", "link": "https://arxiv.org/abs/2403.17882", "description": "arXiv:2403.17882v1 Announce Type: new \nAbstract: Pearson's Chi-squared test, though widely used for detecting association between categorical variables, exhibits low statistical power in large sparse contingency tables. To address this limitation, two novel permutation tests have been recently developed: the distance covariance permutation test and the U-statistic permutation test. Both leverage the distance covariance functional but employ different estimators. In this work, we explore key statistical properties of the distance covariance for categorical variables. Firstly, we show that unlike Chi-squared, the distance covariance functional is B-robust for any number of categories (fixed or diverging). Second, we establish the strong consistency of distance covariance screening under mild conditions, and simulations confirm its advantage over Chi-squared screening, especially for large sparse tables. Finally, we derive an approximate null distribution for a bias-corrected distance correlation estimate, demonstrating its effectiveness through simulations."}, "https://arxiv.org/abs/2403.17221": {"title": "Are Made and Missed Different? An analysis of Field Goal Attempts of Professional Basketball Players via Depth Based Testing Procedure", "link": "https://arxiv.org/abs/2403.17221", "description": "arXiv:2403.17221v1 Announce Type: cross \nAbstract: In this paper, we develop a novel depth-based testing procedure on spatial point processes to examine the difference in made and missed field goal attempts for NBA players. Specifically, our testing procedure can statistically detect the differences between made and missed field goal attempts for NBA players. We first obtain the depths of two processes under the polar coordinate system. A two-dimensional Kolmogorov-Smirnov test is then performed to test the difference between the depths of the two processes. Throughout extensive simulation studies, we show our testing procedure with good frequentist properties under both null hypothesis and alternative hypothesis. A comparison against the competing methods shows that our proposed procedure has better testing reliability and testing power. Application to the shot chart data of 191 NBA players in the 2017-2018 regular season offers interesting insights about these players' made and missed shot patterns."}, "https://arxiv.org/abs/2403.17776": {"title": "Exploring the Boundaries of Ambient Awareness in Twitter", "link": "https://arxiv.org/abs/2403.17776", "description": "arXiv:2403.17776v1 Announce Type: cross \nAbstract: Ambient awareness refers to the ability of social media users to obtain knowledge about who knows what (i.e., users' expertise) in their network, by simply being exposed to other users' content (e.g, tweets on Twitter). Previous work, based on user surveys, reveals that individuals self-report ambient awareness only for parts of their networks. However, it is unclear whether it is their limited cognitive capacity or the limited exposure to diagnostic tweets (i.e., online content) that prevents people from developing ambient awareness for their complete network. In this work, we focus on in-wall ambient awareness (IWAA) in Twitter and conduct a two-step data-driven analysis, that allows us to explore to which extent IWAA is likely, or even possible. First, we rely on reactions (e.g., likes), as strong evidence of users being aware of experts in Twitter. Unfortunately, such strong evidence can be only measured for active users, which represent the minority in the network. Thus to study the boundaries of IWAA to a larger extent, in the second part of our analysis, we instead focus on the passive exposure to content generated by other users -- which we refer to as in-wall visibility. This analysis shows that (in line with \\citet{levordashka2016ambient}) only for a subset of users IWAA is plausible, while for the majority it is unlikely, if even possible, to develop IWAA. We hope that our methodology paves the way for the emergence of data-driven approaches for the study of ambient awareness."}, "https://arxiv.org/abs/2206.04133": {"title": "Bayesian multivariate logistic regression for superiority and inferiority decision-making under observable treatment heterogeneity", "link": "https://arxiv.org/abs/2206.04133", "description": "arXiv:2206.04133v4 Announce Type: replace \nAbstract: The effects of treatments may differ between persons with different characteristics. Addressing such treatment heterogeneity is crucial to investigate whether patients with specific characteristics are likely to benefit from a new treatment. The current paper presents a novel Bayesian method for superiority decision-making in the context of randomized controlled trials with multivariate binary responses and heterogeneous treatment effects. The framework is based on three elements: a) Bayesian multivariate logistic regression analysis with a P\\'olya-Gamma expansion; b) a transformation procedure to transfer obtained regression coefficients to a more intuitive multivariate probability scale (i.e., success probabilities and the differences between them); and c) a compatible decision procedure for treatment comparison with prespecified decision error rates. Procedures for a priori sample size estimation under a non-informative prior distribution are included. A numerical evaluation demonstrated that decisions based on a priori sample size estimation resulted in anticipated error rates among the trial population as well as subpopulations. Further, average and conditional treatment effect parameters could be estimated unbiasedly when the sample was large enough. Illustration with the International Stroke Trial dataset revealed a trend towards heterogeneous effects among stroke patients: Something that would have remained undetected when analyses were limited to average treatment effects."}, "https://arxiv.org/abs/2301.08958": {"title": "A Practical Introduction to Regression Discontinuity Designs: Extensions", "link": "https://arxiv.org/abs/2301.08958", "description": "arXiv:2301.08958v2 Announce Type: replace \nAbstract: This monograph, together with its accompanying first part Cattaneo, Idrobo and Titiunik (2020), collects and expands the instructional materials we prepared for more than $50$ short courses and workshops on Regression Discontinuity (RD) methodology that we taught between 2014 and 2023. In this second monograph, we discuss several topics in RD methodology that build on and extend the analysis of RD designs introduced in Cattaneo, Idrobo and Titiunik (2020). Our first goal is to present an alternative RD conceptual framework based on local randomization ideas. This methodological approach can be useful in RD designs with discretely-valued scores, and can also be used more broadly as a complement to the continuity-based approach in other settings. Then, employing both continuity-based and local randomization approaches, we extend the canonical Sharp RD design in multiple directions: fuzzy RD designs, RD designs with discrete scores, and multi-dimensional RD designs. The goal of our two-part monograph is purposely practical and hence we focus on the empirical analysis of RD designs."}, "https://arxiv.org/abs/2302.03157": {"title": "A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data", "link": "https://arxiv.org/abs/2302.03157", "description": "arXiv:2302.03157v2 Announce Type: replace \nAbstract: Recent advancements in Mixed Integer Optimization (MIO) algorithms, paired with hardware enhancements, have led to significant speedups in resolving MIO problems. These strategies have been utilized for optimal subset selection, specifically for choosing $k$ features out of $p$ in linear regression given $n$ observations. In this paper, we broaden this method to facilitate cluster-aware regression, where selection aims to choose $\\lambda$ out of $K$ clusters in a linear mixed effects (LMM) model with $n_k$ observations for each cluster. Through comprehensive testing on a multitude of synthetic and real datasets, we exhibit that our method efficiently solves problems within minutes. Through numerical experiments, we also show that the MIO approach outperforms both Gaussian- and Laplace-distributed LMMs in terms of generating sparse solutions with high predictive power. Traditional LMMs typically assume that clustering effects are independent of individual features. However, we introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression."}, "https://arxiv.org/abs/2303.04754": {"title": "Estimation of Long-Range Dependent Models with Missing Data: to Impute or not to Impute?", "link": "https://arxiv.org/abs/2303.04754", "description": "arXiv:2303.04754v2 Announce Type: replace \nAbstract: Among the most important models for long-range dependent time series is the class of ARFIMA$(p,d,q)$ (Autoregressive Fractionally Integrated Moving Average) models. Estimating the long-range dependence parameter $d$ in ARFIMA models is a well-studied problem, but the literature regarding the estimation of $d$ in the presence of missing data is very sparse. There are two basic approaches to dealing with the problem: missing data can be imputed using some plausible method, and then the estimation can proceed as if no data were missing, or we can use a specially tailored methodology to estimate $d$ in the presence of missing data. In this work, we review some of the methods available for both approaches and compare them through a Monte Carlo simulation study. We present a comparison among 35 different setups to estimate $d$, under tenths of different scenarios, considering percentages of missing data ranging from as few as 10\\% up to 70\\% and several levels of dependence."}, "https://arxiv.org/abs/2304.07034": {"title": "Recursive Neyman Algorithm for Optimum Sample Allocation under Box Constraints on Sample Sizes in Strata", "link": "https://arxiv.org/abs/2304.07034", "description": "arXiv:2304.07034v4 Announce Type: replace \nAbstract: The optimum sample allocation in stratified sampling is one of the basic issues of survey methodology. It is a procedure of dividing the overall sample size into strata sample sizes in such a way that for given sampling designs in strata the variance of the stratified $\\pi$ estimator of the population total (or mean) for a given study variable assumes its minimum. In this work, we consider the optimum allocation of a sample, under lower and upper bounds imposed jointly on sample sizes in strata. We are concerned with the variance function of some generic form that, in particular, covers the case of the simple random sampling without replacement in strata. The goal of this paper is twofold. First, we establish (using the Karush-Kuhn-Tucker conditions) a generic form of the optimal solution, the so-called optimality conditions. Second, based on the established optimality conditions, we derive an efficient recursive algorithm, named RNABOX, which solves the allocation problem under study. The RNABOX can be viewed as a generalization of the classical recursive Neyman allocation algorithm, a popular tool for optimum allocation when only upper bounds are imposed on sample strata-sizes. We implement RNABOX in R as a part of our package stratallo which is available from the Comprehensive R Archive Network (CRAN) repository."}, "https://arxiv.org/abs/2304.10025": {"title": "Identification and multiply robust estimation in causal mediation analysis across principal strata", "link": "https://arxiv.org/abs/2304.10025", "description": "arXiv:2304.10025v3 Announce Type: replace \nAbstract: We consider assessing causal mediation in the presence of a post-treatment event (examples include noncompliance, a clinical event, or a terminal event). We identify natural mediation effects for the entire study population and for each principal stratum characterized by the joint potential values of the post-treatment event. We derive efficient influence functions for each mediation estimand, which motivate a set of multiply robust estimators for inference. The multiply robust estimators are consistent under four types of misspecifications and are efficient when all nuisance models are correctly specified. We illustrate our methods via simulations and two real data examples."}, "https://arxiv.org/abs/2306.10213": {"title": "A General Form of Covariate Adjustment in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2306.10213", "description": "arXiv:2306.10213v2 Announce Type: replace \nAbstract: In randomized clinical trials, adjusting for baseline covariates can improve credibility and efficiency for demonstrating and quantifying treatment effects. This article studies the augmented inverse propensity weighted (AIPW) estimator, which is a general form of covariate adjustment that uses linear, generalized linear, and non-parametric or machine learning models for the conditional mean of the response given covariates. Under covariate-adaptive randomization, we establish general theorems that show a complete picture of the asymptotic normality, {efficiency gain, and applicability of AIPW estimators}. In particular, we provide for the first time a rigorous theoretical justification of using machine learning methods with cross-fitting for dependent data under covariate-adaptive randomization. Based on the general theorems, we offer insights on the conditions for guaranteed efficiency gain and universal applicability {under different randomization schemes}, which also motivate a joint calibration strategy using some constructed covariates after applying AIPW. Our methods are implemented in the R package RobinCar."}, "https://arxiv.org/abs/2306.16378": {"title": "Spatiotemporal Besov Priors for Bayesian Inverse Problems", "link": "https://arxiv.org/abs/2306.16378", "description": "arXiv:2306.16378v2 Announce Type: replace \nAbstract: Fast development in science and technology has driven the need for proper statistical tools to capture special data features such as abrupt changes or sharp contrast. Many inverse problems in data science require spatiotemporal solutions derived from a sequence of time-dependent objects with these spatial features, e.g., dynamic reconstruction of computerized tomography (CT) images with edges. Conventional methods based on Gaussian processes (GP) often fall short in providing satisfactory solutions since they tend to offer over-smooth priors. Recently, the Besov process (BP), defined by wavelet expansions with random coefficients, has emerged as a more suitable prior for Bayesian inverse problems of this nature. While BP excels in handling spatial inhomogeneity, it does not automatically incorporate temporal correlation inherited in the dynamically changing objects. In this paper, we generalize BP to a novel spatiotemporal Besov process (STBP) by replacing the random coefficients in the series expansion with stochastic time functions as Q-exponential process (Q-EP) which governs the temporal correlation structure. We thoroughly investigate the mathematical and statistical properties of STBP. A white-noise representation of STBP is also proposed to facilitate the inference. Simulations, two limited-angle CT reconstruction examples and a highly non-linear inverse problem involving Navier-Stokes equation are used to demonstrate the advantage of the proposed STBP in preserving spatial features while accounting for temporal changes compared with the classic STGP and a time-uncorrelated approach."}, "https://arxiv.org/abs/2310.17248": {"title": "The observed Fisher information attached to the EM algorithm, illustrated on Shepp and Vardi estimation procedure for positron emission tomography", "link": "https://arxiv.org/abs/2310.17248", "description": "arXiv:2310.17248v2 Announce Type: replace \nAbstract: The Shepp & Vardi (1982) implementation of the EM algorithm for PET scan tumor estimation provides a point estimate of the tumor. The current study presents a closed-form formula of the observed Fisher information for Shepp & Vardi PET scan tumor estimation. Keywords: PET scan, EM algorithm, Fisher information matrix, standard errors."}, "https://arxiv.org/abs/2305.04937": {"title": "Randomly sampling bipartite networks with fixed degree sequences", "link": "https://arxiv.org/abs/2305.04937", "description": "arXiv:2305.04937v3 Announce Type: replace-cross \nAbstract: Statistical analysis of bipartite networks frequently requires randomly sampling from the set of all bipartite networks with the same degree sequence as an observed network. Trade algorithms offer an efficient way to generate samples of bipartite networks by incrementally `trading' the positions of some of their edges. However, it is difficult to know how many such trades are required to ensure that the sample is random. I propose a stopping rule that focuses on the distance between sampled networks and the observed network, and stops performing trades when this distribution stabilizes. Analyses demonstrate that, for over 300 different degree sequences, using this stopping rule ensures a random sample with a high probability, and that it is practical for use in empirical applications."}, "https://arxiv.org/abs/2305.16915": {"title": "When is cross impact relevant?", "link": "https://arxiv.org/abs/2305.16915", "description": "arXiv:2305.16915v2 Announce Type: replace-cross \nAbstract: Trading pressure from one asset can move the price of another, a phenomenon referred to as cross impact. Using tick-by-tick data spanning 5 years for 500 assets listed in the United States, we identify the features that make cross-impact relevant to explain the variance of price returns. We show that price formation occurs endogenously within highly liquid assets. Then, trades in these assets influence the prices of less liquid correlated products, with an impact velocity constrained by their minimum trading frequency. We investigate the implications of such a multidimensional price formation mechanism on interest rate markets. We find that the 10-year bond future serves as the primary liquidity reservoir, influencing the prices of cash bonds and futures contracts within the interest rate curve. Such behaviour challenges the validity of the theory in Financial Economics that regards long-term rates as agents anticipations of future short term rates."}, "https://arxiv.org/abs/2311.01453": {"title": "PPI++: Efficient Prediction-Powered Inference", "link": "https://arxiv.org/abs/2311.01453", "description": "arXiv:2311.01453v2 Announce Type: replace-cross \nAbstract: We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations."}, "https://arxiv.org/abs/2311.02766": {"title": "Riemannian Laplace Approximation with the Fisher Metric", "link": "https://arxiv.org/abs/2311.02766", "description": "arXiv:2311.02766v3 Announce Type: replace-cross \nAbstract: Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties depend heavily on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments."}, "https://arxiv.org/abs/2403.17948": {"title": "The Rule of link functions on Binomial Regression Model: A Cross Sectional Study on Child Malnutrition, Bangladesh", "link": "https://arxiv.org/abs/2403.17948", "description": "arXiv:2403.17948v1 Announce Type: new \nAbstract: Link function is a key tool in the binomial regression model defined as non-linear model under GLM approach. It transforms the nonlinear regression to linear model with converting the interval (-\\infty,\\infty) to the probability [0,1]. The binomial model with link functions (logit, probit, cloglog and cauchy) are applied on the proportional of child malnutrition age 0-5 years in each household level. Multiple Indicator Cluster survey (MICS)-2019, Bangladesh was conducted by a joint cooperation of UNICEF and BBS . The survey covered 64000 households using two stage stratified sampling technique, where around 21000 household have children age 0-5 years. We use bi-variate analysis to find the statistical association between response and sociodemographic features. In the binary regression model, probit model provides the best result based on the lowest standard error of covariates and goodness of fit test (deviance, AIC)."}, "https://arxiv.org/abs/2403.17982": {"title": "Markov chain models for inspecting response dynamics in psychological testing", "link": "https://arxiv.org/abs/2403.17982", "description": "arXiv:2403.17982v1 Announce Type: new \nAbstract: The importance of considering contextual probabilities in shaping response patterns within psychological testing is underscored, despite the ubiquitous nature of order effects discussed extensively in methodological literature. Drawing from concepts such as path-dependency, first-order autocorrelation, state-dependency, and hysteresis, the present study is an attempt to address how earlier responses serve as an anchor for subsequent answers in tests, surveys, and questionnaires. Introducing the notion of non-commuting observables derived from quantum physics, I highlight their role in characterizing psychological processes and the impact of measurement instruments on participants' responses. We advocate for the utilization of first-order Markov chain modeling to capture and forecast sequential dependencies in survey and test responses. The employment of the first-order Markov chain model lies in individuals' propensity to exhibit partial focus to preceding responses, with recent items most likely exerting a substantial influence on subsequent response selection. This study contributes to advancing our understanding of the dynamics inherent in sequential data within psychological research and provides a methodological framework for conducting longitudinal analyses of response patterns of test and questionnaire."}, "https://arxiv.org/abs/2403.17986": {"title": "Comment on \"Safe Testing\" by Gr\\\"unwald, de Heide, and Koolen", "link": "https://arxiv.org/abs/2403.17986", "description": "arXiv:2403.17986v1 Announce Type: new \nAbstract: This comment briefly reflects on \"Safe Testing\" by Gr\\\"{u}wald et al. (2024). The safety of fractional Bayes factors (O'Hagan, 1995) is illustrated and compared to (safe) Bayes factors based on the right Haar prior."}, "https://arxiv.org/abs/2403.18039": {"title": "Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveys", "link": "https://arxiv.org/abs/2403.18039", "description": "arXiv:2403.18039v1 Announce Type: new \nAbstract: Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), faces challenges from sample selection bias and high-dimensional covariates. This requires considering a selection model alongside treatment and outcome models that are typical ingredients in causal inference. This paper considers integrating large non-probability samples with external probability samples from a design survey, addressing moderately high-dimensional confounders and variables that influence selection. In contrast to the two-step approach that separates variable selection and debiased estimation, we propose a one-step plug-in doubly robust (DR) estimator of the ATE. We construct a novel penalized estimating equation by minimizing the squared asymptotic bias of the DR estimator. Our approach facilitates ATE inference in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches with non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our estimator under misspecification of either the outcome model or the selection and treatment models, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample."}, "https://arxiv.org/abs/2403.18069": {"title": "Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information", "link": "https://arxiv.org/abs/2403.18069", "description": "arXiv:2403.18069v1 Announce Type: new \nAbstract: The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric space based statistical objects. The aim of this paper is to introduce a novel two-step framework that comprises: (i) a imputation methods for statistical objects taking values in metrics spaces, and (ii) a criterion for personalizing imputation using conformal inference techniques. This work is motivated by the need to impute distributional functional representations of continuous glucose monitoring (CGM) data within the context of a longitudinal study on diabetes, where a significant fraction of patients do not have available CGM profiles. The importance of these methods is illustrated by evaluating the effectiveness of CGM data as new digital biomarkers to predict the time to diabetes onset in healthy populations. To address these scientific challenges, we propose: (i) a new regression algorithm for missing responses; (ii) novel conformal prediction algorithms tailored for metric spaces with a focus on density responses within the 2-Wasserstein geometry; (iii) a broadly applicable personalized imputation method criterion, designed to enhance both of the aforementioned strategies, yet valid across any statistical model and data structure. Our findings reveal that incorporating CGM data into diabetes time-to-event analysis, augmented with a novel personalization phase of imputation, significantly enhances predictive accuracy by over ten percent compared to traditional predictive models for time to diabetes."}, "https://arxiv.org/abs/2403.18115": {"title": "Assessing COVID-19 Vaccine Effectiveness in Observational Studies via Nested Trial Emulation", "link": "https://arxiv.org/abs/2403.18115", "description": "arXiv:2403.18115v1 Announce Type: new \nAbstract: Observational data are often used to estimate real-world effectiveness and durability of coronavirus disease 2019 (COVID-19) vaccines. A sequence of nested trials can be emulated to draw inference from such data while minimizing selection bias, immortal time bias, and confounding. Typically, when nested trial emulation (NTE) is employed, effect estimates are pooled across trials to increase statistical efficiency. However, such pooled estimates may lack a clear interpretation when the treatment effect is heterogeneous across trials. In the context of COVID-19, vaccine effectiveness quite plausibly will vary over calendar time due to newly emerging variants of the virus. This manuscript considers a NTE inverse probability weighted estimator of vaccine effectiveness that may vary over calendar time, time since vaccination, or both. Statistical testing of the trial effect homogeneity assumption is considered. Simulation studies are presented examining the finite-sample performance of these methods under a variety of scenarios. The methods are used to estimate vaccine effectiveness against COVID-19 outcomes using observational data on over 120,000 residents of Abruzzo, Italy during 2021."}, "https://arxiv.org/abs/2403.18248": {"title": "Statistical Inference of Optimal Allocations I: Regularities and their Implications", "link": "https://arxiv.org/abs/2403.18248", "description": "arXiv:2403.18248v1 Announce Type: new \nAbstract: In this paper, we develp a functional differentiability approach for solving statistical optimal allocation problems. We first derive Hadamard differentiability of the value function through a detailed analysis of the general properties of the sorting operator. Central to our framework are the concept of Hausdorff measure and the area and coarea integration formulas from geometric measure theory. Building on our Hadamard differentiability results, we demonstrate how the functional delta method can be used to directly derive the asymptotic properties of the value function process for binary constrained optimal allocation problems, as well as the two-step ROC curve estimator. Moreover, leveraging profound insights from geometric functional analysis on convex and local Lipschitz functionals, we obtain additional generic Fr\\'echet differentiability results for the value functions of optimal allocation problems. These compelling findings motivate us to study carefully the first order approximation of the optimal social welfare. In this paper, we then present a double / debiased estimator for the value functions. Importantly, the conditions outlined in the Hadamard differentiability section validate the margin assumption from the statistical classification literature employing plug-in methods that justifies a faster convergence rate."}, "https://arxiv.org/abs/2403.18464": {"title": "Cumulative Incidence Function Estimation Based on Population-Based Biobank Data", "link": "https://arxiv.org/abs/2403.18464", "description": "arXiv:2403.18464v1 Announce Type: new \nAbstract: Many countries have established population-based biobanks, which are being used increasingly in epidemiolgical and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data is collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency, and (2) CIF estimation for ages before the lower limit, $c_L$."}, "https://arxiv.org/abs/2403.18503": {"title": "Distributional Treatment Effect with Finite Mixture", "link": "https://arxiv.org/abs/2403.18503", "description": "arXiv:2403.18503v1 Announce Type: new \nAbstract: Treatment effect heterogeneity is of a great concern when evaluating the treatment. However, even with a simple case of a binary treatment, the distribution of treatment effect is difficult to identify due to the fundamental limitation that we cannot observe both treated potential outcome and untreated potential outcome for a given individual. This paper assumes a finite mixture model on the potential outcomes and a vector of control covariates to address treatment endogeneity and imposes a Markov condition on the potential outcomes and covariates within each type to identify the treatment effect distribution. The mixture weights of the finite mixture model are consistently estimated with a nonnegative matrix factorization algorithm, thus allowing us to consistently estimate the component distribution parameters, including ones for the treatment effect distribution."}, "https://arxiv.org/abs/2403.18549": {"title": "A communication-efficient, online changepoint detection method for monitoring distributed sensor networks", "link": "https://arxiv.org/abs/2403.18549", "description": "arXiv:2403.18549v1 Announce Type: new \nAbstract: We consider the challenge of efficiently detecting changes within a network of sensors, where we also need to minimise communication between sensors and the cloud. We propose an online, communication-efficient method to detect such changes. The procedure works by performing likelihood ratio tests at each time point, and two thresholds are chosen to filter unimportant test statistics and make decisions based on the aggregated test statistics respectively. We provide asymptotic theory concerning consistency and the asymptotic distribution if there are no changes. Simulation results suggest that our method can achieve similar performance to the idealised setting, where we have no constraints on communication between sensors, but substantially reduce the transmission costs."}, "https://arxiv.org/abs/2403.18602": {"title": "Collaborative graphical lasso", "link": "https://arxiv.org/abs/2403.18602", "description": "arXiv:2403.18602v1 Announce Type: new \nAbstract: In recent years, the availability of multi-omics data has increased substantially. Multi-omics data integration methods mainly aim to leverage different molecular data sets to gain a complete molecular description of biological processes. An attractive integration approach is the reconstruction of multi-omics networks. However, the development of effective multi-omics network reconstruction strategies lags behind. This hinders maximizing the potential of multi-omics data sets. With this study, we advance the frontier of multi-omics network reconstruction by introducing \"collaborative graphical lasso\" as a novel strategy. Our proposed algorithm synergizes \"graphical lasso\" with the concept of \"collaboration\", effectively harmonizing multi-omics data sets integration, thereby enhancing the accuracy of network inference. Besides, to tackle model selection in this framework, we designed an ad hoc procedure based on network stability. We assess the performance of collaborative graphical lasso and the corresponding model selection procedure through simulations, and we apply them to publicly available multi-omics data. This demonstrated collaborative graphical lasso is able to reconstruct known biological connections and suggest previously unknown and biologically coherent interactions, enabling the generation of novel hypotheses. We implemented collaborative graphical lasso as an R package, available on CRAN as coglasso."}, "https://arxiv.org/abs/2403.18072": {"title": "Goal-Oriented Bayesian Optimal Experimental Design for Nonlinear Models using Markov Chain Monte Carlo", "link": "https://arxiv.org/abs/2403.18072", "description": "arXiv:2403.18072v1 Announce Type: cross \nAbstract: Optimal experimental design (OED) provides a systematic approach to quantify and maximize the value of experimental data. Under a Bayesian approach, conventional OED maximizes the expected information gain (EIG) on model parameters. However, we are often interested in not the parameters themselves, but predictive quantities of interest (QoIs) that depend on the parameters in a nonlinear manner. We present a computational framework of predictive goal-oriented OED (GO-OED) suitable for nonlinear observation and prediction models, which seeks the experimental design providing the greatest EIG on the QoIs. In particular, we propose a nested Monte Carlo estimator for the QoI EIG, featuring Markov chain Monte Carlo for posterior sampling and kernel density estimation for evaluating the posterior-predictive density and its Kullback-Leibler divergence from the prior-predictive. The GO-OED design is then found by maximizing the EIG over the design space using Bayesian optimization. We demonstrate the effectiveness of the overall nonlinear GO-OED method, and illustrate its differences versus conventional non-GO-OED, through various test problems and an application of sensor placement for source inversion in a convection-diffusion field."}, "https://arxiv.org/abs/2403.18245": {"title": "LocalCop: An R package for local likelihood inference for conditional copulas", "link": "https://arxiv.org/abs/2403.18245", "description": "arXiv:2403.18245v1 Announce Type: cross \nAbstract: Conditional copulas models allow the dependence structure between multiple response variables to be modelled as a function of covariates. LocalCop (Acar & Lysy, 2024) is an R/C++ package for computationally efficient semiparametric conditional copula modelling using a local likelihood inference framework developed in Acar, Craiu, & Yao (2011), Acar, Craiu, & Yao (2013) and Acar, Czado, & Lysy (2019)."}, "https://arxiv.org/abs/2403.18782": {"title": "Beyond boundaries: Gary Lorden's groundbreaking contributions to sequential analysis", "link": "https://arxiv.org/abs/2403.18782", "description": "arXiv:2403.18782v1 Announce Type: cross \nAbstract: Gary Lorden provided a number of fundamental and novel insights to sequential hypothesis testing and changepoint detection. In this article we provide an overview of Lorden's contributions in the context of existing results in those areas, and some extensions made possible by Lorden's work, mentioning also areas of application including threat detection in physical-computer systems, near-Earth space informatics, epidemiology, clinical trials, and finance."}, "https://arxiv.org/abs/2109.05755": {"title": "IQ: Intrinsic measure for quantifying the heterogeneity in meta-analysis", "link": "https://arxiv.org/abs/2109.05755", "description": "arXiv:2109.05755v2 Announce Type: replace \nAbstract: Quantifying the heterogeneity is an important issue in meta-analysis, and among the existing measures, the $I^2$ statistic is the most commonly used measure in the literature. In this paper, we show that the $I^2$ statistic was, in fact, defined as problematic or even completely wrong from the very beginning. To confirm this statement, we first present a motivating example to show that the $I^2$ statistic is heavily dependent on the study sample sizes, and consequently it may yield contradictory results for the amount of heterogeneity. Moreover, by drawing a connection between ANOVA and meta-analysis, the $I^2$ statistic is shown to have, mistakenly, applied the sampling errors of the estimators rather than the variances of the study populations. Inspired by this, we introduce an Intrinsic measure for Quantifying the heterogeneity in meta-analysis, and meanwhile study its statistical properties to clarify why it is superior to the existing measures. We further propose an optimal estimator, referred to as the IQ statistic, for the new measure of heterogeneity that can be readily applied in meta-analysis. Simulations and real data analysis demonstrate that the IQ statistic provides a nearly unbiased estimate of the true heterogeneity and it is also independent of the study sample sizes."}, "https://arxiv.org/abs/2207.07020": {"title": "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO", "link": "https://arxiv.org/abs/2207.07020", "description": "arXiv:2207.07020v3 Announce Type: replace \nAbstract: The multivariate regression interpretation of the Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ outcomes and (ii) the residual partial covariances between pairs of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation Conditional Maximization algorithm to obtain sparse estimates of the $p \\times q$ matrix of direct effects and the $q \\times q$ residual precision matrix. Our algorithm iteratively solves a sequence of penalized maximum likelihood problems with self-adaptive penalties that gradually filter out negligible regression coefficients and partial covariances. Because it adaptively penalizes individual model parameters, our method is seen to outperform fixed-penalty competitors on simulated data. We establish the posterior contraction rate for our model, buttressing our method's excellent empirical performance with strong theoretical guarantees. Using our method, we estimated the direct effects of diet and residence type on the composition of the gut microbiome of elderly adults."}, "https://arxiv.org/abs/2210.09426": {"title": "Party On: The Labor Market Returns to Social Networks in Adolescence", "link": "https://arxiv.org/abs/2210.09426", "description": "arXiv:2210.09426v5 Announce Type: replace \nAbstract: We investigate the returns to adolescent friendships on earnings in adulthood using data from the National Longitudinal Study of Adolescent to Adult Health. Because both education and friendships are jointly determined in adolescence, OLS estimates of their returns are likely biased. We implement a novel procedure to obtain bounds on the causal returns to friendships: we assume that the returns to schooling range from 5 to 15% (based on prior literature), and instrument for friendships using similarity in age among peers. Having one more friend in adolescence increases earnings between 7 and 14%, substantially more than OLS estimates would suggest."}, "https://arxiv.org/abs/2306.13829": {"title": "Selective inference using randomized group lasso estimators for general models", "link": "https://arxiv.org/abs/2306.13829", "description": "arXiv:2306.13829v3 Announce Type: replace \nAbstract: Selective inference methods are developed for group lasso estimators for use with a wide class of distributions and loss functions. The method includes the use of exponential family distributions, as well as quasi-likelihood modeling for overdispersed count data, for example, and allows for categorical or grouped covariates as well as continuous covariates. A randomized group-regularized optimization problem is studied. The added randomization allows us to construct a post-selection likelihood which we show to be adequate for selective inference when conditioning on the event of the selection of the grouped covariates. This likelihood also provides a selective point estimator, accounting for the selection by the group lasso. Confidence regions for the regression parameters in the selected model take the form of Wald-type regions and are shown to have bounded volume. The selective inference method for grouped lasso is illustrated on data from the national health and nutrition examination survey while simulations showcase its behaviour and favorable comparison with other methods."}, "https://arxiv.org/abs/2306.15173": {"title": "Robust propensity score weighting estimation under missing at random", "link": "https://arxiv.org/abs/2306.15173", "description": "arXiv:2306.15173v3 Announce Type: replace \nAbstract: Missing data is frequently encountered in many areas of statistics. Propensity score weighting is a popular method for handling missing data. The propensity score method employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data. Doubly robust estimation is attractive, as the consistency of the estimator is guaranteed when either the outcome regression model or the propensity score model is correctly specified. In this paper, we first employ information projection to develop an efficient and doubly robust estimator under indirect model calibration constraints. The resulting propensity score estimator can be equivalently expressed as a doubly robust regression imputation estimator by imposing the internal bias calibration condition in estimating the regression parameters. In addition, we generalize the information projection to allow for outlier-robust estimation. Some asymptotic properties are presented. The simulation study confirms that the proposed method allows robust inference against not only the violation of various model assumptions, but also outliers. A real-life application is presented using data from the Conservation Effects Assessment Project."}, "https://arxiv.org/abs/2307.00567": {"title": "A Note on Ising Network Analysis with Missing Data", "link": "https://arxiv.org/abs/2307.00567", "description": "arXiv:2307.00567v2 Announce Type: replace \nAbstract: The Ising model has become a popular psychometric model for analyzing item response data. The statistical inference of the Ising model is typically carried out via a pseudo-likelihood, as the standard likelihood approach suffers from a high computational cost when there are many variables (i.e., items). Unfortunately, the presence of missing values can hinder the use of pseudo-likelihood, and a listwise deletion approach for missing data treatment may introduce a substantial bias into the estimation and sometimes yield misleading interpretations. This paper proposes a conditional Bayesian framework for Ising network analysis with missing data, which integrates a pseudo-likelihood approach with iterative data imputation. An asymptotic theory is established for the method. Furthermore, a computationally efficient {P{\\'o}lya}-Gamma data augmentation procedure is proposed to streamline the sampling of model parameters. The method's performance is shown through simulations and a real-world application to data on major depressive and generalized anxiety disorders from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC)."}, "https://arxiv.org/abs/2307.09713": {"title": "Non-parametric inference on calibration of predicted risks", "link": "https://arxiv.org/abs/2307.09713", "description": "arXiv:2307.09713v3 Announce Type: replace \nAbstract: Moderate calibration, the expected event probability among observations with predicted probability z being equal to z, is a desired property of risk prediction models. Current graphical and numerical techniques for evaluating moderate calibration of risk prediction models are mostly based on smoothing or grouping the data. As well, there is no widely accepted inferential method for the null hypothesis that a model is moderately calibrated. In this work, we discuss recently-developed, and propose novel, methods for the assessment of moderate calibration for binary responses. The methods are based on the limiting distributions of functions of standardized partial sums of prediction errors converging to the corresponding laws of Brownian motion. The novel method relies on well-known properties of the Brownian bridge which enables joint inference on mean and moderate calibration, leading to a unified \"bridge\" test for detecting miscalibration. Simulation studies indicate that the bridge test is more powerful, often substantially, than the alternative test. As a case study we consider a prediction model for short-term mortality after a heart attack, where we provide suggestions on graphical presentation and the interpretation of results. Moderate calibration can be assessed without requiring arbitrary grouping of data or using methods that require tuning of parameters. An accompanying R package implements this method (see https://github.com/resplab/cumulcalib/)."}, "https://arxiv.org/abs/2308.11138": {"title": "NLP-based detection of systematic anomalies among the narratives of consumer complaints", "link": "https://arxiv.org/abs/2308.11138", "description": "arXiv:2308.11138v3 Announce Type: replace \nAbstract: We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau."}, "https://arxiv.org/abs/2309.10978": {"title": "Negative Spillover: A Potential Source of Bias in Pragmatic Clinical Trials", "link": "https://arxiv.org/abs/2309.10978", "description": "arXiv:2309.10978v4 Announce Type: replace \nAbstract: Pragmatic clinical trials evaluate the effectiveness of health interventions in real-world settings. Negative spillover can arise in a pragmatic trial if the study intervention affects how scarce resources are allocated between patients in the intervention and comparison groups. This can harm patients assigned to the control group and lead to overestimation of treatment effect. While this type of negative spillover is often addressed in trials of social welfare and public health interventions, there is little recognition of this source of bias in the medical literature. In this article, I examine what causes negative spillover and how it may have led clinical trial investigators to overestimate the effect of patient navigation, AI-based physiological alarms, and elective induction of labor. I also suggest ways to detect negative spillover and design trials that avoid this potential source of bias."}, "https://arxiv.org/abs/2310.11471": {"title": "Modeling lower-truncated and right-censored insurance claims with an extension of the MBBEFD class", "link": "https://arxiv.org/abs/2310.11471", "description": "arXiv:2310.11471v2 Announce Type: replace \nAbstract: In general insurance, claims are often lower-truncated and right-censored because insurance contracts may involve deductibles and maximal covers. Most classical statistical models are not (directly) suited to model lower-truncated and right-censored claims. A surprisingly flexible family of distributions that can cope with lower-truncated and right-censored claims is the class of MBBEFD distributions that originally has been introduced by Bernegger (1997) for reinsurance pricing, but which has not gained much attention outside the reinsurance literature. Interestingly, in general insurance, we mainly rely on unimodal skewed densities, whereas the reinsurance literature typically proposes monotonically decreasing densities within the MBBEFD class. We show that this class contains both types of densities, and we extend it to a bigger family of distribution functions suitable for modeling lower-truncated and right-censored claims. In addition, we discuss how changes in the deductible or the maximal cover affect the chosen distributions."}, "https://arxiv.org/abs/2310.13580": {"title": "Bayesian Hierarchical Modeling for Bivariate Multiscale Spatial Data with Application to Blood Test Monitoring", "link": "https://arxiv.org/abs/2310.13580", "description": "arXiv:2310.13580v2 Announce Type: replace \nAbstract: In public health applications, spatial data collected are often recorded at different spatial scales and over different correlated variables. Spatial change of support is a key inferential problem in these applications and have become standard in univariate settings; however, it is less standard in multivariate settings. There are several existing multivariate spatial models that can be easily combined with multiscale spatial approach to analyze multivariate multiscale spatial data. In this paper, we propose three new models from such combinations for bivariate multiscale spatial data in a Bayesian context. In particular, we extend spatial random effects models, multivariate conditional autoregressive models, and ordered hierarchical models through a multiscale spatial approach. We run simulation studies for the three models and compare them in terms of prediction performance and computational efficiency. We motivate our models through an analysis of 2015 Texas annual average percentage receiving two blood tests from the Dartmouth Atlas Project."}, "https://arxiv.org/abs/2310.16502": {"title": "Assessing the overall and partial causal well-specification of nonlinear additive noise models", "link": "https://arxiv.org/abs/2310.16502", "description": "arXiv:2310.16502v3 Announce Type: replace \nAbstract: We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution. We then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data."}, "https://arxiv.org/abs/2401.00624": {"title": "Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures", "link": "https://arxiv.org/abs/2401.00624", "description": "arXiv:2401.00624v2 Announce Type: replace \nAbstract: Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about \"non-zero loadings\" and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies \"non-zero loadings\" by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about \"non-zero loadings\". Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset."}, "https://arxiv.org/abs/2306.10594": {"title": "A nonparametric test for elliptical distribution based on kernel embedding of probabilities", "link": "https://arxiv.org/abs/2306.10594", "description": "arXiv:2306.10594v2 Announce Type: replace-cross \nAbstract: Elliptical distribution is a basic assumption underlying many multivariate statistical methods. For example, in sufficient dimension reduction and statistical graphical models, this assumption is routinely imposed to simplify the data dependence structure. Before applying such methods, we need to decide whether the data are elliptically distributed. Currently existing tests either focus exclusively on spherical distributions, or rely on bootstrap to determine the null distribution, or require specific forms of the alternative distribution. In this paper, we introduce a general nonparametric test for elliptical distribution based on kernel embedding of the probability measure that embodies the two properties that characterize an elliptical distribution: namely, after centering and rescaling, (1) the direction and length of the random vector are independent, and (2) the directional vector is uniformly distributed on the unit sphere. We derive the asymptotic distributions of the test statistic via von-Mises expansion, develop the sample-level procedure to determine the rejection region, and establish the consistency and validity of the proposed test. We also develop the concentration bounds of the test statistic, allowing the dimension to grow with the sample size, and further establish the consistency in this high-dimension setting. We compare our method with several existing methods via simulation studies, and apply our test to a SENIC dataset with and without a transformation aimed to achieve ellipticity."}, "https://arxiv.org/abs/2402.07868": {"title": "Nesting Particle Filters for Experimental Design in Dynamical Systems", "link": "https://arxiv.org/abs/2402.07868", "description": "arXiv:2402.07868v2 Announce Type: replace-cross \nAbstract: In this paper, we propose a novel approach to Bayesian experimental design for non-exchangeable data that formulates it as risk-sensitive policy optimization. We develop the Inside-Out SMC$^2$ algorithm, a nested sequential Monte Carlo technique to infer optimal designs, and embed it into a particle Markov chain Monte Carlo framework to perform gradient-based policy amortization. Our approach is distinct from other amortized experimental design techniques, as it does not rely on contrastive estimators. Numerical validation on a set of dynamical systems showcases the efficacy of our method in comparison to other state-of-the-art strategies."}, "https://arxiv.org/abs/2403.18951": {"title": "Robust estimations from distribution structures: V", "link": "https://arxiv.org/abs/2403.18951", "description": "arXiv:2403.18951v1 Announce Type: new \nAbstract: Due to the complexity of order statistics, the finite sample behaviour of robust statistics is generally not analytically solvable. While the Monte Carlo method can provide approximate solutions, its convergence rate is typically very slow, making the computational cost to achieve the desired accuracy unaffordable for ordinary users. In this paper, we propose an approach analogous to the Fourier transformation to decompose the finite sample structure of the uniform distribution. By obtaining sets of sequences that are consistent with parametric distributions for the first four sample moments, we can approximate the finite sample behavior of other estimators with significantly reduced computational costs. This article reveals the underlying structure of randomness and presents a novel approach to integrate multiple assumptions."}, "https://arxiv.org/abs/2403.19018": {"title": "Efficient global estimation of conditional-value-at-risk through stochastic kriging and extreme value theory", "link": "https://arxiv.org/abs/2403.19018", "description": "arXiv:2403.19018v1 Announce Type: new \nAbstract: We consider the problem of evaluating risk for a system that is modeled by a complex stochastic simulation with many possible input parameter values. Two sources of computational burden can be identified: the effort associated with extensive simulation runs required to accurately represent the tail of the loss distribution for each set of parameter values, and the computational cost of evaluating multiple candidate parameter values. The former concern can be addressed by using Extreme Value Theory (EVT) estimations, which specifically concentrate on the tails. Meta-modeling approaches are often used to tackle the latter concern. In this paper, we propose a framework for constructing a particular meta-modeling framework, stochastic kriging, that is based on EVT-based estimation for a class of coherent measures of risk. The proposed approach requires an efficient estimator of the intrinsic variance, and so we derive an EVT-based expression for it. It then allows us to avoid multiple replications of the risk measure in each design point, which was required in similar previously proposed approaches, resulting in a substantial reduction in computational effort. We then perform a case study, outlining promising use cases, and conditions when the EVT-based approach outperforms simpler empirical estimators."}, "https://arxiv.org/abs/2403.19192": {"title": "Imputing missing not-at-random longitudinal marker values in time-to-event analysis: fully conditional specification multiple imputation in joint modeling", "link": "https://arxiv.org/abs/2403.19192", "description": "arXiv:2403.19192v1 Announce Type: new \nAbstract: We propose a procedure for imputing missing values of time-dependent covariates in a survival model using fully conditional specification. Specifically, we focus on imputing missing values of a longitudinal marker in joint modeling of the marker and time-to-event data, but the procedure can be easily applied to a time-varying covariate survival model as well. First, missing marker values are imputed via fully conditional specification multiple imputation, and then joint modeling is applied for estimating the association between the marker and the event. This procedure is recommended since in joint modeling marker measurements that are missing not-at-random can lead to bias (e.g. when patients with higher marker values tend to miss visits). Specifically, in cohort studies such a bias can occur since patients for whom all marker measurements during follow-up are missing are excluded from the analysis. Our procedure enables to include these patients by imputing their missing values using a modified version of fully conditional specification multiple imputation. The imputation model includes a special indicator for the subgroup with missing marker values during follow-up, and can be easily implemented in various software: R, SAS, Stata etc. Using simulations we show that the proposed procedure performs better than standard joint modeling in the missing not-at-random scenario with respect to bias, coverage and Type I error rate of the test, and as good as standard joint modeling in the completely missing at random scenario. Finally we apply the procedure on real data on glucose control and cancer in diabetic patients."}, "https://arxiv.org/abs/2403.19363": {"title": "Dynamic Correlation of Market Connectivity, Risk Spillover and Abnormal Volatility in Stock Price", "link": "https://arxiv.org/abs/2403.19363", "description": "arXiv:2403.19363v1 Announce Type: new \nAbstract: The connectivity of stock markets reflects the information efficiency of capital markets and contributes to interior risk contagion and spillover effects. We compare Shanghai Stock Exchange A-shares (SSE A-shares) during tranquil periods, with high leverage periods associated with the 2015 subprime mortgage crisis. We use Pearson correlations of returns, the maximum strongly connected subgraph, and $3\\sigma$ principle to iteratively determine the threshold value for building a dynamic correlation network of SSE A-shares. Analyses are carried out based on the networking structure, intra-sector connectivity, and node status, identifying several contributions. First, compared with tranquil periods, the SSE A-shares network experiences a more significant small-world and connective effect during the subprime mortgage crisis and the high leverage period in 2015. Second, the finance, energy and utilities sectors have a stronger intra-industry connectivity than other sectors. Third, HUB nodes drive the growth of the SSE A-shares market during bull periods, while stocks have a think-tail degree distribution in bear periods and show distinct characteristics in terms of market value and finance. Granger linear and non-linear causality networks are also considered for the comparison purpose. Studies on the evolution of inter-cycle connectivity in the SSE A-share market may help investors improve portfolios and develop more robust risk management policies."}, "https://arxiv.org/abs/2403.19439": {"title": "Dynamic Analyses of Contagion Risk and Module Evolution on the SSE A-Shares Market Based on Minimum Information Entropy", "link": "https://arxiv.org/abs/2403.19439", "description": "arXiv:2403.19439v1 Announce Type: new \nAbstract: The interactive effect is significant in the Chinese stock market, exacerbating the abnormal market volatilities and risk contagion. Based on daily stock returns in the Shanghai Stock Exchange (SSE) A-shares, this paper divides the period between 2005 and 2018 into eight bull and bear market stages to investigate interactive patterns in the Chinese financial market. We employ the LASSO method to construct the stock network and further use the Map Equation method to analyze the evolution of modules in the SSE A-shares market. Empirical results show: (1) The connected effect is more significant in bear markets than bull markets; (2) A system module can be found in the network during the first four stages, and the industry aggregation effect leads to module differentiation in the last four stages; (3) Some stocks have leading effects on others throughout eight periods, and medium- and small-cap stocks with poor financial conditions are more likely to become risk sources, especially in bear markets. Our conclusions are beneficial to improving investment strategies and making regulatory policies."}, "https://arxiv.org/abs/2403.19504": {"title": "Overlap violations in external validity", "link": "https://arxiv.org/abs/2403.19504", "description": "arXiv:2403.19504v1 Announce Type: new \nAbstract: Estimating externally valid causal effects is a foundational problem in the social and biomedical sciences. Generalizing or transporting causal estimates from an experimental sample to a target population of interest relies on an overlap assumption between the experimental sample and the target population--i.e., all units in the target population must have a non-zero probability of being included in the experiment. In practice, having full overlap between an experimental sample and a target population can be implausible. In the following paper, we introduce a framework for considering external validity in the presence of overlap violations. We introduce a novel bias decomposition that parameterizes the bias from an overlap violation into two components: (1) the proportion of units omitted, and (2) the degree to which omitting the units moderates the treatment effect. The bias decomposition offers an intuitive and straightforward approach to conducting sensitivity analysis to assess robustness to overlap violations. Furthermore, we introduce a suite of sensitivity tools in the form of summary measures and benchmarking, which help researchers consider the plausibility of the overlap violations. We apply the proposed framework on an experiment evaluating the impact of a cash transfer program in Northern Uganda."}, "https://arxiv.org/abs/2403.19515": {"title": "On Bootstrapping Lasso in Generalized Linear Models and the Cross Validation", "link": "https://arxiv.org/abs/2403.19515", "description": "arXiv:2403.19515v1 Announce Type: new \nAbstract: Generalized linear models or GLM constitutes an important set of models which generalizes the ordinary linear regression by connecting the response variable with the covariates through arbitrary link functions. On the other hand, Lasso is a popular and easy to implement penalization method in regression when all the covariates are not relevant. However, Lasso generally has non-tractable asymptotic distribution and hence development of an alternative method of distributional approximation is required for the purpose of statistical inference. In this paper, we develop a Bootstrap method which works as an approximation of the distribution of the Lasso estimator for all the sub-models of GLM. To connect the distributional approximation theory based on the proposed Bootstrap method with the practical implementation of Lasso, we explore the asymptotic properties of K-fold cross validation-based penalty parameter. The results established essentially justifies drawing valid statistical inference regarding the unknown parameters based on the proposed Bootstrap method for any sub model of GLM after selecting the penalty parameter using K-fold cross validation. Good finite sample properties are also shown through a moderately large simulation study. The method is also implemented on a real data set."}, "https://arxiv.org/abs/2403.19605": {"title": "Data-Adaptive Tradeoffs among Multiple Risks in Distribution-Free Prediction", "link": "https://arxiv.org/abs/2403.19605", "description": "arXiv:2403.19605v1 Announce Type: new \nAbstract: Decision-making pipelines are generally characterized by tradeoffs among various risk functions. It is often desirable to manage such tradeoffs in a data-adaptive manner. As we demonstrate, if this is done naively, state-of-the art uncertainty quantification methods can lead to significant violations of putative risk guarantees.\n To address this issue, we develop methods that permit valid control of risk when threshold and tradeoff parameters are chosen adaptively. Our methodology supports monotone and nearly-monotone risks, but otherwise makes no distributional assumptions.\n To illustrate the benefits of our approach, we carry out numerical experiments on synthetic data and the large-scale vision dataset MS-COCO."}, "https://arxiv.org/abs/2403.19606": {"title": "Positivity violations in marginal structural survival models with time-dependent confounding: a simulation study on IPTW-estimator performance", "link": "https://arxiv.org/abs/2403.19606", "description": "arXiv:2403.19606v1 Announce Type: new \nAbstract: In longitudinal observational studies, marginal structural models (MSMs) are a class of causal models used to analyze the effect of an exposure on the (survival) outcome of interest while accounting for exposure-affected time-dependent confounding. In the applied literature, inverse probability of treatment weighting (IPTW) has been widely adopted to estimate MSMs. An essential assumption for IPTW-based MSMs is the positivity assumption, which ensures that each individual in the population has a non-zero probability of receiving each exposure level within confounder strata. Positivity, along with consistency, conditional exchangeability, and correct specification of the weighting model, is crucial for valid causal inference through IPTW-based MSMs but is often overlooked compared to confounding bias. Positivity violations can arise from subjects having a zero probability of being exposed/unexposed (strict violations) or near-zero probabilities due to sampling variability (near violations). This article discusses the effect of violations in the positivity assumption on the estimates from IPTW-based MSMs. Building on the algorithms for simulating longitudinal survival data from MSMs by Havercroft and Didelez (2012) and Keogh et al. (2021), systematic simulations under strict/near positivity violations are performed. Various scenarios are explored by varying (i) the size of the confounder interval in which positivity violations arise, (ii) the sample size, (iii) the weight truncation strategy, and (iv) the subject's propensity to follow the protocol violation rule. This study underscores the importance of assessing positivity violations in IPTW-based MSMs to ensure robust and reliable causal inference in survival analyses."}, "https://arxiv.org/abs/2403.19289": {"title": "Graph Neural Networks for Treatment Effect Prediction", "link": "https://arxiv.org/abs/2403.19289", "description": "arXiv:2403.19289v1 Announce Type: cross \nAbstract: Estimating causal effects in e-commerce tends to involve costly treatment assignments which can be impractical in large-scale settings. Leveraging machine learning to predict such treatment effects without actual intervention is a standard practice to diminish the risk. However, existing methods for treatment effect prediction tend to rely on training sets of substantial size, which are built from real experiments and are thus inherently risky to create. In this work we propose a graph neural network to diminish the required training set size, relying on graphs that are common in e-commerce data. Specifically, we view the problem as node regression with a restricted number of labeled instances, develop a two-model neural architecture akin to previous causal effect estimators, and test varying message-passing layers for encoding. Furthermore, as an extra step, we combine the model with an acquisition function to guide the creation of the training set in settings with extremely low experimental budget. The framework is flexible since each step can be used separately with other models or policies. The experiments on real large-scale networks indicate a clear advantage of our methodology over the state of the art, which in many cases performs close to random underlining the need for models that can generalize with limited labeled samples to reduce experimental risks."}, "https://arxiv.org/abs/2403.19329": {"title": "Simulating Relational Event Histories -- Why and How", "link": "https://arxiv.org/abs/2403.19329", "description": "arXiv:2403.19329v1 Announce Type: cross \nAbstract: Many important social phenomena result from repeated interactions among individuals over time such as email exchanges in an organization, or face-to-face interactions in a classroom. Insights into the mechanisms underlying the dynamics of these interactions can be achieved through simulations of networks on a fine temporal granularity. In this paper, we present statistical frameworks to simulate relational event networks under dyadic and actor-oriented relational event models. These simulators have a broad applicability in temporal social network research such as model fit assessment, theory building, network intervention planning, making predictions, understanding the impact of network structures, to name a few. We show this in three extensive applications. First, it is shown why simulation-based techniques are crucial for relational event model assessment, for example to investigate how past events affect future interactions in the network. Second, we demonstrate how simulation techniques contribute to a better understanding of the longevity of network interventions. Third, we show how simulation techniques are important when building and extending theories about social phenomena such as understanding social identity dynamics using optimal distinctiveness theory."}, "https://arxiv.org/abs/2201.12045": {"title": "A loss discounting framework for model averaging and selection in time series models", "link": "https://arxiv.org/abs/2201.12045", "description": "arXiv:2201.12045v4 Announce Type: replace \nAbstract: We introduce a Loss Discounting Framework for model and forecast combination which generalises and combines Bayesian model synthesis and generalized Bayes methodologies. We use a loss function to score the performance of different models and introduce a multilevel discounting scheme which allows a flexible specification of the dynamics of the model weights. This novel and simple model combination approach can be easily applied to large scale model averaging/selection, can handle unusual features such as sudden regime changes, and can be tailored to different forecasting problems. We compare our method to both established methodologies and state of the art methods for a number of macroeconomic forecasting examples. We find that the proposed method offers an attractive, computationally efficient alternative to the benchmark methodologies and often outperforms more complex techniques."}, "https://arxiv.org/abs/2203.11873": {"title": "Nonstationary Spatial Process Models with Spatially Varying Covariance Kernels", "link": "https://arxiv.org/abs/2203.11873", "description": "arXiv:2203.11873v2 Announce Type: replace \nAbstract: Spatial process models for capturing nonstationary behavior in scientific data present several challenges with regard to statistical inference and uncertainty quantification. While nonstationary spatially-varying kernels are attractive for their flexibility and richness, their practical implementation has been reported to be overwhelmingly cumbersome because of the high-dimensional parameter spaces resulting from the spatially varying process parameters. Matters are considerably exacerbated with the massive numbers of spatial locations over which measurements are available. With limited theoretical tractability offered by nonstationary spatial processes, overcoming such computational bottlenecks require a synergy between model construction and algorithm development. We build a class of scalable nonstationary spatial process models using spatially varying covariance kernels. We present some novel consequences of such representations that befit computationally efficient implementation. More specifically, we operate within a coherent Bayesian modeling framework to achieve full uncertainty quantification using a Hybrid Monte-Carlo with nested interweaving. We carry out experiments on synthetic data sets to explore model selection and parameter identifiability and assess inferential improvements accrued from the nonstationary modeling. We illustrate strengths and pitfalls with a data set on remote sensed normalized difference vegetation index with further analysis of a lead contamination data set in the Supplement."}, "https://arxiv.org/abs/2209.06704": {"title": "Causal chain event graphs for remedial maintenance", "link": "https://arxiv.org/abs/2209.06704", "description": "arXiv:2209.06704v2 Announce Type: replace \nAbstract: The analysis of system reliability has often benefited from graphical tools such as fault trees and Bayesian networks. In this article, instead of conventional graphical tools, we apply a probabilistic graphical model called the chain event graph (CEG) to represent the failures and processes of deterioration of a system. The CEG is derived from an event tree and can flexibly represent the unfolding of asymmetric processes. For this application we need to define a new class of formal intervention we call remedial to model causal effects of remedial maintenance. This fixes the root causes of a failure and returns the status of the system to as good as new. We demonstrate that the semantics of the CEG are rich enough to express this novel type of intervention. Furthermore through the bespoke causal algebras the CEG provides a transparent framework with which guide and express the rationale behind predictive inferences about the effects of various different types of remedial intervention. A back-door theorem is adapted to apply to these interventions to help discover when a system is only partially observed."}, "https://arxiv.org/abs/2305.17744": {"title": "Heterogeneous Matrix Factorization: When Features Differ by Datasets", "link": "https://arxiv.org/abs/2305.17744", "description": "arXiv:2305.17744v2 Announce Type: replace \nAbstract: In myriad statistical applications, data are collected from related but heterogeneous sources. These sources share some commonalities while containing idiosyncratic characteristics. One of the most fundamental challenges in such scenarios is to recover the shared and source-specific factors. Despite the existence of a few heuristic approaches, a generic algorithm with theoretical guarantees has yet to be established. In this paper, we tackle the problem by proposing a method called Heterogeneous Matrix Factorization to separate the shared and unique factors for a class of problems. HMF maintains the orthogonality between the shared and unique factors by leveraging an invariance property in the objective. The algorithm is easy to implement and intrinsically distributed. On the theoretic side, we show that for the square error loss, HMF will converge into the optimal solutions, which are close to the ground truth. HMF can be integrated auto-encoders to learn nonlinear feature mappings. Through a variety of case studies, we showcase HMF's benefits and applicability in video segmentation, time-series feature extraction, and recommender systems."}, "https://arxiv.org/abs/2308.06820": {"title": "Divisive Hierarchical Clustering of Variables Identified by Singular Vectors", "link": "https://arxiv.org/abs/2308.06820", "description": "arXiv:2308.06820v3 Announce Type: replace \nAbstract: In this work, we present a novel method for divisive hierarchical variable clustering. A cluster is a group of elements that exhibit higher similarity among themselves than to elements outside this cluster. The correlation coefficient serves as a natural measure to assess the similarity of variables. This means that in a correlation matrix, a cluster is represented by a block of variables with greater internal than external correlation. Our approach provides a nonparametric solution to identify such block structures in the correlation matrix using singular vectors of the underlying data matrix. When divisively clustering $p$ variables, there are $2^{p-1}$ possible splits. Using the singular vectors for cluster identification, we can effectively reduce these number to at most $p(p-1)$, thereby making it computationally efficient. We elaborate on the methodology and outline the incorporation of dissimilarity measures and linkage functions to assess distances between clusters. Additionally, we demonstrate that these distances are ultrametric, ensuring that the resulting hierarchical cluster structure can be uniquely represented by a dendrogram, with the heights of the dendrogram being interpretable. To validate the efficiency of our method, we perform simulation studies and analyze real world data on personality traits and cognitive abilities. Supplementary materials for this article can be accessed online."}, "https://arxiv.org/abs/2309.01721": {"title": "Direct and Indirect Treatment Effects in the Presence of Semi-Competing Risks", "link": "https://arxiv.org/abs/2309.01721", "description": "arXiv:2309.01721v4 Announce Type: replace \nAbstract: Semi-competing risks refer to the phenomenon that the terminal event (such as death) can censor the non-terminal event (such as disease progression) but not vice versa. The treatment effect on the terminal event can be delivered either directly following the treatment or indirectly through the non-terminal event. We consider two strategies to decompose the total effect into a direct effect and an indirect effect under the framework of mediation analysis in completely randomized experiments by adjusting the prevalence and hazard of non-terminal events, respectively. They require slightly different assumptions on cross-world quantities to achieve identifiability. We establish asymptotic properties for the estimated counterfactual cumulative incidences and decomposed treatment effects. We illustrate the subtle difference between these two decompositions through simulation studies and two real-data applications."}, "https://arxiv.org/abs/2401.15680": {"title": "How to achieve model-robust inference in stepped wedge trials with model-based methods?", "link": "https://arxiv.org/abs/2401.15680", "description": "arXiv:2401.15680v2 Announce Type: replace \nAbstract: A stepped wedge design is a unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs -- via linear mixed models or generalized estimating equations -- is standard practice to evaluate treatment effects accounting for clustering and adjusting for baseline covariates, their properties under misspecification have not been systematically explored. In this article, we study when a potentially misspecified multilevel model can offer consistent estimation for treatment effect estimands that are functions of calendar time and/or exposure time. We define nonparametric treatment effect estimands using potential outcomes, and adapt model-based methods via g-computation to achieve estimand-aligned inference. We prove a central result that, as long as the working model includes a correctly specified treatment effect structure, the g-computation is guaranteed to be consistent even if all remaining model components are arbitrarily misspecified. Furthermore, valid inference is obtained via the sandwich variance estimator. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge trial."}, "https://arxiv.org/abs/2205.05955": {"title": "Bayesian inference for stochastic oscillatory systems using the phase-corrected Linear Noise Approximation", "link": "https://arxiv.org/abs/2205.05955", "description": "arXiv:2205.05955v3 Announce Type: replace-cross \nAbstract: Likelihood-based inference in stochastic non-linear dynamical systems, such as those found in chemical reaction networks and biological clock systems, is inherently complex and has largely been limited to small and unrealistically simple systems. Recent advances in analytically tractable approximations to the underlying conditional probability distributions enable long-term dynamics to be accurately modelled, and make the large number of model evaluations required for exact Bayesian inference much more feasible. We propose a new methodology for inference in stochastic non-linear dynamical systems exhibiting oscillatory behaviour and show the parameters in these models can be realistically estimated from simulated data. Preliminary analyses based on the Fisher Information Matrix of the model can guide the implementation of Bayesian inference. We show that this parameter sensitivity analysis can predict which parameters are practically identifiable. Several Markov chain Monte Carlo algorithms are compared, with our results suggesting a parallel tempering algorithm consistently gives the best approach for these systems, which are shown to frequently exhibit multi-modal posterior distributions."}, "https://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "https://arxiv.org/abs/2310.06673", "description": "arXiv:2310.06673v2 Announce Type: replace-cross \nAbstract: An assurance calculation is a Bayesian alternative to a power calculation. One may be performed to aid the planning of a clinical trial, specifically setting the sample size or to support decisions about whether or not to perform a study. Immuno-oncology is a rapidly evolving area in the development of anticancer drugs. A common phenomenon that arises in trials of such drugs is one of delayed treatment effects, that is, there is a delay in the separation of the survival curves. To calculate assurance for a trial in which a delayed treatment effect is likely to be present, uncertainty about key parameters needs to be considered. If uncertainty is not considered, the number of patients recruited may not be enough to ensure we have adequate statistical power to detect a clinically relevant treatment effect and the risk of an unsuccessful trial is increased. We present a new elicitation technique for when a delayed treatment effect is likely and show how to compute assurance using these elicited prior distributions. We provide an example to illustrate how this can be used in practice and develop open-source software to implement our methods. Our methodology has the potential to improve the success rate and efficiency of Phase III trials in immuno-oncology and for other treatments where a delayed treatment effect is expected to occur."}, "https://arxiv.org/abs/2312.14810": {"title": "Accurate, scalable, and efficient Bayesian Optimal Experimental Design with derivative-informed neural operators", "link": "https://arxiv.org/abs/2312.14810", "description": "arXiv:2312.14810v2 Announce Type: replace-cross \nAbstract: We consider optimal experimental design (OED) problems in selecting the most informative observation sensors to estimate model parameters in a Bayesian framework. Such problems are computationally prohibitive when the parameter-to-observable (PtO) map is expensive to evaluate, the parameters are high-dimensional, and the optimization for sensor selection is combinatorial and high-dimensional. To address these challenges, we develop an accurate, scalable, and efficient computational framework based on derivative-informed neural operators (DINOs). The derivative of the PtO map is essential for accurate evaluation of the optimality criteria of OED in our consideration. We take the key advantage of DINOs, a class of neural operators trained with derivative information, to achieve high approximate accuracy of not only the PtO map but also, more importantly, its derivative. Moreover, we develop scalable and efficient computation of the optimality criteria based on DINOs and propose a modified swapping greedy algorithm for its optimization. We demonstrate that the proposed method is scalable to preserve the accuracy for increasing parameter dimensions and achieves high computational efficiency, with an over 1000x speedup accounting for both offline construction and online evaluation costs, compared to high-fidelity Bayesian OED solutions for a three-dimensional nonlinear convection-diffusion-reaction example with tens of thousands of parameters."}, "https://arxiv.org/abs/2403.19752": {"title": "Deep Learning Framework with Uncertainty Quantification for Survey Data: Assessing and Predicting Diabetes Mellitus Risk in the American Population", "link": "https://arxiv.org/abs/2403.19752", "description": "arXiv:2403.19752v1 Announce Type: new \nAbstract: Complex survey designs are commonly employed in many medical cohorts. In such scenarios, developing case-specific predictive risk score models that reflect the unique characteristics of the study design is essential. This approach is key to minimizing potential selective biases in results. The objectives of this paper are: (i) To propose a general predictive framework for regression and classification using neural network (NN) modeling, which incorporates survey weights into the estimation process; (ii) To introduce an uncertainty quantification algorithm for model prediction, tailored for data from complex survey designs; (iii) To apply this method in developing robust risk score models to assess the risk of Diabetes Mellitus in the US population, utilizing data from the NHANES 2011-2014 cohort. The theoretical properties of our estimators are designed to ensure minimal bias and the statistical consistency, thereby ensuring that our models yield reliable predictions and contribute novel scientific insights in diabetes research. While focused on diabetes, this NN predictive framework is adaptable to create clinical models for a diverse range of diseases and medical cohorts. The software and the data used in this paper is publicly available on GitHub."}, "https://arxiv.org/abs/2403.19757": {"title": "Nonparametric conditional risk mapping under heteroscedasticity", "link": "https://arxiv.org/abs/2403.19757", "description": "arXiv:2403.19757v1 Announce Type: new \nAbstract: A nonparametric procedure to estimate the conditional probability that a nonstationary geostatistical process exceeds a certain threshold value is proposed. The method consists of a bootstrap algorithm that combines conditional simulation techniques with nonparametric estimations of the trend and the variability. The nonparametric local linear estimator, considering a bandwidth matrix selected by a method that takes the spatial dependence into account, is used to estimate the trend. The variability is modeled estimating the conditional variance and the variogram from corrected residuals to avoid the biasses. The proposed method allows to obtain estimates of the conditional exceedance risk in non-observed spatial locations. The performance of the approach is analyzed by simulation and illustrated with the application to a real data set of precipitations in the U.S."}, "https://arxiv.org/abs/2403.19807": {"title": "Protocols for Observational Studies: Methods and Open Problems", "link": "https://arxiv.org/abs/2403.19807", "description": "arXiv:2403.19807v1 Announce Type: new \nAbstract: For learning about the causal effect of a treatment, a randomized controlled trial (RCT) is considered the gold standard. However, randomizing treatment is sometimes unethical or infeasible, and instead an observational study may be conducted. While some aspects of a well designed RCT cannot be replicated in an observational study, one aspect that can is to have a protocol with prespecified hypotheses about prespecified outcomes and a prespecified analysis. We illustrate the value of protocols for observational studies in three applications -- the effect of playing high school football on later life mental functioning, the effect of police seizing a gun when arresting a domestic violence suspect on future domestic violence and the effect of mountaintop mining on health. We then discuss methodologies for observational study protocols. We discuss considerations for protocols that are similar between observational studies and RCTs, and considerations that are different. The considerations that are different include (i) whether the protocol should be specified before treatment assignment is known or after; (ii) how multiple outcomes should be incorporated into the planned analysis and (iii) how subgroups should be incorporated into the planned analysis. We conclude with discussion of a few open problems in the methodology of observational study protocols."}, "https://arxiv.org/abs/2403.19818": {"title": "Testing for common structures in high-dimensional factor models", "link": "https://arxiv.org/abs/2403.19818", "description": "arXiv:2403.19818v1 Announce Type: new \nAbstract: This work proposes a novel procedure to test for common structures across two high-dimensional factor models. The introduced test allows to uncover whether two factor models are driven by the same loading matrix up to some linear transformation. The test can be used to discover inter individual relationships between two data sets. In addition, it can be applied to test for structural changes over time in the loading matrix of an individual factor model. The test aims to reduce the set of possible alternatives in a classical change-point setting. The theoretical results establish the asymptotic behavior of the introduced test statistic. The theory is supported by a simulation study showing promising results in empirical test size and power. A data application investigates changes in the loadings when modeling the celebrated US macroeconomic data set of Stock and Watson."}, "https://arxiv.org/abs/2403.19835": {"title": "Constrained least squares simplicial-simplicial regression", "link": "https://arxiv.org/abs/2403.19835", "description": "arXiv:2403.19835v1 Announce Type: new \nAbstract: Simplicial-simplicial regression refers to the regression setting where both the responses and predictor variables lie within the simplex space, i.e. they are compositional. For this setting, constrained least squares, where the regression coefficients themselves lie within the simplex, is proposed. The model is transformation-free but the adoption of a power transformation is straightforward, it can treat more than one compositional datasets as predictors and offers the possibility of weights among the simplicial predictors. Among the model's advantages are its ability to treat zeros in a natural way and a highly computationally efficient algorithm to estimate its coefficients. Resampling based hypothesis testing procedures are employed regarding inference, such as linear independence, and equality of the regression coefficients to some pre-specified values. The performance of the proposed technique and its comparison to an existing methodology that is of the same spirit takes place using simulation studies and real data examples."}, "https://arxiv.org/abs/2403.19842": {"title": "Optimal regimes with limited resources", "link": "https://arxiv.org/abs/2403.19842", "description": "arXiv:2403.19842v1 Announce Type: new \nAbstract: Policy-makers are often faced with the task of distributing a limited supply of resources. To support decision-making in these settings, statisticians are confronted with two challenges: estimands are defined by allocation strategies that are functions of features of all individuals in a cluster; and relatedly the observed data are neither independent nor identically distributed when individuals compete for resources. Existing statistical approaches are inadequate because they ignore at least one of these core features. As a solution, we develop theory for a general policy class of dynamic regimes for clustered data, covering existing results in classical and interference settings as special cases. We cover policy-relevant estimands and articulate realistic conditions compatible with resource-limited observed data. We derive identification and inference results for settings with a finite number of individuals in a cluster, where the observed dataset is viewed as a single draw from a super-population of clusters. We also consider asymptotic estimands when the number of individuals in a cluster is allowed to grow; under explicit conditions, we recover previous results, thereby clarifying when the use of existing methods is permitted. Our general results lay the foundation for future research on dynamic regimes for clustered data, including the longitudinal cluster setting."}, "https://arxiv.org/abs/2403.19917": {"title": "Doubly robust estimation and inference for a log-concave counterfactual density", "link": "https://arxiv.org/abs/2403.19917", "description": "arXiv:2403.19917v1 Announce Type: new \nAbstract: We consider the problem of causal inference based on observational data (or the related missing data problem) with a binary or discrete treatment variable. In that context we study counterfactual density estimation, which provides more nuanced information than counterfactual mean estimation (i.e., the average treatment effect). We impose the shape-constraint of log-concavity (a unimodality constraint) on the counterfactual densities, and then develop doubly robust estimators of the log-concave counterfactual density (based on an augmented inverse-probability weighted pseudo-outcome), and show the consistency in various global metrics of that estimator. Based on that estimator we also develop asymptotically valid pointwise confidence intervals for the counterfactual density."}, "https://arxiv.org/abs/2403.19994": {"title": "Supervised Bayesian joint graphical model for simultaneous network estimation and subgroup identification", "link": "https://arxiv.org/abs/2403.19994", "description": "arXiv:2403.19994v1 Announce Type: new \nAbstract: Heterogeneity is a fundamental characteristic of cancer. To accommodate heterogeneity, subgroup identification has been extensively studied and broadly categorized into unsupervised and supervised analysis. Compared to unsupervised analysis, supervised approaches potentially hold greater clinical implications. Under the unsupervised analysis framework, several methods focusing on network-based subgroup identification have been developed, offering more comprehensive insights than those restricted to mean, variance, and other simplistic distributions by incorporating the interconnections among variables. However, research on supervised network-based subgroup identification remains limited. In this study, we develop a novel supervised Bayesian graphical model for jointly identifying multiple heterogeneous networks and subgroups. In the proposed model, heterogeneity is not only reflected in molecular data but also associated with a clinical outcome, and a novel similarity prior is introduced to effectively accommodate similarities among the networks of different subgroups, significantly facilitating clinically meaningful biological network construction and subgroup identification. The consistency properties of the estimates are rigorously established, and an efficient algorithm is developed. Extensive simulation studies and a real-world application to TCGA data are conducted, which demonstrate the advantages of the proposed approach in terms of both subgroup and network identification."}, "https://arxiv.org/abs/2403.20007": {"title": "Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization", "link": "https://arxiv.org/abs/2403.20007", "description": "arXiv:2403.20007v1 Announce Type: new \nAbstract: The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real datasets. The first dataset is analyzed using the principle component analysis while the analysis of the second dataset is based on partial least square framework."}, "https://arxiv.org/abs/2403.20182": {"title": "Quantifying Uncertainty: All We Need is the Bootstrap?", "link": "https://arxiv.org/abs/2403.20182", "description": "arXiv:2403.20182v1 Announce Type: new \nAbstract: Standard errors, confidence intervals, hypothesis tests, and other quantifications of uncertainty are essential to statistical practice. However, they feature a plethora of different methods, mathematical formulas, and concepts. Could we not just replace them all with the general and relatively easy-to-understand non-parametric bootstrap? We contribute to answering this question with a review of related work and a simulation study of one- and two-sided confidence intervals over several different sample sizes, confidence levels, data generating processes, and functionals. Results show that double bootstrap is the overall best method and a viable alternative to typically used approaches in all but the smallest sample sizes."}, "https://arxiv.org/abs/2403.20214": {"title": "Hypergraph adjusted plus-minus", "link": "https://arxiv.org/abs/2403.20214", "description": "arXiv:2403.20214v1 Announce Type: new \nAbstract: In team sports, traditional ranking statistics do not allow for the simultaneous evaluation of both individuals and combinations of players. Metrics for individual player rankings often fail to consider the interaction effects between groups of players, while methods for assessing full lineups cannot be used to identify the value of lower-order combinations of players (pairs, trios, etc.). Given that player and lineup rankings are inherently dependent on each other, these limitations may affect the accuracy of performance evaluations. To address this, we propose a novel adjusted box score plus-minus (APM) approach that allows for the simultaneous ranking of individual players, lower-order combinations of players, and full lineups. The method adjusts for the complete dependency structure and is motivated by the connection between APM and the hypergraph representation of a team. We discuss the similarities of our approach to other advanced metrics, demonstrate it using NBA data from 2012-2022, and suggest potential directions for future work."}, "https://arxiv.org/abs/2403.20313": {"title": "Towards a turnkey approach to unbiased Monte Carlo estimation of smooth functions of expectations", "link": "https://arxiv.org/abs/2403.20313", "description": "arXiv:2403.20313v1 Announce Type: new \nAbstract: Given a smooth function $f$, we develop a general approach to turn Monte\n Carlo samples with expectation $m$ into an unbiased estimate of $f(m)$.\n Specifically, we develop estimators that are based on randomly truncating\n the Taylor series expansion of $f$ and estimating the coefficients of the\n truncated series. We derive their properties and propose a strategy to set\n their tuning parameters -- which depend on $m$ -- automatically, with a\n view to make the whole approach simple to use. We develop our methods for\n the specific functions $f(x)=\\log x$ and $f(x)=1/x$, as they arise in\n several statistical applications such as maximum likelihood estimation of\n latent variable models and Bayesian inference for un-normalised models.\n Detailed numerical studies are performed for a range of applications to\n determine how competitive and reliable the proposed approach is."}, "https://arxiv.org/abs/2403.20200": {"title": "High-dimensional analysis of ridge regression for non-identically distributed data with a variance profile", "link": "https://arxiv.org/abs/2403.20200", "description": "arXiv:2403.20200v1 Announce Type: cross \nAbstract: High-dimensional linear regression has been thoroughly studied in the context of independent and identically distributed data. We propose to investigate high-dimensional regression models for independent but non-identically distributed data. To this end, we suppose that the set of observed predictors (or features) is a random matrix with a variance profile and with dimensions growing at a proportional rate. Assuming a random effect model, we study the predictive risk of the ridge estimator for linear regression with such a variance profile. In this setting, we provide deterministic equivalents of this risk and of the degree of freedom of the ridge estimator. For certain class of variance profile, our work highlights the emergence of the well-known double descent phenomenon in high-dimensional regression for the minimum norm least-squares estimator when the ridge regularization parameter goes to zero. We also exhibit variance profiles for which the shape of this predictive risk differs from double descent. The proofs of our results are based on tools from random matrix theory in the presence of a variance profile that have not been considered so far to study regression models. Numerical experiments are provided to show the accuracy of the aforementioned deterministic equivalents on the computation of the predictive risk of ridge regression. We also investigate the similarities and differences that exist with the standard setting of independent and identically distributed data."}, "https://arxiv.org/abs/2211.06808": {"title": "Flexible Basis Representations for Modeling Large Non-Gaussian Spatial Data", "link": "https://arxiv.org/abs/2211.06808", "description": "arXiv:2211.06808v3 Announce Type: replace \nAbstract: Nonstationary and non-Gaussian spatial data are common in various fields, including ecology (e.g., counts of animal species), epidemiology (e.g., disease incidence counts in susceptible regions), and environmental science (e.g., remotely-sensed satellite imagery). Due to modern data collection methods, the size of these datasets have grown considerably. Spatial generalized linear mixed models (SGLMMs) are a flexible class of models used to model nonstationary and non-Gaussian datasets. Despite their utility, SGLMMs can be computationally prohibitive for even moderately large datasets (e.g., 5,000 to 100,000 observed locations). To circumvent this issue, past studies have embedded nested radial basis functions into the SGLMM. However, two crucial specifications (knot placement and bandwidth parameters), which directly affect model performance, are typically fixed prior to model-fitting. We propose a novel approach to model large nonstationary and non-Gaussian spatial datasets using adaptive radial basis functions. Our approach: (1) partitions the spatial domain into subregions; (2) employs reversible-jump Markov chain Monte Carlo (RJMCMC) to infer the number and location of the knots within each partition; and (3) models the latent spatial surface using partition-varying and adaptive basis functions. Through an extensive simulation study, we show that our approach provides more accurate predictions than competing methods while preserving computational efficiency. We demonstrate our approach on two environmental datasets - incidences of plant species and counts of bird species in the United States."}, "https://arxiv.org/abs/2303.05032": {"title": "Sensitivity analysis for principal ignorability violation in estimating complier and noncomplier average causal effects", "link": "https://arxiv.org/abs/2303.05032", "description": "arXiv:2303.05032v4 Announce Type: replace \nAbstract: An important strategy for identifying principal causal effects, which are often used in settings with noncompliance, is to invoke the principal ignorability (PI) assumption. As PI is untestable, it is important to gauge how sensitive effect estimates are to its violation. We focus on this task for the common one-sided noncompliance setting where there are two principal strata, compliers and noncompliers. Under PI, compliers and noncompliers share the same outcome-mean-given-covariates function under the control condition. For sensitivity analysis, we allow this function to differ between compliers and noncompliers in several ways, indexed by an odds ratio, a generalized odds ratio, a mean ratio, or a standardized mean difference sensitivity parameter. We tailor sensitivity analysis techniques (with any sensitivity parameter choice) to several types of PI-based main analysis methods, including outcome regression, influence function (IF) based and weighting methods. We illustrate the proposed sensitivity analyses using several outcome types from the JOBS II study. This application estimates nuisance functions parametrically -- for simplicity and accessibility. In addition, we establish rate conditions on nonparametric nuisance estimation for IF-based estimators to be asymptotically normal -- with a view to inform nonparametric inference."}, "https://arxiv.org/abs/2305.17643": {"title": "Flexible sensitivity analysis for causal inference in observational studies subject to unmeasured confounding", "link": "https://arxiv.org/abs/2305.17643", "description": "arXiv:2305.17643v2 Announce Type: replace \nAbstract: Causal inference with observational studies often suffers from unmeasured confounding, yielding biased estimators based on the unconfoundedness assumption. Sensitivity analysis assesses how the causal conclusions change with respect to different degrees of unmeasured confounding. Most existing sensitivity analysis methods work well for specific types of statistical estimation or testing strategies. We propose a flexible sensitivity analysis framework that can deal with commonly used inverse probability weighting, outcome regression, and doubly robust estimators simultaneously. It is based on the well-known parametrization of the selection bias as comparisons of the observed and counterfactual outcomes conditional on observed covariates. It is attractive for practical use because it only requires simple modifications of the standard estimators. Moreover, it naturally extends to many other causal inference settings, including the causal risk ratio or odds ratio, the average causal effect on the treated units, and studies with survival outcomes. We also develop an R package saci to implement our sensitivity analysis estimators."}, "https://arxiv.org/abs/2306.17260": {"title": "Incorporating Auxiliary Variables to Improve the Efficiency of Time-Varying Treatment Effect Estimation", "link": "https://arxiv.org/abs/2306.17260", "description": "arXiv:2306.17260v2 Announce Type: replace \nAbstract: The use of smart devices (e.g., smartphones, smartwatches) and other wearables for context sensing and delivery of digital interventions to improve health outcomes has grown significantly in behavioral and psychiatric studies. Micro-randomized trials (MRTs) are a common experimental design for obtaining data-driven evidence on mobile health (mHealth) intervention effectiveness where each individual is repeatedly randomized to receive treatments over numerous time points. Individual characteristics and the contexts around randomizations are also collected throughout the study, some may be pre-specified as moderators when assessing time-varying causal effect moderation. Moreover, we have access to abundant measurements beyond just the moderators. Our study aims to leverage this auxiliary information to improve causal estimation and better understand the intervention effect. Similar problems have been raised in randomized control trials (RCTs), where extensive literature demonstrates that baseline covariate information can be incorporated to alleviate chance imbalances and increase asymptotic efficiency. However, covariate adjustment in the context of time-varying treatments and repeated measurements, as seen in MRTs, has not been studied. Recognizing the connection to Neyman Orthogonality, we address this gap by introducing an intuitive approach to incorporate auxiliary variables to improve the efficiency of moderated causal excursion effect estimation. The efficiency gain of our approach is proved theoretically and demonstrated through simulation studies and an analysis of data from the Intern Health Study (NeCamp et al., 2020)."}, "https://arxiv.org/abs/2310.13240": {"title": "Transparency challenges in policy evaluation with causal machine learning -- improving usability and accountability", "link": "https://arxiv.org/abs/2310.13240", "description": "arXiv:2310.13240v2 Announce Type: replace-cross \nAbstract: Causal machine learning tools are beginning to see use in real-world policy evaluation tasks to flexibly estimate treatment effects. One issue with these methods is that the machine learning models used are generally black boxes, i.e., there is no globally interpretable way to understand how a model makes estimates. This is a clear problem in policy evaluation applications, particularly in government, because it is difficult to understand whether such models are functioning in ways that are fair, based on the correct interpretation of evidence and transparent enough to allow for accountability if things go wrong. However, there has been little discussion of transparency problems in the causal machine learning literature and how these might be overcome. This paper explores why transparency issues are a problem for causal machine learning in public policy evaluation applications and considers ways these problems might be addressed through explainable AI tools and by simplifying models in line with interpretable AI principles. It then applies these ideas to a case-study using a causal forest model to estimate conditional average treatment effects for a hypothetical change in the school leaving age in Australia. It shows that existing tools for understanding black-box predictive models are poorly suited to causal machine learning and that simplifying the model to make it interpretable leads to an unacceptable increase in error (in this application). It concludes that new tools are needed to properly understand causal machine learning models and the algorithms that fit them."}, "https://arxiv.org/abs/2311.04855": {"title": "Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values", "link": "https://arxiv.org/abs/2311.04855", "description": "arXiv:2311.04855v2 Announce Type: replace-cross \nAbstract: Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules."}, "https://arxiv.org/abs/2404.00164": {"title": "Sequential Synthetic Difference in Differences", "link": "https://arxiv.org/abs/2404.00164", "description": "arXiv:2404.00164v1 Announce Type: new \nAbstract: We study the estimation of treatment effects of a binary policy in environments with a staggered treatment rollout. We propose a new estimator -- Sequential Synthetic Difference in Difference (Sequential SDiD) -- and establish its theoretical properties in a linear model with interactive fixed effects. Our estimator is based on sequentially applying the original SDiD estimator proposed in Arkhangelsky et al. (2021) to appropriately aggregated data. To establish the theoretical properties of our method, we compare it to an infeasible OLS estimator based on the knowledge of the subspaces spanned by the interactive fixed effects. We show that this OLS estimator has a sequential representation and use this result to show that it is asymptotically equivalent to the Sequential SDiD estimator. This result implies the asymptotic normality of our estimator along with corresponding efficiency guarantees. The method developed in this paper presents a natural alternative to the conventional DiD strategies in staggered adoption designs."}, "https://arxiv.org/abs/2404.00221": {"title": "Robust Learning for Optimal Dynamic Treatment Regimes with Observational Data", "link": "https://arxiv.org/abs/2404.00221", "description": "arXiv:2404.00221v1 Announce Type: new \nAbstract: Many public policies and medical interventions involve dynamics in their treatment assignments, where treatments are sequentially assigned to the same individuals across multiple stages, and the effect of treatment at each stage is usually heterogeneous with respect to the history of prior treatments and associated characteristics. We study statistical learning of optimal dynamic treatment regimes (DTRs) that guide the optimal treatment assignment for each individual at each stage based on the individual's history. We propose a step-wise doubly-robust approach to learn the optimal DTR using observational data under the assumption of sequential ignorability. The approach solves the sequential treatment assignment problem through backward induction, where, at each step, we combine estimators of propensity scores and action-value functions (Q-functions) to construct augmented inverse probability weighting estimators of values of policies for each stage. The approach consistently estimates the optimal DTR if either a propensity score or Q-function for each stage is consistently estimated. Furthermore, the resulting DTR can achieve the optimal convergence rate $n^{-1/2}$ of regret under mild conditions on the convergence rate for estimators of the nuisance parameters."}, "https://arxiv.org/abs/2404.00256": {"title": "Objective Bayesian FDR", "link": "https://arxiv.org/abs/2404.00256", "description": "arXiv:2404.00256v1 Announce Type: new \nAbstract: Here, we develop an objective Bayesian analysis for large-scale datasets. When Bayesian analysis is applied to large-scale datasets, the cut point that provides the posterior probability is usually determined following customs. In this work, we propose setting the cut point in an objective manner, which is determined so as to match the posterior null number with the estimated true null number. The posterior probability obtained using an objective cut point is relatively similar to the real false discovery rate (FDR), which facilitates control of the FDR level."}, "https://arxiv.org/abs/2404.00319": {"title": "Direction Preferring Confidence Intervals", "link": "https://arxiv.org/abs/2404.00319", "description": "arXiv:2404.00319v1 Announce Type: new \nAbstract: Confidence intervals (CIs) are instrumental in statistical analysis, providing a range estimate of the parameters. In modern statistics, selective inference is common, where only certain parameters are highlighted. However, this selective approach can bias the inference, leading some to advocate for the use of CIs over p-values. To increase the flexibility of confidence intervals, we introduce direction-preferring CIs, enabling analysts to focus on parameters trending in a particular direction. We present these types of CIs in two settings: First, when there is no selection of parameters; and second, for situations involving parameter selection, where we offer a conditional version of the direction-preferring CIs. Both of these methods build upon the foundations of Modified Pratt CIs, which rely on non-equivariant acceptance regions to achieve longer intervals in exchange for improved sign exclusions. We show that for selected parameters out of m > 1 initial parameters of interest, CIs aimed at controlling the false coverage rate, have higher power to determine the sign compared to conditional CIs. We also show that conditional confidence intervals control the marginal false coverage rate (mFCR) under any dependency."}, "https://arxiv.org/abs/2404.00359": {"title": "Loss-based prior for tree topologies in BART models", "link": "https://arxiv.org/abs/2404.00359", "description": "arXiv:2404.00359v1 Announce Type: new \nAbstract: We present a novel prior for tree topology within Bayesian Additive Regression Trees (BART) models. This approach quantifies the hypothetical loss in information and the loss due to complexity associated with choosing the wrong tree structure. The resulting prior distribution is compellingly geared toward sparsity, a critical feature considering BART models' tendency to overfit. Our method incorporates prior knowledge into the distribution via two parameters that govern the tree's depth and balance between its left and right branches. Additionally, we propose a default calibration for these parameters, offering an objective version of the prior. We demonstrate our method's efficacy on both simulated and real datasets."}, "https://arxiv.org/abs/2404.00606": {"title": "\"Sound and Fury\": Nonlinear Functionals of Volatility Matrix in the Presence of Jump and Noise", "link": "https://arxiv.org/abs/2404.00606", "description": "arXiv:2404.00606v1 Announce Type: new \nAbstract: This paper resolves a pivotal open problem on nonparametric inference for nonlinear functionals of volatility matrix. Multiple prominent statistical tasks can be formulated as functionals of volatility matrix, yet a unified statistical theory of general nonlinear functionals based on noisy data remains challenging and elusive. Nonetheless, this paper shows it can be achieved by combining the strengths of pre-averaging, jump truncation and nonlinearity bias correction. In light of general nonlinearity, bias correction beyond linear approximation becomes necessary. Resultant estimators are nonparametric and robust over a wide spectrum of stochastic models. Moreover, the estimators can be rate-optimal and stable central limit theorems are obtained. The proposed framework lends itself conveniently to uncertainty quantification and permits fully feasible inference. With strong theoretical guarantees, this paper provides an inferential foundation for a wealth of statistical methods for noisy high-frequency data, such as realized principal component analysis, continuous-time linear regression, realized Laplace transform, generalized method of integrated moments and specification tests, hence extends current application scopes to noisy data which is more prevalent in practice."}, "https://arxiv.org/abs/2404.00735": {"title": "Two-Stage Nuisance Function Estimation for Causal Mediation Analysis", "link": "https://arxiv.org/abs/2404.00735", "description": "arXiv:2404.00735v1 Announce Type: new \nAbstract: When estimating the direct and indirect causal effects using the influence function-based estimator of the mediation functional, it is crucial to understand what aspects of the treatment, the mediator, and the outcome mean mechanisms should be focused on. Specifically, considering them as nuisance functions and attempting to fit these nuisance functions as accurate as possible is not necessarily the best approach to take. In this work, we propose a two-stage estimation strategy for the nuisance functions that estimates the nuisance functions based on the role they play in the structure of the bias of the influence function-based estimator of the mediation functional. We provide robustness analysis of the proposed method, as well as sufficient conditions for consistency and asymptotic normality of the estimator of the parameter of interest."}, "https://arxiv.org/abs/2404.00788": {"title": "A Novel Stratified Analysis Method for Testing and Estimating Overall Treatment Effects on Time-to-Event Outcomes Using Average Hazard with Survival Weight", "link": "https://arxiv.org/abs/2404.00788", "description": "arXiv:2404.00788v1 Announce Type: new \nAbstract: Given the limitations of using the Cox hazard ratio to summarize the magnitude of the treatment effect, alternative measures that do not have these limitations are gaining attention. One of the recently proposed alternative methods uses the average hazard with survival weight (AH). This population quantity can be interpreted as the average intensity of the event occurrence in a given time window that does not involve study-specific censoring. Inference procedures for the ratio of AH and difference in AH have already been proposed in simple randomized controlled trial settings to compare two groups. However, methods with stratification factors have not been well discussed, although stratified analysis is often used in practice to adjust for confounding factors and increase the power to detect a between-group difference. The conventional stratified analysis or meta-analysis approach, which integrates stratum-specific treatment effects using an optimal weight, directly applies to the ratio of AH and difference in AH. However, this conventional approach has significant limitations similar to the Cochran-Mantel-Haenszel method for a binary outcome and the stratified Cox procedure for a time-to-event outcome. To address this, we propose a new stratified analysis method for AH using standardization. With the proposed method, one can summarize the between-group treatment effect in both absolute difference and relative terms, adjusting for stratification factors. This can be a valuable alternative to the traditional stratified Cox procedure to estimate and report the magnitude of the treatment effect on time-to-event outcomes using hazard."}, "https://arxiv.org/abs/2404.00820": {"title": "Visual analysis of bivariate dependence between continuous random variables", "link": "https://arxiv.org/abs/2404.00820", "description": "arXiv:2404.00820v1 Announce Type: new \nAbstract: Scatter plots are widely recognized as fundamental tools for illustrating the relationship between two numerical variables. Despite this, based on solid theoretical foundations, scatter plots generated from pairs of continuous random variables may not serve as reliable tools for assessing dependence. Sklar's Theorem implies that scatter plots created from ranked data are preferable for such analysis as they exclusively convey information pertinent to dependence. This is in stark contrast to conventional scatter plots, which also encapsulate information about the variables' marginal distributions. Such additional information is extraneous to dependence analysis and can obscure the visual interpretation of the variables' relationship. In this article, we delve into the theoretical underpinnings of these ranked data scatter plots, hereafter referred to as rank plots. We offer insights into interpreting the information they reveal and examine their connections with various association measures, including Pearson's and Spearman's correlation coefficients, as well as Schweizer-Wolff's measure of dependence. Furthermore, we introduce a novel graphical combination for dependence analysis, termed a dplot, and demonstrate its efficacy through real data examples."}, "https://arxiv.org/abs/2404.00864": {"title": "Convolution-t Distributions", "link": "https://arxiv.org/abs/2404.00864", "description": "arXiv:2404.00864v1 Announce Type: new \nAbstract: We introduce a new class of multivariate heavy-tailed distributions that are convolutions of heterogeneous multivariate t-distributions. Unlike commonly used heavy-tailed distributions, the multivariate convolution-t distributions embody cluster structures with flexible nonlinear dependencies and heterogeneous marginal distributions. Importantly, convolution-t distributions have simple density functions that facilitate estimation and likelihood-based inference. The characteristic features of convolution-t distributions are found to be important in an empirical analysis of realized volatility measures and help identify their underlying factor structure."}, "https://arxiv.org/abs/2404.01043": {"title": "The Mean Shape under the Relative Curvature Condition", "link": "https://arxiv.org/abs/2404.01043", "description": "arXiv:2404.01043v1 Announce Type: new \nAbstract: The relative curvature condition (RCC) serves as a crucial constraint, ensuring the avoidance of self-intersection problems in calculating the mean shape over a sample of swept regions. By considering the RCC, this work discusses estimating the mean shape for a class of swept regions called elliptical slabular objects based on a novel shape representation, namely elliptical tube representation (ETRep). The ETRep shape space equipped with extrinsic and intrinsic distances in accordance with object transformation is explained. The intrinsic distance is determined based on the intrinsic skeletal coordinate system of the shape space. Further, calculating the intrinsic mean shape based on the intrinsic distance over a set of ETReps is demonstrated. The proposed intrinsic methodology is applied for the statistical shape analysis to design global and partial hypothesis testing methods to study the hippocampal structure in early Parkinson's disease."}, "https://arxiv.org/abs/2404.01076": {"title": "Debiased calibration estimation using generalized entropy in survey sampling", "link": "https://arxiv.org/abs/2404.01076", "description": "arXiv:2404.01076v1 Announce Type: new \nAbstract: Incorporating the auxiliary information into the survey estimation is a fundamental problem in survey sampling. Calibration weighting is a popular tool for incorporating the auxiliary information. The calibration weighting method of Deville and Sarndal (1992) uses a distance measure between the design weights and the final weights to solve the optimization problem with calibration constraints. This paper introduces a novel framework that leverages generalized entropy as the objective function for optimization, where design weights play a role in the constraints to ensure design consistency, rather than being part of the objective function. This innovative calibration framework is particularly attractive due to its generality and its ability to generate more efficient calibration weights compared to traditional methods based on Deville and Sarndal (1992). Furthermore, we identify the optimal choice of the generalized entropy function that achieves the minimum variance across various choices of the generalized entropy function under the same constraints. Asymptotic properties, such as design consistency and asymptotic normality, are presented rigorously. The results from a limited simulation study are also presented. We demonstrate a real-life application using agricultural survey data collected from Kynetec, Inc."}, "https://arxiv.org/abs/2404.01191": {"title": "A Semiparametric Approach for Robust and Efficient Learning with Biobank Data", "link": "https://arxiv.org/abs/2404.01191", "description": "arXiv:2404.01191v1 Announce Type: new \nAbstract: With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenotyping and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenotyping model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root-$n$ convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenotyping and genetic risk modeling of type II diabetes."}, "https://arxiv.org/abs/2404.00751": {"title": "C-XGBoost: A tree boosting model for causal effect estimation", "link": "https://arxiv.org/abs/2404.00751", "description": "arXiv:2404.00751v1 Announce Type: cross \nAbstract: Causal effect estimation aims at estimating the Average Treatment Effect as well as the Conditional Average Treatment Effect of a treatment to an outcome from the available data. This knowledge is important in many safety-critical domains, where it often needs to be extracted from observational data. In this work, we propose a new causal inference model, named C-XGBoost, for the prediction of potential outcomes. The motivation of our approach is to exploit the superiority of tree-based models for handling tabular data together with the notable property of causal inference neural network-based models to learn representations that are useful for estimating the outcome for both the treatment and non-treatment cases. The proposed model also inherits the considerable advantages of XGBoost model such as efficiently handling features with missing values requiring minimum preprocessing effort, as well as it is equipped with regularization techniques to avoid overfitting/bias. Furthermore, we propose a new loss function for efficiently training the proposed causal inference model. The experimental analysis, which is based on the performance profiles of Dolan and Mor{\\'e} as well as on post-hoc and non-parametric statistical tests, provide strong evidence about the effectiveness of the proposed approach."}, "https://arxiv.org/abs/2404.00753": {"title": "A compromise criterion for weighted least squares estimates", "link": "https://arxiv.org/abs/2404.00753", "description": "arXiv:2404.00753v1 Announce Type: cross \nAbstract: When independent errors in a linear model have non-identity covariance, the ordinary least squares estimate of the model coefficients is less efficient than the weighted least squares estimate. However, the practical application of weighted least squares is challenging due to its reliance on the unknown error covariance matrix. Although feasible weighted least squares estimates, which use an approximation of this matrix, often outperform the ordinary least squares estimate in terms of efficiency, this is not always the case. In some situations, feasible weighted least squares can be less efficient than ordinary least squares. This study identifies the conditions under which feasible weighted least squares estimates using fixed weights demonstrate greater efficiency than the ordinary least squares estimate. These conditions provide guidance for the design of feasible estimates using random weights. They also shed light on how a certain robust regression estimate behaves with respect to the linear model with normal errors of unequal variance."}, "https://arxiv.org/abs/2404.00784": {"title": "Estimating sample paths of Gauss-Markov processes from noisy data", "link": "https://arxiv.org/abs/2404.00784", "description": "arXiv:2404.00784v1 Announce Type: cross \nAbstract: I derive the pointwise conditional means and variances of an arbitrary Gauss-Markov process, given noisy observations of points on a sample path. These moments depend on the process's mean and covariance functions, and on the conditional moments of the sampled points. I study the Brownian motion and bridge as special cases."}, "https://arxiv.org/abs/2404.00848": {"title": "Predictive Performance Comparison of Decision Policies Under Confounding", "link": "https://arxiv.org/abs/2404.00848", "description": "arXiv:2404.00848v1 Announce Type: cross \nAbstract: Predictive models are often introduced to decision-making tasks under the rationale that they improve performance over an existing decision-making policy. However, it is challenging to compare predictive performance against an existing decision-making policy that is generally under-specified and dependent on unobservable factors. These sources of uncertainty are often addressed in practice by making strong assumptions about the data-generating mechanism. In this work, we propose a method to compare the predictive performance of decision policies under a variety of modern identification approaches from the causal inference and off-policy evaluation literatures (e.g., instrumental variable, marginal sensitivity model, proximal variable). Key to our method is the insight that there are regions of uncertainty that we can safely ignore in the policy comparison. We develop a practical approach for finite-sample estimation of regret intervals under no assumptions on the parametric form of the status quo policy. We verify our framework theoretically and via synthetic data experiments. We conclude with a real-world application using our framework to support a pre-deployment evaluation of a proposed modification to a healthcare enrollment policy."}, "https://arxiv.org/abs/2404.00912": {"title": "Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms", "link": "https://arxiv.org/abs/2404.00912", "description": "arXiv:2404.00912v1 Announce Type: cross \nAbstract: Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm. We propose statistical inference methods for a broad range of sketching distributions, such as the subsampled randomized Hadamard transform (SRHT), Sparse Sign Embeddings (SSE) and CountSketch, sketching matrices with i.i.d. entries, and uniform subsampling. To our knowledge, no comparable methods are available for SSE and for SRHT in PCA. Our novel theoretical approach rests on showing the asymptotic normality of certain quadratic forms. As a contribution of broader interest, we show central limit theorems for quadratic forms of the SRHT, relying on a novel proof via a dyadic expansion that leverages the recursive structure of the Hadamard transform. Numerical experiments using both synthetic and empirical datasets support the efficacy of our methods, and in particular suggest that sketching methods can have better computation-estimation tradeoffs than recently proposed optimal subsampling methods."}, "https://arxiv.org/abs/2404.01153": {"title": "TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression", "link": "https://arxiv.org/abs/2404.01153", "description": "arXiv:2404.01153v1 Announce Type: cross \nAbstract: The main challenge that sets transfer learning apart from traditional supervised learning is the distribution shift, reflected as the shift between the source and target models and that between the marginal covariate distributions. In this work, we tackle model shifts in the presence of covariate shifts in the high-dimensional regression setting. Specifically, we propose a two-step method with a novel fused-regularizer that effectively leverages samples from source tasks to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model, showing the robustness of the proposed method to covariate shifts. We further establish conditions under which the estimator is minimax-optimal. Additionally, we extend the method to a distributed setting, allowing for a pretraining-finetuning strategy, requiring just one round of communication while retaining the estimation rate of the centralized version. Numerical tests validate our theory, highlighting the method's robustness to covariate shifts."}, "https://arxiv.org/abs/2404.01273": {"title": "TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model", "link": "https://arxiv.org/abs/2404.01273", "description": "arXiv:2404.01273v1 Announce Type: cross \nAbstract: Recently, there has been a burgeoning interest in virtual clinical trials, which simulate real-world scenarios and hold the potential to significantly enhance patient safety, expedite development, reduce costs, and contribute to the broader scientific knowledge in healthcare. Existing research often focuses on leveraging electronic health records (EHRs) to support clinical trial outcome prediction. Yet, trained with limited clinical trial outcome data, existing approaches frequently struggle to perform accurate predictions. Some research has attempted to generate EHRs to augment model development but has fallen short in personalizing the generation for individual patient profiles. Recently, the emergence of large language models has illuminated new possibilities, as their embedded comprehensive clinical knowledge has proven beneficial in addressing medical issues. In this paper, we propose a large language model-based digital twin creation approach, called TWIN-GPT. TWIN-GPT can establish cross-dataset associations of medical information given limited data, generating unique personalized digital twins for different patients, thereby preserving individual patient characteristics. Comprehensive experiments show that using digital twins created by TWIN-GPT can boost clinical trial outcome prediction, exceeding various previous prediction approaches. Besides, we also demonstrate that TWIN-GPT can generate high-fidelity trial data that closely approximate specific patients, aiding in more accurate result predictions in data-scarce situations. Moreover, our study provides practical evidence for the application of digital twins in healthcare, highlighting its potential significance."}, "https://arxiv.org/abs/2109.02309": {"title": "Hypothesis Testing for Functional Linear Models via Bootstrapping", "link": "https://arxiv.org/abs/2109.02309", "description": "arXiv:2109.02309v4 Announce Type: replace \nAbstract: Hypothesis testing for the slope function in functional linear regression is of both practical and theoretical interest. We develop a novel test for the nullity of the slope function, where testing the slope function is transformed into testing a high-dimensional vector based on functional principal component analysis. This transformation fully circumvents ill-posedness in functional linear regression, thereby enhancing numeric stability. The proposed method leverages the technique of bootstrapping max statistics and exploits the inherent variance decay property of functional data, improving the empirical power of tests especially when the sample size is limited or the signal is relatively weak. We establish validity and consistency of our proposed test when the functional principal components are derived from data. Moreover, we show that the test maintains its asymptotic validity and consistency, even when including \\emph{all} empirical functional principal components in our test statistics. This sharply contrasts with the task of estimating the slope function, which requires a delicate choice of the number (at most in the order of $\\sqrt n$) of functional principal components to ensure estimation consistency. This distinction highlights an interesting difference between estimation and statistical inference regarding the slope function in functional linear regression. To the best of our knowledge, the proposed test is the first of its kind to utilize all empirical functional principal components."}, "https://arxiv.org/abs/2201.10080": {"title": "Spatial meshing for general Bayesian multivariate models", "link": "https://arxiv.org/abs/2201.10080", "description": "arXiv:2201.10080v2 Announce Type: replace \nAbstract: Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package 'meshed', available on CRAN."}, "https://arxiv.org/abs/2209.07028": {"title": "Estimating large causal polytrees from small samples", "link": "https://arxiv.org/abs/2209.07028", "description": "arXiv:2209.07028v3 Announce Type: replace \nAbstract: We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions."}, "https://arxiv.org/abs/2303.16008": {"title": "Risk ratio, odds ratio, risk difference", "link": "https://arxiv.org/abs/2303.16008", "description": "arXiv:2303.16008v3 Announce Type: replace \nAbstract: There are many measures to report so-called treatment or causal effects: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different impressions of the benefit or risk of a treatment. Besides, different causal measures may lead to various treatment effect heterogeneity: some input variables may have an influence on some causal measures and no effect at all on others. In addition some measures - but not all - have appealing properties such as collapsibility, matching the intuition of a population summary. In this paper, we first review common causal measures and their pros and cons typically brought forward. Doing so, we clarify the notions of collapsibility and treatment effect heterogeneity, unifying existing definitions. Then, we show that for any causal measures there exists a generative model such that the conditional average treatment effect (CATE) captures the treatment effect. However, only the risk difference can disentangle the treatment effect from the baseline at both population and strata levels, regardless of the outcome type (continuous or binary). As our primary goal is the generalization of causal measures, we show that different sets of covariates are needed to generalize an effect to a target population depending on (i) the causal measure of interest, and (ii) the identification method chosen, that is generalizing either conditional outcome or local effects."}, "https://arxiv.org/abs/2307.03639": {"title": "Fast and Optimal Inference for Change Points in Piecewise Polynomials via Differencing", "link": "https://arxiv.org/abs/2307.03639", "description": "arXiv:2307.03639v2 Announce Type: replace \nAbstract: We consider the problem of uncertainty quantification in change point regressions, where the signal can be piecewise polynomial of arbitrary but fixed degree. That is we seek disjoint intervals which, uniformly at a given confidence level, must each contain a change point location. We propose a procedure based on performing local tests at a number of scales and locations on a sparse grid, which adapts to the choice of grid in the sense that by choosing a sparser grid one explicitly pays a lower price for multiple testing. The procedure is fast as its computational complexity is always of the order $\\mathcal{O} (n \\log (n))$ where $n$ is the length of the data, and optimal in the sense that under certain mild conditions every change point is detected with high probability and the widths of the intervals returned match the mini-max localisation rates for the associated change point problem up to log factors. A detailed simulation study shows our procedure is competitive against state of the art algorithms for similar problems. Our procedure is implemented in the R package ChangePointInference which is available via https://github.com/gaviosha/ChangePointInference."}, "https://arxiv.org/abs/2307.15213": {"title": "PCA, SVD, and Centering of Data", "link": "https://arxiv.org/abs/2307.15213", "description": "arXiv:2307.15213v2 Announce Type: replace \nAbstract: The research detailed in this paper scrutinizes Principal Component Analysis (PCA), a seminal method employed in statistics and machine learning for the purpose of reducing data dimensionality. Singular Value Decomposition (SVD) is often employed as the primary means for computing PCA, a process that indispensably includes the step of centering - the subtraction of the mean location from the data set. In our study, we delve into a detailed exploration of the influence of this critical yet often ignored or downplayed data centering step. Our research meticulously investigates the conditions under which two PCA embeddings, one derived from SVD with centering and the other without, can be viewed as aligned. As part of this exploration, we analyze the relationship between the first singular vector and the mean direction, subsequently linking this observation to the congruity between two SVDs of centered and uncentered matrices. Furthermore, we explore the potential implications arising from the absence of centering in the context of performing PCA via SVD from a spectral analysis standpoint. Our investigation emphasizes the importance of a comprehensive understanding and acknowledgment of the subtleties involved in the computation of PCA. As such, we believe this paper offers a crucial contribution to the nuanced understanding of this foundational statistical method and stands as a valuable addition to the academic literature in the field of statistics."}, "https://arxiv.org/abs/2308.13069": {"title": "The diachronic Bayesian", "link": "https://arxiv.org/abs/2308.13069", "description": "arXiv:2308.13069v3 Announce Type: replace \nAbstract: It is well known that a Bayesian probability forecast for all future observations should be a probability measure in order to satisfy a natural condition of coherence. The main topics of this paper are the evolution of the Bayesian probability measure and ways of testing its adequacy as it evolves over time. The process of testing evolving Bayesian beliefs is modelled in terms of betting, similarly to the standard Dutch book treatment of coherence. The resulting picture is adapted to forecasting several steps ahead and making almost optimal decisions."}, "https://arxiv.org/abs/2310.01402": {"title": "Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs", "link": "https://arxiv.org/abs/2310.01402", "description": "arXiv:2310.01402v2 Announce Type: replace \nAbstract: We investigated whether large language models (LLMs) can develop data validation tests. We considered 96 conditions each for both GPT-3.5 and GPT-4, examining different prompt scenarios, learning modes, temperature settings, and roles. The prompt scenarios were: 1) Asking for expectations, 2) Asking for expectations with a given context, 3) Asking for expectations after requesting a data simulation, and 4) Asking for expectations with a provided data sample. The learning modes were: 1) zero-shot, 2) one-shot, and 3) few-shot learning. We also tested four temperature settings: 0, 0.4, 0.6, and 1. And the two distinct roles were: 1) helpful assistant, 2) expert data scientist. To gauge consistency, every setup was tested five times. The LLM-generated responses were benchmarked against a gold standard data validation suite, created by an experienced data scientist knowledgeable about the data in question. We find there are considerable returns to the use of few-shot learning, and that the more explicit the data setting can be the better, to a point. The best LLM configurations complement, rather than substitute, the gold standard results. This study underscores the value LLMs can bring to the data cleaning and preparation stages of the data science workflow, but highlights that they need considerable evaluation by experienced analysts."}, "https://arxiv.org/abs/2310.02968": {"title": "Sampling depth trade-off in function estimation under a two-level design", "link": "https://arxiv.org/abs/2310.02968", "description": "arXiv:2310.02968v3 Announce Type: replace \nAbstract: Many modern statistical applications involve a two-level sampling scheme that first samples subjects from a population and then samples observations on each subject. These schemes often are designed to learn both the population-level functional structures shared by the subjects and the functional characteristics specific to individual subjects. Common wisdom suggests that learning population-level structures benefits from sampling more subjects whereas learning subject-specific structures benefits from deeper sampling within each subject. Oftentimes these two objectives compete for limited sampling resources, which raises the question of how to optimally sample at the two levels. We quantify such sampling-depth trade-offs by establishing the $L_2$ minimax risk rates for learning the population-level and subject-specific structures under a hierarchical Gaussian process model framework where we consider a Bayesian and a frequentist perspective on the unknown population-level structure. These rates provide general lessons for designing two-level sampling schemes given a fixed sampling budget. Interestingly, they show that subject-specific learning occasionally benefits more by sampling more subjects than by deeper within-subject sampling. We show that the corresponding minimax rates can be readily achieved in practice through simple adaptive estimators without assuming prior knowledge on the underlying variability at the two sampling levels. We validate our theory and illustrate the sampling trade-off in practice through both simulation experiments and two real datasets. While we carry out all the theoretical analysis in the context of Gaussian process models for analytical tractability, the results provide insights on effective two-level sampling designs more broadly."}, "https://arxiv.org/abs/2311.02789": {"title": "Estimation and Inference for a Class of Generalized Hierarchical Models", "link": "https://arxiv.org/abs/2311.02789", "description": "arXiv:2311.02789v4 Announce Type: replace \nAbstract: In this paper, we consider estimation and inference for the unknown parameters and function involved in a class of generalized hierarchical models. Such models are of great interest in the literature of neural networks (such as Bauer and Kohler, 2019). We propose a rectified linear unit (ReLU) based deep neural network (DNN) approach, and contribute to the design of DNN by i) providing more transparency for practical implementation, ii) defining different types of sparsity, iii) showing the differentiability, iv) pointing out the set of effective parameters, and v) offering a new variant of rectified linear activation function (ReLU), etc. Asymptotic properties are established accordingly, and a feasible procedure for the purpose of inference is also proposed. We conduct extensive numerical studies to examine the finite-sample performance of the estimation methods, and we also evaluate the empirical relevance and applicability of the proposed models and estimation methods to real data."}, "https://arxiv.org/abs/2311.14032": {"title": "Counterfactual Sensitivity in Quantitative Spatial Models", "link": "https://arxiv.org/abs/2311.14032", "description": "arXiv:2311.14032v2 Announce Type: replace \nAbstract: Counterfactuals in quantitative spatial models are functions of the current state of the world and the model parameters. Current practice treats the current state of the world as perfectly observed, but there is good reason to believe that it is measured with error. This paper provides tools for quantifying uncertainty about counterfactuals when the current state of the world is measured with error. I recommend an empirical Bayes approach to uncertainty quantification, which is both practical and theoretically justified. I apply the proposed method to the applications in Adao, Costinot, and Donaldson (2017) and Allen and Arkolakis (2022) and find non-trivial uncertainty about counterfactuals."}, "https://arxiv.org/abs/2312.02518": {"title": "The general linear hypothesis testing problem for multivariate functional data with applications", "link": "https://arxiv.org/abs/2312.02518", "description": "arXiv:2312.02518v2 Announce Type: replace \nAbstract: As technology continues to advance at a rapid pace, the prevalence of multivariate functional data (MFD) has expanded across diverse disciplines, spanning biology, climatology, finance, and numerous other fields of study. Although MFD are encountered in various fields, the development of methods for hypotheses on mean functions, especially the general linear hypothesis testing (GLHT) problem for such data has been limited. In this study, we propose and study a new global test for the GLHT problem for MFD, which includes the one-way FMANOVA, post hoc, and contrast analysis as special cases. The asymptotic null distribution of the test statistic is shown to be a chi-squared-type mixture dependent of eigenvalues of the heteroscedastic covariance functions. The distribution of the chi-squared-type mixture can be well approximated by a three-cumulant matched chi-squared-approximation with its approximation parameters estimated from the data. By incorporating an adjustment coefficient, the proposed test performs effectively irrespective of the correlation structure in the functional data, even when dealing with a relatively small sample size. Additionally, the proposed test is shown to be root-n consistent, that is, it has a nontrivial power against a local alternative. Simulation studies and a real data example demonstrate finite-sample performance and broad applicability of the proposed test."}, "https://arxiv.org/abs/2401.00395": {"title": "Energetic Variational Gaussian Process Regression for Computer Experiments", "link": "https://arxiv.org/abs/2401.00395", "description": "arXiv:2401.00395v2 Announce Type: replace \nAbstract: The Gaussian process (GP) regression model is a widely employed surrogate modeling technique for computer experiments, offering precise predictions and statistical inference for the computer simulators that generate experimental data. Estimation and inference for GP can be performed in both frequentist and Bayesian frameworks. In this chapter, we construct the GP model through variational inference, particularly employing the recently introduced energetic variational inference method by Wang et al. (2021). Adhering to the GP model assumptions, we derive posterior distributions for its parameters. The energetic variational inference approach bridges the Bayesian sampling and optimization and enables approximation of the posterior distributions and identification of the posterior mode. By incorporating a normal prior on the mean component of the GP model, we also apply shrinkage estimation to the parameters, facilitating mean function variable selection. To showcase the effectiveness of our proposed GP model, we present results from three benchmark examples."}, "https://arxiv.org/abs/2110.00152": {"title": "ebnm: An R Package for Solving the Empirical Bayes Normal Means Problem Using a Variety of Prior Families", "link": "https://arxiv.org/abs/2110.00152", "description": "arXiv:2110.00152v3 Announce Type: replace-cross \nAbstract: The empirical Bayes normal means (EBNM) model is important to many areas of statistics, including (but not limited to) multiple testing, wavelet denoising, and gene expression analysis. There are several existing software packages that can fit EBNM models under different prior assumptions and using different algorithms; however, the differences across interfaces complicate direct comparisons. Further, a number of important prior assumptions do not yet have implementations. Motivated by these issues, we developed the R package ebnm, which provides a unified interface for efficiently fitting EBNM models using a variety of prior assumptions, including nonparametric approaches. In some cases, we incorporated existing implementations into ebnm; in others, we implemented new fitting procedures with a focus on speed and numerical stability. We illustrate the use of ebnm in a detailed analysis of baseball statistics. By providing a unified and easily extensible interface, the ebnm package can facilitate development of new methods in statistics, genetics, and other areas; as an example, we briefly discuss the R package flashier, which harnesses methods in ebnm to provide a flexible and robust approach to matrix factorization."}, "https://arxiv.org/abs/2205.13589": {"title": "Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes", "link": "https://arxiv.org/abs/2205.13589", "description": "arXiv:2205.13589v3 Announce Type: replace-cross \nAbstract: We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \\underline{P}roxy variable \\underline{P}essimistic \\underline{P}olicy \\underline{O}ptimization (\\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \\texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \\texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \\texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset."}, "https://arxiv.org/abs/2302.06809": {"title": "Large-scale Multiple Testing: Fundamental Limits of False Discovery Rate Control and Compound Oracle", "link": "https://arxiv.org/abs/2302.06809", "description": "arXiv:2302.06809v2 Announce Type: replace-cross \nAbstract: The false discovery rate (FDR) and the false non-discovery rate (FNR), defined as the expected false discovery proportion (FDP) and the false non-discovery proportion (FNP), are the most popular benchmarks for multiple testing. Despite the theoretical and algorithmic advances in recent years, the optimal tradeoff between the FDR and the FNR has been largely unknown except for certain restricted classes of decision rules, e.g., separable rules, or for other performance metrics, e.g., the marginal FDR and the marginal FNR (mFDR and mFNR). In this paper, we determine the asymptotically optimal FDR-FNR tradeoff under the two-group random mixture model when the number of hypotheses tends to infinity. Distinct from the optimal mFDR-mFNR tradeoff, which is achieved by separable decision rules, the optimal FDR-FNR tradeoff requires compound rules even in the large-sample limit and for models as simple as the Gaussian location model. This suboptimality of separable rules also holds for other objectives, such as maximizing the expected number of true discoveries. Finally, to address the limitation of the FDR which only controls the expectation but not the fluctuation of the FDP, we also determine the optimal tradeoff when the FDP is controlled with high probability and show it coincides with that of the mFDR and the mFNR. Extensions to models with a fixed non-null proportion are also obtained."}, "https://arxiv.org/abs/2310.15333": {"title": "Safe and Interpretable Estimation of Optimal Treatment Regimes", "link": "https://arxiv.org/abs/2310.15333", "description": "arXiv:2310.15333v2 Announce Type: replace-cross \nAbstract: Recent statistical and reinforcement learning methods have significantly advanced patient care strategies. However, these approaches face substantial challenges in high-stakes contexts, including missing data, inherent stochasticity, and the critical requirements for interpretability and patient safety. Our work operationalizes a safe and interpretable framework to identify optimal treatment regimes. This approach involves matching patients with similar medical and pharmacological characteristics, allowing us to construct an optimal policy via interpolation. We perform a comprehensive simulation study to demonstrate the framework's ability to identify optimal policies even in complex settings. Ultimately, we operationalize our approach to study regimes for treating seizures in critically ill patients. Our findings strongly support personalized treatment strategies based on a patient's medical history and pharmacological features. Notably, we identify that reducing medication doses for patients with mild and brief seizure episodes while adopting aggressive treatment for patients in intensive care unit experiencing intense seizures leads to more favorable outcomes."}, "https://arxiv.org/abs/2401.00104": {"title": "Causal State Distillation for Explainable Reinforcement Learning", "link": "https://arxiv.org/abs/2401.00104", "description": "arXiv:2401.00104v2 Announce Type: replace-cross \nAbstract: Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond sub-rewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: causal sufficiency, sparseness, and orthogonality. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agent's action selections."}, "https://arxiv.org/abs/2404.01495": {"title": "Estimating Heterogeneous Effects: Applications to Labor Economics", "link": "https://arxiv.org/abs/2404.01495", "description": "arXiv:2404.01495v1 Announce Type: new \nAbstract: A growing number of applications involve settings where, in order to infer heterogeneous effects, a researcher compares various units. Examples of research designs include children moving between different neighborhoods, workers moving between firms, patients migrating from one city to another, and banks offering loans to different firms. We present a unified framework for these settings, based on a linear model with normal random coefficients and normal errors. Using the model, we discuss how to recover the mean and dispersion of effects, other features of their distribution, and to construct predictors of the effects. We provide moment conditions on the model's parameters, and outline various estimation strategies. A main objective of the paper is to clarify some of the underlying assumptions by highlighting their economic content, and to discuss and inform some of the key practical choices."}, "https://arxiv.org/abs/2404.01546": {"title": "Time-Varying Matrix Factor Models", "link": "https://arxiv.org/abs/2404.01546", "description": "arXiv:2404.01546v1 Announce Type: new \nAbstract: Matrix-variate data of high dimensions are frequently observed in finance and economics, spanning extended time periods, such as the long-term data on international trade flows among numerous countries. To address potential structural shifts and explore the matrix structure's informational context, we propose a time-varying matrix factor model. This model accommodates changing factor loadings over time, revealing the underlying dynamic structure through nonparametric principal component analysis and facilitating dimension reduction. We establish the consistency and asymptotic normality of our estimators under general conditions that allow for weak correlations across time, rows, or columns of the noise. A novel approach is introduced to overcome rotational ambiguity in the estimators, enhancing the clarity and interpretability of the estimated loading matrices. Our simulation study highlights the merits of the proposed estimators and the effective of the smoothing operation. In an application to international trade flow, we investigate the trading hubs, centrality, patterns, and trends in the trading network."}, "https://arxiv.org/abs/2404.01566": {"title": "Heterogeneous Treatment Effects and Causal Mechanisms", "link": "https://arxiv.org/abs/2404.01566", "description": "arXiv:2404.01566v1 Announce Type: new \nAbstract: The credibility revolution advances the use of research designs that permit identification and estimation of causal effects. However, understanding which mechanisms produce measured causal effects remains a challenge. A dominant current approach to the quantitative evaluation of mechanisms relies on the detection of heterogeneous treatment effects with respect to pre-treatment covariates. This paper develops a framework to understand when the existence of such heterogeneous treatment effects can support inferences about the activation of a mechanism. We show first that this design cannot provide evidence of mechanism activation without additional, generally implicit, assumptions. Further, even when these assumptions are satisfied, if a measured outcome is produced by a non-linear transformation of a directly-affected outcome of theoretical interest, heterogeneous treatment effects are not informative of mechanism activation. We provide novel guidance for interpretation and research design in light of these findings."}, "https://arxiv.org/abs/2404.01641": {"title": "The impact of geopolitical risk on the international agricultural market: Empirical analysis based on the GJR-GARCH-MIDAS model", "link": "https://arxiv.org/abs/2404.01641", "description": "arXiv:2404.01641v1 Announce Type: new \nAbstract: The current international landscape is turbulent and unstable, with frequent outbreaks of geopolitical conflicts worldwide. Geopolitical risk has emerged as a significant threat to regional and global peace, stability, and economic prosperity, causing serious disruptions to the global food system and food security. Focusing on the international food market, this paper builds different dimensions of geopolitical risk measures based on the random matrix theory and constructs single- and two-factor GJR-GARCH-MIDAS models with fixed time span and rolling window, respectively, to investigate the impact of geopolitical risk on food market volatility. The findings indicate that modeling based on rolling window performs better in describing the overall volatility of the wheat, maize, soybean, and rice markets, and the two-factor models generally exhibit stronger explanatory power in most cases. In terms of short-term fluctuations, all four staple food markets demonstrate obvious volatility clustering and high volatility persistence, without significant asymmetry. Regarding long-term volatility, the realized volatility of wheat, maize, and soybean significantly exacerbates their long-run market volatility. Additionally, geopolitical risks of different dimensions show varying directions and degrees of effects in explaining the long-term market volatility of the four staple food commodities. This study contributes to the understanding of the macro-drivers of food market fluctuations, provides useful information for investment using agricultural futures, and offers valuable insights into maintaining the stable operation of food markets and safeguarding global food security."}, "https://arxiv.org/abs/2404.01688": {"title": "Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis", "link": "https://arxiv.org/abs/2404.01688", "description": "arXiv:2404.01688v1 Announce Type: new \nAbstract: When building statistical models for Bayesian data analysis tasks, required and optional iterative adjustments and different modelling choices can give rise to numerous candidate models. In particular, checks and evaluations throughout the modelling process can motivate changes to an existing model or the consideration of alternative models to ultimately obtain models of sufficient quality for the problem at hand. Additionally, failing to consider alternative models can lead to overconfidence in the predictive or inferential ability of a chosen model. The search for suitable models requires modellers to work with multiple models without jeopardising the validity of their results. Multiverse analysis offers a framework for transparent creation of multiple models at once based on different sensible modelling choices, but the number of candidate models arising in the combination of iterations and possible modelling choices can become overwhelming in practice. Motivated by these challenges, this work proposes iterative filtering for multiverse analysis to support efficient and consistent assessment of multiple models and meaningful filtering towards fewer models of higher quality across different modelling contexts. Given that causal constraints have been taken into account, we show how multiverse analysis can be combined with recommendations from established Bayesian modelling workflows to identify promising candidate models by assessing predictive abilities and, if needed, tending to computational issues. We illustrate our suggested approach in different realistic modelling scenarios using real data examples."}, "https://arxiv.org/abs/2404.01734": {"title": "Expansion of net correlations in terms of partial correlations", "link": "https://arxiv.org/abs/2404.01734", "description": "arXiv:2404.01734v1 Announce Type: new \nAbstract: Graphical models are usually employed to represent statistical relationships between pairs of variables when all the remaining variables are fixed. In this picture, conditionally independent pairs are disconnected. In the real world, however, strict conditional independence is almost impossible to prove. Here we use a weaker version of the concept of graphical models, in which only the linear component of the conditional dependencies is represented. This notion enables us to relate the marginal Pearson correlation coefficient (a measure of linear marginal dependence) with the partial correlations (a measure of linear conditional dependence). Specifically, we use the graphical model to express the marginal Pearson correlation $\\rho_{ij}$ between variables $X_i$ and $X_j$ as a sum of the efficacies with which messages propagate along all the paths connecting the variables in the graph. The expansion is convergent, and provides a mechanistic interpretation of how global correlations arise from local interactions. Moreover, by weighing the relevance of each path and of each intermediate node, an intuitive way to imagine interventions is enabled, revealing for example what happens when a given edge is pruned, or the weight of an edge is modified. The expansion is also useful to construct minimal equivalent models, in which latent variables are introduced to replace a larger number of marginalised variables. In addition, the expansion yields an alternative algorithm to calculate marginal Pearson correlations, particularly beneficial when partial correlation matrix inversion is difficult. Finally, for Gaussian variables, the mutual information is also related to message-passing efficacies along paths in the graph."}, "https://arxiv.org/abs/2404.01736": {"title": "Nonparametric efficient causal estimation of the intervention-specific expected number of recurrent events with continuous-time targeted maximum likelihood and highly adaptive lasso estimation", "link": "https://arxiv.org/abs/2404.01736", "description": "arXiv:2404.01736v1 Announce Type: new \nAbstract: Longitudinal settings involving outcome, competing risks and censoring events occurring and recurring in continuous time are common in medical research, but are often analyzed with methods that do not allow for taking post-baseline information into account. In this work, we define statistical and causal target parameters via the g-computation formula by carrying out interventions directly on the product integral representing the observed data distribution in a continuous-time counting process model framework. In recurrent events settings our target parameter identifies the expected number of recurrent events also in settings where the censoring mechanism or post-baseline treatment decisions depend on past information of post-baseline covariates such as the recurrent event process. We propose a flexible estimation procedure based on targeted maximum likelihood estimation coupled with highly adaptive lasso estimation to provide a novel approach for double robust and nonparametric inference for the considered target parameter. We illustrate the methods in a simulation study."}, "https://arxiv.org/abs/2404.01977": {"title": "Least Squares Inference for Data with Network Dependency", "link": "https://arxiv.org/abs/2404.01977", "description": "arXiv:2404.01977v1 Announce Type: new \nAbstract: We address the inference problem concerning regression coefficients in a classical linear regression model using least squares estimates. The analysis is conducted under circumstances where network dependency exists across units in the sample. Neglecting the dependency among observations may lead to biased estimation of the asymptotic variance and often inflates the Type I error in coefficient inference. In this paper, we first establish a central limit theorem for the ordinary least squares estimate, with a verifiable dependence condition alongside corresponding neighborhood growth conditions. Subsequently, we propose a consistent estimator for the asymptotic variance of the estimated coefficients, which employs a data-driven method to balance the bias-variance trade-off. We find that the optimal tuning depends on the linear hypothesis under consideration and must be chosen adaptively. The presented theory and methods are illustrated and supported by numerical experiments and a data example."}, "https://arxiv.org/abs/2404.02093": {"title": "High-dimensional covariance regression with application to co-expression QTL detection", "link": "https://arxiv.org/abs/2404.02093", "description": "arXiv:2404.02093v1 Announce Type: new \nAbstract: While covariance matrices have been widely studied in many scientific fields, relatively limited progress has been made on estimating conditional covariances that permits a large covariance matrix to vary with high-dimensional subject-level covariates. In this paper, we present a new sparse multivariate regression framework that models the covariance matrix as a function of subject-level covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can be used to determine if and how gene co-expressions vary with genetic variations. To accommodate high-dimensional responses and covariates, we stipulate a combined sparsity structure that encourages covariates with non-zero effects and edges that are modulated by these covariates to be simultaneously sparse. We approach parameter estimation with a blockwise coordinate descent algorithm, and investigate the $\\ell_2$ convergence rate of the estimated parameters. In addition, we propose a computationally efficient debiased inference procedure for uncertainty quantification. The efficacy of the proposed method is demonstrated through numerical experiments and an application to a gene co-expression network study with brain cancer patients."}, "https://arxiv.org/abs/2404.02141": {"title": "Robustly estimating heterogeneity in factorial data using Rashomon Partitions", "link": "https://arxiv.org/abs/2404.02141", "description": "arXiv:2404.02141v1 Announce Type: new \nAbstract: Many statistical analyses, in both observational data and randomized control trials, ask: how does the outcome of interest vary with combinations of observable covariates? How do various drug combinations affect health outcomes, or how does technology adoption depend on incentives and demographics? Our goal is to partition this factorial space into ``pools'' of covariate combinations where the outcome differs across the pools (but not within a pool). Existing approaches (i) search for a single ``optimal'' partition under assumptions about the association between covariates or (ii) sample from the entire set of possible partitions. Both these approaches ignore the reality that, especially with correlation structure in covariates, many ways to partition the covariate space may be statistically indistinguishable, despite very different implications for policy or science. We develop an alternative perspective, called Rashomon Partition Sets (RPSs). Each item in the RPS partitions the space of covariates using a tree-like geometry. RPSs incorporate all partitions that have posterior values near the maximum a posteriori partition, even if they offer substantively different explanations, and do so using a prior that makes no assumptions about associations between covariates. This prior is the $\\ell_0$ prior, which we show is minimax optimal. Given the RPS we calculate the posterior of any measurable function of the feature effects vector on outcomes, conditional on being in the RPS. We also characterize approximation error relative to the entire posterior and provide bounds on the size of the RPS. Simulations demonstrate this framework allows for robust conclusions relative to conventional regularization techniques. We apply our method to three empirical settings: price effects on charitable giving, chromosomal structure (telomere length), and the introduction of microfinance."}, "https://arxiv.org/abs/2404.01466": {"title": "TS-CausalNN: Learning Temporal Causal Relations from Non-linear Non-stationary Time Series Data", "link": "https://arxiv.org/abs/2404.01466", "description": "arXiv:2404.01466v1 Announce Type: cross \nAbstract: The growing availability and importance of time series data across various domains, including environmental science, epidemiology, and economics, has led to an increasing need for time-series causal discovery methods that can identify the intricate relationships in the non-stationary, non-linear, and often noisy real world data. However, the majority of current time series causal discovery methods assume stationarity and linear relations in data, making them infeasible for the task. Further, the recent deep learning-based methods rely on the traditional causal structure learning approaches making them computationally expensive. In this paper, we propose a Time-Series Causal Neural Network (TS-CausalNN) - a deep learning technique to discover contemporaneous and lagged causal relations simultaneously. Our proposed architecture comprises (i) convolutional blocks comprising parallel custom causal layers, (ii) acyclicity constraint, and (iii) optimization techniques using the augmented Lagrangian approach. In addition to the simple parallel design, an advantage of the proposed model is that it naturally handles the non-stationarity and non-linearity of the data. Through experiments on multiple synthetic and real world datasets, we demonstrate the empirical proficiency of our proposed approach as compared to several state-of-the-art methods. The inferred graphs for the real world dataset are in good agreement with the domain understanding."}, "https://arxiv.org/abs/2404.01467": {"title": "Transnational Network Dynamics of Problematic Information Diffusion", "link": "https://arxiv.org/abs/2404.01467", "description": "arXiv:2404.01467v1 Announce Type: cross \nAbstract: This study maps the spread of two cases of COVID-19 conspiracy theories and misinformation in Spanish and French in Latin American and French-speaking communities on Facebook, and thus contributes to understanding the dynamics, reach and consequences of emerging transnational misinformation networks. The findings show that co-sharing behavior of public Facebook groups created transnational networks by sharing videos of Medicos por la Verdad (MPV) conspiracy theories in Spanish and hydroxychloroquine-related misinformation sparked by microbiologist Didier Raoult (DR) in French, usually igniting the surge of locally led interest groups across the Global South. Using inferential methods, the study shows how these networks are enabled primarily by shared cultural and thematic attributes among Facebook groups, effectively creating very large, networked audiences. The study contributes to the understanding of how potentially harmful conspiracy theories and misinformation transcend national borders through non-English speaking online communities, further highlighting the overlooked role of transnationalism in global misinformation diffusion and the potentially disproportionate harm that it causes in vulnerable communities across the globe."}, "https://arxiv.org/abs/2404.01469": {"title": "A group testing based exploration of age-varying factors in chlamydia infections among Iowa residents", "link": "https://arxiv.org/abs/2404.01469", "description": "arXiv:2404.01469v1 Announce Type: cross \nAbstract: Group testing, a method that screens subjects in pooled samples rather than individually, has been employed as a cost-effective strategy for chlamydia screening among Iowa residents. In efforts to deepen our understanding of chlamydia epidemiology in Iowa, several group testing regression models have been proposed. Different than previous approaches, we expand upon the varying coefficient model to capture potential age-varying associations with chlamydia infection risk. In general, our model operates within a Bayesian framework, allowing regression associations to vary with a covariate of key interest. We employ a stochastic search variable selection process for regularization in estimation. Additionally, our model can integrate random effects to consider potential geographical factors and estimate unknown assay accuracy probabilities. The performance of our model is assessed through comprehensive simulation studies. Upon application to the Iowa group testing dataset, we reveal a significant age-varying racial disparity in chlamydia infections. We believe this discovery has the potential to inform the enhancement of interventions and prevention strategies, leading to more effective chlamydia control and management, thereby promoting health equity across all populations."}, "https://arxiv.org/abs/2404.01595": {"title": "Propensity Score Alignment of Unpaired Multimodal Data", "link": "https://arxiv.org/abs/2404.01595", "description": "arXiv:2404.01595v1 Announce Type: cross \nAbstract: Multimodal representation learning techniques typically rely on paired samples to learn common representations, but paired samples are challenging to collect in fields such as biology where measurement devices often destroy the samples. This paper presents an approach to address the challenge of aligning unpaired samples across disparate modalities in multimodal representation learning. We draw an analogy between potential outcomes in causal inference and potential views in multimodal observations, which allows us to use Rubin's framework to estimate a common space in which to match samples. Our approach assumes we collect samples that are experimentally perturbed by treatments, and uses this to estimate a propensity score from each modality, which encapsulates all shared information between a latent state and treatment and can be used to define a distance between samples. We experiment with two alignment techniques that leverage this distance -- shared nearest neighbours (SNN) and optimal transport (OT) matching -- and find that OT matching results in significant improvements over state-of-the-art alignment approaches in both a synthetic multi-modal setting and in real-world data from NeurIPS Multimodal Single-Cell Integration Challenge."}, "https://arxiv.org/abs/2404.01608": {"title": "FAIRM: Learning invariant representations for algorithmic fairness and domain generalization with minimax optimality", "link": "https://arxiv.org/abs/2404.01608", "description": "arXiv:2404.01608v1 Announce Type: cross \nAbstract: Machine learning methods often assume that the test data have the same distribution as the training data. However, this assumption may not hold due to multiple levels of heterogeneity in applications, raising issues in algorithmic fairness and domain generalization. In this work, we address the problem of fair and generalizable machine learning by invariant principles. We propose a training environment-based oracle, FAIRM, which has desirable fairness and domain generalization properties under a diversity-type condition. We then provide an empirical FAIRM with finite-sample theoretical guarantees under weak distributional assumptions. We then develop efficient algorithms to realize FAIRM in linear models and demonstrate the nonasymptotic performance with minimax optimality. We evaluate our method in numerical experiments with synthetic data and MNIST data and show that it outperforms its counterparts."}, "https://arxiv.org/abs/2404.02120": {"title": "DEMO: Dose Exploration, Monitoring, and Optimization Using a Biological Mediator for Clinical Outcomes", "link": "https://arxiv.org/abs/2404.02120", "description": "arXiv:2404.02120v1 Announce Type: cross \nAbstract: Phase 1-2 designs provide a methodological advance over phase 1 designs for dose finding by using both clinical response and toxicity. A phase 1-2 trial still may fail to select a truly optimal dose. because early response is not a perfect surrogate for long term therapeutic success. To address this problem, a generalized phase 1-2 design first uses a phase 1-2 design's components to identify a set of candidate doses, adaptively randomizes patients among the candidates, and after longer follow up selects a dose to maximize long-term success rate. In this paper, we extend this paradigm by proposing a design that exploits an early treatment-related, real-valued biological outcome, such as pharmacodynamic activity or an immunological effect, that may act as a mediator between dose and clinical outcomes, including tumor response, toxicity, and survival time. We assume multivariate dose-outcome models that include effects appearing in causal pathways from dose to the clinical outcomes. Bayesian model selection is used to identify and eliminate biologically inactive doses. At the end of the trial, a therapeutically optimal dose is chosen from the set of doses that are acceptably safe, clinically effective, and biologically active to maximize restricted mean survival time. Results of a simulation study show that the proposed design may provide substantial improvements over designs that ignore the biological variable."}, "https://arxiv.org/abs/2007.02404": {"title": "Semi-parametric TEnsor Factor Analysis by Iteratively Projected Singular Value Decomposition", "link": "https://arxiv.org/abs/2007.02404", "description": "arXiv:2007.02404v2 Announce Type: replace \nAbstract: This paper introduces a general framework of Semi-parametric TEnsor Factor Analysis (STEFA) that focuses on the methodology and theory of low-rank tensor decomposition with auxiliary covariates. Semi-parametric TEnsor Factor Analysis models extend tensor factor models by incorporating auxiliary covariates in the loading matrices. We propose an algorithm of iteratively projected singular value decomposition (IP-SVD) for the semi-parametric estimation. It iteratively projects tensor data onto the linear space spanned by the basis functions of covariates and applies singular value decomposition on matricized tensors over each mode. We establish the convergence rates of the loading matrices and the core tensor factor. The theoretical results only require a sub-exponential noise distribution, which is weaker than the assumption of sub-Gaussian tail of noise in the literature. Compared with the Tucker decomposition, IP-SVD yields more accurate estimators with a faster convergence rate. Besides estimation, we propose several prediction methods with new covariates based on the STEFA model. On both synthetic and real tensor data, we demonstrate the efficacy of the STEFA model and the IP-SVD algorithm on both the estimation and prediction tasks."}, "https://arxiv.org/abs/2307.04457": {"title": "Predicting milk traits from spectral data using Bayesian probabilistic partial least squares regression", "link": "https://arxiv.org/abs/2307.04457", "description": "arXiv:2307.04457v3 Announce Type: replace \nAbstract: High-dimensional spectral data--routinely generated in dairy production--are used to predict a range of traits in milk products. Partial least squares (PLS) regression is ubiquitously used for these prediction tasks. However, PLS regression is not typically viewed as arising from statistical inference of a probabilistic model, and parameter uncertainty is rarely quantified. Additionally, PLS regression does not easily lend itself to model-based modifications, coherent prediction intervals are not readily available, and the process of choosing the latent-space dimension, $\\mathtt{Q}$, can be subjective and sensitive to data size. We introduce a Bayesian latent-variable model, emulating the desirable properties of PLS regression while accounting for parameter uncertainty in prediction. The need to choose $\\mathtt{Q}$ is eschewed through a nonparametric shrinkage prior. The flexibility of the proposed Bayesian partial least squares (BPLS) regression framework is exemplified by considering sparsity modifications and allowing for multivariate response prediction. The BPLS regression framework is used in two motivating settings: 1) multivariate trait prediction from mid-infrared spectral analyses of milk samples, and 2) milk pH prediction from surface-enhanced Raman spectral data. The prediction performance of BPLS regression at least matches that of PLS regression. Additionally, the provision of correctly calibrated prediction intervals objectively provides richer, more informative inference for stakeholders in dairy production."}, "https://arxiv.org/abs/2308.06913": {"title": "Improving the Estimation of Site-Specific Effects and their Distribution in Multisite Trials", "link": "https://arxiv.org/abs/2308.06913", "description": "arXiv:2308.06913v2 Announce Type: replace \nAbstract: In multisite trials, researchers are often interested in several inferential goals: estimating treatment effects for each site, ranking these effects, and studying their distribution. This study seeks to identify optimal methods for estimating these targets. Through a comprehensive simulation study, we assess two strategies and their combined effects: semiparametric modeling of the prior distribution, and alternative posterior summary methods tailored to minimize specific loss functions. Our findings highlight that the success of different estimation strategies depends largely on the amount of within-site and between-site information available from the data. We discuss how our results can guide balancing the trade-offs associated with shrinkage in limited data environments."}, "https://arxiv.org/abs/2112.10151": {"title": "Edge differentially private estimation in the $\\beta$-model via jittering and method of moments", "link": "https://arxiv.org/abs/2112.10151", "description": "arXiv:2112.10151v2 Announce Type: replace-cross \nAbstract: A standing challenge in data privacy is the trade-off between the level of privacy and the efficiency of statistical inference. Here we conduct an in-depth study of this trade-off for parameter estimation in the $\\beta$-model (Chatterjee, Diaconis and Sly, 2011) for edge differentially private network data released via jittering (Karwa, Krivitsky and Slavkovi\\'{c}, 2017). Unlike most previous approaches based on maximum likelihood estimation for this network model, we proceed via method-of-moments. This choice facilitates our exploration of a substantially broader range of privacy levels - corresponding to stricter privacy - than has been to date. Over this new range we discover our proposed estimator for the parameters exhibits an interesting phase transition, with both its convergence rate and asymptotic variance following one of three different regimes of behavior depending on the level of privacy. Because identification of the operable regime is difficult if not impossible in practice, we devise a novel adaptive bootstrap procedure to construct uniform inference across different phases. In fact, leveraging this bootstrap we are able to provide for simultaneous inference of all parameters in the $\\beta$-model (i.e., equal to the number of nodes), which, to our best knowledge, is the first result of its kind. Numerical experiments confirm the competitive and reliable finite sample performance of the proposed inference methods, next to a comparable maximum likelihood method, as well as significant advantages in terms of computational speed and memory."}, "https://arxiv.org/abs/2311.03381": {"title": "Separating and Learning Latent Confounders to Enhancing User Preferences Modeling", "link": "https://arxiv.org/abs/2311.03381", "description": "arXiv:2311.03381v2 Announce Type: replace-cross \nAbstract: Recommender models aim to capture user preferences from historical feedback and then predict user-specific feedback on candidate items. However, the presence of various unmeasured confounders causes deviations between the user preferences in the historical feedback and the true preferences, resulting in models not meeting their expected performance. Existing debias models either (1) specific to solving one particular bias or (2) directly obtain auxiliary information from user historical feedback, which cannot identify whether the learned preferences are true user preferences or mixed with unmeasured confounders. Moreover, we find that the former recommender system is not only a successor to unmeasured confounders but also acts as an unmeasured confounder affecting user preference modeling, which has always been neglected in previous studies. To this end, we incorporate the effect of the former recommender system and treat it as a proxy for all unmeasured confounders. We propose a novel framework, Separating and Learning Latent Confounders For Recommendation (SLFR), which obtains the representation of unmeasured confounders to identify the counterfactual feedback by disentangling user preferences and unmeasured confounders, then guides the target model to capture the true preferences of users. Extensive experiments in five real-world datasets validate the advantages of our method."}, "https://arxiv.org/abs/2404.02228": {"title": "Seemingly unrelated Bayesian additive regression trees for cost-effectiveness analyses in healthcare", "link": "https://arxiv.org/abs/2404.02228", "description": "arXiv:2404.02228v1 Announce Type: new \nAbstract: In recent years, theoretical results and simulation evidence have shown Bayesian additive regression trees to be a highly-effective method for nonparametric regression. Motivated by cost-effectiveness analyses in health economics, where interest lies in jointly modelling the costs of healthcare treatments and the associated health-related quality of life experienced by a patient, we propose a multivariate extension of BART applicable in regression and classification analyses with several correlated outcome variables. Our framework overcomes some key limitations of existing multivariate BART models by allowing each individual response to be associated with different ensembles of trees, while still handling dependencies between the outcomes. In the case of continuous outcomes, our model is essentially a nonparametric version of seemingly unrelated regression. Likewise, our proposal for binary outcomes is a nonparametric generalisation of the multivariate probit model. We give suggestions for easily interpretable prior distributions, which allow specification of both informative and uninformative priors. We provide detailed discussions of MCMC sampling methods to conduct posterior inference. Our methods are implemented in the R package `suBART'. We showcase their performance through extensive simulations and an application to an empirical case study from health economics. By also accommodating propensity scores in a manner befitting a causal analysis, we find substantial evidence for a novel trauma care intervention's cost-effectiveness."}, "https://arxiv.org/abs/2404.02283": {"title": "Integrating representative and non-representative survey data for efficient inference", "link": "https://arxiv.org/abs/2404.02283", "description": "arXiv:2404.02283v1 Announce Type: new \nAbstract: Non-representative surveys are commonly used and widely available but suffer from selection bias that generally cannot be entirely eliminated using weighting techniques. Instead, we propose a Bayesian method to synthesize longitudinal representative unbiased surveys with non-representative biased surveys by estimating the degree of selection bias over time. We show using a simulation study that synthesizing biased and unbiased surveys together out-performs using the unbiased surveys alone, even if the selection bias may evolve in a complex manner over time. Using COVID-19 vaccination data, we are able to synthesize two large sample biased surveys with an unbiased survey to reduce uncertainty in now-casting and inference estimates while simultaneously retaining the empirical credible interval coverage. Ultimately, we are able to conceptually obtain the properties of a large sample unbiased survey if the assumed unbiased survey, used to anchor the estimates, is unbiased for all time-points."}, "https://arxiv.org/abs/2404.02313": {"title": "Optimal combination of composite likelihoods using approximate Bayesian computation with application to state-space models", "link": "https://arxiv.org/abs/2404.02313", "description": "arXiv:2404.02313v1 Announce Type: new \nAbstract: Composite likelihood provides approximate inference when the full likelihood is intractable and sub-likelihood functions of marginal events can be evaluated relatively easily. It has been successfully applied for many complex models. However, its wider application is limited by two issues. First, weight selection of marginal likelihood can have a significant impact on the information efficiency and is currently an open question. Second, calibrated Bayesian inference with composite likelihood requires curvature adjustment which is difficult for dependent data. This work shows that approximate Bayesian computation (ABC) can properly address these two issues by using multiple composite score functions as summary statistics. First, the summary-based posterior distribution gives the optimal Godambe information among a wide class of estimators defined by linear combinations of estimating functions. Second, to make ABC computationally feasible for models where marginal likelihoods have no closed form, a novel approach is proposed to estimate all simulated marginal scores using a Monte Carlo sample with size N. Sufficient conditions are given for the additional noise to be negligible with N fixed as the data size n goes to infinity, and the computational cost is O(n). Third, asymptotic properties of ABC with summary statistics having heterogeneous convergence rates is derived, and an adaptive scheme to choose the component composite scores is proposed. Numerical studies show that the new method significantly outperforms the existing Bayesian composite likelihood methods, and the efficiency of adaptively combined composite scores well approximates the efficiency of particle MCMC using the full likelihood."}, "https://arxiv.org/abs/2404.02400": {"title": "On Improved Semi-parametric Bounds for Tail Probability and Expected Loss", "link": "https://arxiv.org/abs/2404.02400", "description": "arXiv:2404.02400v1 Announce Type: new \nAbstract: We revisit the fundamental issue of tail behavior of accumulated random realizations when individual realizations are independent, and we develop new sharper bounds on the tail probability and expected linear loss. The underlying distribution is semi-parametric in the sense that it remains unrestricted other than the assumed mean and variance. Our sharp bounds complement well-established results in the literature, including those based on aggregation, which often fail to take full account of independence and use less elegant proofs. New insights include a proof that in the non-identical case, the distributions attaining the bounds have the equal range property, and that the impact of each random variable on the expected value of the sum can be isolated using an extension of the Korkine identity. We show that the new bounds not only complement the extant results but also open up abundant practical applications, including improved pricing of product bundles, more precise option pricing, more efficient insurance design, and better inventory management."}, "https://arxiv.org/abs/2404.02453": {"title": "Exploring the Connection Between the Normalized Power Prior and Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2404.02453", "description": "arXiv:2404.02453v1 Announce Type: new \nAbstract: The power prior is a popular class of informative priors for incorporating information from historical data. It involves raising the likelihood for the historical data to a power, which acts as a discounting parameter. When the discounting parameter is modeled as random, the normalized power prior is recommended. Bayesian hierarchical modeling is a widely used method for synthesizing information from different sources, including historical data. In this work, we examine the analytical relationship between the normalized power prior (NPP) and Bayesian hierarchical models (BHM) for \\emph{i.i.d.} normal data. We establish a direct relationship between the prior for the discounting parameter of the NPP and the prior for the variance parameter of the BHM. Such a relationship is first established for the case of a single historical dataset, and then extended to the case with multiple historical datasets with dataset-specific discounting parameters. For multiple historical datasets, we develop and establish theory for the BHM-matching NPP (BNPP) which establishes dependence between the dataset-specific discounting parameters leading to inferences that are identical to the BHM. Establishing this relationship not only justifies the NPP from the perspective of hierarchical modeling, but also provides insight on prior elicitation for the NPP. We present strategies on inducing priors on the discounting parameter based on hierarchical models, and investigate the borrowing properties of the BNPP."}, "https://arxiv.org/abs/2404.02584": {"title": "Moran's I 2-Stage Lasso: for Models with Spatial Correlation and Endogenous Variables", "link": "https://arxiv.org/abs/2404.02584", "description": "arXiv:2404.02584v1 Announce Type: new \nAbstract: We propose a novel estimation procedure for models with endogenous variables in the presence of spatial correlation based on Eigenvector Spatial Filtering. The procedure, called Moran's $I$ 2-Stage Lasso (Mi-2SL), uses a two-stage Lasso estimator where the Standardised Moran's I is used to set the Lasso tuning parameter. Unlike existing spatial econometric methods, this has the key benefit of not requiring the researcher to explicitly model the spatial correlation process, which is of interest in cases where they are only interested in removing the resulting bias when estimating the direct effect of covariates. We show the conditions necessary for consistent and asymptotically normal parameter estimation assuming the support (relevant) set of eigenvectors is known. Our Monte Carlo simulation results also show that Mi-2SL performs well against common alternatives in the presence of spatial correlation. Our empirical application replicates Cadena and Kovak (2016) instrumental variables estimates using Mi-2SL and shows that in that case, Mi-2SL can boost the performance of the first stage."}, "https://arxiv.org/abs/2404.02594": {"title": "Comparison of the LASSO and Integrative LASSO with Penalty Factors (IPF-LASSO) methods for multi-omics data: Variable selection with Type I error control", "link": "https://arxiv.org/abs/2404.02594", "description": "arXiv:2404.02594v1 Announce Type: new \nAbstract: Variable selection in relation to regression modeling has constituted a methodological problem for more than 60 years. Especially in the context of high-dimensional regression, developing stable and reliable methods, algorithms, and computational tools for variable selection has become an important research topic. Omics data is one source of such high-dimensional data, characterized by diverse genomic layers, and an additional analytical challenge is how to integrate these layers into various types of analyses. While the IPF-LASSO model has previously explored the integration of multiple omics modalities for feature selection and prediction by introducing distinct penalty parameters for each modality, the challenge of incorporating heterogeneous data layers into variable selection with Type I error control remains an open problem. To address this problem, we applied stability selection as a method for variable selection with false positives control in both IPF-LASSO and regular LASSO. The objective of this study was to compare the LASSO algorithm with IPF-LASSO, investigating whether introducing different penalty parameters per omics modality could improve statistical power while controlling false positives. Two high-dimensional data structures were investigated, one with independent data and the other with correlated data. The different models were also illustrated using data from a study on breast cancer treatment, where the IPF-LASSO model was able to select some highly relevant clinical variables."}, "https://arxiv.org/abs/2404.02671": {"title": "Bayesian Bi-level Sparse Group Regressions for Macroeconomic Forecasting", "link": "https://arxiv.org/abs/2404.02671", "description": "arXiv:2404.02671v1 Announce Type: new \nAbstract: We propose a Machine Learning approach for optimal macroeconomic forecasting in a high-dimensional setting with covariates presenting a known group structure. Our model encompasses forecasting settings with many series, mixed frequencies, and unknown nonlinearities. We introduce in time-series econometrics the concept of bi-level sparsity, i.e. sparsity holds at both the group level and within groups, and we assume the true model satisfies this assumption. We propose a prior that induces bi-level sparsity, and the corresponding posterior distribution is demonstrated to contract at the minimax-optimal rate, recover the model parameters, and have a support that includes the support of the model asymptotically. Our theory allows for correlation between groups, while predictors in the same group can be characterized by strong covariation as well as common characteristics and patterns. Finite sample performance is illustrated through comprehensive Monte Carlo experiments and a real-data nowcasting exercise of the US GDP growth rate."}, "https://arxiv.org/abs/2404.02685": {"title": "Testing Independence Between High-Dimensional Random Vectors Using Rank-Based Max-Sum Tests", "link": "https://arxiv.org/abs/2404.02685", "description": "arXiv:2404.02685v1 Announce Type: new \nAbstract: In this paper, we address the problem of testing independence between two high-dimensional random vectors. Our approach involves a series of max-sum tests based on three well-known classes of rank-based correlations. These correlation classes encompass several popular rank measures, including Spearman's $\\rho$, Kendall's $\\tau$, Hoeffding's D, Blum-Kiefer-Rosenblatt's R and Bergsma-Dassios-Yanagimoto's $\\tau^*$.The key advantages of our proposed tests are threefold: (1) they do not rely on specific assumptions about the distribution of random vectors, which flexibility makes them available across various scenarios; (2) they can proficiently manage non-linear dependencies between random vectors, a critical aspect in high-dimensional contexts; (3) they have robust performance, regardless of whether the alternative hypothesis is sparse or dense.Notably, our proposed tests demonstrate significant advantages in various scenarios, which is suggested by extensive numerical results and an empirical application in RNA microarray analysis."}, "https://arxiv.org/abs/2404.02764": {"title": "Estimation of Quantile Functionals in Linear Model", "link": "https://arxiv.org/abs/2404.02764", "description": "arXiv:2404.02764v1 Announce Type: new \nAbstract: Various indicators and measures of the real life procedures rise up as functionals of the quantile process of a parent random variable Z. However, Z can be observed only through a response in a linear model whose covariates are not under our control and the probability distribution of error terms is generally unknown. The problem is that of nonparametric estimation or other inference for such functionals. We propose an estimation procedure based on the averaged two-step regression quantile, recently developed by the authors, combined with an R-estimator of slopes of the linear model."}, "https://arxiv.org/abs/2404.02184": {"title": "What is to be gained by ensemble models in analysis of spectroscopic data?", "link": "https://arxiv.org/abs/2404.02184", "description": "arXiv:2404.02184v1 Announce Type: cross \nAbstract: An empirical study was carried out to compare different implementations of ensemble models aimed at improving prediction in spectroscopic data. A wide range of candidate models were fitted to benchmark datasets from regression and classification settings. A statistical analysis using linear mixed model was carried out on prediction performance criteria resulting from model fits over random splits of the data. The results showed that the ensemble classifiers were able to consistently outperform candidate models in our application"}, "https://arxiv.org/abs/2404.02519": {"title": "Differentially Private Verification of Survey-Weighted Estimates", "link": "https://arxiv.org/abs/2404.02519", "description": "arXiv:2404.02519v1 Announce Type: cross \nAbstract: Several official statistics agencies release synthetic data as public use microdata files. In practice, synthetic data do not admit accurate results for every analysis. Thus, it is beneficial for agencies to provide users with feedback on the quality of their analyses of the synthetic data. One approach is to couple synthetic data with a verification server that provides users with measures of the similarity of estimates computed with the synthetic and underlying confidential data. However, such measures leak information about the confidential records, so that agencies may wish to apply disclosure control methods to the released verification measures. We present a verification measure that satisfies differential privacy and can be used when the underlying confidential are collected with a complex survey design. We illustrate the verification measure using repeated sampling simulations where the confidential data are sampled with a probability proportional to size design, and the analyst estimates a population total or mean with the synthetic data. The simulations suggest that the verification measures can provide useful information about the quality of synthetic data inferences."}, "https://arxiv.org/abs/2404.02736": {"title": "On the Estimation of bivariate Conditional Transition Rates", "link": "https://arxiv.org/abs/2404.02736", "description": "arXiv:2404.02736v1 Announce Type: cross \nAbstract: Recent literature has found conditional transition rates to be a useful tool for avoiding Markov assumptions in multistate models. While the estimation of univariate conditional transition rates has been extensively studied, the intertemporal dependencies captured in the bivariate conditional transition rates still require a consistent estimator. We provide an estimator that is suitable for censored data and emphasize the connection to the rich theory of the estimation of bivariate survival functions. Bivariate conditional transition rates are necessary for various applications in the survival context but especially in the calculation of moments in life insurance mathematics."}, "https://arxiv.org/abs/1603.09326": {"title": "Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index", "link": "https://arxiv.org/abs/1603.09326", "description": "arXiv:1603.09326v4 Announce Type: replace \nAbstract: Estimating the long-term effects of treatments is of interest in many fields. A common challenge in estimating such treatment effects is that long-term outcomes are unobserved in the time frame needed to make policy decisions. One approach to overcome this missing data problem is to analyze treatments effects on an intermediate outcome, often called a statistical surrogate, if it satisfies the condition that treatment and outcome are independent conditional on the statistical surrogate. The validity of the surrogacy condition is often controversial. Here we exploit that fact that in modern datasets, researchers often observe a large number, possibly hundreds or thousands, of intermediate outcomes, thought to lie on or close to the causal chain between the treatment and the long-term outcome of interest. Even if none of the individual proxies satisfies the statistical surrogacy criterion by itself, using multiple proxies can be useful in causal inference. We focus primarily on a setting with two samples, an experimental sample containing data about the treatment indicator and the surrogates and an observational sample containing information about the surrogates and the primary outcome. We state assumptions under which the average treatment effect be identified and estimated with a high-dimensional vector of proxies that collectively satisfy the surrogacy assumption, and derive the bias from violations of the surrogacy assumption, and show that even if the primary outcome is also observed in the experimental sample, there is still information to be gained from using surrogates."}, "https://arxiv.org/abs/2205.00901": {"title": "Beyond Neyman-Pearson: e-values enable hypothesis testing with a data-driven alpha", "link": "https://arxiv.org/abs/2205.00901", "description": "arXiv:2205.00901v3 Announce Type: replace \nAbstract: A standard practice in statistical hypothesis testing is to mention the p-value alongside the accept/reject decision. We show the advantages of mentioning an e-value instead. With p-values, it is not clear how to use an extreme observation (e.g. p $\\ll \\alpha$) for getting better frequentist decisions. With e-values it is straightforward, since they provide Type-I risk control in a generalized Neyman-Pearson setting with the decision task (a general loss function) determined post-hoc, after observation of the data -- thereby providing a handle on `roving $\\alpha$'s'. When Type-II risks are taken into consideration, the only admissible decision rules in the post-hoc setting turn out to be e-value-based. Similarly, if the loss incurred when specifying a faulty confidence interval is not fixed in advance, standard confidence intervals and distributions may fail whereas e-confidence sets and e-posteriors still provide valid risk guarantees. Sufficiently powerful e-values have by now been developed for a range of classical testing problems. We discuss the main challenges for wider development and deployment."}, "https://arxiv.org/abs/2205.09094": {"title": "High confidence inference on the probability an individual benefits from treatment using experimental or observational data with known propensity scores", "link": "https://arxiv.org/abs/2205.09094", "description": "arXiv:2205.09094v3 Announce Type: replace \nAbstract: We seek to understand the probability an individual benefits from treatment (PIBT), an inestimable quantity that must be bounded in practice. Given the innate uncertainty in the population-level bounds on PIBT, we seek to better understand the margin of error for their estimation in order to discern whether the estimated bounds on PIBT are tight or wide due to random chance or not. Toward this goal, we present guarantees to the estimation of bounds on marginal PIBT, with any threshold of interest, for a randomized experiment setting or an observational setting where propensity scores are known. We also derive results that permit us to understand heterogeneity in PIBT across learnable sub-groups delineated by pre-treatment features. These results can be used to help with formal statistical power analyses and frequentist confidence statements for settings where we are interested in assumption-lean inference on PIBT through the target bounds with minimal computational complexity compared to a bootstrap approach. Through a real data example from a large randomized experiment, we also demonstrate how our results for PIBT can allow us to understand the practical implication and goodness of fit of an estimate for the conditional average treatment effect (CATE), a function of an individual's baseline covariates."}, "https://arxiv.org/abs/2206.06991": {"title": "Concentration of discrepancy-based approximate Bayesian computation via Rademacher complexity", "link": "https://arxiv.org/abs/2206.06991", "description": "arXiv:2206.06991v4 Announce Type: replace \nAbstract: There has been an increasing interest on summary-free versions of approximate Bayesian computation (ABC), which replace distances among summaries with discrepancies between the empirical distributions of the observed data and the synthetic samples generated under the proposed parameter values. The success of these solutions has motivated theoretical studies on the limiting properties of the induced posteriors. However, current results (i) are often tailored to a specific discrepancy, (ii) require, either explicitly or implicitly, regularity conditions on the data generating process and the assumed statistical model, and (iii) yield bounds depending on sequences of control functions that are not made explicit. As such, there is the lack of a theoretical framework that (i) is unified, (ii) facilitates the derivation of limiting properties that hold uniformly, and (iii) relies on verifiable assumptions that provide concentration bounds clarifying which factors govern the limiting behavior of the ABC posterior. We address this gap via a novel theoretical framework that introduces the concept of Rademacher complexity in the analysis of the limiting properties for discrepancy-based ABC posteriors. This yields a unified theory that relies on constructive arguments and provides more informative asymptotic results and uniform concentration bounds, even in settings not covered by current studies. These advancements are obtained by relating the properties of summary-free ABC posteriors to the behavior of the Rademacher complexity associated with the chosen discrepancy within the family of integral probability semimetrics. This family extends summary-based ABC, and includes the Wasserstein distance and maximum mean discrepancy (MMD), among others. As clarified through a focus on the MMD case and via illustrative simulations, this perspective yields an improved understanding of summary-free ABC."}, "https://arxiv.org/abs/2211.04752": {"title": "Bayesian Neural Networks for Macroeconomic Analysis", "link": "https://arxiv.org/abs/2211.04752", "description": "arXiv:2211.04752v4 Announce Type: replace \nAbstract: Macroeconomic data is characterized by a limited number of observations (small T), many time series (big K) but also by featuring temporal dependence. Neural networks, by contrast, are designed for datasets with millions of observations and covariates. In this paper, we develop Bayesian neural networks (BNNs) that are well-suited for handling datasets commonly used for macroeconomic analysis in policy institutions. Our approach avoids extensive specification searches through a novel mixture specification for the activation function that appropriately selects the form of nonlinearities. Shrinkage priors are used to prune the network and force irrelevant neurons to zero. To cope with heteroskedasticity, the BNN is augmented with a stochastic volatility model for the error term. We illustrate how the model can be used in a policy institution by first showing that our different BNNs produce precise density forecasts, typically better than those from other machine learning methods. Finally, we showcase how our model can be used to recover nonlinearities in the reaction of macroeconomic aggregates to financial shocks."}, "https://arxiv.org/abs/2303.00281": {"title": "Posterior Robustness with Milder Conditions: Contamination Models Revisited", "link": "https://arxiv.org/abs/2303.00281", "description": "arXiv:2303.00281v2 Announce Type: replace \nAbstract: Robust Bayesian linear regression is a classical but essential statistical tool. Although novel robustness properties of posterior distributions have been proved recently under a certain class of error distributions, their sufficient conditions are restrictive and exclude several important situations. In this work, we revisit a classical two-component mixture model for response variables, also known as contamination model, where one component is a light-tailed regression model and the other component is heavy-tailed. The latter component is independent of the regression parameters, which is crucial in proving the posterior robustness. We obtain new sufficient conditions for posterior (non-)robustness and reveal non-trivial robustness results by using those conditions. In particular, we find that even the Student-$t$ error distribution can achieve the posterior robustness in our framework. A numerical study is performed to check the Kullback-Leibler divergence between the posterior distribution based on full data and that based on data obtained by removing outliers."}, "https://arxiv.org/abs/2303.13281": {"title": "Uncertain Short-Run Restrictions and Statistically Identified Structural Vector Autoregressions", "link": "https://arxiv.org/abs/2303.13281", "description": "arXiv:2303.13281v2 Announce Type: replace \nAbstract: This study proposes a combination of a statistical identification approach with potentially invalid short-run zero restrictions. The estimator shrinks towards imposed restrictions and stops shrinkage when the data provide evidence against a restriction. Simulation results demonstrate how incorporating valid restrictions through the shrinkage approach enhances the accuracy of the statistically identified estimator and how the impact of invalid restrictions decreases with the sample size. The estimator is applied to analyze the interaction between the stock and oil market. The results indicate that incorporating stock market data into the analysis is crucial, as it enables the identification of information shocks, which are shown to be important drivers of the oil price."}, "https://arxiv.org/abs/2307.01449": {"title": "A Double Machine Learning Approach to Combining Experimental and Observational Data", "link": "https://arxiv.org/abs/2307.01449", "description": "arXiv:2307.01449v2 Announce Type: replace \nAbstract: Experimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework tests for violations of external validity and ignorability under milder assumptions. When only one of these assumptions is violated, we provide semiparametrically efficient treatment effect estimators. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. Through comparative analyses, we show our framework's superiority over existing data fusion methods. The practical utility of our approach is further exemplified by three real-world case studies, underscoring its potential for widespread application in empirical research."}, "https://arxiv.org/abs/2309.12425": {"title": "Principal Stratification with Continuous Post-Treatment Variables: Nonparametric Identification and Semiparametric Estimation", "link": "https://arxiv.org/abs/2309.12425", "description": "arXiv:2309.12425v2 Announce Type: replace \nAbstract: Post-treatment variables often complicate causal inference. They appear in many scientific problems, including noncompliance, truncation by death, mediation, and surrogate endpoint evaluation. Principal stratification is a strategy to address these challenges by adjusting for the potential values of the post-treatment variables, defined as the principal strata. It allows for characterizing treatment effect heterogeneity across principal strata and unveiling the mechanism of the treatment's impact on the outcome related to post-treatment variables. However, the existing literature has primarily focused on binary post-treatment variables, leaving the case with continuous post-treatment variables largely unexplored. This gap persists due to the complexity of infinitely many principal strata, which present challenges to both the identification and estimation of causal effects. We fill this gap by providing nonparametric identification and semiparametric estimation theory for principal stratification with continuous post-treatment variables. We propose to use working models to approximate the underlying causal effect surfaces and derive the efficient influence functions of the corresponding model parameters. Based on the theory, we construct doubly robust estimators and implement them in an R package."}, "https://arxiv.org/abs/2310.18556": {"title": "Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment", "link": "https://arxiv.org/abs/2310.18556", "description": "arXiv:2310.18556v2 Announce Type: replace \nAbstract: Design-based causal inference, also known as randomization-based or finite-population causal inference, is one of the most widely used causal inference frameworks, largely due to the merit that its statistical validity can be guaranteed by the study design (e.g., randomized experiments) and does not require assuming specific outcome-generating distributions or super-population models. Despite its advantages, design-based causal inference can still suffer from other data-related issues, among which outcome missingness is a prevalent and significant challenge. This work systematically studies the outcome missingness problem in design-based causal inference. First, we propose a general and flexible outcome missingness mechanism that can facilitate finite-population-exact randomization tests for the null effect. Second, under this flexible missingness mechanism, we propose a general framework called \"imputation and re-imputation\" for conducting finite-population-exact randomization tests in design-based causal inference with missing outcomes. This framework can incorporate any imputation algorithms (from linear models to advanced machine learning-based imputation algorithms) while ensuring finite-population-exact type-I error rate control. Third, we extend our framework to conduct covariate adjustment in randomization tests and construct finite-population-valid confidence sets with missing outcomes. Our framework is evaluated via extensive simulation studies and applied to a cluster randomized experiment called the Work, Family, and Health Study. Open-source Python and R packages are also developed for implementation of our framework."}, "https://arxiv.org/abs/2311.03247": {"title": "Multivariate selfsimilarity: Multiscale eigen-structures for selfsimilarity parameter estimation", "link": "https://arxiv.org/abs/2311.03247", "description": "arXiv:2311.03247v2 Announce Type: replace \nAbstract: Scale-free dynamics, formalized by selfsimilarity, provides a versatile paradigm massively and ubiquitously used to model temporal dynamics in real-world data. However, its practical use has mostly remained univariate so far. By contrast, modern applications often demand multivariate data analysis. Accordingly, models for multivariate selfsimilarity were recently proposed. Nevertheless, they have remained rarely used in practice because of a lack of available robust estimation procedures for the vector of selfsimilarity parameters. Building upon recent mathematical developments, the present work puts forth an efficient estimation procedure based on the theoretical study of the multiscale eigenstructure of the wavelet spectrum of multivariate selfsimilar processes. The estimation performance is studied theoretically in the asymptotic limits of large scale and sample sizes, and computationally for finite-size samples. As a practical outcome, a fully operational and documented multivariate signal processing estimation toolbox is made freely available and is ready for practical use on real-world data. Its potential benefits are illustrated in epileptic seizure prediction from multi-channel EEG data."}, "https://arxiv.org/abs/2401.11507": {"title": "Inconistent multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses", "link": "https://arxiv.org/abs/2401.11507", "description": "arXiv:2401.11507v3 Announce Type: replace \nAbstract: During multiple testing, researchers often adjust their alpha level to control the familywise error rate for a statistical inference about a joint union alternative hypothesis (e.g., \"H1,1 or H1,2\"). However, in some cases, they do not make this inference. Instead, they make separate inferences about each of the individual hypotheses that comprise the joint hypothesis (e.g., H1,1 and H1,2). For example, a researcher might use a Bonferroni correction to adjust their alpha level from the conventional level of 0.050 to 0.025 when testing H1,1 and H1,2, find a significant result for H1,1 (p < 0.025) and not for H1,2 (p > .0.025), and so claim support for H1,1 and not for H1,2. However, these separate individual inferences do not require an alpha adjustment. Only a statistical inference about the union alternative hypothesis \"H1,1 or H1,2\" requires an alpha adjustment because it is based on \"at least one\" significant result among the two tests, and so it refers to the familywise error rate. Hence, an inconsistent correction occurs when a researcher corrects their alpha level during multiple testing but does not make an inference about a union alternative hypothesis. In the present article, I discuss this inconsistent correction problem, including its reduction in statistical power for tests of individual hypotheses and its potential causes vis-a-vis error rate confusions and the alpha adjustment ritual. I also provide three illustrations of inconsistent corrections from recent psychology studies. I conclude that inconsistent corrections represent a symptom of statisticism, and I call for a more nuanced inference-based approach to multiple testing corrections."}, "https://arxiv.org/abs/2401.15461": {"title": "Anytime-Valid Tests of Group Invariance through Conformal Prediction", "link": "https://arxiv.org/abs/2401.15461", "description": "arXiv:2401.15461v2 Announce Type: replace \nAbstract: Many standard statistical hypothesis tests, including those for normality and exchangeability, can be reformulated as tests of invariance under a group of transformations. We develop anytime-valid tests of invariance under the action of general compact groups and show their optimality -- in a specific logarithmic-growth sense -- against certain alternatives. This is achieved by using the invariant structure of the problem to construct conformal test martingales, a class of objects associated to conformal prediction. We apply our methods to extend recent anytime-valid tests of independence, which leverage exchangeability, to work under general group invariances. Additionally, we show applications to testing for invariance under subgroups of rotations, which corresponds to testing the Gaussian-error assumptions behind linear models."}, "https://arxiv.org/abs/2211.13289": {"title": "Shapley Curves: A Smoothing Perspective", "link": "https://arxiv.org/abs/2211.13289", "description": "arXiv:2211.13289v5 Announce Type: replace-cross \nAbstract: This paper fills the limited statistical understanding of Shapley values as a variable importance measure from a nonparametric (or smoothing) perspective. We introduce population-level \\textit{Shapley curves} to measure the true variable importance, determined by the conditional expectation function and the distribution of covariates. Having defined the estimand, we derive minimax convergence rates and asymptotic normality under general conditions for the two leading estimation strategies. For finite sample inference, we propose a novel version of the wild bootstrap procedure tailored for capturing lower-order terms in the estimation of Shapley curves. Numerical studies confirm our theoretical findings, and an empirical application analyzes the determining factors of vehicle prices."}, "https://arxiv.org/abs/2211.16462": {"title": "Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target", "link": "https://arxiv.org/abs/2211.16462", "description": "arXiv:2211.16462v2 Announce Type: replace-cross \nAbstract: As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \\in \\{0, \\ldots, H-1\\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call Probability-space Conformalized Quantile Regression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain guarantees for the probability that the cumulative reward of an autonomous system will fall below a threshold sampled from the marginal distribution of the response variable (i.e., a calibrated CDF estimate) that we employ to predict coverage probabilities for user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated."}, "https://arxiv.org/abs/2404.02982": {"title": "Spatio-temporal Modeling of Count Data", "link": "https://arxiv.org/abs/2404.02982", "description": "arXiv:2404.02982v1 Announce Type: new \nAbstract: We introduce parsimonious parameterisations for multivariate autoregressive count time series models for spatio-temporal data, including possible regressions on covariates. The number of parameters is reduced by specifying spatial neighbourhood structures for possibly huge matrices that take into account spatio-temporal dependencies. Consistency and asymptotic normality of the parameter estimators are obtained under mild assumptions by employing quasi-maximum likelihood methodology. This is used to obtain an asymptotic Wald test for testing the significance of individual or group effects. Several simulations and two data examples support and illustrate the methods proposed in this paper."}, "https://arxiv.org/abs/2404.03024": {"title": "General Effect Modelling (GEM) -- Part 1", "link": "https://arxiv.org/abs/2404.03024", "description": "arXiv:2404.03024v1 Announce Type: new \nAbstract: We present a flexible tool, called General Effect Modelling (GEM), for the analysis of any multivariate data influenced by one or more qualitative (categorical) or quantitative (continuous) input variables. The variables can be design factors or observed values, e.g., age, sex, or income, or they may represent subgroups of the samples found by data exploration. The first step in GEM separates the variation in the multivariate data into effect matrices associated with each of the influencing variables (and possibly interactions between these) by applying a general linear model. The residuals of the model are added to each of the effect matrices and the results are called Effect plus Residual (ER) values. The tables of ER values have the same dimensions as the original multivariate data. The second step of GEM is a multi- or univariate exploration of the ER tables to learn more about the multivariate data in relation to each input variables. The exploration is simplified as it addresses one input variable at the time, or if preferred, a combination of input variables. One example is a study to identify molecular fingerprints associated with a disease that is not influenced by age where individuals at different ages with and without the disease are included in the experiment. This situation can be described as an experiment with two input variables: the targeted disease and the individual age. Through GEM, the effect of age can be removed, thus focusing further analysis on the targeted disease without the influence of the confounding effect of age. ER values can also be the combined effect of several input variables. This publication has three parts: the first part describes the GEM methodology, Part 2 is a consideration of multivariate data and the benefits of treating such data by multivariate analysis, with a focus on omics data, and Part 3 is a case study in Multiple Sclerosis (MS)."}, "https://arxiv.org/abs/2404.03029": {"title": "General Effect Modelling (GEM) -- Part 2", "link": "https://arxiv.org/abs/2404.03029", "description": "arXiv:2404.03029v1 Announce Type: new \nAbstract: General Effect Modelling (GEM) is an umbrella over different methods that utilise effects in the analyses of data with multiple design variables and multivariate responses. To demonstrate the methodology, we here use GEM in gene expression data where we use GEM to combine data from different cohorts and apply multivariate analysis of the effects of the targeted disease across the cohorts. Omics data are by nature multivariate, yet univariate analysis is the dominating approach used for such data. A major challenge in omics data is that the number of features such as genes, proteins and metabolites are often very large, whereas the number of samples is limited. Furthermore, omics research aims to obtain results that are generically valid across different backgrounds. The present publication applies GEM to address these aspects. First, we emphasise the benefit of multivariate analysis for multivariate data. Then we illustrate the use of GEM to combine data from two different cohorts for multivariate analysis across the cohorts, and we highlight that multivariate analysis can detect information that is lost by univariate validation."}, "https://arxiv.org/abs/2404.03034": {"title": "General Effect Modelling (GEM) -- Part 3", "link": "https://arxiv.org/abs/2404.03034", "description": "arXiv:2404.03034v1 Announce Type: new \nAbstract: The novel data analytical platform General Effect Modelling (GEM), is an umbrella platform covering different data analytical methods that handle data with multiple design variables (or pseudo design variables) and multivariate responses. GEM is here demonstrated in an analysis of proteome data from cerebrospinal fluid (CSF) from two independent previously published datasets, one data set comprised of persons with relapsing-remitting multiple sclerosis, persons with other neurological disorders and persons without neurological disorders, and one data set had persons with clinically isolated syndrome (CIS), which is the first clinical symptom of MS, and controls. The primary aim of the present publication is to use these data to demonstrate how patient stratification can be utilised by GEM for multivariate analysis. We also emphasize how the findings shed light on important aspects of the molecular mechanism of MS that may otherwise be lost. We identified proteins involved in neural development as significantly lower for MS/CIS than for their respective controls. This information was only seen after stratification of the persons into two groups, which were found to have different inflammatory patterns and the utilisation of this by GEM. Our conclusion from the study of these data is that disrupted neural development may be an early event in CIS and MS."}, "https://arxiv.org/abs/2404.03059": {"title": "Asymptotically-exact selective inference for quantile regression", "link": "https://arxiv.org/abs/2404.03059", "description": "arXiv:2404.03059v1 Announce Type: new \nAbstract: When analyzing large datasets, it is common to select a model prior to making inferences. For reliable inferences, it is important to make adjustments that account for the model selection process, resulting in selective inferences. Our paper introduces an asymptotic pivot to infer about the effects of selected variables on conditional quantile functions. Utilizing estimators from smoothed quantile regression, our proposed pivot is easy to compute and ensures asymptotically-exact selective inferences without making strict distributional assumptions about the response variable. At the core of the pivot is the use of external randomization, which enables us to utilize the full sample for both selection and inference without the need to partition the data into independent data subsets or discard data at either step. On simulated data, we find that: (i) the asymptotic confidence intervals based on our pivot achieve the desired coverage rates, even in cases where sample splitting fails due to insufficient sample size for inference; (ii) our intervals are consistently shorter than those produced by sample splitting across various models and signal settings. We report similar findings when we apply our approach to study risk factors for low birth weights in a publicly accessible dataset of US birth records from 2022."}, "https://arxiv.org/abs/2404.03127": {"title": "A Bayesian factor analysis model for high-dimensional microbiome count data", "link": "https://arxiv.org/abs/2404.03127", "description": "arXiv:2404.03127v1 Announce Type: new \nAbstract: Dimension reduction techniques are among the most essential analytical tools in the analysis of high-dimensional data. Generalized principal component analysis (PCA) is an extension to standard PCA that has been widely used to identify low-dimensional features in high-dimensional discrete data, such as binary, multi-category and count data. For microbiome count data in particular, the multinomial PCA is a natural counterpart of the standard PCA. However, this technique fails to account for the excessive number of zero values, which is frequently observed in microbiome count data. To allow for sparsity, zero-inflated multivariate distributions can be used. We propose a zero-inflated probabilistic PCA model for latent factor analysis. The proposed model is a fully Bayesian factor analysis technique that is appropriate for microbiome count data analysis. In addition, we use the mean-field-type variational family to approximate the marginal likelihood and develop a classification variational approximation algorithm to fit the model. We demonstrate the efficiency of our procedure for predictions based on the latent factors and the model parameters through simulation experiments, showcasing its superiority over competing methods. This efficiency is further illustrated with two real microbiome count datasets. The method is implemented in R."}, "https://arxiv.org/abs/2404.03152": {"title": "Orthogonal calibration via posterior projections with applications to the Schwarzschild model", "link": "https://arxiv.org/abs/2404.03152", "description": "arXiv:2404.03152v1 Announce Type: new \nAbstract: The orbital superposition method originally developed by Schwarzschild (1979) is used to study the dynamics of growth of a black hole and its host galaxy, and has uncovered new relationships between the galaxy's global characteristics. Scientists are specifically interested in finding optimal parameter choices for this model that best match physical measurements along with quantifying the uncertainty of such procedures. This renders a statistical calibration problem with multivariate outcomes. In this article, we develop a Bayesian method for calibration with multivariate outcomes using orthogonal bias functions thus ensuring parameter identifiability. Our approach is based on projecting the posterior to an appropriate space which allows the user to choose any nonparametric prior on the bias function(s) instead of having to model it (them) with Gaussian processes. We develop a functional projection approach using the theory of Hilbert spaces. A finite-dimensional analogue of the projection problem is also considered. We illustrate the proposed approach using a BART prior and apply it to calibrate the Schwarzschild model illustrating how a multivariate approach may resolve discrepancies resulting from a univariate calibration."}, "https://arxiv.org/abs/2404.03198": {"title": "Delaunay Weighted Two-sample Test for High-dimensional Data by Incorporating Geometric Information", "link": "https://arxiv.org/abs/2404.03198", "description": "arXiv:2404.03198v1 Announce Type: new \nAbstract: Two-sample hypothesis testing is a fundamental problem with various applications, which faces new challenges in the high-dimensional context. To mitigate the issue of the curse of dimensionality, high-dimensional data are typically assumed to lie on a low-dimensional manifold. To incorporate geometric informtion in the data, we propose to apply the Delaunay triangulation and develop the Delaunay weight to measure the geometric proximity among data points. In contrast to existing similarity measures that only utilize pairwise distances, the Delaunay weight can take both the distance and direction information into account. A detailed computation procedure to approximate the Delaunay weight for the unknown manifold is developed. We further propose a novel nonparametric test statistic using the Delaunay weight matrix to test whether the underlying distributions of two samples are the same or not. Applied on simulated data, the new test exhibits substantial power gain in detecting differences in principal directions between distributions. The proposed test also shows great power on a real dataset of human face images."}, "https://arxiv.org/abs/2404.03235": {"title": "Marginal Treatment Effects and Monotonicity", "link": "https://arxiv.org/abs/2404.03235", "description": "arXiv:2404.03235v1 Announce Type: new \nAbstract: How robust are analyses based on marginal treatment effects (MTE) to violations of Imbens and Angrist (1994) monotonicity? In this note, I present weaker forms of monotonicity under which popular MTE-based estimands still identify the parameters of interest."}, "https://arxiv.org/abs/2404.03250": {"title": "Multi-task learning via robust regularized clustering with non-convex group penalties", "link": "https://arxiv.org/abs/2404.03250", "description": "arXiv:2404.03250v1 Announce Type: new \nAbstract: Multi-task learning (MTL) aims to improve estimation and prediction performance by sharing common information among related tasks. One natural assumption in MTL is that tasks are classified into clusters based on their characteristics. However, existing MTL methods based on this assumption often ignore outlier tasks that have large task-specific components or no relation to other tasks. To address this issue, we propose a novel MTL method called Multi-Task Learning via Robust Regularized Clustering (MTLRRC). MTLRRC incorporates robust regularization terms inspired by robust convex clustering, which is further extended to handle non-convex and group-sparse penalties. The extension allows MTLRRC to simultaneously perform robust task clustering and outlier task detection. The connection between the extended robust clustering and the multivariate M-estimator is also established. This provides an interpretation of the robustness of MTLRRC against outlier tasks. An efficient algorithm based on a modified alternating direction method of multipliers is developed for the estimation of the parameters. The effectiveness of MTLRRC is demonstrated through simulation studies and application to real data."}, "https://arxiv.org/abs/2404.03319": {"title": "Early warning systems for financial markets of emerging economies", "link": "https://arxiv.org/abs/2404.03319", "description": "arXiv:2404.03319v1 Announce Type: new \nAbstract: We develop and apply a new online early warning system (EWS) for what is known in machine learning as concept drift, in economics as a regime shift and in statistics as a change point. The system goes beyond linearity assumed in many conventional methods, and is robust to heavy tails and tail-dependence in the data, making it particularly suitable for emerging markets. The key component is an effective change-point detection mechanism for conditional entropy of the data, rather than for a particular indicator of interest. Combined with recent advances in machine learning methods for high-dimensional random forests, the mechanism is capable of finding significant shifts in information transfer between interdependent time series when traditional methods fail. We explore when this happens using simulations and we provide illustrations by applying the method to Uzbekistan's commodity and equity markets as well as to Russia's equity market in 2021-2023."}, "https://arxiv.org/abs/2404.03404": {"title": "Robust inference for linear regression models with possibly skewed error distribution", "link": "https://arxiv.org/abs/2404.03404", "description": "arXiv:2404.03404v1 Announce Type: new \nAbstract: Traditional methods for linear regression generally assume that the underlying error distribution, equivalently the distribution of the responses, is normal. Yet, sometimes real life response data may exhibit a skewed pattern, and assuming normality would not give reliable results in such cases. This is often observed in cases of some biomedical, behavioral, socio-economic and other variables. In this paper, we propose to use the class of skew normal (SN) distributions, which also includes the ordinary normal distribution as its special case, as the model for the errors in a linear regression setup and perform subsequent statistical inference using the popular and robust minimum density power divergence approach to get stable insights in the presence of possible data contamination (e.g., outliers). We provide the asymptotic distribution of the proposed estimator of the regression parameters and also propose robust Wald-type tests of significance for these parameters. We provide an influence function analysis of these estimators and test statistics, and also provide level and power influence functions. Numerical verification including simulation studies and real data analysis is provided to substantiate the theory developed."}, "https://arxiv.org/abs/2404.03420": {"title": "Modeling temporal dependency of longitudinal data: use of multivariate geometric skew-normal copula", "link": "https://arxiv.org/abs/2404.03420", "description": "arXiv:2404.03420v1 Announce Type: new \nAbstract: Use of copula for the purpose of modeling dependence has been receiving considerable attention in recent times. On the other hand, search for multivariate copulas with desirable dependence properties also is an important area of research. When fitting regression models to non-Gaussian longitudinal data, multivariate Gaussian copula is commonly used to account for temporal dependence of the repeated measurements. But using symmetric multivariate Gaussian copula is not preferable in every situation, since it can not capture non-exchangeable dependence or tail dependence, if present in the data. Hence to ensure reliable inference, it is important to look beyond the Gaussian dependence assumption. In this paper, we construct geometric skew-normal copula from multivariate geometric skew-normal (MGSN) distribution proposed by Kundu (2014) and Kundu (2017) in order to model temporal dependency of non-Gaussian longitudinal data. First we investigate the theoretical properties of the proposed multivariate copula, and then develop regression models for both continuous and discrete longitudinal data. The quantile function of this copula is independent of the correlation matrix of its respective multivariate distribution, which provides computational advantage in terms of likelihood inference compared to the class of copulas derived from skew-elliptical distributions by Azzalini & Valle (1996). Moreover, composite likelihood inference is possible for this multivariate copula, which facilitates to estimate parameters from ordered probit model with same dependence structure as geometric skew-normal distribution. We conduct extensive simulation studies to validate our proposed models and therefore apply them to analyze the longitudinal dependence of two real world data sets. Finally, we report our findings in terms of improvements over multivariate Gaussian copula based regression models."}, "https://arxiv.org/abs/2404.03422": {"title": "Empirical Bayes for the Reluctant Frequentist", "link": "https://arxiv.org/abs/2404.03422", "description": "arXiv:2404.03422v1 Announce Type: new \nAbstract: Empirical Bayes methods offer valuable tools for a large class of compound decision problems. In this tutorial we describe some basic principles of the empirical Bayes paradigm stressing their frequentist interpretation. Emphasis is placed on recent developments of nonparametric maximum likelihood methods for estimating mixture models. A more extensive introductory treatment will eventually be available in \\citet{kg24}. The methods are illustrated with an extended application to models of heterogeneous income dynamics based on PSID data."}, "https://arxiv.org/abs/2404.03116": {"title": "ALAAMEE: Open-source software for fitting autologistic actor attribute models", "link": "https://arxiv.org/abs/2404.03116", "description": "arXiv:2404.03116v1 Announce Type: cross \nAbstract: The autologistic actor attribute model (ALAAM) is a model for social influence, derived from the more widely known exponential-family random graph model (ERGM). ALAAMs can be used to estimate parameters corresponding to multiple forms of social contagion associated with network structure and actor covariates. This work introduces ALAAMEE, open-source Python software for estimation, simulation, and goodness-of-fit testing for ALAAM models. ALAAMEE implements both the stochastic approximation and equilibrium expectation (EE) algorithms for ALAAM parameter estimation, including estimation from snowball sampled network data. It implements data structures and statistics for undirected, directed, and bipartite networks. We use a simulation study to assess the accuracy of the EE algorithm for ALAAM parameter estimation and statistical inference, and demonstrate the use of ALAAMEE with empirical examples using both small (fewer than 100 nodes) and large (more than 10 000 nodes) networks."}, "https://arxiv.org/abs/2208.06039": {"title": "Semiparametric adaptive estimation under informative sampling", "link": "https://arxiv.org/abs/2208.06039", "description": "arXiv:2208.06039v3 Announce Type: replace \nAbstract: In survey sampling, survey data do not necessarily represent the target population, and the samples are often biased. However, information on the survey weights aids in the elimination of selection bias. The Horvitz-Thompson estimator is a well-known unbiased, consistent, and asymptotically normal estimator; however, it is not efficient. Thus, this study derives the semiparametric efficiency bound for various target parameters by considering the survey weight as a random variable and consequently proposes a semiparametric optimal estimator with certain working models on the survey weights. The proposed estimator is consistent, asymptotically normal, and efficient in a class of the regular and asymptotically linear estimators. Further, a limited simulation study is conducted to investigate the finite sample performance of the proposed method. The proposed method is applied to the 1999 Canadian Workplace and Employee Survey data."}, "https://arxiv.org/abs/2303.06501": {"title": "Learning from limited temporal data: Dynamically sparse historical functional linear models with applications to Earth science", "link": "https://arxiv.org/abs/2303.06501", "description": "arXiv:2303.06501v2 Announce Type: replace \nAbstract: Scientists and statisticians often want to learn about the complex relationships that connect two time-varying variables. Recent work on sparse functional historical linear models confirms that they are promising for this purpose, but several notable limitations exist. Most importantly, previous works have imposed sparsity on the historical coefficient function, but have not allowed the sparsity, hence lag, to vary with time. We simplify the framework of sparse functional historical linear models by using a rectangular coefficient structure along with Whittaker smoothing, then reduce the assumptions of the previous frameworks by estimating the dynamic time lag from a hierarchical coefficient structure. We motivate our study by aiming to extract the physical rainfall-runoff processes hidden within hydrological data. We show the promise and accuracy of our method using eight simulation studies, further justified by two real sets of hydrological data."}, "https://arxiv.org/abs/2306.05299": {"title": "Heterogeneous Autoregressions in Short T Panel Data Models", "link": "https://arxiv.org/abs/2306.05299", "description": "arXiv:2306.05299v2 Announce Type: replace \nAbstract: This paper considers a first-order autoregressive panel data model with individual-specific effects and heterogeneous autoregressive coefficients defined on the interval (-1,1], thus allowing for some of the individual processes to have unit roots. It proposes estimators for the moments of the cross-sectional distribution of the autoregressive (AR) coefficients, assuming a random coefficient model for the autoregressive coefficients without imposing any restrictions on the fixed effects. It is shown the standard generalized method of moments estimators obtained under homogeneous slopes are biased. Small sample properties of the proposed estimators are investigated by Monte Carlo experiments and compared with a number of alternatives, both under homogeneous and heterogeneous slopes. It is found that a simple moment estimator of the mean of heterogeneous AR coefficients performs very well even for moderate sample sizes, but to reliably estimate the variance of AR coefficients much larger samples are required. It is also required that the true value of this variance is not too close to zero. The utility of the heterogeneous approach is illustrated in the case of earnings dynamics."}, "https://arxiv.org/abs/2306.14302": {"title": "Improved LM Test for Robust Model Specification Searches in Covariance Structure Analysis", "link": "https://arxiv.org/abs/2306.14302", "description": "arXiv:2306.14302v3 Announce Type: replace \nAbstract: Model specification searches and modifications are commonly employed in covariance structure analysis (CSA) or structural equation modeling (SEM) to improve the goodness-of-fit. However, these practices can be susceptible to capitalizing on chance, as a model that fits one sample may not generalize to another sample from the same population. This paper introduces the improved Lagrange Multipliers (LM) test, which provides a reliable method for conducting a thorough model specification search and effectively identifying missing parameters. By leveraging the stepwise bootstrap method in the standard LM and Wald tests, our data-driven approach enhances the accuracy of parameter identification. The results from Monte Carlo simulations and two empirical applications in political science demonstrate the effectiveness of the improved LM test, particularly when dealing with small sample sizes and models with large degrees of freedom. This approach contributes to better statistical fit and addresses the issue of capitalization on chance in model specification."}, "https://arxiv.org/abs/2309.17295": {"title": "covXtreme : MATLAB software for non-stationary penalised piecewise constant marginal and conditional extreme value models", "link": "https://arxiv.org/abs/2309.17295", "description": "arXiv:2309.17295v2 Announce Type: replace \nAbstract: The covXtreme software provides functionality for estimation of marginal and conditional extreme value models, non-stationary with respect to covariates, and environmental design contours. Generalised Pareto (GP) marginal models of peaks over threshold are estimated, using a piecewise-constant representation for the variation of GP threshold and scale parameters on the (potentially multidimensional) covariate domain of interest. The conditional variation of one or more associated variates, given a large value of a single conditioning variate, is described using the conditional extremes model of Heffernan and Tawn (2004), the slope term of which is also assumed to vary in a piecewise constant manner with covariates. Optimal smoothness of marginal and conditional extreme value model parameters with respect to covariates is estimated using cross-validated roughness-penalised maximum likelihood estimation. Uncertainties in model parameter estimates due to marginal and conditional extreme value threshold choice, and sample size, are quantified using a bootstrap resampling scheme. Estimates of environmental contours using various schemes, including the direct sampling approach of Huseby et al. 2013, are calculated by simulation or numerical integration under fitted models. The software was developed in MATLAB for metocean applications, but is applicable generally to multivariate samples of peaks over threshold. The software and case study data can be downloaded from GitHub, with an accompanying user guide."}, "https://arxiv.org/abs/2310.01076": {"title": "A Pareto tail plot without moment restrictions", "link": "https://arxiv.org/abs/2310.01076", "description": "arXiv:2310.01076v2 Announce Type: replace \nAbstract: We propose a mean functional which exists for any probability distributions, and which characterizes the Pareto distribution within the set of distributions with finite left endpoint. This is in sharp contrast to the mean excess plot which is not meaningful for distributions without existing mean, and which has a nonstandard behaviour if the mean is finite, but the second moment does not exist. The construction of the plot is based on the so called principle of a single huge jump, which differentiates between distributions with moderately heavy and super heavy tails. We present an estimator of the tail function based on $U$-statistics and study its large sample properties. The use of the new plot is illustrated by several loss datasets."}, "https://arxiv.org/abs/2310.10915": {"title": "Identifiability of the Multinomial Processing Tree-IRT model for the Philadelphia Naming Test", "link": "https://arxiv.org/abs/2310.10915", "description": "arXiv:2310.10915v3 Announce Type: replace \nAbstract: Naming tests represent an essential tool in gauging the severity of aphasia and monitoring the trajectory of recovery for individuals afflicted with this debilitating condition. In these assessments, patients are presented with images corresponding to common nouns, and their responses are evaluated for accuracy. The Philadelphia Naming Test (PNT) stands as a paragon in this domain, offering nuanced insights into the type of errors made in responses. In a groundbreaking advancement, Walker et al. (2018) introduced a model rooted in Item Response Theory and multinomial processing trees (MPT-IRT). This innovative approach seeks to unravel the intricate mechanisms underlying the various errors patients make when responding to an item, aiming to pinpoint the specific stage of word production where a patient's capability falters. However, given the sophisticated nature of the IRT-MPT model proposed by Walker et al. (2018), it is imperative to scrutinize both its conceptual as well as its statistical validity. Our endeavor here is to closely examine the model's formulation to ensure its parameters are identifiable as a first step in evaluating its validity."}, "https://arxiv.org/abs/2401.14582": {"title": "High-dimensional forecasting with known knowns and known unknowns", "link": "https://arxiv.org/abs/2401.14582", "description": "arXiv:2401.14582v2 Announce Type: replace \nAbstract: Forecasts play a central role in decision making under uncertainty. After a brief review of the general issues, this paper considers ways of using high-dimensional data in forecasting. We consider selecting variables from a known active set, known knowns, using Lasso and OCMT, and approximating unobserved latent factors, known unknowns, by various means. This combines both sparse and dense approaches. We demonstrate the various issues involved in variable selection in a high-dimensional setting with an application to forecasting UK inflation at different horizons over the period 2020q1-2023q1. This application shows both the power of parsimonious models and the importance of allowing for global variables."}, "https://arxiv.org/abs/2310.12140": {"title": "Online Estimation with Rolling Validation: Adaptive Nonparametric Estimation with Streaming Data", "link": "https://arxiv.org/abs/2310.12140", "description": "arXiv:2310.12140v2 Announce Type: replace-cross \nAbstract: Online nonparametric estimators are gaining popularity due to their efficient computation and competitive generalization abilities. An important example includes variants of stochastic gradient descent. These algorithms often take one sample point at a time and instantly update the parameter estimate of interest. In this work we consider model selection and hyperparameter tuning for such online algorithms. We propose a weighted rolling-validation procedure, an online variant of leave-one-out cross-validation, that costs minimal extra computation for many typical stochastic gradient descent estimators. Similar to batch cross-validation, it can boost base estimators to achieve a better, adaptive convergence rate. Our theoretical analysis is straightforward, relying mainly on some general statistical stability assumptions. The simulation study underscores the significance of diverging weights in rolling validation in practice and demonstrates its sensitivity even when there is only a slim difference between candidate estimators."}, "https://arxiv.org/abs/2312.00770": {"title": "Random Forest for Dynamic Risk Prediction or Recurrent Events: A Pseudo-Observation Approach", "link": "https://arxiv.org/abs/2312.00770", "description": "arXiv:2312.00770v2 Announce Type: replace-cross \nAbstract: Recurrent events are common in clinical, healthcare, social and behavioral studies. A recent analysis framework for potentially censored recurrent event data is to construct a censored longitudinal data set consisting of times to the first recurrent event in multiple prespecified follow-up windows of length $\\tau$. With the staggering number of potential predictors being generated from genetic, -omic, and electronic health records sources, machine learning approaches such as the random forest are growing in popularity, as they can incorporate information from highly correlated predictors with non-standard relationships. In this paper, we bridge this gap by developing a random forest approach for dynamically predicting probabilities of remaining event-free during a subsequent $\\tau$-duration follow-up period from a reconstructed censored longitudinal data set. We demonstrate the increased ability of our random forest algorithm for predicting the probability of remaining event-free over a $\\tau$-duration follow-up period when compared to the recurrent event modeling framework of Xia et al. (2020) in settings where association between predictors and recurrent event outcomes is complex in nature. The proposed random forest algorithm is demonstrated using recurrent exacerbation data from the Azithromycin for the Prevention of Exacerbations of Chronic Obstructive Pulmonary Disease (Albert et al., 2011)."}, "https://arxiv.org/abs/2404.03737": {"title": "Forecasting with Neuro-Dynamic Programming", "link": "https://arxiv.org/abs/2404.03737", "description": "arXiv:2404.03737v1 Announce Type: new \nAbstract: Economic forecasting is concerned with the estimation of some variable like gross domestic product (GDP) in the next period given a set of variables that describes the current situation or state of the economy, including industrial production, retail trade turnover or economic confidence. Neuro-dynamic programming (NDP) provides tools to deal with forecasting and other sequential problems with such high-dimensional states spaces. Whereas conventional forecasting methods penalises the difference (or loss) between predicted and actual outcomes, NDP favours the difference between temporally successive predictions, following an interactive and trial-and-error approach. Past data provides a guidance to train the models, but in a different way from ordinary least squares (OLS) and other supervised learning methods, signalling the adjustment costs between sequential states. We found that it is possible to train a GDP forecasting model with data concerned with other countries that performs better than models trained with past data from the tested country (Portugal). In addition, we found that non-linear architectures to approximate the value function of a sequential problem, namely, neural networks can perform better than a simple linear architecture, lowering the out-of-sample mean absolute forecast error (MAE) by 32% from an OLS model."}, "https://arxiv.org/abs/2404.03781": {"title": "Signal cancellation factor analysis", "link": "https://arxiv.org/abs/2404.03781", "description": "arXiv:2404.03781v1 Announce Type: new \nAbstract: Signal cancellation provides a radically new and efficient approach to exploratory factor analysis, without matrix decomposition nor presetting the required number of factors. Its current implementation requires that each factor has at least two unique indicators. Its principle is that it is always possible to combine two indicator variables exclusive to the same factor with weights that cancel their common factor information. Successful combinations, consisting of nose only, are recognized by their null correlations with all remaining variables. The optimal combinations of multifactorial indicators, though, typically retain correlations with some other variables. Their signal, however, can be cancelled through combinations with unifactorial indicators of their contributing factors. The loadings are estimated from the relative signal cancellation weights of the variables involved along with their observed correlations. The factor correlations are obtained from those of their unifactorial indicators, corrected by their factor loadings. The method is illustrated with synthetic data from a complex six-factor structure that even includes two doublet factors. Another example using actual data documents that signal cancellation can rival confirmatory factor analysis."}, "https://arxiv.org/abs/2404.03805": {"title": "Blessing of dimension in Bayesian inference on covariance matrices", "link": "https://arxiv.org/abs/2404.03805", "description": "arXiv:2404.03805v1 Announce Type: new \nAbstract: Bayesian factor analysis is routinely used for dimensionality reduction in modeling of high-dimensional covariance matrices. Factor analytic decompositions express the covariance as a sum of a low rank and diagonal matrix. In practice, Gibbs sampling algorithms are typically used for posterior computation, alternating between updating the latent factors, loadings, and residual variances. In this article, we exploit a blessing of dimensionality to develop a provably accurate pseudo-posterior for the covariance matrix that bypasses the need for Gibbs or other variants of Markov chain Monte Carlo sampling. Our proposed Factor Analysis with BLEssing of dimensionality (FABLE) approach relies on a first-stage singular value decomposition (SVD) to estimate the latent factors, and then defines a jointly conjugate prior for the loadings and residual variances. The accuracy of the resulting pseudo-posterior for the covariance improves with increasing dimensionality. We show that FABLE has excellent performance in high-dimensional covariance matrix estimation, including producing well calibrated credible intervals, both theoretically and through simulation experiments. We also demonstrate the strength of our approach in terms of accurate inference and computational efficiency by applying it to a gene expression data set."}, "https://arxiv.org/abs/2404.03835": {"title": "Quantile-respectful density estimation based on the Harrell-Davis quantile estimator", "link": "https://arxiv.org/abs/2404.03835", "description": "arXiv:2404.03835v1 Announce Type: new \nAbstract: Traditional density and quantile estimators are often inconsistent with each other. Their simultaneous usage may lead to inconsistent results. To address this issue, we propose a novel smooth density estimator that is naturally consistent with the Harrell-Davis quantile estimator. We also provide a jittering implementation to support discrete-continuous mixture distributions."}, "https://arxiv.org/abs/2404.03837": {"title": "Inference for non-stationary time series quantile regression with inequality constraints", "link": "https://arxiv.org/abs/2404.03837", "description": "arXiv:2404.03837v1 Announce Type: new \nAbstract: We consider parameter inference for linear quantile regression with non-stationary predictors and errors, where the regression parameters are subject to inequality constraints. We show that the constrained quantile coefficient estimators are asymptotically equivalent to the metric projections of the unconstrained estimator onto the constrained parameter space. Utilizing a geometry-invariant property of this projection operation, we propose inference procedures - the Wald, likelihood ratio, and rank-based methods - that are consistent regardless of whether the true parameters lie on the boundary of the constrained parameter space. We also illustrate the advantages of considering the inequality constraints in analyses through simulations and an application to an electricity demand dataset."}, "https://arxiv.org/abs/2404.03878": {"title": "Wasserstein F-tests for Fr\\'echet regression on Bures-Wasserstein manifolds", "link": "https://arxiv.org/abs/2404.03878", "description": "arXiv:2404.03878v1 Announce Type: new \nAbstract: This paper considers the problem of regression analysis with random covariance matrix as outcome and Euclidean covariates in the framework of Fr\\'echet regression on the Bures-Wasserstein manifold. Such regression problems have many applications in single cell genomics and neuroscience, where we have covariance matrix measured over a large set of samples. Fr\\'echet regression on the Bures-Wasserstein manifold is formulated as estimating the conditional Fr\\'echet mean given covariates $x$. A non-asymptotic $\\sqrt{n}$-rate of convergence (up to $\\log n$ factors) is obtained for our estimator $\\hat{Q}_n(x)$ uniformly for $\\left\\|x\\right\\| \\lesssim \\sqrt{\\log n}$, which is crucial for deriving the asymptotic null distribution and power of our proposed statistical test for the null hypothesis of no association. In addition, a central limit theorem for the point estimate $\\hat{Q}_n(x)$ is obtained, giving insights to a test for covariate effects. The null distribution of the test statistic is shown to converge to a weighted sum of independent chi-squares, which implies that the proposed test has the desired significance level asymptotically. Also, the power performance of the test is demonstrated against a sequence of contiguous alternatives. Simulation results show the accuracy of the asymptotic distributions. The proposed methods are applied to a single cell gene expression data set that shows the change of gene co-expression network as people age."}, "https://arxiv.org/abs/2404.03957": {"title": "Bayesian Graphs of Intelligent Causation", "link": "https://arxiv.org/abs/2404.03957", "description": "arXiv:2404.03957v1 Announce Type: new \nAbstract: Probabilistic Graphical Bayesian models of causation have continued to impact on strategic analyses designed to help evaluate the efficacy of different interventions on systems. However, the standard causal algebras upon which these inferences are based typically assume that the intervened population does not react intelligently to frustrate an intervention. In an adversarial setting this is rarely an appropriate assumption. In this paper, we extend an established Bayesian methodology called Adversarial Risk Analysis to apply it to settings that can legitimately be designated as causal in this graphical sense. To embed this technology we first need to generalize the concept of a causal graph. We then proceed to demonstrate how the predicable intelligent reactions of adversaries to circumvent an intervention when they hear about it can be systematically modelled within such graphical frameworks, importing these recent developments from Bayesian game theory. The new methodologies and supporting protocols are illustrated through applications associated with an adversary attempting to infiltrate a friendly state."}, "https://arxiv.org/abs/2404.04074": {"title": "DGP-LVM: Derivative Gaussian process latent variable model", "link": "https://arxiv.org/abs/2404.04074", "description": "arXiv:2404.04074v1 Announce Type: new \nAbstract: We develop a framework for derivative Gaussian process latent variable models (DGP-LVM) that can handle multi-dimensional output data using modified derivative covariance functions. The modifications account for complexities in the underlying data generating process such as scaled derivatives, varying information across multiple output dimensions as well as interactions between outputs. Further, our framework provides uncertainty estimates for each latent variable samples using Bayesian inference. Through extensive simulations, we demonstrate that latent variable estimation accuracy can be drastically increased by including derivative information due to our proposed covariance function modifications. The developments are motivated by a concrete biological research problem involving the estimation of the unobserved cellular ordering from single-cell RNA (scRNA) sequencing data for gene expression and its corresponding derivative information known as RNA velocity. Since the RNA velocity is only an estimate of the exact derivative information, the derivative covariance functions need to account for potential scale differences. In a real-world case study, we illustrate the application of DGP-LVMs to such scRNA sequencing data. While motivated by this biological problem, our framework is generally applicable to all kinds of latent variable estimation problems involving derivative information irrespective of the field of study."}, "https://arxiv.org/abs/2404.04122": {"title": "Hidden Markov Models for Multivariate Panel Data", "link": "https://arxiv.org/abs/2404.04122", "description": "arXiv:2404.04122v1 Announce Type: new \nAbstract: While advances continue to be made in model-based clustering, challenges persist in modeling various data types such as panel data. Multivariate panel data present difficulties for clustering algorithms due to the unique correlation structure, a consequence of taking observations on several subjects over multiple time points. Additionally, panel data are often plagued by missing data and dropouts, presenting issues for estimation algorithms. This research presents a family of hidden Markov models that compensate for the unique correlation structures that arise in panel data. A modified expectation-maximization algorithm capable of handling missing not at random data and dropout is presented and used to perform model estimation."}, "https://arxiv.org/abs/2404.04213": {"title": "Modelling handball outcomes using univariate and bivariate approaches", "link": "https://arxiv.org/abs/2404.04213", "description": "arXiv:2404.04213v1 Announce Type: new \nAbstract: Handball has received growing interest during the last years, including academic research for many different aspects of the sport. On the other hand modelling the outcome of the game has attracted less interest mainly because of the additional challenges that occur. Data analysis has revealed that the number of goals scored by each team are under-dispersed relative to a Poisson distribution and hence new models are needed for this purpose. Here we propose to circumvent the problem by modelling the score difference. This removes the need for special models since typical models for integer data like the Skellam distribution can provide sufficient fit and thus reveal some of the characteristics of the game. In the present paper we propose some models starting from a Skellam regression model and also considering zero inflated versions as well as other discrete distributions in $\\mathbb Z$. Furthermore, we develop some bivariate models using copulas to model the two halves of the game and thus providing insights on the game. Data from German Bundesliga are used to show the potential of the new models."}, "https://arxiv.org/abs/2404.03764": {"title": "CONCERT: Covariate-Elaborated Robust Local Information Transfer with Conditional Spike-and-Slab Prior", "link": "https://arxiv.org/abs/2404.03764", "description": "arXiv:2404.03764v1 Announce Type: cross \nAbstract: The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only local information is shared. In this paper, we propose a novel Bayesian transfer learning method named \"CONCERT\" to allow robust local information transfer for high-dimensional data analysis. A novel conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize the local similarities and make the sources work collaboratively to help improve the performance on the target. Distinguished from existing work, CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. Variable selection consistency is established for our CONCERT. To make our algorithm scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and a genetic data analysis demonstrate the validity and the advantage of CONCERT over existing cutting-edge transfer learning methods. We also extend our CONCERT to the logistical models with numerical studies showing its superiority over other methods."}, "https://arxiv.org/abs/2404.03804": {"title": "TransformerLSR: Attentive Joint Model of Longitudinal Data, Survival, and Recurrent Events with Concurrent Latent Structure", "link": "https://arxiv.org/abs/2404.03804", "description": "arXiv:2404.03804v1 Announce Type: cross \nAbstract: In applications such as biomedical studies, epidemiology, and social sciences, recurrent events often co-occur with longitudinal measurements and a terminal event, such as death. Therefore, jointly modeling longitudinal measurements, recurrent events, and survival data while accounting for their dependencies is critical. While joint models for the three components exist in statistical literature, many of these approaches are limited by heavy parametric assumptions and scalability issues. Recently, incorporating deep learning techniques into joint modeling has shown promising results. However, current methods only address joint modeling of longitudinal measurements at regularly-spaced observation times and survival events, neglecting recurrent events. In this paper, we develop TransformerLSR, a flexible transformer-based deep modeling and inference framework to jointly model all three components simultaneously. TransformerLSR integrates deep temporal point processes into the joint modeling framework, treating recurrent and terminal events as two competing processes dependent on past longitudinal measurements and recurrent event times. Additionally, TransformerLSR introduces a novel trajectory representation and model architecture to potentially incorporate a priori knowledge of known latent structures among concurrent longitudinal variables. We demonstrate the effectiveness and necessity of TransformerLSR through simulation studies and analyzing a real-world medical dataset on patients after kidney transplantation."}, "https://arxiv.org/abs/1907.01136": {"title": "Finding Outliers in Gaussian Model-Based Clustering", "link": "https://arxiv.org/abs/1907.01136", "description": "arXiv:1907.01136v5 Announce Type: replace \nAbstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and \\textit{post hoc} outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers."}, "https://arxiv.org/abs/2306.05568": {"title": "Maximally Machine-Learnable Portfolios", "link": "https://arxiv.org/abs/2306.05568", "description": "arXiv:2306.05568v2 Announce Type: replace \nAbstract: When it comes to stock returns, any form of predictability can bolster risk-adjusted profitability. We develop a collaborative machine learning algorithm that optimizes portfolio weights so that the resulting synthetic security is maximally predictable. Precisely, we introduce MACE, a multivariate extension of Alternating Conditional Expectations that achieves the aforementioned goal by wielding a Random Forest on one side of the equation, and a constrained Ridge Regression on the other. There are two key improvements with respect to Lo and MacKinlay's original maximally predictable portfolio approach. First, it accommodates for any (nonlinear) forecasting algorithm and predictor set. Second, it handles large portfolios. We conduct exercises at the daily and monthly frequency and report significant increases in predictability and profitability using very little conditioning information. Interestingly, predictability is found in bad as well as good times, and MACE successfully navigates the debacle of 2022."}, "https://arxiv.org/abs/2311.12392": {"title": "Individualized Dynamic Model for Multi-resolutional Data with Application to Mobile Health", "link": "https://arxiv.org/abs/2311.12392", "description": "arXiv:2311.12392v3 Announce Type: replace \nAbstract: Mobile health has emerged as a major success for tracking individual health status, due to the popularity and power of smartphones and wearable devices. This has also brought great challenges in handling heterogeneous, multi-resolution data which arise ubiquitously in mobile health due to irregular multivariate measurements collected from individuals. In this paper, we propose an individualized dynamic latent factor model for irregular multi-resolution time series data to interpolate unsampled measurements of time series with low resolution. One major advantage of the proposed method is the capability to integrate multiple irregular time series and multiple subjects by mapping the multi-resolution data to the latent space. In addition, the proposed individualized dynamic latent factor model is applicable to capturing heterogeneous longitudinal information through individualized dynamic latent factors. Our theory provides a bound on the integrated interpolation error and the convergence rate for B-spline approximation methods. Both the simulation studies and the application to smartwatch data demonstrate the superior performance of the proposed method compared to existing methods."}, "https://arxiv.org/abs/2404.04301": {"title": "Robust Nonparametric Stochastic Frontier Analysis", "link": "https://arxiv.org/abs/2404.04301", "description": "arXiv:2404.04301v1 Announce Type: new \nAbstract: Benchmarking tools, including stochastic frontier analysis (SFA), data envelopment analysis (DEA), and its stochastic extension (StoNED) are core tools in economics used to estimate an efficiency envelope and production inefficiencies from data. The problem appears in a wide range of fields -- for example, in global health the frontier can quantify efficiency of interventions and funding of health initiatives. Despite their wide use, classic benchmarking approaches have key limitations that preclude even wider applicability. Here we propose a robust non-parametric stochastic frontier meta-analysis (SFMA) approach that fills these gaps. First, we use flexible basis splines and shape constraints to model the frontier function, so specifying a functional form of the frontier as in classic SFA is no longer necessary. Second, the user can specify relative errors on input datapoints, enabling population-level analyses. Third, we develop a likelihood-based trimming strategy to robustify the approach to outliers, which otherwise break available benchmarking methods. We provide a custom optimization algorithm for fast and reliable performance. We implement the approach and algorithm in an open source Python package `sfma'. Synthetic and real examples show the new capabilities of the method, and are used to compare SFMA to state of the art benchmarking packages that implement DEA, SFA, and StoNED."}, "https://arxiv.org/abs/2404.04343": {"title": "Multi-way contingency tables with uniform margins", "link": "https://arxiv.org/abs/2404.04343", "description": "arXiv:2404.04343v1 Announce Type: new \nAbstract: We study the problem of transforming a multi-way contingency table into an equivalent table with uniform margins and same dependence structure. Such a problem relates to recent developments in copula modeling for discrete random vectors. Here, we focus on three-way binary tables and show that, even in such a simple case, the situation is quite different than for two-way tables. Many more constraints are needed to ensure a unique solution to the problem. Therefore, the uniqueness of the transformed table is subject to arbitrary choices of the practitioner. We illustrate the theory through some examples, and conclude with a discussion on the topic and future research directions."}, "https://arxiv.org/abs/2404.04398": {"title": "Bayesian Methods for Modeling Cumulative Exposure to Extensive Environmental Health Hazards", "link": "https://arxiv.org/abs/2404.04398", "description": "arXiv:2404.04398v1 Announce Type: new \nAbstract: Measuring the impact of an environmental point source exposure on the risk of disease, like cancer or childhood asthma, is well-developed. Modeling how an environmental health hazard that is extensive in space, like a wastewater canal, is not. We propose a novel Bayesian generative semiparametric model for characterizing the cumulative spatial exposure to an environmental health hazard that is not well-represented by a single point in space. The model couples a dose-response model with a log-Gaussian Cox process integrated against a distance kernel with an unknown length-scale. We show that this model is a well-defined Bayesian inverse model, namely that the posterior exists under a Gaussian process prior for the log-intensity of exposure, and that a simple integral approximation adequately controls the computational error. We quantify the finite-sample properties and the computational tractability of the discretization scheme in a simulation study. Finally, we apply the model to survey data on household risk of childhood diarrheal illness from exposure to a system of wastewater canals in Mezquital Valley, Mexico."}, "https://arxiv.org/abs/2404.04403": {"title": "Low-Rank Robust Subspace Tensor Clustering for Metro Passenger Flow Modeling", "link": "https://arxiv.org/abs/2404.04403", "description": "arXiv:2404.04403v1 Announce Type: new \nAbstract: Tensor clustering has become an important topic, specifically in spatio-temporal modeling, due to its ability to cluster spatial modes (e.g., stations or road segments) and temporal modes (e.g., time of the day or day of the week). Our motivating example is from subway passenger flow modeling, where similarities between stations are commonly found. However, the challenges lie in the innate high-dimensionality of tensors and also the potential existence of anomalies. This is because the three tasks, i.e., dimension reduction, clustering, and anomaly decomposition, are inter-correlated to each other, and treating them in a separate manner will render a suboptimal performance. Thus, in this work, we design a tensor-based subspace clustering and anomaly decomposition technique for simultaneously outlier-robust dimension reduction and clustering for high-dimensional tensors. To achieve this, a novel low-rank robust subspace clustering decomposition model is proposed by combining Tucker decomposition, sparse anomaly decomposition, and subspace clustering. An effective algorithm based on Block Coordinate Descent is proposed to update the parameters. Prudent experiments prove the effectiveness of the proposed framework via the simulation study, with a gain of +25% clustering accuracy than benchmark methods in a hard case. The interrelations of the three tasks are also analyzed via ablation studies, validating the interrelation assumption. Moreover, a case study in the station clustering based on real passenger flow data is conducted, with quite valuable insights discovered."}, "https://arxiv.org/abs/2404.04406": {"title": "Optimality-based reward learning with applications to toxicology", "link": "https://arxiv.org/abs/2404.04406", "description": "arXiv:2404.04406v1 Announce Type: new \nAbstract: In toxicology research, experiments are often conducted to determine the effect of toxicant exposure on the behavior of mice, where mice are randomized to receive the toxicant or not. In particular, in fixed interval experiments, one provides a mouse reinforcers (e.g., a food pellet), contingent upon some action taken by the mouse (e.g., a press of a lever), but the reinforcers are only provided after fixed time intervals. Often, to analyze fixed interval experiments, one specifies and then estimates the conditional state-action distribution (e.g., using an ANOVA). This existing approach, which in the reinforcement learning framework would be called modeling the mouse's \"behavioral policy,\" is sensitive to misspecification. It is likely that any model for the behavioral policy is misspecified; a mapping from a mouse's exposure to their actions can be highly complex. In this work, we avoid specifying the behavioral policy by instead learning the mouse's reward function. Specifying a reward function is as challenging as specifying a behavioral policy, but we propose a novel approach that incorporates knowledge of the optimal behavior, which is often known to the experimenter, to avoid specifying the reward function itself. In particular, we define the reward as a divergence of the mouse's actions from optimality, where the representations of the action and optimality can be arbitrarily complex. The parameters of the reward function then serve as a measure of the mouse's tolerance for divergence from optimality, which is a novel summary of the impact of the exposure. The parameter itself is scalar, and the proposed objective function is differentiable, allowing us to benefit from typical results on consistency of parametric estimators while making very few assumptions."}, "https://arxiv.org/abs/2404.04415": {"title": "Sample size planning for estimating the global win probability with assurance and precision", "link": "https://arxiv.org/abs/2404.04415", "description": "arXiv:2404.04415v1 Announce Type: new \nAbstract: Most clinical trials conducted in drug development contain multiple endpoints in order to collectively assess the intended effects of the drug on various disease characteristics. Focusing on the estimation of the global win probability, defined as the average win probability (WinP) across endpoints that a treated participant would have a better outcome than a control participant, we propose a closed-form sample size formula incorporating pre-specified precision and assurance, with precision denoted by the lower limit of confidence interval and assurance denoted by the probability of achieving that lower limit. We make use of the equivalence of the WinP and the area under the receiver operating characteristic curve (AUC) and adapt a formula originally developed for the difference between two AUCs to handle the global WinP. Unequal variance is allowed. Simulation results suggest that the method performs very well. We illustrate the proposed formula using a Parkinson's disease clinical trial design example."}, "https://arxiv.org/abs/2404.04446": {"title": "Bounding Causal Effects with Leaky Instruments", "link": "https://arxiv.org/abs/2404.04446", "description": "arXiv:2404.04446v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are a popular and powerful tool for estimating causal effects in the presence of unobserved confounding. However, classical approaches rely on strong assumptions such as the $\\textit{exclusion criterion}$, which states that instrumental effects must be entirely mediated by treatments. This assumption often fails in practice. When IV methods are improperly applied to data that do not meet the exclusion criterion, estimated causal effects may be badly biased. In this work, we propose a novel solution that provides $\\textit{partial}$ identification in linear models given a set of $\\textit{leaky instruments}$, which are allowed to violate the exclusion criterion to some limited degree. We derive a convex optimization objective that provides provably sharp bounds on the average treatment effect under some common forms of information leakage, and implement inference procedures to quantify the uncertainty of resulting estimates. We demonstrate our method in a set of experiments with simulated data, where it performs favorably against the state of the art."}, "https://arxiv.org/abs/2404.04471": {"title": "Estimation and Inference in Ultrahigh Dimensional Partially Linear Single-Index Models", "link": "https://arxiv.org/abs/2404.04471", "description": "arXiv:2404.04471v1 Announce Type: new \nAbstract: This paper is concerned with estimation and inference for ultrahigh dimensional partially linear single-index models. The presence of high dimensional nuisance parameter and nuisance unknown function makes the estimation and inference problem very challenging. In this paper, we first propose a profile partial penalized least squares estimator and establish the sparsity, consistency and asymptotic representation of the proposed estimator in ultrahigh dimensional setting. We then propose an $F$-type test statistic for parameters of primary interest and show that the limiting null distribution of the test statistic is $\\chi^2$ distribution, and the test statistic can detect local alternatives, which converge to the null hypothesis at the root-$n$ rate. We further propose a new test for the specification testing problem of the nonparametric function. The test statistic is shown to be asymptotically normal. Simulation studies are conducted to examine the finite sample performance of the proposed estimators and tests. A real data example is used to illustrate the proposed procedures."}, "https://arxiv.org/abs/2404.04494": {"title": "Fast and simple inner-loop algorithms of static / dynamic BLP estimations", "link": "https://arxiv.org/abs/2404.04494", "description": "arXiv:2404.04494v1 Announce Type: new \nAbstract: This study investigates computationally efficient inner-loop algorithms for estimating static / dynamic BLP models. It provides the following ideas to reduce the number of inner-loop iterations: (1). Add a term concerning the outside option share in the BLP contraction mapping; (2). Analytically represent mean product utilities as a function of value functions and solve for the value functions (for dynamic BLP); (3-1). Combine the spectral / SQUAREM algorithms; (3-2). Choice of the step sizes. These methods are independent and easy to implement. This study shows good performance of these ideas by numerical experiments."}, "https://arxiv.org/abs/2404.04590": {"title": "Absolute Technical Efficiency Indices", "link": "https://arxiv.org/abs/2404.04590", "description": "arXiv:2404.04590v1 Announce Type: new \nAbstract: Technical efficiency indices (TEIs) can be estimated using the traditional stochastic frontier analysis approach, which yields relative indices that do not allow self-interpretations. In this paper, we introduce a single-step estimation procedure for TEIs that eliminates the need to identify best practices and avoids imposing restrictive hypotheses on the error term. The resulting indices are absolute and allow for individual interpretation. In our model, we estimate a distance function using the inverse coefficient of resource utilization, rather than treating it as unobservable. We employ a Tobit model with a translog distance function as our econometric framework. Applying this model to a sample of 19 airline companies from 2012 to 2021, we find that: (1) Absolute technical efficiency varies considerably between companies with medium-haul European airlines being technically the most efficient, while Asian airlines are the least efficient; (2) Our estimated TEIs are consistent with the observed data with a decline in efficiency especially during the Covid-19 crisis and Brexit period; (3) All airlines contained in our sample would be able to increase their average technical efficiency by 0.209% if they reduced their average kerosene consumption by 1%; (4) Total factor productivity (TFP) growth slowed between 2013 and 2019 due to a decrease in Disembodied Technical Change (DTC) and a small effect from Scale Economies (SE). Toward the end of our study period, TFP growth seemed increasingly driven by the SE effect, with a sharp decline in 2020 followed by an equally sharp recovery in 2021 for most airlines."}, "https://arxiv.org/abs/2404.04696": {"title": "Dynamic Treatment Regimes with Replicated Observations Available for Error-prone Covariates: a Q-learning Approach", "link": "https://arxiv.org/abs/2404.04696", "description": "arXiv:2404.04696v1 Announce Type: new \nAbstract: Dynamic treatment regimes (DTRs) have received an increasing interest in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been considered as one of the most popular regression-based methods to estimate the optimal DTR. However, it is rarely studied in an error-prone setting, where the patient information is contaminated with measurement error. In this paper, we study the effect of covariate measurement error on Q-learning and propose a correction method to correct the measurement error in Q-learning. Simulation studies are conducted to assess the performance of the proposed method in Q-learning. We illustrate the use of the proposed method in an application to the sequenced treatment alternatives to relieve depression data."}, "https://arxiv.org/abs/2404.04697": {"title": "Q-learning in Dynamic Treatment Regimes with Misclassified Binary Outcome", "link": "https://arxiv.org/abs/2404.04697", "description": "arXiv:2404.04697v1 Announce Type: new \nAbstract: The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended by taking patient-level information as input. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that leads to the best expected clinical outcome. Statistical methods have been developed in recent years to estimate an optimal DTR, including Q-learning, a regression-based method in the DTR literature. Although there are many studies concerning Q-learning, little attention has been given in the presence of noisy data, such as misclassified outcomes. In this paper, we investigate the effect of outcome misclassification on Q-learning and propose a correction method to accommodate the misclassification effect. Simulation studies are conducted to demonstrate the satisfactory performance of the proposed method. We illustrate the proposed method in two examples from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study and the smoking cessation program."}, "https://arxiv.org/abs/2404.04700": {"title": "Stratifying on Treatment Status", "link": "https://arxiv.org/abs/2404.04700", "description": "arXiv:2404.04700v1 Announce Type: new \nAbstract: We investigate the estimation of treatment effects from a sample that is stratified on the binary treatment status. In the case of unconfounded assignment where the potential outcomes are independent of the treatment given covariates, we show that standard estimators of the average treatment effect are inconsistent. In the case of an endogenous treatment and a binary instrument, we show that the IV estimator is inconsistent for the local average treatment effect. In both cases, we propose simple alternative estimators that are consistent in stratified samples, assuming that the fraction treated in the population is known or can be estimated."}, "https://arxiv.org/abs/2404.04702": {"title": "Topological data analysis for random sets and its application in detecting outliers and goodness of fit testing", "link": "https://arxiv.org/abs/2404.04702", "description": "arXiv:2404.04702v1 Announce Type: new \nAbstract: In this paper we present the methodology for detecting outliers and testing the goodness-of-fit of random sets using topological data analysis. We construct the filtration from level sets of the signed distance function and consider various summary functions of the persistence diagram derived from the obtained persistence homology. The outliers are detected using functional depths for the summary functions. Global envelope tests using the summary statistics as test statistics were used to construct the goodness-of-fit test. The procedures were justified by a simulation study using germ-grain random set models."}, "https://arxiv.org/abs/2404.04719": {"title": "Change Point Detection in Dynamic Graphs with Generative Model", "link": "https://arxiv.org/abs/2404.04719", "description": "arXiv:2404.04719v1 Announce Type: new \nAbstract: This paper proposes a simple generative model to detect change points in time series of graphs. The proposed framework consists of learnable prior distributions for low-dimensional graph representations and of a decoder that can generate dynamic graphs from the latent representations. The informative prior distributions in the latent spaces are learned from observed data as empirical Bayes, and the expressive power of a generative model is exploited to assist change point detection. Specifically, the model parameters are learned via maximum approximate likelihood, with a Group Fused Lasso regularization. The optimization problem is then solved via Alternating Direction Method of Multipliers (ADMM), and Langevin Dynamics are recruited for posterior inference. Experiments in simulated and real data demonstrate the ability of the generative model in supporting change point detection with good performance."}, "https://arxiv.org/abs/2404.04775": {"title": "Bipartite causal inference with interference, time series data, and a random network", "link": "https://arxiv.org/abs/2404.04775", "description": "arXiv:2404.04775v1 Announce Type: new \nAbstract: In bipartite causal inference with interference there are two distinct sets of units: those that receive the treatment, termed interventional units, and those on which the outcome is measured, termed outcome units. Which interventional units' treatment can drive which outcome units' outcomes is often depicted in a bipartite network. We study bipartite causal inference with interference from observational data across time and with a changing bipartite network. Under an exposure mapping framework, we define causal effects specific to each outcome unit, representing average contrasts of potential outcomes across time. We establish unconfoundedness of the exposure received by the outcome units based on unconfoundedness assumptions on the interventional units' treatment assignment and the random graph, hence respecting the bipartite structure of the problem. By harvesting the time component of our setting, causal effects are estimable while controlling only for temporal trends and time-varying confounders. Our results hold for binary, continuous, and multivariate exposure mappings. In the case of a binary exposure, we propose three matching algorithms to estimate the causal effect based on matching exposed to unexposed time periods for the same outcome unit, and we show that the bias of the resulting estimators is bounded. We illustrate our approach with an extensive simulation study and an application on the effect of wildfire smoke on transportation by bicycle."}, "https://arxiv.org/abs/2404.04794": {"title": "A Deep Learning Approach to Nonparametric Propensity Score Estimation with Optimized Covariate Balance", "link": "https://arxiv.org/abs/2404.04794", "description": "arXiv:2404.04794v1 Announce Type: new \nAbstract: This paper proposes a novel propensity score weighting analysis. We define two sufficient and necessary conditions for a function of the covariates to be the propensity score. The first is \"local balance\", which ensures the conditional independence of covariates and treatment assignment across a dense grid of propensity score values. The second condition, \"local calibration\", guarantees that a balancing score is a propensity score. Using three-layer feed-forward neural networks, we develop a nonparametric propensity score model that satisfies these conditions, effectively circumventing the issue of model misspecification and optimizing covariate balance to minimize bias and stabilize the inverse probability weights. Our proposed method performed substantially better than existing methods in extensive numerical studies of both real and simulated benchmark datasets."}, "https://arxiv.org/abs/2404.04905": {"title": "Review for Handling Missing Data with special missing mechanism", "link": "https://arxiv.org/abs/2404.04905", "description": "arXiv:2404.04905v1 Announce Type: new \nAbstract: Missing data poses a significant challenge in data science, affecting decision-making processes and outcomes. Understanding what missing data is, how it occurs, and why it is crucial to handle it appropriately is paramount when working with real-world data, especially in tabular data, one of the most commonly used data types in the real world. Three missing mechanisms are defined in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), each presenting unique challenges in imputation. Most existing work are focused on MCAR that is relatively easy to handle. The special missing mechanisms of MNAR and MAR are less explored and understood. This article reviews existing literature on handling missing values. It compares and contrasts existing methods in terms of their ability to handle different missing mechanisms and data types. It identifies research gap in the existing literature and lays out potential directions for future research in the field. The information in this review will help data analysts and researchers to adopt and promote good practices for handling missing data in real-world problems."}, "https://arxiv.org/abs/2404.04974": {"title": "Neural Network Modeling for Forecasting Tourism Demand in Stopi\\'{c}a Cave: A Serbian Cave Tourism Study", "link": "https://arxiv.org/abs/2404.04974", "description": "arXiv:2404.04974v1 Announce Type: new \nAbstract: For modeling the number of visits in Stopi\\'{c}a cave (Serbia) we consider the classical Auto-regressive Integrated Moving Average (ARIMA) model, Machine Learning (ML) method Support Vector Regression (SVR), and hybrid NeuralPropeth method which combines classical and ML concepts. The most accurate predictions were obtained with NeuralPropeth which includes the seasonal component and growing trend of time-series. In addition, non-linearity is modeled by shallow Neural Network (NN), and Google Trend is incorporated as an exogenous variable. Modeling tourist demand represents great importance for management structures and decision-makers due to its applicability in establishing sustainable tourism utilization strategies in environmentally vulnerable destinations such as caves. The data provided insights into the tourist demand in Stopi\\'{c}a cave and preliminary data for addressing the issues of carrying capacity within the most visited cave in Serbia."}, "https://arxiv.org/abs/2404.04979": {"title": "CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference", "link": "https://arxiv.org/abs/2404.04979", "description": "arXiv:2404.04979v1 Announce Type: new \nAbstract: Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis."}, "https://arxiv.org/abs/2404.04985": {"title": "Towards a generalized accessibility measure for transportation equity and efficiency", "link": "https://arxiv.org/abs/2404.04985", "description": "arXiv:2404.04985v1 Announce Type: new \nAbstract: Locational measures of accessibility are widely used in urban and transportation planning to understand the impact of the transportation system on influencing people's access to places. However, there is a considerable lack of measurement standards and publicly available data. We propose a generalized measure of locational accessibility that has a comprehensible form for transportation planning analysis. This metric combines the cumulative opportunities approach with gravity-based measures and is capable of catering to multiple trip purposes, travel modes, cost thresholds, and scales of analysis. Using data from multiple publicly available datasets, this metric is computed by trip purpose and travel time threshold for all block groups in the United States, and the data is made publicly accessible. Further, case studies of three large metropolitan areas reveal substantial inefficiencies in transportation infrastructure, with the most inefficiency observed in sprawling and non-core urban areas, especially for bicycling. Subsequently, it is shown that targeted investment in facilities can contribute to a more equitable distribution of accessibility to essential shopping and service facilities. By assigning greater weights to socioeconomically disadvantaged neighborhoods, the proposed metric formally incorporates equity considerations into transportation planning, contributing to a more equitable distribution of accessibility to essential services and facilities."}, "https://arxiv.org/abs/2404.05021": {"title": "Context-dependent Causality (the Non-Nonotonic Case)", "link": "https://arxiv.org/abs/2404.05021", "description": "arXiv:2404.05021v1 Announce Type: new \nAbstract: We develop a novel identification strategy as well as a new estimator for context-dependent causal inference in non-parametric triangular models with non-separable disturbances. Departing from the common practice, our analysis does not rely on the strict monotonicity assumption. Our key contribution lies in leveraging on diffusion models to formulate the structural equations as a system evolving from noise accumulation to account for the influence of the latent context (confounder) on the outcome. Our identifiability strategy involves a system of Fredholm integral equations expressing the distributional relationship between a latent context variable and a vector of observables. These integral equations involve an unknown kernel and are governed by a set of structural form functions, inducing a non-monotonic inverse problem. We prove that if the kernel density can be represented as an infinite mixture of Gaussians, then there exists a unique solution for the unknown function. This is a significant result, as it shows that it is possible to solve a non-monotonic inverse problem even when the kernel is unknown. On the methodological front we leverage on a novel and enriched Contaminated Generative Adversarial (Neural) Networks (CONGAN) which we provide as a solution to the non-monotonic inverse problem."}, "https://arxiv.org/abs/2404.05060": {"title": "Dir-SPGLM: A Bayesian semiparametric GLM with data-driven reference distribution", "link": "https://arxiv.org/abs/2404.05060", "description": "arXiv:2404.05060v1 Announce Type: new \nAbstract: The recently developed semi-parametric generalized linear model (SPGLM) offers more flexibility as compared to the classical GLM by including the baseline or reference distribution of the response as an additional parameter in the model. However, some inference summaries are not easily generated under existing maximum-likelihood based inference (ML-SPGLM). This includes uncertainty in estimation for model-derived functionals such as exceedance probabilities. The latter are critical in a clinical diagnostic or decision-making setting. In this article, by placing a Dirichlet prior on the baseline distribution, we propose a Bayesian model-based approach for inference to address these important gaps. We establish consistency and asymptotic normality results for the implied canonical parameter. Simulation studies and an illustration with data from an aging research study confirm that the proposed method performs comparably or better in comparison with ML-SPGLM. The proposed Bayesian framework is most attractive for inference with small sample training data or in sparse-data scenarios."}, "https://arxiv.org/abs/2404.05118": {"title": "BayesPPDSurv: An R Package for Bayesian Sample Size Determination Using the Power and Normalized Power Prior for Time-To-Event Data", "link": "https://arxiv.org/abs/2404.05118", "description": "arXiv:2404.05118v1 Announce Type: new \nAbstract: The BayesPPDSurv (Bayesian Power Prior Design for Survival Data) R package supports Bayesian power and type I error calculations and model fitting using the power and normalized power priors incorporating historical data with for the analysis of time-to-event outcomes. The package implements the stratified proportional hazards regression model with piecewise constant hazard within each stratum. The package allows the historical data to inform the treatment effect parameter, parameter effects for other covariates in the regression model, as well as the baseline hazard parameters. The use of multiple historical datasets is supported. A novel algorithm is developed for computationally efficient use of the normalized power prior. In addition, the package supports the use of arbitrary sampling priors for computing Bayesian power and type I error rates, and has built-in features that semi-automatically generate sampling priors from the historical data. We demonstrate the use of BayesPPDSurv in a comprehensive case study for a melanoma clinical trial design."}, "https://arxiv.org/abs/2404.05148": {"title": "Generalized Criterion for Identifiability of Additive Noise Models Using Majorization", "link": "https://arxiv.org/abs/2404.05148", "description": "arXiv:2404.05148v1 Announce Type: new \nAbstract: The discovery of causal relationships from observational data is very challenging. Many recent approaches rely on complexity or uncertainty concepts to impose constraints on probability distributions, aiming to identify specific classes of directed acyclic graph (DAG) models. In this paper, we introduce a novel identifiability criterion for DAGs that places constraints on the conditional variances of additive noise models. We demonstrate that this criterion extends and generalizes existing identifiability criteria in the literature that employ (conditional) variances as measures of uncertainty in (conditional) distributions. For linear Structural Equation Models, we present a new algorithm that leverages the concept of weak majorization applied to the diagonal elements of the Cholesky factor of the covariance matrix to learn a topological ordering of variables. Through extensive simulations and the analysis of bank connectivity data, we provide evidence of the effectiveness of our approach in successfully recovering DAGs. The code for reproducing the results in this paper is available in Supplementary Materials."}, "https://arxiv.org/abs/2404.05178": {"title": "Estimating granular house price distributions in the Australian market using Gaussian mixtures", "link": "https://arxiv.org/abs/2404.05178", "description": "arXiv:2404.05178v1 Announce Type: new \nAbstract: A new methodology is proposed to approximate the time-dependent house price distribution at a fine regional scale using Gaussian mixtures. The means, variances and weights of the mixture components are related to time, location and dwelling type through a non linear function trained by a deep functional approximator. Price indices are derived as means, medians, quantiles or other functions of the estimated distributions. Price densities for larger regions, such as a city, are calculated via a weighted sum of the component density functions. The method is applied to a data set covering all of Australia at a fine spatial and temporal resolution. In addition to enabling a detailed exploration of the data, the proposed index yields lower prediction errors in the practical task of individual dwelling price projection from previous sales values within the three major Australian cities. The estimated quantiles are also found to be well calibrated empirically, capturing the complexity of house price distributions."}, "https://arxiv.org/abs/2404.05209": {"title": "Maximally Forward-Looking Core Inflation", "link": "https://arxiv.org/abs/2404.05209", "description": "arXiv:2404.05209v1 Announce Type: new \nAbstract: Timely monetary policy decision-making requires timely core inflation measures. We create a new core inflation series that is explicitly designed to succeed at that goal. Precisely, we introduce the Assemblage Regression, a generalized nonnegative ridge regression problem that optimizes the price index's subcomponent weights such that the aggregate is maximally predictive of future headline inflation. Ordering subcomponents according to their rank in each period switches the algorithm to be learning supervised trimmed inflation - or, put differently, the maximally forward-looking summary statistic of the realized price changes distribution. In an extensive out-of-sample forecasting experiment for the US and the euro area, we find substantial improvements for signaling medium-term inflation developments in both the pre- and post-Covid years. Those coming from the supervised trimmed version are particularly striking, and are attributable to a highly asymmetric trimming which contrasts with conventional indicators. We also find that this metric was indicating first upward pressures on inflation as early as mid-2020 and quickly captured the turning point in 2022. We also consider extensions, like assembling inflation from geographical regions, trimmed temporal aggregation, and building core measures specialized for either upside or downside inflation risks."}, "https://arxiv.org/abs/2404.05246": {"title": "Assessing the causes of continuous effects by posterior effects of causes", "link": "https://arxiv.org/abs/2404.05246", "description": "arXiv:2404.05246v1 Announce Type: new \nAbstract: To evaluate a single cause of a binary effect, Dawid et al. (2014) defined the probability of causation, while Pearl (2015) defined the probabilities of necessity and sufficiency. For assessing the multiple correlated causes of a binary effect, Lu et al. (2023) defined the posterior causal effects based on post-treatment variables. In many scenarios, outcomes are continuous, simply binarizing them and applying previous methods may result in information loss or biased conclusions. To address this limitation, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous effect, including posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under the assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish the asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method."}, "https://arxiv.org/abs/2404.05349": {"title": "Common Trends and Long-Run Multipliers in Nonlinear Structural VARs", "link": "https://arxiv.org/abs/2404.05349", "description": "arXiv:2404.05349v1 Announce Type: new \nAbstract: While it is widely recognised that linear (structural) VARs may omit important features of economic time series, the use of nonlinear SVARs has to date been almost entirely confined to the modelling of stationary time series, because of a lack of understanding as to how common stochastic trends may be accommodated within nonlinear VAR models. This has unfortunately circumscribed the range of series to which such models can be applied -- and/or required that these series be first transformed to stationarity, a potential source of misspecification -- and prevented the use of long-run identifying restrictions in these models. To address these problems, we develop a flexible class of additively time-separable nonlinear SVARs, which subsume models with threshold-type endogenous regime switching, both of the piecewise linear and smooth transition varieties. We extend the Granger-Johansen representation theorem to this class of models, obtaining conditions that specialise exactly to the usual ones when the model is linear. We further show that, as a corollary, these models are capable of supporting the same kinds of long-run identifying restrictions as are available in linear cointegrated SVARs."}, "https://arxiv.org/abs/2404.05445": {"title": "Unsupervised Training of Convex Regularizers using Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2404.05445", "description": "arXiv:2404.05445v1 Announce Type: new \nAbstract: Unsupervised learning is a training approach in the situation where ground truth data is unavailable, such as inverse imaging problems. We present an unsupervised Bayesian training approach to learning convex neural network regularizers using a fixed noisy dataset, based on a dual Markov chain estimation method. Compared to classical supervised adversarial regularization methods, where there is access to both clean images as well as unlimited to noisy copies, we demonstrate close performance on natural image Gaussian deconvolution and Poisson denoising tasks."}, "https://arxiv.org/abs/2404.05671": {"title": "Bayesian Inverse Ising Problem with Three-body Interactions", "link": "https://arxiv.org/abs/2404.05671", "description": "arXiv:2404.05671v1 Announce Type: new \nAbstract: In this paper, we solve the inverse Ising problem with three-body interaction. Using the mean-field approximation, we find a tractable expansion of the normalizing constant. This facilitates estimation, which is known to be quite challenging for the Ising model. We then develop a novel hybrid MCMC algorithm that integrates Adaptive Metropolis Hastings (AMH), Hamiltonian Monte Carlo (HMC), and the Manifold-Adjusted Langevin Algorithm (MALA), which converges quickly and mixes well. We demonstrate the robustness of our algorithm using data simulated with a structure under which parameter estimation is known to be challenging, such as in the presence of a phase transition and at the critical point of the system."}, "https://arxiv.org/abs/2404.05702": {"title": "On the estimation of complex statistics combining different surveys", "link": "https://arxiv.org/abs/2404.05702", "description": "arXiv:2404.05702v1 Announce Type: new \nAbstract: The importance of exploring a potential integration among surveys has been acknowledged in order to enhance effectiveness and minimize expenses. In this work, we employ the alignment method to combine information from two different surveys for the estimation of complex statistics. The derivation of the alignment weights poses challenges in case of complex statistics due to their non-linear form. To overcome this, we propose to use a linearized variable associated with the complex statistic under consideration. Linearized variables have been widely used to derive variance estimates, thus allowing for the estimation of the variance of the combined complex statistics estimates. Simulations conducted show the effectiveness of the proposed approach, resulting to the reduction of the variance of the combined complex statistics estimates. Also, in some cases, the usage of the alignment weights derived using the linearized variable associated with a complex statistic, could result in a further reduction of the variance of the combined estimates."}, "https://arxiv.org/abs/2404.04399": {"title": "Longitudinal Targeted Minimum Loss-based Estimation with Temporal-Difference Heterogeneous Transformer", "link": "https://arxiv.org/abs/2404.04399", "description": "arXiv:2404.04399v1 Announce Type: cross \nAbstract: We propose Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to estimate the counterfactual mean of outcome under dynamic treatment policies in longitudinal problem settings. Our approach utilizes a transformer architecture with heterogeneous type embedding trained using temporal-difference learning. After obtaining an initial estimate using the transformer, following the targeted minimum loss-based likelihood estimation (TMLE) framework, we statistically corrected for the bias commonly associated with machine learning algorithms. Furthermore, our method also facilitates statistical inference by enabling the provision of 95% confidence intervals grounded in asymptotic statistical theory. Simulation results demonstrate our method's superior performance over existing approaches, particularly in complex, long time-horizon scenarios. It remains effective in small-sample, short-duration contexts, matching the performance of asymptotically efficient estimators. To demonstrate our method in practice, we applied our method to estimate counterfactual mean outcomes for standard versus intensive blood pressure management strategies in a real-world cardiovascular epidemiology cohort study."}, "https://arxiv.org/abs/2404.04498": {"title": "Bayesian Inference for Consistent Predictions in Overparameterized Nonlinear Regression", "link": "https://arxiv.org/abs/2404.04498", "description": "arXiv:2404.04498v1 Announce Type: cross \nAbstract: The remarkable generalization performance of overparameterized models has challenged the conventional wisdom of statistical learning theory. While recent theoretical studies have shed light on this behavior in linear models or nonlinear classifiers, a comprehensive understanding of overparameterization in nonlinear regression remains lacking. This paper explores the predictive properties of overparameterized nonlinear regression within the Bayesian framework, extending the methodology of adaptive prior based on the intrinsic spectral structure of the data. We establish posterior contraction for single-neuron models with Lipschitz continuous activation functions and for generalized linear models, demonstrating that our approach achieves consistent predictions in the overparameterized regime. Moreover, our Bayesian framework allows for uncertainty estimation of the predictions. The proposed method is validated through numerical simulations and a real data application, showcasing its ability to achieve accurate predictions and reliable uncertainty estimates. Our work advances the theoretical understanding of the blessing of overparameterization and offers a principled Bayesian approach for prediction in large nonlinear models."}, "https://arxiv.org/abs/2404.05062": {"title": "New methods for computing the generalized chi-square distribution", "link": "https://arxiv.org/abs/2404.05062", "description": "arXiv:2404.05062v1 Announce Type: cross \nAbstract: We present several exact and approximate mathematical methods and open-source software to compute the cdf, pdf and inverse cdf of the generalized chi-square distribution, which appears in Bayesian classification problems. Some methods are geared for speed, while others are designed to be accurate far into the tails, using which we can also measure large values of the discriminability index $d'$ between multinormals. We compare the accuracy and speed of these methods against the best existing methods."}, "https://arxiv.org/abs/2404.05545": {"title": "Evaluating Interventional Reasoning Capabilities of Large Language Models", "link": "https://arxiv.org/abs/2404.05545", "description": "arXiv:2404.05545v1 Announce Type: cross \nAbstract: Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consider using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts."}, "https://arxiv.org/abs/2404.05622": {"title": "How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation", "link": "https://arxiv.org/abs/2404.05622", "description": "arXiv:2404.05622v1 Announce Type: cross \nAbstract: Entity resolution (record linkage, microclustering) systems are notoriously difficult to evaluate. Looking for a needle in a haystack, traditional evaluation methods use sophisticated, application-specific sampling schemes to find matching pairs of records among an immense number of non-matches. We propose an alternative that facilitates the creation of representative, reusable benchmark data sets without necessitating complex sampling schemes. These benchmark data sets can then be used for model training and a variety of evaluation tasks. Specifically, we propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics, estimating key performance metrics such as cluster and pairwise precision and recall, and analyzing root causes for errors. We validate the framework in an application to inventor name disambiguation and through simulation studies. Software: https://github.com/OlivierBinette/er-evaluation/"}, "https://arxiv.org/abs/1910.02170": {"title": "Donor's Deferral and Return Behavior: Partial Identification from a Regression Discontinuity Design with Manipulation", "link": "https://arxiv.org/abs/1910.02170", "description": "arXiv:1910.02170v3 Announce Type: replace \nAbstract: Volunteer labor can temporarily yield lower benefits to charities than its costs. In such instances, organizations may wish to defer volunteer donations to a later date. Exploiting a discontinuity in blood donations' eligibility criteria, we show that deferring donors reduces their future volunteerism. In our setting, medical staff manipulates donors' reported hemoglobin levels over a threshold to facilitate donation. Such manipulation invalidates standard regression discontinuity design. To circumvent this issue, we propose a procedure for obtaining partial identification bounds where manipulation is present. Our procedure is applicable in various regression discontinuity settings where the running variable is manipulated and discrete."}, "https://arxiv.org/abs/2201.03616": {"title": "Scale Reliant Inference", "link": "https://arxiv.org/abs/2201.03616", "description": "arXiv:2201.03616v4 Announce Type: replace \nAbstract: Scientific fields such as genomics, ecology, and political science often collect multivariate count data. In these fields, the data are often sufficiently noisy such that inferences regarding the total size of the measured systems have substantial uncertainty. This uncertainty can hinder downstream analyses, such as differential analysis in case-control studies. There have historically been two approaches to this problem: one considers the data as compositional and the other as counts that can be normalized. In this article, we use the framework of partially identified models to rigorously study the types of scientific questions (estimands) that can be answered (estimated) using these data. We prove that satisfying Frequentist inferential criteria is impossible for many estimation problems. In contrast, we find that the criteria for Bayesian inference can be satisfied, yet it requires a particular type of model called a Bayesian partially identified model. We introduce Scale Simulation Random Variables as a flexible and computationally efficient form of Bayesian partially identified models for analyzing these data. We use simulations and data analysis to validate our theory."}, "https://arxiv.org/abs/2204.12699": {"title": "Randomness of Shapes and Statistical Inference on Shapes via the Smooth Euler Characteristic Transform", "link": "https://arxiv.org/abs/2204.12699", "description": "arXiv:2204.12699v4 Announce Type: replace \nAbstract: In this article, we establish the mathematical foundations for modeling the randomness of shapes and conducting statistical inference on shapes using the smooth Euler characteristic transform. Based on these foundations, we propose two chi-squared statistic-based algorithms for testing hypotheses on random shapes. Simulation studies are presented to validate our mathematical derivations and to compare our algorithms with state-of-the-art methods to demonstrate the utility of our proposed framework. As real applications, we analyze a data set of mandibular molars from four genera of primates and show that our algorithms have the power to detect significant shape differences that recapitulate known morphological variation across suborders. Altogether, our discussions bridge the following fields: algebraic and computational topology, probability theory and stochastic processes, Sobolev spaces and functional analysis, analysis of variance for functional data, and geometric morphometrics."}, "https://arxiv.org/abs/2205.09922": {"title": "Nonlinear Fore(Back)casting and Innovation Filtering for Causal-Noncausal VAR Models", "link": "https://arxiv.org/abs/2205.09922", "description": "arXiv:2205.09922v3 Announce Type: replace \nAbstract: We introduce closed-form formulas of out-of-sample predictive densities for forecasting and backcasting of mixed causal-noncausal (Structural) Vector Autoregressive VAR models. These nonlinear and time irreversible non-Gaussian VAR processes are shown to satisfy the Markov property in both calendar and reverse time. A post-estimation inference method for assessing the forecast interval uncertainty due to the preliminary estimation step is introduced too. The nonlinear past-dependent innovations of a mixed causal-noncausal VAR model are defined and their filtering and identification methods are discussed. Our approach is illustrated by a simulation study, and an application to cryptocurrency prices."}, "https://arxiv.org/abs/2206.06344": {"title": "Sparse-group boosting -- Unbiased group and variable selection", "link": "https://arxiv.org/abs/2206.06344", "description": "arXiv:2206.06344v2 Announce Type: replace \nAbstract: In the presence of grouped covariates, we propose a framework for boosting that allows to enforce sparsity within and between groups. By using component-wise and group-wise gradient boosting at the same time with adjusted degrees of freedom, a model with similar properties as the sparse group lasso can be fitted through boosting. We show that within-group and between-group sparsity can be controlled by a mixing parameter and discuss similarities and differences to the mixing parameter in the sparse group lasso. With simulations, gene data as well as agricultural data we show the effectiveness and predictive competitiveness of this estimator. The data and simulations suggest, that in the presence of grouped variables the use of sparse group boosting is associated with less biased variable selection and higher predictability compared to component-wise boosting. Additionally, we propose a way of reducing bias in component-wise boosting through the degrees of freedom."}, "https://arxiv.org/abs/2212.04550": {"title": "Modern Statistical Models and Methods for Estimating Fatigue-Life and Fatigue-Strength Distributions from Experimental Data", "link": "https://arxiv.org/abs/2212.04550", "description": "arXiv:2212.04550v3 Announce Type: replace \nAbstract: Engineers and scientists have been collecting and analyzing fatigue data since the 1800s to ensure the reliability of life-critical structures. Applications include (but are not limited to) bridges, building structures, aircraft and spacecraft components, ships, ground-based vehicles, and medical devices. Engineers need to estimate S-N relationships (Stress or Strain versus Number of cycles to failure), typically with a focus on estimating small quantiles of the fatigue-life distribution. Estimates from this kind of model are used as input to models (e.g., cumulative damage models) that predict failure-time distributions under varying stress patterns. Also, design engineers need to estimate lower-tail quantiles of the closely related fatigue-strength distribution. The history of applying incorrect statistical methods is nearly as long and such practices continue to the present. Examples include treating the applied stress (or strain) as the response and the number of cycles to failure as the explanatory variable in regression analyses (because of the need to estimate strength distributions) and ignoring or otherwise mishandling censored observations (known as runouts in the fatigue literature). The first part of the paper reviews the traditional modeling approach where a fatigue-life model is specified. We then show how this specification induces a corresponding fatigue-strength model. The second part of the paper presents a novel alternative modeling approach where a fatigue-strength model is specified and a corresponding fatigue-life model is induced. We explain and illustrate the important advantages of this new modeling approach."}, "https://arxiv.org/abs/2302.14230": {"title": "Optimal Priors for the Discounting Parameter of the Normalized Power Prior", "link": "https://arxiv.org/abs/2302.14230", "description": "arXiv:2302.14230v2 Announce Type: replace \nAbstract: The power prior is a popular class of informative priors for incorporating information from historical data. It involves raising the likelihood for the historical data to a power, which acts as discounting parameter. When the discounting parameter is modelled as random, the normalized power prior is recommended. In this work, we prove that the marginal posterior for the discounting parameter for generalized linear models converges to a point mass at zero if there is any discrepancy between the historical and current data, and that it does not converge to a point mass at one when they are fully compatible. In addition, we explore the construction of optimal priors for the discounting parameter in a normalized power prior. In particular, we are interested in achieving the dual objectives of encouraging borrowing when the historical and current data are compatible and limiting borrowing when they are in conflict. We propose intuitive procedures for eliciting the shape parameters of a beta prior for the discounting parameter based on two minimization criteria, the Kullback-Leibler divergence and the mean squared error. Based on the proposed criteria, the optimal priors derived are often quite different from commonly used priors such as the uniform prior."}, "https://arxiv.org/abs/2305.11323": {"title": "Cumulative differences between paired samples", "link": "https://arxiv.org/abs/2305.11323", "description": "arXiv:2305.11323v2 Announce Type: replace \nAbstract: The simplest, most common paired samples consist of observations from two populations, with each observed response from one population corresponding to an observed response from the other population at the same value of an ordinal covariate. The pair of observed responses (one from each population) at the same value of the covariate is known as a \"matched pair\" (with the matching based on the value of the covariate). A graph of cumulative differences between the two populations reveals differences in responses as a function of the covariate. Indeed, the slope of the secant line connecting two points on the graph becomes the average difference over the wide interval of values of the covariate between the two points; i.e., slope of the graph is the average difference in responses. (\"Average\" refers to the weighted average if the samples are weighted.) Moreover, a simple statistic known as the Kuiper metric summarizes into a single scalar the overall differences over all values of the covariate. The Kuiper metric is the absolute value of the total difference in responses between the two populations, totaled over the interval of values of the covariate for which the absolute value of the total is greatest. The total should be normalized such that it becomes the (weighted) average over all values of the covariate when the interval over which the total is taken is the entire range of the covariate (i.e., the sum for the total gets divided by the total number of observations, if the samples are unweighted, or divided by the total weight, if the samples are weighted). This cumulative approach is fully nonparametric and uniquely defined (with only one right way to construct the graphs and scalar summary statistics), unlike traditional methods such as reliability diagrams or parametric or semi-parametric regressions, which typically obscure significant differences due to their parameter settings."}, "https://arxiv.org/abs/2309.09371": {"title": "Gibbs Sampling using Anti-correlation Gaussian Data Augmentation, with Applications to L1-ball-type Models", "link": "https://arxiv.org/abs/2309.09371", "description": "arXiv:2309.09371v2 Announce Type: replace \nAbstract: L1-ball-type priors are a recent generalization of the spike-and-slab priors. By transforming a continuous precursor distribution to the L1-ball boundary, it induces exact zeros with positive prior and posterior probabilities. With great flexibility in choosing the precursor and threshold distributions, we can easily specify models under structured sparsity, such as those with dependent probability for zeros and smoothness among the non-zeros. Motivated to significantly accelerate the posterior computation, we propose a new data augmentation that leads to a fast block Gibbs sampling algorithm. The latent variable, named ``anti-correlation Gaussian'', cancels out the quadratic exponent term in the latent Gaussian distribution, making the parameters of interest conditionally independent so that they can be updated in a block. Compared to existing algorithms such as the No-U-Turn sampler, the new blocked Gibbs sampler has a very low computing cost per iteration and shows rapid mixing of Markov chains. We establish the geometric ergodicity guarantee of the algorithm in linear models. Further, we show useful extensions of our algorithm for posterior estimation of general latent Gaussian models, such as those involving multivariate truncated Gaussian or latent Gaussian process."}, "https://arxiv.org/abs/2310.17999": {"title": "Automated threshold selection and associated inference uncertainty for univariate extremes", "link": "https://arxiv.org/abs/2310.17999", "description": "arXiv:2310.17999v3 Announce Type: replace \nAbstract: Threshold selection is a fundamental problem in any threshold-based extreme value analysis. While models are asymptotically motivated, selecting an appropriate threshold for finite samples is difficult and highly subjective through standard methods. Inference for high quantiles can also be highly sensitive to the choice of threshold. Too low a threshold choice leads to bias in the fit of the extreme value model, while too high a choice leads to unnecessary additional uncertainty in the estimation of model parameters. We develop a novel methodology for automated threshold selection that directly tackles this bias-variance trade-off. We also develop a method to account for the uncertainty in the threshold estimation and propagate this uncertainty through to high quantile inference. Through a simulation study, we demonstrate the effectiveness of our method for threshold selection and subsequent extreme quantile estimation, relative to the leading existing methods, and show how the method's effectiveness is not sensitive to the tuning parameters. We apply our method to the well-known, troublesome example of the River Nidd dataset."}, "https://arxiv.org/abs/2102.13273": {"title": "Application-Driven Learning: A Closed-Loop Prediction and Optimization Approach Applied to Dynamic Reserves and Demand Forecasting", "link": "https://arxiv.org/abs/2102.13273", "description": "arXiv:2102.13273v5 Announce Type: replace-cross \nAbstract: Forecasting and decision-making are generally modeled as two sequential steps with no feedback, following an open-loop approach. In this paper, we present application-driven learning, a new closed-loop framework in which the processes of forecasting and decision-making are merged and co-optimized through a bilevel optimization problem. We present our methodology in a general format and prove that the solution converges to the best estimator in terms of the expected cost of the selected application. Then, we propose two solution methods: an exact method based on the KKT conditions of the second-level problem and a scalable heuristic approach suitable for decomposition methods. The proposed methodology is applied to the relevant problem of defining dynamic reserve requirements and conditional load forecasts, offering an alternative approach to current ad hoc procedures implemented in industry practices. We benchmark our methodology with the standard sequential least-squares forecast and dispatch planning process. We apply the proposed methodology to an illustrative system and to a wide range of instances, from dozens of buses to large-scale realistic systems with thousands of buses. Our results show that the proposed methodology is scalable and yields consistently better performance than the standard open-loop approach."}, "https://arxiv.org/abs/2211.13383": {"title": "A Non-Gaussian Bayesian Filter Using Power and Generalized Logarithmic Moments", "link": "https://arxiv.org/abs/2211.13383", "description": "arXiv:2211.13383v4 Announce Type: replace-cross \nAbstract: In this paper, we aim to propose a consistent non-Gaussian Bayesian filter of which the system state is a continuous function. The distributions of the true system states, and those of the system and observation noises, are only assumed Lebesgue integrable with no prior constraints on what function classes they fall within. This type of filter has significant merits in both theory and practice, which is able to ameliorate the curse of dimensionality for the particle filter, a popular non-Gaussian Bayesian filter of which the system state is parameterized by discrete particles and the corresponding weights. We first propose a new type of statistics, called the generalized logarithmic moments. Together with the power moments, they are used to form a density surrogate, parameterized as an analytic function, to approximate the true system state. The map from the parameters of the proposed density surrogate to both the power moments and the generalized logarithmic moments is proved to be a diffeomorphism, establishing the fact that there exists a unique density surrogate which satisfies both moment conditions. This diffeomorphism also allows us to use gradient methods to treat the convex optimization problem in determining the parameters. Last but not least, simulation results reveal the advantage of using both sets of moments for estimating mixtures of complicated types of functions. A robot localization simulation is also given, as an engineering application to validate the proposed filtering scheme."}, "https://arxiv.org/abs/2301.03038": {"title": "Skewed Bernstein-von Mises theorem and skew-modal approximations", "link": "https://arxiv.org/abs/2301.03038", "description": "arXiv:2301.03038v3 Announce Type: replace-cross \nAbstract: Gaussian approximations are routinely employed in Bayesian statistics to ease inference when the target posterior is intractable. Although these approximations are asymptotically justified by Bernstein-von Mises type results, in practice the expected Gaussian behavior may poorly represent the shape of the posterior, thus affecting approximation accuracy. Motivated by these considerations, we derive an improved class of closed-form approximations of posterior distributions which arise from a new treatment of a third-order version of the Laplace method yielding approximations in a tractable family of skew-symmetric distributions. Under general assumptions which account for misspecified models and non-i.i.d. settings, this family of approximations is shown to have a total variation distance from the target posterior whose rate of convergence improves by at least one order of magnitude the one established by the classical Bernstein-von Mises theorem. Specializing this result to the case of regular parametric models shows that the same improvement in approximation accuracy can be also derived for polynomially bounded posterior functionals. Unlike other higher-order approximations, our results prove that it is possible to derive closed-form and valid densities which are expected to provide, in practice, a more accurate, yet similarly-tractable, alternative to Gaussian approximations of the target posterior, while inheriting its limiting frequentist properties. We strengthen such arguments by developing a practical skew-modal approximation for both joint and marginal posteriors that achieves the same theoretical guarantees of its theoretical counterpart by replacing the unknown model parameters with the corresponding MAP estimate. Empirical studies confirm that our theoretical results closely match the remarkable performance observed in practice, even in finite, possibly small, sample regimes."}, "https://arxiv.org/abs/2404.05808": {"title": "Replicability analysis of high dimensional data accounting for dependence", "link": "https://arxiv.org/abs/2404.05808", "description": "arXiv:2404.05808v1 Announce Type: new \nAbstract: Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modelling assumptions, producing overly conservative or overly liberal results. Based on $p$-values from two studies, we use a four-state hidden Markov model to capture the structure of local dependence. Our method effectively borrows information from different features and studies while accounting for dependence among features and heterogeneity across studies. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods."}, "https://arxiv.org/abs/2404.05933": {"title": "fastcpd: Fast Change Point Detection in R", "link": "https://arxiv.org/abs/2404.05933", "description": "arXiv:2404.05933v1 Announce Type: new \nAbstract: Change point analysis is concerned with detecting and locating structure breaks in the underlying model of a sequence of observations ordered by time, space or other variables. A widely adopted approach for change point analysis is to minimize an objective function with a penalty term on the number of change points. This framework includes several well-established procedures, such as the penalized log-likelihood using the (modified) Bayesian information criterion (BIC) or the minimum description length (MDL). The resulting optimization problem can be solved in polynomial time by dynamic programming or its improved version, such as the Pruned Exact Linear Time (PELT) algorithm (Killick, Fearnhead, and Eckley 2012). However, existing computational methods often suffer from two primary limitations: (1) methods based on direct implementation of dynamic programming or PELT are often time-consuming for long data sequences due to repeated computation of the cost value over different segments of the data sequence; (2) state-of-the-art R packages do not provide enough flexibility for users to handle different change point settings and models. In this work, we present the fastcpd package, aiming to provide an efficient and versatile framework for change point detection in several commonly encountered settings. The core of our algorithm is built upon PELT and the sequential gradient descent method recently proposed by Zhang and Dawn (2023). We illustrate the usage of the fastcpd package through several examples, including mean/variance changes in a (multivariate) Gaussian sequence, parameter changes in regression models, structural breaks in ARMA/GARCH/VAR models, and changes in user-specified models."}, "https://arxiv.org/abs/2404.06064": {"title": "Constructing hierarchical time series through clustering: Is there an optimal way for forecasting?", "link": "https://arxiv.org/abs/2404.06064", "description": "arXiv:2404.06064v1 Announce Type: new \nAbstract: Forecast reconciliation has attracted significant research interest in recent years, with most studies taking the hierarchy of time series as given. We extend existing work that uses time series clustering to construct hierarchies, with the goal of improving forecast accuracy, in three ways. First, we investigate multiple approaches to clustering, including not only different clustering algorithms, but also the way time series are represented and how distance between time series is defined. We find that cluster-based hierarchies lead to improvements in forecast accuracy relative to two-level hierarchies. Second, we devise an approach based on random permutation of hierarchies, keeping the structure of the hierarchy fixed, while time series are randomly allocated to clusters. In doing so, we find that improvements in forecast accuracy that accrue from using clustering do not arise from grouping together similar series but from the structure of the hierarchy. Third, we propose an approach based on averaging forecasts across hierarchies constructed using different clustering methods, that is shown to outperform any single clustering method. All analysis is carried out on two benchmark datasets and a simulated dataset. Our findings provide new insights into the role of hierarchy construction in forecast reconciliation and offer valuable guidance on forecasting practice."}, "https://arxiv.org/abs/2404.06093": {"title": "Supervised Contamination Detection, with Flow Cytometry Application", "link": "https://arxiv.org/abs/2404.06093", "description": "arXiv:2404.06093v1 Announce Type: new \nAbstract: The contamination detection problem aims to determine whether a set of observations has been contaminated, i.e. whether it contains points drawn from a distribution different from the reference distribution. Here, we consider a supervised problem, where labeled samples drawn from both the reference distribution and the contamination distribution are available at training time. This problem is motivated by the detection of rare cells in flow cytometry. Compared to novelty detection problems or two-sample testing, where only samples from the reference distribution are available, the challenge lies in efficiently leveraging the observations from the contamination detection to design more powerful tests. In this article, we introduce a test for the supervised contamination detection problem. We provide non-asymptotic guarantees on its Type I error, and characterize its detection rate. The test relies on estimating reference and contamination densities using histograms, and its power depends strongly on the choice of the corresponding partition. We present an algorithm for judiciously choosing the partition that results in a powerful test. Simulations illustrate the good empirical performances of our partition selection algorithm and the efficiency of our test. Finally, we showcase our method and apply it to a real flow cytometry dataset."}, "https://arxiv.org/abs/2404.06205": {"title": "Adaptive Unit Root Inference in Autoregressions using the Lasso Solution Path", "link": "https://arxiv.org/abs/2404.06205", "description": "arXiv:2404.06205v1 Announce Type: new \nAbstract: We show that the activation knot of a potentially non-stationary regressor on the adaptive Lasso solution path in autoregressions can be leveraged for selection-free inference about a unit root. The resulting test has asymptotic power against local alternatives in $1/T$ neighbourhoods, unlike post-selection inference methods based on consistent model selection. Exploiting the information enrichment principle devised by Reinschl\\\"ussel and Arnold arXiv:2402.16580 [stat.ME] to improve the Lasso-based selection of ADF models, we propose a composite statistic and analyse its asymptotic distribution and local power function. Monte Carlo evidence shows that the combined test dominates the comparable post-selection inference methods of Tibshirani et al. [JASA, 2016, 514, 600-620] and may surpass the power of established unit root tests against local alternatives. We apply the new tests to groundwater level time series for Germany and find evidence rejecting stochastic trends to explain observed long-term declines in mean water levels."}, "https://arxiv.org/abs/2404.06471": {"title": "Regression Discontinuity Design with Spillovers", "link": "https://arxiv.org/abs/2404.06471", "description": "arXiv:2404.06471v1 Announce Type: new \nAbstract: Researchers who estimate treatment effects using a regression discontinuity design (RDD) typically assume that there are no spillovers between the treated and control units. This may be unrealistic. We characterize the estimand of RDD in a setting where spillovers occur between units that are close in their values of the running variable. Under the assumption that spillovers are linear-in-means, we show that the estimand depends on the ratio of two terms: (1) the radius over which spillovers occur and (2) the choice of bandwidth used for the local linear regression. Specifically, RDD estimates direct treatment effect when radius is of larger order than the bandwidth, and total treatment effect when radius is of smaller order than the bandwidth. In the more realistic regime where radius is of similar order as the bandwidth, the RDD estimand is a mix of the above effects. To recover direct and spillover effects, we propose incorporating estimated spillover terms into local linear regression -- the local analog of peer effects regression. We also clarify the settings under which the donut-hole RD is able to eliminate the effects of spillovers."}, "https://arxiv.org/abs/2404.05809": {"title": "Self-Labeling in Multivariate Causality and Quantification for Adaptive Machine Learning", "link": "https://arxiv.org/abs/2404.05809", "description": "arXiv:2404.05809v1 Announce Type: cross \nAbstract: Adaptive machine learning (ML) aims to allow ML models to adapt to ever-changing environments with potential concept drift after model deployment. Traditionally, adaptive ML requires a new dataset to be manually labeled to tailor deployed models to altered data distributions. Recently, an interactive causality based self-labeling method was proposed to autonomously associate causally related data streams for domain adaptation, showing promising results compared to traditional feature similarity-based semi-supervised learning. Several unanswered research questions remain, including self-labeling's compatibility with multivariate causality and the quantitative analysis of the auxiliary models used in the self-labeling. The auxiliary models, the interaction time model (ITM) and the effect state detector (ESD), are vital to the success of self-labeling. This paper further develops the self-labeling framework and its theoretical foundations to address these research questions. A framework for the application of self-labeling to multivariate causal graphs is proposed using four basic causal relationships, and the impact of non-ideal ITM and ESD performance is analyzed. A simulated experiment is conducted based on a multivariate causal graph, validating the proposed theory."}, "https://arxiv.org/abs/2404.05929": {"title": "A feature-based information-theoretic approach for detecting interpretable, long-timescale pairwise interactions from time series", "link": "https://arxiv.org/abs/2404.05929", "description": "arXiv:2404.05929v1 Announce Type: cross \nAbstract: Quantifying relationships between components of a complex system is critical to understanding the rich network of interactions that characterize the behavior of the system. Traditional methods for detecting pairwise dependence of time series, such as Pearson correlation, Granger causality, and mutual information, are computed directly in the space of measured time-series values. But for systems in which interactions are mediated by statistical properties of the time series (`time-series features') over longer timescales, this approach can fail to capture the underlying dependence from limited and noisy time-series data, and can be challenging to interpret. Addressing these issues, here we introduce an information-theoretic method for detecting dependence between time series mediated by time-series features that provides interpretable insights into the nature of the interactions. Our method extracts a candidate set of time-series features from sliding windows of the source time series and assesses their role in mediating a relationship to values of the target process. Across simulations of three different generative processes, we demonstrate that our feature-based approach can outperform a traditional inference approach based on raw time-series values, especially in challenging scenarios characterized by short time-series lengths, high noise levels, and long interaction timescales. Our work introduces a new tool for inferring and interpreting feature-mediated interactions from time-series data, contributing to the broader landscape of quantitative analysis in complex systems research, with potential applications in various domains including but not limited to neuroscience, finance, climate science, and engineering."}, "https://arxiv.org/abs/2404.05976": {"title": "A Cyber Manufacturing IoT System for Adaptive Machine Learning Model Deployment by Interactive Causality Enabled Self-Labeling", "link": "https://arxiv.org/abs/2404.05976", "description": "arXiv:2404.05976v1 Announce Type: cross \nAbstract: Machine Learning (ML) has been demonstrated to improve productivity in many manufacturing applications. To host these ML applications, several software and Industrial Internet of Things (IIoT) systems have been proposed for manufacturing applications to deploy ML applications and provide real-time intelligence. Recently, an interactive causality enabled self-labeling method has been proposed to advance adaptive ML applications in cyber-physical systems, especially manufacturing, by automatically adapting and personalizing ML models after deployment to counter data distribution shifts. The unique features of the self-labeling method require a novel software system to support dynamism at various levels.\n This paper proposes the AdaptIoT system, comprised of an end-to-end data streaming pipeline, ML service integration, and an automated self-labeling service. The self-labeling service consists of causal knowledge bases and automated full-cycle self-labeling workflows to adapt multiple ML models simultaneously. AdaptIoT employs a containerized microservice architecture to deliver a scalable and portable solution for small and medium-sized manufacturers. A field demonstration of a self-labeling adaptive ML application is conducted with a makerspace and shows reliable performance."}, "https://arxiv.org/abs/2404.06238": {"title": "Least Squares-Based Permutation Tests in Time Series", "link": "https://arxiv.org/abs/2404.06238", "description": "arXiv:2404.06238v1 Announce Type: cross \nAbstract: This paper studies permutation tests for regression parameters in a time series setting, where the time series is assumed stationary but may exhibit an arbitrary (but weak) dependence structure. In such a setting, it is perhaps surprising that permutation tests can offer any type of inference guarantees, since permuting of covariates can destroy its relationship with the response. Indeed, the fundamental assumption of exchangeability of errors required for the finite-sample exactness of permutation tests, can easily fail. However, we show that permutation tests may be constructed which are asymptotically valid for a wide class of stationary processes, but remain exact when exchangeability holds. We also consider the problem of testing for no monotone trend and we construct asymptotically valid permutation tests in this setting as well."}, "https://arxiv.org/abs/2404.06239": {"title": "Permutation Testing for Monotone Trend", "link": "https://arxiv.org/abs/2404.06239", "description": "arXiv:2404.06239v1 Announce Type: cross \nAbstract: In this paper, we consider the fundamental problem of testing for monotone trend in a time series. While the term \"trend\" is commonly used and has an intuitive meaning, it is first crucial to specify its exact meaning in a hypothesis testing context. A commonly used well-known test is the Mann-Kendall test, which we show does not offer Type 1 error control even in large samples. On the other hand, by an appropriate studentization of the Mann-Kendall statistic, we construct permutation tests that offer asymptotic error control quite generally, but retain the exactness property of permutation tests for i.i.d. observations. We also introduce \"local\" Mann-Kendall statistics as a means of testing for local rather than global trend in a time series. Similar properties of permutation tests are obtained for these tests as well."}, "https://arxiv.org/abs/2107.03253": {"title": "Dynamic Ordered Panel Logit Models", "link": "https://arxiv.org/abs/2107.03253", "description": "arXiv:2107.03253v4 Announce Type: replace \nAbstract: This paper studies a dynamic ordered logit model for panel data with fixed effects. The main contribution of the paper is to construct a set of valid moment conditions that are free of the fixed effects. The moment functions can be computed using four or more periods of data, and the paper presents sufficient conditions for the moment conditions to identify the common parameters of the model, namely the regression coefficients, the autoregressive parameters, and the threshold parameters. The availability of moment conditions suggests that these common parameters can be estimated using the generalized method of moments, and the paper documents the performance of this estimator using Monte Carlo simulations and an empirical illustration to self-reported health status using the British Household Panel Survey."}, "https://arxiv.org/abs/2112.08199": {"title": "M-Estimation based on quasi-processes from discrete samples of Levy processes", "link": "https://arxiv.org/abs/2112.08199", "description": "arXiv:2112.08199v3 Announce Type: replace \nAbstract: We consider M-estimation problems, where the target value is determined using a minimizer of an expected functional of a Levy process. With discrete observations from the Levy process, we can produce a \"quasi-path\" by shuffling increments of the Levy process, we call it a quasi-process. Under a suitable sampling scheme, a quasi-process can converge weakly to the true process according to the properties of the stationary and independent increments. Using this resampling technique, we can estimate objective functionals similar to those estimated using the Monte Carlo simulations, and it is available as a contrast function. The M-estimator based on these quasi-processes can be consistent and asymptotically normal."}, "https://arxiv.org/abs/2212.14444": {"title": "Empirical Bayes When Estimation Precision Predicts Parameters", "link": "https://arxiv.org/abs/2212.14444", "description": "arXiv:2212.14444v4 Announce Type: replace \nAbstract: Empirical Bayes methods usually maintain a prior independence assumption: The unknown parameters of interest are independent from the known standard errors of the estimates. This assumption is often theoretically questionable and empirically rejected. This paper instead models the conditional distribution of the parameter given the standard errors as a flexibly parametrized family of distributions, leading to a family of methods that we call CLOSE. This paper establishes that (i) CLOSE is rate-optimal for squared error Bayes regret, (ii) squared error regret control is sufficient for an important class of economic decision problems, and (iii) CLOSE is worst-case robust when our assumption on the conditional distribution is misspecified. Empirically, using CLOSE leads to sizable gains for selecting high-mobility Census tracts. Census tracts selected by CLOSE are substantially more mobile on average than those selected by the standard shrinkage method."}, "https://arxiv.org/abs/2306.16091": {"title": "Adaptive functional principal components analysis", "link": "https://arxiv.org/abs/2306.16091", "description": "arXiv:2306.16091v2 Announce Type: replace \nAbstract: Functional data analysis almost always involves smoothing discrete observations into curves, because they are never observed in continuous time and rarely without error. Although smoothing parameters affect the subsequent inference, data-driven methods for selecting these parameters are not well-developed, frustrated by the difficulty of using all the information shared by curves while being computationally efficient. On the one hand, smoothing individual curves in an isolated, albeit sophisticated way, ignores useful signals present in other curves. On the other hand, bandwidth selection by automatic procedures such as cross-validation after pooling all the curves together quickly become computationally unfeasible due to the large number of data points. In this paper we propose a new data-driven, adaptive kernel smoothing, specifically tailored for functional principal components analysis through the derivation of sharp, explicit risk bounds for the eigen-elements. The minimization of these quadratic risk bounds provide refined, yet computationally efficient bandwidth rules for each eigen-element separately. Both common and independent design cases are allowed. Rates of convergence for the estimators are derived. An extensive simulation study, designed in a versatile manner to closely mimic the characteristics of real data sets supports our methodological contribution. An illustration on a real data application is provided."}, "https://arxiv.org/abs/2308.04325": {"title": "A Spatial Autoregressive Graphical Model with Applications in Intercropping", "link": "https://arxiv.org/abs/2308.04325", "description": "arXiv:2308.04325v3 Announce Type: replace \nAbstract: Within the statistical literature, a significant gap exists in methods capable of modeling asymmetric multivariate spatial effects that elucidate the relationships underlying complex spatial phenomena. For such a phenomenon, observations at any location are expected to arise from a combination of within- and between- location effects, where the latter exhibit asymmetry. This asymmetry is represented by heterogeneous spatial effects between locations pertaining to different categories, that is, a feature inherent to each location in the data, such that based on the feature label, asymmetric spatial relations are postulated between neighbouring locations with different labels. Our novel approach synergises the principles of multivariate spatial autoregressive models and the Gaussian graphical model. This synergy enables us to effectively address the gap by accommodating asymmetric spatial relations, overcoming the usual constraints in spatial analyses. Using a Bayesian-estimation framework, the model performance is assessed in a simulation study. We apply the model on intercropping data, where spatial effects between different crops are unlikely to be symmetric, in order to illustrate the usage of the proposed methodology. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=SAGM."}, "https://arxiv.org/abs/2308.05456": {"title": "Optimally weighted average derivative effects", "link": "https://arxiv.org/abs/2308.05456", "description": "arXiv:2308.05456v2 Announce Type: replace \nAbstract: Weighted average derivative effects (WADEs) are nonparametric estimands with uses in economics and causal inference. Debiased WADE estimators typically require learning the conditional mean outcome as well as a Riesz representer (RR) that characterises the requisite debiasing corrections. RR estimators for WADEs often rely on kernel estimators, introducing complicated bandwidth-dependant biases. In our work we propose a new class of RRs that are isomorphic to the class of WADEs and we derive the WADE weight that is optimal, in the sense of having minimum nonparametric efficiency bound. Our optimal WADE estimators require estimating conditional expectations only (e.g. using machine learning), thus overcoming the limitations of kernel estimators. Moreover, we connect our optimal WADE to projection parameters in partially linear models. We ascribe a causal interpretation to WADE and projection parameters in terms of so-called incremental effects. We propose efficient estimators for two WADE estimands in our class, which we evaluate in a numerical experiment and use to determine the effect of Warfarin dose on blood clotting function."}, "https://arxiv.org/abs/2404.06565": {"title": "Confidence Intervals on Multivariate Normal Quantiles for Environmental Specification Development in Multi-axis Shock and Vibration Testing", "link": "https://arxiv.org/abs/2404.06565", "description": "arXiv:2404.06565v1 Announce Type: new \nAbstract: This article describes two Monte Carlo methods for calculating confidence intervals on cumulative density function (CDF) based multivariate normal quantiles that allows for controlling the tail regions of a multivariate distribution where one is most concerned about extreme responses. The CDF based multivariate normal quantiles associated with bivariate distributions are represented as contours and for trivariate distributions represented as iso-surfaces. We first provide a novel methodology for an inverse problem, characterizing the uncertainty on the $\\tau^{\\mathrm{th}}$ multivariate quantile probability, when using concurrent univariate quantile probabilities. The uncertainty on the $\\tau^{\\mathrm{th}}$ multivariate quantile probability demonstrates inadequacy in univariate methods which neglect correlation between multiple variates. Limitations of traditional multivariate normal tolerance regions and simultaneous univariate tolerance methods are discussed thereby necessitating the need for confidence intervals on CDF based multivariate normal quantiles. Two Monte Carlo methods are discussed; the first calculates the CDF over a tessellated domain followed by taking a bootstrap confidence interval over the tessellated CDF. The CDF based multivariate quantiles are then estimated from the CDF confidence intervals. For the second method, only the point associated with highest probability density along the CDF based quantile is calculated, which greatly improves the computational speed compared to the first method. Monte Carlo simulation studies are used to assess the performance of the various methods. Finally, real data analysis is performed to illustrate a workflow for CDF based multivariate normal quantiles in the domain of mechanical shock and vibration to specify a minimum conservative test level for environmental specification."}, "https://arxiv.org/abs/2404.06602": {"title": "A General Identification Algorithm For Data Fusion Problems Under Systematic Selection", "link": "https://arxiv.org/abs/2404.06602", "description": "arXiv:2404.06602v1 Announce Type: new \nAbstract: Causal inference is made challenging by confounding, selection bias, and other complications. A common approach to addressing these difficulties is the inclusion of auxiliary data on the superpopulation of interest. Such data may measure a different set of variables, or be obtained under different experimental conditions than the primary dataset. Analysis based on multiple datasets must carefully account for similarities between datasets, while appropriately accounting for differences.\n In addition, selection of experimental units into different datasets may be systematic; similar difficulties are encountered in missing data problems. Existing methods for combining datasets either do not consider this issue, or assume simple selection mechanisms.\n In this paper, we provide a general approach, based on graphical causal models, for causal inference from data on the same superpopulation that is obtained under different experimental conditions. Our framework allows both arbitrary unobserved confounding, and arbitrary selection processes into different experimental regimes in our data.\n We describe how systematic selection processes may be organized into a hierarchy similar to censoring processes in missing data: selected completely at random (SCAR), selected at random (SAR), and selected not at random (SNAR). In addition, we provide a general identification algorithm for interventional distributions in this setting."}, "https://arxiv.org/abs/2404.06698": {"title": "Bayesian Model Selection with Latent Group-Based Effects and Variances with the R Package slgf", "link": "https://arxiv.org/abs/2404.06698", "description": "arXiv:2404.06698v1 Announce Type: new \nAbstract: Linear modeling is ubiquitous, but performance can suffer when the model is misspecified. We have recently demonstrated that latent groupings in the levels of categorical predictors can complicate inference in a variety of fields including bioinformatics, agriculture, industry, engineering, and medicine. Here we present the R package slgf which enables the user to easily implement our recently-developed approach to detect group-based regression effects, latent interactions, and/or heteroscedastic error variance through Bayesian model selection. We focus on the scenario in which the levels of a categorical predictor exhibit two latent groups. We treat the detection of this grouping structure as an unsupervised learning problem by searching the space of possible groupings of factor levels. First we review the suspected latent grouping factor (SLGF) method. Next, using both observational and experimental data, we illustrate the usage of slgf in the context of several common linear model layouts: one-way analysis of variance (ANOVA), analysis of covariance (ANCOVA), a two-way replicated layout, and a two-way unreplicated layout. We have selected data that reveal the shortcomings of classical analyses to emphasize the advantage our method can provide when a latent grouping structure is present."}, "https://arxiv.org/abs/2404.06701": {"title": "Covariance Regression with High-Dimensional Predictors", "link": "https://arxiv.org/abs/2404.06701", "description": "arXiv:2404.06701v1 Announce Type: new \nAbstract: In the high-dimensional landscape, addressing the challenges of covariance regression with high-dimensional covariates has posed difficulties for conventional methodologies. This paper addresses these hurdles by presenting a novel approach for high-dimensional inference with covariance matrix outcomes. The proposed methodology is illustrated through its application in elucidating brain coactivation patterns observed in functional magnetic resonance imaging (fMRI) experiments and unraveling complex associations within anatomical connections between brain regions identified through diffusion tensor imaging (DTI). In the pursuit of dependable statistical inference, we introduce an integrative approach based on penalized estimation. This approach combines data splitting, variable selection, aggregation of low-dimensional estimators, and robust variance estimation. It enables the construction of reliable confidence intervals for covariate coefficients, supported by theoretical confidence levels under specified conditions, where asymptotic distributions are provided. Through various types of simulation studies, the proposed approach performs well for covariance regression in the presence of high-dimensional covariates. This innovative approach is applied to the Lifespan Human Connectome Project (HCP) Aging Study, which aims to uncover a typical aging trajectory and variations in the brain connectome among mature and older adults. The proposed approach effectively identifies brain networks and associated predictors of white matter integrity, aligning with established knowledge of the human brain."}, "https://arxiv.org/abs/2404.06803": {"title": "A new way to evaluate G-Wishart normalising constants via Fourier analysis", "link": "https://arxiv.org/abs/2404.06803", "description": "arXiv:2404.06803v1 Announce Type: new \nAbstract: The G-Wishart distribution is an essential component for the Bayesian analysis of Gaussian graphical models as the conjugate prior for the precision matrix. Evaluating the marginal likelihood of such models usually requires computing high-dimensional integrals to determine the G-Wishart normalising constant. Closed-form results are known for decomposable or chordal graphs, while an explicit representation as a formal series expansion has been derived recently for general graphs. The nested infinite sums, however, do not lend themselves to computation, remaining of limited practical value. Borrowing techniques from random matrix theory and Fourier analysis, we provide novel exact results well suited to the numerical evaluation of the normalising constant for a large class of graphs beyond chordal graphs. Furthermore, they open new possibilities for developing more efficient sampling schemes for Bayesian inference of Gaussian graphical models."}, "https://arxiv.org/abs/2404.06837": {"title": "Sensitivity analysis for publication bias in meta-analysis of sparse data based on exact likelihood", "link": "https://arxiv.org/abs/2404.06837", "description": "arXiv:2404.06837v1 Announce Type: new \nAbstract: Meta-analysis is a powerful tool to synthesize findings from multiple studies. The normal-normal random-effects model is widely used accounting for between-study heterogeneity. However, meta-analysis of sparse data, which may arise when the event rate is low for binary or count outcomes, poses a challenge to the normal-normal random-effects model in the accuracy and stability in inference since the normal approximation in the within-study likelihood may not be good. To reduce bias arising from data sparsity, the generalized linear mixed model can be used by replacing the approximate normal within-study likelihood with an exact likelihood. Publication bias is one of the most serious threats in meta-analysis. Several objective sensitivity analysis methods for evaluating potential impacts of selective publication are available for the normal-normal random-effects model. We propose a sensitivity analysis method by extending the likelihood-based sensitivity analysis with the $t$-statistic selection function of Copas to several generalized linear mixed-effects models. Through applications of our proposed method to several real-world meta-analyses and simulation studies, the proposed method was proven to outperform the likelihood-based sensitivity analysis based on the normal-normal model. The proposed method would give a useful guidance to address publication bias in meta-analysis of sparse data."}, "https://arxiv.org/abs/2404.06967": {"title": "Multiple imputation for longitudinal data: A tutorial", "link": "https://arxiv.org/abs/2404.06967", "description": "arXiv:2404.06967v1 Announce Type: new \nAbstract: Longitudinal studies are frequently used in medical research and involve collecting repeated measures on individuals over time. Observations from the same individual are invariably correlated and thus an analytic approach that accounts for this clustering by individual is required. While almost all research suffers from missing data, this can be particularly problematic in longitudinal studies as participation often becomes harder to maintain over time. Multiple imputation (MI) is widely used to handle missing data in such studies. When using MI, it is important that the imputation model is compatible with the proposed analysis model. In a longitudinal analysis, this implies that the clustering considered in the analysis model should be reflected in the imputation process. Several MI approaches have been proposed to impute incomplete longitudinal data, such as treating repeated measurements of the same variable as distinct variables or using generalized linear mixed imputation models. However, the uptake of these methods has been limited, as they require additional data manipulation and use of advanced imputation procedures. In this tutorial, we review the available MI approaches that can be used for handling incomplete longitudinal data, including where individuals are clustered within higher-level clusters. We illustrate implementation with replicable R and Stata code using a case study from the Childhood to Adolescence Transition Study."}, "https://arxiv.org/abs/2404.06984": {"title": "Adaptive Strategy of Testing Alphas in High Dimensional Linear Factor Pricing Models", "link": "https://arxiv.org/abs/2404.06984", "description": "arXiv:2404.06984v1 Announce Type: new \nAbstract: In recent years, there has been considerable research on testing alphas in high-dimensional linear factor pricing models. In our study, we introduce a novel max-type test procedure that performs well under sparse alternatives. Furthermore, we demonstrate that this new max-type test procedure is asymptotically independent from the sum-type test procedure proposed by Pesaran and Yamagata (2017). Building on this, we propose a Fisher combination test procedure that exhibits good performance for both dense and sparse alternatives."}, "https://arxiv.org/abs/2404.06995": {"title": "Model-free Change-point Detection Using Modern Classifiers", "link": "https://arxiv.org/abs/2404.06995", "description": "arXiv:2404.06995v1 Announce Type: new \nAbstract: In contemporary data analysis, it is increasingly common to work with non-stationary complex datasets. These datasets typically extend beyond the classical low-dimensional Euclidean space, making it challenging to detect shifts in their distribution without relying on strong structural assumptions. This paper introduces a novel offline change-point detection method that leverages modern classifiers developed in the machine-learning community. With suitable data splitting, the test statistic is constructed through sequential computation of the Area Under the Curve (AUC) of a classifier, which is trained on data segments on both ends of the sequence. It is shown that the resulting AUC process attains its maxima at the true change-point location, which facilitates the change-point estimation. The proposed method is characterized by its complete nonparametric nature, significant versatility, considerable flexibility, and absence of stringent assumptions pertaining to the underlying data or any distributional shifts. Theoretically, we derive the limiting pivotal distribution of the proposed test statistic under null, as well as the asymptotic behaviors under both local and fixed alternatives. The weak consistency of the change-point estimator is provided. Extensive simulation studies and the analysis of two real-world datasets illustrate the superior performance of our approach compared to existing model-free change-point detection methods."}, "https://arxiv.org/abs/2404.07136": {"title": "To impute or not to? Testing multivariate normality on incomplete dataset: Revisiting the BHEP test", "link": "https://arxiv.org/abs/2404.07136", "description": "arXiv:2404.07136v1 Announce Type: new \nAbstract: In this paper, we focus on testing multivariate normality using the BHEP test with data that are missing completely at random. Our objective is twofold: first, to gain insight into the asymptotic behavior of BHEP test statistics under two widely used approaches for handling missing data, namely complete-case analysis and imputation, and second, to compare the power performance of test statistic under these approaches. It is observed that under the imputation approach, the affine invariance of test statistics is not preserved. To address this issue, we propose an appropriate bootstrap algorithm for approximating p-values. Extensive simulation studies demonstrate that both mean and median approaches exhibit greater power compared to testing with complete-case analysis, and open some questions for further research."}, "https://arxiv.org/abs/2404.07141": {"title": "High-dimensional copula-based Wasserstein dependence", "link": "https://arxiv.org/abs/2404.07141", "description": "arXiv:2404.07141v1 Announce Type: new \nAbstract: We generalize 2-Wasserstein dependence coefficients to measure dependence between a finite number of random vectors. This generalization includes theoretical properties, and in particular focuses on an interpretation of maximal dependence and an asymptotic normality result for a proposed semi-parametric estimator under a Gaussian copula assumption. In addition, we discuss general axioms for dependence measures between multiple random vectors, other plausible normalizations, and various examples. Afterwards, we look into plug-in estimators based on penalized empirical covariance matrices in order to deal with high dimensionality issues and take possible marginal independencies into account by inducing (block) sparsity. The latter ideas are investigated via a simulation study, considering other dependence coefficients as well. We illustrate the use of the developed methods in two real data applications."}, "https://arxiv.org/abs/2404.06681": {"title": "Causal Unit Selection using Tractable Arithmetic Circuits", "link": "https://arxiv.org/abs/2404.06681", "description": "arXiv:2404.06681v1 Announce Type: cross \nAbstract: The unit selection problem aims to find objects, called units, that optimize a causal objective function which describes the objects' behavior in a causal context (e.g., selecting customers who are about to churn but would most likely change their mind if encouraged). While early studies focused mainly on bounding a specific class of counterfactual objective functions using data, more recent work allows one to find optimal units exactly by reducing the causal objective to a classical objective on a meta-model, and then applying a variant of the classical Variable Elimination (VE) algorithm to the meta-model -- assuming a fully specified causal model is available. In practice, however, finding optimal units using this approach can be very expensive because the used VE algorithm must be exponential in the constrained treewidth of the meta-model, which is larger and denser than the original model. We address this computational challenge by introducing a new approach for unit selection that is not necessarily limited by the constrained treewidth. This is done through compiling the meta-model into a special class of tractable arithmetic circuits that allows the computation of optimal units in time linear in the circuit size. We finally present empirical results on random causal models that show order-of-magnitude speedups based on the proposed method for solving unit selection."}, "https://arxiv.org/abs/2404.06735": {"title": "A Copula Graphical Model for Multi-Attribute Data using Optimal Transport", "link": "https://arxiv.org/abs/2404.06735", "description": "arXiv:2404.06735v1 Announce Type: cross \nAbstract: Motivated by modern data forms such as images and multi-view data, the multi-attribute graphical model aims to explore the conditional independence structure among vectors. Under the Gaussian assumption, the conditional independence between vectors is characterized by blockwise zeros in the precision matrix. To relax the restrictive Gaussian assumption, in this paper, we introduce a novel semiparametric multi-attribute graphical model based on a new copula named Cyclically Monotone Copula. This new copula treats the distribution of the node vectors as multivariate marginals and transforms them into Gaussian distributions based on the optimal transport theory. Since the model allows the node vectors to have arbitrary continuous distributions, it is more flexible than the classical Gaussian copula method that performs coordinatewise Gaussianization. We establish the concentration inequalities of the estimated covariance matrices and provide sufficient conditions for selection consistency of the group graphical lasso estimator. For the setting with high-dimensional attributes, a {Projected Cyclically Monotone Copula} model is proposed to address the curse of dimensionality issue that arises from solving high-dimensional optimal transport problems. Numerical results based on synthetic and real data show the efficiency and flexibility of our methods."}, "https://arxiv.org/abs/2404.07100": {"title": "A New Statistic for Testing Covariance Equality in High-Dimensional Gaussian Low-Rank Models", "link": "https://arxiv.org/abs/2404.07100", "description": "arXiv:2404.07100v1 Announce Type: cross \nAbstract: In this paper, we consider the problem of testing equality of the covariance matrices of L complex Gaussian multivariate time series of dimension $M$ . We study the special case where each of the L covariance matrices is modeled as a rank K perturbation of the identity matrix, corresponding to a signal plus noise model. A new test statistic based on the estimates of the eigenvalues of the different covariance matrices is proposed. In particular, we show that this statistic is consistent and with controlled type I error in the high-dimensional asymptotic regime where the sample sizes $N_1,\\ldots,N_L$ of each time series and the dimension $M$ both converge to infinity at the same rate, while $K$ and $L$ are kept fixed. We also provide some simulations on synthetic and real data (SAR images) which demonstrate significant improvements over some classical methods such as the GLRT, or other alternative methods relevant for the high-dimensional regime and the low-rank model."}, "https://arxiv.org/abs/2106.05031": {"title": "Estimation of Optimal Dynamic Treatment Assignment Rules under Policy Constraints", "link": "https://arxiv.org/abs/2106.05031", "description": "arXiv:2106.05031v4 Announce Type: replace \nAbstract: This paper studies statistical decisions for dynamic treatment assignment problems. Many policies involve dynamics in their treatment assignments where treatments are sequentially assigned to individuals across multiple stages and the effect of treatment at each stage is usually heterogeneous with respect to the prior treatments, past outcomes, and observed covariates. We consider estimating an optimal dynamic treatment rule that guides the optimal treatment assignment for each individual at each stage based on the individual's history. This paper proposes an empirical welfare maximization approach in a dynamic framework. The approach estimates the optimal dynamic treatment rule using data from an experimental or quasi-experimental study. The paper proposes two estimation methods: one solves the treatment assignment problem at each stage through backward induction, and the other solves the whole dynamic treatment assignment problem simultaneously across all stages. We derive finite-sample upper bounds on worst-case average welfare regrets for the proposed methods and show $1/\\sqrt{n}$-minimax convergence rates. We also modify the simultaneous estimation method to incorporate intertemporal budget/capacity constraints."}, "https://arxiv.org/abs/2309.11772": {"title": "Active Learning for a Recursive Non-Additive Emulator for Multi-Fidelity Computer Experiments", "link": "https://arxiv.org/abs/2309.11772", "description": "arXiv:2309.11772v2 Announce Type: replace \nAbstract: Computer simulations have become essential for analyzing complex systems, but high-fidelity simulations often come with significant computational costs. To tackle this challenge, multi-fidelity computer experiments have emerged as a promising approach that leverages both low-fidelity and high-fidelity simulations, enhancing both the accuracy and efficiency of the analysis. In this paper, we introduce a new and flexible statistical model, the Recursive Non-Additive (RNA) emulator, that integrates the data from multi-fidelity computer experiments. Unlike conventional multi-fidelity emulation approaches that rely on an additive auto-regressive structure, the proposed RNA emulator recursively captures the relationships between multi-fidelity data using Gaussian process priors without making the additive assumption, allowing the model to accommodate more complex data patterns. Importantly, we derive the posterior predictive mean and variance of the emulator, which can be efficiently computed in a closed-form manner, leading to significant improvements in computational efficiency. Additionally, based on this emulator, we introduce three active learning strategies that optimize the balance between accuracy and simulation costs to guide the selection of the fidelity level and input locations for the next simulation run. We demonstrate the effectiveness of the proposed approach in a suite of synthetic examples and a real-world problem. An R package RNAmf for the proposed methodology is provided on CRAN."}, "https://arxiv.org/abs/2401.06383": {"title": "Decomposition with Monotone B-splines: Fitting and Testing", "link": "https://arxiv.org/abs/2401.06383", "description": "arXiv:2401.06383v2 Announce Type: replace \nAbstract: A univariate continuous function can always be decomposed as the sum of a non-increasing function and a non-decreasing one. Based on this property, we propose a non-parametric regression method that combines two spline-fitted monotone curves. We demonstrate by extensive simulations that, compared to standard spline-fitting methods, the proposed approach is particularly advantageous in high-noise scenarios. Several theoretical guarantees are established for the proposed approach. Additionally, we present statistics to test the monotonicity of a function based on monotone decomposition, which can better control Type I error and achieve comparable (if not always higher) power compared to existing methods. Finally, we apply the proposed fitting and testing approaches to analyze the single-cell pseudotime trajectory datasets, identifying significant biological insights for non-monotonically expressed genes through Gene Ontology enrichment analysis. The source code implementing the methodology and producing all results is accessible at https://github.com/szcf-weiya/MonotoneDecomposition.jl."}, "https://arxiv.org/abs/2401.07625": {"title": "Statistics in Survey Sampling", "link": "https://arxiv.org/abs/2401.07625", "description": "arXiv:2401.07625v2 Announce Type: replace \nAbstract: Survey sampling theory and methods are introduced. Sampling designs and estimation methods are carefully discussed as a textbook for survey sampling. Topics includes Horvitz-Thompson estimation, simple random sampling, stratified sampling, cluster sampling, ratio estimation, regression estimation, variance estimation, two-phase sampling, and nonresponse adjustment methods."}, "https://arxiv.org/abs/2302.01831": {"title": "Trade-off between predictive performance and FDR control for high-dimensional Gaussian model selection", "link": "https://arxiv.org/abs/2302.01831", "description": "arXiv:2302.01831v3 Announce Type: replace-cross \nAbstract: In the context of the high-dimensional Gaussian linear regression for ordered variables, we study the variable selection procedure via the minimization of the penalized least-squares criterion. We focus on model selection where the penalty function depends on an unknown multiplicative constant commonly calibrated for prediction. We propose a new proper calibration of this hyperparameter to simultaneously control predictive risk and false discovery rate. We obtain non-asymptotic bounds on the False Discovery Rate with respect to the hyperparameter and we provide an algorithm to calibrate it. This algorithm is based on quantities that can typically be observed in real data applications. The algorithm is validated in an extensive simulation study and is compared with some existing variable selection procedures. Finally, we study an extension of our approach to the case in which an ordering of the variables is not available."}, "https://arxiv.org/abs/2302.08070": {"title": "Local Causal Discovery for Estimating Causal Effects", "link": "https://arxiv.org/abs/2302.08070", "description": "arXiv:2302.08070v4 Announce Type: replace-cross \nAbstract: Even when the causal graph underlying our data is unknown, we can use observational data to narrow down the possible values that an average treatment effect (ATE) can take by (1) identifying the graph up to a Markov equivalence class; and (2) estimating that ATE for each graph in the class. While the PC algorithm can identify this class under strong faithfulness assumptions, it can be computationally prohibitive. Fortunately, only the local graph structure around the treatment is required to identify the set of possible ATE values, a fact exploited by local discovery algorithms to improve computational efficiency. In this paper, we introduce Local Discovery using Eager Collider Checks (LDECC), a new local causal discovery algorithm that leverages unshielded colliders to orient the treatment's parents differently from existing methods. We show that there exist graphs where LDECC exponentially outperforms existing local discovery algorithms and vice versa. Moreover, we show that LDECC and existing algorithms rely on different faithfulness assumptions, leveraging this insight to weaken the assumptions for identifying the set of possible ATE values."}, "https://arxiv.org/abs/2305.04174": {"title": "Root-n consistent semiparametric learning with high-dimensional nuisance functions under minimal sparsity", "link": "https://arxiv.org/abs/2305.04174", "description": "arXiv:2305.04174v2 Announce Type: replace-cross \nAbstract: Treatment effect estimation under unconfoundedness is a fundamental task in causal inference. In response to the challenge of analyzing high-dimensional datasets collected in substantive fields such as epidemiology, genetics, economics, and social sciences, many methods for treatment effect estimation with high-dimensional nuisance parameters (the outcome regression and the propensity score) have been developed in recent years. However, it is still unclear what is the necessary and sufficient sparsity condition on the nuisance parameters for the treatment effect to be $\\sqrt{n}$-estimable. In this paper, we propose a new Double-Calibration strategy that corrects the estimation bias of the nuisance parameter estimates computed by regularized high-dimensional techniques and demonstrate that the corresponding Doubly-Calibrated estimator achieves $1 / \\sqrt{n}$-rate as long as one of the nuisance parameters is sparse with sparsity below $\\sqrt{n} / \\log p$, where $p$ denotes the ambient dimension of the covariates, whereas the other nuisance parameter can be arbitrarily complex and completely misspecified. The Double-Calibration strategy can also be applied to settings other than treatment effect estimation, e.g. regression coefficient estimation in the presence of diverging number of controls in a semiparametric partially linear model."}, "https://arxiv.org/abs/2305.07276": {"title": "multilevLCA: An R Package for Single-Level and Multilevel Latent Class Analysis with Covariates", "link": "https://arxiv.org/abs/2305.07276", "description": "arXiv:2305.07276v2 Announce Type: replace-cross \nAbstract: This contribution presents a guide to the R package multilevLCA, which offers a complete and innovative set of technical tools for the latent class analysis of single-level and multilevel categorical data. We describe the available model specifications, mainly falling within the fixed-effect or random-effect approaches. Maximum likelihood estimation of the model parameters, enhanced by a refined initialization strategy, is implemented either simultaneously, i.e., in one-step, or by means of the more advantageous two-step estimator. The package features i) semi-automatic model selection when a priori information on the number of classes is lacking, ii) predictors of class membership, and iii) output visualization tools for any of the available model specifications. All functionalities are illustrated by means of a real application on citizenship norms data, which are available in the package."}, "https://arxiv.org/abs/2311.07474": {"title": "A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals", "link": "https://arxiv.org/abs/2311.07474", "description": "arXiv:2311.07474v2 Announce Type: replace-cross \nAbstract: Most prognostic methods require a decent amount of data for model training. In reality, however, the amount of historical data owned by a single organization might be small or not large enough to train a reliable prognostic model. To address this challenge, this article proposes a federated prognostic model that allows multiple users to jointly construct a failure time prediction model using their multi-stream, high-dimensional, and incomplete data while keeping each user's data local and confidential. The prognostic model first employs multivariate functional principal component analysis to fuse the multi-stream degradation signals. Then, the fused features coupled with the times-to-failure are utilized to build a (log)-location-scale regression model for failure prediction. To estimate parameters using distributed datasets and keep the data privacy of all participants, we propose a new federated algorithm for feature extraction. Numerical studies indicate that the performance of the proposed model is the same as that of classic non-federated prognostic models and is better than that of the models constructed by each user itself."}, "https://arxiv.org/abs/2404.07248": {"title": "Parametric estimation of conditional Archimedean copula generators for censored data", "link": "https://arxiv.org/abs/2404.07248", "description": "arXiv:2404.07248v1 Announce Type: new \nAbstract: In this paper, we propose a novel approach for estimating Archimedean copula generators in a conditional setting, incorporating endogenous variables. Our method allows for the evaluation of the impact of the different levels of covariates on both the strength and shape of dependence by directly estimating the generator function rather than the copula itself. As such, we contribute to relaxing the simplifying assumption inherent in traditional copula modeling. We demonstrate the effectiveness of our methodology through applications in two diverse settings: a diabetic retinopathy study and a claims reserving analysis. In both cases, we show how considering the influence of covariates enables a more accurate capture of the underlying dependence structure in the data, thus enhancing the applicability of copula models, particularly in actuarial contexts."}, "https://arxiv.org/abs/2404.07323": {"title": "Surrogate modeling for probability distribution estimation:uniform or adaptive design?", "link": "https://arxiv.org/abs/2404.07323", "description": "arXiv:2404.07323v1 Announce Type: new \nAbstract: The active learning (AL) technique, one of the state-of-the-art methods for constructing surrogate models, has shown high accuracy and efficiency in forward uncertainty quantification (UQ) analysis. This paper provides a comprehensive study on AL-based global surrogates for computing the full distribution function, i.e., the cumulative distribution function (CDF) and the complementary CDF (CCDF). To this end, we investigate the three essential components for building surrogates, i.e., types of surrogate models, enrichment methods for experimental designs, and stopping criteria. For each component, we choose several representative methods and study their desirable configurations. In addition, we devise a uniform design (i.e., space-filling design) as a baseline for measuring the improvement of using AL. Combining all the representative methods, a total of 1,920 UQ analyses are carried out to solve 16 benchmark examples. The performance of the selected strategies is evaluated based on accuracy and efficiency. In the context of full distribution estimation, this study concludes that (i) AL techniques cannot provide a systematic improvement compared with uniform designs, (ii) the recommended surrogate modeling methods depend on the features of the problems (especially the local nonlinearity), target accuracy, and computational budget."}, "https://arxiv.org/abs/2404.07397": {"title": "Mediated probabilities of causation", "link": "https://arxiv.org/abs/2404.07397", "description": "arXiv:2404.07397v1 Announce Type: new \nAbstract: We propose a set of causal estimands that we call ``the mediated probabilities of causation.'' These estimands quantify the probabilities that an observed negative outcome was induced via a mediating pathway versus a direct pathway in a stylized setting involving a binary exposure or intervention, a single binary mediator, and a binary outcome. We outline a set of conditions sufficient to identify these effects given observed data, and propose a doubly-robust projection based estimation strategy that allows for the use of flexible non-parametric and machine learning methods for estimation. We argue that these effects may be more relevant than the probability of causation, particularly in settings where we observe both some negative outcome and negative mediating event, and we wish to distinguish between settings where the outcome was induced via the exposure inducing the mediator versus the exposure inducing the outcome directly. We motivate our quantities of interest by discussing applications to legal and medical questions of causal attribution."}, "https://arxiv.org/abs/2404.07411": {"title": "Joint mixed-effects models for causal inference in clustered network-based observational studies", "link": "https://arxiv.org/abs/2404.07411", "description": "arXiv:2404.07411v1 Announce Type: new \nAbstract: Causal inference on populations embedded in social networks poses technical challenges, since the typical no interference assumption frequently does not hold. Existing methods developed in the context of network interference rely upon the assumption of no unmeasured confounding. However, when faced with multilevel network data, there may be a latent factor influencing both the exposure and the outcome at the cluster level. We propose a Bayesian inference approach that combines a joint mixed-effects model for the outcome and the exposure with direct standardization to identify and estimate causal effects in the presence of network interference and unmeasured cluster confounding. In simulations, we compare our proposed method with linear mixed and fixed effects models and show that unbiased estimation is achieved using the joint model. Having derived valid tools for estimation, we examine the effect of maternal college education on adolescent school performance using data from the National Longitudinal Study of Adolescent Health."}, "https://arxiv.org/abs/2404.07440": {"title": "Bayesian Penalized Transformation Models: Structured Additive Location-Scale Regression for Arbitrary Conditional Distributions", "link": "https://arxiv.org/abs/2404.07440", "description": "arXiv:2404.07440v1 Announce Type: new \nAbstract: Penalized transformation models (PTMs) are a novel form of location-scale regression. In PTMs, the shape of the response's conditional distribution is estimated directly from the data, and structured additive predictors are placed on its location and scale. The core of the model is a monotonically increasing transformation function that relates the response distribution to a reference distribution. The transformation function is equipped with a smoothness prior that regularizes how much the estimated distribution diverges from the reference distribution. These models can be seen as a bridge between conditional transformation models and generalized additive models for location, scale and shape. Markov chain Monte Carlo inference for PTMs can be conducted with the No-U-Turn sampler and offers straightforward uncertainty quantification for the conditional distribution as well as for the covariate effects. A simulation study demonstrates the effectiveness of the approach. We apply the model to data from the Fourth Dutch Growth Study and the Framingham Heart Study. A full-featured implementation is available as a Python library."}, "https://arxiv.org/abs/2404.07459": {"title": "Safe subspace screening for the adaptive nuclear norm regularized trace regression", "link": "https://arxiv.org/abs/2404.07459", "description": "arXiv:2404.07459v1 Announce Type: new \nAbstract: Matrix form data sets arise in many areas, so there are lots of works about the matrix regression models. One special model of these models is the adaptive nuclear norm regularized trace regression, which has been proven have good statistical performances. In order to accelerate the computation of this model, we consider the technique called screening rule. According to matrix decomposition and optimal condition of the model, we develop a safe subspace screening rule that can be used to identify inactive subspace of the solution decomposition and reduce the dimension of the solution. To evaluate the efficiency of the safe subspace screening rule, we embed this result into the alternating direction method of multipliers algorithm under a sequence of the tuning parameters. Under this process, each solution under the tuning parameter provides a matrix decomposition space. Then, the safe subspace screening rule is applied to eliminate inactive subspace, reduce the solution dimension and accelerate the computation process. Some numerical experiments are implemented on simulation data sets and real data sets, which illustrate the efficiency of our screening rule."}, "https://arxiv.org/abs/2404.07632": {"title": "Consistent Distribution Free Affine Invariant Tests for the Validity of Independent Component Models", "link": "https://arxiv.org/abs/2404.07632", "description": "arXiv:2404.07632v1 Announce Type: new \nAbstract: We propose a family of tests of the validity of the assumptions underlying independent component analysis methods. The tests are formulated as L2-type procedures based on characteristic functions and involve weights; a proper choice of these weights and the estimation method for the mixing matrix yields consistent and affine-invariant tests. Due to the complexity of the asymptotic null distribution of the resulting test statistics, implementation is based on permutational and resampling strategies. This leads to distribution-free procedures regardless of whether these procedures are performed on the estimated independent components themselves or the componentwise ranks of their components. A Monte Carlo study involving various estimation methods for the mixing matrix, various weights, and a competing test based on distance covariance is conducted under the null hypothesis as well as under alternatives. A real-data application demonstrates the practical utility and effectiveness of the method."}, "https://arxiv.org/abs/2404.07684": {"title": "Merger Analysis with Latent Price", "link": "https://arxiv.org/abs/2404.07684", "description": "arXiv:2404.07684v1 Announce Type: new \nAbstract: Standard empirical tools for merger analysis assume price data, which may not be readily available. This paper characterizes sufficient conditions for identifying the unilateral effects of mergers without price data. I show that revenues, margins, and revenue diversion ratios are sufficient for identifying the gross upward pricing pressure indices, impact on consumer/producer surplus, and compensating marginal cost reductions associated with a merger. I also describe assumptions on demand that facilitate the identification of revenue diversion ratios and merger simulations. I use the proposed framework to evaluate the Staples/Office Depot merger (2016)."}, "https://arxiv.org/abs/2404.07906": {"title": "WiNNbeta: Batch and drift correction method by white noise normalization for metabolomic studies", "link": "https://arxiv.org/abs/2404.07906", "description": "arXiv:2404.07906v1 Announce Type: new \nAbstract: We developed a method called batch and drift correction method by White Noise Normalization (WiNNbeta) to correct individual metabolites for batch effects and drifts. This method tests for white noise properties to identify metabolites in need of correction and corrects them by using fine-tuned splines. To test the method performance we applied WiNNbeta to LC-MS data from our metabolomic studies and computed CVs before and after WiNNbeta correction in quality control samples."}, "https://arxiv.org/abs/2404.07923": {"title": "A Bayesian Estimator of Sample Size", "link": "https://arxiv.org/abs/2404.07923", "description": "arXiv:2404.07923v1 Announce Type: new \nAbstract: We consider a Bayesian estimator of sample size (BESS) and an application to oncology dose optimization clinical trials. BESS is built upon balancing a trio of Sample size, Evidence from observed data, and Confidence in posterior inference. It uses a simple logic of \"given the evidence from data, a specific sample size can achieve a degree of confidence in the posterior inference.\" The key distinction between BESS and standard sample size estimation (SSE) is that SSE, typically based on Frequentist inference, specifies the true parameters values in its calculation while BESS assumes a possible outcome from the observed data. As a result, the calibration of the sample size is not based on Type I or Type II error rates, but on posterior probabilities. We argue that BESS leads to a more interpretable statement for investigators, and can easily accommodates prior information as well as sample size re-estimation. We explore its performance in comparison to SSE and demonstrate its usage through a case study of oncology optimization trial. BESS can be applied to general hypothesis tests. R functions are available at https://ccte.uchicago.edu/bess."}, "https://arxiv.org/abs/2404.07222": {"title": "Liquidity Jump, Liquidity Diffusion, and Wash Trading of Crypto Assets", "link": "https://arxiv.org/abs/2404.07222", "description": "arXiv:2404.07222v1 Announce Type: cross \nAbstract: We propose that the liquidity of an asset includes two components: liquidity jump and liquidity diffusion. We find that the liquidity diffusion has a higher correlation with crypto washing trading than the liquidity jump. We demonstrate that the treatment of washing trading significantly reduces the liquidity diffusion, but only marginally reduces the liquidity jump. We find that the ARMA-GARCH/EGARCH models are highly effective in modeling the liquidity-adjusted return with and without treatment on wash trading. We argue that treatment on wash trading is unnecessary in modeling established crypto assets that trade in mainstream exchanges, even if these exchanges are unregulated."}, "https://arxiv.org/abs/2404.07238": {"title": "Application of the chemical master equation and its analytical solution to the illness-death model", "link": "https://arxiv.org/abs/2404.07238", "description": "arXiv:2404.07238v1 Announce Type: cross \nAbstract: The aim of this article is relating the chemical master equation (CME) to the illness-death model for chronic diseases. We show that a recently developed differential equation for the prevalence directly follows from the CME. As an application, we use the theory of the CME in a simulation study about diabetes in Germany from a previous publication. We find a good agreement between the theory and the simulations."}, "https://arxiv.org/abs/2404.07586": {"title": "State-Space Modeling of Shape-constrained Functional Time Series", "link": "https://arxiv.org/abs/2404.07586", "description": "arXiv:2404.07586v1 Announce Type: cross \nAbstract: Functional time series data frequently appears in economic applications, where the functions of interest are subject to some shape constraints, including monotonicity and convexity, as typical of the estimation of the Lorenz curve. This paper proposes a state-space model for time-varying functions to extract trends and serial dependence from functional time series while imposing the shape constraints on the estimated functions. The function of interest is modeled by a convex combination of selected basis functions to satisfy the shape constraints, where the time-varying convex weights on simplex follow the dynamic multi-logit models. For the complicated likelihood of this model, a novel data augmentation technique is devised to enable posterior computation by an efficient Markov chain Monte Carlo method. The proposed method is applied to the estimation of time-varying Lorenz curves, and its utility is illustrated through numerical experiments and analysis of panel data of household incomes in Japan."}, "https://arxiv.org/abs/2404.07593": {"title": "Diffusion posterior sampling for simulation-based inference in tall data settings", "link": "https://arxiv.org/abs/2404.07593", "description": "arXiv:2404.07593v1 Announce Type: cross \nAbstract: Determining which parameters of a non-linear model could best describe a set of experimental data is a fundamental problem in science and it has gained much traction lately with the rise of complex large-scale simulators (a.k.a. black-box simulators). The likelihood of such models is typically intractable, which is why classical MCMC methods can not be used. Simulation-based inference (SBI) stands out in this context by only requiring a dataset of simulations to train deep generative models capable of approximating the posterior distribution that relates input parameters to a given observation. In this work, we consider a tall data extension in which multiple observations are available and one wishes to leverage their shared information to better infer the parameters of the model. The method we propose is built upon recent developments from the flourishing score-based diffusion literature and allows us to estimate the tall data posterior distribution simply using information from the score network trained on individual observations. We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost."}, "https://arxiv.org/abs/2404.07661": {"title": "Robust performance metrics for imbalanced classification problems", "link": "https://arxiv.org/abs/2404.07661", "description": "arXiv:2404.07661v1 Announce Type: cross \nAbstract: We show that established performance metrics in binary classification, such as the F-score, the Jaccard similarity coefficient or Matthews' correlation coefficient (MCC), are not robust to class imbalance in the sense that if the proportion of the minority class tends to $0$, the true positive rate (TPR) of the Bayes classifier under these metrics tends to $0$ as well. Thus, in imbalanced classification problems, these metrics favour classifiers which ignore the minority class. To alleviate this issue we introduce robust modifications of the F-score and the MCC for which, even in strongly imbalanced settings, the TPR is bounded away from $0$. We numerically illustrate the behaviour of the various performance metrics in simulations as well as on a credit default data set. We also discuss connections to the ROC and precision-recall curves and give recommendations on how to combine their usage with performance metrics."}, "https://arxiv.org/abs/2011.08661": {"title": "Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders", "link": "https://arxiv.org/abs/2011.08661", "description": "arXiv:2011.08661v3 Announce Type: replace \nAbstract: We consider estimation of average treatment effects given observational data with high-dimensional pretreatment variables. Existing methods for this problem typically assume some form of sparsity for the regression functions. In this work, we introduce a debiased inverse propensity score weighting (DIPW) scheme for average treatment effect estimation that delivers $\\sqrt{n}$-consistent estimates when the propensity score follows a sparse logistic regression model; the outcome regression functions are permitted to be arbitrarily complex. We further demonstrate how confidence intervals centred on our estimates may be constructed. Our theoretical results quantify the price to pay for permitting the regression functions to be unestimable, which shows up as an inflation of the variance of the estimator compared to the semiparametric efficient variance by a constant factor, under mild conditions. We also show that when outcome regressions can be estimated faster than a slow $1/\\sqrt{ \\log n}$ rate, our estimator achieves semiparametric efficiency. As our results accommodate arbitrary outcome regression functions, averages of transformed responses under each treatment may also be estimated at the $\\sqrt{n}$ rate. Thus, for example, the variances of the potential outcomes may be estimated. We discuss extensions to estimating linear projections of the heterogeneous treatment effect function and explain how propensity score models with more general link functions may be handled within our framework. An R package \\texttt{dipw} implementing our methodology is available on CRAN."}, "https://arxiv.org/abs/2104.13871": {"title": "Selection and Aggregation of Conformal Prediction Sets", "link": "https://arxiv.org/abs/2104.13871", "description": "arXiv:2104.13871v3 Announce Type: replace \nAbstract: Conformal prediction is a generic methodology for finite-sample valid distribution-free prediction. This technique has garnered a lot of attention in the literature partly because it can be applied with any machine learning algorithm that provides point predictions to yield valid prediction regions. Of course, the efficiency (width/volume) of the resulting prediction region depends on the performance of the machine learning algorithm. In the context of point prediction, several techniques (such as cross-validation) exist to select one of many machine learning algorithms for better performance. In contrast, such selection techniques are seldom discussed in the context of set prediction (or prediction regions). In this paper, we consider the problem of obtaining the smallest conformal prediction region given a family of machine learning algorithms. We provide two general-purpose selection algorithms and consider coverage as well as width properties of the final prediction region. The first selection method yields the smallest width prediction region among the family of conformal prediction regions for all sample sizes but only has an approximate coverage guarantee. The second selection method has a finite sample coverage guarantee but only attains close to the smallest width. The approximate optimal width property of the second method is quantified via an oracle inequality. As an illustration, we consider the use of aggregation of non-parametric regression estimators in the split conformal method with the absolute residual conformal score."}, "https://arxiv.org/abs/2203.09000": {"title": "Lorenz map, inequality ordering and curves based on multidimensional rearrangements", "link": "https://arxiv.org/abs/2203.09000", "description": "arXiv:2203.09000v3 Announce Type: replace \nAbstract: We propose a multivariate extension of the Lorenz curve based on multivariate rearrangements of optimal transport theory. We define a vector Lorenz map as the integral of the vector quantile map associated with a multivariate resource allocation. Each component of the Lorenz map is the cumulative share of each resource, as in the traditional univariate case. The pointwise ordering of such Lorenz maps defines a new multivariate majorization order, which is equivalent to preference by any social planner with inequality averse multivariate rank dependent social evaluation functional. We define a family of multi-attribute Gini index and complete ordering based on the Lorenz map. We propose the level sets of an Inverse Lorenz Function as a practical tool to visualize and compare inequality in two dimensions, and apply it to income-wealth inequality in the United States between 1989 and 2022."}, "https://arxiv.org/abs/2207.04773": {"title": "Functional Regression Models with Functional Response: New Approaches and a Comparative Study", "link": "https://arxiv.org/abs/2207.04773", "description": "arXiv:2207.04773v4 Announce Type: replace \nAbstract: This paper proposes a new nonlinear approach for additive functional regression with functional response based on kernel methods along with some slight reformulation and implementation of the linear regression and the spectral additive model. The latter methods have in common that the covariates and the response are represented in a basis and so, can only be applied when the response and the covariates belong to a Hilbert space, while the proposed method only uses the distances among data and thus can be applied to those situations where any of the covariates or the response is not Hilbertian typically normed or even metric spaces with a real vector structure. A comparison of these methods with other procedures readily available in R is performed in a simulation study and in real datasets showing the results the advantages of the nonlinear proposals and the small loss of efficiency when the simulation scenario is truly linear. The comparison is done in the Hilbertian case as it is the only scenario where all the procedures can be compared. Finally, the supplementary material provides a visualization tool for checking the linearity of the relationship between a single covariate and the response and a link to a GitHub repository where the code and data are available.} %and an example considering that the response is not Hilbertian."}, "https://arxiv.org/abs/2207.09101": {"title": "Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability", "link": "https://arxiv.org/abs/2207.09101", "description": "arXiv:2207.09101v3 Announce Type: replace \nAbstract: Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss possible other uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement and implement the computations in IRR2FPR R package."}, "https://arxiv.org/abs/2301.08324": {"title": "Differentially Private Confidence Intervals for Proportions under Stratified Random Sampling", "link": "https://arxiv.org/abs/2301.08324", "description": "arXiv:2301.08324v2 Announce Type: replace \nAbstract: Confidence intervals are a fundamental tool for quantifying the uncertainty of parameters of interest. With the increase of data privacy awareness, developing a private version of confidence intervals has gained growing attention from both statisticians and computer scientists. Differential privacy is a state-of-the-art framework for analyzing privacy loss when releasing statistics computed from sensitive data. Recent work has been done around differentially private confidence intervals, yet to the best of our knowledge, rigorous methodologies on differentially private confidence intervals in the context of survey sampling have not been studied. In this paper, we propose three differentially private algorithms for constructing confidence intervals for proportions under stratified random sampling. We articulate two variants of differential privacy that make sense for data from stratified sampling designs, analyzing each of our algorithms within one of these two variants. We establish analytical privacy guarantees and asymptotic properties of the estimators. In addition, we conduct simulation studies to evaluate the proposed private confidence intervals, and two applications to the 1940 Census data are provided."}, "https://arxiv.org/abs/2304.00117": {"title": "Efficiently transporting average treatment effects using a sufficient subset of effect modifiers", "link": "https://arxiv.org/abs/2304.00117", "description": "arXiv:2304.00117v2 Announce Type: replace \nAbstract: We develop flexible and nonparametric estimators of the average treatment effect (ATE) transported to a new population that offer potential efficiency gains by incorporating only a sufficient subset of effect modifiers that are differentially distributed between the source and target populations into the transport step. We develop both a one-step estimator when this sufficient subset of effect modifiers is known and a collaborative one-step estimator when it is unknown. We discuss when we would expect our estimators to be more efficient than those that assume all covariates may be relevant effect modifiers and the exceptions when we would expect worse efficiency. We use simulation to compare finite sample performance across our proposed estimators and existing estimators of the transported ATE, including in the presence of practical violations of the positivity assumption. Lastly, we apply our proposed estimators to a large-scale housing trial."}, "https://arxiv.org/abs/2309.11058": {"title": "require: Package dependencies for reproducible research", "link": "https://arxiv.org/abs/2309.11058", "description": "arXiv:2309.11058v2 Announce Type: replace \nAbstract: The ability to conduct reproducible research in Stata is often limited by the lack of version control for community-contributed packages. This article introduces the require command, a tool designed to ensure Stata package dependencies are compatible across users and computer systems. Given a list of Stata packages, require verifies that each package is installed, checks for a minimum or exact version or package release date, and optionally installs the package if prompted by the researcher."}, "https://arxiv.org/abs/2311.09972": {"title": "Inference in Auctions with Many Bidders Using Transaction Prices", "link": "https://arxiv.org/abs/2311.09972", "description": "arXiv:2311.09972v2 Announce Type: replace \nAbstract: This paper considers inference in first-price and second-price sealed-bid auctions in empirical settings where we observe auctions with a large number of bidders. Relevant applications include online auctions, treasury auctions, spectrum auctions, art auctions, and IPO auctions, among others. Given the abundance of bidders in each auction, we propose an asymptotic framework in which the number of bidders diverges while the number of auctions remains fixed. This framework allows us to perform asymptotically exact inference on key model features using only transaction price data. Specifically, we examine inference on the expected utility of the auction winner, the expected revenue of the seller, and the tail properties of the valuation distribution. Simulations confirm the accuracy of our inference methods in finite samples. Finally, we also apply them to Hong Kong car license auction data."}, "https://arxiv.org/abs/2312.13643": {"title": "Debiasing Welch's Method for Spectral Density Estimation", "link": "https://arxiv.org/abs/2312.13643", "description": "arXiv:2312.13643v2 Announce Type: replace \nAbstract: Welch's method provides an estimator of the power spectral density that is statistically consistent. This is achieved by averaging over periodograms calculated from overlapping segments of a time series. For a finite length time series, while the variance of the estimator decreases as the number of segments increase, the magnitude of the estimator's bias increases: a bias-variance trade-off ensues when setting the segment number. We address this issue by providing a novel method for debiasing Welch's method which maintains the computational complexity and asymptotic consistency, and leads to improved finite-sample performance. Theoretical results are given for fourth-order stationary processes with finite fourth-order moments and absolutely convergent fourth-order cumulant function. The significant bias reduction is demonstrated with numerical simulation and an application to real-world data. Our estimator also permits irregular spacing over frequency and we demonstrate how this may be employed for signal compression and further variance reduction. Code accompanying this work is available in R and python."}, "https://arxiv.org/abs/2404.08105": {"title": "Uniform Inference in High-Dimensional Threshold Regression Models", "link": "https://arxiv.org/abs/2404.08105", "description": "arXiv:2404.08105v1 Announce Type: new \nAbstract: We develop uniform inference for high-dimensional threshold regression parameters and valid inference for the threshold parameter in this paper. We first establish oracle inequalities for prediction errors and $\\ell_1$ estimation errors for the Lasso estimator of the slope parameters and the threshold parameter, allowing for heteroskedastic non-subgaussian error terms and non-subgaussian covariates. Next, we derive the asymptotic distribution of tests involving an increasing number of slope parameters by debiasing (or desparsifying) the scaled Lasso estimator. The asymptotic distribution of tests without the threshold effect is identical to that with a fixed effect. Moreover, we perform valid inference for the threshold parameter using subsampling method. Finally, we conduct simulation studies to demonstrate the performance of our method in finite samples."}, "https://arxiv.org/abs/2404.08128": {"title": "Inference of treatment effect and its regional modifiers using restricted mean survival time in multi-regional clinical trials", "link": "https://arxiv.org/abs/2404.08128", "description": "arXiv:2404.08128v1 Announce Type: new \nAbstract: Multi-regional clinical trials (MRCTs) play an increasingly crucial role in global pharmaceutical development by expediting data gathering and regulatory approval across diverse patient populations. However, differences in recruitment practices and regional demographics often lead to variations in study participant characteristics, potentially biasing treatment effect estimates and undermining treatment effect consistency assessment across regions. To address this challenge, we propose novel estimators and inference methods utilizing inverse probability of sampling and calibration weighting. Our approaches aim to eliminate exogenous regional imbalance while preserving intrinsic differences across regions, such as race and genetic variants. Moreover, time-to-event outcomes in MRCT studies receive limited attention, with existing methodologies primarily focusing on hazard ratios. In this paper, we adopt restricted mean survival time to characterize the treatment effect, offering more straightforward interpretations of treatment effects with fewer assumptions than hazard ratios. Theoretical results are established for the proposed estimators, supported by extensive simulation studies. We illustrate the effectiveness of our methods through a real MRCT case study on acute coronary syndromes."}, "https://arxiv.org/abs/2404.08169": {"title": "AutoGFI: Streamlined Generalized Fiducial Inference for Modern Inference Problems", "link": "https://arxiv.org/abs/2404.08169", "description": "arXiv:2404.08169v1 Announce Type: new \nAbstract: The origins of fiducial inference trace back to the 1930s when R. A. Fisher first introduced the concept as a response to what he perceived as a limitation of Bayesian inference - the requirement for a subjective prior distribution on model parameters in cases where no prior information was available. However, Fisher's initial fiducial approach fell out of favor as complications arose, particularly in multi-parameter problems. In the wake of 2000, amidst a renewed interest in contemporary adaptations of fiducial inference, generalized fiducial inference (GFI) emerged to extend Fisher's fiducial argument, providing a promising avenue for addressing numerous crucial and practical inference challenges. Nevertheless, the adoption of GFI has been limited due to its often demanding mathematical derivations and the necessity for implementing complex Markov Chain Monte Carlo algorithms. This complexity has impeded its widespread utilization and practical applicability. This paper presents a significant advancement by introducing an innovative variant of GFI designed to alleviate these challenges. Specifically, this paper proposes AutoGFI, an easily implementable algorithm that streamlines the application of GFI to a broad spectrum of inference problems involving additive noise. AutoGFI can be readily implemented as long as a fitting routine is available, making it accessible to a broader audience of researchers and practitioners. To demonstrate its effectiveness, AutoGFI is applied to three contemporary and challenging problems: tensor regression, matrix completion, and regression with network cohesion. These case studies highlight the immense potential of GFI and illustrate AutoGFI's promising performance when compared to specialized solutions for these problems. Overall, this research paves the way for a more accessible and powerful application of GFI in a range of practical domains."}, "https://arxiv.org/abs/2404.08284": {"title": "A unified generalization of inverse regression via adaptive column selection", "link": "https://arxiv.org/abs/2404.08284", "description": "arXiv:2404.08284v1 Announce Type: new \nAbstract: A bottleneck of sufficient dimension reduction (SDR) in the modern era is that, among numerous methods, only the sliced inverse regression (SIR) is generally applicable under the high-dimensional settings. The higher-order inverse regression methods, which form a major family of SDR methods that are superior to SIR in the population level, suffer from the dimensionality of their intermediate matrix-valued parameters that have an excessive number of columns. In this paper, we propose the generic idea of using a small subset of columns of the matrix-valued parameter for SDR estimation, which breaks the convention of using the ambient matrix for the higher-order inverse regression methods. With the aid of a quick column selection procedure, we then generalize these methods as well as their ensembles towards sparsity under the ultrahigh-dimensional settings, in a uniform manner that resembles sparse SIR and without additional assumptions. This is the first promising attempt in the literature to free the higher-order inverse regression methods from their dimensionality, which facilitates the applicability of SDR. The gain of column selection with respect to SDR estimation efficiency is also studied under the fixed-dimensional settings. Simulation studies and a real data example are provided at the end."}, "https://arxiv.org/abs/2404.08331": {"title": "A Balanced Statistical Boosting Approach for GAMLSS via New Step Lengths", "link": "https://arxiv.org/abs/2404.08331", "description": "arXiv:2404.08331v1 Announce Type: new \nAbstract: Component-wise gradient boosting algorithms are popular for their intrinsic variable selection and implicit regularization, which can be especially beneficial for very flexible model classes. When estimating generalized additive models for location, scale and shape (GAMLSS) by means of a component-wise gradient boosting algorithm, an important part of the estimation procedure is to determine the relative complexity of the submodels corresponding to the different distribution parameters. Existing methods either suffer from a computationally expensive tuning procedure or can be biased by structural differences in the negative gradients' sizes, which, if encountered, lead to imbalances between the different submodels. Shrunk optimal step lengths have been suggested to replace the typical small fixed step lengths for a non-cyclical boosting algorithm limited to a Gaussian response variable in order to address this issue. In this article, we propose a new adaptive step length approach that accounts for the relative size of the fitted base-learners to ensure a natural balance between the different submodels. The new balanced boosting approach thus represents a computationally efficient and easily generalizable alternative to shrunk optimal step lengths. We implemented the balanced non-cyclical boosting algorithm for a Gaussian, a negative binomial as well as a Weibull distributed response variable and demonstrate the competitive performance of the new adaptive step length approach by means of a simulation study, in the analysis of count data modeling the number of doctor's visits as well as for survival data in an oncological trial."}, "https://arxiv.org/abs/2404.08352": {"title": "Comment on 'Exact-corrected confidence interval for risk difference in noninferiority binomial trials'", "link": "https://arxiv.org/abs/2404.08352", "description": "arXiv:2404.08352v1 Announce Type: new \nAbstract: The article by Hawila & Berg (2023) that is going to be commented presents four relevant problems, apart from other less important ones that are also cited. First, the title is incorrect, since it leads readers to believe that the confidence interval defined is exact when in fact it is asymptotic. Second, contrary to what is assumed by the authors of the article, the statistic that they define is not monotonic in delta. But it is fundamental that this property is verified, as the authors themselves recognize. Third, the inferences provided by the confidence interval proposed may be incoherent, which could lead the scientific community to reach incorrect conclusions in any practical application. For example, for fixed data it might happen that a certain delta value is within the 90%-confidence interval, but outside the 95%-confidence interval. Fourth, the authors do not validate its statistic through a simulation with diverse (and credible) values of the parameters involved. In fact, one of its two examples is for an alpha error of 70%!"}, "https://arxiv.org/abs/2404.08365": {"title": "Estimation and Inference for Three-Dimensional Panel Data Models", "link": "https://arxiv.org/abs/2404.08365", "description": "arXiv:2404.08365v1 Announce Type: new \nAbstract: Hierarchical panel data models have recently garnered significant attention. This study contributes to the relevant literature by introducing a novel three-dimensional (3D) hierarchical panel data model, which integrates panel regression with three sets of latent factor structures: one set of global factors and two sets of local factors. Instead of aggregating latent factors from various nodes, as seen in the literature of distributed principal component analysis (PCA), we propose an estimation approach capable of recovering the parameters of interest and disentangling latent factors at different levels and across different dimensions. We establish an asymptotic theory and provide a bootstrap procedure to obtain inference for the parameters of interest while accommodating various types of cross-sectional dependence and time series autocorrelation. Finally, we demonstrate the applicability of our framework by examining productivity convergence in manufacturing industries worldwide."}, "https://arxiv.org/abs/2404.08426": {"title": "confintROB Package: Confindence Intervals in robust linear mixed models", "link": "https://arxiv.org/abs/2404.08426", "description": "arXiv:2404.08426v1 Announce Type: new \nAbstract: Statistical inference is a major scientific endeavor for many researchers. In terms of inferential methods implemented to mixed-effects models, significant progress has been made in the R software. However, these advances primarily concern classical estimators (ML, REML) and mainly focus on fixed effects. In the confintROB package, we have implemented various bootstrap methods for computing confidence intervals (CIs) not only for fixed effects but also for variance components. These methods can be implemented with the widely used lmer function from the lme4 package, as well as with the rlmer function from the robustlmm package and the varComprob function from the robustvarComp package. These functions implement robust estimation methods suitable for data with outliers. The confintROB package implements the Wald method for fixed effects, whereas for both fixed effects and variance components, two bootstrap methods are implemented: the parametric bootstrap and the wild bootstrap. Moreover, the confintROB package can obtain both the percentile and the bias-corrected accelerated versions of CIs."}, "https://arxiv.org/abs/2404.08457": {"title": "A Latent Factor Model for High-Dimensional Binary Data", "link": "https://arxiv.org/abs/2404.08457", "description": "arXiv:2404.08457v1 Announce Type: new \nAbstract: In this study, we develop a latent factor model for analysing high-dimensional binary data. Specifically, a standard probit model is used to describe the regression relationship between the observed binary data and the continuous latent variables. Our method assumes that the dependency structure of the observed binary data can be fully captured by the continuous latent factors. To estimate the model, a moment-based estimation method is developed. The proposed method is able to deal with both discontinuity and high dimensionality. Most importantly, the asymptotic properties of the resulting estimators are rigorously established. Extensive simulation studies are presented to demonstrate the proposed methodology. A real dataset about product descriptions is analysed for illustration."}, "https://arxiv.org/abs/2404.08129": {"title": "One Factor to Bind the Cross-Section of Returns", "link": "https://arxiv.org/abs/2404.08129", "description": "arXiv:2404.08129v1 Announce Type: cross \nAbstract: We propose a new non-linear single-factor asset pricing model $r_{it}=h(f_{t}\\lambda_{i})+\\epsilon_{it}$. Despite its parsimony, this model represents exactly any non-linear model with an arbitrary number of factors and loadings -- a consequence of the Kolmogorov-Arnold representation theorem. It features only one pricing component $h(f_{t}\\lambda_{I})$, comprising a nonparametric link function of the time-dependent factor and factor loading that we jointly estimate with sieve-based estimators. Using 171 assets across major classes, our model delivers superior cross-sectional performance with a low-dimensional approximation of the link function. Most known finance and macro factors become insignificant controlling for our single-factor."}, "https://arxiv.org/abs/2404.08208": {"title": "Shifting the Paradigm: Estimating Heterogeneous Treatment Effects in the Development of Walkable Cities Design", "link": "https://arxiv.org/abs/2404.08208", "description": "arXiv:2404.08208v1 Announce Type: cross \nAbstract: The transformation of urban environments to accommodate growing populations has profoundly impacted public health and well-being. This paper addresses the critical challenge of estimating the impact of urban design interventions on diverse populations. Traditional approaches, reliant on questionnaires and stated preference techniques, are limited by recall bias and capturing the complex dynamics between environmental attributes and individual characteristics. To address these challenges, we integrate Virtual Reality (VR) with observational causal inference methods to estimate heterogeneous treatment effects, specifically employing Targeted Maximum Likelihood Estimation (TMLE) for its robustness against model misspecification. Our innovative approach leverages VR-based experiment to collect data that reflects perceptual and experiential factors. The result shows the heterogeneous impacts of urban design elements on public health and underscore the necessity for personalized urban design interventions. This study not only extends the application of TMLE to built environment research but also informs public health policy by illuminating the nuanced effects of urban design on mental well-being and advocating for tailored strategies that foster equitable, health-promoting urban spaces."}, "https://arxiv.org/abs/2206.14668": {"title": "Score Matching for Truncated Density Estimation on a Manifold", "link": "https://arxiv.org/abs/2206.14668", "description": "arXiv:2206.14668v2 Announce Type: replace \nAbstract: When observations are truncated, we are limited to an incomplete picture of our dataset. Recent methods propose to use score matching for truncated density estimation, where the access to the intractable normalising constant is not required. We present a novel extension of truncated score matching to a Riemannian manifold with boundary. Applications are presented for the von Mises-Fisher and Kent distributions on a two dimensional sphere in $\\mathbb{R}^3$, as well as a real-world application of extreme storm observations in the USA. In simulated data experiments, our score matching estimator is able to approximate the true parameter values with a low estimation error and shows improvements over a naive maximum likelihood estimator."}, "https://arxiv.org/abs/2305.12643": {"title": "A two-way heterogeneity model for dynamic networks", "link": "https://arxiv.org/abs/2305.12643", "description": "arXiv:2305.12643v2 Announce Type: replace \nAbstract: Dynamic network data analysis requires joint modelling individual snapshots and time dynamics. This paper proposes a new two-way heterogeneity model towards this goal. The new model equips each node of the network with two heterogeneity parameters, one to characterize the propensity of forming ties with other nodes and the other to differentiate the tendency of retaining existing ties over time. Though the negative log-likelihood function is non-convex, it is locally convex in a neighbourhood of the true value of the parameter vector. By using a novel method of moments estimator as the initial value, the consistent local maximum likelihood estimator (MLE) can be obtained by a gradient descent algorithm. To establish the upper bound for the estimation error of the MLE, we derive a new uniform deviation bound, which is of independent interest. The usefulness of the model and the associated theory are further supported by extensive simulation and the analysis of some real network data sets."}, "https://arxiv.org/abs/2312.06334": {"title": "Model validation for aggregate inferences in out-of-sample prediction", "link": "https://arxiv.org/abs/2312.06334", "description": "arXiv:2312.06334v3 Announce Type: replace \nAbstract: Generalization to new samples is a fundamental rationale for statistical modeling. For this purpose, model validation is particularly important, but recent work in survey inference has suggested that simple aggregation of individual prediction scores does not give a good measure of the score for population aggregate estimates. In this manuscript we explain why this occurs, propose two scoring metrics designed specifically for this problem, and demonstrate their use in three different ways. We show that these scoring metrics correctly order models when compared to the true score, although they do underestimate the magnitude of the score. We demonstrate with a problem in survey research, where multilevel regression and poststratification (MRP) has been used extensively to adjust convenience and low-response surveys to make population and subpopulation estimates."}, "https://arxiv.org/abs/2401.07152": {"title": "Inference for Synthetic Controls via Refined Placebo Tests", "link": "https://arxiv.org/abs/2401.07152", "description": "arXiv:2401.07152v2 Announce Type: replace \nAbstract: The synthetic control method is often applied to problems with one treated unit and a small number of control units. A common inferential task in this setting is to test null hypotheses regarding the average treatment effect on the treated. Inference procedures that are justified asymptotically are often unsatisfactory due to (1) small sample sizes that render large-sample approximation fragile and (2) simplification of the estimation procedure that is implemented in practice. An alternative is permutation inference, which is related to a common diagnostic called the placebo test. It has provable Type-I error guarantees in finite samples without simplification of the method, when the treatment is uniformly assigned. Despite this robustness, the placebo test suffers from low resolution since the null distribution is constructed from only $N$ reference estimates, where $N$ is the sample size. This creates a barrier for statistical inference at a common level like $\\alpha = 0.05$, especially when $N$ is small. We propose a novel leave-two-out procedure that bypasses this issue, while still maintaining the same finite-sample Type-I error guarantee under uniform assignment for a wide range of $N$. Unlike the placebo test whose Type-I error always equals the theoretical upper bound, our procedure often achieves a lower unconditional Type-I error than theory suggests; this enables useful inference in the challenging regime when $\\alpha < 1/N$. Empirically, our procedure achieves a higher power when the effect size is reasonably large and a comparable power otherwise. We generalize our procedure to non-uniform assignments and show how to conduct sensitivity analysis. From a methodological perspective, our procedure can be viewed as a new type of randomization inference different from permutation or rank-based inference, which is particularly effective in small samples."}, "https://arxiv.org/abs/2301.04512": {"title": "Partial Conditioning for Inference of Many-Normal-Means with H\\\"older Constraints", "link": "https://arxiv.org/abs/2301.04512", "description": "arXiv:2301.04512v2 Announce Type: replace-cross \nAbstract: Inferential models have been proposed for valid and efficient prior-free probabilistic inference. As it gradually gained popularity, this theory is subject to further developments for practically challenging problems. This paper considers the many-normal-means problem with the means constrained to be in the neighborhood of each other, formally represented by a H\\\"older space. A new method, called partial conditioning, is proposed to generate valid and efficient marginal inference about the individual means. It is shown that the method outperforms both a fiducial-counterpart in terms of validity and a conservative-counterpart in terms of efficiency. We conclude the paper by remarking that a general theory of partial conditioning for inferential models deserves future development."}, "https://arxiv.org/abs/2306.00602": {"title": "Approximate Stein Classes for Truncated Density Estimation", "link": "https://arxiv.org/abs/2306.00602", "description": "arXiv:2306.00602v2 Announce Type: replace-cross \nAbstract: Estimating truncated density models is difficult, as these models have intractable normalising constants and hard to satisfy boundary conditions. Score matching can be adapted to solve the truncated density estimation problem, but requires a continuous weighting function which takes zero at the boundary and is positive elsewhere. Evaluation of such a weighting function (and its gradient) often requires a closed-form expression of the truncation boundary and finding a solution to a complicated optimisation problem. In this paper, we propose approximate Stein classes, which in turn leads to a relaxed Stein identity for truncated density estimation. We develop a novel discrepancy measure, truncated kernelised Stein discrepancy (TKSD), which does not require fixing a weighting function in advance, and can be evaluated using only samples on the boundary. We estimate a truncated density model by minimising the Lagrangian dual of TKSD. Finally, experiments show the accuracy of our method to be an improvement over previous works even without the explicit functional form of the boundary."}, "https://arxiv.org/abs/2404.08839": {"title": "Multiply-Robust Causal Change Attribution", "link": "https://arxiv.org/abs/2404.08839", "description": "arXiv:2404.08839v1 Announce Type: new \nAbstract: Comparing two samples of data, we observe a change in the distribution of an outcome variable. In the presence of multiple explanatory variables, how much of the change can be explained by each possible cause? We develop a new estimation strategy that, given a causal model, combines regression and re-weighting methods to quantify the contribution of each causal mechanism. Our proposed methodology is multiply robust, meaning that it still recovers the target parameter under partial misspecification. We prove that our estimator is consistent and asymptotically normal. Moreover, it can be incorporated into existing frameworks for causal attribution, such as Shapley values, which will inherit the consistency and large-sample distribution properties. Our method demonstrates excellent performance in Monte Carlo simulations, and we show its usefulness in an empirical application."}, "https://arxiv.org/abs/2404.08883": {"title": "Projection matrices and the sweep operator", "link": "https://arxiv.org/abs/2404.08883", "description": "arXiv:2404.08883v1 Announce Type: new \nAbstract: These notes have been adapted from an undergraduate course given by Professor Alan James at the University of Adelaide from around 1965 and onwards. This adaption has put a focus on the definition of projection matrices and the sweep operator. These devices were at the heart of the development of the statistical package Genstat which initially focussed on the analysis of variance using the sweep operator. The notes provide an algebraic background to the sweep operator which has since been used to effect in a number of experimental design settings."}, "https://arxiv.org/abs/2404.09117": {"title": "Identifying Causal Effects under Kink Setting: Theory and Evidence", "link": "https://arxiv.org/abs/2404.09117", "description": "arXiv:2404.09117v1 Announce Type: new \nAbstract: This paper develops a generalized framework for identifying causal impacts in a reduced-form manner under kinked settings when agents can manipulate their choices around the threshold. The causal estimation using a bunching framework was initially developed by Diamond and Persson (2017) under notched settings. Many empirical applications of bunching designs involve kinked settings. We propose a model-free causal estimator in kinked settings with sharp bunching and then extend to the scenarios with diffuse bunching, misreporting, optimization frictions, and heterogeneity. The estimation method is mostly non-parametric and accounts for the interior response under kinked settings. Applying the proposed approach, we estimate how medical subsidies affect outpatient behaviors in China."}, "https://arxiv.org/abs/2404.09119": {"title": "Causal Inference for Genomic Data with Multiple Heterogeneous Outcomes", "link": "https://arxiv.org/abs/2404.09119", "description": "arXiv:2404.09119v1 Announce Type: new \nAbstract: With the evolution of single-cell RNA sequencing techniques into a standard approach in genomics, it has become possible to conduct cohort-level causal inferences based on single-cell-level measurements. However, the individual gene expression levels of interest are not directly observable; instead, only repeated proxy measurements from each individual's cells are available, providing a derived outcome to estimate the underlying outcome for each of many genes. In this paper, we propose a generic semiparametric inference framework for doubly robust estimation with multiple derived outcomes, which also encompasses the usual setting of multiple outcomes when the response of each unit is available. To reliably quantify the causal effects of heterogeneous outcomes, we specialize the analysis to the standardized average treatment effects and the quantile treatment effects. Through this, we demonstrate the use of the semiparametric inferential results for doubly robust estimators derived from both Von Mises expansions and estimating equations. A multiple testing procedure based on the Gaussian multiplier bootstrap is tailored for doubly robust estimators to control the false discovery exceedance rate. Applications in single-cell CRISPR perturbation analysis and individual-level differential expression analysis demonstrate the utility of the proposed methods and offer insights into the usage of different estimands for causal inference in genomics."}, "https://arxiv.org/abs/2404.09126": {"title": "Treatment Effect Heterogeneity and Importance Measures for Multivariate Continuous Treatments", "link": "https://arxiv.org/abs/2404.09126", "description": "arXiv:2404.09126v1 Announce Type: new \nAbstract: Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and therefore we introduce novel estimands that summarize the nature and extent of the heterogeneity, and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. We apply our approach to a study of the health effects of simultaneous exposure to the components of PM$_{2.5}$ and find that the negative health effects of exposure to these environmental pollutants is exacerbated by low socioeconomic status and age."}, "https://arxiv.org/abs/2404.09154": {"title": "Extreme quantile regression with deep learning", "link": "https://arxiv.org/abs/2404.09154", "description": "arXiv:2404.09154v1 Announce Type: new \nAbstract: Estimation of extreme conditional quantiles is often required for risk assessment of natural hazards in climate and geo-environmental sciences and for quantitative risk management in statistical finance, econometrics, and actuarial sciences. Interest often lies in extrapolating to quantile levels that exceed any past observations. Therefore, it is crucial to use a statistical framework that is well-adapted and especially designed for this purpose, and here extreme-value theory plays a key role. This chapter reviews how extreme quantile regression may be performed using theoretically-justified models, and how modern deep learning approaches can be harnessed in this context to enhance the model's performance in complex high-dimensional settings. The power of deep learning combined with the rigor of theoretically-justified extreme-value methods opens the door to efficient extreme quantile regression, in cases where both the number of covariates and the quantile level of interest can be simultaneously ``extreme''."}, "https://arxiv.org/abs/2404.09194": {"title": "Bayesian modeling of co-occurrence microbial interaction networks", "link": "https://arxiv.org/abs/2404.09194", "description": "arXiv:2404.09194v1 Announce Type: new \nAbstract: The human body consists of microbiomes associated with the development and prevention of several diseases. These microbial organisms form several complex interactions that are informative to the scientific community for explaining disease progression and prevention. Contrary to the traditional view of the microbiome as a singular, assortative network, we introduce a novel statistical approach using a weighted stochastic infinite block model to analyze the complex community structures within microbial co-occurrence microbial interaction networks. Our model defines connections between microbial taxa using a novel semi-parametric rank-based correlation method on their transformed relative abundances within a fully connected network framework. Employing a Bayesian nonparametric approach, the proposed model effectively clusters taxa into distinct communities while estimating the number of communities. The posterior summary of the taxa community membership is obtained based on the posterior probability matrix, which could naturally solve the label switching problem. Through simulation studies and real-world application to microbiome data from postmenopausal patients with recurrent urinary tract infections, we demonstrate that our method has superior clustering accuracy over alternative approaches. This advancement provides a more nuanced understanding of microbiome organization, with significant implications for disease research."}, "https://arxiv.org/abs/2404.09309": {"title": "Julia as a universal platform for statistical software development", "link": "https://arxiv.org/abs/2404.09309", "description": "arXiv:2404.09309v1 Announce Type: new \nAbstract: Like Python and Java, which are integrated into Stata, Julia is a free programming language that runs on all major operating systems. The julia package links Stata to Julia as well. Users can transfer data between Stata and Julia at high speed, issue Julia commands from Stata to analyze and plot, and pass results back to Stata. Julia's econometric software ecosystem is not as mature as Stata's or R's, or even Python's. But Julia is an excellent environment for developing high-performance numerical applications, which can then be called from many platforms. The boottest program for wild bootstrap-based inference (Roodman et al. 2019) can call a Julia back end for a 33-50% speed-up, even as the R package fwildclusterboot (Fischer and Roodman 2021) uses the same back end for inference after instrumental variables estimation. reghdfejl mimics reghdfe (Correia 2016) in fitting linear models with high-dimensional fixed effects but calls an independently developed Julia package for tenfold acceleration on hard problems. reghdfejl also supports nonlinear models--preliminarily, as the Julia package for that purpose matures."}, "https://arxiv.org/abs/2404.09353": {"title": "A Unified Combination Framework for Dependent Tests with Applications to Microbiome Association Studies", "link": "https://arxiv.org/abs/2404.09353", "description": "arXiv:2404.09353v1 Announce Type: new \nAbstract: We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing $p$-value combination methods, including the vanilla Cauchy combination method, the proposed combination framework can handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to Microbiome Association Studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, %resulting in a significant enhancement of testing power across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations."}, "https://arxiv.org/abs/2404.09358": {"title": "Two-stage Spatial Regression Models for Spatial Confounding", "link": "https://arxiv.org/abs/2404.09358", "description": "arXiv:2404.09358v1 Announce Type: new \nAbstract: Public health data are often spatially dependent, but standard spatial regression methods can suffer from bias and invalid inference when the independent variable is associated with spatially-correlated residuals. This could occur if, for example, there is an unmeasured environmental contaminant. Geoadditive structural equation modeling (gSEM), in which an estimated spatial trend is removed from both the explanatory and response variables before estimating the parameters of interest, has previously been proposed as a solution, but there has been little investigation of gSEM's properties with point-referenced data. We link gSEM to results on double machine learning and semiparametric regression based on two-stage procedures. We propose using these semiparametric estimators for spatial regression using Gaussian processes with Mat\\`ern covariance to estimate the spatial trends, and term this class of estimators Double Spatial Regression (DSR). We derive regularity conditions for root-$n$ asymptotic normality and consistency and closed-form variance estimation, and show that in simulations where standard spatial regression estimators are highly biased and have poor coverage, DSR can mitigate bias more effectively than competitors and obtain nominal coverage."}, "https://arxiv.org/abs/2404.09362": {"title": "A Bayesian Joint Modelling for Misclassified Interval-censoring and Competing Risks", "link": "https://arxiv.org/abs/2404.09362", "description": "arXiv:2404.09362v1 Announce Type: new \nAbstract: In active surveillance of prostate cancer, cancer progression is interval-censored and the examination to detect progression is subject to misclassification, usually false negatives. Meanwhile, patients may initiate early treatment before progression detection, constituting a competing risk. We developed the Misclassification-Corrected Interval-censored Cause-specific Joint Model (MCICJM) to estimate the association between longitudinal biomarkers and cancer progression in this setting. The sensitivity of the examination is considered in the likelihood of this model via a parameter that may be set to a specific value if the sensitivity is known, or for which a prior distribution can be specified if the sensitivity is unknown. Our simulation results show that misspecification of the sensitivity parameter or ignoring it entirely impacts the model parameters, especially the parameter uncertainty and the baseline hazards. Moreover, specification of a prior distribution for the sensitivity parameter may reduce the risk of misspecification in settings where the exact sensitivity is unknown, but may cause identifiability issues. Thus, imposing restrictions on the baseline hazards is recommended. A trade-off between modelling with a sensitivity constant at the risk of misspecification and a sensitivity prior at the cost of flexibility needs to be decided."}, "https://arxiv.org/abs/2404.09414": {"title": "General Bayesian inference for causal effects using covariate balancing procedure", "link": "https://arxiv.org/abs/2404.09414", "description": "arXiv:2404.09414v1 Announce Type: new \nAbstract: In observational studies, the propensity score plays a central role in estimating causal effects of interest. The inverse probability weighting (IPW) estimator is especially commonly used. However, if the propensity score model is misspecified, the IPW estimator may produce biased estimates of causal effects. Previous studies have proposed some robust propensity score estimation procedures; these methods, however, require consideration of parameters that dominate the uncertainty of sampling and treatment allocation. In this manuscript, we propose a novel Bayesian estimating procedure that necessitates deciding the parameter probability, rather than deterministically. Since both the IPW estimator and the propensity score estimator can be derived as solutions to certain loss functions, the general Bayesian paradigm, which does not require the consideration of the full likelihood, can be applied. In this sense, our proposed method only requires the same level of assumptions as ordinary causal inference contexts."}, "https://arxiv.org/abs/2404.09467": {"title": "The Role of Carbon Pricing in Food Inflation: Evidence from Canadian Provinces", "link": "https://arxiv.org/abs/2404.09467", "description": "arXiv:2404.09467v1 Announce Type: new \nAbstract: Carbon pricing, including carbon tax and cap-and-trade, is usually seen as an effective policy tool for mitigating emissions. Although such policies are often perceived as worsening affordability issues, earlier studies find insignificant or deflationary effects of carbon pricing. We verify this result for the food sector by using provincial-level data on food CPI from Canada. By using a staggered difference-in-difference (DiD) approach, we show that the deflationary effects of carbon pricing on food do exist. Additionally, such effects are weak at first and grow stronger after two years of implementation. However, the overall magnitudes are too small to make carbon pricing blamable for the current high inflation. Our subsequent analysis suggests a reduction in consumption is likely to be the cause of deflation. Contrarily, carbon pricing has little impact on farm production costs owing to the special treatment farmers receive within carbon pricing systems. This paper decomposes the long-term influence Canadian carbon pricing has on food affordability and its possible mechanisms."}, "https://arxiv.org/abs/2404.09528": {"title": "Overfitting Reduction in Convex Regression", "link": "https://arxiv.org/abs/2404.09528", "description": "arXiv:2404.09528v1 Announce Type: new \nAbstract: Convex regression is a method for estimating an unknown function $f_0$ from a data set of $n$ noisy observations when $f_0$ is known to be convex. This method has played an important role in operations research, economics, machine learning, and many other areas. It has been empirically observed that the convex regression estimator produces inconsistent estimates of $f_0$ and extremely large subgradients near the boundary of the domain of $f_0$ as $n$ increases. In this paper, we provide theoretical evidence of this overfitting behaviour. We also prove that the penalised convex regression estimator, one of the variants of the convex regression estimator, exhibits overfitting behaviour. To eliminate this behaviour, we propose two new estimators by placing a bound on the subgradients of the estimated function. We further show that our proposed estimators do not exhibit the overfitting behaviour by proving that (a) they converge to $f_0$ and (b) their subgradients converge to the gradient of $f_0$, both uniformly over the domain of $f_0$ with probability one as $n \\rightarrow \\infty$. We apply the proposed methods to compute the cost frontier function for Finnish electricity distribution firms and confirm their superior performance in predictive power over some existing methods."}, "https://arxiv.org/abs/2404.09716": {"title": "Optimal Cut-Point Estimation for functional digital biomarkers: Application to Continuous Glucose Monitoring", "link": "https://arxiv.org/abs/2404.09716", "description": "arXiv:2404.09716v1 Announce Type: new \nAbstract: Establish optimal cut points plays a crucial role in epidemiology and biomarker discovery, enabling the development of effective and practical clinical decision criteria. While there is extensive literature to define optimal cut off over scalar biomarkers, there is a notable lack of general methodologies for analyzing statistical objects in more complex spaces of functions and graphs, which are increasingly relevant in digital health applications. This paper proposes a new general methodology to define optimal cut points for random objects in separable Hilbert spaces. The paper is motivated by the need for creating new clinical rules for diabetes mellitus disease, exploiting the functional information of a continuous diabetes monitor (CGM) as a digital biomarker. More specifically, we provide the functional cut off to identify diabetes cases with CGM information based on glucose distributional functional representations."}, "https://arxiv.org/abs/2404.09823": {"title": "Biclustering bipartite networks via extended Mixture of Latent Trait Analyzers", "link": "https://arxiv.org/abs/2404.09823", "description": "arXiv:2404.09823v1 Announce Type: new \nAbstract: In the context of network data, bipartite networks are of particular interest, as they provide a useful description of systems representing relationships between sending and receiving nodes. In this framework, we extend the Mixture of Latent Trait Analyzers (MLTA) to perform a joint clustering of sending and receiving nodes, as in the biclustering framework. In detail, sending nodes are partitioned into clusters (called components) via a finite mixture of latent trait models. In each component, receiving nodes are partitioned into clusters (called segments) by adopting a flexible and parsimonious specification of the linear predictor. Dependence between receiving nodes is modeled via a multidimensional latent trait, as in the original MLTA specification. The proposal also allows for the inclusion of concomitant variables in the latent layer of the model, with the aim of understanding how they influence component formation. To estimate model parameters, an EM-type algorithm based on a Gauss-Hermite approximation of intractable integrals is proposed. A simulation study is conducted to test the performance of the model in terms of clustering and parameters' recovery. The proposed model is applied to a bipartite network on pediatric patients possibly affected by appendicitis with the objective of identifying groups of patients (sending nodes) being similar with respect to subsets of clinical conditions (receiving nodes)."}, "https://arxiv.org/abs/2404.09863": {"title": "sfislands: An R Package for Accommodating Islands and Disjoint Zones in Areal Spatial Modelling", "link": "https://arxiv.org/abs/2404.09863", "description": "arXiv:2404.09863v1 Announce Type: new \nAbstract: Fitting areal models which use a spatial weights matrix to represent relationships between geographical units can be a cumbersome task, particularly when these units are not well-behaved. The two chief aims of sfislands are to simplify the process of creating an appropriate neighbourhood matrix, and to quickly visualise the predictions of subsequent models. The package uses visual aids in the form of easily-generated maps to help this process. This paper demonstrates how sfislands could be useful to researchers. It begins by describing the package's functions in the context of a proposed workflow. It then presents three worked examples showing a selection of potential use-cases. These range from earthquakes in Indonesia, to river crossings in London, and hierarchical models of output areas in Liverpool. We aim to show how the sfislands package streamlines much of the human workflow involved in creating and examining such models."}, "https://arxiv.org/abs/2404.09882": {"title": "A spatio-temporal model to detect potential outliers in disease mapping", "link": "https://arxiv.org/abs/2404.09882", "description": "arXiv:2404.09882v1 Announce Type: new \nAbstract: Spatio-temporal disease mapping models are commonly used to estimate the relative risk of a disease over time and across areas. For each area and time point, the disease count is modelled with a Poisson distribution whose mean is the product of an offset and the disease relative risk. This relative risk is commonly decomposed in the log scale as the sum of fixed and latent effects. The Rushworth model allows for spatio-temporal autocorrelation of the random effects. We build on the Rushworth model to accommodate and identify potentially outlying areas with respect to their disease relative risk evolution, after taking into account the fixed effects. An area may display outlying behaviour at some points in time but not all. At each time point, we assume the latent effects to be spatially structured and include scaling parameters in the precision matrix, to allow for heavy-tails. Two prior specifications are considered for the scaling parameters: one where they are independent across space and one with spatial autocorrelation. We investigate the performance of the different prior specifications of the proposed model through simulation studies and analyse the weekly evolution of the number of COVID-19 cases across the 33 boroughs of Montreal and the 96 French departments during the second wave. In Montreal, 6 boroughs are found to be potentially outlying. In France, the model with spatially structured scaling parameters identified 21 departments as potential outliers. We find that these departments tend to be close to each other and within common French regions."}, "https://arxiv.org/abs/2404.09960": {"title": "Pseudo P-values for Assessing Covariate Balance in a Finite Study Population with Application to the California Sugar Sweetened Beverage Tax Study", "link": "https://arxiv.org/abs/2404.09960", "description": "arXiv:2404.09960v1 Announce Type: new \nAbstract: Assessing covariate balance (CB) is a common practice in various types of evaluation studies. Two-sample descriptive statistics, such as the standardized mean difference, have been widely applied in the scientific literature to assess the goodness of CB. Studies in health policy, health services research, built and social environment research, and many other fields often involve a finite number of units that may be subject to different treatment levels. Our case study, the California Sugar Sweetened Beverage (SSB) Tax Study, include 332 study cities in the state of California, among which individual cities may elect to levy a city-wide excise tax on SSB sales. Evaluating the balance of covariates between study cities with and without the tax policy is essential for assessing the effects of the policy on health outcomes of interest. In this paper, we introduce the novel concepts of the pseudo p-value and the standardized pseudo p-value, which are descriptive statistics to assess the overall goodness of CB between study arms in a finite study population. While not meant as a hypothesis test, the pseudo p-values bear superficial similarity to the classic p-value, which makes them easy to apply and interpret in applications. We discuss some theoretical properties of the pseudo p-values and present an algorithm to calculate them. We report a numerical simulation study to demonstrate their performance. We apply the pseudo p-values to the California SSB Tax study to assess the balance of city-level characteristics between the two study arms."}, "https://arxiv.org/abs/2404.09966": {"title": "A fully Bayesian approach for the imputation and analysis of derived outcome variables with missingness", "link": "https://arxiv.org/abs/2404.09966", "description": "arXiv:2404.09966v1 Announce Type: new \nAbstract: Derived variables are variables that are constructed from one or more source variables through established mathematical operations or algorithms. For example, body mass index (BMI) is a derived variable constructed from two source variables: weight and height. When using a derived variable as the outcome in a statistical model, complications arise when some of the source variables have missing values. In this paper, we propose how one can define a single fully Bayesian model to simultaneously impute missing values and sample from the posterior. We compare our proposed method with alternative approaches that rely on multiple imputation, and, with a simulated dataset, consider how best to estimate the risk of microcephaly in newborns exposed to the ZIKA virus."}, "https://arxiv.org/abs/2404.08694": {"title": "Musical Listening Qualia: A Multivariate Approach", "link": "https://arxiv.org/abs/2404.08694", "description": "arXiv:2404.08694v1 Announce Type: cross \nAbstract: French and American participants listened to new music stimuli and evaluated the stimuli using either adjectives or quantitative musical dimensions. Results were analyzed using correspondence analysis (CA), hierarchical cluster analysis (HCA), multiple factor analysis (MFA), and partial least squares correlation (PLSC). French and American listeners differed when they described the musical stimuli using adjectives, but not when using the quantitative dimensions. The present work serves as a case study in research methodology that allows for a balance between relaxing experimental control and maintaining statistical rigor."}, "https://arxiv.org/abs/2404.08816": {"title": "Evaluating the Quality of Answers in Political Q&A Sessions with Large Language Models", "link": "https://arxiv.org/abs/2404.08816", "description": "arXiv:2404.08816v1 Announce Type: cross \nAbstract: This paper presents a new approach to evaluating the quality of answers in political question-and-answer sessions. We propose to measure an answer's quality based on the degree to which it allows us to infer the initial question accurately. This conception of answer quality inherently reflects their relevance to initial questions. Drawing parallels with semantic search, we argue that this measurement approach can be operationalized by fine-tuning a large language model on the observed corpus of questions and answers without additional labeled data. We showcase our measurement approach within the context of the Question Period in the Canadian House of Commons. Our approach yields valuable insights into the correlates of the quality of answers in the Question Period. We find that answer quality varies significantly based on the party affiliation of the members of Parliament asking the questions and uncover a meaningful correlation between answer quality and the topics of the questions."}, "https://arxiv.org/abs/2404.09059": {"title": "Prevalence estimation methods for time-dependent antibody kinetics of infected and vaccinated individuals: a graph-theoretic approach", "link": "https://arxiv.org/abs/2404.09059", "description": "arXiv:2404.09059v1 Announce Type: cross \nAbstract: Immune events such as infection, vaccination, and a combination of the two result in distinct time-dependent antibody responses in affected individuals. These responses and event prevalences combine non-trivially to govern antibody levels sampled from a population. Time-dependence and disease prevalence pose considerable modeling challenges that need to be addressed to provide a rigorous mathematical underpinning of the underlying biology. We propose a time-inhomogeneous Markov chain model for event-to-event transitions coupled with a probabilistic framework for anti-body kinetics and demonstrate its use in a setting in which individuals can be infected or vaccinated but not both. We prove the equivalency of this approach to the framework developed in our previous work. Synthetic data are used to demonstrate the modeling process and conduct prevalence estimation via transition probability matrices. This approach is ideal to model sequences of infections and vaccinations, or personal trajectories in a population, making it an important first step towards a mathematical characterization of reinfection, vaccination boosting, and cross-events of infection after vaccination or vice versa."}, "https://arxiv.org/abs/2404.09142": {"title": "Controlling the False Discovery Rate in Subspace Selection", "link": "https://arxiv.org/abs/2404.09142", "description": "arXiv:2404.09142v1 Announce Type: cross \nAbstract: Controlling the false discovery rate (FDR) is a popular approach to multiple testing, variable selection, and related problems of simultaneous inference. In many contemporary applications, models are not specified by discrete variables, which necessitates a broadening of the scope of the FDR control paradigm. Motivated by the ubiquity of low-rank models for high-dimensional matrices, we present methods for subspace selection in principal components analysis that provide control on a geometric analog of FDR that is adapted to subspace selection. Our methods crucially rely on recently-developed tools from random matrix theory, in particular on a characterization of the limiting behavior of eigenvectors and the gaps between successive eigenvalues of large random matrices. Our procedure is parameter-free, and we show that it provides FDR control in subspace selection for common noise models considered in the literature. We demonstrate the utility of our algorithm with numerical experiments on synthetic data and on problems arising in single-cell RNA sequencing and hyperspectral imaging."}, "https://arxiv.org/abs/2404.09156": {"title": "Statistics of extremes for natural hazards: landslides and earthquakes", "link": "https://arxiv.org/abs/2404.09156", "description": "arXiv:2404.09156v1 Announce Type: cross \nAbstract: In this chapter, we illustrate the use of split bulk-tail models and subasymptotic models motivated by extreme-value theory in the context of hazard assessment for earthquake-induced landslides. A spatial joint areal model is presented for modeling both landslides counts and landslide sizes, paying particular attention to extreme landslides, which are the most devastating ones."}, "https://arxiv.org/abs/2404.09157": {"title": "Statistics of Extremes for Neuroscience", "link": "https://arxiv.org/abs/2404.09157", "description": "arXiv:2404.09157v1 Announce Type: cross \nAbstract: This chapter illustrates how tools from univariate and multivariate statistics of extremes can complement classical methods used to study brain signals and enhance the understanding of brain activity and connectivity during specific cognitive tasks or abnormal episodes, such as an epileptic seizure."}, "https://arxiv.org/abs/2404.09629": {"title": "Quantifying fair income distribution in Thailand", "link": "https://arxiv.org/abs/2404.09629", "description": "arXiv:2404.09629v1 Announce Type: cross \nAbstract: Given a vast concern about high income inequality in Thailand as opposed to empirical findings around the world showing people's preference for fair income inequality over unfair income equality, it is therefore important to examine whether inequality in income distribution in Thailand over the past three decades is fair, and what fair inequality in income distribution in Thailand should be. To quantitatively measure fair income distribution, this study employs the fairness benchmarks that are derived from the distributions of athletes' salaries in professional sports which satisfy the concepts of distributive justice and procedural justice, the no-envy principle of fair allocation, and the general consensus or the international norm criterion of a meaningful benchmark. By using the data on quintile income shares and the income Gini index of Thailand from the National Social and Economic Development Council, this study finds that, throughout the period from 1988 to 2021, the Thai income earners in the bottom 20%, the second 20%, and the top 20% receive income shares more than the fair shares whereas those in the third 20% and the fourth 20% receive income shares less than the fair shares. Provided that there are infinite combinations of quintile income shares that can have the same value of income Gini index but only one of them is regarded as fair, this study demonstrates the use of fairness benchmarks as a practical guideline for designing policies with an aim to achieve fair income distribution in Thailand. Moreover, a comparative analysis is conducted by employing the method for estimating optimal (fair) income distribution representing feasible income equality in order to provide an alternative recommendation on what optimal (fair) income distribution characterizing feasible income equality in Thailand should be."}, "https://arxiv.org/abs/2404.09729": {"title": "Amplitude-Phase Fusion for Enhanced Electrocardiogram Morphological Analysis", "link": "https://arxiv.org/abs/2404.09729", "description": "arXiv:2404.09729v1 Announce Type: cross \nAbstract: Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG morphology, to comprehensively describe the fusion of amplitude and phase patterns. MEE is computed based on beat-level samples, enabling detailed analysis of each cardiac cycle. Experimental results demonstrate that MEE achieves rapid, accurate, and label-free localization of abnormal ECG arrhythmia regions. Furthermore, MEE provides a method for assessing sample diversity, facilitating compression of imbalanced training sets (via representative sample selection), and outperforms random pruning. Additionally, MEE exhibits the ability to describe areas of poor quality. By discussing, it proves the robustness of MEE value calculation to noise interference and its low computational complexity. Finally, we integrate this method into a clinical interactive interface to provide a more convenient and intuitive user experience. These findings indicate that MEE serves as a valuable clinical descriptor for ECG characterization. The implementation code can be referenced at the following link: https://github.com/fdu-harry/ECG-MEE-metric."}, "https://arxiv.org/abs/2404.09847": {"title": "Statistical learning for constrained functional parameters in infinite-dimensional models with applications in fair machine learning", "link": "https://arxiv.org/abs/2404.09847", "description": "arXiv:2404.09847v1 Announce Type: cross \nAbstract: Constrained learning has become increasingly important, especially in the realm of algorithmic fairness and machine learning. In these settings, predictive models are developed specifically to satisfy pre-defined notions of fairness. Here, we study the general problem of constrained statistical machine learning through a statistical functional lens. We consider learning a function-valued parameter of interest under the constraint that one or several pre-specified real-valued functional parameters equal zero or are otherwise bounded. We characterize the constrained functional parameter as the minimizer of a penalized risk criterion using a Lagrange multiplier formulation. We show that closed-form solutions for the optimal constrained parameter are often available, providing insight into mechanisms that drive fairness in predictive models. Our results also suggest natural estimators of the constrained parameter that can be constructed by combining estimates of unconstrained parameters of the data generating distribution. Thus, our estimation procedure for constructing fair machine learning algorithms can be applied in conjunction with any statistical learning approach and off-the-shelf software. We demonstrate the generality of our method by explicitly considering a number of examples of statistical fairness constraints and implementing the approach using several popular learning approaches."}, "https://arxiv.org/abs/2404.09962": {"title": "Invariant Subspace Decomposition", "link": "https://arxiv.org/abs/2404.09962", "description": "arXiv:2404.09962v1 Announce Type: cross \nAbstract: We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. In this work, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be utilized both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure."}, "https://arxiv.org/abs/2111.09254": {"title": "Universal Inference Meets Random Projections: A Scalable Test for Log-concavity", "link": "https://arxiv.org/abs/2111.09254", "description": "arXiv:2111.09254v4 Announce Type: replace \nAbstract: Shape constraints yield flexible middle grounds between fully nonparametric and fully parametric approaches to modeling distributions of data. The specific assumption of log-concavity is motivated by applications across economics, survival modeling, and reliability theory. However, there do not currently exist valid tests for whether the underlying density of given data is log-concave. The recent universal inference methodology provides a valid test. The universal test relies on maximum likelihood estimation (MLE), and efficient methods already exist for finding the log-concave MLE. This yields the first test of log-concavity that is provably valid in finite samples in any dimension, for which we also establish asymptotic consistency results. Empirically, we find that a random projections approach that converts the d-dimensional testing problem into many one-dimensional problems can yield high power, leading to a simple procedure that is statistically and computationally efficient."}, "https://arxiv.org/abs/2202.00618": {"title": "Penalized Estimation of Frailty-Based Illness-Death Models for Semi-Competing Risks", "link": "https://arxiv.org/abs/2202.00618", "description": "arXiv:2202.00618v3 Announce Type: replace \nAbstract: Semi-competing risks refers to the survival analysis setting where the occurrence of a non-terminal event is subject to whether a terminal event has occurred, but not vice versa. Semi-competing risks arise in a broad range of clinical contexts, with a novel example being the pregnancy condition preeclampsia, which can only occur before the `terminal' event of giving birth. Models that acknowledge semi-competing risks enable investigation of relationships between covariates and the joint timing of the outcomes, but methods for model selection and prediction of semi-competing risks in high dimensions are lacking. Instead, researchers commonly analyze only a single or composite outcome, losing valuable information and limiting clinical utility -- in the obstetric setting, this means ignoring valuable insight into timing of delivery after preeclampsia has onset. To address this gap we propose a novel penalized estimation framework for frailty-based illness-death multi-state modeling of semi-competing risks. Our approach combines non-convex and structured fusion penalization, inducing global sparsity as well as parsimony across submodels. We perform estimation and model selection via a pathwise routine for non-convex optimization, and prove the first statistical error bound results in this setting. We present a simulation study investigating estimation error and model selection performance, and a comprehensive application of the method to joint risk modeling of preeclampsia and timing of delivery using pregnancy data from an electronic health record."}, "https://arxiv.org/abs/2206.08503": {"title": "Semiparametric Single-Index Estimation for Average Treatment Effects", "link": "https://arxiv.org/abs/2206.08503", "description": "arXiv:2206.08503v3 Announce Type: replace \nAbstract: We propose a semiparametric method to estimate the average treatment effect under the assumption of unconfoundedness given observational data. Our estimation method alleviates misspecification issues of the propensity score function by estimating the single-index link function involved through Hermite polynomials. Our approach is computationally tractable and allows for moderately large dimension covariates. We provide the large sample properties of the estimator and show its validity. Also, the average treatment effect estimator achieves the parametric rate and asymptotic normality. Our extensive Monte Carlo study shows that the proposed estimator is valid in finite samples. We also provide an empirical analysis on the effect of maternal smoking on babies' birth weight and the effect of job training program on future earnings."}, "https://arxiv.org/abs/2208.02657": {"title": "Using Instruments for Selection to Adjust for Selection Bias in Mendelian Randomization", "link": "https://arxiv.org/abs/2208.02657", "description": "arXiv:2208.02657v3 Announce Type: replace \nAbstract: Selection bias is a common concern in epidemiologic studies. In the literature, selection bias is often viewed as a missing data problem. Popular approaches to adjust for bias due to missing data, such as inverse probability weighting, rely on the assumption that data are missing at random and can yield biased results if this assumption is violated. In observational studies with outcome data missing not at random, Heckman's sample selection model can be used to adjust for bias due to missing data. In this paper, we review Heckman's method and a similar approach proposed by Tchetgen Tchetgen and Wirth (2017). We then discuss how to apply these methods to Mendelian randomization analyses using individual-level data, with missing data for either the exposure or outcome or both. We explore whether genetic variants associated with participation can be used as instruments for selection. We then describe how to obtain missingness-adjusted Wald ratio, two-stage least squares and inverse variance weighted estimates. The two methods are evaluated and compared in simulations, with results suggesting that they can both mitigate selection bias but may yield parameter estimates with large standard errors in some settings. In an illustrative real-data application, we investigate the effects of body mass index on smoking using data from the Avon Longitudinal Study of Parents and Children."}, "https://arxiv.org/abs/2209.08036": {"title": "mpower: An R Package for Power Analysis of Exposure Mixture Studies via Monte Carlo Simulations", "link": "https://arxiv.org/abs/2209.08036", "description": "arXiv:2209.08036v2 Announce Type: replace \nAbstract: Estimating sample size and statistical power is an essential part of a good study design. This R package allows users to conduct power analysis based on Monte Carlo simulations in settings in which consideration of the correlations between predictors is important. It runs power analyses given a data generative model and an inference model. It can set up a data generative model that preserves dependence structures among variables given existing data (continuous, binary, or ordinal) or high-level descriptions of the associations. Users can generate power curves to assess the trade-offs between sample size, effect size, and power of a design. This paper presents tutorials and examples focusing on applications for environmental mixture studies when predictors tend to be moderately to highly correlated. It easily interfaces with several existing and newly developed analysis strategies for assessing associations between exposures and health outcomes. However, the package is sufficiently general to facilitate power simulations in a wide variety of settings."}, "https://arxiv.org/abs/2209.09810": {"title": "The boosted HP filter is more general than you might think", "link": "https://arxiv.org/abs/2209.09810", "description": "arXiv:2209.09810v2 Announce Type: replace \nAbstract: The global financial crisis and Covid recession have renewed discussion concerning trend-cycle discovery in macroeconomic data, and boosting has recently upgraded the popular HP filter to a modern machine learning device suited to data-rich and rapid computational environments. This paper extends boosting's trend determination capability to higher order integrated processes and time series with roots that are local to unity. The theory is established by understanding the asymptotic effect of boosting on a simple exponential function. Given a universe of time series in FRED databases that exhibit various dynamic patterns, boosting timely captures downturns at crises and recoveries that follow."}, "https://arxiv.org/abs/2305.13188": {"title": "Fast Variational Inference for Bayesian Factor Analysis in Single and Multi-Study Settings", "link": "https://arxiv.org/abs/2305.13188", "description": "arXiv:2305.13188v2 Announce Type: replace \nAbstract: Factors models are routinely used to analyze high-dimensional data in both single-study and multi-study settings. Bayesian inference for such models relies on Markov Chain Monte Carlo (MCMC) methods which scale poorly as the number of studies, observations, or measured variables increase. To address this issue, we propose variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process shrinkage prior. The proposed algorithms provide fast approximate inference at a fraction of the time and memory of MCMC-based implementations while maintaining comparable accuracy in characterizing the data covariance matrix. We conduct extensive simulations to evaluate our proposed algorithms and show their utility in estimating the model for high-dimensional multi-study gene expression data in ovarian cancers. Overall, our proposed approaches enable more efficient and scalable inference for factor models, facilitating their use in high-dimensional settings. An R package VIMSFA implementing our methods is available on GitHub (github.com/blhansen/VI-MSFA)."}, "https://arxiv.org/abs/2305.18809": {"title": "Discrete forecast reconciliation", "link": "https://arxiv.org/abs/2305.18809", "description": "arXiv:2305.18809v3 Announce Type: replace \nAbstract: This paper presents a formal framework and proposes algorithms to extend forecast reconciliation to discrete-valued data to extend forecast reconciliation to discrete-valued data, including low counts. A novel method is introduced based on recasting the optimisation of scoring rules as an assignment problem, which is solved using quadratic programming. The proposed framework produces coherent joint probabilistic forecasts for count hierarchical time series. Two discrete reconciliation algorithms are also proposed and compared against generalisations of the top-down and bottom-up approaches for count data. Two simulation experiments and two empirical examples are conducted to validate that the proposed reconciliation algorithms improve forecast accuracy. The empirical applications are forecasting criminal offences in Washington D.C. and product unit sales in the M5 dataset. Compared to benchmarks, the proposed framework shows superior performance in both simulations and empirical studies."}, "https://arxiv.org/abs/2307.11401": {"title": "Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data", "link": "https://arxiv.org/abs/2307.11401", "description": "arXiv:2307.11401v2 Announce Type: replace \nAbstract: We study partially linear models in settings where observations are arranged in independent groups but may exhibit within-group dependence. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximising a (restricted) likelihood from random effects modelling or by using generalised estimating equations. We introduce a new 'sandwich loss' whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in linear parameter estimation accuracy when they are not. Under relatively mild conditions, our estimated coefficients are asymptotically Gaussian and enjoy minimal variance among estimators with weights restricted to a given class of functions, when user-chosen regression methods are used to estimate nuisance functions. We further expand the class of functional forms for the weights that may be fitted beyond parametric models by leveraging the flexibility of modern machine learning methods within a new gradient boosting scheme for minimising the sandwich loss. We demonstrate the effectiveness of both the sandwich loss and what we call 'sandwich boosting' in a variety of settings with simulated and real-world data."}, "https://arxiv.org/abs/2308.11672": {"title": "Simulation-Based Prior Knowledge Elicitation for Parametric Bayesian Models", "link": "https://arxiv.org/abs/2308.11672", "description": "arXiv:2308.11672v2 Announce Type: replace \nAbstract: A central characteristic of Bayesian statistics is the ability to consistently incorporate prior knowledge into various modeling processes. In this paper, we focus on translating domain expert knowledge into corresponding prior distributions over model parameters, a process known as prior elicitation. Expert knowledge can manifest itself in diverse formats, including information about raw data, summary statistics, or model parameters. A major challenge for existing elicitation methods is how to effectively utilize all of these different formats in order to formulate prior distributions that align with the expert's expectations, regardless of the model structure. To address these challenges, we develop a simulation-based elicitation method that can learn the hyperparameters of potentially any parametric prior distribution from a wide spectrum of expert knowledge using stochastic gradient descent. We validate the effectiveness and robustness of our elicitation method in four representative case studies covering linear models, generalized linear models, and hierarchical models. Our results support the claim that our method is largely independent of the underlying model structure and adaptable to various elicitation techniques, including quantile-based, moment-based, and histogram-based methods."}, "https://arxiv.org/abs/2401.12084": {"title": "Temporal Aggregation for the Synthetic Control Method", "link": "https://arxiv.org/abs/2401.12084", "description": "arXiv:2401.12084v2 Announce Type: replace \nAbstract: The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also destroy important signal. In this paper, we bound the bias for SCM with disaggregated and aggregated outcomes and give conditions under which aggregating tightens the bounds. We then propose finding weights that balance both disaggregated and aggregated series."}, "https://arxiv.org/abs/2010.16271": {"title": "View selection in multi-view stacking: Choosing the meta-learner", "link": "https://arxiv.org/abs/2010.16271", "description": "arXiv:2010.16271v3 Announce Type: replace-cross \nAbstract: Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three."}, "https://arxiv.org/abs/2303.05561": {"title": "Exploration of the search space of Gaussian graphical models for paired data", "link": "https://arxiv.org/abs/2303.05561", "description": "arXiv:2303.05561v2 Announce Type: replace-cross \nAbstract: We consider the problem of learning a Gaussian graphical model in the case where the observations come from two dependent groups sharing the same variables. We focus on a family of coloured Gaussian graphical models specifically suited for the paired data problem. Commonly, graphical models are ordered by the submodel relationship so that the search space is a lattice, called the model inclusion lattice. We introduce a novel order between models, named the twin order. We show that, embedded with this order, the model space is a lattice that, unlike the model inclusion lattice, is distributive. Furthermore, we provide the relevant rules for the computation of the neighbours of a model. The latter are more efficient than the same operations in the model inclusion lattice, and are then exploited to achieve a more efficient exploration of the search space. These results can be applied to improve the efficiency of both greedy and Bayesian model search procedures. Here we implement a stepwise backward elimination procedure and evaluate its performance by means of simulations. Finally, the procedure is applied to learn a brain network from fMRI data where the two groups correspond to the left and right hemispheres, respectively."}, "https://arxiv.org/abs/2306.11281": {"title": "Towards Characterizing Domain Counterfactuals For Invertible Latent Causal Models", "link": "https://arxiv.org/abs/2306.11281", "description": "arXiv:2306.11281v3 Announce Type: replace-cross \nAbstract: Answering counterfactual queries has important applications such as explainability, robustness, and fairness but is challenging when the causal variables are unobserved and the observations are non-linear mixtures of these latent variables, such as pixels in images. One approach is to recover the latent Structural Causal Model (SCM), which may be infeasible in practice due to requiring strong assumptions, e.g., linearity of the causal mechanisms or perfect atomic interventions. Meanwhile, more practical ML-based approaches using naive domain translation models to generate counterfactual samples lack theoretical grounding and may construct invalid counterfactuals. In this work, we strive to strike a balance between practicality and theoretical guarantees by analyzing a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). We show that recovering the latent SCM is unnecessary for estimating domain counterfactuals, thereby sidestepping some of the theoretic challenges. By assuming invertibility and sparsity of intervention, we prove domain counterfactual estimation error can be bounded by a data fit term and intervention sparsity term. Building upon our theoretical results, we develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation under autoregressive and shared parameter constraints that enforce intervention sparsity. Finally, we show an improvement in counterfactual estimation over baseline methods through extensive simulated and image-based experiments."}, "https://arxiv.org/abs/2404.10063": {"title": "Adjusting for bias due to measurement error in functional quantile regression models with error-prone functional and scalar covariates", "link": "https://arxiv.org/abs/2404.10063", "description": "arXiv:2404.10063v1 Announce Type: new \nAbstract: Wearable devices enable the continuous monitoring of physical activity (PA) but generate complex functional data with poorly characterized errors. Most work on functional data views the data as smooth, latent curves obtained at discrete time intervals with some random noise with mean zero and constant variance. Viewing this noise as homoscedastic and independent ignores potential serial correlations. Our preliminary studies indicate that failing to account for these serial correlations can bias estimations. In dietary assessments, epidemiologists often use self-reported measures based on food frequency questionnaires that are prone to recall bias. With the increased availability of complex, high-dimensional functional, and scalar biomedical data potentially prone to measurement errors, it is necessary to adjust for biases induced by these errors to permit accurate analyses in various regression settings. However, there has been limited work to address measurement errors in functional and scalar covariates in the context of quantile regression. Therefore, we developed new statistical methods based on simulation extrapolation (SIMEX) and mixed effects regression with repeated measures to correct for measurement error biases in this context. We conducted simulation studies to establish the finite sample properties of our new methods. The methods are illustrated through application to a real data set."}, "https://arxiv.org/abs/2404.10111": {"title": "From Predictive Algorithms to Automatic Generation of Anomalies", "link": "https://arxiv.org/abs/2404.10111", "description": "arXiv:2404.10111v1 Announce Type: new \nAbstract: Machine learning algorithms can find predictive signals that researchers fail to notice; yet they are notoriously hard-to-interpret. How can we extract theoretical insights from these black boxes? History provides a clue. Facing a similar problem -- how to extract theoretical insights from their intuitions -- researchers often turned to ``anomalies:'' constructed examples that highlight flaws in an existing theory and spur the development of new ones. Canonical examples include the Allais paradox and the Kahneman-Tversky choice experiments for expected utility theory. We suggest anomalies can extract theoretical insights from black box predictive algorithms. We develop procedures to automatically generate anomalies for an existing theory when given a predictive algorithm. We cast anomaly generation as an adversarial game between a theory and a falsifier, the solutions to which are anomalies: instances where the black box algorithm predicts - were we to collect data - we would likely observe violations of the theory. As an illustration, we generate anomalies for expected utility theory using a large, publicly available dataset on real lottery choices. Based on an estimated neural network that predicts lottery choices, our procedures recover known anomalies and discover new ones for expected utility theory. In incentivized experiments, subjects violate expected utility theory on these algorithmically generated anomalies; moreover, the violation rates are similar to observed rates for the Allais paradox and Common ratio effect."}, "https://arxiv.org/abs/2404.10251": {"title": "Perturbations of Markov Chains", "link": "https://arxiv.org/abs/2404.10251", "description": "arXiv:2404.10251v1 Announce Type: new \nAbstract: This chapter surveys progress on three related topics in perturbations of Markov chains: the motivating question of when and how \"perturbed\" MCMC chains are developed, the theoretical problem of how perturbation theory can be used to analyze such chains, and finally the question of how the theoretical analyses can lead to practical advice."}, "https://arxiv.org/abs/2404.10344": {"title": "Semi-parametric profile pseudolikelihood via local summary statistics for spatial point pattern intensity estimation", "link": "https://arxiv.org/abs/2404.10344", "description": "arXiv:2404.10344v1 Announce Type: new \nAbstract: Second-order statistics play a crucial role in analysing point processes. Previous research has specifically explored locally weighted second-order statistics for point processes, offering diagnostic tests in various spatial domains. However, there remains a need to improve inference for complex intensity functions, especially when the point process likelihood is intractable and in the presence of interactions among points. This paper addresses this gap by proposing a method that exploits local second-order characteristics to account for local dependencies in the fitting procedure. Our approach utilises the Papangelou conditional intensity function for general Gibbs processes, avoiding explicit assumptions about the degree of interaction and homogeneity. We provide simulation results and an application to real data to assess the proposed method's goodness-of-fit. Overall, this work contributes to advancing statistical techniques for point process analysis in the presence of spatial interactions."}, "https://arxiv.org/abs/2404.10381": {"title": "Covariate Ordered Systematic Sampling as an Improvement to Randomized Controlled Trials", "link": "https://arxiv.org/abs/2404.10381", "description": "arXiv:2404.10381v1 Announce Type: new \nAbstract: The Randomized Controlled Trial (RCT) or A/B testing is considered the gold standard method for estimating causal effects. Fisher famously advocated randomly allocating experiment units into treatment and control groups to preclude systematic biases. We propose a variant of systematic sampling called Covariate Ordered Systematic Sampling (COSS). In COSS, we order experimental units using a pre-experiment covariate and allocate them alternately into treatment and control groups. Using theoretical proofs, experiments on simulated data, and hundreds of A/B tests conducted within 3 real-world marketing campaigns, we show how our method achieves better sensitivity gains than commonly used variance reduction techniques like CUPED while retaining the simplicity of RCTs."}, "https://arxiv.org/abs/2404.10495": {"title": "Assumption-Lean Quantile Regression", "link": "https://arxiv.org/abs/2404.10495", "description": "arXiv:2404.10495v1 Announce Type: new \nAbstract: Quantile regression is a powerful tool for detecting exposure-outcome associations given covariates across different parts of the outcome's distribution, but has two major limitations when the aim is to infer the effect of an exposure. Firstly, the exposure coefficient estimator may not converge to a meaningful quantity when the model is misspecified, and secondly, variable selection methods may induce bias and excess uncertainty, rendering inferences biased and overly optimistic. In this paper, we address these issues via partially linear quantile regression models which parametrize the conditional association of interest, but do not restrict the association with other covariates in the model. We propose consistent estimators for the unknown model parameter by mapping it onto a nonparametric main effect estimand that captures the (conditional) association of interest even when the quantile model is misspecified. This estimand is estimated using the efficient influence function under the nonparametric model, allowing for the incorporation of data-adaptive procedures such as variable selection and machine learning. Our approach provides a flexible and reliable method for detecting associations that is robust to model misspecification and excess uncertainty induced by variable selection methods. The proposal is illustrated using simulation studies and data on annual health care costs associated with excess body weight."}, "https://arxiv.org/abs/2404.10594": {"title": "Nonparametric Isotropy Test for Spatial Point Processes using Random Rotations", "link": "https://arxiv.org/abs/2404.10594", "description": "arXiv:2404.10594v1 Announce Type: new \nAbstract: In spatial statistics, point processes are often assumed to be isotropic meaning that their distribution is invariant under rotations. Statistical tests for the null hypothesis of isotropy found in the literature are based either on asymptotics or on Monte Carlo simulation of a parametric null model. Here, we present a nonparametric test based on resampling the Fry points of the observed point pattern. Empirical levels and powers of the test are investigated in a simulation study for four point process models with anisotropy induced by different mechanisms. Finally, a real data set is tested for isotropy."}, "https://arxiv.org/abs/2404.10629": {"title": "Weighting methods for truncation by death in cluster-randomized trials", "link": "https://arxiv.org/abs/2404.10629", "description": "arXiv:2404.10629v1 Announce Type: new \nAbstract: Patient-centered outcomes, such as quality of life and length of hospital stay, are the focus in a wide array of clinical studies. However, participants in randomized trials for elderly or critically and severely ill patient populations may have truncated or undefined non-mortality outcomes if they do not survive through the measurement time point. To address truncation by death, the survivor average causal effect (SACE) has been proposed as a causally interpretable subgroup treatment effect defined under the principal stratification framework. However, the majority of methods for estimating SACE have been developed in the context of individually-randomized trials. Only limited discussions have been centered around cluster-randomized trials (CRTs), where methods typically involve strong distributional assumptions for outcome modeling. In this paper, we propose two weighting methods to estimate SACE in CRTs that obviate the need for potentially complicated outcome distribution modeling. We establish the requisite assumptions that address latent clustering effects to enable point identification of SACE, and we provide computationally-efficient asymptotic variance estimators for each weighting estimator. In simulations, we evaluate our weighting estimators, demonstrating their finite-sample operating characteristics and robustness to certain departures from the identification assumptions. We illustrate our methods using data from a CRT to assess the impact of a sedation protocol on mechanical ventilation among children with acute respiratory failure."}, "https://arxiv.org/abs/2404.10427": {"title": "Effect of Systematic Uncertainties on Density and Temperature Estimates in Coronae of Capella", "link": "https://arxiv.org/abs/2404.10427", "description": "arXiv:2404.10427v1 Announce Type: cross \nAbstract: We estimate the coronal density of Capella using the O VII and Fe XVII line systems in the soft X-ray regime that have been observed over the course of the Chandra mission. Our analysis combines measures of error due to uncertainty in the underlying atomic data with statistical errors in the Chandra data to derive meaningful overall uncertainties on the plasma density of the coronae of Capella. We consider two Bayesian frameworks. First, the so-called pragmatic-Bayesian approach considers the atomic data and their uncertainties as fully specified and uncorrectable. The fully-Bayesian approach, on the other hand, allows the observed spectral data to update the atomic data and their uncertainties, thereby reducing the overall errors on the inferred parameters. To incorporate atomic data uncertainties, we obtain a set of atomic data replicates, the distribution of which captures their uncertainty. A principal component analysis of these replicates allows us to represent the atomic uncertainty with a lower-dimensional multivariate Gaussian distribution. A $t$-distribution approximation of the uncertainties of a subset of plasma parameters including a priori temperature information, obtained from the temperature-sensitive-only Fe XVII spectral line analysis, is carried forward into the density- and temperature-sensitive O VII spectral line analysis. Markov Chain Monte Carlo based model fitting is implemented including Multi-step Monte Carlo Gibbs Sampler and Hamiltonian Monte Carlo. Our analysis recovers an isothermally approximated coronal plasma temperature of $\\approx$5 MK and a coronal plasma density of $\\approx$10$^{10}$ cm$^{-3}$, with uncertainties of 0.1 and 0.2 dex respectively."}, "https://arxiv.org/abs/2404.10436": {"title": "Tree Bandits for Generative Bayes", "link": "https://arxiv.org/abs/2404.10436", "description": "arXiv:2404.10436v1 Announce Type: cross \nAbstract: In generative models with obscured likelihood, Approximate Bayesian Computation (ABC) is often the tool of last resort for inference. However, ABC demands many prior parameter trials to keep only a small fraction that passes an acceptance test. To accelerate ABC rejection sampling, this paper develops a self-aware framework that learns from past trials and errors. We apply recursive partitioning classifiers on the ABC lookup table to sequentially refine high-likelihood regions into boxes. Each box is regarded as an arm in a binary bandit problem treating ABC acceptance as a reward. Each arm has a proclivity for being chosen for the next ABC evaluation, depending on the prior distribution and past rejections. The method places more splits in those areas where the likelihood resides, shying away from low-probability regions destined for ABC rejections. We provide two versions: (1) ABC-Tree for posterior sampling, and (2) ABC-MAP for maximum a posteriori estimation. We demonstrate accurate ABC approximability at much lower simulation cost. We justify the use of our tree-based bandit algorithms with nearly optimal regret bounds. Finally, we successfully apply our approach to the problem of masked image classification using deep generative models."}, "https://arxiv.org/abs/2404.10523": {"title": "Capturing the Macroscopic Behaviour of Molecular Dynamics with Membership Functions", "link": "https://arxiv.org/abs/2404.10523", "description": "arXiv:2404.10523v1 Announce Type: cross \nAbstract: Markov processes serve as foundational models in many scientific disciplines, such as molecular dynamics, and their simulation forms a common basis for analysis. While simulations produce useful trajectories, obtaining macroscopic information directly from microstate data presents significant challenges. This paper addresses this gap by introducing the concept of membership functions being the macrostates themselves. We derive equations for the holding times of these macrostates and demonstrate their consistency with the classical definition. Furthermore, we discuss the application of the ISOKANN method for learning these quantities from simulation data. In addition, we present a novel method for extracting transition paths based on the ISOKANN results and demonstrate its efficacy by applying it to simulations of the mu-opioid receptor. With this approach we provide a new perspective on analyzing the macroscopic behaviour of Markov systems."}, "https://arxiv.org/abs/2404.10530": {"title": "JCGM 101-compliant uncertainty evaluation using virtual experiments", "link": "https://arxiv.org/abs/2404.10530", "description": "arXiv:2404.10530v1 Announce Type: cross \nAbstract: Virtual experiments (VEs), a modern tool in metrology, can be used to help perform an uncertainty evaluation for the measurand. Current guidelines in metrology do not cover the many possibilities to incorporate VEs into an uncertainty evaluation, and it is often difficult to assess if the intended use of a VE complies with said guidelines. In recent work, it was shown that a VE can be used in conjunction with real measurement data and a Monte Carlo procedure to produce equal results to a supplement of the Guide to the Expression of Uncertainty in Measurement. However, this was shown only for linear measurement models. In this work, we extend this Monte Carlo approach to a common class of non-linear measurement models and more complex VEs, providing a reference approach for suitable uncertainty evaluations involving VEs. Numerical examples are given to show that the theoretical derivations hold in a practical scenario."}, "https://arxiv.org/abs/2404.10580": {"title": "Data-driven subgrouping of patient trajectories with chronic diseases: Evidence from low back pain", "link": "https://arxiv.org/abs/2404.10580", "description": "arXiv:2404.10580v1 Announce Type: cross \nAbstract: Clinical data informs the personalization of health care with a potential for more effective disease management. In practice, this is achieved by subgrouping, whereby clusters with similar patient characteristics are identified and then receive customized treatment plans with the goal of targeting subgroup-specific disease dynamics. In this paper, we propose a novel mixture hidden Markov model for subgrouping patient trajectories from chronic diseases. Our model is probabilistic and carefully designed to capture different trajectory phases of chronic diseases (i.e., \"severe\", \"moderate\", and \"mild\") through tailored latent states. We demonstrate our subgrouping framework based on a longitudinal study across 847 patients with non-specific low back pain. Here, our subgrouping framework identifies 8 subgroups. Further, we show that our subgrouping framework outperforms common baselines in terms of cluster validity indices. Finally, we discuss the applicability of the model to other chronic and long-lasting diseases."}, "https://arxiv.org/abs/1808.10522": {"title": "Optimal Instrument Selection using Bayesian Model Averaging for Model Implied Instrumental Variable Two Stage Least Squares Estimators", "link": "https://arxiv.org/abs/1808.10522", "description": "arXiv:1808.10522v2 Announce Type: replace \nAbstract: Model-Implied Instrumental Variable Two-Stage Least Squares (MIIV-2SLS) is a limited information, equation-by-equation, non-iterative estimator for latent variable models. Associated with this estimator are equation specific tests of model misspecification. One issue with equation specific tests is that they lack specificity, in that they indicate that some instruments are problematic without revealing which specific ones. Instruments that are poor predictors of their target variables (weak instruments) is a second potential problem. We propose a novel extension to detect instrument specific tests of misspecification and weak instruments. We term this the Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging (MIIV-2SBMA) estimator. We evaluate the performance of MIIV-2SBMA against MIIV-2SLS in a simulation study and show that it has comparable performance in terms of parameter estimation. Additionally, our instrument specific overidentification tests developed within the MIIV-2SBMA framework show increased power to detect specific problematic and weak instruments. Finally, we demonstrate MIIV-2SBMA using an empirical example."}, "https://arxiv.org/abs/2308.10375": {"title": "Model Selection over Partially Ordered Sets", "link": "https://arxiv.org/abs/2308.10375", "description": "arXiv:2308.10375v3 Announce Type: replace \nAbstract: In problems such as variable selection and graph estimation, models are characterized by Boolean logical structure such as presence or absence of a variable or an edge. Consequently, false positive error or false negative error can be specified as the number of variables/edges that are incorrectly included or excluded in an estimated model. However, there are several other problems such as ranking, clustering, and causal inference in which the associated model classes do not admit transparent notions of false positive and false negative errors due to the lack of an underlying Boolean logical structure. In this paper, we present a generic approach to endow a collection of models with partial order structure, which leads to a hierarchical organization of model classes as well as natural analogs of false positive and false negative errors. We describe model selection procedures that provide false positive error control in our general setting and we illustrate their utility with numerical experiments."}, "https://arxiv.org/abs/2310.09701": {"title": "A robust and powerful replicability analysis for high dimensional data", "link": "https://arxiv.org/abs/2310.09701", "description": "arXiv:2310.09701v2 Announce Type: replace \nAbstract: Identifying replicable signals across different studies provides stronger scientific evidence and more powerful inference. Existing literature on high dimensional applicability analysis either imposes strong modeling assumptions or has low power. We develop a powerful and robust empirical Bayes approach for high dimensional replicability analysis. Our method effectively borrows information from different features and studies while accounting for heterogeneity. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from the genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods."}, "https://arxiv.org/abs/2312.13450": {"title": "Precise FWER Control for Gaussian Related Fields: Riding the SuRF to continuous land -- Part 1", "link": "https://arxiv.org/abs/2312.13450", "description": "arXiv:2312.13450v2 Announce Type: replace \nAbstract: The Gaussian Kinematic Formula (GKF) is a powerful and computationally efficient tool to perform statistical inference on random fields and became a well-established tool in the analysis of neuroimaging data. Using realistic error models, recent articles show that GKF based methods for \\emph{voxelwise inference} lead to conservative control of the familywise error rate (FWER) and for cluster-size inference lead to inflated false positive rates. In this series of articles we identify and resolve the main causes of these shortcomings in the traditional usage of the GKF for voxelwise inference. This first part removes the \\textit{good lattice assumption} and allows the data to be non-stationary, yet still assumes the data to be Gaussian. The latter assumption is resolved in part 2, where we also demonstrate that our GKF based methodology is non-conservative under realistic error models."}, "https://arxiv.org/abs/2401.08290": {"title": "Causal Machine Learning for Moderation Effects", "link": "https://arxiv.org/abs/2401.08290", "description": "arXiv:2401.08290v2 Announce Type: replace \nAbstract: It is valuable for any decision maker to know the impact of decisions (treatments) on average and for subgroups. The causal machine learning literature has recently provided tools for estimating group average treatment effects (GATE) to understand treatment heterogeneity better. This paper addresses the challenge of interpreting such differences in treatment effects between groups while accounting for variations in other covariates. We propose a new parameter, the balanced group average treatment effect (BGATE), which measures a GATE with a specific distribution of a priori-determined covariates. By taking the difference of two BGATEs, we can analyse heterogeneity more meaningfully than by comparing two GATEs. The estimation strategy for this parameter is based on double/debiased machine learning for discrete treatments in an unconfoundedness setting, and the estimator is shown to be $\\sqrt{N}$-consistent and asymptotically normal under standard conditions. Adding additional identifying assumptions allows specific balanced differences in treatment effects between groups to be interpreted causally, leading to the causal balanced group average treatment effect. We explore the finite sample properties in a small-scale simulation study and demonstrate the usefulness of these parameters in an empirical example."}, "https://arxiv.org/abs/2401.13057": {"title": "Inference under partial identification with minimax test statistics", "link": "https://arxiv.org/abs/2401.13057", "description": "arXiv:2401.13057v2 Announce Type: replace \nAbstract: We provide a means of computing and estimating the asymptotic distributions of statistics based on an outer minimization of an inner maximization. Such test statistics, which arise frequently in moment models, are of special interest in providing hypothesis tests under partial identification. Under general conditions, we provide an asymptotic characterization of such test statistics using the minimax theorem, and a means of computing critical values using the bootstrap. Making some light regularity assumptions, our results augment several asymptotic approximations that have been provided for partially identified hypothesis tests, and extend them by mitigating their dependence on local linear approximations of the parameter space. These asymptotic results are generally simple to state and straightforward to compute (esp.\\ adversarially)."}, "https://arxiv.org/abs/2401.14355": {"title": "Multiply Robust Difference-in-Differences Estimation of Causal Effect Curves for Continuous Exposures", "link": "https://arxiv.org/abs/2401.14355", "description": "arXiv:2401.14355v2 Announce Type: replace \nAbstract: Researchers commonly use difference-in-differences (DiD) designs to evaluate public policy interventions. While methods exist for estimating effects in the context of binary interventions, policies often result in varied exposures across regions implementing the policy. Yet, existing approaches for incorporating continuous exposures face substantial limitations in addressing confounding variables associated with intervention status, exposure levels, and outcome trends. These limitations significantly constrain policymakers' ability to fully comprehend policy impacts and design future interventions. In this work, we propose new estimators for causal effect curves within the DiD framework, accounting for multiple sources of confounding. Our approach accommodates misspecification of a subset of treatment, exposure, and outcome models while avoiding any parametric assumptions on the effect curve. We present the statistical properties of the proposed methods and illustrate their application through simulations and a study investigating the heterogeneous effects of a nutritional excise tax under different levels of accessibility to cross-border shopping."}, "https://arxiv.org/abs/2205.06812": {"title": "Principal-Agent Hypothesis Testing", "link": "https://arxiv.org/abs/2205.06812", "description": "arXiv:2205.06812v3 Announce Type: replace-cross \nAbstract: Consider the relationship between a regulator (the principal) and an experimenter (the agent) such as a pharmaceutical company. The pharmaceutical company wishes to sell a drug for profit, whereas the regulator wishes to allow only efficacious drugs to be marketed. The efficacy of the drug is not known to the regulator, so the pharmaceutical company must run a costly trial to prove efficacy to the regulator. Critically, the statistical protocol used to establish efficacy affects the behavior of a strategic, self-interested agent; a lower standard of statistical evidence incentivizes the agent to run more trials that are less likely to be effective. The interaction between the statistical protocol and the incentives of the pharmaceutical company is crucial for understanding this system and designing protocols with high social utility. In this work, we discuss how the regulator can set up a protocol with payoffs based on statistical evidence. We show how to design protocols that are robust to an agent's strategic actions, and derive the optimal protocol in the presence of strategic entrants."}, "https://arxiv.org/abs/2207.07978": {"title": "Robust Multivariate Functional Control Chart", "link": "https://arxiv.org/abs/2207.07978", "description": "arXiv:2207.07978v3 Announce Type: replace-cross \nAbstract: In modern Industry 4.0 applications, a huge amount of data is acquired during manufacturing processes that are often contaminated with anomalous observations in the form of both casewise and cellwise outliers. These can seriously reduce the performance of control charting procedures, especially in complex and high-dimensional settings. To mitigate this issue in the context of profile monitoring, we propose a new framework, referred to as robust multivariate functional control chart (RoMFCC), that is able to monitor multivariate functional data while being robust to both functional casewise and cellwise outliers. The RoMFCC relies on four main elements: (I) a functional univariate filter to identify functional cellwise outliers to be replaced by missing components; (II) a robust multivariate functional data imputation method of missing values; (III) a casewise robust dimensionality reduction; (IV) a monitoring strategy for the multivariate functional quality characteristic. An extensive Monte Carlo simulation study is performed to compare the RoMFCC with competing monitoring schemes already appeared in the literature. Finally, a motivating real-case study is presented where the proposed framework is used to monitor a resistance spot welding process in the automotive industry."}, "https://arxiv.org/abs/2304.01163": {"title": "The extended Ville's inequality for nonintegrable nonnegative supermartingales", "link": "https://arxiv.org/abs/2304.01163", "description": "arXiv:2304.01163v2 Announce Type: replace-cross \nAbstract: Following the initial work by Robbins, we rigorously present an extended theory of nonnegative supermartingales, requiring neither integrability nor finiteness. In particular, we derive a key maximal inequality foreshadowed by Robbins, which we call the extended Ville's inequality, that strengthens the classical Ville's inequality (for integrable nonnegative supermartingales), and also applies to our nonintegrable setting. We derive an extension of the method of mixtures, which applies to $\\sigma$-finite mixtures of our extended nonnegative supermartingales. We present some implications of our theory for sequential statistics, such as the use of improper mixtures (priors) in deriving nonparametric confidence sequences and (extended) e-processes."}, "https://arxiv.org/abs/2404.10834": {"title": "VARX Granger Analysis: Modeling, Inference, and Applications", "link": "https://arxiv.org/abs/2404.10834", "description": "arXiv:2404.10834v1 Announce Type: new \nAbstract: Vector Autoregressive models with exogenous input (VARX) provide a powerful framework for modeling complex dynamical systems like brains, markets, or societies. Their simplicity allows us to uncover linear effects between endogenous and exogenous variables. The Granger formalism is naturally suited for VARX models, but the connection between the two is not widely understood. We aim to bridge this gap by providing both the basic equations and easy-to-use code. We first explain how the Granger formalism can be combined with a VARX model using deviance as a test statistic. We also present a bias correction for the deviance in the case of L2 regularization, a technique used to reduce model complexity. To address the challenge of modeling long responses, we propose the use of basis functions, which further reduce parameter complexity. We demonstrate that p-values are correctly estimated, even for short signals where regularization influences the results. Additionally, we analyze the model's performance under various scenarios where model assumptions are violated, such as missing variables or indirect observations of the underlying dynamics. Finally, we showcase the practical utility of our approach by applying it to real-world data from neuroscience, physiology, and sociology. To facilitate its adoption, we make Matlab, Python, and R code available here: https://github.com/lcparra/varx"}, "https://arxiv.org/abs/2404.10884": {"title": "Modeling Interconnected Modules in Multivariate Outcomes: Evaluating the Impact of Alcohol Intake on Plasma Metabolomics", "link": "https://arxiv.org/abs/2404.10884", "description": "arXiv:2404.10884v1 Announce Type: new \nAbstract: Alcohol consumption has been shown to influence cardiovascular mechanisms in humans, leading to observable alterations in the plasma metabolomic profile. Regression models are commonly employed to investigate these effects, treating metabolomics features as the outcomes and alcohol intake as the exposure. Given the latent dependence structure among the numerous metabolomic features (e.g., co-expression networks with interconnected modules), modeling this structure is crucial for accurately identifying metabolomic features associated with alcohol intake. However, integrating dependence structures into regression models remains difficult in both estimation and inference procedures due to their large or high dimensionality. To bridge this gap, we propose an innovative multivariate regression model that accounts for correlations among outcome features by incorporating an interconnected community structure. Furthermore, we derive closed-form and likelihood-based estimators, accompanied by explicit exact and explicit asymptotic covariance matrix estimators, respectively. Simulation analysis demonstrates that our approach provides accurate estimation of both dependence and regression coefficients, and enhances sensitivity while maintaining a well-controlled discovery rate, as evidenced through benchmarking against existing regression models. Finally, we apply our approach to assess the impact of alcohol intake on $249$ metabolomic biomarkers measured using nuclear magnetic resonance spectroscopy. The results indicate that alcohol intake can elevate high-density lipoprotein levels by enhancing the transport rate of Apolipoproteins A1."}, "https://arxiv.org/abs/2404.10974": {"title": "Compressive Bayesian non-negative matrix factorization for mutational signatures analysis", "link": "https://arxiv.org/abs/2404.10974", "description": "arXiv:2404.10974v1 Announce Type: new \nAbstract: Non-negative matrix factorization (NMF) is widely used in many applications for dimensionality reduction. Inferring an appropriate number of factors for NMF is a challenging problem, and several approaches based on information criteria or sparsity-inducing priors have been proposed. However, inference in these models is often complicated and computationally challenging. In this paper, we introduce a novel methodology for overfitted Bayesian NMF models using \"compressive hyperpriors\" that force unneeded factors down to negligible values while only imposing mild shrinkage on needed factors. The method is based on using simple semi-conjugate priors to facilitate inference, while setting the strength of the hyperprior in a data-dependent way to achieve this compressive property. We apply our method to mutational signatures analysis in cancer genomics, where we find that it outperforms state-of-the-art alternatives. In particular, we illustrate how our compressive hyperprior enables the use of biologically informed priors on the signatures, yielding significantly improved accuracy. We provide theoretical results establishing the compressive property, and we demonstrate the method in simulations and on real data from a breast cancer application."}, "https://arxiv.org/abs/2404.11057": {"title": "Partial Identification of Heteroskedastic Structural VARs: Theory and Bayesian Inference", "link": "https://arxiv.org/abs/2404.11057", "description": "arXiv:2404.11057v1 Announce Type: new \nAbstract: We consider structural vector autoregressions identified through stochastic volatility. Our focus is on whether a particular structural shock is identified by heteroskedasticity without the need to impose any sign or exclusion restrictions. Three contributions emerge from our exercise: (i) a set of conditions under which the matrix containing structural parameters is partially or globally unique; (ii) a statistical procedure to assess the validity of the conditions mentioned above; and (iii) a shrinkage prior distribution for conditional variances centred on a hypothesis of homoskedasticity. Such a prior ensures that the evidence for identifying a structural shock comes only from the data and is not favoured by the prior. We illustrate our new methods using a U.S. fiscal structural model."}, "https://arxiv.org/abs/2404.11092": {"title": "Estimation for conditional moment models based on martingale difference divergence", "link": "https://arxiv.org/abs/2404.11092", "description": "arXiv:2404.11092v1 Announce Type: new \nAbstract: We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unconditional moment restrictions than the integrable weighting function to enhance the estimation efficiency. Due to the nature of shift-invariance in MDD, our MDD-based estimation method can not identify the intercept parameters. To overcome this identification issue, we further provide a two-step estimation procedure for the model with intercept parameters. Under regularity conditions, we establish the asymptotics of the proposed estimators, which are not only easy-to-implement with analytic asymptotic variances, but also applicable to time series data with an unspecified form of conditional heteroskedasticity. Finally, we illustrate the usefulness of the proposed estimators by simulations and two real examples."}, "https://arxiv.org/abs/2404.11125": {"title": "Interval-censored linear quantile regression", "link": "https://arxiv.org/abs/2404.11125", "description": "arXiv:2404.11125v1 Announce Type: new \nAbstract: Censored quantile regression has emerged as a prominent alternative to classical Cox's proportional hazards model or accelerated failure time model in both theoretical and applied statistics. While quantile regression has been extensively studied for right-censored survival data, methodologies for analyzing interval-censored data remain limited in the survival analysis literature. This paper introduces a novel local weighting approach for estimating linear censored quantile regression, specifically tailored to handle diverse forms of interval-censored survival data. The estimation equation and the corresponding convex objective function for the regression parameter can be constructed as a weighted average of quantile loss contributions at two interval endpoints. The weighting components are nonparametrically estimated using local kernel smoothing or ensemble machine learning techniques. To estimate the nonparametric distribution mass for interval-censored data, a modified EM algorithm for nonparametric maximum likelihood estimation is employed by introducing subject-specific latent Poisson variables. The proposed method's empirical performance is demonstrated through extensive simulation studies and real data analyses of two HIV/AIDS datasets."}, "https://arxiv.org/abs/2404.11150": {"title": "Automated, efficient and model-free inference for randomized clinical trials via data-driven covariate adjustment", "link": "https://arxiv.org/abs/2404.11150", "description": "arXiv:2404.11150v1 Announce Type: new \nAbstract: In May 2023, the U.S. Food and Drug Administration (FDA) released guidance for industry on \"Adjustment for Covariates in Randomized Clinical Trials for Drugs and Biological Products\". Covariate adjustment is a statistical analysis method for improving precision and power in clinical trials by adjusting for pre-specified, prognostic baseline variables. Though recommended by the FDA and the European Medicines Agency (EMA), many trials do not exploit the available information in baseline variables or make use only of the baseline measurement of the outcome. This is likely (partly) due to the regulatory mandate to pre-specify baseline covariates for adjustment, leading to challenges in determining appropriate covariates and their functional forms. We will explore the potential of automated data-adaptive methods, such as machine learning algorithms, for covariate adjustment, addressing the challenge of pre-specification. Specifically, our approach allows the use of complex models or machine learning algorithms without compromising the interpretation or validity of the treatment effect estimate and its corresponding standard error, even in the presence of misspecified outcome working models. This contrasts the majority of competing works which assume correct model specification for the validity of standard errors. Our proposed estimators either necessitate ultra-sparsity in the outcome model (which can be relaxed by limiting the number of predictors in the model) or necessitate integration with sample splitting to enhance their performance. As such, we will arrive at simple estimators and standard errors for the marginal treatment effect in randomized clinical trials, which exploit data-adaptive outcome predictions based on prognostic baseline covariates, and have low (or no) bias in finite samples even when those predictions are themselves biased."}, "https://arxiv.org/abs/2404.11198": {"title": "Forecasting with panel data: Estimation uncertainty versus parameter heterogeneity", "link": "https://arxiv.org/abs/2404.11198", "description": "arXiv:2404.11198v1 Announce Type: new \nAbstract: We provide a comprehensive examination of the predictive accuracy of panel forecasting methods based on individual, pooling, fixed effects, and Bayesian estimation, and propose optimal weights for forecast combination schemes. We consider linear panel data models, allowing for weakly exogenous regressors and correlated heterogeneity. We quantify the gains from exploiting panel data and demonstrate how forecasting performance depends on the degree of parameter heterogeneity, whether such heterogeneity is correlated with the regressors, the goodness of fit of the model, and the cross-sectional ($N$) and time ($T$) dimensions. Monte Carlo simulations and empirical applications to house prices and CPI inflation show that forecast combination and Bayesian forecasting methods perform best overall and rarely produce the least accurate forecasts for individual series."}, "https://arxiv.org/abs/2404.11235": {"title": "Bayesian Markov-Switching Vector Autoregressive Process", "link": "https://arxiv.org/abs/2404.11235", "description": "arXiv:2404.11235v1 Announce Type: new \nAbstract: This study introduces marginal density functions of the general Bayesian Markov-Switching Vector Autoregressive (MS-VAR) process. In the case of the Bayesian MS-VAR process, we provide closed--form density functions and Monte-Carlo simulation algorithms, including the importance sampling method. The Monte--Carlo simulation method departs from the previous simulation methods because it removes the duplication in a regime vector."}, "https://arxiv.org/abs/2404.11323": {"title": "Bayesian Optimization for Identification of Optimal Biological Dose Combinations in Personalized Dose-Finding Trials", "link": "https://arxiv.org/abs/2404.11323", "description": "arXiv:2404.11323v1 Announce Type: new \nAbstract: Early phase, personalized dose-finding trials for combination therapies seek to identify patient-specific optimal biological dose (OBD) combinations, which are defined as safe dose combinations which maximize therapeutic benefit for a specific covariate pattern. Given the small sample sizes which are typical of these trials, it is challenging for traditional parametric approaches to identify OBD combinations across multiple dosing agents and covariate patterns. To address these challenges, we propose a Bayesian optimization approach to dose-finding which formally incorporates toxicity information into both the initial data collection process and the sequential search strategy. Independent Gaussian processes are used to model the efficacy and toxicity surfaces, and an acquisition function is utilized to define the dose-finding strategy and an early stopping rule. This work is motivated by a personalized dose-finding trial which considers a dual-agent therapy for obstructive sleep apnea, where OBD combinations are tailored to obstructive sleep apnea severity. To compare the performance of the personalized approach to a standard approach where covariate information is ignored, a simulation study is performed. We conclude that personalized dose-finding is essential in the presence of heterogeneity."}, "https://arxiv.org/abs/2404.11324": {"title": "Weighted-Average Least Squares for Negative Binomial Regression", "link": "https://arxiv.org/abs/2404.11324", "description": "arXiv:2404.11324v1 Announce Type: new \nAbstract: Model averaging methods have become an increasingly popular tool for improving predictions and dealing with model uncertainty, especially in Bayesian settings. Recently, frequentist model averaging methods such as information theoretic and least squares model averaging have emerged. This work focuses on the issue of covariate uncertainty where managing the computational resources is key: The model space grows exponentially with the number of covariates such that averaged models must often be approximated. Weighted-average least squares (WALS), first introduced for (generalized) linear models in the econometric literature, combines Bayesian and frequentist aspects and additionally employs a semiorthogonal transformation of the regressors to reduce the computational burden. This paper extends WALS for generalized linear models to the negative binomial (NB) regression model for overdispersed count data. A simulation experiment and an empirical application using data on doctor visits were conducted to compare the predictive power of WALS for NB regression to traditional estimators. The results show that WALS for NB improves on the maximum likelihood estimator in sparse situations and is competitive with lasso while being computationally more efficient."}, "https://arxiv.org/abs/2404.11345": {"title": "Jacobi Prior: An Alternate Bayesian Method for Supervised Learning", "link": "https://arxiv.org/abs/2404.11345", "description": "arXiv:2404.11345v1 Announce Type: new \nAbstract: This paper introduces the `Jacobi prior,' an alternative Bayesian method, that aims to address the computational challenges inherent in traditional techniques. It demonstrates that the Jacobi prior performs better than well-known methods like Lasso, Ridge, Elastic Net, and MCMC-based Horse-Shoe Prior, especially in predicting accurately. Additionally, We also show that the Jacobi prior is more than a hundred times faster than these methods while maintaining similar predictive accuracy. The method is implemented for Generalised Linear Models, Gaussian process regression, and classification, making it suitable for longitudinal/panel data analysis. The Jacobi prior shows it can handle partitioned data across servers worldwide, making it useful for distributed computing environments. As the method runs faster while still predicting accurately, it's good for organizations wanting to reduce their environmental impact and meet ESG standards. To show how well the Jacobi prior works, we did a detailed simulation study with four experiments, looking at statistical consistency, accuracy, and speed. Additionally, we present two empirical studies. First, we thoroughly evaluate Credit Risk by studying default probability using data from the U.S. Small Business Administration (SBA). Also, we use the Jacobi prior to classifying stars, quasars, and galaxies in a 3-class problem using multinational logit regression on Sloan Digital Sky Survey data. We use different filters as features. All codes and datasets for this paper are available in the following GitHub repository: https://github.com/sourish-cmi/Jacobi-Prior/"}, "https://arxiv.org/abs/2404.11427": {"title": "Matern Correlation: A Panoramic Primer", "link": "https://arxiv.org/abs/2404.11427", "description": "arXiv:2404.11427v1 Announce Type: new \nAbstract: Matern correlation is of pivotal importance in spatial statistics and machine learning. This paper serves as a panoramic primer for this correlation with an emphasis on the exposition of its changing behavior and smoothness properties in response to the change of its two parameters. Such exposition is achieved through a series of simulation studies, the use of an interactive 3D visualization applet, and a practical modeling example, all tailored for a wide-ranging statistical audience. Meanwhile, the thorough understanding of these parameter-smoothness relationships, in turn, serves as a pragmatic guide for researchers in their real-world modeling endeavors, such as setting appropriate initial values for these parameters and parameter-fine-tuning in their Bayesian modeling practice or simulation studies involving the Matern correlation. Derived problems surrounding Matern, such as inconsistent parameter inference, extended forms of Matern and limitations of Matern, are also explored and surveyed to impart a panoramic view of this correlation."}, "https://arxiv.org/abs/2404.11510": {"title": "Improved bounds and inference on optimal regimes", "link": "https://arxiv.org/abs/2404.11510", "description": "arXiv:2404.11510v1 Announce Type: new \nAbstract: Point identification of causal effects requires strong assumptions that are unreasonable in many practical settings. However, informative bounds on these effects can often be derived under plausible assumptions. Even when these bounds are wide or cover null effects, they can guide practical decisions based on formal decision theoretic criteria. Here we derive new results on optimal treatment regimes in settings where the effect of interest is bounded. These results are driven by consideration of superoptimal regimes; we define regimes that leverage an individual's natural treatment value, which is typically ignored in the existing literature. We obtain (sharp) bounds for the value function of superoptimal regimes, and provide performance guarantees relative to conventional optimal regimes. As a case study, we consider a commonly studied Marginal Sensitivity Model and illustrate that the superoptimal regime can be identified when conventional optimal regimes are not. We similarly illustrate this property in an instrumental variable setting. Finally, we derive efficient estimators for upper and lower bounds on the superoptimal value in instrumental variable settings, building on recent results on covariate adjusted Balke-Pearl bounds. These estimators are applied to study the effect of prompt ICU admission on survival."}, "https://arxiv.org/abs/2404.11579": {"title": "Spatial Heterogeneous Additive Partial Linear Model: A Joint Approach of Bivariate Spline and Forest Lasso", "link": "https://arxiv.org/abs/2404.11579", "description": "arXiv:2404.11579v1 Announce Type: new \nAbstract: Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an efficient clustering approach to identify the model heterogeneity of the spatial additive partial linear model. Specifically, we aim to detect the spatially contiguous clusters based on the regression coefficients while introducing a spatially varying intercept to deal with the smooth spatial effect. On the one hand, to approximate the spatial varying intercept, we use the method of bivariate spline over triangulation, which can effectively handle the data from a complex domain. On the other hand, a novel fusion penalty termed the forest lasso is proposed to reveal the spatial clustering pattern. Our proposed fusion penalty has advantages in both the estimation and computation efficiencies when dealing with large spatial data. Theoretically properties of our estimator are established, and simulation results show that our approach can achieve more accurate estimation with a limited computation cost compared with the existing approaches. To illustrate its practical use, we apply our approach to analyze the spatial pattern of the relationship between land surface temperature measured by satellites and air temperature measured by ground stations in the United States."}, "https://arxiv.org/abs/2404.10883": {"title": "Automated Discovery of Functional Actual Causes in Complex Environments", "link": "https://arxiv.org/abs/2404.10883", "description": "arXiv:2404.10883v1 Announce Type: cross \nAbstract: Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments."}, "https://arxiv.org/abs/2404.10942": {"title": "What Hides behind Unfairness? Exploring Dynamics Fairness in Reinforcement Learning", "link": "https://arxiv.org/abs/2404.10942", "description": "arXiv:2404.10942v1 Announce Type: cross \nAbstract: In sequential decision-making problems involving sensitive attributes like race and gender, reinforcement learning (RL) agents must carefully consider long-term fairness while maximizing returns. Recent works have proposed many different types of fairness notions, but how unfairness arises in RL problems remains unclear. In this paper, we address this gap in the literature by investigating the sources of inequality through a causal lens. We first analyse the causal relationships governing the data generation process and decompose the effect of sensitive attributes on long-term well-being into distinct components. We then introduce a novel notion called dynamics fairness, which explicitly captures the inequality stemming from environmental dynamics, distinguishing it from those induced by decision-making or inherited from the past. This notion requires evaluating the expected changes in the next state and the reward induced by changing the value of the sensitive attribute while holding everything else constant. To quantitatively evaluate this counterfactual concept, we derive identification formulas that allow us to obtain reliable estimations from data. Extensive experiments demonstrate the effectiveness of the proposed techniques in explaining, detecting, and reducing inequality in reinforcement learning."}, "https://arxiv.org/abs/2404.11006": {"title": "Periodicity in New York State COVID-19 Hospitalizations Leveraged from the Variable Bandpass Periodic Block Bootstrap", "link": "https://arxiv.org/abs/2404.11006", "description": "arXiv:2404.11006v1 Announce Type: cross \nAbstract: The outbreak of the SARS-CoV-2 virus, which led to an unprecedented global pandemic, has underscored the critical importance of understanding seasonal patterns. This knowledge is fundamental for decision-making in healthcare and public health domains. Investigating the presence, intensity, and precise nature of seasonal trends, as well as these temporal patterns, is essential for forecasting future occurrences, planning interventions, and making informed decisions based on the evolution of events over time. This study employs the Variable Bandpass Periodic Block Bootstrap (VBPBB) to separate and analyze different periodic components by frequency in time series data, focusing on annually correlated (PC) principal components. Bootstrapping, a method used to estimate statistical sampling distributions through random sampling with replacement, is particularly useful in this context. Specifically, block bootstrapping, a model-independent resampling method suitable for time series data, is utilized. Its extensions are aimed at preserving the correlation structures inherent in PC processes. The VBPBB applies a bandpass filter to isolate the relevant PC frequency, thereby minimizing contamination from extraneous frequencies and noise. This approach significantly narrows the confidence intervals, enhancing the precision of estimated sampling distributions for the investigated periodic characteristics. Furthermore, we compared the outcomes of block bootstrapping for periodically correlated time series with VBPBB against those from more traditional bootstrapping methods. Our analysis shows VBPBB provides strong evidence of the existence of an annual seasonal PC pattern in hospitalization rates not detectible by other methods, providing timing and confidence intervals for their impact."}, "https://arxiv.org/abs/2404.11341": {"title": "The Causal Chambers: Real Physical Systems as a Testbed for AI Methodology", "link": "https://arxiv.org/abs/2404.11341", "description": "arXiv:2404.11341v1 Announce Type: cross \nAbstract: In some fields of AI, machine learning and statistics, the validation of new methods and algorithms is often hindered by the scarcity of suitable real-world datasets. Researchers must often turn to simulated data, which yields limited information about the applicability of the proposed methods to real problems. As a step forward, we have constructed two devices that allow us to quickly and inexpensively produce large datasets from non-trivial but well-understood physical systems. The devices, which we call causal chambers, are computer-controlled laboratories that allow us to manipulate and measure an array of variables from these physical systems, providing a rich testbed for algorithms from a variety of fields. We illustrate potential applications through a series of case studies in fields such as causal discovery, out-of-distribution generalization, change point detection, independent component analysis, and symbolic regression. For applications to causal inference, the chambers allow us to carefully perform interventions. We also provide and empirically validate a causal model of each chamber, which can be used as ground truth for different tasks. All hardware and software is made open source, and the datasets are publicly available at causalchamber.org or through the Python package causalchamber."}, "https://arxiv.org/abs/2208.03489": {"title": "Forecasting Algorithms for Causal Inference with Panel Data", "link": "https://arxiv.org/abs/2208.03489", "description": "arXiv:2208.03489v3 Announce Type: replace \nAbstract: Conducting causal inference with panel data is a core challenge in social science research. We adapt a deep neural architecture for time series forecasting (the N-BEATS algorithm) to more accurately impute the counterfactual evolution of a treated unit had treatment not occurred. Across a range of settings, the resulting estimator (``SyNBEATS'') significantly outperforms commonly employed methods (synthetic controls, two-way fixed effects), and attains comparable or more accurate performance compared to recently proposed methods (synthetic difference-in-differences, matrix completion). An implementation of this estimator is available for public use. Our results highlight how advances in the forecasting literature can be harnessed to improve causal inference in panel data settings."}, "https://arxiv.org/abs/2209.03218": {"title": "Local Projection Inference in High Dimensions", "link": "https://arxiv.org/abs/2209.03218", "description": "arXiv:2209.03218v3 Announce Type: replace \nAbstract: In this paper, we estimate impulse responses by local projections in high-dimensional settings. We use the desparsified (de-biased) lasso to estimate the high-dimensional local projections, while leaving the impulse response parameter of interest unpenalized. We establish the uniform asymptotic normality of the proposed estimator under general conditions. Finally, we demonstrate small sample performance through a simulation study and consider two canonical applications in macroeconomic research on monetary policy and government spending."}, "https://arxiv.org/abs/2209.10587": {"title": "DeepVARwT: Deep Learning for a VAR Model with Trend", "link": "https://arxiv.org/abs/2209.10587", "description": "arXiv:2209.10587v3 Announce Type: replace \nAbstract: The vector autoregressive (VAR) model has been used to describe the dependence within and across multiple time series. This is a model for stationary time series which can be extended to allow the presence of a deterministic trend in each series. Detrending the data either parametrically or nonparametrically before fitting the VAR model gives rise to more errors in the latter part. In this study, we propose a new approach called DeepVARwT that employs deep learning methodology for maximum likelihood estimation of the trend and the dependence structure at the same time. A Long Short-Term Memory (LSTM) network is used for this purpose. To ensure the stability of the model, we enforce the causality condition on the autoregressive coefficients using the transformation of Ansley & Kohn (1986). We provide a simulation study and an application to real data. In the simulation study, we use realistic trend functions generated from real data and compare the estimates with true function/parameter values. In the real data application, we compare the prediction performance of this model with state-of-the-art models in the literature."}, "https://arxiv.org/abs/2305.00218": {"title": "Subdata selection for big data regression: an improved approach", "link": "https://arxiv.org/abs/2305.00218", "description": "arXiv:2305.00218v2 Announce Type: replace \nAbstract: In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may suffer due to the large sample size, since they involve inverting huge data matrices or even because the data cannot fit to the memory. Proposed approaches are based on selecting representative subdata to run the regression. Existing approaches select the subdata using information criteria and/or properties from orthogonal arrays. In the present paper we improve existing algorithms providing a new algorithm that is based on D-optimality approach. We provide simulation evidence for its performance. Evidence about the parameters of the proposed algorithm is also provided in order to clarify the trade-offs between execution time and information gain. Real data applications are also provided."}, "https://arxiv.org/abs/2305.08201": {"title": "Efficient Computation of High-Dimensional Penalized Generalized Linear Mixed Models by Latent Factor Modeling of the Random Effects", "link": "https://arxiv.org/abs/2305.08201", "description": "arXiv:2305.08201v2 Announce Type: replace \nAbstract: Modern biomedical datasets are increasingly high dimensional and exhibit complex correlation structures. Generalized Linear Mixed Models (GLMMs) have long been employed to account for such dependencies. However, proper specification of the fixed and random effects in GLMMs is increasingly difficult in high dimensions, and computational complexity grows with increasing dimension of the random effects. We present a novel reformulation of the GLMM using a factor model decomposition of the random effects, enabling scalable computation of GLMMs in high dimensions by reducing the latent space from a large number of random effects to a smaller set of latent factors. We also extend our prior work to estimate model parameters using a modified Monte Carlo Expectation Conditional Minimization algorithm, allowing us to perform variable selection on both the fixed and random effects simultaneously. We show through simulation that through this factor model decomposition, our method can fit high dimensional penalized GLMMs faster than comparable methods and more easily scale to larger dimensions not previously seen in existing approaches."}, "https://arxiv.org/abs/1710.00915": {"title": "Change Acceleration and Detection", "link": "https://arxiv.org/abs/1710.00915", "description": "arXiv:1710.00915v4 Announce Type: replace-cross \nAbstract: A novel sequential change detection problem is proposed, in which the goal is to not only detect but also accelerate the change. Specifically, it is assumed that the sequentially collected observations are responses to treatments selected in real time. The assigned treatments determine the pre-change and post-change distributions of the responses and also influence when the change happens. The goal is to find a treatment assignment rule and a stopping rule that minimize the expected total number of observations subject to a user-specified bound on the false alarm probability. The optimal solution is obtained under a general Markovian change-point model. Moreover, an alternative procedure is proposed, whose applicability is not restricted to Markovian change-point models and whose design requires minimal computation. For a large class of change-point models, the proposed procedure is shown to achieve the optimal performance in an asymptotic sense. Finally, its performance is found in simulation studies to be comparable to the optimal, uniformly with respect to the error probability."}, "https://arxiv.org/abs/1908.01109": {"title": "The Use of Binary Choice Forests to Model and Estimate Discrete Choices", "link": "https://arxiv.org/abs/1908.01109", "description": "arXiv:1908.01109v5 Announce Type: replace-cross \nAbstract: Problem definition. In retailing, discrete choice models (DCMs) are commonly used to capture the choice behavior of customers when offered an assortment of products. When estimating DCMs using transaction data, flexible models (such as machine learning models or nonparametric models) are typically not interpretable and hard to estimate, while tractable models (such as the multinomial logit model) tend to misspecify the complex behavior represeted in the data. Methodology/results. In this study, we use a forest of binary decision trees to represent DCMs. This approach is based on random forests, a popular machine learning algorithm. The resulting model is interpretable: the decision trees can explain the decision-making process of customers during the purchase. We show that our approach can predict the choice probability of any DCM consistently and thus never suffers from misspecification. Moreover, our algorithm predicts assortments unseen in the training data. The mechanism and errors can be theoretically analyzed. We also prove that the random forest can recover preference rankings of customers thanks to the splitting criterion such as the Gini index and information gain ratio. Managerial implications. The framework has unique practical advantages. It can capture customers' behavioral patterns such as irrationality or sequential searches when purchasing a product. It handles nonstandard formats of training data that result from aggregation. It can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product. It can also incorporate price information and customer features. Our numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods."}, "https://arxiv.org/abs/2204.01884": {"title": "Policy Learning with Competing Agents", "link": "https://arxiv.org/abs/2204.01884", "description": "arXiv:2204.01884v4 Announce Type: replace-cross \nAbstract: Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating estimation of the optimal policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy gradient. In a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior."}, "https://arxiv.org/abs/2211.07451": {"title": "Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain", "link": "https://arxiv.org/abs/2211.07451", "description": "arXiv:2211.07451v3 Announce Type: replace-cross \nAbstract: Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint distribution of net-demand across the 14 regions constituting Great Britain's electricity network. Joint modelling is complicated by the fact that the net-demand variability within each region, and the dependencies between regions, vary with temporal, socio-economical and weather-related factors. We accommodate for these characteristics by proposing a multivariate Gaussian model based on a modified Cholesky parametrisation, which allows us to model each unconstrained parameter via an additive model. Given that the number of model parameters and covariates is large, we adopt a semi-automated approach to model selection, based on gradient boosting. In addition to comparing the forecasting performance of several versions of the proposed model with that of two non-Gaussian copula-based models, we visually explore the model output to interpret how the covariates affect net-demand variability and dependencies.\n The code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.7315105, while methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM."}, "https://arxiv.org/abs/2305.08204": {"title": "glmmPen: High Dimensional Penalized Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2305.08204", "description": "arXiv:2305.08204v2 Announce Type: replace-cross \nAbstract: Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower-dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo Expectation Conditional Minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs."}, "https://arxiv.org/abs/2308.15838": {"title": "Adaptive Lasso, Transfer Lasso, and Beyond: An Asymptotic Perspective", "link": "https://arxiv.org/abs/2308.15838", "description": "arXiv:2308.15838v2 Announce Type: replace-cross \nAbstract: This paper presents a comprehensive exploration of the theoretical properties inherent in the Adaptive Lasso and the Transfer Lasso. The Adaptive Lasso, a well-established method, employs regularization divided by initial estimators and is characterized by asymptotic normality and variable selection consistency. In contrast, the recently proposed Transfer Lasso employs regularization subtracted by initial estimators with the demonstrated capacity to curtail non-asymptotic estimation errors. A pivotal question thus emerges: Given the distinct ways the Adaptive Lasso and the Transfer Lasso employ initial estimators, what benefits or drawbacks does this disparity confer upon each method? This paper conducts a theoretical examination of the asymptotic properties of the Transfer Lasso, thereby elucidating its differentiation from the Adaptive Lasso. Informed by the findings of this analysis, we introduce a novel method, one that amalgamates the strengths and compensates for the weaknesses of both methods. The paper concludes with validations of our theory and comparisons of the methods via simulation experiments."}, "https://arxiv.org/abs/2404.11678": {"title": "Corrected Correlation Estimates for Meta-Analysis", "link": "https://arxiv.org/abs/2404.11678", "description": "arXiv:2404.11678v1 Announce Type: new \nAbstract: Meta-analysis allows rigorous aggregation of estimates and uncertainty across multiple studies. When a given study reports multiple estimates, such as log odds ratios (ORs) or log relative risks (RRs) across exposure groups, accounting for within-study correlations improves accuracy and efficiency of meta-analytic results. Canonical approaches of Greenland-Longnecker and Hamling estimate pseudo cases and non-cases for exposure groups to obtain within-study correlations. However, currently available implementations for both methods fail on simple examples.\n We review both GL and Hamling methods through the lens of optimization. For ORs, we provide modifications of each approach that ensure convergence for any feasible inputs. For GL, this is achieved through a new connection to entropic minimization. For Hamling, a modification leads to a provably solvable equivalent set of equations given a specific initialization. For each, we provide implementations a guaranteed to work for any feasible input.\n For RRs, we show the new GL approach is always guaranteed to succeed, but any Hamling approach may fail: we give counter-examples where no solutions exist. We derive a sufficient condition on reported RRs that guarantees success when reported variances are all equal."}, "https://arxiv.org/abs/2404.11713": {"title": "Propensity Score Analysis with Guaranteed Subgroup Balance", "link": "https://arxiv.org/abs/2404.11713", "description": "arXiv:2404.11713v1 Announce Type: new \nAbstract: Estimating the causal treatment effects by subgroups is important in observational studies when the treatment effect heterogeneity may be present. Existing propensity score methods rely on a correctly specified propensity score model. Model misspecification results in biased treatment effect estimation and covariate imbalance. We proposed a new algorithm, the propensity score analysis with guaranteed subgroup balance (G-SBPS), to achieve covariate mean balance in all subgroups. We further incorporated nonparametric kernel regression for the propensity scores and developed a kernelized G-SBPS (kG-SBPS) to improve the subgroup mean balance of covariate transformations in a rich functional class. This extension is more robust to propensity score model misspecification. Extensive numerical studies showed that G-SBPS and kG-SBPS improve both subgroup covariate balance and subgroup treatment effect estimation, compared to existing approaches. We applied G-SBPS and kG-SBPS to a dataset on right heart catheterization to estimate the subgroup average treatment effects on the hospital length of stay and a dataset on diabetes self-management training to estimate the subgroup average treatment effects for the treated on the hospitalization rate."}, "https://arxiv.org/abs/2404.11739": {"title": "Testing Mechanisms", "link": "https://arxiv.org/abs/2404.11739", "description": "arXiv:2404.11739v1 Announce Type: new \nAbstract: Economists are often interested in the mechanisms by which a particular treatment affects an outcome. This paper develops tests for the ``sharp null of full mediation'' that the treatment $D$ operates on the outcome $Y$ only through a particular conjectured mechanism (or set of mechanisms) $M$. A key observation is that if $D$ is randomly assigned and has a monotone effect on $M$, then $D$ is a valid instrumental variable for the local average treatment effect (LATE) of $M$ on $Y$. Existing tools for testing the validity of the LATE assumptions can thus be used to test the sharp null of full mediation when $M$ and $D$ are binary. We develop a more general framework that allows one to test whether the effect of $D$ on $Y$ is fully explained by a potentially multi-valued and multi-dimensional set of mechanisms $M$, allowing for relaxations of the monotonicity assumption. We further provide methods for lower-bounding the size of the alternative mechanisms when the sharp null is rejected. An advantage of our approach relative to existing tools for mediation analysis is that it does not require stringent assumptions about how $M$ is assigned; on the other hand, our approach helps to answer different questions than traditional mediation analysis by focusing on the sharp null rather than estimating average direct and indirect effects. We illustrate the usefulness of the testable implications in two empirical applications."}, "https://arxiv.org/abs/2404.11747": {"title": "Spatio-temporal patterns of diurnal temperature: a random matrix approach I-case of India", "link": "https://arxiv.org/abs/2404.11747", "description": "arXiv:2404.11747v1 Announce Type: new \nAbstract: We consider the spatio-temporal gridded daily diurnal temperature range (DTR) data across India during the 72-year period 1951--2022. We augment this data with information on the El Nino-Southern Oscillation (ENSO) and on the climatic regions (Stamp's and Koeppen's classification) and four seasons of India.\n We use various matrix theory approaches to trim out strong but routine signals, random matrix theory to remove noise, and novel empirical generalised singular-value distributions to establish retention of essential signals in the trimmed data. We make use of the spatial Bergsma statistics to measure spatial association and identify temporal change points in the spatial-association.\n In particular, our investigation captures a yet unknown change-point over the 72 years under study with drastic changes in spatial-association of DTR in India. It also brings out changes in spatial association with regard to ENSO.\n We conclude that while studying/modelling Indian DTR data, due consideration should be granted to the strong spatial association that is being persistently exhibited over decades, and provision should be kept for potential change points in the temporal behaviour, which in turn can bring moderate to dramatic changes in the spatial association pattern.\n Some of our analysis also reaffirms the conclusions made by other authors, regarding spatial and temporal behavior of DTR, adding our own insights. We consider the data from the yearly, seasonal and climatic zones points of view, and discover several new and interesting statistical structures which should be of interest, especially to climatologists and statisticians. Our methods are not country specific and could be used profitably for DTR data from other geographical areas."}, "https://arxiv.org/abs/2404.11767": {"title": "Regret Analysis in Threshold Policy Design", "link": "https://arxiv.org/abs/2404.11767", "description": "arXiv:2404.11767v1 Announce Type: new \nAbstract: Threshold policies are targeting mechanisms that assign treatments based on whether an observable characteristic exceeds a certain threshold. They are widespread across multiple domains, such as welfare programs, taxation, and clinical medicine. This paper addresses the problem of designing threshold policies using experimental data, when the goal is to maximize the population welfare. First, I characterize the regret (a measure of policy optimality) of the Empirical Welfare Maximizer (EWM) policy, popular in the literature. Next, I introduce the Smoothed Welfare Maximizer (SWM) policy, which improves the EWM's regret convergence rate under an additional smoothness condition. The two policies are compared studying how differently their regrets depend on the population distribution, and investigating their finite sample performances through Monte Carlo simulations. In many contexts, the welfare guaranteed by the novel SWM policy is larger than with the EWM. An empirical illustration demonstrates how the treatment recommendation of the two policies may in practice notably differ."}, "https://arxiv.org/abs/2404.11781": {"title": "Asymmetric canonical correlation analysis of Riemannian and high-dimensional data", "link": "https://arxiv.org/abs/2404.11781", "description": "arXiv:2404.11781v1 Announce Type: new \nAbstract: In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture."}, "https://arxiv.org/abs/2404.11813": {"title": "Detection of a structural break in intraday volatility pattern", "link": "https://arxiv.org/abs/2404.11813", "description": "arXiv:2404.11813v1 Announce Type: new \nAbstract: We develop theory leading to testing procedures for the presence of a change point in the intraday volatility pattern. The new theory is developed in the framework of Functional Data Analysis. It is based on a model akin to the stochastic volatility model for scalar point-to-point returns. In our context, we study intraday curves, one curve per trading day. After postulating a suitable model for such functional data, we present three tests focusing, respectively, on changes in the shape, the magnitude and arbitrary changes in the sequences of the curves of interest. We justify the respective procedures by showing that they have asymptotically correct size and by deriving consistency rates for all tests. These rates involve the sample size (the number of trading days) and the grid size (the number of observations per day). We also derive the corresponding change point estimators and their consistency rates. All procedures are additionally validated by a simulation study and an application to US stocks."}, "https://arxiv.org/abs/2404.11839": {"title": "(Empirical) Bayes Approaches to Parallel Trends", "link": "https://arxiv.org/abs/2404.11839", "description": "arXiv:2404.11839v1 Announce Type: new \nAbstract: We consider Bayes and Empirical Bayes (EB) approaches for dealing with violations of parallel trends. In the Bayes approach, the researcher specifies a prior over both the pre-treatment violations of parallel trends $\\delta_{pre}$ and the post-treatment violations $\\delta_{post}$. The researcher then updates their posterior about the post-treatment bias $\\delta_{post}$ given an estimate of the pre-trends $\\delta_{pre}$. This allows them to form posterior means and credible sets for the treatment effect of interest, $\\tau_{post}$. In the EB approach, the prior on the violations of parallel trends is learned from the pre-treatment observations. We illustrate these approaches in two empirical applications."}, "https://arxiv.org/abs/2404.12319": {"title": "Marginal Analysis of Count Time Series in the Presence of Missing Observations", "link": "https://arxiv.org/abs/2404.12319", "description": "arXiv:2404.12319v1 Announce Type: new \nAbstract: Time series in real-world applications often have missing observations, making typical analytical methods unsuitable. One method for dealing with missing data is the concept of amplitude modulation. While this principle works with any data, here, missing data for unbounded and bounded count time series are investigated, where tailor-made dispersion and skewness statistics are used for model diagnostics. General closed-form asymptotic formulas are derived for such statistics with only weak assumptions on the underlying process. Moreover, closed-form formulas are derived for the popular special cases of Poisson and binomial autoregressive processes, always under the assumption that missingness occurs. The finite-sample performances of the considered asymptotic approximations are analyzed with simulations. The practical application of the corresponding dispersion and skewness tests under missing data is demonstrated with three real-data examples."}, "https://arxiv.org/abs/2404.11675": {"title": "Decomposition of Longitudinal Disparities: an Application to the Fetal Growth-Singletons Study", "link": "https://arxiv.org/abs/2404.11675", "description": "arXiv:2404.11675v1 Announce Type: cross \nAbstract: Addressing health disparities among different demographic groups is a key challenge in public health. Despite many efforts, there is still a gap in understanding how these disparities unfold over time. Our paper focuses on this overlooked longitudinal aspect, which is crucial in both clinical and public health settings. In this paper, we introduce a longitudinal disparity decomposition method that decomposes disparities into three components: the explained disparity linked to differences in the exploratory variables' conditional distribution when the modifier distribution is identical between majority and minority groups, the explained disparity that emerges specifically from the unequal distribution of the modifier and its interaction with covariates, and the unexplained disparity. The proposed method offers a dynamic alternative to the traditional Peters-Belson decomposition approach, tackling both the potential reduction in disparity if the covariate distributions of minority groups matched those of the majority group and the evolving nature of disparity over time. We apply the proposed approach to a fetal growth study to gain insights into disparities between different race/ethnicity groups in fetal developmental progress throughout the course of pregnancy."}, "https://arxiv.org/abs/2404.11922": {"title": "Redefining the Shortest Path Problem Formulation of the Linear Non-Gaussian Acyclic Model: Pairwise Likelihood Ratios, Prior Knowledge, and Path Enumeration", "link": "https://arxiv.org/abs/2404.11922", "description": "arXiv:2404.11922v1 Announce Type: cross \nAbstract: Effective causal discovery is essential for learning the causal graph from observational data. The linear non-Gaussian acyclic model (LiNGAM) operates under the assumption of a linear data generating process with non-Gaussian noise in determining the causal graph. Its assumption of unmeasured confounders being absent, however, poses practical limitations. In response, empirical research has shown that the reformulation of LiNGAM as a shortest path problem (LiNGAM-SPP) addresses this limitation. Within LiNGAM-SPP, mutual information is chosen to serve as the measure of independence. A challenge is introduced - parameter tuning is now needed due to its reliance on kNN mutual information estimators. The paper proposes a threefold enhancement to the LiNGAM-SPP framework.\n First, the need for parameter tuning is eliminated by using the pairwise likelihood ratio in lieu of kNN-based mutual information. This substitution is validated on a general data generating process and benchmark real-world data sets, outperforming existing methods especially when given a larger set of features. The incorporation of prior knowledge is then enabled by a node-skipping strategy implemented on the graph representation of all causal orderings to eliminate violations based on the provided input of relative orderings. Flexibility relative to existing approaches is achieved. Last among the three enhancements is the utilization of the distribution of paths in the graph representation of all causal orderings. From this, crucial properties of the true causal graph such as the presence of unmeasured confounders and sparsity may be inferred. To some extent, the expected performance of the causal discovery algorithm may be predicted. The refinements above advance the practicality and performance of LiNGAM-SPP, showcasing the potential of graph-search-based methodologies in advancing causal discovery."}, "https://arxiv.org/abs/2404.12181": {"title": "Estimation of the invariant measure of a multidimensional diffusion from noisy observations", "link": "https://arxiv.org/abs/2404.12181", "description": "arXiv:2404.12181v1 Announce Type: cross \nAbstract: We introduce a new approach for estimating the invariant density of a multidimensional diffusion when dealing with high-frequency observations blurred by independent noises. We consider the intermediate regime, where observations occur at discrete time instances $k\\Delta_n$ for $k=0,\\dots,n$, under the conditions $\\Delta_n\\to 0$ and $n\\Delta_n\\to\\infty$. Our methodology involves the construction of a kernel density estimator that uses a pre-averaging technique to proficiently remove noise from the data while preserving the analytical characteristics of the underlying signal and its asymptotic properties. The rate of convergence of our estimator depends on both the anisotropic regularity of the density and the intensity of the noise. We establish conditions on the intensity of the noise that ensure the recovery of convergence rates similar to those achievable without any noise. Furthermore, we prove a Bernstein concentration inequality for our estimator, from which we derive an adaptive procedure for the kernel bandwidth selection."}, "https://arxiv.org/abs/2404.12238": {"title": "Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation", "link": "https://arxiv.org/abs/2404.12238", "description": "arXiv:2404.12238v1 Announce Type: cross \nAbstract: In recent years, there has been a growing interest in using machine learning techniques for the estimation of treatment effects. Most of the best-performing methods rely on representation learning strategies that encourage shared behavior among potential outcomes to increase the precision of treatment effect estimates. In this paper we discuss and classify these models in terms of their algorithmic inductive biases and present a new model, NN-CGC, that considers additional information from the causal graph. NN-CGC tackles bias resulting from spurious variable interactions by implementing novel constraints on models, and it can be integrated with other representation learning methods. We test the effectiveness of our method using three different base models on common benchmarks. Our results indicate that our model constraints lead to significant improvements, achieving new state-of-the-art results in treatment effects estimation. We also show that our method is robust to imperfect causal graphs and that using partial causal information is preferable to ignoring it."}, "https://arxiv.org/abs/2404.12290": {"title": "Debiased Distribution Compression", "link": "https://arxiv.org/abs/2404.12290", "description": "arXiv:2404.12290v1 Announce Type: cross \nAbstract: Modern compression methods can summarize a target distribution $\\mathbb{P}$ more succinctly than i.i.d. sampling but require access to a low-bias input sequence like a Markov chain converging quickly to $\\mathbb{P}$. We introduce a new suite of compression methods suitable for compression with biased input sequences. Given $n$ points targeting the wrong distribution and quadratic time, Stein Kernel Thinning (SKT) returns $\\sqrt{n}$ equal-weighted points with $\\widetilde{O}(n^{-1/2})$ maximum mean discrepancy (MMD) to $\\mathbb {P}$. For larger-scale compression tasks, Low-rank SKT achieves the same feat in sub-quadratic time using an adaptive low-rank debiasing procedure that may be of independent interest. For downstream tasks that support simplex or constant-preserving weights, Stein Recombination and Stein Cholesky achieve even greater parsimony, matching the guarantees of SKT with as few as $\\operatorname{poly-log}(n)$ weighted points. Underlying these advances are new guarantees for the quality of simplex-weighted coresets, the spectral decay of kernel matrices, and the covering numbers of Stein kernel Hilbert spaces. In our experiments, our techniques provide succinct and accurate posterior summaries while overcoming biases due to burn-in, approximate Markov chain Monte Carlo, and tempering."}, "https://arxiv.org/abs/2306.00389": {"title": "R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models", "link": "https://arxiv.org/abs/2306.00389", "description": "arXiv:2306.00389v2 Announce Type: replace-cross \nAbstract: Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a \"partial\" log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by using a variant of variational tempering in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to the exact posterior distributions, that it can be made robust through tempering, and that it is computationally efficient."}, "https://arxiv.org/abs/2311.03343": {"title": "Distribution-uniform anytime-valid sequential inference", "link": "https://arxiv.org/abs/2311.03343", "description": "arXiv:2311.03343v2 Announce Type: replace-cross \nAbstract: Are asymptotic confidence sequences and anytime $p$-values uniformly valid for a nontrivial class of distributions $\\mathcal{P}$? We give a positive answer to this question by deriving distribution-uniform anytime-valid inference procedures. Historically, anytime-valid methods -- including confidence sequences, anytime $p$-values, and sequential hypothesis tests that enable inference at stopping times -- have been justified nonasymptotically. Nevertheless, asymptotic procedures such as those based on the central limit theorem occupy an important part of statistical toolbox due to their simplicity, universality, and weak assumptions. While recent work has derived asymptotic analogues of anytime-valid methods with the aforementioned benefits, these were not shown to be $\\mathcal{P}$-uniform, meaning that their asymptotics are not uniformly valid in a class of distributions $\\mathcal{P}$. Indeed, the anytime-valid inference literature currently has no central limit theory to draw from that is both uniform in $\\mathcal{P}$ and in the sample size $n$. This paper fills that gap by deriving a novel $\\mathcal{P}$-uniform strong Gaussian approximation theorem. We apply some of these results to obtain an anytime-valid test of conditional independence without the Model-X assumption, as well as a $\\mathcal{P}$-uniform law of the iterated logarithm."}, "https://arxiv.org/abs/2404.12462": {"title": "Axiomatic modeling of fixed proportion technologies", "link": "https://arxiv.org/abs/2404.12462", "description": "arXiv:2404.12462v1 Announce Type: new \nAbstract: Understanding input substitution and output transformation possibilities is critical for efficient resource allocation and firm strategy. There are important examples of fixed proportion technologies where certain inputs are non-substitutable and/or certain outputs are non-transformable. However, there is widespread confusion about the appropriate modeling of fixed proportion technologies in data envelopment analysis. We point out and rectify several misconceptions in the existing literature, and show how fixed proportion technologies can be correctly incorporated into the axiomatic framework. A Monte Carlo study is performed to demonstrate the proposed solution."}, "https://arxiv.org/abs/2404.12463": {"title": "Spatially Selected and Dependent Random Effects for Small Area Estimation with Application to Rent Burden", "link": "https://arxiv.org/abs/2404.12463", "description": "arXiv:2404.12463v1 Announce Type: new \nAbstract: Area-level models for small area estimation typically rely on areal random effects to shrink design-based direct estimates towards a model-based predictor. Incorporating the spatial dependence of the random effects into these models can further improve the estimates when there are not enough covariates to fully account for spatial dependence of the areal means. A number of recent works have investigated models that include random effects for only a subset of areas, in order to improve the precision of estimates. However, such models do not readily handle spatial dependence. In this paper, we introduce a model that accounts for spatial dependence in both the random effects as well as the latent process that selects the effects. We show how this model can significantly improve predictive accuracy via an empirical simulation study based on data from the American Community Survey, and illustrate its properties via an application to estimate county-level median rent burden."}, "https://arxiv.org/abs/2404.12499": {"title": "A Multivariate Copula-based Bayesian Framework for Doping Detection", "link": "https://arxiv.org/abs/2404.12499", "description": "arXiv:2404.12499v1 Announce Type: new \nAbstract: Doping control is an essential component of anti-doping organizations for protecting clean sports competitions. Since 2009, this mission has been complemented worldwide by the Athlete Biological Passport (ABP), used to monitor athletes' individual profiles over time. The practical implementation of the ABP is based on a Bayesian framework, called ADAPTIVE, intended to identify individual reference ranges outside of which an observation may indicate doping abuse. Currently, this method follows a univariate approach, relying on simultaneous univariate analysis of different markers. This work extends the ADAPTIVE method to a multivariate testing framework, making use of copula models to couple the marginal distribution of biomarkers with their dependency structure. After introducing the proposed copula-based hierarchical model, we discuss our approach to inference, grounded in a Bayesian spirit, and present an extension to multidimensional predictive reference regions. Focusing on the hematological module of the ABP, we evaluate the proposed framework in both data-driven simulations and real data."}, "https://arxiv.org/abs/2404.12581": {"title": "Two-step Estimation of Network Formation Models with Unobserved Heterogeneities and Strategic Interactions", "link": "https://arxiv.org/abs/2404.12581", "description": "arXiv:2404.12581v1 Announce Type: new \nAbstract: In this paper, I characterize the network formation process as a static game of incomplete information, where the latent payoff of forming a link between two individuals depends on the structure of the network, as well as private information on agents' attributes. I allow agents' private unobserved attributes to be correlated with observed attributes through individual fixed effects. Using data from a single large network, I propose a two-step estimator for the model primitives. In the first step, I estimate agents' equilibrium beliefs of other people's choice probabilities. In the second step, I plug in the first-step estimator to the conditional choice probability expression and estimate the model parameters and the unobserved individual fixed effects together using Joint MLE. Assuming that the observed attributes are discrete, I showed that the first step estimator is uniformly consistent with rate $N^{-1/4}$, where $N$ is the total number of linking proposals. I also show that the second-step estimator converges asymptotically to a normal distribution at the same rate."}, "https://arxiv.org/abs/2404.12592": {"title": "Integer Programming for Learning Directed Acyclic Graphs from Non-identifiable Gaussian Models", "link": "https://arxiv.org/abs/2404.12592", "description": "arXiv:2404.12592v1 Announce Type: new \nAbstract: We study the problem of learning directed acyclic graphs from continuous observational data, generated according to a linear Gaussian structural equation model. State-of-the-art structure learning methods for this setting have at least one of the following shortcomings: i) they cannot provide optimality guarantees and can suffer from learning sub-optimal models; ii) they rely on the stringent assumption that the noise is homoscedastic, and hence the underlying model is fully identifiable. We overcome these shortcomings and develop a computationally efficient mixed-integer programming framework for learning medium-sized problems that accounts for arbitrary heteroscedastic noise. We present an early stopping criterion under which we can terminate the branch-and-bound procedure to achieve an asymptotically optimal solution and establish the consistency of this approximate solution. In addition, we show via numerical experiments that our method outperforms three state-of-the-art algorithms and is robust to noise heteroscedasticity, whereas the performance of the competing methods deteriorates under strong violations of the identifiability assumption. The software implementation of our method is available as the Python package \\emph{micodag}."}, "https://arxiv.org/abs/2404.12696": {"title": "Gaussian dependence structure pairwise goodness-of-fit testing based on conditional covariance and the 20/60/20 rule", "link": "https://arxiv.org/abs/2404.12696", "description": "arXiv:2404.12696v1 Announce Type: new \nAbstract: We present a novel data-oriented statistical framework that assesses the presumed Gaussian dependence structure in a pairwise setting. This refers to both multivariate normality and normal copula goodness-of-fit testing. The proposed test clusters the data according to the 20/60/20 rule and confronts conditional covariance (or correlation) estimates on the obtained subsets. The corresponding test statistic has a natural practical interpretation, desirable statistical properties, and asymptotic pivotal distribution under the multivariate normality assumption. We illustrate the usefulness of the introduced framework using extensive power simulation studies and show that our approach outperforms popular benchmark alternatives. Also, we apply the proposed methodology to commodities market data."}, "https://arxiv.org/abs/2404.12756": {"title": "Why not a thin plate spline for spatial models? A comparative study using Bayesian inference", "link": "https://arxiv.org/abs/2404.12756", "description": "arXiv:2404.12756v1 Announce Type: new \nAbstract: Spatial modelling often uses Gaussian random fields to capture the stochastic nature of studied phenomena. However, this approach incurs significant computational burdens (O(n3)), primarily due to covariance matrix computations. In this study, we propose to use a low-rank approximation of a thin plate spline as a spatial random effect in Bayesian spatial models. We compare its statistical performance and computational efficiency with the approximated Gaussian random field (by the SPDE method). In this case, the dense matrix of the thin plate spline is approximated using a truncated spectral decomposition, resulting in computational complexity of O(kn2) operations, where k is the number of knots. Bayesian inference is conducted via the Hamiltonian Monte Carlo algorithm of the probabilistic software Stan, which allows us to evaluate performance and diagnostics for the proposed models. A simulation study reveals that both models accurately recover the parameters used to simulate data. However, models using a thin plate spline demonstrate superior execution time to achieve the convergence of chains compared to the models utilizing an approximated Gaussian random field. Furthermore, thin plate spline models exhibited better computational efficiency for simulated data coming from different spatial locations. In a real application, models using a thin plate spline as spatial random effect produced similar results in estimating a relative index of abundance for a benthic marine species when compared to models incorporating an approximated Gaussian random field. Although they were not the more computational efficient models, their simplicity in parametrization, execution time and predictive performance make them a valid alternative for spatial modelling under Bayesian inference."}, "https://arxiv.org/abs/2404.12882": {"title": "The modified conditional sum-of-squares estimator for fractionally integrated models", "link": "https://arxiv.org/abs/2404.12882", "description": "arXiv:2404.12882v1 Announce Type: new \nAbstract: In this paper, we analyse the influence of estimating a constant term on the bias of the conditional sum-of-squares (CSS) estimator in a stationary or non-stationary type-II ARFIMA ($p_1$,$d$,$p_2$) model. We derive expressions for the estimator's bias and show that the leading term can be easily removed by a simple modification of the CSS objective function. We call this new estimator the modified conditional sum-of-squares (MCSS) estimator. We show theoretically and by means of Monte Carlo simulations that its performance relative to that of the CSS estimator is markedly improved even for small sample sizes. Finally, we revisit three classical short datasets that have in the past been described by ARFIMA($p_1$,$d$,$p_2$) models with constant term, namely the post-second World War real GNP data, the extended Nelson-Plosser data, and the Nile data."}, "https://arxiv.org/abs/2404.12997": {"title": "On the Asymmetric Volatility Connectedness", "link": "https://arxiv.org/abs/2404.12997", "description": "arXiv:2404.12997v1 Announce Type: new \nAbstract: Connectedness measures the degree at which a time-series variable spills over volatility to other variables compared to the rate that it is receiving. The idea is based on the percentage of variance decomposition from one variable to the others, which is estimated by making use of a VAR model. Diebold and Yilmaz (2012, 2014) suggested estimating this simple and useful measure of percentage risk spillover impact. Their method is symmetric by nature, however. The current paper offers an alternative asymmetric approach for measuring the volatility spillover direction, which is based on estimating the asymmetric variance decompositions introduced by Hatemi-J (2011, 2014). This approach accounts explicitly for the asymmetric property in the estimations, which accords better with reality. An application is provided to capture the potential asymmetric volatility spillover impacts between the three largest financial markets in the world."}, "https://arxiv.org/abs/2404.12613": {"title": "A Fourier Approach to the Parameter Estimation Problem for One-dimensional Gaussian Mixture Models", "link": "https://arxiv.org/abs/2404.12613", "description": "arXiv:2404.12613v1 Announce Type: cross \nAbstract: The purpose of this paper is twofold. First, we propose a novel algorithm for estimating parameters in one-dimensional Gaussian mixture models (GMMs). The algorithm takes advantage of the Hankel structure inherent in the Fourier data obtained from independent and identically distributed (i.i.d) samples of the mixture. For GMMs with a unified variance, a singular value ratio functional using the Fourier data is introduced and used to resolve the variance and component number simultaneously. The consistency of the estimator is derived. Compared to classic algorithms such as the method of moments and the maximum likelihood method, the proposed algorithm does not require prior knowledge of the number of Gaussian components or good initial guesses. Numerical experiments demonstrate its superior performance in estimation accuracy and computational cost. Second, we reveal that there exists a fundamental limit to the problem of estimating the number of Gaussian components or model order in the mixture model if the number of i.i.d samples is finite. For the case of a single variance, we show that the model order can be successfully estimated only if the minimum separation distance between the component means exceeds a certain threshold value and can fail if below. We derive a lower bound for this threshold value, referred to as the computational resolution limit, in terms of the number of i.i.d samples, the variance, and the number of Gaussian components. Numerical experiments confirm this phase transition phenomenon in estimating the model order. Moreover, we demonstrate that our algorithm achieves better scores in likelihood, AIC, and BIC when compared to the EM algorithm."}, "https://arxiv.org/abs/2404.12862": {"title": "A Guide to Feature Importance Methods for Scientific Inference", "link": "https://arxiv.org/abs/2404.12862", "description": "arXiv:2404.12862v1 Announce Type: cross \nAbstract: While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide, due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models."}, "https://arxiv.org/abs/1711.08265": {"title": "Sparse Variable Selection on High Dimensional Heterogeneous Data with Tree Structured Responses", "link": "https://arxiv.org/abs/1711.08265", "description": "arXiv:1711.08265v2 Announce Type: replace \nAbstract: We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taking on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and huge quantities of response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in enormous false discoveries when Lasso or its variants are na\\\"ively applied. Therefore, developing effective confounder correction methods is a growing heat point among researchers. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among response variables. To fully improve current variable selection methods, we introduce a model, the tree-guided sparse linear mixed model, that can utilize the dependency information from multiple responses to explore how specifically clusters are and select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods and achieves the highest ROC area."}, "https://arxiv.org/abs/2101.01157": {"title": "A tutorial on spatiotemporal partially observed Markov process models via the R package spatPomp", "link": "https://arxiv.org/abs/2101.01157", "description": "arXiv:2101.01157v4 Announce Type: replace \nAbstract: We describe a computational framework for modeling and statistical inference on high-dimensional stochastic dynamic systems. Our primary motivation is the investigation of metapopulation dynamics arising from a collection of spatially distributed, interacting biological populations. To make progress on this goal, we embed it in a more general problem: inference for a collection of interacting partially observed nonlinear non-Gaussian stochastic processes. Each process in the collection is called a unit; in the case of spatiotemporal models, the units correspond to distinct spatial locations. The dynamic state for each unit may be discrete or continuous, scalar or vector valued. In metapopulation applications, the state can represent a structured population or the abundances of a collection of species at a single location. We consider models where the collection of states has a Markov property. A sequence of noisy measurements is made on each unit, resulting in a collection of time series. A model of this form is called a spatiotemporal partially observed Markov process (SpatPOMP). The R package spatPomp provides an environment for implementing SpatPOMP models, analyzing data using existing methods, and developing new inference approaches. Our presentation of spatPomp reviews various methodologies in a unifying notational framework. We demonstrate the package on a simple Gaussian system and on a nontrivial epidemiological model for measles transmission within and between cities. We show how to construct user-specified SpatPOMP models within spatPomp."}, "https://arxiv.org/abs/2110.15517": {"title": "CP Factor Model for Dynamic Tensors", "link": "https://arxiv.org/abs/2110.15517", "description": "arXiv:2110.15517v2 Announce Type: replace \nAbstract: Observations in various applications are frequently represented as a time series of multidimensional arrays, called tensor time series, preserving the inherent multidimensional structure. In this paper, we present a factor model approach, in a form similar to tensor CP decomposition, to the analysis of high-dimensional dynamic tensor time series. As the loading vectors are uniquely defined but not necessarily orthogonal, it is significantly different from the existing tensor factor models based on Tucker-type tensor decomposition. The model structure allows for a set of uncorrelated one-dimensional latent dynamic factor processes, making it much more convenient to study the underlying dynamics of the time series. A new high order projection estimator is proposed for such a factor model, utilizing the special structure and the idea of the higher order orthogonal iteration procedures commonly used in Tucker-type tensor factor model and general tensor CP decomposition procedures. Theoretical investigation provides statistical error bounds for the proposed methods, which shows the significant advantage of utilizing the special model structure. Simulation study is conducted to further demonstrate the finite sample properties of the estimators. Real data application is used to illustrate the model and its interpretations."}, "https://arxiv.org/abs/2303.06434": {"title": "Direct Bayesian Regression for Distribution-valued Covariates", "link": "https://arxiv.org/abs/2303.06434", "description": "arXiv:2303.06434v2 Announce Type: replace \nAbstract: In this manuscript, we study the problem of scalar-on-distribution regression; that is, instances where subject-specific distributions or densities, or in practice, repeated measures from those distributions, are the covariates related to a scalar outcome via a regression model. We propose a direct regression for such distribution-valued covariates that circumvents estimating subject-specific densities and directly uses the observed repeated measures as covariates. The model is invariant to any transformation or ordering of the repeated measures. Endowing the regression function with a Gaussian Process prior, we obtain closed form or conjugate Bayesian inference. Our method subsumes the standard Bayesian non-parametric regression using Gaussian Processes as a special case. Theoretically, we show that the method can achieve an optimal estimation error bound. To our knowledge, this is the first theoretical study on Bayesian regression using distribution-valued covariates. Through simulation studies and analysis of activity count dataset, we demonstrate that our method performs better than approaches that require an intermediate density estimation step."}, "https://arxiv.org/abs/2311.05883": {"title": "Time-Varying Identification of Monetary Policy Shocks", "link": "https://arxiv.org/abs/2311.05883", "description": "arXiv:2311.05883v3 Announce Type: replace \nAbstract: We propose a new Bayesian heteroskedastic Markov-switching structural vector autoregression with data-driven time-varying identification. The model selects alternative exclusion restrictions over time and, as a condition for the search, allows to verify identification through heteroskedasticity within each regime. Based on four alternative monetary policy rules, we show that a monthly six-variable system supports time variation in US monetary policy shock identification. In the sample-dominating first regime, systematic monetary policy follows a Taylor rule extended by the term spread, effectively curbing inflation. In the second regime, occurring after 2000 and gaining more persistence after the global financial and COVID crises, it is characterized by a money-augmented Taylor rule. This regime's unconventional monetary policy provides economic stimulus, features the liquidity effect, and is complemented by a pure term spread shock. Absent the specific monetary policy of the second regime, inflation would be over one percentage point higher on average after 2008."}, "https://arxiv.org/abs/2404.13177": {"title": "A Bayesian Hybrid Design with Borrowing from Historical Study", "link": "https://arxiv.org/abs/2404.13177", "description": "arXiv:2404.13177v1 Announce Type: new \nAbstract: In early phase drug development of combination therapy, the primary objective is to preliminarily assess whether there is additive activity when a novel agent combined with an established monotherapy. Due to potential feasibility issues with a large randomized study, uncontrolled single-arm trials have been the mainstream approach in cancer clinical trials. However, such trials often present significant challenges in deciding whether to proceed to the next phase of development. A hybrid design, leveraging data from a completed historical clinical study of the monotherapy, offers a valuable option to enhance study efficiency and improve informed decision-making. Compared to traditional single-arm designs, the hybrid design may significantly enhance power by borrowing external information, enabling a more robust assessment of activity. The primary challenge of hybrid design lies in handling information borrowing. We introduce a Bayesian dynamic power prior (DPP) framework with three components of controlling amount of dynamic borrowing. The framework offers flexible study design options with explicit interpretation of borrowing, allowing customization according to specific needs. Furthermore, the posterior distribution in the proposed framework has a closed form, offering significant advantages in computational efficiency. The proposed framework's utility is demonstrated through simulations and a case study."}, "https://arxiv.org/abs/2404.13233": {"title": "On a Notion of Graph Centrality Based on L1 Data Depth", "link": "https://arxiv.org/abs/2404.13233", "description": "arXiv:2404.13233v1 Announce Type: new \nAbstract: A new measure to assess the centrality of vertices in an undirected and connected graph is proposed. The proposed measure, L1 centrality, can adequately handle graphs with weights assigned to vertices and edges. The study provides tools for graphical and multiscale analysis based on the L1 centrality. Specifically, the suggested analysis tools include the target plot, L1 centrality-based neighborhood, local L1 centrality, multiscale edge representation, and heterogeneity plot and index. Most importantly, our work is closely associated with the concept of data depth for multivariate data, which allows for a wide range of practical applications of the proposed measure. Throughout the paper, we demonstrate our tools with two interesting examples: the Marvel Cinematic Universe movie network and the bill cosponsorship network of the 21st National Assembly of South Korea."}, "https://arxiv.org/abs/2404.13284": {"title": "Impact of methodological assumptions and covariates on the cutoff estimation in ROC analysis", "link": "https://arxiv.org/abs/2404.13284", "description": "arXiv:2404.13284v1 Announce Type: new \nAbstract: The Receiver Operating Characteristic (ROC) curve stands as a cornerstone in assessing the efficacy of biomarkers for disease diagnosis. Beyond merely evaluating performance, it provides with an optimal cutoff for biomarker values, crucial for disease categorization. While diverse methodologies exist for threshold estimation, less attention has been paid to integrating covariate impact into this process. Covariates can strongly impact diagnostic summaries, leading to variations across different covariate levels. Therefore, a tailored covariate-based framework is imperative for outlining covariate-specific optimal cutoffs. Moreover, recent investigations into cutoff estimators have overlooked the influence of ROC curve estimation methodologies. This study endeavors to bridge this gap by addressing the research void. Extensive simulation studies are conducted to scrutinize the performance of ROC curve estimation models in estimating different cutoffs in varying scenarios, encompassing diverse data-generating mechanisms and covariate effects. Additionally, leveraging the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the research assesses the performance of different biomarkers in diagnosing Alzheimer's disease and determines the suitable optimal cutoffs."}, "https://arxiv.org/abs/2404.13339": {"title": "Group COMBSS: Group Selection via Continuous Optimization", "link": "https://arxiv.org/abs/2404.13339", "description": "arXiv:2404.13339v1 Announce Type: new \nAbstract: We present a new optimization method for the group selection problem in linear regression. In this problem, predictors are assumed to have a natural group structure and the goal is to select a small set of groups that best fits the response. The incorporation of group structure in a predictor matrix is a key factor in obtaining better estimators and identifying associations between response and predictors. Such a discrete constrained problem is well-known to be hard, particularly in high-dimensional settings where the number of predictors is much larger than the number of observations. We propose to tackle this problem by framing the underlying discrete binary constrained problem into an unconstrained continuous optimization problem. The performance of our proposed approach is compared to state-of-the-art variable selection strategies on simulated data sets. We illustrate the effectiveness of our approach on a genetic dataset to identify grouping of markers across chromosomes."}, "https://arxiv.org/abs/2404.13356": {"title": "How do applied researchers use the Causal Forest? A methodological review of a method", "link": "https://arxiv.org/abs/2404.13356", "description": "arXiv:2404.13356v1 Announce Type: new \nAbstract: This paper conducts a methodological review of papers using the causal forest machine learning method for flexibly estimating heterogeneous treatment effects. It examines 133 peer-reviewed papers. It shows that the emerging best practice relies heavily on the approach and tools created by the original authors of the causal forest such as their grf package and the approaches given by them in examples. Generally researchers use the causal forest on a relatively low-dimensional dataset relying on randomisation or observed controls to identify effects. There are several common ways to then communicate results -- by mapping out the univariate distribution of individual-level treatment effect estimates, displaying variable importance results for the forest and graphing the distribution of treatment effects across covariates that are important either for theoretical reasons or because they have high variable importance. Some deviations from this common practice are interesting and deserve further development and use. Others are unnecessary or even harmful."}, "https://arxiv.org/abs/2404.13366": {"title": "Prior Effective Sample Size When Borrowing on the Treatment Effect Scale", "link": "https://arxiv.org/abs/2404.13366", "description": "arXiv:2404.13366v1 Announce Type: new \nAbstract: With the robust uptick in the applications of Bayesian external data borrowing, eliciting a prior distribution with the proper amount of information becomes increasingly critical. The prior effective sample size (ESS) is an intuitive and efficient measure for this purpose. The majority of ESS definitions have been proposed in the context of borrowing control information. While many Bayesian models can be naturally extended to leveraging external information on the treatment effect scale, very little attention has been directed to computing the prior ESS in this setting. In this research, we bridge this methodological gap by extending the popular ELIR ESS definition. We lay out the general framework, and derive the prior ESS for various types of endpoints and treatment effect measures. The posterior distribution and the predictive consistency property of ESS are also examined. The methods are implemented in R programs available on GitHub: https://github.com/squallteo/TrtEffESS."}, "https://arxiv.org/abs/2404.13442": {"title": "Difference-in-Differences under Bipartite Network Interference: A Framework for Quasi-Experimental Assessment of the Effects of Environmental Policies on Health", "link": "https://arxiv.org/abs/2404.13442", "description": "arXiv:2404.13442v1 Announce Type: new \nAbstract: Pollution from coal-fired power plants has been linked to substantial health and mortality burdens in the US. In recent decades, federal regulatory policies have spurred efforts to curb emissions through various actions, such as the installation of emissions control technologies on power plants. However, assessing the health impacts of these measures, particularly over longer periods of time, is complicated by several factors. First, the units that potentially receive the intervention (power plants) are disjoint from those on which outcomes are measured (communities), and second, pollution emitted from power plants disperses and affects geographically far-reaching areas. This creates a methodological challenge known as bipartite network interference (BNI). To our knowledge, no methods have been developed for conducting quasi-experimental studies with panel data in the BNI setting. In this study, motivated by the need for robust estimates of the total health impacts of power plant emissions control technologies in recent decades, we introduce a novel causal inference framework for difference-in-differences analysis under BNI with staggered treatment adoption. We explain the unique methodological challenges that arise in this setting and propose a solution via a data reconfiguration and mapping strategy. The proposed approach is advantageous because analysis is conducted at the intervention unit level, avoiding the need to arbitrarily define treatment status at the outcome unit level, but it permits interpretation of results at the more policy-relevant outcome unit level. Using this interference-aware approach, we investigate the impacts of installation of flue gas desulfurization scrubbers on coal-fired power plants on coronary heart disease hospitalizations among older Americans over the period 2003-2014, finding an overall beneficial effect in mitigating such disease outcomes."}, "https://arxiv.org/abs/2404.13487": {"title": "Generating Synthetic Rainfall Fields by R-vine Copulas Applied to Seamless Probabilistic Predictions", "link": "https://arxiv.org/abs/2404.13487", "description": "arXiv:2404.13487v1 Announce Type: new \nAbstract: Many post-processing methods improve forecasts at individual locations but remove their correlation structure, which is crucial for predicting larger-scale events like total precipitation amount over areas such as river catchments that are relevant for weather warnings and flood predictions. We propose a method to reintroduce spatial correlation into a post-processed forecast using an R-vine copula fitted to historical observations. This method works similarly to related approaches like the Schaake shuffle and ensemble copula coupling, i.e., by rearranging predictions at individual locations and reintroducing spatial correlation while maintaining the post-processed marginal distribution. Here, the copula measures how well an arrangement compares with the historical distribution of precipitation. No close relationship is needed between the post-processed marginal distributions and the spatial correlation source. This is an advantage compared to Schaake shuffle and ensemble copula coupling, which rely on a ranking with no ties at each considered location in their source for spatial correlations. However, weather variables such as the precipitation amount, whose distribution has an atom at zero, have rankings with ties. To evaluate the proposed method, it is applied to a precipitation forecast produced by a combination model with two input forecasts that deliver calibrated marginal distributions but without spatial correlations. The obtained results indicate that the calibration of the combination model carries over to the output of the proposed model, i.e., the evaluation of area predictions shows a similar improvement in forecast quality as the predictions for individual locations. Additionally, the spatial correlation of the forecast is evaluated with the help of object-based metrics, for which the proposed model also shows an improvement compared to both input forecasts."}, "https://arxiv.org/abs/2404.13589": {"title": "The quantile-based classifier with variable-wise parameters", "link": "https://arxiv.org/abs/2404.13589", "description": "arXiv:2404.13589v1 Announce Type: new \nAbstract: Quantile-based classifiers can classify high-dimensional observations by minimising a discrepancy of an observation to a class based on suitable quantiles of the within-class distributions, corresponding to a unique percentage for all variables. The present work extends these classifiers by introducing a way to determine potentially different optimal percentages for different variables. Furthermore, a variable-wise scale parameter is introduced. A simple greedy algorithm to estimate the parameters is proposed. Their consistency in a nonparametric setting is proved. Experiments using artificially generated and real data confirm the potential of the quantile-based classifier with variable-wise parameters."}, "https://arxiv.org/abs/2404.13707": {"title": "Robust inference for the unification of confidence intervals in meta-analysis", "link": "https://arxiv.org/abs/2404.13707", "description": "arXiv:2404.13707v1 Announce Type: new \nAbstract: Traditional meta-analysis assumes that the effect sizes estimated in individual studies follow a Gaussian distribution. However, this distributional assumption is not always satisfied in practice, leading to potentially biased results. In the situation when the number of studies, denoted as K, is large, the cumulative Gaussian approximation errors from each study could make the final estimation unreliable. In the situation when K is small, it is not realistic to assume the random-effect follows Gaussian distribution. In this paper, we present a novel empirical likelihood method for combining confidence intervals under the meta-analysis framework. This method is free of the Gaussian assumption in effect size estimates from individual studies and from the random-effects. We establish the large-sample properties of the non-parametric estimator, and introduce a criterion governing the relationship between the number of studies, K, and the sample size of each study, n_i. Our methodology supersedes conventional meta-analysis techniques in both theoretical robustness and computational efficiency. We assess the performance of our proposed methods using simulation studies, and apply our proposed methods to two examples."}, "https://arxiv.org/abs/2404.13735": {"title": "Identification and Estimation of Nonseparable Triangular Equations with Mismeasured Instruments", "link": "https://arxiv.org/abs/2404.13735", "description": "arXiv:2404.13735v1 Announce Type: new \nAbstract: In this paper, I study the nonparametric identification and estimation of the marginal effect of an endogenous variable $X$ on the outcome variable $Y$, given a potentially mismeasured instrument variable $W^*$, without assuming linearity or separability of the functions governing the relationship between observables and unobservables. To address the challenges arising from the co-existence of measurement error and nonseparability, I first employ the deconvolution technique from the measurement error literature to identify the joint distribution of $Y, X, W^*$ using two error-laden measurements of $W^*$. I then recover the structural derivative of the function of interest and the \"Local Average Response\" (LAR) from the joint distribution via the \"unobserved instrument\" approach in Matzkin (2016). I also propose nonparametric estimators for these parameters and derive their uniform rates of convergence. Monte Carlo exercises show evidence that the estimators I propose have good finite sample performance."}, "https://arxiv.org/abs/2404.13753": {"title": "A nonstandard application of cross-validation to estimate density functionals", "link": "https://arxiv.org/abs/2404.13753", "description": "arXiv:2404.13753v1 Announce Type: new \nAbstract: Cross-validation is usually employed to evaluate the performance of a given statistical methodology. When such a methodology depends on a number of tuning parameters, cross-validation proves to be helpful to select the parameters that optimize the estimated performance. In this paper, however, a very different and nonstandard use of cross-validation is investigated. Instead of focusing on the cross-validated parameters, the main interest is switched to the estimated value of the error criterion at optimal performance. It is shown that this approach is able to provide consistent and efficient estimates of some density functionals, with the noteworthy feature that these estimates do not rely on the choice of any further tuning parameter, so that, in that sense, they can be considered to be purely empirical. Here, a base case of application of this new paradigm is developed in full detail, while many other possible extensions are hinted as well."}, "https://arxiv.org/abs/2404.13825": {"title": "Change-point analysis for binomial autoregressive model with application to price stability counts", "link": "https://arxiv.org/abs/2404.13825", "description": "arXiv:2404.13825v1 Announce Type: new \nAbstract: The first-order binomial autoregressive (BAR(1)) model is the most frequently used tool to analyze the bounded count time series. The BAR(1) model is stationary and assumes process parameters to remain constant throughout the time period, which may be incompatible with the non-stationary real data, which indicates piecewise stationary characteristic. To better analyze the non-stationary bounded count time series, this article introduces the BAR(1) process with multiple change-points, which contains the BAR(1) model as a special case. Our primary goals are not only to detect the change-points, but also to give a solution to estimate the number and locations of the change-points. For this, the cumulative sum (CUSUM) test and minimum description length (MDL) principle are employed to deal with the testing and estimation problems. The proposed approaches are also applied to analysis of the Harmonised Index of Consumer Prices of the European Union."}, "https://arxiv.org/abs/2404.13834": {"title": "Inference for multiple change-points in generalized integer-valued autoregressive model", "link": "https://arxiv.org/abs/2404.13834", "description": "arXiv:2404.13834v1 Announce Type: new \nAbstract: In this paper, we propose a computationally valid and theoretically justified methods, the likelihood ratio scan method (LRSM), for estimating multiple change-points in a piecewise stationary generalized conditional integer-valued autoregressive process. LRSM with the usual window parameter $h$ is more satisfied to be used in long-time series with few and even change-points vs. LRSM with the multiple window parameter $h_{mix}$ performs well in short-time series with large and dense change-points. The computational complexity of LRSM can be efficiently performed with order $O((\\log n)^3 n)$. Moreover, two bootstrap procedures, namely parametric and block bootstrap, are developed for constructing confidence intervals (CIs) for each of the change-points. Simulation experiments and real data analysis show that the LRSM and bootstrap procedures have excellent performance and are consistent with the theoretical analysis."}, "https://arxiv.org/abs/2404.13836": {"title": "MultiFun-DAG: Multivariate Functional Directed Acyclic Graph", "link": "https://arxiv.org/abs/2404.13836", "description": "arXiv:2404.13836v1 Announce Type: new \nAbstract: Directed Acyclic Graphical (DAG) models efficiently formulate causal relationships in complex systems. Traditional DAGs assume nodes to be scalar variables, characterizing complex systems under a facile and oversimplified form. This paper considers that nodes can be multivariate functional data and thus proposes a multivariate functional DAG (MultiFun-DAG). It constructs a hidden bilinear multivariate function-to-function regression to describe the causal relationships between different nodes. Then an Expectation-Maximum algorithm is used to learn the graph structure as a score-based algorithm with acyclic constraints. Theoretical properties are diligently derived. Prudent numerical studies and a case study from urban traffic congestion analysis are conducted to show MultiFun-DAG's effectiveness."}, "https://arxiv.org/abs/2404.13939": {"title": "Unlocking Insights: Enhanced Analysis of Covariance in General Factorial Designs through Multiple Contrast Tests under Variance Heteroscedasticity", "link": "https://arxiv.org/abs/2404.13939", "description": "arXiv:2404.13939v1 Announce Type: new \nAbstract: A common goal in clinical trials is to conduct tests on estimated treatment effects adjusted for covariates such as age or sex. Analysis of Covariance (ANCOVA) is often used in these scenarios to test the global null hypothesis of no treatment effect using an $F$-test. However, in several samples, the $F$-test does not provide any information about individual null hypotheses and has strict assumptions such as variance homoscedasticity. We extend the method proposed by Konietschke et al. (2021) to a multiple contrast test procedure (MCTP), which allows us to test arbitrary linear hypotheses and provides information about the global as well as the individual null hypotheses. Further, we can calculate compatible simultaneous confidence intervals for the individual effects. We derive a small sample size approximation of the distribution of the test statistic via a multivariate t-distribution. As an alternative, we introduce a Wild-bootstrap method. Extensive simulations show that our methods are applicable even when sample sizes are small. Their application is further illustrated within a real data example."}, "https://arxiv.org/abs/2404.13986": {"title": "Stochastic Volatility in Mean: Efficient Analysis by a Generalized Mixture Sampler", "link": "https://arxiv.org/abs/2404.13986", "description": "arXiv:2404.13986v1 Announce Type: new \nAbstract: In this paper we consider the simulation-based Bayesian analysis of stochastic volatility in mean (SVM) models. Extending the highly efficient Markov chain Monte Carlo mixture sampler for the SV model proposed in Kim et al. (1998) and Omori et al. (2007), we develop an accurate approximation of the non-central chi-squared distribution as a mixture of thirty normal distributions. Under this mixture representation, we sample the parameters and latent volatilities in one block. We also detail a correction of the small approximation error by using additional Metropolis-Hastings steps. The proposed method is extended to the SVM model with leverage. The methodology and models are applied to excess holding yields in empirical studies, and the SVM model with leverage is shown to outperform competing volatility models based on marginal likelihoods."}, "https://arxiv.org/abs/2404.14124": {"title": "Gaussian distributional structural equation models: A framework for modeling latent heteroscedasticity", "link": "https://arxiv.org/abs/2404.14124", "description": "arXiv:2404.14124v1 Announce Type: new \nAbstract: Accounting for the complexity of psychological theories requires methods that can predict not only changes in the means of latent variables -- such as personality factors, creativity, or intelligence -- but also changes in their variances. Structural equation modeling (SEM) is the framework of choice for analyzing complex relationships among latent variables, but current methods do not allow modeling latent variances as a function of other latent variables. In this paper, we develop a Bayesian framework for Gaussian distributional SEM which overcomes this limitation. We validate our framework using extensive simulations, which demonstrate that the new models produce reliable statistical inference and can be computed with sufficient efficiency for practical everyday use. We illustrate our framework's applicability in a real-world case study that addresses a substantive hypothesis from personality psychology."}, "https://arxiv.org/abs/2404.14149": {"title": "A Linear Relationship between Correlation and Cohen's Kappa for Binary Data and Simulating Multivariate Nominal and Ordinal Data with Specified Kappa Matrix", "link": "https://arxiv.org/abs/2404.14149", "description": "arXiv:2404.14149v1 Announce Type: new \nAbstract: Cohen's kappa is a useful measure for agreement between the judges, inter-rater reliability, and also goodness of fit in classification problems. For binary nominal and ordinal data, kappa and correlation are equally applicable. We have found a linear relationship between correlation and kappa for binary data. Exact bounds of kappa are much more important as kappa can be only .5 even if there is very strong agreement. The exact upper bound was developed by Cohen (1960) but the exact lower bound is also important if the range of kappa is small for some marginals. We have developed an algorithm to find the exact lower bound given marginal proportions. Our final contribution is a method to generate multivariate nominal and ordinal data with a specified kappa matrix based on the rearrangement of independently generated marginal data to a multidimensional contingency table, where cell counts are found by solving system of linear equations for positive roots."}, "https://arxiv.org/abs/2404.14213": {"title": "An Exposure Model Framework for Signal Detection based on Electronic Healthcare Data", "link": "https://arxiv.org/abs/2404.14213", "description": "arXiv:2404.14213v1 Announce Type: new \nAbstract: Despite extensive safety assessments of drugs prior to their introduction to the market, certain adverse drug reactions (ADRs) remain undetected. The primary objective of pharmacovigilance is to identify these ADRs (i.e., signals). In addition to traditional spontaneous reporting systems (SRSs), electronic health (EHC) data is being used for signal detection as well. Unlike SRS, EHC data is longitudinal and thus requires assumptions about the patient's drug exposure history and its impact on ADR occurrences over time, which many current methods do implicitly.\n We propose an exposure model framework that explicitly models the longitudinal relationship between the drug and the ADR. By considering multiple such models simultaneously, we can detect signals that might be missed by other approaches. The parameters of these models are estimated using maximum likelihood, and the Bayesian Information Criterion (BIC) is employed to select the most suitable model. Since BIC is connected to the posterior distribution, it servers the dual purpose of identifying the best-fitting model and determining the presence of a signal by evaluating the posterior probability of the null model.\n We evaluate the effectiveness of this framework through a simulation study, for which we develop an EHC data simulator. Additionally, we conduct a case study applying our approach to four drug-ADR pairs using an EHC dataset comprising over 1.2 million insured individuals. Both the method and the EHC data simulator code are publicly accessible as part of the R package https://github.com/bips-hb/expard."}, "https://arxiv.org/abs/2404.14275": {"title": "Maximally informative feature selection using Information Imbalance: Application to COVID-19 severity prediction", "link": "https://arxiv.org/abs/2404.14275", "description": "arXiv:2404.14275v1 Announce Type: new \nAbstract: Clinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features, and automatically select the set of features which is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients. We apply this algorithm to a data set of ~ 1,300 patients treated for COVID-19 in Udine hospital before October 2021. Using this approach, we find combinations of features which, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The optimal number of features, which is determined automatically, turns out to be between 10 and 15. These features can be measured at admission. The approach can be used also if the features are available only for a fraction of the patients, does not require imputation and, importantly, is able to automatically select features with small inter-feature correlation. Clinical insights deriving from this study are also discussed."}, "https://arxiv.org/abs/2404.13056": {"title": "Variational Bayesian Optimal Experimental Design with Normalizing Flows", "link": "https://arxiv.org/abs/2404.13056", "description": "arXiv:2404.13056v1 Announce Type: cross \nAbstract: Bayesian optimal experimental design (OED) seeks experiments that maximize the expected information gain (EIG) in model parameters. Directly estimating the EIG using nested Monte Carlo is computationally expensive and requires an explicit likelihood. Variational OED (vOED), in contrast, estimates a lower bound of the EIG without likelihood evaluations by approximating the posterior distributions with variational forms, and then tightens the bound by optimizing its variational parameters. We introduce the use of normalizing flows (NFs) for representing variational distributions in vOED; we call this approach vOED-NFs. Specifically, we adopt NFs with a conditional invertible neural network architecture built from compositions of coupling layers, and enhanced with a summary network for data dimension reduction. We present Monte Carlo estimators to the lower bound along with gradient expressions to enable a gradient-based simultaneous optimization of the variational parameters and the design variables. The vOED-NFs algorithm is then validated in two benchmark problems, and demonstrated on a partial differential equation-governed application of cathodic electrophoretic deposition and an implicit likelihood case with stochastic modeling of aphid population. The findings suggest that a composition of 4--5 coupling layers is able to achieve lower EIG estimation bias, under a fixed budget of forward model runs, compared to previous approaches. The resulting NFs produce approximate posteriors that agree well with the true posteriors, able to capture non-Gaussian and multi-modal features effectively."}, "https://arxiv.org/abs/2404.13147": {"title": "Multiclass ROC", "link": "https://arxiv.org/abs/2404.13147", "description": "arXiv:2404.13147v1 Announce Type: cross \nAbstract: Model evaluation is of crucial importance in modern statistics application. The construction of ROC and calculation of AUC have been widely used for binary classification evaluation. Recent research generalizing the ROC/AUC analysis to multi-class classification has problems in at least one of the four areas: 1. failure to provide sensible plots 2. being sensitive to imbalanced data 3. unable to specify mis-classification cost and 4. unable to provide evaluation uncertainty quantification. Borrowing from a binomial matrix factorization model, we provide an evaluation metric summarizing the pair-wise multi-class True Positive Rate (TPR) and False Positive Rate (FPR) with one-dimensional vector representation. Visualization on the representation vector measures the relative speed of increment between TPR and FPR across all the classes pairs, which in turns provides a ROC plot for the multi-class counterpart. An integration over those factorized vector provides a binary AUC-equivalent summary on the classifier performance. Mis-clasification weights specification and bootstrapped confidence interval are also enabled to accommodate a variety of of evaluation criteria. To support our findings, we conducted extensive simulation studies and compared our method to the pair-wise averaged AUC statistics on benchmark datasets."}, "https://arxiv.org/abs/2404.13198": {"title": "An economically-consistent discrete choice model with flexible utility specification based on artificial neural networks", "link": "https://arxiv.org/abs/2404.13198", "description": "arXiv:2404.13198v1 Announce Type: cross \nAbstract: Random utility maximisation (RUM) models are one of the cornerstones of discrete choice modelling. However, specifying the utility function of RUM models is not straightforward and has a considerable impact on the resulting interpretable outcomes and welfare measures. In this paper, we propose a new discrete choice model based on artificial neural networks (ANNs) named \"Alternative-Specific and Shared weights Neural Network (ASS-NN)\", which provides a further balance between flexible utility approximation from the data and consistency with two assumptions: RUM theory and fungibility of money (i.e., \"one euro is one euro\"). Therefore, the ASS-NN can derive economically-consistent outcomes, such as marginal utilities or willingness to pay, without explicitly specifying the utility functional form. Using a Monte Carlo experiment and empirical data from the Swissmetro dataset, we show that ASS-NN outperforms (in terms of goodness of fit) conventional multinomial logit (MNL) models under different utility specifications. Furthermore, we show how the ASS-NN is used to derive marginal utilities and willingness to pay measures."}, "https://arxiv.org/abs/2404.13302": {"title": "Monte Carlo sampling with integrator snippets", "link": "https://arxiv.org/abs/2404.13302", "description": "arXiv:2404.13302v1 Announce Type: cross \nAbstract: Assume interest is in sampling from a probability distribution $\\mu$ defined on $(\\mathsf{Z},\\mathscr{Z})$. We develop a framework to construct sampling algorithms taking full advantage of numerical integrators of ODEs, say $\\psi\\colon\\mathsf{Z}\\rightarrow\\mathsf{Z}$ for one integration step, to explore $\\mu$ efficiently and robustly. The popular Hybrid/Hamiltonian Monte Carlo (HMC) algorithm [Duane, 1987], [Neal, 2011] and its derivatives are example of such a use of numerical integrators. However we show how the potential of integrators can be exploited beyond current ideas and HMC sampling in order to take into account aspects of the geometry of the target distribution. A key idea is the notion of integrator snippet, a fragment of the orbit of an ODE numerical integrator $\\psi$, and its associate probability distribution $\\bar{\\mu}$, which takes the form of a mixture of distributions derived from $\\mu$ and $\\psi$. Exploiting properties of mixtures we show how samples from $\\bar{\\mu}$ can be used to estimate expectations with respect to $\\mu$. We focus here primarily on Sequential Monte Carlo (SMC) algorithms, but the approach can be used in the context of Markov chain Monte Carlo algorithms as discussed at the end of the manuscript. We illustrate performance of these new algorithms through numerical experimentation and provide preliminary theoretical results supporting observed performance."}, "https://arxiv.org/abs/2404.13371": {"title": "On Risk-Sensitive Decision Making Under Uncertainty", "link": "https://arxiv.org/abs/2404.13371", "description": "arXiv:2404.13371v1 Announce Type: cross \nAbstract: This paper studies a risk-sensitive decision-making problem under uncertainty. It considers a decision-making process that unfolds over a fixed number of stages, in which a decision-maker chooses among multiple alternatives, some of which are deterministic and others are stochastic. The decision-maker's cumulative value is updated at each stage, reflecting the outcomes of the chosen alternatives. After formulating this as a stochastic control problem, we delineate the necessary optimality conditions for it. Two illustrative examples from optimal betting and inventory management are provided to support our theory."}, "https://arxiv.org/abs/2404.13649": {"title": "Distributional Principal Autoencoders", "link": "https://arxiv.org/abs/2404.13649", "description": "arXiv:2404.13649v1 Announce Type: cross \nAbstract: Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distribution of data given its low-dimensional latent variables. Motivated by this, we propose Distributional Principal Autoencoder (DPA) that consists of an encoder that maps high-dimensional data to low-dimensional latent variables and a decoder that maps the latent variables back to the data space. For reducing the dimension, the DPA encoder aims to minimise the unexplained variability of the data with an adaptive choice of the latent dimension. For reconstructing data, the DPA decoder aims to match the conditional distribution of all data that are mapped to a certain latent value, thus ensuring that the reconstructed data retains the original data distribution. Our numerical results on climate data, single-cell data, and image benchmarks demonstrate the practical feasibility and success of the approach in reconstructing the original distribution of the data. DPA embeddings are shown to preserve meaningful structures of data such as the seasonal cycle for precipitations and cell types for gene expression."}, "https://arxiv.org/abs/2404.13964": {"title": "An Economic Solution to Copyright Challenges of Generative AI", "link": "https://arxiv.org/abs/2404.13964", "description": "arXiv:2404.13964v1 Announce Type: cross \nAbstract: Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content. The metric for contributions is quantitatively determined by leveraging the probabilistic nature of modern generative AI models and using techniques from cooperative game theory in economics. This framework enables a platform where AI developers benefit from access to high-quality training data, thus improving model performance. Meanwhile, copyright owners receive fair compensation, driving the continued provision of relevant data for generative model training. Experiments demonstrate that our framework successfully identifies the most relevant data sources used in artwork generation, ensuring a fair and interpretable distribution of revenues among copyright owners."}, "https://arxiv.org/abs/2404.14052": {"title": "Differential contributions of machine learning and statistical analysis to language and cognitive sciences", "link": "https://arxiv.org/abs/2404.14052", "description": "arXiv:2404.14052v1 Announce Type: cross \nAbstract: Data-driven approaches have revolutionized scientific research. Machine learning and statistical analysis are commonly utilized in this type of research. Despite their widespread use, these methodologies differ significantly in their techniques and objectives. Few studies have utilized a consistent dataset to demonstrate these differences within the social sciences, particularly in language and cognitive sciences. This study leverages the Buckeye Speech Corpus to illustrate how both machine learning and statistical analysis are applied in data-driven research to obtain distinct insights. This study significantly enhances our understanding of the diverse approaches employed in data-driven strategies."}, "https://arxiv.org/abs/2404.14136": {"title": "Elicitability and identifiability of tail risk measures", "link": "https://arxiv.org/abs/2404.14136", "description": "arXiv:2404.14136v1 Announce Type: cross \nAbstract: Tail risk measures are fully determined by the distribution of the underlying loss beyond its quantile at a certain level, with Value-at-Risk and Expected Shortfall being prime examples. They are induced by law-based risk measures, called their generators, evaluated on the tail distribution. This paper establishes joint identifiability and elicitability results of tail risk measures together with the corresponding quantile, provided that their generators are identifiable and elicitable, respectively. As an example, we establish the joint identifiability and elicitability of the tail expectile together with the quantile. The corresponding consistent scores constitute a novel class of weighted scores, nesting the known class of scores of Fissler and Ziegel for the Expected Shortfall together with the quantile. For statistical purposes, our results pave the way to easier model fitting for tail risk measures via regression and the generalized method of moments, but also model comparison and model validation in terms of established backtesting procedures."}, "https://arxiv.org/abs/2404.14328": {"title": "Preserving linear invariants in ensemble filtering methods", "link": "https://arxiv.org/abs/2404.14328", "description": "arXiv:2404.14328v1 Announce Type: cross \nAbstract: Formulating dynamical models for physical phenomena is essential for understanding the interplay between the different mechanisms and predicting the evolution of physical states. However, a dynamical model alone is often insufficient to address these fundamental tasks, as it suffers from model errors and uncertainties. One common remedy is to rely on data assimilation, where the state estimate is updated with observations of the true system. Ensemble filters sequentially assimilate observations by updating a set of samples over time. They operate in two steps: a forecast step that propagates each sample through the dynamical model and an analysis step that updates the samples with incoming observations. For accurate and robust predictions of dynamical systems, discrete solutions must preserve their critical invariants. While modern numerical solvers satisfy these invariants, existing invariant-preserving analysis steps are limited to Gaussian settings and are often not compatible with classical regularization techniques of ensemble filters, e.g., inflation and covariance tapering. The present work focuses on preserving linear invariants, such as mass, stoichiometric balance of chemical species, and electrical charges. Using tools from measure transport theory (Spantini et al., 2022, SIAM Review), we introduce a generic class of nonlinear ensemble filters that automatically preserve desired linear invariants in non-Gaussian filtering problems. By specializing this framework to the Gaussian setting, we recover a constrained formulation of the Kalman filter. Then, we show how to combine existing regularization techniques for the ensemble Kalman filter (Evensen, 1994, J. Geophys. Res.) with the preservation of the linear invariants. Finally, we assess the benefits of preserving linear invariants for the ensemble Kalman filter and nonlinear ensemble filters."}, "https://arxiv.org/abs/1908.04218": {"title": "Asymptotic Validity and Finite-Sample Properties of Approximate Randomization Tests", "link": "https://arxiv.org/abs/1908.04218", "description": "arXiv:1908.04218v3 Announce Type: replace \nAbstract: Randomization tests rely on simple data transformations and possess an appealing robustness property. In addition to being finite-sample valid if the data distribution is invariant under the transformation, these tests can be asymptotically valid under a suitable studentization of the test statistic, even if the invariance does not hold. However, practical implementation often encounters noisy data, resulting in approximate randomization tests that may not be as robust. In this paper, our key theoretical contribution is a non-asymptotic bound on the discrepancy between the size of an approximate randomization test and the size of the original randomization test using noiseless data. This allows us to derive novel conditions for the validity of approximate randomization tests under data invariances, while being able to leverage existing results based on studentization if the invariance does not hold. We illustrate our theory through several examples, including tests of significance in linear regression. Our theory can explain certain aspects of how randomization tests perform in small samples, addressing limitations of prior theoretical results."}, "https://arxiv.org/abs/2109.02204": {"title": "On the edge eigenvalues of the precision matrices of nonstationary autoregressive processes", "link": "https://arxiv.org/abs/2109.02204", "description": "arXiv:2109.02204v3 Announce Type: replace \nAbstract: This paper investigates the structural changes in the parameters of first-order autoregressive models by analyzing the edge eigenvalues of the precision matrices. Specifically, edge eigenvalues in the precision matrix are observed if and only if there is a structural change in the autoregressive coefficients. We demonstrate that these edge eigenvalues correspond to the zeros of some determinantal equation. Additionally, we propose a consistent estimator for detecting outliers within the panel time series framework, supported by numerical experiments."}, "https://arxiv.org/abs/2112.07465": {"title": "The multirank likelihood for semiparametric canonical correlation analysis", "link": "https://arxiv.org/abs/2112.07465", "description": "arXiv:2112.07465v4 Announce Type: replace \nAbstract: Many analyses of multivariate data focus on evaluating the dependence between two sets of variables, rather than the dependence among individual variables within each set. Canonical correlation analysis (CCA) is a classical data analysis technique that estimates parameters describing the dependence between such sets. However, inference procedures based on traditional CCA rely on the assumption that all variables are jointly normally distributed. We present a semiparametric approach to CCA in which the multivariate margins of each variable set may be arbitrary, but the dependence between variable sets is described by a parametric model that provides low-dimensional summaries of dependence. While maximum likelihood estimation in the proposed model is intractable, we propose two estimation strategies: one using a pseudolikelihood for the model and one using a Markov chain Monte Carlo (MCMC) algorithm that provides Bayesian estimates and confidence regions for the between-set dependence parameters. The MCMC algorithm is derived from a multirank likelihood function, which uses only part of the information in the observed data in exchange for being free of assumptions about the multivariate margins. We apply the proposed Bayesian inference procedure to Brazilian climate data and monthly stock returns from the materials and communications market sectors."}, "https://arxiv.org/abs/2202.02311": {"title": "Graphical criteria for the identification of marginal causal effects in continuous-time survival and event-history analyses", "link": "https://arxiv.org/abs/2202.02311", "description": "arXiv:2202.02311v2 Announce Type: replace \nAbstract: We consider continuous-time survival or more general event-history settings, where the aim is to infer the causal effect of a time-dependent treatment process. This is formalised as the effect on the outcome event of a (possibly hypothetical) intervention on the intensity of the treatment process, i.e. a stochastic intervention. To establish whether valid inference about the interventional situation can be drawn from typical observational, i.e. non-experimental, data we propose graphical rules indicating whether the observed information is sufficient to identify the desired causal effect by suitable re-weighting. In analogy to the well-known causal directed acyclic graphs, the corresponding dynamic graphs combine causal semantics with local independence models for multivariate counting processes. Importantly, we highlight that causal inference from censored data requires structural assumptions on the censoring process beyond the usual independent censoring assumption, which can be represented and verified graphically. Our results establish general non-parametric identifiability and do not rely on particular survival models. We illustrate our proposal with a data example on HPV-testing for cervical cancer screening, where the desired effect is estimated by re-weighted cumulative incidence curves."}, "https://arxiv.org/abs/2203.14511": {"title": "Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments", "link": "https://arxiv.org/abs/2203.14511", "description": "arXiv:2203.14511v3 Announce Type: replace \nAbstract: Researchers are increasingly turning to machine learning (ML) algorithms to investigate causal heterogeneity in randomized experiments. Despite their promise, ML algorithms may fail to accurately ascertain heterogeneous treatment effects under practical settings with many covariates and small sample size. In addition, the quantification of estimation uncertainty remains a challenge. We develop a general approach to statistical inference for heterogeneous treatment effects discovered by a generic ML algorithm. We apply the Neyman's repeated sampling framework to a common setting, in which researchers use an ML algorithm to estimate the conditional average treatment effect and then divide the sample into several groups based on the magnitude of the estimated effects. We show how to estimate the average treatment effect within each of these groups, and construct a valid confidence interval. In addition, we develop nonparametric tests of treatment effect homogeneity across groups, and rank-consistency of within-group average treatment effects. The validity of our methodology does not rely on the properties of ML algorithms because it is solely based on the randomization of treatment assignment and random sampling of units. Finally, we generalize our methodology to the cross-fitting procedure by accounting for the additional uncertainty induced by the random splitting of data."}, "https://arxiv.org/abs/2207.07533": {"title": "Selection of the Most Probable Best", "link": "https://arxiv.org/abs/2207.07533", "description": "arXiv:2207.07533v2 Announce Type: replace \nAbstract: We consider an expected-value ranking and selection (R&S) problem where all k solutions' simulation outputs depend on a common parameter whose uncertainty can be modeled by a distribution. We define the most probable best (MPB) to be the solution that has the largest probability of being optimal with respect to the distribution and design an efficient sequential sampling algorithm to learn the MPB when the parameter has a finite support. We derive the large deviations rate of the probability of falsely selecting the MPB and formulate an optimal computing budget allocation problem to find the rate-maximizing static sampling ratios. The problem is then relaxed to obtain a set of optimality conditions that are interpretable and computationally efficient to verify. We devise a series of algorithms that replace the unknown means in the optimality conditions with their estimates and prove the algorithms' sampling ratios achieve the conditions as the simulation budget increases. Furthermore, we show that the empirical performances of the algorithms can be significantly improved by adopting the kernel ridge regression for mean estimation while achieving the same asymptotic convergence results. The algorithms are benchmarked against a state-of-the-art contextual R&S algorithm and demonstrated to have superior empirical performances."}, "https://arxiv.org/abs/2209.10128": {"title": "Efficient Integrated Volatility Estimation in the Presence of Infinite Variation Jumps via Debiased Truncated Realized Variations", "link": "https://arxiv.org/abs/2209.10128", "description": "arXiv:2209.10128v3 Announce Type: replace \nAbstract: Statistical inference for stochastic processes based on high-frequency observations has been an active research area for more than two decades. One of the most well-known and widely studied problems has been the estimation of the quadratic variation of the continuous component of an It\\^o semimartingale with jumps. Several rate- and variance-efficient estimators have been proposed in the literature when the jump component is of bounded variation. However, to date, very few methods can deal with jumps of unbounded variation. By developing new high-order expansions of the truncated moments of a locally stable L\\'evy process, we propose a new rate- and variance-efficient volatility estimator for a class of It\\^o semimartingales whose jumps behave locally like those of a stable L\\'evy process with Blumenthal-Getoor index $Y\\in (1,8/5)$ (hence, of unbounded variation). The proposed method is based on a two-step debiasing procedure for the truncated realized quadratic variation of the process and can also cover the case $Y<1$. Our Monte Carlo experiments indicate that the method outperforms other efficient alternatives in the literature in the setting covered by our theoretical framework."}, "https://arxiv.org/abs/2210.12382": {"title": "Model-free controlled variable selection via data splitting", "link": "https://arxiv.org/abs/2210.12382", "description": "arXiv:2210.12382v3 Announce Type: replace \nAbstract: Addressing the simultaneous identification of contributory variables while controlling the false discovery rate (FDR) in high-dimensional data is a crucial statistical challenge. In this paper, we propose a novel model-free variable selection procedure in sufficient dimension reduction framework via a data splitting technique. The variable selection problem is first converted to a least squares procedure with several response transformations. We construct a series of statistics with global symmetry property and leverage the symmetry to derive a data-driven threshold aimed at error rate control. Our approach demonstrates the capability for achieving finite-sample and asymptotic FDR control under mild theoretical conditions. Numerical experiments confirm that our procedure has satisfactory FDR control and higher power compared with existing methods."}, "https://arxiv.org/abs/2212.04814": {"title": "The Falsification Adaptive Set in Linear Models with Instrumental Variables that Violate the Exclusion or Conditional Exogeneity Restriction", "link": "https://arxiv.org/abs/2212.04814", "description": "arXiv:2212.04814v2 Announce Type: replace \nAbstract: Masten and Poirier (2021) introduced the falsification adaptive set (FAS) in linear models with a single endogenous variable estimated with multiple correlated instrumental variables (IVs). The FAS reflects the model uncertainty that arises from falsification of the baseline model. We show that it applies to cases where a conditional exogeneity assumption holds and invalid instruments violate the exclusion assumption only. We propose a generalized FAS that reflects the model uncertainty when some instruments violate the exclusion assumption and/or some instruments violate the conditional exogeneity assumption. Under the assumption that invalid instruments are not themselves endogenous explanatory variables, if there is at least one relevant instrument that satisfies both the exclusion and conditional exogeneity assumptions then this generalized FAS is guaranteed to contain the parameter of interest."}, "https://arxiv.org/abs/2212.12501": {"title": "Learning Optimal Dynamic Treatment Regimens Subject to Stagewise Risk Controls", "link": "https://arxiv.org/abs/2212.12501", "description": "arXiv:2212.12501v2 Announce Type: replace \nAbstract: Dynamic treatment regimens (DTRs) aim at tailoring individualized sequential treatment rules that maximize cumulative beneficial outcomes by accommodating patients' heterogeneity in decision-making. For many chronic diseases including type 2 diabetes mellitus (T2D), treatments are usually multifaceted in the sense that aggressive treatments with a higher expected reward are also likely to elevate the risk of acute adverse events. In this paper, we propose a new weighted learning framework, namely benefit-risk dynamic treatment regimens (BR-DTRs), to address the benefit-risk trade-off. The new framework relies on a backward learning procedure by restricting the induced risk of the treatment rule to be no larger than a pre-specified risk constraint at each treatment stage. Computationally, the estimated treatment rule solves a weighted support vector machine problem with a modified smooth constraint. Theoretically, we show that the proposed DTRs are Fisher consistent, and we further obtain the convergence rates for both the value and risk functions. Finally, the performance of the proposed method is demonstrated via extensive simulation studies and application to a real study for T2D patients."}, "https://arxiv.org/abs/2303.08790": {"title": "lmw: Linear Model Weights for Causal Inference", "link": "https://arxiv.org/abs/2303.08790", "description": "arXiv:2303.08790v2 Announce Type: replace \nAbstract: The linear regression model is widely used in the biomedical and social sciences as well as in policy and business research to adjust for covariates and estimate the average effects of treatments. Behind every causal inference endeavor there is a hypothetical randomized experiment. However, in routine regression analyses in observational studies, it is unclear how well the adjustments made by regression approximate key features of randomized experiments, such as covariate balance, study representativeness, sample boundedness, and unweighted sampling. In this paper, we provide software to empirically address this question. We introduce the lmw package for R to compute the implied linear model weights and perform diagnostics for their evaluation. The weights are obtained as part of the design stage of the study; that is, without using outcome information. The implementation is general and applicable, for instance, in settings with instrumental variables and multi-valued treatments; in essence, in any situation where the linear model is the vehicle for adjustment and estimation of average treatment effects with discrete-valued interventions."}, "https://arxiv.org/abs/2303.08987": {"title": "Generalized Score Matching", "link": "https://arxiv.org/abs/2303.08987", "description": "arXiv:2303.08987v2 Announce Type: replace \nAbstract: Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems, this article proposes a unified asymptotic theory of generalized score matching developed under the independence assumption, covering both continuous and discrete response data, thereby giving a sound basis for score-matchingbased inference. Real data analyses and simulation studies provide convincing evidence of strong practical performance of the proposed methods."}, "https://arxiv.org/abs/2303.13237": {"title": "Improving estimation for asymptotically independent bivariate extremes via global estimators for the angular dependence function", "link": "https://arxiv.org/abs/2303.13237", "description": "arXiv:2303.13237v3 Announce Type: replace \nAbstract: Modelling the extremal dependence of bivariate variables is important in a wide variety of practical applications, including environmental planning, catastrophe modelling and hydrology. The majority of these approaches are based on the framework of bivariate regular variation, and a wide range of literature is available for estimating the dependence structure in this setting. However, such procedures are only applicable to variables exhibiting asymptotic dependence, even though asymptotic independence is often observed in practice. In this paper, we consider the so-called `angular dependence function'; this quantity summarises the extremal dependence structure for asymptotically independent variables. Until recently, only pointwise estimators of the angular dependence function have been available. We introduce a range of global estimators and compare them to another recently introduced technique for global estimation through a systematic simulation study, and a case study on river flow data from the north of England, UK."}, "https://arxiv.org/abs/2304.02476": {"title": "A Class of Models for Large Zero-inflated Spatial Data", "link": "https://arxiv.org/abs/2304.02476", "description": "arXiv:2304.02476v2 Announce Type: replace \nAbstract: Spatially correlated data with an excess of zeros, usually referred to as zero-inflated spatial data, arise in many disciplines. Examples include count data, for instance, abundance (or lack thereof) of animal species and disease counts, as well as semi-continuous data like observed precipitation. Spatial two-part models are a flexible class of models for such data. Fitting two-part models can be computationally expensive for large data due to high-dimensional dependent latent variables, costly matrix operations, and slow mixing Markov chains. We describe a flexible, computationally efficient approach for modeling large zero-inflated spatial data using the projection-based intrinsic conditional autoregression (PICAR) framework. We study our approach, which we call PICAR-Z, through extensive simulation studies and two environmental data sets. Our results suggest that PICAR-Z provides accurate predictions while remaining computationally efficient. An important goal of our work is to allow researchers who are not experts in computation to easily build computationally efficient extensions to zero-inflated spatial models; this also allows for a more thorough exploration of modeling choices in two-part models than was previously possible. We show that PICAR-Z is easy to implement and extend in popular probabilistic programming languages such as nimble and stan."}, "https://arxiv.org/abs/2304.05323": {"title": "A nonparametric framework for treatment effect modifier discovery in high dimensions", "link": "https://arxiv.org/abs/2304.05323", "description": "arXiv:2304.05323v2 Announce Type: replace \nAbstract: Heterogeneous treatment effects are driven by treatment effect modifiers, pre-treatment covariates that modify the effect of a treatment on an outcome. Current approaches for uncovering these variables are limited to low-dimensional data, data with weakly correlated covariates, or data generated according to parametric processes. We resolve these issues by developing a framework for defining model-agnostic treatment effect modifier variable importance parameters applicable to high-dimensional data with arbitrary correlation structure, deriving one-step, estimating equation and targeted maximum likelihood estimators of these parameters, and establishing these estimators' asymptotic properties. This framework is showcased by defining variable importance parameters for data-generating processes with continuous, binary, and time-to-event outcomes with binary treatments, and deriving accompanying multiply-robust and asymptotically linear estimators. Simulation experiments demonstrate that these estimators' asymptotic guarantees are approximately achieved in realistic sample sizes for observational and randomized studies alike. This framework is applied to gene expression data collected for a clinical trial assessing the effect of a monoclonal antibody therapy on disease-free survival in breast cancer patients. Genes predicted to have the greatest potential for treatment effect modification have previously been linked to breast cancer. An open-source R package implementing this methodology, unihtee, is made available on GitHub at https://github.com/insightsengineering/unihtee."}, "https://arxiv.org/abs/2305.12624": {"title": "Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors", "link": "https://arxiv.org/abs/2305.12624", "description": "arXiv:2305.12624v2 Announce Type: replace \nAbstract: Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data are smooth, latent functions obtained at discrete time intervals and prone to homoscedastic white noise. However, the assumption of homoscedastic errors might not be appropriate in this setting because the devices collect the data serially. While researchers have previously addressed measurement error in scalar covariates prone to errors, less work has been done on correcting measurement error in high dimensional longitudinal curves prone to heteroscedastic errors. We present two new methods for correcting measurement error in longitudinal functional curves prone to complex measurement error structures in multi-level generalized functional linear regression models. These methods are based on two-stage scalable regression calibration. We assume that the distribution of the scalar responses and the surrogate measures prone to heteroscedastic errors both belong in the exponential family and that the measurement errors follow Gaussian processes. In simulations and sensitivity analyses, we established some finite sample properties of these methods. In our simulations, both regression calibration methods for correcting measurement error performed better than estimators based on averaging the longitudinal functional data and using observations from a single day. We also applied the methods to assess the relationship between physical activity and type 2 diabetes in community dwelling adults in the United States who participated in the National Health and Nutrition Examination Survey."}, "https://arxiv.org/abs/2306.09976": {"title": "Catch me if you can: Signal localization with knockoff e-values", "link": "https://arxiv.org/abs/2306.09976", "description": "arXiv:2306.09976v3 Announce Type: replace \nAbstract: We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank."}, "https://arxiv.org/abs/2306.16785": {"title": "A location-scale joint model for studying the link between the time-dependent subject-specific variability of blood pressure and competing events", "link": "https://arxiv.org/abs/2306.16785", "description": "arXiv:2306.16785v2 Announce Type: replace \nAbstract: Given the high incidence of cardio and cerebrovascular diseases (CVD), and its association with morbidity and mortality, its prevention is a major public health issue. A high level of blood pressure is a well-known risk factor for these events and an increasing number of studies suggest that blood pressure variability may also be an independent risk factor. However, these studies suffer from significant methodological weaknesses. In this work we propose a new location-scale joint model for the repeated measures of a marker and competing events. This joint model combines a mixed model including a subject-specific and time-dependent residual variance modeled through random effects, and cause-specific proportional intensity models for the competing events. The risk of events may depend simultaneously on the current value of the variance, as well as, the current value and the current slope of the marker trajectory. The model is estimated by maximizing the likelihood function using the Marquardt-Levenberg algorithm. The estimation procedure is implemented in a R-package and is validated through a simulation study. This model is applied to study the association between blood pressure variability and the risk of CVD and death from other causes. Using data from a large clinical trial on the secondary prevention of stroke, we find that the current individual variability of blood pressure is associated with the risk of CVD and death. Moreover, the comparison with a model without heterogeneous variance shows the importance of taking into account this variability in the goodness-of-fit and for dynamic predictions."}, "https://arxiv.org/abs/2307.04527": {"title": "Automatic Debiased Machine Learning for Covariate Shifts", "link": "https://arxiv.org/abs/2307.04527", "description": "arXiv:2307.04527v3 Announce Type: replace \nAbstract: In this paper we address the problem of bias in machine learning of parameters following covariate shifts. Covariate shift occurs when the distribution of input features change between the training and deployment stages. Regularization and model selection associated with machine learning biases many parameter estimates. In this paper, we propose an automatic debiased machine learning approach to correct for this bias under covariate shifts. The proposed approach leverages state-of-the-art techniques in debiased machine learning to debias estimators of policy and causal parameters when covariate shift is present. The debiasing is automatic in only relying on the parameter of interest and not requiring the form of the form of the bias. We show that our estimator is asymptotically normal as the sample size grows. Finally, we demonstrate the proposed method on a regression problem using a Monte-Carlo simulation."}, "https://arxiv.org/abs/2307.07898": {"title": "A Graph-Prediction-Based Approach for Debiasing Underreported Data", "link": "https://arxiv.org/abs/2307.07898", "description": "arXiv:2307.07898v3 Announce Type: replace \nAbstract: We present a novel Graph-based debiasing Algorithm for Underreported Data (GRAUD) aiming at an efficient joint estimation of event counts and discovery probabilities across spatial or graphical structures. This innovative method provides a solution to problems seen in fields such as policing data and COVID-$19$ data analysis. Our approach avoids the need for strong priors typically associated with Bayesian frameworks. By leveraging the graph structures on unknown variables $n$ and $p$, our method debiases the under-report data and estimates the discovery probability at the same time. We validate the effectiveness of our method through simulation experiments and illustrate its practicality in one real-world application: police 911 calls-to-service data."}, "https://arxiv.org/abs/2307.14867": {"title": "One-step smoothing splines instrumental regression", "link": "https://arxiv.org/abs/2307.14867", "description": "arXiv:2307.14867v3 Announce Type: replace \nAbstract: We extend nonparametric regression smoothing splines to a context where there is endogeneity and instrumental variables are available. Unlike popular existing estimators, the resulting estimator is one-step and relies on a unique regularization parameter. We derive uniform rates of the convergence for the estimator and its first derivative. We also address the issue of imposing monotonicity in estimation and extend the approach to a partly linear model. Simulations confirm the good performances of our estimator compared to two-step procedures. Our method yields economically sensible results when used to estimate Engel curves."}, "https://arxiv.org/abs/2309.07261": {"title": "Simultaneous inference for generalized linear models with unmeasured confounders", "link": "https://arxiv.org/abs/2309.07261", "description": "arXiv:2309.07261v3 Announce Type: replace \nAbstract: Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model."}, "https://arxiv.org/abs/2312.03967": {"title": "Test-negative designs with various reasons for testing: statistical bias and solution", "link": "https://arxiv.org/abs/2312.03967", "description": "arXiv:2312.03967v2 Announce Type: replace \nAbstract: Test-negative designs are widely used for post-market evaluation of vaccine effectiveness, particularly in cases where randomization is not feasible. Differing from classical test-negative designs where only healthcare-seekers with symptoms are included, recent test-negative designs have involved individuals with various reasons for testing, especially in an outbreak setting. While including these data can increase sample size and hence improve precision, concerns have been raised about whether they introduce bias into the current framework of test-negative designs, thereby demanding a formal statistical examination of this modified design. In this article, using statistical derivations, causal graphs, and numerical simulations, we show that the standard odds ratio estimator may be biased if various reasons for testing are not accounted for. To eliminate this bias, we identify three categories of reasons for testing, including symptoms, disease-unrelated reasons, and case contact tracing, and characterize associated statistical properties and estimands. Based on our characterization, we show how to consistently estimate each estimand via stratification. Furthermore, we describe when these estimands correspond to the same vaccine effectiveness parameter, and, when appropriate, propose a stratified estimator that can incorporate multiple reasons for testing and improve precision. The performance of our proposed method is demonstrated through simulation studies."}, "https://arxiv.org/abs/2312.06415": {"title": "Bioequivalence Design with Sampling Distribution Segments", "link": "https://arxiv.org/abs/2312.06415", "description": "arXiv:2312.06415v3 Announce Type: replace \nAbstract: In bioequivalence design, power analyses dictate how much data must be collected to detect the absence of clinically important effects. Power is computed as a tail probability in the sampling distribution of the pertinent test statistics. When these test statistics cannot be constructed from pivotal quantities, their sampling distributions are approximated via repetitive, time-intensive computer simulation. We propose a novel simulation-based method to quickly approximate the power curve for many such bioequivalence tests by efficiently exploring segments (as opposed to the entirety) of the relevant sampling distributions. Despite not estimating the entire sampling distribution, this approach prompts unbiased sample size recommendations. We illustrate this method using two-group bioequivalence tests with unequal variances and overview its broader applicability in clinical design. All methods proposed in this work can be implemented using the developed dent package in R."}, "https://arxiv.org/abs/2203.05860": {"title": "Modelling non-stationarity in asymptotically independent extremes", "link": "https://arxiv.org/abs/2203.05860", "description": "arXiv:2203.05860v4 Announce Type: replace-cross \nAbstract: In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, we propose a novel semi-parametric modelling framework for bivariate extremal dependence structures. This framework allows us to capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, our model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points."}, "https://arxiv.org/abs/2309.08043": {"title": "On Prediction Feature Assignment in the Heckman Selection Model", "link": "https://arxiv.org/abs/2309.08043", "description": "arXiv:2309.08043v2 Announce Type: replace-cross \nAbstract: Under missing-not-at-random (MNAR) sample selection bias, the performance of a prediction model is often degraded. This paper focuses on one classic instance of MNAR sample selection bias where a subset of samples have non-randomly missing outcomes. The Heckman selection model and its variants have commonly been used to handle this type of sample selection bias. The Heckman model uses two separate equations to model the prediction and selection of samples, where the selection features include all prediction features. When using the Heckman model, the prediction features must be properly chosen from the set of selection features. However, choosing the proper prediction features is a challenging task for the Heckman model. This is especially the case when the number of selection features is large. Existing approaches that use the Heckman model often provide a manually chosen set of prediction features. In this paper, we propose Heckman-FA as a novel data-driven framework for obtaining prediction features for the Heckman model. Heckman-FA first trains an assignment function that determines whether or not a selection feature is assigned as a prediction feature. Using the parameters of the trained function, the framework extracts a suitable set of prediction features based on the goodness-of-fit of the prediction model given the chosen prediction features and the correlation between noise terms of the prediction and selection equations. Experimental results on real-world datasets show that Heckman-FA produces a robust regression model under MNAR sample selection bias."}, "https://arxiv.org/abs/2310.01374": {"title": "Corrected generalized cross-validation for finite ensembles of penalized estimators", "link": "https://arxiv.org/abs/2310.01374", "description": "arXiv:2310.01374v2 Announce Type: replace-cross \nAbstract: Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV."}, "https://arxiv.org/abs/2404.14534": {"title": "Random Indicator Imputation for Missing Not At Random Data", "link": "https://arxiv.org/abs/2404.14534", "description": "arXiv:2404.14534v1 Announce Type: new \nAbstract: Imputation methods for dealing with incomplete data typically assume that the missingness mechanism is at random (MAR). These methods can also be applied to missing not at random (MNAR) situations, where the user specifies some adjustment parameters that describe the degree of departure from MAR. The effect of different pre-chosen values is then studied on the inferences. This paper proposes a novel imputation method, the Random Indicator (RI) method, which, in contrast to the current methodology, estimates these adjustment parameters from the data. For an incomplete variable $X$, the RI method assumes that the observed part of $X$ is normal and the probability for $X$ to be missing follows a logistic function. The idea is to estimate the adjustment parameters by generating a pseudo response indicator from this logistic function. Our method iteratively draws imputations for $X$ and the realization of the response indicator $R$, to which we refer as $\\dot{R}$, for $X$. By cross-classifying $X$ by $R$ and $\\dot{R}$, we obtain various properties on the distribution of the missing data. These properties form the basis for estimating the degree of departure from MAR. Our numerical simulations show that the RI method performs very well across a variety of situations. We show how the method can be used in a real life data set. The RI method is automatic and opens up new ways to tackle the problem of MNAR data."}, "https://arxiv.org/abs/2404.14603": {"title": "Quantifying the Internal Validity of Weighted Estimands", "link": "https://arxiv.org/abs/2404.14603", "description": "arXiv:2404.14603v1 Announce Type: new \nAbstract: In this paper we study a class of weighted estimands, which we define as parameters that can be expressed as weighted averages of the underlying heterogeneous treatment effects. The popular ordinary least squares (OLS), two-stage least squares (2SLS), and two-way fixed effects (TWFE) estimands are all special cases within our framework. Our focus is on answering two questions concerning weighted estimands. First, under what conditions can they be interpreted as the average treatment effect for some (possibly latent) subpopulation? Second, when these conditions are satisfied, what is the upper bound on the size of that subpopulation, either in absolute terms or relative to a target population of interest? We argue that this upper bound provides a valuable diagnostic for empirical research. When a given weighted estimand corresponds to the average treatment effect for a small subset of the population of interest, we say its internal validity is low. Our paper develops practical tools to quantify the internal validity of weighted estimands."}, "https://arxiv.org/abs/2404.14623": {"title": "On Bayesian wavelet shrinkage estimation of nonparametric regression models with stationary errors", "link": "https://arxiv.org/abs/2404.14623", "description": "arXiv:2404.14623v1 Announce Type: new \nAbstract: This work proposes a Bayesian rule based on the mixture of a point mass function at zero and the logistic distribution to perform wavelet shrinkage in nonparametric regression models with stationary errors (with short or long-memory behavior). The proposal is assessed through Monte Carlo experiments and illustrated with real data. Simulation studies indicate that the precision of the estimates decreases as the amount of correlation increases. However, given a sample size and error correlated noise, the performance of the rule is almost the same while the signal-to-noise ratio decreases, compared to the performance of the rule under independent and identically distributed errors. Further, we find that the performance of the proposal is better than the standard soft thresholding rule with universal policy in most of the considered underlying functions, sample sizes and signal-to-noise ratios scenarios."}, "https://arxiv.org/abs/2404.14644": {"title": "Identifying sparse treatment effects", "link": "https://arxiv.org/abs/2404.14644", "description": "arXiv:2404.14644v1 Announce Type: new \nAbstract: Based on technological advances in sensing modalities, randomized trials with primary outcomes represented as high-dimensional vectors have become increasingly prevalent. For example, these outcomes could be week-long time-series data from wearable devices or high-dimensional neuroimaging data, such as from functional magnetic resonance imaging. This paper focuses on randomized treatment studies with such high-dimensional outcomes characterized by sparse treatment effects, where interventions may influence a small number of dimensions, e.g., small temporal windows or specific brain regions. Conventional practices, such as using fixed, low-dimensional summaries of the outcomes, result in significantly reduced power for detecting treatment effects. To address this limitation, we propose a procedure that involves subset selection followed by inference. Specifically, given a potentially large set of outcome summaries, we identify the subset that captures treatment effects, which requires only one call to the Lasso, and subsequently conduct inference on the selected subset. Via theoretical analysis as well as simulations, we demonstrate that our method asymptotically selects the correct subset and increases statistical power."}, "https://arxiv.org/abs/2404.14840": {"title": "Analysis of cohort stepped wedge cluster-randomized trials with non-ignorable dropout via joint modeling", "link": "https://arxiv.org/abs/2404.14840", "description": "arXiv:2404.14840v1 Announce Type: new \nAbstract: Stepped wedge cluster-randomized trial (CRTs) designs randomize clusters of individuals to intervention sequences, ensuring that every cluster eventually transitions from a control period to receive the intervention under study by the end of the study period. The analysis of stepped wedge CRTs is usually more complex than parallel-arm CRTs due to potential secular trends that result in changing intra-cluster and period-cluster correlations over time. A further challenge in the analysis of closed-cohort stepped wedge CRTs, which follow groups of individuals enrolled in each period longitudinally, is the occurrence of dropout. This is particularly problematic in studies of individuals at high risk for mortality, which causes non-ignorable missing outcomes. If not appropriately addressed, missing outcomes from death will erode statistical power, at best, and bias treatment effect estimates, at worst. Joint longitudinal-survival models can accommodate informative dropout and missingness patterns in longitudinal studies. Specifically, within this framework one directly models the dropout process via a time-to-event submodel together with the longitudinal outcome of interest. The two submodels are then linked using a variety of possible association structures. This work extends linear mixed-effects models by jointly modeling the dropout process to accommodate informative missing outcome data in closed-cohort stepped wedge CRTs. We focus on constant intervention and general time-on-treatment effect parametrizations for the longitudinal submodel and study the performance of the proposed methodology using Monte Carlo simulation under several data-generating scenarios. We illustrate the joint modeling methodology in practice by reanalyzing the `Frail Older Adults: Care in Transition' (ACT) trial, a stepped wedge CRT of a multifaceted geriatric care model versus usual care in the Netherlands."}, "https://arxiv.org/abs/2404.15017": {"title": "The mosaic permutation test: an exact and nonparametric goodness-of-fit test for factor models", "link": "https://arxiv.org/abs/2404.15017", "description": "arXiv:2404.15017v1 Announce Type: new \nAbstract: Financial firms often rely on factor models to explain correlations among asset returns. These models are important for managing risk, for example by modeling the probability that many assets will simultaneously lose value. Yet after major events, e.g., COVID-19, analysts may reassess whether existing models continue to fit well: specifically, after accounting for the factor exposures, are the residuals of the asset returns independent? With this motivation, we introduce the mosaic permutation test, a nonparametric goodness-of-fit test for preexisting factor models. Our method allows analysts to use nearly any machine learning technique to detect model violations while provably controlling the false positive rate, i.e., the probability of rejecting a well-fitting model. Notably, this result does not rely on asymptotic approximations and makes no parametric assumptions. This property helps prevent analysts from unnecessarily rebuilding accurate models, which can waste resources and increase risk. We illustrate our methodology by applying it to the Blackrock Fundamental Equity Risk (BFRE) model. Using the mosaic permutation test, we find that the BFRE model generally explains the most significant correlations among assets. However, we find evidence of unexplained correlations among certain real estate stocks, and we show that adding new factors improves model fit. We implement our methods in the python package mosaicperm."}, "https://arxiv.org/abs/2404.15060": {"title": "Fast and reliable confidence intervals for a variance component or proportion", "link": "https://arxiv.org/abs/2404.15060", "description": "arXiv:2404.15060v1 Announce Type: new \nAbstract: We show that confidence intervals for a variance component or proportion, with asymptotically correct uniform coverage probability, can be obtained by inverting certain test-statistics based on the score for the restricted likelihood. The results apply in settings where the variance or proportion is near or at the boundary of the parameter set. Simulations indicate the proposed test-statistics are approximately pivotal and lead to confidence intervals with near-nominal coverage even in small samples. We illustrate our methods' application in spatially-resolved transcriptomics where we compute approximately 15,000 confidence intervals, used for gene ranking, in less than 4 minutes. In the settings we consider, the proposed method is between two and 28,000 times faster than popular alternatives, depending on how many confidence intervals are computed."}, "https://arxiv.org/abs/2404.15073": {"title": "The Complex Estimand of Clone-Censor-Weighting When Studying Treatment Initiation Windows", "link": "https://arxiv.org/abs/2404.15073", "description": "arXiv:2404.15073v1 Announce Type: new \nAbstract: Clone-censor-weighting (CCW) is an analytic method for studying treatment regimens that are indistinguishable from one another at baseline without relying on landmark dates or creating immortal person time. One particularly interesting CCW application is estimating outcomes when starting treatment within specific time windows in observational data (e.g., starting a treatment within 30 days of hospitalization). In such cases, CCW estimates something fairly complex. We show how using CCW to study a regimen such as \"start treatment prior to day 30\" estimates the potential outcome of a hypothetical intervention where A) prior to day 30, everyone follows the treatment start distribution of the study population and B) everyone who has not initiated by day 30 initiates on day 30. As a result, the distribution of treatment initiation timings provides essential context for the results of CCW studies. We also show that if the exposure effect varies over time, ignoring exposure history when estimating inverse probability of censoring weights (IPCW) estimates the risk under an impossible intervention and can create selection bias. Finally, we examine some simplifying assumptions that can make this complex treatment effect more interpretable and allow everyone to contribute to IPCW."}, "https://arxiv.org/abs/2404.15115": {"title": "Principal Component Analysis and biplots", "link": "https://arxiv.org/abs/2404.15115", "description": "arXiv:2404.15115v1 Announce Type: new \nAbstract: Principal Component Analysis and biplots are so well-established and readily implemented that it is just too tempting to give for granted their internal workings. In this note I get back to basics in comparing how PCA and biplots are implemented in base-R and contributed R packages, leveraging an implementation-agnostic understanding of the computational structure of each technique. I do so with a view to illustrating discrepancies that users might find elusive, as these arise from seemingly innocuous computational choices made under the hood. The proposed evaluation grid elevates aspects that are usually disregarded, including relationships that should hold if the computational rationale underpinning each technique is followed correctly. Strikingly, what is expected from these equivalences rarely follows without caveats from the output of specific implementations alone."}, "https://arxiv.org/abs/2404.15245": {"title": "Mining Invariance from Nonlinear Multi-Environment Data: Binary Classification", "link": "https://arxiv.org/abs/2404.15245", "description": "arXiv:2404.15245v1 Announce Type: new \nAbstract: Making predictions in an unseen environment given data from multiple training environments is a challenging task. We approach this problem from an invariance perspective, focusing on binary classification to shed light on general nonlinear data generation mechanisms. We identify a unique form of invariance that exists solely in a binary setting that allows us to train models invariant over environments. We provide sufficient conditions for such invariance and show it is robust even when environmental conditions vary greatly. Our formulation admits a causal interpretation, allowing us to compare it with various frameworks. Finally, we propose a heuristic prediction method and conduct experiments using real and synthetic datasets."}, "https://arxiv.org/abs/2203.05120": {"title": "A low-rank ensemble Kalman filter for elliptic observations", "link": "https://arxiv.org/abs/2203.05120", "description": "arXiv:2203.05120v3 Announce Type: cross \nAbstract: We propose a regularization method for ensemble Kalman filtering (EnKF) with elliptic observation operators. Commonly used EnKF regularization methods suppress state correlations at long distances. For observations described by elliptic partial differential equations, such as the pressure Poisson equation (PPE) in incompressible fluid flows, distance localization cannot be applied, as we cannot disentangle slowly decaying physical interactions from spurious long-range correlations. This is particularly true for the PPE, in which distant vortex elements couple nonlinearly to induce pressure. Instead, these inverse problems have a low effective dimension: low-dimensional projections of the observations strongly inform a low-dimensional subspace of the state space. We derive a low-rank factorization of the Kalman gain based on the spectrum of the Jacobian of the observation operator. The identified eigenvectors generalize the source and target modes of the multipole expansion, independently of the underlying spatial distribution of the problem. Given rapid spectral decay, inference can be performed in the low-dimensional subspace spanned by the dominant eigenvectors. This low-rank EnKF is assessed on dynamical systems with Poisson observation operators, where we seek to estimate the positions and strengths of point singularities over time from potential or pressure observations. We also comment on the broader applicability of this approach to elliptic inverse problems outside the context of filtering."}, "https://arxiv.org/abs/2404.14446": {"title": "Spatio-temporal Joint Analysis of PM2", "link": "https://arxiv.org/abs/2404.14446", "description": "arXiv:2404.14446v1 Announce Type: cross \nAbstract: The substantial threat of concurrent air pollutants to public health is increasingly severe under climate change. To identify the common drivers and extent of spatio-temporal similarity of PM2.5 and ozone, this paper proposed a log Gaussian-Gumbel Bayesian hierarchical model allowing for sharing a SPDE-AR(1) spatio-temporal interaction structure. The proposed model outperforms in terms of estimation accuracy and prediction capacity for its increased parsimony and reduced uncertainty, especially for the shared ozone sub-model. Besides the consistently significant influence of temperature (positive), extreme drought (positive), fire burnt area (positive), and wind speed (negative) on both PM2.5 and ozone, surface pressure and GDP per capita (precipitation) demonstrate only positive associations with PM2.5 (ozone), while population density relates to neither. In addition, our results show the distinct spatio-temporal interactions and different seasonal patterns of PM2.5 and ozone, with peaks of PM2.5 and ozone in cold and hot seasons, respectively. Finally, with the aid of the excursion function, we see that the areas around the intersection of San Luis Obispo and Santa Barbara counties are likely to exceed the unhealthy ozone level for sensitive groups throughout the year. Our findings provide new insights for regional and seasonal strategies in the co-control of PM2.5 and ozone. Our methodology is expected to be utilized when interest lies in multiple interrelated processes in the fields of environment and epidemiology."}, "https://arxiv.org/abs/2404.14460": {"title": "Inference of Causal Networks using a Topological Threshold", "link": "https://arxiv.org/abs/2404.14460", "description": "arXiv:2404.14460v1 Announce Type: cross \nAbstract: We propose a constraint-based algorithm, which automatically determines causal relevance thresholds, to infer causal networks from data. We call these topological thresholds. We present two methods for determining the threshold: the first seeks a set of edges that leaves no disconnected nodes in the network; the second seeks a causal large connected component in the data.\n We tested these methods both for discrete synthetic and real data, and compared the results with those obtained for the PC algorithm, which we took as the benchmark. We show that this novel algorithm is generally faster and more accurate than the PC algorithm.\n The algorithm for determining the thresholds requires choosing a measure of causality. We tested our methods for Fisher Correlations, commonly used in PC algorithm (for instance in \\cite{kalisch2005}), and further proposed a discrete and asymmetric measure of causality, that we called Net Influence, which provided very good results when inferring causal networks from discrete data. This metric allows for inferring directionality of the edges in the process of applying the thresholds, speeding up the inference of causal DAGs."}, "https://arxiv.org/abs/2404.14786": {"title": "LLM-Enhanced Causal Discovery in Temporal Domain from Interventional Data", "link": "https://arxiv.org/abs/2404.14786", "description": "arXiv:2404.14786v1 Announce Type: cross \nAbstract: In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures."}, "https://arxiv.org/abs/2404.15133": {"title": "Bayesian Strategies for Repulsive Spatial Point Processes", "link": "https://arxiv.org/abs/2404.15133", "description": "arXiv:2404.15133v1 Announce Type: cross \nAbstract: There is increasing interest to develop Bayesian inferential algorithms for point process models with intractable likelihoods. A purpose of this paper is to illustrate the utility of using simulation based strategies, including approximate Bayesian computation (ABC) and Markov chain Monte Carlo (MCMC) methods for this task. Shirota and Gelfand (2017) proposed an extended version of an ABC approach for repulsive spatial point processes, including the Strauss point process and the determinantal point process, but their algorithm was not correctly detailed. We explain that is, in general, intractable and therefore impractical to use, except in some restrictive situations. This motivates us to instead consider an ABC-MCMC algorithm developed by Fearnhead and Prangle (2012). We further explore the use of the exchange algorithm, together with the recently proposed noisy Metropolis-Hastings algorithm (Alquier et al., 2016). As an extension of the exchange algorithm, which requires a single simulation from the likelihood at each iteration, the noisy Metropolis-Hastings algorithm considers multiple draws from the same likelihood function. We find that both of these inferential approaches yield good performance for repulsive spatial point processes in both simulated and real data applications and should be considered as viable approaches for the analysis of these models."}, "https://arxiv.org/abs/2404.15209": {"title": "Data-Driven Knowledge Transfer in Batch $Q^*$ Learning", "link": "https://arxiv.org/abs/2404.15209", "description": "arXiv:2404.15209v1 Announce Type: cross \nAbstract: In data-driven decision-making in marketing, healthcare, and education, it is desirable to utilize a large amount of data from existing ventures to navigate high-dimensional feature spaces and address data scarcity in new ventures. We explore knowledge transfer in dynamic decision-making by concentrating on batch stationary environments and formally defining task discrepancies through the lens of Markov decision processes (MDPs). We propose a framework of Transferred Fitted $Q$-Iteration algorithm with general function approximation, enabling the direct estimation of the optimal action-state function $Q^*$ using both target and source data. We establish the relationship between statistical performance and MDP task discrepancy under sieve approximation, shedding light on the impact of source and target sample sizes and task discrepancy on the effectiveness of knowledge transfer. We show that the final learning error of the $Q^*$ function is significantly improved from the single task rate both theoretically and empirically."}, "https://arxiv.org/abs/2004.01623": {"title": "Estimation and Uniform Inference in Sparse High-Dimensional Additive Models", "link": "https://arxiv.org/abs/2004.01623", "description": "arXiv:2004.01623v2 Announce Type: replace \nAbstract: We develop a novel method to construct uniformly valid confidence bands for a nonparametric component $f_1$ in the sparse additive model $Y=f_1(X_1)+\\ldots + f_p(X_p) + \\varepsilon$ in a high-dimensional setting. Our method integrates sieve estimation into a high-dimensional Z-estimation framework, facilitating the construction of uniformly valid confidence bands for the target component $f_1$. To form these confidence bands, we employ a multiplier bootstrap procedure. Additionally, we provide rates for the uniform lasso estimation in high dimensions, which may be of independent interest. Through simulation studies, we demonstrate that our proposed method delivers reliable results in terms of estimation and coverage, even in small samples."}, "https://arxiv.org/abs/2205.07950": {"title": "The Power of Tests for Detecting $p$-Hacking", "link": "https://arxiv.org/abs/2205.07950", "description": "arXiv:2205.07950v3 Announce Type: replace \nAbstract: $p$-Hacking undermines the validity of empirical studies. A flourishing empirical literature investigates the prevalence of $p$-hacking based on the distribution of $p$-values across studies. Interpreting results in this literature requires a careful understanding of the power of methods for detecting $p$-hacking. We theoretically study the implications of likely forms of $p$-hacking on the distribution of $p$-values to understand the power of tests for detecting it. Power depends crucially on the $p$-hacking strategy and the distribution of true effects. Publication bias can enhance the power for testing the joint null of no $p$-hacking and no publication bias."}, "https://arxiv.org/abs/2311.16333": {"title": "From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks", "link": "https://arxiv.org/abs/2311.16333", "description": "arXiv:2311.16333v2 Announce Type: replace \nAbstract: We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density forecasting through a novel neural network architecture with dedicated mean and variance hemispheres. Our architecture features several key ingredients making MLE work in this context. First, the hemispheres share a common core at the entrance of the network which accommodates for various forms of time variation in the error variance. Second, we introduce a volatility emphasis constraint that breaks mean/variance indeterminacy in this class of overparametrized nonlinear models. Third, we conduct a blocked out-of-bag reality check to curb overfitting in both conditional moments. Fourth, the algorithm utilizes standard deep learning software and thus handles large data sets - both computationally and statistically. Ergo, our Hemisphere Neural Network (HNN) provides proactive volatility forecasts based on leading indicators when it can, and reactive volatility based on the magnitude of previous prediction errors when it must. We evaluate point and density forecasts with an extensive out-of-sample experiment and benchmark against a suite of models ranging from classics to more modern machine learning-based offerings. In all cases, HNN fares well by consistently providing accurate mean/variance forecasts for all targets and horizons. Studying the resulting volatility paths reveals its versatility, while probabilistic forecasting evaluation metrics showcase its enviable reliability. Finally, we also demonstrate how this machinery can be merged with other structured deep learning models by revisiting Goulet Coulombe (2022)'s Neural Phillips Curve."}, "https://arxiv.org/abs/2401.04263": {"title": "Two-Step Targeted Minimum-Loss Based Estimation for Non-Negative Two-Part Outcomes", "link": "https://arxiv.org/abs/2401.04263", "description": "arXiv:2401.04263v2 Announce Type: replace \nAbstract: Non-negative two-part outcomes are defined as outcomes with a density function that have a zero point mass but are otherwise positive. Examples, such as healthcare expenditure and hospital length of stay, are common in healthcare utilization research. Despite the practical relevance of non-negative two-part outcomes, very few methods exist to leverage knowledge of their semicontinuity to achieve improved performance in estimating causal effects. In this paper, we develop a nonparametric two-step targeted minimum-loss based estimator (denoted as hTMLE) for non-negative two-part outcomes. We present methods for a general class of interventions referred to as modified treatment policies, which can accommodate continuous, categorical, and binary exposures. The two-step TMLE uses a targeted estimate of the intensity component of the outcome to produce a targeted estimate of the binary component of the outcome that may improve finite sample efficiency. We demonstrate the efficiency gains achieved by the two-step TMLE with simulated examples and then apply it to a cohort of Medicaid beneficiaries to estimate the effect of chronic pain and physical disability on days' supply of opioids."}, "https://arxiv.org/abs/2404.15572": {"title": "Mapping Incidence and Prevalence Peak Data for SIR Forecasting Applications", "link": "https://arxiv.org/abs/2404.15572", "description": "arXiv:2404.15572v1 Announce Type: new \nAbstract: Infectious disease modeling and forecasting have played a key role in helping assess and respond to epidemics and pandemics. Recent work has leveraged data on disease peak infection and peak hospital incidence to fit compartmental models for the purpose of forecasting and describing the dynamics of a disease outbreak. Incorporating these data can greatly stabilize a compartmental model fit on early observations, where slight perturbations in the data may lead to model fits that project wildly unrealistic peak infection. We introduce a new method for incorporating historic data on the value and time of peak incidence of hospitalization into the fit for a Susceptible-Infectious-Recovered (SIR) model by formulating the relationship between an SIR model's starting parameters and peak incidence as a system of two equations that can be solved computationally. This approach is assessed for practicality in terms of accuracy and speed of computation via simulation. To exhibit the modeling potential, we update the Dirichlet-Beta State Space modeling framework to use hospital incidence data, as this framework was previously formulated to incorporate only data on total infections."}, "https://arxiv.org/abs/2404.15586": {"title": "Multiple testing with anytime-valid Monte-Carlo p-values", "link": "https://arxiv.org/abs/2404.15586", "description": "arXiv:2404.15586v1 Announce Type: new \nAbstract: In contemporary problems involving genetic or neuroimaging data, thousands of hypotheses need to be tested. Due to their high power, and finite sample guarantees on type-1 error under weak assumptions, Monte-Carlo permutation tests are often considered as gold standard for these settings. However, the enormous computational effort required for (thousands of) permutation tests is a major burden. Recently, Fischer and Ramdas (2024) constructed a permutation test for a single hypothesis in which the permutations are drawn sequentially one-by-one and the testing process can be stopped at any point without inflating the type I error. They showed that the number of permutations can be substantially reduced (under null and alternative) while the power remains similar. We show how their approach can be modified to make it suitable for a broad class of multiple testing procedures. In particular, we discuss its use with the Benjamini-Hochberg procedure and illustrate the application on a large dataset."}, "https://arxiv.org/abs/2404.15329": {"title": "Greedy Capon Beamformer", "link": "https://arxiv.org/abs/2404.15329", "description": "arXiv:2404.15329v1 Announce Type: cross \nAbstract: We propose greedy Capon beamformer (GBF) for direction finding of narrow-band sources present in the array's viewing field. After defining the grid covering the location search space, the algorithm greedily builds the interference-plus-noise covariance matrix by identifying a high-power source on the grid using Capon's principle of maximizing the signal to interference plus noise ratio (SINR) while enforcing unit gain towards the signal of interest. An estimate of the power of the detected source is derived by exploiting the unit power constraint, which subsequently allows to update the noise covariance matrix by simple rank-1 matrix addition composed of outerproduct of the selected steering matrix with itself scaled by the signal power estimate. Our numerical examples demonstrate effectiveness of the proposed GCB in direction finding where it perform favourably compared to the state-of-the-art algorithms under a broad variety of settings. Furthermore, GCB estimates of direction-of-arrivals (DOAs) are very fast to compute."}, "https://arxiv.org/abs/2404.15440": {"title": "Exploring Convergence in Relation using Association Rules Mining: A Case Study in Collaborative Knowledge Production", "link": "https://arxiv.org/abs/2404.15440", "description": "arXiv:2404.15440v1 Announce Type: cross \nAbstract: This study delves into the pivotal role played by non-experts in knowledge production on open collaboration platforms, with a particular focus on the intricate process of tag development that culminates in the proposal of new glitch classes. Leveraging the power of Association Rule Mining (ARM), this research endeavors to unravel the underlying dynamics of collaboration among citizen scientists. By meticulously quantifying tag associations and scrutinizing their temporal dynamics, the study provides a comprehensive and nuanced understanding of how non-experts collaborate to generate valuable scientific insights. Furthermore, this investigation extends its purview to examine the phenomenon of ideological convergence within online citizen science knowledge production. To accomplish this, a novel measurement algorithm, based on the Mann-Kendall Trend Test, is introduced. This innovative approach sheds illuminating light on the dynamics of collaborative knowledge production, revealing both the vast opportunities and daunting challenges inherent in leveraging non-expert contributions for scientific research endeavors. Notably, the study uncovers a robust pattern of convergence in ideology, employing both the newly proposed convergence testing method and the traditional approach based on the stationarity of time series data. This groundbreaking discovery holds significant implications for understanding the dynamics of online citizen science communities and underscores the crucial role played by non-experts in shaping the scientific landscape of the digital age. Ultimately, this study contributes significantly to our understanding of online citizen science communities, highlighting their potential to harness collective intelligence for tackling complex scientific tasks and enriching our comprehension of collaborative knowledge production processes in the digital age."}, "https://arxiv.org/abs/2404.15484": {"title": "Uncertainty, Imprecise Probabilities and Interval Capacity Measures on a Product Space", "link": "https://arxiv.org/abs/2404.15484", "description": "arXiv:2404.15484v1 Announce Type: cross \nAbstract: In Basili and Pratelli (2024), a novel and coherent concept of interval probability measures has been introduced, providing a method for representing imprecise probabilities and uncertainty. Within the framework of set algebra, we introduced the concepts of weak complementation and interval probability measures associated with a family of random variables, which effectively capture the inherent uncertainty in any event. This paper conducts a comprehensive analysis of these concepts within a specific probability space. Additionally, we elaborate on an updating rule for events, integrating essential concepts of statistical independence, dependence, and stochastic dominance."}, "https://arxiv.org/abs/2404.15495": {"title": "Correlations versus noise in the NFT market", "link": "https://arxiv.org/abs/2404.15495", "description": "arXiv:2404.15495v1 Announce Type: cross \nAbstract: The non-fungible token (NFT) market emerges as a recent trading innovation leveraging blockchain technology, mirroring the dynamics of the cryptocurrency market. To deepen the understanding of the dynamics of this market, in the current study, based on the capitalization changes and transaction volumes across a large number of token collections on the Ethereum platform, the degree of correlation in this market is examined by using the multivariate formalism of detrended correlation coefficient and correlation matrix. It appears that correlation strength is lower here than that observed in previously studied markets. Consequently, the eigenvalue spectra of the correlation matrix more closely follow the Marchenko-Pastur distribution, still, some departures indicating the existence of correlations remain. The comparison of results obtained from the correlation matrix built from the Pearson coefficients and, independently, from the detrended cross-correlation coefficients suggests that the global correlations in the NFT market arise from higher frequency fluctuations. Corresponding minimal spanning trees (MSTs) for capitalization variability exhibit a scale-free character while, for the number of transactions, they are somewhat more decentralized."}, "https://arxiv.org/abs/2404.15649": {"title": "The Impact of Loss Estimation on Gibbs Measures", "link": "https://arxiv.org/abs/2404.15649", "description": "arXiv:2404.15649v1 Announce Type: cross \nAbstract: In recent years, the shortcomings of Bayes posteriors as inferential devices has received increased attention. A popular strategy for fixing them has been to instead target a Gibbs measure based on losses that connect a parameter of interest to observed data. While existing theory for such inference procedures relies on these losses to be analytically available, in many situations these losses must be stochastically estimated using pseudo-observations. The current paper fills this research gap, and derives the first asymptotic theory for Gibbs measures based on estimated losses. Our findings reveal that the number of pseudo-observations required to accurately approximate the exact Gibbs measure depends on the rates at which the bias and variance of the estimated loss converge to zero. These results are particularly consequential for the emerging field of generalised Bayesian inference, for estimated intractable likelihoods, and for biased pseudo-marginal approaches. We apply our results to three Gibbs measures that have been proposed to deal with intractable likelihoods and model misspecification."}, "https://arxiv.org/abs/2404.15654": {"title": "Autoregressive Networks with Dependent Edges", "link": "https://arxiv.org/abs/2404.15654", "description": "arXiv:2404.15654v1 Announce Type: cross \nAbstract: We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters (Chang et al., 2021, 2023). Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set."}, "https://arxiv.org/abs/2404.15764": {"title": "Assessment of the quality of a prediction", "link": "https://arxiv.org/abs/2404.15764", "description": "arXiv:2404.15764v1 Announce Type: cross \nAbstract: Shannon defined the mutual information between two variables. We illustrate why the true mutual information between a variable and the predictions made by a prediction algorithm is not a suitable measure of prediction quality, but the apparent Shannon mutual information (ASI) is; indeed it is the unique prediction quality measure with either of two very different lists of desirable properties, as previously shown by de Finetti and other authors. However, estimating the uncertainty of the ASI is a difficult problem, because of long and non-symmetric heavy tails to the distribution of the individual values of $j(x,y)=\\log\\frac{Q_y(x)}{P(x)}$ We propose a Bayesian modelling method for the distribution of $j(x,y)$, from the posterior distribution of which the uncertainty in the ASI can be inferred. This method is based on Dirichlet-based mixtures of skew-Student distributions. We illustrate its use on data from a Bayesian model for prediction of the recurrence time of prostate cancer. We believe that this approach is generally appropriate for most problems, where it is infeasible to derive the explicit distribution of the samples of $j(x,y)$, though the precise modelling parameters may need adjustment to suit particular cases."}, "https://arxiv.org/abs/2404.15843": {"title": "Large-sample theory for inferential models: a possibilistic Bernstein--von Mises theorem", "link": "https://arxiv.org/abs/2404.15843", "description": "arXiv:2404.15843v1 Announce Type: cross \nAbstract: The inferential model (IM) framework offers alternatives to the familiar probabilistic (e.g., Bayesian and fiducial) uncertainty quantification in statistical inference. Allowing this uncertainty quantification to be imprecise makes it possible to achieve exact validity and reliability. But is imprecision and exact validity compatible with attainment of the classical notions of statistical efficiency? The present paper offers an affirmative answer to this question via a new possibilistic Bernstein--von Mises theorem that parallels a fundamental result in Bayesian inference. Among other things, our result demonstrates that the IM solution is asymptotically efficient in the sense that its asymptotic credal set is the smallest that contains the Gaussian distribution whose variance agrees with the Cramer--Rao lower bound."}, "https://arxiv.org/abs/2404.15967": {"title": "Interpretable clustering with the Distinguishability criterion", "link": "https://arxiv.org/abs/2404.15967", "description": "arXiv:2404.15967v1 Announce Type: cross \nAbstract: Cluster analysis is a popular unsupervised learning tool used in many disciplines to identify heterogeneous sub-populations within a sample. However, validating cluster analysis results and determining the number of clusters in a data set remains an outstanding problem. In this work, we present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations. Our computational implementation of the Distinguishability criterion corresponds to the Bayes risk of a randomized classifier under the 0-1 loss. We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures, such as hierarchical clustering, k-means, and finite mixture models. We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications."}, "https://arxiv.org/abs/2106.07096": {"title": "Tests for partial correlation between repeatedly observed nonstationary nonlinear timeseries", "link": "https://arxiv.org/abs/2106.07096", "description": "arXiv:2106.07096v2 Announce Type: replace \nAbstract: We describe two families of statistical tests to detect partial correlation in vectorial timeseries. The tests measure whether an observed timeseries Y can be predicted from a second series X, even after accounting for a third series Z which may correlate with X. They do not make any assumptions on the nature of these timeseries, such as stationarity or linearity, but they do require that multiple statistically independent recordings of the 3 series are available. Intuitively, the tests work by asking if the series Y recorded on one experiment can be better predicted from X recorded on the same experiment than on a different experiment, after accounting for the prediction from Z recorded on both experiments."}, "https://arxiv.org/abs/2204.06960": {"title": "The replication of equivalence studies", "link": "https://arxiv.org/abs/2204.06960", "description": "arXiv:2204.06960v3 Announce Type: replace \nAbstract: Replication studies are increasingly conducted to assess the credibility of scientific findings. Most of these replication attempts target studies with a superiority design, but there is a lack of methodology regarding the analysis of replication studies with alternative types of designs, such as equivalence. In order to fill this gap, we propose two approaches, the two-trials rule and the sceptical TOST procedure, adapted from methods used in superiority settings. Both methods have the same overall Type-I error rate, but the sceptical TOST procedure allows replication success even for non-significant original or replication studies. This leads to a larger project power and other differences in relevant operating characteristics. Both methods can be used for sample size calculation of the replication study, based on the results from the original one. The two methods are applied to data from the Reproducibility Project: Cancer Biology."}, "https://arxiv.org/abs/2305.11126": {"title": "More powerful multiple testing under dependence via randomization", "link": "https://arxiv.org/abs/2305.11126", "description": "arXiv:2305.11126v3 Announce Type: replace \nAbstract: We show that two procedures for false discovery rate (FDR) control -- the Benjamini-Yekutieli procedure for dependent p-values, and the e-Benjamini-Hochberg procedure for dependent e-values -- can both be made more powerful by a simple randomization involving one independent uniform random variable. As a corollary, the Hommel test under arbitrary dependence is also improved. Importantly, our randomized improvements are never worse than the originals and are typically strictly more powerful, with marked improvements in simulations. The same technique also improves essentially every other multiple testing procedure based on e-values."}, "https://arxiv.org/abs/2306.01086": {"title": "Multi-Study R-Learner for Estimating Heterogeneous Treatment Effects Across Studies Using Statistical Machine Learning", "link": "https://arxiv.org/abs/2306.01086", "description": "arXiv:2306.01086v3 Announce Type: replace \nAbstract: Estimating heterogeneous treatment effects (HTEs) is crucial for precision medicine. While multiple studies can improve the generalizability of results, leveraging them for estimation is statistically challenging. Existing approaches often assume identical HTEs across studies, but this may be violated due to various sources of between-study heterogeneity, including differences in study design, study populations, and data collection protocols, among others. To this end, we propose a framework for multi-study HTE estimation that accounts for between-study heterogeneity in the nuisance functions and treatment effects. Our approach, the multi-study R-learner, extends the R-learner to obtain principled statistical estimation with machine learning (ML) in the multi-study setting. It involves a data-adaptive objective function that links study-specific treatment effects with nuisance functions through membership probabilities, which enable information to be borrowed across potentially heterogeneous studies. The multi-study R-learner framework can combine data from randomized controlled trials, observational studies, or a combination of both. It's easy to implement and flexible in its ability to incorporate ML for estimating HTEs, nuisance functions, and membership probabilities. In the series estimation framework, we show that the multi-study R-learner is asymptotically normal and more efficient than the R-learner when there is between-study heterogeneity in the propensity score model under homoscedasticity. We illustrate using cancer data that the proposed method performs favorably compared to existing approaches in the presence of between-study heterogeneity."}, "https://arxiv.org/abs/2306.11979": {"title": "Qini Curves for Multi-Armed Treatment Rules", "link": "https://arxiv.org/abs/2306.11979", "description": "arXiv:2306.11979v3 Announce Type: replace \nAbstract: Qini curves have emerged as an attractive and popular approach for evaluating the benefit of data-driven targeting rules for treatment allocation. We propose a generalization of the Qini curve to multiple costly treatment arms, that quantifies the value of optimally selecting among both units and treatment arms at different budget levels. We develop an efficient algorithm for computing these curves and propose bootstrap-based confidence intervals that are exact in large samples for any point on the curve. These confidence intervals can be used to conduct hypothesis tests comparing the value of treatment targeting using an optimal combination of arms with using just a subset of arms, or with a non-targeting assignment rule ignoring covariates, at different budget levels. We demonstrate the statistical performance in a simulation experiment and an application to treatment targeting for election turnout."}, "https://arxiv.org/abs/2308.04420": {"title": "Contour Location for Reliability in Airfoil Simulation Experiments using Deep Gaussian Processes", "link": "https://arxiv.org/abs/2308.04420", "description": "arXiv:2308.04420v2 Announce Type: replace \nAbstract: Bayesian deep Gaussian processes (DGPs) outperform ordinary GPs as surrogate models of complex computer experiments when response surface dynamics are non-stationary, which is especially prevalent in aerospace simulations. Yet DGP surrogates have not been deployed for the canonical downstream task in that setting: reliability analysis through contour location (CL). In that context, we are motivated by a simulation of an RAE-2822 transonic airfoil which demarcates efficient and inefficient flight conditions. Level sets separating passable versus failable operating conditions are best learned through strategic sequential design. There are two limitations to modern CL methodology which hinder DGP integration in this setting. First, derivative-based optimization underlying acquisition functions is thwarted by sampling-based Bayesian (i.e., MCMC) inference, which is essential for DGP posterior integration. Second, canonical acquisition criteria, such as entropy, are famously myopic to the extent that optimization may even be undesirable. Here we tackle both of these limitations at once, proposing a hybrid criterion that explores along the Pareto front of entropy and (predictive) uncertainty, requiring evaluation only at strategically located \"triangulation\" candidates. We showcase DGP CL performance in several synthetic benchmark exercises and on the RAE-2822 airfoil."}, "https://arxiv.org/abs/2308.12181": {"title": "Consistency of common spatial estimators under spatial confounding", "link": "https://arxiv.org/abs/2308.12181", "description": "arXiv:2308.12181v2 Announce Type: replace \nAbstract: This paper addresses the asymptotic performance of popular spatial regression estimators on the task of estimating the linear effect of an exposure on an outcome under \"spatial confounding\" -- the presence of an unmeasured spatially-structured variable influencing both the exposure and the outcome. The existing literature on spatial confounding is informal and inconsistent; this paper is an attempt to bring clarity through rigorous results on the asymptotic bias and consistency of estimators from popular spatial regression models. We consider two data generation processes: one where the confounder is a fixed function of space and one where it is a random function (i.e., a stochastic process on the spatial domain). We first show that the estimators from ordinary least squares (OLS) and restricted spatial regression are asymptotically biased under spatial confounding. We then prove a novel main result on the consistency of the generalized least squares (GLS) estimator using a Gaussian process (GP) covariance matrix in the presence of spatial confounding under in-fill (fixed domain) asymptotics. The result holds under very general conditions -- for any exposure with some non-spatial variation (noise), for any spatially continuous confounder, using any choice of Mat\\'ern or square exponential Gaussian process covariance used to construct the GLS estimator, and without requiring Gaussianity of errors. Finally, we prove that spatial estimators from GLS, GP regression, and spline models that are consistent under confounding by a fixed function will also be consistent under confounding by a random function. We conclude that, contrary to much of the literature on spatial confounding, traditional spatial estimators are capable of estimating linear exposure effects under spatial confounding in the presence of some noise in the exposure. We support our theoretical arguments with simulation studies."}, "https://arxiv.org/abs/2311.16451": {"title": "Variational Inference for the Latent Shrinkage Position Model", "link": "https://arxiv.org/abs/2311.16451", "description": "arXiv:2311.16451v2 Announce Type: replace \nAbstract: The latent position model (LPM) is a popular method used in network data analysis where nodes are assumed to be positioned in a $p$-dimensional latent space. The latent shrinkage position model (LSPM) is an extension of the LPM which automatically determines the number of effective dimensions of the latent space via a Bayesian nonparametric shrinkage prior. However, the LSPM reliance on Markov chain Monte Carlo for inference, while rigorous, is computationally expensive, making it challenging to scale to networks with large numbers of nodes. We introduce a variational inference approach for the LSPM, aiming to reduce computational demands while retaining the model's ability to intrinsically determine the number of effective latent dimensions. The performance of the variational LSPM is illustrated through simulation studies and its application to real-world network data. To promote wider adoption and ease of implementation, we also provide open-source code."}, "https://arxiv.org/abs/2111.09447": {"title": "Unbiased Risk Estimation in the Normal Means Problem via Coupled Bootstrap Techniques", "link": "https://arxiv.org/abs/2111.09447", "description": "arXiv:2111.09447v3 Announce Type: replace-cross \nAbstract: We develop a new approach for estimating the risk of an arbitrary estimator of the mean vector in the classical normal means problem. The key idea is to generate two auxiliary data vectors, by adding carefully constructed normal noise vectors to the original data. We then train the estimator of interest on the first auxiliary vector and test it on the second. In order to stabilize the risk estimate, we average this procedure over multiple draws of the synthetic noise vector. A key aspect of this coupled bootstrap (CB) approach is that it delivers an unbiased estimate of risk under no assumptions on the estimator of the mean vector, albeit for a modified and slightly \"harder\" version of the original problem, where the noise variance is elevated. We prove that, under the assumptions required for the validity of Stein's unbiased risk estimator (SURE), a limiting version of the CB estimator recovers SURE exactly. We then analyze a bias-variance decomposition of the error of the CB estimator, which elucidates the effects of the variance of the auxiliary noise and the number of bootstrap samples on the accuracy of the estimator. Lastly, we demonstrate that the CB estimator performs favorably in various simulated experiments."}, "https://arxiv.org/abs/2207.11332": {"title": "Comparing baseball players across eras via novel Full House Modeling", "link": "https://arxiv.org/abs/2207.11332", "description": "arXiv:2207.11332v2 Announce Type: replace-cross \nAbstract: A new methodological framework suitable for era-adjusting baseball statistics is developed in this article. Within this methodological framework specific models are motivated. We call these models Full House Models. Full House Models work by balancing the achievements of Major League Baseball (MLB) players within a given season and the size of the MLB talent pool from which a player came. We demonstrate the utility of Full House Models in an application of comparing baseball players' performance statistics across eras. Our results reveal a new ranking of baseball's greatest players which include several modern players among the top all-time players. Modern players are elevated by Full House Modeling because they come from a larger talent pool. Sensitivity and multiverse analyses which investigate the how results change with changes to modeling inputs including the estimate of the talent pool are presented."}, "https://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "https://arxiv.org/abs/2310.03722", "description": "arXiv:2310.03722v3 Announce Type: replace-cross \nAbstract: In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$ of a Gaussian distribution with unknown variance $\\sigma^2$. Curiously, he employed both an improper (right Haar) mixture over $\\sigma$ and an improper (flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his construction, which use generalized nonintegrable martingales and an extended Ville's inequality. While this does yield a sequential t-test, it does not yield an \"e-process\" (due to the nonintegrability of his martingale). In this paper, we develop two new e-processes and confidence sequences for the same setting: one is a test martingale in a reduced filtration, while the other is an e-process in the canonical data filtration. These are respectively obtained by swapping Lai's flat mixture for a Gaussian mixture, and swapping the right Haar mixture over $\\sigma$ with the maximum likelihood estimate under the null, as done in universal inference. We also analyze the width of resulting confidence sequences, which have a curious polynomial dependence on the error probability $\\alpha$ that we prove to be not only unavoidable, but (for universal inference) even better than the classical fixed-sample t-test. Numerical experiments are provided along the way to compare and contrast the various approaches, including some recent suboptimal ones."}, "https://arxiv.org/abs/2312.11108": {"title": "Multiple change point detection in functional data with applications to biomechanical fatigue data", "link": "https://arxiv.org/abs/2312.11108", "description": "arXiv:2312.11108v3 Announce Type: replace-cross \nAbstract: Injuries to the lower extremity joints are often debilitating, particularly for professional athletes. Understanding the onset of stressful conditions on these joints is therefore important in order to ensure prevention of injuries as well as individualised training for enhanced athletic performance. We study the biomechanical joint angles from the hip, knee and ankle for runners who are experiencing fatigue. The data is cyclic in nature and densely collected by body worn sensors, which makes it ideal to work with in the functional data analysis (FDA) framework.\n We develop a new method for multiple change point detection for functional data, which improves the state of the art with respect to at least two novel aspects. First, the curves are compared with respect to their maximum absolute deviation, which leads to a better interpretation of local changes in the functional data compared to classical $L^2$-approaches. Secondly, as slight aberrations are to be often expected in a human movement data, our method will not detect arbitrarily small changes but hunts for relevant changes, where maximum absolute deviation between the curves exceeds a specified threshold, say $\\Delta >0$. We recover multiple changes in a long functional time series of biomechanical knee angle data, which are larger than the desired threshold $\\Delta$, allowing us to identify changes purely due to fatigue. In this work, we analyse data from both controlled indoor as well as from an uncontrolled outdoor (marathon) setting."}, "https://arxiv.org/abs/2404.16166": {"title": "Double Robust Variance Estimation", "link": "https://arxiv.org/abs/2404.16166", "description": "arXiv:2404.16166v1 Announce Type: new \nAbstract: Doubly robust estimators have gained popularity in the field of causal inference due to their ability to provide consistent point estimates when either an outcome or exposure model is correctly specified. However, the influence function based variance estimator frequently used with doubly robust estimators is only consistent when both the outcome and exposure models are correctly specified. Here, use of M-estimation and the empirical sandwich variance estimator for doubly robust point and variance estimation is demonstrated. Simulation studies illustrate the properties of the influence function based and empirical sandwich variance estimators. Estimators are applied to data from the Improving Pregnancy Outcomes with Progesterone (IPOP) trial to estimate the effect of maternal anemia on birth weight among women with HIV. In the example, birth weights if all women had anemia were estimated to be lower than birth weights if no women had anemia, though estimates were imprecise. Variance estimates were more stable under varying model specifications for the empirical sandwich variance estimator than the influence function based variance estimator."}, "https://arxiv.org/abs/2404.16209": {"title": "Exploring Spatial Context: A Comprehensive Bibliography of GWR and MGWR", "link": "https://arxiv.org/abs/2404.16209", "description": "arXiv:2404.16209v1 Announce Type: new \nAbstract: Local spatial models such as Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) serve as instrumental tools to capture intrinsic contextual effects through the estimates of the local intercepts and behavioral contextual effects through estimates of the local slope parameters. GWR and MGWR provide simple implementation yet powerful frameworks that could be extended to various disciplines that handle spatial data. This bibliography aims to serve as a comprehensive compilation of peer-reviewed papers that have utilized GWR or MGWR as a primary analytical method to conduct spatial analyses and acts as a useful guide to anyone searching the literature for previous examples of local statistical modeling in a wide variety of application fields."}, "https://arxiv.org/abs/2404.16490": {"title": "On Neighbourhood Cross Validation", "link": "https://arxiv.org/abs/2404.16490", "description": "arXiv:2404.16490v1 Announce Type: new \nAbstract: It is shown how to efficiently and accurately compute and optimize a range of cross validation criteria for a wide range of models estimated by minimizing a quadratically penalized smooth loss. Example models include generalized additive models for location scale and shape and smooth additive quantile regression. Example losses include negative log likelihoods and smooth quantile losses. Example cross validation criteria include leave-out-neighbourhood cross validation for dealing with un-modelled short range autocorrelation as well as the more familiar leave-one-out cross validation. For a $p$ coefficient model of $n$ data, estimable at $O(np^2)$ computational cost, the general $O(n^2p^2)$ cost of ordinary cross validation is reduced to $O(np^2)$, computing the cross validation criterion to $O(p^3n^{-2})$ accuracy. This is achieved by directly approximating the model coefficient estimates under data subset omission, via efficiently computed single step Newton updates of the full data coefficient estimates. Optimization of the resulting cross validation criterion, with respect to multiple smoothing/precision parameters, can be achieved efficiently using quasi-Newton optimization, adapted to deal with the indefiniteness that occurs when the optimal value for a smoothing parameter tends to infinity. The link between cross validation and the jackknife can be exploited to achieve reasonably well calibrated uncertainty quantification for the model coefficients in non standard settings such as leaving-out-neighbourhoods under residual autocorrelation or quantile regression. Several practical examples are provided, focussing particularly on dealing with un-modelled auto-correlation."}, "https://arxiv.org/abs/2404.16610": {"title": "Conformalized Ordinal Classification with Marginal and Conditional Coverage", "link": "https://arxiv.org/abs/2404.16610", "description": "arXiv:2404.16610v1 Announce Type: new \nAbstract: Conformal prediction is a general distribution-free approach for constructing prediction sets combined with any machine learning algorithm that achieve valid marginal or conditional coverage in finite samples. Ordinal classification is common in real applications where the target variable has natural ordering among the class labels. In this paper, we discuss constructing distribution-free prediction sets for such ordinal classification problems by leveraging the ideas of conformal prediction and multiple testing with FWER control. Newer conformal prediction methods are developed for constructing contiguous and non-contiguous prediction sets based on marginal and conditional (class-specific) conformal $p$-values, respectively. Theoretically, we prove that the proposed methods respectively achieve satisfactory levels of marginal and class-specific conditional coverages. Through simulation study and real data analysis, these proposed methods show promising performance compared to the existing conformal method."}, "https://arxiv.org/abs/2404.16709": {"title": "Understanding Reliability from a Regression Perspective", "link": "https://arxiv.org/abs/2404.16709", "description": "arXiv:2404.16709v1 Announce Type: new \nAbstract: Reliability is an important quantification of measurement precision based on a latent variable measurement model. Inspired by McDonald (2011), we present a regression framework of reliability, placing emphasis on whether latent or observed scores serve as the regression outcome. Our theory unifies two extant perspectives of reliability: (a) classical test theory (measurement decomposition), and (b) optimal prediction of latent scores (prediction decomposition). Importantly, reliability should be treated as a property of the observed score under a measurement decomposition, but a property of the latent score under a prediction decomposition. To facilitate the evaluation and interpretation of distinct reliability coefficients for complex measurement models, we introduce a Monte Carlo approach for approximate calculation of reliability. We illustrate the proposed computational procedure with an empirical data analysis, which concerns measuring susceptibility and severity of depressive symptoms using a two-dimensional item response theory model. We conclude with a discussion on computing reliability coefficients and outline future avenues of research."}, "https://arxiv.org/abs/2404.16745": {"title": "Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing Fairness", "link": "https://arxiv.org/abs/2404.16745", "description": "arXiv:2404.16745v1 Announce Type: new \nAbstract: In the era of data explosion, statisticians have been developing interpretable and computationally efficient statistical methods to measure latent factors (e.g., skills, abilities, and personalities) using large-scale assessment data. In addition to understanding the latent information, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race) taking into account their latent abilities. However, the large sample size, substantial covariate dimension, and great test length pose challenges to developing efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete types of responses, nonlinear latent factor models are often assumed, bringing further complexity to the problem. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects under a practical yet challenging asymptotic regime. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an application to an educational assessment dataset obtained from the Programme for International Student Assessment (PISA)."}, "https://arxiv.org/abs/2404.16746": {"title": "Estimating the Number of Components in Finite Mixture Models via Variational Approximation", "link": "https://arxiv.org/abs/2404.16746", "description": "arXiv:2404.16746v1 Announce Type: new \nAbstract: This work introduces a new method for selecting the number of components in finite mixture models (FMMs) using variational Bayes, inspired by the large-sample properties of the Evidence Lower Bound (ELBO) derived from mean-field (MF) variational approximation. Specifically, we establish matching upper and lower bounds for the ELBO without assuming conjugate priors, suggesting the consistency of model selection for FMMs based on maximizing the ELBO. As a by-product of our proof, we demonstrate that the MF approximation inherits the stable behavior (benefited from model singularity) of the posterior distribution, which tends to eliminate the extra components under model misspecification where the number of mixture components is over-specified. This stable behavior also leads to the $n^{-1/2}$ convergence rate for parameter estimation, up to a logarithmic factor, under this model overspecification. Empirical experiments are conducted to validate our theoretical findings and compare with other state-of-the-art methods for selecting the number of components in FMMs."}, "https://arxiv.org/abs/2404.16775": {"title": "Estimating Metocean Environments Associated with Extreme Structural Response", "link": "https://arxiv.org/abs/2404.16775", "description": "arXiv:2404.16775v1 Announce Type: new \nAbstract: Extreme value analysis (EVA) uses data to estimate long-term extreme environmental conditions for variables such as significant wave height and period, for the design of marine structures. Together with models for the short-term evolution of the ocean environment and for wave-structure interaction, EVA provides a basis for full probabilistic design analysis. Environmental contours provide an alternate approach to estimating structural integrity, without requiring structural knowledge. These contour methods also exploit statistical models, including EVA, but avoid the need for structural modelling by making what are believed to be conservative assumptions about the shape of the structural failure boundary in the environment space. These assumptions, however, may not always be appropriate, or may lead to unnecessary wasted resources from over design. We introduce a methodology for full probabilistic analysis to estimate the joint probability density of the environment, conditional on the occurrence of an extreme structural response, for simple structures. We use this conditional density of the environment as a basis to assess the performance of different environmental contour methods. We demonstrate the difficulty of estimating the contour boundary in the environment space for typical data samples, as well as the dependence of the performance of the environmental contour on the structure being considered."}, "https://arxiv.org/abs/2404.16287": {"title": "Differentially Private Federated Learning: Servers Trustworthiness, Estimation, and Statistical Inference", "link": "https://arxiv.org/abs/2404.16287", "description": "arXiv:2404.16287v1 Announce Type: cross \nAbstract: Differentially private federated learning is crucial for maintaining privacy in distributed environments. This paper investigates the challenges of high-dimensional estimation and inference under the constraints of differential privacy. First, we study scenarios involving an untrusted central server, demonstrating the inherent difficulties of accurate estimation in high-dimensional problems. Our findings indicate that the tight minimax rates depends on the high-dimensionality of the data even with sparsity assumptions. Second, we consider a scenario with a trusted central server and introduce a novel federated estimation algorithm tailored for linear regression models. This algorithm effectively handles the slight variations among models distributed across different machines. We also propose methods for statistical inference, including coordinate-wise confidence intervals for individual parameters and strategies for simultaneous inference. Extensive simulation experiments support our theoretical advances, underscoring the efficacy and reliability of our approaches."}, "https://arxiv.org/abs/2404.16583": {"title": "Fast Machine-Precision Spectral Likelihoods for Stationary Time Series", "link": "https://arxiv.org/abs/2404.16583", "description": "arXiv:2404.16583v1 Announce Type: cross \nAbstract: We provide in this work an algorithm for approximating a very broad class of symmetric Toeplitz matrices to machine precision in $\\mathcal{O}(n \\log n)$ time. In particular, for a Toeplitz matrix $\\mathbf{\\Sigma}$ with values $\\mathbf{\\Sigma}_{j,k} = h_{|j-k|} = \\int_{-1/2}^{1/2} e^{2 \\pi i |j-k| \\omega} S(\\omega) \\mathrm{d} \\omega$ where $S(\\omega)$ is piecewise smooth, we give an approximation $\\mathbf{\\mathcal{F}} \\mathbf{\\Sigma} \\mathbf{\\mathcal{F}}^H \\approx \\mathbf{D} + \\mathbf{U} \\mathbf{V}^H$, where $\\mathbf{\\mathcal{F}}$ is the DFT matrix, $\\mathbf{D}$ is diagonal, and the matrices $\\mathbf{U}$ and $\\mathbf{V}$ are in $\\mathbb{C}^{n \\times r}$ with $r \\ll n$. Studying these matrices in the context of time series, we offer a theoretical explanation of this structure and connect it to existing spectral-domain approximation frameworks. We then give a complete discussion of the numerical method for assembling the approximation and demonstrate its efficiency for improving Whittle-type likelihood approximations, including dramatic examples where a correction of rank $r = 2$ to the standard Whittle approximation increases the accuracy from $3$ to $14$ digits for a matrix $\\mathbf{\\Sigma} \\in \\mathbb{R}^{10^5 \\times 10^5}$. The method and analysis of this work applies well beyond time series analysis, providing an algorithm for extremely accurate direct solves with a wide variety of symmetric Toeplitz matrices. The analysis employed here largely depends on asymptotic expansions of oscillatory integrals, and also provides a new perspective on when existing spectral-domain approximation methods for Gaussian log-likelihoods can be particularly problematic."}, "https://arxiv.org/abs/2212.09833": {"title": "Direct covariance matrix estimation with compositional data", "link": "https://arxiv.org/abs/2212.09833", "description": "arXiv:2212.09833v2 Announce Type: replace \nAbstract: Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject's gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. Through simulation studies, we demonstrate that our estimator can outperform existing estimators. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with chronic fatigue syndrome."}, "https://arxiv.org/abs/2303.03521": {"title": "Bayesian Variable Selection for Function-on-Scalar Regression Models: a comparative analysis", "link": "https://arxiv.org/abs/2303.03521", "description": "arXiv:2303.03521v4 Announce Type: replace \nAbstract: In this work, we developed a new Bayesian method for variable selection in function-on-scalar regression (FOSR). Our method uses a hierarchical Bayesian structure and latent variables to enable an adaptive covariate selection process for FOSR. Extensive simulation studies show the proposed method's main properties, such as its accuracy in estimating the coefficients and high capacity to select variables correctly. Furthermore, we conducted a substantial comparative analysis with the main competing methods, the BGLSS (Bayesian Group Lasso with Spike and Slab prior) method, the group LASSO (Least Absolute Shrinkage and Selection Operator), the group MCP (Minimax Concave Penalty), and the group SCAD (Smoothly Clipped Absolute Deviation). Our results demonstrate that the proposed methodology is superior in correctly selecting covariates compared with the existing competing methods while maintaining a satisfactory level of goodness of fit. In contrast, the competing methods could not balance selection accuracy with goodness of fit. We also considered a COVID-19 dataset and some socioeconomic data from Brazil as an application and obtained satisfactory results. In short, the proposed Bayesian variable selection model is highly competitive, showing significant predictive and selective quality."}, "https://arxiv.org/abs/2304.02339": {"title": "Combining experimental and observational data through a power likelihood", "link": "https://arxiv.org/abs/2304.02339", "description": "arXiv:2304.02339v2 Announce Type: replace \nAbstract: Randomized controlled trials are the gold standard for causal inference and play a pivotal role in modern evidence-based medicine. However, the sample sizes they use are often too limited to draw significant causal conclusions for subgroups that are less prevalent in the population. In contrast, observational data are becoming increasingly accessible in large volumes but can be subject to bias as a result of hidden confounding. Given these complementary features, we propose a power likelihood approach to augmenting RCTs with observational data to improve the efficiency of treatment effect estimation. We provide a data-adaptive procedure for maximizing the expected log predictive density (ELPD) to select the learning rate that best regulates the information from the observational data. We validate our method through a simulation study that shows increased power while maintaining an approximate nominal coverage rate. Finally, we apply our method in a real-world data fusion study augmenting the PIONEER 6 clinical trial with a US health claims dataset, demonstrating the effectiveness of our method and providing detailed guidance on how to address practical considerations in its application."}, "https://arxiv.org/abs/2307.04225": {"title": "Copula-like inference for discrete bivariate distributions with rectangular supports", "link": "https://arxiv.org/abs/2307.04225", "description": "arXiv:2307.04225v3 Announce Type: replace \nAbstract: After reviewing a large body of literature on the modeling of bivariate discrete distributions with finite support, \\cite{Gee20} made a compelling case for the use of $I$-projections in the sense of \\cite{Csi75} as a sound way to attempt to decompose a bivariate probability mass function (p.m.f.) into its two univariate margins and a bivariate p.m.f.\\ with uniform margins playing the role of a discrete copula. From a practical perspective, the necessary $I$-projections on Fr\\'echet classes can be carried out using the iterative proportional fitting procedure (IPFP), also known as Sinkhorn's algorithm or matrix scaling in the literature. After providing conditions under which a bivariate p.m.f.\\ can be decomposed in the aforementioned sense, we investigate, for starting bivariate p.m.f.s with rectangular supports, nonparametric and parametric estimation procedures as well as goodness-of-fit tests for the underlying discrete copula. Related asymptotic results are provided and build upon a differentiability result for $I$-projections on Fr\\'echet classes which can be of independent interest. Theoretical results are complemented by finite-sample experiments and a data example."}, "https://arxiv.org/abs/2311.03769": {"title": "Nonparametric Screening for Additive Quantile Regression in Ultra-high Dimension", "link": "https://arxiv.org/abs/2311.03769", "description": "arXiv:2311.03769v2 Announce Type: replace \nAbstract: In practical applications, one often does not know the \"true\" structure of the underlying conditional quantile function, especially in the ultra-high dimensional setting. To deal with ultra-high dimensionality, quantile-adaptive marginal nonparametric screening methods have been recently developed. However, these approaches may miss important covariates that are marginally independent of the response, or may select unimportant covariates due to their high correlations with important covariates. To mitigate such shortcomings, we develop a conditional nonparametric quantile screening procedure (complemented by subsequent selection) for nonparametric additive quantile regression models. Under some mild conditions, we show that the proposed screening method can identify all relevant covariates in a small number of steps with probability approaching one. The subsequent narrowed best subset (via a modified Bayesian information criterion) also contains all the relevant covariates with overwhelming probability. The advantages of our proposed procedure are demonstrated through simulation studies and a real data example."}, "https://arxiv.org/abs/2312.05985": {"title": "Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions", "link": "https://arxiv.org/abs/2312.05985", "description": "arXiv:2312.05985v2 Announce Type: replace \nAbstract: To address the bias of the canonical two-way fixed effects estimator for difference-in-differences under staggered adoptions, Wooldridge (2021) proposed the extended two-way fixed effects estimator, which adds many parameters. However, this reduces efficiency. Restricting some of these parameters to be equal (for example, subsequent treatment effects within a cohort) helps, but ad hoc restrictions may reintroduce bias. We propose a machine learning estimator with a single tuning parameter, fused extended two-way fixed effects (FETWFE), that enables automatic data-driven selection of these restrictions. We prove that under an appropriate sparsity assumption FETWFE identifies the correct restrictions with probability tending to one, which improves efficiency. We also prove the consistency, oracle property, and asymptotic normality of FETWFE for several classes of heterogeneous marginal treatment effect estimators under either conditional or marginal parallel trends, and we prove the same results for conditional average treatment effects under conditional parallel trends. We demonstrate FETWFE in simulation studies and an empirical application."}, "https://arxiv.org/abs/2303.07854": {"title": "Empirical Bayes inference in sparse high-dimensional generalized linear models", "link": "https://arxiv.org/abs/2303.07854", "description": "arXiv:2303.07854v2 Announce Type: replace-cross \nAbstract: High-dimensional linear models have been widely studied, but the developments in high-dimensional generalized linear models, or GLMs, have been slower. In this paper, we propose an empirical or data-driven prior leading to an empirical Bayes posterior distribution which can be used for estimation of and inference on the coefficient vector in a high-dimensional GLM, as well as for variable selection. We prove that our proposed posterior concentrates around the true/sparse coefficient vector at the optimal rate, provide conditions under which the posterior can achieve variable selection consistency, and prove a Bernstein--von Mises theorem that implies asymptotically valid uncertainty quantification. Computation of the proposed empirical Bayes posterior is simple and efficient, and is shown to perform well in simulations compared to existing Bayesian and non-Bayesian methods in terms of estimation and variable selection."}, "https://arxiv.org/abs/2306.05571": {"title": "Heterogeneity-aware integrative regression for ancestry-specific association studies", "link": "https://arxiv.org/abs/2306.05571", "description": "arXiv:2306.05571v2 Announce Type: replace-cross \nAbstract: Ancestry-specific proteome-wide association studies (PWAS) based on genetically predicted protein expression can reveal complex disease etiology specific to certain ancestral groups. These studies require ancestry-specific models for protein expression as a function of SNP genotypes. In order to improve protein expression prediction in ancestral populations historically underrepresented in genomic studies, we propose a new penalized maximum likelihood estimator for fitting ancestry-specific joint protein quantitative trait loci models. Our estimator borrows information across ancestral groups, while simultaneously allowing for heterogeneous error variances and regression coefficients. We propose an alternative parameterization of our model which makes the objective function convex and the penalty scale invariant. To improve computational efficiency, we propose an approximate version of our method and study its theoretical properties. Our method provides a substantial improvement in protein expression prediction accuracy in individuals of African ancestry, and in a downstream PWAS analysis, leads to the discovery of multiple associations between protein expression and blood lipid traits in the African ancestry population."}, "https://arxiv.org/abs/2311.02610": {"title": "An adaptive standardisation methodology for Day-Ahead electricity price forecasting", "link": "https://arxiv.org/abs/2311.02610", "description": "arXiv:2311.02610v2 Announce Type: replace-cross \nAbstract: The study of Day-Ahead prices in the electricity market is one of the most popular problems in time series forecasting. Previous research has focused on employing increasingly complex learning algorithms to capture the sophisticated dynamics of the market. However, there is a threshold where increased complexity fails to yield substantial improvements. In this work, we propose an alternative approach by introducing an adaptive standardisation to mitigate the effects of dataset shifts that commonly occur in the market. By doing so, learning algorithms can prioritize uncovering the true relationship between the target variable and the explanatory variables. We investigate five distinct markets, including two novel datasets, previously unexplored in the literature. These datasets provide a more realistic representation of the current market context, that conventional datasets do not show. The results demonstrate a significant improvement across all five markets using the widely accepted learning algorithms in the literature (LEAR and DNN). In particular, the combination of the proposed methodology with the methodology previously presented in the literature obtains the best results. This significant advancement unveils new lines of research in this field, highlighting the potential of adaptive transformations in enhancing the performance of forecasting models."}, "https://arxiv.org/abs/2403.14713": {"title": "Auditing Fairness under Unobserved Confounding", "link": "https://arxiv.org/abs/2403.14713", "description": "arXiv:2403.14713v2 Announce Type: replace-cross \nAbstract: The presence of inequity is a fundamental problem in the outcomes of decision-making systems, especially when human lives are at stake. Yet, estimating notions of unfairness or inequity is difficult, particularly if they rely on hard-to-measure concepts such as risk. Such measurements of risk can be accurately obtained when no unobserved confounders have jointly influenced past decisions and outcomes. However, in the real world, this assumption rarely holds. In this paper, we show a surprising result that one can still give meaningful bounds on treatment rates to high-risk individuals, even when entirely eliminating or relaxing the assumption that all relevant risk factors are observed. We use the fact that in many real-world settings (e.g., the release of a new treatment) we have data from prior to any allocation to derive unbiased estimates of risk. This result is of immediate practical interest: we can audit unfair outcomes of existing decision-making systems in a principled manner. For instance, in a real-world study of Paxlovid allocation, our framework provably identifies that observed racial inequity cannot be explained by unobserved confounders of the same strength as important observed covariates."}, "https://arxiv.org/abs/2404.16961": {"title": "On the testability of common trends in panel data without placebo periods", "link": "https://arxiv.org/abs/2404.16961", "description": "arXiv:2404.16961v1 Announce Type: new \nAbstract: We demonstrate and discuss the testability of the common trend assumption imposed in Difference-in-Differences (DiD) estimation in panel data when not relying on multiple pre-treatment periods for running placebo tests. Our testing approach involves two steps: (i) constructing a control group of non-treated units whose pre-treatment outcome distribution matches that of treated units, and (ii) verifying if this control group and the original non-treated group share the same time trend in average outcomes. Testing is motivated by the fact that in several (but not all) panel data models, a common trend violation across treatment groups implies and is implied by a common trend violation across pre-treatment outcomes. For this reason, the test verifies a sufficient, but (depending on the model) not necessary condition for DiD-based identification. We investigate the finite sample performance of a testing procedure that is based on double machine learning, which permits controlling for covariates in a data-driven manner, in a simulation study and also apply it to labor market data from the National Supported Work Demonstration."}, "https://arxiv.org/abs/2404.17019": {"title": "Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules", "link": "https://arxiv.org/abs/2404.17019", "description": "arXiv:2404.17019v1 Announce Type: new \nAbstract: A century ago, Neyman showed how to evaluate the efficacy of treatment using a randomized experiment under a minimal set of assumptions. This classical repeated sampling framework serves as a basis of routine experimental analyses conducted by today's scientists across disciplines. In this paper, we demonstrate that Neyman's methodology can also be used to experimentally evaluate the efficacy of individualized treatment rules (ITRs), which are derived by modern causal machine learning algorithms. In particular, we show how to account for additional uncertainty resulting from a training process based on cross-fitting. The primary advantage of Neyman's approach is that it can be applied to any ITR regardless of the properties of machine learning algorithms that are used to derive the ITR. We also show, somewhat surprisingly, that for certain metrics, it is more efficient to conduct this ex-post experimental evaluation of an ITR than to conduct an ex-ante experimental evaluation that randomly assigns some units to the ITR. Our analysis demonstrates that Neyman's repeated sampling framework is as relevant for causal inference today as it has been since its inception."}, "https://arxiv.org/abs/2404.17049": {"title": "Overidentification in Shift-Share Designs", "link": "https://arxiv.org/abs/2404.17049", "description": "arXiv:2404.17049v1 Announce Type: new \nAbstract: This paper studies the testability of identifying restrictions commonly employed to assign a causal interpretation to two stage least squares (TSLS) estimators based on Bartik instruments. For homogeneous effects models applied to short panels, our analysis yields testable implications previously noted in the literature for the two major available identification strategies. We propose overidentification tests for these restrictions that remain valid in high dimensional regimes and are robust to heteroskedasticity and clustering. We further show that homogeneous effect models in short panels, and their corresponding overidentification tests, are of central importance by establishing that: (i) In heterogenous effects models, interpreting TSLS as a positively weighted average of treatment effects can impose implausible assumptions on the distribution of the data; and (ii) Alternative identifying strategies relying on long panels can prove uninformative in short panel applications. We highlight the empirical relevance of our results by examining the viability of Bartik instruments for identifying the effect of rising Chinese import competition on US local labor markets."}, "https://arxiv.org/abs/2404.17181": {"title": "Consistent information criteria for regularized regression and loss-based learning problems", "link": "https://arxiv.org/abs/2404.17181", "description": "arXiv:2404.17181v1 Announce Type: new \nAbstract: Many problems in statistics and machine learning can be formulated as model selection problems, where the goal is to choose an optimal parsimonious model among a set of candidate models. It is typical to conduct model selection by penalizing the objective function via information criteria (IC), as with the pioneering work by Akaike and Schwarz. Via recent work, we propose a generalized IC framework to consistently estimate general loss-based learning problems. In this work, we propose a consistent estimation method for Generalized Linear Model (GLM) regressions by utilizing the recent IC developments. We advance the generalized IC framework by proposing model selection problems, where the model set consists of a potentially uncountable set of models. In addition to theoretical expositions, our proposal introduces a computational procedure for the implementation of our methods in the finite sample setting, which we demonstrate via an extensive simulation study."}, "https://arxiv.org/abs/2404.17380": {"title": "Correspondence analysis: handling cell-wise outliers via the reconstitution algorithm", "link": "https://arxiv.org/abs/2404.17380", "description": "arXiv:2404.17380v1 Announce Type: new \nAbstract: Correspondence analysis (CA) is a popular technique to visualize the relationship between two categorical variables. CA uses the data from a two-way contingency table and is affected by the presence of outliers. The supplementary points method is a popular method to handle outliers. Its disadvantage is that the information from entire rows or columns is removed. However, outliers can be caused by cells only. In this paper, a reconstitution algorithm is introduced to cope with such cells. This algorithm can reduce the contribution of cells in CA instead of deleting entire rows or columns. Thus the remaining information in the row and column involved can be used in the analysis. The reconstitution algorithm is compared with two alternative methods for handling outliers, the supplementary points method and MacroPCA. It is shown that the proposed strategy works well."}, "https://arxiv.org/abs/2404.17464": {"title": "Bayesian Federated Inference for Survival Models", "link": "https://arxiv.org/abs/2404.17464", "description": "arXiv:2404.17464v1 Announce Type: new \nAbstract: In cancer research, overall survival and progression free survival are often analyzed with the Cox model. To estimate accurately the parameters in the model, sufficient data and, more importantly, sufficient events need to be observed. In practice, this is often a problem. Merging data sets from different medical centers may help, but this is not always possible due to strict privacy legislation and logistic difficulties. Recently, the Bayesian Federated Inference (BFI) strategy for generalized linear models was proposed. With this strategy the statistical analyses are performed in the local centers where the data were collected (or stored) and only the inference results are combined to a single estimated model; merging data is not necessary. The BFI methodology aims to compute from the separate inference results in the local centers what would have been obtained if the analysis had been based on the merged data sets. In this paper we generalize the BFI methodology as initially developed for generalized linear models to survival models. Simulation studies and real data analyses show excellent performance; i.e., the results obtained with the BFI methodology are very similar to the results obtained by analyzing the merged data. An R package for doing the analyses is available."}, "https://arxiv.org/abs/2404.17482": {"title": "A comparison of the discrimination performance of lasso and maximum likelihood estimation in logistic regression model", "link": "https://arxiv.org/abs/2404.17482", "description": "arXiv:2404.17482v1 Announce Type: new \nAbstract: Logistic regression is widely used in many areas of knowledge. Several works compare the performance of lasso and maximum likelihood estimation in logistic regression. However, part of these works do not perform simulation studies and the remaining ones do not consider scenarios in which the ratio of the number of covariates to sample size is high. In this work, we compare the discrimination performance of lasso and maximum likelihood estimation in logistic regression using simulation studies and applications. Variable selection is done both by lasso and by stepwise when maximum likelihood estimation is used. We consider a wide range of values for the ratio of the number of covariates to sample size. The main conclusion of the work is that lasso has a better discrimination performance than maximum likelihood estimation when the ratio of the number of covariates to sample size is high."}, "https://arxiv.org/abs/2404.17561": {"title": "Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems", "link": "https://arxiv.org/abs/2404.17561", "description": "arXiv:2404.17561v1 Announce Type: new \nAbstract: We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set."}, "https://arxiv.org/abs/2404.17562": {"title": "Boosting e-BH via conditional calibration", "link": "https://arxiv.org/abs/2404.17562", "description": "arXiv:2404.17562v1 Announce Type: new \nAbstract: The e-BH procedure is an e-value-based multiple testing procedure that provably controls the false discovery rate (FDR) under any dependence structure between the e-values. Despite this appealing theoretical FDR control guarantee, the e-BH procedure often suffers from low power in practice. In this paper, we propose a general framework that boosts the power of e-BH without sacrificing its FDR control under arbitrary dependence. This is achieved by the technique of conditional calibration, where we take as input the e-values and calibrate them to be a set of \"boosted e-values\" that are guaranteed to be no less -- and are often more -- powerful than the original ones. Our general framework is explicitly instantiated in three classes of multiple testing problems: (1) testing under parametric models, (2) conditional independence testing under the model-X setting, and (3) model-free conformalized selection. Extensive numerical experiments show that our proposed method significantly improves the power of e-BH while continuing to control the FDR. We also demonstrate the effectiveness of our method through an application to an observational study dataset for identifying individuals whose counterfactuals satisfy certain properties."}, "https://arxiv.org/abs/2404.17211": {"title": "Pseudo-Observations and Super Learner for the Estimation of the Restricted Mean Survival Time", "link": "https://arxiv.org/abs/2404.17211", "description": "arXiv:2404.17211v1 Announce Type: cross \nAbstract: In the context of right-censored data, we study the problem of predicting the restricted time to event based on a set of covariates. Under a quadratic loss, this problem is equivalent to estimating the conditional Restricted Mean Survival Time (RMST). To that aim, we propose a flexible and easy-to-use ensemble algorithm that combines pseudo-observations and super learner. The classical theoretical results of the super learner are extended to right-censored data, using a new definition of pseudo-observations, the so-called split pseudo-observations. Simulation studies indicate that the split pseudo-observations and the standard pseudo-observations are similar even for small sample sizes. The method is applied to maintenance and colon cancer datasets, showing the interest of the method in practice, as compared to other prediction methods. We complement the predictions obtained from our method with our RMST-adapted risk measure, prediction intervals and variable importance measures developed in a previous work."}, "https://arxiv.org/abs/2404.17468": {"title": "On Elliptical and Inverse Elliptical Wishart distributions: Review, new results, and applications", "link": "https://arxiv.org/abs/2404.17468", "description": "arXiv:2404.17468v1 Announce Type: cross \nAbstract: This paper deals with matrix-variate distributions, from Wishart to Inverse Elliptical Wishart distributions over the set of symmetric definite positive matrices. Similar to the multivariate scenario, (Inverse) Elliptical Wishart distributions form a vast and general family of distributions, encompassing, for instance, Wishart or $t$-Wishart ones. The first objective of this study is to present a unified overview of Wishart, Inverse Wishart, Elliptical Wishart, and Inverse Elliptical Wishart distributions through their fundamental properties. This involves leveraging the stochastic representation of these distributions to establish key statistical properties of the Normalized Wishart distribution. Subsequently, this enables the computation of expectations, variances, and Kronecker moments for Elliptical Wishart and Inverse Elliptical Wishart distributions. As an illustrative application, the practical utility of these generalized Elliptical Wishart distributions is demonstrated using a real electroencephalographic dataset. This showcases their effectiveness in accurately modeling heterogeneous data."}, "https://arxiv.org/abs/2404.17483": {"title": "Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect Estimation", "link": "https://arxiv.org/abs/2404.17483", "description": "arXiv:2404.17483v1 Announce Type: cross \nAbstract: There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using the inverse of probability weighting (IPW). However, due to the numerically unstable IPW weights, they suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator via weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Experimental results show that by effectively correcting the weight values, our method outperforms the existing ones, including traditional weighting schemes."}, "https://arxiv.org/abs/2101.00009": {"title": "Adversarial Estimation of Riesz Representers", "link": "https://arxiv.org/abs/2101.00009", "description": "arXiv:2101.00009v3 Announce Type: replace \nAbstract: Many causal parameters are linear functionals of an underlying regression. The Riesz representer is a key component in the asymptotic variance of a semiparametrically estimated linear functional. We propose an adversarial framework to estimate the Riesz representer using general function spaces. We prove a nonasymptotic mean square rate in terms of an abstract quantity called the critical radius, then specialize it for neural networks, random forests, and reproducing kernel Hilbert spaces as leading cases. Our estimators are highly compatible with targeted and debiased machine learning with sample splitting; our guarantees directly verify general conditions for inference that allow mis-specification. We also use our guarantees to prove inference without sample splitting, based on stability or complexity. Our estimators achieve nominal coverage in highly nonlinear simulations where some previous methods break down. They shed new light on the heterogeneous effects of matching grants."}, "https://arxiv.org/abs/2107.05936": {"title": "Testability of Reverse Causality Without Exogenous Variation", "link": "https://arxiv.org/abs/2107.05936", "description": "arXiv:2107.05936v2 Announce Type: replace \nAbstract: This paper shows that testability of reverse causality is possible even in the absence of exogenous variation, such as in the form of instrumental variables. Instead of relying on exogenous variation, we achieve testability by imposing relatively weak model restrictions and exploiting that a dependence of residual and purported cause is informative about the causal direction. Our main assumption is that the true functional relationship is nonlinear and that error terms are additively separable. We extend previous results by incorporating control variables and allowing heteroskedastic errors. We build on reproducing kernel Hilbert space (RKHS) embeddings of probability distributions to test conditional independence and demonstrate the efficacy in detecting the causal direction in both Monte Carlo simulations and an application to German survey data."}, "https://arxiv.org/abs/2304.09988": {"title": "The effect of estimating prevalences on the population-wise error rate", "link": "https://arxiv.org/abs/2304.09988", "description": "arXiv:2304.09988v2 Announce Type: replace \nAbstract: The population-wise error rate (PWER) is a type I error rate for clinical trials with multiple target populations. In such trials, one treatment is tested for its efficacy in each population. The PWER is defined as the probability that a randomly selected, future patient will be exposed to an inefficient treatment based on the study results. The PWER can be understood and computed as an average of strata-specific family-wise error rates and involves the prevalences of these strata. A major issue of this concept is that the prevalences are usually not known in practice, so that the PWER cannot be directly controlled. Instead, one could use an estimator based on the given sample, like their maximum-likelihood estimator under a multinomial distribution. In this paper we show in simulations that this does not substantially inflate the true PWER. We differentiate between the expected PWER, which is almost perfectly controlled, and study-specific values of the PWER which are conditioned on all subgroup sample sizes and vary within a narrow range. Thereby, we consider up to eight different overlapping patient populations and moderate to large sample sizes. In these settings, we also consider the maximum strata-wise family-wise error rate, which is found to be, on average, at least bounded by twice the significance level used for PWER control. Finally, we introduce an adjustment of the PWER that could be made when, by chance, no patients are recruited from a stratum, so that this stratum is not counted in PWER control. We would then reduce the PWER in order to control for multiplicity in this stratum as well."}, "https://arxiv.org/abs/2305.02434": {"title": "Uncertainty Quantification and Confidence Intervals for Naive Rare-Event Estimators", "link": "https://arxiv.org/abs/2305.02434", "description": "arXiv:2305.02434v2 Announce Type: replace \nAbstract: We consider the estimation of rare-event probabilities using sample proportions output by naive Monte Carlo or collected data. Unlike using variance reduction techniques, this naive estimator does not have a priori relative efficiency guarantee. On the other hand, due to the recent surge of sophisticated rare-event problems arising in safety evaluations of intelligent systems, efficiency-guaranteed variance reduction may face implementation challenges which, coupled with the availability of computation or data collection power, motivate the use of such a naive estimator. In this paper we study the uncertainty quantification, namely the construction, coverage validity and tightness of confidence intervals, for rare-event probabilities using only sample proportions. In addition to the known normality, Wilson's and exact intervals, we investigate and compare them with two new intervals derived from Chernoff's inequality and the Berry-Esseen theorem. Moreover, we generalize our results to the natural situation where sampling stops by reaching a target number of rare-event hits. Our findings show that the normality and Wilson's intervals are not always valid, but they are close to the newly developed valid intervals in terms of half-width. In contrast, the exact interval is conservative, but safely guarantees the attainment of the nominal confidence level. Our new intervals, while being more conservative than the exact interval, provide useful insights in understanding the tightness of the considered intervals."}, "https://arxiv.org/abs/2309.15983": {"title": "What To Do (and Not to Do) with Causal Panel Analysis under Parallel Trends: Lessons from A Large Reanalysis Study", "link": "https://arxiv.org/abs/2309.15983", "description": "arXiv:2309.15983v2 Announce Type: replace \nAbstract: Two-way fixed effects (TWFE) models are ubiquitous in causal panel analysis in political science. However, recent methodological discussions challenge their validity in the presence of heterogeneous treatment effects (HTE) and violations of the parallel trends assumption (PTA). This burgeoning literature has introduced multiple estimators and diagnostics, leading to confusion among empirical researchers on two fronts: the reliability of existing results based on TWFE models and the current best practices. To address these concerns, we examined, replicated, and reanalyzed 37 articles from three leading political science journals that employed observational panel data with binary treatments. Using six newly introduced HTE-robust estimators, we find that although precision may be affected, the core conclusions derived from TWFE estimates largely remain unchanged. PTA violations and insufficient statistical power, however, continue to be significant obstacles to credible inferences. Based on these findings, we offer recommendations for improving practice in empirical research."}, "https://arxiv.org/abs/2312.09633": {"title": "Natural Gradient Variational Bayes without Fisher Matrix Analytic Calculation and Its Inversion", "link": "https://arxiv.org/abs/2312.09633", "description": "arXiv:2312.09633v2 Announce Type: replace \nAbstract: This paper introduces a method for efficiently approximating the inverse of the Fisher information matrix, a crucial step in achieving effective variational Bayes inference. A notable aspect of our approach is the avoidance of analytically computing the Fisher information matrix and its explicit inversion. Instead, we introduce an iterative procedure for generating a sequence of matrices that converge to the inverse of Fisher information. The natural gradient variational Bayes algorithm without analytic expression of the Fisher matrix and its inversion is provably convergent and achieves a convergence rate of order O(log s/s), with s the number of iterations. We also obtain a central limit theorem for the iterates. Implementation of our method does not require storage of large matrices, and achieves a linear complexity in the number of variational parameters. Our algorithm exhibits versatility, making it applicable across a diverse array of variational Bayes domains, including Gaussian approximation and normalizing flow Variational Bayes. We offer a range of numerical examples to demonstrate the efficiency and reliability of the proposed variational Bayes method."}, "https://arxiv.org/abs/2401.06575": {"title": "A Weibull Mixture Cure Frailty Model for High-dimensional Covariates", "link": "https://arxiv.org/abs/2401.06575", "description": "arXiv:2401.06575v2 Announce Type: replace \nAbstract: A novel mixture cure frailty model is introduced for handling censored survival data. Mixture cure models are preferable when the existence of a cured fraction among patients can be assumed. However, such models are heavily underexplored: frailty structures within cure models remain largely undeveloped, and furthermore, most existing methods do not work for high-dimensional datasets, when the number of predictors is significantly larger than the number of observations. In this study, we introduce a novel extension of the Weibull mixture cure model that incorporates a frailty component, employed to model an underlying latent population heterogeneity with respect to the outcome risk. Additionally, high-dimensional covariates are integrated into both the cure rate and survival part of the model, providing a comprehensive approach to employ the model in the context of high-dimensional omics data. We also perform variable selection via an adaptive elastic-net penalization, and propose a novel approach to inference using the expectation-maximization (EM) algorithm. Extensive simulation studies are conducted across various scenarios to demonstrate the performance of the model, and results indicate that our proposed method outperforms competitor models. We apply the novel approach to analyze RNAseq gene expression data from bulk breast cancer patients included in The Cancer Genome Atlas (TCGA) database. A set of prognostic biomarkers is then derived from selected genes, and subsequently validated via both functional enrichment analysis and comparison to the existing biological literature. Finally, a prognostic risk score index based on the identified biomarkers is proposed and validated by exploring the patients' survival."}, "https://arxiv.org/abs/2310.19091": {"title": "Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector", "link": "https://arxiv.org/abs/2310.19091", "description": "arXiv:2310.19091v2 Announce Type: replace-cross \nAbstract: Machine Learning (ML) systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, they still face the challenge of aligning nuanced policy objectives with the precise formalization requirements necessitated by ML models. In this paper, we aim to bridge the gap between ML model requirements and public sector decision-making by presenting a comprehensive overview of key technical challenges where disjunctions between policy goals and ML models commonly arise. We concentrate on pivotal points of the ML pipeline that connect the model to its operational environment, discussing the significance of representative training data and highlighting the importance of a model setup that facilitates effective decision-making. Additionally, we link these challenges with emerging methodological advancements, encompassing causal ML, domain adaptation, uncertainty quantification, and multi-objective optimization, illustrating the path forward for harmonizing ML and public sector objectives."}, "https://arxiv.org/abs/2404.17615": {"title": "DeepVARMA: A Hybrid Deep Learning and VARMA Model for Chemical Industry Index Forecasting", "link": "https://arxiv.org/abs/2404.17615", "description": "arXiv:2404.17615v1 Announce Type: new \nAbstract: Since the chemical industry index is one of the important indicators to measure the development of the chemical industry, forecasting it is critical for understanding the economic situation and trends of the industry. Taking the multivariable nonstationary series-synthetic material index as the main research object, this paper proposes a new prediction model: DeepVARMA, and its variants Deep-VARMA-re and DeepVARMA-en, which combine LSTM and VARMAX models. The new model firstly uses the deep learning model such as the LSTM remove the trends of the target time series and also learn the representation of endogenous variables, and then uses the VARMAX model to predict the detrended target time series with the embeddings of endogenous variables, and finally combines the trend learned by the LSTM and dependency learned by the VARMAX model to obtain the final predictive values. The experimental results show that (1) the new model achieves the best prediction accuracy by combining the LSTM encoding of the exogenous variables and the VARMAX model. (2) In multivariate non-stationary series prediction, DeepVARMA uses a phased processing strategy to show higher adaptability and accuracy compared to the traditional VARMA model as well as the machine learning models LSTM, RF and XGBoost. (3) Compared with smooth sequence prediction, the traditional VARMA and VARMAX models fluctuate more in predicting non-smooth sequences, while DeepVARMA shows more flexibility and robustness. This study provides more accurate tools and methods for future development and scientific decision-making in the chemical industry."}, "https://arxiv.org/abs/2404.17693": {"title": "A Survey Selection Correction using Nonrandom Followup with an Application to the Gender Entrepreneurship Gap", "link": "https://arxiv.org/abs/2404.17693", "description": "arXiv:2404.17693v1 Announce Type: new \nAbstract: Selection into samples undermines efforts to describe populations and to estimate relationships between variables. We develop a simple method for correcting for sample selection that explains differences in survey responses between early and late respondents with correlation between potential responses and preference for survey response. Our method relies on researchers observing the number of data collection attempts prior to each individual's survey response rather than covariates that affect response rates without affecting potential responses. Applying our method to a survey of entrepreneurial aspirations among undergraduates at University of Wisconsin-Madison, we find suggestive evidence that the entrepreneurial aspiration rate is larger among survey respondents than the population, as well as the male-female gender gap in the entrepreneurial aspiration rate, which we estimate as 21 percentage points in the sample and 19 percentage points in the population. Our results suggest that the male-female gap in entrepreneurial aspirations arises prior to direct exposure to the labor market."}, "https://arxiv.org/abs/2404.17734": {"title": "Manipulating a Continuous Instrumental Variable in an Observational Study of Premature Babies: Algorithm, Partial Identification Bounds, and Inference under Randomization and Biased Randomization Assumptions", "link": "https://arxiv.org/abs/2404.17734", "description": "arXiv:2404.17734v1 Announce Type: new \nAbstract: Regionalization of intensive care for premature babies refers to a triage system of mothers with high-risk pregnancies to hospitals of varied capabilities based on risks faced by infants. Due to the limited capacity of high-level hospitals, which are equipped with advanced expertise to provide critical care, understanding the effect of delivering premature babies at such hospitals on infant mortality for different subgroups of high-risk mothers could facilitate the design of an efficient perinatal regionalization system. Towards answering this question, Baiocchi et al. (2010) proposed to strengthen an excess-travel-time-based, continuous instrumental variable (IV) in an IV-based, matched-pair design by switching focus to a smaller cohort amenable to being paired with a larger separation in the IV dose. Three elements changed with the strengthened IV: the study cohort, compliance rate and latent complier subgroup. Here, we introduce a non-bipartite, template matching algorithm that embeds data into a target, pair-randomized encouragement trial which maintains fidelity to the original study cohort while strengthening the IV. We then study randomization-based and IV-dependent, biased-randomization-based inference of partial identification bounds for the sample average treatment effect (SATE) in an IV-based matched pair design, which deviates from the usual effect ratio estimand in that the SATE is agnostic to the IV and who is matched to whom, although a strengthened IV design could narrow the partial identification bounds. Based on our proposed strengthened-IV design, we found that delivering at a high-level NICU reduced preterm babies' mortality rate compared to a low-level NICU for $81,766 \\times 2 = 163,532$ mothers and their preterm babies and the effect appeared to be minimal among non-black, low-risk mothers."}, "https://arxiv.org/abs/2404.17763": {"title": "Likelihood Based Inference in Fully and Partially Observed Exponential Family Graphical Models with Intractable Normalizing Constants", "link": "https://arxiv.org/abs/2404.17763", "description": "arXiv:2404.17763v1 Announce Type: new \nAbstract: Probabilistic graphical models that encode an underlying Markov random field are fundamental building blocks of generative modeling to learn latent representations in modern multivariate data sets with complex dependency structures. Among these, the exponential family graphical models are especially popular, given their fairly well-understood statistical properties and computational scalability to high-dimensional data based on pseudo-likelihood methods. These models have been successfully applied in many fields, such as the Ising model in statistical physics and count graphical models in genomics. Another strand of models allows some nodes to be latent, so as to allow the marginal distribution of the observable nodes to depart from exponential family to capture more complex dependence. These approaches form the basis of generative models in artificial intelligence, such as the Boltzmann machines and their restricted versions. A fundamental barrier to likelihood-based (i.e., both maximum likelihood and fully Bayesian) inference in both fully and partially observed cases is the intractability of the likelihood. The usual workaround is via adopting pseudo-likelihood based approaches, following the pioneering work of Besag (1974). The goal of this paper is to demonstrate that full likelihood based analysis of these models is feasible in a computationally efficient manner. The chief innovation lies in using a technique of Geyer (1991) to estimate the intractable normalizing constant, as well as its gradient, for intractable graphical models. Extensive numerical results, supporting theory and comparisons with pseudo-likelihood based approaches demonstrate the applicability of the proposed method."}, "https://arxiv.org/abs/2404.17772": {"title": "PWEXP: An R Package Using Piecewise Exponential Model for Study Design and Event/Timeline Prediction", "link": "https://arxiv.org/abs/2404.17772", "description": "arXiv:2404.17772v1 Announce Type: new \nAbstract: Parametric assumptions such as exponential distribution are commonly used in clinical trial design and analysis. However, violation of distribution assumptions can introduce biases in sample size and power calculations. Piecewise exponential (PWE) hazard model partitions the hazard function into segments each with constant hazards and is easy for interpretation and computation. Due to its piecewise property, PWE can fit a wide range of survival curves and accurately predict the future number of events and analysis time in event-driven clinical trials, thus enabling more flexible and reliable study designs. Compared with other existing approaches, the PWE model provides a superior balance of flexibility and robustness in model fitting and prediction. The proposed PWEXP package is designed for estimating and predicting PWE hazard models for right-censored data. By utilizing well-established criteria such as AIC, BIC, and cross-validation log-likelihood, the PWEXP package chooses the optimal number of change-points and determines the optimal position of change-points. With its particular goodness-of-fit, the PWEXP provides accurate and robust hazard estimation, which can be used for reliable power calculation at study design and timeline prediction at study conduct. The package also offers visualization functions to facilitate the interpretation of survival curve fitting results."}, "https://arxiv.org/abs/2404.17792": {"title": "A General Framework for Random Effects Models for Binary, Ordinal, Count Type and Continuous Dependent Variables Including Variable Selection", "link": "https://arxiv.org/abs/2404.17792", "description": "arXiv:2404.17792v1 Announce Type: new \nAbstract: A general random effects model is proposed that allows for continuous as well as discrete distributions of the responses. Responses can be unrestricted continuous, bounded continuous, binary, ordered categorical or given in the form of counts. The distribution of the responses is not restricted to exponential families, which is a severe restriction in generalized mixed models. Generalized mixed models use fixed distributions for responses, for example the Poisson distribution in count data, which has the disadvantage of not accounting for overdispersion. By using a response function and a thresholds function the proposed mixed thresholds model can account for a variety of alternative distributions that often show better fits than fixed distributions used within the generalized linear model framework. A particular strength of the model is that it provides a tool for joint modeling, responses may be of different types, some can be discrete, others continuous. In addition to introducing the mixed thresholds model parameter sparsity is addressed. Random effects models can contain a large number of parameters, in particular if effects have to be assumed as measurement-specific. Methods to obtain sparser representations are proposed and illustrated. The methods are shown to work in the thresholds model but could also be adapted to other modeling approaches."}, "https://arxiv.org/abs/2404.17885": {"title": "Sequential monitoring for explosive volatility regimes", "link": "https://arxiv.org/abs/2404.17885", "description": "arXiv:2404.17885v1 Announce Type: new \nAbstract: In this paper, we develop two families of sequential monitoring procedure to (timely) detect changes in a GARCH(1,1) model. Whilst our methodologies can be applied for the general analysis of changepoints in GARCH(1,1) sequences, they are in particular designed to detect changes from stationarity to explosivity or vice versa, thus allowing to check for volatility bubbles. Our statistics can be applied irrespective of whether the historical sample is stationary or not, and indeed without prior knowledge of the regime of the observations before and after the break. In particular, we construct our detectors as the CUSUM process of the quasi-Fisher scores of the log likelihood function. In order to ensure timely detection, we then construct our boundary function (exceeding which would indicate a break) by including a weighting sequence which is designed to shorten the detection delay in the presence of a changepoint. We consider two types of weights: a lighter set of weights, which ensures timely detection in the presence of changes occurring early, but not too early after the end of the historical sample; and a heavier set of weights, called Renyi weights which is designed to ensure timely detection in the presence of changepoints occurring very early in the monitoring horizon. In both cases, we derive the limiting distribution of the detection delays, indicating the expected delay for each set of weights. Our theoretical results are validated via a comprehensive set of simulations, and an empirical application to daily returns of individual stocks."}, "https://arxiv.org/abs/2404.18000": {"title": "Thinking inside the bounds: Improved error distributions for indifference point data analysis and simulation via beta regression using common discounting functions", "link": "https://arxiv.org/abs/2404.18000", "description": "arXiv:2404.18000v1 Announce Type: new \nAbstract: Standard nonlinear regression is commonly used when modeling indifference points due to its ability to closely follow observed data, resulting in a good model fit. However, standard nonlinear regression currently lacks a reasonable distribution-based framework for indifference points, which limits its ability to adequately describe the inherent variability in the data. Software commonly assumes data follow a normal distribution with constant variance. However, typical indifference points do not follow a normal distribution or exhibit constant variance. To address these limitations, this paper introduces a class of nonlinear beta regression models that offers excellent fit to discounting data and enhances simulation-based approaches. This beta regression model can accommodate popular discounting functions. This work proposes three specific advances. First, our model automatically captures non-constant variance as a function of delay. Second, our model improves simulation-based approaches since it obeys the natural boundaries of observable data, unlike the ordinary assumption of normal residuals and constant variance. Finally, we introduce a scale-location-truncation trick that allows beta regression to accommodate observed values of zero and one. A comparison between beta regression and standard nonlinear regression reveals close agreement in the estimated discounting rate k obtained from both methods."}, "https://arxiv.org/abs/2404.18197": {"title": "A General Causal Inference Framework for Cross-Sectional Observational Data", "link": "https://arxiv.org/abs/2404.18197", "description": "arXiv:2404.18197v1 Announce Type: new \nAbstract: Causal inference methods for observational data are highly regarded due to their wide applicability. While there are already numerous methods available for de-confounding bias, these methods generally assume that covariates consist solely of confounders or make naive assumptions about the covariates. Such assumptions face challenges in both theory and practice, particularly when dealing with high-dimensional covariates. Relaxing these naive assumptions and identifying the confounding covariates that truly require correction can effectively enhance the practical significance of these methods. Therefore, this paper proposes a General Causal Inference (GCI) framework specifically designed for cross-sectional observational data, which precisely identifies the key confounding covariates and provides corresponding identification algorithm. Specifically, based on progressive derivations of the Markov property on Directed Acyclic Graph, we conclude that the key confounding covariates are equivalent to the common root ancestors of the treatment and the outcome variable. Building upon this conclusion, the GCI framework is composed of a novel Ancestor Set Identification (ASI) algorithm and de-confounding inference methods. Firstly, the ASI algorithm is theoretically supported by the conditional independence properties and causal asymmetry between variables, enabling the identification of key confounding covariates. Subsequently, the identified confounding covariates are used in the de-confounding inference methods to obtain unbiased causal effect estimation, which can support informed decision-making. Extensive experiments on synthetic datasets demonstrate that the GCI framework can effectively identify the critical confounding covariates and significantly improve the precision, stability, and interpretability of causal inference in observational studies."}, "https://arxiv.org/abs/2404.18207": {"title": "Testing for Asymmetric Information in Insurance with Deep Learning", "link": "https://arxiv.org/abs/2404.18207", "description": "arXiv:2404.18207v1 Announce Type: new \nAbstract: The positive correlation test for asymmetric information developed by Chiappori and Salanie (2000) has been applied in many insurance markets. Most of the literature focuses on the special case of constant correlation; it also relies on restrictive parametric specifications for the choice of coverage and the occurrence of claims. We relax these restrictions by estimating conditional covariances and correlations using deep learning methods. We test the positive correlation property by using the intersection test of Chernozhukov, Lee, and Rosen (2013) and the \"sorted groups\" test of Chernozhukov, Demirer, Duflo, and Fernandez-Val (2023). Our results confirm earlier findings that the correlation between risk and coverage is small. Random forests and gradient boosting trees produce similar results to neural networks."}, "https://arxiv.org/abs/2404.18232": {"title": "A cautious approach to constraint-based causal model selection", "link": "https://arxiv.org/abs/2404.18232", "description": "arXiv:2404.18232v1 Announce Type: new \nAbstract: We study the data-driven selection of causal graphical models using constraint-based algorithms, which determine the existence or non-existence of edges (causal connections) in a graph based on testing a series of conditional independence hypotheses. In settings where the ultimate scientific goal is to use the selected graph to inform estimation of some causal effect of interest (e.g., by selecting a valid and sufficient set of adjustment variables), we argue that a \"cautious\" approach to graph selection should control the probability of falsely removing edges and prefer dense, rather than sparse, graphs. We propose a simple inversion of the usual conditional independence testing procedure: to remove an edge, test the null hypothesis of conditional association greater than some user-specified threshold, rather than the null of independence. This equivalence testing formulation to testing independence constraints leads to a procedure with desriable statistical properties and behaviors that better match the inferential goals of certain scientific studies, for example observational epidemiological studies that aim to estimate causal effects in the face of causal model uncertainty. We illustrate our approach on a data example from environmental epidemiology."}, "https://arxiv.org/abs/2404.18256": {"title": "Semiparametric causal mediation analysis in cluster-randomized experiments", "link": "https://arxiv.org/abs/2404.18256", "description": "arXiv:2404.18256v1 Announce Type: new \nAbstract: In cluster-randomized experiments, there is emerging interest in exploring the causal mechanism in which a cluster-level treatment affects the outcome through an intermediate outcome. Despite an extensive development of causal mediation methods in the past decade, only a few exceptions have been considered in assessing causal mediation in cluster-randomized studies, all of which depend on parametric model-based estimators. In this article, we develop the formal semiparametric efficiency theory to motivate several doubly-robust methods for addressing several mediation effect estimands corresponding to both the cluster-average and the individual-level treatment effects in cluster-randomized experiments--the natural indirect effect, natural direct effect, and spillover mediation effect. We derive the efficient influence function for each mediation effect, and carefully parameterize each efficient influence function to motivate practical strategies for operationalizing each estimator. We consider both parametric working models and data-adaptive machine learners to estimate the nuisance functions, and obtain semiparametric efficient causal mediation estimators in the latter case. Our methods are illustrated via extensive simulations and two completed cluster-randomized experiments."}, "https://arxiv.org/abs/2404.18268": {"title": "Optimal Treatment Allocation under Constraints", "link": "https://arxiv.org/abs/2404.18268", "description": "arXiv:2404.18268v1 Announce Type: new \nAbstract: In optimal policy problems where treatment effects vary at the individual level, optimally allocating treatments to recipients is complex even when potential outcomes are known. We present an algorithm for multi-arm treatment allocation problems that is guaranteed to find the optimal allocation in strongly polynomial time, and which is able to handle arbitrary potential outcomes as well as constraints on treatment requirement and capacity. Further, starting from an arbitrary allocation, we show how to optimally re-allocate treatments in a Pareto-improving manner. To showcase our results, we use data from Danish nurse home visiting for infants. We estimate nurse specific treatment effects for children born 1959-1967 in Copenhagen, comparing nurses against each other. We exploit random assignment of newborn children to nurses within a district to obtain causal estimates of nurse-specific treatment effects using causal machine learning. Using these estimates, and treating the Danish nurse home visiting program as a case of an optimal treatment allocation problem (where a treatment is a nurse), we document room for significant productivity improvements by optimally re-allocating nurses to children. Our estimates suggest that optimal allocation of nurses to children could have improved average yearly earnings by USD 1,815 and length of education by around two months."}, "https://arxiv.org/abs/2404.18370": {"title": "Out-of-distribution generalization under random, dense distributional shifts", "link": "https://arxiv.org/abs/2404.18370", "description": "arXiv:2404.18370v1 Announce Type: new \nAbstract: Many existing approaches for estimating parameters in settings with distributional shifts operate under an invariance assumption. For example, under covariate shift, it is assumed that p(y|x) remains invariant. We refer to such distribution shifts as sparse, since they may be substantial but affect only a part of the data generating system. In contrast, in various real-world settings, shifts might be dense. More specifically, these dense distributional shifts may arise through numerous small and random changes in the population and environment. First, we will discuss empirical evidence for such random dense distributional shifts and explain why commonly used models for distribution shifts-including adversarial approaches-may not be appropriate under these conditions. Then, we will develop tools to infer parameters and make predictions for partially observed, shifted distributions. Finally, we will apply the framework to several real-world data sets and discuss diagnostics to evaluate the fit of the distributional uncertainty model."}, "https://arxiv.org/abs/2404.18377": {"title": "Inference for the panel ARMA-GARCH model when both $N$ and $T$ are large", "link": "https://arxiv.org/abs/2404.18377", "description": "arXiv:2404.18377v1 Announce Type: new \nAbstract: We propose a panel ARMA-GARCH model to capture the dynamics of large panel data with $N$ individuals over $T$ time periods. For this model, we provide a two-step estimation procedure to estimate the ARMA parameters and GARCH parameters stepwisely. Under some regular conditions, we show that all of the proposed estimators are asymptotically normal with the convergence rate $(NT)^{-1/2}$, and they have the asymptotic biases when both $N$ and $T$ diverge to infinity at the same rate. Particularly, we find that the asymptotic biases result from the fixed effect, estimation effect, and unobservable initial values. To correct the biases, we further propose the bias-corrected version of estimators by using either the analytical asymptotics or jackknife method. Our asymptotic results are based on a new central limit theorem for the linear-quadratic form in the martingale difference sequence, when the weight matrix is uniformly bounded in row and column. Simulations and one real example are given to demonstrate the usefulness of our panel ARMA-GARCH model."}, "https://arxiv.org/abs/2404.18421": {"title": "Semiparametric mean and variance joint models with Laplace link functions for count time series", "link": "https://arxiv.org/abs/2404.18421", "description": "arXiv:2404.18421v1 Announce Type: new \nAbstract: Count time series data are frequently analyzed by modeling their conditional means and the conditional variance is often considered to be a deterministic function of the corresponding conditional mean and is not typically modeled independently. We propose a semiparametric mean and variance joint model, called random rounded count-valued generalized autoregressive conditional heteroskedastic (RRC-GARCH) model, to address this limitation. The RRC-GARCH model and its variations allow for the joint modeling of both the conditional mean and variance and offer a flexible framework for capturing various mean-variance structures (MVSs). One main feature of this model is its ability to accommodate negative values for regression coefficients and autocorrelation functions. The autocorrelation structure of the RRC-GARCH model using the proposed Laplace link functions with nonnegative regression coefficients is the same as that of an autoregressive moving-average (ARMA) process. For the new model, the stationarity and ergodicity are established and the consistency and asymptotic normality of the conditional least squares estimator are proved. Model selection criteria are proposed to evaluate the RRC-GARCH models. The performance of the RRC-GARCH model is assessed through analyses of both simulated and real data sets. The results indicate that the model can effectively capture the MVS of count time series data and generate accurate forecast means and variances."}, "https://arxiv.org/abs/2404.18678": {"title": "Sequential model confidence", "link": "https://arxiv.org/abs/2404.18678", "description": "arXiv:2404.18678v1 Announce Type: new \nAbstract: In most prediction and estimation situations, scientists consider various statistical models for the same problem, and naturally want to select amongst the best. Hansen et al. (2011) provide a powerful solution to this problem by the so-called model confidence set, a subset of the original set of available models that contains the best models with a given level of confidence. Importantly, model confidence sets respect the underlying selection uncertainty by being flexible in size. However, they presuppose a fixed sample size which stands in contrast to the fact that model selection and forecast evaluation are inherently sequential tasks where we successively collect new data and where the decision to continue or conclude a study may depend on the previous outcomes. In this article, we extend model confidence sets sequentially over time by relying on sequential testing methods. Recently, e-processes and confidence sequences have been introduced as new, safe methods for assessing statistical evidence. Sequential model confidence sets allow to continuously monitor the models' performances and come with time-uniform, nonasymptotic coverage guarantees."}, "https://arxiv.org/abs/2404.18732": {"title": "Two-way Homogeneity Pursuit for Quantile Network Vector Autoregression", "link": "https://arxiv.org/abs/2404.18732", "description": "arXiv:2404.18732v1 Announce Type: new \nAbstract: While the Vector Autoregression (VAR) model has received extensive attention for modelling complex time series, quantile VAR analysis remains relatively underexplored for high-dimensional time series data. To address this disparity, we introduce a two-way grouped network quantile (TGNQ) autoregression model for time series collected on large-scale networks, known for their significant heterogeneous and directional interactions among nodes. Our proposed model simultaneously conducts node clustering and model estimation to balance complexity and interpretability. To account for the directional influence among network nodes, each network node is assigned two latent group memberships that can be consistently estimated using our proposed estimation procedure. Theoretical analysis demonstrates the consistency of membership and parameter estimators even with an overspecified number of groups. With the correct group specification, estimated parameters are proven to be asymptotically normal, enabling valid statistical inferences. Moreover, we propose a quantile information criterion for consistently selecting the number of groups. Simulation studies show promising finite sample performance, and we apply the methodology to analyze connectedness and risk spillover effects among Chinese A-share stocks."}, "https://arxiv.org/abs/2404.18779": {"title": "Semiparametric fiducial inference", "link": "https://arxiv.org/abs/2404.18779", "description": "arXiv:2404.18779v1 Announce Type: new \nAbstract: R. A. Fisher introduced the concept of fiducial as a potential replacement for the Bayesian posterior distribution in the 1930s. During the past century, fiducial approaches have been explored in various parametric and nonparametric settings. However, to the best of our knowledge, no fiducial inference has been developed in the realm of semiparametric statistics. In this paper, we propose a novel fiducial approach for semiparametric models. To streamline our presentation, we use the Cox proportional hazards model, which is the most popular model for the analysis of survival data, as a running example. Other models and extensions are also discussed. In our experiments, we find our method to perform well especially in situations when the maximum likelihood estimator fails."}, "https://arxiv.org/abs/2404.18854": {"title": "Switching Models of Oscillatory Networks Greatly Improve Inference of Dynamic Functional Connectivity", "link": "https://arxiv.org/abs/2404.18854", "description": "arXiv:2404.18854v1 Announce Type: new \nAbstract: Functional brain networks can change rapidly as a function of stimuli or cognitive shifts. Tracking dynamic functional connectivity is particularly challenging as it requires estimating the structure of the network at each moment as well as how it is shifting through time. In this paper, we describe a general modeling framework and a set of specific models that provides substantially increased statistical power for estimating rhythmic dynamic networks, based on the assumption that for a particular experiment or task, the network state at any moment is chosen from a discrete set of possible network modes. Each model is comprised of three components: (1) a set of latent switching states that represent transitions between the expression of each network mode; (2) a set of latent oscillators, each characterized by an estimated mean oscillation frequency and an instantaneous phase and amplitude at each time point; and (3) an observation model that relates the observed activity at each electrode to a linear combination of the latent oscillators. We develop an expectation-maximization procedure to estimate the network structure for each switching state and the probability of each state being expressed at each moment. We conduct a set of simulation studies to illustrate the application of these models and quantify their statistical power, even in the face of model misspecification."}, "https://arxiv.org/abs/2404.18857": {"title": "VT-MRF-SPF: Variable Target Markov Random Field Scalable Particle Filter", "link": "https://arxiv.org/abs/2404.18857", "description": "arXiv:2404.18857v1 Announce Type: new \nAbstract: Markov random fields (MRFs) are invaluable tools across diverse fields, and spatiotemporal MRFs (STMRFs) amplify their effectiveness by integrating spatial and temporal dimensions. However, modeling spatiotemporal data introduces additional hurdles, including dynamic spatial dimensions and partial observations, prevalent in scenarios like disease spread analysis and environmental monitoring. Tracking high-dimensional targets with complex spatiotemporal interactions over extended periods poses significant challenges in accuracy, efficiency, and computational feasibility. To tackle these obstacles, we introduce the variable target MRF scalable particle filter (VT-MRF-SPF), a fully online learning algorithm designed for high-dimensional target tracking over STMRFs with varying dimensions under partial observation. We rigorously guarantee algorithm performance, explicitly indicating overcoming the curse of dimensionality. Additionally, we provide practical guidelines for tuning graphical parameters, leading to superior performance in extensive examinations."}, "https://arxiv.org/abs/2404.18862": {"title": "Conformal Prediction Sets for Populations of Graphs", "link": "https://arxiv.org/abs/2404.18862", "description": "arXiv:2404.18862v1 Announce Type: new \nAbstract: The analysis of data such as graphs has been gaining increasing attention in the past years. This is justified by the numerous applications in which they appear. Several methods are present to predict graphs, but much fewer to quantify the uncertainty of the prediction. The present work proposes an uncertainty quantification methodology for graphs, based on conformal prediction. The method works both for graphs with the same set of nodes (labelled graphs) and graphs with no clear correspondence between the set of nodes across the observed graphs (unlabelled graphs). The unlabelled case is dealt with the creation of prediction sets embedded in a quotient space. The proposed method does not rely on distributional assumptions, it achieves finite-sample validity, and it identifies interpretable prediction sets. To explore the features of this novel forecasting technique, we perform two simulation studies to show the methodology in both the labelled and the unlabelled case. We showcase the applicability of the method in analysing the performance of different teams during the FIFA 2018 football world championship via their player passing networks."}, "https://arxiv.org/abs/2404.18905": {"title": "Detecting critical treatment effect bias in small subgroups", "link": "https://arxiv.org/abs/2404.18905", "description": "arXiv:2404.18905v1 Announce Type: new \nAbstract: Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects estimated from the two studies, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge."}, "https://arxiv.org/abs/2403.15352": {"title": "Universal Cold RNA Phase Transitions", "link": "https://arxiv.org/abs/2403.15352", "description": "arXiv:2403.15352v1 Announce Type: cross \nAbstract: RNA's diversity of structures and functions impacts all life forms since primordia. We use calorimetric force spectroscopy to investigate RNA folding landscapes in previously unexplored low-temperature conditions. We find that Watson-Crick RNA hairpins, the most basic secondary structure elements, undergo a glass-like transition below $\\mathbf{T_G\\sim 20 ^{\\circ}}$C where the heat capacity abruptly changes and the RNA folds into a diversity of misfolded structures. We hypothesize that an altered RNA biochemistry, determined by sequence-independent ribose-water interactions, outweighs sequence-dependent base pairing. The ubiquitous ribose-water interactions lead to universal RNA phase transitions below $\\mathbf{T_G}$, such as maximum stability at $\\mathbf{T_S\\sim 5 ^{\\circ}}$C where water density is maximum, and cold denaturation at $\\mathbf{T_C\\sim-50^{\\circ}}$C. RNA cold biochemistry may have a profound impact on RNA function and evolution."}, "https://arxiv.org/abs/2404.17682": {"title": "Testing for similarity of dose response in multi-regional clinical trials", "link": "https://arxiv.org/abs/2404.17682", "description": "arXiv:2404.17682v1 Announce Type: cross \nAbstract: This paper addresses the problem of deciding whether the dose response relationships between subgroups and the full population in a multi-regional trial are similar to each other. Similarity is measured in terms of the maximal deviation between the dose response curves. We consider a parametric framework and develop two powerful bootstrap tests for the similarity between the dose response curves of one subgroup and the full population, and for the similarity between the dose response curves of several subgroups and the full population. We prove the validity of the tests, investigate the finite sample properties by means of a simulation study and finally illustrate the methodology in a case study."}, "https://arxiv.org/abs/2404.17737": {"title": "Neutral Pivoting: Strong Bias Correction for Shared Information", "link": "https://arxiv.org/abs/2404.17737", "description": "arXiv:2404.17737v1 Announce Type: cross \nAbstract: In the absence of historical data for use as forecasting inputs, decision makers often ask a panel of judges to predict the outcome of interest, leveraging the wisdom of the crowd (Surowiecki 2005). Even if the crowd is large and skilled, shared information can bias the simple mean of judges' estimates. Addressing the issue of bias, Palley and Soll (2019) introduces a novel approach called pivoting. Pivoting can take several forms, most notably the powerful and reliable minimal pivot. We build on the intuition of the minimal pivot and propose a more aggressive bias correction known as the neutral pivot. The neutral pivot achieves the largest bias correction of its class that both avoids the need to directly estimate crowd composition or skill and maintains a smaller expected squared error than the simple mean for all considered settings. Empirical assessments on real datasets confirm the effectiveness of the neutral pivot compared to current methods."}, "https://arxiv.org/abs/2404.17769": {"title": "Conformal Ranked Retrieval", "link": "https://arxiv.org/abs/2404.17769", "description": "arXiv:2404.17769v1 Announce Type: cross \nAbstract: Given the wide adoption of ranked retrieval techniques in various information systems that significantly impact our daily lives, there is an increasing need to assess and address the uncertainty inherent in their predictions. This paper introduces a novel method using the conformal risk control framework to quantitatively measure and manage risks in the context of ranked retrieval problems. Our research focuses on a typical two-stage ranked retrieval problem, where the retrieval stage generates candidates for subsequent ranking. By carefully formulating the conformal risk for each stage, we have developed algorithms to effectively control these risks within their specified bounds. The efficacy of our proposed methods has been demonstrated through comprehensive experiments on three large-scale public datasets for ranked retrieval tasks, including the MSLR-WEB dataset, the Yahoo LTRC dataset and the MS MARCO dataset."}, "https://arxiv.org/abs/2404.17812": {"title": "High-Dimensional Single-Index Models: Link Estimation and Marginal Inference", "link": "https://arxiv.org/abs/2404.17812", "description": "arXiv:2404.17812v1 Announce Type: cross \nAbstract: This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging the information from the estimated link function, we propose more efficient estimators that are better aligned with the underlying model. Furthermore, we rigorously establish the asymptotic normality of each coordinate of the estimator. This provides a valid construction of confidence intervals and $p$-values for any finite collection of coordinates. Numerical experiments validate our theoretical results."}, "https://arxiv.org/abs/2404.17856": {"title": "Uncertainty quantification for iterative algorithms in linear models with application to early stopping", "link": "https://arxiv.org/abs/2404.17856", "description": "arXiv:2404.17856v1 Announce Type: cross \nAbstract: This paper investigates the iterates $\\hbb^1,\\dots,\\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \\asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\\sqrt n$-consistent under Gaussian designs. Applications to early-stopping are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for developing debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results."}, "https://arxiv.org/abs/2404.18786": {"title": "Randomization-based confidence intervals for the local average treatment effect", "link": "https://arxiv.org/abs/2404.18786", "description": "arXiv:2404.18786v1 Announce Type: cross \nAbstract: We consider the problem of generating confidence intervals in randomized experiments with noncompliance. We show that a refinement of a randomization-based procedure proposed by Imbens and Rosenbaum (2005) has desirable properties. Namely, we show that using a studentized Anderson-Rubin-type statistic as a test statistic yields confidence intervals that are finite-sample exact under treatment effect homogeneity, and remain asymptotically valid for the Local Average Treatment Effect when the treatment effect is heterogeneous. We provide a uniform analysis of this procedure."}, "https://arxiv.org/abs/2006.13850": {"title": "Global Sensitivity and Domain-Selective Testing for Functional-Valued Responses: An Application to Climate Economy Models", "link": "https://arxiv.org/abs/2006.13850", "description": "arXiv:2006.13850v4 Announce Type: replace \nAbstract: Understanding the dynamics and evolution of climate change and associated uncertainties is key for designing robust policy actions. Computer models are key tools in this scientific effort, which have now reached a high level of sophistication and complexity. Model auditing is needed in order to better understand their results, and to deal with the fact that such models are increasingly opaque with respect to their inner workings. Current techniques such as Global Sensitivity Analysis (GSA) are limited to dealing either with multivariate outputs, stochastic ones, or finite-change inputs. This limits their applicability to time-varying variables such as future pathways of greenhouse gases. To provide additional semantics in the analysis of a model ensemble, we provide an extension of GSA methodologies tackling the case of stochastic functional outputs with finite change inputs. To deal with finite change inputs and functional outputs, we propose an extension of currently available GSA methodologies while we deal with the stochastic part by introducing a novel, domain-selective inferential technique for sensitivity indices. Our method is explored via a simulation study that shows its robustness and efficacy in detecting sensitivity patterns. We apply it to real world data, where its capabilities can provide to practitioners and policymakers additional information about the time dynamics of sensitivity patterns, as well as information about robustness."}, "https://arxiv.org/abs/2012.11679": {"title": "Discordant Relaxations of Misspecified Models", "link": "https://arxiv.org/abs/2012.11679", "description": "arXiv:2012.11679v5 Announce Type: replace \nAbstract: In many set-identified models, it is difficult to obtain a tractable characterization of the identified set. Therefore, researchers often rely on non-sharp identification conditions, and empirical results are often based on an outer set of the identified set. This practice is often viewed as conservative yet valid because an outer set is always a superset of the identified set. However, this paper shows that when the model is refuted by the data, two sets of non-sharp identification conditions derived from the same model could lead to disjoint outer sets and conflicting empirical results. We provide a sufficient condition for the existence of such discordancy, which covers models characterized by conditional moment inequalities and the Artstein (1983) inequalities. We also derive sufficient conditions for the non-existence of discordant submodels, therefore providing a class of models for which constructing outer sets cannot lead to misleading interpretations. In the case of discordancy, we follow Masten and Poirier (2021) by developing a method to salvage misspecified models, but unlike them, we focus on discrete relaxations. We consider all minimum relaxations of a refuted model that restores data-consistency. We find that the union of the identified sets of these minimum relaxations is robust to detectable misspecifications and has an intuitive empirical interpretation."}, "https://arxiv.org/abs/2204.13439": {"title": "Mahalanobis balancing: a multivariate perspective on approximate covariate balancing", "link": "https://arxiv.org/abs/2204.13439", "description": "arXiv:2204.13439v4 Announce Type: replace \nAbstract: In the past decade, various exact balancing-based weighting methods were introduced to the causal inference literature. Exact balancing alleviates the extreme weight and model misspecification issues that may incur when one implements inverse probability weighting. It eliminates covariate imbalance by imposing balancing constraints in an optimization problem. The optimization problem can nevertheless be infeasible when there is bad overlap between the covariate distributions in the treated and control groups or when the covariates are high-dimensional. Recently, approximate balancing was proposed as an alternative balancing framework, which resolves the feasibility issue by using inequality moment constraints instead. However, it can be difficult to select the threshold parameters when the number of constraints is large. Moreover, moment constraints may not fully capture the discrepancy of covariate distributions. In this paper, we propose Mahalanobis balancing, which approximately balances covariate distributions from a multivariate perspective. We use a quadratic constraint to control overall imbalance with a single threshold parameter, which can be tuned by a simple selection procedure. We show that the dual problem of Mahalanobis balancing is an l_2 norm-based regularized regression problem, and establish interesting connection to propensity score models. We further generalize Mahalanobis balancing to the high-dimensional scenario. We derive asymptotic properties and make extensive comparisons with existing balancing methods in the numerical studies."}, "https://arxiv.org/abs/2206.02508": {"title": "Tucker tensor factor models: matricization and mode-wise PCA estimation", "link": "https://arxiv.org/abs/2206.02508", "description": "arXiv:2206.02508v3 Announce Type: replace \nAbstract: High-dimensional, higher-order tensor data are gaining prominence in a variety of fields, including but not limited to computer vision and network analysis. Tensor factor models, induced from noisy versions of tensor decompositions or factorizations, are natural potent instruments to study a collection of tensor-variate objects that may be dependent or independent. However, it is still in the early stage of developing statistical inferential theories for the estimation of various low-rank structures, which are customary to play the role of signals of tensor factor models. In this paper, we attempt to ``decode\" the estimation of a higher-order tensor factor model by leveraging tensor matricization. Specifically, we recast it into mode-wise traditional high-dimensional vector/fiber factor models, enabling the deployment of conventional principal components analysis (PCA) for estimation. Demonstrated by the Tucker tensor factor model (TuTFaM), which is induced from the noisy version of the widely-used Tucker decomposition, we summarize that estimations on signal components are essentially mode-wise PCA techniques, and the involvement of projection and iteration will enhance the signal-to-noise ratio to various extent. We establish the inferential theory of the proposed estimators, conduct rich simulation experiments, and illustrate how the proposed estimations can work in tensor reconstruction, and clustering for independent video and dependent economic datasets, respectively."}, "https://arxiv.org/abs/2301.13701": {"title": "On the Stability of General Bayesian Inference", "link": "https://arxiv.org/abs/2301.13701", "description": "arXiv:2301.13701v2 Announce Type: replace \nAbstract: We study the stability of posterior predictive inferences to the specification of the likelihood model and perturbations of the data generating process. In modern big data analyses, useful broad structural judgements may be elicited from the decision-maker but a level of interpolation is required to arrive at a likelihood model. As a result, an often computationally convenient canonical form is used in place of the decision-maker's true beliefs. Equally, in practice, observational datasets often contain unforeseen heterogeneities and recording errors and therefore do not necessarily correspond to how the process was idealised by the decision-maker. Acknowledging such imprecisions, a faithful Bayesian analysis should ideally be stable across reasonable equivalence classes of such inputs. We are able to guarantee that traditional Bayesian updating provides stability across only a very strict class of likelihood models and data generating processes, requiring the decision-maker to elicit their beliefs and understand how the data was generated with an unreasonable degree of accuracy. On the other hand, a generalised Bayesian alternative using the $\\beta$-divergence loss function is shown to be stable across practical and interpretable neighbourhoods, providing assurances that posterior inferences are not overly dependent on accidentally introduced spurious specifications or data collection errors. We illustrate this in linear regression, binary classification, and mixture modelling examples, showing that stable updating does not compromise the ability to learn about the data generating process. These stability results provide a compelling justification for using generalised Bayes to facilitate inference under simplified canonical models."}, "https://arxiv.org/abs/2304.04519": {"title": "On new omnibus tests of uniformity on the hypersphere", "link": "https://arxiv.org/abs/2304.04519", "description": "arXiv:2304.04519v5 Announce Type: replace \nAbstract: Two new omnibus tests of uniformity for data on the hypersphere are proposed. The new test statistics exploit closed-form expressions for orthogonal polynomials, feature tuning parameters, and are related to a ``smooth maximum'' function and the Poisson kernel. We obtain exact moments of the test statistics under uniformity and rotationally symmetric alternatives, and give their null asymptotic distributions. We consider approximate oracle tuning parameters that maximize the power of the tests against known generic alternatives and provide tests that estimate oracle parameters through cross-validated procedures while maintaining the significance level. Numerical experiments explore the effectiveness of null asymptotic distributions and the accuracy of inexpensive approximations of exact null distributions. A simulation study compares the powers of the new tests with other tests of the Sobolev class, showing the benefits of the former. The proposed tests are applied to the study of the (seemingly uniform) nursing times of wild polar bears."}, "https://arxiv.org/abs/2306.01198": {"title": "Confidence Intervals for Error Rates in 1:1 Matching Tasks: Critical Statistical Analysis and Recommendations", "link": "https://arxiv.org/abs/2306.01198", "description": "arXiv:2306.01198v3 Announce Type: replace \nAbstract: Matching algorithms are commonly used to predict matches between items in a collection. For example, in 1:1 face verification, a matching algorithm predicts whether two face images depict the same person. Accurately assessing the uncertainty of the error rates of such algorithms can be challenging when data are dependent and error rates are low, two aspects that have been often overlooked in the literature. In this work, we review methods for constructing confidence intervals for error rates in 1:1 matching tasks. We derive and examine the statistical properties of these methods, demonstrating how coverage and interval width vary with sample size, error rates, and degree of data dependence on both analysis and experiments with synthetic and real-world datasets. Based on our findings, we provide recommendations for best practices for constructing confidence intervals for error rates in 1:1 matching tasks."}, "https://arxiv.org/abs/2306.01468": {"title": "Robust Bayesian Inference for Berkson and Classical Measurement Error Models", "link": "https://arxiv.org/abs/2306.01468", "description": "arXiv:2306.01468v2 Announce Type: replace \nAbstract: Measurement error occurs when a covariate influencing a response variable is corrupted by noise. This can lead to misleading inference outcomes, particularly in problems where accurately estimating the relationship between covariates and response variables is crucial, such as causal effect estimation. Existing methods for dealing with measurement error often rely on strong assumptions such as knowledge of the error distribution or its variance and availability of replicated measurements of the covariates. We propose a Bayesian Nonparametric Learning framework that is robust to mismeasured covariates, does not require the preceding assumptions, and can incorporate prior beliefs about the error distribution. This approach gives rise to a general framework that is suitable for both Classical and Berkson error models via the appropriate specification of the prior centering measure of a Dirichlet Process (DP). Moreover, it offers flexibility in the choice of loss function depending on the type of regression model. We provide bounds on the generalization error based on the Maximum Mean Discrepancy (MMD) loss which allows for generalization to non-Gaussian distributed errors and nonlinear covariate-response relationships. We showcase the effectiveness of the proposed framework versus prior art in real-world problems containing either Berkson or Classical measurement errors."}, "https://arxiv.org/abs/2306.09151": {"title": "Estimating the Sampling Distribution of Posterior Decision Summaries in Bayesian Clinical Trials", "link": "https://arxiv.org/abs/2306.09151", "description": "arXiv:2306.09151v2 Announce Type: replace \nAbstract: Bayesian inference and the use of posterior or posterior predictive probabilities for decision making have become increasingly popular in clinical trials. The current practice in Bayesian clinical trials relies on a hybrid Bayesian-frequentist approach where the design and decision criteria are assessed with respect to frequentist operating characteristics such as power and type I error rate conditioning on a given set of parameters. These operating characteristics are commonly obtained via simulation studies. The utility of Bayesian measures, such as ``assurance\", that incorporate uncertainty about model parameters in estimating the probabilities of various decisions in trials has been demonstrated recently. However, the computational burden remains an obstacle toward wider use of such criteria. In this article, we propose methodology which utilizes large sample theory of the posterior distribution to define parametric models for the sampling distribution of the posterior summaries used for decision making. The parameters of these models are estimated using a small number of simulation scenarios, thereby refining these models to capture the sampling distribution for small to moderate sample size. The proposed approach toward the assessment of conditional and marginal operating characteristics and sample size determination can be considered as simulation-assisted rather than simulation-based. It enables formal incorporation of uncertainty about the trial assumptions via a design prior and significantly reduces the computational burden for the design of Bayesian trials in general."}, "https://arxiv.org/abs/2307.16138": {"title": "A switching state-space transmission model for tracking epidemics and assessing interventions", "link": "https://arxiv.org/abs/2307.16138", "description": "arXiv:2307.16138v2 Announce Type: replace \nAbstract: The effective control of infectious diseases relies on accurate assessment of the impact of interventions, which is often hindered by the complex dynamics of the spread of disease. A Beta-Dirichlet switching state-space transmission model is proposed to track underlying dynamics of disease and evaluate the effectiveness of interventions simultaneously. As time evolves, the switching mechanism introduced in the susceptible-exposed-infected-recovered (SEIR) model is able to capture the timing and magnitude of changes in the transmission rate due to the effectiveness of control measures. The implementation of this model is based on a particle Markov Chain Monte Carlo algorithm, which can estimate the time evolution of SEIR states, switching states, and high-dimensional parameters efficiently. The efficacy of the proposed model and estimation procedure are demonstrated through simulation studies. With a real-world application to British Columbia's COVID-19 outbreak, the proposed switching state-space transmission model quantifies the reduction of transmission rate following interventions. The proposed model provides a promising tool to inform public health policies aimed at studying the underlying dynamics and evaluating the effectiveness of interventions during the spread of the disease."}, "https://arxiv.org/abs/2309.02072": {"title": "Data Scaling Effect of Deep Learning in Financial Time Series Forecasting", "link": "https://arxiv.org/abs/2309.02072", "description": "arXiv:2309.02072v4 Announce Type: replace \nAbstract: For many years, researchers have been exploring the use of deep learning in the forecasting of financial time series. However, they have continued to rely on the conventional econometric approach for model optimization, optimizing the deep learning models on individual assets. In this paper, we use the stock volatility forecast as an example to illustrate global training - optimizes the deep learning model across a wide range of stocks - is both necessary and beneficial for any academic or industry practitioners who is interested in employing deep learning to forecast financial time series. Furthermore, a pre-trained foundation model for volatility forecast is introduced, capable of making accurate zero-shot forecasts for any stocks."}, "https://arxiv.org/abs/2309.06305": {"title": "Sensitivity Analysis for Linear Estimators", "link": "https://arxiv.org/abs/2309.06305", "description": "arXiv:2309.06305v3 Announce Type: replace \nAbstract: We propose a novel sensitivity analysis framework for linear estimators with identification failures that can be viewed as seeing the wrong outcome distribution. Our approach measures the degree of identification failure through the change in measure between the observed distribution and a hypothetical target distribution that would identify the causal parameter of interest. The framework yields a sensitivity analysis that generalizes existing bounds for Average Potential Outcome (APO), Regression Discontinuity (RD), and instrumental variables (IV) exclusion failure designs. Our partial identification results extend results from the APO context to allow even unbounded likelihood ratios. Our proposed sensitivity analysis consistently estimates sharp bounds under plausible conditions and estimates valid bounds under mild conditions. We find that our method performs well in simulations even when targeting a discontinuous and nearly infinite bound."}, "https://arxiv.org/abs/2312.07881": {"title": "Efficiency of QMLE for dynamic panel data models with interactive effects", "link": "https://arxiv.org/abs/2312.07881", "description": "arXiv:2312.07881v2 Announce Type: replace \nAbstract: This paper derives the efficiency bound for estimating the parameters of dynamic panel data models in the presence of an increasing number of incidental parameters. We study the efficiency problem by formulating the dynamic panel as a simultaneous equations system, and show that the quasi-maximum likelihood estimator (QMLE) applied to the system achieves the efficiency bound. Comparison of QMLE with fixed effects estimators is made."}, "https://arxiv.org/abs/2111.04597": {"title": "Neyman-Pearson Multi-class Classification via Cost-sensitive Learning", "link": "https://arxiv.org/abs/2111.04597", "description": "arXiv:2111.04597v3 Announce Type: replace-cross \nAbstract: Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications such as loan default prediction, different types of errors can have varying consequences. To address this asymmetry issue, two popular paradigms have been developed: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Previous studies on the NP paradigm have primarily focused on the binary case, while the multi-class NP problem poses a greater challenge due to its unknown feasibility. In this work, we tackle the multi-class NP problem by establishing a connection with the CS problem via strong duality and propose two algorithms. We extend the concept of NP oracle inequalities, crucial in binary classifications, to NP oracle properties in the multi-class context. Our algorithms satisfy these NP oracle properties under certain conditions. Furthermore, we develop practical algorithms to assess the feasibility and strong duality in multi-class NP problems, which can offer practitioners the landscape of a multi-class NP problem with various target error levels. Simulations and real data studies validate the effectiveness of our algorithms. To our knowledge, this is the first study to address the multi-class NP problem with theoretical guarantees. The proposed algorithms have been implemented in the R package \\texttt{npcs}, which is available on CRAN."}, "https://arxiv.org/abs/2211.01939": {"title": "Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation", "link": "https://arxiv.org/abs/2211.01939", "description": "arXiv:2211.01939v3 Announce Type: replace-cross \nAbstract: We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling."}, "https://arxiv.org/abs/2404.19118": {"title": "Identification and estimation of causal effects using non-concurrent controls in platform trials", "link": "https://arxiv.org/abs/2404.19118", "description": "arXiv:2404.19118v1 Announce Type: new \nAbstract: Platform trials are multi-arm designs that simultaneously evaluate multiple treatments for a single disease within the same overall trial structure. Unlike traditional randomized controlled trials, they allow treatment arms to enter and exit the trial at distinct times while maintaining a control arm throughout. This control arm comprises both concurrent controls, where participants are randomized concurrently to either the treatment or control arm, and non-concurrent controls, who enter the trial when the treatment arm under study is unavailable. While flexible, platform trials introduce a unique challenge with the use of non-concurrent controls, raising questions about how to efficiently utilize their data to estimate treatment effects. Specifically, what estimands should be used to evaluate the causal effect of a treatment versus control? Under what assumptions can these estimands be identified and estimated? Do we achieve any efficiency gains? In this paper, we use structural causal models and counterfactuals to clarify estimands and formalize their identification in the presence of non-concurrent controls in platform trials. We also provide outcome regression, inverse probability weighting, and doubly robust estimators for their estimation. We discuss efficiency gains, demonstrate their performance in a simulation study, and apply them to the ACTT platform trial, resulting in a 20% improvement in precision."}, "https://arxiv.org/abs/2404.19127": {"title": "A model-free subdata selection method for classification", "link": "https://arxiv.org/abs/2404.19127", "description": "arXiv:2404.19127v1 Announce Type: new \nAbstract: Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets."}, "https://arxiv.org/abs/2404.19144": {"title": "A Locally Robust Semiparametric Approach to Examiner IV Designs", "link": "https://arxiv.org/abs/2404.19144", "description": "arXiv:2404.19144v1 Announce Type: new \nAbstract: I propose a locally robust semiparametric framework for estimating causal effects using the popular examiner IV design, in the presence of many examiners and possibly many covariates relative to the sample size. The key ingredient of this approach is an orthogonal moment function that is robust to biases and local misspecification from the first step estimation of the examiner IV. I derive the orthogonal moment function and show that it delivers multiple robustness where the outcome model or at least one of the first step components is misspecified but the estimating equation remains valid. The proposed framework not only allows for estimation of the examiner IV in the presence of many examiners and many covariates relative to sample size, using a wide range of nonparametric and machine learning techniques including LASSO, Dantzig, neural networks and random forests, but also delivers root-n consistent estimation of the parameter of interest under mild assumptions."}, "https://arxiv.org/abs/2404.19145": {"title": "Orthogonal Bootstrap: Efficient Simulation of Input Uncertainty", "link": "https://arxiv.org/abs/2404.19145", "description": "arXiv:2404.19145v1 Announce Type: new \nAbstract: Bootstrap is a popular methodology for simulating input uncertainty. However, it can be computationally expensive when the number of samples is large. We propose a new approach called \\textbf{Orthogonal Bootstrap} that reduces the number of required Monte Carlo replications. We decomposes the target being simulated into two parts: the \\textit{non-orthogonal part} which has a closed-form result known as Infinitesimal Jackknife and the \\textit{orthogonal part} which is easier to be simulated. We theoretically and numerically show that Orthogonal Bootstrap significantly reduces the computational cost of Bootstrap while improving empirical accuracy and maintaining the same width of the constructed interval."}, "https://arxiv.org/abs/2404.19325": {"title": "Correcting for confounding in longitudinal experiments: positioning non-linear mixed effects modeling as implementation of standardization using latent conditional exchangeability", "link": "https://arxiv.org/abs/2404.19325", "description": "arXiv:2404.19325v1 Announce Type: new \nAbstract: Non-linear mixed effects modeling and simulation (NLME M&S) is evaluated to be used for standardization with longitudinal data in presence of confounders. Standardization is a well-known method in causal inference to correct for confounding by analyzing and combining results from subgroups of patients. We show that non-linear mixed effects modeling is a particular implementation of standardization that conditions on individual parameters described by the random effects of the mixed effects model. Our motivation is that in pharmacometrics NLME M&S is routinely used to analyze clinical trials and to predict and compare potential outcomes of the same patient population under different treatment regimens. Such a comparison is a causal question sometimes referred to as causal prediction. Nonetheless, NLME M&S is rarely positioned as a method for causal prediction.\n As an example, a simulated clinical trial is used that assumes treatment confounder feedback in which early outcomes can cause deviations from the planned treatment schedule. Being interested in the outcome for the hypothetical situation that patients adhere to the planned treatment schedule, we put assumptions in a causal diagram. From the causal diagram, conditional independence assumptions are derived either using latent conditional exchangeability, conditioning on the individual parameters, or using sequential conditional exchangeability, conditioning on earlier outcomes. Both conditional independencies can be used to estimate the estimand of interest, e.g., with standardization, and they give unbiased estimates."}, "https://arxiv.org/abs/2404.19344": {"title": "Data-adaptive structural change-point detection via isolation", "link": "https://arxiv.org/abs/2404.19344", "description": "arXiv:2404.19344v1 Announce Type: new \nAbstract: In this paper, a new data-adaptive method, called DAIS (Data Adaptive ISolation), is introduced for the estimation of the number and the location of change-points in a given data sequence. The proposed method can detect changes in various different signal structures; we focus on the examples of piecewise-constant and continuous, piecewise-linear signals. We highlight, however, that our algorithm can be extended to other frameworks, such as piecewise-quadratic signals. The data-adaptivity of our methodology lies in the fact that, at each step, and for the data under consideration, we search for the most prominent change-point in a targeted neighborhood of the data sequence that contains this change-point with high probability. Using a suitably chosen contrast function, the change-point will then get detected after being isolated in an interval. The isolation feature enhances estimation accuracy, while the data-adaptive nature of DAIS is advantageous regarding, mainly, computational complexity and accuracy. The simulation results presented indicate that DAIS is at least as accurate as state-of-the-art competitors."}, "https://arxiv.org/abs/2404.19465": {"title": "Optimal E-Values for Exponential Families: the Simple Case", "link": "https://arxiv.org/abs/2404.19465", "description": "arXiv:2404.19465v1 Announce Type: new \nAbstract: We provide a general condition under which e-variables in the form of a simple-vs.-simple likelihood ratio exist when the null hypothesis is a composite, multivariate exponential family. Such `simple' e-variables are easy to compute and expected-log-optimal with respect to any stopping time. Simple e-variables were previously only known to exist in quite specific settings, but we offer a unifying theorem on their existence for testing exponential families. We start with a simple alternative $Q$ and a regular exponential family null. Together these induce a second exponential family ${\\cal Q}$ containing $Q$, with the same sufficient statistic as the null. Our theorem shows that simple e-variables exist whenever the covariance matrices of ${\\cal Q}$ and the null are in a certain relation. Examples in which this relation holds include some $k$-sample tests, Gaussian location- and scale tests, and tests for more general classes of natural exponential families."}, "https://arxiv.org/abs/2404.19472": {"title": "Multi-label Classification under Uncertainty: A Tree-based Conformal Prediction Approach", "link": "https://arxiv.org/abs/2404.19472", "description": "arXiv:2404.19472v1 Announce Type: new \nAbstract: Multi-label classification is a common challenge in various machine learning applications, where a single data instance can be associated with multiple classes simultaneously. The current paper proposes a novel tree-based method for multi-label classification using conformal prediction and multiple hypothesis testing. The proposed method employs hierarchical clustering with labelsets to develop a hierarchical tree, which is then formulated as a multiple-testing problem with a hierarchical structure. The split-conformal prediction method is used to obtain marginal conformal $p$-values for each tested hypothesis, and two \\textit{hierarchical testing procedures} are developed based on marginal conformal $p$-values, including a hierarchical Bonferroni procedure and its modification for controlling the family-wise error rate. The prediction sets are thus formed based on the testing outcomes of these two procedures. We establish a theoretical guarantee of valid coverage for the prediction sets through proven family-wise error rate control of those two procedures. We demonstrate the effectiveness of our method in a simulation study and two real data analysis compared to other conformal methods for multi-label classification."}, "https://arxiv.org/abs/2404.19494": {"title": "The harms of class imbalance corrections for machine learning based prediction models: a simulation study", "link": "https://arxiv.org/abs/2404.19494", "description": "arXiv:2404.19494v1 Announce Type: new \nAbstract: Risk prediction models are increasingly used in healthcare to aid in clinical decision making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with respect to the modeled outcome (i.e., individuals with vs. without the event of interest are not equally represented in the data). It is common for researchers to correct this class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out-of-sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data-generating scenarios (varying sample size, the number of predictors and event fraction). Our findings were illustrated in a case study using MIMIC-III data. In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over-estimation of risk and was not always able to be corrected with re-calibration. Correcting for class imbalance is not always necessary and may even be harmful for clinical prediction models which aim to produce reliable risk estimates on an individual basis."}, "https://arxiv.org/abs/2404.19661": {"title": "PCA for Point Processes", "link": "https://arxiv.org/abs/2404.19661", "description": "arXiv:2404.19661v1 Announce Type: new \nAbstract: We introduce a novel statistical framework for the analysis of replicated point processes that allows for the study of point pattern variability at a population level. By treating point process realizations as random measures, we adopt a functional analysis perspective and propose a form of functional Principal Component Analysis (fPCA) for point processes. The originality of our method is to base our analysis on the cumulative mass functions of the random measures which gives us a direct and interpretable analysis. Key theoretical contributions include establishing a Karhunen-Lo\\`{e}ve expansion for the random measures and a Mercer Theorem for covariance measures. We establish convergence in a strong sense, and introduce the concept of principal measures, which can be seen as latent processes governing the dynamics of the observed point patterns. We propose an easy-to-implement estimation strategy of eigenelements for which parametric rates are achieved. We fully characterize the solutions of our approach to Poisson and Hawkes processes and validate our methodology via simulations and diverse applications in seismology, single-cell biology and neurosiences, demonstrating its versatility and effectiveness. Our method is implemented in the pppca R-package."}, "https://arxiv.org/abs/2404.19700": {"title": "Comparing Multivariate Distributions: A Novel Approach Using Optimal Transport-based Plots", "link": "https://arxiv.org/abs/2404.19700", "description": "arXiv:2404.19700v1 Announce Type: new \nAbstract: Quantile-Quantile (Q-Q) plots are widely used for assessing the distributional similarity between two datasets. Traditionally, Q-Q plots are constructed for univariate distributions, making them less effective in capturing complex dependencies present in multivariate data. In this paper, we propose a novel approach for constructing multivariate Q-Q plots, which extend the traditional Q-Q plot methodology to handle high-dimensional data. Our approach utilizes optimal transport (OT) and entropy-regularized optimal transport (EOT) to align the empirical quantiles of the two datasets. Additionally, we introduce another technique based on OT and EOT potentials which can effectively compare two multivariate datasets. Through extensive simulations and real data examples, we demonstrate the effectiveness of our proposed approach in capturing multivariate dependencies and identifying distributional differences such as tail behaviour. We also propose two test statistics based on the Q-Q and potential plots to compare two distributions rigorously."}, "https://arxiv.org/abs/2404.19707": {"title": "Identification by non-Gaussianity in structural threshold and smooth transition vector autoregressive models", "link": "https://arxiv.org/abs/2404.19707", "description": "arXiv:2404.19707v1 Announce Type: new \nAbstract: Linear structural vector autoregressive models can be identified statistically without imposing restrictions on the model if the shocks are mutually independent and at most one of them is Gaussian. We show that this result extends to structural threshold and smooth transition vector autoregressive models incorporating a time-varying impact matrix defined as a weighted sum of the impact matrices of the regimes. Our empirical application studies the effects of the climate policy uncertainty shock on the U.S. macroeconomy. In a structural logistic smooth transition vector autoregressive model consisting of two regimes, we find that a positive climate policy uncertainty shock decreases production in times of low economic policy uncertainty but slightly increases it in times of high economic policy uncertainty. The introduced methods are implemented to the accompanying R package sstvars."}, "https://arxiv.org/abs/2404.19224": {"title": "Variational approximations of possibilistic inferential models", "link": "https://arxiv.org/abs/2404.19224", "description": "arXiv:2404.19224v1 Announce Type: cross \nAbstract: Inferential models (IMs) offer reliable, data-driven, possibilistic statistical inference. But despite IMs' theoretical/foundational advantages, efficient computation in applications is a major challenge. This paper presents a simple and apparently powerful Monte Carlo-driven strategy for approximating the IM's possibility contour, or at least its $\\alpha$-level set for a specified $\\alpha$. Our proposal utilizes a parametric family that, in a certain sense, approximately covers the credal set associated with the IM's possibility measure, which is reminiscent of variational approximations now widely used in Bayesian statistics."}, "https://arxiv.org/abs/2404.19242": {"title": "A Minimal Set of Parameters Based Depth-Dependent Distortion Model and Its Calibration Method for Stereo Vision Systems", "link": "https://arxiv.org/abs/2404.19242", "description": "arXiv:2404.19242v1 Announce Type: cross \nAbstract: Depth position highly affects lens distortion, especially in close-range photography, which limits the measurement accuracy of existing stereo vision systems. Moreover, traditional depth-dependent distortion models and their calibration methods have remained complicated. In this work, we propose a minimal set of parameters based depth-dependent distortion model (MDM), which considers the radial and decentering distortions of the lens to improve the accuracy of stereo vision systems and simplify their calibration process. In addition, we present an easy and flexible calibration method for the MDM of stereo vision systems with a commonly used planar pattern, which requires cameras to observe the planar pattern in different orientations. The proposed technique is easy to use and flexible compared with classical calibration techniques for depth-dependent distortion models in which the lens must be perpendicular to the planar pattern. The experimental validation of the MDM and its calibration method showed that the MDM improved the calibration accuracy by 56.55% and 74.15% compared with the Li's distortion model and traditional Brown's distortion model. Besides, an iteration-based reconstruction method is proposed to iteratively estimate the depth information in the MDM during three-dimensional reconstruction. The results showed that the accuracy of the iteration-based reconstruction method was improved by 9.08% compared with that of the non-iteration reconstruction method."}, "https://arxiv.org/abs/2404.19495": {"title": "Percentage Coefficient (bp) -- Effect Size Analysis (Theory Paper 1)", "link": "https://arxiv.org/abs/2404.19495", "description": "arXiv:2404.19495v1 Announce Type: cross \nAbstract: Percentage coefficient (bp) has emerged in recent publications as an additional and alternative estimator of effect size for regression analysis. This paper retraces the theory behind the estimator. It's posited that an estimator must first serve the fundamental function of enabling researchers and readers to comprehend an estimand, the target of estimation. It may then serve the instrumental function of enabling researchers and readers to compare two or more estimands. Defined as the regression coefficient when dependent variable (DV) and independent variable (IV) are both on conceptual 0-1 percentage scales, percentage coefficients (bp) feature 1) clearly comprehendible interpretation and 2) equitable scales for comparison. Thus, the coefficient (bp) serves both functions effectively and efficiently, thereby serving some needs not completely served by other indicators such as raw coefficient (bw) and standardized beta. Another fundamental premise of the functionalist theory is that \"effect\" is not a monolithic concept. Rather, it is a collection of compartments, each of which measures a component of the conglomerate that we call \"effect.\" A regression coefficient (b), for example, measures one aspect of effect, which is unit effect, aka efficiency, as it indicates the unit change in DV associated with a one-unit increase in IV. Percentage coefficient (bp) indicates the change in DV in percentage points associated with a whole scale increase in IV. It is meant to be an all-encompassing indicator of the all-encompassing concept of effect, but rather an interpretable and comparable indicator of efficiency, one of the key components of effect."}, "https://arxiv.org/abs/2404.19640": {"title": "Attacking Bayes: On the Adversarial Robustness of Bayesian Neural Networks", "link": "https://arxiv.org/abs/2404.19640", "description": "arXiv:2404.19640v1 Announce Type: cross \nAbstract: Adversarial examples have been shown to cause neural networks to fail on a wide range of vision and language tasks, but recent work has claimed that Bayesian neural networks (BNNs) are inherently robust to adversarial perturbations. In this work, we examine this claim. To study the adversarial robustness of BNNs, we investigate whether it is possible to successfully break state-of-the-art BNN inference methods and prediction pipelines using even relatively unsophisticated attacks for three tasks: (1) label prediction under the posterior predictive mean, (2) adversarial example detection with Bayesian predictive uncertainty, and (3) semantic shift detection. We find that BNNs trained with state-of-the-art approximate inference methods, and even BNNs trained with Hamiltonian Monte Carlo, are highly susceptible to adversarial attacks. We also identify various conceptual and experimental errors in previous works that claimed inherent adversarial robustness of BNNs and conclusively demonstrate that BNNs and uncertainty-aware Bayesian prediction pipelines are not inherently robust against adversarial attacks."}, "https://arxiv.org/abs/2102.08809": {"title": "Testing for Nonlinear Cointegration under Heteroskedasticity", "link": "https://arxiv.org/abs/2102.08809", "description": "arXiv:2102.08809v3 Announce Type: replace \nAbstract: This article discusses tests for nonlinear cointegration in the presence of variance breaks. We build on cointegration test approaches under heteroskedasticity (Cavaliere and Taylor, 2006, Journal of Time Series Analysis) and nonlinearity (Choi and Saikkonen, 2010, Econometric Theory) to propose a bootstrap test and prove its consistency. A Monte Carlo study shows the approach to have satisfactory finite-sample properties. We provide an empirical application to the environmental Kuznets curves (EKC), finding that the cointegration test provides little evidence for the EKC hypothesis. Additionally, we examine a nonlinear relation between the US money demand and the interest rate, finding that our test does not reject the null of a smooth transition cointegrating relation."}, "https://arxiv.org/abs/2112.10248": {"title": "Efficient Modeling of Spatial Extremes over Large Geographical Domains", "link": "https://arxiv.org/abs/2112.10248", "description": "arXiv:2112.10248v3 Announce Type: replace \nAbstract: Various natural phenomena exhibit spatial extremal dependence at short spatial distances. However, existing models proposed in the spatial extremes literature often assume that extremal dependence persists across the entire domain. This is a strong limitation when modeling extremes over large geographical domains, and yet it has been mostly overlooked in the literature. We here develop a more realistic Bayesian framework based on a novel Gaussian scale mixture model, with the Gaussian process component defined by a stochastic partial differential equation yielding a sparse precision matrix, and the random scale component modeled as a low-rank Pareto-tailed or Weibull-tailed spatial process determined by compactly-supported basis functions. We show that our proposed model is approximately tail-stationary and that it can capture a wide range of extremal dependence structures. Its inherently sparse structure allows fast Bayesian computations in high spatial dimensions based on a customized Markov chain Monte Carlo algorithm prioritizing calibration in the tail. We fit our model to analyze heavy monsoon rainfall data in Bangladesh. Our study shows that our model outperforms natural competitors and that it fits precipitation extremes well. We finally use the fitted model to draw inference on long-term return levels for marginal precipitation and spatial aggregates."}, "https://arxiv.org/abs/2204.00473": {"title": "Finite Sample Inference in Incomplete Models", "link": "https://arxiv.org/abs/2204.00473", "description": "arXiv:2204.00473v2 Announce Type: replace \nAbstract: We propose confidence regions for the parameters of incomplete models with exact coverage of the true parameter in finite samples. Our confidence region inverts a test, which generalizes Monte Carlo tests to incomplete models. The test statistic is a discrete analogue of a new optimal transport characterization of the sharp identified region. Both test statistic and critical values rely on simulation drawn from the distribution of latent variables and are computed using solutions to discrete optimal transport, hence linear programming problems. We also propose a fast preliminary search in the parameter space with an alternative, more conservative yet consistent test, based on a parameter free critical value."}, "https://arxiv.org/abs/2210.01757": {"title": "Transportability of model-based estimands in evidence synthesis", "link": "https://arxiv.org/abs/2210.01757", "description": "arXiv:2210.01757v5 Announce Type: replace \nAbstract: In evidence synthesis, effect modifiers are typically described as variables that induce treatment effect heterogeneity at the individual level, through treatment-covariate interactions in an outcome model parametrized at such level. As such, effect modification is defined with respect to a conditional measure, but marginal effect estimates are required for population-level decisions in health technology assessment. For non-collapsible measures, purely prognostic variables that are not determinants of treatment response at the individual level may modify marginal effects, even where there is individual-level treatment effect homogeneity. With heterogeneity, marginal effects for measures that are not directly collapsible cannot be expressed in terms of marginal covariate moments, and generally depend on the joint distribution of conditional effect measure modifiers and purely prognostic variables. There are implications for recommended practices in evidence synthesis. Unadjusted anchored indirect comparisons can be biased in the absence of individual-level treatment effect heterogeneity, or when marginal covariate moments are balanced across studies. Covariate adjustment may be necessary to account for cross-study imbalances in joint covariate distributions involving purely prognostic variables. In the absence of individual patient data for the target, covariate adjustment approaches are inherently limited in their ability to remove bias for measures that are not directly collapsible. Directly collapsible measures would facilitate the transportability of marginal effects between studies by: (1) reducing dependence on model-based covariate adjustment where there is individual-level treatment effect homogeneity or marginal covariate moments are balanced; and (2) facilitating the selection of baseline covariates for adjustment where there is individual-level treatment effect heterogeneity."}, "https://arxiv.org/abs/2210.05792": {"title": "Flexible Modeling of Nonstationary Extremal Dependence using Spatially-Fused LASSO and Ridge Penalties", "link": "https://arxiv.org/abs/2210.05792", "description": "arXiv:2210.05792v3 Announce Type: replace \nAbstract: Statistical modeling of a nonstationary spatial extremal dependence structure is challenging. Max-stable processes are common choices for modeling spatially-indexed block maxima, where an assumption of stationarity is usual to make inference feasible. However, this assumption is often unrealistic for data observed over a large or complex domain. We propose a computationally-efficient method for estimating extremal dependence using a globally nonstationary, but locally-stationary, max-stable process by exploiting nonstationary kernel convolutions. We divide the spatial domain into a fine grid of subregions, assign each of them its own dependence parameters, and use LASSO ($L_1$) or ridge ($L_2$) penalties to obtain spatially-smooth parameter estimates. We then develop a novel data-driven algorithm to merge homogeneous neighboring subregions. The algorithm facilitates model parsimony and interpretability. To make our model suitable for high-dimensional data, we exploit a pairwise likelihood to draw inferences and discuss computational and statistical efficiency. An extensive simulation study demonstrates the superior performance of our proposed model and the subregion-merging algorithm over the approaches that either do not model nonstationarity or do not update the domain partition. We apply our proposed method to model monthly maximum temperatures at over 1400 sites in Nepal and the surrounding Himalayan and sub-Himalayan regions; we again observe significant improvements in model fit compared to a stationary process and a nonstationary process without subregion-merging. Furthermore, we demonstrate that the estimated merged partition is interpretable from a geographic perspective and leads to better model diagnostics by adequately reducing the number of subregion-specific parameters."}, "https://arxiv.org/abs/2210.13027": {"title": "E-Valuating Classifier Two-Sample Tests", "link": "https://arxiv.org/abs/2210.13027", "description": "arXiv:2210.13027v2 Announce Type: replace \nAbstract: We introduce a powerful deep classifier two-sample test for high-dimensional data based on E-values, called E-value Classifier Two-Sample Test (E-C2ST). Our test combines ideas from existing work on split likelihood ratio tests and predictive independence tests. The resulting E-values are suitable for anytime-valid sequential two-sample tests. This feature allows for more effective use of data in constructing test statistics. Through simulations and real data applications, we empirically demonstrate that E-C2ST achieves enhanced statistical power by partitioning datasets into multiple batches beyond the conventional two-split (training and testing) approach of standard classifier two-sample tests. This strategy increases the power of the test while keeping the type I error well below the desired significance level."}, "https://arxiv.org/abs/2302.03435": {"title": "Logistic regression with missing responses and predictors: a review of existing approaches and a case study", "link": "https://arxiv.org/abs/2302.03435", "description": "arXiv:2302.03435v2 Announce Type: replace \nAbstract: In this work logistic regression when both the response and the predictor variables may be missing is considered. Several existing approaches are reviewed, including complete case analysis, inverse probability weighting, multiple imputation and maximum likelihood. The methods are compared in a simulation study, which serves to evaluate the bias, the variance and the mean squared error of the estimators for the regression coefficients. In the simulations, the maximum likelihood methodology is the one that presents the best results, followed by multiple imputation with five imputations, which is the second best. The methods are applied to a case study on the obesity for schoolchildren in the municipality of Viana do Castelo, North Portugal, where a logistic regression model is used to predict the International Obesity Task Force (IOTF) indicator from physical examinations and the past values of the obesity status. All the variables in the case study are potentially missing, with gender as the only exception. The results provided by the several methods are in well agreement, indicating the relevance of the past values of IOTF and physical scores for the prediction of obesity. Practical recommendations are given."}, "https://arxiv.org/abs/2303.17872": {"title": "Lancaster correlation -- a new dependence measure linked to maximum correlation", "link": "https://arxiv.org/abs/2303.17872", "description": "arXiv:2303.17872v2 Announce Type: replace \nAbstract: We suggest novel correlation coefficients which equal the maximum correlation for a class of bivariate Lancaster distributions while being only slightly smaller than maximum correlation for a variety of further bivariate distributions. In contrast to maximum correlation, however, our correlation coefficients allow for rank and moment-based estimators which are simple to compute and have tractable asymptotic distributions. Confidence intervals resulting from these asymptotic approximations and the covariance bootstrap show good finite-sample coverage. In a simulation, the power of asymptotic as well as permutation tests for independence based on our correlation measures compares favorably with competing methods based on distance correlation or rank coefficients for functional dependence, among others. Moreover, for the bivariate normal distribution, our correlation coefficients equal the absolute value of the Pearson correlation, an attractive feature for practitioners which is not shared by various competitors. We illustrate the practical usefulness of our methods in applications to two real data sets."}, "https://arxiv.org/abs/2401.11128": {"title": "Regularized Estimation of Sparse Spectral Precision Matrices", "link": "https://arxiv.org/abs/2401.11128", "description": "arXiv:2401.11128v2 Announce Type: replace \nAbstract: Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of spectral density matrices such as averaged periodograms tend to be severely ill-conditioned, and one needs to resort to suitable regularization strategies involving optimization over complex variables.\n In this work, we propose complex graphical Lasso (CGLASSO), an $\\ell_1$-penalized estimator of spectral precision matrix based on local Whittle likelihood maximization. We develop fast $\\textit{pathwise coordinate descent}$ algorithms for implementing CGLASSO on large dimensional time series data sets. At its core, our algorithmic development relies on a ring isomorphism between complex and real matrices that helps map a number of optimization problems over complex variables to similar optimization problems over real variables. This finding may be of independent interest and more broadly applicable for high-dimensional statistical analysis with complex-valued data. We also present a complete non-asymptotic theory of our proposed estimator which shows that consistent estimation is possible in high-dimensional regime as long as the underlying spectral precision matrix is suitably sparse. We compare the performance of CGLASSO with competing alternatives on simulated data sets, and use it to construct partial coherence network among brain regions from a real fMRI data set."}, "https://arxiv.org/abs/2208.03215": {"title": "Hierarchical Bayesian data selection", "link": "https://arxiv.org/abs/2208.03215", "description": "arXiv:2208.03215v2 Announce Type: replace-cross \nAbstract: There are many issues that can cause problems when attempting to infer model parameters from data. Data and models are both imperfect, and as such there are multiple scenarios in which standard methods of inference will lead to misleading conclusions; corrupted data, models which are only representative of subsets of the data, or multiple regions in which the model is best fit using different parameters. Methods exist for the exclusion of some anomalous types of data, but in practice, data cleaning is often undertaken by hand before attempting to fit models to data. In this work, we will employ hierarchical Bayesian data selection; the simultaneous inference of both model parameters, and parameters which represent our belief that each observation within the data should be included in the inference. The aim, within a Bayesian setting, is to find the regions of observation space for which the model can well-represent the data, and to find the corresponding model parameters for those regions. A number of approaches will be explored, and applied to test problems in linear regression, and to the problem of fitting an ODE model, approximated by a finite difference method. The approaches are simple to implement, can aid mixing of Markov chains designed to sample from the arising densities, and are broadly applicable to many inferential problems."}, "https://arxiv.org/abs/2303.02756": {"title": "A New Class of Realistic Spatio-Temporal Processes with Advection and Their Simulation", "link": "https://arxiv.org/abs/2303.02756", "description": "arXiv:2303.02756v2 Announce Type: replace-cross \nAbstract: Traveling phenomena, frequently observed in a variety of scientific disciplines including atmospheric science, seismography, and oceanography, have long suffered from limitations due to lack of realistic statistical modeling tools and simulation methods. Our work primarily addresses this, introducing more realistic and flexible models for spatio-temporal random fields. We break away from the traditional confines of the classic frozen field by either relaxing the assumption of a single deterministic velocity or rethinking the hypothesis regarding the spectrum shape, thus enhancing the realism of our models. While the proposed models stand out for their realism and flexibility, they are also paired with simulation algorithms that are equally or less computationally complex than the commonly used circulant embedding for Gaussian random fields in $\\mathbb{R}^{2+1}$. This combination of realistic modeling with efficient simulation methods creates an effective solution for better understanding traveling phenomena."}, "https://arxiv.org/abs/2308.00957": {"title": "Causal Inference with Differentially Private (Clustered) Outcomes", "link": "https://arxiv.org/abs/2308.00957", "description": "arXiv:2308.00957v2 Announce Type: replace-cross \nAbstract: Estimating causal effects from randomized experiments is only feasible if participants agree to reveal their potentially sensitive responses. Of the many ways of ensuring privacy, label differential privacy is a widely used measure of an algorithm's privacy guarantee, which might encourage participants to share responses without running the risk of de-anonymization. Many differentially private mechanisms inject noise into the original data-set to achieve this privacy guarantee, which increases the variance of most statistical estimators and makes the precise measurement of causal effects difficult: there exists a fundamental privacy-variance trade-off to performing causal analyses from differentially private data. With the aim of achieving lower variance for stronger privacy guarantees, we suggest a new differential privacy mechanism, Cluster-DP, which leverages any given cluster structure of the data while still allowing for the estimation of causal effects. We show that, depending on an intuitive measure of cluster quality, we can improve the variance loss while maintaining our privacy guarantees. We compare its performance, theoretically and empirically, to that of its unclustered version and a more extreme uniform-prior version which does not use any of the original response distribution, both of which are special cases of the Cluster-DP algorithm."}, "https://arxiv.org/abs/2405.00158": {"title": "BayesBlend: Easy Model Blending using Pseudo-Bayesian Model Averaging, Stacking and Hierarchical Stacking in Python", "link": "https://arxiv.org/abs/2405.00158", "description": "arXiv:2405.00158v1 Announce Type: new \nAbstract: Averaging predictions from multiple competing inferential models frequently outperforms predictions from any single model, providing that models are optimally weighted to maximize predictive performance. This is particularly the case in so-called $\\mathcal{M}$-open settings where the true model is not in the set of candidate models, and may be neither mathematically reifiable nor known precisely. This practice of model averaging has a rich history in statistics and machine learning, and there are currently a number of methods to estimate the weights for constructing model-averaged predictive distributions. Nonetheless, there are few existing software packages that can estimate model weights from the full variety of methods available, and none that blend model predictions into a coherent predictive distribution according to the estimated weights. In this paper, we introduce the BayesBlend Python package, which provides a user-friendly programming interface to estimate weights and blend multiple (Bayesian) models' predictive distributions. BayesBlend implements pseudo-Bayesian model averaging, stacking and, uniquely, hierarchical Bayesian stacking to estimate model weights. We demonstrate the usage of BayesBlend with examples of insurance loss modeling."}, "https://arxiv.org/abs/2405.00161": {"title": "Estimating Heterogeneous Treatment Effects with Item-Level Outcome Data: Insights from Item Response Theory", "link": "https://arxiv.org/abs/2405.00161", "description": "arXiv:2405.00161v1 Announce Type: new \nAbstract: Analyses of heterogeneous treatment effects (HTE) are common in applied causal inference research. However, when outcomes are latent variables assessed via psychometric instruments such as educational tests, standard methods ignore the potential HTE that may exist among the individual items of the outcome measure. Failing to account for \"item-level\" HTE (IL-HTE) can lead to both estimated standard errors that are too small and identification challenges in the estimation of treatment-by-covariate interaction effects. We demonstrate how Item Response Theory (IRT) models that estimate a treatment effect for each assessment item can both address these challenges and provide new insights into HTE generally. This study articulates the theoretical rationale for the IL-HTE model and demonstrates its practical value using data from 20 randomized controlled trials in economics, education, and health. Our results show that the IL-HTE model reveals item-level variation masked by average treatment effects, provides more accurate statistical inference, allows for estimates of the generalizability of causal effects, resolves identification problems in the estimation of interaction effects, and provides estimates of standardized treatment effect sizes corrected for attenuation due to measurement error."}, "https://arxiv.org/abs/2405.00179": {"title": "A Bayesian joint longitudinal-survival model with a latent stochastic process for intensive longitudinal data", "link": "https://arxiv.org/abs/2405.00179", "description": "arXiv:2405.00179v1 Announce Type: new \nAbstract: The availability of mobile health (mHealth) technology has enabled increased collection of intensive longitudinal data (ILD). ILD have potential to capture rapid fluctuations in outcomes that may be associated with changes in the risk of an event. However, existing methods for jointly modeling longitudinal and event-time outcomes are not well-equipped to handle ILD due to the high computational cost. We propose a joint longitudinal and time-to-event model suitable for analyzing ILD. In this model, we summarize a multivariate longitudinal outcome as a smaller number of time-varying latent factors. These latent factors, which are modeled using an Ornstein-Uhlenbeck stochastic process, capture the risk of a time-to-event outcome in a parametric hazard model. We take a Bayesian approach to fit our joint model and conduct simulations to assess its performance. We use it to analyze data from an mHealth study of smoking cessation. We summarize the longitudinal self-reported intensity of nine emotions as the psychological states of positive and negative affect. These time-varying latent states capture the risk of the first smoking lapse after attempted quit. Understanding factors associated with smoking lapse is of keen interest to smoking cessation researchers."}, "https://arxiv.org/abs/2405.00185": {"title": "Finite-sample adjustments for comparing clustered adaptive interventions using data from a clustered SMART", "link": "https://arxiv.org/abs/2405.00185", "description": "arXiv:2405.00185v1 Announce Type: new \nAbstract: Adaptive interventions, aka dynamic treatment regimens, are sequences of pre-specified decision rules that guide the provision of treatment for an individual given information about their baseline and evolving needs, including in response to prior intervention. Clustered adaptive interventions (cAIs) extend this idea by guiding the provision of intervention at the level of clusters (e.g., clinics), but with the goal of improving outcomes at the level of individuals within the cluster (e.g., clinicians or patients within clinics). A clustered, sequential multiple-assignment randomized trials (cSMARTs) is a multistage, multilevel randomized trial design used to construct high-quality cAIs. In a cSMART, clusters are randomized at multiple intervention decision points; at each decision point, the randomization probability can depend on response to prior data. A challenge in cluster-randomized trials, including cSMARTs, is the deleterious effect of small samples of clusters on statistical inference, particularly via estimation of standard errors. \\par This manuscript develops finite-sample adjustment (FSA) methods for making improved statistical inference about the causal effects of cAIs in a cSMART. The paper develops FSA methods that (i) scale variance estimators using a degree-of-freedom adjustment, (ii) reference a t distribution (instead of a normal), and (iii) employ a ``bias corrected\" variance estimator. Method (iii) requires extensions that are unique to the analysis of cSMARTs. Extensive simulation experiments are used to test the performance of the methods. The methods are illustrated using the Adaptive School-based Implementation of CBT (ASIC) study, a cSMART designed to construct a cAI for improving the delivery of cognitive behavioral therapy (CBT) by school mental health professionals within high schools in Michigan."}, "https://arxiv.org/abs/2405.00294": {"title": "Conformal inference for random objects", "link": "https://arxiv.org/abs/2405.00294", "description": "arXiv:2405.00294v1 Announce Type: new \nAbstract: We develop an inferential toolkit for analyzing object-valued responses, which correspond to data situated in general metric spaces, paired with Euclidean predictors within the conformal framework. To this end we introduce conditional profile average transport costs, where we compare distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius through the optimal transport cost when moving from one distance profile to another. The average transport cost to transport a given distance profile to all others is crucial for statistical inference in metric spaces and underpins the proposed conditional profile scores. A key feature of the proposed approach is to utilize the distribution of conditional profile average transport costs as conformity score for general metric space-valued responses, which facilitates the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the prediction sets. The finite sample performance for synthetic data in various metric spaces demonstrates that the proposed conditional profile score outperforms existing methods in terms of both coverage level and size of the resulting prediction sets, even in the special case of scalar and thus Euclidean responses. We also demonstrate the practical utility of conditional profile scores for network data from New York taxi trips and for compositional data reflecting energy sourcing of U.S. states."}, "https://arxiv.org/abs/2405.00424": {"title": "Optimal Bias-Correction and Valid Inference in High-Dimensional Ridge Regression: A Closed-Form Solution", "link": "https://arxiv.org/abs/2405.00424", "description": "arXiv:2405.00424v1 Announce Type: new \nAbstract: Ridge regression is an indispensable tool in big data econometrics but suffers from bias issues affecting both statistical efficiency and scalability. We introduce an iterative strategy to correct the bias effectively when the dimension $p$ is less than the sample size $n$. For $p>n$, our method optimally reduces the bias to a level unachievable through linear transformations of the response. We employ a Ridge-Screening (RS) method to handle the remaining bias when $p>n$, creating a reduced model suitable for bias-correction. Under certain conditions, the selected model nests the true one, making RS a novel variable selection approach. We establish the asymptotic properties and valid inferences of our de-biased ridge estimators for both $p< n$ and $p>n$, where $p$ and $n$ may grow towards infinity, along with the number of iterations. Our method is validated using simulated and real-world data examples, providing a closed-form solution to bias challenges in ridge regression inferences."}, "https://arxiv.org/abs/2405.00535": {"title": "Bayesian Varying-Effects Vector Autoregressive Models for Inference of Brain Connectivity Networks and Covariate Effects in Pediatric Traumatic Brain Injury", "link": "https://arxiv.org/abs/2405.00535", "description": "arXiv:2405.00535v1 Announce Type: new \nAbstract: In this paper, we develop an analytical approach for estimating brain connectivity networks that accounts for subject heterogeneity. More specifically, we consider a novel extension of a multi-subject Bayesian vector autoregressive model that estimates group-specific directed brain connectivity networks and accounts for the effects of covariates on the network edges. We adopt a flexible approach, allowing for (possibly) non-linear effects of the covariates on edge strength via a novel Bayesian nonparametric prior that employs a weighted mixture of Gaussian processes. For posterior inference, we achieve computational scalability by implementing a variational Bayes scheme. Our approach enables simultaneous estimation of group-specific networks and selection of relevant covariate effects. We show improved performance over competing two-stage approaches on simulated data. We apply our method on resting-state fMRI data from children with a history of traumatic brain injury and healthy controls to estimate the effects of age and sex on the group-level connectivities. Our results highlight differences in the distribution of parent nodes. They also suggest alteration in the relation of age, with peak edge strength in children with traumatic brain injury (TBI), and differences in effective connectivity strength between males and females."}, "https://arxiv.org/abs/2405.00581": {"title": "Conformalized Tensor Completion with Riemannian Optimization", "link": "https://arxiv.org/abs/2405.00581", "description": "arXiv:2405.00581v1 Announce Type: new \nAbstract: Tensor data, or multi-dimensional array, is a data format popular in multiple fields such as social network analysis, recommender systems, and brain imaging. It is not uncommon to observe tensor data containing missing values and tensor completion aims at estimating the missing values given the partially observed tensor. Sufficient efforts have been spared on devising scalable tensor completion algorithms but few on quantifying the uncertainty of the estimator. In this paper, we nest the uncertainty quantification (UQ) of tensor completion under a split conformal prediction framework and establish the connection of the UQ problem to a problem of estimating the missing propensity of each tensor entry. We model the data missingness of the tensor with a tensor Ising model parameterized by a low-rank tensor parameter. We propose to estimate the tensor parameter by maximum pseudo-likelihood estimation (MPLE) with a Riemannian gradient descent algorithm. Extensive simulation studies have been conducted to justify the validity of the resulting conformal interval. We apply our method to the regional total electron content (TEC) reconstruction problem."}, "https://arxiv.org/abs/2405.00619": {"title": "One-Bit Total Variation Denoising over Networks with Applications to Partially Observed Epidemics", "link": "https://arxiv.org/abs/2405.00619", "description": "arXiv:2405.00619v1 Announce Type: new \nAbstract: This paper introduces a novel approach for epidemic nowcasting and forecasting over networks using total variation (TV) denoising, a method inspired by classical signal processing techniques. Considering a network that models a population as a set of $n$ nodes characterized by their infection statuses $Y_i$ and that represents contacts as edges, we prove the consistency of graph-TV denoising for estimating the underlying infection probabilities $\\{p_i\\}_{ i \\in \\{1,\\cdots, n\\}}$ in the presence of Bernoulli noise. Our results provide an important extension of existing bounds derived in the Gaussian case to the study of binary variables -- an approach hereafter referred to as one-bit total variation denoising. The methodology is further extended to handle incomplete observations, thereby expanding its relevance to various real-world situations where observations over the full graph may not be accessible. Focusing on the context of epidemics, we establish that one-bit total variation denoising enhances both nowcasting and forecasting accuracy in networks, as further evidenced by comprehensive numerical experiments and two real-world examples. The contributions of this paper lie in its theoretical developments, particularly in addressing the incomplete data case, thereby paving the way for more precise epidemic modelling and enhanced surveillance strategies in practical settings."}, "https://arxiv.org/abs/2405.00626": {"title": "SARMA: Scalable Low-Rank High-Dimensional Autoregressive Moving Averages via Tensor Decomposition", "link": "https://arxiv.org/abs/2405.00626", "description": "arXiv:2405.00626v1 Announce Type: new \nAbstract: Existing models for high-dimensional time series are overwhelmingly developed within the finite-order vector autoregressive (VAR) framework, whereas the more flexible vector autoregressive moving averages (VARMA) have been much less considered. This paper introduces a high-dimensional model for capturing VARMA dynamics, namely the Scalable ARMA (SARMA) model, by combining novel reparameterization and tensor decomposition techniques. To ensure identifiability and computational tractability, we first consider a reparameterization of the VARMA model and discover that this interestingly amounts to a Tucker-low-rank structure for the AR coefficient tensor along the temporal dimension. Motivated by this finding, we further consider Tucker decomposition across the response and predictor dimensions of the AR coefficient tensor, enabling factor extraction across variables and time lags. Additionally, we consider sparsity assumptions on the factor loadings to accomplish automatic variable selection and greater estimation efficiency. For the proposed model, we develop both rank-constrained and sparsity-inducing estimators. Algorithms and model selection methods are also provided. Simulation studies and empirical examples confirm the validity of our theory and advantages of our approaches over existing competitors."}, "https://arxiv.org/abs/2405.00118": {"title": "Causal Inference with High-dimensional Discrete Covariates", "link": "https://arxiv.org/abs/2405.00118", "description": "arXiv:2405.00118v1 Announce Type: cross \nAbstract: When estimating causal effects from observational studies, researchers often need to adjust for many covariates to deconfound the non-causal relationship between exposure and outcome, among which many covariates are discrete. The behavior of commonly used estimators in the presence of many discrete covariates is not well understood since their properties are often analyzed under structural assumptions including sparsity and smoothness, which do not apply in discrete settings. In this work, we study the estimation of causal effects in a model where the covariates required for confounding adjustment are discrete but high-dimensional, meaning the number of categories $d$ is comparable with or even larger than sample size $n$. Specifically, we show the mean squared error of commonly used regression, weighting and doubly robust estimators is bounded by $\\frac{d^2}{n^2}+\\frac{1}{n}$. We then prove the minimax lower bound for the average treatment effect is of order $\\frac{d^2}{n^2 \\log^2 n}+\\frac{1}{n}$, which characterizes the fundamental difficulty of causal effect estimation in the high-dimensional discrete setting, and shows the estimators mentioned above are rate-optimal up to log-factors. We further consider additional structures that can be exploited, namely effect homogeneity and prior knowledge of the covariate distribution, and propose new estimators that enjoy faster convergence rates of order $\\frac{d}{n^2} + \\frac{1}{n}$, which achieve consistency in a broader regime. The results are illustrated empirically via simulation studies."}, "https://arxiv.org/abs/2405.00417": {"title": "Conformal Risk Control for Ordinal Classification", "link": "https://arxiv.org/abs/2405.00417", "description": "arXiv:2405.00417v1 Announce Type: cross \nAbstract: As a natural extension to the standard conformal prediction method, several conformal risk control methods have been recently developed and applied to various learning problems. In this work, we seek to control the conformal risk in expectation for ordinal classification tasks, which have broad applications to many real problems. For this purpose, we firstly formulated the ordinal classification task in the conformal risk control framework, and provided theoretic risk bounds of the risk control method. Then we proposed two types of loss functions specially designed for ordinal classification tasks, and developed corresponding algorithms to determine the prediction set for each case to control their risks at a desired level. We demonstrated the effectiveness of our proposed methods, and analyzed the difference between the two types of risks on three different datasets, including a simulated dataset, the UTKFace dataset and the diabetic retinopathy detection dataset."}, "https://arxiv.org/abs/2405.00576": {"title": "Calibration of the rating transition model for high and low default portfolios", "link": "https://arxiv.org/abs/2405.00576", "description": "arXiv:2405.00576v1 Announce Type: cross \nAbstract: In this paper we develop Maximum likelihood (ML) based algorithms to calibrate the model parameters in credit rating transition models. Since the credit rating transition models are not Gaussian linear models, the celebrated Kalman filter is not suitable to compute the likelihood of observed migrations. Therefore, we develop a Laplace approximation of the likelihood function and as a result the Kalman filter can be used in the end to compute the likelihood function. This approach is applied to so-called high-default portfolios, in which the number of migrations (defaults) is large enough to obtain high accuracy of the Laplace approximation. By contrast, low-default portfolios have a limited number of observed migrations (defaults). Therefore, in order to calibrate low-default portfolios, we develop a ML algorithm using a particle filter (PF) and Gaussian process regression. Experiments show that both algorithms are efficient and produce accurate approximations of the likelihood function and the ML estimates of the model parameters."}, "https://arxiv.org/abs/1902.09608": {"title": "On Binscatter", "link": "https://arxiv.org/abs/1902.09608", "description": "arXiv:1902.09608v5 Announce Type: replace \nAbstract: Binscatter is a popular method for visualizing bivariate relationships and conducting informal specification testing. We study the properties of this method formally and develop enhanced visualization and econometric binscatter tools. These include estimating conditional means with optimal binning and quantifying uncertainty. We also highlight a methodological problem related to covariate adjustment that can yield incorrect conclusions. We revisit two applications using our methodology and find substantially different results relative to those obtained using prior informal binscatter methods. General purpose software in Python, R, and Stata is provided. Our technical work is of independent interest for the nonparametric partition-based estimation literature."}, "https://arxiv.org/abs/2207.04248": {"title": "A Statistical-Modelling Approach to Feedforward Neural Network Model Selection", "link": "https://arxiv.org/abs/2207.04248", "description": "arXiv:2207.04248v5 Announce Type: replace \nAbstract: Feedforward neural networks (FNNs) can be viewed as non-linear regression models, where covariates enter the model through a combination of weighted summations and non-linear functions. Although these models have some similarities to the approaches used within statistical modelling, the majority of neural network research has been conducted outside of the field of statistics. This has resulted in a lack of statistically-based methodology, and, in particular, there has been little emphasis on model parsimony. Determining the input layer structure is analogous to variable selection, while the structure for the hidden layer relates to model complexity. In practice, neural network model selection is often carried out by comparing models using out-of-sample performance. However, in contrast, the construction of an associated likelihood function opens the door to information-criteria-based variable and architecture selection. A novel model selection method, which performs both input- and hidden-node selection, is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance as the model selection objective function leads to an increased probability of recovering the true model, while parsimoniously achieving favourable out-of-sample performance. Simulation studies are used to evaluate and justify the proposed method, and applications on real data are investigated."}, "https://arxiv.org/abs/2307.10503": {"title": "Regularizing threshold priors with sparse response patterns in Bayesian factor analysis with categorical indicators", "link": "https://arxiv.org/abs/2307.10503", "description": "arXiv:2307.10503v2 Announce Type: replace \nAbstract: Using instruments comprising ordered responses to items are ubiquitous for studying many constructs of interest. However, using such an item response format may lead to items with response categories infrequently endorsed or unendorsed completely. In maximum likelihood estimation, this results in non-existing estimates for thresholds. This work focuses on a Bayesian estimation approach to counter this issue. The issue changes from the existence of an estimate to how to effectively construct threshold priors. The proposed prior specification reconceptualizes the threshold prior as prior on the probability of each response category. A metric that is easier to manipulate while maintaining the necessary ordering constraints on the thresholds. The resulting induced-prior is more communicable, and we demonstrate comparable statistical efficiency that existing threshold priors. Evidence is provided using a simulated data set, a Monte Carlo simulation study, and an example multi-group item-factor model analysis. All analyses demonstrate how at least a relatively informative threshold prior is necessary to avoid inefficient posterior sampling and increase confidence in the coverage rates of posterior credible intervals."}, "https://arxiv.org/abs/2307.15330": {"title": "Group integrative dynamic factor models with application to multiple subject brain connectivity", "link": "https://arxiv.org/abs/2307.15330", "description": "arXiv:2307.15330v3 Announce Type: replace \nAbstract: This work introduces a novel framework for dynamic factor model-based group-level analysis of multiple subjects time series data, called GRoup Integrative DYnamic factor (GRIDY) models. The framework identifies and characterizes inter-subject similarities and differences between two pre-determined groups by considering a combination of group spatial information and individual temporal dynamics. Furthermore, it enables the identification of intra-subject similarities and differences over time by employing different model configurations for each subject. Methodologically, the framework combines a novel principal angle-based rank selection algorithm and a non-iterative integrative analysis framework. Inspired by simultaneous component analysis, this approach also reconstructs identifiable latent factor series with flexible covariance structures. The performance of the GRIDY models is evaluated through simulations conducted under various scenarios. An application is also presented to compare resting-state functional MRI data collected from multiple subjects in autism spectrum disorder and control groups."}, "https://arxiv.org/abs/2309.09299": {"title": "Bounds on Average Effects in Discrete Choice Panel Data Models", "link": "https://arxiv.org/abs/2309.09299", "description": "arXiv:2309.09299v3 Announce Type: replace \nAbstract: In discrete choice panel data, the estimation of average effects is crucial for quantifying the effect of covariates, and for policy evaluation and counterfactual analysis. This task is challenging in short panels with individual-specific effects due to partial identification and the incidental parameter problem. In particular, estimation of the sharp identified set is practically infeasible at realistic sample sizes whenever the number of support points of the observed covariates is large, such as when the covariates are continuous. In this paper, we therefore propose estimating outer bounds on the identified set of average effects. Our bounds are easy to construct, converge at the parametric rate, and are computationally simple to obtain even in moderately large samples, independent of whether the covariates are discrete or continuous. We also provide asymptotically valid confidence intervals on the identified set."}, "https://arxiv.org/abs/2309.13640": {"title": "Visualizing periodic stability in studies: the moving average meta-analysis (MA2)", "link": "https://arxiv.org/abs/2309.13640", "description": "arXiv:2309.13640v2 Announce Type: replace \nAbstract: Relative clinical benefits are often visually explored and formally analysed through a (cumulative) meta-analysis. In this manuscript, we introduce and further explore the moving average meta-analysis to aid towards the exploration and visualization of periodic stability in a meta-analysis."}, "https://arxiv.org/abs/2310.16207": {"title": "Propensity weighting plus adjustment in proportional hazards model is not doubly robust", "link": "https://arxiv.org/abs/2310.16207", "description": "arXiv:2310.16207v2 Announce Type: replace \nAbstract: Recently, it has become common for applied works to combine commonly used survival analysis modeling methods, such as the multivariable Cox model and propensity score weighting, with the intention of forming a doubly robust estimator of an exposure effect hazard ratio that is unbiased in large samples when either the Cox model or the propensity score model is correctly specified. This combination does not, in general, produce a doubly robust estimator, even after regression standardization, when there is truly a causal effect. We demonstrate via simulation this lack of double robustness for the semiparametric Cox model, the Weibull proportional hazards model, and a simple proportional hazards flexible parametric model, with both the latter models fit via maximum likelihood. We provide a novel proof that the combination of propensity score weighting and a proportional hazards survival model, fit either via full or partial likelihood, is consistent under the null of no causal effect of the exposure on the outcome under particular censoring mechanisms if either the propensity score or the outcome model is correctly specified and contains all confounders. Given our results suggesting that double robustness only exists under the null, we outline two simple alternative estimators that are doubly robust for the survival difference at a given time point (in the above sense), provided the censoring mechanism can be correctly modeled, and one doubly robust method of estimation for the full survival curve. We provide R code to use these estimators for estimation and inference in the supporting information."}, "https://arxiv.org/abs/2311.11054": {"title": "Modern extreme value statistics for Utopian extremes", "link": "https://arxiv.org/abs/2311.11054", "description": "arXiv:2311.11054v2 Announce Type: replace \nAbstract: Capturing the extremal behaviour of data often requires bespoke marginal and dependence models which are grounded in rigorous asymptotic theory, and hence provide reliable extrapolation into the upper tails of the data-generating distribution. We present a toolbox of four methodological frameworks, motivated by modern extreme value theory, that can be used to accurately estimate extreme exceedance probabilities or the corresponding level in either a univariate or multivariate setting. Our frameworks were used to facilitate the winning contribution of Team Yalla to the EVA (2023) Conference Data Challenge, which was organised for the 13$^\\text{th}$ International Conference on Extreme Value Analysis. This competition comprised seven teams competing across four separate sub-challenges, with each requiring the modelling of data simulated from known, yet highly complex, statistical distributions, and extrapolation far beyond the range of the available samples in order to predict probabilities of extreme events. Data were constructed to be representative of real environmental data, sampled from the fantasy country of \"Utopia\""}, "https://arxiv.org/abs/2311.17271": {"title": "Spatial-Temporal Extreme Modeling for Point-to-Area Random Effects (PARE)", "link": "https://arxiv.org/abs/2311.17271", "description": "arXiv:2311.17271v2 Announce Type: replace \nAbstract: One measurement modality for rainfall is a fixed location rain gauge. However, extreme rainfall, flooding, and other climate extremes often occur at larger spatial scales and affect more than one location in a community. For example, in 2017 Hurricane Harvey impacted all of Houston and the surrounding region causing widespread flooding. Flood risk modeling requires understanding of rainfall for hydrologic regions, which may contain one or more rain gauges. Further, policy changes to address the risks and damages of natural hazards such as severe flooding are usually made at the community/neighborhood level or higher geo-spatial scale. Therefore, spatial-temporal methods which convert results from one spatial scale to another are especially useful in applications for evolving environmental extremes. We develop a point-to-area random effects (PARE) modeling strategy for understanding spatial-temporal extreme values at the areal level, when the core information are time series at point locations distributed over the region."}, "https://arxiv.org/abs/2401.00264": {"title": "Identification of Nonlinear Dynamic Panels under Partial Stationarity", "link": "https://arxiv.org/abs/2401.00264", "description": "arXiv:2401.00264v3 Announce Type: replace \nAbstract: This paper studies identification for a wide range of nonlinear panel data models, including binary choice, ordered response, and other types of limited dependent variable models. Our approach accommodates dynamic models with any number of lagged dependent variables as well as other types of (potentially contemporary) endogeneity. Our identification strategy relies on a partial stationarity condition, which not only allows for an unknown distribution of errors but also for temporal dependencies in errors. We derive partial identification results under flexible model specifications and provide additional support conditions for point identification. We demonstrate the robust finite-sample performance of our approach using Monte Carlo simulations, and apply the approach to analyze the empirical application of income categories using various ordered choice models."}, "https://arxiv.org/abs/2312.14191": {"title": "Noisy Measurements Are Important, the Design of Census Products Is Much More Important", "link": "https://arxiv.org/abs/2312.14191", "description": "arXiv:2312.14191v2 Announce Type: replace-cross \nAbstract: McCartan et al. (2023) call for \"making differential privacy work for census data users.\" This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic."}, "https://arxiv.org/abs/2405.00827": {"title": "Overcoming model uncertainty -- how equivalence tests can benefit from model averaging", "link": "https://arxiv.org/abs/2405.00827", "description": "arXiv:2405.00827v1 Announce Type: new \nAbstract: A common problem in numerous research areas, particularly in clinical trials, is to test whether the effect of an explanatory variable on an outcome variable is equivalent across different groups. In practice, these tests are frequently used to compare the effect between patient groups, e.g. based on gender, age or treatments. Equivalence is usually assessed by testing whether the difference between the groups does not exceed a pre-specified equivalence threshold. Classical approaches are based on testing the equivalence of single quantities, e.g. the mean, the area under the curve (AUC) or other values of interest. However, when differences depending on a particular covariate are observed, these approaches can turn out to be not very accurate. Instead, whole regression curves over the entire covariate range, describing for instance the time window or a dose range, are considered and tests are based on a suitable distance measure of two such curves, as, for example, the maximum absolute distance between them. In this regard, a key assumption is that the true underlying regression models are known, which is rarely the case in practice. However, misspecification can lead to severe problems as inflated type I errors or, on the other hand, conservative test procedures. In this paper, we propose a solution to this problem by introducing a flexible extension of such an equivalence test using model averaging in order to overcome this assumption and making the test applicable under model uncertainty. Precisely, we introduce model averaging based on smooth AIC weights and we propose a testing procedure which makes use of the duality between confidence intervals and hypothesis testing. We demonstrate the validity of our approach by means of a simulation study and demonstrate the practical relevance of the approach considering a time-response case study with toxicological gene expression data."}, "https://arxiv.org/abs/2405.00917": {"title": "Semiparametric mean and variance joint models with clipped-Laplace link functions for bounded integer-valued time series", "link": "https://arxiv.org/abs/2405.00917", "description": "arXiv:2405.00917v1 Announce Type: new \nAbstract: We present a novel approach for modeling bounded count time series data, by deriving accurate upper and lower bounds for the variance of a bounded count random variable while maintaining a fixed mean. Leveraging these bounds, we propose semiparametric mean and variance joint (MVJ) models utilizing a clipped-Laplace link function. These models offer a flexible and feasible structure for both mean and variance, accommodating various scenarios of under-dispersion, equi-dispersion, or over-dispersion in bounded time series. The proposed MVJ models feature a linear mean structure with positive regression coefficients summing to one and allow for negative regression cefficients and autocorrelations. We demonstrate that the autocorrelation structure of MVJ models mirrors that of an autoregressive moving-average (ARMA) process, provided the proposed clipped-Laplace link functions with nonnegative regression coefficients summing to one are utilized. We establish conditions ensuring the stationarity and ergodicity properties of the MVJ process, along with demonstrating the consistency and asymptotic normality of the conditional least squares estimators. To aid model selection and diagnostics, we introduce two model selection criteria and apply two model diagnostics statistics. Finally, we conduct simulations and real data analyses to investigate the finite-sample properties of the proposed MVJ models, providing insights into their efficacy and applicability in practical scenarios."}, "https://arxiv.org/abs/2405.00953": {"title": "Asymptotic Properties of the Distributional Synthetic Controls", "link": "https://arxiv.org/abs/2405.00953", "description": "arXiv:2405.00953v1 Announce Type: new \nAbstract: This paper enhances our comprehension of the Distributional Synthetic Control (DSC) proposed by Gunsilius (2023), focusing on its asymptotic properties. We first establish the DSC estimator's asymptotic optimality. The essence of this optimality lies in the treatment effect estimator given by DSC achieves the lowest possible squared prediction error among all potential treatment effect estimators that depend on an average of quantiles of control units. We also establish the convergence of the DSC weights when some requirements are met, as well as the convergence rate. A significant aspect of our research is that we find DSC synthesis forms an optimal weighted average, particularly in situations where it is impractical to perfectly fit the treated unit's quantiles through the weighted average of the control units' quantiles. To corroborate our theoretical insights, we provide empirical evidence derived from simulations."}, "https://arxiv.org/abs/2405.01072": {"title": "Statistical Inference on the Cumulative Distribution Function using Judgment Post Stratification", "link": "https://arxiv.org/abs/2405.01072", "description": "arXiv:2405.01072v1 Announce Type: new \nAbstract: In this work, we discuss a general class of the estimators for the cumulative distribution function (CDF) based on judgment post stratification (JPS) sampling scheme which includes both empirical and kernel distribution functions. Specifically, we obtain the expectation of the estimators in this class and show that they are asymptotically more efficient than their competitors in simple random sampling (SRS), as long as the rankings are better than random guessing. We find a mild condition that is necessary and sufficient for them to be asymptotically unbiased. We also prove that given the same condition, the estimators in this class are strongly uniformly consistent estimators of the true CDF, and converge in distribution to a normal distribution when the sample size goes to infinity.\n We then focus on the kernel distribution function (KDF) in the JPS design and obtain the optimal bandwidth. We next carry out a comprehensive Monte Carlo simulation to compare the performance of the KDF in the JPS design for different choices of sample size, set size, ranking quality, parent distribution, kernel function as well as both perfect and imperfect rankings set-ups with its counterpart in SRS design. It is found that the JPS estimator dramatically improves the efficiency of the KDF as compared to its SRS competitor for a wide range of the settings. Finally, we apply the described procedure on a real dataset from medical context to show their usefulness and applicability in practice."}, "https://arxiv.org/abs/2405.01110": {"title": "Investigating the causal effects of multiple treatments using longitudinal data: a simulation study", "link": "https://arxiv.org/abs/2405.01110", "description": "arXiv:2405.01110v1 Announce Type: new \nAbstract: Many clinical questions involve estimating the effects of multiple treatments using observational data. When using longitudinal data, the interest is often in the effect of treatment strategies that involve sustaining treatment over time. This requires causal inference methods appropriate for handling multiple treatments and time-dependent confounding. Robins Generalised methods (g-methods) are a family of methods which can deal with time-dependent confounding and some of these have been extended to situations with multiple treatments, although there are currently no studies comparing different methods in this setting. We show how five g-methods (inverse-probability-of-treatment weighted estimation of marginal structural models, g-formula, g-estimation, censoring and weighting, and a sequential trials approach) can be extended to situations with multiple treatments, compare their performances in a simulation study, and demonstrate their application with an example using data from the UK CF Registry."}, "https://arxiv.org/abs/2405.01182": {"title": "A Model-Based Approach to Shot Charts Estimation in Basketball", "link": "https://arxiv.org/abs/2405.01182", "description": "arXiv:2405.01182v1 Announce Type: new \nAbstract: Shot charts in basketball analytics provide an indispensable tool for evaluating players' shooting performance by visually representing the distribution of field goal attempts across different court locations. However, conventional methods often overlook the bounded nature of the basketball court, leading to inaccurate representations, particularly along the boundaries and corners. In this paper, we propose a novel model-based approach to shot chart estimation and visualization that explicitly considers the physical boundaries of the basketball court. By employing Gaussian mixtures for bounded data, our methodology allows to obtain more accurate estimation of shot density distributions for both made and missed shots. Bayes' rule is then applied to derive estimates for the probability of successful shooting from any given locations, and to identify the regions with the highest expected scores. To illustrate the efficacy of our proposal, we apply it to data from the 2022-23 NBA regular season, showing its usefulness through detailed analyses of shot patterns for two prominent players."}, "https://arxiv.org/abs/2405.01275": {"title": "Variable Selection in Ultra-high Dimensional Feature Space for the Cox Model with Interval-Censored Data", "link": "https://arxiv.org/abs/2405.01275", "description": "arXiv:2405.01275v1 Announce Type: new \nAbstract: We develop a set of variable selection methods for the Cox model under interval censoring, in the ultra-high dimensional setting where the dimensionality can grow exponentially with the sample size. The methods select covariates via a penalized nonparametric maximum likelihood estimation with some popular penalty functions, including lasso, adaptive lasso, SCAD, and MCP. We prove that our penalized variable selection methods with folded concave penalties or adaptive lasso penalty enjoy the oracle property. Extensive numerical experiments show that the proposed methods have satisfactory empirical performance under various scenarios. The utility of the methods is illustrated through an application to a genome-wide association study of age to early childhood caries."}, "https://arxiv.org/abs/2405.01281": {"title": "Demistifying Inference after Adaptive Experiments", "link": "https://arxiv.org/abs/2405.01281", "description": "arXiv:2405.01281v1 Announce Type: new \nAbstract: Adaptive experiments such as multi-arm bandits adapt the treatment-allocation policy and/or the decision to stop the experiment to the data observed so far. This has the potential to improve outcomes for study participants within the experiment, to improve the chance of identifying best treatments after the experiment, and to avoid wasting data. Seen as an experiment (rather than just a continually optimizing system) it is still desirable to draw statistical inferences with frequentist guarantees. The concentration inequalities and union bounds that generally underlie adaptive experimentation algorithms can yield overly conservative inferences, but at the same time the asymptotic normality we would usually appeal to in non-adaptive settings can be imperiled by adaptivity. In this article we aim to explain why, how, and when adaptivity is in fact an issue for inference and, when it is, understand the various ways to fix it: reweighting to stabilize variances and recover asymptotic normality, always-valid inference based on joint normality of an asymptotic limiting sequence, and characterizing and inverting the non-normal distributions induced by adaptivity."}, "https://arxiv.org/abs/2405.01336": {"title": "Quantification of vaccine waning as a challenge effect", "link": "https://arxiv.org/abs/2405.01336", "description": "arXiv:2405.01336v1 Announce Type: new \nAbstract: Knowing whether vaccine protection wanes over time is important for health policy and drug development. However, quantifying waning effects is difficult. A simple contrast of vaccine efficacy at two different times compares different populations of individuals: those who were uninfected at the first time versus those who remain uninfected until the second time. Thus, the contrast of vaccine efficacy at early and late times can not be interpreted as a causal effect. We propose to quantify vaccine waning using the challenge effect, which is a contrast of outcomes under controlled exposures to the infectious agent following vaccination. We identify sharp bounds on the challenge effect under non-parametric assumptions that are broadly applicable in vaccine trials using routinely collected data. We demonstrate that the challenge effect can differ substantially from the conventional vaccine efficacy due to depletion of susceptible individuals from the risk set over time. Finally, we apply the methods to derive bounds on the waning of the BNT162b2 COVID-19 vaccine using data from a placebo-controlled randomized trial. Our estimates of the challenge effect suggest waning protection after 2 months beyond administration of the second vaccine dose."}, "https://arxiv.org/abs/2405.01372": {"title": "Statistical algorithms for low-frequency diffusion data: A PDE approach", "link": "https://arxiv.org/abs/2405.01372", "description": "arXiv:2405.01372v1 Announce Type: new \nAbstract: We consider the problem of making nonparametric inference in multi-dimensional diffusion models from low-frequency data. Statistical analysis in this setting is notoriously challenging due to the intractability of the likelihood and its gradient, and computational methods have thus far largely resorted to expensive simulation-based techniques. In this article, we propose a new computational approach which is motivated by PDE theory and is built around the characterisation of the transition densities as solutions of the associated heat (Fokker-Planck) equation. Employing optimal regularity results from the theory of parabolic PDEs, we prove a novel characterisation for the gradient of the likelihood. Using these developments, for the nonlinear inverse problem of recovering the diffusivity (in divergence form models), we then show that the numerical evaluation of the likelihood and its gradient can be reduced to standard elliptic eigenvalue problems, solvable by powerful finite element methods. This enables the efficient implementation of a large class of statistical algorithms, including (i) preconditioned Crank-Nicolson and Langevin-type methods for posterior sampling, and (ii) gradient-based descent optimisation schemes to compute maximum likelihood and maximum-a-posteriori estimates. We showcase the effectiveness of these methods via extensive simulation studies in a nonparametric Bayesian model with Gaussian process priors. Interestingly, the optimisation schemes provided satisfactory numerical recovery while exhibiting rapid convergence towards stationary points despite the problem nonlinearity; thus our approach may lead to significant computational speed-ups. The reproducible code is available online at https://github.com/MattGiord/LF-Diffusion."}, "https://arxiv.org/abs/2405.01463": {"title": "Dynamic Local Average Treatment Effects", "link": "https://arxiv.org/abs/2405.01463", "description": "arXiv:2405.01463v1 Announce Type: new \nAbstract: We consider Dynamic Treatment Regimes (DTRs) with one sided non-compliance that arise in applications such as digital recommendations and adaptive medical trials. These are settings where decision makers encourage individuals to take treatments over time, but adapt encouragements based on previous encouragements, treatments, states, and outcomes. Importantly, individuals may choose to (not) comply with a treatment recommendation, whenever it is made available to them, based on unobserved confounding factors. We provide non-parametric identification, estimation, and inference for Dynamic Local Average Treatment Effects, which are expected values of multi-period treatment contrasts among appropriately defined complier subpopulations. Under standard assumptions in the Instrumental Variable and DTR literature, we show that one can identify local average effects of contrasts that correspond to offering treatment at any single time step. Under an additional cross-period effect-compliance independence assumption, which is satisfied in Staggered Adoption settings and a generalization of them, which we define as Staggered Compliance settings, we identify local average treatment effects of treating in multiple time periods."}, "https://arxiv.org/abs/2405.00727": {"title": "Generalised envelope spectrum-based signal-to-noise objectives: Formulation, optimisation and application for gear fault detection under time-varying speed conditions", "link": "https://arxiv.org/abs/2405.00727", "description": "arXiv:2405.00727v1 Announce Type: cross \nAbstract: In vibration-based condition monitoring, optimal filter design improves fault detection by enhancing weak fault signatures within vibration signals. This process involves optimising a derived objective function from a defined objective. The objectives are often based on proxy health indicators to determine the filter's parameters. However, these indicators can be compromised by irrelevant extraneous signal components and fluctuating operational conditions, affecting the filter's efficacy. Fault detection primarily uses the fault component's prominence in the squared envelope spectrum, quantified by a squared envelope spectrum-based signal-to-noise ratio. New optimal filter objective functions are derived from the proposed generalised envelope spectrum-based signal-to-noise objective for machines operating under variable speed conditions. Instead of optimising proxy health indicators, the optimal filter coefficients of the formulation directly maximise the squared envelope spectrum-based signal-to-noise ratio over targeted frequency bands using standard gradient-based optimisers. Four derived objective functions from the proposed objective effectively outperform five prominent methods in tests on three experimental datasets."}, "https://arxiv.org/abs/2405.00910": {"title": "De-Biasing Models of Biased Decisions: A Comparison of Methods Using Mortgage Application Data", "link": "https://arxiv.org/abs/2405.00910", "description": "arXiv:2405.00910v1 Announce Type: cross \nAbstract: Prediction models can improve efficiency by automating decisions such as the approval of loan applications. However, they may inherit bias against protected groups from the data they are trained on. This paper adds counterfactual (simulated) ethnic bias to real data on mortgage application decisions, and shows that this bias is replicated by a machine learning model (XGBoost) even when ethnicity is not used as a predictive variable. Next, several other de-biasing methods are compared: averaging over prohibited variables, taking the most favorable prediction over prohibited variables (a novel method), and jointly minimizing errors as well as the association between predictions and prohibited variables. De-biasing can recover some of the original decisions, but the results are sensitive to whether the bias is effected through a proxy."}, "https://arxiv.org/abs/2405.01404": {"title": "Random Pareto front surfaces", "link": "https://arxiv.org/abs/2405.01404", "description": "arXiv:2405.01404v1 Announce Type: cross \nAbstract: The Pareto front of a set of vectors is the subset which is comprised solely of all of the best trade-off points. By interpolating this subset, we obtain the optimal trade-off surface. In this work, we prove a very useful result which states that all Pareto front surfaces can be explicitly parametrised using polar coordinates. In particular, our polar parametrisation result tells us that we can fully characterise any Pareto front surface using the length function, which is a scalar-valued function that returns the projected length along any positive radial direction. Consequently, by exploiting this representation, we show how it is possible to generalise many useful concepts from linear algebra, probability and statistics, and decision theory to function over the space of Pareto front surfaces. Notably, we focus our attention on the stochastic setting where the Pareto front surface itself is a stochastic process. Among other things, we showcase how it is possible to define and estimate many statistical quantities of interest such as the expectation, covariance and quantile of any Pareto front surface distribution. As a motivating example, we investigate how these statistics can be used within a design of experiments setting, where the goal is to both infer and use the Pareto front surface distribution in order to make effective decisions. Besides this, we also illustrate how these Pareto front ideas can be used within the context of extreme value theory. Finally, as a numerical example, we applied some of our new methodology on a real-world air pollution data set."}, "https://arxiv.org/abs/2405.01450": {"title": "A mixed effects cosinor modelling framework for circadian gene expression", "link": "https://arxiv.org/abs/2405.01450", "description": "arXiv:2405.01450v1 Announce Type: cross \nAbstract: The cosinor model is frequently used to represent gene expression given the 24 hour day-night cycle time at which a corresponding tissue sample is collected. However, the timing of many biological processes are based on individual-specific internal timing systems that are offset relative to day-night cycle time. When these offsets are unknown, they pose a challenge in performing statistical analyses with a cosinor model. To clarify, when sample collection times are mis-recorded, cosinor regression can yield attenuated parameter estimates, which would also attenuate test statistics. This attenuation bias would inflate type II error rates in identifying genes with oscillatory behavior. This paper proposes a heuristic method to account for unknown offsets when tissue samples are collected in a longitudinal design. Specifically, this method involves first estimating individual-specific cosinor models for each gene. The times of sample collection for that individual are then translated based on the estimated phase-shifts across every gene. Simulation studies confirm that this method mitigates bias in estimation and inference. Illustrations with real data from three circadian biology studies highlight that this method produces parameter estimates and inferences akin to those obtained when each individual's offset is known."}, "https://arxiv.org/abs/2405.01484": {"title": "Designing Algorithmic Recommendations to Achieve Human-AI Complementarity", "link": "https://arxiv.org/abs/2405.01484", "description": "arXiv:2405.01484v1 Announce Type: cross \nAbstract: Algorithms frequently assist, rather than replace, human decision-makers. However, the design and analysis of algorithms often focus on predicting outcomes and do not explicitly model their effect on human decisions. This discrepancy between the design and role of algorithmic assistants becomes of particular concern in light of empirical evidence that suggests that algorithmic assistants again and again fail to improve human decisions. In this article, we formalize the design of recommendation algorithms that assist human decision-makers without making restrictive ex-ante assumptions about how recommendations affect decisions. We formulate an algorithmic-design problem that leverages the potential-outcomes framework from causal inference to model the effect of recommendations on a human decision-maker's binary treatment choice. Within this model, we introduce a monotonicity assumption that leads to an intuitive classification of human responses to the algorithm. Under this monotonicity assumption, we can express the human's response to algorithmic recommendations in terms of their compliance with the algorithm and the decision they would take if the algorithm sends no recommendation. We showcase the utility of our framework using an online experiment that simulates a hiring task. We argue that our approach explains the relative performance of different recommendation algorithms in the experiment, and can help design solutions that realize human-AI complementarity."}, "https://arxiv.org/abs/2305.05281": {"title": "Causal Discovery via Conditional Independence Testing with Proxy Variables", "link": "https://arxiv.org/abs/2305.05281", "description": "arXiv:2305.05281v3 Announce Type: replace \nAbstract: Distinguishing causal connections from correlations is important in many scenarios. However, the presence of unobserved variables, such as the latent confounder, can introduce bias in conditional independence testing commonly employed in constraint-based causal discovery for identifying causal relations. To address this issue, existing methods introduced proxy variables to adjust for the bias caused by unobserveness. However, these methods were either limited to categorical variables or relied on strong parametric assumptions for identification. In this paper, we propose a novel hypothesis-testing procedure that can effectively examine the existence of the causal relationship over continuous variables, without any parametric constraint. Our procedure is based on discretization, which under completeness conditions, is able to asymptotically establish a linear equation whose coefficient vector is identifiable under the causal null hypothesis. Based on this, we introduce our test statistic and demonstrate its asymptotic level and power. We validate the effectiveness of our procedure using both synthetic and real-world data."}, "https://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "https://arxiv.org/abs/2305.10817", "description": "arXiv:2305.10817v4 Announce Type: replace \nAbstract: We introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables. This framework makes causality detection possible even between high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled chaotic dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings. We also show that the method can be used to robustly detect causality in human electroencephalography data."}, "https://arxiv.org/abs/2401.03990": {"title": "Identification with possibly invalid IVs", "link": "https://arxiv.org/abs/2401.03990", "description": "arXiv:2401.03990v2 Announce Type: replace \nAbstract: This paper proposes a novel identification strategy relying on quasi-instrumental variables (quasi-IVs). A quasi-IV is a relevant but possibly invalid IV because it is not exogenous or not excluded. We show that a variety of models with discrete or continuous endogenous treatment which are usually identified with an IV - quantile models with rank invariance, additive models with homogenous treatment effects, and local average treatment effect models - can be identified under the joint relevance of two complementary quasi-IVs instead. To achieve identification, we complement one excluded but possibly endogenous quasi-IV (e.g., \"relevant proxies\" such as lagged treatment choice) with one exogenous (conditional on the excluded quasi-IV) but possibly included quasi-IV (e.g., random assignment or exogenous market shocks). Our approach also holds if any of the two quasi-IVs turns out to be a valid IV. In practice, being able to address endogeneity with complementary quasi-IVs instead of IVs is convenient since there are many applications where quasi-IVs are more readily available. Difference-in-differences is a notable example: time is an exogenous quasi-IV while the group assignment acts as a complementary excluded quasi-IV."}, "https://arxiv.org/abs/2303.11786": {"title": "Skeleton Regression: A Graph-Based Approach to Estimation with Manifold Structure", "link": "https://arxiv.org/abs/2303.11786", "description": "arXiv:2303.11786v2 Announce Type: replace-cross \nAbstract: We introduce a new regression framework designed to deal with large-scale, complex data that lies around a low-dimensional manifold with noises. Our approach first constructs a graph representation, referred to as the skeleton, to capture the underlying geometric structure. We then define metrics on the skeleton graph and apply nonparametric regression techniques, along with feature transformations based on the graph, to estimate the regression function. We also discuss the limitations of some nonparametric regressors with respect to the general metric space such as the skeleton graph. The proposed regression framework suggests a novel way to deal with data with underlying geometric structures and provides additional advantages in handling the union of multiple manifolds, additive noises, and noisy observations. We provide statistical guarantees for the proposed method and demonstrate its effectiveness through simulations and real data examples."}, "https://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "https://arxiv.org/abs/2310.05921", "description": "arXiv:2310.05921v3 Announce Type: replace-cross \nAbstract: We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a safe backup policy at run-time. The decisions produced by our algorithms are safe in the sense that they come with provable statistical guarantees of having low risk without any assumptions on the world model whatsoever; the observations need not be I.I.D. and can even be adversarial. The theory extends results from conformal prediction to calibrate decisions directly, without requiring the construction of prediction sets. Experiments demonstrate the utility of our approach in robot motion planning around humans, automated stock trading, and robot manufacturing."}, "https://arxiv.org/abs/2405.01645": {"title": "Synthetic Controls with spillover effects: A comparative study", "link": "https://arxiv.org/abs/2405.01645", "description": "arXiv:2405.01645v1 Announce Type: new \nAbstract: Iterative Synthetic Control Method is introduced in this study, a modification of the Synthetic Control Method (SCM) designed to improve its predictive performance by utilizing control units affected by the treatment in question. This method is then compared to other SCM modifications: SCM without any modifications, SCM after removing all spillover-affected units, Inclusive SCM, and the SP SCM model. For the comparison, Monte Carlo simulations are utilized, generating artificial datasets with known counterfactuals and comparing the predictive performance of the methods. Generally, the Inclusive SCM performed best in all settings and is relatively simple to implement. The Iterative SCM, introduced in this paper, was in close seconds, with a small difference in performance and a simpler implementation."}, "https://arxiv.org/abs/2405.01651": {"title": "Confidence regions for a persistence diagram of a single image with one or more loops", "link": "https://arxiv.org/abs/2405.01651", "description": "arXiv:2405.01651v1 Announce Type: new \nAbstract: Topological data analysis (TDA) uses persistent homology to quantify loops and higher-dimensional holes in data, making it particularly relevant for examining the characteristics of images of cells in the field of cell biology. In the context of a cell injury, as time progresses, a wound in the form of a ring emerges in the cell image and then gradually vanishes. Performing statistical inference on this ring-like pattern in a single image is challenging due to the absence of repeated samples. In this paper, we develop a novel framework leveraging TDA to estimate underlying structures within individual images and quantify associated uncertainties through confidence regions. Our proposed method partitions the image into the background and the damaged cell regions. Then pixels within the affected cell region are used to establish confidence regions in the space of persistence diagrams (topological summary statistics). The method establishes estimates on the persistence diagrams which correct the bias of traditional TDA approaches. A simulation study is conducted to evaluate the coverage probabilities of the proposed confidence regions in comparison to an alternative approach is proposed in this paper. We also illustrate our methodology by a real-world example provided by cell repair."}, "https://arxiv.org/abs/2405.01709": {"title": "Minimax Regret Learning for Data with Heterogeneous Subgroups", "link": "https://arxiv.org/abs/2405.01709", "description": "arXiv:2405.01709v1 Announce Type: new \nAbstract: Modern complex datasets often consist of various sub-populations. To develop robust and generalizable methods in the presence of sub-population heterogeneity, it is important to guarantee a uniform learning performance instead of an average one. In many applications, prior information is often available on which sub-population or group the data points belong to. Given the observed groups of data, we develop a min-max-regret (MMR) learning framework for general supervised learning, which targets to minimize the worst-group regret. Motivated from the regret-based decision theoretic framework, the proposed MMR is distinguished from the value-based or risk-based robust learning methods in the existing literature. The regret criterion features several robustness and invariance properties simultaneously. In terms of generalizability, we develop the theoretical guarantee for the worst-case regret over a super-population of the meta data, which incorporates the observed sub-populations, their mixtures, as well as other unseen sub-populations that could be approximated by the observed ones. We demonstrate the effectiveness of our method through extensive simulation studies and an application to kidney transplantation data from hundreds of transplant centers."}, "https://arxiv.org/abs/2405.01913": {"title": "Unleashing the Power of AI: Transforming Marketing Decision-Making in Heavy Machinery with Machine Learning, Radar Chart Simulation, and Markov Chain Analysis", "link": "https://arxiv.org/abs/2405.01913", "description": "arXiv:2405.01913v1 Announce Type: new \nAbstract: This pioneering research introduces a novel approach for decision-makers in the heavy machinery industry, specifically focusing on production management. The study integrates machine learning techniques like Ridge Regression, Markov chain analysis, and radar charts to optimize North American Crawler Cranes market production processes. Ridge Regression enables growth pattern identification and performance assessment, facilitating comparisons and addressing industry challenges. Markov chain analysis evaluates risk factors, aiding in informed decision-making and risk management. Radar charts simulate benchmark product designs, enabling data-driven decisions for production optimization. This interdisciplinary approach equips decision-makers with transformative insights, enhancing competitiveness in the heavy machinery industry and beyond. By leveraging these techniques, companies can revolutionize their production management strategies, driving success in diverse markets."}, "https://arxiv.org/abs/2405.02087": {"title": "Testing for an Explosive Bubble using High-Frequency Volatility", "link": "https://arxiv.org/abs/2405.02087", "description": "arXiv:2405.02087v1 Announce Type: new \nAbstract: Based on a continuous-time stochastic volatility model with a linear drift, we develop a test for explosive behavior in financial asset prices at a low frequency when prices are sampled at a higher frequency. The test exploits the volatility information in the high-frequency data. The method consists of devolatizing log-asset price increments with realized volatility measures and performing a supremum-type recursive Dickey-Fuller test on the devolatized sample. The proposed test has a nuisance-parameter-free asymptotic distribution and is easy to implement. We study the size and power properties of the test in Monte Carlo simulations. A real-time date-stamping strategy based on the devolatized sample is proposed for the origination and conclusion dates of the explosive regime. Conditions under which the real-time date-stamping strategy is consistent are established. The test and the date-stamping strategy are applied to study explosive behavior in cryptocurrency and stock markets."}, "https://arxiv.org/abs/2405.02217": {"title": "Identifying and exploiting alpha in linear asset pricing models with strong, semi-strong, and latent factors", "link": "https://arxiv.org/abs/2405.02217", "description": "arXiv:2405.02217v1 Announce Type: new \nAbstract: The risk premia of traded factors are the sum of factor means and a parameter vector we denote by {\\phi} which is identified from the cross section regression of alpha of individual securities on the vector of factor loadings. If phi is non-zero one can construct \"phi-portfolios\" which exploit the systematic components of non-zero alpha. We show that for known values of betas and when phi is non-zero there exist phi-portfolios that dominate mean-variance portfolios. The paper then proposes a two-step bias corrected estimator of phi and derives its asymptotic distribution allowing for idiosyncratic pricing errors, weak missing factors, and weak error cross-sectional dependence. Small sample results from extensive Monte Carlo experiments show that the proposed estimator has the correct size with good power properties. The paper also provides an empirical application to a large number of U.S. securities with risk factors selected from a large number of potential risk factors according to their strength and constructs phi-portfolios and compares their Sharpe ratios to mean variance and S&P 500 portfolio."}, "https://arxiv.org/abs/2405.02231": {"title": "Efficient spline orthogonal basis for representation of density functions", "link": "https://arxiv.org/abs/2405.02231", "description": "arXiv:2405.02231v1 Announce Type: new \nAbstract: Probability density functions form a specific class of functional data objects with intrinsic properties of scale invariance and relative scale characterized by the unit integral constraint. The Bayes spaces methodology respects their specific nature, and the centred log-ratio transformation enables processing such functional data in the standard Lebesgue space of square-integrable functions. As the data representing densities are frequently observed in their discrete form, the focus has been on their spline representation. Therefore, the crucial step in the approximation is to construct a proper spline basis reflecting their specific properties. Since the centred log-ratio transformation forms a subspace of functions with a zero integral constraint, the standard $B$-spline basis is no longer suitable. Recently, a new spline basis incorporating this zero integral property, called $Z\\!B$-splines, was developed. However, this basis does not possess the orthogonal property which is beneficial from computational and application point of view. As a result of this paper, we describe an efficient method for constructing an orthogonal $Z\\!B$-splines basis, called $Z\\!B$-splinets. The advantages of the $Z\\!B$-splinet approach are foremost a computational efficiency and locality of basis supports that is desirable for data interpretability, e.g. in the context of functional principal component analysis. The proposed approach is demonstrated on an empirical demographic dataset."}, "https://arxiv.org/abs/2405.01598": {"title": "Predictive Decision Synthesis for Portfolios: Betting on Better Models", "link": "https://arxiv.org/abs/2405.01598", "description": "arXiv:2405.01598v1 Announce Type: cross \nAbstract: We discuss and develop Bayesian dynamic modelling and predictive decision synthesis for portfolio analysis. The context involves model uncertainty with a set of candidate models for financial time series with main foci in sequential learning, forecasting, and recursive decisions for portfolio reinvestments. The foundational perspective of Bayesian predictive decision synthesis (BPDS) defines novel, operational analysis and resulting predictive and decision outcomes. A detailed case study of BPDS in financial forecasting of international exchange rate time series and portfolio rebalancing, with resulting BPDS-based decision outcomes compared to traditional Bayesian analysis, exemplifies and highlights the practical advances achievable under the expanded, subjective Bayesian approach that BPDS defines."}, "https://arxiv.org/abs/2405.01611": {"title": "Unifying and extending Precision Recall metrics for assessing generative models", "link": "https://arxiv.org/abs/2405.01611", "description": "arXiv:2405.01611v1 Announce Type: cross \nAbstract: With the recent success of generative models in image and text, the evaluation of generative models has gained a lot of attention. Whereas most generative models are compared in terms of scalar values such as Frechet Inception Distance (FID) or Inception Score (IS), in the last years (Sajjadi et al., 2018) proposed a definition of precision-recall curve to characterize the closeness of two distributions. Since then, various approaches to precision and recall have seen the light (Kynkaanniemi et al., 2019; Naeem et al., 2020; Park & Kim, 2023). They center their attention on the extreme values of precision and recall, but apart from this fact, their ties are elusive. In this paper, we unify most of these approaches under the same umbrella, relying on the work of (Simon et al., 2019). Doing so, we were able not only to recover entire curves, but also to expose the sources of the accounted pitfalls of the concerned metrics. We also provide consistency results that go well beyond the ones presented in the corresponding literature. Last, we study the different behaviors of the curves obtained experimentally."}, "https://arxiv.org/abs/2405.01744": {"title": "ALCM: Autonomous LLM-Augmented Causal Discovery Framework", "link": "https://arxiv.org/abs/2405.01744", "description": "arXiv:2405.01744v1 Announce Type: cross \nAbstract: To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicating their emergent capabilities and widespread applicability in facilitating causal reasoning across diverse domains, such as medicine, finance, and science. The expansive knowledge base of LLMs holds the potential to elevate the field of causal reasoning by offering interpretability, making inferences, generalizability, and uncovering novel causal structures. In this paper, we introduce a new framework, named Autonomous LLM-Augmented Causal Discovery Framework (ALCM), to synergize data-driven causal discovery algorithms and LLMs, automating the generation of a more resilient, accurate, and explicable causal graph. The ALCM consists of three integral components: causal structure learning, causal wrapper, and LLM-driven causal refiner. These components autonomously collaborate within a dynamic environment to address causal discovery questions and deliver plausible causal graphs. We evaluate the ALCM framework by implementing two demonstrations on seven well-known datasets. Experimental results demonstrate that ALCM outperforms existing LLM methods and conventional data-driven causal reasoning mechanisms. This study not only shows the effectiveness of the ALCM but also underscores new research directions in leveraging the causal reasoning capabilities of LLMs."}, "https://arxiv.org/abs/2405.02225": {"title": "Fair Risk Control: A Generalized Framework for Calibrating Multi-group Fairness Risks", "link": "https://arxiv.org/abs/2405.02225", "description": "arXiv:2405.02225v1 Announce Type: cross \nAbstract: This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\\mathbf{s},\\mathcal{G}, \\alpha)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\\mathbf{s}$, constraint set $\\mathcal{G}$, and a pre-specified threshold level $\\alpha$. We propose associated algorithms to achieve this notion in general settings. This framework is then applied to diverse scenarios encompassing different fairness concerns, including false negative rate control in image segmentation, prediction set conditional uncertainty quantification in hierarchical classification, and de-biased text generation in language models. We conduct numerical studies on several datasets and tasks."}, "https://arxiv.org/abs/2112.05274": {"title": "Handling missing data when estimating causal effects with Targeted Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2112.05274", "description": "arXiv:2112.05274v4 Announce Type: replace \nAbstract: Targeted Maximum Likelihood Estimation (TMLE) is increasingly used for doubly robust causal inference, but how missing data should be handled when using TMLE with data-adaptive approaches is unclear. Based on the Victorian Adolescent Health Cohort Study, we conducted a simulation study to evaluate eight missing data methods in this context: complete-case analysis, extended TMLE incorporating outcome-missingness model, missing covariate missing indicator method, five multiple imputation (MI) approaches using parametric or machine-learning models. Six scenarios were considered, varying in exposure/outcome generation models (presence of confounder-confounder interactions) and missingness mechanisms (whether outcome influenced missingness in other variables and presence of interaction/non-linear terms in missingness models). Complete-case analysis and extended TMLE had small biases when outcome did not influence missingness in other variables. Parametric MI without interactions had large bias when exposure/outcome generation models included interactions. Parametric MI including interactions performed best in bias and variance reduction across all settings, except when missingness models included a non-linear term. When choosing a method to handle missing data in the context of TMLE, researchers must consider the missingness mechanism and, for MI, compatibility with the analysis method. In many settings, a parametric MI approach that incorporates interactions and non-linearities is expected to perform well."}, "https://arxiv.org/abs/2212.12539": {"title": "Stable Distillation and High-Dimensional Hypothesis Testing", "link": "https://arxiv.org/abs/2212.12539", "description": "arXiv:2212.12539v2 Announce Type: replace \nAbstract: While powerful methods have been developed for high-dimensional hypothesis testing assuming orthogonal parameters, current approaches struggle to generalize to the more common non-orthogonal case. We propose Stable Distillation (SD), a simple paradigm for iteratively extracting independent pieces of information from observed data, assuming a parametric model. When applied to hypothesis testing for large regression models, SD orthogonalizes the effect estimates of non-orthogonal predictors by judiciously introducing noise into the observed outcomes vector, yielding mutually independent p-values across predictors. Simulations and a real regression example using US campaign contributions show that SD yields a scalable approach for non-orthogonal designs that exceeds or matches the power of existing methods against sparse alternatives. While we only present explicit SD algorithms for hypothesis testing in ordinary least squares and logistic regression, we provide general guidance for deriving and improving the power of SD procedures."}, "https://arxiv.org/abs/2307.14282": {"title": "Causal Effects in Matching Mechanisms with Strategically Reported Preferences", "link": "https://arxiv.org/abs/2307.14282", "description": "arXiv:2307.14282v2 Announce Type: replace \nAbstract: A growing number of central authorities use assignment mechanisms to allocate students to schools in a way that reflects student preferences and school priorities. However, most real-world mechanisms incentivize students to strategically misreport their preferences. In this paper, we provide an approach for identifying the causal effects of school assignment on future outcomes that accounts for strategic misreporting. Misreporting may invalidate existing point-identification approaches, and we derive sharp bounds for causal effects that are robust to strategic behavior. Our approach applies to any mechanism as long as there exist placement scores and cutoffs that characterize that mechanism's allocation rule. We use data from a deferred acceptance mechanism that assigns students to more than 1,000 university-major combinations in Chile. Matching theory predicts that students' behavior in Chile should be strategic because they can list only up to eight options, and we find empirical evidence consistent with such behavior. Our bounds are informative enough to reveal significant heterogeneity in graduation success with respect to preferences and school assignment."}, "https://arxiv.org/abs/2310.10271": {"title": "A geometric power analysis for general log-linear models", "link": "https://arxiv.org/abs/2310.10271", "description": "arXiv:2310.10271v2 Announce Type: replace \nAbstract: Log-linear models are widely used to express the association in multivariate frequency data on contingency tables. The paper focuses on the power analysis for testing the goodness-of-fit hypothesis for this model type. Conventionally, for the power-related sample size calculations a deviation from the null hypothesis (effect size) is specified by means of the chi-square goodness-of-fit index. It is argued that the odds ratio is a more natural measure of effect size, with the advantage of having a data-relevant interpretation. Therefore, a class of log-affine models that are specified by odds ratios whose values deviate from those of the null by a small amount can be chosen as an alternative. Being expressed as sets of constraints on odds ratios, both hypotheses are represented by smooth surfaces in the probability simplex, and thus, the power analysis can be given a geometric interpretation as well. A concept of geometric power is introduced and a Monte-Carlo algorithm for its estimation is proposed. The framework is applied to the power analysis of goodness-of-fit in the context of multinomial sampling. An iterative scaling procedure for generating distributions from a log-affine model is described and its convergence is proved. To illustrate, the geometric power analysis is carried out for data from a clinical study."}, "https://arxiv.org/abs/2305.11672": {"title": "Nonparametric classification with missing data", "link": "https://arxiv.org/abs/2305.11672", "description": "arXiv:2305.11672v2 Announce Type: replace-cross \nAbstract: We introduce a new nonparametric framework for classification problems in the presence of missing data. The key aspect of our framework is that the regression function decomposes into an anova-type sum of orthogonal functions, of which some (or even many) may be zero. Working under a general missingness setting, which allows features to be missing not at random, our main goal is to derive the minimax rate for the excess risk in this problem. In addition to the decomposition property, the rate depends on parameters that control the tail behaviour of the marginal feature distributions, the smoothness of the regression function and a margin condition. The ambient data dimension does not appear in the minimax rate, which can therefore be faster than in the classical nonparametric setting. We further propose a new method, called the Hard-thresholding Anova Missing data (HAM) classifier, based on a careful combination of a k-nearest neighbour algorithm and a thresholding step. The HAM classifier attains the minimax rate up to polylogarithmic factors and numerical experiments further illustrate its utility."}, "https://arxiv.org/abs/2307.02375": {"title": "Online Learning of Order Flow and Market Impact with Bayesian Change-Point Detection Methods", "link": "https://arxiv.org/abs/2307.02375", "description": "arXiv:2307.02375v2 Announce Type: replace-cross \nAbstract: Financial order flow exhibits a remarkable level of persistence, wherein buy (sell) trades are often followed by subsequent buy (sell) trades over extended periods. This persistence can be attributed to the division and gradual execution of large orders. Consequently, distinct order flow regimes might emerge, which can be identified through suitable time series models applied to market data. In this paper, we propose the use of Bayesian online change-point detection (BOCPD) methods to identify regime shifts in real-time and enable online predictions of order flow and market impact. To enhance the effectiveness of our approach, we have developed a novel BOCPD method using a score-driven approach. This method accommodates temporal correlations and time-varying parameters within each regime. Through empirical application to NASDAQ data, we have found that: (i) Our newly proposed model demonstrates superior out-of-sample predictive performance compared to existing models that assume i.i.d. behavior within each regime; (ii) When examining the residuals, our model demonstrates good specification in terms of both distributional assumptions and temporal correlations; (iii) Within a given regime, the price dynamics exhibit a concave relationship with respect to time and volume, mirroring the characteristics of actual large orders; (iv) By incorporating regime information, our model produces more accurate online predictions of order flow and market impact compared to models that do not consider regimes."}, "https://arxiv.org/abs/2311.04037": {"title": "Causal Discovery Under Local Privacy", "link": "https://arxiv.org/abs/2311.04037", "description": "arXiv:2311.04037v3 Announce Type: replace-cross \nAbstract: Differential privacy is a widely adopted framework designed to safeguard the sensitive information of data providers within a data set. It is based on the application of controlled noise at the interface between the server that stores and processes the data, and the data consumers. Local differential privacy is a variant that allows data providers to apply the privatization mechanism themselves on their data individually. Therefore it provides protection also in contexts in which the server, or even the data collector, cannot be trusted. The introduction of noise, however, inevitably affects the utility of the data, particularly by distorting the correlations between individual data components. This distortion can prove detrimental to tasks such as causal discovery. In this paper, we consider various well-known locally differentially private mechanisms and compare the trade-off between the privacy they provide, and the accuracy of the causal structure produced by algorithms for causal learning when applied to data obfuscated by these mechanisms. Our analysis yields valuable insights for selecting appropriate local differentially private protocols for causal discovery tasks. We foresee that our findings will aid researchers and practitioners in conducting locally private causal discovery."}, "https://arxiv.org/abs/2405.02343": {"title": "Rejoinder on \"Marked spatial point processes: current state and extensions to point processes on linear networks\"", "link": "https://arxiv.org/abs/2405.02343", "description": "arXiv:2405.02343v1 Announce Type: new \nAbstract: We are grateful to all discussants for their invaluable comments, suggestions, questions, and contributions to our article. We have attentively reviewed all discussions with keen interest. In this rejoinder, our objective is to address and engage with all points raised by the discussants in a comprehensive and considerate manner. Consistently, we identify the discussants, in alphabetical order, as follows: CJK for Cronie, Jansson, and Konstantinou, DS for Stoyan, GP for Grabarnik and Pommerening, MRS for Myllym\\\"aki, Rajala, and S\\\"arkk\\\"a, and MCvL for van Lieshout throughout this rejoinder."}, "https://arxiv.org/abs/2405.02480": {"title": "A Network Simulation of OTC Markets with Multiple Agents", "link": "https://arxiv.org/abs/2405.02480", "description": "arXiv:2405.02480v1 Announce Type: new \nAbstract: We present a novel agent-based approach to simulating an over-the-counter (OTC) financial market in which trades are intermediated solely by market makers and agent visibility is constrained to a network topology. Dynamics, such as changes in price, result from agent-level interactions that ubiquitously occur via market maker agents acting as liquidity providers. Two additional agents are considered: trend investors use a deep convolutional neural network paired with a deep Q-learning framework to inform trading decisions by analysing price history; and value investors use a static price-target to determine their trade directions and sizes. We demonstrate that our novel inclusion of a network topology with market makers facilitates explorations into various market structures. First, we present the model and an overview of its mechanics. Second, we validate our findings via comparison to the real-world: we demonstrate a fat-tailed distribution of price changes, auto-correlated volatility, a skew negatively correlated to market maker positioning, predictable price-history patterns and more. Finally, we demonstrate that our network-based model can lend insights into the effect of market-structure on price-action. For example, we show that markets with sparsely connected intermediaries can have a critical point of fragmentation, beyond which the market forms distinct clusters and arbitrage becomes rapidly possible between the prices of different market makers. A discussion is provided on future work that would be beneficial."}, "https://arxiv.org/abs/2405.02529": {"title": "Chauhan Weighted Trajectory Analysis reduces sample size requirements and expedites time-to-efficacy signals in advanced cancer clinical trials", "link": "https://arxiv.org/abs/2405.02529", "description": "arXiv:2405.02529v1 Announce Type: new \nAbstract: As Kaplan-Meier (KM) analysis is limited to single unidirectional endpoints, most advanced cancer randomized clinical trials (RCTs) are powered for either progression free survival (PFS) or overall survival (OS). This discards efficacy information carried by partial responses, complete responses, and stable disease that frequently precede progressive disease and death. Chauhan Weighted Trajectory Analysis (CWTA) is a generalization of KM that simultaneously assesses multiple rank-ordered endpoints. We hypothesized that CWTA could use this efficacy information to reduce sample size requirements and expedite efficacy signals in advanced cancer trials. We performed 100-fold and 1000-fold simulations of solid tumour systemic therapy RCTs with health statuses rank ordered from complete response (Stage 0) to death (Stage 4). At increments of sample size and hazard ratio, we compared KM PFS and OS with CWTA for (i) sample size requirements to achieve a power of 0.8 and (ii) time-to-first significant efficacy signal. CWTA consistently demonstrated greater power, and reduced sample size requirements by 18% to 35% compared to KM PFS and 14% to 20% compared to KM OS. CWTA also expedited time-to-efficacy signals 2- to 6-fold. CWTA, by incorporating all efficacy signals in the cancer treatment trajectory, provides clinically relevant reduction in required sample size and meaningfully expedites the efficacy signals of cancer treatments compared to KM PFS and KM OS. Using CWTA rather than KM as the primary trial outcome has the potential to meaningfully reduce the numbers of patients, trial duration, and costs to evaluate therapies in advanced cancer."}, "https://arxiv.org/abs/2405.02539": {"title": "Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models", "link": "https://arxiv.org/abs/2405.02539", "description": "arXiv:2405.02539v1 Announce Type: new \nAbstract: While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority."}, "https://arxiv.org/abs/2405.02551": {"title": "Power-Enhanced Two-Sample Mean Tests for High-Dimensional Compositional Data with Application to Microbiome Data Analysis", "link": "https://arxiv.org/abs/2405.02551", "description": "arXiv:2405.02551v1 Announce Type: new \nAbstract: Testing differences in mean vectors is a fundamental task in the analysis of high-dimensional compositional data. Existing methods may suffer from low power if the underlying signal pattern is in a situation that does not favor the deployed test. In this work, we develop two-sample power-enhanced mean tests for high-dimensional compositional data based on the combination of $p$-values, which integrates strengths from two popular types of tests: the maximum-type test and the quadratic-type test. We provide rigorous theoretical guarantees on the proposed tests, showing accurate Type-I error rate control and enhanced testing power. Our method boosts the testing power towards a broader alternative space, which yields robust performance across a wide range of signal pattern settings. Our theory also contributes to the literature on power enhancement and Gaussian approximation for high-dimensional hypothesis testing. We demonstrate the performance of our method on both simulated data and real-world microbiome data, showing that our proposed approach improves the testing power substantially compared to existing methods."}, "https://arxiv.org/abs/2405.02666": {"title": "The Analysis of Criminal Recidivism: A Hierarchical Model-Based Approach for the Analysis of Zero-Inflated, Spatially Correlated recurrent events Data", "link": "https://arxiv.org/abs/2405.02666", "description": "arXiv:2405.02666v1 Announce Type: new \nAbstract: The life course perspective in criminology has become prominent last years, offering valuable insights into various patterns of criminal offending and pathways. The study of criminal trajectories aims to understand the beginning, persistence and desistence in crime, providing intriguing explanations about these moments in life. Central to this analysis is the identification of patterns in the frequency of criminal victimization and recidivism, along with the factors that contribute to them. Specifically, this work introduces a new class of models that overcome limitations in traditional methods used to analyze criminal recidivism. These models are designed for recurrent events data characterized by excess of zeros and spatial correlation. They extend the Non-Homogeneous Poisson Process, incorporating spatial dependence in the model through random effects, enabling the analysis of associations among individuals within the same spatial stratum. To deal with the excess of zeros in the data, a zero-inflated Poisson mixed model was incorporated. In addition to parametric models following the Power Law process for baseline intensity functions, we propose flexible semi-parametric versions approximating the intensity function using Bernstein Polynomials. The Bayesian approach offers advantages such as incorporating external evidence and modeling specific correlations between random effects and observed data. The performance of these models was evaluated in a simulation study with various scenarios, and we applied them to analyze criminal recidivism data in the Metropolitan Region of Belo Horizonte, Brazil. The results provide a detailed analysis of high-risk areas for recurrent crimes and the behavior of recidivism rates over time. This research significantly enhances our understanding of criminal trajectories, paving the way for more effective strategies in combating criminal recidivism."}, "https://arxiv.org/abs/2405.02715": {"title": "Grouping predictors via network-wide metrics", "link": "https://arxiv.org/abs/2405.02715", "description": "arXiv:2405.02715v1 Announce Type: new \nAbstract: When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data."}, "https://arxiv.org/abs/2405.02779": {"title": "Estimating Complier Average Causal Effects with Mixtures of Experts", "link": "https://arxiv.org/abs/2405.02779", "description": "arXiv:2405.02779v1 Announce Type: new \nAbstract: Understanding the causal impact of medical interventions is essential in healthcare research, especially through randomized controlled trials (RCTs). Despite their prominence, challenges arise due to discrepancies between treatment allocation and actual intake, influenced by various factors like patient non-adherence or procedural errors. This paper focuses on the Complier Average Causal Effect (CACE), crucial for evaluating treatment efficacy among compliant patients. Existing methodologies often rely on assumptions such as exclusion restriction and monotonicity, which can be problematic in practice. We propose a novel approach, leveraging supervised learning architectures, to estimate CACE without depending on these assumptions. Our method involves a two-step process: first estimating compliance probabilities for patients, then using these probabilities to estimate two nuisance components relevant to CACE calculation. Building upon the principal ignorability assumption, we introduce four root-n consistent, asymptotically normal, CACE estimators, and prove that the underlying mixtures of experts' nuisance components are identifiable. Our causal framework allows our estimation procedures to enjoy reduced mean squared errors when exclusion restriction or monotonicity assumptions hold. Through simulations and application to a breastfeeding promotion RCT, we demonstrate the method's performance and applicability."}, "https://arxiv.org/abs/2405.02871": {"title": "Modeling frequency distribution above a priority in presence of IBNR", "link": "https://arxiv.org/abs/2405.02871", "description": "arXiv:2405.02871v1 Announce Type: new \nAbstract: In reinsurance, Poisson and Negative binomial distributions are employed for modeling frequency. However, the incomplete data regarding reported incurred claims above a priority level presents challenges in estimation. This paper focuses on frequency estimation using Schnieper's framework for claim numbering. We demonstrate that Schnieper's model is consistent with a Poisson distribution for the total number of claims above a priority at each year of development, providing a robust basis for parameter estimation. Additionally, we explain how to build an alternative assumption based on a Negative binomial distribution, which yields similar results. The study includes a bootstrap procedure to manage uncertainty in parameter estimation and a case study comparing assumptions and evaluating the impact of the bootstrap approach."}, "https://arxiv.org/abs/2405.02905": {"title": "Mixture of partially linear experts", "link": "https://arxiv.org/abs/2405.02905", "description": "arXiv:2405.02905v1 Announce Type: new \nAbstract: In the mixture of experts model, a common assumption is the linearity between a response variable and covariates. While this assumption has theoretical and computational benefits, it may lead to suboptimal estimates by overlooking potential nonlinear relationships among the variables. To address this limitation, we propose a partially linear structure that incorporates unspecified functions to capture nonlinear relationships. We establish the identifiability of the proposed model under mild conditions and introduce a practical estimation algorithm. We present the performance of our approach through numerical studies, including simulations and real data analysis."}, "https://arxiv.org/abs/2405.02983": {"title": "CVXSADes: a stochastic algorithm for constructing optimal exact regression designs with single or multiple objectives", "link": "https://arxiv.org/abs/2405.02983", "description": "arXiv:2405.02983v1 Announce Type: new \nAbstract: We propose an algorithm to construct optimal exact designs (EDs). Most of the work in the optimal regression design literature focuses on the approximate design (AD) paradigm due to its desired properties, including the optimality verification conditions derived by Kiefer (1959, 1974). ADs may have unbalanced weights, and practitioners may have difficulty implementing them with a designated run size $n$. Some EDs are constructed using rounding methods to get an integer number of runs at each support point of an AD, but this approach may not yield optimal results. To construct EDs, one may need to perform new combinatorial constructions for each $n$, and there is no unified approach to construct them. Therefore, we develop a systematic way to construct EDs for any given $n$. Our method can transform ADs into EDs while retaining high statistical efficiency in two steps. The first step involves constructing an AD by utilizing the convex nature of many design criteria. The second step employs a simulated annealing algorithm to search for the ED stochastically. Through several applications, we demonstrate the utility of our method for various design problems. Additionally, we show that the design efficiency approaches unity as the number of design points increases."}, "https://arxiv.org/abs/2405.03021": {"title": "Tuning parameter selection in econometrics", "link": "https://arxiv.org/abs/2405.03021", "description": "arXiv:2405.03021v1 Announce Type: new \nAbstract: I review some of the main methods for selecting tuning parameters in nonparametric and $\\ell_1$-penalized estimation. For the nonparametric estimation, I consider the methods of Mallows, Stein, Lepski, cross-validation, penalization, and aggregation in the context of series estimation. For the $\\ell_1$-penalized estimation, I consider the methods based on the theory of self-normalized moderate deviations, bootstrap, Stein's unbiased risk estimation, and cross-validation in the context of Lasso estimation. I explain the intuition behind each of the methods and discuss their comparative advantages. I also give some extensions."}, "https://arxiv.org/abs/2405.03041": {"title": "Bayesian Functional Graphical Models with Change-Point Detection", "link": "https://arxiv.org/abs/2405.03041", "description": "arXiv:2405.03041v1 Announce Type: new \nAbstract: Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, we propose a dynamic and Bayesian functional graphical model. Our modeling approach prioritizes the careful definition of an appropriate graph to identify both time-invariant and time-varying connectivity patterns. We introduce a novel block-structured sparsity prior paired with a finite basis expansion, which together yield effective shrinkage and graph selection with efficient computations via a Gibbs sampling algorithm. Crucially, the model includes (one or more) graph changepoints, which are learned jointly with all model parameters and incorporate graph dynamics. Simulation studies demonstrate excellent graph selection capabilities, with significant improvements over competing methods. We apply the proposed approach to study of dynamic connectivity patterns of sea surface temperatures in the Pacific Ocean and discovers meaningful edges."}, "https://arxiv.org/abs/2405.03042": {"title": "Functional Post-Clustering Selective Inference with Applications to EHR Data Analysis", "link": "https://arxiv.org/abs/2405.03042", "description": "arXiv:2405.03042v1 Announce Type: new \nAbstract: In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach."}, "https://arxiv.org/abs/2405.03083": {"title": "Causal K-Means Clustering", "link": "https://arxiv.org/abs/2405.03083", "description": "arXiv:2405.03083v1 Announce Type: new \nAbstract: Causal effects are often characterized with population summaries. These might provide an incomplete picture when there are heterogeneous treatment effects across subgroups. Since the subgroup structure is typically unknown, it is more challenging to identify and evaluate subgroup effects than population effects. We propose a new solution to this problem: Causal k-Means Clustering, which harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure. Our problem differs significantly from the conventional clustering setup since the variables to be clustered are unknown counterfactual functions. We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms, and study its rate of convergence. We also develop a new bias-corrected estimator based on nonparametric efficiency theory and double machine learning, and show that this estimator achieves fast root-n rates and asymptotic normality in large nonparametric models. Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels. Further, our framework is extensible to clustering with generic pseudo-outcomes, such as partially observed outcomes or otherwise unknown functions. Finally, we explore finite sample properties via simulation, and illustrate the proposed methods in a study of treatment programs for adolescent substance abuse."}, "https://arxiv.org/abs/2405.03096": {"title": "Exact Sampling of Spanning Trees via Fast-forwarded Random Walks", "link": "https://arxiv.org/abs/2405.03096", "description": "arXiv:2405.03096v1 Announce Type: new \nAbstract: Tree graphs are routinely used in statistics. When estimating a Bayesian model with a tree component, sampling the posterior remains a core difficulty. Existing Markov chain Monte Carlo methods tend to rely on local moves, often leading to poor mixing. A promising approach is to instead directly sample spanning trees on an auxiliary graph. Current spanning tree samplers, such as the celebrated Aldous--Broder algorithm, predominantly rely on simulating random walks that are required to visit all the nodes of the graph. Such algorithms are prone to getting stuck in certain sub-graphs. We formalize this phenomenon using the bottlenecks in the random walk's transition probability matrix. We then propose a novel fast-forwarded cover algorithm that can break free from bottlenecks. The core idea is a marginalization argument that leads to a closed-form expression which allows for fast-forwarding to the event of visiting a new node. Unlike many existing approximation algorithms, our algorithm yields exact samples. We demonstrate the enhanced efficiency of the fast-forwarded cover algorithm, and illustrate its application in fitting a Bayesian dendrogram model on a Massachusetts crimes and communities dataset."}, "https://arxiv.org/abs/2405.03225": {"title": "Consistent response prediction for multilayer networks on unknown manifolds", "link": "https://arxiv.org/abs/2405.03225", "description": "arXiv:2405.03225v1 Announce Type: new \nAbstract: Our paper deals with a collection of networks on a common set of nodes, where some of the networks are associated with responses. Assuming that the networks correspond to points on a one-dimensional manifold in a higher dimensional ambient space, we propose an algorithm to consistently predict the response at an unlabeled network. Our model involves a specific multiple random network model, namely the common subspace independent edge model, where the networks share a common invariant subspace, and the heterogeneity amongst the networks is captured by a set of low dimensional matrices. Our algorithm estimates these low dimensional matrices that capture the heterogeneity of the networks, learns the underlying manifold by isomap, and consistently predicts the response at an unlabeled network. We provide theoretical justifications for the use of our algorithm, validated by numerical simulations. Finally, we demonstrate the use of our algorithm on larval Drosophila connectome data."}, "https://arxiv.org/abs/2405.03603": {"title": "Copas-Heckman-type sensitivity analysis for publication bias in rare-event meta-analysis under the framework of the generalized linear mixed model", "link": "https://arxiv.org/abs/2405.03603", "description": "arXiv:2405.03603v1 Announce Type: new \nAbstract: Publication bias (PB) is one of the serious issues in meta-analysis. Many existing methods dealing with PB are based on the normal-normal (NN) random-effects model assuming normal models in both the within-study and the between-study levels. For rare-event meta-analysis where the data contain rare occurrences of event, the standard NN random-effects model may perform poorly. Instead, the generalized linear mixed effects model (GLMM) using the exact within-study model is recommended. However, no method has been proposed for dealing with PB in rare-event meta-analysis using the GLMM. In this paper, we propose sensitivity analysis methods for evaluating the impact of PB on the GLMM based on the famous Copas-Heckman-type selection model. The proposed methods can be easily implemented with the standard software coring the nonlinear mixed-effects model. We use a real-world example to show how the usefulness of the proposed methods in evaluating the potential impact of PB in meta-analysis of the log-transformed odds ratio based on the GLMM using the non-central hypergeometric or binomial distribution as the within-study model. An extension of the proposed method is also introduced for evaluating PB in meta-analysis of proportion based on the GLMM with the binomial within-study model."}, "https://arxiv.org/abs/2405.03606": {"title": "Strang Splitting for Parametric Inference in Second-order Stochastic Differential Equations", "link": "https://arxiv.org/abs/2405.03606", "description": "arXiv:2405.03606v1 Announce Type: new \nAbstract: We address parameter estimation in second-order stochastic differential equations (SDEs), prevalent in physics, biology, and ecology. Second-order SDE is converted to a first-order system by introducing an auxiliary velocity variable raising two main challenges. First, the system is hypoelliptic since the noise affects only the velocity, making the Euler-Maruyama estimator ill-conditioned. To overcome that, we propose an estimator based on the Strang splitting scheme. Second, since the velocity is rarely observed we adjust the estimator for partial observations. We present four estimators for complete and partial observations, using full likelihood or only velocity marginal likelihood. These estimators are intuitive, easy to implement, and computationally fast, and we prove their consistency and asymptotic normality. Our analysis demonstrates that using full likelihood with complete observations reduces the asymptotic variance of the diffusion estimator. With partial observations, the asymptotic variance increases due to information loss but remains unaffected by the likelihood choice. However, a numerical study on the Kramers oscillator reveals that using marginal likelihood for partial observations yields less biased estimators. We apply our approach to paleoclimate data from the Greenland ice core and fit it to the Kramers oscillator model, capturing transitions between metastable states reflecting observed climatic conditions during glacial eras."}, "https://arxiv.org/abs/2405.02475": {"title": "Generalizing Orthogonalization for Models with Non-linearities", "link": "https://arxiv.org/abs/2405.02475", "description": "arXiv:2405.02475v1 Announce Type: cross \nAbstract: The complexity of black-box algorithms can lead to various challenges, including the introduction of biases. These biases present immediate risks in the algorithms' application. It was, for instance, shown that neural networks can deduce racial information solely from a patient's X-ray scan, a task beyond the capability of medical experts. If this fact is not known to the medical expert, automatic decision-making based on this algorithm could lead to prescribing a treatment (purely) based on racial information. While current methodologies allow for the \"orthogonalization\" or \"normalization\" of neural networks with respect to such information, existing approaches are grounded in linear models. Our paper advances the discourse by introducing corrections for non-linearities such as ReLU activations. Our approach also encompasses scalar and tensor-valued predictions, facilitating its integration into neural network architectures. Through extensive experiments, we validate our method's effectiveness in safeguarding sensitive data in generalized linear models, normalizing convolutional neural networks for metadata, and rectifying pre-existing embeddings for undesired attributes."}, "https://arxiv.org/abs/2405.03063": {"title": "Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection", "link": "https://arxiv.org/abs/2405.03063", "description": "arXiv:2405.03063v1 Announce Type: cross \nAbstract: Suppose that we first apply the Lasso to a design matrix, and then update one of its columns. In general, the signs of the Lasso coefficients may change, and there is no closed-form expression for updating the Lasso solution exactly. In this work, we propose an approximate formula for updating a debiased Lasso coefficient. We provide general nonasymptotic error bounds in terms of the norms and correlations of a given design matrix's columns, and then prove asymptotic convergence results for the case of a random design matrix with i.i.d.\\ sub-Gaussian row vectors and i.i.d.\\ Gaussian noise. Notably, the approximate formula is asymptotically correct for most coordinates in the proportional growth regime, under the mild assumption that each row of the design matrix is sub-Gaussian with a covariance matrix having a bounded condition number. Our proof only requires certain concentration and anti-concentration properties to control various error terms and the number of sign changes. In contrast, rigorously establishing distributional limit properties (e.g.\\ Gaussian limits for the debiased Lasso) under similarly general assumptions has been considered open problem in the universality theory. As applications, we show that the approximate formula allows us to reduce the computation complexity of variable selection algorithms that require solving multiple Lasso problems, such as the conditional randomization test and a variant of the knockoff filter."}, "https://arxiv.org/abs/2405.03579": {"title": "Some Statistical and Data Challenges When Building Early-Stage Digital Experimentation and Measurement Capabilities", "link": "https://arxiv.org/abs/2405.03579", "description": "arXiv:2405.03579v1 Announce Type: cross \nAbstract: Digital experimentation and measurement (DEM) capabilities -- the knowledge and tools necessary to run experiments with digital products, services, or experiences and measure their impact -- are fast becoming part of the standard toolkit of digital/data-driven organisations in guiding business decisions. Many large technology companies report having mature DEM capabilities, and several businesses have been established purely to manage experiments for others. Given the growing evidence that data-driven organisations tend to outperform their non-data-driven counterparts, there has never been a greater need for organisations to build/acquire DEM capabilities to thrive in the current digital era.\n This thesis presents several novel approaches to statistical and data challenges for organisations building DEM capabilities. We focus on the fundamentals associated with building DEM capabilities, which lead to a richer understanding of the underlying assumptions and thus enable us to develop more appropriate capabilities. We address why one should engage in DEM by quantifying the benefits and risks of acquiring DEM capabilities. This is done using a ranking under lower uncertainty model, enabling one to construct a business case. We also examine what ingredients are necessary to run digital experiments. In addition to clarifying the existing literature around statistical tests, datasets, and methods in experimental design and causal inference, we construct an additional dataset and detailed case studies on applying state-of-the-art methods. Finally, we investigate when a digital experiment design would outperform another, leading to an evaluation framework that compares competing designs' data efficiency."}, "https://arxiv.org/abs/2302.12728": {"title": "Statistical Principles for Platform Trials", "link": "https://arxiv.org/abs/2302.12728", "description": "arXiv:2302.12728v2 Announce Type: replace \nAbstract: While within a clinical study there may be multiple doses and endpoints, across different studies each study will result in either an approval or a lack of approval of the drug compound studied. The False Approval Rate (FAR) is the proportion of drug compounds that lack efficacy incorrectly approved by regulators. (In the U.S., compounds that have efficacy and are approved are not involved in the FAR consideration, according to our reading of the relevant U.S. Congressional statute).\n While Tukey's (1953) Error Rate Familwise (ERFw) is meant to be applied within a clinical study, Tukey's (1953) Error Rate per Family (ERpF), defined alongside ERFw,is meant to be applied across studies. We show that controlling Error Rate Familwise (ERFw) within a clinical study at 5% in turn controls Error Rate per Family (ERpF) across studies at 5-per-100, regardless of whether the studies are correlated or not. Further, we show that ongoing regulatory practice, the additive multiplicity adjustment method of controlling ERpF, is controlling False Approval Rate FAR exactly (not conservatively) at 5-per-100 (even for Platform trials).\n In contrast, if a regulatory agency chooses to control the False Discovery Rate (FDR) across studies at 5% instead, then this change in policy from ERpF control to FDR control will result in incorrectly approving drug compounds that lack efficacy at a rate higher than 5-per-100, because in essence it gives the industry additional rewards for successfully developing compounds that have efficacy and are approved. Seems to us the discussion of such a change in policy would be at a level higher than merely statistical, needing harmonizsation/harmonization. (In the U.S., policy is set by the Congress.)"}, "https://arxiv.org/abs/2305.06262": {"title": "Flexible cost-penalized Bayesian model selection: developing inclusion paths with an application to diagnosis of heart disease", "link": "https://arxiv.org/abs/2305.06262", "description": "arXiv:2305.06262v3 Announce Type: replace \nAbstract: We propose a Bayesian model selection approach that allows medical practitioners to select among predictor variables while taking their respective costs into account. Medical procedures almost always incur costs in time and/or money. These costs might exceed their usefulness for modeling the outcome of interest. We develop Bayesian model selection that uses flexible model priors to penalize costly predictors a priori and select a subset of predictors useful relative to their costs. Our approach (i) gives the practitioner control over the magnitude of cost penalization, (ii) enables the prior to scale well with sample size, and (iii) enables the creation of our proposed inclusion path visualization, which can be used to make decisions about individual candidate predictors using both probabilistic and visual tools. We demonstrate the effectiveness of our inclusion path approach and the importance of being able to adjust the magnitude of the prior's cost penalization through a dataset pertaining to heart disease diagnosis in patients at the Cleveland Clinic Foundation, where several candidate predictors with various costs were recorded for patients, and through simulated data."}, "https://arxiv.org/abs/2306.00296": {"title": "Inference in Predictive Quantile Regressions", "link": "https://arxiv.org/abs/2306.00296", "description": "arXiv:2306.00296v2 Announce Type: replace \nAbstract: This paper studies inference in predictive quantile regressions when the predictive regressor has a near-unit root. We derive asymptotic distributions for the quantile regression estimator and its heteroskedasticity and autocorrelation consistent (HAC) t-statistic in terms of functionals of Ornstein-Uhlenbeck processes. We then propose a switching-fully modified (FM) predictive test for quantile predictability. The proposed test employs an FM style correction with a Bonferroni bound for the local-to-unity parameter when the predictor has a near unit root. It switches to a standard predictive quantile regression test with a slightly conservative critical value when the largest root of the predictor lies in the stationary range. Simulations indicate that the test has a reliable size in small samples and good power. We employ this new methodology to test the ability of three commonly employed, highly persistent and endogenous lagged valuation regressors - the dividend price ratio, earnings price ratio, and book-to-market ratio - to predict the median, shoulders, and tails of the stock return distribution."}, "https://arxiv.org/abs/2306.14761": {"title": "Doubly ranked tests for grouped functional data", "link": "https://arxiv.org/abs/2306.14761", "description": "arXiv:2306.14761v2 Announce Type: replace \nAbstract: Nonparametric tests for functional data are a challenging class of tests to work with because of the potentially high dimensional nature of functional data. One of the main challenges for considering rank-based tests, like the Mann-Whitney or Wilcoxon Rank Sum tests (MWW), is that the unit of observation is a curve. Thus any rank-based test must consider ways of ranking curves. While several procedures, including depth-based methods, have recently been used to create scores for rank-based tests, these scores are not constructed under the null and often introduce additional, uncontrolled for variability. We therefore reconsider the problem of rank-based tests for functional data and develop an alternative approach that incorporates the null hypothesis throughout. Our approach first ranks realizations from the curves at each time point, then summarizes the ranks for each subject using a sufficient statistic we derive, and finally re-ranks the sufficient statistics in a procedure we refer to as a doubly ranked test. As we demonstrate, doubly rank tests are more powerful while maintaining ideal type I error in the two sample, MWW setting. We also extend our framework to more than two samples, developing a Kruskal-Wallis test for functional data which exhibits good test characteristics as well. Finally, we illustrate the use of doubly ranked tests in functional data contexts from material science, climatology, and public health policy."}, "https://arxiv.org/abs/2309.10017": {"title": "A Change-Point Approach to Estimating the Proportion of False Null Hypotheses in Multiple Testing", "link": "https://arxiv.org/abs/2309.10017", "description": "arXiv:2309.10017v2 Announce Type: replace \nAbstract: For estimating the proportion of false null hypotheses in multiple testing, a family of estimators by Storey (2002) is widely used in the applied and statistical literature, with many methods suggested for selecting the parameter $\\lambda$. Inspired by change-point concepts, our new approach to the latter problem first approximates the $p$-value plot with a piecewise linear function with a single change-point and then selects the $p$-value at the change-point location as $\\lambda$. Simulations show that our method has among the smallest RMSE across various settings, and we extend it to address the estimation in cases of superuniform $p$-values. We provide asymptotic theory for our estimator, relying on the theory of quantile processes. Additionally, we propose an application in the change-point literature and illustrate it using high-dimensional CNV data."}, "https://arxiv.org/abs/2312.07520": {"title": "Estimating Counterfactual Matrix Means with Short Panel Data", "link": "https://arxiv.org/abs/2312.07520", "description": "arXiv:2312.07520v2 Announce Type: replace \nAbstract: We develop a new, spectral approach for identifying and estimating average counterfactual outcomes under a low-rank factor model with short panel data and general outcome missingness patterns. Applications include event studies and studies of outcomes of \"matches\" between agents of two types, e.g. workers and firms, typically conducted under less-flexible Two-Way-Fixed-Effects (TWFE) models of outcomes. Given an infinite population of units and a finite number of outcomes, we show our approach identifies all counterfactual outcome means, including those not estimable by existing methods, if a particular graph constructed based on overlaps in observed outcomes between subpopulations is connected. Our analogous, computationally efficient estimation procedure yields consistent, asymptotically normal estimates of counterfactual outcome means under fixed-$T$ (number of outcomes), large-$N$ (sample size) asymptotics. In a semi-synthetic simulation study based on matched employer-employee data, our estimator has lower bias and only slightly higher variance than a TWFE-model-based estimator when estimating average log-wages."}, "https://arxiv.org/abs/2312.09884": {"title": "Investigating the heterogeneity of \"study twins\"", "link": "https://arxiv.org/abs/2312.09884", "description": "arXiv:2312.09884v2 Announce Type: replace \nAbstract: Meta-analyses are commonly performed based on random-effects models, while in certain cases one might also argue in favour of a common-effect model. One such case may be given by the example of two \"study twins\" that are performed according to a common (or at least very similar) protocol. Here we investigate the particular case of meta-analysis of a pair of studies, e.g. summarizing the results of two confirmatory clinical trials in phase III of a clinical development programme. Thereby, we focus on the question of to what extent homogeneity or heterogeneity may be discernible, and include an empirical investigation of published (\"twin\") pairs of studies. A pair of estimates from two studies only provides very little evidence on homogeneity or heterogeneity of effects, and ad-hoc decision criteria may often be misleading."}, "https://arxiv.org/abs/2306.10614": {"title": "Identifiable causal inference with noisy treatment and no side information", "link": "https://arxiv.org/abs/2306.10614", "description": "arXiv:2306.10614v2 Announce Type: replace-cross \nAbstract: In some causal inference scenarios, the treatment variable is measured inaccurately, for instance in epidemiology or econometrics. Failure to correct for the effect of this measurement error can lead to biased causal effect estimates. Previous research has not studied methods that address this issue from a causal viewpoint while allowing for complex nonlinear dependencies and without assuming access to side information. For such a scenario, this study proposes a model that assumes a continuous treatment variable that is inaccurately measured. Building on existing results for measurement error models, we prove that our model's causal effect estimates are identifiable, even without knowledge of the measurement error variance or other side information. Our method relies on a deep latent variable model in which Gaussian conditionals are parameterized by neural networks, and we develop an amortized importance-weighted variational objective for training the model. Empirical results demonstrate the method's good performance with unknown measurement error. More broadly, our work extends the range of applications in which reliable causal inference can be conducted."}, "https://arxiv.org/abs/2310.00809": {"title": "Towards Causal Foundation Model: on Duality between Causal Inference and Attention", "link": "https://arxiv.org/abs/2310.00809", "description": "arXiv:2310.00809v2 Announce Type: replace-cross \nAbstract: Foundation models have brought changes to the landscape of machine learning, demonstrating sparks of human-level intelligence across a diverse array of tasks. However, a gap persists in complex tasks such as causal inference, primarily due to challenges associated with intricate reasoning steps and high numerical precision requirements. In this work, we take a first step towards building causally-aware foundation models for complex tasks. We propose a novel, theoretically sound method called Causal Inference with Attention (CInA), which utilizes multiple unlabeled datasets to perform self-supervised causal learning, and subsequently enables zero-shot causal inference on unseen tasks with new data. This is based on our theoretical results that demonstrate the primal-dual connection between optimal covariate balancing and self-attention, facilitating zero-shot causal inference through the final layer of a trained transformer-type architecture. We demonstrate empirically that our approach CInA effectively generalizes to out-of-distribution datasets and various real-world datasets, matching or even surpassing traditional per-dataset causal inference methodologies."}, "https://arxiv.org/abs/2405.03778": {"title": "An Autoregressive Model for Time Series of Random Objects", "link": "https://arxiv.org/abs/2405.03778", "description": "arXiv:2405.03778v1 Announce Type: new \nAbstract: Random variables in metric spaces indexed by time and observed at equally spaced time points are receiving increased attention due to their broad applicability. However, the absence of inherent structure in metric spaces has resulted in a literature that is predominantly non-parametric and model-free. To address this gap in models for time series of random objects, we introduce an adaptation of the classical linear autoregressive model tailored for data lying in a Hadamard space. The parameters of interest in this model are the Fr\\'echet mean and a concentration parameter, both of which we prove can be consistently estimated from data. Additionally, we propose a test statistic and establish its asymptotic normality, thereby enabling hypothesis testing for the absence of serial dependence. Finally, we introduce a bootstrap procedure to obtain critical values for the test statistic under the null hypothesis. Theoretical results of our method, including the convergence of the estimators as well as the size and power of the test, are illustrated through simulations, and the utility of the model is demonstrated by an analysis of a time series of consumer inflation expectations."}, "https://arxiv.org/abs/2405.03815": {"title": "Statistical inference for a stochastic generalized logistic differential equation", "link": "https://arxiv.org/abs/2405.03815", "description": "arXiv:2405.03815v1 Announce Type: new \nAbstract: This research aims to estimate three parameters in a stochastic generalized logistic differential equation. We assume the intrinsic growth rate and shape parameters are constant but unknown. To estimate these two parameters, we use the maximum likelihood method and establish that the estimators for these two parameters are strongly consistent. We estimate the diffusion parameter by using the quadratic variation processes. To test our results, we evaluate two data scenarios, complete and incomplete, with fixed values assigned to the three parameters. In the incomplete data scenario, we apply an Expectation Maximization algorithm."}, "https://arxiv.org/abs/2405.03826": {"title": "A quantile-based nonadditive fixed effects model", "link": "https://arxiv.org/abs/2405.03826", "description": "arXiv:2405.03826v1 Announce Type: new \nAbstract: I propose a quantile-based nonadditive fixed effects panel model to study heterogeneous causal effects. Similar to standard fixed effects (FE) model, my model allows arbitrary dependence between regressors and unobserved heterogeneity, but it generalizes the additive separability of standard FE to allow the unobserved heterogeneity to enter nonseparably. Similar to structural quantile models, my model's random coefficient vector depends on an unobserved, scalar ''rank'' variable, in which outcomes (excluding an additive noise term) are monotonic at a particular value of the regressor vector, which is much weaker than the conventional monotonicity assumption that must hold at all possible values. This rank is assumed to be stable over time, which is often more economically plausible than the panel quantile studies that assume individual rank is iid over time. It uncovers the heterogeneous causal effects as functions of the rank variable. I provide identification and estimation results, establishing uniform consistency and uniform asymptotic normality of the heterogeneous causal effect function estimator. Simulations show reasonable finite-sample performance and show my model complements fixed effects quantile regression. Finally, I illustrate the proposed methods by examining the causal effect of a country's oil wealth on its military defense spending."}, "https://arxiv.org/abs/2405.03834": {"title": "Covariance-free Multifidelity Control Variates Importance Sampling for Reliability Analysis of Rare Events", "link": "https://arxiv.org/abs/2405.03834", "description": "arXiv:2405.03834v1 Announce Type: new \nAbstract: Multifidelity modeling has been steadily gaining attention as a tool to address the problem of exorbitant model evaluation costs that makes the estimation of failure probabilities a significant computational challenge for complex real-world problems, particularly when failure is a rare event. To implement multifidelity modeling, estimators that efficiently combine information from multiple models/sources are necessary. In past works, the variance reduction techniques of Control Variates (CV) and Importance Sampling (IS) have been leveraged for this task. In this paper, we present the CVIS framework; a creative take on a coupled Control Variates and Importance Sampling estimator for bifidelity reliability analysis. The framework addresses some of the practical challenges of the CV method by using an estimator for the control variate mean and side-stepping the need to estimate the covariance between the original estimator and the control variate through a clever choice for the tuning constant. The task of selecting an efficient IS distribution is also considered, with a view towards maximally leveraging the bifidelity structure and maintaining expressivity. Additionally, a diagnostic is provided that indicates both the efficiency of the algorithm as well as the relative predictive quality of the models utilized. Finally, the behavior and performance of the framework is explored through analytical and numerical examples."}, "https://arxiv.org/abs/2405.03910": {"title": "A Primer on the Analysis of Randomized Experiments and a Survey of some Recent Advances", "link": "https://arxiv.org/abs/2405.03910", "description": "arXiv:2405.03910v1 Announce Type: new \nAbstract: The past two decades have witnessed a surge of new research in the analysis of randomized experiments. The emergence of this literature may seem surprising given the widespread use and long history of experiments as the \"gold standard\" in program evaluation, but this body of work has revealed many subtle aspects of randomized experiments that may have been previously unappreciated. This article provides an overview of some of these topics, primarily focused on stratification, regression adjustment, and cluster randomization."}, "https://arxiv.org/abs/2405.03985": {"title": "Bayesian Multilevel Compositional Data Analysis: Introduction, Evaluation, and Application", "link": "https://arxiv.org/abs/2405.03985", "description": "arXiv:2405.03985v1 Announce Type: new \nAbstract: Multilevel compositional data commonly occur in various fields, particularly in intensive, longitudinal studies using ecological momentary assessments. Examples include data repeatedly measured over time that are non-negative and sum to a constant value, such as sleep-wake movement behaviours in a 24-hour day. This article presents a novel methodology for analysing multilevel compositional data using a Bayesian inference approach. This method can be used to investigate how reallocation of time between sleep-wake movement behaviours may be associated with other phenomena (e.g., emotions, cognitions) at a daily level. We explain the theoretical details of the data and the models, and outline the steps necessary to implement this method. We introduce the R package multilevelcoda to facilitate the application of this method and illustrate using a real data example. An extensive parameter recovery simulation study verified the robust performance of the method. Across all simulation conditions investigated in the simulation study, the model had minimal convergence issues (convergence rate > 99%) and achieved excellent quality of parameter estimates and inference, with an average bias of 0.00 (range -0.09, 0.05) and coverage of 0.95 (range 0.93, 0.97). We conclude the article with recommendations on the use of the Bayesian compositional multilevel modelling approach, and hope to promote wider application of this method to answer robust questions using the increasingly available data from intensive, longitudinal studies."}, "https://arxiv.org/abs/2405.04193": {"title": "A generalized ordinal quasi-symmetry model and its separability for analyzing multi-way tables", "link": "https://arxiv.org/abs/2405.04193", "description": "arXiv:2405.04193v1 Announce Type: new \nAbstract: This paper addresses the challenge of modeling multi-way contingency tables for matched set data with ordinal categories. Although the complete symmetry and marginal homogeneity models are well established, they may not always provide a satisfactory fit to the data. To address this issue, we propose a generalized ordinal quasi-symmetry model that offers increased flexibility when the complete symmetry model fails to capture the underlying structure. We investigate the properties of this new model and provide an information-theoretic interpretation, elucidating its relationship to the ordinal quasi-symmetry model. Moreover, we revisit Agresti's findings and present a new necessary and sufficient condition for the complete symmetry model, proving that the proposed model and the marginal moment equality model are separable hypotheses. The separability of the proposed model and marginal moment equality model is a significant development in the analysis of multi-way contingency tables. It enables researchers to examine the symmetry structure in the data with greater precision, providing a more thorough understanding of the underlying patterns. This powerful framework equips researchers with the necessary tools to explore the complexities of ordinal variable relationships in matched set data, paving the way for new discoveries and insights."}, "https://arxiv.org/abs/2405.04226": {"title": "NEST: Neural Estimation by Sequential Testing", "link": "https://arxiv.org/abs/2405.04226", "description": "arXiv:2405.04226v1 Announce Type: new \nAbstract: Adaptive psychophysical procedures aim to increase the efficiency and reliability of measurements. With increasing stimulus and experiment complexity in the last decade, estimating multi-dimensional psychometric functions has become a challenging task for adaptive procedures. If the experimenter has limited information about the underlying psychometric function, it is not possible to use parametric techniques developed for the multi-dimensional stimulus space. Although there are non-parametric approaches that use Gaussian process methods and specific hand-crafted acquisition functions, their performance is sensitive to proper selection of the kernel function, which is not always straightforward. In this work, we use a neural network as the psychometric function estimator and introduce a novel acquisition function for stimulus selection. We thoroughly benchmark our technique both using simulations and by conducting psychovisual experiments under realistic conditions. We show that our method outperforms the state of the art without the need to select a kernel function and significantly reduces the experiment duration."}, "https://arxiv.org/abs/2405.04238": {"title": "Homogeneity of multinomial populations when data are classified into a large number of groups", "link": "https://arxiv.org/abs/2405.04238", "description": "arXiv:2405.04238v1 Announce Type: new \nAbstract: Suppose that we are interested in the comparison of two independent categorical variables. Suppose also that the population is divided into subpopulations or groups. Notice that the distribution of the target variable may vary across subpopulations, moreover, it may happen that the two independent variables have the same distribution in the whole population, but their distributions could differ in some groups. So, instead of testing the homogeneity of the two categorical variables, one may be interested in simultaneously testing the homogeneity in all groups. A novel procedure is proposed for carrying out such a testing problem. The test statistic is shown to be asymptotically normal, avoiding the use of complicated resampling methods to get $p$-values. Here by asymptotic we mean when the number of groups increases; the sample sizes of the data from each group can either stay bounded or grow with the number of groups. The finite sample performance of the proposal is empirically evaluated through an extensive simulation study. The usefulness of the proposal is illustrated by three data sets coming from diverse experimental fields such as education, the COVID-19 pandemic and digital elevation models."}, "https://arxiv.org/abs/2405.04254": {"title": "Distributed variable screening for generalized linear models", "link": "https://arxiv.org/abs/2405.04254", "description": "arXiv:2405.04254v1 Announce Type: new \nAbstract: In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility."}, "https://arxiv.org/abs/2405.04365": {"title": "Detailed Gender Wage Gap Decompositions: Controlling for Worker Unobserved Heterogeneity Using Network Theory", "link": "https://arxiv.org/abs/2405.04365", "description": "arXiv:2405.04365v1 Announce Type: new \nAbstract: Recent advances in the literature of decomposition methods in economics have allowed for the identification and estimation of detailed wage gap decompositions. In this context, building reliable counterfactuals requires using tighter controls to ensure that similar workers are correctly identified by making sure that important unobserved variables such as skills are controlled for, as well as comparing only workers with similar observable characteristics. This paper contributes to the wage decomposition literature in two main ways: (i) developing an economic principled network based approach to control for unobserved worker skills heterogeneity in the presence of potential discrimination; and (ii) extending existing generic decomposition tools to accommodate for potential lack of overlapping supports in covariates between groups being compared, which is likely to be the norm in more detailed decompositions. We illustrate the methodology by decomposing the gender wage gap in Brazil."}, "https://arxiv.org/abs/2405.04419": {"title": "Transportability of Principal Causal Effects", "link": "https://arxiv.org/abs/2405.04419", "description": "arXiv:2405.04419v1 Announce Type: new \nAbstract: Recent research in causal inference has made important progress in addressing challenges to the external validity of trial findings. Such methods weight trial participant data to more closely resemble the distribution of effect-modifying covariates in a well-defined target population. In the presence of participant non-adherence to study medication, these methods effectively transport an intention-to-treat effect that averages over heterogeneous compliance behaviors. In this paper, we develop a principal stratification framework to identify causal effects conditioning on both on compliance behavior and membership in the target population. We also develop non-parametric efficiency theory for and construct efficient estimators of such \"transported\" principal causal effects and characterize their finite-sample performance in simulation experiments. While this work focuses on treatment non-adherence, the framework is applicable to a broad class of estimands that target effects in clinically-relevant, possibly latent subsets of a target population."}, "https://arxiv.org/abs/2405.04446": {"title": "Causal Inference in the Multiverse of Hazard", "link": "https://arxiv.org/abs/2405.04446", "description": "arXiv:2405.04446v1 Announce Type: new \nAbstract: Hazard serves as a pivotal estimand in both practical applications and methodological frameworks. However, its causal interpretation poses notable challenges, including inherent selection biases and ill-defined populations to be compared between different treatment groups. In response, we propose a novel definition of counterfactual hazard within the framework of possible worlds. Instead of conditioning on prior survival status as a conditional probability, our new definition involves intervening in the prior status, treating it as a marginal probability. Using single-world intervention graphs, we demonstrate that the proposed counterfactual hazard is a type of controlled direct effect. Conceptually, intervening in survival status at each time point generates a new possible world, where the proposed hazards across time points represent risks in these hypothetical scenarios, forming a \"multiverse of hazard.\" The cumulative and average counterfactual hazards correspond to the sum and average of risks across this multiverse, respectively, with the actual world's risk lying between the two. This conceptual shift reframes hazards in the actual world as a collection of risks across possible worlds, marking a significant advancement in the causal interpretation of hazards."}, "https://arxiv.org/abs/2405.04465": {"title": "Two-way Fixed Effects and Differences-in-Differences Estimators in Heterogeneous Adoption Designs", "link": "https://arxiv.org/abs/2405.04465", "description": "arXiv:2405.04465v1 Announce Type: new \nAbstract: We consider treatment-effect estimation under a parallel trends assumption, in heterogeneous adoption designs where no unit is treated at period one, and units receive a weakly positive dose at period two. First, we develop a test of the assumption that the treatment effect is mean independent of the treatment, under which the commonly-used two-way-fixed-effects estimator is consistent. When this test is rejected, we propose alternative, robust estimators. If there are stayers with a period-two treatment equal to 0, the robust estimator is a difference-in-differences (DID) estimator using stayers as the control group. If there are quasi-stayers with a period-two treatment arbitrarily close to zero, the robust estimator is a DID using units with a period-two treatment below a bandwidth as controls. Finally, without stayers or quasi-stayers, we propose non-parametric bounds, and an estimator relying on a parametric specification of treatment-effect heterogeneity. We use our results to revisit Pierce and Schott (2016) and Enikolopov et al. (2011)."}, "https://arxiv.org/abs/2405.04475": {"title": "Bayesian Copula Density Estimation Using Bernstein Yett-Uniform Priors", "link": "https://arxiv.org/abs/2405.04475", "description": "arXiv:2405.04475v1 Announce Type: new \nAbstract: Probability density estimation is a central task in statistics. Copula-based models provide a great deal of flexibility in modelling multivariate distributions, allowing for the specifications of models for the marginal distributions separately from the dependence structure (copula) that links them to form a joint distribution. Choosing a class of copula models is not a trivial task and its misspecification can lead to wrong conclusions. We introduce a novel class of random Bernstein copula functions, and studied its support and the behavior of its posterior distribution. The proposal is based on a particular class of random grid-uniform copulas, referred to as yett-uniform copulas. Alternative Markov chain Monte Carlo algorithms for exploring the posterior distribution under the proposed model are also studied. The methodology is illustrated by means of simulated and real data."}, "https://arxiv.org/abs/2405.04531": {"title": "Stochastic Gradient MCMC for Massive Geostatistical Data", "link": "https://arxiv.org/abs/2405.04531", "description": "arXiv:2405.04531v1 Announce Type: new \nAbstract: Gaussian processes (GPs) are commonly used for prediction and inference for spatial data analyses. However, since estimation and prediction tasks have cubic time and quadratic memory complexity in number of locations, GPs are difficult to scale to large spatial datasets. The Vecchia approximation induces sparsity in the dependence structure and is one of several methods proposed to scale GP inference. Our work adds to the substantial research in this area by developing a stochastic gradient Markov chain Monte Carlo (SGMCMC) framework for efficient computation in GPs. At each step, the algorithm subsamples a minibatch of locations and subsequently updates process parameters through a Vecchia-approximated GP likelihood. Since the Vecchia-approximated GP has a time complexity that is linear in the number of locations, this results in scalable estimation in GPs. Through simulation studies, we demonstrate that SGMCMC is competitive with state-of-the-art scalable GP algorithms in terms of computational time and parameter estimation. An application of our method is also provided using the Argo dataset of ocean temperature measurements."}, "https://arxiv.org/abs/2405.03720": {"title": "Spatial Transfer Learning with Simple MLP", "link": "https://arxiv.org/abs/2405.03720", "description": "arXiv:2405.03720v1 Announce Type: cross \nAbstract: First step to investigate the potential of transfer learning applied to the field of spatial statistics"}, "https://arxiv.org/abs/2405.03723": {"title": "Generative adversarial learning with optimal input dimension and its adaptive generator architecture", "link": "https://arxiv.org/abs/2405.03723", "description": "arXiv:2405.03723v1 Announce Type: cross \nAbstract: We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which includes existing GANs as a special case. By incorporating the group penalty and the architecture penalty developed in the paper, G-GANs have several intriguing features. First, our framework offers adaptive dimensionality reduction from the initial dimension to a dimension necessary for generating the target distribution. Second, this reduction in dimensionality also shrinks the required size of the generator network architecture, which is automatically identified by the proposed architecture penalty. Both reductions in dimensionality and the generator network significantly improve the stability and the accuracy of the estimation and prediction. Theoretical support for the consistent selection of the input dimension and the generator network is provided. Third, the proposed algorithm involves an end-to-end training process, and the algorithm allows for dynamic adjustments between the input dimension and the generator network during training, further enhancing the overall performance of G-GANs. Extensive experiments conducted with simulated and benchmark data demonstrate the superior performance of G-GANs. In particular, compared to that of off-the-shelf methods, G-GANs achieves an average improvement of 45.68% in the CT slice dataset, 43.22% in the MNIST dataset and 46.94% in the FashionMNIST dataset in terms of the maximum mean discrepancy or Frechet inception distance. Moreover, the features generated based on the input dimensions identified by G-GANs align with visually significant features."}, "https://arxiv.org/abs/2405.04043": {"title": "Scalable Vertical Federated Learning via Data Augmentation and Amortized Inference", "link": "https://arxiv.org/abs/2405.04043", "description": "arXiv:2405.04043v1 Announce Type: cross \nAbstract: Vertical federated learning (VFL) has emerged as a paradigm for collaborative model estimation across multiple clients, each holding a distinct set of covariates. This paper introduces the first comprehensive framework for fitting Bayesian models in the VFL setting. We propose a novel approach that leverages data augmentation techniques to transform VFL problems into a form compatible with existing Bayesian federated learning algorithms. We present an innovative model formulation for specific VFL scenarios where the joint likelihood factorizes into a product of client-specific likelihoods. To mitigate the dimensionality challenge posed by data augmentation, which scales with the number of observations and clients, we develop a factorized amortized variational approximation that achieves scalability independent of the number of observations. We showcase the efficacy of our framework through extensive numerical experiments on logistic regression, multilevel regression, and a novel hierarchical Bayesian split neural net model. Our work paves the way for privacy-preserving, decentralized Bayesian inference in vertically partitioned data scenarios, opening up new avenues for research and applications in various domains."}, "https://arxiv.org/abs/2012.00180": {"title": "Anisotropic local constant smoothing for change-point regression function estimation", "link": "https://arxiv.org/abs/2012.00180", "description": "arXiv:2012.00180v2 Announce Type: replace \nAbstract: Understanding forest fire spread in any region of Canada is critical to promoting forest health, and protecting human life and infrastructure. Quantifying fire spread from noisy images, where regions of a fire are separated by change-point boundaries, is critical to faithfully estimating fire spread rates. In this research, we develop a statistically consistent smooth estimator that allows us to denoise fire spread imagery from micro-fire experiments. We develop an anisotropic smoothing method for change-point data that uses estimates of the underlying data generating process to inform smoothing. We show that the anisotropic local constant regression estimator is consistent with convergence rate $O\\left(n^{-1/{(q+2)}}\\right)$. We demonstrate its effectiveness on simulated one- and two-dimensional change-point data and fire spread imagery from micro-fire experiments."}, "https://arxiv.org/abs/2205.00171": {"title": "A Heteroskedasticity-Robust Overidentifying Restriction Test with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2205.00171", "description": "arXiv:2205.00171v3 Announce Type: replace \nAbstract: This paper proposes an overidentifying restriction test for high-dimensional linear instrumental variable models. The novelty of the proposed test is that it allows the number of covariates and instruments to be larger than the sample size. The test is scale-invariant and is robust to heteroskedastic errors. To construct the final test statistic, we first introduce a test based on the maximum norm of multiple parameters that could be high-dimensional. The theoretical power based on the maximum norm is higher than that in the modified Cragg-Donald test (Koles\\'{a}r, 2018), the only existing test allowing for large-dimensional covariates. Second, following the principle of power enhancement (Fan et al., 2015), we introduce the power-enhanced test, with an asymptotically zero component used to enhance the power to detect some extreme alternatives with many locally invalid instruments. Finally, an empirical example of the trade and economic growth nexus demonstrates the usefulness of the proposed test."}, "https://arxiv.org/abs/2207.09098": {"title": "ReBoot: Distributed statistical learning via refitting bootstrap samples", "link": "https://arxiv.org/abs/2207.09098", "description": "arXiv:2207.09098v3 Announce Type: replace \nAbstract: In this paper, we propose a one-shot distributed learning algorithm via refitting bootstrap samples, which we refer to as ReBoot. ReBoot refits a new model to mini-batches of bootstrap samples that are continuously drawn from each of the locally fitted models. It requires only one round of communication of model parameters without much memory. Theoretically, we analyze the statistical error rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. In both cases, ReBoot provably achieves the full-sample statistical rate. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of subsamples (i.e., the number of sites), is $O(n ^ {-2})$ in GLM, where $n$ is the subsample size (the sample size of each local site). This rate is sharper than that of model parameter averaging and its variants, implying the higher tolerance of ReBoot with respect to data splits to maintain the full-sample rate. Our simulation study demonstrates the statistical advantage of ReBoot over competing methods. Finally, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification. FedReBoot exhibits substantial superiority over Federated Averaging (FedAvg) within early rounds of communication."}, "https://arxiv.org/abs/2304.02563": {"title": "The transcoding sampler for stick-breaking inferences on Dirichlet process mixtures", "link": "https://arxiv.org/abs/2304.02563", "description": "arXiv:2304.02563v2 Announce Type: replace \nAbstract: Dirichlet process mixture models suffer from slow mixing of the MCMC posterior chain produced by stick-breaking Gibbs samplers, as opposed to collapsed Gibbs samplers based on the Polya urn representation which have shorter integrated autocorrelation time (IAT).\n We study how cluster membership information is encoded under the two aforementioned samplers, and we introduce the transcoding algorithm to switch between encodings. We also develop the transcoding sampler, which consists of undertaking posterior partition inference with any high-efficiency sampler, such as collapsed Gibbs, and to subsequently transcode it to the stick-breaking representation via the transcoding algorithm, thereby allowing inference on all stick-breaking parameters of interest while retaining the shorter IAT of the high-efficiency sampler.\n The transcoding sampler is substantially simpler to implement than the slice sampler, it can inherit the shorter IAT of collapsed Gibbs samplers and it can also achieve zero IAT when paired with a posterior partition sampler that is i.i.d., such as the sequential importance sampler."}, "https://arxiv.org/abs/2311.02043": {"title": "Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective", "link": "https://arxiv.org/abs/2311.02043", "description": "arXiv:2311.02043v2 Announce Type: replace \nAbstract: Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Further, neither approach is well-suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model, we derive optimal and interpretable linear estimates and uncertainty quantification for each model-based conditional quantile. Our approach introduces a quantile-focused squared error loss, which enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, variable selection, and inference over frequentist and Bayesian competitors. We apply these tools to identify the quantile-specific impacts of social and environmental stressors on educational outcomes for a large cohort of children in North Carolina."}, "https://arxiv.org/abs/2401.00097": {"title": "Recursive identification with regularization and on-line hyperparameters estimation", "link": "https://arxiv.org/abs/2401.00097", "description": "arXiv:2401.00097v2 Announce Type: replace \nAbstract: This paper presents a regularized recursive identification algorithm with simultaneous on-line estimation of both the model parameters and the algorithms hyperparameters. A new kernel is proposed to facilitate the algorithm development. The performance of this novel scheme is compared with that of the recursive least squares algorithm in simulation."}, "https://arxiv.org/abs/2401.02048": {"title": "Random Effect Restricted Mean Survival Time Model", "link": "https://arxiv.org/abs/2401.02048", "description": "arXiv:2401.02048v2 Announce Type: replace \nAbstract: The restricted mean survival time (RMST) model has been garnering attention as a way to provide a clinically intuitive measure: the mean survival time. RMST models, which use methods based on pseudo time-to-event values and inverse probability censoring weighting, can adjust covariates. However, no approach has yet been introduced that considers random effects for clusters. In this paper, we propose a new random-effect RMST. We present two methods of analysis that consider variable effects by i) using a generalized mixed model with pseudo-values and ii) integrating the estimated results from the inverse probability censoring weighting estimating equations for each cluster. We evaluate our proposed methods through computer simulations. In addition, we analyze the effect of a mother's age at birth on under-five deaths in India using states as clusters."}, "https://arxiv.org/abs/2202.08081": {"title": "Reasoning with fuzzy and uncertain evidence using epistemic random fuzzy sets: general framework and practical models", "link": "https://arxiv.org/abs/2202.08081", "description": "arXiv:2202.08081v4 Announce Type: replace-cross \nAbstract: We introduce a general theory of epistemic random fuzzy sets for reasoning with fuzzy or crisp evidence. This framework generalizes both the Dempster-Shafer theory of belief functions, and possibility theory. Independent epistemic random fuzzy sets are combined by the generalized product-intersection rule, which extends both Dempster's rule for combining belief functions, and the product conjunctive combination of possibility distributions. We introduce Gaussian random fuzzy numbers and their multi-dimensional extensions, Gaussian random fuzzy vectors, as practical models for quantifying uncertainty about scalar or vector quantities. Closed-form expressions for the combination, projection and vacuous extension of Gaussian random fuzzy numbers and vectors are derived."}, "https://arxiv.org/abs/2209.01328": {"title": "Optimal empirical Bayes estimation for the Poisson model via minimum-distance methods", "link": "https://arxiv.org/abs/2209.01328", "description": "arXiv:2209.01328v2 Announce Type: replace-cross \nAbstract: The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins estimator lacks the desired smoothness and monotonicity of Bayes estimators and can be easily derailed by those data points that were rarely observed before. Based on the minimum-distance distance method, we propose a suite of empirical Bayes estimators, including the classical nonparametric maximum likelihood, that outperform the Robbins method in a variety of synthetic and real data sets and retain its optimality in terms of minimax regret."}, "https://arxiv.org/abs/2211.02039": {"title": "The Projected Covariance Measure for assumption-lean variable significance testing", "link": "https://arxiv.org/abs/2211.02039", "description": "arXiv:2211.02039v4 Announce Type: replace-cross \nAbstract: Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches."}, "https://arxiv.org/abs/2405.04624": {"title": "Adaptive design of experiments methodology for noise resistance with unreplicated experiments", "link": "https://arxiv.org/abs/2405.04624", "description": "arXiv:2405.04624v1 Announce Type: new \nAbstract: A new gradient-based adaptive sampling method is proposed for design of experiments applications which balances space filling, local refinement, and error minimization objectives while reducing reliance on delicate tuning parameters. High order local maximum entropy approximants are used for metamodelling, which take advantage of boundary-corrected kernel density estimation to increase accuracy and robustness on highly clumped datasets, as well as conferring the resulting metamodel with some robustness against data noise in the common case of unreplicated experiments. Two-dimensional test cases are analyzed against full factorial and latin hypercube designs and compare favourably. The proposed method is then applied in a unique manner to the problem of adaptive spatial resolution in time-varying non-linear functions, opening up the possibility to adapt the method to solve partial differential equations."}, "https://arxiv.org/abs/2405.04769": {"title": "Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets", "link": "https://arxiv.org/abs/2405.04769", "description": "arXiv:2405.04769v1 Announce Type: new \nAbstract: Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to generate such datasets are increasingly numerous, using varied tools including Bayesian models, deep neural networks and copulas. However, little is still known about how to properly perform statistical inference with these differentially private synthetic (DIPS) datasets. The challenge is for the analyses to take into account the variability from the synthetic data generation in addition to the usual sampling variability. A similar challenge also occurs when missing data is imputed before analysis, and statisticians have developed appropriate inference procedures for this case, which we tend extended to the case of synthetic datasets for privacy. In this work, we study the applicability of these procedures, based on combining rules, to the analysis of DIPS datasets. Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases."}, "https://arxiv.org/abs/2405.04816": {"title": "Testing the Fairness-Improvability of Algorithms", "link": "https://arxiv.org/abs/2405.04816", "description": "arXiv:2405.04816v1 Announce Type: new \nAbstract: Many algorithms have a disparate impact in that their benefits or harms fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, however, because it is not always clear whether there exists an alternative more-fair algorithm that does not compromise on other key objectives such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policy maker can formally substantiate, or refute, this necessity defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other pre-specified objectives. Our proposed test is simple to implement and can incorporate any exogenous constraint on the algorithm space. We establish the large-sample validity and consistency of our test, and demonstrate its use empirically by evaluating a healthcare algorithm originally considered by Obermeyer et al. (2019). In this demonstration, we find strong statistically significant evidence that it is possible to reduce the algorithm's disparate impact without compromising on the accuracy of its predictions."}, "https://arxiv.org/abs/2405.04845": {"title": "Weighted Particle-Based Optimization for Efficient Generalized Posterior Calibration", "link": "https://arxiv.org/abs/2405.04845", "description": "arXiv:2405.04845v1 Announce Type: new \nAbstract: In the realm of statistical learning, the increasing volume of accessible data and increasing model complexity necessitate robust methodologies. This paper explores two branches of robust Bayesian methods in response to this trend. The first is generalized Bayesian inference, which introduces a learning rate parameter to enhance robustness against model misspecifications. The second is Gibbs posterior inference, which formulates inferential problems using generic loss functions rather than probabilistic models. In such approaches, it is necessary to calibrate the spread of the posterior distribution by selecting a learning rate parameter. The study aims to enhance the generalized posterior calibration (GPC) algorithm proposed by Syring and Martin (2019) [Biometrika, Volume 106, Issue 2, pp. 479-486]. Their algorithm chooses the learning rate to achieve the nominal frequentist coverage probability, but it is computationally intensive because it requires repeated posterior simulations for bootstrap samples. We propose a more efficient version of the GPC inspired by sequential Monte Carlo (SMC) samplers. A target distribution with a different learning rate is evaluated without posterior simulation as in the reweighting step in SMC sampling. Thus, the proposed algorithm can reach the desired value within a few iterations. This improvement substantially reduces the computational cost of the GPC. Its efficacy is demonstrated through synthetic and real data applications."}, "https://arxiv.org/abs/2405.04895": {"title": "On Correlation and Prediction Interval Reduction", "link": "https://arxiv.org/abs/2405.04895", "description": "arXiv:2405.04895v1 Announce Type: new \nAbstract: Pearson's correlation coefficient is a popular statistical measure to summarize the strength of association between two continuous variables. It is usually interpreted via its square as percentage of variance of one variable predicted by the other in a linear regression model. It can be generalized for multiple regression via the coefficient of determination, which is not straightforward to interpret in terms of prediction accuracy. In this paper, we propose to assess the prediction accuracy of a linear model via the prediction interval reduction (PIR) by comparing the width of the prediction interval derived from this model with the width of the prediction interval obtained without this model. At the population level, PIR is one-to-one related to the correlation and the coefficient of determination. In particular, a correlation of 0.5 corresponds to a PIR of only 13%. It is also the one's complement of the coefficient of alienation introduced at the beginning of last century. We argue that PIR is easily interpretable and useful to keep in mind how difficult it is to make accurate individual predictions, an important message in the era of precision medicine and artificial intelligence. Different estimates of PIR are compared in the context of a linear model and an extension of the PIR concept to non-linear models is outlined."}, "https://arxiv.org/abs/2405.04904": {"title": "Dependence-based fuzzy clustering of functional time series", "link": "https://arxiv.org/abs/2405.04904", "description": "arXiv:2405.04904v1 Announce Type: new \nAbstract: Time series clustering is an important data mining task with a wide variety of applications. While most methods focus on time series taking values on the real line, very few works consider functional time series. However, functional objects frequently arise in many fields, such as actuarial science, demography or finance. Functional time series are indexed collections of infinite-dimensional curves viewed as random elements taking values in a Hilbert space. In this paper, the problem of clustering functional time series is addressed. To this aim, a distance between functional time series is introduced and used to construct a clustering procedure. The metric relies on a measure of serial dependence which can be seen as a natural extension of the classical quantile autocorrelation function to the functional setting. Since the dynamics of the series may vary over time, we adopt a fuzzy approach, which enables the procedure to locate each series into several clusters with different membership degrees. The resulting algorithm can group series generated from similar stochastic processes, reaching accurate results with series coming from a broad variety of functional models and requiring minimum hyperparameter tuning. Several simulation experiments show that the method exhibits a high clustering accuracy besides being computationally efficient. Two interesting applications involving high-frequency financial time series and age-specific mortality improvement rates illustrate the potential of the proposed approach."}, "https://arxiv.org/abs/2405.04917": {"title": "Guiding adaptive shrinkage by co-data to improve regression-based prediction and feature selection", "link": "https://arxiv.org/abs/2405.04917", "description": "arXiv:2405.04917v1 Announce Type: new \nAbstract: The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories. Yet, the uptake of learning methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating feature selection. Finally, we demonstrate the versatility of the guided shrinkage methodology by showing how to `do-it-yourself': we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving feature selection in genetics studies."}, "https://arxiv.org/abs/2405.04928": {"title": "A joint model for DHS and MICS surveys: Spatial modeling with anonymized locations", "link": "https://arxiv.org/abs/2405.04928", "description": "arXiv:2405.04928v1 Announce Type: new \nAbstract: Anonymizing the GPS locations of observations can bias a spatial model's parameter estimates and attenuate spatial predictions when improperly accounted for, and is relevant in applications from public health to paleoseismology. In this work, we demonstrate that a newly introduced method for geostatistical modeling in the presence of anonymized point locations can be extended to account for more general kinds of positional uncertainty due to location anonymization, including both jittering (a form of random perturbations of GPS coordinates) and geomasking (reporting only the name of the area containing the true GPS coordinates). We further provide a numerical integration scheme that flexibly accounts for the positional uncertainty as well as spatial and covariate information.\n We apply the method to women's secondary education completion data in the 2018 Nigeria demographic and health survey (NDHS) containing jittered point locations, and the 2016 Nigeria multiple indicator cluster survey (NMICS) containing geomasked locations. We show that accounting for the positional uncertainty in the surveys can improve predictions in terms of their continuous rank probability score."}, "https://arxiv.org/abs/2405.04973": {"title": "SVARs with breaks: Identification and inference", "link": "https://arxiv.org/abs/2405.04973", "description": "arXiv:2405.04973v1 Announce Type: new \nAbstract: In this paper we propose a class of structural vector autoregressions (SVARs) characterized by structural breaks (SVAR-WB). Together with standard restrictions on the parameters and on functions of them, we also consider constraints across the different regimes. Such constraints can be either (a) in the form of stability restrictions, indicating that not all the parameters or impulse responses are subject to structural changes, or (b) in terms of inequalities regarding particular characteristics of the SVAR-WB across the regimes. We show that all these kinds of restrictions provide benefits in terms of identification. We derive conditions for point and set identification of the structural parameters of the SVAR-WB, mixing equality, sign, rank and stability restrictions, as well as constraints on forecast error variances (FEVs). As point identification, when achieved, holds locally but not globally, there will be a set of isolated structural parameters that are observationally equivalent in the parametric space. In this respect, both common frequentist and Bayesian approaches produce unreliable inference as the former focuses on just one of these observationally equivalent points, while for the latter on a non-vanishing sensitivity to the prior. To overcome these issues, we propose alternative approaches for estimation and inference that account for all admissible observationally equivalent structural parameters. Moreover, we develop a pure Bayesian and a robust Bayesian approach for doing inference in set-identified SVAR-WBs. Both the theory of identification and inference are illustrated through a set of examples and an empirical application on the transmission of US monetary policy over the great inflation and great moderation regimes."}, "https://arxiv.org/abs/2405.05119": {"title": "Combining Rollout Designs and Clustering for Causal Inference under Low-order Interference", "link": "https://arxiv.org/abs/2405.05119", "description": "arXiv:2405.05119v1 Announce Type: new \nAbstract: Estimating causal effects under interference is pertinent to many real-world settings. However, the true interference network may be unknown to the practitioner, precluding many existing techniques that leverage this information. A recent line of work with low-order potential outcomes models uses staggered rollout designs to obtain unbiased estimators that require no network information. However, their use of polynomial extrapolation can lead to prohibitively high variance. To address this, we propose a two-stage experimental design that restricts treatment rollout to a sub-population. We analyze the bias and variance of an interpolation-style estimator under this experimental design. Through numerical simulations, we explore the trade-off between the error attributable to the subsampling of our experimental design and the extrapolation of the estimator. Under low-order interactions models with degree greater than 1, the proposed design greatly reduces the error of the polynomial interpolation estimator, such that it outperforms baseline estimators, especially when the treatment probability is small."}, "https://arxiv.org/abs/2405.05121": {"title": "A goodness-of-fit diagnostic for count data derived from half-normal plots with a simulated envelope", "link": "https://arxiv.org/abs/2405.05121", "description": "arXiv:2405.05121v1 Announce Type: new \nAbstract: Traditional methods of model diagnostics may include a plethora of graphical techniques based on residual analysis, as well as formal tests (e.g. Shapiro-Wilk test for normality and Bartlett test for homogeneity of variance). In this paper we derive a new distance metric based on the half-normal plot with a simulation envelope, a graphical model evaluation method, and investigate its properties through simulation studies. The proposed metric can help to assess the fit of a given model, and also act as a model selection criterion by being comparable across models, whether based or not on a true likelihood. More specifically, it quantitatively encompasses the model evaluation principles and removes the subjective bias when closely related models are involved. We validate the technique by means of an extensive simulation study carried out using count data, and illustrate with two case studies in ecology and fisheries research."}, "https://arxiv.org/abs/2405.05139": {"title": "Multivariate group sequential tests for global summary statistics", "link": "https://arxiv.org/abs/2405.05139", "description": "arXiv:2405.05139v1 Announce Type: new \nAbstract: We describe group sequential tests which efficiently incorporate information from multiple endpoints allowing for early stopping at pre-planned interim analyses. We formulate a testing procedure where several outcomes are examined, and interim decisions are based on a global summary statistic. An error spending approach to this problem is defined which allows for unpredictable group sizes and nuisance parameters such as the correlation between endpoints. We present and compare three methods for implementation of the testing procedure including numerical integration, the Delta approximation and Monte Carlo simulation. In our evaluation, numerical integration techniques performed best for implementation with error rate calculations accurate to five decimal places. Our proposed testing method is flexible and accommodates summary statistics derived from general, non-linear functions of endpoints informed by the statistical model. Type 1 error rates are controlled, and sample size calculations can easily be performed to satisfy power requirements."}, "https://arxiv.org/abs/2405.05220": {"title": "Causal Duration Analysis with Diff-in-Diff", "link": "https://arxiv.org/abs/2405.05220", "description": "arXiv:2405.05220v1 Announce Type: new \nAbstract: In economic program evaluation, it is common to obtain panel data in which outcomes are indicators that an individual has reached an absorbing state. For example, they may indicate whether an individual has exited a period of unemployment, passed an exam, left a marriage, or had their parole revoked. The parallel trends assumption that underpins difference-in-differences generally fails in such settings. We suggest identifying conditions that are analogous to those of difference-in-differences but apply to hazard rates rather than mean outcomes. These alternative assumptions motivate estimators that retain the simplicity and transparency of standard diff-in-diff, and we suggest analogous specification tests. Our approach can be adapted to general linear restrictions between the hazard rates of different groups, motivating duration analogues of the triple differences and synthetic control methods. We apply our procedures to examine the impact of a policy that increased the generosity of unemployment benefits, using a cross-cohort comparison."}, "https://arxiv.org/abs/2405.04711": {"title": "Community detection in multi-layer bipartite networks", "link": "https://arxiv.org/abs/2405.04711", "description": "arXiv:2405.04711v1 Announce Type: cross \nAbstract: The problem of community detection in multi-layer undirected networks has received considerable attention in recent years. However, practical scenarios often involve multi-layer bipartite networks, where each layer consists of two distinct types of nodes. Existing community detection algorithms tailored for multi-layer undirected networks are not directly applicable to multi-layer bipartite networks. To address this challenge, this paper introduces a novel multi-layer degree-corrected stochastic co-block model specifically designed to capture the underlying community structure within multi-layer bipartite networks. Within this framework, we propose an efficient debiased spectral co-clustering algorithm for detecting nodes' communities. We establish the consistent estimation property of our proposed algorithm and demonstrate that an increased number of layers in bipartite networks improves the accuracy of community detection. Through extensive numerical experiments, we showcase the superior performance of our algorithm compared to existing methods. Additionally, we validate our algorithm by applying it to real-world multi-layer network datasets, yielding meaningful and insightful results."}, "https://arxiv.org/abs/2405.04715": {"title": "Causality Pursuit from Heterogeneous Environments via Neural Adversarial Invariance Learning", "link": "https://arxiv.org/abs/2405.04715", "description": "arXiv:2405.04715v1 Announce Type: cross \nAbstract: Statistics suffers from a fundamental problem, \"the curse of endogeneity\" -- the regression function, or more broadly the prediction risk minimizer with infinite data, may not be the target we wish to pursue. This is because when complex data are collected from multiple sources, the biases deviated from the interested (causal) association inherited in individuals or sub-populations are not expected to be canceled. Traditional remedies are of hindsight and restrictive in being tailored to prior knowledge like untestable cause-effect structures, resulting in methods that risk model misspecification and lack scalable applicability. This paper seeks to offer a purely data-driven and universally applicable method that only uses the heterogeneity of the biases in the data rather than following pre-offered commandments. Such an idea is formulated as a nonparametric invariance pursuit problem, whose goal is to unveil the invariant conditional expectation $m^\\star(x)\\equiv \\mathbb{E}[Y^{(e)}|X_{S^\\star}^{(e)}=x_{S^\\star}]$ with unknown important variable set $S^\\star$ across heterogeneous environments $e\\in \\mathcal{E}$. Under the structural causal model framework, $m^\\star$ can be interpreted as certain data-driven causality in general. The paper contributes to proposing a novel framework, called Focused Adversarial Invariance Regularization (FAIR), formulated as a single minimax optimization program that can solve the general invariance pursuit problem. As illustrated by the unified non-asymptotic analysis, our adversarial estimation framework can attain provable sample-efficient estimation akin to standard regression under a minimal identification condition for various tasks and models. As an application, the FAIR-NN estimator realized by two Neural Network classes is highlighted as the first approach to attain statistically efficient estimation in general nonparametric invariance learning."}, "https://arxiv.org/abs/2405.04919": {"title": "Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression", "link": "https://arxiv.org/abs/2405.04919", "description": "arXiv:2405.04919v1 Announce Type: cross \nAbstract: We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method."}, "https://arxiv.org/abs/1905.04028": {"title": "Demand and Welfare Analysis in Discrete Choice Models with Social Interactions", "link": "https://arxiv.org/abs/1905.04028", "description": "arXiv:1905.04028v2 Announce Type: replace \nAbstract: Many real-life settings of consumer-choice involve social interactions, causing targeted policies to have spillover-effects. This paper develops novel empirical tools for analyzing demand and welfare-effects of policy-interventions in binary choice settings with social interactions. Examples include subsidies for health-product adoption and vouchers for attending a high-achieving school. We establish the connection between econometrics of large games and Brock-Durlauf-type interaction models, under both I.I.D. and spatially correlated unobservables. We develop new convergence results for associated beliefs and estimates of preference-parameters under increasing-domain spatial asymptotics. Next, we show that even with fully parametric specifications and unique equilibrium, choice data, that are sufficient for counterfactual demand-prediction under interactions, are insufficient for welfare-calculations. This is because distinct underlying mechanisms producing the same interaction coefficient can imply different welfare-effects and deadweight-loss from a policy-intervention. Standard index-restrictions imply distribution-free bounds on welfare. We illustrate our results using experimental data on mosquito-net adoption in rural Kenya."}, "https://arxiv.org/abs/2106.10503": {"title": "Robust Bayesian Modeling of Counts with Zero inflation and Outliers: Theoretical Robustness and Efficient Computation", "link": "https://arxiv.org/abs/2106.10503", "description": "arXiv:2106.10503v2 Announce Type: replace \nAbstract: Count data with zero inflation and large outliers are ubiquitous in many scientific applications. However, posterior analysis under a standard statistical model, such as Poisson or negative binomial distribution, is sensitive to such contamination. This study introduces a novel framework for Bayesian modeling of counts that is robust to both zero inflation and large outliers. In doing so, we introduce rescaled beta distribution and adopt it to absorb undesirable effects from zero and outlying counts. The proposed approach has two appealing features: the efficiency of the posterior computation via a custom Gibbs sampling algorithm and a theoretically guaranteed posterior robustness, where extreme outliers are automatically removed from the posterior distribution. We demonstrate the usefulness of the proposed method by applying it to trend filtering and spatial modeling using predictive Gaussian processes."}, "https://arxiv.org/abs/2208.00174": {"title": "Bump hunting through density curvature features", "link": "https://arxiv.org/abs/2208.00174", "description": "arXiv:2208.00174v3 Announce Type: replace \nAbstract: Bump hunting deals with finding in sample spaces meaningful data subsets known as bumps. These have traditionally been conceived as modal or concave regions in the graph of the underlying density function. We define an abstract bump construct based on curvature functionals of the probability density. Then, we explore several alternative characterizations involving derivatives up to second order. In particular, a suitable implementation of Good and Gaskins' original concave bumps is proposed in the multivariate case. Moreover, we bring to exploratory data analysis concepts like the mean curvature and the Laplacian that have produced good results in applied domains. Our methodology addresses the approximation of the curvature functional with a plug-in kernel density estimator. We provide theoretical results that assure the asymptotic consistency of bump boundaries in the Hausdorff distance with affordable convergence rates. We also present asymptotically valid and consistent confidence regions bounding curvature bumps. The theory is illustrated through several use cases in sports analytics with datasets from the NBA, MLB and NFL. We conclude that the different curvature instances effectively combine to generate insightful visualizations."}, "https://arxiv.org/abs/2210.02599": {"title": "The Local to Unity Dynamic Tobit Model", "link": "https://arxiv.org/abs/2210.02599", "description": "arXiv:2210.02599v3 Announce Type: replace \nAbstract: This paper considers highly persistent time series that are subject to nonlinearities in the form of censoring or an occasionally binding constraint, such as are regularly encountered in macroeconomics. A tractable candidate model for such series is the dynamic Tobit with a root local to unity. We show that this model generates a process that converges weakly to a non-standard limiting process, that is constrained (regulated) to be positive. Surprisingly, despite the presence of censoring, the OLS estimators of the model parameters are consistent. We show that this allows OLS-based inferences to be drawn on the overall persistence of the process (as measured by the sum of the autoregressive coefficients), and for the null of a unit root to be tested in the presence of censoring. Our simulations illustrate that the conventional ADF test substantially over-rejects when the data is generated by a dynamic Tobit with a unit root, whereas our proposed test is correctly sized. We provide an application of our methods to testing for a unit root in the Swiss franc / euro exchange rate, during a period when this was subject to an occasionally binding lower bound."}, "https://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "https://arxiv.org/abs/2304.03853", "description": "arXiv:2304.03853v5 Announce Type: replace \nAbstract: StepMix is an open-source Python package for the pseudo-likelihood estimation (one-, two- and three-step approaches) of generalized finite mixture models (latent profile and latent class analysis) with external variables (covariates and distal outcomes). In many applications in social sciences, the main objective is not only to cluster individuals into latent classes, but also to use these classes to develop more complex statistical models. These models generally divide into a measurement model that relates the latent classes to observed indicators, and a structural model that relates covariates and outcome variables to the latent classes. The measurement and structural models can be estimated jointly using the so-called one-step approach or sequentially using stepwise methods, which present significant advantages for practitioners regarding the interpretability of the estimated latent classes. In addition to the one-step approach, StepMix implements the most important stepwise estimation methods from the literature, including the bias-adjusted three-step methods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the more recent two-step approach. These pseudo-likelihood estimators are presented in this paper under a unified framework as specific expectation-maximization subroutines. To facilitate and promote their adoption among the data science community, StepMix follows the object-oriented design of the scikit-learn library and provides an additional R wrapper."}, "https://arxiv.org/abs/2304.07726": {"title": "Bayesian Causal Synthesis for Meta-Inference on Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2304.07726", "description": "arXiv:2304.07726v2 Announce Type: replace \nAbstract: The estimation of heterogeneous treatment effects in the potential outcome setting is biased when there exists model misspecification or unobserved confounding. As these biases are unobservable, what model to use when remains a critical open question. In this paper, we propose a novel Bayesian methodology to mitigate misspecification and improve estimation via a synthesis of multiple causal estimates, which we call Bayesian causal synthesis. Our development is built upon identifying a synthesis function that correctly specifies the heterogeneous treatment effect under no unobserved confounding, and achieves the irreducible bias under unobserved confounding. We show that our proposed method results in consistent estimates of the heterogeneous treatment effect; either with no bias or with irreducible bias. We provide a computational algorithm for fast posterior sampling. Several benchmark simulations and an empirical study highlight the efficacy of the proposed approach compared to existing methodologies, providing improved point and density estimation of the heterogeneous treatment effect, even under unobserved confounding."}, "https://arxiv.org/abs/2305.06465": {"title": "Occam Factor for Random Graphs: Erd\\\"{o}s-R\\'{e}nyi, Independent Edge, and Rank-1 Stochastic Blockmodel", "link": "https://arxiv.org/abs/2305.06465", "description": "arXiv:2305.06465v4 Announce Type: replace \nAbstract: We investigate the evidence/flexibility (i.e., \"Occam\") paradigm and demonstrate the theoretical and empirical consistency of Bayesian evidence for the task of determining an appropriate generative model for network data. This model selection framework involves determining a collection of candidate models, equipping each of these models' parameters with prior distributions derived via the encompassing priors method, and computing or approximating each models' evidence. We demonstrate how such a criterion may be used to select the most suitable model among the Erd\\\"{o}s-R\\'{e}nyi (ER) model, independent edge (IE) model, and rank-1 stochastic blockmodel (SBM). The Erd\\\"{o}s-R\\'{e}nyi may be considered as being linearly nested within IE, a fact which permits exponential family results. The rank-1 SBM is not so ideal, so we propose a numerical method to approximate its evidence. We apply this paradigm to brain connectome data. Future work necessitates deriving and equipping additional candidate random graph models with appropriate priors so they may be included in the paradigm."}, "https://arxiv.org/abs/2307.01284": {"title": "Does regional variation in wage levels identify the effects of a national minimum wage?", "link": "https://arxiv.org/abs/2307.01284", "description": "arXiv:2307.01284v3 Announce Type: replace \nAbstract: I evaluate the performance of estimators that exploit regional variation in wage levels to identify the employment and wage effects of national minimum wage laws. For the \"effective minimum wage\" design, I show that the identification assumptions in Lee (1999) are difficult to satisfy in settings without regional minimum wages. For the \"fraction affected\" design, I show that economic factors such as skill-biased technical change or regional convergence may cause parallel trends violations and should be investigated using pre-treatment data. I also show that this design is subject to misspecification biases that are not easily solved with changes in specification."}, "https://arxiv.org/abs/2307.03317": {"title": "Fitted value shrinkage", "link": "https://arxiv.org/abs/2307.03317", "description": "arXiv:2307.03317v5 Announce Type: replace \nAbstract: We propose a penalized least-squares method to fit the linear regression model with fitted values that are invariant to invertible linear transformations of the design matrix. This invariance is important, for example, when practitioners have categorical predictors and interactions. Our method has the same computational cost as ridge-penalized least squares, which lacks this invariance. We derive the expected squared distance between the vector of population fitted values and its shrinkage estimator as well as the tuning parameter value that minimizes this expectation. In addition to using cross validation, we construct two estimators of this optimal tuning parameter value and study their asymptotic properties. Our numerical experiments and data examples show that our method performs similarly to ridge-penalized least-squares."}, "https://arxiv.org/abs/2307.05825": {"title": "Bayesian taut splines for estimating the number of modes", "link": "https://arxiv.org/abs/2307.05825", "description": "arXiv:2307.05825v3 Announce Type: replace \nAbstract: The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts."}, "https://arxiv.org/abs/2308.00836": {"title": "Differentially Private Linear Regression with Linked Data", "link": "https://arxiv.org/abs/2308.00836", "description": "arXiv:2308.00836v2 Announce Type: replace \nAbstract: There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data."}, "https://arxiv.org/abs/2309.08707": {"title": "Fixed-b Asymptotics for Panel Models with Two-Way Clustering", "link": "https://arxiv.org/abs/2309.08707", "description": "arXiv:2309.08707v3 Announce Type: replace \nAbstract: This paper studies a cluster robust variance estimator proposed by Chiang, Hansen and Sasaki (2024) for linear panels. First, we show algebraically that this variance estimator (CHS estimator, hereafter) is a linear combination of three common variance estimators: the one-way unit cluster estimator, the \"HAC of averages\" estimator, and the \"average of HACs\" estimator. Based on this finding, we obtain a fixed-b asymptotic result for the CHS estimator and corresponding test statistics as the cross-section and time sample sizes jointly go to infinity. Furthermore, we propose two simple bias-corrected versions of the variance estimator and derive the fixed-b limits. In a simulation study, we find that the two bias-corrected variance estimators along with fixed-b critical values provide improvements in finite sample coverage probabilities. We illustrate the impact of bias-correction and use of the fixed-b critical values on inference in an empirical example from Thompson (2011) on the relationship between industry profitability and market concentration."}, "https://arxiv.org/abs/2312.07873": {"title": "Bayesian Estimation of Propensity Scores for Integrating Multiple Cohorts with High-Dimensional Covariates", "link": "https://arxiv.org/abs/2312.07873", "description": "arXiv:2312.07873v2 Announce Type: replace \nAbstract: Comparative meta-analyses of groups of subjects by integrating multiple observational studies rely on estimated propensity scores (PSs) to mitigate covariate imbalances. However, PS estimation grapples with the theoretical and practical challenges posed by high-dimensional covariates. Motivated by an integrative analysis of breast cancer patients across seven medical centers, this paper tackles the challenges associated with integrating multiple observational datasets. The proposed inferential technique, called Bayesian Motif Submatrices for Covariates (B-MSC), addresses the curse of dimensionality by a hybrid of Bayesian and frequentist approaches. B-MSC uses nonparametric Bayesian \"Chinese restaurant\" processes to eliminate redundancy in the high-dimensional covariates and discover latent motifs or lower-dimensional structure. With these motifs as potential predictors, standard regression techniques can be utilized to accurately infer the PSs and facilitate covariate-balanced group comparisons. Simulations and meta-analysis of the motivating cancer investigation demonstrate the efficacy of the B-MSC approach to accurately estimate the propensity scores and efficiently address covariate imbalance when integrating observational health studies with high-dimensional covariates."}, "https://arxiv.org/abs/1812.07318": {"title": "Zero-Inflated Autoregressive Conditional Duration Model for Discrete Trade Durations with Excessive Zeros", "link": "https://arxiv.org/abs/1812.07318", "description": "arXiv:1812.07318v4 Announce Type: replace-cross \nAbstract: In finance, durations between successive transactions are usually modeled by the autoregressive conditional duration model based on a continuous distribution omitting zero values. Zero or close-to-zero durations can be caused by either split transactions or independent transactions. We propose a discrete model allowing for excessive zero values based on the zero-inflated negative binomial distribution with score dynamics. This model allows to distinguish between the processes generating split and standard transactions. We use the existing theory on score models to establish the invertibility of the score filter and verify that sufficient conditions hold for the consistency and asymptotic normality of the maximum likelihood of the model parameters. In an empirical study, we find that split transactions cause between 92 and 98 percent of zero and close-to-zero values. Furthermore, the loss of decimal places in the proposed approach is less severe than the incorrect treatment of zero values in continuous models."}, "https://arxiv.org/abs/2304.09010": {"title": "Causal Flow-based Variational Auto-Encoder for Disentangled Causal Representation Learning", "link": "https://arxiv.org/abs/2304.09010", "description": "arXiv:2304.09010v4 Announce Type: replace-cross \nAbstract: Disentangled representation learning aims to learn low-dimensional representations of data, where each dimension corresponds to an underlying generative factor. Currently, Variational Auto-Encoder (VAE) are widely used for disentangled representation learning, with the majority of methods assuming independence among generative factors. However, in real-world scenarios, generative factors typically exhibit complex causal relationships. We thus design a new VAE-based framework named Disentangled Causal Variational Auto-Encoder (DCVAE), which includes a variant of autoregressive flows known as causal flows, capable of learning effective causal disentangled representations. We provide a theoretical analysis of the disentanglement identifiability of DCVAE, ensuring that our model can effectively learn causal disentangled representations. The performance of DCVAE is evaluated on both synthetic and real-world datasets, demonstrating its outstanding capability in achieving causal disentanglement and performing intervention experiments. Moreover, DCVAE exhibits remarkable performance on downstream tasks and has the potential to learn the true causal structure among factors."}, "https://arxiv.org/abs/2311.05532": {"title": "Uncertainty-Aware Bayes' Rule and Its Applications", "link": "https://arxiv.org/abs/2311.05532", "description": "arXiv:2311.05532v2 Announce Type: replace-cross \nAbstract: Bayes' rule has enabled innumerable powerful algorithms of statistical signal processing and statistical machine learning. However, when there exist model misspecifications in prior distributions and/or data distributions, the direct application of Bayes' rule is questionable. Philosophically, the key is to balance the relative importance of prior and data distributions when calculating posterior distributions: if prior (resp. data) distributions are overly conservative, we should upweight the prior belief (resp. data evidence); if prior (resp. data) distributions are overly opportunistic, we should downweight the prior belief (resp. data evidence). This paper derives a generalized Bayes' rule, called uncertainty-aware Bayes' rule, to technically realize the above philosophy, i.e., to combat the model uncertainties in prior distributions and/or data distributions. Simulated and real-world experiments on classification and estimation showcase the superiority of the presented uncertainty-aware Bayes' rule over the conventional Bayes' rule: In particular, the uncertainty-aware Bayes classifier, the uncertainty-aware Kalman filter, the uncertainty-aware particle filter, and the uncertainty-aware interactive-multiple-model filter are suggested and validated."}, "https://arxiv.org/abs/2405.05389": {"title": "On foundation of generative statistics with F-entropy: a gradient-based approach", "link": "https://arxiv.org/abs/2405.05389", "description": "arXiv:2405.05389v1 Announce Type: new \nAbstract: This paper explores the interplay between statistics and generative artificial intelligence. Generative statistics, an integral part of the latter, aims to construct models that can {\\it generate} efficiently and meaningfully new data across the whole of the (usually high dimensional) sample space, e.g. a new photo. Within it, the gradient-based approach is a current favourite that exploits effectively, for the above purpose, the information contained in the observed sample, e.g. an old photo. However, often there are missing data in the observed sample, e.g. missing bits in the old photo. To handle this situation, we have proposed a gradient-based algorithm for generative modelling. More importantly, our paper underpins rigorously this powerful approach by introducing a new F-entropy that is related to Fisher's divergence. (The F-entropy is also of independent interest.) The underpinning has enabled the gradient-based approach to expand its scope. For example, it can now provide a tool for Possible future projects include discrete data and Bayesian variational inference."}, "https://arxiv.org/abs/2405.05403": {"title": "A fast and accurate inferential method for complex parametric models: the implicit bootstrap", "link": "https://arxiv.org/abs/2405.05403", "description": "arXiv:2405.05403v1 Announce Type: new \nAbstract: Performing inference such a computing confidence intervals is traditionally done, in the parametric case, by first fitting a model and then using the estimates to compute quantities derived at the asymptotic level or by means of simulations such as the ones from the family of bootstrap methods. These methods require the derivation and computation of a consistent estimator that can be very challenging to obtain when the models are complex as is the case for example when the data exhibit some types of features such as censoring, missclassification errors or contain outliers. In this paper, we propose a simulation based inferential method, the implicit bootstrap, that bypasses the need to compute a consistent estimator and can therefore be easily implemented. While being transformation respecting, we show that under similar conditions as for the studentized bootstrap, without the need of a consistent estimator, the implicit bootstrap is first and second order accurate. Using simulation studies, we also show the coverage accuracy of the method with data settings for which traditional methods are computationally very involving and also lead to poor coverage, especially when the sample size is relatively small. Based on these empirical results, we also explore theoretically the case of exact inference."}, "https://arxiv.org/abs/2405.05459": {"title": "Estimation and Inference for Change Points in Functional Regression Time Series", "link": "https://arxiv.org/abs/2405.05459", "description": "arXiv:2405.05459v1 Announce Type: new \nAbstract: In this paper, we study the estimation and inference of change points under a functional linear regression model with changes in the slope function. We present a novel Functional Regression Binary Segmentation (FRBS) algorithm which is computationally efficient as well as achieving consistency in multiple change point detection. This algorithm utilizes the predictive power of piece-wise constant functional linear regression models in the reproducing kernel Hilbert space framework. We further propose a refinement step that improves the localization rate of the initial estimator output by FRBS, and derive asymptotic distributions of the refined estimators for two different regimes determined by the magnitude of a change. To facilitate the construction of confidence intervals for underlying change points based on the limiting distribution, we propose a consistent block-type long-run variance estimator. Our theoretical justifications for the proposed approach accommodate temporal dependence and heavy-tailedness in both the functional covariates and the measurement errors. Empirical effectiveness of our methodology is demonstrated through extensive simulation studies and an application to the Standard and Poor's 500 index dataset."}, "https://arxiv.org/abs/2405.05534": {"title": "Sequential Validation of Treatment Heterogeneity", "link": "https://arxiv.org/abs/2405.05534", "description": "arXiv:2405.05534v1 Announce Type: new \nAbstract: We use the martingale construction of Luedtke and van der Laan (2016) to develop tests for the presence of treatment heterogeneity. The resulting sequential validation approach can be instantiated using various validation metrics, such as BLPs, GATES, QINI curves, etc., and provides an alternative to cross-validation-like cross-fold application of these metrics."}, "https://arxiv.org/abs/2405.05638": {"title": "An Efficient Finite Difference Approximation via a Double Sample-Recycling Approach", "link": "https://arxiv.org/abs/2405.05638", "description": "arXiv:2405.05638v1 Announce Type: new \nAbstract: Estimating stochastic gradients is pivotal in fields like service systems within operations research. The classical method for this estimation is the finite difference approximation, which entails generating samples at perturbed inputs. Nonetheless, practical challenges persist in determining the perturbation and obtaining an optimal finite difference estimator in the sense of possessing the smallest mean squared error (MSE). To tackle this problem, we propose a double sample-recycling approach in this paper. Firstly, pilot samples are recycled to estimate the optimal perturbation. Secondly, recycling these pilot samples again and generating new samples at the estimated perturbation, lead to an efficient finite difference estimator. We analyze its bias, variance and MSE. Our analyses demonstrate a reduction in asymptotic variance, and in some cases, a decrease in asymptotic bias, compared to the optimal finite difference estimator. Therefore, our proposed estimator consistently coincides with, or even outperforms the optimal finite difference estimator. In numerical experiments, we apply the estimator in several examples, and numerical results demonstrate its robustness, as well as coincidence with the theory presented, especially in the case of small sample sizes."}, "https://arxiv.org/abs/2405.05730": {"title": "Change point localisation and inference in fragmented functional data", "link": "https://arxiv.org/abs/2405.05730", "description": "arXiv:2405.05730v1 Announce Type: new \nAbstract: We study the problem of change point localisation and inference for sequentially collected fragmented functional data, where each curve is observed only over discrete grids randomly sampled over a short fragment. The sequence of underlying covariance functions is assumed to be piecewise constant, with changes happening at unknown time points. To localise the change points, we propose a computationally efficient fragmented functional dynamic programming (FFDP) algorithm with consistent change point localisation rates. With an extra step of local refinement, we derive the limiting distributions for the refined change point estimators in two different regimes where the minimal jump size vanishes and where it remains constant as the sample size diverges. Such results are the first time seen in the fragmented functional data literature. As a byproduct of independent interest, we also present a non-asymptotic result on the estimation error of the covariance function estimators over intervals with change points inspired by Lin et al. (2021). Our result accounts for the effects of the sampling grid size within each fragment under novel identifiability conditions. Extensive numerical studies are also provided to support our theoretical results."}, "https://arxiv.org/abs/2405.05759": {"title": "Advancing Distribution Decomposition Methods Beyond Common Supports: Applications to Racial Wealth Disparities", "link": "https://arxiv.org/abs/2405.05759", "description": "arXiv:2405.05759v1 Announce Type: new \nAbstract: I generalize state-of-the-art approaches that decompose differences in the distribution of a variable of interest between two groups into a portion explained by covariates and a residual portion. The method that I propose relaxes the overlapping supports assumption, allowing the groups being compared to not necessarily share exactly the same covariate support. I illustrate my method revisiting the black-white wealth gap in the U.S. as a function of labor income and other variables. Traditionally used decomposition methods would trim (or assign zero weight to) observations that lie outside the common covariate support region. On the other hand, by allowing all observations to contribute to the existing wealth gap, I find that otherwise trimmed observations contribute from 3% to 19% to the overall wealth gap, at different portions of the wealth distribution."}, "https://arxiv.org/abs/2405.05773": {"title": "Parametric Analysis of Bivariate Current Status data with Competing risks using Frailty model", "link": "https://arxiv.org/abs/2405.05773", "description": "arXiv:2405.05773v1 Announce Type: new \nAbstract: Shared and correlated Gamma frailty models are widely used in the literature to model the association in multivariate current status data. In this paper, we have proposed two other new Gamma frailty models, namely shared cause-specific and correlated cause-specific Gamma frailty to capture association in bivariate current status data with competing risks. We have investigated the identifiability of the bivariate models with competing risks for each of the four frailty variables. We have considered maximum likelihood estimation of the model parameters. Thorough simulation studies have been performed to study the finite sample behaviour of the estimated parameters. Also, we have analyzed a real data set on hearing loss in two ears using Exponential type and Weibull type cause-specific baseline hazard functions with the four different Gamma frailty variables and compare the fits using AIC."}, "https://arxiv.org/abs/2405.05781": {"title": "Nonparametric estimation of a future entry time distribution given the knowledge of a past state occupation in a progressive multistate model with current status data", "link": "https://arxiv.org/abs/2405.05781", "description": "arXiv:2405.05781v1 Announce Type: new \nAbstract: Case-I interval-censored (current status) data from multistate systems are often encountered in cancer and other epidemiological studies. In this article, we focus on the problem of estimating state entry distribution and occupation probabilities, contingent on a preceding state occupation. This endeavor is particularly complex owing to the inherent challenge of the unavailability of directly observed counts of individuals at risk of transitioning from a state, due to the cross-sectional nature of the data. We propose two nonparametric approaches, one using the fractional at-risk set approach recently adopted in the right-censoring framework and the other a new estimator based on the ratio of marginal state occupation probabilities. Both estimation approaches utilize innovative applications of concepts from the competing risks paradigm. The finite-sample behavior of the proposed estimators is studied via extensive simulation studies where we show that the estimators based on severely censored current status data have good performance when compared with those based on complete data. We demonstrate the application of the two methods to analyze data from patients diagnosed with breast cancer."}, "https://arxiv.org/abs/2405.05868": {"title": "Trustworthy Dimensionality Reduction", "link": "https://arxiv.org/abs/2405.05868", "description": "arXiv:2405.05868v1 Announce Type: new \nAbstract: Different unsupervised models for dimensionality reduction like PCA, LLE, Shannon's mapping, tSNE, UMAP, etc. work on different principles, hence, they are difficult to compare on the same ground. Although they are usually good for visualisation purposes, they can produce spurious patterns that are not present in the original data, losing its trustability (or credibility). On the other hand, information about some response variable (or knowledge of class labels) allows us to do supervised dimensionality reduction such as SIR, SAVE, etc. which work to reduce the data dimension without hampering its ability to explain the particular response at hand. Therefore, the reduced dataset cannot be used to further analyze its relationship with some other kind of responses, i.e., it loses its generalizability. To make a better dimensionality reduction algorithm with a better balance between these two, we shall formally describe the mathematical model used by dimensionality reduction algorithms and provide two indices to measure these intuitive concepts such as trustability and generalizability. Then, we propose a Localized Skeletonization and Dimensionality Reduction (LSDR) algorithm which approximately achieves optimality in both these indices to some extent. The proposed algorithm has been compared with state-of-the-art algorithms such as tSNE and UMAP and is found to be better overall in preserving global structure while retaining useful local information as well. We also propose some of the possible extensions of LSDR which could make this algorithm universally applicable for various types of data similar to tSNE and UMAP."}, "https://arxiv.org/abs/2405.05419": {"title": "Decompounding Under General Mixing Distributions", "link": "https://arxiv.org/abs/2405.05419", "description": "arXiv:2405.05419v1 Announce Type: cross \nAbstract: This study focuses on statistical inference for compound models of the form $X=\\xi_1+\\ldots+\\xi_N$, where $N$ is a random variable denoting the count of summands, which are independent and identically distributed (i.i.d.) random variables $\\xi_1, \\xi_2, \\ldots$. The paper addresses the problem of reconstructing the distribution of $\\xi$ from observed samples of $X$'s distribution, a process referred to as decompounding, with the assumption that $N$'s distribution is known. This work diverges from the conventional scope by not limiting $N$'s distribution to the Poisson type, thus embracing a broader context. We propose a nonparametric estimate for the density of $\\xi$, derive its rates of convergence and prove that these rates are minimax optimal for suitable classes of distributions for $\\xi$ and $N$. Finally, we illustrate the numerical performance of the algorithm on simulated examples."}, "https://arxiv.org/abs/2405.05596": {"title": "Measuring Strategization in Recommendation: Users Adapt Their Behavior to Shape Future Content", "link": "https://arxiv.org/abs/2405.05596", "description": "arXiv:2405.05596v1 Announce Type: cross \nAbstract: Most modern recommendation algorithms are data-driven: they generate personalized recommendations by observing users' past behaviors. A common assumption in recommendation is that how a user interacts with a piece of content (e.g., whether they choose to \"like\" it) is a reflection of the content, but not of the algorithm that generated it. Although this assumption is convenient, it fails to capture user strategization: that users may attempt to shape their future recommendations by adapting their behavior to the recommendation algorithm. In this work, we test for user strategization by conducting a lab experiment and survey. To capture strategization, we adopt a model in which strategic users select their engagement behavior based not only on the content, but also on how their behavior affects downstream recommendations. Using a custom music player that we built, we study how users respond to different information about their recommendation algorithm as well as to different incentives about how their actions affect downstream outcomes. We find strong evidence of strategization across outcome metrics, including participants' dwell time and use of \"likes.\" For example, participants who are told that the algorithm mainly pays attention to \"likes\" and \"dislikes\" use those functions 1.9x more than participants told that the algorithm mainly pays attention to dwell time. A close analysis of participant behavior (e.g., in response to our incentive conditions) rules out experimenter demand as the main driver of these trends. Further, in our post-experiment survey, nearly half of participants self-report strategizing \"in the wild,\" with some stating that they ignore content they actually like to avoid over-recommendation of that content in the future. Together, our findings suggest that user strategization is common and that platforms cannot ignore the effect of their algorithms on user behavior."}, "https://arxiv.org/abs/2405.05656": {"title": "Consistent Empirical Bayes estimation of the mean of a mixing distribution without identifiability assumption", "link": "https://arxiv.org/abs/2405.05656", "description": "arXiv:2405.05656v1 Announce Type: cross \nAbstract: {\\bf Abstract}\n Consider a Non-Parametric Empirical Bayes (NPEB) setup. We observe $Y_i, \\sim f(y|\\theta_i)$, $\\theta_i \\in \\Theta$ independent, where $\\theta_i \\sim G$ are independent $i=1,...,n$. The mixing distribution $G$ is unknown $G \\in \\{G\\}$ with no parametric assumptions about the class $\\{G \\}$. The common NPEB task is to estimate $\\theta_i, \\; i=1,...,n$. Conditions that imply 'optimality' of such NPEB estimators typically require identifiability of $G$ based on $Y_1,...,Y_n$. We consider the task of estimating $E_G \\theta$. We show that `often' consistent estimation of $E_G \\theta$ is implied without identifiability.\n We motivate the later task, especially in setups with non-response and missing data. We demonstrate consistency in simulations."}, "https://arxiv.org/abs/2405.05969": {"title": "Learned harmonic mean estimation of the Bayesian evidence with normalizing flows", "link": "https://arxiv.org/abs/2405.05969", "description": "arXiv:2405.05969v1 Announce Type: cross \nAbstract: We present the learned harmonic mean estimator with normalizing flows - a robust, scalable and flexible estimator of the Bayesian evidence for model comparison. Since the estimator is agnostic to sampling strategy and simply requires posterior samples, it can be applied to compute the evidence using any Markov chain Monte Carlo (MCMC) sampling technique, including saved down MCMC chains, or any variational inference approach. The learned harmonic mean estimator was recently introduced, where machine learning techniques were developed to learn a suitable internal importance sampling target distribution to solve the issue of exploding variance of the original harmonic mean estimator. In this article we present the use of normalizing flows as the internal machine learning technique within the learned harmonic mean estimator. Normalizing flows can be elegantly coupled with the learned harmonic mean to provide an approach that is more robust, flexible and scalable than the machine learning models considered previously. We perform a series of numerical experiments, applying our method to benchmark problems and to a cosmological example in up to 21 dimensions. We find the learned harmonic mean estimator is in agreement with ground truth values and nested sampling estimates. The open-source harmonic Python package implementing the learned harmonic mean, now with normalizing flows included, is publicly available."}, "https://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "https://arxiv.org/abs/2212.04620", "description": "arXiv:2212.04620v3 Announce Type: replace \nAbstract: Production functions are potentially misspecified when revenue is used as a proxy for output. I formalize and strengthen this common knowledge by showing that neither the production function nor Hicks-neutral productivity can be identified with such a revenue proxy. This result obtains when relaxing the standard assumptions used in the literature to allow for imperfect competition. It holds for a large class of production functions, including all commonly used parametric forms. Among the prevalent approaches to address this issue, only those that impose assumptions on the underlying demand system can possibly identify the production function."}, "https://arxiv.org/abs/2305.19885": {"title": "Reliability analysis of arbitrary systems based on active learning and global sensitivity analysis", "link": "https://arxiv.org/abs/2305.19885", "description": "arXiv:2305.19885v2 Announce Type: replace \nAbstract: System reliability analysis aims at computing the probability of failure of an engineering system given a set of uncertain inputs and limit state functions. Active-learning solution schemes have been shown to be a viable tool but as of yet they are not as efficient as in the context of component reliability analysis. This is due to some peculiarities of system problems, such as the presence of multiple failure modes and their uneven contribution to failure, or the dependence on the system configuration (e.g., series or parallel). In this work, we propose a novel active learning strategy designed for solving general system reliability problems. This algorithm combines subset simulation and Kriging/PC-Kriging, and relies on an enrichment scheme tailored to specifically address the weaknesses of this class of methods. More specifically, it relies on three components: (i) a new learning function that does not require the specification of the system configuration, (ii) a density-based clustering technique that allows one to automatically detect the different failure modes, and (iii) sensitivity analysis to estimate the contribution of each limit state to system failure so as to select only the most relevant ones for enrichment. The proposed method is validated on two analytical examples and compared against results gathered in the literature. Finally, a complex engineering problem related to power transmission is solved, thereby showcasing the efficiency of the proposed method in a real-case scenario."}, "https://arxiv.org/abs/2312.06204": {"title": "Multilayer Network Regression with Eigenvector Centrality and Community Structure", "link": "https://arxiv.org/abs/2312.06204", "description": "arXiv:2312.06204v2 Announce Type: replace \nAbstract: In the analysis of complex networks, centrality measures and community structures are two important aspects. For multilayer networks, one crucial task is to integrate information across different layers, especially taking the dependence structure within and between layers into consideration. In this study, we introduce a novel two-stage regression model (CC-MNetR) that leverages the eigenvector centrality and network community structure of fourth-order tensor-like multilayer networks. In particular, we construct community-based centrality measures, which are then incorporated into the regression model. In addition, considering the noise of network data, we analyze the centrality measure with and without measurement errors respectively, and establish the consistent properties of the least squares estimates in the regression. Our proposed method is then applied to the World Input-Output Database (WIOD) dataset to explore how input-output network data between different countries and different industries affect the Gross Output of each industry."}, "https://arxiv.org/abs/2312.15624": {"title": "Negative Control Falsification Tests for Instrumental Variable Designs", "link": "https://arxiv.org/abs/2312.15624", "description": "arXiv:2312.15624v2 Announce Type: replace \nAbstract: We develop theoretical foundations for widely used falsification tests for instrumental variable (IV) designs. We characterize these tests as conditional independence tests between negative control variables - proxies for potential threats - and either the IV or the outcome. We find that conventional applications of these falsification tests would flag problems in exogenous IV designs, and propose simple solutions to avoid this. We also propose new falsification tests that incorporate new types of negative control variables or alternative statistical tests. Finally, we illustrate that under stronger assumptions, negative control variables can also be used for bias correction."}, "https://arxiv.org/abs/2212.09706": {"title": "Multiple testing under negative dependence", "link": "https://arxiv.org/abs/2212.09706", "description": "arXiv:2212.09706v4 Announce Type: replace-cross \nAbstract: The multiple testing literature has primarily dealt with three types of dependence assumptions between p-values: independence, positive regression dependence, and arbitrary dependence. In this paper, we provide what we believe are the first theoretical results under various notions of negative dependence (negative Gaussian dependence, negative regression dependence, negative association, negative orthant dependence and weak negative dependence). These include the Simes global null test and the Benjamini-Hochberg procedure, which are known experimentally to be anti-conservative under negative dependence. The anti-conservativeness of these procedures is bounded by factors smaller than that under arbitrary dependence (in particular, by factors independent of the number of hypotheses). We also provide new results about negatively dependent e-values, and provide several examples as to when negative dependence may arise. Our proofs are elementary and short, thus amenable to extensions."}, "https://arxiv.org/abs/2404.17735": {"title": "Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models", "link": "https://arxiv.org/abs/2404.17735", "description": "arXiv:2404.17735v2 Announce Type: replace-cross \nAbstract: Diffusion probabilistic models (DPMs) have become the state-of-the-art in high-quality image generation. However, DPMs have an arbitrary noisy latent space with no interpretable or controllable semantics. Although there has been significant research effort to improve image sample quality, there is little work on representation-controlled generation using diffusion models. Specifically, causal modeling and controllable counterfactual generation using DPMs is an underexplored area. In this work, we propose CausalDiffAE, a diffusion-based causal representation learning framework to enable counterfactual generation according to a specified causal model. Our key idea is to use an encoder to extract high-level semantically meaningful causal variables from high-dimensional data and model stochastic variation using reverse diffusion. We propose a causal encoding mechanism that maps high-dimensional data to causally related latent factors and parameterize the causal mechanisms among latent factors using neural networks. To enforce the disentanglement of causal variables, we formulate a variational objective and leverage auxiliary label information in a prior to regularize the latent space. We propose a DDIM-based counterfactual generation procedure subject to do-interventions. Finally, to address the limited label supervision scenario, we also study the application of CausalDiffAE when a part of the training data is unlabeled, which also enables granular control over the strength of interventions in generating counterfactuals during inference. We empirically show that CausalDiffAE learns a disentangled latent space and is capable of generating high-quality counterfactual images."}, "https://arxiv.org/abs/2405.06135": {"title": "Local Longitudinal Modified Treatment Policies", "link": "https://arxiv.org/abs/2405.06135", "description": "arXiv:2405.06135v1 Announce Type: new \nAbstract: Longitudinal Modified Treatment Policies (LMTPs) provide a framework for defining a broad class of causal target parameters for continuous and categorical exposures. We propose Local LMTPs, a generalization of LMTPs to settings where the target parameter is conditional on subsets of units defined by the treatment or exposure. Such parameters have wide scientific relevance, with well-known parameters such as the Average Treatment Effect on the Treated (ATT) falling within the class. We provide a formal causal identification result that expresses the Local LMTP parameter in terms of sequential regressions, and derive the efficient influence function of the parameter which defines its semi-parametric and local asymptotic minimax efficiency bound. Efficient semi-parametric inference of Local LMTP parameters requires estimating the ratios of functions of complex conditional probabilities (or densities). We propose an estimator for Local LMTP parameters that directly estimates these required ratios via empirical loss minimization, drawing on the theory of Riesz representers. The estimator is implemented using a combination of ensemble machine learning algorithms and deep neural networks, and evaluated via simulation studies. We illustrate in simulation that estimation of the density ratios using Riesz representation might provide more stable estimators in finite samples in the presence of empirical violations of the overlap/positivity assumption."}, "https://arxiv.org/abs/2405.06156": {"title": "A Sharp Test for the Judge Leniency Design", "link": "https://arxiv.org/abs/2405.06156", "description": "arXiv:2405.06156v1 Announce Type: new \nAbstract: We propose a new specification test to assess the validity of the judge leniency design. We characterize a set of sharp testable implications, which exploit all the relevant information in the observed data distribution to detect violations of the judge leniency design assumptions. The proposed sharp test is asymptotically valid and consistent and will not make discordant recommendations. When the judge's leniency design assumptions are rejected, we propose a way to salvage the model using partial monotonicity and exclusion assumptions, under which a variant of the Local Instrumental Variable (LIV) estimand can recover the Marginal Treatment Effect. Simulation studies show our test outperforms existing non-sharp tests by significant margins. We apply our test to assess the validity of the judge leniency design using data from Stevenson (2018), and it rejects the validity for three crime categories: robbery, drug selling, and drug possession."}, "https://arxiv.org/abs/2405.06335": {"title": "Bayesian factor zero-inflated Poisson model for multiple grouped count data", "link": "https://arxiv.org/abs/2405.06335", "description": "arXiv:2405.06335v1 Announce Type: new \nAbstract: This paper proposes a computationally efficient Bayesian factor model for multiple grouped count data. Adopting the link function approach, the proposed model can capture the association within and between the at-risk probabilities and Poisson counts over multiple dimensions. The likelihood function for the grouped count data consists of the differences of the cumulative distribution functions evaluated at the endpoints of the groups, defining the probabilities of each data point falling in the groups. The combination of the data augmentation of underlying counts, the P\\'{o}lya-Gamma augmentation to approximate the Poisson distribution, and parameter expansion for the factor components is used to facilitate posterior computing. The efficacy of the proposed factor model is demonstrated using the simulated data and real data on the involvement of youths in the nineteen illegal activities."}, "https://arxiv.org/abs/2405.06353": {"title": "Next generation clinical trials: Seamless designs and master protocols", "link": "https://arxiv.org/abs/2405.06353", "description": "arXiv:2405.06353v1 Announce Type: new \nAbstract: Background: Drug development is often inefficient, costly and lengthy, yet it is essential for evaluating the safety and efficacy of new interventions. Compared with other disease areas, this is particularly true for Phase II / III cancer clinical trials where high attrition rates and reduced regulatory approvals are being seen. In response to these challenges, seamless clinical trials and master protocols have emerged to streamline the drug development process. Methods: Seamless clinical trials, characterized by their ability to transition seamlessly from one phase to another, can lead to accelerating the development of promising therapies while Master protocols provide a framework for investigating multiple treatment options and patient subgroups within a single trial. Results: We discuss the advantages of these methods through real trial examples and the principals that lead to their success while also acknowledging the associated regulatory considerations and challenges. Conclusion: Seamless designs and Master protocols have the potential to improve confirmatory clinical trials. In the disease area of cancer, this ultimately means that patients can receive life-saving treatments sooner."}, "https://arxiv.org/abs/2405.06366": {"title": "Accounting for selection biases in population analyses: equivalence of the in-likelihood and post-processing approaches", "link": "https://arxiv.org/abs/2405.06366", "description": "arXiv:2405.06366v1 Announce Type: new \nAbstract: In this paper I show the equivalence, under appropriate assumptions, of two alternative methods to account for the presence of selection biases (also called selection effects) in population studies: one is to include the selection effects in the likelihood directly; the other follows the procedure of first inferring the observed distribution and then removing selection effects a posteriori. Moreover, I investigate a potential bias allegedly induced by the latter approach: I show that this procedure, if applied under the appropriate assumptions, does not produce the aforementioned bias."}, "https://arxiv.org/abs/2405.06479": {"title": "Informativeness of Weighted Conformal Prediction", "link": "https://arxiv.org/abs/2405.06479", "description": "arXiv:2405.06479v1 Announce Type: new \nAbstract: Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multiple sources with varied covariate distributions. We establish theoretical guarantees for our proposed methods and demonstrate their efficacy through simulations."}, "https://arxiv.org/abs/2405.06559": {"title": "The landscapemetrics and motif packages for measuring landscape patterns and processes", "link": "https://arxiv.org/abs/2405.06559", "description": "arXiv:2405.06559v1 Announce Type: new \nAbstract: This book chapter emphasizes the significance of categorical raster data in ecological studies, specifically land use or land cover (LULC) data, and highlights the pivotal role of landscape metrics and pattern-based spatial analysis in comprehending environmental patterns and their dynamics. It explores the usage of R packages, particularly landscapemetrics and motif, for quantifying and analyzing landscape patterns using LULC data from three distinct European regions. It showcases the computation, visualization, and comparison of landscape metrics, while also addressing additional features such as patch value extraction, sub-region sampling, and moving window computation. Furthermore, the chapter delves into the intricacies of pattern-based spatial analysis, explaining how spatial signatures are computed and how the motif package facilitates comparisons and clustering of landscape patterns. The chapter concludes by discussing the potential of customization and expansion of the presented tools."}, "https://arxiv.org/abs/2405.06613": {"title": "Simultaneously detecting spatiotemporal changes with penalized Poisson regression models", "link": "https://arxiv.org/abs/2405.06613", "description": "arXiv:2405.06613v1 Announce Type: new \nAbstract: In the realm of large-scale spatiotemporal data, abrupt changes are commonly occurring across both spatial and temporal domains. This study aims to address the concurrent challenges of detecting change points and identifying spatial clusters within spatiotemporal count data. We introduce an innovative method based on the Poisson regression model, employing doubly fused penalization to unveil the underlying spatiotemporal change patterns. To efficiently estimate the model, we present an iterative shrinkage and threshold based algorithm to minimize the doubly penalized likelihood function. We establish the statistical consistency properties of the proposed estimator, confirming its reliability and accuracy. Furthermore, we conduct extensive numerical experiments to validate our theoretical findings, thereby highlighting the superior performance of our method when compared to existing competitive approaches."}, "https://arxiv.org/abs/2405.06635": {"title": "Multivariate Interval-Valued Models in Frequentist and Bayesian Schemes", "link": "https://arxiv.org/abs/2405.06635", "description": "arXiv:2405.06635v1 Announce Type: new \nAbstract: In recent years, addressing the challenges posed by massive datasets has led researchers to explore aggregated data, particularly leveraging interval-valued data, akin to traditional symbolic data analysis. While much recent research, with the exception of Samdai et al. (2023) who focused on the bivariate case, has primarily concentrated on parameter estimation in single-variable scenarios, this paper extends such investigations to the multivariate domain for the first time. We derive maximum likelihood (ML) estimators for the parameters and establish their asymptotic distributions. Additionally, we pioneer a theoretical Bayesian framework, previously confined to the univariate setting, for multivariate data. We provide a detailed exposition of the proposed estimators and conduct comparative performance analyses. Finally, we validate the effectiveness of our estimators through simulations and real-world data analysis."}, "https://arxiv.org/abs/2405.06013": {"title": "Variational Inference for Acceleration of SN Ia Photometric Distance Estimation with BayeSN", "link": "https://arxiv.org/abs/2405.06013", "description": "arXiv:2405.06013v1 Announce Type: cross \nAbstract: Type Ia supernovae (SNe Ia) are standarizable candles whose observed light curves can be used to infer their distances, which can in turn be used in cosmological analyses. As the quantity of observed SNe Ia grows with current and upcoming surveys, increasingly scalable analyses are necessary to take full advantage of these new datasets for precise estimation of cosmological parameters. Bayesian inference methods enable fitting SN Ia light curves with robust uncertainty quantification, but traditional posterior sampling using Markov Chain Monte Carlo (MCMC) is computationally expensive. We present an implementation of variational inference (VI) to accelerate the fitting of SN Ia light curves using the BayeSN hierarchical Bayesian model for time-varying SN Ia spectral energy distributions (SEDs). We demonstrate and evaluate its performance on both simulated light curves and data from the Foundation Supernova Survey with two different forms of surrogate posterior -- a multivariate normal and a custom multivariate zero-lower-truncated normal distribution -- and compare them with the Laplace Approximation and full MCMC analysis. To validate of our variational approximation, we calculate the pareto-smoothed importance sampling (PSIS) diagnostic, and perform variational simulation-based calibration (VSBC). The VI approximation achieves similar results to MCMC but with an order-of-magnitude speedup for the inference of the photometric distance moduli. Overall, we show that VI is a promising method for scalable parameter inference that enables analysis of larger datasets for precision cosmology."}, "https://arxiv.org/abs/2405.06540": {"title": "Separating States in Astronomical Sources Using Hidden Markov Models: With a Case Study of Flaring and Quiescence on EV Lac", "link": "https://arxiv.org/abs/2405.06540", "description": "arXiv:2405.06540v1 Announce Type: cross \nAbstract: We present a new method to distinguish between different states (e.g., high and low, quiescent and flaring) in astronomical sources with count data. The method models the underlying physical process as latent variables following a continuous-space Markov chain that determines the expected Poisson counts in observed light curves in multiple passbands. For the underlying state process, we consider several autoregressive processes, yielding continuous-space hidden Markov models of varying complexity. Under these models, we can infer the state that the object is in at any given time. The state predictions from these models are then dichotomized with the help of a finite-mixture model to produce state classifications. We apply these techniques to X-ray data from the active dMe flare star EV Lac, splitting the data into quiescent and flaring states. We find that a first-order vector autoregressive process efficiently separates flaring from quiescence: flaring occurs over 30-40% of the observation durations, a well-defined persistent quiescent state can be identified, and the flaring state is characterized by higher temperatures and emission measures."}, "https://arxiv.org/abs/2405.06558": {"title": "Random matrix theory improved Fr\\'echet mean of symmetric positive definite matrices", "link": "https://arxiv.org/abs/2405.06558", "description": "arXiv:2405.06558v1 Announce Type: cross \nAbstract: In this study, we consider the realm of covariance matrices in machine learning, particularly focusing on computing Fr\\'echet means on the manifold of symmetric positive definite matrices, commonly referred to as Karcher or geometric means. Such means are leveraged in numerous machine-learning tasks. Relying on advanced statistical tools, we introduce a random matrix theory-based method that estimates Fr\\'echet means, which is particularly beneficial when dealing with low sample support and a high number of matrices to average. Our experimental evaluation, involving both synthetic and real-world EEG and hyperspectral datasets, shows that we largely outperform state-of-the-art methods."}, "https://arxiv.org/abs/2109.04146": {"title": "Forecasting high-dimensional functional time series with dual-factor structures", "link": "https://arxiv.org/abs/2109.04146", "description": "arXiv:2109.04146v2 Announce Type: replace \nAbstract: We propose a dual-factor model for high-dimensional functional time series (HDFTS) that considers multiple populations. The HDFTS is first decomposed into a collection of functional time series (FTS) in a lower dimension and a group of population-specific basis functions. The system of basis functions describes cross-sectional heterogeneity, while the reduced-dimension FTS retains most of the information common to multiple populations. The low-dimensional FTS is further decomposed into a product of common functional loadings and a matrix-valued time series that contains the most temporal dynamics embedded in the original HDFTS. The proposed general-form dual-factor structure is connected to several commonly used functional factor models. We demonstrate the finite-sample performances of the proposed method in recovering cross-sectional basis functions and extracting common features using simulated HDFTS. An empirical study shows that the proposed model produces more accurate point and interval forecasts for subnational age-specific mortality rates in Japan. The financial benefits associated with the improved mortality forecasts are translated into a life annuity pricing scheme."}, "https://arxiv.org/abs/2303.07167": {"title": "When Respondents Don't Care Anymore: Identifying the Onset of Careless Responding", "link": "https://arxiv.org/abs/2303.07167", "description": "arXiv:2303.07167v2 Announce Type: replace \nAbstract: Questionnaires in the behavioral and organizational sciences tend to be lengthy: survey measures comprising hundreds of items are the norm rather than the exception. However, literature suggests that the longer a questionnaire takes, the higher the probability that participants lose interest and start responding carelessly. Consequently, in long surveys a large number of participants may engage in careless responding, posing a major threat to internal validity. We propose a novel method for identifying the onset of careless responding (or an absence thereof) for each participant. It is based on combined measurements of multiple dimensions in which carelessness may manifest, such as inconsistency and invariability. Since a structural break in either dimension is potentially indicative of carelessness, the proposed method searches for evidence for changepoints along the combined measurements. It is highly flexible, based on machine learning, and provides statistical guarantees on its performance. An empirical application on data from a seminal study on the incidence of careless responding reveals that the reported incidence has likely been substantially underestimated due to the presence of respondents that were careless for only parts of the questionnaire. In simulation experiments, we find that the proposed method achieves high reliability in correctly identifying carelessness onset, discriminates well between careless and attentive respondents, and captures a variety of careless response types, even when a large number of careless respondents are present. Furthermore, we provide freely available open source software to enhance accessibility and facilitate adoption by empirical researchers."}, "https://arxiv.org/abs/2101.00726": {"title": "Distributionally robust halfspace depth", "link": "https://arxiv.org/abs/2101.00726", "description": "arXiv:2101.00726v2 Announce Type: replace-cross \nAbstract: Tukey's halfspace depth can be seen as a stochastic program and as such it is not guarded against optimizer's curse, so that a limited training sample may easily result in a poor out-of-sample performance. We propose a generalized halfspace depth concept relying on the recent advances in distributionally robust optimization, where every halfspace is examined using the respective worst-case distribution in the Wasserstein ball of radius $\\delta\\geq 0$ centered at the empirical law. This new depth can be seen as a smoothed and regularized classical halfspace depth which is retrieved as $\\delta\\downarrow 0$. It inherits most of the main properties of the latter and, additionally, enjoys various new attractive features such as continuity and strict positivity beyond the convex hull of the support. We provide numerical illustrations of the new depth and its advantages, and develop some fundamental theory. In particular, we study the upper level sets and the median region including their breakdown properties."}, "https://arxiv.org/abs/2307.08079": {"title": "Flexible and efficient spatial extremes emulation via variational autoencoders", "link": "https://arxiv.org/abs/2307.08079", "description": "arXiv:2307.08079v3 Announce Type: replace-cross \nAbstract: Many real-world processes have complex tail dependence structures that cannot be characterized using classical Gaussian processes. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. In this paper, we aim to push the boundaries on computation and modeling of high-dimensional spatial extremes via integrating a new spatial extremes model that has flexible and non-stationary dependence properties in the encoding-decoding structure of a variational autoencoder called the XVAE. The XVAE can emulate spatial observations and produce outputs that have the same statistical properties as the inputs, especially in the tail. Our approach also provides a novel way of making fast inference with complex extreme-value processes. Through extensive simulation studies, we show that our XVAE is substantially more time-efficient than traditional Bayesian inference while outperforming many spatial extremes models with a stationary dependence structure. Lastly, we analyze a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea, which includes 30 years of daily measurements at 16703 grid cells. We demonstrate how to use XVAE to identify regions susceptible to marine heatwaves under climate change and examine the spatial and temporal variability of the extremal dependence structure."}, "https://arxiv.org/abs/2312.16214": {"title": "Stochastic Equilibrium the Lucas Critique and Keynesian Economics", "link": "https://arxiv.org/abs/2312.16214", "description": "arXiv:2312.16214v2 Announce Type: replace-cross \nAbstract: In this paper, a mathematically rigorous solution overturns existing wisdom regarding New Keynesian Dynamic Stochastic General Equilibrium. I develop a formal concept of stochastic equilibrium. I prove uniqueness and necessity, when agents are patient, with general application. Existence depends on appropriately specified eigenvalue conditions. Otherwise, no solution of any kind exists. I construct the equilibrium with Calvo pricing. I provide novel comparative statics with the non-stochastic model of mathematical significance. I uncover a bifurcation between neighbouring stochastic systems and approximations taken from the Zero Inflation Non-Stochastic Steady State (ZINSS). The correct Phillips curve agrees with the zero limit from the trend inflation framework. It contains a large lagged inflation coefficient and a small response to expected inflation. Price dispersion can be first or second order depending how shocks are scaled. The response to the output gap is always muted and is zero at standard parameters. A neutrality result is presented to explain why and align Calvo with Taylor pricing. Present and lagged demand shocks enter the Phillips curve so there is no Divine Coincidence and the system is identified from structural shocks alone. The lagged inflation slope is increasing in the inflation response, embodying substantive policy trade-offs. The Taylor principle is reversed, inactive settings are necessary, pointing towards inertial policy. The observational equivalence idea of the Lucas critique is disproven. The bifurcation results from the breakdown of the constraints implied by lagged nominal rigidity, associated with cross-equation cancellation possible only at ZINSS. There is a dual relationship between restrictions on the econometrician and constraints on repricing firms. Thus, if the model is correct, goodness of fit will jump."}, "https://arxiv.org/abs/2405.06763": {"title": "Post-selection inference for causal effects after causal discovery", "link": "https://arxiv.org/abs/2405.06763", "description": "arXiv:2405.06763v1 Announce Type: new \nAbstract: Algorithms for constraint-based causal discovery select graphical causal models among a space of possible candidates (e.g., all directed acyclic graphs) by executing a sequence of conditional independence tests. These may be used to inform the estimation of causal effects (e.g., average treatment effects) when there is uncertainty about which covariates ought to be adjusted for, or which variables act as confounders versus mediators. However, naively using the data twice, for model selection and estimation, would lead to invalid confidence intervals. Moreover, if the selected graph is incorrect, the inferential claims may apply to a selected functional that is distinct from the actual causal effect. We propose an approach to post-selection inference that is based on a resampling and screening procedure, which essentially performs causal discovery multiple times with randomly varying intermediate test statistics. Then, an estimate of the target causal effect and corresponding confidence sets are constructed from a union of individual graph-based estimates and intervals. We show that this construction has asymptotically correct coverage for the true causal effect parameter. Importantly, the guarantee holds for a fixed population-level effect, not a data-dependent or selection-dependent quantity. Most of our exposition focuses on the PC-algorithm for learning directed acyclic graphs and the multivariate Gaussian case for simplicity, but the approach is general and modular, so it may be used with other conditional independence based discovery algorithms and distributional families."}, "https://arxiv.org/abs/2405.06779": {"title": "Generalization Problems in Experiments Involving Multidimensional Decisions", "link": "https://arxiv.org/abs/2405.06779", "description": "arXiv:2405.06779v1 Announce Type: new \nAbstract: Can the causal effects estimated in experiment be generalized to real-world scenarios? This question lies at the heart of social science studies. External validity primarily assesses whether experimental effects persist across different settings, implicitly presuming the experiment's ecological validity-that is, the consistency of experimental effects with their real-life counterparts. However, we argue that this presumed consistency may not always hold, especially in experiments involving multidimensional decision processes, such as conjoint experiments. We introduce a formal model to elucidate how attention and salience effects lead to three types of inconsistencies between experimental findings and real-world phenomena: amplified effect magnitude, effect sign reversal, and effect importance reversal. We derive testable hypotheses from each theoretical outcome and test these hypotheses using data from various existing conjoint experiments and our own experiments. Drawing on our theoretical framework, we propose several recommendations for experimental design aimed at enhancing the generalizability of survey experiment findings."}, "https://arxiv.org/abs/2405.06796": {"title": "The Multiple Change-in-Gaussian-Mean Problem", "link": "https://arxiv.org/abs/2405.06796", "description": "arXiv:2405.06796v1 Announce Type: new \nAbstract: A manuscript version of the chapter \"The Multiple Change-in-Gaussian-Mean Problem\" from the book \"Change-Point Detection and Data Segmentation\" by Fearnhead and Fryzlewicz, currently in preparation. All R code and data to accompany this chapter and the book are gradually being made available through https://github.com/pfryz/cpdds."}, "https://arxiv.org/abs/2405.06799": {"title": "Riemannian Statistics for Any Type of Data", "link": "https://arxiv.org/abs/2405.06799", "description": "arXiv:2405.06799v1 Announce Type: new \nAbstract: This paper introduces a novel approach to statistics and data analysis, departing from the conventional assumption of data residing in Euclidean space to consider a Riemannian Manifold. The challenge lies in the absence of vector space operations on such manifolds. Pennec X. et al. in their book Riemannian Geometric Statistics in Medical Image Analysis proposed analyzing data on Riemannian manifolds through geometry, this approach is effective with structured data like medical images, where the intrinsic manifold structure is apparent. Yet, its applicability to general data lacking implicit local distance notions is limited. We propose a solution to generalize Riemannian statistics for any type of data."}, "https://arxiv.org/abs/2405.06813": {"title": "A note on distance variance for categorical variables", "link": "https://arxiv.org/abs/2405.06813", "description": "arXiv:2405.06813v1 Announce Type: new \nAbstract: This study investigates the extension of distance variance, a validated spread metric for continuous and binary variables [Edelmann et al., 2020, Ann. Stat., 48(6)], to quantify the spread of general categorical variables. We provide both geometric and algebraic characterizations of distance variance, revealing its connections to some commonly used entropy measures, and the variance-covariance matrix of the one-hot encoded representation. However, we demonstrate that distance variance fails to satisfy the Schur-concavity axiom for categorical variables with more than two categories, leading to counterintuitive results. This limitation hinders its applicability as a universal measure of spread."}, "https://arxiv.org/abs/2405.06850": {"title": "Identifying Peer Effects in Networks with Unobserved Effort and Isolated Students", "link": "https://arxiv.org/abs/2405.06850", "description": "arXiv:2405.06850v1 Announce Type: new \nAbstract: Peer influence on effort devoted to some activity is often studied using proxy variables when actual effort is unobserved. For instance, in education, academic effort is often proxied by GPA. We propose an alternative approach that circumvents this approximation. Our framework distinguishes unobserved shocks to GPA that do not affect effort from preference shocks that do affect effort levels. We show that peer effects estimates obtained using our approach can differ significantly from classical estimates (where effort is approximated) if the network includes isolated students. Applying our approach to data on high school students in the United States, we find that peer effect estimates relying on GPA as a proxy for effort are 40% lower than those obtained using our approach."}, "https://arxiv.org/abs/2405.06866": {"title": "Dynamic Contextual Pricing with Doubly Non-Parametric Random Utility Models", "link": "https://arxiv.org/abs/2405.06866", "description": "arXiv:2405.06866v1 Announce Type: new \nAbstract: In the evolving landscape of digital commerce, adaptive dynamic pricing strategies are essential for gaining a competitive edge. This paper introduces novel {\\em doubly nonparametric random utility models} that eschew traditional parametric assumptions used in estimating consumer demand's mean utility function and noise distribution. Existing nonparametric methods like multi-scale {\\em Distributional Nearest Neighbors (DNN and TDNN)}, initially designed for offline regression, face challenges in dynamic online pricing due to design limitations, such as the indirect observability of utility-related variables and the absence of uniform convergence guarantees. We address these challenges with innovative population equations that facilitate nonparametric estimation within decision-making frameworks and establish new analytical results on the uniform convergence rates of DNN and TDNN, enhancing their applicability in dynamic environments.\n Our theoretical analysis confirms that the statistical learning rates for the mean utility function and noise distribution are minimax optimal. We also derive a regret bound that illustrates the critical interaction between model dimensionality and noise distribution smoothness, deepening our understanding of dynamic pricing under varied market conditions. These contributions offer substantial theoretical insights and practical tools for implementing effective, data-driven pricing strategies, advancing the theoretical framework of pricing models and providing robust methodologies for navigating the complexities of modern markets."}, "https://arxiv.org/abs/2405.06889": {"title": "Tuning parameter selection for the adaptive nuclear norm regularized trace regression", "link": "https://arxiv.org/abs/2405.06889", "description": "arXiv:2405.06889v1 Announce Type: new \nAbstract: Regularized models have been applied in lots of areas, with high-dimensional data sets being popular. Because tuning parameter decides the theoretical performance and computational efficiency of the regularized models, tuning parameter selection is a basic and important issue. We consider the tuning parameter selection for adaptive nuclear norm regularized trace regression, which achieves by the Bayesian information criterion (BIC). The proposed BIC is established with the help of an unbiased estimator of degrees of freedom. Under some regularized conditions, this BIC is proved to achieve the rank consistency of the tuning parameter selection. That is the model solution under selected tuning parameter converges to the true solution and has the same rank with that of the true solution in probability. Some numerical results are presented to evaluate the performance of the proposed BIC on tuning parameter selection."}, "https://arxiv.org/abs/2405.07026": {"title": "Selective Randomization Inference for Adaptive Experiments", "link": "https://arxiv.org/abs/2405.07026", "description": "arXiv:2405.07026v1 Announce Type: new \nAbstract: Adaptive experiments use preliminary analyses of the data to inform further course of action and are commonly used in many disciplines including medical and social sciences. Because the null hypothesis and experimental design are not pre-specified, it has long been recognized that statistical inference for adaptive experiments is not straightforward. Most existing methods only apply to specific adaptive designs and rely on strong assumptions. In this work, we propose selective randomization inference as a general framework for analyzing adaptive experiments. In a nutshell, our approach applies conditional post-selection inference to randomization tests. By using directed acyclic graphs to describe the data generating process, we derive a selective randomization p-value that controls the selective type-I error without requiring independent and identically distributed data or any other modelling assumptions. We show how rejection sampling and Markov Chain Monte Carlo can be used to compute the selective randomization p-values and construct confidence intervals for a homogeneous treatment effect. To mitigate the risk of disconnected confidence intervals, we propose the use of hold-out units. Lastly, we demonstrate our method and compare it with other randomization tests using synthetic and real-world data."}, "https://arxiv.org/abs/2405.07102": {"title": "Nested Instrumental Variables Design: Switcher Average Treatment Effect, Identification, Efficient Estimation and Generalizability", "link": "https://arxiv.org/abs/2405.07102", "description": "arXiv:2405.07102v1 Announce Type: new \nAbstract: Instrumental variables (IV) are a commonly used tool to estimate causal effects from non-randomized data. A prototype of an IV is a randomized trial with non-compliance where the randomized treatment assignment serves as an IV for the non-ignorable treatment received. Under a monotonicity assumption, a valid IV non-parametrically identifies the average treatment effect among a non-identifiable complier subgroup, whose generalizability is often under debate. In many studies, there could exist multiple versions of an IV, for instance, different nudges to take the same treatment in different study sites in a multi-center clinical trial. These different versions of an IV may result in different compliance rates and offer a unique opportunity to study IV estimates' generalizability. In this article, we introduce a novel nested IV assumption and study identification of the average treatment effect among two latent subgroups: always-compliers and switchers, who are defined based on the joint potential treatment received under two versions of a binary IV. We derive the efficient influence function for the SWitcher Average Treatment Effect (SWATE) and propose efficient estimators. We then propose formal statistical tests of the generalizability of IV estimates based on comparing the conditional average treatment effect among the always-compliers and that among the switchers under the nested IV framework. We apply the proposed framework and method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and study the causal effect of colorectal cancer screening and its generalizability."}, "https://arxiv.org/abs/2405.07109": {"title": "Bridging Binarization: Causal Inference with Dichotomized Continuous Treatments", "link": "https://arxiv.org/abs/2405.07109", "description": "arXiv:2405.07109v1 Announce Type: new \nAbstract: The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary treatments. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous treatment create a new binary treatment variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous treatments by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized treatment variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous treatment variable versus another. The policies assume that, for any two values of the treatment variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the status-quo world. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. Finally, we present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed."}, "https://arxiv.org/abs/2405.07134": {"title": "On the Ollivier-Ricci curvature as fragility indicator of the stock markets", "link": "https://arxiv.org/abs/2405.07134", "description": "arXiv:2405.07134v1 Announce Type: new \nAbstract: Recently, an indicator for stock market fragility and crash size in terms of the Ollivier-Ricci curvature has been proposed. We study analytical and empirical properties of such indicator, test its elasticity with respect to different parameters and provide heuristics for the parameters involved. We show when and how the indicator accurately describes a financial crisis. We also propose an alternate method for calculating the indicator using a specific sub-graph with special curvature properties."}, "https://arxiv.org/abs/2405.07138": {"title": "Large-dimensional Robust Factor Analysis with Group Structure", "link": "https://arxiv.org/abs/2405.07138", "description": "arXiv:2405.07138v1 Announce Type: new \nAbstract: In this paper, we focus on exploiting the group structure for large-dimensional factor models, which captures the homogeneous effects of common factors on individuals within the same group. In view of the fact that datasets in macroeconomics and finance are typically heavy-tailed, we propose to identify the unknown group structure using the agglomerative hierarchical clustering algorithm and an information criterion with the robust two-step (RTS) estimates as initial values. The loadings and factors are then re-estimated conditional on the identified groups. Theoretically, we demonstrate the consistency of the estimators for both group membership and the number of groups determined by the information criterion. Under finite second moment condition, we provide the convergence rate for the newly estimated factor loadings with group information, which are shown to achieve efficiency gains compared to those obtained without group structure information. Numerical simulations and real data analysis demonstrate the nice finite sample performance of our proposed approach in the presence of both group structure and heavy-tailedness."}, "https://arxiv.org/abs/2405.07186": {"title": "Adaptive-TMLE for the Average Treatment Effect based on Randomized Controlled Trial Augmented with Real-World Data", "link": "https://arxiv.org/abs/2405.07186", "description": "arXiv:2405.07186v1 Announce Type: new \nAbstract: We consider the problem of estimating the average treatment effect (ATE) when both randomized control trial (RCT) data and real-world data (RWD) are available. We decompose the ATE estimand as the difference between a pooled-ATE estimand that integrates RCT and RWD and a bias estimand that captures the conditional effect of RCT enrollment on the outcome. We introduce an adaptive targeted minimum loss-based estimation (A-TMLE) framework to estimate them. We prove that the A-TMLE estimator is root-n-consistent and asymptotically normal. Moreover, in finite sample, it achieves the super-efficiency one would obtain had one known the oracle model for the conditional effect of the RCT enrollment on the outcome. Consequently, the smaller the working model of the bias induced by the RWD is, the greater our estimator's efficiency, while our estimator will always be at least as efficient as an efficient estimator that uses the RCT data only. A-TMLE outperforms existing methods in simulations by having smaller mean-squared-error and 95% confidence intervals. A-TMLE could help utilize RWD to improve the efficiency of randomized trial results without biasing the estimates of intervention effects. This approach could allow for smaller, faster trials, decreasing the time until patients can receive effective treatments."}, "https://arxiv.org/abs/2405.07292": {"title": "Kernel Three Pass Regression Filter", "link": "https://arxiv.org/abs/2405.07292", "description": "arXiv:2405.07292v1 Announce Type: new \nAbstract: We forecast a single time series using a high-dimensional set of predictors. When these predictors share common underlying dynamics, an approximate latent factor model provides a powerful characterization of their co-movements Bai(2003). These latent factors succinctly summarize the data and can also be used for prediction, alleviating the curse of dimensionality in high-dimensional prediction exercises, see Stock & Watson (2002a). However, forecasting using these latent factors suffers from two potential drawbacks. First, not all pervasive factors among the set of predictors may be relevant, and using all of them can lead to inefficient forecasts. The second shortcoming is the assumption of linear dependence of predictors on the underlying factors. The first issue can be addressed by using some form of supervision, which leads to the omission of irrelevant information. One example is the three-pass regression filter proposed by Kelly & Pruitt (2015). We extend their framework to cases where the form of dependence might be nonlinear by developing a new estimator, which we refer to as the Kernel Three-Pass Regression Filter (K3PRF). This alleviates the aforementioned second shortcoming. The estimator is computationally efficient and performs well empirically. The short-term performance matches or exceeds that of established models, while the long-term performance shows significant improvement."}, "https://arxiv.org/abs/2405.07294": {"title": "Factor Strength Estimation in Vector and Matrix Time Series Factor Models", "link": "https://arxiv.org/abs/2405.07294", "description": "arXiv:2405.07294v1 Announce Type: new \nAbstract: Most factor modelling research in vector or matrix-valued time series assume all factors are pervasive/strong and leave weaker factors and their corresponding series to the noise. Weaker factors can in fact be important to a group of observed variables, for instance a sector factor in a large portfolio of stocks may only affect particular sectors, but can be important both in interpretations and predictions for those stocks. While more recent factor modelling researches do consider ``local'' factors which are weak factors with sparse corresponding factor loadings, there are real data examples in the literature where factors are weak because of weak influence on most/all observed variables, so that the corresponding factor loadings are not sparse (non-local). As a first in the literature, we propose estimators of factor strengths for both local and non-local weak factors, and prove their consistency with rates of convergence spelt out for both vector and matrix-valued time series factor models. Factor strength has an important indication in what estimation procedure of factor models to follow, as well as the estimation accuracy of various estimators (Chen and Lam, 2024). Simulation results show that our estimators have good performance in recovering the true factor strengths, and an analysis on the NYC taxi traffic data indicates the existence of weak factors in the data which may not be localized."}, "https://arxiv.org/abs/2405.07397": {"title": "The Spike-and-Slab Quantile LASSO for Robust Variable Selection in Cancer Genomics Studies", "link": "https://arxiv.org/abs/2405.07397", "description": "arXiv:2405.07397v1 Announce Type: new \nAbstract: Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the non-robust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Ro\\v{c}kov\\'a and George, 2018). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with non-differentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA)."}, "https://arxiv.org/abs/2405.07408": {"title": "Bayesian Spatially Clustered Compositional Regression: Linking intersectoral GDP contributions to Gini Coefficients", "link": "https://arxiv.org/abs/2405.07408", "description": "arXiv:2405.07408v1 Announce Type: new \nAbstract: The Gini coefficient is an universally used measurement of income inequality. Intersectoral GDP contributions reveal the economic development of different sectors of the national economy. Linking intersectoral GDP contributions to Gini coefficients will provide better understandings of how the Gini coefficient is influenced by different industries. In this paper, a compositional regression with spatially clustered coefficients is proposed to explore heterogeneous effects over spatial locations under nonparametric Bayesian framework. Specifically, a Markov random field constraint mixture of finite mixtures prior is designed for Bayesian log contrast regression with compostional covariates, which allows for both spatially contiguous clusters and discontinous clusters. In addition, an efficient Markov chain Monte Carlo algorithm for posterior sampling that enables simultaneous inference on both cluster configurations and cluster-wise parameters is designed. The compelling empirical performance of the proposed method is demonstrated via extensive simulation studies and an application to 51 states of United States from 2019 Bureau of Economic Analysis."}, "https://arxiv.org/abs/2405.07420": {"title": "Robust Inference for High-Dimensional Panel Data Models", "link": "https://arxiv.org/abs/2405.07420", "description": "arXiv:2405.07420v1 Announce Type: new \nAbstract: In this paper, we propose a robust estimation and inferential method for high-dimensional panel data models. Specifically, (1) we investigate the case where the number of regressors can grow faster than the sample size, (2) we pay particular attention to non-Gaussian, serially and cross-sectionally correlated and heteroskedastic error processes, and (3) we develop an estimation method for high-dimensional long-run covariance matrix using a thresholded estimator.\n Methodologically and technically, we develop two Nagaev-types of concentration inequalities: one for a partial sum and the other for a quadratic form, subject to a set of easily verifiable conditions. Leveraging these two inequalities, we also derive a non-asymptotic bound for the LASSO estimator, achieve asymptotic normality via the node-wise LASSO regression, and establish a sharp convergence rate for the thresholded heteroskedasticity and autocorrelation consistent (HAC) estimator.\n Our study thus provides the relevant literature with a complete toolkit for conducting inference about the parameters of interest involved in a high-dimensional panel data framework. We also demonstrate the practical relevance of these theoretical results by investigating a high-dimensional panel data model with interactive fixed effects. Moreover, we conduct extensive numerical studies using simulated and real data examples."}, "https://arxiv.org/abs/2405.07504": {"title": "Hierarchical inference of evidence using posterior samples", "link": "https://arxiv.org/abs/2405.07504", "description": "arXiv:2405.07504v1 Announce Type: new \nAbstract: The Bayesian evidence, crucial ingredient for model selection, is arguably the most important quantity in Bayesian data analysis: at the same time, however, it is also one of the most difficult to compute. In this paper we present a hierarchical method that leverages on a multivariate normalised approximant for the posterior probability density to infer the evidence for a model in a hierarchical fashion using a set of posterior samples drawn using an arbitrary sampling scheme."}, "https://arxiv.org/abs/2405.07631": {"title": "Improving prediction models by incorporating external data with weights based on similarity", "link": "https://arxiv.org/abs/2405.07631", "description": "arXiv:2405.07631v1 Announce Type: new \nAbstract: In clinical settings, we often face the challenge of building prediction models based on small observational data sets. For example, such a data set might be from a medical center in a multi-center study. Differences between centers might be large, thus requiring specific models based on the data set from the target center. Still, we want to borrow information from the external centers, to deal with small sample sizes. There are approaches that either assign weights to each external data set or each external observation. To incorporate information on differences between data sets and observations, we propose an approach that combines both into weights that can be incorporated into a likelihood for fitting regression models. Specifically, we suggest weights at the data set level that incorporate information on how well the models that provide the observation weights distinguish between data sets. Technically, this takes the form of inverse probability weighting. We explore different scenarios where covariates and outcomes differ among data sets, informing our simulation design for method evaluation. The concept of effective sample size is used for understanding the effectiveness of our subgroup modeling approach. We demonstrate our approach through a clinical application, predicting applied radiotherapy doses for cancer patients. Generally, the proposed approach provides improved prediction performance when external data sets are similar. We thus provide a method for quantifying similarity of external data sets to the target data set and use this similarity to include external observations for improving performance in a target data set prediction modeling task with small data."}, "https://arxiv.org/abs/2405.07860": {"title": "Uniform Inference for Subsampled Moment Regression", "link": "https://arxiv.org/abs/2405.07860", "description": "arXiv:2405.07860v1 Announce Type: new \nAbstract: We propose a method for constructing a confidence region for the solution to a conditional moment equation. The method is built around a class of algorithms for nonparametric regression based on subsampled kernels. This class includes random forest regression. We bound the error in the confidence region's nominal coverage probability, under the restriction that the conditional moment equation of interest satisfies a local orthogonality condition. The method is applicable to the construction of confidence regions for conditional average treatment effects in randomized experiments, among many other similar problems encountered in applied economics and causal inference. As a by-product, we obtain several new order-explicit results on the concentration and normal approximation of high-dimensional $U$-statistics."}, "https://arxiv.org/abs/2405.07979": {"title": "Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference", "link": "https://arxiv.org/abs/2405.07979", "description": "arXiv:2405.07979v1 Announce Type: new \nAbstract: Variance reduction for causal inference in the presence of network interference is often achieved through either outcome modeling, which is typically analyzed under unit-randomized Bernoulli designs, or clustered experimental designs, which are typically analyzed without strong parametric assumptions. In this work, we study the intersection of these two approaches and consider the problem of estimation in low-order outcome models using data from a general experimental design. Our contributions are threefold. First, we present an estimator of the total treatment effect (also called the global average treatment effect) in a low-degree outcome model when the data are collected under general experimental designs, generalizing previous results for Bernoulli designs. We refer to this estimator as the pseudoinverse estimator and give bounds on its bias and variance in terms of properties of the experimental design. Second, we evaluate these bounds for the case of cluster randomized designs with both Bernoulli and complete randomization. For clustered Bernoulli randomization, we find that our estimator is always unbiased and that its variance scales like the smaller of the variance obtained from a low-order assumption and the variance obtained from cluster randomization, showing that combining these variance reduction strategies is preferable to using either individually. For clustered complete randomization, we find a notable bias-variance trade-off mediated by specific features of the clustering. Third, when choosing a clustered experimental design, our bounds can be used to select a clustering from a set of candidate clusterings. Across a range of graphs and clustering algorithms, we show that our method consistently selects clusterings that perform well on a range of response models, suggesting that our bounds are useful to practitioners."}, "https://arxiv.org/abs/2405.07985": {"title": "Improved LARS algorithm for adaptive LASSO in the linear regression model", "link": "https://arxiv.org/abs/2405.07985", "description": "arXiv:2405.07985v1 Announce Type: new \nAbstract: The adaptive LASSO has been used for consistent variable selection in place of LASSO in the linear regression model. In this article, we propose a modified LARS algorithm to combine adaptive LASSO with some biased estimators, namely the Almost Unbiased Ridge Estimator (AURE), Liu Estimator (LE), Almost Unbiased Liu Estimator (AULE), Principal Component Regression Estimator (PCRE), r-k class estimator, and r-d class estimator. Furthermore, we examine the performance of the proposed algorithm using a Monte Carlo simulation study and real-world examples."}, "https://arxiv.org/abs/2405.07343": {"title": "Graph neural networks for power grid operational risk assessment under evolving grid topology", "link": "https://arxiv.org/abs/2405.07343", "description": "arXiv:2405.07343v1 Announce Type: cross \nAbstract: This article investigates the ability of graph neural networks (GNNs) to identify risky conditions in a power grid over the subsequent few hours, without explicit, high-resolution information regarding future generator on/off status (grid topology) or power dispatch decisions. The GNNs are trained using supervised learning, to predict the power grid's aggregated bus-level (either zonal or system-level) or individual branch-level state under different power supply and demand conditions. The variability of the stochastic grid variables (wind/solar generation and load demand), and their statistical correlations, are rigorously considered while generating the inputs for the training data. The outputs in the training data, obtained by solving numerous mixed-integer linear programming (MILP) optimal power flow problems, correspond to system-level, zonal and transmission line-level quantities of interest (QoIs). The QoIs predicted by the GNNs are used to conduct hours-ahead, sampling-based reliability and risk assessment w.r.t. zonal and system-level (load shedding) as well as branch-level (overloading) failure events. The proposed methodology is demonstrated for three synthetic grids with sizes ranging from 118 to 2848 buses. Our results demonstrate that GNNs are capable of providing fast and accurate prediction of QoIs and can be good proxies for computationally expensive MILP algorithms. The excellent accuracy of GNN-based reliability and risk assessment suggests that GNN models can substantially improve situational awareness by quickly providing rigorous reliability and risk estimates."}, "https://arxiv.org/abs/2405.07359": {"title": "Forecasting with an N-dimensional Langevin Equation and a Neural-Ordinary Differential Equation", "link": "https://arxiv.org/abs/2405.07359", "description": "arXiv:2405.07359v1 Announce Type: cross \nAbstract: Accurate prediction of electricity day-ahead prices is essential in competitive electricity markets. Although stationary electricity-price forecasting techniques have received considerable attention, research on non-stationary methods is comparatively scarce, despite the common prevalence of non-stationary features in electricity markets. Specifically, existing non-stationary techniques will often aim to address individual non-stationary features in isolation, leaving aside the exploration of concurrent multiple non-stationary effects. Our overarching objective here is the formulation of a framework to systematically model and forecast non-stationary electricity-price time series, encompassing the broader scope of non-stationary behavior. For this purpose we develop a data-driven model that combines an N-dimensional Langevin equation (LE) with a neural-ordinary differential equation (NODE). The LE captures fine-grained details of the electricity-price behavior in stationary regimes but is inadequate for non-stationary conditions. To overcome this inherent limitation, we adopt a NODE approach to learn, and at the same time predict, the difference between the actual electricity-price time series and the simulated price trajectories generated by the LE. By learning this difference, the NODE reconstructs the non-stationary components of the time series that the LE is not able to capture. We exemplify the effectiveness of our framework using the Spanish electricity day-ahead market as a prototypical case study. Our findings reveal that the NODE nicely complements the LE, providing a comprehensive strategy to tackle both stationary and non-stationary electricity-price behavior. The framework's dependability and robustness is demonstrated through different non-stationary scenarios by comparing it against a range of basic naive methods."}, "https://arxiv.org/abs/2405.07552": {"title": "Distributed High-Dimensional Quantile Regression: Estimation Efficiency and Support Recovery", "link": "https://arxiv.org/abs/2405.07552", "description": "arXiv:2405.07552v1 Announce Type: cross \nAbstract: In this paper, we focus on distributed estimation and support recovery for high-dimensional linear quantile regression. Quantile regression is a popular alternative tool to the least squares regression for robustness against outliers and data heterogeneity. However, the non-smoothness of the check loss function poses big challenges to both computation and theory in the distributed setting. To tackle these problems, we transform the original quantile regression into the least-squares optimization. By applying a double-smoothing approach, we extend a previous Newton-type distributed approach without the restrictive independent assumption between the error term and covariates. An efficient algorithm is developed, which enjoys high computation and communication efficiency. Theoretically, the proposed distributed estimator achieves a near-oracle convergence rate and high support recovery accuracy after a constant number of iterations. Extensive experiments on synthetic examples and a real data application further demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2405.07836": {"title": "Forecasting with Hyper-Trees", "link": "https://arxiv.org/abs/2405.07836", "description": "arXiv:2405.07836v1 Announce Type: cross \nAbstract: This paper introduces the concept of Hyper-Trees and offers a new direction in applying tree-based models to time series data. Unlike conventional applications of decision trees that forecast time series directly, Hyper-Trees are designed to learn the parameters of a target time series model. Our framework leverages the gradient-based nature of boosted trees, which allows us to extend the concept of Hyper-Networks to Hyper-Trees and to induce a time-series inductive bias to tree models. By relating the parameters of a target time series model to features, Hyper-Trees address the challenge of parameter non-stationarity and enable tree-based forecasts to extend beyond their initial training range. With our research, we aim to explore the effectiveness of Hyper-Trees across various forecasting scenarios and to expand the application of gradient boosted decision trees past their conventional use in time series forecasting."}, "https://arxiv.org/abs/2405.07910": {"title": "A Unification of Exchangeability and Continuous Exposure and Confounder Measurement Errors: Probabilistic Exchangeability", "link": "https://arxiv.org/abs/2405.07910", "description": "arXiv:2405.07910v1 Announce Type: cross \nAbstract: Exchangeability concerning a continuous exposure, X, implies no confounding bias when identifying average exposure effects of X, AEE(X). When X is measured with error (Xep), two challenges arise in identifying AEE(X). Firstly, exchangeability regarding Xep does not equal exchangeability regarding X. Secondly, the necessity of the non-differential error assumption (NDEA), overly stringent in practice, remains uncertain. To address them, this article proposes unifying exchangeability and exposure and confounder measurement errors with three novel concepts. The first, Probabilistic Exchangeability (PE), states that the outcomes of those with Xep=e are probabilistically exchangeable with the outcomes of those truly exposed to X=eT. The relationship between AEE(Xep) and AEE(X) in risk difference and ratio scales is mathematically expressed as a probabilistic certainty, termed exchangeability probability (Pe). Squared Pe (Pe.sq) quantifies the extent to which AEE(Xep) differs from AEE(X) due to exposure measurement error not akin to confounding mechanisms. In realistic settings, the coefficient of determination (R.sq) in the regression of X against Xep may be sufficient to measure Pe.sq. The second concept, Emergent Pseudo Confounding (EPC), describes the bias introduced by exposure measurement error, akin to confounding mechanisms. PE can hold when EPC is controlled for, which is weaker than NDEA. The third, Emergent Confounding, describes when bias due to confounder measurement error arises. Adjustment for E(P)C can be performed like confounding adjustment to ensure PE. This paper provides justifies for using AEE(Xep) and maximum insight into potential divergence of AEE(Xep) from AEE(X) and its measurement. Differential errors do not necessarily compromise causal inference."}, "https://arxiv.org/abs/2405.07971": {"title": "Sensitivity Analysis for Active Sampling, with Applications to the Simulation of Analog Circuits", "link": "https://arxiv.org/abs/2405.07971", "description": "arXiv:2405.07971v1 Announce Type: cross \nAbstract: We propose an active sampling flow, with the use-case of simulating the impact of combined variations on analog circuits. In such a context, given the large number of parameters, it is difficult to fit a surrogate model and to efficiently explore the space of design features.\n By combining a drastic dimension reduction using sensitivity analysis and Bayesian surrogate modeling, we obtain a flexible active sampling flow. On synthetic and real datasets, this flow outperforms the usual Monte-Carlo sampling which often forms the foundation of design space exploration."}, "https://arxiv.org/abs/2008.13087": {"title": "Efficient Nested Simulation Experiment Design via the Likelihood Ratio Method", "link": "https://arxiv.org/abs/2008.13087", "description": "arXiv:2008.13087v3 Announce Type: replace \nAbstract: In nested simulation literature, a common assumption is that the experimenter can choose the number of outer scenarios to sample. This paper considers the case when the experimenter is given a fixed set of outer scenarios from an external entity. We propose a nested simulation experiment design that pools inner replications from one scenario to estimate another scenario's conditional mean via the likelihood ratio method. Given the outer scenarios, we decide how many inner replications to run at each outer scenario as well as how to pool the inner replications by solving a bi-level optimization problem that minimizes the total simulation effort. We provide asymptotic analyses on the convergence rates of the performance measure estimators computed from the optimized experiment design. Under some assumptions, the optimized design achieves $\\cO(\\Gamma^{-1})$ mean squared error of the estimators given simulation budget $\\Gamma$. Numerical experiments demonstrate that our design outperforms a state-of-the-art design that pools replications via regression."}, "https://arxiv.org/abs/2102.10778": {"title": "Interactive identification of individuals with positive treatment effect while controlling false discoveries", "link": "https://arxiv.org/abs/2102.10778", "description": "arXiv:2102.10778v3 Announce Type: replace \nAbstract: Out of the participants in a randomized experiment with anticipated heterogeneous treatment effects, is it possible to identify which subjects have a positive treatment effect? While subgroup analysis has received attention, claims about individual participants are much more challenging. We frame the problem in terms of multiple hypothesis testing: each individual has a null hypothesis (stating that the potential outcomes are equal, for example) and we aim to identify those for whom the null is false (the treatment potential outcome stochastically dominates the control one, for example). We develop a novel algorithm that identifies such a subset, with nonasymptotic control of the false discovery rate (FDR). Our algorithm allows for interaction -- a human data scientist (or a computer program) may adaptively guide the algorithm in a data-dependent manner to gain power. We show how to extend the methods to observational settings and achieve a type of doubly-robust FDR control. We also propose several extensions: (a) relaxing the null to nonpositive effects, (b) moving from unpaired to paired samples, and (c) subgroup identification. We demonstrate via numerical experiments and theoretical analysis that the proposed method has valid FDR control in finite samples and reasonably high identification power."}, "https://arxiv.org/abs/2210.14205": {"title": "Unit Averaging for Heterogeneous Panels", "link": "https://arxiv.org/abs/2210.14205", "description": "arXiv:2210.14205v3 Announce Type: replace \nAbstract: In this work we introduce a unit averaging procedure to efficiently recover unit-specific parameters in a heterogeneous panel model. The procedure consists in estimating the parameter of a given unit using a weighted average of all the unit-specific parameter estimators in the panel. The weights of the average are determined by minimizing an MSE criterion we derive. We analyze the properties of the resulting minimum MSE unit averaging estimator in a local heterogeneity framework inspired by the literature on frequentist model averaging, and we derive the local asymptotic distribution of the estimator and the corresponding weights. The benefits of the procedure are showcased with an application to forecasting unemployment rates for a panel of German regions."}, "https://arxiv.org/abs/2301.11472": {"title": "Fast Bayesian inference for spatial mean-parameterized Conway-Maxwell-Poisson models", "link": "https://arxiv.org/abs/2301.11472", "description": "arXiv:2301.11472v4 Announce Type: replace \nAbstract: Count data with complex features arise in many disciplines, including ecology, agriculture, criminology, medicine, and public health. Zero inflation, spatial dependence, and non-equidispersion are common features in count data. There are two classes of models that allow for these features -- he mode-parameterized Conway--Maxwell--Poisson (COMP) distribution and the generalized Poisson model. However both require the use of either constraints on the parameter space or a parameterization that leads to challenges in interpretability. We propose a spatial mean-parameterized COMP model that retains the flexibility of these models while resolving the above issues. We use a Bayesian spatial filtering approach in order to efficiently handle high-dimensional spatial data and we use reversible-jump MCMC to automatically choose the basis vectors for spatial filtering. The COMP distribution poses two additional computational challenges -- an intractable normalizing function in the likelihood and no closed-form expression for the mean. We propose a fast computational approach that addresses these challenges by, respectively, introducing an efficient auxiliary variable algorithm and pre-computing key approximations for fast likelihood evaluation. We illustrate the application of our methodology to simulated and real datasets, including Texas HPV-cancer data and US vaccine refusal data."}, "https://arxiv.org/abs/2309.04793": {"title": "Interpreting TSLS Estimators in Information Provision Experiments", "link": "https://arxiv.org/abs/2309.04793", "description": "arXiv:2309.04793v3 Announce Type: replace \nAbstract: To estimate the causal effects of beliefs on actions, researchers often conduct information provision experiments. We consider the causal interpretation of two-stage least squares (TSLS) estimators in these experiments. In particular, we characterize common TSLS estimators as weighted averages of causal effects, and interpret these weights under general belief updating conditions that nest parametric models from the literature. Our framework accommodates TSLS estimators for both active and passive control designs. Notably, we find that some passive control estimators allow for negative weights, which compromises their causal interpretation. We give practical guidance on such issues, and illustrate our results in two empirical applications."}, "https://arxiv.org/abs/2311.08691": {"title": "On Doubly Robust Estimation with Nonignorable Missing Data Using Instrumental Variables", "link": "https://arxiv.org/abs/2311.08691", "description": "arXiv:2311.08691v2 Announce Type: replace \nAbstract: Suppose we are interested in the mean of an outcome that is subject to nonignorable nonresponse. This paper develops new semiparametric estimation methods with instrumental variables which affect nonresponse, but not the outcome. The proposed estimators remain consistent and asymptotically normal even under partial model misspecifications for two variation independent nuisance components. We evaluate the performance of the proposed estimators via a simulation study, and apply them in adjusting for missing data induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer experience as an instrumental variable."}, "https://arxiv.org/abs/2312.05802": {"title": "Enhancing Scalability in Bayesian Nonparametric Factor Analysis of Spatiotemporal Data", "link": "https://arxiv.org/abs/2312.05802", "description": "arXiv:2312.05802v3 Announce Type: replace \nAbstract: This manuscript puts forward novel practicable spatiotemporal Bayesian factor analysis frameworks computationally feasible for moderate to large data. Our models exhibit significantly enhanced computational scalability and storage efficiency, deliver high overall modeling performances, and possess powerful inferential capabilities for adequately predicting outcomes at future time points or new spatial locations and satisfactorily clustering spatial locations into regions with similar temporal trajectories, a frequently encountered crucial task. We integrate on top of a baseline separable factor model with temporally dependent latent factors and spatially dependent factor loadings under a probit stick breaking process (PSBP) prior a new slice sampling algorithm that permits unknown varying numbers of spatial mixture components across all factors and guarantees them to be non-increasing through the MCMC iterations, thus considerably enhancing model flexibility, efficiency, and scalability. We further introduce a novel spatial latent nearest-neighbor Gaussian process (NNGP) prior and new sequential updating algorithms for the spatially varying latent variables in the PSBP prior, thereby attaining high spatial scalability. The markedly accelerated posterior sampling and spatial prediction as well as the great modeling and inferential performances of our models are substantiated by our simulation experiments."}, "https://arxiv.org/abs/2401.03881": {"title": "Density regression via Dirichlet process mixtures of normal structured additive regression models", "link": "https://arxiv.org/abs/2401.03881", "description": "arXiv:2401.03881v2 Announce Type: replace \nAbstract: Within Bayesian nonparametrics, dependent Dirichlet process mixture models provide a highly flexible approach for conducting inference about the conditional density function. However, several formulations of this class make either rather restrictive modelling assumptions or involve intricate algorithms for posterior inference, thus preventing their widespread use. In response to these challenges, we present a flexible, versatile, and computationally tractable model for density regression based on a single-weights dependent Dirichlet process mixture of normal distributions model for univariate continuous responses. We assume an additive structure for the mean of each mixture component and incorporate the effects of continuous covariates through smooth nonlinear functions. The key components of our modelling approach are penalised B-splines and their bivariate tensor product extension. Our proposed method also seamlessly accommodates parametric effects of categorical covariates, linear effects of continuous covariates, interactions between categorical and/or continuous covariates, varying coefficient terms, and random effects, which is why we refer our model as a Dirichlet process mixture of normal structured additive regression models. A noteworthy feature of our method is its efficiency in posterior simulation through Gibbs sampling, as closed-form full conditional distributions for all model parameters are available. Results from a simulation study demonstrate that our approach successfully recovers true conditional densities and other regression functionals in various challenging scenarios. Applications to a toxicology, disease diagnosis, and agricultural study are provided and further underpin the broad applicability of our modelling framework. An R package, DDPstar, implementing the proposed method is publicly available at https://bitbucket.org/mxrodriguez/ddpstar."}, "https://arxiv.org/abs/2401.07018": {"title": "Graphical models for cardinal paired comparisons data", "link": "https://arxiv.org/abs/2401.07018", "description": "arXiv:2401.07018v2 Announce Type: replace \nAbstract: Graphical models for cardinal paired comparison data with and without covariates are rigorously analyzed. Novel, graph--based, necessary and sufficient conditions which guarantee strong consistency, asymptotic normality and the exponential convergence of the estimated ranks are emphasized. A complete theory for models with covariates is laid out. In particular conditions under which covariates can be safely omitted from the model are provided. The methodology is employed in the analysis of both finite and infinite sets of ranked items specifically in the case of large sparse comparison graphs. The proposed methods are explored by simulation and applied to the ranking of teams in the National Basketball Association (NBA)."}, "https://arxiv.org/abs/2203.02605": {"title": "Reinforcement Learning in Modern Biostatistics: Constructing Optimal Adaptive Interventions", "link": "https://arxiv.org/abs/2203.02605", "description": "arXiv:2203.02605v3 Announce Type: replace-cross \nAbstract: In recent years, reinforcement learning (RL) has acquired a prominent position in health-related sequential decision-making problems, gaining traction as a valuable tool for delivering adaptive interventions (AIs). However, in part due to a poor synergy between the methodological and the applied communities, its real-life application is still limited and its potential is still to be realized. To address this gap, our work provides the first unified technical survey on RL methods, complemented with case studies, for constructing various types of AIs in healthcare. In particular, using the common methodological umbrella of RL, we bridge two seemingly different AI domains, dynamic treatment regimes and just-in-time adaptive interventions in mobile health, highlighting similarities and differences between them and discussing the implications of using RL. Open problems and considerations for future research directions are outlined. Finally, we leverage our experience in designing case studies in both areas to showcase the significant collaborative opportunities between statistical, RL, and healthcare researchers in advancing AIs."}, "https://arxiv.org/abs/2302.08854": {"title": "Post Reinforcement Learning Inference", "link": "https://arxiv.org/abs/2302.08854", "description": "arXiv:2302.08854v3 Announce Type: replace-cross \nAbstract: We consider estimation and inference using data collected from reinforcement learning algorithms. These algorithms, characterized by their adaptive experimentation, interact with individual units over multiple stages, dynamically adjusting their strategies based on previous interactions. Our goal is to evaluate a counterfactual policy post-data collection and estimate structural parameters, like dynamic treatment effects, which can be used for credit assignment and determining the effect of earlier actions on final outcomes. Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches for static data. However, in the adaptive data collection environment of reinforcement learning, where algorithms deploy nonstationary behavior policies, standard estimators do not achieve asymptotic normality due to the fluctuating variance. We propose a weighted Z-estimation approach with carefully designed adaptive weights to stabilize the time-varying estimation variance. We identify proper weighting schemes to restore the consistency and asymptotic normality of the weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing uniform confidence regions. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation."}, "https://arxiv.org/abs/2405.08177": {"title": "Parameter identifiability, parameter estimation and model prediction for differential equation models", "link": "https://arxiv.org/abs/2405.08177", "description": "arXiv:2405.08177v1 Announce Type: new \nAbstract: Interpreting data with mathematical models is an important aspect of real-world applied mathematical modeling. Very often we are interested to understand the extent to which a particular data set informs and constrains model parameters. This question is closely related to the concept of parameter identifiability, and in this article we present a series of computational exercises to introduce tools that can be used to assess parameter identifiability, estimate parameters and generate model predictions. Taking a likelihood-based approach, we show that very similar ideas and algorithms can be used to deal with a range of different mathematical modelling frameworks. The exercises and results presented in this article are supported by a suite of open access codes that can be accessed on GitHub."}, "https://arxiv.org/abs/2405.08180": {"title": "An adaptive enrichment design using Bayesian model averaging for selection and threshold-identification of tailoring variables", "link": "https://arxiv.org/abs/2405.08180", "description": "arXiv:2405.08180v1 Announce Type: new \nAbstract: Precision medicine stands as a transformative approach in healthcare, offering tailored treatments that can enhance patient outcomes and reduce healthcare costs. As understanding of complex disease improves, clinical trials are being designed to detect subgroups of patients with enhanced treatment effects. Biomarker-driven adaptive enrichment designs, which enroll a general population initially and later restrict accrual to treatment-sensitive patients, are gaining popularity. Current practice often assumes either pre-trial knowledge of biomarkers defining treatment-sensitive subpopulations or a simple, linear relationship between continuous markers and treatment effectiveness. Motivated by a trial studying rheumatoid arthritis treatment, we propose a Bayesian adaptive enrichment design which identifies important tailoring variables out of a larger set of candidate biomarkers. Our proposed design is equipped with a flexible modelling framework where the effects of continuous biomarkers are introduced using free knot B-splines. The parameters of interest are then estimated by marginalizing over the space of all possible variable combinations using Bayesian model averaging. At interim analyses, we assess whether a biomarker-defined subgroup has enhanced or reduced treatment effects, allowing for early termination due to efficacy or futility and restricting future enrollment to treatment-sensitive patients. We consider pre-categorized and continuous biomarkers, the latter of which may have complex, nonlinear relationships to the outcome and treatment effect. Using simulations, we derive the operating characteristics of our design and compare its performance to two existing approaches."}, "https://arxiv.org/abs/2405.08222": {"title": "Random Utility Models with Skewed Random Components: the Smallest versus Largest Extreme Value Distribution", "link": "https://arxiv.org/abs/2405.08222", "description": "arXiv:2405.08222v1 Announce Type: new \nAbstract: At the core of most random utility models (RUMs) is an individual agent with a random utility component following a largest extreme value Type I (LEVI) distribution. What if, instead, the random component follows its mirror image -- the smallest extreme value Type I (SEVI) distribution? Differences between these specifications, closely tied to the random component's skewness, can be quite profound. For the same preference parameters, the two RUMs, equivalent with only two choice alternatives, diverge progressively as the number of alternatives increases, resulting in substantially different estimates and predictions for key measures, such as elasticities and market shares.\n The LEVI model imposes the well-known independence-of-irrelevant-alternatives property, while SEVI does not. Instead, the SEVI choice probability for a particular option involves enumerating all subsets that contain this option. The SEVI model, though more complex to estimate, is shown to have computationally tractable closed-form choice probabilities. Much of the paper delves into explicating the properties of the SEVI model and exploring implications of the random component's skewness.\n Conceptually, the difference between the LEVI and SEVI models centers on whether information, known only to the agent, is more likely to increase or decrease the systematic utility parameterized using observed attributes. LEVI does the former; SEVI the latter. An immediate implication is that if choice is characterized by SEVI random components, then the observed choice is more likely to correspond to the systematic-utility-maximizing choice than if characterized by LEVI. Examining standard empirical examples from different applied areas, we find that the SEVI model outperforms the LEVI model, suggesting the relevance of its inclusion in applied researchers' toolkits."}, "https://arxiv.org/abs/2405.08284": {"title": "Predicting NVIDIA's Next-Day Stock Price: A Comparative Analysis of LSTM, MLP, ARIMA, and ARIMA-GARCH Models", "link": "https://arxiv.org/abs/2405.08284", "description": "arXiv:2405.08284v1 Announce Type: new \nAbstract: Forecasting stock prices remains a considerable challenge in financial markets, bearing significant implications for investors, traders, and financial institutions. Amid the ongoing AI revolution, NVIDIA has emerged as a key player driving innovation across various sectors. Given its prominence, we chose NVIDIA as the subject of our study."}, "https://arxiv.org/abs/2405.08307": {"title": "Sequential Maximal Updated Density Parameter Estimation for Dynamical Systems with Parameter Drift", "link": "https://arxiv.org/abs/2405.08307", "description": "arXiv:2405.08307v1 Announce Type: new \nAbstract: We present a novel method for generating sequential parameter estimates and quantifying epistemic uncertainty in dynamical systems within a data-consistent (DC) framework. The DC framework differs from traditional Bayesian approaches due to the incorporation of the push-forward of an initial density, which performs selective regularization in parameter directions not informed by the data in the resulting updated density. This extends a previous study that included the linear Gaussian theory within the DC framework and introduced the maximal updated density (MUD) estimate as an alternative to both least squares and maximum a posterior (MAP) estimates. In this work, we introduce algorithms for operational settings of MUD estimation in real or near-real time where spatio-temporal datasets arrive in packets to provide updated estimates of parameters and identify potential parameter drift. Computational diagnostics within the DC framework prove critical for evaluating (1) the quality of the DC update and MUD estimate and (2) the detection of parameter value drift. The algorithms are applied to estimate (1) wind drag parameters in a high-fidelity storm surge model, (2) thermal diffusivity field for a heat conductivity problem, and (3) changing infection and incubation rates of an epidemiological model."}, "https://arxiv.org/abs/2405.08525": {"title": "Doubly-robust inference and optimality in structure-agnostic models with smoothness", "link": "https://arxiv.org/abs/2405.08525", "description": "arXiv:2405.08525v1 Announce Type: new \nAbstract: We study the problem of constructing an estimator of the average treatment effect (ATE) that exhibits doubly-robust asymptotic linearity (DRAL). This is a stronger requirement than doubly-robust consistency. A DRAL estimator can yield asymptotically valid Wald-type confidence intervals even when the propensity score or the outcome model is inconsistently estimated. On the contrary, the celebrated doubly-robust, augmented-IPW (AIPW) estimator generally requires consistent estimation of both nuisance functions for standard root-n inference. We make three main contributions. First, we propose a new hybrid class of distributions that consists of the structure-agnostic class introduced in Balakrishnan et al (2023) with additional smoothness constraints. While DRAL is generally not possible in the pure structure-agnostic class, we show that it can be attained in the new hybrid one. Second, we calculate minimax lower bounds for estimating the ATE in the new class, as well as in the pure structure-agnostic one. Third, building upon the literature on doubly-robust inference (van der Laan, 2014, Benkeser et al, 2017, Dukes et al 2021), we propose a new estimator of the ATE that enjoys DRAL. Under certain conditions, we show that its rate of convergence in the new class can be much faster than that achieved by the AIPW estimator and, in particular, matches the minimax lower bound rate, thereby establishing its optimality. Finally, we clarify the connection between DRAL estimators and those based on higher-order influence functions (Robins et al, 2017) and complement our theoretical findings with simulations."}, "https://arxiv.org/abs/2405.08675": {"title": "Simplifying Debiased Inference via Automatic Differentiation and Probabilistic Programming", "link": "https://arxiv.org/abs/2405.08675", "description": "arXiv:2405.08675v1 Announce Type: new \nAbstract: We introduce an algorithm that simplifies the construction of efficient estimators, making them accessible to a broader audience. 'Dimple' takes as input computer code representing a parameter of interest and outputs an efficient estimator. Unlike standard approaches, it does not require users to derive a functional derivative known as the efficient influence function. Dimple avoids this task by applying automatic differentiation to the statistical functional of interest. Doing so requires expressing this functional as a composition of primitives satisfying a novel differentiability condition. Dimple also uses this composition to determine the nuisances it must estimate. In software, primitives can be implemented independently of one another and reused across different estimation problems. We provide a proof-of-concept Python implementation and showcase through examples how it allows users to go from parameter specification to efficient estimation with just a few lines of code."}, "https://arxiv.org/abs/2405.08687": {"title": "Latent group structure in linear panel data models with endogenous regressors", "link": "https://arxiv.org/abs/2405.08687", "description": "arXiv:2405.08687v1 Announce Type: new \nAbstract: This paper concerns the estimation of linear panel data models with endogenous regressors and a latent group structure in the coefficients. We consider instrumental variables estimation of the group-specific coefficient vector. We show that direct application of the Kmeans algorithm to the generalized method of moments objective function does not yield unique estimates. We newly develop and theoretically justify two-stage estimation methods that apply the Kmeans algorithm to a regression of the dependent variable on predicted values of the endogenous regressors. The results of Monte Carlo simulations demonstrate that two-stage estimation with the first stage modeled using a latent group structure achieves good classification accuracy, even if the true first-stage regression is fully heterogeneous. We apply our estimation methods to revisiting the relationship between income and democracy."}, "https://arxiv.org/abs/2405.08727": {"title": "Intervention effects based on potential benefit", "link": "https://arxiv.org/abs/2405.08727", "description": "arXiv:2405.08727v1 Announce Type: new \nAbstract: Optimal treatment rules are mappings from individual patient characteristics to tailored treatment assignments that maximize mean outcomes. In this work, we introduce a conditional potential benefit (CPB) metric that measures the expected improvement under an optimally chosen treatment compared to the status quo, within covariate strata. The potential benefit combines (i) the magnitude of the treatment effect, and (ii) the propensity for subjects to naturally select a suboptimal treatment. As a consequence, heterogeneity in the CPB can provide key insights into the mechanism by which a treatment acts and/or highlight potential barriers to treatment access or adverse effects. Moreover, we demonstrate that CPB is the natural prioritization score for individualized treatment policies when intervention capacity is constrained. That is, in the resource-limited setting where treatment options are freely accessible, but the ability to intervene on a portion of the target population is constrained (e.g., if the population is large, and follow-up and encouragement of treatment uptake is labor-intensive), targeting subjects with highest CPB maximizes the mean outcome. Focusing on this resource-limited setting, we derive formulas for optimal constrained treatment rules, and for any given budget, quantify the loss compared to the optimal unconstrained rule. We describe sufficient identification assumptions, and propose nonparametric, robust, and efficient estimators of the proposed quantities emerging from our framework."}, "https://arxiv.org/abs/2405.08730": {"title": "A Generalized Difference-in-Differences Estimator for Unbiased Estimation of Desired Estimands from Staggered Adoption and Stepped-Wedge Settings", "link": "https://arxiv.org/abs/2405.08730", "description": "arXiv:2405.08730v1 Announce Type: new \nAbstract: Staggered treatment adoption arises in the evaluation of policy impact and implementation in a variety of settings. This occurs in both randomized stepped-wedge trials and non-randomized quasi-experimental designs using causal inference methods based on difference-in-differences analysis. In both settings, it is crucial to carefully consider the target estimand and possible treatment effect heterogeneities in order to estimate the effect without bias and in an interpretable fashion. This paper proposes a novel non-parametric approach to this estimation for either setting. By constructing an estimator using two-by-two difference-in-difference comparisons as building blocks with arbitrary weights, the investigator can select weights to target the desired estimand in an unbiased manner under assumed treatment effect homogeneity, and minimize the variance under an assumed working covariance structure. This provides desirable bias properties with a relatively small sacrifice in variance and power by using the comparisons efficiently. The method is demonstrated on toy examples to show the process, as well as in the re-analysis of a stepped wedge trial on the impact of novel tuberculosis diagnostic tools. A full algorithm with R code is provided to implement this method. The proposed method allows for high flexibility and clear targeting of desired effects, providing one solution to the bias-variance-generalizability tradeoff."}, "https://arxiv.org/abs/2405.08738": {"title": "Calibrated sensitivity models", "link": "https://arxiv.org/abs/2405.08738", "description": "arXiv:2405.08738v1 Announce Type: new \nAbstract: In causal inference, sensitivity models assess how unmeasured confounders could alter causal analyses. However, the sensitivity parameter in these models -- which quantifies the degree of unmeasured confounding -- is often difficult to interpret. For this reason, researchers will sometimes compare the magnitude of the sensitivity parameter to an estimate for measured confounding. This is known as calibration. We propose novel calibrated sensitivity models, which directly incorporate measured confounding, and bound the degree of unmeasured confounding by a multiple of measured confounding. We illustrate how to construct calibrated sensitivity models via several examples. We also demonstrate their advantages over standard sensitivity analyses and calibration; in particular, the calibrated sensitivity parameter is an intuitive unit-less ratio of unmeasured divided by measured confounding, unlike standard sensitivity parameters, and one can correctly incorporate uncertainty due to estimating measured confounding, which standard calibration methods fail to do. By incorporating uncertainty due to measured confounding, we observe that causal analyses can be less robust or more robust to unmeasured confounding than would have been shown with standard approaches. We develop efficient estimators and methods for inference for bounds on the average treatment effect with three calibrated sensitivity models, and establish that our estimators are doubly robust and attain parametric efficiency and asymptotic normality under nonparametric conditions on their nuisance function estimators. We illustrate our methods with data analyses on the effect of exposure to violence on attitudes towards peace in Darfur and the effect of mothers' smoking on infant birthweight."}, "https://arxiv.org/abs/2405.08759": {"title": "Optimal Sequential Procedure for Early Detection of Multiple Side Effects", "link": "https://arxiv.org/abs/2405.08759", "description": "arXiv:2405.08759v1 Announce Type: new \nAbstract: In this paper, we propose an optimal sequential procedure for the early detection of potential side effects resulting from the administration of some treatment (e.g. a vaccine, say). The results presented here extend previous results obtained in Wang and Boukai (2024) who study the single side effect case to the case of two (or more) side effects. While the sequential procedure we employ, simultaneously monitors several of the treatment's side effects, the $(\\alpha, \\beta)$-optimal test we propose does not require any information about the inter-correlation between these potential side effects. However, in all of the subsequent analyses, including the derivations of the exact expressions of the Average Sample Number (ASN), the Power function, and the properties of the post-test (or post-detection) estimators, we accounted specifically, for the correlation between the potential side effects. In the real-life application (such as post-marketing surveillance), the number of available observations is large enough to justify asymptotic analyses of the sequential procedure (testing and post-detection estimation) properties. Accordingly, we also derive the consistency and asymptotic normality of our post-test estimators; results which enable us to also provide (asymptotic, post-detection) confidence intervals for the probabilities of various side-effects. Moreover, to compare two specific side effects, their relative risk plays an important role. We derive the distribution of the estimated relative risk in the asymptotic framework to provide appropriate inference. To illustrate the theoretical results presented, we provide two detailed examples based on the data of side effects on COVID-19 vaccine collected in Nigeria (see Nigeria (see Ilori et al. (2022))."}, "https://arxiv.org/abs/2405.08203": {"title": "Community detection in bipartite signed networks is highly dependent on parameter choice", "link": "https://arxiv.org/abs/2405.08203", "description": "arXiv:2405.08203v1 Announce Type: cross \nAbstract: Decision-making processes often involve voting. Human interactions with exogenous entities such as legislations or products can be effectively modeled as two-mode (bipartite) signed networks-where people can either vote positively, negatively, or abstain from voting on the entities. Detecting communities in such networks could help us understand underlying properties: for example ideological camps or consumer preferences. While community detection is an established practice separately for bipartite and signed networks, it remains largely unexplored in the case of bipartite signed networks. In this paper, we systematically evaluate the efficacy of community detection methods on bipartite signed networks using a synthetic benchmark and real-world datasets. Our findings reveal that when no communities are present in the data, these methods often recover spurious communities. When communities are present, the algorithms exhibit promising performance, although their performance is highly susceptible to parameter choice. This indicates that researchers using community detection methods in the context of bipartite signed networks should not take the communities found at face value: it is essential to assess the robustness of parameter choices or perform domain-specific external validation."}, "https://arxiv.org/abs/2405.08290": {"title": "MCMC using $\\textit{bouncy}$ Hamiltonian dynamics: A unifying framework for Hamiltonian Monte Carlo and piecewise deterministic Markov process samplers", "link": "https://arxiv.org/abs/2405.08290", "description": "arXiv:2405.08290v1 Announce Type: cross \nAbstract: Piecewise-deterministic Markov process (PDMP) samplers constitute a state of the art Markov chain Monte Carlo (MCMC) paradigm in Bayesian computation, with examples including the zig-zag and bouncy particle sampler (BPS). Recent work on the zig-zag has indicated its connection to Hamiltonian Monte Carlo, a version of the Metropolis algorithm that exploits Hamiltonian dynamics. Here we establish that, in fact, the connection between the paradigms extends far beyond the specific instance. The key lies in (1) the fact that any time-reversible deterministic dynamics provides a valid Metropolis proposal and (2) how PDMPs' characteristic velocity changes constitute an alternative to the usual acceptance-rejection. We turn this observation into a rigorous framework for constructing rejection-free Metropolis proposals based on bouncy Hamiltonian dynamics which simultaneously possess Hamiltonian-like properties and generate discontinuous trajectories similar in appearance to PDMPs. When combined with periodic refreshment of the inertia, the dynamics converge strongly to PDMP equivalents in the limit of increasingly frequent refreshment. We demonstrate the practical implications of this new paradigm, with a sampler based on a bouncy Hamiltonian dynamics closely related to the BPS. The resulting sampler exhibits competitive performance on challenging real-data posteriors involving tens of thousands of parameters."}, "https://arxiv.org/abs/2405.08719": {"title": "Addressing Misspecification in Simulation-based Inference through Data-driven Calibration", "link": "https://arxiv.org/abs/2405.08719", "description": "arXiv:2405.08719v1 Announce Type: cross \nAbstract: Driven by steady progress in generative modeling, simulation-based inference (SBI) has enabled inference over stochastic simulators. However, recent work has demonstrated that model misspecification can harm SBI's reliability. This work introduces robust posterior estimation (ROPE), a framework that overcomes model misspecification with a small real-world calibration set of ground truth parameter measurements. We formalize the misspecification gap as the solution of an optimal transport problem between learned representations of real-world and simulated observations. Assuming the prior distribution over the parameters of interest is known and well-specified, our method offers a controllable balance between calibrated uncertainty and informative inference under all possible misspecifications of the simulator. Our empirical results on four synthetic tasks and two real-world problems demonstrate that ROPE outperforms baselines and consistently returns informative and calibrated credible intervals."}, "https://arxiv.org/abs/2405.08796": {"title": "Variational Bayes and non-Bayesian Updating", "link": "https://arxiv.org/abs/2405.08796", "description": "arXiv:2405.08796v1 Announce Type: cross \nAbstract: I show how variational Bayes can be used as a microfoundation for a popular model of non-Bayesian updating. All the results here are mathematically trivial, but I think this direction is potentially interesting."}, "https://arxiv.org/abs/2405.08806": {"title": "Bounds on the Distribution of a Sum of Two Random Variables: Revisiting a problem of Kolmogorov with application to Individual Treatment Effects", "link": "https://arxiv.org/abs/2405.08806", "description": "arXiv:2405.08806v1 Announce Type: cross \nAbstract: We revisit the following problem, proposed by Kolmogorov: given prescribed marginal distributions $F$ and $G$ for random variables $X,Y$ respectively, characterize the set of compatible distribution functions for the sum $Z=X+Y$. Bounds on the distribution function for $Z$ were given by Markarov (1982), and Frank et al. (1987), the latter using copula theory. However, though they obtain the same bounds, they make different assertions concerning their sharpness. In addition, their solutions leave some open problems in the case when the given marginal distribution functions are discontinuous. These issues have led to some confusion and erroneous statements in subsequent literature, which we correct.\n Kolmogorov's problem is closely related to inferring possible distributions for individual treatment effects $Y_1 - Y_0$ given the marginal distributions of $Y_1$ and $Y_0$; the latter being identified from a randomized experiment. We use our new insights to sharpen and correct results due to Fan and Park (2010) concerning individual treatment effects, and to fill some other logical gaps."}, "https://arxiv.org/abs/2009.05079": {"title": "Finding Groups of Cross-Correlated Features in Bi-View Data", "link": "https://arxiv.org/abs/2009.05079", "description": "arXiv:2009.05079v4 Announce Type: replace \nAbstract: Datasets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A,B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A,B) is stable if A coincides with the set of features that have significant aggregate correlation with the features in B, and vice-versa. This paper proposes an iterative-testing based bimodule search procedure (BSP) to identify stable bimodules.\n Compared to existing methods for detecting cross-correlated features, BSP was the best at recovering true bimodules with sufficient signal, while limiting the false discoveries. In addition, we applied BSP to the problem of expression quantitative trait loci (eQTL) analysis using data from the GTEx consortium. BSP identified several thousand SNP-gene bimodules. While many of the individual SNP-gene pairs appearing in the discovered bimodules were identified by standard eQTL methods, the discovered bimodules revealed genomic subnetworks that appeared to be biologically meaningful and worthy of further scientific investigation."}, "https://arxiv.org/abs/2202.07234": {"title": "Long-term Causal Inference Under Persistent Confounding via Data Combination", "link": "https://arxiv.org/abs/2202.07234", "description": "arXiv:2202.07234v4 Announce Type: replace \nAbstract: We study the identification and estimation of long-term treatment effects when both experimental and observational data are available. Since the long-term outcome is observed only after a long delay, it is not measured in the experimental data, but only recorded in the observational data. However, both types of data include observations of some short-term outcomes. In this paper, we uniquely tackle the challenge of persistent unmeasured confounders, i.e., some unmeasured confounders that can simultaneously affect the treatment, short-term outcomes and the long-term outcome, noting that they invalidate identification strategies in previous literature. To address this challenge, we exploit the sequential structure of multiple short-term outcomes, and develop three novel identification strategies for the average long-term treatment effect. We further propose three corresponding estimators and prove their asymptotic consistency and asymptotic normality. We finally apply our methods to estimate the effect of a job training program on long-term employment using semi-synthetic data. We numerically show that our proposals outperform existing methods that fail to handle persistent confounders."}, "https://arxiv.org/abs/2204.12023": {"title": "A One-Covariate-at-a-Time Method for Nonparametric Additive Models", "link": "https://arxiv.org/abs/2204.12023", "description": "arXiv:2204.12023v3 Announce Type: replace \nAbstract: This paper proposes a one-covariate-at-a-time multiple testing (OCMT) approach to choose significant variables in high-dimensional nonparametric additive regression models. Similarly to Chudik, Kapetanios and Pesaran (2018), we consider the statistical significance of individual nonparametric additive components one at a time and take into account the multiple testing nature of the problem. One-stage and multiple-stage procedures are both considered. The former works well in terms of the true positive rate only if the marginal effects of all signals are strong enough; the latter helps to pick up hidden signals that have weak marginal effects. Simulations demonstrate the good finite sample performance of the proposed procedures. As an empirical application, we use the OCMT procedure on a dataset we extracted from the Longitudinal Survey on Rural Urban Migration in China. We find that our procedure works well in terms of the out-of-sample forecast root mean square errors, compared with competing methods."}, "https://arxiv.org/abs/2206.10676": {"title": "Conditional probability tensor decompositions for multivariate categorical response regression", "link": "https://arxiv.org/abs/2206.10676", "description": "arXiv:2206.10676v2 Announce Type: replace \nAbstract: In many modern regression applications, the response consists of multiple categorical random variables whose probability mass is a function of a common set of predictors. In this article, we propose a new method for modeling such a probability mass function in settings where the number of response variables, the number of categories per response, and the dimension of the predictor are large. Our method relies on a functional probability tensor decomposition: a decomposition of a tensor-valued function such that its range is a restricted set of low-rank probability tensors. This decomposition is motivated by the connection between the conditional independence of responses, or lack thereof, and their probability tensor rank. We show that the model implied by such a low-rank functional probability tensor decomposition can be interpreted in terms of a mixture of regressions and can thus be fit using maximum likelihood. We derive an efficient and scalable penalized expectation maximization algorithm to fit this model and examine its statistical properties. We demonstrate the encouraging performance of our method through both simulation studies and an application to modeling the functional classes of genes."}, "https://arxiv.org/abs/2302.13999": {"title": "Forecasting Macroeconomic Tail Risk in Real Time: Do Textual Data Add Value?", "link": "https://arxiv.org/abs/2302.13999", "description": "arXiv:2302.13999v2 Announce Type: replace \nAbstract: We examine the incremental value of news-based data relative to the FRED-MD economic indicators for quantile predictions of employment, output, inflation and consumer sentiment in a high-dimensional setting. Our results suggest that news data contain valuable information that is not captured by a large set of economic indicators. We provide empirical evidence that this information can be exploited to improve tail risk predictions. The added value is largest when media coverage and sentiment are combined to compute text-based predictors. Methods that capture quantile-specific non-linearities produce overall superior forecasts relative to methods that feature linear predictive relationships. The results are robust along different modeling choices."}, "https://arxiv.org/abs/2304.09460": {"title": "Studying continuous, time-varying, and/or complex exposures using longitudinal modified treatment policies", "link": "https://arxiv.org/abs/2304.09460", "description": "arXiv:2304.09460v3 Announce Type: replace \nAbstract: This tutorial discusses methodology for causal inference using longitudinal modified treatment policies. This method facilitates the mathematical formalization, identification, and estimation of many novel parameters, and mathematically generalizes many commonly used parameters, such as the average treatment effect. Longitudinal modified treatment policies apply to a wide variety of exposures, including binary, multivariate, and continuous, and can accommodate time-varying treatments and confounders, competing risks, loss-to-follow-up, as well as survival, binary, or continuous outcomes. Longitudinal modified treatment policies can be seen as an extension of static and dynamic interventions to involve the natural value of treatment, and, like dynamic interventions, can be used to define alternative estimands with a positivity assumption that is more likely to be satisfied than estimands corresponding to static interventions. This tutorial aims to illustrate several practical uses of the longitudinal modified treatment policy methodology, including describing different estimation strategies and their corresponding advantages and disadvantages. We provide numerous examples of types of research questions which can be answered using longitudinal modified treatment policies. We go into more depth with one of these examples--specifically, estimating the effect of delaying intubation on critically ill COVID-19 patients' mortality. We demonstrate the use of the open-source R package lmtp to estimate the effects, and we provide code on https://github.com/kathoffman/lmtp-tutorial."}, "https://arxiv.org/abs/2310.09239": {"title": "Causal Quantile Treatment Effects with missing data by double-sampling", "link": "https://arxiv.org/abs/2310.09239", "description": "arXiv:2310.09239v2 Announce Type: replace \nAbstract: Causal weighted quantile treatment effects (WQTE) are a useful complement to standard causal contrasts that focus on the mean when interest lies at the tails of the counterfactual distribution. To-date, however, methods for estimation and inference regarding causal WQTEs have assumed complete data on all relevant factors. In most practical settings, however, data will be missing or incomplete data, particularly when the data are not collected for research purposes, as is the case for electronic health records and disease registries. Furthermore, such data sources may be particularly susceptible to the outcome data being missing-not-at-random (MNAR). In this paper, we consider the use of double-sampling, through which the otherwise missing data are ascertained on a sub-sample of study units, as a strategy to mitigate bias due to MNAR data in the estimation of causal WQTEs. With the additional data in-hand, we present identifying conditions that do not require assumptions regarding missingness in the original data. We then propose a novel inverse-probability weighted estimator and derive its asymptotic properties, both pointwise at specific quantiles and uniformly across a range of quantiles over some compact subset of (0,1), allowing the propensity score and double-sampling probabilities to be estimated. For practical inference, we develop a bootstrap method that can be used for both pointwise and uniform inference. A simulation study is conducted to examine the finite sample performance of the proposed estimators. The proposed method is illustrated with data from an EHR-based study examining the relative effects of two bariatric surgery procedures on BMI loss at 3 years post-surgery."}, "https://arxiv.org/abs/2310.13446": {"title": "Simple binning algorithm and SimDec visualization for comprehensive sensitivity analysis of complex computational models", "link": "https://arxiv.org/abs/2310.13446", "description": "arXiv:2310.13446v2 Announce Type: replace \nAbstract: Models of complex technological systems inherently contain interactions and dependencies among their input variables that affect their joint influence on the output. Such models are often computationally expensive and few sensitivity analysis methods can effectively process such complexities. Moreover, the sensitivity analysis field as a whole pays limited attention to the nature of interaction effects, whose understanding can prove to be critical for the design of safe and reliable systems. In this paper, we introduce and extensively test a simple binning approach for computing sensitivity indices and demonstrate how complementing it with the smart visualization method, simulation decomposition (SimDec), can permit important insights into the behavior of complex engineering models. The simple binning approach computes first-, second-order effects, and a combined sensitivity index, and is considerably more computationally efficient than the mainstream measure for Sobol indices introduced by Saltelli et al. The totality of the sensitivity analysis framework provides an efficient and intuitive way to analyze the behavior of complex systems containing interactions and dependencies."}, "https://arxiv.org/abs/2401.10057": {"title": "A method for characterizing disease emergence curves from paired pathogen detection and serology data", "link": "https://arxiv.org/abs/2401.10057", "description": "arXiv:2401.10057v2 Announce Type: replace \nAbstract: Wildlife disease surveillance programs and research studies track infection and identify risk factors for wild populations, humans, and agriculture. Often, several types of samples are collected from individuals to provide more complete information about an animal's infection history. Methods that jointly analyze multiple data streams to study disease emergence and drivers of infection via epidemiological process models remain underdeveloped. Joint-analysis methods can more thoroughly analyze all available data, more precisely quantifying epidemic processes, outbreak status, and risks. We contribute a paired data modeling approach that analyzes multiple samples from individuals. We use \"characterization maps\" to link paired data to epidemiological processes through a hierarchical statistical observation model. Our approach can provide both Bayesian and frequentist estimates of epidemiological parameters and state. We motivate our approach through the need to use paired pathogen and antibody detection tests to estimate parameters and infection trajectories for the widely applicable susceptible, infectious, recovered (SIR) model. We contribute general formulas to link characterization maps to arbitrary process models and datasets and an extended SIR model that better accommodates paired data. We find via simulation that paired data can more efficiently estimate SIR parameters than unpaired data, requiring samples from 5-10 times fewer individuals. We then study SARS-CoV-2 in wild White-tailed deer (Odocoileus virginianus) from three counties in the United States. Estimates for average infectious times corroborate captive animal studies. Our methods use general statistical theory to let applications extend beyond the SIR model we consider, and to more complicated examples of paired data."}, "https://arxiv.org/abs/2003.12408": {"title": "On the role of surrogates in the efficient estimation of treatment effects with limited outcome data", "link": "https://arxiv.org/abs/2003.12408", "description": "arXiv:2003.12408v3 Announce Type: replace-cross \nAbstract: In many experiments and observational studies, the outcome of interest is often difficult or expensive to observe, reducing effective sample sizes for estimating average treatment effects (ATEs) even when identifiable. We study how incorporating data on units for which only surrogate outcomes not of primary interest are observed can increase the precision of ATE estimation. We refrain from imposing stringent surrogacy conditions, which permit surrogates as perfect replacements for the target outcome. Instead, we supplement the available, albeit limited, observations of the target outcome (which by themselves identify the ATE) with abundant observations of surrogate outcomes, without any assumptions beyond random assignment and missingness and corresponding overlap conditions. To quantify the potential gains, we derive the difference in efficiency bounds on ATE estimation with and without surrogates, both when an overwhelming or comparable number of units have missing outcomes. We develop robust ATE estimation and inference methods that realize these efficiency gains. We empirically demonstrate the gains by studying the long-term-earning effects of job training."}, "https://arxiv.org/abs/2107.10955": {"title": "Learning Linear Polytree Structural Equation Models", "link": "https://arxiv.org/abs/2107.10955", "description": "arXiv:2107.10955v4 Announce Type: replace-cross \nAbstract: We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider the problem of inverse correlation matrix estimation under the linear polytree models, and establish the estimation error bound in terms of the dimension and the total number of v-structures. We also consider an extension of group linear polytree models, in which each node represents a group of variables. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees."}, "https://arxiv.org/abs/2306.05665": {"title": "Causal health impacts of power plant emission controls under modeled and uncertain physical process interference", "link": "https://arxiv.org/abs/2306.05665", "description": "arXiv:2306.05665v2 Announce Type: replace-cross \nAbstract: Causal inference with spatial environmental data is often challenging due to the presence of interference: outcomes for observational units depend on some combination of local and non-local treatment. This is especially relevant when estimating the effect of power plant emissions controls on population health, as pollution exposure is dictated by (i) the location of point-source emissions, as well as (ii) the transport of pollutants across space via dynamic physical-chemical processes. In this work, we estimate the effectiveness of air quality interventions at coal-fired power plants in reducing two adverse health outcomes in Texas in 2016: pediatric asthma ED visits and Medicare all-cause mortality. We develop methods for causal inference with interference when the underlying network structure is not known with certainty and instead must be estimated from ancillary data. Notably, uncertainty in the interference structure is propagated to the resulting causal effect estimates. We offer a Bayesian, spatial mechanistic model for the interference mapping which we combine with a flexible non-parametric outcome model to marginalize estimates of causal effects over uncertainty in the structure of interference. Our analysis finds some evidence that emissions controls at upwind power plants reduce asthma ED visits and all-cause mortality, however accounting for uncertainty in the interference renders the results largely inconclusive."}, "https://arxiv.org/abs/2405.08841": {"title": "Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data", "link": "https://arxiv.org/abs/2405.08841", "description": "arXiv:2405.08841v1 Announce Type: new \nAbstract: Epidemiological delays, such as incubation periods, serial intervals, and hospital lengths of stay, are among key quantities in infectious disease epidemiology that inform public health policy and clinical practice. This information is used to inform mathematical and statistical models, which in turn can inform control strategies. There are three main challenges that make delay distributions difficult to estimate. First, the data are commonly censored (e.g., symptom onset may only be reported by date instead of the exact time of day). Second, delays are often right truncated when being estimated in real time (not all events that have occurred have been observed yet). Third, during a rapidly growing or declining outbreak, overrepresentation or underrepresentation, respectively, of recently infected cases in the data can lead to bias in estimates. Studies that estimate delays rarely address all these factors and sometimes report several estimates using different combinations of adjustments, which can lead to conflicting answers and confusion about which estimates are most accurate. In this work, we formulate a checklist of best practices for estimating and reporting epidemiological delays with a focus on the incubation period and serial interval. We also propose strategies for handling common biases and identify areas where more work is needed. Our recommendations can help improve the robustness and utility of reported estimates and provide guidance for the evaluation of estimates for downstream use in transmission models or other analyses."}, "https://arxiv.org/abs/2405.08853": {"title": "Evaluating the Uncertainty in Mean Residual Times: Estimators Based on Residence Times from Discrete Time Processes", "link": "https://arxiv.org/abs/2405.08853", "description": "arXiv:2405.08853v1 Announce Type: new \nAbstract: In this work, we propose estimators for the uncertainty in mean residual times that require, for their evaluation, statistically independent individual residence times obtained from a discrete time process. We examine their performance through numerical experiments involving well-known probability distributions, and an application example using molecular dynamics simulation results, from an aqueous NaCl solution, is provided. These computationally inexpensive estimators, capable of achieving very accurate outcomes, serve as useful tools for assessing and reporting uncertainties in mean residual times across a wide range of simulations."}, "https://arxiv.org/abs/2405.08912": {"title": "High dimensional test for functional covariates", "link": "https://arxiv.org/abs/2405.08912", "description": "arXiv:2405.08912v1 Announce Type: new \nAbstract: As medical devices become more complex, they routinely collect extensive and complicated data. While classical regressions typically examine the relationship between an outcome and a vector of predictors, it becomes imperative to identify the relationship with predictors possessing functional structures. In this article, we introduce a novel inference procedure for examining the relationship between outcomes and large-scale functional predictors. We target testing the linear hypothesis on the functional parameters under the generalized functional linear regression framework, where the number of the functional parameters grows with the sample size. We develop the estimation procedure for the high dimensional generalized functional linear model incorporating B-spline functional approximation and amenable regularization. Furthermore, we construct a procedure that is able to test the local alternative hypothesis on the linear combinations of the functional parameters. We establish the statistical guarantees in terms of non-asymptotic convergence of the parameter estimation and the oracle property and asymptotic normality of the estimators. Moreover, we derive the asymptotic distribution of the test statistic. We carry out intensive simulations and illustrate with a new dataset from an Alzheimer's disease magnetoencephalography study."}, "https://arxiv.org/abs/2405.09003": {"title": "Nonparametric Inference on Dose-Response Curves Without the Positivity Condition", "link": "https://arxiv.org/abs/2405.09003", "description": "arXiv:2405.09003v1 Announce Type: new \nAbstract: Existing statistical methods in causal inference often rely on the assumption that every individual has some chance of receiving any treatment level regardless of its associated covariates, which is known as the positivity condition. This assumption could be violated in observational studies with continuous treatments. In this paper, we present a novel integral estimator of the causal effects with continuous treatments (i.e., dose-response curves) without requiring the positivity condition. Our approach involves estimating the derivative function of the treatment effect on each observed data sample and integrating it to the treatment level of interest so as to address the bias resulting from the lack of positivity condition. The validity of our approach relies on an alternative weaker assumption that can be satisfied by additive confounding models. We provide a fast and reliable numerical recipe for computing our estimator in practice and derive its related asymptotic theory. To conduct valid inference on the dose-response curve and its derivative, we propose using the nonparametric bootstrap and establish its consistency. The practical performances of our proposed estimators are validated through simulation studies and an analysis of the effect of air pollution exposure (PM$_{2.5}$) on cardiovascular mortality rates."}, "https://arxiv.org/abs/2405.09080": {"title": "Causal Inference for a Hidden Treatment", "link": "https://arxiv.org/abs/2405.09080", "description": "arXiv:2405.09080v1 Announce Type: new \nAbstract: In many empirical settings, directly observing a treatment variable may be infeasible although an error-prone surrogate measurement of the latter will often be available. Causal inference based solely on the observed surrogate measurement of the hidden treatment may be particularly challenging without an additional assumption or auxiliary data. To address this issue, we propose a method that carefully incorporates the surrogate measurement together with a proxy of the hidden treatment to identify its causal effect on any scale for which identification would in principle be feasible had contrary to fact the treatment been observed error-free. Beyond identification, we provide general semiparametric theory for causal effects identified using our approach, and we derive a large class of semiparametric estimators with an appealing multiple robustness property. A significant obstacle to our approach is the estimation of nuisance functions involving the hidden treatment, which prevents the direct application of standard machine learning algorithms. To resolve this, we introduce a novel semiparametric EM algorithm, thus adding a practical dimension to our theoretical contributions. This methodology can be adapted to analyze a large class of causal parameters in the proposed hidden treatment model, including the population average treatment effect, the effect of treatment on the treated, quantile treatment effects, and causal effects under marginal structural models. We examine the finite-sample performance of our method using simulations and an application which aims to estimate the causal effect of Alzheimer's disease on hippocampal volume using data from the Alzheimer's Disease Neuroimaging Initiative."}, "https://arxiv.org/abs/2405.09149": {"title": "Exploring uniformity and maximum entropy distribution on torus through intrinsic geometry: Application to protein-chemistry", "link": "https://arxiv.org/abs/2405.09149", "description": "arXiv:2405.09149v1 Announce Type: new \nAbstract: A generic family of distributions, defined on the surface of a curved torus is introduced using the area element of it. The area uniformity and the maximum entropy distribution are identified using the trigonometric moments of the proposed family. A marginal distribution is obtained as a three-parameter modification of the von Mises distribution that encompasses the von Mises, Cardioid, and Uniform distributions as special cases. The proposed family of the marginal distribution exhibits both symmetric and asymmetric, unimodal or bimodal shapes, contingent upon parameters. Furthermore, we scrutinize a two-parameter symmetric submodel, examining its moments, measure of variation, Kullback-Leibler divergence, and maximum likelihood estimation, among other properties. In addition, we introduce a modified acceptance-rejection sampling with a thin envelope obtained from the upper-Riemann-sum of a circular density, achieving a high rate of acceptance. This proposed sampling scheme will accelerate the empirical studies for a large-scale simulation reducing the processing time. Furthermore, we extend the Uniform, Wrapped Cauchy, and Kato-Jones distributions to the surface of the curved torus and implemented the proposed bivariate toroidal distribution for different groups of protein data, namely, $\\alpha$-helix, $\\beta$-sheet, and their mixture. A marginal of this proposed distribution is fitted to the wind direction data."}, "https://arxiv.org/abs/2405.09331": {"title": "Multi-Source Conformal Inference Under Distribution Shift", "link": "https://arxiv.org/abs/2405.09331", "description": "arXiv:2405.09331v1 Announce Type: new \nAbstract: Recent years have experienced increasing utilization of complex machine learning models across multiple sources of data to inform more generalizable decision-making. However, distribution shifts across data sources and privacy concerns related to sharing individual-level data, coupled with a lack of uncertainty quantification from machine learning predictions, make it challenging to achieve valid inferences in multi-source environments. In this paper, we consider the problem of obtaining distribution-free prediction intervals for a target population, leveraging multiple potentially biased data sources. We derive the efficient influence functions for the quantiles of unobserved outcomes in the target and source populations, and show that one can incorporate machine learning prediction algorithms in the estimation of nuisance functions while still achieving parametric rates of convergence to nominal coverage probabilities. Moreover, when conditional outcome invariance is violated, we propose a data-adaptive strategy to upweight informative data sources for efficiency gain and downweight non-informative data sources for bias reduction. We highlight the robustness and efficiency of our proposals for a variety of conformal scores and data-generating mechanisms via extensive synthetic experiments. Hospital length of stay prediction intervals for pediatric patients undergoing a high-risk cardiac surgical procedure between 2016-2022 in the U.S. illustrate the utility of our methodology."}, "https://arxiv.org/abs/2405.09485": {"title": "Predicting Future Change-points in Time Series", "link": "https://arxiv.org/abs/2405.09485", "description": "arXiv:2405.09485v1 Announce Type: new \nAbstract: Change-point detection and estimation procedures have been widely developed in the literature. However, commonly used approaches in change-point analysis have mainly been focusing on detecting change-points within an entire time series (off-line methods), or quickest detection of change-points in sequentially observed data (on-line methods). Both classes of methods are concerned with change-points that have already occurred. The arguably more important question of when future change-points may occur, remains largely unexplored. In this paper, we develop a novel statistical model that describes the mechanism of change-point occurrence. Specifically, the model assumes a latent process in the form of a random walk driven by non-negative innovations, and an observed process which behaves differently when the latent process belongs to different regimes. By construction, an occurrence of a change-point is equivalent to hitting a regime threshold by the latent process. Therefore, by predicting when the latent process will hit the next regime threshold, future change-points can be forecasted. The probabilistic properties of the model such as stationarity and ergodicity are established. A composite likelihood-based approach is developed for parameter estimation and model selection. Moreover, we construct the predictor and prediction interval for future change points based on the estimated model."}, "https://arxiv.org/abs/2405.09509": {"title": "Double Robustness of Local Projections and Some Unpleasant VARithmetic", "link": "https://arxiv.org/abs/2405.09509", "description": "arXiv:2405.09509v1 Announce Type: new \nAbstract: We consider impulse response inference in a locally misspecified stationary vector autoregression (VAR) model. The conventional local projection (LP) confidence interval has correct coverage even when the misspecification is so large that it can be detected with probability approaching 1. This follows from a \"double robustness\" property analogous to that of modern estimators for partially linear regressions. In contrast, VAR confidence intervals dramatically undercover even for misspecification so small that it is difficult to detect statistically and cannot be ruled out based on economic theory. This is because of a \"no free lunch\" result for VARs: the worst-case bias and coverage distortion are small if, and only if, the variance is close to that of LP. While VAR coverage can be restored by using a bias-aware critical value or a large lag length, the resulting confidence interval tends to be at least as wide as the LP interval."}, "https://arxiv.org/abs/2405.09536": {"title": "Wasserstein Gradient Boosting: A General Framework with Applications to Posterior Regression", "link": "https://arxiv.org/abs/2405.09536", "description": "arXiv:2405.09536v1 Announce Type: new \nAbstract: Gradient boosting is a sequential ensemble method that fits a new base learner to the gradient of the remaining loss at each step. We propose a novel family of gradient boosting, Wasserstein gradient boosting, which fits a new base learner to an exactly or approximately available Wasserstein gradient of a loss functional on the space of probability distributions. Wasserstein gradient boosting returns a set of particles that approximates a target probability distribution assigned at each input. In probabilistic prediction, a parametric probability distribution is often specified on the space of output variables, and a point estimate of the output-distribution parameter is produced for each input by a model. Our main application of Wasserstein gradient boosting is a novel distributional estimate of the output-distribution parameter, which approximates the posterior distribution over the output-distribution parameter determined pointwise at each data point. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with various existing methods."}, "https://arxiv.org/abs/2405.08907": {"title": "Properties of stationary cyclical processes", "link": "https://arxiv.org/abs/2405.08907", "description": "arXiv:2405.08907v1 Announce Type: cross \nAbstract: The paper investigates the theoretical properties of zero-mean stationary time series with cyclical components, admitting the representation $y_t=\\alpha_t \\cos \\lambda t + \\beta_t \\sin \\lambda t$, with $\\lambda \\in (0,\\pi]$ and $[\\alpha_t\\,\\, \\beta_t]$ following some bivariate process. We diagnose that in the extant literature on cyclic time series, a prevalent assumption of Gaussianity for $[\\alpha_t\\,\\, \\beta_t]$ imposes inadvertently a severe restriction on the amplitude of the process. Moreover, it is shown that other common distributions may suffer from either similar defects or fail to guarantee the stationarity of $y_t$. To address both of the issues, we propose to introduce a direct stochastic modulation of the amplitude and phase shift in an almost periodic function. We prove that this novel approach may lead, in general, to a stationary (up to any order) time series, and specifically, to a zero-mean stationary time series featuring cyclicity, with a pseudo-cyclical autocovariance function that may even decay at a very slow rate. The proposed process fills an important gap in this type of models and allows for flexible modeling of amplitude and phase shift."}, "https://arxiv.org/abs/2405.09076": {"title": "Enhancing Airline Customer Satisfaction: A Machine Learning and Causal Analysis Approach", "link": "https://arxiv.org/abs/2405.09076", "description": "arXiv:2405.09076v1 Announce Type: cross \nAbstract: This study explores the enhancement of customer satisfaction in the airline industry, a critical factor for retaining customers and building brand reputation, which are vital for revenue growth. Utilizing a combination of machine learning and causal inference methods, we examine the specific impact of service improvements on customer satisfaction, with a focus on the online boarding pass experience. Through detailed data analysis involving several predictive and causal models, we demonstrate that improvements in the digital aspects of customer service significantly elevate overall customer satisfaction. This paper highlights how airlines can strategically leverage these insights to make data-driven decisions that enhance customer experiences and, consequently, their market competitiveness."}, "https://arxiv.org/abs/2405.09500": {"title": "Identifying Heterogeneous Decision Rules From Choices When Menus Are Unobserved", "link": "https://arxiv.org/abs/2405.09500", "description": "arXiv:2405.09500v1 Announce Type: cross \nAbstract: Given only aggregate choice data and limited information about how menus are distributed across the population, we describe what can be inferred robustly about the distribution of preferences (or more general decision rules). We strengthen and generalize existing results on such identification and provide an alternative analytical approach to study the problem. We show further that our model and results are applicable, after suitable reinterpretation, to other contexts. One application is to the robust identification of the distribution of updating rules given only the population distribution of beliefs and limited information about heterogeneous information sources."}, "https://arxiv.org/abs/2209.07295": {"title": "A new set of tools for goodness-of-fit validation", "link": "https://arxiv.org/abs/2209.07295", "description": "arXiv:2209.07295v2 Announce Type: replace \nAbstract: We introduce two new tools to assess the validity of statistical distributions. These tools are based on components derived from a new statistical quantity, the $comparison$ $curve$. The first tool is a graphical representation of these components on a $bar$ $plot$ (B plot), which can provide a detailed appraisal of the validity of the statistical model, in particular when supplemented by acceptance regions related to the model. The knowledge gained from this representation can sometimes suggest an existing $goodness$-$of$-$fit$ test to supplement this visual assessment with a control of the type I error. Otherwise, an adaptive test may be preferable and the second tool is the combination of these components to produce a powerful $\\chi^2$-type goodness-of-fit test. Because the number of these components can be large, we introduce a new selection rule to decide, in a data driven fashion, on their proper number to take into consideration. In a simulation, our goodness-of-fit tests are seen to be powerwise competitive with the best solutions that have been recommended in the context of a fully specified model as well as when some parameters must be estimated. Practical examples show how to use these tools to derive principled information about where the model departs from the data."}, "https://arxiv.org/abs/2212.11304": {"title": "Powerful Partial Conjunction Hypothesis Testing via Conditioning", "link": "https://arxiv.org/abs/2212.11304", "description": "arXiv:2212.11304v3 Announce Type: replace \nAbstract: A Partial Conjunction Hypothesis (PCH) test combines information across a set of base hypotheses to determine whether some subset is non-null. PCH tests arise in a diverse array of fields, but standard PCH testing methods can be highly conservative, leading to low power especially in low signal settings commonly encountered in applications. In this paper, we introduce the conditional PCH (cPCH) test, a new method for testing a single PCH that directly corrects the conservativeness of standard approaches by conditioning on certain order statistics of the base p-values. Under distributional assumptions commonly encountered in PCH testing, the cPCH test is valid and produces nearly uniformly distributed p-values under the null (i.e., cPCH p-values are only very slightly conservative). We demonstrate that the cPCH test matches or outperforms existing single PCH tests with particular power gains in low signal settings, maintains Type I error control even under model misspecification, and can be used to outperform state-of-the-art multiple PCH testing procedures in certain settings, particularly when side information is present. Finally, we illustrate an application of the cPCH test through a replicability analysis across DNA microarray studies."}, "https://arxiv.org/abs/2307.11127": {"title": "Asymptotically Unbiased Synthetic Control Methods by Distribution Matching", "link": "https://arxiv.org/abs/2307.11127", "description": "arXiv:2307.11127v3 Announce Type: replace \nAbstract: Synthetic Control Methods (SCMs) have become an essential tool for comparative case studies. The fundamental idea of SCMs is to estimate the counterfactual outcomes of a treated unit using a weighted sum of the observed outcomes of untreated units. The accuracy of the synthetic control (SC) is critical for evaluating the treatment effect of a policy intervention; therefore, the estimation of SC weights has been the focus of extensive research. In this study, we first point out that existing SCMs suffer from an endogeneity problem, the correlation between the outcomes of untreated units and the error term of the synthetic control, which yields a bias in the treatment effect estimator. We then propose a novel SCM based on density matching, assuming that the density of outcomes of the treated unit can be approximated by a weighted average of the joint density of untreated units (i.e., a mixture model). Based on this assumption, we estimate SC weights by matching the moments of treated outcomes with the weighted sum of moments of untreated outcomes. Our proposed method has three advantages over existing methods: first, our estimator is asymptotically unbiased under the assumption of the mixture model; second, due to the asymptotic unbiasedness, we can reduce the mean squared error in counterfactual predictions; third, our method generates full densities of the treatment effect, not merely expected values, which broadens the applicability of SCMs. We provide experimental results to demonstrate the effectiveness of our proposed method."}, "https://arxiv.org/abs/2310.09105": {"title": "Estimating Individual Responses when Tomorrow Matters", "link": "https://arxiv.org/abs/2310.09105", "description": "arXiv:2310.09105v3 Announce Type: replace \nAbstract: We propose a regression-based approach to estimate how individuals' expectations influence their responses to a counterfactual change. We provide conditions under which average partial effects based on regression estimates recover structural effects. We propose a practical three-step estimation method that relies on panel data on subjective expectations. We illustrate our approach in a model of consumption and saving, focusing on the impact of an income tax that not only changes current income but also affects beliefs about future income. Applying our approach to Italian survey data, we find that individuals' beliefs matter for evaluating the impact of tax policies on consumption decisions."}, "https://arxiv.org/abs/2310.14448": {"title": "Semiparametrically Efficient Score for the Survival Odds Ratio", "link": "https://arxiv.org/abs/2310.14448", "description": "arXiv:2310.14448v2 Announce Type: replace \nAbstract: We consider a general proportional odds model for survival data under binary treatment, where the functional form of the covariates is left unspecified. We derive the efficient score for the conditional survival odds ratio given the covariates using modern semiparametric theory. The efficient score may be useful in the development of doubly robust estimators, although computational challenges remain."}, "https://arxiv.org/abs/2312.08174": {"title": "Double Machine Learning for Static Panel Models with Fixed Effects", "link": "https://arxiv.org/abs/2312.08174", "description": "arXiv:2312.08174v3 Announce Type: replace \nAbstract: Recent advances in causal inference have seen the development of methods which make use of the predictive power of machine learning algorithms. In this paper, we use double machine learning (DML) (Chernozhukov et al., 2018) to approximate high-dimensional and non-linear nuisance functions of the confounders to make inferences about the effects of policy interventions from panel data. We propose new estimators by adapting correlated random effects, within-group and first-difference estimation for linear models to an extension of Robinson (1988)'s partially linear regression model to static panel data models with individual fixed effects and unspecified non-linear confounder effects. Using Monte Carlo simulations, we compare the relative performance of different machine learning algorithms and find that conventional least squares estimators performs well when the data generating process is mildly non-linear and smooth, but there are substantial performance gains with DML in terms of bias reduction when the true effect of the regressors is non-linear and discontinuous. However, inference based on individual learners can lead to badly biased inference. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the minimum wage on voting behavior in the UK."}, "https://arxiv.org/abs/2212.01792": {"title": "Classification by sparse generalized additive models", "link": "https://arxiv.org/abs/2212.01792", "description": "arXiv:2212.01792v4 Announce Type: replace-cross \nAbstract: We consider (nonparametric) sparse (generalized) additive models (SpAM) for classification. The design of a SpAM classifier is based on minimizing the logistic loss with a sparse group Lasso/Slope-type penalties on the coefficients of univariate additive components' expansions in orthonormal series (e.g., Fourier or wavelets). The resulting classifier is inherently adaptive to the unknown sparsity and smoothness. We show that under certain sparse group restricted eigenvalue condition it is nearly-minimax (up to log-factors) simultaneously across the entire range of analytic, Sobolev and Besov classes. The performance of the proposed classifier is illustrated on a simulated and a real-data examples."}, "https://arxiv.org/abs/2308.01156": {"title": "A new adaptive local polynomial density estimation procedure on complicated domains", "link": "https://arxiv.org/abs/2308.01156", "description": "arXiv:2308.01156v2 Announce Type: replace-cross \nAbstract: This paper presents a novel approach for pointwise estimation of multivariate density functions on known domains of arbitrary dimensions using nonparametric local polynomial estimators. Our method is highly flexible, as it applies to both simple domains, such as open connected sets, and more complicated domains that are not star-shaped around the point of estimation. This enables us to handle domains with sharp concavities, holes, and local pinches, such as polynomial sectors. Additionally, we introduce a data-driven selection rule based on the general ideas of Goldenshluger and Lepski. Our results demonstrate that the local polynomial estimators are minimax under a $L^2$ risk across a wide range of H\\\"older-type functional classes. In the adaptive case, we provide oracle inequalities and explicitly determine the convergence rate of our statistical procedure. Simulations on polynomial sectors show that our oracle estimates outperform those of the most popular alternative method, found in the sparr package for the R software. Our statistical procedure is implemented in an online R package which is readily accessible."}, "https://arxiv.org/abs/2405.09797": {"title": "Identification of Single-Treatment Effects in Factorial Experiments", "link": "https://arxiv.org/abs/2405.09797", "description": "arXiv:2405.09797v1 Announce Type: new \nAbstract: Despite their cost, randomized controlled trials (RCTs) are widely regarded as gold-standard evidence in disciplines ranging from social science to medicine. In recent decades, researchers have increasingly sought to reduce the resource burden of repeated RCTs with factorial designs that simultaneously test multiple hypotheses, e.g. experiments that evaluate the effects of many medications or products simultaneously. Here I show that when multiple interventions are randomized in experiments, the effect any single intervention would have outside the experimental setting is not identified absent heroic assumptions, even if otherwise perfectly realistic conditions are achieved. This happens because single-treatment effects involve a counterfactual world with a single focal intervention, allowing other variables to take their natural values (which may be confounded or modified by the focal intervention). In contrast, observational studies and factorial experiments provide information about potential-outcome distributions with zero and multiple interventions, respectively. In this paper, I formalize sufficient conditions for the identifiability of those isolated quantities. I show that researchers who rely on this type of design have to justify either linearity of functional forms or -- in the nonparametric case -- specify with Directed Acyclic Graphs how variables are related in the real world. Finally, I develop nonparametric sharp bounds -- i.e., maximally informative best-/worst-case estimates consistent with limited RCT data -- that show when extrapolations about effect signs are empirically justified. These new results are illustrated with simulated data."}, "https://arxiv.org/abs/2405.09810": {"title": "Trajecctory-Based Individualized Treatment Rules", "link": "https://arxiv.org/abs/2405.09810", "description": "arXiv:2405.09810v1 Announce Type: new \nAbstract: A core component of precision medicine research involves optimizing individualized treatment rules (ITRs) based on patient characteristics. Many studies used to estimate ITRs are longitudinal in nature, collecting outcomes over time. Yet, to date, methods developed to estimate ITRs often ignore the longitudinal structure of the data. Information available from the longitudinal nature of the data can be especially useful in mental health studies. Although treatment means might appear similar, understanding the trajectory of outcomes over time can reveal important differences between treatments and placebo effects. This longitudinal perspective is especially beneficial in mental health research, where subtle shifts in outcome patterns can hold significant implications. Despite numerous studies involving the collection of outcome data across various time points, most precision medicine methods used to develop ITRs overlook the information available from the longitudinal structure. The prevalence of missing data in such studies exacerbates the issue, as neglecting the longitudinal nature of the data can significantly impair the effectiveness of treatment rules. This paper develops a powerful longitudinal trajectory-based ITR construction method that incorporates baseline variables, via a single-index or biosignature, into the modeling of longitudinal outcomes. This trajectory-based ITR approach substantially minimizes the negative impact of missing data compared to more traditional ITR approaches. The approach is illustrated through simulation studies and a clinical trial for depression, contrasting it with more traditional ITRs that ignore longitudinal information."}, "https://arxiv.org/abs/2405.09887": {"title": "Quantization-based LHS for dependent inputs : application to sensitivity analysis of environmental models", "link": "https://arxiv.org/abs/2405.09887", "description": "arXiv:2405.09887v1 Announce Type: new \nAbstract: Numerical modeling is essential for comprehending intricate physical phenomena in different domains. To handle complexity, sensitivity analysis, particularly screening, is crucial for identifying influential input parameters. Kernel-based methods, such as the Hilbert Schmidt Independence Criterion (HSIC), are valuable for analyzing dependencies between inputs and outputs. Moreover, due to the computational expense of such models, metamodels (or surrogate models) are often unavoidable. Implementing metamodels and HSIC requires data from the original model, which leads to the need for space-filling designs. While existing methods like Latin Hypercube Sampling (LHS) are effective for independent variables, incorporating dependence is challenging. This paper introduces a novel LHS variant, Quantization-based LHS, which leverages Voronoi vector quantization to address correlated inputs. The method ensures comprehensive coverage of stratified variables, enhancing distribution across marginals. The paper outlines expectation estimators based on Quantization-based LHS in various dependency settings, demonstrating their unbiasedness. The method is applied on several models of growing complexities, first on simple examples to illustrate the theory, then on more complex environmental hydrological models, when the dependence is known or not, and with more and more interactive processes and factors. The last application is on the digital twin of a French vineyard catchment (Beaujolais region) to design a vegetative filter strip and reduce water, sediment and pesticide transfers from the fields to the river. Quantization-based LHS is used to compute HSIC measures and independence tests, demonstrating its usefulness, especially in the context of complex models."}, "https://arxiv.org/abs/2405.09906": {"title": "Process-based Inference for Spatial Energetics Using Bayesian Predictive Stacking", "link": "https://arxiv.org/abs/2405.09906", "description": "arXiv:2405.09906v1 Announce Type: new \nAbstract: Rapid developments in streaming data technologies have enabled real-time monitoring of human activity that can deliver high-resolution data on health variables over trajectories or paths carved out by subjects as they conduct their daily physical activities. Wearable devices, such as wrist-worn sensors that monitor gross motor activity, have become prevalent and have kindled the emerging field of ``spatial energetics'' in environmental health sciences. We devise a Bayesian inferential framework for analyzing such data while accounting for information available on specific spatial coordinates comprising a trajectory or path using a Global Positioning System (GPS) device embedded within the wearable device. We offer full probabilistic inference with uncertainty quantification using spatial-temporal process models adapted for data generated from ``actigraph'' units as the subject traverses a path or trajectory in their daily routine. Anticipating the need for fast inference for mobile health data, we pursue exact inference using conjugate Bayesian models and employ predictive stacking to assimilate inference across these individual models. This circumvents issues with iterative estimation algorithms such as Markov chain Monte Carlo. We devise Bayesian predictive stacking in this context for models that treat time as discrete epochs and that treat time as continuous. We illustrate our methods with simulation experiments and analysis of data from the Physical Activity through Sustainable Transport Approaches (PASTA-LA) study conducted by the Fielding School of Public Health at the University of California, Los Angeles."}, "https://arxiv.org/abs/2405.10026": {"title": "The case for specifying the \"ideal\" target trial", "link": "https://arxiv.org/abs/2405.10026", "description": "arXiv:2405.10026v1 Announce Type: new \nAbstract: The target trial is an increasingly popular conceptual device for guiding the design and analysis of observational studies that seek to perform causal inference. As tends to occur with concepts like this, there is variability in how certain aspects of the approach are understood, which may lead to potentially consequential differences in how the approach is taught, implemented, and interpreted in practice. In this commentary, we provide a perspective on two of these aspects: how the target trial should be specified, and relatedly, how the target trial fits within a formal causal inference framework."}, "https://arxiv.org/abs/2405.10036": {"title": "Large-scale Data Integration using Matrix Denoising and Geometric Factor Matching", "link": "https://arxiv.org/abs/2405.10036", "description": "arXiv:2405.10036v1 Announce Type: new \nAbstract: Unsupervised integrative analysis of multiple data sources has become common place and scalable algorithms are necessary to accommodate ever increasing availability of data. Only few currently methods have estimation speed as their focus, and those that do are only applicable to restricted data layouts such as different data types measured on the same observation units. We introduce a novel point of view on low-rank matrix integration phrased as a graph estimation problem which allows development of a method, large-scale Collective Matrix Factorization (lsCMF), which is able to integrate data in flexible layouts in a speedy fashion. It utilizes a matrix denoising framework for rank estimation and geometric properties of singular vectors to efficiently integrate data. The quick estimation speed of lsCMF while retaining good estimation of data structure is then demonstrated in simulation studies."}, "https://arxiv.org/abs/2405.10067": {"title": "Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layouts", "link": "https://arxiv.org/abs/2405.10067", "description": "arXiv:2405.10067v1 Announce Type: new \nAbstract: Interest in unsupervised methods for joint analysis of heterogeneous data sources has risen in recent years. Low-rank latent factor models have proven to be an effective tool for data integration and have been extended to a large number of data source layouts. Of particular interest is the separation of variation present in data sources into shared and individual subspaces. In addition, interpretability of estimated latent factors is crucial to further understanding.\n We present sparse and orthogonal low-rank Collective Matrix Factorization (solrCMF) to estimate low-rank latent factor models for flexible data layouts. These encompass traditional multi-view (one group, multiple data types) and multi-grid (multiple groups, multiple data types) layouts, as well as augmented layouts, which allow the inclusion of side information between data types or groups. In addition, solrCMF allows tensor-like layouts (repeated layers), estimates interpretable factors, and determines variation structure among factors and data sources.\n Using a penalized optimization approach, we automatically separate variability into the globally and partially shared as well as individual components and estimate sparse representations of factors. To further increase interpretability of factors, we enforce orthogonality between them. Estimation is performed efficiently in a recent multi-block ADMM framework which we adapted to support embedded manifold constraints.\n The performance of solrCMF is demonstrated in simulation studies and compares favorably to existing methods."}, "https://arxiv.org/abs/2405.10198": {"title": "Comprehensive Causal Machine Learning", "link": "https://arxiv.org/abs/2405.10198", "description": "arXiv:2405.10198v1 Announce Type: new \nAbstract: Uncovering causal effects at various levels of granularity provides substantial value to decision makers. Comprehensive machine learning approaches to causal effect estimation allow to use a single causal machine learning approach for estimation and inference of causal mean effects for all levels of granularity. Focusing on selection-on-observables, this paper compares three such approaches, the modified causal forest (mcf), the generalized random forest (grf), and double machine learning (dml). It also provides proven theoretical guarantees for the mcf and compares the theoretical properties of the approaches. The findings indicate that dml-based methods excel for average treatment effects at the population level (ATE) and group level (GATE) with few groups, when selection into treatment is not too strong. However, for finer causal heterogeneity, explicitly outcome-centred forest-based approaches are superior. The mcf has three additional benefits: (i) It is the most robust estimator in cases when dml-based approaches underperform because of substantial selectivity; (ii) it is the best estimator for GATEs when the number of groups gets larger; and (iii), it is the only estimator that is internally consistent, in the sense that low-dimensional causal ATEs and GATEs are obtained as aggregates of finer-grained causal parameters."}, "https://arxiv.org/abs/2405.10302": {"title": "Optimal Aggregation of Prediction Intervals under Unsupervised Domain Shift", "link": "https://arxiv.org/abs/2405.10302", "description": "arXiv:2405.10302v1 Announce Type: new \nAbstract: As machine learning models are increasingly deployed in dynamic environments, it becomes paramount to assess and quantify uncertainties associated with distribution shifts. A distribution shift occurs when the underlying data-generating process changes, leading to a deviation in the model's performance. The prediction interval, which captures the range of likely outcomes for a given prediction, serves as a crucial tool for characterizing uncertainties induced by their underlying distribution. In this paper, we propose methodologies for aggregating prediction intervals to obtain one with minimal width and adequate coverage on the target domain under unsupervised domain shift, under which we have labeled samples from a related source domain and unlabeled covariates from the target domain. Our analysis encompasses scenarios where the source and the target domain are related via i) a bounded density ratio, and ii) a measure-preserving transformation. Our proposed methodologies are computationally efficient and easy to implement. Beyond illustrating the performance of our method through a real-world dataset, we also delve into the theoretical details. This includes establishing rigorous theoretical guarantees, coupled with finite sample bounds, regarding the coverage and width of our prediction intervals. Our approach excels in practical applications and is underpinned by a solid theoretical framework, ensuring its reliability and effectiveness across diverse contexts."}, "https://arxiv.org/abs/2405.09596": {"title": "Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)", "link": "https://arxiv.org/abs/2405.09596", "description": "arXiv:2405.09596v1 Announce Type: cross \nAbstract: The prediction of ship trajectories is a growing field of study in artificial intelligence. Traditional methods rely on the use of LSTM, GRU networks, and even Transformer architectures for the prediction of spatio-temporal series. This study proposes a viable alternative for predicting these trajectories using only GNSS positions. It considers this spatio-temporal problem as a natural language processing problem. The latitude/longitude coordinates of AIS messages are transformed into cell identifiers using the H3 index. Thanks to the pseudo-octal representation, it becomes easier for language models to learn the spatial hierarchy of the H3 index. The method is compared with a classical Kalman filter, widely used in the maritime domain, and introduces the Fr\\'echet distance as the main evaluation metric. We show that it is possible to predict ship trajectories quite precisely up to 8 hours with 30 minutes of context. We demonstrate that this alternative works well enough to predict trajectories worldwide."}, "https://arxiv.org/abs/2405.09989": {"title": "A Gaussian Process Model for Ordinal Data with Applications to Chemoinformatics", "link": "https://arxiv.org/abs/2405.09989", "description": "arXiv:2405.09989v1 Announce Type: cross \nAbstract: With the proliferation of screening tools for chemical testing, it is now possible to create vast databases of chemicals easily. However, rigorous statistical methodologies employed to analyse these databases are in their infancy, and further development to facilitate chemical discovery is imperative. In this paper, we present conditional Gaussian process models to predict ordinal outcomes from chemical experiments, where the inputs are chemical compounds. We implement the Tanimoto distance, a metric on the chemical space, within the covariance of the Gaussian processes to capture correlated effects in the chemical space. A novel aspect of our model is that the kernel contains a scaling parameter, a feature not previously examined in the literature, that controls the strength of the correlation between elements of the chemical space. Using molecular fingerprints, a numerical representation of a compound's location within the chemical space, we show that accounting for correlation amongst chemical compounds improves predictive performance over the uncorrelated model, where effects are assumed to be independent. Moreover, we present a genetic algorithm for the facilitation of chemical discovery and identification of important features to the compound's efficacy. A simulation study is conducted to demonstrate the suitability of the proposed methods. Our proposed methods are demonstrated on a hazard classification problem of organic solvents."}, "https://arxiv.org/abs/2208.13370": {"title": "A Consistent ICM-based $\\chi^2$ Specification Test", "link": "https://arxiv.org/abs/2208.13370", "description": "arXiv:2208.13370v2 Announce Type: replace \nAbstract: In spite of the omnibus property of Integrated Conditional Moment (ICM) specification tests, they are not commonly used in empirical practice owing to, e.g., the non-pivotality of the test and the high computational cost of available bootstrap schemes especially in large samples. This paper proposes specification and mean independence tests based on a class of ICM metrics termed the generalized martingale difference divergence (GMDD). The proposed tests exhibit consistency, asymptotic $\\chi^2$-distribution under the null hypothesis, and computational efficiency. Moreover, they demonstrate robustness to heteroskedasticity of unknown form and can be adapted to enhance power towards specific alternatives. A power comparison with classical bootstrap-based ICM tests using Bahadur slopes is also provided. Monte Carlo simulations are conducted to showcase the proposed tests' excellent size control and competitive power."}, "https://arxiv.org/abs/2309.04047": {"title": "Fully Latent Principal Stratification With Measurement Models", "link": "https://arxiv.org/abs/2309.04047", "description": "arXiv:2309.04047v2 Announce Type: replace \nAbstract: There is wide agreement on the importance of implementation data from randomized effectiveness studies in behavioral science; however, there are few methods available to incorporate these data into causal models, especially when they are multivariate or longitudinal, and interest is in low-dimensional summaries. We introduce a framework for studying how treatment effects vary between subjects who implement an intervention differently, combining principal stratification with latent variable measurement models; since principal strata are latent in both treatment arms, we call it \"fully-latent principal stratification\" or FLPS. We describe FLPS models including item-response-theory measurement, show that they are feasible in a simulation study, and illustrate them in an analysis of hint usage from a randomized study of computerized mathematics tutors."}, "https://arxiv.org/abs/2310.02278": {"title": "A Stable and Efficient Covariate-Balancing Estimator for Causal Survival Effects", "link": "https://arxiv.org/abs/2310.02278", "description": "arXiv:2310.02278v2 Announce Type: replace \nAbstract: We propose an empirically stable and asymptotically efficient covariate-balancing approach to the problem of estimating survival causal effects in data with conditionally-independent censoring. This addresses a challenge often encountered in state-of-the-art nonparametric methods: the use of inverses of small estimated probabilities and the resulting amplification of estimation error. We validate our theoretical results in experiments on synthetic and semi-synthetic data."}, "https://arxiv.org/abs/2312.04077": {"title": "When is Plasmode simulation superior to parametric simulation when estimating the MSE of the least squares estimator in linear regression?", "link": "https://arxiv.org/abs/2312.04077", "description": "arXiv:2312.04077v2 Announce Type: replace \nAbstract: Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model (OGM), is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the OGM were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended."}, "https://arxiv.org/abs/2312.12641": {"title": "Robust Point Matching with Distance Profiles", "link": "https://arxiv.org/abs/2312.12641", "description": "arXiv:2312.12641v2 Announce Type: replace \nAbstract: While matching procedures based on pairwise distances are conceptually appealing and thus favored in practice, theoretical guarantees for such procedures are rarely found in the literature. We propose and analyze matching procedures based on distance profiles that are easily implementable in practice, showing these procedures are robust to outliers and noise. We demonstrate the performance of the proposed method using a real data example and provide simulation studies to complement the theoretical findings."}, "https://arxiv.org/abs/2203.15945": {"title": "A Framework for Improving the Reliability of Black-box Variational Inference", "link": "https://arxiv.org/abs/2203.15945", "description": "arXiv:2203.15945v2 Announce Type: replace-cross \nAbstract: Black-box variational inference (BBVI) now sees widespread use in machine learning and statistics as a fast yet flexible alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, stochastic optimization methods for BBVI remain unreliable and require substantial expertise and hand-tuning to apply effectively. In this paper, we propose Robust and Automated Black-box VI (RABVI), a framework for improving the reliability of BBVI optimization. RABVI is based on rigorously justified automation techniques, includes just a small number of intuitive tuning parameters, and detects inaccurate estimates of the optimal variational approximation. RABVI adaptively decreases the learning rate by detecting convergence of the fixed--learning-rate iterates, then estimates the symmetrized Kullback--Leibler (KL) divergence between the current variational approximation and the optimal one. It also employs a novel optimization termination criterion that enables the user to balance desired accuracy against computational cost by comparing (i) the predicted relative decrease in the symmetrized KL divergence if a smaller learning were used and (ii) the predicted computation required to converge with the smaller learning rate. We validate the robustness and accuracy of RABVI through carefully designed simulation studies and on a diverse set of real-world model and data examples."}, "https://arxiv.org/abs/2209.09936": {"title": "Solving Fredholm Integral Equations of the First Kind via Wasserstein Gradient Flows", "link": "https://arxiv.org/abs/2209.09936", "description": "arXiv:2209.09936v3 Announce Type: replace-cross \nAbstract: Solving Fredholm equations of the first kind is crucial in many areas of the applied sciences. In this work we adopt a probabilistic and variational point of view by considering a minimization problem in the space of probability measures with an entropic regularization. Contrary to classical approaches which discretize the domain of the solutions, we introduce an algorithm to asymptotically sample from the unique solution of the regularized minimization problem. As a result our estimators do not depend on any underlying grid and have better scalability properties than most existing methods. Our algorithm is based on a particle approximation of the solution of a McKean--Vlasov stochastic differential equation associated with the Wasserstein gradient flow of our variational formulation. We prove the convergence towards a minimizer and provide practical guidelines for its numerical implementation. Finally, our method is compared with other approaches on several examples including density deconvolution and epidemiology."}, "https://arxiv.org/abs/2405.10371": {"title": "Causal Discovery in Multivariate Extremes with a Hydrological Analysis of Swiss River Discharges", "link": "https://arxiv.org/abs/2405.10371", "description": "arXiv:2405.10371v1 Announce Type: new \nAbstract: Causal asymmetry is based on the principle that an event is a cause only if its absence would not have been a cause. From there, uncovering causal effects becomes a matter of comparing a well-defined score in both directions. Motivated by studying causal effects at extreme levels of a multivariate random vector, we propose to construct a model-agnostic causal score relying solely on the assumption of the existence of a max-domain of attraction. Based on a representation of a Generalized Pareto random vector, we construct the causal score as the Wasserstein distance between the margins and a well-specified random variable. The proposed methodology is illustrated on a hydrologically simulated dataset of different characteristics of catchments in Switzerland: discharge, precipitation, and snowmelt."}, "https://arxiv.org/abs/2405.10449": {"title": "Optimal Text-Based Time-Series Indices", "link": "https://arxiv.org/abs/2405.10449", "description": "arXiv:2405.10449v1 Announce Type: new \nAbstract: We propose an approach to construct text-based time-series indices in an optimal way--typically, indices that maximize the contemporaneous relation or the predictive performance with respect to a target variable, such as inflation. We illustrate our methodology with a corpus of news articles from the Wall Street Journal by optimizing text-based indices focusing on tracking the VIX index and inflation expectations. Our results highlight the superior performance of our approach compared to existing indices."}, "https://arxiv.org/abs/2405.10461": {"title": "Prediction in Measurement Error Models", "link": "https://arxiv.org/abs/2405.10461", "description": "arXiv:2405.10461v1 Announce Type: new \nAbstract: We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining minimal prediction interval length. Numerical experiments are conducted to illustrate the performance and we apply our method to predict concentration of Abeta1-12 in cerebrospinal fluid in an Alzheimer's disease data."}, "https://arxiv.org/abs/2405.10490": {"title": "Neural Optimization with Adaptive Heuristics for Intelligent Marketing System", "link": "https://arxiv.org/abs/2405.10490", "description": "arXiv:2405.10490v1 Announce Type: new \nAbstract: Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing."}, "https://arxiv.org/abs/2405.10527": {"title": "Hawkes Models And Their Applications", "link": "https://arxiv.org/abs/2405.10527", "description": "arXiv:2405.10527v1 Announce Type: new \nAbstract: The Hawkes process is a model for counting the number of arrivals to a system which exhibits the self-exciting property - that one arrival creates a heightened chance of further arrivals in the near future. The model, and its generalizations, have been applied in a plethora of disparate domains, though two particularly developed applications are in seismology and in finance. As the original model is elegantly simple, generalizations have been proposed which: track marks for each arrival, are multivariate, have a spatial component, are driven by renewal processes, treat time as discrete, and so on. This paper creates a cohesive review of the traditional Hawkes model and the modern generalizations, providing details on their construction, simulation algorithms, and giving key references to the appropriate literature for a detailed treatment."}, "https://arxiv.org/abs/2405.10539": {"title": "Overcoming Medical Overuse with AI Assistance: An Experimental Investigation", "link": "https://arxiv.org/abs/2405.10539", "description": "arXiv:2405.10539v1 Announce Type: new \nAbstract: This study evaluates the effectiveness of Artificial Intelligence (AI) in mitigating medical overtreatment, a significant issue characterized by unnecessary interventions that inflate healthcare costs and pose risks to patients. We conducted a lab-in-the-field experiment at a medical school, utilizing a novel medical prescription task, manipulating monetary incentives and the availability of AI assistance among medical students using a three-by-two factorial design. We tested three incentive schemes: Flat (constant pay regardless of treatment quantity), Progressive (pay increases with the number of treatments), and Regressive (penalties for overtreatment) to assess their influence on the adoption and effectiveness of AI assistance. Our findings demonstrate that AI significantly reduced overtreatment rates by up to 62% in the Regressive incentive conditions where (prospective) physician and patient interests were most aligned. Diagnostic accuracy improved by 17% to 37%, depending on the incentive scheme. Adoption of AI advice was high, with approximately half of the participants modifying their decisions based on AI input across all settings. For policy implications, we quantified the monetary (57%) and non-monetary (43%) incentives of overtreatment and highlighted AI's potential to mitigate non-monetary incentives and enhance social welfare. Our results provide valuable insights for healthcare administrators considering AI integration into healthcare systems."}, "https://arxiv.org/abs/2405.10655": {"title": "Macroeconomic Factors, Industrial Indexes and Bank Spread in Brazil", "link": "https://arxiv.org/abs/2405.10655", "description": "arXiv:2405.10655v1 Announce Type: new \nAbstract: The main objective of this paper is to Identify which macroe conomic factors and industrial indexes influenced the total Brazilian banking spread between March 2011 and March 2015. This paper considers subclassification of industrial activities in Brazil. Monthly time series data were used in multivariate linear regression models using Eviews (7.0). Eighteen variables were considered as candidates to be determinants. Variables which positively influenced bank spread are; Default, IPIs (Industrial Production Indexes) for capital goods, intermediate goods, du rable consumer goods, semi-durable and non-durable goods, the Selic, GDP, unemployment rate and EMBI +. Variables which influence negatively are; Consumer and general consumer goods IPIs, IPCA, the balance of the loan portfolio and the retail sales index. A p-value of 05% was considered. The main conclusion of this work is that the progress of industry, job creation and consumption can reduce bank spread. Keywords: Credit. Bank spread. Macroeconomics. Industrial Production Indexes. Finance."}, "https://arxiv.org/abs/2405.10719": {"title": "$\\ell_1$-Regularized Generalized Least Squares", "link": "https://arxiv.org/abs/2405.10719", "description": "arXiv:2405.10719v1 Announce Type: new \nAbstract: In this paper we propose an $\\ell_1$-regularized GLS estimator for high-dimensional regressions with potentially autocorrelated errors. We establish non-asymptotic oracle inequalities for estimation accuracy in a framework that allows for highly persistent autoregressive errors. In practice, the Whitening matrix required to implement the GLS is unkown, we present a feasible estimator for this matrix, derive consistency results and ultimately show how our proposed feasible GLS can recover closely the optimal performance (as if the errors were a white noise) of the LASSO. A simulation study verifies the performance of the proposed method, demonstrating that the penalized (feasible) GLS-LASSO estimator performs on par with the LASSO in the case of white noise errors, whilst outperforming it in terms of sign-recovery and estimation error when the errors exhibit significant correlation."}, "https://arxiv.org/abs/2405.10742": {"title": "Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal Mine", "link": "https://arxiv.org/abs/2405.10742", "description": "arXiv:2405.10742v1 Announce Type: new \nAbstract: We consider disease outbreak detection settings where the population under study consists of various subpopulations available for stratified surveillance. These subpopulations can for example be based on age cohorts, but may also correspond to other subgroups of the population under study such as international travellers. Rather than sampling uniformly over the entire population, one may elevate the effectiveness of the detection methodology by optimally choosing a subpopulation for sampling. We show (under some assumptions) the relative sampling efficiency between two subpopulations is inversely proportional to the ratio of their respective baseline disease risks. This leads to a considerable potential increase in sampling efficiency when sampling from the subpopulation with higher baseline disease risk, if the two subpopulation baseline risks differ strongly. Our mathematical results require a careful treatment of the power curves of exact binomial tests as a function of their sample size, which are erratic and non-monotonic due to the discreteness of the underlying distribution. Subpopulations with comparatively high baseline disease risk are typically in greater contact with health professionals, and thus when sampled for surveillance purposes this is typically motivated merely through a convenience argument. With this study, we aim to elevate the status of such \"convenience surveillance\" to optimal subpopulation surveillance."}, "https://arxiv.org/abs/2405.10769": {"title": "Efficient estimation of target population treatment effect from multiple source trials under effect-measure transportability", "link": "https://arxiv.org/abs/2405.10769", "description": "arXiv:2405.10769v1 Announce Type: new \nAbstract: When the marginal causal effect comparing the same treatment pair is available from multiple trials, we wish to transport all results to make inference on the target population effect. To account for the differences between populations, statistical analysis is often performed controlling for relevant variables. However, when transportability assumptions are placed on conditional causal effects, rather than the distribution of potential outcomes, we need to carefully choose these effect measures. In particular, we present identifiability results in two cases: target population average treatment effect for a continuous outcome and causal mean ratio for a positive outcome. We characterize the semiparametric efficiency bounds of the causal effects under the respective transportability assumptions and propose estimators that are doubly robust against model misspecifications. We highlight an important discussion on the tension between the non-collapsibility of conditional effects and the variational independence induced by transportability in the case of multiple source trials."}, "https://arxiv.org/abs/2405.10773": {"title": "Proximal indirect comparison", "link": "https://arxiv.org/abs/2405.10773", "description": "arXiv:2405.10773v1 Announce Type: new \nAbstract: We consider the problem of indirect comparison, where a treatment arm of interest is absent by design in the target randomized control trial (RCT) but available in a source RCT. The identifiability of the target population average treatment effect often relies on conditional transportability assumptions. However, it is a common concern whether all relevant effect modifiers are measured and controlled for. We highlight a new proximal identification result in the presence of shifted, unobserved effect modifiers based on proxies: an adjustment proxy in both RCTs and an additional reweighting proxy in the source RCT. We propose an estimator which is doubly-robust against misspecifications of the so-called bridge functions and asymptotically normal under mild consistency of the nuisance models. An alternative estimator is presented to accommodate missing outcomes in the source RCT, which we then apply to conduct a proximal indirect comparison analysis using two weight management trials."}, "https://arxiv.org/abs/2405.10925": {"title": "High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates", "link": "https://arxiv.org/abs/2405.10925", "description": "arXiv:2405.10925v1 Announce Type: new \nAbstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors."}, "https://arxiv.org/abs/2405.10469": {"title": "Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions", "link": "https://arxiv.org/abs/2405.10469", "description": "arXiv:2405.10469v1 Announce Type: cross \nAbstract: The development of open benchmarking platforms could greatly accelerate the adoption of AI agents in retail. This paper presents comprehensive simulations of customer shopping behaviors for the purpose of benchmarking reinforcement learning (RL) agents that optimize coupon targeting. The difficulty of this learning problem is largely driven by the sparsity of customer purchase events. We trained agents using offline batch data comprising summarized customer purchase histories to help mitigate this effect. Our experiments revealed that contextual bandit and deep RL methods that are less prone to over-fitting the sparse reward distributions significantly outperform static policies. This study offers a practical framework for simulating AI agents that optimize the entire retail customer journey. It aims to inspire the further development of simulation tools for retail AI systems."}, "https://arxiv.org/abs/2405.10795": {"title": "Non trivial optimal sampling rate for estimating a Lipschitz-continuous function in presence of mean-reverting Ornstein-Uhlenbeck noise", "link": "https://arxiv.org/abs/2405.10795", "description": "arXiv:2405.10795v1 Announce Type: cross \nAbstract: We examine a mean-reverting Ornstein-Uhlenbeck process that perturbs an unknown Lipschitz-continuous drift and aim to estimate the drift's value at a predetermined time horizon by sampling the path of the process. Due to the time varying nature of the drift we propose an estimation procedure that involves an online, time-varying optimization scheme implemented using a stochastic gradient ascent algorithm to maximize the log-likelihood of our observations. The objective of the paper is to investigate the optimal sample size/rate for achieving the minimum mean square distance between our estimator and the true value of the drift. In this setting we uncover a trade-off between the correlation of the observations, which increases with the sample size, and the dynamic nature of the unknown drift, which is weakened by increasing the frequency of observation. The mean square error is shown to be non monotonic in the sample size, attaining a global minimum whose precise description depends on the parameters that govern the model. In the static case, i.e. when the unknown drift is constant, our method outperforms the arithmetic mean of the observations in highly correlated regimes, despite the latter being a natural candidate estimator. We then compare our online estimator with the global maximum likelihood estimator."}, "https://arxiv.org/abs/2110.07051": {"title": "Fast and Scalable Inference for Spatial Extreme Value Models", "link": "https://arxiv.org/abs/2110.07051", "description": "arXiv:2110.07051v3 Announce Type: replace \nAbstract: The generalized extreme value (GEV) distribution is a popular model for analyzing and forecasting extreme weather data. To increase prediction accuracy, spatial information is often pooled via a latent Gaussian process (GP) on the GEV parameters. Inference for GEV-GP models is typically carried out using Markov chain Monte Carlo (MCMC) methods, or using approximate inference methods such as the integrated nested Laplace approximation (INLA). However, MCMC becomes prohibitively slow as the number of spatial locations increases, whereas INLA is only applicable in practice to a limited subset of GEV-GP models. In this paper, we revisit the original Laplace approximation for fitting spatial GEV models. In combination with a popular sparsity-inducing spatial covariance approximation technique, we show through simulations that our approach accurately estimates the Bayesian predictive distribution of extreme weather events, is scalable to several thousand spatial locations, and is several orders of magnitude faster than MCMC. A case study in forecasting extreme snowfall across Canada is presented."}, "https://arxiv.org/abs/2112.08934": {"title": "Lassoed Boosting and Linear Prediction in the Equities Market", "link": "https://arxiv.org/abs/2112.08934", "description": "arXiv:2112.08934v3 Announce Type: replace \nAbstract: We consider a two-stage estimation method for linear regression. First, it uses the lasso in Tibshirani (1996) to screen variables and, second, re-estimates the coefficients using the least-squares boosting method in Friedman (2001) on every set of selected variables. Based on the large-scale simulation experiment in Hastie et al. (2020), lassoed boosting performs as well as the relaxed lasso in Meinshausen (2007) and, under certain scenarios, can yield a sparser model. Applied to predicting equity returns, lassoed boosting gives the smallest mean-squared prediction error compared to several other methods."}, "https://arxiv.org/abs/2210.05983": {"title": "Model-based clustering in simple hypergraphs through a stochastic blockmodel", "link": "https://arxiv.org/abs/2210.05983", "description": "arXiv:2210.05983v3 Announce Type: replace \nAbstract: We propose a model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochastic blockmodel for graphs and assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection.\n To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset."}, "https://arxiv.org/abs/2210.09560": {"title": "A Bayesian Convolutional Neural Network-based Generalized Linear Model", "link": "https://arxiv.org/abs/2210.09560", "description": "arXiv:2210.09560v3 Announce Type: replace \nAbstract: Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly."}, "https://arxiv.org/abs/2306.08485": {"title": "Graph-Aligned Random Partition Model (GARP)", "link": "https://arxiv.org/abs/2306.08485", "description": "arXiv:2306.08485v2 Announce Type: replace \nAbstract: Bayesian nonparametric mixtures and random partition models are powerful tools for probabilistic clustering. However, standard independent mixture models can be restrictive in some applications such as inference on cell lineage due to the biological relations of the clusters. The increasing availability of large genomic data requires new statistical tools to perform model-based clustering and infer the relationship between homogeneous subgroups of units. Motivated by single-cell RNA applications we develop a novel dependent mixture model to jointly perform cluster analysis and align the clusters on a graph. Our flexible graph-aligned random partition model (GARP) exploits Gibbs-type priors as building blocks, allowing us to derive analytical results on the graph-aligned random partition's probability mass function (pmf). We derive a generalization of the Chinese restaurant process from the pmf and a related efficient and neat MCMC algorithm to perform Bayesian inference. We perform posterior inference on real single-cell RNA data from mice stem cells. We further investigate the performance of our model in capturing the underlying clustering structure as well as the underlying graph by means of simulation studies."}, "https://arxiv.org/abs/2306.09555": {"title": "Geometric-Based Pruning Rules For Change Point Detection in Multiple Independent Time Series", "link": "https://arxiv.org/abs/2306.09555", "description": "arXiv:2306.09555v2 Announce Type: replace \nAbstract: We consider the problem of detecting multiple changes in multiple independent time series. The search for the best segmentation can be expressed as a minimization problem over a given cost function. We focus on dynamic programming algorithms that solve this problem exactly. When the number of changes is proportional to data length, an inequality-based pruning rule encoded in the PELT algorithm leads to a linear time complexity. Another type of pruning, called functional pruning, gives a close-to-linear time complexity whatever the number of changes, but only for the analysis of univariate time series.\n We propose a few extensions of functional pruning for multiple independent time series based on the use of simple geometric shapes (balls and hyperrectangles). We focus on the Gaussian case, but some of our rules can be easily extended to the exponential family. In a simulation study we compare the computational efficiency of different geometric-based pruning rules. We show that for small dimensions (2, 3, 4) some of them ran significantly faster than inequality-based approaches in particular when the underlying number of changes is small compared to the data length."}, "https://arxiv.org/abs/2306.15075": {"title": "Differences in academic preparedness do not fully explain Black-White enrollment disparities in advanced high school coursework", "link": "https://arxiv.org/abs/2306.15075", "description": "arXiv:2306.15075v2 Announce Type: replace \nAbstract: Whether racial disparities in enrollment in advanced high school coursework can be attributed to differences in prior academic preparation is a central question in sociological research and education policy. However, previous investigations face methodological limitations, for they compare race-specific enrollment rates of students after adjusting for characteristics only partially related to their academic preparedness for advanced coursework. Informed by a recently-developed statistical technique, we propose and estimate a novel measure of students' academic preparedness and use administrative data from the New York City Department of Education to measure differences in AP mathematics enrollment rates among similarly prepared students of different races. We find that preexisting differences in academic preparation do not fully explain the under-representation of Black students relative to White students in AP mathematics. Our results imply that achieving equal opportunities for AP enrollment not only requires equalizing earlier academic experiences, but also addressing inequities that emerge from coursework placement processes."}, "https://arxiv.org/abs/2307.09864": {"title": "Asymptotic equivalence of Principal Components and Quasi Maximum Likelihood estimators in Large Approximate Factor Models", "link": "https://arxiv.org/abs/2307.09864", "description": "arXiv:2307.09864v4 Announce Type: replace \nAbstract: We provide an alternative derivation of the asymptotic results for the Principal Components estimator of a large approximate factor model. Results are derived under a minimal set of assumptions and, in particular, we require only the existence of 4th order moments. A special focus is given to the time series setting, a case considered in almost all recent econometric applications of factor models. Hence, estimation is based on the classical $n\\times n$ sample covariance matrix and not on a $T\\times T$ covariance matrix often considered in the literature. Indeed, despite the two approaches being asymptotically equivalent, the former is more coherent with a time series setting and it immediately allows us to write more intuitive asymptotic expansions for the Principal Component estimators showing that they are equivalent to OLS as long as $\\sqrt n/T\\to 0$ and $\\sqrt T/n\\to 0$, that is the loadings are estimated in a time series regression as if the factors were known, while the factors are estimated in a cross-sectional regression as if the loadings were known. Finally, we give some alternative sets of primitive sufficient conditions for mean-squared consistency of the sample covariance matrix of the factors, of the idiosyncratic components, and of the observed time series, which is the starting point for Principal Component Analysis."}, "https://arxiv.org/abs/2311.03644": {"title": "BOB: Bayesian Optimized Bootstrap for Uncertainty Quantification in Gaussian Mixture Models", "link": "https://arxiv.org/abs/2311.03644", "description": "arXiv:2311.03644v2 Announce Type: replace \nAbstract: A natural way to quantify uncertainties in Gaussian mixture models (GMMs) is through Bayesian methods. That said, sampling from the joint posterior distribution of GMMs via standard Markov chain Monte Carlo (MCMC) imposes several computational challenges, which have prevented a broader full Bayesian implementation of these models. A growing body of literature has introduced the Weighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as alternatives to MCMC sampling. The core idea of these methods is to repeatedly compute maximum a posteriori (MAP) estimates on many randomly weighted posterior densities. These MAP estimates then can be treated as approximate posterior draws. Nonetheless, a central question remains unanswered: How to select the random weights under arbitrary sample sizes. We, therefore, introduce the Bayesian Optimized Bootstrap (BOB), a computational method to automatically select these random weights by minimizing, through Bayesian Optimization, a black-box and noisy version of the reverse Kullback-Leibler (KL) divergence between the Bayesian posterior and an approximate posterior obtained via random weighting. Our proposed method outperforms competing approaches in recovering the Bayesian posterior, it provides a better uncertainty quantification, and it retains key asymptotic properties from existing methods. BOB's performance is demonstrated through extensive simulations, along with real-world data analyses."}, "https://arxiv.org/abs/2311.05794": {"title": "An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed Bandits", "link": "https://arxiv.org/abs/2311.05794", "description": "arXiv:2311.05794v2 Announce Type: replace \nAbstract: In multi-armed bandit (MAB) experiments, it is often advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. We develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit experiments that produces powerful and anytime-valid inference on the ATE for \\emph{any} bandit algorithm of the experimenter's choice, even those without probabilistic treatment assignment. Intuitively, the MAD \"mixes\" any bandit algorithm of the experimenter's choice with a Bernoulli design through a tuning parameter $\\delta_t$, where $\\delta_t$ is a deterministic sequence that decreases the priority placed on the Bernoulli design as the sample size grows. We prove that for $\\delta_t = \\omega\\left(t^{-1/4}\\right)$, the MAD generates anytime-valid asymptotic confidence sequences that are guaranteed to shrink around the true ATE. Hence, the experimenter is guaranteed to detect a true non-zero treatment effect in finite time. Additionally, we prove that the regret of the MAD approaches that of its underlying bandit algorithm over time, and hence, incurs a relatively small loss in regret in return for powerful inferential guarantees. Finally, we conduct an extensive simulation study exhibiting that the MAD achieves finite-sample anytime validity and high power without significant losses in finite-sample reward."}, "https://arxiv.org/abs/2312.10563": {"title": "Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS Integration", "link": "https://arxiv.org/abs/2312.10563", "description": "arXiv:2312.10563v2 Announce Type: replace \nAbstract: Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables. We develop a novel MR framework for mediation analysis with genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework employs carefully crafted estimating equations, allowing for different sets of genetic variations to instrument the exposure and the mediator, to efficiently integrate information stored in three independent GWAS. As part of this endeavor, we demonstrate that in mediation analysis, the challenge raised by instrument selection goes beyond the well-known winner's curse issue, and therefore, addressing it requires special treatment. We then develop bias correction techniques to address the instrument selection issue and commonly encountered measurement error bias issue. Collectively, through our theoretical investigations, we show that our framework provides valid statistical inference for both direct and mediation effects with enhanced statistical efficiency compared to existing methods. We further illustrate the finite-sample performance of our approach through simulation experiments and a case study."}, "https://arxiv.org/abs/2003.05492": {"title": "An asymptotic Peskun ordering and its application to lifted samplers", "link": "https://arxiv.org/abs/2003.05492", "description": "arXiv:2003.05492v5 Announce Type: replace-cross \nAbstract: A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result. It is however also known for being a result that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state $\\mathbf{y}$ from a state $\\mathbf{x}$, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all pairs $(\\mathbf{x}, \\mathbf{y})$ such that $\\mathbf{x} \\neq \\mathbf{y}$. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: essentially, the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis--Hastings counterparts. An analysis in great generality yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify them), but not necessarily in others (and the reasons why are made clear). A quantitative study in a specific context of graphical-model simulation is also conducted."}, "https://arxiv.org/abs/2311.00905": {"title": "Data-driven fixed-point tuning for truncated realized variations", "link": "https://arxiv.org/abs/2311.00905", "description": "arXiv:2311.00905v2 Announce Type: replace-cross \nAbstract: Many methods for estimating integrated volatility and related functionals of semimartingales in the presence of jumps require specification of tuning parameters for their use in practice. In much of the available theory, tuning parameters are assumed to be deterministic and their values are specified only up to asymptotic constraints. However, in empirical work and in simulation studies, they are typically chosen to be random and data-dependent, with explicit choices often relying entirely on heuristics. In this paper, we consider novel data-driven tuning procedures for the truncated realized variations of a semimartingale with jumps based on a type of random fixed-point iteration. Being effectively automated, our approach alleviates the need for delicate decision-making regarding tuning parameters in practice and can be implemented using information regarding sampling frequency alone. We show our methods can lead to asymptotically efficient estimation of integrated volatility and exhibit superior finite-sample performance compared to popular alternatives in the literature."}, "https://arxiv.org/abs/2312.07792": {"title": "Differentially private projection-depth-based medians", "link": "https://arxiv.org/abs/2312.07792", "description": "arXiv:2312.07792v2 Announce Type: replace-cross \nAbstract: We develop $(\\epsilon,\\delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We then present a new definition of the finite sample breakdown point which applies to a mechanism, and present a lower bound on the finite sample breakdown point of the projection-depth-based median. We demonstrate our main results on the canonical projection-depth-based median, as well as on projection-depth-based medians derived from trimmed estimators. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation. In the Cauchy setting, we show that the \"outlier error amplification\" effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics, which may be of general interest."}, "https://arxiv.org/abs/2405.11081": {"title": "What are You Weighting For? Improved Weights for Gaussian Mixture Filtering With Application to Cislunar Orbit Determination", "link": "https://arxiv.org/abs/2405.11081", "description": "arXiv:2405.11081v1 Announce Type: new \nAbstract: This work focuses on the critical aspect of accurate weight computation during the measurement incorporation phase of Gaussian mixture filters. The proposed novel approach computes weights by linearizing the measurement model about each component's posterior estimate rather than the the prior, as traditionally done. This work proves equivalence with traditional methods for linear models, provides novel sigma-point extensions to the traditional and proposed methods, and empirically demonstrates improved performance in nonlinear cases. Two illustrative examples, the Avocado and a cislunar single target tracking scenario, serve to highlight the advantages of the new weight computation technique by analyzing filter accuracy and consistency through varying the number of Gaussian mixture components."}, "https://arxiv.org/abs/2405.11111": {"title": "Euclidean mirrors and first-order changepoints in network time series", "link": "https://arxiv.org/abs/2405.11111", "description": "arXiv:2405.11111v1 Announce Type: new \nAbstract: We describe a model for a network time series whose evolution is governed by an underlying stochastic process, known as the latent position process, in which network evolution can be represented in Euclidean space by a curve, called the Euclidean mirror. We define the notion of a first-order changepoint for a time series of networks, and construct a family of latent position process networks with underlying first-order changepoints. We prove that a spectral estimate of the associated Euclidean mirror localizes these changepoints, even when the graph distribution evolves continuously, but at a rate that changes. Simulated and real data examples on organoid networks show that this localization captures empirically significant shifts in network evolution."}, "https://arxiv.org/abs/2405.11156": {"title": "A Randomized Permutation Whole-Model Test Heuristic for Self-Validated Ensemble Models (SVEM)", "link": "https://arxiv.org/abs/2405.11156", "description": "arXiv:2405.11156v1 Announce Type: new \nAbstract: We introduce a heuristic to test the significance of fit of Self-Validated Ensemble Models (SVEM) against the null hypothesis of a constant response. A SVEM model averages predictions from nBoot fits of a model, applied to fractionally weighted bootstraps of the target dataset. It tunes each fit on a validation copy of the training data, utilizing anti-correlated weights for training and validation. The proposed test computes SVEM predictions centered by the response column mean and normalized by the ensemble variability at each of nPoint points spaced throughout the factor space. A reference distribution is constructed by refitting the SVEM model to nPerm randomized permutations of the response column and recording the corresponding standardized predictions at the nPoint points. A reduced-rank singular value decomposition applied to the centered and scaled nPerm x nPoint reference matrix is used to calculate the Mahalanobis distance for each of the nPerm permutation results as well as the jackknife (holdout) Mahalanobis distance of the original response column. The process is repeated independently for each response in the experiment, producing a joint graphical summary. We present a simulation driven power analysis and discuss limitations of the test relating to model flexibility and design adequacy. The test maintains the nominal Type I error rate even when the base SVEM model contains more parameters than observations."}, "https://arxiv.org/abs/2405.11248": {"title": "Generalized extremiles and risk measures of distorted random variables", "link": "https://arxiv.org/abs/2405.11248", "description": "arXiv:2405.11248v1 Announce Type: new \nAbstract: Quantiles, expectiles and extremiles can be seen as concepts defined via an optimization problem, where this optimization problem is driven by two important ingredients: the loss function as well as a distributional weight function. This leads to the formulation of a general class of functionals that contains next to the above concepts many interesting quantities, including also a subclass of distortion risks. The focus of the paper is on developing estimators for such functionals and to establish asymptotic consistency and asymptotic normality of these estimators. The advantage of the general framework is that it allows application to a very broad range of concepts, providing as such estimation tools and tools for statistical inference (for example for construction of confidence intervals) for all involved concepts. After developing the theory for the general functional we apply it to various settings, illustrating the broad applicability. In a real data example the developed tools are used in an analysis of natural disasters."}, "https://arxiv.org/abs/2405.11358": {"title": "A Bayesian Nonparametric Approach for Clustering Functional Trajectories over Time", "link": "https://arxiv.org/abs/2405.11358", "description": "arXiv:2405.11358v1 Announce Type: new \nAbstract: Functional concurrent, or varying-coefficient, regression models are commonly used in biomedical and clinical settings to investigate how the relation between an outcome and observed covariate varies as a function of another covariate. In this work, we propose a Bayesian nonparametric approach to investigate how clusters of these functional relations evolve over time. Our model clusters individual functional trajectories within and across time periods while flexibly accommodating the evolution of the partitions across time periods with covariates. Motivated by mobile health data collected in a novel, smartphone-based smoking cessation intervention study, we demonstrate how our proposed method can simultaneously cluster functional trajectories, accommodate temporal dependence, and provide insights into the transitions between functional clusters over time."}, "https://arxiv.org/abs/2405.11477": {"title": "Analyze Additive and Interaction Effects via Collaborative Trees", "link": "https://arxiv.org/abs/2405.11477", "description": "arXiv:2405.11477v1 Announce Type: new \nAbstract: We present Collaborative Trees, a novel tree model designed for regression prediction, along with its bagging version, which aims to analyze complex statistical associations between features and uncover potential patterns inherent in the data. We decompose the mean decrease in impurity from the proposed tree model to analyze the additive and interaction effects of features on the response variable. Additionally, we introduce network diagrams to visually depict how each feature contributes additively to the response and how pairs of features contribute interaction effects. Through a detailed demonstration using an embryo growth dataset, we illustrate how the new statistical tools aid data analysis, both visually and numerically. Moreover, we delve into critical aspects of tree modeling, such as prediction performance, inference stability, and bias in feature importance measures, leveraging real datasets and simulation experiments for comprehensive discussions. On the theory side, we show that Collaborative Trees, built upon a ``sum of trees'' approach with our own innovative tree model regularization, exhibit characteristics akin to matching pursuit, under the assumption of high-dimensional independent binary input features (or one-hot feature groups). This newfound link sheds light on the superior capability of our tree model in estimating additive effects of features, a crucial factor for accurate interaction effect estimation."}, "https://arxiv.org/abs/2405.11522": {"title": "A comparative study of augmented inverse propensity weighted estimators using outcome-adaptive lasso and other penalized regression methods", "link": "https://arxiv.org/abs/2405.11522", "description": "arXiv:2405.11522v1 Announce Type: new \nAbstract: Confounder selection may be efficiently conducted using penalized regression methods when causal effects are estimated from observational data with many variables. An outcome-adaptive lasso was proposed to build a model for the propensity score that can be employed in conjunction with other variable selection methods for the outcome model to apply the augmented inverse propensity weighted (AIPW) estimator. However, researchers may not know which method is optimal to use for outcome model when applying the AIPW estimator with the outcome-adaptive lasso. This study provided hints on readily implementable penalized regression methods that should be adopted for the outcome model as a counterpart of the outcome-adaptive lasso. We evaluated the bias and variance of the AIPW estimators using the propensity score (PS) model and an outcome model based on penalized regression methods under various conditions by analyzing a clinical trial example and numerical experiments; the estimates and standard errors of the AIPW estimators were almost identical in an example with over 5000 participants. The AIPW estimators using penalized regression methods with the oracle property performed well in terms of bias and variance in numerical experiments with smaller sample sizes. Meanwhile, the bias of the AIPW estimator using the ordinary lasso for the PS and outcome models was considerably larger."}, "https://arxiv.org/abs/2405.11615": {"title": "Approximation of bivariate densities with compositional splines", "link": "https://arxiv.org/abs/2405.11615", "description": "arXiv:2405.11615v1 Announce Type: new \nAbstract: Reliable estimation and approximation of probability density functions is fundamental for their further processing. However, their specific properties, i.e. scale invariance and relative scale, prevent the use of standard methods of spline approximation and have to be considered when building a suitable spline basis. Bayes Hilbert space methodology allows to account for these properties of densities and enables their conversion to a standard Lebesgue space of square integrable functions using the centered log-ratio transformation. As the transformed densities fulfill a zero integral constraint, the constraint should likewise be respected by any spline basis used. Bayes Hilbert space methodology also allows to decompose bivariate densities into their interactive and independent parts with univariate marginals. As this yields a useful framework for studying the dependence structure between random variables, a spline basis ideally should admit a corresponding decomposition. This paper proposes a new spline basis for (transformed) bivariate densities respecting the desired zero integral property. We show that there is a one-to-one correspondence of this basis to a corresponding basis in the Bayes Hilbert space of bivariate densities using tools of this methodology. Furthermore, the spline representation and the resulting decomposition into interactive and independent parts are derived. Finally, this novel spline representation is evaluated in a simulation study and applied to empirical geochemical data."}, "https://arxiv.org/abs/2405.11624": {"title": "On Generalized Transmuted Lifetime Distribution", "link": "https://arxiv.org/abs/2405.11624", "description": "arXiv:2405.11624v1 Announce Type: new \nAbstract: This article presents a new class of generalized transmuted lifetime distributions which includes a large number of lifetime distributions as sub-family. Several important mathematical quantities such as density function, distribution function, quantile function, moments, moment generating function, stress-strength reliability function, order statistics, R\\'enyi and q-entropy, residual and reversed residual life function, and cumulative information generating function are obtained. The methods of maximum likelihood, ordinary least square, weighted least square, Cram\\'er-von Mises, Anderson Darling, and Right-tail Anderson Darling are considered to estimate the model parameters in a general way. Further, a well-organized Monte Carlo simulation experiments have been performed to observe the behavior of the estimators. Finally, two real data have also been analyzed to demonstrate the effectiveness of the proposed distribution in real-life modeling."}, "https://arxiv.org/abs/2405.11626": {"title": "Distribution-in-distribution-out Regression", "link": "https://arxiv.org/abs/2405.11626", "description": "arXiv:2405.11626v1 Announce Type: new \nAbstract: Regression analysis with probability measures as input predictors and output response has recently drawn great attention. However, it is challenging to handle multiple input probability measures due to the non-flat Riemannian geometry of the Wasserstein space, hindering the definition of arithmetic operations, hence additive linear structure is not well-defined. In this work, a distribution-in-distribution-out regression model is proposed by introducing parallel transport to achieve provable commutativity and additivity of newly defined arithmetic operations in Wasserstein space. The appealing properties of the DIDO regression model can serve a foundation for model estimation, prediction, and inference. Specifically, the Fr\\'echet least squares estimator is employed to obtain the best linear unbiased estimate, supported by the newly established Fr\\'echet Gauss-Markov Theorem. Furthermore, we investigate a special case when predictors and response are all univariate Gaussian measures, leading to a simple close-form solution of linear model coefficients and $R^2$ metric. A simulation study and real case study in intraoperative cardiac output prediction are performed to evaluate the performance of the proposed method."}, "https://arxiv.org/abs/2405.11681": {"title": "Distributed Tensor Principal Component Analysis", "link": "https://arxiv.org/abs/2405.11681", "description": "arXiv:2405.11681v1 Announce Type: new \nAbstract: As tensors become widespread in modern data analysis, Tucker low-rank Principal Component Analysis (PCA) has become essential for dimensionality reduction and structural discovery in tensor datasets. Motivated by the common scenario where large-scale tensors are distributed across diverse geographic locations, this paper investigates tensor PCA within a distributed framework where direct data pooling is impractical.\n We offer a comprehensive analysis of three specific scenarios in distributed Tensor PCA: a homogeneous setting in which tensors at various locations are generated from a single noise-affected model; a heterogeneous setting where tensors at different locations come from distinct models but share some principal components, aiming to improve estimation across all locations; and a targeted heterogeneous setting, designed to boost estimation accuracy at a specific location with limited samples by utilizing transferred knowledge from other sites with ample data.\n We introduce novel estimation methods tailored to each scenario, establish statistical guarantees, and develop distributed inference techniques to construct confidence regions. Our theoretical findings demonstrate that these distributed methods achieve sharp rates of accuracy by efficiently aggregating shared information across different tensors, while maintaining reasonable communication costs. Empirical validation through simulations and real-world data applications highlights the advantages of our approaches, particularly in managing heterogeneous tensor data."}, "https://arxiv.org/abs/2405.11720": {"title": "Estimating optimal tailored active surveillance strategy under interval censoring", "link": "https://arxiv.org/abs/2405.11720", "description": "arXiv:2405.11720v1 Announce Type: new \nAbstract: Active surveillance (AS) using repeated biopsies to monitor disease progression has been a popular alternative to immediate surgical intervention in cancer care. However, a biopsy procedure is invasive and sometimes leads to severe side effects of infection and bleeding. To reduce the burden of repeated surveillance biopsies, biomarker-assistant decision rules are sought to replace the fix-for-all regimen with tailored biopsy intensity for individual patients. Constructing or evaluating such decision rules is challenging. The key AS outcome is often ascertained subject to interval censoring. Furthermore, patients will discontinue their participation in the AS study once they receive a positive surveillance biopsy. Thus, patient dropout is affected by the outcomes of these biopsies. In this work, we propose a nonparametric kernel-based method to estimate the true positive rates (TPRs) and true negative rates (TNRs) of a tailored AS strategy, accounting for interval censoring and immediate dropouts. Based on these estimates, we develop a weighted classification framework to estimate the optimal tailored AS strategy and further incorporate the cost-benefit ratio for cost-effectiveness in medical decision-making. Theoretically, we provide a uniform generalization error bound of the derived AS strategy accommodating all possible trade-offs between TPRs and TNRs. Simulation and application to a prostate cancer surveillance study show the superiority of the proposed method."}, "https://arxiv.org/abs/2405.11723": {"title": "Inference with non-differentiable surrogate loss in a general high-dimensional classification framework", "link": "https://arxiv.org/abs/2405.11723", "description": "arXiv:2405.11723v1 Announce Type: new \nAbstract: Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In this work, we propose a kernel-smoothed decorrelated score to construct hypothesis testing and interval estimations for the linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version in a high-dimensional setup. Simulation and real data analysis are conducted to demonstrate the validity and superiority of the proposed method."}, "https://arxiv.org/abs/2405.11759": {"title": "Testing Sign Congruence", "link": "https://arxiv.org/abs/2405.11759", "description": "arXiv:2405.11759v1 Announce Type: new \nAbstract: We consider testing the null hypothesis that two parameters $({\\mu}_1, {\\mu}_2)$ have the same sign, assuming that (asymptotically) normal estimators are available. Examples of this problem include the analysis of heterogeneous treatment effects, causal interpretation of reduced-form estimands, meta-studies, and mediation analysis. A number of tests were recently proposed. We recommend a test that is simple and rejects more often than many of these recent proposals. Like all other tests in the literature, it is conservative if the truth is near (0, 0) and therefore also biased. To clarify whether these features are avoidable, we also provide a test that is unbiased and has exact size control on the boundary of the null hypothesis, but which has counterintuitive properties and hence we do not recommend. The method that we recommend can be used to revisit existing findings using information typically reported in empirical research papers."}, "https://arxiv.org/abs/2405.11781": {"title": "Structural Nested Mean Models Under Parallel Trends with Interference", "link": "https://arxiv.org/abs/2405.11781", "description": "arXiv:2405.11781v1 Announce Type: new \nAbstract: Despite the common occurrence of interference in Difference-in-Differences (DiD) applications, standard DiD methods rely on an assumption that interference is absent, and comparatively little work has considered how to accommodate and learn about spillover effects within a DiD framework. Here, we extend the so-called `DiD-SNMMs' of Shahn et al (2022) to accommodate interference in a time-varying DiD setting. Doing so enables estimation of a richer set of effects than previous DiD approaches. For example, DiD-SNMMs do not assume the absence of spillover effects after direct exposures and can model how effects of direct or indirect (i.e. spillover) exposures depend on past and concurrent (direct or indirect) exposure and covariate history. We consider both cluster and network interference structures an illustrate the methodology in simulations."}, "https://arxiv.org/abs/2405.11954": {"title": "Comparing predictive ability in presence of instability over a very short time", "link": "https://arxiv.org/abs/2405.11954", "description": "arXiv:2405.11954v1 Announce Type: new \nAbstract: We consider forecast comparison in the presence of instability when this affects only a short period of time. We demonstrate that global tests do not perform well in this case, as they were not designed to capture very short-lived instabilities, and their power vanishes altogether when the magnitude of the shock is very large. We then discuss and propose approaches that are more suitable to detect such situations, such as nonparametric methods (S test or MAX procedure). We illustrate these results in different Monte Carlo exercises and in evaluating the nowcast of the quarterly US nominal GDP from the Survey of Professional Forecasters (SPF) against a naive benchmark of no growth, over the period that includes the GDP instability brought by the Covid-19 crisis. We recommend that the forecaster should not pool the sample, but exclude the short periods of high local instability from the evaluation exercise."}, "https://arxiv.org/abs/2405.12083": {"title": "Instrumented Difference-in-Differences with heterogeneous treatment effects", "link": "https://arxiv.org/abs/2405.12083", "description": "arXiv:2405.12083v1 Announce Type: new \nAbstract: Many studies exploit variation in the timing of policy adoption across units as an instrument for treatment, and use instrumental variable techniques. This paper formalizes the underlying identification strategy as an instrumented difference-in-differences (DID-IV). In a simple setting with two periods and two groups, our DID-IV design mainly consists of a monotonicity assumption, and parallel trends assumptions in the treatment and the outcome. In this design, a Wald-DID estimand, which scales the DID estimand of the outcome by the DID estimand of the treatment, captures the local average treatment effect on the treated (LATET). In contrast to Fuzzy DID design considered in \\cite{De_Chaisemartin2018-xe}, our DID-IV design does not {\\it ex-ante} require strong restrictions on the treatment adoption behavior across units, and our target parameter, the LATET, is policy-relevant if the instrument is based on the policy change of interest to the researcher. We extend the canonical DID-IV design to multiple period settings with the staggered adoption of the instrument across units, which we call staggered DID-IV designs. We propose an estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity. We illustrate our findings in the setting of \\cite{Oreopoulos2006-bn}, estimating returns to schooling in the United Kingdom. In this application, the two-way fixed effects instrumental variable regression, which is the conventional approach to implement staggered DID-IV designs, yields the negative estimate, whereas our estimation method indicates the substantial gain from schooling."}, "https://arxiv.org/abs/2405.12157": {"title": "Asymmetry models and separability for multi-way contingency tables with ordinal categories", "link": "https://arxiv.org/abs/2405.12157", "description": "arXiv:2405.12157v1 Announce Type: new \nAbstract: In this paper, we propose a model that indicates the asymmetry structure for cell probabilities in multivariate contingency tables with the same ordered categories. The proposed model is the closest to the symmetry model in terms of the $f$-divergence under certain conditions and incorporates various asymmetry models as special cases, including existing models. We elucidate the relationship between the proposed model and conventional models from several aspects of divergence in $f$-divergence. Furthermore, we provide theorems showing that the symmetry model can be decomposed into two or more models, each imposing less restrictive parameter constraints than the symmetry condition. We also discuss the properties of goodness-of-fit statistics, particularly focusing on the likelihood ratio test statistics and Wald test statistics. Finally, we summarize the proposed model and discuss some problems and future work."}, "https://arxiv.org/abs/2405.12180": {"title": "Estimating the Impact of Social Distance Policy in Mitigating COVID-19 Spread with Factor-Based Imputation Approach", "link": "https://arxiv.org/abs/2405.12180", "description": "arXiv:2405.12180v1 Announce Type: new \nAbstract: We identify the effectiveness of social distancing policies in reducing the transmission of the COVID-19 spread. We build a model that measures the relative frequency and geographic distribution of the virus growth rate and provides hypothetical infection distribution in the states that enacted the social distancing policies, where we control time-varying, observed and unobserved, state-level heterogeneities. Using panel data on infection and deaths in all US states from February 20 to April 20, 2020, we find that stay-at-home orders and other types of social distancing policies significantly reduced the growth rate of infection and deaths. We show that the effects are time-varying and range from the weakest at the beginning of policy intervention to the strongest by the end of our sample period. We also found that social distancing policies were more effective in states with higher income, better education, more white people, more democratic voters, and higher CNN viewership."}, "https://arxiv.org/abs/2405.10991": {"title": "Relative Counterfactual Contrastive Learning for Mitigating Pretrained Stance Bias in Stance Detection", "link": "https://arxiv.org/abs/2405.10991", "description": "arXiv:2405.10991v1 Announce Type: cross \nAbstract: Stance detection classifies stance relations (namely, Favor, Against, or Neither) between comments and targets. Pretrained language models (PLMs) are widely used to mine the stance relation to improve the performance of stance detection through pretrained knowledge. However, PLMs also embed ``bad'' pretrained knowledge concerning stance into the extracted stance relation semantics, resulting in pretrained stance bias. It is not trivial to measure pretrained stance bias due to its weak quantifiability. In this paper, we propose Relative Counterfactual Contrastive Learning (RCCL), in which pretrained stance bias is mitigated as relative stance bias instead of absolute stance bias to overtake the difficulty of measuring bias. Firstly, we present a new structural causal model for characterizing complicated relationships among context, PLMs and stance relations to locate pretrained stance bias. Then, based on masked language model prediction, we present a target-aware relative stance sample generation method for obtaining relative bias. Finally, we use contrastive learning based on counterfactual theory to mitigate pretrained stance bias and preserve context stance relation. Experiments show that the proposed method is superior to stance detection and debiasing baselines."}, "https://arxiv.org/abs/2405.11377": {"title": "Causal Customer Churn Analysis with Low-rank Tensor Block Hazard Model", "link": "https://arxiv.org/abs/2405.11377", "description": "arXiv:2405.11377v1 Announce Type: cross \nAbstract: This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor completion for the parameter tensor. This captures hidden customer characteristics and temporal elements from churn records, effectively addressing the binary nature of churn data and its time-monotonic trends. Our model also uniquely categorizes interventions by their similar impacts, enhancing the precision and practicality of implementing customer retention strategies. For computational efficiency, we apply a projected gradient descent algorithm combined with spectral clustering. We lay down the theoretical groundwork for our model, including its non-asymptotic properties. The efficacy and superiority of our model are further validated through comprehensive experiments on both simulated and real-world applications."}, "https://arxiv.org/abs/2405.11688": {"title": "Performance Analysis of Monte Carlo Algorithms in Dense Subgraph Identification", "link": "https://arxiv.org/abs/2405.11688", "description": "arXiv:2405.11688v1 Announce Type: cross \nAbstract: The exploration of network structures through the lens of graph theory has become a cornerstone in understanding complex systems across diverse fields. Identifying densely connected subgraphs within larger networks is crucial for uncovering functional modules in biological systems, cohesive groups within social networks, and critical paths in technological infrastructures. The most representative approach, the SM algorithm, cannot locate subgraphs with large sizes, therefore cannot identify dense subgraphs; while the SA algorithm previously used by researchers combines simulated annealing and efficient moves for the Markov chain. However, the global optima cannot be guaranteed to be located by the simulated annealing methods including SA unless a logarithmic cooling schedule is used. To this end, our study introduces and evaluates the performance of the Simulated Annealing Algorithm (SAA), which combines simulated annealing with the stochastic approximation Monte Carlo algorithm. The performance of SAA against two other numerical algorithms-SM and SA, is examined in the context of identifying these critical subgraph structures using simulated graphs with embeded cliques. We have found that SAA outperforms both SA and SM by 1) the number of iterations to find the densest subgraph and 2) the percentage of time the algorithm is able to find a clique after 10,000 iterations, and 3) computation time. The promising result of the SAA algorithm could offer a robust tool for dissecting complex systems and potentially transforming our approach to solving problems in interdisciplinary fields."}, "https://arxiv.org/abs/2405.11923": {"title": "Rate Optimality and Phase Transition for User-Level Local Differential Privacy", "link": "https://arxiv.org/abs/2405.11923", "description": "arXiv:2405.11923v1 Announce Type: cross \nAbstract: Most of the literature on differential privacy considers the item-level case where each user has a single observation, but a growing field of interest is that of user-level privacy where each of the $n$ users holds $T$ observations and wishes to maintain the privacy of their entire collection.\n In this paper, we derive a general minimax lower bound, which shows that, for locally private user-level estimation problems, the risk cannot, in general, be made to vanish for a fixed number of users even when each user holds an arbitrarily large number of observations. We then derive matching, up to logarithmic factors, lower and upper bounds for univariate and multidimensional mean estimation, sparse mean estimation and non-parametric density estimation. In particular, with other model parameters held fixed, we observe phase transition phenomena in the minimax rates as $T$ the number of observations each user holds varies.\n In the case of (non-sparse) mean estimation and density estimation, we see that, for $T$ below a phase transition boundary, the rate is the same as having $nT$ users in the item-level setting. Different behaviour is however observed in the case of $s$-sparse $d$-dimensional mean estimation, wherein consistent estimation is impossible when $d$ exceeds the number of observations in the item-level setting, but is possible in the user-level setting when $T \\gtrsim s \\log (d)$, up to logarithmic factors. This may be of independent interest for applications as an example of a high-dimensional problem that is feasible under local privacy constraints."}, "https://arxiv.org/abs/2112.01611": {"title": "Robust changepoint detection in the variability of multivariate functional data", "link": "https://arxiv.org/abs/2112.01611", "description": "arXiv:2112.01611v2 Announce Type: replace \nAbstract: We consider the problem of robustly detecting changepoints in the variability of a sequence of independent multivariate functions. We develop a novel changepoint procedure, called the functional Kruskal--Wallis for covariance (FKWC) changepoint procedure, based on rank statistics and multivariate functional data depth. The FKWC changepoint procedure allows the user to test for at most one changepoint (AMOC) or an epidemic period, or to estimate the number and locations of an unknown amount of changepoints in the data. We show that when the ``signal-to-noise'' ratio is bounded below, the changepoint estimates produced by the FKWC procedure attain the minimax localization rate for detecting general changes in distribution in the univariate setting (Theorem 1). We also provide the behavior of the proposed test statistics for the AMOC and epidemic setting under the null hypothesis (Theorem 2) and, as a simple consequence of our main result, these tests are consistent (Corollary 1). In simulation, we show that our method is particularly robust when compared to similar changepoint methods. We present an application of the FKWC procedure to intraday asset returns and f-MRI scans. As a by-product of Theorem 1, we provide a concentration result for integrated functional depth functions (Lemma 2), which may be of general interest."}, "https://arxiv.org/abs/2212.09844": {"title": "Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding", "link": "https://arxiv.org/abs/2212.09844", "description": "arXiv:2212.09844v5 Announce Type: replace \nAbstract: Predictive algorithms inform consequential decisions in settings where the outcome is selectively observed given choices made by human decision makers. We propose a unified framework for the robust design and evaluation of predictive algorithms in selectively observed data. We impose general assumptions on how much the outcome may vary on average between unselected and selected units conditional on observed covariates and identified nuisance parameters, formalizing popular empirical strategies for imputing missing data such as proxy outcomes and instrumental variables. We develop debiased machine learning estimators for the bounds on a large class of predictive performance estimands, such as the conditional likelihood of the outcome, a predictive algorithm's mean square error, true/false positive rate, and many others, under these assumptions. In an administrative dataset from a large Australian financial institution, we illustrate how varying assumptions on unobserved confounding leads to meaningful changes in default risk predictions and evaluations of credit scores across sensitive groups."}, "https://arxiv.org/abs/2303.11777": {"title": "Quasi Maximum Likelihood Estimation of High-Dimensional Factor Models: A Critical Review", "link": "https://arxiv.org/abs/2303.11777", "description": "arXiv:2303.11777v5 Announce Type: replace \nAbstract: We review Quasi Maximum Likelihood estimation of factor models for high-dimensional panels of time series. We consider two cases: (1) estimation when no dynamic model for the factors is specified (Bai and Li, 2012, 2016); (2) estimation based on the Kalman smoother and the Expectation Maximization algorithm thus allowing to model explicitly the factor dynamics (Doz et al., 2012, Barigozzi and Luciani, 2019). Our interest is in approximate factor models, i.e., when we allow for the idiosyncratic components to be mildly cross-sectionally, as well as serially, correlated. Although such setting apparently makes estimation harder, we show, in fact, that factor models do not suffer of the {\\it curse of dimensionality} problem, but instead they enjoy a {\\it blessing of dimensionality} property. In particular, given an approximate factor structure, if the cross-sectional dimension of the data, $N$, grows to infinity, we show that: (i) identification of the model is still possible, (ii) the mis-specification error due to the use of an exact factor model log-likelihood vanishes. Moreover, if we let also the sample size, $T$, grow to infinity, we can also consistently estimate all parameters of the model and make inference. The same is true for estimation of the latent factors which can be carried out by weighted least-squares, linear projection, or Kalman filtering/smoothing. We also compare the approaches presented with: Principal Component analysis and the classical, fixed $N$, exact Maximum Likelihood approach. We conclude with a discussion on efficiency of the considered estimators."}, "https://arxiv.org/abs/2305.07581": {"title": "Nonparametric data segmentation in multivariate time series via joint characteristic functions", "link": "https://arxiv.org/abs/2305.07581", "description": "arXiv:2305.07581v3 Announce Type: replace \nAbstract: Modern time series data often exhibit complex dependence and structural changes which are not easily characterised by shifts in the mean or model parameters. We propose a nonparametric data segmentation methodology for multivariate time series termed NP-MOJO. By considering joint characteristic functions between the time series and its lagged values, NP-MOJO is able to detect change points in the marginal distribution, but also those in possibly non-linear serial dependence, all without the need to pre-specify the type of changes. We show the theoretical consistency of NP-MOJO in estimating the total number and the locations of the change points, and demonstrate the good performance of NP-MOJO against a variety of change point scenarios. We further demonstrate its usefulness in applications to seismology and economic time series."}, "https://arxiv.org/abs/2309.01334": {"title": "Average treatment effect on the treated, under lack of positivity", "link": "https://arxiv.org/abs/2309.01334", "description": "arXiv:2309.01334v3 Announce Type: replace \nAbstract: The use of propensity score (PS) methods has become ubiquitous in causal inference. At the heart of these methods is the positivity assumption. Violation of the positivity assumption leads to the presence of extreme PS weights when estimating average causal effects of interest, such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT), which renders invalid related statistical inference. To circumvent this issue, trimming or truncating the extreme estimated PSs have been widely used. However, these methods require that we specify a priori a threshold and sometimes an additional smoothing parameter. While there are a number of methods dealing with the lack of positivity when estimating ATE, surprisingly there is no much effort in the same issue for ATT. In this paper, we first review widely used methods, such as trimming and truncation in ATT. We emphasize the underlying intuition behind these methods to better understand their applications and highlight their main limitations. Then, we argue that the current methods simply target estimands that are scaled ATT (and thus move the goalpost to a different target of interest), where we specify the scale and the target populations. We further propose a PS weight-based alternative for the average causal effect on the treated, called overlap weighted average treatment effect on the treated (OWATT). The appeal of our proposed method lies in its ability to obtain similar or even better results than trimming and truncation while relaxing the constraint to choose a priori a threshold (or even specify a smoothing parameter). The performance of the proposed method is illustrated via a series of Monte Carlo simulations and a data analysis on racial disparities in health care expenditures."}, "https://arxiv.org/abs/2309.04926": {"title": "Testing for Stationary or Persistent Coefficient Randomness in Predictive Regressions", "link": "https://arxiv.org/abs/2309.04926", "description": "arXiv:2309.04926v4 Announce Type: replace \nAbstract: This study considers tests for coefficient randomness in predictive regressions. Our focus is on how tests for coefficient randomness are influenced by the persistence of random coefficient. We show that when the random coefficient is stationary, or I(0), Nyblom's (1989) LM test loses its optimality (in terms of power), which is established against the alternative of integrated, or I(1), random coefficient. We demonstrate this by constructing a test that is more powerful than the LM test when the random coefficient is stationary, although the test is dominated in terms of power by the LM test when the random coefficient is integrated. This implies that the best test for coefficient randomness differs from context to context, and the persistence of the random coefficient determines which test is the best one. We apply those tests to the U.S. stock returns data."}, "https://arxiv.org/abs/2104.14412": {"title": "Nonparametric Test for Volatility in Clustered Multiple Time Series", "link": "https://arxiv.org/abs/2104.14412", "description": "arXiv:2104.14412v3 Announce Type: replace-cross \nAbstract: Contagion arising from clustering of multiple time series like those in the stock market indicators can further complicate the nature of volatility, rendering a parametric test (relying on asymptotic distribution) to suffer from issues on size and power. We propose a test on volatility based on the bootstrap method for multiple time series, intended to account for possible presence of contagion effect. While the test is fairly robust to distributional assumptions, it depends on the nature of volatility. The test is correctly sized even in cases where the time series are almost nonstationary. The test is also powerful specially when the time series are stationary in mean and that volatility are contained only in fewer clusters. We illustrate the method in global stock prices data."}, "https://arxiv.org/abs/2109.08793": {"title": "Estimations of the Local Conditional Tail Average Treatment Effect", "link": "https://arxiv.org/abs/2109.08793", "description": "arXiv:2109.08793v3 Announce Type: replace-cross \nAbstract: The conditional tail average treatment effect (CTATE) is defined as a difference between the conditional tail expectations of potential outcomes, which can capture heterogeneity and deliver aggregated local information on treatment effects over different quantile levels and is closely related to the notion of second-order stochastic dominance and the Lorenz curve. These properties render it a valuable tool for policy evaluation. In this paper, we study estimation of the CTATE locally for a group of compliers (local CTATE or LCTATE) under the two-sided noncompliance framework. We consider a semiparametric treatment effect framework under endogeneity for the LCTATE estimation using a newly introduced class of consistent loss functions jointly for the conditional tail expectation and quantile. We establish the asymptotic theory of our proposed LCTATE estimator and provide an efficient algorithm for its implementation. We then apply the method to evaluate the effects of participating in programs under the Job Training Partnership Act in the US."}, "https://arxiv.org/abs/2305.19461": {"title": "Residual spectrum: Brain functional connectivity detection beyond coherence", "link": "https://arxiv.org/abs/2305.19461", "description": "arXiv:2305.19461v2 Announce Type: replace-cross \nAbstract: Coherence is a widely used measure to assess linear relationships between time series. However, it fails to capture nonlinear dependencies. To overcome this limitation, this paper introduces the notion of residual spectral density as a higher-order extension of the squared coherence. The method is based on an orthogonal decomposition of time series regression models. We propose a test for the existence of the residual spectrum and derive its fundamental properties. A numerical study illustrates finite sample performance of the proposed method. An application of the method shows that the residual spectrum can effectively detect brain connectivity. Our study reveals a noteworthy contrast in connectivity patterns between schizophrenia patients and healthy individuals. Specifically, we observed that non-linear connectivity in schizophrenia patients surpasses that of healthy individuals, which stands in stark contrast to the established understanding that linear connectivity tends to be higher in healthy individuals. This finding sheds new light on the intricate dynamics of brain connectivity in schizophrenia."}, "https://arxiv.org/abs/2308.05945": {"title": "Improving Ego-Cluster for Network Effect Measurement", "link": "https://arxiv.org/abs/2308.05945", "description": "arXiv:2308.05945v2 Announce Type: replace-cross \nAbstract: The network effect, wherein one user's activity impacts another user, is common in social network platforms. Many new features in social networks are specifically designed to create a network effect, enhancing user engagement. For instance, content creators tend to produce more when their articles and posts receive positive feedback from followers. This paper discusses a new cluster-level experimentation methodology for measuring creator-side metrics in the context of A/B experiments. The methodology is designed to address cases where the experiment randomization unit and the metric measurement unit differ. It is a crucial part of LinkedIn's overall strategy to foster a robust creator community and ecosystem. The method is developed based on widely-cited research at LinkedIn but significantly improves the efficiency and flexibility of the clustering algorithm. This improvement results in a stronger capability for measuring creator-side metrics and an increased velocity for creator-related experiments."}, "https://arxiv.org/abs/2405.12467": {"title": "Conditional Choice Probability Estimation of Dynamic Discrete Choice Models with 2-period Finite Dependence", "link": "https://arxiv.org/abs/2405.12467", "description": "arXiv:2405.12467v1 Announce Type: new \nAbstract: This paper extends the work of Arcidiacono and Miller (2011, 2019) by introducing a novel characterization of finite dependence within dynamic discrete choice models, demonstrating that numerous models display 2-period finite dependence. We recast finite dependence as a problem of sequentially searching for weights and introduce a computationally efficient method for determining these weights by utilizing the Kronecker product structure embedded in state transitions. With the estimated weights, we develop a computationally attractive Conditional Choice Probability estimator with 2-period finite dependence. The computational efficacy of our proposed estimator is demonstrated through Monte Carlo simulations."}, "https://arxiv.org/abs/2405.12581": {"title": "Spectral analysis for noisy Hawkes processes inference", "link": "https://arxiv.org/abs/2405.12581", "description": "arXiv:2405.12581v1 Announce Type: new \nAbstract: Classic estimation methods for Hawkes processes rely on the assumption that observed event times are indeed a realisation of a Hawkes process, without considering any potential perturbation of the model. However, in practice, observations are often altered by some noise, the form of which depends on the context.It is then required to model the alteration mechanism in order to infer accurately such a noisy Hawkes process. While several models exist, we consider, in this work, the observations to be the indistinguishable union of event times coming from a Hawkes process and from an independent Poisson process. Since standard inference methods (such as maximum likelihood or Expectation-Maximisation) are either unworkable or numerically prohibitive in this context, we propose an estimation procedure based on the spectral analysis of second order properties of the noisy Hawkes process. Novel results include sufficient conditions for identifiability of the ensuing statistical model with exponential interaction functions for both univariate and bivariate processes. Although we mainly focus on the exponential scenario, other types of kernels are investigated and discussed. A new estimator based on maximising the spectral log-likelihood is then described, and its behaviour is numerically illustrated on synthetic data. Besides being free from knowing the source of each observed time (Hawkes or Poisson process), the proposed estimator is shown to perform accurately in estimating both processes."}, "https://arxiv.org/abs/2405.12622": {"title": "Asymptotic Properties of Matthews Correlation Coefficient", "link": "https://arxiv.org/abs/2405.12622", "description": "arXiv:2405.12622v1 Announce Type: new \nAbstract: Evaluating classifications is crucial in statistics and machine learning, as it influences decision-making across various fields, such as patient prognosis and therapy in critical conditions. The Matthews correlation coefficient (MCC) is recognized as a performance metric with high reliability, offering a balanced measurement even in the presence of class imbalances. Despite its importance, there remains a notable lack of comprehensive research on the statistical inference of MCC. This deficiency often leads to studies merely validating and comparing MCC point estimates, a practice that, while common, overlooks the statistical significance and reliability of results. Addressing this research gap, our paper introduces and evaluates several methods to construct asymptotic confidence intervals for the single MCC and the differences between MCCs in paired designs. Through simulations across various scenarios, we evaluate the finite-sample behavior of these methods and compare their performances. Furthermore, through real data analysis, we illustrate the potential utility of our findings in comparing binary classifiers, highlighting the possible contributions of our research in this field."}, "https://arxiv.org/abs/2405.12668": {"title": "Short and simple introduction to Bellman filtering and smoothing", "link": "https://arxiv.org/abs/2405.12668", "description": "arXiv:2405.12668v1 Announce Type: new \nAbstract: Based on Bellman's dynamic-programming principle, Lange (2024) presents an approximate method for filtering, smoothing and parameter estimation for possibly non-linear and/or non-Gaussian state-space models. While the approach applies more generally, this pedagogical note highlights the main results in the case where (i) the state transition remains linear and Gaussian while (ii) the observation density is log-concave and sufficiently smooth in the state variable. I demonstrate how Kalman's (1960) filter and Rauch et al.'s (1965) smoother can be obtained as special cases within the proposed framework. The main aim is to present non-experts (and my own students) with an accessible introduction, enabling them to implement the proposed methods."}, "https://arxiv.org/abs/2405.12694": {"title": "Parameter estimation in Comparative Judgement", "link": "https://arxiv.org/abs/2405.12694", "description": "arXiv:2405.12694v1 Announce Type: new \nAbstract: Comparative Judgement is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimates. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered. Further, we propose a superior approach based on bootstrapping. It is shown to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions and initial penalization method."}, "https://arxiv.org/abs/2405.12816": {"title": "A Non-Parametric Box-Cox Approach to Robustifying High-Dimensional Linear Hypothesis Testing", "link": "https://arxiv.org/abs/2405.12816", "description": "arXiv:2405.12816v1 Announce Type: new \nAbstract: The mainstream theory of hypothesis testing in high-dimensional regression typically assumes the underlying true model is a low-dimensional linear regression model, yet the Box-Cox transformation is a regression technique commonly used to mitigate anomalies like non-additivity and heteroscedasticity. This paper introduces a more flexible framework, the non-parametric Box-Cox model with unspecified transformation, to address model mis-specification in high-dimensional linear hypothesis testing while preserving the interpretation of regression coefficients. Model estimation and computation in high dimensions poses challenges beyond traditional sparse penalization methods. We propose the constrained partial penalized composite probit regression method for sparse estimation and investigate its statistical properties. Additionally, we present a computationally efficient algorithm using augmented Lagrangian and coordinate majorization descent for solving regularization problems with folded concave penalization and linear constraints. For testing linear hypotheses, we propose the partial penalized composite likelihood ratio test, score test and Wald test, and show that their limiting distributions under null and local alternatives follow generalized chi-squared distributions with the same degrees of freedom and noncentral parameter. Extensive simulation studies are conducted to examine the finite sample performance of the proposed tests. Our analysis of supermarket data illustrates potential discrepancies between our testing procedures and standard high-dimensional methods, highlighting the importance of our robustified approach."}, "https://arxiv.org/abs/2405.12924": {"title": "Robust Nonparametric Regression for Compositional Data: the Simplicial--Real case", "link": "https://arxiv.org/abs/2405.12924", "description": "arXiv:2405.12924v1 Announce Type: new \nAbstract: Statistical analysis on compositional data has gained a lot of attention due to their great potential of applications. A feature of these data is that they are multivariate vectors that lie in the simplex, that is, the components of each vector are positive and sum up a constant value. This fact poses a challenge to the analyst due to the internal dependency of the components which exhibit a spurious negative correlation. Since classical multivariate techniques are not appropriate in this scenario, it is necessary to endow the simplex of a suitable algebraic-geometrical structure, which is a starting point to develop adequate methodology and strategies to handle compositions. We centered our attention on regression problems with real responses and compositional covariates and we adopt a nonparametric approach due to the flexibility it provides. Aware of the potential damage that outliers may produce, we introduce a robust estimator in the framework of nonparametric regression for compositional data. The performance of the estimators is investigated by means of a numerical study where different contamination schemes are simulated. Through a real data analysis the advantages of using a robust procedure is illustrated."}, "https://arxiv.org/abs/2405.12953": {"title": "Quantifying Uncertainty in Classification Performance: ROC Confidence Bands Using Conformal Prediction", "link": "https://arxiv.org/abs/2405.12953", "description": "arXiv:2405.12953v1 Announce Type: new \nAbstract: To evaluate a classification algorithm, it is common practice to plot the ROC curve using test data. However, the inherent randomness in the test data can undermine our confidence in the conclusions drawn from the ROC curve, necessitating uncertainty quantification. In this article, we propose an algorithm to construct confidence bands for the ROC curve, quantifying the uncertainty of classification on the test data in terms of sensitivity and specificity. The algorithm is based on a procedure called conformal prediction, which constructs individualized confidence intervals for the test set and the confidence bands for the ROC curve can be obtained by combining the individualized intervals together. Furthermore, we address both scenarios where the test data are either iid or non-iid relative to the observed data set and propose distinct algorithms for each case with valid coverage probability. The proposed method is validated through both theoretical results and numerical experiments."}, "https://arxiv.org/abs/2405.12343": {"title": "Determine the Number of States in Hidden Markov Models via Marginal Likelihood", "link": "https://arxiv.org/abs/2405.12343", "description": "arXiv:2405.12343v1 Announce Type: cross \nAbstract: Hidden Markov models (HMM) have been widely used by scientists to model stochastic systems: the underlying process is a discrete Markov chain and the observations are noisy realizations of the underlying process. Determining the number of hidden states for an HMM is a model selection problem, which is yet to be satisfactorily solved, especially for the popular Gaussian HMM with heterogeneous covariance. In this paper, we propose a consistent method for determining the number of hidden states of HMM based on the marginal likelihood, which is obtained by integrating out both the parameters and hidden states. Moreover, we show that the model selection problem of HMM includes the order selection problem of finite mixture models as a special case. We give rigorous proof of the consistency of the proposed marginal likelihood method and provide an efficient computation method for practical implementation. We numerically compare the proposed method with the Bayesian information criterion (BIC), demonstrating the effectiveness of the proposed marginal likelihood method."}, "https://arxiv.org/abs/2207.11532": {"title": "Change Point Detection for High-dimensional Linear Models: A General Tail-adaptive Approach", "link": "https://arxiv.org/abs/2207.11532", "description": "arXiv:2207.11532v3 Announce Type: replace \nAbstract: We propose a novel approach for detecting change points in high-dimensional linear regression models. Unlike previous research that relied on strict Gaussian/sub-Gaussian error assumptions and had prior knowledge of change points, we propose a tail-adaptive method for change point detection and estimation. We use a weighted combination of composite quantile and least squared losses to build a new loss function, allowing us to leverage information from both conditional means and quantiles. For change point testing, we develop a family of individual testing statistics with different weights to account for unknown tail structures. These individual tests are further aggregated to construct a powerful tail-adaptive test for sparse regression coefficient changes. For change point estimation, we propose a family of argmax-based individual estimators. We provide theoretical justifications for the validity of these tests and change point estimators. Additionally, we introduce a new algorithm for detecting multiple change points in a tail-adaptive manner using the wild binary segmentation. Extensive numerical results show the effectiveness of our method. Lastly, an R package called ``TailAdaptiveCpt\" is developed to implement our algorithms."}, "https://arxiv.org/abs/2303.14508": {"title": "A spectral based goodness-of-fit test for stochastic block models", "link": "https://arxiv.org/abs/2303.14508", "description": "arXiv:2303.14508v2 Announce Type: replace \nAbstract: Community detection is a fundamental problem in complex network data analysis. Though many methods have been proposed, most existing methods require the number of communities to be the known parameter, which is not in practice. In this paper, we propose a novel goodness-of-fit test for the stochastic block model. The test statistic is based on the linear spectral of the adjacency matrix. Under the null hypothesis, we prove that the linear spectral statistic converges in distribution to $N(0,1)$. Some recent results in generalized Wigner matrices are used to prove the main theorems. Numerical experiments and real world data examples illustrate that our proposed linear spectral statistic has good performance."}, "https://arxiv.org/abs/2304.12414": {"title": "Bayesian Geostatistics Using Predictive Stacking", "link": "https://arxiv.org/abs/2304.12414", "description": "arXiv:2304.12414v2 Announce Type: replace \nAbstract: We develop Bayesian predictive stacking for geostatistical models, where the primary inferential objective is to provide inference on the latent spatial random field and conduct spatial predictions at arbitrary locations. We exploit analytically tractable posterior distributions for regression coefficients of predictors and the realizations of the spatial process conditional upon process parameters. We subsequently combine such inference by stacking these models across the range of values of the hyper-parameters. We devise stacking of means and posterior densities in a manner that is computationally efficient without resorting to iterative algorithms such as Markov chain Monte Carlo (MCMC) and can exploit the benefits of parallel computations. We offer novel theoretical insights into the resulting inference within an infill asymptotic paradigm and through empirical results showing that stacked inference is comparable to full sampling-based Bayesian inference at a significantly lower computational cost."}, "https://arxiv.org/abs/2311.00596": {"title": "Evaluating Binary Outcome Classifiers Estimated from Survey Data", "link": "https://arxiv.org/abs/2311.00596", "description": "arXiv:2311.00596v3 Announce Type: replace \nAbstract: Surveys are commonly used to facilitate research in epidemiology, health, and the social and behavioral sciences. Often, these surveys are not simple random samples, and respondents are given weights reflecting their probability of selection into the survey. It is well known that analysts can use these survey weights to produce unbiased estimates of population quantities like totals. In this article, we show that survey weights also can be beneficial for evaluating the quality of predictive models when splitting data into training and test sets. In particular, we characterize model assessment statistics, such as sensitivity and specificity, as finite population quantities, and compute survey-weighted estimates of these quantities with sample test data comprising a random subset of the original data.Using simulations with data from the National Survey on Drug Use and Health and the National Comorbidity Survey, we show that unweighted metrics estimated with sample test data can misrepresent population performance, but weighted metrics appropriately adjust for the complex sampling design. We also show that this conclusion holds for models trained using upsampling for mitigating class imbalance. The results suggest that weighted metrics should be used when evaluating performance on sample test data."}, "https://arxiv.org/abs/2312.00590": {"title": "Inference on common trends in functional time series", "link": "https://arxiv.org/abs/2312.00590", "description": "arXiv:2312.00590v4 Announce Type: replace \nAbstract: We study statistical inference on unit roots and cointegration for time series in a Hilbert space. We develop statistical inference on the number of common stochastic trends embedded in the time series, i.e., the dimension of the nonstationary subspace. We also consider tests of hypotheses on the nonstationary and stationary subspaces themselves. The Hilbert space can be of an arbitrarily large dimension, and our methods remain asymptotically valid even when the time series of interest takes values in a subspace of possibly unknown dimension. This has wide applicability in practice; for example, to the case of cointegrated vector time series that are either high-dimensional or of finite dimension, to high-dimensional factor model that includes a finite number of nonstationary factors, to cointegrated curve-valued (or function-valued) time series, and to nonstationary dynamic functional factor models. We include two empirical illustrations to the term structure of interest rates and labor market indices, respectively."}, "https://arxiv.org/abs/2210.14086": {"title": "A Global Wavelet Based Bootstrapped Test of Covariance Stationarity", "link": "https://arxiv.org/abs/2210.14086", "description": "arXiv:2210.14086v3 Announce Type: replace-cross \nAbstract: We propose a covariance stationarity test for an otherwise dependent and possibly globally non-stationary time series. We work in a generalized version of the new setting in Jin, Wang and Wang (2015), who exploit Walsh (1923) functions in order to compare sub-sample covariances with the full sample counterpart. They impose strict stationarity under the null, only consider linear processes under either hypothesis in order to achieve a parametric estimator for an inverted high dimensional asymptotic covariance matrix, and do not consider any other orthonormal basis. Conversely, we work with a general orthonormal basis under mild conditions that include Haar wavelet and Walsh functions; and we allow for linear or nonlinear processes with possibly non-iid innovations. This is important in macroeconomics and finance where nonlinear feedback and random volatility occur in many settings. We completely sidestep asymptotic covariance matrix estimation and inversion by bootstrapping a max-correlation difference statistic, where the maximum is taken over the correlation lag $h$ and basis generated sub-sample counter $k$ (the number of systematic samples). We achieve a higher feasible rate of increase for the maximum lag and counter $\\mathcal{H}_{T}$ and $\\mathcal{K}_{T}$. Of particular note, our test is capable of detecting breaks in variance, and distant, or very mild, deviations from stationarity."}, "https://arxiv.org/abs/2307.03687": {"title": "Leveraging text data for causal inference using electronic health records", "link": "https://arxiv.org/abs/2307.03687", "description": "arXiv:2307.03687v2 Announce Type: replace-cross \nAbstract: In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in developing countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research."}, "https://arxiv.org/abs/2312.10002": {"title": "On the Injectivity of Euler Integral Transforms with Hyperplanes and Quadric Hypersurfaces", "link": "https://arxiv.org/abs/2312.10002", "description": "arXiv:2312.10002v2 Announce Type: replace-cross \nAbstract: The Euler characteristic transform (ECT) is an integral transform used widely in topological data analysis. Previous efforts by Curry et al. and Ghrist et al. have independently shown that the ECT is injective on all compact definable sets. In this work, we first study the injectivity of the ECT on definable sets that are not necessarily compact and prove a complete classification of constructible functions that the Euler characteristic transform is not injective on. We then introduce the quadric Euler characteristic transform (QECT) as a natural generalization of the ECT by detecting definable shapes with quadric hypersurfaces rather than hyperplanes. We also discuss some criteria for the injectivity of QECT."}, "https://arxiv.org/abs/2405.13100": {"title": "Better Simulations for Validating Causal Discovery with the DAG-Adaptation of the Onion Method", "link": "https://arxiv.org/abs/2405.13100", "description": "arXiv:2405.13100v1 Announce Type: new \nAbstract: The number of artificial intelligence algorithms for learning causal models from data is growing rapidly. Most ``causal discovery'' or ``causal structure learning'' algorithms are primarily validated through simulation studies. However, no widely accepted simulation standards exist and publications often report conflicting performance statistics -- even when only considering publications that simulate data from linear models. In response, several manuscripts have criticized a popular simulation design for validating algorithms in the linear case.\n We propose a new simulation design for generating linear models for directed acyclic graphs (DAGs): the DAG-adaptation of the Onion (DaO) method. DaO simulations are fundamentally different from existing simulations because they prioritize the distribution of correlation matrices rather than the distribution of linear effects. Specifically, the DaO method uniformly samples the space of all correlation matrices consistent with (i.e. Markov to) a DAG. We also discuss how to sample DAGs and present methods for generating DAGs with scale-free in-degree or out-degree. We compare the DaO method against two alternative simulation designs and provide implementations of the DaO method in Python and R: https://github.com/bja43/DaO_simulation. We advocate for others to adopt DaO simulations as a fair universal benchmark."}, "https://arxiv.org/abs/2405.13342": {"title": "Scalable Bayesian inference for heat kernel Gaussian processes on manifolds", "link": "https://arxiv.org/abs/2405.13342", "description": "arXiv:2405.13342v1 Announce Type: new \nAbstract: We develop scalable manifold learning methods and theory, motivated by the problem of estimating manifold of fMRI activation in the Human Connectome Project (HCP). We propose the Fast Graph Laplacian Estimation for Heat Kernel Gaussian Processes (FLGP) in the natural exponential family model. FLGP handles large sample sizes $ n $, preserves the intrinsic geometry of data, and significantly reduces computational complexity from $ \\mathcal{O}(n^3) $ to $ \\mathcal{O}(n) $ via a novel reduced-rank approximation of the graph Laplacian's transition matrix and truncated Singular Value Decomposition for eigenpair computation. Our numerical experiments demonstrate FLGP's scalability and improved accuracy for manifold learning from large-scale complex data."}, "https://arxiv.org/abs/2405.13353": {"title": "Adaptive Bayesian Multivariate Spline Knot Inference with Prior Specifications on Model Complexity", "link": "https://arxiv.org/abs/2405.13353", "description": "arXiv:2405.13353v1 Announce Type: new \nAbstract: In multivariate spline regression, the number and locations of knots influence the performance and interpretability significantly. However, due to non-differentiability and varying dimensions, there is no desirable frequentist method to make inference on knots. In this article, we propose a fully Bayesian approach for knot inference in multivariate spline regression. The existing Bayesian method often uses BIC to calculate the posterior, but BIC is too liberal and it will heavily overestimate the knot number when the candidate model space is large. We specify a new prior on the knot number to take into account the complexity of the model space and derive an analytic formula in the normal model. In the non-normal cases, we utilize the extended Bayesian information criterion to approximate the posterior density. The samples are simulated in the space with differing dimensions via reversible jump Markov chain Monte Carlo. We apply the proposed method in knot inference and manifold denoising. Experiments demonstrate the splendid capability of the algorithm, especially in function fitting with jumping discontinuity."}, "https://arxiv.org/abs/2405.13400": {"title": "Ensemble size dependence of the logarithmic score for forecasts issued as multivariate normal distributions", "link": "https://arxiv.org/abs/2405.13400", "description": "arXiv:2405.13400v1 Announce Type: new \nAbstract: Multivariate probabilistic verification is concerned with the evaluation of joint probability distributions of vector quantities such as a weather variable at multiple locations or a wind vector for instance. The logarithmic score is a proper score that is useful in this context. In order to apply this score to ensemble forecasts, a choice for the density is required. Here, we are interested in the specific case when the density is multivariate normal with mean and covariance given by the ensemble mean and ensemble covariance, respectively. Under the assumptions of multivariate normality and exchangeability of the ensemble members, a relationship is derived which describes how the logarithmic score depends on ensemble size. It permits to estimate the score in the limit of infinite ensemble size from a small ensemble and thus produces a fair logarithmic score for multivariate ensemble forecasts under the assumption of normality. This generalises a study from 2018 which derived the ensemble size adjustment of the logarithmic score in the univariate case.\n An application to medium-range forecasts examines the usefulness of the ensemble size adjustments when multivariate normality is only an approximation. Predictions of vectors consisting of several different combinations of upper air variables are considered. Logarithmic scores are calculated for these vectors using ECMWF's daily extended-range forecasts which consist of a 100-member ensemble. The probabilistic forecasts of these vectors are verified against operational ECMWF analyses in the Northern mid-latitudes in autumn 2023. Scores are computed for ensemble sizes from 8 to 100. The fair logarithmic scores of ensembles with different cardinalities are very close, in contrast to the unadjusted scores which decrease considerably with ensemble size. This provides evidence for the practical usefulness of the derived relationships."}, "https://arxiv.org/abs/2405.13537": {"title": "Sequential Bayesian inference for stochastic epidemic models of cumulative incidence", "link": "https://arxiv.org/abs/2405.13537", "description": "arXiv:2405.13537v1 Announce Type: new \nAbstract: Epidemics are inherently stochastic, and stochastic models provide an appropriate way to describe and analyse such phenomena. Given temporal incidence data consisting of, for example, the number of new infections or removals in a given time window, a continuous-time discrete-valued Markov process provides a natural description of the dynamics of each model component, typically taken to be the number of susceptible, exposed, infected or removed individuals. Fitting the SEIR model to time-course data is a challenging problem due incomplete observations and, consequently, the intractability of the observed data likelihood. Whilst sampling based inference schemes such as Markov chain Monte Carlo are routinely applied, their computational cost typically restricts analysis to data sets of no more than a few thousand infective cases. Instead, we develop a sequential inference scheme that makes use of a computationally cheap approximation of the most natural Markov process model. Crucially, the resulting model allows a tractable conditional parameter posterior which can be summarised in terms of a set of low dimensional statistics. This is used to rejuvenate parameter samples in conjunction with a novel bridge construct for propagating state trajectories conditional on the next observation of cumulative incidence. The resulting inference framework also allows for stochastic infection and reporting rates. We illustrate our approach using synthetic and real data applications."}, "https://arxiv.org/abs/2405.13553": {"title": "Hidden semi-Markov models with inhomogeneous state dwell-time distributions", "link": "https://arxiv.org/abs/2405.13553", "description": "arXiv:2405.13553v1 Announce Type: new \nAbstract: The well-established methodology for the estimation of hidden semi-Markov models (HSMMs) as hidden Markov models (HMMs) with extended state spaces is further developed to incorporate covariate influences across all aspects of the state process model, in particular, regarding the distributions governing the state dwell time. The special case of periodically varying covariate effects on the state dwell-time distributions - and possibly the conditional transition probabilities - is examined in detail to derive important properties of such models, namely the periodically varying unconditional state distribution as well as the overall state dwell-time distribution. Through simulation studies, we ascertain key properties of these models and develop recommendations for hyperparameter settings. Furthermore, we provide a case study involving an HSMM with periodically varying dwell-time distributions to analyse the movement trajectory of an arctic muskox, demonstrating the practical relevance of the developed methodology."}, "https://arxiv.org/abs/2405.13591": {"title": "Running in circles: is practical application feasible for data fission and data thinning in post-clustering differential analysis?", "link": "https://arxiv.org/abs/2405.13591", "description": "arXiv:2405.13591v1 Announce Type: new \nAbstract: The standard pipeline to analyse single-cell RNA sequencing (scRNA-seq) often involves two steps : clustering and Differential Expression Analysis (DEA) to annotate cell populations based on gene expression. However, using clustering results for data-driven hypothesis formulation compromises statistical properties, especially Type I error control. Data fission was introduced to split the information contained in each observation into two independent parts that can be used for clustering and testing. However, data fission was originally designed for non-mixture distributions, and adapting it for mixtures requires knowledge of the unknown clustering structure to estimate component-specific scale parameters. As components are typically unavailable in practice, scale parameter estimators often exhibit bias. We explicitly quantify how this bias affects subsequent post-clustering differential analysis Type I error rate despite employing data fission. In response, we propose a novel approach that involves modeling each observation as a realization of its distribution, with scale parameters estimated non-parametrically. Simulations study showcase the efficacy of our method when component are clearly separated. However, the level of separability required to reach good performance presents complexities in its application to real scRNA-seq data."}, "https://arxiv.org/abs/2405.13621": {"title": "Interval identification of natural effects in the presence of outcome-related unmeasured confounding", "link": "https://arxiv.org/abs/2405.13621", "description": "arXiv:2405.13621v1 Announce Type: new \nAbstract: With reference to a binary outcome and a binary mediator, we derive identification bounds for natural effects under a reduced set of assumptions. Specifically, no assumptions about confounding are made that involve the outcome; we only assume no unobserved exposure-mediator confounding as well as a condition termed partially constant cross-world dependence (PC-CWD), which poses fewer constraints on the counterfactual probabilities than the usual cross-world independence assumption. The proposed strategy can be used also to achieve interval identification of the total effect, which is no longer point identified under the considered set of assumptions. Our derivations are based on postulating a logistic regression model for the mediator as well as for the outcome. However, in both cases the functional form governing the dependence on the explanatory variables is allowed to be arbitrary, thereby resulting in a semi-parametric approach. To account for sampling variability, we provide delta-method approximations of standard errors in order to build uncertainty intervals from identification bounds. The proposed method is applied to a dataset gathered from a Spanish prospective cohort study. The aim is to evaluate whether the effect of smoking on lung cancer risk is mediated by the onset of pulmonary emphysema."}, "https://arxiv.org/abs/2405.13767": {"title": "Enhancing Dose Selection in Phase I Cancer Trials: Extending the Bayesian Logistic Regression Model with Non-DLT Adverse Events Integration", "link": "https://arxiv.org/abs/2405.13767", "description": "arXiv:2405.13767v1 Announce Type: new \nAbstract: This paper presents the Burdened Bayesian Logistic Regression Model (BBLRM), an enhancement to the Bayesian Logistic Regression Model (BLRM) for dose-finding in phase I oncology trials. Traditionally, the BLRM determines the maximum tolerated dose (MTD) based on dose-limiting toxicities (DLTs). However, clinicians often perceive model-based designs like BLRM as complex and less conservative than rule-based designs, such as the widely used 3+3 method. To address these concerns, the BBLRM incorporates non-DLT adverse events (nDLTAEs) into the model. These events, although not severe enough to qualify as DLTs, provide additional information suggesting that higher doses might result in DLTs. In the BBLRM, an additional parameter $\\delta$ is introduced to account for nDLTAEs. This parameter adjusts the toxicity probability estimates, making the model more conservative in dose escalation. The $\\delta$ parameter is derived from the proportion of patients experiencing nDLTAEs within each cohort and is tuned to balance the model's conservatism. This approach aims to reduce the likelihood of assigning toxic doses as MTD while involving clinicians more directly in the decision-making process. The paper includes a simulation study comparing BBLRM with the traditional BLRM across various scenarios. The simulations demonstrate that BBLRM significantly reduces the selection of toxic doses as MTD without compromising, and sometimes even increasing, the accuracy of MTD identification. These results suggest that integrating nDLTAEs into the dose-finding process can enhance the safety and acceptance of model-based designs in phase I oncology trials."}, "https://arxiv.org/abs/2405.13783": {"title": "Nonparametric quantile regression for spatio-temporal processes", "link": "https://arxiv.org/abs/2405.13783", "description": "arXiv:2405.13783v1 Announce Type: new \nAbstract: In this paper, we develop a new and effective approach to nonparametric quantile regression that accommodates ultrahigh-dimensional data arising from spatio-temporal processes. This approach proves advantageous in staving off computational challenges that constitute known hindrances to existing nonparametric quantile regression methods when the number of predictors is much larger than the available sample size. We investigate conditions under which estimation is feasible and of good overall quality and obtain sharp approximations that we employ to devising statistical inference methodology. These include simultaneous confidence intervals and tests of hypotheses, whose asymptotics is borne by a non-trivial functional central limit theorem tailored to martingale differences. Additionally, we provide finite-sample results through various simulations which, accompanied by an illustrative application to real-worldesque data (on electricity demand), offer guarantees on the performance of the proposed methodology."}, "https://arxiv.org/abs/2405.13799": {"title": "Extending Kernel Testing To General Designs", "link": "https://arxiv.org/abs/2405.13799", "description": "arXiv:2405.13799v1 Announce Type: new \nAbstract: Kernel-based testing has revolutionized the field of non-parametric tests through the embedding of distributions in an RKHS. This strategy has proven to be powerful and flexible, yet its applicability has been limited to the standard two-sample case, while practical situations often involve more complex experimental designs. To extend kernel testing to any design, we propose a linear model in the RKHS that allows for the decomposition of mean embeddings into additive functional effects. We then introduce a truncated kernel Hotelling-Lawley statistic to test the effects of the model, demonstrating that its asymptotic distribution is chi-square, which remains valid with its Nystrom approximation. We discuss a homoscedasticity assumption that, although absent in the standard two-sample case, is necessary for general designs. Finally, we illustrate our framework using a single-cell RNA sequencing dataset and provide kernel-based generalizations of classical diagnostic and exploration tools to broaden the scope of kernel testing in any experimental design."}, "https://arxiv.org/abs/2405.13801": {"title": "Bayesian Inference Under Differential Privacy: Prior Selection Considerations with Application to Univariate Gaussian Data and Regression", "link": "https://arxiv.org/abs/2405.13801", "description": "arXiv:2405.13801v1 Announce Type: new \nAbstract: We describe Bayesian inference for the mean and variance of bounded data protected by differential privacy and modeled as Gaussian. Using this setting, we demonstrate that analysts can and should take the constraints imposed by the bounds into account when specifying prior distributions. Additionally, we provide theoretical and empirical results regarding what classes of default priors produce valid inference for a differentially private release in settings where substantial prior information is not available. We discuss how these results can be applied to Bayesian inference for regression with differentially private data."}, "https://arxiv.org/abs/2405.13844": {"title": "Causal Inference with Cocycles", "link": "https://arxiv.org/abs/2405.13844", "description": "arXiv:2405.13844v1 Announce Type: new \nAbstract: Many interventions in causal inference can be represented as transformations. We identify a local symmetry property satisfied by a large class of causal models under such interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional and counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show they achieve semiparametric efficiency under typical conditions. Since (infinitely) many distributions can share the same cocycle, these estimators make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset."}, "https://arxiv.org/abs/2405.13926": {"title": "Some models are useful, but for how long?: A decision theoretic approach to choosing when to refit large-scale prediction models", "link": "https://arxiv.org/abs/2405.13926", "description": "arXiv:2405.13926v1 Announce Type: new \nAbstract: Large-scale prediction models (typically using tools from artificial intelligence, AI, or machine learning, ML) are increasingly ubiquitous across a variety of industries and scientific domains. Such methods are often paired with detailed data from sources such as electronic health records, wearable sensors, and omics data (high-throughput technology used to understand biology). Despite their utility, implementing AI and ML tools at the scale necessary to work with this data introduces two major challenges. First, it can cost tens of thousands of dollars to train a modern AI/ML model at scale. Second, once the model is trained, its predictions may become less relevant as patient and provider behavior change, and predictions made for one geographical area may be less accurate for another. These two challenges raise a fundamental question: how often should you refit the AI/ML model to optimally trade-off between cost and relevance? Our work provides a framework for making decisions about when to {\\it refit} AI/ML models when the goal is to maintain valid statistical inference (e.g. estimating a treatment effect in a clinical trial). Drawing on portfolio optimization theory, we treat the decision of {\\it recalibrating} versus {\\it refitting} the model as a choice between ''investing'' in one of two ''assets.'' One asset, recalibrating the model based on another model, is quick and relatively inexpensive but bears uncertainty from sampling and the possibility that the other model is not relevant to current circumstances. The other asset, {\\it refitting} the model, is costly but removes the irrelevance concern (though not the risk of sampling error). We explore the balancing act between these two potential investments in this paper."}, "https://arxiv.org/abs/2405.13945": {"title": "Exogenous Consideration and Extended Random Utility", "link": "https://arxiv.org/abs/2405.13945", "description": "arXiv:2405.13945v1 Announce Type: new \nAbstract: In a consideration set model, an individual maximizes utility among the considered alternatives. I relate a consideration set additive random utility model to classic discrete choice and the extended additive random utility model, in which utility can be $-\\infty$ for infeasible alternatives. When observable utility shifters are bounded, all three models are observationally equivalent. Moreover, they have the same counterfactual bounds and welfare formulas for changes in utility shifters like price. For attention interventions, welfare cannot change in the full consideration model but is completely unbounded in the limited consideration model. The identified set for consideration set probabilities has a minimal width for any bounded support of shifters, but with unbounded support it is a point: identification \"towards\" infinity does not resemble identification \"at\" infinity."}, "https://arxiv.org/abs/2405.13970": {"title": "Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces", "link": "https://arxiv.org/abs/2405.13970", "description": "arXiv:2405.13970v1 Announce Type: new \nAbstract: Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics."}, "https://arxiv.org/abs/2405.14048": {"title": "fsemipar: an R package for SoF semiparametric regression", "link": "https://arxiv.org/abs/2405.14048", "description": "arXiv:2405.14048v1 Announce Type: new \nAbstract: Functional data analysis has become a tool of interest in applied areas such as economics, medicine, and chemistry. Among the techniques developed in recent literature, functional semiparametric regression stands out for its balance between flexible modelling and output interpretation. Despite the large variety of research papers dealing with scalar-on-function (SoF) semiparametric models, there is a notable gap in software tools for their implementation. This article introduces the R package \\texttt{fsemipar}, tailored for these models. \\texttt{fsemipar} not only estimates functional single-index models using kernel smoothing techniques but also estimates and selects relevant scalar variables in semi-functional models with multivariate linear components. A standout feature is its ability to identify impact points of a curve on the response, even in models with multiple functional covariates, and to integrate both continuous and pointwise effects of functional predictors within a single model. In addition, it allows the use of location-adaptive estimators based on the $k$-nearest-neighbours approach for all the semiparametric models included. Its flexible interface empowers users to customise a wide range of input parameters and includes the standard S3 methods for prediction, statistical analysis, and estimate visualization (\\texttt{predict}, \\texttt{summary}, \\texttt{print}, and \\texttt{plot}), enhancing clear result interpretation. Throughout the article, we illustrate the functionalities and the practicality of \\texttt{fsemipar} using two chemometric datasets."}, "https://arxiv.org/abs/2405.14104": {"title": "On the Identifying Power of Monotonicity for Average Treatment Effects", "link": "https://arxiv.org/abs/2405.14104", "description": "arXiv:2405.14104v1 Announce Type: new \nAbstract: In the context of a binary outcome, treatment, and instrument, Balke and Pearl (1993, 1997) establish that adding monotonicity to the instrument exogeneity assumption does not decrease the identified sets for average potential outcomes and average treatment effect parameters when those assumptions are consistent with the distribution of the observable data. We show that the same results hold in the broader context of multi-valued outcome, treatment, and instrument. An important example of such a setting is a multi-arm randomized controlled trial with noncompliance."}, "https://arxiv.org/abs/2405.14145": {"title": "Generalised Bayes Linear Inference", "link": "https://arxiv.org/abs/2405.14145", "description": "arXiv:2405.14145v1 Announce Type: new \nAbstract: Motivated by big data and the vast parameter spaces in modern machine learning models, optimisation approaches to Bayesian inference have seen a surge in popularity in recent years. In this paper, we address the connection between the popular new methods termed generalised Bayesian inference and Bayes linear methods. We propose a further generalisation to Bayesian inference that unifies these and other recent approaches by considering the Bayesian inference problem as one of finding the closest point in a particular solution space to a data generating process, where these notions differ depending on user-specified geometries and foundational belief systems. Motivated by this framework, we propose a generalisation to Bayes linear approaches that enables fast and principled inferences that obey the coherence requirements implied by domain restrictions on random quantities. We demonstrate the efficacy of generalised Bayes linear inference on a number of examples, including monotonic regression and inference for spatial counts. This paper is accompanied by an R package available at github.com/astfalckl/bayeslinear."}, "https://arxiv.org/abs/2405.14149": {"title": "A Direct Importance Sampling-based Framework for Rare Event Uncertainty Quantification in Non-Gaussian Spaces", "link": "https://arxiv.org/abs/2405.14149", "description": "arXiv:2405.14149v1 Announce Type: new \nAbstract: This work introduces a novel framework for precisely and efficiently estimating rare event probabilities in complex, high-dimensional non-Gaussian spaces, building on our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach. An unnormalized sampling target is first constructed and sampled, relaxing the optimal importance sampling distribution and appropriately designed for non-Gaussian spaces. Post-sampling, its normalizing constant is estimated using a stable inverse importance sampling procedure, employing an importance sampling density based on the already available samples. The sought probability is then computed based on the estimates evaluated in these two stages. The proposed estimator is theoretically analyzed, proving its unbiasedness and deriving its analytical coefficient of variation. To sample the constructed target, we resort to our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) and we prove that it converges to the correct stationary target distribution. To avoid the challenging task of tuning the trajectory length in complex spaces, QNp-HMCMC is effectively utilized in this work with a single-step integration. We thus show the equivalence of QNp-HMCMC with single-step implementation to a unique and efficient preconditioned Metropolis-adjusted Langevin algorithm (MALA). An optimization approach is also leveraged to initiate QNp-HMCMC effectively, and the implementation of the developed framework in bounded spaces is eventually discussed. A series of diverse problems involving high dimensionality (several hundred inputs), strong nonlinearity, and non-Gaussianity is presented, showcasing the capabilities and efficiency of the suggested framework and demonstrating its advantages compared to relevant state-of-the-art sampling methods."}, "https://arxiv.org/abs/2405.14166": {"title": "Optimal Bayesian predictive probability for delayed response in single-arm clinical trials with binary efficacy outcome", "link": "https://arxiv.org/abs/2405.14166", "description": "arXiv:2405.14166v1 Announce Type: new \nAbstract: In oncology, phase II or multiple expansion cohort trials are crucial for clinical development plans. This is because they aid in identifying potent agents with sufficient activity to continue development and confirm the proof of concept. Typically, these clinical trials are single-arm trials, with the primary endpoint being short-term treatment efficacy. Despite the development of several well-designed methodologies, there may be a practical impediment in that the endpoints may be observed within a sufficient time such that adaptive go/no-go decisions can be made in a timely manner at each interim monitoring. Specifically, Response Evaluation Criteria in Solid Tumors guideline defines a confirmed response and necessitates it in non-randomized trials, where the response is the primary endpoint. However, obtaining the confirmed outcome from all participants entered at interim monitoring may be time-consuming as non-responders should be followed up until the disease progresses. Thus, this study proposed an approach to accelerate the decision-making process that incorporated the outcome without confirmation by discounting its contribution to the decision-making framework using the generalized Bayes' theorem. Further, the behavior of the proposed approach was evaluated through a simple simulation study. The results demonstrated that the proposed approach made appropriate interim go/no-go decisions."}, "https://arxiv.org/abs/2405.14208": {"title": "An Empirical Comparison of Methods to Produce Business Statistics Using Non-Probability Data", "link": "https://arxiv.org/abs/2405.14208", "description": "arXiv:2405.14208v1 Announce Type: new \nAbstract: There is a growing trend among statistical agencies to explore non-probability data sources for producing more timely and detailed statistics, while reducing costs and respondent burden. Coverage and measurement error are two issues that may be present in such data. The imperfections may be corrected using available information relating to the population of interest, such as a census or a reference probability sample.\n In this paper, we compare a wide range of existing methods for producing population estimates using a non-probability dataset through a simulation study based on a realistic business population. The study was conducted to examine the performance of the methods under different missingness and data quality assumptions. The results confirm the ability of the methods examined to address selection bias. When no measurement error is present in the non-probability dataset, a screening dual-frame approach for the probability sample tends to yield lower sample size and mean squared error results. The presence of measurement error and/or nonignorable missingness increases mean squared errors for estimators that depend heavily on the non-probability data. In this case, the best approach tends to be to fall back to a model-assisted estimator based on the probability sample."}, "https://arxiv.org/abs/2405.14392": {"title": "Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing Flows", "link": "https://arxiv.org/abs/2405.14392", "description": "arXiv:2405.14392v1 Announce Type: new \nAbstract: Continuous normalizing flows (CNFs) learn the probability path between a reference and a target density by modeling the vector field generating said path using neural networks. Recently, Lipman et al. (2022) introduced a simple and inexpensive method for training CNFs in generative modeling, termed flow matching (FM). In this paper, we re-purpose this method for probabilistic inference by incorporating Markovian sampling methods in evaluating the FM objective and using the learned probability path to improve Monte Carlo sampling. We propose a sequential method, which uses samples from a Markov chain to fix the probability path defining the FM objective. We augment this scheme with an adaptive tempering mechanism that allows the discovery of multiple modes in the target. Under mild assumptions, we establish convergence to a local optimum of the FM objective, discuss improvements in the convergence rate, and illustrate our methods on synthetic and real-world examples."}, "https://arxiv.org/abs/2405.14456": {"title": "Cumulant-based approximation for fast and efficient prediction for species distribution", "link": "https://arxiv.org/abs/2405.14456", "description": "arXiv:2405.14456v1 Announce Type: new \nAbstract: Species distribution modeling plays an important role in estimating the habitat suitability of species using environmental variables. For this purpose, Maxent and the Poisson point process are popular and powerful methods extensively employed across various ecological and biological sciences. However, the computational speed becomes prohibitively slow when using huge background datasets, which is often the case with fine-resolution data or global-scale estimations. To address this problem, we propose a computationally efficient species distribution model using a cumulant-based approximation (CBA) applied to the loss function of $\\gamma$-divergence. Additionally, we introduce a sequential estimating algorithm with an $L_1$ penalty to select important environmental variables closely associated with species distribution. The regularized geometric-mean method, derived from the CBA, demonstrates high computational efficiency and estimation accuracy. Moreover, by applying CBA to Maxent, we establish that Maxent and Fisher linear discriminant analysis are equivalent under a normality assumption. This equivalence leads to an highly efficient computational method for estimating species distribution. The effectiveness of our proposed methods is illustrated through simulation studies and by analyzing data on 226 species from the National Centre for Ecological Analysis and Synthesis and 709 Japanese vascular plant species. The computational efficiency of the proposed methods is significantly improved compared to Maxent, while maintaining comparable estimation accuracy. A R package {\\tt CBA} is also prepared to provide all programming codes used in simulation studies and real data analysis."}, "https://arxiv.org/abs/2405.14492": {"title": "Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data", "link": "https://arxiv.org/abs/2405.14492", "description": "arXiv:2405.14492v1 Announce Type: new \nAbstract: Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, we consider full-scale approximations (FSAs) that combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce the computational costs for calculating likelihoods, gradients, and predictive distributions with FSAs. We introduce a novel preconditioner and show that it accelerates the conjugate gradient method's convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Further, we present a novel, accurate, and fast way to calculate predictive variances relying on stochastic estimations and iterative methods. In both simulated and real-world data experiments, we find that our proposed methodology achieves the same accuracy as Cholesky-based computations with a substantial reduction in computational time. Finally, we also compare different approaches for determining inducing points in predictive process and FSA models. All methods are implemented in a free C++ software library with high-level Python and R packages."}, "https://arxiv.org/abs/2405.14509": {"title": "Closed-form estimators for an exponential family derived from likelihood equations", "link": "https://arxiv.org/abs/2405.14509", "description": "arXiv:2405.14509v1 Announce Type: new \nAbstract: In this paper, we derive closed-form estimators for the parameters of some probability distributions belonging to the exponential family. A bootstrap bias-reduced version of these proposed closed-form estimators are also derived. A Monte Carlo simulation is performed for the assessment of the estimators. The results are seen to be quite favorable to the proposed bootstrap bias-reduce estimators."}, "https://arxiv.org/abs/2405.14628": {"title": "Online robust estimation and bootstrap inference for function-on-scalar regression", "link": "https://arxiv.org/abs/2405.14628", "description": "arXiv:2405.14628v1 Announce Type: new \nAbstract: We propose a novel and robust online function-on-scalar regression technique via geometric median to learn associations between functional responses and scalar covariates based on massive or streaming datasets. The online estimation procedure, developed using the average stochastic gradient descent algorithm, offers an efficient and cost-effective method for analyzing sequentially augmented datasets, eliminating the need to store large volumes of data in memory. We establish the almost sure consistency, $L_p$ convergence, and asymptotic normality of the online estimator. To enable efficient and fast inference of the parameters of interest, including the derivation of confidence intervals, we also develop an innovative two-step online bootstrap procedure to approximate the limiting error distribution of the robust online estimator. Numerical studies under a variety of scenarios demonstrate the effectiveness and efficiency of the proposed online learning method. A real application analyzing PM$_{2.5}$ air-quality data is also included to exemplify the proposed online approach."}, "https://arxiv.org/abs/2405.14652": {"title": "Statistical inference for high-dimensional convoluted rank regression", "link": "https://arxiv.org/abs/2405.14652", "description": "arXiv:2405.14652v1 Announce Type: new \nAbstract: High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods."}, "https://arxiv.org/abs/2405.14686": {"title": "Efficient Algorithms for the Sensitivities of the Pearson Correlation Coefficient and Its Statistical Significance to Online Data", "link": "https://arxiv.org/abs/2405.14686", "description": "arXiv:2405.14686v1 Announce Type: new \nAbstract: Reliably measuring the collinearity of bivariate data is crucial in statistics, particularly for time-series analysis or ongoing studies in which incoming observations can significantly impact current collinearity estimates. Leveraging identities from Welford's online algorithm for sample variance, we develop a rigorous theoretical framework for analyzing the maximal change to the Pearson correlation coefficient and its p-value that can be induced by additional data. Further, we show that the resulting optimization problems yield elegant closed-form solutions that can be accurately computed by linear- and constant-time algorithms. Our work not only creates new theoretical avenues for robust correlation measures, but also has broad practical implications for disciplines that span econometrics, operations research, clinical trials, climatology, differential privacy, and bioinformatics. Software implementations of our algorithms in Cython-wrapped C are made available at https://github.com/marc-harary/sensitivity for reproducibility, practical deployment, and future theoretical development."}, "https://arxiv.org/abs/2405.14711": {"title": "Zero-inflation in the Multivariate Poisson Lognormal Family", "link": "https://arxiv.org/abs/2405.14711", "description": "arXiv:2405.14711v1 Announce Type: new \nAbstract: Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to $90\\%$ of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing $90.6\\%$ of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination."}, "https://arxiv.org/abs/2405.13224": {"title": "Integrating behavioral experimental findings into dynamical models to inform social change interventions", "link": "https://arxiv.org/abs/2405.13224", "description": "arXiv:2405.13224v1 Announce Type: cross \nAbstract: Addressing global challenges -- from public health to climate change -- often involves stimulating the large-scale adoption of new products or behaviors. Research traditions that focus on individual decision making suggest that achieving this objective requires better identifying the drivers of individual adoption choices. On the other hand, computational approaches rooted in complexity science focus on maximizing the propagation of a given product or behavior throughout social networks of interconnected adopters. The integration of these two perspectives -- although advocated by several research communities -- has remained elusive so far. Here we show how achieving this integration could inform seeding policies to facilitate the large-scale adoption of a given behavior or product. Drawing on complex contagion and discrete choice theories, we propose a method to estimate individual-level thresholds to adoption, and validate its predictive power in two choice experiments. By integrating the estimated thresholds into computational simulations, we show that state-of-the-art seeding methods for social influence maximization might be suboptimal if they neglect individual-level behavioral drivers, which can be corrected through the proposed experimental method."}, "https://arxiv.org/abs/2405.13794": {"title": "Conditioning diffusion models by explicit forward-backward bridging", "link": "https://arxiv.org/abs/2405.13794", "description": "arXiv:2405.13794v1 Announce Type: cross \nAbstract: Given an unconditional diffusion model $\\pi(x, y)$, using it to perform conditional simulation $\\pi(x \\mid y)$ is still largely an open question and is typically achieved by learning conditional drifts to the denoising SDE after the fact. In this work, we express conditional simulation as an inference problem on an augmented space corresponding to a partial SDE bridge. This perspective allows us to implement efficient and principled particle Gibbs and pseudo-marginal samplers marginally targeting the conditional distribution $\\pi(x \\mid y)$. Contrary to existing methodology, our methods do not introduce any additional approximation to the unconditional diffusion model aside from the Monte Carlo error. We showcase the benefits and drawbacks of our approach on a series of synthetic and real data examples."}, "https://arxiv.org/abs/2007.13804": {"title": "The Spectral Approach to Linear Rational Expectations Models", "link": "https://arxiv.org/abs/2007.13804", "description": "arXiv:2007.13804v5 Announce Type: replace \nAbstract: This paper considers linear rational expectations models in the frequency domain. The paper characterizes existence and uniqueness of solutions to particular as well as generic systems. The set of all solutions to a given system is shown to be a finite dimensional affine space in the frequency domain. It is demonstrated that solutions can be discontinuous with respect to the parameters of the models in the context of non-uniqueness, invalidating mainstream frequentist and Bayesian methods. The ill-posedness of the problem motivates regularized solutions with theoretically guaranteed uniqueness, continuity, and even differentiability properties."}, "https://arxiv.org/abs/2012.02708": {"title": "A Multivariate Realized GARCH Model", "link": "https://arxiv.org/abs/2012.02708", "description": "arXiv:2012.02708v2 Announce Type: replace \nAbstract: We propose a novel class of multivariate GARCH models that utilize realized measures of volatilities and correlations. The central component is an unconstrained vector parametrization of the conditional correlation matrix that facilitates factor models for correlations. This offers an elegant solution to the primary challenge that plagues multivariate GARCH models in high-dimensional settings. As an illustration, we consider block correlation structures that naturally simplify to linear factor models for the conditional correlations. We apply the model to returns of nine assets and inspect in-sample and out-of-sample model performance in comparison with several popular benchmarks."}, "https://arxiv.org/abs/2203.15897": {"title": "Calibrated Model Criticism Using Split Predictive Checks", "link": "https://arxiv.org/abs/2203.15897", "description": "arXiv:2203.15897v3 Announce Type: replace \nAbstract: Checking how well a fitted model explains the data is one of the most fundamental parts of a Bayesian data analysis. However, existing model checking methods suffer from trade-offs between being well-calibrated, automated, and computationally efficient. To overcome these limitations, we propose split predictive checks (SPCs), which combine the ease-of-use and speed of posterior predictive checks with the good calibration properties of predictive checks that rely on model-specific derivations or inference schemes. We develop an asymptotic theory for two types of SPCs: single SPCs and the divided SPCs. Our results demonstrate that they offer complementary strengths. Single SPCs work well with smaller datasets and provide excellent power when there is substantial misspecification, such as when the estimate uncertainty in the test statistic is significantly underestimated. When the sample size is large, divided SPCs can provide better power and are able to detect more subtle form of misspecification. We validate the finite-sample utility of SPCs through extensive simulation experiments in exponential family and hierarchical models, and provide three real-data examples where SPCs offer novel insights and additional flexibility beyond what is available when using posterior predictive checks."}, "https://arxiv.org/abs/2303.05659": {"title": "A marginal structural model for normal tissue complication probability", "link": "https://arxiv.org/abs/2303.05659", "description": "arXiv:2303.05659v4 Announce Type: replace \nAbstract: The goal of radiation therapy for cancer is to deliver prescribed radiation dose to the tumor while minimizing dose to the surrounding healthy tissues. To evaluate treatment plans, the dose distribution to healthy organs is commonly summarized as dose-volume histograms (DVHs). Normal tissue complication probability (NTCP) modelling has centered around making patient-level risk predictions with features extracted from the DVHs, but few have considered adapting a causal framework to evaluate the safety of alternative treatment plans. We propose causal estimands for NTCP based on deterministic and stochastic interventions, as well as propose estimators based on marginal structural models that impose bivariable monotonicity between dose, volume, and toxicity risk. The properties of these estimators are studied through simulations, and their use is illustrated in the context of radiotherapy treatment of anal canal cancer patients."}, "https://arxiv.org/abs/2303.09575": {"title": "Sample size determination via learning-type curves", "link": "https://arxiv.org/abs/2303.09575", "description": "arXiv:2303.09575v2 Announce Type: replace \nAbstract: This paper is concerned with sample size determination methodology for prediction models. We propose combining the individual calculations via a learning-type curve. We suggest two distinct ways of doing so, a deterministic skeleton of a learning curve and a Gaussian process centred upon its deterministic counterpart. We employ several learning algorithms for modelling the primary endpoint and distinct measures for trial efficacy. We find that the performance may vary with the sample size, but borrowing information across sample size universally improves the performance of such calculations. The Gaussian process-based learning curve appears more robust and statistically efficient, while computational efficiency is comparable. We suggest that anchoring against historical evidence when extrapolating sample sizes should be adopted when such data are available. The methods are illustrated on binary and survival endpoints."}, "https://arxiv.org/abs/2305.14194": {"title": "A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments", "link": "https://arxiv.org/abs/2305.14194", "description": "arXiv:2305.14194v2 Announce Type: replace \nAbstract: We develop new methodology to improve our understanding of the causal effects of multivariate air pollution exposures on public health. Typically, exposure to air pollution for an individual is measured at their home geographic region, though people travel to different regions with potentially different levels of air pollution. To account for this, we incorporate estimates of the mobility of individuals from cell phone mobility data to get an improved estimate of their exposure to air pollution. We treat this as an interference problem, where individuals in one geographic region can be affected by exposures in other regions due to mobility into those areas. We propose policy-relevant estimands and derive expressions showing the extent of bias one would obtain by ignoring this mobility. We additionally highlight the benefits of the proposed interference framework relative to a measurement error framework for accounting for mobility. We develop novel estimation strategies to estimate causal effects that account for this spatial spillover utilizing flexible Bayesian methodology. Lastly, we use the proposed methodology to study the health effects of ambient air pollution on mortality among Medicare enrollees in the United States."}, "https://arxiv.org/abs/2306.11697": {"title": "Treatment Effects in Extreme Regimes", "link": "https://arxiv.org/abs/2306.11697", "description": "arXiv:2306.11697v2 Announce Type: replace \nAbstract: Understanding treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the unavailability of counterfactual outcomes and the rarity and difficulty of collecting extreme data in practice. To address this issue, we propose a new framework based on extreme value theory for estimating treatment effects in extreme regimes. We quantify these effects using variations in tail decay rates of potential outcomes in the presence and absence of treatments. We establish algorithms for calculating these quantities and develop related theoretical results. We demonstrate the efficacy of our approach on various standard synthetic and semi-synthetic datasets."}, "https://arxiv.org/abs/2309.15600": {"title": "pencal: an R Package for the Dynamic Prediction of Survival with Many Longitudinal Predictors", "link": "https://arxiv.org/abs/2309.15600", "description": "arXiv:2309.15600v2 Announce Type: replace \nAbstract: In survival analysis, longitudinal information on the health status of a patient can be used to dynamically update the predicted probability that a patient will experience an event of interest. Traditional approaches to dynamic prediction such as joint models become computationally unfeasible with more than a handful of longitudinal covariates, warranting the development of methods that can handle a larger number of longitudinal covariates. We introduce the R package pencal, which implements a Penalized Regression Calibration approach that makes it possible to handle many longitudinal covariates as predictors of survival. pencal uses mixed-effects models to summarize the trajectories of the longitudinal covariates up to a prespecified landmark time, and a penalized Cox model to predict survival based on both baseline covariates and summary measures of the longitudinal covariates. This article illustrates the structure of the R package, provides a step by step example showing how to estimate PRC, compute dynamic predictions of survival and validate performance, and shows how parallelization can be used to significantly reduce computing time."}, "https://arxiv.org/abs/2310.15512": {"title": "Inference for Rank-Rank Regressions", "link": "https://arxiv.org/abs/2310.15512", "description": "arXiv:2310.15512v2 Announce Type: replace \nAbstract: Slope coefficients in rank-rank regressions are popular measures of intergenerational mobility. In this paper, we first point out two important properties of the OLS estimator in such regressions: commonly used variance estimators do not consistently estimate the asymptotic variance of the OLS estimator and, when the underlying distribution is not continuous, the OLS estimator may be highly sensitive to the way in which ties are handled. Motivated by these findings we derive the asymptotic theory for the OLS estimator in a general rank-rank regression specification without making assumptions about the continuity of the underlying distribution. We then extend the asymptotic theory to other regressions involving ranks that have been used in empirical work. Finally, we apply our new inference methods to three empirical studies. We find that the confidence intervals based on estimators of the correct variance may sometimes be substantially shorter and sometimes substantially longer than those based on commonly used variance estimators. The differences in confidence intervals concern economically meaningful values of mobility and thus may lead to different conclusions when comparing mobility across different regions or countries."}, "https://arxiv.org/abs/2311.11153": {"title": "Biarchetype analysis: simultaneous learning of observations and features based on extremes", "link": "https://arxiv.org/abs/2311.11153", "description": "arXiv:2311.11153v2 Announce Type: replace \nAbstract: We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value, and both the source code and examples are readily accessible in R and Python at https://github.com/aleixalcacer/JA-BIAA."}, "https://arxiv.org/abs/2311.15322": {"title": "False Discovery Rate Control For Structured Multiple Testing: Asymmetric Rules And Conformal Q-values", "link": "https://arxiv.org/abs/2311.15322", "description": "arXiv:2311.15322v3 Announce Type: replace \nAbstract: The effective utilization of structural information in data while ensuring statistical validity poses a significant challenge in false discovery rate (FDR) analyses. Conformal inference provides rigorous theory for grounding complex machine learning methods without relying on strong assumptions or highly idealized models. However, existing conformal methods have limitations in handling structured multiple testing. This is because their validity requires the deployment of symmetric rules, which assume the exchangeability of data points and permutation-invariance of fitting algorithms. To overcome these limitations, we introduce the pseudo local index of significance (PLIS) procedure, which is capable of accommodating asymmetric rules and requires only pairwise exchangeability between the null conformity scores. We demonstrate that PLIS offers finite-sample guarantees in FDR control and the ability to assign higher weights to relevant data points. Numerical results confirm the effectiveness and robustness of PLIS and show improvements in power compared to existing model-free methods in various scenarios."}, "https://arxiv.org/abs/2306.13214": {"title": "Prior-itizing Privacy: A Bayesian Approach to Setting the Privacy Budget in Differential Privacy", "link": "https://arxiv.org/abs/2306.13214", "description": "arXiv:2306.13214v2 Announce Type: replace-cross \nAbstract: When releasing outputs from confidential data, agencies need to balance the analytical usefulness of the released data with the obligation to protect data subjects' confidentiality. For releases satisfying differential privacy, this balance is reflected by the privacy budget, $\\varepsilon$. We provide a framework for setting $\\varepsilon$ based on its relationship with Bayesian posterior probabilities of disclosure. The agency responsible for the data release decides how much posterior risk it is willing to accept at various levels of prior risk, which implies a unique $\\varepsilon$. Agencies can evaluate different risk profiles to determine one that leads to an acceptable trade-off in risk and utility."}, "https://arxiv.org/abs/2309.09367": {"title": "ForLion: A New Algorithm for D-optimal Designs under General Parametric Statistical Models with Mixed Factors", "link": "https://arxiv.org/abs/2309.09367", "description": "arXiv:2309.09367v3 Announce Type: replace-cross \nAbstract: In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs."}, "https://arxiv.org/abs/2310.09818": {"title": "MCMC for Bayesian nonparametric mixture modeling under differential privacy", "link": "https://arxiv.org/abs/2310.09818", "description": "arXiv:2310.09818v2 Announce Type: replace-cross \nAbstract: Estimating the probability density of a population while preserving the privacy of individuals in that population is an important and challenging problem that has received considerable attention in recent years. While the previous literature focused on frequentist approaches, in this paper, we propose a Bayesian nonparametric mixture model under differential privacy (DP) and present two Markov chain Monte Carlo (MCMC) algorithms for posterior inference. One is a marginal approach, resembling Neal's algorithm 5 with a pseudo-marginal Metropolis-Hastings move, and the other is a conditional approach. Although our focus is primarily on local DP, we show that our MCMC algorithms can be easily extended to deal with global differential privacy mechanisms. Moreover, for some carefully chosen mechanisms and mixture kernels, we show how auxiliary parameters can be analytically marginalized, allowing standard MCMC algorithms (i.e., non-privatized, such as Neal's Algorithm 2) to be efficiently employed. Our approach is general and applicable to any mixture model and privacy mechanism. In several simulations and a real case study, we discuss the performance of our algorithms and evaluate different privacy mechanisms proposed in the frequentist literature."}, "https://arxiv.org/abs/2311.00541": {"title": "An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek", "link": "https://arxiv.org/abs/2311.00541", "description": "arXiv:2311.00541v3 Announce Type: replace-cross \nAbstract: Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as ``kosmos'' (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed."}, "https://arxiv.org/abs/2311.05025": {"title": "Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients", "link": "https://arxiv.org/abs/2311.05025", "description": "arXiv:2311.05025v2 Announce Type: replace-cross \nAbstract: We present an unbiased method for Bayesian posterior means based on kinetic Langevin dynamics that combines advanced splitting methods with enhanced gradient approximations. Our approach avoids Metropolis correction by coupling Markov chains at different discretization levels in a multilevel Monte Carlo approach. Theoretical analysis demonstrates that our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem. It can achieve accuracy $\\epsilon>0$ for estimating expectations of Lipschitz functions in $d$ dimensions with $\\mathcal{O}(d^{1/4}\\epsilon^{-2})$ expected gradient evaluations, without assuming warm start. We exhibit similar bounds using both approximate and stochastic gradients, and our method's computational cost is shown to scale independently of the size of the dataset. The proposed method is tested using a multinomial regression problem on the MNIST dataset and a Poisson regression model for soccer scores. Experiments indicate that the number of gradient evaluations per effective sample is independent of dimension, even when using inexact gradients. For product distributions, we give dimension-independent variance bounds. Our results demonstrate that the unbiased algorithm we present can be much more efficient than the ``gold-standard\" randomized Hamiltonian Monte Carlo."}, "https://arxiv.org/abs/2401.06687": {"title": "Proximal Causal Inference With Text Data", "link": "https://arxiv.org/abs/2401.06687", "description": "arXiv:2401.06687v2 Announce Type: replace-cross \nAbstract: Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses multiple instances of pre-treatment text data, infers two proxies from two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove that our text-based proxy method satisfies identification conditions required by the proximal g-formula while other seemingly reasonable proposals do not. We evaluate our method in synthetic and semi-synthetic settings and find that it produces estimates with low bias. To address untestable assumptions associated with the proximal g-formula, we further propose an odds ratio falsification heuristic. This new combination of proximal causal inference and zero-shot classifiers expands the set of text-specific causal methods available to practitioners."}, "https://arxiv.org/abs/2405.14913": {"title": "High Rank Path Development: an approach of learning the filtration of stochastic processes", "link": "https://arxiv.org/abs/2405.14913", "description": "arXiv:2405.14913v1 Announce Type: new \nAbstract: Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence, which can fully characterise all essential properties, including the filtration, of stochastic processes; however was considered to be hard to find efficient numerical implementations. In this paper, we introduce a novel metric called High Rank PCF Distance (HRPCFD) for extended weak convergence based on the high rank path development method from rough path theory, which also defines the characteristic function for measure-valued processes. We then show that such HRPCFD admits many favourable analytic properties which allows us to design an efficient algorithm for training HRPCFD from data and construct the HRPCF-GAN by using HRPCFD as the discriminator for conditional time series generation. Our numerical experiments on both hypothesis testing and generative modelling validate the out-performance of our approach compared with several state-of-the-art methods, highlighting its potential in broad applications of synthetic time series generation and in addressing classic financial and economic challenges, such as optimal stopping or utility maximisation problems."}, "https://arxiv.org/abs/2405.14990": {"title": "Dispersion Modeling in Zero-inflated Tweedie Models with Applications to Insurance Claim Data Analysis", "link": "https://arxiv.org/abs/2405.14990", "description": "arXiv:2405.14990v1 Announce Type: new \nAbstract: The Tweedie generalized linear models are commonly applied in the insurance industry to analyze semicontinuous claim data. For better prediction of the aggregated claim size, the mean and dispersion of the Tweedie model are often estimated together using the double generalized linear models. In some actuarial applications, it is common to observe an excessive percentage of zeros, which often results in a decline in the performance of the Tweedie model. The zero-inflated Tweedie model has been recently considered in the literature, which draws inspiration from the zero-inflated Poisson model. In this article, we consider the problem of dispersion modeling of the Tweedie state in the zero-inflated Tweedie model, in addition to the mean modeling. We also model the probability of the zero state based on the generalized expectation-maximization algorithm. To potentially incorporate nonlinear and interaction effects of the covariates, we estimate the mean, dispersion, and zero-state probability using decision-tree-based gradient boosting. We conduct extensive numerical studies to demonstrate the improved performance of our method over existing ones."}, "https://arxiv.org/abs/2405.15038": {"title": "Preferential Latent Space Models for Networks with Textual Edges", "link": "https://arxiv.org/abs/2405.15038", "description": "arXiv:2405.15038v1 Announce Type: new \nAbstract: Many real-world networks contain rich textual information in the edges, such as email networks where an edge between two nodes is an email exchange. Other examples include co-author networks and social media networks. The useful textual information carried in the edges is often discarded in most network analyses, resulting in an incomplete view of the relationships between nodes. In this work, we propose to represent the text document between each pair of nodes as a vector counting the appearances of keywords extracted from the corpus, and introduce a new and flexible preferential latent space network model that can offer direct insights on how contents of the textual exchanges modulate the relationships between nodes. We establish identifiability conditions for the proposed model and tackle model estimation with a computationally efficient projected gradient descent algorithm. We further derive the non-asymptotic error bound of the estimator from each step of the algorithm. The efficacy of our proposed method is demonstrated through simulations and an analysis of the Enron email network."}, "https://arxiv.org/abs/2405.15042": {"title": "Modularity, Higher-Order Recombination, and New Venture Success", "link": "https://arxiv.org/abs/2405.15042", "description": "arXiv:2405.15042v1 Announce Type: new \nAbstract: Modularity is critical for the emergence and evolution of complex social, natural, and technological systems robust to exploratory failure. We consider this in the context of emerging business organizations, which can be understood as complex systems. We build a theory of organizational emergence as higher-order, modular recombination wherein successful start-ups assemble novel combinations of successful modular components, rather than engage in the lower-order combination of disparate, singular components. Lower-order combinations are critical for long-term socio-economic transformation, but manifest diffuse benefits requiring support as public goods. Higher-order combinations facilitate rapid experimentation and attract private funding. We evaluate this with U.S. venture-funded start-ups over 45 years using company descriptions. We build a dynamic semantic space with word embedding models constructed from evolving business discourse, which allow us to measure the modularity of and distance between new venture components. Using event history models, we demonstrate how ventures more likely achieve successful IPOs and high-priced acquisitions when they combine diverse modules of clustered components. We demonstrate how higher-order combination enables venture success by accelerating firm development and diversifying investment, and we reflect on its implications for social innovation."}, "https://arxiv.org/abs/2405.15053": {"title": "A Latent Variable Approach to Learning High-dimensional Multivariate longitudinal Data", "link": "https://arxiv.org/abs/2405.15053", "description": "arXiv:2405.15053v1 Announce Type: new \nAbstract: High-dimensional multivariate longitudinal data, which arise when many outcome variables are measured repeatedly over time, are becoming increasingly common in social, behavioral and health sciences. We propose a latent variable model for drawing statistical inferences on covariate effects and predicting future outcomes based on high-dimensional multivariate longitudinal data. This model introduces unobserved factors to account for the between-variable and across-time dependence and assist the prediction. Statistical inference and prediction tools are developed under a general setting that allows outcome variables to be of mixed types and possibly unobserved for certain time points, for example, due to right censoring. A central limit theorem is established for drawing statistical inferences on regression coefficients. Additionally, an information criterion is introduced to choose the number of factors. The proposed model is applied to customer grocery shopping records to predict and understand shopping behavior."}, "https://arxiv.org/abs/2405.15192": {"title": "Addressing Duplicated Data in Point Process Models", "link": "https://arxiv.org/abs/2405.15192", "description": "arXiv:2405.15192v1 Announce Type: new \nAbstract: Spatial point process models are widely applied to point pattern data from various fields in the social and environmental sciences. However, a serious hurdle in fitting point process models is the presence of duplicated points, wherein multiple observations share identical spatial coordinates. This often occurs because of decisions made in the geo-coding process, such as assigning representative locations (e.g., aggregate-level centroids) to observations when data producers lack exact location information. Because spatial point process models like the Log-Gaussian Cox Process (LGCP) assume unique locations, researchers often employ {\\it ad hoc} solutions (e.g., jittering) to address duplicated data before analysis. As an alternative, this study proposes a Modified Minimum Contrast (MMC) method that adapts the inference procedure to account for the effect of duplicates without needing to alter the data. The proposed MMC method is applied to LGCP models, with simulation results demonstrating the gains of our method relative to existing approaches in terms of parameter estimation. Interestingly, simulation results also show the effect of the geo-coding process on parameter estimates, which can be utilized in the implementation of the MMC method. The MMC approach is then used to infer the spatial clustering characteristics of conflict events in Afghanistan (2008-2009)."}, "https://arxiv.org/abs/2405.15204": {"title": "A New Fit Assessment Framework for Common Factor Models Using Generalized Residuals", "link": "https://arxiv.org/abs/2405.15204", "description": "arXiv:2405.15204v1 Announce Type: new \nAbstract: Standard common factor models, such as the linear normal factor model, rely on strict parametric assumptions, which require rigorous model-data fit assessment to prevent fallacious inferences. However, overall goodness-of-fit diagnostics conventionally used in factor analysis do not offer diagnostic information on where the misfit originates. In the current work, we propose a new fit assessment framework for common factor models by extending the theory of generalized residuals (Haberman & Sinharay, 2013). This framework allows for the flexible adaptation of test statistics to identify various sources of misfit. In addition, the resulting goodness-of-fit tests provide more informative diagnostics, as the evaluation is performed conditionally on latent variables. Several examples of test statistics suitable for assessing various model assumptions are presented within this framework, and their performance is evaluated by simulation studies and a real data example."}, "https://arxiv.org/abs/2405.15242": {"title": "Causal machine learning methods and use of sample splitting in settings with high-dimensional confounding", "link": "https://arxiv.org/abs/2405.15242", "description": "arXiv:2405.15242v1 Announce Type: new \nAbstract: Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, induced when there are many confounders relative to sample size, or complex relationships between continuous confounders and exposure and outcome. As a promising avenue to overcome this challenge, doubly robust methods (Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE)) enable the use of data-adaptive approaches to fit the two models they involve. Biased standard errors may result when the data-adaptive approaches used are very complex. The coupling of doubly robust methods with cross-fitting has been proposed to tackle this. Despite advances, limited evaluation, comparison, and guidance are available on the implementation of AIPW and TMLE with data-adaptive approaches and cross-fitting in realistic settings where high-dimensional confounding is present. We conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches in estimating the average causal effect (ACE) and evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner (SL) ensemble learning approach used for the data-adaptive models. A range of scenarios in terms of data generation, and sample size were considered. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, with the number of folds a less important consideration. Using a full SL library was important to reduce bias and variance in the complex scenarios typical of modern health research studies."}, "https://arxiv.org/abs/2405.15531": {"title": "MMD Two-sample Testing in the Presence of Arbitrarily Missing Data", "link": "https://arxiv.org/abs/2405.15531", "description": "arXiv:2405.15531v1 Announce Type: new \nAbstract: In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that our method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of our approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error."}, "https://arxiv.org/abs/2405.15576": {"title": "Online Changepoint Detection via Dynamic Mode Decomposition", "link": "https://arxiv.org/abs/2405.15576", "description": "arXiv:2405.15576v1 Announce Type: new \nAbstract: Detecting changes in data streams is a vital task in many applications. There is increasing interest in changepoint detection in the online setting, to enable real-time monitoring and support prompt responses and informed decision-making. Many approaches assume stationary sequences before encountering an abrupt change in the mean or variance. Notably less attention has focused on the challenging case where the monitored sequences exhibit trend, periodicity and seasonality. Dynamic mode decomposition is a data-driven dimensionality reduction technique that extracts the essential components of a dynamical system. We propose a changepoint detection method that leverages this technique to sequentially model the dynamics of a moving window of data and produce a low-rank reconstruction. A change is identified when there is a significant difference between this reconstruction and the observed data, and we provide theoretical justification for this approach. Extensive simulations demonstrate that our approach has superior detection performance compared to other methods for detecting small changes in mean, variance, periodicity, and second-order structure, among others, in data that exhibits seasonality. Results on real-world datasets also show excellent performance compared to contemporary approaches."}, "https://arxiv.org/abs/2405.15579": {"title": "Generating density nowcasts for U", "link": "https://arxiv.org/abs/2405.15579", "description": "arXiv:2405.15579v1 Announce Type: new \nAbstract: Recent results in the literature indicate that artificial neural networks (ANNs) can outperform the dynamic factor model (DFM) in terms of the accuracy of GDP nowcasts. Compared to the DFM, the performance advantage of these highly flexible, nonlinear estimators is particularly evident in periods of recessions and structural breaks. From the perspective of policy-makers, however, nowcasts are the most useful when they are conveyed with uncertainty attached to them. While the DFM and other classical time series approaches analytically derive the predictive (conditional) distribution for GDP growth, ANNs can only produce point nowcasts based on their default training procedure (backpropagation). To fill this gap, first in the literature, we adapt two different deep learning algorithms that enable ANNs to generate density nowcasts for U.S. GDP growth: Bayes by Backprop and Monte Carlo dropout. The accuracy of point nowcasts, defined as the mean of the empirical predictive distribution, is evaluated relative to a naive constant growth model for GDP and a benchmark DFM specification. Using a 1D CNN as the underlying ANN architecture, both algorithms outperform those benchmarks during the evaluation period (2012:Q1 -- 2022:Q4). Furthermore, both algorithms are able to dynamically adjust the location (mean), scale (variance), and shape (skew) of the empirical predictive distribution. The results indicate that both Bayes by Backprop and Monte Carlo dropout can effectively augment the scope and functionality of ANNs, rendering them a fully compatible and competitive alternative for classical time series approaches."}, "https://arxiv.org/abs/2405.15641": {"title": "Predictive Uncertainty Quantification with Missing Covariates", "link": "https://arxiv.org/abs/2405.15641", "description": "arXiv:2405.15641v1 Announce Type: new \nAbstract: Predictive uncertainty quantification is crucial in decision-making problems. We investigate how to adequately quantify predictive uncertainty with missing covariates. A bottleneck is that missing values induce heteroskedasticity on the response's predictive distribution given the observed covariates. Thus, we focus on building predictive sets for the response that are valid conditionally to the missing values pattern. We show that this goal is impossible to achieve informatively in a distribution-free fashion, and we propose useful restrictions on the distribution class. Motivated by these hardness results, we characterize how missing values and predictive uncertainty intertwine. Particularly, we rigorously formalize the idea that the more missing values, the higher the predictive uncertainty. Then, we introduce a generalized framework, coined CP-MDA-Nested*, outputting predictive sets in both regression and classification. Under independence between the missing value pattern and both the features and the response (an assumption justified by our hardness results), these predictive sets are valid conditionally to any pattern of missing values. Moreover, it provides great flexibility in the trade-off between statistical variability and efficiency. Finally, we experimentally assess the performances of CP-MDA-Nested* beyond its scope of theoretical validity, demonstrating promising outcomes in more challenging configurations than independence."}, "https://arxiv.org/abs/2405.15670": {"title": "Post-selection inference for quantifying uncertainty in changes in variance", "link": "https://arxiv.org/abs/2405.15670", "description": "arXiv:2405.15670v1 Announce Type: new \nAbstract: Quantifying uncertainty in detected changepoints is an important problem. However it is challenging as the naive approach would use the data twice, first to detect the changes, and then to test them. This will bias the test, and can lead to anti-conservative p-values. One approach to avoid this is to use ideas from post-selection inference, which conditions on the information in the data used to choose which changes to test. As a result this produces valid p-values; that is, p-values that have a uniform distribution if there is no change. Currently such methods have been developed for detecting changes in mean only. This paper presents two approaches for constructing post-selection p-values for detecting changes in variance. These vary depending on the method use to detect the changes, but are general in terms of being applicable for a range of change-detection methods and a range of hypotheses that we may wish to test."}, "https://arxiv.org/abs/2405.15716": {"title": "Empirical Crypto Asset Pricing", "link": "https://arxiv.org/abs/2405.15716", "description": "arXiv:2405.15716v1 Announce Type: new \nAbstract: We motivate the study of the crypto asset class with eleven empirical facts, and study the drivers of crypto asset returns through the lens of univariate factors. We argue crypto assets are a new, attractive, and independent asset class. In a novel and rigorously built panel of crypto assets, we examine pricing ability of sixty three asset characteristics to find rich signal content across the characteristics and at several future horizons. Only univariate financial factors (i.e., functions of previous returns) were associated with statistically significant long-short strategies, suggestive of speculatively driven returns as opposed to more fundamental pricing factors."}, "https://arxiv.org/abs/2405.15721": {"title": "Dynamic Latent-Factor Model with High-Dimensional Asset Characteristics", "link": "https://arxiv.org/abs/2405.15721", "description": "arXiv:2405.15721v1 Announce Type: new \nAbstract: We develop novel estimation procedures with supporting econometric theory for a dynamic latent-factor model with high-dimensional asset characteristics, that is, the number of characteristics is on the order of the sample size. Utilizing the Double Selection Lasso estimator, our procedure employs regularization to eliminate characteristics with low signal-to-noise ratios yet maintains asymptotically valid inference for asset pricing tests. The crypto asset class is well-suited for applying this model given the limited number of tradable assets and years of data as well as the rich set of available asset characteristics. The empirical results present out-of-sample pricing abilities and risk-adjusted returns for our novel estimator as compared to benchmark methods. We provide an inference procedure for measuring the risk premium of an observable nontradable factor, and employ this to find that the inflation-mimicking portfolio in the crypto asset class has positive risk compensation."}, "https://arxiv.org/abs/2405.15740": {"title": "On Flexible Inverse Probability of Treatment and Intensity Weighting: Informative Censoring, Variable Inclusion, and Weight Trimming", "link": "https://arxiv.org/abs/2405.15740", "description": "arXiv:2405.15740v1 Announce Type: new \nAbstract: Many observational studies feature irregular longitudinal data, where the observation times are not common across individuals in the study. Further, the observation times may be related to the longitudinal outcome. In this setting, failing to account for the informative observation process may result in biased causal estimates. This can be coupled with other sources of bias, including non-randomized treatment assignments and informative censoring. This paper provides an overview of a flexible weighting method used to adjust for informative observation processes and non-randomized treatment assignments. We investigate the sensitivity of the flexible weighting method to violations of the noninformative censoring assumption, examine variable selection for the observation process weighting model, known as inverse intensity weighting, and look at the impacts of weight trimming for the flexible weighting model. We show that the flexible weighting method is sensitive to violations of the noninformative censoring assumption and show that a previously proposed extension fails under such violations. We also show that variables confounding the observation and outcome processes should always be included in the observation intensity model. Finally, we show that weight trimming should be applied in the flexible weighting model when the treatment assignment process is highly informative and driving the extreme weights. We conclude with an application of the methodology to a real data set to examine the impacts of household water sources on malaria diagnoses."}, "https://arxiv.org/abs/2405.14893": {"title": "YUI: Day-ahead Electricity Price Forecasting Using Invariance Simplified Supply and Demand Curve", "link": "https://arxiv.org/abs/2405.14893", "description": "arXiv:2405.14893v1 Announce Type: cross \nAbstract: In day-ahead electricity market, it is crucial for all market participants to have access to reliable and accurate price forecasts for their decision-making processes. Forecasting methods currently utilized in industrial applications frequently neglect the underlying mechanisms of price formation, while economic research from the perspective of supply and demand have stringent data collection requirements, making it difficult to apply in actual markets. Observing the characteristics of the day-ahead electricity market, we introduce two invariance assumptions to simplify the modeling of supply and demand curves. Upon incorporating the time invariance assumption, we can forecast the supply curve using the market equilibrium points from multiple time slots in the recent period. By introducing the price insensitivity assumption, we can approximate the demand curve using a straight line. The point where these two curves intersect provides us with the forecast price. The proposed model, forecasting suppl\\textbf{Y} and demand cUrve simplified by Invariance, termed as YUI, is more efficient than state-of-the-art methods. Our experiment results in Shanxi day-ahead electricity market show that compared with existing methods, YUI can reduce forecast error by 13.8\\% in MAE and 28.7\\% in sMAPE. Code is publicly available at https://github.com/wangln19/YUI."}, "https://arxiv.org/abs/2405.15132": {"title": "Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification", "link": "https://arxiv.org/abs/2405.15132", "description": "arXiv:2405.15132v1 Announce Type: cross \nAbstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets."}, "https://arxiv.org/abs/2405.15141": {"title": "Likelihood distortion and Bayesian local robustness", "link": "https://arxiv.org/abs/2405.15141", "description": "arXiv:2405.15141v1 Announce Type: cross \nAbstract: Robust Bayesian analysis has been mainly devoted to detecting and measuring robustness to the prior distribution. Indeed, many contributions in the literature aim to define suitable classes of priors which allow the computation of variations of quantities of interest while the prior changes within those classes. The literature has devoted much less attention to the robustness of Bayesian methods to the likelihood function due to mathematical and computational complexity, and because it is often arguably considered a more objective choice compared to the prior. In this contribution, a new approach to Bayesian local robustness to the likelihood function is proposed and extended to robustness to the prior and to both. This approach is based on the notion of distortion function introduced in the literature on risk theory, and then successfully adopted to build suitable classes of priors for Bayesian global robustness to the prior. The novel robustness measure is a local sensitivity measure that turns out to be very tractable and easy to compute for certain classes of distortion functions. Asymptotic properties are derived and numerical experiments illustrate the theory and its applicability for modelling purposes."}, "https://arxiv.org/abs/2405.15294": {"title": "Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision", "link": "https://arxiv.org/abs/2405.15294", "description": "arXiv:2405.15294v1 Announce Type: cross \nAbstract: We provide a theoretical and computational investigation of the Gamma-Maximin method with soft revision, which was recently proposed as a robust criterion for pseudo-label selection (PLS) in semi-supervised learning. Opposed to traditional methods for PLS we use credal sets of priors (\"generalized Bayes\") to represent the epistemic modeling uncertainty. These latter are then updated by the Gamma-Maximin method with soft revision. We eventually select pseudo-labeled data that are most likely in light of the least favorable distribution from the so updated credal set. We formalize the task of finding optimal pseudo-labeled data w.r.t. the Gamma-Maximin method with soft revision as an optimization problem. A concrete implementation for the class of logistic models then allows us to compare the predictive power of the method with competing approaches. It is observed that the Gamma-Maximin method with soft revision can achieve very promising results, especially when the proportion of labeled data is low."}, "https://arxiv.org/abs/2405.15357": {"title": "Strong screening rules for group-based SLOPE models", "link": "https://arxiv.org/abs/2405.15357", "description": "arXiv:2405.15357v1 Announce Type: cross \nAbstract: Tuning the regularization parameter in penalized regression models is an expensive task, requiring multiple models to be fit along a path of parameters. Strong screening rules drastically reduce computational costs by lowering the dimensionality of the input prior to fitting. We develop strong screening rules for group-based Sorted L-One Penalized Estimation (SLOPE) models: Group SLOPE and Sparse-group SLOPE. The developed rules are applicable for the wider family of group-based OWL models, including OSCAR. Our experiments on both synthetic and real data show that the screening rules significantly accelerate the fitting process. The screening rules make it accessible for group SLOPE and sparse-group SLOPE to be applied to high-dimensional datasets, particularly those encountered in genetics."}, "https://arxiv.org/abs/2405.15600": {"title": "Transfer Learning for Spatial Autoregressive Models", "link": "https://arxiv.org/abs/2405.15600", "description": "arXiv:2405.15600v1 Announce Type: cross \nAbstract: The spatial autoregressive (SAR) model has been widely applied in various empirical economic studies to characterize the spatial dependence among subjects. However, the precision of estimating the SAR model diminishes when the sample size of the target data is limited. In this paper, we propose a new transfer learning framework for the SAR model to borrow the information from similar source data to improve both estimation and prediction. When the informative source data sets are known, we introduce a two-stage algorithm, including a transferring stage and a debiasing stage, to estimate the unknown parameters and also establish the theoretical convergence rates for the resulting estimators. If we do not know which sources to transfer, a transferable source detection algorithm is proposed to detect informative sources data based on spatial residual bootstrap to retain the necessary spatial dependence. Its detection consistency is also derived. Simulation studies demonstrate that using informative source data, our transfer learning algorithm significantly enhances the performance of the classical two-stage least squares estimator. In the empirical application, we apply our method to the election prediction in swing states in the 2020 U.S. presidential election, utilizing polling data from the 2016 U.S. presidential election along with other demographic and geographical data. The empirical results show that our method outperforms traditional estimation methods."}, "https://arxiv.org/abs/2105.12891": {"title": "Identification and Estimation of Partial Effects in Nonlinear Semiparametric Panel Models", "link": "https://arxiv.org/abs/2105.12891", "description": "arXiv:2105.12891v5 Announce Type: replace \nAbstract: Average partial effects (APEs) are often not point identified in panel models with unrestricted unobserved individual heterogeneity, such as a binary response panel model with fixed effects and logistic errors as a special case. This lack of point identification occurs despite the identification of these models' common coefficients. We provide a unified framework to establish the point identification of various partial effects in a wide class of nonlinear semiparametric models under an index sufficiency assumption on the unobserved heterogeneity, even when the error distribution is unspecified and non-stationary. This assumption does not impose parametric restrictions on the unobserved heterogeneity and idiosyncratic errors. We also present partial identification results when the support condition fails. We then propose three-step semiparametric estimators for APEs, average structural functions, and average marginal effects, and show their consistency and asymptotic normality. Finally, we illustrate our approach in a study of determinants of married women's labor supply."}, "https://arxiv.org/abs/2207.05281": {"title": "Constrained D-optimal Design for Paid Research Study", "link": "https://arxiv.org/abs/2207.05281", "description": "arXiv:2207.05281v4 Announce Type: replace \nAbstract: We consider constrained sampling problems in paid research studies or clinical trials. When qualified volunteers are more than the budget allowed, we recommend a D-optimal sampling strategy based on the optimal design theory and develop a constrained lift-one algorithm to find the optimal allocation. Unlike the literature which mainly deals with linear models, our solution solves the constrained sampling problem under fairly general statistical models, including generalized linear models and multinomial logistic models, and with more general constraints. We justify theoretically the optimality of our sampling strategy and show by simulation studies and real-world examples the advantages over simple random sampling and proportionally stratified sampling strategies."}, "https://arxiv.org/abs/2301.01085": {"title": "The Chained Difference-in-Differences", "link": "https://arxiv.org/abs/2301.01085", "description": "arXiv:2301.01085v3 Announce Type: replace \nAbstract: This paper studies the identification, estimation, and inference of long-term (binary) treatment effect parameters when balanced panel data is not available, or consists of only a subset of the available data. We develop a new estimator: the chained difference-in-differences, which leverages the overlapping structure of many unbalanced panel data sets. This approach consists in aggregating a collection of short-term treatment effects estimated on multiple incomplete panels. Our estimator accommodates (1) multiple time periods, (2) variation in treatment timing, (3) treatment effect heterogeneity, (4) general missing data patterns, and (5) sample selection on observables. We establish the asymptotic properties of the proposed estimator and discuss identification and efficiency gains in comparison to existing methods. Finally, we illustrate its relevance through (i) numerical simulations, and (ii) an application about the effects of an innovation policy in France."}, "https://arxiv.org/abs/2306.10779": {"title": "Bootstrap test procedure for variance components in nonlinear mixed effects models in the presence of nuisance parameters and a singular Fisher Information Matrix", "link": "https://arxiv.org/abs/2306.10779", "description": "arXiv:2306.10779v2 Announce Type: replace \nAbstract: We examine the problem of variance components testing in general mixed effects models using the likelihood ratio test. We account for the presence of nuisance parameters, i.e. the fact that some untested variances might also be equal to zero. Two main issues arise in this context leading to a non regular setting. First, under the null hypothesis the true parameter value lies on the boundary of the parameter space. Moreover, due to the presence of nuisance parameters the exact location of these boundary points is not known, which prevents from using classical asymptotic theory of maximum likelihood estimation. Then, in the specific context of nonlinear mixed-effects models, the Fisher information matrix is singular at the true parameter value. We address these two points by proposing a shrinked parametric bootstrap procedure, which is straightforward to apply even for nonlinear models. We show that the procedure is consistent, solving both the boundary and the singularity issues, and we provide a verifiable criterion for the applicability of our theoretical results. We show through a simulation study that, compared to the asymptotic approach, our procedure has a better small sample performance and is more robust to the presence of nuisance parameters. A real data application is also provided."}, "https://arxiv.org/abs/2307.13627": {"title": "A flexible class of priors for orthonormal matrices with basis function-specific structure", "link": "https://arxiv.org/abs/2307.13627", "description": "arXiv:2307.13627v2 Announce Type: replace \nAbstract: Statistical modeling of high-dimensional matrix-valued data motivates the use of a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction. Low-rank representations commonly factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for basis function correlation structure a priori. While there exist Bayesian methods that allow for a common correlation structure across basis functions, empirical examples motivate the need for basis function-specific dependence structure. We propose a prior distribution for orthonormal matrices that can explicitly model basis function-specific structure. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. We discuss how the prior specification can be used for various scenarios and demonstrate favorable model properties through synthetic data examples. Finally, we apply our method to two-meter air temperature data from the Pacific Northwest, enhancing our understanding of the Earth system's internal variability."}, "https://arxiv.org/abs/2310.02414": {"title": "Sharp and Robust Estimation of Partially Identified Discrete Response Models", "link": "https://arxiv.org/abs/2310.02414", "description": "arXiv:2310.02414v3 Announce Type: replace \nAbstract: Semiparametric discrete choice models are widely used in a variety of practical applications. While these models are point identified in the presence of continuous covariates, they can become partially identified when covariates are discrete. In this paper we find that classic estimators, including the maximum score estimator, (Manski (1975)), loose their attractive statistical properties without point identification. First of all, they are not sharp with the estimator converging to an outer region of the identified set, (Komarova (2013)), and in many discrete designs it weakly converges to a random set. Second, they are not robust, with their distribution limit discontinuously changing with respect to the parameters of the model. We propose a novel class of estimators based on the concept of a quantile of a random set, which we show to be both sharp and robust. We demonstrate that our approach extends from cross-sectional settings to class static and dynamic discrete panel data models."}, "https://arxiv.org/abs/2311.07034": {"title": "Regularized Halfspace Depth for Functional Data", "link": "https://arxiv.org/abs/2311.07034", "description": "arXiv:2311.07034v2 Announce Type: replace \nAbstract: Data depth is a powerful nonparametric tool originally proposed to rank multivariate data from center outward. In this context, one of the most archetypical depth notions is Tukey's halfspace depth. In the last few decades notions of depth have also been proposed for functional data. However, Tukey's depth cannot be extended to handle functional data because of its degeneracy. Here, we propose a new halfspace depth for functional data which avoids degeneracy by regularization. The halfspace projection directions are constrained to have a small reproducing kernel Hilbert space norm. Desirable theoretical properties of the proposed depth, such as isometry invariance, maximality at center, monotonicity relative to a deepest point, upper semi-continuity, and consistency are established. Moreover, the regularized halfspace depth can rank functional data with varying emphasis in shape or magnitude, depending on the regularization. A new outlier detection approach is also proposed, which is capable of detecting both shape and magnitude outliers. It is applicable to trajectories in $L^2$, a very general space of functions that include non-smooth trajectories. Based on extensive numerical studies, our methods are shown to perform well in terms of detecting outliers of different types. Three real data examples showcase the proposed depth notion."}, "https://arxiv.org/abs/2311.17021": {"title": "Optimal Categorical Instrumental Variables", "link": "https://arxiv.org/abs/2311.17021", "description": "arXiv:2311.17021v2 Announce Type: replace \nAbstract: This paper discusses estimation with a categorical instrumental variable in settings with potentially few observations per category. The proposed categorical instrumental variable estimator (CIV) leverages a regularization assumption that implies existence of a latent categorical variable with fixed finite support achieving the same first stage fit as the observed instrument. In asymptotic regimes that allow the number of observations per category to grow at arbitrary small polynomial rate with the sample size, I show that when the cardinality of the support of the optimal instrument is known, CIV is root-n asymptotically normal, achieves the same asymptotic variance as the oracle IV estimator that presumes knowledge of the optimal instrument, and is semiparametrically efficient under homoskedasticity. Under-specifying the number of support points reduces efficiency but maintains asymptotic normality. In an application that leverages judge fixed effects as instruments, CIV compares favorably to commonly used jackknife-based instrumental variable estimators."}, "https://arxiv.org/abs/2312.03643": {"title": "Propagating moments in probabilistic graphical models with polynomial regression forms for decision support systems", "link": "https://arxiv.org/abs/2312.03643", "description": "arXiv:2312.03643v2 Announce Type: replace \nAbstract: Probabilistic graphical models are widely used to model complex systems under uncertainty. Traditionally, Gaussian directed graphical models are applied for analysis of large networks with continuous variables as they can provide conditional and marginal distributions in closed form simplifying the inferential task. The Gaussianity and linearity assumptions are often adequate, yet can lead to poor performance when dealing with some practical applications. In this paper, we model each variable in graph G as a polynomial regression of its parents to capture complex relationships between individual variables and with a utility function of polynomial form. We develop a message-passing algorithm to propagate information throughout the network solely using moments which enables the expected utility scores to be calculated exactly. Our propagation method scales up well and enables to perform inference in terms of a finite number of expectations. We illustrate how the proposed methodology works with examples and in an application to decision problems in energy planning and for real-time clinical decision support."}, "https://arxiv.org/abs/2312.06478": {"title": "Prediction De-Correlated Inference: A safe approach for post-prediction inference", "link": "https://arxiv.org/abs/2312.06478", "description": "arXiv:2312.06478v3 Announce Type: replace \nAbstract: In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PDC). Our approach is safe, in the sense that PDC can automatically adapt to any black-box machine-learning model and consistently outperform the supervised counterparts. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis demonstrate the superiority of PDC over the state-of-the-art methods."}, "https://arxiv.org/abs/2203.15756": {"title": "Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data", "link": "https://arxiv.org/abs/2203.15756", "description": "arXiv:2203.15756v3 Announce Type: replace-cross \nAbstract: Constraint-based causal discovery methods leverage conditional independence tests to infer causal relationships in a wide variety of applications. Just as the majority of machine learning methods, existing work focuses on studying $\\textit{independent and identically distributed}$ data. However, it is known that even with infinite i.i.d.$\\ $ data, constraint-based methods can only identify causal structures up to broad Markov equivalence classes, posing a fundamental limitation for causal discovery. In this work, we observe that exchangeable data contains richer conditional independence structure than i.i.d.$\\ $ data, and show how the richer structure can be leveraged for causal discovery. We first present causal de Finetti theorems, which state that exchangeable distributions with certain non-trivial conditional independences can always be represented as $\\textit{independent causal mechanism (ICM)}$ generative processes. We then present our main identifiability theorem, which shows that given data from an ICM generative process, its unique causal structure can be identified through performing conditional independence tests. We finally develop a causal discovery algorithm and demonstrate its applicability to inferring causal relationships from multi-environment data. Our code and models are publicly available at: https://github.com/syguo96/Causal-de-Finetti"}, "https://arxiv.org/abs/2305.16539": {"title": "On the existence of powerful p-values and e-values for composite hypotheses", "link": "https://arxiv.org/abs/2305.16539", "description": "arXiv:2305.16539v3 Announce Type: replace-cross \nAbstract: Given a composite null $ \\mathcal P$ and composite alternative $ \\mathcal Q$, when and how can we construct a p-value whose distribution is exactly uniform under the null, and stochastically smaller than uniform under the alternative? Similarly, when and how can we construct an e-value whose expectation exactly equals one under the null, but its expected logarithm under the alternative is positive? We answer these basic questions, and other related ones, when $ \\mathcal P$ and $ \\mathcal Q$ are convex polytopes (in the space of probability measures). We prove that such constructions are possible if and only if $ \\mathcal Q$ does not intersect the span of $ \\mathcal P$. If the p-value is allowed to be stochastically larger than uniform under $P\\in \\mathcal P$, and the e-value can have expectation at most one under $P\\in \\mathcal P$, then it is achievable whenever $ \\mathcal P$ and $ \\mathcal Q$ are disjoint. More generally, even when $ \\mathcal P$ and $ \\mathcal Q$ are not polytopes, we characterize the existence of a bounded nontrivial e-variable whose expectation exactly equals one under any $P \\in \\mathcal P$. The proofs utilize recently developed techniques in simultaneous optimal transport. A key role is played by coarsening the filtration: sometimes, no such p-value or e-value exists in the richest data filtration, but it does exist in some reduced filtration, and our work provides the first general characterization of this phenomenon. We also provide an iterative construction that explicitly constructs such processes, and under certain conditions it finds the one that grows fastest under a specific alternative $Q$. We discuss implications for the construction of composite nonnegative (super)martingales, and end with some conjectures and open problems."}, "https://arxiv.org/abs/2405.15887": {"title": "Data-adaptive exposure thresholds for the Horvitz-Thompson estimator of the Average Treatment Effect in experiments with network interference", "link": "https://arxiv.org/abs/2405.15887", "description": "arXiv:2405.15887v1 Announce Type: new \nAbstract: Randomized controlled trials often suffer from interference, a violation of the Stable Unit Treatment Values Assumption (SUTVA) in which a unit's treatment assignment affects the outcomes of its neighbors. This interference causes bias in naive estimators of the average treatment effect (ATE). A popular method to achieve unbiasedness is to pair the Horvitz-Thompson estimator of the ATE with a known exposure mapping: a function that identifies which units in a given randomization are not subject to interference. For example, an exposure mapping can specify that any unit with at least $h$-fraction of its neighbors having the same treatment status does not experience interference. However, this threshold $h$ is difficult to elicit from domain experts, and a misspecified threshold can induce bias. In this work, we propose a data-adaptive method to select the \"$h$\"-fraction threshold that minimizes the mean squared error of the Hortvitz-Thompson estimator. Our method estimates the bias and variance of the Horvitz-Thompson estimator under different thresholds using a linear dose-response model of the potential outcomes. We present simulations illustrating that our method improves upon non-adaptive choices of the threshold. We further illustrate the performance of our estimator by running experiments on a publicly-available Amazon product similarity graph. Furthermore, we demonstrate that our method is robust to deviations from the linear potential outcomes model."}, "https://arxiv.org/abs/2405.15948": {"title": "Multicalibration for Censored Survival Data: Towards Universal Adaptability in Predictive Modeling", "link": "https://arxiv.org/abs/2405.15948", "description": "arXiv:2405.15948v1 Announce Type: new \nAbstract: Traditional statistical and machine learning methods assume identical distribution for the training and test data sets. This assumption, however, is often violated in real applications, particularly in health care research, where the training data~(source) may underrepresent specific subpopulations in the testing or target domain. Such disparities, coupled with censored observations, present significant challenges for investigators aiming to make predictions for those minority groups. This paper focuses on target-independent learning under covariate shift, where we study multicalibration for survival probability and restricted mean survival time, and propose a black-box post-processing boosting algorithm designed for censored survival data. Our algorithm, leveraging the pseudo observations, yields a multicalibrated predictor competitive with propensity scoring regarding predictions on the unlabeled target domain, not just overall but across diverse subpopulations. Our theoretical analysis for pseudo observations relies on functional delta method and $p$-variational norm. We further investigate the algorithm's sample complexity and convergence properties, as well as the multicalibration guarantee for post-processed predictors. Our theoretical insights reveal the link between multicalibration and universal adaptability, suggesting that our calibrated function performs comparably to, if not better than, the inverse propensity score weighting estimator. The performance of our proposed methods is corroborated through extensive numerical simulations and a real-world case study focusing on prediction of cardiovascular disease risk in two large prospective cohort studies. These empirical results confirm its potential as a powerful tool for predictive analysis with censored outcomes in diverse and shifting populations."}, "https://arxiv.org/abs/2405.16046": {"title": "Sensitivity Analysis for Attributable Effects in Case$^2$ Studies", "link": "https://arxiv.org/abs/2405.16046", "description": "arXiv:2405.16046v1 Announce Type: new \nAbstract: The case$^2$ study, also referred to as the case-case study design, is a valuable approach for conducting inference for treatment effects. Unlike traditional case-control studies, the case$^2$ design compares treatment in two types of cases with the same disease. A key quantity of interest is the attributable effect, which is the number of cases of disease among treated units which are caused by the treatment. Two key assumptions that are usually made for making inferences about the attributable effect in case$^2$ studies are 1.) treatment does not cause the second type of case, and 2.) the treatment does not alter an individual's case type. However, these assumptions are not realistic in many real-data applications. In this article, we present a sensitivity analysis framework to scrutinize the impact of deviations from these assumptions on obtained results. We also include sensitivity analyses related to the assumption of unmeasured confounding, recognizing the potential bias introduced by unobserved covariates. The proposed methodology is exemplified through an investigation into whether having violent behavior in the last year of life increases suicide risk via 1993 National Mortality Followback Survey dataset."}, "https://arxiv.org/abs/2405.16106": {"title": "On the PM2", "link": "https://arxiv.org/abs/2405.16106", "description": "arXiv:2405.16106v1 Announce Type: new \nAbstract: Spatial confounding, often regarded as a major concern in epidemiological studies, relates to the difficulty of recovering the effect of an exposure on an outcome when these variables are associated with unobserved factors. This issue is particularly challenging in spatio-temporal analyses, where it has been less explored so far. To study the effects of air pollution on mortality in Italy, we argue that a model that simultaneously accounts for spatio-temporal confounding and for the non-linear form of the effect of interest is needed. To this end, we propose a Bayesian dynamic generalized linear model, which allows for a non-linear association and for a decomposition of the exposure effect into two components. This decomposition accommodates associations with the outcome at fine and coarse temporal and spatial scales of variation. These features, when combined, allow reducing the spatio-temporal confounding bias and recovering the true shape of the association, as demonstrated through simulation studies. The results from the real-data application indicate that the exposure effect seems to have different magnitudes in different seasons, with peaks in the summer. We hypothesize that this could be due to possible interactions between the exposure variable with air temperature and unmeasured confounders."}, "https://arxiv.org/abs/2405.16161": {"title": "Inference for Optimal Linear Treatment Regimes in Personalized Decision-making", "link": "https://arxiv.org/abs/2405.16161", "description": "arXiv:2405.16161v1 Announce Type: new \nAbstract: Personalized decision-making, tailored to individual characteristics, is gaining significant attention. The optimal treatment regime aims to provide the best-expected outcome in the entire population, known as the value function. One approach to determine this optimal regime is by maximizing the Augmented Inverse Probability Weighting (AIPW) estimator of the value function. However, the derived treatment regime can be intricate and nonlinear, limiting their use. For clarity and interoperability, we emphasize linear regimes and determine the optimal linear regime by optimizing the AIPW estimator within set constraints.\n While the AIPW estimator offers a viable path to estimating the optimal regime, current methodologies predominantly focus on its asymptotic distribution, leaving a gap in studying the linear regime itself. However, there are many benefits to understanding the regime, as pinpointing significant covariates can enhance treatment effects and provide future clinical guidance. In this paper, we explore the asymptotic distribution of the estimated linear regime. Our results show that the parameter associated with the linear regime follows a cube-root convergence to a non-normal limiting distribution characterized by the maximizer of a centered Gaussian process with a quadratic drift. When making inferences for the estimated linear regimes with cube-root convergence in practical scenarios, the standard nonparametric bootstrap is invalid. As a solution, we facilitate the Cattaneo et al. (2020) bootstrap technique to provide a consistent distributional approximation for the estimated linear regimes, validated further through simulations and real-world data applications from the eICU Collaborative Research Database."}, "https://arxiv.org/abs/2405.16192": {"title": "Novel closed-form point estimators for a weighted exponential family derived from likelihood equations", "link": "https://arxiv.org/abs/2405.16192", "description": "arXiv:2405.16192v1 Announce Type: new \nAbstract: In this paper, we propose and investigate closed-form point estimators for a weighted exponential family. We also develop a bias-reduced version of these proposed closed-form estimators through bootstrap methods. Estimators are assessed using a Monte Carlo simulation, revealing favorable results for the proposed bootstrap bias-reduced estimators."}, "https://arxiv.org/abs/2405.16246": {"title": "Conformalized Late Fusion Multi-View Learning", "link": "https://arxiv.org/abs/2405.16246", "description": "arXiv:2405.16246v1 Announce Type: new \nAbstract: Uncertainty quantification for multi-view learning is motivated by the increasing use of multi-view data in scientific problems. A common variant of multi-view learning is late fusion: train separate predictors on individual views and combine them after single-view predictions are available. Existing methods for uncertainty quantification for late fusion often rely on undesirable distributional assumptions for validity. Conformal prediction is one approach that avoids such distributional assumptions. However, naively applying conformal prediction to late-stage fusion pipelines often produces overly conservative and uninformative prediction regions, limiting its downstream utility. We propose a novel methodology, Multi-View Conformal Prediction (MVCP), where conformal prediction is instead performed separately on the single-view predictors and only fused subsequently. Our framework extends the standard scalar formulation of a score function to a multivariate score that produces more efficient downstream prediction regions in both classification and regression settings. We then demonstrate that such improvements can be realized in methods built atop conformalized regressors, specifically in robust predict-then-optimize pipelines."}, "https://arxiv.org/abs/2405.16298": {"title": "Fast Emulation and Modular Calibration for Simulators with Functional Response", "link": "https://arxiv.org/abs/2405.16298", "description": "arXiv:2405.16298v1 Announce Type: new \nAbstract: Scalable surrogate models enable efficient emulation of computer models (or simulators), particularly when dealing with large ensembles of runs. While Gaussian Process (GP) models are commonly employed for emulation, they face limitations in scaling to truly large datasets. Furthermore, when dealing with dense functional output, such as spatial or time-series data, additional complexities arise, requiring careful handling to ensure fast emulation. This work presents a highly scalable emulator for functional data, building upon the works of Kennedy and O'Hagan (2001) and Higdon et al. (2008), while incorporating the local approximate Gaussian Process framework proposed by Gramacy and Apley (2015). The emulator utilizes global GP lengthscale parameter estimates to scale the input space, leading to a substantial improvement in prediction speed. We demonstrate that our fast approximation-based emulator can serve as a viable alternative to the methods outlined in Higdon et al. (2008) for functional response, while drastically reducing computational costs. The proposed emulator is applied to quickly calibrate the multiphysics continuum hydrodynamics simulator FLAG with a large ensemble of 20000 runs. The methods presented are implemented in the R package FlaGP."}, "https://arxiv.org/abs/2405.16379": {"title": "Selective inference for multiple pairs of clusters after K-means clustering", "link": "https://arxiv.org/abs/2405.16379", "description": "arXiv:2405.16379v1 Announce Type: new \nAbstract: If the same data is used for both clustering and for testing a null hypothesis that is formulated in terms of the estimated clusters, then the traditional hypothesis testing framework often fails to control the Type I error. Gao et al. [2022] and Chen and Witten [2023] provide selective inference frameworks for testing if a pair of estimated clusters indeed stem from underlying differences, for the case where hierarchical clustering and K-means clustering, respectively, are used to define the clusters. In applications, however, it is often of interest to test for multiple pairs of clusters. In our work, we extend the pairwise test of Chen and Witten [2023] to a test for multiple pairs of clusters, where the cluster assignments are produced by K-means clustering. We further develop an analogous test for the setting where the variance is unknown, building on the work of Yun and Barber [2023] that extends Gao et al. [2022]'s pairwise test to the case of unknown variance. For both known and unknown variance settings, we present methods that address certain forms of data-dependence in the choice of pairs of clusters to test for. We show that our proposed tests control the Type I error, both theoretically and empirically, and provide a numerical study of their empirical powers under various settings."}, "https://arxiv.org/abs/2405.16467": {"title": "Two-way fixed effects instrumental variable regressions in staggered DID-IV designs", "link": "https://arxiv.org/abs/2405.16467", "description": "arXiv:2405.16467v1 Announce Type: new \nAbstract: Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation in the timing of policy adoption across units as an instrument for treatment. This paper studies the properties of the TWFEIV estimator in staggered instrumented difference-in-differences (DID-IV) designs. We show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator can be decomposed into a weighted average of all possible two-group/two-period Wald-DID estimators. Under staggered DID-IV designs, a causal interpretation of the TWFEIV estimand hinges on the stable effects of the instrument on the treatment and the outcome over time. We illustrate the use of our decomposition theorem for the TWFEIV estimator through an empirical application."}, "https://arxiv.org/abs/2405.16492": {"title": "A joint model for (un)bounded longitudinal markers, competing risks, and recurrent events using patient registry data", "link": "https://arxiv.org/abs/2405.16492", "description": "arXiv:2405.16492v1 Announce Type: new \nAbstract: Joint models for longitudinal and survival data have become a popular framework for studying the association between repeatedly measured biomarkers and clinical events. Nevertheless, addressing complex survival data structures, especially handling both recurrent and competing event times within a single model, remains a challenge. This causes important information to be disregarded. Moreover, existing frameworks rely on a Gaussian distribution for continuous markers, which may be unsuitable for bounded biomarkers, resulting in biased estimates of associations. To address these limitations, we propose a Bayesian shared-parameter joint model that simultaneously accommodates multiple (possibly bounded) longitudinal markers, a recurrent event process, and competing risks. We use the beta distribution to model responses bounded within any interval (a,b) without sacrificing the interpretability of the association. The model offers various forms of association, discontinuous risk intervals, and both gap and calendar timescales. A simulation study shows that it outperforms simpler joint models. We utilize the US Cystic Fibrosis Foundation Patient Registry to study the associations between changes in lung function and body mass index, and the risk of recurrent pulmonary exacerbations, while accounting for the competing risks of death and lung transplantation. Our efficient implementation allows fast fitting of the model despite its complexity and the large sample size from this patient registry. Our comprehensive approach provides new insights into cystic fibrosis disease progression by quantifying the relationship between the most important clinical markers and events more precisely than has been possible before. The model implementation is available in the R package JMbayes2."}, "https://arxiv.org/abs/2405.16547": {"title": "Estimating Dyadic Treatment Effects with Unknown Confounders", "link": "https://arxiv.org/abs/2405.16547", "description": "arXiv:2405.16547v1 Announce Type: new \nAbstract: This paper proposes a statistical inference method for assessing treatment effects with dyadic data. Under the assumption that the treatments follow an exchangeable distribution, our approach allows for the presence of any unobserved confounding factors that potentially cause endogeneity of treatment choice without requiring additional information other than the treatments and outcomes. Building on the literature of graphon estimation in network data analysis, we propose a neighborhood kernel smoothing method for estimating dyadic average treatment effects. We also develop a permutation inference method for testing the sharp null hypothesis. Under certain regularity conditions, we derive the rate of convergence of the proposed estimator and demonstrate the size control property of our test. We apply our method to international trade data to assess the impact of free trade agreements on bilateral trade flows."}, "https://arxiv.org/abs/2405.16602": {"title": "Multiple imputation of missing covariates when using the Fine-Gray model", "link": "https://arxiv.org/abs/2405.16602", "description": "arXiv:2405.16602v1 Announce Type: new \nAbstract: The Fine-Gray model for the subdistribution hazard is commonly used for estimating associations between covariates and competing risks outcomes. When there are missing values in the covariates included in a given model, researchers may wish to multiply impute them. Assuming interest lies in estimating the risk of only one of the competing events, this paper develops a substantive-model-compatible multiple imputation approach that exploits the parallels between the Fine-Gray model and the standard (single-event) Cox model. In the presence of right-censoring, this involves first imputing the potential censoring times for those failing from competing events, and thereafter imputing the missing covariates by leveraging methodology previously developed for the Cox model in the setting without competing risks. In a simulation study, we compared the proposed approach to alternative methods, such as imputing compatibly with cause-specific Cox models. The proposed method performed well (in terms of estimation of both subdistribution log hazard ratios and cumulative incidences) when data were generated assuming proportional subdistribution hazards, and performed satisfactorily when this assumption was not satisfied. The gain in efficiency compared to a complete-case analysis was demonstrated in both the simulation study and in an applied data example on competing outcomes following an allogeneic stem cell transplantation. For individual-specific cumulative incidence estimation, assuming proportionality on the correct scale at the analysis phase appears to be more important than correctly specifying the imputation procedure used to impute the missing covariates."}, "https://arxiv.org/abs/2405.16780": {"title": "Analysis of Broken Randomized Experiments by Principal Stratification", "link": "https://arxiv.org/abs/2405.16780", "description": "arXiv:2405.16780v1 Announce Type: new \nAbstract: Although randomized controlled trials have long been regarded as the ``gold standard'' for evaluating treatment effects, there is no natural prevention from post-treatment events. For example, non-compliance makes the actual treatment different from the assigned treatment, truncation-by-death renders the outcome undefined or ill-defined, and missingness prevents the outcomes from being measured. In this paper, we develop a statistical analysis framework using principal stratification to investigate the treatment effect in broken randomized experiments. The average treatment effect in compliers and always-survivors is adopted as the target causal estimand. We establish the asymptotic property for the estimator. We apply the framework to study the effect of training on earnings in the Job Corps Study and find that the training program does not have an effect on employment but possibly have an effect on improving the earnings after employment."}, "https://arxiv.org/abs/2405.16859": {"title": "Gaussian Mixture Model with Rare Events", "link": "https://arxiv.org/abs/2405.16859", "description": "arXiv:2405.16859v1 Announce Type: new \nAbstract: We study here a Gaussian Mixture Model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow numerical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theoretical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires additionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample performance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs."}, "https://arxiv.org/abs/2405.16885": {"title": "Hidden Markov modelling of spatio-temporal dynamics of measles in 1750-1850 Finland", "link": "https://arxiv.org/abs/2405.16885", "description": "arXiv:2405.16885v1 Announce Type: new \nAbstract: Real world spatio-temporal datasets, and phenomena related to them, are often challenging to visualise or gain a general overview of. In order to summarise information encompassed in such data, we combine two well known statistical modelling methods. To account for the spatial dimension, we use the intrinsic modification of the conditional autoregression, and incorporate it with the hidden Markov model, allowing the spatial patterns to vary over time. We apply our method into parish register data considering deaths caused by measles in Finland in 1750-1850, and gain novel insight of previously undiscovered infection dynamics. Five distinctive, reoccurring states describing spatially and temporally differing infection burden and potential routes of spread are identified. We also find that there is a change in the occurrences of the most typical spatial patterns circa 1812, possibly due to changes in communication routes after major administrative transformations in Finland."}, "https://arxiv.org/abs/2405.16989": {"title": "Uncertainty Learning for High-dimensional Mean-variance Portfolio", "link": "https://arxiv.org/abs/2405.16989", "description": "arXiv:2405.16989v1 Announce Type: new \nAbstract: Accounting for uncertainty in Data quality is important for accurate statistical inference. We aim to an optimal conservative allocation for a large universe of assets in mean-variance portfolio (MVP), which is the worst choice within uncertainty in data distribution. Unlike the low dimensional MVP studied in Blanchet et al. (2022, Management Science), the large number of assets raises a challenging problem in quantifying the uncertainty, due to the big deviation of the sample covariance matrix from the population version. To overcome this difficulty, we propose a data-adaptive method to quantify the uncertainty with the help of a factor structure. Monte-Carlo Simulation is conducted to show the superiority of our method in high-dimensional cases, that, avoiding the over-conservative results in Blanchet et al. (2022), our allocation is closer to the oracle version in terms of risk minimization and expected portfolio return controlling."}, "https://arxiv.org/abs/2405.17064": {"title": "The Probability of Improved Prediction: a new concept in statistical inference", "link": "https://arxiv.org/abs/2405.17064", "description": "arXiv:2405.17064v1 Announce Type: new \nAbstract: In an attempt to provide an answer to the increasing criticism against p-values and to bridge the gap between statistical inference and prediction modelling, we introduce the probability of improved prediction (PIP). In general, the PIP is a probabilistic measure for comparing two competing models. Three versions of the PIP and several estimators are introduced and the relationships between them, p-values and the mean squared error are investigated. The performance of the estimators is assessed in a simulation study. An application shows how the PIP can support p-values to strengthen the conclusions or possibly point at issues with e.g. replicability."}, "https://arxiv.org/abs/2405.17117": {"title": "Robust Reproducible Network Exploration", "link": "https://arxiv.org/abs/2405.17117", "description": "arXiv:2405.17117v1 Announce Type: new \nAbstract: We propose a novel method of network detection that is robust against any complex dependence structure. Our goal is to conduct exploratory network detection, meaning that we attempt to detect a network composed of ``connectable'' edges that are worth investigating in detail for further modelling or precise network analysis. For a reproducible network detection, we pursuit high power while controlling the false discovery rate (FDR). In particular, we formalize the problem as a multiple testing, and propose p-variables that are used in the Benjamini-Hochberg procedure. We show that the proposed method controls the FDR under arbitrary dependence structure with any sample size, and has asymptotic power one. The validity is also confirmed by simulations and a real data example."}, "https://arxiv.org/abs/2405.17166": {"title": "Cross-border cannibalization: Spillover effects of wind and solar energy on interconnected European electricity markets", "link": "https://arxiv.org/abs/2405.17166", "description": "arXiv:2405.17166v1 Announce Type: new \nAbstract: The average revenue, or market value, of wind and solar energy tends to fall with increasing market shares, as is now evident across European electricity markets. At the same time, these markets have become more interconnected. In this paper, we empirically study the multiple cross-border effects on the value of renewable energy: on one hand, interconnection is a flexibility resource that allows to export energy when it is locally abundant, benefitting renewables. On the other hand, wind and solar radiation are correlated across space, so neighboring supply adds to the local one to depress domestic prices. We estimate both effects, using spatial panel regression on electricity market data from 2015 to 2023 from 30 European bidding zones. We find that domestic wind and solar value is not only depressed by domestic, but also by neighboring renewables expansion. The better interconnected a market is, the smaller the effect of domestic but the larger the effect of neighboring renewables. While wind value is stabilized by interconnection, solar value is not. If wind market share increases both at home and in neighboring markets by one percentage point, the value factor of wind energy is reduced by just above 1 percentage points. For solar, this number is almost 4 percentage points."}, "https://arxiv.org/abs/2405.17225": {"title": "Quantifying the Reliance of Black-Box Decision-Makers on Variables of Interest", "link": "https://arxiv.org/abs/2405.17225", "description": "arXiv:2405.17225v1 Announce Type: new \nAbstract: This paper introduces a framework for measuring how much black-box decision-makers rely on variables of interest. The framework adapts a permutation-based measure of variable importance from the explainable machine learning literature. With an emphasis on applicability, I present some of the framework's theoretical and computational properties, explain how reliance computations have policy implications, and work through an illustrative example. In the empirical application to interruptions by Supreme Court Justices during oral argument, I find that the effect of gender is more muted compared to the existing literature's estimate; I then use this paper's framework to compare Justices' reliance on gender and alignment to their reliance on experience, which are incomparable using regression coefficients."}, "https://arxiv.org/abs/2405.17237": {"title": "Mixing it up: Inflation at risk", "link": "https://arxiv.org/abs/2405.17237", "description": "arXiv:2405.17237v1 Announce Type: new \nAbstract: Assessing the contribution of various risk factors to future inflation risks was crucial for guiding monetary policy during the recent high inflation period. However, existing methodologies often provide limited insights by focusing solely on specific percentiles of the forecast distribution. In contrast, this paper introduces a comprehensive framework that examines how economic indicators impact the entire forecast distribution of macroeconomic variables, facilitating the decomposition of the overall risk outlook into its underlying drivers. Additionally, the framework allows for the construction of risk measures that align with central bank preferences, serving as valuable summary statistics. Applied to the recent inflation surge, the framework reveals that U.S. inflation risk was primarily influenced by the recovery of the U.S. business cycle and surging commodity prices, partially mitigated by adjustments in monetary policy and credit spreads."}, "https://arxiv.org/abs/2405.17254": {"title": "Estimating treatment-effect heterogeneity across sites in multi-site randomized experiments with imperfect compliance", "link": "https://arxiv.org/abs/2405.17254", "description": "arXiv:2405.17254v1 Announce Type: new \nAbstract: We consider multi-site randomized controlled trials with a large number of small sites and imperfect compliance, conducted in non-random convenience samples in each site. We show that an Empirical-Bayes (EB) estimator can be used to estimate a lower bound of the variance of intention-to-treat (ITT) effects across sites. We also propose bounds for the coefficient from a regression of site-level ITTs on sites' control-group outcome. Turning to local average treatment effects (LATEs), the EB estimator cannot be used to estimate their variance, because site-level LATE estimators are biased. Instead, we propose two testable assumptions under which the LATEs' variance can be written as a function of sites' ITT and first-stage (FS) effects, thus allowing us to use an EB estimator leveraging only unbiased ITT and FS estimators. We revisit Behaghel et al. (2014), who study the effect of counselling programs on job seekers job-finding rate, in more than 200 job placement agencies in France. We find considerable ITT heterogeneity, and even more LATE heterogeneity: our lower bounds on ITTs' (resp. LATEs') standard deviation are more than three (resp. four) times larger than the average ITT (resp. LATE) across sites. Sites with a lower job-finding rate in the control group have larger ITT effects."}, "https://arxiv.org/abs/2405.17259": {"title": "The state learner -- a super learner for right-censored data", "link": "https://arxiv.org/abs/2405.17259", "description": "arXiv:2405.17259v1 Announce Type: new \nAbstract: In survival analysis, prediction models are needed as stand-alone tools and in applications of causal inference to estimate nuisance parameters. The super learner is a machine learning algorithm which combines a library of prediction models into a meta learner based on cross-validated loss. In right-censored data, the choice of the loss function and the estimation of the expected loss need careful consideration. We introduce the state learner, a new super learner for survival analysis, which simultaneously evaluates libraries of prediction models for the event of interest and the censoring distribution. The state learner can be applied to all types of survival models, works in the presence of competing risks, and does not require a single pre-specified estimator of the conditional censoring distribution. We establish an oracle inequality for the state learner and investigate its performance through numerical experiments. We illustrate the application of the state learner with prostate cancer data, as a stand-alone prediction tool, and, for causal inference, as a way to estimate the nuisance parameter models of a smooth statistical functional."}, "https://arxiv.org/abs/2405.17265": {"title": "Assessing uncertainty in Gaussian mixtures-based entropy estimation", "link": "https://arxiv.org/abs/2405.17265", "description": "arXiv:2405.17265v1 Announce Type: new \nAbstract: Entropy estimation plays a crucial role in various fields, such as information theory, statistical data science, and machine learning. However, traditional entropy estimation methods often struggle with complex data distributions. Mixture-based estimation of entropy has been recently proposed and gained attention due to its ease of use and accuracy. This paper presents a novel approach to quantify the uncertainty associated with this mixture-based entropy estimation method using weighted likelihood bootstrap. Unlike standard methods, our approach leverages the underlying mixture structure by assigning random weights to observations in a weighted likelihood bootstrap procedure, leading to more accurate uncertainty estimation. The generation of weights is also investigated, leading to the proposal of using weights obtained from a Dirichlet distribution with parameter $\\alpha = 0.8137$ instead of the usual $\\alpha = 1$. Furthermore, the use of centered percentile intervals emerges as the preferred choice to ensure empirical coverage close to the nominal level. Extensive simulation studies comparing different resampling strategies are presented and results discussed. The proposed approach is illustrated by analyzing the log-returns of daily Gold prices at COMEX for the years 2014--2022, and the Net Rating scores, an advanced statistic used in basketball analytics, for NBA teams with reference to the 2022/23 regular season."}, "https://arxiv.org/abs/2405.17290": {"title": "Count Data Models with Heterogeneous Peer Effects under Rational Expectations", "link": "https://arxiv.org/abs/2405.17290", "description": "arXiv:2405.17290v1 Announce Type: new \nAbstract: This paper develops a micro-founded peer effect model for count responses using a game of incomplete information. The model incorporates heterogeneity in peer effects through agents' groups based on observed characteristics. Parameter identification is established using the identification condition of linear models, which relies on the presence of friends' friends who are not direct friends in the network. I show that this condition extends to a large class of nonlinear models. The model parameters are estimated using the nested pseudo-likelihood approach, controlling for network endogeneity. I present an empirical application on students' participation in extracurricular activities. I find that females are more responsive to their peers than males, whereas male peers do not influence male students. An easy-to-use R packag--named CDatanet--is available for implementing the model."}, "https://arxiv.org/abs/2405.15950": {"title": "A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction", "link": "https://arxiv.org/abs/2405.15950", "description": "arXiv:2405.15950v1 Announce Type: cross \nAbstract: Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the \"systematic bias of machine learning regression\". In this paper, we first demonstrate that this issue persists across various machine learning models, and then delve into its theoretical underpinnings. We propose a general constrained optimization approach designed to correct this bias and develop a computationally efficient algorithm to implement our method. Our simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning models, our method effectively addresses the longstanding issue of \"systematic bias of machine learning regression\" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age."}, "https://arxiv.org/abs/2405.16055": {"title": "Federated Learning for Non-factorizable Models using Deep Generative Prior Approximations", "link": "https://arxiv.org/abs/2405.16055", "description": "arXiv:2405.16055v1 Announce Type: cross \nAbstract: Federated learning (FL) allows for collaborative model training across decentralized clients while preserving privacy by avoiding data sharing. However, current FL methods assume conditional independence between client models, limiting the use of priors that capture dependence, such as Gaussian processes (GPs). We introduce the Structured Independence via deep Generative Model Approximation (SIGMA) prior which enables FL for non-factorizable models across clients, expanding the applicability of FL to fields such as spatial statistics, epidemiology, environmental science, and other domains where modeling dependencies is crucial. The SIGMA prior is a pre-trained deep generative model that approximates the desired prior and induces a specified conditional independence structure in the latent variables, creating an approximate model suitable for FL settings. We demonstrate the SIGMA prior's effectiveness on synthetic data and showcase its utility in a real-world example of FL for spatial data, using a conditional autoregressive prior to model spatial dependence across Australia. Our work enables new FL applications in domains where modeling dependent data is essential for accurate predictions and decision-making."}, "https://arxiv.org/abs/2405.16069": {"title": "IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark", "link": "https://arxiv.org/abs/2405.16069", "description": "arXiv:2405.16069v1 Announce Type: cross \nAbstract: Evaluating observational estimators of causal effects demands information that is rarely available: unconfounded interventions and outcomes from the population of interest, created either by randomization or adjustment. As a result, it is customary to fall back on simulators when creating benchmark tasks. Simulators offer great control but are often too simplistic to make challenging tasks, either because they are hand-designed and lack the nuances of real-world data, or because they are fit to observational data without structural constraints. In this work, we propose a general, repeatable strategy for turning observational data into sequential structural causal models and challenging estimation tasks by following two simple principles: 1) fitting real-world data where possible, and 2) creating complexity by composing simple, hand-designed mechanisms. We implement these ideas in a highly configurable software package and apply it to the well-known Adult income data set to construct the IncomeSCM simulator. From this, we devise multiple estimation tasks and sample data sets to compare established estimators of causal effects. The tasks present a suitable challenge, with effect estimates varying greatly in quality between methods, despite similar performance in the modeling of factual outcomes, highlighting the need for dedicated causal estimators and model selection criteria."}, "https://arxiv.org/abs/2405.16130": {"title": "Automating the Selection of Proxy Variables of Unmeasured Confounders", "link": "https://arxiv.org/abs/2405.16130", "description": "arXiv:2405.16130v1 Announce Type: cross \nAbstract: Recently, interest has grown in the use of proxy variables of unobserved confounding for inferring the causal effect in the presence of unmeasured confounders from observational data. One difficulty inhibiting the practical use is finding valid proxy variables of unobserved confounding to a target causal effect of interest. These proxy variables are typically justified by background knowledge. In this paper, we investigate the estimation of causal effects among multiple treatments and a single outcome, all of which are affected by unmeasured confounders, within a linear causal model, without prior knowledge of the validity of proxy variables. To be more specific, we first extend the existing proxy variable estimator, originally addressing a single unmeasured confounder, to accommodate scenarios where multiple unmeasured confounders exist between the treatments and the outcome. Subsequently, we present two different sets of precise identifiability conditions for selecting valid proxy variables of unmeasured confounders, based on the second-order statistics and higher-order statistics of the data, respectively. Moreover, we propose two data-driven methods for the selection of proxy variables and for the unbiased estimation of causal effects. Theoretical analysis demonstrates the correctness of our proposed algorithms. Experimental results on both synthetic and real-world data show the effectiveness of the proposed approach."}, "https://arxiv.org/abs/2405.16250": {"title": "Conformal Robust Control of Linear Systems", "link": "https://arxiv.org/abs/2405.16250", "description": "arXiv:2405.16250v1 Announce Type: cross \nAbstract: End-to-end engineering design pipelines, in which designs are evaluated using concurrently defined optimal controllers, are becoming increasingly common in practice. To discover designs that perform well even under the misspecification of system dynamics, such end-to-end pipelines have now begun evaluating designs with a robust control objective in place of the nominal optimal control setup. Current approaches of specifying such robust control subproblems, however, rely on hand specification of perturbations anticipated to be present upon deployment or margin methods that ignore problem structure, resulting in a lack of theoretical guarantees and overly conservative empirical performance. We, instead, propose a novel methodology for LQR systems that leverages conformal prediction to specify such uncertainty regions in a data-driven fashion. Such regions have distribution-free coverage guarantees on the true system dynamics, in turn allowing for a probabilistic characterization of the regret of the resulting robust controller. We then demonstrate that such a controller can be efficiently produced via a novel policy gradient method that has convergence guarantees. We finally demonstrate the superior empirical performance of our method over alternate robust control specifications in a collection of engineering control systems, specifically for airfoils and a load-positioning system."}, "https://arxiv.org/abs/2405.16455": {"title": "On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization", "link": "https://arxiv.org/abs/2405.16455", "description": "arXiv:2405.16455v1 Announce Type: cross \nAbstract: Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF."}, "https://arxiv.org/abs/2405.16564": {"title": "Contextual Linear Optimization with Bandit Feedback", "link": "https://arxiv.org/abs/2405.16564", "description": "arXiv:2405.16564v1 Announce Type: cross \nAbstract: Contextual linear optimization (CLO) uses predictive observations to reduce uncertainty in random cost coefficients and thereby improve average-cost performance. An example is a stochastic shortest path with random edge costs (e.g., traffic) and predictive features (e.g., lagged traffic, weather). Existing work on CLO assumes the data has fully observed cost coefficient vectors, but in many applications, we can only see the realized cost of a historical decision, that is, just one projection of the random cost coefficient vector, to which we refer as bandit feedback. We study a class of algorithms for CLO with bandit feedback, which we term induced empirical risk minimization (IERM), where we fit a predictive model to directly optimize the downstream performance of the policy it induces. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses. A byproduct of our theory of independent interest is fast-rate regret bound for IERM with full feedback and misspecified policy class. We compare the performance of different modeling choices numerically using a stochastic shortest path example and provide practical insights from the empirical results."}, "https://arxiv.org/abs/2405.16672": {"title": "Transfer Learning Under High-Dimensional Graph Convolutional Regression Model for Node Classification", "link": "https://arxiv.org/abs/2405.16672", "description": "arXiv:2405.16672v1 Announce Type: cross \nAbstract: Node classification is a fundamental task, but obtaining node classification labels can be challenging and expensive in many real-world scenarios. Transfer learning has emerged as a promising solution to address this challenge by leveraging knowledge from source domains to enhance learning in a target domain. Existing transfer learning methods for node classification primarily focus on integrating Graph Convolutional Networks (GCNs) with various transfer learning techniques. While these approaches have shown promising results, they often suffer from a lack of theoretical guarantees, restrictive conditions, and high sensitivity to hyperparameter choices. To overcome these limitations, we propose a Graph Convolutional Multinomial Logistic Regression (GCR) model and a transfer learning method based on the GCR model, called Trans-GCR. We provide theoretical guarantees of the estimate obtained under GCR model in high-dimensional settings. Moreover, Trans-GCR demonstrates superior empirical performance, has a low computational cost, and requires fewer hyperparameters than existing methods."}, "https://arxiv.org/abs/2405.17178": {"title": "Statistical Mechanism Design: Robust Pricing, Estimation, and Inference", "link": "https://arxiv.org/abs/2405.17178", "description": "arXiv:2405.17178v1 Announce Type: cross \nAbstract: This paper tackles challenges in pricing and revenue projections due to consumer uncertainty. We propose a novel data-based approach for firms facing unknown consumer type distributions. Unlike existing methods, we assume firms only observe a finite sample of consumers' types. We introduce \\emph{empirically optimal mechanisms}, a simple and intuitive class of sample-based mechanisms with strong finite-sample revenue guarantees. Furthermore, we leverage our results to develop a toolkit for statistical inference on profits. Our approach allows to reliably estimate the profits associated for any particular mechanism, to construct confidence intervals, and to, more generally, conduct valid hypothesis testing."}, "https://arxiv.org/abs/2405.17318": {"title": "Extremal correlation coefficient for functional data", "link": "https://arxiv.org/abs/2405.17318", "description": "arXiv:2405.17318v1 Announce Type: cross \nAbstract: We propose a coefficient that measures dependence in paired samples of functions. It has properties similar to the Pearson correlation, but differs in significant ways: 1) it is designed to measure dependence between curves, 2) it focuses only on extreme curves. The new coefficient is derived within the framework of regular variation in Banach spaces. A consistent estimator is proposed and justified by an asymptotic analysis and a simulation study. The usefulness of the new coefficient is illustrated on financial and and climate functional data."}, "https://arxiv.org/abs/2112.13398": {"title": "Long Story Short: Omitted Variable Bias in Causal Machine Learning", "link": "https://arxiv.org/abs/2112.13398", "description": "arXiv:2112.13398v5 Announce Type: replace \nAbstract: We develop a general theory of omitted variable bias for a wide range of common causal parameters, including (but not limited to) averages of potential outcomes, average treatment effects, average causal derivatives, and policy effects from covariate shifts. Our theory applies to nonparametric models, while naturally allowing for (semi-)parametric restrictions (such as partial linearity) when such assumptions are made. We show how simple plausibility judgments on the maximum explanatory power of omitted variables are sufficient to bound the magnitude of the bias, thus facilitating sensitivity analysis in otherwise complex, nonlinear models. Finally, we provide flexible and efficient statistical inference methods for the bounds, which can leverage modern machine learning algorithms for estimation. These results allow empirical researchers to perform sensitivity analyses in a flexible class of machine-learned causal models using very simple, and interpretable, tools. We demonstrate the utility of our approach with two empirical examples."}, "https://arxiv.org/abs/2208.08638": {"title": "Lost in the Shuffle: Testing Power in the Presence of Errorful Network Vertex Labels", "link": "https://arxiv.org/abs/2208.08638", "description": "arXiv:2208.08638v5 Announce Type: replace \nAbstract: Two-sample network hypothesis testing is an important inference task with applications across diverse fields such as medicine, neuroscience, and sociology. Many of these testing methodologies operate under the implicit assumption that the vertex correspondence across networks is a priori known. This assumption is often untrue, and the power of the subsequent test can degrade when there are misaligned/label-shuffled vertices across networks. This power loss due to shuffling is theoretically explored in the context of random dot product and stochastic block model networks for a pair of hypothesis tests based on Frobenius norm differences between estimated edge probability matrices or between adjacency matrices. The loss in testing power is further reinforced by numerous simulations and experiments, both in the stochastic block model and in the random dot product graph model, where the power loss across multiple recently proposed tests in the literature is considered. Lastly, the impact that shuffling can have in real-data testing is demonstrated in a pair of examples from neuroscience and from social network analysis."}, "https://arxiv.org/abs/2210.00697": {"title": "A flexible model for correlated count data, with application to multi-condition differential expression analyses of single-cell RNA sequencing data", "link": "https://arxiv.org/abs/2210.00697", "description": "arXiv:2210.00697v3 Announce Type: replace \nAbstract: Detecting differences in gene expression is an important part of single-cell RNA sequencing experiments, and many statistical methods have been developed for this aim. Most differential expression analyses focus on comparing expression between two groups (e.g., treatment vs. control). But there is increasing interest in multi-condition differential expression analyses in which expression is measured in many conditions, and the aim is to accurately detect and estimate expression differences in all conditions. We show that directly modeling single-cell RNA-seq counts in all conditions simultaneously, while also inferring how expression differences are shared across conditions, leads to greatly improved performance for detecting and estimating expression differences compared to existing methods. We illustrate the potential of this new approach by analyzing data from a single-cell experiment studying the effects of cytokine stimulation on gene expression. We call our new method \"Poisson multivariate adaptive shrinkage\", and it is implemented in an R package available online at https://github.com/stephenslab/poisson.mash.alpha."}, "https://arxiv.org/abs/2301.09694": {"title": "Flexible Modeling of Demographic Transition Processes with a Bayesian Hierarchical B-splines Model", "link": "https://arxiv.org/abs/2301.09694", "description": "arXiv:2301.09694v2 Announce Type: replace \nAbstract: Several demographic and health indicators, including the total fertility rate (TFR) and modern contraceptive use rate (mCPR), evolve similarly over time, characterized by a transition between stable states. Existing approaches for estimation or projection of transitions in multiple populations have successfully used parametric functions to capture the relation between the rate of change of an indicator and its level. However, incorrect parametric forms may result in bias or incorrect coverage in long-term projections. We propose a new class of models to capture demographic transitions in multiple populations. Our proposal, the B-spline Transition Model (BTM), models the relationship between the rate of change of an indicator and its level using B-splines, allowing for data-adaptive estimation of transition functions. Bayesian hierarchical models are used to share information on the transition function between populations. We apply the BTM to estimate and project country-level TFR and mCPR and compare the results against those from extant parametric models. For TFR, BTM projections have generally lower error than the comparison model. For mCPR, while results are comparable between BTM and a parametric approach, the B-spline model generally improves out-of-sample predictions. The case studies suggest that the BTM may be considered for demographic applications"}, "https://arxiv.org/abs/2302.01607": {"title": "dynamite: An R Package for Dynamic Multivariate Panel Models", "link": "https://arxiv.org/abs/2302.01607", "description": "arXiv:2302.01607v2 Announce Type: replace \nAbstract: dynamite is an R package for Bayesian inference of intensive panel (time series) data comprising multiple measurements per multiple individuals measured in time. The package supports joint modeling of multiple response variables, time-varying and time-invariant effects, a wide range of discrete and continuous distributions, group-specific random effects, latent factors, and customization of prior distributions of the model parameters. Models in the package are defined via a user-friendly formula interface, and estimation of the posterior distribution of the model parameters takes advantage of state-of-the-art Markov chain Monte Carlo methods. The package enables efficient computation of both individual-level and summarized predictions and offers a comprehensive suite of tools for visualization and model diagnostics."}, "https://arxiv.org/abs/2307.13686": {"title": "Characteristics and Predictive Modeling of Short-term Impacts of Hurricanes on the US Employment", "link": "https://arxiv.org/abs/2307.13686", "description": "arXiv:2307.13686v3 Announce Type: replace \nAbstract: The physical and economic damages of hurricanes can acutely affect employment and the well-being of employees. However, a comprehensive understanding of these impacts remains elusive as many studies focused on narrow subsets of regions or hurricanes. Here we present an open-source dataset that serves interdisciplinary research on hurricane impacts on US employment. Compared to past domain-specific efforts, this dataset has greater spatial-temporal granularity and variable coverage. To demonstrate potential applications of this dataset, we focus on the short-term employment disruptions related to hurricanes during 1990-2020. The observed county-level employment changes in the initial month are small on average, though large employment losses (>30%) can occur after extreme storms. The overall small changes partly result from compensation among different employment sectors, which may obscure large, concentrated employment losses after hurricanes. Additional econometric analyses concur on the post-storm employment losses in hospitality and leisure but disagree on employment changes in the other industries. The dataset also enables data-driven analyses that highlight vulnerabilities such as pronounced employment losses related to Puerto Rico and rainy hurricanes. Furthermore, predictive modeling of short-term employment changes shows promising performance for service-providing industries and high-impact storms. In the examined cases, the nonlinear Random Forests model greatly outperforms the multiple linear regression model. The nonlinear model also suggests that more severe hurricane hazards projected by physical models may cause more extreme losses in US service-providing employment. Finally, we share our dataset and analytical code to facilitate the study and modeling of hurricane impacts in a changing climate."}, "https://arxiv.org/abs/2308.05205": {"title": "Dynamic survival analysis: modelling the hazard function via ordinary differential equations", "link": "https://arxiv.org/abs/2308.05205", "description": "arXiv:2308.05205v4 Announce Type: replace \nAbstract: The hazard function represents one of the main quantities of interest in the analysis of survival data. We propose a general approach for parametrically modelling the dynamics of the hazard function using systems of autonomous ordinary differential equations (ODEs). This modelling approach can be used to provide qualitative and quantitative analyses of the evolution of the hazard function over time. Our proposal capitalises on the extensive literature of ODEs which, in particular, allow for establishing basic rules or laws on the dynamics of the hazard function via the use of autonomous ODEs. We show how to implement the proposed modelling framework in cases where there is an analytic solution to the system of ODEs or where an ODE solver is required to obtain a numerical solution. We focus on the use of a Bayesian modelling approach, but the proposed methodology can also be coupled with maximum likelihood estimation. A simulation study is presented to illustrate the performance of these models and the interplay of sample size and censoring. Two case studies using real data are presented to illustrate the use of the proposed approach and to highlight the interpretability of the corresponding models. We conclude with a discussion on potential extensions of our work and strategies to include covariates into our framework. Although we focus on examples on Medical Statistics, the proposed framework is applicable in any context where the interest lies on estimating and interpreting the dynamics hazard function."}, "https://arxiv.org/abs/2308.10583": {"title": "The Multivariate Bernoulli detector: Change point estimation in discrete survival analysis", "link": "https://arxiv.org/abs/2308.10583", "description": "arXiv:2308.10583v2 Announce Type: replace \nAbstract: Time-to-event data are often recorded on a discrete scale with multiple, competing risks as potential causes for the event. In this context, application of continuous survival analysis methods with a single risk suffer from biased estimation. Therefore, we propose the Multivariate Bernoulli detector for competing risks with discrete times involving a multivariate change point model on the cause-specific baseline hazards. Through the prior on the number of change points and their location, we impose dependence between change points across risks, as well as allowing for data-driven learning of their number. Then, conditionally on these change points, a Multivariate Bernoulli prior is used to infer which risks are involved. Focus of posterior inference is cause-specific hazard rates and dependence across risks. Such dependence is often present due to subject-specific changes across time that affect all risks. Full posterior inference is performed through a tailored local-global Markov chain Monte Carlo (MCMC) algorithm, which exploits a data augmentation trick and MCMC updates from non-conjugate Bayesian nonparametric methods. We illustrate our model in simulations and on ICU data, comparing its performance with existing approaches."}, "https://arxiv.org/abs/2310.12402": {"title": "Sufficient dimension reduction for regression with metric space-valued responses", "link": "https://arxiv.org/abs/2310.12402", "description": "arXiv:2310.12402v2 Announce Type: replace \nAbstract: Data visualization and dimension reduction for regression between a general metric space-valued response and Euclidean predictors is proposed. Current Fr\\'ech\\'et dimension reduction methods require that the response metric space be continuously embeddable into a Hilbert space, which imposes restriction on the type of metric and kernel choice. We relax this assumption by proposing a Euclidean embedding technique which avoids the use of kernels. Under this framework, classical dimension reduction methods such as ordinary least squares and sliced inverse regression are extended. An extensive simulation experiment demonstrates the superior performance of the proposed method on synthetic data compared to existing methods where applicable. The real data analysis of factors influencing the distribution of COVID-19 transmission in the U.S. and the association between BMI and structural brain connectivity of healthy individuals are also investigated."}, "https://arxiv.org/abs/2310.12427": {"title": "Fast Power Curve Approximation for Posterior Analyses", "link": "https://arxiv.org/abs/2310.12427", "description": "arXiv:2310.12427v2 Announce Type: replace \nAbstract: Bayesian hypothesis tests leverage posterior probabilities, Bayes factors, or credible intervals to inform data-driven decision making. We propose a framework for power curve approximation with such hypothesis tests. We present a fast approach to explore the approximate sampling distribution of posterior probabilities when the conditions for the Bernstein-von Mises theorem are satisfied. We extend that approach to consider segments of such sampling distributions in a targeted manner for each sample size explored. These sampling distribution segments are used to construct power curves for various types of posterior analyses. Our resulting method for power curve approximation is orders of magnitude faster than conventional power curve estimation for Bayesian hypothesis tests. We also prove the consistency of the corresponding power estimates and sample size recommendations under certain conditions."}, "https://arxiv.org/abs/2311.02543": {"title": "Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling", "link": "https://arxiv.org/abs/2311.02543", "description": "arXiv:2311.02543v2 Announce Type: replace \nAbstract: This paper discusses estimation and limited information goodness-of-fit test statistics in factor models for binary data using pairwise likelihood estimation and sampling weights. The paper extends the applicability of pairwise likelihood estimation for factor models with binary data to accommodate complex sampling designs. Additionally, it introduces two key limited information test statistics: the Pearson chi-squared test and the Wald test. To enhance computational efficiency, the paper introduces modifications to both test statistics. The performance of the estimation and the proposed test statistics under simple random sampling and unequal probability sampling is evaluated using simulated data."}, "https://arxiv.org/abs/2311.13017": {"title": "W-kernel and essential subspace for frequencist evaluation of Bayesian estimators", "link": "https://arxiv.org/abs/2311.13017", "description": "arXiv:2311.13017v2 Announce Type: replace \nAbstract: The posterior covariance matrix W defined by the log-likelihood of each observation plays important roles both in the sensitivity analysis and frequencist evaluation of the Bayesian estimators. This study is focused on the matrix W and its principal space; we term the latter as an essential subspace. A key tool for treating frequencist properties is the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)). The matrix W can be interpreted as a reproducing kernel and is denoted as W-kernel. Using W-kernel, the essential subspace is expressed as a principal space given by the kernel principal component analysis. A relation to the Fisher kernel and neural tangent kernel is established, which elucidates the connection to the classical asymptotic theory. We also discuss a type of Bayesian-frequencist duality, which is naturally appeared from the kernel framework. Finally, two applications are discussed: the selection of a representative set of observations and dimensional reduction in the approximate bootstrap. In the former, incomplete Cholesky decomposition is introduced as an efficient method for computing the essential subspace. In the latter, different implementations of the approximate bootstrap for posterior means are compared."}, "https://arxiv.org/abs/2312.06098": {"title": "Mixture Matrix-valued Autoregressive Model", "link": "https://arxiv.org/abs/2312.06098", "description": "arXiv:2312.06098v2 Announce Type: replace \nAbstract: Time series of matrix-valued data are increasingly available in various areas including economics, finance, social science, etc. These data may shed light on the inter-dynamical relationships between two sets of attributes, for instance countries and economic indices. The matrix autoregressive (MAR) model provides a parsimonious approach for analyzing such data. However, the MAR model, being a linear model with parametric constraints, cannot capture the nonlinear patterns in the data, such as regime shifts in the dynamics. We propose a mixture matrix autoregressive (MMAR) model for analyzing potential regime shifts in the dynamics between two attributes, for instance, due to recession vs. blooming, or quiet period vs. pandemic. We propose an EM algorithm for maximum likelihood estimation. We derive some theoretical properties of the proposed method including consistency and asymptotic distribution, and illustrate its performance via simulations and real applications."}, "https://arxiv.org/abs/2312.10596": {"title": "A maximin optimal approach for sampling designs in two-phase studies", "link": "https://arxiv.org/abs/2312.10596", "description": "arXiv:2312.10596v2 Announce Type: replace \nAbstract: Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis."}, "https://arxiv.org/abs/2007.03069": {"title": "Outcome-Driven Dynamic Refugee Assignment with Allocation Balancing", "link": "https://arxiv.org/abs/2007.03069", "description": "arXiv:2007.03069v5 Announce Type: replace-cross \nAbstract: This study proposes two new dynamic assignment algorithms to match refugees and asylum seekers to geographic localities within a host country. The first, currently implemented in a multi-year randomized control trial in Switzerland, seeks to maximize the average predicted employment level (or any measured outcome of interest) of refugees through a minimum-discord online assignment algorithm. The performance of this algorithm is tested on real refugee resettlement data from both the US and Switzerland, where we find that it is able to achieve near-optimal expected employment compared to the hindsight-optimal solution, and is able to improve upon the status quo procedure by 40-50%. However, pure outcome maximization can result in a periodically imbalanced allocation to the localities over time, leading to implementation difficulties and an undesirable workflow for resettlement resources and agents. To address these problems, the second algorithm balances the goal of improving refugee outcomes with the desire for an even allocation over time. We find that this algorithm can achieve near-perfect balance over time with only a small loss in expected employment compared to the employment-maximizing algorithm. In addition, the allocation balancing algorithm offers a number of ancillary benefits compared to pure outcome maximization, including robustness to unknown arrival flows and greater exploration."}, "https://arxiv.org/abs/2012.02985": {"title": "Selecting the number of components in PCA via random signflips", "link": "https://arxiv.org/abs/2012.02985", "description": "arXiv:2012.02985v3 Announce Type: replace-cross \nAbstract: Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of \"empirical null\" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy."}, "https://arxiv.org/abs/2311.01412": {"title": "Causal Temporal Regime Structure Learning", "link": "https://arxiv.org/abs/2311.01412", "description": "arXiv:2311.01412v2 Announce Type: replace-cross \nAbstract: We address the challenge of structure learning from multivariate time series that are characterized by a sequence of different, unknown regimes. We introduce a new optimization-based method (CASTOR), that concurrently learns the Directed Acyclic Graph (DAG) for each regime and determine the number of regimes along with their sequential arrangement. Through the optimization of a score function via an expectation maximization (EM) algorithm, CASTOR alternates between learning the regime indices (Expectation step) and inferring causal relationships in each regime (Maximization step). We further prove the identifiability of regimes and DAGs within the CASTOR framework. We conduct extensive experiments and show that our method consistently outperforms causal discovery models across various settings (linear and nonlinear causal relationships) and datasets (synthetic and real data)."}, "https://arxiv.org/abs/2405.17591": {"title": "Individualized Dynamic Mediation Analysis Using Latent Factor Models", "link": "https://arxiv.org/abs/2405.17591", "description": "arXiv:2405.17591v1 Announce Type: new \nAbstract: Mediation analysis plays a crucial role in causal inference as it can investigate the pathways through which treatment influences outcome. Most existing mediation analysis assumes that mediation effects are static and homogeneous within populations. However, mediation effects usually change over time and exhibit significant heterogeneity in many real-world applications. Additionally, the presence of unobserved confounding variables imposes a significant challenge to inferring both causal effect and mediation effect. To address these issues, we propose an individualized dynamic mediation analysis method. Our approach can identify the significant mediators of the population level while capturing the time-varying and heterogeneous mediation effects via latent factor modeling on coefficients of structural equation models. Another advantage of our method is that we can infer individualized mediation effects in the presence of unmeasured time-varying confounders. We provide estimation consistency for our proposed causal estimand and selection consistency for significant mediators. Extensive simulation studies and an application to a DNA methylation study demonstrate the effectiveness and advantages of our method."}, "https://arxiv.org/abs/2405.17669": {"title": "Bayesian Nonparametrics for Principal Stratification with Continuous Post-Treatment Variables", "link": "https://arxiv.org/abs/2405.17669", "description": "arXiv:2405.17669v1 Announce Type: new \nAbstract: Principal stratification provides a causal inference framework that allows adjustment for confounded post-treatment variables when comparing treatments. Although the literature has focused mainly on binary post-treatment variables, there is a growing interest in principal stratification involving continuous post-treatment variables. However, characterizing the latent principal strata with a continuous post-treatment presents a significant challenge, which is further complicated in observational studies where the treatment is not randomized. In this paper, we introduce the Confounders-Aware SHared atoms BAyesian mixture (CASBAH), a novel approach for principal stratification with continuous post-treatment variables that can be directly applied to observational studies. CASBAH leverages a dependent Dirichlet process, utilizing shared atoms across treatment levels, to effectively control for measured confounders and facilitate information sharing between treatment groups in the identification of principal strata membership. CASBAH also offers a comprehensive quantification of uncertainty surrounding the membership of the principal strata. Through Monte Carlo simulations, we show that the proposed methodology has excellent performance in characterizing the latent principal strata and estimating the effects of treatment on post-treatment variables and outcomes. Finally, CASBAH is applied to a case study in which we estimate the causal effects of US national air quality regulations on pollution levels and health outcomes."}, "https://arxiv.org/abs/2405.17684": {"title": "ZIKQ: An innovative centile chart method for utilizing natural history data in rare disease clinical development", "link": "https://arxiv.org/abs/2405.17684", "description": "arXiv:2405.17684v1 Announce Type: new \nAbstract: Utilizing natural history data as external control plays an important role in the clinical development of rare diseases, since placebo groups in double-blind randomization trials may not be available due to ethical reasons and low disease prevalence. This article proposed an innovative approach for utilizing natural history data to support rare disease clinical development by constructing reference centile charts. Due to the deterioration nature of certain rare diseases, the distributions of clinical endpoints can be age-dependent and have an absorbing state of zero, which can result in censored natural history data. Existing methods of reference centile charts can not be directly used in the censored natural history data. Therefore, we propose a new calibrated zero-inflated kernel quantile (ZIKQ) estimation to construct reference centile charts from censored natural history data. Using the application to Duchenne Muscular Dystrophy drug development, we demonstrate that the reference centile charts using the ZIKQ method can be implemented to evaluate treatment efficacy and facilitate a more targeted patient enrollment in rare disease clinical development."}, "https://arxiv.org/abs/2405.17707": {"title": "The Multiplex $p_2$ Model: Mixed-Effects Modeling for Multiplex Social Networks", "link": "https://arxiv.org/abs/2405.17707", "description": "arXiv:2405.17707v1 Announce Type: new \nAbstract: Social actors are often embedded in multiple social networks, and there is a growing interest in studying social systems from a multiplex network perspective. In this paper, we propose a mixed-effects model for cross-sectional multiplex network data that assumes dyads to be conditionally independent. Building on the uniplex $p_2$ model, we incorporate dependencies between different network layers via cross-layer dyadic effects and actor random effects. These cross-layer effects model the tendencies for ties between two actors and the ties to and from the same actor to be dependent across different relational dimensions. The model can also study the effect of actor and dyad covariates. As simulation-based goodness-of-fit analyses are common practice in applied network studies, we here propose goodness-of-fit measures for multiplex network analyses. We evaluate our choice of priors and the computational faithfulness and inferential properties of the proposed method through simulation. We illustrate the utility of the multiplex $p_2$ model in a replication study of a toxic chemical policy network. An original study that reflects on gossip as perceived by gossip senders and gossip targets, and their differences in perspectives, based on data from 34 Hungarian elementary school classes, highlights the applicability of the proposed method."}, "https://arxiv.org/abs/2405.17744": {"title": "Factor Augmented Matrix Regression", "link": "https://arxiv.org/abs/2405.17744", "description": "arXiv:2405.17744v1 Announce Type: new \nAbstract: We introduce \\underline{F}actor-\\underline{A}ugmented \\underline{Ma}trix \\underline{R}egression (FAMAR) to address the growing applications of matrix-variate data and their associated challenges, particularly with high-dimensionality and covariate correlations. FAMAR encompasses two key algorithms. The first is a novel non-iterative approach that efficiently estimates the factors and loadings of the matrix factor model, utilizing techniques of pre-training, diverse projection, and block-wise averaging. The second algorithm offers an accelerated solution for penalized matrix factor regression. Both algorithms are supported by established statistical and numerical convergence properties. Empirical evaluations, conducted on synthetic and real economics datasets, demonstrate FAMAR's superiority in terms of accuracy, interpretability, and computational speed. Our application to economic data showcases how matrix factors can be incorporated to predict the GDPs of the countries of interest, and the influence of these factors on the GDPs."}, "https://arxiv.org/abs/2405.17787": {"title": "Dyadic Regression with Sample Selection", "link": "https://arxiv.org/abs/2405.17787", "description": "arXiv:2405.17787v1 Announce Type: new \nAbstract: This paper addresses the sample selection problem in panel dyadic regression analysis. Dyadic data often include many zeros in the main outcomes due to the underlying network formation process. This not only contaminates popular estimators used in practice but also complicates the inference due to the dyadic dependence structure. We extend Kyriazidou (1997)'s approach to dyadic data and characterize the asymptotic distribution of our proposed estimator. The convergence rates are $\\sqrt{n}$ or $\\sqrt{n^{2}h_{n}}$, depending on the degeneracy of the H\\'{a}jek projection part of the estimator, where $n$ is the number of nodes and $h_{n}$ is a bandwidth. We propose a bias-corrected confidence interval and a variance estimator that adapts to the degeneracy. A Monte Carlo simulation shows the good finite performance of our estimator and highlights the importance of bias correction in both asymptotic regimes when the fraction of zeros in outcomes varies. We illustrate our procedure using data from Moretti and Wilson (2017)'s paper on migration."}, "https://arxiv.org/abs/2405.17828": {"title": "On Robust Clustering of Temporal Point Process", "link": "https://arxiv.org/abs/2405.17828", "description": "arXiv:2405.17828v1 Announce Type: new \nAbstract: Clustering of event stream data is of great importance in many application scenarios, including but not limited to, e-commerce, electronic health, online testing, mobile music service, etc. Existing clustering algorithms fail to take outlier data into consideration and are implemented without theoretical guarantees. In this paper, we propose a robust temporal point processes clustering framework which works under mild assumptions and meanwhile addresses several important issues in the event stream clustering problem.Specifically, we introduce a computationally efficient model-free distance function to quantify the dissimilarity between different event streams so that the outliers can be detected and the good initial clusters could be obtained. We further consider an expectation-maximization-type algorithm incorporated with a Catoni's influence function for robust estimation and fine-tuning of clusters. We also establish the theoretical results including algorithmic convergence, estimation error bound, outlier detection, etc. Simulation results corroborate our theoretical findings and real data applications show the effectiveness of our proposed methodology."}, "https://arxiv.org/abs/2405.17919": {"title": "Fisher's Legacy of Directional Statistics, and Beyond to Statistics on Manifolds", "link": "https://arxiv.org/abs/2405.17919", "description": "arXiv:2405.17919v1 Announce Type: new \nAbstract: It will not be an exaggeration to say that R A Fisher is the Albert Einstein of Statistics. He pioneered almost all the main branches of statistics, but it is not as well known that he opened the area of Directional Statistics with his 1953 paper introducing a distribution on the sphere which is now known as the Fisher distribution. He stressed that for spherical data one should take into account that the data is on a manifold. We will describe this Fisher distribution and reanalyse his geological data. We also comment on the two goals he set himself in that paper, and how he reinvented the von Mises distribution on the circle. Since then, many extensions of this distribution have appeared bearing Fisher's name such as the von Mises Fisher distribution and the matrix Fisher distribution. In fact, the subject of Directional Statistics has grown tremendously in the last two decades with new applications emerging in Life Sciences, Image Analysis, Machine Learning and so on. We give a recent new method of constructing the Fisher type distribution which has been motivated by some problems in Machine Learning. The subject related to his distribution has evolved since then more broadly as Statistics on Manifolds which also includes the new field of Shape Analysis. We end with a historical note pointing out some correspondence between D'Arcy Thompson and R A Fisher related to Shape Analysis."}, "https://arxiv.org/abs/2405.17954": {"title": "Comparison of predictive values with paired samples", "link": "https://arxiv.org/abs/2405.17954", "description": "arXiv:2405.17954v1 Announce Type: new \nAbstract: Positive predictive value and negative predictive value are two widely used parameters to assess the clinical usefulness of a medical diagnostic test. When there are two diagnostic tests, it is recommendable to make a comparative assessment of the values of these two parameters after applying the two tests to the same subjects (paired samples). The objective is then to make individual or global inferences about the difference or the ratio of the predictive value of the two diagnostic tests. These inferences are usually based on complex and not very intuitive expressions, some of which have subsequently been reformulated. We define the two properties of symmetry which any inference method must verify - symmetry in diagnoses and symmetry in the tests -, we propose new inference methods, and we define them with simple expressions. All of the methods are compared with each other, selecting the optimal method: (a) to obtain a confidence interval for the difference or ratio; (b) to perform an individual homogeneity test of the two predictive values; and (c) to carry out a global homogeneity test of the two predictive values."}, "https://arxiv.org/abs/2405.18089": {"title": "Semi-nonparametric models of multidimensional matching: an optimal transport approach", "link": "https://arxiv.org/abs/2405.18089", "description": "arXiv:2405.18089v1 Announce Type: new \nAbstract: This paper proposes empirically tractable multidimensional matching models, focusing on worker-job matching. We generalize the parametric model proposed by Lindenlaub (2017), which relies on the assumption of joint normality of observed characteristics of workers and jobs. In our paper, we allow unrestricted distributions of characteristics and show identification of the production technology, and equilibrium wage and matching functions using tools from optimal transport theory. Given identification, we propose efficient, consistent, asymptotically normal sieve estimators. We revisit Lindenlaub's empirical application and show that, between 1990 and 2010, the U.S. economy experienced much larger technological progress favoring cognitive abilities than the original findings suggest. Furthermore, our flexible model specifications provide a significantly better fit for patterns in the evolution of wage inequality."}, "https://arxiv.org/abs/2405.18288": {"title": "Stagewise Boosting Distributional Regression", "link": "https://arxiv.org/abs/2405.18288", "description": "arXiv:2405.18288v1 Announce Type: new \nAbstract: Forward stagewise regression is a simple algorithm that can be used to estimate regularized models. The updating rule adds a small constant to a regression coefficient in each iteration, such that the underlying optimization problem is solved slowly with small improvements. This is similar to gradient boosting, with the essential difference that the step size is determined by the product of the gradient and a step length parameter in the latter algorithm. One often overlooked challenge in gradient boosting for distributional regression is the issue of a vanishing small gradient, which practically halts the algorithm's progress. We show that gradient boosting in this case oftentimes results in suboptimal models, especially for complex problems certain distributional parameters are never updated due to the vanishing gradient. Therefore, we propose a stagewise boosting-type algorithm for distributional regression, combining stagewise regression ideas with gradient boosting. Additionally, we extend it with a novel regularization method, correlation filtering, to provide additional stability when the problem involves a large number of covariates. Furthermore, the algorithm includes best-subset selection for parameters and can be applied to big data problems by leveraging stochastic approximations of the updating steps. Besides the advantage of processing large datasets, the stochastic nature of the approximations can lead to better results, especially for complex distributions, by reducing the risk of being trapped in a local optimum. The performance of our proposed stagewise boosting distributional regression approach is investigated in an extensive simulation study and by estimating a full probabilistic model for lightning counts with data of more than 9.1 million observations and 672 covariates."}, "https://arxiv.org/abs/2405.18323": {"title": "Optimal Design in Repeated Testing for Count Data", "link": "https://arxiv.org/abs/2405.18323", "description": "arXiv:2405.18323v1 Announce Type: new \nAbstract: In this paper, we develop optimal designs for growth curve models with count data based on the Rasch Poisson-Gamma counts (RPGCM) model. This model is often used in educational and psychological testing when test results yield count data. In the RPGCM, the test scores are determined by respondents ability and item difficulty. Locally D-optimal designs are derived for maximum quasi-likelihood estimation to efficiently estimate the mean abilities of the respondents over time. Using the log link, both unstructured, linear and nonlinear growth curves of log mean abilities are taken into account. Finally, the sensitivity of the derived optimal designs due to an imprecise choice of parameter values is analyzed using D-efficiency."}, "https://arxiv.org/abs/2405.18413": {"title": "Homophily-adjusted social influence estimation", "link": "https://arxiv.org/abs/2405.18413", "description": "arXiv:2405.18413v1 Announce Type: new \nAbstract: Homophily and social influence are two key concepts of social network analysis. Distinguishing between these phenomena is difficult, and approaches to disambiguate the two have been primarily limited to longitudinal data analyses. In this study, we provide sufficient conditions for valid estimation of social influence through cross-sectional data, leading to a novel homophily-adjusted social influence model which addresses the backdoor pathway of latent homophilic features. The oft-used network autocorrelation model (NAM) is the special case of our proposed model with no latent homophily, suggesting that the NAM is only valid when all homophilic attributes are observed. We conducted an extensive simulation study to evaluate the performance of our proposed homophily-adjusted model, comparing its results with those from the conventional NAM. Our findings shed light on the nuanced dynamics of social networks, presenting a valuable tool for researchers seeking to estimate the effects of social influence while accounting for homophily. Code to implement our approach is available at https://github.com/hanhtdpham/hanam."}, "https://arxiv.org/abs/2405.17640": {"title": "Probabilistically Plausible Counterfactual Explanations with Normalizing Flows", "link": "https://arxiv.org/abs/2405.17640", "description": "arXiv:2405.17640v1 Announce Type: cross \nAbstract: We present PPCEF, a novel method for generating probabilistically plausible counterfactual explanations (CFs). PPCEF advances beyond existing methods by combining a probabilistic formulation that leverages the data distribution with the optimization of plausibility within a unified framework. Compared to reference approaches, our method enforces plausibility by directly optimizing the explicit density function without assuming a particular family of parametrized distributions. This ensures CFs are not only valid (i.e., achieve class change) but also align with the underlying data's probability density. For that purpose, our approach leverages normalizing flows as powerful density estimators to capture the complex high-dimensional data distribution. Furthermore, we introduce a novel loss that balances the trade-off between achieving class change and maintaining closeness to the original instance while also incorporating a probabilistic plausibility term. PPCEF's unconstrained formulation allows for efficient gradient-based optimization with batch processing, leading to orders of magnitude faster computation compared to prior methods. Moreover, the unconstrained formulation of PPCEF allows for the seamless integration of future constraints tailored to specific counterfactual properties. Finally, extensive evaluations demonstrate PPCEF's superiority in generating high-quality, probabilistically plausible counterfactual explanations in high-dimensional tabular settings. This makes PPCEF a powerful tool for not only interpreting complex machine learning models but also for improving fairness, accountability, and trust in AI systems."}, "https://arxiv.org/abs/2405.17642": {"title": "Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels", "link": "https://arxiv.org/abs/2405.17642", "description": "arXiv:2405.17642v1 Announce Type: cross \nAbstract: Growing regulatory and societal pressures demand increased transparency in AI, particularly in understanding the decisions made by complex machine learning models. Counterfactual Explanations (CFs) have emerged as a promising technique within Explainable AI (xAI), offering insights into individual model predictions. However, to understand the systemic biases and disparate impacts of AI models, it is crucial to move beyond local CFs and embrace global explanations, which offer a~holistic view across diverse scenarios and populations. Unfortunately, generating Global Counterfactual Explanations (GCEs) faces challenges in computational complexity, defining the scope of \"global,\" and ensuring the explanations are both globally representative and locally plausible. We introduce a novel unified approach for generating Local, Group-wise, and Global Counterfactual Explanations for differentiable classification models via gradient-based optimization to address these challenges. This framework aims to bridge the gap between individual and systemic insights, enabling a deeper understanding of model decisions and their potential impact on diverse populations. Our approach further innovates by incorporating a probabilistic plausibility criterion, enhancing actionability and trustworthiness. By offering a cohesive solution to the optimization and plausibility challenges in GCEs, our work significantly advances the interpretability and accountability of AI models, marking a step forward in the pursuit of transparent AI."}, "https://arxiv.org/abs/2405.18206": {"title": "Multi-CATE: Multi-Accurate Conditional Average Treatment Effect Estimation Robust to Unknown Covariate Shifts", "link": "https://arxiv.org/abs/2405.18206", "description": "arXiv:2405.18206v1 Announce Type: cross \nAbstract: Estimating heterogeneous treatment effects is important to tailor treatments to those individuals who would most likely benefit. However, conditional average treatment effect predictors may often be trained on one population but possibly deployed on different, possibly unknown populations. We use methodology for learning multi-accurate predictors to post-process CATE T-learners (differenced regressions) to become robust to unknown covariate shifts at the time of deployment. The method works in general for pseudo-outcome regression, such as the DR-learner. We show how this approach can combine (large) confounded observational and (smaller) randomized datasets by learning a confounded predictor from the observational dataset, and auditing for multi-accuracy on the randomized controlled trial. We show improvements in bias and mean squared error in simulations with increasingly larger covariate shift, and on a semi-synthetic case study of a parallel large observational study and smaller randomized controlled experiment. Overall, we establish a connection between methods developed for multi-distribution learning and achieve appealing desiderata (e.g. external validity) in causal inference and machine learning."}, "https://arxiv.org/abs/2405.18379": {"title": "A Note on the Prediction-Powered Bootstrap", "link": "https://arxiv.org/abs/2405.18379", "description": "arXiv:2405.18379v1 Announce Type: cross \nAbstract: We introduce PPBoot: a bootstrap-based method for prediction-powered inference. PPBoot is applicable to arbitrary estimation problems and is very simple to implement, essentially only requiring one application of the bootstrap. Through a series of examples, we demonstrate that PPBoot often performs nearly identically to (and sometimes better than) the earlier PPI(++) method based on asymptotic normality$\\unicode{x2013}$when the latter is applicable$\\unicode{x2013}$without requiring any asymptotic characterizations. Given its versatility, PPBoot could simplify and expand the scope of application of prediction-powered inference to problems where central limit theorems are hard to prove."}, "https://arxiv.org/abs/2405.18412": {"title": "Tensor Methods in High Dimensional Data Analysis: Opportunities and Challenges", "link": "https://arxiv.org/abs/2405.18412", "description": "arXiv:2405.18412v1 Announce Type: cross \nAbstract: Large amount of multidimensional data represented by multiway arrays or tensors are prevalent in modern applications across various fields such as chemometrics, genomics, physics, psychology, and signal processing. The structural complexity of such data provides vast new opportunities for modeling and analysis, but efficiently extracting information content from them, both statistically and computationally, presents unique and fundamental challenges. Addressing these challenges requires an interdisciplinary approach that brings together tools and insights from statistics, optimization and numerical linear algebra among other fields. Despite these hurdles, significant progress has been made in the last decade. This review seeks to examine some of the key advancements and identify common threads among them, under eight different statistical settings."}, "https://arxiv.org/abs/2008.09263": {"title": "Empirical Likelihood Covariate Adjustment for Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2008.09263", "description": "arXiv:2008.09263v3 Announce Type: replace \nAbstract: This paper proposes a versatile covariate adjustment method that directly incorporates covariate balance in regression discontinuity (RD) designs. The new empirical entropy balancing method reweights the standard local polynomial RD estimator by using the entropy balancing weights that minimize the Kullback--Leibler divergence from the uniform weights while satisfying the covariate balance constraints. Our estimator can be formulated as an empirical likelihood estimator that efficiently incorporates the information from the covariate balance condition as correctly specified over-identifying moment restrictions, and thus has an asymptotic variance no larger than that of the standard estimator without covariates. We demystify the asymptotic efficiency gain of Calonico, Cattaneo, Farrell, and Titiunik (2019)'s regression-based covariate-adjusted estimator, as their estimator has the same asymptotic variance as ours. Further efficiency improvement from balancing over sieve spaces is possible if our entropy balancing weights are computed using stronger covariate balance constraints that are imposed on functions of covariates. We then show that our method enjoys favorable second-order properties from empirical likelihood estimation and inference: the estimator has a small (bounded) nonlinearity bias, and the likelihood ratio based confidence set admits a simple analytical correction that can be used to improve coverage accuracy. The coverage accuracy of our confidence set is robust against slight perturbation to the covariate balance condition, which may happen in cases such as data contamination and misspecified \"unaffected\" outcomes used as covariates. The proposed entropy balancing approach for covariate adjustment is applicable to other RD-related settings."}, "https://arxiv.org/abs/2012.02601": {"title": "Asymmetric uncertainty : Nowcasting using skewness in real-time data", "link": "https://arxiv.org/abs/2012.02601", "description": "arXiv:2012.02601v4 Announce Type: replace \nAbstract: This paper presents a new way to account for downside and upside risks when producing density nowcasts of GDP growth. The approach relies on modelling location, scale and shape common factors in real-time macroeconomic data. While movements in the location generate shifts in the central part of the predictive density, the scale controls its dispersion (akin to general uncertainty) and the shape its asymmetry, or skewness (akin to downside and upside risks). The empirical application is centred on US GDP growth and the real-time data come from Fred-MD. The results show that there is more to real-time data than their levels or means: their dispersion and asymmetry provide valuable information for nowcasting economic activity. Scale and shape common factors (i) yield more reliable measures of uncertainty and (ii) improve precision when macroeconomic uncertainty is at its peak."}, "https://arxiv.org/abs/2203.04080": {"title": "On Robust Inference in Time Series Regression", "link": "https://arxiv.org/abs/2203.04080", "description": "arXiv:2203.04080v3 Announce Type: replace \nAbstract: Least squares regression with heteroskedasticity consistent standard errors (\"OLS-HC regression\") has proved very useful in cross section environments. However, several major difficulties, which are generally overlooked, must be confronted when transferring the HC technology to time series environments via heteroskedasticity and autocorrelation consistent standard errors (\"OLS-HAC regression\"). First, in plausible time-series environments, OLS parameter estimates can be inconsistent, so that OLS-HAC inference fails even asymptotically. Second, most economic time series have autocorrelation, which renders OLS parameter estimates inefficient. Third, autocorrelation similarly renders conditional predictions based on OLS parameter estimates inefficient. Finally, the structure of popular HAC covariance matrix estimators is ill-suited for capturing the autoregressive autocorrelation typically present in economic time series, which produces large size distortions and reduced power in HAC-based hypothesis testing, in all but the largest samples. We show that all four problems are largely avoided by the use of a simple and easily-implemented dynamic regression procedure, which we call DURBIN. We demonstrate the advantages of DURBIN with detailed simulations covering a range of practical issues."}, "https://arxiv.org/abs/2210.04146": {"title": "Inference in parametric models with many L-moments", "link": "https://arxiv.org/abs/2210.04146", "description": "arXiv:2210.04146v3 Announce Type: replace \nAbstract: L-moments are expected values of linear combinations of order statistics that provide robust alternatives to traditional moments. The estimation of parametric models by matching sample L-moments has been shown to outperform maximum likelihood estimation (MLE) in small samples from popular distributions. The choice of the number of L-moments to be used in estimation remains ad-hoc, though: researchers typically set the number of L-moments equal to the number of parameters, as to achieve an order condition for identification. This approach is generally inefficient in larger samples. In this paper, we show that, by properly choosing the number of L-moments and weighting these accordingly, we are able to construct an estimator that outperforms MLE in finite samples, and yet does not suffer from efficiency losses asymptotically. We do so by considering a \"generalised\" method of L-moments estimator and deriving its asymptotic properties in a framework where the number of L-moments varies with sample size. We then propose methods to automatically select the number of L-moments in a given sample. Monte Carlo evidence shows our proposed approach is able to outperform (in a mean-squared error sense) MLE in smaller samples, whilst working as well as it in larger samples. We then consider extensions of our approach to conditional and semiparametric models, and apply the latter to study expenditure patterns in a ridesharing platform in Brazil."}, "https://arxiv.org/abs/2302.09526": {"title": "Mixed Semi-Supervised Generalized-Linear-Regression with applications to Deep-Learning and Interpolators", "link": "https://arxiv.org/abs/2302.09526", "description": "arXiv:2302.09526v3 Announce Type: replace \nAbstract: We present a methodology for using unlabeled data to design semi supervised learning (SSL) methods that improve the prediction performance of supervised learning for regression tasks. The main idea is to design different mechanisms for integrating the unlabeled data, and include in each of them a mixing parameter $\\alpha$, controlling the weight given to the unlabeled data. Focusing on Generalized Linear Models (GLM) and linear interpolators classes of models, we analyze the characteristics of different mixing mechanisms, and prove that in all cases, it is invariably beneficial to integrate the unlabeled data with some nonzero mixing ratio $\\alpha>0$, in terms of predictive performance. Moreover, we provide a rigorous framework to estimate the best mixing ratio $\\alpha^*$ where mixed SSL delivers the best predictive performance, while using the labeled and unlabeled data on hand.\n The effectiveness of our methodology in delivering substantial improvement compared to the standard supervised models, in a variety of settings, is demonstrated empirically through extensive simulation, in a manner that supports the theoretical analysis. We also demonstrate the applicability of our methodology (with some intuitive modifications) to improve more complex models, such as deep neural networks, in real-world regression tasks."}, "https://arxiv.org/abs/2309.06668": {"title": "On the uses and abuses of regression models: a call for reform of statistical practice and teaching", "link": "https://arxiv.org/abs/2309.06668", "description": "arXiv:2309.06668v2 Announce Type: replace \nAbstract: Regression methods dominate the practice of biostatistical analysis, but biostatistical training emphasises the details of regression models and methods ahead of the purposes for which such modelling might be useful. More broadly, statistics is widely understood to provide a body of techniques for \"modelling data\", underpinned by what we describe as the \"true model myth\": that the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective has led to a range of problems in the application of regression methods, including misguided \"adjustment\" for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline a new approach to the teaching and application of biostatistical methods, which situates them within a framework that first requires clear definition of the substantive research question at hand within one of three categories: descriptive, predictive, or causal. Within this approach, the simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models as well as other advanced biostatistical methods should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of \"input\" variables, but their conceptualisation and usage should follow from the purpose at hand."}, "https://arxiv.org/abs/2311.01217": {"title": "The learning effects of subsidies to bundled goods: a semiparametric approach", "link": "https://arxiv.org/abs/2311.01217", "description": "arXiv:2311.01217v3 Announce Type: replace \nAbstract: Can temporary subsidies to bundles induce long-run changes in demand due to learning about the quality of one of the constituent goods? This paper provides theoretical support and empirical evidence on this mechanism. Theoretically, we introduce a model where an agent learns about the quality of an innovation through repeated consumption. We then assess the predictions of our theory in a randomised experiment in a ridesharing platform. The experiment subsidised car trips integrating with a train or metro station, which we interpret as a bundle. Given the heavy-tailed nature of our data, we propose a semiparametric specification for treatment effects that enables the construction of more efficient estimators. We then introduce an efficient estimator for our specification by relying on L-moments. Our results indicate that a ten-weekday 50\\% discount on integrated trips leads to a large contemporaneous increase in the demand for integration, and, consistent with our model, persistent changes in the mean and dispersion of nonintegrated app rides. These effects last for over four months. A calibration of our theoretical model suggests that around 40\\% of the contemporaneous increase in integrated rides may be attributable to increased incentives to learning. Our results have nontrivial policy implications for the design of public transit systems."}, "https://arxiv.org/abs/2401.10592": {"title": "Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights", "link": "https://arxiv.org/abs/2401.10592", "description": "arXiv:2401.10592v2 Announce Type: replace \nAbstract: Randomized controlled clinical trials provide the gold standard for evidence generation in relation to the efficacy of a new treatment in medical research. Relevant information from previous studies may be desirable to incorporate in the design and analysis of a new trial, with the Bayesian paradigm providing a coherent framework to formally incorporate prior knowledge. Many established methods involve the use of a discounting factor, sometimes related to a measure of `similarity' between historical and the new trials. However, it is often the case that the sample size is highly nonlinear in those discounting factors. This hinders communication with subject-matter experts to elicit sensible values for borrowing strength at the trial design stage. Focusing on a commensurate predictive prior method that can incorporate historical data from multiple sources, we highlight a particular issue of nonmonotonicity and explain why this causes issues with interpretability of the discounting factors (hereafter referred to as `weights'). We propose a solution for this, from which an analytical sample size formula is derived. We then propose a linearization technique such that the sample size changes uniformly over the weights. Our approach leads to interpretable weights that represent the probability that historical data are (ir)relevant to the new trial, and could therefore facilitate easier elicitation of expert opinion on their values.\n Keywords: Bayesian sample size determination; Commensurate priors; Historical borrowing; Prior aggregation; Uniform shrinkage."}, "https://arxiv.org/abs/2207.09054": {"title": "Towards a Low-SWaP 1024-beam Digital Array: A 32-beam Sub-system at 5", "link": "https://arxiv.org/abs/2207.09054", "description": "arXiv:2207.09054v2 Announce Type: replace-cross \nAbstract: Millimeter wave communications require multibeam beamforming in order to utilize wireless channels that suffer from obstructions, path loss, and multi-path effects. Digital multibeam beamforming has maximum degrees of freedom compared to analog phased arrays. However, circuit complexity and power consumption are important constraints for digital multibeam systems. A low-complexity digital computing architecture is proposed for a multiplication-free 32-point linear transform that approximates multiple simultaneous RF beams similar to a discrete Fourier transform (DFT). Arithmetic complexity due to multiplication is reduced from the FFT complexity of $\\mathcal{O}(N\\: \\log N)$ for DFT realizations, down to zero, thus yielding a 46% and 55% reduction in chip area and dynamic power consumption, respectively, for the $N=32$ case considered. The paper describes the proposed 32-point DFT approximation targeting a 1024-beams using a 2D array, and shows the multiplierless approximation and its mapping to a 32-beam sub-system consisting of 5.8 GHz antennas that can be used for generating 1024 digital beams without multiplications. Real-time beam computation is achieved using a Xilinx FPGA at 120 MHz bandwidth per beam. Theoretical beam performance is compared with measured RF patterns from both a fixed-point FFT as well as the proposed multiplier-free algorithm and are in good agreement."}, "https://arxiv.org/abs/2303.15029": {"title": "Random measure priors in Bayesian recovery from sketches", "link": "https://arxiv.org/abs/2303.15029", "description": "arXiv:2303.15029v2 Announce Type: replace-cross \nAbstract: This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general ``traits'' setting, where each data point has integer levels of association with multiple symbols, typically referred to as ``traits''. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions."}, "https://arxiv.org/abs/2405.18531": {"title": "Difference-in-Discontinuities: Estimation, Inference and Validity Tests", "link": "https://arxiv.org/abs/2405.18531", "description": "arXiv:2405.18531v1 Announce Type: new \nAbstract: This paper investigates the econometric theory behind the newly developed difference-in-discontinuities design (DiDC). Despite its increasing use in applied research, there are currently limited studies of its properties. The method combines elements of regression discontinuity (RDD) and difference-in-differences (DiD) designs, allowing researchers to eliminate the effects of potential confounders at the discontinuity. We formalize the difference-in-discontinuity theory by stating the identification assumptions and proposing a nonparametric estimator, deriving its asymptotic properties and examining the scenarios in which the DiDC has desirable bias properties when compared to the standard RDD. We also provide comprehensive tests for one of the identification assumption of the DiDC. Monte Carlo simulation studies show that the estimators have good performance in finite samples. Finally, we revisit Grembi et al. (2016), that studies the effects of relaxing fiscal rules on public finance outcomes in Italian municipalities. The results show that the proposed estimator exhibits substantially smaller confidence intervals for the estimated effects."}, "https://arxiv.org/abs/2405.18597": {"title": "Causal inference in the closed-loop: marginal structural models for sequential excursion effects", "link": "https://arxiv.org/abs/2405.18597", "description": "arXiv:2405.18597v1 Announce Type: new \nAbstract: Optogenetics is widely used to study the effects of neural circuit manipulation on behavior. However, the paucity of causal inference methodological work on this topic has resulted in analysis conventions that discard information, and constrain the scientific questions that can be posed. To fill this gap, we introduce a nonparametric causal inference framework for analyzing \"closed-loop\" designs, which use dynamic policies that assign treatment based on covariates. In this setting, standard methods can introduce bias and occlude causal effects. Building on the sequentially randomized experiments literature in causal inference, our approach extends history-restricted marginal structural models for dynamic regimes. In practice, our framework can identify a wide range of causal effects of optogenetics on trial-by-trial behavior, such as, fast/slow-acting, dose-response, additive/antagonistic, and floor/ceiling. Importantly, it does so without requiring negative controls, and can estimate how causal effect magnitudes evolve across time points. From another view, our work extends \"excursion effect\" methods--popular in the mobile health literature--to enable estimation of causal contrasts for treatment sequences greater than length one, in the presence of positivity violations. We derive rigorous statistical guarantees, enabling hypothesis testing of these causal effects. We demonstrate our approach on data from a recent study of dopaminergic activity on learning, and show how our method reveals relevant effects obscured in standard analyses."}, "https://arxiv.org/abs/2405.18722": {"title": "Adaptive and Efficient Learning with Blockwise Missing and Semi-Supervised Data", "link": "https://arxiv.org/abs/2405.18722", "description": "arXiv:2405.18722v1 Announce Type: new \nAbstract: Data fusion is an important way to realize powerful and generalizable analyses across multiple sources. However, different capability of data collection across the sources has become a prominent issue in practice. This could result in the blockwise missingness (BM) of covariates troublesome for integration. Meanwhile, the high cost of obtaining gold-standard labels can cause the missingness of response on a large proportion of samples, known as the semi-supervised (SS) problem. In this paper, we consider a challenging scenario confronting both the BM and SS issues, and propose a novel Data-adaptive projecting Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE). Starting with a complete-data-only estimator, it involves two successive projection steps to reduce its variance without incurring bias. Compared to existing approaches, DEFUSE achieves a two-fold improvement. First, it leverages the BM labeled sample more efficiently through a novel data-adaptive projection approach robust to model misspecification on the missing covariates, leading to better variance reduction. Second, our method further incorporates the large unlabeled sample to enhance the estimation efficiency through imputation and projection. Compared to the previous SS setting with complete covariates, our work reveals a more essential role of the unlabeled sample in the BM setting. These advantages are justified in asymptotic and simulation studies. We also apply DEFUSE for the risk modeling and inference of heart diseases with the MIMIC-III electronic medical record (EMR) data."}, "https://arxiv.org/abs/2405.18836": {"title": "Do Finetti: On Causal Effects for Exchangeable Data", "link": "https://arxiv.org/abs/2405.18836", "description": "arXiv:2405.18836v1 Announce Type: new \nAbstract: We study causal effect estimation in a setting where the data are not i.i.d. (independent and identically distributed). We focus on exchangeable data satisfying an assumption of independent causal mechanisms. Traditional causal effect estimation frameworks, e.g., relying on structural causal models and do-calculus, are typically limited to i.i.d. data and do not extend to more general exchangeable generative processes, which naturally arise in multi-environment data. To address this gap, we develop a generalized framework for exchangeable data and introduce a truncated factorization formula that facilitates both the identification and estimation of causal effects in our setting. To illustrate potential applications, we introduce a causal P\\'olya urn model and demonstrate how intervention propagates effects in exchangeable data settings. Finally, we develop an algorithm that performs simultaneous causal discovery and effect estimation given multi-environment data."}, "https://arxiv.org/abs/2405.18856": {"title": "Inference under covariate-adaptive randomization with many strata", "link": "https://arxiv.org/abs/2405.18856", "description": "arXiv:2405.18856v1 Announce Type: new \nAbstract: Covariate-adaptive randomization is widely employed to balance baseline covariates in interventional studies such as clinical trials and experiments in development economics. Recent years have witnessed substantial progress in inference under covariate-adaptive randomization with a fixed number of strata. However, concerns have been raised about the impact of a large number of strata on its design and analysis, which is a common scenario in practice, such as in multicenter randomized clinical trials. In this paper, we propose a general framework for inference under covariate-adaptive randomization, which extends the seminal works of Bugni et al. (2018, 2019) by allowing for a diverging number of strata. Furthermore, we introduce a novel weighted regression adjustment that ensures efficiency improvement. On top of establishing the asymptotic theory, practical algorithms for handling situations involving an extremely large number of strata are also developed. Moreover, by linking design balance and inference robustness, we highlight the advantages of stratified block randomization, which enforces better covariate balance within strata compared to simple randomization. This paper offers a comprehensive landscape of inference under covariate-adaptive randomization, spanning from fixed to diverging to extremely large numbers of strata."}, "https://arxiv.org/abs/2405.18873": {"title": "A Return to Biased Nets: New Specifications and Approximate Bayesian Inference", "link": "https://arxiv.org/abs/2405.18873", "description": "arXiv:2405.18873v1 Announce Type: new \nAbstract: The biased net paradigm was the first general and empirically tractable scheme for parameterizing complex patterns of dependence in networks, expressing deviations from uniform random graph structure in terms of latent ``bias events,'' whose realizations enhance reciprocity, transitivity, or other structural features. Subsequent developments have introduced local specifications of biased nets, which reduce the need for approximations required in early specifications based on tracing processes. Here, we show that while one such specification leads to inconsistencies, a closely related Markovian specification both evades these difficulties and can be extended to incorporate new types of effects. We introduce the notion of inhibitory bias events, with satiation as an example, which are useful for avoiding degeneracies that can arise from closure bias terms. Although our approach does not lead to a computable likelihood, we provide a strategy for approximate Bayesian inference using random forest prevision. We demonstrate our approach on a network of friendship ties among college students, recapitulating a relationship between the sibling bias and tie strength posited in earlier work by Fararo."}, "https://arxiv.org/abs/2405.18987": {"title": "Transmission Channel Analysis in Dynamic Models", "link": "https://arxiv.org/abs/2405.18987", "description": "arXiv:2405.18987v1 Announce Type: new \nAbstract: We propose a framework for the analysis of transmission channels in a large class of dynamic models. To this end, we formulate our approach both using graph theory and potential outcomes, which we show to be equivalent. Our method, labelled Transmission Channel Analysis (TCA), allows for the decomposition of total effects captured by impulse response functions into the effects flowing along transmission channels, thereby providing a quantitative assessment of the strength of various transmission channels. We establish that this requires no additional identification assumptions beyond the identification of the structural shock whose effects the researcher wants to decompose. Additionally, we prove that impulse response functions are sufficient statistics for the computation of transmission effects. We also demonstrate the empirical relevance of TCA for policy evaluation by decomposing the effects of various monetary policy shock measures into instantaneous implementation effects and effects that likely relate to forward guidance."}, "https://arxiv.org/abs/2405.19058": {"title": "Participation bias in the estimation of heritability and genetic correlation", "link": "https://arxiv.org/abs/2405.19058", "description": "arXiv:2405.19058v1 Announce Type: new \nAbstract: It is increasingly recognized that participation bias can pose problems for genetic studies. Recently, to overcome the challenge that genetic information of non-participants is unavailable, it is shown that by comparing the IBD (identity by descent) shared and not-shared segments among the participants, one can estimate the genetic component underlying participation. That, however, does not directly address how to adjust estimates of heritability and genetic correlation for phenotypes correlated with participation. Here, for phenotypes whose mean differences between population and sample are known, we demonstrate a way to do so by adopting a statistical framework that separates out the genetic and non-genetic correlations between participation and these phenotypes. Crucially, our method avoids making the assumption that the effect of the genetic component underlying participation is manifested entirely through these other phenotypes. Applying the method to 12 UK Biobank phenotypes, we found 8 have significant genetic correlations with participation, including body mass index, educational attainment, and smoking status. For most of these phenotypes, without adjustments, estimates of heritability and the absolute value of genetic correlation would have underestimation biases."}, "https://arxiv.org/abs/2405.19145": {"title": "L-Estimation in Instrumental Variables Regression for Censored Data in Presence of Endogeneity and Dependent Errors", "link": "https://arxiv.org/abs/2405.19145", "description": "arXiv:2405.19145v1 Announce Type: new \nAbstract: In this article, we propose L-estimators of the unknown parameters in the instrumental variables regression in the presence of censored data under endogeneity. We allow the random errors involved in the model to be dependent. The proposed estimation procedure is a two-stage procedure, and the large sample properties of the proposed estimators are established. The utility of the proposed methodology is demonstrated for various simulated data and a benchmark real data set."}, "https://arxiv.org/abs/2405.19231": {"title": "Covariate Shift Corrected Conditional Randomization Test", "link": "https://arxiv.org/abs/2405.19231", "description": "arXiv:2405.19231v1 Announce Type: new \nAbstract: Conditional independence tests are crucial across various disciplines in determining the independence of an outcome variable $Y$ from a treatment variable $X$, conditioning on a set of confounders $Z$. The Conditional Randomization Test (CRT) offers a powerful framework for such testing by assuming known distributions of $X \\mid Z$; it controls the Type-I error exactly, allowing for the use of flexible, black-box test statistics. In practice, testing for conditional independence often involves using data from a source population to draw conclusions about a target population. This can be challenging due to covariate shift -- differences in the distribution of $X$, $Z$, and surrogate variables, which can affect the conditional distribution of $Y \\mid X, Z$ -- rendering traditional CRT approaches invalid. To address this issue, we propose a novel Covariate Shift Corrected Pearson Chi-squared Conditional Randomization (csPCR) test. This test adapts to covariate shifts by integrating importance weights and employing the control variates method to reduce variance in the test statistics and thus enhance power. Theoretically, we establish that the csPCR test controls the Type-I error asymptotically. Empirically, through simulation studies, we demonstrate that our method not only maintains control over Type-I errors but also exhibits superior power, confirming its efficacy and practical utility in real-world scenarios where covariate shifts are prevalent. Finally, we apply our methodology to a real-world dataset to assess the impact of a COVID-19 treatment on the 90-day mortality rate among patients."}, "https://arxiv.org/abs/2405.19312": {"title": "Causal Inference for Balanced Incomplete Block Designs", "link": "https://arxiv.org/abs/2405.19312", "description": "arXiv:2405.19312v1 Announce Type: new \nAbstract: Researchers often turn to block randomization to increase the precision of their inference or due to practical considerations, such as in multi-site trials. However, if the number of treatments under consideration is large it might not be practical or even feasible to assign all treatments within each block. We develop novel inference results under the finite-population design-based framework for a natural alternative to the complete block design that does not require reducing the number of treatment arms, the balanced incomplete block design (BIBD). This includes deriving the properties of two estimators for BIBDs and proposing conservative variance estimators. To assist practitioners in understanding the trade-offs of using BIBDs over other designs, the precisions of resulting estimators are compared to standard estimators for the complete block design, the cluster-randomized design, and the completely randomized design. Simulations and a data illustration demonstrate the strengths and weaknesses of using BIBDs. This work highlights BIBDs as practical and currently underutilized designs."}, "https://arxiv.org/abs/2405.18459": {"title": "Probing the Information Theoretical Roots of Spatial Dependence Measures", "link": "https://arxiv.org/abs/2405.18459", "description": "arXiv:2405.18459v1 Announce Type: cross \nAbstract: Intuitively, there is a relation between measures of spatial dependence and information theoretical measures of entropy. For instance, we can provide an intuition of why spatial data is special by stating that, on average, spatial data samples contain less than expected information. Similarly, spatial data, e.g., remotely sensed imagery, that is easy to compress is also likely to show significant spatial autocorrelation. Formulating our (highly specific) core concepts of spatial information theory in the widely used language of information theory opens new perspectives on their differences and similarities and also fosters cross-disciplinary collaboration, e.g., with the broader AI/ML communities. Interestingly, however, this intuitive relation is challenging to formalize and generalize, leading prior work to rely mostly on experimental results, e.g., for describing landscape patterns. In this work, we will explore the information theoretical roots of spatial autocorrelation, more specifically Moran's I, through the lens of self-information (also known as surprisal) and provide both formal proofs and experiments."}, "https://arxiv.org/abs/2405.18518": {"title": "LSTM-COX Model: A Concise and Efficient Deep Learning Approach for Handling Recurrent Events", "link": "https://arxiv.org/abs/2405.18518", "description": "arXiv:2405.18518v1 Announce Type: cross \nAbstract: In the current field of clinical medicine, traditional methods for analyzing recurrent events have limitations when dealing with complex time-dependent data. This study combines Long Short-Term Memory networks (LSTM) with the Cox model to enhance the model's performance in analyzing recurrent events with dynamic temporal information. Compared to classical models, the LSTM-Cox model significantly improves the accuracy of extracting clinical risk features and exhibits lower Akaike Information Criterion (AIC) values, while maintaining good performance on simulated datasets. In an empirical analysis of bladder cancer recurrence data, the model successfully reduced the mean squared error during the training phase and achieved a Concordance index of up to 0.90 on the test set. Furthermore, the model effectively distinguished between high and low-risk patient groups, and the identified recurrence risk features such as the number of tumor recurrences and maximum size were consistent with other research and clinical trial results. This study not only provides a straightforward and efficient method for analyzing recurrent data and extracting features but also offers a convenient pathway for integrating deep learning techniques into clinical risk prediction systems."}, "https://arxiv.org/abs/2405.18563": {"title": "Counterfactual Explanations for Multivariate Time-Series without Training Datasets", "link": "https://arxiv.org/abs/2405.18563", "description": "arXiv:2405.18563v1 Announce Type: cross \nAbstract: Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interpretations of opaque ML models and providing a pathway to transition from one decision to another. However, most existing CFE methods require access to the model's training dataset, few methods can handle multivariate time-series, and none can handle multivariate time-series without training datasets. These limitations can be formidable in many scenarios. In this paper, we present CFWoT, a novel reinforcement-learning-based CFE method that generates CFEs when training datasets are unavailable. CFWoT is model-agnostic and suitable for both static and multivariate time-series datasets with continuous and discrete features. Users have the flexibility to specify non-actionable, immutable, and preferred features, as well as causal constraints which CFWoT guarantees will be respected. We demonstrate the performance of CFWoT against four baselines on several datasets and find that, despite not having access to a training dataset, CFWoT finds CFEs that make significantly fewer and significantly smaller changes to the input time-series. These properties make CFEs more actionable, as the magnitude of change required to alter an outcome is vastly reduced."}, "https://arxiv.org/abs/2405.18601": {"title": "From Conformal Predictions to Confidence Regions", "link": "https://arxiv.org/abs/2405.18601", "description": "arXiv:2405.18601v1 Announce Type: cross \nAbstract: Conformal prediction methodologies have significantly advanced the quantification of uncertainties in predictive models. Yet, the construction of confidence regions for model parameters presents a notable challenge, often necessitating stringent assumptions regarding data distribution or merely providing asymptotic guarantees. We introduce a novel approach termed CCR, which employs a combination of conformal prediction intervals for the model outputs to establish confidence regions for model parameters. We present coverage guarantees under minimal assumptions on noise and that is valid in finite sample regime. Our approach is applicable to both split conformal predictions and black-box methodologies including full or cross-conformal approaches. In the specific case of linear models, the derived confidence region manifests as the feasible set of a Mixed-Integer Linear Program (MILP), facilitating the deduction of confidence intervals for individual parameters and enabling robust optimization. We empirically compare CCR to recent advancements in challenging settings such as with heteroskedastic and non-Gaussian noise."}, "https://arxiv.org/abs/2405.18621": {"title": "Multi-Armed Bandits with Network Interference", "link": "https://arxiv.org/abs/2405.18621", "description": "arXiv:2405.18621v1 Announce Type: cross \nAbstract: Online experimentation with interference is a common challenge in modern applications such as e-commerce and adaptive clinical trials in medicine. For example, in online marketplaces, the revenue of a good depends on discounts applied to competing goods. Statistical inference with interference is widely studied in the offline setting, but far less is known about how to adaptively assign treatments to minimize regret. We address this gap by studying a multi-armed bandit (MAB) problem where a learner (e-commerce platform) sequentially assigns one of possible $\\mathcal{A}$ actions (discounts) to $N$ units (goods) over $T$ rounds to minimize regret (maximize revenue). Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. With $\\mathcal{A}$ actions and $N$ units, minimizing regret is combinatorially difficult since the action space grows as $\\mathcal{A}^N$. To overcome this issue, we study a sparse network interference model, where the reward of a unit is only affected by the treatments assigned to $s$ neighboring units. We use tools from discrete Fourier analysis to develop a sparse linear representation of the unit-specific reward $r_n: [\\mathcal{A}]^N \\rightarrow \\mathbb{R} $, and propose simple, linear regression-based algorithms to minimize regret. Importantly, our algorithms achieve provably low regret both when the learner observes the interference neighborhood for all units and when it is unknown. This significantly generalizes other works on this topic which impose strict conditions on the strength of interference on a known network, and also compare regret to a markedly weaker optimal action. Empirically, we corroborate our theoretical findings via numerical simulations."}, "https://arxiv.org/abs/2405.18671": {"title": "Watermarking Counterfactual Explanations", "link": "https://arxiv.org/abs/2405.18671", "description": "arXiv:2405.18671v1 Announce Type: cross \nAbstract: The field of Explainable Artificial Intelligence (XAI) focuses on techniques for providing explanations to end-users about the decision-making processes that underlie modern-day machine learning (ML) models. Within the vast universe of XAI techniques, counterfactual (CF) explanations are often preferred by end-users as they help explain the predictions of ML models by providing an easy-to-understand & actionable recourse (or contrastive) case to individual end-users who are adversely impacted by predicted outcomes. However, recent studies have shown significant security concerns with using CF explanations in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on proprietary ML models. In this paper, we propose a model-agnostic watermarking framework (for adding watermarks to CF explanations) that can be leveraged to detect unauthorized model extraction attacks (which rely on the watermarked CF explanations). Our novel framework solves a bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks that rely on these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme, while ensuring that these embedded watermarks do not compromise the quality of the generated CF explanations. We evaluate this framework's performance across a diverse set of real-world datasets, CF explanation methods, and model extraction techniques, and show that our watermarking detection system can be used to accurately identify extracted ML models that are trained using the watermarked CF explanations. Our work paves the way for the secure adoption of CF explanations in real-world applications."}, "https://arxiv.org/abs/2405.19225": {"title": "Synthetic Potential Outcomes for Mixtures of Treatment Effects", "link": "https://arxiv.org/abs/2405.19225", "description": "arXiv:2405.19225v1 Announce Type: cross \nAbstract: Modern data analysis frequently relies on the use of large datasets, often constructed as amalgamations of diverse populations or data-sources. Heterogeneity across these smaller datasets constitutes two major challenges for causal inference: (1) the source of each sample can introduce latent confounding between treatment and effect, and (2) diverse populations may respond differently to the same treatment, giving rise to heterogeneous treatment effects (HTEs). The issues of latent confounding and HTEs have been studied separately but not in conjunction. In particular, previous works only report the conditional average treatment effect (CATE) among similar individuals (with respect to the measured covariates). CATEs cannot resolve mixtures of potential treatment effects driven by latent heterogeneity, which we call mixtures of treatment effects (MTEs). Inspired by method of moment approaches to mixture models, we propose \"synthetic potential outcomes\" (SPOs). Our new approach deconfounds heterogeneity while also guaranteeing the identifiability of MTEs. This technique bypasses full recovery of a mixture, which significantly simplifies its requirements for identifiability. We demonstrate the efficacy of SPOs on synthetic data."}, "https://arxiv.org/abs/2405.19317": {"title": "Adaptive Generalized Neyman Allocation: Local Asymptotic Minimax Optimal Best Arm Identification", "link": "https://arxiv.org/abs/2405.19317", "description": "arXiv:2405.19317v1 Announce Type: cross \nAbstract: This study investigates a local asymptotic minimax optimal strategy for fixed-budget best arm identification (BAI). We propose the Adaptive Generalized Neyman Allocation (AGNA) strategy and show that its worst-case upper bound of the probability of misidentifying the best arm aligns with the worst-case lower bound under the small-gap regime, where the gap between the expected outcomes of the best and suboptimal arms is small. Our strategy corresponds to a generalization of the Neyman allocation for two-armed bandits (Neyman, 1934; Kaufmann et al., 2016) and a refinement of existing strategies such as the ones proposed by Glynn & Juneja (2004) and Shin et al. (2018). Compared to Komiyama et al. (2022), which proposes a minimax rate-optimal strategy, our proposed strategy has a tighter upper bound that exactly matches the lower bound, including the constant terms, by restricting the class of distributions to the class of small-gap distributions. Our result contributes to the longstanding open issue about the existence of asymptotically optimal strategies in fixed-budget BAI, by presenting the local asymptotic minimax optimal strategy."}, "https://arxiv.org/abs/2106.14083": {"title": "Bayesian Time-Varying Tensor Vector Autoregressive Models for Dynamic Effective Connectivity", "link": "https://arxiv.org/abs/2106.14083", "description": "arXiv:2106.14083v2 Announce Type: replace \nAbstract: In contemporary neuroscience, a key area of interest is dynamic effective connectivity, which is crucial for understanding the dynamic interactions and causal relationships between different brain regions. Dynamic effective connectivity can provide insights into how brain network interactions are altered in neurological disorders such as dyslexia. Time-varying vector autoregressive (TV-VAR) models have been employed to draw inferences for this purpose. However, their significant computational requirements pose challenges, since the number of parameters to be estimated increases quadratically with the number of time series. In this paper, we propose a computationally efficient Bayesian time-varying VAR approach. For dealing with large-dimensional time series, the proposed framework employs a tensor decomposition for the VAR coefficient matrices at different lags. Dynamically varying connectivity patterns are captured by assuming that at any given time only a subset of components in the tensor decomposition is active. Latent binary time series select the active components at each time via an innovative and parsimonious Ising model in the time-domain. Furthermore, we propose parsity-inducing priors to achieve global-local shrinkage of the VAR coefficients, determine automatically the rank of the tensor decomposition and guide the selection of the lags of the auto-regression. We show the performances of our model formulation via simulation studies and data from a real fMRI study involving a book reading experiment."}, "https://arxiv.org/abs/2307.11941": {"title": "Visibility graph-based covariance functions for scalable spatial analysis in non-convex domains", "link": "https://arxiv.org/abs/2307.11941", "description": "arXiv:2307.11941v3 Announce Type: replace \nAbstract: We present a new method for constructing valid covariance functions of Gaussian processes for spatial analysis in irregular, non-convex domains such as bodies of water. Standard covariance functions based on geodesic distances are not guaranteed to be positive definite on such domains, while existing non-Euclidean approaches fail to respect the partially Euclidean nature of these domains where the geodesic distance agrees with the Euclidean distances for some pairs of points. Using a visibility graph on the domain, we propose a class of covariance functions that preserve Euclidean-based covariances between points that are connected in the domain while incorporating the non-convex geometry of the domain via conditional independence relationships. We show that the proposed method preserves the partially Euclidean nature of the intrinsic geometry on the domain while maintaining validity (positive definiteness) and marginal stationarity of the covariance function over the entire parameter space, properties which are not always fulfilled by existing approaches to construct covariance functions on non-convex domains. We provide useful approximations to improve computational efficiency, resulting in a scalable algorithm. We compare the performance of our method with those of competing state-of-the-art methods using simulation studies on synthetic non-convex domains. The method is applied to data regarding acidity levels in the Chesapeake Bay, showing its potential for ecological monitoring in real-world spatial applications on irregular domains."}, "https://arxiv.org/abs/2206.12235": {"title": "Guided sequential ABC schemes for intractable Bayesian models", "link": "https://arxiv.org/abs/2206.12235", "description": "arXiv:2206.12235v5 Announce Type: replace-cross \nAbstract: Sequential algorithms such as sequential importance sampling (SIS) and sequential Monte Carlo (SMC) have proven fundamental in Bayesian inference for models not admitting a readily available likelihood function. For approximate Bayesian computation (ABC), SMC-ABC is the state-of-art sampler. However, since the ABC paradigm is intrinsically wasteful, sequential ABC schemes can benefit from well-targeted proposal samplers that efficiently avoid improbable parameter regions. We contribute to the ABC modeller's toolbox with novel proposal samplers that are conditional to summary statistics of the data. In a sense, the proposed parameters are \"guided\" to rapidly reach regions of the posterior surface that are compatible with the observed data. This speeds up the convergence of these sequential samplers, thus reducing the computational effort, while preserving the accuracy in the inference. We provide a variety of guided Gaussian and copula-based samplers for both SIS-ABC and SMC-ABC easing inference for challenging case-studies, including multimodal posteriors, highly correlated posteriors, hierarchical models with about 20 parameters, and a simulation study of cell movements using more than 400 summary statistics."}, "https://arxiv.org/abs/2305.15988": {"title": "Non-Log-Concave and Nonsmooth Sampling via Langevin Monte Carlo Algorithms", "link": "https://arxiv.org/abs/2305.15988", "description": "arXiv:2305.15988v2 Announce Type: replace-cross \nAbstract: We study the problem of approximate sampling from non-log-concave distributions, e.g., Gaussian mixtures, which is often challenging even in low dimensions due to their multimodality. We focus on performing this task via Markov chain Monte Carlo (MCMC) methods derived from discretizations of the overdamped Langevin diffusions, which are commonly known as Langevin Monte Carlo algorithms. Furthermore, we are also interested in two nonsmooth cases for which a large class of proximal MCMC methods have been developed: (i) a nonsmooth prior is considered with a Gaussian mixture likelihood; (ii) a Laplacian mixture distribution. Such nonsmooth and non-log-concave sampling tasks arise from a wide range of applications to Bayesian inference and imaging inverse problems such as image deconvolution. We perform numerical simulations to compare the performance of most commonly used Langevin Monte Carlo algorithms."}, "https://arxiv.org/abs/2405.19523": {"title": "Comparison of Point Process Learning and its special case Takacs-Fiksel estimation", "link": "https://arxiv.org/abs/2405.19523", "description": "arXiv:2405.19523v1 Announce Type: new \nAbstract: Recently, Cronie et al. (2024) introduced the notion of cross-validation for point processes and a new statistical methodology called Point Process Learning (PPL). In PPL one splits a point process/pattern into a training and a validation set, and then predicts the latter from the former through a parametrised Papangelou conditional intensity. The model parameters are estimated by minimizing a point process prediction error; this notion was introduced as the second building block of PPL. It was shown that PPL outperforms the state-of-the-art in both kernel intensity estimation and estimation of the parameters of the Gibbs hard-core process. In the latter case, the state-of-the-art was represented by pseudolikelihood estimation. In this paper we study PPL in relation to Takacs-Fiksel estimation, of which pseudolikelihood is a special case. We show that Takacs-Fiksel estimation is a special case of PPL in the sense that PPL with a specific loss function asymptotically reduces to Takacs-Fiksel estimation if we let the cross-validation regime tend to leave-one-out cross-validation. Moreover, PPL involves a certain type of hyperparameter given by a weight function which ensures that the prediction errors have expectation zero if and only if we have the correct parametrisation. We show that the weight function takes an explicit but intractable form for general Gibbs models. Consequently, we propose different approaches to estimate the weight function in practice. In order to assess how the general PPL setup performs in relation to its special case Takacs-Fiksel estimation, we conduct a simulation study where we find that for common Gibbs models we can find loss functions and hyperparameters so that PPL typically outperforms Takacs-Fiksel estimation significantly in terms of mean square error. Here, the hyperparameters are the cross-validation parameters and the weight function estimate."}, "https://arxiv.org/abs/2405.19539": {"title": "Canonical Correlation Analysis as Reduced Rank Regression in High Dimensions", "link": "https://arxiv.org/abs/2405.19539", "description": "arXiv:2405.19539v1 Announce Type: new \nAbstract: Canonical Correlation Analysis (CCA) is a widespread technique for discovering linear relationships between two sets of variables $X \\in \\mathbb{R}^{n \\times p}$ and $Y \\in \\mathbb{R}^{n \\times q}$. In high dimensions however, standard estimates of the canonical directions cease to be consistent without assuming further structure. In this setting, a possible solution consists in leveraging the presumed sparsity of the solution: only a subset of the covariates span the canonical directions. While the last decade has seen a proliferation of sparse CCA methods, practical challenges regarding the scalability and adaptability of these methods still persist. To circumvent these issues, this paper suggests an alternative strategy that uses reduced rank regression to estimate the canonical directions when one of the datasets is high-dimensional while the other remains low-dimensional. By casting the problem of estimating the canonical direction as a regression problem, our estimator is able to leverage the rich statistics literature on high-dimensional regression and is easily adaptable to accommodate a wider range of structural priors. Our proposed solution maintains computational efficiency and accuracy, even in the presence of very high-dimensional data. We validate the benefits of our approach through a series of simulated experiments and further illustrate its practicality by applying it to three real-world datasets."}, "https://arxiv.org/abs/2405.19637": {"title": "Inference in semiparametric formation models for directed networks", "link": "https://arxiv.org/abs/2405.19637", "description": "arXiv:2405.19637v1 Announce Type: new \nAbstract: We propose a semiparametric model for dyadic link formations in directed networks. The model contains a set of degree parameters that measure different effects of popularity or outgoingness across nodes, a regression parameter vector that reflects the homophily effect resulting from the nodal attributes or pairwise covariates associated with edges, and a set of latent random noises with unknown distributions. Our interest lies in inferring the unknown degree parameters and homophily parameters. The dimension of the degree parameters increases with the number of nodes. Under the high-dimensional regime, we develop a kernel-based least squares approach to estimate the unknown parameters. The major advantage of our estimator is that it does not encounter the incidental parameter problem for the homophily parameters. We prove consistency of all the resulting estimators of the degree parameters and homophily parameters. We establish high-dimensional central limit theorems for the proposed estimators and provide several applications of our general theory, including testing the existence of degree heterogeneity, testing sparse signals and recovering the support. Simulation studies and a real data application are conducted to illustrate the finite sample performance of the proposed methods."}, "https://arxiv.org/abs/2405.19666": {"title": "Bayesian Joint Modeling for Longitudinal Magnitude Data with Informative Dropout: an Application to Critical Care Data", "link": "https://arxiv.org/abs/2405.19666", "description": "arXiv:2405.19666v1 Announce Type: new \nAbstract: In various biomedical studies, the focus of analysis centers on the magnitudes of data, particularly when algebraic signs are irrelevant or lost. To analyze the magnitude outcomes in repeated measures studies, using models with random effects is essential. This is because random effects can account for individual heterogeneity, enhancing parameter estimation precision. However, there are currently no established regression methods that incorporate random effects and are specifically designed for magnitude outcomes. This article bridges this gap by introducing Bayesian regression modeling approaches for analyzing magnitude data, with a key focus on the incorporation of random effects. Additionally, the proposed method is extended to address multiple causes of informative dropout, commonly encountered in repeated measures studies. To tackle the missing data challenge arising from dropout, a joint modeling strategy is developed, building upon the previously introduced regression techniques. Two numerical simulation studies are conducted to assess the validity of our method. The chosen simulation scenarios aim to resemble the conditions of our motivating study. The results demonstrate that the proposed method for magnitude data exhibits good performance in terms of both estimation accuracy and precision, and the joint models effectively mitigate bias due to missing data. Finally, we apply proposed models to analyze the magnitude data from the motivating study, investigating if sex impacts the magnitude change in diaphragm thickness over time for ICU patients."}, "https://arxiv.org/abs/2405.19803": {"title": "Dynamic Factor Analysis of High-dimensional Recurrent Events", "link": "https://arxiv.org/abs/2405.19803", "description": "arXiv:2405.19803v1 Announce Type: new \nAbstract: Recurrent event time data arise in many studies, including biomedicine, public health, marketing, and social media analysis. High-dimensional recurrent event data involving large numbers of event types and observations become prevalent with the advances in information technology. This paper proposes a semiparametric dynamic factor model for the dimension reduction and prediction of high-dimensional recurrent event data. The proposed model imposes a low-dimensional structure on the mean intensity functions of the event types while allowing for dependencies. A nearly rate-optimal smoothing-based estimator is proposed. An information criterion that consistently selects the number of factors is also developed. Simulation studies demonstrate the effectiveness of these inference tools. The proposed method is applied to grocery shopping data, for which an interpretable factor structure is obtained."}, "https://arxiv.org/abs/2405.19849": {"title": "Modelling and Forecasting Energy Market Volatility Using GARCH and Machine Learning Approach", "link": "https://arxiv.org/abs/2405.19849", "description": "arXiv:2405.19849v1 Announce Type: new \nAbstract: This paper presents a comparative analysis of univariate and multivariate GARCH-family models and machine learning algorithms in modeling and forecasting the volatility of major energy commodities: crude oil, gasoline, heating oil, and natural gas. It uses a comprehensive dataset incorporating financial, macroeconomic, and environmental variables to assess predictive performance and discusses volatility persistence and transmission across these commodities. Aspects of volatility persistence and transmission, traditionally examined by GARCH-class models, are jointly explored using the SHAP (Shapley Additive exPlanations) method. The findings reveal that machine learning models demonstrate superior out-of-sample forecasting performance compared to traditional GARCH models. Machine learning models tend to underpredict, while GARCH models tend to overpredict energy market volatility, suggesting a hybrid use of both types of models. There is volatility transmission from crude oil to the gasoline and heating oil markets. The volatility transmission in the natural gas market is less prevalent."}, "https://arxiv.org/abs/2405.19865": {"title": "Reduced Rank Regression for Mixed Predictor and Response Variables", "link": "https://arxiv.org/abs/2405.19865", "description": "arXiv:2405.19865v1 Announce Type: new \nAbstract: In this paper, we propose the generalized mixed reduced rank regression method, GMR$^3$ for short. GMR$^3$ is a regression method for a mix of numeric, binary and ordinal response variables. The predictor variables can be a mix of binary, nominal, ordinal, and numeric variables. For dealing with the categorical predictors we use optimal scaling. A majorization-minimization algorithm is derived for maximum likelihood estimation under a local independence assumption. We discuss in detail model selection for the dimensionality or rank, and the selection of predictor variables. We show an application of GMR$^3$ using the Eurobarometer Surveys data set of 2023."}, "https://arxiv.org/abs/2405.19897": {"title": "The Political Resource Curse Redux", "link": "https://arxiv.org/abs/2405.19897", "description": "arXiv:2405.19897v1 Announce Type: new \nAbstract: In the study of the Political Resource Curse (Brollo et al.,2013), the authors identified a new channel to investigate whether the windfalls of resources are unambiguously beneficial to society, both with theory and empirical evidence. This paper revisits the framework with a new dataset. Specifically, we implemented a regression discontinuity design and difference-in-difference specification"}, "https://arxiv.org/abs/2405.19985": {"title": "Targeted Sequential Indirect Experiment Design", "link": "https://arxiv.org/abs/2405.19985", "description": "arXiv:2405.19985v1 Announce Type: new \nAbstract: Scientific hypotheses typically concern specific aspects of complex, imperfectly understood or entirely unknown mechanisms, such as the effect of gene expression levels on phenotypes or how microbial communities influence environmental health. Such queries are inherently causal (rather than purely associational), but in many settings, experiments can not be conducted directly on the target variables of interest, but are indirect. Therefore, they perturb the target variable, but do not remove potential confounding factors. If, additionally, the resulting experimental measurements are multi-dimensional and the studied mechanisms nonlinear, the query of interest is generally not identified. We develop an adaptive strategy to design indirect experiments that optimally inform a targeted query about the ground truth mechanism in terms of sequentially narrowing the gap between an upper and lower bound on the query. While the general formulation consists of a bi-level optimization procedure, we derive an efficiently estimable analytical kernel-based estimator of the bounds for the causal effect, a query of key interest, and demonstrate the efficacy of our approach in confounded, multivariate, nonlinear synthetic settings."}, "https://arxiv.org/abs/2405.20137": {"title": "A unified framework of principal component analysis and factor analysis", "link": "https://arxiv.org/abs/2405.20137", "description": "arXiv:2405.20137v1 Announce Type: new \nAbstract: Principal component analysis and factor analysis are fundamental multivariate analysis methods. In this paper a unified framework to connect them is introduced. Under a general latent variable model, we present matrix optimization problems from the viewpoint of loss function minimization, and show that the two methods can be viewed as solutions to the optimization problems with specific loss functions. Specifically, principal component analysis can be derived from a broad class of loss functions including the L2 norm, while factor analysis corresponds to a modified L0 norm problem. Related problems are discussed, including algorithms, penalized maximum likelihood estimation under the latent variable model, and a principal component factor model. These results can lead to new tools of data analysis and research topics."}, "https://arxiv.org/abs/2405.20149": {"title": "Accounting for Mismatch Error in Small Area Estimation with Linked Data", "link": "https://arxiv.org/abs/2405.20149", "description": "arXiv:2405.20149v1 Announce Type: new \nAbstract: In small area estimation different data sources are integrated in order to produce reliable estimates of target parameters (e.g., a mean or a proportion) for a collection of small subsets (areas) of a finite population. Regression models such as the linear mixed effects model or M-quantile regression are often used to improve the precision of survey sample estimates by leveraging auxiliary information for which means or totals are known at the area level. In many applications, the unit-level linkage of records from different sources is probabilistic and potentially error-prone. In this paper, we present adjustments of the small area predictors that are based on either the linear mixed effects model or M-quantile regression to account for the presence of linkage error. These adjustments are developed from a two-component mixture model that hinges on the assumption of independence of the target and auxiliary variable given incorrect linkage. Estimation and inference is based on composite likelihoods and machinery revolving around the Expectation-Maximization Algorithm. For each of the two regression methods, we propose modified small area predictors and approximations for their mean squared errors. The empirical performance of the proposed approaches is studied in both design-based and model-based simulations that include comparisons to a variety of baselines."}, "https://arxiv.org/abs/2405.20164": {"title": "Item response parameter estimation performance using Gaussian quadrature and Laplace", "link": "https://arxiv.org/abs/2405.20164", "description": "arXiv:2405.20164v1 Announce Type: new \nAbstract: Item parameter estimation in pharmacometric item response theory (IRT) models is predominantly performed using the Laplace estimation algorithm as implemented in NONMEM. In psychometrics a wide range of different software tools, including several packages for the open-source software R for implementation of IRT are also available. Each have their own set of benefits and limitations and to date a systematic comparison of the primary estimation algorithms has not been evaluated. A simulation study evaluating varying number of hypothetical sample sizes and item scenarios at baseline was performed using both Laplace and Gauss-hermite quadrature (GHQ-EM). In scenarios with at least 20 items and more than 100 subjects, item parameters were estimated with good precision and were similar between estimation algorithms as demonstrated by several measures of bias and precision. The minimal differences observed for certain parameters or sample size scenarios were reduced when translating to the total score scale. The ease of use, speed of estimation and relative accuracy of the GHQ-EM method employed in mirt make it an appropriate alternative or supportive analytical approach to NONMEM for potential pharmacometrics IRT applications."}, "https://arxiv.org/abs/2405.20342": {"title": "Evaluating Approximations of Count Distributions and Forecasts for Poisson-Lindley Integer Autoregressive Processes", "link": "https://arxiv.org/abs/2405.20342", "description": "arXiv:2405.20342v1 Announce Type: new \nAbstract: Although many time series are realizations from discrete processes, it is often that a continuous Gaussian model is implemented for modeling and forecasting the data, resulting in incoherent forecasts. Forecasts using a Poisson-Lindley integer autoregressive (PLINAR) model are compared to variations of Gaussian forecasts via simulation by equating relevant moments of the marginals of the PLINAR to the Gaussian AR. To illustrate utility, the methods discussed are applied and compared using a discrete series with model parameters being estimated using each of conditional least squares, Yule-Walker, and maximum likelihood."}, "https://arxiv.org/abs/2405.19407": {"title": "Tempered Multifidelity Importance Sampling for Gravitational Wave Parameter Estimation", "link": "https://arxiv.org/abs/2405.19407", "description": "arXiv:2405.19407v1 Announce Type: cross \nAbstract: Estimating the parameters of compact binaries which coalesce and produce gravitational waves is a challenging Bayesian inverse problem. Gravitational-wave parameter estimation lies within the class of multifidelity problems, where a variety of models with differing assumptions, levels of fidelity, and computational cost are available for use in inference. In an effort to accelerate the solution of a Bayesian inverse problem, cheaper surrogates for the best models may be used to reduce the cost of likelihood evaluations when sampling the posterior. Importance sampling can then be used to reweight these samples to represent the true target posterior, incurring a reduction in the effective sample size. In cases when the problem is high dimensional, or when the surrogate model produces a poor approximation of the true posterior, this reduction in effective samples can be dramatic and render multifidelity importance sampling ineffective. We propose a novel method of tempered multifidelity importance sampling in order to remedy this issue. With this method the biasing distribution produced by the low-fidelity model is tempered, allowing for potentially better overlap with the target distribution. There is an optimal temperature which maximizes the efficiency in this setting, and we propose a low-cost strategy for approximating this optimal temperature using samples from the untempered distribution. In this paper, we motivate this method by applying it to Gaussian target and biasing distributions. Finally, we apply it to a series of problems in gravitational wave parameter estimation and demonstrate improved efficiencies when applying the method to real gravitational wave detections."}, "https://arxiv.org/abs/2405.19463": {"title": "Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data", "link": "https://arxiv.org/abs/2405.19463", "description": "arXiv:2405.19463v1 Announce Type: cross \nAbstract: We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order $\\mathcal{O}(\\log T/T)$ and $\\mathcal{O}(1/T^{1-\\iota})$ for any $\\iota>0$, respectively under the availability of two-sample and one-sample oracles, respectively, where $T$ is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure avoids explicitly modeling and estimating the relationship between confounder and the instrumental variables, demonstrating the benefit of the proposed approach over recent works based on reformulating the problem as minimax optimization problems. Numerical experiments are provided to corroborate the theoretical results."}, "https://arxiv.org/abs/2405.19610": {"title": "Factor Augmented Tensor-on-Tensor Neural Networks", "link": "https://arxiv.org/abs/2405.19610", "description": "arXiv:2405.19610v1 Announce Type: cross \nAbstract: This paper studies the prediction task of tensor-on-tensor regression in which both covariates and responses are multi-dimensional arrays (a.k.a., tensors) across time with arbitrary tensor order and data dimension. Existing methods either focused on linear models without accounting for possibly nonlinear relationships between covariates and responses, or directly employed black-box deep learning algorithms that failed to utilize the inherent tensor structure. In this work, we propose a Factor Augmented Tensor-on-Tensor Neural Network (FATTNN) that integrates tensor factor models into deep neural networks. We begin with summarizing and extracting useful predictive information (represented by the ``factor tensor'') from the complex structured tensor covariates, and then proceed with the prediction task using the estimated factor tensor as input of a temporal convolutional neural network. The proposed methods effectively handle nonlinearity between complex data structures, and improve over traditional statistical models and conventional deep learning approaches in both prediction accuracy and computational cost. By leveraging tensor factor models, our proposed methods exploit the underlying latent factor structure to enhance the prediction, and in the meantime, drastically reduce the data dimensionality that speeds up the computation. The empirical performances of our proposed methods are demonstrated via simulation studies and real-world applications to three public datasets. Numerical results show that our proposed algorithms achieve substantial increases in prediction accuracy and significant reductions in computational time compared to benchmark methods."}, "https://arxiv.org/abs/2405.19704": {"title": "Enhancing Sufficient Dimension Reduction via Hellinger Correlation", "link": "https://arxiv.org/abs/2405.19704", "description": "arXiv:2405.19704v1 Announce Type: cross \nAbstract: In this work, we develop a new theory and method for sufficient dimension reduction (SDR) in single-index models, where SDR is a sub-field of supervised dimension reduction based on conditional independence. Our work is primarily motivated by the recent introduction of the Hellinger correlation as a dependency measure. Utilizing this measure, we develop a method capable of effectively detecting the dimension reduction subspace, complete with theoretical justification. Through extensive numerical experiments, we demonstrate that our proposed method significantly enhances and outperforms existing SDR methods. This improvement is largely attributed to our proposed method's deeper understanding of data dependencies and the refinement of existing SDR techniques."}, "https://arxiv.org/abs/2405.19920": {"title": "The ARR2 prior: flexible predictive prior definition for Bayesian auto-regressions", "link": "https://arxiv.org/abs/2405.19920", "description": "arXiv:2405.19920v1 Announce Type: cross \nAbstract: We present the ARR2 prior, a joint prior over the auto-regressive components in Bayesian time-series models and their induced $R^2$. Compared to other priors designed for times-series models, the ARR2 prior allows for flexible and intuitive shrinkage. We derive the prior for pure auto-regressive models, and extend it to auto-regressive models with exogenous inputs, and state-space models. Through both simulations and real-world modelling exercises, we demonstrate the efficacy of the ARR2 prior in improving sparse and reliable inference, while showing greater inference quality and predictive performance than other shrinkage priors. An open-source implementation of the prior is provided."}, "https://arxiv.org/abs/2405.20039": {"title": "Task-Agnostic Machine Learning-Assisted Inference", "link": "https://arxiv.org/abs/2405.20039", "description": "arXiv:2405.20039v1 Announce Type: cross \nAbstract: Machine learning (ML) is playing an increasingly important role in scientific research. In conjunction with classical statistical approaches, ML-assisted analytical strategies have shown great promise in accelerating research findings. This has also opened up a whole new field of methodological research focusing on integrative approaches that leverage both ML and statistics to tackle data science challenges. One type of study that has quickly gained popularity employs ML to predict unobserved outcomes in massive samples and then uses the predicted outcomes in downstream statistical inference. However, existing methods designed to ensure the validity of this type of post-prediction inference are limited to very basic tasks such as linear regression analysis. This is because any extension of these approaches to new, more sophisticated statistical tasks requires task-specific algebraic derivations and software implementations, which ignores the massive library of existing software tools already developed for complex inference tasks and severely constrains the scope of post-prediction inference in real applications. To address this challenge, we propose a novel statistical framework for task-agnostic ML-assisted inference. It provides a post-prediction inference solution that can be easily plugged into almost any established data analysis routine. It delivers valid and efficient inference that is robust to arbitrary choices of ML models, while allowing nearly all existing analytical frameworks to be incorporated into the analysis of ML-predicted outcomes. Through extensive experiments, we showcase the validity, versatility, and superiority of our method compared to existing approaches."}, "https://arxiv.org/abs/2405.20088": {"title": "Personalized Predictions from Population Level Experiments: A Study on Alzheimer's Disease", "link": "https://arxiv.org/abs/2405.20088", "description": "arXiv:2405.20088v1 Announce Type: cross \nAbstract: The purpose of this article is to infer patient level outcomes from population level randomized control trials (RCTs). In this pursuit, we utilize the recently proposed synthetic nearest neighbors (SNN) estimator. At its core, SNN leverages information across patients to impute missing data associated with each patient of interest. We focus on two types of missing data: (i) unrecorded outcomes from discontinuing the assigned treatments and (ii) unobserved outcomes associated with unassigned treatments. Data imputation in the former powers and de-biases RCTs, while data imputation in the latter simulates \"synthetic RCTs\" to predict the outcomes for each patient under every treatment. The SNN estimator is interpretable, transparent, and causally justified under a broad class of missing data scenarios. Relative to several standard methods, we empirically find that SNN performs well for the above two applications using Phase 3 clinical trial data on patients with Alzheimer's Disease. Our findings directly suggest that SNN can tackle a current pain point within the clinical trial workflow on patient dropouts and serve as a new tool towards the development of precision medicine. Building on our insights, we discuss how SNN can further generalize to real-world applications."}, "https://arxiv.org/abs/2405.20191": {"title": "Multidimensional spatiotemporal clustering -- An application to environmental sustainability scores in Europe", "link": "https://arxiv.org/abs/2405.20191", "description": "arXiv:2405.20191v1 Announce Type: cross \nAbstract: The assessment of corporate sustainability performance is extremely relevant in facilitating the transition to a green and low-carbon intensity economy. However, companies located in different areas may be subject to different sustainability and environmental risks and policies. Henceforth, the main objective of this paper is to investigate the spatial and temporal pattern of the sustainability evaluations of European firms. We leverage on a large dataset containing information about companies' sustainability performances, measured by MSCI ESG ratings, and geographical coordinates of firms in Western Europe between 2013 and 2023. By means of a modified version of the Chavent et al. (2018) hierarchical algorithm, we conduct a spatial clustering analysis, combining sustainability and spatial information, and a spatiotemporal clustering analysis, which combines the time dynamics of multiple sustainability features and spatial dissimilarities, to detect groups of firms with homogeneous sustainability performance. We are able to build cross-national and cross-industry clusters with remarkable differences in terms of sustainability scores. Among other results, in the spatio-temporal analysis, we observe a high degree of geographical overlap among clusters, indicating that the temporal dynamics in sustainability assessment are relevant within a multidimensional approach. Our findings help to capture the diversity of ESG ratings across Western Europe and may assist practitioners and policymakers in evaluating companies facing different sustainability-linked risks in different areas."}, "https://arxiv.org/abs/2208.02024": {"title": "Time-Varying Dispersion Integer-Valued GARCH Models", "link": "https://arxiv.org/abs/2208.02024", "description": "arXiv:2208.02024v2 Announce Type: replace \nAbstract: We propose a general class of INteger-valued Generalized AutoRegressive Conditionally Heteroscedastic (INGARCH) processes by allowing time-varying mean and dispersion parameters, which we call time-varying dispersion INGARCH (tv-DINGARCH) models. More specifically, we consider mixed Poisson INGARCH models and allow for dynamic modeling of the dispersion parameter (as well as the mean), similar to the spirit of the ordinary GARCH models. We derive conditions to obtain first and second-order stationarity, and ergodicity as well. Estimation of the parameters is addressed and their associated asymptotic properties are established as well. A restricted bootstrap procedure is proposed for testing constant dispersion against time-varying dispersion. Monte Carlo simulation studies are presented for checking point estimation, standard errors, and the performance of the restricted bootstrap approach. We apply the tv-DINGARCH process to model the weekly number of reported measles infections in North Rhine-Westphalia, Germany, from January 2001 to May 2013, and compare its performance to the ordinary INGARCH approach."}, "https://arxiv.org/abs/2208.07590": {"title": "Neural Networks for Extreme Quantile Regression with an Application to Forecasting of Flood Risk", "link": "https://arxiv.org/abs/2208.07590", "description": "arXiv:2208.07590v3 Announce Type: replace \nAbstract: Risk assessment for extreme events requires accurate estimation of high quantiles that go beyond the range of historical observations. When the risk depends on the values of observed predictors, regression techniques are used to interpolate in the predictor space. We propose the EQRN model that combines tools from neural networks and extreme value theory into a method capable of extrapolation in the presence of complex predictor dependence. Neural networks can naturally incorporate additional structure in the data. We develop a recurrent version of EQRN that is able to capture complex sequential dependence in time series. We apply this method to forecast flood risk in the Swiss Aare catchment. It exploits information from multiple covariates in space and time to provide one-day-ahead predictions of return levels and exceedance probabilities. This output complements the static return level from a traditional extreme value analysis, and the predictions are able to adapt to distributional shifts as experienced in a changing climate. Our model can help authorities to manage flooding more effectively and to minimize their disastrous impacts through early warning systems."}, "https://arxiv.org/abs/2305.10656": {"title": "Spectral Change Point Estimation for High Dimensional Time Series by Sparse Tensor Decomposition", "link": "https://arxiv.org/abs/2305.10656", "description": "arXiv:2305.10656v2 Announce Type: replace \nAbstract: Multivariate time series may be subject to partial structural changes over certain frequency band, for instance, in neuroscience. We study the change point detection problem with high dimensional time series, within the framework of frequency domain. The overarching goal is to locate all change points and delineate which series are activated by the change, over which frequencies. In practice, the number of activated series per change and frequency could span from a few to full participation. We solve the problem by first computing a CUSUM tensor based on spectra estimated from blocks of the time series. A frequency-specific projection approach is applied for dimension reduction. The projection direction is estimated by a proposed tensor decomposition algorithm that adjusts to the sparsity level of changes. Finally, the projected CUSUM vectors across frequencies are aggregated for change point detection. We provide theoretical guarantees on the number of estimated change points and the convergence rate of their locations. We derive error bounds for the estimated projection direction for identifying the frequency-specific series activated in a change. We provide data-driven rules for the choice of parameters. The efficacy of the proposed method is illustrated by simulation and a stock returns application."}, "https://arxiv.org/abs/2306.14004": {"title": "Latent Factor Analysis in Short Panels", "link": "https://arxiv.org/abs/2306.14004", "description": "arXiv:2306.14004v2 Announce Type: replace \nAbstract: We develop inferential tools for latent factor analysis in short panels. The pseudo maximum likelihood setting under a large cross-sectional dimension n and a fixed time series dimension T relies on a diagonal TxT covariance matrix of the errors without imposing sphericity nor Gaussianity. We outline the asymptotic distributions of the latent factor and error covariance estimates as well as of an asymptotically uniformly most powerful invariant (AUMPI) test for the number of factors based on the likelihood ratio statistic. We derive the AUMPI characterization from inequalities ensuring the monotone likelihood ratio property for positive definite quadratic forms in normal variables. An empirical application to a large panel of monthly U.S. stock returns separates month after month systematic and idiosyncratic risks in short subperiods of bear vs. bull market based on the selected number of factors. We observe an uptrend in the paths of total and idiosyncratic volatilities while the systematic risk explains a large part of the cross-sectional total variance in bear markets but is not driven by a single factor. Rank tests show that observed factors struggle spanning latent factors with a discrepancy between the dimensions of the two factor spaces decreasing over time."}, "https://arxiv.org/abs/2212.14857": {"title": "Nuisance Function Tuning for Optimal Doubly Robust Estimation", "link": "https://arxiv.org/abs/2212.14857", "description": "arXiv:2212.14857v2 Announce Type: replace-cross \nAbstract: Estimators of doubly robust functionals typically rely on estimating two complex nuisance functions, such as the propensity score and conditional outcome mean for the average treatment effect functional. We consider the problem of how to estimate nuisance functions to obtain optimal rates of convergence for a doubly robust nonparametric functional that has witnessed applications across the causal inference and conditional independence testing literature. For several plug-in type estimators and a one-step type estimator, we illustrate the interplay between different tuning parameter choices for the nuisance function estimators and sample splitting strategies on the optimal rate of estimating the functional of interest. For each of these estimators and each sample splitting strategy, we show the necessity to undersmooth the nuisance function estimators under low regularity conditions to obtain optimal rates of convergence for the functional of interest. By performing suitable nuisance function tuning and sample splitting strategies, we show that some of these estimators can achieve minimax rates of convergence in all H\\\"older smoothness classes of the nuisance functions."}, "https://arxiv.org/abs/2308.13047": {"title": "Federated Causal Inference from Observational Data", "link": "https://arxiv.org/abs/2308.13047", "description": "arXiv:2308.13047v2 Announce Type: replace-cross \nAbstract: Decentralized data sources are prevalent in real-world applications, posing a formidable challenge for causal inference. These sources cannot be consolidated into a single entity owing to privacy constraints. The presence of dissimilar data distributions and missing values within them can potentially introduce bias to the causal estimands. In this article, we propose a framework to estimate causal effects from decentralized data sources. The proposed framework avoid exchanging raw data among the sources, thus contributing towards privacy-preserving causal learning. Three instances of the proposed framework are introduced to estimate causal effects across a wide range of diverse scenarios within a federated setting. (1) FedCI: a Bayesian framework based on Gaussian processes for estimating causal effects from federated observational data sources. It estimates the posterior distributions of the causal effects to compute the higher-order statistics that capture the uncertainty. (2) CausalRFF: an adaptive transfer algorithm that learns the similarities among the data sources by utilizing Random Fourier Features to disentangle the loss function into multiple components, each of which is associated with a data source. It estimates the similarities among the sources through transfer coefficients, and hence requiring no prior information about the similarity measures. (3) CausalFI: a new approach for federated causal inference from incomplete data, enabling the estimation of causal effects from multiple decentralized and incomplete data sources. It accounts for the missing data under the missing at random assumption, while also estimating higher-order statistics of the causal estimands. The proposed federated framework and its instances are an important step towards a privacy-preserving causal learning model."}, "https://arxiv.org/abs/2309.10211": {"title": "Loop Polarity Analysis to Avoid Underspecification in Deep Learning", "link": "https://arxiv.org/abs/2309.10211", "description": "arXiv:2309.10211v2 Announce Type: replace-cross \nAbstract: Deep learning is a powerful set of techniques for detecting complex patterns in data. However, when the causal structure of that process is underspecified, deep learning models can be brittle, lacking robustness to shifts in the distribution of the data-generating process. In this paper, we turn to loop polarity analysis as a tool for specifying the causal structure of a data-generating process, in order to encode a more robust understanding of the relationship between system structure and system behavior within the deep learning pipeline. We use simulated epidemic data based on an SIR model to demonstrate how measuring the polarity of the different feedback loops that compose a system can lead to more robust inferences on the part of neural networks, improving the out-of-distribution performance of a deep learning model and infusing a system-dynamics-inspired approach into the machine learning development pipeline."}, "https://arxiv.org/abs/2309.15769": {"title": "Algebraic and Statistical Properties of the Ordinary Least Squares Interpolator", "link": "https://arxiv.org/abs/2309.15769", "description": "arXiv:2309.15769v2 Announce Type: replace-cross \nAbstract: Deep learning research has uncovered the phenomenon of benign overfitting for overparameterized statistical models, which has drawn significant theoretical interest in recent years. Given its simplicity and practicality, the ordinary least squares (OLS) interpolator has become essential to gain foundational insights into this phenomenon. While properties of OLS are well established in classical, underparameterized settings, its behavior in high-dimensional, overparameterized regimes is less explored (unlike for ridge or lasso regression) though significant progress has been made of late. We contribute to this growing literature by providing fundamental algebraic and statistical results for the minimum $\\ell_2$-norm OLS interpolator. In particular, we provide algebraic equivalents of (i) the leave-$k$-out residual formula, (ii) Cochran's formula, and (iii) the Frisch-Waugh-Lovell theorem in the overparameterized regime. These results aid in understanding the OLS interpolator's ability to generalize and have substantive implications for causal inference. Under the Gauss-Markov model, we present statistical results such as an extension of the Gauss-Markov theorem and an analysis of variance estimation under homoskedastic errors for the overparameterized regime. To substantiate our theoretical contributions, we conduct simulations that further explore the stochastic properties of the OLS interpolator."}, "https://arxiv.org/abs/2401.14535": {"title": "CaRiNG: Learning Temporal Causal Representation under Non-Invertible Generation Process", "link": "https://arxiv.org/abs/2401.14535", "description": "arXiv:2401.14535v2 Announce Type: replace-cross \nAbstract: Identifying the underlying time-delayed latent causal processes in sequential data is vital for grasping temporal dynamics and making downstream reasoning. While some recent methods can robustly identify these latent causal variables, they rely on strict assumptions about the invertible generation process from latent variables to observed data. However, these assumptions are often hard to satisfy in real-world applications containing information loss. For instance, the visual perception process translates a 3D space into 2D images, or the phenomenon of persistence of vision incorporates historical data into current perceptions. To address this challenge, we establish an identifiability theory that allows for the recovery of independent latent components even when they come from a nonlinear and non-invertible mix. Using this theory as a foundation, we propose a principled approach, CaRiNG, to learn the CAusal RepresentatIon of Non-invertible Generative temporal data with identifiability guarantees. Specifically, we utilize temporal context to recover lost latent information and apply the conditions in our theory to guide the training process. Through experiments conducted on synthetic datasets, we validate that our CaRiNG method reliably identifies the causal process, even when the generation process is non-invertible. Moreover, we demonstrate that our approach considerably improves temporal understanding and reasoning in practical applications."}, "https://arxiv.org/abs/2405.20400": {"title": "Fast leave-one-cluster-out cross-validation by clustered Network Information Criteria (NICc)", "link": "https://arxiv.org/abs/2405.20400", "description": "arXiv:2405.20400v1 Announce Type: new \nAbstract: This paper introduced a clustered estimator of the Network Information Criterion (NICc) to approximate leave-one-cluster-out cross-validated deviance, which can be used as an alternative to cluster-based cross-validation when modeling clustered data. Stone proved that Akaike Information Criterion (AIC) is an asymptotic equivalence to leave-one-observation-out cross-validation if the parametric model is true. Ripley pointed out that the Network Information Criterion (NIC) derived in Stone's proof, is a better approximation to leave-one-observation-out cross-validation when the model is not true. For clustered data, we derived a clustered estimator of NIC, referred to as NICc, by substituting the Fisher information matrix in NIC with its estimator that adjusts for clustering. This adjustment imposes a larger penalty in NICc than the unclustered estimator of NIC when modeling clustered data, thereby preventing overfitting more effectively. In a simulation study and an empirical example, we used linear and logistic regression to model clustered data with Gaussian or binomial response, respectively. We showed that NICc is a better approximation to leave-one-cluster-out deviance and prevents overfitting more effectively than AIC and Bayesian Information Criterion (BIC). NICc leads to more accurate model selection, as determined by cluster-based cross-validation, compared to AIC and BIC."}, "https://arxiv.org/abs/2405.20415": {"title": "Differentially Private Boxplots", "link": "https://arxiv.org/abs/2405.20415", "description": "arXiv:2405.20415v1 Announce Type: new \nAbstract: Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains relatively underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms a boxplot naively constructed from existing differentially private quantile algorithms. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization."}, "https://arxiv.org/abs/2405.20601": {"title": "Bayesian Nonparametric Quasi Likelihood", "link": "https://arxiv.org/abs/2405.20601", "description": "arXiv:2405.20601v1 Announce Type: new \nAbstract: A recent trend in Bayesian research has been revisiting generalizations of the likelihood that enable Bayesian inference without requiring the specification of a model for the data generating mechanism. This paper focuses on a Bayesian nonparametric extension of Wedderburn's quasi-likelihood, using Bayesian additive regression trees to model the mean function. Here, the analyst posits only a structural relationship between the mean and variance of the outcome. We show that this approach provides a unified, computationally efficient, framework for extending Bayesian decision tree ensembles to many new settings, including simplex-valued and heavily heteroskedastic data. We also introduce Bayesian strategies for inferring the dispersion parameter of the quasi-likelihood, a task which is complicated by the fact that the quasi-likelihood itself does not contain information about this parameter; despite these challenges, we are able to inject updates for the dispersion parameter into a Markov chain Monte Carlo inference scheme in a way that, in the parametric setting, leads to a Bernstein-von Mises result for the stationary distribution of the resulting Markov chain. We illustrate the utility of our approach on a variety of both synthetic and non-synthetic datasets."}, "https://arxiv.org/abs/2405.20644": {"title": "Fixed-budget optimal designs for multi-fidelity computer experiments", "link": "https://arxiv.org/abs/2405.20644", "description": "arXiv:2405.20644v1 Announce Type: new \nAbstract: This work focuses on the design of experiments of multi-fidelity computer experiments. We consider the autoregressive Gaussian process model proposed by Kennedy and O'Hagan (2000) and the optimal nested design that maximizes the prediction accuracy subject to a budget constraint. An approximate solution is identified through the idea of multi-level approximation and recent error bounds of Gaussian process regression. The proposed (approximately) optimal designs admit a simple analytical form. We prove that, to achieve the same prediction accuracy, the proposed optimal multi-fidelity design requires much lower computational cost than any single-fidelity design in the asymptotic sense. Numerical studies confirm this theoretical assertion."}, "https://arxiv.org/abs/2405.20655": {"title": "Statistical inference for case-control logistic regression via integrating external summary data", "link": "https://arxiv.org/abs/2405.20655", "description": "arXiv:2405.20655v1 Announce Type: new \nAbstract: Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be consistently estimated, the intercept parameter is not identifiable, and the marginal case proportion is not estimatable, either. We consider the situations in which besides the case-control data from the main study, called internal study, there also exists summary-level information from related external studies. An empirical likelihood based approach is proposed to make inference for the logistic model by incorporating the internal case-control data and external information. We show that the intercept parameter is identifiable with the help of external information, and then all the regression parameters as well as the marginal case proportion can be estimated consistently. The proposed method also accounts for the possible variability in external studies. The resultant estimators are shown to be asymptotically normally distributed. The asymptotic variance-covariance matrix can be consistently estimated by the case-control data. The optimal way to utilized external information is discussed. Simulation studies are conducted to verify the theoretical findings. A real data set is analyzed for illustration."}, "https://arxiv.org/abs/2405.20758": {"title": "Fast Bayesian Basis Selection for Functional Data Representation with Correlated Errors", "link": "https://arxiv.org/abs/2405.20758", "description": "arXiv:2405.20758v1 Announce Type: new \nAbstract: Functional data analysis (FDA) finds widespread application across various fields, due to data being recorded continuously over a time interval or at several discrete points. Since the data is not observed at every point but rather across a dense grid, smoothing techniques are often employed to convert the observed data into functions. In this work, we propose a novel Bayesian approach for selecting basis functions for smoothing one or multiple curves simultaneously. Our method differentiates from other Bayesian approaches in two key ways: (i) by accounting for correlated errors and (ii) by developing a variational EM algorithm instead of a Gibbs sampler. Simulation studies demonstrate that our method effectively identifies the true underlying structure of the data across various scenarios and it is applicable to different types of functional data. Our variational EM algorithm not only recovers the basis coefficients and the correct set of basis functions but also estimates the existing within-curve correlation. When applied to the motorcycle dataset, our method demonstrates comparable, and in some cases superior, performance in terms of adjusted $R^2$ compared to other techniques such as regression splines, Bayesian LASSO and LASSO. Additionally, when assuming independence among observations within a curve, our method, utilizing only a variational Bayes algorithm, is in the order of thousands faster than a Gibbs sampler on average. Our proposed method is implemented in R and codes are available at https://github.com/acarolcruz/VB-Bases-Selection."}, "https://arxiv.org/abs/2405.20817": {"title": "Extremile scalar-on-function regression with application to climate scenarios", "link": "https://arxiv.org/abs/2405.20817", "description": "arXiv:2405.20817v1 Announce Type: new \nAbstract: Extremiles provide a generalization of quantiles which are not only robust, but also have an intrinsic link with extreme value theory. This paper introduces an extremile regression model tailored for functional covariate spaces. The estimation procedure turns out to be a weighted version of local linear scalar-on-function regression, where now a double kernel approach plays a crucial role. Asymptotic expressions for the bias and variance are established, applicable to both decreasing bandwidth sequences and automatically selected bandwidths. The methodology is then investigated in detail through a simulation study. Furthermore, we highlight the applicability of the model through the analysis of data sourced from the CH2018 Swiss climate scenarios project, offering insights into its ability to serve as a modern tool to quantify climate behaviour."}, "https://arxiv.org/abs/2405.20856": {"title": "Parameter identification in linear non-Gaussian causal models under general confounding", "link": "https://arxiv.org/abs/2405.20856", "description": "arXiv:2405.20856v1 Announce Type: new \nAbstract: Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result, explore a generalization to models with feedback loops, and provide new results on the identifiability of the causal graph."}, "https://arxiv.org/abs/2405.20936": {"title": "Bayesian Deep Generative Models for Replicated Networks with Multiscale Overlapping Clusters", "link": "https://arxiv.org/abs/2405.20936", "description": "arXiv:2405.20936v1 Announce Type: new \nAbstract: Our interest is in replicated network data with multiple networks observed across the same set of nodes. Examples include brain connection networks, in which nodes corresponds to brain regions and replicates to different individuals, and ecological networks, in which nodes correspond to species and replicates to samples collected at different locations and/or times. Our goal is to infer a hierarchical structure of the nodes at a population level, while performing multi-resolution clustering of the individual replicates. In brain connectomics, the focus is on inferring common relationships among the brain regions, while characterizing inter-individual variability in an easily interpretable manner. To accomplish this, we propose a Bayesian hierarchical model, while providing theoretical support in terms of identifiability and posterior consistency, and design efficient methods for posterior computation. We provide novel technical tools for proving model identifiability, which are of independent interest. Our simulations and application to brain connectome data provide support for the proposed methodology."}, "https://arxiv.org/abs/2405.20957": {"title": "Data Fusion for Heterogeneous Treatment Effect Estimation with Multi-Task Gaussian Processes", "link": "https://arxiv.org/abs/2405.20957", "description": "arXiv:2405.20957v1 Announce Type: new \nAbstract: Bridging the gap between internal and external validity is crucial for heterogeneous treatment effect estimation. Randomised controlled trials (RCTs), favoured for their internal validity due to randomisation, often encounter challenges in generalising findings due to strict eligibility criteria. Observational studies on the other hand, provide external validity advantages through larger and more representative samples but suffer from compromised internal validity due to unmeasured confounding. Motivated by these complementary characteristics, we propose a novel Bayesian nonparametric approach leveraging multi-task Gaussian processes to integrate data from both RCTs and observational studies. In particular, we introduce a parameter which controls the degree of borrowing between the datasets and prevents the observational dataset from dominating the estimation. The value of the parameter can be either user-set or chosen through a data-adaptive procedure. Our approach outperforms other methods in point predictions across the covariate support of the observational study, and furthermore provides a calibrated measure of uncertainty for the estimated treatment effects, which is crucial when extrapolating. We demonstrate the robust performance of our approach in diverse scenarios through multiple simulation studies and a real-world education randomised trial."}, "https://arxiv.org/abs/2405.21020": {"title": "Bayesian Estimation of Hierarchical Linear Models from Incomplete Data: Cluster-Level Interaction Effects and Small Sample Sizes", "link": "https://arxiv.org/abs/2405.21020", "description": "arXiv:2405.21020v1 Announce Type: new \nAbstract: We consider Bayesian estimation of a hierarchical linear model (HLM) from small sample sizes where 37 patient-physician encounters are repeatedly measured at four time points. The continuous response $Y$ and continuous covariates $C$ are partially observed and assumed missing at random. With $C$ having linear effects, the HLM may be efficiently estimated by available methods. When $C$ includes cluster-level covariates having interactive or other nonlinear effects given small sample sizes, however, maximum likelihood estimation is suboptimal, and existing Gibbs samplers are based on a Bayesian joint distribution compatible with the HLM, but impute missing values of $C$ by a Metropolis algorithm via a proposal density having a constant variance while the target conditional distribution has a nonconstant variance. Therefore, the samplers are not guaranteed to be compatible with the joint distribution and, thus, not guaranteed to always produce unbiased estimation of the HLM. We introduce a compatible Gibbs sampler that imputes parameters and missing values directly from the exact conditional distributions. We analyze repeated measurements from patient-physician encounters by our sampler, and compare our estimators with those of existing methods by simulation."}, "https://arxiv.org/abs/2405.20418": {"title": "A Bayesian joint model of multiple nonlinear longitudinal and competing risks outcomes for dynamic prediction in multiple myeloma: joint estimation and corrected two-stage approaches", "link": "https://arxiv.org/abs/2405.20418", "description": "arXiv:2405.20418v1 Announce Type: cross \nAbstract: Predicting cancer-associated clinical events is challenging in oncology. In Multiple Myeloma (MM), a cancer of plasma cells, disease progression is determined by changes in biomarkers, such as serum concentration of the paraprotein secreted by plasma cells (M-protein). Therefore, the time-dependent behaviour of M-protein and the transition across lines of therapy (LoT) that may be a consequence of disease progression should be accounted for in statistical models to predict relevant clinical outcomes. Furthermore, it is important to understand the contribution of the patterns of longitudinal biomarkers, upon each LoT initiation, to time-to-death or time-to-next-LoT. Motivated by these challenges, we propose a Bayesian joint model for trajectories of multiple longitudinal biomarkers, such as M-protein, and the competing risks of death and transition to next LoT. Additionally, we explore two estimation approaches for our joint model: simultaneous estimation of all parameters (joint estimation) and sequential estimation of parameters using a corrected two-stage strategy aiming to reduce computational time. Our proposed model and estimation methods are applied to a retrospective cohort study from a real-world database of patients diagnosed with MM in the US from January 2015 to February 2022. We split the data into training and test sets in order to validate the joint model using both estimation approaches and make dynamic predictions of times until clinical events of interest, informed by longitudinally measured biomarkers and baseline variables available up to the time of prediction."}, "https://arxiv.org/abs/2405.20715": {"title": "Transforming Japan Real Estate", "link": "https://arxiv.org/abs/2405.20715", "description": "arXiv:2405.20715v1 Announce Type: cross \nAbstract: The Japanese real estate market, valued over 35 trillion USD, offers significant investment opportunities. Accurate rent and price forecasting could provide a substantial competitive edge. This paper explores using alternative data variables to predict real estate performance in 1100 Japanese municipalities. A comprehensive house price index was created, covering all municipalities from 2005 to the present, using a dataset of over 5 million transactions. This core dataset was enriched with economic factors spanning decades, allowing for price trajectory predictions.\n The findings show that alternative data variables can indeed forecast real estate performance effectively. Investment signals based on these variables yielded notable returns with low volatility. For example, the net migration ratio delivered an annualized return of 4.6% with a Sharpe ratio of 1.5. Taxable income growth and new dwellings ratio also performed well, with annualized returns of 4.1% (Sharpe ratio of 1.3) and 3.3% (Sharpe ratio of 0.9), respectively. When combined with transformer models to predict risk-adjusted returns 4 years in advance, the model achieved an R-squared score of 0.28, explaining nearly 30% of the variation in future municipality prices.\n These results highlight the potential of alternative data variables in real estate investment. They underscore the need for further research to identify more predictive factors. Nonetheless, the evidence suggests that such data can provide valuable insights into real estate price drivers, enabling more informed investment decisions in the Japanese market."}, "https://arxiv.org/abs/2405.20779": {"title": "Asymptotic utility of spectral anonymization", "link": "https://arxiv.org/abs/2405.20779", "description": "arXiv:2405.20779v1 Announce Type: cross \nAbstract: In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version $\\mathcal{P}$-SA, employing random permutation transformation, we introduce two novel SA variants: $\\mathcal{J}$-spectral anonymization and $\\mathcal{O}$-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, $\\mathcal{O}$-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, $\\mathcal{P}$-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation."}, "https://arxiv.org/abs/2405.21012": {"title": "G-Transformer for Conditional Average Potential Outcome Estimation over Time", "link": "https://arxiv.org/abs/2405.21012", "description": "arXiv:2405.21012v1 Announce Type: cross \nAbstract: Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task suffer from either (a) bias or (b) large variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model designed for unbiased, low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records."}, "https://arxiv.org/abs/2201.02532": {"title": "Approximate Factor Models for Functional Time Series", "link": "https://arxiv.org/abs/2201.02532", "description": "arXiv:2201.02532v3 Announce Type: replace \nAbstract: We propose a novel approximate factor model tailored for analyzing time-dependent curve data. Our model decomposes such data into two distinct components: a low-dimensional predictable factor component and an unpredictable error term. These components are identified through the autocovariance structure of the underlying functional time series. The model parameters are consistently estimated using the eigencomponents of a cumulative autocovariance operator and an information criterion is proposed to determine the appropriate number of factors. The methodology is applied to yield curve modeling and forecasting. Our results indicate that more than three factors are required to characterize the dynamics of the term structure of bond yields."}, "https://arxiv.org/abs/2303.01887": {"title": "Fast Forecasting of Unstable Data Streams for On-Demand Service Platforms", "link": "https://arxiv.org/abs/2303.01887", "description": "arXiv:2303.01887v2 Announce Type: replace \nAbstract: On-demand service platforms face a challenging problem of forecasting a large collection of high-frequency regional demand data streams that exhibit instabilities. This paper develops a novel forecast framework that is fast and scalable, and automatically assesses changing environments without human intervention. We empirically test our framework on a large-scale demand data set from a leading on-demand delivery platform in Europe, and find strong performance gains from using our framework against several industry benchmarks, across all geographical regions, loss functions, and both pre- and post-Covid periods. We translate forecast gains to economic impacts for this on-demand service platform by computing financial gains and reductions in computing costs."}, "https://arxiv.org/abs/2309.13251": {"title": "Nonparametric estimation of conditional densities by generalized random forests", "link": "https://arxiv.org/abs/2309.13251", "description": "arXiv:2309.13251v3 Announce Type: replace \nAbstract: Considering a continuous random variable $Y$ together with a continuous random vector $X$, I propose a nonparametric estimator $\\hat{f}(\\cdot|x)$ for the conditional density of $Y$ given $X=x$. This estimator takes the form of an exponential series whose coefficients $\\hat{\\theta}_{x}=(\\hat{\\theta}_{x,1}, \\dots,\\hat{\\theta}_{x,J})$ are the solution of a system of nonlinear equations that depends on an estimator of the conditional expectation $E[\\phi (Y)|X=x]$, where $\\phi$ is a $J$-dimensional vector of basis functions. The distinguishing feature of the proposed estimator is that $E[\\phi(Y)|X=x]$ is estimated by generalized random forest (Athey, Tibshirani, and Wager, Annals of Statistics, 2019), targeting the heterogeneity of $\\hat{\\theta}_{x}$ across $x$. I show that $\\hat{f}(\\cdot|x)$ is uniformly consistent and asymptotically normal, allowing $J \\rightarrow \\infty$. I also provide a standard error formula to construct asymptotically valid confidence intervals. Results from Monte Carlo experiments and an empirical illustration are provided."}, "https://arxiv.org/abs/2104.11702": {"title": "Correlated Dynamics in Marketing Sensitivities", "link": "https://arxiv.org/abs/2104.11702", "description": "arXiv:2104.11702v2 Announce Type: replace-cross \nAbstract: Understanding individual customers' sensitivities to prices, promotions, brands, and other marketing mix elements is fundamental to a wide swath of marketing problems. An important but understudied aspect of this problem is the dynamic nature of these sensitivities, which change over time and vary across individuals. Prior work has developed methods for capturing such dynamic heterogeneity within product categories, but neglected the possibility of correlated dynamics across categories. In this work, we introduce a framework to capture such correlated dynamics using a hierarchical dynamic factor model, where individual preference parameters are influenced by common cross-category dynamic latent factors, estimated through Bayesian nonparametric Gaussian processes. We apply our model to grocery purchase data, and find that a surprising degree of dynamic heterogeneity can be accounted for by only a few global trends. We also characterize the patterns in how consumers' sensitivities evolve across categories. Managerially, the proposed framework not only enhances predictive accuracy by leveraging cross-category data, but enables more precise estimation of quantities of interest, like price elasticity."}, "https://arxiv.org/abs/2406.00196": {"title": "A Seamless Phase II/III Design with Dose Optimization for Oncology Drug Development", "link": "https://arxiv.org/abs/2406.00196", "description": "arXiv:2406.00196v1 Announce Type: new \nAbstract: The US FDA's Project Optimus initiative that emphasizes dose optimization prior to marketing approval represents a pivotal shift in oncology drug development. It has a ripple effect for rethinking what changes may be made to conventional pivotal trial designs to incorporate a dose optimization component. Aligned with this initiative, we propose a novel Seamless Phase II/III Design with Dose Optimization (SDDO framework). The proposed design starts with dose optimization in a randomized setting, leading to an interim analysis focused on optimal dose selection, trial continuation decisions, and sample size re-estimation (SSR). Based on the decision at interim analysis, patient enrollment continues for both the selected dose arm and control arm, and the significance of treatment effects will be determined at final analysis. The SDDO framework offers increased flexibility and cost-efficiency through sample size adjustment, while stringently controlling the Type I error. This proposed design also facilitates both Accelerated Approval (AA) and regular approval in a \"one-trial\" approach. Extensive simulation studies confirm that our design reliably identifies the optimal dosage and makes preferable decisions with a reduced sample size while retaining statistical power."}, "https://arxiv.org/abs/2406.00245": {"title": "Model-based Clustering of Zero-Inflated Single-Cell RNA Sequencing Data via the EM Algorithm", "link": "https://arxiv.org/abs/2406.00245", "description": "arXiv:2406.00245v1 Announce Type: new \nAbstract: Biological cells can be distinguished by their phenotype or at the molecular level, based on their genome, epigenome, and transcriptome. This paper focuses on the transcriptome, which encompasses all the RNA transcripts in a given cell population, indicating the genes being expressed at a given time. We consider single-cell RNA sequencing data and develop a novel model-based clustering method to group cells based on their transcriptome profiles. Our clustering approach takes into account the presence of zero inflation in the data, which can occur due to genuine biological zeros or technological noise. The proposed model for clustering involves a mixture of zero-inflated Poisson or zero-inflated negative binomial distributions, and parameter estimation is carried out using the EM algorithm. We evaluate the performance of our proposed methodology through simulation studies and analyses of publicly available datasets."}, "https://arxiv.org/abs/2406.00322": {"title": "Adaptive Penalized Likelihood method for Markov Chains", "link": "https://arxiv.org/abs/2406.00322", "description": "arXiv:2406.00322v1 Announce Type: new \nAbstract: Maximum Likelihood Estimation (MLE) and Likelihood Ratio Test (LRT) are widely used methods for estimating the transition probability matrix in Markov chains and identifying significant relationships between transitions, such as equality. However, the estimated transition probability matrix derived from MLE lacks accuracy compared to the real one, and LRT is inefficient in high-dimensional Markov chains. In this study, we extended the adaptive Lasso technique from linear models to Markov chains and proposed a novel model by applying penalized maximum likelihood estimation to optimize the estimation of the transition probability matrix. Meanwhile, we demonstrated that the new model enjoys oracle properties, which means the estimated transition probability matrix has the same performance as the real one when given. Simulations show that our new method behave very well overall in comparison with various competitors. Real data analysis further convince the value of our proposed method."}, "https://arxiv.org/abs/2406.00442": {"title": "Optimizing hydrogen and e-methanol production through Power-to-X integration in biogas plants", "link": "https://arxiv.org/abs/2406.00442", "description": "arXiv:2406.00442v1 Announce Type: new \nAbstract: The European Union strategy for net zero emissions relies on developing hydrogen and electro fuels infrastructure. These fuels will be crucial as energy carriers and balancing agents for renewable energy variability. Large scale production requires more renewable capacity, and various Power to X (PtX) concepts are emerging in renewable rich countries. However, sourcing renewable carbon to scale carbon based electro fuels is a significant challenge. This study explores a PtX hub that sources renewable CO2 from biogas plants, integrating renewable energy, hydrogen production, and methanol synthesis on site. This concept creates an internal market for energy and materials, interfacing with the external energy system. The size and operation of the PtX hub were optimized, considering integration with local energy systems and a potential hydrogen grid. The levelized costs of hydrogen and methanol were estimated for a 2030 start, considering new legislation on renewable fuels of non biological origin (RFNBOs). Our results show the PtX hub can rely mainly on on site renewable energy, selling excess electricity to the grid. A local hydrogen grid connection improves operations, and the behind the meter market lowers energy prices, buffering against market variability. We found methanol costs could be below 650 euros per ton and hydrogen production costs below 3 euros per kg, with standalone methanol plants costing 23 per cent more. The CO2 recovery to methanol production ratio is crucial, with over 90 per cent recovery requiring significant investment in CO2 and H2 storage. Overall, our findings support planning PtX infrastructures integrated with the agricultural sector as a cost effective way to access renewable carbon."}, "https://arxiv.org/abs/2406.00472": {"title": "Financial Deepening and Economic Growth in Select Emerging Markets with Currency Board Systems: Theory and Evidence", "link": "https://arxiv.org/abs/2406.00472", "description": "arXiv:2406.00472v1 Announce Type: new \nAbstract: This paper investigates some indicators of financial development in select countries with currency board systems and raises some questions about the connection between financial development and growth in currency board systems. Most of those cases are long past episodes of what we would now call emerging markets. However, the paper also looks at Hong Kong, the currency board system that is one of the world's largest and most advanced financial markets. The global financial crisis of 2008 09 created doubts about the efficiency of financial markets in advanced economies, including in Hong Kong, and unsettled the previous consensus that a large financial sector would be more stable than a smaller one."}, "https://arxiv.org/abs/2406.00493": {"title": "Assessment of Case Influence in the Lasso with a Case-weight Adjusted Solution Path", "link": "https://arxiv.org/abs/2406.00493", "description": "arXiv:2406.00493v1 Announce Type: new \nAbstract: We study case influence in the Lasso regression using Cook's distance which measures overall change in the fitted values when one observation is deleted. Unlike in ordinary least squares regression, the estimated coefficients in the Lasso do not have a closed form due to the nondifferentiability of the $\\ell_1$ penalty, and neither does Cook's distance. To find the case-deleted Lasso solution without refitting the model, we approach it from the full data solution by introducing a weight parameter ranging from 1 to 0 and generating a solution path indexed by this parameter. We show that the solution path is piecewise linear with respect to a simple function of the weight parameter under a fixed penalty. The resulting case influence is a function of the penalty and weight, and it becomes Cook's distance when the weight is 0. As the penalty parameter changes, selected variables change, and the magnitude of Cook's distance for the same data point may vary with the subset of variables selected. In addition, we introduce a case influence graph to visualize how the contribution of each data point changes with the penalty parameter. From the graph, we can identify influential points at different penalty levels and make modeling decisions accordingly. Moreover, we find that case influence graphs exhibit different patterns between underfitting and overfitting phases, which can provide additional information for model selection."}, "https://arxiv.org/abs/2406.00549": {"title": "Zero Inflation as a Missing Data Problem: a Proxy-based Approach", "link": "https://arxiv.org/abs/2406.00549", "description": "arXiv:2406.00549v1 Announce Type: new \nAbstract: A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data).\n Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated.\n This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known.\n If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023].\n We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs)."}, "https://arxiv.org/abs/2406.00650": {"title": "Cluster-robust jackknife and bootstrap inference for binary response models", "link": "https://arxiv.org/abs/2406.00650", "description": "arXiv:2406.00650v1 Announce Type: new \nAbstract: We study cluster-robust inference for binary response models. Inference based on the most commonly-used cluster-robust variance matrix estimator (CRVE) can be very unreliable. We study several alternatives. Conceptually the simplest of these, but also the most computationally demanding, involves jackknifing at the cluster level. We also propose a linearized version of the cluster-jackknife variance matrix estimator as well as linearized versions of the wild cluster bootstrap. The linearizations are based on empirical scores and are computationally efficient. Throughout we use the logit model as a leading example. We also discuss a new Stata software package called logitjack which implements these procedures. Simulation results strongly favor the new methods, and two empirical examples suggest that it can be important to use them in practice."}, "https://arxiv.org/abs/2406.00700": {"title": "On the modelling and prediction of high-dimensional functional time series", "link": "https://arxiv.org/abs/2406.00700", "description": "arXiv:2406.00700v1 Announce Type: new \nAbstract: We propose a two-step procedure to model and predict high-dimensional functional time series, where the number of function-valued time series $p$ is large in relation to the length of time series $n$. Our first step performs an eigenanalysis of a positive definite matrix, which leads to a one-to-one linear transformation for the original high-dimensional functional time series, and the transformed curve series can be segmented into several groups such that any two subseries from any two different groups are uncorrelated both contemporaneously and serially. Consequently in our second step those groups are handled separately without the information loss on the overall linear dynamic structure. The second step is devoted to establishing a finite-dimensional dynamical structure for all the transformed functional time series within each group. Furthermore the finite-dimensional structure is represented by that of a vector time series. Modelling and forecasting for the original high-dimensional functional time series are realized via those for the vector time series in all the groups. We investigate the theoretical properties of our proposed methods, and illustrate the finite-sample performance through both extensive simulation and two real datasets."}, "https://arxiv.org/abs/2406.00730": {"title": "Assessing survival models by interval testing", "link": "https://arxiv.org/abs/2406.00730", "description": "arXiv:2406.00730v1 Announce Type: new \nAbstract: When considering many survival models, decisions become more challenging in health economic evaluation. In this paper, we present a set of methods to assist with selecting the most appropriate survival models. The methods highlight areas of particularly poor fit. Furthermore, plots and overall p-values provide guidance on whether a survival model should be rejected or not."}, "https://arxiv.org/abs/2406.00804": {"title": "On the Addams family of discrete frailty distributions for modelling multivariate case I interval-censored data", "link": "https://arxiv.org/abs/2406.00804", "description": "arXiv:2406.00804v1 Announce Type: new \nAbstract: Random effect models for time-to-event data, also known as frailty models, provide a conceptually appealing way of quantifying association between survival times and of representing heterogeneities resulting from factors which may be difficult or impossible to measure. In the literature, the random effect is usually assumed to have a continuous distribution. However, in some areas of application, discrete frailty distributions may be more appropriate. The present paper is about the implementation and interpretation of the Addams family of discrete frailty distributions. We propose methods of estimation for this family of densities in the context of shared frailty models for the hazard rates for case I interval-censored data. Our optimization framework allows for stratification of random effect distributions by covariates. We highlight interpretational advantages of the Addams family of discrete frailty distributions and the K-point distribution as compared to other frailty distributions. A unique feature of the Addams family and the K-point distribution is that the support of the frailty distribution depends on its parameters. This feature is best exploited by imposing a model on the distributional parameters, resulting in a model with non-homogeneous covariate effects that can be analysed using standard measures such as the hazard ratio. Our methods are illustrated with applications to multivariate case I interval-censored infection data."}, "https://arxiv.org/abs/2406.00827": {"title": "LaLonde (1986) after Nearly Four Decades: Lessons Learned", "link": "https://arxiv.org/abs/2406.00827", "description": "arXiv:2406.00827v1 Announce Type: new \nAbstract: In 1986, Robert LaLonde published an article that compared nonexperimental estimates to experimental benchmarks LaLonde (1986). He concluded that the nonexperimental methods at the time could not systematically replicate experimental benchmarks, casting doubt on the credibility of these methods. Following LaLonde's critical assessment, there have been significant methodological advances and practical changes, including (i) an emphasis on estimators based on unconfoundedness, (ii) a focus on the importance of overlap in covariate distributions, (iii) the introduction of propensity score-based methods leading to doubly robust estimators, (iv) a greater emphasis on validation exercises to bolster research credibility, and (v) methods for estimating and exploiting treatment effect heterogeneity. To demonstrate the practical lessons from these advances, we reexamine the LaLonde data and the Imbens-Rubin-Sacerdote lottery data. We show that modern methods, when applied in contexts with significant covariate overlap, yield robust estimates for the adjusted differences between the treatment and control groups. However, this does not mean that these estimates are valid. To assess their credibility, validation exercises (such as placebo tests) are essential, whereas goodness of fit tests alone are inadequate. Our findings highlight the importance of closely examining the assignment process, carefully inspecting overlap, and conducting validation exercises when analyzing causal effects with nonexperimental data."}, "https://arxiv.org/abs/2406.00866": {"title": "Planning for Gold: Sample Splitting for Valid Powerful Design of Observational Studies", "link": "https://arxiv.org/abs/2406.00866", "description": "arXiv:2406.00866v1 Announce Type: new \nAbstract: Observational studies are valuable tools for inferring causal effects in the absence of controlled experiments. However, these studies may be biased due to the presence of some relevant, unmeasured set of covariates. The design of an observational study has a prominent effect on its sensitivity to hidden biases, and the best design may not be apparent without examining the data. One approach to facilitate a data-inspired design is to split the sample into a planning sample for choosing the design and an analysis sample for making inferences. We devise a powerful and flexible method for selecting outcomes in the planning sample when an unknown number of outcomes are affected by the treatment. We investigate the theoretical properties of our method and conduct extensive simulations that demonstrate pronounced benefits, especially at higher levels of allowance for unmeasured confounding. Finally, we demonstrate our method in an observational study of the multi-dimensional impacts of a devastating flood in Bangladesh."}, "https://arxiv.org/abs/2406.00906": {"title": "A Bayesian Generalized Bridge Regression Approach to Covariance Estimation in the Presence of Covariates", "link": "https://arxiv.org/abs/2406.00906", "description": "arXiv:2406.00906v1 Announce Type: new \nAbstract: A hierarchical Bayesian approach that permits simultaneous inference for the regression coefficient matrix and the error precision (inverse covariance) matrix in the multivariate linear model is proposed. Assuming a natural ordering of the elements of the response, the precision matrix is reparameterized so it can be estimated with univariate-response linear regression techniques. A novel generalized bridge regression prior that accommodates both sparse and dense settings and is competitive with alternative methods for univariate-response regression is proposed and used in this framework. Two component-wise Markov chain Monte Carlo algorithms are developed for sampling, including a data augmentation algorithm based on a scale mixture of normals representation. Numerical examples demonstrate that the proposed method is competitive with comparable joint mean-covariance models, particularly in estimation of the precision matrix. The method is also used to estimate the 253 by 253 precision matrices of two classes of spectra extracted from images taken by the Hubble Space Telescope. Some interesting structural patterns in the estimates are discussed."}, "https://arxiv.org/abs/2406.00930": {"title": "A class of sequential multi-hypothesis tests", "link": "https://arxiv.org/abs/2406.00930", "description": "arXiv:2406.00930v1 Announce Type: new \nAbstract: In this paper, we deal with sequential testing of multiple hypotheses. In the general scheme of construction of optimal tests based on the backward induction, we propose a modification which provides a simplified (generally speaking, suboptimal) version of the optimal test, for any particular criterion of optimization. We call this DBC version (the one with Dropped Backward Control) of the optimal test. In particular, for the case of two simple hypotheses, dropping backward control in the Bayesian test produces the classical sequential probability ratio test (SPRT). Similarly, dropping backward control in the modified Kiefer-Weiss solutions produces Lorden's 2-SPRTs .\n In the case of more than two hypotheses, we obtain in this way new classes of sequential multi-hypothesis tests, and investigate their properties. The efficiency of the DBC-tests is evaluated with respect to the optimal Bayesian multi-hypothesis test and with respect to the matrix sequential probability ratio test (MSPRT) by Armitage. In a multihypothesis variant of the Kiefer-Weiss problem for binomial proportions the performance of the DBC-test is numerically compared with that of the exact solution. In a model of normal observations with a linear trend, the performance of of the DBC-test is numerically compared with that of the MSPRT. Some other numerical examples are presented.\n In all the cases the proposed tests exhibit a very high efficiency with respect to the optimal tests (more than 99.3\\% when sampling from Bernoulli populations) and/or with respect to the MSPRT (even outperforming the latter in some scenarios)."}, "https://arxiv.org/abs/2406.00940": {"title": "Measurement Error-Robust Causal Inference via Constructed Instrumental Variables", "link": "https://arxiv.org/abs/2406.00940", "description": "arXiv:2406.00940v1 Announce Type: new \nAbstract: Measurement error can often be harmful when estimating causal effects. Two scenarios in which this is the case are in the estimation of (a) the average treatment effect when confounders are measured with error and (b) the natural indirect effect when the exposure and/or confounders are measured with error. Methods adjusting for measurement error typically require external data or knowledge about the measurement error distribution. Here, we propose methodology not requiring any such information. Instead, we show that when the outcome regression is linear in the error-prone variables, consistent estimation of these causal effects can be recovered using constructed instrumental variables under certain conditions. These variables, which are functions of only the observed data, behave like instrumental variables for the error-prone variables. Using data from a study of the effects of prenatal exposure to heavy metals on growth and neurodevelopment in Bangladeshi mother-infant pairs, we apply our methodology to estimate (a) the effect of lead exposure on birth length while controlling for maternal protein intake, and (b) lead exposure's role in mediating the effect of maternal protein intake on birth length. Protein intake is calculated from food journal entries, and is suspected to be highly prone to measurement error."}, "https://arxiv.org/abs/2406.00941": {"title": "A Robust Residual-Based Test for Structural Changes in Factor Models", "link": "https://arxiv.org/abs/2406.00941", "description": "arXiv:2406.00941v1 Announce Type: new \nAbstract: In this paper, we propose an easy-to-implement residual-based specification testing procedure for detecting structural changes in factor models, which is powerful against both smooth and abrupt structural changes with unknown break dates. The proposed test is robust against the over-specified number of factors, and serially and cross-sectionally correlated error processes. A new central limit theorem is given for the quadratic forms of panel data with dependence over both dimensions, thereby filling a gap in the literature. We establish the asymptotic properties of the proposed test statistic, and accordingly develop a simulation-based scheme to select critical value in order to improve finite sample performance. Through extensive simulations and a real-world application, we confirm our theoretical results and demonstrate that the proposed test exhibits desirable size and power in practice."}, "https://arxiv.org/abs/2406.01002": {"title": "Random Subspace Local Projections", "link": "https://arxiv.org/abs/2406.01002", "description": "arXiv:2406.01002v1 Announce Type: new \nAbstract: We show how random subspace methods can be adapted to estimating local projections with many controls. Random subspace methods have their roots in the machine learning literature and are implemented by averaging over regressions estimated over different combinations of subsets of these controls. We document three key results: (i) Our approach can successfully recover the impulse response functions across Monte Carlo experiments representative of different macroeconomic settings and identification schemes. (ii) Our results suggest that random subspace methods are more accurate than other dimension reduction methods if the underlying large dataset has a factor structure similar to typical macroeconomic datasets such as FRED-MD. (iii) Our approach leads to differences in the estimated impulse response functions relative to benchmark methods when applied to two widely studied empirical applications."}, "https://arxiv.org/abs/2406.01218": {"title": "Sequential FDR and pFDR control under arbitrary dependence, with application to pharmacovigilance database monitoring", "link": "https://arxiv.org/abs/2406.01218", "description": "arXiv:2406.01218v1 Announce Type: new \nAbstract: We propose sequential multiple testing procedures which control the false discover rate (FDR) or the positive false discovery rate (pFDR) under arbitrary dependence between the data streams. This is accomplished by \"optimizing\" an upper bound on these error metrics for a class of step down sequential testing procedures. Both open-ended and truncated versions of these sequential procedures are given, both being able to control both the type~I multiple testing metric (FDR or pFDR) at specified levels, and the former being able to control both the type I and type II (e.g., FDR and the false nondiscovery rate, FNR). In simulation studies, these procedures provide 45-65% savings in average sample size over their fixed-sample competitors. We illustrate our procedures on drug data from the United Kingdom's Yellow Card Pharmacovigilance Database."}, "https://arxiv.org/abs/2406.01242": {"title": "Multiple Comparison Procedures for Simultaneous Inference in Functional MANOVA", "link": "https://arxiv.org/abs/2406.01242", "description": "arXiv:2406.01242v1 Announce Type: new \nAbstract: Functional data analysis is becoming increasingly popular to study data from real-valued random functions. Nevertheless, there is a lack of multiple testing procedures for such data. These are particularly important in factorial designs to compare different groups or to infer factor effects. We propose a new class of testing procedures for arbitrary linear hypotheses in general factorial designs with functional data. Our methods allow global as well as multiple inference of both, univariate and multivariate mean functions without assuming particular error distributions nor homoscedasticity. That is, we allow for different structures of the covariance functions between groups. To this end, we use point-wise quadratic-form-type test functions that take potential heteroscedasticity into account. Taking the supremum over each test function, we define a class of local test statistics. We analyse their (joint) asymptotic behaviour and propose a resampling approach to approximate the limit distributions. The resulting global and multiple testing procedures are asymptotic valid under weak conditions and applicable in general functional MANOVA settings. We evaluate their small-sample performance in extensive simulations and finally illustrate their applicability by analysing a multivariate functional air pollution data set."}, "https://arxiv.org/abs/2406.01557": {"title": "Bayesian compositional regression with flexible microbiome feature aggregation and selection", "link": "https://arxiv.org/abs/2406.01557", "description": "arXiv:2406.01557v1 Announce Type: new \nAbstract: Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability of existing variable selection approaches. In particular, microbiome data are high-dimensional, extremely sparse, and compositional. Importantly, many of the observed features, although categorized as different taxa, may play related functional roles. To address these challenges, we propose a novel compositional regression approach that leverages the data-adaptive clustering and variable selection properties of the spiked Dirichlet process to identify taxa that exhibit similar functional roles. Our proposed method, Bayesian Regression with Agglomerated Compositional Effects using a dirichLET process (BRACElet), enables the identification of a sparse set of features with shared impacts on the outcome, facilitating dimension reduction and model interpretation. We demonstrate that BRACElet outperforms existing approaches for microbiome variable selection through simulation studies and an application elucidating the impact of oral microbiome composition on insulin resistance."}, "https://arxiv.org/abs/2406.00128": {"title": "Matrix-valued Factor Model with Time-varying Main Effects", "link": "https://arxiv.org/abs/2406.00128", "description": "arXiv:2406.00128v1 Announce Type: cross \nAbstract: We introduce the matrix-valued time-varying Main Effects Factor Model (MEFM). MEFM is a generalization to the traditional matrix-valued factor model (FM). We give rigorous definitions of MEFM and its identifications, and propose estimators for the time-varying grand mean, row and column main effects, and the row and column factor loading matrices for the common component. Rates of convergence for different estimators are spelt out, with asymptotic normality shown. The core rank estimator for the common component is also proposed, with consistency of the estimators presented. We propose a test for testing if FM is sufficient against the alternative that MEFM is necessary, and demonstrate the power of such a test in various simulation settings. We also demonstrate numerically the accuracy of our estimators in extended simulation experiments. A set of NYC Taxi traffic data is analysed and our test suggests that MEFM is indeed necessary for analysing the data against a traditional FM."}, "https://arxiv.org/abs/2406.00317": {"title": "Combining Experimental and Historical Data for Policy Evaluation", "link": "https://arxiv.org/abs/2406.00317", "description": "arXiv:2406.00317v1 Announce Type: cross \nAbstract: This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to minimize the mean square error (MSE) of the resulting combined estimator. We further apply the pessimistic principle to obtain more robust estimators, and extend these developments to sequential decision making. Theoretically, we establish non-asymptotic error bounds for the MSEs of our proposed estimators, and derive their oracle, efficiency and robustness properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators."}, "https://arxiv.org/abs/2406.00326": {"title": "Far beyond day-ahead with econometric models for electricity price forecasting", "link": "https://arxiv.org/abs/2406.00326", "description": "arXiv:2406.00326v1 Announce Type: cross \nAbstract: The surge in global energy prices during the recent energy crisis, which peaked in 2022, has intensified the need for mid-term to long-term forecasting for hedging and valuation purposes. This study analyzes the statistical predictability of power prices before, during, and after the energy crisis, using econometric models with an hourly resolution. To stabilize the model estimates, we define fundamentally derived coefficient bounds. We provide an in-depth analysis of the unit root behavior of the power price series, showing that the long-term stochastic trend is explained by the prices of commodities used as fuels for power generation: gas, coal, oil, and emission allowances (EUA). However, as the forecasting horizon increases, spurious effects become extremely relevant, leading to highly significant but economically meaningless results. To mitigate these spurious effects, we propose the \"current\" model: estimating the current same-day relationship between power prices and their regressors and projecting this relationship into the future. This flexible and interpretable method is applied to hourly German day-ahead power prices for forecasting horizons up to one year ahead, utilizing a combination of regularized regression methods and generalized additive models."}, "https://arxiv.org/abs/2406.00394": {"title": "Learning Causal Abstractions of Linear Structural Causal Models", "link": "https://arxiv.org/abs/2406.00394", "description": "arXiv:2406.00394v1 Announce Type: cross \nAbstract: The need for modelling causal knowledge at different levels of granularity arises in several settings. Causal Abstraction provides a framework for formalizing this problem by relating two Structural Causal Models at different levels of detail. Despite increasing interest in applying causal abstraction, e.g. in the interpretability of large machine learning models, the graphical and parametrical conditions under which a causal model can abstract another are not known. Furthermore, learning causal abstractions from data is still an open problem. In this work, we tackle both issues for linear causal models with linear abstraction functions. First, we characterize how the low-level coefficients and the abstraction function determine the high-level coefficients and how the high-level model constrains the causal ordering of low-level variables. Then, we apply our theoretical results to learn high-level and low-level causal models and their abstraction function from observational data. In particular, we introduce Abs-LiNGAM, a method that leverages the constraints induced by the learned high-level model and the abstraction function to speedup the recovery of the larger low-level model, under the assumption of non-Gaussian noise terms. In simulated settings, we show the effectiveness of learning causal abstractions from data and the potential of our method in improving scalability of causal discovery."}, "https://arxiv.org/abs/2406.00535": {"title": "Causal Contrastive Learning for Counterfactual Regression Over Time", "link": "https://arxiv.org/abs/2406.00535", "description": "arXiv:2406.00535v1 Announce Type: cross \nAbstract: Estimating treatment effects over time holds significance in various domains, including precision medicine, epidemiology, economy, and marketing. This paper introduces a unique approach to counterfactual regression over time, emphasizing long-term predictions. Distinguishing itself from existing models like Causal Transformer, our approach highlights the efficacy of employing RNNs for long-term forecasting, complemented by Contrastive Predictive Coding (CPC) and Information Maximization (InfoMax). Emphasizing efficiency, we avoid the need for computationally expensive transformers. Leveraging CPC, our method captures long-term dependencies in the presence of time-varying confounders. Notably, recent models have disregarded the importance of invertible representation, compromising identification assumptions. To remedy this, we employ the InfoMax principle, maximizing a lower bound of mutual information between sequence data and its representation. Our method achieves state-of-the-art counterfactual estimation results using both synthetic and real-world data, marking the pioneering incorporation of Contrastive Predictive Encoding in causal inference."}, "https://arxiv.org/abs/2406.00610": {"title": "Portfolio Optimization with Robust Covariance and Conditional Value-at-Risk Constraints", "link": "https://arxiv.org/abs/2406.00610", "description": "arXiv:2406.00610v1 Announce Type: cross \nAbstract: The measure of portfolio risk is an important input of the Markowitz framework. In this study, we explored various methods to obtain a robust covariance estimators that are less susceptible to financial data noise. We evaluated the performance of large-cap portfolio using various forms of Ledoit Shrinkage Covariance and Robust Gerber Covariance matrix during the period of 2012 to 2022. Out-of-sample performance indicates that robust covariance estimators can outperform the market capitalization-weighted benchmark portfolio, particularly during bull markets. The Gerber covariance with Mean-Absolute-Deviation (MAD) emerged as the top performer. However, robust estimators do not manage tail risk well under extreme market conditions, for example, Covid-19 period. When we aim to control for tail risk, we should add constraint on Conditional Value-at-Risk (CVaR) to make more conservative decision on risk exposure. Additionally, we incorporated unsupervised clustering algorithm K-means to the optimization algorithm (i.e. Nested Clustering Optimization, NCO). It not only helps mitigate numerical instability of the optimization algorithm, but also contributes to lower drawdown as well."}, "https://arxiv.org/abs/2406.00611": {"title": "DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation", "link": "https://arxiv.org/abs/2406.00611", "description": "arXiv:2406.00611v1 Announce Type: cross \nAbstract: Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers for black-box models lack faithfulness guarantees, and self-interpretable models greatly compromise accuracy. To address these issues, we propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample. A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples. We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space. We evaluate DISCRET on diverse tasks involving tabular, image, and text data. DISCRET outperforms the best self-interpretable models and has accuracy comparable to the best black-box models while providing faithful explanations. DISCRET is available at https://github.com/wuyinjun-1993/DISCRET-ICML2024."}, "https://arxiv.org/abs/2406.00701": {"title": "Profiled Transfer Learning for High Dimensional Linear Model", "link": "https://arxiv.org/abs/2406.00701", "description": "arXiv:2406.00701v1 Announce Type: cross \nAbstract: We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \\textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \\textit{vanishing-difference} assumption and \\textit{low-rank} assumption in the literature, the \\textit{approximate-linear} assumption is more flexible and less stringent. Specifically, the PTL estimator is constructed by two major steps. Firstly, we regress the response on the transferred feature, leading to the profiled responses. Subsequently, we learn the regression relationship between profiled responses and the covariates on the target data. The final estimator is then assembled based on the \\textit{approximate-linear} relationship. To theoretically support the PTL estimator, we derive the non-asymptotic upper bound and minimax lower bound. We find that the PTL estimator is minimax optimal under appropriate regularity conditions. Extensive simulation studies are presented to demonstrate the finite sample performance of the new method. A real data example about sentence prediction is also presented with very encouraging results."}, "https://arxiv.org/abs/2406.00713": {"title": "Logistic Variational Bayes Revisited", "link": "https://arxiv.org/abs/2406.00713", "description": "arXiv:2406.00713v1 Announce Type: cross \nAbstract: Variational logistic regression is a popular method for approximate Bayesian inference seeing wide-spread use in many areas of machine learning including: Bayesian optimization, reinforcement learning and multi-instance learning to name a few. However, due to the intractability of the Evidence Lower Bound, authors have turned to the use of Monte Carlo, quadrature or bounds to perform inference, methods which are costly or give poor approximations to the true posterior.\n In this paper we introduce a new bound for the expectation of softplus function and subsequently show how this can be applied to variational logistic regression and Gaussian process classification. Unlike other bounds, our proposal does not rely on extending the variational family, or introducing additional parameters to ensure the bound is tight. In fact, we show that this bound is tighter than the state-of-the-art, and that the resulting variational posterior achieves state-of-the-art performance, whilst being significantly faster to compute than Monte-Carlo methods."}, "https://arxiv.org/abs/2406.00778": {"title": "Bayesian Joint Additive Factor Models for Multiview Learning", "link": "https://arxiv.org/abs/2406.00778", "description": "arXiv:2406.00778v1 Announce Type: cross \nAbstract: It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. We propose a joint additive factor regression model (JAFAR) with a structured additive design, accounting for shared and view-specific components. We ensure identifiability via a novel dependent cumulative shrinkage process (D-CUSP) prior. We provide an efficient implementation via a partially collapsed Gibbs sampler and extend our approach to allow flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (R package) is available at https://github.com/niccoloanceschi/jafar."}, "https://arxiv.org/abs/2406.00853": {"title": "A Tutorial on Doubly Robust Learning for Causal Inference", "link": "https://arxiv.org/abs/2406.00853", "description": "arXiv:2406.00853v1 Announce Type: cross \nAbstract: Doubly robust learning offers a robust framework for causal inference from observational data by integrating propensity score and outcome modeling. Despite its theoretical appeal, practical adoption remains limited due to perceived complexity and inaccessible software. This tutorial aims to demystify doubly robust methods and demonstrate their application using the EconML package. We provide an introduction to causal inference, discuss the principles of outcome modeling and propensity scores, and illustrate the doubly robust approach through simulated case studies. By simplifying the methodology and offering practical coding examples, we intend to make doubly robust learning accessible to researchers and practitioners in data science and statistics."}, "https://arxiv.org/abs/2406.00998": {"title": "Distributional Refinement Network: Distributional Forecasting via Deep Learning", "link": "https://arxiv.org/abs/2406.00998", "description": "arXiv:2406.00998v1 Announce Type: cross \nAbstract: A key task in actuarial modelling involves modelling the distributional properties of losses. Classic (distributional) regression approaches like Generalized Linear Models (GLMs; Nelder and Wedderburn, 1972) are commonly used, but challenges remain in developing models that can (i) allow covariates to flexibly impact different aspects of the conditional distribution, (ii) integrate developments in machine learning and AI to maximise the predictive power while considering (i), and, (iii) maintain a level of interpretability in the model to enhance trust in the model and its outputs, which is often compromised in efforts pursuing (i) and (ii). We tackle this problem by proposing a Distributional Refinement Network (DRN), which combines an inherently interpretable baseline model (such as GLMs) with a flexible neural network-a modified Deep Distribution Regression (DDR; Li et al., 2019) method. Inspired by the Combined Actuarial Neural Network (CANN; Schelldorfer and W{\\''u}thrich, 2019), our approach flexibly refines the entire baseline distribution. As a result, the DRN captures varying effects of features across all quantiles, improving predictive performance while maintaining adequate interpretability. Using both synthetic and real-world data, we demonstrate the DRN's superior distributional forecasting capacity. The DRN has the potential to be a powerful distributional regression model in actuarial science and beyond."}, "https://arxiv.org/abs/2406.01259": {"title": "Aging modeling and lifetime prediction of a proton exchange membrane fuel cell using an extended Kalman filter", "link": "https://arxiv.org/abs/2406.01259", "description": "arXiv:2406.01259v1 Announce Type: cross \nAbstract: This article presents a methodology that aims to model and to provide predictive capabilities for the lifetime of Proton Exchange Membrane Fuel Cell (PEMFC). The approach integrates parametric identification, dynamic modeling, and Extended Kalman Filtering (EKF). The foundation is laid with the creation of a representative aging database, emphasizing specific operating conditions. Electrochemical behavior is characterized through the identification of critical parameters. The methodology extends to capture the temporal evolution of the identified parameters. We also address challenges posed by the limiting current density through a differential analysis-based modeling technique and the detection of breakpoints. This approach, involving Monte Carlo simulations, is coupled with an EKF for predicting voltage degradation. The Remaining Useful Life (RUL) is also estimated. The results show that our approach accurately predicts future voltage and RUL with very low relative errors."}, "https://arxiv.org/abs/2111.13226": {"title": "A Kernel Test for Causal Association via Noise Contrastive Backdoor Adjustment", "link": "https://arxiv.org/abs/2111.13226", "description": "arXiv:2111.13226v4 Announce Type: replace \nAbstract: Causal inference grows increasingly complex as the number of confounders increases. Given treatments $X$, confounders $Z$ and outcomes $Y$, we develop a non-parametric method to test the \\textit{do-null} hypothesis $H_0:\\; p(y|\\text{\\it do}(X=x))=p(y)$ against the general alternative. Building on the Hilbert Schmidt Independence Criterion (HSIC) for marginal independence testing, we propose backdoor-HSIC (bd-HSIC) and demonstrate that it is calibrated and has power for both binary and continuous treatments under a large number of confounders. Additionally, we establish convergence properties of the estimators of covariance operators used in bd-HSIC. We investigate the advantages and disadvantages of bd-HSIC against parametric tests as well as the importance of using the do-null testing in contrast to marginal independence testing or conditional independence testing. A complete implementation can be found at \\hyperlink{https://github.com/MrHuff/kgformula}{\\texttt{https://github.com/MrHuff/kgformula}}."}, "https://arxiv.org/abs/2302.03996": {"title": "High-Dimensional Granger Causality for Climatic Attribution", "link": "https://arxiv.org/abs/2302.03996", "description": "arXiv:2302.03996v2 Announce Type: replace \nAbstract: In this paper we test for Granger causality in high-dimensional vector autoregressive models (VARs) to disentangle and interpret the complex causal chains linking radiative forcings and global temperatures. By allowing for high dimensionality in the model, we can enrich the information set with relevant natural and anthropogenic forcing variables to obtain reliable causal relations. This provides a step forward from existing climatology literature, which has mostly treated these variables in isolation in small models. Additionally, our framework allows to disregard the order of integration of the variables by directly estimating the VAR in levels, thus avoiding accumulating biases coming from unit-root and cointegration tests. This is of particular appeal for climate time series which are well known to contain stochastic trends and long memory. We are thus able to establish causal networks linking radiative forcings to global temperatures and to connect radiative forcings among themselves, thereby allowing for tracing the path of dynamic causal effects through the system."}, "https://arxiv.org/abs/2303.02438": {"title": "Bayesian clustering of high-dimensional data via latent repulsive mixtures", "link": "https://arxiv.org/abs/2303.02438", "description": "arXiv:2303.02438v2 Announce Type: replace \nAbstract: Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. The repulsive point process must be anisotropic to favor well-separated clusters of data, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes. We illustrate our model in simulations as well as a plant species co-occurrence dataset."}, "https://arxiv.org/abs/2303.12687": {"title": "On Weighted Orthogonal Learners for Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2303.12687", "description": "arXiv:2303.12687v2 Announce Type: replace \nAbstract: Motivated by applications in personalized medicine and individualized policymaking, there is a growing interest in techniques for quantifying treatment effect heterogeneity in terms of the conditional average treatment effect (CATE). Some of the most prominent methods for CATE estimation developed in recent years are T-Learner, DR-Learner and R-Learner. The latter two were designed to improve on the former by being Neyman-orthogonal. However, the relations between them remain unclear, and likewise the literature remains vague on whether these learners converge to a useful quantity or (functional) estimand when the underlying optimization procedure is restricted to a class of functions that does not include the CATE. In this article, we provide insight into these questions by discussing DR-Learner and R-Learner as special cases of a general class of weighted Neyman-orthogonal learners for the CATE, for which we moreover derive oracle bounds. Our results shed light on how one may construct Neyman-orthogonal learners with desirable properties, on when DR-Learner may be preferred over R-Learner (and vice versa), and on novel learners that may sometimes be preferable to either of these. Theoretical findings are confirmed using results from simulation studies on synthetic data, as well as an application in critical care medicine."}, "https://arxiv.org/abs/2305.04140": {"title": "A Nonparametric Mixed-Effects Mixture Model for Patterns of Clinical Measurements Associated with COVID-19", "link": "https://arxiv.org/abs/2305.04140", "description": "arXiv:2305.04140v2 Announce Type: replace \nAbstract: Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients."}, "https://arxiv.org/abs/2309.02584": {"title": "Multivariate Mat\\'ern Models -- A Spectral Approach", "link": "https://arxiv.org/abs/2309.02584", "description": "arXiv:2309.02584v2 Announce Type: replace \nAbstract: The classical Mat\\'ern model has been a staple in spatial statistics. Novel data-rich applications in environmental and physical sciences, however, call for new, flexible vector-valued spatial and space-time models. Therefore, the extension of the classical Mat\\'ern model has been a problem of active theoretical and methodological interest. In this paper, we offer a new perspective to extending the Mat\\'ern covariance model to the vector-valued setting. We adopt a spectral, stochastic integral approach, which allows us to address challenging issues on the validity of the covariance structure and at the same time to obtain new, flexible, and interpretable models. In particular, our multivariate extensions of the Mat\\'ern model allow for asymmetric covariance structures. Moreover, the spectral approach provides an essentially complete flexibility in modeling the local structure of the process. We establish closed-form representations of the cross-covariances when available, compare them with existing models, simulate Gaussian instances of these new processes, and demonstrate estimation of the model's parameters through maximum likelihood. An application of the new class of multivariate Mat\\'ern models to environmental data indicate their success in capturing inherent covariance-asymmetry phenomena."}, "https://arxiv.org/abs/2310.01198": {"title": "Likelihood Based Inference for ARMA Models", "link": "https://arxiv.org/abs/2310.01198", "description": "arXiv:2310.01198v3 Announce Type: replace \nAbstract: Autoregressive moving average (ARMA) models are frequently used to analyze time series data. Despite the popularity of these models, likelihood-based inference for ARMA models has subtleties that have been previously identified but continue to cause difficulties in widely used data analysis strategies. We provide a summary of parameter estimation via maximum likelihood and discuss common pitfalls that may lead to sub-optimal parameter estimates. We propose a random initialization algorithm for parameter estimation that frequently yields higher likelihoods than traditional maximum likelihood estimation procedures. We then investigate the parameter uncertainty of maximum likelihood estimates, and propose the use of profile confidence intervals as a superior alternative to intervals derived from the Fisher's information matrix. Through a series of simulation studies, we demonstrate the efficacy of our proposed algorithm and the improved nominal coverage of profile confidence intervals compared to the normal approximation based on Fisher's Information."}, "https://arxiv.org/abs/2310.07151": {"title": "Identification and Estimation of a Semiparametric Logit Model using Network Data", "link": "https://arxiv.org/abs/2310.07151", "description": "arXiv:2310.07151v2 Announce Type: replace \nAbstract: This paper studies the identification and estimation of a semiparametric binary network model in which the unobserved social characteristic is endogenous, that is, the unobserved individual characteristic influences both the binary outcome of interest and how links are formed within the network. The exact functional form of the latent social characteristic is not known. The proposed estimators are obtained based on matching pairs of agents whose network formation distributions are the same. The consistency and the asymptotic distribution of the estimators are proposed. The finite sample properties of the proposed estimators in a Monte-Carlo simulation are assessed. We conclude this study with an empirical application."}, "https://arxiv.org/abs/2310.08672": {"title": "Machine Learning Who to Nudge: Causal vs Predictive Targeting in a Field Experiment on Student Financial Aid Renewal", "link": "https://arxiv.org/abs/2310.08672", "description": "arXiv:2310.08672v2 Announce Type: replace \nAbstract: In many settings, interventions may be more effective for some individuals than others, so that targeting interventions may be beneficial. We analyze the value of targeting in the context of a large-scale field experiment with over 53,000 college students, where the goal was to use \"nudges\" to encourage students to renew their financial-aid applications before a non-binding deadline. We begin with baseline approaches to targeting. First, we target based on a causal forest that estimates heterogeneous treatment effects and then assigns students to treatment according to those estimated to have the highest treatment effects. Next, we evaluate two alternative targeting policies, one targeting students with low predicted probability of renewing financial aid in the absence of the treatment, the other targeting those with high probability. The predicted baseline outcome is not the ideal criterion for targeting, nor is it a priori clear whether to prioritize low, high, or intermediate predicted probability. Nonetheless, targeting on low baseline outcomes is common in practice, for example because the relationship between individual characteristics and treatment effects is often difficult or impossible to estimate with historical data. We propose hybrid approaches that incorporate the strengths of both predictive approaches (accurate estimation) and causal approaches (correct criterion); we show that targeting intermediate baseline outcomes is most effective in our specific application, while targeting based on low baseline outcomes is detrimental. In one year of the experiment, nudging all students improved early filing by an average of 6.4 percentage points over a baseline average of 37% filing, and we estimate that targeting half of the students using our preferred policy attains around 75% of this benefit."}, "https://arxiv.org/abs/2310.10329": {"title": "Towards Data-Conditional Simulation for ABC Inference in Stochastic Differential Equations", "link": "https://arxiv.org/abs/2310.10329", "description": "arXiv:2310.10329v2 Announce Type: replace \nAbstract: We develop a Bayesian inference method for discretely-observed stochastic differential equations (SDEs). Inference is challenging for most SDEs, due to the analytical intractability of the likelihood function. Nevertheless, forward simulation via numerical methods is straightforward, motivating the use of approximate Bayesian computation (ABC). We propose a conditional simulation scheme for SDEs that is based on lookahead strategies for sequential Monte Carlo (SMC) and particle smoothing using backward simulation. This leads to the simulation of trajectories that are consistent with the observed trajectory, thereby increasing the ABC acceptance rate. We additionally employ an invariant neural network, previously developed for Markov processes, to learn the summary statistics function required in ABC. The neural network is incrementally retrained by exploiting an ABC-SMC sampler, which provides new training data at each round. Since the SDEs simulation scheme differs from standard forward simulation, we propose a suitable importance sampling correction, which has the added advantage of guiding the parameters towards regions of high posterior density, especially in the first ABC-SMC round. Our approach achieves accurate inference and is about three times faster than standard (forward-only) ABC-SMC. We illustrate our method in five simulation studies, including three examples from the Chan-Karaolyi-Longstaff-Sanders SDE family, a stochastic bi-stable model (Schl{\\\"o}gl) that is notoriously challenging for ABC methods, and a two dimensional biochemical reaction network."}, "https://arxiv.org/abs/2311.01341": {"title": "Composite Dyadic Models for Spatio-Temporal Data", "link": "https://arxiv.org/abs/2311.01341", "description": "arXiv:2311.01341v3 Announce Type: replace \nAbstract: Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully-connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe."}, "https://arxiv.org/abs/2401.04498": {"title": "Efficient designs for multivariate crossover trials", "link": "https://arxiv.org/abs/2401.04498", "description": "arXiv:2401.04498v2 Announce Type: replace \nAbstract: This article aims to study efficient/trace optimal designs for crossover trials with multiple responses recorded from each subject in the time periods. A multivariate fixed effects model is proposed with direct and carryover effects corresponding to the multiple responses. The corresponding error dispersion matrix is chosen to be either of the proportional or the generalized Markov covariance type, permitting the existence of direct and cross-correlations within and between the multiple responses. The corresponding information matrices for direct effects under the two types of dispersions are used to determine efficient designs. The efficiency of orthogonal array designs of Type $I$ and strength $2$ is investigated for a wide choice of covariance functions, namely, Mat($0.5$), Mat($1.5$) and Mat($\\infty$). To motivate these multivariate crossover designs, a gene expression dataset in a $3 \\times 3$ framework is utilized."}, "https://arxiv.org/abs/2102.11076": {"title": "Kernel Ridge Riesz Representers: Generalization Error and Mis-specification", "link": "https://arxiv.org/abs/2102.11076", "description": "arXiv:2102.11076v3 Announce Type: replace-cross \nAbstract: Kernel balancing weights provide confidence intervals for average treatment effects, based on the idea of balancing covariates for the treated group and untreated group in feature space, often with ridge regularization. Previous works on the classical kernel ridge balancing weights have certain limitations: (i) not articulating generalization error for the balancing weights, (ii) typically requiring correct specification of features, and (iii) providing inference for only average effects.\n I interpret kernel balancing weights as kernel ridge Riesz representers (KRRR) and address these limitations via a new characterization of the counterfactual effective dimension. KRRR is an exact generalization of kernel ridge regression and kernel ridge balancing weights. I prove strong properties similar to kernel ridge regression: population $L_2$ rates controlling generalization error, and a standalone closed form solution that can interpolate. The framework relaxes the stringent assumption that the underlying regression model is correctly specified by the features. It extends inference beyond average effects to heterogeneous effects, i.e. causal functions. I use KRRR to infer heterogeneous treatment effects, by age, of 401(k) eligibility on assets."}, "https://arxiv.org/abs/2309.11028": {"title": "The Topology and Geometry of Neural Representations", "link": "https://arxiv.org/abs/2309.11028", "description": "arXiv:2309.11028v3 Announce Type: replace-cross \nAbstract: A central question for neuroscience is how to characterize brain representations of perceptual and cognitive content. An ideal characterization should distinguish different functional regions with robustness to noise and idiosyncrasies of individual brains that do not correspond to computational differences. Previous studies have characterized brain representations by their representational geometry, which is defined by the representational dissimilarity matrix (RDM), a summary statistic that abstracts from the roles of individual neurons (or responses channels) and characterizes the discriminability of stimuli. Here we explore a further step of abstraction: from the geometry to the topology of brain representations. We propose topological representational similarity analysis (tRSA), an extension of representational similarity analysis (RSA) that uses a family of geo-topological summary statistics that generalizes the RDM to characterize the topology while de-emphasizing the geometry. We evaluate this new family of statistics in terms of the sensitivity and specificity for model selection using both simulations and fMRI data. In the simulations, the ground truth is a data-generating layer representation in a neural network model and the models are the same and other layers in different model instances (trained from different random seeds). In fMRI, the ground truth is a visual area and the models are the same and other areas measured in different subjects. Results show that topology-sensitive characterizations of population codes are robust to noise and interindividual variability and maintain excellent sensitivity to the unique representational signatures of different neural network layers and brain regions. These methods enable researchers to calibrate comparisons among representations in brains and models to be sensitive to the geometry, the topology, or a combination of both."}, "https://arxiv.org/abs/2310.13444": {"title": "Testing for the extent of instability in nearly unstable processes", "link": "https://arxiv.org/abs/2310.13444", "description": "arXiv:2310.13444v2 Announce Type: replace-cross \nAbstract: This paper deals with unit root issues in time series analysis. It has been known for a long time that unit root tests may be flawed when a series although stationary has a root close to unity. That motivated recent papers dedicated to autoregressive processes where the bridge between stability and instability is expressed by means of time-varying coefficients. The process we consider has a companion matrix $A_{n}$ with spectral radius $\\rho(A_{n}) < 1$ satisfying $\\rho(A_{n}) \\rightarrow 1$, a situation described as `nearly-unstable'. The question we investigate is: given an observed path supposed to come from a nearly-unstable process, is it possible to test for the `extent of instability', i.e. to test how close we are to the unit root? In this regard, we develop a strategy to evaluate $\\alpha$ and to test for $\\mathcal{H}_0 : ``\\alpha = \\alpha_0\"$ against $\\mathcal{H}_1 : ``\\alpha > \\alpha_0\"$ when $\\rho(A_{n})$ lies in an inner $O(n^{-\\alpha})$-neighborhood of the unity, for some $0 < \\alpha < 1$. Empirical evidence is given about the advantages of the flexibility induced by such a procedure compared to the common unit root tests. We also build a symmetric procedure for the usually left out situation where the dominant root lies around $-1$."}, "https://arxiv.org/abs/2406.01652": {"title": "Distributional bias compromises leave-one-out cross-validation", "link": "https://arxiv.org/abs/2406.01652", "description": "arXiv:2406.01652v1 Announce Type: new \nAbstract: Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called \"leave-one-out cross-validation\" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses."}, "https://arxiv.org/abs/2406.01819": {"title": "Bayesian Linear Models: A compact general set of results", "link": "https://arxiv.org/abs/2406.01819", "description": "arXiv:2406.01819v1 Announce Type: new \nAbstract: I present all the details in calculating the posterior distribution of the conjugate Normal-Gamma prior in Bayesian Linear Models (BLM), including correlated observations, prediction, model selection and comments on efficient numeric implementations. A Python implementation is also presented. These have been presented and available in many books and texts but, I believe, a general compact and simple presentation is always welcome and not always simple to find. Since correlated observations are also included, these results may also be useful for time series analysis and spacial statistics. Other particular cases presented include regression, Gaussian processes and Bayesian Dynamic Models."}, "https://arxiv.org/abs/2406.02028": {"title": "How should parallel cluster randomized trials with a baseline period be analyzed? A survey of estimands and common estimators", "link": "https://arxiv.org/abs/2406.02028", "description": "arXiv:2406.02028v1 Announce Type: new \nAbstract: The parallel cluster randomized trial with baseline (PB-CRT) is a common variant of the standard parallel cluster randomized trial (P-CRT) that maintains parallel randomization but additionally allows for both within and between-cluster comparisons. We define two estimands of interest in the context of PB-CRTs, the participant-average treatment effect (pATE) and cluster-average treatment effect (cATE), to address participant and cluster-level hypotheses. Previous work has indicated that under informative cluster sizes, commonly used mixed-effects models may yield inconsistent estimators for the estimands of interest. In this work, we theoretically derive the convergence of the unweighted and inverse cluster-period size weighted (i.) independence estimating equation, (ii.) fixed-effects model, (iii.) exchangeable mixed-effects model, and (iv.) nested-exchangeable mixed-effects model treatment effect estimators in a PB-CRT with continuous outcomes. We report a simulation study to evaluate the bias and inference with these different treatment effect estimators and their corresponding model-based or jackknife variance estimators. We then re-analyze a PB-CRT examining the effects of community youth teams on improving mental health among adolescent girls in rural eastern India. We demonstrate that the unweighted and weighted independence estimating equation and fixed-effects model regularly yield consistent estimators for the pATE and cATE estimands, whereas the mixed-effects models yield inconsistent estimators under informative cluster sizes. However, we demonstrate that unlike the nested-exchangeable mixed-effects model and corresponding analyses in P-CRTs, the exchangeable mixed-effects model is surprisingly robust to bias in many PB-CRT scenarios."}, "https://arxiv.org/abs/2406.02124": {"title": "Measuring the Dispersion of Discrete Distributions", "link": "https://arxiv.org/abs/2406.02124", "description": "arXiv:2406.02124v1 Announce Type: new \nAbstract: Measuring dispersion is among the most fundamental and ubiquitous concepts in statistics, both in applied and theoretical contexts. In order to ensure that dispersion measures like the standard deviation indeed capture the dispersion of any given distribution, they are by definition required to preserve a stochastic order of dispersion. The most basic order that functions as a foundation underneath the concept of dispersion measures is the so-called dispersive order. However, that order is incompatible with almost all discrete distributions, including all lattice distributions and most empirical distributions. Thus, there is no guarantee that popular measures properly capture the dispersion of these distributions.\n In this paper, discrete adaptations of the dispersive order are derived and analyzed. Their derivation is directly informed by key properties of the dispersive order in order to obtain a foundation for the measurement of discrete dispersion that is as similar as possible to the continuous setting. Two slightly different orders are obtained that both have numerous properties that the original dispersive order also has. Their behaviour on well-known families of lattice distribution is generally as expected if the parameter differences are large enough. Most popular dispersion measures preserve both discrete dispersive orders, which rigorously ensures that they are also meaningful in discrete settings. However, the interquantile range preserves neither discrete order, yielding that it should not be used to measure the dispersion of discrete distributions."}, "https://arxiv.org/abs/2406.02152": {"title": "A sequential test procedure for the choice of the number of regimes in multivariate nonlinear models", "link": "https://arxiv.org/abs/2406.02152", "description": "arXiv:2406.02152v1 Announce Type: new \nAbstract: This paper proposes a sequential test procedure for determining the number of regimes in nonlinear multivariate autoregressive models. The procedure relies on linearity and no additional nonlinearity tests for both multivariate smooth transition and threshold autoregressive models. We conduct a simulation study to evaluate the finite-sample properties of the proposed test in small samples. Our findings indicate that the test exhibits satisfactory size properties, with the rescaled version of the Lagrange Multiplier test statistics demonstrating the best performance in most simulation settings. The sequential procedure is also applied to two empirical cases, the US monthly interest rates and Icelandic river flows. In both cases, the detected number of regimes aligns well with the existing literature."}, "https://arxiv.org/abs/2406.02241": {"title": "Enabling Decision-Making with the Modified Causal Forest: Policy Trees for Treatment Assignment", "link": "https://arxiv.org/abs/2406.02241", "description": "arXiv:2406.02241v1 Announce Type: new \nAbstract: Decision-making plays a pivotal role in shaping outcomes in various disciplines, such as medicine, economics, and business. This paper provides guidance to practitioners on how to implement a decision tree designed to address treatment assignment policies using an interpretable and non-parametric algorithm. Our Policy Tree is motivated on the method proposed by Zhou, Athey, and Wager (2023), distinguishing itself for the policy score calculation, incorporating constraints, and handling categorical and continuous variables. We demonstrate the usage of the Policy Tree for multiple, discrete treatments on data sets from different fields. The Policy Tree is available in Python's open-source package mcf (Modified Causal Forest)."}, "https://arxiv.org/abs/2406.02297": {"title": "Optimal Stock Portfolio Selection with a Multivariate Hidden Markov Model", "link": "https://arxiv.org/abs/2406.02297", "description": "arXiv:2406.02297v1 Announce Type: new \nAbstract: The underlying market trends that drive stock price fluctuations are often referred to in terms of bull and bear markets. Optimal stock portfolio selection methods need to take into account these market trends; however, the bull and bear market states tend to be unobserved and can only be assigned retrospectively. We fit a linked hidden Markov model (LHMM) to relative stock price changes for S&P 500 stocks from 2011--2016 based on weekly closing values. The LHMM consists of a multivariate state process whose individual components correspond to HMMs for each of the 12 sectors of the S\\&P 500 stocks. The state processes are linked using a Gaussian copula so that the states of the component chains are correlated at any given time point. The LHMM allows us to capture more heterogeneity in the underlying market dynamics for each sector. In this study, stock performances are evaluated in terms of capital gains using the LHMM by utilizing historical stock price data. Based on the fitted LHMM, optimal stock portfolios are constructed to maximize capital gain while balancing reward and risk. Under out-of-sample testing, the annual capital gain for the portfolios for 2016--2017 are calculated. Portfolios constructed using the LHMM are able to generate returns comparable to the S&P 500 index."}, "https://arxiv.org/abs/2406.02320": {"title": "Compositional dynamic modelling for causal prediction in multivariate time series", "link": "https://arxiv.org/abs/2406.02320", "description": "arXiv:2406.02320v1 Announce Type: new \nAbstract: Theoretical developments in sequential Bayesian analysis of multivariate dynamic models underlie new methodology for causal prediction. This extends the utility of existing models with computationally efficient methodology, enabling routine exploration of Bayesian counterfactual analyses with multiple selected time series as synthetic controls. Methodological contributions also define the concept of outcome adaptive modelling to monitor and inferentially respond to changes in experimental time series following interventions designed to explore causal effects. The benefits of sequential analyses with time-varying parameter models for causal investigations are inherited in this broader setting. A case study in commercial causal analysis-- involving retail revenue outcomes related to marketing interventions-- highlights the methodological advances."}, "https://arxiv.org/abs/2406.02321": {"title": "A Bayesian nonlinear stationary model with multiple frequencies for business cycle analysis", "link": "https://arxiv.org/abs/2406.02321", "description": "arXiv:2406.02321v1 Announce Type: new \nAbstract: We design a novel, nonlinear single-source-of-error model for analysis of multiple business cycles. The model's specification is intended to capture key empirical characteristics of business cycle data by allowing for simultaneous cycles of different types and lengths, as well as time-variable amplitude and phase shift. The model is shown to feature relevant theoretical properties, including stationarity and pseudo-cyclical autocovariance function, and enables a decomposition of overall cyclic fluctuations into separate frequency-specific components. We develop a Bayesian framework for estimation and inference in the model, along with an MCMC procedure for posterior sampling, combining the Gibbs sampler and the Metropolis-Hastings algorithm, suitably adapted to address encountered numerical issues. Empirical results obtained from the model applied to the Polish GDP growth rates imply co-existence of two types of economic fluctuations: the investment and inventory cycles, and support the stochastic variability of the amplitude and phase shift, also capturing some business cycle asymmetries. Finally, the Bayesian framework enables a fully probabilistic inference on the business cycle clocks and dating, which seems the most relevant approach in view of economic uncertainties."}, "https://arxiv.org/abs/2406.02369": {"title": "Identifying Sample Size and Accuracy and Precision of the Estimators in Case-Crossover Analysis with Distributed Lags of Heteroskedastic Time-Varying Continuous Exposures Measured with Simple or Complex Error", "link": "https://arxiv.org/abs/2406.02369", "description": "arXiv:2406.02369v1 Announce Type: new \nAbstract: Power analyses help investigators design robust and reproducible research. Understanding of determinants of statistical power is helpful in interpreting results in publications and in analysis for causal inference. Case-crossover analysis, a matched case-control analysis, is widely used to estimate health effects of short-term exposures. Despite its widespread use, understanding of sample size, statistical power, and the accuracy and precision of the estimator in real-world data settings is very limited. First, the variance of exposures that exhibit spatiotemporal patterns may be heteroskedastic (e.g., air pollution and temperature exposures, impacted by climate change). Second, distributed lags of the exposure variable may be used to identify critical exposure time-windows. Third, exposure measurement error is not uncommon, impacting the accuracy and/or precision of the estimator, depending on the measurement error mechanism. Exposure measurement errors result in covariate measurement errors of distributed lags. All these issues complicate the understanding. Therefore, I developed approximation equations for sample size, estimates of the estimators and standard errors, and identified conditions for applications. I discussed polynomials for non-linear effect estimation. I analyzed air pollution estimates in the United States (U.S.), developed by U.S. Environmental Protection Agency to examine errors, and conducted statistical simulations. Overall, sample size can be calculated based on external information about exposure variable validation, without validation data in hand. For estimators of distributed lags, calculations may perform well if residual confounding due to covariate measurement error is not severe. This condition may sometimes be difficult to identify without validation data, suggesting investigators should consider validation research."}, "https://arxiv.org/abs/2406.02525": {"title": "The Impact of Acquisition on Product Quality in the Console Gaming Industry", "link": "https://arxiv.org/abs/2406.02525", "description": "arXiv:2406.02525v1 Announce Type: new \nAbstract: The console gaming industry, a dominant force in the global entertainment sector, has witnessed a wave of consolidation in recent years, epitomized by Microsoft's high-profile acquisitions of Activision Blizzard and Zenimax. This study investigates the repercussions of such mergers on consumer welfare and innovation within the gaming landscape, focusing on product quality as a key metric. Through a comprehensive analysis employing a difference-in-difference model, the research evaluates the effects of acquisition on game review ratings, drawing from a dataset comprising over 16,000 console games released between 2000 and 2023. The research addresses key assumptions underlying the difference-in-difference methodology, including parallel trends and spillover effects, to ensure the robustness of the findings. The DID results suggest a positive and statistically significant impact of acquisition on game review ratings, when controlling for genre and release year. The study contributes to the literature by offering empirical evidence on the direct consequences of industry consolidation on consumer welfare and competition dynamics within the gaming sector."}, "https://arxiv.org/abs/2406.02530": {"title": "LongBet: Heterogeneous Treatment Effect Estimation in Panel Data", "link": "https://arxiv.org/abs/2406.02530", "description": "arXiv:2406.02530v1 Announce Type: new \nAbstract: This paper introduces a novel approach for estimating heterogeneous treatment effects of binary treatment in panel data, particularly focusing on short panel data with large cross-sectional data and observed confoundings. In contrast to traditional literature in difference-in-differences method that often relies on the parallel trend assumption, our proposed model does not necessitate such an assumption. Instead, it leverages observed confoundings to impute potential outcomes and identify treatment effects. The method presented is a Bayesian semi-parametric approach based on the Bayesian causal forest model, which is extended here to suit panel data settings. The approach offers the advantage of the Bayesian approach to provides uncertainty quantification on the estimates. Simulation studies demonstrate its performance with and without the presence of parallel trend. Additionally, our proposed model enables the estimation of conditional average treatment effects, a capability that is rarely available in panel data settings."}, "https://arxiv.org/abs/2406.01649": {"title": "CoLa-DCE -- Concept-guided Latent Diffusion Counterfactual Explanations", "link": "https://arxiv.org/abs/2406.01649", "description": "arXiv:2406.01649v1 Announce Type: cross \nAbstract: Recent advancements in generative AI have introduced novel prospects and practical implementations. Especially diffusion models show their strength in generating diverse and, at the same time, realistic features, positioning them well for generating counterfactual explanations for computer vision models. Answering \"what if\" questions of what needs to change to make an image classifier change its prediction, counterfactual explanations align well with human understanding and consequently help in making model behavior more comprehensible. Current methods succeed in generating authentic counterfactuals, but lack transparency as feature changes are not directly perceivable. To address this limitation, we introduce Concept-guided Latent Diffusion Counterfactual Explanations (CoLa-DCE). CoLa-DCE generates concept-guided counterfactuals for any classifier with a high degree of control regarding concept selection and spatial conditioning. The counterfactuals comprise an increased granularity through minimal feature changes. The reference feature visualization ensures better comprehensibility, while the feature localization provides increased transparency of \"where\" changed \"what\". We demonstrate the advantages of our approach in minimality and comprehensibility across multiple image classification models and datasets and provide insights into how our CoLa-DCE explanations help comprehend model errors like misclassification cases."}, "https://arxiv.org/abs/2406.01653": {"title": "An efficient Wasserstein-distance approach for reconstructing jump-diffusion processes using parameterized neural networks", "link": "https://arxiv.org/abs/2406.01653", "description": "arXiv:2406.01653v1 Announce Type: cross \nAbstract: We analyze the Wasserstein distance ($W$-distance) between two probability distributions associated with two multidimensional jump-diffusion processes. Specifically, we analyze a temporally decoupled squared $W_2$-distance, which provides both upper and lower bounds associated with the discrepancies in the drift, diffusion, and jump amplitude functions between the two jump-diffusion processes. Then, we propose a temporally decoupled squared $W_2$-distance method for efficiently reconstructing unknown jump-diffusion processes from data using parameterized neural networks. We further show its performance can be enhanced by utilizing prior information on the drift function of the jump-diffusion process. The effectiveness of our proposed reconstruction method is demonstrated across several examples and applications."}, "https://arxiv.org/abs/2406.01663": {"title": "An efficient solution to Hidden Markov Models on trees with coupled branches", "link": "https://arxiv.org/abs/2406.01663", "description": "arXiv:2406.01663v1 Announce Type: cross \nAbstract: Hidden Markov Models (HMMs) are powerful tools for modeling sequential data, where the underlying states evolve in a stochastic manner and are only indirectly observable. Traditional HMM approaches are well-established for linear sequences, and have been extended to other structures such as trees. In this paper, we extend the framework of HMMs on trees to address scenarios where the tree-like structure of the data includes coupled branches -- a common feature in biological systems where entities within the same lineage exhibit dependent characteristics. We develop a dynamic programming algorithm that efficiently solves the likelihood, decoding, and parameter learning problems for tree-based HMMs with coupled branches. Our approach scales polynomially with the number of states and nodes, making it computationally feasible for a wide range of applications and does not suffer from the underflow problem. We demonstrate our algorithm by applying it to simulated data and propose self-consistency checks for validating the assumptions of the model used for inference. This work not only advances the theoretical understanding of HMMs on trees but also provides a practical tool for analyzing complex biological data where dependencies between branches cannot be ignored."}, "https://arxiv.org/abs/2406.01750": {"title": "Survival Data Simulation With the R Package rsurv", "link": "https://arxiv.org/abs/2406.01750", "description": "arXiv:2406.01750v1 Announce Type: cross \nAbstract: In this paper we propose a novel R package, called rsurv, developed for general survival data simulation purposes. The package is built under a new approach to simulate survival data that depends heavily on the use of dplyr verbs. The proposed package allows simulations of survival data from a wide range of regression models, including accelerated failure time (AFT), proportional hazards (PH), proportional odds (PO), accelerated hazard (AH), Yang and Prentice (YP), and extended hazard (EH) models. The package rsurv also stands out by its ability to generate survival data from an unlimited number of baseline distributions provided that an implementation of the quantile function of the chosen baseline distribution is available in R. Another nice feature of the package rsurv lies in the fact that linear predictors are specified using R formulas, facilitating the inclusion of categorical variables, interaction terms and offset variables. The functions implemented in the package rsurv can also be employed to simulate survival data with more complex structures, such as survival data with different types of censoring mechanisms, survival data with cure fraction, survival data with random effects (frailties), multivarite survival data, and competing risks survival data."}, "https://arxiv.org/abs/2406.01813": {"title": "Diffusion Boosted Trees", "link": "https://arxiv.org/abs/2406.01813", "description": "arXiv:2406.01813v1 Announce Type: cross \nAbstract: Combining the merits of both denoising diffusion probabilistic models and gradient boosting, the diffusion boosting paradigm is introduced for tackling supervised learning problems. We develop Diffusion Boosted Trees (DBT), which can be viewed as both a new denoising diffusion generative model parameterized by decision trees (one single tree for each diffusion timestep), and a new boosting algorithm that combines the weak learners into a strong learner of conditional distributions without making explicit parametric assumptions on their density forms. We demonstrate through experiments the advantages of DBT over deep neural network-based diffusion models as well as the competence of DBT on real-world regression tasks, and present a business application (fraud detection) of DBT for classification on tabular data with the ability of learning to defer."}, "https://arxiv.org/abs/2406.01823": {"title": "Causal Discovery with Fewer Conditional Independence Tests", "link": "https://arxiv.org/abs/2406.01823", "description": "arXiv:2406.01823v1 Announce Type: cross \nAbstract: Many questions in science center around the fundamental problem of understanding causal relationships. However, most constraint-based causal discovery algorithms, including the well-celebrated PC algorithm, often incur an exponential number of conditional independence (CI) tests, posing limitations in various applications. Addressing this, our work focuses on characterizing what can be learned about the underlying causal graph with a reduced number of CI tests. We show that it is possible to a learn a coarser representation of the hidden causal graph with a polynomial number of tests. This coarser representation, named Causal Consistent Partition Graph (CCPG), comprises of a partition of the vertices and a directed graph defined over its components. CCPG satisfies consistency of orientations and additional constraints which favor finer partitions. Furthermore, it reduces to the underlying causal graph when the causal graph is identifiable. As a consequence, our results offer the first efficient algorithm for recovering the true causal graph with a polynomial number of tests, in special cases where the causal graph is fully identifiable through observational data and potentially additional interventions."}, "https://arxiv.org/abs/2406.01933": {"title": "Orthogonal Causal Calibration", "link": "https://arxiv.org/abs/2406.01933", "description": "arXiv:2406.01933v1 Announce Type: cross \nAbstract: Estimates of causal parameters such as conditional average treatment effects and conditional quantile treatment effects play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters.\n In this work, we provide a general framework for calibrating predictors involving nuisance estimation. We consider a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\\ell$, under which we say an estimator $\\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. We prove generic upper bounds on the calibration error of any causal parameter estimate $\\theta$ with respect to any loss $\\ell$ using a concept called Neyman Orthogonality. Our bounds involve two decoupled terms - one measuring the error in estimating the unknown nuisance parameters, and the other representing the calibration error in a hypothetical world where the learned nuisance estimates were true. We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration. One algorithm, which applies to universally orthogonalizable loss functions, transforms the data into generalized pseudo-outcomes and applies an off-the-shelf calibration procedure. The other algorithm, which applies to conditionally orthogonalizable loss functions, extends the classical uniform mass binning algorithm to include nuisance estimation. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation."}, "https://arxiv.org/abs/2406.02049": {"title": "Causal Effect Identification in LiNGAM Models with Latent Confounders", "link": "https://arxiv.org/abs/2406.02049", "description": "arXiv:2406.02049v1 Announce Type: cross \nAbstract: We study the generic identifiability of causal effects in linear non-Gaussian acyclic models (LiNGAM) with latent variables. We consider the problem in two main settings: When the causal graph is known a priori, and when it is unknown. In both settings, we provide a complete graphical characterization of the identifiable direct or total causal effects among observed variables. Moreover, we propose efficient algorithms to certify the graphical conditions. Finally, we propose an adaptation of the reconstruction independent component analysis (RICA) algorithm that estimates the causal effects from the observational data given the causal graph. Experimental results show the effectiveness of the proposed method in estimating the causal effects."}, "https://arxiv.org/abs/2406.02360": {"title": "A Practical Approach for Exploring Granger Connectivity in High-Dimensional Networks of Time Series", "link": "https://arxiv.org/abs/2406.02360", "description": "arXiv:2406.02360v1 Announce Type: cross \nAbstract: This manuscript presents a novel method for discovering effective connectivity between specified pairs of nodes in a high-dimensional network of time series. To accurately perform Granger causality analysis from the first node to the second node, it is essential to eliminate the influence of all other nodes within the network. The approach proposed is to create a low-dimensional representation of all other nodes in the network using frequency-domain-based dynamic principal component analysis (spectral DPCA). The resulting scores are subsequently removed from the first and second nodes of interest, thus eliminating the confounding effect of other nodes within the high-dimensional network. To conduct hypothesis testing on Granger causality, we propose a permutation-based causality test. This test enhances the accuracy of our findings when the error structures are non-Gaussian. The approach has been validated in extensive simulation studies, which demonstrate the efficacy of the methodology as a tool for causality analysis in complex time series networks. The proposed methodology has also been demonstrated to be both expedient and viable on real datasets, with particular success observed on multichannel EEG networks."}, "https://arxiv.org/abs/2406.02424": {"title": "Contextual Dynamic Pricing: Algorithms, Optimality, and Local Differential Privacy Constraints", "link": "https://arxiv.org/abs/2406.02424", "description": "arXiv:2406.02424v1 Announce Type: cross \nAbstract: We study the contextual dynamic pricing problem where a firm sells products to $T$ sequentially arriving consumers that behave according to an unknown demand model. The firm aims to maximize its revenue, i.e. minimize its regret over a clairvoyant that knows the model in advance. The demand model is a generalized linear model (GLM), allowing for a stochastic feature vector in $\\mathbb R^d$ that encodes product and consumer information. We first show that the optimal regret upper bound is of order $\\sqrt{dT}$, up to a logarithmic factor, improving upon existing upper bounds in the literature by a $\\sqrt{d}$ factor. This sharper rate is materialised by two algorithms: a confidence bound-type (supCB) algorithm and an explore-then-commit (ETC) algorithm. A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem with many arms based on a careful discretization. We further study contextual dynamic pricing under the local differential privacy (LDP) constraints. In particular, we propose a stochastic gradient descent based ETC algorithm that achieves an optimal regret upper bound of order $d\\sqrt{T}/\\epsilon$, up to a logarithmic factor, where $\\epsilon>0$ is the privacy parameter. The regret upper bounds with and without LDP constraints are accompanied by newly constructed minimax lower bounds, which further characterize the cost of privacy. Extensive numerical experiments and a real data application on online lending are conducted to illustrate the efficiency and practical value of the proposed algorithms in dynamic pricing."}, "https://arxiv.org/abs/1604.04318": {"title": "Principal Sub-manifolds", "link": "https://arxiv.org/abs/1604.04318", "description": "arXiv:1604.04318v5 Announce Type: replace \nAbstract: We propose a novel method of finding principal components in multivariate data sets that lie on an embedded nonlinear Riemannian manifold within a higher-dimensional space. Our aim is to extend the geometric interpretation of PCA, while being able to capture non-geodesic modes of variation in the data. We introduce the concept of a principal sub-manifold, a manifold passing through a reference point, and at any point on the manifold extending in the direction of highest variation in the space spanned by the eigenvectors of the local tangent space PCA. Compared to recent work for the case where the sub-manifold is of dimension one Panaretos et al. (2014)$-$essentially a curve lying on the manifold attempting to capture one-dimensional variation$-$the current setting is much more general. The principal sub-manifold is therefore an extension of the principal flow, accommodating to capture higher dimensional variation in the data. We show the principal sub-manifold yields the ball spanned by the usual principal components in Euclidean space. By means of examples, we illustrate how to find, use and interpret a principal sub-manifold and we present an application in shape analysis."}, "https://arxiv.org/abs/2303.00178": {"title": "Disentangling Structural Breaks in Factor Models for Macroeconomic Data", "link": "https://arxiv.org/abs/2303.00178", "description": "arXiv:2303.00178v2 Announce Type: replace \nAbstract: Through a routine normalization of the factor variance, standard methods for estimating factor models in macroeconomics do not distinguish between breaks of the factor variance and factor loadings. We argue that it is important to distinguish between structural breaks in the factor variance and loadings within factor models commonly employed in macroeconomics as both can lead to markedly different interpretations when viewed via the lens of the underlying dynamic factor model. We then develop a projection-based decomposition that leads to two standard and easy-to-implement Wald tests to disentangle structural breaks in the factor variance and factor loadings. Applying our procedure to U.S. macroeconomic data, we find evidence of both types of breaks associated with the Great Moderation and the Great Recession. Through our projection-based decomposition, we estimate that the Great Moderation is associated with an over 60% reduction in the total factor variance, highlighting the relevance of disentangling breaks in the factor structure."}, "https://arxiv.org/abs/2306.13257": {"title": "Semiparametric Estimation of the Shape of the Limiting Bivariate Point Cloud", "link": "https://arxiv.org/abs/2306.13257", "description": "arXiv:2306.13257v3 Announce Type: replace \nAbstract: We propose a model to flexibly estimate joint tail properties by exploiting the convergence of an appropriately scaled point cloud onto a compact limit set. Characteristics of the shape of the limit set correspond to key tail dependence properties. We directly model the shape of the limit set using Bezier splines, which allow flexible and parsimonious specification of shapes in two dimensions. We then fit the Bezier splines to data in pseudo-polar coordinates using Markov chain Monte Carlo sampling, utilizing a limiting approximation to the conditional likelihood of the radii given angles. By imposing appropriate constraints on the parameters of the Bezier splines, we guarantee that each posterior sample is a valid limit set boundary, allowing direct posterior analysis of any quantity derived from the shape of the curve. Furthermore, we obtain interpretable inference on the asymptotic dependence class by using mixture priors with point masses on the corner of the unit box. Finally, we apply our model to bivariate datasets of extremes of variables related to fire risk and air pollution."}, "https://arxiv.org/abs/2401.00618": {"title": "Changes-in-Changes for Ordered Choice Models: Too Many \"False Zeros\"?", "link": "https://arxiv.org/abs/2401.00618", "description": "arXiv:2401.00618v2 Announce Type: replace \nAbstract: In this paper, we develop a Difference-in-Differences model for discrete, ordered outcomes, building upon elements from a continuous Changes-in-Changes model. We focus on outcomes derived from self-reported survey data eliciting socially undesirable, illegal, or stigmatized behaviors like tax evasion, substance abuse, or domestic violence, where too many \"false zeros\", or more broadly, underreporting are likely. We provide characterizations for distributional parallel trends, a concept central to our approach, within a general threshold-crossing model framework. In cases where outcomes are assumed to be reported correctly, we propose a framework for identifying and estimating treatment effects across the entire distribution. This framework is then extended to modeling underreported outcomes, allowing the reporting decision to depend on treatment status. A simulation study documents the finite sample performance of the estimators. Applying our methodology, we investigate the impact of recreational marijuana legalization for adults in several U.S. states on the short-term consumption behavior of 8th-grade high-school students. The results indicate small, but significant increases in consumption probabilities at each level. These effects are further amplified upon accounting for misreporting."}, "https://arxiv.org/abs/2209.04419": {"title": "Majority Vote for Distributed Differentially Private Sign Selection", "link": "https://arxiv.org/abs/2209.04419", "description": "arXiv:2209.04419v2 Announce Type: replace-cross \nAbstract: Privacy-preserving data analysis has become more prevalent in recent years. In this study, we propose a distributed group differentially private Majority Vote mechanism, for the sign selection problem in a distributed setup. To achieve this, we apply the iterative peeling to the stability function and use the exponential mechanism to recover the signs. For enhanced applicability, we study the private sign selection for mean estimation and linear regression problems, in distributed systems. Our method recovers the support and signs with the optimal signal-to-noise ratio as in the non-private scenario, which is better than contemporary works of private variable selections. Moreover, the sign selection consistency is justified by theoretical guarantees. Simulation studies are conducted to demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2303.02119": {"title": "Conditional Aalen--Johansen estimation", "link": "https://arxiv.org/abs/2303.02119", "description": "arXiv:2303.02119v2 Announce Type: replace-cross \nAbstract: The conditional Aalen--Johansen estimator, a general-purpose non-parametric estimator of conditional state occupation probabilities, is introduced. The estimator is applicable for any finite-state jump process and supports conditioning on external as well as internal covariate information. The conditioning feature permits for a much more detailed analysis of the distributional characteristics of the process. The estimator reduces to the conditional Kaplan--Meier estimator in the special case of a survival model and also englobes other, more recent, landmark estimators when covariates are discrete. Strong uniform consistency and asymptotic normality are established under lax moment conditions on the multivariate counting process, allowing in particular for an unbounded number of transitions."}, "https://arxiv.org/abs/2306.05937": {"title": "Robust Data-driven Prescriptiveness Optimization", "link": "https://arxiv.org/abs/2306.05937", "description": "arXiv:2306.05937v2 Announce Type: replace-cross \nAbstract: The abundance of data has led to the emergence of a variety of optimization techniques that attempt to leverage available side information to provide more anticipative decisions. The wide range of methods and contexts of application have motivated the design of a universal unitless measure of performance known as the coefficient of prescriptiveness. This coefficient was designed to quantify both the quality of contextual decisions compared to a reference one and the prescriptive power of side information. To identify policies that maximize the former in a data-driven context, this paper introduces a distributionally robust contextual optimization model where the coefficient of prescriptiveness substitutes for the classical empirical risk minimization objective. We present a bisection algorithm to solve this model, which relies on solving a series of linear programs when the distributional ambiguity set has an appropriate nested form and polyhedral structure. Studying a contextual shortest path problem, we evaluate the robustness of the resulting policies against alternative methods when the out-of-sample dataset is subject to varying amounts of distribution shift."}, "https://arxiv.org/abs/2406.02751": {"title": "Bayesian Statistics: A Review and a Reminder for the Practicing Reliability Engineer", "link": "https://arxiv.org/abs/2406.02751", "description": "arXiv:2406.02751v1 Announce Type: new \nAbstract: This paper introduces and reviews some of the principles and methods used in Bayesian reliability. It specifically discusses methods used in the analysis of success/no-success data and then reminds the reader of a simple (yet infrequently applied) Monte Carlo algorithm that can be used to calculate the posterior distribution of a system's reliability."}, "https://arxiv.org/abs/2406.02794": {"title": "PriME: Privacy-aware Membership profile Estimation in networks", "link": "https://arxiv.org/abs/2406.02794", "description": "arXiv:2406.02794v1 Announce Type: new \nAbstract: This paper presents a novel approach to estimating community membership probabilities for network vertices generated by the Degree Corrected Mixed Membership Stochastic Block Model while preserving individual edge privacy. Operating within the $\\varepsilon$-edge local differential privacy framework, we introduce an optimal private algorithm based on a symmetric edge flip mechanism and spectral clustering for accurate estimation of vertex community memberships. We conduct a comprehensive analysis of the estimation risk and establish the optimality of our procedure by providing matching lower bounds to the minimax risk under privacy constraints. To validate our approach, we demonstrate its performance through numerical simulations and its practical application to real-world data. This work represents a significant step forward in balancing accurate community membership estimation with stringent privacy preservation in network data analysis."}, "https://arxiv.org/abs/2406.02834": {"title": "Asymptotic inference with flexible covariate adjustment under rerandomization and stratified rerandomization", "link": "https://arxiv.org/abs/2406.02834", "description": "arXiv:2406.02834v1 Announce Type: new \nAbstract: Rerandomization is an effective treatment allocation procedure to control for baseline covariate imbalance. For estimating the average treatment effect, rerandomization has been previously shown to improve the precision of the unadjusted and the linearly-adjusted estimators over simple randomization without compromising consistency. However, it remains unclear whether such results apply more generally to the class of M-estimators, including the g-computation formula with generalized linear regression and doubly-robust methods, and more broadly, to efficient estimators with data-adaptive machine learners. In this paper, using a super-population framework, we develop the asymptotic theory for a more general class of covariate-adjusted estimators under rerandomization and its stratified extension. We prove that the asymptotic linearity and the influence function remain identical for any M-estimator under simple randomization and rerandomization, but rerandomization may lead to a non-Gaussian asymptotic distribution. We further explain, drawing examples from several common M-estimators, that asymptotic normality can be achieved if rerandomization variables are appropriately adjusted for in the final estimator. These results are extended to stratified rerandomization. Finally, we study the asymptotic theory for efficient estimators based on data-adaptive machine learners, and prove their efficiency optimality under rerandomization and stratified rerandomization. Our results are demonstrated via simulations and re-analyses of a cluster-randomized experiment that used stratified rerandomization."}, "https://arxiv.org/abs/2406.02835": {"title": "When is IV identification agnostic about outcomes?", "link": "https://arxiv.org/abs/2406.02835", "description": "arXiv:2406.02835v1 Announce Type: new \nAbstract: Many identification results in instrumental variables (IV) models have the property that identification holds with no restrictions on the joint distribution of potential outcomes or how these outcomes are correlated with selection behavior. This enables many IV models to allow for arbitrary heterogeneity in treatment effects and the possibility of selection on gains in the outcome variable. I call this type of identification result \"outcome-agnostic\", and provide a necessary and sufficient condition for counterfactual means or treatment effects to be identified in an outcome-agnostic manner, when the instruments and treatments have finite support. In addition to unifying many existing IV identification results, this characterization suggests a brute-force approach to revealing all restrictions on selection behavior that yield identification of treatment effect parameters. While computationally intensive, the approach uncovers even in simple settings new selection models that afford identification of interpretable causal parameters."}, "https://arxiv.org/abs/2406.02840": {"title": "Statistical inference of convex order by Wasserstein projection", "link": "https://arxiv.org/abs/2406.02840", "description": "arXiv:2406.02840v1 Announce Type: new \nAbstract: Ranking distributions according to a stochastic order has wide applications in diverse areas. Although stochastic dominance has received much attention,convex order, particularly in general dimensions, has yet to be investigated from a statistical point of view. This article addresses this gap by introducing a simple statistical test for convex order based on the Wasserstein projection distance. This projection distance not only encodes whether two distributions are indeed in convex order, but also quantifies the deviation from the desired convex order and produces an optimal convex order approximation. Lipschitz stability of the backward and forward Wasserstein projection distance is proved, which leads to elegant consistency results of the estimator we employ as our test statistic. Combining these with state of the art results regarding the convergence rate of empirical distributions, we also derive upper bounds for the $p$-value and type I error our test statistic, as well as upper bounds on the type II error for an appropriate class of strict alternatives. Lastly, we provide an efficient numerical scheme for our test statistic, by way of an entropic Frank-Wolfe algorithm. Some experiments based on synthetic data sets illuminates the success of our approach empirically."}, "https://arxiv.org/abs/2406.02948": {"title": "Copula-based semiparametric nonnormal transformed linear model for survival data with dependent censoring", "link": "https://arxiv.org/abs/2406.02948", "description": "arXiv:2406.02948v1 Announce Type: new \nAbstract: Although the independent censoring assumption is commonly used in survival analysis, it can be violated when the censoring time is related to the survival time, which often happens in many practical applications. To address this issue, we propose a flexible semiparametric method for dependent censored data. Our approach involves fitting the survival time and the censoring time with a joint transformed linear model, where the transformed function is unspecified. This allows for a very general class of models that can account for possible covariate effects, while also accommodating administrative censoring. We assume that the transformed variables have a bivariate nonnormal distribution based on parametric copulas and parametric marginals, which further enhances the flexibility of our method. We demonstrate the identifiability of the proposed model and establish the consistency and asymptotic normality of the model parameters under appropriate regularity conditions and assumptions. Furthermore, we evaluate the performance of our method through extensive simulation studies, and provide a real data example for illustration."}, "https://arxiv.org/abs/2406.03022": {"title": "Is local opposition taking the wind out of the energy transition?", "link": "https://arxiv.org/abs/2406.03022", "description": "arXiv:2406.03022v1 Announce Type: new \nAbstract: Local opposition to the installation of renewable energy sources is a potential threat to the energy transition. Local communities tend to oppose the construction of energy plants due to the associated negative externalities (the so-called 'not in my backyard' or NIMBY phenomenon) according to widespread belief, mostly based on anecdotal evidence. Using administrative data on wind turbine installation and electoral outcomes across municipalities located in the South of Italy during 2000-19, we estimate the impact of wind turbines' installation on incumbent regional governments' electoral support during the next elections. Our main findings, derived by a wind-speed based instrumental variable strategy, point in the direction of a mild and not statistically significant electoral backlash for right-wing regional administrations and of a strong and statistically significant positive reinforcement for left-wing regional administrations. Based on our analysis, the hypothesis of an electoral effect of NIMBY type of behavior in connection with the development of wind turbines appears not to be supported by the data."}, "https://arxiv.org/abs/2406.03053": {"title": "Identification of structural shocks in Bayesian VEC models with two-state Markov-switching heteroskedasticity", "link": "https://arxiv.org/abs/2406.03053", "description": "arXiv:2406.03053v1 Announce Type: new \nAbstract: We develop a Bayesian framework for cointegrated structural VAR models identified by two-state Markovian breaks in conditional covariances. The resulting structural VEC specification with Markov-switching heteroskedasticity (SVEC-MSH) is formulated in the so-called B-parameterization, in which the prior distribution is specified directly for the matrix of the instantaneous reactions of the endogenous variables to structural innovations. We discuss some caveats pertaining to the identification conditions presented earlier in the literature on stationary structural VAR-MSH models, and revise the restrictions to actually ensure the unique global identification through the two-state heteroskedasticity. To enable the posterior inference in the proposed model, we design an MCMC procedure, combining the Gibbs sampler and the Metropolis-Hastings algorithm. The methodology is illustrated both with a simulated as well as real-world data examples."}, "https://arxiv.org/abs/2406.03056": {"title": "Sparse two-stage Bayesian meta-analysis for individualized treatments", "link": "https://arxiv.org/abs/2406.03056", "description": "arXiv:2406.03056v1 Announce Type: new \nAbstract: Individualized treatment rules tailor treatments to patients based on clinical, demographic, and other characteristics. Estimation of individualized treatment rules requires the identification of individuals who benefit most from the particular treatments and thus the detection of variability in treatment effects. To develop an effective individualized treatment rule, data from multisite studies may be required due to the low power provided by smaller datasets for detecting the often small treatment-covariate interactions. However, sharing of individual-level data is sometimes constrained. Furthermore, sparsity may arise in two senses: different data sites may recruit from different populations, making it infeasible to estimate identical models or all parameters of interest at all sites, and the number of non-zero parameters in the model for the treatment rule may be small. To address these issues, we adopt a two-stage Bayesian meta-analysis approach to estimate individualized treatment rules which optimize expected patient outcomes using multisite data without disclosing individual-level data beyond the sites. Simulation results demonstrate that our approach can provide consistent estimates of the parameters which fully characterize the optimal individualized treatment rule. We estimate the optimal Warfarin dose strategy using data from the International Warfarin Pharmacogenetics Consortium, where data sparsity and small treatment-covariate interaction effects pose additional statistical challenges."}, "https://arxiv.org/abs/2406.03130": {"title": "Ordinal Mixed-Effects Random Forest", "link": "https://arxiv.org/abs/2406.03130", "description": "arXiv:2406.03130v1 Announce Type: new \nAbstract: We propose an innovative statistical method, called Ordinal Mixed-Effect Random Forest (OMERF), that extends the use of random forest to the analysis of hierarchical data and ordinal responses. The model preserves the flexibility and ability of modeling complex patterns of both categorical and continuous variables, typical of tree-based ensemble methods, and, at the same time, takes into account the structure of hierarchical data, modeling the dependence structure induced by the grouping and allowing statistical inference at all data levels. A simulation study is conducted to validate the performance of the proposed method and to compare it to the one of other state-of-the art models. The application of OMERF is exemplified in a case study focusing on predicting students performances using data from the Programme for International Student Assessment (PISA) 2022. The model identifies discriminating student characteristics and estimates the school-effect."}, "https://arxiv.org/abs/2406.03252": {"title": "Continuous-time modeling and bootstrap for chain ladder reserving", "link": "https://arxiv.org/abs/2406.03252", "description": "arXiv:2406.03252v1 Announce Type: new \nAbstract: We revisit the famous Mack's model which gives an estimate for the mean square error of prediction of the chain ladder claims reserves. We introduce a stochastic differential equation driven by a Brownian motion to model accumulated total claims amount for the chain ladder method. Within this continuous-time framework, we propose a bootstrap technique for estimating the distribution of claims reserves. It turns out that our approach leads to inherently capturing asymmetry and non-negativity, eliminating the necessity for additional assumptions. We conclude with a case study and comparative analysis against alternative methodologies based on Mack's model."}, "https://arxiv.org/abs/2406.03296": {"title": "Multi-relational Network Autoregression Model with Latent Group Structures", "link": "https://arxiv.org/abs/2406.03296", "description": "arXiv:2406.03296v1 Announce Type: new \nAbstract: Multi-relational networks among entities are frequently observed in the era of big data. Quantifying the effects of multiple networks have attracted significant research interest recently. In this work, we model multiple network effects through an autoregressive framework for tensor-valued time series. To characterize the potential heterogeneity of the networks and handle the high dimensionality of the time series data simultaneously, we assume a separate group structure for entities in each network and estimate all group memberships in a data-driven fashion. Specifically, we propose a group tensor network autoregression (GTNAR) model, which assumes that within each network, entities in the same group share the same set of model parameters, and the parameters differ across networks. An iterative algorithm is developed to estimate the model parameters and the latent group memberships simultaneously. Theoretically, we show that the group-wise parameters and group memberships can be consistently estimated when the group numbers are correctly- or possibly over-specified. An information criterion for group number estimation of each network is also provided to consistently select the group numbers. Lastly, we implement the method on a Yelp dataset to illustrate the usefulness of the method."}, "https://arxiv.org/abs/2406.03302": {"title": "Combining an experimental study with external data: study designs and identification strategies", "link": "https://arxiv.org/abs/2406.03302", "description": "arXiv:2406.03302v1 Announce Type: new \nAbstract: There is increasing interest in combining information from experimental studies, including randomized and single-group trials, with information from external experimental or observational data sources. Such efforts are usually motivated by the desire to compare treatments evaluated in different studies -- for instance, through the introduction of external treatment groups -- or to estimate treatment effects with greater precision. Proposals to combine experimental studies with external data were made at least as early as the 1970s, but in recent years have come under increasing consideration by regulatory agencies involved in drug and device evaluation, particularly with the increasing availability of rich observational data. In this paper, we describe basic templates of study designs and data structures for combining information from experimental studies with external data, and use the potential (counterfactual) outcomes framework to elaborate identification strategies for potential outcome means and average treatment effects in these designs. In formalizing designs and identification strategies for combining information from experimental studies with external data, we hope to provide a conceptual foundation to support the systematic use and evaluation of such efforts."}, "https://arxiv.org/abs/2406.03321": {"title": "Decision synthesis in monetary policy", "link": "https://arxiv.org/abs/2406.03321", "description": "arXiv:2406.03321v1 Announce Type: new \nAbstract: The macroeconomy is a sophisticated dynamic system involving significant uncertainties that complicate modelling. In response, decision makers consider multiple models that provide different predictions and policy recommendations which are then synthesized into a policy decision. In this setting, we introduce and develop Bayesian predictive decision synthesis (BPDS) to formalize monetary policy decision processes. BPDS draws on recent developments in model combination and statistical decision theory that yield new opportunities in combining multiple models, emphasizing the integration of decision goals, expectations and outcomes into the model synthesis process. Our case study concerns central bank policy decisions about target interest rates with a focus on implications for multi-step macroeconomic forecasting."}, "https://arxiv.org/abs/2406.03336": {"title": "Griddy-Gibbs sampling for Bayesian P-splines models with Poisson data", "link": "https://arxiv.org/abs/2406.03336", "description": "arXiv:2406.03336v1 Announce Type: new \nAbstract: P-splines are appealing for smoothing Poisson distributed counts. They provide a flexible setting for modeling nonlinear model components based on a discretized penalty structure with a relatively simple computational backbone. Under a Bayesian inferential process relying on Markov chain Monte Carlo, estimates of spline coefficients are typically obtained by means of Metropolis-type algorithms, which may suffer from convergence issues if the proposal distribution is not properly chosen. To avoid such a sensitive calibration choice, we extend the Griddy-Gibbs sampler to Bayesian P-splines models with a Poisson response variable. In this model class, conditional posterior distributions of spline components are shown to have attractive mathematical properties. Despite their non-conjugate nature, conditional posteriors of spline coefficients can be efficiently explored with a Gibbs sampling scheme by relying on grid-based approximations. The proposed Griddy-Gibbs sampler for Bayesian P-splines (GGSBPS) algorithm is an interesting calibration-free tool for density estimation and histogram smoothing that is made available in a compact and user-friendly routine. The performance of our approach is assessed in different simulation settings and the GGSBPS algorithm is illustrated on two real datasets."}, "https://arxiv.org/abs/2406.03358": {"title": "Bayesian Quantile Estimation and Regression with Martingale Posteriors", "link": "https://arxiv.org/abs/2406.03358", "description": "arXiv:2406.03358v1 Announce Type: new \nAbstract: Quantile estimation and regression within the Bayesian framework is challenging as the choice of likelihood and prior is not obvious. In this paper, we introduce a novel Bayesian nonparametric method for quantile estimation and regression based on the recently introduced martingale posterior (MP) framework. The core idea of the MP is that posterior sampling is equivalent to predictive imputation, which allows us to break free of the stringent likelihood-prior specification. We demonstrate that a recursive estimate of a smooth quantile function, subject to a martingale condition, is entirely sufficient for full nonparametric Bayesian inference. We term the resulting posterior distribution as the quantile martingale posterior (QMP), which arises from an implicit generative predictive distribution. Associated with the QMP is an expedient, MCMC-free and parallelizable posterior computation scheme, which can be further accelerated with an asymptotic approximation based on a Gaussian process. Furthermore, the well-known issue of monotonicity in quantile estimation is naturally alleviated through increasing rearrangement due to the connections to the Bayesian bootstrap. Finally, the QMP has a particularly tractable form that allows for comprehensive theoretical study, which forms a main focus of the work. We demonstrate the ease of posterior computation in simulations and real data experiments."}, "https://arxiv.org/abs/2406.03385": {"title": "Discrete Autoregressive Switching Processes in Sparse Graphical Modeling of Multivariate Time Series Data", "link": "https://arxiv.org/abs/2406.03385", "description": "arXiv:2406.03385v1 Announce Type: new \nAbstract: We propose a flexible Bayesian approach for sparse Gaussian graphical modeling of multivariate time series. We account for temporal correlation in the data by assuming that observations are characterized by an underlying and unobserved hidden discrete autoregressive process. We assume multivariate Gaussian emission distributions and capture spatial dependencies by modeling the state-specific precision matrices via graphical horseshoe priors. We characterize the mixing probabilities of the hidden process via a cumulative shrinkage prior that accommodates zero-inflated parameters for non-active components, and further incorporate a sparsity-inducing Dirichlet prior to estimate the effective number of states from the data. For posterior inference, we develop a sampling procedure that allows estimation of the number of discrete autoregressive lags and the number of states, and that cleverly avoids having to deal with the changing dimensions of the parameter space. We thoroughly investigate performance of our proposed methodology through several simulation studies. We further illustrate the use of our approach for the estimation of dynamic brain connectivity based on fMRI data collected on a subject performing a task-based experiment on latent learning"}, "https://arxiv.org/abs/2406.03400": {"title": "Non-stationary Spatio-Temporal Modeling Using the Stochastic Advection-Diffusion Equation", "link": "https://arxiv.org/abs/2406.03400", "description": "arXiv:2406.03400v1 Announce Type: new \nAbstract: We construct flexible spatio-temporal models through stochastic partial differential equations (SPDEs) where both diffusion and advection can be spatially varying. Computations are done through a Gaussian Markov random field approximation of the solution of the SPDE, which is constructed through a finite volume method. The new flexible non-separable model is compared to a flexible separable model both for reconstruction and forecasting and evaluated in terms of root mean square errors and continuous rank probability scores. A simulation study demonstrates that the non-separable model performs better when the data is simulated with non-separable effects such as diffusion and advection. Further, we estimate surrogate models for emulating the output of a ocean model in Trondheimsfjorden, Norway, and simulate observations of autonoumous underwater vehicles. The results show that the flexible non-separable model outperforms the flexible separable model for real-time prediction of unobserved locations."}, "https://arxiv.org/abs/2406.03432": {"title": "Bayesian inference for scale mixtures of skew-normal linear models under the centered parameterization", "link": "https://arxiv.org/abs/2406.03432", "description": "arXiv:2406.03432v1 Announce Type: new \nAbstract: In many situations we are interested in modeling real data where the response distribution, even conditionally on the covariates, presents asymmetry and/or heavy/light tails. In these situations, it is more suitable to consider models based on the skewed and/or heavy/light tailed distributions, such as the class of scale mixtures of skew-normal distributions. The classical parameterization of this distributions may not be good due to the some inferential issues when the skewness parameter is in a neighborhood of 0, then, the centered parameterization becomes more appropriate. In this paper, we developed a class of scale mixtures of skew-normal distributions under the centered parameterization, also a linear regression model based on them was proposed. We explore a hierarchical representation and set up a MCMC scheme for parameter estimation. Furthermore, we developed residuals and influence analysis tools. A Monte Carlo experiment is conducted to evaluate the performance of the MCMC algorithm and the behavior of the residual distribution. The methodology is illustrated with the analysis of a real data set."}, "https://arxiv.org/abs/2406.03463": {"title": "Gaussian Copula Models for Nonignorable Missing Data Using Auxiliary Marginal Quantiles", "link": "https://arxiv.org/abs/2406.03463", "description": "arXiv:2406.03463v1 Announce Type: new \nAbstract: We present an approach for modeling and imputation of nonignorable missing data under Gaussian copulas. The analyst posits a set of quantiles of the marginal distributions of the study variables, for example, reflecting information from external data sources or elicited expert opinion. When these quantiles are accurately specified, we prove it is possible to consistently estimate the copula correlation and perform multiple imputation in the presence of nonignorable missing data. We develop algorithms for estimation and imputation that are computationally efficient, which we evaluate in simulation studies of multiple imputation inferences. We apply the model to analyze associations between lead exposure levels and end-of-grade test scores for 170,000 students in North Carolina. These measurements are not missing at random, as children deemed at-risk for high lead exposure are more likely to be measured. We construct plausible marginal quantiles for lead exposure using national statistics provided by the Centers for Disease Control and Prevention. Complete cases and missing at random analyses appear to underestimate the relationships between certain variables and end-of-grade test scores, while multiple imputation inferences under our model support stronger adverse associations between lead exposure and educational outcomes."}, "https://arxiv.org/abs/2406.02584": {"title": "Planetary Causal Inference: Implications for the Geography of Poverty", "link": "https://arxiv.org/abs/2406.02584", "description": "arXiv:2406.02584v1 Announce Type: cross \nAbstract: Earth observation data such as satellite imagery can, when combined with machine learning, have profound impacts on our understanding of the geography of poverty through the prediction of living conditions, especially where government-derived economic indicators are either unavailable or potentially untrustworthy. Recent work has progressed in using EO data not only to predict spatial economic outcomes, but also to explore cause and effect, an understanding which is critical for downstream policy analysis. In this review, we first document the growth of interest in EO-ML analyses in the causal space. We then trace the relationship between spatial statistics and EO-ML methods before discussing the four ways in which EO data has been used in causal ML pipelines -- (1.) poverty outcome imputation for downstream causal analysis, (2.) EO image deconfounding, (3.) EO-based treatment effect heterogeneity, and (4.) EO-based transportability analysis. We conclude by providing a workflow for how researchers can incorporate EO data in causal ML analysis going forward."}, "https://arxiv.org/abs/2406.03341": {"title": "Tackling GenAI Copyright Issues: Originality Estimation and Genericization", "link": "https://arxiv.org/abs/2406.03341", "description": "arXiv:2406.03341v1 Announce Type: cross \nAbstract: The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. While some studies explore methods to mitigate copyright risks by steering the outputs of generative models away from those resembling copyrighted data, little attention has been paid to the question of how much of a resemblance is undesirable; more original or unique data are afforded stronger protection, and the threshold level of resemblance for constituting infringement correspondingly lower. Here, leveraging this principle, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to infringe copyright. To achieve this, we introduce a metric for quantifying the level of originality of data in a manner that is consistent with the legal framework. This metric can be practically estimated by drawing samples from a generative model, which is then used for the genericization process. Experiments demonstrate that our genericization method successfully modifies the output of a text-to-image generative model so that it produces more generic, copyright-compliant images."}, "https://arxiv.org/abs/2212.12874": {"title": "Test and Measure for Partial Mean Dependence Based on Machine Learning Methods", "link": "https://arxiv.org/abs/2212.12874", "description": "arXiv:2212.12874v2 Announce Type: replace \nAbstract: It is of importance to investigate the significance of a subset of covariates $W$ for the response $Y$ given covariates $Z$ in regression modeling. To this end, we propose a significance test for the partial mean independence problem based on machine learning methods and data splitting. The test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability are also discussed. If the null hypothesis is rejected, we propose a partial Generalized Measure of Correlation (pGMC) to measure the partial mean dependence of $Y$ given $W$ after controlling for the nonlinear effect of $Z$. We present the appealing theoretical properties of the pGMC and establish the asymptotic normality of its estimator with the optimal root-$N$ convergence rate. Furthermore, the valid confidence interval for the pGMC is also derived. As an important special case when there are no conditional covariates $Z$, we introduce a new test of overall significance of covariates for the response in a model-free setting. Numerical studies and real data analysis are also conducted to compare with existing approaches and to demonstrate the validity and flexibility of our proposed procedures."}, "https://arxiv.org/abs/2303.07272": {"title": "Accounting for multiplicity in machine learning benchmark performance", "link": "https://arxiv.org/abs/2303.07272", "description": "arXiv:2303.07272v4 Announce Type: replace \nAbstract: Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss three real-world examples; Kaggle competitions that demonstrate various aspects."}, "https://arxiv.org/abs/2306.00382": {"title": "Calibrated and Conformal Propensity Scores for Causal Effect Estimation", "link": "https://arxiv.org/abs/2306.00382", "description": "arXiv:2306.00382v2 Announce Type: replace \nAbstract: Propensity scores are commonly used to estimate treatment effects from observational data. We argue that the probabilistic output of a learned propensity score model should be calibrated -- i.e., a predictive treatment probability of 90% should correspond to 90% of individuals being assigned the treatment group -- and we propose simple recalibration techniques to ensure this property. We prove that calibration is a necessary condition for unbiased treatment effect estimation when using popular inverse propensity weighted and doubly robust estimators. We derive error bounds on causal effect estimates that directly relate to the quality of uncertainties provided by the probabilistic propensity score model and show that calibration strictly improves this error bound while also avoiding extreme propensity weights. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional image covariates and genome-wide association studies (GWASs). Calibrated propensity scores improve the speed of GWAS analysis by more than two-fold by enabling the use of simpler models that are faster to train."}, "https://arxiv.org/abs/2306.11907": {"title": "Exact Inference for Random Effects Meta-Analyses with Small, Sparse Data", "link": "https://arxiv.org/abs/2306.11907", "description": "arXiv:2306.11907v2 Announce Type: replace \nAbstract: Meta-analysis aggregates information across related studies to provide more reliable statistical inference and has been a vital tool for assessing the safety and efficacy of many high profile pharmaceutical products. A key challenge in conducting a meta-analysis is that the number of related studies is typically small. Applying classical methods that are asymptotic in the number of studies can compromise the validity of inference, particularly when heterogeneity across studies is present. Moreover, serious adverse events are often rare and can result in one or more studies with no events in at least one study arm. Practitioners often apply arbitrary continuity corrections or remove zero-event studies to stabilize or define effect estimates in such settings, which can further invalidate subsequent inference. To address these significant practical issues, we introduce an exact inference method for comparing event rates in two treatment arms under a random effects framework, which we coin \"XRRmeta\". In contrast to existing methods, the coverage of the confidence interval from XRRmeta is guaranteed to be at or above the nominal level (up to Monte Carlo error) when the event rates, number of studies, and/or the within-study sample sizes are small. Extensive numerical studies indicate that XRRmeta does not yield overly conservative inference and we apply our proposed method to two real-data examples using our open source R package."}, "https://arxiv.org/abs/2307.04400": {"title": "ARK: Robust Knockoffs Inference with Coupling", "link": "https://arxiv.org/abs/2307.04400", "description": "arXiv:2307.04400v2 Announce Type: replace \nAbstract: We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and $k$-familywise error rate ($k$-FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or $k$-FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework. Additionally, we formally connect our concept of knockoff variable coupling to a type of Wasserstein distance."}, "https://arxiv.org/abs/2308.11805": {"title": "The Impact of Stocks on Correlations between Crop Yields and Prices and on Revenue Insurance Premiums using Semiparametric Quantile Regression", "link": "https://arxiv.org/abs/2308.11805", "description": "arXiv:2308.11805v2 Announce Type: replace-cross \nAbstract: Crop yields and harvest prices are often considered to be negatively correlated, thus acting as a natural risk management hedge through stabilizing revenues. Storage theory gives reason to believe that the correlation is an increasing function of stocks carried over from previous years. Stock-conditioned second moments have implications for price movements during shortages and for hedging needs, while spatially varying yield-price correlation structures have implications for who benefits from commodity support policies. In this paper, we propose to use semi-parametric quantile regression (SQR) with penalized B-splines to estimate a stock-conditioned joint distribution of yield and price. The proposed method, validated through a comprehensive simulation study, enables sampling from the true joint distribution using SQR. Then it is applied to approximate stock-conditioned correlation and revenue insurance premium for both corn and soybeans in the United States. For both crops, Cornbelt core regions have more negative correlations than do peripheral regions. We find strong evidence that correlation becomes less negative as stocks increase. We also show that conditioning on stocks is important when calculating actuarially fair revenue insurance premiums. In particular, revenue insurance premiums in the Cornbelt core will be biased upward if the model for calculating premiums does not allow correlation to vary with stocks available. The stock-dependent correlation can be viewed as a form of tail dependence that, if unacknowledged, leads to mispricing of revenue insurance products."}, "https://arxiv.org/abs/2401.00139": {"title": "Is Knowledge All Large Language Models Needed for Causal Reasoning?", "link": "https://arxiv.org/abs/2401.00139", "description": "arXiv:2401.00139v2 Announce Type: replace-cross \nAbstract: This paper explores the causal reasoning of large language models (LLMs) to enhance their interpretability and reliability in advancing artificial intelligence. Despite the proficiency of LLMs in a range of tasks, their potential for understanding causality requires further exploration. We propose a novel causal attribution model that utilizes ``do-operators\" for constructing counterfactual scenarios, allowing us to systematically quantify the influence of input numerical data and LLMs' pre-existing knowledge on their causal reasoning processes. Our newly developed experimental setup assesses LLMs' reliance on contextual information and inherent knowledge across various domains. Our evaluation reveals that LLMs' causal reasoning ability mainly depends on the context and domain-specific knowledge provided. In the absence of such knowledge, LLMs can still maintain a degree of causal reasoning using the available numerical data, albeit with limitations in the calculations. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively leveraging both knowledge and numerical information."}, "https://arxiv.org/abs/2406.03596": {"title": "A Multivariate Equivalence Test Based on Mahalanobis Distance with a Data-Driven Margin", "link": "https://arxiv.org/abs/2406.03596", "description": "arXiv:2406.03596v1 Announce Type: new \nAbstract: Multivariate equivalence testing is needed in a variety of scenarios for drug development. For example, drug products obtained from natural sources may contain many components for which the individual effects and/or their interactions on clinical efficacy and safety cannot be completely characterized. Such lack of sufficient characterization poses a challenge for both generic drug developers to demonstrate and regulatory authorities to determine the sameness of a proposed generic product to its reference product. Another case is to ensure batch-to-batch consistency of naturally derived products containing a vast number of components, such as botanical products. The equivalence or sameness between products containing many components that cannot be individually evaluated needs to be studied in a holistic manner. Multivariate equivalence test based on Mahalanobis distance may be suitable to evaluate many variables holistically. Existing studies based on such method assumed either a predetermined constant margin, for which a consensus is difficult to achieve, or a margin derived from the data, where, however, the randomness is ignored during the testing. In this study, we propose a multivariate equivalence test based on Mahalanobis distance with a data-drive margin with the randomness in the margin considered. Several possible implementations are compared with existing approaches via extensive simulation studies."}, "https://arxiv.org/abs/2406.03681": {"title": "Multiscale Tests for Point Processes and Longitudinal Networks", "link": "https://arxiv.org/abs/2406.03681", "description": "arXiv:2406.03681v1 Announce Type: new \nAbstract: We propose a new testing framework applicable to both the two-sample problem on point processes and the community detection problem on rectangular arrays of point processes, which we refer to as longitudinal networks; the latter problem is useful in situations where we observe interactions among a group of individuals over time. Our framework is based on a multiscale discretization scheme that consider not just the global null but also a collection of nulls local to small regions in the domain; in the two-sample problem, the local rejections tell us where the intensity functions differ and in the longitudinal network problem, the local rejections tell us when the community structure is most salient. We provide theoretical analysis for the two-sample problem and show that our method has minimax optimal power under a Holder continuity condition. We provide extensive simulation and real data analysis demonstrating the practicality of our proposed method."}, "https://arxiv.org/abs/2406.03861": {"title": "Small area estimation with generalized random forests: Estimating poverty rates in Mexico", "link": "https://arxiv.org/abs/2406.03861", "description": "arXiv:2406.03861v1 Announce Type: new \nAbstract: Identifying and addressing poverty is challenging in administrative units with limited information on income distribution and well-being. To overcome this obstacle, small area estimation methods have been developed to provide reliable and efficient estimators at disaggregated levels, enabling informed decision-making by policymakers despite the data scarcity. From a theoretical perspective, we propose a robust and flexible approach for estimating poverty indicators based on binary response variables within the small area estimation context: the generalized mixed effects random forest. Our method employs machine learning techniques to identify predictive, non-linear relationships from data, while also modeling hierarchical structures. Mean squared error estimation is explored using a parametric bootstrap. From an applied perspective, we examine the impact of information loss due to converting continuous variables into binary variables on the performance of small area estimation methods. We evaluate the proposed point and uncertainty estimates in both model- and design-based simulations. Finally, we apply our method to a case study revealing spatial patterns of poverty in the Mexican state of Tlaxcala."}, "https://arxiv.org/abs/2406.03900": {"title": "Enhanced variable selection for boosting sparser and less complex models in distributional copula regression", "link": "https://arxiv.org/abs/2406.03900", "description": "arXiv:2406.03900v1 Announce Type: new \nAbstract: Structured additive distributional copula regression allows to model the joint distribution of multivariate outcomes by relating all distribution parameters to covariates. Estimation via statistical boosting enables accounting for high-dimensional data and incorporating data-driven variable selection, both of which are useful given the complexity of the model class. However, as known from univariate (distributional) regression, the standard boosting algorithm tends to select too many variables with minor importance, particularly in settings with large sample sizes, leading to complex models with difficult interpretation. To counteract this behavior and to avoid selecting base-learners with only a negligible impact, we combined the ideas of probing, stability selection and a new deselection approach with statistical boosting for distributional copula regression. In a simulation study and an application to the joint modelling of weight and length of newborns, we found that all proposed methods enhance variable selection by reducing the number of false positives. However, only stability selection and the deselection approach yielded similar predictive performance to classical boosting. Finally, the deselection approach is better scalable to larger datasets and led to a competitive predictive performance, which we further illustrated in a genomic cohort study from the UK Biobank by modelling the joint genetic predisposition for two phenotypes."}, "https://arxiv.org/abs/2406.03971": {"title": "Comments on B", "link": "https://arxiv.org/abs/2406.03971", "description": "arXiv:2406.03971v1 Announce Type: new \nAbstract: In P\\\"otscher and Preinerstorfer (2022) and in the abridged version P\\\"otscher and Preinerstorfer (2024, published in Econometrica) we have tried to clear up the confusion introduced in Hansen (2022a) and in the earlier versions Hansen (2021a,b). Unfortunatelly, Hansen's (2024) reply to P\\\"otscher and Preinerstorfer (2024) further adds to the confusion. While we are already somewhat tired of the matter, for the sake of the econometrics community we feel compelled to provide clarification. We also add a comment on Portnoy (2023), a \"correction\" to Portnoy (2022), as well as on Lei and Wooldridge (2022)."}, "https://arxiv.org/abs/2406.04072": {"title": "Variational Prior Replacement in Bayesian Inference and Inversion", "link": "https://arxiv.org/abs/2406.04072", "description": "arXiv:2406.04072v1 Announce Type: new \nAbstract: Many scientific investigations require that the values of a set of model parameters are estimated using recorded data. In Bayesian inference, information from both observed data and prior knowledge is combined to update model parameters probabilistically. Prior information represents our belief about the range of values that the variables can take, and their relative probabilities when considered independently of recorded data. Situations arise in which we wish to change prior information: (i) the subjective nature of prior information, (ii) cases in which we wish to test different states of prior information as hypothesis tests, and (iii) information from new studies may emerge so prior information may evolve over time. Estimating the solution to any single inference problem is usually computationally costly, as it typically requires thousands of model samples and their forward simulations. Therefore, recalculating the Bayesian solution every time prior information changes can be extremely expensive. We develop a mathematical formulation that allows prior information to be changed in a solution using variational methods, without performing Bayesian inference on each occasion. In this method, existing prior information is removed from a previously obtained posterior distribution and is replaced by new prior information. We therefore call the methodology variational prior replacement (VPR). We demonstrate VPR using a 2D seismic full waveform inversion example, where VPR provides almost identical posterior solutions compared to those obtained by solving independent inference problems using different priors. The former can be completed within minutes even on a laptop whereas the latter requires days of computations using high-performance computing resources. We demonstrate the value of the method by comparing the posterior solutions obtained using three different types of prior information."}, "https://arxiv.org/abs/2406.04077": {"title": "Why recommended visit intervals should be extracted when conducting longitudinal analyses using electronic health record data: examining visit mechanism and sensitivity to assessment not at random", "link": "https://arxiv.org/abs/2406.04077", "description": "arXiv:2406.04077v1 Announce Type: new \nAbstract: Electronic health records (EHRs) provide an efficient approach to generating rich longitudinal datasets. However, since patients visit as needed, the assessment times are typically irregular and may be related to the patient's health. Failing to account for this informative assessment process could result in biased estimates of the disease course. In this paper, we show how estimation of the disease trajectory can be enhanced by leveraging an underutilized piece of information that is often in the patient's EHR: physician-recommended intervals between visits. Specifically, we demonstrate how recommended intervals can be used in characterizing the assessment process, and in investigating the sensitivity of the results to assessment not at random (ANAR). We illustrate our proposed approach in a clinic-based cohort study of juvenile dermatomyositis (JDM). In this study, we found that the recommended intervals explained 78% of the variability in the assessment times. Under a specific case of ANAR where we assumed that a worsening in disease led to patients visiting earlier than recommended, the estimated population average disease activity trajectory was shifted downward relative to the trajectory assuming assessment at random. These results demonstrate the crucial role recommended intervals play in improving the rigour of the analysis by allowing us to assess both the plausibility of the AAR assumption and the sensitivity of the results to departures from this assumption. Thus, we advise that studies using irregular longitudinal data should extract recommended visit intervals and follow our procedure for incorporating them into analyses."}, "https://arxiv.org/abs/2406.04085": {"title": "Copula-based models for correlated circular data", "link": "https://arxiv.org/abs/2406.04085", "description": "arXiv:2406.04085v1 Announce Type: new \nAbstract: We exploit Gaussian copulas to specify a class of multivariate circular distributions and obtain parametric models for the analysis of correlated circular data. This approach provides a straightforward extension of traditional multivariate normal models to the circular setting, without imposing restrictions on the marginal data distribution nor requiring overwhelming routines for parameter estimation. The proposal is illustrated on two case studies of animal orientation and sea currents, where we propose an autoregressive model for circular time series and a geostatistical model for circular spatial series."}, "https://arxiv.org/abs/2406.04133": {"title": "GLOBUS: Global building renovation potential by 2070", "link": "https://arxiv.org/abs/2406.04133", "description": "arXiv:2406.04133v1 Announce Type: new \nAbstract: Surpassing the two large emission sectors of transportation and industry, the building sector accounted for 34% and 37% of global energy consumption and carbon emissions in 2021, respectively. The building sector, the final piece to be addressed in the transition to net-zero carbon emissions, requires a comprehensive, multisectoral strategy for reducing emissions. Until now, the absence of data on global building floorspace has impeded the measurement of building carbon intensity (carbon emissions per floorspace) and the identification of ways to achieve carbon neutrality for buildings. For this study, we develop a global building stock model (GLOBUS) to fill that data gap. Our study's primary contribution lies in providing a dataset of global building stock turnover using scenarios that incorporate various levels of building renovation. By unifying the evaluation indicators, the dataset empowers building science researchers to perform comparative analyses based on floorspace. Specifically, the building stock dataset establishes a reference for measuring carbon emission intensity and decarbonization intensity of buildings within different countries. Further, we emphasize the sufficiency of existing buildings by incorporating building renovation into the model. Renovation can minimize the need to expand the building stock, thereby bolstering decarbonization of the building sector."}, "https://arxiv.org/abs/2406.04150": {"title": "A novel robust meta-analysis model using the $t$ distribution for outlier accommodation and detection", "link": "https://arxiv.org/abs/2406.04150", "description": "arXiv:2406.04150v1 Announce Type: new \nAbstract: Random effects meta-analysis model is an important tool for integrating results from multiple independent studies. However, the standard model is based on the assumption of normal distributions for both random effects and within-study errors, making it susceptible to outlying studies. Although robust modeling using the $t$ distribution is an appealing idea, the existing work, that explores the use of the $t$ distribution only for random effects, involves complicated numerical integration and numerical optimization. In this paper, a novel robust meta-analysis model using the $t$ distribution is proposed ($t$Meta). The novelty is that the marginal distribution of the effect size in $t$Meta follows the $t$ distribution, enabling that $t$Meta can simultaneously accommodate and detect outlying studies in a simple and adaptive manner. A simple and fast EM-type algorithm is developed for maximum likelihood estimation. Due to the mathematical tractability of the $t$ distribution, $t$Meta frees from numerical integration and allows for efficient optimization. Experiments on real data demonstrate that $t$Meta is compared favorably with related competitors in situations involving mild outliers. Moreover, in the presence of gross outliers, while related competitors may fail, $t$Meta continues to perform consistently and robustly."}, "https://arxiv.org/abs/2406.04167": {"title": "Comparing estimators of discriminative performance of time-to-event models", "link": "https://arxiv.org/abs/2406.04167", "description": "arXiv:2406.04167v1 Announce Type: new \nAbstract: Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review various estimators' mathematical construction and compare the behavior of the two classes of estimators. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly over-optimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators' bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g., cross validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce over-optimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014."}, "https://arxiv.org/abs/2406.04256": {"title": "Gradient Boosting for Hierarchical Data in Small Area Estimation", "link": "https://arxiv.org/abs/2406.04256", "description": "arXiv:2406.04256v1 Announce Type: new \nAbstract: This paper introduces Mixed Effect Gradient Boosting (MEGB), which combines the strengths of Gradient Boosting with Mixed Effects models to address complex, hierarchical data structures often encountered in statistical analysis. The methodological foundations, including a review of the Mixed Effects model and the Extreme Gradient Boosting method, leading to the introduction of MEGB are shown in detail. It highlights how MEGB can derive area-level mean estimations from unit-level data and calculate Mean Squared Error (MSE) estimates using a nonparametric bootstrap approach. The paper evaluates MEGB's performance through model-based and design-based simulation studies, comparing it against established estimators. The findings indicate that MEGB provides promising area mean estimations and may outperform existing small area estimators in various scenarios. The paper concludes with a discussion on future research directions, highlighting the possibility of extending MEGB's framework to accommodate different types of outcome variables or non-linear area level indicators."}, "https://arxiv.org/abs/2406.03821": {"title": "Bayesian generalized method of moments applied to pseudo-observations in survival analysis", "link": "https://arxiv.org/abs/2406.03821", "description": "arXiv:2406.03821v1 Announce Type: cross \nAbstract: Bayesian inference for survival regression modeling offers numerous advantages, especially for decision-making and external data borrowing, but demands the specification of the baseline hazard function, which may be a challenging task. We propose an alternative approach that does not need the specification of this function. Our approach combines pseudo-observations to convert censored data into longitudinal data with the Generalized Methods of Moments (GMM) to estimate the parameters of interest from the survival function directly. GMM may be viewed as an extension of the Generalized Estimating Equation (GEE) currently used for frequentist pseudo-observations analysis and can be extended to the Bayesian framework using a pseudo-likelihood function. We assessed the behavior of the frequentist and Bayesian GMM in the new context of analyzing pseudo-observations. We compared their performances to the Cox, GEE, and Bayesian piecewise exponential models through a simulation study of two-arm randomized clinical trials. Frequentist and Bayesian GMM gave valid inferences with similar performances compared to the three benchmark methods, except for small sample sizes and high censoring rates. For illustration, three post-hoc efficacy analyses were performed on randomized clinical trials involving patients with Ewing Sarcoma, producing results similar to those of the benchmark methods. Through a simple application of estimating hazard ratios, these findings confirm the effectiveness of this new Bayesian approach based on pseudo-observations and the generalized method of moments. This offers new insights on using pseudo-observations for Bayesian survival analysis."}, "https://arxiv.org/abs/2406.03924": {"title": "Statistical Multicriteria Benchmarking via the GSD-Front", "link": "https://arxiv.org/abs/2406.03924", "description": "arXiv:2406.03924v1 Announce Type: cross \nAbstract: Given the vast number of classifiers that have been (and continue to be) proposed, reliable methods for comparing them are becoming increasingly important. The desire for reliability is broken down into three main aspects: (1) Comparisons should allow for different quality metrics simultaneously. (2) Comparisons should take into account the statistical uncertainty induced by the choice of benchmark suite. (3) The robustness of the comparisons under small deviations in the underlying assumptions should be verifiable. To address (1), we propose to compare classifiers using a generalized stochastic dominance ordering (GSD) and present the GSD-front as an information-efficient alternative to the classical Pareto-front. For (2), we propose a consistent statistical estimator for the GSD-front and construct a statistical test for whether a (potentially new) classifier lies in the GSD-front of a set of state-of-the-art classifiers. For (3), we relax our proposed test using techniques from robust statistics and imprecise probabilities. We illustrate our concepts on the benchmark suite PMLB and on the platform OpenML."}, "https://arxiv.org/abs/2406.04191": {"title": "Strong Approximations for Empirical Processes Indexed by Lipschitz Functions", "link": "https://arxiv.org/abs/2406.04191", "description": "arXiv:2406.04191v1 Announce Type: cross \nAbstract: This paper presents new uniform Gaussian strong approximations for empirical processes indexed by classes of functions based on $d$-variate random vectors ($d\\geq1$). First, a uniform Gaussian strong approximation is established for general empirical processes indexed by Lipschitz functions, encompassing and improving on all previous results in the literature. When specialized to the setting considered by Rio (1994), and certain constraints on the function class hold, our result improves the approximation rate $n^{-1/(2d)}$ to $n^{-1/\\max\\{d,2\\}}$, up to the same $\\operatorname{polylog} n$ term, where $n$ denotes the sample size. Remarkably, we establish a valid uniform Gaussian strong approximation at the optimal rate $n^{-1/2}\\log n$ for $d=2$, which was previously known to be valid only for univariate ($d=1$) empirical processes via the celebrated Hungarian construction (Koml\\'os et al., 1975). Second, a uniform Gaussian strong approximation is established for a class of multiplicative separable empirical processes indexed by Lipschitz functions, which address some outstanding problems in the literature (Chernozhukov et al., 2014, Section 3). In addition, two other uniform Gaussian strong approximation results are presented for settings where the function class takes the form of a sequence of Haar basis based on generalized quasi-uniform partitions. We demonstrate the improvements and usefulness of our new strong approximation results with several statistical applications to nonparametric density and regression estimation."}, "https://arxiv.org/abs/1903.05054": {"title": "Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions", "link": "https://arxiv.org/abs/1903.05054", "description": "arXiv:1903.05054v2 Announce Type: replace \nAbstract: Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets."}, "https://arxiv.org/abs/2206.06821": {"title": "DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models", "link": "https://arxiv.org/abs/2206.06821", "description": "arXiv:2206.06821v2 Announce Type: replace \nAbstract: We present DoWhy-GCM, an extension of the DoWhy Python library, which leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation, DoWhy-GCM addresses diverse causal queries, such as identifying the root causes of outliers and distributional changes, attributing causal influences to the data generating process of each node, or diagnosis of causal structures. With DoWhy-GCM, users typically specify cause-effect relations via a causal graph, fit causal mechanisms, and pose causal queries -- all with just a few lines of code. The general documentation is available at https://www.pywhy.org/dowhy and the DoWhy-GCM specific code at https://github.com/py-why/dowhy/tree/main/dowhy/gcm."}, "https://arxiv.org/abs/2304.01273": {"title": "Heterogeneity-robust granular instruments", "link": "https://arxiv.org/abs/2304.01273", "description": "arXiv:2304.01273v3 Announce Type: replace \nAbstract: Granular instrumental variables (GIV) has experienced sharp growth in empirical macro-finance. The methodology's rise showcases granularity's potential for identification across many economic environments, like the estimation of spillovers and demand systems. I propose a new estimator--called robust granular instrumental variables (RGIV)--that enables studying unit-level heterogeneity in spillovers. Unlike existing methods that assume heterogeneity is a function of observables, RGIV leaves heterogeneity unrestricted. In contrast to the baseline GIV estimator, RGIV allows for unknown shock variances and equal-sized units. Applied to the Euro area, I find strong evidence of country-level heterogeneity in sovereign yield spillovers."}, "https://arxiv.org/abs/2304.14545": {"title": "Augmented balancing weights as linear regression", "link": "https://arxiv.org/abs/2304.14545", "description": "arXiv:2304.14545v3 Announce Type: replace \nAbstract: We provide a novel characterization of augmented balancing weights, also known as automatic debiased machine learning (AutoDML). These popular doubly robust or de-biased machine learning estimators combine outcome modeling with balancing weights - weights that achieve covariate balance directly in lieu of estimating and inverting the propensity score. When the outcome and weighting models are both linear in some (possibly infinite) basis, we show that the augmented estimator is equivalent to a single linear model with coefficients that combine the coefficients from the original outcome model and coefficients from an unpenalized ordinary least squares (OLS) fit on the same data. We see that, under certain choices of regularization parameters, the augmented estimator often collapses to the OLS estimator alone; this occurs for example in a re-analysis of the Lalonde 1986 dataset. We then extend these results to specific choices of outcome and weighting models. We first show that the augmented estimator that uses (kernel) ridge regression for both outcome and weighting models is equivalent to a single, undersmoothed (kernel) ridge regression. This holds numerically in finite samples and lays the groundwork for a novel analysis of undersmoothing and asymptotic rates of convergence. When the weighting model is instead lasso-penalized regression, we give closed-form expressions for special cases and demonstrate a ``double selection'' property. Our framework opens the black box on this increasingly popular class of estimators, bridges the gap between existing results on the semiparametric efficiency of undersmoothed and doubly robust estimators, and provides new insights into the performance of augmented balancing weights."}, "https://arxiv.org/abs/2406.04423": {"title": "Determining the Number of Communities in Sparse and Imbalanced Settings", "link": "https://arxiv.org/abs/2406.04423", "description": "arXiv:2406.04423v1 Announce Type: new \nAbstract: Community structures represent a crucial aspect of network analysis, and various methods have been developed to identify these communities. However, a common hurdle lies in determining the number of communities K, a parameter that often requires estimation in practice. Existing approaches for estimating K face two notable challenges: the weak community signal present in sparse networks and the imbalance in community sizes or edge densities that result in unequal per-community expected degree. We propose a spectral method based on a novel network operator whose spectral properties effectively overcome both challenges. This operator is a refined version of the non-backtracking operator, adapted from a \"centered\" adjacency matrix. Its leading eigenvalues are more concentrated than those of the adjacency matrix for sparse networks, while they also demonstrate enhanced signal under imbalance scenarios, a benefit attributed to the centering step. This is justified, either theoretically or numerically, under the null model K = 1, in both dense and ultra-sparse settings. A goodness-of-fit test based on the leading eigenvalue can be applied to determine the number of communities K."}, "https://arxiv.org/abs/2406.04448": {"title": "Bayesian Methods to Improve The Accuracy of Differentially Private Measurements of Constrained Parameters", "link": "https://arxiv.org/abs/2406.04448", "description": "arXiv:2406.04448v1 Announce Type: new \nAbstract: Formal disclosure avoidance techniques are necessary to ensure that published data can not be used to identify information about individuals. The addition of statistical noise to unpublished data can be implemented to achieve differential privacy, which provides a formal mathematical privacy guarantee. However, the infusion of noise results in data releases which are less precise than if no noise had been added, and can lead to some of the individual data points being nonsensical. Examples of this are estimates of population counts which are negative, or estimates of the ratio of counts which violate known constraints. A straightforward way to guarantee that published estimates satisfy these known constraints is to specify a statistical model and incorporate a prior on census counts and ratios which properly constrains the parameter space. We utilize rejection sampling methods for drawing samples from the posterior distribution and we show that this implementation produces estimates of population counts and ratios which maintain formal privacy, are more precise than the original unconstrained noisy measurements, and are guaranteed to satisfy prior constraints."}, "https://arxiv.org/abs/2406.04498": {"title": "Conformal Multi-Target Hyperrectangles", "link": "https://arxiv.org/abs/2406.04498", "description": "arXiv:2406.04498v1 Announce Type: new \nAbstract: We propose conformal hyperrectangular prediction regions for multi-target regression. We propose split conformal prediction algorithms for both point and quantile regression to form hyperrectangular prediction regions, which allow for easy marginal interpretation and do not require covariance estimation. In practice, it is preferable that a prediction region is balanced, that is, having identical marginal prediction coverage, since prediction accuracy is generally equally important across components of the response vector. The proposed algorithms possess two desirable properties, namely, tight asymptotic overall nominal coverage as well as asymptotic balance, that is, identical asymptotic marginal coverage, under mild conditions. We then compare our methods to some existing methods on both simulated and real data sets. Our simulation results and real data analysis show that our methods outperform existing methods while achieving the desired nominal coverage and good balance between dimensions."}, "https://arxiv.org/abs/2406.04505": {"title": "Causal Inference in Randomized Trials with Partial Clustering and Imbalanced Dependence Structures", "link": "https://arxiv.org/abs/2406.04505", "description": "arXiv:2406.04505v1 Announce Type: new \nAbstract: In many randomized trials, participants are grouped into clusters, such as neighborhoods or schools, and these clusters are assumed to be the independent unit. This assumption, however, might not reflect the underlying dependence structure, with serious consequences to statistical power. First, consider a cluster randomized trial where participants are artificially grouped together for the purposes of randomization. For intervention participants the groups are the basis for intervention delivery, but for control participants the groups are dissolved. Second, consider an individually randomized group treatment trial where participants are randomized and then post-randomization, intervention participants are grouped together for intervention delivery, while the control participants continue with the standard of care. In both trial designs, outcomes among intervention participants will be dependent within each cluster, while outcomes for control participants will be effectively independent. We use causal models to non-parametrically describe the data generating process for each trial design and formalize the conditional independence in the observed data distribution. For estimation and inference, we propose a novel implementation of targeted minimum loss-based estimation (TMLE) accounting for partial clustering and the imbalanced dependence structure. TMLE is a model-robust approach, leverages covariate adjustment and machine learning to improve precision, and facilitates estimation of a large set of causal effects. In finite sample simulations, TMLE achieved comparable or markedly higher statistical power than common alternatives. Finally, application of TMLE to real data from the SEARCH-IPT trial resulted in 20-57\\% efficiency gains, demonstrating the real-world consequences of our proposed approach."}, "https://arxiv.org/abs/2406.04518": {"title": "A novel multivariate regression model for unbalanced binary data : a strong conjugacy under random effect approach", "link": "https://arxiv.org/abs/2406.04518", "description": "arXiv:2406.04518v1 Announce Type: new \nAbstract: In this paper, we deduce a new multivariate regression model designed to fit correlated binary data. The multivariate distribution is derived from a Bernoulli mixed model with a nonnormal random intercept on the marginal approach. The random effect distribution is assumed to be the generalized log-gamma (GLG) distribution by considering a particular parameter setting. The complement log-log function is specified to lead to strong conjugacy between the response variable and random effect. The new discrete multivariate distribution, named MBerGLG distribution, has location and dispersion parameters. The MBerGLG distribution leads to the MBerGLG regression (MBerGLGR) model, providing an alternative approach to fitting both unbalanced and balanced correlated response binary data. Monte Carlo simulation studies show that its maximum likelihood estimators are unbiased, efficient, and consistent asymptotically. The randomized quantile residuals are performed to identify possible departures from the proposal model and the data and detect atypical subjects. Finally, two applications are presented in the data analysis section."}, "https://arxiv.org/abs/2406.04599": {"title": "Imputation of Nonignorable Missing Data in Surveys Using Auxiliary Margins Via Hot Deck and Sequential Imputation", "link": "https://arxiv.org/abs/2406.04599", "description": "arXiv:2406.04599v1 Announce Type: new \nAbstract: Survey data collection often is plagued by unit and item nonresponse. To reduce reliance on strong assumptions about the missingness mechanisms, statisticians can use information about population marginal distributions known, for example, from censuses or administrative databases. One approach that does so is the Missing Data with Auxiliary Margins, or MD-AM, framework, which uses multiple imputation for both unit and item nonresponse so that survey-weighted estimates accord with the known marginal distributions. However, this framework relies on specifying and estimating a joint distribution for the survey data and nonresponse indicators, which can be computationally and practically daunting in data with many variables of mixed types. We propose two adaptations to the MD-AM framework to simplify the imputation task. First, rather than specifying a joint model for unit respondents' data, we use random hot deck imputation while still leveraging the known marginal distributions. Second, instead of sampling from conditional distributions implied by the joint model for the missing data due to item nonresponse, we apply multiple imputation by chained equations for item nonresponse before imputation for unit nonresponse. Using simulation studies with nonignorable missingness mechanisms, we demonstrate that the proposed approach can provide more accurate point and interval estimates than models that do not leverage the auxiliary information. We illustrate the approach using data on voter turnout from the U.S. Current Population Survey."}, "https://arxiv.org/abs/2406.04653": {"title": "Dynamical mixture modeling with fast, automatic determination of Markov chains", "link": "https://arxiv.org/abs/2406.04653", "description": "arXiv:2406.04653v1 Announce Type: new \nAbstract: Markov state modeling has gained popularity in various scientific fields due to its ability to reduce complex time series data into transitions between a few states. Yet, current frameworks are limited by assuming a single Markov chain describes the data, and they suffer an inability to discern heterogeneities. As a solution, this paper proposes a variational expectation-maximization algorithm that identifies a mixture of Markov chains in a time-series data set. The method is agnostic to the definition of the Markov states, whether data-driven (e.g. by spectral clustering) or based on domain knowledge. Variational EM efficiently and organically identifies the number of Markov chains and dynamics of each chain without expensive model comparisons or posterior sampling. The approach is supported by a theoretical analysis and numerical experiments, including simulated and observational data sets based on ${\\tt Last.fm}$ music listening, ultramarathon running, and gene expression. The results show the new algorithm is competitive with contemporary mixture modeling approaches and powerful in identifying meaningful heterogeneities in time series data."}, "https://arxiv.org/abs/2406.04655": {"title": "Bayesian Inference for Spatial-temporal Non-Gaussian Data Using Predictive Stacking", "link": "https://arxiv.org/abs/2406.04655", "description": "arXiv:2406.04655v1 Announce Type: new \nAbstract: Analysing non-Gaussian spatial-temporal data typically requires introducing spatial dependence in generalised linear models through the link function of an exponential family distribution. However, unlike in Gaussian likelihoods, inference is considerably encumbered by the inability to analytically integrate out the random effects and reduce the dimension of the parameter space. Iterative estimation algorithms struggle to converge due to the presence of weakly identified parameters. We devise an approach that obviates these issues by exploiting generalised conjugate multivariate distribution theory for exponential families, which enables exact sampling from analytically available posterior distributions conditional upon some fixed process parameters. More specifically, we expand upon the Diaconis-Ylvisaker family of conjugate priors to achieve analytically tractable posterior inference for spatially-temporally varying regression models conditional on some kernel parameters. Subsequently, we assimilate inference from these individual posterior distributions over a range of values of these parameters using Bayesian predictive stacking. We evaluate inferential performance on simulated data, compare with fully Bayesian inference using Markov chain Monte Carlo and apply our proposed method to analyse spatially-temporally referenced avian count data from the North American Breeding Bird Survey database."}, "https://arxiv.org/abs/2406.04796": {"title": "Robust Inference of Dynamic Covariance Using Wishart Processes and Sequential Monte Carlo", "link": "https://arxiv.org/abs/2406.04796", "description": "arXiv:2406.04796v1 Announce Type: new \nAbstract: Several disciplines, such as econometrics, neuroscience, and computational psychology, study the dynamic interactions between variables over time. A Bayesian nonparametric model known as the Wishart process has been shown to be effective in this situation, but its inference remains highly challenging. In this work, we introduce a Sequential Monte Carlo (SMC) sampler for the Wishart process, and show how it compares to conventional inference approaches, namely MCMC and variational inference. Using simulations we show that SMC sampling results in the most robust estimates and out-of-sample predictions of dynamic covariance. SMC especially outperforms the alternative approaches when using composite covariance functions with correlated parameters. We demonstrate the practical applicability of our proposed approach on a dataset of clinical depression (n=1), and show how using an accurate representation of the posterior distribution can be used to test for dynamics on covariance"}, "https://arxiv.org/abs/2406.04849": {"title": "Dynamic prediction of death risk given a renewal hospitalization process", "link": "https://arxiv.org/abs/2406.04849", "description": "arXiv:2406.04849v1 Announce Type: new \nAbstract: Predicting the risk of death for chronic patients is highly valuable for informed medical decision-making. This paper proposes a general framework for dynamic prediction of the risk of death of a patient given her hospitalization history, which is generally available to physicians. Predictions are based on a joint model for the death and hospitalization processes, thereby avoiding the potential bias arising from selection of survivors. The framework accommodates various submodels for the hospitalization process. In particular, we study prediction of the risk of death in a renewal model for hospitalizations, a common approach to recurrent event modelling. In the renewal model, the distribution of hospitalizations throughout the follow-up period impacts the risk of death. This result differs from prediction in the Poisson model, previously studied, where only the number of hospitalizations matters. We apply our methodology to a prospective, observational cohort study of 401 patients treated for COPD in one of six outpatient respiratory clinics run by the Respiratory Service of Galdakao University Hospital, with a median follow-up of 4.16 years. We find that more concentrated hospitalizations increase the risk of death."}, "https://arxiv.org/abs/2406.04874": {"title": "Approximate Bayesian Computation with Deep Learning and Conformal prediction", "link": "https://arxiv.org/abs/2406.04874", "description": "arXiv:2406.04874v1 Announce Type: new \nAbstract: Approximate Bayesian Computation (ABC) methods are commonly used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Classical ABC methods are based on nearest neighbor type algorithms and rely on the choice of so-called summary statistics, distances between datasets and a tolerance threshold. Recently, methods combining ABC with more complex machine learning algorithms have been proposed to mitigate the impact of these \"user-choices\". In this paper, we propose the first, to our knowledge, ABC method completely free of summary statistics, distance and tolerance threshold. Moreover, in contrast with usual generalizations of the ABC method, it associates a confidence interval (having a proper frequentist marginal coverage) with the posterior mean estimation (or other moment-type estimates).\n Our method, ABCD-Conformal, uses a neural network with Monte Carlo Dropout to provide an estimation of the posterior mean (or others moment type functional), and conformal theory to obtain associated confidence sets. Efficient for estimating multidimensional parameters, we test this new method on three different applications and compare it with other ABC methods in the literature."}, "https://arxiv.org/abs/2406.04994": {"title": "Unguided structure learning of DAGs for count data", "link": "https://arxiv.org/abs/2406.04994", "description": "arXiv:2406.04994v1 Announce Type: new \nAbstract: Mainly motivated by the problem of modelling directional dependence relationships for multivariate count data in high-dimensional settings, we present a new algorithm, called learnDAG, for learning the structure of directed acyclic graphs (DAGs). In particular, the proposed algorithm tackled the problem of learning DAGs from observational data in two main steps: (i) estimation of candidate parent sets; and (ii) feature selection. We experimentally compare learnDAG to several popular competitors in recovering the true structure of the graphs in situations where relatively moderate sample sizes are available. Furthermore, to make our algorithm is stronger, a validation of the algorithm is presented through the analysis of real datasets."}, "https://arxiv.org/abs/2406.05010": {"title": "Testing common invariant subspace of multilayer networks", "link": "https://arxiv.org/abs/2406.05010", "description": "arXiv:2406.05010v1 Announce Type: new \nAbstract: Graph (or network) is a mathematical structure that has been widely used to model relational data. As real-world systems get more complex, multilayer (or multiple) networks are employed to represent diverse patterns of relationships among the objects in the systems. One active research problem in multilayer networks analysis is to study the common invariant subspace of the networks, because such common invariant subspace could capture the fundamental structural patterns and interactions across all layers. Many methods have been proposed to estimate the common invariant subspace. However, whether real-world multilayer networks share the same common subspace remains unknown. In this paper, we first attempt to answer this question by means of hypothesis testing. The null hypothesis states that the multilayer networks share the same subspace, and under the alternative hypothesis, there exist at least two networks that do not have the same subspace. We propose a Weighted Degree Difference Test, derive the limiting distribution of the test statistic and provide an analytical analysis of the power. Simulation study shows that the proposed test has satisfactory performance, and a real data application is provided."}, "https://arxiv.org/abs/2406.05012": {"title": "TrendLSW: Trend and Spectral Estimation of Nonstationary Time Series in R", "link": "https://arxiv.org/abs/2406.05012", "description": "arXiv:2406.05012v1 Announce Type: new \nAbstract: The TrendLSW R package has been developed to provide users with a suite of wavelet-based techniques to analyse the statistical properties of nonstationary time series. The key components of the package are (a) two approaches for the estimation of the evolutionary wavelet spectrum in the presence of trend; and (b) wavelet-based trend estimation in the presence of locally stationary wavelet errors via both linear and nonlinear wavelet thresholding; and (c) the calculation of associated pointwise confidence intervals. Lastly, the package directly implements boundary handling methods that enable the methods to be performed on data of arbitrary length, not just dyadic length as is common for wavelet-based methods, ensuring no pre-processing of data is necessary. The key functionality of the package is demonstrated through two data examples, arising from biology and activity monitoring."}, "https://arxiv.org/abs/2406.04915": {"title": "Bayesian inference of Latent Spectral Shapes", "link": "https://arxiv.org/abs/2406.04915", "description": "arXiv:2406.04915v1 Announce Type: cross \nAbstract: This paper proposes a hierarchical spatial-temporal model for modelling the spectrograms of animal calls. The motivation stems from analyzing recordings of the so-called grunt calls emitted by various lemur species. Our goal is to identify a latent spectral shape that characterizes each species and facilitates measuring dissimilarities between them. The model addresses the synchronization of animal vocalizations, due to varying time-lengths and speeds, with non-stationary temporal patterns and accounts for periodic sampling artifacts produced by the time discretization of analog signals. The former is achieved through a synchronization function, and the latter is modeled using a circular representation of time. To overcome the curse of dimensionality inherent in the model's implementation, we employ the Nearest Neighbor Gaussian Process, and posterior samples are obtained using the Markov Chain Monte Carlo method. We apply the model to a real dataset comprising sounds from 8 different species. We define a representative sound for each species and compare them using a simple distance measure. Cross-validation is used to evaluate the predictive capability of our proposal and explore special cases. Additionally, a simulation example is provided to demonstrate that the algorithm is capable of retrieving the true parameters."}, "https://arxiv.org/abs/2105.07685": {"title": "Time-lag bias induced by unobserved heterogeneity: comparing treated patients to controls with a different start of follow-up", "link": "https://arxiv.org/abs/2105.07685", "description": "arXiv:2105.07685v2 Announce Type: replace \nAbstract: In comparative effectiveness research, treated and control patients might have a different start of follow-up as treatment is often started later in the disease trajectory. This typically occurs when data from treated and controls are not collected within the same source. Only patients who did not yet experience the event of interest whilst in the control condition end up in the treatment data source. In case of unobserved heterogeneity, these treated patients will have a lower average risk than the controls. We illustrate how failing to account for this time-lag between treated and controls leads to bias in the estimated treatment effect. We define estimands and time axes, then explore five methods to adjust for this time-lag bias by utilising the time between diagnosis and treatment initiation in different ways. We conducted a simulation study to evaluate whether these methods reduce the bias and then applied the methods to a comparison between fertility patients treated with insemination and similar but untreated patients. We conclude that time-lag bias can be vast and that the time between diagnosis and treatment initiation should be taken into account in the analysis to respect the chronology of the disease and treatment trajectory."}, "https://arxiv.org/abs/2307.07342": {"title": "Bounded-memory adjusted scores estimation in generalized linear models with large data sets", "link": "https://arxiv.org/abs/2307.07342", "description": "arXiv:2307.07342v4 Announce Type: replace \nAbstract: The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically as a proportion of the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. Both procedures can also be readily adapted to fit generalized linear models when distinct parts of the data is stored across different sites and, due to privacy concerns, cannot be fully transferred across sites. We assess the procedures through a real-data application with millions of observations."}, "https://arxiv.org/abs/2401.11352": {"title": "A Connection Between Covariate Adjustment and Stratified Randomization in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2401.11352", "description": "arXiv:2401.11352v2 Announce Type: replace \nAbstract: The statistical efficiency of randomized clinical trials can be improved by incorporating information from baseline covariates (i.e., pre-treatment patient characteristics). This can be done in the design stage using stratified (permutated block) randomization or in the analysis stage through covariate adjustment. This article makes a connection between covariate adjustment and stratified randomization in a general framework where all regular, asymptotically linear estimators are identified as augmented estimators. From a geometric perspective, covariate adjustment can be viewed as an attempt to approximate the optimal augmentation function, and stratified randomization improves a given approximation by moving it closer to the optimal augmentation function. The efficiency benefit of stratified randomization is asymptotically equivalent to attaching an optimal augmentation term based on the stratification factor. Under stratified randomization, adjusting for the stratification factor only in data analysis is not expected to improve efficiency, and the key to efficient estimation is incorporating new prognostic information from other covariates. In designing a trial with stratified randomization, it is not essential to include all important covariates in the stratification, because their prognostic information can be incorporated through covariate adjustment. These observations are confirmed in a simulation study and illustrated using real clinical trial data."}, "https://arxiv.org/abs/2406.05188": {"title": "Numerically robust square root implementations of statistical linear regression filters and smoothers", "link": "https://arxiv.org/abs/2406.05188", "description": "arXiv:2406.05188v1 Announce Type: new \nAbstract: In this article, square-root formulations of the statistical linear regression filter and smoother are developed. Crucially, the method uses QR decompositions rather than Cholesky downdates. This makes the method inherently more numerically robust than the downdate based methods, which may fail in the face of rounding errors. This increased robustness is demonstrated in an ill-conditioned problem, where it is compared against a reference implementation in both double and single precision arithmetic. The new implementation is found to be more robust, when implemented in lower precision arithmetic as compared to the alternative."}, "https://arxiv.org/abs/2406.05193": {"title": "Probabilistic Clustering using Shared Latent Variable Model for Assessing Alzheimers Disease Biomarkers", "link": "https://arxiv.org/abs/2406.05193", "description": "arXiv:2406.05193v1 Announce Type: new \nAbstract: The preclinical stage of many neurodegenerative diseases can span decades before symptoms become apparent. Understanding the sequence of preclinical biomarker changes provides a critical opportunity for early diagnosis and effective intervention prior to significant loss of patients' brain functions. The main challenge to early detection lies in the absence of direct observation of the disease state and the considerable variability in both biomarkers and disease dynamics among individuals. Recent research hypothesized the existence of subgroups with distinct biomarker patterns due to co-morbidities and degrees of brain resilience. Our ability to early diagnose and intervene during the preclinical stage of neurodegenerative diseases will be enhanced by further insights into heterogeneity in the biomarker-disease relationship. In this paper, we focus on Alzheimer's disease (AD) and attempt to identify the systematic patterns within the heterogeneous AD biomarker-disease cascade. Specifically, we quantify the disease progression using a dynamic latent variable whose mixture distribution represents patient subgroups. Model estimation uses Hamiltonian Monte Carlo with the number of clusters determined by the Bayesian Information Criterion (BIC). We report simulation studies that investigate the performance of the proposed model in finite sample settings that are similar to our motivating application. We apply the proposed model to the BIOCARD data, a longitudinal study that was conducted over two decades among individuals who were initially cognitively normal. Our application yields evidence consistent with the hypothetical model of biomarker dynamics presented in Jack et al. (2013). In addition, our analysis identified two subgroups with distinct disease-onset patterns. Finally, we develop a dynamic prediction approach to improve the precision of prognoses."}, "https://arxiv.org/abs/2406.05304": {"title": "Polytomous Explanatory Item Response Models for Item Discrimination: Assessing Negative-Framing Effects in Social-Emotional Learning Surveys", "link": "https://arxiv.org/abs/2406.05304", "description": "arXiv:2406.05304v1 Announce Type: new \nAbstract: Modeling item parameters as a function of item characteristics has a long history but has generally focused on models for item location. Explanatory item response models for item discrimination are available but rarely used. In this study, we extend existing approaches for modeling item discrimination from dichotomous to polytomous item responses. We illustrate our proposed approach with an application to four social-emotional learning surveys of preschool children to investigate how item discrimination depends on whether an item is positively or negatively framed. Negative framing predicts significantly lower item discrimination on two of the four surveys, and a plausibly causal estimate from a regression discontinuity analysis shows that negative framing reduces discrimination by about 30\\% on one survey. We conclude with a discussion of potential applications of explanatory models for item discrimination."}, "https://arxiv.org/abs/2406.05340": {"title": "Selecting the Number of Communities for Weighted Degree-Corrected Stochastic Block Models", "link": "https://arxiv.org/abs/2406.05340", "description": "arXiv:2406.05340v1 Announce Type: new \nAbstract: We investigate how to select the number of communities for weighted networks without a full likelihood modeling. First, we propose a novel weighted degree-corrected stochastic block model (DCSBM), in which the mean adjacency matrix is modeled as the same as in standard DCSBM, while the variance profile matrix is assumed to be related to the mean adjacency matrix through a given variance function. Our method of selection the number of communities is based on a sequential testing framework, in each step the weighed DCSBM is fitted via some spectral clustering method. A key step is to carry out matrix scaling on the estimated variance profile matrix. The resulting scaling factors can be used to normalize the adjacency matrix, from which the testing statistic is obtained. Under mild conditions on the weighted DCSBM, our proposed procedure is shown to be consistent in estimating the true number of communities. Numerical experiments on both simulated and real network data also demonstrate the desirable empirical properties of our method."}, "https://arxiv.org/abs/2406.05548": {"title": "Causal Interpretation of Regressions With Ranks", "link": "https://arxiv.org/abs/2406.05548", "description": "arXiv:2406.05548v1 Announce Type: new \nAbstract: In studies of educational production functions or intergenerational mobility, it is common to transform the key variables into percentile ranks. Yet, it remains unclear what the regression coefficient estimates with ranks of the outcome or the treatment. In this paper, we derive effective causal estimands for a broad class of commonly-used regression methods, including the ordinary least squares (OLS), two-stage least squares (2SLS), difference-in-differences (DiD), and regression discontinuity designs (RDD). Specifically, we introduce a novel primitive causal estimand, the Rank Average Treatment Effect (rank-ATE), and prove that it serves as the building block of the effective estimands of all the aforementioned econometrics methods. For 2SLS, DiD, and RDD, we show that direct applications to outcome ranks identify parameters that are difficult to interpret. To address this issue, we develop alternative methods to identify more interpretable causal parameters."}, "https://arxiv.org/abs/2406.05592": {"title": "Constrained Design of a Binary Instrument in a Partially Linear Model", "link": "https://arxiv.org/abs/2406.05592", "description": "arXiv:2406.05592v1 Announce Type: new \nAbstract: We study the question of how best to assign an encouragement in a randomized encouragement study. In our setting, units arrive with covariates, receive a nudge toward treatment or control, acquire one of those statuses in a way that need not align with the nudge, and finally have a response observed. The nudge can be seen as a binary instrument that affects the response only via the treatment status. Our goal is to assign the nudge as a function of covariates in a way that best estimates the local average treatment effect (LATE). We assume a partially linear model, wherein the baseline model is non-parametric and the treatment term is linear in the covariates. Under this model, we outline a two-stage procedure to consistently estimate the LATE. Though the variance of the LATE is intractable, we derive a finite sample approximation and thus a design criterion to minimize. This criterion is convex, allowing for constraints that might arise for budgetary or ethical reasons. We prove conditions under which our solution asymptotically recovers the lowest true variance among all possible nudge propensities. We apply our method to a semi-synthetic example involving triage in an emergency department and find significant gains relative to a regression discontinuity design."}, "https://arxiv.org/abs/2406.05607": {"title": "HAL-based Plugin Estimation of the Causal Dose-Response Curve", "link": "https://arxiv.org/abs/2406.05607", "description": "arXiv:2406.05607v1 Announce Type: new \nAbstract: Estimating the marginally adjusted dose-response curve for continuous treatments is a longstanding statistical challenge critical across multiple fields. In the context of parametric models, mis-specification may result in substantial bias, hindering the accurate discernment of the true data generating distribution and the associated dose-response curve. In contrast, non-parametric models face difficulties as the dose-response curve isn't pathwise differentiable, and then there is no $\\sqrt{n}$-consistent estimator. The emergence of the Highly Adaptive Lasso (HAL) MLE by van der Laan [2015] and van der Laan [2017] and the subsequent theoretical evidence by van der Laan [2023] regarding its pointwise asymptotic normality and uniform convergence rates, have highlighted the asymptotic efficacy of the HAL-based plug-in estimator for this intricate problem. This paper delves into the HAL-based plug-in estimators, including those with cross-validation and undersmoothing selectors, and introduces the undersmoothed smoothness-adaptive HAL-based plug-in estimator. We assess these estimators through extensive simulations, employing detailed evaluation metrics. Building upon the theoretical proofs in van der Laan [2023], our empirical findings underscore the asymptotic effectiveness of the undersmoothed smoothness-adaptive HAL-based plug-in estimator in estimating the marginally adjusted dose-response curve."}, "https://arxiv.org/abs/2406.05805": {"title": "Toward identifiability of total effects in summary causal graphs with latent confounders: an extension of the front-door criterion", "link": "https://arxiv.org/abs/2406.05805", "description": "arXiv:2406.05805v1 Announce Type: new \nAbstract: Conducting experiments to estimate total effects can be challenging due to cost, ethical concerns, or practical limitations. As an alternative, researchers often rely on causal graphs to determine if it is possible to identify these effects from observational data. Identifying total effects in fully specified non-temporal causal graphs has garnered considerable attention, with Pearl's front-door criterion enabling the identification of total effects in the presence of latent confounding even when no variable set is sufficient for adjustment. However, specifying a complete causal graph is challenging in many domains. Extending these identifiability results to partially specified graphs is crucial, particularly in dynamic systems where causal relationships evolve over time. This paper addresses the challenge of identifying total effects using a specific and well-known partially specified graph in dynamic systems called a summary causal graph, which does not specify the temporal lag between causal relations and can contain cycles. In particular, this paper presents sufficient graphical conditions for identifying total effects from observational data, even in the presence of hidden confounding and when no variable set is sufficient for adjustment, contributing to the ongoing effort to understand and estimate causal effects from observational data using summary causal graphs."}, "https://arxiv.org/abs/2406.05944": {"title": "Embedding Network Autoregression for time series analysis and causal peer effect inference", "link": "https://arxiv.org/abs/2406.05944", "description": "arXiv:2406.05944v1 Announce Type: new \nAbstract: We propose an Embedding Network Autoregressive Model (ENAR) for multivariate networked longitudinal data. We assume the network is generated from a latent variable model, and these unobserved variables are included in a structural peer effect model or a time series network autoregressive model as additive effects. This approach takes a unified view of two related problems, (1) modeling and predicting multivariate time series data and (2) causal peer influence estimation in the presence of homophily from finite time longitudinal data. Our estimation strategy comprises estimating latent factors from the observed network adjacency matrix either through spectral embedding or maximum likelihood estimation, followed by least squares estimation of the network autoregressive model. We show that the estimated momentum and peer effect parameters are consistent and asymptotically normal in asymptotic setups with a growing number of network vertices N while including a growing number of time points T and finite T cases. We allow the number of latent vectors K to grow at appropriate rates, which improves upon existing rates when such results are available for related models."}, "https://arxiv.org/abs/2406.05987": {"title": "Data-Driven Real-time Coupon Allocation in the Online Platform", "link": "https://arxiv.org/abs/2406.05987", "description": "arXiv:2406.05987v1 Announce Type: new \nAbstract: Traditionally, firms have offered coupons to customer groups at predetermined discount rates. However, advancements in machine learning and the availability of abundant customer data now enable platforms to provide real-time customized coupons to individuals. In this study, we partner with Meituan, a leading shopping platform, to develop a real-time, end-to-end coupon allocation system that is fast and effective in stimulating demand while adhering to marketing budgets when faced with uncertain traffic from a diverse customer base. Leveraging comprehensive customer and product features, we estimate Conversion Rates (CVR) under various coupon values and employ isotonic regression to ensure the monotonicity of predicted CVRs with respect to coupon value. Using calibrated CVR predictions as input, we propose a Lagrangian Dual-based algorithm that efficiently determines optimal coupon values for each arriving customer within 50 milliseconds. We theoretically and numerically investigate the model performance under parameter misspecifications and apply a control loop to adapt to real-time updated information, thereby better adhering to the marketing budget. Finally, we demonstrate through large-scale field experiments and observational data that our proposed coupon allocation algorithm outperforms traditional approaches in terms of both higher conversion rates and increased revenue. As of May 2024, Meituan has implemented our framework to distribute coupons to over 100 million users across more than 110 major cities in China, resulting in an additional CNY 8 million in annual profit. We demonstrate how to integrate a machine learning prediction model for estimating customer CVR, a Lagrangian Dual-based coupon value optimizer, and a control system to achieve real-time coupon delivery while dynamically adapting to random customer arrival patterns."}, "https://arxiv.org/abs/2406.06071": {"title": "Bayesian Parametric Methods for Deriving Distribution of Restricted Mean Survival Time", "link": "https://arxiv.org/abs/2406.06071", "description": "arXiv:2406.06071v1 Announce Type: new \nAbstract: We propose a Bayesian method for deriving the distribution of restricted mean survival time (RMST) using posterior samples, which accounts for covariates and heterogeneity among clusters based on a parametric model for survival time. We derive an explicit RMST equation by devising an integral of the survival function, allowing for the calculation of not only the mean and credible interval but also the mode, median, and probability of exceeding a certain value. Additionally, We propose two methods: one using random effects to account for heterogeneity among clusters and another utilizing frailty. We developed custom Stan code for the exponential, Weibull, log-normal frailty, and log-logistic models, as they cannot be processed using the brm functions in R. We evaluate our proposed methods through computer simulations and analyze real data from the eight Empowered Action Group states in India to confirm consistent results across states after adjusting for cluster differences. In conclusion, we derived explicit RMST formulas for parametric models and their distributions, enabling the calculation of the mean, median, mode, and credible interval. Our simulations confirmed the robustness of the proposed methods, and using the shrinkage effect allowed for more accurate results for each cluster. This manuscript has not been published elsewhere. The manuscript is not under consideration in whole or in part by another journal."}, "https://arxiv.org/abs/2406.06426": {"title": "Biomarker-Guided Adaptive Enrichment Design with Threshold Detection for Clinical Trials with Time-to-Event Outcome", "link": "https://arxiv.org/abs/2406.06426", "description": "arXiv:2406.06426v1 Announce Type: new \nAbstract: Biomarker-guided designs are increasingly used to evaluate personalized treatments based on patients' biomarker status in Phase II and III clinical trials. With adaptive enrichment, these designs can improve the efficiency of evaluating the treatment effect in biomarker-positive patients by increasing their proportion in the randomized trial. While time-to-event outcomes are often used as the primary endpoint to measure treatment effects for a new therapy in severe diseases like cancer and cardiovascular diseases, there is limited research on biomarker-guided adaptive enrichment trials in this context. Such trials almost always adopt hazard ratio methods for statistical measurement of treatment effects. In contrast, restricted mean survival time (RMST) has gained popularity for analyzing time-to-event outcomes because it offers more straightforward interpretations of treatment effects and does not require the proportional hazard assumption. This paper proposes a two-stage biomarker-guided adaptive RMST design with threshold detection and patient enrichment. We develop sophisticated methods for identifying the optimal biomarker threshold, treatment effect estimators in the biomarker-positive subgroup, and approaches for type I error rate, power analysis, and sample size calculation. We present a numerical example of re-designing an oncology trial. An extensive simulation study is conducted to evaluate the performance of the proposed design."}, "https://arxiv.org/abs/2406.06452": {"title": "Estimating Heterogeneous Treatment Effects by Combining Weak Instruments and Observational Data", "link": "https://arxiv.org/abs/2406.06452", "description": "arXiv:2406.06452v1 Announce Type: new \nAbstract: Accurately predicting conditional average treatment effects (CATEs) is crucial in personalized medicine and digital platform analytics. Since often the treatments of interest cannot be directly randomized, observational data is leveraged to learn CATEs, but this approach can incur significant bias from unobserved confounding. One strategy to overcome these limitations is to seek latent quasi-experiments in instrumental variables (IVs) for the treatment, for example, a randomized intent to treat or a randomized product recommendation. This approach, on the other hand, can suffer from low compliance, i.e., IV weakness. Some subgroups may even exhibit zero compliance meaning we cannot instrument for their CATEs at all. In this paper we develop a novel approach to combine IV and observational data to enable reliable CATE estimation in the presence of unobserved confounding in the observational data and low compliance in the IV data, including no compliance for some subgroups. We propose a two-stage framework that first learns biased CATEs from the observational data, and then applies a compliance-weighted correction using IV data, effectively leveraging IV strength variability across covariates. We characterize the convergence rates of our method and validate its effectiveness through a simulation study. Additionally, we demonstrate its utility with real data by analyzing the heterogeneous effects of 401(k) plan participation on wealth."}, "https://arxiv.org/abs/2406.06516": {"title": "Distribution-Free Predictive Inference under Unknown Temporal Drift", "link": "https://arxiv.org/abs/2406.06516", "description": "arXiv:2406.06516v1 Announce Type: new \nAbstract: Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The window is selected by optimizing an estimated bias-variance tradeoff. We provide sharp coverage guarantees for our method, showing its adaptivity to the underlying temporal drift. We also illustrate its efficacy through numerical experiments on synthetic and real data."}, "https://arxiv.org/abs/2406.05242": {"title": "Markov chain Monte Carlo without evaluating the target: an auxiliary variable approach", "link": "https://arxiv.org/abs/2406.05242", "description": "arXiv:2406.05242v1 Announce Type: cross \nAbstract: In sampling tasks, it is common for target distributions to be known up to a normalizing constant. However, in many situations, evaluating even the unnormalized distribution can be costly or infeasible. This issue arises in scenarios such as sampling from the Bayesian posterior for tall datasets and the `doubly-intractable' distributions. In this paper, we begin by observing that seemingly different Markov chain Monte Carlo (MCMC) algorithms, such as the exchange algorithm, PoissonMH, and TunaMH, can be unified under a simple common procedure. We then extend this procedure into a novel framework that allows the use of auxiliary variables in both the proposal and acceptance-rejection steps. We develop the theory of the new framework, applying it to existing algorithms to simplify and extend their results. Several new algorithms emerge from this framework, with improved performance demonstrated on both synthetic and real datasets."}, "https://arxiv.org/abs/2406.05264": {"title": "\"Minus-One\" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity", "link": "https://arxiv.org/abs/2406.05264", "description": "arXiv:2406.05264v1 Announce Type: cross \nAbstract: We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that \"learns\" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected."}, "https://arxiv.org/abs/2406.05633": {"title": "Heterogeneous Treatment Effects in Panel Data", "link": "https://arxiv.org/abs/2406.05633", "description": "arXiv:2406.05633v1 Announce Type: cross \nAbstract: We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters with similar treatment effects using a regression tree, and then leverages the (assumed) low-rank structure of the panel data to estimate the average treatment effect for each cluster. Our theoretical results establish the convergence of the resulting estimates to the true treatment effects. Computation experiments with semi-synthetic data show that our method achieves superior accuracy compared to alternative approaches, using a regression tree with no more than 40 leaves. Hence, our method provides more accurate and interpretable estimates than alternative methods."}, "https://arxiv.org/abs/2406.06014": {"title": "Network two-sample test for block models", "link": "https://arxiv.org/abs/2406.06014", "description": "arXiv:2406.06014v1 Announce Type: cross \nAbstract: We consider the two-sample testing problem for networks, where the goal is to determine whether two sets of networks originated from the same stochastic model. Assuming no vertex correspondence and allowing for different numbers of nodes, we address a fundamental network testing problem that goes beyond simple adjacency matrix comparisons. We adopt the stochastic block model (SBM) for network distributions, due to their interpretability and the potential to approximate more general models. The lack of meaningful node labels and vertex correspondence translate to a graph matching challenge when developing a test for SBMs. We introduce an efficient algorithm to match estimated network parameters, allowing us to properly combine and contrast information within and across samples, leading to a powerful test. We show that the matching algorithm, and the overall test are consistent, under mild conditions on the sparsity of the networks and the sample sizes, and derive a chi-squared asymptotic null distribution for the test. Through a mixture of theoretical insights and empirical validations, including experiments with both synthetic and real-world data, this study advances robust statistical inference for complex network data."}, "https://arxiv.org/abs/2406.06348": {"title": "Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning", "link": "https://arxiv.org/abs/2406.06348", "description": "arXiv:2406.06348v1 Announce Type: cross \nAbstract: The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way -- without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of directed acyclic graphs, to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap. In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees. We leverage the idea of a superstructure -- a set of learned or existing candidate hypotheses -- to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces."}, "https://arxiv.org/abs/2203.14223": {"title": "Identifying Peer Influence in Therapeutic Communities Adjusting for Latent Homophily", "link": "https://arxiv.org/abs/2203.14223", "description": "arXiv:2203.14223v4 Announce Type: replace \nAbstract: We investigate peer role model influence on successful graduation from Therapeutic Communities (TCs) for substance abuse and criminal behavior. We use data from 3 TCs that kept records of exchanges of affirmations among residents and their precise entry and exit dates, allowing us to form peer networks and define a causal effect of interest. The role model effect measures the difference in the expected outcome of a resident (ego) who can observe one of their peers graduate before the ego's exit vs not graduating. To identify peer influence in the presence of unobserved homophily in observational data, we model the network with a latent variable model. We show that our peer influence estimator is asymptotically unbiased when the unobserved latent positions are estimated from the observed network. We additionally propose a measurement error bias correction method to further reduce bias due to estimating latent positions. Our simulations show the proposed latent homophily adjustment and bias correction perform well in finite samples. We also extend the methodology to the case of binary response with a probit model. Our results indicate a positive effect of peers' graduation on residents' graduation and that it differs based on gender, race, and the definition of the role model effect. A counterfactual exercise quantifies the potential benefits of an intervention directly on the treated resident and indirectly on their peers through network propagation."}, "https://arxiv.org/abs/2212.08697": {"title": "Multi-Task Learning for Sparsity Pattern Heterogeneity: Statistical and Computational Perspectives", "link": "https://arxiv.org/abs/2212.08697", "description": "arXiv:2212.08697v2 Announce Type: replace \nAbstract: We consider a problem in Multi-Task Learning (MTL) where multiple linear models are jointly trained on a collection of datasets (\"tasks\"). A key novelty of our framework is that it allows the sparsity pattern of regression coefficients and the values of non-zero coefficients to differ across tasks while still leveraging partially shared structure. Our methods encourage models to share information across tasks through separately encouraging 1) coefficient supports, and/or 2) nonzero coefficient values to be similar. This allows models to borrow strength during variable selection even when non-zero coefficient values differ across tasks. We propose a novel mixed-integer programming formulation for our estimator. We develop custom scalable algorithms based on block coordinate descent and combinatorial local search to obtain high-quality (approximate) solutions for our estimator. Additionally, we propose a novel exact optimization algorithm to obtain globally optimal solutions. We investigate the theoretical properties of our estimators. We formally show how our estimators leverage the shared support information across tasks to achieve better variable selection performance. We evaluate the performance of our methods in simulations and two biomedical applications. Our proposed approaches appear to outperform other sparse MTL methods in variable selection and prediction accuracy. We provide the sMTL package on CRAN."}, "https://arxiv.org/abs/2212.09145": {"title": "Classification of multivariate functional data on different domains with Partial Least Squares approaches", "link": "https://arxiv.org/abs/2212.09145", "description": "arXiv:2212.09145v3 Announce Type: replace \nAbstract: Classification (supervised-learning) of multivariate functional data is considered when the elements of the random functional vector of interest are defined on different domains. In this setting, PLS classification and tree PLS-based methods for multivariate functional data are presented. From a computational point of view, we show that the PLS components of the regression with multivariate functional data can be obtained using only the PLS methodology with univariate functional data. This offers an alternative way to present the PLS algorithm for multivariate functional data."}, "https://arxiv.org/abs/2212.10024": {"title": "Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples", "link": "https://arxiv.org/abs/2212.10024", "description": "arXiv:2212.10024v2 Announce Type: replace \nAbstract: Data subsampling has become widely recognized as a tool to overcome computational and economic bottlenecks in analyzing massive datasets. We contribute to the development of adaptive design for estimation of finite population characteristics, using active learning and adaptive importance sampling. We propose an active sampling strategy that iterates between estimation and data collection with optimal subsamples, guided by machine learning predictions on yet unseen data. The method is illustrated on virtual simulation-based safety assessment of advanced driver assistance systems. Substantial performance improvements are demonstrated compared to traditional sampling methods."}, "https://arxiv.org/abs/2301.01660": {"title": "Projection predictive variable selection for discrete response families with finite support", "link": "https://arxiv.org/abs/2301.01660", "description": "arXiv:2301.01660v3 Announce Type: replace \nAbstract: The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback-Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, in such cases, we recommend the latent projection in the early phase of a model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection."}, "https://arxiv.org/abs/2303.06661": {"title": "Bayesian size-and-shape regression modelling", "link": "https://arxiv.org/abs/2303.06661", "description": "arXiv:2303.06661v3 Announce Type: replace \nAbstract: Building on Dryden et al. (2021), this note presents the Bayesian estimation of a regression model for size-and-shape response variables with Gaussian landmarks. Our proposal fits into the framework of Bayesian latent variable models and allows a highly flexible modelling framework."}, "https://arxiv.org/abs/2306.15286": {"title": "Multilayer random dot product graphs: Estimation and online change point detection", "link": "https://arxiv.org/abs/2306.15286", "description": "arXiv:2306.15286v4 Announce Type: replace \nAbstract: We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realization from an MRDPG. Across layers, we assume fixed shared common node sets and latent positions but allow for different connectivity matrices. We propose efficient tensor algorithms under both fixed and random latent position cases to minimize the detection delay while controlling false alarms. Notably, in the random latent position case, we devise a novel nonparametric change point detection algorithm based on density kernel estimation that is applicable to a wide range of scenarios, including stochastic block models as special cases. Our theoretical findings are supported by extensive numerical experiments, with the code available online https://github.com/MountLee/MRDPG."}, "https://arxiv.org/abs/2306.15537": {"title": "Sparse estimation in ordinary kriging for functional data", "link": "https://arxiv.org/abs/2306.15537", "description": "arXiv:2306.15537v2 Announce Type: replace \nAbstract: We introduce a sparse estimation in the ordinary kriging for functional data. The functional kriging predicts a feature given as a function at a location where the data are not observed by a linear combination of data observed at other locations. To estimate the weights of the linear combination, we apply the lasso-type regularization in minimizing the expected squared error. We derive an algorithm to derive the estimator using the augmented Lagrange method. Tuning parameters included in the estimation procedure are selected by cross-validation. Since the proposed method can shrink some of the weights of the linear combination toward zeros exactly, we can investigate which locations are necessary or unnecessary to predict the feature. Simulation and real data analysis show that the proposed method appropriately provides reasonable results."}, "https://arxiv.org/abs/2306.15607": {"title": "Assessing small area estimates via bootstrap-weighted k-Nearest-Neighbor artificial populations", "link": "https://arxiv.org/abs/2306.15607", "description": "arXiv:2306.15607v2 Announce Type: replace \nAbstract: Comparing and evaluating small area estimation (SAE) models for a given application is inherently difficult. Typically, many areas lack enough data to check unit-level modeling assumptions or to assess unit-level predictions empirically; and no ground truth is available for checking area-level estimates. Design-based simulation from artificial populations can help with each of these issues, but only if the artificial populations realistically represent the application at hand and are not built using assumptions that inherently favor one SAE model over another. In this paper, we borrow ideas from random hot deck, approximate Bayesian bootstrap (ABB), and k Nearest Neighbor (kNN) imputation methods to propose a kNN-based approximation to ABB (KBAABB), for generating an artificial population when rich unit-level auxiliary data is available. We introduce diagnostic checks on the process of building the artificial population, and we demonstrate how to use such an artificial population for design-based simulation studies to compare and evaluate SAE models, using real data from the Forest Inventory and Analysis (FIA) program of the US Forest Service."}, "https://arxiv.org/abs/2309.14621": {"title": "Confidence Intervals for the F1 Score: A Comparison of Four Methods", "link": "https://arxiv.org/abs/2309.14621", "description": "arXiv:2309.14621v2 Announce Type: replace \nAbstract: In Natural Language Processing (NLP), binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research."}, "https://arxiv.org/abs/2311.12016": {"title": "Estimating Heterogeneous Exposure Effects in the Case-Crossover Design using BART", "link": "https://arxiv.org/abs/2311.12016", "description": "arXiv:2311.12016v2 Announce Type: replace \nAbstract: Epidemiological approaches for examining human health responses to environmental exposures in observational studies often control for confounding by implementing clever matching schemes and using statistical methods based on conditional likelihood. Nonparametric regression models have surged in popularity in recent years as a tool for estimating individual-level heterogeneous effects, which provide a more detailed picture of the exposure-response relationship but can also be aggregated to obtain improved marginal estimates at the population level. In this work we incorporate Bayesian additive regression trees (BART) into the conditional logistic regression model to identify heterogeneous exposure effects in a case-crossover design. Conditional logistic BART (CL-BART) utilizes reversible jump Markov chain Monte Carlo to bypass the conditional conjugacy requirement of the original BART algorithm. Our work is motivated by the growing interest in identifying subpopulations more vulnerable to environmental exposures. We apply CL-BART to a study of the impact of heat waves on people with Alzheimer's disease in California and effect modification by other chronic conditions. Through this application, we also describe strategies to examine heterogeneous odds ratios through variable importance, partial dependence, and lower-dimensional summaries."}, "https://arxiv.org/abs/2301.12537": {"title": "Non-Asymptotic State-Space Identification of Closed-Loop Stochastic Linear Systems using Instrumental Variables", "link": "https://arxiv.org/abs/2301.12537", "description": "arXiv:2301.12537v4 Announce Type: replace-cross \nAbstract: The paper suggests a generalization of the Sign-Perturbed Sums (SPS) finite sample system identification method for the identification of closed-loop observable stochastic linear systems in state-space form. The solution builds on the theory of matrix-variate regression and instrumental variable methods to construct distribution-free confidence regions for the state-space matrices. Both direct and indirect identification are studied, and the exactness as well as the strong consistency of the construction are proved. Furthermore, a new, computationally efficient ellipsoidal outer-approximation algorithm for the confidence regions is proposed. The new construction results in a semidefinite optimization problem which has an order-of-magnitude smaller number of constraints, as if one applied the ellipsoidal outer-approximation after vectorization. The effectiveness of the approach is also demonstrated empirically via a series of numerical experiments."}, "https://arxiv.org/abs/2308.06718": {"title": "Generalized Independent Noise Condition for Estimating Causal Structure with Latent Variables", "link": "https://arxiv.org/abs/2308.06718", "description": "arXiv:2308.06718v2 Announce Type: replace-cross \nAbstract: We investigate the task of learning causal structure in the presence of latent variables, including locating latent variables and determining their quantity, and identifying causal relationships among both latent and observed variables. To this end, we propose a Generalized Independent Noise (GIN) condition for linear non-Gaussian acyclic causal models that incorporate latent variables, which establishes the independence between a linear combination of certain measured variables and some other measured variables. Specifically, for two observed random vectors $\\bf{Y}$ and $\\bf{Z}$, GIN holds if and only if $\\omega^{\\intercal}\\mathbf{Y}$ and $\\mathbf{Z}$ are independent, where $\\omega$ is a non-zero parameter vector determined by the cross-covariance between $\\mathbf{Y}$ and $\\mathbf{Z}$. We then give necessary and sufficient graphical criteria of the GIN condition in linear non-Gaussian acyclic models. Roughly speaking, GIN implies the existence of a set $\\mathcal{S}$ such that $\\mathcal{S}$ is causally earlier (w.r.t. the causal ordering) than $\\mathbf{Y}$, and that every active (collider-free) path between $\\mathbf{Y}$ and $\\mathbf{Z}$ must contain a node from $\\mathcal{S}$. Interestingly, we find that the independent noise condition (i.e., if there is no confounder, causes are independent of the residual derived from regressing the effect on the causes) can be seen as a special case of GIN. With such a connection between GIN and latent causal structures, we further leverage the proposed GIN condition, together with a well-designed search procedure, to efficiently estimate Linear, Non-Gaussian Latent Hierarchical Models (LiNGLaHs), where latent confounders may also be causally related and may even follow a hierarchical structure. We show that the causal structure of a LiNGLaH is identifiable in light of GIN conditions. Experimental results show the effectiveness of the proposed method."}, "https://arxiv.org/abs/2310.00250": {"title": "The oracle property of the generalized outcome adaptive lasso", "link": "https://arxiv.org/abs/2310.00250", "description": "arXiv:2310.00250v2 Announce Type: replace-cross \nAbstract: The generalized outcome-adaptive lasso (GOAL) is a variable selection for high-dimensional causal inference proposed by Bald\\'e et al. [2023, {\\em Biometrics} {\\bfseries 79(1)}, 514--520]. When the dimension is high, it is now well established that an ideal variable selection method should have the oracle property to ensure the optimal large sample performance. However, the oracle property of GOAL has not been proven. In this paper, we show that the GOAL estimator enjoys the oracle property. Our simulation shows that the GOAL method deals with the collinearity problem better than the oracle-like method, the outcome-adaptive lasso (OAL)."}, "https://arxiv.org/abs/2312.12844": {"title": "Effective Causal Discovery under Identifiable Heteroscedastic Noise Model", "link": "https://arxiv.org/abs/2312.12844", "description": "arXiv:2312.12844v2 Announce Type: replace-cross \nAbstract: Capturing the underlying structural causal relations represented by Directed Acyclic Graphs (DAGs) has been a fundamental task in various AI disciplines. Causal DAG learning via the continuous optimization framework has recently achieved promising performance in terms of both accuracy and efficiency. However, most methods make strong assumptions of homoscedastic noise, i.e., exogenous noises have equal variances across variables, observations, or even both. The noises in real data usually violate both assumptions due to the biases introduced by different data collection processes. To address the issue of heteroscedastic noise, we introduce relaxed and implementable sufficient conditions, proving the identifiability of a general class of SEM subject to these conditions. Based on the identifiable general SEM, we propose a novel formulation for DAG learning that accounts for the variation in noise variance across variables and observations. We then propose an effective two-phase iterative DAG learning algorithm to address the increasing optimization difficulties and to learn a causal DAG from data with heteroscedastic variable noise under varying variance. We show significant empirical gains of the proposed approaches over state-of-the-art methods on both synthetic data and real data."}, "https://arxiv.org/abs/2406.06767": {"title": "ULV: A robust statistical method for clustered data, with applications to multisubject, single-cell omics data", "link": "https://arxiv.org/abs/2406.06767", "description": "arXiv:2406.06767v1 Announce Type: new \nAbstract: Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity."}, "https://arxiv.org/abs/2406.06768": {"title": "Data-Driven Switchback Experiments: Theoretical Tradeoffs and Empirical Bayes Designs", "link": "https://arxiv.org/abs/2406.06768", "description": "arXiv:2406.06768v1 Announce Type: new \nAbstract: We study the design and analysis of switchback experiments conducted on a single aggregate unit. The design problem is to partition the continuous time space into intervals and switch treatments between intervals, in order to minimize the estimation error of the treatment effect. We show that the estimation error depends on four factors: carryover effects, periodicity, serially correlated outcomes, and impacts from simultaneous experiments. We derive a rigorous bias-variance decomposition and show the tradeoffs of the estimation error from these factors. The decomposition provides three new insights in choosing a design: First, balancing the periodicity between treated and control intervals reduces the variance; second, switching less frequently reduces the bias from carryover effects while increasing the variance from correlated outcomes, and vice versa; third, randomizing interval start and end points reduces both bias and variance from simultaneous experiments. Combining these insights, we propose a new empirical Bayes design approach. This approach uses prior data and experiments for designing future experiments. We illustrate this approach using real data from a ride-sharing platform, yielding a design that reduces MSE by 33% compared to the status quo design used on the platform."}, "https://arxiv.org/abs/2406.06804": {"title": "Robustness to Missing Data: Breakdown Point Analysis", "link": "https://arxiv.org/abs/2406.06804", "description": "arXiv:2406.06804v1 Announce Type: new \nAbstract: Missing data is pervasive in econometric applications, and rarely is it plausible that the data are missing (completely) at random. This paper proposes a methodology for studying the robustness of results drawn from incomplete datasets. Selection is measured as the squared Hellinger divergence between the distributions of complete and incomplete observations, which has a natural interpretation. The breakdown point is defined as the minimal amount of selection needed to overturn a given result. Reporting point estimates and lower confidence intervals of the breakdown point is a simple, concise way to communicate the robustness of a result. An estimator of the breakdown point of a result drawn from a generalized method of moments model is proposed and shown root-n consistent and asymptotically normal under mild assumptions. Lower confidence intervals of the breakdown point are simple to construct. The paper concludes with a simulation study illustrating the finite sample performance of the estimators in several common models."}, "https://arxiv.org/abs/2406.06834": {"title": "Power Analysis for Experiments with Clustered Data, Ratio Metrics, and Regression for Covariate Adjustment", "link": "https://arxiv.org/abs/2406.06834", "description": "arXiv:2406.06834v1 Announce Type: new \nAbstract: We describe how to calculate standard errors for A/B tests that include clustered data, ratio metrics, and/or covariate adjustment. We may do this for power analysis/sample size calculations prior to running an experiment using historical data, or after an experiment for hypothesis testing and confidence intervals. The different applications have a common framework, using the sample variance of certain residuals. The framework is compatible with modular software, can be plugged into standard tools, doesn't require computing covariance matrices, and is numerically stable. Using this approach we estimate that covariate adjustment gives a median 66% variance reduction for a key metric, reducing experiment run time by 66%."}, "https://arxiv.org/abs/2406.06851": {"title": "Unbiased Markov Chain Monte Carlo: what, why, and how", "link": "https://arxiv.org/abs/2406.06851", "description": "arXiv:2406.06851v1 Announce Type: new \nAbstract: This document presents methods to remove the initialization or burn-in bias from Markov chain Monte Carlo (MCMC) estimates, with consequences on parallel computing, convergence diagnostics and performance assessment. The document is written as an introduction to these methods for MCMC users. Some theoretical results are mentioned, but the focus is on the methodology."}, "https://arxiv.org/abs/2406.06860": {"title": "Cluster GARCH", "link": "https://arxiv.org/abs/2406.06860", "description": "arXiv:2406.06860v1 Announce Type: new \nAbstract: We introduce a novel multivariate GARCH model with flexible convolution-t distributions that is applicable in high-dimensional systems. The model is called Cluster GARCH because it can accommodate cluster structures in the conditional correlation matrix and in the tail dependencies. The expressions for the log-likelihood function and its derivatives are tractable, and the latter facilitate a score-drive model for the dynamic correlation structure. We apply the Cluster GARCH model to daily returns for 100 assets and find it outperforms existing models, both in-sample and out-of-sample. Moreover, the convolution-t distribution provides a better empirical performance than the conventional multivariate t-distribution."}, "https://arxiv.org/abs/2406.06924": {"title": "A Novel Nonlinear Nonparametric Correlation Measurement With A Case Study on Surface Roughness in Finish Turning", "link": "https://arxiv.org/abs/2406.06924", "description": "arXiv:2406.06924v1 Announce Type: new \nAbstract: Estimating the correlation coefficient has been a daunting work with the increasing complexity of dataset's pattern. One of the problems in manufacturing applications consists of the estimation of a critical process variable during a machining operation from directly measurable process variables. For example, the prediction of surface roughness of a workpiece during finish turning processes. In this paper, we did exhaustive study on the existing popular correlation coefficients: Pearson correlation coefficient, Spearman's rank correlation coefficient, Kendall's Tau correlation coefficient, Fechner correlation coefficient, and Nonlinear correlation coefficient. However, no one of them can capture all the nonlinear and linear correlations. So, we represent a universal non-linear non-parametric correlation measurement, g-correlation coefficient. Unlike other correlation measurements, g-correlation doesn't require assumptions and pick the dominating patterns of the dataset after examining all the major patterns no matter it is linear or nonlinear. Results of testing on both linearly correlated and non-linearly correlated dataset and comparison with the introduced correlation coefficients in literature show that g-correlation is robust on all the linearly correlated dataset and outperforms for some non-linearly correlated dataset. Results of the application of different correlation concepts to surface roughness assessment show that g-correlation has a central role among all standard concepts of correlation."}, "https://arxiv.org/abs/2406.06941": {"title": "Efficient combination of observational and experimental datasets under general restrictions on outcome mean functions", "link": "https://arxiv.org/abs/2406.06941", "description": "arXiv:2406.06941v1 Announce Type: new \nAbstract: A researcher collecting data from a randomized controlled trial (RCT) often has access to an auxiliary observational dataset that may be confounded or otherwise biased for estimating causal effects. Common modeling assumptions impose restrictions on the outcome mean function - the conditional expectation of the outcome of interest given observed covariates - in the two datasets. Running examples from the literature include settings where the observational dataset is subject to outcome-mediated selection bias or to confounding bias taking an assumed parametric form. We propose a succinct framework to derive the efficient influence function for any identifiable pathwise differentiable estimand under a general class of restrictions on the outcome mean function. This uncovers surprising results that with homoskedastic outcomes and a constant propensity score in the RCT, even strong parametric assumptions cannot improve the semiparametric lower bound for estimating various average treatment effects. We then leverage double machine learning to construct a one-step estimator that achieves the semiparametric efficiency bound even in cases when the outcome mean function and other nuisance parameters are estimated nonparametrically. The goal is to empower a researcher with custom, previously unstudied modeling restrictions on the outcome mean function to systematically construct causal estimators that maximially leverage their assumptions for variance reduction. We demonstrate the finite sample precision gains of our estimator over existing approaches in extensions of various numerical studies and data examples from the literature."}, "https://arxiv.org/abs/2406.06980": {"title": "Sensitivity Analysis for the Test-Negative Design", "link": "https://arxiv.org/abs/2406.06980", "description": "arXiv:2406.06980v1 Announce Type: new \nAbstract: The test-negative design has become popular for evaluating the effectiveness of post-licensure vaccines using observational data. In addition to its logistical convenience on data collection, the design is also believed to control for the differential health-care-seeking behavior between vaccinated and unvaccinated individuals, which is an important while often unmeasured confounder between the vaccination and infection. Hence, the design has been employed routinely to monitor seasonal flu vaccines and more recently to measure the COVID-19 vaccine effectiveness. Despite its popularity, the design has been questioned, in particular about its ability to fully control for the unmeasured confounding. In this paper, we explore deviations from a perfect test-negative design, and propose various sensitivity analysis methods for estimating the effect of vaccination measured by the causal odds ratio on the subpopulation of individuals with good health-care-seeking behavior. We start with point identification of the causal odds ratio under a test-negative design, considering two forms of assumptions on the unmeasured confounder. These assumptions then lead to two approaches for conducting sensitivity analysis, addressing the influence of the unmeasured confounding in different ways. Specifically, one approach investigates partial control for unmeasured confounder in the test-negative design, while the other examines the impact of unmeasured confounder on both vaccination and infection. Furthermore, these approaches can be combined to provide narrower bounds on the true causal odds ratio, and can be further extended to sharpen the bounds by restricting the treatment effect heterogeneity. Finally, we apply the proposed methods to evaluate the effectiveness of COVID-19 vaccines using observational data from test-negative designs."}, "https://arxiv.org/abs/2406.07449": {"title": "Boosted Conformal Prediction Intervals", "link": "https://arxiv.org/abs/2406.07449", "description": "arXiv:2406.07449v1 Announce Type: new \nAbstract: This paper introduces a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length. We employ machine learning techniques, notably gradient boosting, to systematically improve upon a predefined conformity score function. This process is guided by carefully constructed loss functions that measure the deviation of prediction intervals from the targeted properties. The procedure operates post-training, relying solely on model predictions and without modifying the trained model (e.g., the deep network). Systematic experiments demonstrate that starting from conventional conformal methods, our boosted procedure achieves substantial improvements in reducing interval length and decreasing deviation from target conditional coverage."}, "https://arxiv.org/abs/2406.06654": {"title": "Training and Validating a Treatment Recommender with Partial Verification Evidence", "link": "https://arxiv.org/abs/2406.06654", "description": "arXiv:2406.06654v1 Announce Type: cross \nAbstract: Current clinical decision support systems (DSS) are trained and validated on observational data from the target clinic. This is problematic for treatments validated in a randomized clinical trial (RCT), but not yet introduced in any clinic. In this work, we report on a method for training and validating the DSS using the RCT data. The key challenges we address are of missingness -- missing rationale for treatment assignment (the assignment is at random), and missing verification evidence, since the effectiveness of a treatment for a patient can only be verified (ground truth) for treatments what were actually assigned to a patient. We use data from a multi-armed RCT that investigated the effectiveness of single- and combination- treatments for 240+ tinnitus patients recruited and treated in 5 clinical centers.\n To deal with the 'missing rationale' challenge, we re-model the target variable (outcome) in order to suppress the effect of the randomly-assigned treatment, and control on the effect of treatment in general. Our methods are also robust to missing values in features and with a small number of patients per RCT arm. We deal with 'missing verification evidence' by using counterfactual treatment verification, which compares the effectiveness of the DSS recommendations to the effectiveness of the RCT assignments when they are aligned v/s not aligned.\n We demonstrate that our approach leverages the RCT data for learning and verification, by showing that the DSS suggests treatments that improve the outcome. The results are limited through the small number of patients per treatment; while our ensemble is designed to mitigate this effect, the predictive performance of the methods is affected by the smallness of the data. We provide a basis for the establishment of decision supporting routines on treatments that have been tested in RCTs but have not yet been deployed clinically."}, "https://arxiv.org/abs/2406.06671": {"title": "Controlling Counterfactual Harm in Decision Support Systems Based on Prediction Sets", "link": "https://arxiv.org/abs/2406.06671", "description": "arXiv:2406.06671v1 Announce Type: cross \nAbstract: Decision support systems based on prediction sets help humans solve multiclass classification tasks by narrowing down the set of potential label values to a subset of them, namely a prediction set, and asking them to always predict label values from the prediction sets. While this type of systems have been proven to be effective at improving the average accuracy of the predictions made by humans, by restricting human agency, they may cause harm$\\unicode{x2014}$a human who has succeeded at predicting the ground-truth label of an instance on their own may have failed had they used these systems. In this paper, our goal is to control how frequently a decision support system based on prediction sets may cause harm, by design. To this end, we start by characterizing the above notion of harm using the theoretical framework of structural causal models. Then, we show that, under a natural, albeit unverifiable, monotonicity assumption, we can estimate how frequently a system may cause harm using only predictions made by humans on their own. Further, we also show that, under a weaker monotonicity assumption, which can be verified experimentally, we can bound how frequently a system may cause harm again using only predictions made by humans on their own. Building upon these assumptions, we introduce a computational framework to design decision support systems based on prediction sets that are guaranteed to cause harm less frequently than a user-specified value using conformal risk control. We validate our framework using real human predictions from two different human subject studies and show that, in decision support systems based on prediction sets, there is a trade-off between accuracy and counterfactual harm."}, "https://arxiv.org/abs/2406.06868": {"title": "Causality for Complex Continuous-time Functional Longitudinal Studies with Dynamic Treatment Regimes", "link": "https://arxiv.org/abs/2406.06868", "description": "arXiv:2406.06868v1 Announce Type: cross \nAbstract: Causal inference in longitudinal studies is often hampered by treatment-confounder feedback. Existing methods typically assume discrete time steps or step-like data changes, which we term ``regular and irregular functional studies,'' limiting their applicability to studies with continuous monitoring data, like intensive care units or continuous glucose monitoring. These studies, which we formally term ``functional longitudinal studies,'' require new approaches. Moreover, existing methods tailored for ``functional longitudinal studies'' can only investigate static treatment regimes, which are independent of historical covariates or treatments, leading to either stringent parametric assumptions or strong positivity assumptions. This restriction has limited the range of causal questions these methods can answer and their practicality. We address these limitations by developing a nonparametric framework for functional longitudinal data, accommodating dynamic treatment regimes that depend on historical covariates or treatments, and may or may not depend on the actual treatment administered. To build intuition and explain our approach, we provide a comprehensive review of existing methods for regular and irregular longitudinal studies. We then formally define the potential outcomes and causal effects of interest, develop identification assumptions, and derive g-computation and inverse probability weighting formulas through novel applications of stochastic process and measure theory. Additionally, we compute the efficient influence curve using semiparametric theory. Our framework generalizes existing literature, and achieves double robustness under specific conditions. Finally, to aid interpretation, we provide sufficient and intuitive conditions for our identification assumptions, enhancing the applicability of our methodology to real-world scenarios."}, "https://arxiv.org/abs/2406.07075": {"title": "New density/likelihood representations for Gibbs models based on generating functionals of point processes", "link": "https://arxiv.org/abs/2406.07075", "description": "arXiv:2406.07075v1 Announce Type: cross \nAbstract: Deriving exact density functions for Gibbs point processes has been challenging due to their general intractability, stemming from the intractability of their normalising constants/partition functions. This paper offers a solution to this open problem by exploiting a recent alternative representation of point process densities. Here, for a finite point process, the density is expressed as the void probability multiplied by a higher-order Papangelou conditional intensity function. By leveraging recent results on dependent thinnings, exact expressions for generating functionals and void probabilities of locally stable point processes are derived. Consequently, exact expressions for density/likelihood functions, partition functions and posterior densities are also obtained. The paper finally extends the results to locally stable Gibbsian random fields on lattices by representing them as point processes."}, "https://arxiv.org/abs/2406.07121": {"title": "The Treatment of Ties in Rank-Biased Overlap", "link": "https://arxiv.org/abs/2406.07121", "description": "arXiv:2406.07121v1 Announce Type: cross \nAbstract: Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines by comparing the rankings of documents they retrieve for the same queries. In these situations, though, it is very frequent to find tied documents that have the same score. Unfortunately, the treatment of ties in RBO remains superficial and incomplete, in the sense that it is not clear how to calculate it from the ranking prefixes only. In addition, the existing way of dealing with ties is very different from the one traditionally followed in the field of Statistics, most notably found in rank correlation coefficients such as Kendall's and Spearman's. In this paper we propose a generalized formulation for RBO to handle ties, thanks to which we complete the original definitions by showing how to perform prefix evaluation. We also use it to fully develop two variants that align with the ones found in the Statistics literature: one when there is a reference ranking to compare to, and one when there is not. Overall, these three variants provide researchers with flexibility when comparing rankings with RBO, by clearly determining what ties mean, and how they should be treated. Finally, using both synthetic and TREC data, we demonstrate the use of these new tie-aware RBO measures. We show that the scores may differ substantially from the original tie-unaware RBO measure, where ties had to be broken at random or by arbitrary criteria such as by document ID. Overall, these results evidence the need for a proper account of ties in rank similarity measures such as RBO."}, "https://arxiv.org/abs/2106.15675": {"title": "Estimating Gaussian mixtures using sparse polynomial moment systems", "link": "https://arxiv.org/abs/2106.15675", "description": "arXiv:2106.15675v3 Announce Type: replace \nAbstract: The method of moments is a classical statistical technique for density estimation that solves a system of moment equations to estimate the parameters of an unknown distribution. A fundamental question critical to understanding identifiability asks how many moment equations are needed to get finitely many solutions and how many solutions there are. We answer this question for classes of Gaussian mixture models using the tools of polyhedral geometry. In addition, we show that a generic Gaussian $k$-mixture model is identifiable from its first $3k+2$ moments. Using these results, we present a homotopy algorithm that performs parameter recovery for high dimensional Gaussian mixture models where the number of paths tracked scales linearly in the dimension."}, "https://arxiv.org/abs/2109.11990": {"title": "Optimization-based Causal Estimation from Heterogenous Environments", "link": "https://arxiv.org/abs/2109.11990", "description": "arXiv:2109.11990v3 Announce Type: replace \nAbstract: This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments-and ones that exhibit sufficient heterogeneity-CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model and more accurate predictions under interventions."}, "https://arxiv.org/abs/2211.16921": {"title": "Binary De Bruijn Processes", "link": "https://arxiv.org/abs/2211.16921", "description": "arXiv:2211.16921v2 Announce Type: replace \nAbstract: Binary time series data are very common in many applications, and are typically modelled independently via a Bernoulli process with a single probability of success. However, the probability of a success can be dependent on the outcome successes of past events. Presented here is a novel approach for modelling binary time series data called a binary de Bruijn process which takes into account temporal correlation. The structure is derived from de Bruijn Graphs - a directed graph, where given a set of symbols, V, and a 'word' length, m, the nodes of the graph consist of all possible sequences of V of length m. De Bruijn Graphs are equivalent to mth order Markov chains, where the 'word' length controls the number of states that each individual state is dependent on. This increases correlation over a wider area. To quantify how clustered a sequence generated from a de Bruijn process is, the run lengths of letters are observed along with run length properties. Inference is also presented along with two application examples: precipitation data and the Oxford and Cambridge boat race."}, "https://arxiv.org/abs/2303.00982": {"title": "Aggregated Intersection Bounds and Aggregated Minimax Values", "link": "https://arxiv.org/abs/2303.00982", "description": "arXiv:2303.00982v2 Announce Type: replace \nAbstract: This paper proposes a novel framework of aggregated intersection bounds, where the target parameter is obtained by averaging the minimum (or maximum) of a collection of regression functions over the covariate space. Examples of such quantities include the lower and upper bounds on distributional effects (Fr\\'echet-Hoeffding, Makarov) as well as the optimal welfare in statistical treatment choice problems. The proposed estimator -- the envelope score estimator -- is shown to have an oracle property, where the oracle knows the identity of the minimizer for each covariate value. Next, the result is extended to the aggregated minimax values of a collection of regression functions, covering optimal distributional welfare in worst-case and best-case, respectively. This proposed estimator -- the envelope saddle value estimator -- is shown to have an oracle property, where the oracle knows the identity of the saddle point."}, "https://arxiv.org/abs/2406.07564": {"title": "Optimizing Sales Forecasts through Automated Integration of Market Indicators", "link": "https://arxiv.org/abs/2406.07564", "description": "arXiv:2406.07564v1 Announce Type: new \nAbstract: Recognizing that traditional forecasting models often rely solely on historical demand, this work investigates the potential of data-driven techniques to automatically select and integrate market indicators for improving customer demand predictions. By adopting an exploratory methodology, we integrate macroeconomic time series, such as national GDP growth, from the \\textit{Eurostat} database into \\textit{Neural Prophet} and \\textit{SARIMAX} forecasting models. Suitable time series are automatically identified through different state-of-the-art feature selection methods and applied to sales data from our industrial partner. It could be shown that forecasts can be significantly enhanced by incorporating external information. Notably, the potential of feature selection methods stands out, especially due to their capability for automation without expert knowledge and manual selection effort. In particular, the Forward Feature Selection technique consistently yielded superior forecasting accuracy for both SARIMAX and Neural Prophet across different company sales datasets. In the comparative analysis of the errors of the selected forecasting models, namely Neural Prophet and SARIMAX, it is observed that neither model demonstrates a significant superiority over the other."}, "https://arxiv.org/abs/2406.07651": {"title": "surveygenmod2: A SAS macro for estimating complex survey adjusted generalized linear models and Wald-type tests", "link": "https://arxiv.org/abs/2406.07651", "description": "arXiv:2406.07651v1 Announce Type: new \nAbstract: surveygenmod2 builds on the macro written by da Silva (2017) for generalized linear models under complex survey designs. The updated macro fixed several minor bugs we encountered while updating the macro for use in SAS\\textregistered. We added additional features for conducting basic Wald-type tests on groups of parameters based on the estimated regression coefficients and parameter variance-covariance matrix."}, "https://arxiv.org/abs/2406.07756": {"title": "The Exchangeability Assumption for Permutation Tests of Multiple Regression Models: Implications for Statistics and Data Science", "link": "https://arxiv.org/abs/2406.07756", "description": "arXiv:2406.07756v1 Announce Type: new \nAbstract: Permutation tests are a powerful and flexible approach to inference via resampling. As computational methods become more ubiquitous in the statistics curriculum, use of permutation tests has become more tractable. At the heart of the permutation approach is the exchangeability assumption, which determines the appropriate null sampling distribution. We explore the exchangeability assumption in the context of permutation tests for multiple linear regression models. Various permutation schemes for the multiple linear regression setting have been previously proposed and assessed in the literature. As has been demonstrated previously, in most settings, the choice of how to permute a multiple linear regression model does not materially change inferential conclusions. Regardless, we believe that (1) understanding exchangeability in the multiple linear regression setting and also (2) how it relates to the null hypothesis of interest is valuable. We also briefly explore model settings beyond multiple linear regression (e.g., settings where clustering or hierarchical relationships exist) as a motivation for the benefit and flexibility of permutation tests. We close with pedagogical recommendations for instructors who want to bring multiple linear regression permutation inference into their classroom as a way to deepen student understanding of resampling-based inference."}, "https://arxiv.org/abs/2406.07787": {"title": "A Diagnostic Tool for Functional Causal Discovery", "link": "https://arxiv.org/abs/2406.07787", "description": "arXiv:2406.07787v1 Announce Type: new \nAbstract: Causal discovery methods aim to determine the causal direction between variables using observational data. Functional causal discovery methods, such as those based on the Linear Non-Gaussian Acyclic Model (LiNGAM), rely on structural and distributional assumptions to infer the causal direction. However, approaches for assessing causal discovery methods' performance as a function of sample size or the impact of assumption violations, inevitable in real-world scenarios, are lacking. To address this need, we propose Causal Direction Detection Rate (CDDR) diagnostic that evaluates whether and to what extent the interaction between assumption violations and sample size affects the ability to identify the hypothesized causal direction. Given a bivariate dataset of size N on a pair of variables, X and Y, CDDR diagnostic is the plotted comparison of the probability of each causal discovery outcome (e.g. X causes Y, Y causes X, or inconclusive) as a function of sample size less than N. We fully develop CDDR diagnostic in a bivariate case and demonstrate its use for two methods, LiNGAM and our new test-based causal discovery approach. We find CDDR diagnostic for the test-based approach to be more informative since it uses a richer set of causal discovery outcomes. Under certain assumptions, we prove that the probability estimates of detecting each possible causal discovery outcome are consistent and asymptotically normal. Through simulations, we study CDDR diagnostic's behavior when linearity and non-Gaussianity assumptions are violated. Additionally, we illustrate CDDR diagnostic on four real datasets, including three for which the causal direction is known."}, "https://arxiv.org/abs/2406.07809": {"title": "Did Harold Zuercher Have Time-Separable Preferences?", "link": "https://arxiv.org/abs/2406.07809", "description": "arXiv:2406.07809v1 Announce Type: new \nAbstract: This paper proposes an empirical model of dynamic discrete choice to allow for non-separable time preferences, generalizing the well-known Rust (1987) model. Under weak conditions, we show the existence of value functions and hence well-defined optimal choices. We construct a contraction mapping of the value function and propose an estimation method similar to Rust's nested fixed point algorithm. Finally, we apply the framework to the bus engine replacement data. We improve the fit of the data with our general model and reject the null hypothesis that Harold Zuercher has separable time preferences. Misspecifying an agent's preference as time-separable when it is not leads to biased inferences about structure parameters (such as the agent's risk attitudes) and misleading policy recommendations."}, "https://arxiv.org/abs/2406.07868": {"title": "Bridging multiple worlds: multi-marginal optimal transport for causal partial-identification problem", "link": "https://arxiv.org/abs/2406.07868", "description": "arXiv:2406.07868v1 Announce Type: new \nAbstract: Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the joint distribution is unavailable. In this paper, we cast the causal partial identification problem in the framework of MOT with $K$ margins and $d$-dimensional outcomes and obtain the exact partial identified set. In order to estimate the partial identified set via MOT, statistically, we establish a convergence rate of the plug-in MOT estimator for general quadratic objective functions and prove it is minimax optimal for a quadratic objective function stemming from the variance minimization problem with arbitrary $K$ and $d \\le 4$. Numerically, we demonstrate the efficacy of our method over several real-world datasets where our proposal consistently outperforms the baseline by a significant margin (over 70%). In addition, we provide efficient off-the-shelf implementations of MOT with general objective functions."}, "https://arxiv.org/abs/2406.07940": {"title": "Simple yet Sharp Sensitivity Analysis for Any Contrast Under Unmeasured Confounding", "link": "https://arxiv.org/abs/2406.07940", "description": "arXiv:2406.07940v1 Announce Type: new \nAbstract: We extend our previous work on sensitivity analysis for the risk ratio and difference contrasts under unmeasured confounding to any contrast. We prove that the bounds produced are still arbitrarily sharp, i.e. practically attainable. We illustrate the usability of the bounds with real data."}, "https://arxiv.org/abs/2406.08019": {"title": "Assessing Extreme Risk using Stochastic Simulation of Extremes", "link": "https://arxiv.org/abs/2406.08019", "description": "arXiv:2406.08019v1 Announce Type: new \nAbstract: Risk management is particularly concerned with extreme events, but analysing these events is often hindered by the scarcity of data, especially in a multivariate context. This data scarcity complicates risk management efforts. Various tools can assess the risk posed by extreme events, even under extraordinary circumstances. This paper studies the evaluation of univariate risk for a given risk factor using metrics that account for its asymptotic dependence on other risk factors. Data availability is crucial, particularly for extreme events where it is often limited by the nature of the phenomenon itself, making estimation challenging. To address this issue, two non-parametric simulation algorithms based on multivariate extreme theory are developed. These algorithms aim to extend a sample of extremes jointly and conditionally for asymptotically dependent variables using stochastic simulation and multivariate Generalised Pareto Distributions. The approach is illustrated with numerical analyses of both simulated and real data to assess the accuracy of extreme risk metric estimations."}, "https://arxiv.org/abs/2406.08022": {"title": "Null hypothesis Bayes factor estimates can be biased in (some) common factorial designs: A simulation study", "link": "https://arxiv.org/abs/2406.08022", "description": "arXiv:2406.08022v1 Announce Type: new \nAbstract: Bayes factor null hypothesis tests provide a viable alternative to frequentist measures of evidence quantification. Bayes factors for realistic interesting models cannot be calculated exactly, but have to be estimated, which involves approximations to complex integrals. Crucially, the accuracy of these estimates, i.e., whether an estimated Bayes factor corresponds to the true Bayes factor, is unknown, and may depend on data, prior, and likelihood. We have recently developed a novel statistical procedure, namely simulation-based calibration (SBC) for Bayes factors, to test for a given analysis, whether the computed Bayes factors are accurate. Here, we use SBC for Bayes factors to test for some common cognitive designs, whether Bayes factors are estimated accurately. We use the bridgesampling/brms packages as well as the BayesFactor package in R. We find that Bayes factor estimates are accurate and exhibit only little bias in Latin square designs with (a) random effects for subjects only and (b) for crossed random effects for subjects and items, but a single fixed-factor. However, Bayes factor estimates turn out biased and liberal in a 2x2 design with crossed random effects for subjects and items. These results suggest that researchers should test for their individual analysis, whether Bayes factor estimates are accurate. Moreover, future research is needed to determine the boundary conditions under which Bayes factor estimates are accurate or biased, as well as software development to improve estimation accuracy."}, "https://arxiv.org/abs/2406.08168": {"title": "Global Tests for Smoothed Functions in Mean Field Variational Additive Models", "link": "https://arxiv.org/abs/2406.08168", "description": "arXiv:2406.08168v1 Announce Type: new \nAbstract: Variational regression methods are an increasingly popular tool for their efficient estimation of complex. Given the mixed model representation of penalized effects, additive regression models with smoothed effects and scalar-on-function regression models can be fit relatively efficiently in a variational framework. However, inferential procedures for smoothed and functional effects in such a context is limited. We demonstrate that by using the Mean Field Variational Bayesian (MFVB) approximation to the additive model and the subsequent Coordinate Ascent Variational Inference (CAVI) algorithm, we can obtain a form of the estimated effects required of a Frequentist test for semiparametric curves. We establish MFVB approximations and CAVI algorithms for both Gaussian and binary additive models with an arbitrary number of smoothed and functional effects. We then derive a global testing framework for smoothed and functional effects. Our empirical study demonstrates that the test maintains good Frequentist properties in the variational framework and can be used to directly test results from a converged, MFVB approximation and CAVI algorithm. We illustrate the applicability of this approach in a wide range of data illustrations."}, "https://arxiv.org/abs/2406.08172": {"title": "inlamemi: An R package for missing data imputation and measurement error modelling using INLA", "link": "https://arxiv.org/abs/2406.08172", "description": "arXiv:2406.08172v1 Announce Type: new \nAbstract: Measurement error and missing data in variables used in statistical models are common, and can at worst lead to serious biases in analyses if they are ignored. Yet, these problems are often not dealt with adequately, presumably in part because analysts lack simple enough tools to account for error and missingness. In this R package, we provide functions to aid fitting hierarchical Bayesian models that account for cases where either measurement error (classical or Berkson), missing data, or both are present in continuous covariates. Model fitting is done in a Bayesian framework using integrated nested Laplace approximations (INLA), an approach that is growing in popularity due to its combination of computational speed and accuracy. The {inlamemi} R package is suitable for data analysts who have little prior experience using the R package {R-INLA}, and aids in formulating suitable hierarchical models for a variety of scenarios in order to appropriately capture the processes that generate the measurement error and/or missingness. Numerous examples are given to help analysts identify scenarios similar to their own, and make the process of specifying a suitable model easier."}, "https://arxiv.org/abs/2406.08174": {"title": "A computationally efficient procedure for combining ecological datasets by means of sequential consensus inference", "link": "https://arxiv.org/abs/2406.08174", "description": "arXiv:2406.08174v1 Announce Type: new \nAbstract: Combining data has become an indispensable tool for managing the current diversity and abundance of data. But, as data complexity and data volume swell, the computational demands of previously proposed models for combining data escalate proportionally, posing a significant challenge to practical implementation. This study presents a sequential consensus Bayesian inference procedure that allows for a flexible definition of models, aiming to emulate the versatility of integrated models while significantly reducing their computational cost. The method is based on updating the distribution of the fixed effects and hyperparameters from their marginal posterior distribution throughout a sequential inference procedure, and performing a consensus on the random effects after the sequential inference is completed. The applicability, together with its strengths and limitations, is outlined in the methodological description of the procedure. The sequential consensus method is presented in two distinct algorithms. The first algorithm performs a sequential updating and consensus from the stored values of the marginal or joint posterior distribution of the random effects. The second algorithm performs an extra step, addressing the deficiencies that may arise when the model partition does not share the whole latent field. The performance of the procedure is shown by three different examples -- one simulated and two with real data -- intending to expose its strengths and limitations."}, "https://arxiv.org/abs/2406.08241": {"title": "Mode-based estimation of the center of symmetry", "link": "https://arxiv.org/abs/2406.08241", "description": "arXiv:2406.08241v1 Announce Type: new \nAbstract: In the mean-median-mode triad of univariate centrality measures, the mode has been overlooked for estimating the center of symmetry in continuous and unimodal settings. This paper expands on the connection between kernel mode estimators and M-estimators for location, bridging the gap between the nonparametrics and robust statistics communities. The variance of modal estimators is studied in terms of a bandwidth parameter, establishing conditions for an optimal solution that outperforms the household sample mean. A purely nonparametric approach is adopted, modeling heavy-tailedness through regular variation. The results lead to an estimator proposal that includes a novel one-parameter family of kernels with compact support, offering extra robustness and efficiency. The effectiveness and versatility of the new method are demonstrated in a real-world case study and a thorough simulation study, comparing favorably to traditional and more competitive alternatives. Several myths about the mode are clarified along the way, reopening the quest for flexible and efficient nonparametric estimators."}, "https://arxiv.org/abs/2406.08279": {"title": "Positive and negative word of mouth in the United States", "link": "https://arxiv.org/abs/2406.08279", "description": "arXiv:2406.08279v1 Announce Type: new \nAbstract: Word of mouth is a process by which consumers transmit positive or negative sentiment to other consumers about a business. While this process has long been recognized as a type of promotion for businesses, the value of word of mouth is questionable. This study will examine the various correlates of word of mouth to demographic variables, including the role of the trust of business owners. Education level, region of residence, and income level were found to be significant predictors of positive word of mouth. Although the results generally suggest that the majority of respondents do not engage in word of mouth, there are valuable insights to be learned."}, "https://arxiv.org/abs/2406.08366": {"title": "Highest Probability Density Conformal Regions", "link": "https://arxiv.org/abs/2406.08366", "description": "arXiv:2406.08366v1 Announce Type: new \nAbstract: We propose a new method for finding the highest predictive density set or region using signed conformal inference. The proposed method is computationally efficient, while also carrying conformal coverage guarantees. We prove that under, mild regularity conditions, the conformal prediction set is asymptotically close to its oracle counterpart. The efficacy of the method is illustrated through simulations and real applications."}, "https://arxiv.org/abs/2406.08367": {"title": "Hierarchical Bayesian Emulation of the Expected Net Present Value Utility Function via a Multi-Model Ensemble Member Decomposition", "link": "https://arxiv.org/abs/2406.08367", "description": "arXiv:2406.08367v1 Announce Type: new \nAbstract: Computer models are widely used to study complex real world physical systems. However, there are major limitations to their direct use including: their complex structure; large numbers of inputs and outputs; and long evaluation times. Bayesian emulators are an effective means of addressing these challenges providing fast and efficient statistical approximation for computer model outputs. It is commonly assumed that computer models behave like a ``black-box'' function with no knowledge of the output prior to its evaluation. This ensures that emulators are generalisable but potentially limits their accuracy compared with exploiting such knowledge of constrained or structured output behaviour. We assume a ``grey-box'' computer model and establish a hierarchical emulation framework encompassing structured emulators which exploit known constrained and structured behaviour of constituent computer model outputs. This achieves greater physical interpretability and more accurate emulator predictions. This research is motivated by and applied to the commercially important TNO OLYMPUS Well Control Optimisation Challenge from the petroleum industry. We re-express this as a decision support under uncertainty problem. First, we reduce the computational expense of the analysis by identifying a representative subset of models using an efficient multi-model ensemble subsampling technique. Next we apply our hierarchical emulation methodology to the expected Net Present Value utility function with well control decision parameters as inputs."}, "https://arxiv.org/abs/2406.08390": {"title": "Coordinated Trading Strategies for Battery Storage in Reserve and Spot Markets", "link": "https://arxiv.org/abs/2406.08390", "description": "arXiv:2406.08390v1 Announce Type: new \nAbstract: Quantity and price risks are key uncertainties market participants face in electricity markets with increased volatility, for instance, due to high shares of renewables. From day ahead until real-time, there is a large variation in the best available information, leading to price changes that flexible assets, such as battery storage, can exploit economically. This study contributes to understanding how coordinated bidding strategies can enhance multi-market trading and large-scale energy storage integration. Our findings shed light on the complexities arising from interdependencies and the high-dimensional nature of the problem. We show how stochastic dual dynamic programming is a suitable solution technique for such an environment. We include the three markets of the frequency containment reserve, day-ahead, and intraday in stochastic modelling and develop a multi-stage stochastic program. Prices are represented in a multidimensional Markov Chain, following the scheduling of the markets and allowing for time-dependent randomness. Using the example of a battery storage in the German energy sector, we provide valuable insights into the technical aspects of our method and the economic feasibility of battery storage operation. We find that capacity reservation in the frequency containment reserve dominates over the battery's cycling in spot markets at the given resolution on prices in 2022. In an adjusted price environment, we find that coordination can yield an additional value of up to 12.5%."}, "https://arxiv.org/abs/2406.08419": {"title": "Identification and Inference on Treatment Effects under Covariate-Adaptive Randomization and Imperfect Compliance", "link": "https://arxiv.org/abs/2406.08419", "description": "arXiv:2406.08419v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) frequently utilize covariate-adaptive randomization (CAR) (e.g., stratified block randomization) and commonly suffer from imperfect compliance. This paper studies the identification and inference for the average treatment effect (ATE) and the average treatment effect on the treated (ATT) in such RCTs with a binary treatment.\n We first develop characterizations of the identified sets for both estimands. Since data are generally not i.i.d. under CAR, these characterizations do not follow from existing results. We then provide consistent estimators of the identified sets and asymptotically valid confidence intervals for the parameters. Our asymptotic analysis leads to concrete practical recommendations regarding how to estimate the treatment assignment probabilities that enter in estimated bounds. In the case of the ATE, using sample analog assignment frequencies is more efficient than using the true assignment probabilities. On the contrary, using the true assignment probabilities is preferable for the ATT."}, "https://arxiv.org/abs/2406.07555": {"title": "Sequential Monte Carlo for Cut-Bayesian Posterior Computation", "link": "https://arxiv.org/abs/2406.07555", "description": "arXiv:2406.07555v1 Announce Type: cross \nAbstract: We propose a sequential Monte Carlo (SMC) method to efficiently and accurately compute cut-Bayesian posterior quantities of interest, variations of standard Bayesian approaches constructed primarily to account for model misspecification. We prove finite sample concentration bounds for estimators derived from the proposed method along with a linear tempering extension and apply these results to a realistic setting where a computer model is misspecified. We then illustrate the SMC method for inference in a modular chemical reactor example that includes submodels for reaction kinetics, turbulence, mass transfer, and diffusion. The samples obtained are commensurate with a direct-sampling approach that consists of running multiple Markov chains, with computational efficiency gains using the SMC method. Overall, the SMC method presented yields a novel, rigorous approach to computing with cut-Bayesian posterior distributions."}, "https://arxiv.org/abs/2406.07825": {"title": "Shape-Constrained Distributional Optimization via Importance-Weighted Sample Average Approximation", "link": "https://arxiv.org/abs/2406.07825", "description": "arXiv:2406.07825v1 Announce Type: cross \nAbstract: Shape-constrained optimization arises in a wide range of problems including distributionally robust optimization (DRO) that has surging popularity in recent years. In the DRO literature, these problems are usually solved via reduction into moment-constrained problems using the Choquet representation. While powerful, such an approach could face tractability challenges arising from the geometries and the compatibility between the shape and the objective function and moment constraints. In this paper, we propose an alternative methodology to solve shape-constrained optimization problems by integrating sample average approximation with importance sampling, the latter used to convert the distributional optimization into an optimization problem over the likelihood ratio with respect to a sampling distribution. We demonstrate how our approach, which relies on finite-dimensional linear programs, can handle a range of shape-constrained problems beyond the reach of previous Choquet-based reformulations, and entails vanishing and quantifiable optimality gaps. Moreover, our theoretical analyses based on strong duality and empirical processes reveal the critical role of shape constraints in guaranteeing desirable consistency and convergence rates."}, "https://arxiv.org/abs/2406.08041": {"title": "HARd to Beat: The Overlooked Impact of Rolling Windows in the Era of Machine Learning", "link": "https://arxiv.org/abs/2406.08041", "description": "arXiv:2406.08041v1 Announce Type: cross \nAbstract: We investigate the predictive abilities of the heterogeneous autoregressive (HAR) model compared to machine learning (ML) techniques across an unprecedented dataset of 1,455 stocks. Our analysis focuses on the role of fitting schemes, particularly the training window and re-estimation frequency, in determining the HAR model's performance. Despite extensive hyperparameter tuning, ML models fail to surpass the linear benchmark set by HAR when utilizing a refined fitting approach for the latter. Moreover, the simplicity of HAR allows for an interpretable model with drastically lower computational costs. We assess performance using QLIKE, MSE, and realized utility metrics, finding that HAR consistently outperforms its ML counterparts when both rely solely on realized volatility and VIX as predictors. Our results underscore the importance of a correctly specified fitting scheme. They suggest that properly fitted HAR models provide superior forecasting accuracy, establishing robust guidelines for their practical application and use as a benchmark. This study not only reaffirms the efficacy of the HAR model but also provides a critical perspective on the practical limitations of ML approaches in realized volatility forecasting."}, "https://arxiv.org/abs/2406.08097": {"title": "Inductive Global and Local Manifold Approximation and Projection", "link": "https://arxiv.org/abs/2406.08097", "description": "arXiv:2406.08097v1 Announce Type: cross \nAbstract: Nonlinear dimensional reduction with the manifold assumption, often called manifold learning, has proven its usefulness in a wide range of high-dimensional data analysis. The significant impact of t-SNE and UMAP has catalyzed intense research interest, seeking further innovations toward visualizing not only the local but also the global structure information of the data. Moreover, there have been consistent efforts toward generalizable dimensional reduction that handles unseen data. In this paper, we first propose GLoMAP, a novel manifold learning method for dimensional reduction and high-dimensional data visualization. GLoMAP preserves locally and globally meaningful distance estimates and displays a progression from global to local formation during the course of optimization. Furthermore, we extend GLoMAP to its inductive version, iGLoMAP, which utilizes a deep neural network to map data to its lower-dimensional representation. This allows iGLoMAP to provide lower-dimensional embeddings for unseen points without needing to re-train the algorithm. iGLoMAP is also well-suited for mini-batch learning, enabling large-scale, accelerated gradient calculations. We have successfully applied both GLoMAP and iGLoMAP to the simulated and real-data settings, with competitive experiments against the state-of-the-art methods."}, "https://arxiv.org/abs/2406.08180": {"title": "Stochastic Process-based Method for Degree-Degree Correlation of Evolving Networks", "link": "https://arxiv.org/abs/2406.08180", "description": "arXiv:2406.08180v1 Announce Type: cross \nAbstract: Existing studies on the degree correlation of evolving networks typically rely on differential equations and statistical analysis, resulting in only approximate solutions due to inherent randomness. To address this limitation, we propose an improved Markov chain method for modeling degree correlation in evolving networks. By redesigning the network evolution rules to reflect actual network dynamics more accurately, we achieve a topological structure that closely matches real-world network evolution. Our method models the degree correlation evolution process for both directed and undirected networks and provides theoretical results that are verified through simulations. This work offers the first theoretical solution for the steady-state degree correlation in evolving network models and is applicable to more complex evolution mechanisms and networks with directional attributes. Additionally, it supports the study of dynamic characteristic control based on network structure at any given time, offering a new tool for researchers in the field."}, "https://arxiv.org/abs/2406.08322": {"title": "MMIL: A novel algorithm for disease associated cell type discovery", "link": "https://arxiv.org/abs/2406.08322", "description": "arXiv:2406.08322v1 Announce Type: cross \nAbstract: Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. To address this, we introduce Mixture Modeling for Multiple Instance Learning (MMIL), an expectation maximization method that enables the training and calibration of cell-level classifiers using patient-level labels. Our approach can be used to train e.g. lasso logistic regression models, gradient boosted trees, and neural networks. When applied to clinically-annotated, primary patient samples in Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL), our method accurately identifies cancer cells, generalizes across tissues and treatment timepoints, and selects biologically relevant features. In addition, MMIL is capable of incorporating cell labels into model training when they are known, providing a powerful framework for leveraging both labeled and unlabeled data simultaneously. Mixture Modeling for MIL offers a novel approach for cell classification, with significant potential to advance disease understanding and management, especially in scenarios with unknown gold-standard labels and high dimensionality."}, "https://arxiv.org/abs/2111.15524": {"title": "Efficiency and Robustness of Rosenbaum's Regression (Un)-Adjusted Rank-based Estimator in Randomized Experiments", "link": "https://arxiv.org/abs/2111.15524", "description": "arXiv:2111.15524v4 Announce Type: replace \nAbstract: Mean-based estimators of the causal effect in a completely randomized experiment may behave poorly if the potential outcomes have a heavy-tail, or contain outliers. We study an alternative estimator by Rosenbaum that estimates the constant additive treatment effect by inverting a randomization test using ranks. By investigating the breakdown point and asymptotic relative efficiency of this rank-based estimator, we show that it is provably robust against outliers and heavy-tailed potential outcomes, and has asymptotic variance at most 1.16 times that of the difference-in-means estimator (and much smaller when the potential outcomes are not light-tailed). We further derive a consistent estimator of the asymptotic standard error for Rosenbaum's estimator which yields a readily computable confidence interval for the treatment effect. We also study a regression adjusted version of Rosenbaum's estimator to incorporate additional covariate information in randomization inference. We prove gain in efficiency by this regression adjustment method under a linear regression model. We illustrate through synthetic and real data that, unlike the mean-based estimators, these rank-based estimators (both unadjusted or regression adjusted) are efficient and robust against heavy-tailed distributions, contamination, and model misspecification. Finally, we initiate the study of Rosenbaum's estimator when the constant treatment effect assumption may be violated."}, "https://arxiv.org/abs/2208.10910": {"title": "A flexible empirical Bayes approach to multiple linear regression and connections with penalized regression", "link": "https://arxiv.org/abs/2208.10910", "description": "arXiv:2208.10910v3 Announce Type: replace \nAbstract: We introduce a new empirical Bayes approach for large-scale multiple linear regression. Our approach combines two key ideas: (i) the use of flexible \"adaptive shrinkage\" priors, which approximate the nonparametric family of scale mixture of normal distributions by a finite mixture of normal distributions; and (ii) the use of variational approximations to efficiently estimate prior hyperparameters and compute approximate posteriors. Combining these two ideas results in fast and flexible methods, with computational speed comparable to fast penalized regression methods such as the Lasso, and with competitive prediction accuracy across a wide range of scenarios. Further, we provide new results that establish conceptual connections between our empirical Bayes methods and penalized methods. Specifically, we show that the posterior mean from our method solves a penalized regression problem, with the form of the penalty function being learned from the data by directly solving an optimization problem (rather than being tuned by cross-validation). Our methods are implemented in an R package, mr.ash.alpha, available from https://github.com/stephenslab/mr.ash.alpha."}, "https://arxiv.org/abs/2211.06568": {"title": "Effective experience rating for large insurance portfolios via surrogate modeling", "link": "https://arxiv.org/abs/2211.06568", "description": "arXiv:2211.06568v3 Announce Type: replace \nAbstract: Experience rating in insurance uses a Bayesian credibility model to upgrade the current premiums of a contract by taking into account policyholders' attributes and their claim history. Most data-driven models used for this task are mathematically intractable, and premiums must be obtained through numerical methods such as simulation via MCMC. However, these methods can be computationally expensive and even prohibitive for large portfolios when applied at the policyholder level. Additionally, these computations become ``black-box\" procedures as there is no analytical expression showing how the claim history of policyholders is used to upgrade their premiums. To address these challenges, this paper proposes a surrogate modeling approach to inexpensively derive an analytical expression for computing the Bayesian premiums for any given model, approximately. As a part of the methodology, the paper introduces a \\emph{likelihood-based summary statistic} of the policyholder's claim history that serves as the main input of the surrogate model and that is sufficient for certain families of distribution, including the exponential dispersion family. As a result, the computational burden of experience rating for large portfolios is reduced through the direct evaluation of such analytical expression, which can provide a transparent and interpretable way of computing Bayesian premiums."}, "https://arxiv.org/abs/2307.09404": {"title": "Continuous-time multivariate analysis", "link": "https://arxiv.org/abs/2307.09404", "description": "arXiv:2307.09404v3 Announce Type: replace \nAbstract: The starting point for much of multivariate analysis (MVA) is an $n\\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending techniques of multivariate analysis to such settings. The proposed continuous-time multivariate analysis (CTMVA) framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as $B$-splines, as in the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We present continuous-time extensions of the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering. We show that CTMVA can improve on the performance of classical MVA, in particular for correlation estimation and clustering, and can be applied in some settings where classical MVA cannot, including variables observed at disparate time points. CTMVA is illustrated with a novel perspective on a well-known Canadian weather data set, and with applications to data sets involving international development, brain signals, and air quality. The proposed methods are implemented in the publicly available R package \\texttt{ctmva}."}, "https://arxiv.org/abs/2307.10808": {"title": "Claim Reserving via Inverse Probability Weighting: A Micro-Level Chain-Ladder Method", "link": "https://arxiv.org/abs/2307.10808", "description": "arXiv:2307.10808v3 Announce Type: replace \nAbstract: Claim reserving primarily relies on macro-level models, with the Chain-Ladder method being the most widely adopted. These methods were heuristically developed without minimal statistical foundations, relying on oversimplified data assumptions and neglecting policyholder heterogeneity, often resulting in conservative reserve predictions. Micro-level reserving, utilizing stochastic modeling with granular information, can improve predictions but tends to involve less attractive and complex models for practitioners. This paper aims to strike a practical balance between aggregate and individual models by introducing a methodology that enables the Chain-Ladder method to incorporate individual information. We achieve this by proposing a novel framework, formulating the claim reserving problem within a population sampling context. We introduce a reserve estimator in a frequency and severity distribution-free manner that utilizes inverse probability weights (IPW) driven by individual information, akin to propensity scores. We demonstrate that the Chain-Ladder method emerges as a particular case of such an IPW estimator, thereby inheriting a statistically sound foundation based on population sampling theory that enables the use of granular information, and other extensions."}, "https://arxiv.org/abs/2309.15408": {"title": "A smoothed-Bayesian approach to frequency recovery from sketched data", "link": "https://arxiv.org/abs/2309.15408", "description": "arXiv:2309.15408v2 Announce Type: replace \nAbstract: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a {\\em smoothed-Bayesian} method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on \\emph{multi-view} learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives."}, "https://arxiv.org/abs/2310.17820": {"title": "Sparse Bayesian Multidimensional Item Response Theory", "link": "https://arxiv.org/abs/2310.17820", "description": "arXiv:2310.17820v2 Announce Type: replace \nAbstract: Multivariate Item Response Theory (MIRT) is sought-after widely by applied researchers looking for interpretable (sparse) explanations underlying response patterns in questionnaire data. There is, however, an unmet demand for such sparsity discovery tools in practice. Our paper develops a Bayesian platform for binary and ordinal item MIRT which requires minimal tuning and scales well on large datasets due to its parallelizable features. Bayesian methodology for MIRT models has traditionally relied on MCMC simulation, which cannot only be slow in practice, but also often renders exact sparsity recovery impossible without additional thresholding. In this work, we develop a scalable Bayesian EM algorithm to estimate sparse factor loadings from mixed continuous, binary, and ordinal item responses. We address the seemingly insurmountable problem of unknown latent factor dimensionality with tools from Bayesian nonparametrics which enable estimating the number of factors. Rotations to sparsity through parameter expansion further enhance convergence and interpretability without identifiability constraints. In our simulation study, we show that our method reliably recovers both the factor dimensionality as well as the latent structure on high-dimensional synthetic data even for small samples. We demonstrate the practical usefulness of our approach on three datasets: an educational assessment dataset, a quality-of-life measurement dataset, and a bio-behavioral dataset. All demonstrations show that our tool yields interpretable estimates, facilitating interesting discoveries that might otherwise go unnoticed under a pure confirmatory factor analysis setting."}, "https://arxiv.org/abs/2312.05411": {"title": "Deep Bayes Factors", "link": "https://arxiv.org/abs/2312.05411", "description": "arXiv:2312.05411v2 Announce Type: replace \nAbstract: The is no other model or hypothesis verification tool in Bayesian statistics that is as widely used as the Bayes factor. We focus on generative models that are likelihood-free and, therefore, render the computation of Bayes factors (marginal likelihood ratios) far from obvious. We propose a deep learning estimator of the Bayes factor based on simulated data from two competing models using the likelihood ratio trick. This estimator is devoid of summary statistics and obviates some of the difficulties with ABC model choice. We establish sufficient conditions for consistency of our Deep Bayes Factor estimator as well as its consistency as a model selection tool. We investigate the performance of our estimator on various examples using a wide range of quality metrics related to estimation and model decision accuracy. After training, our deep learning approach enables rapid evaluations of the Bayes factor estimator at any fictional data arriving from either hypothesized model, not just the observed data $Y_0$. This allows us to inspect entire Bayes factor distributions under the two models and to quantify the relative location of the Bayes factor evaluated at $Y_0$ in light of these distributions. Such tail area evaluations are not possible for Bayes factor estimators tailored to $Y_0$. We find the performance of our Deep Bayes Factors competitive with existing MCMC techniques that require the knowledge of the likelihood function. We also consider variants for posterior or intrinsic Bayes factors estimation. We demonstrate the usefulness of our approach on a relatively high-dimensional real data example about determining cognitive biases."}, "https://arxiv.org/abs/2406.08628": {"title": "Empirical Evidence That There Is No Such Thing As A Validated Prediction Model", "link": "https://arxiv.org/abs/2406.08628", "description": "arXiv:2406.08628v1 Announce Type: new \nAbstract: Background: External validations are essential to assess clinical prediction models (CPMs) before deployment. Apart from model misspecification, differences in patient population and other factors influence a model's AUC (c-statistic). We aimed to quantify variation in AUCs across external validation studies and adjust expectations of a model's performance in a new setting.\n Methods: The Tufts-PACE CPM Registry contains CPMs for cardiovascular disease prognosis. We analyzed the AUCs of 469 CPMs with a total of 1,603 external validations. For each CPM, we performed a random effects meta-analysis to estimate the between-study standard deviation $\\tau$ among the AUCs. Since the majority of these meta-analyses has only a handful of validations, this leads to very poor estimates of $\\tau$. So, we estimated a log normal distribution of $\\tau$ across all CPMs and used this as an empirical prior. We compared this empirical Bayesian approach with frequentist meta-analyses using cross-validation.\n Results: The 469 CPMs had a median of 2 external validations (IQR: [1-3]). The estimated distribution of $\\tau$ had a mean of 0.055 and a standard deviation of 0.015. If $\\tau$ = 0.05, the 95% prediction interval for the AUC in a new setting is at least +/- 0.1, regardless of the number of validations. Frequentist methods underestimate the uncertainty about the AUC in a new setting. Accounting for $\\tau$ in a Bayesian approach achieved near nominal coverage.\n Conclusion: Due to large heterogeneity among the validated AUC values of a CPM, there is great irreducible uncertainty in predicting the AUC in a new setting. This uncertainty is underestimated by existing methods. The proposed empirical Bayes approach addresses this problem which merits wide application in judging the validity of prediction models."}, "https://arxiv.org/abs/2406.08668": {"title": "Causal Inference on Missing Exposure via Robust Estimation", "link": "https://arxiv.org/abs/2406.08668", "description": "arXiv:2406.08668v1 Announce Type: new \nAbstract: How to deal with missing data in observational studies is a common concern for causal inference. When the covariates are missing at random (MAR), multiple approaches have been provided to help solve the issue. However, if the exposure is MAR, few approaches are available and careful adjustments on both missingness and confounding issues are required to ensure a consistent estimate of the true causal effect on the response. In this article, a new inverse probability weighting (IPW) estimator based on weighted estimating equations (WEE) is proposed to incorporate weights from both the missingness and propensity score (PS) models, which can reduce the joint effect of extreme weights in finite samples. Additionally, we develop a triple robust (TR) estimator via WEE to further protect against the misspecification of the missingness model. The asymptotic properties of WEE estimators are proved using properties of estimating equations. Based on the simulation studies, WEE methods outperform others including imputation-based approaches in terms of bias and variability. Finally, an application study is conducted to identify the causal effect of the presence of cardiovascular disease on mortality for COVID-19 patients."}, "https://arxiv.org/abs/2406.08685": {"title": "Variational Bayes Inference for Spatial Error Models with Missing Data", "link": "https://arxiv.org/abs/2406.08685", "description": "arXiv:2406.08685v1 Announce Type: new \nAbstract: The spatial error model (SEM) is a type of simultaneous autoregressive (SAR) model for analysing spatially correlated data. Markov chain Monte Carlo (MCMC) is one of the most widely used Bayesian methods for estimating SEM, but it has significant limitations when it comes to handling missing data in the response variable due to its high computational cost. Variational Bayes (VB) approximation offers an alternative solution to this problem. Two VB-based algorithms employing Gaussian variational approximation with factor covariance structure are presented, joint VB (JVB) and hybrid VB (HVB), suitable for both missing at random and not at random inference. When dealing with many missing values, the JVB is inaccurate, and the standard HVB algorithm struggles to achieve accurate inferences. Our modified versions of HVB enable accurate inference within a reasonable computational time, thus improving its performance. The performance of the VB methods is evaluated using simulated and real datasets."}, "https://arxiv.org/abs/2406.08776": {"title": "Learning Joint and Individual Structure in Network Data with Covariates", "link": "https://arxiv.org/abs/2406.08776", "description": "arXiv:2406.08776v1 Announce Type: new \nAbstract: Datasets consisting of a network and covariates associated with its vertices have become ubiquitous. One problem pertaining to this type of data is to identify information unique to the network, information unique to the vertex covariates and information that is shared between the network and the vertex covariates. Existing techniques for network data and vertex covariates focus on capturing structure that is shared but are usually not able to differentiate structure that is unique to each dataset. This work formulates a low-rank model that simultaneously captures joint and individual information in network data with vertex covariates. A two-step estimation procedure is proposed, composed of an efficient spectral method followed by a refinement optimization step. Theoretically, we show that the spectral method is able to consistently recover the joint and individual components under a general signal-plus-noise model.\n Simulations and real data examples demonstrate the ability of the methods to recover accurate and interpretable components. In particular, the application of the methodology to a food trade network between countries with economic, developmental and geographical country-level indicators as covariates yields joint and individual factors that explain the trading patterns."}, "https://arxiv.org/abs/2406.08784": {"title": "Improved methods for empirical Bayes multivariate multiple testing and effect size estimation", "link": "https://arxiv.org/abs/2406.08784", "description": "arXiv:2406.08784v1 Announce Type: new \nAbstract: Estimating the sharing of genetic effects across different conditions is important to many statistical analyses of genomic data. The patterns of sharing arising from these data are often highly heterogeneous. To flexibly model these heterogeneous sharing patterns, Urbut et al. (2019) proposed the multivariate adaptive shrinkage (MASH) method to jointly analyze genetic effects across multiple conditions. However, multivariate analyses using MASH (as well as other multivariate analyses) require good estimates of the sharing patterns, and estimating these patterns efficiently and accurately remains challenging. Here we describe new empirical Bayes methods that provide improvements in speed and accuracy over existing methods. The two key ideas are: (1) adaptive regularization to improve accuracy in settings with many conditions; (2) improving the speed of the model fitting algorithms by exploiting analytical results on covariance estimation. In simulations, we show that the new methods provide better model fits, better out-of-sample performance, and improved power and accuracy in detecting the true underlying signals. In an analysis of eQTLs in 49 human tissues, our new analysis pipeline achieves better model fits and better out-of-sample performance than the existing MASH analysis pipeline. We have implemented the new methods, which we call ``Ultimate Deconvolution'', in an R package, udr, available on GitHub."}, "https://arxiv.org/abs/2406.08867": {"title": "A Robust Bayesian approach for reliability prognosis of one-shot devices under cumulative risk model", "link": "https://arxiv.org/abs/2406.08867", "description": "arXiv:2406.08867v1 Announce Type: new \nAbstract: The reliability prognosis of one-shot devices is drawing increasing attention because of their wide applicability. The present study aims to determine the lifetime prognosis of highly durable one-shot device units under a step-stress accelerated life testing (SSALT) experiment applying a cumulative risk model (CRM). In an SSALT experiment, CRM retains the continuity of hazard function by allowing the lag period before the effects of stress change emerge. In an analysis of such lifetime data, plentiful datasets might have outliers where conventional methods like maximum likelihood estimation or likelihood-based Bayesian estimation frequently fail. This work develops a robust estimation method based on density power divergence in classical and Bayesian frameworks. The hypothesis is tested by implementing the Bayes factor based on a robustified posterior. In Bayesian estimation, we exploit Hamiltonian Monte Carlo, which has certain advantages over the conventional Metropolis-Hastings algorithms. Further, the influence functions are examined to evaluate the robust behaviour of the estimators and the Bayes factor. Finally, the analytical development is validated through a simulation study and a real data analysis."}, "https://arxiv.org/abs/2406.08880": {"title": "Jackknife inference with two-way clustering", "link": "https://arxiv.org/abs/2406.08880", "description": "arXiv:2406.08880v1 Announce Type: new \nAbstract: For linear regression models with cross-section or panel data, it is natural to assume that the disturbances are clustered in two dimensions. However, the finite-sample properties of two-way cluster-robust tests and confidence intervals are often poor. We discuss several ways to improve inference with two-way clustering. Two of these are existing methods for avoiding, or at least ameliorating, the problem of undefined standard errors when a cluster-robust variance matrix estimator (CRVE) is not positive definite. One is a new method that always avoids the problem. More importantly, we propose a family of new two-way CRVEs based on the cluster jackknife. Simulations for models with two-way fixed effects suggest that, in many cases, the cluster-jackknife CRVE combined with our new method yields surprisingly accurate inferences. We provide a simple software package, twowayjack for Stata, that implements our recommended variance estimator."}, "https://arxiv.org/abs/2406.08968": {"title": "Covariate Selection for Optimizing Balance with Covariate-Adjusted Response-Adaptive Randomization", "link": "https://arxiv.org/abs/2406.08968", "description": "arXiv:2406.08968v1 Announce Type: new \nAbstract: Balancing influential covariates is crucial for valid treatment comparisons in clinical studies. While covariate-adaptive randomization is commonly used to achieve balance, its performance can be inadequate when the number of baseline covariates is large. It is therefore essential to identify the influential factors associated with the outcome and ensure balance among these critical covariates. In this article, we propose a novel covariate-adjusted response-adaptive randomization that integrates the patients' responses and covariates information to select sequentially significant covariates and maintain their balance. We establish theoretically the consistency of our covariate selection method and demonstrate that the improved covariate balancing, as evidenced by a faster convergence rate of the imbalance measure, leads to higher efficiency in estimating treatment effects. Furthermore, we provide extensive numerical and empirical studies to illustrate the benefits of our proposed method across various settings."}, "https://arxiv.org/abs/2406.09010": {"title": "A geometric approach to informed MCMC sampling", "link": "https://arxiv.org/abs/2406.09010", "description": "arXiv:2406.09010v1 Announce Type: new \nAbstract: A Riemannian geometric framework for Markov chain Monte Carlo (MCMC) is developed where using the Fisher-Rao metric on the manifold of probability density functions (pdfs) informed proposal densities for Metropolis-Hastings (MH) algorithms are constructed. We exploit the square-root representation of pdfs under which the Fisher-Rao metric boils down to the standard $L^2$ metric on the positive orthant of the unit hypersphere. The square-root representation allows us to easily compute the geodesic distance between densities, resulting in a straightforward implementation of the proposed geometric MCMC methodology. Unlike the random walk MH that blindly proposes a candidate state using no information about the target, the geometric MH algorithms effectively move an uninformed base density (e.g., a random walk proposal density) towards different global/local approximations of the target density. We compare the proposed geometric MH algorithm with other MCMC algorithms for various Markov chain orderings, namely the covariance, efficiency, Peskun, and spectral gap orderings. The superior performance of the geometric algorithms over other MH algorithms like the random walk Metropolis, independent MH and variants of Metropolis adjusted Langevin algorithms is demonstrated in the context of various multimodal, nonlinear and high dimensional examples. In particular, we use extensive simulation and real data applications to compare these algorithms for analyzing mixture models, logistic regression models and ultra-high dimensional Bayesian variable selection models. A publicly available R package accompanies the article."}, "https://arxiv.org/abs/2406.09055": {"title": "Relational event models with global covariates", "link": "https://arxiv.org/abs/2406.09055", "description": "arXiv:2406.09055v1 Announce Type: new \nAbstract: Traditional inference in relational event models from dynamic network data involves only dyadic and node-specific variables, as anything that is global, i.e. constant across dyads, drops out of the partial likelihood. We address this with the use of nested case-control sampling on a time-shifted version of the event process. This leads to a partial likelihood of a degenerate logistic additive model, enabling efficient estimation of global and non-global covariate effects. The method's effectiveness is demonstrated through a simulation study. An application to bike sharing data reveals significant influences of global covariates like weather and time of day on bike-sharing dynamics."}, "https://arxiv.org/abs/2406.09163": {"title": "Covariate balancing with measurement error", "link": "https://arxiv.org/abs/2406.09163", "description": "arXiv:2406.09163v1 Announce Type: new \nAbstract: In recent years, there is a growing body of causal inference literature focusing on covariate balancing methods. These methods eliminate observed confounding by equalizing covariate moments between the treated and control groups. The validity of covariate balancing relies on an implicit assumption that all covariates are accurately measured, which is frequently violated in observational studies. Nevertheless, the impact of measurement error on covariate balancing is unclear, and there is no existing work on balancing mismeasured covariates adequately. In this article, we show that naively ignoring measurement error reversely increases the magnitude of covariate imbalance and induces bias to treatment effect estimation. We then propose a class of measurement error correction strategies for the existing covariate balancing methods. Theoretically, we show that these strategies successfully recover balance for all covariates, and eliminate bias of treatment effect estimation. We assess the proposed correction methods in simulation studies and real data analysis."}, "https://arxiv.org/abs/2406.09195": {"title": "When Pearson $\\chi^2$ and other divisible statistics are not goodness-of-fit tests", "link": "https://arxiv.org/abs/2406.09195", "description": "arXiv:2406.09195v1 Announce Type: new \nAbstract: Thousands of experiments are analyzed and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived - somewhat naively - as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what are the new possibilities in their hands.\n Motivated by this need, the article introduces a unifying approach to the analysis of grouped data which allows us to study the class of divisible statistics - that includes Pearson's $\\chi^2$, the likelihood ratio as special cases - with a fresh perspective. The contributions collected in this manuscript span from modeling and estimation to distribution-free goodness-of-fit tests.\n Perhaps the most surprising result presented here is that, in a sparse regime, all tests proposed in the literature are dominated by a class of weighted linear statistics."}, "https://arxiv.org/abs/2406.09254": {"title": "General Bayesian Predictive Synthesis", "link": "https://arxiv.org/abs/2406.09254", "description": "arXiv:2406.09254v1 Announce Type: new \nAbstract: This study investigates Bayesian ensemble learning for improving the quality of decision-making. We consider a decision-maker who selects an action from a set of candidates based on a policy trained using observations. In our setting, we assume the existence of experts who provide predictive distributions based on their own policies. Our goal is to integrate these predictive distributions within the Bayesian framework. Our proposed method, which we refer to as General Bayesian Predictive Synthesis (GBPS), is characterized by a loss minimization framework and does not rely on parameter estimation, unlike existing studies. Inspired by Bayesian predictive synthesis and general Bayes frameworks, we evaluate the performance of our proposed method through simulation studies."}, "https://arxiv.org/abs/2406.08666": {"title": "Interventional Causal Discovery in a Mixture of DAGs", "link": "https://arxiv.org/abs/2406.08666", "description": "arXiv:2406.08666v1 Announce Type: cross \nAbstract: Causal interactions among a group of variables are often modeled by a single causal graph. In some domains, however, these interactions are best described by multiple co-existing causal graphs, e.g., in dynamical systems or genomics. This paper addresses the hitherto unknown role of interventions in learning causal interactions among variables governed by a mixture of causal systems, each modeled by one directed acyclic graph (DAG). Causal discovery from mixtures is fundamentally more challenging than single-DAG causal discovery. Two major difficulties stem from (i) inherent uncertainty about the skeletons of the component DAGs that constitute the mixture and (ii) possibly cyclic relationships across these component DAGs. This paper addresses these challenges and aims to identify edges that exist in at least one component DAG of the mixture, referred to as true edges. First, it establishes matching necessary and sufficient conditions on the size of interventions required to identify the true edges. Next, guided by the necessity results, an adaptive algorithm is designed that learns all true edges using ${\\cal O}(n^2)$ interventions, where $n$ is the number of nodes. Remarkably, the size of the interventions is optimal if the underlying mixture model does not contain cycles across its components. More generally, the gap between the intervention size used by the algorithm and the optimal size is quantified. It is shown to be bounded by the cyclic complexity number of the mixture model, defined as the size of the minimal intervention that can break the cycles in the mixture, which is upper bounded by the number of cycles among the ancestors of a node."}, "https://arxiv.org/abs/2406.08697": {"title": "Orthogonalized Estimation of Difference of $Q$-functions", "link": "https://arxiv.org/abs/2406.08697", "description": "arXiv:2406.08697v1 Announce Type: cross \nAbstract: Offline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of causal contrast functions such as CATE, which is sufficient for optimizing decisions and can adapt to potentially smoother structure. We develop a dynamic generalization of the R-learner (Nie and Wager 2021, Lewis and Syrgkanis 2021) for estimating and optimizing the difference of $Q^\\pi$-functions, $Q^\\pi(s,1)-Q^\\pi(s,0)$ (which can be used to optimize multiple-valued actions). We leverage orthogonal estimation to improve convergence rates in the presence of slower nuisance estimation rates and prove consistency of policy optimization under a margin condition. The method can leverage black-box nuisance estimators of the $Q$-function and behavior policy to target estimation of a more structured $Q$-function contrast."}, "https://arxiv.org/abs/2406.08709": {"title": "Introducing Diminutive Causal Structure into Graph Representation Learning", "link": "https://arxiv.org/abs/2406.08709", "description": "arXiv:2406.08709v1 Announce Type: cross \nAbstract: When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. However, within the domain of graph representation learning, the inherent complexity of graph data obstructs the derivation of a comprehensive causal structure that encapsulates universal rules or relationships governing the entire dataset. Instead, only specialized diminutive causal structures, delineating specific causal relationships within constrained subsets of graph data, emerge as discernible. Motivated by empirical insights, it is observed that GNN models exhibit a tendency to converge towards such specialized causal structures during the training process. Consequently, we posit that the introduction of these specific causal structures is advantageous for the training of GNN models. Building upon this proposition, we introduce a novel method that enables GNN models to glean insights from these specialized diminutive causal structures, thereby enhancing overall performance. Our method specifically extracts causal knowledge from the model representation of these diminutive causal structures and incorporates interchange intervention to optimize the learning process. Theoretical analysis serves to corroborate the efficacy of our proposed method. Furthermore, empirical experiments consistently demonstrate significant performance improvements across diverse datasets."}, "https://arxiv.org/abs/2406.08738": {"title": "Volatility Forecasting Using Similarity-based Parameter Correction and Aggregated Shock Information", "link": "https://arxiv.org/abs/2406.08738", "description": "arXiv:2406.08738v1 Announce Type: cross \nAbstract: We develop a procedure for forecasting the volatility of a time series immediately following a news shock. Adapting the similarity-based framework of Lin and Eck (2020), we exploit series that have experienced similar shocks. We aggregate their shock-induced excess volatilities by positing the shocks to be affine functions of exogenous covariates. The volatility shocks are modeled as random effects and estimated as fixed effects. The aggregation of these estimates is done in service of adjusting the $h$-step-ahead GARCH forecast of the time series under study by an additive term. The adjusted and unadjusted forecasts are evaluated using the unobservable but easily-estimated realized volatility (RV). A real-world application is provided, as are simulation results suggesting the conditions and hyperparameters under which our method thrives."}, "https://arxiv.org/abs/2406.09169": {"title": "Empirical Networks are Sparse: Enhancing Multi-Edge Models with Zero-Inflation", "link": "https://arxiv.org/abs/2406.09169", "description": "arXiv:2406.09169v1 Announce Type: cross \nAbstract: Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the $G(N,p)$, configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models which do not accurately represent complex systems and their dynamics."}, "https://arxiv.org/abs/2406.09172": {"title": "Generative vs", "link": "https://arxiv.org/abs/2406.09172", "description": "arXiv:2406.09172v1 Announce Type: cross \nAbstract: Learning a parametric model from a given dataset indeed enables to capture intrinsic dependencies between random variables via a parametric conditional probability distribution and in turn predict the value of a label variable given observed variables. In this paper, we undertake a comparative analysis of generative and discriminative approaches which differ in their construction and the structure of the underlying inference problem. Our objective is to compare the ability of both approaches to leverage information from various sources in an epistemic uncertainty aware inference via the posterior predictive distribution. We assess the role of a prior distribution, explicit in the generative case and implicit in the discriminative case, leading to a discussion about discriminative models suffering from imbalanced dataset. We next examine the double role played by the observed variables in the generative case, and discuss the compatibility of both approaches with semi-supervised learning. We also provide with practical insights and we examine how the modeling choice impacts the sampling from the posterior predictive distribution. With regard to this, we propose a general sampling scheme enabling supervised learning for both approaches, as well as semi-supervised learning when compatible with the considered modeling approach. Throughout this paper, we illustrate our arguments and conclusions using the example of affine regression, and validate our comparative analysis through classification simulations using neural network based models."}, "https://arxiv.org/abs/2406.09387": {"title": "Oblivious subspace embeddings for compressed Tucker decompositions", "link": "https://arxiv.org/abs/2406.09387", "description": "arXiv:2406.09387v1 Announce Type: cross \nAbstract: Emphasis in the tensor literature on random embeddings (tools for low-distortion dimension reduction) for the canonical polyadic (CP) tensor decomposition has left analogous results for the more expressive Tucker decomposition comparatively lacking. This work establishes general Johnson-Lindenstrauss (JL) type guarantees for the estimation of Tucker decompositions when an oblivious random embedding is applied along each mode. When these embeddings are drawn from a JL-optimal family, the decomposition can be estimated within $\\varepsilon$ relative error under restrictions on the embedding dimension that are in line with recent CP results. We implement a higher-order orthogonal iteration (HOOI) decomposition algorithm with random embeddings to demonstrate the practical benefits of this approach and its potential to improve the accessibility of otherwise prohibitive tensor analyses. On moderately large face image and fMRI neuroimaging datasets, empirical results show that substantial dimension reduction is possible with minimal increase in reconstruction error relative to traditional HOOI ($\\leq$5% larger error, 50%-60% lower computation time for large models with 50% dimension reduction along each mode). Especially for large tensors, our method outperforms traditional higher-order singular value decomposition (HOSVD) and recently proposed TensorSketch methods."}, "https://arxiv.org/abs/2110.06692": {"title": "A procedure for multiple testing of partial conjunction hypotheses based on a hazard rate inequality", "link": "https://arxiv.org/abs/2110.06692", "description": "arXiv:2110.06692v4 Announce Type: replace \nAbstract: The partial conjunction null hypothesis is tested in order to discover a signal that is present in multiple studies. The standard approach of carrying out a multiple test procedure on the partial conjunction (PC) $p$-values can be extremely conservative. We suggest alleviating this conservativeness, by eliminating many of the conservative PC $p$-values prior to the application of a multiple test procedure. This leads to the following two step procedure: first, select the set with PC $p$-values below a selection threshold; second, within the selected set only, apply a family-wise error rate or false discovery rate controlling procedure on the conditional PC $p$-values. The conditional PC $p$-values are valid if the null p-values are uniform and the combining method is Fisher. The proof of their validity is based on a novel inequality in hazard rate order of partial sums of order statistics which may be of independent interest. We also provide the conditions for which the false discovery rate controlling procedures considered will be below the nominal level. We demonstrate the potential usefulness of our novel method, CoFilter (conditional testing after filtering), for analyzing multiple genome wide association studies of Crohn's disease."}, "https://arxiv.org/abs/2310.07850": {"title": "Conformal prediction with local weights: randomization enables local guarantees", "link": "https://arxiv.org/abs/2310.07850", "description": "arXiv:2310.07850v2 Announce Type: replace \nAbstract: In this work, we consider the problem of building distribution-free prediction intervals with finite-sample conditional coverage guarantees. Conformal prediction (CP) is an increasingly popular framework for building prediction intervals with distribution-free guarantees, but these guarantees only ensure marginal coverage: the probability of coverage is averaged over a random draw of both the training and test data, meaning that there might be substantial undercoverage within certain subpopulations. Instead, ideally, we would want to have local coverage guarantees that hold for each possible value of the test point's features. While the impossibility of achieving pointwise local coverage is well established in the literature, many variants of conformal prediction algorithm show favorable local coverage properties empirically. Relaxing the definition of local coverage can allow for a theoretical understanding of this empirical phenomenon. We aim to bridge this gap between theoretical validation and empirical performance by proving achievable and interpretable guarantees for a relaxed notion of local coverage. Building on the localized CP method of Guan (2023) and the weighted CP framework of Tibshirani et al. (2019), we propose a new method, randomly-localized conformal prediction (RLCP), which returns prediction intervals that are not only marginally valid but also achieve a relaxed local coverage guarantee and guarantees under covariate shift. Through a series of simulations and real data experiments, we validate these coverage guarantees of RLCP while comparing it with the other local conformal prediction methods."}, "https://arxiv.org/abs/2311.16529": {"title": "Efficient and Globally Robust Causal Excursion Effect Estimation", "link": "https://arxiv.org/abs/2311.16529", "description": "arXiv:2311.16529v3 Announce Type: replace \nAbstract: Causal excursion effect (CEE) characterizes the effect of an intervention under policies that deviate from the experimental policy. It is widely used to study the effect of time-varying interventions that have the potential to be frequently adaptive, such as those delivered through smartphones. We study the semiparametric efficient estimation of CEE and we derive a semiparametric efficiency bound for CEE with identity or log link functions under working assumptions, in the context of micro-randomized trials. We propose a class of two-stage estimators that achieve the efficiency bound and are robust to misspecified nuisance models. In deriving the asymptotic property of the estimators, we establish a general theory for globally robust Z-estimators with either cross-fitted or non-cross-fitted nuisance parameters. We demonstrate substantial efficiency gain of the proposed estimator compared to existing ones through simulations and a real data application using the Drink Less micro-randomized trial."}, "https://arxiv.org/abs/2110.02318": {"title": "Approximate Message Passing for orthogonally invariant ensembles: Multivariate non-linearities and spectral initialization", "link": "https://arxiv.org/abs/2110.02318", "description": "arXiv:2110.02318v2 Announce Type: replace-cross \nAbstract: We study a class of Approximate Message Passing (AMP) algorithms for symmetric and rectangular spiked random matrix models with orthogonally invariant noise. The AMP iterates have fixed dimension $K \\geq 1$, a multivariate non-linearity is applied in each AMP iteration, and the algorithm is spectrally initialized with $K$ super-critical sample eigenvectors. We derive the forms of the Onsager debiasing coefficients and corresponding AMP state evolution, which depend on the free cumulants of the noise spectral distribution. This extends previous results for such models with $K=1$ and an independent initialization.\n Applying this approach to Bayesian principal components analysis, we introduce a Bayes-OAMP algorithm that uses as its non-linearity the posterior mean conditional on all preceding AMP iterates. We describe a practical implementation of this algorithm, where all debiasing and state evolution parameters are estimated from the observed data, and we illustrate the accuracy and stability of this approach in simulations."}, "https://arxiv.org/abs/2305.12883": {"title": "Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors", "link": "https://arxiv.org/abs/2305.12883", "description": "arXiv:2305.12883v3 Announce Type: replace-cross \nAbstract: In recent years, there has been a significant growth in research focusing on minimum $\\ell_2$ norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data."}, "https://arxiv.org/abs/2406.09473": {"title": "Multidimensional clustering in judge designs", "link": "https://arxiv.org/abs/2406.09473", "description": "arXiv:2406.09473v1 Announce Type: new \nAbstract: Estimates in judge designs run the risk of being biased due to the many judge identities that are implicitly or explicitly used as instrumental variables. The usual method to analyse judge designs, via a leave-out mean instrument, eliminates this many instrument bias only in case the data are clustered in at most one dimension. What is left out in the mean defines this clustering dimension. How most judge designs cluster their standard errors, however, implies that there are additional clustering dimensions, which makes that a many instrument bias remains. We propose two estimators that are many instrument bias free, also in multidimensional clustered judge designs. The first generalises the one dimensional cluster jackknife instrumental variable estimator, by removing from this estimator the additional bias terms due to the extra dependence in the data. The second models all but one clustering dimensions by fixed effects and we show how these numerous fixed effects can be removed without introducing extra bias. A Monte-Carlo experiment and the revisitation of two judge designs show the empirical relevance of properly accounting for multidimensional clustering in estimation."}, "https://arxiv.org/abs/2406.09521": {"title": "Randomization Inference: Theory and Applications", "link": "https://arxiv.org/abs/2406.09521", "description": "arXiv:2406.09521v1 Announce Type: new \nAbstract: We review approaches to statistical inference based on randomization. Permutation tests are treated as an important special case. Under a certain group invariance property, referred to as the ``randomization hypothesis,'' randomization tests achieve exact control of the Type I error rate in finite samples. Although this unequivocal precision is very appealing, the range of problems that satisfy the randomization hypothesis is somewhat limited. We show that randomization tests are often asymptotically, or approximately, valid and efficient in settings that deviate from the conditions required for finite-sample error control. When randomization tests fail to offer even asymptotic Type 1 error control, their asymptotic validity may be restored by constructing an asymptotically pivotal test statistic. Randomization tests can then provide exact error control for tests of highly structured hypotheses with good performance in a wider class of problems. We give a detailed overview of several prominent applications of randomization tests, including two-sample permutation tests, regression, and conformal inference."}, "https://arxiv.org/abs/2406.09597": {"title": "Ridge Regression for Paired Comparisons: A Tractable New Approach, with Application to Premier League Football", "link": "https://arxiv.org/abs/2406.09597", "description": "arXiv:2406.09597v1 Announce Type: new \nAbstract: Paired comparison models, such as Bradley-Terry and Thurstone-Mosteller, are commonly used to estimate relative strengths of pairwise compared items in tournament-style datasets. With predictive performance as primary criterion, we discuss estimation of paired comparison models with a ridge penalty. A new approach is derived which combines empirical Bayes and composite likelihoods without any need to re-fit the model, as a convenient alternative to cross-validation of the ridge tuning parameter. Simulation studies, together with application to 28 seasons of English Premier League football, demonstrate much better predictive accuracy of the new approach relative to ordinary maximum likelihood. While the application of a standard bias-reducing penalty was found to improve appreciably the performance of maximum likelihood, the ridge penalty with tuning as developed here yields greater accuracy still."}, "https://arxiv.org/abs/2406.09625": {"title": "Time Series Forecasting with Many Predictors", "link": "https://arxiv.org/abs/2406.09625", "description": "arXiv:2406.09625v1 Announce Type: new \nAbstract: We propose a novel approach for time series forecasting with many predictors, referred to as the GO-sdPCA, in this paper. The approach employs a variable selection method known as the group orthogonal greedy algorithm and the high-dimensional Akaike information criterion to mitigate the impact of irrelevant predictors. Moreover, a novel technique, called peeling, is used to boost the variable selection procedure so that many factor-relevant predictors can be included in prediction. Finally, the supervised dynamic principal component analysis (sdPCA) method is adopted to account for the dynamic information in factor recovery. In simulation studies, we found that the proposed method adapts well to unknown degrees of sparsity and factor strength, which results in good performance even when the number of relevant predictors is large compared to the sample size. Applying to economic and environmental studies, the proposed method consistently performs well compared to some commonly used benchmarks in one-step-ahead out-sample forecasts."}, "https://arxiv.org/abs/2406.09714": {"title": "Large language model validity via enhanced conformal prediction methods", "link": "https://arxiv.org/abs/2406.09714", "description": "arXiv:2406.09714v1 Announce Type: cross \nAbstract: We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on both synthetic and real-world datasets."}, "https://arxiv.org/abs/2406.10086": {"title": "Discovering influential text using convolutional neural networks", "link": "https://arxiv.org/abs/2406.10086", "description": "arXiv:2406.10086v1 Announce Type: cross \nAbstract: Experimental methods for estimating the impacts of text on human evaluation have been widely used in the social sciences. However, researchers in experimental settings are usually limited to testing a small number of pre-specified text treatments. While efforts to mine unstructured texts for features that causally affect outcomes have been ongoing in recent years, these models have primarily focused on the topics or specific words of text, which may not always be the mechanism of the effect. We connect these efforts with NLP interpretability techniques and present a method for flexibly discovering clusters of similar text phrases that are predictive of human reactions to texts using convolutional neural networks. When used in an experimental setting, this method can identify text treatments and their effects under certain assumptions. We apply the method to two datasets. The first enables direct validation of the model's ability to detect phrases known to cause the outcome. The second demonstrates its ability to flexibly discover text treatments with varying textual structures. In both cases, the model learns a greater variety of text treatments compared to benchmark methods, and these text features quantitatively meet or exceed the ability of benchmark methods to predict the outcome."}, "https://arxiv.org/abs/1707.07215": {"title": "Sparse Recovery With Multiple Data Streams: A Sequential Adaptive Testing Approach", "link": "https://arxiv.org/abs/1707.07215", "description": "arXiv:1707.07215v3 Announce Type: replace \nAbstract: Multistage design has been used in a wide range of scientific fields. By allocating sensing resources adaptively, one can effectively eliminate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multi-stage adaptive testing and study how to minimize the total number of measurements while meeting pre-specified constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which effectively pools information across individual tests using a simultaneous multistage adaptive ranking and thresholding (SMART) approach, controls the error rates and leads to great savings in total study costs. Numerical studies confirm the effectiveness of SMART. The SMART procedure is illustrated through the analysis of large-scale A/B tests, high-throughput screening and image analysis."}, "https://arxiv.org/abs/2208.00139": {"title": "Another look at forecast trimming for combinations: robustness, accuracy and diversity", "link": "https://arxiv.org/abs/2208.00139", "description": "arXiv:2208.00139v2 Announce Type: replace \nAbstract: Forecast combination is widely recognized as a preferred strategy over forecast selection due to its ability to mitigate the uncertainty associated with identifying a single \"best\" forecast. Nonetheless, sophisticated combinations are often empirically dominated by simple averaging, which is commonly attributed to the weight estimation error. The issue becomes more problematic when dealing with a forecast pool containing a large number of individual forecasts. In this paper, we propose a new forecast trimming algorithm to identify an optimal subset from the original forecast pool for forecast combination tasks. In contrast to existing approaches, our proposed algorithm simultaneously takes into account the robustness, accuracy and diversity issues of the forecast pool, rather than isolating each one of these issues. We also develop five forecast trimming algorithms as benchmarks, including one trimming-free algorithm and several trimming algorithms that isolate each one of the three key issues. Experimental results show that our algorithm achieves superior forecasting performance in general in terms of both point forecasts and prediction intervals. Nevertheless, we argue that diversity does not always have to be addressed in forecast trimming. Based on the results, we offer some practical guidelines on the selection of forecast trimming algorithms for a target series."}, "https://arxiv.org/abs/2302.08348": {"title": "A robust statistical framework for cyber-vulnerability prioritisation under partial information in threat intelligence", "link": "https://arxiv.org/abs/2302.08348", "description": "arXiv:2302.08348v4 Announce Type: replace \nAbstract: Proactive cyber-risk assessment is gaining momentum due to the wide range of sectors that can benefit from the prevention of cyber-incidents by preserving integrity, confidentiality, and the availability of data. The rising attention to cybersecurity also results from the increasing connectivity of cyber-physical systems, which generates multiple sources of uncertainty about emerging cyber-vulnerabilities. This work introduces a robust statistical framework for quantitative and qualitative reasoning under uncertainty about cyber-vulnerabilities and their prioritisation. Specifically, we take advantage of mid-quantile regression to deal with ordinal risk assessments, and we compare it to current alternatives for cyber-risk ranking and graded responses. For this purpose, we identify a novel accuracy measure suited for rank invariance under partial knowledge of the whole set of existing vulnerabilities. The model is tested on both simulated and real data from selected databases that support the evaluation, exploitation, or response to cyber-vulnerabilities in realistic contexts. Such datasets allow us to compare multiple models and accuracy measures, discussing the implications of partial knowledge about cyber-vulnerabilities on threat intelligence and decision-making in operational scenarios."}, "https://arxiv.org/abs/2303.03009": {"title": "Identification of Ex ante Returns Using Elicited Choice Probabilities: an Application to Preferences for Public-sector Jobs", "link": "https://arxiv.org/abs/2303.03009", "description": "arXiv:2303.03009v2 Announce Type: replace \nAbstract: Ex ante returns, the net value that agents perceive before they take an investment decision, are understood as the main drivers of individual decisions. Hence, their distribution in a population is an important tool for counterfactual analysis and policy evaluation. This paper studies the identification of the population distribution of ex ante returns using stated choice experiments, in the context of binary investment decisions. The environment is characterised by uncertainty about future outcomes, with some uncertainty being resolved over time. In this context, each individual holds a probability distribution over different levels of returns. The paper provides novel, nonparametric identification results for the population distribution of returns, accounting for uncertainty. It complements these with a nonparametric/semiparametric estimation methodology, which is new to the stated-preference literature. Finally, it uses these results to study the preference of high ability students in Cote d'Ivoire for public-sector jobs and how the competition for talent affects the expansion of the private sector."}, "https://arxiv.org/abs/2305.01201": {"title": "Estimating Input Coefficients for Regional Input-Output Tables Using Deep Learning with Mixup", "link": "https://arxiv.org/abs/2305.01201", "description": "arXiv:2305.01201v3 Announce Type: replace \nAbstract: An input-output table is an important data for analyzing the economic situation of a region. Generally, the input-output table for each region (regional input-output table) in Japan is not always publicly available, so it is necessary to estimate the table. In particular, various methods have been developed for estimating input coefficients, which are an important part of the input-output table. Currently, non-survey methods are often used to estimate input coefficients because they require less data and computation, but these methods have some problems, such as discarding information and requiring additional data for estimation.\n In this study, the input coefficients are estimated by approximating the generation process with an artificial neural network (ANN) to mitigate the problems of the non-survey methods and to estimate the input coefficients with higher precision. To avoid over-fitting due to the small data used, data augmentation, called mixup, is introduced to increase the data size by generating virtual regions through region composition and scaling.\n By comparing the estimates of the input coefficients with those of Japan as a whole, it is shown that the accuracy of the method of this research is higher and more stable than that of the conventional non-survey methods. In addition, the estimated input coefficients for the three cities in Japan are generally close to the published values for each city."}, "https://arxiv.org/abs/2310.02600": {"title": "Neural Bayes Estimators for Irregular Spatial Data using Graph Neural Networks", "link": "https://arxiv.org/abs/2310.02600", "description": "arXiv:2310.02600v2 Announce Type: replace \nAbstract: Neural Bayes estimators are neural networks that approximate Bayes estimators in a fast and likelihood-free manner. Although they are appealing to use with spatial models, where estimation is often a computational bottleneck, neural Bayes estimators in spatial applications have, to date, been restricted to data collected over a regular grid. These estimators are also currently dependent on a prescribed set of spatial locations, which means that the neural network needs to be re-trained for new data sets; this renders them impractical in many applications and impedes their widespread adoption. In this work, we employ graph neural networks to tackle the important problem of parameter point estimation from data collected over arbitrary spatial locations. In addition to extending neural Bayes estimation to irregular spatial data, our architecture leads to substantial computational benefits, since the estimator can be used with any configuration or number of locations and independent replicates, thus amortising the cost of training for a given spatial model. We also facilitate fast uncertainty quantification by training an accompanying neural Bayes estimator that approximates a set of marginal posterior quantiles. We illustrate our methodology on Gaussian and max-stable processes. Finally, we showcase our methodology on a data set of global sea-surface temperature, where we estimate the parameters of a Gaussian process model in 2161 spatial regions, each containing thousands of irregularly-spaced data points, in just a few minutes with a single graphics processing unit."}, "https://arxiv.org/abs/2312.09825": {"title": "Extreme value methods for estimating rare events in Utopia", "link": "https://arxiv.org/abs/2312.09825", "description": "arXiv:2312.09825v2 Announce Type: replace \nAbstract: To capture the extremal behaviour of complex environmental phenomena in practice, flexi\\-ble techniques for modelling tail behaviour are required. In this paper, we introduce a variety of such methods, which were used by the Lancopula Utopiversity team to tackle the EVA (2023) Conference Data Challenge. This data challenge was split into four challenges, labelled C1-C4. Challenges C1 and C2 comprise univariate problems, where the goal is to estimate extreme quantiles for a non-stationary time series exhibiting several complex features. For these, we propose a flexible modelling technique, based on generalised additive models, with diagnostics indicating generally good performance for the observed data. Challenges C3 and C4 concern multivariate problems where the focus is on estimating joint extremal probabilities. For challenge C3, we propose an extension of available models in the multivariate literature and use this framework to estimate extreme probabilities in the presence of non-stationary dependence. Finally, for challenge C4, which concerns a 50 dimensional random vector, we employ a clustering technique to achieve dimension reduction and use a conditional modelling approach to estimate extremal probabilities across independent groups of variables."}, "https://arxiv.org/abs/2312.12361": {"title": "Improved multifidelity Monte Carlo estimators based on normalizing flows and dimensionality reduction techniques", "link": "https://arxiv.org/abs/2312.12361", "description": "arXiv:2312.12361v2 Announce Type: replace \nAbstract: We study the problem of multifidelity uncertainty propagation for computationally expensive models. In particular, we consider the general setting where the high-fidelity and low-fidelity models have a dissimilar parameterization both in terms of number of random inputs and their probability distributions, which can be either known in closed form or provided through samples. We derive novel multifidelity Monte Carlo estimators which rely on a shared subspace between the high-fidelity and low-fidelity models where the parameters follow the same probability distribution, i.e., a standard Gaussian. We build the shared space employing normalizing flows to map different probability distributions into a common one, together with linear and nonlinear dimensionality reduction techniques, active subspaces and autoencoders, respectively, which capture the subspaces where the models vary the most. We then compose the existing low-fidelity model with these transformations and construct modified models with an increased correlation with the high-fidelity model, which therefore yield multifidelity Monte Carlo estimators with reduced variance. A series of numerical experiments illustrate the properties and advantages of our approaches."}, "https://arxiv.org/abs/2105.04981": {"title": "Quantifying patient and neighborhood risks for stillbirth and preterm birth in Philadelphia with a Bayesian spatial model", "link": "https://arxiv.org/abs/2105.04981", "description": "arXiv:2105.04981v5 Announce Type: replace-cross \nAbstract: Stillbirth and preterm birth are major public health challenges. Using a Bayesian spatial model, we quantified patient-specific and neighborhood risks of stillbirth and preterm birth in the city of Philadelphia. We linked birth data from electronic health records at Penn Medicine hospitals from 2010 to 2017 with census-tract-level data from the United States Census Bureau. We found that both patient-level characteristics (e.g. self-identified race/ethnicity) and neighborhood-level characteristics (e.g. violent crime) were significantly associated with patients' risk of stillbirth or preterm birth. Our neighborhood analysis found that higher-risk census tracts had 2.68 times the average risk of stillbirth and 2.01 times the average risk of preterm birth compared to lower-risk census tracts. Higher neighborhood rates of women in poverty or on public assistance were significantly associated with greater neighborhood risk for these outcomes, whereas higher neighborhood rates of college-educated women or women in the labor force were significantly associated with lower risk. Several of these neighborhood associations were missed by the patient-level analysis. These results suggest that neighborhood-level analyses of adverse pregnancy outcomes can reveal nuanced relationships and, thus, should be considered by epidemiologists. Our findings can potentially guide place-based public health interventions to reduce stillbirth and preterm birth rates."}, "https://arxiv.org/abs/2406.10308": {"title": "Quick and Simple Kernel Differential Equation Regression Estimators for Data with Sparse Design", "link": "https://arxiv.org/abs/2406.10308", "description": "arXiv:2406.10308v1 Announce Type: new \nAbstract: Local polynomial regression of order at least one often performs poorly in regions of sparse data. Local constant regression is exceptional in this regard, though it is the least accurate method in general, especially at the boundaries of the data. Incorporating information from differential equations which may approximately or exactly hold is one way of extending the sparse design capacity of local constant regression while reducing bias and variance. A nonparametric regression method that exploits first order differential equations is introduced in this paper and applied to noisy mouse tumour growth data. Asymptotic biases and variances of kernel estimators using Taylor polynomials with different degrees are discussed. Model comparison is performed for different estimators through simulation studies under various scenarios which simulate exponential-type growth."}, "https://arxiv.org/abs/2406.10360": {"title": "Causal inference for N-of-1 trials", "link": "https://arxiv.org/abs/2406.10360", "description": "arXiv:2406.10360v1 Announce Type: new \nAbstract: The aim of personalized medicine is to tailor treatment decisions to individuals' characteristics. N-of-1 trials are within-person crossover trials that hold the promise of targeting individual-specific effects. While the idea behind N-of-1 trials might seem simple, analyzing and interpreting N-of-1 trials is not straightforward. In particular, there exists confusion about the role of randomization in this design, the (implicit) target estimand, and the need for covariate adjustment. Here we ground N-of-1 trials in a formal causal inference framework and formalize intuitive claims from the N-of-1 trial literature. We focus on causal inference from a single N-of-1 trial and define a conditional average treatment effect (CATE) that represents a target in this setting, which we call the U-CATE. We discuss the assumptions sufficient for identification and estimation of the U-CATE under different causal models in which the treatment schedule is assigned at baseline. A simple mean difference is shown to be an unbiased, asymptotically normal estimator of the U-CATE in simple settings, such as when participants have stable conditions (e.g., chronic pain) and interventions have effects limited in time (no carryover). We also consider settings where carryover effects, trends over time, time-varying common causes of the outcome, and outcome-outcome effects are present. In these more complex settings, we show that a time-varying g-formula identifies the U-CATE under explicit assumptions. Finally, we analyze data from N-of-1 trials about acne symptoms. Using this example, we show how different assumptions about the underlying data generating process can lead to different analytical strategies in N-of-1 trials."}, "https://arxiv.org/abs/2406.10473": {"title": "Design-based variance estimation of the H\\'ajek effect estimator in stratified and clustered experiments", "link": "https://arxiv.org/abs/2406.10473", "description": "arXiv:2406.10473v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are used to evaluate treatment effects. When individuals are grouped together, clustered RCTs are conducted. Stratification is recommended to reduce imbalance of baseline covariates between treatment and control. In practice, this can lead to comparisons between clusters of very different sizes. As a result, direct adjustment estimators that average differences of means within the strata may be inconsistent. We study differences of inverse probability weighted means of a treatment and a control group -- H\\'ajek effect estimators -- under two common forms of stratification: small strata that increase in number; or larger strata with growing numbers of clusters in each. Under either scenario, mild conditions give consistency and asymptotic Normality. We propose a variance estimator applicable to designs with any number of strata and strata of any size. We describe a special use of the variance estimator that improves small sample performance of Wald-type confidence intervals. The H\\'ajek effect estimator lends itself to covariance adjustment, and our variance estimator remains applicable. Simulations and real-world applications in children's nutrition and education confirm favorable operating characteristics, demonstrating advantages of the H\\'ajek effect estimator beyond its simplicity and ease of use."}, "https://arxiv.org/abs/2406.10499": {"title": "Functional Clustering for Longitudinal Associations between County-Level Social Determinants of Health and Stroke Mortality in the US", "link": "https://arxiv.org/abs/2406.10499", "description": "arXiv:2406.10499v1 Announce Type: new \nAbstract: Understanding longitudinally changing associations between Social determinants of health (SDOH) and stroke mortality is crucial for timely stroke management. Previous studies have revealed a significant regional disparity in the SDOH -- stroke mortality associations. However, they do not develop data-driven methods based on these longitudinal associations for regional division in stroke control. To fill this gap, we propose a novel clustering method for SDOH -- stroke mortality associations in the US counties. To enhance interpretability and statistical efficiency of the clustering outcomes, we introduce a new class of smoothness-sparsity pursued penalties for simultaneous clustering and variable selection in the longitudinal associations. As a result, we can identify important SDOH that contribute to longitudinal changes in the stroke mortality, facilitating clustering of US counties into several regions based on how these SDOH relate to stroke mortality. The effectiveness of our proposed method is demonstrated through extensive numerical studies. By applying our method to a county-level SDOH and stroke mortality longitudinal data, we identify 18 important SDOH for stroke mortality and divide the US counties into two clusters based on these selected SDOH. Our findings unveil complex regional heterogeneity in the longitudinal associations between SDOH and stroke mortality, providing valuable insights in region-specific SDOH adjustments for mitigating stroke mortality."}, "https://arxiv.org/abs/2406.10554": {"title": "Causal Inference with Outcomes Truncated by Death and Missing Not at Random", "link": "https://arxiv.org/abs/2406.10554", "description": "arXiv:2406.10554v1 Announce Type: new \nAbstract: In clinical trials, principal stratification analysis is commonly employed to address the issue of truncation by death, where a subject dies before the outcome can be measured. However, in practice, many survivor outcomes may remain uncollected or be missing not at random, posing a challenge to standard principal stratification analyses. In this paper, we explore the identification, estimation, and bounds of the average treatment effect within a subpopulation of individuals who would potentially survive under both treatment and control conditions. We show that the causal parameter of interest can be identified by introducing a proxy variable that affects the outcome only through the principal strata, while requiring that the treatment variable does not directly affect the missingness mechanism. Subsequently, we propose an approach for estimating causal parameters and derive nonparametric bounds in cases where identification assumptions are violated. We illustrate the performance of the proposed method through simulation studies and a real dataset obtained from a Human Immunodeficiency Virus (HIV) study."}, "https://arxiv.org/abs/2406.10612": {"title": "Producing treatment hierarchies in network meta-analysis using probabilistic models and treatment-choice criteria", "link": "https://arxiv.org/abs/2406.10612", "description": "arXiv:2406.10612v1 Announce Type: new \nAbstract: A key output of network meta-analysis (NMA) is the relative ranking of the treatments; nevertheless, it has attracted a lot of criticism. This is mainly due to the fact that ranking is an influential output and prone to over-interpretations even when relative effects imply small differences between treatments. To date, common ranking methods rely on metrics that lack a straightforward interpretation, while it is still unclear how to measure their uncertainty. We introduce a novel framework for estimating treatment hierarchies in NMA. At first, we formulate a mathematical expression that defines a treatment choice criterion (TCC) based on clinically important values. This TCC is applied to the study treatment effects to generate paired data indicating treatment preferences or ties. Then, we synthesize the paired data across studies using an extension of the so-called \"Bradley-Terry\" model. We assign to each treatment a latent variable interpreted as the treatment \"ability\" and we estimate the ability parameters within a regression model. Higher ability estimates correspond to higher positions in the final ranking. We further extend our model to adjust for covariates that may affect treatment selection. We illustrate the proposed approach and compare it with alternatives in two datasets: a network comparing 18 antidepressants for major depression and a network comparing 6 antihypertensives for the incidence of diabetes. Our approach provides a robust and interpretable treatment hierarchy which accounts for clinically important values and is presented alongside with uncertainty measures. Overall, the proposed framework offers a novel approach for ranking in NMA based on concrete criteria and preserves from over-interpretation of unimportant differences between treatments."}, "https://arxiv.org/abs/2406.10733": {"title": "A Laplace transform-based test for the equality of positive semidefinite matrix distributions", "link": "https://arxiv.org/abs/2406.10733", "description": "arXiv:2406.10733v1 Announce Type: new \nAbstract: In this paper, we present a novel test for determining equality in distribution of matrix distributions. Our approach is based on the integral squared difference of the empirical Laplace transforms with respect to the noncentral Wishart measure. We conduct an extensive power study to assess the performance of the test and determine the optimal choice of parameters. Furthermore, we demonstrate the applicability of the test on financial and non-life insurance data, illustrating its effectiveness in practical scenarios."}, "https://arxiv.org/abs/2406.10792": {"title": "Data-Adaptive Identification of Subpopulations Vulnerable to Chemical Exposures using Stochastic Interventions", "link": "https://arxiv.org/abs/2406.10792", "description": "arXiv:2406.10792v1 Announce Type: new \nAbstract: In environmental epidemiology, identifying subpopulations vulnerable to chemical exposures and those who may benefit differently from exposure-reducing policies is essential. For instance, sex-specific vulnerabilities, age, and pregnancy are critical factors for policymakers when setting regulatory guidelines. However, current semi-parametric methods for heterogeneous treatment effects are often limited to binary exposures and function as black boxes, lacking clear, interpretable rules for subpopulation-specific policy interventions. This study introduces a novel method using cross-validated targeted minimum loss-based estimation (TMLE) paired with a data-adaptive target parameter strategy to identify subpopulations with the most significant differential impact from simulated policy interventions that reduce exposure. Our approach is assumption-lean, allowing for the integration of machine learning while still yielding valid confidence intervals. We demonstrate the robustness of our methodology through simulations and application to NHANES data. Our analysis of NHANES data for persistent organic pollutants on leukocyte telomere length (LTL) identified age as the maximum effect modifier. Specifically, we found that exposure to 3,3',4,4',5-pentachlorobiphenyl (pcnb) consistently had a differential impact on LTL, with a one standard deviation reduction in exposure leading to a more pronounced increase in LTL among younger populations compared to older ones. We offer our method as an open-source software package, \\texttt{EffectXshift}, enabling researchers to investigate the effect modification of continuous exposures. The \\texttt{EffectXshift} package provides clear and interpretable results, informing targeted public health interventions and policy decisions."}, "https://arxiv.org/abs/2406.10837": {"title": "EM Estimation of Conditional Matrix Variate $t$ Distributions", "link": "https://arxiv.org/abs/2406.10837", "description": "arXiv:2406.10837v1 Announce Type: new \nAbstract: Conditional matrix variate student $t$ distribution was introduced by Battulga (2024a). In this paper, we propose a new version of the conditional matrix variate student $t$ distribution. The paper provides EM algorithms, which estimate parameters of the conditional matrix variate student $t$ distributions, including general cases and special cases with Minnesota prior."}, "https://arxiv.org/abs/2406.10962": {"title": "SynthTree: Co-supervised Local Model Synthesis for Explainable Prediction", "link": "https://arxiv.org/abs/2406.10962", "description": "arXiv:2406.10962v1 Announce Type: new \nAbstract: Explainable machine learning (XML) has emerged as a major challenge in artificial intelligence (AI). Although black-box models such as Deep Neural Networks and Gradient Boosting often exhibit exceptional predictive accuracy, their lack of interpretability is a notable drawback, particularly in domains requiring transparency and trust. This paper tackles this core AI problem by proposing a novel method to enhance explainability with minimal accuracy loss, using a Mixture of Linear Models (MLM) estimated under the co-supervision of black-box models. We have developed novel methods for estimating MLM by leveraging AI techniques. Specifically, we explore two approaches for partitioning the input space: agglomerative clustering and decision trees. The agglomerative clustering approach provides greater flexibility in model construction, while the decision tree approach further enhances explainability, yielding a decision tree model with linear or logistic regression models at its leaf nodes. Comparative analyses with widely-used and state-of-the-art predictive models demonstrate the effectiveness of our proposed methods. Experimental results show that statistical models can significantly enhance the explainability of AI, thereby broadening their potential for real-world applications. Our findings highlight the critical role that statistical methodologies can play in advancing explainable AI."}, "https://arxiv.org/abs/2406.11184": {"title": "HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators", "link": "https://arxiv.org/abs/2406.11184", "description": "arXiv:2406.11184v1 Announce Type: new \nAbstract: Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method's superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank."}, "https://arxiv.org/abs/2406.11216": {"title": "Bayesian Hierarchical Modelling of Noisy Gamma Processes: Model Formulation, Identifiability, Model Fitting, and Extensions to Unit-to-Unit Variability", "link": "https://arxiv.org/abs/2406.11216", "description": "arXiv:2406.11216v1 Announce Type: new \nAbstract: The gamma process is a natural model for monotonic degradation processes. In practice, it is desirable to extend the single gamma process to incorporate measurement error and to construct models for the degradation of several nominally identical units. In this paper, we show how these extensions are easily facilitated through the Bayesian hierarchical modelling framework. Following the precepts of the Bayesian statistical workflow, we show the principled construction of a noisy gamma process model. We also reparameterise the gamma process to simplify the specification of priors and make it obvious how the single gamma process model can be extended to include unit-to-unit variability or covariates. We first fit the noisy gamma process model to a single simulated degradation trace. In doing so, we find an identifiability problem between the volatility of the gamma process and the measurement error when there are only a few noisy degradation observations. However, this lack of identifiability can be resolved by including extra information in the analysis through a stronger prior or extra data that informs one of the non-identifiable parameters, or by borrowing information from multiple units. We then explore extensions of the model to account for unit-to-unit variability and demonstrate them using a crack-propagation data set with added measurement error. Lastly, we perform model selection in a fully Bayesian framework by using cross-validation to approximate the expected log probability density of new observation. We also show how failure time distributions with uncertainty intervals can be calculated for new units or units that are currently under test but are yet to fail."}, "https://arxiv.org/abs/2406.11306": {"title": "Bayesian Variable Selection via Hierarchical Gaussian Process Model in Computer Experiments", "link": "https://arxiv.org/abs/2406.11306", "description": "arXiv:2406.11306v1 Announce Type: new \nAbstract: Identifying the active factors that have significant impacts on the output of the complex system is an important but challenging variable selection problem in computer experiments. In this paper, a Bayesian hierarchical Gaussian process model is developed and some latent indicator variables are embedded into this setting for the sake of labelling the important variables. The parameter estimation and variable selection can be processed simultaneously in a full Bayesian framework through an efficient Markov Chain Monte Carlo (MCMC) method -- Metropolis-within-Gibbs sampler. The much better performances of the proposed method compared with the related competitors are evaluated by the analysis of simulated examples and a practical application."}, "https://arxiv.org/abs/2406.11399": {"title": "Spillover Detection for Donor Selection in Synthetic Control Models", "link": "https://arxiv.org/abs/2406.11399", "description": "arXiv:2406.11399v1 Announce Type: new \nAbstract: Synthetic control (SC) models are widely used to estimate causal effects in settings with observational time-series data. To identify the causal effect on a target unit, SC requires the existence of correlated units that are not impacted by the intervention. Given one of these potential donor units, how can we decide whether it is in fact a valid donor - that is, one not subject to spillover effects from the intervention? Such a decision typically requires appealing to strong a priori domain knowledge specifying the units, which becomes infeasible in situations with large pools of potential donors. In this paper, we introduce a practical, theoretically-grounded donor selection procedure, aiming to weaken this domain knowledge requirement. Our main result is a Theorem that yields the assumptions required to identify donor values at post-intervention time points using only pre-intervention data. We show how this Theorem - and the assumptions underpinning it - can be turned into a practical method for detecting potential spillover effects and excluding invalid donors when constructing SCs. Importantly, we employ sensitivity analysis to formally bound the bias in our SC causal estimate in situations where an excluded donor was indeed valid, or where a selected donor was invalid. Using ideas from the proximal causal inference and instrumental variables literature, we show that the excluded donors can nevertheless be leveraged to further debias causal effect estimates. Finally, we illustrate our donor selection procedure on both simulated and real-world datasets."}, "https://arxiv.org/abs/2406.11467": {"title": "Resilience of international oil trade networks under extreme event shock-recovery simulations", "link": "https://arxiv.org/abs/2406.11467", "description": "arXiv:2406.11467v1 Announce Type: new \nAbstract: With the frequent occurrence of black swan events, global energy security situation has become increasingly complex and severe. Assessing the resilience of the international oil trade network (iOTN) is crucial for evaluating its ability to withstand extreme shocks and recover thereafter, ensuring energy security. We overcomes the limitations of discrete historical data by developing a simulation model for extreme event shock-recovery in the iOTNs. We introduce network efficiency indicator to measure oil resource allocation efficiency and evaluate network performance. Then, construct a resilience index to explore the resilience of the iOTNs from dimensions of resistance and recoverability. Our findings indicate that extreme events can lead to sharp declines in performance of the iOTNs, especially when economies with significant trading positions and relations suffer shocks. The upward trend in recoverability and resilience reflects the self-organizing nature of the iOTNs, demonstrating its capacity for optimizing its own structure and functionality. Unlike traditional energy security research based solely on discrete historical data or resistance indicators, our model evaluates resilience from multiple dimensions, offering insights for global energy governance systems while providing diverse perspectives for various economies to mitigate risks and uphold energy security."}, "https://arxiv.org/abs/2406.11573": {"title": "Bayesian Outcome Weighted Learning", "link": "https://arxiv.org/abs/2406.11573", "description": "arXiv:2406.11573v1 Announce Type: new \nAbstract: One of the primary goals of statistical precision medicine is to learn optimal individualized treatment rules (ITRs). The classification-based, or machine learning-based, approach to estimating optimal ITRs was first introduced in outcome-weighted learning (OWL). OWL recasts the optimal ITR learning problem into a weighted classification problem, which can be solved using machine learning methods, e.g., support vector machines. In this paper, we introduce a Bayesian formulation of OWL. Starting from the OWL objective function, we generate a pseudo-likelihood which can be expressed as a scale mixture of normal distributions. A Gibbs sampling algorithm is developed to sample the posterior distribution of the parameters. In addition to providing a strategy for learning an optimal ITR, Bayesian OWL provides a natural, probabilistic approach to estimate uncertainty in ITR treatment recommendations themselves. We demonstrate the performance of our method through several simulation studies."}, "https://arxiv.org/abs/2406.11584": {"title": "The analysis of paired comparison data in the presence of cyclicality and intransitivity", "link": "https://arxiv.org/abs/2406.11584", "description": "arXiv:2406.11584v1 Announce Type: new \nAbstract: A principled approach to cyclicality and intransitivity in cardinal paired comparison data is developed within the framework of graphical linear models. Fundamental to our developments is a detailed understanding and study of the parameter space which accommodates cyclicality and intransitivity. In particular, the relationships between the reduced, completely transitive model, the full, not necessarily transitive model, and all manner of intermediate models are explored for both complete and incomplete paired comparison graphs. It is shown that identifying cyclicality and intransitivity reduces to a model selection problem and a new method for model selection employing geometrical insights, unique to the problem at hand, is proposed. The large sample properties of the estimators as well as guarantees on the selected model are provided. It is thus shown that in large samples all cyclicalities and intransitivities can be identified. The method is exemplified using simulations and the analysis of an illustrative example."}, "https://arxiv.org/abs/2406.11585": {"title": "Bayesian regression discontinuity design with unknown cutoff", "link": "https://arxiv.org/abs/2406.11585", "description": "arXiv:2406.11585v1 Announce Type: new \nAbstract: Regression discontinuity design (RDD) is a quasi-experimental approach used to estimate the causal effects of an intervention assigned based on a cutoff criterion. RDD exploits the idea that close to the cutoff units below and above are similar; hence, they can be meaningfully compared. Consequently, the causal effect can be estimated only locally at the cutoff point. This makes the cutoff point an essential element of RDD. However, especially in medical applications, the exact cutoff location may not always be disclosed to the researcher, and even when it is, the actual location may deviate from the official one. As we illustrate on the application of RDD to the HIV treatment eligibility data, estimating the causal effect at an incorrect cutoff point leads to meaningless results. Moreover, since the cutoff criterion often acts as a guideline rather than as a strict rule, the location of the cutoff may be unclear from the data. The method we present can be applied both as an estimation and validation tool in RDD. We use a Bayesian approach to incorporate prior knowledge and uncertainty about the cutoff location in the causal effect estimation. At the same time, our Bayesian model LoTTA is fitted globally to the whole data, whereas RDD is a local, boundary point estimation problem. In this work we address a natural question that arises: how to make Bayesian inference more local to render a meaningful and powerful estimate of the treatment effect?"}, "https://arxiv.org/abs/2406.11806": {"title": "A conservation law for posterior predictive variance", "link": "https://arxiv.org/abs/2406.11806", "description": "arXiv:2406.11806v1 Announce Type: new \nAbstract: We use the law of total variance to generate multiple expressions for the posterior predictive variance in Bayesian hierarchical models. These expressions are sums of terms involving conditional expectations and conditional variances. Since the posterior predictive variance is fixed given the hierarchical model, it represents a constant quantity that is conserved over the various expressions for it. The terms in the expressions can be assessed in absolute or relative terms to understand the main contributors to the length of prediction intervals. Also, sometimes these terms can be intepreted in the context of the hierarchical model. We show several examples, closed form and computational, to illustrate the uses of this approach in model assessment."}, "https://arxiv.org/abs/2406.10366": {"title": "Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework", "link": "https://arxiv.org/abs/2406.10366", "description": "arXiv:2406.10366v1 Announce Type: cross \nAbstract: Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how the estimands framework can help uncover underlying issues, their causes, and potential solutions. Ultimately, we believe this framework can improve the validity of evaluations through better-aligned inference, and help decision-makers and model users interpret reported results more effectively."}, "https://arxiv.org/abs/2406.10464": {"title": "The data augmentation algorithm", "link": "https://arxiv.org/abs/2406.10464", "description": "arXiv:2406.10464v1 Announce Type: cross \nAbstract: The data augmentation (DA) algorithms are popular Markov chain Monte Carlo (MCMC) algorithms often used for sampling from intractable probability distributions. This review article comprehensively surveys DA MCMC algorithms, highlighting their theoretical foundations, methodological implementations, and diverse applications in frequentist and Bayesian statistics. The article discusses tools for studying the convergence properties of DA algorithms. Furthermore, it contains various strategies for accelerating the speed of convergence of the DA algorithms, different extensions of DA algorithms and outlines promising directions for future research. This paper aims to serve as a resource for researchers and practitioners seeking to leverage data augmentation techniques in MCMC algorithms by providing key insights and synthesizing recent developments."}, "https://arxiv.org/abs/2406.10481": {"title": "DCDILP: a distributed learning method for large-scale causal structure learning", "link": "https://arxiv.org/abs/2406.10481", "description": "arXiv:2406.10481v1 Announce Type: cross \nAbstract: This paper presents a novel approach to causal discovery through a divide-and-conquer framework. By decomposing the problem into smaller subproblems defined on Markov blankets, the proposed DCDILP method first explores in parallel the local causal graphs of these subproblems. However, this local discovery phase encounters systematic challenges due to the presence of hidden confounders (variables within each Markov blanket may be influenced by external variables). Moreover, aggregating these local causal graphs in a consistent global graph defines a large size combinatorial optimization problem. DCDILP addresses these challenges by: i) restricting the local subgraphs to causal links only related with the central variable of the Markov blanket; ii) formulating the reconciliation of local causal graphs as an integer linear programming method. The merits of the approach, in both terms of causal discovery accuracy and scalability in the size of the problem, are showcased by experiments and comparisons with the state of the art."}, "https://arxiv.org/abs/2406.10738": {"title": "Adaptive Experimentation When You Can't Experiment", "link": "https://arxiv.org/abs/2406.10738", "description": "arXiv:2406.10738v1 Announce Type: cross \nAbstract: This paper introduces the \\emph{confounded pure exploration transductive linear bandit} (\\texttt{CPET-LB}) problem. As a motivating example, often online services cannot directly assign users to specific control or treatment experiences either for business or practical reasons. In these settings, naively comparing treatment and control groups that may result from self-selection can lead to biased estimates of underlying treatment effects. Instead, online services can employ a properly randomized encouragement that incentivizes users toward a specific treatment. Our methodology provides online services with an adaptive experimental design approach for learning the best-performing treatment for such \\textit{encouragement designs}. We consider a more general underlying model captured by a linear structural equation and formulate pure exploration linear bandits in this setting. Though pure exploration has been extensively studied in standard adaptive experimental design settings, we believe this is the first work considering a setting where noise is confounded. Elimination-style algorithms using experimental design methods in combination with a novel finite-time confidence interval on an instrumental variable style estimator are presented with sample complexity upper bounds nearly matching a minimax lower bound. Finally, experiments are conducted that demonstrate the efficacy of our approach."}, "https://arxiv.org/abs/2406.11043": {"title": "Statistical Considerations for Evaluating Treatment Effect under Various Non-proportional Hazard Scenarios", "link": "https://arxiv.org/abs/2406.11043", "description": "arXiv:2406.11043v1 Announce Type: cross \nAbstract: We conducted a systematic comparison of statistical methods used for the analysis of time-to-event outcomes under various proportional and nonproportional hazard (NPH) scenarios. Our study used data from recently published oncology trials to compare the Log-rank test, still by far the most widely used option, against some available alternatives, including the MaxCombo test, the Restricted Mean Survival Time Difference (dRMST) test, the Generalized Gamma Model (GGM) and the Generalized F Model (GFM). Power, type I error rate, and time-dependent bias with respect to the RMST difference, survival probability difference, and median survival time were used to evaluate and compare the performance of these methods. In addition to the real data, we simulated three hypothetical scenarios with crossing hazards chosen so that the early and late effects 'cancel out' and used them to evaluate the ability of the aforementioned methods to detect time-specific and overall treatment effects. We implemented novel metrics for assessing the time-dependent bias in treatment effect estimates to provide a more comprehensive evaluation in NPH scenarios. Recommendations under each NPH scenario are provided by examining the type I error rate, power, and time-dependent bias associated with each statistical approach."}, "https://arxiv.org/abs/2406.11046": {"title": "Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data", "link": "https://arxiv.org/abs/2406.11046", "description": "arXiv:2406.11046v1 Announce Type: cross \nAbstract: Advancements in Artificial Intelligence, particularly with ChatGPT, have significantly impacted software development. Utilizing novel data from GitHub Innovation Graph, we hypothesize that ChatGPT enhances software production efficiency. Utilizing natural experiments where some governments banned ChatGPT, we employ Difference-in-Differences (DID), Synthetic Control (SC), and Synthetic Difference-in-Differences (SDID) methods to estimate its effects. Our findings indicate a significant positive impact on the number of git pushes, repositories, and unique developers per 100,000 people, particularly for high-level, general purpose, and shell scripting languages. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns."}, "https://arxiv.org/abs/2406.11490": {"title": "Interventional Imbalanced Multi-Modal Representation Learning via $\\beta$-Generalization Front-Door Criterion", "link": "https://arxiv.org/abs/2406.11490", "description": "arXiv:2406.11490v1 Announce Type: cross \nAbstract: Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method."}, "https://arxiv.org/abs/2406.11501": {"title": "Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality", "link": "https://arxiv.org/abs/2406.11501", "description": "arXiv:2406.11501v1 Announce Type: cross \nAbstract: Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks."}, "https://arxiv.org/abs/2406.11761": {"title": "Joint Linked Component Analysis for Multiview Data", "link": "https://arxiv.org/abs/2406.11761", "description": "arXiv:2406.11761v1 Announce Type: cross \nAbstract: In this work, we propose the joint linked component analysis (joint\\_LCA) for multiview data. Unlike classic methods which extract the shared components in a sequential manner, the objective of joint\\_LCA is to identify the view-specific loading matrices and the rank of the common latent subspace simultaneously. We formulate a matrix decomposition model where a joint structure and an individual structure are present in each data view, which enables us to arrive at a clean svd representation for the cross covariance between any pair of data views. An objective function with a novel penalty term is then proposed to achieve simultaneous estimation and rank selection. In addition, a refitting procedure is employed as a remedy to reduce the shrinkage bias caused by the penalization."}, "https://arxiv.org/abs/1908.04822": {"title": "R-miss-tastic: a unified platform for missing values methods and workflows", "link": "https://arxiv.org/abs/1908.04822", "description": "arXiv:1908.04822v4 Announce Type: replace \nAbstract: Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula.\n To help address this challenge, we have launched the \"R-miss-tastic\" platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), \"R-miss-tastic\" covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides)."}, "https://arxiv.org/abs/2109.11647": {"title": "Treatment Effects in Market Equilibrium", "link": "https://arxiv.org/abs/2109.11647", "description": "arXiv:2109.11647v3 Announce Type: replace \nAbstract: Policy-relevant treatment effect estimation in a marketplace setting requires taking into account both the direct benefit of the treatment and any spillovers induced by changes to the market equilibrium. The standard way to address these challenges is to evaluate interventions via cluster-randomized experiments, where each cluster corresponds to an isolated market. This approach, however, cannot be used when we only have access to a single market (or a small number of markets). Here, we show how to identify and estimate policy-relevant treatment effects using a unit-level randomized trial run within a single large market. A standard Bernoulli-randomized trial allows consistent estimation of direct effects, and of treatment heterogeneity measures that can be used for welfare-improving targeting. Estimating spillovers - as well as providing confidence intervals for the direct effect - requires estimates of price elasticities, which we provide using an augmented experimental design. Our results rely on all spillovers being mediated via the (observed) prices of a finite number of traded goods, and the market power of any single unit decaying as the market gets large. We illustrate our results using a simulation calibrated to a conditional cash transfer experiment in the Philippines."}, "https://arxiv.org/abs/2110.11771": {"title": "Additive Density-on-Scalar Regression in Bayes Hilbert Spaces with an Application to Gender Economics", "link": "https://arxiv.org/abs/2110.11771", "description": "arXiv:2110.11771v3 Announce Type: replace \nAbstract: Motivated by research on gender identity norms and the distribution of the woman's share in a couple's total labor income, we consider functional additive regression models for probability density functions as responses with scalar covariates. To preserve nonnegativity and integration to one under vector space operations, we formulate the model for densities in a Bayes Hilbert space, which allows to not only consider continuous densities, but also, e.g., discrete or mixed densities. Mixed ones occur in our application, as the woman's income share is a continuous variable having discrete point masses at zero and one for single-earner couples. Estimation is based on a gradient boosting algorithm, allowing for potentially numerous flexible covariate effects and model selection. We develop properties of Bayes Hilbert spaces related to subcompositional coherence, yielding (odds-ratio) interpretation of effect functions and simplified estimation for mixed densities via an orthogonal decomposition. Applying our approach to data from the German Socio-Economic Panel Study (SOEP) shows a more symmetric distribution in East German than in West German couples after reunification and a smaller child penalty comparing couples with and without minor children. These West-East differences become smaller, but are persistent over time."}, "https://arxiv.org/abs/2201.01357": {"title": "Estimating Heterogeneous Causal Effects of High-Dimensional Treatments: Application to Conjoint Analysis", "link": "https://arxiv.org/abs/2201.01357", "description": "arXiv:2201.01357v4 Announce Type: replace \nAbstract: Estimation of heterogeneous treatment effects is an active area of research. Most of the existing methods, however, focus on estimating the conditional average treatment effects of a single, binary treatment given a set of pre-treatment covariates. In this paper, we propose a method to estimate the heterogeneous causal effects of high-dimensional treatments, which poses unique challenges in terms of estimation and interpretation. The proposed approach finds maximally heterogeneous groups and uses a Bayesian mixture of regularized logistic regressions to identify groups of units who exhibit similar patterns of treatment effects. By directly modeling group membership with covariates, the proposed methodology allows one to explore the unit characteristics that are associated with different patterns of treatment effects. Our motivating application is conjoint analysis, which is a popular type of survey experiment in social science and marketing research and is based on a high-dimensional factorial design. We apply the proposed methodology to the conjoint data, where survey respondents are asked to select one of two immigrant profiles with randomly selected attributes. We find that a group of respondents with a relatively high degree of prejudice appears to discriminate against immigrants from non-European countries like Iraq. An open-source software package is available for implementing the proposed methodology."}, "https://arxiv.org/abs/2210.15253": {"title": "An extended generalized Pareto regression model for count data", "link": "https://arxiv.org/abs/2210.15253", "description": "arXiv:2210.15253v2 Announce Type: replace \nAbstract: The statistical modeling of discrete extremes has received less attention than their continuous counterparts in the Extreme Value Theory (EVT) literature. One approach to the transition from continuous to discrete extremes is the modeling of threshold exceedances of integer random variables by the discrete version of the generalized Pareto distribution. However, the optimal choice of thresholds defining exceedances remains a problematic issue. Moreover, in a regression framework, the treatment of the majority of non-extreme data below the selected threshold is either ignored or separated from the extremes. To tackle these issues, we expand on the concept of employing a smooth transition between the bulk and the upper tail of the distribution. In the case of zero inflation, we also develop models with an additional parameter. To incorporate possible predictors, we relate the parameters to additive smoothed predictors via an appropriate link, as in the generalized additive model (GAM) framework. A penalized maximum likelihood estimation procedure is implemented. We illustrate our modeling proposal with a real dataset of avalanche activity in the French Alps. With the advantage of bypassing the threshold selection step, our results indicate that the proposed models are more flexible and robust than competing models, such as the negative binomial distribution"}, "https://arxiv.org/abs/2211.16552": {"title": "Bayesian inference for aggregated Hawkes processes", "link": "https://arxiv.org/abs/2211.16552", "description": "arXiv:2211.16552v3 Announce Type: replace \nAbstract: The Hawkes process, a self-exciting point process, has a wide range of applications in modeling earthquakes, social networks and stock markets. The established estimation process requires that researchers have access to the exact time stamps and spatial information. However, available data are often rounded or aggregated. We develop a Bayesian estimation procedure for the parameters of a Hawkes process based on aggregated data. Our approach is developed for temporal, spatio-temporal, and mutually exciting Hawkes processes where data are available over discrete time periods and regions. We show theoretically that the parameters of the Hawkes process are identifiable from aggregated data under general specifications. We demonstrate the method on simulated temporal and spatio-temporal data with various model specifications in the presence of one or more interacting processes, and under varying coarseness of data aggregation. Finally, we examine the internal and cross-excitation effects of airstrikes and insurgent violence events from February 2007 to June 2008, with some data aggregated by day."}, "https://arxiv.org/abs/2301.06718": {"title": "Sparse and Integrative Principal Component Analysis for Multiview Data", "link": "https://arxiv.org/abs/2301.06718", "description": "arXiv:2301.06718v2 Announce Type: replace \nAbstract: We consider dimension reduction of multiview data, which are emerging in scientific studies. Formulating multiview data as multi-variate data with block structures corresponding to the different views, or views of data, we estimate top eigenvectors from multiview data that have two-fold sparsity, elementwise sparsity and blockwise sparsity. We propose a Fantope-based optimization criterion with multiple penalties to enforce the desired sparsity patterns and a denoising step is employed to handle potential presence of heteroskedastic noise across different data views. An alternating direction method of multipliers (ADMM) algorithm is used for optimization. We derive the l2 convergence of the estimated top eigenvectors and establish their sparsity and support recovery properties. Numerical studies are used to illustrate the proposed method."}, "https://arxiv.org/abs/2302.04354": {"title": "Consider or Choose? The Role and Power of Consideration Sets", "link": "https://arxiv.org/abs/2302.04354", "description": "arXiv:2302.04354v3 Announce Type: replace \nAbstract: Consideration sets play a crucial role in discrete choice modeling, where customers are commonly assumed to go through a two-stage decision making process. Specifically, customers are assumed to form consideration sets in the first stage and then use a second-stage choice mechanism to pick the product with the highest utility from the consideration sets. Recent studies mostly aim to propose more powerful choice mechanisms based on advanced non-parametric models to improve prediction accuracy. In contrast, this paper takes a step back from exploring more complex second-stage choice mechanisms and instead focus on how effectively we can model customer choice relying only on the first-stage consideration set formation. To this end, we study a class of nonparametric choice models that is only specified by a distribution over consideration sets and has a bounded rationality interpretation. We denote it as the consideration set model. Intriguingly, we show that this class of choice models can be characterized by the axiom of symmetric demand cannibalization, which enables complete statistical identification. We further consider the model's downstream assortment planning as an application. We first present an exact description of the optimal assortment, proving that it is revenue-ordered based on the blocks defined by the consideration sets. Despite this compelling structure, we establish that the assortment optimization problem under this model is NP-hard even to approximate. This result shows that accounting for consideration sets in the model inevitably results in inapproximability in assortment planning, even though the consideration set model uses the simplest possible uniform second-stage choice mechanism. Finally, using a real-world dataset, we show the tremendous power of the first-stage consideration sets when modeling customers' decision-making processes."}, "https://arxiv.org/abs/2303.08568": {"title": "Generating contingency tables with fixed marginal probabilities and dependence structures described by loglinear models", "link": "https://arxiv.org/abs/2303.08568", "description": "arXiv:2303.08568v2 Announce Type: replace \nAbstract: We present a method to generate contingency tables that follow loglinear models with prescribed marginal probabilities and dependence structures. We make use of (loglinear) Poisson regression, where the dependence structures, described using odds ratios, are implemented using an offset term. We apply this methodology to carry out simulation studies in the context of population size estimation using dual system and triple system estimators, popular in official statistics. These estimators use contingency tables that summarise the counts of elements enumerated or captured within lists that are linked. The simulation is used to investigate these estimators in the situation that the model assumptions are fulfilled, and the situation that the model assumptions are violated."}, "https://arxiv.org/abs/2309.07893": {"title": "Choosing a Proxy Metric from Past Experiments", "link": "https://arxiv.org/abs/2309.07893", "description": "arXiv:2309.07893v2 Announce Type: replace \nAbstract: In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines."}, "https://arxiv.org/abs/2309.10083": {"title": "Invariant Probabilistic Prediction", "link": "https://arxiv.org/abs/2309.10083", "description": "arXiv:2309.10083v2 Announce Type: replace \nAbstract: In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method to yield invariant probabilistic predictions, called IPP, and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure on simulated as well as on single-cell data."}, "https://arxiv.org/abs/2401.00245": {"title": "Alternative Approaches for Estimating Highest-Density Regions", "link": "https://arxiv.org/abs/2401.00245", "description": "arXiv:2401.00245v2 Announce Type: replace \nAbstract: Among the variety of statistical intervals, highest-density regions (HDRs) stand out for their ability to effectively summarize a distribution or sample, unveiling its distinctive and salient features. An HDR represents the minimum size set that satisfies a certain probability coverage, and current methods for their computation require knowledge or estimation of the underlying probability distribution or density $f$. In this work, we illustrate a broader framework for computing HDRs, which generalizes the classical density quantile method introduced in the seminal paper of Hyndman (1996). The framework is based on neighbourhood measures, i.e., measures that preserve the order induced in the sample by $f$, and include the density $f$ as a special case. We explore a number of suitable distance-based measures, such as the $k$-nearest neighborhood distance, and some probabilistic variants based on copula models. An extensive comparison is provided, showing the advantages of the copula-based strategy, especially in those scenarios that exhibit complex structures (e.g., multimodalities or particular dependencies). Finally, we discuss the practical implications of our findings for estimating HDRs in real-world applications."}, "https://arxiv.org/abs/2401.16286": {"title": "Robust Functional Data Analysis for Stochastic Evolution Equations in Infinite Dimensions", "link": "https://arxiv.org/abs/2401.16286", "description": "arXiv:2401.16286v2 Announce Type: replace \nAbstract: We develop an asymptotic theory for the jump robust measurement of covariations in the context of stochastic evolution equation in infinite dimensions. Namely, we identify scaling limits for realized covariations of solution processes with the quadratic covariation of the latent random process that drives the evolution equation which is assumed to be a Hilbert space-valued semimartingale. We discuss applications to dynamically consistent and outlier-robust dimension reduction in the spirit of functional principal components and the estimation of infinite-dimensional stochastic volatility models."}, "https://arxiv.org/abs/2208.02807": {"title": "Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport", "link": "https://arxiv.org/abs/2208.02807", "description": "arXiv:2208.02807v3 Announce Type: replace-cross \nAbstract: We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem is therefore a mixture of unlabeled background and signal events, and the primary aim of the analysis is to determine whether the proportion of unlabeled signal events is nonzero. A challenging but necessary first step is to estimate the distribution of background events. Past work in this area has determined regions of the space of collider events where signal is unlikely to appear, and where the background distribution is therefore identifiable. The background distribution can be estimated in these regions, and extrapolated into the region of primary interest using transfer learning with a multivariate classifier. We build upon this existing approach in two ways. First, we revisit this method by developing a customized residual neural network which is tailored to the structure and symmetries of collider data. Second, we develop a new method for background estimation, based on the optimal transport problem, which relies on modeling assumptions distinct from earlier work. These two methods can serve as cross-checks for each other in particle physics analyses, due to the complementarity of their underlying assumptions. We compare their performance on simulated double Higgs boson data."}, "https://arxiv.org/abs/2312.12477": {"title": "When Graph Neural Network Meets Causality: Opportunities, Methodologies and An Outlook", "link": "https://arxiv.org/abs/2312.12477", "description": "arXiv:2312.12477v2 Announce Type: replace-cross \nAbstract: Graph Neural Networks (GNNs) have emerged as powerful representation learning tools for capturing complex dependencies within diverse graph-structured data. Despite their success in a wide range of graph mining tasks, GNNs have raised serious concerns regarding their trustworthiness, including susceptibility to distribution shift, biases towards certain populations, and lack of explainability. Recently, integrating causal learning techniques into GNNs has sparked numerous ground-breaking studies since many GNN trustworthiness issues can be alleviated by capturing the underlying data causality rather than superficial correlations. In this survey, we comprehensively review recent research efforts on Causality-Inspired GNNs (CIGNNs). Specifically, we first employ causal tools to analyze the primary trustworthiness risks of existing GNNs, underscoring the necessity for GNNs to comprehend the causal mechanisms within graph data. Moreover, we introduce a taxonomy of CIGNNs based on the type of causal learning capability they are equipped with, i.e., causal reasoning and causal representation learning. Besides, we systematically introduce typical methods within each category and discuss how they mitigate trustworthiness risks. Finally, we summarize useful resources and discuss several future directions, hoping to shed light on new research opportunities in this emerging field. The representative papers, along with open-source data and codes, are available in https://github.com/usail-hkust/Causality-Inspired-GNNs."}, "https://arxiv.org/abs/2406.11892": {"title": "Simultaneous comparisons of the variances of k treatments with that of a control: a Levene-Dunnett type procedure", "link": "https://arxiv.org/abs/2406.11892", "description": "arXiv:2406.11892v1 Announce Type: new \nAbstract: There are some global tests for heterogeneity of variance in k-sample one-way layouts, but few consider pairwise comparisons between treatment levels. For experimental designs with a control, comparisons of the variances between the treatment levels and the control are of interest - in analogy to the location parameter with the Dunnett (1955) procedure. Such a many-to-one approach for variances is proposed using the Levene transformation, a kind of residuals. Its properties are characterized with simulation studies and corresponding data examples are evaluated with R code."}, "https://arxiv.org/abs/2406.11940": {"title": "Model-Based Inference and Experimental Design for Interference Using Partial Network Data", "link": "https://arxiv.org/abs/2406.11940", "description": "arXiv:2406.11940v1 Announce Type: new \nAbstract: The stable unit treatment value assumption states that the outcome of an individual is not affected by the treatment statuses of others, however in many real world applications, treatments can have an effect on many others beyond the immediately treated. Interference can generically be thought of as mediated through some network structure. In many empirically relevant situations however, complete network data (required to adjust for these spillover effects) are too costly or logistically infeasible to collect. Partially or indirectly observed network data (e.g., subsamples, aggregated relational data (ARD), egocentric sampling, or respondent-driven sampling) reduce the logistical and financial burden of collecting network data, but the statistical properties of treatment effect adjustments from these design strategies are only beginning to be explored. In this paper, we present a framework for the estimation and inference of treatment effect adjustments using partial network data through the lens of structural causal models. We also illustrate procedures to assign treatments using only partial network data, with the goal of either minimizing estimator variance or optimally seeding. We derive single network asymptotic results applicable to a variety of choices for an underlying graph model. We validate our approach using simulated experiments on observed graphs with applications to information diffusion in India and Malawi."}, "https://arxiv.org/abs/2406.11942": {"title": "Clustering functional data with measurement errors: a simulation-based approach", "link": "https://arxiv.org/abs/2406.11942", "description": "arXiv:2406.11942v1 Announce Type: new \nAbstract: Clustering analysis of functional data, which comprises observations that evolve continuously over time or space, has gained increasing attention across various scientific disciplines. Practical applications often involve functional data that are contaminated with measurement errors arising from imprecise instruments, sampling errors, or other sources. These errors can significantly distort the inherent data structure, resulting in erroneous clustering outcomes. In this paper, we propose a simulation-based approach designed to mitigate the impact of measurement errors. Our proposed method estimates the distribution of functional measurement errors through repeated measurements. Subsequently, the clustering algorithm is applied to simulated data generated from the conditional distribution of the unobserved true functional data given the observed contaminated functional data, accounting for the adjustments made to rectify measurement errors. We illustrate through simulations show that the proposed method has improved numerical performance than the naive methods that neglect such errors. Our proposed method was applied to a childhood obesity study, giving more reliable clustering results"}, "https://arxiv.org/abs/2406.12028": {"title": "Mixed-resolution hybrid modeling in an element-based framework", "link": "https://arxiv.org/abs/2406.12028", "description": "arXiv:2406.12028v1 Announce Type: new \nAbstract: Computational modeling of a complex system is limited by the parts of the system with the least information. While detailed models and high-resolution data may be available for parts of a system, abstract relationships are often necessary to connect the parts and model the full system. For example, modeling food security necessitates the interaction of climate and socioeconomic factors, with models of system components existing at different levels of information in terms of granularity and resolution. Connecting these models is an ongoing challenge. In this work, we demonstrate methodology to quantize and integrate information from data and detailed component models alongside abstract relationships in a hybrid element-based modeling and simulation framework. In a case study of modeling food security, we apply quantization methods to generate (1) time-series model input from climate data and (2) a discrete representation of a component model (a statistical emulator of crop yield), which we then incorporate as an update rule in the hybrid element-based model, bridging differences in model granularity and resolution. Simulation of the hybrid element-based model recapitulated the trends of the original emulator, supporting the use of this methodology to integrate data and information from component models to simulate complex systems."}, "https://arxiv.org/abs/2406.12171": {"title": "Model Selection for Causal Modeling in Missing Exposure Problems", "link": "https://arxiv.org/abs/2406.12171", "description": "arXiv:2406.12171v1 Announce Type: new \nAbstract: In causal inference, properly selecting the propensity score (PS) model is a popular topic and has been widely investigated in observational studies. In addition, there is a large literature concerning the missing data problem. However, there are very few studies investigating the model selection issue for causal inference when the exposure is missing at random (MAR). In this paper, we discuss how to select both imputation and PS models, which can result in the smallest RMSE of the estimated causal effect. Then, we provide a new criterion, called the ``rank score\" for evaluating the overall performance of both models. The simulation studies show that the full imputation plus the outcome-related PS models lead to the smallest RMSE and the rank score can also pick the best models. An application study is conducted to study the causal effect of CVD on the mortality of COVID-19 patients."}, "https://arxiv.org/abs/2406.12237": {"title": "Lasso regularization for mixture experiments with noise variables", "link": "https://arxiv.org/abs/2406.12237", "description": "arXiv:2406.12237v1 Announce Type: new \nAbstract: We apply classical and Bayesian lasso regularizations to a family of models with the presence of mixture and process variables. We analyse the performance of these estimates with respect to ordinary least squares estimators by a simulation study and a real data application. Our results demonstrate the superior performance of Bayesian lasso, particularly via coordinate ascent variational inference, in terms of variable selection accuracy and response optimization."}, "https://arxiv.org/abs/2406.12780": {"title": "Bayesian Consistency for Long Memory Processes: A Semiparametric Perspective", "link": "https://arxiv.org/abs/2406.12780", "description": "arXiv:2406.12780v1 Announce Type: new \nAbstract: In this work, we will investigate a Bayesian approach to estimating the parameters of long memory models. Long memory, characterized by the phenomenon of hyperbolic autocorrelation decay in time series, has garnered significant attention. This is because, in many situations, the assumption of short memory, such as the Markovianity assumption, can be deemed too restrictive. Applications for long memory models can be readily found in fields such as astronomy, finance, and environmental sciences. However, current parametric and semiparametric approaches to modeling long memory present challenges, particularly in the estimation process.\n In this study, we will introduce various methods applied to this problem from a Bayesian perspective, along with a novel semiparametric approach for deriving the posterior distribution of the long memory parameter. Additionally, we will establish the asymptotic properties of the model. An advantage of this approach is that it allows to implement state-of-the-art efficient algorithms for nonparametric Bayesian models."}, "https://arxiv.org/abs/2406.12817": {"title": "Intrinsic Modeling of Shape-Constrained Functional Data, With Applications to Growth Curves and Activity Profiles", "link": "https://arxiv.org/abs/2406.12817", "description": "arXiv:2406.12817v1 Announce Type: new \nAbstract: Shape-constrained functional data encompass a wide array of application fields especially in the life sciences, such as activity profiling, growth curves, healthcare and mortality. Most existing methods for general functional data analysis often ignore that such data are subject to inherent shape constraints, while some specialized techniques rely on strict distributional assumptions. We propose an approach for modeling such data that harnesses the intrinsic geometry of functional trajectories by decomposing them into size and shape components. We focus on the two most prevalent shape constraints, positivity and monotonicity, and develop individual-level estimators for the size and shape components. Furthermore, we demonstrate the applicability of our approach by conducting subsequent analyses involving Fr\\'{e}chet mean and Fr\\'{e}chet regression and establish rates of convergence for the empirical estimators. Illustrative examples include simulations and data applications for activity profiles for Mediterranean fruit flies during their entire lifespan and for data from the Z\\\"{u}rich longitudinal growth study."}, "https://arxiv.org/abs/2406.12212": {"title": "Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data", "link": "https://arxiv.org/abs/2406.12212", "description": "arXiv:2406.12212v1 Announce Type: cross \nAbstract: Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness."}, "https://arxiv.org/abs/2406.12474": {"title": "Exploring Intra and Inter-language Consistency in Embeddings with ICA", "link": "https://arxiv.org/abs/2406.12474", "description": "arXiv:2406.12474v1 Announce Type: cross \nAbstract: Word embeddings represent words as multidimensional real vectors, facilitating data analysis and processing, but are often challenging to interpret. Independent Component Analysis (ICA) creates clearer semantic axes by identifying independent key features. Previous research has shown ICA's potential to reveal universal semantic axes across languages. However, it lacked verification of the consistency of independent components within and across languages. We investigated the consistency of semantic axes in two ways: both within a single language and across multiple languages. We first probed into intra-language consistency, focusing on the reproducibility of axes by performing ICA multiple times and clustering the outcomes. Then, we statistically examined inter-language consistency by verifying those axes' correspondences using statistical tests. We newly applied statistical methods to establish a robust framework that ensures the reliability and universality of semantic axes."}, "https://arxiv.org/abs/2104.00262": {"title": "Statistical significance revisited", "link": "https://arxiv.org/abs/2104.00262", "description": "arXiv:2104.00262v3 Announce Type: replace \nAbstract: Statistical significance measures the reliability of a result obtained from a random experiment. We investigate the number of repetitions needed for a statistical result to have a certain significance. In the first step, we consider binomially distributed variables in the example of medication testing with fixed placebo efficacy, asking how many experiments are needed in order to achieve a significance of 95 %. In the next step, we take the probability distribution of the placebo efficacy into account, which to the best of our knowledge has not been done so far. Depending on the specifics, we show that in order to obtain identical significance, it may be necessary to perform twice as many experiments than in a setting where the placebo distribution is neglected. We proceed by considering more general probability distributions and close with comments on some erroneous assumptions on probability distributions which lead, for instance, to a trivial explanation of the fat tail."}, "https://arxiv.org/abs/2109.08109": {"title": "Standard Errors for Calibrated Parameters", "link": "https://arxiv.org/abs/2109.08109", "description": "arXiv:2109.08109v3 Announce Type: replace \nAbstract: Calibration, the practice of choosing the parameters of a structural model to match certain empirical moments, can be viewed as minimum distance estimation. Existing standard error formulas for such estimators require a consistent estimate of the correlation structure of the empirical moments, which is often unavailable in practice. Instead, the variances of the individual empirical moments are usually readily estimable. Using only these variances, we derive conservative standard errors and confidence intervals for the structural parameters that are valid even under the worst-case correlation structure. In the over-identified case, we show that the moment weighting scheme that minimizes the worst-case estimator variance amounts to a moment selection problem with a simple solution. Finally, we develop tests of over-identifying or parameter restrictions. We apply our methods empirically to a model of menu cost pricing for multi-product firms and to a heterogeneous agent New Keynesian model."}, "https://arxiv.org/abs/2306.15642": {"title": "Neural Bayes estimators for censored inference with peaks-over-threshold models", "link": "https://arxiv.org/abs/2306.15642", "description": "arXiv:2306.15642v4 Announce Type: replace \nAbstract: Making inference with spatial extremal dependence models can be computationally burdensome since they involve intractable and/or censored likelihoods. Building on recent advances in likelihood-free inference with neural Bayes estimators, that is, neural networks that approximate Bayes estimators, we develop highly efficient estimators for censored peaks-over-threshold models that {use data augmentation techniques} to encode censoring information in the neural network {input}. Our new method provides a paradigm shift that challenges traditional censored likelihood-based inference methods for spatial extremal dependence models. Our simulation studies highlight significant gains in both computational and statistical efficiency, relative to competing likelihood-based approaches, when applying our novel estimators to make inference with popular extremal dependence models, such as max-stable, $r$-Pareto, and random scale mixture process models. We also illustrate that it is possible to train a single neural Bayes estimator for a general censoring level, precluding the need to retrain the network when the censoring level is changed. We illustrate the efficacy of our estimators by making fast inference on hundreds-of-thousands of high-dimensional spatial extremal dependence models to assess extreme particulate matter 2.5 microns or less in diameter (${\\rm PM}_{2.5}$) concentration over the whole of Saudi Arabia."}, "https://arxiv.org/abs/2310.08063": {"title": "Inference for Nonlinear Endogenous Treatment Effects Accounting for High-Dimensional Covariate Complexity", "link": "https://arxiv.org/abs/2310.08063", "description": "arXiv:2310.08063v3 Announce Type: replace \nAbstract: Nonlinearity and endogeneity are prevalent challenges in causal analysis using observational data. This paper proposes an inference procedure for a nonlinear and endogenous marginal effect function, defined as the derivative of the nonparametric treatment function, with a primary focus on an additive model that includes high-dimensional covariates. Using the control function approach for identification, we implement a regularized nonparametric estimation to obtain an initial estimator of the model. Such an initial estimator suffers from two biases: the bias in estimating the control function and the regularization bias for the high-dimensional outcome model. Our key innovation is to devise the double bias correction procedure that corrects these two biases simultaneously. Building on this debiased estimator, we further provide a confidence band of the marginal effect function. Simulations and an empirical study of air pollution and migration demonstrate the validity of our procedures."}, "https://arxiv.org/abs/2311.00553": {"title": "Polynomial Chaos Surrogate Construction for Random Fields with Parametric Uncertainty", "link": "https://arxiv.org/abs/2311.00553", "description": "arXiv:2311.00553v2 Announce Type: replace \nAbstract: Engineering and applied science rely on computational experiments to rigorously study physical systems. The mathematical models used to probe these systems are highly complex, and sampling-intensive studies often require prohibitively many simulations for acceptable accuracy. Surrogate models provide a means of circumventing the high computational expense of sampling such complex models. In particular, polynomial chaos expansions (PCEs) have been successfully used for uncertainty quantification studies of deterministic models where the dominant source of uncertainty is parametric. We discuss an extension to conventional PCE surrogate modeling to enable surrogate construction for stochastic computational models that have intrinsic noise in addition to parametric uncertainty. We develop a PCE surrogate on a joint space of intrinsic and parametric uncertainty, enabled by Rosenblatt transformations, and then extend the construction to random field data via the Karhunen-Loeve expansion. We then take advantage of closed-form solutions for computing PCE Sobol indices to perform a global sensitivity analysis of the model which quantifies the intrinsic noise contribution to the overall model output variance. Additionally, the resulting joint PCE is generative in the sense that it allows generating random realizations at any input parameter setting that are statistically approximately equivalent to realizations from the underlying stochastic model. The method is demonstrated on a chemical catalysis example model."}, "https://arxiv.org/abs/2311.18501": {"title": "Perturbation-based Effect Measures for Compositional Data", "link": "https://arxiv.org/abs/2311.18501", "description": "arXiv:2311.18501v2 Announce Type: replace \nAbstract: Existing effect measures for compositional features are inadequate for many modern applications for two reasons. First, modern datasets with compositional covariates, for example in microbiome research, display traits such as high-dimensionality and sparsity that can be poorly modelled with traditional parametric approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike many existing effect measures for compositional features, we do not define our effects based on a parametric model or a transformation of the data. Instead, we use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These effects naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated and semi-synthetic data and demonstrate advantages over existing techniques on data from New York schools and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees."}, "https://arxiv.org/abs/2312.05682": {"title": "Valid Cross-Covariance Models via Multivariate Mixtures with an Application to the Confluent Hypergeometric Class", "link": "https://arxiv.org/abs/2312.05682", "description": "arXiv:2312.05682v2 Announce Type: replace \nAbstract: Modeling of multivariate random fields through Gaussian processes calls for the construction of valid cross-covariance functions describing the dependence between any two component processes at different spatial locations. The required validity conditions often present challenges that lead to complicated restrictions on the parameter space. The purpose of this work is to present techniques using multivariate mixtures for establishing validity that are simultaneously simplified and comprehensive. This is accomplished using results on conditionally negative semidefinite matrices and the Schur product theorem. For illustration, we use the recently-introduced Confluent Hypergeometric (CH) class of covariance functions. In addition, we establish the spectral density of the Confluent Hypergeometric covariance and use this to construct valid multivariate models as well as propose new cross-covariances. Our approach leads to valid multivariate cross-covariance models that inherit the desired marginal properties of the Confluent Hypergeometric model and outperform the multivariate Mat\\'ern model in out-of-sample prediction under slowly-decaying correlation of the underlying multivariate random field. We also establish properties of the new models, including results on equivalence of Gaussian measures. We demonstrate the new model's use for multivariate oceanography dataset consisting of temperature, salinity and oxygen, as measured by autonomous floats in the Southern Ocean."}, "https://arxiv.org/abs/2209.01679": {"title": "Orthogonal and Linear Regressions and Pencils of Confocal Quadrics", "link": "https://arxiv.org/abs/2209.01679", "description": "arXiv:2209.01679v3 Announce Type: replace-cross \nAbstract: This paper enhances and develops bridges between statistics, mechanics, and geometry. For a given system of points in $\\mathbb R^k$ representing a sample of full rank, we construct an explicit pencil of confocal quadrics with the following properties: (i) All the hyperplanes for which the hyperplanar moments of inertia for the given system of points are equal, are tangent to the same quadrics from the pencil of quadrics. As an application, we develop regularization procedures for the orthogonal least square method, analogues of lasso and ridge methods from linear regression. (ii) For any given point $P$ among all the hyperplanes that contain it, the best fit is the tangent hyperplane to the quadric from the confocal pencil corresponding to the maximal Jacobi coordinate of the point $P$; the worst fit among the hyperplanes containing $P$ is the tangent hyperplane to the ellipsoid from the confocal pencil that contains $P$. The confocal pencil of quadrics provides a universal tool to solve the restricted principal component analysis restricted at any given point. Both results (i) and (ii) can be seen as generalizations of the classical result of Pearson on orthogonal regression. They have natural and important applications in the statistics of the errors-in-variables models (EIV). For the classical linear regressions we provide a geometric characterization of hyperplanes of least squares in a given direction among all hyperplanes which contain a given point. The obtained results have applications in restricted regressions, both ordinary and orthogonal ones. For the latter, a new formula for test statistic is derived. The developed methods and results are illustrated in natural statistics examples."}, "https://arxiv.org/abs/2302.07658": {"title": "SUrvival Control Chart EStimation Software in R: the success package", "link": "https://arxiv.org/abs/2302.07658", "description": "arXiv:2302.07658v2 Announce Type: replace-cross \nAbstract: Monitoring the quality of statistical processes has been of great importance, mostly in industrial applications. Control charts are widely used for this purpose, but often lack the possibility to monitor survival outcomes. Recently, inspecting survival outcomes has become of interest, especially in medical settings where outcomes often depend on risk factors of patients. For this reason many new survival control charts have been devised and existing ones have been extended to incorporate survival outcomes. The R package success allows users to construct risk-adjusted control charts for survival data. Functions to determine control chart parameters are included, which can be used even without expert knowledge on the subject of control charts. The package allows to create static as well as interactive charts, which are built using ggplot2 (Wickham 2016) and plotly (Sievert 2020)."}, "https://arxiv.org/abs/2303.05263": {"title": "Fast post-process Bayesian inference with Variational Sparse Bayesian Quadrature", "link": "https://arxiv.org/abs/2303.05263", "description": "arXiv:2303.05263v2 Announce Type: replace-cross \nAbstract: In applied Bayesian inference scenarios, users may have access to a large number of pre-existing model evaluations, for example from maximum-a-posteriori (MAP) optimization runs. However, traditional approximate inference techniques make little to no use of this available information. We propose the framework of post-process Bayesian inference as a means to obtain a quick posterior approximation from existing target density evaluations, with no further model calls. Within this framework, we introduce Variational Sparse Bayesian Quadrature (VSBQ), a method for post-process approximate inference for models with black-box and potentially noisy likelihoods. VSBQ reuses existing target density evaluations to build a sparse Gaussian process (GP) surrogate model of the log posterior density function. Subsequently, we leverage sparse-GP Bayesian quadrature combined with variational inference to achieve fast approximate posterior inference over the surrogate. We validate our method on challenging synthetic scenarios and real-world applications from computational neuroscience. The experiments show that VSBQ builds high-quality posterior approximations by post-processing existing optimization traces, with no further model evaluations."}, "https://arxiv.org/abs/2406.13052": {"title": "Distance Covariance, Independence, and Pairwise Differences", "link": "https://arxiv.org/abs/2406.13052", "description": "arXiv:2406.13052v1 Announce Type: new \nAbstract: (To appear in The American Statistician.) Distance covariance (Sz\\'ekely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an independent copy of $(X,Y)$. This raises natural questions about independence of variables like $X-X'$ and $Y-Y'$, about the connection between Cov$(|X-X'|,|Y-Y'|)$ and the covariance between doubly centered distances, and about necessary and sufficient conditions for independence. We show some basic results and present a new and nontechnical counterexample to a common fallacy, which provides more insight. We also show some motivating examples involving bivariate distributions and contingency tables, which can be used as didactic material for introducing distance correlation."}, "https://arxiv.org/abs/2406.13111": {"title": "Nonparametric Motion Control in Functional Connectivity Studies in Children with Autism Spectrum Disorder", "link": "https://arxiv.org/abs/2406.13111", "description": "arXiv:2406.13111v1 Announce Type: new \nAbstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental condition associated with difficulties with social interactions, communication, and restricted or repetitive behaviors. To characterize ASD, investigators often use functional connectivity derived from resting-state functional magnetic resonance imaging of the brain. However, participants' head motion during the scanning session can induce motion artifacts. Many studies remove scans with excessive motion, which can lead to drastic reductions in sample size and introduce selection bias. To avoid such exclusions, we propose an estimand inspired by causal inference methods that quantifies the difference in average functional connectivity in autistic and non-ASD children while standardizing motion relative to the low motion distribution in scans that pass motion quality control. We introduce a nonparametric estimator for motion control, called MoCo, that uses all participants and flexibly models the impacts of motion and other relevant features using an ensemble of machine learning methods. We establish large-sample efficiency and multiple robustness of our proposed estimator. The framework is applied to estimate the difference in functional connectivity between 132 autistic and 245 non-ASD children, of which 34 and 126 pass motion quality control. MoCo appears to dramatically reduce motion artifacts relative to no participant removal, while more efficiently utilizing participant data and accounting for possible selection biases relative to the na\\\"ive approach with participant removal."}, "https://arxiv.org/abs/2406.13122": {"title": "Testing for Underpowered Literatures", "link": "https://arxiv.org/abs/2406.13122", "description": "arXiv:2406.13122v1 Announce Type: new \nAbstract: How many experimental studies would have come to different conclusions had they been run on larger samples? I show how to estimate the expected number of statistically significant results that a set of experiments would have reported had their sample sizes all been counterfactually increased by a chosen factor. The estimator is consistent and asymptotically normal. Unlike existing methods, my approach requires no assumptions about the distribution of true effects of the interventions being studied other than continuity. This method includes an adjustment for publication bias in the reported t-scores. An application to randomized controlled trials (RCTs) published in top economics journals finds that doubling every experiment's sample size would only increase the power of two-sided t-tests by 7.2 percentage points on average. This effect is small and is comparable to the effect for systematic replication projects in laboratory psychology where previous studies enabled accurate power calculations ex ante. These effects are both smaller than for non-RCTs. This comparison suggests that RCTs are on average relatively insensitive to sample size increases. The policy implication is that grant givers should generally fund more experiments rather than fewer, larger ones."}, "https://arxiv.org/abs/2406.13197": {"title": "Representation Transfer Learning for Semiparametric Regression", "link": "https://arxiv.org/abs/2406.13197", "description": "arXiv:2406.13197v1 Announce Type: new \nAbstract: We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data."}, "https://arxiv.org/abs/2406.13310": {"title": "A finite-infinite shared atoms nested model for the Bayesian analysis of large grouped data", "link": "https://arxiv.org/abs/2406.13310", "description": "arXiv:2406.13310v1 Announce Type: new \nAbstract: The use of hierarchical mixture priors with shared atoms has recently flourished in the Bayesian literature for partially exchangeable data. Leveraging on nested levels of mixtures, these models allow the estimation of a two-layered data partition: across groups and across observations. This paper discusses and compares the properties of such modeling strategies when the mixing weights are assigned either a finite-dimensional Dirichlet distribution or a Dirichlet process prior. Based on these considerations, we introduce a novel hierarchical nonparametric prior based on a finite set of shared atoms, a specification that enhances the flexibility of the induced random measures and the availability of fast posterior inference. To support these findings, we analytically derive the induced prior correlation structure and partially exchangeable partition probability function. Additionally, we develop a novel mean-field variational algorithm for posterior inference to boost the applicability of our nested model to large multivariate data. We then assess and compare the performance of the different shared-atom specifications via simulation. We also show that our variational proposal is highly scalable and that the accuracy of the posterior density estimate and the estimated partition is comparable with state-of-the-art Gibbs sampler algorithms. Finally, we apply our model to a real dataset of Spotify's song features, simultaneously segmenting artists and songs with similar characteristics."}, "https://arxiv.org/abs/2406.13395": {"title": "Bayesian Inference for Multidimensional Welfare Comparisons", "link": "https://arxiv.org/abs/2406.13395", "description": "arXiv:2406.13395v1 Announce Type: new \nAbstract: Using both single-index measures and stochastic dominance concepts, we show how Bayesian inference can be used to make multivariate welfare comparisons. A four-dimensional distribution for the well-being attributes income, mental health, education, and happiness are estimated via Bayesian Markov chain Monte Carlo using unit-record data taken from the Household, Income and Labour Dynamics in Australia survey. Marginal distributions of beta and gamma mixtures and discrete ordinal distributions are combined using a copula. Improvements in both well-being generally and poverty magnitude are assessed using posterior means of single-index measures and posterior probabilities of stochastic dominance. The conditions for stochastic dominance depend on the class of utility functions that is assumed to define a social welfare function and the number of attributes in the utility function. Three classes of utility functions are considered, and posterior probabilities of dominance are computed for one, two, and four-attribute utility functions for three time intervals within the period 2001 to 2019."}, "https://arxiv.org/abs/2406.13478": {"title": "Semiparametric Localized Principal Stratification Analysis with Continuous Strata", "link": "https://arxiv.org/abs/2406.13478", "description": "arXiv:2406.13478v1 Announce Type: new \nAbstract: Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries."}, "https://arxiv.org/abs/2406.13500": {"title": "Gradient-Boosted Generalized Linear Models for Conditional Vine Copulas", "link": "https://arxiv.org/abs/2406.13500", "description": "arXiv:2406.13500v1 Announce Type: new \nAbstract: Vine copulas are flexible dependence models using bivariate copulas as building blocks. If the parameters of the bivariate copulas in the vine copula depend on covariates, one obtains a conditional vine copula. We propose an extension for the estimation of continuous conditional vine copulas, where the parameters of continuous conditional bivariate copulas are estimated sequentially and separately via gradient-boosting. For this purpose, we link covariates via generalized linear models (GLMs) to Kendall's $\\tau$ correlation coefficient from which the corresponding copula parameter can be obtained. Consequently, the gradient-boosting algorithm estimates the copula parameters providing a natural covariate selection. In a second step, an additional covariate deselection procedure is applied. The performance of the gradient-boosted conditional vine copulas is illustrated in a simulation study. Linear covariate effects in low- and high-dimensional settings are investigated for the conditional bivariate copulas separately and for conditional vine copulas. Moreover, the gradient-boosted conditional vine copulas are applied to the temporal postprocessing of ensemble weather forecasts in a low-dimensional setting. The results show, that our suggested method is able to outperform the benchmark methods and identifies temporal correlations better. Eventually, we provide an R-package called boostCopula for this method."}, "https://arxiv.org/abs/2406.13635": {"title": "Temporal label recovery from noisy dynamical data", "link": "https://arxiv.org/abs/2406.13635", "description": "arXiv:2406.13635v1 Announce Type: new \nAbstract: Analyzing dynamical data often requires information of the temporal labels, but such information is unavailable in many applications. Recovery of these temporal labels, closely related to the seriation or sequencing problem, becomes crucial in the study. However, challenges arise due to the nonlinear nature of the data and the complexity of the underlying dynamical system, which may be periodic or non-periodic. Additionally, noise within the feature space complicates the theoretical analysis. Our work develops spectral algorithms that leverage manifold learning concepts to recover temporal labels from noisy data. We first construct the graph Laplacian of the data, and then employ the second (and the third) Fiedler vectors to recover temporal labels. This method can be applied to both periodic and aperiodic cases. It also does not require monotone properties on the similarity matrix, which are commonly assumed in existing spectral seriation algorithms. We develop the $\\ell_{\\infty}$ error of our estimators for the temporal labels and ranking, without assumptions on the eigen-gap. In numerical analysis, our method outperforms spectral seriation algorithms based on a similarity matrix. The performance of our algorithms is further demonstrated on a synthetic biomolecule data example."}, "https://arxiv.org/abs/2406.13691": {"title": "Computationally efficient multi-level Gaussian process regression for functional data observed under completely or partially regular sampling designs", "link": "https://arxiv.org/abs/2406.13691", "description": "arXiv:2406.13691v1 Announce Type: new \nAbstract: Gaussian process regression is a frequently used statistical method for flexible yet fully probabilistic non-linear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis.\n We consider a multi-level Gaussian process regression model where a common mean function and individual subject-specific deviations are modeled simultaneously as latent Gaussian processes. We derive exact analytic and computationally efficient expressions for the log-likelihood function and the posterior distributions in the case where the observations are sampled on either a completely or partially regular grid. This enables us to fit the model to large data sets that are currently computationally inaccessible using a standard implementation. We show through a simulation study that our analytic expressions are several orders of magnitude faster compared to a standard implementation, and we provide an implementation in the probabilistic programming language Stan."}, "https://arxiv.org/abs/2406.13826": {"title": "Testing identification in mediation and dynamic treatment models", "link": "https://arxiv.org/abs/2406.13826", "description": "arXiv:2406.13826v1 Announce Type: new \nAbstract: We propose a test for the identification of causal effects in mediation and dynamic treatment models that is based on two sets of observed variables, namely covariates to be controlled for and suspected instruments, building on the test by Huber and Kueck (2022) for single treatment models. We consider models with a sequential assignment of a treatment and a mediator to assess the direct treatment effect (net of the mediator), the indirect treatment effect (via the mediator), or the joint effect of both treatment and mediator. We establish testable conditions for identifying such effects in observational data. These conditions jointly imply (1) the exogeneity of the treatment and the mediator conditional on covariates and (2) the validity of distinct instruments for the treatment and the mediator, meaning that the instruments do not directly affect the outcome (other than through the treatment or mediator) and are unconfounded given the covariates. Our framework extends to post-treatment sample selection or attrition problems when replacing the mediator by a selection indicator for observing the outcome, enabling joint testing of the selectivity of treatment and attrition. We propose a machine learning-based test to control for covariates in a data-driven manner and analyze its finite sample performance in a simulation study. Additionally, we apply our method to Slovak labor market data and find that our testable implications are not rejected for a sequence of training programs typically considered in dynamic treatment evaluations."}, "https://arxiv.org/abs/2406.13833": {"title": "Cluster Quilting: Spectral Clustering for Patchwork Learning", "link": "https://arxiv.org/abs/2406.13833", "description": "arXiv:2406.13833v1 Announce Type: new \nAbstract: Patchwork learning arises as a new and challenging data collection paradigm where both samples and features are observed in fragmented subsets. Due to technological limits, measurement expense, or multimodal data integration, such patchwork data structures are frequently seen in neuroscience, healthcare, and genomics, among others. Instead of analyzing each data patch separately, it is highly desirable to extract comprehensive knowledge from the whole data set. In this work, we focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) k-means on the combined and weighted singular vectors. Under a sub-Gaussian mixture model, we establish theoretical guarantees via a non-asymptotic misclustering rate bound that reflects both properties of the patch-wise observation regime as well as the clustering signal and noise dependencies. We also validate our Cluster Quilting algorithm through extensive empirical studies on both simulated and real data sets in neuroscience and genomics, where it discovers more accurate and scientifically more plausible clusters than other approaches."}, "https://arxiv.org/abs/2406.13836": {"title": "Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions", "link": "https://arxiv.org/abs/2406.13836", "description": "arXiv:2406.13836v1 Announce Type: new \nAbstract: In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the optimal subsample size. To bridge this gap, our work introduces tools designed for choosing the optimal subsample size. We focus on three settings: the Cox regression model for survival data with rare events and logistic regression for both balanced and imbalanced datasets. Additionally, we present a novel optimal subsampling procedure tailored for logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets."}, "https://arxiv.org/abs/2406.13876": {"title": "An Empirical Bayes Jackknife Regression Framework for Covariance Matrix Estimation", "link": "https://arxiv.org/abs/2406.13876", "description": "arXiv:2406.13876v1 Announce Type: new \nAbstract: Covariance matrix estimation, a classical statistical topic, poses significant challenges when the sample size is comparable to or smaller than the number of features. In this paper, we frame covariance matrix estimation as a compound decision problem and apply an optimal decision rule to estimate covariance parameters. To approximate this rule, we introduce an algorithm that integrates jackknife techniques with machine learning regression methods. This algorithm exhibits adaptability across diverse scenarios without relying on assumptions about data distribution. Simulation results and gene network inference from an RNA-seq experiment in mice demonstrate that our approach either matches or surpasses several state-of-the-art methods"}, "https://arxiv.org/abs/2406.13906": {"title": "Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data", "link": "https://arxiv.org/abs/2406.13906", "description": "arXiv:2406.13906v1 Announce Type: new \nAbstract: The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and high-dimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a real-world data application."}, "https://arxiv.org/abs/2406.13938": {"title": "Coverage of Credible Sets for Regression under Variable Selection", "link": "https://arxiv.org/abs/2406.13938", "description": "arXiv:2406.13938v1 Announce Type: new \nAbstract: We study the asymptotic frequentist coverage of credible sets based on a novel Bayesian approach for a multiple linear regression model under variable selection. We initially ignore the issue of variable selection, which allows us to put a conjugate normal prior on the coefficient vector. The variable selection step is incorporated directly in the posterior through a sparsity-inducing map and uses the induced prior for making an inference instead of the natural conjugate posterior. The sparsity-inducing map minimizes the sum of the squared l2-distance weighted by the data matrix and a suitably scaled l1-penalty term. We obtain the limiting coverage of various credible regions and demonstrate that a modified credible interval for a component has the exact asymptotic frequentist coverage if the corresponding predictor is asymptotically uncorrelated with other predictors. Through extensive simulation, we provide a guideline for choosing the penalty parameter as a function of the credibility level appropriate for the corresponding coverage. We also show finite-sample numerical results that support the conclusions from the asymptotic theory. We also provide the credInt package that implements the method in R to obtain the credible intervals along with the posterior samples."}, "https://arxiv.org/abs/2406.14046": {"title": "Estimating Time-Varying Parameters of Various Smoothness in Linear Models via Kernel Regression", "link": "https://arxiv.org/abs/2406.14046", "description": "arXiv:2406.14046v1 Announce Type: new \nAbstract: We consider estimating nonparametric time-varying parameters in linear models using kernel regression. Our contributions are twofold. First, We consider a broad class of time-varying parameters including deterministic smooth functions, the rescaled random walk, structural breaks, the threshold model and their mixtures. We show that those time-varying parameters can be consistently estimated by kernel regression. Our analysis exploits the smoothness of time-varying parameters rather than their specific form. The second contribution is to reveal that the bandwidth used in kernel regression determines the trade-off between the rate of convergence and the size of the class of time-varying parameters that can be estimated. An implication from our result is that the bandwidth should be proportional to $T^{-1/2}$ if the time-varying parameter follows the rescaled random walk, where $T$ is the sample size. We propose a specific choice of the bandwidth that accommodates a wide range of time-varying parameter models. An empirical application shows that the kernel-based estimator with this choice can capture the random-walk dynamics in time-varying parameters."}, "https://arxiv.org/abs/2406.14182": {"title": "Averaging polyhazard models using Piecewise deterministic Monte Carlo with applications to data with long-term survivors", "link": "https://arxiv.org/abs/2406.14182", "description": "arXiv:2406.14182v1 Announce Type: new \nAbstract: Polyhazard models are a class of flexible parametric models for modelling survival over extended time horizons. Their additive hazard structure allows for flexible, non-proportional hazards whose characteristics can change over time while retaining a parametric form, which allows for survival to be extrapolated beyond the observation period of a study. Significant user input is required, however, in selecting the number of latent hazards to model, their distributions and the choice of which variables to associate with each hazard. The resulting set of models is too large to explore manually, limiting their practical usefulness. Motivated by applications to stroke survivor and kidney transplant patient survival times we extend the standard polyhazard model through a prior structure allowing for joint inference of parameters and structural quantities, and develop a sampling scheme that utilises state-of-the-art Piecewise Deterministic Markov Processes to sample from the resulting transdimensional posterior with minimal user tuning."}, "https://arxiv.org/abs/2406.14184": {"title": "On integral priors for multiple comparison in Bayesian model selection", "link": "https://arxiv.org/abs/2406.14184", "description": "arXiv:2406.14184v1 Announce Type: new \nAbstract: Noninformative priors constructed for estimation purposes are usually not appropriate for model selection and testing. The methodology of integral priors was developed to get prior distributions for Bayesian model selection when comparing two models, modifying initial improper reference priors. We propose a generalization of this methodology to more than two models. Our approach adds an artificial copy of each model under comparison by compactifying the parametric space and creating an ergodic Markov chain across all models that returns the integral priors as marginals of the stationary distribution. Besides the garantee of their existance and the lack of paradoxes attached to estimation reference priors, an additional advantage of this methodology is that the simulation of this Markov chain is straightforward as it only requires simulations of imaginary training samples for all models and from the corresponding posterior distributions. This renders its implementation automatic and generic, both in the nested case and in the nonnested case."}, "https://arxiv.org/abs/2406.14380": {"title": "Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach", "link": "https://arxiv.org/abs/2406.14380", "description": "arXiv:2406.14380v1 Announce Type: new \nAbstract: Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creator-side randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard difference-in-means estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a \"recommender choice model\" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewer-content heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin short-video platform: besides the standard creator-side experiment, we carry out a costly blocked double-sided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard difference-in-means estimator."}, "https://arxiv.org/abs/2406.14453": {"title": "The Effective Number of Parameters in Kernel Density Estimation", "link": "https://arxiv.org/abs/2406.14453", "description": "arXiv:2406.14453v1 Announce Type: new \nAbstract: The quest for a formula that satisfactorily measures the effective degrees of freedom in kernel density estimation (KDE) is a long standing problem with few solutions. Starting from the orthogonal polynomial sequence (OPS) expansion for the ratio of the empirical to the oracle density, we show how convolution with the kernel leads to a new OPS with respect to which one may express the resulting KDE. The expansion coefficients of the two OPS systems can then be related via a kernel sensitivity matrix, and this then naturally leads to a definition of effective parameters by taking the trace of a symmetrized positive semi-definite normalized version. The resulting effective degrees of freedom (EDoF) formula is an oracle-based quantity; the first ever proposed in the literature. Asymptotic properties of the empirical EDoF are worked out through influence functions. Numerical investigations confirm the theoretical insights."}, "https://arxiv.org/abs/2406.14535": {"title": "On estimation and order selection for multivariate extremes via clustering", "link": "https://arxiv.org/abs/2406.14535", "description": "arXiv:2406.14535v1 Announce Type: new \nAbstract: We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation."}, "https://arxiv.org/abs/2406.11308": {"title": "Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?", "link": "https://arxiv.org/abs/2406.11308", "description": "arXiv:2406.11308v1 Announce Type: cross \nAbstract: In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs)."}, "https://arxiv.org/abs/2406.12908": {"title": "Rating Multi-Modal Time-Series Forecasting Models (MM-TSFM) for Robustness Through a Causal Lens", "link": "https://arxiv.org/abs/2406.12908", "description": "arXiv:2406.12908v1 Announce Type: cross \nAbstract: AI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multi-modal time-series forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of Multi-Modal Time-Series Forecasting Models (MM-TSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MM-TSFM. We apply our novel rating method on a variety of numeric and multi-modal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five time-series forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multi-modal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of time-series forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decision-making."}, "https://arxiv.org/abs/2406.13814": {"title": "Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches", "link": "https://arxiv.org/abs/2406.13814", "description": "arXiv:2406.13814v1 Announce Type: cross \nAbstract: Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates."}, "https://arxiv.org/abs/2406.13944": {"title": "Generalization error of min-norm interpolators in transfer learning", "link": "https://arxiv.org/abs/2406.13944", "description": "arXiv:2406.13944v1 Announce Type: cross \nAbstract: This paper establishes the generalization error of pooled min-$\\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results."}, "https://arxiv.org/abs/2406.13966": {"title": "Causal Inference with Latent Variables: Recent Advances and Future Prospectives", "link": "https://arxiv.org/abs/2406.13966", "description": "arXiv:2406.13966v1 Announce Type: cross \nAbstract: Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs)."}, "https://arxiv.org/abs/2406.14003": {"title": "Deep Optimal Experimental Design for Parameter Estimation Problems", "link": "https://arxiv.org/abs/2406.14003", "description": "arXiv:2406.14003v1 Announce Type: cross \nAbstract: Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations."}, "https://arxiv.org/abs/2406.14145": {"title": "Temperature in the Iberian Peninsula: Trend, seasonality, and heterogeneity", "link": "https://arxiv.org/abs/2406.14145", "description": "arXiv:2406.14145v1 Announce Type: cross \nAbstract: In this paper, we propose fitting unobserved component models to represent the dynamic evolution of bivariate systems of centre and log-range temperatures obtained monthly from minimum/maximum temperatures observed at a given location. In doing so, the centre and log-range temperature are decomposed into potentially stochastic trends, seasonal, and transitory components. Since our model encompasses deterministic trends and seasonal components as limiting cases, we contribute to the debate on whether stochastic or deterministic components better represent the trend and seasonal components. The methodology is implemented to centre and log-range temperature observed in four locations in the Iberian Peninsula, namely, Barcelona, Coru\\~{n}a, Madrid, and Seville. We show that, at each location, the centre temperature can be represented by a smooth integrated random walk with time-varying slope, while a stochastic level better represents the log-range. We also show that centre and log-range temperature are unrelated. The methodology is then extended to simultaneously model centre and log-range temperature observed at several locations in the Iberian Peninsula. We fit a multi-level dynamic factor model to extract potential commonalities among centre (log-range) temperature while also allowing for heterogeneity in different areas in the Iberian Peninsula. We show that, although the commonality in trends of average temperature is considerable, the regional components are also relevant."}, "https://arxiv.org/abs/2406.14163": {"title": "A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics", "link": "https://arxiv.org/abs/2406.14163", "description": "arXiv:2406.14163v1 Announce Type: cross \nAbstract: Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows."}, "https://arxiv.org/abs/2112.07755": {"title": "Separate Exchangeability as Modeling Principle in Bayesian Nonparametrics", "link": "https://arxiv.org/abs/2112.07755", "description": "arXiv:2112.07755v2 Announce Type: replace \nAbstract: We argue for the use of separate exchangeability as a modeling principle in Bayesian nonparametric (BNP) inference. Separate exchangeability is \\emph{de facto} widely applied in the Bayesian parametric case, e.g., it naturally arises in simple mixed models. However, while in some areas, such as random graphs, separate and (closely related) joint exchangeability are widely used, it is curiously underused for several other applications in BNP. We briefly review the definition of separate exchangeability focusing on the implications of such a definition in Bayesian modeling. We then discuss two tractable classes of models that implement separate exchangeability that are the natural counterparts of familiar partially exchangeable BNP models.\n The first is nested random partitions for a data matrix, defining a partition of columns and nested partitions of rows, nested within column clusters. Many recent models for nested partitions implement partially exchangeable models related to variations of the well-known nested Dirichlet process. We argue that inference under such models in some cases ignores important features of the experimental setup. We obtain the separately exchangeable counterpart of such partially exchangeable partition structures.\n The second class is about setting up separately exchangeable priors for a nonparametric regression model when multiple sets of experimental units are involved. We highlight how a Dirichlet process mixture of linear models known as ANOVA DDP can naturally implement separate exchangeability in such regression problems. Finally, we illustrate how to perform inference under such models in two real data examples."}, "https://arxiv.org/abs/2205.10310": {"title": "Treatment Effects in Bunching Designs: The Impact of Mandatory Overtime Pay on Hours", "link": "https://arxiv.org/abs/2205.10310", "description": "arXiv:2205.10310v4 Announce Type: replace \nAbstract: This paper studies the identifying power of bunching at kinks when the researcher does not assume a parametric choice model. I find that in a general choice model, identifying the average causal response to the policy switch at a kink amounts to confronting two extrapolation problems, each about the distribution of a counterfactual choice that is observed only in a censored manner. I apply this insight to partially identify the effect of overtime pay regulation on the hours of U.S. workers using administrative payroll data, assuming that each distribution satisfies a weak non-parametric shape constraint in the region where it is not observed. The resulting bounds are informative and indicate a relatively small elasticity of demand for weekly hours, addressing a long-standing question about the causal effects of the overtime mandate."}, "https://arxiv.org/abs/2302.03200": {"title": "Multivariate Bayesian dynamic modeling for causal prediction", "link": "https://arxiv.org/abs/2302.03200", "description": "arXiv:2302.03200v2 Announce Type: replace \nAbstract: Bayesian forecasting is developed in multivariate time series analysis for causal inference. Causal evaluation of sequentially observed time series data from control and treated units focuses on the impacts of interventions using contemporaneous outcomes in control units. Methodological developments here concern multivariate dynamic models for time-varying effects across multiple treated units with explicit foci on sequential learning and aggregation of intervention effects. Analysis explores dimension reduction across multiple synthetic counterfactual predictors. Computational advances leverage fully conjugate models for efficient sequential learning and inference, including cross-unit correlations and their time variation. This allows full uncertainty quantification on model hyper-parameters via Bayesian model averaging. A detailed case study evaluates interventions in a supermarket promotions experiment, with coupled predictive analyses in selected regions of a large-scale commercial system. Comparisons with existing methods highlight the issues of appropriate uncertainty quantification in casual inference in aggregation across treated units, among other practical concerns."}, "https://arxiv.org/abs/2303.14298": {"title": "Sensitivity Analysis in Unconditional Quantile Effects", "link": "https://arxiv.org/abs/2303.14298", "description": "arXiv:2303.14298v3 Announce Type: replace \nAbstract: This paper proposes a framework to analyze the effects of counterfactual policies on the unconditional quantiles of an outcome variable. For a given counterfactual policy, we obtain identified sets for the effect of both marginal and global changes in the proportion of treated individuals. To conduct a sensitivity analysis, we introduce the quantile breakdown frontier, a curve that (i) indicates whether a sensitivity analysis is possible or not, and (ii) when a sensitivity analysis is possible, quantifies the amount of selection bias consistent with a given conclusion of interest across different quantiles. To illustrate our method, we perform a sensitivity analysis on the effect of unionizing low income workers on the quantiles of the distribution of (log) wages."}, "https://arxiv.org/abs/2305.19089": {"title": "Impulse Response Analysis of Structural Nonlinear Time Series Models", "link": "https://arxiv.org/abs/2305.19089", "description": "arXiv:2305.19089v4 Announce Type: replace \nAbstract: This paper proposes a semiparametric sieve approach to estimate impulse response functions of nonlinear time series within a general class of structural autoregressive models. We prove that a two-step procedure can flexibly accommodate nonlinear specifications while avoiding the need to choose of fixed parametric forms. Sieve impulse responses are proven to be consistent by deriving uniform estimation guarantees, and an iterative algorithm makes it straightforward to compute them in practice. With simulations, we show that the proposed semiparametric approach proves effective against misspecification while suffering only minor efficiency losses. In a US monetary policy application, we find that the pointwise sieve GDP response associated with an interest rate increase is larger than that of a linear model. Finally, in an analysis of interest rate uncertainty shocks, sieve responses imply significantly more substantial contractionary effects both on production and inflation."}, "https://arxiv.org/abs/2306.12949": {"title": "On the use of the Gram matrix for multivariate functional principal components analysis", "link": "https://arxiv.org/abs/2306.12949", "description": "arXiv:2306.12949v2 Announce Type: replace \nAbstract: Dimension reduction is crucial in functional data analysis (FDA). The key tool to reduce the dimension of the data is functional principal component analysis. Existing approaches for functional principal component analysis usually involve the diagonalization of the covariance operator. With the increasing size and complexity of functional datasets, estimating the covariance operator has become more challenging. Therefore, there is a growing need for efficient methodologies to estimate the eigencomponents. Using the duality of the space of observations and the space of functional features, we propose to use the inner-product between the curves to estimate the eigenelements of multivariate and multidimensional functional datasets. The relationship between the eigenelements of the covariance operator and those of the inner-product matrix is established. We explore the application of these methodologies in several FDA settings and provide general guidance on their usability."}, "https://arxiv.org/abs/2307.05732": {"title": "From isotonic to Lipschitz regression: a new interpolative perspective on shape-restricted estimation", "link": "https://arxiv.org/abs/2307.05732", "description": "arXiv:2307.05732v3 Announce Type: replace \nAbstract: This manuscript seeks to bridge two seemingly disjoint paradigms of nonparametric regression estimation based on smoothness assumptions and shape constraints. The proposed approach is motivated by a conceptually simple observation: Every Lipschitz function is a sum of monotonic and linear functions. This principle is further generalized to the higher-order monotonicity and multivariate covariates. A family of estimators is proposed based on a sample-splitting procedure, which inherits desirable methodological, theoretical, and computational properties of shape-restricted estimators. Our theoretical analysis provides convergence guarantees of the estimator under heteroscedastic and heavy-tailed errors, as well as adaptive properties to the complexity of the true regression function. The generality of the proposed decomposition framework is demonstrated through new approximation results, and extensive numerical studies validate the theoretical properties and empirical evidence for the practicalities of the proposed estimation framework."}, "https://arxiv.org/abs/2309.01404": {"title": "Hierarchical Regression Discontinuity Design: Pursuing Subgroup Treatment Effects", "link": "https://arxiv.org/abs/2309.01404", "description": "arXiv:2309.01404v2 Announce Type: replace \nAbstract: Regression discontinuity design (RDD) is widely adopted for causal inference under intervention determined by a continuous variable. While one is interested in treatment effect heterogeneity by subgroups in many applications, RDD typically suffers from small subgroup-wise sample sizes, which makes the estimation results highly instable. To solve this issue, we introduce hierarchical RDD (HRDD), a hierarchical Bayes approach for pursuing treatment effect heterogeneity in RDD. A key feature of HRDD is to employ a pseudo-model based on a loss function to estimate subgroup-level parameters of treatment effects under RDD, and assign a hierarchical prior distribution to ''borrow strength'' from other subgroups. The posterior computation can be easily done by a simple Gibbs sampling, and the optimal bandwidth can be automatically selected by the Hyv\\\"{a}rinen scores for unnormalized models. We demonstrate the proposed HRDD through simulation and real data analysis, and show that HRDD provides much more stable point and interval estimation than separately applying the standard RDD method to each subgroup."}, "https://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "https://arxiv.org/abs/2310.04924", "description": "arXiv:2310.04924v3 Announce Type: replace \nAbstract: Monte Carlo significance tests are a general tool that produce p-values by generating samples from the null distribution. However, Monte Carlo tests are limited to null hypothesis which we can exactly sample from. Markov chain Monte Carlo (MCMC) significance tests are a way to produce statistical valid p-values for null hypothesis we can only approximately sample from. These methods were first introduced by Besag and Clifford in 1989 and make no assumptions on the mixing time of the MCMC procedure. Here we review the two methods of Besag and Clifford and introduce a new method that unifies the existing procedures. We use simple examples to highlight the difference between MCMC significance tests and standard Monte Carlo tests based on exact sampling. We also survey a range of contemporary applications in the literature including goodness-of-fit testing for the Rasch model, tests for detecting gerrymandering [8] and a permutation based test of conditional independence [3]."}, "https://arxiv.org/abs/2310.17806": {"title": "Transporting treatment effects from difference-in-differences studies", "link": "https://arxiv.org/abs/2310.17806", "description": "arXiv:2310.17806v2 Announce Type: replace \nAbstract: Difference-in-differences (DID) is a popular approach to identify the causal effects of treatments and policies in the presence of unmeasured confounding. DID identifies the sample average treatment effect in the treated (SATT). However, a goal of such research is often to inform decision-making in target populations outside the treated sample. Transportability methods have been developed to extend inferences from study samples to external target populations; these methods have primarily been developed and applied in settings where identification is based on conditional independence between the treatment and potential outcomes, such as in a randomized trial. We present a novel approach to identifying and estimating effects in a target population, based on DID conducted in a study sample that differs from the target population. We present a range of assumptions under which one may identify causal effects in the target population and employ causal diagrams to illustrate these assumptions. In most realistic settings, results depend critically on the assumption that any unmeasured confounders are not effect measure modifiers on the scale of the effect of interest (e.g., risk difference, odds ratio). We develop several estimators of transported effects, including g-computation, inverse odds weighting, and a doubly robust estimator based on the efficient influence function. Simulation results support theoretical properties of the proposed estimators. As an example, we apply our approach to study the effects of a 2018 US federal smoke-free public housing law on air quality in public housing across the US, using data from a DID study conducted in New York City alone."}, "https://arxiv.org/abs/2310.20088": {"title": "Functional Principal Component Analysis for Distribution-Valued Processes", "link": "https://arxiv.org/abs/2310.20088", "description": "arXiv:2310.20088v2 Announce Type: replace \nAbstract: We develop statistical models for samples of distribution-valued stochastic processes featuring time-indexed univariate distributions, with emphasis on functional principal component analysis. The proposed model presents an intrinsic rather than transformation-based approach. The starting point is a transport process representation for distribution-valued processes under the Wasserstein metric. Substituting transports for distributions addresses the challenge of centering distribution-valued processes and leads to a useful and interpretable decomposition of each realized process into a process-specific single transport and a real-valued trajectory. This representation makes it possible to utilize a scalar multiplication operation for transports and facilitates not only functional principal component analysis but also to introduce a latent Gaussian process. This Gaussian process proves especially useful for the case where the distribution-valued processes are only observed on a sparse grid of time points, establishing an approach for longitudinal distribution-valued data. We study the convergence of the key components of this novel representation to their population targets and demonstrate the practical utility of the proposed approach through simulations and several data illustrations."}, "https://arxiv.org/abs/2312.16260": {"title": "Multinomial Link Models", "link": "https://arxiv.org/abs/2312.16260", "description": "arXiv:2312.16260v2 Announce Type: replace \nAbstract: We propose a unified multinomial link model for analyzing categorical responses. It not only covers the existing multinomial logistic models and their extensions as special cases, but also includes new models that can incorporate the observations with NA or Unknown responses in the data analysis. We provide explicit formulae and detailed algorithms for finding the maximum likelihood estimates of the model parameters and computing the Fisher information matrix. Our algorithms solve the infeasibility issue of existing statistical software on estimating parameters of cumulative link models. The applications to real datasets show that the new models can fit the data significantly better, and the corresponding data analysis may correct the misleading conclusions due to missing responses."}, "https://arxiv.org/abs/2406.14636": {"title": "MSmix: An R Package for clustering partial rankings via mixtures of Mallows Models with Spearman distance", "link": "https://arxiv.org/abs/2406.14636", "description": "arXiv:2406.14636v1 Announce Type: new \nAbstract: MSmix is a recently developed R package implementing maximum likelihood estimation of finite mixtures of Mallows models with Spearman distance for full and partial rankings. The package is designed to implement computationally tractable estimation routines of the model parameters, with the ability to handle arbitrary forms of partial rankings and sequences of a large number of items. The frequentist estimation task is accomplished via EM algorithms, integrating data augmentation strategies to recover the unobserved heterogeneity and the missing ranks. The package also provides functionalities for uncertainty quantification of the estimated parameters, via diverse bootstrap methods and asymptotic confidence intervals. Generic methods for S3 class objects are constructed for more effectively managing the output of the main routines. The usefulness of the package and its computational performance compared with competing software is illustrated via applications to both simulated and original real ranking datasets."}, "https://arxiv.org/abs/2406.14650": {"title": "Conditional correlation estimation and serial dependence identification", "link": "https://arxiv.org/abs/2406.14650", "description": "arXiv:2406.14650v1 Announce Type: new \nAbstract: It has been recently shown in Jaworski, P., Jelito, D. and Pitera, M. (2024), 'A note on the equivalence between the conditional uncorrelation and the independence of random variables', Electronic Journal of Statistics 18(1), that one can characterise the independence of random variables via the family of conditional correlations on quantile-induced sets. This effectively shows that the localized linear measure of dependence is able to detect any form of nonlinear dependence for appropriately chosen conditioning sets. In this paper, we expand this concept, focusing on the statistical properties of conditional correlation estimators and their potential usage in serial dependence identification. In particular, we show how to estimate conditional correlations in generic and serial dependence setups, discuss key properties of the related estimators, define the conditional equivalent of the autocorrelation function, and provide a series of examples which prove that the proposed framework could be efficiently used in many practical econometric applications."}, "https://arxiv.org/abs/2406.14717": {"title": "Analysis of Linked Files: A Missing Data Perspective", "link": "https://arxiv.org/abs/2406.14717", "description": "arXiv:2406.14717v1 Announce Type: new \nAbstract: In many applications, researchers seek to identify overlapping entities across multiple data files. Record linkage algorithms facilitate this task, in the absence of unique identifiers. As these algorithms rely on semi-identifying information, they may miss records that represent the same entity, or incorrectly link records that do not represent the same entity. Analysis of linked files commonly ignores such linkage errors, resulting in biased, or overly precise estimates of the associations of interest. We view record linkage as a missing data problem, and delineate the linkage mechanisms that underpin analysis methods with linked files. Following the missing data literature, we group these methods under three categories: likelihood and Bayesian methods, imputation methods, and weighting methods. We summarize the assumptions and limitations of the methods, and evaluate their performance in a wide range of simulation scenarios."}, "https://arxiv.org/abs/2406.14738": {"title": "Robust parameter estimation for partially observed second-order diffusion processes", "link": "https://arxiv.org/abs/2406.14738", "description": "arXiv:2406.14738v1 Announce Type: new \nAbstract: Estimating parameters of a diffusion process given continuous-time observations of the process via maximum likelihood approaches or, online, via stochastic gradient descent or Kalman filter formulations constitutes a well-established research area. It has also been established previously that these techniques are, in general, not robust to perturbations in the data in the form of temporal correlations. While the subject is relatively well understood and appropriate modifications have been suggested in the context of multi-scale diffusion processes and their reduced model equations, we consider here an alternative setting where a second-order diffusion process in positions and velocities is only observed via its positions. In this note, we propose a simple modification to standard stochastic gradient descent and Kalman filter formulations, which eliminates the arising systematic estimation biases. The modification can be extended to standard maximum likelihood approaches and avoids computation of previously proposed correction terms."}, "https://arxiv.org/abs/2406.14772": {"title": "Consistent community detection in multi-layer networks with heterogeneous differential privacy", "link": "https://arxiv.org/abs/2406.14772", "description": "arXiv:2406.14772v1 Announce Type: new \nAbstract: As network data has become increasingly prevalent, a substantial amount of attention has been paid to the privacy issue in publishing network data. One of the critical challenges for data publishers is to preserve the topological structures of the original network while protecting sensitive information. In this paper, we propose a personalized edge flipping mechanism that allows data publishers to protect edge information based on each node's privacy preference. It can achieve differential privacy while preserving the community structure under the multi-layer degree-corrected stochastic block model after appropriately debiasing, and thus consistent community detection in the privatized multi-layer networks is achievable. Theoretically, we establish the consistency of community detection in the privatized multi-layer network and show that better privacy protection of edges can be obtained for a proportion of nodes while allowing other nodes to give up their privacy. Furthermore, the advantage of the proposed personalized edge-flipping mechanism is also supported by its numerical performance on various synthetic networks and a real-life multi-layer network."}, "https://arxiv.org/abs/2406.14814": {"title": "Frank copula is minimum information copula under fixed Kendall's $\\tau$", "link": "https://arxiv.org/abs/2406.14814", "description": "arXiv:2406.14814v1 Announce Type: new \nAbstract: In dependence modeling, various copulas have been utilized. Among them, the Frank copula has been one of the most typical choices due to its simplicity. In this work, we demonstrate that the Frank copula is the minimum information copula under fixed Kendall's $\\tau$ (MICK), both theoretically and numerically. First, we explain that both MICK and the Frank density follow the hyperbolic Liouville equation. Moreover, we show that the copula density satisfying the Liouville equation is uniquely the Frank copula. Our result asserts that selecting the Frank copula as an appropriate copula model is equivalent to using Kendall's $\\tau$ as the sole available information about the true distribution, based on the entropy maximization principle."}, "https://arxiv.org/abs/2406.14904": {"title": "Enhancing reliability in prediction intervals using point forecasters: Heteroscedastic Quantile Regression and Width-Adaptive Conformal Inference", "link": "https://arxiv.org/abs/2406.14904", "description": "arXiv:2406.14904v1 Announce Type: new \nAbstract: Building prediction intervals for time series forecasting problems presents a complex challenge, particularly when relying solely on point predictors, a common scenario for practitioners in the industry. While research has primarily focused on achieving increasingly efficient valid intervals, we argue that, when evaluating a set of intervals, traditional measures alone are insufficient. There are additional crucial characteristics: the intervals must vary in length, with this variation directly linked to the difficulty of the prediction, and the coverage of the interval must remain independent of the difficulty of the prediction for practical utility. We propose the Heteroscedastic Quantile Regression (HQR) model and the Width-Adaptive Conformal Inference (WACI) method, providing theoretical coverage guarantees, to overcome those issues, respectively. The methodologies are evaluated in the context of Electricity Price Forecasting and Wind Power Forecasting, representing complex scenarios in time series forecasting. The results demonstrate that HQR and WACI not only improve or achieve typical measures of validity and efficiency but also successfully fulfil the commonly ignored mentioned characteristics."}, "https://arxiv.org/abs/2406.15078": {"title": "The Influence of Nuisance Parameter Uncertainty on Statistical Inference in Practical Data Science Models", "link": "https://arxiv.org/abs/2406.15078", "description": "arXiv:2406.15078v1 Announce Type: new \nAbstract: For multiple reasons -- such as avoiding overtraining from one data set or because of having received numerical estimates for some parameters in a model from an alternative source -- it is sometimes useful to divide a model's parameters into one group of primary parameters and one group of nuisance parameters. However, uncertainty in the values of nuisance parameters is an inevitable factor that impacts the model's reliability. This paper examines the issue of uncertainty calculation for primary parameters of interest in the presence of nuisance parameters. We illustrate a general procedure on two distinct model forms: 1) the GARCH time series model with univariate nuisance parameter and 2) multiple hidden layer feed-forward neural network models with multivariate nuisance parameters. Leveraging an existing theoretical framework for nuisance parameter uncertainty, we show how to modify the confidence regions for the primary parameters while considering the inherent uncertainty introduced by nuisance parameters. Furthermore, our study validates the practical effectiveness of adjusted confidence regions that properly account for uncertainty in nuisance parameters. Such an adjustment helps data scientists produce results that more honestly reflect the overall uncertainty."}, "https://arxiv.org/abs/2406.15157": {"title": "MIDAS-QR with 2-Dimensional Structure", "link": "https://arxiv.org/abs/2406.15157", "description": "arXiv:2406.15157v1 Announce Type: new \nAbstract: Mixed frequency data has been shown to improve the performance of growth-at-risk models in the literature. Most of the research has focused on imposing structure on the high-frequency lags when estimating MIDAS-QR models akin to what is done in mean models. However, only imposing structure on the lag-dimension can potentially induce quantile variation that would otherwise not be there. In this paper we extend the framework by introducing structure on both the lag dimension and the quantile dimension. In this way we are able to shrink unnecessary quantile variation in the high-frequency variables. This leads to more gradual lag profiles in both dimensions compared to the MIDAS-QR and UMIDAS-QR. We show that this proposed method leads to further gains in nowcasting and forecasting on a pseudo-out-of-sample exercise on US data."}, "https://arxiv.org/abs/2406.15170": {"title": "Inference for Delay Differential Equations Using Manifold-Constrained Gaussian Processes", "link": "https://arxiv.org/abs/2406.15170", "description": "arXiv:2406.15170v1 Announce Type: new \nAbstract: Dynamic systems described by differential equations often involve feedback among system components. When there are time delays for components to sense and respond to feedback, delay differential equation (DDE) models are commonly used. This paper considers the problem of inferring unknown system parameters, including the time delays, from noisy and sparse experimental data observed from the system. We propose an extension of manifold-constrained Gaussian processes to conduct parameter inference for DDEs, whereas the time delay parameters have posed a challenge for existing methods that bypass numerical solvers. Our method uses a Bayesian framework to impose a Gaussian process model over the system trajectory, conditioned on the manifold constraint that satisfies the DDEs. For efficient computation, a linear interpolation scheme is developed to approximate the values of the time-delayed system outputs, along with corresponding theoretical error bounds on the approximated derivatives. Two simulation examples, based on Hutchinson's equation and the lac operon system, together with a real-world application using Ontario COVID-19 data, are used to illustrate the efficacy of our method."}, "https://arxiv.org/abs/2406.15285": {"title": "Monte Carlo Integration in Simple and Complex Simulation Designs", "link": "https://arxiv.org/abs/2406.15285", "description": "arXiv:2406.15285v1 Announce Type: new \nAbstract: Simulation studies are used to evaluate and compare the properties of statistical methods in controlled experimental settings. In most cases, performing a simulation study requires knowledge of the true value of the parameter, or estimand, of interest. However, in many simulation designs, the true value of the estimand is difficult to compute analytically. Here, we illustrate the use of Monte Carlo integration to compute true estimand values in simple and complex simulation designs. We provide general pseudocode that can be replicated in any software program of choice to demonstrate key principles in using Monte Carlo integration in two scenarios: a simple three variable simulation where interest lies in the marginally adjusted odds ratio; and a more complex causal mediation analysis where interest lies in the controlled direct effect in the presence of mediator-outcome confounders affected by the exposure. We discuss general strategies that can be used to minimize Monte Carlo error, and to serve as checks on the simulation program to avoid coding errors. R programming code is provided illustrating the application of our pseudocode in these settings."}, "https://arxiv.org/abs/2406.15288": {"title": "Difference-in-Differences with Time-Varying Covariates in the Parallel Trends Assumption", "link": "https://arxiv.org/abs/2406.15288", "description": "arXiv:2406.15288v1 Announce Type: new \nAbstract: In this paper, we study difference-in-differences identification and estimation strategies where the parallel trends assumption holds after conditioning on time-varying covariates and/or time-invariant covariates. Our first main contribution is to point out a number of weaknesses of commonly used two-way fixed effects (TWFE) regressions in this context. In addition to issues related to multiple periods and variation in treatment timing that have been emphasized in the literature, we show that, even in the case with only two time periods, TWFE regressions are not generally robust to (i) paths of untreated potential outcomes depending on the level of time-varying covariates (as opposed to only the change in the covariates over time), (ii) paths of untreated potential outcomes depending on time-invariant covariates, and (iii) violations of linearity conditions for outcomes over time and/or the propensity score. Even in cases where none of the previous three issues hold, we show that TWFE regressions can suffer from negative weighting and weight-reversal issues. Thus, TWFE regressions can deliver misleading estimates of causal effect parameters in a number of empirically relevant cases. Second, we extend these arguments to the case of multiple periods and variation in treatment timing. Third, we provide simple diagnostics for assessing the extent of misspecification bias arising due to TWFE regressions. Finally, we propose alternative (and simple) estimation strategies that can circumvent these issues with two-way fixed regressions."}, "https://arxiv.org/abs/2406.14753": {"title": "A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms", "link": "https://arxiv.org/abs/2406.14753", "description": "arXiv:2406.14753v1 Announce Type: cross \nAbstract: We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish theoretical properties of our approach and derive an algorithm based on a specific instance of this approach. Our empirical results demonstrate the significant benefits of our approach."}, "https://arxiv.org/abs/2406.14808": {"title": "On the estimation rate of Bayesian PINN for inverse problems", "link": "https://arxiv.org/abs/2406.14808", "description": "arXiv:2406.14808v1 Announce Type: cross \nAbstract: Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian PINN estimator of the solution of a PDE from $n$ independent noisy measurement of the solution. We focus on a class of equations that are linear in their parameters (with unknown coefficients $\\theta_\\star$). We show that when the partial differential equation admits a classical solution (say $u_\\star$), differentiable to order $\\beta$, the mean square error of the Bayesian posterior mean is at least of order $n^{-2\\beta/(2\\beta + d)}$. Furthermore, we establish a convergence rate of the linear coefficients of $\\theta_\\star$ depending on the order of the underlying differential operator. Last but not least, our theoretical results are validated through extensive simulations."}, "https://arxiv.org/abs/2406.15195": {"title": "Multiscale modelling of animal movement with persistent dynamics", "link": "https://arxiv.org/abs/2406.15195", "description": "arXiv:2406.15195v1 Announce Type: cross \nAbstract: Wild animals are commonly fitted with trackers that record their position through time, to learn about their behaviour. Broadly, statistical models for tracking data often fall into two categories: local models focus on describing small-scale movement decisions, and global models capture large-scale spatial distributions. Due to this dichotomy, it is challenging to describe mathematically how animals' distributions arise from their short-term movement patterns, and to combine data sets collected at different scales. We propose a multiscale model of animal movement and space use based on the underdamped Langevin process, widely used in statistical physics. The model is convenient to describe animal movement for three reasons: it is specified in continuous time (such that its parameters are not dependent on an arbitrary time scale), its speed and direction are autocorrelated (similarly to real animal trajectories), and it has a closed form stationary distribution that we can view as a model of long-term space use. We use the common form of a resource selection function for the stationary distribution, to model the environmental drivers behind the animal's movement decisions. We further increase flexibility by allowing movement parameters to be time-varying, e.g., to account for daily cycles in an animal's activity. We formulate the model as a state-space model and present a method of inference based on the Kalman filter. The approach requires discretising the continuous-time process, and we use simulations to investigate performance for various time resolutions of observation. The approach works well at fine resolutions, though the estimated stationary distribution tends to be too flat when time intervals between observations are very long."}, "https://arxiv.org/abs/2406.15311": {"title": "The disruption index suffers from citation inflation and is confounded by shifts in scholarly citation practice", "link": "https://arxiv.org/abs/2406.15311", "description": "arXiv:2406.15311v1 Announce Type: cross \nAbstract: Measuring the rate of innovation in academia and industry is fundamental to monitoring the efficiency and competitiveness of the knowledge economy. To this end, a disruption index (CD) was recently developed and applied to publication and patent citation networks (Wu et al., Nature 2019; Park et al., Nature 2023). Here we show that CD systematically decreases over time due to secular growth in research and patent production, following two distinct mechanisms unrelated to innovation -- one behavioral and the other structural. Whereas the behavioral explanation reflects shifts associated with techno-social factors (e.g. self-citation practices), the structural explanation follows from `citation inflation' (CI), an inextricable feature of real citation networks attributable to increasing reference list lengths, which causes CD to systematically decrease. We demonstrate this causal link by way of mathematical deduction, computational simulation, multi-variate regression, and quasi-experimental comparison of the disruptiveness of PNAS versus PNAS Plus articles, which differ only in their lengths. Accordingly, we analyze CD data available in the SciSciNet database and find that disruptiveness incrementally increased from 2005-2015, and that the negative relationship between disruption and team-size is remarkably small in overall magnitude effect size, and shifts from negative to positive for team size $\\geq$ 8 coauthors."}, "https://arxiv.org/abs/2107.09235": {"title": "Distributional Effects with Two-Sided Measurement Error: An Application to Intergenerational Income Mobility", "link": "https://arxiv.org/abs/2107.09235", "description": "arXiv:2107.09235v3 Announce Type: replace \nAbstract: This paper considers identification and estimation of distributional effect parameters that depend on the joint distribution of an outcome and another variable of interest (\"treatment\") in a setting with \"two-sided\" measurement error -- that is, where both variables are possibly measured with error. Examples of these parameters in the context of intergenerational income mobility include transition matrices, rank-rank correlations, and the poverty rate of children as a function of their parents' income, among others. Building on recent work on quantile regression (QR) with measurement error in the outcome (particularly, Hausman, Liu, Luo, and Palmer (2021)), we show that, given (i) two linear QR models separately for the outcome and treatment conditional on other observed covariates and (ii) assumptions about the measurement error for each variable, one can recover the joint distribution of the outcome and the treatment. Besides these conditions, our approach does not require an instrument, repeated measurements, or distributional assumptions about the measurement error. Using recent data from the 1997 National Longitudinal Study of Youth, we find that accounting for measurement error notably reduces several estimates of intergenerational mobility parameters."}, "https://arxiv.org/abs/2206.01824": {"title": "Estimation of Over-parameterized Models from an Auto-Modeling Perspective", "link": "https://arxiv.org/abs/2206.01824", "description": "arXiv:2206.01824v3 Announce Type: replace \nAbstract: From a model-building perspective, we propose a paradigm shift for fitting over-parameterized models. Philosophically, the mindset is to fit models to future observations rather than to the observed sample. Technically, given an imputation method to generate future observations, we fit over-parameterized models to these future observations by optimizing an approximation of the desired expected loss function based on its sample counterpart and an adaptive $\\textit{duality function}$. The required imputation method is also developed using the same estimation technique with an adaptive $m$-out-of-$n$ bootstrap approach. We illustrate its applications with the many-normal-means problem, $n < p$ linear regression, and neural network-based image classification of MNIST digits. The numerical results demonstrate its superior performance across these diverse applications. While primarily expository, the paper conducts an in-depth investigation into the theoretical aspects of the topic. It concludes with remarks on some open problems."}, "https://arxiv.org/abs/2210.17063": {"title": "Shrinkage Methods for Treatment Choice", "link": "https://arxiv.org/abs/2210.17063", "description": "arXiv:2210.17063v2 Announce Type: replace \nAbstract: This study examines the problem of determining whether to treat individuals based on observed covariates. The most common decision rule is the conditional empirical success (CES) rule proposed by Manski (2004), which assigns individuals to treatments that yield the best experimental outcomes conditional on the observed covariates. Conversely, using shrinkage estimators, which shrink unbiased but noisy preliminary estimates toward the average of these estimates, is a common approach in statistical estimation problems because it is well-known that shrinkage estimators have smaller mean squared errors than unshrunk estimators. Inspired by this idea, we propose a computationally tractable shrinkage rule that selects the shrinkage factor by minimizing the upper bound of the maximum regret. Then, we compare the maximum regret of the proposed shrinkage rule with that of CES and pooling rules when the parameter space is correctly specified or misspecified. Our theoretical results demonstrate that the shrinkage rule performs well in many cases and these findings are further supported by numerical experiments. Specifically, we show that the maximum regret of the shrinkage rule can be strictly smaller than that of the CES and pooling rules in certain cases when the parameter space is correctly specified. In addition, we find that the shrinkage rule is robust against misspecifications of the parameter space. Finally, we apply our method to experimental data from the National Job Training Partnership Act Study."}, "https://arxiv.org/abs/2212.11442": {"title": "Small-time approximation of the transition density for diffusions with singularities", "link": "https://arxiv.org/abs/2212.11442", "description": "arXiv:2212.11442v2 Announce Type: replace \nAbstract: The Wright-Fisher (W-F) diffusion model serves as a foundational framework for interpreting population evolution through allele frequency dynamics over time. Despite the known transition probability between consecutive generations, an exact analytical expression for the transition density at arbitrary time intervals remains elusive. Commonly utilized distributions such as Gaussian or Beta inadequately address the fixation issue at extreme allele frequencies (0 or 1), particularly for short periods. In this study, we introduce two alternative parametric functions, namely the Asymptotic Expansion (AE) and the Gaussian approximation (GaussA), derived through probabilistic methodologies, aiming to better approximate this density. The AE function provides a suitable density for allele frequency distributions, encompassing extreme values within the interval [0,1]. Additionally, we outline the range of validity for the GaussA approximation. While our primary focus is on W-F diffusion, we demonstrate how our findings extend to other diffusion models featuring singularities. Through simulations of allele frequencies under a W-F process and employing a recently developed adaptive density estimation method, we conduct a comparative analysis to assess the fit of the proposed densities against the Beta and Gaussian distributions."}, "https://arxiv.org/abs/2302.12648": {"title": "New iterative algorithms for estimation of item functioning", "link": "https://arxiv.org/abs/2302.12648", "description": "arXiv:2302.12648v2 Announce Type: replace \nAbstract: When the item functioning of multi-item measurement is modeled with three or four-parameter models, parameter estimation may become challenging. Effective algorithms are crucial in such scenarios. This paper explores innovations to parameter estimation in generalized logistic regression models, which may be used in item response modeling to account for guessing/pretending or slipping/dissimulation and for the effect of covariates. We introduce a new implementation of the EM algorithm and propose a new algorithm based on the parametrized link function. The two novel iterative algorithms are compared to existing methods in a simulation study. Additionally, the study examines software implementation, including the specification of initial values for numerical algorithms and asymptotic properties with an estimation of standard errors. Overall, the newly proposed algorithm based on the parametrized link function outperforms other procedures, especially for small sample sizes. Moreover, the newly implemented EM algorithm provides additional information regarding respondents' inclination to guess or pretend and slip or dissimulate when answering the item. The study also discusses applications of the methods in the context of the detection of differential item functioning. Methods are demonstrated using real data from psychological and educational assessments."}, "https://arxiv.org/abs/2306.10221": {"title": "Dynamic Modeling of Sparse Longitudinal Data and Functional Snippets With Stochastic Differential Equations", "link": "https://arxiv.org/abs/2306.10221", "description": "arXiv:2306.10221v2 Announce Type: replace \nAbstract: Sparse functional/longitudinal data have attracted widespread interest due to the prevalence of such data in social and life sciences. A prominent scenario where such data are routinely encountered are accelerated longitudinal studies, where subjects are enrolled in the study at a random time and are only tracked for a short amount of time relative to the domain of interest. The statistical analysis of such functional snippets is challenging since information for the far-off-diagonal regions of the covariance structure is missing. Our main methodological contribution is to address this challenge by bypassing covariance estimation and instead modeling the underlying process as the solution of a data-adaptive stochastic differential equation. Taking advantage of the interface between Gaussian functional data and stochastic differential equations makes it possible to efficiently reconstruct the target process by estimating its dynamic distribution. The proposed approach allows one to consistently recover forward sample paths from functional snippets at the subject level. We establish the existence and uniqueness of the solution to the proposed data-driven stochastic differential equation and derive rates of convergence for the corresponding estimators. The finite-sample performance is demonstrated with simulation studies and functional snippets arising from a growth study and spinal bone mineral density data."}, "https://arxiv.org/abs/2308.15770": {"title": "Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data", "link": "https://arxiv.org/abs/2308.15770", "description": "arXiv:2308.15770v3 Announce Type: replace \nAbstract: Concentrations of pathogen genomes measured in wastewater have recently become available as a new data source to use when modeling the spread of infectious diseases. One promising use for this data source is inference of the effective reproduction number, the average number of individuals a newly infected person will infect. We propose a model where new infections arrive according to a time-varying immigration rate which can be interpreted as an average number of secondary infections produced by one infectious individual per unit time. This model allows us to estimate the effective reproduction number from concentrations of pathogen genomes while avoiding difficult to verify assumptions about the dynamics of the susceptible population. As a byproduct of our primary goal, we also produce a new model for estimating the effective reproduction number from case data using the same framework. We test this modeling framework in an agent-based simulation study with a realistic data generating mechanism which accounts for the time-varying dynamics of pathogen shedding. Finally, we apply our new model to estimating the effective reproduction number of SARS-CoV-2 in Los Angeles, California, using pathogen RNA concentrations collected from a large wastewater treatment facility."}, "https://arxiv.org/abs/2211.14692": {"title": "Radial Neighbors for Provably Accurate Scalable Approximations of Gaussian Processes", "link": "https://arxiv.org/abs/2211.14692", "description": "arXiv:2211.14692v4 Announce Type: replace-cross \nAbstract: In geostatistical problems with massive sample size, Gaussian processes can be approximated using sparse directed acyclic graphs to achieve scalable $O(n)$ computational complexity. In these models, data at each location are typically assumed conditionally dependent on a small set of parents which usually include a subset of the nearest neighbors. These methodologies often exhibit excellent empirical performance, but the lack of theoretical validation leads to unclear guidance in specifying the underlying graphical model and sensitivity to graph choice. We address these issues by introducing radial neighbors Gaussian processes (RadGP), a class of Gaussian processes based on directed acyclic graphs in which directed edges connect every location to all of its neighbors within a predetermined radius. We prove that any radial neighbors Gaussian process can accurately approximate the corresponding unrestricted Gaussian process in Wasserstein-2 distance, with an error rate determined by the approximation radius, the spatial covariance function, and the spatial dispersion of samples. We offer further empirical validation of our approach via applications on simulated and real world data showing excellent performance in both prior and posterior approximations to the original Gaussian process."}, "https://arxiv.org/abs/2311.12717": {"title": "Phylogenetic least squares estimation without genetic distances", "link": "https://arxiv.org/abs/2311.12717", "description": "arXiv:2311.12717v2 Announce Type: replace-cross \nAbstract: Least squares estimation of phylogenies is an established family of methods with good statistical properties. State-of-the-art least squares phylogenetic estimation proceeds by first estimating a distance matrix, which is then used to determine the phylogeny by minimizing a squared-error loss function. Here, we develop a method for least squares phylogenetic inference that does not rely on a pre-estimated distance matrix. Our approach allows us to circumvent the typical need to first estimate a distance matrix by forming a new loss function inspired by the phylogenetic likelihood score function; in this manner, inference is not based on a summary statistic of the sequence data, but directly on the sequence data itself. We use a Jukes-Cantor substitution model to show that our method leads to improvements over ordinary least squares phylogenetic inference, and is even observed to rival maximum likelihood estimation in terms of topology estimation efficiency. Using a Kimura 2-parameter model, we show that our method also allows for estimation of the global transition/transversion ratio simultaneously with the phylogeny and its branch lengths. This is impossible to accomplish with any other distance-based method as far as we know. Our developments pave the way for more optimal phylogenetic inference under the least squares framework, particularly in settings under which likelihood-based inference is infeasible, including when one desires to build a phylogeny based on information provided by only a subset of all possible nucleotide substitutions such as synonymous or non-synonymous substitutions."}, "https://arxiv.org/abs/2406.15573": {"title": "Sparse Bayesian multidimensional scaling(s)", "link": "https://arxiv.org/abs/2406.15573", "description": "arXiv:2406.15573v1 Announce Type: new \nAbstract: Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require a burdensome order of $N^2$ floating-point operations, where $N$ is the number of data points. Thus, BMDS becomes impractical as $N$ grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between $N^2$ and $N$. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1,000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks."}, "https://arxiv.org/abs/2406.15582": {"title": "Graphical copula GARCH modeling with dynamic conditional dependence", "link": "https://arxiv.org/abs/2406.15582", "description": "arXiv:2406.15582v1 Announce Type: new \nAbstract: Modeling returns on large portfolios is a challenging problem as the number of parameters in the covariance matrix grows as the square of the size of the portfolio. Traditional correlation models, for example, the dynamic conditional correlation (DCC)-GARCH model, often ignore the nonlinear dependencies in the tail of the return distribution. In this paper, we aim to develop a framework to model the nonlinear dependencies dynamically, namely the graphical copula GARCH (GC-GARCH) model. Motivated from the capital asset pricing model, to allow modeling of large portfolios, the number of parameters can be greatly reduced by introducing conditional independence among stocks given some risk factors. The joint distribution of the risk factors is factorized using a directed acyclic graph (DAG) with pair-copula construction (PCC) to enhance the modeling of the tails of the return distribution while offering the flexibility of having complex dependent structures. The DAG induces topological orders to the risk factors, which can be regarded as a list of directions of the flow of information. The conditional distributions among stock returns are also modeled using PCC. Dynamic conditional dependence structures are incorporated to allow the parameters in the copulas to be time-varying. Three-stage estimation is used to estimate parameters in the marginal distributions, the risk factor copulas, and the stock copulas. The simulation study shows that the proposed estimation procedure can estimate the parameters and the underlying DAG structure accurately. In the investment experiment of the empirical study, we demonstrate that the GC-GARCH model produces more precise conditional value-at-risk prediction and considerably higher cumulative portfolio returns than the DCC-GARCH model."}, "https://arxiv.org/abs/2406.15608": {"title": "Nonparametric FBST for Validating Linear Models", "link": "https://arxiv.org/abs/2406.15608", "description": "arXiv:2406.15608v1 Announce Type: new \nAbstract: The Full Bayesian Significance Test (FBST) possesses many desirable aspects, such as not requiring a non-zero prior probability for hypotheses while also producing a measure of evidence for $H_0$. Still, few attempts have been made to bring the FBST to nonparametric settings, with the main drawback being the need to obtain the highest posterior density (HPD) in a function space. In this work, we use Gaussian processes to provide an analytically tractable FBST for hypotheses of the type $$ H_0: g(\\boldsymbol{x}) = \\boldsymbol{b}(\\boldsymbol{x})\\boldsymbol{\\beta}, \\quad \\forall \\boldsymbol{x} \\in \\mathcal{X}, \\quad \\boldsymbol{\\beta} \\in \\mathbb{R}^k, $$ where $g(\\cdot)$ is the regression function, $\\boldsymbol{b}(\\cdot)$ is a vector of linearly independent linear functions -- such as $\\boldsymbol{b}(\\boldsymbol{x}) = \\boldsymbol{x}'$ -- and $\\mathcal{X}$ is the covariates' domain. We also make use of pragmatic hypotheses to verify if the adherence of linear models may be approximately instead of exactly true, allowing for the inclusion of valuable information such as measurement errors and utility judgments. This contribution extends the theory of the FBST, allowing its application in nonparametric settings and providing a procedure that easily tests if linear models are adequate for the data and that can automatically perform variable selection."}, "https://arxiv.org/abs/2406.15667": {"title": "Identification and Estimation of Causal Effects in High-Frequency Event Studies", "link": "https://arxiv.org/abs/2406.15667", "description": "arXiv:2406.15667v1 Announce Type: new \nAbstract: We provide precise conditions for nonparametric identification of causal effects by high-frequency event study regressions, which have been used widely in the recent macroeconomics, financial economics and political economy literatures. The high-frequency event study method regresses changes in an outcome variable on a measure of unexpected changes in a policy variable in a narrow time window around an event or a policy announcement (e.g., a 30-minute window around an FOMC announcement). We show that, contrary to popular belief, the narrow size of the window is not sufficient for identification. Rather, the population regression coefficient identifies a causal estimand when (i) the effect of the policy shock on the outcome does not depend on the other shocks (separability) and (ii) the surprise component of the news or event dominates all other shocks that are present in the event window (relative exogeneity). Technically, the latter condition requires the policy shock to have infinite variance in the event window. Under these conditions, we establish the causal meaning of the event study estimand corresponding to the regression coefficient and the consistency and asymptotic normality of the event study estimator. Notably, this standard linear regression estimator is robust to general forms of nonlinearity. We apply our results to Nakamura and Steinsson's (2018a) analysis of the real economic effects of monetary policy, providing a simple empirical procedure to analyze the extent to which the standard event study estimator adequately estimates causal effects of interest."}, "https://arxiv.org/abs/2406.15700": {"title": "Mixture of Directed Graphical Models for Discrete Spatial Random Fields", "link": "https://arxiv.org/abs/2406.15700", "description": "arXiv:2406.15700v1 Announce Type: new \nAbstract: Current approaches for modeling discrete-valued outcomes associated with spatially-dependent areal units incur computational and theoretical challenges, especially in the Bayesian setting when full posterior inference is desired. As an alternative, we propose a novel statistical modeling framework for this data setting, namely a mixture of directed graphical models (MDGMs). The components of the mixture, directed graphical models, can be represented by directed acyclic graphs (DAGs) and are computationally quick to evaluate. The DAGs representing the mixture components are selected to correspond to an undirected graphical representation of an assumed spatial contiguity/dependence structure of the areal units, which underlies the specification of traditional modeling approaches for discrete spatial processes such as Markov random fields (MRFs). We introduce the concept of compatibility to show how an undirected graph can be used as a template for the structural dependencies between areal units to create sets of DAGs which, as a collection, preserve the structural dependencies represented in the template undirected graph. We then introduce three classes of compatible DAGs and corresponding algorithms for fitting MDGMs based on these classes. In addition, we compare MDGMs to MRFs and a popular Bayesian MRF model approximation used in high-dimensional settings in a series of simulations and an analysis of ecometrics data collected as part of the Adolescent Health and Development in Context Study."}, "https://arxiv.org/abs/2406.15702": {"title": "Testing for Restricted Stochastic Dominance under Survey Nonresponse with Panel Data: Theory and an Evaluation of Poverty in Australia", "link": "https://arxiv.org/abs/2406.15702", "description": "arXiv:2406.15702v1 Announce Type: new \nAbstract: This paper lays the groundwork for a unifying approach to stochastic dominance testing under survey nonresponse that integrates the partial identification approach to incomplete data and design-based inference for complex survey data. We propose a novel inference procedure for restricted $s$th-order stochastic dominance, tailored to accommodate a broad spectrum of nonresponse assumptions. The method uses pseudo-empirical likelihood to formulate the test statistic and compares it to a critical value from the chi-squared distribution with one degree of freedom. We detail the procedure's asymptotic properties under both null and alternative hypotheses, establishing its uniform validity under the null and consistency against various alternatives. Using the Household, Income and Labour Dynamics in Australia survey, we demonstrate the procedure's utility in a sensitivity analysis of temporal poverty comparisons among Australian households."}, "https://arxiv.org/abs/2406.15844": {"title": "Bayesian modeling of multi-species labeling errors in ecological studies", "link": "https://arxiv.org/abs/2406.15844", "description": "arXiv:2406.15844v1 Announce Type: new \nAbstract: Ecological and conservation studies monitoring bird communities typically rely on species classification based on bird vocalizations. Historically, this has been based on expert volunteers going into the field and making lists of the bird species that they observe. Recently, machine learning algorithms have emerged that can accurately classify bird species based on audio recordings of their vocalizations. Such algorithms crucially rely on training data that are labeled by experts. Automated classification is challenging when multiple species are vocalizing simultaneously, there is background noise, and/or the bird is far from the microphone. In continuously monitoring different locations, the size of the audio data become immense and it is only possible for human experts to label a tiny proportion of the available data. In addition, experts can vary in their accuracy and breadth of knowledge about different species. This article focuses on the important problem of combining sparse expert annotations to improve bird species classification while providing uncertainty quantification. We additionally are interested in providing expert performance scores to increase their engagement and encourage improvements. We propose a Bayesian hierarchical modeling approach and evaluate this approach on a new community science platform developed in Finland."}, "https://arxiv.org/abs/2406.15912": {"title": "Clustering and Meta-Analysis Using a Mixture of Dependent Linear Tail-Free Priors", "link": "https://arxiv.org/abs/2406.15912", "description": "arXiv:2406.15912v1 Announce Type: new \nAbstract: We propose a novel nonparametric Bayesian approach for meta-analysis with event time outcomes. The model is an extension of linear dependent tail-free processes. The extension includes a modification to facilitate (conditionally) conjugate posterior updating and a hierarchical extension with a random partition of studies. The partition is formalized as a Dirichlet process mixture. The model development is motivated by a meta-analysis of cancer immunotherapy studies. The aim is to validate the use of relevant biomarkers in the design of immunotherapy studies. The hypothesis is about immunotherapy in general, rather than about a specific tumor type, therapy and marker. This broad hypothesis leads to a very diverse set of studies being included in the analysis and gives rise to substantial heterogeneity across studies"}, "https://arxiv.org/abs/2406.15933": {"title": "On the use of splines for representing ordered factors", "link": "https://arxiv.org/abs/2406.15933", "description": "arXiv:2406.15933v1 Announce Type: new \nAbstract: In the context of regression-type statistical models, the inclusion of some ordered factors among the explanatory variables requires the conversion of qualitative levels to numeric components of the linear predictor. The present note represent a follow-up of a methodology proposed by Azzalini (2023} for constructing numeric scores assigned to the factors levels. The aim of the present supplement it to allow additional flexibility of the mapping from ordered levels and numeric scores."}, "https://arxiv.org/abs/2406.16136": {"title": "Distribution-Free Online Change Detection for Low-Rank Images", "link": "https://arxiv.org/abs/2406.16136", "description": "arXiv:2406.16136v1 Announce Type: new \nAbstract: We present a distribution-free CUSUM procedure designed for online change detection in a time series of low-rank images, particularly when the change causes a mean shift. We represent images as matrix data and allow for temporal dependence, in addition to inherent spatial dependence, before and after the change. The marginal distributions are assumed to be general, not limited to any specific parametric distribution. We propose new monitoring statistics that utilize the low-rank structure of the in-control mean matrix. Additionally, we study the properties of the proposed detection procedure, assessing whether the monitoring statistics effectively capture a mean shift and evaluating the rate of increase in average run length relative to the control limit in both in-control and out-of-control cases. The effectiveness of our procedure is demonstrated through simulated and real data experiments."}, "https://arxiv.org/abs/2406.16171": {"title": "Exploring the difficulty of estimating win probability: a simulation study", "link": "https://arxiv.org/abs/2406.16171", "description": "arXiv:2406.16171v1 Announce Type: new \nAbstract: Estimating win probability is one of the classic modeling tasks of sports analytics. Many widely used win probability estimators are statistical win probability models, which fit the relationship between a binary win/loss outcome variable and certain game-state variables using data-driven regression or machine learning approaches. To illustrate just how difficult it is to accurately fit a statistical win probability model from noisy and highly correlated observational data, in this paper we conduct a simulation study. We create a simplified random walk version of football in which true win probability at each game-state is known, and we see how well a model recovers it. We find that the dependence structure of observational play-by-play data substantially inflates the bias and variance of estimators and lowers the effective sample size. This makes it essential to quantify uncertainty in win probability estimates, but typical bootstrapped confidence intervals are too narrow and don't achieve nominal coverage. Hence, we introduce a novel method, the fractional bootstrap, to calibrate these intervals to achieve adequate coverage."}, "https://arxiv.org/abs/2406.16234": {"title": "Efficient estimation of longitudinal treatment effects using difference-in-differences and machine learning", "link": "https://arxiv.org/abs/2406.16234", "description": "arXiv:2406.16234v1 Announce Type: new \nAbstract: Difference-in-differences is based on a parallel trends assumption, which states that changes over time in average potential outcomes are independent of treatment assignment, possibly conditional on covariates. With time-varying treatments, parallel trends assumptions can identify many types of parameters, but most work has focused on group-time average treatment effects and similar parameters conditional on the treatment trajectory. This paper focuses instead on identification and estimation of the intervention-specific mean - the mean potential outcome had everyone been exposed to a proposed intervention - which may be directly policy-relevant in some settings. Previous estimators for this parameter under parallel trends have relied on correctly-specified parametric models, which may be difficult to guarantee in applications. We develop multiply-robust and efficient estimators of the intervention-specific mean based on the efficient influence function, and derive conditions under which data-adaptive machine learning methods can be used to relax modeling assumptions. Our approach allows the parallel trends assumption to be conditional on the history of time-varying covariates, thus allowing for adjustment for time-varying covariates possibly impacted by prior treatments. Simulation results support the use of the proposed methods at modest sample sizes. As an example, we estimate the effect of a hypothetical federal minimum wage increase on self-rated health in the US."}, "https://arxiv.org/abs/2406.16458": {"title": "Distance-based Chatterjee correlation: a new generalized robust measure of directed association for multivariate real and complex-valued data", "link": "https://arxiv.org/abs/2406.16458", "description": "arXiv:2406.16458v1 Announce Type: new \nAbstract: Building upon the Chatterjee correlation (2021: J. Am. Stat. Assoc. 116, p2009) for two real-valued variables, this study introduces a generalized measure of directed association between two vector variables, real or complex-valued, and of possibly different dimensions. The new measure is denoted as the \"distance-based Chatterjee correlation\", owing to the use here of the \"distance transformed data\" defined in Szekely et al (2007: Ann. Statist. 35, p2769) for the distance correlation. A main property of the new measure, inherited from the original Chatterjee correlation, is its predictive and asymmetric nature: it measures how well one variable can be predicted by the other, asymmetrically. This allows for inferring the causal direction of the association, by using the method of Blobaum et al (2019: PeerJ Comput. Sci. 1, e169). Since the original Chatterjee correlation is based on ranks, it is not available for complex variables, nor for general multivariate data. The novelty of our work is the extension to multivariate real and complex-valued pairs of vectors, offering a robust measure of directed association in a completely non-parametric setting. Informally, the intuitive assumption used here is that distance correlation is mathematically equivalent to Pearson's correlation when applied to \"distance transformed\" data. The next logical step is to compute Chatterjee's correlation on the same \"distance transformed\" data, thereby extending the analysis to multivariate vectors of real and complex valued data. As a bonus, the new measure here is robust to outliers, which is not true for the distance correlation of Szekely et al. Additionally, this approach allows for inference regarding the causal direction of the association between the variables."}, "https://arxiv.org/abs/2406.16485": {"title": "Influence analyses of \"designs\" for evaluating inconsistency in network meta-analysis", "link": "https://arxiv.org/abs/2406.16485", "description": "arXiv:2406.16485v1 Announce Type: new \nAbstract: Network meta-analysis is an evidence synthesis method for comparative effectiveness analyses of multiple available treatments. To justify evidence synthesis, consistency is a relevant assumption; however, existing methods founded on statistical testing possibly have substantial limitations of statistical powers or several drawbacks in treating multi-arm studies. Besides, inconsistency is theoretically explained as design-by-treatment interactions, and the primary purpose of these analyses is prioritizing \"designs\" for further investigations to explore sources of biases and irregular issues that might influence the overall results. In this article, we propose an alternative framework for inconsistency evaluations using influence diagnostic methods that enable quantitative evaluations of the influences of individual designs to the overall results. We provide four new methods to quantify the influences of individual designs through a \"leave-one-design-out\" analysis framework. We also propose a simple summary measure, the O-value, for prioritizing designs and interpreting these influential analyses straightforwardly. Furthermore, we propose another testing approach based on the leave-one-design-out analysis framework. By applying the new methods to a network meta-analysis of antihypertensive drugs, we demonstrate the new methods located potential sources of inconsistency accurately. The proposed methods provide new insights into alternatives to existing test-based methods, especially quantifications of influences of individual designs on the overall network meta-analysis results."}, "https://arxiv.org/abs/2406.16507": {"title": "Statistical ranking with dynamic covariates", "link": "https://arxiv.org/abs/2406.16507", "description": "arXiv:2406.16507v1 Announce Type: new \nAbstract: We consider a covariate-assisted ranking model grounded in the Plackett--Luce framework. Unlike existing works focusing on pure covariates or individual effects with fixed covariates, our approach integrates individual effects with dynamic covariates. This added flexibility enhances realistic ranking yet poses significant challenges for analyzing the associated estimation procedures. This paper makes an initial attempt to address these challenges. We begin by discussing the sufficient and necessary condition for the model's identifiability. We then introduce an efficient alternating maximization algorithm to compute the maximum likelihood estimator (MLE). Under suitable assumptions on the topology of comparison graphs and dynamic covariates, we establish a quantitative uniform consistency result for the MLE with convergence rates characterized by the asymptotic graph connectivity. The proposed graph topology assumption holds for several popular random graph models under optimal leading-order sparsity conditions. A comprehensive numerical study is conducted to corroborate our theoretical findings and demonstrate the application of the proposed model to real-world datasets, including horse racing and tennis competitions."}, "https://arxiv.org/abs/2406.16523": {"title": "YEAST: Yet Another Sequential Test", "link": "https://arxiv.org/abs/2406.16523", "description": "arXiv:2406.16523v1 Announce Type: new \nAbstract: Large-scale randomised experiments have become a standard tool for developing products and improving user experience. To reduce losses from shipping harmful changes experimental results are, in practice, often checked repeatedly, which leads to inflated false alarm rates. To alleviate this problem, one can use sequential testing techniques as they control false discovery rates despite repeated checks. While multiple sequential testing methods exist in the literature, they either restrict the number of interim checks the experimenter can perform or have tuning parameters that require calibration. In this paper, we propose a novel sequential testing method that does not limit the number of interim checks and at the same time does not have any tuning parameters. The proposed method is new and does not stem from existing experiment monitoring procedures. It controls false discovery rates by ``inverting'' a bound on the threshold crossing probability derived from a classical maximal inequality. We demonstrate both in simulations and using real-world data that the proposed method outperforms current state-of-the-art sequential tests for continuous test monitoring. In addition, we illustrate the method's effectiveness with a real-world application on a major online fashion platform."}, "https://arxiv.org/abs/2406.16820": {"title": "EFECT -- A Method and Metric to Assess the Reproducibility of Stochastic Simulation Studies", "link": "https://arxiv.org/abs/2406.16820", "description": "arXiv:2406.16820v1 Announce Type: new \nAbstract: Reproducibility is a foundational standard for validating scientific claims in computational research. Stochastic computational models are employed across diverse fields such as systems biology, financial modelling and environmental sciences. Existing infrastructure and software tools support various aspects of reproducible model development, application, and dissemination, but do not adequately address independently reproducing simulation results that form the basis of scientific conclusions. To bridge this gap, we introduce the Empirical Characteristic Function Equality Convergence Test (EFECT), a data-driven method to quantify the reproducibility of stochastic simulation results. EFECT employs empirical characteristic functions to compare reported results with those independently generated by assessing distributional inequality, termed EFECT error, a metric to quantify the likelihood of equality. Additionally, we establish the EFECT convergence point, a metric for determining the required number of simulation runs to achieve an EFECT error value of a priori statistical significance, setting a reproducibility benchmark. EFECT supports all real-valued and bounded results irrespective of the model or method that produced them, and accommodates stochasticity from intrinsic model variability and random sampling of model inputs. We tested EFECT with stochastic differential equations, agent-based models, and Boolean networks, demonstrating its broad applicability and effectiveness. EFECT standardizes stochastic simulation reproducibility, establishing a workflow that guarantees reliable results, supporting a wide range of stakeholders, and thereby enhancing validation of stochastic simulation studies, across a model's lifecycle. To promote future standardization efforts, we are developing open source software library libSSR in diverse programming languages for easy integration of EFECT."}, "https://arxiv.org/abs/2406.16830": {"title": "Adjusting for Selection Bias Due to Missing Eligibility Criteria in Emulated Target Trials", "link": "https://arxiv.org/abs/2406.16830", "description": "arXiv:2406.16830v1 Announce Type: new \nAbstract: Target trial emulation (TTE) is a popular framework for observational studies based on electronic health records (EHR). A key component of this framework is determining the patient population eligible for inclusion in both a target trial of interest and its observational emulation. Missingness in variables that define eligibility criteria, however, presents a major challenge towards determining the eligible population when emulating a target trial with an observational study. In practice, patients with incomplete data are almost always excluded from analysis despite the possibility of selection bias, which can arise when subjects with observed eligibility data are fundamentally different than excluded subjects. Despite this, to the best of our knowledge, very little work has been done to mitigate this concern. In this paper, we propose a novel conceptual framework to address selection bias in TTE studies, tailored towards time-to-event endpoints, and describe estimation and inferential procedures via inverse probability weighting (IPW). Under an EHR-based simulation infrastructure, developed to reflect the complexity of EHR data, we characterize common settings under which missing eligibility data poses the threat of selection bias and investigate the ability of the proposed methods to address it. Finally, using EHR databases from Kaiser Permanente, we demonstrate the use of our method to evaluate the effect of bariatric surgery on microvascular outcomes among a cohort of severely obese patients with Type II diabetes mellitus (T2DM)."}, "https://arxiv.org/abs/2406.16859": {"title": "On the extensions of the Chatterjee-Spearman test", "link": "https://arxiv.org/abs/2406.16859", "description": "arXiv:2406.16859v1 Announce Type: new \nAbstract: Chatterjee (2021) introduced a novel independence test that is rank-based, asymptotically normal and consistent against all alternatives. One limitation of Chatterjee's test is its low statistical power for detecting monotonic relationships. To address this limitation, in our previous work (Zhang, 2024, Commun. Stat. - Theory Methods), we proposed to combine Chatterjee's and Spearman's correlations into a max-type test and established the asymptotic joint normality. This work examines three key extensions of the combined test. First, motivated by its original asymmetric form, we extend the Chatterjee-Spearman test to a symmetric version, and derive the asymptotic null distribution of the symmetrized statistic. Second, we investigate the relationships between Chatterjee's correlation and other popular rank correlations, including Kendall's tau and quadrant correlation. We demonstrate that, under independence, Chatterjee's correlation and any of these rank correlations are asymptotically joint normal and independent. Simulation studies demonstrate that the Chatterjee-Kendall test has better power than the Chatterjee-Spearman test. Finally, we explore two possible extensions to the multivariate case. These extensions expand the applicability of the rank-based combined tests to a broader range of scenarios."}, "https://arxiv.org/abs/2406.15514": {"title": "How big does a population need to be before demographers can ignore individual-level randomness in demographic events?", "link": "https://arxiv.org/abs/2406.15514", "description": "arXiv:2406.15514v1 Announce Type: cross \nAbstract: When studying a national-level population, demographers can safely ignore the effect of individual-level randomness on age-sex structure. When studying a single community, or group of communities, however, the potential importance of individual-level randomness is less clear. We seek to measure the effect of individual-level randomness in births and deaths on standard summary indicators of age-sex structure, for populations of different sizes, focusing on on demographic conditions typical of historical populations. We conduct a microsimulation experiment where we simulate events and age-sex structure under a range of settings for demographic rates and population size. The experiment results suggest that individual-level randomness strongly affects age-sex structure for populations of about 100, but has a much smaller effect on populations of 1,000, and a negligible effect on populations of 10,000. Our conclusion is that analyses of age-sex structure in historical populations with sizes on the order 100 must account for individual-level randomness in demographic events. Analyses of populations with sizes on the order of 1,000 may need to make some allowance for individual-level variation, but other issues, such as measurement error, probably deserve more attention. Analyses of populations of 10,000 can safely ignore individual-level variation."}, "https://arxiv.org/abs/2406.15522": {"title": "Statistical Inference and A/B Testing in Fisher Markets and Paced Auctions", "link": "https://arxiv.org/abs/2406.15522", "description": "arXiv:2406.15522v1 Announce Type: cross \nAbstract: We initiate the study of statistical inference and A/B testing for two market equilibrium models: linear Fisher market (LFM) equilibrium and first-price pacing equilibrium (FPPE). LFM arises from fair resource allocation systems such as allocation of food to food banks and notification opportunities to different types of notifications. For LFM, we assume that the data observed is captured by the classical finite-dimensional Fisher market equilibrium, and its steady-state behavior is modeled by a continuous limit Fisher market. The second type of equilibrium we study, FPPE, arises from internet advertising where advertisers are constrained by budgets and advertising opportunities are sold via first-price auctions. For platforms that use pacing-based methods to smooth out the spending of advertisers, FPPE provides a hindsight-optimal configuration of the pacing method. We propose a statistical framework for the FPPE model, in which a continuous limit FPPE models the steady-state behavior of the auction platform, and a finite FPPE provides the data to estimate primitives of the limit FPPE. Both LFM and FPPE have an Eisenberg-Gale convex program characterization, the pillar upon which we derive our statistical theory. We start by deriving basic convergence results for the finite market to the limit market. We then derive asymptotic distributions, and construct confidence intervals. Furthermore, we establish the asymptotic local minimax optimality of estimation based on finite markets. We then show that the theory can be used for conducting statistically valid A/B testing on auction platforms. Synthetic and semi-synthetic experiments verify the validity and practicality of our theory."}, "https://arxiv.org/abs/2406.15867": {"title": "Hedging in Sequential Experiments", "link": "https://arxiv.org/abs/2406.15867", "description": "arXiv:2406.15867v1 Announce Type: cross \nAbstract: Experimentation involves risk. The investigator expends time and money in the pursuit of data that supports a hypothesis. In the end, the investigator may find that all of these costs were for naught and the data fail to reject the null. Furthermore, the investigator may not be able to test other hypotheses with the same data set in order to avoid false positives due to p-hacking. Therefore, there is a need for a mechanism for investigators to hedge the risk of financial and statistical bankruptcy in the business of experimentation.\n In this work, we build on the game-theoretic statistics framework to enable an investigator to hedge their bets against the null hypothesis and thus avoid ruin. First, we describe a method by which the investigator's test martingale wealth process can be capitalized by solving for the risk-neutral price. Then, we show that a portfolio that comprises the risky test martingale and a risk-free process is still a test martingale which enables the investigator to select a particular risk-return position using Markowitz portfolio theory. Finally, we show that a function that is derivative of the test martingale process can be constructed and used as a hedging instrument by the investigator or as a speculative instrument by a risk-seeking investor who wants to participate in the potential returns of the uncertain experiment wealth process. Together, these instruments enable an investigator to hedge the risk of ruin and they enable a investigator to efficiently hedge experimental risk."}, "https://arxiv.org/abs/2406.15904": {"title": "Learning When the Concept Shifts: Confounding, Invariance, and Dimension Reduction", "link": "https://arxiv.org/abs/2406.15904", "description": "arXiv:2406.15904v1 Announce Type: cross \nAbstract: Practitioners often deploy a learned prediction model in a new environment where the joint distribution of covariate and response has shifted. In observational data, the distribution shift is often driven by unobserved confounding factors lurking in the environment, with the underlying mechanism unknown. Confounding can obfuscate the definition of the best prediction model (concept shift) and shift covariates to domains yet unseen (covariate shift). Therefore, a model maximizing prediction accuracy in the source environment could suffer a significant accuracy drop in the target environment. This motivates us to study the domain adaptation problem with observational data: given labeled covariate and response pairs from a source environment, and unlabeled covariates from a target environment, how can one predict the missing target response reliably? We root the adaptation problem in a linear structural causal model to address endogeneity and unobserved confounding. We study the necessity and benefit of leveraging exogenous, invariant covariate representations to cure concept shifts and improve target prediction. This further motivates a new representation learning method for adaptation that optimizes for a lower-dimensional linear subspace and, subsequently, a prediction model confined to that subspace. The procedure operates on a non-convex objective-that naturally interpolates between predictability and stability/invariance-constrained on the Stiefel manifold. We study the optimization landscape and prove that, when the regularization is sufficient, nearly all local optima align with an invariant linear subspace resilient to both concept and covariate shift. In terms of predictability, we show a model that uses the learned lower-dimensional subspace can incur a nearly ideal gap between target and source risk. Three real-world data sets are investigated to validate our method and theory."}, "https://arxiv.org/abs/2406.16174": {"title": "Comparison of methods for mediation analysis with multiple correlated mediators", "link": "https://arxiv.org/abs/2406.16174", "description": "arXiv:2406.16174v1 Announce Type: cross \nAbstract: Various methods have emerged for conducting mediation analyses with multiple correlated mediators, each with distinct strengths and limitations. However, a comparative evaluation of these methods is lacking, providing the motivation for this paper. This study examines six mediation analysis methods for multiple correlated mediators that provide insights to the contributors for health disparities. We assessed the performance of each method in identifying joint or path-specific mediation effects in the context of binary outcome variables varying mediator types and levels of residual correlation between mediators. Through comprehensive simulations, the performance of six methods in estimating joint and/or path-specific mediation effects was assessed rigorously using a variety of metrics including bias, mean squared error, coverage and width of the 95$\\%$ confidence intervals. Subsequently, these methods were applied to the REasons for Geographic And Racial Differences in Stroke (REGARDS) study, where differing conclusions were obtained depending on the mediation method employed. This evaluation provides valuable guidance for researchers grappling with complex multi-mediator scenarios, enabling them to select an optimal mediation method for their research question and dataset."}, "https://arxiv.org/abs/2406.16221": {"title": "F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data", "link": "https://arxiv.org/abs/2406.16221", "description": "arXiv:2406.16221v1 Announce Type: cross \nAbstract: Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset."}, "https://arxiv.org/abs/2406.16227": {"title": "VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data", "link": "https://arxiv.org/abs/2406.16227", "description": "arXiv:2406.16227v1 Announce Type: cross \nAbstract: Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including `omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different `omics datasets, enabling the discovery of novel subtypes.\n \\textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at \\url{https://github.com/j-ackierao/VICatMix}."}, "https://arxiv.org/abs/2406.16241": {"title": "Position: Benchmarking is Limited in Reinforcement Learning Research", "link": "https://arxiv.org/abs/2406.16241", "description": "arXiv:2406.16241v1 Announce Type: cross \nAbstract: Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking."}, "https://arxiv.org/abs/2406.16351": {"title": "METRIK: Measurement-Efficient Randomized Controlled Trials using Transformers with Input Masking", "link": "https://arxiv.org/abs/2406.16351", "description": "arXiv:2406.16351v1 Announce Type: cross \nAbstract: Clinical randomized controlled trials (RCTs) collect hundreds of measurements spanning various metric types (e.g., laboratory tests, cognitive/motor assessments, etc.) across 100s-1000s of subjects to evaluate the effect of a treatment, but do so at the cost of significant trial expense. To reduce the number of measurements, trial protocols can be revised to remove metrics extraneous to the study's objective, but doing so requires additional human labor and limits the set of hypotheses that can be studied with the collected data. In contrast, a planned missing design (PMD) can reduce the amount of data collected without removing any metric by imputing the unsampled data. Standard PMDs randomly sample data to leverage statistical properties of imputation algorithms, but are ad hoc, hence suboptimal. Methods that learn PMDs produce more sample-efficient PMDs, but are not suitable for RCTs because they require ample prior data (150+ subjects) to model the data distribution. Therefore, we introduce a framework called Measurement EfficienT Randomized Controlled Trials using Transformers with Input MasKing (METRIK), which, for the first time, calculates a PMD specific to the RCT from a modest amount of prior data (e.g., 60 subjects). Specifically, METRIK models the PMD as a learnable input masking layer that is optimized with a state-of-the-art imputer based on the Transformer architecture. METRIK implements a novel sampling and selection algorithm to generate a PMD that satisfies the trial designer's objective, i.e., whether to maximize sampling efficiency or imputation performance for a given sampling budget. Evaluated across five real-world clinical RCT datasets, METRIK increases the sampling efficiency of and imputation performance under the generated PMD by leveraging correlations over time and across metrics, thereby removing the need to manually remove metrics from the RCT."}, "https://arxiv.org/abs/2406.16605": {"title": "CLEAR: Can Language Models Really Understand Causal Graphs?", "link": "https://arxiv.org/abs/2406.16605", "description": "arXiv:2406.16605v1 Announce Type: cross \nAbstract: Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR."}, "https://arxiv.org/abs/2406.16708": {"title": "CausalFormer: An Interpretable Transformer for Temporal Causal Discovery", "link": "https://arxiv.org/abs/2406.16708", "description": "arXiv:2406.16708v1 Announce Type: cross \nAbstract: Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality. Our code is available at https://github.com/lingbai-kong/CausalFormer."}, "https://arxiv.org/abs/2107.12420": {"title": "Semiparametric Estimation of Treatment Effects in Observational Studies with Heterogeneous Partial Interference", "link": "https://arxiv.org/abs/2107.12420", "description": "arXiv:2107.12420v3 Announce Type: replace \nAbstract: In many observational studies in social science and medicine, subjects or units are connected, and one unit's treatment and attributes may affect another's treatment and outcome, violating the stable unit treatment value assumption (SUTVA) and resulting in interference. To enable feasible estimation and inference, many previous works assume exchangeability of interfering units (neighbors). However, in many applications with distinctive units, interference is heterogeneous and needs to be modeled explicitly. In this paper, we focus on the partial interference setting, and only restrict units to be exchangeable conditional on observable characteristics. Under this framework, we propose generalized augmented inverse propensity weighted (AIPW) estimators for general causal estimands that include heterogeneous direct and spillover effects. We show that they are semiparametric efficient and robust to heterogeneous interference as well as model misspecifications. We apply our methods to the Add Health dataset to study the direct effects of alcohol consumption on academic performance and the spillover effects of parental incarceration on adolescent well-being."}, "https://arxiv.org/abs/2202.02903": {"title": "Difference in Differences with Time-Varying Covariates", "link": "https://arxiv.org/abs/2202.02903", "description": "arXiv:2202.02903v3 Announce Type: replace \nAbstract: This paper considers identification and estimation of causal effect parameters from participating in a binary treatment in a difference in differences (DID) setup when the parallel trends assumption holds after conditioning on observed covariates. Relative to existing work in the econometrics literature, we consider the case where the value of covariates can change over time and, potentially, where participating in the treatment can affect the covariates themselves. We propose new empirical strategies in both cases. We also consider two-way fixed effects (TWFE) regressions that include time-varying regressors, which is the most common way that DID identification strategies are implemented under conditional parallel trends. We show that, even in the case with only two time periods, these TWFE regressions are not generally robust to (i) time-varying covariates being affected by the treatment, (ii) treatment effects and/or paths of untreated potential outcomes depending on the level of time-varying covariates in addition to only the change in the covariates over time, (iii) treatment effects and/or paths of untreated potential outcomes depending on time-invariant covariates, (iv) treatment effect heterogeneity with respect to observed covariates, and (v) violations of strong functional form assumptions, both for outcomes over time and the propensity score, that are unlikely to be plausible in most DID applications. Thus, TWFE regressions can deliver misleading estimates of causal effect parameters in a number of empirically relevant cases. We propose both doubly robust estimands and regression adjustment/imputation strategies that are robust to these issues while not being substantially more challenging to implement."}, "https://arxiv.org/abs/2301.05743": {"title": "Re-thinking Spatial Confounding in Spatial Linear Mixed Models", "link": "https://arxiv.org/abs/2301.05743", "description": "arXiv:2301.05743v2 Announce Type: replace \nAbstract: In the last two decades, considerable research has been devoted to a phenomenon known as spatial confounding. Spatial confounding is thought to occur when there is multicollinearity between a covariate and the random effect in a spatial regression model. This multicollinearity is considered highly problematic when the inferential goal is estimating regression coefficients and various methodologies have been proposed to attempt to alleviate it. Recently, it has become apparent that many of these methodologies are flawed, yet the field continues to expand. In this paper, we offer a novel perspective of synthesizing the work in the field of spatial confounding. We propose that at least two distinct phenomena are currently conflated with the term spatial confounding. We refer to these as the ``analysis model'' and the ``data generation'' types of spatial confounding. We show that these two issues can lead to contradicting conclusions about whether spatial confounding exists and whether methods to alleviate it will improve inference. Our results also illustrate that in most cases, traditional spatial linear mixed models do help to improve inference on regression coefficients. Drawing on the insights gained, we offer a path forward for research in spatial confounding."}, "https://arxiv.org/abs/2302.12111": {"title": "Communication-Efficient Distributed Estimation and Inference for Cox's Model", "link": "https://arxiv.org/abs/2302.12111", "description": "arXiv:2302.12111v3 Announce Type: replace \nAbstract: Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods."}, "https://arxiv.org/abs/2304.01921": {"title": "Individual Welfare Analysis: Random Quasilinear Utility, Independence, and Confidence Bounds", "link": "https://arxiv.org/abs/2304.01921", "description": "arXiv:2304.01921v3 Announce Type: replace \nAbstract: We introduce a novel framework for individual-level welfare analysis. It builds on a parametric model for continuous demand with a quasilinear utility function, allowing for heterogeneous coefficients and unobserved individual-good-level preference shocks. We obtain bounds on the individual-level consumer welfare loss at any confidence level due to a hypothetical price increase, solving a scalable optimization problem constrained by a novel confidence set under an independence restriction. This confidence set is computationally simple and robust to weak instruments, nonlinearity, and partial identification. The validity of the confidence set is guaranteed by our new results on the joint limiting distribution of the independence test by Chatterjee (2021). These results together with the confidence set may have applications beyond welfare analysis. Monte Carlo simulations and two empirical applications on gasoline and food demand demonstrate the effectiveness of our method."}, "https://arxiv.org/abs/2305.16018": {"title": "Accommodating informative visit times for analysing irregular longitudinal data: a sensitivity analysis approach with balancing weights estimators", "link": "https://arxiv.org/abs/2305.16018", "description": "arXiv:2305.16018v2 Announce Type: replace \nAbstract: Irregular longitudinal data with informative visit times arise when patients' visits are partly driven by concurrent disease outcomes. However, existing methods such as inverse intensity weighting (IIW), often overlook or have not adequately assess the influence of informative visit times on estimation and inference. Based on novel balancing weights estimators, we propose a new sensitivity analysis approach to addressing informative visit times within the IIW framework. The balancing weights are obtained by balancing observed history variable distributions over time and including a selection function with specified sensitivity parameters to characterise the additional influence of the concurrent outcome on the visit process. A calibration procedure is proposed to anchor the range of the sensitivity parameters to the amount of variation in the visit process that could be additionally explained by the concurrent outcome given the observed history and time. Simulations demonstrate that our balancing weights estimators outperform existing weighted estimators for robustness and efficiency. We provide an R Markdown tutorial of the proposed methods and apply them to analyse data from a clinic-based cohort of psoriatic arthritis."}, "https://arxiv.org/abs/2308.01704": {"title": "Similarity-based Random Partition Distribution for Clustering Functional Data", "link": "https://arxiv.org/abs/2308.01704", "description": "arXiv:2308.01704v3 Announce Type: replace \nAbstract: Random partition distribution is a crucial tool for model-based clustering. This study advances the field of random partition in the context of functional spatial data, focusing on the challenges posed by hourly population data across various regions and dates. We propose an extended generalized Dirichlet process, named the similarity-based generalized Dirichlet process (SGDP), to address the limitations of simple random partition distributions (e.g., those induced by the Dirichlet process), such as an overabundance of clusters. This model prevents excess cluster production as well as incorporates pairwise similarity information to ensure accurate and meaningful grouping. The theoretical properties of the SGDP are studied. Then, SGDP-based random partition is applied to a real-world dataset of hourly population flow in $500\\text{m}^2$ meshes in the central part of Tokyo. In this empirical context, our method excels at detecting meaningful patterns in the data while accounting for spatial nuances. The results underscore the adaptability and utility of the method, showcasing its prowess in revealing intricate spatiotemporal dynamics. The proposed SGDP will significantly contribute to urban planning, transportation, and policy-making and will be a helpful tool for understanding population dynamics and their implications."}, "https://arxiv.org/abs/2308.15681": {"title": "Scalable Composite Likelihood Estimation of Probit Models with Crossed Random Effects", "link": "https://arxiv.org/abs/2308.15681", "description": "arXiv:2308.15681v2 Announce Type: replace \nAbstract: Crossed random effects structures arise in many scientific contexts. They raise severe computational problems with likelihood computations scaling like $N^{3/2}$ or worse for $N$ data points. In this paper we develop a new composite likelihood approach for crossed random effects probit models. For data arranged in R rows and C columns, the likelihood function includes a very difficult R + C dimensional integral. The composite likelihood we develop uses the marginal distribution of the response along with two hierarchical models. The cost is reduced to $\\mathcal{O}(N)$ and it can be computed with $R + C$ one dimensional integrals. We find that the commonly used Laplace approximation has a cost that grows superlinearly. We get consistent estimates of the probit slope and variance components from our composite likelihood algorithm. We also show how to estimate the covariance of the estimated regression coefficients. The algorithm scales readily to a data set of five million observations from Stitch Fix with $R + C > 700{,}000$."}, "https://arxiv.org/abs/2308.15986": {"title": "Sensitivity Analysis of Inverse Probability Weighting Estimators of Causal Effects in Observational Studies with Multivalued Treatments", "link": "https://arxiv.org/abs/2308.15986", "description": "arXiv:2308.15986v4 Announce Type: replace \nAbstract: One of the fundamental challenges in drawing causal inferences from observational studies is that the assumption of no unmeasured confounding is not testable from observed data. Therefore, assessing sensitivity to this assumption's violation is important to obtain valid causal conclusions in observational studies. Although several sensitivity analysis frameworks are available in the casual inference literature, very few of them are applicable to observational studies with multivalued treatments. To address this issue, we propose a sensitivity analysis framework for performing sensitivity analysis in multivalued treatment settings. Within this framework, a general class of additive causal estimands has been proposed. We demonstrate that the estimation of the causal estimands under the proposed sensitivity model can be performed very efficiently. Simulation results show that the proposed framework performs well in terms of bias of the point estimates and coverage of the confidence intervals when there is sufficient overlap in the covariate distributions. We illustrate the application of our proposed method by conducting an observational study that estimates the causal effect of fish consumption on blood mercury levels."}, "https://arxiv.org/abs/2310.13764": {"title": "Statistical Inference for Bures-Wasserstein Flows", "link": "https://arxiv.org/abs/2310.13764", "description": "arXiv:2310.13764v2 Announce Type: replace \nAbstract: We develop a statistical framework for conducting inference on collections of time-varying covariance operators (covariance flows) over a general, possibly infinite dimensional, Hilbert space. We model the intrinsically non-linear structure of covariances by means of the Bures-Wasserstein metric geometry. We make use of the Riemmanian-like structure induced by this metric to define a notion of mean and covariance of a random flow, and develop an associated Karhunen-Lo\\`eve expansion. We then treat the problem of estimation and construction of functional principal components from a finite collection of covariance flows, observed fully or irregularly.\n Our theoretical results are motivated by modern problems in functional data analysis, where one observes operator-valued random processes -- for instance when analysing dynamic functional connectivity and fMRI data, or when analysing multiple functional time series in the frequency domain. Nevertheless, our framework is also novel in the finite-dimensions (matrix case), and we demonstrate what simplifications can be afforded then. We illustrate our methodology by means of simulations and data analyses."}, "https://arxiv.org/abs/2311.13410": {"title": "Assessing the Unobserved: Enhancing Causal Inference in Sociology with Sensitivity Analysis", "link": "https://arxiv.org/abs/2311.13410", "description": "arXiv:2311.13410v2 Announce Type: replace \nAbstract: Explaining social events is a primary objective of applied data-driven sociology. To achieve that objective, many sociologists use statistical causal inference to identify causality using observational studies research context where the analyst does not control the data generating process. However, it is often challenging in observation studies to satisfy the unmeasured confounding assumption, namely, that there is no lurking third variable affecting the causal relationship of interest. In this article, we develop a framework enabling sociologists to employ a different strategy to enhance the quality of observational studies. Our framework builds on a surprisingly simple statistical approach, sensitivity analysis: a thought-experimental framework where the analyst imagines a lever, which they can pull for probing a variety of theoretically driven statistical magnitudes of posited unmeasured confounding which in turn distorts the causal effect of interest. By pulling that lever, the analyst can identify how strong an unmeasured confounder must be to wash away the estimated causal effect. Although each sensitivity analysis method requires its own assumptions, this sort of post-hoc analysis provides underutilized tools to bound causal quantities. Extending Lundberg et al, we develop a five-step approach to how applied sociological research can incorporate sensitivity analysis, empowering scholars to rejuvenate causal inference in observational studies."}, "https://arxiv.org/abs/2311.14889": {"title": "Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data", "link": "https://arxiv.org/abs/2311.14889", "description": "arXiv:2311.14889v2 Announce Type: replace \nAbstract: In this paper we review recent advances in statistical methods for the evaluation of the heterogeneity of treatment effects (HTE), including subgroup identification and estimation of individualized treatment regimens, from randomized clinical trials and observational studies. We identify several types of approaches using the features introduced in Lipkovich, Dmitrienko and D'Agostino (2017) that distinguish the recommended principled methods from basic methods for HTE evaluation that typically rely on rules of thumb and general guidelines (the methods are often referred to as common practices). We discuss the advantages and disadvantages of various principled methods as well as common measures for evaluating their performance. We use simulated data and a case study based on a historical clinical trial to illustrate several new approaches to HTE evaluation."}, "https://arxiv.org/abs/2401.06261": {"title": "Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables", "link": "https://arxiv.org/abs/2401.06261", "description": "arXiv:2401.06261v2 Announce Type: replace \nAbstract: Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent, which is usually not possible when considering a group of candidate genes from the same locus. We used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results even at modest sample sizes. Importantly, the causal effect estimates remain unbiased and their variance small when instruments are highly correlated. We applied MVMR with correlated instrumental variable sets at risk loci from genome-wide association studies (GWAS) for coronary artery disease using eQTL data from the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a single model to predict causal gene-tissue combinations remains infeasible."}, "https://arxiv.org/abs/2305.11561": {"title": "Causal Inference on Process Graphs, Part I: The Structural Equation Process Representation", "link": "https://arxiv.org/abs/2305.11561", "description": "arXiv:2305.11561v2 Announce Type: replace-cross \nAbstract: When dealing with time series data, causal inference methods often employ structural vector autoregressive (SVAR) processes to model time-evolving random systems. In this work, we rephrase recursive SVAR processes with possible latent component processes as a linear Structural Causal Model (SCM) of stochastic processes on a simple causal graph, the \\emph{process graph}, that models every process as a single node. Using this reformulation, we generalise Wright's well-known path-rule for linear Gaussian SCMs to the newly introduced process SCMs and we express the auto-covariance sequence of an SVAR process by means of a generalised trek-rule. Employing the Fourier-Transformation, we derive compact expressions for causal effects in the frequency domain that allow us to efficiently visualise the causal interactions in a multivariate SVAR process. Finally, we observe that the process graph can be used to formulate graphical criteria for identifying causal effects and to derive algebraic relations with which these frequency domain causal effects can be recovered from the observed spectral density."}, "https://arxiv.org/abs/2312.07320": {"title": "Convergence rates of non-stationary and deep Gaussian process regression", "link": "https://arxiv.org/abs/2312.07320", "description": "arXiv:2312.07320v3 Announce Type: replace-cross \nAbstract: The focus of this work is the convergence of non-stationary and deep Gaussian process regression. More precisely, we follow a Bayesian approach to regression or interpolation, where the prior placed on the unknown function $f$ is a non-stationary or deep Gaussian process, and we derive convergence rates of the posterior mean to the true function $f$ in terms of the number of observed training points. In some cases, we also show convergence of the posterior variance to zero. The only assumption imposed on the function $f$ is that it is an element of a certain reproducing kernel Hilbert space, which we in particular cases show to be norm-equivalent to a Sobolev space. Our analysis includes the case of estimated hyper-parameters in the covariance kernels employed, both in an empirical Bayes' setting and the particular hierarchical setting constructed through deep Gaussian processes. We consider the settings of noise-free or noisy observations on deterministic or random training points. We establish general assumptions sufficient for the convergence of deep Gaussian process regression, along with explicit examples demonstrating the fulfilment of these assumptions. Specifically, our examples require that the H\\\"older or Sobolev norms of the penultimate layer are bounded almost surely."}, "https://arxiv.org/abs/2406.17056": {"title": "Efficient two-sample instrumental variable estimators with change points and near-weak identification", "link": "https://arxiv.org/abs/2406.17056", "description": "arXiv:2406.17056v1 Announce Type: new \nAbstract: We consider estimation and inference in a linear model with endogenous regressors where the parameters of interest change across two samples. If the first-stage is common, we show how to use this information to obtain more efficient two-sample GMM estimators than the standard split-sample GMM, even in the presence of near-weak instruments. We also propose two tests to detect change points in the parameters of interest, depending on whether the first-stage is common or not. We derive the limiting distribution of these tests and show that they have non-trivial power even under weaker and possibly time-varying identification patterns. The finite sample properties of our proposed estimators and testing procedures are illustrated in a series of Monte-Carlo experiments, and in an application to the open-economy New Keynesian Phillips curve. Our empirical analysis using US data provides strong support for a New Keynesian Phillips curve with incomplete pass-through and reveals important time variation in the relationship between inflation and exchange rate pass-through."}, "https://arxiv.org/abs/2406.17058": {"title": "Bayesian Deep ICE", "link": "https://arxiv.org/abs/2406.17058", "description": "arXiv:2406.17058v1 Announce Type: new \nAbstract: Deep Independent Component Estimation (DICE) has many applications in modern day machine learning as a feature engineering extraction method. We provide a novel latent variable representation of independent component analysis that enables both point estimates via expectation-maximization (EM) and full posterior sampling via Markov Chain Monte Carlo (MCMC) algorithms. Our methodology also applies to flow-based methods for nonlinear feature extraction. We discuss how to implement conditional posteriors and envelope-based methods for optimization. Through this representation hierarchy, we unify a number of hitherto disjoint estimation procedures. We illustrate our methodology and algorithms on a numerical example. Finally, we conclude with directions for future research."}, "https://arxiv.org/abs/2406.17131": {"title": "Bayesian temporal biclustering with applications to multi-subject neuroscience studies", "link": "https://arxiv.org/abs/2406.17131", "description": "arXiv:2406.17131v1 Announce Type: new \nAbstract: We consider the problem of analyzing multivariate time series collected on multiple subjects, with the goal of identifying groups of subjects exhibiting similar trends in their recorded measurements over time as well as time-varying groups of associated measurements. To this end, we propose a Bayesian model for temporal biclustering featuring nested partitions, where a time-invariant partition of subjects induces a time-varying partition of measurements. Our approach allows for data-driven determination of the number of subject and measurement clusters as well as estimation of the number and location of changepoints in measurement partitions. To efficiently perform model fitting and posterior estimation with Markov Chain Monte Carlo, we derive a blocked update of measurements' cluster-assignment sequences. We illustrate the performance of our model in two applications to functional magnetic resonance imaging data and to an electroencephalogram dataset. The results indicate that the proposed model can combine information from potentially many subjects to discover a set of interpretable, dynamic patterns. Experiments on simulated data compare the estimation performance of the proposed model against ground-truth values and other statistical methods, showing that it performs well at identifying ground-truth subject and measurement clusters even when no subject or time dependence is present."}, "https://arxiv.org/abs/2406.17278": {"title": "Estimation and Inference for CP Tensor Factor Models", "link": "https://arxiv.org/abs/2406.17278", "description": "arXiv:2406.17278v1 Announce Type: new \nAbstract: High-dimensional tensor-valued data have recently gained attention from researchers in economics and finance. We consider the estimation and inference of high-dimensional tensor factor models, where each dimension of the tensor diverges. Our focus is on a factor model that admits CP-type tensor decomposition, which allows for non-orthogonal loading vectors. Based on the contemporary covariance matrix, we propose an iterative simultaneous projection estimation method. Our estimator is robust to weak dependence among factors and weak correlation across different dimensions in the idiosyncratic shocks. We establish an inferential theory, demonstrating both consistency and asymptotic normality under relaxed assumptions. Within a unified framework, we consider two eigenvalue ratio-based estimators for the number of factors in a tensor factor model and justify their consistency. Through a simulation study and two empirical applications featuring sorted portfolios and international trade flows, we illustrate the advantages of our proposed estimator over existing methodologies in the literature."}, "https://arxiv.org/abs/2406.17318": {"title": "Model Uncertainty in Latent Gaussian Models with Univariate Link Function", "link": "https://arxiv.org/abs/2406.17318", "description": "arXiv:2406.17318v1 Announce Type: new \nAbstract: We consider a class of latent Gaussian models with a univariate link function (ULLGMs). These are based on standard likelihood specifications (such as Poisson, Binomial, Bernoulli, Erlang, etc.) but incorporate a latent normal linear regression framework on a transformation of a key scalar parameter. We allow for model uncertainty regarding the covariates included in the regression. The ULLGM class typically accommodates extra dispersion in the data and has clear advantages for deriving theoretical properties and designing computational procedures. We formally characterize posterior existence under a convenient and popular improper prior and propose an efficient Markov chain Monte Carlo algorithm for Bayesian model averaging in ULLGMs. Simulation results suggest that the framework provides accurate results that are robust to some degree of misspecification. The methodology is successfully applied to measles vaccination coverage data from Ethiopia and to data on bilateral migration flows between OECD countries."}, "https://arxiv.org/abs/2406.17361": {"title": "Tree-based variational inference for Poisson log-normal models", "link": "https://arxiv.org/abs/2406.17361", "description": "arXiv:2406.17361v1 Announce Type: new \nAbstract: When studying ecosystems, hierarchical trees are often used to organize entities based on proximity criteria, such as the taxonomy in microbiology, social classes in geography, or product types in retail businesses, offering valuable insights into entity relationships. Despite their significance, current count-data models do not leverage this structured information. In particular, the widely used Poisson log-normal (PLN) model, known for its ability to model interactions between entities from count data, lacks the possibility to incorporate such hierarchical tree structures, limiting its applicability in domains characterized by such complexities. To address this matter, we introduce the PLN-Tree model as an extension of the PLN model, specifically designed for modeling hierarchical count data. By integrating structured variational inference techniques, we propose an adapted training procedure and establish identifiability results, enhancisng both theoretical foundations and practical interpretability. Additionally, we extend our framework to classification tasks as a preprocessing pipeline, showcasing its versatility. Experimental evaluations on synthetic datasets as well as real-world microbiome data demonstrate the superior performance of the PLN-Tree model in capturing hierarchical dependencies and providing valuable insights into complex data structures, showing the practical interest of knowledge graphs like the taxonomy in ecosystems modeling."}, "https://arxiv.org/abs/2406.17444": {"title": "Bayesian Partial Reduced-Rank Regression", "link": "https://arxiv.org/abs/2406.17444", "description": "arXiv:2406.17444v1 Announce Type: new \nAbstract: Reduced-rank (RR) regression may be interpreted as a dimensionality reduction technique able to reveal complex relationships among the data parsimoniously. However, RR regression models typically overlook any potential group structure among the responses by assuming a low-rank structure on the coefficient matrix. To address this limitation, a Bayesian Partial RR (BPRR) regression is exploited, where the response vector and the coefficient matrix are partitioned into low- and full-rank sub-groups. As opposed to the literature, which assumes known group structure and rank, a novel strategy is introduced that treats them as unknown parameters to be estimated. The main contribution is two-fold: an approach to infer the low- and full-rank group memberships from the data is proposed, and then, conditionally on this allocation, the corresponding (reduced) rank is estimated. Both steps are carried out in a Bayesian approach, allowing for full uncertainty quantification and based on a partially collapsed Gibbs sampler. It relies on a Laplace approximation of the marginal likelihood and the Metropolized Shotgun Stochastic Search to estimate the group allocation efficiently. Applications to synthetic and real-world data reveal the potential of the proposed method to reveal hidden structures in the data."}, "https://arxiv.org/abs/2406.17445": {"title": "Copula-Based Estimation of Causal Effects in Multiple Linear and Path Analysis Models", "link": "https://arxiv.org/abs/2406.17445", "description": "arXiv:2406.17445v1 Announce Type: new \nAbstract: Regression analysis is one of the most popularly used statistical technique which only measures the direct effect of independent variables on dependent variable. Path analysis looks for both direct and indirect effects of independent variables and may overcome several hurdles allied with regression models. It utilizes one or more structural regression equations in the model which are used to estimate the unknown parameters. The aim of this work is to study the path analysis models when the endogenous (dependent) variable and exogenous (independent) variables are linked through the elliptical copulas. Using well-organized numerical schemes, we investigate the performance of path models when direct and indirect effects are estimated applying classical ordinary least squares and copula-based regression approaches in different scenarios. Finally, two real data applications are also presented to demonstrate the performance of path analysis using copula approach."}, "https://arxiv.org/abs/2406.17466": {"title": "Two-Stage Testing in a high dimensional setting", "link": "https://arxiv.org/abs/2406.17466", "description": "arXiv:2406.17466v1 Announce Type: new \nAbstract: In a high dimensional regression setting in which the number of variables ($p$) is much larger than the sample size ($n$), the number of possible two-way interactions between the variables is immense. If the number of variables is in the order of one million, which is usually the case in e.g., genetics, the number of two-way interactions is of the order one million squared. In the pursuit of detecting two-way interactions, testing all pairs for interactions one-by-one is computational unfeasible and the multiple testing correction will be severe. In this paper we describe a two-stage testing procedure consisting of a screening and an evaluation stage. It is proven that, under some assumptions, the tests-statistics in the two stages are asymptotically independent. As a result, multiplicity correction in the second stage is only needed for the number of statistical tests that are actually performed in that stage. This increases the power of the testing procedure. Also, since the testing procedure in the first stage is computational simple, the computational burden is lowered. Simulations have been performed for multiple settings and regression models (generalized linear models and Cox PH model) to study the performance of the two-stage testing procedure. The results show type I error control and an increase in power compared to the procedure in which the pairs are tested one-by-one."}, "https://arxiv.org/abs/2406.17567": {"title": "Transfer Learning for High Dimensional Robust Regression", "link": "https://arxiv.org/abs/2406.17567", "description": "arXiv:2406.17567v1 Announce Type: new \nAbstract: Transfer learning has become an essential technique for utilizing information from source datasets to improve the performance of the target task. However, in the context of high-dimensional data, heterogeneity arises due to heteroscedastic variance or inhomogeneous covariate effects. To solve this problem, this paper proposes a robust transfer learning based on the Huber regression, specifically designed for scenarios where the transferable source data set is known. This method effectively mitigates the impact of data heteroscedasticity, leading to improvements in estimation and prediction accuracy. Moreover, when the transferable source data set is unknown, the paper introduces an efficient detection algorithm to identify informative sources. The effectiveness of the proposed method is proved through numerical simulation and empirical analysis using superconductor data."}, "https://arxiv.org/abs/2406.17571": {"title": "Causal Responder Detection", "link": "https://arxiv.org/abs/2406.17571", "description": "arXiv:2406.17571v1 Announce Type: new \nAbstract: We introduce the causal responders detection (CARD), a novel method for responder analysis that identifies treated subjects who significantly respond to a treatment. Leveraging recent advances in conformal prediction, CARD employs machine learning techniques to accurately identify responders while controlling the false discovery rate in finite sample sizes. Additionally, we incorporate a propensity score adjustment to mitigate bias arising from non-random treatment allocation, enhancing the robustness of our method in observational settings. Simulation studies demonstrate that CARD effectively detects responders with high power in diverse scenarios."}, "https://arxiv.org/abs/2406.17637": {"title": "Nowcasting in triple-system estimation", "link": "https://arxiv.org/abs/2406.17637", "description": "arXiv:2406.17637v1 Announce Type: new \nAbstract: When samples that each cover part of a population for a certain reference date become available slowly over time, an estimate of the population size can be obtained when at least two samples are available. Ideally one uses all the available samples, but if some samples become available much later one may want to use the samples that are available earlier, to obtain a preliminary or nowcast estimate. However, a limited number of samples may no longer lead to asymptotically unbiased estimates, in particularly in case of two early available samples that suffer from pairwise dependence. In this paper we propose a multiple system nowcasting model that deals with this issue by combining the early available samples with samples from a previous reference date and the expectation-maximisation algorithm. This leads to a nowcast estimate that is asymptotically unbiased under more relaxed assumptions than the dual-system estimator. The multiple system nowcasting model is applied to the problem of estimating the number of homeless people in The Netherlands, which leads to reasonably accurate nowcast estimates."}, "https://arxiv.org/abs/2406.17708": {"title": "Forecast Relative Error Decomposition", "link": "https://arxiv.org/abs/2406.17708", "description": "arXiv:2406.17708v1 Announce Type: new \nAbstract: We introduce a class of relative error decomposition measures that are well-suited for the analysis of shocks in nonlinear dynamic models. They include the Forecast Relative Error Decomposition (FRED), Forecast Error Kullback Decomposition (FEKD) and Forecast Error Laplace Decomposition (FELD). These measures are favourable over the traditional Forecast Error Variance Decomposition (FEVD) because they account for nonlinear dependence in both a serial and cross-sectional sense. This is illustrated by applications to dynamic models for qualitative data, count data, stochastic volatility and cyberrisk."}, "https://arxiv.org/abs/2406.17422": {"title": "Causal Inference on Process Graphs, Part II: Causal Structure and Effect Identification", "link": "https://arxiv.org/abs/2406.17422", "description": "arXiv:2406.17422v1 Announce Type: cross \nAbstract: A structural vector autoregressive (SVAR) process is a linear causal model for variables that evolve over a discrete set of time points and between which there may be lagged and instantaneous effects. The qualitative causal structure of an SVAR process can be represented by its finite and directed process graph, in which a directed link connects two processes whenever there is a lagged or instantaneous effect between them. At the process graph level, the causal structure of SVAR processes is compactly parameterised in the frequency domain. In this paper, we consider the problem of causal discovery and causal effect estimation from the spectral density, the frequency domain analogue of the auto covariance, of the SVAR process. Causal discovery concerns the recovery of the process graph and causal effect estimation concerns the identification and estimation of causal effects in the frequency domain.\n We show that information about the process graph, in terms of $d$- and $t$-separation statements, can be identified by verifying algebraic constraints on the spectral density. Furthermore, we introduce a notion of rational identifiability for frequency causal effects that may be confounded by exogenous latent processes, and show that the recent graphical latent factor half-trek criterion can be used on the process graph to assess whether a given (confounded) effect can be identified by rational operations on the entries of the spectral density."}, "https://arxiv.org/abs/2406.17714": {"title": "Compositional Models for Estimating Causal Effects", "link": "https://arxiv.org/abs/2406.17714", "description": "arXiv:2406.17714v1 Announce Type: cross \nAbstract: Many real-world systems can be represented as sets of interacting components. Examples of such systems include computational systems such as query processors, natural systems such as cells, and social systems such as families. Many approaches have been proposed in traditional (associational) machine learning to model such structured systems, including statistical relational models and graph neural networks. Despite this prior work, existing approaches to estimating causal effects typically treat such systems as single units, represent them with a fixed set of variables and assume a homogeneous data-generating process. We study a compositional approach for estimating individual treatment effects (ITE) in structured systems, where each unit is represented by the composition of multiple heterogeneous components. This approach uses a modular architecture to model potential outcomes at each component and aggregates component-level potential outcomes to obtain the unit-level potential outcomes. We discover novel benefits of the compositional approach in causal inference - systematic generalization to estimate counterfactual outcomes of unseen combinations of components and improved overlap guarantees between treatment and control groups compared to the classical methods for causal effect estimation. We also introduce a set of novel environments for empirically evaluating the compositional approach and demonstrate the effectiveness of our approach using both simulated and real-world data."}, "https://arxiv.org/abs/2109.13648": {"title": "Gaussian and Student's $t$ mixture vector autoregressive model with application to the effects of the Euro area monetary policy shock", "link": "https://arxiv.org/abs/2109.13648", "description": "arXiv:2109.13648v4 Announce Type: replace \nAbstract: A new mixture vector autoregressive model based on Gaussian and Student's $t$ distributions is introduced. As its mixture components, our model incorporates conditionally homoskedastic linear Gaussian vector autoregressions and conditionally heteroskedastic linear Student's $t$ vector autoregressions. For a $p$th order model, the mixing weights depend on the full distribution of the preceding $p$ observations, which leads to attractive practical and theoretical properties such as ergodicity and full knowledge of the stationary distribution of $p+1$ consecutive observations. A structural version of the model with statistically identified shocks is also proposed. The empirical application studies the effects of the Euro area monetary policy shock. We fit a two-regime model to the data and find the effects, particularly on inflation, stronger in the regime that mainly prevails before the Financial crisis than in the regime that mainly dominates after it. The introduced methods are implemented in the accompanying R package gmvarkit."}, "https://arxiv.org/abs/2210.09828": {"title": "Modelling Large Dimensional Datasets with Markov Switching Factor Models", "link": "https://arxiv.org/abs/2210.09828", "description": "arXiv:2210.09828v4 Announce Type: replace \nAbstract: We study a novel large dimensional approximate factor model with regime changes in the loadings driven by a latent first order Markov process. By exploiting the equivalent linear representation of the model, we first recover the latent factors by means of Principal Component Analysis. We then cast the model in state-space form, and we estimate loadings and transition probabilities through an EM algorithm based on a modified version of the Baum-Lindgren-Hamilton-Kim filter and smoother that makes use of the factors previously estimated. Our approach is appealing as it provides closed form expressions for all estimators. More importantly, it does not require knowledge of the true number of factors. We derive the theoretical properties of the proposed estimation procedure, and we show their good finite sample performance through a comprehensive set of Monte Carlo experiments. The empirical usefulness of our approach is illustrated through three applications to large U.S. datasets of stock returns, macroeconomic variables, and inflation indexes."}, "https://arxiv.org/abs/2307.09319": {"title": "Estimation of the Number Needed to Treat, the Number Needed to be Exposed, and the Exposure Impact Number with Instrumental Variables", "link": "https://arxiv.org/abs/2307.09319", "description": "arXiv:2307.09319v3 Announce Type: replace \nAbstract: The Number needed to treat (NNT) is an efficacy index defined as the average number of patients needed to treat to attain one additional treatment benefit. In observational studies, specifically in epidemiology, the adequacy of the populationwise NNT is questionable since the exposed group characteristics may substantially differ from the unexposed. To address this issue, groupwise efficacy indices were defined: the Exposure Impact Number (EIN) for the exposed group and the Number Needed to be Exposed (NNE) for the unexposed. Each defined index answers a unique research question since it targets a unique sub-population. In observational studies, the group allocation is typically affected by confounders that might be unmeasured. The available estimation methods that rely either on randomization or the sufficiency of the measured covariates for confounding control will result in inconsistent estimators of the true NNT (EIN, NNE) in such settings. Using Rubin's potential outcomes framework, we explicitly define the NNT and its derived indices as causal contrasts. Next, we introduce a novel method that uses instrumental variables to estimate the three aforementioned indices in observational studies. We present two analytical examples and a corresponding simulation study. The simulation study illustrates that the novel estimators are statistically consistent, unlike the previously available methods, and their analytical confidence intervals' empirical coverage rates converge to their nominal values. Finally, a real-world data example of an analysis of the effect of vitamin D deficiency on the mortality rate is presented."}, "https://arxiv.org/abs/2406.17827": {"title": "Practical identifiability and parameter estimation of compartmental epidemiological models", "link": "https://arxiv.org/abs/2406.17827", "description": "arXiv:2406.17827v1 Announce Type: new \nAbstract: Practical parameter identifiability in ODE-based epidemiological models is a known issue, yet one that merits further study. It is essentially ubiquitous due to noise and errors in real data. In this study, to avoid uncertainty stemming from data of unknown quality, simulated data with added noise are used to investigate practical identifiability in two distinct epidemiological models. Particular emphasis is placed on the role of initial conditions, which are assumed unknown, except those that are directly measured. Instead of just focusing on one method of estimation, we use and compare results from various broadly used methods, including maximum likelihood and Markov Chain Monte Carlo (MCMC) estimation.\n Among other findings, our analysis revealed that the MCMC estimator is overall more robust than the point estimators considered. Its estimates and predictions are improved when the initial conditions of certain compartments are fixed so that the model becomes globally identifiable. For the point estimators, whether fixing or fitting the that are not directly measured improves parameter estimates is model-dependent. Specifically, in the standard SEIR model, fixing the initial condition for the susceptible population S(0) improved parameter estimates, while this was not true when fixing the initial condition of the asymptomatic population in a more involved model. Our study corroborates the change in quality of parameter estimates upon usage of pre-peak or post-peak time-series under consideration. Finally, our examples suggest that in the presence of significantly noisy data, the value of structural identifiability is moot."}, "https://arxiv.org/abs/2406.17971": {"title": "Robust integration of external control data in randomized trials", "link": "https://arxiv.org/abs/2406.17971", "description": "arXiv:2406.17971v1 Announce Type: new \nAbstract: One approach for increasing the efficiency of randomized trials is the use of \"external controls\" -- individuals who received the control treatment in the trial during routine practice or in prior experimental studies. Existing external control methods, however, can have substantial bias if the populations underlying the trial and the external control data are not exchangeable. Here, we characterize a randomization-aware class of treatment effect estimators in the population underlying the trial that remain consistent and asymptotically normal when using external control data, even when exchangeability does not hold. We consider two members of this class of estimators: the well-known augmented inverse probability weighting trial-only estimator, which is the efficient estimator when only trial data are used; and a more efficient member of the class when exchangeability holds and external control data are available, which we refer to as the optimized randomization-aware estimator. To achieve robust integration of external control data in trial analyses, we then propose a combined estimator based on the efficient trial-only estimator and the optimized randomization-aware estimator. We show that the combined estimator is consistent and no less efficient than the most efficient of the two component estimators, whether the exchangeability assumption holds or not. We examine the estimators' performance in simulations and we illustrate their use with data from two trials of paliperidone extended-release for schizophrenia."}, "https://arxiv.org/abs/2406.18047": {"title": "Shrinkage Estimators for Beta Regression Models", "link": "https://arxiv.org/abs/2406.18047", "description": "arXiv:2406.18047v1 Announce Type: new \nAbstract: The beta regression model is a useful framework to model response variables that are rates or proportions, that is to say, response variables which are continuous and restricted to the interval (0,1). As with any other regression model, parameter estimates may be affected by collinearity or even perfect collinearity among the explanatory variables. To handle these situations shrinkage estimators are proposed. In particular we develop ridge regression and LASSO estimators from a penalized likelihood perspective with a logit link function. The properties of the resulting estimators are evaluated through a simulation study and a real data application"}, "https://arxiv.org/abs/2406.18052": {"title": "Flexible Conformal Highest Predictive Conditional Density Sets", "link": "https://arxiv.org/abs/2406.18052", "description": "arXiv:2406.18052v1 Announce Type: new \nAbstract: We introduce our method, conformal highest conditional density sets (CHCDS), that forms conformal prediction sets using existing estimated conditional highest density predictive regions. We prove the validity of the method and that conformal adjustment is negligible under some regularity conditions. In particular, if we correctly specify the underlying conditional density estimator, the conformal adjustment will be negligible. When the underlying model is incorrect, the conformal adjustment provides guaranteed nominal unconditional coverage. We compare the proposed method via simulation and a real data analysis to other existing methods. Our numerical results show that the flexibility of being able to use any existing conditional density estimation method is a large advantage for CHCDS compared to existing methods."}, "https://arxiv.org/abs/2406.18154": {"title": "Errors-In-Variables Model Fitting for Partially Unpaired Data Utilizing Mixture Models", "link": "https://arxiv.org/abs/2406.18154", "description": "arXiv:2406.18154v1 Announce Type: new \nAbstract: The goal of this paper is to introduce a general argumentation framework for regression in the errors-in-variables regime, allowing for full flexibility about the dimensionality of the data, error probability density types, the (linear or nonlinear) model type and the avoidance of explicit definition of loss functions. Further, we introduce in this framework model fitting for partially unpaired data, i.e. for given data groups the pairing information of input and output is lost (semi-supervised). This is achieved by constructing mixture model densities, which directly model this loss of pairing information allowing for inference. In a numerical simulation study linear and nonlinear model fits are illustrated as well as a real data study is presented based on life expectancy data from the world bank utilizing a multiple linear regression model. These results allow the conclusion that high quality model fitting is possible with partially unpaired data, which opens the possibility for new applications with unfortunate or deliberate loss of pairing information in the data."}, "https://arxiv.org/abs/2406.18189": {"title": "Functional knockoffs selection with applications to functional data analysis in high dimensions", "link": "https://arxiv.org/abs/2406.18189", "description": "arXiv:2406.18189v1 Announce Type: new \nAbstract: The knockoffs is a recently proposed powerful framework that effectively controls the false discovery rate (FDR) for variable selection. However, none of the existing knockoff solutions are directly suited to handle multivariate or high-dimensional functional data, which has become increasingly prevalent in various scientific applications. In this paper, we propose a novel functional model-X knockoffs selection framework tailored to sparse high-dimensional functional models, and show that our proposal can achieve the effective FDR control for any sample size. Furthermore, we illustrate the proposed functional model-X knockoffs selection procedure along with the associated theoretical guarantees for both FDR control and asymptotic power using examples of commonly adopted functional linear additive regression models and the functional graphical model. In the construction of functional knockoffs, we integrate essential components including the correlation operator matrix, the Karhunen-Lo\\`eve expansion, and semidefinite programming, and develop executable algorithms. We demonstrate the superiority of our proposed methods over the competitors through both extensive simulations and the analysis of two brain imaging datasets."}, "https://arxiv.org/abs/2406.18191": {"title": "Asymptotic Uncertainty in the Estimation of Frequency Domain Causal Effects for Linear Processes", "link": "https://arxiv.org/abs/2406.18191", "description": "arXiv:2406.18191v1 Announce Type: new \nAbstract: Structural vector autoregressive (SVAR) processes are commonly used time series models to identify and quantify causal interactions between dynamically interacting processes from observational data. The causal relationships between these processes can be effectively represented by a finite directed process graph - a graph that connects two processes whenever there is a direct delayed or simultaneous effect between them. Recent research has introduced a framework for quantifying frequency domain causal effects along paths on the process graph. This framework allows to identify how the spectral density of one process is contributing to the spectral density of another. In the current work, we characterise the asymptotic distribution of causal effect and spectral contribution estimators in terms of algebraic relations dictated by the process graph. Based on the asymptotic distribution we construct approximate confidence intervals and Wald type hypothesis tests for the estimated effects and spectral contributions. Under the assumption of causal sufficiency, we consider the class of differentiable estimators for frequency domain causal quantities, and within this class we identify the asymptotically optimal estimator. We illustrate the frequency domain Wald tests and uncertainty approximation on synthetic data, and apply them to analyse the impact of the 10 to 11 year solar cycle on the North Atlantic Oscillation (NAO). Our results confirm a significant effect of the solar cycle on the NAO at the 10 to 11 year time scale."}, "https://arxiv.org/abs/2406.18390": {"title": "The $\\ell$-test: leveraging sparsity in the Gaussian linear model for improved inference", "link": "https://arxiv.org/abs/2406.18390", "description": "arXiv:2406.18390v1 Announce Type: new \nAbstract: We develop novel LASSO-based methods for coefficient testing and confidence interval construction in the Gaussian linear model with $n\\ge d$. Our methods' finite-sample guarantees are identical to those of their ubiquitous ordinary-least-squares-$t$-test-based analogues, yet have substantially higher power when the true coefficient vector is sparse. In particular, our coefficient test, which we call the $\\ell$-test, performs like the one-sided $t$-test (despite not being given any information about the sign) under sparsity, and the corresponding confidence intervals are more than 10% shorter than the standard $t$-test based intervals. The nature of the $\\ell$-test directly provides a novel exact adjustment conditional on LASSO selection for post-selection inference, allowing for the construction of post-selection p-values and confidence intervals. None of our methods require resampling or Monte Carlo estimation. We perform a variety of simulations and a real data analysis on an HIV drug resistance data set to demonstrate the benefits of the $\\ell$-test. We end with a discussion of how the $\\ell$-test may asymptotically apply to a much more general class of parametric models."}, "https://arxiv.org/abs/2406.18484": {"title": "An Understanding of Principal Differential Analysis", "link": "https://arxiv.org/abs/2406.18484", "description": "arXiv:2406.18484v1 Announce Type: new \nAbstract: In functional data analysis, replicate observations of a smooth functional process and its derivatives offer a unique opportunity to flexibly estimate continuous-time ordinary differential equation models. Ramsay (1996) first proposed to estimate a linear ordinary differential equation from functional data in a technique called Principal Differential Analysis, by formulating a functional regression in which the highest-order derivative of a function is modelled as a time-varying linear combination of its lower-order derivatives. Principal Differential Analysis was introduced as a technique for data reduction and representation, using solutions of the estimated differential equation as a basis to represent the functional data. In this work, we re-formulate PDA as a generative statistical model in which functional observations arise as solutions of a deterministic ODE that is forced by a smooth random error process. This viewpoint defines a flexible class of functional models based on differential equations and leads to an improved understanding and characterisation of the sources of variability in Principal Differential Analysis. It does, however, result in parameter estimates that can be heavily biased under the standard estimation approach of PDA. Therefore, we introduce an iterative bias-reduction algorithm that can be applied to improve parameter estimates. We also examine the utility of our approach when the form of the deterministic part of the differential equation is unknown and possibly non-linear, where Principal Differential Analysis is treated as an approximate model based on time-varying linearisation. We demonstrate our approach on simulated data from linear and non-linear differential equations and on real data from human movement biomechanics. Supplementary R code for this manuscript is available at \\url{https://github.com/edwardgunning/UnderstandingOfPDAManuscript}."}, "https://arxiv.org/abs/2406.17972": {"title": "LABOR-LLM: Language-Based Occupational Representations with Large Language Models", "link": "https://arxiv.org/abs/2406.17972", "description": "arXiv:2406.17972v1 Announce Type: cross \nAbstract: Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based \"foundation model\", CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions."}, "https://arxiv.org/abs/2406.18240": {"title": "Concordance in basal cell carcinoma diagnosis", "link": "https://arxiv.org/abs/2406.18240", "description": "arXiv:2406.18240v1 Announce Type: cross \nAbstract: Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists."}, "https://arxiv.org/abs/2108.03464": {"title": "Bayesian $L_{\\frac{1}{2}}$ regression", "link": "https://arxiv.org/abs/2108.03464", "description": "arXiv:2108.03464v2 Announce Type: replace \nAbstract: It is well known that Bridge regression enjoys superior theoretical properties when compared to traditional LASSO. However, the current latent variable representation of its Bayesian counterpart, based on the exponential power prior, is computationally expensive in higher dimensions. In this paper, we show that the exponential power prior has a closed form scale mixture of normal decomposition for $\\alpha=(\\frac{1}{2})^\\gamma, \\gamma \\in \\{1, 2,\\ldots\\}$. We call these types of priors $L_{\\frac{1}{2}}$ prior for short. We develop an efficient partially collapsed Gibbs sampling scheme for computation using the $L_{\\frac{1}{2}}$ prior and study theoretical properties when $p>n$. In addition, we introduce a non-separable Bridge penalty function inspired by the fully Bayesian formulation and a novel, efficient coordinate descent algorithm. We prove the algorithm's convergence and show that the local minimizer from our optimisation algorithm has an oracle property. Finally, simulation studies were carried out to illustrate the performance of the new algorithms. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2306.16297": {"title": "A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying Moderation", "link": "https://arxiv.org/abs/2306.16297", "description": "arXiv:2306.16297v2 Announce Type: replace \nAbstract: Twin revolutions in wearable technologies and health interventions delivered by smartphones have greatly increased the accessibility of mobile health (mHealth) interventions. Micro-randomized trials (MRTs) are designed to assess the effectiveness of the mHealth intervention and introduce a novel class of causal estimands called \"causal excursion effects.\" These estimands enable the evaluation of how intervention effects change over time and are influenced by individual characteristics or context. However, existing analysis methods for causal excursion effects require prespecified features of the observed high-dimensional history to build a working model for a critical nuisance parameter. Machine learning appears ideal for automatic feature construction, but their naive application can lead to bias under model misspecification. To address this issue, this paper revisits the estimation of causal excursion effects from a meta-learner perspective, where the analyst remains agnostic to the supervised learning algorithms used to estimate nuisance parameters. We present the bidirectional asymptotic properties of the proposed estimators and compare them both theoretically and through extensive simulations. The results show relative efficiency gains and support the suggestion of a doubly robust alternative to existing methods. Finally, the proposed methods' practical utilities are demonstrated by analyzing data from a multi-institution cohort of first-year medical residents in the United States (NeCamp et al., 2020)."}, "https://arxiv.org/abs/2309.04685": {"title": "Simultaneous Modeling of Disease Screening and Severity Prediction: A Multi-task and Sparse Regularization Approach", "link": "https://arxiv.org/abs/2309.04685", "description": "arXiv:2309.04685v2 Announce Type: replace \nAbstract: The exploration of biomarkers, which are clinically useful biomolecules, and the development of prediction models using them are important problems in biomedical research. Biomarkers are widely used for disease screening, and some are related not only to the presence or absence of a disease but also to its severity. These biomarkers can be useful for prioritization of treatment and clinical decision-making. Considering a model helpful for both disease screening and severity prediction, this paper focuses on regression modeling for an ordinal response equipped with a hierarchical structure.\n If the response variable is a combination of the presence of disease and severity such as \\{{\\it healthy, mild, intermediate, severe}\\}, for example, the simplest method would be to apply the conventional ordinal regression model. However, the conventional model has flexibility issues and may not be suitable for the problems addressed in this paper, where the levels of the response variable might be heterogeneous. Therefore, this paper proposes a model assuming screening and severity prediction as different tasks, and an estimation method based on structural sparse regularization that leverages any common structure between the tasks when such commonality exists. In numerical experiments, the proposed method demonstrated stable performance across many scenarios compared to existing ordinal regression methods."}, "https://arxiv.org/abs/2401.05330": {"title": "Hierarchical Causal Models", "link": "https://arxiv.org/abs/2401.05330", "description": "arXiv:2401.05330v2 Announce Type: replace \nAbstract: Scientists often want to learn about cause and effect from hierarchical data, collected from subunits nested inside units. Consider students in schools, cells in patients, or cities in states. In such settings, unit-level variables (e.g. each school's budget) may affect subunit-level variables (e.g. the test scores of each student in each school) and vice versa. To address causal questions with hierarchical data, we propose hierarchical causal models, which extend structural causal models and causal graphical models by adding inner plates. We develop a general graphical identification technique for hierarchical causal models that extends do-calculus. We find many situations in which hierarchical data can enable causal identification even when it would be impossible with non-hierarchical data, that is, if we had only unit-level summaries of subunit-level variables (e.g. the school's average test score, rather than each student's score). We develop estimation techniques for hierarchical causal models, using methods including hierarchical Bayesian models. We illustrate our results in simulation and via a reanalysis of the classic \"eight schools\" study."}, "https://arxiv.org/abs/2301.13152": {"title": "STEEL: Singularity-aware Reinforcement Learning", "link": "https://arxiv.org/abs/2301.13152", "description": "arXiv:2301.13152v5 Announce Type: replace-cross \nAbstract: Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. The existing methods require absolutely continuous assumption (e.g., there do not exist non-overlapping regions) on the distribution induced by target policies with respect to the data distribution over either the state or action or both. We propose a new batch RL algorithm that allows for singularity for both state and action spaces (e.g., existence of non-overlapping regions between offline data distribution and the distribution induced by the target policies) in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm under singularity. Compared with existing algorithms,by requiring only minimal data-coverage assumption, STEEL improves the applicability and robustness of batch RL. In addition, a two-step adaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulation studies and one (semi)-real experiment on personalized pricing demonstrate the superior performance of our methods in dealing with possible singularity in batch RL."}, "https://arxiv.org/abs/2406.18681": {"title": "Data Sketching and Stacking: A Confluence of Two Strategies for Predictive Inference in Gaussian Process Regressions with High-Dimensional Features", "link": "https://arxiv.org/abs/2406.18681", "description": "arXiv:2406.18681v1 Announce Type: new \nAbstract: This article focuses on drawing computationally-efficient predictive inference from Gaussian process (GP) regressions with a large number of features when the response is conditionally independent of the features given the projection to a noisy low dimensional manifold. Bayesian estimation of the regression relationship using Markov Chain Monte Carlo and subsequent predictive inference is computationally prohibitive and may lead to inferential inaccuracies since accurate variable selection is essentially impossible in such high-dimensional GP regressions. As an alternative, this article proposes a strategy to sketch the high-dimensional feature vector with a carefully constructed sketching matrix, before fitting a GP with the scalar outcome and the sketched feature vector to draw predictive inference. The analysis is performed in parallel with many different sketching matrices and smoothing parameters in different processors, and the predictive inferences are combined using Bayesian predictive stacking. Since posterior predictive distribution in each processor is analytically tractable, the algorithm allows bypassing the robustness issues due to convergence and mixing of MCMC chains, leading to fast implementation with very large number of features. Simulation studies show superior performance of the proposed approach with a wide variety of competitors. The approach outperforms competitors in drawing point prediction with predictive uncertainties of outdoor air pollution from satellite images."}, "https://arxiv.org/abs/2406.18819": {"title": "MultiObjMatch: Matching with Optimal Tradeoffs between Multiple Objectives in R", "link": "https://arxiv.org/abs/2406.18819", "description": "arXiv:2406.18819v1 Announce Type: new \nAbstract: In an observational study, matching aims to create many small sets of similar treated and control units from initial samples that may differ substantially in order to permit more credible causal inferences. The problem of constructing matched sets may be formulated as an optimization problem, but it can be challenging to specify a single objective function that adequately captures all the design considerations at work. One solution, proposed by \\citet{pimentel2019optimal} is to explore a family of matched designs that are Pareto optimal for multiple objective functions. We present an R package, \\href{https://github.com/ShichaoHan/MultiObjMatch}{\\texttt{MultiObjMatch}}, that implements this multi-objective matching strategy using a network flow algorithm for several common design goals: marginal balance on important covariates, size of the matched sample, and average within-pair multivariate distances. We demonstrate the package's flexibility in exploring user-defined tradeoffs of interest via two case studies, a reanalysis of the canonical National Supported Work dataset and a novel analysis of a clinical dataset to estimate the impact of diabetic kidney disease on hospitalization costs."}, "https://arxiv.org/abs/2406.18829": {"title": "Full Information Linked ICA: addressing missing data problem in multimodal fusion", "link": "https://arxiv.org/abs/2406.18829", "description": "arXiv:2406.18829v1 Announce Type: new \nAbstract: Recent advances in multimodal imaging acquisition techniques have allowed us to measure different aspects of brain structure and function. Multimodal fusion, such as linked independent component analysis (LICA), is popularly used to integrate complementary information. However, it has suffered from missing data, commonly occurring in neuroimaging data. Therefore, in this paper, we propose a Full Information LICA algorithm (FI-LICA) to handle the missing data problem during multimodal fusion under the LICA framework. Built upon complete cases, our method employs the principle of full information and utilizes all available information to recover the missing latent information. Our simulation experiments showed the ideal performance of FI-LICA compared to current practices. Further, we applied FI-LICA to multimodal data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, showcasing better performance in classifying current diagnosis and in predicting the AD transition of participants with mild cognitive impairment (MCI), thereby highlighting the practical utility of our proposed method."}, "https://arxiv.org/abs/2406.18905": {"title": "Bayesian inference: More than Bayes's theorem", "link": "https://arxiv.org/abs/2406.18905", "description": "arXiv:2406.18905v1 Announce Type: new \nAbstract: Bayesian inference gets its name from *Bayes's theorem*, expressing posterior probabilities for hypotheses about a data generating process as the (normalized) product of prior probabilities and a likelihood function. But Bayesian inference uses all of probability theory, not just Bayes's theorem. Many hypotheses of scientific interest are *composite hypotheses*, with the strength of evidence for the hypothesis dependent on knowledge about auxiliary factors, such as the values of nuisance parameters (e.g., uncertain background rates or calibration factors). Many important capabilities of Bayesian methods arise from use of the law of total probability, which instructs analysts to compute probabilities for composite hypotheses by *marginalization* over auxiliary factors. This tutorial targets relative newcomers to Bayesian inference, aiming to complement tutorials that focus on Bayes's theorem and how priors modulate likelihoods. The emphasis here is on marginalization over parameter spaces -- both how it is the foundation for important capabilities, and how it may motivate caution when parameter spaces are large. Topics covered include the difference between likelihood and probability, understanding the impact of priors beyond merely shifting the maximum likelihood estimate, and the role of marginalization in accounting for uncertainty in nuisance parameters, systematic error, and model misspecification."}, "https://arxiv.org/abs/2406.18913": {"title": "A Note on Identification of Match Fixed Effects as Interpretable Unobserved Match Affinity", "link": "https://arxiv.org/abs/2406.18913", "description": "arXiv:2406.18913v1 Announce Type: new \nAbstract: We highlight that match fixed effects, represented by the coefficients of interaction terms involving dummy variables for two elements, lack identification without specific restrictions on parameters. Consequently, the coefficients typically reported as relative match fixed effects by statistical software are not interpretable. To address this, we establish normalization conditions that enable identification of match fixed effect parameters as interpretable indicators of unobserved match affinity, facilitating comparisons among observed matches."}, "https://arxiv.org/abs/2406.19021": {"title": "Nonlinear Multivariate Function-on-function Regression with Variable Selection", "link": "https://arxiv.org/abs/2406.19021", "description": "arXiv:2406.19021v1 Announce Type: new \nAbstract: This paper proposes a multivariate nonlinear function-on-function regression model, which allows both the response and the covariates can be multi-dimensional functions. The model is built upon the multivariate functional reproducing kernel Hilbert space (RKHS) theory. It predicts the response function by linearly combining each covariate function in their respective functional RKHS, and extends the representation theorem to accommodate model estimation. Further variable selection is proposed by adding the lasso penalty to the coefficients of the kernel functions. A block coordinate descent algorithm is proposed for model estimation, and several theoretical properties are discussed. Finally, we evaluate the efficacy of our proposed model using simulation data and a real-case dataset in meteorology."}, "https://arxiv.org/abs/2406.19033": {"title": "Factor multivariate stochastic volatility models of high dimension", "link": "https://arxiv.org/abs/2406.19033", "description": "arXiv:2406.19033v1 Announce Type: new \nAbstract: Building upon the pertinence of the factor decomposition to break the curse of dimensionality inherent to multivariate volatility processes, we develop a factor model-based multivariate stochastic volatility (fMSV) framework that relies on two viewpoints: sparse approximate factor model and sparse factor loading matrix. We propose a two-stage estimation procedure for the fMSV model: the first stage obtains the estimators of the factor model, and the second stage estimates the MSV part using the estimated common factor variables. We derive the asymptotic properties of the estimators. Simulated experiments are performed to assess the forecasting performances of the covariance matrices. The empirical analysis based on vectors of asset returns illustrates that the forecasting performances of the fMSV models outperforms competing conditional covariance models."}, "https://arxiv.org/abs/2406.19152": {"title": "Mixture priors for replication studies", "link": "https://arxiv.org/abs/2406.19152", "description": "arXiv:2406.19152v1 Announce Type: new \nAbstract: Replication of scientific studies is important for assessing the credibility of their results. However, there is no consensus on how to quantify the extent to which a replication study replicates an original result. We propose a novel Bayesian approach based on mixture priors. The idea is to use a mixture of the posterior distribution based on the original study and a non-informative distribution as the prior for the analysis of the replication study. The mixture weight then determines the extent to which the original and replication data are pooled.\n Two distinct strategies are presented: one with fixed mixture weights, and one that introduces uncertainty by assigning a prior distribution to the mixture weight itself. Furthermore, it is shown how within this framework Bayes factors can be used for formal testing of scientific hypotheses, such as tests regarding the presence or absence of an effect. To showcase the practical application of the methodology, we analyze data from three replication studies. Our findings suggest that mixture priors are a valuable and intuitive alternative to other Bayesian methods for analyzing replication studies, such as hierarchical models and power priors. We provide the free and open source R package repmix that implements the proposed methodology."}, "https://arxiv.org/abs/2406.19157": {"title": "How to build your latent Markov model -- the role of time and space", "link": "https://arxiv.org/abs/2406.19157", "description": "arXiv:2406.19157v1 Announce Type: new \nAbstract: Statistical models that involve latent Markovian state processes have become immensely popular tools for analysing time series and other sequential data. However, the plethora of model formulations, the inconsistent use of terminology, and the various inferential approaches and software packages can be overwhelming to practitioners, especially when they are new to this area. With this review-like paper, we thus aim to provide guidance for both statisticians and practitioners working with latent Markov models by offering a unifying view on what otherwise are often considered separate model classes, from hidden Markov models over state-space models to Markov-modulated Poisson processes. In particular, we provide a roadmap for identifying a suitable latent Markov model formulation given the data to be analysed. Furthermore, we emphasise that it is key to applied work with any of these model classes to understand how recursive techniques exploiting the models' dependence structure can be used for inference. The R package LaMa adapts this unified view and provides an easy-to-use framework for very fast (C++ based) evaluation of the likelihood of any of the models discussed in this paper, allowing users to tailor a latent Markov model to their data using a Lego-type approach."}, "https://arxiv.org/abs/2406.19213": {"title": "Comparing Lasso and Adaptive Lasso in High-Dimensional Data: A Genetic Survival Analysis in Triple-Negative Breast Cancer", "link": "https://arxiv.org/abs/2406.19213", "description": "arXiv:2406.19213v1 Announce Type: new \nAbstract: This study aims to evaluate the performance of Cox regression with lasso penalty and adaptive lasso penalty in high-dimensional settings. Variable selection methods are necessary in this context to reduce dimensionality and make the problem feasible. Several weight calculation procedures for adaptive lasso are proposed to determine if they offer an improvement over lasso, as adaptive lasso addresses its inherent bias. These proposed weights are based on principal component analysis, ridge regression, univariate Cox regressions and random survival forest (RSF). The proposals are evaluated in simulated datasets.\n A real application of these methodologies in the context of genomic data is also carried out. The study consists of determining the variables, clinical and genetic, that influence the survival of patients with triple-negative breast cancer (TNBC), which is a type breast cancer with low survival rates due to its aggressive nature."}, "https://arxiv.org/abs/2406.19346": {"title": "Eliciting prior information from clinical trials via calibrated Bayes factor", "link": "https://arxiv.org/abs/2406.19346", "description": "arXiv:2406.19346v1 Announce Type: new \nAbstract: In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated to a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is actually available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. This parameter can be considered as either fixed or random. Although various strategies exist for its determination, eliciting the prior distribution of the weight parameter according to a full Bayesian approach remains a challenge. In general, this parameter should be carefully selected to accurately reflect the available prior information without dominating the posterior inferential conclusions. To this aim, we propose a novel method for eliciting the prior distribution of the weight parameter through a simulation-based calibrated Bayes factor procedure. This approach allows for the prior distribution to be updated based on the strength of evidence provided by the data: The goal is to facilitate the integration of historical data when it aligns with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials."}, "https://arxiv.org/abs/2406.18623": {"title": "Unbiased least squares regression via averaged stochastic gradient descent", "link": "https://arxiv.org/abs/2406.18623", "description": "arXiv:2406.18623v1 Announce Type: cross \nAbstract: We consider an on-line least squares regression problem with optimal solution $\\theta^*$ and Hessian matrix H, and study a time-average stochastic gradient descent estimator of $\\theta^*$. For $k\\ge2$, we provide an unbiased estimator of $\\theta^*$ that is a modification of the time-average estimator, runs with an expected number of time-steps of order k, with O(1/k) expected excess risk. The constant behind the O notation depends on parameters of the regression and is a poly-logarithmic function of the smallest eigenvalue of H. We provide both a biased and unbiased estimator of the expected excess risk of the time-average estimator and of its unbiased counterpart, without requiring knowledge of either H or $\\theta^*$. We describe an \"average-start\" version of our estimators with similar properties. Our approach is based on randomized multilevel Monte Carlo. Our numerical experiments confirm our theoretical findings."}, "https://arxiv.org/abs/2406.18814": {"title": "Length Optimization in Conformal Prediction", "link": "https://arxiv.org/abs/2406.18814", "description": "arXiv:2406.18814v1 Announce Type: cross \nAbstract: Conditional validity and length efficiency are two crucial aspects of conformal prediction (CP). Achieving conditional validity ensures accurate uncertainty quantification for data subpopulations, while proper length efficiency ensures that the prediction sets remain informative and non-trivial. Despite significant efforts to address each of these issues individually, a principled framework that reconciles these two objectives has been missing in the CP literature. In this paper, we develop Conformal Prediction with Length-Optimization (CPL) - a novel framework that constructs prediction sets with (near-) optimal length while ensuring conditional validity under various classes of covariate shifts, including the key cases of marginal and group-conditional coverage. In the infinite sample regime, we provide strong duality results which indicate that CPL achieves conditional validity and length optimality. In the finite sample regime, we show that CPL constructs conditionally valid prediction sets. Our extensive empirical evaluations demonstrate the superior prediction set size performance of CPL compared to state-of-the-art methods across diverse real-world and synthetic datasets in classification, regression, and text-related settings."}, "https://arxiv.org/abs/2406.19082": {"title": "Gratia: An R package for exploring generalized additive models", "link": "https://arxiv.org/abs/2406.19082", "description": "arXiv:2406.19082v1 Announce Type: cross \nAbstract: Generalized additive models (GAMs, Hastie & Tibshirani, 1990; Wood, 2017) are an extension of the generalized linear model that allows the effects of covariates to be modelled as smooth functions. GAMs are increasingly used in many areas of science (e.g. Pedersen, Miller, Simpson, & Ross, 2019; Simpson, 2018) because the smooth functions allow nonlinear relationships between covariates and the response to be learned from the data through the use of penalized splines. Within the R (R Core Team, 2024) ecosystem, Simon Wood's mgcv package (Wood, 2017) is widely used to fit GAMs and is a Recommended package that ships with R as part of the default install. A growing number of other R packages build upon mgcv, for example as an engine to fit specialised models not handled by mgcv itself (e.g. GJMR, Marra & Radice, 2023), or to make use of the wide range of splines available in mgcv (e.g. brms, B\\\"urkner, 2017).\n The gratia package builds upon mgcv by providing functions that make working with GAMs easier. gratia takes a tidy approach (Wickham, 2014) providing ggplot2 (Wickham, 2016) replacements for mgcv's base graphics-based plots, functions for model diagnostics and exploration of fitted models, and a family of functions for drawing samples from the posterior distribution of a fitted GAM. Additional functionality is provided to facilitate the teaching and understanding of GAMs."}, "https://arxiv.org/abs/2208.07831": {"title": "Structured prior distributions for the covariance matrix in latent factor models", "link": "https://arxiv.org/abs/2208.07831", "description": "arXiv:2208.07831v3 Announce Type: replace \nAbstract: Factor models are widely used for dimension reduction in the analysis of multivariate data. This is achieved through decomposition of a p x p covariance matrix into the sum of two components. Through a latent factor representation, they can be interpreted as a diagonal matrix of idiosyncratic variances and a shared variation matrix, that is, the product of a p x k factor loadings matrix and its transpose. If k << p, this defines a parsimonious factorisation of the covariance matrix. Historically, little attention has been paid to incorporating prior information in Bayesian analyses using factor models where, at best, the prior for the factor loadings is order invariant. In this work, a class of structured priors is developed that can encode ideas of dependence structure about the shared variation matrix. The construction allows data-informed shrinkage towards sensible parametric structures while also facilitating inference over the number of factors. Using an unconstrained reparameterisation of stationary vector autoregressions, the methodology is extended to stationary dynamic factor models. For computational inference, parameter-expanded Markov chain Monte Carlo samplers are proposed, including an efficient adaptive Gibbs sampler. Two substantive applications showcase the scope of the methodology and its inferential benefits."}, "https://arxiv.org/abs/2305.06466": {"title": "The Bayesian Infinitesimal Jackknife for Variance", "link": "https://arxiv.org/abs/2305.06466", "description": "arXiv:2305.06466v2 Announce Type: replace \nAbstract: The frequentist variability of Bayesian posterior expectations can provide meaningful measures of uncertainty even when models are misspecified. Classical methods to asymptotically approximate the frequentist covariance of Bayesian estimators such as the Laplace approximation and the nonparametric bootstrap can be practically inconvenient, since the Laplace approximation may require an intractable integral to compute the marginal log posterior, and the bootstrap requires computing the posterior for many different bootstrap datasets. We develop and explore the infinitesimal jackknife (IJ), an alternative method for computing asymptotic frequentist covariance of smooth functionals of exchangeable data, which is based on the \"influence function\" of robust statistics. We show that the influence function for posterior expectations has the form of a simple posterior covariance, and that the IJ covariance estimate is, in turn, easily computed from a single set of posterior samples. Under conditions similar to those required for a Bayesian central limit theorem to apply, we prove that the corresponding IJ covariance estimate is asymptotically equivalent to the Laplace approximation and the bootstrap. In the presence of nuisance parameters that may not obey a central limit theorem, we argue using a von Mises expansion that the IJ covariance is inconsistent, but can remain a good approximation to the limiting frequentist variance. We demonstrate the accuracy and computational benefits of the IJ covariance estimates with simulated and real-world experiments."}, "https://arxiv.org/abs/2306.06756": {"title": "Semi-Parametric Inference for Doubly Stochastic Spatial Point Processes: An Approximate Penalized Poisson Likelihood Approach", "link": "https://arxiv.org/abs/2306.06756", "description": "arXiv:2306.06756v2 Announce Type: replace \nAbstract: Doubly-stochastic point processes model the occurrence of events over a spatial domain as an inhomogeneous Poisson process conditioned on the realization of a random intensity function. They are flexible tools for capturing spatial heterogeneity and dependence. However, existing implementations of doubly-stochastic spatial models are computationally demanding, often have limited theoretical guarantee, and/or rely on restrictive assumptions. We propose a penalized regression method for estimating covariate effects in doubly-stochastic point processes that is computationally efficient and does not require a parametric form or stationarity of the underlying intensity. Our approach is based on an approximate (discrete and deterministic) formulation of the true (continuous and stochastic) intensity function. We show that consistency and asymptotic normality of the covariate effect estimates can be achieved despite the model misspecification, and develop a covariance estimator that leads to a valid, albeit conservative, statistical inference procedure. A simulation study shows the validity of our approach under less restrictive assumptions on the data generating mechanism, and an application to Seattle crime data demonstrates better prediction accuracy compared with existing alternatives."}, "https://arxiv.org/abs/2307.00450": {"title": "Bayesian Hierarchical Modeling and Inference for Mechanistic Systems in Industrial Hygiene", "link": "https://arxiv.org/abs/2307.00450", "description": "arXiv:2307.00450v2 Announce Type: replace \nAbstract: A series of experiments in stationary and moving passenger rail cars were conducted to measure removal rates of particles in the size ranges of SARS-CoV-2 viral aerosols, and the air changes per hour provided by existing and modified air handling systems. Such methods for exposure assessments are customarily based on mechanistic models derived from physical laws of particle movement that are deterministic and do not account for measurement errors inherent in data collection. The resulting analysis compromises on reliably learning about mechanistic factors such as ventilation rates, aerosol generation rates and filtration efficiencies from field measurements. This manuscript develops a Bayesian state space modeling framework that synthesizes information from the mechanistic system as well as the field data. We derive a stochastic model from finite difference approximations of differential equations explaining particle concentrations. Our inferential framework trains the mechanistic system using the field measurements from the chamber experiments and delivers reliable estimates of the underlying physical process with fully model-based uncertainty quantification. Our application falls within the realm of Bayesian \"melding\" of mechanistic and statistical models and is of significant relevance to industrial hygienists and public health researchers working on assessment of exposure to viral aerosols in rail car fleets."}, "https://arxiv.org/abs/2307.13094": {"title": "Inference in Experiments with Matched Pairs and Imperfect Compliance", "link": "https://arxiv.org/abs/2307.13094", "description": "arXiv:2307.13094v2 Announce Type: replace \nAbstract: This paper studies inference for the local average treatment effect in randomized controlled trials with imperfect compliance where treatment status is determined according to \"matched pairs.\" By \"matched pairs,\" we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. Under weak assumptions governing the quality of the pairings, we first derive the limit distribution of the usual Wald (i.e., two-stage least squares) estimator of the local average treatment effect. We show further that conventional heteroskedasticity-robust estimators of the Wald estimator's limiting variance are generally conservative, in that their probability limits are (typically strictly) larger than the limiting variance. We therefore provide an alternative estimator of the limiting variance that is consistent. Finally, we consider the use of additional observed, baseline covariates not used in pairing units to increase the precision with which we can estimate the local average treatment effect. To this end, we derive the limiting behavior of a two-stage least squares estimator of the local average treatment effect which includes both the additional covariates in addition to pair fixed effects, and show that its limiting variance is always less than or equal to that of the Wald estimator. To complete our analysis, we provide a consistent estimator of this limiting variance. A simulation study confirms the practical relevance of our theoretical results. Finally, we apply our results to revisit a prominent experiment studying the effect of macroinsurance on microenterprise in Egypt."}, "https://arxiv.org/abs/2310.00803": {"title": "A Bayesian joint model for mediation analysis with matrix-valued mediators", "link": "https://arxiv.org/abs/2310.00803", "description": "arXiv:2310.00803v2 Announce Type: replace \nAbstract: Unscheduled treatment interruptions may lead to reduced quality of care in radiation therapy (RT). Identifying the RT prescription dose effects on the outcome of treatment interruptions, mediated through doses distributed into different organs-at-risk (OARs), can inform future treatment planning. The radiation exposure to OARs can be summarized by a matrix of dose-volume histograms (DVH) for each patient. Although various methods for high-dimensional mediation analysis have been proposed recently, few studies investigated how matrix-valued data can be treated as mediators. In this paper, we propose a novel Bayesian joint mediation model for high-dimensional matrix-valued mediators. In this joint model, latent features are extracted from the matrix-valued data through an adaptation of probabilistic multilinear principal components analysis (MPCA), retaining the inherent matrix structure. We derive and implement a Gibbs sampling algorithm to jointly estimate all model parameters, and introduce a Varimax rotation method to identify active indicators of mediation among the matrix-valued data. Our simulation study finds that the proposed joint model has higher efficiency in estimating causal decomposition effects compared to an alternative two-step method, and demonstrates that the mediation effects can be identified and visualized in the matrix form. We apply the method to study the effect of prescription dose on treatment interruptions in anal canal cancer patients."}, "https://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "https://arxiv.org/abs/2310.03521", "description": "arXiv:2310.03521v2 Announce Type: replace \nAbstract: In copula models the marginal distributions and copula function are specified separately. We treat these as two modules in a modular Bayesian inference framework, and propose conducting modified Bayesian inference by \"cutting feedback\". Cutting feedback limits the influence of potentially misspecified modules in posterior inference. We consider two types of cuts. The first limits the influence of a misspecified copula on inference for the marginals, which is a Bayesian analogue of the popular Inference for Margins (IFM) estimator. The second limits the influence of misspecified marginals on inference for the copula parameters by using a pseudo likelihood of the ranks to define the cut model. We establish that if only one of the modules is misspecified, then the appropriate cut posterior gives accurate uncertainty quantification asymptotically for the parameters in the other module. Computation of the cut posteriors is difficult, and new variational inference methods to do so are proposed. The efficacy of the new methodology is demonstrated using both simulated data and a substantive multivariate time series application from macroeconomic forecasting. In the latter, cutting feedback from misspecified marginals to a 1096 dimension copula improves posterior inference and predictive accuracy greatly, compared to conventional Bayesian inference."}, "https://arxiv.org/abs/2311.08340": {"title": "Causal Message Passing: A Method for Experiments with Unknown and General Network Interference", "link": "https://arxiv.org/abs/2311.08340", "description": "arXiv:2311.08340v2 Announce Type: replace \nAbstract: Randomized experiments are a powerful methodology for data-driven evaluation of decisions or interventions. Yet, their validity may be undermined by network interference. This occurs when the treatment of one unit impacts not only its outcome but also that of connected units, biasing traditional treatment effect estimations. Our study introduces a new framework to accommodate complex and unknown network interference, moving beyond specialized models in the existing literature. Our framework, termed causal message-passing, is grounded in high-dimensional approximate message passing methodology. It is tailored for multi-period experiments and is particularly effective in settings with many units and prevalent network interference. The framework models causal effects as a dynamic process where a treated unit's impact propagates through the network via neighboring units until equilibrium is reached. This approach allows us to approximate the dynamics of potential outcomes over time, enabling the extraction of valuable information before treatment effects reach equilibrium. Utilizing causal message-passing, we introduce a practical algorithm to estimate the total treatment effect, defined as the impact observed when all units are treated compared to the scenario where no unit receives treatment. We demonstrate the effectiveness of this approach across five numerical scenarios, each characterized by a distinct interference structure."}, "https://arxiv.org/abs/2312.10695": {"title": "Nonparametric Strategy Test", "link": "https://arxiv.org/abs/2312.10695", "description": "arXiv:2312.10695v3 Announce Type: replace \nAbstract: We present a nonparametric statistical test for determining whether an agent is following a given mixed strategy in a repeated strategic-form game given samples of the agent's play. This involves two components: determining whether the agent's frequencies of pure strategies are sufficiently close to the target frequencies, and determining whether the pure strategies selected are independent between different game iterations. Our integrated test involves applying a chi-squared goodness of fit test for the first component and a generalized Wald-Wolfowitz runs test for the second component. The results from both tests are combined using Bonferroni correction to produce a complete test for a given significance level $\\alpha.$ We applied the test to publicly available data of human rock-paper-scissors play. The data consists of 50 iterations of play for 500 human players. We test with a null hypothesis that the players are following a uniform random strategy independently at each game iteration. Using a significance level of $\\alpha = 0.05$, we conclude that 305 (61%) of the subjects are following the target strategy."}, "https://arxiv.org/abs/2312.15205": {"title": "X-Vine Models for Multivariate Extremes", "link": "https://arxiv.org/abs/2312.15205", "description": "arXiv:2312.15205v2 Announce Type: replace \nAbstract: Regular vine sequences permit the organisation of variables in a random vector along a sequence of trees. Regular vine models have become greatly popular in dependence modelling as a way to combine arbitrary bivariate copulas into higher-dimensional ones, offering flexibility, parsimony, and tractability. In this project, we use regular vine structures to decompose and construct the exponent measure density of a multivariate extreme value distribution, or, equivalently, the tail copula density. Although these densities pose theoretical challenges due to their infinite mass, their homogeneity property offers simplifications. The theory sheds new light on existing parametric families and facilitates the construction of new ones, called X-vines. Computations proceed via recursive formulas in terms of bivariate model components. We develop simulation algorithms for X-vine multivariate Pareto distributions as well as methods for parameter estimation and model selection on the basis of threshold exceedances. The methods are illustrated by Monte Carlo experiments and a case study on US flight delay data."}, "https://arxiv.org/abs/2311.10263": {"title": "Stable Differentiable Causal Discovery", "link": "https://arxiv.org/abs/2311.10263", "description": "arXiv:2311.10263v2 Announce Type: replace-cross \nAbstract: Inferring causal relationships as directed acyclic graphs (DAGs) is an important but challenging problem. Differentiable Causal Discovery (DCD) is a promising approach to this problem, framing the search as a continuous optimization. But existing DCD methods are numerically unstable, with poor performance beyond tens of variables. In this paper, we propose Stable Differentiable Causal Discovery (SDCD), a new method that improves previous DCD methods in two ways: (1) It employs an alternative constraint for acyclicity; this constraint is more stable, both theoretically and empirically, and fast to compute. (2) It uses a training procedure tailored for sparse causal graphs, which are common in real-world scenarios. We first derive SDCD and prove its stability and correctness. We then evaluate it with both observational and interventional data and on both small-scale and large-scale settings. We find that SDCD outperforms existing methods in both convergence speed and accuracy and can scale to thousands of variables. We provide code at https://github.com/azizilab/sdcd."}, "https://arxiv.org/abs/2406.19432": {"title": "Estimation of Shannon differential entropy: An extensive comparative review", "link": "https://arxiv.org/abs/2406.19432", "description": "arXiv:2406.19432v1 Announce Type: new \nAbstract: In this research work, a total of 45 different estimators of the Shannon differential entropy were reviewed. The estimators were mainly based on three classes, namely: window size spacings, kernel density estimation (KDE) and k-nearest neighbour (kNN) estimation. A total of 16, 5 and 6 estimators were selected from each of the classes, respectively, for comparison. The performances of the 27 selected estimators, in terms of their bias values and root mean squared errors (RMSEs) as well as their asymptotic behaviours, were compared through extensive Monte Carlo simulations. The empirical comparisons were carried out at different sample sizes of 10, 50, and 100 and different variable dimensions of 1, 2, 3, and 5, for three groups of continuous distributions according to their symmetry and support. The results showed that the spacings based estimators generally performed better than the estimators from the other two classes at univariate level, but suffered from non existence at multivariate level. The kNN based estimators were generally inferior to the estimators from the other two classes considered but showed an advantage of existence for all dimensions. Also, a new class of optimal window size was obtained and sets of estimators were recommended for different groups of distributions at different variable dimensions. Finally, the asymptotic biases, variances and distributions of the 'best estimators' were considered."}, "https://arxiv.org/abs/2406.19503": {"title": "Improving Finite Sample Performance of Causal Discovery by Exploiting Temporal Structure", "link": "https://arxiv.org/abs/2406.19503", "description": "arXiv:2406.19503v1 Announce Type: new \nAbstract: Methods of causal discovery aim to identify causal structures in a data driven way. Existing algorithms are known to be unstable and sensitive to statistical errors, and are therefore rarely used with biomedical or epidemiological data. We present an algorithm that efficiently exploits temporal structure, so-called tiered background knowledge, for estimating causal structures. Tiered background knowledge is readily available from, e.g., cohort or registry data. When used efficiently it renders the algorithm more robust to statistical errors and ultimately increases accuracy in finite samples. We describe the algorithm and illustrate how it proceeds. Moreover, we offer formal proofs as well as examples of desirable properties of the algorithm, which we demonstrate empirically in an extensive simulation study. To illustrate its usefulness in practice, we apply the algorithm to data from a children's cohort study investigating the interplay of diet, physical activity and other lifestyle factors for health outcomes."}, "https://arxiv.org/abs/2406.19535": {"title": "Modeling trajectories using functional linear differential equations", "link": "https://arxiv.org/abs/2406.19535", "description": "arXiv:2406.19535v1 Announce Type: new \nAbstract: We are motivated by a study that seeks to better understand the dynamic relationship between muscle activation and paw position during locomotion. For each gait cycle in this experiment, activation in the biceps and triceps is measured continuously and in parallel with paw position as a mouse trotted on a treadmill. We propose an innovative general regression method that draws from both ordinary differential equations and functional data analysis to model the relationship between these functional inputs and responses as a dynamical system that evolves over time. Specifically, our model addresses gaps in both literatures and borrows strength across curves estimating ODE parameters across all curves simultaneously rather than separately modeling each functional observation. Our approach compares favorably to related functional data methods in simulations and in cross-validated predictive accuracy of paw position in the gait data. In the analysis of the gait cycles, we find that paw speed and position are dynamically influenced by inputs from the biceps and triceps muscles, and that the effect of muscle activation persists beyond the activation itself."}, "https://arxiv.org/abs/2406.19550": {"title": "Provably Efficient Posterior Sampling for Sparse Linear Regression via Measure Decomposition", "link": "https://arxiv.org/abs/2406.19550", "description": "arXiv:2406.19550v1 Announce Type: new \nAbstract: We consider the problem of sampling from the posterior distribution of a $d$-dimensional coefficient vector $\\boldsymbol{\\theta}$, given linear observations $\\boldsymbol{y} = \\boldsymbol{X}\\boldsymbol{\\theta}+\\boldsymbol{\\varepsilon}$. In general, such posteriors are multimodal, and therefore challenging to sample from. This observation has prompted the exploration of various heuristics that aim at approximating the posterior distribution.\n In this paper, we study a different approach based on decomposing the posterior distribution into a log-concave mixture of simple product measures. This decomposition allows us to reduce sampling from a multimodal distribution of interest to sampling from a log-concave one, which is tractable and has been investigated in detail. We prove that, under mild conditions on the prior, for random designs, such measure decomposition is generally feasible when the number of samples per parameter $n/d$ exceeds a constant threshold. We thus obtain a provably efficient (polynomial time) sampling algorithm in a regime where this was previously not known. Numerical simulations confirm that the algorithm is practical, and reveal that it has attractive statistical properties compared to state-of-the-art methods."}, "https://arxiv.org/abs/2406.19563": {"title": "Bayesian Rank-Clustering", "link": "https://arxiv.org/abs/2406.19563", "description": "arXiv:2406.19563v1 Announce Type: new \nAbstract: In a traditional analysis of ordinal comparison data, the goal is to infer an overall ranking of objects from best to worst with each object having a unique rank. However, the ranks of some objects may not be statistically distinguishable. This could happen due to insufficient data or to the true underlying abilities or qualities being equal for some objects. In such cases, practitioners may prefer an overall ranking where groups of objects are allowed to have equal ranks or to be $\\textit{rank-clustered}$. Existing models related to rank-clustering are limited by their inability to handle a variety of ordinal data types, to quantify uncertainty, or by the need to pre-specify the number and size of potential rank-clusters. We solve these limitations through the proposed Bayesian $\\textit{Rank-Clustered Bradley-Terry-Luce}$ model. We allow for rank-clustering via parameter fusion by imposing a novel spike-and-slab prior on object-specific worth parameters in Bradley-Terry-Luce family of distributions for ordinal comparisons. We demonstrate the model on simulated and real datasets in survey analysis, elections, and sports."}, "https://arxiv.org/abs/2406.19597": {"title": "What's the Weight? Estimating Controlled Outcome Differences in Complex Surveys for Health Disparities Research", "link": "https://arxiv.org/abs/2406.19597", "description": "arXiv:2406.19597v1 Announce Type: new \nAbstract: A basic descriptive question in statistics often asks whether there are differences in mean outcomes between groups based on levels of a discrete covariate (e.g., racial disparities in health outcomes). However, when this categorical covariate of interest is correlated with other factors related to the outcome, direct comparisons may lead to biased estimates and invalid inferential conclusions without appropriate adjustment. Propensity score methods are broadly employed with observational data as a tool to achieve covariate balance, but how to implement them in complex surveys is less studied - in particular, when the survey weights depend on the group variable under comparison. In this work, we focus on a specific example when sample selection depends on race. We propose identification formulas to properly estimate the average controlled difference (ACD) in outcomes between Black and White individuals, with appropriate weighting for covariate imbalance across the two racial groups and generalizability. Via extensive simulation, we show that our proposed methods outperform traditional analytic approaches in terms of bias, mean squared error, and coverage. We are motivated by the interplay between race and social determinants of health when estimating racial differences in telomere length using data from the National Health and Nutrition Examination Survey. We build a propensity for race to properly adjust for other social determinants while characterizing the controlled effect of race on telomere length. We find that evidence of racial differences in telomere length between Black and White individuals attenuates after accounting for confounding by socioeconomic factors and after utilizing appropriate propensity score and survey weighting techniques. Software to implement these methods can be found in the R package svycdiff at https://github.com/salernos/svycdiff."}, "https://arxiv.org/abs/2406.19604": {"title": "Geodesic Causal Inference", "link": "https://arxiv.org/abs/2406.19604", "description": "arXiv:2406.19604v1 Announce Type: new \nAbstract: Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for scalar outcomes and recently for distributional outcomes. We introduce here a general framework for causal inference when outcomes reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the characterization of treatment effects through the concept of the geodesic average treatment effect. Using ideas from Fr\\'echet regression, we develop estimation methods of the geodesic average treatment effect and derive consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Our methodology is illustrated by a simulation study and real data examples for compositional outcomes of U.S. statewise energy source data to study the effect of coal mining, network data of New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and brain functional connectivity network data to study the effect of Alzheimer's disease."}, "https://arxiv.org/abs/2406.19673": {"title": "Extended sample size calculations for evaluation of prediction models using a threshold for classification", "link": "https://arxiv.org/abs/2406.19673", "description": "arXiv:2406.19673v1 Announce Type: new \nAbstract: When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures can also be used. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have developed closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, PPV, NPV, and F1-score in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R and Stata commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit."}, "https://arxiv.org/abs/2406.19691": {"title": "Optimal subsampling for functional composite quantile regression in massive data", "link": "https://arxiv.org/abs/2406.19691", "description": "arXiv:2406.19691v1 Announce Type: new \nAbstract: As computer resources become increasingly limited, traditional statistical methods face challenges in analyzing massive data, especially in functional data analysis. To address this issue, subsampling offers a viable solution by significantly reducing computational requirements. This paper introduces a subsampling technique for composite quantile regression, designed for efficient application within the functional linear model on large datasets. We establish the asymptotic distribution of the subsampling estimator and introduce an optimal subsampling method based on the functional L-optimality criterion. Results from simulation studies and the real data analysis consistently demonstrate the superiority of the L-optimality criterion-based optimal subsampling method over the uniform subsampling approach."}, "https://arxiv.org/abs/2406.19702": {"title": "Vector AutoRegressive Moving Average Models: A Review", "link": "https://arxiv.org/abs/2406.19702", "description": "arXiv:2406.19702v1 Announce Type: new \nAbstract: Vector AutoRegressive Moving Average (VARMA) models form a powerful and general model class for analyzing dynamics among multiple time series. While VARMA models encompass the Vector AutoRegressive (VAR) models, their popularity in empirical applications is dominated by the latter. Can this phenomenon be explained fully by the simplicity of VAR models? Perhaps many users of VAR models have not fully appreciated what VARMA models can provide. The goal of this review is to provide a comprehensive resource for researchers and practitioners seeking insights into the advantages and capabilities of VARMA models. We start by reviewing the identification challenges inherent to VARMA models thereby encompassing classical and modern identification schemes and we continue along the same lines regarding estimation, specification and diagnosis of VARMA models. We then highlight the practical utility of VARMA models in terms of Granger Causality analysis, forecasting and structural analysis as well as recent advances and extensions of VARMA models to further facilitate their adoption in practice. Finally, we discuss some interesting future research directions where VARMA models can fulfill their potentials in applications as compared to their subclass of VAR models."}, "https://arxiv.org/abs/2406.19716": {"title": "Functional Time Transformation Model with Applications to Digital Health", "link": "https://arxiv.org/abs/2406.19716", "description": "arXiv:2406.19716v1 Announce Type: new \nAbstract: The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medicine, we develop a more general and flexible functional time-transformation model for estimating the conditional survival function with both functional and scalar covariates. A partially functional regression model is used to directly model the survival time on the covariates through an unknown monotone transformation and a known error distribution. We use Bernstein polynomials to model the monotone transformation function and the smooth functional coefficients. A sieve method of maximum likelihood is employed for estimation. Numerical simulations illustrate a satisfactory performance of the proposed method in estimation and inference. We demonstrate the application of the proposed model through two case studies involving wearable data i) Understanding the association between diurnal physical activity pattern and all-cause mortality based on accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 and ii) Modelling Time-to-Hypoglycemia events in a cohort of diabetic patients based on distributional representation of continuous glucose monitoring (CGM) data. The results provide important epidemiological insights into the direct association between survival times and the physiological signals and also exhibit superior predictive performance compared to traditional summary based biomarkers in the CGM study."}, "https://arxiv.org/abs/2406.19722": {"title": "Exact Bayesian Gaussian Cox Processes Using Random Integral", "link": "https://arxiv.org/abs/2406.19722", "description": "arXiv:2406.19722v1 Announce Type: new \nAbstract: A Gaussian Cox process is a popular model for point process data, in which the intensity function is a transformation of a Gaussian process. Posterior inference of this intensity function involves an intractable integral (i.e., the cumulative intensity function) in the likelihood resulting in doubly intractable posterior distribution. Here, we propose a nonparametric Bayesian approach for estimating the intensity function of an inhomogeneous Poisson process without reliance on large data augmentation or approximations of the likelihood function. We propose to jointly model the intensity and the cumulative intensity function as a transformed Gaussian process, allowing us to directly bypass the need of approximating the cumulative intensity function in the likelihood. We propose an exact MCMC sampler for posterior inference and evaluate its performance on simulated data. We demonstrate the utility of our method in three real-world scenarios including temporal and spatial event data, as well as aggregated time count data collected at multiple resolutions. Finally, we discuss extensions of our proposed method to other point processes."}, "https://arxiv.org/abs/2406.19778": {"title": "A multiscale Bayesian nonparametric framework for partial hierarchical clustering", "link": "https://arxiv.org/abs/2406.19778", "description": "arXiv:2406.19778v1 Announce Type: new \nAbstract: In recent years, there has been a growing demand to discern clusters of subjects in datasets characterized by a large set of features. Often, these clusters may be highly variable in size and present partial hierarchical structures. In this context, model-based clustering approaches with nonparametric priors are gaining attention in the literature due to their flexibility and adaptability to new data. However, current approaches still face challenges in recognizing hierarchical cluster structures and in managing tiny clusters or singletons. To address these limitations, we propose a novel infinite mixture model with kernels organized within a multiscale structure. Leveraging a careful specification of the kernel parameters, our method allows the inclusion of additional information guiding possible hierarchies among clusters while maintaining flexibility. We provide theoretical support and an elegant, parsimonious formulation based on infinite factorization that allows efficient inference via Gibbs sampler."}, "https://arxiv.org/abs/2406.19887": {"title": "Confidence intervals for tree-structured varying coefficients", "link": "https://arxiv.org/abs/2406.19887", "description": "arXiv:2406.19887v1 Announce Type: new \nAbstract: The tree-structured varying coefficient model (TSVC) is a flexible regression approach that allows the effects of covariates to vary with the values of the effect modifiers. Relevant effect modifiers are identified inherently using recursive partitioning techniques. To quantify uncertainty in TSVC models, we propose a procedure to construct confidence intervals of the estimated partition-specific coefficients. This task constitutes a selective inference problem as the coefficients of a TSVC model result from data-driven model building. To account for this issue, we introduce a parametric bootstrap approach, which is tailored to the complex structure of TSVC. Finite sample properties, particularly coverage proportions, of the proposed confidence intervals are evaluated in a simulation study. For illustration, we consider applications to data from COVID-19 patients and from patients suffering from acute odontogenic infection. The proposed approach may also be adapted for constructing confidence intervals for other tree-based methods."}, "https://arxiv.org/abs/2406.19903": {"title": "Joint estimation of insurance loss development factors using Bayesian hidden Markov models", "link": "https://arxiv.org/abs/2406.19903", "description": "arXiv:2406.19903v1 Announce Type: new \nAbstract: Loss development modelling is the actuarial practice of predicting the total 'ultimate' losses incurred on a set of policies once all claims are reported and settled. This poses a challenging prediction task as losses frequently take years to fully emerge from reported claims, and not all claims might yet be reported. Loss development models frequently estimate a set of 'link ratios' from insurance loss triangles, which are multiplicative factors transforming losses at one time point to ultimate. However, link ratios estimated using classical methods typically underestimate ultimate losses and cannot be extrapolated outside the domains of the triangle, requiring extension by 'tail factors' from another model. Although flexible, this two-step process relies on subjective decision points that might bias inference. Methods that jointly estimate 'body' link ratios and smooth tail factors offer an attractive alternative. This paper proposes a novel application of Bayesian hidden Markov models to loss development modelling, where discrete, latent states representing body and tail processes are automatically learned from the data. The hidden Markov development model is found to perform comparably to, and frequently better than, the two-step approach on numerical examples and industry datasets."}, "https://arxiv.org/abs/2406.19936": {"title": "Deep Learning of Multivariate Extremes via a Geometric Representation", "link": "https://arxiv.org/abs/2406.19936", "description": "arXiv:2406.19936v1 Announce Type: new \nAbstract: The study of geometric extremes, where extremal dependence properties are inferred from the deterministic limiting shapes of scaled sample clouds, provides an exciting approach to modelling the extremes of multivariate data. These shapes, termed limit sets, link together several popular extremal dependence modelling frameworks. Although the geometric approach is becoming an increasingly popular modelling tool, current inference techniques are limited to a low dimensional setting (d < 4), and generally require rigid modelling assumptions. In this work, we propose a range of novel theoretical results to aid with the implementation of the geometric extremes framework and introduce the first approach to modelling limit sets using deep learning. By leveraging neural networks, we construct asymptotically-justified yet flexible semi-parametric models for extremal dependence of high-dimensional data. We showcase the efficacy of our deep approach by modelling the complex extremal dependencies between meteorological and oceanographic variables in the North Sea off the coast of the UK."}, "https://arxiv.org/abs/2406.19940": {"title": "Closed-Form Power and Sample Size Calculations for Bayes Factors", "link": "https://arxiv.org/abs/2406.19940", "description": "arXiv:2406.19940v1 Announce Type: new \nAbstract: Determining an appropriate sample size is a critical element of study design, and the method used to determine it should be consistent with the planned analysis. When the planned analysis involves Bayes factor hypothesis testing, the sample size is usually desired to ensure a sufficiently high probability of obtaining a Bayes factor indicating compelling evidence for a hypothesis, given that the hypothesis is true. In practice, Bayes factor sample size determination is typically performed using computationally intensive Monte Carlo simulation. Here, we summarize alternative approaches that enable sample size determination without simulation. We show how, under approximate normality assumptions, sample sizes can be determined numerically, and provide the R package bfpwr for this purpose. Additionally, we identify conditions under which sample sizes can even be determined in closed-form, resulting in novel, easy-to-use formulas that also help foster intuition, enable asymptotic analysis, and can also be used for hybrid Bayesian/likelihoodist design. Furthermore, we show how in our framework power and sample size can be computed without simulation for more complex analysis priors, such as Jeffreys-Zellner-Siow priors or nonlocal normal moment priors. Case studies from medicine and psychology illustrate how researchers can use our methods to design informative yet cost-efficient studies."}, "https://arxiv.org/abs/2406.19956": {"title": "Three Scores and 15 Years (1948-2023) of Rao's Score Test: A Brief History", "link": "https://arxiv.org/abs/2406.19956", "description": "arXiv:2406.19956v1 Announce Type: new \nAbstract: Rao (1948) introduced the score test statistic as an alternative to the likelihood ratio and Wald test statistics. In spite of the optimality properties of the score statistic shown in Rao and Poti (1946), the Rao score (RS) test remained unnoticed for almost 20 years. Today, the RS test is part of the ``Holy Trinity'' of hypothesis testing and has found its place in the Statistics and Econometrics textbooks and related software. Reviewing the history of the RS test we note that remarkable test statistics proposed in the literature earlier or around the time of Rao (1948) mostly from intuition, such as Pearson (1900) goodness-fit-test, Moran (1948) I test for spatial dependence and Durbin and Watson (1950) test for serial correlation, can be given RS test statistic interpretation. At the same time, recent developments in the robust hypothesis testing under certain forms of misspecification, make the RS test an active area of research in Statistics and Econometrics. From our brief account of the history the RS test we conclude that its impact in science goes far beyond its calendar starting point with promising future research activities for many years to come."}, "https://arxiv.org/abs/2406.19965": {"title": "Futility analyses for the MCP-Mod methodology based on longitudinal models", "link": "https://arxiv.org/abs/2406.19965", "description": "arXiv:2406.19965v1 Announce Type: new \nAbstract: This article discusses futility analyses for the MCP-Mod methodology. Formulas are derived for calculating predictive and conditional power for MCP-Mod, which also cover the case when longitudinal models are used allowing to utilize incomplete data from patients at interim. A simulation study is conducted to evaluate the repeated sampling properties of the proposed decision rules and to assess the benefit of using a longitudinal versus a completer only model for decision making at interim. The results suggest that the proposed methods perform adequately and a longitudinal analysis outperforms a completer only analysis, particularly when the recruitment speed is higher and the correlation over time is larger. The proposed methodology is illustrated using real data from a dose-finding study for severe uncontrolled asthma."}, "https://arxiv.org/abs/2406.19986": {"title": "Instrumental Variable Estimation of Distributional Causal Effects", "link": "https://arxiv.org/abs/2406.19986", "description": "arXiv:2406.19986v1 Announce Type: new \nAbstract: Estimating the causal effect of a treatment on the entire response distribution is an important yet challenging task. For instance, one might be interested in how a pension plan affects not only the average savings among all individuals but also how it affects the entire savings distribution. While sufficiently large randomized studies can be used to estimate such distributional causal effects, they are often either not feasible in practice or involve non-compliance. A well-established class of methods for estimating average causal effects from either observational studies with unmeasured confounding or randomized studies with non-compliance are instrumental variable (IV) methods. In this work, we develop an IV-based approach for identifying and estimating distributional causal effects. We introduce a distributional IV model with corresponding assumptions, which leads to a novel identification result for the interventional cumulative distribution function (CDF) under a binary treatment. We then use this identification to construct a nonparametric estimator, called DIVE, for estimating the interventional CDFs under both treatments. We empirically assess the performance of DIVE in a simulation experiment and illustrate the usefulness of distributional causal effects on two real-data applications."}, "https://arxiv.org/abs/2406.19989": {"title": "A Closed-Form Solution to the 2-Sample Problem for Quantifying Changes in Gene Expression using Bayes Factors", "link": "https://arxiv.org/abs/2406.19989", "description": "arXiv:2406.19989v1 Announce Type: new \nAbstract: Sequencing technologies have revolutionised the field of molecular biology. We now have the ability to routinely capture the complete RNA profile in tissue samples. This wealth of data allows for comparative analyses of RNA levels at different times, shedding light on the dynamics of developmental processes, and under different environmental responses, providing insights into gene expression regulation and stress responses. However, given the inherent variability of the data stemming from biological and technological sources, quantifying changes in gene expression proves to be a statistical challenge. Here, we present a closed-form Bayesian solution to this problem. Our approach is tailored to the differential gene expression analysis of processed RNA-Seq data. The framework unifies and streamlines an otherwise complex analysis, typically involving parameter estimations and multiple statistical tests, into a concise mathematical equation for the calculation of Bayes factors. Using conjugate priors we can solve the equations analytically. For each gene, we calculate a Bayes factor, which can be used for ranking genes according to the statistical evidence for the gene's expression change given RNA-Seq data. The presented closed-form solution is derived under minimal assumptions and may be applied to a variety of other 2-sample problems."}, "https://arxiv.org/abs/2406.19412": {"title": "Dynamically Consistent Analysis of Realized Covariations in Term Structure Models", "link": "https://arxiv.org/abs/2406.19412", "description": "arXiv:2406.19412v1 Announce Type: cross \nAbstract: In this article we show how to analyze the covariation of bond prices nonparametrically and robustly, staying consistent with a general no-arbitrage setting. This is, in particular, motivated by the problem of identifying the number of statistically relevant factors in the bond market under minimal conditions. We apply this method in an empirical study which suggests that a high number of factors is needed to describe the term structure evolution and that the term structure of volatility varies over time."}, "https://arxiv.org/abs/2406.19573": {"title": "On Counterfactual Interventions in Vector Autoregressive Models", "link": "https://arxiv.org/abs/2406.19573", "description": "arXiv:2406.19573v1 Announce Type: cross \nAbstract: Counterfactual reasoning allows us to explore hypothetical scenarios in order to explain the impacts of our decisions. However, addressing such inquires is impossible without establishing the appropriate mathematical framework. In this work, we introduce the problem of counterfactual reasoning in the context of vector autoregressive (VAR) processes. We also formulate the inference of a causal model as a joint regression task where for inference we use both data with and without interventions. After learning the model, we exploit linearity of the VAR model to make exact predictions about the effects of counterfactual interventions. Furthermore, we quantify the total causal effects of past counterfactual interventions. The source code for this project is freely available at https://github.com/KurtButler/counterfactual_interventions."}, "https://arxiv.org/abs/2406.19974": {"title": "Generalizing self-normalized importance sampling with couplings", "link": "https://arxiv.org/abs/2406.19974", "description": "arXiv:2406.19974v1 Announce Type: cross \nAbstract: An essential problem in statistics and machine learning is the estimation of expectations involving PDFs with intractable normalizing constants. The self-normalized importance sampling (SNIS) estimator, which normalizes the IS weights, has become the standard approach due to its simplicity. However, the SNIS has been shown to exhibit high variance in challenging estimation problems, e.g, involving rare events or posterior predictive distributions in Bayesian statistics. Further, most of the state-of-the-art adaptive importance sampling (AIS) methods adapt the proposal as if the weights had not been normalized. In this paper, we propose a framework that considers the original task as estimation of a ratio of two integrals. In our new formulation, we obtain samples from a joint proposal distribution in an extended space, with two of its marginals playing the role of proposals used to estimate each integral. Importantly, the framework allows us to induce and control a dependency between both estimators. We propose a construction of the joint proposal that decomposes in two (multivariate) marginals and a coupling. This leads to a two-stage framework suitable to be integrated with existing or new AIS and/or variational inference (VI) algorithms. The marginals are adapted in the first stage, while the coupling can be chosen and adapted in the second stage. We show in several examples the benefits of the proposed methodology, including an application to Bayesian prediction with misspecified models."}, "https://arxiv.org/abs/2406.20088": {"title": "Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints", "link": "https://arxiv.org/abs/2406.20088", "description": "arXiv:2406.20088v1 Announce Type: cross \nAbstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate, precisely characterizing the effects of privacy constraints, source samples, and target samples on classification accuracy. The results reveal interesting phase transition phenomena and highlight the intricate trade-offs between preserving privacy and achieving classification accuracy. We then develop a data-driven adaptive classifier that achieves the optimal rate within a logarithmic factor across a large collection of parameter spaces while satisfying the same set of differential privacy constraints. Simulation studies and real-world data applications further elucidate the theoretical analysis with numerical results."}, "https://arxiv.org/abs/2109.11271": {"title": "Design-based theory for Lasso adjustment in randomized block experiments and rerandomized experiments", "link": "https://arxiv.org/abs/2109.11271", "description": "arXiv:2109.11271v3 Announce Type: replace \nAbstract: Blocking, a special case of rerandomization, is routinely implemented in the design stage of randomized experiments to balance the baseline covariates. This study proposes a regression adjustment method based on the least absolute shrinkage and selection operator (Lasso) to efficiently estimate the average treatment effect in randomized block experiments with high-dimensional covariates. We derive the asymptotic properties of the proposed estimator and outline the conditions under which this estimator is more efficient than the unadjusted one. We provide a conservative variance estimator to facilitate valid inferences. Our framework allows one treated or control unit in some blocks and heterogeneous propensity scores across blocks, thus including paired experiments and finely stratified experiments as special cases. We further accommodate rerandomized experiments and a combination of blocking and rerandomization. Moreover, our analysis allows both the number of blocks and block sizes to tend to infinity, as well as heterogeneous treatment effects across blocks without assuming a true outcome data-generating model. Simulation studies and two real-data analyses demonstrate the advantages of the proposed method."}, "https://arxiv.org/abs/2310.20376": {"title": "Hierarchical Mixture of Finite Mixtures", "link": "https://arxiv.org/abs/2310.20376", "description": "arXiv:2310.20376v2 Announce Type: replace \nAbstract: Statistical modelling in the presence of data organized in groups is a crucial task in Bayesian statistics. The present paper conceives a mixture model based on a novel family of Bayesian priors designed for multilevel data and obtained by normalizing a finite point process. In particular, the work extends the popular Mixture of Finite Mixture model to the hierarchical framework to capture heterogeneity within and between groups. A full distribution theory for this new family and the induced clustering is developed, including the marginal, posterior, and predictive distributions. Efficient marginal and conditional Gibbs samplers are designed to provide posterior inference. The proposed mixture model overcomes the Hierarchical Dirichlet Process, the utmost tool for handling multilevel data, in terms of analytical feasibility, clustering discovery, and computational time. The motivating application comes from the analysis of shot put data, which contains performance measurements of athletes across different seasons. In this setting, the proposed model is exploited to induce clustering of the observations across seasons and athletes. By linking clusters across seasons, similarities and differences in athletes' performances are identified."}, "https://arxiv.org/abs/2301.12616": {"title": "Active Sequential Two-Sample Testing", "link": "https://arxiv.org/abs/2301.12616", "description": "arXiv:2301.12616v4 Announce Type: replace-cross \nAbstract: A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \\emph{active sequential two-sample testing framework} that not only sequentially but also \\emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \\emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control."}, "https://arxiv.org/abs/2310.00125": {"title": "Covariance Expressions for Multi-Fidelity Sampling with Multi-Output, Multi-Statistic Estimators: Application to Approximate Control Variates", "link": "https://arxiv.org/abs/2310.00125", "description": "arXiv:2310.00125v2 Announce Type: replace-cross \nAbstract: We provide a collection of results on covariance expressions between Monte Carlo based multi-output mean, variance, and Sobol main effect variance estimators from an ensemble of models. These covariances can be used within multi-fidelity uncertainty quantification strategies that seek to reduce the estimator variance of high-fidelity Monte Carlo estimators with an ensemble of low-fidelity models. Such covariance expressions are required within approaches like the approximate control variate and multi-level best linear unbiased estimator. While the literature provides these expressions for some single-output cases such as mean and variance, our results are relevant to both multiple function outputs and multiple statistics across any sampling strategy. Following the description of these results, we use them within an approximate control variate scheme to show that leveraging multiple outputs can dramatically reduce estimator variance compared to single-output approaches. Synthetic examples are used to highlight the effects of optimal sample allocation and pilot sample estimation. A flight-trajectory simulation of entry, descent, and landing is used to demonstrate multi-output estimation in practical applications."}, "https://arxiv.org/abs/2312.10499": {"title": "Censored extreme value estimation", "link": "https://arxiv.org/abs/2312.10499", "description": "arXiv:2312.10499v4 Announce Type: replace-cross \nAbstract: A novel and comprehensive methodology designed to tackle the challenges posed by extreme values in the context of random censorship is introduced. The main focus is on the analysis of integrals based on the product-limit estimator of normalized upper order statistics, called extreme Kaplan--Meier integrals. These integrals allow for the transparent derivation of various important asymptotic distributional properties, offering an alternative approach to conventional plug-in estimation methods. Notably, this methodology demonstrates robustness and wide applicability within the scope of max-domains of attraction. A noteworthy by-product is the extension of generalized Hill-type estimators of extremes to encompass all max-domains of attraction, which is of independent interest. The theoretical framework is applied to construct novel estimators for positive and real-valued extreme value indices for right-censored data. Simulation studies supporting the theory are provided."}, "https://arxiv.org/abs/2407.00139": {"title": "A Calibrated Sensitivity Analysis for Weighted Causal Decompositions", "link": "https://arxiv.org/abs/2407.00139", "description": "arXiv:2407.00139v1 Announce Type: new \nAbstract: Disparities in health or well-being experienced by minority groups can be difficult to study using the traditional exposure-outcome paradigm in causal inference, since potential outcomes in variables such as race or sexual minority status are challenging to interpret. Causal decomposition analysis addresses this gap by positing causal effects on disparities under interventions to other, intervenable exposures that may play a mediating role in the disparity. While invoking weaker assumptions than causal mediation approaches, decomposition analyses are often conducted in observational settings and require uncheckable assumptions that eliminate unmeasured confounders. Leveraging the marginal sensitivity model, we develop a sensitivity analysis for weighted causal decomposition estimators and use the percentile bootstrap to construct valid confidence intervals for causal effects on disparities. We also propose a two-parameter amplification that enhances interpretability and facilitates an intuitive understanding of the plausibility of unmeasured confounders and their effects. We illustrate our framework on a study examining the effect of parental acceptance on disparities in suicidal ideation among sexual minority youth. We find that the effect is small and sensitive to unmeasured confounding, suggesting that further screening studies are needed to identify mitigating interventions in this vulnerable population."}, "https://arxiv.org/abs/2407.00364": {"title": "Medical Knowledge Integration into Reinforcement Learning Algorithms for Dynamic Treatment Regimes", "link": "https://arxiv.org/abs/2407.00364", "description": "arXiv:2407.00364v1 Announce Type: new \nAbstract: The goal of precision medicine is to provide individualized treatment at each stage of chronic diseases, a concept formalized by Dynamic Treatment Regimes (DTR). These regimes adapt treatment strategies based on decision rules learned from clinical data to enhance therapeutic effectiveness. Reinforcement Learning (RL) algorithms allow to determine these decision rules conditioned by individual patient data and their medical history. The integration of medical expertise into these models makes possible to increase confidence in treatment recommendations and facilitate the adoption of this approach by healthcare professionals and patients. In this work, we examine the mathematical foundations of RL, contextualize its application in the field of DTR, and present an overview of methods to improve its effectiveness by integrating medical expertise."}, "https://arxiv.org/abs/2407.00381": {"title": "Climate change analysis from LRD manifold functional regression", "link": "https://arxiv.org/abs/2407.00381", "description": "arXiv:2407.00381v1 Announce Type: new \nAbstract: A functional nonlinear regression approach, incorporating time information in the covariates, is proposed for temporal strong correlated manifold map data sequence analysis. Specifically, the functional regression parameters are supported on a connected and compact two--point homogeneous space. The Generalized Least--Squares (GLS) parameter estimator is computed in the linearized model, having error term displaying manifold scale varying Long Range Dependence (LRD). The performance of the theoretical and plug--in nonlinear regression predictors is illustrated by simulations on sphere, in terms of the empirical mean of the computed spherical functional absolute errors. In the case where the second--order structure of the functional error term in the linearized model is unknown, its estimation is performed by minimum contrast in the functional spectral domain. The linear case is illustrated in the Supplementary Material, revealing the effect of the slow decay velocity in time of the trace norms of the covariance operator family of the regression LRD error term. The purely spatial statistical analysis of atmospheric pressure at high cloud bottom, and downward solar radiation flux in Alegria et al. (2021) is extended to the spatiotemporal context, illustrating the numerical results from a generated synthetic data set."}, "https://arxiv.org/abs/2407.00561": {"title": "Advancing Information Integration through Empirical Likelihood: Selective Reviews and a New Idea", "link": "https://arxiv.org/abs/2407.00561", "description": "arXiv:2407.00561v1 Announce Type: new \nAbstract: Information integration plays a pivotal role in biomedical studies by facilitating the combination and analysis of independent datasets from multiple studies, thereby uncovering valuable insights that might otherwise remain obscured due to the limited sample size in individual studies. However, sharing raw data from independent studies presents significant challenges, primarily due to the need to safeguard sensitive participant information and the cumbersome paperwork involved in data sharing. In this article, we first provide a selective review of recent methodological developments in information integration via empirical likelihood, wherein only summary information is required, rather than the raw data. Following this, we introduce a new insight and a potentially promising framework that could broaden the application of information integration across a wider spectrum. Furthermore, this new framework offers computational convenience compared to classic empirical likelihood-based methods. We provide numerical evaluations to assess its performance and discuss various extensions in the end."}, "https://arxiv.org/abs/2407.00564": {"title": "Variational Nonparametric Inference in Functional Stochastic Block Model", "link": "https://arxiv.org/abs/2407.00564", "description": "arXiv:2407.00564v1 Announce Type: new \nAbstract: We propose a functional stochastic block model whose vertices involve functional data information. This new model extends the classic stochastic block model with vector-valued nodal information, and finds applications in real-world networks whose nodal information could be functional curves. Examples include international trade data in which a network vertex (country) is associated with the annual or quarterly GDP over certain time period, and MyFitnessPal data in which a network vertex (MyFitnessPal user) is associated with daily calorie information measured over certain time period. Two statistical tasks will be jointly executed. First, we will detect community structures of the network vertices assisted by the functional nodal information. Second, we propose computationally efficient variational test to examine the significance of the functional nodal information. We show that the community detection algorithms achieve weak and strong consistency, and the variational test is asymptotically chi-square with diverging degrees of freedom. As a byproduct, we propose pointwise confidence intervals for the slop function of the functional nodal information. Our methods are examined through both simulated and real datasets."}, "https://arxiv.org/abs/2407.00650": {"title": "Proper Scoring Rules for Multivariate Probabilistic Forecasts based on Aggregation and Transformation", "link": "https://arxiv.org/abs/2407.00650", "description": "arXiv:2407.00650v1 Announce Type: new \nAbstract: Proper scoring rules are an essential tool to assess the predictive performance of probabilistic forecasts. However, propriety alone does not ensure an informative characterization of predictive performance and it is recommended to compare forecasts using multiple scoring rules. With that in mind, interpretable scoring rules providing complementary information are necessary. We formalize a framework based on aggregation and transformation to build interpretable multivariate proper scoring rules. Aggregation-and-transformation-based scoring rules are able to target specific features of the probabilistic forecasts; which improves the characterization of the predictive performance. This framework is illustrated through examples taken from the literature and studied using numerical experiments showcasing its benefits. In particular, it is shown that it can help bridge the gap between proper scoring rules and spatial verification tools."}, "https://arxiv.org/abs/2407.00655": {"title": "Markov Switching Multiple-equation Tensor Regressions", "link": "https://arxiv.org/abs/2407.00655", "description": "arXiv:2407.00655v1 Announce Type: new \nAbstract: We propose a new flexible tensor model for multiple-equation regression that accounts for latent regime changes. The model allows for dynamic coefficients and multi-dimensional covariates that vary across equations. We assume the coefficients are driven by a common hidden Markov process that addresses structural breaks to enhance the model flexibility and preserve parsimony. We introduce a new Soft PARAFAC hierarchical prior to achieve dimensionality reduction while preserving the structural information of the covariate tensor. The proposed prior includes a new multi-way shrinking effect to address over-parametrization issues. We developed theoretical results to help hyperparameter choice. An efficient MCMC algorithm based on random scan Gibbs and back-fitting strategy is developed to achieve better computational scalability of the posterior sampling. The validity of the MCMC algorithm is demonstrated theoretically, and its computational efficiency is studied using numerical experiments in different parameter settings. The effectiveness of the model framework is illustrated using two original real data analyses. The proposed model exhibits superior performance when compared to the current benchmark, Lasso regression."}, "https://arxiv.org/abs/2407.00716": {"title": "On a General Theoretical Framework of Reliability", "link": "https://arxiv.org/abs/2407.00716", "description": "arXiv:2407.00716v1 Announce Type: new \nAbstract: Reliability is an essential measure of how closely observed scores represent latent scores (reflecting constructs), assuming some latent variable measurement model. We present a general theoretical framework of reliability, placing emphasis on measuring association between latent and observed scores. This framework was inspired by McDonald's (2011) regression framework, which highlighted the coefficient of determination as a measure of reliability. We extend McDonald's (2011) framework beyond coefficients of determination and introduce four desiderata for reliability measures (estimability, normalization, symmetry, and invariance). We also present theoretical examples to illustrate distinct measures of reliability and report on a numerical study that demonstrates the behavior of different reliability measures. We conclude with a discussion on the use of reliability coefficients and outline future avenues of research."}, "https://arxiv.org/abs/2407.00791": {"title": "inlabru: software for fitting latent Gaussian models with non-linear predictors", "link": "https://arxiv.org/abs/2407.00791", "description": "arXiv:2407.00791v1 Announce Type: new \nAbstract: The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian models and the likelihoods currently implemented in {INLA}, the main software implementation of the INLA methodology.\n {inlabru} is a software package that extends the types of models that can be fitted using INLA by allowing the latent predictor to be non-linear in its parameters, moving beyond the additive linear predictor framework to allow more complex functional relationships. For inference it uses an approximate iterative method based on the first-order Taylor expansion of the non-linear predictor, fitting the model using INLA for each linearised model configuration.\n {inlabru} automates much of the workflow required to fit models using {R-INLA}, simplifying the process for users to specify, fit and predict from models. There is additional support for fitting joint likelihood models by building each likelihood individually. {inlabru} also supports the direct use of spatial data structures, such as those implemented in the {sf} and {terra} packages.\n In this paper we outline the statistical theory, model structure and basic syntax required for users to understand and develop their own models using {inlabru}. We evaluate the approximate inference method using a Bayesian method checking approach. We provide three examples modelling simulated spatial data that demonstrate the benefits of the additional flexibility provided by {inlabru}."}, "https://arxiv.org/abs/2407.00797": {"title": "A placement-value based approach to concave ROC analysis", "link": "https://arxiv.org/abs/2407.00797", "description": "arXiv:2407.00797v1 Announce Type: new \nAbstract: The receiver operating characteristic (ROC) curve is an important graphic tool for evaluating a test in a wide range of disciplines. While useful, an ROC curve can cross the chance line, either by having an S-shape or a hook at the extreme specificity. These non-concave ROC curves are sub-optimal according to decision theory, as there are points that are superior than those corresponding to the portions below the chance line with either the same sensitivity or specificity. We extend the literature by proposing a novel placement value-based approach to ensure concave curvature of the ROC curve, and utilize Bayesian paradigm to make estimations under both a parametric and a semiparametric framework. We conduct extensive simulation studies to assess the performance of the proposed methodology under various scenarios, and apply it to a pancreatic cancer dataset."}, "https://arxiv.org/abs/2407.00846": {"title": "Estimating the cognitive effects of statins from observational data using the survival-incorporated median: a summary measure for clinical outcomes in the presence of death", "link": "https://arxiv.org/abs/2407.00846", "description": "arXiv:2407.00846v1 Announce Type: new \nAbstract: The issue of \"truncation by death\" commonly arises in clinical research: subjects may die before their follow-up assessment, resulting in undefined clinical outcomes. This article addresses truncation by death by analyzing the Long Life Family Study (LLFS), a multicenter observational study involving over 4000 older adults with familial longevity. We are interested in the cognitive effects of statins in LLFS participants, as the impact of statins on cognition remains unclear despite their widespread use. In this application, rather than treating death as a mechanism through which clinical outcomes are missing, we advocate treating death as part of the outcome measure. We focus on the survival-incorporated median, the median of a composite outcome combining death and cognitive scores, to summarize the effect of statins. We propose an estimator for the survival-incorporated median from observational data, applicable in both point-treatment settings and time-varying treatment settings. Simulations demonstrate the survival-incorporated median as a simple and useful summary measure. We apply this method to estimate the effect of statins on the change in cognitive function (measured by the Digit Symbol Substitution Test), incorporating death. Our results indicate no significant difference in cognitive decline between participants with a similar age distribution on and off statins from baseline. Through this application, we aim to not only contribute to this clinical question but also offer insights into analyzing clinical outcomes in the presence of death."}, "https://arxiv.org/abs/2407.00859": {"title": "Statistical inference on partially shape-constrained function-on-scalar linear regression models", "link": "https://arxiv.org/abs/2407.00859", "description": "arXiv:2407.00859v1 Announce Type: new \nAbstract: We consider functional linear regression models where functional outcomes are associated with scalar predictors by coefficient functions with shape constraints, such as monotonicity and convexity, that apply to sub-domains of interest. To validate the partial shape constraints, we propose testing a composite hypothesis of linear functional constraints on regression coefficients. Our approach employs kernel- and spline-based methods within a unified inferential framework, evaluating the statistical significance of the hypothesis by measuring an $L^2$-distance between constrained and unconstrained model fits. In the theoretical study of large-sample analysis under mild conditions, we show that both methods achieve the standard rate of convergence observed in the nonparametric estimation literature. Through numerical experiments of finite-sample analysis, we demonstrate that the type I error rate keeps the significance level as specified across various scenarios and that the power increases with sample size, confirming the consistency of the test procedure under both estimation methods. Our theoretical and numerical results provide researchers the flexibility to choose a method based on computational preference. The practicality of partial shape-constrained inference is illustrated by two data applications: one involving clinical trials of NeuroBloc in type A-resistant cervical dystonia and the other with the National Institute of Mental Health Schizophrenia Study."}, "https://arxiv.org/abs/2407.00882": {"title": "Subgroup Identification with Latent Factor Structure", "link": "https://arxiv.org/abs/2407.00882", "description": "arXiv:2407.00882v1 Announce Type: new \nAbstract: Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in the existing literature. In this paper, we aim to fill this gap in the ``diverging dimension\" regime and propose a center-augmented subgroup identification method under the Factor Augmented (sparse) Linear Model framework, which bridge dimension reduction and sparse regression together. The proposed method is flexible to the possibly high cross-sectional dependence among covariates and inherits the computational advantage with complexity $O(nK)$, in contrast to the $O(n^2)$ complexity of the conventional pairwise fusion penalty method in the literature, where $n$ is the sample size and $K$ is the number of subgroups. We also investigate the asymptotic properties of its oracle estimators under conditions on the minimal distance between group centroids. To implement the proposed approach, we introduce a Difference of Convex functions based Alternating Direction Method of Multipliers (DC-ADMM) algorithm and demonstrate its convergence to a local minimizer in finite steps. We illustrate the superiority of the proposed method through extensive numerical experiments and a real macroeconomic data example. An \\texttt{R} package \\texttt{SILFS} implementing the method is also available on CRAN."}, "https://arxiv.org/abs/2407.00890": {"title": "Macroeconomic Forecasting with Large Language Models", "link": "https://arxiv.org/abs/2407.00890", "description": "arXiv:2407.00890v1 Announce Type: new \nAbstract: This paper presents a comparative analysis evaluating the accuracy of Large Language Models (LLMs) against traditional macro time series forecasting approaches. In recent times, LLMs have surged in popularity for forecasting due to their ability to capture intricate patterns in data and quickly adapt across very different domains. However, their effectiveness in forecasting macroeconomic time series data compared to conventional methods remains an area of interest. To address this, we conduct a rigorous evaluation of LLMs against traditional macro forecasting methods, using as common ground the FRED-MD database. Our findings provide valuable insights into the strengths and limitations of LLMs in forecasting macroeconomic time series, shedding light on their applicability in real-world scenarios"}, "https://arxiv.org/abs/2407.01036": {"title": "Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests", "link": "https://arxiv.org/abs/2407.01036", "description": "arXiv:2407.01036v1 Announce Type: new \nAbstract: A/B testers conducting large-scale tests prioritize lifts and want to be able to control false rejections of the null. This work develops a decision-theoretic framework for maximizing profits subject to false discovery rate (FDR) control. We build an empirical Bayes solution for the problem via the greedy knapsack approach. We derive an oracle rule based on ranking the ratio of expected lifts and the cost of wrong rejections using the local false discovery rate (lfdr) statistic. Our oracle decision rule is valid and optimal for large-scale tests. Further, we establish asymptotic validity for the data-driven procedure and demonstrate finite-sample validity in experimental studies. We also demonstrate the merit of the proposed method over other FDR control methods. Finally, we discuss an application to actual Optimizely experiments."}, "https://arxiv.org/abs/2407.01055": {"title": "Exact statistical analysis for response-adaptive clinical trials: a general and computationally tractable approach", "link": "https://arxiv.org/abs/2407.01055", "description": "arXiv:2407.01055v1 Announce Type: new \nAbstract: Response-adaptive (RA) designs of clinical trials allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. RA designs face greater regulatory scrutiny due to potential type I error inflation, which limits their uptake in practice. Existing approaches to type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. We develop a general and computationally tractable approach for exact analysis in two-arm RA designs with binary outcomes. We use the approach to construct exact tests applicable to designs that use either randomized or deterministic RA procedures, allowing for complexities such as delayed outcomes, early stopping or allocation of participants in blocks. Our efficient forward recursion implementation allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming we show that, contrary to what is known for equal allocation, a conditional exact test has, almost uniformly, higher power than the unconditional test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of our approach in controlling type I error and/or improving the statistical power."}, "https://arxiv.org/abs/2407.01186": {"title": "Data fusion for efficiency gain in ATE estimation: A practical review with simulations", "link": "https://arxiv.org/abs/2407.01186", "description": "arXiv:2407.01186v1 Announce Type: new \nAbstract: The integration of real-world data (RWD) and randomized controlled trials (RCT) is increasingly important for advancing causal inference in scientific research. This combination holds great promise for enhancing the efficiency of causal effect estimation, offering benefits such as reduced trial participant numbers and expedited drug access for patients. Despite the availability of numerous data fusion methods, selecting the most appropriate one for a specific research question remains challenging. This paper systematically reviews and compares these methods regarding their assumptions, limitations, and implementation complexities. Through simulations reflecting real-world scenarios, we identify a prevalent risk-reward trade-off across different methods. We investigate and interpret this trade-off, providing key insights into the strengths and weaknesses of various methods; thereby helping researchers navigate through the application of data fusion for improved causal inference."}, "https://arxiv.org/abs/2407.00417": {"title": "Obtaining $(\\epsilon,\\delta)$-differential privacy guarantees when using a Poisson mechanism to synthesize contingency tables", "link": "https://arxiv.org/abs/2407.00417", "description": "arXiv:2407.00417v1 Announce Type: cross \nAbstract: We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain $(\\epsilon, \\delta)$-probabilistic differential privacy guarantees via the Poisson distribution's cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database."}, "https://arxiv.org/abs/2407.01171": {"title": "Neural Conditional Probability for Inference", "link": "https://arxiv.org/abs/2407.01171", "description": "arXiv:2407.01171v1 Announce Type: cross \nAbstract: We introduce NCP (Neural Conditional Probability), a novel operator-theoretic approach for learning conditional distributions with a particular focus on inference tasks. NCP can be used to build conditional confidence regions and extract important statistics like conditional quantiles, mean, and covariance. It offers streamlined learning through a single unconditional training phase, facilitating efficient inference without the need for retraining even when conditioning changes. By tapping into the powerful approximation capabilities of neural networks, our method efficiently handles a wide variety of complex probability distributions, effectively dealing with nonlinear relationships between input and output variables. Theoretical guarantees ensure both optimization consistency and statistical accuracy of the NCP method. Our experiments show that our approach matches or beats leading methods using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This demonstrates that a minimalistic architecture with a theoretically grounded loss function can achieve competitive results without sacrificing performance, even in the face of more complex architectures."}, "https://arxiv.org/abs/2010.00729": {"title": "Individual-centered partial information in social networks", "link": "https://arxiv.org/abs/2010.00729", "description": "arXiv:2010.00729v4 Announce Type: replace \nAbstract: In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive theoretical properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual's partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure."}, "https://arxiv.org/abs/2106.09499": {"title": "Maximum Entropy Spectral Analysis: an application to gravitational waves data analysis", "link": "https://arxiv.org/abs/2106.09499", "description": "arXiv:2106.09499v2 Announce Type: replace \nAbstract: The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, offers a powerful tool for spectral estimation of a time-series. It relies on Jaynes' maximum entropy principle, allowing the spectrum of a stochastic process to be inferred using the coefficients of an autoregressive process AR($p$) of order $p$. A closed-form recursive solution provides estimates for both the autoregressive coefficients and the order $p$ of the process. We provide a ready-to-use implementation of this algorithm in a Python package called \\texttt{memspectrum}, characterized through power spectral density (PSD) analysis on synthetic data with known PSD and comparisons of different criteria for stopping the recursion. Additionally, we compare the performance of our implementation with the ubiquitous Welch algorithm, using synthetic data generated from the GW150914 strain spectrum released by the LIGO-Virgo-Kagra collaboration. Our findings indicate that Burg's method provides PSD estimates with systematically lower variance and bias. This is particularly manifest in the case of a small (O($5000$)) number of data points, making Burg's method most suitable to work in this regime. Since this is close to the typical length of analysed gravitational waves data, improving the estimate of the PSD in this regime leads to more reliable posterior profiles for the system under study. We conclude our investigation by utilising MESA, and its particularly easy parametrisation where the only free parameter is the order $p$ of the AR process, to marginalise over the interferometers noise PSD in conjunction with inferring the parameters of GW150914."}, "https://arxiv.org/abs/2201.04811": {"title": "Binary response model with many weak instruments", "link": "https://arxiv.org/abs/2201.04811", "description": "arXiv:2201.04811v4 Announce Type: replace \nAbstract: This paper considers an endogenous binary response model with many weak instruments. We employ a control function approach and a regularization scheme to obtain better estimation results for the endogenous binary response model in the presence of many weak instruments. Two consistent and asymptotically normally distributed estimators are provided, each of which is called a regularized conditional maximum likelihood estimator (RCMLE) and a regularized nonlinear least squares estimator (RNLSE). Monte Carlo simulations show that the proposed estimators outperform the existing ones when there are many weak instruments. We use the proposed estimation method to examine the effect of family income on college completion."}, "https://arxiv.org/abs/2211.10776": {"title": "Bayesian Modal Regression based on Mixture Distributions", "link": "https://arxiv.org/abs/2211.10776", "description": "arXiv:2211.10776v5 Announce Type: replace \nAbstract: Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at https://github.com/rh8liuqy/Bayesian_modal_regression."}, "https://arxiv.org/abs/2212.01699": {"title": "Parametric Modal Regression with Error in Covariates", "link": "https://arxiv.org/abs/2212.01699", "description": "arXiv:2212.01699v3 Announce Type: replace \nAbstract: An inference procedure is proposed to provide consistent estimators of parameters in a modal regression model with a covariate prone to measurement error. A score-based diagnostic tool exploiting parametric bootstrap is developed to assess adequacy of parametric assumptions imposed on the regression model. The proposed estimation method and diagnostic tool are applied to synthetic data generated from simulation experiments and data from real-world applications to demonstrate their implementation and performance. These empirical examples illustrate the importance of adequately accounting for measurement error in the error-prone covariate when inferring the association between a response and covariates based on a modal regression model that is especially suitable for skewed and heavy-tailed response data."}, "https://arxiv.org/abs/2212.04746": {"title": "Model-based clustering of categorical data based on the Hamming distance", "link": "https://arxiv.org/abs/2212.04746", "description": "arXiv:2212.04746v2 Announce Type: replace \nAbstract: A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components.\n Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches."}, "https://arxiv.org/abs/2301.04625": {"title": "Enhanced Response Envelope via Envelope Regularization", "link": "https://arxiv.org/abs/2301.04625", "description": "arXiv:2301.04625v2 Announce Type: replace \nAbstract: The response envelope model provides substantial efficiency gains over the standard multivariate linear regression by identifying the material part of the response to the model and by excluding the immaterial part. In this paper, we propose the enhanced response envelope by incorporating a novel envelope regularization term based on a nonconvex manifold formulation. It is shown that the enhanced response envelope can yield better prediction risk than the original envelope estimator. The enhanced response envelope naturally handles high-dimensional data for which the original response envelope is not serviceable without necessary remedies. In an asymptotic high-dimensional regime where the ratio of the number of predictors over the number of samples converges to a non-zero constant, we characterize the risk function and reveal an interesting double descent phenomenon for the envelope model. A simulation study confirms our main theoretical findings. Simulations and real data applications demonstrate that the enhanced response envelope does have significantly improved prediction performance over the original envelope method, especially when the number of predictors is close to or moderately larger than the number of samples. Proofs and additional simulation results are shown in the supplementary file to this paper."}, "https://arxiv.org/abs/2301.13088": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces", "link": "https://arxiv.org/abs/2301.13088", "description": "arXiv:2301.13088v3 Announce Type: replace \nAbstract: Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners."}, "https://arxiv.org/abs/2307.02331": {"title": "Differential recall bias in estimating treatment effects in observational studies", "link": "https://arxiv.org/abs/2307.02331", "description": "arXiv:2307.02331v2 Announce Type: replace \nAbstract: Observational studies are frequently used to estimate the effect of an exposure or treatment on an outcome. To obtain an unbiased estimate of the treatment effect, it is crucial to measure the exposure accurately. A common type of exposure misclassification is recall bias, which occurs in retrospective cohort studies when study subjects may inaccurately recall their past exposure. Particularly challenging is differential recall bias in the context of self-reported binary exposures, where the bias may be directional rather than random , and its extent varies according to the outcomes experienced. This paper makes several contributions: (1) it establishes bounds for the average treatment effect (ATE) even when a validation study is not available; (2) it proposes multiple estimation methods across various strategies predicated on different assumptions; and (3) it suggests a sensitivity analysis technique to assess the robustness of the causal conclusion, incorporating insights from prior research. The effectiveness of these methods is demonstrated through simulation studies that explore various model misspecification scenarios. These approaches are then applied to investigate the effect of childhood physical abuse on mental health in adulthood."}, "https://arxiv.org/abs/2309.10642": {"title": "Correcting Selection Bias in Standardized Test Scores Comparisons", "link": "https://arxiv.org/abs/2309.10642", "description": "arXiv:2309.10642v4 Announce Type: replace \nAbstract: This paper addresses the issue of sample selection bias when comparing countries using International assessments like PISA (Program for International Student Assessment). Despite its widespread use, PISA rankings may be biased due to different attrition patterns in different countries, leading to inaccurate comparisons. This study proposes a methodology to correct for sample selection bias using a quantile selection model. Applying the method to PISA 2018 data, I find that correcting for selection bias significantly changes the rankings (based on the mean) of countries' educational performances. My results highlight the importance of accounting for sample selection bias in international educational comparisons."}, "https://arxiv.org/abs/2310.01575": {"title": "Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class Analysis", "link": "https://arxiv.org/abs/2310.01575", "description": "arXiv:2310.01575v2 Announce Type: replace \nAbstract: Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. \\sw{Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns}. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States."}, "https://arxiv.org/abs/2310.10393": {"title": "Statistical and Causal Robustness for Causal Null Hypothesis Tests", "link": "https://arxiv.org/abs/2310.10393", "description": "arXiv:2310.10393v2 Announce Type: replace \nAbstract: Prior work applying semiparametric theory to causal inference has primarily focused on deriving estimators that exhibit statistical robustness under a prespecified causal model that permits identification of a desired causal parameter. However, a fundamental challenge is correct specification of such a model, which usually involves making untestable assumptions. Evidence factors is an approach to combining hypothesis tests of a common causal null hypothesis under two or more candidate causal models. Under certain conditions, this yields a test that is valid if at least one of the underlying models is correct, which is a form of causal robustness. We propose a method of combining semiparametric theory with evidence factors. We develop a causal null hypothesis test based on joint asymptotic normality of K asymptotically linear semiparametric estimators, where each estimator is based on a distinct identifying functional derived from each of K candidate causal models. We show that this test provides both statistical and causal robustness in the sense that it is valid if at least one of the K proposed causal models is correct, while also allowing for slower than parametric rates of convergence in estimating nuisance functions. We demonstrate the effectiveness of our method via simulations and applications to the Framingham Heart Study and Wisconsin Longitudinal Study."}, "https://arxiv.org/abs/2308.01198": {"title": "Analyzing the Reporting Error of Public Transport Trips in the Danish National Travel Survey Using Smart Card Data", "link": "https://arxiv.org/abs/2308.01198", "description": "arXiv:2308.01198v3 Announce Type: replace-cross \nAbstract: Household travel surveys have been used for decades to collect individuals and households' travel behavior. However, self-reported surveys are subject to recall bias, as respondents might struggle to recall and report their activities accurately. This study examines the time reporting error of public transit users in a nationwide household travel survey by matching, at the individual level, five consecutive years of data from two sources, namely the Danish National Travel Survey (TU) and the Danish Smart Card system (Rejsekort). Survey respondents are matched with travel cards from the Rejsekort data solely based on the respondents' declared spatiotemporal travel behavior. Approximately, 70% of the respondents were successfully matched with Rejsekort travel cards. The findings reveal a median time reporting error of 11.34 minutes, with an Interquartile Range of 28.14 minutes. Furthermore, a statistical analysis was performed to explore the relationships between the survey respondents' reporting error and their socio-economic and demographic characteristics. The results indicate that females and respondents with a fixed schedule are in general more accurate than males and respondents with a flexible schedule in reporting their times of travel. Moreover, trips reported during weekdays or via the internet displayed higher accuracies compared to trips reported during weekends and holidays or via telephone interviews. This disaggregated analysis provides valuable insights that could help in improving the design and analysis of travel surveys, as well accounting for reporting errors/biases in travel survey-based applications. Furthermore, it offers valuable insights underlying the psychology of travel recall by survey respondents."}, "https://arxiv.org/abs/2401.13665": {"title": "Entrywise Inference for Missing Panel Data: A Simple and Instance-Optimal Approach", "link": "https://arxiv.org/abs/2401.13665", "description": "arXiv:2401.13665v2 Announce Type: replace-cross \nAbstract: Longitudinal or panel data can be represented as a matrix with rows indexed by units and columns indexed by time. We consider inferential questions associated with the missing data version of panel data induced by staggered adoption. We propose a computationally efficient procedure for estimation, involving only simple matrix algebra and singular value decomposition, and prove non-asymptotic and high-probability bounds on its error in estimating each missing entry. By controlling proximity to a suitably scaled Gaussian variable, we develop and analyze a data-driven procedure for constructing entrywise confidence intervals with pre-specified coverage. Despite its simplicity, our procedure turns out to be instance-optimal: we prove that the width of our confidence intervals match a non-asymptotic instance-wise lower bound derived via a Bayesian Cram\\'{e}r-Rao argument. We illustrate the sharpness of our theoretical characterization on a variety of numerical examples. Our analysis is based on a general inferential toolbox for SVD-based algorithm applied to the matrix denoising model, which might be of independent interest."}, "https://arxiv.org/abs/2407.01565": {"title": "A pseudo-outcome-based framework to analyze treatment heterogeneity in survival data using electronic health records", "link": "https://arxiv.org/abs/2407.01565", "description": "arXiv:2407.01565v1 Announce Type: new \nAbstract: An important aspect of precision medicine focuses on characterizing diverse responses to treatment due to unique patient characteristics, also known as heterogeneous treatment effects (HTE), and identifying beneficial subgroups with enhanced treatment effects. Estimating HTE with right-censored data in observational studies remains challenging. In this paper, we propose a pseudo-outcome-based framework for analyzing HTE in survival data, which includes a list of meta-learners for estimating HTE, a variable importance metric for identifying predictive variables to HTE, and a data-adaptive procedure to select subgroups with enhanced treatment effects. We evaluate the finite sample performance of the framework under various settings of observational studies. Furthermore, we applied the proposed methods to analyze the treatment heterogeneity of a Written Asthma Action Plan (WAAP) on time-to-ED (Emergency Department) return due to asthma exacerbation using a large asthma electronic health records dataset with visit records expanded from pre- to post-COVID-19 pandemic. We identified vulnerable subgroups of patients with poorer asthma outcomes but enhanced benefits from WAAP and characterized patient profiles. Our research provides valuable insights for healthcare providers on the strategic distribution of WAAP, particularly during disruptive public health crises, ultimately improving the management and control of pediatric asthma."}, "https://arxiv.org/abs/2407.01631": {"title": "Model Identifiability for Bivariate Failure Time Data with Competing Risks: Parametric Cause-specific Hazards and Non-parametric Frailty", "link": "https://arxiv.org/abs/2407.01631", "description": "arXiv:2407.01631v1 Announce Type: new \nAbstract: One of the commonly used approaches to capture dependence in multivariate survival data is through the frailty variables. The identifiability issues should be carefully investigated while modeling multivariate survival with or without competing risks. The use of non-parametric frailty distribution(s) is sometimes preferred for its robustness and flexibility properties. In this paper, we consider modeling of bivariate survival data with competing risks through four different kinds of non-parametric frailty and parametric baseline cause-specific hazard functions to investigate the corresponding model identifiability. We make the common assumption of the frailty mean being equal to unity."}, "https://arxiv.org/abs/2407.01763": {"title": "A Cepstral Model for Efficient Spectral Analysis of Covariate-dependent Time Series", "link": "https://arxiv.org/abs/2407.01763", "description": "arXiv:2407.01763v1 Announce Type: new \nAbstract: This article introduces a novel and computationally fast model to study the association between covariates and power spectra of replicated time series. A random covariate-dependent Cram\\'{e}r spectral representation and a semiparametric log-spectral model are used to quantify the association between the log-spectra and covariates. Each replicate-specific log-spectrum is represented by the cepstrum, inducing a cepstral-based multivariate linear model with the cepstral coefficients as the responses. By using only a small number of cepstral coefficients, the model parsimoniously captures frequency patterns of time series and saves a significant amount of computational time compared to existing methods. A two-stage estimation procedure is proposed. In the first stage, a Whittle likelihood-based approach is used to estimate the truncated replicate-specific cepstral coefficients. In the second stage, parameters of the cepstral-based multivariate linear model, and consequently the effect functions of covariates, are estimated. The model is flexible in the sense that it can accommodate various estimation methods for the multivariate linear model, depending on the application, domain knowledge, or characteristics of the covariates. Numerical studies confirm that the proposed method outperforms some existing methods despite its simplicity and shorter computational time. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2407.01765": {"title": "A General Framework for Design-Based Treatment Effect Estimation in Paired Cluster-Randomized Experiments", "link": "https://arxiv.org/abs/2407.01765", "description": "arXiv:2407.01765v1 Announce Type: new \nAbstract: Paired cluster-randomized experiments (pCRTs) are common across many disciplines because there is often natural clustering of individuals, and paired randomization can help balance baseline covariates to improve experimental precision. Although pCRTs are common, there is surprisingly no obvious way to analyze this randomization design if an individual-level (rather than cluster-level) treatment effect is of interest. Variance estimation is also complicated due to the dependency created through pairing clusters. Therefore, we aim to provide an intuitive and practical comparison between different estimation strategies in pCRTs in order to inform practitioners' choice of strategy. To this end, we present a general framework for design-based estimation in pCRTs for average individual effects. This framework offers a novel and intuitive view on the bias-variance trade-off between estimators and emphasizes the benefits of covariate adjustment for estimation with pCRTs. In addition to providing a general framework for estimation in pCRTs, the point and variance estimators we present support fixed-sample unbiased estimation with similar precision to a common regression model and consistently conservative variance estimation. Through simulation studies, we compare the performance of the point and variance estimators reviewed. Finally, we compare the performance of estimators with simulations using real data from an educational efficacy trial. Our analysis and simulation studies inform the choice of point and variance estimators for analyzing pCRTs in practice."}, "https://arxiv.org/abs/2407.01770": {"title": "Exploring causal effects of hormone- and radio-treatments in an observational study of breast cancer using copula-based semi-competing risks models", "link": "https://arxiv.org/abs/2407.01770", "description": "arXiv:2407.01770v1 Announce Type: new \nAbstract: Breast cancer patients may experience relapse or death after surgery during the follow-up period, leading to dependent censoring of relapse. This phenomenon, known as semi-competing risk, imposes challenges in analyzing treatment effects on breast cancer and necessitates advanced statistical tools for unbiased analysis. Despite progress in estimation and inference within semi-competing risks regression, its application to causal inference is still in its early stages. This article aims to propose a frequentist and semi-parametric framework based on copula models that can facilitate valid causal inference, net quantity estimation and interpretation, and sensitivity analysis for unmeasured factors under right-censored semi-competing risks data. We also propose novel procedures to enhance parameter estimation and its applicability in real practice. After that, we apply the proposed framework to a breast cancer study and detect the time-varying causal effects of hormone- and radio-treatments on patients' relapse-free survival and overall survival. Moreover, extensive numerical evaluations demonstrate the method's feasibility, highlighting minimal estimation bias and reliable statistical inference."}, "https://arxiv.org/abs/2407.01868": {"title": "Forecast Linear Augmented Projection (FLAP): A free lunch to reduce forecast error variance", "link": "https://arxiv.org/abs/2407.01868", "description": "arXiv:2407.01868v1 Announce Type: new \nAbstract: A novel forecast linear augmented projection (FLAP) method is introduced, which reduces the forecast error variance of any unbiased multivariate forecast without introducing bias. The method first constructs new component series which are linear combinations of the original series. Forecasts are then generated for both the original and component series. Finally, the full vector of forecasts is projected onto a linear subspace where the constraints implied by the combination weights hold. It is proven that the trace of the forecast error variance is non-increasing with the number of components, and mild conditions are established for which it is strictly decreasing. It is also shown that the proposed method achieves maximum forecast error variance reduction among linear projection methods. The theoretical results are validated through simulations and two empirical applications based on Australian tourism and FRED-MD data. Notably, using FLAP with Principal Component Analysis (PCA) to construct the new series leads to substantial forecast error variance reduction."}, "https://arxiv.org/abs/2407.01883": {"title": "Robust Linear Mixed Models using Hierarchical Gamma-Divergence", "link": "https://arxiv.org/abs/2407.01883", "description": "arXiv:2407.01883v1 Announce Type: new \nAbstract: Linear mixed models (LMMs), which typically assume normality for both the random effects and error terms, are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to poor statistical results (e.g., biased inference on model parameters and inaccurate prediction of random effects) if the data are contaminated. We propose a new approach to robust estimation and inference for LMMs using a hierarchical gamma divergence, which offers an automated, data-driven approach to downweight the effects of outliers occurring in both the error, and the random effects, using normalized powered density weights. For estimation and inference, we develop a computationally scalable minorization-maximization algorithm for the resulting objective function, along with a clustered bootstrap method for uncertainty quantification and a Hyvarinen score criterion for selecting a tuning parameter controlling the degree of robustness. When the genuine and contamination mixed effects distributions are sufficiently separated, then under suitable regularity conditions assuming the number of clusters tends to infinity, we show the resulting robust estimates can be asymptotically controlled even under a heavy level of (covariate-dependent) contamination. Simulation studies demonstrate hierarchical gamma divergence consistently outperforms several currently available methods for robustifying LMMs, under a wide range of scenarios of outlier generation at both the response and random effects levels. We illustrate the proposed method using data from a multi-center AIDS cohort study, where the use of a robust LMMs using hierarchical gamma divergence approach produces noticeably different results compared to methods that do not adequately adjust for potential outlier contamination."}, "https://arxiv.org/abs/2407.02085": {"title": "Regularized estimation of Monge-Kantorovich quantiles for spherical data", "link": "https://arxiv.org/abs/2407.02085", "description": "arXiv:2407.02085v1 Announce Type: new \nAbstract: Tools from optimal transport (OT) theory have recently been used to define a notion of quantile function for directional data. In practice, regularization is mandatory for applications that require out-of-sample estimates. To this end, we introduce a regularized estimator built from entropic optimal transport, by extending the definition of the entropic map to the spherical setting. We propose a stochastic algorithm to directly solve a continuous OT problem between the uniform distribution and a target distribution, by expanding Kantorovich potentials in the basis of spherical harmonics. In addition, we define the directional Monge-Kantorovich depth, a companion concept for OT-based quantiles. We show that it benefits from desirable properties related to Liu-Zuo-Serfling axioms for the statistical analysis of directional data. Building on our regularized estimators, we illustrate the benefits of our methodology for data analysis."}, "https://arxiv.org/abs/2407.02178": {"title": "Reverse time-to-death as time-scale in time-to-event analysis for studies of advanced illness and palliative care", "link": "https://arxiv.org/abs/2407.02178", "description": "arXiv:2407.02178v1 Announce Type: new \nAbstract: Background: Incidence of adverse outcome events rises as patients with advanced illness approach end-of-life. Exposures that tend to occur near end-of-life, e.g., use of wheelchair, oxygen therapy and palliative care, may therefore be found associated with the incidence of the adverse outcomes. We propose a strategy for time-to-event analysis to mitigate the time-varying confounding. Methods: We propose a concept of reverse time-to-death (rTTD) and its use for the time-scale in time-to-event analysis. We used data on community-based palliative care uptake (exposure) and emergency department visits (outcome) among patients with advanced cancer in Singapore to illustrate. We compare the results against that of the common practice of using time-on-study (TOS) as time-scale. Results: Graphical analysis demonstrated that cancer patients receiving palliative care had higher rate of emergency department visits than non-recipients mainly because they were closer to end-of-life, and that rTTD analysis made comparison between patients at the same time-to-death. Analysis of emergency department visits in relation to palliative care using TOS time-scale showed significant increase in hazard ratio estimate when observed time-varying covariates were omitted from statistical adjustment (change-in-estimate=0.38; 95% CI 0.15 to 0.60). There was no such change in otherwise the same analysis using rTTD (change-in-estimate=0.04; 95% CI -0.02 to 0.11), demonstrating the ability of rTTD time-scale to mitigate confounding that intensifies in relation to time-to-death. Conclusion: Use of rTTD as time-scale in time-to-event analysis provides a simple and robust approach to control time-varying confounding in studies of advanced illness, even if the confounders are unmeasured."}, "https://arxiv.org/abs/2407.02183": {"title": "How do financial variables impact public debt growth in China? An empirical study based on Markov regime-switching model", "link": "https://arxiv.org/abs/2407.02183", "description": "arXiv:2407.02183v1 Announce Type: new \nAbstract: The deep financial turmoil in China caused by the COVID-19 pandemic has exacerbated fiscal shocks and soaring public debt levels, which raises concerns about the stability and sustainability of China's public debt growth in the future. This paper employs the Markov regime-switching model with time-varying transition probability (TVTP-MS) to investigate the growth pattern of China's public debt and the impact of financial variables such as credit, house prices and stock prices on the growth of public debt. We identify two distinct regimes of China's public debt, i.e., the surge regime with high growth rate and high volatility and the steady regime with low growth rate and low volatility. The main results are twofold. On the one hand, an increase in the growth rate of the financial variables helps to moderate the growth rate of public debt, whereas the effects differ between the two regimes. More specifically, the impacts of credit and house prices are significant in the surge regime, whereas stock prices affect public debt growth significantly in the steady regime. On the other hand, a higher growth rate of financial variables also increases the probability of public debt either staying in or switching to the steady regime. These findings highlight the necessity of aligning financial adjustments with the prevailing public debt regime when developing sustainable fiscal policies."}, "https://arxiv.org/abs/2407.02262": {"title": "Conditional Forecasts in Large Bayesian VARs with Multiple Equality and Inequality Constraints", "link": "https://arxiv.org/abs/2407.02262", "description": "arXiv:2407.02262v1 Announce Type: new \nAbstract: Conditional forecasts, i.e. projections of a set of variables of interest on the future paths of some other variables, are used routinely by empirical macroeconomists in a number of applied settings. In spite of this, the existing algorithms used to generate conditional forecasts tend to be very computationally intensive, especially when working with large Vector Autoregressions or when multiple linear equality and inequality constraints are imposed at once. We introduce a novel precision-based sampler that is fast, scales well, and yields conditional forecasts from linear equality and inequality constraints. We show in a simulation study that the proposed method produces forecasts that are identical to those from the existing algorithms but in a fraction of the time. We then illustrate the performance of our method in a large Bayesian Vector Autoregression where we simultaneously impose a mix of linear equality and inequality constraints on the future trajectories of key US macroeconomic indicators over the 2020--2022 period."}, "https://arxiv.org/abs/2407.02367": {"title": "Rediscovering Bottom-Up: Effective Forecasting in Temporal Hierarchies", "link": "https://arxiv.org/abs/2407.02367", "description": "arXiv:2407.02367v1 Announce Type: new \nAbstract: Forecast reconciliation has become a prominent topic in recent forecasting literature, with a primary distinction made between cross-sectional and temporal hierarchies. This work focuses on temporal hierarchies, such as aggregating monthly time series data to annual data. We explore the impact of various forecast reconciliation methods on temporally aggregated ARIMA models, thereby bridging the fields of hierarchical forecast reconciliation and temporal aggregation both theoretically and experimentally. Our paper is the first to theoretically examine the effects of temporal hierarchical forecast reconciliation, demonstrating that the optimal method aligns with a bottom-up aggregation approach. To assess the practical implications and performance of the reconciled forecasts, we conduct a series of simulation studies, confirming that the findings extend to more complex models. This result helps explain the strong performance of the bottom-up approach observed in many prior studies. Finally, we apply our methods to real data examples, where we observe similar results."}, "https://arxiv.org/abs/2407.02401": {"title": "Fuzzy Social Network Analysis: Theory and Application in a University Department's Collaboration Network", "link": "https://arxiv.org/abs/2407.02401", "description": "arXiv:2407.02401v1 Announce Type: new \nAbstract: Social network analysis (SNA) helps us understand the relationships and interactions between individuals, groups, organisations, or other social entities. In SNA, ties are generally binary or weighted based on their strength. Nonetheless, when actors are individuals, the relationships between actors are often imprecise and identifying them with simple scalars leads to information loss. Social relationships are often vague in real life. Despite many classical social network techniques contemplate the use of weighted links, these approaches do not align with the original philosophy of fuzzy logic, which instead aims to preserve the vagueness inherent in human language and real life. Dealing with imprecise ties and introducing fuzziness in the definition of relationships requires an extension of social network analysis to fuzzy numbers instead of crisp values. The mathematical formalisation for this generalisation needs to extend classical centrality indices and operations to fuzzy numbers. For this reason, this paper proposes a generalisation of the so-called Fuzzy Social Network Analysis (FSNA) to the context of imprecise relationships among actors. The article shows the theory and application of real data collected through a fascinating mouse tracking technique to study the fuzzy relationships in a collaboration network among the members of a University department."}, "https://arxiv.org/abs/2407.01621": {"title": "Deciphering interventional dynamical causality from non-intervention systems", "link": "https://arxiv.org/abs/2407.01621", "description": "arXiv:2407.01621v1 Announce Type: cross \nAbstract: Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. To address this challenge, we propose a framework named Interventional Dynamical Causality (IntDC) for such non-intervention systems, along with its computational criterion, Interventional Embedding Entropy (IEE), to quantify causality. The IEE criterion theoretically and numerically enables the deciphering of IntDC solely from observational (non-interventional) time-series data, without requiring any knowledge of dynamical models or real interventions in the considered system. Demonstrations of performance showed the accuracy and robustness of IEE on benchmark simulated systems as well as real-world systems, including the neural connectomes of C. elegans, COVID-19 transmission networks in Japan, and regulatory networks surrounding key circadian genes."}, "https://arxiv.org/abs/2407.01623": {"title": "Uncertainty estimation in satellite precipitation spatial prediction by combining distributional regression algorithms", "link": "https://arxiv.org/abs/2407.01623", "description": "arXiv:2407.01623v1 Announce Type: cross \nAbstract: To facilitate effective decision-making, gridded satellite precipitation products should include uncertainty estimates. Machine learning has been proposed for issuing such estimates. However, most existing algorithms for this purpose rely on quantile regression. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. In this work, we introduce the concept of distributional regression for the engineering task of creating precipitation datasets through data merging. Building upon this concept, we propose new ensemble learning methods that can be valuable not only for spatial prediction but also for prediction problems in general. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale, and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression, and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking emerged as the most successful strategy. Three specific stacking methods achieved the best performance based on the quantile scoring rule, although the ranking of these methods varied across quantile levels. This suggests that a task-specific combination of multiple algorithms could yield significant benefits."}, "https://arxiv.org/abs/2407.01751": {"title": "Asymptotic tests for monotonicity and convexity of a probability mass function", "link": "https://arxiv.org/abs/2407.01751", "description": "arXiv:2407.01751v1 Announce Type: cross \nAbstract: In shape-constrained nonparametric inference, it is often necessary to perform preliminary tests to verify whether a probability mass function (p.m.f.) satisfies qualitative constraints such as monotonicity, convexity or in general $k$-monotonicity. In this paper, we are interested in testing $k$-monotonicity of a compactly supported p.m.f. and we put our main focus on monotonicity and convexity; i.e., $k \\in \\{1,2\\}$. We consider new testing procedures that are directly derived from the definition of $k$-monotonicity and rely exclusively on the empirical measure, as well as tests that are based on the projection of the empirical measure on the class of $k$-monotone p.m.f.s. The asymptotic behaviour of the introduced test statistics is derived and a simulation study is performed to assess the finite sample performance of all the proposed tests. Applications to real datasets are presented to illustrate the theory."}, "https://arxiv.org/abs/2407.01794": {"title": "Conditionally valid Probabilistic Conformal Prediction", "link": "https://arxiv.org/abs/2407.01794", "description": "arXiv:2407.01794v1 Announce Type: cross \nAbstract: We develop a new method for creating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution $P_{Y \\mid X}$. Most existing methods, such as conformalized quantile regression and probabilistic conformal prediction, only offer marginal coverage guarantees. Our approach extends these methods to achieve conditional coverage, which is essential for many practical applications. While exact conditional guarantees are impossible without assumptions on the data distribution, we provide non-asymptotic bounds that explicitly depend on the quality of the available estimate of the conditional distribution. Our confidence sets are highly adaptive to the local structure of the data, making them particularly useful in high heteroskedasticity situations. We demonstrate the effectiveness of our approach through extensive simulations, showing that it outperforms existing methods in terms of conditional coverage and improves the reliability of statistical inference in a wide range of applications."}, "https://arxiv.org/abs/2407.01874": {"title": "Simultaneous semiparametric inference for single-index models", "link": "https://arxiv.org/abs/2407.01874", "description": "arXiv:2407.01874v1 Announce Type: cross \nAbstract: In the common partially linear single-index model we establish a Bahadur representation for a smoothing spline estimator of all model parameters and use this result to prove the joint weak convergence of the estimator of the index link function at a given point, together with the estimators of the parametric regression coefficients. We obtain the surprising result that, despite of the nature of single-index models where the link function is evaluated at a linear combination of the index-coefficients, the estimator of the link function and the estimator of the index-coefficients are asymptotically independent. Our approach leverages a delicate analysis based on reproducing kernel Hilbert space and empirical process theory.\n We show that the smoothing spline estimator achieves the minimax optimal rate with respect to the $L^2$-risk and consider several statistical applications where joint inference on all model parameters is of interest. In particular, we develop a simultaneous confidence band for the link function and propose inference tools to investigate if the maximum absolute deviation between the (unknown) link function and a given function exceeds a given threshold. We also construct tests for joint hypotheses regarding model parameters which involve both the nonparametric and parametric components and propose novel multiplier bootstrap procedures to avoid the estimation of unknown asymptotic quantities."}, "https://arxiv.org/abs/2208.07044": {"title": "On minimum contrast method for multivariate spatial point processes", "link": "https://arxiv.org/abs/2208.07044", "description": "arXiv:2208.07044v3 Announce Type: replace \nAbstract: Compared to widely used likelihood-based approaches, the minimum contrast (MC) method offers a computationally efficient method for estimation and inference of spatial point processes. These relative gains in computing time become more pronounced when analyzing complicated multivariate point process models. Despite this, there has been little exploration of the MC method for multivariate spatial point processes. Therefore, this article introduces a new MC method for parametric multivariate spatial point processes. A contrast function is computed based on the trace of the power of the difference between the conjectured $K$-function matrix and its nonparametric unbiased edge-corrected estimator. Under standard assumptions, we derive the asymptotic normality of our MC estimator. The performance of the proposed method is demonstrated through simulation studies of bivariate log-Gaussian Cox processes and five-variate product-shot-noise Cox processes."}, "https://arxiv.org/abs/2401.00249": {"title": "Forecasting CPI inflation under economic policy and geopolitical uncertainties", "link": "https://arxiv.org/abs/2401.00249", "description": "arXiv:2401.00249v2 Announce Type: replace \nAbstract: Forecasting consumer price index (CPI) inflation is of paramount importance for both academics and policymakers at the central banks. This study introduces a filtered ensemble wavelet neural network (FEWNet) to forecast CPI inflation, which is tested on BRIC countries. FEWNet breaks down inflation data into high and low-frequency components using wavelets and utilizes them along with other economic factors (economic policy uncertainty and geopolitical risk) to produce forecasts. All the wavelet-transformed series and filtered exogenous variables are fed into downstream autoregressive neural networks to make the final ensemble forecast. Theoretically, we show that FEWNet reduces the empirical risk compared to fully connected autoregressive neural networks. FEWNet is more accurate than other forecasting methods and can also estimate the uncertainty in its predictions due to its capacity to effectively capture non-linearities and long-range dependencies in the data through its adaptable architecture. This makes FEWNet a valuable tool for central banks to manage inflation."}, "https://arxiv.org/abs/2211.15771": {"title": "Approximate Gibbs Sampler for Efficient Inference of Hierarchical Bayesian Models for Grouped Count Data", "link": "https://arxiv.org/abs/2211.15771", "description": "arXiv:2211.15771v2 Announce Type: replace-cross \nAbstract: Hierarchical Bayesian Poisson regression models (HBPRMs) provide a flexible modeling approach of the relationship between predictors and count response variables. The applications of HBPRMs to large-scale datasets require efficient inference algorithms due to the high computational cost of inferring many model parameters based on random sampling. Although Markov Chain Monte Carlo (MCMC) algorithms have been widely used for Bayesian inference, sampling using this class of algorithms is time-consuming for applications with large-scale data and time-sensitive decision-making, partially due to the non-conjugacy of many models. To overcome this limitation, this research develops an approximate Gibbs sampler (AGS) to efficiently learn the HBPRMs while maintaining the inference accuracy. In the proposed sampler, the data likelihood is approximated with Gaussian distribution such that the conditional posterior of the coefficients has a closed-form solution. Numerical experiments using real and synthetic datasets with small and large counts demonstrate the superior performance of AGS in comparison to the state-of-the-art sampling algorithm, especially for large datasets."}, "https://arxiv.org/abs/2407.02583": {"title": "Generalized Ridge Regression: Biased Estimation for Multiple Linear Regression Models", "link": "https://arxiv.org/abs/2407.02583", "description": "arXiv:2407.02583v1 Announce Type: new \nAbstract: When the regressors of a econometric linear model are nonorthogonal, it is well known that their estimation by ordinary least squares can present various problems that discourage the use of this model. The ridge regression is the most commonly used alternative; however, its generalized version has hardly been analyzed. The present work addresses the estimation of this generalized version, as well as the calculation of its mean squared error, goodness of fit and bootstrap inference."}, "https://arxiv.org/abs/2407.02671": {"title": "When Do Natural Mediation Effects Differ from Their Randomized Interventional Analogues: Test and Theory", "link": "https://arxiv.org/abs/2407.02671", "description": "arXiv:2407.02671v1 Announce Type: new \nAbstract: In causal mediation analysis, the natural direct and indirect effects (natural effects) are nonparametrically unidentifiable in the presence of treatment-induced confounding, which motivated the development of randomized interventional analogues (RIAs) of the natural effects. The RIAs are easier to identify and widely used in practice. Applied researchers often interpret RIA estimates as if they were the natural effects, even though the RIAs could be poor proxies for the natural effects. This calls for practical and theoretical guidance on when the RIAs differ from or coincide with the natural effects, which this paper aims to address. We develop a novel empirical test for the divergence between the RIAs and the natural effects under the weak assumptions sufficient for identifying the RIAs and illustrate the test using the Moving to Opportunity Study. We also provide new theoretical insights on the relationship between the RIAs and the natural effects from a covariance perspective and a structural equation perspective. Additionally, we discuss previously undocumented connections between the natural effects, the RIAs, and estimands in instrumental variable analysis and Wilcoxon-Mann-Whitney tests."}, "https://arxiv.org/abs/2407.02676": {"title": "Covariate-dependent hierarchical Dirichlet process", "link": "https://arxiv.org/abs/2407.02676", "description": "arXiv:2407.02676v1 Announce Type: new \nAbstract: The intricacies inherent in contemporary real datasets demand more advanced statistical models to effectively address complex challenges. In this article we delve into problems related to identifying clusters across related groups, when additional covariate information is available. We formulate a novel Bayesian nonparametric approach based on mixture models, integrating ideas from the hierarchical Dirichlet process and \"single-atoms\" dependent Dirichlet process. The proposed method exhibits exceptional generality and flexibility, accommodating both continuous and discrete covariates through the utilization of appropriate kernel functions. We construct a robust and efficient Markov chain Monte Carlo (MCMC) algorithm involving data augmentation to tackle the intractable normalized weights. The versatility of the proposed model extends our capability to discern the relationship between covariates and clusters. Through testing on both simulated and real-world datasets, our model demonstrates its capacity to identify meaningful clusters across groups, providing valuable insights for a spectrum of applications."}, "https://arxiv.org/abs/2407.02684": {"title": "A dimension reduction approach to edge weight estimation for use in spatial models", "link": "https://arxiv.org/abs/2407.02684", "description": "arXiv:2407.02684v1 Announce Type: new \nAbstract: Models for areal data are traditionally defined using the neighborhood structure of the regions on which data are observed. The unweighted adjacency matrix of a graph is commonly used to characterize the relationships between locations, resulting in the implicit assumption that all pairs of neighboring regions interact similarly, an assumption which may not be true in practice. It has been shown that more complex spatial relationships between graph nodes may be represented when edge weights are allowed to vary. Christensen and Hoff (2023) introduced a covariance model for data observed on graphs which is more flexible than traditional alternatives, parameterizing covariance as a function of an unknown edge weights matrix. A potential issue with their approach is that each edge weight is treated as a unique parameter, resulting in increasingly challenging parameter estimation as graph size increases. Within this article we propose a framework for estimating edge weight matrices that reduces their effective dimension via a basis function representation of of the edge weights. We show that this method may be used to enhance the performance and flexibility of covariance models parameterized by such matrices in a series of illustrations, simulations and data examples."}, "https://arxiv.org/abs/2407.02902": {"title": "Instrumental Variable methods to target Hypothetical Estimands with longitudinal repeated measures data: Application to the STEP 1 trial", "link": "https://arxiv.org/abs/2407.02902", "description": "arXiv:2407.02902v1 Announce Type: new \nAbstract: The STEP 1 randomized trial evaluated the effect of taking semaglutide vs placebo on body weight over a 68 week duration. As with any study evaluating an intervention delivered over a sustained period, non-adherence was observed. This was addressed in the original trial analysis within the Estimand Framework by viewing non-adherence as an intercurrent event. The primary analysis applied a treatment policy strategy which viewed it as an aspect of the treatment regimen, and thus made no adjustment for its presence. A supplementary analysis used a hypothetical strategy, targeting an estimand that would have been realised had all participants adhered, under the assumption that no post-baseline variables confounded adherence and change in body weight. In this paper we propose an alternative Instrumental Variable method to adjust for non-adherence which does not rely on the same `unconfoundedness' assumption and is less vulnerable to positivity violations (e.g., it can give valid results even under conditions where non-adherence is guaranteed). Unlike many previous Instrumental Variable approaches, it makes full use of the repeatedly measured outcome data, and allows for a time-varying effect of treatment adherence on a participant's weight. We show that it provides a natural vehicle for defining two distinct hypothetical estimands: the treatment effect if all participants would have adhered to semaglutide, and the treatment effect if all participants would have adhered to both semaglutide and placebo. When applied to the STEP 1 study, they both suggest a sustained, slowly decaying weight loss effect of semaglutide treatment."}, "https://arxiv.org/abs/2407.03085": {"title": "Accelerated Inference for Partially Observed Markov Processes using Automatic Differentiation", "link": "https://arxiv.org/abs/2407.03085", "description": "arXiv:2407.03085v1 Announce Type: new \nAbstract: Automatic differentiation (AD) has driven recent advances in machine learning, including deep neural networks and Hamiltonian Markov Chain Monte Carlo methods. Partially observed nonlinear stochastic dynamical systems have proved resistant to AD techniques because widely used particle filter algorithms yield an estimated likelihood function that is discontinuous as a function of the model parameters. We show how to embed two existing AD particle filter methods in a theoretical framework that provides an extension to a new class of algorithms. This new class permits a bias/variance tradeoff and hence a mean squared error substantially lower than the existing algorithms. We develop likelihood maximization algorithms suited to the Monte Carlo properties of the AD gradient estimate. Our algorithms require only a differentiable simulator for the latent dynamic system; by contrast, most previous approaches to AD likelihood maximization for particle filters require access to the system's transition probabilities. Numerical results indicate that a hybrid algorithm that uses AD to refine a coarse solution from an iterated filtering algorithm show substantial improvement on current state-of-the-art methods for a challenging scientific benchmark problem."}, "https://arxiv.org/abs/2407.03167": {"title": "Tail calibration of probabilistic forecasts", "link": "https://arxiv.org/abs/2407.03167", "description": "arXiv:2407.03167v1 Announce Type: new \nAbstract: Probabilistic forecasts comprehensively describe the uncertainty in the unknown future outcome, making them essential for decision making and risk management. While several methods have been introduced to evaluate probabilistic forecasts, existing evaluation techniques are ill-suited to the evaluation of tail properties of such forecasts. However, these tail properties are often of particular interest to forecast users due to the severe impacts caused by extreme outcomes. In this work, we introduce a general notion of tail calibration for probabilistic forecasts, which allows forecasters to assess the reliability of their predictions for extreme outcomes. We study the relationships between tail calibration and standard notions of forecast calibration, and discuss connections to peaks-over-threshold models in extreme value theory. Diagnostic tools are introduced and applied in a case study on European precipitation forecasts"}, "https://arxiv.org/abs/2407.03265": {"title": "Policymaker meetings as heteroscedasticity shifters: Identification and simultaneous inference in unstable SVARs", "link": "https://arxiv.org/abs/2407.03265", "description": "arXiv:2407.03265v1 Announce Type: new \nAbstract: We propose a novel approach to identification in structural vector autoregressions (SVARs) that uses external instruments for heteroscedasticiy of a structural shock of interest. This approach does not require lead/lag exogeneity for identification, does not require heteroskedasticity to be persistent, and facilitates interpretation of the structural shocks. To implement this identification approach in applications, we develop a new method for simultaneous inference of structural impulse responses and other parameters, employing a dependent wild-bootstrap of local projection estimators. This method is robust to an arbitrary number of unit roots and cointegration relationships, time-varying local means and drifts, and conditional heteroskedasticity of unknown form and can be used with other identification schemes, including Cholesky and the conventional external IV. We show how to construct pointwise and simultaneous confidence bounds for structural impulse responses and how to compute smoothed local projections with the corresponding confidence bounds. Using simulated data from a standard log-linearized DSGE model, we show that the method can reliably recover the true impulse responses in realistic datasets. As an empirical application, we adopt the proposed method in order to identify monetary policy shock using the dates of FOMC meetings in a standard six-variable VAR. The robustness of our identification and inference methods allows us to construct an instrumental variable for monetary policy shock that dates back to 1965. The resulting impulse response functions for all variables align with the classical Cholesky identification scheme and are different from the narrative sign restricted Bayesian VAR estimates. In particular, the response to inflation manifests a price puzzle that is indicative of the cost channel of the interest rates."}, "https://arxiv.org/abs/2407.03279": {"title": "Finely Stratified Rerandomization Designs", "link": "https://arxiv.org/abs/2407.03279", "description": "arXiv:2407.03279v1 Announce Type: new \nAbstract: We study estimation and inference on causal parameters under finely stratified rerandomization designs, which use baseline covariates to match units into groups (e.g. matched pairs), then rerandomize within-group treatment assignments until a balance criterion is satisfied. We show that finely stratified rerandomization does partially linear regression adjustment by design, providing nonparametric control over the covariates used for stratification, and linear control over the rerandomization covariates. We also introduce novel rerandomization criteria, allowing for nonlinear imbalance metrics and proposing a minimax scheme that optimizes the balance criterion using pilot data or prior information provided by the researcher. While the asymptotic distribution of generalized method of moments (GMM) estimators under stratified rerandomization is generically non-Gaussian, we show how to restore asymptotic normality using optimal ex-post linear adjustment. This allows us to provide simple asymptotically exact inference methods for superpopulation parameters, as well as efficient conservative inference methods for finite population parameters."}, "https://arxiv.org/abs/2405.14896": {"title": "Study on spike-and-wave detection in epileptic signals using t-location-scale distribution and the K-nearest neighbors classifier", "link": "https://arxiv.org/abs/2405.14896", "description": "arXiv:2405.14896v1 Announce Type: cross \nAbstract: Pattern classification in electroencephalography (EEG) signals is an important problem in biomedical engineering since it enables the detection of brain activity, particularly the early detection of epileptic seizures. In this paper, we propose a k-nearest neighbors classification for epileptic EEG signals based on a t-location-scale statistical representation to detect spike-and-waves. The proposed approach is demonstrated on a real dataset containing both spike-and-wave events and normal brain function signals, where our performance is evaluated in terms of classification accuracy, sensitivity, and specificity."}, "https://arxiv.org/abs/2407.02657": {"title": "Large Scale Hierarchical Industrial Demand Time-Series Forecasting incorporating Sparsity", "link": "https://arxiv.org/abs/2407.02657", "description": "arXiv:2407.02657v1 Announce Type: cross \nAbstract: Hierarchical time-series forecasting (HTSF) is an important problem for many real-world business applications where the goal is to simultaneously forecast multiple time-series that are related to each other via a hierarchical relation. Recent works, however, do not address two important challenges that are typically observed in many demand forecasting applications at large companies. First, many time-series at lower levels of the hierarchy have high sparsity i.e., they have a significant number of zeros. Most HTSF methods do not address this varying sparsity across the hierarchy. Further, they do not scale well to the large size of the real-world hierarchy typically unseen in benchmarks used in literature. We resolve both these challenges by proposing HAILS, a novel probabilistic hierarchical model that enables accurate and calibrated probabilistic forecasts across the hierarchy by adaptively modeling sparse and dense time-series with different distributional assumptions and reconciling them to adhere to hierarchical constraints. We show the scalability and effectiveness of our methods by evaluating them against real-world demand forecasting datasets. We deploy HAILS at a large chemical manufacturing company for a product demand forecasting application with over ten thousand products and observe a significant 8.5\\% improvement in forecast accuracy and 23% better improvement for sparse time-series. The enhanced accuracy and scalability make HAILS a valuable tool for improved business planning and customer experience."}, "https://arxiv.org/abs/2407.02702": {"title": "Practical Guide for Causal Pathways and Sub-group Disparity Analysis", "link": "https://arxiv.org/abs/2407.02702", "description": "arXiv:2407.02702v1 Announce Type: cross \nAbstract: In this study, we introduce the application of causal disparity analysis to unveil intricate relationships and causal pathways between sensitive attributes and the targeted outcomes within real-world observational data. Our methodology involves employing causal decomposition analysis to quantify and examine the causal interplay between sensitive attributes and outcomes. We also emphasize the significance of integrating heterogeneity assessment in causal disparity analysis to gain deeper insights into the impact of sensitive attributes within specific sub-groups on outcomes. Our two-step investigation focuses on datasets where race serves as the sensitive attribute. The results on two datasets indicate the benefit of leveraging causal analysis and heterogeneity assessment not only for quantifying biases in the data but also for disentangling their influences on outcomes. We demonstrate that the sub-groups identified by our approach to be affected the most by disparities are the ones with the largest ML classification errors. We also show that grouping the data only based on a sensitive attribute is not enough, and through these analyses, we can find sub-groups that are directly affected by disparities. We hope that our findings will encourage the adoption of such methodologies in future ethical AI practices and bias audits, fostering a more equitable and fair technological landscape."}, "https://arxiv.org/abs/2407.02754": {"title": "Is Cross-Validation the Gold Standard to Evaluate Model Performance?", "link": "https://arxiv.org/abs/2407.02754", "description": "arXiv:2407.02754v1 Announce Type: cross \nAbstract: Cross-Validation (CV) is the default choice for evaluating the performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that in fact, for a wide spectrum of models, CV does not statistically outperform the simple \"plug-in\" approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, $K$-fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases leave-one-out again does not outperform plug-in once this variability is taken into account. We obtain our theoretical comparisons via a novel higher-order Taylor analysis that allows us to derive necessary conditions for limit theorems of testing evaluations, which applies to model classes that are not amenable to previously known sufficient conditions. Our numerical results demonstrate that plug-in performs indeed no worse than CV across a wide range of examples."}, "https://arxiv.org/abs/2407.03094": {"title": "Conformal Prediction for Causal Effects of Continuous Treatments", "link": "https://arxiv.org/abs/2407.03094", "description": "arXiv:2407.03094v1 Announce Type: cross \nAbstract: Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data."}, "https://arxiv.org/abs/2108.06473": {"title": "Evidence Aggregation for Treatment Choice", "link": "https://arxiv.org/abs/2108.06473", "description": "arXiv:2108.06473v2 Announce Type: replace \nAbstract: Consider a planner who has limited knowledge of the policy's causal impact on a certain local population of interest due to a lack of data, but does have access to the publicized intervention studies performed for similar policies on different populations. How should the planner make use of and aggregate this existing evidence to make her policy decision? Following Manski (2020; Towards Credible Patient-Centered Meta-Analysis, \\textit{Epidemiology}), we formulate the planner's problem as a statistical decision problem with a social welfare objective, and solve for an optimal aggregation rule under the minimax-regret criterion. We investigate the analytical properties, computational feasibility, and welfare regret performance of this rule. We apply the minimax regret decision rule to two settings: whether to enact an active labor market policy based on 14 randomized control trial studies; and whether to approve a drug (Remdesivir) for COVID-19 treatment using a meta-database of clinical trials."}, "https://arxiv.org/abs/2112.01709": {"title": "Optimized variance estimation under interference and complex experimental designs", "link": "https://arxiv.org/abs/2112.01709", "description": "arXiv:2112.01709v2 Announce Type: replace \nAbstract: Unbiased and consistent variance estimators generally do not exist for design-based treatment effect estimators because experimenters never observe more than one potential outcome for any unit. The problem is exacerbated by interference and complex experimental designs. Experimenters must accept conservative variance estimators in these settings, but they can strive to minimize conservativeness. In this paper, we show that the task of constructing a minimally conservative variance estimator can be interpreted as an optimization problem that aims to find the lowest estimable upper bound of the true variance given the experimenter's risk preference and knowledge of the potential outcomes. We characterize the set of admissible bounds in the class of quadratic forms, and we demonstrate that the optimization problem is a convex program for many natural objectives. The resulting variance estimators are guaranteed to be conservative regardless of whether the background knowledge used to construct the bound is correct, but the estimators are less conservative if the provided information is reasonably accurate. Numerical results show that the resulting variance estimators can be considerably less conservative than existing estimators, allowing experimenters to draw more informative inferences about treatment effects."}, "https://arxiv.org/abs/2207.10513": {"title": "A flexible and interpretable spatial covariance model for data on graphs", "link": "https://arxiv.org/abs/2207.10513", "description": "arXiv:2207.10513v2 Announce Type: replace \nAbstract: Spatial models for areal data are often constructed such that all pairs of adjacent regions are assumed to have near-identical spatial autocorrelation. In practice, data can exhibit dependence structures more complicated than can be represented under this assumption. In this article we develop a new model for spatially correlated data observed on graphs, which can flexibly represented many types of spatial dependence patterns while retaining aspects of the original graph geometry. Our method implies an embedding of the graph into Euclidean space wherein covariance can be modeled using traditional covariance functions, such as those from the Mat\\'{e}rn family. We parameterize our model using a class of graph metrics compatible with such covariance functions, and which characterize distance in terms of network flow, a property useful for understanding proximity in many ecological settings. By estimating the parameters underlying these metrics, we recover the \"intrinsic distances\" between graph nodes, which assist in the interpretation of the estimated covariance and allow us to better understand the relationship between the observed process and spatial domain. We compare our model to existing methods for spatially dependent graph data, primarily conditional autoregressive models and their variants, and illustrate advantages of our method over traditional approaches. We fit our model to bird abundance data for several species in North Carolina, and show how it provides insight into the interactions between species-specific spatial distributions and geography."}, "https://arxiv.org/abs/2212.02335": {"title": "Policy Learning with the polle package", "link": "https://arxiv.org/abs/2212.02335", "description": "arXiv:2212.02335v4 Announce Type: replace \nAbstract: The R package polle is a unifying framework for learning and evaluating finite stage policies based on observational data. The package implements a collection of existing and novel methods for causal policy learning including doubly robust restricted Q-learning, policy tree learning, and outcome weighted learning. The package deals with (near) positivity violations by only considering realistic policies. Highly flexible machine learning methods can be used to estimate the nuisance components and valid inference for the policy value is ensured via cross-fitting. The library is built up around a simple syntax with four main functions policy_data(), policy_def(), policy_learn(), and policy_eval() used to specify the data structure, define user-specified policies, specify policy learning methods and evaluate (learned) policies. The functionality of the package is illustrated via extensive reproducible examples."}, "https://arxiv.org/abs/2306.08940": {"title": "Spatial modeling of extremes and an angular component", "link": "https://arxiv.org/abs/2306.08940", "description": "arXiv:2306.08940v2 Announce Type: replace \nAbstract: Many environmental processes such as rainfall, wind or snowfall are inherently spatial and the modelling of extremes has to take into account that feature. In addition, environmental processes are often attached with an angle, e.g., wind speed and direction or extreme snowfall and time of occurrence in year. This article proposes a Bayesian hierarchical model with a conditional independence assumption that aims at modelling simultaneously spatial extremes and an angular component. The proposed model relies on the extreme value theory as well as recent developments for handling directional statistics over a continuous domain. Working within a Bayesian setting, a Gibbs sampler is introduced whose performances are analysed through a simulation study. The paper ends with an application on extreme wind speed in France. Results show that extreme wind events in France are mainly coming from West apart from the Mediterranean part of France and the Alps."}, "https://arxiv.org/abs/1607.00393": {"title": "Frequentist properties of Bayesian inequality tests", "link": "https://arxiv.org/abs/1607.00393", "description": "arXiv:1607.00393v4 Announce Type: replace-cross \nAbstract: Bayesian and frequentist criteria fundamentally differ, but often posterior and sampling distributions agree asymptotically (e.g., Gaussian with same covariance). For the corresponding single-draw experiment, we characterize the frequentist size of a certain Bayesian hypothesis test of (possibly nonlinear) inequalities. If the null hypothesis is that the (possibly infinite-dimensional) parameter lies in a certain half-space, then the Bayesian test's size is $\\alpha$; if the null hypothesis is a subset of a half-space, then size is above $\\alpha$; and in other cases, size may be above, below, or equal to $\\alpha$. Rejection probabilities at certain points in the parameter space are also characterized. Two examples illustrate our results: translog cost function curvature and ordinal distribution relationships."}, "https://arxiv.org/abs/2010.03832": {"title": "Estimation of the Spectral Measure from ConvexCombinations of Regularly Varying RandomVectors", "link": "https://arxiv.org/abs/2010.03832", "description": "arXiv:2010.03832v2 Announce Type: replace-cross \nAbstract: The extremal dependence structure of a regularly varying random vector Xis fully described by its limiting spectral measure. In this paper, we investigate how torecover characteristics of the measure, such as extremal coefficients, from the extremalbehaviour of convex combinations of components of X. Our considerations result in aclass of new estimators of moments of the corresponding combinations for the spectralvector. We show asymptotic normality by means of a functional limit theorem and, focusingon the estimation of extremal coefficients, we verify that the minimal asymptoticvariance can be achieved by a plug-in estimator using subsampling bootstrap. We illustratethe benefits of our approach on simulated and real data."}, "https://arxiv.org/abs/2407.03379": {"title": "missForestPredict -- Missing data imputation for prediction settings", "link": "https://arxiv.org/abs/2407.03379", "description": "arXiv:2407.03379v1 Announce Type: new \nAbstract: Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times."}, "https://arxiv.org/abs/2407.03383": {"title": "Continuous Optimization for Offline Change Point Detection and Estimation", "link": "https://arxiv.org/abs/2407.03383", "description": "arXiv:2407.03383v1 Announce Type: new \nAbstract: This work explores use of novel advances in best subset selection for regression modelling via continuous optimization for offline change point detection and estimation in univariate Gaussian data sequences. The approach exploits reformulating the normal mean multiple change point model into a regularized statistical inverse problem enforcing sparsity. After introducing the problem statement, criteria and previous investigations via Lasso-regularization, the recently developed framework of continuous optimization for best subset selection (COMBSS) is briefly introduced and related to the problem at hand. Supervised and unsupervised perspectives are explored with the latter testing different approaches for the choice of regularization penalty parameters via the discrepancy principle and a confidence bound. The main result is an adaptation and evaluation of the COMBSS approach for offline normal mean multiple change-point detection via experimental results on simulated data for different choices of regularisation parameters. Results and future directions are discussed."}, "https://arxiv.org/abs/2407.03389": {"title": "A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data", "link": "https://arxiv.org/abs/2407.03389", "description": "arXiv:2407.03389v1 Announce Type: new \nAbstract: In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The method is a variant of the Deterministic Information Bottleneck algorithm which optimally compresses the data while retaining relevant information about the underlying structure. We compare the performance of the proposed method to that of three well-established clustering methods (KAMILA, K-Prototypes, and Partitioning Around Medoids with Gower's dissimilarity) on simulated and real-world datasets. The results demonstrate that the proposed approach represents a competitive alternative to conventional clustering techniques under specific conditions."}, "https://arxiv.org/abs/2407.03420": {"title": "Balancing events, not patients, maximizes power of the logrank test: and other insights on unequal randomization in survival trials", "link": "https://arxiv.org/abs/2407.03420", "description": "arXiv:2407.03420v1 Announce Type: new \nAbstract: We revisit the question of what randomization ratio (RR) maximizes power of the logrank test in event-driven survival trials under proportional hazards (PH). By comparing three approximations of the logrank test (Schoenfeld, Freedman, Rubinstein) to empirical simulations, we find that the RR that maximizes power is the RR that balances number of events across treatment arms at the end of the trial. This contradicts the common misconception implied by Schoenfeld's approximation that 1:1 randomization maximizes power. Besides power, we consider other factors that might influence the choice of RR (accrual, trial duration, sample size, etc.). We perform simulations to better understand how unequal randomization might impact these factors in practice. Altogether, we derive 6 insights to guide statisticians in the design of survival trials considering unequal randomization."}, "https://arxiv.org/abs/2407.03539": {"title": "Population Size Estimation with Many Lists and Heterogeneity: A Conditional Log-Linear Model Among the Unobserved", "link": "https://arxiv.org/abs/2407.03539", "description": "arXiv:2407.03539v1 Announce Type: new \nAbstract: We contribute a general and flexible framework to estimate the size of a closed population in the presence of $K$ capture-recapture lists and heterogeneous capture probabilities. Our novel identifying strategy leverages the fact that it is sufficient for identification that a subset of the $K$ lists are not arbitrarily dependent \\textit{within the subset of the population unobserved by the remaining lists}, conditional on covariates. This identification approach is interpretable and actionable, interpolating between the two predominant approaches in the literature as special cases: (conditional) independence across lists and log-linear models with no highest-order interaction. We derive nonparametric doubly-robust estimators for the resulting identification expression that are nearly optimal and approximately normal for any finite sample size, even when the heterogeneous capture probabilities are estimated nonparametrically using machine learning methods. Additionally, we devise a sensitivity analysis to show how deviations from the identification assumptions affect the resulting population size estimates, allowing for the integration of domain-specific knowledge into the identification and estimation processes more transparently. We empirically demonstrate the advantages of our method using both synthetic data and real data from the Peruvian internal armed conflict to estimate the number of casualties. The proposed methodology addresses recent critiques of capture-recapture models by allowing for a weaker and more interpretable identifying assumption and accommodating complex heterogeneous capture probabilities depending on high-dimensional or continuous covariates."}, "https://arxiv.org/abs/2407.03558": {"title": "Aggregated Sure Independence Screening for Variable Selection with Interaction Structures", "link": "https://arxiv.org/abs/2407.03558", "description": "arXiv:2407.03558v1 Announce Type: new \nAbstract: A new method called the aggregated sure independence screening is proposed for the computational challenges in variable selection of interactions when the number of explanatory variables is much higher than the number of observations (i.e., $p\\gg n$). In this problem, the two main challenges are the strong hierarchical restriction and the number of candidates for the main effects and interactions. If $n$ is a few hundred and $p$ is ten thousand, then the memory needed for the augmented matrix of the full model is more than $100{\\rm GB}$ in size, beyond the memory capacity of a personal computer. This issue can be solved by our proposed method but not by our competitors. Two advantages are that the proposed method can include important interactions even if the related main effects are weak or absent, and it can be combined with an arbitrary variable selection method for interactions. The research addresses the main concern for variable selection of interactions because it makes previous methods applicable to the case when $p$ is extremely large."}, "https://arxiv.org/abs/2407.03616": {"title": "When can weak latent factors be statistically inferred?", "link": "https://arxiv.org/abs/2407.03616", "description": "arXiv:2407.03616v1 Announce Type: new \nAbstract: This article establishes a new and comprehensive estimation and inference theory for principal component analysis (PCA) under the weak factor model that allow for cross-sectional dependent idiosyncratic components under nearly minimal the factor strength relative to the noise level or signal-to-noise ratio. Our theory is applicable regardless of the relative growth rate between the cross-sectional dimension $N$ and temporal dimension $T$. This more realistic assumption and noticeable result requires completely new technical device, as the commonly-used leave-one-out trick is no longer applicable to the case with cross-sectional dependence. Another notable advancement of our theory is on PCA inference $ - $ for example, under the regime where $N\\asymp T$, we show that the asymptotic normality for the PCA-based estimator holds as long as the signal-to-noise ratio (SNR) grows faster than a polynomial rate of $\\log N$. This finding significantly surpasses prior work that required a polynomial rate of $N$. Our theory is entirely non-asymptotic, offering finite-sample characterizations for both the estimation error and the uncertainty level of statistical inference. A notable technical innovation is our closed-form first-order approximation of PCA-based estimator, which paves the way for various statistical tests. Furthermore, we apply our theories to design easy-to-implement statistics for validating whether given factors fall in the linear spans of unknown latent factors, testing structural breaks in the factor loadings for an individual unit, checking whether two units have the same risk exposures, and constructing confidence intervals for systematic risks. Our empirical studies uncover insightful correlations between our test results and economic cycles."}, "https://arxiv.org/abs/2407.03619": {"title": "Multivariate Representations of Univariate Marked Hawkes Processes", "link": "https://arxiv.org/abs/2407.03619", "description": "arXiv:2407.03619v1 Announce Type: new \nAbstract: Univariate marked Hawkes processes are used to model a range of real-world phenomena including earthquake aftershock sequences, contagious disease spread, content diffusion on social media platforms, and order book dynamics. This paper illustrates a fundamental connection between univariate marked Hawkes processes and multivariate Hawkes processes. Exploiting this connection renders a framework that can be built upon for expressive and flexible inference on diverse data. Specifically, multivariate unmarked Hawkes representations are introduced as a tool to parameterize univariate marked Hawkes processes. We show that such multivariate representations can asymptotically approximate a large class of univariate marked Hawkes processes, are stationary given the approximated process is stationary, and that resultant conditional intensity parameters are identifiable. A simulation study demonstrates the efficacy of this approach, and provides heuristic bounds for error induced by the relatively larger parameter space of multivariate Hawkes processes."}, "https://arxiv.org/abs/2407.03690": {"title": "Robust CATE Estimation Using Novel Ensemble Methods", "link": "https://arxiv.org/abs/2407.03690", "description": "arXiv:2407.03690v1 Announce Type: new \nAbstract: The estimation of Conditional Average Treatment Effects (CATE) is crucial for understanding the heterogeneity of treatment effects in clinical trials. We evaluate the performance of common methods, including causal forests and various meta-learners, across a diverse set of scenarios revealing that each of the methods fails in one or more of the tested scenarios. Given the inherent uncertainty of the data-generating process in real-life scenarios, the robustness of a CATE estimator to various scenarios is critical for its reliability.\n To address this limitation of existing methods, we propose two new ensemble methods that integrate multiple estimators to enhance prediction stability and performance - Stacked X-Learner which uses the X-Learner with model stacking for estimating the nuisance functions, and Consensus Based Averaging (CBA), which averages only the models with highest internal agreement. We show that these models achieve good performance across a wide range of scenarios varying in complexity, sample size and structure of the underlying-mechanism, including a biologically driven model for PD-L1 inhibition pathway for cancer treatment."}, "https://arxiv.org/abs/2407.03725": {"title": "Under the null of valid specification, pre-tests of valid specification do not distort inference", "link": "https://arxiv.org/abs/2407.03725", "description": "arXiv:2407.03725v1 Announce Type: new \nAbstract: Consider a parameter of interest, which can be consistently estimated under some conditions. Suppose also that we can at least partly test these conditions with specification tests. We consider the common practice of conducting inference on the parameter of interest conditional on not rejecting these tests. We show that if the tested conditions hold, conditional inference is valid but possibly conservative. This holds generally, without imposing any assumption on the asymptotic dependence between the estimator of the parameter of interest and the specification test."}, "https://arxiv.org/abs/2407.03726": {"title": "Absolute average and median treatment effects as causal estimands on metric spaces", "link": "https://arxiv.org/abs/2407.03726", "description": "arXiv:2407.03726v1 Announce Type: new \nAbstract: We define the notions of absolute average and median treatment effects as causal estimands on general metric spaces such as Riemannian manifolds, propose estimators using stratification, and prove several properties, including strong consistency. In the process, we also demonstrate the strong consistency of the weighted sample Fr\\'echet means and geometric medians. Stratification allows these estimators to be utilized beyond the narrow constraints of a completely randomized experiment. After constructing confidence intervals using bootstrapping, we outline how to use the proposed estimates to test Fisher's sharp null hypothesis that the absolute average or median treatment effect is zero. Empirical evidence for the strong consistency of the estimators and the reasonable asymptotic coverage of the confidence intervals is provided through simulations in both randomized experiments and observational study settings. We also apply our methods to real data from an observational study to investigate the causal relationship between Alzheimer's disease and the shape of the corpus callosum, rejecting the aforementioned null hypotheses in cases where conventional Euclidean methods fail to do so. Our proposed methods are more generally applicable than past studies in dealing with general metric spaces."}, "https://arxiv.org/abs/2407.03774": {"title": "Mixture Modeling for Temporal Point Processes with Memory", "link": "https://arxiv.org/abs/2407.03774", "description": "arXiv:2407.03774v1 Announce Type: new \nAbstract: We propose a constructive approach to building temporal point processes that incorporate dependence on their history. The dependence is modeled through the conditional density of the duration, i.e., the interval between successive event times, using a mixture of first-order conditional densities for each one of a specific number of lagged durations. Such a formulation for the conditional duration density accommodates high-order dynamics, and it thus enables flexible modeling for point processes with memory. The implied conditional intensity function admits a representation as a local mixture of first-order hazard functions. By specifying appropriate families of distributions for the first-order conditional densities, with different shapes for the associated hazard functions, we can obtain either self-exciting or self-regulating point processes. From the perspective of duration processes, we develop a method to specify a stationary marginal density. The resulting model, interpreted as a dependent renewal process, introduces high-order Markov dependence among identically distributed durations. Furthermore, we provide extensions to cluster point processes. These can describe duration clustering behaviors attributed to different factors, thus expanding the scope of the modeling framework to a wider range of applications. Regarding implementation, we develop a Bayesian approach to inference, model checking, and prediction. We investigate point process model properties analytically, and illustrate the methodology with both synthetic and real data examples."}, "https://arxiv.org/abs/2407.04071": {"title": "Three- and four-parameter item response model in factor analysis framework", "link": "https://arxiv.org/abs/2407.04071", "description": "arXiv:2407.04071v1 Announce Type: new \nAbstract: This work proposes a 4-parameter factor analytic (4P FA) model for multi-item measurements composed of binary items as an extension to the dichotomized single latent variable FA model. We provide an analytical derivation of the relationship between the newly proposed 4P FA model and its counterpart in the item response theory (IRT) framework, the 4P IRT model. A Bayesian estimation method for the proposed 4P FA model is provided to estimate the four item parameters, the respondents' latent scores, and the scores cleaned of the guessing and inattention effects. The newly proposed algorithm is implemented in R and Python, and the relationship between the 4P FA and 4P IRT is empirically demonstrated using real datasets from admission tests and the assessment of anxiety."}, "https://arxiv.org/abs/2407.04104": {"title": "Network-based Neighborhood regression", "link": "https://arxiv.org/abs/2407.04104", "description": "arXiv:2407.04104v1 Announce Type: new \nAbstract: Given the ubiquity of modularity in biological systems, module-level regulation analysis is vital for understanding biological systems across various levels and their dynamics. Current statistical analysis on biological modules predominantly focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data. This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities. An efficient community-wise least square optimization approach is developed to uncover the strength of regulation among the network modules while enabling asymptotic inference. With random graph theory, we derive non-asymptotic estimation error bounds for the proposed estimator, achieving exact minimax optimality. Unlike the root-n consistency typical in canonical linear regression, our model exhibits linear consistency in the number of nodes n, highlighting the advantage of incorporating neighborhood information. The effectiveness of the proposed framework is further supported by extensive numerical experiments. Application to whole-exome sequencing and RNA-sequencing Autism datasets demonstrates the usage of the proposed method in identifying the association between the gene modules of genetic variations and the gene modules of genomic differential expressions."}, "https://arxiv.org/abs/2407.04142": {"title": "Bayesian Structured Mediation Analysis With Unobserved Confounders", "link": "https://arxiv.org/abs/2407.04142", "description": "arXiv:2407.04142v1 Announce Type: new \nAbstract: We explore methods to reduce the impact of unobserved confounders on the causal mediation analysis of high-dimensional mediators with spatially smooth structures, such as brain imaging data. The key approach is to incorporate the latent individual effects, which influence the structured mediators, as unobserved confounders in the outcome model, thereby potentially debiasing the mediation effects. We develop BAyesian Structured Mediation analysis with Unobserved confounders (BASMU) framework, and establish its model identifiability conditions. Theoretical analysis is conducted on the asymptotic bias of the Natural Indirect Effect (NIE) and the Natural Direct Effect (NDE) when the unobserved confounders are omitted in mediation analysis. For BASMU, we propose a two-stage estimation algorithm to mitigate the impact of these unobserved confounders on estimating the mediation effect. Extensive simulations demonstrate that BASMU substantially reduces the bias in various scenarios. We apply BASMU to the analysis of fMRI data in the Adolescent Brain Cognitive Development (ABCD) study, focusing on four brain regions previously reported to exhibit meaningful mediation effects. Compared with the existing image mediation analysis method, BASMU identifies two to four times more voxels that have significant mediation effects, with the NIE increased by 41%, and the NDE decreased by 26%."}, "https://arxiv.org/abs/2407.04437": {"title": "Overeducation under different macroeconomic conditions: The case of Spanish university graduates", "link": "https://arxiv.org/abs/2407.04437", "description": "arXiv:2407.04437v1 Announce Type: new \nAbstract: This paper examines the incidence and persistence of overeducation in the early careers of Spanish university graduates. We investigate the role played by the business cycle and field of study and their interaction in shaping both phenomena. We also analyse the relevance of specific types of knowledge and skills as driving factors in reducing overeducation risk. We use data from the Survey on the Labour Insertion of University Graduates (EILU) conducted by the Spanish National Statistics Institute in 2014 and 2019. The survey collects rich information on cohorts that graduated in the 2009/2010 and 2014/2015 academic years during the Great Recession and the subsequent economic recovery, respectively. Our results show, first, the relevance of the economic scenario when graduates enter the labour market. Graduation during a recession increased overeducation risk and persistence. Second, a clear heterogeneous pattern occurs across fields of study, with health sciences graduates displaying better performance in terms of both overeducation incidence and persistence and less impact of the business cycle. Third, we find evidence that some transversal skills (language, IT, management) can help to reduce overeducation risk in the absence of specific knowledge required for the job, thus indicating some kind of compensatory role. Finally, our findings have important policy implications. Overeducation, and more importantly overeducation persistence, imply a non-neglectable misallocation of resources. Therefore, policymakers need to address this issue in the design of education and labour market policies."}, "https://arxiv.org/abs/2407.04446": {"title": "Random-Effect Meta-Analysis with Robust Between-Study Variance", "link": "https://arxiv.org/abs/2407.04446", "description": "arXiv:2407.04446v1 Announce Type: new \nAbstract: Meta-analyses are widely employed to demonstrate strong evidence across numerous studies. On the other hand, in the context of rare diseases, meta-analyses are often conducted with a limited number of studies in which the analysis methods are based on theoretical frameworks assuming that the between-study variance is known. That is, the estimate of between-study variance is substituted for the true value, neglecting the randomness with the between-study variance estimated from the data. Consequently, excessively narrow confidence intervals for the overall treatment effect for meta-analyses have been constructed in only a few studies. In the present study, we propose overcoming this problem by estimating the distribution of between-study variance using the maximum likelihood-like estimator. We also suggest an approach for estimating the overall treatment effect via the distribution of the between-study variance. Our proposed method can extend many existing approaches to allow more adequate estimation under a few studies. Through simulation and analysis of real data, we demonstrate that our method remains consistently conservative compared to existing methods, which enables meta-analyses to consider the randomness of the between-study variance."}, "https://arxiv.org/abs/2407.04448": {"title": "Learning control variables and instruments for causal analysis in observational data", "link": "https://arxiv.org/abs/2407.04448", "description": "arXiv:2407.04448v1 Announce Type: new \nAbstract: This study introduces a data-driven, machine learning-based method to detect suitable control variables and instruments for assessing the causal effect of a treatment on an outcome in observational data, if they exist. Our approach tests the joint existence of instruments, which are associated with the treatment but not directly with the outcome (at least conditional on observables), and suitable control variables, conditional on which the treatment is exogenous, and learns the partition of instruments and control variables from the observed data. The detection of sets of instruments and control variables relies on the condition that proper instruments are conditionally independent of the outcome given the treatment and suitable control variables. We establish the consistency of our method for detecting control variables and instruments under certain regularity conditions, investigate the finite sample performance through a simulation study, and provide an empirical application to labor market data from the Job Corps study."}, "https://arxiv.org/abs/2407.04530": {"title": "A spatial-correlated multitask linear mixed-effects model for imaging genetics", "link": "https://arxiv.org/abs/2407.04530", "description": "arXiv:2407.04530v1 Announce Type: new \nAbstract: Imaging genetics aims to uncover the hidden relationship between imaging quantitative traits (QTs) and genetic markers (e.g. single nucleotide polymorphism (SNP)), and brings valuable insights into the pathogenesis of complex diseases, such as cancers and cognitive disorders (e.g. the Alzheimer's Disease). However, most linear models in imaging genetics didn't explicitly model the inner relationship among QTs, which might miss some potential efficiency gains from information borrowing across brain regions. In this work, we developed a novel Bayesian regression framework for identifying significant associations between QTs and genetic markers while explicitly modeling spatial dependency between QTs, with the main contributions as follows. Firstly, we developed a spatial-correlated multitask linear mixed-effects model (LMM) to account for dependencies between QTs. We incorporated a population-level mixed effects term into the model, taking full advantage of the dependent structure of brain imaging-derived QTs. Secondly, we implemented the model in the Bayesian framework and derived a Markov chain Monte Carlo (MCMC) algorithm to achieve the model inference. Further, we incorporated the MCMC samples with the Cauchy combination test (CCT) to examine the association between SNPs and QTs, which avoided computationally intractable multi-test issues. The simulation studies indicated improved power of our proposed model compared to classic models where inner dependencies of QTs were not modeled. We also applied the new spatial model to an imaging dataset obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database."}, "https://arxiv.org/abs/2407.04659": {"title": "Simulation-based Calibration of Uncertainty Intervals under Approximate Bayesian Estimation", "link": "https://arxiv.org/abs/2407.04659", "description": "arXiv:2407.04659v1 Announce Type: new \nAbstract: The mean field variational Bayes (VB) algorithm implemented in Stan is relatively fast and efficient, making it feasible to produce model-estimated official statistics on a rapid timeline. Yet, while consistent point estimates of parameters are achieved for continuous data models, the mean field approximation often produces inaccurate uncertainty quantification to the extent that parameters are correlated a posteriori. In this paper, we propose a simulation procedure that calibrates uncertainty intervals for model parameters estimated under approximate algorithms to achieve nominal coverages. Our procedure detects and corrects biased estimation of both first and second moments of approximate marginal posterior distributions induced by any estimation algorithm that produces consistent first moments under specification of the correct model. The method generates replicate datasets using parameters estimated in an initial model run. The model is subsequently re-estimated on each replicate dataset, and we use the empirical distribution over the re-samples to formulate calibrated confidence intervals of parameter estimates of the initial model run that are guaranteed to asymptotically achieve nominal coverage. We demonstrate the performance of our procedure in Monte Carlo simulation study and apply it to real data from the Current Employment Statistics survey."}, "https://arxiv.org/abs/2407.04667": {"title": "The diameter of a stochastic matrix: A new measure for sensitivity analysis in Bayesian networks", "link": "https://arxiv.org/abs/2407.04667", "description": "arXiv:2407.04667v1 Announce Type: new \nAbstract: Bayesian networks are one of the most widely used classes of probabilistic models for risk management and decision support because of their interpretability and flexibility in including heterogeneous pieces of information. In any applied modelling, it is critical to assess how robust the inferences on certain target variables are to changes in the model. In Bayesian networks, these analyses fall under the umbrella of sensitivity analysis, which is most commonly carried out by quantifying dissimilarities using Kullback-Leibler information measures. In this paper, we argue that robustness methods based instead on the familiar total variation distance provide simple and more valuable bounds on robustness to misspecification, which are both formally justifiable and transparent. We introduce a novel measure of dependence in conditional probability tables called the diameter to derive such bounds. This measure quantifies the strength of dependence between a variable and its parents. We demonstrate how such formal robustness considerations can be embedded in building a Bayesian network."}, "https://arxiv.org/abs/2407.03336": {"title": "Efficient and Precise Calculation of the Confluent Hypergeometric Function", "link": "https://arxiv.org/abs/2407.03336", "description": "arXiv:2407.03336v1 Announce Type: cross \nAbstract: Kummer's function, also known as the confluent hypergeometric function (CHF), is an important mathematical function, in particular due to its many special cases, which include the Bessel function, the incomplete Gamma function and the error function (erf). The CHF has no closed form expression, but instead is most commonly expressed as an infinite sum of ratios of rising factorials, which makes its precise and efficient calculation challenging. It is a function of three parameters, the first two being the rising factorial base of the numerator and denominator, and the third being a scale parameter. Accurate and efficient calculation for large values of the scale parameter is particularly challenging due to numeric underflow and overflow which easily occur when summing the underlying component terms. This work presents an elegant and precise mathematical algorithm for the calculation of the CHF, which is of particular advantage for large values of the scale parameter. This method massively reduces the number and range of component terms which need to be summed to achieve any required precision, thus obviating the need for the computationally intensive transformations needed by current algorithms."}, "https://arxiv.org/abs/2407.03781": {"title": "Block-diagonal idiosyncratic covariance estimation in high-dimensional factor models for financial time series", "link": "https://arxiv.org/abs/2407.03781", "description": "arXiv:2407.03781v1 Announce Type: cross \nAbstract: Estimation of high-dimensional covariance matrices in latent factor models is an important topic in many fields and especially in finance. Since the number of financial assets grows while the estimation window length remains of limited size, the often used sample estimator yields noisy estimates which are not even positive definite. Under the assumption of latent factor models, the covariance matrix is decomposed into a common low-rank component and a full-rank idiosyncratic component. In this paper we focus on the estimation of the idiosyncratic component, under the assumption of a grouped structure of the time series, which may arise due to specific factors such as industries, asset classes or countries. We propose a generalized methodology for estimation of the block-diagonal idiosyncratic component by clustering the residual series and applying shrinkage to the obtained blocks in order to ensure positive definiteness. We derive two different estimators based on different clustering methods and test their performance using simulation and historical data. The proposed methods are shown to provide reliable estimates and outperform other state-of-the-art estimators based on thresholding methods."}, "https://arxiv.org/abs/2407.04138": {"title": "Online Bayesian changepoint detection for network Poisson processes with community structure", "link": "https://arxiv.org/abs/2407.04138", "description": "arXiv:2407.04138v1 Announce Type: cross \nAbstract: Network point processes often exhibit latent structure that govern the behaviour of the sub-processes. It is not always reasonable to assume that this latent structure is static, and detecting when and how this driving structure changes is often of interest. In this paper, we introduce a novel online methodology for detecting changes within the latent structure of a network point process. We focus on block-homogeneous Poisson processes, where latent node memberships determine the rates of the edge processes. We propose a scalable variational procedure which can be applied on large networks in an online fashion via a Bayesian forgetting factor applied to sequential variational approximations to the posterior distribution. The proposed framework is tested on simulated and real-world data, and it rapidly and accurately detects changes to the latent edge process rates, and to the latent node group memberships, both in an online manner. In particular, in an application on the Santander Cycles bike-sharing network in central London, we detect changes within the network related to holiday periods and lockdown restrictions between 2019 and 2020."}, "https://arxiv.org/abs/2407.04214": {"title": "Investigating symptom duration using current status data: a case study of post-acute COVID-19 syndrome", "link": "https://arxiv.org/abs/2407.04214", "description": "arXiv:2407.04214v1 Announce Type: cross \nAbstract: For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least 28 days after testing positive and asked to report current symptom status. This study design yielded current status data: Outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates specialized statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. Female sex, history of seasonal allergies, fatigue during acute infection, and higher viral load were associated with slower symptom resolution."}, "https://arxiv.org/abs/2105.03067": {"title": "The $s$-value: evaluating stability with respect to distributional shifts", "link": "https://arxiv.org/abs/2105.03067", "description": "arXiv:2105.03067v4 Announce Type: replace \nAbstract: Common statistical measures of uncertainty such as $p$-values and confidence intervals quantify the uncertainty due to sampling, that is, the uncertainty due to not observing the full population. However, sampling is not the only source of uncertainty. In practice, distributions change between locations and across time. This makes it difficult to gather knowledge that transfers across data sets. We propose a measure of instability that quantifies the distributional instability of a statistical parameter with respect to Kullback-Leibler divergence, that is, the sensitivity of the parameter under general distributional perturbations within a Kullback-Leibler divergence ball. In addition, we quantify the instability of parameters with respect to directional or variable-specific shifts. Measuring instability with respect to directional shifts can be used to detect the type of shifts a parameter is sensitive to. We discuss how such knowledge can inform data collection for improved estimation of statistical parameters under shifted distributions. We evaluate the performance of the proposed measure on real data and show that it can elucidate the distributional instability of a parameter with respect to certain shifts and can be used to improve estimation accuracy under shifted distributions."}, "https://arxiv.org/abs/2107.06141": {"title": "Identification of Average Marginal Effects in Fixed Effects Dynamic Discrete Choice Models", "link": "https://arxiv.org/abs/2107.06141", "description": "arXiv:2107.06141v2 Announce Type: replace \nAbstract: In nonlinear panel data models, fixed effects methods are often criticized because they cannot identify average marginal effects (AMEs) in short panels. The common argument is that identifying AMEs requires knowledge of the distribution of unobserved heterogeneity, but this distribution is not identified in a fixed effects model with a short panel. In this paper, we derive identification results that contradict this argument. In a panel data dynamic logit model, and for $T$ as small as three, we prove the point identification of different AMEs, including causal effects of changes in the lagged dependent variable or the last choice's duration. Our proofs are constructive and provide simple closed-form expressions for the AMEs in terms of probabilities of choice histories. We illustrate our results using Monte Carlo experiments and with an empirical application of a dynamic structural model of consumer brand choice with state dependence."}, "https://arxiv.org/abs/2307.00835": {"title": "Engression: Extrapolation through the Lens of Distributional Regression", "link": "https://arxiv.org/abs/2307.00835", "description": "arXiv:2307.00835v3 Announce Type: replace \nAbstract: Distributional regression aims to estimate the full conditional distribution of a target variable, given covariates. Popular methods include linear and tree-ensemble based quantile regression. We propose a neural network-based distributional regression methodology called `engression'. An engression model is generative in the sense that we can sample from the fitted conditional distribution and is also suitable for high-dimensional outcomes. Furthermore, we find that modelling the conditional distribution on training data can constrain the fitted function outside of the training support, which offers a new perspective to the challenging extrapolation problem in nonlinear regression. In particular, for `pre-additive noise' models, where noise is added to the covariates before applying a nonlinear transformation, we show that engression can successfully perform extrapolation under some assumptions such as monotonicity, whereas traditional regression approaches such as least-squares or quantile regression fall short under the same assumptions. Our empirical results, from both simulated and real data, validate the effectiveness of the engression method and indicate that the pre-additive noise model is typically suitable for many real-world scenarios. The software implementations of engression are available in both R and Python."}, "https://arxiv.org/abs/2310.11680": {"title": "Trimmed Mean Group Estimation of Average Effects in Ultra Short T Panels under Correlated Heterogeneity", "link": "https://arxiv.org/abs/2310.11680", "description": "arXiv:2310.11680v2 Announce Type: replace \nAbstract: The commonly used two-way fixed effects estimator is biased under correlated heterogeneity and can lead to misleading inference. This paper proposes a new trimmed mean group (TMG) estimator which is consistent at the irregular rate of n^{1/3} even if the time dimension of the panel is as small as the number of its regressors. Extensions to panels with time effects are provided, and a Hausman test of correlated heterogeneity is proposed. Small sample properties of the TMG estimator (with and without time effects) are investigated by Monte Carlo experiments and shown to be satisfactory and perform better than other trimmed estimators proposed in the literature. The proposed test of correlated heterogeneity is also shown to have the correct size and satisfactory power. The utility of the TMG approach is illustrated with an empirical application."}, "https://arxiv.org/abs/2311.16614": {"title": "A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections", "link": "https://arxiv.org/abs/2311.16614", "description": "arXiv:2311.16614v4 Announce Type: replace \nAbstract: Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $\\alpha$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters."}, "https://arxiv.org/abs/2312.01168": {"title": "MacroPARAFAC for handling rowwise and cellwise outliers in incomplete multi-way data", "link": "https://arxiv.org/abs/2312.01168", "description": "arXiv:2312.01168v2 Announce Type: replace \nAbstract: Multi-way data extend two-way matrices into higher-dimensional tensors, often explored through dimensional reduction techniques. In this paper, we study the Parallel Factor Analysis (PARAFAC) model for handling multi-way data, representing it more compactly through a concise set of loading matrices and scores. We assume that the data may be incomplete and could contain both rowwise and cellwise outliers, signifying cases that deviate from the majority and outlying cells dispersed throughout the data array. To address these challenges, we present a novel algorithm designed to robustly estimate both loadings and scores. Additionally, we introduce an enhanced outlier map to distinguish various patterns of outlying behavior. Through simulations and the analysis of fluorescence Excitation-Emission Matrix (EEM) data, we demonstrate the robustness of our approach. Our results underscore the effectiveness of diagnostic tools in identifying and interpreting unusual patterns within the data."}, "https://arxiv.org/abs/2401.09381": {"title": "Modelling clusters in network time series with an application to presidential elections in the USA", "link": "https://arxiv.org/abs/2401.09381", "description": "arXiv:2401.09381v2 Announce Type: replace \nAbstract: Network time series are becoming increasingly relevant in the study of dynamic processes characterised by a known or inferred underlying network structure. Generalised Network Autoregressive (GNAR) models provide a parsimonious framework for exploiting the underlying network, even in the high-dimensional setting. We extend the GNAR framework by presenting the $\\textit{community}$-$\\alpha$ GNAR model that exploits prior knowledge and/or exogenous variables for identifying and modelling dynamic interactions across communities in the network. We further analyse the dynamics of $\\textit{ Red, Blue}$ and $\\textit{Swing}$ states throughout presidential elections in the USA. Our analysis suggests interesting global and communal effects."}, "https://arxiv.org/abs/2008.04267": {"title": "Robust Validation: Confident Predictions Even When Distributions Shift", "link": "https://arxiv.org/abs/2008.04267", "description": "arXiv:2008.04267v3 Announce Type: replace-cross \nAbstract: While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy -- coming from robust statistics and optimization -- is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.'s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity."}, "https://arxiv.org/abs/2208.03737": {"title": "Finite Tests from Functional Characterizations", "link": "https://arxiv.org/abs/2208.03737", "description": "arXiv:2208.03737v5 Announce Type: replace-cross \nAbstract: Classically, testing whether decision makers belong to specific preference classes involves two main approaches. The first, known as the functional approach, assumes access to an entire demand function. The second, the revealed preference approach, constructs inequalities to test finite demand data. This paper bridges these methods by using the functional approach to test finite data through preference learnability results. We develop a computationally efficient algorithm that generates tests for choice data based on functional characterizations of preference families. We provide these restrictions for various applications, including homothetic and weakly separable preferences, where the latter's revealed preference characterization is provably NP-Hard. We also address choice under uncertainty, offering tests for betweenness preferences. Lastly, we perform a simulation exercise demonstrating that our tests are effective in finite samples and accurately reject demands not belonging to a specified class."}, "https://arxiv.org/abs/2302.00934": {"title": "High-dimensional variable clustering based on maxima of a weakly dependent random process", "link": "https://arxiv.org/abs/2302.00934", "description": "arXiv:2302.00934v3 Announce Type: replace-cross \nAbstract: We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm depending on a tuning parameter that recovers the clusters of variables without specifying the number of clusters \\emph{a priori}. Our work provides some theoretical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. A data-driven selection method for the tuning parameter is also proposed. To further illustrate the significance of our work, we applied our method to neuroscience and environmental real-datasets. These applications highlight the potential and versatility of the proposed approach."}, "https://arxiv.org/abs/2303.13598": {"title": "Bootstrap-Assisted Inference for Generalized Grenander-type Estimators", "link": "https://arxiv.org/abs/2303.13598", "description": "arXiv:2303.13598v3 Announce Type: replace-cross \nAbstract: Westling and Carone (2020) proposed a framework for studying the large sample distributional properties of generalized Grenander-type estimators, a versatile class of nonparametric estimators of monotone functions. The limiting distribution of those estimators is representable as the left derivative of the greatest convex minorant of a Gaussian process whose monomial mean can be of unknown order (when the degree of flatness of the function of interest is unknown). The standard nonparametric bootstrap is unable to consistently approximate the large sample distribution of the generalized Grenander-type estimators even if the monomial order of the mean is known, making statistical inference a challenging endeavour in applications. To address this inferential problem, we present a bootstrap-assisted inference procedure for generalized Grenander-type estimators. The procedure relies on a carefully crafted, yet automatic, transformation of the estimator. Moreover, our proposed method can be made ``flatness robust'' in the sense that it can be made adaptive to the (possibly unknown) degree of flatness of the function of interest. The method requires only the consistent estimation of a single scalar quantity, for which we propose an automatic procedure based on numerical derivative estimation and the generalized jackknife. Under random sampling, our inference method can be implemented using a computationally attractive exchangeable bootstrap procedure. We illustrate our methods with examples and we also provide a small simulation study. The development of formal results is made possible by some technical results that may be of independent interest."}, "https://arxiv.org/abs/2308.13451": {"title": "Gotta match 'em all: Solution diversification in graph matching matched filters", "link": "https://arxiv.org/abs/2308.13451", "description": "arXiv:2308.13451v3 Announce Type: replace-cross \nAbstract: We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose algorithmic speed-ups that greatly enhance the scalability of our matched-filter approach. We present theoretical justification of our methodology in the setting of correlated Erdos-Renyi graphs, showing its ability to sequentially discover multiple templates under mild model conditions. We additionally demonstrate our method's utility via extensive experiments both using simulated models and real-world dataset, include human brain connectomes and a large transactional knowledge base."}, "https://arxiv.org/abs/2407.04812": {"title": "Active-Controlled Trial Design for HIV Prevention Trials with a Counterfactual Placebo", "link": "https://arxiv.org/abs/2407.04812", "description": "arXiv:2407.04812v1 Announce Type: new \nAbstract: In the quest for enhanced HIV prevention methods, the advent of antiretroviral drugs as pre-exposure prophylaxis (PrEP) has marked a significant stride forward. However, the ethical challenges in conducting placebo-controlled trials for new PrEP agents against a backdrop of highly effective existing PrEP options necessitates innovative approaches. This manuscript delves into the design and implementation of active-controlled trials that incorporate a counterfactual placebo estimate - a theoretical estimate of what HIV incidence would have been without effective prevention. We introduce a novel statistical framework for regulatory approval of new PrEP agents, predicated on the assumption of an available and consistent counterfactual placebo estimate. Our approach aims to assess the absolute efficacy (i.e., against placebo) of the new PrEP agent relative to the absolute efficacy of the active control. We propose a two-step procedure for hypothesis testing and further develop an approach that addresses potential biases inherent in non-randomized comparison to counterfactual placebos. By exploring different scenarios with moderately and highly effective active controls and counterfactual placebo estimates from various sources, we demonstrate how our design can significantly reduce sample sizes compared to traditional non-inferiority trials and offer a robust framework for evaluating new PrEP agents. This work contributes to the methodological repertoire for HIV prevention trials and underscores the importance of adaptability in the face of ethical and practical challenges."}, "https://arxiv.org/abs/2407.04933": {"title": "A Triginometric Seasonal Component Model and its Application to Time Series with Two Types of Seasonality", "link": "https://arxiv.org/abs/2407.04933", "description": "arXiv:2407.04933v1 Announce Type: new \nAbstract: A finite trigonometric series model for seasonal time series is considered in this paper. This component model is shown to be useful, in particular, for the modeling of time series with two types of seasonality, a long-period and a short period. This component model is also shown to be effective in the case of ordinary seasonal time series with only one seasonal component, if the seasonal pattern is simple and can be well represented by a small number of trigonometric components. As examples, electricity demand data, bi-hourly temperature data, CO2 data, and two economic time series are considered. The last section summarizes the findings from the emperical studies."}, "https://arxiv.org/abs/2407.05001": {"title": "Treatment effect estimation under covariate-adaptive randomization with heavy-tailed outcomes", "link": "https://arxiv.org/abs/2407.05001", "description": "arXiv:2407.05001v1 Announce Type: new \nAbstract: Randomized experiments are the gold standard for investigating causal relationships, with comparisons of potential outcomes under different treatment groups used to estimate treatment effects. However, outcomes with heavy-tailed distributions pose significant challenges to traditional statistical approaches. While recent studies have explored these issues under simple randomization, their application in more complex randomization designs, such as stratified randomization or covariate-adaptive randomization, has not been adequately addressed. To fill the gap, this paper examines the properties of the estimated influence function-based M-estimator under covariate-adaptive randomization with heavy-tailed outcomes, demonstrating its consistency and asymptotic normality. Yet, the existing variance estimator tends to overestimate the asymptotic variance, especially under more balanced designs, and lacks universal applicability across randomization methods. To remedy this, we introduce a novel stratified transformed difference-in-means estimator to enhance efficiency and propose a universally applicable variance estimator to facilitate valid inferences. Additionally, we establish the consistency of kernel-based density estimation in the context of covariate-adaptive randomization. Numerical results demonstrate the effectiveness of the proposed methods in finite samples."}, "https://arxiv.org/abs/2407.05089": {"title": "Bayesian network-guided sparse regression with flexible varying effects", "link": "https://arxiv.org/abs/2407.05089", "description": "arXiv:2407.05089v1 Announce Type: new \nAbstract: In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors."}, "https://arxiv.org/abs/2407.05159": {"title": "Roughness regularization for functional data analysis with free knots spline estimation", "link": "https://arxiv.org/abs/2407.05159", "description": "arXiv:2407.05159v1 Announce Type: new \nAbstract: In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA me\\-thods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method's strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data."}, "https://arxiv.org/abs/2407.05241": {"title": "Joint identification of spatially variable genes via a network-assisted Bayesian regularization approach", "link": "https://arxiv.org/abs/2407.05241", "description": "arXiv:2407.05241v1 Announce Type: new \nAbstract: Identifying genes that display spatial patterns is critical to investigating expression interactions within a spatial context and further dissecting biological understanding of complex mechanistic functionality. Despite the increase in statistical methods designed to identify spatially variable genes, they are mostly based on marginal analysis and share the limitation that the dependence (network) structures among genes are not well accommodated, where a biological process usually involves changes in multiple genes that interact in a complex network. Moreover, the latent cellular composition within spots may introduce confounding variations, negatively affecting identification accuracy. In this study, we develop a novel Bayesian regularization approach for spatial transcriptomic data, with the confounding variations induced by varying cellular distributions effectively corrected. Significantly advancing from the existing studies, a thresholded graph Laplacian regularization is proposed to simultaneously identify spatially variable genes and accommodate the network structure among genes. The proposed method is based on a zero-inflated negative binomial distribution, effectively accommodating the count nature, zero inflation, and overdispersion of spatial transcriptomic data. Extensive simulations and the application to real data demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2407.05288": {"title": "Efficient Bayesian dynamic closed skew-normal model preserving mean and covariance for spatio-temporal data", "link": "https://arxiv.org/abs/2407.05288", "description": "arXiv:2407.05288v1 Announce Type: new \nAbstract: Although Bayesian skew-normal models are useful for flexibly modeling spatio-temporal processes, they still have difficulty in computation cost and interpretability in their mean and variance parameters, including regression coefficients. To address these problems, this study proposes a spatio-temporal model that incorporates skewness while maintaining mean and variance, by applying the flexible subclass of the closed skew-normal distribution. An efficient sampling method is introduced, leveraging the autoregressive representation of the model. Additionally, the model's symmetry concerning spatial order is demonstrated, and Mardia's skewness and kurtosis are derived, showing independence from the mean and variance. Simulation studies compare the estimation performance of the proposed model with that of the Gaussian model. The result confirms its superiority in high skewness and low observation noise scenarios. The identification of Cobb-Douglas production functions across US states is examined as an application to real data, revealing that the proposed model excels in both goodness-of-fit and predictive performance."}, "https://arxiv.org/abs/2407.05372": {"title": "A Convexified Matching Approach to Imputation and Individualized Inference", "link": "https://arxiv.org/abs/2407.05372", "description": "arXiv:2407.05372v1 Announce Type: new \nAbstract: We introduce a new convexified matching method for missing value imputation and individualized inference inspired by computational optimal transport. Our method integrates favorable features from mainstream imputation approaches: optimal matching, regression imputation, and synthetic control. We impute counterfactual outcomes based on convex combinations of observed outcomes, defined based on an optimal coupling between the treated and control data sets. The optimal coupling problem is considered a convex relaxation to the combinatorial optimal matching problem. We estimate granular-level individual treatment effects while maintaining a desirable aggregate-level summary by properly constraining the coupling. We construct transparent, individual confidence intervals for the estimated counterfactual outcomes. We devise fast iterative entropic-regularized algorithms to solve the optimal coupling problem that scales favorably when the number of units to match is large. Entropic regularization plays a crucial role in both inference and computation; it helps control the width of the individual confidence intervals and design fast optimization algorithms."}, "https://arxiv.org/abs/2407.05400": {"title": "Collaborative Analysis for Paired A/B Testing Experiments", "link": "https://arxiv.org/abs/2407.05400", "description": "arXiv:2407.05400v1 Announce Type: new \nAbstract: With the extensive use of digital devices, online experimental platforms are commonly used to conduct experiments to collect data for evaluating different variations of products, algorithms, and interface designs, a.k.a., A/B tests. In practice, multiple A/B testing experiments are often carried out based on a common user population on the same platform. The same user's responses to different experiments can be correlated to some extent due to the individual effect of the user. In this paper, we propose a novel framework that collaboratively analyzes the data from paired A/B tests, namely, a pair of A/B testing experiments conducted on the same set of experimental subjects. The proposed analysis approach for paired A/B tests can lead to more accurate estimates than the traditional separate analysis of each experiment. We obtain the asymptotic distribution of the proposed estimators and demonstrate that the proposed estimators are asymptotically the best linear unbiased estimators under certain assumptions. Moreover, the proposed analysis approach is computationally efficient, easy to implement, and robust to different types of responses. Both numerical simulations and numerical studies based on a real case are used to examine the performance of the proposed method."}, "https://arxiv.org/abs/2407.05431": {"title": "Without Pain -- Clustering Categorical Data Using a Bayesian Mixture of Finite Mixtures of Latent Class Analysis Models", "link": "https://arxiv.org/abs/2407.05431", "description": "arXiv:2407.05431v1 Announce Type: new \nAbstract: We propose a Bayesian approach for model-based clustering of multivariate categorical data where variables are allowed to be associated within clusters and the number of clusters is unknown. The approach uses a two-layer mixture of finite mixtures model where the cluster distributions are approximated using latent class analysis models. A careful specification of priors with suitable hyperparameter values is crucial to identify the two-layer structure and obtain a parsimonious cluster solution. We outline the Bayesian estimation based on Markov chain Monte Carlo sampling with the telescoping sampler and describe how to obtain an identified clustering model by resolving the label switching issue. Empirical demonstrations in a simulation study using artificial data as well as a data set on low back pain indicate the good clustering performance of the proposed approach, provided hyperparameters are selected which induce sufficient shrinkage."}, "https://arxiv.org/abs/2407.05470": {"title": "Bayesian Finite Mixture Models", "link": "https://arxiv.org/abs/2407.05470", "description": "arXiv:2407.05470v1 Announce Type: new \nAbstract: Finite mixture models are a useful statistical model class for clustering and density approximation. In the Bayesian framework finite mixture models require the specification of suitable priors in addition to the data model. These priors allow to avoid spurious results and provide a principled way to define cluster shapes and a preference for specific cluster solutions. A generic model estimation scheme for finite mixtures with a fixed number of components is available using Markov chain Monte Carlo (MCMC) sampling with data augmentation. The posterior allows to assess uncertainty in a comprehensive way, but component-specific posterior inference requires resolving the label switching issue.\n In this paper we focus on the application of Bayesian finite mixture models for clustering. We start with discussing suitable specification, estimation and inference of the model if the number of components is assumed to be known. We then continue to explain suitable strategies for fitting Bayesian finite mixture models when the number of components is not known. In addition, all steps required to perform Bayesian finite mixture modeling are illustrated on a data example where a finite mixture model of multivariate Gaussian distributions is fitted. Suitable prior specification, estimation using MCMC and posterior inference are discussed for this example assuming the number of components to be known as well as unknown."}, "https://arxiv.org/abs/2407.05537": {"title": "Optimal treatment strategies for prioritized outcomes", "link": "https://arxiv.org/abs/2407.05537", "description": "arXiv:2407.05537v1 Announce Type: new \nAbstract: Dynamic treatment regimes formalize precision medicine as a sequence of decision rules, one for each stage of clinical intervention, that map current patient information to a recommended intervention. Optimal regimes are typically defined as maximizing some functional of a scalar outcome's distribution, e.g., the distribution's mean or median. However, in many clinical applications, there are multiple outcomes of interest. We consider the problem of estimating an optimal regime when there are multiple outcomes that are ordered by priority but which cannot be readily combined by domain experts into a meaningful single scalar outcome. We propose a definition of optimality in this setting and show that an optimal regime with respect to this definition leads to maximal mean utility under a large class of utility functions. Furthermore, we use inverse reinforcement learning to identify a composite outcome that most closely aligns with our definition within a pre-specified class. Simulation experiments and an application to data from a sequential multiple assignment randomized trial (SMART) on HIV/STI prevention illustrate the usefulness of the proposed approach."}, "https://arxiv.org/abs/2407.05543": {"title": "Functional Principal Component Analysis for Truncated Data", "link": "https://arxiv.org/abs/2407.05543", "description": "arXiv:2407.05543v1 Announce Type: new \nAbstract: Functional principal component analysis (FPCA) is a key tool in the study of functional data, driving both exploratory analyses and feature construction for use in formal modeling and testing procedures. However, existing methods for FPCA do not apply when functional observations are truncated, e.g., the measurement instrument only supports recordings within a pre-specified interval, thereby truncating values outside of the range to the nearest boundary. A naive application of existing methods without correction for truncation induces bias. We extend the FPCA framework to accommodate truncated noisy functional data by first recovering smooth mean and covariance surface estimates that are representative of the latent process's mean and covariance functions. Unlike traditional sample covariance smoothing techniques, our procedure yields a positive semi-definite covariance surface, computed without the need to retroactively remove negative eigenvalues in the covariance operator decomposition. Additionally, we construct a FPC score predictor and demonstrate its use in the generalized functional linear model. Convergence rates for the proposed estimators are provided. In simulation experiments, the proposed method yields better predictive performance and lower bias than existing alternatives. We illustrate its practical value through an application to a study with truncated blood glucose measurements."}, "https://arxiv.org/abs/2407.05585": {"title": "Unmasking Bias: A Framework for Evaluating Treatment Benefit Predictors Using Observational Studies", "link": "https://arxiv.org/abs/2407.05585", "description": "arXiv:2407.05585v1 Announce Type: new \nAbstract: Treatment benefit predictors (TBPs) map patient characteristics into an estimate of the treatment benefit tailored to individual patients, which can support optimizing treatment decisions. However, the assessment of their performance might be challenging with the non-random treatment assignment. This study conducts a conceptual analysis, which can be applied to finite-sample studies. We present a framework for evaluating TBPs using observational data from a target population of interest. We then explore the impact of confounding bias on TBP evaluation using measures of discrimination and calibration, which are the moderate calibration and the concentration of the benefit index ($C_b$), respectively. We illustrate that failure to control for confounding can lead to misleading values of performance metrics and establish how the confounding bias propagates to an evaluation bias to quantify the explicit bias for the performance metrics. These findings underscore the necessity of accounting for confounding factors when evaluating TBPs, ensuring more reliable and contextually appropriate treatment decisions."}, "https://arxiv.org/abs/2407.05596": {"title": "Methodology for Calculating CO2 Absorption by Tree Planting for Greening Projects", "link": "https://arxiv.org/abs/2407.05596", "description": "arXiv:2407.05596v1 Announce Type: new \nAbstract: In order to explore the possibility of carbon credits for greening projects, which play an important role in climate change mitigation, this paper examines a formula for estimating the amount of carbon fixation for greening activities in urban areas through tree planting. The usefulness of the formula studied was examined by conducting calculations based on actual data through measurements made by on-site surveys of a greening companie. A series of calculation results suggest that this formula may be useful. Recognizing carbon credits for green businesses for the carbon sequestration of their projects is an important incentive not only as part of environmental improvement and climate change action, but also to improve the health and well-being of local communities and to generate economic benefits. This study is a pioneering exploration of the methodology."}, "https://arxiv.org/abs/2407.05624": {"title": "Dynamic Matrix Factor Models for High Dimensional Time Series", "link": "https://arxiv.org/abs/2407.05624", "description": "arXiv:2407.05624v1 Announce Type: new \nAbstract: Matrix time series, which consist of matrix-valued data observed over time, are prevalent in various fields such as economics, finance, and engineering. Such matrix time series data are often observed in high dimensions. Matrix factor models are employed to reduce the dimensionality of such data, but they lack the capability to make predictions without specified dynamics in the latent factor process. To address this issue, we propose a two-component dynamic matrix factor model that extends the standard matrix factor model by incorporating a matrix autoregressive structure for the low-dimensional latent factor process. This two-component model injects prediction capability to the matrix factor model and provides deeper insights into the dynamics of high-dimensional matrix time series. We present the estimation procedures of the model and their theoretical properties, as well as empirical analysis of the estimation procedures via simulations, and a case study of New York city taxi data, demonstrating the performance and usefulness of the model."}, "https://arxiv.org/abs/2407.05625": {"title": "New User Event Prediction Through the Lens of Causal Inference", "link": "https://arxiv.org/abs/2407.05625", "description": "arXiv:2407.05625v1 Announce Type: new \nAbstract: Modeling and analysis for event series generated by heterogeneous users of various behavioral patterns are closely involved in our daily lives, including credit card fraud detection, online platform user recommendation, and social network analysis. The most commonly adopted approach to this task is to classify users into behavior-based categories and analyze each of them separately. However, this approach requires extensive data to fully understand user behavior, presenting challenges in modeling newcomers without historical knowledge. In this paper, we propose a novel discrete event prediction framework for new users through the lens of causal inference. Our method offers an unbiased prediction for new users without needing to know their categories. We treat the user event history as the ''treatment'' for future events and the user category as the key confounder. Thus, the prediction problem can be framed as counterfactual outcome estimation, with the new user model trained on an adjusted dataset where each event is re-weighted by its inverse propensity score. We demonstrate the superior performance of the proposed framework with a numerical simulation study and two real-world applications, including Netflix rating prediction and seller contact prediction for customer support at Amazon."}, "https://arxiv.org/abs/2407.05691": {"title": "Multi-resolution subsampling for large-scale linear classification", "link": "https://arxiv.org/abs/2407.05691", "description": "arXiv:2407.05691v1 Announce Type: new \nAbstract: Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information of the full data. The present work takes the view that sampling techniques are recommended for the region we focus on and summary measures are enough to collect the information for the rest according to a well-designed data partitioning. We propose a multi-resolution subsampling strategy that combines global information described by summary measures and local information obtained from selected subsample points. We show that the proposed method will lead to a more efficient subsample-based estimator for general large-scale classification problems. Some asymptotic properties of the proposed method are established and connections to existing subsampling procedures are explored. Finally, we illustrate the proposed subsampling strategy via simulated and real-world examples."}, "https://arxiv.org/abs/2407.05824": {"title": "Counting on count regression: overlooked aspects of the Negative Binomial specification", "link": "https://arxiv.org/abs/2407.05824", "description": "arXiv:2407.05824v1 Announce Type: new \nAbstract: Negative Binomial regression is a staple in Operations Management empirical research. Most of its analytical aspects are considered either self-evident, or minutiae that are better left to specialised textbooks. But what if the evidence provided by trusted sources disagrees? In this note I set out to verify results about the Negative Binomial regression specification presented in widely-cited academic sources. I identify problems in how these sources approach the gamma function and its derivatives, with repercussions on the Fisher Information Matrix that may ultimately affect statistical testing. By elevating computations that are rarely specified in full, I provide recommendations to improve methodological evidence that is typically presented without proof."}, "https://arxiv.org/abs/2407.05849": {"title": "Small area prediction of counts under machine learning-type mixed models", "link": "https://arxiv.org/abs/2407.05849", "description": "arXiv:2407.05849v1 Announce Type: new \nAbstract: This paper proposes small area estimation methods that utilize generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, we present two approaches based on random forests: the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF), both tailored to address challenges associated with count outcomes, particularly overdispersion. Our analysis reveals that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. Additionally, we introduce and evaluate three bootstrap methodologies - one parametric and two non-parametric - designed to assess the reliability of point estimators for area-level means. The effectiveness of these methodologies is tested through model-based (and design-based) simulations and applied to a real-world dataset from the state of Guerrero in Mexico, demonstrating their robustness and potential for practical applications."}, "https://arxiv.org/abs/2407.05854": {"title": "A Low-Rank Bayesian Approach for Geoadditive Modeling", "link": "https://arxiv.org/abs/2407.05854", "description": "arXiv:2407.05854v1 Announce Type: new \nAbstract: Kriging is an established methodology for predicting spatial data in geostatistics. Current kriging techniques can handle linear dependencies on spatially referenced covariates. Although splines have shown promise in capturing nonlinear dependencies of covariates, their combination with kriging, especially in handling count data, remains underexplored. This paper proposes a novel Bayesian approach to the low-rank representation of geoadditive models, which integrates splines and kriging to account for both spatial correlations and nonlinear dependencies of covariates. The proposed method accommodates Gaussian and count data inherent in many geospatial datasets. Additionally, Laplace approximations to selected posterior distributions enhances computational efficiency, resulting in faster computation times compared to Markov chain Monte Carlo techniques commonly used for Bayesian inference. Method performance is assessed through a simulation study, demonstrating the effectiveness of the proposed approach. The methodology is applied to the analysis of heavy metal concentrations in the Meuse river and vulnerability to the coronavirus disease 2019 (COVID-19) in Belgium. Through this work, we provide a new flexible and computationally efficient framework for analyzing spatial data."}, "https://arxiv.org/abs/2407.05896": {"title": "A new multivariate Poisson model", "link": "https://arxiv.org/abs/2407.05896", "description": "arXiv:2407.05896v1 Announce Type: new \nAbstract: Multi-dimensional data frequently occur in many different fields, including risk management, insurance, biology, environmental sciences, and many more. In analyzing multivariate data, it is imperative that the underlying modelling assumptions adequately reflect both the marginal behavior as well as the associations between components. This work focuses specifically on developing a new multivariate Poisson model appropriate for multi-dimensional count data. The proposed formulation is based on convolutions of comonotonic shock vectors with Poisson distributed components and allows for flexibility in capturing different degrees of positive dependence. In this paper, the general model framework will be presented along with various distributional properties. Several estimation techniques will be explored and assessed both through simulations and in a real data application involving extreme rainfall events."}, "https://arxiv.org/abs/2407.05914": {"title": "Constructing Level Sets Using Smoothed Approximate Bayesian Computation", "link": "https://arxiv.org/abs/2407.05914", "description": "arXiv:2407.05914v1 Announce Type: new \nAbstract: This paper presents a novel approach to level set estimation for any function/simulation with an arbitrary number of continuous inputs and arbitrary numbers of continuous responses. We present a method that uses existing data from computer model simulations to fit a Gaussian process surrogate and use a newly proposed Markov Chain Monte Carlo technique, which we refer to as Smoothed Approximate Bayesian Computation to sample sets of parameters that yield a desired response, which improves on ``hard-clipped\" versions of ABC. We prove that our method converges to the correct distribution (i.e. the posterior distribution of level sets, or probability contours) and give results of our method on known functions and a dam breach simulation where the relationship between input parameters and responses of interest is unknown. Two versions of S-ABC are offered based on: 1) surrogating an accurately known target model and 2) surrogating an approximate model, which leads to uncertainty in estimating the level sets. In addition, we show how our method can be extended to multiple responses with an accompanying example. As demonstrated, S-ABC is able to estimate a level set accurately without the use of a predefined grid or signed distance function."}, "https://arxiv.org/abs/2407.05957": {"title": "A likelihood ratio test for circular multimodality", "link": "https://arxiv.org/abs/2407.05957", "description": "arXiv:2407.05957v1 Announce Type: new \nAbstract: The modes of a statistical population are high frequency points around which most of the probability mass is accumulated. For the particular case of circular densities, we address the problem of testing if, given an observed sample of a random angle, the underlying circular distribution model is multimodal. Our work is motivated by the analysis of migration patterns of birds and the methodological proposal follows a novel approach based on likelihood ratio ideas, combined with critical bandwidths. Theoretical results support the behaviour of the test, whereas simulation examples show its finite sample performance."}, "https://arxiv.org/abs/2407.06038": {"title": "Comparing Causal Inference Methods for Point Exposures with Missing Confounders: A Simulation Study", "link": "https://arxiv.org/abs/2407.06038", "description": "arXiv:2407.06038v1 Announce Type: new \nAbstract: Causal inference methods based on electronic health record (EHR) databases must simultaneously handle confounding and missing data. Vast scholarship exists aimed at addressing these two issues separately, but surprisingly few papers attempt to address them simultaneously. In practice, when faced with simultaneous missing data and confounding, analysts may proceed by first imputing missing data and subsequently using outcome regression or inverse-probability weighting (IPW) to address confounding. However, little is known about the theoretical performance of such $\\textit{ad hoc}$ methods. In a recent paper Levis $\\textit{et al.}$ outline a robust framework for tackling these problems together under certain identifying conditions, and introduce a pair of estimators for the average treatment effect (ATE), one of which is non-parametric efficient. In this work we present a series of simulations, motivated by a published EHR based study of the long-term effects of bariatric surgery on weight outcomes, to investigate these new estimators and compare them to existing $\\textit{ad hoc}$ methods. While the latter perform well in certain scenarios, no single estimator is uniformly best. As such, the work of Levis $\\textit{et al.}$ may serve as a reasonable default for causal inference when handling confounding and missing data together."}, "https://arxiv.org/abs/2407.06069": {"title": "How to Add Baskets to an Ongoing Basket Trial with Information Borrowing", "link": "https://arxiv.org/abs/2407.06069", "description": "arXiv:2407.06069v1 Announce Type: new \nAbstract: Basket trials test a single therapeutic treatment on several patient populations under one master protocol. A desirable adaptive design feature in these studies may be the incorporation of new baskets to an ongoing study. Limited basket sample sizes can cause issues in power and precision of treatment effect estimates which could be amplified in added baskets due to the shortened recruitment time. While various Bayesian information borrowing techniques have been introduced to tackle the issue of small sample sizes, the impact of including new baskets in the trial and into the borrowing model has yet to be investigated. We explore approaches for adding baskets to an ongoing trial under information borrowing and highlight when it is beneficial to add a basket compared to running a separate investigation for new baskets. We also propose a novel calibration approach for the decision criteria that is more robust to false decision making. Simulation studies are conducted to assess the performance of approaches which is monitored primarily through type I error control and precision of estimates. Results display a substantial improvement in power for a new basket when information borrowing is utilized, however, this comes with potential inflation of error rates which can be shown to be reduced under the proposed calibration procedure."}, "https://arxiv.org/abs/2407.06173": {"title": "Large Row-Constrained Supersaturated Designs for High-throughput Screening", "link": "https://arxiv.org/abs/2407.06173", "description": "arXiv:2407.06173v1 Announce Type: new \nAbstract: High-throughput screening, in which multiwell plates are used to test large numbers of compounds against specific targets, is widely used across many areas of the biological sciences and most prominently in drug discovery. We propose a statistically principled approach to these screening experiments, using the machinery of supersaturated designs and the Lasso. To accommodate limitations on the number of biological entities that can be applied to a single microplate well, we present a new class of row-constrained supersaturated designs. We develop a computational procedure to construct these designs, provide some initial lower bounds on the average squared off-diagonal values of their main-effects information matrix, and study the impact of the constraint on design quality. We also show via simulation that the proposed constrained row screening method is statistically superior to existing methods and demonstrate the use of the new methodology on a real drug-discovery system."}, "https://arxiv.org/abs/2407.04980": {"title": "Enabling Causal Discovery in Post-Nonlinear Models with Normalizing Flows", "link": "https://arxiv.org/abs/2407.04980", "description": "arXiv:2407.04980v1 Announce Type: cross \nAbstract: Post-nonlinear (PNL) causal models stand out as a versatile and adaptable framework for modeling intricate causal relationships. However, accurately capturing the invertibility constraint required in PNL models remains challenging in existing studies. To address this problem, we introduce CAF-PoNo (Causal discovery via Normalizing Flows for Post-Nonlinear models), harnessing the power of the normalizing flows architecture to enforce the crucial invertibility constraint in PNL models. Through normalizing flows, our method precisely reconstructs the hidden noise, which plays a vital role in cause-effect identification through statistical independence testing. Furthermore, the proposed approach exhibits remarkable extensibility, as it can be seamlessly expanded to facilitate multivariate causal discovery via causal order identification, empowering us to efficiently unravel complex causal relationships. Extensive experimental evaluations on both simulated and real datasets consistently demonstrate that the proposed method outperforms several state-of-the-art approaches in both bivariate and multivariate causal discovery tasks."}, "https://arxiv.org/abs/2407.04992": {"title": "Scalable Variational Causal Discovery Unconstrained by Acyclicity", "link": "https://arxiv.org/abs/2407.04992", "description": "arXiv:2407.04992v1 Announce Type: cross \nAbstract: Bayesian causal discovery offers the power to quantify epistemic uncertainties among a broad range of structurally diverse causal theories potentially explaining the data, represented in forms of directed acyclic graphs (DAGs). However, existing methods struggle with efficient DAG sampling due to the complex acyclicity constraint. In this study, we propose a scalable Bayesian approach to effectively learn the posterior distribution over causal graphs given observational data thanks to the ability to generate DAGs without explicitly enforcing acyclicity. Specifically, we introduce a novel differentiable DAG sampling method that can generate a valid acyclic causal graph by mapping an unconstrained distribution of implicit topological orders to a distribution over DAGs. Given this efficient DAG sampling scheme, we are able to model the posterior distribution over causal graphs using a simple variational distribution over a continuous domain, which can be learned via the variational inference framework. Extensive empirical experiments on both simulated and real datasets demonstrate the superior performance of the proposed model compared to several state-of-the-art baselines."}, "https://arxiv.org/abs/2407.05330": {"title": "Fast Proxy Experiment Design for Causal Effect Identification", "link": "https://arxiv.org/abs/2407.05330", "description": "arXiv:2407.05330v1 Announce Type: cross \nAbstract: Identifying causal effects is a key problem of interest across many disciplines. The two long-standing approaches to estimate causal effects are observational and experimental (randomized) studies. Observational studies can suffer from unmeasured confounding, which may render the causal effects unidentifiable. On the other hand, direct experiments on the target variable may be too costly or even infeasible to conduct. A middle ground between these two approaches is to estimate the causal effect of interest through proxy experiments, which are conducted on variables with a lower cost to intervene on compared to the main target. Akbari et al. [2022] studied this setting and demonstrated that the problem of designing the optimal (minimum-cost) experiment for causal effect identification is NP-complete and provided a naive algorithm that may require solving exponentially many NP-hard problems as a sub-routine in the worst case. In this work, we provide a few reformulations of the problem that allow for designing significantly more efficient algorithms to solve it as witnessed by our extensive simulations. Additionally, we study the closely-related problem of designing experiments that enable us to identify a given effect through valid adjustments sets."}, "https://arxiv.org/abs/2407.05492": {"title": "Gaussian Approximation and Output Analysis for High-Dimensional MCMC", "link": "https://arxiv.org/abs/2407.05492", "description": "arXiv:2407.05492v1 Announce Type: cross \nAbstract: The widespread use of Markov Chain Monte Carlo (MCMC) methods for high-dimensional applications has motivated research into the scalability of these algorithms with respect to the dimension of the problem. Despite this, numerous problems concerning output analysis in high-dimensional settings have remained unaddressed. We present novel quantitative Gaussian approximation results for a broad range of MCMC algorithms. Notably, we analyse the dependency of the obtained approximation errors on the dimension of both the target distribution and the feature space. We demonstrate how these Gaussian approximations can be applied in output analysis. This includes determining the simulation effort required to guarantee Markov chain central limit theorems and consistent variance estimation in high-dimensional settings. We give quantitative convergence bounds for termination criteria and show that the termination time of a wide class of MCMC algorithms scales polynomially in dimension while ensuring a desired level of precision. Our results offer guidance to practitioners for obtaining appropriate standard errors and deciding the minimum simulation effort of MCMC algorithms in both multivariate and high-dimensional settings."}, "https://arxiv.org/abs/2407.05954": {"title": "Causality-driven Sequence Segmentation for Enhancing Multiphase Industrial Process Data Analysis and Soft Sensing", "link": "https://arxiv.org/abs/2407.05954", "description": "arXiv:2407.05954v1 Announce Type: cross \nAbstract: The dynamic characteristics of multiphase industrial processes present significant challenges in the field of industrial big data modeling. Traditional soft sensing models frequently neglect the process dynamics and have difficulty in capturing transient phenomena like phase transitions. To address this issue, this article introduces a causality-driven sequence segmentation (CDSS) model. This model first identifies the local dynamic properties of the causal relationships between variables, which are also referred to as causal mechanisms. It then segments the sequence into different phases based on the sudden shifts in causal mechanisms that occur during phase transitions. Additionally, a novel metric, similarity distance, is designed to evaluate the temporal consistency of causal mechanisms, which includes both causal similarity distance and stable similarity distance. The discovered causal relationships in each phase are represented as a temporal causal graph (TCG). Furthermore, a soft sensing model called temporal-causal graph convolutional network (TC-GCN) is trained for each phase, by using the time-extended data and the adjacency matrix of TCG. The numerical examples are utilized to validate the proposed CDSS model, and the segmentation results demonstrate that CDSS has excellent performance on segmenting both stable and unstable multiphase series. Especially, it has higher accuracy in separating non-stationary time series compared to other methods. The effectiveness of the proposed CDSS model and the TC-GCN model is also verified through a penicillin fermentation process. Experimental results indicate that the breakpoints discovered by CDSS align well with the reaction mechanisms and TC-GCN significantly has excellent predictive accuracy."}, "https://arxiv.org/abs/1906.03661": {"title": "Community Correlations and Testing Independence Between Binary Graphs", "link": "https://arxiv.org/abs/1906.03661", "description": "arXiv:1906.03661v3 Announce Type: replace \nAbstract: Graph data has a unique structure that deviates from standard data assumptions, often necessitating modifications to existing methods or the development of new ones to ensure valid statistical analysis. In this paper, we explore the notion of correlation and dependence between two binary graphs. Given vertex communities, we propose community correlations to measure the edge association, which equals zero if and only if the two graphs are conditionally independent within a specific pair of communities. The set of community correlations naturally leads to the maximum community correlation, indicating conditional independence on all possible pairs of communities, and to the overall graph correlation, which equals zero if and only if the two binary graphs are unconditionally independent. We then compute the sample community correlations via graph encoder embedding, proving they converge to their respective population versions, and derive the asymptotic null distribution to enable a fast, valid, and consistent test for conditional or unconditional independence between two binary graphs. The theoretical results are validated through comprehensive simulations, and we provide two real-data examples: one using Enron email networks and another using mouse connectome graphs, to demonstrate the utility of the proposed correlation measures."}, "https://arxiv.org/abs/2005.12017": {"title": "Estimating spatially varying health effects of wildland fire smoke using mobile health data", "link": "https://arxiv.org/abs/2005.12017", "description": "arXiv:2005.12017v3 Announce Type: replace \nAbstract: Wildland fire smoke exposures are an increasing threat to public health, and thus there is a growing need for studying the effects of protective behaviors on reducing health outcomes. Emerging smartphone applications provide unprecedented opportunities to deliver health risk communication messages to a large number of individuals when and where they experience the exposure and subsequently study the effectiveness, but also pose novel methodological challenges. Smoke Sense, a citizen science project, provides an interactive smartphone app platform for participants to engage with information about air quality and ways to protect their health and record their own health symptoms and actions taken to reduce smoke exposure. We propose a new, doubly robust estimator of the structural nested mean model parameter that accounts for spatially- and time-varying effects via a local estimating equation approach with geographical kernel weighting. Moreover, our analytical framework is flexible enough to handle informative missingness by inverse probability weighting of estimating functions. We evaluate the new method using extensive simulation studies and apply it to Smoke Sense data reported by the citizen scientists to increase the knowledge base about the relationship between health preventive measures and improved health outcomes. Our results estimate how the protective behaviors effects vary over space and time and find that protective behaviors have more significant effects on reducing health symptoms in the Southwest than the Northwest region of the USA."}, "https://arxiv.org/abs/2206.01779": {"title": "Bayesian and Frequentist Inference for Synthetic Controls", "link": "https://arxiv.org/abs/2206.01779", "description": "arXiv:2206.01779v3 Announce Type: replace \nAbstract: The synthetic control method has become a widely popular tool to estimate causal effects with observational data. Despite this, inference for synthetic control methods remains challenging. Often, inferential results rely on linear factor model data generating processes. In this paper, we characterize the conditions on the factor model primitives (the factor loadings) for which the statistical risk minimizers are synthetic controls (in the simplex). Then, we propose a Bayesian alternative to the synthetic control method that preserves the main features of the standard method and provides a new way of doing valid inference. We explore a Bernstein-von Mises style result to link our Bayesian inference to the frequentist inference. For linear factor model frameworks we show that a maximum likelihood estimator (MLE) of the synthetic control weights can consistently estimate the predictive function of the potential outcomes for the treated unit and that our Bayes estimator is asymptotically close to the MLE in the total variation sense. Through simulations, we show that there is convergence between the Bayes and frequentist approach even in sparse settings. Finally, we apply the method to re-visit the study of the economic costs of the German re-unification and the Catalan secession movement. The Bayesian synthetic control method is available in the bsynth R-package."}, "https://arxiv.org/abs/2308.01724": {"title": "Reconciling Functional Data Regression with Excess Bases", "link": "https://arxiv.org/abs/2308.01724", "description": "arXiv:2308.01724v2 Announce Type: replace \nAbstract: As the development of measuring instruments and computers has accelerated the collection of massive amounts of data, functional data analysis (FDA) has experienced a surge of attention. The FDA methodology treats longitudinal data as a set of functions on which inference, including regression, is performed. Functionalizing data typically involves fitting the data with basis functions. In general, the number of basis functions smaller than the sample size is selected. This paper casts doubt on this convention. Recent statistical theory has revealed the so-called double-descent phenomenon in which excess parameters overcome overfitting and lead to precise interpolation. Applying this idea to choosing the number of bases to be used for functional data, we show that choosing an excess number of bases can lead to more accurate predictions. Specifically, we explored this phenomenon in a functional regression context and examined its validity through numerical experiments. In addition, we introduce two real-world datasets to demonstrate that the double-descent phenomenon goes beyond theoretical and numerical experiments, confirming its importance in practical applications."}, "https://arxiv.org/abs/2310.18858": {"title": "Estimating a function of the scale parameter in a gamma distribution with bounded variance", "link": "https://arxiv.org/abs/2310.18858", "description": "arXiv:2310.18858v2 Announce Type: replace \nAbstract: Given a gamma population with known shape parameter $\\alpha$, we develop a general theory for estimating a function $g(\\cdot)$ of the scale parameter $\\beta$ with bounded variance. We begin by defining a sequential sampling procedure with $g(\\cdot)$ satisfying some desired condition in proposing the stopping rule, and show the procedure enjoys appealing asymptotic properties. After these general conditions, we substitute $g(\\cdot)$ with specific functions including the gamma mean, the gamma variance, the gamma rate parameter, and a gamma survival probability as four possible illustrations. For each illustration, Monte Carlo simulations are carried out to justify the remarkable performance of our proposed sequential procedure. This is further substantiated with a real data study on weights of newly born babies."}, "https://arxiv.org/abs/1809.01796": {"title": "Optimal Sparse Singular Value Decomposition for High-dimensional High-order Data", "link": "https://arxiv.org/abs/1809.01796", "description": "arXiv:1809.01796v2 Announce Type: replace-cross \nAbstract: In this article, we consider the sparse tensor singular value decomposition, which aims for dimension reduction on high-dimensional high-order data with certain sparsity structure. A method named Sparse Tensor Alternating Thresholding for Singular Value Decomposition (STAT-SVD) is proposed. The proposed procedure features a novel double projection \\& thresholding scheme, which provides a sharp criterion for thresholding in each iteration. Compared with regular tensor SVD model, STAT-SVD permits more robust estimation under weaker assumptions. Both the upper and lower bounds for estimation accuracy are developed. The proposed procedure is shown to be minimax rate-optimal in a general class of situations. Simulation studies show that STAT-SVD performs well under a variety of configurations. We also illustrate the merits of the proposed procedure on a longitudinal tensor dataset on European country mortality rates."}, "https://arxiv.org/abs/2305.07993": {"title": "The Nonstationary Newsvendor with (and without) Predictions", "link": "https://arxiv.org/abs/2305.07993", "description": "arXiv:2305.07993v3 Announce Type: replace-cross \nAbstract: The classic newsvendor model yields an optimal decision for a \"newsvendor\" selecting a quantity of inventory, under the assumption that the demand is drawn from a known distribution. Motivated by applications such as cloud provisioning and staffing, we consider a setting in which newsvendor-type decisions must be made sequentially, in the face of demand drawn from a stochastic process that is both unknown and nonstationary. All prior work on this problem either (a) assumes that the level of nonstationarity is known, or (b) imposes additional statistical assumptions that enable accurate predictions of the unknown demand.\n We study the Nonstationary Newsvendor, with and without predictions. We first, in the setting without predictions, design a policy which we prove (via matching upper and lower bounds) achieves order-optimal regret -- ours is the first policy to accomplish this without being given the level of nonstationarity of the underlying demand. We then, for the first time, introduce a model for generic (i.e. with no statistical assumptions) predictions with arbitrary accuracy, and propose a policy that incorporates these predictions without being given their accuracy. We upper bound the regret of this policy, and show that it matches the best achievable regret had the accuracy of the predictions been known. Finally, we empirically validate our new policy with experiments based on three real-world datasets containing thousands of time-series, showing that it succeeds in closing approximately 74% of the gap between the best approaches based on nonstationarity and predictions alone."}, "https://arxiv.org/abs/2401.11646": {"title": "Nonparametric Density Estimation via Variance-Reduced Sketching", "link": "https://arxiv.org/abs/2401.11646", "description": "arXiv:2401.11646v2 Announce Type: replace-cross \nAbstract: Nonparametric density models are of great interest in various scientific and engineering disciplines. Classical density kernel methods, while numerically robust and statistically sound in low-dimensional settings, become inadequate even in moderate higher-dimensional settings due to the curse of dimensionality. In this paper, we introduce a new framework called Variance-Reduced Sketching (VRS), specifically designed to estimate multivariable density functions with a reduced curse of dimensionality. Our framework conceptualizes multivariable functions as infinite-size matrices, and facilitates a new sketching technique motivated by numerical linear algebra literature to reduce the variance in density estimation problems. We demonstrate the robust numerical performance of VRS through a series of simulated experiments and real-world data applications. Notably, VRS shows remarkable improvement over existing neural network estimators and classical kernel methods in numerous density models. Additionally, we offer theoretical justifications for VRS to support its ability to deliver nonparametric density estimation with a reduced curse of dimensionality."}, "https://arxiv.org/abs/2407.06350": {"title": "A Surrogate Endpoint Based Provisional Approval Causal Roadmap", "link": "https://arxiv.org/abs/2407.06350", "description": "arXiv:2407.06350v1 Announce Type: new \nAbstract: For many rare diseases with no approved preventive interventions, promising interventions exist, yet it has been difficult to conduct a pivotal phase 3 trial that could provide direct evidence demonstrating a beneficial effect on the target disease outcome. When a promising putative surrogate endpoint(s) for the target outcome is available, surrogate-based provisional approval of an intervention may be pursued. We apply the Causal Roadmap rubric to define a surrogate endpoint based provisional approval causal roadmap, which combines observational study data that estimates the relationship between the putative surrogate and the target outcome, with a phase 3 surrogate endpoint study that collects the same data but is very under-powered to assess the treatment effect (TE) on the target outcome. The objective is conservative estimation/inference for the TE with an estimated lower uncertainty bound that allows (through two bias functions) for an imperfect surrogate and imperfect transport of the conditional target outcome risk in the untreated between the observational and phase 3 studies. Two estimators of TE (plug-in, nonparametric efficient one-step) with corresponding inference procedures are developed. Finite-sample performance of the plug-in estimator is evaluated in two simulation studies, with R code provided. The roadmap is illustrated with contemporary Group B Streptococcus vaccine development."}, "https://arxiv.org/abs/2407.06387": {"title": "Conditional Rank-Rank Regression", "link": "https://arxiv.org/abs/2407.06387", "description": "arXiv:2407.06387v1 Announce Type: new \nAbstract: Rank-rank regressions are widely used in economic research to evaluate phenomena such as intergenerational income persistence or mobility. However, when covariates are incorporated to capture between-group persistence, the resulting coefficients can be difficult to interpret as such. We propose the conditional rank-rank regression, which uses conditional ranks instead of unconditional ranks, to measure average within-group income persistence. This property is analogous to that of the unconditional rank-rank regression that measures the overall income persistence. The difference between conditional and unconditional rank-rank regression coefficients therefore can measure between-group persistence. We develop a flexible estimation approach using distribution regression and establish a theoretical framework for large sample inference. An empirical study on intergenerational income mobility in Switzerland demonstrates the advantages of this approach. The study reveals stronger intergenerational persistence between fathers and sons compared to fathers and daughters, with the within-group persistence explaining 62% of the overall income persistence for sons and 52% for daughters. Families of small size or with highly educated fathers exhibit greater persistence in passing on their economic status."}, "https://arxiv.org/abs/2407.06395": {"title": "Logit unfolding choice models for binary data", "link": "https://arxiv.org/abs/2407.06395", "description": "arXiv:2407.06395v1 Announce Type: new \nAbstract: Discrete choice models with non-monotonic response functions are important in many areas of application, especially political sciences and marketing. This paper describes a novel unfolding model for binary data that allows for heavy-tailed shocks to the underlying utilities. One of our key contributions is a Markov chain Monte Carlo algorithm that requires little or no parameter tuning, fully explores the support of the posterior distribution, and can be used to fit various extensions of our core model that involve (Bayesian) hypothesis testing on the latent construct. Our empirical evaluations of the model and the associated algorithm suggest that they provide better complexity-adjusted fit to voting data from the United States House of Representatives."}, "https://arxiv.org/abs/2407.06466": {"title": "Increased risk of type I errors for detecting heterogeneity of treatment effects in cluster-randomized trials using mixed-effect models", "link": "https://arxiv.org/abs/2407.06466", "description": "arXiv:2407.06466v1 Announce Type: new \nAbstract: Evaluating heterogeneity of treatment effects (HTE) across subgroups is common in both randomized trials and observational studies. Although several statistical challenges of HTE analyses including low statistical power and multiple comparisons are widely acknowledged, issues arising for clustered data, including cluster randomized trials (CRTs), have received less attention. Notably, the potential for model misspecification is increased given the complex clustering structure (e.g., due to correlation among individuals within a subgroup and cluster), which could impact inference and type 1 errors. To illicit this issue, we conducted a simulation study to evaluate the performance of common analytic approaches for testing the presence of HTE for continuous, binary, and count outcomes: generalized linear mixed models (GLMM) and generalized estimating equations (GEE) including interaction terms between treatment group and subgroup. We found that standard GLMM analyses that assume a common correlation of participants within clusters can lead to severely elevated type 1 error rates of up to 47.2% compared to the 5% nominal level if the within-cluster correlation varies across subgroups. A flexible GLMM, which allows subgroup-specific within-cluster correlations, achieved the nominal type 1 error rate, as did GEE (though rates were slightly elevated even with as many as 50 clusters). Applying the methods to a real-world CRT using the count outcome utilization of healthcare, we found a large impact of the model specification on inference: the standard GLMM yielded highly significant interaction by sex (P=0.01), whereas the interaction was non-statistically significant under the flexible GLMM and GEE (P=0.64 and 0.93, respectively). We recommend that HTE analyses using GLMM account for within-subgroup correlation to avoid anti-conservative inference."}, "https://arxiv.org/abs/2407.06497": {"title": "Bayesian design for mathematical models of fruit growth based on misspecified prior information", "link": "https://arxiv.org/abs/2407.06497", "description": "arXiv:2407.06497v1 Announce Type: new \nAbstract: Bayesian design can be used for efficient data collection over time when the process can be described by the solution to an ordinary differential equation (ODE). Typically, Bayesian designs in such settings are obtained by maximising the expected value of a utility function that is derived from the joint probability distribution of the parameters and the response, given prior information about an appropriate ODE. However, in practice, appropriately defining such information \\textit{a priori} can be difficult due to incomplete knowledge about the mechanisms that govern how the process evolves over time. In this paper, we propose a method for finding Bayesian designs based on a flexible class of ODEs. Specifically, we consider the inclusion of spline terms into ODEs to provide flexibility in modelling how the process changes over time. We then propose to leverage this flexibility to form designs that are efficient even when the prior information is misspecified. Our approach is motivated by a sampling problem in agriculture where the goal is to provide a better understanding of fruit growth where prior information is based on studies conducted overseas, and therefore is potentially misspecified."}, "https://arxiv.org/abs/2407.06522": {"title": "Independent Approximates provide a maximum likelihood estimate for heavy-tailed distributions", "link": "https://arxiv.org/abs/2407.06522", "description": "arXiv:2407.06522v1 Announce Type: new \nAbstract: Heavy-tailed distributions are infamously difficult to estimate because their moments tend to infinity as the shape of the tail decay increases. Nevertheless, this study shows the utilization of a modified group of moments for estimating a heavy-tailed distribution. These modified moments are determined from powers of the original distribution. The nth-power distribution is guaranteed to have finite moments up to n-1. Samples from the nth-power distribution are drawn from n-tuple Independent Approximates, which are the set of independent samples grouped into n-tuples and sub-selected to be approximately equal to each other. We show that Independent Approximates are a maximum likelihood estimator for the generalized Pareto and the Student's t distributions, which are members of the family of coupled exponential distributions. We use the first (original), second, and third power distributions to estimate their zeroth (geometric mean), first, and second power-moments respectively. In turn, these power-moments are used to estimate the scale and shape of the distributions. A least absolute deviation criteria is used to select the optimal set of Independent Approximates. Estimates using higher powers and moments are possible though the number of n-tuples that are approximately equal may be limited."}, "https://arxiv.org/abs/2407.06722": {"title": "Femicide Laws, Unilateral Divorce, and Abortion Decriminalization Fail to Stop Women's Killings in Mexico", "link": "https://arxiv.org/abs/2407.06722", "description": "arXiv:2407.06722v1 Announce Type: new \nAbstract: This paper evaluates the effectiveness of femicide laws in combating gender-based killings of women, a major cause of premature female mortality globally. Focusing on Mexico, a pioneer in adopting such legislation, the paper leverages variations in the enactment of femicide laws and associated prison sentences across states. Using the difference-in-difference estimator, the analysis reveals that these laws have not significantly affected the incidence of femicides, homicides of women, or reports of women who have disappeared. These findings remain robust even when accounting for differences in prison sentencing, whether states also implemented unilateral divorce laws, or decriminalized abortion alongside femicide legislation. The results suggest that legislative measures are insufficient to address violence against women in settings where impunity prevails."}, "https://arxiv.org/abs/2407.06733": {"title": "Causes and Electoral Consequences of Political Assassinations: The Role of Organized Crime in Mexico", "link": "https://arxiv.org/abs/2407.06733", "description": "arXiv:2407.06733v1 Announce Type: new \nAbstract: Mexico has experienced a notable surge in assassinations of political candidates and mayors. This article argues that these killings are largely driven by organized crime, aiming to influence candidate selection, control local governments for rent-seeking, and retaliate against government crackdowns. Using a new dataset of political assassinations in Mexico from 2000 to 2021 and instrumental variables, we address endogeneity concerns in the location and timing of government crackdowns. Our instruments include historical Chinese immigration patterns linked to opium cultivation in Mexico, local corn prices, and U.S. illicit drug prices. The findings reveal that candidates in municipalities near oil pipelines face an increased risk of assassination due to drug trafficking organizations expanding into oil theft, particularly during elections and fuel price hikes. Government arrests or killings of organized crime members trigger retaliatory violence, further endangering incumbent mayors. This political violence has a negligible impact on voter turnout, as it targets politicians rather than voters. However, voter turnout increases in areas where authorities disrupt drug smuggling, raising the chances of the local party being re-elected. These results offer new insights into how criminal groups attempt to capture local governments and the implications for democracy under criminal governance."}, "https://arxiv.org/abs/2407.06835": {"title": "A flexible model for Record Linkage", "link": "https://arxiv.org/abs/2407.06835", "description": "arXiv:2407.06835v1 Announce Type: new \nAbstract: Combining data from various sources empowers researchers to explore innovative questions, for example those raised by conducting healthcare monitoring studies. However, the lack of a unique identifier often poses challenges. Record linkage procedures determine whether pairs of observations collected on different occasions belong to the same individual using partially identifying variables (e.g. birth year, postal code). Existing methodologies typically involve a compromise between computational efficiency and accuracy. Traditional approaches simplify this task by condensing information, yet they neglect dependencies among linkage decisions and disregard the one-to-one relationship required to establish coherent links. Modern approaches offer a comprehensive representation of the data generation process, at the expense of computational overhead and reduced flexibility. We propose a flexible method, that adapts to varying data complexities, addressing registration errors and accommodating changes of the identifying information over time. Our approach balances accuracy and scalability, estimating the linkage using a Stochastic Expectation Maximisation algorithm on a latent variable model. We illustrate the ability of our methodology to connect observations using large real data applications and demonstrate the robustness of our model to the linking variables quality in a simulation study. The proposed algorithm FlexRL is implemented and available in an open source R package."}, "https://arxiv.org/abs/2407.06867": {"title": "Distributionally robust risk evaluation with an isotonic constraint", "link": "https://arxiv.org/abs/2407.06867", "description": "arXiv:2407.06867v1 Announce Type: new \nAbstract: Statistical learning under distribution shift is challenging when neither prior knowledge nor fully accessible data from the target distribution is available. Distributionally robust learning (DRL) aims to control the worst-case statistical performance within an uncertainty set of candidate distributions, but how to properly specify the set remains challenging. To enable distributional robustness without being overly conservative, in this paper, we propose a shape-constrained approach to DRL, which incorporates prior information about the way in which the unknown target distribution differs from its estimate. More specifically, we assume the unknown density ratio between the target distribution and its estimate is isotonic with respect to some partial order. At the population level, we provide a solution to the shape-constrained optimization problem that does not involve the isotonic constraint. At the sample level, we provide consistency results for an empirical estimator of the target in a range of different settings. Empirical studies on both synthetic and real data examples demonstrate the improved accuracy of the proposed shape-constrained approach."}, "https://arxiv.org/abs/2407.06883": {"title": "Dealing with idiosyncratic cross-correlation when constructing confidence regions for PC factors", "link": "https://arxiv.org/abs/2407.06883", "description": "arXiv:2407.06883v1 Announce Type: new \nAbstract: In this paper, we propose a computationally simple estimator of the asymptotic covariance matrix of the Principal Components (PC) factors valid in the presence of cross-correlated idiosyncratic components. The proposed estimator of the asymptotic Mean Square Error (MSE) of PC factors is based on adaptive thresholding the sample covariances of the id iosyncratic residuals with the threshold based on their individual variances. We compare the nite sample performance of condence regions for the PC factors obtained using the proposed asymptotic MSE with those of available extant asymptotic and bootstrap regions and show that the former beats all alternative procedures for a wide variety of idiosyncratic cross-correlation structures."}, "https://arxiv.org/abs/2407.06892": {"title": "When Knockoffs fail: diagnosing and fixing non-exchangeability of Knockoffs", "link": "https://arxiv.org/abs/2407.06892", "description": "arXiv:2407.06892v1 Announce Type: new \nAbstract: Knockoffs are a popular statistical framework that addresses the challenging problem of conditional variable selection in high-dimensional settings with statistical control. Such statistical control is essential for the reliability of inference. However, knockoff guarantees rely on an exchangeability assumption that is difficult to test in practice, and there is little discussion in the literature on how to deal with unfulfilled hypotheses. This assumption is related to the ability to generate data similar to the observed data. To maintain reliable inference, we introduce a diagnostic tool based on Classifier Two-Sample Tests. Using simulations and real data, we show that violations of this assumption occur in common settings for classical Knockoffs generators, especially when the data have a strong dependence structure. We show that the diagnostic tool correctly detects such behavior. To fix knockoff generation, we propose a nonparametric, computationally-efficient alternative knockoff construction, which is based on constructing a predictor of each variable based on all others. We show that this approach achieves asymptotic exchangeability with the original variables under standard assumptions on the predictive model. We show empirically that the proposed approach restores error control on simulated data."}, "https://arxiv.org/abs/2407.06970": {"title": "Effect estimation in the presence of a misclassified binary mediator", "link": "https://arxiv.org/abs/2407.06970", "description": "arXiv:2407.06970v1 Announce Type: new \nAbstract: Mediation analyses allow researchers to quantify the effect of an exposure variable on an outcome variable through a mediator variable. If a binary mediator variable is misclassified, the resulting analysis can be severely biased. Misclassification is especially difficult to deal with when it is differential and when there are no gold standard labels available. Previous work has addressed this problem using a sensitivity analysis framework or by assuming that misclassification rates are known. We leverage a variable related to the misclassification mechanism to recover unbiased parameter estimates without using gold standard labels. The proposed methods require the reasonable assumption that the sum of the sensitivity and specificity is greater than 1. Three correction methods are presented: (1) an ordinary least squares correction for Normal outcome models, (2) a multi-step predictive value weighting method, and (3) a seamless expectation-maximization algorithm. We apply our misclassification correction strategies to investigate the mediating role of gestational hypertension on the association between maternal age and pre-term birth."}, "https://arxiv.org/abs/2407.07067": {"title": "Aggregate Bayesian Causal Forests: The ABCs of Flexible Causal Inference for Hierarchically Structured Data", "link": "https://arxiv.org/abs/2407.07067", "description": "arXiv:2407.07067v1 Announce Type: new \nAbstract: This paper introduces aggregate Bayesian Causal Forests (aBCF), a new Bayesian model for causal inference using aggregated data. Aggregated data are common in policy evaluations where we observe individuals such as students, but participation in an intervention is determined at a higher level of aggregation, such as schools implementing a curriculum. Interventions often have millions of individuals but far fewer higher-level units, making aggregation computationally attractive. To analyze aggregated data, a model must account for heteroskedasticity and intraclass correlation (ICC). Like Bayesian Causal Forests (BCF), aBCF estimates heterogeneous treatment effects with minimal parametric assumptions, but accounts for these aggregated data features, improving estimation of average and aggregate unit-specific effects.\n After introducing the aBCF model, we demonstrate via simulation that aBCF improves performance for aggregated data over BCF. We anchor our simulation on an evaluation of a large-scale Medicare primary care model. We demonstrate that aBCF produces treatment effect estimates with a lower root mean squared error and narrower uncertainty intervals while achieving the same level of coverage. We show that aBCF is not sensitive to the prior distribution used and that estimation improvements relative to BCF decline as the ICC approaches one. Code is available at https://github.com/mathematica-mpr/bcf-1."}, "https://arxiv.org/abs/2407.06390": {"title": "JANET: Joint Adaptive predictioN-region Estimation for Time-series", "link": "https://arxiv.org/abs/2407.06390", "description": "arXiv:2407.06390v1 Announce Type: cross \nAbstract: Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (Joint Adaptive predictioN-region Estimation for Time-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlled K-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET's superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data."}, "https://arxiv.org/abs/2407.06533": {"title": "LETS-C: Leveraging Language Embedding for Time Series Classification", "link": "https://arxiv.org/abs/2407.06533", "description": "arXiv:2407.06533v1 Announce Type: cross \nAbstract: Recent advancements in language modeling have shown promising results when applied to time series data. In particular, fine-tuning pre-trained large language models (LLMs) for time series classification tasks has achieved state-of-the-art (SOTA) performance on standard benchmarks. However, these LLM-based models have a significant drawback due to the large model size, with the number of trainable parameters in the millions. In this paper, we propose an alternative approach to leveraging the success of language modeling in the time series domain. Instead of fine-tuning LLMs, we utilize a language embedding model to embed time series and then pair the embeddings with a simple classification head composed of convolutional neural networks (CNN) and multilayer perceptron (MLP). We conducted extensive experiments on well-established time series classification benchmark datasets. We demonstrated LETS-C not only outperforms the current SOTA in classification accuracy but also offers a lightweight solution, using only 14.5% of the trainable parameters on average compared to the SOTA model. Our findings suggest that leveraging language encoders to embed time series data, combined with a simple yet effective classification head, offers a promising direction for achieving high-performance time series classification while maintaining a lightweight model architecture."}, "https://arxiv.org/abs/2407.06875": {"title": "Extending the blended generalized extreme value distribution", "link": "https://arxiv.org/abs/2407.06875", "description": "arXiv:2407.06875v1 Announce Type: cross \nAbstract: The generalized extreme value (GEV) distribution is commonly employed to help estimate the likelihood of extreme events in many geophysical and other application areas. The recently proposed blended generalized extreme value (bGEV) distribution modifies the GEV with positive shape parameter to avoid a hard lower bound that complicates fitting and inference. Here, the bGEV is extended to the GEV with negative shape parameter, avoiding a hard upper bound that is unrealistic in many applications. This extended bGEV is shown to improve on the GEV for forecasting future heat extremes based on past data. Software implementing this bGEV and applying it to the example temperature data is provided."}, "https://arxiv.org/abs/2407.07018": {"title": "End-To-End Causal Effect Estimation from Unstructured Natural Language Data", "link": "https://arxiv.org/abs/2407.07018", "description": "arXiv:2407.07018v1 Announce Type: cross \nAbstract: Knowing the effect of an intervention is critical for human decision-making, but current approaches for causal effect estimation rely on manual data collection and structuring, regardless of the causal assumptions. This increases both the cost and time-to-completion for studies. We show how large, diverse observational text data can be mined with large language models (LLMs) to produce inexpensive causal effect estimates under appropriate causal assumptions. We introduce NATURAL, a novel family of causal effect estimators built with LLMs that operate over datasets of unstructured text. Our estimators use LLM conditional distributions (over variables of interest, given the text data) to assist in the computation of classical estimators of causal effect. We overcome a number of technical challenges to realize this idea, such as automating data curation and using LLMs to impute missing information. We prepare six (two synthetic and four real) observational datasets, paired with corresponding ground truth in the form of randomized trials, which we used to systematically evaluate each step of our pipeline. NATURAL estimators demonstrate remarkable performance, yielding causal effect estimates that fall within 3 percentage points of their ground truth counterparts, including on real-world Phase 3/4 clinical trials. Our results suggest that unstructured text data is a rich source of causal effect information, and NATURAL is a first step towards an automated pipeline to tap this resource."}, "https://arxiv.org/abs/2407.07072": {"title": "Assumption Smuggling in Intermediate Outcome Tests of Causal Mechanisms", "link": "https://arxiv.org/abs/2407.07072", "description": "arXiv:2407.07072v1 Announce Type: cross \nAbstract: Political scientists are increasingly attuned to the promises and pitfalls of establishing causal effects. But the vital question for many is not if a causal effect exists but why and how it exists. Even so, many researchers avoid causal mediation analyses due to the assumptions required, instead opting to explore causal mechanisms through what we call intermediate outcome tests. These tests use the same research design used to estimate the effect of treatment on the outcome to estimate the effect of the treatment on one or more mediators, with authors often concluding that evidence of the latter is evidence of a causal mechanism. We show in this paper that, without further assumptions, this can neither establish nor rule out the existence of a causal mechanism. Instead, such conclusions about the indirect effect of treatment rely on implicit and usually very strong assumptions that are often unmet. Thus, such causal mechanism tests, though very common in political science, should not be viewed as a free lunch but rather should be used judiciously, and researchers should explicitly state and defend the requisite assumptions."}, "https://arxiv.org/abs/2107.07942": {"title": "Flexible Covariate Adjustments in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2107.07942", "description": "arXiv:2107.07942v3 Announce Type: replace \nAbstract: Empirical regression discontinuity (RD) studies often use covariates to increase the precision of their estimates. In this paper, we propose a novel class of estimators that use such covariate information more efficiently than existing methods and can accommodate many covariates. It involves running a standard RD analysis in which a function of the covariates has been subtracted from the original outcome variable. We characterize the function that leads to the estimator with the smallest asymptotic variance, and consider feasible versions of such estimators in which this function is estimated, for example, through modern machine learning techniques."}, "https://arxiv.org/abs/2112.13651": {"title": "Factor modelling for high-dimensional functional time series", "link": "https://arxiv.org/abs/2112.13651", "description": "arXiv:2112.13651v3 Announce Type: replace \nAbstract: Many economic and scientific problems involve the analysis of high-dimensional functional time series, where the number of functional variables $p$ diverges as the number of serially dependent observations $n$ increases. In this paper, we present a novel functional factor model for high-dimensional functional time series that maintains and makes use of the functional and dynamic structure to achieve great dimension reduction and find the latent factor structure. To estimate the number of functional factors and the factor loadings, we propose a fully functional estimation procedure based on an eigenanalysis for a nonnegative definite and symmetric matrix. Our proposal involves a weight matrix to improve the estimation efficiency and tackle the issue of heterogeneity, the rationale of which is illustrated by formulating the estimation from a novel regression perspective. Asymptotic properties of the proposed method are studied when $p$ diverges at some polynomial rate as $n$ increases. To provide a parsimonious model and enhance interpretability for near-zero factor loadings, we impose sparsity assumptions on the factor loading space and then develop a regularized estimation procedure with theoretical guarantees when $p$ grows exponentially fast relative to $n.$ Finally, we demonstrate that our proposed estimators significantly outperform the competing methods through both simulations and applications to a U.K. temperature data set and a Japanese mortality data set."}, "https://arxiv.org/abs/2303.11721": {"title": "Using Forests in Multivariate Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2303.11721", "description": "arXiv:2303.11721v2 Announce Type: replace \nAbstract: We discuss estimation and inference of conditional treatment effects in regression discontinuity designs with multiple scores. Aside from the commonly used local linear regression approach and a minimax-optimal estimator recently proposed by Imbens and Wager (2019), we consider two estimators based on random forests -- honest regression forests and local linear forests -- whose construction resembles that of standard local regressions, with theoretical validity following from results in Wager and Athey (2018) and Friedberg et al. (2020). We design a systematic Monte Carlo study with data generating processes built both from functional forms that we specify and from Wasserstein Generative Adversarial Networks that can closely mimic the observed data. We find that no single estimator dominates across all simulations: (i) local linear regressions perform well in univariate settings, but can undercover when multivariate scores are transformed into a univariate score -- which is commonly done in practice -- possibly due to the \"zero-density\" issue of the collapsed univariate score at the transformed cutoff; (ii) good performance of the minimax-optimal estimator depends on accurate estimation of a nuisance parameter and its current implementation only accepts up to two scores; (iii) forest-based estimators are not designed for estimation at boundary points and can suffer from bias in finite sample, but their flexibility in modeling multivariate scores opens the door to a wide range of empirical applications in multivariate regression discontinuity designs."}, "https://arxiv.org/abs/2307.05818": {"title": "What Does it Take to Control Global Temperatures? A toolbox for testing and estimating the impact of economic policies on climate", "link": "https://arxiv.org/abs/2307.05818", "description": "arXiv:2307.05818v2 Announce Type: replace \nAbstract: This paper tests the feasibility and estimates the cost of climate control through economic policies. It provides a toolbox for a statistical historical assessment of a Stochastic Integrated Model of Climate and the Economy, and its use in (possibly counterfactual) policy analysis. Recognizing that stabilization requires supressing a trend, we use an integrated-cointegrated Vector Autoregressive Model estimated using a newly compiled dataset ranging between years A.D. 1000-2008, extending previous results on Control Theory in nonstationary systems. We test statistically whether, and quantify to what extent, carbon abatement policies can effectively stabilize or reduce global temperatures. Our formal test of policy feasibility shows that carbon abatement can have a significant long run impact and policies can render temperatures stationary around a chosen long run mean. In a counterfactual empirical illustration of the possibilities of our modeling strategy, we study a retrospective policy aiming to keep global temperatures close to their 1900 historical level. Achieving this via carbon abatement may cost about 75% of the observed 2008 level of world GDP, a cost equivalent to reverting to levels of output historically observed in the mid 1960s. By contrast, investment in carbon neutral technology could achieve the policy objective and be self-sustainable as long as it costs less than 50% of 2008 global GDP and 75% of consumption."}, "https://arxiv.org/abs/2310.12285": {"title": "Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm", "link": "https://arxiv.org/abs/2310.12285", "description": "arXiv:2310.12285v2 Announce Type: replace \nAbstract: High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. To properly account for dependence between longitudinal observations, statistical methods for high-dimensional linear mixed models (LMMs) have been developed. However, few packages implementing these high-dimensional LMMs are available in the statistical software R. Additionally, some packages suffer from scalability issues. This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators of hyperparameters for increased flexibility and an Expectation-Conditional-Minimization (ECM) algorithm for computationally efficient maximum a posteriori probability (MAP) estimation of parameters. The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation. We illustrate Linear Mixed Modeling with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies evaluating fixed and random effects estimation along with computation time. A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time. Supplementary materials are available online."}, "https://arxiv.org/abs/2401.07259": {"title": "Inference for bivariate extremes via a semi-parametric angular-radial model", "link": "https://arxiv.org/abs/2401.07259", "description": "arXiv:2401.07259v2 Announce Type: replace \nAbstract: The modelling of multivariate extreme events is important in a wide variety of applications, including flood risk analysis, metocean engineering and financial modelling. A wide variety of statistical techniques have been proposed in the literature; however, many such methods are limited in the forms of dependence they can capture, or make strong parametric assumptions about data structures. In this article, we introduce a novel inference framework for multivariate extremes based on a semi-parametric angular-radial model. This model overcomes the limitations of many existing approaches and provides a unified paradigm for assessing joint tail behaviour. Alongside inferential tools, we also introduce techniques for assessing uncertainty and goodness of fit. Our proposed technique is tested on simulated data sets alongside observed metocean time series', with results indicating generally good performance."}, "https://arxiv.org/abs/2308.12227": {"title": "Semiparametric Modeling and Analysis for Longitudinal Network Data", "link": "https://arxiv.org/abs/2308.12227", "description": "arXiv:2308.12227v2 Announce Type: replace-cross \nAbstract: We introduce a semiparametric latent space model for analyzing longitudinal network data. The model consists of a static latent space component and a time-varying node-specific baseline component. We develop a semiparametric efficient score equation for the latent space parameter by adjusting for the baseline nuisance component. Estimation is accomplished through a one-step update estimator and an appropriately penalized maximum likelihood estimator. We derive oracle error bounds for the two estimators and address identifiability concerns from a quotient manifold perspective. Our approach is demonstrated using the New York Citi Bike Dataset."}, "https://arxiv.org/abs/2407.07217": {"title": "The Hidden Subsidy of the Affordable Care Act", "link": "https://arxiv.org/abs/2407.07217", "description": "arXiv:2407.07217v1 Announce Type: new \nAbstract: Under the ACA, the federal government paid a substantially larger share of medical costs of newly eligible Medicaid enrollees than previously eligible ones. States could save up to 100% of their per-enrollee costs by reclassifying original enrollees into the newly eligible group. We examine whether this fiscal incentive changed states' enrollment practices. We find that Medicaid expansion caused large declines in the number of beneficiaries enrolled in the original Medicaid population, suggesting widespread reclassifications. In 2019 alone, this phenomenon affected 4.4 million Medicaid enrollees at a federal cost of $8.3 billion. Our results imply that reclassifications inflated the federal cost of Medicaid expansion by 18.2%."}, "https://arxiv.org/abs/2407.07251": {"title": "R", "link": "https://arxiv.org/abs/2407.07251", "description": "arXiv:2407.07251v1 Announce Type: new \nAbstract: This note provides a conceptual clarification of Ronald Aylmer Fisher's (1935) pioneering exact test in the context of the Lady Testing Tea experiment. It unveils a critical implicit assumption in Fisher's calibration: the taster minimizes expected misclassification given fixed probabilistic information. Without similar assumptions or an explicit alternative hypothesis, the rationale behind Fisher's specification of the rejection region remains unclear."}, "https://arxiv.org/abs/2407.07637": {"title": "Function-valued marked spatial point processes on linear networks: application to urban cycling profiles", "link": "https://arxiv.org/abs/2407.07637", "description": "arXiv:2407.07637v1 Announce Type: new \nAbstract: In the literature on spatial point processes, there is an emerging challenge in studying marked point processes with points being labelled by functions. In this paper, we focus on point processes living on linear networks and, from distinct points of view, propose several marked summary characteristics that are of great use in studying the average association and dispersion of the function-valued marks. Through a simulation study, we evaluate the performance of our proposed marked summary characteristics, both when marks are independent and when some sort of spatial dependence is evident among them. Finally, we employ our proposed mark summary characteristics to study the spatial structure of urban cycling profiles in Vancouver, Canada."}, "https://arxiv.org/abs/2407.07647": {"title": "Two Stage Least Squares with Time-Varying Instruments: An Application to an Evaluation of Treatment Intensification for Type-2 Diabetes", "link": "https://arxiv.org/abs/2407.07647", "description": "arXiv:2407.07647v1 Announce Type: new \nAbstract: As longitudinal data becomes more available in many settings, policy makers are increasingly interested in the effect of time-varying treatments (e.g. sustained treatment strategies). In settings such as this, the preferred analysis techniques are the g-methods, however these require the untestable assumption of no unmeasured confounding. Instrumental variable analyses can minimise bias through unmeasured confounding. Of these methods, the Two Stage Least Squares technique is one of the most well used in Econometrics, but it has not been fully extended, and evaluated, in full time-varying settings. This paper proposes a robust two stage least squares method for the econometric evaluation of time-varying treatment. Using a simulation study we found that, unlike standard two stage least squares, it performs relatively well across a wide range of circumstances, including model misspecification. It compares well with recent time-varying instrument approaches via g-estimation. We illustrate the methods in an evaluation of treatment intensification for Type-2 Diabetes Mellitus, exploring the exogeneity in prescribing preferences to operationalise a time-varying instrument."}, "https://arxiv.org/abs/2407.07717": {"title": "High-dimensional Covariance Estimation by Pairwise Likelihood Truncation", "link": "https://arxiv.org/abs/2407.07717", "description": "arXiv:2407.07717v1 Announce Type: new \nAbstract: Pairwise likelihood offers a useful approximation to the full likelihood function for covariance estimation in high-dimensional context. It simplifies high-dimensional dependencies by combining marginal bivariate likelihood objects, thereby making estimation more manageable. In certain models, including the Gaussian model, both pairwise and full likelihoods are known to be maximized by the same parameter values, thus retaining optimal statistical efficiency, when the number of variables is fixed. Leveraging this insight, we introduce the estimation of sparse high-dimensional covariance matrices by maximizing a truncated version of the pairwise likelihood function, which focuses on pairwise terms corresponding to nonzero covariance elements. To achieve a meaningful truncation, we propose to minimize the discrepancy between pairwise and full likelihood scores plus an L1-penalty discouraging the inclusion of uninformative terms. Differently from other regularization approaches, our method selects whole pairwise likelihood objects rather than individual covariance parameters, thus retaining the inherent unbiasedness of the pairwise likelihood estimating equations. This selection procedure is shown to have the selection consistency property as the covariance dimension increases exponentially fast. As a result, the implied pairwise likelihood estimator is consistent and converges to the oracle maximum likelihood estimator that assumes knowledge of nonzero covariance entries."}, "https://arxiv.org/abs/2407.07809": {"title": "Direct estimation and inference of higher-level correlations from lower-level measurements with applications in gene-pathway and proteomics studies", "link": "https://arxiv.org/abs/2407.07809", "description": "arXiv:2407.07809v1 Announce Type: new \nAbstract: This paper tackles the challenge of estimating correlations between higher-level biological variables (e.g., proteins and gene pathways) when only lower-level measurements are directly observed (e.g., peptides and individual genes). Existing methods typically aggregate lower-level data into higher-level variables and then estimate correlations based on the aggregated data. However, different data aggregation methods can yield varying correlation estimates as they target different higher-level quantities. Our solution is a latent factor model that directly estimates these higher-level correlations from lower-level data without the need for data aggregation. We further introduce a shrinkage estimator to ensure the positive definiteness and improve the accuracy of the estimated correlation matrix. Furthermore, we establish the asymptotic normality of our estimator, enabling efficient computation of p-values for the identification of significant correlations. The effectiveness of our approach is demonstrated through comprehensive simulations and the analysis of proteomics and gene expression datasets. We develop the R package highcor for implementing our method."}, "https://arxiv.org/abs/2407.07134": {"title": "Calibrating satellite maps with field data for improved predictions of forest biomass", "link": "https://arxiv.org/abs/2407.07134", "description": "arXiv:2407.07134v1 Announce Type: cross \nAbstract: Spatially explicit quantification of forest biomass is important for forest-health monitoring and carbon accounting. Direct field measurements of biomass are laborious and expensive, typically limiting their spatial and temporal sampling density and therefore the precision and resolution of the resulting inference. Satellites can provide biomass predictions at a far greater density, but these predictions are often biased relative to field measurements and exhibit heterogeneous errors. We developed and implemented a coregionalization model between sparse field measurements and a predictive satellite map to deliver improved predictions of biomass density at a 1-by-1 km resolution throughout the Pacific states of California, Oregon and Washington. The model accounts for zero-inflation in the field measurements and the heterogeneous errors in the satellite predictions. A stochastic partial differential equation approach to spatial modeling is applied to handle the magnitude of the satellite data. The spatial detail rendered by the model is much finer than would be possible with the field measurements alone, and the model provides substantial noise-filtering and bias-correction to the satellite map."}, "https://arxiv.org/abs/2407.07297": {"title": "Geometric quantile-based measures of multivariate distributional characteristics", "link": "https://arxiv.org/abs/2407.07297", "description": "arXiv:2407.07297v1 Announce Type: cross \nAbstract: Several new geometric quantile-based measures for multivariate dispersion, skewness, kurtosis, and spherical asymmetry are defined. These measures differ from existing measures, which use volumes and are easy to calculate. Some theoretical justification is given, followed by experiments illustrating that they are reasonable measures of these distributional characteristics and computing confidence regions with the desired coverage."}, "https://arxiv.org/abs/2407.07338": {"title": "Towards Complete Causal Explanation with Expert Knowledge", "link": "https://arxiv.org/abs/2407.07338", "description": "arXiv:2407.07338v1 Announce Type: cross \nAbstract: We study the problem of restricting Markov equivalence classes of maximal ancestral graphs (MAGs) containing certain edge marks, which we refer to as expert knowledge. MAGs forming a Markov equivalence class can be uniquely represented by an essential ancestral graph. We seek to learn the restriction of the essential ancestral graph containing the proposed expert knowledge. Our contributions are several-fold. First, we prove certain properties for the entire Markov equivalence class including a conjecture from Ali et al. (2009). Second, we present three sound graphical orientation rules, two of which generalize previously known rules, for adding expert knowledge to an essential graph. We also show that some orientation rules of Zhang (2008) are not needed for restricting the Markov equivalence class with expert knowledge. We provide an algorithm for including this expert knowledge and show that our algorithm is complete in certain settings i.e., in these settings, the output of our algorithm is a restricted essential ancestral graph. We conjecture this algorithm is complete generally. Outside of our specified settings, we provide an algorithm for checking whether a graph is a restricted essential graph and discuss its runtime. This work can be seen as a generalization of Meek (1995)."}, "https://arxiv.org/abs/2407.07596": {"title": "Learning treatment effects while treating those in need", "link": "https://arxiv.org/abs/2407.07596", "description": "arXiv:2407.07596v1 Announce Type: cross \nAbstract: Many social programs attempt to allocate scarce resources to people with the greatest need. Indeed, public services increasingly use algorithmic risk assessments motivated by this goal. However, targeting the highest-need recipients often conflicts with attempting to evaluate the causal effect of the program as a whole, as the best evaluations would be obtained by randomizing the allocation. We propose a framework to design randomized allocation rules which optimally balance targeting high-need individuals with learning treatment effects, presenting policymakers with a Pareto frontier between the two goals. We give sample complexity guarantees for the policy learning problem and provide a computationally efficient strategy to implement it. We then apply our framework to data from human services in Allegheny County, Pennsylvania. Optimized policies can substantially mitigate the tradeoff between learning and targeting. For example, it is often possible to obtain 90% of the optimal utility in targeting high-need individuals while ensuring that the average treatment effect can be estimated with less than 2 times the samples that a randomized controlled trial would require. Mechanisms for targeting public services often focus on measuring need as accurately as possible. However, our results suggest that algorithmic systems in public services can be most impactful if they incorporate program evaluation as an explicit goal alongside targeting."}, "https://arxiv.org/abs/2301.01480": {"title": "A new over-dispersed count model", "link": "https://arxiv.org/abs/2301.01480", "description": "arXiv:2301.01480v3 Announce Type: replace \nAbstract: A new two-parameter discrete distribution, namely the PoiG distribution is derived by the convolution of a Poisson variate and an independently distributed geometric random variable. This distribution generalizes both the Poisson and geometric distributions and can be used for modelling over-dispersed as well as equi-dispersed count data. A number of important statistical properties of the proposed count model, such as the probability generating function, the moment generating function, the moments, the survival function and the hazard rate function. Monotonic properties are studied, such as the log concavity and the stochastic ordering are also investigated in detail. Method of moment and the maximum likelihood estimators of the parameters of the proposed model are presented. It is envisaged that the proposed distribution may prove to be useful for the practitioners for modelling over-dispersed count data compared to its closest competitors."}, "https://arxiv.org/abs/2302.10367": {"title": "JOINTVIP: Prioritizing variables in observational study design with joint variable importance plot in R", "link": "https://arxiv.org/abs/2302.10367", "description": "arXiv:2302.10367v4 Announce Type: replace \nAbstract: Credible causal effect estimation requires treated subjects and controls to be otherwise similar. In observational settings, such as analysis of electronic health records, this is not guaranteed. Investigators must balance background variables so they are similar in treated and control groups. Common approaches include matching (grouping individuals into small homogeneous sets) or weighting (upweighting or downweighting individuals) to create similar profiles. However, creating identical distributions may be impossible if many variables are measured, and not all variables are of equal importance to the outcome. The joint variable importance plot (jointVIP) package to guides decisions about which variables to prioritize for adjustment by quantifying and visualizing each variable's relationship to both treatment and outcome."}, "https://arxiv.org/abs/2307.13364": {"title": "Testing for sparse idiosyncratic components in factor-augmented regression models", "link": "https://arxiv.org/abs/2307.13364", "description": "arXiv:2307.13364v4 Announce Type: replace \nAbstract: We propose a novel bootstrap test of a dense model, namely factor regression, against a sparse plus dense alternative augmenting model with sparse idiosyncratic components. The asymptotic properties of the test are established under time series dependence and polynomial tails. We outline a data-driven rule to select the tuning parameter and prove its theoretical validity. In simulation experiments, our procedure exhibits high power against sparse alternatives and low power against dense deviations from the null. Moreover, we apply our test to various datasets in macroeconomics and finance and often reject the null. This suggests the presence of sparsity -- on top of a dense model -- in commonly studied economic applications. The R package FAS implements our approach."}, "https://arxiv.org/abs/2401.02917": {"title": "Bayesian changepoint detection via logistic regression and the topological analysis of image series", "link": "https://arxiv.org/abs/2401.02917", "description": "arXiv:2401.02917v2 Announce Type: replace \nAbstract: We present a Bayesian method for multivariate changepoint detection that allows for simultaneous inference on the location of a changepoint and the coefficients of a logistic regression model for distinguishing pre-changepoint data from post-changepoint data. In contrast to many methods for multivariate changepoint detection, the proposed method is applicable to data of mixed type and avoids strict assumptions regarding the distribution of the data and the nature of the change. The regression coefficients provide an interpretable description of a potentially complex change. For posterior inference, the model admits a simple Gibbs sampling algorithm based on P\\'olya-gamma data augmentation. We establish conditions under which the proposed method is guaranteed to recover the true underlying changepoint. As a testing ground for our method, we consider the problem of detecting topological changes in time series of images. We demonstrate that our proposed method $\\mathtt{bclr}$, combined with a topological feature embedding, performs well on both simulated and real image data. The method also successfully recovers the location and nature of changes in more traditional changepoint tasks."}, "https://arxiv.org/abs/2407.07933": {"title": "Identification and Estimation of the Bi-Directional MR with Some Invalid Instruments", "link": "https://arxiv.org/abs/2407.07933", "description": "arXiv:2407.07933v1 Announce Type: new \nAbstract: We consider the challenging problem of estimating causal effects from purely observational data in the bi-directional Mendelian randomization (MR), where some invalid instruments, as well as unmeasured confounding, usually exist. To address this problem, most existing methods attempt to find proper valid instrumental variables (IVs) for the target causal effect by expert knowledge or by assuming that the causal model is a one-directional MR model. As such, in this paper, we first theoretically investigate the identification of the bi-directional MR from observational data. In particular, we provide necessary and sufficient conditions under which valid IV sets are correctly identified such that the bi-directional MR model is identifiable, including the causal directions of a pair of phenotypes (i.e., the treatment and outcome). Moreover, based on the identification theory, we develop a cluster fusion-like method to discover valid IV sets and estimate the causal effects of interest. We theoretically demonstrate the correctness of the proposed algorithm. Experimental results show the effectiveness of our method for estimating causal effects in bi-directional MR."}, "https://arxiv.org/abs/2407.07934": {"title": "Identifying macro conditional independencies and macro total effects in summary causal graphs with latent confounding", "link": "https://arxiv.org/abs/2407.07934", "description": "arXiv:2407.07934v1 Announce Type: new \nAbstract: Understanding causal relationships in dynamic systems is essential for numerous scientific fields, including epidemiology, economics, and biology. While causal inference methods have been extensively studied, they often rely on fully specified causal graphs, which may not always be available or practical in complex dynamic systems. Partially specified causal graphs, such as summary causal graphs (SCGs), provide a simplified representation of causal relationships, omitting temporal information and focusing on high-level causal structures. This simplification introduces new challenges concerning the types of queries of interest: macro queries, which involve relationships between clusters represented as vertices in the graph, and micro queries, which pertain to relationships between variables that are not directly visible through the vertices of the graph. In this paper, we first clearly distinguish between macro conditional independencies and micro conditional independencies and between macro total effects and micro total effects. Then, we demonstrate the soundness and completeness of the d-separation to identify macro conditional independencies in SCGs. Furthermore, we establish that the do-calculus is sound and complete for identifying macro total effects in SCGs. Conversely, we also show through various examples that these results do not hold when considering micro conditional independencies and micro total effects."}, "https://arxiv.org/abs/2407.07973": {"title": "Reduced-Rank Matrix Autoregressive Models: A Medium $N$ Approach", "link": "https://arxiv.org/abs/2407.07973", "description": "arXiv:2407.07973v1 Announce Type: new \nAbstract: Reduced-rank regressions are powerful tools used to identify co-movements within economic time series. However, this task becomes challenging when we observe matrix-valued time series, where each dimension may have a different co-movement structure. We propose reduced-rank regressions with a tensor structure for the coefficient matrix to provide new insights into co-movements within and between the dimensions of matrix-valued time series. Moreover, we relate the co-movement structures to two commonly used reduced-rank models, namely the serial correlation common feature and the index model. Two empirical applications involving U.S.\\ states and economic indicators for the Eurozone and North American countries illustrate how our new tools identify co-movements."}, "https://arxiv.org/abs/2407.07988": {"title": "Production function estimation using subjective expectations data", "link": "https://arxiv.org/abs/2407.07988", "description": "arXiv:2407.07988v1 Announce Type: new \nAbstract: Standard methods for estimating production functions in the Olley and Pakes (1996) tradition require assumptions on input choices. We introduce a new method that exploits (increasingly available) data on a firm's expectations of its future output and inputs that allows us to obtain consistent production function parameter estimates while relaxing these input demand assumptions. In contrast to dynamic panel methods, our proposed estimator can be implemented on very short panels (including a single cross-section), and Monte Carlo simulations show it outperforms alternative estimators when firms' material input choices are subject to optimization error. Implementing a range of production function estimators on UK data, we find our proposed estimator yields results that are either similar to or more credible than commonly-used alternatives. These differences are larger in industries where material inputs appear harder to optimize. We show that TFP implied by our proposed estimator is more strongly associated with future jobs growth than existing methods, suggesting that failing to adequately account for input endogeneity may underestimate the degree of dynamic reallocation in the economy."}, "https://arxiv.org/abs/2407.08140": {"title": "Variational Bayes for Mixture of Gaussian Structural Equation Models", "link": "https://arxiv.org/abs/2407.08140", "description": "arXiv:2407.08140v1 Announce Type: new \nAbstract: Structural equation models (SEMs) are commonly used to study the structural relationship between observed variables and latent constructs. Recently, Bayesian fitting procedures for SEMs have received more attention thanks to their potential to facilitate the adoption of more flexible model structures, and variational approximations have been shown to provide fast and accurate inference for Bayesian analysis of SEMs. However, the application of variational approximations is currently limited to very simple, elemental SEMs. We develop mean-field variational Bayes algorithms for two SEM formulations for data that present non-Gaussian features such as skewness and multimodality. The proposed models exploit the use of mixtures of Gaussians, include covariates for the analysis of latent traits and consider missing data. We also examine two variational information criteria for model selection that are straightforward to compute in our variational inference framework. The performance of the MFVB algorithms and information criteria is investigated in a simulated data study and a real data application."}, "https://arxiv.org/abs/2407.08228": {"title": "Wasserstein $k$-Centres Clustering for Distributional Data", "link": "https://arxiv.org/abs/2407.08228", "description": "arXiv:2407.08228v1 Announce Type: new \nAbstract: We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes the mode of variation of data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods for functional data. In this study, we propose a novel clustering method for distributional data on the real line, which takes account of difference in both the mean and mode of variation structures of clusters, in the spirit of the $k$-centres clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define a geodesic mode of variation of distributional data using geodesic principal component analysis. Then, we utilize the geodesic mode of each cluster to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve cluster quality compared to conventional clustering algorithms."}, "https://arxiv.org/abs/2407.08278": {"title": "Structuring, Sequencing, Staging, Selecting: the 4S method for the longitudinal analysis of multidimensional measurement scales in chronic diseases", "link": "https://arxiv.org/abs/2407.08278", "description": "arXiv:2407.08278v1 Announce Type: new \nAbstract: In clinical studies, measurement scales are often collected to report disease-related manifestations from clinician or patient perspectives. Their analysis can help identify relevant manifestations throughout the disease course, enhancing knowledge of disease progression and guiding clinicians in providing appropriate support. However, the analysis of measurement scales in health studies is not straightforward as made of repeated, ordinal, and potentially multidimensional item data. Their sum-score summaries may considerably reduce information and impend interpretation, their change over time occurs along clinical progression, and as many other longitudinal processes, their observation may be truncated by events. This work establishes a comprehensive strategy in four consecutive steps to leverage repeated data from multidimensional measurement scales. The 4S method successively (1) identifies the scale structure into subdimensions satisfying three calibration assumptions (unidimensionality, conditional independence, increasing monotonicity), (2) describes each subdimension progression using a joint latent process model which includes a continuous-time item response theory model for the longitudinal subpart, (3) aligns each subdimension's progression with disease stages through a projection approach, and (4) identifies the most informative items across disease stages using the Fisher's information. The method is comprehensively illustrated in multiple system atrophy (MSA), an alpha-synucleinopathy, with the analysis of daily activity and motor impairments over disease progression. The 4S method provides an effective and complete analytical strategy for any measurement scale repeatedly collected in health studies."}, "https://arxiv.org/abs/2407.08317": {"title": "Inference procedures in sequential trial emulation with survival outcomes: comparing confidence intervals based on the sandwich variance estimator, bootstrap and jackknife", "link": "https://arxiv.org/abs/2407.08317", "description": "arXiv:2407.08317v1 Announce Type: new \nAbstract: Sequential trial emulation (STE) is an approach to estimating causal treatment effects by emulating a sequence of target trials from observational data. In STE, inverse probability weighting is commonly utilised to address time-varying confounding and/or dependent censoring. Then structural models for potential outcomes are applied to the weighted data to estimate treatment effects. For inference, the simple sandwich variance estimator is popular but conservative, while nonparametric bootstrap is computationally expensive, and a more efficient alternative, linearised estimating function (LEF) bootstrap, has not been adapted to STE. We evaluated the performance of various methods for constructing confidence intervals (CIs) of marginal risk differences in STE with survival outcomes by comparing the coverage of CIs based on nonparametric/LEF bootstrap, jackknife, and the sandwich variance estimator through simulations. LEF bootstrap CIs demonstrated the best coverage with small/moderate sample sizes, low event rates and low treatment prevalence, which were the motivating scenarios for STE. They were less affected by treatment group imbalance and faster to compute than nonparametric bootstrap CIs. With large sample sizes and medium/high event rates, the sandwich-variance-estimator-based CIs had the best coverage and were the fastest to compute. These findings offer guidance in constructing CIs in causal survival analysis using STE."}, "https://arxiv.org/abs/2407.08382": {"title": "Adjusting for Participation Bias in Case-Control Genetic Association Studies for Rare Diseases", "link": "https://arxiv.org/abs/2407.08382", "description": "arXiv:2407.08382v1 Announce Type: new \nAbstract: Collection of genotype data in case-control genetic association studies may often be incomplete for reasons related to genes themselves. This non-ignorable missingness structure, if not appropriately accounted for, can result in participation bias in association analyses. To deal with this issue, Chen et al. (2016) proposed to collect additional genetic information from family members of individuals whose genotype data were not available, and developed a maximum likelihood method for bias correction. In this study, we develop an estimating equation approach to analyzing data collected from this design that allows adjustment of covariates. It jointly estimates odds ratio parameters for genetic association and missingness, where a logistic regression model is used to relate missingness to genotype and other covariates. Our method allows correlation between genotype and covariates while using genetic information from family members to provide information on the missing genotype data. In the estimating equation for genetic association parameters, we weight the contribution of each genotyped subject to the empirical likelihood score function by the inverse probability that the genotype data are available. We evaluate large and finite sample performance of our method via simulation studies and apply it to a family-based case-control study of breast cancer."}, "https://arxiv.org/abs/2407.08510": {"title": "Comparative analysis of Mixed-Data Sampling (MIDAS) model compared to Lag-Llama model for inflation nowcasting", "link": "https://arxiv.org/abs/2407.08510", "description": "arXiv:2407.08510v1 Announce Type: new \nAbstract: Inflation is one of the most important economic indicators closely watched by both public institutions and private agents. This study compares the performance of a traditional econometric model, Mixed Data Sampling regression, with one of the newest developments from the field of Artificial Intelligence, a foundational time series forecasting model based on a Long short-term memory neural network called Lag-Llama, in their ability to nowcast the Harmonized Index of Consumer Prices in the Euro area. Two models were compared and assessed whether the Lag-Llama can outperform the MIDAS regression, ensuring that the MIDAS regression is evaluated under the best-case scenario using a dataset spanning from 2010 to 2022. The following metrics were used to evaluate the models: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), correlation with the target, R-squared and adjusted R-squared. The results show better performance of the pre-trained Lag-Llama across all metrics."}, "https://arxiv.org/abs/2407.08565": {"title": "Information matrix test for normality of innovations in stationary time series models", "link": "https://arxiv.org/abs/2407.08565", "description": "arXiv:2407.08565v1 Announce Type: new \nAbstract: This study focuses on the problem of testing for normality of innovations in stationary time series models.To achieve this, we introduce an information matrix (IM) based test. While the IM test was originally developed to test for model misspecification, our study addresses that the test can also be used to test for the normality of innovations in various time series models. We provide sufficient conditions under which the limiting null distribution of the test statistics exists. As applications, a first-order threshold moving average model, GARCH model and double autoregressive model are considered. We conduct simulations to evaluate the performance of the proposed test and compare with other tests, and provide a real data analysis."}, "https://arxiv.org/abs/2407.08599": {"title": "Goodness of fit of relational event models", "link": "https://arxiv.org/abs/2407.08599", "description": "arXiv:2407.08599v1 Announce Type: new \nAbstract: A type of dynamic network involves temporally ordered interactions between actors, where past network configurations may influence future ones. The relational event model can be used to identify the underlying dynamics that drive interactions among system components. Despite the rapid development of this model over the past 15 years, an ongoing area of research revolves around evaluating the goodness of fit of this model, especially when it incorporates time-varying and random effects. Current methodologies often rely on comparing observed and simulated events using specific statistics, but this can be computationally intensive, and requires various assumptions.\n We propose an additive mixed-effect relational event model estimated via case-control sampling, and introduce a versatile framework for testing the goodness of fit of such models using weighted martingale residuals. Our focus is on a Kolmogorov-Smirnov type test designed to assess if covariates are accurately modeled. Our approach can be easily extended to evaluate whether other features of network dynamics have been appropriately incorporated into the model. We assess the goodness of fit of various relational event models using synthetic data to evaluate the test's power and coverage. Furthermore, we apply the method to a social study involving 57,791 emails sent by 159 employees of a Polish manufacturing company in 2010.\n The method is implemented in the R package mgcv."}, "https://arxiv.org/abs/2407.08602": {"title": "An Introduction to Causal Discovery", "link": "https://arxiv.org/abs/2407.08602", "description": "arXiv:2407.08602v1 Announce Type: new \nAbstract: In social sciences and economics, causal inference traditionally focuses on assessing the impact of predefined treatments (or interventions) on predefined outcomes, such as the effect of education programs on earnings. Causal discovery, in contrast, aims to uncover causal relationships among multiple variables in a data-driven manner, by investigating statistical associations rather than relying on predefined causal structures. This approach, more common in computer science, seeks to understand causality in an entire system of variables, which can be visualized by causal graphs. This survey provides an introduction to key concepts, algorithms, and applications of causal discovery from the perspectives of economics and social sciences. It covers fundamental concepts like d-separation, causal faithfulness, and Markov equivalence, sketches various algorithms for causal discovery, and discusses the back-door and front-door criteria for identifying causal effects. The survey concludes with more specific examples of causal discovery, e.g. for learning all variables that directly affect an outcome of interest and/or testing identification of causal effects in observational data."}, "https://arxiv.org/abs/2407.07996": {"title": "Gradual changes in functional time series", "link": "https://arxiv.org/abs/2407.07996", "description": "arXiv:2407.07996v1 Announce Type: cross \nAbstract: We consider the problem of detecting gradual changes in the sequence of mean functions from a not necessarily stationary functional time series. Our approach is based on the maximum deviation (calculated over a given time interval) between a benchmark function and the mean functions at different time points. We speak of a gradual change of size $\\Delta $, if this quantity exceeds a given threshold $\\Delta>0$. For example, the benchmark function could represent an average of yearly temperature curves from the pre-industrial time, and we are interested in the question if the yearly temperature curves afterwards deviate from the pre-industrial average by more than $\\Delta =1.5$ degrees Celsius, where the deviations are measured with respect to the sup-norm. Using Gaussian approximations for high-dimensional data we develop a test for hypotheses of this type and estimators for the time where a deviation of size larger than $\\Delta$ appears for the first time. We prove the validity of our approach and illustrate the new methods by a simulation study and a data example, where we analyze yearly temperature curves at different stations in Australia."}, "https://arxiv.org/abs/2407.08271": {"title": "Gaussian process interpolation with conformal prediction: methods and comparative analysis", "link": "https://arxiv.org/abs/2407.08271", "description": "arXiv:2407.08271v1 Announce Type: cross \nAbstract: This article advocates the use of conformal prediction (CP) methods for Gaussian process (GP) interpolation to enhance the calibration of prediction intervals. We begin by illustrating that using a GP model with parameters selected by maximum likelihood often results in predictions that are not optimally calibrated. CP methods can adjust the prediction intervals, leading to better uncertainty quantification while maintaining the accuracy of the underlying GP model. We compare different CP variants and introduce a novel variant based on an asymmetric score. Our numerical experiments demonstrate the effectiveness of CP methods in improving calibration without compromising accuracy. This work aims to facilitate the adoption of CP methods in the GP community."}, "https://arxiv.org/abs/2407.08485": {"title": "Local logistic regression for dimension reduction in classification", "link": "https://arxiv.org/abs/2407.08485", "description": "arXiv:2407.08485v1 Announce Type: cross \nAbstract: Sufficient dimension reduction has received much interest over the past 30 years. Most existing approaches focus on statistical models linking the response to the covariate through a regression equation, and as such are not adapted to binary classification problems. We address the question of dimension reduction for binary classification by fitting a localized nearest-neighbor logistic model with $\\ell_1$-penalty in order to estimate the gradient of the conditional probability of interest. Our theoretical analysis shows that the pointwise convergence rate of the gradient estimator is optimal under very mild conditions. The dimension reduction subspace is estimated using an outer product of such gradient estimates at several points in the covariate space. Our implementation uses cross-validation on the misclassification rate to estimate the dimension of this subspace. We find that the proposed approach outperforms existing competitors in synthetic and real data applications."}, "https://arxiv.org/abs/2407.08560": {"title": "Causal inference through multi-stage learning and doubly robust deep neural networks", "link": "https://arxiv.org/abs/2407.08560", "description": "arXiv:2407.08560v1 Announce Type: cross \nAbstract: Deep neural networks (DNNs) have demonstrated remarkable empirical performance in large-scale supervised learning problems, particularly in scenarios where both the sample size $n$ and the dimension of covariates $p$ are large. This study delves into the application of DNNs across a wide spectrum of intricate causal inference tasks, where direct estimation falls short and necessitates multi-stage learning. Examples include estimating the conditional average treatment effect and dynamic treatment effect. In this framework, DNNs are constructed sequentially, with subsequent stages building upon preceding ones. To mitigate the impact of estimation errors from early stages on subsequent ones, we integrate DNNs in a doubly robust manner. In contrast to previous research, our study offers theoretical assurances regarding the effectiveness of DNNs in settings where the dimensionality $p$ expands with the sample size. These findings are significant independently and extend to degenerate single-stage learning problems."}, "https://arxiv.org/abs/2407.08718": {"title": "The exact non-Gaussian weak lensing likelihood: A framework to calculate analytic likelihoods for correlation functions on masked Gaussian random fields", "link": "https://arxiv.org/abs/2407.08718", "description": "arXiv:2407.08718v1 Announce Type: cross \nAbstract: We present exact non-Gaussian joint likelihoods for auto- and cross-correlation functions on arbitrarily masked spherical Gaussian random fields. Our considerations apply to spin-0 as well as spin-2 fields but are demonstrated here for the spin-2 weak-lensing correlation function.\n We motivate that this likelihood cannot be Gaussian and show how it can nevertheless be calculated exactly for any mask geometry and on a curved sky, as well as jointly for different angular-separation bins and redshift-bin combinations. Splitting our calculation into a large- and small-scale part, we apply a computationally efficient approximation for the small scales that does not alter the overall non-Gaussian likelihood shape.\n To compare our exact likelihoods to correlation-function sampling distributions, we simulated a large number of weak-lensing maps, including shape noise, and find excellent agreement for one-dimensional as well as two-dimensional distributions. Furthermore, we compare the exact likelihood to the widely employed Gaussian likelihood and find significant levels of skewness at angular separations $\\gtrsim 1^{\\circ}$ such that the mode of the exact distributions is shifted away from the mean towards lower values of the correlation function. We find that the assumption of a Gaussian random field for the weak-lensing field is well valid at these angular separations.\n Considering the skewness of the non-Gaussian likelihood, we evaluate its impact on the posterior constraints on $S_8$. On a simplified weak-lensing-survey setup with an area of $10 \\ 000 \\ \\mathrm{deg}^2$, we find that the posterior mean of $S_8$ is up to $2\\%$ higher when using the non-Gaussian likelihood, a shift comparable to the precision of current stage-III surveys."}, "https://arxiv.org/abs/2302.05747": {"title": "Individualized Treatment Allocation in Sequential Network Games", "link": "https://arxiv.org/abs/2302.05747", "description": "arXiv:2302.05747v4 Announce Type: replace \nAbstract: Designing individualized allocation of treatments so as to maximize the equilibrium welfare of interacting agents has many policy-relevant applications. Focusing on sequential decision games of interacting agents, this paper develops a method to obtain optimal treatment assignment rules that maximize a social welfare criterion by evaluating stationary distributions of outcomes. Stationary distributions in sequential decision games are given by Gibbs distributions, which are difficult to optimize with respect to a treatment allocation due to analytical and computational complexity. We apply a variational approximation to the stationary distribution and optimize the approximated equilibrium welfare with respect to treatment allocation using a greedy optimization algorithm. We characterize the performance of the variational approximation, deriving a performance guarantee for the greedy optimization algorithm via a welfare regret bound. We implement our proposed method in simulation exercises and an empirical application using the Indian microfinance data (Banerjee et al., 2013), and show it delivers significant welfare gains."}, "https://arxiv.org/abs/2308.14625": {"title": "Models for temporal clustering of extreme events with applications to mid-latitude winter cyclones", "link": "https://arxiv.org/abs/2308.14625", "description": "arXiv:2308.14625v2 Announce Type: replace \nAbstract: The occurrence of extreme events like heavy precipitation or storms at a certain location often shows a clustering behaviour and is thus not described well by a Poisson process. We construct a general model for the inter-exceedance times in between such events which combines different candidate models for such behaviour. This allows us to distinguish data generating mechanisms leading to clusters of dependent events with exponential inter-exceedance times in between clusters from independent events with heavy-tailed inter-exceedance times, and even allows us to combine these two mechanisms for better descriptions of such occurrences. We propose a modification of the Cram\\'er-von Mises distance for model fitting. An application to mid-latitude winter cyclones illustrates the usefulness of our work."}, "https://arxiv.org/abs/2312.06883": {"title": "Adaptive Experiments Toward Learning Treatment Effect Heterogeneity", "link": "https://arxiv.org/abs/2312.06883", "description": "arXiv:2312.06883v3 Announce Type: replace \nAbstract: Understanding treatment effect heterogeneity has become an increasingly popular task in various fields, as it helps design personalized advertisements in e-commerce or targeted treatment in biomedical studies. However, most of the existing work in this research area focused on either analyzing observational data based on strong causal assumptions or conducting post hoc analyses of randomized controlled trial data, and there has been limited effort dedicated to the design of randomized experiments specifically for uncovering treatment effect heterogeneity. In the manuscript, we develop a framework for designing and analyzing response adaptive experiments toward better learning treatment effect heterogeneity. Concretely, we provide response adaptive experimental design frameworks that sequentially revise the data collection mechanism according to the accrued evidence during the experiment. Such design strategies allow for the identification of subgroups with the largest treatment effects with enhanced statistical efficiency. The proposed frameworks not only unify adaptive enrichment designs and response-adaptive randomization designs but also complement A/B test designs in e-commerce and randomized trial designs in clinical settings. We demonstrate the merit of our design with theoretical justifications and in simulation studies with synthetic e-commerce and clinical trial data."}, "https://arxiv.org/abs/2407.08761": {"title": "Bayesian analysis for pretest-posttest binary outcomes with adaptive significance levels", "link": "https://arxiv.org/abs/2407.08761", "description": "arXiv:2407.08761v1 Announce Type: new \nAbstract: Count outcomes in longitudinal studies are frequent in clinical and engineering studies. In frequentist and Bayesian statistical analysis, methods such as Mixed linear models allow the variability or correlation within individuals to be taken into account. However, in more straightforward scenarios, where only two stages of an experiment are observed (pre-treatment vs. post-treatment), there are only a few tools available, mainly for continuous outcomes. Thus, this work introduces a Bayesian statistical methodology for comparing paired samples in binary pretest-posttest scenarios. We establish a Bayesian probabilistic model for the inferential analysis of the unknown quantities, which is validated and refined through simulation analyses, and present an application to a dataset taken from the Television School and Family Smoking Prevention and Cessation Project (TVSFP) (Flay et al., 1995). The application of the Full Bayesian Significance Test (FBST) for precise hypothesis testing, along with the implementation of adaptive significance levels in the decision-making process, is included."}, "https://arxiv.org/abs/2407.08814": {"title": "Covariate Assisted Entity Ranking with Sparse Intrinsic Scores", "link": "https://arxiv.org/abs/2407.08814", "description": "arXiv:2407.08814v1 Announce Type: new \nAbstract: This paper addresses the item ranking problem with associate covariates, focusing on scenarios where the preference scores can not be fully explained by covariates, and the remaining intrinsic scores, are sparse. Specifically, we extend the pioneering Bradley-Terry-Luce (BTL) model by incorporating covariate information and considering sparse individual intrinsic scores. Our work introduces novel model identification conditions and examines the regularized penalized Maximum Likelihood Estimator (MLE) statistical rates. We then construct a debiased estimator for the penalized MLE and analyze its distributional properties. Additionally, we apply our method to the goodness-of-fit test for models with no latent intrinsic scores, namely, the covariates fully explaining the preference scores of individual items. We also offer confidence intervals for ranks. Our numerical studies lend further support to our theoretical findings, demonstrating validation for our proposed method"}, "https://arxiv.org/abs/2407.08827": {"title": "Estimating Methane Emissions from the Upstream Oil and Gas Industry Using a Multi-Stage Framework", "link": "https://arxiv.org/abs/2407.08827", "description": "arXiv:2407.08827v1 Announce Type: new \nAbstract: Measurement-based methane inventories, which involve surveying oil and gas facilities and compiling data to estimate methane emissions, are becoming the gold standard for quantifying emissions. However, there is a current lack of statistical guidance for the design and analysis of such surveys. The only existing method is a Monte Carlo procedure which is difficult to interpret, computationally intensive, and lacks available open-source code for its implementation. We provide an alternative method by framing methane surveys in the context of multi-stage sampling designs. We contribute estimators of the total emissions along with variance estimators which do not require simulation, as well as stratum-level total estimators. We show that the variance contribution from each stage of sampling can be estimated to inform the design of future surveys. We also introduce a more efficient modification of the estimator. Finally, we propose combining the multi-stage approach with a simple Monte Carlo procedure to model measurement error. The resulting methods are interpretable and require minimal computational resources. We apply the methods to aerial survey data of oil and gas facilities in British Columbia, Canada, to estimate the methane emissions in the province. An R package is provided to facilitate the use of the methods."}, "https://arxiv.org/abs/2407.08862": {"title": "Maximum Entropy Estimation of Heterogeneous Causal Effects", "link": "https://arxiv.org/abs/2407.08862", "description": "arXiv:2407.08862v1 Announce Type: new \nAbstract: For the purpose of causal inference we employ a stochastic model of the data generating process, utilizing individual propensity probabilities for the treatment, and also individual and counterfactual prognosis probabilities for the outcome. We assume a generalized version of the stable unit treatment value assumption, but we do not assume any version of strongly ignorable treatment assignment. Instead of conducting a sensitivity analysis, we utilize the principle of maximum entropy to estimate the distribution of causal effects. We develop a principled middle-way between extreme explanations of the observed data: we do not conclude that an observed association is wholly spurious, and we do not conclude that it is wholly causal. Rather, our conclusions are tempered and we conclude that the association is part spurious and part causal. In an example application we apply our methodology to analyze an observed association between marijuana use and hard drug use."}, "https://arxiv.org/abs/2407.08911": {"title": "Computationally efficient and statistically accurate conditional independence testing with spaCRT", "link": "https://arxiv.org/abs/2407.08911", "description": "arXiv:2407.08911v1 Announce Type: new \nAbstract: We introduce the saddlepoint approximation-based conditional randomization test (spaCRT), a novel conditional independence test that effectively balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens. Resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computational cost. The spaCRT leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands. We prove that the spaCRT p-value approximates the dCRT p-value with vanishing relative error, and that these two tests are asymptotically equivalent. Through extensive simulations and real data analysis, we demonstrate that the spaCRT controls Type-I error and maintains high power, outperforming other asymptotic and resampling-based tests. Our method is particularly well-suited for large-scale single-cell CRISPR screen analyses, facilitating the efficient and accurate assessment of perturbation-gene associations."}, "https://arxiv.org/abs/2407.09062": {"title": "Temporal M-quantile models and robust bias-corrected small area predictors", "link": "https://arxiv.org/abs/2407.09062", "description": "arXiv:2407.09062v1 Announce Type: new \nAbstract: In small area estimation, it is a smart strategy to rely on data measured over time. However, linear mixed models struggle to properly capture time dependencies when the number of lags is large. Given the lack of published studies addressing robust prediction in small areas using time-dependent data, this research seeks to extend M-quantile models to this field. Indeed, our methodology successfully addresses this challenge and offers flexibility to the widely imposed assumption of unit-level independence. Under the new model, robust bias-corrected predictors for small area linear indicators are derived. Additionally, the optimal selection of the robustness parameter for bias correction is explored, contributing theoretically to the field and enhancing outlier detection. For the estimation of the mean squared error (MSE), a first-order approximation and analytical estimators are obtained under general conditions. Several simulation experiments are conducted to evaluate the performance of the fitting algorithm, the new predictors, and the resulting MSE estimators, as well as the optimal selection of the robustness parameter. Finally, an application to the Spanish Living Conditions Survey data illustrates the usefulness of the proposed predictors."}, "https://arxiv.org/abs/2407.09293": {"title": "Sample size for developing a prediction model with a binary outcome: targeting precise individual risk estimates to improve clinical decisions and fairness", "link": "https://arxiv.org/abs/2407.09293", "description": "arXiv:2407.09293v1 Announce Type: new \nAbstract: When developing a clinical prediction model, the sample size of the development dataset is a key consideration. Small sample sizes lead to greater concerns of overfitting, instability, poor performance and lack of fairness. Previous research has outlined minimum sample size calculations to minimise overfitting and precisely estimate the overall risk. However even when meeting these criteria, the uncertainty (instability) in individual-level risk estimates may be considerable. In this article we propose how to examine and calculate the sample size required for developing a model with acceptably precise individual-level risk estimates to inform decisions and improve fairness. We outline a five-step process to be used before data collection or when an existing dataset is available. It requires researchers to specify the overall risk in the target population, the (anticipated) distribution of key predictors in the model, and an assumed 'core model' either specified directly (i.e., a logistic regression equation is provided) or based on specified C-statistic and relative effects of (standardised) predictors. We produce closed-form solutions that decompose the variance of an individual's risk estimate into Fisher's unit information matrix, predictor values and total sample size; this allows researchers to quickly calculate and examine individual-level uncertainty interval widths and classification instability for specified sample sizes. Such information can be presented to key stakeholders (e.g., health professionals, patients, funders) using prediction and classification instability plots to help identify the (target) sample size required to improve trust, reliability and fairness in individual predictions. Our proposal is implemented in software module pmstabilityss. We provide real examples and emphasise the importance of clinical context including any risk thresholds for decision making."}, "https://arxiv.org/abs/2407.09371": {"title": "Computationally Efficient Estimation of Large Probit Models", "link": "https://arxiv.org/abs/2407.09371", "description": "arXiv:2407.09371v1 Announce Type: new \nAbstract: Probit models are useful for modeling correlated discrete responses in many disciplines, including discrete choice data in economics. However, the Gaussian latent variable feature of probit models coupled with identification constraints pose significant computational challenges for its estimation and inference, especially when the dimension of the discrete response variable is large. In this paper, we propose a computationally efficient Expectation-Maximization (EM) algorithm for estimating large probit models. Our work is distinct from existing methods in two important aspects. First, instead of simulation or sampling methods, we apply and customize expectation propagation (EP), a deterministic method originally proposed for approximate Bayesian inference, to estimate moments of the truncated multivariate normal (TMVN) in the E (expectation) step. Second, we take advantage of a symmetric identification condition to transform the constrained optimization problem in the M (maximization) step into a one-dimensional problem, which is solved efficiently using Newton's method instead of off-the-shelf solvers. Our method enables the analysis of correlated choice data in the presence of more than 100 alternatives, which is a reasonable size in modern applications, such as online shopping and booking platforms, but has been difficult in practice with probit models. We apply our probit estimation method to study ordering effects in hotel search results on Expedia.com."}, "https://arxiv.org/abs/2407.09390": {"title": "Tail-robust factor modelling of vector and tensor time series in high dimensions", "link": "https://arxiv.org/abs/2407.09390", "description": "arXiv:2407.09390v1 Announce Type: new \nAbstract: We study the problem of factor modelling vector- and tensor-valued time series in the presence of heavy tails in the data, which produce anomalous observations with non-negligible probability. For this, we propose to combine a two-step procedure with data truncation, which is easy to implement and does not require iteratively searching for a numerical solution. Departing away from the light-tail assumptions often adopted in the time series factor modelling literature, we derive the theoretical properties of the proposed estimators while only assuming the existence of the $(2 + 2\\eps)$-th moment for some $\\eps \\in (0, 1)$, fully characterising the effect of heavy tails on the rates of estimation as well as the level of truncation. Numerical experiments on simulated datasets demonstrate the good performance of the proposed estimator, which is further supported by applications to two macroeconomic datasets."}, "https://arxiv.org/abs/2407.09443": {"title": "Addressing Confounding and Continuous Exposure Measurement Error Using Corrected Score Functions", "link": "https://arxiv.org/abs/2407.09443", "description": "arXiv:2407.09443v1 Announce Type: new \nAbstract: Confounding and exposure measurement error can introduce bias when drawing inference about the marginal effect of an exposure on an outcome of interest. While there are broad methodologies for addressing each source of bias individually, confounding and exposure measurement error frequently co-occur and there is a need for methods that address them simultaneously. In this paper, corrected score methods are derived under classical additive measurement error to draw inference about marginal exposure effects using only measured variables. Three estimators are proposed based on g-formula, inverse probability weighting, and doubly-robust estimation techniques. The estimators are shown to be consistent and asymptotically normal, and the doubly-robust estimator is shown to exhibit its namesake property. The methods, which are implemented in the R package mismex, perform well in finite samples under both confounding and measurement error as demonstrated by simulation studies. The proposed doubly-robust estimator is applied to study the effects of two biomarkers on HIV-1 infection using data from the HVTN 505 preventative vaccine trial."}, "https://arxiv.org/abs/2407.08750": {"title": "ROLCH: Regularized Online Learning for Conditional Heteroskedasticity", "link": "https://arxiv.org/abs/2407.08750", "description": "arXiv:2407.08750v1 Announce Type: cross \nAbstract: Large-scale streaming data are common in modern machine learning applications and have led to the development of online learning algorithms. Many fields, such as supply chain management, weather and meteorology, energy markets, and finance, have pivoted towards using probabilistic forecasts, which yields the need not only for accurate learning of the expected value but also for learning the conditional heteroskedasticity. Against this backdrop, we present a methodology for online estimation of regularized linear distributional models for conditional heteroskedasticity. The proposed algorithm is based on a combination of recent developments for the online estimation of LASSO models and the well-known GAMLSS framework. We provide a case study on day-ahead electricity price forecasting, in which we show the competitive performance of the adaptive estimation combined with strongly reduced computational effort. Our algorithms are implemented in a computationally efficient Python package."}, "https://arxiv.org/abs/2407.09130": {"title": "On goodness-of-fit testing for self-exciting point processes", "link": "https://arxiv.org/abs/2407.09130", "description": "arXiv:2407.09130v1 Announce Type: cross \nAbstract: Despite the wide usage of parametric point processes in theory and applications, a sound goodness-of-fit procedure to test whether a given parametric model is appropriate for data coming from a self-exciting point processes has been missing in the literature. In this work, we establish a bootstrap-based goodness-of-fit test which empirically works for all kinds of self-exciting point processes (and even beyond). In an infill-asymptotic setting we also prove its asymptotic consistency, albeit only in the particular case that the underlying point process is inhomogeneous Poisson."}, "https://arxiv.org/abs/2407.09387": {"title": "Meta-Analysis with Untrusted Data", "link": "https://arxiv.org/abs/2407.09387", "description": "arXiv:2407.09387v1 Announce Type: cross \nAbstract: [See paper for full abstract] Meta-analysis is a crucial tool for answering scientific questions. It is usually conducted on a relatively small amount of ``trusted'' data -- ideally from randomized, controlled trials -- which allow causal effects to be reliably estimated with minimal assumptions. We show how to answer causal questions much more precisely by making two changes. First, we incorporate untrusted data drawn from large observational databases, related scientific literature and practical experience -- without sacrificing rigor or introducing strong assumptions. Second, we train richer models capable of handling heterogeneous trials, addressing a long-standing challenge in meta-analysis. Our approach is based on conformal prediction, which fundamentally produces rigorous prediction intervals, but doesn't handle indirect observations: in meta-analysis, we observe only noisy effects due to the limited number of participants in each trial. To handle noise, we develop a simple, efficient version of fully-conformal kernel ridge regression, based on a novel condition called idiocentricity. We introduce noise-correcting terms in the residuals and analyze their interaction with a ``variance shaving'' technique. In multiple experiments on healthcare datasets, our algorithms deliver tighter, sounder intervals than traditional ones. This paper charts a new course for meta-analysis and evidence-based medicine, where heterogeneity and untrusted data are embraced for more nuanced and precise predictions."}, "https://arxiv.org/abs/2308.00812": {"title": "Causal exposure-response curve estimation with surrogate confounders: a study of air pollution and children's health in Medicaid claims data", "link": "https://arxiv.org/abs/2308.00812", "description": "arXiv:2308.00812v2 Announce Type: replace \nAbstract: In this paper, we undertake a case study to estimate a causal exposure-response function (ERF) for long-term exposure to fine particulate matter (PM$_{2.5}$) and respiratory hospitalizations in socioeconomically disadvantaged children using nationwide Medicaid claims data. These data present specific challenges. First, family income-based Medicaid eligibility criteria for children differ by state, creating socioeconomically distinct populations and leading to clustered data. Second, Medicaid enrollees' socioeconomic status, a confounder and an effect modifier of the exposure-response relationships under study, is not measured. However, two surrogates are available: median household income of each enrollee's zip code and state-level Medicaid family income eligibility thresholds for children. We introduce a customized approach for causal ERF estimation called MedMatch, building on generalized propensity score (GPS) matching methods. MedMatch adapts these methods to (1) leverage the surrogate variables to account for potential confounding and/or effect modification by socioeconomic status and (2) address practical challenges presented by differing exposure distributions across clusters. We also propose a new hyperparameter selection criterion for MedMatch and traditional GPS matching methods. Through extensive simulation studies, we demonstrate the strong performance of MedMatch relative to conventional approaches in this setting. We apply MedMatch to estimate the causal ERF between PM$_{2.5}$ and respiratory hospitalization among children in Medicaid, 2000-2012. We find a positive association, with a steeper curve at lower PM$_{2.5}$ concentrations that levels off at higher concentrations."}, "https://arxiv.org/abs/2311.04540": {"title": "On the estimation of the number of components in multivariate functional principal component analysis", "link": "https://arxiv.org/abs/2311.04540", "description": "arXiv:2311.04540v2 Announce Type: replace \nAbstract: Happ and Greven (2018) developed a methodology for principal components analysis of multivariate functional data for data observed on different dimensional domains. Their approach relies on an estimation of univariate functional principal components for each univariate functional feature. In this paper, we present extensive simulations to investigate choosing the number of principal components to retain. We show empirically that the conventional approach of using a percentage of variance explained threshold for each univariate functional feature may be unreliable when aiming to explain an overall percentage of variance in the multivariate functional data, and thus we advise practitioners to be careful when using it."}, "https://arxiv.org/abs/2312.11991": {"title": "Outcomes truncated by death in RCTs: a simulation study on the survivor average causal effect", "link": "https://arxiv.org/abs/2312.11991", "description": "arXiv:2312.11991v2 Announce Type: replace \nAbstract: Continuous outcome measurements truncated by death present a challenge for the estimation of unbiased treatment effects in randomized controlled trials (RCTs). One way to deal with such situations is to estimate the survivor average causal effect (SACE), but this requires making non-testable assumptions. Motivated by an ongoing RCT in very preterm infants with intraventricular hemorrhage, we performed a simulation study to compare a SACE estimator with complete case analysis (CCA) and an analysis after multiple imputation of missing outcomes. We set up 9 scenarios combining positive, negative and no treatment effect on the outcome (cognitive development) and on survival at 2 years of age. Treatment effect estimates from all methods were compared in terms of bias, mean squared error and coverage with regard to two true treatment effects: the treatment effect on the outcome used in the simulation and the SACE, which was derived by simulation of both potential outcomes per patient. Despite targeting different estimands (principal stratum estimand, hypothetical estimand), the SACE-estimator and multiple imputation gave similar estimates of the treatment effect and efficiently reduced the bias compared to CCA. Also, both methods were relatively robust to omission of one covariate in the analysis, and thus violation of relevant assumptions. Although the SACE is not without controversy, we find it useful if mortality is inherent to the study population. Some degree of violation of the required assumptions is almost certain, but may be acceptable in practice."}, "https://arxiv.org/abs/2309.10476": {"title": "Thermodynamically rational decision making under uncertainty", "link": "https://arxiv.org/abs/2309.10476", "description": "arXiv:2309.10476v3 Announce Type: replace-cross \nAbstract: An analytical characterization of thermodynamically rational agent behaviour is obtained for a simple, yet non--trivial example of a ``Maxwell's demon\" operating with partial information. Our results provide the first fully transparent physical understanding of a decision problem under uncertainty."}, "https://arxiv.org/abs/2407.09565": {"title": "A Short Note on Event-Study Synthetic Difference-in-Differences Estimators", "link": "https://arxiv.org/abs/2407.09565", "description": "arXiv:2407.09565v1 Announce Type: new \nAbstract: I propose an event study extension of Synthetic Difference-in-Differences (SDID) estimators. I show that, in simple and staggered adoption designs, estimators from Arkhangelsky et al. (2021) can be disaggregated into dynamic treatment effect estimators, comparing the lagged outcome differentials of treated and synthetic controls to their pre-treatment average. Estimators presented in this note can be computed using the sdid_event Stata package."}, "https://arxiv.org/abs/2407.09696": {"title": "Regularizing stock return covariance matrices via multiple testing of correlations", "link": "https://arxiv.org/abs/2407.09696", "description": "arXiv:2407.09696v1 Announce Type: new \nAbstract: This paper develops a large-scale inference approach for the regularization of stock return covariance matrices. The framework allows for the presence of heavy tails and multivariate GARCH-type effects of unknown form among the stock returns. The approach involves simultaneous testing of all pairwise correlations, followed by setting non-statistically significant elements to zero. This adaptive thresholding is achieved through sign-based Monte Carlo resampling within multiple testing procedures, controlling either the traditional familywise error rate, a generalized familywise error rate, or the false discovery proportion. Subsequent shrinkage ensures that the final covariance matrix estimate is positive definite and well-conditioned while preserving the achieved sparsity. Compared to alternative estimators, this new regularization method demonstrates strong performance in simulation experiments and real portfolio optimization."}, "https://arxiv.org/abs/2407.09735": {"title": "Positive and Unlabeled Data: Model, Estimation, Inference, and Classification", "link": "https://arxiv.org/abs/2407.09735", "description": "arXiv:2407.09735v1 Announce Type: new \nAbstract: This study introduces a new approach to addressing positive and unlabeled (PU) data through the double exponential tilting model (DETM). Traditional methods often fall short because they only apply to selected completely at random (SCAR) PU data, where the labeled positive and unlabeled positive data are assumed to be from the same distribution. In contrast, our DETM's dual structure effectively accommodates the more complex and underexplored selected at random PU data, where the labeled and unlabeled positive data can be from different distributions. We rigorously establish the theoretical foundations of DETM, including identifiability, parameter estimation, and asymptotic properties. Additionally, we move forward to statistical inference by developing a goodness-of-fit test for the SCAR condition and constructing confidence intervals for the proportion of positive instances in the target domain. We leverage an approximated Bayes classifier for classification tasks, demonstrating DETM's robust performance in prediction. Through theoretical insights and practical applications, this study highlights DETM as a comprehensive framework for addressing the challenges of PU data."}, "https://arxiv.org/abs/2407.09738": {"title": "Sparse Asymptotic PCA: Identifying Sparse Latent Factors Across Time Horizon", "link": "https://arxiv.org/abs/2407.09738", "description": "arXiv:2407.09738v1 Announce Type: new \nAbstract: This paper proposes a novel method for sparse latent factor modeling using a new sparse asymptotic Principal Component Analysis (APCA). This approach analyzes the co-movements of large-dimensional panel data systems over time horizons within a general approximate factor model framework. Unlike existing sparse factor modeling approaches based on sparse PCA, which assume sparse loading matrices, our sparse APCA assumes that factor processes are sparse over the time horizon, while the corresponding loading matrices are not necessarily sparse. This development is motivated by the observation that the assumption of sparse loadings may not be appropriate for financial returns, where exposure to market factors is generally universal and non-sparse. We propose a truncated power method to estimate the first sparse factor process and a sequential deflation method for multi-factor cases. Additionally, we develop a data-driven approach to identify the sparsity of risk factors over the time horizon using a novel cross-sectional cross-validation method. Theoretically, we establish that our estimators are consistent under mild conditions. Monte Carlo simulations demonstrate that the proposed method performs well in finite samples. Empirically, we analyze daily stock returns for a balanced panel of S&P 500 stocks from January 2004 to December 2016. Through textual analysis, we examine specific events associated with the identified sparse factors that systematically influence the stock market. Our approach offers a new pathway for economists to study and understand the systematic risks of economic and financial systems over time."}, "https://arxiv.org/abs/2407.09759": {"title": "Estimation of Integrated Volatility Functionals with Kernel Spot Volatility Estimators", "link": "https://arxiv.org/abs/2407.09759", "description": "arXiv:2407.09759v1 Announce Type: new \nAbstract: For a multidimensional It\\^o semimartingale, we consider the problem of estimating integrated volatility functionals. Jacod and Rosenbaum (2013) studied a plug-in type of estimator based on a Riemann sum approximation of the integrated functional and a spot volatility estimator with a forward uniform kernel. Motivated by recent results that show that spot volatility estimators with general two-side kernels of unbounded support are more accurate, in this paper, an estimator using a general kernel spot volatility estimator as the plug-in is considered. A biased central limit theorem for estimating the integrated functional is established with an optimal convergence rate. Unbiased central limit theorems for estimators with proper de-biasing terms are also obtained both at the optimal convergence regime for the bandwidth and when applying undersmoothing. Our results show that one can significantly reduce the estimator's bias by adopting a general kernel instead of the standard uniform kernel. Our proposed bias-corrected estimators are found to maintain remarkable robustness against bandwidth selection in a variety of sampling frequencies and functions."}, "https://arxiv.org/abs/2407.09761": {"title": "Exploring Differences between Two Decades of Mental Health Related Emergency Department Visits by Youth via Recurrent Events Analyses", "link": "https://arxiv.org/abs/2407.09761", "description": "arXiv:2407.09761v1 Announce Type: new \nAbstract: We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the data are usually not in a conventional format since administrative databases are created and maintained to serve non-research purposes and only information for people who seek health services is accessible. Analysis of administrative health data is thus challenging in general. In the MHED data analyses, we are particularly concerned with (i) evaluating dynamic patterns and impacts with doubly-censored recurrent event data, and (ii) re-calibrating estimators developed based on truncated data by leveraging summary statistics from the population. The findings are verified empirically via simulation. We have established the asymptotic properties of the inference procedures. The contributions of this paper are twofold. We present innovative strategies for processing doubly-censored recurrent event data, and overcoming the truncation induced by the data collection. In addition, through exploring the pediatric MHED visit records, we provide new insights into children/youths mental health changes over time."}, "https://arxiv.org/abs/2407.09772": {"title": "Valid standard errors for Bayesian quantile regression with clustered and independent data", "link": "https://arxiv.org/abs/2407.09772", "description": "arXiv:2407.09772v1 Announce Type: new \nAbstract: In Bayesian quantile regression, the most commonly used likelihood is the asymmetric Laplace (AL) likelihood. The reason for this choice is not that it is a plausible data-generating model but that the corresponding maximum likelihood estimator is identical to the classical estimator by Koenker and Bassett (1978), and in that sense, the AL likelihood can be thought of as a working likelihood. AL-based quantile regression has been shown to produce good finite-sample Bayesian point estimates and to be consistent. However, if the AL distribution does not correspond to the data-generating distribution, credible intervals based on posterior standard deviations can have poor coverage. Yang, Wang, and He (2016) proposed an adjustment to the posterior covariance matrix that produces asymptotically valid intervals. However, we show that this adjustment is sensitive to the choice of scale parameter for the AL likelihood and can lead to poor coverage when the sample size is small to moderate. We therefore propose using Infinitesimal Jackknife (IJ) standard errors (Giordano & Broderick, 2023). These standard errors do not require resampling but can be obtained from a single MCMC run. We also propose a version of IJ standard errors for clustered data. Simulations and applications to real data show that the IJ standard errors have good frequentist properties, both for independent and clustered data. We provide an R-package that computes IJ standard errors for clustered or independent data after estimation with the brms wrapper in R for Stan."}, "https://arxiv.org/abs/2407.10014": {"title": "Identification of Average Causal Effects in Confounded Additive Noise Models", "link": "https://arxiv.org/abs/2407.10014", "description": "arXiv:2407.10014v1 Announce Type: new \nAbstract: Additive noise models (ANMs) are an important setting studied in causal inference. Most of the existing works on ANMs assume causal sufficiency, i.e., there are no unobserved confounders. This paper focuses on confounded ANMs, where a set of treatment variables and a target variable are affected by an unobserved confounder that follows a multivariate Gaussian distribution. We introduce a novel approach for estimating the average causal effects (ACEs) of any subset of the treatment variables on the outcome and demonstrate that a small set of interventional distributions is sufficient to estimate all of them. In addition, we propose a randomized algorithm that further reduces the number of required interventions to poly-logarithmic in the number of nodes. Finally, we demonstrate that these interventions are also sufficient to recover the causal structure between the observed variables. This establishes that a poly-logarithmic number of interventions is sufficient to infer the causal effects of any subset of treatments on the outcome in confounded ANMs with high probability, even when the causal structure between treatments is unknown. The simulation results indicate that our method can accurately estimate all ACEs in the finite-sample regime. We also demonstrate the practical significance of our algorithm by evaluating it on semi-synthetic data."}, "https://arxiv.org/abs/2407.10089": {"title": "The inverse Kalman filter", "link": "https://arxiv.org/abs/2407.10089", "description": "arXiv:2407.10089v1 Announce Type: new \nAbstract: In this study, we introduce a new approach, the inverse Kalman filter (IKF), which enables accurate matrix-vector multiplication between a covariance matrix from a dynamic linear model and any real-valued vector with linear computational cost. We incorporate the IKF with the conjugate gradient algorithm, which substantially accelerates the computation of matrix inversion for a general form of covariance matrices, whereas other approximation approaches may not be directly applicable. We demonstrate the scalability and efficiency of the IKF approach through distinct applications, including nonparametric estimation of particle interaction functions and predicting incomplete lattices of correlated data, using both simulation and real-world observations, including cell trajectory and satellite radar interferogram."}, "https://arxiv.org/abs/2407.10185": {"title": "Semiparametric Efficient Inference for the Probability of Necessary and Sufficient Causation", "link": "https://arxiv.org/abs/2407.10185", "description": "arXiv:2407.10185v1 Announce Type: new \nAbstract: Causal attribution, which aims to explain why events or behaviors occur, is crucial in causal inference and enhances our understanding of cause-and-effect relationships in scientific research. The probabilities of necessary causation (PN) and sufficient causation (PS) are two of the most common quantities for attribution in causal inference. While many works have explored the identification or bounds of PN and PS, efficient estimation remains unaddressed. To fill this gap, this paper focuses on obtaining semiparametric efficient estimators of PN and PS under two sets of identifiability assumptions: strong ignorability and monotonicity, and strong ignorability and conditional independence. We derive efficient influence functions and semiparametric efficiency bounds for PN and PS under the two sets of identifiability assumptions, respectively. Based on this, we propose efficient estimators for PN and PS, and show their large sample properties. Extensive simulations validate the superiority of our estimators compared to competing methods. We apply our methods to a real-world dataset to assess various risk factors affecting stroke."}, "https://arxiv.org/abs/2407.10272": {"title": "Two-way Threshold Matrix Autoregression", "link": "https://arxiv.org/abs/2407.10272", "description": "arXiv:2407.10272v1 Announce Type: new \nAbstract: Matrix-valued time series data are widely available in various applications, attracting increasing attention in the literature. However, while nonlinearity has been recognized, the literature has so far neglected a deeper and more intricate level of nonlinearity, namely the {\\it row-level} nonlinear dynamics and the {\\it column-level} nonlinear dynamics, which are often observed in economic and financial data. In this paper, we propose a novel two-way threshold matrix autoregression (TWTMAR) model. This model is designed to effectively characterize the threshold structure in both rows and columns of matrix-valued time series. Unlike existing models that consider a single threshold variable or assume a uniform structure change across the matrix, the TWTMAR model allows for distinct threshold effects for rows and columns using two threshold variables. This approach achieves greater dimension reduction and yields better interpretation compared to existing methods. Moreover, we propose a parameter estimation procedure leveraging the intrinsic matrix structure and investigate the asymptotic properties. The efficacy and flexibility of the model are demonstrated through both simulation studies and an empirical analysis of the Fama-French Portfolio dataset."}, "https://arxiv.org/abs/2407.10418": {"title": "An integrated perspective of robustness in regression through the lens of the bias-variance trade-off", "link": "https://arxiv.org/abs/2407.10418", "description": "arXiv:2407.10418v1 Announce Type: new \nAbstract: This paper presents an integrated perspective on robustness in regression. Specifically, we examine the relationship between traditional outlier-resistant robust estimation and robust optimization, which focuses on parameter estimation resistant to imaginary dataset-perturbations. While both are commonly regarded as robust methods, these concepts demonstrate a bias-variance trade-off, indicating that they follow roughly converse strategies."}, "https://arxiv.org/abs/2407.10442": {"title": "Inference at the data's edge: Gaussian processes for modeling and inference under model-dependency, poor overlap, and extrapolation", "link": "https://arxiv.org/abs/2407.10442", "description": "arXiv:2407.10442v1 Announce Type: new \nAbstract: The Gaussian Process (GP) is a highly flexible non-linear regression approach that provides a principled approach to handling our uncertainty over predicted (counterfactual) values. It does so by computing a posterior distribution over predicted point as a function of a chosen model space and the observed data, in contrast to conventional approaches that effectively compute uncertainty estimates conditionally on placing full faith in a fitted model. This is especially valuable under conditions of extrapolation or weak overlap, where model dependency poses a severe threat. We first offer an accessible explanation of GPs, and provide an implementation suitable to social science inference problems. In doing so we reduce the number of user-chosen hyperparameters from three to zero. We then illustrate the settings in which GPs can be most valuable: those where conventional approaches have poor properties due to model-dependency/extrapolation in data-sparse regions. Specifically, we apply it to (i) comparisons in which treated and control groups have poor covariate overlap; (ii) interrupted time-series designs, where models are fitted prior to an event by extrapolated after it; and (iii) regression discontinuity, which depends on model estimates taken at or just beyond the edge of their supporting data."}, "https://arxiv.org/abs/2407.10653": {"title": "The Dynamic, the Static, and the Weak factor models and the analysis of high-dimensional time series", "link": "https://arxiv.org/abs/2407.10653", "description": "arXiv:2407.10653v1 Announce Type: new \nAbstract: Several fundamental and closely interconnected issues related to factor models are reviewed and discussed: dynamic versus static loadings, rate-strong versus rate-weak factors, the concept of weakly common component recently introduced by Gersing et al. (2023), the irrelevance of cross-sectional ordering and the assumption of cross-sectional exchangeability, and the problem of undetected strong factors."}, "https://arxiv.org/abs/2407.10721": {"title": "Nonparametric Multivariate Profile Monitoring Via Tree Ensembles", "link": "https://arxiv.org/abs/2407.10721", "description": "arXiv:2407.10721v1 Announce Type: new \nAbstract: Monitoring random profiles over time is used to assess whether the system of interest, generating the profiles, is operating under desired conditions at any time-point. In practice, accurate detection of a change-point within a sequence of responses that exhibit a functional relationship with multiple explanatory variables is an important goal for effectively monitoring such profiles. We present a nonparametric method utilizing ensembles of regression trees and random forests to model the functional relationship along with associated Kolmogorov-Smirnov statistic to monitor profile behavior. Through a simulation study considering multiple factors, we demonstrate that our method offers strong performance and competitive detection capability when compared to existing methods."}, "https://arxiv.org/abs/2407.10846": {"title": "Joint Learning from Heterogeneous Rank Data", "link": "https://arxiv.org/abs/2407.10846", "description": "arXiv:2407.10846v1 Announce Type: new \nAbstract: The statistical modelling of ranking data has a long history and encompasses various perspectives on how observed rankings arise. One of the most common models, the Plackett-Luce model, is frequently used to aggregate rankings from multiple rankers into a single ranking that corresponds to the underlying quality of the ranked objects. Given that rankers frequently exhibit heterogeneous preferences, mixture-type models have been developed to group rankers with more or less homogeneous preferences together to reduce bias. However, occasionally, these preference groups are known a-priori. Under these circumstances, current practice consists of fitting Plackett-Luce models separately for each group. Nevertheless, there might be some commonalities between different groups of rankers, such that separate estimation implies a loss of information. We propose an extension of the Plackett-Luce model, the Sparse Fused Plackett-Luce model, that allows for joint learning of such heterogeneous rank data, whereby information from different groups is utilised to achieve better model performance. The observed rankings can be considered a function of variables pertaining to the ranked objects. As such, we allow for these types of variables, where information on the coefficients is shared across groups. Moreover, as not all variables might be relevant for the ranking of an object, we impose sparsity on the coefficients to improve interpretability, estimation and prediction of the model. Simulations studies indicate superior performance of the proposed method compared to existing approaches. To illustrate the usage and interpretation of the method, an application on data consisting of consumer preferences regarding various sweet potato varieties is provided. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=SFPL."}, "https://arxiv.org/abs/2407.09542": {"title": "Multi-object Data Integration in the Study of Primary Progressive Aphasia", "link": "https://arxiv.org/abs/2407.09542", "description": "arXiv:2407.09542v1 Announce Type: cross \nAbstract: This article focuses on a multi-modal imaging data application where structural/anatomical information from gray matter (GM) and brain connectivity information in the form of a brain connectome network from functional magnetic resonance imaging (fMRI) are available for a number of subjects with different degrees of primary progressive aphasia (PPA), a neurodegenerative disorder (ND) measured through a speech rate measure on motor speech loss. The clinical/scientific goal in this study becomes the identification of brain regions of interest significantly related to the speech rate measure to gain insight into ND patterns. Viewing the brain connectome network and GM images as objects, we develop an integrated object response regression framework of network and GM images on the speech rate measure. A novel integrated prior formulation is proposed on network and structural image coefficients in order to exploit network information of the brain connectome while leveraging the interconnections among the two objects. The principled Bayesian framework allows the characterization of uncertainty in ascertaining a region being actively related to the speech rate measure. Our framework yields new insights into the relationship of brain regions associated with PPA, offering a deeper understanding of neuro-degenerative patterns of PPA. The supplementary file adds details about posterior computation and additional empirical results."}, "https://arxiv.org/abs/2407.09632": {"title": "Granger Causality in Extremes", "link": "https://arxiv.org/abs/2407.09632", "description": "arXiv:2407.09632v1 Announce Type: cross \nAbstract: We introduce a rigorous mathematical framework for Granger causality in extremes, designed to identify causal links from extreme events in time series. Granger causality plays a pivotal role in uncovering directional relationships among time-varying variables. While this notion gains heightened importance during extreme and highly volatile periods, state-of-the-art methods primarily focus on causality within the body of the distribution, often overlooking causal mechanisms that manifest only during extreme events. Our framework is designed to infer causality mainly from extreme events by leveraging the causal tail coefficient. We establish equivalences between causality in extremes and other causal concepts, including (classical) Granger causality, Sims causality, and structural causality. We prove other key properties of Granger causality in extremes and show that the framework is especially helpful under the presence of hidden confounders. We also propose a novel inference method for detecting the presence of Granger causality in extremes from data. Our method is model-free, can handle non-linear and high-dimensional time series, outperforms current state-of-the-art methods in all considered setups, both in performance and speed, and was found to uncover coherent effects when applied to financial and extreme weather observations."}, "https://arxiv.org/abs/2407.09664": {"title": "An Introduction to Permutation Processes (version 0", "link": "https://arxiv.org/abs/2407.09664", "description": "arXiv:2407.09664v1 Announce Type: cross \nAbstract: These lecture notes were prepared for a special topics course in the Department of Statistics at the University of Washington, Seattle. They comprise the first eight chapters of a book currently in progress."}, "https://arxiv.org/abs/2407.09832": {"title": "Molecular clouds: do they deserve a non-Gaussian description?", "link": "https://arxiv.org/abs/2407.09832", "description": "arXiv:2407.09832v1 Announce Type: cross \nAbstract: Molecular clouds show complex structures reflecting their non-linear dynamics. Many studies, investigating the bridge between their morphology and physical properties, have shown the interest provided by non-Gaussian higher-order statistics to grasp physical information. Yet, as this bridge is usually characterized in the supervised world of simulations, transferring it onto observations can be hazardous, especially when the discrepancy between simulations and observations remains unknown. In this paper, we aim at identifying relevant summary statistics directly from the observation data. To do so, we develop a test that compares the informative power of two sets of summary statistics for a given dataset. Contrary to supervised approaches, this test does not require the knowledge of any data label or parameter, but focuses instead on comparing the degeneracy levels of these descriptors, relying on a notion of statistical compatibility. We apply this test to column density maps of 14 nearby molecular clouds observed by Herschel, and iteratively compare different sets of usual summary statistics. We show that a standard Gaussian description of these clouds is highly degenerate but can be substantially improved when being estimated on the logarithm of the maps. This illustrates that low-order statistics, properly used, remain a very powerful tool. We then further show that such descriptions still exhibit a small quantity of degeneracies, some of which are lifted by the higher order statistics provided by reduced wavelet scattering transforms. This property of observations quantitatively differs from state-of-the-art simulations of dense molecular cloud collapse and is not reproduced by logfBm models. Finally we show how the summary statistics identified can be cooperatively used to build a morphological distance, which is evaluated visually, and gives very satisfactory results."}, "https://arxiv.org/abs/2407.10132": {"title": "Optimal Kernel Choice for Score Function-based Causal Discovery", "link": "https://arxiv.org/abs/2407.10132", "description": "arXiv:2407.10132v1 Announce Type: cross \nAbstract: Score-based methods have demonstrated their effectiveness in discovering causal relationships by scoring different causal structures based on their goodness of fit to the data. Recently, Huang et al. proposed a generalized score function that can handle general data distributions and causal relationships by modeling the relations in reproducing kernel Hilbert space (RKHS). The selection of an appropriate kernel within this score function is crucial for accurately characterizing causal relationships and ensuring precise causal discovery. However, the current method involves manual heuristic selection of kernel parameters, making the process tedious and less likely to ensure optimality. In this paper, we propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data. Specifically, we model the generative process of the variables involved in each step of the causal graph search procedure as a mixture of independent noise variables. Based on this model, we derive an automatic kernel selection method by maximizing the marginal likelihood of the variables involved in each search step. We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms heuristic kernel selection methods."}, "https://arxiv.org/abs/2407.10175": {"title": "Low Volatility Stock Portfolio Through High Dimensional Bayesian Cointegration", "link": "https://arxiv.org/abs/2407.10175", "description": "arXiv:2407.10175v1 Announce Type: cross \nAbstract: We employ a Bayesian modelling technique for high dimensional cointegration estimation to construct low volatility portfolios from a large number of stocks. The proposed Bayesian framework effectively identifies sparse and important cointegration relationships amongst large baskets of stocks across various asset spaces, resulting in portfolios with reduced volatility. Such cointegration relationships persist well over the out-of-sample testing time, providing practical benefits in portfolio construction and optimization. Further studies on drawdown and volatility minimization also highlight the benefits of including cointegrated portfolios as risk management instruments."}, "https://arxiv.org/abs/2407.10659": {"title": "A nonparametric test for rough volatility", "link": "https://arxiv.org/abs/2407.10659", "description": "arXiv:2407.10659v1 Announce Type: cross \nAbstract: We develop a nonparametric test for deciding whether volatility of an asset follows a standard semimartingale process, with paths of finite quadratic variation, or a rough process with paths of infinite quadratic variation. The test utilizes the fact that volatility is rough if and only if volatility increments are negatively autocorrelated at high frequencies. It is based on the sample autocovariance of increments of spot volatility estimates computed from high-frequency asset return data. By showing a feasible CLT for this statistic under the null hypothesis of semimartingale volatility paths, we construct a test with fixed asymptotic size and an asymptotic power equal to one. The test is derived under very general conditions for the data-generating process. In particular, it is robust to jumps with arbitrary activity and to the presence of market microstructure noise. In an application of the test to SPY high-frequency data, we find evidence for rough volatility."}, "https://arxiv.org/abs/2407.10869": {"title": "Hidden Markov models with an unknown number of states and a repulsive prior on the state parameters", "link": "https://arxiv.org/abs/2407.10869", "description": "arXiv:2407.10869v1 Announce Type: cross \nAbstract: Hidden Markov models (HMMs) offer a robust and efficient framework for analyzing time series data, modelling both the underlying latent state progression over time and the observation process, conditional on the latent state. However, a critical challenge lies in determining the appropriate number of underlying states, often unknown in practice. In this paper, we employ a Bayesian framework, treating the number of states as a random variable and employing reversible jump Markov chain Monte Carlo to sample from the posterior distributions of all parameters, including the number of states. Additionally, we introduce repulsive priors for the state parameters in HMMs, and hence avoid overfitting issues and promote parsimonious models with dissimilar state components. We perform an extensive simulation study comparing performance of models with independent and repulsive prior distributions on the state parameters, and demonstrate our proposed framework on two ecological case studies: GPS tracking data on muskox in Antarctica and acoustic data on Cape gannets in South Africa. Our results highlight how our framework effectively explores the model space, defined by models with different latent state dimensions, while leading to latent states that are distinguished better and hence are more interpretable, enabling better understanding of complex dynamic systems."}, "https://arxiv.org/abs/2205.01061": {"title": "Robust inference for matching under rolling enrollment", "link": "https://arxiv.org/abs/2205.01061", "description": "arXiv:2205.01061v3 Announce Type: replace \nAbstract: Matching in observational studies faces complications when units enroll in treatment on a rolling basis. While each treated unit has a specific time of entry into the study, control units each have many possible comparison, or \"pseudo-treatment,\" times. The recent GroupMatch framework (Pimentel et al., 2020) solves this problem by searching over all possible pseudo-treatment times for each control and selecting those permitting the closest matches based on covariate histories. However, valid methods of inference have been described only for special cases of the general GroupMatch design, and these rely on strong assumptions. We provide three important innovations to address these problems. First, we introduce a new design, GroupMatch with instance replacement, that allows additional flexibility in control selection and proves more amenable to analysis. Second, we propose a block bootstrap approach for inference in GroupMatch with instance replacement and demonstrate that it accounts properly for complex correlations across matched sets. Third, we develop a permutation-based falsification test to detect possible violations of the important timepoint agnosticism assumption underpinning GroupMatch, which requires homogeneity of potential outcome means across time. Via simulation and a case study of the impact of short-term injuries on batting performance in major league baseball, we demonstrate the effectiveness of our methods for data analysis in practice."}, "https://arxiv.org/abs/2210.00091": {"title": "Factorized Fusion Shrinkage for Dynamic Relational Data", "link": "https://arxiv.org/abs/2210.00091", "description": "arXiv:2210.00091v3 Announce Type: replace \nAbstract: Modern data science applications often involve complex relational data with dynamic structures. An abrupt change in such dynamic relational data is typically observed in systems that undergo regime changes due to interventions. In such a case, we consider a factorized fusion shrinkage model in which all decomposed factors are dynamically shrunk towards group-wise fusion structures, where the shrinkage is obtained by applying global-local shrinkage priors to the successive differences of the row vectors of the factorized matrices. The proposed priors enjoy many favorable properties in comparison and clustering of the estimated dynamic latent factors. Comparing estimated latent factors involves both adjacent and long-term comparisons, with the time range of comparison considered as a variable. Under certain conditions, we demonstrate that the posterior distribution attains the minimax optimal rate up to logarithmic factors. In terms of computation, we present a structured mean-field variational inference framework that balances optimal posterior inference with computational scalability, exploiting both the dependence among components and across time. The framework can accommodate a wide variety of models, including dynamic matrix factorization, latent space models for networks and low-rank tensors. The effectiveness of our methodology is demonstrated through extensive simulations and real-world data analysis."}, "https://arxiv.org/abs/2210.07491": {"title": "Latent process models for functional network data", "link": "https://arxiv.org/abs/2210.07491", "description": "arXiv:2210.07491v3 Announce Type: replace \nAbstract: Network data are often sampled with auxiliary information or collected through the observation of a complex system over time, leading to multiple network snapshots indexed by a continuous variable. Many methods in statistical network analysis are traditionally designed for a single network, and can be applied to an aggregated network in this setting, but that approach can miss important functional structure. Here we develop an approach to estimating the expected network explicitly as a function of a continuous index, be it time or another indexing variable. We parameterize the network expectation through low dimensional latent processes, whose components we represent with a fixed, finite-dimensional functional basis. We derive a gradient descent estimation algorithm, establish theoretical guarantees for recovery of the low dimensional structure, compare our method to competitors, and apply it to a data set of international political interactions over time, showing our proposed method to adapt well to data, outperform competitors, and provide interpretable and meaningful results."}, "https://arxiv.org/abs/2306.16549": {"title": "UTOPIA: Universally Trainable Optimal Prediction Intervals Aggregation", "link": "https://arxiv.org/abs/2306.16549", "description": "arXiv:2306.16549v2 Announce Type: replace \nAbstract: Uncertainty quantification in prediction presents a compelling challenge with vast applications across various domains, including biomedical science, economics, and weather forecasting. There exists a wide array of methods for constructing prediction intervals, such as quantile regression and conformal prediction. However, practitioners often face the challenge of selecting the most suitable method for a specific real-world data problem. In response to this dilemma, we introduce a novel and universally applicable strategy called Universally Trainable Optimal Predictive Intervals Aggregation (UTOPIA). This technique excels in efficiently aggregating multiple prediction intervals while maintaining a small average width of the prediction band and ensuring coverage. UTOPIA is grounded in linear or convex programming, making it straightforward to train and implement. In the specific case where the prediction methods are elementary basis functions, as in kernel and spline bases, our method becomes the construction of a prediction band. Our proposed methodologies are supported by theoretical guarantees on the coverage probability and the average width of the aggregated prediction interval, which are detailed in this paper. The practicality and effectiveness of UTOPIA are further validated through its application to synthetic data and two real-world datasets in finance and macroeconomics."}, "https://arxiv.org/abs/2309.08783": {"title": "Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression", "link": "https://arxiv.org/abs/2309.08783", "description": "arXiv:2309.08783v4 Announce Type: replace \nAbstract: Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ. However, sparse regression models show marked evidence of heteroscedastic error even after transformations are applied. This violation of the homoscedasticity assumption can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. Bayesian heteroscedastic linear regression models relax the homoscedastic error assumption but can enforce restrictive prior assumptions on parameters, and many are computationally infeasible in the high-dimensional setting. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach that requires minimal prior assumptions and can incorporate covariates hypothesized to impact heterogeneity. We apply this method by using high-dimensional neuroimages to predict and provide PIs for AQ that accurately quantify predictive uncertainty. Our analysis demonstrates that H-PROBE can provide narrower PI widths than standard methods without sacrificing coverage. Narrower PIs are clinically important for determining the risk of moderate to severe aphasia. Additionally, through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference compared to alternative methods."}, "https://arxiv.org/abs/2311.13327": {"title": "Regressions under Adverse Conditions", "link": "https://arxiv.org/abs/2311.13327", "description": "arXiv:2311.13327v2 Announce Type: replace \nAbstract: We introduce a new regression method that relates the mean of an outcome variable to covariates, given the \"adverse condition\" that a distress variable falls in its tail. This allows to tailor classical mean regressions to adverse economic scenarios, which receive increasing interest in managing macroeconomic and financial risks, among many others. In the terminology of the systemic risk literature, our method can be interpreted as a regression for the Marginal Expected Shortfall. We propose a two-step procedure to estimate the new models, show consistency and asymptotic normality of the estimator, and propose feasible inference under weak conditions allowing for cross-sectional and time series applications. The accuracy of the asymptotic approximations of the two-step estimator is verified in simulations. Two empirical applications show that our regressions under adverse conditions are valuable in such diverse fields as the study of the relation between systemic risk and asset price bubbles, and dissecting macroeconomic growth vulnerabilities into individual components."}, "https://arxiv.org/abs/2311.13556": {"title": "Universally Optimal Multivariate Crossover Designs", "link": "https://arxiv.org/abs/2311.13556", "description": "arXiv:2311.13556v2 Announce Type: replace \nAbstract: In this article, universally optimal multivariate crossover designs are studied. The multiple response crossover design is motivated by a $3 \\times 3$ crossover setup, where the effect of $3$ doses of an oral drug are studied on gene expressions related to mucosal inflammation. Subjects are assigned to three treatment sequences and response measurements on $5$ different gene expressions are taken from each subject in each of the $3$ time periods. To model multiple or $g$ responses, where $g>1$, in a crossover setup, a multivariate fixed effect model with both direct and carryover treatment effects is considered. It is assumed that there are non zero within response correlations, while between response correlations are taken to be zero. The information matrix corresponding to the direct effects is obtained and some results are studied. The information matrix in the multivariate case is shown to differ from the univariate case, particularly in the completely symmetric property. For the $g>1$ case, with $t$ treatments and $p$ periods, for $p=t \\geq 3$, the design represented by a Type $\\rm{I}$ orthogonal array of strength $2$ is proved to be universally optimal over the class of binary designs, for the direct treatment effects."}, "https://arxiv.org/abs/2309.07692": {"title": "A minimum Wasserstein distance approach to Fisher's combination of independent discrete p-values", "link": "https://arxiv.org/abs/2309.07692", "description": "arXiv:2309.07692v2 Announce Type: replace-cross \nAbstract: This paper introduces a comprehensive framework to adjust a discrete test statistic for improving its hypothesis testing procedure. The adjustment minimizes the Wasserstein distance to a null-approximating continuous distribution, tackling some fundamental challenges inherent in combining statistical significances derived from discrete distributions. The related theory justifies Lancaster's mid-p and mean-value chi-squared statistics for Fisher's combination as special cases. However, in order to counter the conservative nature of Lancaster's testing procedures, we propose an updated null-approximating distribution. It is achieved by further minimizing the Wasserstein distance to the adjusted statistics within a proper distribution family. Specifically, in the context of Fisher's combination, we propose an optimal gamma distribution as a substitute for the traditionally used chi-squared distribution. This new approach yields an asymptotically consistent test that significantly improves type I error control and enhances statistical power."}, "https://arxiv.org/abs/2407.11035": {"title": "Optimal estimators of cross-partial derivatives and surrogates of functions", "link": "https://arxiv.org/abs/2407.11035", "description": "arXiv:2407.11035v1 Announce Type: new \nAbstract: Computing cross-partial derivatives using fewer model runs is relevant in modeling, such as stochastic approximation, derivative-based ANOVA, exploring complex models, and active subspaces. This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at $N$ randomized points and using a set of $L$ constraints. Randomized points rely on independent, central, and symmetric variables. The associated estimators, based on $NL$ model runs, reach the optimal rates of convergence (i.e., $\\mathcal{O}(N^{-1})$), and the biases of our approximations do not suffer from the curse of dimensionality for a wide class of functions. Such results are used for i) computing the main and upper-bounds of sensitivity indices, and ii) deriving emulators of simulators or surrogates of functions thanks to the derivative-based ANOVA. Simulations are presented to show the accuracy of our emulators and estimators of sensitivity indices. The plug-in estimates of indices using the U-statistics of one sample are numerically much stable."}, "https://arxiv.org/abs/2407.11094": {"title": "Robust Score-Based Quickest Change Detection", "link": "https://arxiv.org/abs/2407.11094", "description": "arXiv:2407.11094v1 Announce Type: new \nAbstract: Methods in the field of quickest change detection rapidly detect in real-time a change in the data-generating distribution of an online data stream. Existing methods have been able to detect this change point when the densities of the pre- and post-change distributions are known. Recent work has extended these results to the case where the pre- and post-change distributions are known only by their score functions. This work considers the case where the pre- and post-change score functions are known only to correspond to distributions in two disjoint sets. This work employs a pair of \"least-favorable\" distributions to robustify the existing score-based quickest change detection algorithm, the properties of which are studied. This paper calculates the least-favorable distributions for specific model classes and provides methods of estimating the least-favorable distributions for common constructions. Simulation results are provided demonstrating the performance of our robust change detection algorithm."}, "https://arxiv.org/abs/2407.11173": {"title": "Approximate Bayesian inference for high-resolution spatial disaggregation using alternative data sources", "link": "https://arxiv.org/abs/2407.11173", "description": "arXiv:2407.11173v1 Announce Type: new \nAbstract: This paper addresses the challenge of obtaining precise demographic information at a fine-grained spatial level, a necessity for planning localized public services such as water distribution networks, or understanding local human impacts on the ecosystem. While population sizes are commonly available for large administrative areas, such as wards in India, practical applications often demand knowledge of population density at smaller spatial scales. We explore the integration of alternative data sources, specifically satellite-derived products, including land cover, land use, street density, building heights, vegetation coverage, and drainage density. Using a case study focused on Bangalore City, India, with a ward-level population dataset for 198 wards and satellite-derived sources covering 786,702 pixels at a resolution of 30mX30m, we propose a semiparametric Bayesian spatial regression model for obtaining pixel-level population estimates. Given the high dimensionality of the problem, exact Bayesian inference is deemed impractical; we discuss an approximate Bayesian inference scheme based on the recently proposed max-and-smooth approach, a combination of Laplace approximation and Markov chain Monte Carlo. A simulation study validates the reasonable performance of our inferential approach. Mapping pixel-level estimates to the ward level demonstrates the effectiveness of our method in capturing the spatial distribution of population sizes. While our case study focuses on a demographic application, the methodology developed here readily applies to count-type spatial datasets from various scientific disciplines, where high-resolution alternative data sources are available."}, "https://arxiv.org/abs/2407.11342": {"title": "GenTwoArmsTrialSize: An R Statistical Software Package to estimate Generalized Two Arms Randomized Clinical Trial Sample Size", "link": "https://arxiv.org/abs/2407.11342", "description": "arXiv:2407.11342v1 Announce Type: new \nAbstract: The precise calculation of sample sizes is a crucial aspect in the design of clinical trials particularly for pharmaceutical statisticians. While various R statistical software packages have been developed by researchers to estimate required sample sizes under different assumptions, there has been a notable absence of a standalone R statistical software package that allows researchers to comprehensively estimate sample sizes under generalized scenarios. This paper introduces the R statistical software package \"GenTwoArmsTrialSize\" available on the Comprehensive R Archive Network (CRAN), designed for estimating the required sample size in two-arm clinical trials. The package incorporates four endpoint types, two trial treatment designs, four types of hypothesis tests, as well as considerations for noncompliance and loss of follow-up, providing researchers with the capability to estimate sample sizes across 24 scenarios. To facilitate understanding of the estimation process and illuminate the impact of noncompliance and loss of follow-up on the size and variability of estimations, the paper includes four hypothetical examples and one applied example. The discussion encompasses the package's limitations and outlines directions for future extensions and improvements."}, "https://arxiv.org/abs/2407.11614": {"title": "Restricted mean survival times for comparing grouped survival data: a Bayesian nonparametric approach", "link": "https://arxiv.org/abs/2407.11614", "description": "arXiv:2407.11614v1 Announce Type: new \nAbstract: Comparing survival experiences of different groups of data is an important issue in several applied problems. A typical example is where one wishes to investigate treatment effects. Here we propose a new Bayesian approach based on restricted mean survival times (RMST). A nonparametric prior is specified for the underlying survival functions: this extends the standard univariate neutral to the right processes to a multivariate setting and induces a prior for the RMST's. We rely on a representation as exponential functionals of compound subordinators to determine closed form expressions of prior and posterior mixed moments of RMST's. These results are used to approximate functionals of the posterior distribution of RMST's and are essential for comparing time--to--event data arising from different samples."}, "https://arxiv.org/abs/2407.11634": {"title": "A goodness-of-fit test for testing exponentiality based on normalized dynamic survival extropy", "link": "https://arxiv.org/abs/2407.11634", "description": "arXiv:2407.11634v1 Announce Type: new \nAbstract: The cumulative residual extropy (CRJ) is a measure of uncertainty that serves as an alternative to extropy. It replaces the probability density function with the survival function in the expression of extropy. This work introduces a new concept called normalized dynamic survival extropy (NDSE), a dynamic variation of CRJ. We observe that NDSE is equivalent to CRJ of the random variable of interest $X_{[t]}$ in the age replacement model at a fixed time $t$. Additionally, we have demonstrated that NDSE remains constant exclusively for exponential distribution at any time. We categorize two classes, INDSE and DNDSE, based on their increasing and decreasing NDSE values. Next, we present a non-parametric test to assess whether a distribution follows an exponential pattern against INDSE. We derive the exact and asymptotic distribution for the test statistic $\\widehat{\\Delta}^*$. Additionally, a test for asymptotic behavior is presented in the paper for right censoring data. Finally, we determine the critical values and power of our exact test through simulation. The simulation demonstrates that the suggested test is easy to compute and has significant statistical power, even with small sample sizes. We also conduct a power comparison analysis among other tests, which shows better power for the proposed test against other alternatives mentioned in this paper. Some numerical real-life examples validating the test are also included."}, "https://arxiv.org/abs/2407.11646": {"title": "Discovery and inference of possibly bi-directional causal relationships with invalid instrumental variables", "link": "https://arxiv.org/abs/2407.11646", "description": "arXiv:2407.11646v1 Announce Type: new \nAbstract: Learning causal relationships between pairs of complex traits from observational studies is of great interest across various scientific domains. However, most existing methods assume the absence of unmeasured confounding and restrict causal relationships between two traits to be uni-directional, which may be violated in real-world systems. In this paper, we address the challenge of causal discovery and effect inference for two traits while accounting for unmeasured confounding and potential feedback loops. By leveraging possibly invalid instrumental variables, we provide identification conditions for causal parameters in a model that allows for bi-directional relationships, and we also establish identifiability of the causal direction under the introduced conditions. Then we propose a data-driven procedure to detect the causal direction and provide inference results about causal effects along the identified direction. We show that our method consistently recovers the true direction and produces valid confidence intervals for the causal effect. We conduct extensive simulation studies to show that our proposal outperforms existing methods. We finally apply our method to analyze real data sets from UK Biobank."}, "https://arxiv.org/abs/2407.11674": {"title": "Effect Heterogeneity with Earth Observation in Randomized Controlled Trials: Exploring the Role of Data, Model, and Evaluation Metric Choice", "link": "https://arxiv.org/abs/2407.11674", "description": "arXiv:2407.11674v1 Announce Type: new \nAbstract: Many social and environmental phenomena are associated with macroscopic changes in the built environment, captured by satellite imagery on a global scale and with daily temporal resolution. While widely used for prediction, these images and especially image sequences remain underutilized for causal inference, especially in the context of randomized controlled trials (RCTs), where causal identification is established by design. In this paper, we develop and compare a set of general tools for analyzing Conditional Average Treatment Effects (CATEs) from temporal satellite data that can be applied to any RCT where geographical identifiers are available. Through a simulation study, we analyze different modeling strategies for estimating CATE in sequences of satellite images. We find that image sequence representation models with more parameters generally yield a higher ability to detect heterogeneity. To explore the role of model and data choice in practice, we apply the approaches to two influential RCTs--Banerjee et al. (2015), a poverty study in Cusco, Peru, and Bolsen et al. (2014), a water conservation experiment in the USA. We benchmark our image sequence models against image-only, tabular-only, and combined image-tabular data sources. We detect a stronger heterogeneity signal in the Peru experiment and for image sequence over image-only data. Land cover classifications over satellite images facilitate interpretation of what image features drive heterogeneity. These satellite-based CATE models enable generalizing the RCT results to larger geographical areas outside the original experimental context. While promising, transportability estimates highlight the need for sensitivity analysis. Overall, this paper shows how satellite sequence data can be incorporated into the analysis of RCTs, and how choices regarding satellite image data and model can be improved using evaluation metrics."}, "https://arxiv.org/abs/2407.11729": {"title": "Using shrinkage methods to estimate treatment effects in overlapping subgroups in randomized clinical trials with a time-to-event endpoint", "link": "https://arxiv.org/abs/2407.11729", "description": "arXiv:2407.11729v1 Announce Type: new \nAbstract: In randomized controlled trials, forest plots are frequently used to investigate the homogeneity of treatment effect estimates in subgroups. However, the interpretation of subgroup-specific treatment effect estimates requires great care due to the smaller sample size of subgroups and the large number of investigated subgroups. Bayesian shrinkage methods have been proposed to address these issues, but they often focus on disjoint subgroups while subgroups displayed in forest plots are overlapping, i.e., each subject appears in multiple subgroups. In our approach, we first build a flexible Cox model based on all available observations, including categorical covariates that identify the subgroups of interest and their interactions with the treatment group variable. We explore both penalized partial likelihood estimation with a lasso or ridge penalty for treatment-by-covariate interaction terms, and Bayesian estimation with a regularized horseshoe prior. One advantage of the Bayesian approach is the ability to derive credible intervals for shrunken subgroup-specific estimates. In a second step, the Cox model is marginalized to obtain treatment effect estimates for all subgroups. We illustrate these methods using data from a randomized clinical trial in follicular lymphoma and evaluate their properties in a simulation study. In all simulation scenarios, the overall mean-squared error is substantially smaller for penalized and shrinkage estimators compared to the standard subgroup-specific treatment effect estimator but leads to some bias for heterogeneous subgroups. We recommend that subgroup-specific estimators, which are typically displayed in forest plots, are more routinely complemented by treatment effect estimators based on shrinkage methods. The proposed methods are implemented in the R package bonsaiforest."}, "https://arxiv.org/abs/2407.11765": {"title": "Nowcasting R&D Expenditures: A Machine Learning Approach", "link": "https://arxiv.org/abs/2407.11765", "description": "arXiv:2407.11765v1 Announce Type: new \nAbstract: Macroeconomic data are crucial for monitoring countries' performance and driving policy. However, traditional data acquisition processes are slow, subject to delays, and performed at a low frequency. We address this 'ragged-edge' problem with a two-step framework. The first step is a supervised learning model predicting observed low-frequency figures. We propose a neural-network-based nowcasting model that exploits mixed-frequency, high-dimensional data. The second step uses the elasticities derived from the previous step to interpolate unobserved high-frequency figures. We apply our method to nowcast countries' yearly research and development (R&D) expenditure series. These series are collected through infrequent surveys, making them ideal candidates for this task. We exploit a range of predictors, chiefly Internet search volume data, and document the relevance of these data in improving out-of-sample predictions. Furthermore, we leverage the high frequency of our data to derive monthly estimates of R&D expenditures, which are currently unobserved. We compare our results with those obtained from the classical regression-based and the sparse temporal disaggregation methods. Finally, we validate our results by reporting a strong correlation with monthly R&D employment data."}, "https://arxiv.org/abs/2407.11937": {"title": "Generalized Difference-in-Differences", "link": "https://arxiv.org/abs/2407.11937", "description": "arXiv:2407.11937v1 Announce Type: new \nAbstract: In many social science applications, researchers use the difference-in-differences (DID) estimator to establish causal relationships, exploiting cross-sectional variation in a baseline factor and temporal variation in exposure to an event that presumably may affect all units. This approach, often referred to as generalized DID (GDID), differs from canonical DID in that it lacks a \"clean control group\" unexposed to the event after the event occurs. In this paper, we clarify GDID as a research design in terms of its data structure, feasible estimands, and identifying assumptions that allow the DID estimator to recover these estimands. We frame GDID as a factorial design with two factors: the baseline factor, denoted by $G$, and the exposure level to the event, denoted by $Z$, and define effect modification and causal interaction as the associative and causal effects of $G$ on the effect of $Z$, respectively. We show that under the canonical no anticipation and parallel trends assumptions, the DID estimator identifies only the effect modification of $G$ in GDID, and propose an additional generalized parallel trends assumption to identify causal interaction. Moreover, we show that the canonical DID research design can be framed as a special case of the GDID research design with an additional exclusion restriction assumption, thereby reconciling the two approaches. We illustrate these findings with empirical examples from economics and political science, and provide recommendations for improving practice and interpretation under GDID."}, "https://arxiv.org/abs/2407.11032": {"title": "Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version)", "link": "https://arxiv.org/abs/2407.11032", "description": "arXiv:2407.11032v1 Announce Type: cross \nAbstract: Collaborative causal inference (CCI) is a federated learning method for pooling data from multiple, often self-interested, parties, to achieve a common learning goal over causal structures, e.g. estimation and optimization of treatment variables in a medical setting. Since obtaining data can be costly for the participants and sharing unique data poses the risk of losing competitive advantages, motivating the participation of all parties through equitable rewards and incentives is necessary. This paper devises an evaluation scheme to measure the value of each party's data contribution to the common learning task, tailored to causal inference's statistical demands, by comparing completed partially directed acyclic graphs (CPDAGs) inferred from observational data contributed by the participants. The Data Valuation Scheme thus obtained can then be used to introduce mechanisms that incentivize the agents to contribute data. It can be leveraged to reward agents fairly, according to the quality of their data, or to maximize all agents' data contributions."}, "https://arxiv.org/abs/2407.11056": {"title": "Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure: a Proof of Concept", "link": "https://arxiv.org/abs/2407.11056", "description": "arXiv:2407.11056v1 Announce Type: cross \nAbstract: This paper describes the development of a counterfactual Root Cause Analysis diagnosis approach for an industrial multivariate time series environment. It drives the attention toward the Point of Incipient Failure, which is the moment in time when the anomalous behavior is first observed, and where the root cause is assumed to be found before the issue propagates. The paper presents the elementary but essential concepts of the solution and illustrates them experimentally on a simulated setting. Finally, it discusses avenues of improvement for the maturity of the causal technology to meet the robustness challenges of increasingly complex environments in the industry."}, "https://arxiv.org/abs/2407.11426": {"title": "Generally-Occurring Model Change for Robust Counterfactual Explanations", "link": "https://arxiv.org/abs/2407.11426", "description": "arXiv:2407.11426v1 Announce Type: cross \nAbstract: With the increasing impact of algorithmic decision-making on human lives, the interpretability of models has become a critical issue in machine learning. Counterfactual explanation is an important method in the field of interpretable machine learning, which can not only help users understand why machine learning models make specific decisions, but also help users understand how to change these decisions. Naturally, it is an important task to study the robustness of counterfactual explanation generation algorithms to model changes. Previous literature has proposed the concept of Naturally-Occurring Model Change, which has given us a deeper understanding of robustness to model change. In this paper, we first further generalize the concept of Naturally-Occurring Model Change, proposing a more general concept of model parameter changes, Generally-Occurring Model Change, which has a wider range of applicability. We also prove the corresponding probabilistic guarantees. In addition, we consider a more specific problem, data set perturbation, and give relevant theoretical results by combining optimization theory."}, "https://arxiv.org/abs/2407.11465": {"title": "Testing by Betting while Borrowing and Bargaining", "link": "https://arxiv.org/abs/2407.11465", "description": "arXiv:2407.11465v1 Announce Type: cross \nAbstract: Testing by betting has been a cornerstone of the game-theoretic statistics literature. In this framework, a betting score (or more generally an e-process), as opposed to a traditional p-value, is used to quantify the evidence against a null hypothesis: the higher the betting score, the more money one has made betting against the null, and thus the larger the evidence that the null is false. A key ingredient assumed throughout past works is that one cannot bet more money than one currently has. In this paper, we ask what happens if the bettor is allowed to borrow money after going bankrupt, allowing further financial flexibility in this game of hypothesis testing. We propose various definitions of (adjusted) evidence relative to the wealth borrowed, indebted, and accumulated. We also ask what happens if the bettor can \"bargain\", in order to obtain odds bettor than specified by the null hypothesis. The adjustment of wealth in order to serve as evidence appeals to the characterization of arbitrage, interest rates, and num\\'eraire-adjusted pricing in this setting."}, "https://arxiv.org/abs/2407.11676": {"title": "SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation", "link": "https://arxiv.org/abs/2407.11676", "description": "arXiv:2407.11676v1 Announce Type: cross \nAbstract: Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-Bench, we propose a framework to evaluate DA methods and present a fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data with specific feature extraction. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-Bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-Bench is available on GitHub at https://github.com/scikit-adaptation/skada-bench."}, "https://arxiv.org/abs/2407.11887": {"title": "On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting", "link": "https://arxiv.org/abs/2407.11887", "description": "arXiv:2407.11887v1 Announce Type: cross \nAbstract: The prediction of extreme events in time series is a fundamental problem arising in many financial, scientific, engineering, and other applications. We begin by establishing a general Neyman-Pearson-type characterization of optimal extreme event predictors in terms of density ratios. This yields new insights and several closed-form optimal extreme event predictors for additive models. These results naturally extend to time series, where we study optimal extreme event prediction for heavy-tailed autoregressive and moving average models. Using a uniform law of large numbers for ergodic time series, we establish the asymptotic optimality of an empirical version of the optimal predictor for autoregressive models. Using multivariate regular variation, we also obtain expressions for the optimal extremal precision in heavy-tailed infinite moving averages, which provide theoretical bounds on the ability to predict extremes in this general class of models. The developed theory and methodology is applied to the important problem of solar flare prediction based on the state-of-the-art GOES satellite flux measurements of the Sun. Our results demonstrate the success and limitations of long-memory autoregressive as well as long-range dependent heavy-tailed FARIMA models for the prediction of extreme solar flares."}, "https://arxiv.org/abs/2407.11901": {"title": "Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows", "link": "https://arxiv.org/abs/2407.11901", "description": "arXiv:2407.11901v1 Announce Type: cross \nAbstract: We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures."}, "https://arxiv.org/abs/2209.11691": {"title": "Linear multidimensional regression with interactive fixed-effects", "link": "https://arxiv.org/abs/2209.11691", "description": "arXiv:2209.11691v3 Announce Type: replace \nAbstract: This paper studies a linear and additively separable model for multidimensional panel data of three or more dimensions with unobserved interactive fixed effects. Two approaches are considered to account for these unobserved interactive fixed-effects when estimating coefficients on the observed covariates. First, the model is embedded within the standard two dimensional panel framework and restrictions are formed under which the factor structure methods in Bai (2009) lead to consistent estimation of model parameters, but at slow rates of convergence. The second approach develops a kernel weighted fixed-effects method that is more robust to the multidimensional nature of the problem and can achieve the parametric rate of consistency under certain conditions. Theoretical results and simulations show some benefits to standard two-dimensional panel methods when the structure of the interactive fixed-effect term is known, but also highlight how the kernel weighted method performs well without knowledge of this structure. The methods are implemented to estimate the demand elasticity for beer."}, "https://arxiv.org/abs/2305.08488": {"title": "Hierarchical DCC-HEAVY Model for High-Dimensional Covariance Matrices", "link": "https://arxiv.org/abs/2305.08488", "description": "arXiv:2305.08488v2 Announce Type: replace \nAbstract: We introduce a HD DCC-HEAVY class of hierarchical-type factor models for high-dimensional covariance matrices, employing the realized measures built from higher-frequency data. The modelling approach features straightforward estimation and forecasting schemes, independent of the cross-sectional dimension of the assets under consideration, and accounts for sophisticated asymmetric dynamics in the covariances. Empirical analyses suggest that the HD DCC-HEAVY models have a better in-sample fit and deliver statistically and economically significant out-of-sample gains relative to the existing hierarchical factor model and standard benchmarks. The results are robust under different frequencies and market conditions."}, "https://arxiv.org/abs/2306.02813": {"title": "Variational inference based on a subclass of closed skew normals", "link": "https://arxiv.org/abs/2306.02813", "description": "arXiv:2306.02813v2 Announce Type: replace \nAbstract: Gaussian distributions are widely used in Bayesian variational inference to approximate intractable posterior densities, but the ability to accommodate skewness can improve approximation accuracy significantly, when data or prior information is scarce. We study the properties of a subclass of closed skew normals constructed using affine transformation of independent standardized univariate skew normals as the variational density, and illustrate how it provides increased flexibility and accuracy in approximating the joint posterior in various applications, by overcoming limitations in existing skew normal variational approximations. The evidence lower bound is optimized using stochastic gradient ascent, where analytic natural gradient updates are derived. We also demonstrate how problems in maximum likelihood estimation of skew normal parameters occur similarly in stochastic variational inference, and can be resolved using the centered parametrization. Supplemental materials are available online."}, "https://arxiv.org/abs/2312.15494": {"title": "Variable Selection in High Dimensional Linear Regressions with Parameter Instability", "link": "https://arxiv.org/abs/2312.15494", "description": "arXiv:2312.15494v2 Announce Type: replace \nAbstract: This paper considers the problem of variable selection allowing for parameter instability. It distinguishes between signal and pseudo-signal variables that are correlated with the target variable, and noise variables that are not, and investigate the asymptotic properties of the One Covariate at a Time Multiple Testing (OCMT) method proposed by Chudik et al. (2018) under parameter insatiability. It is established that OCMT continues to asymptotically select an approximating model that includes all the signals and none of the noise variables. Properties of post selection regressions are also investigated, and in-sample fit of the selected regression is shown to have the oracle property. The theoretical results support the use of unweighted observations at the selection stage of OCMT, whilst applying down-weighting of observations only at the forecasting stage. Monte Carlo and empirical applications show that OCMT without down-weighting at the selection stage yields smaller mean squared forecast errors compared to Lasso, Adaptive Lasso, and boosting."}, "https://arxiv.org/abs/2401.14558": {"title": "Simulation Model Calibration with Dynamic Stratification and Adaptive Sampling", "link": "https://arxiv.org/abs/2401.14558", "description": "arXiv:2401.14558v2 Announce Type: replace \nAbstract: Calibrating simulation models that take large quantities of multi-dimensional data as input is a hard simulation optimization problem. Existing adaptive sampling strategies offer a methodological solution. However, they may not sufficiently reduce the computational cost for estimation and solution algorithm's progress within a limited budget due to extreme noise levels and heteroskedasticity of system responses. We propose integrating stratification with adaptive sampling for the purpose of efficiency in optimization. Stratification can exploit local dependence in the simulation inputs and outputs. Yet, the state-of-the-art does not provide a full capability to adaptively stratify the data as different solution alternatives are evaluated. We devise two procedures for data-driven calibration problems that involve a large dataset with multiple covariates to calibrate models within a fixed overall simulation budget. The first approach dynamically stratifies the input data using binary trees, while the second approach uses closed-form solutions based on linearity assumptions between the objective function and concomitant variables. We find that dynamical adjustment of stratification structure accelerates optimization and reduces run-to-run variability in generated solutions. Our case study for calibrating a wind power simulation model, widely used in the wind industry, using the proposed stratified adaptive sampling, shows better-calibrated parameters under a limited budget."}, "https://arxiv.org/abs/2407.12100": {"title": "Agglomerative Clustering of Simulation Output Distributions Using Regularized Wasserstein Distance", "link": "https://arxiv.org/abs/2407.12100", "description": "arXiv:2407.12100v1 Announce Type: new \nAbstract: We investigate the use of clustering methods on data produced by a stochastic simulator, with applications in anomaly detection, pre-optimization, and online monitoring. We introduce an agglomerative clustering algorithm that clusters multivariate empirical distributions using the regularized Wasserstein distance and apply the proposed methodology on a call-center model."}, "https://arxiv.org/abs/2407.12114": {"title": "Bounds on causal effects in $2^{K}$ factorial experiments with non-compliance", "link": "https://arxiv.org/abs/2407.12114", "description": "arXiv:2407.12114v1 Announce Type: new \nAbstract: Factorial experiments are ubiquitous in the social and biomedical sciences, but when units fail to comply with each assigned factors, identification and estimation of the average treatment effects become impossible. Leveraging an instrumental variables approach, previous studies have shown how to identify and estimate the causal effect of treatment uptake among respondents who comply with treatment. A major caveat is that these identification results rely on strong assumptions on the effect of randomization on treatment uptake. This paper shows how to bound these complier average treatment effects under more mild assumptions on non-compliance."}, "https://arxiv.org/abs/2407.12175": {"title": "Temporal Configuration Model: Statistical Inference and Spreading Processes", "link": "https://arxiv.org/abs/2407.12175", "description": "arXiv:2407.12175v1 Announce Type: new \nAbstract: We introduce a family of parsimonious network models that are intended to generalize the configuration model to temporal settings. We present consistent estimators for the model parameters and perform numerical simulations to illustrate the properties of the estimators on finite samples. We also develop analytical solutions for basic and effective reproductive numbers for the early stage of discrete-time SIR spreading process. We apply three distinct temporal configuration models to empirical student proximity networks and compare their performance."}, "https://arxiv.org/abs/2407.12348": {"title": "MM Algorithms for Statistical Estimation in Quantile Regression", "link": "https://arxiv.org/abs/2407.12348", "description": "arXiv:2407.12348v1 Announce Type: new \nAbstract: Quantile regression is a robust and practically useful way to efficiently model quantile varying correlation and predict varied response quantiles of interest. This article constructs and tests MM algorithms, which are simple to code and have been suggested superior to some other prominent quantile regression methods in nonregularized problems, in an array of quantile regression settings including linear (modeling different quantile coefficients both separately and simultaneously), nonparametric, regularized, and monotone quantile regression. Applications to various real data sets and two simulation studies comparing MM to existing tested methods have corroborated our algorithms' effectiveness. We have made one key advance by generalizing our MM algorithm to efficiently fit easy-to-predict-and-interpret parametric quantile regression models for data sets exhibiting manifest complicated nonlinear correlation patterns, which has not yet been covered by current literature to the best of our knowledge."}, "https://arxiv.org/abs/2407.12422": {"title": "Conduct Parameter Estimation in Homogeneous Goods Markets with Equilibrium Existence and Uniqueness Conditions: The Case of Log-linear Specification", "link": "https://arxiv.org/abs/2407.12422", "description": "arXiv:2407.12422v1 Announce Type: new \nAbstract: We propose a constrained generalized method of moments estimator (GMM) incorporating theoretical conditions for the unique existence of equilibrium prices for estimating conduct parameters in a log-linear model with homogeneous goods markets. First, we derive such conditions. Second, Monte Carlo simulations confirm that in a log-linear model, incorporating the conditions resolves the problems of implausibly low or negative values of conduct parameters."}, "https://arxiv.org/abs/2407.12557": {"title": "Comparing Homogeneous And Inhomogeneous Time Markov Chains For Modelling Degradation In Sewer Pipe Networks", "link": "https://arxiv.org/abs/2407.12557", "description": "arXiv:2407.12557v1 Announce Type: new \nAbstract: Sewer pipe systems are essential for social and economic welfare. Managing these systems requires robust predictive models for degradation behaviour. This study focuses on probability-based approaches, particularly Markov chains, for their ability to associate random variables with degradation. Literature predominantly uses homogeneous and inhomogeneous Markov chains for this purpose. However, their effectiveness in sewer pipe degradation modelling is still debatable. Some studies support homogeneous Markov chains, while others challenge their utility. We examine this issue using a large-scale sewer network in the Netherlands, incorporating historical inspection data. We model degradation with homogeneous discrete and continuous time Markov chains, and inhomogeneous-time Markov chains using Gompertz, Weibull, Log-Logistic and Log-Normal density functions. Our analysis suggests that, despite their higher computational requirements, inhomogeneous-time Markov chains are more appropriate for modelling the nonlinear stochastic characteristics related to sewer pipe degradation, particularly the Gompertz distribution. However, they pose a risk of over-fitting, necessitating significant improvements in parameter inference processes to effectively address this issue."}, "https://arxiv.org/abs/2407.12633": {"title": "Bayesian spatial functional data clustering: applications in disease surveillance", "link": "https://arxiv.org/abs/2407.12633", "description": "arXiv:2407.12633v1 Announce Type: new \nAbstract: Our method extends the application of random spanning trees to cases where the response variable belongs to the exponential family, making it suitable for a wide range of real-world scenarios, including non-Gaussian likelihoods. The proposed model addresses the limitations of previous spatial clustering methods by allowing all within-cluster model parameters to be cluster-specific, thus offering greater flexibility. Additionally, we propose a Bayesian inference algorithm that overcomes the computational challenges associated with the reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm by employing composition sampling and the integrated nested Laplace approximation (INLA) to compute the marginal distribution necessary for the acceptance probability. This enhancement improves the mixing and feasibility of Bayesian inference for complex models. We demonstrate the effectiveness of our approach through simulation studies and apply it to real-world disease mapping applications: COVID-19 in the United States of America, and dengue fever in the states of Minas Gerais and S\\~ao Paulo, Brazil. Our results highlight the model's capability to uncover meaningful spatial patterns and temporal dynamics in disease outbreaks, providing valuable insights for public health decision-making and resource allocation."}, "https://arxiv.org/abs/2407.12700": {"title": "Bayesian Joint Modeling of Interrater and Intrarater Reliability with Multilevel Data", "link": "https://arxiv.org/abs/2407.12700", "description": "arXiv:2407.12700v1 Announce Type: new \nAbstract: We formulate three generalized Bayesian models for analyzing interrater and intrarater reliability in the presence of multilevel data. Stan implementations of these models provide new estimates of interrater and intrarater reliability. We also derive formulas for calculating marginal correlations under each of the three models. Comparisons of the kappa estimates and marginal correlations across the different models are presented from two real-world datasets. Simulations demonstrate properties of the different measures of agreement under different model assumptions."}, "https://arxiv.org/abs/2407.12132": {"title": "Maximum-likelihood regression with systematic errors for astronomy and the physical sciences: I", "link": "https://arxiv.org/abs/2407.12132", "description": "arXiv:2407.12132v1 Announce Type: cross \nAbstract: The paper presents a new statistical method that enables the use of systematic errors in the maximum-likelihood regression of integer-count Poisson data to a parametric model. The method is primarily aimed at the characterization of the goodness-of-fit statistic in the presence of the over-dispersion that is induced by sources of systematic error, and is based on a quasi-maximum-likelihood method that retains the Poisson distribution of the data. We show that the Poisson deviance, which is the usual goodness-of-fit statistic and that is commonly referred to in astronomy as the Cash statistics, can be easily generalized in the presence of systematic errors, under rather general conditions. The method and the associated statistics are first developed theoretically, and then they are tested with the aid of numerical simulations and further illustrated with real-life data from astronomical observations. The statistical methods presented in this paper are intended as a simple general-purpose framework to include additional sources of uncertainty for the analysis of integer-count data in a variety of practical data analysis situations."}, "https://arxiv.org/abs/2407.12254": {"title": "COKE: Causal Discovery with Chronological Order and Expert Knowledge in High Proportion of Missing Manufacturing Data", "link": "https://arxiv.org/abs/2407.12254", "description": "arXiv:2407.12254v1 Announce Type: cross \nAbstract: Understanding causal relationships between machines is crucial for fault diagnosis and optimization in manufacturing processes. Real-world datasets frequently exhibit up to 90% missing data and high dimensionality from hundreds of sensors. These datasets also include domain-specific expert knowledge and chronological order information, reflecting the recording order across different machines, which is pivotal for discerning causal relationships within the manufacturing data. However, previous methods for handling missing data in scenarios akin to real-world conditions have not been able to effectively utilize expert knowledge. Conversely, prior methods that can incorporate expert knowledge struggle with datasets that exhibit missing values. Therefore, we propose COKE to construct causal graphs in manufacturing datasets by leveraging expert knowledge and chronological order among sensors without imputing missing data. Utilizing the characteristics of the recipe, we maximize the use of samples with missing values, derive embeddings from intersections with an initial graph that incorporates expert knowledge and chronological order, and create a sensor ordering graph. The graph-generating process has been optimized by an actor-critic architecture to obtain a final graph that has a maximum reward. Experimental evaluations in diverse settings of sensor quantities and missing proportions demonstrate that our approach compared with the benchmark methods shows an average improvement of 39.9% in the F1-score. Moreover, the F1-score improvement can reach 62.6% when considering the configuration similar to real-world datasets, and 85.0% in real-world semiconductor datasets. The source code is available at https://github.com/OuTingYun/COKE."}, "https://arxiv.org/abs/2407.12708": {"title": "An Approximation for the 32-point Discrete Fourier Transform", "link": "https://arxiv.org/abs/2407.12708", "description": "arXiv:2407.12708v1 Announce Type: cross \nAbstract: This brief note aims at condensing some results on the 32-point approximate DFT and discussing its arithmetic complexity."}, "https://arxiv.org/abs/2407.12751": {"title": "Scalable Monte Carlo for Bayesian Learning", "link": "https://arxiv.org/abs/2407.12751", "description": "arXiv:2407.12751v1 Announce Type: cross \nAbstract: This book aims to provide a graduate-level introduction to advanced topics in Markov chain Monte Carlo (MCMC) algorithms, as applied broadly in the Bayesian computational context. Most, if not all of these topics (stochastic gradient MCMC, non-reversible MCMC, continuous time MCMC, and new techniques for convergence assessment) have emerged as recently as the last decade, and have driven substantial recent practical and theoretical advances in the field. A particular focus is on methods that are scalable with respect to either the amount of data, or the data dimension, motivated by the emerging high-priority application areas in machine learning and AI."}, "https://arxiv.org/abs/2205.05002": {"title": "Estimating Discrete Games of Complete Information: Bringing Logit Back in the Game", "link": "https://arxiv.org/abs/2205.05002", "description": "arXiv:2205.05002v3 Announce Type: replace \nAbstract: Estimating discrete games of complete information is often computationally difficult due to partial identification and the absence of closed-form moment characterizations. For both unordered and ordered-actions games, I propose computationally tractable approaches to estimation and inference that leverage convex programming theory via logit expressions. These methods are scalable to settings with many players and actions as they remove the computational burden associated with equilibria enumeration, numerical simulation, and grid search. I use simulation and empirical examples to show that my approaches can be several orders of magnitude faster than existing approaches."}, "https://arxiv.org/abs/2303.02498": {"title": "A stochastic network approach to clustering and visualising single-cell genomic count data", "link": "https://arxiv.org/abs/2303.02498", "description": "arXiv:2303.02498v3 Announce Type: replace \nAbstract: Important tasks in the study of genomic data include the identification of groups of similar cells (for example by clustering), and visualisation of data summaries (for example by dimensional reduction). In this paper, we propose a novel approach to studying single-cell genomic data, by modelling the observed genomic data count matrix $\\mathbf{X}\\in\\mathbb{Z}_{\\geq0}^{p\\times n}$ as a bipartite network with multi-edges. Utilising this first-principles network representation of the raw data, we propose clustering single cells in a suitably identified $d$-dimensional Laplacian Eigenspace (LE) via a Gaussian mixture model (GMM-LE), and employing UMAP to non-linearly project the LE to two dimensions for visualisation (UMAP-LE). This LE representation of the data estimates transformed latent positions (of genes and cells), under a latent position model of nodes in a bipartite stochastic network. We demonstrate how these estimated latent positions can enable fine-grained clustering and visualisation of single-cell genomic data, by application to data from three recent genomics studies in different biological contexts. In each data application, clusters of cells independently learned by our proposed methodology are found to correspond to cells expressing specific marker genes that were independently defined by domain experts. In this validation setting, our proposed clustering methodology outperforms the industry-standard for these data. Furthermore, we validate components of the LE decomposition of the data by contrasting healthy cells from normal and at-risk groups in a machine-learning model, thereby generating an LE cancer biomarker that significantly predicts long-term patient survival outcome in an independent validation dataset."}, "https://arxiv.org/abs/2304.02171": {"title": "Faster estimation of dynamic discrete choice models using index invertibility", "link": "https://arxiv.org/abs/2304.02171", "description": "arXiv:2304.02171v3 Announce Type: replace \nAbstract: Many estimators of dynamic discrete choice models with persistent unobserved heterogeneity have desirable statistical properties but are computationally intensive. In this paper we propose a method to quicken estimation for a broad class of dynamic discrete choice problems by exploiting semiparametric index restrictions. Specifically, we propose an estimator for models whose reduced form parameters are invertible functions of one or more linear indices (Ahn, Ichimura, Powell and Ruud 2018), a property we term index invertibility. We establish that index invertibility implies a set of equality constraints on the model parameters. Our proposed estimator uses the equality constraints to decrease the dimension of the optimization problem, thereby generating computational gains. Our main result shows that the proposed estimator is asymptotically equivalent to the unconstrained, computationally heavy estimator. In addition, we provide a series of results on the number of independent index restrictions on the model parameters, providing theoretical guidance on the extent of computational gains. Finally, we demonstrate the advantages of our approach via Monte Carlo simulations."}, "https://arxiv.org/abs/2308.04971": {"title": "Stein Variational Rare Event Simulation", "link": "https://arxiv.org/abs/2308.04971", "description": "arXiv:2308.04971v2 Announce Type: replace \nAbstract: Rare event simulation and rare event probability estimation are important tasks within the analysis of systems subject to uncertainty and randomness. Simultaneously, accurately estimating rare event probabilities is an inherently difficult task that calls for dedicated tools and methods. One way to improve estimation efficiency on difficult rare event estimation problems is to leverage gradients of the computational model representing the system in consideration, e.g., to explore the rare event faster and more reliably. We present a novel approach for estimating rare event probabilities using such model gradients by drawing on a technique to generate samples from non-normalized posterior distributions in Bayesian inference - the Stein variational gradient descent. We propagate samples generated from a tractable input distribution towards a near-optimal rare event importance sampling distribution by exploiting a similarity of the latter with Bayesian posterior distributions. Sample propagation takes the shape of passing samples through a sequence of invertible transforms such that their densities can be tracked and used to construct an unbiased importance sampling estimate of the rare event probability - the Stein variational rare event estimator. We discuss settings and parametric choices of the algorithm and suggest a method for balancing convergence speed with stability by choosing the step width or base learning rate adaptively. We analyze the method's performance on several analytical test functions and two engineering examples in low to high stochastic dimensions ($d = 2 - 869$) and find that it consistently outperforms other state-of-the-art gradient-based rare event simulation methods."}, "https://arxiv.org/abs/2312.01870": {"title": "Extreme-value modelling of migratory bird arrival dates: Insights from citizen science data", "link": "https://arxiv.org/abs/2312.01870", "description": "arXiv:2312.01870v3 Announce Type: replace \nAbstract: Citizen science mobilises many observers and gathers huge datasets but often without strict sampling protocols, which results in observation biases due to heterogeneity in sampling effort that can lead to biased statistical inferences. We develop a spatiotemporal Bayesian hierarchical model for bias-corrected estimation of arrival dates of the first migratory bird individuals at a breeding site. Higher sampling effort could be correlated with earlier observed dates. We implement data fusion of two citizen-science datasets with fundamentally different protocols (BBS, eBird) and map posterior distributions of the latent process, which contains four spatial components with Gaussian process priors: species niche; sampling effort; position and scale parameters of annual first date of arrival. The data layer includes four response variables: counts of observed eBird locations (Poisson); presence-absence at observed eBird locations (Binomial); BBS occurrence counts (Poisson); first arrival dates (Generalized Extreme-Value). We devise a Markov Chain Monte Carlo scheme and check by simulation that the latent process components are identifiable. We apply our model to several migratory bird species in the northeastern US for 2001--2021, and find that the sampling effort significantly modulates the observed first arrival date. We exploit this relationship to effectively bias-correct predictions of the true first arrival dates."}, "https://arxiv.org/abs/2101.09271": {"title": "Representation of Context-Specific Causal Models with Observational and Interventional Data", "link": "https://arxiv.org/abs/2101.09271", "description": "arXiv:2101.09271v4 Announce Type: replace-cross \nAbstract: We address the problem of representing context-specific causal models based on both observational and experimental data collected under general (e.g. hard or soft) interventions by introducing a new family of context-specific conditional independence models called CStrees. This family is defined via a novel factorization criterion that allows for a generalization of the factorization property defining general interventional DAG models. We derive a graphical characterization of model equivalence for observational CStrees that extends the Verma and Pearl criterion for DAGs. This characterization is then extended to CStree models under general, context-specific interventions. To obtain these results, we formalize a notion of context-specific intervention that can be incorporated into concise graphical representations of CStree models. We relate CStrees to other context-specific models, showing that the families of DAGs, CStrees, labeled DAGs and staged trees form a strict chain of inclusions. We end with an application of interventional CStree models to a real data set, revealing the context-specific nature of the data dependence structure and the soft, interventional perturbations."}, "https://arxiv.org/abs/2302.04412": {"title": "Spatiotemporal factor models for functional data with application to population map forecast", "link": "https://arxiv.org/abs/2302.04412", "description": "arXiv:2302.04412v3 Announce Type: replace-cross \nAbstract: The proliferation of mobile devices has led to the collection of large amounts of population data. This situation has prompted the need to utilize this rich, multidimensional data in practical applications. In response to this trend, we have integrated functional data analysis (FDA) and factor analysis to address the challenge of predicting hourly population changes across various districts in Tokyo. Specifically, by assuming a Gaussian process, we avoided the large covariance matrix parameters of the multivariate normal distribution. In addition, the data were both time and spatially dependent between districts. To capture these characteristics, a Bayesian factor model was introduced, which modeled the time series of a small number of common factors and expressed the spatial structure through factor loading matrices. Furthermore, the factor loading matrices were made identifiable and sparse to ensure the interpretability of the model. We also proposed a Bayesian shrinkage method as a systematic approach for factor selection. Through numerical experiments and data analysis, we investigated the predictive accuracy and interpretability of our proposed method. We concluded that the flexibility of the method allows for the incorporation of additional time series features, thereby improving its accuracy."}, "https://arxiv.org/abs/2407.13029": {"title": "Bayesian Inference and the Principle of Maximum Entropy", "link": "https://arxiv.org/abs/2407.13029", "description": "arXiv:2407.13029v1 Announce Type: new \nAbstract: Bayes' theorem incorporates distinct types of information through the likelihood and prior. Direct observations of state variables enter the likelihood and modify posterior probabilities through consistent updating. Information in terms of expected values of state variables modify posterior probabilities by constraining prior probabilities to be consistent with the information. Constraints on the prior can be exact, limiting hypothetical frequency distributions to only those that satisfy the constraints, or be approximate, allowing residual deviations from the exact constraint to some degree of tolerance. When the model parameters and constraint tolerances are known, posterior probability follows directly from Bayes' theorem. When parameters and tolerances are unknown a prior for them must be specified. When the system is close to statistical equilibrium the computation of posterior probabilities is simplified due to the concentration of the prior on the maximum entropy hypothesis. The relationship between maximum entropy reasoning and Bayes' theorem from this point of view is that maximum entropy reasoning is a special case of Bayesian inference with a constrained entropy-favoring prior."}, "https://arxiv.org/abs/2407.13169": {"title": "Combining Climate Models using Bayesian Regression Trees and Random Paths", "link": "https://arxiv.org/abs/2407.13169", "description": "arXiv:2407.13169v1 Announce Type: new \nAbstract: Climate models, also known as general circulation models (GCMs), are essential tools for climate studies. Each climate model may have varying accuracy across the input domain, but no single model is uniformly better than the others. One strategy to improving climate model prediction performance is to integrate multiple model outputs using input-dependent weights. Along with this concept, weight functions modeled using Bayesian Additive Regression Trees (BART) were recently shown to be useful for integrating multiple Effective Field Theories in nuclear physics applications. However, a restriction of this approach is that the weights could only be modeled as piecewise constant functions. To smoothly integrate multiple climate models, we propose a new tree-based model, Random Path BART (RPBART), that incorporates random path assignments into the BART model to produce smooth weight functions and smooth predictions of the physical system, all in a matrix-free formulation. The smoothness feature of RPBART requires a more complex prior specification, for which we introduce a semivariogram to guide its hyperparameter selection. This approach is easy to interpret, computationally cheap, and avoids an expensive cross-validation study. Finally, we propose a posterior projection technique to enable detailed analysis of the fitted posterior weight functions. This allows us to identify a sparse set of climate models that can largely recover the underlying system within a given spatial region as well as quantifying model discrepancy within the model set under consideration. Our method is demonstrated on an ensemble of 8 GCMs modeling the average monthly surface temperature."}, "https://arxiv.org/abs/2407.13261": {"title": "Enhanced inference for distributions and quantiles of individual treatment effects in various experiments", "link": "https://arxiv.org/abs/2407.13261", "description": "arXiv:2407.13261v1 Announce Type: new \nAbstract: Understanding treatment effect heterogeneity has become increasingly important in many fields. In this paper we study distributions and quantiles of individual treatment effects to provide a more comprehensive and robust understanding of treatment effects beyond usual averages, despite they are more challenging to infer due to nonidentifiability from observed data. Recent randomization-based approaches offer finite-sample valid inference for treatment effect distributions and quantiles in both completely randomized and stratified randomized experiments, but can be overly conservative by assuming the worst-case scenario where units with large effects are all assigned to the treated (or control) group. We introduce two improved methods to enhance the power of these existing approaches. The first method reinterprets existing approaches as inferring treatment effects among only treated or control units, and then combines the inference for treated and control units to infer treatment effects for all units. The second method explicitly controls for the actual number of treated units with large effects. Both simulation and applications demonstrate the substantial gain from the improved methods. These methods are further extended to sampling-based experiments as well as quasi-experiments from matching, in which the ideas for both improved methods play critical and complementary roles."}, "https://arxiv.org/abs/2407.13283": {"title": "Heterogeneous Clinical Trial Outcomes via Multi-Output Gaussian Processes", "link": "https://arxiv.org/abs/2407.13283", "description": "arXiv:2407.13283v1 Announce Type: new \nAbstract: We make use of Kronecker structure for scaling Gaussian Process models to large-scale, heterogeneous, clinical data sets. Repeated measures, commonly performed in clinical research, facilitate computational acceleration for nonlinear Bayesian nonparametric models and enable exact sampling for non-conjugate inference, when combinations of continuous and discrete endpoints are observed. Model inference is performed in Stan, and comparisons are made with brms on simulated data and two real clinical data sets, following a radiological image quality theme. Scalable Gaussian Process models compare favourably with parametric models on real data sets with 17,460 observations. Different GP model specifications are explored, with components analogous to random effects, and their theoretical properties are described."}, "https://arxiv.org/abs/2407.13302": {"title": "Non-zero block selector: A linear correlation coefficient measure for blocking-selection models", "link": "https://arxiv.org/abs/2407.13302", "description": "arXiv:2407.13302v1 Announce Type: new \nAbstract: Multiple-group data is widely used in genomic studies, finance, and social science. This study investigates a block structure that consists of covariate and response groups. It examines the block-selection problem of high-dimensional models with group structures for both responses and covariates, where both the number of blocks and the dimension within each block are allowed to grow larger than the sample size. We propose a novel strategy for detecting the block structure, which includes the block-selection model and a non-zero block selector (NBS). We establish the uniform consistency of the NBS and propose three estimators based on the NBS to enhance modeling efficiency. We prove that the estimators achieve the oracle solution and show that they are consistent, jointly asymptotically normal, and efficient in modeling extremely high-dimensional data. Simulations generate complex data settings and demonstrate the superiority of the proposed method. A gene-data analysis also demonstrates its effectiveness."}, "https://arxiv.org/abs/2407.13314": {"title": "NIRVAR: Network Informed Restricted Vector Autoregression", "link": "https://arxiv.org/abs/2407.13314", "description": "arXiv:2407.13314v1 Announce Type: new \nAbstract: High-dimensional panels of time series arise in many scientific disciplines such as neuroscience, finance, and macroeconomics. Often, co-movements within groups of the panel components occur. Extracting these groupings from the data provides a course-grained description of the complex system in question and can inform subsequent prediction tasks. We develop a novel methodology to model such a panel as a restricted vector autoregressive process, where the coefficient matrix is the weighted adjacency matrix of a stochastic block model. This network time series model, which we call the Network Informed Restricted Vector Autoregression (NIRVAR) model, yields a coefficient matrix that has a sparse block-diagonal structure. We propose an estimation procedure that embeds each panel component in a low-dimensional latent space and clusters the embedded points to recover the blocks of the coefficient matrix. Crucially, the method allows for network-based time series modelling when the underlying network is unobserved. We derive the bias, consistency and asymptotic normality of the NIRVAR estimator. Simulation studies suggest that the NIRVAR estimated embedded points are Gaussian distributed around the ground truth latent positions. On three applications to finance, macroeconomics, and transportation systems, NIRVAR outperforms competing factor and network time series models in terms of out-of-sample prediction."}, "https://arxiv.org/abs/2407.13374": {"title": "A unifying modelling approach for hierarchical distributed lag models", "link": "https://arxiv.org/abs/2407.13374", "description": "arXiv:2407.13374v1 Announce Type: new \nAbstract: We present a statistical modelling framework for implementing Distributed Lag Models (DLMs), encompassing several extensions of the approach to capture the temporally distributed effect from covariates via regression. We place DLMs in the context of penalised Generalized Additive Models (GAMs) and illustrate that implementation via the R package \\texttt{mgcv}, which allows for flexible and interpretable inference in addition to thorough model assessment. We show how the interpretation of penalised splines as random quantities enables approximate Bayesian inference and hierarchical structures in the same practical setting. We focus on epidemiological studies and demonstrate the approach with application to mortality data from Cyprus and Greece. For the Cyprus case study, we investigate for the first time, the joint lagged effects from both temperature and humidity on mortality risk with the unexpected result that humidity severely increases risk during cold rather than hot conditions. Another novel application is the use of the proposed framework for hierarchical pooling, to estimate district-specific covariate-lag risk on morality and the use of posterior simulation to compare risk across districts."}, "https://arxiv.org/abs/2407.13402": {"title": "Block-Additive Gaussian Processes under Monotonicity Constraints", "link": "https://arxiv.org/abs/2407.13402", "description": "arXiv:2407.13402v1 Announce Type: new \nAbstract: We generalize the additive constrained Gaussian process framework to handle interactions between input variables while enforcing monotonicity constraints everywhere on the input space. The block-additive structure of the model is particularly suitable in the presence of interactions, while maintaining tractable computations. In addition, we develop a sequential algorithm, MaxMod, for model selection (i.e., the choice of the active input variables and of the blocks). We speed up our implementations through efficient matrix computations and thanks to explicit expressions of criteria involved in MaxMod. The performance and scalability of our methodology are showcased with several numerical examples in dimensions up to 120, as well as in a 5D real-world coastal flooding application, where interpretability is enhanced by the selection of the blocks."}, "https://arxiv.org/abs/2407.13446": {"title": "Subsampled One-Step Estimation for Fast Statistical Inference", "link": "https://arxiv.org/abs/2407.13446", "description": "arXiv:2407.13446v1 Announce Type: new \nAbstract: Subsampling is an effective approach to alleviate the computational burden associated with large-scale datasets. Nevertheless, existing subsampling estimators incur a substantial loss in estimation efficiency compared to estimators based on the full dataset. Specifically, the convergence rate of existing subsampling estimators is typically $n^{-1/2}$ rather than $N^{-1/2}$, where $n$ and $N$ denote the subsample and full data sizes, respectively. This paper proposes a subsampled one-step (SOS) method to mitigate the estimation efficiency loss utilizing the asymptotic expansions of the subsampling and full-data estimators. The resulting SOS estimator is computationally efficient and achieves a fast convergence rate of $\\max\\{n^{-1}, N^{-1/2}\\}$ rather than $n^{-1/2}$. We establish the asymptotic distribution of the SOS estimator, which can be non-normal in general, and construct confidence intervals on top of the asymptotic distribution. Furthermore, we prove that the SOS estimator is asymptotically normal and equivalent to the full data-based estimator when $n / \\sqrt{N} \\to \\infty$.Simulation studies and real data analyses were conducted to demonstrate the finite sample performance of the SOS estimator. Numerical results suggest that the SOS estimator is almost as computationally efficient as the uniform subsampling estimator while achieving similar estimation efficiency to the full data-based estimator."}, "https://arxiv.org/abs/2407.13546": {"title": "Treatment-control comparisons in platform trials including non-concurrent controls", "link": "https://arxiv.org/abs/2407.13546", "description": "arXiv:2407.13546v1 Announce Type: new \nAbstract: Shared controls in platform trials comprise concurrent and non-concurrent controls. For a given experimental arm, non-concurrent controls refer to data from patients allocated to the control arm before the arm enters the trial. The use of non-concurrent controls in the analysis is attractive because it may increase the trial's power of testing treatment differences while decreasing the sample size. However, since arms are added sequentially in the trial, randomization occurs at different times, which can introduce bias in the estimates due to time trends. In this article, we present methods to incorporate non-concurrent control data in treatment-control comparisons, allowing for time trends. We focus mainly on frequentist approaches that model the time trend and Bayesian strategies that limit the borrowing level depending on the heterogeneity between concurrent and non-concurrent controls. We examine the impact of time trends, overlap between experimental treatment arms and entry times of arms in the trial on the operating characteristics of treatment effect estimators for each method under different patterns for the time trends. We argue under which conditions the methods lead to type 1 error control and discuss the gain in power compared to trials only using concurrent controls by means of a simulation study in which methods are compared."}, "https://arxiv.org/abs/2407.13613": {"title": "Revisiting Randomization with the Cube Method", "link": "https://arxiv.org/abs/2407.13613", "description": "arXiv:2407.13613v1 Announce Type: new \nAbstract: We propose a novel randomization approach for randomized controlled trials (RCTs), named the cube method. The cube method allows for the selection of balanced samples across various covariate types, ensuring consistent adherence to balance tests and, whence, substantial precision gains when estimating treatment effects. We establish several statistical properties for the population and sample average treatment effects (PATE and SATE, respectively) under randomization using the cube method. The relevance of the cube method is particularly striking when comparing the behavior of prevailing methods employed for treatment allocation when the number of covariates to balance is increasing. We formally derive and compare bounds of balancing adjustments depending on the number of units $n$ and the number of covariates $p$ and show that our randomization approach outperforms methods proposed in the literature when $p$ is large and $p/n$ tends to 0. We run simulation studies to illustrate the substantial gains from the cube method for a large set of covariates."}, "https://arxiv.org/abs/2407.13678": {"title": "Joint modelling of time-to-event and longitudinal response using robust skew normal-independent distributions", "link": "https://arxiv.org/abs/2407.13678", "description": "arXiv:2407.13678v1 Announce Type: new \nAbstract: Joint modelling of longitudinal observations and event times continues to remain a topic of considerable interest in biomedical research. For example, in HIV studies, the longitudinal bio-marker such as CD4 cell count in a patient's blood over follow up months is jointly modelled with the time to disease progression, death or dropout via a random intercept term mostly assumed to be Gaussian. However, longitudinal observations in these kinds of studies often exhibit non-Gaussian behavior (due to high degree of skewness), and parameter estimation is often compromised under violations of the Gaussian assumptions. In linear mixed-effects model assumptions, the distributional assumption for the subject-specific random-effects is taken as Gaussian which may not be true in many situations. Further, this assumption makes the model extremely sensitive to outlying observations. We address these issues in this work by devising a joint model which uses a robust distribution in a parametric setup along with a conditional distributional assumption that ensures dependency of two processes in case the subject-specific random effects is given."}, "https://arxiv.org/abs/2407.13495": {"title": "Identifying Research Hotspots and Future Development Trends in Current Psychology: A Bibliometric Analysis of the Past Decade's Publications", "link": "https://arxiv.org/abs/2407.13495", "description": "arXiv:2407.13495v1 Announce Type: cross \nAbstract: By conducting a bibliometric analysis on 4,869 publications in Current Psychology from 2013 to 2022, this paper examined the annual publications and annual citations, as well as the leading institutions, countries, and keywords. CiteSpace, VOSviewer and SCImago Graphica were utilized for visualization analysis. On one hand, this paper analyzed the academic influence of Current Psychology over the past decade. On the other hand, it explored the research hotspots and future development trends within the field of international psychology. The results revealed that the three main research areas covered in the publications of Current Psychology were: the psychological well-being of young people, the negative emotions of adults, and self-awareness and management. The latest research hotspots highlighted in the journal include negative emotions, personality, and mental health. The three main development trends of Current Psychology are: 1) exploring the personality psychology of both adolescents and adults, 2) promoting the interdisciplinary research to study social psychological issues through the use of diversified research methods, and 3) emphasizing the emotional psychology of individuals and their interaction with social reality, from a people-oriented perspective."}, "https://arxiv.org/abs/2407.13514": {"title": "Topological Analysis of Seizure-Induced Changes in Brain Hierarchy Through Effective Connectivity", "link": "https://arxiv.org/abs/2407.13514", "description": "arXiv:2407.13514v1 Announce Type: cross \nAbstract: Traditional Topological Data Analysis (TDA) methods, such as Persistent Homology (PH), rely on distance measures (e.g., cross-correlation, partial correlation, coherence, and partial coherence) that are symmetric by definition. While useful for studying topological patterns in functional brain connectivity, the main limitation of these methods is their inability to capture the directional dynamics - which is crucial for understanding effective brain connectivity. We propose the Causality-Based Topological Ranking (CBTR) method, which integrates Causal Inference (CI) to assess effective brain connectivity with Hodge Decomposition (HD) to rank brain regions based on their mutual influence. Our simulations confirm that the CBTR method accurately and consistently identifies hierarchical structures in multivariate time series data. Moreover, this method effectively identifies brain regions showing the most significant interaction changes with other regions during seizures using electroencephalogram (EEG) data. These results provide novel insights into the brain's hierarchical organization and illuminate the impact of seizures on its dynamics."}, "https://arxiv.org/abs/2407.13641": {"title": "Optimal rates for estimating the covariance kernel from synchronously sampled functional data", "link": "https://arxiv.org/abs/2407.13641", "description": "arXiv:2407.13641v1 Announce Type: cross \nAbstract: We obtain minimax-optimal convergence rates in the supremum norm, including infor-mation-theoretic lower bounds, for estimating the covariance kernel of a stochastic process which is repeatedly observed at discrete, synchronous design points. In particular, for dense design we obtain the $\\sqrt n$-rate of convergence in the supremum norm without additional logarithmic factors which typically occur in the results in the literature. Surprisingly, in the transition from dense to sparse design the rates do not reflect the two-dimensional nature of the covariance kernel but correspond to those for univariate mean function estimation. Our estimation method can make use of higher-order smoothness of the covariance kernel away from the diagonal, and does not require the same smoothness on the diagonal itself. Hence, as in Mohammadi and Panaretos (2024) we can cover covariance kernels of processes with rough sample paths. Moreover, the estimator does not use mean function estimation to form residuals, and no smoothness assumptions on the mean have to be imposed. In the dense case we also obtain a central limit theorem in the supremum norm, which can be used as the basis for the construction of uniform confidence sets. Simulations and real-data applications illustrate the practical usefulness of the methods."}, "https://arxiv.org/abs/2006.02611": {"title": "Tensor Factor Model Estimation by Iterative Projection", "link": "https://arxiv.org/abs/2006.02611", "description": "arXiv:2006.02611v3 Announce Type: replace \nAbstract: Tensor time series, which is a time series consisting of tensorial observations, has become ubiquitous. It typically exhibits high dimensionality. One approach for dimension reduction is to use a factor model structure, in a form similar to Tucker tensor decomposition, except that the time dimension is treated as a dynamic process with a time dependent structure. In this paper we introduce two approaches to estimate such a tensor factor model by using iterative orthogonal projections of the original tensor time series. These approaches extend the existing estimation procedures and improve the estimation accuracy and convergence rate significantly as proven in our theoretical investigation. Our algorithms are similar to the higher order orthogonal projection method for tensor decomposition, but with significant differences due to the need to unfold tensors in the iterations and the use of autocorrelation. Consequently, our analysis is significantly different from the existing ones. Computational and statistical lower bounds are derived to prove the optimality of the sample size requirement and convergence rate for the proposed methods. Simulation study is conducted to further illustrate the statistical properties of these estimators."}, "https://arxiv.org/abs/2101.05774": {"title": "Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables", "link": "https://arxiv.org/abs/2101.05774", "description": "arXiv:2101.05774v4 Announce Type: replace \nAbstract: We propose a procedure which combines hierarchical clustering with a test of overidentifying restrictions for selecting valid instrumental variables (IV) from a large set of IVs. Some of these IVs may be invalid in that they fail the exclusion restriction. We show that if the largest group of IVs is valid, our method achieves oracle properties. Unlike existing techniques, our work deals with multiple endogenous regressors. Simulation results suggest an advantageous performance of the method in various settings. The method is applied to estimating the effect of immigration on wages."}, "https://arxiv.org/abs/2212.14075": {"title": "Forward Orthogonal Deviations GMM and the Absence of Large Sample Bias", "link": "https://arxiv.org/abs/2212.14075", "description": "arXiv:2212.14075v2 Announce Type: replace \nAbstract: It is well known that generalized method of moments (GMM) estimators of dynamic panel data regressions can have significant bias when the number of time periods ($T$) is not small compared to the number of cross-sectional units ($n$). The bias is attributed to the use of many instrumental variables. This paper shows that if the maximum number of instrumental variables used in a period increases with $T$ at a rate slower than $T^{1/2}$, then GMM estimators that exploit the forward orthogonal deviations (FOD) transformation do not have asymptotic bias, regardless of how fast $T$ increases relative to $n$. This conclusion is specific to using the FOD transformation. A similar conclusion does not necessarily apply when other transformations are used to remove fixed effects. Monte Carlo evidence illustrating the analytical results is provided."}, "https://arxiv.org/abs/2307.00093": {"title": "Design Sensitivity and Its Implications for Weighted Observational Studies", "link": "https://arxiv.org/abs/2307.00093", "description": "arXiv:2307.00093v2 Announce Type: replace \nAbstract: Sensitivity to unmeasured confounding is not typically a primary consideration in designing treated-control comparisons in observational studies. We introduce a framework allowing researchers to optimize robustness to omitted variable bias at the design stage using a measure called design sensitivity. Design sensitivity, which describes the asymptotic power of a sensitivity analysis, allows transparent assessment of the impact of different estimation strategies on sensitivity. We apply this general framework to two commonly-used sensitivity models, the marginal sensitivity model and the variance-based sensitivity model. By comparing design sensitivities, we interrogate how key features of weighted designs, including choices about trimming of weights and model augmentation, impact robustness to unmeasured confounding, and how these impacts may differ for the two different sensitivity models. We illustrate the proposed framework on a study examining drivers of support for the 2016 Colombian peace agreement."}, "https://arxiv.org/abs/2307.10454": {"title": "Latent Gaussian dynamic factor modeling and forecasting for multivariate count time series", "link": "https://arxiv.org/abs/2307.10454", "description": "arXiv:2307.10454v2 Announce Type: replace \nAbstract: This work considers estimation and forecasting in a multivariate, possibly high-dimensional count time series model constructed from a transformation of a latent Gaussian dynamic factor series. The estimation of the latent model parameters is based on second-order properties of the count and underlying Gaussian time series, yielding estimators of the underlying covariance matrices for which standard principal component analysis applies. Theoretical consistency results are established for the proposed estimation, building on certain concentration results for the models of the type considered. They also involve the memory of the latent Gaussian process, quantified through a spectral gap, shown to be suitably bounded as the model dimension increases, which is of independent interest. In addition, novel cross-validation schemes are suggested for model selection. The forecasting is carried out through a particle-based sequential Monte Carlo, leveraging Kalman filtering techniques. A simulation study and an application are also considered."}, "https://arxiv.org/abs/2212.01621": {"title": "A simple extension of Azadkia & Chatterjee's rank correlation to multi-response vectors", "link": "https://arxiv.org/abs/2212.01621", "description": "arXiv:2212.01621v3 Announce Type: replace-cross \nAbstract: Recently, Chatterjee (2023) recognized the lack of a direct generalization of his rank correlation $\\xi$ in Azadkia and Chatterjee (2021) to a multi-dimensional response vector. As a natural solution to this problem, we here propose an extension of $\\xi$ that is applicable to a set of $q \\geq 1$ response variables, where our approach builds upon converting the original vector-valued problem into a univariate problem and then applying the rank correlation $\\xi$ to it. Our novel measure $T$ quantifies the scale-invariant extent of functional dependence of a response vector $\\mathbf{Y} = (Y_1,\\dots,Y_q)$ on predictor variables $\\mathbf{X} = (X_1, \\dots,X_p)$, characterizes independence of $\\mathbf{X}$ and $\\mathbf{Y}$ as well as perfect dependence of $\\mathbf{Y}$ on $\\mathbf{X}$ and hence fulfills all the characteristics of a measure of predictability. Aiming at maximum interpretability, we provide various invariance results for $T$ as well as a closed-form expression in multivariate normal models. Building upon the graph-based estimator for $\\xi$ in Azadkia and Chatterjee (2021), we obtain a non-parametric, strongly consistent estimator for $T$ and show its asymptotic normality. Based on this estimator, we develop a model-free and dependence-based feature ranking and forward feature selection for multiple-outcome data. Simulation results and real case studies illustrate $T$'s broad applicability."}, "https://arxiv.org/abs/2302.03391": {"title": "Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection", "link": "https://arxiv.org/abs/2302.03391", "description": "arXiv:2302.03391v2 Announce Type: replace-cross \nAbstract: Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple l1 penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses."}, "https://arxiv.org/abs/2311.12978": {"title": "Physics-Informed Priors with Application to Boundary Layer Velocity", "link": "https://arxiv.org/abs/2311.12978", "description": "arXiv:2311.12978v2 Announce Type: replace-cross \nAbstract: One of the most popular recent areas of machine learning predicates the use of neural networks augmented by information about the underlying process in the form of Partial Differential Equations (PDEs). These physics-informed neural networks are obtained by penalizing the inference with a PDE, and have been cast as a minimization problem currently lacking a formal approach to quantify the uncertainty. In this work, we propose a novel model-based framework which regards the PDE as a prior information of a deep Bayesian neural network. The prior is calibrated without data to resemble the PDE solution in the prior mean, while our degree in confidence on the PDE with respect to the data is expressed in terms of the prior variance. The information embedded in the PDE is then propagated to the posterior yielding physics-informed forecasts with uncertainty quantification. We apply our approach to a simulated viscous fluid and to experimentally-obtained turbulent boundary layer velocity in a water tunnel using an appropriately simplified Navier-Stokes equation. Our approach requires very few observations to produce physically-consistent forecasts as opposed to non-physical forecasts stemming from non-informed priors, thereby allowing forecasting complex systems where some amount of data as well as some contextual knowledge is available."}, "https://arxiv.org/abs/2407.13814": {"title": "Building Population-Informed Priors for Bayesian Inference Using Data-Consistent Stochastic Inversion", "link": "https://arxiv.org/abs/2407.13814", "description": "arXiv:2407.13814v1 Announce Type: new \nAbstract: Bayesian inference provides a powerful tool for leveraging observational data to inform model predictions and uncertainties. However, when such data is limited, Bayesian inference may not adequately constrain uncertainty without the use of highly informative priors. Common approaches for constructing informative priors typically rely on either assumptions or knowledge of the underlying physics, which may not be available in all scenarios. In this work, we consider the scenario where data are available on a population of assets/individuals, which occurs in many problem domains such as biomedical or digital twin applications, and leverage this population-level data to systematically constrain the Bayesian prior and subsequently improve individualized inferences. The approach proposed in this paper is based upon a recently developed technique known as data-consistent inversion (DCI) for constructing a pullback probability measure. Succinctly, we utilize DCI to build population-informed priors for subsequent Bayesian inference on individuals. While the approach is general and applies to nonlinear maps and arbitrary priors, we prove that for linear inverse problems with Gaussian priors, the population-informed prior produces an increase in the information gain as measured by the determinant and trace of the inverse posterior covariance. We also demonstrate that the Kullback-Leibler divergence often improves with high probability. Numerical results, including linear-Gaussian examples and one inspired by digital twins for additively manufactured assets, indicate that there is significant value in using these population-informed priors."}, "https://arxiv.org/abs/2407.13865": {"title": "Projection-pursuit Bayesian regression for symmetric matrix predictors", "link": "https://arxiv.org/abs/2407.13865", "description": "arXiv:2407.13865v1 Announce Type: new \nAbstract: This paper develops a novel Bayesian approach for nonlinear regression with symmetric matrix predictors, often used to encode connectivity of different nodes. Unlike methods that vectorize matrices as predictors that result in a large number of model parameters and unstable estimation, we propose a Bayesian multi-index regression method, resulting in a projection-pursuit-type estimator that leverages the structure of matrix-valued predictors. We establish the model identifiability conditions and impose a sparsity-inducing prior on the projection directions for sparse sampling to prevent overfitting and enhance interpretability of the parameter estimates. Posterior inference is conducted through Bayesian backfitting. The performance of the proposed method is evaluated through simulation studies and a case study investigating the relationship between brain connectivity features and cognitive scores."}, "https://arxiv.org/abs/2407.13904": {"title": "In defense of MAR over latent ignorability (or latent MAR) for outcome missingness in studying principal causal effects: a causal graph view", "link": "https://arxiv.org/abs/2407.13904", "description": "arXiv:2407.13904v1 Announce Type: new \nAbstract: This paper concerns outcome missingness in principal stratification analysis. We revisit a common assumption known as latent ignorability or latent missing-at-random (LMAR), often considered a relaxation of missing-at-random (MAR). LMAR posits that the outcome is independent of its missingness if one conditions on principal stratum (which is partially unobservable) in addition to observed variables. The literature has focused on methods assuming LMAR (usually supplemented with a more specific assumption about the missingness), without considering the theoretical plausibility and necessity of LMAR. In this paper, we devise a way to represent principal stratum in causal graphs, and use causal graphs to examine this assumption. We find that LMAR is harder to satisfy than MAR, and for the purpose of breaking the dependence between the outcome and its missingness, no benefit is gained from conditioning on principal stratum on top of conditioning on observed variables. This finding has an important implication: MAR should be preferred over LMAR. This is convenient because MAR is easier to handle and (unlike LMAR) if MAR is assumed no additional assumption is needed. We thus turn to focus on the plausibility of MAR and its implications, with a view to facilitate appropriate use of this assumption. We clarify conditions on the causal structure and on auxiliary variables (if available) that need to hold for MAR to hold, and we use MAR to recover effect identification under two dominant identification assumptions (exclusion restriction and principal ignorability). We briefly comment on cases where MAR does not hold. In terms of broader connections, most of the MAR findings are also relevant to classic instrumental variable analysis that targets the local average treatment effect; and the LMAR finding suggests general caution with assumptions that condition on principal stratum."}, "https://arxiv.org/abs/2407.13958": {"title": "Flexible max-stable processes for fast and efficient inference", "link": "https://arxiv.org/abs/2407.13958", "description": "arXiv:2407.13958v1 Announce Type: new \nAbstract: Max-stable processes serve as the fundamental distributional family in extreme value theory. However, likelihood-based inference methods for max-stable processes still heavily rely on composite likelihoods, rendering them intractable in high dimensions due to their intractable densities. In this paper, we introduce a fast and efficient inference method for max-stable processes based on their angular densities for a class of max-stable processes whose angular densities do not put mass on the boundary space of the simplex, which can be used to construct r-Pareto processes. We demonstrate the efficiency of the proposed method through two new max-stable processes, the truncated extremal-t process and the skewed Brown-Resnick process. The proposed method is shown to be computationally efficient and can be applied to large datasets. Furthermore, the skewed Brown-Resnick process contains the popular Brown-Resnick model as a special case and possesses nonstationary extremal dependence structures. We showcase the new max-stable processes on simulated and real data."}, "https://arxiv.org/abs/2407.13971": {"title": "Dimension-reduced Reconstruction Map Learning for Parameter Estimation in Likelihood-Free Inference Problems", "link": "https://arxiv.org/abs/2407.13971", "description": "arXiv:2407.13971v1 Announce Type: new \nAbstract: Many application areas rely on models that can be readily simulated but lack a closed-form likelihood, or an accurate approximation under arbitrary parameter values. Existing parameter estimation approaches in this setting are generally approximate. Recent work on using neural network models to reconstruct the mapping from the data space to the parameters from a set of synthetic parameter-data pairs suffers from the curse of dimensionality, resulting in inaccurate estimation as the data size grows. We propose a dimension-reduced approach to likelihood-free estimation which combines the ideas of reconstruction map estimation with dimension-reduction approaches based on subject-specific knowledge. We examine the properties of reconstruction map estimation with and without dimension reduction and explore the trade-off between approximation error due to information loss from reducing the data dimension and approximation error. Numerical examples show that the proposed approach compares favorably with reconstruction map estimation, approximate Bayesian computation, and synthetic likelihood estimation."}, "https://arxiv.org/abs/2407.13980": {"title": "Byzantine-tolerant distributed learning of finite mixture models", "link": "https://arxiv.org/abs/2407.13980", "description": "arXiv:2407.13980v1 Announce Type: new \nAbstract: This paper proposes two split-and-conquer (SC) learning estimators for finite mixture models that are tolerant to Byzantine failures. In SC learning, individual machines obtain local estimates, which are then transmitted to a central server for aggregation. During this communication, the server may receive malicious or incorrect information from some local machines, a scenario known as Byzantine failures. While SC learning approaches have been devised to mitigate Byzantine failures in statistical models with Euclidean parameters, developing Byzantine-tolerant methods for finite mixture models with non-Euclidean parameters requires a distinct strategy. Our proposed distance-based methods are hyperparameter tuning free, unlike existing methods, and are resilient to Byzantine failures while achieving high statistical efficiency. We validate the effectiveness of our methods both theoretically and empirically via experiments on simulated and real data from machine learning applications for digit recognition. The code for the experiment can be found at https://github.com/SarahQiong/RobustSCGMM."}, "https://arxiv.org/abs/2407.14002": {"title": "Derandomized Truncated D-vine Copula Knockoffs with e-values to control the false discovery rate", "link": "https://arxiv.org/abs/2407.14002", "description": "arXiv:2407.14002v1 Announce Type: new \nAbstract: The Model-X knockoffs is a practical methodology for variable selection, which stands out from other selection strategies since it allows for the control of the false discovery rate (FDR), relying on finite-sample guarantees. In this article, we propose a Truncated D-vine Copula Knockoffs (TDCK) algorithm for sampling approximate knockoffs from complex multivariate distributions. Our algorithm enhances and improves features of previous attempts to sample knockoffs under the multivariate setting, with the three main contributions being: 1) the truncation of the D-vine copula, which reduces the dependence between the original variables and their corresponding knockoffs, improving the statistical power; 2) the employment of a straightforward non-parametric formulation for marginal transformations, eliminating the need for a specific parametric family or a kernel density estimator; 3) the use of the \"rvinecopulib'' R package offers better flexibility than the existing fitting vine copula knockoff methods. To eliminate the randomness in distinct realizations resulting in different sets of selected variables, we wrap the TDCK method with an existing derandomizing procedure for knockoffs, leading to a Derandomized Truncated D-vine Copula Knockoffs with e-values (DTDCKe) procedure. We demonstrate the robustness of the DTDCKe procedure under various scenarios with extensive simulation studies. We further illustrate its efficacy using a gene expression dataset, showing it achieves a more reliable gene selection than other competing methods, when the findings are compared with those of a meta-analysis. The results indicate that our Truncated D-vine copula approach is robust and has superior power, representing an appealing approach for variable selection in different multivariate applications, particularly in gene expression analysis."}, "https://arxiv.org/abs/2407.14022": {"title": "Causal Inference with Complex Treatments: A Survey", "link": "https://arxiv.org/abs/2407.14022", "description": "arXiv:2407.14022v1 Announce Type: new \nAbstract: Causal inference plays an important role in explanatory analysis and decision making across various fields like statistics, marketing, health care, and education. Its main task is to estimate treatment effects and make intervention policies. Traditionally, most of the previous works typically focus on the binary treatment setting that there is only one treatment for a unit to adopt or not. However, in practice, the treatment can be much more complex, encompassing multi-valued, continuous, or bundle options. In this paper, we refer to these as complex treatments and systematically and comprehensively review the causal inference methods for addressing them. First, we formally revisit the problem definition, the basic assumptions, and their possible variations under specific conditions. Second, we sequentially review the related methods for multi-valued, continuous, and bundled treatment settings. In each situation, we tentatively divide the methods into two categories: those conforming to the unconfoundedness assumption and those violating it. Subsequently, we discuss the available datasets and open-source codes. Finally, we provide a brief summary of these works and suggest potential directions for future research."}, "https://arxiv.org/abs/2407.14074": {"title": "Regression Adjustment for Estimating Distributional Treatment Effects in Randomized Controlled Trials", "link": "https://arxiv.org/abs/2407.14074", "description": "arXiv:2407.14074v1 Announce Type: new \nAbstract: In this paper, we address the issue of estimating and inferring the distributional treatment effects in randomized experiments. The distributional treatment effect provides a more comprehensive understanding of treatment effects by characterizing heterogeneous effects across individual units, as opposed to relying solely on the average treatment effect. To enhance the precision of distributional treatment effect estimation, we propose a regression adjustment method that utilizes the distributional regression and pre-treatment information. Our method is designed to be free from restrictive distributional assumptions. We establish theoretical efficiency gains and develop a practical, statistically sound inferential framework. Through extensive simulation studies and empirical applications, we illustrate the substantial advantages of our method, equipping researchers with a powerful tool for capturing the full spectrum of treatment effects in experimental research."}, "https://arxiv.org/abs/2407.14248": {"title": "Incertus", "link": "https://arxiv.org/abs/2407.14248", "description": "arXiv:2407.14248v1 Announce Type: new \nAbstract: In this paper, we present Insertus.jl, the Julia package that can help the user generate a randomization sequence of a given length for a multi-arm trial with a pre-specified target allocation ratio and assess the operating characteristics of the chosen randomization method through Monte Carlo simulations. The developed package is computationally efficient, and it can be invoked in R. Furthermore, the package is open-ended -- it can flexibly accommodate new randomization procedures and evaluate their statistical properties via simulation. It may be also helpful for validating other randomization methods for which software is not readily available. In summary, Insertus.jl can be used as ``Lego Blocks'' to construct a fit-for-purpose randomization procedure for a given clinical trial design."}, "https://arxiv.org/abs/2407.14311": {"title": "A Bayesian joint model of multiple longitudinal and categorical outcomes with application to multiple myeloma using permutation-based variable importance", "link": "https://arxiv.org/abs/2407.14311", "description": "arXiv:2407.14311v1 Announce Type: new \nAbstract: Joint models have proven to be an effective approach for uncovering potentially hidden connections between various types of outcomes, mainly continuous, time-to-event, and binary. Typically, longitudinal continuous outcomes are characterized by linear mixed-effects models, survival outcomes are described by proportional hazards models, and the link between outcomes are captured by shared random effects. Other modeling variations include generalized linear mixed-effects models for longitudinal data and logistic regression when a binary outcome is present, rather than time until an event of interest. However, in a clinical research setting, one might be interested in modeling the physician's chosen treatment based on the patient's medical history in order to identify prognostic factors. In this situation, there are often multiple treatment options, requiring the use of a multiclass classification approach. Inspired by this context, we develop a Bayesian joint model for longitudinal and categorical data. In particular, our motivation comes from a multiple myeloma study, in which biomarkers display nonlinear trajectories that are well captured through bi-exponential submodels, where patient-level information is shared with the categorical submodel. We also present a variable importance strategy for ranking prognostic factors. We apply our proposal and a competing model to the multiple myeloma data, compare the variable importance and inferential results for both models, and illustrate patient-level interpretations using our joint model."}, "https://arxiv.org/abs/2407.14349": {"title": "Measuring and testing tail equivalence", "link": "https://arxiv.org/abs/2407.14349", "description": "arXiv:2407.14349v1 Announce Type: new \nAbstract: We call two copulas tail equivalent if their first-order approximations in the tail coincide. As a special case, a copula is called tail symmetric if it is tail equivalent to the associated survival copula. We propose a novel measure and statistical test for tail equivalence. The proposed measure takes the value of zero if and only if the two copulas share a pair of tail order and tail order parameter in common. Moreover, taking the nature of these tail quantities into account, we design the proposed measure so that it takes a large value when tail orders are different, and a small value when tail order parameters are non-identical. We derive asymptotic properties of the proposed measure, and then propose a novel statistical test for tail equivalence. Performance of the proposed test is demonstrated in a series of simulation studies and empirical analyses of financial stock returns in the periods of the world financial crisis and the COVID-19 recession. Our empirical analysis reveals non-identical tail behaviors in different pairs of stocks, different parts of tails, and the two periods of recessions."}, "https://arxiv.org/abs/2407.14365": {"title": "Modified BART for Learning Heterogeneous Effects in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2407.14365", "description": "arXiv:2407.14365v1 Announce Type: new \nAbstract: This paper introduces BART-RDD, a sum-of-trees regression model built around a novel regression tree prior, which incorporates the special covariate structure of regression discontinuity designs. Specifically, the tree splitting process is constrained to ensure overlap within a narrow band surrounding the running variable cutoff value, where the treatment effect is identified. It is shown that unmodified BART-based models estimate RDD treatment effects poorly, while our modified model accurately recovers treatment effects at the cutoff. Specifically, BART-RDD is perhaps the first RDD method that effectively learns conditional average treatment effects. The new method is investigated in thorough simulation studies as well as an empirical application looking at the effect of academic probation on student performance in subsequent terms (Lindo et al., 2010)."}, "https://arxiv.org/abs/2407.14369": {"title": "tidychangepoint: a unified framework for analyzing changepoint detection in univariate time series", "link": "https://arxiv.org/abs/2407.14369", "description": "arXiv:2407.14369v1 Announce Type: new \nAbstract: We present tidychangepoint, a new R package for changepoint detection analysis. tidychangepoint leverages existing packages like changepoint, GA, tsibble, and broom to provide tidyverse-compliant tools for segmenting univariate time series using various changepoint detection algorithms. In addition, tidychangepoint also provides model-fitting procedures for commonly-used parametric models, tools for computing various penalized objective functions, and graphical diagnostic displays. tidychangepoint wraps both deterministic algorithms like PELT, and also flexible, randomized, genetic algorithms that can be used with any compliant model-fitting function and any penalized objective function. By bringing all of these disparate tools together in a cohesive fashion, tidychangepoint facilitates comparative analysis of changepoint detection algorithms and models."}, "https://arxiv.org/abs/2407.14003": {"title": "Time Series Generative Learning with Application to Brain Imaging Analysis", "link": "https://arxiv.org/abs/2407.14003", "description": "arXiv:2407.14003v1 Announce Type: cross \nAbstract: This paper focuses on the analysis of sequential image data, particularly brain imaging data such as MRI, fMRI, CT, with the motivation of understanding the brain aging process and neurodegenerative diseases. To achieve this goal, we investigate image generation in a time series context. Specifically, we formulate a min-max problem derived from the $f$-divergence between neighboring pairs to learn a time series generator in a nonparametric manner. The generator enables us to generate future images by transforming prior lag-k observations and a random vector from a reference distribution. With a deep neural network learned generator, we prove that the joint distribution of the generated sequence converges to the latent truth under a Markov and a conditional invariance condition. Furthermore, we extend our generation mechanism to a panel data scenario to accommodate multiple samples. The effectiveness of our mechanism is evaluated by generating real brain MRI sequences from the Alzheimer's Disease Neuroimaging Initiative. These generated image sequences can be used as data augmentation to enhance the performance of further downstream tasks, such as Alzheimer's disease detection."}, "https://arxiv.org/abs/2205.07689": {"title": "From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data", "link": "https://arxiv.org/abs/2205.07689", "description": "arXiv:2205.07689v3 Announce Type: replace \nAbstract: How can we tell complex point clouds with different small scale characteristics apart, while disregarding global features? Can we find a suitable transformation of such data in a way that allows to discriminate between differences in this sense with statistical guarantees? In this paper, we consider the analysis and classification of complex point clouds as they are obtained, e.g., via single molecule localization microscopy. We focus on the task of identifying differences between noisy point clouds based on small scale characteristics, while disregarding large scale information such as overall size. We propose an approach based on a transformation of the data via the so-called Distance-to-Measure (DTM) function, a transformation which is based on the average of nearest neighbor distances. For each data set, we estimate the probability density of average local distances of all data points and use the estimated densities for classification. While the applicability is immediate and the practical performance of the proposed methodology is very good, the theoretical study of the density estimators is quite challenging, as they are based on i.i.d. observations that have been obtained via a complicated transformation. In fact, the transformed data are stochastically dependent in a non-local way that is not captured by commonly considered dependence measures. Nonetheless, we show that the asymptotic behaviour of the density estimator is driven by a kernel density estimator of certain i.i.d. random variables by using theoretical properties of U-statistics, which allows to handle the dependencies via a Hoeffding decomposition. We show via a numerical study and in an application to simulated single molecule localization microscopy data of chromatin fibers that unsupervised classification tasks based on estimated DTM-densities achieve excellent separation results."}, "https://arxiv.org/abs/2401.15014": {"title": "A Robust Bayesian Method for Building Polygenic Risk Scores using Projected Summary Statistics and Bridge Prior", "link": "https://arxiv.org/abs/2401.15014", "description": "arXiv:2401.15014v2 Announce Type: replace \nAbstract: Polygenic risk scores (PRS) developed from genome-wide association studies (GWAS) are of increasing interest for clinical and research applications. Bayesian methods have been popular for building PRS because of their natural ability to regularize models and incorporate external information. In this article, we present new theoretical results, methods, and extensive numerical studies to advance Bayesian methods for PRS applications. We identify a potential risk, under a common Bayesian PRS framework, of posterior impropriety when integrating the required GWAS summary-statistics and linkage disequilibrium (LD) data from two distinct sources. As a principled remedy to this problem, we propose a projection of the summary statistics data that ensures compatibility between the two sources and in turn a proper behavior of the posterior. We further introduce a new PRS method, with accompanying software package, under the less-explored Bayesian bridge prior to more flexibly model varying sparsity levels in effect size distributions. We extensively benchmark it against alternative Bayesian methods using both synthetic and real datasets, quantifying the impact of both prior specification and LD estimation strategy. Our proposed PRS-Bridge, equipped with the projection technique and flexible prior, demonstrates the most consistent and generally superior performance across a variety of scenarios."}, "https://arxiv.org/abs/2407.14630": {"title": "Identification of changes in gene expression", "link": "https://arxiv.org/abs/2407.14630", "description": "arXiv:2407.14630v1 Announce Type: new \nAbstract: Evaluating the change in gene expression is a common goal in many research areas, such as in toxicological studies as well as in clinical trials. In practice, the analysis is often based on multiple t-tests evaluated at the observed time points. This severely limits the accuracy of determining the time points at which the gene changes in expression. Even if a parametric approach is chosen, the analysis is often restricted to identifying the onset of an effect. In this paper, we propose a parametric method to identify the time frame where the gene expression significantly changes. This is achieved by fitting a parametric model to the time-response data and constructing a confidence band for its first derivative. The confidence band is derived by a flexible two step bootstrap approach, which can be applied to a wide variety of possible curves. Our method focuses on the first derivative, since it provides an easy to compute and reliable measure for the change in response. It is summarised in terms of a hypothesis test, such that rejecting the null hypothesis means detecting a significant change in gene expression. Furthermore, a method for calculating confidence intervals for time points of interest (e.g. the beginning and end of significant change) is developed. We demonstrate the validity of our approach through a simulation study and present a variety of different applications to mouse gene expression data from a study investigating the effect of a Western diet on the progression of non-alcoholic fatty liver disease."}, "https://arxiv.org/abs/2407.14635": {"title": "Predicting the Distribution of Treatment Effects: A Covariate-Adjustment Approach", "link": "https://arxiv.org/abs/2407.14635", "description": "arXiv:2407.14635v1 Announce Type: new \nAbstract: Important questions for impact evaluation require knowledge not only of average effects, but of the distribution of treatment effects. What proportion of people are harmed? Does a policy help many by a little? Or a few by a lot? The inability to observe individual counterfactuals makes these empirical questions challenging. I propose an approach to inference on points of the distribution of treatment effects by incorporating predicted counterfactuals through covariate adjustment. I show that finite-sample inference is valid under weak assumptions, for example when data come from a Randomized Controlled Trial (RCT), and that large-sample inference is asymptotically exact under suitable conditions. Finally, I revisit five RCTs in microcredit where average effects are not statistically significant and find evidence of both positive and negative treatment effects in household income. On average across studies, at least 13.6% of households benefited and 12.5% were negatively affected."}, "https://arxiv.org/abs/2407.14666": {"title": "A Bayesian workflow for securitizing casualty insurance risk", "link": "https://arxiv.org/abs/2407.14666", "description": "arXiv:2407.14666v1 Announce Type: new \nAbstract: Casualty insurance-linked securities (ILS) are appealing to investors because the underlying insurance claims, which are directly related to resulting security performance, are uncorrelated with most other asset classes. Conversely, casualty ILS are appealing to insurers as an efficient capital managment tool. However, securitizing casualty insurance risk is non-trivial, as it requires forecasting loss ratios for pools of insurance policies that have not yet been written, in addition to estimating how the underlying losses will develop over time within future accident years. In this paper, we lay out a Bayesian workflow that tackles these complexities by using: (1) theoretically informed time-series and state-space models to capture how loss ratios develop and change over time; (2) historic industry data to inform prior distributions of models fit to individual programs; (3) stacking to combine loss ratio predictions from candidate models, and (4) both prior predictive simulations and simulation-based calibration to aid model specification. Using historic Schedule P filings, we then show how our proposed Bayesian workflow can be used to assess and compare models across a variety of key model performance metrics evaluated on future accident year losses."}, "https://arxiv.org/abs/2407.14703": {"title": "Generalizing and transporting causal inferences from randomized trials in the presence of trial engagement effects", "link": "https://arxiv.org/abs/2407.14703", "description": "arXiv:2407.14703v1 Announce Type: new \nAbstract: Trial engagement effects are effects of trial participation on the outcome that are not mediated by treatment assignment. Most work on extending (generalizing or transporting) causal inferences from a randomized trial to a target population has, explicitly or implicitly, assumed that trial engagement effects are absent, allowing evidence about the effects of the treatments examined in trials to be applied to non-experimental settings. Here, we define novel causal estimands and present identification results for generalizability and transportability analyses in the presence of trial engagement effects. Our approach allows for trial engagement effects under assumptions of no causal interaction between trial participation and treatment assignment on the absolute or relative scales. We show that under these assumptions, even in the presence of trial engagement effects, the trial data can be combined with covariate data from the target population to identify average treatment effects in the context of usual care as implemented in the target population (i.e., outside the experimental setting). The identifying observed data functionals under these no-interaction assumptions are the same as those obtained under the stronger identifiability conditions that have been invoked in prior work. Therefore, our results suggest a new interpretation for previously proposed generalizability and transportability estimators; this interpretation may be useful in analyses under causal structures where background knowledge suggests that trial engagement effects are present but interactions between trial participation and treatment are negligible."}, "https://arxiv.org/abs/2407.14748": {"title": "Regression models for binary data with scale mixtures of centered skew-normal link functions", "link": "https://arxiv.org/abs/2407.14748", "description": "arXiv:2407.14748v1 Announce Type: new \nAbstract: For the binary regression, the use of symmetrical link functions are not appropriate when we have evidence that the probability of success increases at a different rate than decreases. In these cases, the use of link functions based on the cumulative distribution function of a skewed and heavy tailed distribution can be useful. The most popular choice is some scale mixtures of skew-normal distribution. This family of distributions can have some identifiability problems, caused by the so-called direct parameterization. Also, in the binary modeling with skewed link functions, we can have another identifiability problem caused by the presence of the intercept and the skewness parameter. To circumvent these issues, in this work we proposed link functions based on the scale mixtures of skew-normal distributions under the centered parameterization. Furthermore, we proposed to fix the sign of the skewness parameter, which is a new perspective in the literature to deal with the identifiability problem in skewed link functions. Bayesian inference using MCMC algorithms and residual analysis are developed. Simulation studies are performed to evaluate the performance of the model. Also, the methodology is applied in a heart disease data."}, "https://arxiv.org/abs/2407.14914": {"title": "Leveraging Uniformization and Sparsity for Computation of Continuous Time Dynamic Discrete Choice Games", "link": "https://arxiv.org/abs/2407.14914", "description": "arXiv:2407.14914v1 Announce Type: new \nAbstract: Continuous-time formulations of dynamic discrete choice games offer notable computational advantages, particularly in modeling strategic interactions in oligopolistic markets. This paper extends these benefits by addressing computational challenges in order to improve model solution and estimation. We first establish new results on the rates of convergence of the value iteration, policy evaluation, and relative value iteration operators in the model, holding fixed player beliefs. Next, we introduce a new representation of the value function in the model based on uniformization -- a technique used in the analysis of continuous time Markov chains -- which allows us to draw a direct analogy to discrete time models. Furthermore, we show that uniformization also leads to a stable method to compute the matrix exponential, an operator appearing in the model's log likelihood function when only discrete time \"snapshot\" data are available. We also develop a new algorithm that concurrently computes the matrix exponential and its derivatives with respect to model parameters, enhancing computational efficiency. By leveraging the inherent sparsity of the model's intensity matrix, combined with sparse matrix techniques and precomputed addresses, we show how to significantly speed up computations. These strategies allow researchers to estimate more sophisticated and realistic models of strategic interactions and policy impacts in empirical industrial organization."}, "https://arxiv.org/abs/2407.15084": {"title": "High-dimensional log contrast models with measurement errors", "link": "https://arxiv.org/abs/2407.15084", "description": "arXiv:2407.15084v1 Announce Type: new \nAbstract: High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously address the compositional nature and measurement errors in the high-dimensional design matrix of compositional covariates, we propose a new method named Error-in-composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign-consistent selection properties are established. We then illustrate the finite sample performance of Eric Lasso using simulation studies and demonstrate its potential usefulness in a real data application example."}, "https://arxiv.org/abs/2407.15276": {"title": "Nonlinear Binscatter Methods", "link": "https://arxiv.org/abs/2407.15276", "description": "arXiv:2407.15276v1 Announce Type: new \nAbstract: Binned scatter plots are a powerful statistical tool for empirical work in the social, behavioral, and biomedical sciences. Available methods rely on a quantile-based partitioning estimator of the conditional mean regression function to primarily construct flexible yet interpretable visualization methods, but they can also be used to estimate treatment effects, assess uncertainty, and test substantive domain-specific hypotheses. This paper introduces novel binscatter methods based on nonlinear, possibly nonsmooth M-estimation methods, covering generalized linear, robust, and quantile regression models. We provide a host of theoretical results and practical tools for local constant estimation along with piecewise polynomial and spline approximations, including (i) optimal tuning parameter (number of bins) selection, (ii) confidence bands, and (iii) formal statistical tests regarding functional form or shape restrictions. Our main results rely on novel strong approximations for general partitioning-based estimators covering random, data-driven partitions, which may be of independent interest. We demonstrate our methods with an empirical application studying the relation between the percentage of individuals without health insurance and per capita income at the zip-code level. We provide general-purpose software packages implementing our methods in Python, R, and Stata."}, "https://arxiv.org/abs/2407.15340": {"title": "Random Survival Forest for Censored Functional Data", "link": "https://arxiv.org/abs/2407.15340", "description": "arXiv:2407.15340v1 Announce Type: new \nAbstract: This paper introduces a Random Survival Forest (RSF) method for functional data. The focus is specifically on defining a new functional data structure, the Censored Functional Data (CFD), for dealing with temporal observations that are censored due to study limitations or incomplete data collection. This approach allows for precise modelling of functional survival trajectories, leading to improved interpretation and prediction of survival dynamics across different groups. A medical survival study on the benchmark SOFA data set is presented. Results show good performance of the proposed approach, particularly in ranking the importance of predicting variables, as captured through dynamic changes in SOFA scores and patient mortality rates."}, "https://arxiv.org/abs/2407.15377": {"title": "Replicable Bandits for Digital Health Interventions", "link": "https://arxiv.org/abs/2407.15377", "description": "arXiv:2407.15377v1 Announce Type: new \nAbstract: Adaptive treatment assignment algorithms, such as bandit and reinforcement learning algorithms, are increasingly used in digital health intervention clinical trials. Causal inference and related data analyses are critical for evaluating digital health interventions, deciding how to refine the intervention, and deciding whether to roll-out the intervention more broadly. However the replicability of these analyses has received relatively little attention. This work investigates the replicability of statistical analyses from trials deploying adaptive treatment assignment algorithms. We demonstrate that many standard statistical estimators can be inconsistent and fail to be replicable across repetitions of the clinical trial, even as the sample size grows large. We show that this non-replicability is intimately related to properties of the adaptive algorithm itself. We introduce a formal definition of a \"replicable bandit algorithm\" and prove that under such algorithms, a wide variety of common statistical analyses are guaranteed to be consistent. We present both theoretical results and simulation studies based on a mobile health oral health self-care intervention. Our findings underscore the importance of designing adaptive algorithms with replicability in mind, especially for settings like digital health where deployment decisions rely heavily on replicated evidence. We conclude by discussing open questions on the connections between algorithm design, statistical inference, and experimental replicability."}, "https://arxiv.org/abs/2407.15461": {"title": "Forecasting mortality rates with functional signatures", "link": "https://arxiv.org/abs/2407.15461", "description": "arXiv:2407.15461v1 Announce Type: new \nAbstract: This study introduces an innovative methodology for mortality forecasting, which integrates signature-based methods within the functional data framework of the Hyndman-Ullah (HU) model. This new approach, termed the Hyndman-Ullah with truncated signatures (HUts) model, aims to enhance the accuracy and robustness of mortality predictions. By utilizing signature regression, the HUts model aims to capture complex, nonlinear dependencies in mortality data which enhances forecasting accuracy across various demographic conditions. The model is applied to mortality data from 12 countries, comparing its forecasting performance against classical models like the Lee-Carter model and variants of the HU models across multiple forecast horizons. Our findings indicate that overall the HUts model not only provides more precise point forecasts but also shows robustness against data irregularities, such as those observed in countries with historical outliers. The integration of signature-based methods enables the HUts model to capture complex patterns in mortality data, making it a powerful tool for actuaries and demographers. Prediction intervals are also constructed using bootstrapping methods."}, "https://arxiv.org/abs/2407.15522": {"title": "Big Data Analytics-Enabled Dynamic Capabilities and Market Performance: Examining the Roles of Marketing Ambidexterity and Competitor Pressure", "link": "https://arxiv.org/abs/2407.15522", "description": "arXiv:2407.15522v1 Announce Type: new \nAbstract: This study, rooted in dynamic capability theory and the developing era of Big Data Analytics, explores the transformative effect of BDA EDCs on marketing. Ambidexterity and firms market performance in the textile sector of Pakistans cities. Specifically, focusing on the firms who directly deal with customers, investigates the nuanced role of BDA EDCs in textile retail firms potential to navigate market dynamics. Emphasizing the exploitation component of marketing ambidexterity, the study investigated the mediating function of marketing ambidexterity and the moderating influence of competitive pressure. Using a survey questionnaire, the study targets key choice makers in textile firms of Faisalabad, Chiniot and Lahore, Pakistan. The PLS-SEM model was employed as an analytical technique, allows for a full examination of the complicated relations between BDA EDCs, marketing ambidexterity, rival pressure, and market performance. The study Predicting a positive impact of Big Data on marketing ambidexterity, with a specific emphasis on exploitation. The study expects this exploitation-orientated marketing ambidexterity to significantly enhance the firms market performance. This research contributes to the existing literature on dynamic capabilities-based frameworks from the perspective of the retail segment of textile industry. The study emphasizes the role of BDA-EDCs in the retail sector, imparting insights into the direct and indirect results of BDA EDCs on market performance inside the retail area. The study s novelty lies in its contextualization of BDA-EDCs in the textile zone of Faisalabad, Lahore and Chiniot, providing a unique perspective on the effect of BDA on marketing ambidexterity and market performance in firms. Methodologically, the study uses numerous samples of retail sectors to make sure broader universality, contributing realistic insights."}, "https://arxiv.org/abs/2407.15666": {"title": "Particle Based Inference for Continuous-Discrete State Space Models", "link": "https://arxiv.org/abs/2407.15666", "description": "arXiv:2407.15666v1 Announce Type: new \nAbstract: This article develops a methodology allowing application of the complete machinery of particle-based inference methods upon what we call the class of continuous-discrete State Space Models (CD-SSMs). Such models correspond to a latent continuous-time It\\^o diffusion process which is observed with noise at discrete time instances. Due to the continuous-time nature of the hidden signal, standard Feynman-Kac formulations and their accompanying particle-based approximations have to overcome several challenges, arising mainly due to the following considerations: (i) finite-time transition densities of the signal are typically intractable; (ii) ancestors of sampled signals are determined w.p.~1, thus cannot be resampled; (iii) diffusivity parameters given a sampled signal yield Dirac distributions. We overcome all above issues by introducing a framework based on carefully designed proposals and transformations thereof. That is, we obtain new expressions for the Feynman-Kac model that accommodate the effects of a continuous-time signal and overcome induced degeneracies. The constructed formulations will enable use of the full range of particle-based algorithms for CD-SSMs: for filtering/smoothing and parameter inference, whether online or offline. Our framework is compatible with guided proposals in the filtering steps that are essential for efficient algorithmic performance in the presence of informative observations or in higher dimensions, and is applicable for a very general class of CD-SSMs, including the case when the signal is modelled as a hypo-elliptic diffusion. Our methods can be immediately incorporated to available software packages for particle-based algorithms."}, "https://arxiv.org/abs/2407.15674": {"title": "LASSO Estimation in Exponential Random Graph models", "link": "https://arxiv.org/abs/2407.15674", "description": "arXiv:2407.15674v1 Announce Type: new \nAbstract: The paper demonstrates the use of LASSO-based estimation in network models. Taking the Exponential Random Graph Model (ERGM) as a flexible and widely used model for network data analysis, the paper focuses on the question of how to specify the (sufficient) statistics, that define the model structure. This includes both, endogenous network statistics (e.g. twostars, triangles, etc.) as well as statistics involving exogenous covariates; on the node as well as on the edge level. LASSO estimation is a penalized estimation that shrinks some of the parameter estimates to be equal to zero. As such it allows for model selection by modifying the amount of penalty. The concept is well established in standard regression and we demonstrate its usage in network data analysis, with the advantage of automatically providing a model selection framework."}, "https://arxiv.org/abs/2407.15733": {"title": "Online closed testing with e-values", "link": "https://arxiv.org/abs/2407.15733", "description": "arXiv:2407.15733v1 Announce Type: new \nAbstract: In contemporary research, data scientists often test an infinite sequence of hypotheses $H_1,H_2,\\ldots $ one by one, and are required to make real-time decisions without knowing the future hypotheses or data. In this paper, we consider such an online multiple testing problem with the goal of providing simultaneous lower bounds for the number of true discoveries in data-adaptively chosen rejection sets. Using the (online) closure principle, we show that for this task it is necessary to use an anytime-valid test for each intersection hypothesis. Motivated by this result, we construct a new online closed testing procedure and a corresponding short-cut with a true discovery guarantee based on multiplying sequential e-values. This general but simple procedure gives uniform improvements over existing methods but also allows to construct entirely new and powerful procedures. In addition, we introduce new ideas for hedging and boosting of sequential e-values that provably increase power. Finally, we also propose the first online true discovery procedure for arbitrarily dependent e-values."}, "https://arxiv.org/abs/2407.14537": {"title": "Small but not least changes: The Art of Creating Disruptive Innovations", "link": "https://arxiv.org/abs/2407.14537", "description": "arXiv:2407.14537v1 Announce Type: cross \nAbstract: In the ever-evolving landscape of technology, product innovation thrives on replacing outdated technologies with groundbreaking ones or through the ingenious recombination of existing technologies. Our study embarks on a revolutionary journey by genetically representing products, extracting their chromosomal data, and constructing a comprehensive phylogenetic network of automobiles. We delve deep into the technological features that shape innovation, pinpointing the ancestral roots of products and mapping out intricate product-family triangles. By leveraging the similarities within these triangles, we introduce a pioneering \"Product Disruption Index\"-inspired by the CD index (Funk and Owen-Smith, 2017)-to quantify a product's disruptiveness. Our approach is rigorously validated against the scientifically recognized trend of decreasing disruptiveness over time (Park et al., 2023) and through compelling case studies. Our statistical analysis reveals a fascinating insight: disruptive product innovations often stem from minor, yet crucial, modifications."}, "https://arxiv.org/abs/2407.14861": {"title": "Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes", "link": "https://arxiv.org/abs/2407.14861", "description": "arXiv:2407.14861v1 Announce Type: cross \nAbstract: With the growing access to administrative health databases, retrospective studies have become crucial evidence for medical treatments. Yet, non-randomized studies frequently face selection biases, requiring mitigation strategies. Propensity score matching (PSM) addresses these biases by selecting comparable populations, allowing for analysis without further methodological constraints. However, PSM has several drawbacks. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To prevent cherry-picking the best method, public authorities must involve field experts and engage in extensive discussions with researchers.\n To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches. A2A constructs artificial matching tasks that mirror the original ones but with known outcomes, assessing each matching method's performance comprehensively from propensity estimation to ATE estimation. When combined with Standardized Mean Difference, A2A enhances the precision of model selection, resulting in a reduction of up to 50% in ATE estimation errors across synthetic tasks and up to 90% in predicted ATE variability across both synthetic and real-world datasets. To our knowledge, A2A is the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection.\n Computing A2A requires solving hundreds of PSMs, we therefore automate all manual steps of the PSM pipeline. We integrate PSM methods from Python and R, our automated pipeline, a new metric, and reproducible experiments into popmatch, our new Python package, to enhance reproducibility and accessibility to bias correction methods."}, "https://arxiv.org/abs/2407.15028": {"title": "Statistical Models for Outbreak Detection of Measles in North Cotabato, Philippines", "link": "https://arxiv.org/abs/2407.15028", "description": "arXiv:2407.15028v1 Announce Type: cross \nAbstract: A measles outbreak occurs when the number of cases of measles in the population exceeds the typical level. Outbreaks that are not detected and managed early can increase mortality and morbidity and incur costs from activities responding to these events. The number of measles cases in the Province of North Cotabato, Philippines, was used in this study. Weekly reported cases of measles from January 2016 to December 2021 were provided by the Epidemiology and Surveillance Unit of the North Cotabato Provincial Health Office. Several integer-valued autoregressive (INAR) time series models were used to explore the possibility of detecting and identifying measles outbreaks in the province along with the classical ARIMA model. These models were evaluated based on goodness of fit, measles outbreak detection accuracy, and timeliness. The results of this study confirmed that INAR models have the conceptual advantage over ARIMA since the latter produces non-integer forecasts, which are not realistic for count data such as measles cases. Among the INAR models, the ZINGINAR (1) model was recommended for having a good model fit and timely and accurate detection of outbreaks. Furthermore, policymakers and decision-makers from relevant government agencies can use the ZINGINAR (1) model to improve disease surveillance and implement preventive measures against contagious diseases beforehand."}, "https://arxiv.org/abs/2407.15256": {"title": "Weak-instrument-robust subvector inference in instrumental variables regression: A subvector Lagrange multiplier test and properties of subvector Anderson-Rubin confidence sets", "link": "https://arxiv.org/abs/2407.15256", "description": "arXiv:2407.15256v1 Announce Type: cross \nAbstract: We propose a weak-instrument-robust subvector Lagrange multiplier test for instrumental variables regression. We show that it is asymptotically size-correct under a technical condition. This is the first weak-instrument-robust subvector test for instrumental variables regression to recover the degrees of freedom of the commonly used Wald test, which is not robust to weak instruments. Additionally, we provide a closed-form solution for subvector confidence sets obtained by inverting the subvector Anderson-Rubin test. We show that they are centered around a k-class estimator. Also, we show that the subvector confidence sets for single coefficients of the causal parameter are jointly bounded if and only if Anderson's likelihood-ratio test rejects the hypothesis that the first-stage regression parameter is of reduced rank, that is, that the causal parameter is not identified. Finally, we show that if a confidence set obtained by inverting the Anderson-Rubin test is bounded and nonempty, it is equal to a Wald-based confidence set with a data-dependent confidence level. We explicitly compute this Wald-based confidence test."}, "https://arxiv.org/abs/2407.15564": {"title": "Non-parametric estimation of conditional quantiles for time series with heavy tails", "link": "https://arxiv.org/abs/2407.15564", "description": "arXiv:2407.15564v1 Announce Type: cross \nAbstract: We propose a modified weighted Nadaraya-Watson estimator for the conditional distribution of a time series with heavy tails. We establish the asymptotic normality of the proposed estimator. Simulation study is carried out to assess the performance of the estimator. We illustrate our method using a dataset."}, "https://arxiv.org/abs/2407.15636": {"title": "On-the-fly spectral unmixing based on Kalman filtering", "link": "https://arxiv.org/abs/2407.15636", "description": "arXiv:2407.15636v1 Announce Type: cross \nAbstract: This work introduces an on-the-fly (i.e., online) linear unmixing method which is able to sequentially analyze spectral data acquired on a spectrum-by-spectrum basis. After deriving a sequential counterpart of the conventional linear mixing model, the proposed approach recasts the linear unmixing problem into a linear state-space estimation framework. Under Gaussian noise and state models, the estimation of the pure spectra can be efficiently conducted by resorting to Kalman filtering. Interestingly, it is shown that this Kalman filter can operate in a lower-dimensional subspace while ensuring the nonnegativity constraint inherent to pure spectra. This dimensionality reduction allows significantly lightening the computational burden, while leveraging recent advances related to the representation of essential spectral information. The proposed method is evaluated through extensive numerical experiments conducted on synthetic and real Raman data sets. The results show that this Kalman filter-based method offers a convenient trade-off between unmixing accuracy and computational efficiency, which is crucial for operating in an on-the-fly setting. To the best of the authors' knowledge, this is the first operational method which is able to solve the spectral unmixing problem efficiently in a dynamic fashion. It also constitutes a valuable building block for benefiting from acquisition and processing frameworks recently proposed in the microscopy literature, which are motivated by practical issues such as reducing acquisition time and avoiding potential damages being inflicted to photosensitive samples."}, "https://arxiv.org/abs/2407.15764": {"title": "Huber means on Riemannian manifolds", "link": "https://arxiv.org/abs/2407.15764", "description": "arXiv:2407.15764v1 Announce Type: cross \nAbstract: This article introduces Huber means on Riemannian manifolds, providing a robust alternative to the Frechet mean by integrating elements of both square and absolute loss functions. The Huber means are designed to be highly resistant to outliers while maintaining efficiency, making it a valuable generalization of Huber's M-estimator for manifold-valued data. We comprehensively investigate the statistical and computational aspects of Huber means, demonstrating their utility in manifold-valued data analysis. Specifically, we establish minimal conditions for ensuring the existence and uniqueness of the Huber mean and discuss regularity conditions for unbiasedness. The Huber means are statistically consistent and enjoy the central limit theorem. Additionally, we propose a moment-based estimator for the limiting covariance matrix, which is used to construct a robust one-sample location test procedure and an approximate confidence region for location parameters. Huber means are shown to be highly robust and efficient in the presence of outliers or under heavy-tailed distribution. To be more specific, it achieves a breakdown point of at least 0.5, the highest among all isometric equivariant estimators, and is more efficient than the Frechet mean under heavy-tailed distribution. Numerical examples on spheres and the set of symmetric positive-definite matrices further illustrate the efficiency and reliability of the proposed Huber means on Riemannian manifolds."}, "https://arxiv.org/abs/2007.04229": {"title": "Estimating Monte Carlo variance from multiple Markov chains", "link": "https://arxiv.org/abs/2007.04229", "description": "arXiv:2007.04229v4 Announce Type: replace \nAbstract: Modern computational advances have enabled easy parallel implementations of Markov chain Monte Carlo (MCMC). However, almost all work in estimating the variance of Monte Carlo averages, including the efficient batch means (BM) estimator, focuses on a single-chain MCMC run. We demonstrate that simply averaging covariance matrix estimators from multiple chains can yield critical underestimates in small Monte Carlo sample sizes, especially for slow-mixing Markov chains. We extend the work of \\cite{arg:and:2006} and propose a multivariate replicated batch means (RBM) estimator that utilizes information from parallel chains, thereby correcting for the underestimation. Under weak conditions on the mixing rate of the process, RBM is strongly consistent and exhibits similar large-sample bias and variance to the BM estimator. We also exhibit superior theoretical properties of RBM by showing that the (negative) bias in the RBM estimator is less than the average BM estimator in the presence of positive correlation in MCMC. Consequently, in small runs, the RBM estimator can be dramatically superior and this is demonstrated through a variety of examples."}, "https://arxiv.org/abs/2007.12807": {"title": "Cross-validation Approaches for Multi-study Predictions", "link": "https://arxiv.org/abs/2007.12807", "description": "arXiv:2007.12807v4 Announce Type: replace \nAbstract: We consider prediction in multiple studies with potential differences in the relationships between predictors and outcomes. Our objective is to integrate data from multiple studies to develop prediction models for unseen studies. We propose and investigate two cross-validation approaches applicable to multi-study stacking, an ensemble method that linearly combines study-specific ensemble members to produce generalizable predictions. Among our cross-validation approaches are some that avoid reuse of the same data in both the training and stacking steps, as done in earlier multi-study stacking. We prove that under mild regularity conditions the proposed cross-validation approaches produce stacked prediction functions with oracle properties. We also identify analytically in which scenarios the proposed cross-validation approaches increase prediction accuracy compared to stacking with data reuse. We perform a simulation study to illustrate these results. Finally, we apply our method to predicting mortality from long-term exposure to air pollutants, using collections of datasets."}, "https://arxiv.org/abs/2104.04628": {"title": "Modeling Time-Varying Random Objects and Dynamic Networks", "link": "https://arxiv.org/abs/2104.04628", "description": "arXiv:2104.04628v2 Announce Type: replace \nAbstract: Samples of dynamic or time-varying networks and other random object data such as time-varying probability distributions are increasingly encountered in modern data analysis. Common methods for time-varying data such as functional data analysis are infeasible when observations are time courses of networks or other complex non-Euclidean random objects that are elements of general metric spaces. In such spaces, only pairwise distances between the data objects are available and a strong limitation is that one cannot carry out arithmetic operations due to the lack of an algebraic structure. We combat this complexity by a generalized notion of mean trajectory taking values in the object space. For this, we adopt pointwise Fr\\'echet means and then construct pointwise distance trajectories between the individual time courses and the estimated Fr\\'echet mean trajectory, thus representing the time-varying objects and networks by functional data. Functional principal component analysis of these distance trajectories can reveal interesting features of dynamic networks and object time courses and is useful for downstream analysis. Our approach also makes it possible to study the empirical dynamics of time-varying objects, including dynamic regression to the mean or explosive behavior over time. We demonstrate desirable asymptotic properties of sample based estimators for suitable population targets under mild assumptions. The utility of the proposed methodology is illustrated with dynamic networks, time-varying distribution data and longitudinal growth data."}, "https://arxiv.org/abs/2112.04626": {"title": "Bayesian Semiparametric Longitudinal Inverse-Probit Mixed Models for Category Learning", "link": "https://arxiv.org/abs/2112.04626", "description": "arXiv:2112.04626v4 Announce Type: replace \nAbstract: Understanding how the adult human brain learns novel categories is an important problem in neuroscience. Drift-diffusion models are popular in such contexts for their ability to mimic the underlying neural mechanisms. One such model for gradual longitudinal learning was recently developed by Paulon et al. (2021). Fitting conventional drift-diffusion models, however, requires data on both category responses and associated response times. In practice, category response accuracies are often the only reliable measure recorded by behavioral scientists to describe human learning. However, To our knowledge, drift-diffusion models for such scenarios have never been considered in the literature. To address this gap, in this article, we build carefully on Paulon et al. (2021), but now with latent response times integrated out, to derive a novel biologically interpretable class of `inverse-probit' categorical probability models for observed categories alone. However, this new marginal model presents significant identifiability and inferential challenges not encountered originally for the joint model by Paulon et al. (2021). We address these new challenges using a novel projection-based approach with a symmetry-preserving identifiability constraint that allows us to work with conjugate priors in an unconstrained space. We adapt the model for group and individual-level inference in longitudinal settings. Building again on the model's latent variable representation, we design an efficient Markov chain Monte Carlo algorithm for posterior computation. We evaluate the empirical performance of the method through simulation experiments. The practical efficacy of the method is illustrated in applications to longitudinal tone learning studies."}, "https://arxiv.org/abs/2205.04345": {"title": "A unified diagnostic test for regression discontinuity designs", "link": "https://arxiv.org/abs/2205.04345", "description": "arXiv:2205.04345v4 Announce Type: replace \nAbstract: Diagnostic tests for regression discontinuity design face a size-control problem. We document a massive over-rejection of the identifying restriction among empirical studies in the top five economics journals. At least one diagnostic test was rejected for 21 out of 60 studies, whereas less than 5% of the collected 799 tests rejected the null hypotheses. In other words, more than one-third of the studies rejected at least one of their diagnostic tests, whereas their underlying identifying restrictions appear valid. Multiple testing causes this problem because the median number of tests per study was as high as 12. Therefore, we offer unified tests to overcome the size-control problem. Our procedure is based on the new joint asymptotic normality of local polynomial mean and density estimates. In simulation studies, our unified tests outperformed the Bonferroni correction. We implement the procedure as an R package rdtest with two empirical examples in its vignettes."}, "https://arxiv.org/abs/2208.14015": {"title": "The SPDE approach for spatio-temporal datasets with advection and diffusion", "link": "https://arxiv.org/abs/2208.14015", "description": "arXiv:2208.14015v4 Announce Type: replace \nAbstract: In the task of predicting spatio-temporal fields in environmental science using statistical methods, introducing statistical models inspired by the physics of the underlying phenomena that are numerically efficient is of growing interest. Large space-time datasets call for new numerical methods to efficiently process them. The Stochastic Partial Differential Equation (SPDE) approach has proven to be effective for the estimation and the prediction in a spatial context. We present here the advection-diffusion SPDE with first order derivative in time which defines a large class of nonseparable spatio-temporal models. A Gaussian Markov random field approximation of the solution to the SPDE is built by discretizing the temporal derivative with a finite difference method (implicit Euler) and by solving the spatial SPDE with a finite element method (continuous Galerkin) at each time step. The ''Streamline Diffusion'' stabilization technique is introduced when the advection term dominates the diffusion. Computationally efficient methods are proposed to estimate the parameters of the SPDE and to predict the spatio-temporal field by kriging, as well as to perform conditional simulations. The approach is applied to a solar radiation dataset. Its advantages and limitations are discussed."}, "https://arxiv.org/abs/2209.01396": {"title": "Small Study Regression Discontinuity Designs: Density Inclusive Study Size Metric and Performance", "link": "https://arxiv.org/abs/2209.01396", "description": "arXiv:2209.01396v3 Announce Type: replace \nAbstract: Regression discontinuity (RD) designs are popular quasi-experimental studies in which treatment assignment depends on whether the value of a running variable exceeds a cutoff. RD designs are increasingly popular in educational applications due to the prevalence of cutoff-based interventions. In such applications sample sizes can be relatively small or there may be sparsity around the cutoff. We propose a metric, density inclusive study size (DISS), that characterizes the size of an RD study better than overall sample size by incorporating the density of the running variable. We show the usefulness of this metric in a Monte Carlo simulation study that compares the operating characteristics of popular nonparametric RD estimation methods in small studies. We also apply the DISS metric and RD estimation methods to school accountability data from the state of Indiana."}, "https://arxiv.org/abs/2302.03687": {"title": "Covariate Adjustment in Stratified Experiments", "link": "https://arxiv.org/abs/2302.03687", "description": "arXiv:2302.03687v4 Announce Type: replace \nAbstract: This paper studies covariate adjusted estimation of the average treatment effect in stratified experiments. We work in a general framework that includes matched tuples designs, coarse stratification, and complete randomization as special cases. Regression adjustment with treatment-covariate interactions is known to weakly improve efficiency for completely randomized designs. By contrast, we show that for stratified designs such regression estimators are generically inefficient, potentially even increasing estimator variance relative to the unadjusted benchmark. Motivated by this result, we derive the asymptotically optimal linear covariate adjustment for a given stratification. We construct several feasible estimators that implement this efficient adjustment in large samples. In the special case of matched pairs, for example, the regression including treatment, covariates, and pair fixed effects is asymptotically optimal. We also provide novel asymptotically exact inference methods that allow researchers to report smaller confidence intervals, fully reflecting the efficiency gains from both stratification and adjustment. Simulations and an empirical application demonstrate the value of our proposed methods."}, "https://arxiv.org/abs/2303.16101": {"title": "Two-step estimation of latent trait models", "link": "https://arxiv.org/abs/2303.16101", "description": "arXiv:2303.16101v2 Announce Type: replace \nAbstract: We consider two-step estimation of latent variable models, in which just the measurement model is estimated in the first step and the measurement parameters are then fixed at their estimated values in the second step where the structural model is estimated. We show how this approach can be implemented for latent trait models (item response theory models) where the latent variables are continuous and their measurement indicators are categorical variables. The properties of two-step estimators are examined using simulation studies and applied examples. They perform well, and have attractive practical and conceptual properties compared to the alternative one-step and three-step approaches. These results are in line with previous findings for other families of latent variable models. This provides strong evidence that two-step estimation is a flexible and useful general method of estimation for different types of latent variable models."}, "https://arxiv.org/abs/2306.05593": {"title": "Localized Neural Network Modelling of Time Series: A Case Study on US Monetary Policy", "link": "https://arxiv.org/abs/2306.05593", "description": "arXiv:2306.05593v2 Announce Type: replace \nAbstract: In this paper, we investigate a semiparametric regression model under the context of treatment effects via a localized neural network (LNN) approach. Due to a vast number of parameters involved, we reduce the number of effective parameters by (i) exploring the use of identification restrictions; and (ii) adopting a variable selection method based on the group-LASSO technique. Subsequently, we derive the corresponding estimation theory and propose a dependent wild bootstrap procedure to construct valid inferences accounting for the dependence of data. Finally, we validate our theoretical findings through extensive numerical studies. In an empirical study, we revisit the impacts of a tightening monetary policy action on a variety of economic variables, including short-/long-term interest rate, inflation, unemployment rate, industrial price and equity return via the newly proposed framework using a monthly dataset of the US."}, "https://arxiv.org/abs/2307.00127": {"title": "Large-scale Bayesian Structure Learning for Gaussian Graphical Models using Marginal Pseudo-likelihood", "link": "https://arxiv.org/abs/2307.00127", "description": "arXiv:2307.00127v3 Announce Type: replace \nAbstract: Bayesian methods for learning Gaussian graphical models offer a comprehensive framework that addresses model uncertainty and incorporates prior knowledge. Despite their theoretical strengths, the applicability of Bayesian methods is often constrained by computational demands, especially in modern contexts involving thousands of variables. To overcome this issue, we introduce two novel Markov chain Monte Carlo (MCMC) search algorithms with a significantly lower computational cost than leading Bayesian approaches. Our proposed MCMC-based search algorithms use the marginal pseudo-likelihood approach to bypass the complexities of computing intractable normalizing constants and iterative precision matrix sampling. These algorithms can deliver reliable results in mere minutes on standard computers, even for large-scale problems with one thousand variables. Furthermore, our proposed method efficiently addresses model uncertainty by exploring the full posterior graph space. We establish the consistency of graph recovery, and our extensive simulation study indicates that the proposed algorithms, particularly for large-scale sparse graphs, outperform leading Bayesian approaches in terms of computational efficiency and accuracy. We also illustrate the practical utility of our methods on medium and large-scale applications from human and mice gene expression studies. The implementation supporting the new approach is available through the R package BDgraph."}, "https://arxiv.org/abs/2308.13630": {"title": "Degrees of Freedom: Search Cost and Self-consistency", "link": "https://arxiv.org/abs/2308.13630", "description": "arXiv:2308.13630v2 Announce Type: replace \nAbstract: Model degrees of freedom ($\\df$) is a fundamental concept in statistics because it quantifies the flexibility of a fitting procedure and is indispensable in model selection. To investigate the gap between $\\df$ and the number of independent variables in the fitting procedure, \\textcite{tibshiraniDegreesFreedomModel2015} introduced the \\emph{search degrees of freedom} ($\\sdf$) concept to account for the search cost during model selection. However, this definition has two limitations: it does not consider fitting procedures in augmented spaces and does not use the same fitting procedure for $\\sdf$ and $\\df$. We propose a \\emph{modified search degrees of freedom} ($\\msdf$) to directly account for the cost of searching in either original or augmented spaces. We check this definition for various fitting procedures, including classical linear regressions, spline methods, adaptive regressions (the best subset and the lasso), regression trees, and multivariate adaptive regression splines (MARS). In many scenarios when $\\sdf$ is applicable, $\\msdf$ reduces to $\\sdf$. However, for certain procedures like the lasso, $\\msdf$ offers a fresh perspective on search costs. For some complex procedures like MARS, the $\\df$ has been pre-determined during model fitting, but the $\\df$ of the final fitted procedure might differ from the pre-determined one. To investigate this discrepancy, we introduce the concepts of \\emph{nominal} $\\df$ and \\emph{actual} $\\df$, and define the property of \\emph{self-consistency}, which occurs when there is no gap between these two $\\df$'s. We propose a correction procedure for MARS to align these two $\\df$'s, demonstrating improved fitting performance through extensive simulations and two real data applications."}, "https://arxiv.org/abs/2309.09872": {"title": "Moment-assisted GMM for Improving Subsampling-based MLE with Large-scale data", "link": "https://arxiv.org/abs/2309.09872", "description": "arXiv:2309.09872v2 Announce Type: replace \nAbstract: The maximum likelihood estimation is computationally demanding for large datasets, particularly when the likelihood function includes integrals. Subsampling can reduce the computational burden, but it typically results in efficiency loss. This paper proposes a moment-assisted subsampling (MAS) method that can improve the estimation efficiency of existing subsampling-based maximum likelihood estimators. The motivation behind this approach stems from the fact that sample moments can be efficiently computed even if the sample size of the whole data set is huge. Through the generalized method of moments, the proposed method incorporates informative sample moments of the whole data. The MAS estimator can be computed rapidly and is asymptotically normal with a smaller asymptotic variance than the corresponding estimator without incorporating sample moments of the whole data. The asymptotic variance of the MAS estimator depends on the specific sample moments incorporated. We derive the optimal moment that minimizes the resulting asymptotic variance in terms of Loewner order. Simulation studies and real data analysis were conducted to compare the proposed method with existing subsampling methods. Numerical results demonstrate the promising performance of the MAS method across various scenarios."}, "https://arxiv.org/abs/2311.05200": {"title": "Efficient Bayesian functional principal component analysis of irregularly-observed multivariate curves", "link": "https://arxiv.org/abs/2311.05200", "description": "arXiv:2311.05200v2 Announce Type: replace \nAbstract: The analysis of multivariate functional curves has the potential to yield important scientific discoveries in domains such as healthcare, medicine, economics and social sciences. However, it is common for real-world settings to present longitudinal data that are both irregularly and sparsely observed, which introduces important challenges for the current functional data methodology. A Bayesian hierarchical framework for multivariate functional principal component analysis is proposed, which accommodates the intricacies of such irregular observation settings by flexibly pooling information across subjects and correlated curves. The model represents common latent dynamics via shared functional principal component scores, thereby effectively borrowing strength across curves while circumventing the computationally challenging task of estimating covariance matrices. These scores also provide a parsimonious representation of the major modes of joint variation of the curves and constitute interpretable scalar summaries that can be employed in follow-up analyses. Estimation is carried out using variational inference, which combines efficiency, modularity and approximate posterior density estimation, enabling the joint analysis of large datasets with parameter uncertainty quantification. Detailed simulations assess the effectiveness of the approach in sharing information from sparse and irregularly sampled multivariate curves. The methodology is also exploited to estimate the molecular disease courses of individual patients with SARS-CoV-2 infection and characterise patient heterogeneity in recovery outcomes; this study reveals key coordinated dynamics across the immune, inflammatory and metabolic systems, which are associated with survival and long-COVID symptoms up to one year post disease onset. The approach is implemented in the R package bayesFPCA."}, "https://arxiv.org/abs/2311.18146": {"title": "Co-Active Subspace Methods for the Joint Analysis of Adjacent Computer Models", "link": "https://arxiv.org/abs/2311.18146", "description": "arXiv:2311.18146v2 Announce Type: replace \nAbstract: Active subspace (AS) methods are a valuable tool for understanding the relationship between the inputs and outputs of a Physics simulation. In this paper, an elegant generalization of the traditional ASM is developed to assess the co-activity of two computer models. This generalization, which we refer to as a Co-Active Subspace (C-AS) Method, allows for the joint analysis of two or more computer models allowing for thorough exploration of the alignment (or non-alignment) of the respective gradient spaces. We define co-active directions, co-sensitivity indices, and a scalar ``concordance\" metric (and complementary ``discordance\" pseudo-metric) and we demonstrate that these are powerful tools for understanding the behavior of a class of computer models, especially when used to supplement traditional AS analysis. Details for efficient estimation of the C-AS and an accompanying R package (github.com/knrumsey/concordance) are provided. Practical application is demonstrated through analyzing a set of simulated rate stick experiments for PBX 9501, a high explosive, offering insights into complex model dynamics."}, "https://arxiv.org/abs/2008.03073": {"title": "Degree distributions in networks: beyond the power law", "link": "https://arxiv.org/abs/2008.03073", "description": "arXiv:2008.03073v5 Announce Type: replace-cross \nAbstract: The power law is useful in describing count phenomena such as network degrees and word frequencies. With a single parameter, it captures the main feature that the frequencies are linear on the log-log scale. Nevertheless, there have been criticisms of the power law, for example that a threshold needs to be pre-selected without its uncertainty quantified, that the power law is simply inadequate, and that subsequent hypothesis tests are required to determine whether the data could have come from the power law. We propose a modelling framework that combines two different generalisations of the power law, namely the generalised Pareto distribution and the Zipf-polylog distribution, to resolve these issues. The proposed mixture distributions are shown to fit the data well and quantify the threshold uncertainty in a natural way. A model selection step embedded in the Bayesian inference algorithm further answers the question whether the power law is adequate."}, "https://arxiv.org/abs/2309.07810": {"title": "Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression", "link": "https://arxiv.org/abs/2309.07810", "description": "arXiv:2309.07810v3 Announce Type: replace-cross \nAbstract: Debiasing is a fundamental concept in high-dimensional statistics. While degrees-of-freedom adjustment is the state-of-the-art technique in high-dimensional linear regression, it is limited to i.i.d. samples and sub-Gaussian covariates. These constraints hinder its broader practical use. Here, we introduce Spectrum-Aware Debiasing--a novel method for high-dimensional regression. Our approach applies to problems with structured dependencies, heavy tails, and low-rank structures. Our method achieves debiasing through a rescaled gradient descent step, deriving the rescaling factor using spectral information of the sample covariance matrix. The spectrum-based approach enables accurate debiasing in much broader contexts. We study the common modern regime where the number of features and samples scale proportionally. We establish asymptotic normality of our proposed estimator (suitably centered and scaled) under various convergence notions when the covariates are right-rotationally invariant. Such designs have garnered recent attention due to their crucial role in compressed sensing. Furthermore, we devise a consistent estimator for its asymptotic variance.\n Our work has two notable by-products: first, we use Spectrum-Aware Debiasing to correct bias in principal components regression (PCR), providing the first debiased PCR estimator in high dimensions. Second, we introduce a principled test for checking alignment between the signal and the eigenvectors of the sample covariance matrix. This test is independently valuable for statistical methods developed using approximate message passing, leave-one-out, or convex Gaussian min-max theorems. We demonstrate our method through simulated and real data experiments. Technically, we connect approximate message passing algorithms with debiasing and provide the first proof of the Cauchy property of vector approximate message passing (V-AMP)."}, "https://arxiv.org/abs/2407.15874": {"title": "Spatially-clustered spatial autoregressive models with application to agricultural market concentration in Europe", "link": "https://arxiv.org/abs/2407.15874", "description": "arXiv:2407.15874v1 Announce Type: new \nAbstract: In this paper, we present an extension of the spatially-clustered linear regression models, namely, the spatially-clustered spatial autoregression (SCSAR) model, to deal with spatial heterogeneity issues in clustering procedures. In particular, we extend classical spatial econometrics models, such as the spatial autoregressive model, the spatial error model, and the spatially-lagged model, by allowing the regression coefficients to be spatially varying according to a cluster-wise structure. Cluster memberships and regression coefficients are jointly estimated through a penalized maximum likelihood algorithm which encourages neighboring units to belong to the same spatial cluster with shared regression coefficients. Motivated by the increase of observed values of the Gini index for the agricultural production in Europe between 2010 and 2020, the proposed methodology is employed to assess the presence of local spatial spillovers on the market concentration index for the European regions in the last decade. Empirical findings support the hypothesis of fragmentation of the European agricultural market, as the regions can be well represented by a clustering structure partitioning the continent into three-groups, roughly approximated by a division among Western, North Central and Southeastern regions. Also, we detect heterogeneous local effects induced by the selected explanatory variables on the regional market concentration. In particular, we find that variables associated with social, territorial and economic relevance of the agricultural sector seem to act differently throughout the spatial dimension, across the clusters and with respect to the pooled model, and temporal dimension."}, "https://arxiv.org/abs/2407.16024": {"title": "Generalized functional dynamic principal component analysis", "link": "https://arxiv.org/abs/2407.16024", "description": "arXiv:2407.16024v1 Announce Type: new \nAbstract: In this paper, we explore dimension reduction for time series of functional data within both stationary and non-stationary frameworks. We introduce a functional framework of generalized dynamic principal component analysis (GDPCA). The concept of GDPCA aims for better adaptation to possible nonstationary features of the series. We define the functional generalized dynamic principal component (GDPC) as static factor time series in a functional dynamic factor model and obtain the multivariate GDPC from a truncation of the functional dynamic factor model. GDFPCA uses a minimum squared error criterion to evaluate the reconstruction of the original functional time series. The computation of GDPC involves a two-step estimation of the coefficient vector of the loading curves in a basis expansion. We provide a proof of the consistency of the reconstruction of the original functional time series with GDPC converging in mean square to the original functional time series. Monte Carlo simulation studies indicate that the proposed GDFPCA is comparable to dynamic functional principal component analysis (DFPCA) when the data generating process is stationary, and outperforms DFPCA and FPCA when the data generating process is non-stationary. The results of applications to real data reaffirm the findings in simulation studies."}, "https://arxiv.org/abs/2407.16037": {"title": "Estimating Distributional Treatment Effects in Randomized Experiments: Machine Learning for Variance Reduction", "link": "https://arxiv.org/abs/2407.16037", "description": "arXiv:2407.16037v1 Announce Type: new \nAbstract: We propose a novel regression adjustment method designed for estimating distributional treatment effect parameters in randomized experiments. Randomized experiments have been extensively used to estimate treatment effects in various scientific fields. However, to gain deeper insights, it is essential to estimate distributional treatment effects rather than relying solely on average effects. Our approach incorporates pre-treatment covariates into a distributional regression framework, utilizing machine learning techniques to improve the precision of distributional treatment effect estimators. The proposed approach can be readily implemented with off-the-shelf machine learning methods and remains valid as long as the nuisance components are reasonably well estimated. Also, we establish the asymptotic properties of the proposed estimator and present a uniformly valid inference method. Through simulation results and real data analysis, we demonstrate the effectiveness of integrating machine learning techniques in reducing the variance of distributional treatment effect estimators in finite samples."}, "https://arxiv.org/abs/2407.16116": {"title": "Robust and consistent model evaluation criteria in high-dimensional regression", "link": "https://arxiv.org/abs/2407.16116", "description": "arXiv:2407.16116v1 Announce Type: new \nAbstract: In the last two decades, sparse regularization methods such as the LASSO have been applied in various fields. Most of the regularization methods have one or more regularization parameters, and to select the value of the regularization parameter is essentially equal to select a model, thus we need to determine the regularization parameter adequately. Regarding the determination of the regularization parameter in the linear regression model, we often apply the information criteria like the AIC and BIC, however, it has been pointed out that these criteria are sensitive to outliers and tend not to perform well in high-dimensional settings. Outliers generally have a negative influence on not only estimation but also model selection, consequently, it is important to employ a selection method that is robust against outliers. In addition, when the number of explanatory variables is quite large, most conventional criteria are prone to select unnecessary explanatory variables. In this paper, we propose model evaluation criteria via the statistical divergence with excellence in robustness in both of parametric estimation and model selection. Furthermore, our proposed criteria simultaneously achieve the selection consistency with the robustness even in high-dimensional settings. We also report the results of some numerical examples to verify that the proposed criteria perform robust and consistent variable selection compared with the conventional selection methods."}, "https://arxiv.org/abs/2407.16212": {"title": "Optimal experimental design: Formulations and computations", "link": "https://arxiv.org/abs/2407.16212", "description": "arXiv:2407.16212v1 Announce Type: new \nAbstract: Questions of `how best to acquire data' are essential to modeling and prediction in the natural and social sciences, engineering applications, and beyond. Optimal experimental design (OED) formalizes these questions and creates computational methods to answer them. This article presents a systematic survey of modern OED, from its foundations in classical design theory to current research involving OED for complex models. We begin by reviewing criteria used to formulate an OED problem and thus to encode the goal of performing an experiment. We emphasize the flexibility of the Bayesian and decision-theoretic approach, which encompasses information-based criteria that are well-suited to nonlinear and non-Gaussian statistical models. We then discuss methods for estimating or bounding the values of these design criteria; this endeavor can be quite challenging due to strong nonlinearities, high parameter dimension, large per-sample costs, or settings where the model is implicit. A complementary set of computational issues involves optimization methods used to find a design; we discuss such methods in the discrete (combinatorial) setting of observation selection and in settings where an exact design can be continuously parameterized. Finally we present emerging methods for sequential OED that build non-myopic design policies, rather than explicit designs; these methods naturally adapt to the outcomes of past experiments in proposing new experiments, while seeking coordination among all experiments to be performed. Throughout, we highlight important open questions and challenges."}, "https://arxiv.org/abs/2407.16299": {"title": "Sparse outlier-robust PCA for multi-source data", "link": "https://arxiv.org/abs/2407.16299", "description": "arXiv:2407.16299v1 Announce Type: new \nAbstract: Sparse and outlier-robust Principal Component Analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single dataset whereas multi-source data-i.e. multiple related datasets requiring joint analysis-arise across many scientific areas. We introduce a novel PCA methodology that simultaneously (i) selects important features, (ii) allows for the detection of global sparse patterns across multiple data sources as well as local source-specific patterns, and (iii) is resistant to outliers. To this end, we develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns, and where the ssMRCD estimator is used as plug-in to permit joint outlier-robust analysis across multiple data sources. We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier and illustrate its practical advantages in simulation and in applications."}, "https://arxiv.org/abs/2407.16349": {"title": "Bayesian modelling of VAR precision matrices using stochastic block networks", "link": "https://arxiv.org/abs/2407.16349", "description": "arXiv:2407.16349v1 Announce Type: new \nAbstract: Commonly used priors for Vector Autoregressions (VARs) induce shrinkage on the autoregressive coefficients. Introducing shrinkage on the error covariance matrix is sometimes done but, in the vast majority of cases, without considering the network structure of the shocks and by placing the prior on the lower Cholesky factor of the precision matrix. In this paper, we propose a prior on the VAR error precision matrix directly. Our prior, which resembles a standard spike and slab prior, models variable inclusion probabilities through a stochastic block model that clusters shocks into groups. Within groups, the probability of having relations across group members is higher (inducing less sparsity) whereas relations across groups imply a lower probability that members of each group are conditionally related. We show in simulations that our approach recovers the true network structure well. Using a US macroeconomic data set, we illustrate how our approach can be used to cluster shocks together and that this feature leads to improved density forecasts."}, "https://arxiv.org/abs/2407.16366": {"title": "Robust Bayesian Model Averaging for Linear Regression Models With Heavy-Tailed Errors", "link": "https://arxiv.org/abs/2407.16366", "description": "arXiv:2407.16366v1 Announce Type: new \nAbstract: In this article, our goal is to develop a method for Bayesian model averaging in linear regression models to accommodate heavier tailed error distributions than the normal distribution. Motivated by the use of the Huber loss function in presence of outliers, Park and Casella (2008) proposed the concept of the Bayesian Huberized lasso, which has been recently developed and implemented by Kawakami and Hashimoto (2023), with hyperbolic errors. Because the Huberized lasso cannot enforce regression coefficients to be exactly zero, we propose a fully Bayesian variable selection approach with spike and slab priors, that can address sparsity more effectively. Furthermore, while the hyperbolic distribution has heavier tails than a normal distribution, its tails are less heavy in comparison to a Cauchy distribution.Thus, we propose a regression model, with an error distribution that encompasses both hyperbolic and Student-t distributions. Our model aims to capture the benefit of using Huber loss, but it can also adapt to heavier tails, and unknown levels of sparsity, as necessitated by the data. We develop an efficient Gibbs sampler with Metropolis Hastings steps for posterior computation. Through simulation studies, and analyses of the benchmark Boston housing dataset and NBA player salaries in the 2022-2023 season, we show that our method is competitive with various state-of-the-art methods."}, "https://arxiv.org/abs/2407.16374": {"title": "A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests", "link": "https://arxiv.org/abs/2407.16374", "description": "arXiv:2407.16374v1 Announce Type: new \nAbstract: In the statistical literature, as well as in artificial intelligence and machine learning, measures of discrepancy between two probability distributions are largely used to develop measures of goodness-of-fit. We concentrate on quadratic distances, which depend on a non-negative definite kernel. We propose a unified framework for the study of two-sample and k-sample goodness of fit tests based on the concept of matrix distance. We provide a succinct review of the goodness of fit literature related to the use of distance measures, and specifically to quadratic distances. We show that the quadratic distance kernel-based two-sample test has the same functional form with the maximum mean discrepancy test. We develop tests for the $k$-sample scenario, where the two-sample problem is a special case. We derive their asymptotic distribution under the null hypothesis and discuss computational aspects of the test procedures. We assess their performance, in terms of level and power, via extensive simulations and a real data example. The proposed framework is implemented in the QuadratiK package, available in both R and Python environments."}, "https://arxiv.org/abs/2407.16550": {"title": "A Kernel-Based Conditional Two-Sample Test Using Nearest Neighbors (with Applications to Calibration, Regression Curves, and Simulation-Based Inference)", "link": "https://arxiv.org/abs/2407.16550", "description": "arXiv:2407.16550v1 Announce Type: new \nAbstract: In this paper we introduce a kernel-based measure for detecting differences between two conditional distributions. Using the `kernel trick' and nearest-neighbor graphs, we propose a consistent estimate of this measure which can be computed in nearly linear time (for a fixed number of nearest neighbors). Moreover, when the two conditional distributions are the same, the estimate has a Gaussian limit and its asymptotic variance has a simple form that can be easily estimated from the data. The resulting test attains precise asymptotic level and is universally consistent for detecting differences between two conditional distributions. We also provide a resampling based test using our estimate that applies to the conditional goodness-of-fit problem, which controls Type I error in finite samples and is asymptotically consistent with only a finite number of resamples. A method to de-randomize the resampling test is also presented. The proposed methods can be readily applied to a broad range of problems, ranging from classical nonparametric statistics to modern machine learning. Specifically, we explore three applications: testing model calibration, regression curve evaluation, and validation of emulator models in simulation-based inference. We illustrate the superior performance of our method for these tasks, both in simulations as well as on real data. In particular, we apply our method to (1) assess the calibration of neural network models trained on the CIFAR-10 dataset, (2) compare regression functions for wind power generation across two different turbines, and (3) validate emulator models on benchmark examples with intractable posteriors and for generating synthetic `redshift' associated with galaxy images."}, "https://arxiv.org/abs/2407.15854": {"title": "Decoding Digital Influence: The Role of Social Media Behavior in Scientific Stratification Through Logistic Attribution Method", "link": "https://arxiv.org/abs/2407.15854", "description": "arXiv:2407.15854v1 Announce Type: cross \nAbstract: Scientific social stratification is a classic theme in the sociology of science. The deep integration of social media has bridged the gap between scientometrics and sociology of science. This study comprehensively analyzes the impact of social media on scientific stratification and mobility, delving into the complex interplay between academic status and social media activity in the digital age. [Research Method] Innovatively, this paper employs An Explainable Logistic Attribution Analysis from a meso-level perspective to explore the correlation between social media behaviors and scientific social stratification. It examines the impact of scientists' use of social media in the digital age on scientific stratification and mobility, uniquely combining statistical methods with machine learning. This fusion effectively integrates hypothesis testing with a substantive interpretation of the contribution of independent variables to the model. [Research Conclusion] Empirical evidence demonstrates that social media promotes stratification and mobility within the scientific community, revealing a nuanced and non-linear facilitation mechanism. Social media activities positively impact scientists' status within the scientific social hierarchy to a certain extent, but beyond a specific threshold, this impact turns negative. It shows that the advent of social media has opened new channels for academic influence, transcending the limitations of traditional academic publishing, and prompting changes in scientific stratification. Additionally, the study acknowledges the limitations of its experimental design and suggests future research directions."}, "https://arxiv.org/abs/2407.15868": {"title": "A Survey on Differential Privacy for SpatioTemporal Data in Transportation Research", "link": "https://arxiv.org/abs/2407.15868", "description": "arXiv:2407.15868v1 Announce Type: cross \nAbstract: With low-cost computing devices, improved sensor technology, and the proliferation of data-driven algorithms, we have more data than we know what to do with. In transportation, we are seeing a surge in spatiotemporal data collection. At the same time, concerns over user privacy have led to research on differential privacy in applied settings. In this paper, we look at some recent developments in differential privacy in the context of spatiotemporal data. Spatiotemporal data contain not only features about users but also the geographical locations of their frequent visits. Hence, the public release of such data carries extreme risks. To address the need for such data in research and inference without exposing private information, significant work has been proposed. This survey paper aims to summarize these efforts and provide a review of differential privacy mechanisms and related software. We also discuss related work in transportation where such mechanisms have been applied. Furthermore, we address the challenges in the deployment and mass adoption of differential privacy in transportation spatiotemporal data for downstream analyses."}, "https://arxiv.org/abs/2407.16152": {"title": "Discovering overlapping communities in multi-layer directed networks", "link": "https://arxiv.org/abs/2407.16152", "description": "arXiv:2407.16152v1 Announce Type: cross \nAbstract: This article explores the challenging problem of detecting overlapping communities in multi-layer directed networks. Our goal is to understand the underlying asymmetric overlapping community structure by analyzing the mixed memberships of nodes. We introduce a new model, the multi-layer mixed membership stochastic co-block model (multi-layer MM-ScBM), to model multi-layer directed networks in which nodes can belong to multiple communities. We develop a spectral procedure to estimate nodes' memberships in both sending and receiving patterns. Our method uses a successive projection algorithm on a few leading eigenvectors of two debiased aggregation matrices. To our knowledge, this is the first work to detect asymmetric overlapping communities in multi-layer directed networks. We demonstrate the consistent estimation properties of our method by providing per-node error rates under the multi-layer MM-ScBM framework. Our theoretical analysis reveals that increasing the overall sparsity, the number of nodes, or the number of layers can improve the accuracy of overlapping community detection. Extensive numerical experiments are conducted to validate these theoretical findings. We also apply our method to one real-world multi-layer directed network, gaining insightful results."}, "https://arxiv.org/abs/2407.16283": {"title": "A Randomized Exchange Algorithm for Optimal Design of Multi-Response Experiments", "link": "https://arxiv.org/abs/2407.16283", "description": "arXiv:2407.16283v1 Announce Type: cross \nAbstract: Despite the increasing prevalence of vector observations, computation of optimal experimental design for multi-response models has received limited attention. To address this problem within the framework of approximate designs, we introduce mREX, an algorithm that generalizes the randomized exchange algorithm REX (J Am Stat Assoc 115:529, 2020), originally specialized for single-response models. The mREX algorithm incorporates several improvements: a novel method for computing efficient sparse initial designs, an extension to all differentiable Kiefer's optimality criteria, and an efficient method for performing optimal exchanges of weights. For the most commonly used D-optimality criterion, we propose a technique for optimal weight exchanges based on the characteristic matrix polynomial. The mREX algorithm is applicable to linear, nonlinear, and generalized linear models, and scales well to large problems. It typically converges to optimal designs faster than available alternative methods, although it does not require advanced mathematical programming solvers. We demonstrate the application of mREX to bivariate dose-response Emax models for clinical trials, both without and with the inclusion of covariates."}, "https://arxiv.org/abs/2407.16376": {"title": "Bayesian Autoregressive Online Change-Point Detection with Time-Varying Parameters", "link": "https://arxiv.org/abs/2407.16376", "description": "arXiv:2407.16376v1 Announce Type: cross \nAbstract: Change points in real-world systems mark significant regime shifts in system dynamics, possibly triggered by exogenous or endogenous factors. These points define regimes for the time evolution of the system and are crucial for understanding transitions in financial, economic, social, environmental, and technological contexts. Building upon the Bayesian approach introduced in \\cite{c:07}, we devise a new method for online change point detection in the mean of a univariate time series, which is well suited for real-time applications and is able to handle the general temporal patterns displayed by data in many empirical contexts. We first describe time series as an autoregressive process of an arbitrary order. Second, the variance and correlation of the data are allowed to vary within each regime driven by a scoring rule that updates the value of the parameters for a better fit of the observations. Finally, a change point is detected in a probabilistic framework via the posterior distribution of the current regime length. By modeling temporal dependencies and time-varying parameters, the proposed approach enhances both the estimate accuracy and the forecasting power. Empirical validations using various datasets demonstrate the method's effectiveness in capturing memory and dynamic patterns, offering deeper insights into the non-stationary dynamics of real-world systems."}, "https://arxiv.org/abs/2407.16567": {"title": "CASTRO -- Efficient constrained sampling method for material and chemical experimental design", "link": "https://arxiv.org/abs/2407.16567", "description": "arXiv:2407.16567v1 Announce Type: cross \nAbstract: The exploration of multicomponent material composition space requires significant time and financial investments, necessitating efficient use of resources for statistically relevant compositions. This article introduces a novel methodology, implemented in the open-source CASTRO (ConstrAined Sequential laTin hypeRcube sampling methOd) software package, to overcome equality-mixture constraints and ensure comprehensive design space coverage. Our approach leverages Latin hypercube sampling (LHS) and LHS with multidimensional uniformity (LHSMDU) using a divide-and-conquer strategy to manage high-dimensional problems effectively. By incorporating previous experimental knowledge within a limited budget, our method strategically recommends a feasible number of experiments to explore the design space. We validate our methodology with two examples: a four-dimensional problem with near-uniform distributions and a nine-dimensional problem with additional mixture constraints, yielding specific component distributions. Our constrained sequential LHS or LHSMDU approach enables thorough design space exploration, proving robustness for experimental design. This research not only advances material science but also offers promising solutions for efficiency challenges in pharmaceuticals and chemicals. CASTRO and the case studies are available for free download on GitHub."}, "https://arxiv.org/abs/1910.12545": {"title": "Testing Forecast Rationality for Measures of Central Tendency", "link": "https://arxiv.org/abs/1910.12545", "description": "arXiv:1910.12545v5 Announce Type: replace \nAbstract: Rational respondents to economic surveys may report as a point forecast any measure of the central tendency of their (possibly latent) predictive distribution, for example the mean, median, mode, or any convex combination thereof. We propose tests of forecast rationality when the measure of central tendency used by the respondent is unknown. We overcome an identification problem that arises when the measures of central tendency are equal or in a local neighborhood of each other, as is the case for (exactly or nearly) symmetric distributions. As a building block, we also present novel tests for the rationality of mode forecasts. We apply our tests to income forecasts from the Federal Reserve Bank of New York's Survey of Consumer Expectations. We find these forecasts are rationalizable as mode forecasts, but not as mean or median forecasts. We also find heterogeneity in the measure of centrality used by respondents when stratifying the sample by past income, age, job stability, and survey experience."}, "https://arxiv.org/abs/2110.06450": {"title": "Online network change point detection with missing values and temporal dependence", "link": "https://arxiv.org/abs/2110.06450", "description": "arXiv:2110.06450v3 Announce Type: replace \nAbstract: In this paper we study online change point detection in dynamic networks with time heterogeneous missing pattern within networks and dependence across the time course. The missingness probabilities, the entrywise sparsity of networks, the rank of networks and the jump size in terms of the Frobenius norm, are all allowed to vary as functions of the pre-change sample size. On top of a thorough handling of all the model parameters, we notably allow the edges and missingness to be dependent. To the best of our knowledge, such general framework has not been rigorously nor systematically studied before in the literature. We propose a polynomial time change point detection algorithm, with a version of soft-impute algorithm (e.g. Mazumder et al., 2010; Klopp, 2015) as the imputation sub-routine. Piecing up these standard sub-routines algorithms, we are able to solve a brand new problem with sharp detection delay subject to an overall Type-I error control. Extensive numerical experiments are conducted demonstrating the outstanding performances of our proposed method in practice."}, "https://arxiv.org/abs/2111.13774": {"title": "Robust Permutation Tests in Linear Instrumental Variables Regression", "link": "https://arxiv.org/abs/2111.13774", "description": "arXiv:2111.13774v4 Announce Type: replace \nAbstract: This paper develops permutation versions of identification-robust tests in linear instrumental variables (IV) regression. Unlike the existing randomization and rank-based tests in which independence between the instruments and the error terms is assumed, the permutation Anderson- Rubin (AR), Lagrange Multiplier (LM) and Conditional Likelihood Ratio (CLR) tests are asymptotically similar and robust to conditional heteroskedasticity under standard exclusion restriction i.e. the orthogonality between the instruments and the error terms. Moreover, when the instruments are independent of the structural error term, the permutation AR tests are exact, hence robust to heavy tails. As such, these tests share the strengths of the rank-based tests and the wild bootstrap AR tests. Numerical illustrations corroborate the theoretical results."}, "https://arxiv.org/abs/2312.07775": {"title": "On the construction of stationary processes and random fields", "link": "https://arxiv.org/abs/2312.07775", "description": "arXiv:2312.07775v2 Announce Type: replace \nAbstract: We propose a new method to construct a stationary process and random field with a given decreasing covariance function and any one-dimensional marginal distribution. The result is a new class of stationary processes and random fields. The construction method utilizes a correlated binary sequence, and it allows a simple and practical way to model dependence structures in a stationary process and random field as its dependence structure is induced by the correlation structure of a few disjoint sets in the support set of the marginal distribution. Simulation results of the proposed models are provided, which show the empirical behavior of a sample path."}, "https://arxiv.org/abs/2307.02781": {"title": "Dynamic Factor Analysis with Dependent Gaussian Processes for High-Dimensional Gene Expression Trajectories", "link": "https://arxiv.org/abs/2307.02781", "description": "arXiv:2307.02781v2 Announce Type: replace-cross \nAbstract: The increasing availability of high-dimensional, longitudinal measures of gene expression can facilitate understanding of biological mechanisms, as required for precision medicine. Biological knowledge suggests that it may be best to describe complex diseases at the level of underlying pathways, which may interact with one another. We propose a Bayesian approach that allows for characterising such correlation among different pathways through Dependent Gaussian Processes (DGP) and mapping the observed high-dimensional gene expression trajectories into unobserved low-dimensional pathway expression trajectories via Bayesian Sparse Factor Analysis. Our proposal is the first attempt to relax the classical assumption of independent factors for longitudinal data and has demonstrated a superior performance in recovering the shape of pathway expression trajectories, revealing the relationships between genes and pathways, and predicting gene expressions (closer point estimates and narrower predictive intervals), as demonstrated through simulations and real data analysis. To fit the model, we propose a Monte Carlo Expectation Maximization (MCEM) scheme that can be implemented conveniently by combining a standard Markov Chain Monte Carlo sampler and an R package GPFDA (Konzen and others, 2021), which returns the maximum likelihood estimates of DGP hyperparameters. The modular structure of MCEM makes it generalizable to other complex models involving the DGP model component. Our R package DGP4LCF that implements the proposed approach is available on CRAN."}, "https://arxiv.org/abs/2407.16786": {"title": "Generalised Causal Dantzig", "link": "https://arxiv.org/abs/2407.16786", "description": "arXiv:2407.16786v1 Announce Type: new \nAbstract: Prediction invariance of causal models under heterogeneous settings has been exploited by a number of recent methods for causal discovery, typically focussing on recovering the causal parents of a target variable of interest. When instrumental variables are not available, the causal Dantzig estimator exploits invariance under the more restrictive case of shift interventions. However, also in this case, one requires observational data from a number of sufficiently different environments, which is rarely available. In this paper, we consider a structural equation model where the target variable is described by a generalised additive model conditional on its parents. Besides having finite moments, no modelling assumptions are made on the conditional distributions of the other variables in the system. Under this setting, we characterise the causal model uniquely by means of two key properties: the Pearson residuals are invariant under the causal model and conditional on the causal parents the causal parameters maximise the population likelihood. These two properties form the basis of a computational strategy for searching the causal model among all possible models. Crucially, for generalised linear models with a known dispersion parameter, such as Poisson and logistic regression, the causal model can be identified from a single data environment."}, "https://arxiv.org/abs/2407.16870": {"title": "CoCA: Cooperative Component Analysis", "link": "https://arxiv.org/abs/2407.16870", "description": "arXiv:2407.16870v1 Announce Type: new \nAbstract: We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of \"-omic\" data, ranging from genomics to proteomics, are measured on the same set of samples. The goal is to uncover important, shared signals that represent underlying biological mechanisms. CoCA combines an approximation error loss to preserve information within data views and an \"agreement penalty\" to encourage alignment across data views. By balancing the trade-off between these two key components in the objective, CoCA has the property of interpolating between the commonly-used principal component analysis (PCA) and canonical correlation analysis (CCA) as special cases at the two ends of the solution path. CoCA chooses the degree of agreement in a data-adaptive manner, using a validation set or cross-validation to estimate test error. Furthermore, we propose a sparse variant of CoCA that incorporates the Lasso penalty to yield feature sparsity, facilitating the identification of key features driving the observed patterns. We demonstrate the effectiveness of CoCA on simulated data and two real multiomics studies of COVID-19 and ductal carcinoma in situ of breast. In both real data applications, CoCA successfully integrates multiomics data, extracting components that are not only consistently present across different data views but also more informative and predictive of disease progression. CoCA offers a powerful framework for discovering important shared signals in multi-view data, with the potential to uncover novel insights in an increasingly multi-view data world."}, "https://arxiv.org/abs/2407.16948": {"title": "Relative local dependence of bivariate copulas", "link": "https://arxiv.org/abs/2407.16948", "description": "arXiv:2407.16948v1 Announce Type: new \nAbstract: For a bivariate probability distribution, local dependence around a single point on the support is often formulated as the second derivative of the logarithm of the probability density function. However, this definition lacks the invariance under marginal distribution transformations, which is often required as a criterion for dependence measures. In this study, we examine the \\textit{relative local dependence}, which we define as the ratio of the local dependence to the probability density function, for copulas. By using this notion, we point out that typical copulas can be characterised as the solutions to the corresponding partial differential equations, particularly highlighting that the relative local dependence of the Frank copula remains constant. The estimation and visualization of the relative local dependence are demonstrated using simulation data. Furthermore, we propose a class of copulas where local dependence is proportional to the $k$-th power of the probability density function, and as an example, we demonstrate a newly discovered relationship derived from the density functions of two representative copulas, the Frank copula and the Farlie-Gumbel-Morgenstern (FGM) copula."}, "https://arxiv.org/abs/2407.16950": {"title": "Identification and inference of outcome conditioned partial effects of general interventions", "link": "https://arxiv.org/abs/2407.16950", "description": "arXiv:2407.16950v1 Announce Type: new \nAbstract: This paper proposes a new class of distributional causal quantities, referred to as the \\textit{outcome conditioned partial policy effects} (OCPPEs), to measure the \\textit{average} effect of a general counterfactual intervention of a target covariate on the individuals in different quantile ranges of the outcome distribution.\n The OCPPE approach is valuable in several aspects: (i) Unlike the unconditional quantile partial effect (UQPE) that is not $\\sqrt{n}$-estimable, an OCPPE is $\\sqrt{n}$-estimable. Analysts can use it to capture heterogeneity across the unconditional distribution of $Y$ as well as obtain accurate estimation of the aggregated effect at the upper and lower tails of $Y$. (ii) The semiparametric efficiency bound for an OCPPE is explicitly derived. (iii) We propose an efficient debiased estimator for OCPPE, and provide feasible uniform inference procedures for the OCPPE process. (iv) The efficient doubly robust score for an OCPPE can be used to optimize infinitesimal nudges to a continuous treatment by maximizing a quantile specific Empirical Welfare function. We illustrate the method by analyzing how anti-smoking policies impact low percentiles of live infants' birthweights."}, "https://arxiv.org/abs/2407.17036": {"title": "A Bayesian modelling framework for health care resource use and costs in trial-based economic evaluations", "link": "https://arxiv.org/abs/2407.17036", "description": "arXiv:2407.17036v1 Announce Type: new \nAbstract: Individual-level effectiveness and healthcare resource use (HRU) data are routinely collected in trial-based economic evaluations. While effectiveness is often expressed in terms of utility scores derived from some health-related quality of life instruments (e.g.~EQ-5D questionnaires), different types of HRU may be included. Costs are usually generated by applying unit prices to HRU data and statistical methods have been traditionally implemented to analyse costs and utilities or after combining them into aggregated variables (e.g. Quality-Adjusted Life Years). When outcome data are not fully observed, e.g. some patients drop out or only provided partial information, the validity of the results may be hindered both in terms of efficiency and bias. Often, partially-complete HRU data are handled using \"ad-hoc\" methods, implicitly relying on some assumptions (e.g. fill-in a zero) which are hard to justify beside the practical convenience of increasing the completion rate. We present a general Bayesian framework for the modelling of partially-observed HRUs which allows a flexible model specification to accommodate the typical complexities of the data and to quantify the impact of different types of uncertainty on the results. We show the benefits of using our approach using a motivating example and compare the results to those from traditional analyses focussed on the modelling of cost variables after adopting some ad-hoc imputation strategy for HRU data."}, "https://arxiv.org/abs/2407.17113": {"title": "Bayesian non-linear subspace shrinkage using horseshoe priors", "link": "https://arxiv.org/abs/2407.17113", "description": "arXiv:2407.17113v1 Announce Type: new \nAbstract: When modeling biological responses using Bayesian non-parametric regression, prior information may be available on the shape of the response in the form of non-linear function spaces that define the general shape of the response. To incorporate such information into the analysis, we develop a non-linear functional shrinkage (NLFS) approach that uniformly shrinks the non-parametric fitted function into a non-linear function space while allowing for fits outside of this space when the data suggest alternative shapes. This approach extends existing functional shrinkage approaches into linear subspaces to shrinkage into non-linear function spaces using a Taylor series expansion and corresponding updating of non-linear parameters. We demonstrate this general approach on the Hill model, a popular, biologically motivated model, and show that shrinkage into combined function spaces, i.e., where one has two or more non-linear functions a priori, is straightforward. We demonstrate this approach through synthetic and real data. Computational details on the underlying MCMC sampling are provided with data and analysis available in an online supplement."}, "https://arxiv.org/abs/2407.17225": {"title": "Asymmetry Analysis of Bilateral Shapes", "link": "https://arxiv.org/abs/2407.17225", "description": "arXiv:2407.17225v1 Announce Type: new \nAbstract: Many biological objects possess bilateral symmetry about a midline or midplane, up to a ``noise'' term. This paper uses landmark-based methods to measure departures from bilateral symmetry, especially for the two-group problem where one group is more asymmetric than the other. In this paper, we formulate our work in the framework of size-and-shape analysis including registration via rigid body motion. Our starting point is a vector of elementary asymmetry features defined at the individual landmark coordinates for each object. We introduce two approaches for testing. In the first, the elementary features are combined into a scalar composite asymmetry measure for each object. Then standard univariate tests can be used to compare the two groups. In the second approach, a univariate test statistic is constructed for each elementary feature. The maximum of these statistics lead to an overall test statistic to compare the two groups and we then provide a technique to extract the important features from the landmark data. Our methodology is illustrated on a pre-registered smile dataset collected to assess the success of cleft lip surgery on human subjects. The asymmetry in a group of cleft lip subjects is compared to a group of normal subjects, and statistically significant differences have been found by univariate tests in the first approach. Further, our feature extraction method leads to an anatomically plausible set of landmarks for medical applications."}, "https://arxiv.org/abs/2407.17385": {"title": "Causal modelling without counterfactuals and individualised effects", "link": "https://arxiv.org/abs/2407.17385", "description": "arXiv:2407.17385v1 Announce Type: new \nAbstract: The most common approach to causal modelling is the potential outcomes framework due to Neyman and Rubin. In this framework, outcomes of counterfactual treatments are assumed to be well-defined. This metaphysical assumption is often thought to be problematic yet indispensable. The conventional approach relies not only on counterfactuals, but also on abstract notions of distributions and assumptions of independence that are not directly testable. In this paper, we construe causal inference as treatment-wise predictions for finite populations where all assumptions are testable; this means that one can not only test predictions themselves (without any fundamental problem), but also investigate sources of error when they fail. The new framework highlights the model-dependence of causal claims as well as the difference between statistical and scientific inference."}, "https://arxiv.org/abs/2407.16797": {"title": "Estimating the hyperuniformity exponent of point processes", "link": "https://arxiv.org/abs/2407.16797", "description": "arXiv:2407.16797v1 Announce Type: cross \nAbstract: We address the challenge of estimating the hyperuniformity exponent $\\alpha$ of a spatial point process, given only one realization of it. Assuming that the structure factor $S$ of the point process follows a vanishing power law at the origin (the typical case of a hyperuniform point process), this exponent is defined as the slope near the origin of $\\log S$. Our estimator is built upon the (expanding window) asymptotic variance of some wavelet transforms of the point process. By combining several scales and several wavelets, we develop a multi-scale, multi-taper estimator $\\widehat{\\alpha}$. We analyze its asymptotic behavior, proving its consistency under various settings, and enabling the construction of asymptotic confidence intervals for $\\alpha$ when $\\alpha < d$ and under Brillinger mixing. This construction is derived from a multivariate central limit theorem where the normalisations are non-standard and vary among the components. We also present a non-asymptotic deviation inequality providing insights into the influence of tapers on the bias-variance trade-off of $\\widehat{\\alpha}$. Finally, we investigate the performance of $\\widehat{\\alpha}$ through simulations, and we apply our method to the analysis of hyperuniformity in a real dataset of marine algae."}, "https://arxiv.org/abs/2407.16975": {"title": "On the Parameter Identifiability of Partially Observed Linear Causal Models", "link": "https://arxiv.org/abs/2407.16975", "description": "arXiv:2407.16975v1 Announce Type: cross \nAbstract: Linear causal models are important tools for modeling causal dependencies and yet in practice, only a subset of the variables can be observed. In this paper, we examine the parameter identifiability of these models by investigating whether the edge coefficients can be recovered given the causal structure and partially observed data. Our setting is more general than that of prior research - we allow all variables, including both observed and latent ones, to be flexibly related, and we consider the coefficients of all edges, whereas most existing works focus only on the edges between observed variables. Theoretically, we identify three types of indeterminacy for the parameters in partially observed linear causal models. We then provide graphical conditions that are sufficient for all parameters to be identifiable and show that some of them are provably necessary. Methodologically, we propose a novel likelihood-based parameter estimation method that addresses the variance indeterminacy of latent variables in a specific way and can asymptotically recover the underlying parameters up to trivial indeterminacy. Empirical studies on both synthetic and real-world datasets validate our identifiability theory and the effectiveness of the proposed method in the finite-sample regime."}, "https://arxiv.org/abs/2407.17132": {"title": "Exploring Covid-19 Spatiotemporal Dynamics: Non-Euclidean Spatially Aware Functional Registration", "link": "https://arxiv.org/abs/2407.17132", "description": "arXiv:2407.17132v1 Announce Type: cross \nAbstract: When it came to Covid-19, timing was everything. This paper considers the spatiotemporal dynamics of the Covid-19 pandemic via a developed methodology of non-Euclidean spatially aware functional registration. In particular, the daily SARS-CoV-2 incidence in each of 380 local authorities in the UK from March to June 2020 is analysed to understand the phase variation of the waves when considered as curves. This is achieved by adapting a traditional registration method (that of local variation analysis) to account for the clear spatial dependencies in the data. This adapted methodology is shown via simulation studies to perform substantially better for the estimation of the registration functions than the non-spatial alternative. Moreover, it is found that the driving time between locations represents the spatial dependency in the Covid-19 data better than geographical distance. However, since driving time is non-Euclidean, the traditional spatial frameworks break down; to solve this, a methodology inspired by multidimensional scaling is developed to approximate the driving times by a Euclidean distance which enables the established theory to be applied. Finally, the resulting estimates of the registration/warping processes are analysed by taking functionals to understand the qualitatively observable earliness/lateness and sharpness/flatness of the Covid-19 waves quantitatively."}, "https://arxiv.org/abs/2407.17200": {"title": "Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems", "link": "https://arxiv.org/abs/2407.17200", "description": "arXiv:2407.17200v1 Announce Type: cross \nAbstract: A recent stream of structured learning approaches has improved the practical state of the art for a range of combinatorial optimization problems with complex objectives encountered in operations research. Such approaches train policies that chain a statistical model with a surrogate combinatorial optimization oracle to map any instance of the problem to a feasible solution. The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately. However learning such policies by risk minimization is challenging because the empirical risk is piecewise constant in the parameters, and few theoretical guarantees have been provided so far. In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves generalization. Our main contribution is a generalization bound that controls the perturbation bias, the statistical learning error, and the optimization error. Our analysis relies on the introduction of a uniform weak property, which captures and quantifies the interplay of the statistical model and the surrogate combinatorial optimization oracle. This property holds under mild assumptions on the statistical model, the surrogate optimization, and the instance data distribution. We illustrate the result on a range of applications such as stochastic vehicle scheduling. In particular, such policies are relevant for contextual stochastic optimization and our results cover this case."}, "https://arxiv.org/abs/2407.17329": {"title": "Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia", "link": "https://arxiv.org/abs/2407.17329", "description": "arXiv:2407.17329v1 Announce Type: cross \nAbstract: Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid Leukemia (AML), a type of cancer that affects the blood and bone marrow, is essential in the prognosis and follow-up of AML patients. As traditional cytological analysis cannot detect leukemia cells below 5\\%, the analysis of flow cytometry dataset is expected to provide more reliable results. In this paper, we explore statistical learning methods based on optimal transport (OT) to achieve a relevant low-dimensional representation of multi-patient flow cytometry measurements (FCM) datasets considered as high-dimensional probability distributions. Using the framework of OT, we justify the use of the K-means algorithm for dimensionality reduction of multiple large-scale point clouds through mean measure quantization by merging all the data into a single point cloud. After this quantization step, the visualization of the intra and inter-patients FCM variability is carried out by embedding low-dimensional quantized probability measures into a linear space using either Wasserstein Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of compositional data. Using a publicly available FCM dataset and a FCM dataset from Bordeaux University Hospital, we demonstrate the benefits of our approach over the popular kernel mean embedding technique for statistical learning from multiple high-dimensional probability distributions. We also highlight the usefulness of our methodology for low-dimensional projection and clustering patient measurements according to their level of MRD in AML from FCM. In particular, our OT-based approach allows a relevant and informative two-dimensional representation of the results of the FlowSom algorithm, a state-of-the-art method for the detection of MRD in AML using multi-patient FCM."}, "https://arxiv.org/abs/2407.17401": {"title": "Estimation of bid-ask spreads in the presence of serial dependence", "link": "https://arxiv.org/abs/2407.17401", "description": "arXiv:2407.17401v1 Announce Type: cross \nAbstract: Starting from a basic model in which the dynamic of the transaction prices is a geometric Brownian motion disrupted by a microstructure white noise, corresponding to the random alternation of bids and asks, we propose moment-based estimators along with their statistical properties. We then make the model more realistic by considering serial dependence: we assume a geometric fractional Brownian motion for the price, then an Ornstein-Uhlenbeck process for the microstructure noise. In these two cases of serial dependence, we propose again consistent and asymptotically normal estimators. All our estimators are compared on simulated data with existing approaches, such as Roll, Corwin-Schultz, Abdi-Ranaldo, or Ardia-Guidotti-Kroencke estimators."}, "https://arxiv.org/abs/1902.09615": {"title": "Binscatter Regressions", "link": "https://arxiv.org/abs/1902.09615", "description": "arXiv:1902.09615v5 Announce Type: replace \nAbstract: We introduce the package Binsreg, which implements the binscatter methods developed by Cattaneo, Crump, Farrell, and Feng (2024b,a). The package includes seven commands: binsreg, binslogit, binsprobit, binsqreg, binstest, binspwc, and binsregselect. The first four commands implement binscatter plotting, point estimation, and uncertainty quantification (confidence intervals and confidence bands) for least squares linear binscatter regression (binsreg) and for nonlinear binscatter regression (binslogit for Logit regression, binsprobit for Probit regression, and binsqreg for quantile regression). The next two commands focus on pointwise and uniform inference: binstest implements hypothesis testing procedures for parametric specifications and for nonparametric shape restrictions of the unknown regression function, while binspwc implements multi-group pairwise statistical comparisons. Finally, the command binsregselect implements data-driven number of bins selectors. The commands offer binned scatter plots, and allow for covariate adjustment, weighting, clustering, and multi-sample analysis, which is useful when studying treatment effect heterogeneity in randomized and observational studies, among many other features."}, "https://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "https://arxiv.org/abs/2202.08419", "description": "arXiv:2202.08419v4 Announce Type: replace \nAbstract: In this paper, we develop a novel high-dimensional time-varying coefficient estimation method, based on high-dimensional Ito diffusion processes. To account for high-dimensional time-varying coefficients, we first estimate local (or instantaneous) coefficients using a time-localized Dantzig selection scheme under a sparsity condition, which results in biased local coefficient estimators due to the regularization. To handle the bias, we propose a debiasing scheme, which provides well-performing unbiased local coefficient estimators. With the unbiased local coefficient estimators, we estimate the integrated coefficient, and to further account for the sparsity of the coefficient process, we apply thresholding schemes. We call this Thresholding dEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED estimator. In the empirical analysis, we apply the TED procedure to analyzing high-dimensional factor models using high-frequency data."}, "https://arxiv.org/abs/2211.08858": {"title": "Unbalanced Kantorovich-Rubinstein distance, plan, and barycenter on finite spaces: A statistical perspective", "link": "https://arxiv.org/abs/2211.08858", "description": "arXiv:2211.08858v2 Announce Type: replace \nAbstract: We analyze statistical properties of plug-in estimators for unbalanced optimal transport quantities between finitely supported measures in different prototypical sampling models. Specifically, our main results provide non-asymptotic bounds on the expected error of empirical Kantorovich-Rubinstein (KR) distance, plans, and barycenters for mass penalty parameter $C>0$. The impact of the mass penalty parameter $C$ is studied in detail. Based on this analysis, we mathematically justify randomized computational schemes for KR quantities which can be used for fast approximate computations in combination with any exact solver. Using synthetic and real datasets, we empirically analyze the behavior of the expected errors in simulation studies and illustrate the validity of our theoretical bounds."}, "https://arxiv.org/abs/2008.07361": {"title": "Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?", "link": "https://arxiv.org/abs/2008.07361", "description": "arXiv:2008.07361v2 Announce Type: replace-cross \nAbstract: Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements.\n Materials and Methods: We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value.\n Results: The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively.\n Discussion: Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability.\n Conclusion: Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity."}, "https://arxiv.org/abs/2306.00833": {"title": "When Does Bottom-up Beat Top-down in Hierarchical Community Detection?", "link": "https://arxiv.org/abs/2306.00833", "description": "arXiv:2306.00833v2 Announce Type: replace-cross \nAbstract: Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive ($\\textit{top-down}$) algorithms recursively partition the nodes into two communities, until a stopping rule indicates that no further split is needed. In contrast, agglomerative ($\\textit{bottom-up}$) algorithms first identify the smallest community structure and then repeatedly merge the communities using a $\\textit{linkage}$ method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis."}, "https://arxiv.org/abs/2407.17534": {"title": "Extension of W-method and A-learner for multiple binary outcomes", "link": "https://arxiv.org/abs/2407.17534", "description": "arXiv:2407.17534v1 Announce Type: new \nAbstract: In this study, we compared two groups, in which subjects were assigned to either the treatment or the control group. In such trials, if the efficacy of the treatment cannot be demonstrated in a population that meets the eligibility criteria, identifying the subgroups for which the treatment is effective is desirable. Such subgroups can be identified by estimating heterogeneous treatment effects (HTE). In recent years, methods for estimating HTE have increasingly relied on complex models. Although these models improve the estimation accuracy, they often sacrifice interpretability. Despite significant advancements in the methods for continuous or univariate binary outcomes, methods for multiple binary outcomes are less prevalent, and existing interpretable methods, such as the W-method and A-learner, while capable of estimating HTE for a single binary outcome, still fail to capture the correlation structure when applied to multiple binary outcomes. We thus propose two methods for estimating HTE for multiple binary outcomes: one based on the W-method and the other based on the A-learner. We also demonstrate that the conventional A-learner introduces bias in the estimation of the treatment effect. The proposed method employs a framework based on reduced-rank regression to capture the correlation structure among multiple binary outcomes. We correct for the bias inherent in the A-learner estimates and investigate the impact of this bias through numerical simulations. Finally, we demonstrate the effectiveness of the proposed method using a real data application."}, "https://arxiv.org/abs/2407.17592": {"title": "Robust Maximum $L_q$-Likelihood Covariance Estimation for Replicated Spatial Data", "link": "https://arxiv.org/abs/2407.17592", "description": "arXiv:2407.17592v1 Announce Type: new \nAbstract: Parameter estimation with the maximum $L_q$-likelihood estimator (ML$q$E) is an alternative to the maximum likelihood estimator (MLE) that considers the $q$-th power of the likelihood values for some $q<1$. In this method, extreme values are down-weighted because of their lower likelihood values, which yields robust estimates. In this work, we study the properties of the ML$q$E for spatial data with replicates. We investigate the asymptotic properties of the ML$q$E for Gaussian random fields with a Mat\\'ern covariance function, and carry out simulation studies to investigate the numerical performance of the ML$q$E. We show that it can provide more robust and stable estimation results when some of the replicates in the spatial data contain outliers. In addition, we develop a mechanism to find the optimal choice of the hyper-parameter $q$ for the ML$q$E. The robustness of our approach is further verified on a United States precipitation dataset. Compared with other robust methods for spatial data, our proposal is more intuitive and easier to understand, yet it performs well when dealing with datasets containing outliers."}, "https://arxiv.org/abs/2407.17666": {"title": "Causal estimands and identification of time-varying effects in non-stationary time series from N-of-1 mobile device data", "link": "https://arxiv.org/abs/2407.17666", "description": "arXiv:2407.17666v1 Announce Type: new \nAbstract: Mobile technology (mobile phones and wearable devices) generates continuous data streams encompassing outcomes, exposures and covariates, presented as intensive longitudinal or multivariate time series data. The high frequency of measurements enables granular and dynamic evaluation of treatment effect, revealing their persistence and accumulation over time. Existing methods predominantly focus on the contemporaneous effect, temporal-average, or population-average effects, assuming stationarity or invariance of treatment effects over time, which are inadequate both conceptually and statistically to capture dynamic treatment effects in personalized mobile health data. We here propose new causal estimands for multivariate time series in N-of-1 studies. These estimands summarize how time-varying exposures impact outcomes in both short- and long-term. We propose identifiability assumptions and a g-formula estimator that accounts for exposure-outcome and outcome-covariate feedback. The g-formula employs a state space model framework innovatively to accommodate time-varying behavior of treatment effects in non-stationary time series. We apply the proposed method to a multi-year smartphone observational study of bipolar patients and estimate the dynamic effect of phone-based communication on mood of patients with bipolar disorder in an N-of-1 setting. Our approach reveals substantial heterogeneity in treatment effects over time and across individuals. A simulation-based strategy is also proposed for the development of a short-term, dynamic, and personalized treatment recommendation based on patient's past information, in combination with a novel positivity diagnostics plot, validating proper causal inference in time series data."}, "https://arxiv.org/abs/2407.17694": {"title": "Doubly Robust Conditional Independence Testing with Generative Neural Networks", "link": "https://arxiv.org/abs/2407.17694", "description": "arXiv:2407.17694v1 Announce Type: new \nAbstract: This article addresses the problem of testing the conditional independence of two generic random vectors $X$ and $Y$ given a third random vector $Z$, which plays an important role in statistical and machine learning applications. We propose a new non-parametric testing procedure that avoids explicitly estimating any conditional distributions but instead requires sampling from the two marginal conditional distributions of $X$ given $Z$ and $Y$ given $Z$. We further propose using a generative neural network (GNN) framework to sample from these approximated marginal conditional distributions, which tends to mitigate the curse of dimensionality due to its adaptivity to any low-dimensional structures and smoothness underlying the data. Theoretically, our test statistic is shown to enjoy a doubly robust property against GNN approximation errors, meaning that the test statistic retains all desirable properties of the oracle test statistic utilizing the true marginal conditional distributions, as long as the product of the two approximation errors decays to zero faster than the parametric rate. Asymptotic properties of our statistic and the consistency of a bootstrap procedure are derived under both null and local alternatives. Extensive numerical experiments and real data analysis illustrate the effectiveness and broad applicability of our proposed test."}, "https://arxiv.org/abs/2407.17804": {"title": "Bayesian Spatiotemporal Wombling", "link": "https://arxiv.org/abs/2407.17804", "description": "arXiv:2407.17804v1 Announce Type: new \nAbstract: Stochastic process models for spatiotemporal data underlying random fields find substantial utility in a range of scientific disciplines. Subsequent to predictive inference on the values of the random field (or spatial surface indexed continuously over time) at arbitrary space-time coordinates, scientific interest often turns to gleaning information regarding zones of rapid spatial-temporal change. We develop Bayesian modeling and inference for directional rates of change along a given surface. These surfaces, which demarcate regions of rapid change, are referred to as ``wombling'' surface boundaries. Existing methods for studying such changes have often been associated with curves and are not easily extendable to surfaces resulting from curves evolving over time. Our current contribution devises a fully model-based inferential framework for analyzing differential behavior in spatiotemporal responses by formalizing the notion of a ``wombling'' surface boundary using conventional multi-linear vector analytic frameworks and geometry followed by posterior predictive computations using triangulated surface approximations. We illustrate our methodology with comprehensive simulation experiments followed by multiple applications in environmental and climate science; pollutant analysis in environmental health; and brain imaging."}, "https://arxiv.org/abs/2407.17848": {"title": "Bayesian Benchmarking Small Area Estimation via Entropic Tilting", "link": "https://arxiv.org/abs/2407.17848", "description": "arXiv:2407.17848v1 Announce Type: new \nAbstract: Benchmarking estimation and its risk evaluation is a practically important issue in small area estimation. While hierarchical Bayesian methods have been widely adopted in small area estimation, a unified Bayesian approach to benchmarking estimation has not been fully discussed. This work employs an entropic tilting method to modify the posterior distribution of the small area parameters to meet the benchmarking constraint, which enables us to obtain benchmarked point estimation as well as reasonable uncertainty quantification. Using conditionally independent structures of the posterior, we first introduce general Monte Carlo methods for obtaining a benchmarked posterior and then show that the benchmarked posterior can be obtained in an analytical form for some representative small area models. We demonstrate the usefulness of the proposed method through simulation and empirical studies."}, "https://arxiv.org/abs/2407.17888": {"title": "Enhanced power enhancements for testing many moment equalities: Beyond the $2$- and $\\infty$-norm", "link": "https://arxiv.org/abs/2407.17888", "description": "arXiv:2407.17888v1 Announce Type: new \nAbstract: Tests based on the $2$- and $\\infty$-norm have received considerable attention in high-dimensional testing problems, as they are powerful against dense and sparse alternatives, respectively. The power enhancement principle of Fan et al. (2015) combines these two norms to construct tests that are powerful against both types of alternatives. Nevertheless, the $2$- and $\\infty$-norm are just two out of the whole spectrum of $p$-norms that one can base a test on. In the context of testing whether a candidate parameter satisfies a large number of moment equalities, we construct a test that harnesses the strength of all $p$-norms with $p\\in[2, \\infty]$. As a result, this test consistent against strictly more alternatives than any test based on a single $p$-norm. In particular, our test is consistent against more alternatives than tests based on the $2$- and $\\infty$-norm, which is what most implementations of the power enhancement principle target.\n We illustrate the scope of our general results by using them to construct a test that simultaneously dominates the Anderson-Rubin test (based on $p=2$) and tests based on the $\\infty$-norm in terms of consistency in the linear instrumental variable model with many (weak) instruments."}, "https://arxiv.org/abs/2407.17920": {"title": "Tobit Exponential Smoothing, towards an enhanced demand planning in the presence of censored data", "link": "https://arxiv.org/abs/2407.17920", "description": "arXiv:2407.17920v1 Announce Type: new \nAbstract: ExponenTial Smoothing (ETS) is a widely adopted forecasting technique in both research and practical applications. One critical development in ETS was the establishment of a robust statistical foundation based on state space models with a single source of error. However, an important challenge in ETS that remains unsolved is censored data estimation. This issue is critical in supply chain management, in particular, when companies have to deal with stockouts. This work solves that problem by proposing the Tobit ETS, which extends the use of ETS models to handle censored data efficiently. This advancement builds upon the linear models taxonomy and extends it to encompass censored data scenarios. The results show that the Tobit ETS reduces considerably the forecast bias. Real and simulation data are used from the airline and supply chain industries to corroborate the findings."}, "https://arxiv.org/abs/2407.18077": {"title": "An Alternating Direction Method of Multipliers Algorithm for the Weighted Fused LASSO Signal Approximator", "link": "https://arxiv.org/abs/2407.18077", "description": "arXiv:2407.18077v1 Announce Type: new \nAbstract: We present an Alternating Direction Method of Multipliers (ADMM) algorithm designed to solve the Weighted Generalized Fused LASSO Signal Approximator (wFLSA). First, we show that wFLSAs can always be reformulated as a Generalized LASSO problem. With the availability of algorithms tailored to the Generalized LASSO, the issue appears to be, in principle, resolved. However, the computational complexity of these algorithms is high, with a time complexity of $O(p^4)$ for a single iteration, where $p$ represents the number of coefficients. To overcome this limitation, we propose an ADMM algorithm specifically tailored for wFLSA-equivalent problems, significantly reducing the complexity to $O(p^2)$. Our algorithm is publicly accessible through the R package wflsa."}, "https://arxiv.org/abs/2407.18166": {"title": "Identification and multiply robust estimation of causal effects via instrumental variables from an auxiliary heterogeneous population", "link": "https://arxiv.org/abs/2407.18166", "description": "arXiv:2407.18166v1 Announce Type: new \nAbstract: Evaluating causal effects in a primary population of interest with unmeasured confounders is challenging. Although instrumental variables (IVs) are widely used to address unmeasured confounding, they may not always be available in the primary population. Fortunately, IVs might have been used in previous observational studies on similar causal problems, and these auxiliary studies can be useful to infer causal effects in the primary population, even if they represent different populations. However, existing methods often assume homogeneity or equality of conditional average treatment effects between the primary and auxiliary populations, which may be limited in practice. This paper aims to remove the homogeneity requirement and establish a novel identifiability result allowing for different conditional average treatment effects across populations. We also construct a multiply robust estimator that remains consistent despite partial misspecifications of the observed data model and achieves local efficiency if all nuisance models are correct. The proposed approach is illustrated through simulation studies. We finally apply our approach by leveraging data from lower income individuals with cigarette price as a valid IV to evaluate the causal effect of smoking on physical functional status in higher income group where strong IVs are not available."}, "https://arxiv.org/abs/2407.18206": {"title": "Starting Small: Prioritizing Safety over Efficacy in Randomized Experiments Using the Exact Finite Sample Likelihood", "link": "https://arxiv.org/abs/2407.18206", "description": "arXiv:2407.18206v1 Announce Type: new \nAbstract: We use the exact finite sample likelihood and statistical decision theory to answer questions of ``why?'' and ``what should you have done?'' using data from randomized experiments and a utility function that prioritizes safety over efficacy. We propose a finite sample Bayesian decision rule and a finite sample maximum likelihood decision rule. We show that in finite samples from 2 to 50, it is possible for these rules to achieve better performance according to established maximin and maximum regret criteria than a rule based on the Boole-Frechet-Hoeffding bounds. We also propose a finite sample maximum likelihood criterion. We apply our rules and criterion to an actual clinical trial that yielded a promising estimate of efficacy, and our results point to safety as a reason for why results were mixed in subsequent trials."}, "https://arxiv.org/abs/2407.17565": {"title": "Periodicity significance testing with null-signal templates: reassessment of PTF's SMBH binary candidates", "link": "https://arxiv.org/abs/2407.17565", "description": "arXiv:2407.17565v1 Announce Type: cross \nAbstract: Periodograms are widely employed for identifying periodicity in time series data, yet they often struggle to accurately quantify the statistical significance of detected periodic signals when the data complexity precludes reliable simulations. We develop a data-driven approach to address this challenge by introducing a null-signal template (NST). The NST is created by carefully randomizing the period of each cycle in the periodogram template, rendering it non-periodic. It has the same frequentist properties as a periodic signal template regardless of the noise probability distribution, and we show with simulations that the distribution of false positives is the same as with the original periodic template, regardless of the underlying data. Thus, performing a periodicity search with the NST acts as an effective simulation of the null (no-signal) hypothesis, without having to simulate the noise properties of the data. We apply the NST method to the supermassive black hole binaries (SMBHB) search in the Palomar Transient Factory (PTF), where Charisi et al. had previously proposed 33 high signal to (white) noise candidates utilizing simulations to quantify their significance. Our approach reveals that these simulations do not capture the complexity of the real data. There are no statistically significant periodic signal detections above the non-periodic background. To improve the search sensitivity we introduce a Gaussian quadrature based algorithm for the Bayes Factor with correlated noise as a test statistic, in contrast to the standard signal to white noise. We show with simulations that this improves sensitivity to true signals by more than an order of magnitude. However, using the Bayes Factor approach also results in no statistically significant detections in the PTF data."}, "https://arxiv.org/abs/2407.17658": {"title": "Semiparametric Piecewise Accelerated Failure Time Model for the Analysis of Immune-Oncology Clinical Trials", "link": "https://arxiv.org/abs/2407.17658", "description": "arXiv:2407.17658v1 Announce Type: cross \nAbstract: Effectiveness of immune-oncology chemotherapies has been presented in recent clinical trials. The Kaplan-Meier estimates of the survival functions of the immune therapy and the control often suggested the presence of the lag-time until the immune therapy began to act. It implies the use of hazard ratio under the proportional hazards assumption would not be appealing, and many alternatives have been investigated such as the restricted mean survival time. In addition to such overall summary of the treatment contrast, the lag-time is also an important feature of the treatment effect. Identical survival functions up to the lag-time implies patients who are likely to die before the lag-time would not benefit the treatment and identifying such patients would be very important. We propose the semiparametric piecewise accelerated failure time model and its inference procedure based on the semiparametric maximum likelihood method. It provides not only an overall treatment summary, but also a framework to identify patients who have less benefit from the immune-therapy in a unified way. Numerical experiments confirm that each parameter can be estimated with minimal bias. Through a real data analysis, we illustrate the evaluation of the effect of immune-oncology therapy and the characterization of covariates in which patients are unlikely to receive the benefit of treatment."}, "https://arxiv.org/abs/1904.00111": {"title": "Simple subvector inference on sharp identified set in affine models", "link": "https://arxiv.org/abs/1904.00111", "description": "arXiv:1904.00111v3 Announce Type: replace \nAbstract: This paper studies a regularized support function estimator for bounds on components of the parameter vector in the case in which the identified set is a polygon. The proposed regularized estimator has three important properties: (i) it has a uniform asymptotic Gaussian limit in the presence of flat faces in the absence of redundant (or overidentifying) constraints (or vice versa); (ii) the bias from regularization does not enter the first-order limiting distribution; (iii) the estimator remains consistent for sharp (non-enlarged) identified set for the individual components even in the non-regualar case. These properties are used to construct \\emph{uniformly valid} confidence sets for an element $\\theta_{1}$ of a parameter vector $\\theta\\in\\mathbb{R}^{d}$ that is partially identified by affine moment equality and inequality conditions. The proposed confidence sets can be computed as a solution to a small number of linear and convex quadratic programs, leading to a substantial decrease in computation time and guarantees a global optimum. As a result, the method provides a uniformly valid inference in applications in which the dimension of the parameter space, $d$, and the number of inequalities, $k$, were previously computationally unfeasible ($d,k=100$). The proposed approach can be extended to construct confidence sets for intersection bounds, to construct joint polygon-shaped confidence sets for multiple components of $\\theta$, and to find the set of solutions to a linear program. Inference for coefficients in the linear IV regression model with an interval outcome is used as an illustrative example."}, "https://arxiv.org/abs/2106.11043": {"title": "Scalable Bayesian inference for time series via divide-and-conquer", "link": "https://arxiv.org/abs/2106.11043", "description": "arXiv:2106.11043v3 Announce Type: replace \nAbstract: Bayesian computational algorithms tend to scale poorly as data size increases. This has motivated divide-and-conquer-based approaches for scalable inference. These divide the data into subsets, perform inference for each subset in parallel, and then combine these inferences. While appealing theoretical properties and practical performance have been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that usually lack rigorous theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer method, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach."}, "https://arxiv.org/abs/2202.08728": {"title": "Nonparametric extensions of randomized response for private confidence sets", "link": "https://arxiv.org/abs/2202.08728", "description": "arXiv:2202.08728v4 Announce Type: replace \nAbstract: This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations $(X_1, \\dots, X_n)$ with mean $\\mu^\\star$ that are privatized into $(Z_1, \\dots, Z_n)$, we present confidence intervals (CI) and time-uniform confidence sequences (CS) for $\\mu^\\star$ when only given access to the privatized data. To achieve this, we study a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests."}, "https://arxiv.org/abs/2305.14275": {"title": "Variational Inference with Coverage Guarantees in Simulation-Based Inference", "link": "https://arxiv.org/abs/2305.14275", "description": "arXiv:2305.14275v3 Announce Type: replace \nAbstract: Amortized variational inference is an often employed framework in simulation-based inference that produces a posterior approximation that can be rapidly computed given any new observation. Unfortunately, there are few guarantees about the quality of these approximate posteriors. We propose Conformalized Amortized Neural Variational Inference (CANVI), a procedure that is scalable, easily implemented, and provides guaranteed marginal coverage. Given a collection of candidate amortized posterior approximators, CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. CANVI ensures that the resulting predictor constructs regions that contain the truth with a user-specified level of probability. CANVI is agnostic to design decisions in formulating the candidate approximators and only requires access to samples from the forward model, permitting its use in likelihood-free settings. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation. Finally, we demonstrate the accurate calibration and high predictive efficiency of CANVI on a suite of simulation-based inference benchmark tasks and an important scientific task: analyzing galaxy emission spectra."}, "https://arxiv.org/abs/2309.16373": {"title": "Regularization and Model Selection for Ordinal-on-Ordinal Regression with Applications to Food Products' Testing and Survey Data", "link": "https://arxiv.org/abs/2309.16373", "description": "arXiv:2309.16373v2 Announce Type: replace \nAbstract: Ordinal data are quite common in applied statistics. Although some model selection and regularization techniques for categorical predictors and ordinal response models have been developed over the past few years, less work has been done concerning ordinal-on-ordinal regression. Motivated by a consumer test and a survey on the willingness to pay for luxury food products consisting of Likert-type items, we propose a strategy for smoothing and selecting ordinally scaled predictors in the cumulative logit model. First, the group lasso is modified by the use of difference penalties on neighboring dummy coefficients, thus taking into account the predictors' ordinal structure. Second, a fused lasso-type penalty is presented for the fusion of predictor categories and factor selection. The performance of both approaches is evaluated in simulation studies and on real-world data."}, "https://arxiv.org/abs/2310.16213": {"title": "Bayes Factors Based on Test Statistics and Non-Local Moment Prior Densities", "link": "https://arxiv.org/abs/2310.16213", "description": "arXiv:2310.16213v2 Announce Type: replace \nAbstract: We describe Bayes factors based on z, t, $\\chi^2$, and F statistics when non-local moment prior distributions are used to define alternative hypotheses. The non-local alternative prior distributions are centered on standardized effects. The prior densities include a dispersion parameter that can be used to model prior precision and the variation of effect sizes across replicated experiments. We examine the convergence rates of Bayes factors under true null and true alternative hypotheses and show how these Bayes factors can be used to construct Bayes factor functions. An example illustrates the application of resulting Bayes factors to psychological experiments."}, "https://arxiv.org/abs/2407.18341": {"title": "Shrinking Coarsened Win Ratio and Testing of Composite Endpoint", "link": "https://arxiv.org/abs/2407.18341", "description": "arXiv:2407.18341v1 Announce Type: new \nAbstract: Composite endpoints consisting of both terminal and non-terminal events, such as death and hospitalization, are frequently used as primary endpoints in cardiovascular clinical trials. The Win Ratio method (WR) proposed by Pocock et al. (2012) [1] employs a hierarchical structure to combine fatal and non-fatal events by giving death information an absolute priority, which adversely affects power if the treatment effect is mainly on the non-fatal outcomes. We hereby propose the Shrinking Coarsened Win Ratio method (SCWR) that releases the strict hierarchical structure of the standard WR by adding stages with coarsened thresholds shrinking to zero. A weighted adaptive approach is developed to determine the thresholds in SCWR. This method preserves the good statistical properties of the standard WR and has a greater capacity to detect treatment effects on non-fatal events. We show that SCWR has an overall more favorable performance than WR in our simulation that addresses the influence of follow-up time, the association between events, and the treatment effect levels, as well as a case study based on the Digitalis Investigation Group clinical trial data."}, "https://arxiv.org/abs/2407.18389": {"title": "Doubly Robust Targeted Estimation of Conditional Average Treatment Effects for Time-to-event Outcomes with Competing Risks", "link": "https://arxiv.org/abs/2407.18389", "description": "arXiv:2407.18389v1 Announce Type: new \nAbstract: In recent years, precision treatment strategy have gained significant attention in medical research, particularly for patient care. We propose a novel framework for estimating conditional average treatment effects (CATE) in time-to-event data with competing risks, using ICU patients with sepsis as an illustrative example. Our approach, based on cumulative incidence functions and targeted maximum likelihood estimation (TMLE), achieves both asymptotic efficiency and double robustness. The primary contribution of this work lies in our derivation of the efficient influence function for the targeted causal parameter, CATE. We established the theoretical proofs for these properties, and subsequently confirmed them through simulations. Our TMLE framework is flexible, accommodating various regression and machine learning models, making it applicable in diverse scenarios. In order to identify variables contributing to treatment effect heterogeneity and to facilitate accurate estimation of CATE, we developed two distinct variable importance measures (VIMs). This work provides a powerful tool for optimizing personalized treatment strategies, furthering the pursuit of precision medicine."}, "https://arxiv.org/abs/2407.18432": {"title": "Accounting for reporting delays in real-time phylodynamic analyses with preferential sampling", "link": "https://arxiv.org/abs/2407.18432", "description": "arXiv:2407.18432v1 Announce Type: new \nAbstract: The COVID-19 pandemic demonstrated that fast and accurate analysis of continually collected infectious disease surveillance data is crucial for situational awareness and policy making. Coalescent-based phylodynamic analysis can use genetic sequences of a pathogen to estimate changes in its effective population size, a measure of genetic diversity. These changes in effective population size can be connected to the changes in the number of infections in the population of interest under certain conditions. Phylodynamics is an important set of tools because its methods are often resilient to the ascertainment biases present in traditional surveillance data (e.g., preferentially testing symptomatic individuals). Unfortunately, it takes weeks or months to sequence and deposit the sampled pathogen genetic sequences into a database, making them available for such analyses. These reporting delays severely decrease precision of phylodynamic methods closer to present time, and for some models can lead to extreme biases. Here we present a method that affords reliable estimation of the effective population size trajectory closer to the time of data collection, allowing for policy decisions to be based on more recent data. Our work uses readily available historic times between sampling and sequencing for a population of interest, and incorporates this information into the sampling model to mitigate the effects of reporting delay in real-time analyses. We illustrate our methodology on simulated data and on SARS-CoV-2 sequences collected in the state of Washington in 2021."}, "https://arxiv.org/abs/2407.18612": {"title": "Integration of Structural Equation Modeling and Bayesian Networks in the Context of Causal Inference: A Case Study on Personal Positive Youth Development", "link": "https://arxiv.org/abs/2407.18612", "description": "arXiv:2407.18612v1 Announce Type: new \nAbstract: In this study, the combined use of structural equation modeling (SEM) and Bayesian network modeling (BNM) in causal inference analysis is revisited. The perspective highlights the debate between proponents of using BNM as either an exploratory phase or even as the sole phase in the definition of structural models, and those advocating for SEM as the superior alternative for exploratory analysis. The individual strengths and limitations of SEM and BNM are recognized, but this exploration evaluates the contention between utilizing SEM's robust structural inference capabilities and the dynamic probabilistic modeling offered by BNM. A case study of the work of, \\citet{balaguer_2022} in a structural model for personal positive youth development (\\textit{PYD}) as a function of positive parenting (\\textit{PP}) and perception of the climate and functioning of the school (\\textit{CFS}) is presented. The paper at last presents a clear stance on the analytical primacy of SEM in exploratory causal analysis, while acknowledging the potential of BNM in subsequent phases."}, "https://arxiv.org/abs/2407.18721": {"title": "Ensemble Kalman inversion approximate Bayesian computation", "link": "https://arxiv.org/abs/2407.18721", "description": "arXiv:2407.18721v1 Announce Type: new \nAbstract: Approximate Bayesian computation (ABC) is the most popular approach to inferring parameters in the case where the data model is specified in the form of a simulator. It is not possible to directly implement standard Monte Carlo methods for inference in such a model, due to the likelihood not being available to evaluate pointwise. The main idea of ABC is to perform inference on an alternative model with an approximate likelihood (the ABC likelihood), estimated at each iteration from points simulated from the data model. The central challenge of ABC is then to trade-off bias (introduced by approximating the model) with the variance introduced by estimating the ABC likelihood. Stabilising the variance of the ABC likelihood requires a computational cost that is exponential in the dimension of the data, thus the most common approach to reducing variance is to perform inference conditional on summary statistics. In this paper we introduce a new approach to estimating the ABC likelihood: using iterative ensemble Kalman inversion (IEnKI) (Iglesias, 2016; Iglesias et al., 2018). We first introduce new estimators of the marginal likelihood in the case of a Gaussian data model using the IEnKI output, then show how this may be used in ABC. Performance is illustrated on the Lotka-Volterra model, where we observe substantial improvements over standard ABC and other commonly-used approaches."}, "https://arxiv.org/abs/2407.18835": {"title": "Robust Estimation of Polychoric Correlation", "link": "https://arxiv.org/abs/2407.18835", "description": "arXiv:2407.18835v1 Announce Type: new \nAbstract: Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust to partial misspecification of the polychoric model, that is, the model is only misspecified for an unknown fraction of observations, for instance (but not limited to) careless respondents. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation and is consistent as well as asymptotically normally distributed. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify."}, "https://arxiv.org/abs/2407.18885": {"title": "Simulation Experiment Design for Calibration via Active Learning", "link": "https://arxiv.org/abs/2407.18885", "description": "arXiv:2407.18885v1 Announce Type: new \nAbstract: Simulation models often have parameters as input and return outputs to understand the behavior of complex systems. Calibration is the process of estimating the values of the parameters in a simulation model in light of observed data from the system that is being simulated. When simulation models are expensive, emulators are built with simulation data as a computationally efficient approximation of an expensive model. An emulator then can be used to predict model outputs, instead of repeatedly running an expensive simulation model during the calibration process. Sequential design with an intelligent selection criterion can guide the process of collecting simulation data to build an emulator, making the calibration process more efficient and effective. This article proposes two novel criteria for sequentially acquiring new simulation data in an active learning setting by considering uncertainties on the posterior density of parameters. Analysis of several simulation experiments and real-data simulation experiments from epidemiology demonstrates that proposed approaches result in improved posterior and field predictions."}, "https://arxiv.org/abs/2407.18905": {"title": "The nph2ph-transform: applications to the statistical analysis of completed clinical trials", "link": "https://arxiv.org/abs/2407.18905", "description": "arXiv:2407.18905v1 Announce Type: new \nAbstract: We present several illustrations from completed clinical trials on a statistical approach that allows us to gain useful insights regarding the time dependency of treatment effects. Our approach leans on a simple proposition: all non-proportional hazards (NPH) models are equivalent to a proportional hazards model. The nph2ph transform brings an NPH model into a PH form. We often find very simple approximations for this transform, enabling us to analyze complex NPH observations as though they had arisen under proportional hazards. Many techniques become available to us, and we use these to understand treatment effects better."}, "https://arxiv.org/abs/2407.18314": {"title": "Higher Partials of fStress", "link": "https://arxiv.org/abs/2407.18314", "description": "arXiv:2407.18314v1 Announce Type: cross \nAbstract: We define *fDistances*, which generalize Euclidean distances, squared distances, and log distances. The least squares loss function to fit fDistances to dissimilarity data is *fStress*. We give formulas and R/C code to compute partial derivatives of orders one to four of fStress, relying heavily on the use of Fa\\`a di Bruno's chain rule formula for higher derivatives."}, "https://arxiv.org/abs/2407.18698": {"title": "Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation", "link": "https://arxiv.org/abs/2407.18698", "description": "arXiv:2407.18698v1 Announce Type: cross \nAbstract: Decoding from the output distributions of large language models to produce high-quality text is a complex challenge in language modeling. Various approaches, such as beam search, sampling with temperature, $k-$sampling, nucleus $p-$sampling, typical decoding, contrastive decoding, and contrastive search, have been proposed to address this problem, aiming to improve coherence, diversity, as well as resemblance to human-generated text. In this study, we introduce adaptive contrastive search, a novel decoding strategy extending contrastive search by incorporating an adaptive degeneration penalty, guided by the estimated uncertainty of the model at each generation step. This strategy is designed to enhance both the creativity and diversity of the language modeling process while at the same time producing coherent and high-quality generated text output. Our findings indicate performance enhancement in both aspects, across different model architectures and datasets, underscoring the effectiveness of our method in text generation tasks. Our code base, datasets, and models are publicly available."}, "https://arxiv.org/abs/2407.18755": {"title": "Score matching through the roof: linear, nonlinear, and latent variables causal discovery", "link": "https://arxiv.org/abs/2407.18755", "description": "arXiv:2407.18755v1 Announce Type: cross \nAbstract: Causal discovery from observational data holds great promise, but existing methods rely on strong assumptions about the underlying causal structure, often requiring full observability of all relevant variables. We tackle these challenges by leveraging the score function $\\nabla \\log p(X)$ of observed variables for causal discovery and propose the following contributions. First, we generalize the existing results of identifiability with the score to additive noise models with minimal requirements on the causal mechanisms. Second, we establish conditions for inferring causal relations from the score even in the presence of hidden variables; this result is two-faced: we demonstrate the score's potential as an alternative to conditional independence tests to infer the equivalence class of causal graphs with hidden variables, and we provide the necessary conditions for identifying direct causes in latent variable models. Building on these insights, we propose a flexible algorithm for causal discovery across linear, nonlinear, and latent variable models, which we empirically validate."}, "https://arxiv.org/abs/2201.06898": {"title": "Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period", "link": "https://arxiv.org/abs/2201.06898", "description": "arXiv:2201.06898v4 Announce Type: replace \nAbstract: We propose difference-in-differences estimators in designs where the treatment is continuously distributed at every period, as is often the case when one studies the effects of taxes, tariffs, or prices. We assume that between consecutive periods, the treatment of some units, the switchers, changes, while the treatment of other units remains constant. We show that under a placebo-testable parallel-trends assumption, averages of the slopes of switchers' potential outcomes can be nonparametrically estimated. We generalize our estimators to the instrumental-variable case. We use our estimators to estimate the price-elasticity of gasoline consumption."}, "https://arxiv.org/abs/2209.04329": {"title": "Heterogeneous Treatment Effect Bounds under Sample Selection with an Application to the Effects of Social Media on Political Polarization", "link": "https://arxiv.org/abs/2209.04329", "description": "arXiv:2209.04329v5 Announce Type: replace \nAbstract: We propose a method for estimation and inference for bounds for heterogeneous causal effect parameters in general sample selection models where the treatment can affect whether an outcome is observed and no exclusion restrictions are available. The method provides conditional effect bounds as functions of policy relevant pre-treatment variables. It allows for conducting valid statistical inference on the unidentified conditional effects. We use a flexible debiased/double machine learning approach that can accommodate non-linear functional forms and high-dimensional confounders. Easily verifiable high-level conditions for estimation, misspecification robust confidence intervals, and uniform confidence bands are provided as well. We re-analyze data from a large scale field experiment on Facebook on counter-attitudinal news subscription with attrition. Our method yields substantially tighter effect bounds compared to conventional methods and suggests depolarization effects for younger users."}, "https://arxiv.org/abs/2307.16502": {"title": "Percolated stochastic block model via EM algorithm and belief propagation with non-backtracking spectra", "link": "https://arxiv.org/abs/2307.16502", "description": "arXiv:2307.16502v5 Announce Type: replace-cross \nAbstract: Whereas Laplacian and modularity based spectral clustering is apt to dense graphs, recent results show that for sparse ones, the non-backtracking spectrum is the best candidate to find assortative clusters of nodes. Here belief propagation in the sparse stochastic block model is derived with arbitrary given model parameters that results in a non-linear system of equations; with linear approximation, the spectrum of the non-backtracking matrix is able to specify the number $k$ of clusters. Then the model parameters themselves can be estimated by the EM algorithm. Bond percolation in the assortative model is considered in the following two senses: the within- and between-cluster edge probabilities decrease with the number of nodes and edges coming into existence in this way are retained with probability $\\beta$. As a consequence, the optimal $k$ is the number of the structural real eigenvalues (greater than $\\sqrt{c}$, where $c$ is the average degree) of the non-backtracking matrix of the graph. Assuming, these eigenvalues $\\mu_1 >\\dots > \\mu_k$ are distinct, the multiple phase transitions obtained for $\\beta$ are $\\beta_i =\\frac{c}{\\mu_i^2}$; further, at $\\beta_i$ the number of detectable clusters is $i$, for $i=1,\\dots ,k$. Inflation-deflation techniques are also discussed to classify the nodes themselves, which can be the base of the sparse spectral clustering."}, "https://arxiv.org/abs/2309.03731": {"title": "Using representation balancing to learn conditional-average dose responses from clustered data", "link": "https://arxiv.org/abs/2309.03731", "description": "arXiv:2309.03731v2 Announce Type: replace-cross \nAbstract: Estimating a unit's responses to interventions with an associated dose, the \"conditional average dose response\" (CADR), is relevant in a variety of domains, from healthcare to business, economics, and beyond. Such a response typically needs to be estimated from observational data, which introduces several challenges. That is why the machine learning (ML) community has proposed several tailored CADR estimators. Yet, the proposal of most of these methods requires strong assumptions on the distribution of data and the assignment of interventions, which go beyond the standard assumptions in causal inference. Whereas previous works have so far focused on smooth shifts in covariate distributions across doses, in this work, we will study estimating CADR from clustered data and where different doses are assigned to different segments of a population. On a novel benchmarking dataset, we show the impacts of clustered data on model performance and propose an estimator, CBRNet, that learns cluster-agnostic and hence dose-agnostic covariate representations through representation balancing for unbiased CADR inference. We run extensive experiments to illustrate the workings of our method and compare it with the state of the art in ML for CADR estimation."}, "https://arxiv.org/abs/2401.13929": {"title": "HMM for Discovering Decision-Making Dynamics Using Reinforcement Learning Experiments", "link": "https://arxiv.org/abs/2401.13929", "description": "arXiv:2401.13929v2 Announce Type: replace-cross \nAbstract: Major depressive disorder (MDD) presents challenges in diagnosis and treatment due to its complex and heterogeneous nature. Emerging evidence indicates that reward processing abnormalities may serve as a behavioral marker for MDD. To measure reward processing, patients perform computer-based behavioral tasks that involve making choices or responding to stimulants that are associated with different outcomes. Reinforcement learning (RL) models are fitted to extract parameters that measure various aspects of reward processing to characterize how patients make decisions in behavioral tasks. Recent findings suggest the inadequacy of characterizing reward learning solely based on a single RL model; instead, there may be a switching of decision-making processes between multiple strategies. An important scientific question is how the dynamics of learning strategies in decision-making affect the reward learning ability of individuals with MDD. Motivated by the probabilistic reward task (PRT) within the EMBARC study, we propose a novel RL-HMM framework for analyzing reward-based decision-making. Our model accommodates learning strategy switching between two distinct approaches under a hidden Markov model (HMM): subjects making decisions based on the RL model or opting for random choices. We account for continuous RL state space and allow time-varying transition probabilities in the HMM. We introduce a computationally efficient EM algorithm for parameter estimation and employ a nonparametric bootstrap for inference. We apply our approach to the EMBARC study to show that MDD patients are less engaged in RL compared to the healthy controls, and engagement is associated with brain activities in the negative affect circuitry during an emotional conflict task."}, "https://arxiv.org/abs/2407.19030": {"title": "Multimodal data integration and cross-modal querying via orchestrated approximate message passing", "link": "https://arxiv.org/abs/2407.19030", "description": "arXiv:2407.19030v1 Announce Type: new \nAbstract: The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on a tri-modal single-cell dataset."}, "https://arxiv.org/abs/2407.19057": {"title": "Partial Identification of the Average Treatment Effect with Stochastic Counterfactuals and Discordant Twins", "link": "https://arxiv.org/abs/2407.19057", "description": "arXiv:2407.19057v1 Announce Type: new \nAbstract: We develop a novel approach to partially identify causal estimands, such as the average treatment effect (ATE), from observational data. To better satisfy the stable unit treatment value assumption (SUTVA) we utilize stochastic counterfactuals within a propensity-prognosis model of the data generating process. For more precise identification we utilize knowledge of discordant twin outcomes as evidence for randomness in the data generating process. Our approach culminates with a constrained optimization problem; the solution gives upper and lower bounds for the ATE. We demonstrate the applicability of our introduced methodology with three example applications."}, "https://arxiv.org/abs/2407.19135": {"title": "Bayesian Mapping of Mortality Clusters", "link": "https://arxiv.org/abs/2407.19135", "description": "arXiv:2407.19135v1 Announce Type: new \nAbstract: Disease mapping analyses the distribution of several diseases within a territory. Primary goals include identifying areas with unexpected changes in mortality rates, studying the relation among multiple diseases, and dividing the analysed territory into clusters based on the observed levels of disease incidence or mortality. In this work, we focus on detecting spatial mortality clusters, that occur when neighbouring areas within a territory exhibit similar mortality levels due to one or more diseases. When multiple death causes are examined together, it is relevant to identify both the spatial boundaries of the clusters and the diseases that lead to their formation. However, existing methods in literature struggle to address this dual problem effectively and simultaneously. To overcome these limitations, we introduce Perla, a multivariate Bayesian model that clusters areas in a territory according to the observed mortality rates of multiple death causes, also exploiting the information of external covariates. Our model incorporates the spatial data structure directly into the clustering probabilities by leveraging the stick-breaking formulation of the multinomial distribution. Additionally, it exploits suitable global-local shrinkage priors to ensure that the detection of clusters is driven by concrete differences across mortality levels while excluding spurious differences. We propose an MCMC algorithm for posterior inference that consists of closed-form Gibbs sampling moves for nearly every model parameter, without requiring complex tuning operations. This work is primarily motivated by a case study on the territory of the local unit ULSS6 Euganea within the Italian public healthcare system. To demonstrate the flexibility and effectiveness of our methodology, we also validate Perla with a series of simulation experiments and an extensive case study on mortality levels in U.S. counties."}, "https://arxiv.org/abs/2407.19171": {"title": "Assessing Spatial Disparities: A Bayesian Linear Regression Approach", "link": "https://arxiv.org/abs/2407.19171", "description": "arXiv:2407.19171v1 Announce Type: new \nAbstract: Epidemiological investigations of regionally aggregated spatial data often involve detecting spatial health disparities between neighboring regions on a map of disease mortality or incidence rates. Analyzing such data introduces spatial dependence among the health outcomes and seeks to report statistically significant spatial disparities by delineating boundaries that separate neighboring regions with widely disparate health outcomes. However, current statistical methods are often inadequate for appropriately defining what constitutes a spatial disparity and for constructing rankings of posterior probabilities that are robust under changes to such a definition. More specifically, non-parametric Bayesian approaches endow spatial effects with discrete probability distributions using Dirichlet processes, or generalizations thereof, and rely upon computationally intensive methods for inferring on weakly identified parameters. In this manuscript, we introduce a Bayesian linear regression framework to detect spatial health disparities. This enables us to exploit Bayesian conjugate posterior distributions in a more accessible manner and accelerate computation significantly over existing Bayesian non-parametric approaches. Simulation experiments conducted over a county map of the entire United States demonstrate the effectiveness of our method and we apply our method to a data set from the Institute of Health Metrics and Evaluation (IHME) on age-standardized US county-level estimates of mortality rates across tracheal, bronchus, and lung cancer."}, "https://arxiv.org/abs/2407.19269": {"title": "A Bayesian Approach Toward Robust Multidimensional Ellipsoid-Specific Fitting", "link": "https://arxiv.org/abs/2407.19269", "description": "arXiv:2407.19269v1 Announce Type: new \nAbstract: This work presents a novel and effective method for fitting multidimensional ellipsoids to scattered data in the contamination of noise and outliers. We approach the problem as a Bayesian parameter estimate process and maximize the posterior probability of a certain ellipsoidal solution given the data. We establish a more robust correlation between these points based on the predictive distribution within the Bayesian framework. We incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain, ensuring ellipsoid-specific results regardless of inputs. We then establish the connection between measurement point and model data via Bayes' rule to enhance the method's robustness against noise. Due to independent of spatial dimensions, the proposed method not only delivers high-quality fittings to challenging elongated ellipsoids but also generalizes well to multidimensional spaces. To address outlier disturbances, often overlooked by previous approaches, we further introduce a uniform distribution on top of the predictive distribution to significantly enhance the algorithm's robustness against outliers. We introduce an {\\epsilon}-accelerated technique to expedite the convergence of EM considerably. To the best of our knowledge, this is the first comprehensive method capable of performing multidimensional ellipsoid specific fitting within the Bayesian optimization paradigm under diverse disturbances. We evaluate it across lower and higher dimensional spaces in the presence of heavy noise, outliers, and substantial variations in axis ratios. Also, we apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks."}, "https://arxiv.org/abs/2407.19329": {"title": "Normality testing after transformation", "link": "https://arxiv.org/abs/2407.19329", "description": "arXiv:2407.19329v1 Announce Type: new \nAbstract: Transforming a random variable to improve its normality leads to a followup test for whether the transformed variable follows a normal distribution. Previous work has shown that the Anderson Darling test for normality suffers from resubstitution bias following Box-Cox transformation, and indicates normality much too often. The work reported here extends this by adding the Shapiro-Wilk statistic and the two-parameter Box Cox transformation, all of which show severe bias. We also develop a recalibration to correct the bias in all four settings. The methodology was motivated by finding reference ranges in biomarker studies where parametric analysis, possibly on a power-transformed measurand, can be much more informative than nonparametric. It is illustrated with a data set on biomarkers."}, "https://arxiv.org/abs/2407.19339": {"title": "Accounting for Nonresponse in Election Polls: Total Margin of Error", "link": "https://arxiv.org/abs/2407.19339", "description": "arXiv:2407.19339v1 Announce Type: new \nAbstract: The potential impact of nonresponse on election polls is well known and frequently acknowledged. Yet measurement and reporting of polling error has focused solely on sampling error, represented by the margin of error of a poll. Survey statisticians have long recommended measurement of the total survey error of a sample estimate by its mean square error (MSE), which jointly measures sampling and non-sampling errors. Extending the conventional language of polling, we think it reasonable to use the square root of maximum MSE to measure the total margin of error. This paper demonstrates how to measure the potential impact of nonresponse using the concept of the total margin of error, which we argue should be a standard feature in the reporting of election poll results. We first show how to jointly measure statistical imprecision and response bias when a pollster lacks any knowledge of the candidate preferences of non-responders. We then extend the analysis to settings where the pollster has partial knowledge that bounds the preferences of non-responders."}, "https://arxiv.org/abs/2407.19378": {"title": "Penalized Principal Component Analysis for Large-dimension Factor Model with Group Pursuit", "link": "https://arxiv.org/abs/2407.19378", "description": "arXiv:2407.19378v1 Announce Type: new \nAbstract: This paper investigates the intrinsic group structures within the framework of large-dimensional approximate factor models, which portrays homogeneous effects of the common factors on the individuals that fall into the same group. To this end, we propose a fusion Penalized Principal Component Analysis (PPCA) method and derive a closed-form solution for the $\\ell_2$-norm optimization problem. We also show the asymptotic properties of our proposed PPCA estimates. With the PPCA estimates as an initialization, we identify the unknown group structure by a combination of the agglomerative hierarchical clustering algorithm and an information criterion. Then the factor loadings and factor scores are re-estimated conditional on the identified latent groups. Under some regularity conditions, we establish the consistency of the membership estimators as well as that of the group number estimator derived from the information criterion. Theoretically, we show that the post-clustering estimators for the factor loadings and factor scores with group pursuit achieve efficiency gains compared to the estimators by conventional PCA method. Thorough numerical studies validate the established theory and a real financial example illustrates the practical usefulness of the proposed method."}, "https://arxiv.org/abs/2407.19502": {"title": "Identifying arbitrary transformation between the slopes in functional regression", "link": "https://arxiv.org/abs/2407.19502", "description": "arXiv:2407.19502v1 Announce Type: new \nAbstract: In this article, we study whether the slope functions of two functional regression models in two samples are associated with any arbitrary transformation (barring constant and linear transformation) or not along the vertical axis. In order to address this issue, a statistical testing of the hypothesis problem is formalized, and the test statistic is formed based on the estimated second derivative of the unknown transformation. The asymptotic properties of the test statistics are investigated using some advanced techniques related to the empirical process. Moreover, to implement the test for small sample size data, a Bootstrap algorithm is proposed, and it is shown that the Bootstrap version of the test is as good as the original test for sufficiently large sample size. Furthermore, the utility of the proposed methodology is shown for simulated data sets, and DTI data is analyzed using this methodology."}, "https://arxiv.org/abs/2407.19509": {"title": "Heterogeneous Grouping Structures in Panel Data", "link": "https://arxiv.org/abs/2407.19509", "description": "arXiv:2407.19509v1 Announce Type: new \nAbstract: In this paper we examine the existence of heterogeneity within a group, in panels with latent grouping structure. The assumption of within group homogeneity is prevalent in this literature, implying that the formation of groups alleviates cross-sectional heterogeneity, regardless of the prior knowledge of groups. While the latter hypothesis makes inference powerful, it can be often restrictive. We allow for models with richer heterogeneity that can be found both in the cross-section and within a group, without imposing the simple assumption that all groups must be heterogeneous. We further contribute to the method proposed by \\cite{su2016identifying}, by showing that the model parameters can be consistently estimated and the groups, while unknown, can be identifiable in the presence of different types of heterogeneity. Within the same framework we consider the validity of assuming both cross-sectional and within group homogeneity, using testing procedures. Simulations demonstrate good finite-sample performance of the approach in both classification and estimation, while empirical applications across several datasets provide evidence of multiple clusters, as well as reject the hypothesis of within group homogeneity."}, "https://arxiv.org/abs/2407.19558": {"title": "Identification and Inference with Invalid Instruments", "link": "https://arxiv.org/abs/2407.19558", "description": "arXiv:2407.19558v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are widely used to study the causal effect of an exposure on an outcome in the presence of unmeasured confounding. IVs require an instrument, a variable that is (A1) associated with the exposure, (A2) has no direct effect on the outcome except through the exposure, and (A3) is not related to unmeasured confounders. Unfortunately, finding variables that satisfy conditions (A2) or (A3) can be challenging in practice. This paper reviews works where instruments may not satisfy conditions (A2) or (A3), which we refer to as invalid instruments. We review identification and inference under different violations of (A2) or (A3), specifically under linear models, non-linear models, and heteroskedatic models. We conclude with an empirical comparison of various methods by re-analyzing the effect of body mass index on systolic blood pressure from the UK Biobank."}, "https://arxiv.org/abs/2407.19602": {"title": "Metropolis--Hastings with Scalable Subsampling", "link": "https://arxiv.org/abs/2407.19602", "description": "arXiv:2407.19602v1 Announce Type: new \nAbstract: The Metropolis-Hastings (MH) algorithm is one of the most widely used Markov Chain Monte Carlo schemes for generating samples from Bayesian posterior distributions. The algorithm is asymptotically exact, flexible and easy to implement. However, in the context of Bayesian inference for large datasets, evaluating the likelihood on the full data for thousands of iterations until convergence can be prohibitively expensive. This paper introduces a new subsample MH algorithm that satisfies detailed balance with respect to the target posterior and utilises control variates to enable exact, efficient Bayesian inference on datasets with large numbers of observations. Through theoretical results, simulation experiments and real-world applications on certain generalised linear models, we demonstrate that our method requires substantially smaller subsamples and is computationally more efficient than the standard MH algorithm and other exact subsample MH algorithms."}, "https://arxiv.org/abs/2407.19618": {"title": "Experimenting on Markov Decision Processes with Local Treatments", "link": "https://arxiv.org/abs/2407.19618", "description": "arXiv:2407.19618v1 Announce Type: new \nAbstract: As service systems grow increasingly complex and dynamic, many interventions become localized, available and taking effect only in specific states. This paper investigates experiments with local treatments on a widely-used class of dynamic models, Markov Decision Processes (MDPs). Particularly, we focus on utilizing the local structure to improve the inference efficiency of the average treatment effect. We begin by demonstrating the efficiency of classical inference methods, including model-based estimation and temporal difference learning under a fixed policy, as well as classical A/B testing with general treatments. We then introduce a variance reduction technique that exploits the local treatment structure by sharing information for states unaffected by the treatment policy. Our new estimator effectively overcomes the variance lower bound for general treatments while matching the more stringent lower bound incorporating the local treatment structure. Furthermore, our estimator can optimally achieve a linear reduction with the number of test arms for a major part of the variance. Finally, we explore scenarios with perfect knowledge of the control arm and design estimators that further improve inference efficiency."}, "https://arxiv.org/abs/2407.19624": {"title": "Nonparametric independence tests in high-dimensional settings, with applications to the genetics of complex disease", "link": "https://arxiv.org/abs/2407.19624", "description": "arXiv:2407.19624v1 Announce Type: new \nAbstract: [PhD thesis of FCP.] Nowadays, genetics studies large amounts of very diverse variables. Mathematical statistics has evolved in parallel to its applications, with much recent interest high-dimensional settings. In the genetics of human common disease, a number of relevant problems can be formulated as tests of independence. We show how defining adequate premetric structures on the support spaces of the genetic data allows for novel approaches to such testing. This yields a solid theoretical framework, which reflects the underlying biology, and allows for computationally-efficient implementations. For each problem, we provide mathematical results, simulations and the application to real data."}, "https://arxiv.org/abs/2407.19659": {"title": "Estimating heterogeneous treatment effects by W-MCM based on Robust reduced rank regression", "link": "https://arxiv.org/abs/2407.19659", "description": "arXiv:2407.19659v1 Announce Type: new \nAbstract: Recently, from the personalized medicine perspective, there has been an increased demand to identify subgroups of subjects for whom treatment is effective. Consequently, the estimation of heterogeneous treatment effects (HTE) has been attracting attention. While various estimation methods have been developed for a single outcome, there are still limited approaches for estimating HTE for multiple outcomes. Accurately estimating HTE remains a challenge especially for datasets where there is a high correlation between outcomes or the presence of outliers. Therefore, this study proposes a method that uses a robust reduced-rank regression framework to estimate treatment effects and identify effective subgroups. This approach allows the consideration of correlations between treatment effects and the estimation of treatment effects with an accurate low-rank structure. It also provides robust estimates for outliers. This study demonstrates that, when treatment effects are estimated using the reduced rank regression framework with an appropriate rank, the expected value of the estimator equals the treatment effect. Finally, we illustrate the effectiveness and interpretability of the proposed method through simulations and real data examples."}, "https://arxiv.org/abs/2407.19744": {"title": "Robust classification via finite mixtures of matrix-variate skew t distributions", "link": "https://arxiv.org/abs/2407.19744", "description": "arXiv:2407.19744v1 Announce Type: new \nAbstract: Analysis of matrix-variate data is becoming increasingly common in the literature, particularly in the field of clustering and classification. It is well-known that real data, including real matrix-variate data, often exhibit high levels of asymmetry. To address this issue, one common approach is to introduce a tail or skewness parameter to a symmetric distribution. In this regard, we introduced here a new distribution called the matrix-variate skew t distribution (MVST), which provides flexibility in terms of heavy tail and skewness. We then conduct a thorough investigation of various characterizations and probabilistic properties of the MVST distribution. We also explore extensions of this distribution to a finite mixture model. To estimate the parameters of the MVST distribution, we develop an efficient EM-type algorithm that computes maximum likelihood (ML) estimates of the model parameters. To validate the effectiveness and usefulness of the developed models and associated methods, we perform empirical experiments using simulated data as well as three real data examples. Our results demonstrate the efficacy of the developed approach in handling asymmetric matrix-variate data."}, "https://arxiv.org/abs/2407.19978": {"title": "Inferring High-Dimensional Dynamic Networks Changing with Multiple Covariates", "link": "https://arxiv.org/abs/2407.19978", "description": "arXiv:2407.19978v1 Announce Type: new \nAbstract: High-dimensional networks play a key role in understanding complex relationships. These relationships are often dynamic in nature and can change with multiple external factors (e.g., time and groups). Methods for estimating graphical models are often restricted to static graphs or graphs that can change with a single covariate (e.g., time). We propose a novel class of graphical models, the covariate-varying network (CVN), that can change with multiple external covariates.\n In order to introduce sparsity, we apply a $L_1$-penalty to the precision matrices of $m \\geq 2$ graphs we want to estimate. These graphs often show a level of similarity. In order to model this 'smoothness', we introduce the concept of a 'meta-graph' where each node in the meta-graph corresponds to an individual graph in the CVN. The (weighted) adjacency matrix of the meta-graph represents the strength with which similarity is enforced between the $m$ graphs.\n The resulting optimization problem is solved by employing an alternating direction method of multipliers. We test our method using a simulation study and we show its applicability by applying it to a real-world data set, the gene expression networks from the study 'German Cancer in childhood and molecular-epidemiology' (KiKme). An implementation of the algorithm in R is publicly available under https://github.com/bips-hb/cvn"}, "https://arxiv.org/abs/2407.20027": {"title": "Graphical tools for detection and control of selection bias with multiple exposures and samples", "link": "https://arxiv.org/abs/2407.20027", "description": "arXiv:2407.20027v1 Announce Type: new \nAbstract: Among recent developments in definitions and analysis of selection bias is the potential outcomes approach of Kenah (Epidemiology, 2023), which allows non-parametric analysis using single-world intervention graphs, linking selection of study participants to identification of causal effects. Mohan & Pearl (JASA, 2021) provide a framework for missing data via directed acyclic graphs augmented with nodes indicating missingness for each sometimes-missing variable, which allows for analysis of more general missing data problems but cannot easily encode scenarios in which different groups of variables are observed in specific subsamples. We give an alternative formulation of the potential outcomes framework based on conditional separable effects and indicators for selection into subsamples. This is practical for problems between the single-sample scenarios considered by Kenah and the variable-wise missingness considered by Mohan & Pearl. This simplifies identification conditions and admits generalizations to scenarios with multiple, potentially nested or overlapping study samples, as well as multiple or time-dependent exposures. We give examples of identifiability arguments for case-cohort studies, multiple or time-dependent exposures, and direct effects of selection."}, "https://arxiv.org/abs/2407.20051": {"title": "Estimating risk factors for pathogenic dose accrual from longitudinal data", "link": "https://arxiv.org/abs/2407.20051", "description": "arXiv:2407.20051v1 Announce Type: new \nAbstract: Estimating risk factors for incidence of a disease is crucial for understanding its etiology. For diseases caused by enteric pathogens, off-the-shelf statistical model-based approaches do not provide biological plausibility and ignore important sources of variability. We propose a new approach to estimating incidence risk factors built on established work in quantitative microbiological risk assessment. Excepting those risk factors which affect both dose accrual and within-host pathogen survival rates, our model's regression parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. % So long as risk factors do not affect both dose accrual and within-host pathogen survival rates, our model parameters are easily interpretable as the dose accrual rate ratio due to the risk factors under study. We also describe a method for leveraging information across multiple pathogens. The proposed methods are available as an R package at \\url{https://github.com/dksewell/ladie}. Our simulation study shows unacceptable coverage rates from generalized linear models, while the proposed approach maintains the nominal rate even when the model is misspecified. Finally, we demonstrated our proposed approach by applying our method to Nairobian infant data obtained through the PATHOME study (\\url{https://reporter.nih.gov/project-details/10227256}), discovering the impact of various environmental factors on infant enteric infections."}, "https://arxiv.org/abs/2407.20073": {"title": "Transfer Learning Targeting Mixed Population: A Distributional Robust Perspective", "link": "https://arxiv.org/abs/2407.20073", "description": "arXiv:2407.20073v1 Announce Type: new \nAbstract: Despite recent advances in transfer learning with multiple source data sets, there still lacks developments for mixture target populations that could be approximated through a composite of the sources due to certain key factors like ethnicity in practice. To address this open problem under distributional shifts of covariates and outcome models as well as the absence of accurate labels on target, we propose a novel approach for distributionally robust transfer learning targeting mixture population. It learns a set of covariate-specific weights to infer the target outcome model with multiple sources, relying on a joint source mixture assumption for the target population. Then our method incorporates a group adversarial learning step to enhance the robustness against moderate violation of the joint mixture assumption. In addition, our framework allows the use of side information like small labeled sample as a guidance to avoid over-conservative results. Statistical convergence and predictive accuracy of our method are quantified through asymptotic studies. Simulation and real-world studies demonstrate the out-performance of our method over existing multi-source and transfer learning approaches."}, "https://arxiv.org/abs/2407.20085": {"title": "Local Level Dynamic Random Partition Models for Changepoint Detection", "link": "https://arxiv.org/abs/2407.20085", "description": "arXiv:2407.20085v1 Announce Type: new \nAbstract: Motivated by an increasing demand for models that can effectively describe features of complex multivariate time series, e.g. from sensor data in biomechanics, motion analysis, and sports science, we introduce a novel state-space modeling framework where the state equation encodes the evolution of latent partitions of the data over time. Building on the principles of dynamic linear models, our approach develops a random partition model capable of linking data partitions to previous ones over time, using a straightforward Markov structure that accounts for temporal persistence and facilitates changepoint detection. The selection of changepoints involves multiple dependent decisions, and we address this time-dependence by adopting a non-marginal false discovery rate control. This leads to a simple decision rule that ensures more stringent control of the false discovery rate compared to approaches that do not consider dependence. The method is efficiently implemented using a Gibbs sampling algorithm, leading to a straightforward approach compared to existing methods for dependent random partition models. Additionally, we show how the proposed method can be adapted to handle multi-view clustering scenarios. Simulation studies and the analysis of a human gesture phase dataset collected through various sensing technologies show the effectiveness of the method in clustering multivariate time series and detecting changepoints."}, "https://arxiv.org/abs/2407.19092": {"title": "Boosted generalized normal distributions: Integrating machine learning with operations knowledge", "link": "https://arxiv.org/abs/2407.19092", "description": "arXiv:2407.19092v1 Announce Type: cross \nAbstract: Applications of machine learning (ML) techniques to operational settings often face two challenges: i) ML methods mostly provide point predictions whereas many operational problems require distributional information; and ii) They typically do not incorporate the extensive body of knowledge in the operations literature, particularly the theoretical and empirical findings that characterize specific distributions. We introduce a novel and rigorous methodology, the Boosted Generalized Normal Distribution ($b$GND), to address these challenges. The Generalized Normal Distribution (GND) encompasses a wide range of parametric distributions commonly encountered in operations, and $b$GND leverages gradient boosting with tree learners to flexibly estimate the parameters of the GND as functions of covariates. We establish $b$GND's statistical consistency, thereby extending this key property to special cases studied in the ML literature that lacked such guarantees. Using data from a large academic emergency department in the United States, we show that the distributional forecasting of patient wait and service times can be meaningfully improved by leveraging findings from the healthcare operations literature. Specifically, $b$GND performs 6% and 9% better than the distribution-agnostic ML benchmark used to forecast wait and service times respectively. Further analysis suggests that these improvements translate into a 9% increase in patient satisfaction and a 4% reduction in mortality for myocardial infarction patients. Our work underscores the importance of integrating ML with operations knowledge to enhance distributional forecasts."}, "https://arxiv.org/abs/2407.19191": {"title": "Network sampling based inference for subgraph counts and clustering coefficient in a Stochastic Block Model framework with some extensions to a sparse case", "link": "https://arxiv.org/abs/2407.19191", "description": "arXiv:2407.19191v1 Announce Type: cross \nAbstract: Sampling is frequently used to collect data from large networks. In this article we provide valid asymptotic prediction intervals for subgraph counts and clustering coefficient of a population network when a network sampling scheme is used to observe the population. The theory is developed under a model based framework, where it is assumed that the population network is generated by a Stochastic Block Model (SBM). We study the effects of induced and ego-centric network formation, following the initial selection of nodes by Bernoulli sampling, and establish asymptotic normality of sample based subgraph count and clustering coefficient statistic under both network formation methods. The asymptotic results are developed under a joint design and model based approach, where the effect of sampling design is not ignored. In case of the sample based clustering coefficient statistic, we find that a bias correction is required in the ego-centric case, but there is no such bias in the induced case. We also extend the asymptotic normality results for estimated subgraph counts to a mildly sparse SBM framework, where edge probabilities decay to zero at a slow rate. In this sparse setting we find that the scaling and the maximum allowable decay rate for edge probabilities depend on the choice of the target subgraph. We obtain an expression for this maximum allowable decay rate and our results suggest that the rate becomes slower if the target subgraph has more edges in a certain sense. The simulation results suggest that the proposed prediction intervals have excellent coverage, even when the node selection probability is small and unknown SBM parameters are replaced by their estimates. Finally, the proposed methodology is applied to a real data set."}, "https://arxiv.org/abs/2407.19399": {"title": "Large-scale Multiple Testing of Cross-covariance Functions with Applications to Functional Network Models", "link": "https://arxiv.org/abs/2407.19399", "description": "arXiv:2407.19399v1 Announce Type: cross \nAbstract: The estimation of functional networks through functional covariance and graphical models have recently attracted increasing attention in settings with high dimensional functional data, where the number of functional variables p is comparable to, and maybe larger than, the number of subjects. In this paper, we first reframe the functional covariance model estimation as a tuning-free problem of simultaneously testing p(p-1)/2 hypotheses for cross-covariance functions. Our procedure begins by constructing a Hilbert-Schmidt-norm-based test statistic for each pair, and employs normal quantile transformations for all test statistics, upon which a multiple testing step is proposed. We then explore the multiple testing procedure under a general error-contamination framework and establish that our procedure can control false discoveries asymptotically. Additionally, we demonstrate that our proposed methods for two concrete examples: the functional covariance model with partial observations and, importantly, the more challenging functional graphical model, can be seamlessly integrated into the general error-contamination framework, and, with verifiable conditions, achieve theoretical guarantees on effective false discovery control. Finally, we showcase the superiority of our proposals through extensive simulations and functional connectivity analysis of two neuroimaging datasets."}, "https://arxiv.org/abs/2407.19613": {"title": "Causal effect estimation under network interference with mean-field methods", "link": "https://arxiv.org/abs/2407.19613", "description": "arXiv:2407.19613v1 Announce Type: cross \nAbstract: We study causal effect estimation from observational data under interference. The interference pattern is captured by an observed network. We adopt the chain graph framework of Tchetgen Tchetgen et. al. (2021), which allows (i) interaction among the outcomes of distinct study units connected along the graph and (ii) long range interference, whereby the outcome of an unit may depend on the treatments assigned to distant units connected along the interference network. For ``mean-field\" interaction networks, we develop a new scalable iterative algorithm to estimate the causal effects. For gaussian weighted networks, we introduce a novel causal effect estimation algorithm based on Approximate Message Passing (AMP). Our algorithms are provably consistent under a ``high-temperature\" condition on the underlying model. We estimate the (unknown) parameters of the model from data using maximum pseudo-likelihood and establish $\\sqrt{n}$-consistency of this estimator in all parameter regimes. Finally, we prove that the downstream estimators obtained by plugging in estimated parameters into the aforementioned algorithms are consistent at high-temperature. Our methods can accommodate dense interactions among the study units -- a setting beyond reach using existing techniques. Our algorithms originate from the study of variational inference approaches in high-dimensional statistics; overall, we demonstrate the usefulness of these ideas in the context of causal effect estimation under interference."}, "https://arxiv.org/abs/2407.19688": {"title": "Causal Interventional Prediction System for Robust and Explainable Effect Forecasting", "link": "https://arxiv.org/abs/2407.19688", "description": "arXiv:2407.19688v1 Announce Type: cross \nAbstract: Although the widespread use of AI systems in today's world is growing, many current AI systems are found vulnerable due to hidden bias and missing information, especially in the most commonly used forecasting system. In this work, we explore the robustness and explainability of AI-based forecasting systems. We provide an in-depth analysis of the underlying causality involved in the effect prediction task and further establish a causal graph based on treatment, adjustment variable, confounder, and outcome. Correspondingly, we design a causal interventional prediction system (CIPS) based on a variational autoencoder and fully conditional specification of multiple imputations. Extensive results demonstrate the superiority of our system over state-of-the-art methods and show remarkable versatility and extensibility in practice."}, "https://arxiv.org/abs/2407.19932": {"title": "Testing for the Asymmetric Optimal Hedge Ratios: With an Application to Bitcoin", "link": "https://arxiv.org/abs/2407.19932", "description": "arXiv:2407.19932v1 Announce Type: cross \nAbstract: Reducing financial risk is of paramount importance to investors, financial institutions, and corporations. Since the pioneering contribution of Johnson (1960), the optimal hedge ratio based on futures is regularly utilized. The current paper suggests an explicit and efficient method for testing the null hypothesis of a symmetric optimal hedge ratio against an asymmetric alternative one within a multivariate setting. If the null is rejected, the position dependent optimal hedge ratios can be estimated via the suggested model. This approach is expected to enhance the accuracy of the implemented hedging strategies compared to the standard methods since it accounts for the fact that the source of risk depends on whether the investor is a buyer or a seller of the risky asset. An application is provided using spot and futures prices of Bitcoin. The results strongly support the view that the optimal hedge ratio for this cryptocurrency is position dependent. The investor that is long in Bitcoin has a much higher conditional optimal hedge ratio compared to the one that is short in the asset. The difference between the two conditional optimal hedge ratios is statistically significant, which has important repercussions for implementing risk management strategies."}, "https://arxiv.org/abs/2112.11870": {"title": "Bag of DAGs: Inferring Directional Dependence in Spatiotemporal Processes", "link": "https://arxiv.org/abs/2112.11870", "description": "arXiv:2112.11870v5 Announce Type: replace \nAbstract: We propose a class of nonstationary processes to characterize space- and time-varying directional associations in point-referenced data. We are motivated by spatiotemporal modeling of air pollutants in which local wind patterns are key determinants of the pollutant spread, but information regarding prevailing wind directions may be missing or unreliable. We propose to map a discrete set of wind directions to edges in a sparse directed acyclic graph (DAG), accounting for uncertainty in directional correlation patterns across a domain. The resulting Bag of DAGs processes (BAGs) lead to interpretable nonstationarity and scalability for large data due to sparsity of DAGs in the bag. We outline Bayesian hierarchical models using BAGs and illustrate inferential and performance gains of our methods compared to other state-of-the-art alternatives. We analyze fine particulate matter using high-resolution data from low-cost air quality sensors in California during the 2020 wildfire season. An R package is available on GitHub."}, "https://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "https://arxiv.org/abs/2208.09638", "description": "arXiv:2208.09638v3 Announce Type: replace \nAbstract: What is the purpose of pre-analysis plans, and how should they be designed? We model the interaction between an agent who analyzes data and a principal who makes a decision based on agent reports. The agent could be the manufacturer of a new drug, and the principal a regulator deciding whether the drug is approved. Or the agent could be a researcher submitting a research paper, and the principal an editor deciding whether it is published. The agent decides which statistics to report to the principal. The principal cannot verify whether the analyst reported selectively. Absent a pre-analysis message, if there are conflicts of interest, then many desirable decision rules cannot be implemented. Allowing the agent to send a message before seeing the data increases the set of decision rules that can be implemented, and allows the principal to leverage agent expertise. The optimal mechanisms that we characterize require pre-analysis plans. Applying these results to hypothesis testing, we show that optimal rejection rules pre-register a valid test, and make worst-case assumptions about unreported statistics. Optimal tests can be found as a solution to a linear-programming problem."}, "https://arxiv.org/abs/2306.14019": {"title": "Instrumental Variable Approach to Estimating Individual Causal Effects in N-of-1 Trials: Application to ISTOP Study", "link": "https://arxiv.org/abs/2306.14019", "description": "arXiv:2306.14019v2 Announce Type: replace \nAbstract: An N-of-1 trial is a multiple crossover trial conducted in a single individual to provide evidence to directly inform personalized treatment decisions. Advancements in wearable devices greatly improved the feasibility of adopting these trials to identify optimal individual treatment plans, particularly when treatments differ among individuals and responses are highly heterogeneous. Our work was motivated by the I-STOP-AFib Study, which examined the impact of different triggers on atrial fibrillation (AF) occurrence. We described a causal framework for 'N-of-1' trial using potential treatment selection paths and potential outcome paths. Two estimands of individual causal effect were defined:(a) the effect of continuous exposure, and (b) the effect of an individual observed behavior. We addressed three challenges: (a) imperfect compliance to the randomized treatment assignment; (b) binary treatments and binary outcomes which led to the 'non-collapsibility' issue of estimating odds ratios; and (c) serial inference in the longitudinal observations. We adopted the Bayesian IV approach where the study randomization was the IV as it impacted the choice of exposure of a subject but not directly the outcome. Estimations were through a system of two parametric Bayesian models to estimate the individual causal effect. Our model got around the non-collapsibility and non-consistency by modeling the confounding mechanism through latent structural models and by inferring with Bayesian posterior of functionals. Autocorrelation present in the repeated measurements was also accounted for. The simulation study showed our method largely reduced bias and greatly improved the coverage of the estimated causal effect, compared to existing methods (ITT, PP, and AT). We applied the method to I-STOP-AFib Study to estimate the individual effect of alcohol on AF occurrence."}, "https://arxiv.org/abs/2307.16048": {"title": "Structural restrictions in local causal discovery: identifying direct causes of a target variable", "link": "https://arxiv.org/abs/2307.16048", "description": "arXiv:2307.16048v2 Announce Type: replace \nAbstract: We consider the problem of learning a set of direct causes of a target variable from an observational joint distribution. Learning directed acyclic graphs (DAGs) that represent the causal structure is a fundamental problem in science. Several results are known when the full DAG is identifiable from the distribution, such as assuming a nonlinear Gaussian data-generating process. Here, we are only interested in identifying the direct causes of one target variable (local causal structure), not the full DAG. This allows us to relax the identifiability assumptions and develop possibly faster and more robust algorithms. In contrast to the Invariance Causal Prediction framework, we only assume that we observe one environment without any interventions. We discuss different assumptions for the data-generating process of the target variable under which the set of direct causes is identifiable from the distribution. While doing so, we put essentially no assumptions on the variables other than the target variable. In addition to the novel identifiability results, we provide two practical algorithms for estimating the direct causes from a finite random sample and demonstrate their effectiveness on several benchmark and real datasets."}, "https://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "https://arxiv.org/abs/2310.05646", "description": "arXiv:2310.05646v4 Announce Type: replace \nAbstract: We study transfer learning for estimating piecewise-constant signals when source data, which may be relevant but disparate, are available in addition to the target data. We first investigate transfer learning estimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for unisource data scenarios and then generalise these estimators to accommodate multisources. To further reduce estimation errors, especially when some sources significantly differ from the target, we introduce an informative source selection algorithm. We then examine these estimators with multisource selection and establish their minimax optimality. Unlike the common narrative in the transfer learning literature that the performance is enhanced through large source sample sizes, our approaches leverage higher observation frequencies and accommodate diverse frequencies across multiple sources. Our theoretical findings are supported by extensive numerical experiments, with the code available online, see https://github.com/chrisfanwang/transferlearning"}, "https://arxiv.org/abs/2310.09597": {"title": "Adaptive maximization of social welfare", "link": "https://arxiv.org/abs/2310.09597", "description": "arXiv:2310.09597v2 Announce Type: replace \nAbstract: We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation. We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of $T^{2/3}$. This implies that (i) welfare maximization is harder than the multi-armed bandit problem (with a rate of $T^{1/2}$ for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of $T^{1/2}$ (for continuous policy sets), using a dyadic search algorithm. We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder)."}, "https://arxiv.org/abs/2312.02404": {"title": "Nonparametric Bayesian Adjustment of Unmeasured Confounders in Cox Proportional Hazards Models", "link": "https://arxiv.org/abs/2312.02404", "description": "arXiv:2312.02404v3 Announce Type: replace \nAbstract: In observational studies, unmeasured confounders present a crucial challenge in accurately estimating desired causal effects. To calculate the hazard ratio (HR) in Cox proportional hazard models for time-to-event outcomes, two-stage residual inclusion and limited information maximum likelihood are typically employed. However, these methods are known to entail difficulty in terms of potential bias of HR estimates and parameter identification. This study introduces a novel nonparametric Bayesian method designed to estimate an unbiased HR, addressing concerns that previous research methods have had. Our proposed method consists of two phases: 1) detecting clusters based on the likelihood of the exposure and outcome variables, and 2) estimating the hazard ratio within each cluster. Although it is implicitly assumed that unmeasured confounders affect outcomes through cluster effects, our algorithm is well-suited for such data structures. The proposed Bayesian estimator has good performance compared with some competitors."}, "https://arxiv.org/abs/2312.03268": {"title": "Design-based inference for generalized network experiments with stochastic interventions", "link": "https://arxiv.org/abs/2312.03268", "description": "arXiv:2312.03268v2 Announce Type: replace \nAbstract: A growing number of researchers are conducting randomized experiments to analyze causal relationships in network settings where units influence one another. A dominant methodology for analyzing these experiments is design-based, leveraging random treatment assignments as the basis for inference. In this paper, we generalize this design-based approach to accommodate complex experiments with a variety of causal estimands and different target populations. An important special case of such generalized network experiments is a bipartite network experiment, in which treatment is randomized among one set of units, and outcomes are measured on a separate set of units. We propose a broad class of causal estimands based on stochastic interventions for generalized network experiments. Using a design-based approach, we show how to estimate these causal quantities without bias and develop conservative variance estimators. We apply our methodology to a randomized experiment in education where participation in an anti-conflict promotion program is randomized among selected students. Our analysis estimates the causal effects of treating each student or their friends among different target populations in the network. We find that the program improves the overall conflict awareness among students but does not significantly reduce the total number of such conflicts."}, "https://arxiv.org/abs/2312.14086": {"title": "A Bayesian approach to functional regression: theory and computation", "link": "https://arxiv.org/abs/2312.14086", "description": "arXiv:2312.14086v2 Announce Type: replace \nAbstract: We propose a novel Bayesian methodology for inference in functional linear and logistic regression models based on the theory of reproducing kernel Hilbert spaces (RKHS's). We introduce general models that build upon the RKHS generated by the covariance function of the underlying stochastic process, and whose formulation includes as particular cases all finite-dimensional models based on linear combinations of marginals of the process, which can collectively be seen as a dense subspace made of simple approximations. By imposing a suitable prior distribution on this dense functional space we can perform data-driven inference via standard Bayes methodology, estimating the posterior distribution through reversible jump Markov chain Monte Carlo methods. In this context, our contribution is two-fold. First, we derive a theoretical result that guarantees posterior consistency, based on an application of a classic theorem of Doob to our RKHS setting. Second, we show that several prediction strategies stemming from our Bayesian procedure are competitive against other usual alternatives in both simulations and real data sets, including a Bayesian-motivated variable selection method."}, "https://arxiv.org/abs/2312.17566": {"title": "Doublethink: simultaneous Bayesian-frequentist model-averaged hypothesis testing", "link": "https://arxiv.org/abs/2312.17566", "description": "arXiv:2312.17566v2 Announce Type: replace \nAbstract: Establishing the frequentist properties of Bayesian approaches widens their appeal and offers new understanding. In hypothesis testing, Bayesian model averaging addresses the problem that conclusions are sensitive to variable selection. But Bayesian false discovery rate (FDR) guarantees are contingent on prior assumptions that may be disputed. Here we show that Bayesian model-averaged hypothesis testing is a closed testing procedure that controls the frequentist familywise error rate (FWER) in the strong sense. The rate converges pointwise as the sample size grows and, under some conditions, uniformly. The `Doublethink' method computes simultaneous posterior odds and asymptotic p-values for model-averaged hypothesis testing. We explore its benefits, including post-hoc variable selection, and limitations, including finite-sample inflation, through a Mendelian randomization study and simulations comparing approaches like LASSO, stepwise regression, the Benjamini-Hochberg procedure and e-values."}, "https://arxiv.org/abs/2202.04796": {"title": "The Transfer Performance of Economic Models", "link": "https://arxiv.org/abs/2202.04796", "description": "arXiv:2202.04796v4 Announce Type: replace-cross \nAbstract: Economists often estimate models using data from a particular domain, e.g. estimating risk preferences in a particular subject pool or for a specific class of lotteries. Whether a model's predictions extrapolate well across domains depends on whether the estimated model has captured generalizable structure. We provide a tractable formulation for this \"out-of-domain\" prediction problem and define the transfer error of a model based on how well it performs on data from a new domain. We derive finite-sample forecast intervals that are guaranteed to cover realized transfer errors with a user-selected probability when domains are iid, and use these intervals to compare the transferability of economic models and black box algorithms for predicting certainty equivalents. We find that in this application, the black box algorithms we consider outperform standard economic models when estimated and tested on data from the same domain, but the economic models generalize across domains better than the black-box algorithms do."}, "https://arxiv.org/abs/2407.20386": {"title": "On the power properties of inference for parameters with interval identified sets", "link": "https://arxiv.org/abs/2407.20386", "description": "arXiv:2407.20386v1 Announce Type: new \nAbstract: This paper studies a specific inference problem for a partially-identified parameter of interest with an interval identified set. We consider the favorable situation in which a researcher has two possible estimators to construct the confidence interval proposed in Imbens and Manski (2004) and Stoye (2009), and one is more efficient than the other. While the literature shows that both estimators deliver asymptotically exact confidence intervals for the parameter of interest, their inference in terms of statistical power is not compared. One would expect that using the more efficient estimator would result in more powerful inference. We formally prove this result."}, "https://arxiv.org/abs/2407.20491": {"title": "High dimensional inference for extreme value indices", "link": "https://arxiv.org/abs/2407.20491", "description": "arXiv:2407.20491v1 Announce Type: new \nAbstract: When applying multivariate extreme values statistics to analyze tail risk in compound events defined by a multivariate random vector, one often assumes that all dimensions share the same extreme value index. While such an assumption can be tested using a Wald-type test, the performance of such a test deteriorates as the dimensionality increases. This paper introduces a novel test for testing extreme value indices in a high dimensional setting. We show the asymptotic behavior of the test statistic and conduct simulation studies to evaluate its finite sample performance. The proposed test significantly outperforms existing methods in high dimensional settings. We apply this test to examine two datasets previously assumed to have identical extreme value indices across all dimensions."}, "https://arxiv.org/abs/2407.20520": {"title": "Uncertainty Quantification under Noisy Constraints, with Applications to Raking", "link": "https://arxiv.org/abs/2407.20520", "description": "arXiv:2407.20520v1 Announce Type: new \nAbstract: We consider statistical inference problems under uncertain equality constraints, and provide asymptotically valid uncertainty estimates for inferred parameters. The proposed approach leverages the implicit function theorem and primal-dual optimality conditions for a particular problem class. The motivating application is multi-dimensional raking, where observations are adjusted to match marginals; for example, adjusting estimated deaths across race, county, and cause in order to match state all-race all-cause totals. We review raking from a convex optimization perspective, providing explicit primal-dual formulations, algorithms, and optimality conditions for a wide array of raking applications, which are then leveraged to obtain the uncertainty estimates. Empirical results show that the approach obtains, at the cost of a single solve, nearly the same uncertainty estimates as computationally intensive Monte Carlo techniques that pass thousands of observed and of marginal draws through the entire raking process."}, "https://arxiv.org/abs/2407.20580": {"title": "Laplace approximation for Bayesian variable selection via Le Cam's one-step procedure", "link": "https://arxiv.org/abs/2407.20580", "description": "arXiv:2407.20580v1 Announce Type: new \nAbstract: Variable selection in high-dimensional spaces is a pervasive challenge in contemporary scientific exploration and decision-making. However, existing approaches that are known to enjoy strong statistical guarantees often struggle to cope with the computational demands arising from the high dimensionality. To address this issue, we propose a novel Laplace approximation method based on Le Cam's one-step procedure (\\textsf{OLAP}), designed to effectively tackles the computational burden. Under some classical high-dimensional assumptions we show that \\textsf{OLAP} is a statistically consistent variable selection procedure. Furthermore, we show that the approach produces a posterior distribution that can be explored in polynomial time using a simple Gibbs sampling algorithm. Toward that polynomial complexity result, we also made some general, noteworthy contributions to the mixing time analysis of Markov chains. We illustrate the method using logistic and Poisson regression models applied to simulated and real data examples."}, "https://arxiv.org/abs/2407.20683": {"title": "An online generalization of the e-BH procedure", "link": "https://arxiv.org/abs/2407.20683", "description": "arXiv:2407.20683v1 Announce Type: new \nAbstract: In online multiple testing the hypotheses arrive one-by-one over time and at each time it must be decided on the current hypothesis solely based on the data and hypotheses observed so far. In this paper we relax this setup by allowing initially accepted hypotheses to be rejected due to information gathered at later steps. We propose online e-BH, an online ARC (online with acceptance-to-rejection changes) version of the e-BH procedure. Online e-BH is the first nontrivial online procedure which provably controls the FDR at data-adaptive stopping times and under arbitrary dependence between the test statistics. Online e-BH uniformly improves e-LOND, the existing method for e-value based online FDR control. In addition, we introduce new boosting techniques for online e-BH to increase the power in case of locally dependent e-values. Furthermore, based on the same proof technique as used for online e-BH, we show that all existing online procedures with valid FDR control under arbitrary dependence also control the FDR at data-adaptive stopping times."}, "https://arxiv.org/abs/2407.20738": {"title": "A Local Modal Outer-Product-Gradient Estimator for Dimension Reduction", "link": "https://arxiv.org/abs/2407.20738", "description": "arXiv:2407.20738v1 Announce Type: new \nAbstract: Sufficient dimension reduction (SDR) is a valuable approach for handling high-dimensional data. Outer Product Gradient (OPG) is an popular approach. However, because of focusing the mean regression function, OPG may ignore some directions of central subspace (CS) when the distribution of errors is symmetric about zero. The mode of a distribution can provide an important summary of data. A Local Modal OPG (LMOPG) and its algorithm through mode regression are proposed to estimate the basis of CS with skew errors distribution. The estimator shows the consistent and asymptotic normal distribution under some mild conditions. Monte Carlo simulation is used to evaluate the performance and demonstrate the efficiency and robustness of the proposed method."}, "https://arxiv.org/abs/2407.20796": {"title": "Linear mixed modelling of federated data when only the mean, covariance, and sample size are available", "link": "https://arxiv.org/abs/2407.20796", "description": "arXiv:2407.20796v1 Announce Type: new \nAbstract: In medical research, individual-level patient data provide invaluable information, but the patients' right to confidentiality remains of utmost priority. This poses a huge challenge when estimating statistical models such as linear mixed models, which is an extension of linear regression models that can account for potential heterogeneity whenever data come from different data providers. Federated learning algorithms tackle this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analyst is required. In this paper, we propose an alternative framework to federated learning algorithms for fitting linear mixed models. Specifically, our approach only requires the mean, covariance, and sample size of multiple covariates from different data providers once. Using the principle of statistical sufficiency within the framework of likelihood as theoretical support, this proposed framework achieves estimates identical to those derived from actual individual-level data. We demonstrate this approach through real data on 15 068 patient records from 70 clinics at the Children's Hospital of Pennsylvania (CHOP). Assuming that each clinic only shares summary statistics once, we model the COVID-19 PCR test cycle threshold as a function of patient information. Simplicity, communication efficiency, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature."}, "https://arxiv.org/abs/2407.20819": {"title": "Design and inference for multi-arm clinical trials with informational borrowing: the interacting urns design", "link": "https://arxiv.org/abs/2407.20819", "description": "arXiv:2407.20819v1 Announce Type: new \nAbstract: This paper deals with a new design methodology for stratified comparative experiments based on interacting reinforced urn systems. The key idea is to model the interaction between urns for borrowing information across strata and to use it in the design phase in order to i) enhance the information exchange at the beginning of the study, when only few subjects have been enrolled and the stratum-specific information on treatments' efficacy could be scarce, ii) let the information sharing adaptively evolves via a reinforcement mechanism based on the observed outcomes, for skewing at each step the allocations towards the stratum-specific most promising treatment and iii) make the contribution of the strata with different treatment efficacy vanishing as the stratum information grows. In particular, we introduce the Interacting Urns Design, namely a new Covariate-Adjusted Response-Adaptive procedure, that randomizes the treatment allocations according to the evolution of the urn system. The theoretical properties of this proposal are described and the corresponding asymptotic inference is provided. Moreover, by a functional central limit theorem, we obtain the asymptotic joint distribution of the Wald-type sequential test statistics, which allows to sequentially monitor the suggested design in the clinical practice."}, "https://arxiv.org/abs/2407.20929": {"title": "ROC curve analysis for functional markers", "link": "https://arxiv.org/abs/2407.20929", "description": "arXiv:2407.20929v1 Announce Type: new \nAbstract: Functional markers become a more frequent tool in medical diagnosis. In this paper, we aim to define an index allowing to discriminate between populations when the observations are functional data belonging to a Hilbert space. We discuss some of the problems arising when estimating optimal directions defined to maximize the area under the curve of a projection index and we construct the corresponding ROC curve. We also go one step forward and consider the case of possibly different covariance operators, for which we recommend a quadratic discrimination rule. Consistency results are derived for both linear and quadratic indexes, under mild conditions. The results of our numerical experiments allow to see the advantages of the quadratic rule when the populations have different covariance operators. We also illustrate the considered methods on a real data set."}, "https://arxiv.org/abs/2407.20995": {"title": "Generalized Multivariate Functional Additive Mixed Models for Location, Scale, and Shape", "link": "https://arxiv.org/abs/2407.20995", "description": "arXiv:2407.20995v1 Announce Type: new \nAbstract: We propose a flexible regression framework to model the conditional distribution of multilevel generalized multivariate functional data of potentially mixed type, e.g. binary and continuous data. We make pointwise parametric distributional assumptions for each dimension of the multivariate functional data and model each distributional parameter as an additive function of covariates. The dependency between the different outcomes and, for multilevel functional data, also between different functions within a level is modelled by shared latent multivariate Gaussian processes. For a parsimonious representation of the latent processes, (generalized) multivariate functional principal components are estimated from the data and used as an empirical basis for these latent processes in the regression framework. Our modular two-step approach is very general and can easily incorporate new developments in the estimation of functional principal components for all types of (generalized) functional data. Flexible additive covariate effects for scalar or even functional covariates are available and are estimated in a Bayesian framework. We provide an easy-to-use implementation in the accompanying R package 'gmfamm' on CRAN and conduct a simulation study to confirm the validity of our regression framework and estimation strategy. The proposed multivariate functional model is applied to four dimensional traffic data in Berlin, which consists of the hourly numbers and mean speed of cars and trucks at different locations."}, "https://arxiv.org/abs/2407.20295": {"title": "Warped multifidelity Gaussian processes for data fusion of skewed environmental data", "link": "https://arxiv.org/abs/2407.20295", "description": "arXiv:2407.20295v1 Announce Type: cross \nAbstract: Understanding the dynamics of climate variables is paramount for numerous sectors, like energy and environmental monitoring. This study focuses on the critical need for a precise mapping of environmental variables for national or regional monitoring networks, a task notably challenging when dealing with skewed data. To address this issue, we propose a novel data fusion approach, the \\textit{warped multifidelity Gaussian process} (WMFGP). The method performs prediction using multiple time-series, accommodating varying reliability and resolutions and effectively handling skewness. In an extended simulation experiment the benefits and the limitations of the methods are explored, while as a case study, we focused on the wind speed monitored by the network of ARPA Lombardia, one of the regional environmental agencies operting in Italy. ARPA grapples with data gaps, and due to the connection between wind speed and air quality, it struggles with an effective air quality management. We illustrate the efficacy of our approach in filling the wind speed data gaps through two extensive simulation experiments. The case study provides more informative wind speed predictions crucial for predicting air pollutant concentrations, enhancing network maintenance, and advancing understanding of relevant meteorological and climatic phenomena."}, "https://arxiv.org/abs/2407.20352": {"title": "Designing Time-Series Models With Hypernetworks & Adversarial Portfolios", "link": "https://arxiv.org/abs/2407.20352", "description": "arXiv:2407.20352v1 Announce Type: cross \nAbstract: This article describes the methods that achieved 4th and 6th place in the forecasting and investment challenges, respectively, of the M6 competition, ultimately securing the 1st place in the overall duathlon ranking. In the forecasting challenge, we tested a novel meta-learning model that utilizes hypernetworks to design a parametric model tailored to a specific family of forecasting tasks. This approach allowed us to leverage similarities observed across individual forecasting tasks while also acknowledging potential heterogeneity in their data generating processes. The model's training can be directly performed with backpropagation, eliminating the need for reliance on higher-order derivatives and is equivalent to a simultaneous search over the space of parametric functions and their optimal parameter values. The proposed model's capabilities extend beyond M6, demonstrating superiority over state-of-the-art meta-learning methods in the sinusoidal regression task and outperforming conventional parametric models on time-series from the M4 competition. In the investment challenge, we adjusted portfolio weights to induce greater or smaller correlation between our submission and that of other participants, depending on the current ranking, aiming to maximize the probability of achieving a good rank."}, "https://arxiv.org/abs/2407.20377": {"title": "Leveraging Natural Language and Item Response Theory Models for ESG Scoring", "link": "https://arxiv.org/abs/2407.20377", "description": "arXiv:2407.20377v1 Announce Type: cross \nAbstract: This paper explores an innovative approach to Environmental, Social, and Governance (ESG) scoring by integrating Natural Language Processing (NLP) techniques with Item Response Theory (IRT), specifically the Rasch model. The study utilizes a comprehensive dataset of news articles in Portuguese related to Petrobras, a major oil company in Brazil, collected from 2022 and 2023. The data is filtered and classified for ESG-related sentiments using advanced NLP methods. The Rasch model is then applied to evaluate the psychometric properties of these ESG measures, providing a nuanced assessment of ESG sentiment trends over time. The results demonstrate the efficacy of this methodology in offering a more precise and reliable measurement of ESG factors, highlighting significant periods and trends. This approach may enhance the robustness of ESG metrics and contribute to the broader field of sustainability and finance by offering a deeper understanding of the temporal dynamics in ESG reporting."}, "https://arxiv.org/abs/2407.20553": {"title": "DiffusionCounterfactuals: Inferring High-dimensional Counterfactuals with Guidance of Causal Representations", "link": "https://arxiv.org/abs/2407.20553", "description": "arXiv:2407.20553v1 Announce Type: cross \nAbstract: Accurate estimation of counterfactual outcomes in high-dimensional data is crucial for decision-making and understanding causal relationships and intervention outcomes in various domains, including healthcare, economics, and social sciences. However, existing methods often struggle to generate accurate and consistent counterfactuals, particularly when the causal relationships are complex. We propose a novel framework that incorporates causal mechanisms and diffusion models to generate high-quality counterfactual samples guided by causal representation. Our approach introduces a novel, theoretically grounded training and sampling process that enables the model to consistently generate accurate counterfactual high-dimensional data under multiple intervention steps. Experimental results on various synthetic and real benchmarks demonstrate the proposed approach outperforms state-of-the-art methods in generating accurate and high-quality counterfactuals, using different evaluation metrics."}, "https://arxiv.org/abs/2407.20700": {"title": "Industrial-Grade Smart Troubleshooting through Causal Technical Language Processing: a Proof of Concept", "link": "https://arxiv.org/abs/2407.20700", "description": "arXiv:2407.20700v1 Announce Type: cross \nAbstract: This paper describes the development of a causal diagnosis approach for troubleshooting an industrial environment on the basis of the technical language expressed in Return on Experience records. The proposed method leverages the vectorized linguistic knowledge contained in the distributed representation of a Large Language Model, and the causal associations entailed by the embedded failure modes and mechanisms of the industrial assets. The paper presents the elementary but essential concepts of the solution, which is conceived as a causality-aware retrieval augmented generation system, and illustrates them experimentally on a real-world Predictive Maintenance setting. Finally, it discusses avenues of improvement for the maturity of the utilized causal technology to meet the robustness challenges of increasingly complex scenarios in the industry."}, "https://arxiv.org/abs/2210.08346": {"title": "Inferring a population composition from survey data with nonignorable nonresponse: Borrowing information from external sources", "link": "https://arxiv.org/abs/2210.08346", "description": "arXiv:2210.08346v2 Announce Type: replace \nAbstract: We introduce a method to make inference on the composition of a heterogeneous population using survey data, accounting for the possibility that capture heterogeneity is related to key survey variables. To deal with nonignorable nonresponse, we combine different data sources and propose the use of Fisher's noncentral hypergeometric model in a Bayesian framework. To illustrate the potentialities of our methodology, we focus on a case study aimed at estimating the composition of the population of Italian graduates by their occupational status one year after graduating, stratifying by gender and degree program. We account for the possibility that surveys inquiring about the occupational status of new graduates may have response rates that depend on individuals' employment status, implying the nonignorability of the nonresponse. Our findings show that employed people are generally more inclined to answer the questionnaire. Neglecting the nonresponse bias in such contexts might lead to overestimating the employment rate."}, "https://arxiv.org/abs/2306.05498": {"title": "Monte Carlo inference for semiparametric Bayesian regression", "link": "https://arxiv.org/abs/2306.05498", "description": "arXiv:2306.05498v2 Announce Type: replace \nAbstract: Data transformations are essential for broad applicability of parametric regression models. However, for Bayesian analysis, joint inference of the transformation and model parameters typically involves restrictive parametric transformations or nonparametric representations that are computationally inefficient and cumbersome for implementation and theoretical analysis, which limits their usability in practice. This paper introduces a simple, general, and efficient strategy for joint posterior inference of an unknown transformation and all regression model parameters. The proposed approach directly targets the posterior distribution of the transformation by linking it with the marginal distributions of the independent and dependent variables, and then deploys a Bayesian nonparametric model via the Bayesian bootstrap. Crucially, this approach delivers (1) joint posterior consistency under general conditions, including multiple model misspecifications, and (2) efficient Monte Carlo (not Markov chain Monte Carlo) inference for the transformation and all parameters for important special cases. These tools apply across a variety of data domains, including real-valued, positive, and compactly-supported data. Simulation studies and an empirical application demonstrate the effectiveness and efficiency of this strategy for semiparametric Bayesian analysis with linear models, quantile regression, and Gaussian processes. The R package SeBR is available on CRAN."}, "https://arxiv.org/abs/2309.00706": {"title": "Causal Effect Estimation after Propensity Score Trimming with Continuous Treatments", "link": "https://arxiv.org/abs/2309.00706", "description": "arXiv:2309.00706v2 Announce Type: replace \nAbstract: Propensity score trimming, which discards subjects with propensity scores below a threshold, is a common way to address positivity violations that complicate causal effect estimation. However, most works on trimming assume treatment is discrete and models for the outcome regression and propensity score are parametric. This work proposes nonparametric estimators for trimmed average causal effects in the case of continuous treatments based on efficient influence functions. For continuous treatments, an efficient influence function for a trimmed causal effect does not exist, due to a lack of pathwise differentiability induced by trimming and a continuous treatment. Thus, we target a smoothed version of the trimmed causal effect for which an efficient influence function exists. Our resulting estimators exhibit doubly-robust style guarantees, with error involving products or squares of errors for the outcome regression and propensity score, which allows for valid inference even when nonparametric models are used. Our results allow the trimming threshold to be fixed or defined as a quantile of the propensity score, such that confidence intervals incorporate uncertainty involved in threshold estimation. These findings are validated via simulation and an application, thereby showing how to efficiently-but-flexibly estimate trimmed causal effects with continuous treatments."}, "https://arxiv.org/abs/2301.03894": {"title": "Location- and scale-free procedures for distinguishing between distribution tail models", "link": "https://arxiv.org/abs/2301.03894", "description": "arXiv:2301.03894v2 Announce Type: replace-cross \nAbstract: We consider distinguishing between two distribution tail models when tails of one model are lighter (or heavier) than those of the other. Two procedures are proposed: one scale-free and one location- and scale-free, and their asymptotic properties are established. We show the advantage of using these procedures for distinguishing between certain tail models in comparison with the tests proposed in the literature by simulation and apply them to data on daily precipitation in Green Bay, US and Saentis, Switzerland."}, "https://arxiv.org/abs/2305.01392": {"title": "Multi-Scale CUSUM Tests for Time Dependent Spherical Random Fields", "link": "https://arxiv.org/abs/2305.01392", "description": "arXiv:2305.01392v2 Announce Type: replace-cross \nAbstract: This paper investigates the asymptotic behavior of structural break tests in the harmonic domain for time dependent spherical random fields. In particular, we prove a functional central limit theorem result for the fluctuations over time of the sample spherical harmonic coefficients, under the null of isotropy and stationarity; furthermore, we prove consistency of the corresponding CUSUM test, under a broad range of alternatives, including deterministic trend, abrupt change, and a nontrivial power alternative. Our results are then applied to NCEP data on global temperature: our estimates suggest that Climate Change does not simply affect global average temperatures, but also the nature of spatial fluctuations at different scales."}, "https://arxiv.org/abs/2308.09869": {"title": "Symmetrisation of a class of two-sample tests by mutually considering depth ranks including functional spaces", "link": "https://arxiv.org/abs/2308.09869", "description": "arXiv:2308.09869v2 Announce Type: replace-cross \nAbstract: Statistical depth functions provide measures of the outlyingness, or centrality, of the elements of a space with respect to a distribution. It is a nonparametric concept applicable to spaces of any dimension, for instance, multivariate and functional. Liu and Singh (1993) presented a multivariate two-sample test based on depth-ranks. We dedicate this paper to improving the power of the associated test statistic and incorporating its applicability to functional data. In doing so, we obtain a more natural test statistic that is symmetric in both samples. We derive the null asymptotic of the proposed test statistic, also proving the validity of the testing procedure for functional data. Finally, the finite sample performance of the test for functional data is illustrated by means of a simulation study and a real data analysis on annual temperature curves of ocean drifters is executed."}, "https://arxiv.org/abs/2310.17546": {"title": "A changepoint approach to modelling non-stationary soil moisture dynamics", "link": "https://arxiv.org/abs/2310.17546", "description": "arXiv:2310.17546v2 Announce Type: replace-cross \nAbstract: Soil moisture dynamics provide an indicator of soil health that scientists model via drydown curves. The typical modelling process requires the soil moisture time series to be manually separated into drydown segments and then exponential decay models are fitted to them independently. Sensor development over recent years means that experiments that were previously conducted over a few field campaigns can now be scaled to months or years at a higher sampling rate. To better meet the challenge of increasing data size, this paper proposes a novel changepoint-based approach to automatically identify structural changes in the soil drying process and simultaneously estimate the drydown parameters that are of interest to soil scientists. A simulation study is carried out to demonstrate the performance of the method in detecting changes and retrieving model parameters. Practical aspects of the method such as adding covariates and penalty learning are discussed. The method is applied to hourly soil moisture time series from the NEON data portal to investigate the temporal dynamics of soil moisture drydown. We recover known relationships previously identified manually, alongside delivering new insights into the temporal variability across soil types and locations."}, "https://arxiv.org/abs/2407.21119": {"title": "Potential weights and implicit causal designs in linear regression", "link": "https://arxiv.org/abs/2407.21119", "description": "arXiv:2407.21119v1 Announce Type: new \nAbstract: When do linear regressions estimate causal effects in quasi-experiments? This paper provides a generic diagnostic that assesses whether a given linear regression specification on a given dataset admits a design-based interpretation. To do so, we define a notion of potential weights, which encode counterfactual decisions a given regression makes to unobserved potential outcomes. If the specification does admit such an interpretation, this diagnostic can find a vector of unit-level treatment assignment probabilities -- which we call an implicit design -- under which the regression estimates a causal effect. This diagnostic also finds the implicit causal effect estimand. Knowing the implicit design and estimand adds transparency, leads to further sanity checks, and opens the door to design-based statistical inference. When applied to regression specifications studied in the causal inference literature, our framework recovers and extends existing theoretical results. When applied to widely-used specifications not covered by existing causal inference literature, our framework generates new theoretical insights."}, "https://arxiv.org/abs/2407.21154": {"title": "Bayesian thresholded modeling for integrating brain node and network predictors", "link": "https://arxiv.org/abs/2407.21154", "description": "arXiv:2407.21154v1 Announce Type: new \nAbstract: Progress in neuroscience has provided unprecedented opportunities to advance our understanding of brain alterations and their correspondence to phenotypic profiles. With data collected from various imaging techniques, studies have integrated different types of information ranging from brain structure, function, or metabolism. More recently, an emerging way to categorize imaging traits is through a metric hierarchy, including localized node-level measurements and interactive network-level metrics. However, limited research has been conducted to integrate these different hierarchies and achieve a better understanding of the neurobiological mechanisms and communications. In this work, we address this literature gap by proposing a Bayesian regression model under both vector-variate and matrix-variate predictors. To characterize the interplay between different predicting components, we propose a set of biologically plausible prior models centered on an innovative joint thresholded prior. This captures the coupling and grouping effect of signal patterns, as well as their spatial contiguity across brain anatomy. By developing a posterior inference, we can identify and quantify the uncertainty of signaling node- and network-level neuromarkers, as well as their predictive mechanism for phenotypic outcomes. Through extensive simulations, we demonstrate that our proposed method outperforms the alternative approaches substantially in both out-of-sample prediction and feature selection. By implementing the model to study children's general mental abilities, we establish a powerful predictive mechanism based on the identified task contrast traits and resting-state sub-networks."}, "https://arxiv.org/abs/2407.21253": {"title": "An overview of methods for receiver operating characteristic analysis, with an application to SARS-CoV-2 vaccine-induced humoral responses in solid organ transplant recipients", "link": "https://arxiv.org/abs/2407.21253", "description": "arXiv:2407.21253v1 Announce Type: new \nAbstract: Receiver operating characteristic (ROC) analysis is a tool to evaluate the capacity of a numeric measure to distinguish between groups, often employed in the evaluation of diagnostic tests. Overall classification ability is sometimes crudely summarized by a single numeric measure such as the area under the empirical ROC curve. However, it may also be of interest to estimate the full ROC curve while leveraging assumptions regarding the nature of the data (parametric) or about the ROC curve directly (semiparametric). Although there has been recent interest in methods to conduct comparisons by way of stochastic ordering, nuances surrounding ROC geometry and estimation are not widely known in the broader scientific and statistical community. The overarching goals of this manuscript are to (1) provide an overview of existing frameworks for ROC curve estimation with examples, (2) offer intuition for and considerations regarding methodological trade-offs, and (3) supply sample R code to guide implementation. We utilize simulations to demonstrate the bias-variance trade-off across various methods. As an illustrative example, we analyze data from a recent cohort study in order to compare responses to SARS-CoV-2 vaccination between solid organ transplant recipients and healthy controls."}, "https://arxiv.org/abs/2407.21322": {"title": "Randomized Controlled Trials of Service Interventions: The Impact of Capacity Constraints", "link": "https://arxiv.org/abs/2407.21322", "description": "arXiv:2407.21322v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs), or experiments, are the gold standard for intervention evaluation. However, the main appeal of RCTs, the clean identification of causal effects, can be compromised by interference, when one subject's treatment assignment can influence another subject's behavior or outcomes. In this paper, we formalise and study a type of interference stemming from the operational implementation of a subclass of interventions we term Service Interventions (SIs): interventions that include an on-demand service component provided by a costly and limited resource (e.g., healthcare providers or teachers).\n We show that in such a system, the capacity constraints induce dependencies across experiment subjects, where an individual may need to wait before receiving the intervention. By modeling these dependencies using a queueing system, we show how increasing the number of subjects without increasing the capacity of the system can lead to a nonlinear decrease in the treatment effect size. This has implications for conventional power analysis and recruitment strategies: increasing the sample size of an RCT without appropriately expanding capacity can decrease the study's power. To address this issue, we propose a method to jointly select the system capacity and number of users using the square root staffing rule from queueing theory. We show how incorporating knowledge of the queueing structure can help an experimenter reduce the amount of capacity and number of subjects required while still maintaining high power. In addition, our analysis of congestion-driven interference provides one concrete mechanism to explain why similar protocols can result in different RCT outcomes and why promising interventions at the RCT stage may not perform well at scale."}, "https://arxiv.org/abs/2407.21407": {"title": "Deep Fr\\'echet Regression", "link": "https://arxiv.org/abs/2407.21407", "description": "arXiv:2407.21407v1 Announce Type: new \nAbstract: Advancements in modern science have led to the increasing availability of non-Euclidean data in metric spaces. This paper addresses the challenge of modeling relationships between non-Euclidean responses and multivariate Euclidean predictors. We propose a flexible regression model capable of handling high-dimensional predictors without imposing parametric assumptions. Two primary challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we demonstrate the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, employing local Fr\\'echet regression, to map the low-dimensional manifold representations back to objects in the original metric space. We develop a theoretical framework, investigating the convergence rate of deep neural networks under dependent sub-Gaussian noise with bias. The convergence rate of the proposed regression model is then obtained by expanding the scope of local Fr\\'echet regression to accommodate multivariate predictors in the presence of errors in predictors. Simulations and case studies show that the proposed model outperforms existing methods for non-Euclidean responses, focusing on the special cases of probability measures and networks."}, "https://arxiv.org/abs/2407.21588": {"title": "A Bayesian Bootstrap Approach for Dynamic Borrowing for Minimizing Mean Squared Error", "link": "https://arxiv.org/abs/2407.21588", "description": "arXiv:2407.21588v1 Announce Type: new \nAbstract: For dynamic borrowing to leverage external data to augment the control arm of small RCTs, the key step is determining the amount of borrowing based on the similarity of the outcomes in the controls from the trial and the external data sources. A simple approach for this task uses the empirical Bayesian approach, which maximizes the marginal likelihood (maxML) of the amount of borrowing, while a likelihood-independent alternative minimizes the mean squared error (minMSE). We consider two minMSE approaches that differ from each other in the way of estimating the parameters in the minMSE rule. The classical one adjusts for bias due to sample variance, which in some situations is equivalent to the maxML rule. We propose a simplified alternative without the variance adjustment, which has asymptotic properties partially similar to the maxML rule, leading to no borrowing if means of control outcomes from the two data sources are different and may have less bias than that of the maxML rule. In contrast, the maxML rule may lead to full borrowing even when two datasets are moderately different, which may not be a desirable property. For inference, we propose a Bayesian bootstrap (BB) based approach taking the uncertainty of the estimated amount of borrowing and that of pre-adjustment into account. The approach can also be used with a pre-adjustment on the external controls for population difference between the two data sources using, e.g., inverse probability weighting. The proposed approach is computationally efficient and is implemented via a simple algorithm. We conducted a simulation study to examine properties of the proposed approach, including the coverage of 95 CI based on the Bayesian bootstrapped posterior samples, or asymptotic normality. The approach is illustrated by an example of borrowing controls for an AML trial from another study."}, "https://arxiv.org/abs/2407.21682": {"title": "Shape-restricted transfer learning analysis for generalized linear regression model", "link": "https://arxiv.org/abs/2407.21682", "description": "arXiv:2407.21682v1 Announce Type: new \nAbstract: Transfer learning has emerged as a highly sought-after and actively pursued research area within the statistical community. The core concept of transfer learning involves leveraging insights and information from auxiliary datasets to enhance the analysis of the primary dataset of interest. In this paper, our focus is on datasets originating from distinct yet interconnected distributions. We assume that the training data conforms to a standard generalized linear model, while the testing data exhibit a connection to the training data based on a prior probability shift assumption. Ultimately, we discover that the two-sample conditional means are interrelated through an unknown, nondecreasing function. We integrate the power of generalized estimating equations with the shape-restricted score function, creating a robust framework for improved inference regarding the underlying parameters. We theoretically establish the asymptotic properties of our estimator and demonstrate, through simulation studies, that our method yields more accurate parameter estimates compared to those based solely on the testing or training data. Finally, we apply our method to a real-world example."}, "https://arxiv.org/abs/2407.21695": {"title": "Unveiling land use dynamics: Insights from a hierarchical Bayesian spatio-temporal modelling of Compositional Data", "link": "https://arxiv.org/abs/2407.21695", "description": "arXiv:2407.21695v1 Announce Type: new \nAbstract: Changes in land use patterns have significant environmental and socioeconomic impacts, making it crucial for policymakers to understand their causes and consequences. This study, part of the European LAMASUS (Land Management for Sustainability) project, aims to support the EU's climate neutrality target by developing a governance model through collaboration between policymakers, land users, and researchers. We present a methodological synthesis for treating land use data using a Bayesian approach within spatial and spatio-temporal modeling frameworks.\n The study tackles the challenges of analyzing land use changes, particularly the presence of zero values and computational issues with large datasets. It introduces joint model structures to address zeros and employs sequential inference and consensus methods for Big Data problems. Spatial downscaling models approximate smaller scales from aggregated data, circumventing high-resolution data complications.\n We explore Beta regression and Compositional Data Analysis (CoDa) for land use data, review relevant spatial and spatio-temporal models, and present strategies for handling zeros. The paper demonstrates the implementation of key models, downscaling techniques, and solutions to Big Data challenges with examples from simulated data and the LAMASUS project, providing a comprehensive framework for understanding and managing land use changes."}, "https://arxiv.org/abs/2407.21025": {"title": "Reinforcement Learning in High-frequency Market Making", "link": "https://arxiv.org/abs/2407.21025", "description": "arXiv:2407.21025v1 Announce Type: cross \nAbstract: This paper establishes a new and comprehensive theoretical analysis for the application of reinforcement learning (RL) in high-frequency market making. We bridge the modern RL theory and the continuous-time statistical models in high-frequency financial economics. Different with most existing literature on methodological research about developing various RL methods for market making problem, our work is a pilot to provide the theoretical analysis. We target the effects of sampling frequency, and find an interesting tradeoff between error and complexity of RL algorithm when tweaking the values of the time increment $\\Delta$ $-$ as $\\Delta$ becomes smaller, the error will be smaller but the complexity will be larger. We also study the two-player case under the general-sum game framework and establish the convergence of Nash equilibrium to the continuous-time game equilibrium as $\\Delta\\rightarrow0$. The Nash Q-learning algorithm, which is an online multi-agent RL method, is applied to solve the equilibrium. Our theories are not only useful for practitioners to choose the sampling frequency, but also very general and applicable to other high-frequency financial decision making problems, e.g., optimal executions, as long as the time-discretization of a continuous-time markov decision process is adopted. Monte Carlo simulation evidence support all of our theories."}, "https://arxiv.org/abs/2407.21238": {"title": "Quantile processes and their applications in finite populations", "link": "https://arxiv.org/abs/2407.21238", "description": "arXiv:2407.21238v1 Announce Type: cross \nAbstract: The weak convergence of the quantile processes, which are constructed based on different estimators of the finite population quantiles, is shown under various well-known sampling designs based on a superpopulation model. The results related to the weak convergence of these quantile processes are applied to find asymptotic distributions of the smooth $L$-estimators and the estimators of smooth functions of finite population quantiles. Based on these asymptotic distributions, confidence intervals are constructed for several finite population parameters like the median, the $\\alpha$-trimmed means, the interquartile range and the quantile based measure of skewness. Comparisons of various estimators are carried out based on their asymptotic distributions. We show that the use of the auxiliary information in the construction of the estimators sometimes has an adverse effect on the performances of the smooth $L$-estimators and the estimators of smooth functions of finite population quantiles under several sampling designs. Further, the performance of each of the above-mentioned estimators sometimes becomes worse under sampling designs, which use the auxiliary information, than their performances under simple random sampling without replacement (SRSWOR)."}, "https://arxiv.org/abs/2407.21456": {"title": "A Ball Divergence Based Measure For Conditional Independence Testing", "link": "https://arxiv.org/abs/2407.21456", "description": "arXiv:2407.21456v1 Announce Type: cross \nAbstract: In this paper we introduce a new measure of conditional dependence between two random vectors ${\\boldsymbol X}$ and ${\\boldsymbol Y}$ given another random vector $\\boldsymbol Z$ using the ball divergence. Our measure characterizes conditional independence and does not require any moment assumptions. We propose a consistent estimator of the measure using a kernel averaging technique and derive its asymptotic distribution. Using this statistic we construct two tests for conditional independence, one in the model-${\\boldsymbol X}$ framework and the other based on a novel local wild bootstrap algorithm. In the model-${\\boldsymbol X}$ framework, which assumes the knowledge of the distribution of ${\\boldsymbol X}|{\\boldsymbol Z}$, applying the conditional randomization test we obtain a method that controls Type I error in finite samples and is asymptotically consistent, even if the distribution of ${\\boldsymbol X}|{\\boldsymbol Z}$ is incorrectly specified up to distance preserving transformations. More generally, in situations where ${\\boldsymbol X}|{\\boldsymbol Z}$ is unknown or hard to estimate, we design a double-bandwidth based local wild bootstrap algorithm that asymptotically controls both Type I error and power. We illustrate the advantage of our method, both in terms of Type I error and power, in a range of simulation settings and also in a real data example. A consequence of our theoretical results is a general framework for studying the asymptotic properties of a 2-sample conditional $V$-statistic, which is of independent interest."}, "https://arxiv.org/abs/2303.15376": {"title": "Identifiability of causal graphs under nonadditive conditionally parametric causal models", "link": "https://arxiv.org/abs/2303.15376", "description": "arXiv:2303.15376v4 Announce Type: replace \nAbstract: Causal discovery from observational data typically requires strong assumptions about the data-generating process. Previous research has established the identifiability of causal graphs under various models, including linear non-Gaussian, post-nonlinear, and location-scale models. However, these models may have limited applicability in real-world situations that involve a mixture of discrete and continuous variables or where the cause affects the variance or tail behavior of the effect. In this study, we introduce a new class of models, called Conditionally Parametric Causal Models (CPCM), which assume that the distribution of the effect, given the cause, belongs to well-known families such as Gaussian, Poisson, Gamma, or heavy-tailed Pareto distributions. These models are adaptable to a wide range of practical situations where the cause can influence the variance or tail behavior of the effect. We demonstrate the identifiability of CPCM by leveraging the concept of sufficient statistics. Furthermore, we propose an algorithm for estimating the causal structure from random samples drawn from CPCM. We evaluate the empirical properties of our methodology on various datasets, demonstrating state-of-the-art performance across multiple benchmarks."}, "https://arxiv.org/abs/2310.13232": {"title": "Interaction Screening and Pseudolikelihood Approaches for Tensor Learning in Ising Models", "link": "https://arxiv.org/abs/2310.13232", "description": "arXiv:2310.13232v2 Announce Type: replace \nAbstract: In this paper, we study two well known methods of Ising structure learning, namely the pseudolikelihood approach and the interaction screening approach, in the context of tensor recovery in $k$-spin Ising models. We show that both these approaches, with proper regularization, retrieve the underlying hypernetwork structure using a sample size logarithmic in the number of network nodes, and exponential in the maximum interaction strength and maximum node-degree. We also track down the exact dependence of the rate of tensor recovery on the interaction order $k$, that is allowed to grow with the number of samples and nodes, for both the approaches. We then provide a comparative discussion of the performance of the two approaches based on simulation studies, which also demonstrates the exponential dependence of the tensor recovery rate on the maximum coupling strength. Our tensor recovery methods are then applied on gene data taken from the Curated Microarray Database (CuMiDa), where we focus on understanding the important genes related to hepatocellular carcinoma."}, "https://arxiv.org/abs/2311.14846": {"title": "Fast Estimation of the Renshaw-Haberman Model and Its Variants", "link": "https://arxiv.org/abs/2311.14846", "description": "arXiv:2311.14846v2 Announce Type: replace \nAbstract: In mortality modelling, cohort effects are often taken into consideration as they add insights about variations in mortality across different generations. Statistically speaking, models such as the Renshaw-Haberman model may provide a better fit to historical data compared to their counterparts that incorporate no cohort effects. However, when such models are estimated using an iterative maximum likelihood method in which parameters are updated one at a time, convergence is typically slow and may not even be reached within a reasonably established maximum number of iterations. Among others, the slow convergence problem hinders the study of parameter uncertainty through bootstrapping methods. In this paper, we propose an intuitive estimation method that minimizes the sum of squared errors between actual and fitted log central death rates. The complications arising from the incorporation of cohort effects are overcome by formulating part of the optimization as a principal component analysis with missing values. Using mortality data from various populations, we demonstrate that our proposed method produces satisfactory estimation results and is significantly more efficient compared to the traditional likelihood-based approach."}, "https://arxiv.org/abs/2408.00032": {"title": "Methodological Foundations of Modern Causal Inference in Social Science Research", "link": "https://arxiv.org/abs/2408.00032", "description": "arXiv:2408.00032v1 Announce Type: new \nAbstract: This paper serves as a literature review of methodology concerning the (modern) causal inference methods to address the causal estimand with observational/survey data that have been or will be used in social science research. Mainly, this paper is divided into two parts: inference from statistical estimand for the causal estimand, in which we reviewed the assumptions for causal identification and the methodological strategies addressing the problems if some of the assumptions are violated. We also discuss the asymptotical analysis concerning the measure from the observational data to the theoretical measure and replicate the deduction of the efficient/doubly robust average treatment effect estimator, which is commonly used in current social science analysis."}, "https://arxiv.org/abs/2408.00100": {"title": "A new unit-bimodal distribution based on correlated Birnbaum-Saunders random variables", "link": "https://arxiv.org/abs/2408.00100", "description": "arXiv:2408.00100v1 Announce Type: new \nAbstract: In this paper, we propose a new distribution over the unit interval which can be characterized as a ratio of the type Z=Y/(X+Y) where X and Y are two correlated Birnbaum-Saunders random variables. The stress-strength probability between X and Y is calculated explicitly when the respective scale parameters are equal. Two applications of the ratio distribution are discussed."}, "https://arxiv.org/abs/2408.00177": {"title": "Fast variational Bayesian inference for correlated survival data: an application to invasive mechanical ventilation duration analysis", "link": "https://arxiv.org/abs/2408.00177", "description": "arXiv:2408.00177v1 Announce Type: new \nAbstract: Correlated survival data are prevalent in various clinical settings and have been extensively discussed in literature. One of the most common types of correlated survival data is clustered survival data, where the survival times from individuals in a cluster are associated. Our study is motivated by invasive mechanical ventilation data from different intensive care units (ICUs) in Ontario, Canada, forming multiple clusters. The survival times from patients within the same ICU cluster are correlated. To address this association, we introduce a shared frailty log-logistic accelerated failure time model that accounts for intra-cluster correlation through a cluster-specific random intercept. We present a novel, fast variational Bayes (VB) algorithm for parameter inference and evaluate its performance using simulation studies varying the number of clusters and their sizes. We further compare the performance of our proposed VB algorithm with the h-likelihood method and a Markov Chain Monte Carlo (MCMC) algorithm. The proposed algorithm delivers satisfactory results and demonstrates computational efficiency over the MCMC algorithm. We apply our method to the ICU ventilation data from Ontario to investigate the ICU site random effect on ventilation duration."}, "https://arxiv.org/abs/2408.00289": {"title": "Operator on Operator Regression in Quantum Probability", "link": "https://arxiv.org/abs/2408.00289", "description": "arXiv:2408.00289v1 Announce Type: new \nAbstract: This article introduces operator on operator regression in quantum probability. Here in the regression model, the response and the independent variables are certain operator valued observables, and they are linearly associated with unknown scalar coefficient (denoted by $\\beta$), and the error is a random operator. In the course of this study, we propose a quantum version of a class of estimators (denoted by $M$ estimator) of $\\beta$, and the large sample behaviour of those quantum version of the estimators are derived, given the fact that the true model is also linear and the samples are observed eigenvalue pairs of the operator valued observables."}, "https://arxiv.org/abs/2408.00291": {"title": "Bayesian Synthetic Control Methods with Spillover Effects: Estimating the Economic Cost of the 2011 Sudan Split", "link": "https://arxiv.org/abs/2408.00291", "description": "arXiv:2408.00291v1 Announce Type: new \nAbstract: The synthetic control method (SCM) is widely used for causal inference with panel data, particularly when there are few treated units. SCM assumes the stable unit treatment value assumption (SUTVA), which posits that potential outcomes are unaffected by the treatment status of other units. However, interventions often impact not only treated units but also untreated units, known as spillover effects. This study introduces a novel panel data method that extends SCM to allow for spillover effects and estimate both treatment and spillover effects. This method leverages a spatial autoregressive panel data model to account for spillover effects. We also propose Bayesian inference methods using Bayesian horseshoe priors for regularization. We apply the proposed method to two empirical studies: evaluating the effect of the California tobacco tax on consumption and estimating the economic impact of the 2011 division of Sudan on GDP per capita."}, "https://arxiv.org/abs/2408.00618": {"title": "Facilitating heterogeneous effect estimation via statistically efficient categorical modifiers", "link": "https://arxiv.org/abs/2408.00618", "description": "arXiv:2408.00618v1 Announce Type: new \nAbstract: Categorical covariates such as race, sex, or group are ubiquitous in regression analysis. While main-only (or ANCOVA) linear models are predominant, cat-modified linear models that include categorical-continuous or categorical-categorical interactions are increasingly important and allow heterogeneous, group-specific effects. However, with standard approaches, the addition of cat-modifiers fundamentally alters the estimates and interpretations of the main effects, often inflates their standard errors, and introduces significant concerns about group (e.g., racial) biases. We advocate an alternative parametrization and estimation scheme using abundance-based constraints (ABCs). ABCs induce a model parametrization that is both interpretable and equitable. Crucially, we show that with ABCs, the addition of cat-modifiers 1) leaves main effect estimates unchanged and 2) enhances their statistical power, under reasonable conditions. Thus, analysts can, and arguably should include cat-modifiers in linear regression models to discover potential heterogeneous effects--without compromising estimation, inference, and interpretability for the main effects. Using simulated data, we verify these invariance properties for estimation and inference and showcase the capabilities of ABCs to increase statistical power. We apply these tools to study demographic heterogeneities among the effects of social and environmental factors on STEM educational outcomes for children in North Carolina. An R package lmabc is available."}, "https://arxiv.org/abs/2408.00651": {"title": "A Dirichlet stochastic block model for composition-weighted networks", "link": "https://arxiv.org/abs/2408.00651", "description": "arXiv:2408.00651v1 Announce Type: new \nAbstract: Network data are observed in various applications where the individual entities of the system interact with or are connected to each other, and often these interactions are defined by their associated strength or importance. Clustering is a common task in network analysis that involves finding groups of nodes displaying similarities in the way they interact with the rest of the network. However, most clustering methods use the strengths of connections between entities in their original form, ignoring the possible differences in the capacities of individual nodes to send or receive edges. This often leads to clustering solutions that are heavily influenced by the nodes' capacities. One way to overcome this is to analyse the strengths of connections in relative rather than absolute terms, expressing each edge weight as a proportion of the sending (or receiving) capacity of the respective node. This, however, induces additional modelling constraints that most existing clustering methods are not designed to handle. In this work we propose a stochastic block model for composition-weighted networks based on direct modelling of compositional weight vectors using a Dirichlet mixture, with the parameters determined by the cluster labels of the sender and the receiver nodes. Inference is implemented via an extension of the classification expectation-maximisation algorithm that uses a working independence assumption, expressing the complete data likelihood of each node of the network as a function of fixed cluster labels of the remaining nodes. A model selection criterion is derived to aid the choice of the number of clusters. The model is validated using simulation studies, and showcased on network data from the Erasmus exchange program and a bike sharing network for the city of London."}, "https://arxiv.org/abs/2408.00237": {"title": "Empirical Bayes Linked Matrix Decomposition", "link": "https://arxiv.org/abs/2408.00237", "description": "arXiv:2408.00237v1 Announce Type: cross \nAbstract: Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular \"omics\" technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for \"blockwise\" imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation."}, "https://arxiv.org/abs/2408.00270": {"title": "Strong Oracle Guarantees for Partial Penalized Tests of High Dimensional Generalized Linear Models", "link": "https://arxiv.org/abs/2408.00270", "description": "arXiv:2408.00270v1 Announce Type: cross \nAbstract: Partial penalized tests provide flexible approaches to testing linear hypotheses in high dimensional generalized linear models. However, because the estimators used in these tests are local minimizers of potentially non-convex folded-concave penalized objectives, the solutions one computes in practice may not coincide with the unknown local minima for which we have nice theoretical guarantees. To close this gap between theory and computation, we introduce local linear approximation (LLA) algorithms to compute the full and reduced model estimators for these tests and develop theory specifically for the LLA solutions. We prove that our LLA algorithms converge to oracle estimators for the full and reduced models in two steps with overwhelming probability. We then leverage this strong oracle result and the asymptotic properties of the oracle estimators to show that the partial penalized test statistics evaluated at the two-step LLA solutions are approximately chi-square in large samples, giving us guarantees for the tests using specific computed solutions and thereby closing the theoretical gap. We conduct simulation studies to assess the finite-sample performance of our testing procedures, finding that partial penalized tests using the LLA solutions agree with tests using the oracle estimators, and demonstrate our testing procedures in a real data application."}, "https://arxiv.org/abs/2408.00399": {"title": "Unsupervised Pairwise Causal Discovery on Heterogeneous Data using Mutual Information Measures", "link": "https://arxiv.org/abs/2408.00399", "description": "arXiv:2408.00399v1 Announce Type: cross \nAbstract: A fundamental task in science is to determine the underlying causal relations because it is the knowledge of this functional structure what leads to the correct interpretation of an effect given the apparent associations in the observed data. In this sense, Causal Discovery is a technique that tackles this challenge by analyzing the statistical properties of the constituent variables. In this work, we target the generalizability of the discovery method by following a reductionist approach that only involves two variables, i.e., the pairwise or bi-variate setting. We question the current (possibly misleading) baseline results on the basis that they were obtained through supervised learning, which is arguably contrary to this genuinely exploratory endeavor. In consequence, we approach this problem in an unsupervised way, using robust Mutual Information measures, and observing the impact of the different variable types, which is oftentimes ignored in the design of solutions. Thus, we provide a novel set of standard unbiased results that can serve as a reference to guide future discovery tasks in completely unknown environments."}, "https://arxiv.org/abs/2205.03706": {"title": "Identification and Estimation of Dynamic Games with Unknown Information Structure", "link": "https://arxiv.org/abs/2205.03706", "description": "arXiv:2205.03706v3 Announce Type: replace \nAbstract: This paper studies the identification and estimation of dynamic games when the underlying information structure is unknown to the researcher. To tractably characterize the set of Markov perfect equilibrium predictions while maintaining weak assumptions on players' information, we introduce \\textit{Markov correlated equilibrium}, a dynamic analog of Bayes correlated equilibrium. The set of Markov correlated equilibrium predictions coincides with the set of Markov perfect equilibrium predictions that can arise when the players can observe more signals than assumed by the analyst. Using Markov correlated equilibrium as the solution concept, we propose tractable computational strategies for informationally robust estimation, inference, and counterfactual analysis that deal with the non-convexities arising in dynamic environments. We use our method to analyze the dynamic competition between Starbucks and Dunkin' in the US and the role of informational assumptions."}, "https://arxiv.org/abs/2208.07900": {"title": "Statistical Inferences and Predictions for Areal Data and Spatial Data Fusion with Hausdorff--Gaussian Processes", "link": "https://arxiv.org/abs/2208.07900", "description": "arXiv:2208.07900v2 Announce Type: replace \nAbstract: Accurate modeling of spatial dependence is pivotal in analyzing spatial data, influencing parameter estimation and out-of-sample predictions. The spatial structure and geometry of the data significantly impact valid statistical inference. Existing models for areal data often rely on adjacency matrices, struggling to differentiate between polygons of varying sizes and shapes. Conversely, data fusion models, while effective, rely on computationally intensive numerical integrals, presenting challenges for moderately large datasets. In response to these issues, we propose the Hausdorff-Gaussian process (HGP), a versatile model class utilizing the Hausdorff distance to capture spatial dependence in both point and areal data. We introduce a valid correlation function within the HGP framework, accommodating diverse modeling techniques, including geostatistical and areal models. Integration into generalized linear mixed-effects models enhances its applicability, particularly in addressing change of support and data fusion challenges. We validate our approach through a comprehensive simulation study and application to two real-world scenarios: one involving areal data and another demonstrating its effectiveness in data fusion. The results suggest that the HGP is competitive with specialized models regarding goodness-of-fit and prediction performances. In summary, the HGP offers a flexible and robust solution for modeling spatial data of various types and shapes, with potential applications spanning fields such as public health and climate science."}, "https://arxiv.org/abs/2209.08273": {"title": "Low-Rank Covariance Completion for Graph Quilting with Applications to Functional Connectivity", "link": "https://arxiv.org/abs/2209.08273", "description": "arXiv:2209.08273v2 Announce Type: replace \nAbstract: As a tool for estimating networks in high dimensions, graphical models are commonly applied to calcium imaging data to estimate functional neuronal connectivity, i.e. relationships between the activities of neurons. However, in many calcium imaging data sets, the full population of neurons is not recorded simultaneously, but instead in partially overlapping blocks. This leads to the Graph Quilting problem, as first introduced by (Vinci et.al. 2019), in which the goal is to infer the structure of the full graph when only subsets of features are jointly observed. In this paper, we study a novel two-step approach to Graph Quilting, which first imputes the complete covariance matrix using low-rank covariance completion techniques before estimating the graph structure. We introduce three approaches to solve this problem: block singular value decomposition, nuclear norm penalization, and non-convex low-rank factorization. While prior works have studied low-rank matrix completion, we address the challenges brought by the block-wise missingness and are the first to investigate the problem in the context of graph learning. We discuss theoretical properties of the two-step procedure, showing graph selection consistency of one proposed approach by proving novel L infinity-norm error bounds for matrix completion with block-missingness. We then investigate the empirical performance of the proposed methods on simulations and on real-world data examples, through which we show the efficacy of these methods for estimating functional connectivity from calcium imaging data."}, "https://arxiv.org/abs/2211.16502": {"title": "Identified vaccine efficacy for binary post-infection outcomes under misclassification without monotonicity", "link": "https://arxiv.org/abs/2211.16502", "description": "arXiv:2211.16502v4 Announce Type: replace \nAbstract: In order to meet regulatory approval, pharmaceutical companies often must demonstrate that new vaccines reduce the total risk of a post-infection outcome like transmission, symptomatic disease, severe illness, or death in randomized, placebo-controlled trials. Given that infection is a necessary precondition for a post-infection outcome, one can use principal stratification to partition the total causal effect of vaccination into two causal effects: vaccine efficacy against infection, and the principal effect of vaccine efficacy against a post-infection outcome in the patients that would be infected under both placebo and vaccination. Despite the importance of such principal effects to policymakers, these estimands are generally unidentifiable, even under strong assumptions that are rarely satisfied in real-world trials. We develop a novel method to nonparametrically point identify these principal effects while eliminating the monotonicity assumption and allowing for measurement error. Furthermore, our results allow for multiple treatments, and are general enough to be applicable outside of vaccine efficacy. Our method relies on the fact that many vaccine trials are run at geographically disparate health centers, and measure biologically-relevant categorical pretreatment covariates. We show that our method can be applied to a variety of clinical trial settings where vaccine efficacy against infection and a post-infection outcome can be jointly inferred. This can yield new insights from existing vaccine efficacy trial data and will aid researchers in designing new multi-arm clinical trials."}, "https://arxiv.org/abs/2305.06645": {"title": "Causal Inference for Continuous Multiple Time Point Interventions", "link": "https://arxiv.org/abs/2305.06645", "description": "arXiv:2305.06645v3 Announce Type: replace \nAbstract: There are limited options to estimate the treatment effects of variables which are continuous and measured at multiple time points, particularly if the true dose-response curve should be estimated as closely as possible. However, these situations may be of relevance: in pharmacology, one may be interested in how outcomes of people living with -- and treated for -- HIV, such as viral failure, would vary for time-varying interventions such as different drug concentration trajectories. A challenge for doing causal inference with continuous interventions is that the positivity assumption is typically violated. To address positivity violations, we develop projection functions, which reweigh and redefine the estimand of interest based on functions of the conditional support for the respective interventions. With these functions, we obtain the desired dose-response curve in areas of enough support, and otherwise a meaningful estimand that does not require the positivity assumption. We develop $g$-computation type plug-in estimators for this case. Those are contrasted with g-computation estimators which are applied to continuous interventions without specifically addressing positivity violations, which we propose to be presented with diagnostics. The ideas are illustrated with longitudinal data from HIV positive children treated with an efavirenz-based regimen as part of the CHAPAS-3 trial, which enrolled children $<13$ years in Zambia/Uganda. Simulations show in which situations a standard g-computation approach is appropriate, and in which it leads to bias and how the proposed weighted estimation approach then recovers the alternative estimand of interest."}, "https://arxiv.org/abs/2307.13124": {"title": "Conformal prediction for frequency-severity modeling", "link": "https://arxiv.org/abs/2307.13124", "description": "arXiv:2307.13124v3 Announce Type: replace \nAbstract: We present a model-agnostic framework for the construction of prediction intervals of insurance claims, with finite sample statistical guarantees, extending the technique of split conformal prediction to the domain of two-stage frequency-severity modeling. The framework effectiveness is showcased with simulated and real datasets using classical parametric models and contemporary machine learning methods. When the underlying severity model is a random forest, we extend the two-stage split conformal prediction algorithm, showing how the out-of-bag mechanism can be leveraged to eliminate the need for a calibration set in the conformal procedure."}, "https://arxiv.org/abs/2308.13346": {"title": "GARHCX-NoVaS: A Model-free Approach to Incorporate Exogenous Variables", "link": "https://arxiv.org/abs/2308.13346", "description": "arXiv:2308.13346v2 Announce Type: replace \nAbstract: In this work, we explore the forecasting ability of a recently proposed normalizing and variance-stabilizing (NoVaS) transformation with the possible inclusion of exogenous variables. From an applied point-of-view, extra knowledge such as fundamentals- and sentiments-based information could be beneficial to improve the prediction accuracy of market volatility if they are incorporated into the forecasting process. In the classical approach, these models including exogenous variables are typically termed GARCHX-type models. Being a Model-free prediction method, NoVaS has generally shown more accurate, stable and robust (to misspecifications) performance than that compared to classical GARCH-type methods. This motivates us to extend this framework to the GARCHX forecasting as well. We derive the NoVaS transformation needed to include exogenous covariates and then construct the corresponding prediction procedure. We show through extensive simulation studies that bolster our claim that the NoVaS method outperforms traditional ones, especially for long-term time aggregated predictions. We also provide an interesting data analysis to exhibit how our method could possibly shed light on the role of geopolitical risks in forecasting volatility in national stock market indices for three different countries in Europe."}, "https://arxiv.org/abs/2401.10233": {"title": "Likelihood-ratio inference on differences in quantiles", "link": "https://arxiv.org/abs/2401.10233", "description": "arXiv:2401.10233v2 Announce Type: replace \nAbstract: Quantiles can represent key operational and business metrics, but the computational challenges associated with inference has hampered their adoption in online experimentation. One-sample confidence intervals are trivial to construct; however, two-sample inference has traditionally required bootstrapping or a density estimator. This paper presents a new two-sample difference-in-quantile hypothesis test and confidence interval based on a likelihood-ratio test statistic. A conservative version of the test does not involve a density estimator; a second version of the test, which uses a density estimator, yields confidence intervals very close to the nominal coverage level. It can be computed using only four order statistics from each sample."}, "https://arxiv.org/abs/2201.05102": {"title": "Space-time extremes of severe US thunderstorm environments", "link": "https://arxiv.org/abs/2201.05102", "description": "arXiv:2201.05102v3 Announce Type: replace-cross \nAbstract: Severe thunderstorms cause substantial economic and human losses in the United States. Simultaneous high values of convective available potential energy (CAPE) and storm relative helicity (SRH) are favorable to severe weather, and both they and the composite variable $\\mathrm{PROD}=\\sqrt{\\mathrm{CAPE}} \\times \\mathrm{SRH}$ can be used as indicators of severe thunderstorm activity. Their extremal spatial dependence exhibits temporal non-stationarity due to seasonality and large-scale atmospheric signals such as El Ni\\~no-Southern Oscillation (ENSO). In order to investigate this, we introduce a space-time model based on a max-stable, Brown--Resnick, field whose range depends on ENSO and on time through a tensor product spline. We also propose a max-stability test based on empirical likelihood and the bootstrap. The marginal and dependence parameters must be estimated separately owing to the complexity of the model, and we develop a bootstrap-based model selection criterion that accounts for the marginal uncertainty when choosing the dependence model. In the case study, the out-sample performance of our model is good. We find that extremes of PROD, CAPE and SRH are generally more localized in summer and, in some regions, less localized during El Ni\\~no and La Ni\\~na events, and give meteorological interpretations of these phenomena."}, "https://arxiv.org/abs/2209.15224": {"title": "Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models", "link": "https://arxiv.org/abs/2209.15224", "description": "arXiv:2209.15224v3 Announce Type: replace-cross \nAbstract: Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on GMMs, which aims to leverage potentially similar GMM parameter structures among tasks to obtain improved learning performance compared to single-task learning. We propose a multi-task GMM learning procedure based on the EM algorithm that effectively utilizes unknown similarities between related tasks and is robust against a fraction of outlier tasks from arbitrary distributions. The proposed procedure is shown to achieve the minimax optimal rate of convergence for both parameter estimation error and the excess mis-clustering error, in a wide range of regimes. Moreover, we generalize our approach to tackle the problem of transfer learning for GMMs, where similar theoretical results are derived. Additionally, iterative unsupervised multi-task and transfer learning methods may suffer from an initialization alignment problem, and two alignment algorithms are proposed to resolve the issue. Finally, we demonstrate the effectiveness of our methods through simulations and real data examples. To the best of our knowledge, this is the first work studying multi-task and transfer learning on GMMs with theoretical guarantees."}, "https://arxiv.org/abs/2310.12428": {"title": "Enhanced Local Explainability and Trust Scores with Random Forest Proximities", "link": "https://arxiv.org/abs/2310.12428", "description": "arXiv:2310.12428v2 Announce Type: replace-cross \nAbstract: We initiate a novel approach to explain the predictions and out of sample performance of random forest (RF) regression and classification models by exploiting the fact that any RF can be mathematically formulated as an adaptive weighted K nearest-neighbors model. Specifically, we employ a recent result that, for both regression and classification tasks, any RF prediction can be rewritten exactly as a weighted sum of the training targets, where the weights are RF proximities between the corresponding pairs of data points. We show that this linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established feature-based methods like SHAP, which generate attributions for a model prediction across input features. We show how this proximity-based approach to explainability can be used in conjunction with SHAP to explain not just the model predictions, but also out-of-sample performance, in the sense that proximities furnish a novel means of assessing when a given model prediction is more or less likely to be correct. We demonstrate this approach in the modeling of US corporate bond prices and returns in both regression and classification cases."}, "https://arxiv.org/abs/2401.06925": {"title": "Modeling Latent Selection with Structural Causal Models", "link": "https://arxiv.org/abs/2401.06925", "description": "arXiv:2401.06925v2 Announce Type: replace-cross \nAbstract: Selection bias is ubiquitous in real-world data, and can lead to misleading results if not dealt with properly. We introduce a conditioning operation on Structural Causal Models (SCMs) to model latent selection from a causal perspective. We show that the conditioning operation transforms an SCM with the presence of an explicit latent selection mechanism into an SCM without such selection mechanism, which partially encodes the causal semantics of the selected subpopulation according to the original SCM. Furthermore, we show that this conditioning operation preserves the simplicity, acyclicity, and linearity of SCMs, and commutes with marginalization. Thanks to these properties, combined with marginalization and intervention, the conditioning operation offers a valuable tool for conducting causal reasoning tasks within causal models where latent details have been abstracted away. We demonstrate by example how classical results of causal inference can be generalized to include selection bias and how the conditioning operation helps with modeling of real-world problems."}, "https://arxiv.org/abs/2408.00908": {"title": "Early Stopping Based on Repeated Significance", "link": "https://arxiv.org/abs/2408.00908", "description": "arXiv:2408.00908v1 Announce Type: new \nAbstract: For a bucket test with a single criterion for success and a fixed number of samples or testing period, requiring a $p$-value less than a specified value of $\\alpha$ for the success criterion produces statistical confidence at level $1 - \\alpha$. For multiple criteria, a Bonferroni correction that partitions $\\alpha$ among the criteria produces statistical confidence, at the cost of requiring lower $p$-values for each criterion. The same concept can be applied to decisions about early stopping, but that can lead to strict requirements for $p$-values. We show how to address that challenge by requiring criteria to be successful at multiple decision points."}, "https://arxiv.org/abs/2408.01023": {"title": "Distilling interpretable causal trees from causal forests", "link": "https://arxiv.org/abs/2408.01023", "description": "arXiv:2408.01023v1 Announce Type: new \nAbstract: Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are."}, "https://arxiv.org/abs/2408.01208": {"title": "Distributional Difference-in-Differences Models with Multiple Time Periods: A Monte Carlo Analysis", "link": "https://arxiv.org/abs/2408.01208", "description": "arXiv:2408.01208v1 Announce Type: new \nAbstract: Researchers are often interested in evaluating the impact of a policy on the entire (or specific parts of the) distribution of the outcome of interest. In this paper, I provide a practical toolkit to recover the whole counterfactual distribution of the untreated potential outcome for the treated group in non-experimental settings with staggered treatment adoption by generalizing the existing quantile treatment effects on the treated (QTT) estimator proposed by Callaway and Li (2019). Besides the QTT, I consider different approaches that anonymously summarize the quantiles of the distribution of the outcome of interest (such as tests for stochastic dominance rankings) without relying on rank invariance assumptions. The finite-sample properties of the estimator proposed are analyzed via different Monte Carlo simulations. Despite being slightly biased for relatively small sample sizes, the proposed method's performance increases substantially when the sample size increases."}, "https://arxiv.org/abs/2408.00955": {"title": "Aggregation Models with Optimal Weights for Distributed Gaussian Processes", "link": "https://arxiv.org/abs/2408.00955", "description": "arXiv:2408.00955v1 Announce Type: cross \nAbstract: Gaussian process (GP) models have received increasingly attentions in recent years due to their superb prediction accuracy and modeling flexibility. To address the computational burdens of GP models for large-scale datasets, distributed learning for GPs are often adopted. Current aggregation models for distributed GPs are not time-efficient when incorporating correlations between GP experts. In this work, we propose a novel approach for aggregated prediction in distributed GPs. The technique is suitable for both the exact and sparse variational GPs. The proposed method incorporates correlations among experts, leading to better prediction accuracy with manageable computational requirements. As demonstrated by empirical studies, the proposed approach results in more stable predictions in less time than state-of-the-art consistent aggregation models."}, "https://arxiv.org/abs/2408.01017": {"title": "Application of Superconducting Technology in the Electricity Industry: A Game-Theoretic Analysis of Government Subsidy Policies and Power Company Equipment Upgrade Decisions", "link": "https://arxiv.org/abs/2408.01017", "description": "arXiv:2408.01017v1 Announce Type: cross \nAbstract: This study investigates the potential impact of \"LK-99,\" a novel material developed by a Korean research team, on the power equipment industry. Using evolutionary game theory, the interactions between governmental subsidies and technology adoption by power companies are modeled. A key innovation of this research is the introduction of sensitivity analyses concerning time delays and initial subsidy amounts, which significantly influence the strategic decisions of both government and corporate entities. The findings indicate that these factors are critical in determining the rate of technology adoption and the efficiency of the market as a whole. Due to existing data limitations, the study offers a broad overview of likely trends and recommends the inclusion of real-world data for more precise modeling once the material demonstrates room-temperature superconducting characteristics. The research contributes foundational insights valuable for future policy design and has significant implications for advancing the understanding of technology adoption and market dynamics."}, "https://arxiv.org/abs/2408.01117": {"title": "Reduced-Rank Estimation for Ill-Conditioned Stochastic Linear Model with High Signal-to-Noise Ratio", "link": "https://arxiv.org/abs/2408.01117", "description": "arXiv:2408.01117v1 Announce Type: cross \nAbstract: Reduced-rank approach has been used for decades in robust linear estimation of both deterministic and random vector of parameters in linear model y=Hx+\\sqrt{epsilon}n. In practical settings, estimation is frequently performed under incomplete or inexact model knowledge, which in the stochastic case significantly increases mean-square-error (MSE) of an estimate obtained by the linear minimum mean-square-error (MMSE) estimator, which is MSE-optimal among linear estimators in the theoretical case of perfect model knowledge. However, the improved performance of reduced-rank estimators over MMSE estimator in estimation under incomplete or inexact model knowledge has been established to date only by means of numerical simulations and arguments indicating that the reduced-rank approach may provide improved performance over MMSE estimator in certain settings. In this paper we focus on the high signal-to-noise ratio (SNR) case, which has not been previously considered as a natural area of application of reduced-rank estimators. We first show explicit sufficient conditions under which familiar reduced-rank MMSE and truncated SVD estimators achieve lower MSE than MMSE estimator if singular values of array response matrix H are perturbed. We then extend these results to the case of a generic perturbation of array response matrix H, and demonstrate why MMSE estimator frequently attains higher MSE than reduced-rank MMSE and truncated SVD estimators if H is ill-conditioned. The main results of this paper are verified in numerical simulations."}, "https://arxiv.org/abs/2408.01298": {"title": "Probabilistic Inversion Modeling of Gas Emissions: A Gradient-Based MCMC Estimation of Gaussian Plume Parameters", "link": "https://arxiv.org/abs/2408.01298", "description": "arXiv:2408.01298v1 Announce Type: cross \nAbstract: In response to global concerns regarding air quality and the environmental impact of greenhouse gas emissions, detecting and quantifying sources of emissions has become critical. To understand this impact and target mitigations effectively, methods for accurate quantification of greenhouse gas emissions are required. In this paper, we focus on the inversion of concentration measurements to estimate source location and emission rate. In practice, such methods often rely on atmospheric stability class-based Gaussian plume dispersion models. However, incorrectly identifying the atmospheric stability class can lead to significant bias in estimates of source characteristics. We present a robust approach that reduces this bias by jointly estimating the horizontal and vertical dispersion parameters of the Gaussian plume model, together with source location and emission rate, atmospheric background concentration, and sensor measurement error variance. Uncertainty in parameter estimation is quantified through probabilistic inversion using gradient-based MCMC methods. A simulation study is performed to assess the inversion methodology. We then focus on inference for the published Chilbolton dataset which contains controlled methane releases and demonstrates the practical benefits of estimating dispersion parameters in source inversion problems."}, "https://arxiv.org/abs/2408.01326": {"title": "Nonparametric Mean and Covariance Estimation for Discretely Observed High-Dimensional Functional Data: Rates of Convergence and Division of Observational Regimes", "link": "https://arxiv.org/abs/2408.01326", "description": "arXiv:2408.01326v1 Announce Type: cross \nAbstract: Nonparametric estimation of the mean and covariance parameters for functional data is a critical task, with local linear smoothing being a popular choice. In recent years, many scientific domains are producing high-dimensional functional data for which $p$, the number of curves per subject, is often much larger than the sample size $n$. Much of the methodology developed for such data rely on preliminary nonparametric estimates of the unknown mean functions and the auto- and cross-covariance functions. We investigate the convergence rates of local linear estimators in terms of the maximal error across components and pairs of components for mean and covariance functions, respectively, in both $L^2$ and uniform metrics. The local linear estimators utilize a generic weighting scheme that can adjust for differing numbers of discrete observations $N_{ij}$ across curves $j$ and subjects $i$, where the $N_{ij}$ vary with $n$. Particular attention is given to the equal weight per observation (OBS) and equal weight per subject (SUBJ) weighting schemes. The theoretical results utilize novel applications of concentration inequalities for functional data and demonstrate that, similar to univariate functional data, the order of the $N_{ij}$ relative to $p$ and $n$ divides high-dimensional functional data into three regimes: sparse, dense, and ultra-dense, with the high-dimensional parametric convergence rate of $\\left\\{\\log(p)/n\\right\\}^{1/2}$ being attainable in the latter two."}, "https://arxiv.org/abs/2211.06337": {"title": "Differentially Private Methods for Compositional Data", "link": "https://arxiv.org/abs/2211.06337", "description": "arXiv:2211.06337v3 Announce Type: replace \nAbstract: Confidential data, such as electronic health records, activity data from wearable devices, and geolocation data, are becoming increasingly prevalent. Differential privacy provides a framework to conduct statistical analyses while mitigating the risk of leaking private information. Compositional data, which consist of vectors with positive components that add up to a constant, have received little attention in the differential privacy literature. This article proposes differentially private approaches for analyzing compositional data using the Dirichlet distribution. We explore several methods, including Bayesian and bootstrap procedures. For the Bayesian methods, we consider posterior inference techniques based on Markov Chain Monte Carlo, Approximate Bayesian Computation, and asymptotic approximations. We conduct an extensive simulation study to compare these approaches and make evidence-based recommendations. Finally, we apply the methodology to a data set from the American Time Use Survey."}, "https://arxiv.org/abs/2212.09996": {"title": "A marginalized three-part interrupted time series regression model for proportional data", "link": "https://arxiv.org/abs/2212.09996", "description": "arXiv:2212.09996v2 Announce Type: replace \nAbstract: Interrupted time series (ITS) is often used to evaluate the effectiveness of a health policy intervention that accounts for the temporal dependence of outcomes. When the outcome of interest is a percentage or percentile, the data can be highly skewed, bounded in $[0, 1]$, and have many zeros or ones. A three-part Beta regression model is commonly used to separate zeros, ones, and positive values explicitly by three submodels. However, incorporating temporal dependence into the three-part Beta regression model is challenging. In this article, we propose a marginalized zero-one-inflated Beta time series model that captures the temporal dependence of outcomes through copula and allows investigators to examine covariate effects on the marginal mean. We investigate its practical performance using simulation studies and apply the model to a real ITS study."}, "https://arxiv.org/abs/2309.13666": {"title": "More power to you: Using machine learning to augment human coding for more efficient inference in text-based randomized trials", "link": "https://arxiv.org/abs/2309.13666", "description": "arXiv:2309.13666v2 Announce Type: replace \nAbstract: For randomized trials that use text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by trained human raters. This process, the current standard, is both time-consuming and limiting: even the largest human coding efforts are typically constrained to measure only a small set of dimensions across a subsample of available texts. In this work, we present an inferential framework that can be used to increase the power of an impact assessment, given a fixed human-coding budget, by taking advantage of any \"untapped\" observations -- those documents not manually scored due to time or resource constraints -- as a supplementary resource. Our approach, a methodological combination of causal inference, survey sampling methods, and machine learning, has four steps: (1) select and code a sample of documents; (2) build a machine learning model to predict the human-coded outcomes from a set of automatically extracted text features; (3) generate machine-predicted scores for all documents and use these scores to estimate treatment impacts; and (4) adjust the final impact estimates using the residual differences between human-coded and machine-predicted outcomes. This final step ensures any biases in the modeling procedure do not propagate to biases in final estimated effects. Through an extensive simulation study and an application to a recent field trial in education, we show that our proposed approach can be used to reduce the scope of a human-coding effort while maintaining nominal power to detect a significant treatment impact."}, "https://arxiv.org/abs/2312.15222": {"title": "Sequential decision rules for 2-arm clinical trials: a Bayesian perspective", "link": "https://arxiv.org/abs/2312.15222", "description": "arXiv:2312.15222v2 Announce Type: replace \nAbstract: Practical employment of Bayesian trial designs has been rare. Even if accepted in principle, the regulators have commonly required that such designs be calibrated according to an upper bound for the frequentist Type 1 error rate. This represents an internally inconsistent hybrid methodology, where important advantages from applying the Bayesian principles are lost. In particular, all pre-planned interim looks have an inflating multiplicity effect on Type 1 error rate. To present an alternative approach, we consider the prototype case of a 2-arm superiority trial with binary outcomes. The design is adaptive, using error tolerance criteria based on sequentially updated posterior probabilities, to conclude efficacy of the experimental treatment or futility of the trial. In the proposed approach, the regulators are assumed to have the main responsibility in defining criteria for the error control against false conclusions of efficacy, whereas the trial investigators will have a natural role in determining the criteria for concluding futility and thereby stopping the trial. It is suggested that the control of Type 1 error rate be replaced by the control of a criterion called regulators' False Discovery Probability (rFDP), the term corresponding directly to the probability interpretation of this criterion. Importantly, the sequential error control during the data analysis based on posterior probabilities will satisfy the rFDP criterion automatically, so that no separate computations are needed for such a purpose. The method contains the option of applying a decision rule for terminating the trial early if the predicted costs from continuing would exceed the corresponding gains. The proposed approach can lower the ultimately unnecessary barriers from the practical application of Bayesian trial designs."}, "https://arxiv.org/abs/2401.13975": {"title": "Sparse signal recovery and source localization via covariance learning", "link": "https://arxiv.org/abs/2401.13975", "description": "arXiv:2401.13975v2 Announce Type: replace \nAbstract: In the Multiple Measurements Vector (MMV) model, measurement vectors are connected to unknown, jointly sparse signal vectors through a linear regression model employing a single known measurement matrix (or dictionary). Typically, the number of atoms (columns of the dictionary) is greater than the number measurements and the sparse signal recovery problem is generally ill-posed. In this paper, we treat the signals and measurement noise as independent Gaussian random vectors with unknown signal covariance matrix and noise variance, respectively, and characterize the solution of the likelihood equation in terms of fixed point equation, thereby enabling the recovery of the sparse signal support (sources with non-zero variances) via a block coordinate descent (BCD) algorithm that leverage the FP characterization of the likelihood equation. Additionally, a greedy pursuit method, analogous to popular simultaneous orthogonal matching pursuit (OMP), is introduced. Our numerical examples demonstrate effectiveness of the proposed covariance learning (CL) algorithms both in classic sparse signal recovery as well as in direction-of-arrival (DOA) estimation problems where they perform favourably compared to the state-of-the-art algorithms under a broad variety of settings."}, "https://arxiv.org/abs/2110.15501": {"title": "Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning", "link": "https://arxiv.org/abs/2110.15501", "description": "arXiv:2110.15501v4 Announce Type: replace-cross \nAbstract: Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instructions on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection for consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulation studies and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method."}, "https://arxiv.org/abs/2408.01626": {"title": "Weighted Brier Score -- an Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration", "link": "https://arxiv.org/abs/2408.01626", "description": "arXiv:2408.01626v1 Announce Type: new \nAbstract: As advancements in novel biomarker-based algorithms and models accelerate disease risk prediction and stratification in medicine, it is crucial to evaluate these models within the context of their intended clinical application. Prediction models output the absolute risk of disease; subsequently, patient counseling and shared decision-making are based on the estimated individual risk and cost-benefit assessment. The overall impact of the application is often referred to as clinical utility, which received significant attention in terms of model assessment lately. The classic Brier score is a popular measure of prediction accuracy; however, it is insufficient for effectively assessing clinical utility. To address this limitation, we propose a class of weighted Brier scores that aligns with the decision-theoretic framework of clinical utility. Additionally, we decompose the weighted Brier score into discrimination and calibration components, examining how weighting influences the overall score and its individual components. Through this decomposition, we link the weighted Brier score to the $H$ measure, which has been proposed as a coherent alternative to the area under the receiver operating characteristic curve. This theoretical link to the $H$ measure further supports our weighting method and underscores the essential elements of discrimination and calibration in risk prediction evaluation. The practical use of the weighted Brier score as an overall summary is demonstrated using data from the Prostate Cancer Active Surveillance Study (PASS)."}, "https://arxiv.org/abs/2408.01628": {"title": "On Nonparametric Estimation of Covariograms", "link": "https://arxiv.org/abs/2408.01628", "description": "arXiv:2408.01628v1 Announce Type: new \nAbstract: The paper overviews and investigates several nonparametric methods of estimating covariograms. It provides a unified approach and notation to compare the main approaches used in applied research. The primary focus is on methods that utilise the actual values of observations, rather than their ranks. We concentrate on such desirable properties of covariograms as bias, positive-definiteness and behaviour at large distances. The paper discusses several theoretical properties and demonstrates some surprising drawbacks of well-known estimators. Numerical studies provide a comparison of representatives from different methods using various metrics. The results provide important insight and guidance for practitioners who use estimated covariograms in various applications, including kriging, monitoring network optimisation, cross-validation, and other related tasks."}, "https://arxiv.org/abs/2408.01662": {"title": "Principal component analysis balancing prediction and approximation accuracy for spatial data", "link": "https://arxiv.org/abs/2408.01662", "description": "arXiv:2408.01662v1 Announce Type: new \nAbstract: Dimension reduction is often the first step in statistical modeling or prediction of multivariate spatial data. However, most existing dimension reduction techniques do not account for the spatial correlation between observations and do not take the downstream modeling task into consideration when finding the lower-dimensional representation. We formalize the closeness of approximation to the original data and the utility of lower-dimensional scores for downstream modeling as two complementary, sometimes conflicting, metrics for dimension reduction. We illustrate how existing methodologies fall into this framework and propose a flexible dimension reduction algorithm that achieves the optimal trade-off. We derive a computationally simple form for our algorithm and illustrate its performance through simulation studies, as well as two applications in air pollution modeling and spatial transcriptomics."}, "https://arxiv.org/abs/2408.01893": {"title": "Minimum Gamma Divergence for Regression and Classification Problems", "link": "https://arxiv.org/abs/2408.01893", "description": "arXiv:2408.01893v1 Announce Type: new \nAbstract: The book is structured into four main chapters. Chapter 1 introduces the foundational concepts of divergence measures, including the well-known Kullback-Leibler divergence and its limitations. It then presents a detailed exploration of power divergences, such as the $\\alpha$, $\\beta$, and $\\gamma$-divergences, highlighting their unique properties and advantages. Chapter 2 explores minimum divergence methods for regression models, demonstrating how these methods can improve robustness and efficiency in statistical estimation. Chapter 3 extends these methods to Poisson point processes, with a focus on ecological applications, providing a robust framework for modeling species distributions and other spatial phenomena. Finally, Chapter 4 explores the use of divergence measures in machine learning, including applications in Boltzmann machines, AdaBoost, and active learning. The chapter emphasizes the practical benefits of these measures in enhancing model robustness and performance."}, "https://arxiv.org/abs/2408.01985": {"title": "Analysis of Factors Affecting the Entry of Foreign Direct Investment into Indonesia (Case Study of Three Industrial Sectors in Indonesia)", "link": "https://arxiv.org/abs/2408.01985", "description": "arXiv:2408.01985v1 Announce Type: new \nAbstract: The realization of FDI and DDI from January to December 2022 reached Rp1,207.2 trillion. The largest FDI investment realization by sector was led by the Basic Metal, Metal Goods, Non-Machinery, and Equipment Industry sector, followed by the Mining sector and the Electricity, Gas, and Water sector. The uneven amount of FDI investment realization in each industry and the impact of the COVID-19 pandemic in Indonesia are the main issues addressed in this study. This study aims to identify the factors that influence the entry of FDI into industries in Indonesia and measure the extent of these factors' influence on the entry of FDI. In this study, classical assumption tests and hypothesis tests are conducted to investigate whether the research model is robust enough to provide strategic options nationally. Moreover, this study uses the ordinary least squares (OLS) method. The results show that the electricity factor does not influence FDI inflows in the three industries. The Human Development Index (HDI) factor has a significant negative effect on FDI in the Mining Industry and a significant positive effect on FDI in the Basic Metal, Metal Goods, Non-Machinery, and Equipment Industries. However, HDI does not influence FDI in the Electricity, Gas, and Water Industries in Indonesia."}, "https://arxiv.org/abs/2408.02028": {"title": "Multivariate Information Measures: A Copula-based Approach", "link": "https://arxiv.org/abs/2408.02028", "description": "arXiv:2408.02028v1 Announce Type: new \nAbstract: Multivariate datasets are common in various real-world applications. Recently, copulas have received significant attention for modeling dependencies among random variables. A copula-based information measure is required to quantify the uncertainty inherent in these dependencies. This paper introduces a multivariate variant of the cumulative copula entropy and explores its various properties, including bounds, stochastic orders, and convergence-related results. Additionally, we define a cumulative copula information generating function and derive it for several well-known families of multivariate copulas. A fractional generalization of the multivariate cumulative copula entropy is also introduced and examined. We present a non-parametric estimator of the cumulative copula entropy using empirical beta copula. Furthermore, we propose a new distance measure between two copulas based on the Kullback-Leibler divergence and discuss a goodness-of-fit test based on this measure."}, "https://arxiv.org/abs/2408.02159": {"title": "SPINEX-TimeSeries: Similarity-based Predictions with Explainable Neighbors Exploration for Time Series and Forecasting Problems", "link": "https://arxiv.org/abs/2408.02159", "description": "arXiv:2408.02159v1 Announce Type: new \nAbstract: This paper introduces a new addition to the SPINEX (Similarity-based Predictions with Explainable Neighbors Exploration) family, tailored specifically for time series and forecasting analysis. This new algorithm leverages the concept of similarity and higher-order temporal interactions across multiple time scales to enhance predictive accuracy and interpretability in forecasting. To evaluate the effectiveness of SPINEX, we present comprehensive benchmarking experiments comparing it against 18 algorithms and across 49 synthetic and real datasets characterized by varying trends, seasonality, and noise levels. Our performance assessment focused on forecasting accuracy and computational efficiency. Our findings reveal that SPINEX consistently ranks among the top 5 performers in forecasting precision and has a superior ability to handle complex temporal dynamics compared to commonly adopted algorithms. Moreover, the algorithm's explainability features, Pareto efficiency, and medium complexity (on the order of O(log n)) are demonstrated through detailed visualizations to enhance the prediction and decision-making process. We note that integrating similarity-based concepts opens new avenues for research in predictive analytics, promising more accurate and transparent decision making."}, "https://arxiv.org/abs/2408.02331": {"title": "Explaining and Connecting Kriging with Gaussian Process Regression", "link": "https://arxiv.org/abs/2408.02331", "description": "arXiv:2408.02331v1 Announce Type: new \nAbstract: Kriging and Gaussian Process Regression are statistical methods that allow predicting the outcome of a random process or a random field by using a sample of correlated observations. In other words, the random process or random field is partially observed, and by using a sample a prediction is made, pointwise or as a whole, where the latter can be thought as a reconstruction. In addition, the techniques permit to give a measure of uncertainty of the prediction. The methods have different origins. Kriging comes from geostatistics, a field which started to develop around 1950 oriented to mining valuation problems, whereas Gaussian Process Regression has gained popularity in the area of machine learning in the last decade of the previous century. In the literature, the methods are usually presented as being the same technique. However, beyond this affirmation, the techniques have yet not been compared on a thorough mathematical basis and neither explained why and under which conditions this affirmation holds. Furthermore, Kriging has many variants and this affirmation should be precised. In this paper, this gap is filled. It is shown, step by step how both methods are deduced from the first principles -- with a major focus on Kriging, the mathematical connection between them, and which Kriging variant corresponds to which Gaussian Process Regression set up. The three most widely used versions of Kriging are considered: Simple Kriging, Ordinary Kriging and Universal Kriging. It is found, that despite their closeness, the techniques are different in their approach and assumptions, in a similar way the Least Square method, the Best Linear Unbiased Estimator method, and the Likelihood method in regression do. I hope this work can serve for a deeper understanding of the relationship between Kriging and Gaussian Process Regression, as well as a cohesive introductory resource for researchers."}, "https://arxiv.org/abs/2408.02343": {"title": "Unified Principal Components Analysis of Irregularly Observed Functional Time Series", "link": "https://arxiv.org/abs/2408.02343", "description": "arXiv:2408.02343v1 Announce Type: new \nAbstract: Irregularly observed functional time series (FTS) are increasingly available in many real-world applications. To analyze FTS, it is crucial to account for both serial dependencies and the irregularly observed nature of functional data. However, existing methods for FTS often rely on specific model assumptions in capturing serial dependencies, or cannot handle the irregular observational scheme of functional data. To solve these issues, one can perform dimension reduction on FTS via functional principal component analysis (FPCA) or dynamic FPCA. Nonetheless, these methods may either be not theoretically optimal or too redundant to represent serially dependent functional data. In this article, we introduce a novel dimension reduction method for FTS based on dynamic FPCA. Through a new concept called optimal functional filters, we unify the theories of FPCA and dynamic FPCA, providing a parsimonious and optimal representation for FTS adapting to its serial dependence structure. This framework is referred to as principal analysis via dependency-adaptivity (PADA). Under a hierarchical Bayesian model, we establish an estimation procedure for dimension reduction via PADA. Our method can be used for both sparsely and densely observed FTS, and is capable of predicting future functional data. We investigate the theoretical properties of PADA and demonstrate its effectiveness through extensive simulation studies. Finally, we illustrate our method via dimension reduction and prediction of daily PM2.5 data."}, "https://arxiv.org/abs/2408.02393": {"title": "Graphical Modelling without Independence Assumptions for Uncentered Data", "link": "https://arxiv.org/abs/2408.02393", "description": "arXiv:2408.02393v1 Announce Type: new \nAbstract: The independence assumption is a useful tool to increase the tractability of one's modelling framework. However, this assumption does not match reality; failing to take dependencies into account can cause models to fail dramatically. The field of multi-axis graphical modelling (also called multi-way modelling, Kronecker-separable modelling) has seen growth over the past decade, but these models require that the data have zero mean. In the multi-axis case, inference is typically done in the single sample scenario, making mean inference impossible.\n In this paper, we demonstrate how the zero-mean assumption can cause egregious modelling errors, as well as propose a relaxation to the zero-mean assumption that allows the avoidance of such errors. Specifically, we propose the \"Kronecker-sum-structured mean\" assumption, which leads to models with nonconvex-but-unimodal log-likelihoods that can be solved efficiently with coordinate descent."}, "https://arxiv.org/abs/2408.02513": {"title": "The appeal of the gamma family distribution to protect the confidentiality of contingency tables", "link": "https://arxiv.org/abs/2408.02513", "description": "arXiv:2408.02513v1 Announce Type: new \nAbstract: Administrative databases, such as the English School Census (ESC), are rich sources of information that are potentially useful for researchers. For such data sources to be made available, however, strict guarantees of privacy would be required. To achieve this, synthetic data methods can be used. Such methods, when protecting the confidentiality of tabular data (contingency tables), often utilise the Poisson or Poisson-mixture distributions, such as the negative binomial (NBI). These distributions, however, are either equidispersed (in the case of the Poisson) or overdispersed (e.g. in the case of the NBI), which results in excessive noise being applied to large low-risk counts. This paper proposes the use of the (discretized) gamma family (GAF) distribution, which allows noise to be applied in a more bespoke fashion. Specifically, it allows less noise to be applied as cell counts become larger, providing an optimal balance in relation to the risk-utility trade-off. We illustrate the suitability of the GAF distribution on an administrative-type data set that is reminiscent of the ESC."}, "https://arxiv.org/abs/2408.02573": {"title": "Testing identifying assumptions in Tobit Models", "link": "https://arxiv.org/abs/2408.02573", "description": "arXiv:2408.02573v1 Announce Type: new \nAbstract: This paper develops sharp testable implications for Tobit and IV-Tobit models' identifying assumptions: linear index specification, (joint) normality of latent errors, and treatment (instrument) exogeneity and relevance. The new sharp testable equalities can detect all possible observable violations of the identifying conditions. We propose a testing procedure for the model's validity using existing inference methods for intersection bounds. Simulation results suggests proper size for large samples and that the test is powerful to detect large violation of the exogeneity assumption and violations in the error structure. Finally, we review and propose new alternative paths to partially identify the parameters of interest under less restrictive assumptions."}, "https://arxiv.org/abs/2408.02594": {"title": "Time-series imputation using low-rank matrix completion", "link": "https://arxiv.org/abs/2408.02594", "description": "arXiv:2408.02594v1 Announce Type: new \nAbstract: We investigate the use of matrix completion methods for time-series imputation. Specifically we consider low-rank completion of the block-Hankel matrix representation of a time-series. Simulation experiments are used to compare the method with five recognised imputation techniques with varying levels of computational effort. The Hankel Imputation (HI) method is seen to perform competitively at interpolating missing time-series data, and shows particular potential for reproducing sharp peaks in the data."}, "https://arxiv.org/abs/2408.02667": {"title": "Evaluating and Utilizing Surrogate Outcomes in Covariate-Adjusted Response-Adaptive Designs", "link": "https://arxiv.org/abs/2408.02667", "description": "arXiv:2408.02667v1 Announce Type: new \nAbstract: This manuscript explores the intersection of surrogate outcomes and adaptive designs in statistical research. While surrogate outcomes have long been studied for their potential to substitute long-term primary outcomes, current surrogate evaluation methods do not directly account for the potential benefits of using surrogate outcomes to adapt randomization probabilities in adaptive randomized trials that aim to learn and respond to treatment effect heterogeneity. In this context, surrogate outcomes can benefit participants in the trial directly (i.e. improve expected outcome of newly-enrolled participants) by allowing for more rapid adaptation of randomization probabilities, particularly when surrogates enable earlier detection of heterogeneous treatment effects and/or indicate the optimal (individualized) treatment with stronger signals. Our study introduces a novel approach for surrogate evaluation that quantifies both of these benefits in the context of sequential adaptive experiment designs. We also propose a new Covariate-Adjusted Response-Adaptive (CARA) design that incorporates an Online Superlearner to assess and adaptively choose surrogate outcomes for updating treatment randomization probabilities. We introduce a Targeted Maximum Likelihood Estimator that addresses data dependency challenges in adaptively collected data and achieves asymptotic normality under reasonable assumptions without relying on parametric model assumptions. The robust performance of our adaptive design with Online Superlearner is presented via simulations. Our framework not only contributes a method to more comprehensively quantifying the benefits of candidate surrogate outcomes and choosing between them, but also offers an easily generalizable tool for evaluating various adaptive designs and making inferences, providing insights into alternative choices of designs."}, "https://arxiv.org/abs/2408.01582": {"title": "Conformal Diffusion Models for Individual Treatment Effect Estimation and Inference", "link": "https://arxiv.org/abs/2408.01582", "description": "arXiv:2408.01582v1 Announce Type: cross \nAbstract: Estimating treatment effects from observational data is of central interest across numerous application domains. Individual treatment effect offers the most granular measure of treatment effect on an individual level, and is the most useful to facilitate personalized care. However, its estimation and inference remain underdeveloped due to several challenges. In this article, we propose a novel conformal diffusion model-based approach that addresses those intricate challenges. We integrate the highly flexible diffusion modeling, the model-free statistical inference paradigm of conformal inference, along with propensity score and covariate local approximation that tackle distributional shifts. We unbiasedly estimate the distributions of potential outcomes for individual treatment effect, construct an informative confidence interval, and establish rigorous theoretical guarantees. We demonstrate the competitive performance of the proposed method over existing solutions through extensive numerical studies."}, "https://arxiv.org/abs/2408.01617": {"title": "Review and Demonstration of a Mixture Representation for Simulation from Densities Involving Sums of Powers", "link": "https://arxiv.org/abs/2408.01617", "description": "arXiv:2408.01617v1 Announce Type: cross \nAbstract: Penalized and robust regression, especially when approached from a Bayesian perspective, can involve the problem of simulating a random variable $\\boldsymbol z$ from a posterior distribution that includes a term proportional to a sum of powers, $\\|\\boldsymbol z \\|^q_q$, on the log scale. However, many popular gradient-based methods for Markov Chain Monte Carlo simulation from such posterior distributions use Hamiltonian Monte Carlo and accordingly require conditions on the differentiability of the unnormalized posterior distribution that do not hold when $q \\leq 1$ (Plummer, 2023). This is limiting; the setting where $q \\leq 1$ includes widely used sparsity inducing penalized regression models and heavy tailed robust regression models. In the special case where $q = 1$, a latent variable representation that facilitates simulation from such a posterior distribution is well known. However, the setting where $q < 1$ has not been treated as thoroughly. In this note, we review the availability of a latent variable representation described in Devroye (2009), show how it can be used to simulate from such posterior distributions when $0 < q < 2$, and demonstrate its utility in the context of estimating the parameters of a Bayesian penalized regression model."}, "https://arxiv.org/abs/2408.01926": {"title": "Efficient Decision Trees for Tensor Regressions", "link": "https://arxiv.org/abs/2408.01926", "description": "arXiv:2408.01926v1 Announce Type: cross \nAbstract: We proposed the tensor-input tree (TT) method for scalar-on-tensor and tensor-on-tensor regression problems. We first address scalar-on-tensor problem by proposing scalar-output regression tree models whose input variable are tensors (i.e., multi-way arrays). We devised and implemented fast randomized and deterministic algorithms for efficient fitting of scalar-on-tensor trees, making TT competitive against tensor-input GP models. Based on scalar-on-tensor tree models, we extend our method to tensor-on-tensor problems using additive tree ensemble approaches. Theoretical justification and extensive experiments on real and synthetic datasets are provided to illustrate the performance of TT."}, "https://arxiv.org/abs/2408.02060": {"title": "Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection", "link": "https://arxiv.org/abs/2408.02060", "description": "arXiv:2408.02060v1 Announce Type: cross \nAbstract: We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop a test statistic that is asymptotically normal, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape."}, "https://arxiv.org/abs/2408.02106": {"title": "A Functional Data Approach for Structural Health Monitoring", "link": "https://arxiv.org/abs/2408.02106", "description": "arXiv:2408.02106v1 Announce Type: cross \nAbstract: Structural Health Monitoring (SHM) is increasingly applied in civil engineering. One of its primary purposes is detecting and assessing changes in structure conditions to increase safety and reduce potential maintenance downtime. Recent advancements, especially in sensor technology, facilitate data measurements, collection, and process automation, leading to large data streams. We propose a function-on-function regression framework for (nonlinear) modeling the sensor data and adjusting for covariate-induced variation. Our approach is particularly suited for long-term monitoring when several months or years of training data are available. It combines highly flexible yet interpretable semi-parametric modeling with functional principal component analysis and uses the corresponding out-of-sample phase-II scores for monitoring. The method proposed can also be described as a combination of an ``input-output'' and an ``output-only'' method."}, "https://arxiv.org/abs/2408.02122": {"title": "Graph-Enabled Fast MCMC Sampling with an Unknown High-Dimensional Prior Distribution", "link": "https://arxiv.org/abs/2408.02122", "description": "arXiv:2408.02122v1 Announce Type: cross \nAbstract: Posterior sampling is a task of central importance in Bayesian inference. For many applications in Bayesian meta-analysis and Bayesian transfer learning, the prior distribution is unknown and needs to be estimated from samples. In practice, the prior distribution can be high-dimensional, adding to the difficulty of efficient posterior inference. In this paper, we propose a novel Markov chain Monte Carlo algorithm, which we term graph-enabled MCMC, for posterior sampling with unknown and potentially high-dimensional prior distributions. The algorithm is based on constructing a geometric graph from prior samples and subsequently uses the graph structure to guide the transition of the Markov chain. Through extensive theoretical and numerical studies, we demonstrate that our graph-enabled MCMC algorithm provides reliable approximation to the posterior distribution and is highly computationally efficient."}, "https://arxiv.org/abs/2408.02391": {"title": "Kullback-Leibler-based characterizations of score-driven updates", "link": "https://arxiv.org/abs/2408.02391", "description": "arXiv:2408.02391v1 Announce Type: cross \nAbstract: Score-driven models have been applied in some 400 published articles over the last decade. Much of this literature cites the optimality result in Blasques et al. (2015), which, roughly, states that sufficiently small score-driven updates are unique in locally reducing the Kullback-Leibler (KL) divergence relative to the true density for every observation. This is at odds with other well-known optimality results; the Kalman filter, for example, is optimal in a mean squared error sense, but may move in the wrong direction for atypical observations. We show that score-driven filters are, similarly, not guaranteed to improve the localized KL divergence at every observation. The seemingly stronger result in Blasques et al. (2015) is due to their use of an improper (localized) scoring rule. Even as a guaranteed improvement for every observation is unattainable, we prove that sufficiently small score-driven updates are unique in reducing the KL divergence relative to the true density in expectation. This positive$-$albeit weaker$-$result justifies the continued use of score-driven models and places their information-theoretic properties on solid footing."}, "https://arxiv.org/abs/1904.08895": {"title": "The uniform general signed rank test and its design sensitivity", "link": "https://arxiv.org/abs/1904.08895", "description": "arXiv:1904.08895v2 Announce Type: replace \nAbstract: A sensitivity analysis in an observational study tests whether the qualitative conclusions of an analysis would change if we were to allow for the possibility of limited bias due to confounding. The design sensitivity of a hypothesis test quantifies the asymptotic performance of the test in a sensitivity analysis against a particular alternative. We propose a new, non-asymptotic, distribution-free test, the uniform general signed rank test, for observational studies with paired data, and examine its performance under Rosenbaum's sensitivity analysis model. Our test can be viewed as adaptively choosing from among a large underlying family of signed rank tests, and we show that the uniform test achieves design sensitivity equal to the maximum design sensitivity over the underlying family of signed rank tests. Our test thus achieves superior, and sometimes infinite, design sensitivity, indicating it will perform well in sensitivity analyses on large samples. We support this conclusion with simulations and a data example, showing that the advantages of our test extend to moderate sample sizes as well."}, "https://arxiv.org/abs/2010.13599": {"title": "Design-Based Inference for Spatial Experiments under Unknown Interference", "link": "https://arxiv.org/abs/2010.13599", "description": "arXiv:2010.13599v5 Announce Type: replace \nAbstract: We consider design-based causal inference for spatial experiments in which treatments may have effects that bleed out and feed back in complex ways. Such spatial spillover effects violate the standard ``no interference'' assumption for standard causal inference methods. The complexity of spatial spillover effects also raises the risk of misspecification and bias in model-based analyses. We offer an approach for robust inference in such settings without having to specify a parametric outcome model. We define a spatial ``average marginalized effect'' (AME) that characterizes how, in expectation, units of observation that are a specified distance from an intervention location are affected by treatment at that location, averaging over effects emanating from other intervention nodes. We show that randomization is sufficient for non-parametric identification of the AME even if the nature of interference is unknown. Under mild restrictions on the extent of interference, we establish asymptotic distributions of estimators and provide methods for both sample-theoretic and randomization-based inference. We show conditions under which the AME recovers a structural effect. We illustrate our approach with a simulation study. Then we re-analyze a randomized field experiment and a quasi-experiment on forest conservation, showing how our approach offers robust inference on policy-relevant spillover effects."}, "https://arxiv.org/abs/2201.01010": {"title": "A Double Robust Approach for Non-Monotone Missingness in Multi-Stage Data", "link": "https://arxiv.org/abs/2201.01010", "description": "arXiv:2201.01010v2 Announce Type: replace \nAbstract: Multivariate missingness with a non-monotone missing pattern is complicated to deal with in empirical studies. The traditional Missing at Random (MAR) assumption is difficult to justify in such cases. Previous studies have strengthened the MAR assumption, suggesting that the missing mechanism of any variable is random when conditioned on a uniform set of fully observed variables. However, empirical evidence indicates that this assumption may be violated for variables collected at different stages. This paper proposes a new MAR-type assumption that fits non-monotone missing scenarios involving multi-stage variables. Based on this assumption, we construct an Augmented Inverse Probability Weighted GMM (AIPW-GMM) estimator. This estimator features an asymmetric format for the augmentation term, guarantees double robustness, and achieves the closed-form semiparametric efficiency bound. We apply this method to cases of missingness in both endogenous regressor and outcome, using the Oregon Health Insurance Experiment as an example. We check the correlation between missing probabilities and partially observed variables to justify the assumption. Moreover, we find that excluding incomplete data results in a loss of efficiency and insignificant estimators. The proposed estimator reduces the standard error by more than 50% for the estimated effects of the Oregon Health Plan on the elderly."}, "https://arxiv.org/abs/2211.14903": {"title": "Inference in Cluster Randomized Trials with Matched Pairs", "link": "https://arxiv.org/abs/2211.14903", "description": "arXiv:2211.14903v4 Announce Type: replace \nAbstract: This paper studies inference in cluster randomized trials where treatment status is determined according to a \"matched pairs\" design. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the cluster; by a \"matched pairs\" design, we mean that a sample of clusters is paired according to baseline, cluster-level covariates and, within each pair, one cluster is selected at random for treatment. We study the large-sample behavior of a weighted difference-in-means estimator and derive two distinct sets of results depending on if the matching procedure does or does not match on cluster size. We then propose a single variance estimator which is consistent in either regime. Combining these results establishes the asymptotic exactness of tests based on these estimators. Next, we consider the properties of two common testing procedures based on t-tests constructed from linear regressions, and argue that both are generally conservative in our framework. We additionally study the behavior of a randomization test which permutes the treatment status for clusters within pairs, and establish its finite-sample and asymptotic validity for testing specific null hypotheses. Finally, we propose a covariate-adjusted estimator which adjusts for additional baseline covariates not used for treatment assignment, and establish conditions under which such an estimator leads to strict improvements in precision. A simulation study confirms the practical relevance of our theoretical results."}, "https://arxiv.org/abs/2307.00575": {"title": "Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model", "link": "https://arxiv.org/abs/2307.00575", "description": "arXiv:2307.00575v2 Announce Type: replace \nAbstract: This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two steps: Average Subspace Capture (ASC) and Alternating Projection (AP). These steps are specifically designed to capture the row-wise and column-wise dimension-reduced subspaces which contain the most informative features of the data. ASC utilizes a novel average projection operator as initialization and achieves exact recovery in the noiseless setting. We analyze the convergence and non-asymptotic error bounds of MOP-UP, introducing a blockwise matrix eigenvalue perturbation bound that proves the desired bound, where classic perturbation bounds fail. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets. Lastly, we discuss generalizations of our approach to higher-order data."}, "https://arxiv.org/abs/2308.14996": {"title": "The projected dynamic linear model for time series on the sphere", "link": "https://arxiv.org/abs/2308.14996", "description": "arXiv:2308.14996v2 Announce Type: replace \nAbstract: Time series on the unit n-sphere arise in directional statistics, compositional data analysis, and many scientific fields. There are few models for such data, and the ones that exist suffer from several limitations: they are often computationally challenging to fit, many of them apply only to the circular case of n=2, and they are usually based on families of distributions that are not flexible enough to capture the complexities observed in real data. Furthermore, there is little work on Bayesian methods for spherical time series. To address these shortcomings, we propose a state space model based on the projected normal distribution that can be applied to spherical time series of arbitrary dimension. We describe how to perform fully Bayesian offline inference for this model using a simple and efficient Gibbs sampling algorithm, and we develop a Rao-Blackwellized particle filter to perform online inference for streaming data. In analyses of wind direction and energy market time series, we show that the proposed model outperforms competitors in terms of point, set, and density forecasting."}, "https://arxiv.org/abs/2309.03714": {"title": "An efficient joint model for high dimensional longitudinal and survival data via generic association features", "link": "https://arxiv.org/abs/2309.03714", "description": "arXiv:2309.03714v2 Announce Type: replace \nAbstract: This paper introduces a prognostic method called FLASH that addresses the problem of joint modelling of longitudinal data and censored durations when a large number of both longitudinal and time-independent features are available. In the literature, standard joint models are either of the shared random effect or joint latent class type. Combining ideas from both worlds and using appropriate regularisation techniques, we define a new model with the ability to automatically identify significant prognostic longitudinal features in a high-dimensional context, which is of increasing importance in many areas such as personalised medicine or churn prediction. We develop an estimation methodology based on the EM algorithm and provide an efficient implementation. The statistical performance of the method is demonstrated both in extensive Monte Carlo simulation studies and on publicly available real-world datasets. Our method significantly outperforms the state-of-the-art joint models in predicting the latent class membership probability in terms of the C-index in a so-called ``real-time'' prediction setting, with a computational speed that is orders of magnitude faster than competing methods. In addition, our model automatically identifies significant features that are relevant from a practical perspective, making it interpretable."}, "https://arxiv.org/abs/2310.06926": {"title": "Bayesian inference and cure rate modeling for event history data", "link": "https://arxiv.org/abs/2310.06926", "description": "arXiv:2310.06926v2 Announce Type: replace \nAbstract: Estimating model parameters of a general family of cure models is always a challenging task mainly due to flatness and multimodality of the likelihood function. In this work, we propose a fully Bayesian approach in order to overcome these issues. Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. It is demonstrated via simulations that the proposed algorithm freely explores the multimodal posterior distribution and produces robust point estimates, while it outperforms maximum likelihood estimation via the Expectation-Maximization algorithm. A by-product of our Bayesian implementation is to control the False Discovery Rate when classifying items as cured or not. Finally, the proposed method is illustrated in a real dataset which refers to recidivism for offenders released from prison; the event of interest is whether the offender was re-incarcerated after probation or not."}, "https://arxiv.org/abs/2307.01357": {"title": "Adaptive Principal Component Regression with Applications to Panel Data", "link": "https://arxiv.org/abs/2307.01357", "description": "arXiv:2307.01357v3 Announce Type: replace-cross \nAbstract: Principal component regression (PCR) is a popular technique for fixed-design error-in-variables regression, a generalization of the linear regression setting in which the observed covariates are corrupted with random noise. We provide the first time-uniform finite sample guarantees for (regularized) PCR whenever data is collected adaptively. Since the proof techniques for analyzing PCR in the fixed design setting do not readily extend to the online setting, our results rely on adapting tools from modern martingale concentration to the error-in-variables setting. We demonstrate the usefulness of our bounds by applying them to the domain of panel data, a ubiquitous setting in econometrics and statistics. As our first application, we provide a framework for experiment design in panel data settings when interventions are assigned adaptively. Our framework may be thought of as a generalization of the synthetic control and synthetic interventions frameworks, where data is collected via an adaptive intervention assignment policy. Our second application is a procedure for learning such an intervention assignment policy in a setting where units arrive sequentially to be treated. In addition to providing theoretical performance guarantees (as measured by regret), we show that our method empirically outperforms a baseline which does not leverage error-in-variables regression."}, "https://arxiv.org/abs/2312.04972": {"title": "Comparison of Probabilistic Structural Reliability Methods for Ultimate Limit State Assessment of Wind Turbines", "link": "https://arxiv.org/abs/2312.04972", "description": "arXiv:2312.04972v2 Announce Type: replace-cross \nAbstract: The probabilistic design of offshore wind turbines aims to ensure structural safety in a cost-effective way. This involves conducting structural reliability assessments for different design options and considering different structural responses. There are several structural reliability methods, and this paper will apply and compare different approaches in some simplified case studies. In particular, the well known environmental contour method will be compared to a more novel approach based on sequential sampling and Gaussian processes regression for an ultimate limit state case study. For one of the case studies, results will also be compared to results from a brute force simulation approach. Interestingly, the comparison is very different from the two case studies. In one of the cases the environmental contours method agrees well with the sequential sampling method but in the other, results vary considerably. Probably, this can be explained by the violation of some of the assumptions associated with the environmental contour approach, i.e. that the short-term variability of the response is large compared to the long-term variability of the environmental conditions. Results from this simple comparison study suggests that the sequential sampling method can be a robust and computationally effective approach for structural reliability assessment."}, "https://arxiv.org/abs/2408.02757": {"title": "A nonparametric test for diurnal variation in spot correlation processes", "link": "https://arxiv.org/abs/2408.02757", "description": "arXiv:2408.02757v1 Announce Type: new \nAbstract: The association between log-price increments of exchange-traded equities, as measured by their spot correlation estimated from high-frequency data, exhibits a pronounced upward-sloping and almost piecewise linear relationship at the intraday horizon. There is notably lower-on average less positive-correlation in the morning than in the afternoon. We develop a nonparametric testing procedure to detect such deterministic variation in a correlation process. The test statistic has a known distribution under the null hypothesis, whereas it diverges under the alternative. It is robust against stochastic correlation. We run a Monte Carlo simulation to discover the finite sample properties of the test statistic, which are close to the large sample predictions, even for small sample sizes and realistic levels of diurnal variation. In an application, we implement the test on a monthly basis for a high-frequency dataset covering the stock market over an extended period. The test leads to rejection of the null most of the time. This suggests diurnal variation in the correlation process is a nontrivial effect in practice."}, "https://arxiv.org/abs/2408.02770": {"title": "Measuring the Impact of New Risk Factors Within Survival Models", "link": "https://arxiv.org/abs/2408.02770", "description": "arXiv:2408.02770v1 Announce Type: new \nAbstract: Survival is poor for patients with metastatic cancer, and it is vital to examine new biomarkers that can improve patient prognostication and identify those who would benefit from more aggressive therapy. In metastatic prostate cancer, two new assays have become available: one that quantifies the number of cancer cells circulating in the peripheral blood, and the other a marker of the aggressiveness of the disease. It is critical to determine the magnitude of the effect of these biomarkers on the discrimination of a model-based risk score. To do so, most analysts frequently consider the discrimination of two separate survival models: one that includes both the new and standard factors and a second that includes the standard factors alone. However, this analysis is ultimately incorrect for many of the scale-transformation models ubiquitous in survival, as the reduced model is misspecified if the full model is specified correctly. To circumvent this issue, we developed a projection-based approach to estimate the impact of the two prostate cancer biomarkers. The results indicate that the new biomarkers can influence model discrimination and justify their inclusion in the risk model; however, the hunt remains for an applicable model to risk-stratify patients with metastatic prostate cancer."}, "https://arxiv.org/abs/2408.02821": {"title": "Continuous Monitoring via Repeated Significance", "link": "https://arxiv.org/abs/2408.02821", "description": "arXiv:2408.02821v1 Announce Type: new \nAbstract: Requiring statistical significance at multiple interim analyses to declare a statistically significant result for an AB test allows less stringent requirements for significance at each interim analysis. Repeated repeated significance competes well with methods built on assumptions about the test -- assumptions that may be impossible to evaluate a priori and may require extra data to evaluate empirically.\n Instead, requiring repeated significance allows the data itself to prove directly that the required results are not due to chance alone. We explain how to apply tests with repeated significance to continuously monitor unbounded tests -- tests that do not have an a priori bound on running time or number of observations. We show that it is impossible to maintain a constant requirement for significance for unbounded tests, but that we can come arbitrarily close to that goal."}, "https://arxiv.org/abs/2408.02830": {"title": "Setting the duration of online A/B experiments", "link": "https://arxiv.org/abs/2408.02830", "description": "arXiv:2408.02830v1 Announce Type: new \nAbstract: In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube."}, "https://arxiv.org/abs/2408.03024": {"title": "Weighted shape-constrained estimation for the autocovariance sequence from a reversible Markov chain", "link": "https://arxiv.org/abs/2408.03024", "description": "arXiv:2408.03024v1 Announce Type: new \nAbstract: We present a novel weighted $\\ell_2$ projection method for estimating autocovariance sequences and spectral density functions from reversible Markov chains. Berg and Song (2023) introduced a least-squares shape-constrained estimation approach for the autocovariance function by projecting an initial estimate onto a shape-constrained space using an $\\ell_2$ projection. While the least-squares objective is commonly used in shape-constrained regression, it can be suboptimal due to correlation and unequal variances in the input function. To address this, we propose a weighted least-squares method that defines a weighted norm on transformed data. Specifically, we transform an input autocovariance sequence into the Fourier domain and apply weights based on the asymptotic variance of the sample periodogram, leveraging the asymptotic independence of periodogram ordinates. Our proposal can equivalently be viewed as estimating a spectral density function by applying shape constraints to its Fourier series. We demonstrate that our weighted approach yields strongly consistent estimates for both the spectral density and the autocovariance sequence. Empirical studies show its effectiveness in uncertainty quantification for Markov chain Monte Carlo estimation, outperforming the unweighted moment LS estimator and other state-of-the-art methods."}, "https://arxiv.org/abs/2408.03137": {"title": "Efficient Asymmetric Causality Tests", "link": "https://arxiv.org/abs/2408.03137", "description": "arXiv:2408.03137v1 Announce Type: new \nAbstract: Asymmetric causality tests are increasingly gaining popularity in different scientific fields. This approach corresponds better to reality since logical reasons behind asymmetric behavior exist and need to be considered in empirical investigations. Hatemi-J (2012) introduced the asymmetric causality tests via partial cumulative sums for positive and negative components of the variables operating within the vector autoregressive (VAR) model. However, since the the residuals across the equations in the VAR model are not independent, the ordinary least squares method for estimating the parameters is not efficient. Additionally, asymmetric causality tests mean having different causal parameters (i.e., for positive or negative components), thus, it is crucial to assess not only if these causal parameters are individually statistically significant, but also if their difference is statistically significant. Consequently, tests of difference between estimated causal parameters should explicitly be conducted, which are neglected in the existing literature. The purpose of the current paper is to deal with these issues explicitly. An application is provided, and ten different hypotheses pertinent to the asymmetric causal interaction between two largest financial markets worldwide are efficiently tested within a multivariate setting."}, "https://arxiv.org/abs/2408.03138": {"title": "Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data", "link": "https://arxiv.org/abs/2408.03138", "description": "arXiv:2408.03138v1 Announce Type: new \nAbstract: It is crucial to assess the predictive performance of a model in order to establish its practicality and relevance in real-world scenarios, particularly for high-dimensional data analysis. Among data splitting or resampling methods, cross-validation (CV) is extensively used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model among competing alternatives. The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data (for training and testing). Here, the issues regarding the reproducibility of the K-fold CV estimator is demonstrated in hypothesis testing wherein different partitions lead to notably disparate conclusions. This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation for determining the difference in prediction error between two model-fitting algorithms. A naive implementation of the exhaustive nested cross-validation is computationally costly. Here, we address concerns regarding computational complexity by devising a computationally tractable closed-form expression for the proposed cross-validation estimator using ridge regularization. Our study also investigates strategies aimed at enhancing statistical power within high-dimensional scenarios while controlling the Type I error rate. To illustrate the practical utility of our method, we apply it to an RNA sequencing study and demonstrate its effectiveness in the context of biological data analysis."}, "https://arxiv.org/abs/2408.03268": {"title": "Regression analysis of elliptically symmetric direction data", "link": "https://arxiv.org/abs/2408.03268", "description": "arXiv:2408.03268v1 Announce Type: new \nAbstract: A comprehensive toolkit is developed for regression analysis of directional data based on a flexible class of angular Gaussian distributions. Informative testing procedures for isotropy and covariate effects on the directional response are proposed. Moreover, a prediction region that achieves the smallest volume in a class of ellipsoidal prediction regions of the same coverage probability is constructed. The efficacy of these inference procedures is demonstrated in simulation experiments. Finally, this new toolkit is used to analyze directional data originating from a hydrology study and a bioinformatics application."}, "https://arxiv.org/abs/2408.02679": {"title": "Visual Analysis of Multi-outcome Causal Graphs", "link": "https://arxiv.org/abs/2408.02679", "description": "arXiv:2408.02679v1 Announce Type: cross \nAbstract: We introduce a visual analysis method for multiple causal graphs with different outcome variables, namely, multi-outcome causal graphs. Multi-outcome causal graphs are important in healthcare for understanding multimorbidity and comorbidity. To support the visual analysis, we collaborated with medical experts to devise two comparative visualization techniques at different stages of the analysis process. First, a progressive visualization method is proposed for comparing multiple state-of-the-art causal discovery algorithms. The method can handle mixed-type datasets comprising both continuous and categorical variables and assist in the creation of a fine-tuned causal graph of a single outcome. Second, a comparative graph layout technique and specialized visual encodings are devised for the quick comparison of multiple causal graphs. In our visual analysis approach, analysts start by building individual causal graphs for each outcome variable, and then, multi-outcome causal graphs are generated and visualized with our comparative technique for analyzing differences and commonalities of these causal graphs. Evaluation includes quantitative measurements on benchmark datasets, a case study with a medical expert, and expert user studies with real-world health research data."}, "https://arxiv.org/abs/2408.03051": {"title": "The multivariate fractional Ornstein-Uhlenbeck process", "link": "https://arxiv.org/abs/2408.03051", "description": "arXiv:2408.03051v1 Announce Type: cross \nAbstract: Starting from the notion of multivariate fractional Brownian Motion introduced in [F. Lavancier, A. Philippe, and D. Surgailis. Covariance function of vector self-similar processes. Statistics & Probability Letters, 2009] we define a multivariate version of the fractional Ornstein-Uhlenbeck process. This multivariate Gaussian process is stationary, ergodic and allows for different Hurst exponents on each component. We characterize its correlation matrix and its short and long time asymptotics. Besides the marginal parameters, the cross correlation between one-dimensional marginal components is ruled by two parameters. We consider the problem of their inference, proposing two types of estimator, constructed from discrete observations of the process. We establish their asymptotic theory, in one case in the long time asymptotic setting, in the other case in the infill and long time asymptotic setting. The limit behavior can be asymptotically Gaussian or non-Gaussian, depending on the values of the Hurst exponents of the marginal components. The technical core of the paper relies on the analysis of asymptotic properties of functionals of Gaussian processes, that we establish using Malliavin calculus and Stein's method. We provide numerical experiments that support our theoretical analysis and also suggest a conjecture on the application of one of these estimators to the multivariate fractional Brownian Motion."}, "https://arxiv.org/abs/2306.15581": {"title": "Advances in projection predictive inference", "link": "https://arxiv.org/abs/2306.15581", "description": "arXiv:2306.15581v2 Announce Type: replace \nAbstract: The concepts of Bayesian prediction, model comparison, and model selection have developed significantly over the last decade. As a result, the Bayesian community has witnessed a rapid growth in theoretical and applied contributions to building and selecting predictive models. Projection predictive inference in particular has shown promise to this end, finding application across a broad range of fields. It is less prone to over-fitting than na\\\"ive selection based purely on cross-validation or information criteria performance metrics, and has been known to out-perform other methods in terms of predictive performance. We survey the core concept and contemporary contributions to projection predictive inference, and present a safe, efficient, and modular workflow for prediction-oriented model selection therein. We also provide an interpretation of the projected posteriors achieved by projection predictive inference in terms of their limitations in causal settings."}, "https://arxiv.org/abs/2307.10694": {"title": "PySDTest: a Python/Stata Package for Stochastic Dominance Tests", "link": "https://arxiv.org/abs/2307.10694", "description": "arXiv:2307.10694v2 Announce Type: replace \nAbstract: We introduce PySDTest, a Python/Stata package for statistical tests of stochastic dominance. PySDTest implements various testing procedures such as Barrett and Donald (2003), Linton et al. (2005), Linton et al. (2010), and Donald and Hsu (2016), along with their extensions. Users can flexibly combine several resampling methods and test statistics, including the numerical delta method (D\\\"umbgen, 1993; Hong and Li, 2018; Fang and Santos, 2019). The package allows for testing advanced hypotheses on stochastic dominance relations, such as stochastic maximality among multiple prospects. We first provide an overview of the concepts of stochastic dominance and testing methods. Then, we offer practical guidance for using the package and the Stata command pysdtest. We apply PySDTest to investigate the portfolio choice problem between the daily returns of Bitcoin and the S&P 500 index as an empirical illustration. Our findings indicate that the S&P 500 index returns second-order stochastically dominate the Bitcoin returns."}, "https://arxiv.org/abs/2401.01833": {"title": "Credible Distributions of Overall Ranking of Entities", "link": "https://arxiv.org/abs/2401.01833", "description": "arXiv:2401.01833v2 Announce Type: replace \nAbstract: Inference on overall ranking of a set of entities, such as athletes or players, schools and universities, hospitals, cities, restaurants, movies or books, companies, states, countries or subpopulations, based on appropriate characteristics or performances, is an important problem. Estimation of ranks based on point estimates of means does not account for the uncertainty in those estimates. Treating estimated ranks without any regard for uncertainty is problematic. We propose a novel solution using the Bayesian approach. It is easily implementable, competitive with a popular frequentist method, more effective and informative. Using suitable joint credible sets for entity means, we appropriately create {\\it credible distributions} (CDs, a phrase we coin), which are probability distributions, for the rank vector of entities. As a byproduct, the supports of the CDs are credible sets for overall ranking. We evaluate our proposed procedure in terms of accuracy and stability using a number of applications and a simulation study. While the frequentist approach cannot utilize covariates, the proposed method handles them routinely to its benefit."}, "https://arxiv.org/abs/2408.03415": {"title": "A Novel Approximate Bayesian Inference Method for Compartmental Models in Epidemiology using Stan", "link": "https://arxiv.org/abs/2408.03415", "description": "arXiv:2408.03415v1 Announce Type: new \nAbstract: Mechanistic compartmental models are widely used in epidemiology to study the dynamics of infectious disease transmission. These models have significantly contributed to designing and evaluating effective control strategies during pandemics. However, the increasing complexity and the number of parameters needed to describe rapidly evolving transmission scenarios present significant challenges for parameter estimation due to intractable likelihoods. To overcome this issue, likelihood-free methods have proven effective for accurately and efficiently fitting these models to data. In this study, we focus on approximate Bayesian computation (ABC) and synthetic likelihood methods for parameter inference. We develop a method that employs ABC to select the most informative subset of summary statistics, which are then used to construct a synthetic likelihood for posterior sampling. Posterior sampling is performed using Hamiltonian Monte Carlo as implemented in the Stan software. The proposed algorithm is demonstrated through simulation studies, showing promising results for inference in a simulated epidemic scenario."}, "https://arxiv.org/abs/2408.03463": {"title": "Identifying treatment response subgroups in observational time-to-event data", "link": "https://arxiv.org/abs/2408.03463", "description": "arXiv:2408.03463v1 Announce Type: new \nAbstract: Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily focus on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. Furthermore, the patient cohort of an RCT is often constrained by cost, and is not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. Therefore, when applied to observational studies, such approaches suffer from significant statistical biases because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes."}, "https://arxiv.org/abs/2408.03530": {"title": "Robust Identification in Randomized Experiments with Noncompliance", "link": "https://arxiv.org/abs/2408.03530", "description": "arXiv:2408.03530v1 Announce Type: new \nAbstract: This paper considers a robust identification of causal parameters in a randomized experiment setting with noncompliance where the standard local average treatment effect assumptions could be violated. Following Li, K\\'edagni, and Mourifi\\'e (2024), we propose a misspecification robust bound for a real-valued vector of various causal parameters. We discuss identification under two sets of weaker assumptions: random assignment and exclusion restriction (without monotonicity), and random assignment and monotonicity (without exclusion restriction). We introduce two causal parameters: the local average treatment-controlled direct effect (LATCDE), and the local average instrument-controlled direct effect (LAICDE). Under the random assignment and monotonicity assumptions, we derive sharp bounds on the local average treatment-controlled direct effects for the always-takers and never-takers, respectively, and the total average controlled direct effect for the compliers. Additionally, we show that the intent-to-treat effect can be expressed as a convex weighted average of these three effects. Finally, we apply our method on the proximity to college instrument and find that growing up near a four-year college increases the wage of never-takers (who represent more than 70% of the population) by a range of 4.15% to 27.07%."}, "https://arxiv.org/abs/2408.03590": {"title": "Sensitivity analysis using the Metamodel of Optimal Prognosis", "link": "https://arxiv.org/abs/2408.03590", "description": "arXiv:2408.03590v1 Announce Type: new \nAbstract: In real case applications within the virtual prototyping process, it is not always possible to reduce the complexity of the physical models and to obtain numerical models which can be solved quickly. Usually, every single numerical simulation takes hours or even days. Although the progresses in numerical methods and high performance computing, in such cases, it is not possible to explore various model configurations, hence efficient surrogate models are required. Generally the available meta-model techniques show several advantages and disadvantages depending on the investigated problem. In this paper we present an automatic approach for the selection of the optimal suitable meta-model for the actual problem. Together with an automatic reduction of the variable space using advanced filter techniques an efficient approximation is enabled also for high dimensional problems. This filter techniques enable a reduction of the high dimensional variable space to a much smaller subspace where meta-model-based sensitivity analyses are carried out to assess the influence of important variables and to identify the optimal subspace with corresponding surrogate model which enables the most accurate probabilistic analysis. For this purpose we investigate variance-based and moment-free sensitivity measures in combination with advanced meta-models as moving least squares and kriging."}, "https://arxiv.org/abs/2408.03602": {"title": "Piecewise Constant Hazard Estimation with the Fused Lasso", "link": "https://arxiv.org/abs/2408.03602", "description": "arXiv:2408.03602v1 Announce Type: new \nAbstract: In applied time-to-event analysis, a flexible parametric approach is to model the hazard rate as a piecewise constant function of time. However, the change points and values of the piecewise constant hazard are usually unknown and need to be estimated. In this paper, we develop a fully data-driven procedure for piecewise constant hazard estimation. We work in a general counting process framework which nests a wide range of popular models in time-to-event analysis including Cox's proportional hazards model with potentially high-dimensional covariates, competing risks models as well as more general multi-state models. To construct our estimator, we set up a regression model for the increments of the Breslow estimator and then use fused lasso techniques to approximate the piecewise constant signal in this regression model. In the theoretical part of the paper, we derive the convergence rate of our estimator as well as some results on how well the change points of the piecewise constant hazard are approximated by our method. We complement the theory by both simulations and a real data example, illustrating that our results apply in rather general event histories such as multi-state models."}, "https://arxiv.org/abs/2408.03738": {"title": "Parameter estimation for the generalized extreme value distribution: a method that combines bootstrapping and r largest order statistics", "link": "https://arxiv.org/abs/2408.03738", "description": "arXiv:2408.03738v1 Announce Type: new \nAbstract: A critical problem in extreme value theory (EVT) is the estimation of parameters for the limit probability distributions. Block maxima (BM), an approach in EVT that seeks estimates of parameters of the generalized extreme value distribution (GEV), can be generalized to take into account not just the maximum realization from a given dataset, but the r largest order statistics for a given r. In this work we propose a parameter estimation method that combines the r largest order statistic (r-LOS) extension of BM with permutation bootstrapping: surrogate realizations are obtained by randomly reordering the original data set, and then r-LOS is applied to these shuffled measurements - the mean estimate computed from these surrogate realizations is the desired estimate. We used synthetic observations and real meteorological time series to verify the performance of our method; we found that the combination of r-LOS and bootstrapping resulted in estimates more accurate than when either approach was implemented separately."}, "https://arxiv.org/abs/2408.03777": {"title": "Combining BART and Principal Stratification to estimate the effect of intermediate on primary outcomes with application to estimating the effect of family planning on employment in sub-Saharan Africa", "link": "https://arxiv.org/abs/2408.03777", "description": "arXiv:2408.03777v1 Announce Type: new \nAbstract: There is interest in learning about the causal effect of family planning (FP) on empowerment related outcomes. Experimental data related to this question are available from trials in which FP programs increase access to FP. While program assignment is unconfounded, FP uptake and subsequent empowerment may share common causes. We use principal stratification to estimate the causal effect of an intermediate FP outcome on a primary outcome of interest, among women affected by a FP program. Within strata defined by the potential reaction to the program, FP uptake is unconfounded. To minimize the need for parametric assumptions, we propose to use Bayesian Additive Regression Trees (BART) for modeling stratum membership and outcomes of interest. We refer to the combined approach as Prince BART. We evaluate Prince BART through a simulation study and use it to assess the causal effect of modern contraceptive use on employment in six cities in Nigeria, based on quasi-experimental data from a FP program trial during the first half of the 2010s. We show that findings differ between Prince BART and alternative modeling approaches based on parametric assumptions."}, "https://arxiv.org/abs/2408.03930": {"title": "Robust Estimation of Regression Models with Potentially Endogenous Outliers via a Modern Optimization Lens", "link": "https://arxiv.org/abs/2408.03930", "description": "arXiv:2408.03930v1 Announce Type: new \nAbstract: This paper addresses the robust estimation of linear regression models in the presence of potentially endogenous outliers. Through Monte Carlo simulations, we demonstrate that existing $L_1$-regularized estimation methods, including the Huber estimator and the least absolute deviation (LAD) estimator, exhibit significant bias when outliers are endogenous. Motivated by this finding, we investigate $L_0$-regularized estimation methods. We propose systematic heuristic algorithms, notably an iterative hard-thresholding algorithm and a local combinatorial search refinement, to solve the combinatorial optimization problem of the \\(L_0\\)-regularized estimation efficiently. Our Monte Carlo simulations yield two key results: (i) The local combinatorial search algorithm substantially improves solution quality compared to the initial projection-based hard-thresholding algorithm while offering greater computational efficiency than directly solving the mixed integer optimization problem. (ii) The $L_0$-regularized estimator demonstrates superior performance in terms of bias reduction, estimation accuracy, and out-of-sample prediction errors compared to $L_1$-regularized alternatives. We illustrate the practical value of our method through an empirical application to stock return forecasting."}, "https://arxiv.org/abs/2408.03425": {"title": "Sequential Conditional Transport on Probabilistic Graphs for Interpretable Counterfactual Fairness", "link": "https://arxiv.org/abs/2408.03425", "description": "arXiv:2408.03425v1 Announce Type: cross \nAbstract: In this paper, we link two existing approaches to derive counterfactuals: adaptations based on a causal graph, as suggested in Ple\\v{c}ko and Meinshausen (2020) and optimal transport, as in De Lara et al. (2024). We extend \"Knothe's rearrangement\" Bonnotte (2013) and \"triangular transport\" Zech and Marzouk (2022a) to probabilistic graphical models, and use this counterfactual approach, referred to as sequential transport, to discuss individual fairness. After establishing the theoretical foundations of the proposed method, we demonstrate its application through numerical experiments on both synthetic and real datasets."}, "https://arxiv.org/abs/2408.03608": {"title": "InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation", "link": "https://arxiv.org/abs/2408.03608", "description": "arXiv:2408.03608v1 Announce Type: cross \nAbstract: Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer."}, "https://arxiv.org/abs/2408.03626": {"title": "On the choice of the non-trainable internal weights in random feature maps", "link": "https://arxiv.org/abs/2408.03626", "description": "arXiv:2408.03626v1 Announce Type: cross \nAbstract: The computationally cheap machine learning architecture of random feature maps can be viewed as a single-layer feedforward network in which the weights of the hidden layer are random but fixed and only the outer weights are learned via linear regression. The internal weights are typically chosen from a prescribed distribution. The choice of the internal weights significantly impacts the accuracy of random feature maps. We address here the task of how to best select the internal weights. In particular, we consider the forecasting problem whereby random feature maps are used to learn a one-step propagator map for a dynamical system. We provide a computationally cheap hit-and-run algorithm to select good internal weights which lead to good forecasting skill. We show that the number of good features is the main factor controlling the forecasting skill of random feature maps and acts as an effective feature dimension. Lastly, we compare random feature maps with single-layer feedforward neural networks in which the internal weights are now learned using gradient descent. We find that random feature maps have superior forecasting capabilities whilst having several orders of magnitude lower computational cost."}, "https://arxiv.org/abs/2103.02235": {"title": "Prewhitened Long-Run Variance Estimation Robust to Nonstationarity", "link": "https://arxiv.org/abs/2103.02235", "description": "arXiv:2103.02235v3 Announce Type: replace \nAbstract: We introduce a nonparametric nonlinear VAR prewhitened long-run variance (LRV) estimator for the construction of standard errors robust to autocorrelation and heteroskedasticity that can be used for hypothesis testing in a variety of contexts including the linear regression model. Existing methods either are theoretically valid only under stationarity and have poor finite-sample properties under nonstationarity (i.e., fixed-b methods), or are theoretically valid under the null hypothesis but lead to tests that are not consistent under nonstationary alternative hypothesis (i.e., both fixed-b and traditional HAC estimators). The proposed estimator accounts explicitly for nonstationarity, unlike previous prewhitened procedures which are known to be unreliable, and leads to tests with accurate null rejection rates and good monotonic power. We also establish MSE bounds for LRV estimation that are sharper than previously established and use them to determine the data-dependent bandwidths."}, "https://arxiv.org/abs/2103.02981": {"title": "Theory of Evolutionary Spectra for Heteroskedasticity and Autocorrelation Robust Inference in Possibly Misspecified and Nonstationary Models", "link": "https://arxiv.org/abs/2103.02981", "description": "arXiv:2103.02981v2 Announce Type: replace \nAbstract: We develop a theory of evolutionary spectra for heteroskedasticity and autocorrelation robust (HAR) inference when the data may not satisfy second-order stationarity. Nonstationarity is a common feature of economic time series which may arise either from parameter variation or model misspecification. In such a context, the theories that support HAR inference are either not applicable or do not provide accurate approximations. HAR tests standardized by existing long-run variance estimators then may display size distortions and little or no power. This issue can be more severe for methods that use long bandwidths (i.e., fixed-b HAR tests). We introduce a class of nonstationary processes that have a time-varying spectral representation which evolves continuously except at a finite number of time points. We present an extension of the classical heteroskedasticity and autocorrelation consistent (HAC) estimators that applies two smoothing procedures. One is over the lagged autocovariances, akin to classical HAC estimators, and the other is over time. The latter element is important to flexibly account for nonstationarity. We name them double kernel HAC (DK-HAC) estimators. We show the consistency of the estimators and obtain an optimal DK-HAC estimator under the mean squared error (MSE) criterion. Overall, HAR tests standardized by the proposed DK-HAC estimators are competitive with fixed-b HAR tests, when the latter work well, with regards to size control even when there is strong dependence. Notably, in those empirically relevant situations in which previous HAR tests are undersized and have little or no power, the DK-HAC estimator leads to tests that have good size and power."}, "https://arxiv.org/abs/2111.14590": {"title": "The Fixed-b Limiting Distribution and the ERP of HAR Tests Under Nonstationarity", "link": "https://arxiv.org/abs/2111.14590", "description": "arXiv:2111.14590v2 Announce Type: replace \nAbstract: We show that the nonstandard limiting distribution of HAR test statistics under fixed-b asymptotics is not pivotal (even after studentization) when the data are nonstationarity. It takes the form of a complicated function of Gaussian processes and depends on the integrated local long-run variance and on on the second moments of the relevant series (e.g., of the regressors and errors for the case of the linear regression model). Hence, existing fixed-b inference methods based on stationarity are not theoretically valid in general. The nuisance parameters entering the fixed-b limiting distribution can be consistently estimated under small-b asymptotics but only with nonparametric rate of convergence. Hence, We show that the error in rejection probability (ERP) is an order of magnitude larger than that under stationarity and is also larger than that of HAR tests based on HAC estimators under conventional asymptotics. These theoretical results reconcile with recent finite-sample evidence in Casini (2021) and Casini, Deng and Perron (2021) who showing that fixed-b HAR tests can perform poorly when the data are nonstationary. They can be conservative under the null hypothesis and have non-monotonic power under the alternative hypothesis irrespective of how large the sample size is."}, "https://arxiv.org/abs/2211.16121": {"title": "Bayesian Multivariate Quantile Regression with alternative Time-varying Volatility Specifications", "link": "https://arxiv.org/abs/2211.16121", "description": "arXiv:2211.16121v2 Announce Type: replace \nAbstract: This article proposes a novel Bayesian multivariate quantile regression to forecast the tail behavior of energy commodities, where the homoskedasticity assumption is relaxed to allow for time-varying volatility. In particular, we exploit the mixture representation of the multivariate asymmetric Laplace likelihood and the Cholesky-type decomposition of the scale matrix to introduce stochastic volatility and GARCH processes and then provide an efficient MCMC to estimate them. The proposed models outperform the homoskedastic benchmark mainly when predicting the distribution's tails. We provide a model combination using a quantile score-based weighting scheme, which leads to improved performances, notably when no single model uniformly outperforms the other across quantiles, time, or variables."}, "https://arxiv.org/abs/2307.15348": {"title": "The curse of isotropy: from principal components to principal subspaces", "link": "https://arxiv.org/abs/2307.15348", "description": "arXiv:2307.15348v3 Announce Type: replace \nAbstract: This paper raises an important issue about the interpretation of principal component analysis. The curse of isotropy states that a covariance matrix with repeated eigenvalues yields rotation-invariant eigenvectors. In other words, principal components associated with equal eigenvalues show large intersample variability and are arbitrary combinations of potentially more interpretable components. However, empirical eigenvalues are never exactly equal in practice due to sampling errors. Therefore, most users overlook the problem. In this paper, we propose to identify datasets that are likely to suffer from the curse of isotropy by introducing a generative Gaussian model with repeated eigenvalues and comparing it to traditional models via the principle of parsimony. This yields an explicit criterion to detect the curse of isotropy in practice. We notably argue that in a dataset with 1000 samples, all the eigenvalue pairs with a relative eigengap lower than 21% should be assumed equal. This demonstrates that the curse of isotropy cannot be overlooked. In this context, we propose to transition from fuzzy principal components to much-more-interpretable principal subspaces. The final methodology (principal subspace analysis) is extremely simple and shows promising results on a variety of datasets from different fields."}, "https://arxiv.org/abs/2309.03742": {"title": "Efficient estimation and correction of selection-induced bias with order statistics", "link": "https://arxiv.org/abs/2309.03742", "description": "arXiv:2309.03742v3 Announce Type: replace \nAbstract: Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in step-wise selection), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area."}, "https://arxiv.org/abs/2311.15988": {"title": "A novel CFA+EFA model to detect aberrant respondents", "link": "https://arxiv.org/abs/2311.15988", "description": "arXiv:2311.15988v2 Announce Type: replace \nAbstract: Aberrant respondents are common but yet extremely detrimental to the quality of social surveys or questionnaires. Recently, factor mixture models have been employed to identify individuals providing deceptive or careless responses. We propose a comprehensive factor mixture model for continuous outcomes that combines confirmatory and exploratory factor models to classify both the non-aberrant and aberrant respondents. The flexibility of the proposed {classification model} allows for the identification of two of the most common aberrant response styles, namely faking and careless responding. We validated our approach by means of two simulations and two case studies. The results indicate the effectiveness of the proposed model in dealing with aberrant responses in social and behavioural surveys."}, "https://arxiv.org/abs/2312.17623": {"title": "Decision Theory for Treatment Choice Problems with Partial Identification", "link": "https://arxiv.org/abs/2312.17623", "description": "arXiv:2312.17623v2 Announce Type: replace \nAbstract: We apply classical statistical decision theory to a large class of treatment choice problems with partial identification, revealing important theoretical and practical challenges but also interesting research opportunities. The challenges are: In a general class of problems with Gaussian likelihood, all decision rules are admissible; it is maximin-welfare optimal to ignore all data; and, for severe enough partial identification, there are infinitely many minimax-regret optimal decision rules, all of which sometimes randomize the policy recommendation. The opportunities are: We introduce a profiled regret criterion that can reveal important differences between rules and render some of them inadmissible; and we uniquely characterize the minimax-regret optimal rule that least frequently randomizes. We apply our results to aggregation of experimental estimates for policy adoption, to extrapolation of Local Average Treatment Effects, and to policy making in the presence of omitted variable bias."}, "https://arxiv.org/abs/2106.02031": {"title": "Change-Point Analysis of Time Series with Evolutionary Spectra", "link": "https://arxiv.org/abs/2106.02031", "description": "arXiv:2106.02031v3 Announce Type: replace-cross \nAbstract: This paper develops change-point methods for the spectrum of a locally stationary time series. We focus on series with a bounded spectral density that change smoothly under the null hypothesis but exhibits change-points or becomes less smooth under the alternative. We address two local problems. The first is the detection of discontinuities (or breaks) in the spectrum at unknown dates and frequencies. The second involves abrupt yet continuous changes in the spectrum over a short time period at an unknown frequency without signifying a break. Both problems can be cast into changes in the degree of smoothness of the spectral density over time. We consider estimation and minimax-optimal testing. We determine the optimal rate for the minimax distinguishable boundary, i.e., the minimum break magnitude such that we are able to uniformly control type I and type II errors. We propose a novel procedure for the estimation of the change-points based on a wild sequential top-down algorithm and show its consistency under shrinking shifts and possibly growing number of change-points. Our method can be used across many fields and a companion program is made available in popular software packages."}, "https://arxiv.org/abs/2408.04056": {"title": "Testing for a general changepoint in psychometric studies: changes detection and sample size planning", "link": "https://arxiv.org/abs/2408.04056", "description": "arXiv:2408.04056v1 Announce Type: new \nAbstract: This paper introduces a new method for change detection in psychometric studies based on the recently introduced pseudo Score statistic, for which the sampling distribution under the alternative hypothesis has been determined. Our approach has the advantage of simplicity in its computation, eliminating the need for resampling or simulations to obtain critical values. Additionally, it comes with a known null/alternative distribution, facilitating easy calculations for power levels and sample size planning. The paper indeed also discusses the topic of power analysis in segmented regression, namely the estimation of sample size or power level when the study data being collected focuses on a covariate expected to affect the mean response via a piecewise relationship with an unknown breakpoint. We run simulation results showing that our method outperforms other Tests for a Change Point (TFCP) with both normally distributed and binary data and carry out a real SAT Critical reading data analysis. The proposed test contributes to the framework of psychometric research, and it is available on the Comprehensive R Archive Network (CRAN) and in a more user-friendly Shiny App, both illustrated at the end of the paper."}, "https://arxiv.org/abs/2408.04176": {"title": "A robust approach for generalized linear models based on maximum Lq-likelihood procedure", "link": "https://arxiv.org/abs/2408.04176", "description": "arXiv:2408.04176v1 Announce Type: new \nAbstract: In this paper we propose a procedure for robust estimation in the context of generalized linear models based on the maximum Lq-likelihood method. Alongside this, an estimation algorithm that represents a natural extension of the usual iteratively weighted least squares method in generalized linear models is presented. It is through the discussion of the asymptotic distribution of the proposed estimator and a set of statistics for testing linear hypothesis that it is possible to define standardized residuals using the mean-shift outlier model. In addition, robust versions of deviance function and the Akaike information criterion are defined with the aim of providing tools for model selection. Finally, the performance of the proposed methodology is illustrated through a simulation study and analysis of a real dataset."}, "https://arxiv.org/abs/2408.04213": {"title": "Hypothesis testing for general network models", "link": "https://arxiv.org/abs/2408.04213", "description": "arXiv:2408.04213v1 Announce Type: new \nAbstract: The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures based on these methods may suffer from model misspecification. To handle this issue, based on the random matrix theory, we first give a spectral property of the normalized adjacency matrix under a mild condition. Further, we establish a general goodness-of-fit test procedure for the unweight and undirected network. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Theoretically, this testing procedure is suitable for nearly all popular network models, such as stochastic block models, and latent space models. Further, we apply the proposed method to the degree-corrected mixed membership model and give a sequential estimator of the number of communities. Both simulation studies and real-world data examples indicate that the proposed method works well."}, "https://arxiv.org/abs/2408.04327": {"title": "BayesFBHborrow: An R Package for Bayesian borrowing for time-to-event data from a flexible baseline hazard", "link": "https://arxiv.org/abs/2408.04327", "description": "arXiv:2408.04327v1 Announce Type: new \nAbstract: There is currently a focus on statistical methods which can use external trial information to help accelerate the discovery, development and delivery of medicine. Bayesian methods facilitate borrowing which is \"dynamic\" in the sense that the similarity of the data helps to determine how much information is used. We propose a Bayesian semiparameteric model, which allows the baseline hazard to take any form through an ensemble average. We introduce priors to smooth the posterior baseline hazard improving both model estimation and borrowing characteristics. A \"lump-and-smear\" borrowing prior accounts for non-exchangable historical data and helps reduce the maximum type I error in the presence of prior-data conflict. In this article, we present BayesFBHborrow, an R package, which enables the user to perform Bayesian borrowing with a historical control dataset in a semiparameteric time-to-event model. User-defined hyperparameters smooth an ensemble averaged posterior baseline hazard. The model offers the specification of lump-and-smear priors on the commensurability parameter where the associated hyperparameters can be chosen according to the users tolerance for difference between the log baseline hazards. We demonstrate the performance of our Bayesian flexible baseline hazard model on a simulated and real world dataset."}, "https://arxiv.org/abs/2408.04419": {"title": "Analysing symbolic data by pseudo-marginal methods", "link": "https://arxiv.org/abs/2408.04419", "description": "arXiv:2408.04419v1 Announce Type: new \nAbstract: Symbolic data analysis (SDA) aggregates large individual-level datasets into a small number of distributional summaries, such as random rectangles or random histograms. Inference is carried out using these summaries in place of the original dataset, resulting in computational gains at the loss of some information. In likelihood-based SDA, the likelihood function is characterised by an integral with a large exponent, which limits the method's utility as for typical models the integral unavailable in closed form. In addition, the likelihood function is known to produce biased parameter estimates in some circumstances. Our article develops a Bayesian framework for SDA methods in these settings that resolves the issues resulting from integral intractability and biased parameter estimation using pseudo-marginal Markov chain Monte Carlo methods. We develop an exact but computationally expensive method based on path sampling and the block-Poisson estimator, and a much faster, but approximate, method based on Taylor expansion. Through simulation and real-data examples we demonstrate the performance of the developed methods, showing large reductions in computation time compared to the full-data analysis, with only a small loss of information."}, "https://arxiv.org/abs/2408.04552": {"title": "Semiparametric Estimation of Individual Coefficients in a Dyadic Link Formation Model Lacking Observable Characteristics", "link": "https://arxiv.org/abs/2408.04552", "description": "arXiv:2408.04552v1 Announce Type: new \nAbstract: Dyadic network formation models have wide applicability in economic research, yet are difficult to estimate in the presence of individual specific effects and in the absence of distributional assumptions regarding the model noise component. The availability of (continuously distributed) individual or link characteristics generally facilitates estimation. Yet, while data on social networks has recently become more abundant, the characteristics of the entities involved in the link may not be measured. Adapting the procedure of \\citet{KS}, I propose to use network data alone in a semiparametric estimation of the individual fixed effect coefficients, which carry the interpretation of the individual relative popularity. This entails the possibility to anticipate how a new-coming individual will connect in a pre-existing group. The estimator, needed for its fast convergence, fails to implement the monotonicity assumption regarding the model noise component, thereby potentially reversing the order if the fixed effect coefficients. This and other numerical issues can be conveniently tackled by my novel, data-driven way of normalising the fixed effects, which proves to outperform a conventional standardisation in many cases. I demonstrate that the normalised coefficients converge both at the same rate and to the same limiting distribution as if the true error distribution was known. The cost of semiparametric estimation is thus purely computational, while the potential benefits are large whenever the errors have a strongly convex or strongly concave distribution."}, "https://arxiv.org/abs/2408.04313": {"title": "Better Locally Private Sparse Estimation Given Multiple Samples Per User", "link": "https://arxiv.org/abs/2408.04313", "description": "arXiv:2408.04313v1 Announce Type: cross \nAbstract: Previous studies yielded discouraging results for item-level locally differentially private linear regression with $s^*$-sparsity assumption, where the minimax rate for $nm$ samples is $\\mathcal{O}(s^{*}d / nm\\varepsilon^2)$. This can be challenging for high-dimensional data, where the dimension $d$ is extremely large. In this work, we investigate user-level locally differentially private sparse linear regression. We show that with $n$ users each contributing $m$ samples, the linear dependency of dimension $d$ can be eliminated, yielding an error upper bound of $\\mathcal{O}(s^{*2} / nm\\varepsilon^2)$. We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space, which is extendable to general sparse estimation problems with tight error bounds. Experiments on both synthetic and real datasets demonstrate the superiority of the proposed methods. Both the theoretical and empirical results suggest that, with the same number of samples, locally private sparse estimation is better conducted when multiple samples per user are available."}, "https://arxiv.org/abs/2408.04617": {"title": "Difference-in-Differences for Health Policy and Practice: A Review of Modern Methods", "link": "https://arxiv.org/abs/2408.04617", "description": "arXiv:2408.04617v1 Announce Type: cross \nAbstract: Difference-in-differences (DiD) is the most popular observational causal inference method in health policy, employed to evaluate the real-world impact of policies and programs. To estimate treatment effects, DiD relies on the \"parallel trends assumption\", that on average treatment and comparison groups would have had parallel trajectories in the absence of an intervention. Historically, DiD has been considered broadly applicable and straightforward to implement, but recent years have seen rapid advancements in DiD methods. This paper reviews and synthesizes these innovations for medical and health policy researchers. We focus on four topics: (1) assessing the parallel trends assumption in health policy contexts; (2) relaxing the parallel trends assumption when appropriate; (3) employing estimators to account for staggered treatment timing; and (4) conducting robust inference for analyses in which normal-based clustered standard errors are inappropriate. For each, we explain challenges and common pitfalls in traditional DiD and modern methods available to address these issues."}, "https://arxiv.org/abs/2010.01800": {"title": "Robust and Efficient Estimation of Potential Outcome Means under Random Assignment", "link": "https://arxiv.org/abs/2010.01800", "description": "arXiv:2010.01800v2 Announce Type: replace \nAbstract: We study efficiency improvements in randomized experiments for estimating a vector of potential outcome means using regression adjustment (RA) when there are more than two treatment levels. We show that linear RA which estimates separate slopes for each assignment level is never worse, asymptotically, than using the subsample averages. We also show that separate RA improves over pooled RA except in the obvious case where slope parameters in the linear projections are identical across the different assignment levels. We further characterize the class of nonlinear RA methods that preserve consistency of the potential outcome means despite arbitrary misspecification of the conditional mean functions. Finally, we apply these regression adjustment techniques to efficiently estimate the lower bound mean willingness to pay for an oil spill prevention program in California."}, "https://arxiv.org/abs/2408.04730": {"title": "Vela: A Data-Driven Proposal for Joint Collaboration in Space Exploration", "link": "https://arxiv.org/abs/2408.04730", "description": "arXiv:2408.04730v1 Announce Type: new \nAbstract: The UN Office of Outer Space Affairs identifies synergy of space development activities and international cooperation through data and infrastructure sharing in their Sustainable Development Goal 17 (SDG17). Current multilateral space exploration paradigms, however, are divided between the Artemis and the Roscosmos-CNSA programs to return to the moon and establish permanent human settlements. As space agencies work to expand human presence in space, economic resource consolidation in pursuit of technologically ambitious space expeditions is the most sensible path to accomplish SDG17. This paper compiles a budget dataset for the top five federally-funded space agencies: CNSA, ESA, JAXA, NASA, and Roscosmos. Using time-series econometric anslysis methods in STATA, this work analyzes each agency's economic contributions toward space exploration. The dataset results are used to propose a multinational space mission, Vela, for the development of an orbiting space station around Mars in the late 2030s. Distribution of economic resources and technological capabilities by the respective space programs are proposed to ensure programmatic redundancy and increase the odds of success on the given timeline."}, "https://arxiv.org/abs/2408.04854": {"title": "A propensity score weighting approach to integrate aggregated data in random-effect individual-level data meta-analysis", "link": "https://arxiv.org/abs/2408.04854", "description": "arXiv:2408.04854v1 Announce Type: new \nAbstract: In evidence synthesis, collecting individual participant data (IPD) across eligible studies is the most reliable way to investigate the treatment effects in different subgroups defined by participant characteristics. Nonetheless, access to all IPD from all studies might be very challenging due to privacy concerns. To overcome this, many approaches such as multilevel modeling have been proposed to incorporate the vast amount of aggregated data from the literature into IPD meta-analysis. These methods, however, often rely on specifying separate models for trial-level versus patient-level data, which likely suffers from ecological bias when there are non-linearities in the outcome generating mechanism. In this paper, we introduce a novel method to combine aggregated data and IPD in meta-analysis that is free from ecological bias. The proposed approach relies on modeling the study membership given covariates, then using inverse weighting to estimate the trial-specific coefficients in the individual-level outcome model of studies without IPD accessible. The weights derived from this approach also shed insights on the similarity in the case-mix across studies, which is useful to assess whether eligible trials are sufficiently similar to be meta-analyzed. We evaluate the proposed method by synthetic data, then apply it to a real-world meta-analysis comparing the chance of response between guselkumab and adalimumab among patients with psoriasis."}, "https://arxiv.org/abs/2408.04933": {"title": "Variance-based sensitivity analysis in the presence of correlated input variables", "link": "https://arxiv.org/abs/2408.04933", "description": "arXiv:2408.04933v1 Announce Type: new \nAbstract: In this paper we propose an extension of the classical Sobol' estimator for the estimation of variance based sensitivity indices. The approach assumes a linear correlation model between the input variables which is used to decompose the contribution of an input variable into a correlated and an uncorrelated part. This method provides sampling matrices following the original joint probability distribution which are used directly to compute the model output without any assumptions or approximations of the model response function."}, "https://arxiv.org/abs/2408.05106": {"title": "Spatial Deconfounding is Reasonable Statistical Practice: Interpretations, Clarifications, and New Benefits", "link": "https://arxiv.org/abs/2408.05106", "description": "arXiv:2408.05106v1 Announce Type: new \nAbstract: The spatial linear mixed model (SLMM) consists of fixed and spatial random effects that can be confounded. Restricted spatial regression (RSR) models restrict the spatial random effects to be in the orthogonal column space of the covariates, which \"deconfounds\" the SLMM. Recent articles have shown that the RSR generally performs worse than the SLMM under a certain interpretation of the RSR. We show that every additive model can be reparameterized as a deconfounded model leading to what we call the linear reparameterization of additive models (LRAM). Under this reparameterization the coefficients of the covariates (referred to as deconfounded regression effects) are different from the (confounded) regression effects in the SLMM. It is shown that under the LRAM interpretation, existing deconfounded spatial models produce estimated deconfounded regression effects, spatial prediction, and spatial prediction variances equivalent to that of SLMM in Bayesian contexts. Furthermore, a general RSR (GRSR) and the SLMM produce identical inferences on confounded regression effects. While our results are in complete agreement with recent criticisms, our new results under the LRAM interpretation provide clarifications that lead to different and sometimes contrary conclusions. Additionally, we discuss the inferential and computational benefits to deconfounding, which we illustrate via a simulation."}, "https://arxiv.org/abs/2408.05209": {"title": "What are the real implications for $CO_2$ as generation from renewables increases?", "link": "https://arxiv.org/abs/2408.05209", "description": "arXiv:2408.05209v1 Announce Type: new \nAbstract: Wind and solar electricity generation account for 14% of total electricity generation in the United States and are expected to continue to grow in the next decades. In low carbon systems, generation from renewable energy sources displaces conventional fossil fuel power plants resulting in lower system-level emissions and emissions intensity. However, we find that intermittent generation from renewables changes the way conventional thermal power plants operate, and that the displacement of generation is not 1 to 1 as expected. Our work provides a method that allows policy and decision makers to continue to track the effect of additional renewable capacity and the resulting thermal power plant operational responses."}, "https://arxiv.org/abs/2408.04907": {"title": "Causal Discovery of Linear Non-Gaussian Causal Models with Unobserved Confounding", "link": "https://arxiv.org/abs/2408.04907", "description": "arXiv:2408.04907v1 Announce Type: cross \nAbstract: We consider linear non-Gaussian structural equation models that involve latent confounding. In this setting, the causal structure is identifiable, but, in general, it is not possible to identify the specific causal effects. Instead, a finite number of different causal effects result in the same observational distribution. Most existing algorithms for identifying these causal effects use overcomplete independent component analysis (ICA), which often suffers from convergence to local optima. Furthermore, the number of latent variables must be known a priori. To address these issues, we propose an algorithm that operates recursively rather than using overcomplete ICA. The algorithm first infers a source, estimates the effect of the source and its latent parents on their descendants, and then eliminates their influence from the data. For both source identification and effect size estimation, we use rank conditions on matrices formed from higher-order cumulants. We prove asymptotic correctness under the mild assumption that locally, the number of latent variables never exceeds the number of observed variables. Simulation studies demonstrate that our method achieves comparable performance to overcomplete ICA even though it does not know the number of latents in advance."}, "https://arxiv.org/abs/2206.03038": {"title": "Asymptotic Distribution-free Change-point Detection for Modern Data Based on a New Ranking Scheme", "link": "https://arxiv.org/abs/2206.03038", "description": "arXiv:2206.03038v3 Announce Type: replace \nAbstract: Change-point detection (CPD) involves identifying distributional changes in a sequence of independent observations. Among nonparametric methods, rank-based methods are attractive due to their robustness and effectiveness and have been extensively studied for univariate data. However, they are not well explored for high-dimensional or non-Euclidean data. This paper proposes a new method, Rank INduced by Graph Change-Point Detection (RING-CPD), which utilizes graph-induced ranks to handle high-dimensional and non-Euclidean data. The new method is asymptotically distribution-free under the null hypothesis, and an analytic $p$-value approximation is provided for easy type-I error control. Simulation studies show that RING-CPD effectively detects change points across a wide range of alternatives and is also robust to heavy-tailed distribution and outliers. The new method is illustrated by the detection of seizures in a functional connectivity network dataset, changes of digit images, and travel pattern changes in the New York City Taxi dataset."}, "https://arxiv.org/abs/2306.15199": {"title": "A new classification framework for high-dimensional data", "link": "https://arxiv.org/abs/2306.15199", "description": "arXiv:2306.15199v2 Announce Type: replace \nAbstract: Classification, a fundamental problem in many fields, faces significant challenges when handling a large number of features, a scenario commonly encountered in modern applications, such as identifying tumor subtypes from genomic data or categorizing customer attitudes based on online reviews. We propose a novel framework that utilizes the ranks of pairwise distances among observations and identifies consistent patterns in moderate- to high- dimensional data, which previous methods have overlooked. The proposed method exhibits superior performance across a variety of scenarios, from high-dimensional data to network data. We further explore a typical setting to investigate key quantities that play essential roles in our framework, which reveal the framework's capabilities in distinguishing differences in the first and/or second moment, as well as distinctions in higher moments."}, "https://arxiv.org/abs/2312.05365": {"title": "Product Centered Dirichlet Processes for Dependent Clustering", "link": "https://arxiv.org/abs/2312.05365", "description": "arXiv:2312.05365v2 Announce Type: replace \nAbstract: While there is an immense literature on Bayesian methods for clustering, the multiview case has received little attention. This problem focuses on obtaining distinct but statistically dependent clusterings in a common set of entities for different data types. For example, clustering patients into subgroups with subgroup membership varying according to the domain of the patient variables. A challenge is how to model the across-view dependence between the partitions of patients into subgroups. The complexities of the partition space make standard methods to model dependence, such as correlation, infeasible. In this article, we propose CLustering with Independence Centering (CLIC), a clustering prior that uses a single parameter to explicitly model dependence between clusterings across views. CLIC is induced by the product centered Dirichlet process (PCDP), a novel hierarchical prior that bridges between independent and equivalent partitions. We show appealing theoretic properties, provide a finite approximation and prove its accuracy, present a marginal Gibbs sampler for posterior computation, and derive closed form expressions for the marginal and joint partition distributions for the CLIC model. On synthetic data and in an application to epidemiology, CLIC accurately characterizes view-specific partitions while providing inference on the dependence level."}, "https://arxiv.org/abs/2408.05297": {"title": "Bootstrap Matching: a robust and efficient correction for non-random A/B test, and its applications", "link": "https://arxiv.org/abs/2408.05297", "description": "arXiv:2408.05297v1 Announce Type: new \nAbstract: A/B testing, a widely used form of Randomized Controlled Trial (RCT), is a fundamental tool in business data analysis and experimental design. However, despite its intent to maintain randomness, A/B testing often faces challenges that compromise this randomness, leading to significant limitations in practice. In this study, we introduce Bootstrap Matching, an innovative approach that integrates Bootstrap resampling, Matching techniques, and high-dimensional hypothesis testing to address the shortcomings of A/B tests when true randomization is not achieved. Unlike traditional methods such as Difference-in-Differences (DID) and Propensity Score Matching (PSM), Bootstrap Matching is tailored for large-scale datasets, offering enhanced robustness and computational efficiency. We illustrate the effectiveness of this methodology through a real-world application in online advertising and further discuss its potential applications in digital marketing, empirical economics, clinical trials, and high-dimensional bioinformatics."}, "https://arxiv.org/abs/2408.05342": {"title": "Optimal Treatment Allocation Strategies for A/B Testing in Partially Observable Time Series Experiments", "link": "https://arxiv.org/abs/2408.05342", "description": "arXiv:2408.05342v1 Announce Type: new \nAbstract: Time series experiments, in which experimental units receive a sequence of treatments over time, are frequently employed in many technological companies to evaluate the performance of a newly developed policy, product, or treatment relative to a baseline control. Many existing A/B testing solutions assume a fully observable experimental environment that satisfies the Markov condition, which often does not hold in practice. This paper studies the optimal design for A/B testing in partially observable environments. We introduce a controlled (vector) autoregressive moving average model to capture partial observability. We introduce a small signal asymptotic framework to simplify the analysis of asymptotic mean squared errors of average treatment effect estimators under various designs. We develop two algorithms to estimate the optimal design: one utilizing constrained optimization and the other employing reinforcement learning. We demonstrate the superior performance of our designs using a dispatch simulator and two real datasets from a ride-sharing company."}, "https://arxiv.org/abs/2408.05386": {"title": "Debiased Estimating Equation Method for Versatile and Efficient Mendelian Randomization Using Large Numbers of Correlated Weak and Invalid Instruments", "link": "https://arxiv.org/abs/2408.05386", "description": "arXiv:2408.05386v1 Announce Type: new \nAbstract: Mendelian randomization (MR) is a widely used tool for causal inference in the presence of unobserved confounders, which uses single nucleotide polymorphisms (SNPs) as instrumental variables (IVs) to estimate causal effects. However, SNPs often have weak effects on complex traits, leading to bias and low statistical efficiency in existing MR analysis due to weak instruments that are often used. The linkage disequilibrium (LD) among SNPs poses additional statistical hurdles. Specifically, existing MR methods often restrict analysis to independent SNPs via LD clumping and result in efficiency loss in estimating the causal effect. To address these issues, we propose the Debiased Estimating Equation Method (DEEM), a summary statistics-based MR approach that incorporates a large number of correlated weak-effect and invalid SNPs. DEEM not only effectively eliminates the weak IV bias but also improves the statistical efficiency of the causal effect estimation by leveraging information from many correlated SNPs. DEEM is a versatile method that allows for pleiotropic effects, adjusts for Winner's curse, and is applicable to both two-sample and one-sample MR analyses. Asymptotic analyses of the DEEM estimator demonstrate its attractive theoretical properties. Through extensive simulations and two real data examples, we demonstrate that DEEM improves the efficiency and robustness of MR analysis compared with existing methods."}, "https://arxiv.org/abs/2408.05665": {"title": "Change-Point Detection in Time Series Using Mixed Integer Programming", "link": "https://arxiv.org/abs/2408.05665", "description": "arXiv:2408.05665v1 Announce Type: new \nAbstract: We use cutting-edge mixed integer optimization (MIO) methods to develop a framework for detection and estimation of structural breaks in time series regression models. The framework is constructed based on the least squares problem subject to a penalty on the number of breakpoints. We restate the $l_0$-penalized regression problem as a quadratic programming problem with integer- and real-valued arguments and show that MIO is capable of finding provably optimal solutions using a well-known optimization solver. Compared to the popular $l_1$-penalized regression (LASSO) and other classical methods, the MIO framework permits simultaneous estimation of the number and location of structural breaks as well as regression coefficients, while accommodating the option of specifying a given or minimal number of breaks. We derive the asymptotic properties of the estimator and demonstrate its effectiveness through extensive numerical experiments, confirming a more accurate estimation of multiple breaks as compared to popular non-MIO alternatives. Two empirical examples demonstrate usefulness of the framework in applications from business and economic statistics."}, "https://arxiv.org/abs/2408.05688": {"title": "Bank Cost Efficiency and Credit Market Structure Under a Volatile Exchange Rate", "link": "https://arxiv.org/abs/2408.05688", "description": "arXiv:2408.05688v1 Announce Type: new \nAbstract: We study the impact of exchange rate volatility on cost efficiency and market structure in a cross-section of banks that have non-trivial exposures to foreign currency (FX) operations. We use unique data on quarterly revaluations of FX assets and liabilities (Revals) that Russian banks were reporting between 2004 Q1 and 2020 Q2. {\\it First}, we document that Revals constitute the largest part of the banks' total costs, 26.5\\% on average, with considerable variation across banks. {\\it Second}, we find that stochastic estimates of cost efficiency are both severely downward biased -- by 30\\% on average -- and generally not rank preserving when Revals are ignored, except for the tails, as our nonparametric copulas reveal. To ensure generalizability to other emerging market economies, we suggest a two-stage approach that does not rely on Revals but is able to shrink the downward bias in cost efficiency estimates by two-thirds. {\\it Third}, we show that Revals are triggered by the mismatch in the banks' FX operations, which, in turn, is driven by household FX deposits and the instability of Ruble's exchange rate. {\\it Fourth}, we find that the failure to account for Revals leads to the erroneous conclusion that the credit market is inefficient, which is driven by the upper quartile of the banks' distribution by total assets. Revals have considerable negative implications for financial stability which can be attenuated by the cross-border diversification of bank assets."}, "https://arxiv.org/abs/2408.05747": {"title": "Addressing Outcome Reporting Bias in Meta-analysis: A Selection Model Perspective", "link": "https://arxiv.org/abs/2408.05747", "description": "arXiv:2408.05747v1 Announce Type: new \nAbstract: Outcome Reporting Bias (ORB) poses significant threats to the validity of meta-analytic findings. It occurs when researchers selectively report outcomes based on the significance or direction of results, potentially leading to distorted treatment effect estimates. Despite its critical implications, ORB remains an under-recognized issue, with few comprehensive adjustment methods available. The goal of this research is to investigate ORB-adjustment techniques through a selection model lens, thereby extending some of the existing methodological approaches available in the literature. To gain a better insight into the effects of ORB in meta-analysis of clinical trials, specifically in the presence of heterogeneity, and to assess the effectiveness of ORB-adjustment techniques, we apply the methodology to real clinical data affected by ORB and conduct a simulation study focusing on treatment effect estimation with a secondary interest in heterogeneity quantification."}, "https://arxiv.org/abs/2408.05816": {"title": "BOP2-TE: Bayesian Optimal Phase 2 Design for Jointly Monitoring Efficacy and Toxicity with Application to Dose Optimization", "link": "https://arxiv.org/abs/2408.05816", "description": "arXiv:2408.05816v1 Announce Type: new \nAbstract: We propose a Bayesian optimal phase 2 design for jointly monitoring efficacy and toxicity, referred to as BOP2-TE, to improve the operating characteristics of the BOP2 design proposed by Zhou et al. (2017). BOP2-TE utilizes a Dirichlet-multinomial model to jointly model the distribution of toxicity and efficacy endpoints, making go/no-go decisions based on the posterior probability of toxicity and futility. In comparison to the original BOP2 and other existing designs, BOP2-TE offers the advantage of providing rigorous type I error control in cases where the treatment is toxic and futile, effective but toxic, or safe but futile, while optimizing power when the treatment is effective and safe. As a result, BOP2-TE enhances trial safety and efficacy. We also explore the incorporation of BOP2-TE into multiple-dose randomized trials for dose optimization, and consider a seamless design that integrates phase I dose finding with phase II randomized dose optimization. BOP2-TE is user-friendly, as its decision boundary can be determined prior to the trial's onset. Simulations demonstrate that BOP2-TE possesses desirable operating characteristics. We have developed a user-friendly web application as part of the BOP2 app, which is freely available at www.trialdesign.org."}, "https://arxiv.org/abs/2408.05847": {"title": "Correcting invalid regression discontinuity designs with multiple time period data", "link": "https://arxiv.org/abs/2408.05847", "description": "arXiv:2408.05847v1 Announce Type: new \nAbstract: A common approach to Regression Discontinuity (RD) designs relies on a continuity assumption of the mean potential outcomes at the cutoff defining the RD design. In practice, this assumption is often implausible when changes other than the intervention of interest occur at the cutoff (e.g., other policies are implemented at the same cutoff). When the continuity assumption is implausible, researchers often retreat to ad-hoc analyses that are not supported by any theory and yield results with unclear causal interpretation. These analyses seek to exploit additional data where either all units are treated or all units are untreated (regardless of their running variable value). For example, when data from multiple time periods are available. We first derive the bias of RD designs when the continuity assumption does not hold. We then present a theoretical foundation for analyses using multiple time periods by the means of a general identification framework incorporating data from additional time periods to overcome the bias. We discuss this framework under various RD designs, and also extend our work to carry-over effects and time-varying running variables. We develop local linear regression estimators, bias correction procedures, and standard errors that are robust to bias-correction for the multiple period setup. The approach is illustrated using an application that studied the effect of new fiscal laws on debt of Italian municipalities."}, "https://arxiv.org/abs/2408.05862": {"title": "Censored and extreme losses: functional convergence and applications to tail goodness-of-fit", "link": "https://arxiv.org/abs/2408.05862", "description": "arXiv:2408.05862v1 Announce Type: new \nAbstract: This paper establishes the functional convergence of the Extreme Nelson--Aalen and Extreme Kaplan--Meier estimators, which are designed to capture the heavy-tailed behaviour of censored losses. The resulting limit representations can be used to obtain the distributions of pathwise functionals with respect to the so-called tail process. For instance, we may recover the convergence of a censored Hill estimator, and we further investigate two goodness-of-fit statistics for the tail of the loss distribution. Using the the latter limit theorems, we propose two rules for selecting a suitable number of order statistics, both based on test statistics derived from the functional convergence results. The effectiveness of these selection rules is investigated through simulations and an application to a real dataset comprised of French motor insurance claim sizes."}, "https://arxiv.org/abs/2408.05887": {"title": "Statistically Optimal Uncertainty Quantification for Expensive Black-Box Models", "link": "https://arxiv.org/abs/2408.05887", "description": "arXiv:2408.05887v1 Announce Type: new \nAbstract: Uncertainty quantification, by means of confidence interval (CI) construction, has been a fundamental problem in statistics and also important in risk-aware decision-making. In this paper, we revisit the basic problem of CI construction, but in the setting of expensive black-box models. This means we are confined to using a low number of model runs, and without the ability to obtain auxiliary model information such as gradients. In this case, there exist classical methods based on data splitting, and newer methods based on suitable resampling. However, while all these resulting CIs have similarly accurate coverage in large sample, their efficiencies in terms of interval length differ, and a systematic understanding of which method and configuration attains the shortest interval appears open. Motivated by this, we create a theoretical framework to study the statistical optimality on CI tightness under computation constraint. Our theory shows that standard batching, but also carefully constructed new formulas using uneven-size or overlapping batches, batched jackknife, and the so-called cheap bootstrap and its weighted generalizations, are statistically optimal. Our developments build on a new bridge of the classical notion of uniformly most accurate unbiasedness with batching and resampling, by viewing model runs as asymptotically Gaussian \"data\", as well as a suitable notion of homogeneity for CIs."}, "https://arxiv.org/abs/2408.05896": {"title": "Scalable recommender system based on factor analysis", "link": "https://arxiv.org/abs/2408.05896", "description": "arXiv:2408.05896v1 Announce Type: new \nAbstract: Recommender systems have become crucial in the modern digital landscape, where personalized content, products, and services are essential for enhancing user experience. This paper explores statistical models for recommender systems, focusing on crossed random effects models and factor analysis. We extend the crossed random effects model to include random slopes, enabling the capture of varying covariate effects among users and items. Additionally, we investigate the use of factor analysis in recommender systems, particularly for settings with incomplete data. The paper also discusses scalable solutions using the Expectation Maximization (EM) and variational EM algorithms for parameter estimation, highlighting the application of these models to predict user-item interactions effectively."}, "https://arxiv.org/abs/2408.05961": {"title": "Cross-Spectral Analysis of Bivariate Graph Signals", "link": "https://arxiv.org/abs/2408.05961", "description": "arXiv:2408.05961v1 Announce Type: new \nAbstract: With the advancements in technology and monitoring tools, we often encounter multivariate graph signals, which can be seen as the realizations of multivariate graph processes, and revealing the relationship between their constituent quantities is one of the important problems. To address this issue, we propose a cross-spectral analysis tool for bivariate graph signals. The main goal of this study is to extend the scope of spectral analysis of graph signals to multivariate graph signals. In this study, we define joint weak stationarity graph processes and introduce graph cross-spectral density and coherence for multivariate graph processes. We propose several estimators for the cross-spectral density and investigate the theoretical properties of the proposed estimators. Furthermore, we demonstrate the effectiveness of the proposed estimators through numerical experiments, including simulation studies and a real data application. Finally, as an interesting extension, we discuss robust spectral analysis of graph signals in the presence of outliers."}, "https://arxiv.org/abs/2408.06046": {"title": "Identifying Total Causal Effects in Linear Models under Partial Homoscedasticity", "link": "https://arxiv.org/abs/2408.06046", "description": "arXiv:2408.06046v1 Announce Type: new \nAbstract: A fundamental challenge of scientific research is inferring causal relations based on observed data. One commonly used approach involves utilizing structural causal models that postulate noisy functional relations among interacting variables. A directed graph naturally represents these models and reflects the underlying causal structure. However, classical identifiability results suggest that, without conducting additional experiments, this causal graph can only be identified up to a Markov equivalence class of indistinguishable models. Recent research has shown that focusing on linear relations with equal error variances can enable the identification of the causal structure from mere observational data. Nonetheless, practitioners are often primarily interested in the effects of specific interventions, rendering the complete identification of the causal structure unnecessary. In this work, we investigate the extent to which less restrictive assumptions of partial homoscedasticity are sufficient for identifying the causal effects of interest. Furthermore, we construct mathematically rigorous confidence regions for total causal effects under structure uncertainty and explore the performance gain of relying on stricter error assumptions in a simulation study."}, "https://arxiv.org/abs/2408.06211": {"title": "Extreme-based causal effect learning with endogenous exposures and a light-tailed error", "link": "https://arxiv.org/abs/2408.06211", "description": "arXiv:2408.06211v1 Announce Type: new \nAbstract: Endogeneity poses significant challenges in causal inference across various research domains. This paper proposes a novel approach to identify and estimate causal effects in the presence of endogeneity. We consider a structural equation with endogenous exposures and an additive error term. Assuming the light-tailedness of the error term, we show that the causal effect can be identified by contrasting extreme conditional quantiles of the outcome given the exposures. Unlike many existing results, our identification approach does not rely on additional parametric assumptions or auxiliary variables. Building on the identification result, we develop an EXtreme-based Causal Effect Learning (EXCEL) method that estimates the causal effect using extreme quantile regression. We establish the consistency of the EXCEL estimator under a general additive structural equation and demonstrate its asymptotic normality in the linear model setting. These results reveal that extreme quantile regression is invulnerable to endogeneity when the error term is light-tailed, which is not appreciated in the literature to our knowledge. The EXCEL method is applied to causal inference problems with invalid instruments to construct a valid confidence set for the causal effect. Simulations and data analysis of an automobile sale dataset show the effectiveness of our method in addressing endogeneity."}, "https://arxiv.org/abs/2408.06263": {"title": "Optimal Integrative Estimation for Distributed Precision Matrices with Heterogeneity Adjustment", "link": "https://arxiv.org/abs/2408.06263", "description": "arXiv:2408.06263v1 Announce Type: new \nAbstract: Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring communication efficiency pose significant challenges on distributed statistical analysis. In this article, we focus on integrative estimation of distributed heterogeneous precision matrices, a crucial task related to joint precision matrix estimation where computation-efficient algorithms and statistical optimality theories are still underdeveloped. To tackle these challenges, we introduce a novel HEterogeneity-adjusted Aggregating and Thresholding (HEAT) approach for distributed integrative estimation. HEAT is designed to be both communication- and computation-efficient, and we demonstrate its statistical optimality by establishing the convergence rates and the corresponding minimax lower bounds under various integrative losses. To enhance the optimality of HEAT, we further propose an iterative HEAT (IteHEAT) approach. By iteratively refining the higher-order errors of HEAT estimators through multi-round communications, IteHEAT achieves geometric contraction rates of convergence. Extensive simulations and real data applications validate the numerical performance of HEAT and IteHEAT methods."}, "https://arxiv.org/abs/2408.06323": {"title": "Infer-and-widen versus split-and-condition: two tales of selective inference", "link": "https://arxiv.org/abs/2408.06323", "description": "arXiv:2408.06323v1 Announce Type: new \nAbstract: Recent attention has focused on the development of methods for post-selection inference. However, the connections between these methods, and the extent to which one might be preferred to another, remain unclear. In this paper, we classify existing methods for post-selection inference into one of two frameworks: infer-and-widen or split-and-condition. The infer-and-widen framework produces confidence intervals whose midpoints are biased due to selection, and must be wide enough to account for this bias. By contrast, split-and-condition directly adjusts the intervals' midpoints to account for selection. We compare the two frameworks in three vignettes: the winner's curse, maximal contrasts, and inference after the lasso. Our results are striking: in each of these examples, a split-and-condition strategy leads to confidence intervals that are much narrower than the state-of-the-art infer-and-widen proposal, when methods are tuned to yield identical selection events. Furthermore, even an ``oracle\" infer-and-widen confidence interval -- the narrowest possible interval that could be theoretically attained via infer-and-widen -- is not necessarily narrower than a feasible split-and-condition method. Taken together, these results point to split-and-condition as the most promising framework for post-selection inference in real-world settings."}, "https://arxiv.org/abs/2408.05428": {"title": "Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression", "link": "https://arxiv.org/abs/2408.05428", "description": "arXiv:2408.05428v1 Announce Type: cross \nAbstract: In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements act as instrumental variables (IVs), facilitating the identification of causal effects through leveraging exogenous perturbations in discrete treatment scenarios. However, real-world applications of encouragement designs often face challenges such as incomplete randomization, limited experimental data, and significantly fewer encouragements compared to treatments, hindering precise causal effect estimation. To address this, this paper introduces novel theories and algorithms for identifying the Conditional Average Treatment Effect (CATE) using variations in encouragement. Further, by leveraging both observational and encouragement data, we propose a generalized IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR), to effectively estimate the causal effects. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of EnCounteR over existing methods."}, "https://arxiv.org/abs/2408.05514": {"title": "Testing Elliptical Models in High Dimensions", "link": "https://arxiv.org/abs/2408.05514", "description": "arXiv:2408.05514v1 Announce Type: cross \nAbstract: Due to the broad applications of elliptical models, there is a long line of research on goodness-of-fit tests for empirically validating them. However, the existing literature on this topic is generally confined to low-dimensional settings, and to the best of our knowledge, there are no established goodness-of-fit tests for elliptical models that are supported by theoretical guarantees in high dimensions. In this paper, we propose a new goodness-of-fit test for this problem, and our main result shows that the test is asymptotically valid when the dimension and sample size diverge proportionally. Remarkably, it also turns out that the asymptotic validity of the test requires no assumptions on the population covariance matrix. With regard to numerical performance, we confirm that the empirical level of the test is close to the nominal level across a range of conditions, and that the test is able to reliably detect non-elliptical distributions. Moreover, when the proposed test is specialized to the problem of testing normality in high dimensions, we show that it compares favorably with a state-of-the-art method, and hence, this way of using the proposed test is of independent interest."}, "https://arxiv.org/abs/2408.05584": {"title": "Dynamical causality under invisible confounders", "link": "https://arxiv.org/abs/2408.05584", "description": "arXiv:2408.05584v1 Announce Type: cross \nAbstract: Causality inference is prone to spurious causal interactions, due to the substantial confounders in a complex system. While many existing methods based on the statistical methods or dynamical methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causality, in particular in the presence of invisible/unobservable confounders. As a result, accurately inferring causation with invisible confounders remains a largely unexplored and outstanding issue in data science and AI fields. In this work, we propose a method to overcome such challenges to infer dynamical causality under invisible confounders (CIC method) and further reconstruct the invisible confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. The core of our CIC method lies in its ability to decompose the observed variables not in their original space but in their delay embedding space into the common and private subspaces respectively, thereby quantifying causality between those variables both theoretically and computationally. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many invisible confounders, which is actually a long-standing problem in the field. In addition to the invisible confounder problem, such a decomposition actually makes the intertwined variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, and the experimental results demonstrates its effectiveness to reconstruct real biological networks even with unobserved confounders."}, "https://arxiv.org/abs/2408.05647": {"title": "Controlling for discrete unmeasured confounding in nonlinear causal models", "link": "https://arxiv.org/abs/2408.05647", "description": "arXiv:2408.05647v1 Announce Type: cross \nAbstract: Unmeasured confounding is a major challenge for identifying causal relationships from non-experimental data. Here, we propose a method that can accommodate unmeasured discrete confounding. Extending recent identifiability results in deep latent variable models, we show theoretically that confounding can be detected and corrected under the assumption that the observed data is a piecewise affine transformation of a latent Gaussian mixture model and that the identity of the mixture components is confounded. We provide a flow-based algorithm to estimate this model and perform deconfounding. Experimental results on synthetic and real-world data provide support for the effectiveness of our approach."}, "https://arxiv.org/abs/2408.05854": {"title": "On the Robustness of Kernel Goodness-of-Fit Tests", "link": "https://arxiv.org/abs/2408.05854", "description": "arXiv:2408.05854v1 Announce Type: cross \nAbstract: Goodness-of-fit testing is often criticized for its lack of practical relevance; since ``all models are wrong'', the null hypothesis that the data conform to our model is ultimately always rejected when the sample size is large enough. Despite this, probabilistic models are still used extensively, raising the more pertinent question of whether the model is good enough for a specific task. This question can be formalized as a robust goodness-of-fit testing problem by asking whether the data were generated by a distribution corresponding to our model up to some mild perturbation. In this paper, we show that existing kernel goodness-of-fit tests are not robust according to common notions of robustness including qualitative and quantitative robustness. We also show that robust techniques based on tilted kernels from the parameter estimation literature are not sufficient for ensuring both types of robustness in the context of goodness-of-fit testing. We therefore propose the first robust kernel goodness-of-fit test which resolves this open problem using kernel Stein discrepancy balls, which encompass perturbation models such as Huber contamination models and density uncertainty bands."}, "https://arxiv.org/abs/2408.06103": {"title": "Method-of-Moments Inference for GLMs and Doubly Robust Functionals under Proportional Asymptotics", "link": "https://arxiv.org/abs/2408.06103", "description": "arXiv:2408.06103v1 Announce Type: cross \nAbstract: In this paper, we consider the estimation of regression coefficients and signal-to-noise (SNR) ratio in high-dimensional Generalized Linear Models (GLMs), and explore their implications in inferring popular estimands such as average treatment effects in high-dimensional observational studies. Under the ``proportional asymptotic'' regime and Gaussian covariates with known (population) covariance $\\Sigma$, we derive Consistent and Asymptotically Normal (CAN) estimators of our targets of inference through a Method-of-Moments type of estimators that bypasses estimation of high dimensional nuisance functions and hyperparameter tuning altogether. Additionally, under non-Gaussian covariates, we demonstrate universality of our results under certain additional assumptions on the regression coefficients and $\\Sigma$. We also demonstrate that knowing $\\Sigma$ is not essential to our proposed methodology when the sample covariance matrix estimator is invertible. Finally, we complement our theoretical results with numerical experiments and comparisons with existing literature."}, "https://arxiv.org/abs/2408.06277": {"title": "Multi-marginal Schr\\\"odinger Bridges with Iterative Reference", "link": "https://arxiv.org/abs/2408.06277", "description": "arXiv:2408.06277v1 Announce Type: cross \nAbstract: Practitioners frequently aim to infer an unobserved population trajectory using sample snapshots at multiple time points. For instance, in single-cell sequencing, scientists would like to learn how gene expression evolves over time. But sequencing any cell destroys that cell. So we cannot access any cell's full trajectory, but we can access snapshot samples from many cells. Stochastic differential equations are commonly used to analyze systems with full individual-trajectory access; since here we have only sample snapshots, these methods are inapplicable. The deep learning community has recently explored using Schr\\\"odinger bridges (SBs) and their extensions to estimate these dynamics. However, these methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic within the SB, which is often just set to be Brownian motion. But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model class for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a class of reference dynamics, not a single fixed one. In particular, we suggest an iterative projection method inspired by Schr\\\"odinger bridges; we alternate between learning a piecewise SB on the unobserved trajectories and using the learned SB to refine our best guess for the dynamics within the reference class. We demonstrate the advantages of our method via a well-known simulated parametric model from ecology, simulated and real data from systems biology, and real motion-capture data."}, "https://arxiv.org/abs/2108.12682": {"title": "Feature Selection in High-dimensional Spaces Using Graph-Based Methods", "link": "https://arxiv.org/abs/2108.12682", "description": "arXiv:2108.12682v3 Announce Type: replace \nAbstract: High-dimensional feature selection is a central problem in a variety of application domains such as machine learning, image analysis, and genomics. In this paper, we propose graph-based tests as a useful basis for feature selection. We describe an algorithm for selecting informative features in high-dimensional data, where each observation comes from one of $K$ different distributions. Our algorithm can be applied in a completely nonparametric setup without any distributional assumptions on the data, and it aims at outputting those features in the data, that contribute the most to the overall distributional variation. At the heart of our method is the recursive application of distribution-free graph-based tests on subsets of the feature set, located at different depths of a hierarchical clustering tree constructed from the data. Our algorithm recovers all truly contributing features with high probability, while ensuring optimal control on false-discovery. We show the superior performance of our method over other existing ones through synthetic data, and demonstrate the utility of this method on several real-life datasets from the domains of climate change and biology, wherein our algorithm is not only able to detect known features expected to be associated with the underlying process, but also discovers novel targets that can be subsequently studied."}, "https://arxiv.org/abs/2210.17121": {"title": "Powerful Spatial Multiple Testing via Borrowing Neighboring Information", "link": "https://arxiv.org/abs/2210.17121", "description": "arXiv:2210.17121v2 Announce Type: replace \nAbstract: Clustered effects are often encountered in multiple hypothesis testing of spatial signals. In this paper, we propose a new method, termed \\textit{two-dimensional spatial multiple testing} (2d-SMT) procedure, to control the false discovery rate (FDR) and improve the detection power by exploiting the spatial information encoded in neighboring observations. The proposed method provides a novel perspective of utilizing spatial information by gathering signal patterns and spatial dependence into an auxiliary statistic. 2d-SMT rejects the null when a primary statistic at the location of interest and the auxiliary statistic constructed based on nearby observations are greater than their corresponding cutoffs. 2d-SMT can also be combined with different variants of the weighted BH procedures to improve the detection power further. A fast algorithm is developed to accelerate the search for optimal cutoffs in 2d-SMT. In theory, we establish the asymptotic FDR control of 2d-SMT under weak spatial dependence. Extensive numerical experiments demonstrate that the 2d-SMT method combined with various weighted BH procedures achieves the most competitive performance in FDR and power trade-off."}, "https://arxiv.org/abs/2310.14419": {"title": "Variable Selection and Minimax Prediction in High-dimensional Functional Linear Model", "link": "https://arxiv.org/abs/2310.14419", "description": "arXiv:2310.14419v3 Announce Type: replace \nAbstract: High-dimensional functional data have become increasingly prevalent in modern applications such as high-frequency financial data and neuroimaging data analysis. We investigate a class of high-dimensional linear regression models, where each predictor is a random element in an infinite-dimensional function space, and the number of functional predictors $p$ can potentially be ultra-high. Assuming that each of the unknown coefficient functions belongs to some reproducing kernel Hilbert space (RKHS), we regularize the fitting of the model by imposing a group elastic-net type of penalty on the RKHS norms of the coefficient functions. We show that our loss function is Gateaux sub-differentiable, and our functional elastic-net estimator exists uniquely in the product RKHS. Under suitable sparsity assumptions and a functional version of the irrepresentable condition, we derive a non-asymptotic tail bound for variable selection consistency of our method. Allowing the number of true functional predictors $q$ to diverge with the sample size, we also show a post-selection refined estimator can achieve the oracle minimax optimal prediction rate. The proposed methods are illustrated through simulation studies and a real-data application from the Human Connectome Project."}, "https://arxiv.org/abs/2209.00991": {"title": "E-backtesting", "link": "https://arxiv.org/abs/2209.00991", "description": "arXiv:2209.00991v4 Announce Type: replace-cross \nAbstract: In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. To design a model-free backtesting procedure for ES, we make use of the recently developed techniques of e-values and e-processes. Backtest e-statistics are introduced to formulate e-processes for risk measure forecasts, and unique forms of backtest e-statistics for VaR and ES are characterized using recent results on identification functions. For a given backtest e-statistic, a few criteria for optimally constructing the e-processes are studied. The proposed method can be naturally applied to many other risk measures and statistical quantities. We conduct extensive simulation studies and data analysis to illustrate the advantages of the model-free backtesting method, and compare it with the ones in the literature."}, "https://arxiv.org/abs/2408.06464": {"title": "Causal Graph Aided Causal Discovery in an Observational Aneurysmal Subarachnoid Hemorrhage Study", "link": "https://arxiv.org/abs/2408.06464", "description": "arXiv:2408.06464v1 Announce Type: new \nAbstract: Causal inference methods for observational data are increasingly recognized as a valuable complement to randomized clinical trials (RCTs). They can, under strong assumptions, emulate RCTs or help refine their focus. Our approach to causal inference uses causal directed acyclic graphs (DAGs). We are motivated by a concern that many observational studies in medicine begin without a clear definition of their objectives, without awareness of the scientific potential, and without tools to identify the necessary in itinere adjustments. We present and illustrate methods that provide \"midway insights\" during study's course, identify meaningful causal questions within the study's reach and point to the necessary data base enhancements for these questions to be meaningfully tackled. The method hinges on concepts of identification and positivity. Concepts are illustrated through an analysis of data generated by patients with aneurysmal Subarachnoid Hemorrhage (aSAH) halfway through a study, focusing in particular on the consequences of external ventricular drain (EVD) in strata of the aSAH population. In addition, we propose a method for multicenter studies, to monitor the impact of changes in practice at an individual center's level, by leveraging principles of instrumental variable (IV) inference."}, "https://arxiv.org/abs/2408.06504": {"title": "Generative Bayesian Modeling with Implicit Priors", "link": "https://arxiv.org/abs/2408.06504", "description": "arXiv:2408.06504v1 Announce Type: new \nAbstract: Generative models are a cornerstone of Bayesian data analysis, enabling predictive simulations and model validation. However, in practice, manually specified priors often lead to unreasonable simulation outcomes, a common obstacle for full Bayesian simulations. As a remedy, we propose to add small portions of real or simulated data, which creates implicit priors that improve the stability of Bayesian simulations. We formalize this approach, providing a detailed process for constructing implicit priors and relating them to existing research in this area. We also integrate the implicit priors into simulation-based calibration, a prominent Bayesian simulation task. Through two case studies, we demonstrate that implicit priors quickly improve simulation stability and model calibration. Our findings provide practical guidelines for implementing implicit priors in various Bayesian modeling scenarios, showcasing their ability to generate more reliable simulation outcomes."}, "https://arxiv.org/abs/2408.06517": {"title": "Post-selection inference for high-dimensional mediation analysis with survival outcomes", "link": "https://arxiv.org/abs/2408.06517", "description": "arXiv:2408.06517v1 Announce Type: new \nAbstract: It is of substantial scientific interest to detect mediators that lie in the causal pathway from an exposure to a survival outcome. However, with high-dimensional mediators, as often encountered in modern genomic data settings, there is a lack of powerful methods that can provide valid post-selection inference for the identified marginal mediation effect. To resolve this challenge, we develop a post-selection inference procedure for the maximally selected natural indirect effect using a semiparametric efficient influence function approach. To this end, we establish the asymptotic normality of a stabilized one-step estimator that takes the selection of the mediator into account. Simulation studies show that our proposed method has good empirical performance. We further apply our proposed approach to a lung cancer dataset and find multiple DNA methylation CpG sites that might mediate the effect of cigarette smoking on lung cancer survival."}, "https://arxiv.org/abs/2408.06519": {"title": "An unbounded intensity model for point processes", "link": "https://arxiv.org/abs/2408.06519", "description": "arXiv:2408.06519v1 Announce Type: new \nAbstract: We develop a model for point processes on the real line, where the intensity can be locally unbounded without inducing an explosion. In contrast to an orderly point process, for which the probability of observing more than one event over a short time interval is negligible, the bursting intensity causes an extreme clustering of events around the singularity. We propose a nonparametric approach to detect such bursts in the intensity. It relies on a heavy traffic condition, which admits inference for point processes over a finite time interval. With Monte Carlo evidence, we show that our testing procedure exhibits size control under the null, whereas it has high rejection rates under the alternative. We implement our approach on high-frequency data for the EUR/USD spot exchange rate, where the test statistic captures abnormal surges in trading activity. We detect a nontrivial amount of intensity bursts in these data and describe their basic properties. Trading activity during an intensity burst is positively related to volatility, illiquidity, and the probability of observing a drift burst. The latter effect is reinforced if the order flow is imbalanced or the price elasticity of the limit order book is large."}, "https://arxiv.org/abs/2408.06539": {"title": "Conformal predictive intervals in survival analysis: a re-sampling approach", "link": "https://arxiv.org/abs/2408.06539", "description": "arXiv:2408.06539v1 Announce Type: new \nAbstract: The distribution-free method of conformal prediction (Vovk et al, 2005) has gained considerable attention in computer science, machine learning, and statistics. Candes et al. (2023) extended this method to right-censored survival data, addressing right-censoring complexity by creating a covariate shift setting, extracting a subcohort of subjects with censoring times exceeding a fixed threshold. Their approach only estimates the lower prediction bound for type I censoring, where all subjects have available censoring times regardless of their failure status. In medical applications, we often encounter more general right-censored data, observing only the minimum of failure time and censoring time. Subjects with observed failure times have unavailable censoring times. To address this, we propose a bootstrap method to construct one -- as well as two-sided conformal predictive intervals for general right-censored survival data under different working regression models. Through simulations, our method demonstrates excellent average coverage for the lower bound and good coverage for the two-sided predictive interval, regardless of working model is correctly specified or not, particularly under moderate censoring. We further extend the proposed method to several directions in medical applications. We apply this method to predict breast cancer patients' future survival times based on tumour characteristics and treatment."}, "https://arxiv.org/abs/2408.06612": {"title": "Double Robust high dimensional alpha test for linear factor pricing model", "link": "https://arxiv.org/abs/2408.06612", "description": "arXiv:2408.06612v1 Announce Type: new \nAbstract: In this paper, we investigate alpha testing for high-dimensional linear factor pricing models. We propose a spatial sign-based max-type test to handle sparse alternative cases. Additionally, we prove that this test is asymptotically independent of the spatial-sign-based sum-type test proposed by Liu et al. (2023). Based on this result, we introduce a Cauchy Combination test procedure that combines both the max-type and sum-type tests. Simulation studies and real data applications demonstrate that the new proposed test procedure is robust not only for heavy-tailed distributions but also for the sparsity of the alternative hypothesis."}, "https://arxiv.org/abs/2408.06624": {"title": "Estimation and Inference of Average Treatment Effect in Percentage Points under Heterogeneity", "link": "https://arxiv.org/abs/2408.06624", "description": "arXiv:2408.06624v1 Announce Type: new \nAbstract: In semi-log regression models with heterogeneous treatment effects, the average treatment effect (ATE) in log points and its exponential transformation minus one underestimate the ATE in percentage points. I propose new estimation and inference methods for the ATE in percentage points, with inference utilizing the Fenton-Wilkinson approximation. These methods are particularly relevant for staggered difference-in-differences designs, where treatment effects often vary across groups and periods. I prove the methods' large-sample properties and demonstrate their finite-sample performance through simulations, revealing substantial discrepancies between conventional and proposed measures. Two empirical applications further underscore the practical importance of these methods."}, "https://arxiv.org/abs/2408.06739": {"title": "Considerations for missing data, outliers and transformations in permutation testing for ANOVA, ASCA(+) and related factorizations", "link": "https://arxiv.org/abs/2408.06739", "description": "arXiv:2408.06739v1 Announce Type: new \nAbstract: Multifactorial experimental designs allow us to assess the contribution of several factors, and potentially their interactions, to one or several responses of interests. Following the principles of the partition of the variance advocated by Sir R.A. Fisher, the experimental responses are factored into the quantitative contribution of main factors and interactions. A popular approach to perform this factorization in both ANOVA and ASCA(+) is through General Linear Models. Subsequently, different inferential approaches can be used to identify whether the contributions are statistically significant or not. Unfortunately, the performance of inferential approaches in terms of Type I and Type II errors can be heavily affected by missing data, outliers and/or the departure from normality of the distribution of the responses, which are commonplace problems in modern analytical experiments. In this paper, we study these problem and suggest good practices of application."}, "https://arxiv.org/abs/2408.06760": {"title": "Stratification in Randomised Clinical Trials and Analysis of Covariance: Some Simple Theory and Recommendations", "link": "https://arxiv.org/abs/2408.06760", "description": "arXiv:2408.06760v1 Announce Type: new \nAbstract: A simple device for balancing for a continuous covariate in clinical trials is to stratify by whether the covariate is above or below some target value, typically the predicted median. This raises an issue as to which model should be used for modelling the effect of treatment on the outcome variable, $Y$. Should one fit, the stratum indicator, $S$, the continuous covariate, $X$, both or neither?\n This question has been investigated in the literature using simulations targetting the overall effect on inferences about treatment . However, when a covariate is added to a model there are three consequences for inference: 1) The mean square error effect, 2) The variance inflation factor and 3) second order precision. We consider that it is valuable to consider these three factors separately even if, ultimately, it is their joint effect that matters.\n We present some simple theory, concentrating in particular on the variance inflation factor, that may be used to guide trialists in their choice of model. We also consider the case where the precise form of the relationship between the outcome and the covariate is not known. We conclude by recommending that the continuous coovariate should always be in the model but that, depending on circumstances, there may be some justification in fitting the stratum indicator also."}, "https://arxiv.org/abs/2408.06769": {"title": "Joint model for interval-censored semi-competing events and longitudinal data with subject-specific within and between visits variabilities", "link": "https://arxiv.org/abs/2408.06769", "description": "arXiv:2408.06769v1 Announce Type: new \nAbstract: Dementia currently affects about 50 million people worldwide, and this number is rising. Since there is still no cure, the primary focus remains on preventing modifiable risk factors such as cardiovascular factors. It is now recognized that high blood pressure is a risk factor for dementia. An increasing number of studies suggest that blood pressure variability may also be a risk factor for dementia. However, these studies have significant methodological weaknesses and fail to distinguish between long-term and short-term variability. The aim of this work was to propose a new joint model that combines a mixed-effects model, which handles the residual variance distinguishing inter-visit variability from intra-visit variability, and an illness-death model that allows for interval censoring and semi-competing risks. A subject-specific random effect is included in the model for both variances. Risks can simultaneously depend on the current value and slope of the marker, as well as on each of the two components of the residual variance. The model estimation is performed by maximizing the likelihood function using the Marquardt-Levenberg algorithm. A simulation study validates the estimation procedure, which is implemented in an R package. The model was estimated using data from the Three-City (3C) cohort to study the impact of intra- and inter-visits blood pressure variability on the risk of dementia and death."}, "https://arxiv.org/abs/2408.06977": {"title": "Endogeneity Corrections in Binary Outcome Models with Nonlinear Transformations: Identification and Inference", "link": "https://arxiv.org/abs/2408.06977", "description": "arXiv:2408.06977v1 Announce Type: new \nAbstract: For binary outcome models, an endogeneity correction based on nonlinear rank-based transformations is proposed. Identification without external instruments is achieved under one of two assumptions: Either the endogenous regressor is a nonlinear function of one component of the error term conditionally on exogenous regressors. Or the dependence between endogenous regressor and exogenous regressor is nonlinear. Under these conditions, we prove consistency and asymptotic normality. Monte Carlo simulations and an application on German insolvency data illustrate the usefulness of the method."}, "https://arxiv.org/abs/2408.06987": {"title": "Optimal Network Pairwise Comparison", "link": "https://arxiv.org/abs/2408.06987", "description": "arXiv:2408.06987v1 Announce Type: new \nAbstract: We are interested in the problem of two-sample network hypothesis testing: given two networks with the same set of nodes, we wish to test whether the underlying Bernoulli probability matrices of the two networks are the same or not. We propose Interlacing Balance Measure (IBM) as a new two-sample testing approach. We consider the {\\it Degree-Corrected Mixed-Membership (DCMM)} model for undirected networks, where we allow severe degree heterogeneity, mixed-memberships, flexible sparsity levels, and weak signals. In such a broad setting, how to find a test that has a tractable limiting null and optimal testing performances is a challenging problem. We show that IBM is such a test: in a broad DCMM setting with only mild regularity conditions, IBM has $N(0,1)$ as the limiting null and achieves the optimal phase transition.\n While the above is for undirected networks, IBM is a unified approach and is directly implementable for directed networks. For a broad directed-DCMM (extension of DCMM for directed networks) setting, we show that IBM has $N(0, 1/2)$ as the limiting null and continues to achieve the optimal phase transition. We have also applied IBM to the Enron email network and a gene co-expression network, with interesting results."}, "https://arxiv.org/abs/2408.07006": {"title": "The Complexities of Differential Privacy for Survey Data", "link": "https://arxiv.org/abs/2408.07006", "description": "arXiv:2408.07006v1 Announce Type: new \nAbstract: The concept of differential privacy (DP) has gained substantial attention in recent years, most notably since the U.S. Census Bureau announced the adoption of the concept for its 2020 Decennial Census. However, despite its attractive theoretical properties, implementing DP in practice remains challenging, especially when it comes to survey data. In this paper we present some results from an ongoing project funded by the U.S. Census Bureau that is exploring the possibilities and limitations of DP for survey data. Specifically, we identify five aspects that need to be considered when adopting DP in the survey context: the multi-staged nature of data production; the limited privacy amplification from complex sampling designs; the implications of survey-weighted estimates; the weighting adjustments for nonresponse and other data deficiencies, and the imputation of missing values. We summarize the project's key findings with respect to each of these aspects and also discuss some of the challenges that still need to be addressed before DP could become the new data protection standard at statistical agencies."}, "https://arxiv.org/abs/2408.07025": {"title": "GARCH copulas and GARCH-mimicking copulas", "link": "https://arxiv.org/abs/2408.07025", "description": "arXiv:2408.07025v1 Announce Type: new \nAbstract: The bivariate copulas that describe the dependencies and partial dependencies of lagged variables in strictly stationary, first-order GARCH-type processes are investigated. It is shown that the copulas of symmetric GARCH processes are jointly symmetric but non-exchangeable, while the copulas of processes with symmetric innovation distributions and asymmetric leverage effects have weaker h-symmetry; copulas with asymmetric innovation distributions have neither form of symmetry. Since the actual copulas are typically inaccessible, due to the unknown functional forms of the marginal distributions of GARCH processes, methods of mimicking them are proposed. These rely on constructions that combine standard bivariate copulas for positive dependence with two uniformity-preserving transformations known as v-transforms. A variety of new copulas are introduced and the ones providing the best fit to simulated data from GARCH-type processes are identified. A method of constructing tractable simplified d-vines using linear v-transforms is described and shown to coincide with the vt-d-vine model when the two v-transforms are identical."}, "https://arxiv.org/abs/2408.07066": {"title": "Conformal prediction after efficiency-oriented model selection", "link": "https://arxiv.org/abs/2408.07066", "description": "arXiv:2408.07066v1 Announce Type: new \nAbstract: Given a family of pretrained models and a hold-out set, how can we construct a valid conformal prediction set while selecting a model that minimizes the width of the set? If we use the same hold-out data set both to select a model (the model that yields the smallest conformal prediction sets) and then to construct a conformal prediction set based on that selected model, we suffer a loss of coverage due to selection bias. Alternatively, we could further splitting the data to perform selection and calibration separately, but this comes at a steep cost if the size of the dataset is limited. In this paper, we address the challenge of constructing a valid prediction set after efficiency-oriented model selection. Our novel methods can be implemented efficiently and admit finite-sample validity guarantees without invoking additional sample-splitting. We show that our methods yield prediction sets with asymptotically optimal size under certain notion of continuity for the model class. The improved efficiency of the prediction sets constructed by our methods are further demonstrated through applications to synthetic datasets in various settings and a real data example."}, "https://arxiv.org/abs/2003.00593": {"title": "True and false discoveries with independent and sequential e-values", "link": "https://arxiv.org/abs/2003.00593", "description": "arXiv:2003.00593v2 Announce Type: replace \nAbstract: In this paper we use e-values in the context of multiple hypothesis testing assuming that the base tests produce independent, or sequential, e-values. Our simulation and empirical studies and theoretical considerations suggest that, under this assumption, our new algorithms are superior to the known algorithms using independent p-values and to our recent algorithms designed for e-values without the assumption of independence."}, "https://arxiv.org/abs/2211.04459": {"title": "flexBART: Flexible Bayesian regression trees with categorical predictors", "link": "https://arxiv.org/abs/2211.04459", "description": "arXiv:2211.04459v3 Announce Type: replace \nAbstract: Most implementations of Bayesian additive regression trees (BART) one-hot encode categorical predictors, replacing each one with several binary indicators, one for every level or category. Regression trees built with these indicators partition the discrete set of categorical levels by repeatedly removing one level at a time. Unfortunately, the vast majority of partitions cannot be built with this strategy, severely limiting BART's ability to partially pool data across groups of levels. Motivated by analyses of baseball data and neighborhood-level crime dynamics, we overcame this limitation by re-implementing BART with regression trees that can assign multiple levels to both branches of a decision tree node. To model spatial data aggregated into small regions, we further proposed a new decision rule prior that creates spatially contiguous regions by deleting a random edge from a random spanning tree of a suitably defined network. Our re-implementation, which is available in the flexBART package, often yields improved out-of-sample predictive performance and scales better to larger datasets than existing implementations of BART."}, "https://arxiv.org/abs/2311.12726": {"title": "Assessing variable importance in survival analysis using machine learning", "link": "https://arxiv.org/abs/2311.12726", "description": "arXiv:2311.12726v3 Announce Type: replace \nAbstract: Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. For example, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of HIV acquisition over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to HIV acquisition are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 vaccine trial to inform enrollment strategies for future HIV vaccine trials."}, "https://arxiv.org/abs/2408.07185": {"title": "A Sparse Grid Approach for the Nonparametric Estimation of High-Dimensional Random Coefficient Models", "link": "https://arxiv.org/abs/2408.07185", "description": "arXiv:2408.07185v1 Announce Type: new \nAbstract: A severe limitation of many nonparametric estimators for random coefficient models is the exponential increase of the number of parameters in the number of random coefficients included into the model. This property, known as the curse of dimensionality, restricts the application of such estimators to models with moderately few random coefficients. This paper proposes a scalable nonparametric estimator for high-dimensional random coefficient models. The estimator uses a truncated tensor product of one-dimensional hierarchical basis functions to approximate the underlying random coefficients' distribution. Due to the truncation, the number of parameters increases at a much slower rate than in the regular tensor product basis, rendering the nonparametric estimation of high-dimensional random coefficient models feasible. The derived estimator allows estimating the underlying distribution with constrained least squares, making the approach computationally simple and fast. Monte Carlo experiments and an application to data on the regulation of air pollution illustrate the good performance of the estimator."}, "https://arxiv.org/abs/2408.07193": {"title": "A comparison of methods for estimating the average treatment effect on the treated for externally controlled trials", "link": "https://arxiv.org/abs/2408.07193", "description": "arXiv:2408.07193v1 Announce Type: new \nAbstract: While randomized trials may be the gold standard for evaluating the effectiveness of the treatment intervention, in some special circumstances, single-arm clinical trials utilizing external control may be considered. The causal treatment effect of interest for single-arm studies is usually the average treatment effect on the treated (ATT) rather than the average treatment effect (ATE). Although methods have been developed to estimate the ATT, the selection and use of these methods require a thorough comparison and in-depth understanding of the advantages and disadvantages of these methods. In this study, we conducted simulations under different identifiability assumptions to compare the performance metrics (e.g., bias, standard deviation (SD), mean squared error (MSE), type I error rate) for a variety of methods, including the regression model, propensity score matching, Mahalanobis distance matching, coarsened exact matching, inverse probability weighting, augmented inverse probability weighting (AIPW), AIPW with SuperLearner, and targeted maximum likelihood estimator (TMLE) with SuperLearner.\n Our simulation results demonstrate that the doubly robust methods in general have smaller biases than other methods. In terms of SD, nonmatching methods in general have smaller SDs than matching-based methods. The performance of MSE is a trade-off between the bias and SD, and no method consistently performs better in term of MSE. The identifiability assumptions are critical to the models' performance: violation of the positivity assumption can lead to a significant inflation of type I errors in some methods; violation of the unconfoundedness assumption can lead to a large bias for all methods... (Further details are available in the main body of the paper)."}, "https://arxiv.org/abs/2408.07231": {"title": "Estimating the FDR of variable selection", "link": "https://arxiv.org/abs/2408.07231", "description": "arXiv:2408.07231v1 Announce Type: new \nAbstract: We introduce a generic estimator for the false discovery rate of any model selection procedure, in common statistical modeling settings including the Gaussian linear model, Gaussian graphical model, and model-X setting. We prove that our method has a conservative (non-negative) bias in finite samples under standard statistical assumptions, and provide a bootstrap method for assessing its standard error. For methods like the Lasso, forward-stepwise regression, and the graphical Lasso, our estimator serves as a valuable companion to cross-validation, illuminating the tradeoff between prediction error and variable selection accuracy as a function of the model complexity parameter."}, "https://arxiv.org/abs/2408.07240": {"title": "Sensitivity of MCMC-based analyses to small-data removal", "link": "https://arxiv.org/abs/2408.07240", "description": "arXiv:2408.07240v1 Announce Type: new \nAbstract: If the conclusion of a data analysis is sensitive to dropping very few data points, that conclusion might hinge on the particular data at hand rather than representing a more broadly applicable truth. How could we check whether this sensitivity holds? One idea is to consider every small subset of data, drop it from the dataset, and re-run our analysis. But running MCMC to approximate a Bayesian posterior is already very expensive; running multiple times is prohibitive, and the number of re-runs needed here is combinatorially large. Recent work proposes a fast and accurate approximation to find the worst-case dropped data subset, but that work was developed for problems based on estimating equations -- and does not directly handle Bayesian posterior approximations using MCMC. We make two principal contributions in the present work. We adapt the existing data-dropping approximation to estimators computed via MCMC. Observing that Monte Carlo errors induce variability in the approximation, we use a variant of the bootstrap to quantify this uncertainty. We demonstrate how to use our approximation in practice to determine whether there is non-robustness in a problem. Empirically, our method is accurate in simple models, such as linear regression. In models with complicated structure, such as hierarchical models, the performance of our method is mixed."}, "https://arxiv.org/abs/2408.07365": {"title": "Fast Bayesian inference in a class of sparse linear mixed effects models", "link": "https://arxiv.org/abs/2408.07365", "description": "arXiv:2408.07365v1 Announce Type: new \nAbstract: Linear mixed effects models are widely used in statistical modelling. We consider a mixed effects model with Bayesian variable selection in the random effects using spike-and-slab priors and developed a variational Bayes inference scheme that can be applied to large data sets. An EM algorithm is proposed for the model with normal errors where the posterior distribution of the variable inclusion parameters is approximated using an Occam's window approach. Placing this approach within a variational Bayes scheme also the algorithm to be extended to the model with skew-t errors. The performance of the algorithm is evaluated in a simulation study and applied to a longitudinal model for elite athlete performance in the 100 metre sprint and weightlifting."}, "https://arxiv.org/abs/2408.07463": {"title": "A novel framework for quantifying nominal outlyingness", "link": "https://arxiv.org/abs/2408.07463", "description": "arXiv:2408.07463v1 Announce Type: new \nAbstract: Outlier detection is an important data mining tool that becomes particularly challenging when dealing with nominal data. First and foremost, flagging observations as outlying requires a well-defined notion of nominal outlyingness. This paper presents a definition of nominal outlyingness and introduces a general framework for quantifying outlyingness of nominal data. The proposed framework makes use of ideas from the association rule mining literature and can be used for calculating scores that indicate how outlying a nominal observation is. Methods for determining the involved hyperparameter values are presented and the concepts of variable contributions and outlyingness depth are introduced, in an attempt to enhance interpretability of the results. An implementation of the framework is tested on five real-world data sets and the key findings are outlined. The ideas presented can serve as a tool for assessing the degree to which an observation differs from the rest of the data, under the assumption of sequences of nominal levels having been generated from a Multinomial distribution with varying event probabilities."}, "https://arxiv.org/abs/2408.07678": {"title": "Your MMM is Broken: Identification of Nonlinear and Time-varying Effects in Marketing Mix Models", "link": "https://arxiv.org/abs/2408.07678", "description": "arXiv:2408.07678v1 Announce Type: new \nAbstract: Recent years have seen a resurgence in interest in marketing mix models (MMMs), which are aggregate-level models of marketing effectiveness. Often these models incorporate nonlinear effects, and either implicitly or explicitly assume that marketing effectiveness varies over time. In this paper, we show that nonlinear and time-varying effects are often not identifiable from standard marketing mix data: while certain data patterns may be suggestive of nonlinear effects, such patterns may also emerge under simpler models that incorporate dynamics in marketing effectiveness. This lack of identification is problematic because nonlinearities and dynamics suggest fundamentally different optimal marketing allocations. We examine this identification issue through theory and simulations, wherein we explore the exact conditions under which conflation between the two types of models is likely to occur. In doing so, we introduce a flexible Bayesian nonparametric model that allows us to both flexibly simulate and estimate different data-generating processes. We show that conflating the two types of effects is especially likely in the presence of autocorrelated marketing variables, which are common in practice, especially given the widespread use of stock variables to capture long-run effects of advertising. We illustrate these ideas through numerous empirical applications to real-world marketing mix data, showing the prevalence of the conflation issue in practice. Finally, we show how marketers can avoid this conflation, by designing experiments that strategically manipulate spending in ways that pin down model form."}, "https://arxiv.org/abs/2408.07209": {"title": "Local linear smoothing for regression surfaces on the simplex using Dirichlet kernels", "link": "https://arxiv.org/abs/2408.07209", "description": "arXiv:2408.07209v1 Announce Type: cross \nAbstract: This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring excellent boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda, Nezzal, and Elhattab (2024)."}, "https://arxiv.org/abs/2408.07219": {"title": "Causal Effect Estimation using identifiable Variational AutoEncoder with Latent Confounders and Post-Treatment Variables", "link": "https://arxiv.org/abs/2408.07219", "description": "arXiv:2408.07219v1 Announce Type: cross \nAbstract: Estimating causal effects from observational data is challenging, especially in the presence of latent confounders. Much work has been done on addressing this challenge, but most of the existing research ignores the bias introduced by the post-treatment variables. In this paper, we propose a novel method of joint Variational AutoEncoder (VAE) and identifiable Variational AutoEncoder (iVAE) for learning the representations of latent confounders and latent post-treatment variables from their proxy variables, termed CPTiVAE, to achieve unbiased causal effect estimation from observational data. We further prove the identifiability in terms of the representation of latent post-treatment variables. Extensive experiments on synthetic and semi-synthetic datasets demonstrate that the CPTiVAE outperforms the state-of-the-art methods in the presence of latent confounders and post-treatment variables. We further apply CPTiVAE to a real-world dataset to show its potential application."}, "https://arxiv.org/abs/2408.07298": {"title": "Improving the use of social contact studies in epidemic modelling", "link": "https://arxiv.org/abs/2408.07298", "description": "arXiv:2408.07298v1 Announce Type: cross \nAbstract: Social contact studies, investigating social contact patterns in a population sample, have been an important contribution for epidemic models to better fit real life epidemics. A contact matrix $M$, having the \\emph{mean} number of contacts between individuals of different age groups as its elements, is estimated and used in combination with a multitype epidemic model to produce better data fitting and also giving more appropriate expressions for $R_0$ and other model outcomes. However, $M$ does not capture \\emph{variation} in contacts \\emph{within} each age group, which is often large in empirical settings. Here such variation within age groups is included in a simple way by dividing each age group into two halves: the socially active and the socially less active. The extended contact matrix, and its associated epidemic model, empirically show that acknowledging variation in social activity within age groups has a substantial impact on modelling outcomes such as $R_0$ and the final fraction $\\tau$ getting infected. In fact, the variation in social activity within age groups is often more important for data fitting than the division into different age groups itself. However, a difficulty with heterogeneity in social activity is that social contact studies typically lack information on if mixing with respect to social activity is assortative or not, i.e.\\ do socially active tend to mix more with other socially active or more with socially less active? The analyses show that accounting for heterogeneity in social activity improves the analyses irrespective of if such mixing is assortative or not, but the different assumptions gives rather different output. Future social contact studies should hence also try to infer the degree of assortativity of contacts with respect to social activity."}, "https://arxiv.org/abs/2408.07575": {"title": "A General Framework for Constraint-based Causal Learning", "link": "https://arxiv.org/abs/2408.07575", "description": "arXiv:2408.07575v1 Announce Type: cross \nAbstract: By representing any constraint-based causal learning algorithm via a placeholder property, we decompose the correctness condition into a part relating the distribution and the true causal graph, and a part that depends solely on the distribution. This provides a general framework to obtain correctness conditions for causal learning, and has the following implications. We provide exact correctness conditions for the PC algorithm, which are then related to correctness conditions of some other existing causal discovery algorithms. We show that the sparsest Markov representation condition is the weakest correctness condition resulting from existing notions of minimality for maximal ancestral graphs and directed acyclic graphs. We also reason that additional knowledge than just Pearl-minimality is necessary for causal learning beyond faithfulness."}, "https://arxiv.org/abs/2201.00468": {"title": "A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings", "link": "https://arxiv.org/abs/2201.00468", "description": "arXiv:2201.00468v3 Announce Type: replace \nAbstract: In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size $n$, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size $N$, much larger than $n$, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency."}, "https://arxiv.org/abs/2201.10208": {"title": "Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings", "link": "https://arxiv.org/abs/2201.10208", "description": "arXiv:2201.10208v2 Announce Type: replace \nAbstract: We consider quantile estimation in a semi-supervised setting, characterized by two available data sets: (i) a small or moderate sized labeled data set containing observations for a response and a set of possibly high dimensional covariates, and (ii) a much larger unlabeled data set where only the covariates are observed. We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets, to improve the estimation accuracy compared to the supervised estimator, i.e., the sample quantile from the labeled data. These estimators use a flexible imputation strategy applied to the estimating equation along with a debiasing step that allows for full robustness against misspecification of the imputation model. Further, a one-step update strategy is adopted to enable easy implementation of our method and handle the complexity from the non-linear nature of the quantile estimating equation. Under mild assumptions, our estimators are fully robust to the choice of the nuisance imputation model, in the sense of always maintaining root-n consistency and asymptotic normality, while having improved efficiency relative to the supervised estimator. They also attain semi-parametric optimality if the relation between the response and the covariates is correctly specified via the imputation model. As an illustration of estimating the nuisance imputation function, we consider kernel smoothing type estimators on lower dimensional and possibly estimated transformations of the high dimensional covariates, and we establish novel results on their uniform convergence rates in high dimensions, involving responses indexed by a function class and usage of dimension reduction techniques. These results may be of independent interest. Numerical results on both simulated and real data confirm our semi-supervised approach's improved performance, in terms of both estimation and inference."}, "https://arxiv.org/abs/2206.01076": {"title": "Likelihood-based Inference for Random Networks with Changepoints", "link": "https://arxiv.org/abs/2206.01076", "description": "arXiv:2206.01076v3 Announce Type: replace \nAbstract: Generative, temporal network models play an important role in analyzing the dependence structure and evolution patterns of complex networks. Due to the complicated nature of real network data, it is often naive to assume that the underlying data-generative mechanism itself is invariant with time. Such observation leads to the study of changepoints or sudden shifts in the distributional structure of the evolving network. In this paper, we propose a likelihood-based methodology to detect changepoints in undirected, affine preferential attachment networks, and establish a hypothesis testing framework to detect a single changepoint, together with a consistent estimator for the changepoint. Such results require establishing consistency and asymptotic normality of the MLE under the changepoint regime, which suffers from long range dependence. The methodology is then extended to the multiple changepoint setting via both a sliding window method and a more computationally efficient score statistic. We also compare the proposed methodology with previously developed non-parametric estimators of the changepoint via simulation, and the methods developed herein are applied to modeling the popularity of a topic in a Twitter network over time."}, "https://arxiv.org/abs/2306.03073": {"title": "Inference for Local Projections", "link": "https://arxiv.org/abs/2306.03073", "description": "arXiv:2306.03073v2 Announce Type: replace \nAbstract: Inference for impulse responses estimated with local projections presents interesting challenges and opportunities. Analysts typically want to assess the precision of individual estimates, explore the dynamic evolution of the response over particular regions, and generally determine whether the impulse generates a response that is any different from the null of no effect. Each of these goals requires a different approach to inference. In this article, we provide an overview of results that have appeared in the literature in the past 20 years along with some new procedures that we introduce here."}, "https://arxiv.org/abs/2310.17712": {"title": "Community Detection Guarantees Using Embeddings Learned by Node2Vec", "link": "https://arxiv.org/abs/2310.17712", "description": "arXiv:2310.17712v2 Announce Type: replace-cross \nAbstract: Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of $k$-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data."}, "https://arxiv.org/abs/2408.07738": {"title": "A discrete-time survival model to handle interval-censored covariates", "link": "https://arxiv.org/abs/2408.07738", "description": "arXiv:2408.07738v1 Announce Type: new \nAbstract: Methods are lacking to handle the problem of survival analysis in the presence of an interval-censored covariate, specifically the case in which the conditional hazard of the primary event of interest depends on the occurrence of a secondary event, the observation time of which is subject to interval censoring. We propose and study a flexible class of discrete-time parametric survival models that handle the censoring problem through joint modeling of the interval-censored secondary event, the outcome, and the censoring mechanism. We apply this model to the research question that motivated the methodology, estimating the effect of HIV status on all-cause mortality in a prospective cohort study in South Africa."}, "https://arxiv.org/abs/2408.07842": {"title": "Quantile and Distribution Treatment Effects on the Treated with Possibly Non-Continuous Outcomes", "link": "https://arxiv.org/abs/2408.07842", "description": "arXiv:2408.07842v1 Announce Type: new \nAbstract: Quantile and Distribution Treatment effects on the Treated (QTT/DTT) for non-continuous outcomes are either not identified or inference thereon is infeasible using existing methods. By introducing functional index parallel trends and no anticipation assumptions, this paper identifies and provides uniform inference procedures for QTT/DTT. The inference procedure applies under both the canonical two-group and staggered treatment designs with balanced panels, unbalanced panels, or repeated cross-sections. Monte Carlo experiments demonstrate the proposed method's robust and competitive performance, while an empirical application illustrates its practical utility."}, "https://arxiv.org/abs/2408.08135": {"title": "Combined p-value functions for meta-analysis", "link": "https://arxiv.org/abs/2408.08135", "description": "arXiv:2408.08135v1 Announce Type: new \nAbstract: P-value functions are modern statistical tools that unify effect estimation and hypothesis testing and can provide alternative point and interval estimates compared to standard meta-analysis methods, using any of the many p-value combination procedures available (Xie et al., 2011, JASA). We provide a systematic comparison of different combination procedures, both from a theoretical perspective and through simulation. We show that many prominent p-value combination methods (e.g. Fisher's method) are not invariant to the orientation of the underlying one-sided p-values. Only Edgington's method, a lesser-known combination method based on the sum of p-values, is orientation-invariant and provides confidence intervals not restricted to be symmetric around the point estimate. Adjustments for heterogeneity can also be made and results from a simulation study indicate that the approach can compete with more standard meta-analytic methods."}, "https://arxiv.org/abs/2408.08177": {"title": "Localized Sparse Principal Component Analysis of Multivariate Time Series in Frequency Domain", "link": "https://arxiv.org/abs/2408.08177", "description": "arXiv:2408.08177v1 Announce Type: new \nAbstract: Principal component analysis has been a main tool in multivariate analysis for estimating a low dimensional linear subspace that explains most of the variability in the data. However, in high-dimensional regimes, naive estimates of the principal loadings are not consistent and difficult to interpret. In the context of time series, principal component analysis of spectral density matrices can provide valuable, parsimonious information about the behavior of the underlying process, particularly if the principal components are interpretable in that they are sparse in coordinates and localized in frequency bands. In this paper, we introduce a formulation and consistent estimation procedure for interpretable principal component analysis for high-dimensional time series in the frequency domain. An efficient frequency-sequential algorithm is developed to compute sparse-localized estimates of the low-dimensional principal subspaces of the signal process. The method is motivated by and used to understand neurological mechanisms from high-density resting-state EEG in a study of first episode psychosis."}, "https://arxiv.org/abs/2408.08200": {"title": "Analysing kinematic data from recreational runners using functional data analysis", "link": "https://arxiv.org/abs/2408.08200", "description": "arXiv:2408.08200v1 Announce Type: new \nAbstract: We present a multivariate functional mixed effects model for kinematic data from a large number of recreational runners. The runners' sagittal plane hip and knee angles are modelled jointly as a bivariate function with random effects functions used to account for the dependence among measurements from either side of the body. The model is fitted by first applying multivariate functional principal component analysis (mv-FPCA) and then modelling the mv-FPCA scores using scalar linear mixed effects models. Simulation and bootstrap approaches are introduced to construct simultaneous confidence bands for the fixed effects functions, and covariance functions are reconstructed to summarise the variability structure in the data and thoroughly investigate the suitability of the proposed model. In our scientific application, we observe a statistically significant effect of running speed on both the hip and knee angles. We also observe strong within-subject correlations, reflecting the highly idiosyncratic nature of running technique. Our approach is more generally applicable to modelling multiple streams of smooth kinematic or kinetic data measured repeatedly for multiple subjects in complex experimental designs."}, "https://arxiv.org/abs/2408.08259": {"title": "Incorporating Local Step-Size Adaptivity into the No-U-Turn Sampler using Gibbs Self Tuning", "link": "https://arxiv.org/abs/2408.08259", "description": "arXiv:2408.08259v1 Announce Type: new \nAbstract: Adapting the step size locally in the no-U-turn sampler (NUTS) is challenging because the step-size and path-length tuning parameters are interdependent. The determination of an optimal path length requires a predefined step size, while the ideal step size must account for errors along the selected path. Ensuring reversibility further complicates this tuning problem. In this paper, we present a method for locally adapting the step size in NUTS that is an instance of the Gibbs self-tuning (GIST) framework. Our approach guarantees reversibility with an acceptance probability that depends exclusively on the conditional distribution of the step size. We validate our step-size-adaptive NUTS method on Neal's funnel density and a high-dimensional normal distribution, demonstrating its effectiveness in challenging scenarios."}, "https://arxiv.org/abs/2408.07765": {"title": "A Bayesian Classification Trees Approach to Treatment Effect Variation with Noncompliance", "link": "https://arxiv.org/abs/2408.07765", "description": "arXiv:2408.07765v1 Announce Type: cross \nAbstract: Estimating varying treatment effects in randomized trials with noncompliance is inherently challenging since variation comes from two separate sources: variation in the impact itself and variation in the compliance rate. In this setting, existing frequentist and flexible machine learning methods are highly sensitive to the weak instruments problem, in which the compliance rate is (locally) close to zero. Bayesian approaches, on the other hand, can naturally account for noncompliance via imputation. We propose a Bayesian machine learning approach that combines the best features of both approaches. Our main methodological contribution is to present a Bayesian Causal Forest model for binary response variables in scenarios with noncompliance by repeatedly imputing individuals' compliance types, allowing us to flexibly estimate varying treatment effects among compliers. Simulation studies demonstrate the usefulness of our approach when compliance and treatment effects are heterogeneous. We apply the method to detect and analyze heterogeneity in the treatment effects in the Illinois Workplace Wellness Study, which not only features heterogeneous and one-sided compliance but also several binary outcomes of interest. We demonstrate the methodology on three outcomes one year after intervention. We confirm a null effect on the presence of a chronic condition, discover meaningful heterogeneity in a \"bad health\" outcome that cancels out to null in classical partial effect estimates, and find substantial heterogeneity in individuals' perception of management prioritization of health and safety."}, "https://arxiv.org/abs/1902.07447": {"title": "Eliciting ambiguity with mixing bets", "link": "https://arxiv.org/abs/1902.07447", "description": "arXiv:1902.07447v5 Announce Type: replace \nAbstract: Preferences for mixing can reveal ambiguity perception and attitude on a single event. The validity of the approach is discussed for multiple preference classes including maxmin, maxmax, variational, and smooth second-order preferences. An experimental implementation suggests that participants perceive almost as much ambiguity for the stock index and actions of other participants as for the Ellsberg urn, indicating the importance of ambiguity in real-world decision-making."}, "https://arxiv.org/abs/2009.09614": {"title": "Spillovers of Program Benefits with Missing Network Links", "link": "https://arxiv.org/abs/2009.09614", "description": "arXiv:2009.09614v3 Announce Type: replace \nAbstract: The issue of missing network links in partially observed networks is frequently neglected in empirical studies. This paper addresses this issue when investigating the spillovers of program benefits in the presence of network interactions. Our method is flexible enough to account for non-i.i.d. missing links. It relies on two network measures that can be easily constructed based on the incoming and outgoing links of the same observed network. The treatment and spillover effects can be point identified and consistently estimated if network degrees are bounded for all units. We also demonstrate the bias reduction property of our method if network degrees of some units are unbounded. Monte Carlo experiments and a naturalistic simulation on real-world network data are implemented to verify the finite-sample performance of our method. We also re-examine the spillover effects of home computer use on children's self-empowered learning."}, "https://arxiv.org/abs/2010.02297": {"title": "Testing homogeneity in dynamic discrete games in finite samples", "link": "https://arxiv.org/abs/2010.02297", "description": "arXiv:2010.02297v3 Announce Type: replace \nAbstract: The literature on dynamic discrete games often assumes that the conditional choice probabilities and the state transition probabilities are homogeneous across markets and over time. We refer to this as the \"homogeneity assumption\" in dynamic discrete games. This assumption enables empirical studies to estimate the game's structural parameters by pooling data from multiple markets and from many time periods. In this paper, we propose a hypothesis test to evaluate whether the homogeneity assumption holds in the data. Our hypothesis test is the result of an approximate randomization test, implemented via a Markov chain Monte Carlo (MCMC) algorithm. We show that our hypothesis test becomes valid as the (user-defined) number of MCMC draws diverges, for any fixed number of markets, time periods, and players. We apply our test to the empirical study of the U.S.\\ Portland cement industry in Ryan (2012)."}, "https://arxiv.org/abs/2208.02806": {"title": "A tree perspective on stick-breaking models in covariate-dependent mixtures", "link": "https://arxiv.org/abs/2208.02806", "description": "arXiv:2208.02806v4 Announce Type: replace \nAbstract: Stick-breaking (SB) processes are often adopted in Bayesian mixture models for generating mixing weights. When covariates influence the sizes of clusters, SB mixtures are particularly convenient as they can leverage their connection to binary regression to ease both the specification of covariate effects and posterior computation. Existing SB models are typically constructed based on continually breaking a single remaining piece of the unit stick. We view this from a dyadic tree perspective in terms of a lopsided bifurcating tree that extends only in one side. We show that two unsavory characteristics of SB models are in fact largely due to this lopsided tree structure. We consider a generalized class of SB models with alternative bifurcating tree structures and examine the influence of the underlying tree topology on the resulting Bayesian analysis in terms of prior assumptions, posterior uncertainty, and computational effectiveness. In particular, we provide evidence that a balanced tree topology, which corresponds to continually breaking all remaining pieces of the unit stick, can resolve or mitigate these undesirable properties of SB models that rely on a lopsided tree."}, "https://arxiv.org/abs/2212.10301": {"title": "Probabilistic Quantile Factor Analysis", "link": "https://arxiv.org/abs/2212.10301", "description": "arXiv:2212.10301v3 Announce Type: replace \nAbstract: This paper extends quantile factor analysis to a probabilistic variant that incorporates regularization and computationally efficient variational approximations. We establish through synthetic and real data experiments that the proposed estimator can, in many cases, achieve better accuracy than a recently proposed loss-based estimator. We contribute to the factor analysis literature by extracting new indexes of \\emph{low}, \\emph{medium}, and \\emph{high} economic policy uncertainty, as well as \\emph{loose}, \\emph{median}, and \\emph{tight} financial conditions. We show that the high uncertainty and tight financial conditions indexes have superior predictive ability for various measures of economic activity. In a high-dimensional exercise involving about 1000 daily financial series, we find that quantile factors also provide superior out-of-sample information compared to mean or median factors."}, "https://arxiv.org/abs/2305.00578": {"title": "A new clustering framework", "link": "https://arxiv.org/abs/2305.00578", "description": "arXiv:2305.00578v2 Announce Type: replace \nAbstract: Detecting clusters is a critical task in various fields, including statistics, engineering and bioinformatics. Our focus is primarily on the modern high-dimensional scenario, where traditional methods often fail due to the curse of dimensionality. In this study, we introduce a non-parametric framework for clustering that is applicable to any number of dimensions. Simulation results demonstrate that this new framework surpasses existing methods across a wide range of settings. We illustrate the proposed method with real data applications in distinguishing cancer tissues from normal tissues through gene expression data."}, "https://arxiv.org/abs/2401.00987": {"title": "Inverting estimating equations for causal inference on quantiles", "link": "https://arxiv.org/abs/2401.00987", "description": "arXiv:2401.00987v2 Announce Type: replace \nAbstract: The causal inference literature frequently focuses on estimating the mean of the potential outcome, whereas quantiles of the potential outcome may carry important additional information. We propose a unified approach, based on the inverse estimating equations, to generalize a class of causal inference solutions from estimating the mean of the potential outcome to its quantiles. We assume that a moment function is available to identify the mean of the threshold-transformed potential outcome, based on which a convenient construction of the estimating equation of quantiles of potential outcome is proposed. In addition, we give a general construction of the efficient influence functions of the mean and quantiles of potential outcomes, and explicate their connection. We motivate estimators for the quantile estimands with the efficient influence function, and develop their asymptotic properties when either parametric models or data-adaptive machine learners are used to estimate the nuisance functions. A broad implication of our results is that one can rework the existing result for mean causal estimands to facilitate causal inference on quantiles. Our general results are illustrated by several analytical and numerical examples."}, "https://arxiv.org/abs/2408.08481": {"title": "A Multivariate Multilevel Longitudinal Functional Model for Repeatedly Observed Human Movement Data", "link": "https://arxiv.org/abs/2408.08481", "description": "arXiv:2408.08481v1 Announce Type: new \nAbstract: Biomechanics and human movement research often involves measuring multiple kinematic or kinetic variables regularly throughout a movement, yielding data that present as smooth, multivariate, time-varying curves and are naturally amenable to functional data analysis. It is now increasingly common to record the same movement repeatedly for each individual, resulting in curves that are serially correlated and can be viewed as longitudinal functional data. We present a new approach for modelling multivariate multilevel longitudinal functional data, with application to kinematic data from recreational runners collected during a treadmill run. For each stride, the runners' hip, knee and ankle angles are modelled jointly as smooth multivariate functions that depend on subject-specific covariates. Longitudinally varying multivariate functional random effects are used to capture the dependence among adjacent strides and changes in the multivariate functions over the course of the treadmill run. A basis modelling approach is adopted to fit the model -- we represent each observation using a multivariate functional principal components basis and model the basis coefficients using scalar longitudinal mixed effects models. The predicted random effects are used to understand and visualise changes in the multivariate functional data over the course of the treadmill run. In our application, our method quantifies the effects of scalar covariates on the multivariate functional data, revealing a statistically significant effect of running speed at the hip, knee and ankle joints. Analysis of the predicted random effects reveals that individuals' kinematics are generally stable but certain individuals who exhibit strong changes during the run can also be identified. A simulation study is presented to demonstrate the efficacy of the proposed methodology under realistic data-generating scenarios."}, "https://arxiv.org/abs/2408.08580": {"title": "Revisiting the Many Instruments Problem using Random Matrix Theory", "link": "https://arxiv.org/abs/2408.08580", "description": "arXiv:2408.08580v1 Announce Type: new \nAbstract: We use recent results from the theory of random matrices to improve instrumental variables estimation with many instruments. In settings where the first-stage parameters are dense, we show that Ridge lowers the implicit price of a bias adjustment. This comes along with improved (finite-sample) properties in the second stage regression. Our theoretical results nest existing results on bias approximation and bias adjustment. Moreover, it extends them to settings with more instruments than observations."}, "https://arxiv.org/abs/2408.08630": {"title": "Spatial Principal Component Analysis and Moran Statistics for Multivariate Functional Areal Data", "link": "https://arxiv.org/abs/2408.08630", "description": "arXiv:2408.08630v1 Announce Type: new \nAbstract: In this article, we present the bivariate and multivariate functional Moran's I statistics and multivariate functional areal spatial principal component analysis (mfasPCA). These methods are the first of their kind in the field of multivariate areal spatial functional data analysis. The multivariate functional Moran's I statistic is employed to assess spatial autocorrelation, while mfasPCA is utilized for dimension reduction in both univariate and multivariate functional areal data. Through simulation studies and real-world examples, we demonstrate that the multivariate functional Moran's I statistic and mfasPCA are powerful tools for evaluating spatial autocorrelation in univariate and multivariate functional areal data analysis."}, "https://arxiv.org/abs/2408.08636": {"title": "Augmented Binary Method for Basket Trials (ABBA)", "link": "https://arxiv.org/abs/2408.08636", "description": "arXiv:2408.08636v1 Announce Type: new \nAbstract: In several clinical areas, traditional clinical trials often use a responder outcome, a composite endpoint that involves dichotomising a continuous measure. An augmented binary method that improves power whilst retaining the original responder endpoint has previously been proposed. The method leverages information from the the undichotomised component to improve power. We extend this method for basket trials, which are gaining popularity in many clinical areas. For clinical areas where response outcomes are used, we propose the new Augmented Binary method for BAsket trials (ABBA) enhances efficiency by borrowing information on the treatment effect between subtrials. The method is developed within a latent variable framework using a Bayesian hierarchical modelling approach. We investigate the properties of the proposed methodology by analysing point estimates and credible intervals in various simulation scenarios, comparing them to the standard analysis for basket trials that assumes binary outcome. Our method results in a reduction of 95% high density interval of the posterior distribution of the log odds ratio and an increase in power when the treatment effect is consistent across subtrials. We illustrate our approach using real data from two clinical trials in rheumatology."}, "https://arxiv.org/abs/2408.08771": {"title": "Dynamic factor analysis for sparse and irregular longitudinal data: an application to metabolite measurements in a COVID-19 study", "link": "https://arxiv.org/abs/2408.08771", "description": "arXiv:2408.08771v1 Announce Type: new \nAbstract: It is of scientific interest to identify essential biomarkers in biological processes underlying diseases to facilitate precision medicine. Factor analysis (FA) has long been used to address this goal: by assuming latent biological pathways drive the activity of measurable biomarkers, a biomarker is more influential if its absolute factor loading is larger. Although correlation between biomarkers has been properly handled under this framework, correlation between latent pathways are often overlooked, as one classical assumption in FA is the independence between factors. However, this assumption may not be realistic in the context of pathways, as existing biological knowledge suggests that pathways interact with one another rather than functioning independently. Motivated by sparsely and irregularly collected longitudinal measurements of metabolites in a COVID-19 study of large sample size, we propose a dynamic factor analysis model that can account for the potential cross-correlations between pathways, through a multi-output Gaussian processes (MOGP) prior on the factor trajectories. To mitigate against overfitting caused by sparsity of longitudinal measurements, we introduce a roughness penalty upon MOGP hyperparameters and allow for non-zero mean functions. To estimate these hyperparameters, we develop a stochastic expectation maximization (StEM) algorithm that scales well to the large sample size. In our simulation studies, StEM leads across all sample sizes considered to a more accurate and stable estimate of the MOGP hyperparameters than a comparator algorithm used in previous research. Application to the motivating example identifies a kynurenine pathway that affects the clinical severity of patients with COVID-19. In particular, a novel biomarker taurine is discovered, which has been receiving increased attention clinically, yet its role was overlooked in a previous analysis."}, "https://arxiv.org/abs/2408.08388": {"title": "Classification of High-dimensional Time Series in Spectral Domain using Explainable Features", "link": "https://arxiv.org/abs/2408.08388", "description": "arXiv:2408.08388v1 Announce Type: cross \nAbstract: Interpretable classification of time series presents significant challenges in high dimensions. Traditional feature selection methods in the frequency domain often assume sparsity in spectral density matrices (SDMs) or their inverses, which can be restrictive for real-world applications. In this article, we propose a model-based approach for classifying high-dimensional stationary time series by assuming sparsity in the difference between inverse SDMs. Our approach emphasizes the interpretability of model parameters, making it especially suitable for fields like neuroscience, where understanding differences in brain network connectivity across various states is crucial. The estimators for model parameters demonstrate consistency under appropriate conditions. We further propose using standard deep learning optimizers for parameter estimation, employing techniques such as mini-batching and learning rate scheduling. Additionally, we introduce a method to screen the most discriminatory frequencies for classification, which exhibits the sure screening property under general conditions. The flexibility of the proposed model allows the significance of covariates to vary across frequencies, enabling nuanced inferences and deeper insights into the underlying problem. The novelty of our method lies in the interpretability of the model parameters, addressing critical needs in neuroscience. The proposed approaches have been evaluated on simulated examples and the `Alert-vs-Drowsy' EEG dataset."}, "https://arxiv.org/abs/2408.08450": {"title": "Smooth and shape-constrained quantile distributed lag models", "link": "https://arxiv.org/abs/2408.08450", "description": "arXiv:2408.08450v1 Announce Type: cross \nAbstract: Exposure to environmental pollutants during the gestational period can significantly impact infant health outcomes, such as birth weight and neurological development. Identifying critical windows of susceptibility, which are specific periods during pregnancy when exposure has the most profound effects, is essential for developing targeted interventions. Distributed lag models (DLMs) are widely used in environmental epidemiology to analyze the temporal patterns of exposure and their impact on health outcomes. However, traditional DLMs focus on modeling the conditional mean, which may fail to capture heterogeneity in the relationship between predictors and the outcome. Moreover, when modeling the distribution of health outcomes like gestational birthweight, it is the extreme quantiles that are of most clinical relevance. We introduce two new quantile distributed lag model (QDLM) estimators designed to address the limitations of existing methods by leveraging smoothness and shape constraints, such as unimodality and concavity, to enhance interpretability and efficiency. We apply our QDLM estimators to the Colorado birth cohort data, demonstrating their effectiveness in identifying critical windows of susceptibility and informing public health interventions."}, "https://arxiv.org/abs/2408.08764": {"title": "Generalized logistic model for $r$ largest order statistics, with hydrological application", "link": "https://arxiv.org/abs/2408.08764", "description": "arXiv:2408.08764v1 Announce Type: cross \nAbstract: The effective use of available information in extreme value analysis is critical because extreme values are scarce. Thus, using the $r$ largest order statistics (rLOS) instead of the block maxima is encouraged. Based on the four-parameter kappa model for the rLOS (rK4D), we introduce a new distribution for the rLOS as a special case of the rK4D. That is the generalized logistic model for rLOS (rGLO). This distribution can be useful when the generalized extreme value model for rLOS is no longer efficient to capture the variability of extreme values. Moreover, the rGLO enriches a pool of candidate distributions to determine the best model to yield accurate and robust quantile estimates. We derive a joint probability density function, the marginal and conditional distribution functions of new model. The maximum likelihood estimation, delta method, profile likelihood, order selection by the entropy difference test, cross-validated likelihood criteria, and model averaging were considered for inferences. The usefulness and practical effectiveness of the rGLO are illustrated by the Monte Carlo simulation and an application to extreme streamflow data in Bevern Stream, UK."}, "https://arxiv.org/abs/2007.12031": {"title": "The r-largest four parameter kappa distribution", "link": "https://arxiv.org/abs/2007.12031", "description": "arXiv:2007.12031v2 Announce Type: replace \nAbstract: The generalized extreme value distribution (GEVD) has been widely used to model the extreme events in many areas. It is however limited to using only block maxima, which motivated to model the GEVD dealing with $r$-largest order statistics (rGEVD). The rGEVD which uses more than one extreme per block can significantly improves the performance of the GEVD. The four parameter kappa distribution (K4D) is a generalization of some three-parameter distributions including the GEVD. It can be useful in fitting data when three parameters in the GEVD are not sufficient to capture the variability of the extreme observations. The K4D still uses only block maxima. In this study, we thus extend the K4D to deal with $r$-largest order statistics as analogy as the GEVD is extended to the rGEVD. The new distribution is called the $r$-largest four parameter kappa distribution (rK4D). We derive a joint probability density function (PDF) of the rK4D, and the marginal and conditional cumulative distribution functions and PDFs. The maximum likelihood method is considered to estimate parameters. The usefulness and some practical concerns of the rK4D are illustrated by applying it to Venice sea-level data. This example study shows that the rK4D gives better fit but larger variances of the parameter estimates than the rGEVD. Some new $r$-largest distributions are derived as special cases of the rK4D, such as the $r$-largest logistic (rLD), generalized logistic (rGLD), and generalized Gumbel distributions (rGGD)."}, "https://arxiv.org/abs/2210.10768": {"title": "Anytime-valid off-policy inference for contextual bandits", "link": "https://arxiv.org/abs/2210.10768", "description": "arXiv:2210.10768v3 Announce Type: replace \nAbstract: Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data."}, "https://arxiv.org/abs/2301.01345": {"title": "Inspecting discrepancy between multivariate distributions using half-space depth based information criteria", "link": "https://arxiv.org/abs/2301.01345", "description": "arXiv:2301.01345v3 Announce Type: replace \nAbstract: This article inspects whether a multivariate distribution is different from a specified distribution or not, and it also tests the equality of two multivariate distributions. In the course of this study, a graphical tool-kit using well-known half-space depth based information criteria is proposed, which is a two-dimensional plot, regardless of the dimension of the data, and it is even useful in comparing high-dimensional distributions. The simple interpretability of the proposed graphical tool-kit motivates us to formulate test statistics to carry out the corresponding testing of hypothesis problems. It is established that the proposed tests based on the same information criteria are consistent, and moreover, the asymptotic distributions of the test statistics under contiguous/local alternatives are derived, which enable us to compute the asymptotic power of these tests. Furthermore, it is observed that the computations associated with the proposed tests are unburdensome. Besides, these tests perform better than many other tests available in the literature when data are generated from various distributions such as heavy tailed distributions, which indicates that the proposed methodology is robust as well. Finally, the usefulness of the proposed graphical tool-kit and tests is shown on two benchmark real data sets."}, "https://arxiv.org/abs/2310.11357": {"title": "A Pseudo-likelihood Approach to Under-5 Mortality Estimation", "link": "https://arxiv.org/abs/2310.11357", "description": "arXiv:2310.11357v2 Announce Type: replace \nAbstract: Accurate and precise estimates of the under-5 mortality rate (U5MR) are an important health summary for countries. However, full survival curves allow us to better understand the pattern of mortality in children under five. Modern demographic methods for estimating a full mortality schedule for children have been developed for countries with good vital registration and reliable census data, but perform poorly in many low- and middle-income countries (LMICs). In these countries, the need to utilize nationally representative surveys to estimate the U5MR requires additional care to mitigate potential biases in survey data, acknowledge the survey design, and handle the usual characteristics of survival data, for example, censoring and truncation. In this paper, we develop parametric and non-parametric pseudo-likelihood approaches to estimating child mortality across calendar time from complex survey data. We show that the parametric approach is particularly useful in scenarios where data are sparse and parsimonious models allow efficient estimation. We compare a variety of parametric models to two existing methods for obtaining a full survival curve for children under the age of 5, and argue that a parametric pseudo-likelihood approach is advantageous in LMICs. We apply our proposed approaches to survey data from four LMICs."}, "https://arxiv.org/abs/2311.11256": {"title": "Bayesian Modeling of Incompatible Spatial Data: A Case Study Involving Post-Adrian Storm Forest Damage Assessment", "link": "https://arxiv.org/abs/2311.11256", "description": "arXiv:2311.11256v2 Announce Type: replace \nAbstract: Modeling incompatible spatial data, i.e., data with different spatial resolutions, is a pervasive challenge in remote sensing data analysis. Typical approaches to addressing this challenge aggregate information to a common coarse resolution, i.e., compatible resolutions, prior to modeling. Such pre-processing aggregation simplifies analysis, but potentially causes information loss and hence compromised inference and predictive performance. To avoid losing potential information provided by finer spatial resolution data and improve predictive performance, we propose a new Bayesian method that constructs a latent spatial process model at the finest spatial resolution. This model is tailored to settings where the outcome variable is measured on a coarser spatial resolution than predictor variables -- a configuration seen increasingly when high spatial resolution remotely sensed predictors are used in analysis. A key contribution of this work is an efficient algorithm that enables full Bayesian inference using finer resolution data while optimizing computational and storage costs. The proposed method is applied to a forest damage assessment for the 2018 Adrian storm in Carinthia, Austria, that uses high-resolution laser imaging detection and ranging (LiDAR) measurements and relatively coarse resolution forest inventory measurements. Extensive simulation studies demonstrate the proposed approach substantially improves inference for small prediction units."}, "https://arxiv.org/abs/2408.08908": {"title": "Panel Data Unit Root testing: Overview", "link": "https://arxiv.org/abs/2408.08908", "description": "arXiv:2408.08908v1 Announce Type: new \nAbstract: This review discusses methods of testing for a panel unit root. Modern approaches to testing in cross-sectionally correlated panels are discussed, preceding the analysis with an analysis of independent panels. In addition, methods for testing in the case of non-linearity in the data (for example, in the case of structural breaks) are presented, as well as methods for testing in short panels, when the time dimension is small and finite. In conclusion, links to existing packages that allow implementing some of the described methods are provided."}, "https://arxiv.org/abs/2408.08990": {"title": "Adaptive Uncertainty Quantification for Generative AI", "link": "https://arxiv.org/abs/2408.08990", "description": "arXiv:2408.08990v1 Announce Type: new \nAbstract: This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning the predictor space into groups and then calibrating sectionally group by group. Adaptive partitioning (self-grouping) is achieved by fitting a robust regression tree to the conformity scores on the calibration set. This new tree variant is designed in such a way that adding a single new observation does not change the tree fit with overwhelmingly large probability. This add-one-in robustness property allows us to conclude a finite sample group-conditional coverage guarantee, a refinement of the marginal guarantee. In addition, unlike traditional split-conformal inference, adaptive splitting and within-group calibration yields adaptive bands which can stretch and shrink locally. We demonstrate benefits of local tightening on several simulated as well as real examples using non-parametric regression. Finally, we consider two contemporary classification applications for obtaining uncertainty quantification around GPT-4o predictions. We conformalize skin disease diagnoses based on self-reported symptoms as well as predicted states of U.S. legislators based on summaries of their ideology. We demonstrate substantial local tightening of the uncertainty sets while attaining similar marginal coverage."}, "https://arxiv.org/abs/2408.09008": {"title": "Approximations to worst-case data dropping: unmasking failure modes", "link": "https://arxiv.org/abs/2408.09008", "description": "arXiv:2408.09008v1 Announce Type: new \nAbstract: A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one outlier may hide or conceal the effect of another outlier. Based on the failures we identify, we provide recommendations for users and suggest directions for future improvements."}, "https://arxiv.org/abs/2408.09012": {"title": "Autoregressive models for panel data causal inference with application to state-level opioid policies", "link": "https://arxiv.org/abs/2408.09012", "description": "arXiv:2408.09012v1 Announce Type: new \nAbstract: Motivated by the study of state opioid policies, we propose a novel approach that uses autoregressive models for causal effect estimation in settings with panel data and staggered treatment adoption. Specifically, we seek to estimate of the impact of key opioid-related policies by quantifying the effects of must access prescription drug monitoring programs (PDMPs), naloxone access laws (NALs), and medical marijuana laws on opioid prescribing. Existing methods, such as differences-in-differences and synthetic controls, are challenging to apply in these types of dynamic policy landscapes where multiple policies are implemented over time and sample sizes are small. Autoregressive models are an alternative strategy that have been used to estimate policy effects in similar settings, but until this paper have lacked formal justification. We outline a set of assumptions that tie these models to causal effects, and we study biases of estimates based on this approach when key causal assumptions are violated. In a set of simulation studies that mirror the structure of our application, we also show that our proposed estimators frequently outperform existing estimators. In short, we justify the use of autoregressive models to provide robust evidence on the effectiveness of four state policies in combating the opioid crisis."}, "https://arxiv.org/abs/2408.09096": {"title": "Dynamic linear regression models for forecasting time series with semi long memory errors", "link": "https://arxiv.org/abs/2408.09096", "description": "arXiv:2408.09096v1 Announce Type: new \nAbstract: Dynamic linear regression models forecast the values of a time series based on a linear combination of a set of exogenous time series while incorporating a time series process for the error term. This error process is often assumed to follow an autoregressive integrated moving average (ARIMA) model, or seasonal variants thereof, which are unable to capture a long-range dependency structure of the error process. We propose a novel dynamic linear regression model that incorporates the long-range dependency feature of the errors. We demonstrate that the proposed error process may (i) have a significant impact on the posterior uncertainty of the estimated regression parameters and (ii) improve the model's forecasting ability. We develop a Markov chain Monte Carlo method to fit general dynamic linear regression models based on a frequency domain approach that enables fast, asymptotically exact Bayesian inference for large datasets. We demonstrate that our approximate algorithm is faster than the traditional time domain approaches, such as the Kalman filter and the multivariate Gaussian likelihood, while retaining a high accuracy when approximating the posterior distribution. We illustrate the method in simulated examples and two energy forecasting applications."}, "https://arxiv.org/abs/2408.09155": {"title": "Learning Robust Treatment Rules for Censored Data", "link": "https://arxiv.org/abs/2408.09155", "description": "arXiv:2408.09155v1 Announce Type: new \nAbstract: There is a fast-growing literature on estimating optimal treatment rules directly by maximizing the expected outcome. In biomedical studies and operations applications, censored survival outcome is frequently observed, in which case the restricted mean survival time and survival probability are of great interest. In this paper, we propose two robust criteria for learning optimal treatment rules with censored survival outcomes; the former one targets at an optimal treatment rule maximizing the restricted mean survival time, where the restriction is specified by a given quantile such as median; the latter one targets at an optimal treatment rule maximizing buffered survival probabilities, where the predetermined threshold is adjusted to account the restricted mean survival time. We provide theoretical justifications for the proposed optimal treatment rules and develop a sampling-based difference-of-convex algorithm for learning them. In simulation studies, our estimators show improved performance compared to existing methods. We also demonstrate the proposed method using AIDS clinical trial data."}, "https://arxiv.org/abs/2408.09187": {"title": "Externally Valid Selection of Experimental Sites via the k-Median Problem", "link": "https://arxiv.org/abs/2408.09187", "description": "arXiv:2408.09187v1 Announce Type: new \nAbstract: We present a decision-theoretic justification for viewing the question of how to best choose where to experiment in order to optimize external validity as a k-median (clustering) problem, a popular problem in computer science and operations research. We present conditions under which minimizing the worst-case, welfare-based regret among all nonrandom schemes that select k sites to experiment is approximately equal - and sometimes exactly equal - to finding the k most central vectors of baseline site-level covariates. The k-median problem can be formulated as a linear integer program. Two empirical applications illustrate the theoretical and computational benefits of the suggested procedure."}, "https://arxiv.org/abs/2408.09271": {"title": "Counterfactual and Synthetic Control Method: Causal Inference with Instrumented Principal Component Analysis", "link": "https://arxiv.org/abs/2408.09271", "description": "arXiv:2408.09271v1 Announce Type: new \nAbstract: The fundamental problem of causal inference lies in the absence of counterfactuals. Traditional methodologies impute the missing counterfactuals implicitly or explicitly based on untestable or overly stringent assumptions. Synthetic control method (SCM) utilizes a weighted average of control units to impute the missing counterfactual for the treated unit. Although SCM relaxes some strict assumptions, it still requires the treated unit to be inside the convex hull formed by the controls, avoiding extrapolation. In recent advances, researchers have modeled the entire data generating process (DGP) to explicitly impute the missing counterfactual. This paper expands the interactive fixed effect (IFE) model by instrumenting covariates into factor loadings, adding additional robustness. This methodology offers multiple benefits: firstly, it incorporates the strengths of previous SCM approaches, such as the relaxation of the untestable parallel trends assumption (PTA). Secondly, it does not require the targeted outcomes to be inside the convex hull formed by the controls. Thirdly, it eliminates the need for correct model specification required by the IFE model. Finally, it inherits the ability of principal component analysis (PCA) to effectively handle high-dimensional data and enhances the value extracted from numerous covariates."}, "https://arxiv.org/abs/2408.09415": {"title": "An exhaustive selection of sufficient adjustment sets for causal inference", "link": "https://arxiv.org/abs/2408.09415", "description": "arXiv:2408.09415v1 Announce Type: new \nAbstract: A subvector of predictor that satisfies the ignorability assumption, whose index set is called a sufficient adjustment set, is crucial for conducting reliable causal inference based on observational data. In this paper, we propose a general family of methods to detect all such sets for the first time in the literature, with no parametric assumptions on the outcome models and with flexible parametric and semiparametric assumptions on the predictor within the treatment groups; the latter induces desired sample-level accuracy. We show that the collection of sufficient adjustment sets can uniquely facilitate multiple types of studies in causal inference, including sharpening the estimation of average causal effect and recovering fundamental connections between the outcome and the treatment hidden in the dependence structure of the predictor. These findings are illustrated by simulation studies and a real data example at the end."}, "https://arxiv.org/abs/2408.09418": {"title": "Grade of membership analysis for multi-layer categorical data", "link": "https://arxiv.org/abs/2408.09418", "description": "arXiv:2408.09418v1 Announce Type: new \nAbstract: Consider a group of individuals (subjects) participating in the same psychological tests with numerous questions (items) at different times. The observed responses can be recorded in multiple response matrices over time, named multi-layer categorical data. Assuming that each subject has a common mixed membership shared across all layers, enabling it to be affiliated with multiple latent classes with varying weights, the objective of the grade of membership (GoM) analysis is to estimate these mixed memberships from the data. When the test is conducted only once, the data becomes traditional single-layer categorical data. The GoM model is a popular choice for describing single-layer categorical data with a latent mixed membership structure. However, GoM cannot handle multi-layer categorical data. In this work, we propose a new model, multi-layer GoM, which extends GoM to multi-layer categorical data. To estimate the common mixed memberships, we propose a new approach, GoM-DSoG, based on a debiased sum of Gram matrices. We establish GoM-DSoG's per-subject convergence rate under the multi-layer GoM model. Our theoretical results suggest that fewer no-responses, more subjects, more items, and more layers are beneficial for GoM analysis. We also propose an approach to select the number of latent classes. Extensive experimental studies verify the theoretical findings and show GoM-DSoG's superiority over its competitors, as well as the accuracy of our method in determining the number of latent classes."}, "https://arxiv.org/abs/2408.09560": {"title": "Deep Learning for the Estimation of Heterogeneous Parameters in Discrete Choice Models", "link": "https://arxiv.org/abs/2408.09560", "description": "arXiv:2408.09560v1 Announce Type: new \nAbstract: This paper studies the finite sample performance of the flexible estimation approach of Farrell, Liang, and Misra (2021a), who propose to use deep learning for the estimation of heterogeneous parameters in economic models, in the context of discrete choice models. The approach combines the structure imposed by economic models with the flexibility of deep learning, which assures the interpretebility of results on the one hand, and allows estimating flexible functional forms of observed heterogeneity on the other hand. For inference after the estimation with deep learning, Farrell et al. (2021a) derive an influence function that can be applied to many quantities of interest. We conduct a series of Monte Carlo experiments that investigate the impact of regularization on the proposed estimation and inference procedure in the context of discrete choice models. The results show that the deep learning approach generally leads to precise estimates of the true average parameters and that regular robust standard errors lead to invalid inference results, showing the need for the influence function approach for inference. Without regularization, the influence function approach can lead to substantial bias and large estimated standard errors caused by extreme outliers. Regularization reduces this property and stabilizes the estimation procedure, but at the expense of inducing an additional bias. The bias in combination with decreasing variance associated with increasing regularization leads to the construction of invalid inferential statements in our experiments. Repeated sample splitting, unlike regularization, stabilizes the estimation approach without introducing an additional bias, thereby allowing for the construction of valid inferential statements."}, "https://arxiv.org/abs/2408.09598": {"title": "Anytime-Valid Inference for Double/Debiased Machine Learning of Causal Parameters", "link": "https://arxiv.org/abs/2408.09598", "description": "arXiv:2408.09598v1 Announce Type: new \nAbstract: Double (debiased) machine learning (DML) has seen widespread use in recent years for learning causal/structural parameters, in part due to its flexibility and adaptability to high-dimensional nuisance functions as well as its ability to avoid bias from regularization or overfitting. However, the classic double-debiased framework is only valid asymptotically for a predetermined sample size, thus lacking the flexibility of collecting more data if sharper inference is needed, or stopping data collection early if useful inferences can be made earlier than expected. This can be of particular concern in large scale experimental studies with huge financial costs or human lives at stake, as well as in observational studies where the length of confidence of intervals do not shrink to zero even with increasing sample size due to partial identifiability of a structural parameter. In this paper, we present time-uniform counterparts to the asymptotic DML results, enabling valid inference and confidence intervals for structural parameters to be constructed at any arbitrary (possibly data-dependent) stopping time. We provide conditions which are only slightly stronger than the standard DML conditions, but offer the stronger guarantee for anytime-valid inference. This facilitates the transformation of any existing DML method to provide anytime-valid guarantees with minimal modifications, making it highly adaptable and easy to use. We illustrate our procedure using two instances: a) local average treatment effect in online experiments with non-compliance, and b) partial identification of average treatment effect in observational studies with potential unmeasured confounding."}, "https://arxiv.org/abs/2408.09607": {"title": "Experimental Design For Causal Inference Through An Optimization Lens", "link": "https://arxiv.org/abs/2408.09607", "description": "arXiv:2408.09607v1 Announce Type: new \nAbstract: The study of experimental design offers tremendous benefits for answering causal questions across a wide range of applications, including agricultural experiments, clinical trials, industrial experiments, social experiments, and digital experiments. Although valuable in such applications, the costs of experiments often drive experimenters to seek more efficient designs. Recently, experimenters have started to examine such efficiency questions from an optimization perspective, as experimental design problems are fundamentally decision-making problems. This perspective offers a lot of flexibility in leveraging various existing optimization tools to study experimental design problems. This manuscript thus aims to examine the foundations of experimental design problems in the context of causal inference as viewed through an optimization lens."}, "https://arxiv.org/abs/2408.09619": {"title": "Statistical Inference for Regression with Imputed Binary Covariates with Application to Emotion Recognition", "link": "https://arxiv.org/abs/2408.09619", "description": "arXiv:2408.09619v1 Announce Type: new \nAbstract: In the flourishing live streaming industry, accurate recognition of streamers' emotions has become a critical research focus, with profound implications for audience engagement and content optimization. However, precise emotion coding typically requires manual annotation by trained experts, making it extremely expensive and time-consuming to obtain complete observational data for large-scale studies. Motivated by this challenge in streamer emotion recognition, we develop here a novel imputation method together with a principled statistical inference procedure for analyzing partially observed binary data. Specifically, we assume for each observation an auxiliary feature vector, which is sufficiently cheap to be fully collected for the whole sample. We next assume a small pilot sample with both the target binary covariates (i.e., the emotion status) and the auxiliary features fully observed, of which the size could be considerably smaller than that of the whole sample. Thereafter, a regression model can be constructed for the target binary covariates and the auxiliary features. This enables us to impute the missing binary features using the fully observed auxiliary features for the entire sample. We establish the associated asymptotic theory for principled statistical inference and present extensive simulation experiments, demonstrating the effectiveness and theoretical soundness of our proposed method. Furthermore, we validate our approach using a comprehensive dataset on emotion recognition in live streaming, demonstrating that our imputation method yields smaller standard errors and is more statistically efficient than using pilot data only. Our findings have significant implications for enhancing user experience and optimizing engagement on streaming platforms."}, "https://arxiv.org/abs/2408.09631": {"title": "Penalized Likelihood Approach for the Four-parameter Kappa Distribution", "link": "https://arxiv.org/abs/2408.09631", "description": "arXiv:2408.09631v1 Announce Type: new \nAbstract: The four-parameter kappa distribution (K4D) is a generalized form of some commonly used distributions such as generalized logistic, generalized Pareto, generalized Gumbel, and generalized extreme value (GEV) distributions. Owing to its flexibility, the K4D is widely applied in modeling in several fields such as hydrology and climatic change. For the estimation of the four parameters, the maximum likelihood approach and the method of L-moments are usually employed. The L-moment estimator (LME) method works well for some parameter spaces, with up to a moderate sample size, but it is sometimes not feasible in terms of computing the appropriate estimates. Meanwhile, the maximum likelihood estimator (MLE) is optimal for large samples and applicable to a very wide range of situations, including non-stationary data. However, using the MLE of K4D with small sample sizes shows substantially poor performance in terms of a large variance of the estimator. We therefore propose a maximum penalized likelihood estimation (MPLE) of K4D by adjusting the existing penalty functions that restrict the parameter space. Eighteen combinations of penalties for two shape parameters are considered and compared. The MPLE retains modeling flexibility and large sample optimality while also improving on small sample properties. The properties of the proposed estimator are verified through a Monte Carlo simulation, and an application case is demonstrated taking Thailand's annual maximum temperature data. Based on this study, we suggest using combinations of penalty functions in general."}, "https://arxiv.org/abs/2408.09634": {"title": "Branch and Bound to Assess Stability of Regression Coefficients in Uncertain Models", "link": "https://arxiv.org/abs/2408.09634", "description": "arXiv:2408.09634v1 Announce Type: new \nAbstract: It can be difficult to interpret a coefficient of an uncertain model. A slope coefficient of a regression model may change as covariates are added or removed from the model. In the context of high-dimensional data, there are too many model extensions to check. However, as we show here, it is possible to efficiently search, with a branch and bound algorithm, for maximum and minimum values of that adjusted slope coefficient over a discrete space of regularized regression models. Here we introduce our algorithm, along with supporting mathematical results, an example application, and a link to our computer code, to help researchers summarize high-dimensional data and assess the stability of regression coefficients in uncertain models."}, "https://arxiv.org/abs/2408.09755": {"title": "Ensemble Prediction via Covariate-dependent Stacking", "link": "https://arxiv.org/abs/2408.09755", "description": "arXiv:2408.09755v1 Announce Type: new \nAbstract: This paper presents a novel approach to ensemble prediction called \"Covariate-dependent Stacking\" (CDST). Unlike traditional stacking methods, CDST allows model weights to vary flexibly as a function of covariates, thereby enhancing predictive performance in complex scenarios. We formulate the covariate-dependent weights through combinations of basis functions, estimate them by optimizing cross-validation, and develop an Expectation-Maximization algorithm, ensuring computational efficiency. To analyze the theoretical properties, we establish an oracle inequality regarding the expected loss to be minimized for estimating model weights. Through comprehensive simulation studies and an application to large-scale land price prediction, we demonstrate that CDST consistently outperforms conventional model averaging methods, particularly on datasets where some models fail to capture the underlying complexity. Our findings suggest that CDST is especially valuable for, but not limited to, spatio-temporal prediction problems, offering a powerful tool for researchers and practitioners in various fields of data analysis."}, "https://arxiv.org/abs/2408.09760": {"title": "Regional and spatial dependence of poverty factors in Thailand, and its use into Bayesian hierarchical regression analysis", "link": "https://arxiv.org/abs/2408.09760", "description": "arXiv:2408.09760v1 Announce Type: new \nAbstract: Poverty is a serious issue that harms humanity progression. The simplest solution is to use one-shirt-size policy to alleviate it. Nevertheless, each region has its unique issues, which require a unique solution to solve them. In the aspect of spatial analysis, neighbor regions can provide useful information to analyze issues of a given region. In this work, we proposed inferred boundaries of regions of Thailand that can explain better the poverty dynamics, instead of the usual government administrative regions. The proposed regions maximize a trade-off between poverty-related features and geographical coherence. We use a spatial analysis together with Moran's cluster algorithms and Bayesian hierarchical regression models, with the potential of assist the implementation of the right policy to alleviate the poverty phenomenon. We found that all variables considered show a positive spatial autocorrelation. The results of analysis illustrate that 1) Northern, Northeastern Thailand, and in less extend Northcentral Thailand are the regions that require more attention in the aspect of poverty issues, 2) Northcentral, Northeastern, Northern and Southern Thailand present dramatically low levels of education, income and amount of savings contrasted with large cities such as Bangkok-Pattaya and Central Thailand, and 3) Bangkok-Pattaya is the only region whose average years of education is above 12 years, which corresponds (approx.) with a complete senior high school."}, "https://arxiv.org/abs/2408.09770": {"title": "Shift-Dispersion Decompositions of Wasserstein and Cram\\'er Distances", "link": "https://arxiv.org/abs/2408.09770", "description": "arXiv:2408.09770v1 Announce Type: new \nAbstract: Divergence functions are measures of distance or dissimilarity between probability distributions that serve various purposes in statistics and applications. We propose decompositions of Wasserstein and Cram\\'er distances$-$which compare two distributions by integrating over their differences in distribution or quantile functions$-$into directed shift and dispersion components. These components are obtained by dividing the differences between the quantile functions into contributions arising from shift and dispersion, respectively. Our decompositions add information on how the distributions differ in a condensed form and consequently enhance the interpretability of the underlying divergences. We show that our decompositions satisfy a number of natural properties and are unique in doing so in location-scale families. The decompositions allow to derive sensitivities of the divergence measures to changes in location and dispersion, and they give rise to weak stochastic order relations that are linked to the usual stochastic and the dispersive order. Our theoretical developments are illustrated in two applications, where we focus on forecast evaluation of temperature extremes and on the design of probabilistic surveys in economics."}, "https://arxiv.org/abs/2408.09868": {"title": "Weak instruments in multivariable Mendelian randomization: methods and practice", "link": "https://arxiv.org/abs/2408.09868", "description": "arXiv:2408.09868v1 Announce Type: new \nAbstract: The method of multivariable Mendelian randomization uses genetic variants to instrument multiple exposures, to estimate the effect that a given exposure has on an outcome conditional on all other exposures included in a linear model. Unfortunately, the inclusion of every additional exposure makes a weak instruments problem more likely, because we require conditionally strong genetic predictors of each exposure. This issue is well appreciated in practice, with different versions of F-statistics routinely reported as measures of instument strength. Less transparently, however, these F-statistics are sometimes used to guide instrument selection, and even to decide whether to report empirical results. Rather than discarding findings with low F-statistics, weak instrument-robust methods can provide valid inference under weak instruments. For multivariable Mendelian randomization with two-sample summary data, we encourage use of the inference strategy of Andrews (2018) that reports both robust and non-robust confidence sets, along with a statistic that measures how reliable the non-robust confidence set is in terms of coverage. We also propose a novel adjusted-Kleibergen statistic that corrects for overdispersion heterogeneity in genetic associations with the outcome."}, "https://arxiv.org/abs/2408.09876": {"title": "Improving Genomic Prediction using High-dimensional Secondary Phenotypes: the Genetic Latent Factor Approach", "link": "https://arxiv.org/abs/2408.09876", "description": "arXiv:2408.09876v1 Announce Type: new \nAbstract: Decreasing costs and new technologies have led to an increase in the amount of data available to plant breeding programs. High-throughput phenotyping (HTP) platforms routinely generate high-dimensional datasets of secondary features that may be used to improve genomic prediction accuracy. However, integration of this data comes with challenges such as multicollinearity, parameter estimation in $p > n$ settings, and the computational complexity of many standard approaches. Several methods have emerged to analyze such data, but interpretation of model parameters often remains challenging.\n We propose genetic factor best linear unbiased prediction (gfBLUP), a seven-step prediction pipeline that reduces the dimensionality of the original secondary HTP data using generative factor analysis. In short, gfBLUP uses redundancy filtered and regularized genetic and residual correlation matrices to fit a maximum likelihood factor model and estimate genetic latent factor scores. These latent factors are subsequently used in multi-trait genomic prediction. Our approach performs on par or better than alternatives in extensive simulations and a real-world application, while producing easily interpretable and biologically relevant parameters. We discuss several possible extensions and highlight gfBLUP as the basis for a flexible and modular multi-trait genomic prediction framework."}, "https://arxiv.org/abs/2408.10091": {"title": "Non-Plug-In Estimators Could Outperform Plug-In Estimators: a Cautionary Note and a Diagnosis", "link": "https://arxiv.org/abs/2408.10091", "description": "arXiv:2408.10091v1 Announce Type: new \nAbstract: Objectives: Highly flexible nonparametric estimators have gained popularity in causal inference and epidemiology. Popular examples of such estimators include targeted maximum likelihood estimators (TMLE) and double machine learning (DML). TMLE is often argued or suggested to be better than DML estimators and several other estimators in small to moderate samples -- even if they share the same large-sample properties -- because TMLE is a plug-in estimator and respects the known bounds on the parameter, while other estimators might fall outside the known bounds and yield absurd estimates. However, this argument is not a rigorously proven result and may fail in certain cases. Methods: In a carefully chosen simulation setting, I compare the performance of several versions of TMLE and DML estimators of the average treatment effect among treated in small to moderate samples. Results: In this simulation setting, DML estimators outperforms some versions of TMLE in small samples. TMLE fluctuations are unstable, and hence empirically checking the magnitude of the TMLE fluctuation might alert cases where TMLE might perform poorly. Conclusions: As a plug-in estimator, TMLE is not guaranteed to outperform non-plug-in counterparts such as DML estimators in small samples. Checking the fluctuation magnitude might be a useful diagnosis for TMLE. More rigorous theoretical justification is needed to understand and compare the finite-sample performance of these highly flexible estimators in general."}, "https://arxiv.org/abs/2408.10142": {"title": "Insights of the Intersection of Phase-Type Distributions and Positive Systems", "link": "https://arxiv.org/abs/2408.10142", "description": "arXiv:2408.10142v1 Announce Type: new \nAbstract: In this paper, we consider the relationship between phase-type distributions and positive systems through practical examples. Phase-type distributions, commonly used in modelling dynamic systems, represent the temporal evolution of a set of variables based on their phase. On the other hand, positive systems, prevalent in a wide range of disciplines, are those where the involved variables maintain non-negative values over time. Through some examples, we demonstrate how phase-type distributions can be useful in describing and analyzing positive systems, providing a perspective on their dynamic behavior. Our main objective is to establish clear connections between these seemingly different concepts, highlighting their relevance and utility in various fields of study. The findings presented here contribute to a better understanding of the interaction between phase-type distribution theory and positive system theory, opening new opportunities for future research in this exciting interdisciplinary field."}, "https://arxiv.org/abs/2408.10149": {"title": "A non-parametric U-statistic testing approach for multi-arm clinical trials with multivariate longitudinal data", "link": "https://arxiv.org/abs/2408.10149", "description": "arXiv:2408.10149v1 Announce Type: new \nAbstract: Randomized clinical trials (RCTs) often involve multiple longitudinal primary outcomes to comprehensively assess treatment efficacy. The Longitudinal Rank-Sum Test (LRST), a robust U-statistics-based, non-parametric, rank-based method, effectively controls Type I error and enhances statistical power by leveraging the temporal structure of the data without relying on distributional assumptions. However, the LRST is limited to two-arm comparisons. To address the need for comparing multiple doses against a control group in many RCTs, we extend the LRST to a multi-arm setting. This novel multi-arm LRST provides a flexible and powerful approach for evaluating treatment efficacy across multiple arms and outcomes, with a strong capability for detecting the most effective dose in multi-arm trials. Extensive simulations demonstrate that this method maintains excellent Type I error control while providing greater power compared to the two-arm LRST with multiplicity adjustments. Application to the Bapineuzumab (Bapi) 301 trial further validates the multi-arm LRST's practical utility and robustness, confirming its efficacy in complex clinical trial analyses."}, "https://arxiv.org/abs/2408.09060": {"title": "[Invited Discussion] Randomization Tests to Address Disruptions in Clinical Trials: A Report from the NISS Ingram Olkin Forum Series on Unplanned Clinical Trial Disruptions", "link": "https://arxiv.org/abs/2408.09060", "description": "arXiv:2408.09060v1 Announce Type: cross \nAbstract: Disruptions in clinical trials may be due to external events like pandemics, warfare, and natural disasters. Resulting complications may lead to unforeseen intercurrent events (events that occur after treatment initiation and affect the interpretation of the clinical question of interest or the existence of the measurements associated with it). In Uschner et al. (2023), several example clinical trial disruptions are described: treatment effect drift, population shift, change of care, change of data collection, and change of availability of study medication. A complex randomized controlled trial (RCT) setting with (planned or unplanned) intercurrent events is then described, and randomization tests are presented as a means for non-parametric inference that is robust to violations of assumption typically made in clinical trials. While estimation methods like Targeted Learning (TL) are valid in such settings, we do not see where the authors make the case that one should be going for a randomization test in such disrupted RCTs. In this discussion, we comment on the appropriateness of TL and the accompanying TL Roadmap in the context of disrupted clinical trials. We highlight a few key articles related to the broad applicability of TL for RCTs and real-world data (RWD) analyses with intercurrent events. We begin by introducing TL and motivating its utility in Section 2, and then in Section 3 we provide a brief overview of the TL Roadmap. In Section 4 we recite the example clinical trial disruptions presented in Uschner et al. (2023), discussing considerations and solutions based on the principles of TL. We request in an authors' rejoinder a clear theoretical demonstration with specific examples in this setting that a randomization test is the only valid inferential method relative to one based on following the TL Roadmap."}, "https://arxiv.org/abs/2408.09185": {"title": "Method of Moments Estimation for Affine Stochastic Volatility Models", "link": "https://arxiv.org/abs/2408.09185", "description": "arXiv:2408.09185v1 Announce Type: cross \nAbstract: We develop moment estimators for the parameters of affine stochastic volatility models. We first address the challenge of calculating moments for the models by introducing a recursive equation for deriving closed-form expressions for moments of any order. Consequently, we propose our moment estimators. We then establish a central limit theorem for our estimators and derive the explicit formulas for the asymptotic covariance matrix. Finally, we provide numerical results to validate our method."}, "https://arxiv.org/abs/2408.09532": {"title": "Deep Limit Model-free Prediction in Regression", "link": "https://arxiv.org/abs/2408.09532", "description": "arXiv:2408.09532v1 Announce Type: cross \nAbstract: In this paper, we provide a novel Model-free approach based on Deep Neural Network (DNN) to accomplish point prediction and prediction interval under a general regression setting. Usually, people rely on parametric or non-parametric models to bridge dependent and independent variables (Y and X). However, this classical method relies heavily on the correct model specification. Even for the non-parametric approach, some additive form is often assumed. A newly proposed Model-free prediction principle sheds light on a prediction procedure without any model assumption. Previous work regarding this principle has shown better performance than other standard alternatives. Recently, DNN, one of the machine learning methods, has received increasing attention due to its great performance in practice. Guided by the Model-free prediction idea, we attempt to apply a fully connected forward DNN to map X and some appropriate reference random variable Z to Y. The targeted DNN is trained by minimizing a specially designed loss function so that the randomness of Y conditional on X is outsourced to Z through the trained DNN. Our method is more stable and accurate compared to other DNN-based counterparts, especially for optimal point predictions. With a specific prediction procedure, our prediction interval can capture the estimation variability so that it can render a better coverage rate for finite sample cases. The superior performance of our method is verified by simulation and empirical studies."}, "https://arxiv.org/abs/2408.09537": {"title": "Sample-Optimal Large-Scale Optimal Subset Selection", "link": "https://arxiv.org/abs/2408.09537", "description": "arXiv:2408.09537v1 Announce Type: cross \nAbstract: Ranking and selection (R&S) conventionally aims to select the unique best alternative with the largest mean performance from a finite set of alternatives. However, for better supporting decision making, it may be more informative to deliver a small menu of alternatives whose mean performances are among the top $m$. Such problem, called optimal subset selection (OSS), is generally more challenging to address than the conventional R&S. This challenge becomes even more significant when the number of alternatives is considerably large. Thus, the focus of this paper is on addressing the large-scale OSS problem. To achieve this goal, we design a top-$m$ greedy selection mechanism that keeps sampling the current top $m$ alternatives with top $m$ running sample means and propose the explore-first top-$m$ greedy (EFG-$m$) procedure. Through an extended boundary-crossing framework, we prove that the EFG-$m$ procedure is both sample optimal and consistent in terms of the probability of good selection, confirming its effectiveness in solving large-scale OSS problem. Surprisingly, we also demonstrate that the EFG-$m$ procedure enables to achieve an indifference-based ranking within the selected subset of alternatives at no extra cost. This is highly beneficial as it delivers deeper insights to decision-makers, enabling more informed decision-makings. Lastly, numerical experiments validate our results and demonstrate the efficiency of our procedures."}, "https://arxiv.org/abs/2408.09582": {"title": "A Likelihood-Free Approach to Goal-Oriented Bayesian Optimal Experimental Design", "link": "https://arxiv.org/abs/2408.09582", "description": "arXiv:2408.09582v1 Announce Type: cross \nAbstract: Conventional Bayesian optimal experimental design seeks to maximize the expected information gain (EIG) on model parameters. However, the end goal of the experiment often is not to learn the model parameters, but to predict downstream quantities of interest (QoIs) that depend on the learned parameters. And designs that offer high EIG for parameters may not translate to high EIG for QoIs. Goal-oriented optimal experimental design (GO-OED) thus directly targets to maximize the EIG of QoIs.\n We introduce LF-GO-OED (likelihood-free goal-oriented optimal experimental design), a computational method for conducting GO-OED with nonlinear observation and prediction models. LF-GO-OED is specifically designed to accommodate implicit models, where the likelihood is intractable. In particular, it builds a density ratio estimator from samples generated from approximate Bayesian computation (ABC), thereby sidestepping the need for likelihood evaluations or density estimations. The overall method is validated on benchmark problems with existing methods, and demonstrated on scientific applications of epidemiology and neural science."}, "https://arxiv.org/abs/2408.09618": {"title": "kendallknight: Efficient Implementation of Kendall's Correlation Coefficient Computation", "link": "https://arxiv.org/abs/2408.09618", "description": "arXiv:2408.09618v1 Announce Type: cross \nAbstract: The kendallknight package introduces an efficient implementation of Kendall's correlation coefficient computation, significantly improving the processing time for large datasets without sacrificing accuracy. The kendallknight package, following Knight (1966) and posterior literature, reduces the computational complexity resulting in drastic reductions in computation time, transforming operations that would take minutes or hours into milliseconds or minutes, while maintaining precision and correctly handling edge cases and errors. The package is particularly advantageous in econometric and statistical contexts where rapid and accurate calculation of Kendall's correlation coefficient is desirable. Benchmarks demonstrate substantial performance gains over the base R implementation, especially for large datasets."}, "https://arxiv.org/abs/2408.10136": {"title": "Robust spectral clustering with rank statistics", "link": "https://arxiv.org/abs/2408.10136", "description": "arXiv:2408.10136v1 Announce Type: cross \nAbstract: This paper analyzes the statistical performance of a robust spectral clustering method for latent structure recovery in noisy data matrices. We consider eigenvector-based clustering applied to a matrix of nonparametric rank statistics that is derived entrywise from the raw, original data matrix. This approach is robust in the sense that, unlike traditional spectral clustering procedures, it can provably recover population-level latent block structure even when the observed data matrix includes heavy-tailed entries and has a heterogeneous variance profile.\n Our main theoretical contributions are threefold and hold under flexible data generating conditions. First, we establish that robust spectral clustering with rank statistics can consistently recover latent block structure, viewed as communities of nodes in a graph, in the sense that unobserved community memberships for all but a vanishing fraction of nodes are correctly recovered with high probability when the data matrix is large. Second, we refine the former result and further establish that, under certain conditions, the community membership of any individual, specified node of interest can be asymptotically exactly recovered with probability tending to one in the large-data limit. Third, we establish asymptotic normality results associated with the truncated eigenstructure of matrices whose entries are rank statistics, made possible by synthesizing contemporary entrywise matrix perturbation analysis with the classical nonparametric theory of so-called simple linear rank statistics. Collectively, these results demonstrate the statistical utility of rank-based data transformations when paired with spectral techniques for dimensionality reduction. Additionally, for a dataset of human connectomes, our approach yields parsimonious dimensionality reduction and improved recovery of ground-truth neuroanatomical cluster structure."}, "https://arxiv.org/abs/2112.09259": {"title": "Robustness, Heterogeneous Treatment Effects and Covariate Shifts", "link": "https://arxiv.org/abs/2112.09259", "description": "arXiv:2112.09259v2 Announce Type: replace \nAbstract: This paper studies the robustness of estimated policy effects to changes in the distribution of covariates. Robustness to covariate shifts is important, for example, when evaluating the external validity of quasi-experimental results, which are often used as a benchmark for evidence-based policy-making. I propose a novel scalar robustness metric. This metric measures the magnitude of the smallest covariate shift needed to invalidate a claim on the policy effect (for example, $ATE \\geq 0$) supported by the quasi-experimental evidence. My metric links the heterogeneity of policy effects and robustness in a flexible, nonparametric way and does not require functional form assumptions. I cast the estimation of the robustness metric as a de-biased GMM problem. This approach guarantees a parametric convergence rate for the robustness metric while allowing for machine learning-based estimators of policy effect heterogeneity (for example, lasso, random forest, boosting, neural nets). I apply my procedure to the Oregon Health Insurance experiment. I study the robustness of policy effects estimates of health-care utilization and financial strain outcomes, relative to a shift in the distribution of context-specific covariates. Such covariates are likely to differ across US states, making quantification of robustness an important exercise for adoption of the insurance policy in states other than Oregon. I find that the effect on outpatient visits is the most robust among the metrics of health-care utilization considered."}, "https://arxiv.org/abs/2209.07672": {"title": "Nonparametric Estimation via Partial Derivatives", "link": "https://arxiv.org/abs/2209.07672", "description": "arXiv:2209.07672v2 Announce Type: replace \nAbstract: Traditional nonparametric estimation methods often lead to a slow convergence rate in large dimensions and require unrealistically enormous sizes of datasets for reliable conclusions. We develop an approach based on partial derivatives, either observed or estimated, to effectively estimate the function at near-parametric convergence rates. The novel approach and computational algorithm could lead to methods useful to practitioners in many areas of science and engineering. Our theoretical results reveal a behavior universal to this class of nonparametric estimation problems. We explore a general setting involving tensor product spaces and build upon the smoothing spline analysis of variance (SS-ANOVA) framework. For $d$-dimensional models under full interaction, the optimal rates with gradient information on $p$ covariates are identical to those for the $(d-p)$-interaction models without gradients and, therefore, the models are immune to the \"curse of interaction.\" For additive models, the optimal rates using gradient information are root-$n$, thus achieving the \"parametric rate.\" We demonstrate aspects of the theoretical results through synthetic and real data applications."}, "https://arxiv.org/abs/2212.01943": {"title": "Unbiased Test Error Estimation in the Poisson Means Problem via Coupled Bootstrap Techniques", "link": "https://arxiv.org/abs/2212.01943", "description": "arXiv:2212.01943v3 Announce Type: replace \nAbstract: We propose a coupled bootstrap (CB) method for the test error of an arbitrary algorithm that estimates the mean in a Poisson sequence, often called the Poisson means problem. The idea behind our method is to generate two carefully-designed data vectors from the original data vector, by using synthetic binomial noise. One such vector acts as the training sample and the second acts as the test sample. To stabilize the test error estimate, we average this over multiple bootstrap B of the synthetic noise. A key property of the CB estimator is that it is unbiased for the test error in a Poisson problem where the original mean has been shrunken by a small factor, driven by the success probability $p$ in the binomial noise. Further, in the limit as $B \\to \\infty$ and $p \\to 0$, we show that the CB estimator recovers a known unbiased estimator for test error based on Hudson's lemma, under no assumptions on the given algorithm for estimating the mean (in particular, no smoothness assumptions). Our methodology applies to two central loss functions that can be used to define test error: Poisson deviance and squared loss. Via a bias-variance decomposition, for each loss function, we analyze the effects of the binomial success probability and the number of bootstrap samples and on the accuracy of the estimator. We also investigate our method empirically across a variety of settings, using simulated as well as real data."}, "https://arxiv.org/abs/2306.08719": {"title": "Off-policy Evaluation in Doubly Inhomogeneous Environments", "link": "https://arxiv.org/abs/2306.08719", "description": "arXiv:2306.08719v4 Announce Type: replace \nAbstract: This work aims to study off-policy evaluation (OPE) under scenarios where two key reinforcement learning (RL) assumptions -- temporal stationarity and individual homogeneity are both violated. To handle the ``double inhomogeneities\", we propose a class of latent factor models for the reward and observation transition functions, under which we develop a general OPE framework that consists of both model-based and model-free approaches. To our knowledge, this is the first paper that develops statistically sound OPE methods in offline RL with double inhomogeneities. It contributes to a deeper understanding of OPE in environments, where standard RL assumptions are not met, and provides several practical approaches in these settings. We establish the theoretical properties of the proposed value estimators and empirically show that our approach outperforms competing methods that ignore either temporal nonstationarity or individual heterogeneity. Finally, we illustrate our method on a data set from the Medical Information Mart for Intensive Care."}, "https://arxiv.org/abs/2309.04957": {"title": "Winner's Curse Free Robust Mendelian Randomization with Summary Data", "link": "https://arxiv.org/abs/2309.04957", "description": "arXiv:2309.04957v2 Announce Type: replace \nAbstract: In the past decade, the increased availability of genome-wide association studies summary data has popularized Mendelian Randomization (MR) for conducting causal inference. MR analyses, incorporating genetic variants as instrumental variables, are known for their robustness against reverse causation bias and unmeasured confounders. Nevertheless, classical MR analyses utilizing summary data may still produce biased causal effect estimates due to the winner's curse and pleiotropic issues. To address these two issues and establish valid causal conclusions, we propose a unified robust Mendelian Randomization framework with summary data, which systematically removes the winner's curse and screens out invalid genetic instruments with pleiotropic effects. Different from existing robust MR literature, our framework delivers valid statistical inference on the causal effect neither requiring the genetic pleiotropy effects to follow any parametric distribution nor relying on perfect instrument screening property. Under appropriate conditions, we show that our proposed estimator converges to a normal distribution and its variance can be well estimated. We demonstrate the performance of our proposed estimator through Monte Carlo simulations and two case studies. The codes implementing the procedures are available at https://github.com/ChongWuLab/CARE/."}, "https://arxiv.org/abs/2310.13826": {"title": "A p-value for Process Tracing and other N=1 Studies", "link": "https://arxiv.org/abs/2310.13826", "description": "arXiv:2310.13826v2 Announce Type: replace \nAbstract: We introduce a method for calculating \\(p\\)-values when testing causal theories about a single case, for instance when conducting process tracing. As in Fisher's (1935) original design, our \\(p\\)-value indicates how frequently one would find the same or more favorable evidence while entertaining a rival theory (the null) for the sake of argument. We use an urn model to represent the null distribution and calibrate it to privilege false negative errors and reduce false positive errors. We also present an approach to sensitivity analysis and for representing the evidentiary weight of different observations. Our test suits any type of evidence, such as data from interviews and archives, observed in any combination. We apply our hypothesis test in two studies: a process tracing classic about the cause of the cholera outburst in Soho (Snow 1855) and a recent process tracing based explanation of the cause of a welfare policy shift in Uruguay (Rossel, Antia, and Manzi 2023)."}, "https://arxiv.org/abs/2311.17575": {"title": "Identifying Causal Effects of Nonbinary, Ordered Treatments using Multiple Instrumental Variables", "link": "https://arxiv.org/abs/2311.17575", "description": "arXiv:2311.17575v2 Announce Type: replace \nAbstract: This paper introduces a novel method for identifying causal effects of ordered, nonbinary treatments using multiple binary instruments. Extending the two-stage least squares (TSLS) framework, the approach accommodates ordered treatments under any monotonicity assumption. The key contribution is the identification of a new causal parameter that simplifies the interpretation of causal effects and is broadly applicable due to a mild monotonicity assumption, offering a compelling alternative to TSLS. The paper builds upon recent causal machine learning methodology for estimation and demonstrates how causal forests can detect local violations of the underlying monotonicity assumption. The methodology is applied to estimate the returns to education using the seminal dataset of Card (1995) and to evaluate the impact of an additional child on female labor market outcomes using the data from Angrist and Evans (1998)."}, "https://arxiv.org/abs/2312.12786": {"title": "Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets", "link": "https://arxiv.org/abs/2312.12786", "description": "arXiv:2312.12786v2 Announce Type: replace \nAbstract: Development of comprehensive prediction models are often of great interest in many disciplines of science, but datasets with information on all desired features often have small sample sizes. We describe a transfer learning approach for building high-dimensional generalized linear models using data from a main study with detailed information on all predictors and an external, potentially much larger, study that has ascertained a more limited set of predictors. We propose using the external dataset to build a reduced model and then \"transfer\" the information on underlying parameters for the analysis of the main study through a set of calibration equations which can account for the study-specific effects of design variables. We then propose a penalized generalized method of moment framework for inference and a one-step estimation method that could be implemented using standard glmnet package. We develop asymptotic theory and conduct extensive simulation studies to investigate both predictive performance and post-selection inference properties of the proposed method. Finally, we illustrate an application of the proposed method for the development of risk models for five common diseases using the UK Biobank study, combining information on low-dimensional risk factors and high throughout proteomic biomarkers."}, "https://arxiv.org/abs/2204.00180": {"title": "Measuring Diagnostic Test Performance Using Imperfect Reference Tests: A Partial Identification Approach", "link": "https://arxiv.org/abs/2204.00180", "description": "arXiv:2204.00180v4 Announce Type: replace-cross \nAbstract: Diagnostic tests are almost never perfect. Studies quantifying their performance use knowledge of the true health status, measured with a reference diagnostic test. Researchers commonly assume that the reference test is perfect, which is often not the case in practice. When the assumption fails, conventional studies identify \"apparent\" performance or performance with respect to the reference, but not true performance. This paper provides the smallest possible bounds on the measures of true performance - sensitivity (true positive rate) and specificity (true negative rate), or equivalently false positive and negative rates, in standard settings. Implied bounds on policy-relevant parameters are derived: 1) Prevalence in screened populations; 2) Predictive values. Methods for inference based on moment inequalities are used to construct uniformly consistent confidence sets in level over a relevant family of data distributions. Emergency Use Authorization (EUA) and independent study data for the BinaxNOW COVID-19 antigen test demonstrate that the bounds can be very informative. Analysis reveals that the estimated false negative rates for symptomatic and asymptomatic patients are up to 3.17 and 4.59 times higher than the frequently cited \"apparent\" false negative rate. Further applicability of the results in the context of imperfect proxies such as survey responses and imputed protected classes is indicated."}, "https://arxiv.org/abs/2309.06673": {"title": "Ridge detection for nonstationary multicomponent signals with time-varying wave-shape functions and its applications", "link": "https://arxiv.org/abs/2309.06673", "description": "arXiv:2309.06673v2 Announce Type: replace-cross \nAbstract: We introduce a novel ridge detection algorithm for time-frequency (TF) analysis, particularly tailored for intricate nonstationary time series encompassing multiple non-sinusoidal oscillatory components. The algorithm is rooted in the distinctive geometric patterns that emerge in the TF domain due to such non-sinusoidal oscillations. We term this method \\textit{shape-adaptive mode decomposition-based multiple harmonic ridge detection} (\\textsf{SAMD-MHRD}). A swift implementation is available when supplementary information is at hand. We demonstrate the practical utility of \\textsf{SAMD-MHRD} through its application to a real-world challenge. We employ it to devise a cutting-edge walking activity detection algorithm, leveraging accelerometer signals from an inertial measurement unit across diverse body locations of a moving subject."}, "https://arxiv.org/abs/2408.10396": {"title": "Highly Multivariate High-dimensionality Spatial Stochastic Processes-A Mixed Conditional Approach", "link": "https://arxiv.org/abs/2408.10396", "description": "arXiv:2408.10396v1 Announce Type: new \nAbstract: We propose a hybrid mixed spatial graphical model framework and novel concepts, e.g., cross-Markov Random Field (cross-MRF), to comprehensively address all feature aspects of highly multivariate high-dimensionality (HMHD) spatial data class when constructing the desired joint variance and precision matrix (where both p and n are large). Specifically, the framework accommodates any customized conditional independence (CI) among any number of p variate fields at the first stage, alleviating dynamic memory burden. Meanwhile, it facilitates parallel generation of covariance and precision matrix, with the latter's generation order scaling only linearly in p. In the second stage, we demonstrate the multivariate Hammersley-Clifford theorem from a column-wise conditional perspective and unearth the existence of cross-MRF. The link of the mixed spatial graphical framework and the cross-MRF allows for a mixed conditional approach, resulting in the sparsest possible representation of the precision matrix via accommodating the doubly CI among both p and n, with the highest possible exact-zero-value percentage. We also explore the possibility of the co-existence of geostatistical and MRF modelling approaches in one unified framework, imparting a potential solution to an open problem. The derived theories are illustrated with 1D simulation and 2D real-world spatial data."}, "https://arxiv.org/abs/2408.10401": {"title": "Spatial Knockoff Bayesian Variable Selection in Genome-Wide Association Studies", "link": "https://arxiv.org/abs/2408.10401", "description": "arXiv:2408.10401v1 Announce Type: new \nAbstract: High-dimensional variable selection has emerged as one of the prevailing statistical challenges in the big data revolution. Many variable selection methods have been adapted for identifying single nucleotide polymorphisms (SNPs) linked to phenotypic variation in genome-wide association studies. We develop a Bayesian variable selection regression model for identifying SNPs linked to phenotypic variation. We modify our Bayesian variable selection regression models to control the false discovery rate of SNPs using a knockoff variable approach. We reduce spurious associations by regressing the phenotype of interest against a set of basis functions that account for the relatedness of individuals. Using a restricted regression approach, we simultaneously estimate the SNP-level effects while removing variation in the phenotype that can be explained by population structure. We also accommodate the spatial structure among causal SNPs by modeling their inclusion probabilities jointly with a reduced rank Gaussian process. In a simulation study, we demonstrate that our spatial Bayesian variable selection regression model controls the false discovery rate and increases power when the relevant SNPs are clustered. We conclude with an analysis of Arabidopsis thaliana flowering time, a polygenic trait that is confounded with population structure, and find the discoveries of our method cluster near described flowering time genes."}, "https://arxiv.org/abs/2408.10478": {"title": "On a fundamental difference between Bayesian and frequentist approaches to robustness", "link": "https://arxiv.org/abs/2408.10478", "description": "arXiv:2408.10478v1 Announce Type: new \nAbstract: Heavy-tailed models are often used as a way to gain robustness against outliers in Bayesian analyses. On the other side, in frequentist analyses, M-estimators are often employed. In this paper, the two approaches are reconciled by considering M-estimators as maximum likelihood estimators of heavy-tailed models. We realize that, even from this perspective, there is a fundamental difference in that frequentists do not require these heavy-tailed models to be proper. It is shown what the difference between improper and proper heavy-tailed models can be in terms of estimation results through two real-data analyses based on linear regression. The findings of this paper make us ponder on the use of improper heavy-tailed data models in Bayesian analyses, an approach which is seen to fit within the generalized Bayesian framework of Bissiri et al. (2016) when combined with proper prior distributions yielding proper (generalized) posterior distributions."}, "https://arxiv.org/abs/2408.10509": {"title": "Continuous difference-in-differences with double/debiased machine learning", "link": "https://arxiv.org/abs/2408.10509", "description": "arXiv:2408.10509v1 Announce Type: new \nAbstract: This paper extends difference-in-differences to settings involving continuous treatments. Specifically, the average treatment effect on the treated (ATT) at any level of continuous treatment intensity is identified using a conditional parallel trends assumption. In this framework, estimating the ATTs requires first estimating infinite-dimensional nuisance parameters, especially the conditional density of the continuous treatment, which can introduce significant biases. To address this challenge, estimators for the causal parameters are proposed under the double/debiased machine learning framework. We show that these estimators are asymptotically normal and provide consistent variance estimators. To illustrate the effectiveness of our methods, we re-examine the study by Acemoglu and Finkelstein (2008), which assessed the effects of the 1983 Medicare Prospective Payment System (PPS) reform. By reinterpreting their research design using a difference-in-differences approach with continuous treatment, we nonparametrically estimate the treatment effects of the 1983 PPS reform, thereby providing a more detailed understanding of its impact."}, "https://arxiv.org/abs/2408.10542": {"title": "High-Dimensional Covariate-Augmented Overdispersed Multi-Study Poisson Factor Model", "link": "https://arxiv.org/abs/2408.10542", "description": "arXiv:2408.10542v1 Announce Type: new \nAbstract: Factor analysis for high-dimensional data is a canonical problem in statistics and has a wide range of applications. However, there is currently no factor model tailored to effectively analyze high-dimensional count responses with corresponding covariates across multiple studies, such as the single-cell sequencing dataset from a case-control study. In this paper, we introduce factor models designed to jointly analyze multiple studies by extracting study-shared and specified factors. Our factor models account for heterogeneous noises and overdispersion among counts with augmented covariates. We propose an efficient and speedy variational estimation procedure for estimating model parameters, along with a novel criterion for selecting the optimal number of factors and the rank of regression coefficient matrix. The consistency and asymptotic normality of estimators are systematically investigated by connecting variational likelihood and profile M-estimation. Extensive simulations and an analysis of a single-cell sequencing dataset are conducted to demonstrate the effectiveness of the proposed multi-study Poisson factor model."}, "https://arxiv.org/abs/2408.10558": {"title": "Multi-Attribute Preferences: A Transfer Learning Approach", "link": "https://arxiv.org/abs/2408.10558", "description": "arXiv:2408.10558v1 Announce Type: new \nAbstract: This contribution introduces a novel statistical learning methodology based on the Bradley-Terry method for pairwise comparisons, where the novelty arises from the method's capacity to estimate the worth of objects for a primary attribute by incorporating data of secondary attributes. These attributes are properties on which objects are evaluated in a pairwise fashion by individuals. By assuming that the main interest of practitioners lies in the primary attribute, and the secondary attributes only serve to improve estimation of the parameters underlying the primary attribute, this paper utilises the well-known transfer learning framework. To wit, the proposed method first estimates a biased worth vector using data pertaining to both the primary attribute and the set of informative secondary attributes, which is followed by a debiasing step based on a penalised likelihood of the primary attribute. When the set of informative secondary attributes is unknown, we allow for their estimation by a data-driven algorithm. Theoretically, we show that, under mild conditions, the $\\ell_\\infty$ and $\\ell_2$ rates are improved compared to fitting a Bradley-Terry model on just the data pertaining to the primary attribute. The favourable (comparative) performance under more general settings is shown by means of a simulation study. To illustrate the usage and interpretation of the method, an application of the proposed method is provided on consumer preference data pertaining to a cassava derived food product: eba. An R package containing the proposed methodology can be found on xhttps://CRAN.R-project.org/package=BTTL."}, "https://arxiv.org/abs/2408.10570": {"title": "A two-sample test based on averaged Wilcoxon rank sums over interpoint distances", "link": "https://arxiv.org/abs/2408.10570", "description": "arXiv:2408.10570v1 Announce Type: new \nAbstract: An important class of two-sample multivariate homogeneity tests is based on identifying differences between the distributions of interpoint distances. While generating distances from point clouds offers a straightforward and intuitive way for dimensionality reduction, it also introduces dependencies to the resulting distance samples. We propose a simple test based on Wilcoxon's rank sum statistic for which we prove asymptotic normality under the null hypothesis and fixed alternatives under mild conditions on the underlying distributions of the point clouds. Furthermore, we show consistency of the test and derive a variance approximation that allows to construct a computationally feasible, distribution-free test with good finite sample performance. The power and robustness of the test for high-dimensional data and low sample sizes is demonstrated by numerical simulations. Finally, we apply the proposed test to case-control testing on microarray data in genetic studies, which is considered a notorious case for a high number of variables and low sample sizes."}, "https://arxiv.org/abs/2408.10650": {"title": "Principal component analysis for max-stable distributions", "link": "https://arxiv.org/abs/2408.10650", "description": "arXiv:2408.10650v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is one of the most popular dimension reduction techniques in statistics and is especially powerful when a multivariate distribution is concentrated near a lower-dimensional subspace. Multivariate extreme value distributions have turned out to provide challenges for the application of PCA since their constraint support impedes the detection of lower-dimensional structures and heavy-tails can imply that second moments do not exist, thereby preventing the application of classical variance-based techniques for PCA. We adapt PCA to max-stable distributions using a regression setting and employ max-linear maps to project the random vector to a lower-dimensional space while preserving max-stability. We also provide a characterization of those distributions which allow for a perfect reconstruction from the lower-dimensional representation. Finally, we demonstrate how an optimal projection matrix can be consistently estimated and show viability in practice with a simulation study and application to a benchmark dataset."}, "https://arxiv.org/abs/2408.10686": {"title": "Gradient Wild Bootstrap for Instrumental Variable Quantile Regressions with Weak and Few Clusters", "link": "https://arxiv.org/abs/2408.10686", "description": "arXiv:2408.10686v1 Announce Type: new \nAbstract: We study the gradient wild bootstrap-based inference for instrumental variable quantile regressions in the framework of a small number of large clusters in which the number of clusters is viewed as fixed, and the number of observations for each cluster diverges to infinity. For the Wald inference, we show that our wild bootstrap Wald test, with or without studentization using the cluster-robust covariance estimator (CRVE), controls size asymptotically up to a small error as long as the parameter of endogenous variable is strongly identified in at least one of the clusters. We further show that the wild bootstrap Wald test with CRVE studentization is more powerful for distant local alternatives than that without. Last, we develop a wild bootstrap Anderson-Rubin (AR) test for the weak-identification-robust inference. We show it controls size asymptotically up to a small error, even under weak or partial identification for all clusters. We illustrate the good finite-sample performance of the new inference methods using simulations and provide an empirical application to a well-known dataset about US local labor markets."}, "https://arxiv.org/abs/2408.10825": {"title": "Conditional nonparametric variable screening by neural factor regression", "link": "https://arxiv.org/abs/2408.10825", "description": "arXiv:2408.10825v1 Announce Type: new \nAbstract: High-dimensional covariates often admit linear factor structure. To effectively screen correlated covariates in high-dimension, we propose a conditional variable screening test based on non-parametric regression using neural networks due to their representation power. We ask the question whether individual covariates have additional contributions given the latent factors or more generally a set of variables. Our test statistics are based on the estimated partial derivative of the regression function of the candidate variable for screening and a observable proxy for the latent factors. Hence, our test reveals how much predictors contribute additionally to the non-parametric regression after accounting for the latent factors. Our derivative estimator is the convolution of a deep neural network regression estimator and a smoothing kernel. We demonstrate that when the neural network size diverges with the sample size, unlike estimating the regression function itself, it is necessary to smooth the partial derivative of the neural network estimator to recover the desired convergence rate for the derivative. Moreover, our screening test achieves asymptotic normality under the null after finely centering our test statistics that makes the biases negligible, as well as consistency for local alternatives under mild conditions. We demonstrate the performance of our test in a simulation study and two real world applications."}, "https://arxiv.org/abs/2408.10915": {"title": "Neural Networks for Parameter Estimation in Geometrically Anisotropic Geostatistical Models", "link": "https://arxiv.org/abs/2408.10915", "description": "arXiv:2408.10915v1 Announce Type: new \nAbstract: This article presents a neural network approach for estimating the covariance function of spatial Gaussian random fields defined in a portion of the Euclidean plane. Our proposal builds upon recent contributions, expanding from the purely isotropic setting to encompass geometrically anisotropic correlation structures, i.e., random fields with correlation ranges that vary across different directions. We conduct experiments with both simulated and real data to assess the performance of the methodology and to provide guidelines to practitioners."}, "https://arxiv.org/abs/2408.11003": {"title": "DEEPEAST technique to enhance power in two-sample tests via the same-attraction function", "link": "https://arxiv.org/abs/2408.11003", "description": "arXiv:2408.11003v1 Announce Type: new \nAbstract: Data depth has emerged as an invaluable nonparametric measure for the ranking of multivariate samples. The main contribution of depth-based two-sample comparisons is the introduction of the Q statistic (Liu and Singh, 1993), a quality index. Unlike traditional methods, data depth does not require the assumption of normal distributions and adheres to four fundamental properties. Many existing two-sample homogeneity tests, which assess mean and/or scale changes in distributions often suffer from low statistical power or indeterminate asymptotic distributions. To overcome these challenges, we introduced a DEEPEAST (depth-explored same-attraction sample-to-sample central-outward ranking) technique for improving statistical power in two-sample tests via the same-attraction function. We proposed two novel and powerful depth-based test statistics: the sum test statistic and the product test statistic, which are rooted in Q statistics, share a \"common attractor\" and are applicable across all depth functions. We further proved the asymptotic distribution of these statistics for various depth functions. To assess the performance of power gain, we apply three depth functions: Mahalanobis depth (Liu and Singh, 1993), Spatial depth (Brown, 1958; Gower, 1974), and Projection depth (Liu, 1992). Through two-sample simulations, we have demonstrated that our sum and product statistics exhibit superior power performance, utilizing a strategic block permutation algorithm and compare favourably with popular methods in literature. Our tests are further validated through analysis on Raman spectral data, acquired from cellular and tissue samples, highlighting the effectiveness of the proposed tests highlighting the effective discrimination between health and cancerous samples."}, "https://arxiv.org/abs/2408.11012": {"title": "Discriminant Analysis in stationary time series based on robust cepstral coefficients", "link": "https://arxiv.org/abs/2408.11012", "description": "arXiv:2408.11012v1 Announce Type: new \nAbstract: Time series analysis is crucial in fields like finance, economics, environmental science, and biomedical engineering, aiding in forecasting, pattern identification, and understanding underlying mechanisms. While traditional time-domain methods focus on trends and seasonality, they often miss periodicities better captured in the frequency domain. Analyzing time series in the frequency domain uncovers spectral properties, offering deeper insights into underlying processes, aiding in differentiating data-generating processes of various populations, and assisting in classification. Common approaches use smoothed estimators, such as the smoothed periodogram, to minimize bias by averaging spectra from individual replicates within a population. However, these methods struggle with spectral variability among replicates, and abrupt values can skew estimators, complicating discrimination and classification. There's a gap in the literature for methods that account for within-population spectral variability, separate white noise effects from autocorrelations, and employ robust estimators in the presence of outliers. This paper fills that gap by introducing a robust framework for classifying time series groups. The process involves transforming time series into the frequency domain using the Fourier Transform, computing the power spectrum, and using the inverse Fourier Transform to obtain the cepstrum. To enhance spectral estimates' robustness and consistency, we apply the multitaper periodogram and the M-periodogram. These features are then used in Linear Discriminant Analysis (LDA) to improve classification accuracy and interpretability, offering a powerful tool for precise temporal pattern distinction and resilience to data anomalies."}, "https://arxiv.org/abs/2408.10251": {"title": "Impossible temperatures are not as rare as you think", "link": "https://arxiv.org/abs/2408.10251", "description": "arXiv:2408.10251v1 Announce Type: cross \nAbstract: The last decade has seen numerous record-shattering heatwaves in all corners of the globe. In the aftermath of these devastating events, there is interest in identifying worst-case thresholds or upper bounds that quantify just how hot temperatures can become. Generalized Extreme Value theory provides a data-driven estimate of extreme thresholds; however, upper bounds may be exceeded by future events, which undermines attribution and planning for heatwave impacts. Here, we show how the occurrence and relative probability of observed events that exceed a priori upper bound estimates, so-called \"impossible\" temperatures, has changed over time. We find that many unprecedented events are actually within data-driven upper bounds, but only when using modern spatial statistical methods. Furthermore, there are clear connections between anthropogenic forcing and the \"impossibility\" of the most extreme temperatures. Robust understanding of heatwave thresholds provides critical information about future record-breaking events and how their extremity relates to historical measurements."}, "https://arxiv.org/abs/2408.10610": {"title": "On the Approximability of Stationary Processes using the ARMA Model", "link": "https://arxiv.org/abs/2408.10610", "description": "arXiv:2408.10610v1 Announce Type: cross \nAbstract: We identify certain gaps in the literature on the approximability of stationary random variables using the Autoregressive Moving Average (ARMA) model. To quantify approximability, we propose that an ARMA model be viewed as an approximation of a stationary random variable. We map these stationary random variables to Hardy space functions, and formulate a new function approximation problem that corresponds to random variable approximation, and thus to ARMA. Based on this Hardy space formulation we identify a class of stationary processes where approximation guarantees are feasible. We also identify an idealized stationary random process for which we conjecture that a good ARMA approximation is not possible. Next, we provide a constructive proof that Pad\\'e approximations do not always correspond to the best ARMA approximation. Finally, we note that the spectral methods adopted in this paper can be seen as a generalization of unit root methods for stationary processes even when an ARMA model is not defined."}, "https://arxiv.org/abs/2208.05949": {"title": "Valid Inference After Causal Discovery", "link": "https://arxiv.org/abs/2208.05949", "description": "arXiv:2208.05949v3 Announce Type: replace \nAbstract: Causal discovery and causal effect estimation are two fundamental tasks in causal inference. While many methods have been developed for each task individually, statistical challenges arise when applying these methods jointly: estimating causal effects after running causal discovery algorithms on the same data leads to \"double dipping,\" invalidating the coverage guarantees of classical confidence intervals. To this end, we develop tools for valid post-causal-discovery inference. Across empirical studies, we show that a naive combination of causal discovery and subsequent inference algorithms leads to highly inflated miscoverage rates; on the other hand, applying our method provides reliable coverage while achieving more accurate causal discovery than data splitting."}, "https://arxiv.org/abs/2208.08693": {"title": "Matrix Quantile Factor Model", "link": "https://arxiv.org/abs/2208.08693", "description": "arXiv:2208.08693v3 Announce Type: replace \nAbstract: This paper introduces a matrix quantile factor model for matrix-valued data with low-rank structure. We estimate the row and column factor spaces via minimizing the empirical check loss function with orthogonal rotation constraints. We show that the estimates converge at rate $(\\min\\{p_1p_2,p_2T,p_1T\\})^{-1/2}$ in the average Frobenius norm, where $p_1$, $p_2$ and $T$ are the row dimensionality, column dimensionality and length of the matrix sequence, respectively. This rate is faster than that of the quantile estimates via ``flattening\" the matrix model into a large vector model. To derive the central limit theorem, we introduce a novel augmented Lagrangian function, which is equivalent to the original constrained empirical check loss minimization problem. Via the equivalence, we prove that the Hessian matrix of the augmented Lagrangian function is locally positive definite, resulting in a locally convex penalized loss function around the true factors and their loadings. This easily leads to a feasible second-order expansion of the score function and readily established central limit theorems of the smoothed estimates of the loadings. We provide three consistent criteria to determine the pair of row and column factor numbers. Extensive simulation studies and an empirical study justify our theory."}, "https://arxiv.org/abs/2307.02188": {"title": "Improving Algorithms for Fantasy Basketball", "link": "https://arxiv.org/abs/2307.02188", "description": "arXiv:2307.02188v4 Announce Type: replace \nAbstract: Fantasy basketball has a rich underlying mathematical structure which makes optimal drafting strategy unclear. A central issue for category leagues is how to aggregate a player's statistics from all categories into a single number representing general value. It is shown that under a simplified model of fantasy basketball, a novel metric dubbed the \"G-score\" is appropriate for this purpose. The traditional metric used by analysts, \"Z-score\", is a special case of the G-score under the condition that future player performances are known exactly. The distinction between Z-score and G-score is particularly meaningful for head-to-head formats, because there is a large degree of uncertainty in player performance from one week to another. Simulated fantasy basketball seasons with head-to-head scoring provide evidence that G-scores do in fact outperform Z-scores in that context."}, "https://arxiv.org/abs/2310.02273": {"title": "A New measure of income inequality", "link": "https://arxiv.org/abs/2310.02273", "description": "arXiv:2310.02273v2 Announce Type: replace \nAbstract: A new measure of income inequality that captures the heavy tail behavior of the income distribution is proposed. We discuss two different approaches to find the estimators of the proposed measure. We show that these estimators are consistent and have an asymptotically normal distribution. We also obtain a jackknife empirical likelihood (JEL) confidence interval of the income inequality measure. A Monte Carlo simulation study is conducted to evaluate the finite sample properties of the estimators and JEL-based confidence inerval. Finally, we use our measure to study the income inequality of three states in India."}, "https://arxiv.org/abs/2305.00050": {"title": "Causal Reasoning and Large Language Models: Opening a New Frontier for Causality", "link": "https://arxiv.org/abs/2305.00050", "description": "arXiv:2305.00050v3 Announce Type: replace-cross \nAbstract: The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a \"behavorial\" study of LLMs to benchmark their capability in generating causal arguments. Across a wide range of tasks, we find that LLMs can generate text corresponding to correct causal arguments with high probability, surpassing the best-performing existing methods. Algorithms based on GPT-3.5 and 4 outperform existing algorithms on a pairwise causal discovery task (97%, 13 points gain), counterfactual reasoning task (92%, 20 points gain) and event causality (86% accuracy in determining necessary and sufficient causes in vignettes). We perform robustness checks across tasks and show that the capabilities cannot be explained by dataset memorization alone, especially since LLMs generalize to novel datasets that were created after the training cutoff date.\n That said, LLMs exhibit unpredictable failure modes, and we discuss the kinds of errors that may be improved and what are the fundamental limits of LLM-based answers. Overall, by operating on the text metadata, LLMs bring capabilities so far understood to be restricted to humans, such as using collected knowledge to generate causal graphs or identifying background causal context from natural language. As a result, LLMs may be used by human domain experts to save effort in setting up a causal analysis, one of the biggest impediments to the widespread adoption of causal methods. Given that LLMs ignore the actual data, our results also point to a fruitful research direction of developing algorithms that combine LLMs with existing causal techniques. Code and datasets are available at https://github.com/py-why/pywhy-llm."}, "https://arxiv.org/abs/2305.03205": {"title": "Risk management in the use of published statistical results for policy decisions", "link": "https://arxiv.org/abs/2305.03205", "description": "arXiv:2305.03205v2 Announce Type: replace-cross \nAbstract: Statistical inferential results generally come with a measure of reliability for decision-making purposes. For a policy implementer, the value of implementing published policy research depends critically upon this reliability. For a policy researcher, the value of policy implementation may depend weakly or not at all upon the policy's outcome. Some researchers might benefit from overstating the reliability of statistical results. Implementers may find it difficult or impossible to determine whether researchers are overstating reliability. This information asymmetry between researchers and implementers can lead to an adverse selection problem where, at best, the full benefits of a policy are not realized or, at worst, a policy is deemed too risky to implement at any scale. Researchers can remedy this by guaranteeing the policy outcome. Researchers can overcome their own risk aversion and wealth constraints by exchanging risks with other researchers or offering only partial insurance. The problem and remedy are illustrated using a confidence interval for the success probability of a binomial policy outcome."}, "https://arxiv.org/abs/2408.11193": {"title": "Inference with Many Weak Instruments and Heterogeneity", "link": "https://arxiv.org/abs/2408.11193", "description": "arXiv:2408.11193v1 Announce Type: new \nAbstract: This paper considers inference in a linear instrumental variable regression model with many potentially weak instruments and treatment effect heterogeneity. I show that existing tests can be arbitrarily oversized in this setup. Then, I develop a valid procedure that is robust to weak instrument asymptotics and heterogeneous treatment effects. The procedure targets a JIVE estimand, calculates an LM statistic, and compares it with critical values from a normal distribution. To establish this procedure's validity, this paper shows that the LM statistic is asymptotically normal and a leave-three-out variance estimator is unbiased and consistent. The power of the LM test is also close to a power envelope in an empirical application."}, "https://arxiv.org/abs/2408.11272": {"title": "High-Dimensional Overdispersed Generalized Factor Model with Application to Single-Cell Sequencing Data Analysis", "link": "https://arxiv.org/abs/2408.11272", "description": "arXiv:2408.11272v1 Announce Type: new \nAbstract: The current high-dimensional linear factor models fail to account for the different types of variables, while high-dimensional nonlinear factor models often overlook the overdispersion present in mixed-type data. However, overdispersion is prevalent in practical applications, particularly in fields like biomedical and genomics studies. To address this practical demand, we propose an overdispersed generalized factor model (OverGFM) for performing high-dimensional nonlinear factor analysis on overdispersed mixed-type data. Our approach incorporates an additional error term to capture the overdispersion that cannot be accounted for by factors alone. However, this introduces significant computational challenges due to the involvement of two high-dimensional latent random matrices in the nonlinear model. To overcome these challenges, we propose a novel variational EM algorithm that integrates Laplace and Taylor approximations. This algorithm provides iterative explicit solutions for the complex variational parameters and is proven to possess excellent convergence properties. We also develop a criterion based on the singular value ratio to determine the optimal number of factors. Numerical results demonstrate the effectiveness of this criterion. Through comprehensive simulation studies, we show that OverGFM outperforms state-of-the-art methods in terms of estimation accuracy and computational efficiency. Furthermore, we demonstrate the practical merit of our method through its application to two datasets from genomics. To facilitate its usage, we have integrated the implementation of OverGFM into the R package GFM."}, "https://arxiv.org/abs/2408.11315": {"title": "Locally Adaptive Random Walk Stochastic Volatility", "link": "https://arxiv.org/abs/2408.11315", "description": "arXiv:2408.11315v1 Announce Type: new \nAbstract: We introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with a new Dynamic Shrinkage Process (DSP) in (log) variances. Unlike classical Stochastic Volatility or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty (vol of vol). We derive the theoretical properties of the proposed global-local shrinkage prior. Through simulation studies, we demonstrate that ASV exhibits remarkable misspecification resilience with low prediction error across various data generating scenarios in simulation. Furthermore, ASV's capacity to yield locally smooth and interpretable estimates facilitates a clearer understanding of underlying patterns and trends in volatility. Additionally, we propose and illustrate an extension for Bayesian Trend Filtering simultaneously in both mean and variance. Finally, we show that this attribute makes ASV a robust tool applicable across a wide range of disciplines, including in finance, environmental science, epidemiology, and medicine, among others."}, "https://arxiv.org/abs/2408.11497": {"title": "Climate Change in Austria: Precipitation and Dry Spells over 50 years", "link": "https://arxiv.org/abs/2408.11497", "description": "arXiv:2408.11497v1 Announce Type: new \nAbstract: We propose a spatio-temporal generalised additive model (GAM) to study if precipitation patterns have changed between two 10-year time periods in the last 50 years in Austria. In particular, we model three scenarios: monthly mean and monthly maximum precipitation as well as the maximum length of a dry spell per month with a gamma, blended generalised extreme value and negative binomial distribution, respectively, over the periods 1973-1982 and 2013-2022. In order to model the spatial dependencies in the data more realistically, we intend to take the mountainous landscape of Austria into account. Therefore, we have chosen a non-stationary version of the Mat\\'ern covariance function, which accounts for elevation differences, as a spatial argument of the latent field in the GAM. The temporal part of the latent field is captured by an AR(1) process. We use the stochastic partial differential equation approach in combination with integrated nested Laplace approximation to perform inference computationally efficient. The model outputs are visualised and support existing climate change studies in the Alpine region obtained with, for example, projections from regional climate models."}, "https://arxiv.org/abs/2408.11519": {"title": "Towards an Inclusive Approach to Corporate Social Responsibility (CSR) in Morocco: CGEM's Commitment", "link": "https://arxiv.org/abs/2408.11519", "description": "arXiv:2408.11519v1 Announce Type: new \nAbstract: Corporate social responsibility encourages companies to integrate social and environmental concerns into their activities and their relations with stakeholders. It encompasses all actions aimed at the social good, above and beyond corporate interests and legal requirements. Various international organizations, authors and researchers have explored the notion of CSR and proposed a range of definitions reflecting their perspectives on the concept. In Morocco, although Moroccan companies are not overwhelmingly embracing CSR, several factors are encouraging them to integrate the CSR approach not only into their discourse, but also into their strategies. The CGEM is actively involved in promoting CSR within Moroccan companies, awarding the \"CGEM Label for CSR\" to companies that meet the criteria set out in the CSR Charter. The process of labeling Moroccan companies is in full expansion. The graphs presented in this article are broken down according to several criteria, such as company size, sector of activity and listing on the Casablanca Stock Exchange, in order to provide an overview of CSR-labeled companies in Morocco. The approach adopted for this article is a qualitative one aimed at presenting, firstly, the different definitions of the CSR concept and its evolution over time. In this way, the study focuses on the Moroccan context to dissect and analyze the state of progress of CSR integration in Morocco and the various efforts made by the CGEM to implement it. According to the data, 124 Moroccan companies have been awarded the CSR label. For a label in existence since 2006, this figure reflects a certain reluctance on the part of Moroccan companies to fully implement the CSR approach in their strategies. Nevertheless, Morocco is in a transitional phase, marked by the gradual adoption of various socially responsible practices."}, "https://arxiv.org/abs/2408.11594": {"title": "On the handling of method failure in comparison studies", "link": "https://arxiv.org/abs/2408.11594", "description": "arXiv:2408.11594v1 Announce Type: new \nAbstract: Comparison studies in methodological research are intended to compare methods in an evidence-based manner, offering guidance to data analysts to select a suitable method for their application. To provide trustworthy evidence, they must be carefully designed, implemented, and reported, especially given the many decisions made in planning and running. A common challenge in comparison studies is to handle the ``failure'' of one or more methods to produce a result for some (real or simulated) data sets, such that their performances cannot be measured in those instances. Despite an increasing emphasis on this topic in recent literature (focusing on non-convergence as a common manifestation), there is little guidance on proper handling and interpretation, and reporting of the chosen approach is often neglected. This paper aims to fill this gap and provides practical guidance for handling method failure in comparison studies. In particular, we show that the popular approaches of discarding data sets yielding failure (either for all or the failing methods only) and imputing are inappropriate in most cases. We also discuss how method failure in published comparison studies -- in various contexts from classical statistics and predictive modeling -- may manifest differently, but is often caused by a complex interplay of several aspects. Building on this, we provide recommendations derived from realistic considerations on suitable fallbacks when encountering method failure, hence avoiding the need for discarding data sets or imputation. Finally, we illustrate our recommendations and the dangers of inadequate handling of method failure through two illustrative comparison studies."}, "https://arxiv.org/abs/2408.11621": {"title": "Robust Bayes Treatment Choice with Partial Identification", "link": "https://arxiv.org/abs/2408.11621", "description": "arXiv:2408.11621v1 Announce Type: new \nAbstract: We study a class of binary treatment choice problems with partial identification, through the lens of robust (multiple prior) Bayesian analysis. We use a convenient set of prior distributions to derive ex-ante and ex-post robust Bayes decision rules, both for decision makers who can randomize and for decision makers who cannot.\n Our main messages are as follows: First, ex-ante and ex-post robust Bayes decision rules do not tend to agree in general, whether or not randomized rules are allowed. Second, randomized treatment assignment for some data realizations can be optimal in both ex-ante and, perhaps more surprisingly, ex-post problems. Therefore, it is usually with loss of generality to exclude randomized rules from consideration, even when regret is evaluated ex-post.\n We apply our results to a stylized problem where a policy maker uses experimental data to choose whether to implement a new policy in a population of interest, but is concerned about the external validity of the experiment at hand (Stoye, 2012); and to the aggregation of data generated by multiple randomized control trials in different sites to make a policy choice in a population for which no experimental data are available (Manski, 2020; Ishihara and Kitagawa, 2021)."}, "https://arxiv.org/abs/2408.11672": {"title": "Evidential Analysis: An Alternative to Hypothesis Testing in Normal Linear Models", "link": "https://arxiv.org/abs/2408.11672", "description": "arXiv:2408.11672v1 Announce Type: new \nAbstract: Statistical hypothesis testing, as formalized by 20th Century statisticians and taught in college statistics courses, has been a cornerstone of 100 years of scientific progress. Nevertheless, the methodology is increasingly questioned in many scientific disciplines. We demonstrate in this paper how many of the worrisome aspects of statistical hypothesis testing can be ameliorated with concepts and methods from evidential analysis. The model family we treat is the familiar normal linear model with fixed effects, embracing multiple regression and analysis of variance, a warhorse of everyday science in labs and field stations. Questions about study design, the applicability of the null hypothesis, the effect size, error probabilities, evidence strength, and model misspecification become more naturally housed in an evidential setting. We provide a completely worked example featuring a 2-way analysis of variance."}, "https://arxiv.org/abs/2408.11676": {"title": "L2-Convergence of the Population Principal Components in the Approximate Factor Model", "link": "https://arxiv.org/abs/2408.11676", "description": "arXiv:2408.11676v1 Announce Type: new \nAbstract: We prove that under the condition that the eigenvalues are asymptotically well separated and stable, the normalised principal components of a r-static factor sequence converge in mean square. Consequently, we have a generic interpretation of the principal components estimator as the normalised principal components of the statically common space. We illustrate why this can be useful for the interpretation of the PC-estimated factors, performing an asymptotic theory without rotation matrices and avoiding singularity issues in factor augmented regressions."}, "https://arxiv.org/abs/2408.11718": {"title": "Scalable and non-iterative graphical model estimation", "link": "https://arxiv.org/abs/2408.11718", "description": "arXiv:2408.11718v1 Announce Type: new \nAbstract: Graphical models have found widespread applications in many areas of modern statistics and machine learning. Iterative Proportional Fitting (IPF) and its variants have become the default method for undirected graphical model estimation, and are thus ubiquitous in the field. As the IPF is an iterative approach, it is not always readily scalable to modern high-dimensional data regimes. In this paper we propose a novel and fast non-iterative method for positive definite graphical model estimation in high dimensions, one that directly addresses the shortcomings of IPF and its variants. In addition, the proposed method has a number of other attractive properties. First, we show formally that as the dimension p grows, the proportion of graphs for which the proposed method will outperform the state-of-the-art in terms of computational complexity and performance tends to 1, affirming its efficacy in modern settings. Second, the proposed approach can be readily combined with scalable non-iterative thresholding-based methods for high-dimensional sparsity selection. Third, the proposed method has high-dimensional statistical guarantees. Moreover, our numerical experiments also show that the proposed method achieves scalability without compromising on statistical precision. Fourth, unlike the IPF, which depends on the Gaussian likelihood, the proposed method is much more robust."}, "https://arxiv.org/abs/2408.11803": {"title": "Bayesian Nonparametric Risk Assessment in Developmental Toxicity Studies with Ordinal Responses", "link": "https://arxiv.org/abs/2408.11803", "description": "arXiv:2408.11803v1 Announce Type: new \nAbstract: We develop a nonparametric Bayesian modeling framework for clustered ordinal responses in developmental toxicity studies, which typically exhibit extensive heterogeneity. The primary focus of these studies is to examine the dose-response relationship, which is depicted by the (conditional) probability of an endpoint across the dose (toxin) levels. Standard parametric approaches, limited in terms of the response distribution and/or the dose-response relationship, hinder reliable uncertainty quantification in this context. We propose nonparametric mixture models that are built from dose-dependent stick-breaking process priors, leveraging the continuation-ratio logits representation of the multinomial distribution to formulate the mixture kernel. We further elaborate the modeling approach, amplifying the mixture models with an overdispersed kernel which offers enhanced control of variability. We conduct a simulation study to demonstrate the benefits of both the discrete nonparametric mixing structure and the overdispersed kernel in delivering coherent uncertainty quantification. Further illustration is provided with different forms of risk assessment, using data from a toxicity experiment on the effects of ethylene glycol."}, "https://arxiv.org/abs/2408.11808": {"title": "Distance Correlation in Multiple Biased Sampling Models", "link": "https://arxiv.org/abs/2408.11808", "description": "arXiv:2408.11808v1 Announce Type: new \nAbstract: Testing the independence between random vectors is a fundamental problem in statistics. Distance correlation, a recently popular dependence measure, is universally consistent for testing independence against all distributions with finite moments. However, when data are subject to selection bias or collected from multiple sources or schemes, spurious dependence may arise. This creates a need for methods that can effectively utilize data from different sources and correct these biases. In this paper, we study the estimation of distance covariance and distance correlation under multiple biased sampling models, which provide a natural framework for addressing these issues. Theoretical properties, including the strong consistency and asymptotic null distributions of the distance covariance and correlation estimators, and the rate at which the test statistic diverges under sequences of alternatives approaching the null, are established. A weighted permutation procedure is proposed to determine the critical value of the independence test. Simulation studies demonstrate that our approach improves both the estimation of distance correlation and the power of the test."}, "https://arxiv.org/abs/2408.11164": {"title": "The Ensemble Epanechnikov Mixture Filter", "link": "https://arxiv.org/abs/2408.11164", "description": "arXiv:2408.11164v1 Announce Type: cross \nAbstract: In the high-dimensional setting, Gaussian mixture kernel density estimates become increasingly suboptimal. In this work we aim to show that it is practical to instead use the optimal multivariate Epanechnikov kernel. We make use of this optimal Epanechnikov mixture kernel density estimate for the sequential filtering scenario through what we term the ensemble Epanechnikov mixture filter (EnEMF). We provide a practical implementation of the EnEMF that is as cost efficient as the comparable ensemble Gaussian mixture filter. We show on a static example that the EnEMF is robust to growth in dimension, and also that the EnEMF has a significant reduction in error per particle on the 40-variable Lorenz '96 system."}, "https://arxiv.org/abs/2408.11753": {"title": "Small Sample Behavior of Wasserstein Projections, Connections to Empirical Likelihood, and Other Applications", "link": "https://arxiv.org/abs/2408.11753", "description": "arXiv:2408.11753v1 Announce Type: cross \nAbstract: The empirical Wasserstein projection (WP) distance quantifies the Wasserstein distance from the empirical distribution to a set of probability measures satisfying given expectation constraints. The WP is a powerful tool because it mitigates the curse of dimensionality inherent in the Wasserstein distance, making it valuable for various tasks, including constructing statistics for hypothesis testing, optimally selecting the ambiguity size in Wasserstein distributionally robust optimization, and studying algorithmic fairness. While the weak convergence analysis of the WP as the sample size $n$ grows is well understood, higher-order (i.e., sharp) asymptotics of WP remain unknown. In this paper, we study the second-order asymptotic expansion and the Edgeworth expansion of WP, both expressed as power series of $n^{-1/2}$. These expansions are essential to develop improved confidence level accuracy and a power expansion analysis for the WP-based tests for moment equations null against local alternative hypotheses. As a by-product, we obtain insightful criteria for comparing the power of the Empirical Likelihood and Hotelling's $T^2$ tests against the WP-based test. This insight provides the first comprehensive guideline for selecting the most powerful local test among WP-based, empirical-likelihood-based, and Hotelling's $T^2$ tests for a null. Furthermore, we introduce Bartlett-type corrections to improve the approximation to WP distance quantiles and, thus, improve the coverage in WP applications."}, "https://arxiv.org/abs/1511.04745": {"title": "The matryoshka doll prior: principled penalization in Bayesian selection", "link": "https://arxiv.org/abs/1511.04745", "description": "arXiv:1511.04745v2 Announce Type: replace \nAbstract: This paper introduces a general and principled construction of model space priors with a focus on regression problems. The proposed formulation regards each model as a ``local'' null hypothesis whose alternatives are the set of models that nest it. A simple proportionality principle yields a natural isomorphism of model spaces induced by conditioning on predictor inclusion before or after observing data. This isomorphism produces the Poisson distribution as the unique limiting distribution over model dimension under mild assumptions. We compare this model space prior theoretically and in simulations to widely adopted Beta-Binomial constructions and show that the proposed prior yields a ``just-right'' penalization profile."}, "https://arxiv.org/abs/2304.03809": {"title": "Estimating Shapley Effects in Big-Data Emulation and Regression Settings using Bayesian Additive Regression Trees", "link": "https://arxiv.org/abs/2304.03809", "description": "arXiv:2304.03809v2 Announce Type: replace \nAbstract: Shapley effects are a particularly interpretable approach to assessing how a function depends on its various inputs. The existing literature contains various estimators for this class of sensitivity indices in the context of nonparametric regression where the function is observed with noise, but there does not seem to be an estimator that is computationally tractable for input dimensions in the hundreds scale. This article provides such an estimator that is computationally tractable on this scale. The estimator uses a metamodel-based approach by first fitting a Bayesian Additive Regression Trees model which is then used to compute Shapley-effect estimates. This article also establishes a theoretical guarantee of posterior consistency on a large function class for this Shapley-effect estimator. Finally, this paper explores the performance of these Shapley-effect estimators on four different test functions for various input dimensions, including $p=500$."}, "https://arxiv.org/abs/2306.01566": {"title": "Fatigue detection via sequential testing of biomechanical data using martingale statistic", "link": "https://arxiv.org/abs/2306.01566", "description": "arXiv:2306.01566v2 Announce Type: replace \nAbstract: Injuries to the knee joint are very common for long-distance and frequent runners, an issue which is often attributed to fatigue. We address the problem of fatigue detection from biomechanical data from different sources, consisting of lower extremity joint angles and ground reaction forces from running athletes with the goal of better understanding the impact of fatigue on the biomechanics of runners in general and on an individual level. This is done by sequentially testing for change in a datastream using a simple martingale test statistic. Time-uniform probabilistic martingale bounds are provided which are used as thresholds for the test statistic. Sharp bounds can be developed by a hybrid of a piece-wise linear- and a law of iterated logarithm- bound over all time regimes, where the probability of an early detection is controlled in a uniform way. If the underlying distribution of the data gradually changes over the course of a run, then a timely upcrossing of the martingale over these bounds is expected. The methods are developed for a setting when change sets in gradually in an incoming stream of data. Parameter selection for the bounds are based on simulations and methodological comparison is done with respect to existing advances. The algorithms presented here can be easily adapted to an online change-detection setting. Finally, we provide a detailed data analysis based on extensive measurements of several athletes and benchmark the fatigue detection results with the runners' individual feedback over the course of the data collection. Qualitative conclusions on the biomechanical profiles of the athletes can be made based on the shape of the martingale trajectories even in the absence of an upcrossing of the threshold."}, "https://arxiv.org/abs/2311.01638": {"title": "Inference on summaries of a model-agnostic longitudinal variable importance trajectory with application to suicide prevention", "link": "https://arxiv.org/abs/2311.01638", "description": "arXiv:2311.01638v2 Announce Type: replace \nAbstract: Risk of suicide attempt varies over time. Understanding the importance of risk factors measured at a mental health visit can help clinicians evaluate future risk and provide appropriate care during the visit. In prediction settings where data are collected over time, such as in mental health care, it is often of interest to understand both the importance of variables for predicting the response at each time point and the importance summarized over the time series. Building on recent advances in estimation and inference for variable importance measures, we define summaries of variable importance trajectories and corresponding estimators. The same approaches for inference can be applied to these measures regardless of the choice of the algorithm(s) used to estimate the prediction function. We propose a nonparametric efficient estimation and inference procedure as well as a null hypothesis testing procedure that are valid even when complex machine learning tools are used for prediction. Through simulations, we demonstrate that our proposed procedures have good operating characteristics. We use these approaches to analyze electronic health records data from two large health systems to investigate the longitudinal importance of risk factors for suicide attempt to inform future suicide prevention research and clinical workflow."}, "https://arxiv.org/abs/2401.07000": {"title": "Counterfactual Slopes and Their Applications in Social Stratification", "link": "https://arxiv.org/abs/2401.07000", "description": "arXiv:2401.07000v2 Announce Type: replace \nAbstract: This paper addresses two prominent theses in social stratification research, the great equalizer thesis and Mare's (1980) school transition thesis. Both theses are premised on a descriptive regularity: the association between socioeconomic background and an outcome variable changes when conditioning on an intermediate treatment. However, if the descriptive regularity is driven by differential selection into treatment, then the two theses do not have substantive interpretations. We propose a set of novel counterfactual slope estimands, which capture the two theses under the hypothetical scenario where differential selection into treatment is eliminated. Thus, we use the counterfactual slopes to construct selection-free tests for the two theses. Compared with the existing literature, we are the first to explicitly provide nonparametric and causal estimands, which enable us to conduct more principled analysis. We are also the first to develop flexible, efficient, and robust estimators for the two theses based on efficient influence functions. We apply our framework to a nationally representative dataset in the United States and re-evaluate the two theses. Findings from our selection-free tests show that the descriptive regularity is sometimes misleading for substantive interpretations."}, "https://arxiv.org/abs/2212.05053": {"title": "Joint Spectral Clustering in Multilayer Degree-Corrected Stochastic Blockmodels", "link": "https://arxiv.org/abs/2212.05053", "description": "arXiv:2212.05053v2 Announce Type: replace-cross \nAbstract: Modern network datasets are often composed of multiple layers, either as different views, time-varying observations, or independent sample units, resulting in collections of networks over the same set of vertices but with potentially different connectivity patterns on each network. These data require models and methods that are flexible enough to capture local and global differences across the networks, while at the same time being parsimonious and tractable to yield computationally efficient and theoretically sound solutions that are capable of aggregating information across the networks. This paper considers the multilayer degree-corrected stochastic blockmodel, where a collection of networks share the same community structure, but degree-corrections and block connection probability matrices are permitted to be different. We establish the identifiability of this model and propose a spectral clustering algorithm for community detection in this setting. Our theoretical results demonstrate that the misclustering error rate of the algorithm improves exponentially with multiple network realizations, even in the presence of significant layer heterogeneity with respect to degree corrections, signal strength, and spectral properties of the block connection probability matrices. Simulation studies show that this approach improves on existing multilayer community detection methods in this challenging regime. Furthermore, in a case study of US airport data through January 2016 -- September 2021, we find that this methodology identifies meaningful community structure and trends in airport popularity influenced by pandemic impacts on travel."}, "https://arxiv.org/abs/2308.09104": {"title": "Spike-and-slab shrinkage priors for structurally sparse Bayesian neural networks", "link": "https://arxiv.org/abs/2308.09104", "description": "arXiv:2308.09104v2 Announce Type: replace-cross \nAbstract: Network complexity and computational efficiency have become increasingly significant aspects of deep learning. Sparse deep learning addresses these challenges by recovering a sparse representation of the underlying target function by reducing heavily over-parameterized deep neural networks. Specifically, deep neural architectures compressed via structured sparsity (e.g. node sparsity) provide low latency inference, higher data throughput, and reduced energy consumption. In this paper, we explore two well-established shrinkage techniques, Lasso and Horseshoe, for model compression in Bayesian neural networks. To this end, we propose structurally sparse Bayesian neural networks which systematically prune excessive nodes with (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group Horseshoe (SS-GHS) priors, and develop computationally tractable variational inference including continuous relaxation of Bernoulli variables. We establish the contraction rates of the variational posterior of our proposed models as a function of the network topology, layer-wise node cardinalities, and bounds on the network weights. We empirically demonstrate the competitive performance of our models compared to the baseline models in prediction accuracy, model compression, and inference latency."}, "https://arxiv.org/abs/2408.11951": {"title": "SPORTSCausal: Spill-Over Time Series Causal Inference", "link": "https://arxiv.org/abs/2408.11951", "description": "arXiv:2408.11951v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) have long been the gold standard for causal inference across various fields, including business analysis, economic studies, sociology, clinical research, and network learning. The primary advantage of RCTs over observational studies lies in their ability to significantly reduce noise from individual variance. However, RCTs depend on strong assumptions, such as group independence, time independence, and group randomness, which are not always feasible in real-world applications. Traditional inferential methods, including analysis of covariance (ANCOVA), often fail when these assumptions do not hold. In this paper, we propose a novel approach named \\textbf{Sp}ill\\textbf{o}ve\\textbf{r} \\textbf{T}ime \\textbf{S}eries \\textbf{Causal} (\\verb+SPORTSCausal+), which enables the estimation of treatment effects without relying on these stringent assumptions. We demonstrate the practical applicability of \\verb+SPORTSCausal+ through a real-world budget-control experiment. In this experiment, data was collected from both a 5\\% live experiment and a 50\\% live experiment using the same treatment. Due to the spillover effect, the vanilla estimation of the treatment effect was not robust across different treatment sizes, whereas \\verb+SPORTSCausal+ provided a robust estimation."}, "https://arxiv.org/abs/2408.11994": {"title": "Fast and robust cross-validation-based scoring rule inference for spatial statistics", "link": "https://arxiv.org/abs/2408.11994", "description": "arXiv:2408.11994v1 Announce Type: new \nAbstract: Scoring rules are aimed at evaluation of the quality of predictions, but can also be used for estimation of parameters in statistical models. We propose estimating parameters of multivariate spatial models by maximising the average leave-one-out cross-validation score. This method, LOOS, thus optimises predictions instead of maximising the likelihood. The method allows for fast computations for Gaussian models with sparse precision matrices, such as spatial Markov models. It also makes it possible to tailor the estimator's robustness to outliers and their sensitivity to spatial variations of uncertainty through the choice of the scoring rule which is used in the maximisation. The effects of the choice of scoring rule which is used in LOOS are studied by simulation in terms of computation time, statistical efficiency, and robustness. Various popular scoring rules and a new scoring rule, the root score, are compared to maximum likelihood estimation. The results confirmed that for spatial Markov models the computation time for LOOS was much smaller than for maximum likelihood estimation. Furthermore, the standard deviations of parameter estimates were smaller for maximum likelihood estimation, although the differences often were small. The simulations also confirmed that the usage of a robust scoring rule results in robust LOOS estimates and that the robustness provides better predictive quality for spatial data with outliers. Finally, the new inference method was applied to ERA5 temperature reanalysis data for the contiguous United States and the average July temperature for the years 1940 to 2023, and this showed that the LOOS estimator provided parameter estimates that were more than a hundred times faster to compute compared to maximum-likelihood estimation, and resulted in a model with better predictive performance."}, "https://arxiv.org/abs/2408.12078": {"title": "L1 Prominence Measures for Directed Graphs", "link": "https://arxiv.org/abs/2408.12078", "description": "arXiv:2408.12078v1 Announce Type: new \nAbstract: We introduce novel measures, L1 prestige and L1 centrality, for quantifying the prominence of each vertex in a strongly connected and directed graph by utilizing the concept of L1 data depth (Vardi and Zhang, Proc. Natl. Acad. Sci. U.S.A.\\ 97(4):1423--1426, 2000). The former measure quantifies the degree of prominence of each vertex in receiving choices, whereas the latter measure evaluates the degree of importance in giving choices. The proposed measures can handle graphs with both edge and vertex weights, as well as undirected graphs. However, examining a graph using a measure defined over a single `scale' inevitably leads to a loss of information, as each vertex may exhibit distinct structural characteristics at different levels of locality. To this end, we further develop local versions of the proposed measures with a tunable locality parameter. Using these tools, we present a multiscale network analysis framework that provides much richer structural information about each vertex than a single-scale inspection. By applying the proposed measures to the networks constructed from the Seoul Mobility Flow Data, it is demonstrated that these measures accurately depict and uncover the inherent characteristics of individual city regions."}, "https://arxiv.org/abs/2408.12098": {"title": "Temporal discontinuity trials and randomization: success rates versus design strength", "link": "https://arxiv.org/abs/2408.12098", "description": "arXiv:2408.12098v1 Announce Type: new \nAbstract: We consider the following comparative effectiveness scenario. There are two treatments for a particular medical condition: a randomized experiment has demonstrated mediocre effectiveness for the first treatment, while a non-randomized study of the second treatment reports a much higher success rate. On what grounds might one justifiably prefer the second treatment over the first treatment, given only the information from those two studies, including design details? This situation occurs in reality and warrants study. We consider a particular example involving studies of treatments for Crohn's disease. In order to help resolve these cases of asymmetric evidence, we make three contributions and apply them to our example. First, we demonstrate the potential to improve success rates above those found in a randomized trial, given heterogeneous effects. Second, we prove that deliberate treatment assignment can be more efficient than randomization when study results are to be transported to formulate an intervention policy on a wider population. Third, we provide formal conditions under which a temporal-discontinuity design approximates a randomized trial, and we introduce a novel design parameter to inform researchers about the strength of that approximation. Overall, our results indicate that while randomization certainly provides special advantages, other study designs such as temporal-discontinuity designs also have distinct advantages, and can produce valuable evidence that informs treatment decisions and intervention policy."}, "https://arxiv.org/abs/2408.12272": {"title": "Decorrelated forward regression for high dimensional data analysis", "link": "https://arxiv.org/abs/2408.12272", "description": "arXiv:2408.12272v1 Announce Type: new \nAbstract: Forward regression is a crucial methodology for automatically identifying important predictors from a large pool of potential covariates. In contexts with moderate predictor correlation, forward selection techniques can achieve screening consistency. However, this property gradually becomes invalid in the presence of substantially correlated variables, especially in high-dimensional datasets where strong correlations exist among predictors. This dilemma is encountered by other model selection methods in literature as well. To address these challenges, we introduce a novel decorrelated forward (DF) selection framework for generalized mean regression models, including prevalent models, such as linear, logistic, Poisson, and quasi likelihood. The DF selection framework stands out because of its ability to convert generalized mean regression models into linear ones, thus providing a clear interpretation of the forward selection process. It also offers a closed-form expression for forward iteration, to improve practical applicability and efficiency. Theoretically, we establish the screening consistency of DF selection and determine the upper bound of the selected submodel's size. To reduce computational burden, we develop a thresholding DF algorithm that provides a stopping rule for the forward-searching process. Simulations and two real data applications show the outstanding performance of our method compared with some existing model selection methods."}, "https://arxiv.org/abs/2408.12286": {"title": "Momentum Informed Inflation-at-Risk", "link": "https://arxiv.org/abs/2408.12286", "description": "arXiv:2408.12286v1 Announce Type: new \nAbstract: Growth-at-Risk has recently become a key measure of macroeconomic tail-risk, which has seen it be researched extensively. Surprisingly, the same cannot be said for Inflation-at-Risk where both tails, deflation and high inflation, are of key concern to policymakers, which has seen comparatively much less research. This paper will tackle this gap and provide estimates for Inflation-at-Risk. The key insight of the paper is that inflation is best characterised by a combination of two types of nonlinearities: quantile variation, and conditioning on the momentum of inflation."}, "https://arxiv.org/abs/2408.12339": {"title": "Inference for decorated graphs and application to multiplex networks", "link": "https://arxiv.org/abs/2408.12339", "description": "arXiv:2408.12339v1 Announce Type: new \nAbstract: A graphon is a limiting object used to describe the behaviour of large networks through a function that captures the probability of edge formation between nodes. Although the merits of graphons to describe large and unlabelled networks are clear, they traditionally are used for describing only binary edge information, which limits their utility for more complex relational data. Decorated graphons were introduced to extend the graphon framework by incorporating richer relationships, such as edge weights and types. This specificity in modelling connections provides more granular insight into network dynamics. Yet, there are no existing inference techniques for decorated graphons. We develop such an estimation method, extending existing techniques from traditional graphon estimation to accommodate these richer interactions. We derive the rate of convergence for our method and show that it is consistent with traditional non-parametric theory when the decoration space is finite. Simulations confirm that these theoretical rates are achieved in practice. Our method, tested on synthetic and empirical data, effectively captures additional edge information, resulting in improved network models. This advancement extends the scope of graphon estimation to encompass more complex networks, such as multiplex networks and attributed graphs, thereby increasing our understanding of their underlying structures."}, "https://arxiv.org/abs/2408.12347": {"title": "Preregistration does not improve the transparent evaluation of severity in Popper's philosophy of science or when deviations are allowed", "link": "https://arxiv.org/abs/2408.12347", "description": "arXiv:2408.12347v1 Announce Type: new \nAbstract: One justification for preregistering research hypotheses, methods, and analyses is that it improves the transparent evaluation of the severity of hypothesis tests. In this article, I consider two cases in which preregistration does not improve this evaluation. First, I argue that, although preregistration can facilitate the transparent evaluation of severity in Mayo's error statistical philosophy of science, it does not facilitate this evaluation in Popper's theory-centric philosophy. To illustrate, I show that associated concerns about Type I error rate inflation are only relevant in the error statistical approach and not in a theory-centric approach. Second, I argue that a preregistered test procedure that allows deviations in its implementation does not provide a more transparent evaluation of Mayoian severity than a non-preregistered procedure. In particular, I argue that sample-based validity-enhancing deviations cause an unknown inflation of the test procedure's Type I (familywise) error rate and, consequently, an unknown reduction in its capability to license inferences severely. I conclude that preregistration does not improve the transparent evaluation of severity in Popper's philosophy of science or when deviations are allowed."}, "https://arxiv.org/abs/2408.12482": {"title": "Latent Gaussian Graphical Models with Golazo Penalty", "link": "https://arxiv.org/abs/2408.12482", "description": "arXiv:2408.12482v1 Announce Type: new \nAbstract: The existence of latent variables in practical problems is common, for example when some variables are difficult or expensive to measure, or simply unknown. When latent variables are unaccounted for, structure learning for Gaussian graphical models can be blurred by additional correlation between the observed variables that is incurred by the latent variables. A standard approach for this problem is a latent version of the graphical lasso that splits the inverse covariance matrix into a sparse and a low-rank part that are penalized separately. In this paper we propose a generalization of this via the flexible Golazo penalty. This allows us to introduce latent versions of for example the adaptive lasso, positive dependence constraints or predetermined sparsity patterns, and combinations of those. We develop an algorithm for the latent Gaussian graphical model with the Golazo penalty and demonstrate it on simulated and real data."}, "https://arxiv.org/abs/2408.12541": {"title": "Clarifying the Role of the Mantel-Haenszel Risk Difference Estimator in Randomized Clinical Trials", "link": "https://arxiv.org/abs/2408.12541", "description": "arXiv:2408.12541v1 Announce Type: new \nAbstract: The Mantel-Haenszel (MH) risk difference estimator, commonly used in randomized clinical trials for binary outcomes, calculates a weighted average of stratum-specific risk difference estimators. Traditionally, this method requires the stringent assumption that risk differences are homogeneous across strata, also known as the common risk difference assumption. In our article, we relax this assumption and adopt a modern perspective, viewing the MH risk difference estimator as an approach for covariate adjustment in randomized clinical trials, distinguishing its use from that in meta-analysis and observational studies. We demonstrate that the MH risk difference estimator consistently estimates the average treatment effect within a standard super-population framework, which is often the primary interest in randomized clinical trials, in addition to estimating a weighted average of stratum-specific risk difference. We rigorously study its properties under both the large-stratum and sparse-stratum asymptotic regimes. Furthermore, for either estimand, we propose a unified robust variance estimator that improves over the popular variance estimators by Greenland and Robins (1985) and Sato et al. (1989) and has provable consistency across both asymptotic regimes, regardless of assuming common risk differences. Extensions of our theoretical results also provide new insights into the Cochran-Mantel-Haenszel test and the post-stratification estimator. Our findings are thoroughly validated through simulations and a clinical trial example."}, "https://arxiv.org/abs/2408.12577": {"title": "Integrating an agent-based behavioral model in microtransit forecasting and revenue management", "link": "https://arxiv.org/abs/2408.12577", "description": "arXiv:2408.12577v1 Announce Type: new \nAbstract: As an IT-enabled multi-passenger mobility service, microtransit has the potential to improve accessibility, reduce congestion, and enhance flexibility in transportation options. However, due to its heterogeneous impacts on different communities and population segments, there is a need for better tools in microtransit forecast and revenue management, especially when actual usage data are limited. We propose a novel framework based on an agent-based mixed logit model estimated with microtransit usage data and synthetic trip data. The framework involves estimating a lower-branch mode choice model with synthetic trip data, combining lower-branch parameters with microtransit data to estimate an upper-branch ride pass subscription model, and applying the nested model to evaluate microtransit pricing and subsidy policies. The framework enables further decision-support analysis to consider diverse travel patterns and heterogeneous tastes of the total population. We test the framework in a case study with synthetic trip data from Replica Inc. and microtransit data from Arlington Via. The lower-branch model result in a rho-square value of 0.603 on weekdays and 0.576 on weekends. Predictions made by the upper-branch model closely match the marginal subscription data. In a ride pass pricing policy scenario, we show that a discount in weekly pass (from $25 to $18.9) and monthly pass (from $80 to $71.5) would surprisingly increase total revenue by $102/day. In an event- or place-based subsidy policy scenario, we show that a 100% fare discount would reduce 80 car trips during peak hours at AT&T Stadium, requiring a subsidy of $32,068/year."}, "https://arxiv.org/abs/2408.11967": {"title": "Valuing an Engagement Surface using a Large Scale Dynamic Causal Model", "link": "https://arxiv.org/abs/2408.11967", "description": "arXiv:2408.11967v1 Announce Type: cross \nAbstract: With recent rapid growth in online shopping, AI-powered Engagement Surfaces (ES) have become ubiquitous across retail services. These engagement surfaces perform an increasing range of functions, including recommending new products for purchase, reminding customers of their orders and providing delivery notifications. Understanding the causal effect of engagement surfaces on value driven for customers and businesses remains an open scientific question. In this paper, we develop a dynamic causal model at scale to disentangle value attributable to an ES, and to assess its effectiveness. We demonstrate the application of this model to inform business decision-making by understanding returns on investment in the ES, and identifying product lines and features where the ES adds the most value."}, "https://arxiv.org/abs/2408.12004": {"title": "CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies", "link": "https://arxiv.org/abs/2408.12004", "description": "arXiv:2408.12004v1 Announce Type: cross \nAbstract: When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions."}, "https://arxiv.org/abs/2408.12014": {"title": "An Econometric Analysis of Large Flexible Cryptocurrency-mining Consumers in Electricity Markets", "link": "https://arxiv.org/abs/2408.12014", "description": "arXiv:2408.12014v1 Announce Type: cross \nAbstract: In recent years, power grids have seen a surge in large cryptocurrency mining firms, with individual consumption levels reaching 700MW. This study examines the behavior of these firms in Texas, focusing on how their consumption is influenced by cryptocurrency conversion rates, electricity prices, local weather, and other factors. We transform the skewed electricity consumption data of these firms, perform correlation analysis, and apply a seasonal autoregressive moving average model for analysis. Our findings reveal that, surprisingly, short-term mining electricity consumption is not correlated with cryptocurrency conversion rates. Instead, the primary influencers are the temperature and electricity prices. These firms also respond to avoid transmission and distribution network (T\\&D) charges -- famously known as four Coincident peak (4CP) charges -- during summer times. As the scale of these firms is likely to surge in future years, the developed electricity consumption model can be used to generate public, synthetic datasets to understand the overall impact on power grid. The developed model could also lead to better pricing mechanisms to effectively use the flexibility of these resources towards improving power grid reliability."}, "https://arxiv.org/abs/2408.12210": {"title": "Enhancing Causal Discovery in Financial Networks with Piecewise Quantile Regression", "link": "https://arxiv.org/abs/2408.12210", "description": "arXiv:2408.12210v1 Announce Type: cross \nAbstract: Financial networks can be constructed using statistical dependencies found within the price series of speculative assets. Across the various methods used to infer these networks, there is a general reliance on predictive modelling to capture cross-correlation effects. These methods usually model the flow of mean-response information, or the propagation of volatility and risk within the market. Such techniques, though insightful, don't fully capture the broader distribution-level causality that is possible within speculative markets. This paper introduces a novel approach, combining quantile regression with a piecewise linear embedding scheme - allowing us to construct causality networks that identify the complex tail interactions inherent to financial markets. Applying this method to 260 cryptocurrency return series, we uncover significant tail-tail causal effects and substantial causal asymmetry. We identify a propensity for coins to be self-influencing, with comparatively sparse cross variable effects. Assessing all link types in conjunction, Bitcoin stands out as the primary influencer - a nuance that is missed in conventional linear mean-response analyses. Our findings introduce a comprehensive framework for modelling distributional causality, paving the way towards more holistic representations of causality in financial markets."}, "https://arxiv.org/abs/2408.12288": {"title": "Demystifying Functional Random Forests: Novel Explainability Tools for Model Transparency in High-Dimensional Spaces", "link": "https://arxiv.org/abs/2408.12288", "description": "arXiv:2408.12288v1 Announce Type: cross \nAbstract: The advent of big data has raised significant challenges in analysing high-dimensional datasets across various domains such as medicine, ecology, and economics. Functional Data Analysis (FDA) has proven to be a robust framework for addressing these challenges, enabling the transformation of high-dimensional data into functional forms that capture intricate temporal and spatial patterns. However, despite advancements in functional classification methods and very high performance demonstrated by combining FDA and ensemble methods, a critical gap persists in the literature concerning the transparency and interpretability of black-box models, e.g. Functional Random Forests (FRF). In response to this need, this paper introduces a novel suite of explainability tools to illuminate the inner mechanisms of FRF. We propose using Functional Partial Dependence Plots (FPDPs), Functional Principal Component (FPC) Probability Heatmaps, various model-specific and model-agnostic FPCs' importance metrics, and the FPC Internal-External Importance and Explained Variance Bubble Plot. These tools collectively enhance the transparency of FRF models by providing a detailed analysis of how individual FPCs contribute to model predictions. By applying these methods to an ECG dataset, we demonstrate the effectiveness of these tools in revealing critical patterns and improving the explainability of FRF."}, "https://arxiv.org/abs/2408.12296": {"title": "Multiple testing for signal-agnostic searches of new physics with machine learning", "link": "https://arxiv.org/abs/2408.12296", "description": "arXiv:2408.12296v1 Announce Type: cross \nAbstract: In this work, we address the question of how to enhance signal-agnostic searches by leveraging multiple testing strategies. Specifically, we consider hypothesis tests relying on machine learning, where model selection can introduce a bias towards specific families of new physics signals. We show that it is beneficial to combine different tests, characterised by distinct choices of hyperparameters, and that performances comparable to the best available test are generally achieved while providing a more uniform response to various types of anomalies. Focusing on the New Physics Learning Machine, a methodology to perform a signal-agnostic likelihood-ratio test, we explore a number of approaches to multiple testing, such as combining p-values and aggregating test statistics."}, "https://arxiv.org/abs/2408.12346": {"title": "A logical framework for data-driven reasoning", "link": "https://arxiv.org/abs/2408.12346", "description": "arXiv:2408.12346v1 Announce Type: cross \nAbstract: We introduce and investigate a family of consequence relations with the goal of capturing certain important patterns of data-driven inference. The inspiring idea for our framework is the fact that data may reject, possibly to some degree, and possibly by mistake, any given scientific hypothesis. There is no general agreement in science about how to do this, which motivates putting forward a logical formulation of the problem. We do so by investigating distinct definitions of \"rejection degrees\" each yielding a consequence relation. Our investigation leads to novel variations on the theme of rational consequence relations, prominent among non-monotonic logics."}, "https://arxiv.org/abs/2408.12564": {"title": "Factor Adjusted Spectral Clustering for Mixture Models", "link": "https://arxiv.org/abs/2408.12564", "description": "arXiv:2408.12564v1 Announce Type: cross \nAbstract: This paper studies a factor modeling-based approach for clustering high-dimensional data generated from a mixture of strongly correlated variables. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc., with factor modeling being one of the popular techniques for explaining the common dependence. Standard techniques for clustering high-dimensional data, e.g., naive spectral clustering, often fail to yield insightful results as their performances heavily depend on the mixture components having a weakly correlated structure. To address the clustering problem in the presence of a latent factor model, we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step via eliminating the factor component to cope with the data dependency. We prove this method achieves an exponentially low mislabeling rate, with respect to the signal to noise ratio under a general set of assumptions. Our assumption bridges many classical factor models in the literature, such as the pervasive factor model, the weak factor model, and the sparse factor model. The FASC algorithm is also computationally efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of the FASC algorithm with real data experiments and numerical studies, and establish that FASC provides significant results in many cases where traditional spectral clustering fails."}, "https://arxiv.org/abs/2105.08868": {"title": "Markov-Restricted Analysis of Randomized Trials with Non-Monotone Missing Binary Outcomes", "link": "https://arxiv.org/abs/2105.08868", "description": "arXiv:2105.08868v2 Announce Type: replace \nAbstract: Scharfstein et al. (2021) developed a sensitivity analysis model for analyzing randomized trials with repeatedly measured binary outcomes that are subject to nonmonotone missingness. Their approach becomes computationally intractable when the number of measurements is large (e.g., greater than 15). In this paper, we repair this problem by introducing mth-order Markovian restrictions. We establish identification results for the joint distribution of the binary outcomes by representing the model as a directed acyclic graph (DAG). We develop a novel estimation strategy for a smooth functional of the joint distrubution. We illustrate our methodology in the context of a randomized trial designed to evaluate a web-delivered psychosocial intervention to reduce substance use, assessed by evaluating abstinence twice weekly for 12 weeks, among patients entering outpatient addiction treatment."}, "https://arxiv.org/abs/2209.07111": {"title": "$\\rho$-GNF: A Copula-based Sensitivity Analysis to Unobserved Confounding Using Normalizing Flows", "link": "https://arxiv.org/abs/2209.07111", "description": "arXiv:2209.07111v2 Announce Type: replace \nAbstract: We propose a novel sensitivity analysis to unobserved confounding in observational studies using copulas and normalizing flows. Using the idea of interventional equivalence of structural causal models, we develop $\\rho$-GNF ($\\rho$-graphical normalizing flow), where $\\rho{\\in}[-1,+1]$ is a bounded sensitivity parameter. This parameter represents the back-door non-causal association due to unobserved confounding, and which is encoded with a Gaussian copula. In other words, the $\\rho$-GNF enables scholars to estimate the average causal effect (ACE) as a function of $\\rho$, while accounting for various assumed strengths of the unobserved confounding. The output of the $\\rho$-GNF is what we denote as the $\\rho_{curve}$ that provides the bounds for the ACE given an interval of assumed $\\rho$ values. In particular, the $\\rho_{curve}$ enables scholars to identify the confounding strength required to nullify the ACE, similar to other sensitivity analysis methods (e.g., the E-value). Leveraging on experiments from simulated and real-world data, we show the benefits of $\\rho$-GNF. One benefit is that the $\\rho$-GNF uses a Gaussian copula to encode the distribution of the unobserved causes, which is commonly used in many applied settings. This distributional assumption produces narrower ACE bounds compared to other popular sensitivity analysis methods."}, "https://arxiv.org/abs/2309.13159": {"title": "Estimating a k-modal nonparametric mixed logit model with market-level data", "link": "https://arxiv.org/abs/2309.13159", "description": "arXiv:2309.13159v2 Announce Type: replace \nAbstract: We propose a group-level agent-based mixed (GLAM) logit model that is estimated using market-level choice share data. The model non-parametrically represents taste heterogeneity through market-specific parameters by solving a multiagent inverse utility maximization problem, addressing the limitations of existing market-level choice models with parametric taste heterogeneity. A case study of mode choice in New York State is conducted using synthetic population data of 53.55 million trips made by 19.53 million residents in 2019. These trips are aggregated based on population segments and census block group-level origin-destination (OD) pairs, resulting in 120,740 markets/agents. We benchmark in-sample and out-of-sample predictive performance of the GLAM logit model against multinomial logit, nested logit, inverse product differentiation logit, and random coefficient logit (RCL) models. The results show that GLAM logit outperforms benchmark models, improving the overall in-sample predictive accuracy from 78.7% to 96.71% and out-of-sample accuracy from 65.30% to 81.78%. The price elasticities and diversion ratios retrieved from GLAM logit and benchmark models exhibit similar substitution patterns among the six travel modes. GLAM logit is scalable and computationally efficient, taking less than one-tenth of the time taken to estimate the RCL model. The agent-specific parameters in GLAM logit provide additional insights such as value-of-time (VOT) across segments and regions, which has been further utilized to demonstrate its application in analyzing NYS travelers' mode choice response to the congestion pricing. The agent-specific parameters in GLAM logit facilitate their seamless integration into supply-side optimization models for revenue management and system design."}, "https://arxiv.org/abs/2401.08175": {"title": "Bayesian Function-on-Function Regression for Spatial Functional Data", "link": "https://arxiv.org/abs/2401.08175", "description": "arXiv:2401.08175v2 Announce Type: replace \nAbstract: Spatial functional data arise in many settings, such as particulate matter curves observed at monitoring stations and age population curves at each areal unit. Most existing functional regression models have limited applicability because they do not consider spatial correlations. Although functional kriging methods can predict the curves at unobserved spatial locations, they are based on variogram fittings rather than constructing hierarchical statistical models. In this manuscript, we propose a Bayesian framework for spatial function-on-function regression that can carry out parameter estimations and predictions. However, the proposed model has computational and inferential challenges because the model needs to account for within and between-curve dependencies. Furthermore, high-dimensional and spatially correlated parameters can lead to the slow mixing of Markov chain Monte Carlo algorithms. To address these issues, we first utilize a basis transformation approach to simplify the covariance and apply projection methods for dimension reduction. We also develop a simultaneous band score for the proposed model to detect the significant region in the regression function. We apply our method to both areal and point-level spatial functional data, showing the proposed method is computationally efficient and provides accurate estimations and predictions."}, "https://arxiv.org/abs/2204.06544": {"title": "Features of the Earth's seasonal hydroclimate: Characterizations and comparisons across the Koppen-Geiger climates and across continents", "link": "https://arxiv.org/abs/2204.06544", "description": "arXiv:2204.06544v2 Announce Type: replace-cross \nAbstract: Detailed investigations of time series features across climates, continents and variable types can progress our understanding and modelling ability of the Earth's hydroclimate and its dynamics. They can also improve our comprehension of the climate classification systems appearing in their core. Still, such investigations for seasonal hydroclimatic temporal dependence, variability and change are currently missing from the literature. Herein, we propose and apply at the global scale a methodological framework for filling this specific gap. We analyse over 13 000 earth-observed quarterly temperature, precipitation and river flow time series. We adopt the Koppen-Geiger climate classification system and define continental-scale geographical regions for conducting upon them seasonal hydroclimatic feature summaries. The analyses rely on three sample autocorrelation features, a temporal variation feature, a spectral entropy feature, a Hurst feature, a trend strength feature and a seasonality strength feature. We find notable differences to characterize the magnitudes of these features across the various Koppen-Geiger climate classes, as well as between continental-scale geographical regions. We, therefore, deem that the consideration of the comparative summaries could be beneficial in water resources engineering contexts. Lastly, we apply explainable machine learning to compare the investigated features with respect to how informative they are in distinguishing either the main Koppen-Geiger climates or the continental-scale regions. In this regard, the sample autocorrelation, temporal variation and seasonality strength features are found to be more informative than the spectral entropy, Hurst and trend strength features at the seasonal time scale."}, "https://arxiv.org/abs/2206.06885": {"title": "Neural interval-censored survival regression with feature selection", "link": "https://arxiv.org/abs/2206.06885", "description": "arXiv:2206.06885v3 Announce Type: replace-cross \nAbstract: Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships."}, "https://arxiv.org/abs/2304.06353": {"title": "Bayesian mixture models for phylogenetic source attribution from consensus sequences and time since infection estimates", "link": "https://arxiv.org/abs/2304.06353", "description": "arXiv:2304.06353v2 Announce Type: replace-cross \nAbstract: In stopping the spread of infectious diseases, pathogen genomic data can be used to reconstruct transmission events and characterize population-level sources of infection. Most approaches for identifying transmission pairs do not account for the time passing since divergence of pathogen variants in individuals, which is problematic in viruses with high within-host evolutionary rates. This prompted us to consider possible transmission pairs in terms of phylogenetic data and additional estimates of time since infection derived from clinical biomarkers. We develop Bayesian mixture models with an evolutionary clock as signal component and additional mixed effects or covariate random functions describing the mixing weights to classify potential pairs into likely and unlikely transmission pairs. We demonstrate that although sources cannot be identified at the individual level with certainty, even with the additional data on time elapsed, inferences into the population-level sources of transmission are possible, and more accurate than using only phylogenetic data without time since infection estimates. We apply the approach to estimate age-specific sources of HIV infection in Amsterdam MSM transmission networks between 2010-2021. This study demonstrates that infection time estimates provide informative data to characterize transmission sources, and shows how phylogenetic source attribution can then be done with multi-dimensional mixture models."}, "https://arxiv.org/abs/2408.12863": {"title": "Machine Learning and the Yield Curve: Tree-Based Macroeconomic Regime Switching", "link": "https://arxiv.org/abs/2408.12863", "description": "arXiv:2408.12863v1 Announce Type: new \nAbstract: We explore tree-based macroeconomic regime-switching in the context of the dynamic Nelson-Siegel (DNS) yield-curve model. In particular, we customize the tree-growing algorithm to partition macroeconomic variables based on the DNS model's marginal likelihood, thereby identifying regime-shifting patterns in the yield curve. Compared to traditional Markov-switching models, our model offers clear economic interpretation via macroeconomic linkages and ensures computational simplicity. In an empirical application to U.S. Treasury bond yields, we find (1) important yield curve regime switching, and (2) evidence that macroeconomic variables have predictive power for the yield curve when the short rate is high, but not in other regimes, thereby refining the notion of yield curve ``macro-spanning\"."}, "https://arxiv.org/abs/2408.13000": {"title": "Air-HOLP: Adaptive Regularized Feature Screening for High Dimensional Data", "link": "https://arxiv.org/abs/2408.13000", "description": "arXiv:2408.13000v1 Announce Type: new \nAbstract: Handling high-dimensional datasets presents substantial computational challenges, particularly when the number of features far exceeds the number of observations and when features are highly correlated. A modern approach to mitigate these issues is feature screening. In this work, the High-dimensional Ordinary Least-squares Projection (HOLP) feature screening method is advanced by employing adaptive ridge regularization. The impact of the ridge penalty on the Ridge-HOLP method is examined and Air-HOLP is proposed, a data-adaptive advance to Ridge-HOLP where the ridge-regularization parameter is selected iteratively and optimally for better feature screening performance. The proposed method addresses the challenges of penalty selection in high dimensions by offering a computationally efficient and stable alternative to traditional methods like bootstrapping and cross-validation. Air-HOLP is evaluated using simulated data and a prostate cancer genetic dataset. The empirical results demonstrate that Air-HOLP has improved performance over a large range of simulation settings. We provide R codes implementing the Air-HOLP feature screening method and integrating it into existing feature screening methods that utilize the HOLP formula."}, "https://arxiv.org/abs/2408.13047": {"title": "Difference-in-differences with as few as two cross-sectional units -- A new perspective to the democracy-growth debate", "link": "https://arxiv.org/abs/2408.13047", "description": "arXiv:2408.13047v1 Announce Type: new \nAbstract: Pooled panel analyses tend to mask heterogeneity in unit-specific treatment effects. For example, existing studies on the impact of democracy on economic growth do not reach a consensus as empirical findings are substantially heterogeneous in the country composition of the panel. In contrast to pooled panel analyses, this paper proposes a Difference-in-Differences (DiD) estimator that exploits the temporal dimension in the data and estimates unit-specific average treatment effects on the treated (ATT) with as few as two cross-sectional units. Under weak identification and temporal dependence conditions, the DiD estimator is asymptotically normal. The estimator is further complemented with a test of identification granted at least two candidate control units. Empirical results using the DiD estimator suggest Benin's economy would have been 6.3% smaller on average over the 1993-2018 period had she not democratised."}, "https://arxiv.org/abs/2408.13143": {"title": "A Restricted Latent Class Model with Polytomous Ordinal Correlated Attributes and Respondent-Level Covariates", "link": "https://arxiv.org/abs/2408.13143", "description": "arXiv:2408.13143v1 Announce Type: new \nAbstract: We present an exploratory restricted latent class model where response data is for a single time point, polytomous, and differing across items, and where latent classes reflect a multi-attribute state where each attribute is ordinal. Our model extends previous work to allow for correlation of the attributes through a multivariate probit and to allow for respondent-specific covariates. We demonstrate that the model recovers parameters well in a variety of realistic scenarios, and apply the model to the analysis of a particular dataset designed to diagnose depression. The application demonstrates the utility of the model in identifying the latent structure of depression beyond single-factor approaches which have been used in the past."}, "https://arxiv.org/abs/2408.13162": {"title": "A latent space model for multivariate count data time series analysis", "link": "https://arxiv.org/abs/2408.13162", "description": "arXiv:2408.13162v1 Announce Type: new \nAbstract: Motivated by a dataset of burglaries in Chicago, USA, we introduce a novel framework to analyze time series of count data combining common multivariate time series models with latent position network models. This novel methodology allows us to gain a new latent variable perspective on the crime dataset that we consider, allowing us to disentangle and explain the complex patterns exhibited by the data, while providing a natural time series framework that can be used to make future predictions. Our model is underpinned by two well known statistical approaches: a log-linear vector autoregressive model, which is prominent in the literature on multivariate count time series, and a latent projection model, which is a popular latent variable model for networks. The role of the projection model is to characterize the interaction parameters of the vector autoregressive model, thus uncovering the underlying network that is associated with the pairwise relationships between the time series. Estimation and inferential procedures are performed using an optimization algorithm and a Hamiltonian Monte Carlo procedure for efficient Bayesian inference. We also include a simulation study to illustrate the merits of our methodology in recovering consistent parameter estimates, and in making accurate future predictions for the time series. As we demonstrate in our application to the crime dataset, this new methodology can provide very meaningful model-based interpretations of the data, and it can be generalized to other time series contexts and applications."}, "https://arxiv.org/abs/2408.13022": {"title": "Estimation of ratios of normalizing constants using stochastic approximation : the SARIS algorithm", "link": "https://arxiv.org/abs/2408.13022", "description": "arXiv:2408.13022v1 Announce Type: cross \nAbstract: Computing ratios of normalizing constants plays an important role in statistical modeling. Two important examples are hypothesis testing in latent variables models, and model comparison in Bayesian statistics. In both examples, the likelihood ratio and the Bayes factor are defined as the ratio of the normalizing constants of posterior distributions. We propose in this article a novel methodology that estimates this ratio using stochastic approximation principle. Our estimator is consistent and asymptotically Gaussian. Its asymptotic variance is smaller than the one of the popular optimal bridge sampling estimator. Furthermore, it is much more robust to little overlap between the two unnormalized distributions considered. Thanks to its online definition, our procedure can be integrated in an estimation process in latent variables model, and therefore reduce the computational effort. The performances of the estimator are illustrated through a simulation study and compared to two other estimators : the ratio importance sampling and the optimal bridge sampling estimators."}, "https://arxiv.org/abs/2408.13176": {"title": "Non-parametric estimators of scaled cash flows", "link": "https://arxiv.org/abs/2408.13176", "description": "arXiv:2408.13176v1 Announce Type: cross \nAbstract: In multi-state life insurance, incidental policyholder behavior gives rise to expected cash flows that are not easily targeted by classic non-parametric estimators if data is subject to sampling effects. We introduce a scaled version of the classic Aalen--Johansen estimator that overcomes this challenge. Strong uniform consistency and asymptotic normality are established under entirely random right-censoring, subject to lax moment conditions on the multivariate counting process. In a simulation study, the estimator outperforms earlier proposals from the literature. Finally, we showcase the potential of the presented method to other areas of actuarial science."}, "https://arxiv.org/abs/2408.13179": {"title": "Augmented Functional Random Forests: Classifier Construction and Unbiased Functional Principal Components Importance through Ad-Hoc Conditional Permutations", "link": "https://arxiv.org/abs/2408.13179", "description": "arXiv:2408.13179v1 Announce Type: cross \nAbstract: This paper introduces a novel supervised classification strategy that integrates functional data analysis (FDA) with tree-based methods, addressing the challenges of high-dimensional data and enhancing the classification performance of existing functional classifiers. Specifically, we propose augmented versions of functional classification trees and functional random forests, incorporating a new tool for assessing the importance of functional principal components. This tool provides an ad-hoc method for determining unbiased permutation feature importance in functional data, particularly when dealing with correlated features derived from successive derivatives. Our study demonstrates that these additional features can significantly enhance the predictive power of functional classifiers. Experimental evaluations on both real-world and simulated datasets showcase the effectiveness of the proposed methodology, yielding promising results compared to existing methods."}, "https://arxiv.org/abs/1911.04696": {"title": "Extended MinP Tests for Global and Multiple testing", "link": "https://arxiv.org/abs/1911.04696", "description": "arXiv:1911.04696v2 Announce Type: replace \nAbstract: Empirical economic studies often involve multiple propositions or hypotheses, with researchers aiming to assess both the collective and individual evidence against these propositions or hypotheses. To rigorously assess this evidence, practitioners frequently employ tests with quadratic test statistics, such as $F$-tests and Wald tests, or tests based on minimum/maximum type test statistics. This paper introduces a combination test that merges these two classes of tests using the minimum $p$-value principle. The proposed test capitalizes on the global power advantages of both constituent tests while retaining the benefits of the stepdown procedure from minimum/maximum type tests."}, "https://arxiv.org/abs/2308.02974": {"title": "Combining observational and experimental data for causal inference considering data privacy", "link": "https://arxiv.org/abs/2308.02974", "description": "arXiv:2308.02974v2 Announce Type: replace \nAbstract: Combining observational and experimental data for causal inference can improve treatment effect estimation. However, many observational data sets cannot be released due to data privacy considerations, so one researcher may not have access to both experimental and observational data. Nonetheless, a small amount of risk of disclosing sensitive information might be tolerable to organizations that house confidential data. In these cases, organizations can employ data privacy techniques, which decrease disclosure risk, potentially at the expense of data utility. In this paper, we explore disclosure limiting transformations of observational data, which can be combined with experimental data to estimate the sample and population average treatment effects. We consider leveraging observational data to improve generalizability of treatment effect estimates when a randomized experiment (RCT) is not representative of the population of interest, and to increase precision of treatment effect estimates. Through simulation studies, we illustrate the trade-off between privacy and utility when employing different disclosure limiting transformations. We find that leveraging transformed observational data in treatment effect estimation can still improve estimation over only using data from an RCT."}, "https://arxiv.org/abs/2408.13393": {"title": "WASP: Voting-based ex Ante method for Selecting joint Prediction strategy", "link": "https://arxiv.org/abs/2408.13393", "description": "arXiv:2408.13393v1 Announce Type: new \nAbstract: This paper addresses the topic of choosing a prediction strategy when using parametric or nonparametric regression models. It emphasizes the importance of ex ante prediction accuracy, ensemble approaches, and forecasting not only the values of the dependent variable but also a function of these values, such as total income or median loss. It proposes a method for selecting a strategy for predicting the vector of functions of the dependent variable using various ex ante accuracy measures. The final decision is made through voting, where the candidates are prediction strategies and the voters are diverse prediction models with their respective prediction errors. Because the method is based on a Monte Carlo simulation, it allows for new scenarios, not previously observed, to be considered. The first part of the article provides a detailed theoretical description of the proposed method, while the second part presents its practical use in managing a portfolio of communication insurance. The example uses data from the Polish insurance market. All calculations are performed using the R programme."}, "https://arxiv.org/abs/2408.13409": {"title": "Leveraging external data in the analysis of randomized controlled trials: a comparative analysis", "link": "https://arxiv.org/abs/2408.13409", "description": "arXiv:2408.13409v1 Announce Type: new \nAbstract: The use of patient-level information from previous studies, registries, and other external datasets can support the analysis of single-arm and randomized clinical trials to evaluate and test experimental treatments. However, the integration of external data in the analysis of clinical trials can also compromise the scientific validity of the results due to selection bias, study-to-study differences, unmeasured confounding, and other distortion mechanisms. Therefore, leveraging external data in the analysis of a clinical trial requires the use of appropriate methods that can detect, prevent or mitigate the risks of bias and potential distortion mechanisms. We review several methods that have been previously proposed to leverage external datasets, such as matching procedures or random effects modeling. Different methods present distinct trade-offs between risks and efficiencies. We conduct a comparative analysis of statistical methods to leverage external data and analyze randomized clinical trials. Multiple operating characteristics are discussed, such as the control of false positive results, power, and the bias of the treatment effect estimates, across candidate statistical methods. We compare the statistical methods through a broad set of simulation scenarios. We then compare the methods using a collection of datasets with individual patient-level information from several glioblastoma studies in order to provide recommendations for future glioblastoma trials."}, "https://arxiv.org/abs/2408.13411": {"title": "Estimating the Effective Sample Size for an inverse problem in subsurface flows", "link": "https://arxiv.org/abs/2408.13411", "description": "arXiv:2408.13411v1 Announce Type: new \nAbstract: The Effective Sample Size (ESS) and Integrated Autocorrelation Time (IACT) are two popular criteria for comparing Markov Chain Monte Carlo (MCMC) algorithms and detecting their convergence. Our goal is to assess those two quantities in the context of an inverse problem in subsurface flows. We begin by presenting a review of some popular methods for their estimation, and then simulate their sample distributions on AR(1) sequences for which the exact values were known. We find that those ESS estimators may not be statistically consistent, because their variance grows linearly in the number of sample values of the MCMC. Next, we analyze the output of two distinct MCMC algorithms for the Bayesian approach to the simulation of an elliptic inverse problem. Here, the estimators cannot even agree about the order of magnitude of the ESS. Our conclusion is that the ESS has major limitations and should not be used on MCMC outputs of complex models."}, "https://arxiv.org/abs/2408.13414": {"title": "Epistemically robust selection of fitted models", "link": "https://arxiv.org/abs/2408.13414", "description": "arXiv:2408.13414v1 Announce Type: new \nAbstract: Fitting models to data is an important part of the practice of science, made almost ubiquitous by advances in machine learning. Very often however, fitted solutions are not unique, but form an ensemble of candidate models -- qualitatively different, yet with comparable quantitative performance. One then needs a criterion which can select the best candidate models, or at least falsify (reject) the worst ones. Because standard statistical approaches to model selection rely on assumptions which are usually invalid in scientific contexts, they tend to be overconfident, rejecting models based on little more than statistical noise. The ideal objective for fitting models is generally considered to be the risk: this is the theoretical average loss of a model (assuming unlimited data). In this work we develop a nonparametric method for estimating, for each candidate model, the epistemic uncertainty on its risk: in other words we associate to each model a distribution of scores which accounts for expected modelling errors. We then propose that a model falsification criterion should mirror established experimental practice: a falsification result should be accepted only if it is reproducible across experimental variations. The strength of this approach is illustrated using examples from physics and neuroscience."}, "https://arxiv.org/abs/2408.13437": {"title": "Cross-sectional Dependence in Idiosyncratic Volatility", "link": "https://arxiv.org/abs/2408.13437", "description": "arXiv:2408.13437v1 Announce Type: new \nAbstract: This paper introduces an econometric framework for analyzing cross-sectional dependence in the idiosyncratic volatilities of assets using high frequency data. We first consider the estimation of standard measures of dependence in the idiosyncratic volatilities such as covariances and correlations. Naive estimators of these measures are biased due to the use of the error-laden estimates of idiosyncratic volatilities. We provide bias-corrected estimators and the relevant asymptotic theory. Next, we introduce an idiosyncratic volatility factor model, in which we decompose the variation in idiosyncratic volatilities into two parts: the variation related to the systematic factors such as the market volatility, and the residual variation. Again, naive estimators of the decomposition are biased, and we provide bias-corrected estimators. We also provide the asymptotic theory that allows us to test whether the residual (non-systematic) components of the idiosyncratic volatilities exhibit cross-sectional dependence. We apply our methodology to the S&P 100 index constituents, and document strong cross-sectional dependence in their idiosyncratic volatilities. We consider two different sets of idiosyncratic volatility factors, and find that neither can fully account for the cross-sectional dependence in idiosyncratic volatilities. For each model, we map out the network of dependencies in residual (non-systematic) idiosyncratic volatilities across all stocks."}, "https://arxiv.org/abs/2408.13453": {"title": "Unifying design-based and model-based sampling theory -- some suggestions to clear the cobwebs", "link": "https://arxiv.org/abs/2408.13453", "description": "arXiv:2408.13453v1 Announce Type: new \nAbstract: This paper gives a holistic overview of both the design-based and model-based paradigms for sampling theory. Both methods are presented within a unified framework with a simple consistent notation, and the differences in the two paradigms are explained within this common framework. We examine the different definitions of the \"population variance\" within the two paradigms and examine the use of Bessel's correction for a population variance. We critique some messy aspects of the presentation of the design-based paradigm and implore readers to avoid the standard presentation of this framework in favour of a more explicit presentation that includes explicit conditioning in probability statements. We also discuss a number of confusions that arise from the standard presentation of the design-based paradigm and argue that Bessel's correction should be applied to the population variance."}, "https://arxiv.org/abs/2408.13474": {"title": "Ridge, lasso, and elastic-net estimations of the modified Poisson and least-squares regressions for binary outcome data", "link": "https://arxiv.org/abs/2408.13474", "description": "arXiv:2408.13474v1 Announce Type: new \nAbstract: Logistic regression is a standard method in multivariate analysis for binary outcome data in epidemiological and clinical studies; however, the resultant odds-ratio estimates fail to provide directly interpretable effect measures. The modified Poisson and least-squares regressions are alternative standard methods that can provide risk-ratio and risk difference estimates without computational problems. However, the bias and invalid inference problems of these regression analyses under small or sparse data conditions (i.e.,the \"separation\" problem) have been insufficiently investigated. We show that the separation problem can adversely affect the inferences of the modified Poisson and least squares regressions, and to address these issues, we apply the ridge, lasso, and elastic-net estimating approaches to the two regression methods. As the methods are not founded on the maximum likelihood principle, we propose regularized quasi-likelihood approaches based on the estimating equations for these generalized linear models. The methods provide stable shrinkage estimates of risk ratios and risk differences even under separation conditions, and the lasso and elastic-net approaches enable simultaneous variable selection. We provide a bootstrap method to calculate the confidence intervals on the basis of the regularized quasi-likelihood estimation. The proposed methods are applied to a hematopoietic stem cell transplantation cohort study and the National Child Development Survey. We also provide an R package, regconfint, to implement these methods with simple commands."}, "https://arxiv.org/abs/2408.13514": {"title": "Cross Sectional Regression with Cluster Dependence: Inference based on Averaging", "link": "https://arxiv.org/abs/2408.13514", "description": "arXiv:2408.13514v1 Announce Type: new \nAbstract: We re-investigate the asymptotic properties of the traditional OLS (pooled) estimator, $\\hat{\\beta} _P$, with cluster dependence. The present study considers various scenarios under various restrictions on the cluster sizes and number of clusters. It is shown that $\\hat{\\beta}_P$ could be inconsistent in many realistic situations. We propose a simple estimator, $\\hat{\\beta}_A$ based on data averaging. The asymptotic properties of $\\hat{\\beta}_A$ is studied. It is shown that $\\hat{\\beta}_A$ is consistent even when $\\hat{\\beta}_P$ is inconsistent. It is further shown that the proposed estimator $\\hat{\\beta}_A$ could be more efficient as compared to $\\hat{\\beta}_P$ in many practical scenarios. As a consequence of averaging, we show that $\\hat{\\beta}_A$ retains consistency, asymptotic normality under classical measurement error problem circumventing the use of IV. A detailed simulation study shows the efficacy of $\\hat{\\beta}_A$. It is also seen that $\\hat{\\beta}_A$ yields better goodness of fit."}, "https://arxiv.org/abs/2408.13555": {"title": "Local statistical moments to capture Kramers-Moyal coefficients", "link": "https://arxiv.org/abs/2408.13555", "description": "arXiv:2408.13555v1 Announce Type: new \nAbstract: This study introduces an innovative local statistical moment approach for estimating Kramers-Moyal coefficients, effectively bridging the gap between nonparametric and parametric methodologies. These coefficients play a crucial role in characterizing stochastic processes. Our proposed approach provides a versatile framework for localized coefficient estimation, combining the flexibility of nonparametric methods with the interpretability of global parametric approaches. We showcase the efficacy of our approach through use cases involving both stationary and non-stationary time series analysis. Additionally, we demonstrate its applicability to real-world complex systems, specifically in the energy conversion process analysis of a wind turbine."}, "https://arxiv.org/abs/2408.13596": {"title": "Robust Principal Components by Casewise and Cellwise Weighting", "link": "https://arxiv.org/abs/2408.13596", "description": "arXiv:2408.13596v1 Announce Type: new \nAbstract: Principal component analysis (PCA) is a fundamental tool for analyzing multivariate data. Here the focus is on dimension reduction to the principal subspace, characterized by its projection matrix. The classical principal subspace can be strongly affected by the presence of outliers. Traditional robust approaches consider casewise outliers, that is, cases generated by an unspecified outlier distribution that differs from that of the clean cases. But there may also be cellwise outliers, which are suspicious entries that can occur anywhere in the data matrix. Another common issue is that some cells may be missing. This paper proposes a new robust PCA method, called cellPCA, that can simultaneously deal with casewise outliers, cellwise outliers, and missing cells. Its single objective function combines two robust loss functions, that together mitigate the effect of casewise and cellwise outliers. The objective function is minimized by an iteratively reweighted least squares (IRLS) algorithm. Residual cellmaps and enhanced outlier maps are proposed for outlier detection. The casewise and cellwise influence functions of the principal subspace are derived, and its asymptotic distribution is obtained. Extensive simulations and two real data examples illustrate the performance of cellPCA."}, "https://arxiv.org/abs/2408.13606": {"title": "Influence Networks: Bayesian Modeling and Diffusion", "link": "https://arxiv.org/abs/2408.13606", "description": "arXiv:2408.13606v1 Announce Type: new \nAbstract: In this article, we make an innovative adaptation of a Bayesian latent space model based on projections in a novel way to analyze influence networks. By appropriately reparameterizing the model, we establish a formal metric for quantifying each individual's influencing capacity and estimating their latent position embedded in a social space. This modeling approach introduces a novel mechanism for fully characterizing the diffusion of an idea based on the estimated latent characteristics. It assumes that each individual takes the following states: Unknown, undecided, supporting, or rejecting an idea. This approach is demonstrated using a influence network from Twitter (now $\\mathbb{X}$) related to the 2022 Tax Reform in Colombia. An exhaustive simulation exercise is also performed to evaluate the proposed diffusion process."}, "https://arxiv.org/abs/2408.13649": {"title": "Tree-structured Markov random fields with Poisson marginal distributions", "link": "https://arxiv.org/abs/2408.13649", "description": "arXiv:2408.13649v1 Announce Type: new \nAbstract: A new family of tree-structured Markov random fields for a vector of discrete counting random variables is introduced. According to the characteristics of the family, the marginal distributions of the Markov random fields are all Poisson with the same mean, and are untied from the strength or structure of their built-in dependence. This key feature is uncommon for Markov random fields and most convenient for applications purposes. The specific properties of this new family confer a straightforward sampling procedure and analytic expressions for the joint probability mass function and the joint probability generating function of the vector of counting random variables, thus granting computational methods that scale well to vectors of high dimension. We study the distribution of the sum of random variables constituting a Markov random field from the proposed family, analyze a random variable's individual contribution to that sum through expected allocations, and establish stochastic orderings to assess a wide understanding of their behavior."}, "https://arxiv.org/abs/2408.13949": {"title": "Inference on Consensus Ranking of Distributions", "link": "https://arxiv.org/abs/2408.13949", "description": "arXiv:2408.13949v1 Announce Type: new \nAbstract: Instead of testing for unanimous agreement, I propose learning how broad of a consensus favors one distribution over another (of earnings, productivity, asset returns, test scores, etc.). Specifically, given a sample from each of two distributions, I propose statistical inference methods to learn about the set of utility functions for which the first distribution has higher expected utility than the second distribution. With high probability, an \"inner\" confidence set is contained within this true set, while an \"outer\" confidence set contains the true set. Such confidence sets can be formed by inverting a proposed multiple testing procedure that controls the familywise error rate. Theoretical justification comes from empirical process results, given that very large classes of utility functions are generally Donsker (subject to finite moments). The theory additionally justifies a uniform (over utility functions) confidence band of expected utility differences, as well as tests with a utility-based \"restricted stochastic dominance\" as either the null or alternative hypothesis. Simulated and empirical examples illustrate the methodology."}, "https://arxiv.org/abs/2408.13971": {"title": "Endogenous Treatment Models with Social Interactions: An Application to the Impact of Exercise on Self-Esteem", "link": "https://arxiv.org/abs/2408.13971", "description": "arXiv:2408.13971v1 Announce Type: new \nAbstract: We address the estimation of endogenous treatment models with social interactions in both the treatment and outcome equations. We model the interactions between individuals in an internally consistent manner via a game theoretic approach based on discrete Bayesian games. This introduces a substantial computational burden in estimation which we address through a sequential version of the nested fixed point algorithm. We also provide some relevant treatment effects, and procedures for their estimation, which capture the impact on both the individual and the total sample. Our empirical application examines the impact of an individual's exercise frequency on her level of self-esteem. We find that an individual's exercise frequency is influenced by her expectation of her friends'. We also find that an individual's level of self-esteem is affected by her level of exercise and, at relatively lower levels of self-esteem, by the expectation of her friends' self-esteem."}, "https://arxiv.org/abs/2408.14015": {"title": "Robust likelihood ratio tests for composite nulls and alternatives", "link": "https://arxiv.org/abs/2408.14015", "description": "arXiv:2408.14015v1 Announce Type: new \nAbstract: We propose an e-value based framework for testing composite nulls against composite alternatives when an $\\epsilon$ fraction of the data can be arbitrarily corrupted. Our tests are inherently sequential, being valid at arbitrary data-dependent stopping times, but they are new even for fixed sample sizes, giving type-I error control without any regularity conditions. We achieve this by modifying and extending a proposal by Huber (1965) in the point null versus point alternative case. Our test statistic is a nonnegative supermartingale under the null, even with a sequentially adaptive contamination model where the conditional distribution of each observation given the past data lies within an $\\epsilon$ (total variation) ball of the null. The test is powerful within an $\\epsilon$ ball of the alternative. As a consequence, one obtains anytime-valid p-values that enable continuous monitoring of the data, and adaptive stopping. We analyze the growth rate of our test supermartingale and demonstrate that as $\\epsilon\\to 0$, it approaches a certain Kullback-Leibler divergence between the null and alternative, which is the optimal non-robust growth rate. A key step is the derivation of a robust Reverse Information Projection (RIPr). Simulations validate the theory and demonstrate excellent practical performance."}, "https://arxiv.org/abs/2408.14036": {"title": "Robust subgroup-classifier learning and testing in change-plane regressions", "link": "https://arxiv.org/abs/2408.14036", "description": "arXiv:2408.14036v1 Announce Type: new \nAbstract: Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur representation of estimators are established, in which the proposed estimators of the grouping difference parameter and baseline parameter achieve sub-Gaussian tails. The hypothesis test considered here belongs to the class of test problems for which some parameters are not identifiable under the null hypothesis. The classic supremum of the squared score test statistic may lose power in practice when the dimension of the grouping parameter is large, so to overcome this drawback and make full use of the data's heavy-tailed error distribution, a robust weighted average of the squared score test statistic is proposed, which achieves a closed form when an appropriate weight is chosen. Asymptotic distributions of the proposed robust test statistic are derived under the null and alternative hypotheses. The proposed robust subgroup classifier and test statistic perform well on finite samples, and their performances are shown further by applying them to a medical dataset. The proposed procedure leads to the immediate application of recommending optimal individualized treatments."}, "https://arxiv.org/abs/2408.14038": {"title": "Jackknife Empirical Likelihood Method for U Statistics Based on Multivariate Samples and its Applications", "link": "https://arxiv.org/abs/2408.14038", "description": "arXiv:2408.14038v1 Announce Type: new \nAbstract: Empirical likelihood (EL) and its extension via the jackknife empirical likelihood (JEL) method provide robust alternatives to parametric approaches, in the contexts with uncertain data distributions. This paper explores the theoretical foundations and practical applications of JEL in the context of multivariate sample-based U-statistics. In this study we develop the JEL method for multivariate U-statistics with three (or more) samples. This study enhance the JEL methods capability to handle complex data structures while preserving the computation efficiency of the empirical likelihood method. To demonstrate the applications of the JEL method, we compute confidence intervals for differences in VUS measurements which have potential applications in classification problems. Monte Carlo simulation studies are conducted to evaluate the efficiency of the JEL, Normal approximation and Kernel based confidence intervals. These studies validate the superior performance of the JEL approach in terms of coverage probability and computational efficiency compared to other two methods. Additionally, a real data application illustrates the practical utility of the approach. The JEL method developed here has potential applications in dealing with complex data structures."}, "https://arxiv.org/abs/2408.14214": {"title": "Modeling the Dynamics of Growth in Master-Planned Communities", "link": "https://arxiv.org/abs/2408.14214", "description": "arXiv:2408.14214v1 Announce Type: new \nAbstract: This paper describes how a time-varying Markov model was used to forecast housing development at a master-planned community during a transition from high to low growth. Our approach draws on detailed historical data to model the dynamics of the market participants, producing results that are entirely data-driven and free of bias. While traditional time series forecasting methods often struggle to account for nonlinear regime changes in growth, our approach successfully captures the onset of buildout as well as external economic shocks, such as the 1990 and 2008-2011 recessions and the 2021 post-pandemic boom.\n This research serves as a valuable tool for urban planners, homeowner associations, and property stakeholders aiming to navigate the complexities of growth at master-planned communities during periods of both system stability and instability."}, "https://arxiv.org/abs/2408.14402": {"title": "A quasi-Bayesian sequential approach to deconvolution density estimation", "link": "https://arxiv.org/abs/2408.14402", "description": "arXiv:2408.14402v1 Announce Type: new \nAbstract: Density deconvolution addresses the estimation of the unknown (probability) density function $f$ of a random signal from data that are observed with an independent additive random noise. This is a classical problem in statistics, for which frequentist and Bayesian nonparametric approaches are available to deal with static or batch data. In this paper, we consider the problem of density deconvolution in a streaming or online setting where noisy data arrive progressively, with no predetermined sample size, and we develop a sequential nonparametric approach to estimate $f$. By relying on a quasi-Bayesian sequential approach, often referred to as Newton's algorithm, we obtain estimates of $f$ that are of easy evaluation, computationally efficient, and with a computational cost that remains constant as the amount of data increases, which is critical in the streaming setting. Large sample asymptotic properties of the proposed estimates are studied, yielding provable guarantees with respect to the estimation of $f$ at a point (local) and on an interval (uniform). In particular, we establish local and uniform central limit theorems, providing corresponding asymptotic credible intervals and bands. We validate empirically our methods on synthetic and real data, by considering the common setting of Laplace and Gaussian noise distributions, and make a comparison with respect to the kernel-based approach and a Bayesian nonparametric approach with a Dirichlet process mixture prior."}, "https://arxiv.org/abs/2408.14410": {"title": "Generalized Bayesian nonparametric clustering framework for high-dimensional spatial omics data", "link": "https://arxiv.org/abs/2408.14410", "description": "arXiv:2408.14410v1 Announce Type: new \nAbstract: The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has transformed genomic research by enabling high-throughput gene expression profiling while preserving spatial context. Identifying spatial domains within SRT data is a critical task, with numerous computational approaches currently available. However, most existing methods rely on a multi-stage process that involves ad-hoc dimension reduction techniques to manage the high dimensionality of SRT data. These low-dimensional embeddings are then subjected to model-based or distance-based clustering methods. Additionally, many approaches depend on arbitrarily specifying the number of clusters (i.e., spatial domains), which can result in information loss and suboptimal downstream analysis. To address these limitations, we propose a novel Bayesian nonparametric mixture of factor analysis (BNPMFA) model, which incorporates a Markov random field-constrained Gibbs-type prior for partitioning high-dimensional spatial omics data. This new prior effectively integrates the spatial constraints inherent in SRT data while simultaneously inferring cluster membership and determining the optimal number of spatial domains. We have established the theoretical identifiability of cluster membership within this framework. The efficacy of our proposed approach is demonstrated through realistic simulations and applications to two SRT datasets. Our results show that the BNPMFA model not only surpasses state-of-the-art methods in clustering accuracy and estimating the number of clusters but also offers novel insights for identifying cellular regions within tissue samples."}, "https://arxiv.org/abs/2401.05218": {"title": "Invariant Causal Prediction with Locally Linear Models", "link": "https://arxiv.org/abs/2401.05218", "description": "arXiv:2401.05218v1 Announce Type: cross \nAbstract: We consider the task of identifying the causal parents of a target variable among a set of candidate variables from observational data. Our main assumption is that the candidate variables are observed in different environments which may, for example, correspond to different settings of a machine or different time intervals in a dynamical process. Under certain assumptions different environments can be regarded as interventions on the observed system. We assume a linear relationship between target and covariates, which can be different in each environment with the only restriction that the causal structure is invariant across environments. This is an extension of the ICP ($\\textbf{I}$nvariant $\\textbf{C}$ausal $\\textbf{P}$rediction) principle by Peters et al. [2016], who assumed a fixed linear relationship across all environments. Within our proposed setting we provide sufficient conditions for identifiability of the causal parents and introduce a practical method called LoLICaP ($\\textbf{Lo}$cally $\\textbf{L}$inear $\\textbf{I}$nvariant $\\textbf{Ca}$usal $\\textbf{P}$rediction), which is based on a hypothesis test for parent identification using a ratio of minimum and maximum statistics. We then show in a simplified setting that the statistical power of LoLICaP converges exponentially fast in the sample size, and finally we analyze the behavior of LoLICaP experimentally in more general settings."}, "https://arxiv.org/abs/2408.13448": {"title": "Efficient Reinforced DAG Learning without Acyclicity Constraints", "link": "https://arxiv.org/abs/2408.13448", "description": "arXiv:2408.13448v1 Announce Type: cross \nAbstract: Unraveling cause-effect structures embedded in mere observational data is of great scientific interest, owning to the wealth of knowledge that can benefit from such structures. Recently, reinforcement learning (RL) has emerged as the enhancement for classical techniques to search for the most probable causal explanation in the form of a directed acyclic graph (DAG). Yet, effectively exploring the DAG space is challenging due to the vast number of candidates and the intricate constraint of acyclicity. In this study, we present REACT (REinforced DAG learning without acyclicity ConstrainTs)-a novel causal discovery approach fueled by the RL machinery with an efficient DAG generation policy. Through a novel parametrization of DAGs, which allows for directly mapping a real-valued vector to an adjacency matrix representing a valid DAG in a single step without enforcing any acyclicity constraint, we are able to navigate the search space much more effectively with policy gradient methods. In addition, our comprehensive numerical evaluations on a diverse set of both synthetic and real data confirm the effectiveness of our method compared with state-of-the-art baselines."}, "https://arxiv.org/abs/2408.13556": {"title": "What if? Causal Machine Learning in Supply Chain Risk Management", "link": "https://arxiv.org/abs/2408.13556", "description": "arXiv:2408.13556v1 Announce Type: cross \nAbstract: The penultimate goal for developing machine learning models in supply chain management is to make optimal interventions. However, most machine learning models identify correlations in data rather than inferring causation, making it difficult to systematically plan for better outcomes. In this article, we propose and evaluate the use of causal machine learning for developing supply chain risk intervention models, and demonstrate its use with a case study in supply chain risk management in the maritime engineering sector. Our findings highlight that causal machine learning enhances decision-making processes by identifying changes that can be achieved under different supply chain interventions, allowing \"what-if\" scenario planning. We therefore propose different machine learning developmental pathways for for predicting risk, and planning for interventions to minimise risk and outline key steps for supply chain researchers to explore causal machine learning."}, "https://arxiv.org/abs/2408.13642": {"title": "Change Point Detection in Pairwise Comparison Data with Covariates", "link": "https://arxiv.org/abs/2408.13642", "description": "arXiv:2408.13642v1 Announce Type: cross \nAbstract: This paper introduces the novel piecewise stationary covariate-assisted ranking estimation (PS-CARE) model for analyzing time-evolving pairwise comparison data, enhancing item ranking accuracy through the integration of covariate information. By partitioning the data into distinct, stationary segments, the PS-CARE model adeptly detects temporal shifts in item rankings, known as change points, whose number and positions are initially unknown. Leveraging the minimum description length (MDL) principle, this paper establishes a statistically consistent model selection criterion to estimate these unknowns. The practical optimization of this MDL criterion is done with the pruned exact linear time (PELT) algorithm. Empirical evaluations reveal the method's promising performance in accurately locating change points across various simulated scenarios. An application to an NBA dataset yielded meaningful insights that aligned with significant historical events, highlighting the method's practical utility and the MDL criterion's effectiveness in capturing temporal ranking changes. To the best of the authors' knowledge, this research pioneers change point detection in pairwise comparison data with covariate information, representing a significant leap forward in the field of dynamic ranking analysis."}, "https://arxiv.org/abs/2408.14012": {"title": "Bayesian Cointegrated Panels in Digital Marketing", "link": "https://arxiv.org/abs/2408.14012", "description": "arXiv:2408.14012v1 Announce Type: cross \nAbstract: In this paper, we fully develop and apply a novel extension of Bayesian cointegrated panels modeling in digital marketing, particularly in modeling of a system where key ROI metrics such as clicks or impressions of a given digital campaign considered. Thus, in this context our goal is evaluating how the system reacts to investment perturbations due to changes in the investment strategy and its impact on the visibility of specific campaigns. To do so, we fit the model using a set of real marketing data with different investment campaigns over the same geographic territory. By employing forecast error variance decomposition, our findings indicate that clicks and impressions have a significant impact on session generation. Also, we evaluate our approach through a comprehensive simulation study that considers different processes. The results indicate that our proposal has substantial capabilities in terms of estimability and accuracy."}, "https://arxiv.org/abs/2408.14073": {"title": "Score-based change point detection via tracking the best of infinitely many experts", "link": "https://arxiv.org/abs/2408.14073", "description": "arXiv:2408.14073v1 Announce Type: cross \nAbstract: We suggest a novel algorithm for online change point detection based on sequential score function estimation and tracking the best expert approach. The core of the procedure is a version of the fixed share forecaster for the case of infinite number of experts and quadratic loss functions. The algorithm shows a promising performance in numerical experiments on artificial and real-world data sets. We also derive new upper bounds on the dynamic regret of the fixed share forecaster with varying parameter, which are of independent interest."}, "https://arxiv.org/abs/2408.14408": {"title": "Consistent diffusion matrix estimation from population time series", "link": "https://arxiv.org/abs/2408.14408", "description": "arXiv:2408.14408v1 Announce Type: cross \nAbstract: Progress on modern scientific questions regularly depends on using large-scale datasets to understand complex dynamical systems. An especially challenging case that has grown to prominence with advances in single-cell sequencing technologies is learning the behavior of individuals from population snapshots. In the absence of individual-level time series, standard stochastic differential equation models are often nonidentifiable because intrinsic diffusion cannot be distinguished from measurement noise. Despite the difficulty, accurately recovering diffusion terms is required to answer even basic questions about the system's behavior. We show how to combine population-level time series with velocity measurements to build a provably consistent estimator of the diffusion matrix."}, "https://arxiv.org/abs/2204.05793": {"title": "Coarse Personalization", "link": "https://arxiv.org/abs/2204.05793", "description": "arXiv:2204.05793v3 Announce Type: replace \nAbstract: Advances in estimating heterogeneous treatment effects enable firms to personalize marketing mix elements and target individuals at an unmatched level of granularity, but feasibility constraints limit such personalization. In practice, firms choose which unique treatments to offer and which individuals to offer these treatments with the goal of maximizing profits: we call this the coarse personalization problem. We propose a two-step solution that makes segmentation and targeting decisions in concert. First, the firm personalizes by estimating conditional average treatment effects. Second, the firm discretizes by utilizing treatment effects to choose which unique treatments to offer and who to assign to these treatments. We show that a combination of available machine learning tools for estimating heterogeneous treatment effects and a novel application of optimal transport methods provides a viable and efficient solution. With data from a large-scale field experiment for promotions management, we find that our methodology outperforms extant approaches that segment on consumer characteristics or preferences and those that only search over a prespecified grid. Using our procedure, the firm recoups over 99.5% of its expected incremental profits under fully granular personalization while offering only five unique treatments. We conclude by discussing how coarse personalization arises in other domains."}, "https://arxiv.org/abs/2211.11673": {"title": "Asymptotically Normal Estimation of Local Latent Network Curvature", "link": "https://arxiv.org/abs/2211.11673", "description": "arXiv:2211.11673v4 Announce Type: replace \nAbstract: Network data, commonly used throughout the physical, social, and biological sciences, consist of nodes (individuals) and the edges (interactions) between them. One way to represent network data's complex, high-dimensional structure is to embed the graph into a low-dimensional geometric space. The curvature of this space, in particular, provides insights about the structure in the graph, such as the propensity to form triangles or present tree-like structures. We derive an estimating function for curvature based on triangle side lengths and the length of the midpoint of a side to the opposing corner. We construct an estimator where the only input is a distance matrix and also establish asymptotic normality. We next introduce a novel latent distance matrix estimator for networks and an efficient algorithm to compute the estimate via solving iterative quadratic programs. We apply this method to the Los Alamos National Laboratory Unified Network and Host dataset and show how curvature estimates can be used to detect a red-team attack faster than naive methods, as well as discover non-constant latent curvature in co-authorship networks in physics. The code for this paper is available at https://github.com/SteveJWR/netcurve, and the methods are implemented in the R package https://github.com/SteveJWR/lolaR."}, "https://arxiv.org/abs/2302.02310": {"title": "$\\ell_1$-penalized Multinomial Regression: Estimation, inference, and prediction, with an application to risk factor identification for different dementia subtypes", "link": "https://arxiv.org/abs/2302.02310", "description": "arXiv:2302.02310v2 Announce Type: replace \nAbstract: High-dimensional multinomial regression models are very useful in practice but have received less research attention than logistic regression models, especially from the perspective of statistical inference. In this work, we analyze the estimation and prediction error of the contrast-based $\\ell_1$-penalized multinomial regression model and extend the debiasing method to the multinomial case, providing a valid confidence interval for each coefficient and $p$-value of the individual hypothesis test. We also examine cases of model misspecification and non-identically distributed data to demonstrate the robustness of our method when some assumptions are violated. We apply the debiasing method to identify important predictors in the progression into dementia of different subtypes. Results from extensive simulations show the superiority of the debiasing method compared to other inference methods."}, "https://arxiv.org/abs/2305.04113": {"title": "Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis", "link": "https://arxiv.org/abs/2305.04113", "description": "arXiv:2305.04113v3 Announce Type: replace \nAbstract: Factor analysis provides a canonical framework for imposing lower-dimensional structure such as sparse covariance in high-dimensional data. High-dimensional data on the same set of variables are often collected under different conditions, for instance in reproducing studies across research groups. In such cases, it is natural to seek to learn the shared versus condition-specific structure. Existing hierarchical extensions of factor analysis have been proposed, but face practical issues including identifiability problems. To address these shortcomings, we propose a class of SUbspace Factor Analysis (SUFA) models, which characterize variation across groups at the level of a lower-dimensional subspace. We prove that the proposed class of SUFA models lead to identifiability of the shared versus group-specific components of the covariance, and study their posterior contraction properties. Taking a Bayesian approach, these contributions are developed alongside efficient posterior computation algorithms. Our sampler fully integrates out latent variables, is easily parallelizable and has complexity that does not depend on sample size. We illustrate the methods through application to integration of multiple gene expression datasets relevant to immunology."}, "https://arxiv.org/abs/2308.05534": {"title": "Collective Outlier Detection and Enumeration with Conformalized Closed Testing", "link": "https://arxiv.org/abs/2308.05534", "description": "arXiv:2308.05534v2 Announce Type: replace \nAbstract: This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set."}, "https://arxiv.org/abs/2311.04159": {"title": "Uncertainty Quantification using Simulation Output: Batching as an Inferential Device", "link": "https://arxiv.org/abs/2311.04159", "description": "arXiv:2311.04159v2 Announce Type: replace \nAbstract: We present batching as an omnibus device for uncertainty quantification using simulation output. We consider the classical context of a simulationist performing uncertainty quantification on an estimator $\\theta_n$ (of an unknown fixed quantity $\\theta$) using only the output data $(Y_1,Y_2,\\ldots,Y_n)$ gathered from a simulation. By uncertainty quantification, we mean approximating the sampling distribution of the error $\\theta_n-\\theta$ toward: (A) estimating an assessment functional $\\psi$, e.g., bias, variance, or quantile; or (B) constructing a $(1-\\alpha)$-confidence region on $\\theta$. We argue that batching is a remarkably simple and effective device for this purpose, and is especially suited for handling dependent output data such as what one frequently encounters in simulation contexts. We demonstrate that if the number of batches and the extent of their overlap are chosen appropriately, batching retains bootstrap's attractive theoretical properties of strong consistency and higher-order accuracy. For constructing confidence regions, we characterize two limiting distributions associated with a Studentized statistic. Our extensive numerical experience confirms theoretical insight, especially about the effects of batch size and batch overlap."}, "https://arxiv.org/abs/2302.02560": {"title": "Causal Estimation of Exposure Shifts with Neural Networks", "link": "https://arxiv.org/abs/2302.02560", "description": "arXiv:2302.02560v4 Announce Type: replace-cross \nAbstract: A fundamental task in causal inference is estimating the effect of distribution shift in the treatment variable. We refer to this problem as shift-response function (SRF) estimation. Existing neural network methods for causal inference lack theoretical guarantees and practical implementations for SRF estimation. In this paper, we introduce Targeted Regularization for Exposure Shifts with Neural Networks (TRESNET), a method to estimate SRFs with robustness and efficiency guarantees. Our contributions are twofold. First, we propose a targeted regularization loss for neural networks with theoretical properties that ensure double robustness and asymptotic efficiency specific to SRF estimation. Second, we extend targeted regularization to support loss functions from the exponential family to accommodate non-continuous outcome distributions (e.g., discrete counts). We conduct benchmark experiments demonstrating TRESNET's broad applicability and competitiveness. We then apply our method to a key policy question in public health to estimate the causal effect of revising the US National Ambient Air Quality Standards (NAAQS) for PM 2.5 from 12 ${\\mu}g/m^3$ to 9 ${\\mu}g/m^3$. This change has been recently proposed by the US Environmental Protection Agency (EPA). Our goal is to estimate the reduction in deaths that would result from this anticipated revision using data consisting of 68 million individuals across the U.S."}, "https://arxiv.org/abs/2408.14604": {"title": "Co-factor analysis of citation networks", "link": "https://arxiv.org/abs/2408.14604", "description": "arXiv:2408.14604v1 Announce Type: new \nAbstract: One compelling use of citation networks is to characterize papers by their relationships to the surrounding literature. We propose a method to characterize papers by embedding them into two distinct \"co-factor\" spaces: one describing how papers send citations, and the other describing how papers receive citations. This approach presents several challenges. First, older documents cannot cite newer documents, and thus it is not clear that co-factors are even identifiable. We resolve this challenge by developing a co-factor model for asymmetric adjacency matrices with missing lower triangles and showing that identification is possible. We then frame estimation as a matrix completion problem and develop a specialized implementation of matrix completion because prior implementations are memory bound in our setting. Simulations show that our estimator has promising finite sample properties, and that naive approaches fail to recover latent co-factor structure. We leverage our estimator to investigate 237,794 papers published in statistics journals from 1898 to 2022, resulting in the most comprehensive topic model of the statistics literature to date. We find interpretable co-factors corresponding to many statistical subfields, including time series, variable selection, spatial methods, graphical models, GLM(M)s, causal inference, multiple testing, quantile regression, resampling, semi-parametrics, dimension reduction, and several more."}, "https://arxiv.org/abs/2408.14661": {"title": "Non-Parametric Bayesian Inference for Partial Orders with Ties from Rank Data observed with Mallows Noise", "link": "https://arxiv.org/abs/2408.14661", "description": "arXiv:2408.14661v1 Announce Type: new \nAbstract: Partial orders may be used for modeling and summarising ranking data when the underlying order relations are less strict than a total order. They are a natural choice when the data are lists recording individuals' positions in queues in which queue order is constrained by a social hierarchy, as it may be appropriate to model the social hierarchy as a partial order and the lists as random linear extensions respecting the partial order. In this paper, we set up a new prior model for partial orders incorporating ties by clustering tied actors using a Poisson Dirichlet process. The family of models is projective. We perform Bayesian inference with different choices of noisy observation model. In particular, we propose a Mallow's observation model for our partial orders and give a recursive likelihood evaluation algorithm. We demonstrate our model on the 'Royal Acta' (Bishop) list data where we find the model is favored over well-known alternatives which fit only total orders."}, "https://arxiv.org/abs/2408.14669": {"title": "Inspection-Guided Randomization: A Flexible and Transparent Restricted Randomization Framework for Better Experimental Design", "link": "https://arxiv.org/abs/2408.14669", "description": "arXiv:2408.14669v1 Announce Type: new \nAbstract: Randomized experiments are considered the gold standard for estimating causal effects. However, out of the set of possible randomized assignments, some may be likely to produce poor effect estimates and misleading conclusions. Restricted randomization is an experimental design strategy that filters out undesirable treatment assignments, but its application has primarily been limited to ensuring covariate balance in two-arm studies where the target estimand is the average treatment effect. Other experimental settings with different design desiderata and target effect estimands could also stand to benefit from a restricted randomization approach. We introduce Inspection-Guided Randomization (IGR), a transparent and flexible framework for restricted randomization that filters out undesirable treatment assignments by inspecting assignments against analyst-specified, domain-informed design desiderata. In IGR, the acceptable treatment assignments are locked in ex ante and pre-registered in the trial protocol, thus safeguarding against $p$-hacking and promoting reproducibility. Through illustrative simulation studies motivated by education and behavioral health interventions, we demonstrate how IGR can be used to improve effect estimates compared to benchmark designs in group formation experiments and experiments with interference."}, "https://arxiv.org/abs/2408.14671": {"title": "Double/Debiased CoCoLASSO of Treatment Effects with Mismeasured High-Dimensional Control Variables", "link": "https://arxiv.org/abs/2408.14671", "description": "arXiv:2408.14671v1 Announce Type: new \nAbstract: We develop an estimator for treatment effects in high-dimensional settings with additive measurement error, a prevalent challenge in modern econometrics. We introduce the Double/Debiased Convex Conditioned LASSO (Double/Debiased CoCoLASSO), which extends the double/debiased machine learning framework to accommodate mismeasured covariates. Our principal contributions are threefold. (1) We construct a Neyman-orthogonal score function that remains valid under measurement error, incorporating a bias correction term to account for error-induced correlations. (2) We propose a method of moments estimator for the measurement error variance, enabling implementation without prior knowledge of the error covariance structure. (3) We establish the $\\sqrt{N}$-consistency and asymptotic normality of our estimator under general conditions, allowing for both the number of covariates and the magnitude of measurement error to increase with the sample size. Our theoretical results demonstrate the estimator's efficiency within the class of regularized high-dimensional estimators accounting for measurement error. Monte Carlo simulations corroborate our asymptotic theory and illustrate the estimator's robust performance across various levels of measurement error. Notably, our covariance-oblivious approach nearly matches the efficiency of methods that assume known error variance."}, "https://arxiv.org/abs/2408.14691": {"title": "Effects Among the Affected", "link": "https://arxiv.org/abs/2408.14691", "description": "arXiv:2408.14691v1 Announce Type: new \nAbstract: Many interventions are both beneficial to initiate and harmful to stop. Traditionally, to determine whether to deploy that intervention in a time-limited way depends on if, on average, the increase in the benefits of starting it outweigh the increase in the harms of stopping it. We propose a novel causal estimand that provides a more nuanced understanding of the effects of such treatments, particularly, how response to an earlier treatment (e.g., treatment initiation) modifies the effect of a later treatment (e.g., treatment discontinuation), thus learning if there are effects among the (un)affected. Specifically, we consider a marginal structural working model summarizing how the average effect of a later treatment varies as a function of the (estimated) conditional average effect of an earlier treatment. We allow for estimation of this conditional average treatment effect using machine learning, such that the causal estimand is a data-adaptive parameter. We show how a sequentially randomized design can be used to identify this causal estimand, and we describe a targeted maximum likelihood estimator for the resulting statistical estimand, with influence curve-based inference. Throughout, we use the Adaptive Strategies for Preventing and Treating Lapses of Retention in HIV Care trial (NCT02338739) as an illustrative example, showing that discontinuation of conditional cash transfers for HIV care adherence was most harmful among those who most had an increase in benefits from them initially."}, "https://arxiv.org/abs/2408.14710": {"title": "On the distinction between the per-protocol effect and the effect of the treatment strategy", "link": "https://arxiv.org/abs/2408.14710", "description": "arXiv:2408.14710v1 Announce Type: new \nAbstract: In randomized trials, the per-protocol effect, that is, the effect of being assigned a treatment strategy and receiving treatment according to the assigned strategy, is sometimes thought to reflect the effect of the treatment strategy itself, without intervention on assignment. Here, we argue by example that this is not necessarily the case. We examine a causal structure for a randomized trial where these two causal estimands -- the per-protocol effect and the effect of the treatment strategy -- are not equal, and where their corresponding identifying observed data functionals are not the same, but both require information on assignment for identification. Our example highlights the conceptual difference between the per-protocol effect and the effect of the treatment strategy itself, the conditions under which the observed data functionals for these estimands are equal, and suggests that in some cases their identification requires information on assignment, even when assignment is randomized. An implication of these findings is that in observational analyses that aim to emulate a target randomized trial in which an analog of assignment is well-defined, the effect of the treatment strategy is not necessarily an observational analog of the per-protocol effect. Furthermore, either of these effects may be unidentifiable without information on treatment assignment, unless one makes additional assumptions; informally, that assignment does not affect the outcome except through treatment (i.e., an exclusion-restriction assumption), and that assignment is not a confounder of the treatment outcome association conditional on other variables in the analysis."}, "https://arxiv.org/abs/2408.14766": {"title": "Differentially Private Estimation of Weighted Average Treatment Effects for Binary Outcomes", "link": "https://arxiv.org/abs/2408.14766", "description": "arXiv:2408.14766v1 Announce Type: new \nAbstract: In the social and health sciences, researchers often make causal inferences using sensitive variables. These researchers, as well as the data holders themselves, may be ethically and perhaps legally obligated to protect the confidentiality of study participants' data. It is now known that releasing any statistics, including estimates of causal effects, computed with confidential data leaks information about the underlying data values. Thus, analysts may desire to use causal estimators that can provably bound this information leakage. Motivated by this goal, we develop algorithms for estimating weighted average treatment effects with binary outcomes that satisfy the criterion of differential privacy. We present theoretical results on the accuracy of several differentially private estimators of weighted average treatment effects. We illustrate the empirical performance of these estimators using simulated data and a causal analysis using data on education and income."}, "https://arxiv.org/abs/2408.15052": {"title": "stopp: An R Package for Spatio-Temporal Point Pattern Analysis", "link": "https://arxiv.org/abs/2408.15052", "description": "arXiv:2408.15052v1 Announce Type: new \nAbstract: stopp is a novel R package specifically designed for the analysis of spatio-temporal point patterns which might have occurred in a subset of the Euclidean space or on some specific linear network, such as roads of a city. It represents the first package providing a comprehensive modelling framework for spatio-temporal Poisson point processes. While many specialized models exist in the scientific literature for analyzing complex spatio-temporal point patterns, we address the lack of general software for comparing simpler alternative models and their goodness of fit. The package's main functionalities include modelling and diagnostics, together with exploratory analysis tools and the simulation of point processes. A particular focus is given to local first-order and second-order characteristics. The package aggregates existing methods within one coherent framework, including those we proposed in recent papers, and it aims to welcome many further proposals and extensions from the R community."}, "https://arxiv.org/abs/2408.15058": {"title": "Competing risks models with two time scales", "link": "https://arxiv.org/abs/2408.15058", "description": "arXiv:2408.15058v1 Announce Type: new \nAbstract: Competing risks models can involve more than one time scale. A relevant example is the study of mortality after a cancer diagnosis, where time since diagnosis but also age may jointly determine the hazards of death due to different causes. Multiple time scales have rarely been explored in the context of competing events. Here, we propose a model in which the cause-specific hazards vary smoothly over two times scales. It is estimated by two-dimensional $P$-splines, exploiting the equivalence between hazard smoothing and Poisson regression. The data are arranged on a grid so that we can make use of generalized linear array models for efficient computations. The R-package TwoTimeScales implements the model.\n As a motivating example we analyse mortality after diagnosis of breast cancer and we distinguish between death due to breast cancer and all other causes of death. The time scales are age and time since diagnosis. We use data from the Surveillance, Epidemiology and End Results (SEER) program. In the SEER data, age at diagnosis is provided with a last open-ended category, leading to coarsely grouped data. We use the two-dimensional penalised composite link model to ungroup the data before applying the competing risks model with two time scales."}, "https://arxiv.org/abs/2408.15061": {"title": "Diagnosing overdispersion in longitudinal analyses with grouped nominal polytomous data", "link": "https://arxiv.org/abs/2408.15061", "description": "arXiv:2408.15061v1 Announce Type: new \nAbstract: Experiments in Agricultural Sciences often involve the analysis of longitudinal nominal polytomous variables, both in individual and grouped structures. Marginal and mixed-effects models are two common approaches. The distributional assumptions induce specific mean-variance relationships, however, in many instances, the observed variability is greater than assumed by the model. This characterizes overdispersion, whose identification is crucial for choosing an appropriate modeling framework to make inferences reliable. We propose an initial exploration of constructing a longitudinal multinomial dispersion index as a descriptive and diagnostic tool. This index is calculated as the ratio between the observed and assumed variances. The performance of this index was evaluated through a simulation study, employing statistical techniques to assess its initial performance in different scenarios. We identified that as the index approaches one, it is more likely that this corresponds to a high degree of overdispersion. Conversely, values closer to zero indicate a low degree of overdispersion. As a case study, we present an application in animal science, in which the behaviour of pigs (grouped in stalls) is evaluated, considering three response categories."}, "https://arxiv.org/abs/2408.14625": {"title": "A Bayesian approach for fitting semi-Markov mixture models of cancer latency to individual-level data", "link": "https://arxiv.org/abs/2408.14625", "description": "arXiv:2408.14625v1 Announce Type: cross \nAbstract: Multi-state models of cancer natural history are widely used for designing and evaluating cancer early detection strategies. Calibrating such models against longitudinal data from screened cohorts is challenging, especially when fitting non-Markovian mixture models against individual-level data. Here, we consider a family of semi-Markov mixture models of cancer natural history introduce an efficient data-augmented Markov chain Monte Carlo sampling algorithm for fitting these models to individual-level screening and cancer diagnosis histories. Our fully Bayesian approach supports rigorous uncertainty quantification and model selection through leave-one-out cross-validation, and it enables the estimation of screening-related overdiagnosis rates. We demonstrate the effectiveness of our approach using synthetic data, showing that the sampling algorithm efficiently explores the joint posterior distribution of model parameters and latent variables. Finally, we apply our method to data from the US Breast Cancer Surveillance Consortium and estimate the extent of breast cancer overdiagnosis associated with mammography screening. The sampler and model comparison method are available in the R package baclava."}, "https://arxiv.org/abs/2302.01269": {"title": "Adjusting for Incomplete Baseline Covariates in Randomized Controlled Trials: A Cross-World Imputation Framework", "link": "https://arxiv.org/abs/2302.01269", "description": "arXiv:2302.01269v2 Announce Type: replace \nAbstract: In randomized controlled trials, adjusting for baseline covariates is often applied to improve the precision of treatment effect estimation. However, missingness in covariates is common. Recently, Zhao & Ding (2022) studied two simple strategies, the single imputation method and missingness indicator method (MIM), to deal with missing covariates, and showed that both methods can provide efficiency gain. To better understand and compare these two strategies, we propose and investigate a novel imputation framework termed cross-world imputation (CWI), which includes single imputation and MIM as special cases. Through the lens of CWI, we show that MIM implicitly searches for the optimal CWI values and thus achieves optimal efficiency. We also derive conditions under which the single imputation method, by searching for the optimal single imputation values, can achieve the same efficiency as the MIM."}, "https://arxiv.org/abs/2310.12391": {"title": "Online Semiparametric Regression via Sequential Monte Carlo", "link": "https://arxiv.org/abs/2310.12391", "description": "arXiv:2310.12391v2 Announce Type: replace \nAbstract: We develop and describe online algorithms for performing online semiparametric regression analyses. Earlier work on this topic is in Luts, Broderick & Wand (J. Comput. Graph. Statist., 2014) where online mean field variational Bayes was employed. In this article we instead develop sequential Monte Carlo approaches to circumvent well-known inaccuracies inherent in variational approaches. Even though sequential Monte Carlo is not as fast as online mean field variational Bayes, it can be a viable alternative for applications where the data rate is not overly high. For Gaussian response semiparametric regression models our new algorithms share the online mean field variational Bayes property of only requiring updating and storage of sufficient statistics quantities of streaming data. In the non-Gaussian case accurate real-time semiparametric regression requires the full data to be kept in storage. The new algorithms allow for new options concerning accuracy/speed trade-offs for online semiparametric regression."}, "https://arxiv.org/abs/2308.15062": {"title": "Forecasting with Feedback", "link": "https://arxiv.org/abs/2308.15062", "description": "arXiv:2308.15062v3 Announce Type: replace-cross \nAbstract: Systematically biased forecasts are typically interpreted as evidence of forecasters' irrationality and/or asymmetric loss. In this paper we propose an alternative explanation: when forecasts inform economic policy decisions, and the resulting actions affect the realization of the forecast target itself, forecasts may be optimally biased even under quadratic loss. The result arises in environments in which the forecaster is uncertain about the decision maker's reaction to the forecast, which is presumably the case in most applications. We illustrate the empirical relevance of our theory by reviewing some stylized properties of Green Book inflation forecasts and relating them to the predictions from our model. Our results point out that the presence of policy feedback poses a challenge to traditional tests of forecast rationality."}, "https://arxiv.org/abs/2310.09702": {"title": "Inference with Mondrian Random Forests", "link": "https://arxiv.org/abs/2310.09702", "description": "arXiv:2310.09702v2 Announce Type: replace-cross \nAbstract: Random forests are popular methods for regression and classification analysis, and many different variants have been proposed in recent years. One interesting example is the Mondrian random forest, in which the underlying constituent trees are constructed via a Mondrian process. We give precise bias and variance characterizations, along with a Berry-Esseen-type central limit theorem, for the Mondrian random forest regression estimator. By combining these results with a carefully crafted debiasing approach and an accurate variance estimator, we present valid statistical inference methods for the unknown regression function. These methods come with explicitly characterized error bounds in terms of the sample size, tree complexity parameter, and number of trees in the forest, and include coverage error rates for feasible confidence interval estimators. Our novel debiasing procedure for the Mondrian random forest also allows it to achieve the minimax-optimal point estimation convergence rate in mean squared error for multivariate $\\beta$-H\\\"older regression functions, for all $\\beta > 0$, provided that the underlying tuning parameters are chosen appropriately. Efficient and implementable algorithms are devised for both batch and online learning settings, and we carefully study the computational complexity of different Mondrian random forest implementations. Finally, simulations with synthetic data validate our theory and methodology, demonstrating their excellent finite-sample properties."}, "https://arxiv.org/abs/2408.15452": {"title": "The effects of data preprocessing on probability of default model fairness", "link": "https://arxiv.org/abs/2408.15452", "description": "arXiv:2408.15452v1 Announce Type: new \nAbstract: In the context of financial credit risk evaluation, the fairness of machine learning models has become a critical concern, especially given the potential for biased predictions that disproportionately affect certain demographic groups. This study investigates the impact of data preprocessing, with a specific focus on Truncated Singular Value Decomposition (SVD), on the fairness and performance of probability of default models. Using a comprehensive dataset sourced from Kaggle, various preprocessing techniques, including SVD, were applied to assess their effect on model accuracy, discriminatory power, and fairness."}, "https://arxiv.org/abs/2408.15454": {"title": "BayesSRW: Bayesian Sampling and Re-weighting approach for variance reduction", "link": "https://arxiv.org/abs/2408.15454", "description": "arXiv:2408.15454v1 Announce Type: new \nAbstract: In this paper, we address the challenge of sampling in scenarios where limited resources prevent exhaustive measurement across all subjects. We consider a setting where samples are drawn from multiple groups, each following a distribution with unknown mean and variance parameters. We introduce a novel sampling strategy, motivated simply by Cauchy-Schwarz inequality, which minimizes the variance of the population mean estimator by allocating samples proportionally to both the group size and the standard deviation. This approach improves the efficiency of sampling by focusing resources on groups with greater variability, thereby enhancing the precision of the overall estimate. Additionally, we extend our method to a two-stage sampling procedure in a Bayes approach, named BayesSRW, where a preliminary stage is used to estimate the variance, which then informs the optimal allocation of the remaining sampling budget. Through simulation examples, we demonstrate the effectiveness of our approach in reducing estimation uncertainty and providing more reliable insights in applications ranging from user experience surveys to high-dimensional peptide array studies."}, "https://arxiv.org/abs/2408.15502": {"title": "ROMI: A Randomized Two-Stage Basket Trial Design to Optimize Doses for Multiple Indications", "link": "https://arxiv.org/abs/2408.15502", "description": "arXiv:2408.15502v1 Announce Type: new \nAbstract: Optimizing doses for multiple indications is challenging. The pooled approach of finding a single optimal biological dose (OBD) for all indications ignores that dose-response or dose-toxicity curves may differ between indications, resulting in varying OBDs. Conversely, indication-specific dose optimization often requires a large sample size. To address this challenge, we propose a Randomized two-stage basket trial design that Optimizes doses in Multiple Indications (ROMI). In stage 1, for each indication, response and toxicity are evaluated for a high dose, which may be a previously obtained MTD, with a rule that stops accrual to indications where the high dose is unsafe or ineffective. Indications not terminated proceed to stage 2, where patients are randomized between the high dose and a specified lower dose. A latent-cluster Bayesian hierarchical model is employed to borrow information between indications, while considering the potential heterogeneity of OBD across indications. Indication-specific utilities are used to quantify response-toxicity trade-offs. At the end of stage 2, for each indication with at least one acceptable dose, the dose with highest posterior mean utility is selected as optimal. Two versions of ROMI are presented, one using only stage 2 data for dose optimization and the other optimizing doses using data from both stages. Simulations show that both versions have desirable operating characteristics compared to designs that either ignore indications or optimize dose independently for each indication."}, "https://arxiv.org/abs/2408.15607": {"title": "Comparing restricted mean survival times in small sample clinical trials using pseudo-observations", "link": "https://arxiv.org/abs/2408.15607", "description": "arXiv:2408.15607v1 Announce Type: new \nAbstract: The widely used proportional hazard assumption cannot be assessed reliably in small-scale clinical trials and might often in fact be unjustified, e.g. due to delayed treatment effects. An alternative to the hazard ratio as effect measure is the difference in restricted mean survival time (RMST) that does not rely on model assumptions. Although an asymptotic test for two-sample comparisons of the RMST exists, it has been shown to suffer from an inflated type I error rate in samples of small or moderate sizes. Recently, permutation tests, including the studentized permutation test, have been introduced to address this issue. In this paper, we propose two methods based on pseudo-observations (PO) regression models as alternatives for such scenarios and assess their properties in comparison to previously proposed approaches in an extensive simulation study. Furthermore, we apply the proposed PO methods to data from a clinical trail and, by doing so, point out some extension that might be very useful for practical applications such as covariate adjustments."}, "https://arxiv.org/abs/2408.15617": {"title": "Network Representation of Higher-Order Interactions Based on Information Dynamics", "link": "https://arxiv.org/abs/2408.15617", "description": "arXiv:2408.15617v1 Announce Type: new \nAbstract: Many complex systems in science and engineering are modeled as networks whose nodes and links depict the temporal evolution of each system unit and the dynamic interaction between pairs of units, which are assessed respectively using measures of auto- and cross-correlation or variants thereof. However, a growing body of work is documenting that this standard network representation can neglect potentially crucial information shared by three or more dynamic processes in the form of higher-order interactions (HOIs). While several measures, mostly derived from information theory, are available to assess HOIs in network systems mapped by multivariate time series, none of them is able to provide a compact and detailed representation of higher-order interdependencies. In this work, we fill this gap by introducing a framework for the assessment of HOIs in dynamic network systems at different levels of resolution. The framework is grounded on the dynamic implementation of the O-information, a new measure assessing HOIs in dynamic networks, which is here used together with its local counterpart and its gradient to quantify HOIs respectively for the network as a whole, for each link, and for each node. The integration of these measures into the conventional network representation results in a tool for the representation of HOIs as networks, which is defined formally using measures of information dynamics, implemented in its linear version by using vector regression models and statistical validation techniques, illustrated in simulated network systems, and finally applied to an illustrative example in the field of network physiology."}, "https://arxiv.org/abs/2408.15623": {"title": "Correlation-Adjusted Simultaneous Testing for Ultra High-dimensional Grouped Data", "link": "https://arxiv.org/abs/2408.15623", "description": "arXiv:2408.15623v1 Announce Type: new \nAbstract: Epigenetics plays a crucial role in understanding the underlying molecular processes of several types of cancer as well as the determination of innovative therapeutic tools. To investigate the complex interplay between genetics and environment, we develop a novel procedure to identify differentially methylated probes (DMPs) among cases and controls. Statistically, this translates to an ultra high-dimensional testing problem with sparse signals and an inherent grouping structure. When the total number of variables being tested is massive and typically exhibits some degree of dependence, existing group-wise multiple comparisons adjustment methods lead to inflated false discoveries. We propose a class of Correlation-Adjusted Simultaneous Testing (CAST) procedures incorporating the general dependence among probes within and between genes to control the false discovery rate (FDR). Simulations demonstrate that CASTs have superior empirical power while maintaining the FDR compared to the benchmark group-wise. Moreover, while the benchmark fails to control FDR for small-sized grouped correlated data, CAST exhibits robustness in controlling FDR across varying group sizes. In bladder cancer data, the proposed CAST method confirms some existing differentially methylated probes implicated with the disease (Langevin, et. al., 2014). However, CAST was able to detect novel DMPs that the previous study (Langevin, et. al., 2014) failed to identify. The CAST method can accurately identify significant potential biomarkers and facilitates informed decision-making aligned with precision medicine in the context of complex data analysis."}, "https://arxiv.org/abs/2408.15670": {"title": "Adaptive Weighted Random Isolation (AWRI): a simple design to estimate causal effects under network interference", "link": "https://arxiv.org/abs/2408.15670", "description": "arXiv:2408.15670v1 Announce Type: new \nAbstract: Recently, causal inference under interference has gained increasing attention in the literature. In this paper, we focus on randomized designs for estimating the total treatment effect (TTE), defined as the average difference in potential outcomes between fully treated and fully controlled groups. We propose a simple design called weighted random isolation (WRI) along with a restricted difference-in-means estimator (RDIM) for TTE estimation. Additionally, we derive a novel mean squared error surrogate for the RDIM estimator, supported by a network-adaptive weight selection algorithm. This can help us determine a fair weight for the WRI design, thereby effectively reducing the bias. Our method accommodates directed networks, extending previous frameworks. Extensive simulations demonstrate that the proposed method outperforms nine established methods across a wide range of scenarios."}, "https://arxiv.org/abs/2408.15701": {"title": "Robust discriminant analysis", "link": "https://arxiv.org/abs/2408.15701", "description": "arXiv:2408.15701v1 Announce Type: new \nAbstract: Discriminant analysis (DA) is one of the most popular methods for classification due to its conceptual simplicity, low computational cost, and often solid performance. In its standard form, DA uses the arithmetic mean and sample covariance matrix to estimate the center and scatter of each class. We discuss and illustrate how this makes standard DA very sensitive to suspicious data points, such as outliers and mislabeled cases. We then present an overview of techniques for robust DA, which are more reliable in the presence of deviating cases. In particular, we review DA based on robust estimates of location and scatter, along with graphical diagnostic tools for visualizing the results of DA."}, "https://arxiv.org/abs/2408.15806": {"title": "Bayesian analysis of product feature allocation models", "link": "https://arxiv.org/abs/2408.15806", "description": "arXiv:2408.15806v1 Announce Type: new \nAbstract: Feature allocation models are an extension of Bayesian nonparametric clustering models, where individuals can share multiple features. We study a broad class of models whose probability distribution has a product form, which includes the popular Indian buffet process. This class plays a prominent role among existing priors, and it shares structural characteristics with Gibbs-type priors in the species sampling framework. We develop a general theory for the entire class, obtaining closed form expressions for the predictive structure and the posterior law of the underlying stochastic process. Additionally, we describe the distribution for the number of features and the number of hitherto unseen features in a future sample, leading to the $\\alpha$-diversity for feature models. We also examine notable novel examples, such as mixtures of Indian buffet processes and beta Bernoulli models, where the latter entails a finite random number of features. This methodology finds significant applications in ecology, allowing the estimation of species richness for incidence data, as we demonstrate by analyzing plant diversity in Danish forests and trees in Barro Colorado Island."}, "https://arxiv.org/abs/2408.15862": {"title": "Marginal homogeneity tests with panel data", "link": "https://arxiv.org/abs/2408.15862", "description": "arXiv:2408.15862v1 Announce Type: new \nAbstract: A panel dataset satisfies marginal homogeneity if the time-specific marginal distributions are homogeneous or time-invariant. Marginal homogeneity is relevant in economic settings such as dynamic discrete games. In this paper, we propose several tests for the hypothesis of marginal homogeneity and investigate their properties. We consider an asymptotic framework in which the number of individuals n in the panel diverges, and the number of periods T is fixed. We implement our tests by comparing a studentized or non-studentized T-sample version of the Cramer-von Mises statistic with a suitable critical value. We propose three methods to construct the critical value: asymptotic approximations, the bootstrap, and time permutations. We show that the first two methods result in asymptotically exact hypothesis tests. The permutation test based on a non-studentized statistic is asymptotically exact when T=2, but is asymptotically invalid when T>2. In contrast, the permutation test based on a studentized statistic is always asymptotically exact. Finally, under a time-exchangeability assumption, the permutation test is exact in finite samples, both with and without studentization."}, "https://arxiv.org/abs/2408.15964": {"title": "On harmonic oscillator hazard functions", "link": "https://arxiv.org/abs/2408.15964", "description": "arXiv:2408.15964v1 Announce Type: new \nAbstract: We propose a parametric hazard model obtained by enforcing positivity in the damped harmonic oscillator. The resulting model has closed-form hazard and cumulative hazard functions, facilitating likelihood and Bayesian inference on the parameters. We show that this model can capture a range of hazard shapes, such as increasing, decreasing, unimodal, bathtub, and oscillatory patterns, and characterize the tails of the corresponding survival function. We illustrate the use of this model in survival analysis using real data."}, "https://arxiv.org/abs/2408.15979": {"title": "Comparing the Pearson and Spearman Correlation Coefficients Across Distributions and Sample Sizes: A Tutorial Using Simulations and Empirical Data", "link": "https://arxiv.org/abs/2408.15979", "description": "arXiv:2408.15979v1 Announce Type: new \nAbstract: The Pearson product-moment correlation coefficient (rp) and the Spearman rank correlation coefficient (rs) are widely used in psychological research. We compare rp and rs on 3 criteria: variability, bias with respect to the population value, and robustness to an outlier. Using simulations across low (N = 5) to high (N = 1,000) sample sizes we show that, for normally distributed variables, rp and rs have similar expected values but rs is more variable, especially when the correlation is strong. However, when the variables have high kurtosis, rp is more variable than rs. Next, we conducted a sampling study of a psychometric dataset featuring symmetrically distributed data with light tails, and of 2 Likert-type survey datasets, 1 with light-tailed and the other with heavy-tailed distributions. Consistent with the simulations, rp had lower variability than rs in the psychometric dataset. In the survey datasets with heavy-tailed variables in particular, rs had lower variability than rp, and often corresponded more accurately to the population Pearson correlation coefficient (Rp) than rp did. The simulations and the sampling studies showed that variability in terms of standard deviations can be reduced by about 20% by choosing rs instead of rp. In comparison, increasing the sample size by a factor of 2 results in a 41% reduction of the standard deviations of rs and rp. In conclusion, rp is suitable for light-tailed distributions, whereas rs is preferable when variables feature heavy-tailed distributions or when outliers are present, as is often the case in psychological research."}, "https://arxiv.org/abs/2408.15260": {"title": "Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data", "link": "https://arxiv.org/abs/2408.15260", "description": "arXiv:2408.15260v1 Announce Type: cross \nAbstract: Synthetic Data is not new, but recent advances in Generative AI have raised interest in expanding the research toolbox, creating new opportunities and risks. This article provides a taxonomy of the full breadth of the Synthetic Data domain. We discuss its place in the research ecosystem by linking the advances in computational social science with the idea of the Fourth Paradigm of scientific discovery that integrates the elements of the evolution from empirical to theoretic to computational models. Further, leveraging the framework of Truth, Beauty, and Justice, we discuss how evaluation criteria vary across use cases as the information is used to add value and draw insights. Building a framework to organize different types of synthetic data, we end by describing the opportunities and challenges with detailed examples of using Generative AI to create synthetic quantitative and qualitative datasets and discuss the broader spectrum including synthetic populations, expert systems, survey data replacement, and personabots."}, "https://arxiv.org/abs/2408.15387": {"title": "Semiparametric Modelling of Cancer Mortality Trends in Colombia", "link": "https://arxiv.org/abs/2408.15387", "description": "arXiv:2408.15387v1 Announce Type: cross \nAbstract: In this paper, we compare semiparametric and parametric model adjustments for cancer mortality in breast and cervical cancer and prostate and lung cancer in men, according to age and period of death. Semiparametric models were adjusted for the number of deaths from the two localizations of greatest mortality by sex: breast and cervix in women; prostate and lungs in men. Adjustments in different semiparametric models were compared; which included making adjustments with different distributions and variable combinations in the parametric and non-parametric part, for localization as well as for scale. Finally, the semiparametric model with best adjustment was selected and compared to traditional model; that is, to the generalized lineal model with Poisson response and logarithmic link. Best results for the four kinds of cancer were obtained for the selected semiparametric model by comparing it to the traditional Poisson model based upon AIC, envelope correlation between estimated logarithm rate and real rate logarithm. In general, we observe that in estimation, rate increases with age; however, with respect to period, breast cancer and stomach cancer in men show a tendency to rise over time; on the other hand, for cervical cancer, it remains virtually constant, but for lung cancer in men, as of 2007, it tends to decrease."}, "https://arxiv.org/abs/2408.15451": {"title": "Certified Causal Defense with Generalizable Robustness", "link": "https://arxiv.org/abs/2408.15451", "description": "arXiv:2408.15451v1 Announce Type: cross \nAbstract: While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials."}, "https://arxiv.org/abs/2408.15612": {"title": "Cellwise robust and sparse principal component analysis", "link": "https://arxiv.org/abs/2408.15612", "description": "arXiv:2408.15612v1 Announce Type: cross \nAbstract: A first proposal of a sparse and cellwise robust PCA method is presented. Robustness to single outlying cells in the data matrix is achieved by substituting the squared loss function for the approximation error by a robust version. The integration of a sparsity-inducing $L_1$ or elastic net penalty offers additional modeling flexibility. For the resulting challenging optimization problem, an algorithm based on Riemannian stochastic gradient descent is developed, with the advantage of being scalable to high-dimensional data, both in terms of many variables as well as observations. The resulting method is called SCRAMBLE (Sparse Cellwise Robust Algorithm for Manifold-based Learning and Estimation). Simulations reveal the superiority of this approach in comparison to established methods, both in the casewise and cellwise robustness paradigms. Two applications from the field of tribology underline the advantages of a cellwise robust and sparse PCA method."}, "https://arxiv.org/abs/2305.14265": {"title": "Adapting to Misspecification", "link": "https://arxiv.org/abs/2305.14265", "description": "arXiv:2305.14265v4 Announce Type: replace \nAbstract: Empirical research typically involves a robustness-efficiency tradeoff. A researcher seeking to estimate a scalar parameter can invoke strong assumptions to motivate a restricted estimator that is precise but may be heavily biased, or they can relax some of these assumptions to motivate a more robust, but variable, unrestricted estimator. When a bound on the bias of the restricted estimator is available, it is optimal to shrink the unrestricted estimator towards the restricted estimator. For settings where a bound on the bias of the restricted estimator is unknown, we propose adaptive estimators that minimize the percentage increase in worst case risk relative to an oracle that knows the bound. We show that adaptive estimators solve a weighted convex minimax problem and provide lookup tables facilitating their rapid computation. Revisiting some well known empirical studies where questions of model specification arise, we examine the advantages of adapting to -- rather than testing for -- misspecification."}, "https://arxiv.org/abs/2309.00870": {"title": "Robust estimation for number of factors in high dimensional factor modeling via Spearman correlation matrix", "link": "https://arxiv.org/abs/2309.00870", "description": "arXiv:2309.00870v2 Announce Type: replace \nAbstract: Determining the number of factors in high-dimensional factor modeling is essential but challenging, especially when the data are heavy-tailed. In this paper, we introduce a new estimator based on the spectral properties of Spearman sample correlation matrix under the high-dimensional setting, where both dimension and sample size tend to infinity proportionally. Our estimator is robust against heavy tails in either the common factors or idiosyncratic errors. The consistency of our estimator is established under mild conditions. Numerical experiments demonstrate the superiority of our estimator compared to existing methods."}, "https://arxiv.org/abs/2310.04576": {"title": "Challenges in Statistically Rejecting the Perfect Competition Hypothesis Using Imperfect Competition Data", "link": "https://arxiv.org/abs/2310.04576", "description": "arXiv:2310.04576v4 Announce Type: replace \nAbstract: We theoretically prove why statistically rejecting the null hypothesis of perfect competition is challenging, known as a common problem in the literature. We also assess the finite sample performance of the conduct parameter test in homogeneous goods markets, showing that statistical power increases with the number of markets, a larger conduct parameter, and a stronger demand rotation instrument. However, even with a moderate number of markets and five firms, rejecting the null hypothesis of perfect competition remains difficult, irrespective of instrument strength or the use of optimal instruments. Our findings suggest that empirical results failing to reject perfect competition are due to the limited number of markets rather than methodological shortcomings."}, "https://arxiv.org/abs/2002.09377": {"title": "Misspecification-robust likelihood-free inference in high dimensions", "link": "https://arxiv.org/abs/2002.09377", "description": "arXiv:2002.09377v4 Announce Type: replace-cross \nAbstract: Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions and discrepancies for each parameter. The efficient additive acquisition structure is combined with exponentiated loss -likelihood to provide a misspecification-robust characterisation of the marginal posterior distribution for all model parameters. The method successfully performs computationally efficient inference in a 100-dimensional space on canonical examples and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space."}, "https://arxiv.org/abs/2204.06943": {"title": "Option Pricing with Time-Varying Volatility Risk Aversion", "link": "https://arxiv.org/abs/2204.06943", "description": "arXiv:2204.06943v3 Announce Type: replace-cross \nAbstract: We introduce a pricing kernel with time-varying volatility risk aversion that can explain the observed time variation in the shape of the pricing kernel. Dynamic volatility risk aversion, combined with the Heston-Nandi GARCH model, leads to a convenient option pricing model, denoted DHNG. The variance risk ratio emerges as a fundamental variable, and we show that it is closely related to economic fundamentals and common measures of sentiment and uncertainty. DHNG yields a closed-form pricing formula for the VIX, and we propose a novel approximation method that provides analytical expressions for option prices. We estimate the model using S&P 500 returns, the VIX, and option prices, and find that dynamic volatility risk aversion leads to a substantial reduction in VIX and option pricing errors."}, "https://arxiv.org/abs/2408.16023": {"title": "Inferring the parameters of Taylor's law in ecology", "link": "https://arxiv.org/abs/2408.16023", "description": "arXiv:2408.16023v1 Announce Type: new \nAbstract: Taylor's power law (TL) or fluctuation scaling has been verified empirically for the abundances of many species, human and non-human, and in many other fields including physics, meteorology, computer science, and finance. TL asserts that the variance is directly proportional to a power of the mean, exactly for population moments and, whether or not population moments exist, approximately for sample moments. In many papers, linear regression of log variance as a function of log mean is used to estimate TL's parameters. We provide some statistical guarantees with large-sample asymptotics for this kind of inference under general conditions, and we derive confidence intervals for the parameters. In many ecological applications, the means and variances are estimated over time or across space from arrays of abundance data collected at different locations and time points. When the ratio between the time-series length and the number of spatial points converges to a constant as both become large, the usual normalized statistics are asymptotically biased. We provide a bias correction to get correct confidence intervals. TL, widely studied in multiple sciences, is a source of challenging new statistical problems in a nonstationary spatiotemporal framework. We illustrate our results with both simulated and real data sets."}, "https://arxiv.org/abs/2408.16039": {"title": "Group Difference in Differences can Identify Effect Heterogeneity in Non-Canonical Settings", "link": "https://arxiv.org/abs/2408.16039", "description": "arXiv:2408.16039v1 Announce Type: new \nAbstract: Consider a very general setting in which data on an outcome of interest is collected in two `groups' at two time periods, with certain group-periods deemed `treated' and others `untreated'. A special case is the canonical Difference-in-Differences (DiD) setting in which one group is treated only in the second period while the other is treated in neither period. Then it is well known that under a parallel trends assumption across the two groups the classic DiD formula (subtracting the average change in the outcome across periods in the treated group by the average change in the outcome across periods in the untreated group) identifies the average treatment effect on the treated in the second period. But other relations between group, period, and treatment are possible. For example, the groups might be demographic (or other baseline covariate) categories with all units in both groups treated in the second period and none treated in the first, i.e. a pre-post design. Or one group might be treated in both periods while the other is treated in neither. In these non-canonical settings (lacking a control group or a pre-period), some researchers still compute DiD estimators, while others avoid causal inference altogether. In this paper, we will elucidate the group-period-treatment scenarios and corresponding parallel trends assumptions under which a DiD formula identifies meaningful causal estimands and what those causal estimands are. We find that in non-canonical settings, under a group parallel trends assumption the DiD formula identifies effect heterogeneity in the treated across groups or across time periods (depending on the setting)."}, "https://arxiv.org/abs/2408.16129": {"title": "Direct-Assisted Bayesian Unit-level Modeling for Small Area Estimation of Rare Event Prevalence", "link": "https://arxiv.org/abs/2408.16129", "description": "arXiv:2408.16129v1 Announce Type: new \nAbstract: Small area estimation using survey data can be achieved by using either a design-based or a model-based inferential approach. With respect to assumptions, design-based direct estimators are generally preferable because of their consistency and asymptotic normality. However, when data are sparse at the desired area level, as is often the case when measuring rare events for example, these direct estimators can have extremely large uncertainty, making a model-based approach preferable. A model-based approach with a random spatial effect borrows information from surrounding areas at the cost of inducing shrinkage towards the local average. As a result, estimates may be over-smoothed and inconsistent with design-based estimates at higher area levels when aggregated. We propose a unit-level Bayesian model for small area estimation of rare event prevalence which uses design-based direct estimates at a higher area level to increase accuracy, precision, and consistency in aggregation. After introducing the model and its implementation, we conduct a simulation study to compare its properties to alternative models and apply it to the estimation of the neonatal mortality rate in Zambia, using 2014 DHS data."}, "https://arxiv.org/abs/2408.16153": {"title": "Statistical comparison of quality attributes_a range-based approach", "link": "https://arxiv.org/abs/2408.16153", "description": "arXiv:2408.16153v1 Announce Type: new \nAbstract: A novel approach for comparing quality attributes of different products when there is considerable product-related variability is proposed. In such a case, the whole range of possible realizations must be considered. Looking, for example, at the respective information published by agencies like the EMA or the FDA, one can see that commonly accepted tests together with the proper statistical framework are not yet available. This work attempts to close this gap in the treatment of range-based comparisons. The question of when two products can be considered similar with respect to a certain property is discussed and a framework for such a statistical comparison is presented, which is based on the proposed concept of kappa-cover. Assuming normally distributed quality attributes a statistical test termed covering-test is proposed. Simulations show that this test possesses desirable statistical properties with respect to small sample size and power. In order to demonstrate the usefulness of the suggested concept, the proposed test is applied to a data set from the pharmaceutical industry"}, "https://arxiv.org/abs/2408.16330": {"title": "Sensitivity Analysis for Dynamic Discrete Choice Models", "link": "https://arxiv.org/abs/2408.16330", "description": "arXiv:2408.16330v1 Announce Type: new \nAbstract: In dynamic discrete choice models, some parameters, such as the discount factor, are being fixed instead of being estimated. This paper proposes two sensitivity analysis procedures for dynamic discrete choice models with respect to the fixed parameters. First, I develop a local sensitivity measure that estimates the change in the target parameter for a unit change in the fixed parameter. This measure is fast to compute as it does not require model re-estimation. Second, I propose a global sensitivity analysis procedure that uses model primitives to study the relationship between target parameters and fixed parameters. I show how to apply the sensitivity analysis procedures of this paper through two empirical applications."}, "https://arxiv.org/abs/2408.16381": {"title": "Uncertainty quantification for intervals", "link": "https://arxiv.org/abs/2408.16381", "description": "arXiv:2408.16381v1 Announce Type: new \nAbstract: Data following an interval structure are increasingly prevalent in many scientific applications. In medicine, clinical events are often monitored between two clinical visits, making the exact time of the event unknown and generating outcomes with a range format. As interest in automating healthcare decisions grows, uncertainty quantification via predictive regions becomes essential for developing reliable and trusworthy predictive algorithms. However, the statistical literature currently lacks a general methodology for interval targets, especially when these outcomes are incomplete due to censoring. We propose a uncertainty quantification algorithm and establish its theoretical properties using empirical process arguments based on a newly developed class of functions specifically designed for interval data structures. Although this paper primarily focuses on deriving predictive regions for interval-censored data, the approach can also be applied to other statistical modeling tasks, such as goodness-of-fit assessments. Finally, the applicability of the methods developed here is illustrated through various biomedical applications, including two clinical examples: i) sleep time and its link with cardiovasculuar diseases ii) survival time and physical activity values."}, "https://arxiv.org/abs/2408.16384": {"title": "Nonparametric goodness of fit tests for Pareto type-I distribution with complete and censored data", "link": "https://arxiv.org/abs/2408.16384", "description": "arXiv:2408.16384v1 Announce Type: new \nAbstract: Two new goodness of fit tests for the Pareto type-I distribution for complete and right censored data are proposed using fixed point characterization based on Steins type identity. The asymptotic distributions of the test statistics under both the null and alternative hypotheses are obtained. The performance of the proposed tests is evaluated and compared with existing tests through a Monte Carlo simulation experiment. The newly proposed tests exhibit greater power than existing tests for the Pareto type-I distribution. Finally, the methodology is applied to real-world data sets."}, "https://arxiv.org/abs/2408.16485": {"title": "A multiple imputation approach to distinguish curative from life-prolonging effects in the presence of missing covariates", "link": "https://arxiv.org/abs/2408.16485", "description": "arXiv:2408.16485v1 Announce Type: new \nAbstract: Medical advancements have increased cancer survival rates and the possibility of finding a cure. Hence, it is crucial to evaluate the impact of treatments in terms of both curing the disease and prolonging survival. We may use a Cox proportional hazards (PH) cure model to achieve this. However, a significant challenge in applying such a model is the potential presence of partially observed covariates in the data. We aim to refine the methods for imputing partially observed covariates based on multiple imputation and fully conditional specification (FCS) approaches. To be more specific, we consider a more general case, where different covariate vectors are used to model the cure probability and the survival of patients who are not cured. We also propose an approximation of the exact conditional distribution using a regression approach, which helps draw imputed values at a lower computational cost. To assess its effectiveness, we compare the proposed approach with a complete case analysis and an analysis without any missing covariates. We discuss the application of these techniques to a real-world dataset from the BO06 clinical trial on osteosarcoma."}, "https://arxiv.org/abs/2408.16670": {"title": "A Causal Framework for Evaluating Heterogeneous Policy Mechanisms Using Difference-in-Differences", "link": "https://arxiv.org/abs/2408.16670", "description": "arXiv:2408.16670v1 Announce Type: new \nAbstract: In designing and evaluating public policies, policymakers and researchers often hypothesize about the mechanisms through which a policy may affect a population and aim to assess these mechanisms in practice. For example, when studying an excise tax on sweetened beverages, researchers might explore how cross-border shopping, economic competition, and store-level price changes differentially affect store sales. However, many policy evaluation designs, including the difference-in-differences (DiD) approach, traditionally target the average effect of the intervention rather than the underlying mechanisms. Extensions of these approaches to evaluate policy mechanisms often involve exploratory subgroup analyses or outcome models parameterized by mechanism-specific variables. However, neither approach studies the mechanisms within a causal framework, limiting the analysis to associative relationships between mechanisms and outcomes, which may be confounded by differences among sub-populations exposed to varying levels of the mechanisms. Therefore, rigorous mechanism evaluation requires robust techniques to adjust for confounding and accommodate the interconnected relationship between stores within competitive economic landscapes. In this paper, we present a framework for evaluating policy mechanisms by studying Philadelphia beverage tax. Our approach builds on recent advancements in causal effect curve estimators under DiD designs, offering tools and insights for assessing policy mechanisms complicated by confounding and network interference."}, "https://arxiv.org/abs/2408.16708": {"title": "Effect Aliasing in Observational Studies", "link": "https://arxiv.org/abs/2408.16708", "description": "arXiv:2408.16708v1 Announce Type: new \nAbstract: In experimental design, aliasing of effects occurs in fractional factorial experiments, where certain low order factorial effects are indistinguishable from certain high order interactions: low order contrasts may be orthogonal to one another, while their higher order interactions are aliased and not identified. In observational studies, aliasing occurs when certain combinations of covariates -- e.g., time period and various eligibility criteria for treatment -- perfectly predict the treatment that an individual will receive, so a covariate combination is aliased with a particular treatment. In this situation, when a contrast among several groups is used to estimate a treatment effect, collections of individuals defined by contrast weights may be balanced with respect to summaries of low-order interactions between covariates and treatments, but necessarily not balanced with respect to summaries of high-order interactions between covariates and treatments. We develop a theory of aliasing in observational studies, illustrate that theory in an observational study whose aliasing is more robust than conventional difference-in-differences, and develop a new form of matching to construct balanced confounded factorial designs from observational data."}, "https://arxiv.org/abs/2408.16763": {"title": "Finite Sample Valid Inference via Calibrated Bootstrap", "link": "https://arxiv.org/abs/2408.16763", "description": "arXiv:2408.16763v1 Announce Type: new \nAbstract: While widely used as a general method for uncertainty quantification, the bootstrap method encounters difficulties that raise concerns about its validity in practical applications. This paper introduces a new resampling-based method, termed $\\textit{calibrated bootstrap}$, designed to generate finite sample-valid parametric inference from a sample of size $n$. The central idea is to calibrate an $m$-out-of-$n$ resampling scheme, where the calibration parameter $m$ is determined against inferential pivotal quantities derived from the cumulative distribution functions of loss functions in parameter estimation. The method comprises two algorithms. The first, named $\\textit{resampling approximation}$ (RA), employs a $\\textit{stochastic approximation}$ algorithm to find the value of the calibration parameter $m=m_\\alpha$ for a given $\\alpha$ in a manner that ensures the resulting $m$-out-of-$n$ bootstrapped $1-\\alpha$ confidence set is valid. The second algorithm, termed $\\textit{distributional resampling}$ (DR), is developed to further select samples of bootstrapped estimates from the RA step when constructing $1-\\alpha$ confidence sets for a range of $\\alpha$ values is of interest. The proposed method is illustrated and compared to existing methods using linear regression with and without $L_1$ penalty, within the context of a high-dimensional setting and a real-world data application. The paper concludes with remarks on a few open problems worthy of consideration."}, "https://arxiv.org/abs/1001.2055": {"title": "Reversible jump Markov chain Monte Carlo and multi-model samplers", "link": "https://arxiv.org/abs/1001.2055", "description": "arXiv:1001.2055v2 Announce Type: replace \nAbstract: To appear in the second edition of the MCMC handbook, S. P. Brooks, A. Gelman, G. Jones and X.-L. Meng (eds), Chapman & Hall."}, "https://arxiv.org/abs/2304.01098": {"title": "The synthetic instrument: From sparse association to sparse causation", "link": "https://arxiv.org/abs/2304.01098", "description": "arXiv:2304.01098v2 Announce Type: replace \nAbstract: In many observational studies, researchers are often interested in studying the effects of multiple exposures on a single outcome. Standard approaches for high-dimensional data such as the lasso assume the associations between the exposures and the outcome are sparse. These methods, however, do not estimate the causal effects in the presence of unmeasured confounding. In this paper, we consider an alternative approach that assumes the causal effects in view are sparse. We show that with sparse causation, the causal effects are identifiable even with unmeasured confounding. At the core of our proposal is a novel device, called the synthetic instrument, that in contrast to standard instrumental variables, can be constructed using the observed exposures directly. We show that under linear structural equation models, the problem of causal effect estimation can be formulated as an $\\ell_0$-penalization problem, and hence can be solved efficiently using off-the-shelf software. Simulations show that our approach outperforms state-of-art methods in both low-dimensional and high-dimensional settings. We further illustrate our method using a mouse obesity dataset."}, "https://arxiv.org/abs/2306.07181": {"title": "Bayesian estimation of covariate assisted principal regression for brain functional connectivity", "link": "https://arxiv.org/abs/2306.07181", "description": "arXiv:2306.07181v2 Announce Type: replace \nAbstract: This paper presents a Bayesian reformulation of covariate-assisted principal (CAP) regression of Zhao and others (2021), which aims to identify components in the covariance of response signal that are associated with covariates in a regression framework. We introduce a geometric formulation and reparameterization of individual covariance matrices in their tangent space. By mapping the covariance matrices to the tangent space, we leverage Euclidean geometry to perform posterior inference. This approach enables joint estimation of all parameters and uncertainty quantification within a unified framework, fusing dimension reduction for covariance matrices with regression model estimation. We validate the proposed method through simulation studies and apply it to analyze associations between covariates and brain functional connectivity, utilizing data from the Human Connectome Project."}, "https://arxiv.org/abs/2310.12000": {"title": "Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian Process Models", "link": "https://arxiv.org/abs/2310.12000", "description": "arXiv:2310.12000v2 Announce Type: replace \nAbstract: Latent Gaussian process (GP) models are flexible probabilistic non-parametric function models. Vecchia approximations are accurate approximations for GPs to overcome computational bottlenecks for large data, and the Laplace approximation is a fast method with asymptotic convergence guarantees to approximate marginal likelihoods and posterior predictive distributions for non-Gaussian likelihoods. Unfortunately, the computational complexity of combined Vecchia-Laplace approximations grows faster than linearly in the sample size when used in combination with direct solver methods such as the Cholesky decomposition. Computations with Vecchia-Laplace approximations can thus become prohibitively slow precisely when the approximations are usually the most accurate, i.e., on large data sets. In this article, we present iterative methods to overcome this drawback. Among other things, we introduce and analyze several preconditioners, derive new convergence results, and propose novel methods for accurately approximating predictive variances. We analyze our proposed methods theoretically and in experiments with simulated and real-world data. In particular, we obtain a speed-up of an order of magnitude compared to Cholesky-based calculations and a threefold increase in prediction accuracy in terms of the continuous ranked probability score compared to a state-of-the-art method on a large satellite data set. All methods are implemented in a free C++ software library with high-level Python and R packages."}, "https://arxiv.org/abs/2401.11272": {"title": "Asymptotics for non-degenerate multivariate $U$-statistics with estimated nuisance parameters under the null and local alternative hypotheses", "link": "https://arxiv.org/abs/2401.11272", "description": "arXiv:2401.11272v2 Announce Type: replace-cross \nAbstract: The large-sample behavior of non-degenerate multivariate $U$-statistics of arbitrary degree is investigated under the assumption that their kernel depends on parameters that can be estimated consistently. Mild regularity conditions are given which guarantee that once properly normalized, such statistics are asymptotically multivariate Gaussian both under the null hypothesis and sequences of local alternatives. The work of Randles (1982, \\emph{Ann. Statist.}) is extended in three ways: the data and the kernel values can be multivariate rather than univariate, the limiting behavior under local alternatives is studied for the first time, and the effect of knowing some of the nuisance parameters is quantified. These results can be applied to a broad range of goodness-of-fit testing contexts, as shown in two specific examples."}, "https://arxiv.org/abs/2408.16963": {"title": "Nonparametric Density Estimation for Data Scattered on Irregular Spatial Domains: A Likelihood-Based Approach Using Bivariate Penalized Spline Smoothing", "link": "https://arxiv.org/abs/2408.16963", "description": "arXiv:2408.16963v1 Announce Type: new \nAbstract: Accurately estimating data density is crucial for making informed decisions and modeling in various fields. This paper presents a novel nonparametric density estimation procedure that utilizes bivariate penalized spline smoothing over triangulation for data scattered over irregular spatial domains. The approach is likelihood-based with a regularization term that addresses the roughness of the logarithm of density based on a second-order differential operator. The proposed method offers greater efficiency and flexibility in estimating density over complex domains and has been theoretically supported by establishing the asymptotic convergence rate under mild natural conditions. Through extensive simulation studies and a real-world application that analyzes motor vehicle theft data from Portland City, Oregon, we demonstrate the advantages of the proposed method over existing techniques detailed in the literature."}, "https://arxiv.org/abs/2408.17022": {"title": "Non-parametric Monitoring of Spatial Dependence", "link": "https://arxiv.org/abs/2408.17022", "description": "arXiv:2408.17022v1 Announce Type: new \nAbstract: In process monitoring applications, measurements are often taken regularly or randomly from different spatial locations in two or three dimensions. Here, we consider streams of regular, rectangular data sets and use spatial ordinal patterns (SOPs) as a non-parametric approach to detect spatial dependencies. A key feature of our proposed SOP charts is that they are distribution-free and do not require prior Phase-I analysis. We conduct an extensive simulation study, demonstrating the superiority and effectiveness of the proposed charts compared to traditional parametric approaches. We apply the SOP-based control charts to detect heavy rainfall in Germany, war-related fires in (eastern) Ukraine, and manufacturing defects in textile production. The wide range of applications and insights illustrate the broad utility of our non-parametric approach."}, "https://arxiv.org/abs/2408.17040": {"title": "Model-based clustering for covariance matrices via penalized Wishart mixture models", "link": "https://arxiv.org/abs/2408.17040", "description": "arXiv:2408.17040v1 Announce Type: new \nAbstract: Covariance matrices provide a valuable source of information about complex interactions and dependencies within the data. However, from a clustering perspective, this information has often been underutilized and overlooked. Indeed, commonly adopted distance-based approaches tend to rely primarily on mean levels to characterize and differentiate between groups. Recently, there have been promising efforts to cluster covariance matrices directly, thereby distinguishing groups solely based on the relationships between variables. From a model-based perspective, a probabilistic formalization has been provided by considering a mixture model with component densities following a Wishart distribution. Notwithstanding, this approach faces challenges when dealing with a large number of variables, as the number of parameters to be estimated increases quadratically. To address this issue, we propose a sparse Wishart mixture model, which assumes that the component scale matrices possess a cluster-dependent degree of sparsity. Model estimation is performed by maximizing a penalized log-likelihood, enforcing a covariance graphical lasso penalty on the component scale matrices. This penalty not only reduces the number of non-zero parameters, mitigating the challenges of high-dimensional settings, but also enhances the interpretability of results by emphasizing the most relevant relationships among variables. The proposed methodology is tested on both simulated and real data, demonstrating its ability to unravel the complexities of neuroimaging data and effectively cluster subjects based on the relational patterns among distinct brain regions."}, "https://arxiv.org/abs/2408.17153": {"title": "Scalable Bayesian Clustering for Integrative Analysis of Multi-View Data", "link": "https://arxiv.org/abs/2408.17153", "description": "arXiv:2408.17153v1 Announce Type: new \nAbstract: In the era of Big Data, scalable and accurate clustering algorithms for high-dimensional data are essential. We present new Bayesian Distance Clustering (BDC) models and inference algorithms with improved scalability while maintaining the predictive accuracy of modern Bayesian non-parametric models. Unlike traditional methods, BDC models the distance between observations rather than the observations directly, offering a compromise between the scalability of distance-based methods and the enhanced predictive power and probabilistic interpretation of model-based methods. However, existing BDC models still rely on performing inference on the partition model to group observations into clusters. The support of this partition model grows exponentially with the dataset's size, complicating posterior space exploration and leading to many costly likelihood evaluations. Inspired by K-medoids, we propose using tessellations in discrete space to simplify inference by focusing the learning task on finding the best tessellation centers, or \"medoids.\" Additionally, we extend our models to effectively handle multi-view data, such as data comprised of clusters that evolve across time, enhancing their applicability to complex datasets. The real data application in numismatics demonstrates the efficacy of our approach."}, "https://arxiv.org/abs/2408.17187": {"title": "State Space Model of Realized Volatility under the Existence of Dependent Market Microstructure Noise", "link": "https://arxiv.org/abs/2408.17187", "description": "arXiv:2408.17187v1 Announce Type: new \nAbstract: Volatility means the degree of variation of a stock price which is important in finance. Realized Volatility (RV) is an estimator of the volatility calculated using high-frequency observed prices. RV has lately attracted considerable attention of econometrics and mathematical finance. However, it is known that high-frequency data includes observation errors called market microstructure noise (MN). Nagakura and Watanabe[2015] proposed a state space model that resolves RV into true volatility and influence of MN. In this paper, we assume a dependent MN that autocorrelates and correlates with return as reported by Hansen and Lunde[2006] and extends the results of Nagakura and Watanabe[2015] and compare models by simulation and actual data."}, "https://arxiv.org/abs/2408.17188": {"title": "A note on promotion time cure models with a new biological consideration", "link": "https://arxiv.org/abs/2408.17188", "description": "arXiv:2408.17188v1 Announce Type: new \nAbstract: We introduce a generalized promotion time cure model motivated by a new biological consideration. The new approach is flexible to model heterogeneous survival data, in particular for addressing intra-sample heterogeneity."}, "https://arxiv.org/abs/2408.17205": {"title": "Estimation and inference of average treatment effects under heterogeneous additive treatment effect model", "link": "https://arxiv.org/abs/2408.17205", "description": "arXiv:2408.17205v1 Announce Type: new \nAbstract: Randomized experiments are the gold standard for estimating treatment effects, yet network interference challenges the validity of traditional estimators by violating the stable unit treatment value assumption and introducing bias. While cluster randomized experiments mitigate this bias, they encounter limitations in handling network complexity and fail to distinguish between direct and indirect effects. To address these challenges, we develop a design-based asymptotic theory for the existing Horvitz--Thompson estimators of the direct, indirect, and global average treatment effects under Bernoulli trials. We assume the heterogeneous additive treatment effect model with a hidden network that drives interference. Observing that these estimators are inconsistent in dense networks, we introduce novel eigenvector-based regression adjustment estimators to ensure consistency. We establish the asymptotic normality of the proposed estimators and provide conservative variance estimators under the design-based inference framework, offering robust conclusions independent of the underlying stochastic processes of the network and model parameters. Our method's adaptability is demonstrated across various interference structures, including partial interference and local interference in a two-sided marketplace. Numerical studies further illustrate the efficacy of the proposed estimators, offering practical insights into handling network interference."}, "https://arxiv.org/abs/2408.17257": {"title": "Likelihood estimation for stochastic differential equations with mixed effects", "link": "https://arxiv.org/abs/2408.17257", "description": "arXiv:2408.17257v1 Announce Type: new \nAbstract: Stochastic differential equations provide a powerful and versatile tool for modelling dynamic phenomena affected by random noise. In case of repeated observations of time series for several experimental units, it is often the case that some of the parameters vary between the individual experimental units, which has motivated a considerable interest in stochastic differential equations with mixed effects, where a subset of the parameters are random. These models enables simultaneous representation of randomness in the dynamics and variability between experimental units. When the data are observations at discrete time points, the likelihood function is only rarely explicitly available, so for likelihood based inference numerical methods are needed. We present Gibbs samplers and stochastic EM-algorithms based on the simple methods for simulation of diffusion bridges in Bladt and S{\\o}rensen (2014). These methods are easy to implement and have no tuning parameters. They are, moreover, computationally efficient at low sampling frequencies because the computing time increases linearly with the time between observations. The algorithms are shown to simplify considerably for exponential families of diffusion processes. In a simulation study, the estimation methods are shown to work well for Ornstein-Uhlenbeck processes and t-diffusions with mixed effects. Finally, the Gibbs sampler is applied to neuronal data."}, "https://arxiv.org/abs/2408.17278": {"title": "Incorporating Memory into Continuous-Time Spatial Capture-Recapture Models", "link": "https://arxiv.org/abs/2408.17278", "description": "arXiv:2408.17278v1 Announce Type: new \nAbstract: Obtaining reliable and precise estimates of wildlife species abundance and distribution is essential for the conservation and management of animal populations and natural reserves. Remote sensors such as camera traps are increasingly employed to gather data on uniquely identifiable individuals. Spatial capture-recapture (SCR) models provide estimates of population and spatial density from such data. These models introduce spatial correlation between observations of the same individual through a latent activity center. However SCR models assume that observations are independent over time and space, conditional on their given activity center, so that observed sightings at a given time and location do not influence the probability of being seen at future times and/or locations. With detectors like camera traps, this is ecologically unrealistic given the smooth movement of animals over space through time. We propose a new continuous-time modeling framework that incorporates both an individual's (latent) activity center and (known) previous location and time of detection. We demonstrate that standard SCR models can produce substantially biased density estimates when there is correlation in the times and locations of detections, and that our new model performs substantially better than standard SCR models on data simulated through a movement model as well as in a real camera trap study of American martens where an improvement in model fit is observed when incorporating the observed locations and times of previous observations."}, "https://arxiv.org/abs/2408.17346": {"title": "On Nonparanormal Likelihoods", "link": "https://arxiv.org/abs/2408.17346", "description": "arXiv:2408.17346v1 Announce Type: new \nAbstract: Nonparanormal models describe the joint distribution of multivariate responses via latent Gaussian, and thus parametric, copulae while allowing flexible nonparametric marginals. Some aspects of such distributions, for example conditional independence, are formulated parametrically. Other features, such as marginal distributions, can be formulated non- or semiparametrically. Such models are attractive when multivariate normality is questionable.\n Most estimation procedures perform two steps, first estimating the nonparametric part. The copula parameters come second, treating the marginal estimates as known. This is sufficient for some applications. For other applications, e.g. when a semiparametric margin features parameters of interest or when standard errors are important, a simultaneous estimation of all parameters might be more advantageous.\n We present suitable parameterisations of nonparanormal models, possibly including semiparametric effects, and define four novel nonparanormal log-likelihood functions. In general, the corresponding one-step optimization problems are shown to be non-convex. In some cases, however, biconvex problems emerge. Several convex approximations are discussed.\n From a low-level computational point of view, the core contribution is the score function for multivariate normal log-probabilities computed via Genz' procedure. We present transformation discriminant analysis when some biomarkers are subject to limit-of-detection problems as an application and illustrate possible empirical gains in semiparametric efficient polychoric correlation analysis."}, "https://arxiv.org/abs/2408.17385": {"title": "Comparing Propensity Score-Based Methods in Estimating the Treatment Effects: A Simulation Study", "link": "https://arxiv.org/abs/2408.17385", "description": "arXiv:2408.17385v1 Announce Type: new \nAbstract: In observational studies, the recorded treatment assignment is not purely random, but it is influenced by external factors such as patient characteristics, reimbursement policies, and existing guidelines. Therefore, the treatment effect can be estimated only after accounting for confounding factors. Propensity score (PS) methods are a family of methods that is widely used for this purpose. Although they are all based on the estimation of the a posteriori probability of treatment assignment given patient covariates, they estimate the treatment effect from different statistical points of view and are, thus, relatively hard to compare. In this work, we propose a simulation experiment in which a hypothetical cohort of subjects is simulated in seven scenarios of increasing complexity of the associations between covariates and treatment, but where the two main definitions of treatment effect (average treatment effect, ATE, and average effect of the treatment on the treated, ATT) coincide. Our purpose is to compare the performance of a wide array of PS-based methods (matching, stratification, and inverse probability weighting) in estimating the treatment effect and their robustness in different scenarios. We find that inverse probability weighting provides estimates of the treatment effect that are closer to the expected value by weighting all subjects of the starting population. Conversely, matching and stratification ensure that the subpopulation that generated the final estimate is made up of real instances drawn from the starting population, and, thus, provide a higher degree of control on the validity domain of the estimates."}, "https://arxiv.org/abs/2408.17392": {"title": "Dual-criterion Dose Finding Designs Based on Dose-Limiting Toxicity and Tolerability", "link": "https://arxiv.org/abs/2408.17392", "description": "arXiv:2408.17392v1 Announce Type: new \nAbstract: The primary objective of Phase I oncology trials is to assess the safety and tolerability of novel therapeutics. Conventional dose escalation methods identify the maximum tolerated dose (MTD) based on dose-limiting toxicity (DLT). However, as cancer therapies have evolved from chemotherapy to targeted therapies, these traditional methods have become problematic. Many targeted therapies rarely produce DLT and are administered over multiple cycles, potentially resulting in the accumulation of lower-grade toxicities, which can lead to intolerance, such as dose reduction or interruption. To address this issue, we proposed dual-criterion designs that find the MTD based on both DLT and non-DLT-caused intolerance. We considered the model-based design and model-assisted design that allow real-time decision-making in the presence of pending data due to long event assessment windows. Compared to DLT-based methods, our approaches exhibit superior operating characteristics when intolerance is the primary driver for determining the MTD and comparable operating characteristics when DLT is the primary driver."}, "https://arxiv.org/abs/2408.17410": {"title": "Family of multivariate extended skew-elliptical distributions: Statistical properties, inference and application", "link": "https://arxiv.org/abs/2408.17410", "description": "arXiv:2408.17410v1 Announce Type: new \nAbstract: In this paper we propose a family of multivariate asymmetric distributions over an arbitrary subset of set of real numbers which is defined in terms of the well-known elliptically symmetric distributions. We explore essential properties, including the characterization of the density function for various distribution types, as well as other key aspects such as identifiability, quantiles, stochastic representation, conditional and marginal distributions, moments, Kullback-Leibler Divergence, and parameter estimation. A Monte Carlo simulation study is performed for examining the performance of the developed parameter estimation method. Finally, the proposed models are used to analyze socioeconomic data."}, "https://arxiv.org/abs/2408.17426": {"title": "Weighted Regression with Sybil Networks", "link": "https://arxiv.org/abs/2408.17426", "description": "arXiv:2408.17426v1 Announce Type: new \nAbstract: In many online domains, Sybil networks -- or cases where a single user assumes multiple identities -- is a pervasive feature. This complicates experiments, as off-the-shelf regression estimators at least assume known network topologies (if not fully independent observations) when Sybil network topologies in practice are often unknown. The literature has exclusively focused on techniques to detect Sybil networks, leading many experimenters to subsequently exclude suspected networks entirely before estimating treatment effects. I present a more efficient solution in the presence of these suspected Sybil networks: a weighted regression framework that applies weights based on the probabilities that sets of observations are controlled by single actors. I show in the paper that the MSE-minimizing solution is to set the weight matrix equal to the inverse of the expected network topology. I demonstrate the methodology on simulated data, and then I apply the technique to a competition with suspected Sybil networks run on the Sui blockchain and show reductions in the standard error of the estimate by 6 - 24%."}, "https://arxiv.org/abs/2408.16791": {"title": "Multi-faceted Neuroimaging Data Integration via Analysis of Subspaces", "link": "https://arxiv.org/abs/2408.16791", "description": "arXiv:2408.16791v1 Announce Type: cross \nAbstract: Neuroimaging studies, such as the Human Connectome Project (HCP), often collect multi-faceted and multi-block data to study the complex human brain. However, these data are often analyzed in a pairwise fashion, which can hinder our understanding of how different brain-related measures interact with each other. In this study, we comprehensively analyze the multi-block HCP data using the Data Integration via Analysis of Subspaces (DIVAS) method. We integrate structural and functional brain connectivity, substance use, cognition, and genetics in an exhaustive five-block analysis. This gives rise to the important finding that genetics is the single data modality most predictive of brain connectivity, outside of brain connectivity itself. Nearly 14\\% of the variation in functional connectivity (FC) and roughly 12\\% of the variation in structural connectivity (SC) is attributed to shared spaces with genetics. Moreover, investigations of shared space loadings provide interpretable associations between particular brain regions and drivers of variability, such as alcohol consumption in the substance-use data block. Novel Jackstraw hypothesis tests are developed for the DIVAS framework to establish statistically significant loadings. For example, in the (FC, SC, and Substance Use) shared space, these novel hypothesis tests highlight largely negative functional and structural connections suggesting the brain's role in physiological responses to increased substance use. Furthermore, our findings have been validated using a subset of genetically relevant siblings or twins not studied in the main analysis."}, "https://arxiv.org/abs/2408.17087": {"title": "On the choice of the two tuning parameters for nonparametric estimation of an elliptical distribution generator", "link": "https://arxiv.org/abs/2408.17087", "description": "arXiv:2408.17087v1 Announce Type: cross \nAbstract: Elliptical distributions are a simple and flexible class of distributions that depend on a one-dimensional function, called the density generator. In this article, we study the non-parametric estimator of this generator that was introduced by Liebscher (2005). This estimator depends on two tuning parameters: a bandwidth $h$ -- as usual in kernel smoothing -- and an additional parameter $a$ that control the behavior near the center of the distribution. We give an explicit expression for the asymptotic MSE at a point $x$, and derive explicit expressions for the optimal tuning parameters $h$ and $a$. Estimation of the derivatives of the generator is also discussed. A simulation study shows the performance of the new methods."}, "https://arxiv.org/abs/2408.17230": {"title": "cosimmr: an R package for fast fitting of Stable Isotope Mixing Models with covariates", "link": "https://arxiv.org/abs/2408.17230", "description": "arXiv:2408.17230v1 Announce Type: cross \nAbstract: The study of animal diets and the proportional contribution that different foods make to their diets is an important task in ecology. Stable Isotope Mixing Models (SIMMs) are an important tool for studying an animal's diet and understanding how the animal interacts with its environment. We present cosimmr, a new R package designed to include covariates when estimating diet proportions in SIMMs, with simple functions to produce plots and summary statistics. The inclusion of covariates allows for users to perform a more in-depth analysis of their system and to gain new insights into the diets of the organisms being studied. A common problem with the previous generation of SIMMs is that they are very slow to produce a posterior distribution of dietary estimates, especially for more complex model structures, such as when covariates are included. The widely-used Markov chain Monte Carlo (MCMC) algorithm used by many traditional SIMMs often requires a very large number of iterations to reach convergence. In contrast, cosimmr uses Fixed Form Variational Bayes (FFVB), which we demonstrate gives up to an order of magnitude speed improvement with no discernible loss of accuracy. We provide a full mathematical description of the model, which includes corrections for trophic discrimination and concentration dependence, and evaluate its performance against the state of the art MixSIAR model. Whilst MCMC is guaranteed to converge to the posterior distribution in the long term, FFVB converges to an approximation of the posterior distribution, which may lead to sub-optimal performance. However we show that the package produces equivalent results in a fraction of the time for all the examples on which we test. The package is designed to be user-friendly and is based on the existing simmr framework."}, "https://arxiv.org/abs/2207.08868": {"title": "Isotonic propensity score matching", "link": "https://arxiv.org/abs/2207.08868", "description": "arXiv:2207.08868v2 Announce Type: replace \nAbstract: We propose a one-to-many matching estimator of the average treatment effect based on propensity scores estimated by isotonic regression. The method relies on the monotonicity assumption on the propensity score function, which can be justified in many applications in economics. We show that the nature of the isotonic estimator can help us to fix many problems of existing matching methods, including efficiency, choice of the number of matches, choice of tuning parameters, robustness to propensity score misspecification, and bootstrap validity. As a by-product, a uniformly consistent isotonic estimator is developed for our proposed matching method."}, "https://arxiv.org/abs/2211.04012": {"title": "A functional regression model for heterogeneous BioGeoChemical Argo data in the Southern Ocean", "link": "https://arxiv.org/abs/2211.04012", "description": "arXiv:2211.04012v2 Announce Type: replace \nAbstract: Leveraging available measurements of our environment can help us understand complex processes. One example is Argo Biogeochemical data, which aims to collect measurements of oxygen, nitrate, pH, and other variables at varying depths in the ocean. We focus on the oxygen data in the Southern Ocean, which has implications for ocean biology and the Earth's carbon cycle. Systematic monitoring of such data has only recently begun to be established, and the data is sparse. In contrast, Argo measurements of temperature and salinity are much more abundant. In this work, we introduce and estimate a functional regression model describing dependence in oxygen, temperature, and salinity data at all depths covered by the Argo data simultaneously. Our model elucidates important aspects of the joint distribution of temperature, salinity, and oxygen. Due to fronts that establish distinct spatial zones in the Southern Ocean, we augment this functional regression model with a mixture component. By modelling spatial dependence in the mixture component and in the data itself, we provide predictions onto a grid and improve location estimates of fronts. Our approach is scalable to the size of the Argo data, and we demonstrate its success in cross-validation and a comprehensive interpretation of the model."}, "https://arxiv.org/abs/2301.10468": {"title": "Model selection-based estimation for generalized additive models using mixtures of g-priors: Towards systematization", "link": "https://arxiv.org/abs/2301.10468", "description": "arXiv:2301.10468v4 Announce Type: replace \nAbstract: We explore the estimation of generalized additive models using basis expansion in conjunction with Bayesian model selection. Although Bayesian model selection is useful for regression splines, it has traditionally been applied mainly to Gaussian regression owing to the availability of a tractable marginal likelihood. We extend this method to handle an exponential family of distributions by using the Laplace approximation of the likelihood. Although this approach works well with any Gaussian prior distribution, consensus has not been reached on the best prior for nonparametric regression with basis expansions. Our investigation indicates that the classical unit information prior may not be ideal for nonparametric regression. Instead, we find that mixtures of g-priors are more effective. We evaluate various mixtures of g-priors to assess their performance in estimating generalized additive models. Additionally, we compare several priors for knots to determine the most effective strategy. Our simulation studies demonstrate that model selection-based approaches outperform other Bayesian methods."}, "https://arxiv.org/abs/2311.00878": {"title": "Backward Joint Model for the Dynamic Prediction of Both Competing Risk and Longitudinal Outcomes", "link": "https://arxiv.org/abs/2311.00878", "description": "arXiv:2311.00878v2 Announce Type: replace \nAbstract: Joint modeling is a useful approach to dynamic prediction of clinical outcomes using longitudinally measured predictors. When the outcomes are competing risk events, fitting the conventional shared random effects joint model often involves intensive computation, especially when multiple longitudinal biomarkers are be used as predictors, as is often desired in prediction problems. This paper proposes a new joint model for the dynamic prediction of competing risk outcomes. The model factorizes the likelihood into the distribution of the competing risks data and the distribution of longitudinal data given the competing risks data. It extends the basic idea of the recently published backward joint model (BJM) to the competing risk setting, and we call this model crBJM. This model also enables the prediction of future longitudinal data trajectories conditional on being at risk at a future time, a practically important problem that has not been studied in the statistical literature. The model fitting with the EM algorithm is efficient, stable and computationally fast, with a one-dimensional integral in the E-step and convex optimization for most parameters in the M-step, regardless of the number of longitudinal predictors. The model also comes with a consistent albeit less efficient estimation method that can be quickly implemented with standard software, ideal for model building and diagnostics. We study the numerical properties of the proposed method using simulations and illustrate its use in a chronic kidney disease study."}, "https://arxiv.org/abs/2312.01162": {"title": "Inference on many jumps in nonparametric panel regression models", "link": "https://arxiv.org/abs/2312.01162", "description": "arXiv:2312.01162v2 Announce Type: replace \nAbstract: We investigate the significance of change-point or jump effects within fully nonparametric regression contexts, with a particular focus on panel data scenarios where data generation processes vary across individual or group units, and error terms may display complex dependency structures. In our setting the threshold effect depends on a specific covariate, and we permit the true nonparametric regression to vary based on additional latent variables. We propose two uniform testing procedures: one to assess the existence of change-point effects and another to evaluate the uniformity of such effects across units. Even though the underlying data generation processes are neither independent nor identically distributed, our approach involves deriving a straightforward analytical expression to approximate the variance-covariance structure of change-point effects under general dependency conditions. Notably, when Gaussian approximations are made to these test statistics, the intricate dependency structures within the data can be safely disregarded owing to the localized nature of the statistics. This finding bears significant implications for obtaining critical values. Through extensive simulations, we demonstrate that our tests exhibit excellent control over size and reasonable power performance in finite samples, irrespective of strong cross-sectional and weak serial dependency within the data. Furthermore, applying our tests to two datasets reveals the existence of significant nonsmooth effects in both cases."}, "https://arxiv.org/abs/2312.13097": {"title": "Power calculation for cross-sectional stepped wedge cluster randomized trials with a time-to-event endpoint", "link": "https://arxiv.org/abs/2312.13097", "description": "arXiv:2312.13097v2 Announce Type: replace \nAbstract: Stepped wedge cluster randomized trials (SW-CRTs) are a form of randomized trial whereby clusters are progressively transitioned from control to intervention, with the timing of transition randomized for each cluster. An important task at the design stage is to ensure that the planned trial has sufficient power to observe a clinically meaningful effect size. While methods for determining study power have been well-developed for SW-CRTs with continuous and binary outcomes, limited methods for power calculation are available for SW-CRTs with censored time-to-event outcomes. In this article, we propose a stratified marginal Cox model to account for secular trend in cross-sectional SW-CRTs, and derive an explicit expression of the robust sandwich variance to facilitate power calculations without the need for computationally intensive simulations. Power formulas based on both the Wald and robust score tests are developed and validated via simulation under different finite-sample scenarios. Finally, we illustrate our methods in the context of a SW-CRT testing the effect of a new electronic reminder system on time to catheter removal in hospital settings. We also offer an R Shiny application to facilitate sample size and power calculations using our proposed methods."}, "https://arxiv.org/abs/2302.05955": {"title": "Recursive Estimation of Conditional Kernel Mean Embeddings", "link": "https://arxiv.org/abs/2302.05955", "description": "arXiv:2302.05955v2 Announce Type: replace-cross \nAbstract: Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces."}, "https://arxiv.org/abs/2409.00291": {"title": "Variable selection in the joint frailty model of recurrent and terminal events using Broken Adaptive Ridge regression", "link": "https://arxiv.org/abs/2409.00291", "description": "arXiv:2409.00291v1 Announce Type: new \nAbstract: We introduce a novel method to simultaneously perform variable selection and estimation in the joint frailty model of recurrent and terminal events using the Broken Adaptive Ridge Regression penalty. The BAR penalty can be summarized as an iteratively reweighted squared $L_2$-penalized regression, which approximates the $L_0$-regularization method. Our method allows for the number of covariates to diverge with the sample size. Under certain regularity conditions, we prove that the BAR estimator implemented under the model framework is consistent and asymptotically normally distributed, which are known as the oracle properties in the variable selection literature. In our simulation studies, we compare our proposed method to the Minimum Information Criterion (MIC) method. We apply our method on the Medical Information Mart for Intensive Care (MIMIC-III) database, with the aim of investigating which variables affect the risks of repeated ICU admissions and death during ICU stay."}, "https://arxiv.org/abs/2409.00379": {"title": "Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance", "link": "https://arxiv.org/abs/2409.00379", "description": "arXiv:2409.00379v1 Announce Type: new \nAbstract: Static supervised learning-in which experimental data serves as a training sample for the estimation of an optimal treatment assignment policy-is a commonly assumed framework of policy learning. An arguably more realistic but challenging scenario is a dynamic setting in which the planner performs experimentation and exploitation simultaneously with subjects that arrive sequentially. This paper studies bandit algorithms for learning an optimal individualised treatment assignment policy. Specifically, we study applicability of the EXP4.P (Exponential weighting for Exploration and Exploitation with Experts) algorithm developed by Beygelzimer et al. (2011) to policy learning. Assuming that the class of policies has a finite Vapnik-Chervonenkis dimension and that the number of subjects to be allocated is known, we present a high probability welfare-regret bound of the algorithm. To implement the algorithm, we use an incremental enumeration algorithm for hyperplane arrangements. We perform extensive numerical analysis to assess the algorithm's sensitivity to its tuning parameters and its welfare-regret performance. Further simulation exercises are calibrated to the National Job Training Partnership Act (JTPA) Study sample to determine how the algorithm performs when applied to economic data. Our findings highlight various computational challenges and suggest that the limited welfare gain from the algorithm is due to substantial heterogeneity in causal effects in the JTPA data."}, "https://arxiv.org/abs/2409.00453": {"title": "Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference", "link": "https://arxiv.org/abs/2409.00453", "description": "arXiv:2409.00453v1 Announce Type: new \nAbstract: Quantifying causal effects of exposures on outcomes, such as a treatment and a disease respectively, is a crucial issue in medical science for the administration of effective therapies. Importantly, any related causal analysis should account for all those variables, e.g. clinical features, that can act as risk factors involved in the occurrence of a disease. In addition, the selection of targeted strategies for therapy administration requires to quantify such treatment effects at personalized level rather than at population level. We address these issues by proposing a methodology based on categorical Directed Acyclic Graphs (DAGs) which provide an effective tool to infer causal relationships and causal effects between variables. In addition, we account for population heterogeneity by considering a Dirichlet Process mixture of categorical DAGs, which clusters individuals into homogeneous groups characterized by common causal structures, dependence parameters and causal effects. We develop computational strategies for Bayesian posterior inference, from which a battery of causal effects at subject-specific level is recovered. Our methodology is evaluated through simulations and applied to a dataset of breast cancer patients to investigate cardiotoxic side effects that can be induced by the administrated anticancer therapies."}, "https://arxiv.org/abs/2409.00470": {"title": "Examining the robustness of a model selection procedure in the binary latent block model through a language placement test data set", "link": "https://arxiv.org/abs/2409.00470", "description": "arXiv:2409.00470v1 Announce Type: new \nAbstract: When entering French university, the students' foreign language level is assessed through a placement test. In this work, we model the placement test results using binary latent block models which allow to simultaneously form homogeneous groups of students and of items. However, a major difficulty in latent block models is to select correctly the number of groups of rows and the number of groups of columns. The first purpose of this paper is to tune the number of initializations needed to limit the initial values problem in the estimation algorithm in order to propose a model selection procedure in the placement test context. Computational studies based on simulated data sets and on two placement test data sets are investigated. The second purpose is to investigate the robustness of the proposed model selection procedure in terms of stability of the students groups when the number of students varies."}, "https://arxiv.org/abs/2409.00679": {"title": "Exact Exploratory Bi-factor Analysis: A Constraint-based Optimisation Approach", "link": "https://arxiv.org/abs/2409.00679", "description": "arXiv:2409.00679v1 Announce Type: new \nAbstract: Bi-factor analysis is a form of confirmatory factor analysis widely used in psychological and educational measurement. The use of a bi-factor model requires the specification of an explicit bi-factor structure on the relationship between the observed variables and the group factors. In practice, the bi-factor structure is sometimes unknown, in which case an exploratory form of bi-factor analysis is needed to find the bi-factor structure. Unfortunately, there are few methods for exploratory bi-factor analysis, with the exception of a rotation-based method proposed in Jennrich and Bentler (2011, 2012). However, this method only finds approximate bi-factor structures, as it does not yield an exact bi-factor loading structure, even after applying hard thresholding. In this paper, we propose a constraint-based optimisation method that learns an exact bi-factor loading structure from data, overcoming the issue with the rotation-based method. The key to the proposed method is a mathematical characterisation of the bi-factor loading structure as a set of equality constraints, which allows us to formulate the exploratory bi-factor analysis problem as a constrained optimisation problem in a continuous domain and solve the optimisation problem with an augmented Lagrangian method. The power of the proposed method is shown via simulation studies and a real data example. Extending the proposed method to exploratory hierarchical factor analysis is also discussed. The codes are available on ``https://anonymous.4open.science/r/Bifactor-ALM-C1E6\"."}, "https://arxiv.org/abs/2409.00817": {"title": "Structural adaptation via directional regularity: rate accelerated estimation in multivariate functional data", "link": "https://arxiv.org/abs/2409.00817", "description": "arXiv:2409.00817v1 Announce Type: new \nAbstract: We introduce directional regularity, a new definition of anisotropy for multivariate functional data. Instead of taking the conventional view which determines anisotropy as a notion of smoothness along a dimension, directional regularity additionally views anisotropy through the lens of directions. We show that faster rates of convergence can be obtained through a change-of-basis by adapting to the directional regularity of a multivariate process. An algorithm for the estimation and identification of the change-of-basis matrix is constructed, made possible due to the unique replication structure of functional data. Non-asymptotic bounds are provided for our algorithm, supplemented by numerical evidence from an extensive simulation study. We discuss two possible applications of the directional regularity approach, and advocate its consideration as a standard pre-processing step in multivariate functional data analysis."}, "https://arxiv.org/abs/2409.01017": {"title": "Linear spline index regression model: Interpretability, nonlinearity and dimension reduction", "link": "https://arxiv.org/abs/2409.01017", "description": "arXiv:2409.01017v1 Announce Type: new \nAbstract: Inspired by the complexity of certain real-world datasets, this article introduces a novel flexible linear spline index regression model. The model posits piecewise linear effects of an index on the response, with continuous changes occurring at knots. Significantly, it possesses the interpretability of linear models, captures nonlinear effects similar to nonparametric models, and achieves dimension reduction like single-index models. In addition, the locations and number of knots remain unknown, which further enhances the adaptability of the model in practical applications. We propose a new method that combines penalized approaches and convolution techniques to simultaneously estimate the unknown parameters and determine the number of knots. Noteworthy is that the proposed method allows the number of knots to diverge with the sample size. We demonstrate that the proposed estimators can identify the number of knots with a probability approaching one and estimate the coefficients as efficiently as if the number of knots is known in advance. We also introduce a procedure to test the presence of knots. Simulation studies and two real datasets are employed to assess the finite sample performance of the proposed method."}, "https://arxiv.org/abs/2409.01208": {"title": "Statistical Jump Model for Mixed-Type Data with Missing Data Imputation", "link": "https://arxiv.org/abs/2409.01208", "description": "arXiv:2409.01208v1 Announce Type: new \nAbstract: In this paper, we address the challenge of clustering mixed-type data with temporal evolution by introducing the statistical jump model for mixed-type data. This novel framework incorporates regime persistence, enhancing interpretability and reducing the frequency of state switches, and efficiently handles missing data. The model is easily interpretable through its state-conditional means and modes, making it accessible to practitioners and policymakers. We validate our approach through extensive simulation studies and an empirical application to air quality data, demonstrating its superiority in inferring persistent air quality regimes compared to the traditional air quality index. Our contributions include a robust method for mixed-type temporal clustering, effective missing data management, and practical insights for environmental monitoring."}, "https://arxiv.org/abs/2409.01248": {"title": "Nonparametric Estimation of Path-specific Effects in Presence of Nonignorable Missing Covariates", "link": "https://arxiv.org/abs/2409.01248", "description": "arXiv:2409.01248v1 Announce Type: new \nAbstract: The path-specific effect (PSE) is of primary interest in mediation analysis when multiple intermediate variables between treatment and outcome are observed, as it can isolate the specific effect through each mediator, thus mitigating potential bias arising from other intermediate variables serving as mediator-outcome confounders. However, estimation and inference of PSE become challenging in the presence of nonignorable missing covariates, a situation particularly common in epidemiological research involving sensitive patient information. In this paper, we propose a fully nonparametric methodology to address this challenge. We establish identification for PSE by expressing it as a functional of observed data and demonstrate that the associated nuisance functions can be uniquely determined through sequential optimization problems by leveraging a shadow variable. Then we propose a sieve-based regression imputation approach for estimation. We establish the large-sample theory for the proposed estimator, and introduce a robust and efficient approach to make inference for PSE. The proposed method is applied to the NHANES dataset to investigate the mediation roles of dyslipidemia and obesity in the pathway from Type 2 diabetes mellitus to cardiovascular disease."}, "https://arxiv.org/abs/2409.01266": {"title": "Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions", "link": "https://arxiv.org/abs/2409.01266", "description": "arXiv:2409.01266v1 Announce Type: new \nAbstract: Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods."}, "https://arxiv.org/abs/2409.01295": {"title": "Pearson's Correlation under the scope: Assessment of the efficiency of Pearson's correlation to select predictor variables for linear models", "link": "https://arxiv.org/abs/2409.01295", "description": "arXiv:2409.01295v1 Announce Type: new \nAbstract: This article examines the limitations of Pearson's correlation in selecting predictor variables for linear models. Using mtcars and iris datasets from R, this paper demonstrates the limitation of this correlation measure when selecting a proper independent variable to model miles per gallon (mpg) from mtcars data and the petal length from the iris data. This paper exhibits the findings by reporting Pearson's correlation values for two potential predictor variables for each response variable, then builds a linear model to predict the response variable using each predictor variable. The error metrics for each model are then reported to evaluate how reliable Pearson's correlation is in selecting the best predictor variable. The results show that Pearson's correlation can be deceiving if used to select the predictor variable to build a linear model for a dependent variable."}, "https://arxiv.org/abs/2409.01444": {"title": "A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions", "link": "https://arxiv.org/abs/2409.01444", "description": "arXiv:2409.01444v1 Announce Type: new \nAbstract: Prediction models inform important clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. However, changes in the distribution of the data impact model performance. In health-care, a typical change is a shift in case-mix: for example, for cardiovascular risk managment, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital.\n This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis preditions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not.\n A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. This framework provides critical insights for evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task."}, "https://arxiv.org/abs/2409.01521": {"title": "Modelling Volatilities of High-dimensional Count Time Series with Network Structure and Asymmetry", "link": "https://arxiv.org/abs/2409.01521", "description": "arXiv:2409.01521v1 Announce Type: new \nAbstract: Modelling high-dimensional volatilities is a challenging topic, especially for high-dimensional discrete-valued time series data. This paper proposes a threshold spatial GARCH-type model for high-dimensional count data with network structure. The proposed model can simplify the parameterization by taking use of the network structure in data, and can capture the asymmetry in dynamics of volatilities by adopting a threshold structure. Our model is called Poisson Threshold Network GARCH model, because the conditional distributions are assumed to be Poisson distribution. Asymptotic theory of our maximum-likelihood-estimator (MLE) for the proposed spatial model is derived when both sample size and network dimension go to infinity. We get asymptotic statistical inferences via investigating the week dependence among components of the model and using limit theorems for weekly dependent random fields. Simulations are conducted to test the theoretical results, and the model is fitted to real count data as illustration of the proposed methodology."}, "https://arxiv.org/abs/2409.01599": {"title": "Multivariate Inference of Network Moments by Subsampling", "link": "https://arxiv.org/abs/2409.01599", "description": "arXiv:2409.01599v1 Announce Type: new \nAbstract: In this paper, we study the characterization of a network population by analyzing a single observed network, focusing on the counts of multiple network motifs or their corresponding multivariate network moments. We introduce an algorithm based on node subsampling to approximate the nontrivial joint distribution of the network moments, and prove its asymptotic accuracy. By examining the joint distribution of these moments, our approach captures complex dependencies among network motifs, making a significant advancement over earlier methods that rely on individual motifs marginally. This enables more accurate and robust network inference. Through real-world applications, such as comparing coexpression networks of distinct gene sets and analyzing collaboration patterns within the statistical community, we demonstrate that the multivariate inference of network moments provides deeper insights than marginal approaches, thereby enhancing our understanding of network mechanisms."}, "https://arxiv.org/abs/2409.01735": {"title": "Multi-objective Bayesian optimization for Likelihood-Free inference in sequential sampling models of decision making", "link": "https://arxiv.org/abs/2409.01735", "description": "arXiv:2409.01735v1 Announce Type: new \nAbstract: Joint modeling of different data sources in decision-making processes is crucial for understanding decision dynamics in consumer behavior models. Sequential Sampling Models (SSMs), grounded in neuro-cognitive principles, provide a systematic approach to combining information from multi-source data, such as those based on response times and choice outcomes. However, parameter estimation of SSMs is challenging due to the complexity of joint likelihood functions. Likelihood-Free inference (LFI) approaches enable Bayesian inference in complex models with intractable likelihoods, like SSMs, and only require the ability to simulate synthetic data from the model. Extending a popular approach to simulation efficient LFI for single-source data, we propose Multi-objective Bayesian Optimization for Likelihood-Free Inference (MOBOLFI) to estimate the parameters of SSMs calibrated using multi-source data. MOBOLFI models a multi-dimensional discrepancy between observed and simulated data, using a discrepancy for each data source. Multi-objective Bayesian Optimization is then used to ensure simulation efficient approximation of the SSM likelihood. The use of a multivariate discrepancy allows for approximations to individual data source likelihoods in addition to the joint likelihood, enabling both the detection of conflicting information and a deeper understanding of the importance of different data sources in estimating individual SSM parameters. We illustrate the advantages of our approach in comparison with the use of a single discrepancy in a simple synthetic data example and an SSM example with real-world data assessing preferences of ride-hailing drivers in Singapore to rent electric vehicles. Although we focus on applications to SSMs, our approach applies to the Likelihood-Free calibration of other models using multi-source data."}, "https://arxiv.org/abs/2409.01794": {"title": "Estimating Joint interventional distributions from marginal interventional data", "link": "https://arxiv.org/abs/2409.01794", "description": "arXiv:2409.01794v1 Announce Type: new \nAbstract: In this paper we show how to exploit interventional data to acquire the joint conditional distribution of all the variables using the Maximum Entropy principle. To this end, we extend the Causal Maximum Entropy method to make use of interventional data in addition to observational data. Using Lagrange duality, we prove that the solution to the Causal Maximum Entropy problem with interventional constraints lies in the exponential family, as in the Maximum Entropy solution. Our method allows us to perform two tasks of interest when marginal interventional distributions are provided for any subset of the variables. First, we show how to perform causal feature selection from a mixture of observational and single-variable interventional data, and, second, how to infer joint interventional distributions. For the former task, we show on synthetically generated data, that our proposed method outperforms the state-of-the-art method on merging datasets, and yields comparable results to the KCI-test which requires access to joint observations of all variables."}, "https://arxiv.org/abs/2409.01874": {"title": "Partial membership models for soft clustering of multivariate football player performance data", "link": "https://arxiv.org/abs/2409.01874", "description": "arXiv:2409.01874v1 Announce Type: new \nAbstract: The standard mixture modelling framework has been widely used to study heterogeneous populations, by modelling them as being composed of a finite number of homogeneous sub-populations. However, the standard mixture model assumes that each data point belongs to one and only one mixture component, or cluster, but when data points have fractional membership in multiple clusters this assumption is unrealistic. It is in fact conceptually very different to represent an observation as partly belonging to multiple groups instead of belonging to one group with uncertainty. For this purpose, various soft clustering approaches, or individual-level mixture models, have been developed. In this context, Heller et al (2008) formulated the Bayesian partial membership model (PM) as an alternative structure for individual-level mixtures, which also captures partial membership in the form of attribute specific mixtures, but does not assume a factorization over attributes. Our work proposes using the PM for soft clustering of count data arising in football performance analysis and compare the results with those achieved with the mixed membership model and finite mixture model. Learning and inference are carried out using Markov chain Monte Carlo methods. The method is applied on Serie A football player data from the 2022/2023 football season, to estimate the positions on the field where the players tend to play, in addition to their primary position, based on their playing style. The application of partial membership model to football data could have practical implications for coaches, talent scouts, team managers and analysts. These stakeholders can utilize the findings to make informed decisions related to team strategy, talent acquisition, and statistical research, ultimately enhancing performance and understanding in the field of football."}, "https://arxiv.org/abs/2409.01908": {"title": "Bayesian CART models for aggregate claim modeling", "link": "https://arxiv.org/abs/2409.01908", "description": "arXiv:2409.01908v1 Announce Type: new \nAbstract: This paper proposes three types of Bayesian CART (or BCART) models for aggregate claim amount, namely, frequency-severity models, sequential models and joint models. We propose a general framework for the BCART models applicable to data with multivariate responses, which is particularly useful for the joint BCART models with a bivariate response: the number of claims and aggregate claim amount. To facilitate frequency-severity modeling, we investigate BCART models for the right-skewed and heavy-tailed claim severity data by using various distributions. We discover that the Weibull distribution is superior to gamma and lognormal distributions, due to its ability to capture different tail characteristics in tree models. Additionally, we find that sequential BCART models and joint BCART models, which incorporate dependence between the number of claims and average severity, are beneficial and thus preferable to the frequency-severity BCART models in which independence is assumed. The effectiveness of these models' performance is illustrated by carefully designed simulations and real insurance data."}, "https://arxiv.org/abs/2409.01911": {"title": "Variable selection in convex nonparametric least squares via structured Lasso: An application to the Swedish electricity market", "link": "https://arxiv.org/abs/2409.01911", "description": "arXiv:2409.01911v1 Announce Type: new \nAbstract: We study the problem of variable selection in convex nonparametric least squares (CNLS). Whereas the least absolute shrinkage and selection operator (Lasso) is a popular technique for least squares, its variable selection performance is unknown in CNLS problems. In this work, we investigate the performance of the Lasso CNLS estimator and find out it is usually unable to select variables efficiently. Exploiting the unique structure of the subgradients in CNLS, we develop a structured Lasso by combining $\\ell_1$-norm and $\\ell_{\\infty}$-norm. To improve its predictive performance, we propose a relaxed version of the structured Lasso where we can control the two effects--variable selection and model shrinkage--using an additional tuning parameter. A Monte Carlo study is implemented to verify the finite sample performances of the proposed approaches. In the application of Swedish electricity distribution networks, when the regression model is assumed to be semi-nonparametric, our methods are extended to the doubly penalized CNLS estimators. The results from the simulation and application confirm that the proposed structured Lasso performs favorably, generally leading to sparser and more accurate predictive models, relative to the other variable selection methods in the literature."}, "https://arxiv.org/abs/2409.01926": {"title": "$Q_B$ Optimal Two-Level Designs for the Baseline Parameterization", "link": "https://arxiv.org/abs/2409.01926", "description": "arXiv:2409.01926v1 Announce Type: new \nAbstract: We have established the association matrix that expresses the estimator of effects under baseline parameterization, which has been considered in some recent literature, in an equivalent form as a linear combination of estimators of effects under the traditional centered parameterization. This allows the generalization of the $Q_B$ criterion which evaluates designs under model uncertainty in the traditional centered parameterization to be applicable to the baseline parameterization. Some optimal designs under the baseline parameterization seen in the previous literature are evaluated and it has been shown that at a given prior probability of a main effect being in the best model, the design converges to $Q_B$ optimal as the probability of an interaction being in the best model converges to 0 from above. The $Q_B$ optimal designs for two setups of factors and run sizes at various priors are found by an extended coordinate exchange algorithm and the evaluation of their performances are discussed. Comparisons have been made to those optimal designs restricted to level balance and orthogonality conditions."}, "https://arxiv.org/abs/2409.01943": {"title": "Spatially-dependent Indian Buffet Processes", "link": "https://arxiv.org/abs/2409.01943", "description": "arXiv:2409.01943v1 Announce Type: new \nAbstract: We develop a new stochastic process called spatially-dependent Indian buffet processes (SIBP) for spatially correlated binary matrices and propose general spatial factor models for various multivariate response variables. We introduce spatial dependency through the stick-breaking representation of the original Indian buffet process (IBP) and latent Gaussian process for the logit-transformed breaking proportion to capture underlying spatial correlation. We show that the marginal limiting properties of the number of non-zero entries under SIBP are the same as those in the original IBP, while the joint probability is affected by the spatial correlation. Using binomial expansion and Polya-gamma data augmentation, we provide a novel Gibbs sampling algorithm for posterior computation. The usefulness of the SIBP is demonstrated through simulation studies and two applications for large-dimensional multinomial data of areal dialects and geographical distribution of multiple tree species."}, "https://arxiv.org/abs/2409.01983": {"title": "Formalizing the causal interpretation in accelerated failure time models with unmeasured heterogeneity", "link": "https://arxiv.org/abs/2409.01983", "description": "arXiv:2409.01983v1 Announce Type: new \nAbstract: In the presence of unmeasured heterogeneity, the hazard ratio for exposure has a complex causal interpretation. To address this, accelerated failure time (AFT) models, which assess the effect on the survival time ratio scale, are often suggested as a better alternative. AFT models also allow for straightforward confounder adjustment. In this work, we formalize the causal interpretation of the acceleration factor in AFT models using structural causal models and data under independent censoring. We prove that the acceleration factor is a valid causal effect measure, even in the presence of frailty and treatment effect heterogeneity. Through simulations, we show that the acceleration factor better captures the causal effect than the hazard ratio when both AFT and proportional hazards models apply. Additionally, we extend the interpretation to systems with time-dependent acceleration factors, revealing the challenge of distinguishing between a time-varying homogeneous effect and unmeasured heterogeneity. While the causal interpretation of acceleration factors is promising, we caution practitioners about potential challenges in estimating these factors in the presence of effect heterogeneity."}, "https://arxiv.org/abs/2409.02087": {"title": "Objective Weights for Scoring: The Automatic Democratic Method", "link": "https://arxiv.org/abs/2409.02087", "description": "arXiv:2409.02087v1 Announce Type: new \nAbstract: When comparing performance (of products, services, entities, etc.), multiple attributes are involved. This paper deals with a way of weighting these attributes when one is seeking an overall score. It presents an objective approach to generating the weights in a scoring formula which avoids personal judgement. The first step is to find the maximum possible score for each assessed entity. These upper bound scores are found using Data Envelopment Analysis. In the second step the weights in the scoring formula are found by regressing the unique DEA scores on the attribute data. Reasons for using least squares and avoiding other distance measures are given. The method is tested on data where the true scores and weights are known. The method enables the construction of an objective scoring formula which has been generated from the data arising from all assessed entities and is, in that sense, democratic."}, "https://arxiv.org/abs/2409.00013": {"title": "CEopt: A MATLAB Package for Non-convex Optimization with the Cross-Entropy Method", "link": "https://arxiv.org/abs/2409.00013", "description": "arXiv:2409.00013v1 Announce Type: cross \nAbstract: This paper introduces CEopt (https://ceopt.org), a MATLAB tool leveraging the Cross-Entropy method for non-convex optimization. Due to the relative simplicity of the algorithm, it provides a kind of transparent ``gray-box'' optimization solver, with intuitive control parameters. Unique in its approach, CEopt effectively handles both equality and inequality constraints using an augmented Lagrangian method, offering robustness and scalability for moderately sized complex problems. Through select case studies, the package's applicability and effectiveness in various optimization scenarios are showcased, marking CEopt as a practical addition to optimization research and application toolsets."}, "https://arxiv.org/abs/2409.00417": {"title": "Learning linear acyclic causal model including Gaussian noise using ancestral relationships", "link": "https://arxiv.org/abs/2409.00417", "description": "arXiv:2409.00417v1 Announce Type: cross \nAbstract: This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algorithm and LiNGAM, can identify up to the distribution-equivalence pattern of a linear causal model, even in the presence of Gaussian disturbances. However, in the worst case, the PC-LiNGAM has factorial time complexity for the number of variables. In this paper, we propose an algorithm for learning the distribution-equivalence patterns of a linear causal model with a lower time complexity than PC-LiNGAM, using the causal ancestor finding algorithm in Maeda and Shimizu, which is generalized to account for Gaussian disturbances."}, "https://arxiv.org/abs/2409.00582": {"title": "CRUD-Capable Mobile Apps with R and shinyMobile: a Case Study in Rapid Prototyping", "link": "https://arxiv.org/abs/2409.00582", "description": "arXiv:2409.00582v1 Announce Type: cross \nAbstract: \"Harden\" is a Progressive Web Application (PWA) for Ecological Momentary Assessment (EMA) developed mostly in R, which runs on all platforms with an internet connection, including iOS and Android. It leverages the shinyMobile package for creating a reactive mobile user interface (UI), PostgreSQL for the database backend, and Google Cloud Run for scalable hosting in the cloud, with serverless execution. Using this technology stack, it was possible to rapidly prototype a fully CRUD-capable (Create, Read, Update, Delete) mobile app, with persistent user data across sessions, interactive graphs, and real-time statistical calculation. This framework is compared with current alternative frameworks for creating data science apps; it is argued that the shinyMobile package provides one of the most efficient methods for rapid prototyping and creation of statistical mobile apps that require advanced graphing capabilities. This paper outlines the methodology used to create the Harden application, and discusses the advantages and limitations of the shinyMobile approach to app development. It is hoped that this information will encourage other programmers versed in R to consider developing mobile apps with this framework."}, "https://arxiv.org/abs/2409.00704": {"title": "Stochastic Monotonicity and Random Utility Models: The Good and The Ugly", "link": "https://arxiv.org/abs/2409.00704", "description": "arXiv:2409.00704v1 Announce Type: cross \nAbstract: When it comes to structural estimation of risk preferences from data on choices, random utility models have long been one of the standard research tools in economics. A recent literature has challenged these models, pointing out some concerning monotonicity and, thus, identification problems. In this paper, we take a second look and point out that some of the criticism - while extremely valid - may have gone too far, demanding monotonicity of choice probabilities in decisions where it is not so clear whether it should be imposed. We introduce a new class of random utility models based on carefully constructed generalized risk premia which always satisfy our relaxed monotonicity criteria. Moreover, we show that some of the models used in applied research like the certainty-equivalent-based random utility model for CARA utility actually lie in this class of monotonic stochastic choice models. We conclude that not all random utility models are bad."}, "https://arxiv.org/abs/2409.00908": {"title": "EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification", "link": "https://arxiv.org/abs/2409.00908", "description": "arXiv:2409.00908v1 Announce Type: cross \nAbstract: Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namely \\textsc{EnsLoss}, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the ``legitimacy'' of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout, \\textsc{EnsLoss} enables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness of \\textsc{EnsLoss} compared to fixed loss methods is demonstrated through experiments on a broad range of 14 OpenML tabular datasets and 46 image datasets with various deep learning architectures. Python repository and source code are available on \\textsc{GitHub} at \\url{https://github.com/statmlben/rankseg}."}, "https://arxiv.org/abs/2409.01220": {"title": "Simultaneous Inference for Non-Stationary Random Fields, with Application to Gridded Data Analysis", "link": "https://arxiv.org/abs/2409.01220", "description": "arXiv:2409.01220v1 Announce Type: cross \nAbstract: Current statistics literature on statistical inference of random fields typically assumes that the fields are stationary or focuses on models of non-stationary Gaussian fields with parametric/semiparametric covariance families, which may not be sufficiently flexible to tackle complex modern-era random field data. This paper performs simultaneous nonparametric statistical inference for a general class of non-stationary and non-Gaussian random fields by modeling the fields as nonlinear systems with location-dependent transformations of an underlying `shift random field'. Asymptotic results, including concentration inequalities and Gaussian approximation theorems for high dimensional sparse linear forms of the random field, are derived. A computationally efficient locally weighted multiplier bootstrap algorithm is proposed and theoretically verified as a unified tool for the simultaneous inference of the aforementioned non-stationary non-Gaussian random field. Simulations and real-life data examples demonstrate good performances and broad applications of the proposed algorithm."}, "https://arxiv.org/abs/2409.01243": {"title": "Sample Complexity of the Sign-Perturbed Sums Method", "link": "https://arxiv.org/abs/2409.01243", "description": "arXiv:2409.01243v1 Announce Type: cross \nAbstract: We study the sample complexity of the Sign-Perturbed Sums (SPS) method, which constructs exact, non-asymptotic confidence regions for the true system parameters under mild statistical assumptions, such as independent and symmetric noise terms. The standard version of SPS deals with linear regression problems, however, it can be generalized to stochastic linear (dynamical) systems, even with closed-loop setups, and to nonlinear and nonparametric problems, as well. Although the strong consistency of the method was rigorously proven, the sample complexity of the algorithm was only analyzed so far for scalar linear regression problems. In this paper we study the sample complexity of SPS for general linear regression problems. We establish high probability upper bounds for the diameters of SPS confidence regions for finite sample sizes and show that the SPS regions shrink at the same, optimal rate as the classical asymptotic confidence ellipsoids. Finally, the difference between the theoretical bounds and the empirical sizes of SPS confidence regions is investigated experimentally."}, "https://arxiv.org/abs/2409.01464": {"title": "Stein transport for Bayesian inference", "link": "https://arxiv.org/abs/2409.01464", "description": "arXiv:2409.01464v1 Announce Type: cross \nAbstract: We introduce $\\textit{Stein transport}$, a novel methodology for Bayesian inference designed to efficiently push an ensemble of particles along a predefined curve of tempered probability distributions. The driving vector field is chosen from a reproducing kernel Hilbert space and can be derived either through a suitable kernel ridge regression formulation or as an infinitesimal optimal transport map in the Stein geometry. The update equations of Stein transport resemble those of Stein variational gradient descent (SVGD), but introduce a time-varying score function as well as specific weights attached to the particles. While SVGD relies on convergence in the long-time limit, Stein transport reaches its posterior approximation at finite time $t=1$. Studying the mean-field limit, we discuss the errors incurred by regularisation and finite-particle effects, and we connect Stein transport to birth-death dynamics and Fisher-Rao gradient flows. In a series of experiments, we show that in comparison to SVGD, Stein transport not only often reaches more accurate posterior approximations with a significantly reduced computational budget, but that it also effectively mitigates the variance collapse phenomenon commonly observed in SVGD."}, "https://arxiv.org/abs/2409.01570": {"title": "Smoothed Robust Phase Retrieval", "link": "https://arxiv.org/abs/2409.01570", "description": "arXiv:2409.01570v1 Announce Type: cross \nAbstract: The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the $\\ell_1$-loss is largely unknown to study spurious local solutions, even under the ideal noiseless setting, and its intrinsic nonsmooth nature also impacts the efficiency of optimization algorithms. This paper introduces the smoothed robust phase retrieval (SRPR) based on a family of convolution-type smoothed loss functions. Theoretically, we prove that the SRPR enjoys a benign geometric structure with high probability: (1) under the noiseless situation, the SRPR has no spurious local solutions, and the target signals are global solutions, and (2) under the infrequent but arbitrary corruptions, we characterize the stationary points of the SRPR and prove its benign landscape, which is the first landscape analysis of phase retrieval with corruption in the literature. Moreover, we prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless situation. Experiments on both simulated datasets and image recovery are provided to demonstrate the numerical performance of the SRPR."}, "https://arxiv.org/abs/2409.02086": {"title": "Taming Randomness in Agent-Based Models using Common Random Numbers", "link": "https://arxiv.org/abs/2409.02086", "description": "arXiv:2409.02086v1 Announce Type: cross \nAbstract: Random numbers are at the heart of every agent-based model (ABM) of health and disease. By representing each individual in a synthetic population, agent-based models enable detailed analysis of intervention impact and parameter sensitivity. Yet agent-based modeling has a fundamental signal-to-noise problem, in which small differences between simulations cannot be reliably differentiated from stochastic noise resulting from misaligned random number realizations. We introduce a novel methodology that eliminates noise due to misaligned random numbers, a first for agent-based modeling. Our approach enables meaningful individual-level analysis between ABM scenarios because all differences are driven by mechanistic effects rather than random number noise. A key result is that many fewer simulations are needed for some applications. We demonstrate the benefits of our approach on three disparate examples and discuss limitations."}, "https://arxiv.org/abs/2011.11558": {"title": "Beta-CoRM: A Bayesian Approach for $n$-gram Profiles Analysis", "link": "https://arxiv.org/abs/2011.11558", "description": "arXiv:2011.11558v3 Announce Type: replace \nAbstract: $n$-gram profiles have been successfully and widely used to analyse long sequences of potentially differing lengths for clustering or classification. Mainly, machine learning algorithms have been used for this purpose but, despite their predictive performance, these methods cannot discover hidden structures or provide a full probabilistic representation of the data. A novel class of Bayesian generative models designed for $n$-gram profiles used as binary attributes have been designed to address this. The flexibility of the proposed modelling allows to consider a straightforward approach to feature selection in the generative model. Furthermore, a slice sampling algorithm is derived for a fast inferential procedure, which is applied to synthetic and real data scenarios and shows that feature selection can improve classification accuracy."}, "https://arxiv.org/abs/2205.01016": {"title": "Evidence Estimation in Gaussian Graphical Models Using a Telescoping Block Decomposition of the Precision Matrix", "link": "https://arxiv.org/abs/2205.01016", "description": "arXiv:2205.01016v4 Announce Type: replace \nAbstract: Marginal likelihood, also known as model evidence, is a fundamental quantity in Bayesian statistics. It is used for model selection using Bayes factors or for empirical Bayes tuning of prior hyper-parameters. Yet, the calculation of evidence has remained a longstanding open problem in Gaussian graphical models. Currently, the only feasible solutions that exist are for special cases such as the Wishart or G-Wishart, in moderate dimensions. We develop an approach based on a novel telescoping block decomposition of the precision matrix that allows the estimation of evidence by application of Chib's technique under a very broad class of priors under mild requirements. Specifically, the requirements are: (a) the priors on the diagonal terms on the precision matrix can be written as gamma or scale mixtures of gamma random variables and (b) those on the off-diagonal terms can be represented as normal or scale mixtures of normal. This includes structured priors such as the Wishart or G-Wishart, and more recently introduced element-wise priors, such as the Bayesian graphical lasso and the graphical horseshoe. Among these, the true marginal is known in an analytically closed form for Wishart, providing a useful validation of our approach. For the general setting of the other three, and several more priors satisfying conditions (a) and (b) above, the calculation of evidence has remained an open question that this article resolves under a unifying framework."}, "https://arxiv.org/abs/2207.11686": {"title": "Inference for linear functionals of high-dimensional longitudinal proteomics data using generalized estimating equations", "link": "https://arxiv.org/abs/2207.11686", "description": "arXiv:2207.11686v3 Announce Type: replace \nAbstract: Regression analysis of correlated data, where multiple correlated responses are recorded on the same unit, is ubiquitous in many scientific areas. With the advent of new technologies, in particular high-throughput omics profiling assays, such correlated data increasingly consist of large number of variables compared with the available sample size. Motivated by recent longitudinal proteomics studies of COVID-19, we propose a novel inference procedure for linear functionals of high-dimensional regression coefficients in generalized estimating equations, which are widely used to analyze correlated data. Our estimator for this more general inferential target, obtained via constructing projected estimating equations, is shown to be asymptotically normally distributed under mild regularity conditions. We also introduce a data-driven cross-validation procedure to select the tuning parameter for estimating the projection direction, which is not addressed in the existing procedures. We illustrate the utility of the proposed procedure in providing confidence intervals for associations of individual proteins and severe COVID risk scores obtained based on high-dimensional proteomics data, and demonstrate its robust finite-sample performance, especially in estimation bias and confidence interval coverage, via extensive simulations."}, "https://arxiv.org/abs/2207.12804": {"title": "Large-Scale Low-Rank Gaussian Process Prediction with Support Points", "link": "https://arxiv.org/abs/2207.12804", "description": "arXiv:2207.12804v2 Announce Type: replace \nAbstract: Low-rank approximation is a popular strategy to tackle the \"big n problem\" associated with large-scale Gaussian process regressions. Basis functions for developing low-rank structures are crucial and should be carefully specified. Predictive processes simplify the problem by inducing basis functions with a covariance function and a set of knots. The existing literature suggests certain practical implementations of knot selection and covariance estimation; however, theoretical foundations explaining the influence of these two factors on predictive processes are lacking. In this paper, the asymptotic prediction performance of the predictive process and Gaussian process predictions is derived and the impacts of the selected knots and estimated covariance are studied. We suggest the use of support points as knots, which best represent data locations. Extensive simulation studies demonstrate the superiority of support points and verify our theoretical results. Real data of precipitation and ozone are used as examples, and the efficiency of our method over other widely used low-rank approximation methods is verified."}, "https://arxiv.org/abs/2212.12041": {"title": "Estimating network-mediated causal effects via principal components network regression", "link": "https://arxiv.org/abs/2212.12041", "description": "arXiv:2212.12041v3 Announce Type: replace \nAbstract: We develop a method to decompose causal effects on a social network into an indirect effect mediated by the network, and a direct effect independent of the social network. To handle the complexity of network structures, we assume that latent social groups act as causal mediators. We develop principal components network regression models to differentiate the social effect from the non-social effect. Fitting the regression models is as simple as principal components analysis followed by ordinary least squares estimation. We prove asymptotic theory for regression coefficients from this procedure and show that it is widely applicable, allowing for a variety of distributions on the regression errors and network edges. We carefully characterize the counterfactual assumptions necessary to use the regression models for causal inference, and show that current approaches to causal network regression may result in over-control bias. The structure of our method is very general, so that it is applicable to many types of structured data beyond social networks, such as text, areal data, psychometrics, images and omics."}, "https://arxiv.org/abs/2301.06742": {"title": "Robust Realized Integrated Beta Estimator with Application to Dynamic Analysis of Integrated Beta", "link": "https://arxiv.org/abs/2301.06742", "description": "arXiv:2301.06742v2 Announce Type: replace \nAbstract: In this paper, we develop a robust non-parametric realized integrated beta estimator using high-frequency financial data contaminated by microstructure noises, which is robust to the stylized features, such as the time-varying beta and the dependence structure of microstructure noises. With this robust realized integrated beta estimator, we investigate dynamic structures of integrated betas and find an auto-regressive--moving-average (ARMA) structure. To model this dynamic structure, we utilize the ARMA model for daily integrated market betas. We call this the dynamic realized beta (DR Beta). We further introduce a high-frequency data generating process by filling the gap between the high-frequency-based non-parametric estimator and low-frequency dynamic structure. Then, we propose a quasi-likelihood procedure for estimating the model parameters with the robust realized integrated beta estimator as the proxy. We also establish asymptotic theorems for the proposed estimator and conduct a simulation study to check the performance of finite samples of the estimator. The empirical study with the S&P 500 index and the top 50 large trading volume stocks from the S&P 500 illustrates that the proposed DR Beta model with the robust realized beta estimator effectively accounts for dynamics in the market beta of individual stocks and better predicts future market betas."}, "https://arxiv.org/abs/2306.02360": {"title": "Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors", "link": "https://arxiv.org/abs/2306.02360", "description": "arXiv:2306.02360v2 Announce Type: replace \nAbstract: Dirichlet process mixtures are particularly sensitive to the value of the precision parameter controlling the behavior of the latent partition. Randomization of the precision through a prior distribution is a common solution, which leads to more robust inferential procedures. However, existing prior choices do not allow for transparent elicitation, due to the lack of analytical results. We introduce and investigate a novel prior for the Dirichlet process precision, the Stirling-gamma distribution. We study the distributional properties of the induced random partition, with an emphasis on the number of clusters. Our theoretical investigation clarifies the reasons of the improved robustness properties of the proposed prior. Moreover, we show that, under specific choices of its hyperparameters, the Stirling-gamma distribution is conjugate to the random partition of a Dirichlet process. We illustrate with an ecological application the usefulness of our approach for the detection of communities of ant workers."}, "https://arxiv.org/abs/2307.00224": {"title": "Flexible Bayesian Modeling for Longitudinal Binary and Ordinal Responses", "link": "https://arxiv.org/abs/2307.00224", "description": "arXiv:2307.00224v2 Announce Type: replace \nAbstract: Longitudinal studies with binary or ordinal responses are widely encountered in various disciplines, where the primary focus is on the temporal evolution of the probability of each response category. Traditional approaches build from the generalized mixed effects modeling framework. Even amplified with nonparametric priors placed on the fixed or random effects, such models are restrictive due to the implied assumptions on the marginal expectation and covariance structure of the responses. We tackle the problem from a functional data analysis perspective, treating the observations for each subject as realizations from subject-specific stochastic processes at the measured times. We develop the methodology focusing initially on binary responses, for which we assume the stochastic processes have Binomial marginal distributions. Leveraging the logits representation, we model the discrete space processes through sequences of continuous space processes. We utilize a hierarchical framework to model the mean and covariance kernel of the continuous space processes nonparametrically and simultaneously through a Gaussian process prior and an Inverse-Wishart process prior, respectively. The prior structure results in flexible inference for the evolution and correlation of binary responses, while allowing for borrowing of strength across all subjects. The modeling approach can be naturally extended to ordinal responses. Here, the continuation-ratio logits factorization of the multinomial distribution is key for efficient modeling and inference, including a practical way of dealing with unbalanced longitudinal data. The methodology is illustrated with synthetic data examples and an analysis of college students' mental health status data."}, "https://arxiv.org/abs/2308.12470": {"title": "Scalable Estimation of Multinomial Response Models with Random Consideration Sets", "link": "https://arxiv.org/abs/2308.12470", "description": "arXiv:2308.12470v3 Announce Type: replace \nAbstract: A common assumption in the fitting of unordered multinomial response models for $J$ mutually exclusive categories is that the responses arise from the same set of $J$ categories across subjects. However, when responses measure a choice made by the subject, it is more appropriate to condition the distribution of multinomial responses on a subject-specific consideration set, drawn from the power set of $\\{1,2,\\ldots,J\\}$. This leads to a mixture of multinomial response models governed by a probability distribution over the $J^{\\ast} = 2^J -1$ consideration sets. We introduce a novel method for estimating such generalized multinomial response models based on the fundamental result that any mass distribution over $J^{\\ast}$ consideration sets can be represented as a mixture of products of $J$ component-specific inclusion-exclusion probabilities. Moreover, under time-invariant consideration sets, the conditional posterior distribution of consideration sets is sparse. These features enable a scalable MCMC algorithm for sampling the posterior distribution of parameters, random effects, and consideration sets. Under regularity conditions, the posterior distributions of the marginal response probabilities and the model parameters satisfy consistency. The methodology is demonstrated in a longitudinal data set on weekly cereal purchases that cover $J = 101$ brands, a dimension substantially beyond the reach of existing methods."}, "https://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "https://arxiv.org/abs/2310.06330", "description": "arXiv:2310.06330v2 Announce Type: replace \nAbstract: Markov chain Monte Carlo (MCMC) is a commonly used method for approximating expectations with respect to probability distributions. Uncertainty assessment for MCMC estimators is essential in practical applications. Moreover, for multivariate functions of a Markov chain, it is important to estimate not only the auto-correlation for each component but also to estimate cross-correlations, in order to better assess sample quality, improve estimates of effective sample size, and use more effective stopping rules. Berg and Song [2022] introduced the moment least squares (momentLS) estimator, a shape-constrained estimator for the autocovariance sequence from a reversible Markov chain, for univariate functions of the Markov chain. Based on this sequence estimator, they proposed an estimator of the asymptotic variance of the sample mean from MCMC samples. In this study, we propose novel autocovariance sequence and asymptotic variance estimators for Markov chain functions with multiple components, based on the univariate momentLS estimators from Berg and Song [2022]. We demonstrate strong consistency of the proposed auto(cross)-covariance sequence and asymptotic variance matrix estimators. We conduct empirical comparisons of our method with other state-of-the-art approaches on simulated and real-data examples, using popular samplers including the random-walk Metropolis sampler and the No-U-Turn sampler from STAN."}, "https://arxiv.org/abs/2210.04482": {"title": "Leave-group-out cross-validation for latent Gaussian models", "link": "https://arxiv.org/abs/2210.04482", "description": "arXiv:2210.04482v5 Announce Type: replace-cross \nAbstract: Evaluating the predictive performance of a statistical model is commonly done using cross-validation. Although the leave-one-out method is frequently employed, its application is justified primarily for independent and identically distributed observations. However, this method tends to mimic interpolation rather than prediction when dealing with dependent observations. This paper proposes a modified cross-validation for dependent observations. This is achieved by excluding an automatically determined set of observations from the training set to mimic a more reasonable prediction scenario. Also, within the framework of latent Gaussian models, we illustrate a method to adjust the joint posterior for this modified cross-validation to avoid model refitting. This new approach is accessible in the R-INLA package (www.r-inla.org)."}, "https://arxiv.org/abs/2212.02935": {"title": "A multi-language toolkit for supporting automated checking of research outputs", "link": "https://arxiv.org/abs/2212.02935", "description": "arXiv:2212.02935v2 Announce Type: replace-cross \nAbstract: This article presents the automatic checking of research outputs package acro, which assists researchers and data governance teams by automatically applying best-practice principles-based statistical disclosure control (SDC) techniques on-the-fly as researchers conduct their analyses. acro distinguishes between: research output that is safe to publish; output that requires further analysis; and output that cannot be published because it creates substantial risk of disclosing private data. This is achieved through the use of a lightweight Python wrapper that sits over well-known analysis tools that produce outputs such as tables, plots, and statistical models. This adds functionality to (i) identify potentially disclosive outputs against a range of commonly used disclosure tests; (ii) apply disclosure mitigation strategies where required; (iii) report reasons for applying SDC; and (iv) produce simple summary documents trusted research environment staff can use to streamline their workflow. The major analytical programming languages used by researchers are supported: Python, R, and Stata. The acro code and documentation are available under an MIT license at https://github.com/AI-SDC/ACRO"}, "https://arxiv.org/abs/2312.04444": {"title": "Parameter Inference for Hypo-Elliptic Diffusions under a Weak Design Condition", "link": "https://arxiv.org/abs/2312.04444", "description": "arXiv:2312.04444v2 Announce Type: replace-cross \nAbstract: We address the problem of parameter estimation for degenerate diffusion processes defined via the solution of Stochastic Differential Equations (SDEs) with diffusion matrix that is not full-rank. For this class of hypo-elliptic diffusions recent works have proposed contrast estimators that are asymptotically normal, provided that the step-size in-between observations $\\Delta=\\Delta_n$ and their total number $n$ satisfy $n \\to \\infty$, $n \\Delta_n \\to \\infty$, $\\Delta_n \\to 0$, and additionally $\\Delta_n = o (n^{-1/2})$. This latter restriction places a requirement for a so-called `rapidly increasing experimental design'. In this paper, we overcome this limitation and develop a general contrast estimator satisfying asymptotic normality under the weaker design condition $\\Delta_n = o(n^{-1/p})$ for general $p \\ge 2$. Such a result has been obtained for elliptic SDEs in the literature, but its derivation in a hypo-elliptic setting is highly non-trivial. We provide numerical results to illustrate the advantages of the developed theory."}, "https://arxiv.org/abs/2401.12967": {"title": "Measure transport with kernel mean embeddings", "link": "https://arxiv.org/abs/2401.12967", "description": "arXiv:2401.12967v2 Announce Type: replace-cross \nAbstract: Kalman filters constitute a scalable and robust methodology for approximate Bayesian inference, matching first and second order moments of the target posterior. To improve the accuracy in nonlinear and non-Gaussian settings, we extend this principle to include more or different characteristics, based on kernel mean embeddings (KMEs) of probability measures into reproducing kernel Hilbert spaces. Focusing on the continuous-time setting, we develop a family of interacting particle systems (termed $\\textit{KME-dynamics}$) that bridge between prior and posterior, and that include the Kalman-Bucy filter as a special case. KME-dynamics does not require the score of the target, but rather estimates the score implicitly and intrinsically, and we develop links to score-based generative modeling and importance reweighting. A variant of KME-dynamics has recently been derived from an optimal transport and Fisher-Rao gradient flow perspective by Maurais and Marzouk, and we expose further connections to (kernelised) diffusion maps, leading to a variational formulation of regression type. Finally, we conduct numerical experiments on toy examples and the Lorenz 63 and 96 models, comparing our results against the ensemble Kalman filter and the mapping particle filter (Pulido and van Leeuwen, 2019, J. Comput. Phys.). Our experiments show particular promise for a hybrid modification (called Kalman-adjusted KME-dynamics)."}, "https://arxiv.org/abs/2409.02204": {"title": "Moment-type estimators for a weighted exponential family", "link": "https://arxiv.org/abs/2409.02204", "description": "arXiv:2409.02204v1 Announce Type: new \nAbstract: In this paper, we propose and study closed-form moment type estimators for a weighted exponential family. We also develop a bias-reduced version of these proposed closed-form estimators using bootstrap techniques. The estimators are evaluated using Monte Carlo simulation. This shows favourable results for the proposed bootstrap bias-reduced estimators."}, "https://arxiv.org/abs/2409.02209": {"title": "Estimand-based Inference in Presence of Long-Term Survivors", "link": "https://arxiv.org/abs/2409.02209", "description": "arXiv:2409.02209v1 Announce Type: new \nAbstract: In this article, we develop nonparametric inference methods for comparing survival data across two samples, which are beneficial for clinical trials of novel cancer therapies where long-term survival is a critical outcome. These therapies, including immunotherapies or other advanced treatments, aim to establish durable effects. They often exhibit distinct survival patterns such as crossing or delayed separation and potentially leveling-off at the tails of survival curves, clearly violating the proportional hazards assumption and rendering the hazard ratio inappropriate for measuring treatment effects. The proposed methodology utilizes the mixture cure framework to separately analyze the cure rates of long-term survivors and the survival functions of susceptible individuals. We evaluate a nonparametric estimator for the susceptible survival function in the one-sample setting. Under sufficient follow-up, it is expressed as a location-scale-shift variant of the Kaplan-Meier (KM) estimator. It retains several desirable features of the KM estimator, including inverse-probability-censoring weighting, product-limit estimation, self-consistency, and nonparametric efficiency. In scenarios of insufficient follow-up, it can easily be adapted by incorporating a suitable cure rate estimator. In the two-sample setting, besides using the difference in cure rates to measure the long-term effect, we propose a graphical estimand to compare the relative treatment effects on susceptible subgroups. This process, inspired by Kendall's tau, compares the order of survival times among susceptible individuals. The proposed methods' large-sample properties are derived for further inference, and the finite-sample properties are examined through extensive simulation studies. The proposed methodology is applied to analyze the digitized data from the CheckMate 067 immunotherapy clinical trial."}, "https://arxiv.org/abs/2409.02258": {"title": "Generalized implementation of invariant coordinate selection with positive semi-definite scatter matrices", "link": "https://arxiv.org/abs/2409.02258", "description": "arXiv:2409.02258v1 Announce Type: new \nAbstract: Invariant coordinate selection (ICS) is an unsupervised multivariate data transformation useful in many contexts such as outlier detection or clustering. It is based on the simultaneous diagonalization of two affine equivariant and positive definite scatter matrices. Its classical implementation relies on a non-symmetric eigenvalue problem (EVP) by diagonalizing one scatter relatively to the other. In case of collinearity, at least one of the scatter matrices is singular and the problem cannot be solved. To address this limitation, three approaches are proposed based on: a Moore-Penrose pseudo inverse (GINV), a dimension reduction (DR), and a generalized singular value decomposition (GSVD). Their properties are investigated theoretically and in different empirical applications. Overall, the extension based on GSVD seems the most promising even if it restricts the choice of scatter matrices that can be expressed as cross-products. In practice, some of the approaches also look suitable in the context of data in high dimension low sample size (HDLSS)."}, "https://arxiv.org/abs/2409.02269": {"title": "Simulation-calibration testing for inference in Lasso regressions", "link": "https://arxiv.org/abs/2409.02269", "description": "arXiv:2409.02269v1 Announce Type: new \nAbstract: We propose a test of the significance of a variable appearing on the Lasso path and use it in a procedure for selecting one of the models of the Lasso path, controlling the Family-Wise Error Rate. Our null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. We focus on the regularization parameter value from which a first variable outside A is selected. As the test statistic, we use this quantity's conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate this by simulating outcome vectors and then calibrating them on the observed outcome's estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models in which it turns into an iterative stochastic procedure. We prove that the test controls the risk of selecting a false positive in linear models, both under the null hypothesis and, under a correlation condition, when A does not contain all active variables. We assess the performance of our procedure through extensive simulation studies. We also illustrate it in the detection of exposures associated with drug-induced liver injuries in the French pharmacovigilance database."}, "https://arxiv.org/abs/2409.02311": {"title": "Distribution Regression Difference-In-Differences", "link": "https://arxiv.org/abs/2409.02311", "description": "arXiv:2409.02311v1 Announce Type: new \nAbstract: We provide a simple distribution regression estimator for treatment effects in the difference-in-differences (DiD) design. Our procedure is particularly useful when the treatment effect differs across the distribution of the outcome variable. Our proposed estimator easily incorporates covariates and, importantly, can be extended to settings where the treatment potentially affects the joint distribution of multiple outcomes. Our key identifying restriction is that the counterfactual distribution of the treated in the untreated state has no interaction effect between treatment and time. This assumption results in a parallel trend assumption on a transformation of the distribution. We highlight the relationship between our procedure and assumptions with the changes-in-changes approach of Athey and Imbens (2006). We also reexamine two existing empirical examples which highlight the utility of our approach."}, "https://arxiv.org/abs/2409.02331": {"title": "A parameterization of anisotropic Gaussian fields with penalized complexity priors", "link": "https://arxiv.org/abs/2409.02331", "description": "arXiv:2409.02331v1 Announce Type: new \nAbstract: Gaussian random fields (GFs) are fundamental tools in spatial modeling and can be represented flexibly and efficiently as solutions to stochastic partial differential equations (SPDEs). The SPDEs depend on specific parameters, which enforce various field behaviors and can be estimated using Bayesian inference. However, the likelihood typically only provides limited insights into the covariance structure under in-fill asymptotics. In response, it is essential to leverage priors to achieve appropriate, meaningful covariance structures in the posterior. This study introduces a smooth, invertible parameterization of the correlation length and diffusion matrix of an anisotropic GF and constructs penalized complexity (PC) priors for the model when the parameters are constant in space. The formulated prior is weakly informative, effectively penalizing complexity by pushing the correlation range toward infinity and the anisotropy to zero."}, "https://arxiv.org/abs/2409.02372": {"title": "A Principal Square Response Forward Regression Method for Dimension Reduction", "link": "https://arxiv.org/abs/2409.02372", "description": "arXiv:2409.02372v1 Announce Type: new \nAbstract: Dimension reduction techniques, such as Sufficient Dimension Reduction (SDR), are indispensable for analyzing high-dimensional datasets. This paper introduces a novel SDR method named Principal Square Response Forward Regression (PSRFR) for estimating the central subspace of the response variable Y, given the vector of predictor variables $\\bm{X}$. We provide a computational algorithm for implementing PSRFR and establish its consistency and asymptotic properties. Monte Carlo simulations are conducted to assess the performance, efficiency, and robustness of the proposed method. Notably, PSRFR exhibits commendable performance in scenarios where the variance of each component becomes increasingly dissimilar, particularly when the predictor variables follow an elliptical distribution. Furthermore, we illustrate and validate the effectiveness of PSRFR using a real-world dataset concerning wine quality. Our findings underscore the utility and reliability of the PSRFR method in practical applications of dimension reduction for high-dimensional data analysis."}, "https://arxiv.org/abs/2409.02397": {"title": "High-dimensional Bayesian Model for Disease-Specific Gene Detection in Spatial Transcriptomics", "link": "https://arxiv.org/abs/2409.02397", "description": "arXiv:2409.02397v1 Announce Type: new \nAbstract: Identifying disease-indicative genes is critical for deciphering disease mechanisms and has attracted significant interest in biomedical research. Spatial transcriptomics offers unprecedented insights for the detection of disease-specific genes by enabling within-tissue contrasts. However, this new technology poses challenges for conventional statistical models developed for RNA-sequencing, as these models often neglect the spatial organization of tissue spots. In this article, we propose a Bayesian shrinkage model to characterize the relationship between high-dimensional gene expressions and the disease status of each tissue spot, incorporating spatial correlation among these spots through autoregressive terms. Our model adopts a hierarchical structure to facilitate the analysis of multiple correlated samples and is further extended to accommodate the missing data within tissues. To ensure the model's applicability to datasets of varying sizes, we carry out two computational frameworks for Bayesian parameter estimation, tailored to both small and large sample scenarios. Simulation studies are conducted to evaluate the performance of the proposed model. The proposed model is applied to analyze the data arising from a HER2-positive breast cancer study."}, "https://arxiv.org/abs/2409.02573": {"title": "Fitting an Equation to Data Impartially", "link": "https://arxiv.org/abs/2409.02573", "description": "arXiv:2409.02573v1 Announce Type: new \nAbstract: We consider the problem of fitting a relationship (e.g. a potential scientific law) to data involving multiple variables. Ordinary (least squares) regression is not suitable for this because the estimated relationship will differ according to which variable is chosen as being dependent, and the dependent variable is unrealistically assumed to be the only variable which has any measurement error (noise). We present a very general method for estimating a linear functional relationship between multiple noisy variables, which are treated impartially, i.e. no distinction between dependent and independent variables. The data are not assumed to follow any distribution, but all variables are treated as being equally reliable. Our approach extends the geometric mean functional relationship to multiple dimensions. This is especially useful with variables measured in different units, as it is naturally scale-invariant, whereas orthogonal regression is not. This is because our approach is not based on minimizing distances, but on the symmetric concept of correlation. The estimated coefficients are easily obtained from the covariances or correlations, and correspond to geometric means of associated least squares coefficients. The ease of calculation will hopefully allow widespread application of impartial fitting to estimate relationships in a neutral way."}, "https://arxiv.org/abs/2409.02642": {"title": "The Application of Green GDP and Its Impact on Global Economy and Environment: Analysis of GGDP based on SEEA model", "link": "https://arxiv.org/abs/2409.02642", "description": "arXiv:2409.02642v1 Announce Type: new \nAbstract: This paper presents an analysis of Green Gross Domestic Product (GGDP) using the System of Environmental-Economic Accounting (SEEA) model to evaluate its impact on global climate mitigation and economic health. GGDP is proposed as a superior measure to tradi-tional GDP by incorporating natural resource consumption, environmental pollution control, and degradation factors. The study develops a GGDP model and employs grey correlation analysis and grey prediction models to assess its relationship with these factors. Key findings demonstrate that replacing GDP with GGDP can positively influence climate change, partic-ularly in reducing CO2 emissions and stabilizing global temperatures. The analysis further explores the implications of GGDP adoption across developed and developing countries, with specific predictions for China and the United States. The results indicate a potential increase in economic levels for developing countries, while developed nations may experi-ence a decrease. Additionally, the shift to GGDP is shown to significantly reduce natural re-source depletion and population growth rates in the United States, suggesting broader envi-ronmental and economic benefits. This paper highlights the universal applicability of the GGDP model and its potential to enhance environmental and economic policies globally."}, "https://arxiv.org/abs/2409.02662": {"title": "The Impact of Data Elements on Narrowing the Urban-Rural Consumption Gap in China: Mechanisms and Policy Analysis", "link": "https://arxiv.org/abs/2409.02662", "description": "arXiv:2409.02662v1 Announce Type: new \nAbstract: The urban-rural consumption gap, as one of the important indicators in social development, directly reflects the imbalance in urban and rural economic and social development. Data elements, as an important component of New Quality Productivity, are of significant importance in promoting economic development and improving people's living standards in the information age. This study, through the analysis of fixed-effects regression models, system GMM regression models, and the intermediate effect model, found that the development level of data elements to some extent promotes the narrowing of the urban-rural consumption gap. At the same time, the intermediate variable of urban-rural income gap plays an important role between data elements and consumption gap, with a significant intermediate effect. The results of the study indicate that the advancement of data elements can promote the balance of urban and rural residents' consumption levels by reducing the urban-rural income gap, providing theoretical support and policy recommendations for achieving common prosperity and promoting coordinated urban-rural development. Building upon this, this paper emphasizes the complex correlation between the development of data elements and the urban-rural consumption gap, and puts forward policy suggestions such as promoting the development of the data element market, strengthening the construction of the digital economy and e-commerce, and promoting integrated urban-rural development. Overall, the development of data elements is not only an important path to reducing the urban-rural consumption gap but also one of the key drivers for promoting the balanced development of China's economic and social development. This study has a certain theoretical and practical significance for understanding the mechanism of the urban-rural consumption gap and improving policies for urban-rural economic development."}, "https://arxiv.org/abs/2409.02705": {"title": "A family of toroidal diffusions with exact likelihood inference", "link": "https://arxiv.org/abs/2409.02705", "description": "arXiv:2409.02705v1 Announce Type: new \nAbstract: We provide a class of diffusion processes for continuous time-varying multivariate angular data with explicit transition probability densities, enabling exact likelihood inference. The presented diffusions are time-reversible and can be constructed for any pre-specified stationary distribution on the torus, including highly-multimodal mixtures. We give results on asymptotic likelihood theory allowing one-sample inference and tests of linear hypotheses for $k$ groups of diffusions, including homogeneity. We show that exact and direct diffusion bridge simulation is possible too. A class of circular jump processes with similar properties is also proposed. Several numerical experiments illustrate the methodology for the circular and two-dimensional torus cases. The new family of diffusions is applied (i) to test several homogeneity hypotheses on the movement of ants and (ii) to simulate bridges between the three-dimensional backbones of two related proteins."}, "https://arxiv.org/abs/2409.02872": {"title": "Momentum Dynamics in Competitive Sports: A Multi-Model Analysis Using TOPSIS and Logistic Regression", "link": "https://arxiv.org/abs/2409.02872", "description": "arXiv:2409.02872v1 Announce Type: new \nAbstract: This paper explores the concept of \"momentum\" in sports competitions through the use of the TOPSIS model and 0-1 logistic regression model. First, the TOPSIS model is employed to evaluate the performance of two tennis players, with visualizations used to analyze the situation's evolution at every moment in the match, explaining how \"momentum\" manifests in sports. Then, the 0-1 logistic regression model is utilized to verify the impact of \"momentum\" on match outcomes, demonstrating that fluctuations in player performance and the successive occurrence of successes are not random. Additionally, this paper examines the indicators that influence the reversal of game situations by analyzing key match data and testing the accuracy of the models with match data. The findings show that the model accurately explains the conditions during matches and can be generalized to other sports competitions. Finally, the strengths, weaknesses, and potential future improvements of the model are discussed."}, "https://arxiv.org/abs/2409.02888": {"title": "Cost-Effectiveness Analysis for Disease Prevention -- A Case Study on Colorectal Cancer Screening", "link": "https://arxiv.org/abs/2409.02888", "description": "arXiv:2409.02888v1 Announce Type: new \nAbstract: Cancer Screening has been widely recognized as an effective strategy for preventing the disease. Despite its effectiveness, determining when to start screening is complicated, because starting too early increases the number of screenings over lifetime and thus costs but starting too late may miss the cancer that could have been prevented. Therefore, to make an informed recommendation on the age to start screening, it is necessary to conduct cost-effectiveness analysis to assess the gain in life years relative to the cost of screenings. As more large-scale observational studies become accessible, there is growing interest in evaluating cost-effectiveness based on empirical evidence. In this paper, we propose a unified measure for evaluating cost-effectiveness and a causal analysis for the continuous intervention of screening initiation age, under the multi-state modeling with semi-competing risks. Extensive simulation results show that the proposed estimators perform well in realistic scenarios. We perform a cost-effectiveness analysis of the colorectal cancer screening, utilizing data from the large-scale Women's Health Initiative. Our analysis reveals that initiating screening at age 50 years yields the highest quality-adjusted life years with an acceptable incremental cost-effectiveness ratio compared to no screening, providing real-world evidence in support of screening recommendation for colorectal cancer."}, "https://arxiv.org/abs/2409.02116": {"title": "Discovering Candidate Genes Regulated by GWAS Signals in Cis and Trans", "link": "https://arxiv.org/abs/2409.02116", "description": "arXiv:2409.02116v1 Announce Type: cross \nAbstract: Understanding the genetic underpinnings of complex traits and diseases has been greatly advanced by genome-wide association studies (GWAS). However, a significant portion of trait heritability remains unexplained, known as ``missing heritability\". Most GWAS loci reside in non-coding regions, posing challenges in understanding their functional impact. Integrating GWAS with functional genomic data, such as expression quantitative trait loci (eQTLs), can bridge this gap. This study introduces a novel approach to discover candidate genes regulated by GWAS signals in both cis and trans. Unlike existing eQTL studies that focus solely on cis-eQTLs or consider cis- and trans-QTLs separately, we utilize adaptive statistical metrics that can reflect both the strong, sparse effects of cis-eQTLs and the weak, dense effects of trans-eQTLs. Consequently, candidate genes regulated by the joint effects can be prioritized. We demonstrate the efficiency of our method through theoretical and numerical analyses and apply it to adipose eQTL data from the METabolic Syndrome in Men (METSIM) study, uncovering genes playing important roles in the regulatory networks influencing cardiometabolic traits. Our findings offer new insights into the genetic regulation of complex traits and present a practical framework for identifying key regulatory genes based on joint eQTL effects."}, "https://arxiv.org/abs/2409.02135": {"title": "Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling", "link": "https://arxiv.org/abs/2409.02135", "description": "arXiv:2409.02135v1 Announce Type: cross \nAbstract: Learning-based methods have gained attention as general-purpose solvers because they can automatically learn problem-specific heuristics, reducing the need for manually crafted heuristics. However, these methods often face challenges with scalability. To address these issues, the improved Sampling algorithm for Combinatorial Optimization (iSCO) using discrete Langevin dynamics has been proposed, demonstrating better performance than several learning-based solvers. This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function from a simple convex form, where half-integral solutions dominate, to the original objective function, where the variables are restricted to 0 or 1. Furthermore, we incorporate parallel run communication leveraging GPUs, enhancing exploration capabilities and accelerating convergence. Numerical experiments demonstrate that our approach is a competitive general-purpose solver, achieving comparable performance to iSCO across various benchmark problems. Notably, our method exhibits superior trade-offs between speed and solution quality for large-scale instances compared to iSCO, commercial solvers, and specialized algorithms."}, "https://arxiv.org/abs/2409.02332": {"title": "Double Machine Learning at Scale to Predict Causal Impact of Customer Actions", "link": "https://arxiv.org/abs/2409.02332", "description": "arXiv:2409.02332v1 Announce Type: cross \nAbstract: Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams."}, "https://arxiv.org/abs/2409.02604": {"title": "Hypothesizing Missing Causal Variables with LLMs", "link": "https://arxiv.org/abs/2409.02604", "description": "arXiv:2409.02604v1 Announce Type: cross \nAbstract: Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establish the relationship between the cause and the effect. Motivated by the scientific discovery process, in this work, we formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We design a benchmark with varying difficulty levels and knowledge assumptions about the causal graph. With the growing interest in using Large Language Models (LLMs) to assist in scientific discovery, we benchmark open-source and closed models on our testbed. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. In contrast, they underperform in hypothesizing the cause and effect variables themselves. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model."}, "https://arxiv.org/abs/2409.02708": {"title": "Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit", "link": "https://arxiv.org/abs/2409.02708", "description": "arXiv:2409.02708v1 Announce Type: cross \nAbstract: Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects."}, "https://arxiv.org/abs/2103.05909": {"title": "A variational inference framework for inverse problems", "link": "https://arxiv.org/abs/2103.05909", "description": "arXiv:2103.05909v4 Announce Type: replace \nAbstract: A framework is presented for fitting inverse problem models via variational Bayes approximations. This methodology guarantees flexibility to statistical model specification for a broad range of applications, good accuracy and reduced model fitting times. The message passing and factor graph fragment approach to variational Bayes that is also described facilitates streamlined implementation of approximate inference algorithms and allows for supple inclusion of numerous response distributions and penalizations into the inverse problem model. Models for one- and two-dimensional response variables are examined and an infrastructure is laid down where efficient algorithm updates based on nullifying weak interactions between variables can also be derived for inverse problems in higher dimensions. An image processing application and a simulation exercise motivated by biomedical problems reveal the computational advantage offered by efficient implementation of variational Bayes over Markov chain Monte Carlo."}, "https://arxiv.org/abs/2112.02822": {"title": "A stableness of resistance model for nonresponse adjustment with callback data", "link": "https://arxiv.org/abs/2112.02822", "description": "arXiv:2112.02822v4 Announce Type: replace \nAbstract: Nonresponse arises frequently in surveys and follow-ups are routinely made to increase the response rate. In order to monitor the follow-up process, callback data have been used in social sciences and survey studies for decades. In modern surveys, the availability of callback data is increasing because the response rate is decreasing and follow-ups are essential to collect maximum information. Although callback data are helpful to reduce the bias in surveys, such data have not been widely used in statistical analysis until recently. We propose a stableness of resistance assumption for nonresponse adjustment with callback data. We establish the identification and the semiparametric efficiency theory under this assumption, and propose a suite of semiparametric estimation methods including doubly robust estimators, which generalize existing parametric approaches for callback data analysis. We apply the approach to a Consumer Expenditure Survey dataset. The results suggest an association between nonresponse and high housing expenditures."}, "https://arxiv.org/abs/2208.13323": {"title": "Safe Policy Learning under Regression Discontinuity Designs with Multiple Cutoffs", "link": "https://arxiv.org/abs/2208.13323", "description": "arXiv:2208.13323v4 Announce Type: replace \nAbstract: The regression discontinuity (RD) design is widely used for program evaluation with observational data. The primary focus of the existing literature has been the estimation of the local average treatment effect at the existing treatment cutoff. In contrast, we consider policy learning under the RD design. Because the treatment assignment mechanism is deterministic, learning better treatment cutoffs requires extrapolation. We develop a robust optimization approach to finding optimal treatment cutoffs that improve upon the existing ones. We first decompose the expected utility into point-identifiable and unidentifiable components. We then propose an efficient doubly-robust estimator for the identifiable parts. To account for the unidentifiable components, we leverage the existence of multiple cutoffs that are common under the RD design. Specifically, we assume that the heterogeneity in the conditional expectations of potential outcomes across different groups vary smoothly along the running variable. Under this assumption, we minimize the worst case utility loss relative to the status quo policy. The resulting new treatment cutoffs have a safety guarantee that they will not yield a worse overall outcome than the existing cutoffs. Finally, we establish the asymptotic regret bounds for the learned policy using semi-parametric efficiency theory. We apply the proposed methodology to empirical and simulated data sets."}, "https://arxiv.org/abs/2301.02739": {"title": "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values", "link": "https://arxiv.org/abs/2301.02739", "description": "arXiv:2301.02739v3 Announce Type: replace \nAbstract: Many testing problems are readily amenable to randomised tests such as those employing data splitting. However despite their usefulness in principle, randomised tests have obvious drawbacks. Firstly, two analyses of the same dataset may lead to different results. Secondly, the test typically loses power because it does not fully utilise the entire sample. As a remedy to these drawbacks, we study how to combine the test statistics or p-values resulting from multiple random realisations such as through random data splits. We develop rank-transformed subsampling as a general method for delivering large sample inference about the combined statistic or p-value under mild assumptions. We apply our methodology to a wide range of problems, including testing unimodality in high-dimensional data, testing goodness-of-fit of parametric quantile regression models, testing no direct effect in a sequentially randomised trial and calibrating cross-fit double machine learning confidence intervals. In contrast to existing p-value aggregation schemes that can be highly conservative, our method enjoys type-I error control that asymptotically approaches the nominal level. Moreover, compared to using the ordinary subsampling, we show that our rank transform can remove the first-order bias in approximating the null under alternatives and greatly improve power."}, "https://arxiv.org/abs/2409.02942": {"title": "Categorical Data Analysis", "link": "https://arxiv.org/abs/2409.02942", "description": "arXiv:2409.02942v1 Announce Type: new \nAbstract: Categorical data are common in educational and social science research; however, methods for its analysis are generally not covered in introductory statistics courses. This chapter overviews fundamental concepts and methods in categorical data analysis. It describes and illustrates the analysis of contingency tables given different sampling processes and distributions, estimation of probabilities, hypothesis testing, measures of associations, and tests of no association with nominal variables, as well as the test of linear association with ordinal variables. Three data sets illustrate fatal police shootings in the United States, clinical trials of the Moderna vaccine, and responses to General Social Survey questions."}, "https://arxiv.org/abs/2409.02950": {"title": "On Inference of Weitzman Overlapping Coefficient in Two Weibull Distributions", "link": "https://arxiv.org/abs/2409.02950", "description": "arXiv:2409.02950v1 Announce Type: new \nAbstract: Studying overlapping coefficients has recently become of great benefit, especially after its use in goodness-of-fit tests. These coefficients are defined as the amount of similarity between two statistical distributions. This research examines the estimation of one of these overlapping coefficients, which is the Weitzman coefficient {\\Delta}, assuming two Weibull distributions and without using any restrictions on the parameters of these distributions. We studied the relative bias and relative mean square error of the resulting estimator by implementing a simulation study. The results show the importance of the resulting estimator."}, "https://arxiv.org/abs/2409.03069": {"title": "Discussion of \"Data fission: splitting a single data point\"", "link": "https://arxiv.org/abs/2409.03069", "description": "arXiv:2409.03069v1 Announce Type: new \nAbstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations."}, "https://arxiv.org/abs/2409.03085": {"title": "Penalized Subgrouping of Heterogeneous Time Series", "link": "https://arxiv.org/abs/2409.03085", "description": "arXiv:2409.03085v1 Announce Type: new \nAbstract: Interest in the study and analysis of dynamic processes in the social, behavioral, and health sciences has burgeoned in recent years due to the increased availability of intensive longitudinal data. However, how best to model and account for the persistent heterogeneity characterizing such processes remains an open question. The multi-VAR framework, a recent methodological development built on the vector autoregressive model, accommodates heterogeneous dynamics in multiple-subject time series through structured penalization. In the original multi-VAR proposal, individual-level transition matrices are decomposed into common and unique dynamics, allowing for generalizable and person-specific features. The current project extends this framework to allow additionally for the identification and penalized estimation of subgroup-specific dynamics; that is, patterns of dynamics that are shared across subsets of individuals. The performance of the proposed subgrouping extension is evaluated in the context of both a simulation study and empirical application, and results are compared to alternative methods for subgrouping multiple-subject, multivariate time series."}, "https://arxiv.org/abs/2409.03126": {"title": "Co-Developing Causal Graphs with Domain Experts Guided by Weighted FDR-Adjusted p-values", "link": "https://arxiv.org/abs/2409.03126", "description": "arXiv:2409.03126v1 Announce Type: new \nAbstract: This paper proposes an approach facilitating co-design of causal graphs between subject matter experts and statistical modellers. Modern causal analysis starting with formulation of causal graphs provides benefits for robust analysis and well-grounded decision support. Moreover, this process can enrich the discovery and planning phase of data science projects.\n The key premise is that plotting relevant statistical information on a causal graph structure can facilitate an intuitive discussion between domain experts and modellers. Furthermore, Hand-crafting causality graphs, integrating human expertise with robust statistical methodology, enables ensuring responsible AI practices.\n The paper focuses on using multiplicity-adjusted p-values, controlling for the false discovery rate (FDR), as an aid for co-designing the graph. A family of hypotheses relevant to causal graph construction is identified, including assessing correlation strengths, directions of causal effects, and how well an estimated structural causal model induces the observed covariance structure.\n An iterative flow is described where an initial causal graph is drafted based on expert beliefs about likely causal relationships. The subject matter expert's beliefs, communicated as ranked scores could be incorporated into the control of the measure proposed by Benjamini and Kling, the FDCR (False Discovery Cost Rate). The FDCR-adjusted p-values then provide feedback on which parts of the graph are supported or contradicted by the data. This co-design process continues, adding, removing, or revising arcs in the graph, until the expert and modeller converge on a satisfactory causal structure grounded in both domain knowledge and data evidence."}, "https://arxiv.org/abs/2409.03136": {"title": "A New Forward Discriminant Analysis Framework Based On Pillai's Trace and ULDA", "link": "https://arxiv.org/abs/2409.03136", "description": "arXiv:2409.03136v1 Announce Type: new \nAbstract: Linear discriminant analysis (LDA), a traditional classification tool, suffers from limitations such as sensitivity to noise and computational challenges when dealing with non-invertible within-class scatter matrices. Traditional stepwise LDA frameworks, which iteratively select the most informative features, often exacerbate these issues by relying heavily on Wilks' $\\Lambda$, potentially causing premature stopping of the selection process. This paper introduces a novel forward discriminant analysis framework that integrates Pillai's trace with Uncorrelated Linear Discriminant Analysis (ULDA) to address these challenges, and offers a unified and stand-alone classifier. Through simulations and real-world datasets, the new framework demonstrates effective control of Type I error rates and improved classification accuracy, particularly in cases involving perfect group separations. The results highlight the potential of this approach as a robust alternative to the traditional stepwise LDA framework."}, "https://arxiv.org/abs/2409.03181": {"title": "Wrapped Gaussian Process Functional Regression Model for Batch Data on Riemannian Manifolds", "link": "https://arxiv.org/abs/2409.03181", "description": "arXiv:2409.03181v1 Announce Type: new \nAbstract: Regression is an essential and fundamental methodology in statistical analysis. The majority of the literature focuses on linear and nonlinear regression in the context of the Euclidean space. However, regression models in non-Euclidean spaces deserve more attention due to collection of increasing volumes of manifold-valued data. In this context, this paper proposes a concurrent functional regression model for batch data on Riemannian manifolds by estimating both mean structure and covariance structure simultaneously. The response variable is assumed to follow a wrapped Gaussian process distribution. Nonlinear relationships between manifold-valued response variables and multiple Euclidean covariates can be captured by this model in which the covariates can be functional and/or scalar. The performance of our model has been tested on both simulated data and real data, showing it is an effective and efficient tool in conducting functional data regression on Riemannian manifolds."}, "https://arxiv.org/abs/2409.03292": {"title": "Directional data analysis using the spherical Cauchy and the Poisson-kernel based distribution", "link": "https://arxiv.org/abs/2409.03292", "description": "arXiv:2409.03292v1 Announce Type: new \nAbstract: The spherical Cauchy distribution and the Poisson-kernel based distribution were both proposed in 2020, for the analysis of directional data. The paper explores both of them under various frameworks. Alternative parametrizations that offer numerical and estimation advantages, including a straightforward Newton-Raphson algorithm to estimate the parameters are suggested, which further facilitate a more straightforward formulation under the regression setting. A two-sample location test, based on the log-likelihood ratio test is suggested, completing with discriminant analysis. The two distributions are put to the test-bed for all aforementioned cases, through simulation studies and via real data examples comparing and illustrating their performance."}, "https://arxiv.org/abs/2409.03296": {"title": "An Efficient Two-Dimensional Functional Mixed-Effect Model Framework for Repeatedly Measured Functional Data", "link": "https://arxiv.org/abs/2409.03296", "description": "arXiv:2409.03296v1 Announce Type: new \nAbstract: With the rapid development of wearable device technologies, accelerometers can record minute-by-minute physical activity for consecutive days, which provides important insight into a dynamic association between the intensity of physical activity and mental health outcomes for large-scale population studies. Using Shanghai school adolescent cohort we estimate the effect of health assessment results on physical activity profiles recorded by accelerometers throughout a week, which is recognized as repeatedly measured functional data. To achieve this goal, we propose an innovative two-dimensional functional mixed-effect model (2dFMM) for the specialized data, which smoothly varies over longitudinal day observations with covariate-dependent mean and covariance functions. The modeling framework characterizes the longitudinal and functional structures while incorporating two-dimensional fixed effects for covariates of interest. We also develop a fast three-stage estimation procedure to provide accurate fixed-effect inference for model interpretability and improve computational efficiency when encountering large datasets. We find strong evidence of intraday and interday varying significant associations between physical activity and mental health assessments among our cohort population, which shed light on possible intervention strategies targeting daily physical activity patterns to improve school adolescent mental health. Our method is also used in environmental data to illustrate the wide applicability. Supplementary materials for this article are available online."}, "https://arxiv.org/abs/2409.03502": {"title": "Testing Whether Reported Treatment Effects are Unduly Dependent on the Specific Outcome Measure Used", "link": "https://arxiv.org/abs/2409.03502", "description": "arXiv:2409.03502v1 Announce Type: new \nAbstract: This paper addresses the situation in which treatment effects are reported using educational or psychological outcome measures comprised of multiple questions or ``items.'' A distinction is made between a treatment effect on the construct being measured, which is referred to as impact, and item-specific treatment effects that are not due to impact, which are referred to as differential item functioning (DIF). By definition, impact generalizes to other measures of the same construct (i.e., measures that use different items), while DIF is dependent upon the specific items that make up the outcome measure. To distinguish these two cases, two estimators of impact are compared: an estimator that naively aggregates over items, and a less efficient one that is highly robust to DIF. The null hypothesis that both are consistent estimators of the true treatment impact leads to a Hausman-like specification test of whether the naive estimate is affected by item-level variation that would not be expected to generalize beyond the specific outcome measure used. The performance of the test is illustrated with simulation studies and a re-analysis of 34 item-level datasets from 22 randomized evaluations of educational interventions. In the empirical example, the dependence of reported effect sizes on the type of outcome measure (researcher-developed or independently developed) was substantially reduced after accounting for DIF. Implications for the ongoing debate about the role of researcher-developed assessments in education sciences are discussed."}, "https://arxiv.org/abs/2409.03513": {"title": "Bias correction of posterior means using MCMC outputs", "link": "https://arxiv.org/abs/2409.03513", "description": "arXiv:2409.03513v1 Announce Type: new \nAbstract: We propose algorithms for addressing the bias of the posterior mean when used as an estimator of parameters. These algorithms build upon the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)) and can be implemented using the posterior covariance and third-order combined cumulants easily calculated from MCMC outputs. Two algorithms are introduced: The first algorithm utilises the output of a single-run MCMC with the original likelihood and prior to estimate the bias. A notable feature of the algorithm is that its ability to estimate definitional bias (Efron (2015)), which is crucial for Bayesian estimators. The second algorithm is designed for high-dimensional and sparse data settings, where ``quasi-prior'' for bias correction is introduced. The quasi-prior is iteratively refined using the output of the first algorithm as a measure of the residual bias at each step. These algorithms have been successfully implemented and tested for parameter estimation in the Weibull distribution and logistic regression in moderately high-dimensional settings."}, "https://arxiv.org/abs/2409.03572": {"title": "Extrinsic Principal Component Analysis", "link": "https://arxiv.org/abs/2409.03572", "description": "arXiv:2409.03572v1 Announce Type: new \nAbstract: One develops a fast computational methodology for principal component analysis on manifolds. Instead of estimating intrinsic principal components on an object space with a Riemannian structure, one embeds the object space in a numerical space, and the resulting chord distance is used. This method helps us analyzing high, theoretically even infinite dimensional data, from a new perspective. We define the extrinsic principal sub-manifolds of a random object on a Hilbert manifold embedded in a Hilbert space, and the sample counterparts. The resulting extrinsic principal components are useful for dimension data reduction. For application, one retains a very small number of such extrinsic principal components for a shape of contour data sample, extracted from imaging data."}, "https://arxiv.org/abs/2409.03575": {"title": "Detecting Spatial Dependence in Transcriptomics Data using Vectorised Persistence Diagrams", "link": "https://arxiv.org/abs/2409.03575", "description": "arXiv:2409.03575v1 Announce Type: new \nAbstract: Evaluating spatial patterns in data is an integral task across various domains, including geostatistics, astronomy, and spatial tissue biology. The analysis of transcriptomics data in particular relies on methods for detecting spatially-dependent features that exhibit significant spatial patterns for both explanatory analysis and feature selection. However, given the complex and high-dimensional nature of these data, there is a need for robust, stable, and reliable descriptors of spatial dependence. We leverage the stability and multiscale properties of persistent homology to address this task. To this end, we introduce a novel framework using functional topological summaries, such as Betti curves and persistence landscapes, for identifying and describing non-random patterns in spatial data. In particular, we propose a non-parametric one-sample permutation test for spatial dependence and investigate its utility across both simulated and real spatial omics data. Our vectorised approach outperforms baseline methods at accurately detecting spatial dependence. Further, we find that our method is more robust to outliers than alternative tests using Moran's I."}, "https://arxiv.org/abs/2409.03606": {"title": "Performance of Empirical Risk Minimization For Principal Component Regression", "link": "https://arxiv.org/abs/2409.03606", "description": "arXiv:2409.03606v1 Announce Type: new \nAbstract: This paper establishes bounds on the predictive performance of empirical risk minimization for principal component regression. Our analysis is nonparametric, in the sense that the relation between the prediction target and the predictors is not specified. In particular, we do not rely on the assumption that the prediction target is generated by a factor model. In our analysis we consider the cases in which the largest eigenvalues of the covariance matrix of the predictors grow linearly in the number of predictors (strong signal regime) or sublinearly (weak signal regime). The main result of this paper shows that empirical risk minimization for principal component regression is consistent for prediction and, under appropriate conditions, it achieves optimal performance (up to a logarithmic factor) in both the strong and weak signal regimes."}, "https://arxiv.org/abs/2302.10687": {"title": "Boosting the Power of Kernel Two-Sample Tests", "link": "https://arxiv.org/abs/2302.10687", "description": "arXiv:2302.10687v2 Announce Type: replace \nAbstract: The kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces. In this paper we propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance. We derive the asymptotic null distribution of the proposed test statistic and use a multiplier bootstrap approach to efficiently compute the rejection region. The resulting test is universally consistent and, since it is obtained by aggregating over a collection of kernels/bandwidths, is more powerful in detecting a wide range of alternatives in finite samples. We also derive the distribution of the test statistic for both fixed and local contiguous alternatives. The latter, in particular, implies that the proposed test is statistically efficient, that is, it has non-trivial asymptotic (Pitman) efficiency. The consistency properties of the Mahalanobis and other natural aggregation methods are also explored when the number of kernels is allowed to grow with the sample size. Extensive numerical experiments are performed on both synthetic and real-world datasets to illustrate the efficacy of the proposed method over single kernel tests. The computational complexity of the proposed method is also studied, both theoretically and in simulations. Our asymptotic results rely on deriving the joint distribution of MMD estimates using the framework of multiple stochastic integrals, which is more broadly useful, specifically, in understanding the efficiency properties of recently proposed adaptive MMD tests based on kernel aggregation and also in developing more computationally efficient (linear time) tests that combine multiple kernels. We conclude with an application of the Mahalanobis aggregation method for kernels with diverging scaling parameters."}, "https://arxiv.org/abs/2401.04156": {"title": "LASPATED: a Library for the Analysis of SPAtio-TEmporal Discrete data", "link": "https://arxiv.org/abs/2401.04156", "description": "arXiv:2401.04156v2 Announce Type: replace \nAbstract: We describe methods, tools, and a software library called LASPATED, available on GitHub (at https://github.com/vguigues/) to fit models using spatio-temporal data and space-time discretization. A video tutorial for this library is available on YouTube. We consider two types of methods to estimate a non-homogeneous Poisson process in space and time. The methods approximate the arrival intensity function of the Poisson process by discretizing space and time, and estimating arrival intensity as a function of subregion and time interval. With such methods, it is typical that the dimension of the estimator is large relative to the amount of data, and therefore the performance of the estimator can be improved by using additional data. The first method uses additional data to add a regularization term to the likelihood function for calibrating the intensity of the Poisson process. The second method uses additional data to estimate arrival intensity as a function of covariates. We describe a Python package to perform various types of space and time discretization. We also describe two packages for the calibration of the models, one in Matlab and one in C++. We demonstrate the advantages of our methods compared to basic maximum likelihood estimation with simulated and real data. The experiments with real data calibrate models of the arrival process of emergencies to be handled by the Rio de Janeiro emergency medical service."}, "https://arxiv.org/abs/2409.03805": {"title": "Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects", "link": "https://arxiv.org/abs/2409.03805", "description": "arXiv:2409.03805v1 Announce Type: new \nAbstract: We present experiences and lessons learned from increasing data readiness of heterogeneous data for artificial intelligence projects using visual analysis methods. Increasing the data readiness level involves understanding both the data as well as the context in which it is used, which are challenges well suitable to visual analysis. For this purpose, we contribute a mapping between data readiness aspects and visual analysis techniques suitable for different data types. We use the defined mapping to increase data readiness levels in use cases involving time-varying data, including numerical, categorical, and text. In addition to the mapping, we extend the data readiness concept to better take aspects of the task and solution into account and explicitly address distribution shifts during data collection time. We report on our experiences in using the presented visual analysis techniques to aid future artificial intelligence projects in raising the data readiness level."}, "https://arxiv.org/abs/2409.03962": {"title": "Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria", "link": "https://arxiv.org/abs/2409.03962", "description": "arXiv:2409.03962v1 Announce Type: new \nAbstract: The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2(P)-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R."}, "https://arxiv.org/abs/2409.03979": {"title": "Extreme Quantile Treatment Effects under Endogeneity: Evaluating Policy Effects for the Most Vulnerable Individuals", "link": "https://arxiv.org/abs/2409.03979", "description": "arXiv:2409.03979v1 Announce Type: new \nAbstract: We introduce a novel method for estimating and conducting inference about extreme quantile treatment effects (QTEs) in the presence of endogeneity. Our approach is applicable to a broad range of empirical research designs, including instrumental variables design and regression discontinuity design, among others. By leveraging regular variation and subsampling, the method ensures robust performance even in extreme tails, where data may be sparse or entirely absent. Simulation studies confirm the theoretical robustness of our approach. Applying our method to assess the impact of job training provided by the Job Training Partnership Act (JTPA), we find significantly negative QTEs for the lowest quantiles (i.e., the most disadvantaged individuals), contrasting with previous literature that emphasizes positive QTEs for intermediate quantiles."}, "https://arxiv.org/abs/2409.04079": {"title": "Fitting the Discrete Swept Skeletal Representation to Slabular Objects", "link": "https://arxiv.org/abs/2409.04079", "description": "arXiv:2409.04079v1 Announce Type: new \nAbstract: Statistical shape analysis of slabular objects like groups of hippocampi is highly useful for medical researchers as it can be useful for diagnoses and understanding diseases. This work proposes a novel object representation based on locally parameterized discrete swept skeletal structures. Further, model fitting and analysis of such representations are discussed. The model fitting procedure is based on boundary division and surface flattening. The quality of the model fitting is evaluated based on the symmetry and tidiness of the skeletal structure as well as the volume of the implied boundary. The power of the method is demonstrated by visual inspection and statistical analysis of a synthetic and an actual data set in comparison with an available skeletal representation."}, "https://arxiv.org/abs/2409.04126": {"title": "Incorporating external data for analyzing randomized clinical trials: A transfer learning approach", "link": "https://arxiv.org/abs/2409.04126", "description": "arXiv:2409.04126v1 Announce Type: new \nAbstract: Randomized clinical trials are the gold standard for analyzing treatment effects, but high costs and ethical concerns can limit recruitment, potentially leading to invalid inferences. Incorporating external trial data with similar characteristics into the analysis using transfer learning appears promising for addressing these issues. In this paper, we present a formal framework for applying transfer learning to the analysis of clinical trials, considering three key perspectives: transfer algorithm, theoretical foundation, and inference method. For the algorithm, we adopt a parameter-based transfer learning approach to enhance the lasso-adjusted stratum-specific estimator developed for estimating treatment effects. A key component in constructing the transfer learning estimator is deriving the regression coefficient estimates within each stratum, accounting for the bias between source and target data. To provide a theoretical foundation, we derive the $l_1$ convergence rate for the estimated regression coefficients and establish the asymptotic normality of the transfer learning estimator. Our results show that when external trial data resembles current trial data, the sample size requirements can be reduced compared to using only the current trial data. Finally, we propose a consistent nonparametric variance estimator to facilitate inference. Numerical studies demonstrate the effectiveness and robustness of our proposed estimator across various scenarios."}, "https://arxiv.org/abs/2409.04162": {"title": "Modelling multivariate spatio-temporal data with identifiable variational autoencoders", "link": "https://arxiv.org/abs/2409.04162", "description": "arXiv:2409.04162v1 Announce Type: new \nAbstract: Modelling multivariate spatio-temporal data with complex dependency structures is a challenging task but can be simplified by assuming that the original variables are generated from independent latent components. If these components are found, they can be modelled univariately. Blind source separation aims to recover the latent components by estimating the unmixing transformation based on the observed data only. The current methods for spatio-temporal blind source separation are restricted to linear unmixing, and nonlinear variants have not been implemented. In this paper, we extend identifiable variational autoencoder to the nonlinear nonstationary spatio-temporal blind source separation setting and demonstrate its performance using comprehensive simulation studies. Additionally, we introduce two alternative methods for the latent dimension estimation, which is a crucial task in order to obtain the correct latent representation. Finally, we illustrate the proposed methods using a meteorological application, where we estimate the latent dimension and the latent components, interpret the components, and show how nonstationarity can be accounted and prediction accuracy can be improved by using the proposed nonlinear blind source separation method as a preprocessing method."}, "https://arxiv.org/abs/2409.04256": {"title": "The $\\infty$-S test via regression quantile affine LASSO", "link": "https://arxiv.org/abs/2409.04256", "description": "arXiv:2409.04256v1 Announce Type: new \nAbstract: The nonparametric sign test dates back to the early 18th century with a data analysis by John Arbuthnot. It is an alternative to Gosset's more recent $t$-test for consistent differences between two sets of observations. Fisher's $F$-test is a generalization of the $t$-test to linear regression and linear null hypotheses. Only the sign test is robust to non-Gaussianity. Gutenbrunner et al. [1993] derived a version of the sign test for linear null hypotheses in the spirit of the F-test, which requires the difficult estimation of the sparsity function. We propose instead a new sign test called $\\infty$-S test via the convex analysis of a point estimator that thresholds the estimate towards the null hypothesis of the test."}, "https://arxiv.org/abs/2409.04378": {"title": "An MPEC Estimator for the Sequential Search Model", "link": "https://arxiv.org/abs/2409.04378", "description": "arXiv:2409.04378v1 Announce Type: new \nAbstract: This paper proposes a constrained maximum likelihood estimator for sequential search models, using the MPEC (Mathematical Programming with Equilibrium Constraints) approach. This method enhances numerical accuracy while avoiding ad hoc components and errors related to equilibrium conditions. Monte Carlo simulations show that the estimator performs better in small samples, with lower bias and root-mean-squared error, though less effectively in large samples. Despite these mixed results, the MPEC approach remains valuable for identifying candidate parameters comparable to the benchmark, without relying on ad hoc look-up tables, as it generates the table through solved equilibrium constraints."}, "https://arxiv.org/abs/2409.04412": {"title": "Robust Elicitable Functionals", "link": "https://arxiv.org/abs/2409.04412", "description": "arXiv:2409.04412v1 Announce Type: new \nAbstract: Elicitable functionals and (strict) consistent scoring functions are of interest due to their utility of determining (uniquely) optimal forecasts, and thus the ability to effectively backtest predictions. However, in practice, assuming that a distribution is correctly specified is too strong a belief to reliably hold. To remediate this, we incorporate a notion of statistical robustness into the framework of elicitable functionals, meaning that our robust functional accounts for \"small\" misspecifications of a baseline distribution. Specifically, we propose a robustified version of elicitable functionals by using the Kullback-Leibler divergence to quantify potential misspecifications from a baseline distribution. We show that the robust elicitable functionals admit unique solutions lying at the boundary of the uncertainty region. Since every elicitable functional possesses infinitely many scoring functions, we propose the class of b-homogeneous strictly consistent scoring functions, for which the robust functionals maintain desirable statistical properties. We show the applicability of the REF in two examples: in the reinsurance setting and in robust regression problems."}, "https://arxiv.org/abs/2409.03876": {"title": "A tutorial on panel data analysis using partially observed Markov processes via the R package panelPomp", "link": "https://arxiv.org/abs/2409.03876", "description": "arXiv:2409.03876v1 Announce Type: cross \nAbstract: The R package panelPomp supports analysis of panel data via a general class of partially observed Markov process models (PanelPOMP). This package tutorial describes how the mathematical concept of a PanelPOMP is represented in the software and demonstrates typical use-cases of panelPomp. Monte Carlo methods used for POMP models require adaptation for PanelPOMP models due to the higher dimensionality of panel data. The package takes advantage of recent advances for PanelPOMP, including an iterated filtering algorithm, Monte Carlo adjusted profile methodology and block optimization methodology to assist with the large parameter spaces that can arise with panel models. In addition, tools for manipulation of models and data are provided that take advantage of the panel structure."}, "https://arxiv.org/abs/2409.04001": {"title": "Over-parameterized regression methods and their application to semi-supervised learning", "link": "https://arxiv.org/abs/2409.04001", "description": "arXiv:2409.04001v1 Announce Type: cross \nAbstract: The minimum norm least squares is an estimation strategy under an over-parameterized case and, in machine learning, is known as a helpful tool for understanding a nature of deep learning. In this paper, to apply it in a context of non-parametric regression problems, we established several methods which are based on thresholding of SVD (singular value decomposition) components, wihch are referred to as SVD regression methods. We considered several methods that are singular value based thresholding, hard-thresholding with cross validation, universal thresholding and bridge thresholding. Information on output samples is not utilized in the first method while it is utilized in the other methods. We then applied them to semi-supervised learning, in which unlabeled input samples are incorporated into kernel functions in a regressor. The experimental results for real data showed that, depending on the datasets, the SVD regression methods is superior to a naive ridge regression method. Unfortunately, there were no clear advantage of the methods utilizing information on output samples. Furthermore, for depending on datasets, incorporation of unlabeled input samples into kernels is found to have certain advantages."}, "https://arxiv.org/abs/2409.04365": {"title": "Leveraging Machine Learning for Official Statistics: A Statistical Manifesto", "link": "https://arxiv.org/abs/2409.04365", "description": "arXiv:2409.04365v1 Announce Type: cross \nAbstract: It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics."}, "https://arxiv.org/abs/2409.04377": {"title": "Local times of self-intersection and sample path properties of Volterra Gaussian processes", "link": "https://arxiv.org/abs/2409.04377", "description": "arXiv:2409.04377v1 Announce Type: cross \nAbstract: We study a Volterra Gaussian process of the form $X(t)=\\int^t_0K(t,s)d{W(s)},$ where $W$ is a Wiener process and $K$ is a continuous kernel. In dimension one, we prove a law of the iterated logarithm, discuss the existence of local times and verify a continuous dependence between the local time and the kernel that generates the process. Furthermore, we prove the existence of the Rosen renormalized self-intersection local times for a planar Gaussian Volterra process."}, "https://arxiv.org/abs/1808.04945": {"title": "A confounding bridge approach for double negative control inference on causal effects", "link": "https://arxiv.org/abs/1808.04945", "description": "arXiv:1808.04945v4 Announce Type: replace \nAbstract: Unmeasured confounding is a key challenge for causal inference. In this paper, we establish a framework for unmeasured confounding adjustment with negative control variables. A negative control outcome is associated with the confounder but not causally affected by the exposure in view, and a negative control exposure is correlated with the primary exposure or the confounder but does not causally affect the outcome of interest. We introduce an outcome confounding bridge function that depicts the relationship between the confounding effects on the primary outcome and the negative control outcome, and we incorporate a negative control exposure to identify the bridge function and the average causal effect. We also consider the extension to the positive control setting by allowing for nonzero causal effect of the primary exposure on the control outcome. We illustrate our approach with simulations and apply it to a study about the short-term effect of air pollution on mortality. Although a standard analysis shows a significant acute effect of PM2.5 on mortality, our analysis indicates that this effect may be confounded, and after double negative control adjustment, the effect is attenuated toward zero."}, "https://arxiv.org/abs/2311.11290": {"title": "Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias", "link": "https://arxiv.org/abs/2311.11290", "description": "arXiv:2311.11290v2 Announce Type: replace \nAbstract: Firth (1993, Biometrika) shows that the maximum Jeffreys' prior penalized likelihood estimator in logistic regression has asymptotic bias decreasing with the square of the number of observations when the number of parameters is fixed, which is an order faster than the typical rate from maximum likelihood. The widespread use of that estimator in applied work is supported by the results in Kosmidis and Firth (2021, Biometrika), who show that it takes finite values, even in cases where the maximum likelihood estimate does not exist. Kosmidis and Firth (2021, Biometrika) also provide empirical evidence that the estimator has good bias properties in high-dimensional settings where the number of parameters grows asymptotically linearly but slower than the number of observations. We design and carry out a large-scale computer experiment covering a wide range of such high-dimensional settings and produce strong empirical evidence for a simple rescaling of the maximum Jeffreys' prior penalized likelihood estimator that delivers high accuracy in signal recovery, in terms of aggregate bias, in the presence of an intercept parameter. The rescaled estimator is effective even in cases where estimates from maximum likelihood and other recently proposed corrective methods based on approximate message passing do not exist."}, "https://arxiv.org/abs/2210.12277": {"title": "The Stochastic Proximal Distance Algorithm", "link": "https://arxiv.org/abs/2210.12277", "description": "arXiv:2210.12277v4 Announce Type: replace-cross \nAbstract: Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\\rho \\rightarrow \\infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks."}, "https://arxiv.org/abs/2211.03933": {"title": "A Hypergraph-Based Machine Learning Ensemble Network Intrusion Detection System", "link": "https://arxiv.org/abs/2211.03933", "description": "arXiv:2211.03933v3 Announce Type: replace-cross \nAbstract: Network intrusion detection systems (NIDS) to detect malicious attacks continue to meet challenges. NIDS are often developed offline while they face auto-generated port scan infiltration attempts, resulting in a significant time lag from adversarial adaption to NIDS response. To address these challenges, we use hypergraphs focused on internet protocol addresses and destination ports to capture evolving patterns of port scan attacks. The derived set of hypergraph-based metrics are then used to train an ensemble machine learning (ML) based NIDS that allows for real-time adaption in monitoring and detecting port scanning activities, other types of attacks, and adversarial intrusions at high accuracy, precision and recall performances. This ML adapting NIDS was developed through the combination of (1) intrusion examples, (2) NIDS update rules, (3) attack threshold choices to trigger NIDS retraining requests, and (4) a production environment with no prior knowledge of the nature of network traffic. 40 scenarios were auto-generated to evaluate the ML ensemble NIDS comprising three tree-based models. The resulting ML Ensemble NIDS was extended and evaluated with the CIC-IDS2017 dataset. Results show that under the model settings of an Update-ALL-NIDS rule (specifically retrain and update all the three models upon the same NIDS retraining request) the proposed ML ensemble NIDS evolved intelligently and produced the best results with nearly 100% detection performance throughout the simulation."}, "https://arxiv.org/abs/2306.15422": {"title": "Debiasing Piecewise Deterministic Markov Process samplers using couplings", "link": "https://arxiv.org/abs/2306.15422", "description": "arXiv:2306.15422v2 Announce Type: replace-cross \nAbstract: Monte Carlo methods -- such as Markov chain Monte Carlo (MCMC) and piecewise deterministic Markov process (PDMP) samplers -- provide asymptotically exact estimators of expectations under a target distribution. There is growing interest in alternatives to this asymptotic regime, in particular in constructing estimators that are exact in the limit of an infinite amount of computing processors, rather than in the limit of an infinite number of Markov iterations. In particular, Jacob et al. (2020) introduced coupled MCMC estimators to remove the non-asymptotic bias, resulting in MCMC estimators that can be embarrassingly parallelised. In this work, we extend the estimators of Jacob et al. (2020) to the continuous-time context and derive couplings for the bouncy, the boomerang and the coordinate samplers. Some preliminary empirical results are included that demonstrate the reasonable scaling of our method with the dimension of the target."}, "https://arxiv.org/abs/2312.11437": {"title": "Clustering Consistency of General Nonparametric Classification Methods in Cognitive Diagnosis", "link": "https://arxiv.org/abs/2312.11437", "description": "arXiv:2312.11437v2 Announce Type: replace-cross \nAbstract: Cognitive diagnosis models have been popularly used in fields such as education, psychology, and social sciences. While parametric likelihood estimation is a prevailing method for fitting cognitive diagnosis models, nonparametric methodologies are attracting increasing attention due to their ease of implementation and robustness, particularly when sample sizes are relatively small. However, existing clustering consistency results of the nonparametric estimation methods often rely on certain restrictive conditions, which may not be easily satisfied in practice. In this article, the clustering consistency of the general nonparametric classification method is reestablished under weaker and more practical conditions."}, "https://arxiv.org/abs/2409.04589": {"title": "Lee Bounds with Multilayered Sample Selection", "link": "https://arxiv.org/abs/2409.04589", "description": "arXiv:2409.04589v1 Announce Type: new \nAbstract: This paper investigates the causal effect of job training on wage rates in the presence of firm heterogeneity. When training affects worker sorting to firms, sample selection is no longer binary but is \"multilayered\". This paper extends the canonical Heckman (1979) sample selection model - which assumes selection is binary - to a setting where it is multilayered, and shows that in this setting Lee bounds set identifies a total effect that combines a weighted-average of the causal effect of job training on wage rates across firms with a weighted-average of the contrast in wages between different firms for a fixed level of training. Thus, Lee bounds set identifies a policy-relevant estimand only when firms pay homogeneous wages and/or when job training does not affect worker sorting across firms. We derive sharp closed-form bounds for the causal effect of job training on wage rates at each firm which leverage information on firm-specific wages. We illustrate our partial identification approach with an empirical application to the Job Corps Study. Results show that while conventional Lee bounds are strictly positive, our within-firm bounds include 0 showing that canonical Lee bounds may be capturing a pure sorting effect of job training."}, "https://arxiv.org/abs/2409.04684": {"title": "Establishing the Parallels and Differences Between Right-Censored and Missing Covariates", "link": "https://arxiv.org/abs/2409.04684", "description": "arXiv:2409.04684v1 Announce Type: new \nAbstract: While right-censored time-to-event outcomes have been studied for decades, handling time-to-event covariates, also known as right-censored covariates, is now of growing interest. So far, the literature has treated right-censored covariates as distinct from missing covariates, overlooking the potential applicability of estimators to both scenarios. We bridge this gap by establishing connections between right-censored and missing covariates under various assumptions about censoring and missingness, allowing us to identify parallels and differences to determine when estimators can be used in both contexts. These connections reveal adaptations to five estimators for right-censored covariates in the unexplored area of informative covariate right-censoring and to formulate a new estimator for this setting, where the event time depends on the censoring time. We establish the asymptotic properties of the six estimators, evaluate their robustness under incorrect distributional assumptions, and establish their comparative efficiency. We conducted a simulation study to confirm our theoretical results, and then applied all estimators to a Huntington disease observational study to analyze cognitive impairments as a function of time to clinical diagnosis."}, "https://arxiv.org/abs/2409.04729": {"title": "A Unified Framework for Cluster Methods with Tensor Networks", "link": "https://arxiv.org/abs/2409.04729", "description": "arXiv:2409.04729v1 Announce Type: new \nAbstract: Markov Chain Monte Carlo (MCMC), and Tensor Networks (TN) are two powerful frameworks for numerically investigating many-body systems, each offering distinct advantages. MCMC, with its flexibility and theoretical consistency, is well-suited for simulating arbitrary systems by sampling. TN, on the other hand, provides a powerful tensor-based language for capturing the entanglement properties intrinsic to many-body systems, offering a universal representation of these systems. In this work, we leverage the computational strengths of TN to design a versatile cluster MCMC sampler. Specifically, we propose a general framework for constructing tensor-based cluster MCMC methods, enabling arbitrary cluster updates by utilizing TNs to compute the distributions required in the MCMC sampler. Our framework unifies several existing cluster algorithms as special cases and allows for natural extensions. We demonstrate our method by applying it to the simulation of the two-dimensional Edwards-Anderson Model and the three-dimensional Ising Model. This work is dedicated to the memory of Prof. David Draper."}, "https://arxiv.org/abs/2409.04836": {"title": "Spatial Interference Detection in Treatment Effect Model", "link": "https://arxiv.org/abs/2409.04836", "description": "arXiv:2409.04836v1 Announce Type: new \nAbstract: Modeling the interference effect is an important issue in the field of causal inference. Existing studies rely on explicit and often homogeneous assumptions regarding interference structures. In this paper, we introduce a low-rank and sparse treatment effect model that leverages data-driven techniques to identify the locations of interference effects. A profiling algorithm is proposed to estimate the model coefficients, and based on these estimates, global test and local detection methods are established to detect the existence of interference and the interference neighbor locations for each unit. We derive the non-asymptotic bound of the estimation error, and establish theoretical guarantees for the global test and the accuracy of the detection method in terms of Jaccard index. Simulations and real data examples are provided to demonstrate the usefulness of the proposed method."}, "https://arxiv.org/abs/2409.04874": {"title": "Improving the Finite Sample Performance of Double/Debiased Machine Learning with Propensity Score Calibration", "link": "https://arxiv.org/abs/2409.04874", "description": "arXiv:2409.04874v1 Announce Type: new \nAbstract: Machine learning techniques are widely used for estimating causal effects. Double/debiased machine learning (DML) (Chernozhukov et al., 2018) uses a double-robust score function that relies on the prediction of nuisance functions, such as the propensity score, which is the probability of treatment assignment conditional on covariates. Estimators relying on double-robust score functions are highly sensitive to errors in propensity score predictions. Machine learners increase the severity of this problem as they tend to over- or underestimate these probabilities. Several calibration approaches have been proposed to improve probabilistic forecasts of machine learners. This paper investigates the use of probability calibration approaches within the DML framework. Simulation results demonstrate that calibrating propensity scores may significantly reduces the root mean squared error of DML estimates of the average treatment effect in finite samples. We showcase it in an empirical example and provide conditions under which calibration does not alter the asymptotic properties of the DML estimator."}, "https://arxiv.org/abs/2409.04876": {"title": "DEPLOYERS: An agent based modeling tool for multi country real world data", "link": "https://arxiv.org/abs/2409.04876", "description": "arXiv:2409.04876v1 Announce Type: new \nAbstract: We present recent progress in the design and development of DEPLOYERS, an agent-based macroeconomics modeling (ABM) framework, capable to deploy and simulate a full economic system (individual workers, goods and services firms, government, central and private banks, financial market, external sectors) whose structure and activity analysis reproduce the desired calibration data, that can be, for example a Social Accounting Matrix (SAM) or a Supply-Use Table (SUT) or an Input-Output Table (IOT).Here we extend our previous work to a multi-country version and show an example using data from a 46-countries 64-sectors FIGARO Inter-Country IOT. The simulation of each country runs on a separate thread or CPU core to simulate the activity of one step (month, week, or day) and then interacts (updates imports, exports, transfer) with that country's foreign partners, and proceeds to the next step. This interaction can be chosen to be aggregated (a single row and column IO account) or disaggregated (64 rows and columns) with each partner. A typical run simulates thousands of individuals and firms engaged in their monthly activity and then records the results, much like a survey of the country's economic system. This data can then be subjected to, for example, an Input-Output analysis to find out the sources of observed stylized effects as a function of time in the detailed and realistic modeling environment that can be easily implemented in an ABM framework."}, "https://arxiv.org/abs/2409.04933": {"title": "Marginal Structural Modeling of Representative Treatment Trajectories", "link": "https://arxiv.org/abs/2409.04933", "description": "arXiv:2409.04933v1 Announce Type: new \nAbstract: Marginal structural models (MSMs) are widely used in observational studies to estimate the causal effect of time-varying treatments. Despite its popularity, limited attention has been paid to summarizing the treatment history in the outcome model, which proves particularly challenging when individuals' treatment trajectories exhibit complex patterns over time. Commonly used metrics such as the average treatment level fail to adequately capture the treatment history, hindering causal interpretation. For scenarios where treatment histories exhibit distinct temporal patterns, we develop a new approach to parameterize the outcome model. We apply latent growth curve analysis to identify representative treatment trajectories from the observed data and use the posterior probability of latent class membership to summarize the different treatment trajectories. We demonstrate its use in parameterizing the MSMs, which facilitates the interpretations of the results. We apply the method to analyze data from an existing cohort of lung transplant recipients to estimate the effect of Tacrolimus concentrations on the risk of incident chronic kidney disease."}, "https://arxiv.org/abs/2409.04970": {"title": "A response-adaptive multi-arm design for continuous endpoints based on a weighted information measure", "link": "https://arxiv.org/abs/2409.04970", "description": "arXiv:2409.04970v1 Announce Type: new \nAbstract: Multi-arm trials are gaining interest in practice given the statistical and logistical advantages that they can offer. The standard approach is to use a fixed (throughout the trial) allocation ratio, but there is a call for making it adaptive and skewing the allocation of patients towards better performing arms. However, among other challenges, it is well-known that these approaches might suffer from lower statistical power. We present a response-adaptive design for continuous endpoints which explicitly allows to control the trade-off between the number of patients allocated to the 'optimal' arm and the statistical power. Such a balance is achieved through the calibration of a tuning parameter, and we explore various strategies to effectively select it. The proposed criterion is based on a context-dependent information measure which gives a greater weight to those treatment arms which have characteristics close to a pre-specified clinical target. We also introduce a simulation-based hypothesis testing procedure which focuses on selecting the target arm, discussing strategies to effectively control the type-I error rate. The potential advantage of the proposed criterion over currently used alternatives is evaluated in simulations, and its practical implementation is illustrated in the context of early Phase IIa proof-of-concept oncology clinical trials."}, "https://arxiv.org/abs/2409.04981": {"title": "Forecasting Age Distribution of Deaths: Cumulative Distribution Function Transformation", "link": "https://arxiv.org/abs/2409.04981", "description": "arXiv:2409.04981v1 Announce Type: new \nAbstract: Like density functions, period life-table death counts are nonnegative and have a constrained integral, and thus live in a constrained nonlinear space. Implementing established modelling and forecasting methods without obeying these constraints can be problematic for such nonlinear data. We introduce cumulative distribution function transformation to forecast the life-table death counts. Using the Japanese life-table death counts obtained from the Japanese Mortality Database (2024), we evaluate the point and interval forecast accuracies of the proposed approach, which compares favourably to an existing compositional data analytic approach. The improved forecast accuracy of life-table death counts is of great interest to demographers for estimating age-specific survival probabilities and life expectancy and actuaries for determining temporary annuity prices for different ages and maturities."}, "https://arxiv.org/abs/2409.04995": {"title": "Projective Techniques in Consumer Research: A Mixed Methods-Focused Review and Empirical Reanalysis", "link": "https://arxiv.org/abs/2409.04995", "description": "arXiv:2409.04995v1 Announce Type: new \nAbstract: This article gives an integrative review of research using projective methods in the consumer research domain. We give a general historical overview of the use of projective methods, both in psychology and in consumer research applications, and discuss the reliability and validity aspects and measurement for projective techniques. We review the literature on projective techniques in the areas of marketing, hospitality & tourism, and consumer & food science, with a mixed methods research focus on the interplay of qualitative and quantitative techniques. We review the use of several quantitative techniques used for structuring and analyzing projective data and run an empirical reanalysis of previously gathered data. We give recommendations for improved rigor and for potential future work involving mixed methods in projective techniques."}, "https://arxiv.org/abs/2409.05038": {"title": "An unbiased rank-based estimator of the Mann-Whitney variance including the case of ties", "link": "https://arxiv.org/abs/2409.05038", "description": "arXiv:2409.05038v1 Announce Type: new \nAbstract: Many estimators of the variance of the well-known unbiased and uniform most powerful estimator $\\htheta$ of the Mann-Whitney effect, $\\theta = P(X < Y) + \\nfrac12 P(X=Y)$, are considered in the literature. Some of these estimators are only valid in case of no ties or are biased in case of small sample sizes where the amount of the bias is not discussed. Here we derive an unbiased estimator that is based on different rankings, the so-called 'placements' (Orban and Wolfe, 1980), and is therefore easy to compute. This estimator does not require the assumption of continuous \\dfs\\ and is also valid in the case of ties. Moreover, it is shown that this estimator is non-negative and has a sharp upper bound which may be considered an empirical version of the well-known Birnbaum-Klose inequality. The derivation of this estimator provides an option to compute the biases of some commonly used estimators in the literature. Simulations demonstrate that, for small sample sizes, the biases of these estimators depend on the underlying \\dfs\\ and thus are not under control. This means that in the case of a biased estimator, simulation results for the type-I error of a test or the coverage probability of a \\ci\\ do not only depend on the quality of the approximation of $\\htheta$ by a normal \\db\\ but also an additional unknown bias caused by the variance estimator. Finally, it is shown that this estimator is $L_2$-consistent."}, "https://arxiv.org/abs/2409.05160": {"title": "Inference for Large Scale Regression Models with Dependent Errors", "link": "https://arxiv.org/abs/2409.05160", "description": "arXiv:2409.05160v1 Announce Type: new \nAbstract: The exponential growth in data sizes and storage costs has brought considerable challenges to the data science community, requiring solutions to run learning methods on such data. While machine learning has scaled to achieve predictive accuracy in big data settings, statistical inference and uncertainty quantification tools are still lagging. Priority scientific fields collect vast data to understand phenomena typically studied with statistical methods like regression. In this setting, regression parameter estimation can benefit from efficient computational procedures, but the main challenge lies in computing error process parameters with complex covariance structures. Identifying and estimating these structures is essential for inference and often used for uncertainty quantification in machine learning with Gaussian Processes. However, estimating these structures becomes burdensome as data scales, requiring approximations that compromise the reliability of outputs. These approximations are even more unreliable when complexities like long-range dependencies or missing data are present. This work defines and proves the statistical properties of the Generalized Method of Wavelet Moments with Exogenous variables (GMWMX), a highly scalable, stable, and statistically valid method for estimating and delivering inference for linear models using stochastic processes in the presence of data complexities like latent dependence structures and missing data. Applied examples from Earth Sciences and extensive simulations highlight the advantages of the GMWMX."}, "https://arxiv.org/abs/2409.05161": {"title": "Really Doing Great at Model Evaluation for CATE Estimation? A Critical Consideration of Current Model Evaluation Practices in Treatment Effect Estimation", "link": "https://arxiv.org/abs/2409.05161", "description": "arXiv:2409.05161v1 Announce Type: new \nAbstract: This paper critically examines current methodologies for evaluating models in Conditional and Average Treatment Effect (CATE/ATE) estimation, identifying several key pitfalls in existing practices. The current approach of over-reliance on specific metrics and empirical means and lack of statistical tests necessitates a more rigorous evaluation approach. We propose an automated algorithm for selecting appropriate statistical tests, addressing the trade-offs and assumptions inherent in these tests. Additionally, we emphasize the importance of reporting empirical standard deviations alongside performance metrics and advocate for using Squared Error for Coverage (SEC) and Absolute Error for Coverage (AEC) metrics and empirical histograms of the coverage results as supplementary metrics. These enhancements provide a more comprehensive understanding of model performance in heterogeneous data-generating processes (DGPs). The practical implications are demonstrated through two examples, showcasing the benefits of these methodological improvements, which can significantly improve the robustness and accuracy of future research in statistical models for CATE and ATE estimation."}, "https://arxiv.org/abs/2409.05184": {"title": "Difference-in-Differences with Multiple Events", "link": "https://arxiv.org/abs/2409.05184", "description": "arXiv:2409.05184v1 Announce Type: new \nAbstract: Confounding events with correlated timing violate the parallel trends assumption in Difference-in-Differences (DiD) designs. I show that the standard staggered DiD estimator is biased in the presence of confounding events. Identification can be achieved with units not yet treated by either event as controls and a double DiD design using variation in treatment timing. I apply this method to examine the effect of states' staggered minimum wage raise on teen employment from 2010 to 2020. The Medicaid expansion under the ACA confounded the raises, leading to a spurious negative estimate."}, "https://arxiv.org/abs/2409.05271": {"title": "Priors from Envisioned Posterior Judgments: A Novel Elicitation Approach With Application to Bayesian Clinical Trials", "link": "https://arxiv.org/abs/2409.05271", "description": "arXiv:2409.05271v1 Announce Type: new \nAbstract: The uptake of formalized prior elicitation from experts in Bayesian clinical trials has been limited, largely due to the challenges associated with complex statistical modeling, the lack of practical tools, and the cognitive burden on experts required to quantify their uncertainty using probabilistic language. Additionally, existing methods do not address prior-posterior coherence, i.e., does the posterior distribution, obtained mathematically from combining the estimated prior with the trial data, reflect the expert's actual posterior beliefs? We propose a new elicitation approach that seeks to ensure prior-posterior coherence and reduce the expert's cognitive burden. This is achieved by eliciting responses about the expert's envisioned posterior judgments under various potential data outcomes and inferring the prior distribution by minimizing the discrepancies between these responses and the expected responses obtained from the posterior distribution. The feasibility and potential value of the new approach are illustrated through an application to a real trial currently underway."}, "https://arxiv.org/abs/2409.05276": {"title": "An Eigengap Ratio Test for Determining the Number of Communities in Network Data", "link": "https://arxiv.org/abs/2409.05276", "description": "arXiv:2409.05276v1 Announce Type: new \nAbstract: To characterize the community structure in network data, researchers have introduced various block-type models, including the stochastic block model, degree-corrected stochastic block model, mixed membership block model, degree-corrected mixed membership block model, and others. A critical step in applying these models effectively is determining the number of communities in the network. However, to our knowledge, existing methods for estimating the number of network communities often require model estimations or are unable to simultaneously account for network sparsity and a divergent number of communities. In this paper, we propose an eigengap-ratio based test that address these challenges. The test is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. We show that the proposed test statistic converges to a function of the type-I Tracy-Widom distributions under the null hypothesis, and that the test is asymptotically powerful under alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test."}, "https://arxiv.org/abs/2409.05412": {"title": "Censored Data Forecasting: Applying Tobit Exponential Smoothing with Time Aggregation", "link": "https://arxiv.org/abs/2409.05412", "description": "arXiv:2409.05412v1 Announce Type: new \nAbstract: This study introduces a novel approach to forecasting by Tobit Exponential Smoothing with time aggregation constraints. This model, a particular case of the Tobit Innovations State Space system, handles censored observed time series effectively, such as sales data, with known and potentially variable censoring levels over time. The paper provides a comprehensive analysis of the model structure, including its representation in system equations and the optimal recursive estimation of states. It also explores the benefits of time aggregation in state space systems, particularly for inventory management and demand forecasting. Through a series of case studies, the paper demonstrates the effectiveness of the model across various scenarios, including hourly and daily censoring levels. The results highlight the model's ability to produce accurate forecasts and confidence bands comparable to those from uncensored models, even under severe censoring conditions. The study further discusses the implications for inventory policy, emphasizing the importance of avoiding spiral-down effects in demand estimation. The paper concludes by showcasing the superiority of the proposed model over standard methods, particularly in reducing lost sales and excess stock, thereby optimizing inventory costs. This research contributes to the field of forecasting by offering a robust model that effectively addresses the challenges of censored data and time aggregation."}, "https://arxiv.org/abs/2409.05632": {"title": "Efficient nonparametric estimators of discriminationmeasures with censored survival data", "link": "https://arxiv.org/abs/2409.05632", "description": "arXiv:2409.05632v1 Announce Type: new \nAbstract: Discrimination measures such as concordance statistics (e.g. the c-index or the concordance probability) and the cumulative-dynamic time-dependent area under the ROC-curve (AUC) are widely used in the medical literature for evaluating the predictive accuracy of a scoring rule which relates a set of prognostic markers to the risk of experiencing a particular event. Often the scoring rule being evaluated in terms of discriminatory ability is the linear predictor of a survival regression model such as the Cox proportional hazards model. This has the undesirable feature that the scoring rule depends on the censoring distribution when the model is misspecified. In this work we focus on linear scoring rules where the coefficient vector is a nonparametric estimand defined in the setting where there is no censoring. We propose so-called debiased estimators of the aforementioned discrimination measures for this class of scoring rules. The proposed estimators make efficient use of the data and minimize bias by allowing for the use of data-adaptive methods for model fitting. Moreover, the estimators do not rely on correct specification of the censoring model to produce consistent estimation. We compare the estimators to existing methods in a simulation study, and we illustrate the method by an application to a brain cancer study."}, "https://arxiv.org/abs/2409.05713": {"title": "The Surprising Robustness of Partial Least Squares", "link": "https://arxiv.org/abs/2409.05713", "description": "arXiv:2409.05713v1 Announce Type: new \nAbstract: Partial least squares (PLS) is a simple factorisation method that works well with high dimensional problems in which the number of observations is limited given the number of independent variables. In this article, we show that PLS can perform better than ordinary least squares (OLS), least absolute shrinkage and selection operator (LASSO) and ridge regression in forecasting quarterly gross domestic product (GDP) growth, covering the period from 2000 to 2023. In fact, through dimension reduction, PLS proved to be effective in lowering the out-of-sample forecasting error, specially since 2020. For the period 2000-2019, the four methods produce similar results, suggesting that PLS is a valid regularisation technique like LASSO or ridge."}, "https://arxiv.org/abs/2409.04500": {"title": "Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm", "link": "https://arxiv.org/abs/2409.04500", "description": "arXiv:2409.04500v1 Announce Type: cross \nAbstract: Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators."}, "https://arxiv.org/abs/2409.04789": {"title": "forester: A Tree-Based AutoML Tool in R", "link": "https://arxiv.org/abs/2409.04789", "description": "arXiv:2409.04789v1 Announce Type: cross \nAbstract: The majority of automated machine learning (AutoML) solutions are developed in Python, however a large percentage of data scientists are associated with the R language. Unfortunately, there are limited R solutions available. Moreover high entry level means they are not accessible to everyone, due to required knowledge about machine learning (ML). To fill this gap, we present the forester package, which offers ease of use regardless of the user's proficiency in the area of machine learning.\n The forester is an open-source AutoML package implemented in R designed for training high-quality tree-based models on tabular data. It fully supports binary and multiclass classification, regression, and partially survival analysis tasks. With just a few functions, the user is capable of detecting issues regarding the data quality, preparing the preprocessing pipeline, training and tuning tree-based models, evaluating the results, and creating the report for further analysis."}, "https://arxiv.org/abs/2409.05036": {"title": "Estimating velocities of infectious disease spread through spatio-temporal log-Gaussian Cox point processes", "link": "https://arxiv.org/abs/2409.05036", "description": "arXiv:2409.05036v1 Announce Type: cross \nAbstract: Understanding the spread of infectious diseases such as COVID-19 is crucial for informed decision-making and resource allocation. A critical component of disease behavior is the velocity with which disease spreads, defined as the rate of change between time and space. In this paper, we propose a spatio-temporal modeling approach to determine the velocities of infectious disease spread. Our approach assumes that the locations and times of people infected can be considered as a spatio-temporal point pattern that arises as a realization of a spatio-temporal log-Gaussian Cox process. The intensity of this process is estimated using fast Bayesian inference by employing the integrated nested Laplace approximation (INLA) and the Stochastic Partial Differential Equations (SPDE) approaches. The velocity is then calculated using finite differences that approximate the derivatives of the intensity function. Finally, the directions and magnitudes of the velocities can be mapped at specific times to examine better the spread of the disease throughout the region. We demonstrate our method by analyzing COVID-19 spread in Cali, Colombia, during the 2020-2021 pandemic."}, "https://arxiv.org/abs/2409.05192": {"title": "Bellwether Trades: Characteristics of Trades influential in Predicting Future Price Movements in Markets", "link": "https://arxiv.org/abs/2409.05192", "description": "arXiv:2409.05192v1 Announce Type: cross \nAbstract: In this study, we leverage powerful non-linear machine learning methods to identify the characteristics of trades that contain valuable information. First, we demonstrate the effectiveness of our optimized neural network predictor in accurately predicting future market movements. Then, we utilize the information from this successful neural network predictor to pinpoint the individual trades within each data point (trading window) that had the most impact on the optimized neural network's prediction of future price movements. This approach helps us uncover important insights about the heterogeneity in information content provided by trades of different sizes, venues, trading contexts, and over time."}, "https://arxiv.org/abs/2409.05354": {"title": "Recursive Nested Filtering for Efficient Amortized Bayesian Experimental Design", "link": "https://arxiv.org/abs/2409.05354", "description": "arXiv:2409.05354v1 Announce Type: cross \nAbstract: This paper introduces the Inside-Out Nested Particle Filter (IO-NPF), a novel, fully recursive, algorithm for amortized sequential Bayesian experimental design in the non-exchangeable setting. We frame policy optimization as maximum likelihood estimation in a non-Markovian state-space model, achieving (at most) $\\mathcal{O}(T^2)$ computational complexity in the number of experiments. We provide theoretical convergence guarantees and introduce a backward sampling algorithm to reduce trajectory degeneracy. IO-NPF offers a practical, extensible, and provably consistent approach to sequential Bayesian experimental design, demonstrating improved efficiency over existing methods."}, "https://arxiv.org/abs/2409.05529": {"title": "Bootstrapping Estimators based on the Block Maxima Method", "link": "https://arxiv.org/abs/2409.05529", "description": "arXiv:2409.05529v1 Announce Type: cross \nAbstract: The block maxima method is a standard approach for analyzing the extremal behavior of a potentially multivariate time series. It has recently been found that the classical approach based on disjoint block maxima may be universally improved by considering sliding block maxima instead. However, the asymptotic variance formula for estimators based on sliding block maxima involves an integral over the covariance of a certain family of multivariate extreme value distributions, which makes its estimation, and inference in general, an intricate problem. As an alternative, one may rely on bootstrap approximations: we show that naive block-bootstrap approaches from time series analysis are inconsistent even in i.i.d.\\ situations, and provide a consistent alternative based on resampling circular block maxima. As a by-product, we show consistency of the classical resampling bootstrap for disjoint block maxima, and that estimators based on circular block maxima have the same asymptotic variance as their sliding block maxima counterparts. The finite sample properties are illustrated by Monte Carlo experiments, and the methods are demonstrated by a case study of precipitation extremes."}, "https://arxiv.org/abs/2409.05630": {"title": "Multilevel testing of constraints induced by structural equation modeling in fMRI effective connectivity analysis: A proof of concept", "link": "https://arxiv.org/abs/2409.05630", "description": "arXiv:2409.05630v1 Announce Type: cross \nAbstract: In functional MRI (fMRI), effective connectivity analysis aims at inferring the causal influences that brain regions exert on one another. A common method for this type of analysis is structural equation modeling (SEM). We here propose a novel method to test the validity of a given model of structural equation. Given a structural model in the form of a directed graph, the method extracts the set of all constraints of conditional independence induced by the absence of links between pairs of regions in the model and tests for their validity in a Bayesian framework, either individually (constraint by constraint), jointly (e.g., by gathering all constraints associated with a given missing link), or globally (i.e., all constraints associated with the structural model). This approach has two main advantages. First, it only tests what is testable from observational data and does allow for false causal interpretation. Second, it makes it possible to test each constraint (or group of constraints) separately and, therefore, quantify in what measure each constraint (or, e..g., missing link) is respected in the data. We validate our approach using a simulation study and illustrate its potential benefits through the reanalysis of published data."}, "https://arxiv.org/abs/2409.05715": {"title": "Uniform Estimation and Inference for Nonparametric Partitioning-Based M-Estimators", "link": "https://arxiv.org/abs/2409.05715", "description": "arXiv:2409.05715v1 Announce Type: cross \nAbstract: This paper presents uniform estimation and inference theory for a large class of nonparametric partitioning-based M-estimators. The main theoretical results include: (i) uniform consistency for convex and non-convex objective functions; (ii) optimal uniform Bahadur representations; (iii) optimal uniform (and mean square) convergence rates; (iv) valid strong approximations and feasible uniform inference methods; and (v) extensions to functional transformations of underlying estimators. Uniformity is established over both the evaluation point of the nonparametric functional parameter and a Euclidean parameter indexing the class of loss functions. The results also account explicitly for the smoothness degree of the loss function (if any), and allow for a possibly non-identity (inverse) link function. We illustrate the main theoretical and methodological results with four substantive applications: quantile regression, distribution regression, $L_p$ regression, and Logistic regression; many other possibly non-smooth, nonlinear, generalized, robust M-estimation settings are covered by our theoretical results. We provide detailed comparisons with the existing literature and demonstrate substantive improvements: we achieve the best (in some cases optimal) known results under improved (in some cases minimal) requirements in terms of regularity conditions and side rate restrictions. The supplemental appendix reports other technical results that may be of independent interest."}, "https://arxiv.org/abs/2409.05729": {"title": "Efficient estimation with incomplete data via generalised ANOVA decomposition", "link": "https://arxiv.org/abs/2409.05729", "description": "arXiv:2409.05729v1 Announce Type: cross \nAbstract: We study the efficient estimation of a class of mean functionals in settings where a complete multivariate dataset is complemented by additional datasets recording subsets of the variables of interest. These datasets are allowed to have a general, in particular non-monotonic, structure. Our main contribution is to characterise the asymptotic minimal mean squared error for these problems and to introduce an estimator whose risk approximately matches this lower bound. We show that the efficient rescaled variance can be expressed as the minimal value of a quadratic optimisation problem over a function space, thus establishing a fundamental link between these estimation problems and the theory of generalised ANOVA decompositions. Our estimation procedure uses iterated nonparametric regression to mimic an approximate influence function derived through gradient descent. We prove that this estimator is approximately normally distributed, provide an estimator of its variance and thus develop confidence intervals of asymptotically minimal width. Finally we study a more direct estimator, which can be seen as a U-statistic with a data-dependent kernel, showing that it is also efficient under stronger regularity conditions."}, "https://arxiv.org/abs/2409.05798": {"title": "Enhancing Preference-based Linear Bandits via Human Response Time", "link": "https://arxiv.org/abs/2409.05798", "description": "arXiv:2409.05798v1 Announce Type: cross \nAbstract: Binary human choice feedback is widely used in interactive preference learning for its simplicity, but it provides limited information about preference strength. To overcome this limitation, we leverage human response times, which inversely correlate with preference strength, as complementary information. Our work integrates the EZ-diffusion model, which jointly models human choices and response times, into preference-based linear bandits. We introduce a computationally efficient utility estimator that reformulates the utility estimation problem using both choices and response times as a linear regression problem. Theoretical and empirical comparisons with traditional choice-only estimators reveal that for queries with strong preferences (\"easy\" queries), choices alone provide limited information, while response times offer valuable complementary information about preference strength. As a result, incorporating response times makes easy queries more useful. We demonstrate this advantage in the fixed-budget best-arm identification problem, with simulations based on three real-world datasets, consistently showing accelerated learning when response times are incorporated."}, "https://arxiv.org/abs/2209.05399": {"title": "Principles of Statistical Inference in Online Problems", "link": "https://arxiv.org/abs/2209.05399", "description": "arXiv:2209.05399v2 Announce Type: replace \nAbstract: To investigate a dilemma of statistical and computational efficiency faced by long-run variance estimators, we propose a decomposition of kernel weights in a quadratic form and some online inference principles. These proposals allow us to characterize efficient online long-run variance estimators. Our asymptotic theory and simulations show that this principle-driven approach leads to online estimators with a uniformly lower mean squared error than all existing works. We also discuss practical enhancements such as mini-batch and automatic updates to handle fast streaming data and optimal parameters tuning. Beyond variance estimation, we consider the proposals in the context of online quantile regression, online change point detection, Markov chain Monte Carlo convergence diagnosis, and stochastic approximation. Substantial improvements in computational cost and finite-sample statistical properties are observed when we apply our principle-driven variance estimator to original and modified inference procedures."}, "https://arxiv.org/abs/2211.13610": {"title": "Cross-Sectional Dynamics Under Network Structure: Theory and Macroeconomic Applications", "link": "https://arxiv.org/abs/2211.13610", "description": "arXiv:2211.13610v4 Announce Type: replace \nAbstract: Many environments in economics feature a cross-section of units linked by bilateral ties. I develop a framework for studying dynamics of cross-sectional variables that exploits this network structure. The Network-VAR (NVAR) is a vector autoregression in which innovations transmit cross-sectionally via bilateral links and which can accommodate rich patterns of how network effects of higher order accumulate over time. It can be used to estimate dynamic network effects, with the network given or inferred from dynamic cross-correlations in the data. It also offers a dimensionality-reduction technique for modeling high-dimensional (cross-sectional) processes, owing to networks' ability to summarize complex relations among variables (units) by relatively few bilateral links. In a first application, consistent with an RBC economy with lagged input-output conversion, I estimate how sectoral productivity shocks transmit along supply chains and affect sectoral prices in the US economy. In a second application, I forecast monthly industrial production growth across 44 countries by assuming and estimating a network underlying the dynamics. This reduces out-of-sample mean squared errors by up to 23% relative to a factor model, consistent with an equivalence result I derive."}, "https://arxiv.org/abs/2212.10042": {"title": "Guarantees for Comprehensive Simulation Assessment of Statistical Methods", "link": "https://arxiv.org/abs/2212.10042", "description": "arXiv:2212.10042v3 Announce Type: replace \nAbstract: Simulation can evaluate a statistical method for properties such as Type I Error, FDR, or bias on a grid of hypothesized parameter values. But what about the gaps between the grid-points? Continuous Simulation Extension (CSE) is a proof-by-simulation framework which can supplement simulations with (1) confidence bands valid over regions of parameter space or (2) calibration of rejection thresholds to provide rigorous proof of strong Type I Error control. CSE extends simulation estimates at grid-points into bounds over nearby space using a model shift bound related to the Renyi divergence, which we analyze for models in exponential family or canonical GLM form. CSE can work with adaptive sampling, nuisance parameters, administrative censoring, multiple arms, multiple testing, Bayesian randomization, Bayesian decision-making, and inference algorithms of arbitrary complexity. As a case study, we calibrate for strong Type I Error control a Phase II/III Bayesian selection design with 4 unknown statistical parameters. Potential applications include calibration of new statistical procedures or streamlining regulatory review of adaptive trial designs. Our open-source software implementation imprint is available athttps://github.com/Confirm-Solutions/imprint"}, "https://arxiv.org/abs/2409.06172": {"title": "Nonparametric Inference for Balance in Signed Networks", "link": "https://arxiv.org/abs/2409.06172", "description": "arXiv:2409.06172v1 Announce Type: new \nAbstract: In many real-world networks, relationships often go beyond simple dyadic presence or absence; they can be positive, like friendship, alliance, and mutualism, or negative, characterized by enmity, disputes, and competition. To understand the formation mechanism of such signed networks, the social balance theory sheds light on the dynamics of positive and negative connections. In particular, it characterizes the proverbs, \"a friend of my friend is my friend\" and \"an enemy of my enemy is my friend\". In this work, we propose a nonparametric inference approach for assessing empirical evidence for the balance theory in real-world signed networks. We first characterize the generating process of signed networks with node exchangeability and propose a nonparametric sparse signed graphon model. Under this model, we construct confidence intervals for the population parameters associated with balance theory and establish their theoretical validity. Our inference procedure is as computationally efficient as a simple normal approximation but offers higher-order accuracy. By applying our method, we find strong real-world evidence for balance theory in signed networks across various domains, extending its applicability beyond social psychology."}, "https://arxiv.org/abs/2409.06180": {"title": "Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach", "link": "https://arxiv.org/abs/2409.06180", "description": "arXiv:2409.06180v1 Announce Type: new \nAbstract: Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment."}, "https://arxiv.org/abs/2409.06288": {"title": "Ensemble Doubly Robust Bayesian Inference via Regression Synthesis", "link": "https://arxiv.org/abs/2409.06288", "description": "arXiv:2409.06288v1 Announce Type: new \nAbstract: The doubly robust estimator, which models both the propensity score and outcomes, is a popular approach to estimate the average treatment effect in the potential outcome setting. The primary appeal of this estimator is its theoretical property, wherein the estimator achieves consistency as long as either the propensity score or outcomes is correctly specified. In most applications, however, both are misspecified, leading to considerable bias that cannot be checked. In this paper, we propose a Bayesian ensemble approach that synthesizes multiple models for both the propensity score and outcomes, which we call doubly robust Bayesian regression synthesis. Our approach applies Bayesian updating to the ensemble model weights that adapt at the unit level, incorporating data heterogeneity, to significantly mitigate misspecification bias. Theoretically, we show that our proposed approach is consistent regarding the estimation of both the propensity score and outcomes, ensuring that the doubly robust estimator is consistent, even if no single model is correctly specified. An efficient algorithm for posterior computation facilitates the characterization of uncertainty regarding the treatment effect. Our proposed approach is compared against standard and state-of-the-art methods through two comprehensive simulation studies, where we find that our approach is superior in all cases. An empirical study on the impact of maternal smoking on birth weight highlights the practical applicability of our proposed method."}, "https://arxiv.org/abs/2409.06413": {"title": "This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis", "link": "https://arxiv.org/abs/2409.06413", "description": "arXiv:2409.06413v1 Announce Type: new \nAbstract: The commonly cited rule of thumb for regression analysis, which suggests that a sample size of $n \\geq 30$ is sufficient to ensure valid inferences, is frequently referenced but rarely scrutinized. This research note evaluates the lower bound for the number of observations required for regression analysis by exploring how different distributional characteristics, such as skewness and kurtosis, influence the convergence of t-values to the t-distribution in linear regression models. Through an extensive simulation study involving over 22 billion regression models, this paper examines a range of symmetric, platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000. The results reveal that it is sufficient that either the dependent or independent variable follow a symmetric distribution for the t-values to converge to the t-distribution at much smaller sample sizes than $n=30$. This is contrary to previous guidance which suggests that the error term needs to be normally distributed for this convergence to happen at low $n$. On the other hand, if both dependent and independent variables are highly skewed the required sample size is substantially higher. In cases of extreme skewness, even sample sizes of 10,000 do not ensure convergence. These findings suggest that the $n\\geq30$ rule is too permissive in certain cases but overly conservative in others, depending on the underlying distributional characteristics. This study offers revised guidelines for determining the minimum sample size necessary for valid regression analysis."}, "https://arxiv.org/abs/2409.06654": {"title": "Estimation and Inference for Causal Functions with Multiway Clustered Data", "link": "https://arxiv.org/abs/2409.06654", "description": "arXiv:2409.06654v1 Announce Type: new \nAbstract: This paper proposes methods of estimation and uniform inference for a general class of causal functions, such as the conditional average treatment effects and the continuous treatment effects, under multiway clustering. The causal function is identified as a conditional expectation of an adjusted (Neyman-orthogonal) signal that depends on high-dimensional nuisance parameters. We propose a two-step procedure where the first step uses machine learning to estimate the high-dimensional nuisance parameters. The second step projects the estimated Neyman-orthogonal signal onto a dictionary of basis functions whose dimension grows with the sample size. For this two-step procedure, we propose both the full-sample and the multiway cross-fitting estimation approaches. A functional limit theory is derived for these estimators. To construct the uniform confidence bands, we develop a novel resampling procedure, called the multiway cluster-robust sieve score bootstrap, that extends the sieve score bootstrap (Chen and Christensen, 2018) to the novel setting with multiway clustering. Extensive numerical simulations showcase that our methods achieve desirable finite-sample behaviors. We apply the proposed methods to analyze the causal relationship between mistrust levels in Africa and the historical slave trade. Our analysis rejects the null hypothesis of uniformly zero effects and reveals heterogeneous treatment effects, with significant impacts at higher levels of trade volumes."}, "https://arxiv.org/abs/2409.06680": {"title": "Sequential stratified inference for the mean", "link": "https://arxiv.org/abs/2409.06680", "description": "arXiv:2409.06680v1 Announce Type: new \nAbstract: We develop conservative tests for the mean of a bounded population using data from a stratified sample. The sample may be drawn sequentially, with or without replacement. The tests are \"anytime valid,\" allowing optional stopping and continuation in each stratum. We call this combination of properties sequential, finite-sample, nonparametric validity. The methods express a hypothesis about the population mean as a union of intersection hypotheses describing within-stratum means. They test each intersection hypothesis using independent test supermartingales (TSMs) combined across strata by multiplication. The $P$-value of the global null hypothesis is then the maximum $P$-value of any intersection hypothesis in the union. This approach has three primary moving parts: (i) the rule for deciding which stratum to draw from next to test each intersection null, given the sample so far; (ii) the form of the TSM for each null in each stratum; and (iii) the method of combining evidence across strata. These choices interact. We examine the performance of a variety of rules with differing computational complexity. Approximately optimal methods have a prohibitive computational cost, while naive rules may be inconsistent -- they will never reject for some alternative populations, no matter how large the sample. We present a method that is statistically comparable to optimal methods in examples where optimal methods are computable, but computationally tractable for arbitrarily many strata. In numerical examples its expected sample size is substantially smaller than that of previous methods."}, "https://arxiv.org/abs/2409.05934": {"title": "Predicting Electricity Consumption with Random Walks on Gaussian Processes", "link": "https://arxiv.org/abs/2409.05934", "description": "arXiv:2409.05934v1 Announce Type: cross \nAbstract: We consider time-series forecasting problems where data is scarce, difficult to gather, or induces a prohibitive computational cost. As a first attempt, we focus on short-term electricity consumption in France, which is of strategic importance for energy suppliers and public stakeholders. The complexity of this problem and the many levels of geospatial granularity motivate the use of an ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors, they are computationally expensive to train, which calls for a frugal few-shot learning approach. By taking into account performance on GPs trained on a dataset and designing a random walk on these, we mitigate the training cost of our entire Bayesian decision-making procedure. We introduce our algorithm called \\textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present numerical experiments to support its merits."}, "https://arxiv.org/abs/2409.06157": {"title": "Causal Analysis of Shapley Values: Conditional vs", "link": "https://arxiv.org/abs/2409.06157", "description": "arXiv:2409.06157v1 Announce Type: cross \nAbstract: Shapley values, a game theoretic concept, has been one of the most popular tools for explaining Machine Learning (ML) models in recent years. Unfortunately, the two most common approaches, conditional and marginal, to calculating Shapley values can lead to different results along with some undesirable side effects when features are correlated. This in turn has led to the situation in the literature where contradictory recommendations regarding choice of an approach are provided by different authors. In this paper we aim to resolve this controversy through the use of causal arguments. We show that the differences arise from the implicit assumptions that are made within each method to deal with missing causal information. We also demonstrate that the conditional approach is fundamentally unsound from a causal perspective. This, together with previous work in [1], leads to the conclusion that the marginal approach should be preferred over the conditional one."}, "https://arxiv.org/abs/2409.06271": {"title": "A new paradigm for global sensitivity analysis", "link": "https://arxiv.org/abs/2409.06271", "description": "arXiv:2409.06271v1 Announce Type: cross \nAbstract: Current theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output's variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized.
"}, "https://arxiv.org/abs/2409.06565": {"title": "Enzyme kinetic reactions as interacting particle systems: Stochastic averaging and parameter inference", "link": "https://arxiv.org/abs/2409.06565", "description": "arXiv:2409.06565v1 Announce Type: cross \nAbstract: We consider a stochastic model of multistage Michaelis--Menten (MM) type enzyme kinetic reactions describing the conversion of substrate molecules to a product through several intermediate species. The high-dimensional, multiscale nature of these reaction networks presents significant computational challenges, especially in statistical estimation of reaction rates. This difficulty is amplified when direct data on system states are unavailable, and one only has access to a random sample of product formation times. To address this, we proceed in two stages. First, under certain technical assumptions akin to those made in the Quasi-steady-state approximation (QSSA) literature, we prove two asymptotic results: a stochastic averaging principle that yields a lower-dimensional model, and a functional central limit theorem that quantifies the associated fluctuations. Next, for statistical inference of the parameters of the original MM reaction network, we develop a mathematical framework involving an interacting particle system (IPS) and prove a propagation of chaos result that allows us to write a product-form likelihood function. The novelty of the IPS-based inference method is that it does not require information about the state of the system and works with only a random sample of product formation times. We provide numerical examples to illustrate the efficacy of the theoretical results."}, "https://arxiv.org/abs/2409.07018": {"title": "Clustered Factor Analysis for Multivariate Spatial Data", "link": "https://arxiv.org/abs/2409.07018", "description": "arXiv:2409.07018v1 Announce Type: new \nAbstract: Factor analysis has been extensively used to reveal the dependence structures among multivariate variables, offering valuable insight in various fields. However, it cannot incorporate the spatial heterogeneity that is typically present in spatial data. To address this issue, we introduce an effective method specifically designed to discover the potential dependence structures in multivariate spatial data. Our approach assumes that spatial locations can be approximately divided into a finite number of clusters, with locations within the same cluster sharing similar dependence structures. By leveraging an iterative algorithm that combines spatial clustering with factor analysis, we simultaneously detect spatial clusters and estimate a unique factor model for each cluster. The proposed method is evaluated through comprehensive simulation studies, demonstrating its flexibility. In addition, we apply the proposed method to a dataset of railway station attributes in the Tokyo metropolitan area, highlighting its practical applicability and effectiveness in uncovering complex spatial dependencies."}, "https://arxiv.org/abs/2409.07087": {"title": "Testing for a Forecast Accuracy Breakdown under Long Memory", "link": "https://arxiv.org/abs/2409.07087", "description": "arXiv:2409.07087v1 Announce Type: new \nAbstract: We propose a test to detect a forecast accuracy breakdown in a long memory time series and provide theoretical and simulation evidence on the memory transfer from the time series to the forecast residuals. The proposed method uses a double sup-Wald test against the alternative of a structural break in the mean of an out-of-sample loss series. To address the problem of estimating the long-run variance under long memory, a robust estimator is applied. The corresponding breakpoint results from a long memory robust CUSUM test. The finite sample size and power properties of the test are derived in a Monte Carlo simulation. A monotonic power function is obtained for the fixed forecasting scheme. In our practical application, we find that the global energy crisis that began in 2021 led to a forecast break in European electricity prices, while the results for the U.S. are mixed."}, "https://arxiv.org/abs/2409.07125": {"title": "Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning", "link": "https://arxiv.org/abs/2409.07125", "description": "arXiv:2409.07125v1 Announce Type: new \nAbstract: Modeling with multi-omics data presents multiple challenges such as the high-dimensionality of the problem ($p \\gg n$), the presence of interactions between features, and the need for integration between multiple data sources. We establish an interaction model that allows for the inclusion of multiple sources of data from the integration of two existing methods, pliable lasso and cooperative learning. The integrated model is tested both on simulation studies and on real multi-omics datasets for predicting labor onset and cancer treatment response. The results show that the model is effective in modeling multi-source data in various scenarios where interactions are present, both in terms of prediction performance and selection of relevant variables."}, "https://arxiv.org/abs/2409.07176": {"title": "Non-parametric estimation of transition intensities in interval censored Markov multi-state models without loops", "link": "https://arxiv.org/abs/2409.07176", "description": "arXiv:2409.07176v1 Announce Type: new \nAbstract: Panel data arises when transitions between different states are interval-censored in multi-state data. The analysis of such data using non-parametric multi-state models was not possible until recently, but is very desirable as it allows for more flexibility than its parametric counterparts. The single available result to date has some unique drawbacks. We propose a non-parametric estimator of the transition intensities for panel data using an Expectation Maximisation algorithm. The method allows for a mix of interval-censored and right-censored (exactly observed) transitions. A condition to check for the convergence of the algorithm to the non-parametric maximum likelihood estimator is given. A simulation study comparing the proposed estimator to a consistent estimator is performed, and shown to yield near identical estimates at smaller computational cost. A data set on the emergence of teeth in children is analysed. Code to perform the analyses is publicly available."}, "https://arxiv.org/abs/2409.07233": {"title": "Extended-support beta regression for $[0, 1]$ responses", "link": "https://arxiv.org/abs/2409.07233", "description": "arXiv:2409.07233v1 Announce Type: new \nAbstract: We introduce the XBX regression model, a continuous mixture of extended-support beta regressions for modeling bounded responses with or without boundary observations. The core building block of the new model is the extended-support beta distribution, which is a censored version of a four-parameter beta distribution with the same exceedance on the left and right of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We prove that both beta regression with dispersion effects and heteroscedastic normal regression with censoring at both $0$ and $1$ -- known as the heteroscedastic two-limit tobit model in the econometrics literature -- are special cases of the extended-support beta regression model, depending on whether a single extra parameter is zero or infinity, respectively. To overcome identifiability issues that may arise in estimating the extra parameter due to the similarity of the beta and normal distribution for certain parameter settings, we assume that the additional parameter has an exponential distribution with an unknown mean. The associated marginal likelihood can be conveniently and accurately approximated using a Gauss-Laguerre quadrature rule, resulting in efficient estimation and inference procedures. The new model is used to analyze investment decisions in a behavioral economics experiment, where the occurrence and extent of loss aversion is of interest. In contrast to standard approaches, XBX regression can simultaneously capture the probability of rational behavior as well as the mean amount of loss aversion. Moreover, the effectiveness of the new model is illustrated through extensive numerical comparisons with alternative models."}, "https://arxiv.org/abs/2409.07263": {"title": "Order selection in GARMA models for count time series: a Bayesian perspective", "link": "https://arxiv.org/abs/2409.07263", "description": "arXiv:2409.07263v1 Announce Type: new \nAbstract: Estimation in GARMA models has traditionally been carried out under the frequentist approach. To date, Bayesian approaches for such estimation have been relatively limited. In the context of GARMA models for count time series, Bayesian estimation achieves satisfactory results in terms of point estimation. Model selection in this context often relies on the use of information criteria. Despite its prominence in the literature, the use of information criteria for model selection in GARMA models for count time series have been shown to present poor performance in simulations, especially in terms of their ability to correctly identify models, even under large sample sizes. In this study, we study the problem of order selection in GARMA models for count time series, adopting a Bayesian perspective through the application of the Reversible Jump Markov Chain Monte Carlo approach. Monte Carlo simulation studies are conducted to assess the finite sample performance of the developed ideas, including point and interval inference, sensitivity analysis, effects of burn-in and thinning, as well as the choice of related priors and hyperparameters. Two real-data applications are presented, one considering automobile production in Brazil and the other considering bus exportation in Brazil before and after the COVID-19 pandemic, showcasing the method's capabilities and further exploring its flexibility."}, "https://arxiv.org/abs/2409.07350": {"title": "Local Effects of Continuous Instruments without Positivity", "link": "https://arxiv.org/abs/2409.07350", "description": "arXiv:2409.07350v1 Announce Type: new \nAbstract: Instrumental variables have become a popular study design for the estimation of treatment effects in the presence of unobserved confounders. In the canonical instrumental variables design, the instrument is a binary variable, and most extant methods are tailored to this context. In many settings, however, the instrument is a continuous measure. Standard estimation methods can be applied with continuous instruments, but they require strong assumptions regarding functional form. Moreover, while some recent work has introduced more flexible approaches for continuous instruments, these methods require an assumption known as positivity that is unlikely to hold in many applications. We derive a novel family of causal estimands using a stochastic dynamic intervention framework that considers a range of intervention distributions that are absolutely continuous with respect to the observed distribution of the instrument. These estimands focus on a specific form of local effect but do not require a positivity assumption. Next, we develop doubly robust estimators for these estimands that allow for estimation of the nuisance functions via nonparametric estimators. We use empirical process theory and sample splitting to derive asymptotic properties of the proposed estimators under weak conditions. In addition, we derive methods for profiling the principal strata as well as a method for sensitivity analysis for assessing robustness to an underlying monotonicity assumption. We evaluate our methods via simulation and demonstrate their feasibility using an application on the effectiveness of surgery for specific emergency conditions."}, "https://arxiv.org/abs/2409.07380": {"title": "Multi-source Stable Variable Importance Measure via Adversarial Machine Learning", "link": "https://arxiv.org/abs/2409.07380", "description": "arXiv:2409.07380v1 Announce Type: new \nAbstract: As part of enhancing the interpretability of machine learning, it is of renewed interest to quantify and infer the predictive importance of certain exposure covariates. Modern scientific studies often collect data from multiple sources with distributional heterogeneity. Thus, measuring and inferring stable associations across multiple environments is crucial in reliable and generalizable decision-making. In this paper, we propose MIMAL, a novel statistical framework for Multi-source stable Importance Measure via Adversarial Learning. MIMAL measures the importance of some exposure variables by maximizing the worst-case predictive reward over the source mixture. Our framework allows various machine learning methods for confounding adjustment and exposure effect characterization. For inferential analysis, the asymptotic normality of our introduced statistic is established under a general machine learning framework that requires no stronger learning accuracy conditions than those for single source variable importance. Numerical studies with various types of data generation setups and machine learning implementation are conducted to justify the finite-sample performance of MIMAL. We also illustrate our method through a real-world study of Beijing air pollution in multiple locations."}, "https://arxiv.org/abs/2409.07391": {"title": "Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap", "link": "https://arxiv.org/abs/2409.07391", "description": "arXiv:2409.07391v1 Announce Type: new \nAbstract: To estimate the average treatment effect in real-world populations, observational studies are typically designed around real-world cohorts. However, even when study samples from these designs represent the population, unmeasured confounders can introduce bias. Sensitivity analysis is often used to estimate bounds for the average treatment effect without relying on the strict mathematical assumptions of other existing methods. This article introduces a new approach that improves sensitivity analysis in observational studies by incorporating randomized clinical trial data, even with limited overlap due to inclusion/exclusion criteria. Theoretical proof and simulations show that this method provides a tighter bound width than existing approaches. We also apply this method to both a trial dataset and a real-world drug effectiveness comparison dataset for practical analysis."}, "https://arxiv.org/abs/2409.07111": {"title": "Local Sequential MCMC for Data Assimilation with Applications in Geoscience", "link": "https://arxiv.org/abs/2409.07111", "description": "arXiv:2409.07111v1 Announce Type: cross \nAbstract: This paper presents a new data assimilation (DA) scheme based on a sequential Markov Chain Monte Carlo (SMCMC) DA technique [Ruzayqat et al. 2024] which is provably convergent and has been recently used for filtering, particularly for high-dimensional non-linear, and potentially, non-Gaussian state-space models. Unlike particle filters, which can be considered exact methods and can be used for filtering non-linear, non-Gaussian models, SMCMC does not assign weights to the samples/particles, and therefore, the method does not suffer from the issue of weight-degeneracy when a relatively small number of samples is used. We design a localization approach within the SMCMC framework that focuses on regions where observations are located and restricts the transition densities included in the filtering distribution of the state to these regions. This results in immensely reducing the effective degrees of freedom and thus improving the efficiency. We test the new technique on high-dimensional ($d \\sim 10^4 - 10^5$) linear Gaussian model and non-linear shallow water models with Gaussian noise with real and synthetic observations. For two of the numerical examples, the observations mimic the data generated by the Surface Water and Ocean Topography (SWOT) mission led by NASA, which is a swath of ocean height observations that changes location at every assimilation time step. We also use a set of ocean drifters' real observations in which the drifters are moving according the ocean kinematics and assumed to have uncertain locations at the time of assimilation. We show that when higher accuracy is required, the proposed algorithm is superior in terms of efficiency and accuracy over competing ensemble methods and the original SMCMC filter."}, "https://arxiv.org/abs/2409.07389": {"title": "Dynamic Bayesian Networks, Elicitation and Data Embedding for Secure Environments", "link": "https://arxiv.org/abs/2409.07389", "description": "arXiv:2409.07389v1 Announce Type: cross \nAbstract: Serious crime modelling typically needs to be undertaken securely behind a firewall where police knowledge and capabilities can remain undisclosed. Data informing an ongoing incident is often sparse, with a large proportion of relevant data only coming to light after the incident culminates or after police intervene - by which point it is too late to make use of the data to aid real-time decision making for the incident in question. Much of the data that is available to police to support real-time decision making is highly confidential so cannot be shared with academics, and is therefore missing to them. In this paper, we describe the development of a formal protocol where a graphical model is used as a framework for securely translating a model designed by an academic team to a model for use by a police team. We then show, for the first time, how libraries of these models can be built and used for real-time decision support to circumvent the challenges of data missingness and tardiness seen in such a secure environment. The parallel development described by this protocol ensures that any sensitive information collected by police, and missing to academics, remains secured behind a firewall. The protocol nevertheless guides police so that they are able to combine the typically incomplete data streams that are open source with their more sensitive information in a formal and justifiable way. We illustrate the application of this protocol by describing how a new entry - a suspected vehicle attack - can be embedded into such a police library of criminal plots."}, "https://arxiv.org/abs/2211.04027": {"title": "Bootstraps for Dynamic Panel Threshold Models", "link": "https://arxiv.org/abs/2211.04027", "description": "arXiv:2211.04027v3 Announce Type: replace \nAbstract: This paper develops valid bootstrap inference methods for the dynamic short panel threshold regression. We demonstrate that the standard nonparametric bootstrap is inconsistent for the first-differenced generalized method of moments (GMM) estimator. The inconsistency is due to an $n^{1/4}$-consistent non-normal asymptotic distribution for the threshold estimate when the parameter resides within the continuity region of the parameter space. It stems from the rank deficiency of the approximate Jacobian of the sample moment conditions on the continuity region. To address this, we propose a grid bootstrap to construct confidence intervals of the threshold, a residual bootstrap to construct confidence intervals of the coefficients, and a bootstrap for testing continuity. They are shown to be valid under uncertain continuity, while the grid bootstrap is additionally shown to be uniformly valid. A set of Monte Carlo experiments demonstrate that the proposed bootstraps perform well in the finite samples and improve upon the standard nonparametric bootstrap."}, "https://arxiv.org/abs/2302.02468": {"title": "Circular and Spherical Projected Cauchy Distributions: A Novel Framework for Circular and Directional Data Modeling", "link": "https://arxiv.org/abs/2302.02468", "description": "arXiv:2302.02468v4 Announce Type: replace \nAbstract: We introduce a novel family of projected distributions on the circle and the sphere, namely the circular and spherical projected Cauchy distributions, as promising alternatives for modelling circular and spherical data. The circular distribution encompasses the wrapped Cauchy distribution as a special case, while featuring a more convenient parameterisation. We also propose a generalised wrapped Cauchy distribution that includes an extra parameter, enhancing the fit of the distribution. In the spherical context, we impose two conditions on the scatter matrix of the Cauchy distribution, resulting in an elliptically symmetric distribution. Our projected distributions exhibit attractive properties, such as a closed-form normalising constant and straightforward random value generation. The distribution parameters can be estimated using maximum likelihood, and we assess their bias through numerical studies. Further, we compare our proposed distributions to existing models with real datasets, demonstrating equal or superior fitting both with and without covariates."}, "https://arxiv.org/abs/2302.14423": {"title": "The First-stage F Test with Many Weak Instruments", "link": "https://arxiv.org/abs/2302.14423", "description": "arXiv:2302.14423v2 Announce Type: replace \nAbstract: A widely adopted approach for detecting weak instruments is to use the first-stage $F$ statistic. While this method was developed with a fixed number of instruments, its performance with many instruments remains insufficiently explored. We show that the first-stage $F$ test exhibits distorted sizes for detecting many weak instruments, regardless of the choice of pretested estimators or Wald tests. These distortions occur due to the inadequate approximation using classical noncentral Chi-squared distributions. As a byproduct of our main result, we present an alternative approach to pre-test many weak instruments with the corrected first-stage $F$ statistic. An empirical illustration with Angrist and Keueger (1991)'s returns to education data confirms its usefulness."}, "https://arxiv.org/abs/2303.10525": {"title": "Robustifying likelihoods by optimistically re-weighting data", "link": "https://arxiv.org/abs/2303.10525", "description": "arXiv:2303.10525v2 Announce Type: replace \nAbstract: Likelihood-based inferences have been remarkably successful in wide-spanning application areas. However, even after due diligence in selecting a good model for the data at hand, there is inevitably some amount of model misspecification: outliers, data contamination or inappropriate parametric assumptions such as Gaussianity mean that most models are at best rough approximations of reality. A significant practical concern is that for certain inferences, even small amounts of model misspecification may have a substantial impact; a problem we refer to as brittleness. This article attempts to address the brittleness problem in likelihood-based inferences by choosing the most model friendly data generating process in a distance-based neighborhood of the empirical measure. This leads to a new Optimistically Weighted Likelihood (OWL), which robustifies the original likelihood by formally accounting for a small amount of model misspecification. Focusing on total variation (TV) neighborhoods, we study theoretical properties, develop estimation algorithms and illustrate the methodology in applications to mixture models and regression."}, "https://arxiv.org/abs/2211.15769": {"title": "Graphical models for infinite measures with applications to extremes", "link": "https://arxiv.org/abs/2211.15769", "description": "arXiv:2211.15769v2 Announce Type: replace-cross \nAbstract: Conditional independence and graphical models are well studied for probability distributions on product spaces. We propose a new notion of conditional independence for any measure $\\Lambda$ on the punctured Euclidean space $\\mathbb R^d\\setminus \\{0\\}$ that explodes at the origin. The importance of such measures stems from their connection to infinitely divisible and max-infinitely divisible distributions, where they appear as L\\'evy measures and exponent measures, respectively. We characterize independence and conditional independence for $\\Lambda$ in various ways through kernels and factorization of a modified density, including a Hammersley-Clifford type theorem for undirected graphical models. As opposed to the classical conditional independence, our notion is intimately connected to the support of the measure $\\Lambda$. Our general theory unifies and extends recent approaches to graphical modeling in the fields of extreme value analysis and L\\'evy processes. Our results for the corresponding undirected and directed graphical models lay the foundation for new statistical methodology in these areas."}, "https://arxiv.org/abs/2409.07559": {"title": "Spatial Deep Convolutional Neural Networks", "link": "https://arxiv.org/abs/2409.07559", "description": "arXiv:2409.07559v1 Announce Type: new \nAbstract: Spatial prediction problems often use Gaussian process models, which can be computationally burdensome in high dimensions. Specification of an appropriate covariance function for the model can be challenging when complex non-stationarities exist. Recent work has shown that pre-computed spatial basis functions and a feed-forward neural network can capture complex spatial dependence structures while remaining computationally efficient. This paper builds on this literature by tailoring spatial basis functions for use in convolutional neural networks. Through both simulated and real data, we demonstrate that this approach yields more accurate spatial predictions than existing methods. Uncertainty quantification is also considered."}, "https://arxiv.org/abs/2409.07568": {"title": "Debiased high-dimensional regression calibration for errors-in-variables log-contrast models", "link": "https://arxiv.org/abs/2409.07568", "description": "arXiv:2409.07568v1 Announce Type: new \nAbstract: Motivated by the challenges in analyzing gut microbiome and metagenomic data, this work aims to tackle the issue of measurement errors in high-dimensional regression models that involve compositional covariates. This paper marks a pioneering effort in conducting statistical inference on high-dimensional compositional data affected by mismeasured or contaminated data. We introduce a calibration approach tailored for the linear log-contrast model. Under relatively lenient conditions regarding the sparsity level of the parameter, we have established the asymptotic normality of the estimator for inference. Numerical experiments and an application in microbiome study have demonstrated the efficacy of our high-dimensional calibration strategy in minimizing bias and achieving the expected coverage rates for confidence intervals. Moreover, the potential application of our proposed methodology extends well beyond compositional data, suggesting its adaptability for a wide range of research contexts."}, "https://arxiv.org/abs/2409.07617": {"title": "Determining number of factors under stability considerations", "link": "https://arxiv.org/abs/2409.07617", "description": "arXiv:2409.07617v1 Announce Type: new \nAbstract: This paper proposes a novel method for determining the number of factors in linear factor models under stability considerations. An instability measure is proposed based on the principal angle between the estimated loading spaces obtained by data splitting. Based on this measure, criteria for determining the number of factors are proposed and shown to be consistent. This consistency is obtained using results from random matrix theory, especially the complete delocalization of non-outlier eigenvectors. The advantage of the proposed methods over the existing ones is shown via weaker asymptotic requirements for consistency, simulation studies and a real data example."}, "https://arxiv.org/abs/2409.07738": {"title": "A model-based approach for clustering binned data", "link": "https://arxiv.org/abs/2409.07738", "description": "arXiv:2409.07738v1 Announce Type: new \nAbstract: Binned data often appears in different fields of research, and it is generated after summarizing the original data in a sequence of pairs of bins (or their midpoints) and frequencies. There may exist different reasons to only provide this summary, but more importantly, it is necessary being able to perform statistical analyses based only on it. We present a Bayesian nonparametric model for clustering applicable for binned data. Clusters are modeled via random partitions, and within them a model-based approach is assumed. Inferences are performed by a Markov chain Monte Carlo method and the complete proposal is tested using simulated and real data. Having particular interest in studying marine populations, we analyze samples of Lobatus (Strobus) gigas' lengths and found the presence of up to three cohorts along the year."}, "https://arxiv.org/abs/2409.07745": {"title": "Generalized Independence Test for Modern Data", "link": "https://arxiv.org/abs/2409.07745", "description": "arXiv:2409.07745v1 Announce Type: new \nAbstract: The test of independence is a crucial component of modern data analysis. However, traditional methods often struggle with the complex dependency structures found in high-dimensional data. To overcome this challenge, we introduce a novel test statistic that captures intricate relationships using similarity and dissimilarity information derived from the data. The statistic exhibits strong power across a broad range of alternatives for high-dimensional data, as demonstrated in extensive simulation studies. Under mild conditions, we show that the new test statistic converges to the $\\chi^2_4$ distribution under the permutation null distribution, ensuring straightforward type I error control. Furthermore, our research advances the moment method in proving the joint asymptotic normality of multiple double-indexed permutation statistics. We showcase the practical utility of this new test with an application to the Genotype-Tissue Expression dataset, where it effectively measures associations between human tissues."}, "https://arxiv.org/abs/2409.07795": {"title": "Robust and efficient estimation in the presence of a randomly censored covariate", "link": "https://arxiv.org/abs/2409.07795", "description": "arXiv:2409.07795v1 Announce Type: new \nAbstract: In Huntington's disease research, a current goal is to understand how symptoms change prior to a clinical diagnosis. Statistically, this entails modeling symptom severity as a function of the covariate 'time until diagnosis', which is often heavily right-censored in observational studies. Existing estimators that handle right-censored covariates have varying statistical efficiency and robustness to misspecified models for nuisance distributions (those of the censored covariate and censoring variable). On one extreme, complete case estimation, which utilizes uncensored data only, is free of nuisance distribution models but discards informative censored observations. On the other extreme, maximum likelihood estimation is maximally efficient but inconsistent when the covariate's distribution is misspecified. We propose a semiparametric estimator that is robust and efficient. When the nuisance distributions are modeled parametrically, the estimator is doubly robust, i.e., consistent if at least one distribution is correctly specified, and semiparametric efficient if both models are correctly specified. When the nuisance distributions are estimated via nonparametric or machine learning methods, the estimator is consistent and semiparametric efficient. We show empirically that the proposed estimator, implemented in the R package sparcc, has its claimed properties, and we apply it to study Huntington's disease symptom trajectories using data from the Enroll-HD study."}, "https://arxiv.org/abs/2409.07859": {"title": "Bootstrap Adaptive Lasso Solution Path Unit Root Tests", "link": "https://arxiv.org/abs/2409.07859", "description": "arXiv:2409.07859v1 Announce Type: new \nAbstract: We propose sieve wild bootstrap analogues to the adaptive Lasso solution path unit root tests of Arnold and Reinschl\\\"ussel (2024) arXiv:2404.06205 to improve finite sample properties and extend their applicability to a generalised framework, allowing for non-stationary volatility. Numerical evidence shows the bootstrap to improve the tests' precision for error processes that promote spurious rejections of the unit root null, depending on the detrending procedure. The bootstrap mitigates finite-sample size distortions and restores asymptotically valid inference when the data features time-varying unconditional variance. We apply the bootstrap tests to real residential property prices of the top six Eurozone economies and find evidence of stationarity to be period-specific, supporting the conjecture that exuberance in the housing market characterises the development of Euro-era residential property prices in the recent past."}, "https://arxiv.org/abs/2409.07881": {"title": "Cellwise outlier detection in heterogeneous populations", "link": "https://arxiv.org/abs/2409.07881", "description": "arXiv:2409.07881v1 Announce Type: new \nAbstract: Real-world applications may be affected by outlying values. In the model-based clustering literature, several methodologies have been proposed to detect units that deviate from the majority of the data (rowwise outliers) and trim them from the parameter estimates. However, the discarded observations can encompass valuable information in some observed features. Following the more recent cellwise contamination paradigm, we introduce a Gaussian mixture model for cellwise outlier detection. The proposal is estimated via an Expectation-Maximization (EM) algorithm with an additional step for flagging the contaminated cells of a data matrix and then imputing -- instead of discarding -- them before the parameter estimation. This procedure adheres to the spirit of the EM algorithm by treating the contaminated cells as missing values. We analyze the performance of the proposed model in comparison with other existing methodologies through a simulation study with different scenarios and illustrate its potential use for clustering, outlier detection, and imputation on three real data sets."}, "https://arxiv.org/abs/2409.07917": {"title": "Multiple tests for restricted mean time lost with competing risks data", "link": "https://arxiv.org/abs/2409.07917", "description": "arXiv:2409.07917v1 Announce Type: new \nAbstract: Easy-to-interpret effect estimands are highly desirable in survival analysis. In the competing risks framework, one good candidate is the restricted mean time lost (RMTL). It is defined as the area under the cumulative incidence function up to a prespecified time point and, thus, it summarizes the cumulative incidence function into a meaningful estimand. While existing RMTL-based tests are limited to two-sample comparisons and mostly to two event types, we aim to develop general contrast tests for factorial designs and an arbitrary number of event types based on a Wald-type test statistic. Furthermore, we avoid the often-made, rather restrictive continuity assumption on the event time distribution. This allows for ties in the data, which often occur in practical applications, e.g., when event times are measured in whole days. In addition, we develop more reliable tests for RMTL comparisons that are based on a permutation approach to improve the small sample performance. In a second step, multiple tests for RMTL comparisons are developed to test several null hypotheses simultaneously. Here, we incorporate the asymptotically exact dependence structure between the local test statistics to gain more power. The small sample performance of the proposed testing procedures is analyzed in simulations and finally illustrated by analyzing a real data example about leukemia patients who underwent bone marrow transplantation."}, "https://arxiv.org/abs/2409.07956": {"title": "Community detection in multi-layer networks by regularized debiased spectral clustering", "link": "https://arxiv.org/abs/2409.07956", "description": "arXiv:2409.07956v1 Announce Type: new \nAbstract: Community detection is a crucial problem in the analysis of multi-layer networks. In this work, we introduce a new method, called regularized debiased sum of squared adjacency matrices (RDSoS), to detect latent communities in multi-layer networks. RDSoS is developed based on a novel regularized Laplacian matrix that regularizes the debiased sum of squared adjacency matrices. In contrast, the classical regularized Laplacian matrix typically regularizes the adjacency matrix of a single-layer network. Therefore, at a high level, our regularized Laplacian matrix extends the classical regularized Laplacian matrix to multi-layer networks. We establish the consistency property of RDSoS under the multi-layer stochastic block model (MLSBM) and further extend RDSoS and its theoretical results to the degree-corrected version of the MLSBM model. The effectiveness of the proposed methods is evaluated and demonstrated through synthetic and real datasets."}, "https://arxiv.org/abs/2409.08112": {"title": "Review of Recent Advances in Gaussian Process Regression Methods", "link": "https://arxiv.org/abs/2409.08112", "description": "arXiv:2409.08112v1 Announce Type: new \nAbstract: Gaussian process (GP) methods have been widely studied recently, especially for large-scale systems with big data and even more extreme cases when data is sparse. Key advantages of these methods consist in: 1) the ability to provide inherent ways to assess the impact of uncertainties (especially in the data, and environment) on the solutions, 2) have efficient factorisation based implementations and 3) can be implemented easily in distributed manners and hence provide scalable solutions. This paper reviews the recently developed key factorised GP methods such as the hierarchical off-diagonal low-rank approximation methods and GP with Kronecker structures. An example illustrates the performance of these methods with respect to accuracy and computational complexity."}, "https://arxiv.org/abs/2409.08158": {"title": "Trends and biases in the social cost of carbon", "link": "https://arxiv.org/abs/2409.08158", "description": "arXiv:2409.08158v1 Announce Type: new \nAbstract: An updated and extended meta-analysis confirms that the central estimate of the social cost of carbon is around $200/tC with a large, right-skewed uncertainty and trending up. The pure rate of time preference and the inverse of the elasticity of intertemporal substitution are key assumptions, the total impact of 2.5K warming less so. The social cost of carbon is much higher if climate change is assumed to affect economic growth rather than the level of output and welfare. The literature is dominated by a relatively small network of authors, based in a few countries. Publication and citation bias have pushed the social cost of carbon up."}, "https://arxiv.org/abs/2409.07679": {"title": "Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning", "link": "https://arxiv.org/abs/2409.07679", "description": "arXiv:2409.07679v1 Announce Type: cross \nAbstract: We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the \"notorious\" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase."}, "https://arxiv.org/abs/2409.07874": {"title": "Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler", "link": "https://arxiv.org/abs/2409.07874", "description": "arXiv:2409.07874v1 Announce Type: cross \nAbstract: In this paper, we study Bayesian approach for solving large scale linear inverse problems arising in various scientific and engineering fields. We propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting properties and show that it can be formulated as a Gaussian mixture Markov random field. Since the density function of this family of prior is neither log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can not be applied to sample the posterior. Thus, we present a Gibbs sampler in which all the conditional posteriors involved have closed form expressions. The Gibbs sampler works well for small size problems but it is computationally intractable for large scale problems due to the need for sample high dimensional Gaussian distribution. To reduce the computation burden, we construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise deterministic Markov process. This new sampler combines elements of Gibbs sampler with bouncy particle sampler and its computation complexity is an order of magnitude smaller. We show that the new sampler converges to the target distribution. With computed tomography examples, we demonstrate that the proposed method shows competitive performance with existing popular Bayesian methods and is highly efficient in large scale problems."}, "https://arxiv.org/abs/2409.07879": {"title": "Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series", "link": "https://arxiv.org/abs/2409.07879", "description": "arXiv:2409.07879v1 Announce Type: cross \nAbstract: Functional data analysis (FDA) and ensemble learning can be powerful tools for analyzing complex environmental time series. Recent literature has highlighted the key role of diversity in enhancing accuracy and reducing variance in ensemble methods.This paper introduces Randomized Spline Trees (RST), a novel algorithm that bridges these two approaches by incorporating randomized functional representations into the Random Forest framework. RST generates diverse functional representations of input data using randomized B-spline parameters, creating an ensemble of decision trees trained on these varied representations. We provide a theoretical analysis of how this functional diversity contributes to reducing generalization error and present empirical evaluations on six environmental time series classification tasks from the UCR Time Series Archive. Results show that RST variants outperform standard Random Forests and Gradient Boosting on most datasets, improving classification accuracy by up to 14\\%. The success of RST demonstrates the potential of adaptive functional representations in capturing complex temporal patterns in environmental data. This work contributes to the growing field of machine learning techniques focused on functional data and opens new avenues for research in environmental time series analysis."}, "https://arxiv.org/abs/2409.08059": {"title": "Causal inference and racial bias in policing: New estimands and the importance of mobility data", "link": "https://arxiv.org/abs/2409.08059", "description": "arXiv:2409.08059v1 Announce Type: cross \nAbstract: Studying racial bias in policing is a critically important problem, but one that comes with a number of inherent difficulties due to the nature of the available data. In this manuscript we tackle multiple key issues in the causal analysis of racial bias in policing. First, we formalize race and place policing, the idea that individuals of one race are policed differently when they are in neighborhoods primarily made up of individuals of other races. We develop an estimand to study this question rigorously, show the assumptions necessary for causal identification, and develop sensitivity analyses to assess robustness to violations of key assumptions. Additionally, we investigate difficulties with existing estimands targeting racial bias in policing. We show for these estimands, and the estimands developed in this manuscript, that estimation can benefit from incorporating mobility data into analyses. We apply these ideas to a study in New York City, where we find a large amount of racial bias, as well as race and place policing, and that these findings are robust to large violations of untestable assumptions. We additionally show that mobility data can make substantial impacts on the resulting estimates, suggesting it should be used whenever possible in subsequent studies."}, "https://arxiv.org/abs/2409.08201": {"title": "Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study", "link": "https://arxiv.org/abs/2409.08201", "description": "arXiv:2409.08201v1 Announce Type: cross \nAbstract: The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the distribution of test statistics for the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. All results from numerical experiments were obtained from a synthetic dataset generated using the Smirnov transform (Inverse Transform Sampling) and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods. All necessary materials (source code, example scripts, dataset, and samples) are available on GitHub and Hugging Face."}, "https://arxiv.org/abs/2211.01727": {"title": "Bayesian inference of vector autoregressions with tensor decompositions", "link": "https://arxiv.org/abs/2211.01727", "description": "arXiv:2211.01727v5 Announce Type: replace \nAbstract: Vector autoregressions (VARs) are popular model for analyzing multivariate economic time series. However, VARs can be over-parameterized if the numbers of variables and lags are moderately large. Tensor VAR, a recent solution to over-parameterization, treats the coefficient matrix as a third-order tensor and estimates the corresponding tensor decomposition to achieve parsimony. In this paper, we employ the Tensor VAR structure with a CANDECOMP/PARAFAC (CP) decomposition and conduct Bayesian inference to estimate parameters. Firstly, we determine the rank by imposing the Multiplicative Gamma Prior to the tensor margins, i.e. elements in the decomposition, and accelerate the computation with an adaptive inferential scheme. Secondly, to obtain interpretable margins, we propose an interweaving algorithm to improve the mixing of margins and identify the margins using a post-processing procedure. In an application to the US macroeconomic data, our models outperform standard VARs in point and density forecasting and yield a summary of the dynamic of the US economy."}, "https://arxiv.org/abs/2301.11711": {"title": "ADDIS-Graphs for online error control with application to platform trials", "link": "https://arxiv.org/abs/2301.11711", "description": "arXiv:2301.11711v3 Announce Type: replace \nAbstract: In contemporary research, online error control is often required, where an error criterion, such as familywise error rate (FWER) or false discovery rate (FDR), shall remain under control while testing an a priori unbounded sequence of hypotheses. The existing online literature mainly considered large-scale designs and constructed blackbox-like algorithms for these. However, smaller studies, such as platform trials, require high flexibility and easy interpretability to take study objectives into account and facilitate the communication. Another challenge in platform trials is that due to the shared control arm some of the p-values are dependent and significance levels need to be prespecified before the decisions for all the past treatments are available. We propose ADDIS-Graphs with FWER control that due to their graphical structure perfectly adapt to such settings and provably uniformly improve the state-of-the-art method. We introduce several extensions of these ADDIS-Graphs, including the incorporation of information about the joint distribution of the p-values and a version for FDR control."}, "https://arxiv.org/abs/2306.07017": {"title": "Multivariate extensions of the Multilevel Best Linear Unbiased Estimator for ensemble-variational data assimilation", "link": "https://arxiv.org/abs/2306.07017", "description": "arXiv:2306.07017v2 Announce Type: replace \nAbstract: Multilevel estimators aim at reducing the variance of Monte Carlo statistical estimators, by combining samples generated with simulators of different costs and accuracies. In particular, the recent work of Schaden and Ullmann (2020) on the multilevel best linear unbiased estimator (MLBLUE) introduces a framework unifying several multilevel and multifidelity techniques. The MLBLUE is reintroduced here using a variance minimization approach rather than the regression approach of Schaden and Ullmann. We then discuss possible extensions of the scalar MLBLUE to a multidimensional setting, i.e. from the expectation of scalar random variables to the expectation of random vectors. Several estimators of increasing complexity are proposed: a) multilevel estimators with scalar weights, b) with element-wise weights, c) with spectral weights and d) with general matrix weights. The computational cost of each method is discussed. We finally extend the MLBLUE to the estimation of second-order moments in the multidimensional case, i.e. to the estimation of covariance matrices. The multilevel estimators proposed are d) a multilevel estimator with scalar weights and e) with element-wise weights. In large-dimension applications such as data assimilation for geosciences, the latter estimator is computationnally unaffordable. As a remedy, we also propose f) a multilevel covariance matrix estimator with optimal multilevel localization, inspired by the optimal localization theory of M\\'en\\'etrier and Aulign\\'e (2015). Some practical details on weighted MLMC estimators of covariance matrices are given in appendix."}, "https://arxiv.org/abs/2308.01747": {"title": "Fusion regression methods with repeated functional data", "link": "https://arxiv.org/abs/2308.01747", "description": "arXiv:2308.01747v3 Announce Type: replace \nAbstract: Linear regression and classification methods with repeated functional data are considered. For each statistical unit in the sample, a real-valued parameter is observed over time under different conditions related by some neighborhood structure (spatial, group, etc.). Two regression methods based on fusion penalties are proposed to consider the dependence induced by this structure. These methods aim to obtain parsimonious coefficient regression functions, by determining if close conditions are associated with common regression coefficient functions. The first method is a generalization to functional data of the variable fusion methodology based on the 1-nearest neighbor. The second one relies on the group fusion lasso penalty which assumes some grouping structure of conditions and allows for homogeneity among the regression coefficient functions within groups. Numerical simulations and an application of electroencephalography data are presented."}, "https://arxiv.org/abs/2310.11741": {"title": "Graph of Graphs: From Nodes to Supernodes in Graphical Models", "link": "https://arxiv.org/abs/2310.11741", "description": "arXiv:2310.11741v2 Announce Type: replace \nAbstract: High-dimensional data analysis typically focuses on low-dimensional structure, often to aid interpretation and computational efficiency. Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data by representing variables as nodes and dependencies as edges. Inference is often focused on individual edges in the latent graph. Nonetheless, there is increasing interest in determining more complex structures, such as communities of nodes, for multiple reasons, including more effective information retrieval and better interpretability. In this work, we propose a hierarchical graphical model where we first cluster nodes and then, at the higher level, investigate the relationships among groups of nodes. Specifically, nodes are partitioned into supernodes with a data-coherent size-biased tessellation prior which combines ideas from Bayesian nonparametrics and Voronoi tessellations. This construct also allows accounting for the dependence of nodes within supernodes. At the higher level, dependence structure among supernodes is modeled through a Gaussian graphical model, where the focus of inference is on superedges. We provide theoretical justification for our modeling choices. We design tailored Markov chain Monte Carlo schemes, which also enable parallel computations. We demonstrate the effectiveness of our approach for large-scale structure learning in simulations and a transcriptomics application."}, "https://arxiv.org/abs/2312.16160": {"title": "SymmPI: Predictive Inference for Data with Group Symmetries", "link": "https://arxiv.org/abs/2312.16160", "description": "arXiv:2312.16160v3 Announce Type: replace \nAbstract: Quantifying the uncertainty of predictions is a core problem in modern statistics. Methods for predictive inference have been developed under a variety of assumptions, often -- for instance, in standard conformal prediction -- relying on the invariance of the distribution of the data under special groups of transformations such as permutation groups. Moreover, many existing methods for predictive inference aim to predict unobserved outcomes in sequences of feature-outcome observations. Meanwhile, there is interest in predictive inference under more general observation models (e.g., for partially observed features) and for data satisfying more general distributional symmetries (e.g., rotationally invariant or coordinate-independent observations in physics). Here we propose SymmPI, a methodology for predictive inference when data distributions have general group symmetries in arbitrary observation models. Our methods leverage the novel notion of distributional equivariant transformations, which process the data while preserving their distributional invariances. We show that SymmPI has valid coverage under distributional invariance and characterize its performance under distribution shift, recovering recent results as special cases. We apply SymmPI to predict unobserved values associated to vertices in a network, where the distribution is unchanged under relabelings that keep the network structure unchanged. In several simulations in a two-layer hierarchical model, and in an empirical data analysis example, SymmPI performs favorably compared to existing methods."}, "https://arxiv.org/abs/2310.02008": {"title": "fmeffects: An R Package for Forward Marginal Effects", "link": "https://arxiv.org/abs/2310.02008", "description": "arXiv:2310.02008v2 Announce Type: replace-cross \nAbstract: Forward marginal effects have recently been introduced as a versatile and effective model-agnostic interpretation method particularly suited for non-linear and non-parametric prediction models. They provide comprehensible model explanations of the form: if we change feature values by a pre-specified step size, what is the change in the predicted outcome? We present the R package fmeffects, the first software implementation of the theory surrounding forward marginal effects. The relevant theoretical background, package functionality and handling, as well as the software design and options for future extensions are discussed in this paper."}, "https://arxiv.org/abs/2409.08347": {"title": "Substitution in the perturbed utility route choice model", "link": "https://arxiv.org/abs/2409.08347", "description": "arXiv:2409.08347v1 Announce Type: new \nAbstract: This paper considers substitution patterns in the perturbed utility route choice model. We provide a general result that determines the marginal change in link flows following a marginal change in link costs across the network. We give a general condition on the network structure under which all paths are necessarily substitutes and an example in which some paths are complements. The presence of complementarity contradicts a result in a previous paper in this journal; we point out and correct the error."}, "https://arxiv.org/abs/2409.08354": {"title": "Bayesian Dynamic Factor Models for High-dimensional Matrix-valued Time Series", "link": "https://arxiv.org/abs/2409.08354", "description": "arXiv:2409.08354v1 Announce Type: new \nAbstract: High-dimensional matrix-valued time series are of significant interest in economics and finance, with prominent examples including cross region macroeconomic panels and firms' financial data panels. We introduce a class of Bayesian matrix dynamic factor models that utilize matrix structures to identify more interpretable factor patterns and factor impacts. Our model accommodates time-varying volatility, adjusts for outliers, and allows cross-sectional correlations in the idiosyncratic components. To determine the dimension of the factor matrix, we employ an importance-sampling estimator based on the cross-entropy method to estimate marginal likelihoods. Through a series of Monte Carlo experiments, we show the properties of the factor estimators and the performance of the marginal likelihood estimator in correctly identifying the true dimensions of the factor matrices. Applying our model to a macroeconomic dataset and a financial dataset, we demonstrate its ability in unveiling interesting features within matrix-valued time series."}, "https://arxiv.org/abs/2409.08756": {"title": "Cubature-based uncertainty estimation for nonlinear regression models", "link": "https://arxiv.org/abs/2409.08756", "description": "arXiv:2409.08756v1 Announce Type: new \nAbstract: Calibrating model parameters to measured data by minimizing loss functions is an important step in obtaining realistic predictions from model-based approaches, e.g., for process optimization. This is applicable to both knowledge-driven and data-driven model setups. Due to measurement errors, the calibrated model parameters also carry uncertainty. In this contribution, we use cubature formulas based on sparse grids to calculate the variance of the regression results. The number of cubature points is close to the theoretical minimum required for a given level of exactness. We present exact benchmark results, which we also compare to other cubatures. This scheme is then applied to estimate the prediction uncertainty of the NRTL model, calibrated to observations from different experimental designs."}, "https://arxiv.org/abs/2409.08773": {"title": "The Clustered Dose-Response Function Estimator for continuous treatment with heterogeneous treatment effects", "link": "https://arxiv.org/abs/2409.08773", "description": "arXiv:2409.08773v1 Announce Type: new \nAbstract: Many treatments are non-randomly assigned, continuous in nature, and exhibit heterogeneous effects even at identical treatment intensities. Taken together, these characteristics pose significant challenges for identifying causal effects, as no existing estimator can provide an unbiased estimate of the average causal dose-response function. To address this gap, we introduce the Clustered Dose-Response Function (Cl-DRF), a novel estimator designed to discern the continuous causal relationships between treatment intensity and the dependent variable across different subgroups. This approach leverages both theoretical and data-driven sources of heterogeneity and operates under relaxed versions of the conditional independence and positivity assumptions, which are required to be met only within each identified subgroup. To demonstrate the capabilities of the Cl-DRF estimator, we present both simulation evidence and an empirical application examining the impact of European Cohesion funds on economic growth."}, "https://arxiv.org/abs/2409.08779": {"title": "The underreported death toll of wars: a probabilistic reassessment from a structured expert elicitation", "link": "https://arxiv.org/abs/2409.08779", "description": "arXiv:2409.08779v1 Announce Type: new \nAbstract: Event datasets including those provided by Uppsala Conflict Data Program (UCDP) are based on reports from the media and international organizations, and are likely to suffer from reporting bias. Since the UCDP has strict inclusion criteria, they most likely under-estimate conflict-related deaths, but we do not know by how much. Here, we provide a generalizable, cross-national measure of uncertainty around UCDP reported fatalities that is more robust and realistic than UCDP's documented low and high estimates, and make available a dataset and R package accounting for the measurement uncertainty. We use a structured expert elicitation combined with statistical modelling to derive a distribution of plausible number of fatalities given the number of battle-related deaths and the type of violence documented by the UCDP. The results can help scholars understand the extent of bias affecting their empirical analyses of organized violence and contribute to improve the accuracy of conflict forecasting systems."}, "https://arxiv.org/abs/2409.08821": {"title": "High-dimensional regression with a count response", "link": "https://arxiv.org/abs/2409.08821", "description": "arXiv:2409.08821v1 Announce Type: new \nAbstract: We consider high-dimensional regression with a count response modeled by Poisson or negative binomial generalized linear model (GLM). We propose a penalized maximum likelihood estimator with a properly chosen complexity penalty and establish its adaptive minimaxity across models of various sparsity. To make the procedure computationally feasible for high-dimensional data we consider its LASSO and SLOPE convex surrogates. Their performance is illustrated through simulated and real-data examples."}, "https://arxiv.org/abs/2409.08838": {"title": "Angular Co-variance using intrinsic geometry of torus: Non-parametric change points detection in meteorological data", "link": "https://arxiv.org/abs/2409.08838", "description": "arXiv:2409.08838v1 Announce Type: new \nAbstract: In many temporal datasets, the parameters of the underlying distribution may change abruptly at unknown times. Detecting these changepoints is crucial for numerous applications. While this problem has been extensively studied for linear data, there has been remarkably less research on bivariate angular data. For the first time, we address the changepoint problem for the mean direction of toroidal and spherical data, which are types of bivariate angular data. By leveraging the intrinsic geometry of a curved torus, we introduce the concept of the ``square'' of an angle. This leads us to define the ``curved dispersion matrix'' for bivariate angular random variables, analogous to the dispersion matrix for bivariate linear random variables. Using this analogous measure of the ``Mahalanobis distance,'' we develop two new non-parametric tests to identify changes in the mean direction parameters for toroidal and spherical distributions. We derive the limiting distributions of the test statistics and evaluate their power surface and contours through extensive simulations. We also apply the proposed methods to detect changes in mean direction for hourly wind-wave direction measurements and the path of the cyclonic storm ``Biporjoy,'' which occurred between 6th and 19th June 2023 over the Arabian Sea, western coast of India."}, "https://arxiv.org/abs/2409.08863": {"title": "Change point analysis with irregular signals", "link": "https://arxiv.org/abs/2409.08863", "description": "arXiv:2409.08863v1 Announce Type: new \nAbstract: This paper considers the problem of testing and estimation of change point where signals after the change point can be highly irregular, which departs from the existing literature that assumes signals after the change point to be piece-wise constant or vary smoothly. A two-step approach is proposed to effectively estimate the location of the change point. The first step consists of a preliminary estimation of the change point that allows us to obtain unknown parameters for the second step. In the second step we use a new procedure to determine the position of the change point. We show that, under suitable conditions, the desirable $\\mathcal{O}_P(1)$ rate of convergence of the estimated change point can be obtained. We apply our method to analyze the Baidu search index of COVID-19 related symptoms and find 8~December 2019 to be the starting date of the COVID-19 pandemic."}, "https://arxiv.org/abs/2409.08912": {"title": "Joint spatial modeling of mean and non-homogeneous variance combining semiparametric SAR and GAMLSS models for hedonic prices", "link": "https://arxiv.org/abs/2409.08912", "description": "arXiv:2409.08912v1 Announce Type: new \nAbstract: In the context of spatial econometrics, it is very useful to have methodologies that allow modeling the spatial dependence of the observed variables and obtaining more precise predictions of both the mean and the variability of the response variable, something very useful in territorial planning and public policies. This paper proposes a new methodology that jointly models the mean and the variance. Also, it allows to model the spatial dependence of the dependent variable as a function of covariates and to model the semiparametric effects in both models. The algorithms developed are based on generalized additive models that allow the inclusion of non-parametric terms in both the mean and the variance, maintaining the traditional theoretical framework of spatial regression. The theoretical developments of the estimation of this model are carried out, obtaining desirable statistical properties in the estimators. A simulation study is developed to verify that the proposed method has a remarkable predictive capacity in terms of the mean square error and shows a notable improvement in the estimation of the spatial autoregressive parameter, compared to other traditional methods and some recent developments. The model is also tested on data from the construction of a hedonic price model for the city of Bogota, highlighting as the main result the ability to model the variability of housing prices, and the wealth in the analysis obtained."}, "https://arxiv.org/abs/2409.08924": {"title": "Regression-based proximal causal inference for right-censored time-to-event data", "link": "https://arxiv.org/abs/2409.08924", "description": "arXiv:2409.08924v1 Announce Type: new \nAbstract: Unmeasured confounding is one of the major concerns in causal inference from observational data. Proximal causal inference (PCI) is an emerging methodological framework to detect and potentially account for confounding bias by carefully leveraging a pair of negative control exposure (NCE) and outcome (NCO) variables, also known as treatment and outcome confounding proxies. Although regression-based PCI is well developed for binary and continuous outcomes, analogous PCI regression methods for right-censored time-to-event outcomes are currently lacking. In this paper, we propose a novel two-stage regression PCI approach for right-censored survival data under an additive hazard structural model. We provide theoretical justification for the proposed approach tailored to different types of NCOs, including continuous, count, and right-censored time-to-event variables. We illustrate the approach with an evaluation of the effectiveness of right heart catheterization among critically ill patients using data from the SUPPORT study. Our method is implemented in the open-access R package 'pci2s'."}, "https://arxiv.org/abs/2409.08965": {"title": "Dynamic Bayesian Networks with Conditional Dynamics in Edge Addition and Deletion", "link": "https://arxiv.org/abs/2409.08965", "description": "arXiv:2409.08965v1 Announce Type: new \nAbstract: This study presents a dynamic Bayesian network framework that facilitates intuitive gradual edge changes. We use two conditional dynamics to model the edge addition and deletion, and edge selection separately. Unlike previous research that uses a mixture network approach, which restricts the number of possible edge changes, or structural priors to induce gradual changes, which can lead to unclear network evolution, our model induces more frequent and intuitive edge change dynamics. We employ Markov chain Monte Carlo (MCMC) sampling to estimate the model structures and parameters and demonstrate the model's effectiveness in a portfolio selection application."}, "https://arxiv.org/abs/2409.08350": {"title": "An efficient heuristic for approximate maximum flow computations", "link": "https://arxiv.org/abs/2409.08350", "description": "arXiv:2409.08350v1 Announce Type: cross \nAbstract: Several concepts borrowed from graph theory are routinely used to better understand the inner workings of the (human) brain. To this end, a connectivity network of the brain is built first, which then allows one to assess quantities such as information flow and information routing via shortest path and maximum flow computations. Since brain networks typically contain several thousand nodes and edges, computational scaling is a key research area. In this contribution, we focus on approximate maximum flow computations in large brain networks. By combining graph partitioning with maximum flow computations, we propose a new approximation algorithm for the computation of the maximum flow with runtime O(|V||E|^2/k^2) compared to the usual runtime of O(|V||E|^2) for the Edmonds-Karp algorithm, where $V$ is the set of vertices, $E$ is the set of edges, and $k$ is the number of partitions. We assess both accuracy and runtime of the proposed algorithm on simulated graphs as well as on graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com)."}, "https://arxiv.org/abs/2409.08908": {"title": "Tracing the impacts of Mount Pinatubo eruption on global climate using spatially-varying changepoint detection", "link": "https://arxiv.org/abs/2409.08908", "description": "arXiv:2409.08908v1 Announce Type: cross \nAbstract: Significant events such as volcanic eruptions can have global and long lasting impacts on climate. These global impacts, however, are not uniform across space and time. Understanding how the Mt. Pinatubo eruption affects global and regional climate is of great interest for predicting impact on climate due to similar events. We propose a Bayesian framework to simultaneously detect and estimate spatially-varying temporal changepoints for regional climate impacts. Our approach takes into account the diffusing nature of the changes caused by the volcanic eruption and leverages spatial correlation. We illustrate our method on simulated datasets and compare it with an existing changepoint detection method. Finally, we apply our method on monthly stratospheric aerosol optical depth and surface temperature data from 1985 to 1995 to detect and estimate changepoints following the 1991 Mt. Pinatubo eruption."}, "https://arxiv.org/abs/2409.08925": {"title": "Multi forests: Variable importance for multi-class outcomes", "link": "https://arxiv.org/abs/2409.08925", "description": "arXiv:2409.08925v1 Announce Type: cross \nAbstract: In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm's primary purpose of calculating the multi-class VIM."}, "https://arxiv.org/abs/2409.08928": {"title": "Self-Organized State-Space Models with Artificial Dynamics", "link": "https://arxiv.org/abs/2409.08928", "description": "arXiv:2409.08928v1 Announce Type: cross \nAbstract: In this paper we consider a state-space model (SSM) parametrized by some parameter $\\theta$, and our aim is to perform joint parameter and state inference. A simple idea to perform this task, which almost dates back to the origin of the Kalman filter, is to replace the static parameter $\\theta$ by a Markov chain $(\\theta_t)_{t\\geq 0}$ on the parameter space and then to apply a standard filtering algorithm to the extended, or self-organized SSM. However, the practical implementation of this idea in a theoretically justified way has remained an open problem. In this paper we fill this gap by introducing various possible constructions of the Markov chain $(\\theta_t)_{t\\geq 0}$ that ensure the validity of the self-organized SSM (SO-SSM) for joint parameter and state inference. Notably, we show that theoretically valid SO-SSMs can be defined even if $\\|\\mathrm{Var}(\\theta_{t}|\\theta_{t-1})\\|$ converges to 0 slowly as $t\\rightarrow\\infty$. This result is important since, as illustrated in our numerical experiments, such models can be efficiently approximated using standard particle filter algorithms. While the idea studied in this work was first introduced for online inference in SSMs, it has also been proved to be useful for computing the maximum likelihood estimator (MLE) of a given SSM, since iterated filtering algorithms can be seen as particle filters applied to SO-SSMs for which the target parameter value is the MLE of interest. Based on this observation, we also derive constructions of $(\\theta_t)_{t\\geq 0}$ and theoretical results tailored to these specific applications of SO-SSMs, and as a result, we introduce new iterated filtering algorithms. From a practical point of view, the algorithms introduced in this work have the merit of being simple to implement and only requiring minimal tuning to perform well."}, "https://arxiv.org/abs/2102.07008": {"title": "A Distance Covariance-based Estimator", "link": "https://arxiv.org/abs/2102.07008", "description": "arXiv:2102.07008v2 Announce Type: replace \nAbstract: This paper introduces an estimator that considerably weakens the conventional relevance condition of instrumental variable (IV) methods, allowing for instruments that are weakly correlated, uncorrelated, or even mean-independent but not independent of endogenous covariates. Under the relevance condition, the estimator achieves consistent estimation and reliable inference without requiring instrument excludability, and it remains robust even when the first moment of the disturbance term does not exist. In contrast to conventional IV methods, it maximises the set of feasible instruments in any empirical setting. Under a weak conditional median independence condition on pairwise differences in disturbances and mild regularity assumptions, identification holds, and the estimator is consistent and asymptotically normal."}, "https://arxiv.org/abs/2208.14960": {"title": "Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case", "link": "https://arxiv.org/abs/2208.14960", "description": "arXiv:2208.14960v4 Announce Type: replace \nAbstract: Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners."}, "https://arxiv.org/abs/2209.03318": {"title": "On the Wasserstein median of probability measures", "link": "https://arxiv.org/abs/2209.03318", "description": "arXiv:2209.03318v3 Announce Type: replace \nAbstract: The primary choice to summarize a finite collection of random objects is by using measures of central tendency, such as mean and median. In the field of optimal transport, the Wasserstein barycenter corresponds to the Fr\\'{e}chet or geometric mean of a set of probability measures, which is defined as a minimizer of the sum of squared distances to each element in a given set with respect to the Wasserstein distance of order 2. We introduce the Wasserstein median as a robust alternative to the Wasserstein barycenter. The Wasserstein median corresponds to the Fr\\'{e}chet median under the 2-Wasserstein metric. The existence and consistency of the Wasserstein median are first established, along with its robustness property. In addition, we present a general computational pipeline that employs any recognized algorithms for the Wasserstein barycenter in an iterative fashion and demonstrate its convergence. The utility of the Wasserstein median as a robust measure of central tendency is demonstrated using real and simulated data."}, "https://arxiv.org/abs/2209.13918": {"title": "Inference in generalized linear models with robustness to misspecified variances", "link": "https://arxiv.org/abs/2209.13918", "description": "arXiv:2209.13918v3 Announce Type: replace \nAbstract: Generalized linear models usually assume a common dispersion parameter, an assumption that is seldom true in practice. Consequently, standard parametric methods may suffer appreciable loss of type I error control. As an alternative, we present a semi-parametric group-invariance method based on sign flipping of score contributions. Our method requires only the correct specification of the mean model, but is robust against any misspecification of the variance. We present tests for single as well as multiple regression coefficients. The test is asymptotically valid but shows excellent performance in small samples. We illustrate the method using RNA sequencing count data, for which it is difficult to model the overdispersion correctly. The method is available in the R library flipscores."}, "https://arxiv.org/abs/2211.07351": {"title": "A Tutorial on Asymptotic Properties for Biostatisticians with Applications to COVID-19 Data", "link": "https://arxiv.org/abs/2211.07351", "description": "arXiv:2211.07351v2 Announce Type: replace \nAbstract: Asymptotic properties of statistical estimators play a significant role both in practice and in theory. However, many asymptotic results in statistics rely heavily on the independent and identically distributed (iid) assumption, which is not realistic when we have fixed designs. In this article, we build a roadmap of general procedures for deriving asymptotic properties under fixed designs and the observations need not to be iid. We further provide their applications in many statistical applications. Finally, we apply our results to Poisson regression using a COVID-19 dataset as an illustration to demonstrate the power of these results in practice."}, "https://arxiv.org/abs/2311.18699": {"title": "Correlated Bayesian Additive Regression Trees with Gaussian Process for Regression Analysis of Dependent Data", "link": "https://arxiv.org/abs/2311.18699", "description": "arXiv:2311.18699v2 Announce Type: replace \nAbstract: Bayesian Additive Regression Trees (BART) has gained widespread popularity, prompting the development of various extensions for different applications. However, limited attention has been given to analyzing dependent data. Based on a general correlated error assumption and an innovative dummy representation, we introduces a novel extension of BART, called Correlated BART (CBART), designed to handle correlated errors. By integrating CBART with a Gaussian process (GP), we propose the CBART-GP model, in which the CBART and GP components are loosely coupled, allowing them to be estimated and applied independently. CBART captures the covariate mean function E[y|x]=f(x), while the Gaussian process models the dependency structure in the response $y$. We also developed a computationally efficient approach, named two-stage analysis of variance with weighted residuals, for the estimation of CBART-GP. Simulation studies demonstrate the superiority of CBART-GP over other models, and a real-world application illustrates its practical applicability."}, "https://arxiv.org/abs/2401.01949": {"title": "Adjacency Matrix Decomposition Clustering for Human Activity Data", "link": "https://arxiv.org/abs/2401.01949", "description": "arXiv:2401.01949v2 Announce Type: replace \nAbstract: Mobile apps and wearable devices accurately and continuously measure human activity; patterns within this data can provide a wealth of information applicable to fields such as transportation and health. Despite the potential utility of this data, there has been limited development of analysis methods for sequences of daily activities. In this paper, we propose a novel clustering method and cluster evaluation metric for human activity data that leverages an adjacency matrix representation to cluster the data without the calculation of a distance matrix. Our technique is substantially faster than conventional methods based on computing pairwise distances via sequence alignment algorithms and also enhances interpretability of results. We compare our method to distance-based hierarchical clustering and nTreeClus through simulation studies and an application to data collected by Daynamica, an app that turns sensor data into a daily summary of a user's activities. Among days that contain a large portion of time spent at home, our method distinguishes days that also contain multiple hours of travel or other activities, while both comparison methods fail to identify these patterns. We further identify which day patterns classified by our method are associated with higher concern for contracting COVID-19 with implications for public health messaging."}, "https://arxiv.org/abs/2312.15469": {"title": "Efficient Estimation of the Central Mean Subspace via Smoothed Gradient Outer Products", "link": "https://arxiv.org/abs/2312.15469", "description": "arXiv:2312.15469v2 Announce Type: replace-cross \nAbstract: We consider the problem of sufficient dimension reduction (SDR) for multi-index models. The estimators of the central mean subspace in prior works either have slow (non-parametric) convergence rates, or rely on stringent distributional conditions (e.g., the covariate distribution $P_{\\mathbf{X}}$ being elliptical symmetric). In this paper, we show that a fast parametric convergence rate of form $C_d \\cdot n^{-1/2}$ is achievable via estimating the \\emph{expected smoothed gradient outer product}, for a general class of distribution $P_{\\mathbf{X}}$ admitting Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most $r$ and $P_{\\mathbf{X}}$ is the standard Gaussian, we show that the prefactor depends on the ambient dimension $d$ as $C_d \\propto d^r$."}, "https://arxiv.org/abs/2409.09065": {"title": "Automatic Pricing and Replenishment Strategies for Vegetable Products Based on Data Analysis and Nonlinear Programming", "link": "https://arxiv.org/abs/2409.09065", "description": "arXiv:2409.09065v1 Announce Type: new \nAbstract: In the field of fresh produce retail, vegetables generally have a relatively limited shelf life, and their quality deteriorates with time. Most vegetable varieties, if not sold on the day of delivery, become difficult to sell the following day. Therefore, retailers usually perform daily quantitative replenishment based on historical sales data and demand conditions. Vegetable pricing typically uses a \"cost-plus pricing\" method, with retailers often discounting products affected by transportation loss and quality decline. In this context, reliable market demand analysis is crucial as it directly impacts replenishment and pricing decisions. Given the limited retail space, a rational sales mix becomes essential. This paper first uses data analysis and visualization techniques to examine the distribution patterns and interrelationships of vegetable sales quantities by category and individual item, based on provided data on vegetable types, sales records, wholesale prices, and recent loss rates. Next, it constructs a functional relationship between total sales volume and cost-plus pricing for vegetable categories, forecasts future wholesale prices using the ARIMA model, and establishes a sales profit function and constraints. A nonlinear programming model is then developed and solved to provide daily replenishment quantities and pricing strategies for each vegetable category for the upcoming week. Further, we optimize the profit function and constraints based on the actual sales conditions and requirements, providing replenishment quantities and pricing strategies for individual items on July 1 to maximize retail profit. Finally, to better formulate replenishment and pricing decisions for vegetable products, we discuss and forecast the data that retailers need to collect and analyses how the collected data can be applied to the above issues."}, "https://arxiv.org/abs/2409.09178": {"title": "Identification of distributions for risks based on the first moment and c-statistic", "link": "https://arxiv.org/abs/2409.09178", "description": "arXiv:2409.09178v1 Announce Type: new \nAbstract: We show that for any family of distributions with support on [0,1] with strictly monotonic cumulative distribution function (CDF) that has no jumps and is quantile-identifiable (i.e., any two distinct quantiles identify the distribution), knowing the first moment and c-statistic is enough to identify the distribution. The derivations motivate numerical algorithms for mapping a given pair of expected value and c-statistic to the parameters of specified two-parameter distributions for probabilities. We implemented these algorithms in R and in a simulation study evaluated their numerical accuracy for common families of distributions for risks (beta, logit-normal, and probit-normal). An area of application for these developments is in risk prediction modeling (e.g., sample size calculations and Value of Information analysis), where one might need to estimate the parameters of the distribution of predicted risks from the reported summary statistics."}, "https://arxiv.org/abs/2409.09236": {"title": "Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times", "link": "https://arxiv.org/abs/2409.09236", "description": "arXiv:2409.09236v1 Announce Type: new \nAbstract: While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments."}, "https://arxiv.org/abs/2409.09243": {"title": "Unconditional Randomization Tests for Interference", "link": "https://arxiv.org/abs/2409.09243", "description": "arXiv:2409.09243v1 Announce Type: new \nAbstract: In social networks or spatial experiments, one unit's outcome often depends on another's treatment, a phenomenon called interference. Researchers are interested in not only the presence and magnitude of interference but also its pattern based on factors like distance, neighboring units, and connection strength. However, the non-random nature of these factors and complex correlations across units pose challenges for inference. This paper introduces the partial null randomization tests (PNRT) framework to address these issues. The proposed method is finite-sample valid and applicable with minimal network structure assumptions, utilizing randomization testing and pairwise comparisons. Unlike existing conditional randomization tests, PNRT avoids the need for conditioning events, making it more straightforward to implement. Simulations demonstrate the method's desirable power properties and its applicability to general interference scenarios."}, "https://arxiv.org/abs/2409.09310": {"title": "Exact Posterior Mean and Covariance for Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2409.09310", "description": "arXiv:2409.09310v1 Announce Type: new \nAbstract: A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution."}, "https://arxiv.org/abs/2409.09355": {"title": "A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions", "link": "https://arxiv.org/abs/2409.09355", "description": "arXiv:2409.09355v1 Announce Type: new \nAbstract: Linear model prediction with a large number of potential predictors is both statistically and computationally challenging. The traditional approaches are largely based on shrinkage selection/estimation methods, which are applicable even when the number of potential predictors is (much) larger than the sample size. A situation of the latter scenario occurs when the candidate predictors involve many binary indicators corresponding to categories of some categorical predictors as well as their interactions. We propose an alternative approach to the shrinkage prediction methods in such a case based on mixed model prediction, which effectively treats combinations of the categorical effects as random effects. We establish theoretical validity of the proposed method, and demonstrate empirically its advantage over the shrinkage methods. We also develop measures of uncertainty for the proposed method and evaluate their performance empirically. A real-data example is considered."}, "https://arxiv.org/abs/2409.09440": {"title": "Group Sequential Testing of a Treatment Effect Using a Surrogate Marker", "link": "https://arxiv.org/abs/2409.09440", "description": "arXiv:2409.09440v1 Announce Type: new \nAbstract: The identification of surrogate markers is motivated by their potential to make decisions sooner about a treatment effect. However, few methods have been developed to actually use a surrogate marker to test for a treatment effect in a future study. Most existing methods consider combining surrogate marker and primary outcome information to test for a treatment effect, rely on fully parametric methods where strict parametric assumptions are made about the relationship between the surrogate and the outcome, and/or assume the surrogate marker is measured at only a single time point. Recent work has proposed a nonparametric test for a treatment effect using only surrogate marker information measured at a single time point by borrowing information learned from a prior study where both the surrogate and primary outcome were measured. In this paper, we utilize this nonparametric test and propose group sequential procedures that allow for early stopping of treatment effect testing in a setting where the surrogate marker is measured repeatedly over time. We derive the properties of the correlated surrogate-based nonparametric test statistics at multiple time points and compute stopping boundaries that allow for early stopping for a significant treatment effect, or for futility. We examine the performance of our testing procedure using a simulation study and illustrate the method using data from two distinct AIDS clinical trials."}, "https://arxiv.org/abs/2409.09512": {"title": "Doubly robust and computationally efficient high-dimensional variable selection", "link": "https://arxiv.org/abs/2409.09512", "description": "arXiv:2409.09512v1 Announce Type: new \nAbstract: The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We present tower PCM (tPCM), a statistically and computationally efficient solution to the variable selection problem that does not suffer from these shortcomings. tPCM adapts the best aspects of two existing procedures that are based on similar functionals: the holdout randomization test (HRT) and the projected covariance measure (PCM). The former is a model-X test that utilizes many resamples and few machine learning fits, while the latter is an asymptotic doubly-robust style test for a single hypothesis that requires no resamples and many machine learning fits. Theoretically, we demonstrate the validity of tPCM, and perhaps surprisingly, the asymptotic equivalence of HRT, PCM, and tPCM. In so doing, we clarify the relationship between two methods from two separate literatures. An extensive simulation study verifies that tPCM can have significant computational savings compared to HRT and PCM, while maintaining nearly identical power."}, "https://arxiv.org/abs/2409.09577": {"title": "Structural counterfactual analysis in macroeconomics: theory and inference", "link": "https://arxiv.org/abs/2409.09577", "description": "arXiv:2409.09577v1 Announce Type: new \nAbstract: We propose a structural model-free methodology to analyze two types of macroeconomic counterfactuals related to policy path deviation: hypothetical trajectory and policy intervention. Our model-free approach is built on a structural vector moving-average (SVMA) model that relies solely on the identification of policy shocks, thereby eliminating the need to specify an entire structural model. Analytical solutions are derived for the counterfactual parameters, and statistical inference for these parameter estimates is provided using the Delta method. By utilizing external instruments, we introduce a projection-based method for the identification, estimation, and inference of these parameters. This approach connects our counterfactual analysis with the Local Projection literature. A simulation-based approach with nonlinear model is provided to add in addressing Lucas' critique. The innovative model-free methodology is applied in three counterfactual studies on the U.S. monetary policy: (1) a historical scenario analysis for a hypothetical interest rate path in the post-pandemic era, (2) a future scenario analysis under either hawkish or dovish interest rate policy, and (3) an evaluation of the policy intervention effect of an oil price shock by zeroing out the systematic responses of the interest rate."}, "https://arxiv.org/abs/2409.09660": {"title": "On the Proofs of the Predictive Synthesis Formula", "link": "https://arxiv.org/abs/2409.09660", "description": "arXiv:2409.09660v1 Announce Type: new \nAbstract: Bayesian predictive synthesis is useful in synthesizing multiple predictive distributions coherently. However, the proof for the fundamental equation of the synthesized predictive density has been missing. In this technical report, we review the series of research on predictive synthesis, then fill the gap between the known results and the equation used in modern applications. We provide two proofs and clarify the structure of predictive synthesis."}, "https://arxiv.org/abs/2409.09865": {"title": "A general approach to fitting multistate cure models based on an extended-long-format data structure", "link": "https://arxiv.org/abs/2409.09865", "description": "arXiv:2409.09865v1 Announce Type: new \nAbstract: A multistate cure model is a statistical framework used to analyze and represent the transitions that individuals undergo between different states over time, taking into account the possibility of being cured by initial treatment. This model is particularly useful in pediatric oncology where a fraction of the patient population achieves cure through treatment and therefore they will never experience some events. Our study develops a generalized algorithm based on the extended long data format, an extension of long data format where a transition can be split up to two rows each with a weight assigned reflecting the posterior probability of its cure status. The multistate cure model is fit on top of the current framework of multistate model and mixture cure model. The proposed algorithm makes use of the Expectation-Maximization (EM) algorithm and weighted likelihood representation such that it is easy to implement with standard package. As an example, the proposed algorithm is applied on data from the European Society for Blood and Marrow Transplantation (EBMT). Standard errors of the estimated parameters are obtained via a non-parametric bootstrap procedure, while the method involving the calculation of the second-derivative matrix of the observed log-likelihood is also presented."}, "https://arxiv.org/abs/2409.09884": {"title": "Dynamic quantification of player value for fantasy basketball", "link": "https://arxiv.org/abs/2409.09884", "description": "arXiv:2409.09884v1 Announce Type: new \nAbstract: Previous work on fantasy basketball quantifies player value for category leagues without taking draft circumstances into account. Quantifying value in this way is convenient, but inherently limited as a strategy, because it precludes the possibility of dynamic adaptation. This work introduces a framework for dynamic algorithms, dubbed \"H-scoring\", and describes an implementation of the framework for head-to-head formats, dubbed $H_0$. $H_0$ models many of the main aspects of category league strategy including category weighting, positional assignments, and format-specific objectives. Head-to-head simulations provide evidence that $H_0$ outperforms static ranking lists. Category-level results from the simulations reveal that one component of $H_0$'s strategy is punting a subset of categories, which it learns to do implicitly."}, "https://arxiv.org/abs/2409.09962": {"title": "A Simple and Adaptive Confidence Interval when Nuisance Parameters Satisfy an Inequality", "link": "https://arxiv.org/abs/2409.09962", "description": "arXiv:2409.09962v1 Announce Type: new \nAbstract: Inequalities may appear in many models. They can be as simple as assuming a parameter is nonnegative, possibly a regression coefficient or a treatment effect. This paper focuses on the case that there is only one inequality and proposes a confidence interval that is particularly attractive, called the inequality-imposed confidence interval (IICI). The IICI is simple. It does not require simulations or tuning parameters. The IICI is adaptive. It reduces to the usual confidence interval (calculated by adding and subtracting the standard error times the $1 - \\alpha/2$ standard normal quantile) when the inequality is sufficiently slack. When the inequality is sufficiently violated, the IICI reduces to an equality-imposed confidence interval (the usual confidence interval for the submodel where the inequality holds with equality). Also, the IICI is uniformly valid and has (weakly) shorter length than the usual confidence interval; it is never longer. The first empirical application considers a linear regression when a coefficient is known to be nonpositive. A second empirical application considers an instrumental variables regression when the endogeneity of a regressor is known to be nonnegative."}, "https://arxiv.org/abs/2409.10001": {"title": "Generalized Matrix Factor Model", "link": "https://arxiv.org/abs/2409.10001", "description": "arXiv:2409.10001v1 Announce Type: new \nAbstract: This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix."}, "https://arxiv.org/abs/2409.10030": {"title": "On LASSO Inference for High Dimensional Predictive Regression", "link": "https://arxiv.org/abs/2409.10030", "description": "arXiv:2409.10030v1 Announce Type: new \nAbstract: LASSO introduces shrinkage bias into estimated coefficients, which can adversely affect the desirable asymptotic normality and invalidate the standard inferential procedure based on the $t$-statistic. The desparsified LASSO has emerged as a well-known remedy for this issue. In the context of high dimensional predictive regression, the desparsified LASSO faces an additional challenge: the Stambaugh bias arising from nonstationary regressors. To restore the standard inferential procedure, we propose a novel estimator called IVX-desparsified LASSO (XDlasso). XDlasso eliminates the shrinkage bias and the Stambaugh bias simultaneously and does not require prior knowledge about the identities of nonstationary and stationary regressors. We establish the asymptotic properties of XDlasso for hypothesis testing, and our theoretical findings are supported by Monte Carlo simulations. Applying our method to real-world applications from the FRED-MD database -- which includes a rich set of control variables -- we investigate two important empirical questions: (i) the predictability of the U.S. stock returns based on the earnings-price ratio, and (ii) the predictability of the U.S. inflation using the unemployment rate."}, "https://arxiv.org/abs/2409.10174": {"title": "Information criteria for the number of directions of extremes in high-dimensional data", "link": "https://arxiv.org/abs/2409.10174", "description": "arXiv:2409.10174v1 Announce Type: new \nAbstract: In multivariate extreme value analysis, the estimation of the dependence structure in extremes is a challenging task, especially in the context of high-dimensional data. Therefore, a common approach is to reduce the model dimension by considering only the directions in which extreme values occur. In this paper, we use the concept of sparse regular variation recently introduced by Meyer and Wintenberger (2021) to derive information criteria for the number of directions in which extreme events occur, such as a Bayesian information criterion (BIC), a mean-squared error-based information criterion (MSEIC), and a quasi-Akaike information criterion (QAIC) based on the Gaussian likelihood function. As is typical in extreme value analysis, a challenging task is the choice of the number $k_n$ of observations used for the estimation. Therefore, for all information criteria, we present a two-step procedure to estimate both the number of directions of extremes and an optimal choice of $k_n$. We prove that the AIC of Meyer and Wintenberger (2023) and the MSEIC are inconsistent information criteria for the number of extreme directions whereas the BIC and the QAIC are consistent information criteria. Finally, the performance of the different information criteria is compared in a simulation study and applied on wind speed data."}, "https://arxiv.org/abs/2409.10221": {"title": "bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R", "link": "https://arxiv.org/abs/2409.10221", "description": "arXiv:2409.10221v1 Announce Type: new \nAbstract: The family of cure models provides a unique opportunity to simultaneously model both the proportion of cured subjects (those not facing the event of interest) and the distribution function of time-to-event for susceptibles (those facing the event). In practice, the application of cure models is mainly facilitated by the availability of various R packages. However, most of these packages primarily focus on the mixture or promotion time cure rate model. This article presents a fully Bayesian approach implemented in R to estimate a general family of cure rate models in the presence of covariates. It builds upon the work by Papastamoulis and Milienos (2024) by additionally considering various options for describing the promotion time, including the Weibull, exponential, Gompertz, log-logistic and finite mixtures of gamma distributions, among others. Moreover, the user can choose any proper distribution function for modeling the promotion time (provided that some specific conditions are met). Posterior inference is carried out by constructing a Metropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the latent cure indicators and Metropolis-Hastings steps with Langevin diffusion dynamics for parameter updates. The main MCMC algorithm is embedded within a parallel tempering scheme by considering heated versions of the target posterior distribution. The package is illustrated on a real dataset analyzing the duration of the first marriage under the presence of various covariates such as the race, age and the presence of kids."}, "https://arxiv.org/abs/2409.10318": {"title": "Systematic comparison of Bayesian basket trial designs with unequal sample sizes and proposal of a new method based on power priors", "link": "https://arxiv.org/abs/2409.10318", "description": "arXiv:2409.10318v1 Announce Type: new \nAbstract: Basket trials examine the efficacy of an intervention in multiple patient subgroups simultaneously. The division into subgroups, called baskets, is based on matching medical characteristics, which may result in small sample sizes within baskets that are also likely to differ. Sparse data complicate statistical inference. Several Bayesian methods have been proposed in the literature that allow information sharing between baskets to increase statistical power. In this work, we provide a systematic comparison of five different Bayesian basket trial designs when sample sizes differ between baskets. We consider the power prior approach with both known and new weighting methods, a design by Fujikawa et al., as well as models based on Bayesian hierarchical modeling and Bayesian model averaging. The results of our simulation study show a high sensitivity to changing sample sizes for Fujikawa's design and the power prior approach. Limiting the amount of shared information was found to be decisive for the robustness to varying basket sizes. In combination with the power prior approach, this resulted in the best performance and the most reliable detection of an effect of the treatment under investigation and its absence."}, "https://arxiv.org/abs/2409.10352": {"title": "Partial Ordering Bayesian Logistic Regression Model for Phase I Combination Trials and Computationally Efficient Approach to Operational Prior Specification", "link": "https://arxiv.org/abs/2409.10352", "description": "arXiv:2409.10352v1 Announce Type: new \nAbstract: Recent years have seen increased interest in combining drug agents and/or schedules. Several methods for Phase I combination-escalation trials are proposed, among which, the partial ordering continual reassessment method (POCRM) gained great attention for its simplicity and good operational characteristics. However, the one-parameter nature of the POCRM makes it restrictive in more complicated settings such as the inclusion of a control group. This paper proposes a Bayesian partial ordering logistic model (POBLRM), which combines partial ordering and the more flexible (than CRM) two-parameter logistic model. Simulation studies show that the POBLRM performs similarly as the POCRM in non-randomised settings. When patients are randomised between the experimental dose-combinations and a control, performance is drastically improved.\n Most designs require specifying hyper-parameters, often chosen from statistical considerations (operational prior). The conventional \"grid search'' calibration approach requires large simulations, which are computationally costly. A novel \"cyclic calibration\" has been proposed to reduce the computation from multiplicative to additive. Furthermore, calibration processes should consider wide ranges of scenarios of true toxicity probabilities to avoid bias. A method to reduce scenarios based on scenario-complexities is suggested. This can reduce the computation by more than 500 folds while remaining operational characteristics similar to the grid search."}, "https://arxiv.org/abs/2409.10448": {"title": "Why you should also use OLS estimation of tail exponents", "link": "https://arxiv.org/abs/2409.10448", "description": "arXiv:2409.10448v1 Announce Type: new \nAbstract: Even though practitioners often estimate Pareto exponents running OLS rank-size regressions, the usual recommendation is to use the Hill MLE with a small-sample correction instead, due to its unbiasedness and efficiency. In this paper, we advocate that you should also apply OLS in empirical applications. On the one hand, we demonstrate that, with a small-sample correction, the OLS estimator is also unbiased. On the other hand, we show that the MLE assigns significantly greater weight to smaller observations. This suggests that the OLS estimator may outperform the MLE in cases where the distribution is (i) strictly Pareto but only in the upper tail or (ii) regularly varying rather than strictly Pareto. We substantiate our theoretical findings with Monte Carlo simulations and real-world applications, demonstrating the practical relevance of the OLS method in estimating tail exponents."}, "https://arxiv.org/abs/2409.09066": {"title": "Replicating The Log of Gravity", "link": "https://arxiv.org/abs/2409.09066", "description": "arXiv:2409.09066v1 Announce Type: cross \nAbstract: This document replicates the main results from Santos Silva and Tenreyro (2006 in R. The original results were obtained in TSP back in 2006. The idea here is to be explicit regarding the conceptual approach to regression in R. For most of the replication I used base R without external libraries except when it was absolutely necessary. The findings are consistent with the original article and reveal that the replication effort is minimal, without the need to contact the authors for clarifications or incur into data transformations or filtering not mentioned in the article."}, "https://arxiv.org/abs/2409.09894": {"title": "Estimating Wage Disparities Using Foundation Models", "link": "https://arxiv.org/abs/2409.09894", "description": "arXiv:2409.09894v1 Announce Type: cross \nAbstract: One thread of empirical work in social science focuses on decomposing group differences in outcomes into unexplained components and components explained by observable factors. In this paper, we study gender wage decompositions, which require estimating the portion of the gender wage gap explained by career histories of workers. Classical methods for decomposing the wage gap employ simple predictive models of wages which condition on a small set of simple summaries of labor history. The problem is that these predictive models cannot take advantage of the full complexity of a worker's history, and the resulting decompositions thus suffer from omitted variable bias (OVB), where covariates that are correlated with both gender and wages are not included in the model. Here we explore an alternative methodology for wage gap decomposition that employs powerful foundation models, such as large language models, as the predictive engine. Foundation models excel at making accurate predictions from complex, high-dimensional inputs. We use a custom-built foundation model, designed to predict wages from full labor histories, to decompose the gender wage gap. We prove that the way such models are usually trained might still lead to OVB, but develop fine-tuning algorithms that empirically mitigate this issue. Our model captures a richer representation of career history than simple models and predicts wages more accurately. In detail, we first provide a novel set of conditions under which an estimator of the wage gap based on a fine-tuned foundation model is $\\sqrt{n}$-consistent. Building on the theory, we then propose methods for fine-tuning foundation models that minimize OVB. Using data from the Panel Study of Income Dynamics, we find that history explains more of the gender wage gap than standard econometric models can measure, and we identify elements of history that are important for reducing OVB."}, "https://arxiv.org/abs/2409.09903": {"title": "Learning large softmax mixtures with warm start EM", "link": "https://arxiv.org/abs/2409.09903", "description": "arXiv:2409.09903v1 Announce Type: cross \nAbstract: Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute from $p$ possible candidates, in heterogeneous populations. The model has recently attracted attention in the AI literature, under the name softmax mixtures, where it is routinely used in the final layer of a neural network to map a large number $p$ of vectors in $\\mathbb{R}^L$ to a probability vector. Despite its wide applicability and empirical success, statistically optimal estimators of the mixture parameters, obtained via algorithms whose running time scales polynomially in $L$, are not known. This paper provides a solution to this problem for contemporary applications, such as large language models, in which the mixture has a large number $p$ of support points, and the size $N$ of the sample observed from the mixture is also large. Our proposed estimator combines two classical estimators, obtained respectively via a method of moments (MoM) and the expectation-minimization (EM) algorithm. Although both estimator types have been studied, from a theoretical perspective, for Gaussian mixtures, no similar results exist for softmax mixtures for either procedure. We develop a new MoM parameter estimator based on latent moment estimation that is tailored to our model, and provide the first theoretical analysis for a MoM-based procedure in softmax mixtures. Although consistent, MoM for softmax mixtures can exhibit poor numerical performance, as observed other mixture models. Nevertheless, as MoM is provably in a neighborhood of the target, it can be used as warm start for any iterative algorithm. We study in detail the EM algorithm, and provide its first theoretical analysis for softmax mixtures. Our final proposal for parameter estimation is the EM algorithm with a MoM warm start."}, "https://arxiv.org/abs/2409.09973": {"title": "Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data", "link": "https://arxiv.org/abs/2409.09973", "description": "arXiv:2409.09973v1 Announce Type: cross \nAbstract: We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation."}, "https://arxiv.org/abs/2409.10374": {"title": "Nonlinear Causality in Brain Networks: With Application to Motor Imagery vs Execution", "link": "https://arxiv.org/abs/2409.10374", "description": "arXiv:2409.10374v1 Announce Type: cross \nAbstract: One fundamental challenge of data-driven analysis in neuroscience is modeling causal interactions and exploring the connectivity of nodes in a brain network. Various statistical methods, relying on various perspectives and employing different data modalities, are being developed to examine and comprehend the underlying causal structures inherent to brain dynamics. This study introduces a novel statistical approach, TAR4C, to dissect causal interactions in multichannel EEG recordings. TAR4C uses the threshold autoregressive model to describe the causal interaction between nodes or clusters of nodes in a brain network. The perspective involves testing whether one node, which may represent a brain region, can control the dynamics of the other. The node that has such an impact on the other is called a threshold variable and can be classified as a causative because its functionality is the leading source operating as an instantaneous switching mechanism that regulates the time-varying autoregressive structure of the other. This statistical concept is commonly referred to as threshold non-linearity. Once threshold non-linearity has been verified between a pair of nodes, the subsequent essential facet of TAR modeling is to assess the predictive ability of the causal node for the current activity on the other and represent causal interactions in autoregressive terms. This predictive ability is what underlies Granger causality. The TAR4C approach can discover non-linear and time-dependent causal interactions without negating the G-causality perspective. The efficacy of the proposed approach is exemplified by analyzing the EEG signals recorded during the motor movement/imagery experiment. The similarities and differences between the causal interactions manifesting during the execution and the imagery of a given motor movement are demonstrated by analyzing EEG recordings from multiple subjects."}, "https://arxiv.org/abs/2103.01604": {"title": "Theory of Low Frequency Contamination from Nonstationarity and Misspecification: Consequences for HAR Inference", "link": "https://arxiv.org/abs/2103.01604", "description": "arXiv:2103.01604v3 Announce Type: replace \nAbstract: We establish theoretical results about the low frequency contamination (i.e., long memory effects) induced by general nonstationarity for estimates such as the sample autocovariance and the periodogram, and deduce consequences for heteroskedasticity and autocorrelation robust (HAR) inference. We present explicit expressions for the asymptotic bias of these estimates. We distinguish cases where this contamination only occurs as a small-sample problem and cases where the contamination continues to hold asymptotically. We show theoretically that nonparametric smoothing over time is robust to low frequency contamination. Our results provide new insights on the debate between consistent versus inconsistent long-run variance (LRV) estimation. Existing LRV estimators tend to be in inflated when the data are nonstationary. This results in HAR tests that can be undersized and exhibit dramatic power losses. Our theory indicates that long bandwidths or fixed-b HAR tests suffer more from low frequency contamination relative to HAR tests based on HAC estimators, whereas recently introduced double kernel HAC estimators do not super from this problem. Finally, we present second-order Edgeworth expansions under nonstationarity about the distribution of HAC and DK-HAC estimators and about the corresponding t-test in the linear regression model."}, "https://arxiv.org/abs/2103.15298": {"title": "Empirical Welfare Maximization with Constraints", "link": "https://arxiv.org/abs/2103.15298", "description": "arXiv:2103.15298v2 Announce Type: replace \nAbstract: Empirical Welfare Maximization (EWM) is a framework that can be used to select welfare program eligibility policies based on data. This paper extends EWM by allowing for uncertainty in estimating the budget needed to implement the selected policy, in addition to its welfare. Due to the additional estimation error, I show there exist no rules that achieve the highest welfare possible while satisfying a budget constraint uniformly over a wide range of DGPs. This differs from the setting without a budget constraint where uniformity is achievable. I propose an alternative trade-off rule and illustrate it with Medicaid expansion, a setting with imperfect take-up and varying program costs."}, "https://arxiv.org/abs/2306.00453": {"title": "A Gaussian Sliding Windows Regression Model for Hydrological Inference", "link": "https://arxiv.org/abs/2306.00453", "description": "arXiv:2306.00453v3 Announce Type: replace \nAbstract: Statistical models are an essential tool to model, forecast and understand the hydrological processes in watersheds. In particular, the understanding of time lags associated with the delay between rainfall occurrence and subsequent changes in streamflow is of high practical importance. Since water can take a variety of flow paths to generate streamflow, a series of distinct runoff pulses may combine to create the observed streamflow time series. Current state-of-the-art models are not able to sufficiently confront the problem complexity with interpretable parametrization, thus preventing novel insights about the dynamics of distinct flow paths from being formed. The proposed Gaussian Sliding Windows Regression Model targets this problem by combining the concept of multiple windows sliding along the time axis with multiple linear regression. The window kernels, which indicate the weights applied to different time lags, are implemented via Gaussian-shaped kernels. As a result, straightforward process inference can be achieved since each window can represent one flow path. Experiments on simulated and real-world scenarios underline that the proposed model achieves accurate parameter estimates and competitive predictive performance, while fostering explainable and interpretable hydrological modeling."}, "https://arxiv.org/abs/2306.15947": {"title": "Separable pathway effects of semi-competing risks using multi-state sodels", "link": "https://arxiv.org/abs/2306.15947", "description": "arXiv:2306.15947v2 Announce Type: replace \nAbstract: Semi-competing risks refer to the phenomenon where a primary event (such as mortality) can ``censor'' an intermediate event (such as relapse of a disease), but not vice versa. Under the multi-state model, the primary event consists of two specific types: the direct outcome event and an indirect outcome event developed from intermediate events. Within this framework, we show that the total treatment effect on the cumulative incidence of the primary event can be decomposed into three separable pathway effects, capturing treatment effects on population-level transition rates between states. We next propose two estimators for the counterfactual cumulative incidences of the primary event under hypothetical treatment components. One estimator is given by the generalized Nelson--Aalen estimator with inverse probability weighting under covariates isolation, and the other is given based on the efficient influence function. The asymptotic normality of these estimators is established. The first estimator only involves a propensity score model and avoid modeling the cause-specific hazards. The second estimator has robustness against the misspecification of submodels. As an illustration of its potential usefulness, the proposed method is applied to compare effects of different allogeneic stem cell transplantation types on overall survival after transplantation."}, "https://arxiv.org/abs/2311.00528": {"title": "On the Comparative Analysis of Average Treatment Effects Estimation via Data Combination", "link": "https://arxiv.org/abs/2311.00528", "description": "arXiv:2311.00528v2 Announce Type: replace \nAbstract: There is growing interest in exploring causal effects in target populations via data combination. However, most approaches are tailored to specific settings and lack comprehensive comparative analyses. In this article, we focus on a typical scenario involving a source dataset and a target dataset. We first design six settings under covariate shift and conduct a comparative analysis by deriving the semiparametric efficiency bounds for the ATE in the target population. We then extend this analysis to six new settings that incorporate both covariate shift and posterior drift. Our study uncovers the key factors that influence efficiency gains and the ``effective sample size\" when combining two datasets, with a particular emphasis on the roles of the variance ratio of potential outcomes between datasets and the derivatives of the posterior drift function. To the best of our knowledge, this is the first paper that explicitly explores the role of the posterior drift functions in causal inference. Additionally, we also propose novel methods for conducting sensitivity analysis to address violations of transportability between the two datasets. We empirically validate our findings by constructing locally efficient estimators and conducting extensive simulations. We demonstrate the proposed methods in two real-world applications."}, "https://arxiv.org/abs/2110.10650": {"title": "Attention Overload", "link": "https://arxiv.org/abs/2110.10650", "description": "arXiv:2110.10650v4 Announce Type: replace-cross \nAbstract: We introduce an Attention Overload Model that captures the idea that alternatives compete for the decision maker's attention, and hence the attention that each alternative receives decreases as the choice problem becomes larger. Using this nonparametric restriction on the random attention formation, we show that a fruitful revealed preference theory can be developed and provide testable implications on the observed choice behavior that can be used to (point or partially) identify the decision maker's preference and attention frequency. We then enhance our attention overload model to accommodate heterogeneous preferences. Due to the nonparametric nature of our identifying assumption, we must discipline the amount of heterogeneity in the choice model: we propose the idea of List-based Attention Overload, where alternatives are presented to the decision makers as a list that correlates with both heterogeneous preferences and random attention. We show that preference and attention frequencies are (point or partially) identifiable under nonparametric assumptions on the list and attention formation mechanisms, even when the true underlying list is unknown to the researcher. Building on our identification results, for both preference and attention frequencies, we develop econometric methods for estimation and inference that are valid in settings with a large number of alternatives and choice problems, a distinctive feature of the economic environment we consider. We provide a software package in R implementing our empirical methods, and illustrate them in a simulation study."}, "https://arxiv.org/abs/2206.10240": {"title": "Core-Elements for Large-Scale Least Squares Estimation", "link": "https://arxiv.org/abs/2206.10240", "description": "arXiv:2206.10240v4 Announce Type: replace-cross \nAbstract: The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample and has found extensive applications in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome this limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\\mathrm{nnz}(X)+rp^2)$ computational cost, where $X$ is an $n\\times p$ predictor matrix, $r$ is the number of elements selected from each column of $X$, and $\\mathrm{nnz}(\\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors."}, "https://arxiv.org/abs/2307.02616": {"title": "Federated Epidemic Surveillance", "link": "https://arxiv.org/abs/2307.02616", "description": "arXiv:2307.02616v2 Announce Type: replace-cross \nAbstract: Epidemic surveillance is a challenging task, especially when crucial data is fragmented across institutions and data custodians are unable or unwilling to share it. This study aims to explore the feasibility of a simple federated surveillance approach. The idea is to conduct hypothesis tests for a rise in counts behind each custodian's firewall and then combine p-values from these tests using techniques from meta-analysis. We propose a hypothesis testing framework to identify surges in epidemic-related data streams and conduct experiments on real and semi-synthetic data to assess the power of different p-value combination methods to detect surges without needing to combine the underlying counts. Our findings show that relatively simple combination methods achieve a high degree of fidelity and suggest that infectious disease outbreaks can be detected without needing to share even aggregate data across institutions."}, "https://arxiv.org/abs/2307.10272": {"title": "A Shrinkage Likelihood Ratio Test for High-Dimensional Subgroup Analysis with a Logistic-Normal Mixture Model", "link": "https://arxiv.org/abs/2307.10272", "description": "arXiv:2307.10272v2 Announce Type: replace-cross \nAbstract: In subgroup analysis, testing the existence of a subgroup with a differential treatment effect serves as protection against spurious subgroup discovery. Despite its importance, this hypothesis testing possesses a complicated nature: parameter characterizing subgroup classification is not identified under the null hypothesis of no subgroup. Due to this irregularity, the existing methods have the following two limitations. First, the asymptotic null distribution of test statistics often takes an intractable form, which necessitates computationally demanding resampling methods to calculate the critical value. Second, the dimension of personal attributes characterizing subgroup membership is not allowed to be of high dimension. To solve these two problems simultaneously, this study develops a shrinkage likelihood ratio test for the existence of a subgroup using a logistic-normal mixture model. The proposed test statistics are built on a modified likelihood function that shrinks possibly high-dimensional unidentified parameters toward zero under the null hypothesis while retaining power under the alternative."}, "https://arxiv.org/abs/2401.04778": {"title": "Generative neural networks for characteristic functions", "link": "https://arxiv.org/abs/2401.04778", "description": "arXiv:2401.04778v2 Announce Type: replace-cross \nAbstract: We provide a simulation algorithm to simulate from a (multivariate) characteristic function, which is only accessible in a black-box format. The method is based on a generative neural network, whose loss function exploits a specific representation of the Maximum-Mean-Discrepancy metric to directly incorporate the targeted characteristic function. The algorithm is universal in the sense that it is independent of the dimension and that it does not require any assumptions on the given characteristic function. Furthermore, finite sample guarantees on the approximation quality in terms of the Maximum-Mean Discrepancy metric are derived. The method is illustrated in a simulation study."}, "https://arxiv.org/abs/2409.10750": {"title": "GPT takes the SAT: Tracing changes in Test Difficulty and Math Performance of Students", "link": "https://arxiv.org/abs/2409.10750", "description": "arXiv:2409.10750v1 Announce Type: new \nAbstract: Scholastic Aptitude Test (SAT) is crucial for college admissions but its effectiveness and relevance are increasingly questioned. This paper enhances Synthetic Control methods by introducing \"Transformed Control\", a novel method that employs Large Language Models (LLMs) powered by Artificial Intelligence to generate control groups. We utilize OpenAI's API to generate a control group where GPT-4, or ChatGPT, takes multiple SATs annually from 2008 to 2023. This control group helps analyze shifts in SAT math difficulty over time, starting from the baseline year of 2008. Using parallel trends, we calculate the Average Difference in Scores (ADS) to assess changes in high school students' math performance. Our results indicate a significant decrease in the difficulty of the SAT math section over time, alongside a decline in students' math performance. The analysis shows a 71-point drop in the rigor of SAT math from 2008 to 2023, with student performance decreasing by 36 points, resulting in a 107-point total divergence in average student math performance. We investigate possible mechanisms for this decline in math proficiency, such as changing university selection criteria, increased screen time, grade inflation, and worsening adolescent mental health. Disparities among demographic groups show a 104-point drop for White students, 84 points for Black students, and 53 points for Asian students. Male students saw a 117-point reduction, while female students had a 100-point decrease."}, "https://arxiv.org/abs/2409.10771": {"title": "Flexible survival regression with variable selection for heterogeneous population", "link": "https://arxiv.org/abs/2409.10771", "description": "arXiv:2409.10771v1 Announce Type: new \nAbstract: Survival regression is widely used to model time-to-events data, to explore how covariates may influence the occurrence of events. Modern datasets often encompass a vast number of covariates across many subjects, with only a subset of the covariates significantly affecting survival. Additionally, subjects often belong to an unknown number of latent groups, where covariate effects on survival differ significantly across groups. The proposed methodology addresses both challenges by simultaneously identifying the latent sub-groups in the heterogeneous population and evaluating covariate significance within each sub-group. This approach is shown to enhance the predictive accuracy for time-to-event outcomes, via uncovering varying risk profiles within the underlying heterogeneous population and is thereby helpful to device targeted disease management strategies."}, "https://arxiv.org/abs/2409.10812": {"title": "Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation", "link": "https://arxiv.org/abs/2409.10812", "description": "arXiv:2409.10812v1 Announce Type: new \nAbstract: Missing data is a common issue in medical, psychiatry, and social studies. In literature, Multiple Imputation (MI) was proposed to multiply impute datasets and combine analysis results from imputed datasets for statistical inference using Rubin's rule. However, Rubin's rule only works for combined inference on statistical tests with point and variance estimates and is not applicable to combine general F-statistics or Chi-square statistics. In this manuscript, we provide a solution to combine F-test statistics from multiply imputed datasets, when the F-statistic has an explicit fractional form (that is, both the numerator and denominator of the F-statistic are reported). Then we extend the method to combine Chi-square statistics from multiply imputed datasets. Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA and Type-III tests of fixed effects in mixed effects models, which do not have the explicit fractional form. SAS macros are also developed to facilitate applications."}, "https://arxiv.org/abs/2409.10820": {"title": "Simple robust two-stage estimation and inference for generalized impulse responses and multi-horizon causality", "link": "https://arxiv.org/abs/2409.10820", "description": "arXiv:2409.10820v1 Announce Type: new \nAbstract: This paper introduces a novel two-stage estimation and inference procedure for generalized impulse responses (GIRs). GIRs encompass all coefficients in a multi-horizon linear projection model of future outcomes of y on lagged values (Dufour and Renault, 1998), which include the Sims' impulse response. The conventional use of Least Squares (LS) with heteroskedasticity- and autocorrelation-consistent covariance estimation is less precise and often results in unreliable finite sample tests, further complicated by the selection of bandwidth and kernel functions. Our two-stage method surpasses the LS approach in terms of estimation efficiency and inference robustness. The robustness stems from our proposed covariance matrix estimates, which eliminate the need to correct for serial correlation in the multi-horizon projection residuals. Our method accommodates non-stationary data and allows the projection horizon to grow with sample size. Monte Carlo simulations demonstrate our two-stage method outperforms the LS method. We apply the two-stage method to investigate the GIRs, implement multi-horizon Granger causality test, and find that economic uncertainty exerts both short-run (1-3 months) and long-run (30 months) effects on economic activities."}, "https://arxiv.org/abs/2409.10835": {"title": "BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models", "link": "https://arxiv.org/abs/2409.10835", "description": "arXiv:2409.10835v1 Announce Type: new \nAbstract: We introduce the BMRMM package implementing Bayesian inference for a class of Markov renewal mixed models which can characterize the stochastic dynamics of a collection of sequences, each comprising alternative instances of categorical states and associated continuous duration times, while being influenced by a set of exogenous factors as well as a 'random' individual. The default setting flexibly models the state transition probabilities using mixtures of Dirichlet distributions and the duration times using mixtures of gamma kernels while also allowing variable selection for both. Modeling such data using simpler Markov mixed models also remains an option, either by ignoring the duration times altogether or by replacing them with instances of an additional category obtained by discretizing them by a user-specified unit. The option is also useful when data on duration times may not be available in the first place. We demonstrate the package's utility using two data sets."}, "https://arxiv.org/abs/2409.10855": {"title": "Calibrated Multivariate Regression with Localized PIT Mappings", "link": "https://arxiv.org/abs/2409.10855", "description": "arXiv:2409.10855v1 Announce Type: new \nAbstract: Calibration ensures that predicted uncertainties align with observed uncertainties. While there is an extensive literature on recalibration methods for univariate probabilistic forecasts, work on calibration for multivariate forecasts is much more limited. This paper introduces a novel post-hoc recalibration approach that addresses multivariate calibration for potentially misspecified models. Our method involves constructing local mappings between vectors of marginal probability integral transform values and the space of observations, providing a flexible and model free solution applicable to continuous, discrete, and mixed responses. We present two versions of our approach: one uses K-nearest neighbors, and the other uses normalizing flows. Each method has its own strengths in different situations. We demonstrate the effectiveness of our approach on two real data applications: recalibrating a deep neural network's currency exchange rate forecast and improving a regression model for childhood malnutrition in India for which the multivariate response has both discrete and continuous components."}, "https://arxiv.org/abs/2409.10860": {"title": "Cointegrated Matrix Autoregression Models", "link": "https://arxiv.org/abs/2409.10860", "description": "arXiv:2409.10860v1 Announce Type: new \nAbstract: We propose a novel cointegrated autoregressive model for matrix-valued time series, with bi-linear cointegrating vectors corresponding to the rows and columns of the matrix data. Compared to the traditional cointegration analysis, our proposed matrix cointegration model better preserves the inherent structure of the data and enables corresponding interpretations. To estimate the cointegrating vectors as well as other coefficients, we introduce two types of estimators based on least squares and maximum likelihood. We investigate the asymptotic properties of the cointegrated matrix autoregressive model under the existence of trend and establish the asymptotic distributions for the cointegrating vectors, as well as other model parameters. We conduct extensive simulations to demonstrate its superior performance over traditional methods. In addition, we apply our proposed model to Fama-French portfolios and develop a effective pairs trading strategy."}, "https://arxiv.org/abs/2409.10943": {"title": "Comparison of g-estimation approaches for handling symptomatic medication at multiple timepoints in Alzheimer's Disease with a hypothetical strategy", "link": "https://arxiv.org/abs/2409.10943", "description": "arXiv:2409.10943v1 Announce Type: new \nAbstract: For handling intercurrent events in clinical trials, one of the strategies outlined in the ICH E9(R1) addendum targets the hypothetical scenario of non-occurrence of the intercurrent event. While this strategy is often implemented by setting data after the intercurrent event to missing even if they have been collected, g-estimation allows for a more efficient estimation by using the information contained in post-IE data. As the g-estimation methods have largely developed outside of randomised clinical trials, optimisations for the application in clinical trials are possible. In this work, we describe and investigate the performance of modifications to the established g-estimation methods, leveraging the assumption that some intercurrent events are expected to have the same impact on the outcome regardless of the timing of their occurrence. In a simulation study in Alzheimer disease, the modifications show a substantial efficiency advantage for the estimation of an estimand that applies the hypothetical strategy to the use of symptomatic treatment while retaining unbiasedness and adequate type I error control."}, "https://arxiv.org/abs/2409.11040": {"title": "Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable", "link": "https://arxiv.org/abs/2409.11040", "description": "arXiv:2409.11040v1 Announce Type: new \nAbstract: This research deals with the estimation and imputation of missing data in longitudinal models with a Poisson response variable inflated with zeros. A methodology is proposed that is based on the use of maximum likelihood, assuming that data is missing at random and that there is a correlation between the response variables. In each of the times, the expectation maximization (EM) algorithm is used: in step E, a weighted regression is carried out, conditioned on the previous times that are taken as covariates. In step M, the estimation and imputation of the missing data are performed. The good performance of the methodology in different loss scenarios is demonstrated in a simulation study comparing the model only with complete data, and estimating missing data using the mode of the data of each individual. Furthermore, in a study related to the growth of corn, it is tested on real data to develop the algorithm in a practical scenario."}, "https://arxiv.org/abs/2409.11134": {"title": "E-Values for Exponential Families: the General Case", "link": "https://arxiv.org/abs/2409.11134", "description": "arXiv:2409.11134v1 Announce Type: new \nAbstract: We analyze common types of e-variables and e-processes for composite exponential family nulls: the optimal e-variable based on the reverse information projection (RIPr), the conditional (COND) e-variable, and the universal inference (UI) and sequen\\-tialized RIPr e-processes. We characterize the RIPr prior for simple and Bayes-mixture based alternatives, either precisely (for Gaussian nulls and alternatives) or in an approximate sense (general exponential families). We provide conditions under which the RIPr e-variable is (again exactly vs. approximately) equal to the COND e-variable. Based on these and other interrelations which we establish, we determine the e-power of the four e-statistics as a function of sample size, exactly for Gaussian and up to $o(1)$ in general. For $d$-dimensional null and alternative, the e-power of UI tends to be smaller by a term of $(d/2) \\log n + O(1)$ than that of the COND e-variable, which is the clear winner."}, "https://arxiv.org/abs/2409.11162": {"title": "Chasing Shadows: How Implausible Assumptions Skew Our Understanding of Causal Estimands", "link": "https://arxiv.org/abs/2409.11162", "description": "arXiv:2409.11162v1 Announce Type: new \nAbstract: The ICH E9 (R1) addendum on estimands, coupled with recent advancements in causal inference, has prompted a shift towards using model-free treatment effect estimands that are more closely aligned with the underlying scientific question. This represents a departure from traditional, model-dependent approaches where the statistical model often overshadows the inquiry itself. While this shift is a positive development, it has unintentionally led to the prioritization of an estimand's theoretical appeal over its practical learnability from data under plausible assumptions. We illustrate this by scrutinizing assumptions in the recent clinical trials literature on principal stratum estimands, demonstrating that some popular assumptions are not only implausible but often inevitably violated. We advocate for a more balanced approach to estimand formulation, one that carefully considers both the scientific relevance and the practical feasibility of estimation under realistic conditions."}, "https://arxiv.org/abs/2409.11167": {"title": "Poisson and Gamma Model Marginalisation and Marginal Likelihood calculation using Moment-generating Functions", "link": "https://arxiv.org/abs/2409.11167", "description": "arXiv:2409.11167v1 Announce Type: new \nAbstract: We present a new analytical method to derive the likelihood function that has the population of parameters marginalised out in Bayesian hierarchical models. This method is also useful to find the marginal likelihoods in Bayesian models or in random-effect linear mixed models. The key to this method is to take high-order (sometimes fractional) derivatives of the prior moment-generating function if particular existence and differentiability conditions hold.\n In particular, this analytical method assumes that the likelihood is either Poisson or gamma. Under Poisson likelihoods, the observed Poisson count determines the order of the derivative. Under gamma likelihoods, the shape parameter, which is assumed to be known, determines the order of the fractional derivative.\n We also present some examples validating this new analytical method."}, "https://arxiv.org/abs/2409.11265": {"title": "Performance of Cross-Validated Targeted Maximum Likelihood Estimation", "link": "https://arxiv.org/abs/2409.11265", "description": "arXiv:2409.11265v1 Announce Type: new \nAbstract: Background: Advanced methods for causal inference, such as targeted maximum likelihood estimation (TMLE), require certain conditions for statistical inference. However, in situations where there is not differentiability due to data sparsity or near-positivity violations, the Donsker class condition is violated. In such situations, TMLE variance can suffer from inflation of the type I error and poor coverage, leading to conservative confidence intervals. Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve on performance compared to TMLE in settings of positivity or Donsker class violations. We aim to investigate the performance of CVTMLE compared to TMLE in various settings.\n Methods: We utilised the data-generating mechanism as described in Leger et al. (2022) to run a Monte Carlo experiment under different Donsker class violations. Then, we evaluated the respective statistical performances of TMLE and CVTMLE with different super learner libraries, with and without regression tree methods.\n Results: We found that CVTMLE vastly improves confidence interval coverage without adversely affecting bias, particularly in settings with small sample sizes and near-positivity violations. Furthermore, incorporating regression trees using standard TMLE with ensemble super learner-based initial estimates increases bias and variance leading to invalid statistical inference.\n Conclusions: It has been shown that when using CVTMLE the Donsker class condition is no longer necessary to obtain valid statistical inference when using regression trees and under either data sparsity or near-positivity violations. We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library and thereby provides better estimation and inference in cases where the super learner library uses more flexible candidates and is prone to overfitting."}, "https://arxiv.org/abs/2409.11385": {"title": "Probability-scale residuals for event-time data", "link": "https://arxiv.org/abs/2409.11385", "description": "arXiv:2409.11385v1 Announce Type: new \nAbstract: The probability-scale residual (PSR) is defined as $E\\{sign(y, Y^*)\\}$, where $y$ is the observed outcome and $Y^*$ is a random variable from the fitted distribution. The PSR is particularly useful for ordinal and censored outcomes for which fitted values are not available without additional assumptions. Previous work has defined the PSR for continuous, binary, ordinal, right-censored, and current status outcomes; however, development of the PSR has not yet been considered for data subject to general interval censoring. We develop extensions of the PSR, first to mixed-case interval-censored data, and then to data subject to several types of common censoring schemes. We derive the statistical properties of the PSR and show that our more general PSR encompasses several previously defined PSR for continuous and censored outcomes as special cases. The performance of the residual is illustrated in real data from the Caribbean, Central, and South American Network for HIV Epidemiology."}, "https://arxiv.org/abs/2409.11080": {"title": "Data-driven stochastic 3D modeling of the nanoporous binder-conductive additive phase in battery cathodes", "link": "https://arxiv.org/abs/2409.11080", "description": "arXiv:2409.11080v1 Announce Type: cross \nAbstract: A stochastic 3D modeling approach for the nanoporous binder-conductive additive phase in hierarchically structured cathodes of lithium-ion batteries is presented. The binder-conductive additive phase of these electrodes consists of carbon black, polyvinylidene difluoride binder and graphite particles. For its stochastic 3D modeling, a three-step procedure based on methods from stochastic geometry is used. First, the graphite particles are described by a Boolean model with ellipsoidal grains. Second, the mixture of carbon black and binder is modeled by an excursion set of a Gaussian random field in the complement of the graphite particles. Third, large pore regions within the mixture of carbon black and binder are described by a Boolean model with spherical grains. The model parameters are calibrated to 3D image data of cathodes in lithium-ion batteries acquired by focused ion beam scanning electron microscopy. Subsequently, model validation is performed by comparing model realizations with measured image data in terms of various morphological descriptors that are not used for model fitting. Finally, we use the stochastic 3D model for predictive simulations, where we generate virtual, yet realistic, image data of nanoporous binder-conductive additives with varying amounts of graphite particles. Based on these virtual nanostructures, we can investigate structure-property relationships. In particular, we quantitatively study the influence of graphite particles on effective transport properties in the nanoporous binder-conductive additive phase, which have a crucial impact on electrochemical processes in the cathode and thus on the performance of battery cells."}, "https://arxiv.org/abs/2305.12616": {"title": "Conformal Prediction With Conditional Guarantees", "link": "https://arxiv.org/abs/2305.12616", "description": "arXiv:2305.12616v4 Announce Type: replace \nAbstract: We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of pre-specified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions."}, "https://arxiv.org/abs/2307.12325": {"title": "A Robust Framework for Graph-based Two-Sample Tests Using Weights", "link": "https://arxiv.org/abs/2307.12325", "description": "arXiv:2307.12325v3 Announce Type: replace \nAbstract: Graph-based tests are a class of non-parametric two-sample tests useful for analyzing high-dimensional data. The framework offers both flexibility and power in a wide-range of testing scenarios. The test statistics are constructed from similarity graphs (such as K-nearest neighbor graphs) and consequently, their performance is sensitive to the structure of the graph. When the graph has problematic structures, as is common for high-dimensional data, this can result in poor or unstable performance among existing graph-based tests. We address this challenge and develop graph-based test statistics that are robust to problematic structures of the graph. The limiting null distribution of the robust test statistics is derived. We illustrate the new tests via simulation studies and a real-world application on Chicago taxi trip-data."}, "https://arxiv.org/abs/2012.04485": {"title": "Occupational segregation in a Roy model with composition preferences", "link": "https://arxiv.org/abs/2012.04485", "description": "arXiv:2012.04485v3 Announce Type: replace-cross \nAbstract: We propose a model of labor market sector self-selection that combines comparative advantage, as in the Roy model, and sector composition preference. Two groups choose between two sectors based on heterogeneous potential incomes and group compositions in each sector. Potential incomes incorporate group specific human capital accumulation and wage discrimination. Composition preferences are interpreted as reflecting group specific amenity preferences as well as homophily and aversion to minority status. We show that occupational segregation is amplified by the composition preferences and we highlight a resulting tension between redistribution and diversity. The model also exhibits tipping from extreme compositions to more balanced ones. Tipping occurs when a small nudge, associated with affirmative action, pushes the system to a very different equilibrium, and when the set of equilibria changes abruptly when a parameter governing the relative importance of pecuniary and composition preferences crosses a threshold."}, "https://arxiv.org/abs/2306.02821": {"title": "A unified analysis of likelihood-based estimators in the Plackett--Luce model", "link": "https://arxiv.org/abs/2306.02821", "description": "arXiv:2306.02821v3 Announce Type: replace-cross \nAbstract: The Plackett--Luce model has been extensively used for rank aggregation in social choice theory. A central question in this model concerns estimating the utility vector that governs the model's likelihood. In this paper, we investigate the asymptotic theory of utility vector estimation by maximizing different types of likelihood, such as full, marginal, and quasi-likelihood. Starting from interpreting the estimating equations of these estimators to gain some initial insights, we analyze their asymptotic behavior as the number of compared objects increases. In particular, we establish both the uniform consistency and asymptotic normality of these estimators and discuss the trade-off between statistical efficiency and computational complexity. For generality, our results are proven for deterministic graph sequences under appropriate graph topology conditions. These conditions are shown to be revealing and sharp when applied to common sampling scenarios, such as nonuniform random hypergraph models and hypergraph stochastic block models. Numerical results are provided to support our findings."}, "https://arxiv.org/abs/2306.07819": {"title": "False discovery proportion envelopes with m-consistency", "link": "https://arxiv.org/abs/2306.07819", "description": "arXiv:2306.07819v2 Announce Type: replace-cross \nAbstract: We provide new non-asymptotic false discovery proportion (FDP) confidence envelopes in several multiple testing settings relevant for modern high dimensional-data methods. We revisit the multiple testing scenarios considered in the recent work of Katsevich and Ramdas (2020): top-$k$, preordered (including knockoffs), online. Our emphasis is on obtaining FDP confidence bounds that both have non-asymptotic coverage and are asymptotically accurate in a specific sense, as the number $m$ of tested hypotheses grows. Namely, we introduce and study the property (which we call $m$-consistency) that the confidence bound converges to or below the desired level $\\alpha$ when applied to a specific reference $\\alpha$-level false discovery rate (FDR) controlling procedure. In this perspective, we derive new bounds that provide improvements over existing ones, both theoretically and practically, and are suitable for situations where at least a moderate number of rejections is expected. These improvements are illustrated with numerical experiments and real data examples. In particular, the improvement is significant in the knockoffs setting, which shows the impact of the method for a practical use. As side results, we introduce a new confidence envelope for the empirical cumulative distribution function of i.i.d. uniform variables, and we provide new power results in sparse cases, both being of independent interest."}, "https://arxiv.org/abs/2409.11497": {"title": "Decomposing Gaussians with Unknown Covariance", "link": "https://arxiv.org/abs/2409.11497", "description": "arXiv:2409.11497v1 Announce Type: new \nAbstract: Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently available to decompose multivariate Gaussian data require knowledge of the covariance matrix. In many important problems (such as in spatial or longitudinal data analysis, and graphical modeling), the covariance matrix may be unknown and even of primary interest. Thus, in this work we develop new approaches to decompose Gaussians with unknown covariance. First, we present a general algorithm that encompasses all previous decomposition approaches for Gaussian data as special cases, and can further handle the case of an unknown covariance. It yields a new and more flexible alternative to sample splitting when $n>1$. When $n=1$, we prove that it is impossible to partition the information in a multivariate Gaussian into independent portions without knowing the covariance matrix. Thus, we use the general algorithm to decompose a single multivariate Gaussian with unknown covariance into dependent parts with tractable conditional distributions, and demonstrate their use for inference and validation. The proposed decomposition strategy extends naturally to Gaussian processes. In simulation and on electroencephalography data, we apply these decompositions to the tasks of model selection and post-selection inference in settings where alternative strategies are unavailable."}, "https://arxiv.org/abs/2409.11525": {"title": "Interpretability Indices and Soft Constraints for Factor Models", "link": "https://arxiv.org/abs/2409.11525", "description": "arXiv:2409.11525v1 Announce Type: new \nAbstract: Factor analysis is a way to characterize the relationships between many (observable) variables in terms of a smaller number of unobservable random variables which are called factors. However, the application of factor models and its success can be subjective or difficult to gauge, since infinitely many factor models that produce the same correlation matrix can be fit given sample data. Thus, there is a need to operationalize a criterion that measures how meaningful or \"interpretable\" a factor model is in order to select the best among many factor models.\n While there are already techniques that aim to measure and enhance interpretability, new indices, as well as rotation methods via mathematical optimization based on them, are proposed to measure interpretability. The proposed methods directly incorporate semantics with the help of natural language processing and are generalized to incorporate any \"prior information\". Moreover, the indices allow for complete or partial specification of relationships at a pairwise level. Aside from these, two other main benefits of the proposed methods are that they do not require the estimation of factor scores, which avoids the factor score indeterminacy problem, and that no additional explanatory variables are necessary.\n The implementation of the proposed methods is written in Python 3 and is made available together with several helper functions through the package interpretablefa on the Python Package Index. The methods' application is demonstrated here using data on the Experiences in Close Relationships Scale, obtained from the Open-Source Psychometrics Project."}, "https://arxiv.org/abs/2409.11701": {"title": "Bias Reduction in Matched Observational Studies with Continuous Treatments: Calipered Non-Bipartite Matching and Bias-Corrected Estimation and Inference", "link": "https://arxiv.org/abs/2409.11701", "description": "arXiv:2409.11701v1 Announce Type: new \nAbstract: Matching is a commonly used causal inference framework in observational studies. By pairing individuals with different treatment values but with the same values of covariates (i.e., exact matching), the sample average treatment effect (SATE) can be consistently estimated and inferred using the classic Neyman-type (difference-in-means) estimator and confidence interval. However, inexact matching typically exists in practice and may cause substantial bias for the downstream treatment effect estimation and inference. Many methods have been proposed to reduce bias due to inexact matching in the binary treatment case. However, to our knowledge, no existing work has systematically investigated bias due to inexact matching in the continuous treatment case. To fill this blank, we propose a general framework for reducing bias in inexactly matched observational studies with continuous treatments. In the matching stage, we propose a carefully formulated caliper that incorporates the information of both the paired covariates and treatment doses to better tailor matching for the downstream SATE estimation and inference. In the estimation and inference stage, we propose a bias-corrected Neyman estimator paired with the corresponding bias-corrected variance estimator to leverage the information on propensity density discrepancies after inexact matching to further reduce the bias due to inexact matching. We apply our proposed framework to COVID-19 social mobility data to showcase differences between classic and bias-corrected SATE estimation and inference."}, "https://arxiv.org/abs/2409.11967": {"title": "Incremental effects for continuous exposures", "link": "https://arxiv.org/abs/2409.11967", "description": "arXiv:2409.11967v1 Announce Type: new \nAbstract: Causal inference problems often involve continuous treatments, such as dose, duration, or frequency. However, continuous exposures bring many challenges, both with identification and estimation. For example, identifying standard dose-response estimands requires that everyone has some chance of receiving any particular level of the exposure (i.e., positivity). In this work, we explore an alternative approach: rather than estimating dose-response curves, we consider stochastic interventions based on exponentially tilting the treatment distribution by some parameter $\\delta$, which we term an incremental effect. This increases or decreases the likelihood a unit receives a given treatment level, and crucially, does not require positivity for identification. We begin by deriving the efficient influence function and semiparametric efficiency bound for these incremental effects under continuous exposures. We then show that estimation of the incremental effect is dependent on the size of the exponential tilt, as measured by $\\delta$. In particular, we derive new minimax lower bounds illustrating how the best possible root mean squared error scales with an effective sample size of $n/\\delta$, instead of usual sample size $n$. Further, we establish new convergence rates and bounds on the bias of double machine learning-style estimators. Our novel analysis gives a better dependence on $\\delta$ compared to standard analyses, by using mixed supremum and $L_2$ norms, instead of just $L_2$ norms from Cauchy-Schwarz bounds. Finally, we show that taking $\\delta \\to \\infty$ gives a new estimator of the dose-response curve at the edge of the support, and we give a detailed study of convergence rates in this regime."}, "https://arxiv.org/abs/2409.12081": {"title": "Optimising the Trade-Off Between Type I and Type II Errors: A Review and Extensions", "link": "https://arxiv.org/abs/2409.12081", "description": "arXiv:2409.12081v1 Announce Type: new \nAbstract: In clinical studies upon which decisions are based there are two types of errors that can be made: a type I error arises when the decision is taken to declare a positive outcome when the truth is in fact negative, and a type II error arises when the decision is taken to declare a negative outcome when the truth is in fact positive. Commonly the primary analysis of such a study entails a two-sided hypothesis test with a type I error rate of 5% and the study is designed to have a sufficiently low type II error rate, for example 10% or 20%. These values are arbitrary and often do not reflect the clinical, or regulatory, context of the study and ignore both the relative costs of making either type of error and the sponsor's prior belief that the drug is superior to either placebo, or a standard of care if relevant. This simplistic approach has recently been challenged by numerous authors both from a frequentist and Bayesian perspective since when resources are constrained there will be a need to consider a trade-off between type I and type II errors. In this paper we review proposals to utilise the trade-off by formally acknowledging the costs to optimise the choice of error rates for simple, point null and alternative hypotheses and extend the results to composite, or interval hypotheses, showing links to the Probability of Success of a clinical study."}, "https://arxiv.org/abs/2409.12173": {"title": "Poisson approximate likelihood compared to the particle filter", "link": "https://arxiv.org/abs/2409.12173", "description": "arXiv:2409.12173v1 Announce Type: new \nAbstract: Filtering algorithms are fundamental for inference on partially observed stochastic dynamic systems, since they provide access to the likelihood function and hence enable likelihood-based or Bayesian inference. A novel Poisson approximate likelihood (PAL) filter was introduced by Whitehouse et al. (2023). PAL employs a Poisson approximation to conditional densities, offering a fast approximation to the likelihood function for a certain subset of partially observed Markov process models. A central piece of evidence for PAL is the comparison in Table 1 of Whitehouse et al. (2023), which claims a large improvement for PAL over a standard particle filter algorithm. This evidence, based on a model and data from a previous scientific study by Stocks et al. (2020), might suggest that researchers confronted with similar models should use PAL rather than particle filter methods. Taken at face value, this evidence also reduces the credibility of Stocks et al. (2020) by indicating a shortcoming with the numerical methods that they used. However, we show that the comparison of log-likelihood values made by Whitehouse et al. (2023) is flawed because their PAL calculations were carried out using a dataset scaled differently from the previous study. If PAL and the particle filter are applied to the same data, the advantage claimed for PAL disappears. On simulations where the model is correctly specified, the particle filter outperforms PAL."}, "https://arxiv.org/abs/2409.11658": {"title": "Forecasting age distribution of life-table death counts via {\\alpha}-transformation", "link": "https://arxiv.org/abs/2409.11658", "description": "arXiv:2409.11658v1 Announce Type: cross \nAbstract: We introduce a compositional power transformation, known as an {\\alpha}-transformation, to model and forecast a time series of life-table death counts, possibly with zero counts observed at older ages. As a generalisation of the isometric log-ratio transformation (i.e., {\\alpha} = 0), the {\\alpha} transformation relies on the tuning parameter {\\alpha}, which can be determined in a data-driven manner. Using the Australian age-specific period life-table death counts from 1921 to 2020, the {\\alpha} transformation can produce more accurate short-term point and interval forecasts than the log-ratio transformation. The improved forecast accuracy of life-table death counts is of great importance to demographers and government planners for estimating survival probabilities and life expectancy and actuaries for determining annuity prices and reserves for various initial ages and maturity terms."}, "https://arxiv.org/abs/2204.02872": {"title": "Cluster randomized trials designed to support generalizable inferences", "link": "https://arxiv.org/abs/2204.02872", "description": "arXiv:2204.02872v2 Announce Type: replace \nAbstract: Background: When planning a cluster randomized trial, evaluators often have access to an enumerated cohort representing the target population of clusters. Practicalities of conducting the trial, such as the need to oversample clusters with certain characteristics to improve trial economy or to support inference about subgroups of clusters, may preclude simple random sampling from the cohort into the trial, and thus interfere with the goal of producing generalizable inferences about the target population.\n Methods: We describe a nested trial design where the randomized clusters are embedded within a cohort of trial-eligible clusters from the target population and where clusters are selected for inclusion in the trial with known sampling probabilities that may depend on cluster characteristics (e.g., allowing clusters to be chosen to facilitate trial conduct or to examine hypotheses related to their characteristics). We develop and evaluate methods for analyzing data from this design to generalize causal inferences to the target population underlying the cohort.\n Results: We present identification and estimation results for the expectation of the average potential outcome and for the average treatment effect, in the entire target population of clusters and in its non-randomized subset. In simulation studies we show that all the estimators have low bias but markedly different precision.\n Conclusions: Cluster randomized trials where clusters are selected for inclusion with known sampling probabilities that depend on cluster characteristics, combined with efficient estimation methods, can precisely quantify treatment effects in the target population, while addressing objectives of trial conduct that require oversampling clusters on the basis of their characteristics."}, "https://arxiv.org/abs/2307.15205": {"title": "A new robust graph for graph-based methods", "link": "https://arxiv.org/abs/2307.15205", "description": "arXiv:2307.15205v3 Announce Type: replace \nAbstract: Graph-based two-sample tests and change-point detection are powerful tools for analyzing high-dimensional and non-Euclidean data, as they do not impose distributional assumptions and perform effectively across a wide range of scenarios. These methods utilize a similarity graph constructed from the observations, with $K$-nearest neighbor graphs or $K$-minimum spanning trees being the current state-of-the-art choices. However, in high-dimensional settings, these graphs tend to form hubs -- nodes with disproportionately large degrees -- and graph-based methods are sensitive to hubs. To address this issue, we propose a robust graph that is significantly less prone to forming hubs in high-dimensional settings. Incorporating this robust graph can substantially improve the power of graph-based methods across various scenarios. Furthermore, we establish a theoretical foundation for graph-based methods using the proposed robust graph, demonstrating its consistency under fixed alternatives in both low-dimensional and high-dimensional contexts."}, "https://arxiv.org/abs/2401.04036": {"title": "A regularized MANOVA test for semicontinuous high-dimensional data", "link": "https://arxiv.org/abs/2401.04036", "description": "arXiv:2401.04036v2 Announce Type: replace \nAbstract: We propose a MANOVA test for semicontinuous data that is applicable also when the dimensionality exceeds the sample size. The test statistic is obtained as a likelihood ratio, where numerator and denominator are computed at the maxima of penalized likelihood functions under each hypothesis. Closed form solutions for the regularized estimators allow us to avoid computational overheads. We derive the null distribution using a permutation scheme. The power and level of the resulting test are evaluated in a simulation study. We illustrate the new methodology with two original data analyses, one regarding microRNA expression in human blastocyst cultures, and another regarding alien plant species invasion in the island of Socotra (Yemen)."}, "https://arxiv.org/abs/2205.08609": {"title": "Bagged Polynomial Regression and Neural Networks", "link": "https://arxiv.org/abs/2205.08609", "description": "arXiv:2205.08609v2 Announce Type: replace-cross \nAbstract: Series and polynomial regression are able to approximate the same function classes as neural networks. However, these methods are rarely used in practice, although they offer more interpretability than neural networks. In this paper, we show that a potential reason for this is the slow convergence rate of polynomial regression estimators and propose the use of \\textit{bagged} polynomial regression (BPR) as an attractive alternative to neural networks. Theoretically, we derive new finite sample and asymptotic $L^2$ convergence rates for series estimators. We show that the rates can be improved in smooth settings by splitting the feature space and generating polynomial features separately for each partition. Empirically, we show that our proposed estimator, the BPR, can perform as well as more complex models with more parameters. Our estimator also performs close to state-of-the-art prediction methods in the benchmark MNIST handwritten digit dataset. We demonstrate that BPR performs as well as neural networks in crop classification using satellite data, a setting where prediction accuracy is critical and interpretability is often required for addressing research questions."}, "https://arxiv.org/abs/2311.04318": {"title": "Estimation for multistate models subject to reporting delays and incomplete event adjudication", "link": "https://arxiv.org/abs/2311.04318", "description": "arXiv:2311.04318v2 Announce Type: replace-cross \nAbstract: Complete observation of event histories is often impossible due to sampling effects such as right-censoring and left-truncation, but also due to reporting delays and incomplete event adjudication. This is for example the case for health insurance claims and during interim stages of clinical trials. In this paper, we develop a parametric method that takes the aforementioned effects into account, treating the latter two as partially exogenous. The method, which takes the form of a two-step M-estimation procedure, is applicable to multistate models in general, including competing risks and recurrent event models. The effect of reporting delays is derived via thinning, offering an alternative to existing results for Poisson models. To address incomplete event adjudication, we propose an imputed likelihood approach which, compared to existing methods, has the advantage of allowing for dependencies between the event history and adjudication processes as well as allowing for unreported events and multiple event types. We establish consistency and asymptotic normality under standard identifiability, integrability, and smoothness conditions, and we demonstrate the validity of the percentile bootstrap. Finally, a simulation study shows favorable finite sample performance of our method compared to other alternatives, while an application to disability insurance data illustrates its practical potential."}, "https://arxiv.org/abs/2312.11582": {"title": "Shapley-PC: Constraint-based Causal Structure Learning with Shapley Values", "link": "https://arxiv.org/abs/2312.11582", "description": "arXiv:2312.11582v2 Announce Type: replace-cross \nAbstract: Causal Structure Learning (CSL), also referred to as causal discovery, amounts to extracting causal relations among variables in data. CSL enables the estimation of causal effects from observational data alone, avoiding the need to perform real life experiments. Constraint-based CSL leverages conditional independence tests to perform causal discovery. We propose Shapley-PC, a novel method to improve constraint-based CSL algorithms by using Shapley values over the possible conditioning sets, to decide which variables are responsible for the observed conditional (in)dependences. We prove soundness, completeness and asymptotic consistency of Shapley-PC and run a simulation study showing that our proposed algorithm is superior to existing versions of PC."}, "https://arxiv.org/abs/2409.12275": {"title": "Simultaneous Estimation of Many Sparse Networks via Hierarchical Poisson Log-Normal Model", "link": "https://arxiv.org/abs/2409.12275", "description": "arXiv:2409.12275v1 Announce Type: new \nAbstract: The advancement of single-cell RNA-sequencing (scRNA-seq) technologies allow us to study the individual level cell-type-specific gene expression networks by direct inference of genes' conditional independence structures. scRNA-seq data facilitates the analysis of gene expression data across different conditions or samples, enabling simultaneous estimation of condition- or sample-specific gene networks. Since the scRNA-seq data are count data with many zeros, existing network inference methods based on Gaussian graphs cannot be applied to such single cell data directly. We propose a hierarchical Poisson Log-Normal model to simultaneously estimate many such networks to effectively incorporate the shared network structures. We develop an efficient simultaneous estimation method that uses the variational EM and alternating direction method of multipliers (ADMM) algorithms, optimized for parallel processing. Simulation studies show this method outperforms traditional methods in network structure recovery and parameter estimation across various network models. We apply the method to two single cell RNA-seq datasets, a yeast single-cell gene expression dataset measured under 11 different environmental conditions, and a single-cell gene expression data from 13 inflammatory bowel disease patients. We demonstrate that simultaneous estimation can uncover a wider range of conditional dependence networks among genes, offering deeper insights into gene expression mechanisms."}, "https://arxiv.org/abs/2409.12348": {"title": "Heckman Selection Contaminated Normal Model", "link": "https://arxiv.org/abs/2409.12348", "description": "arXiv:2409.12348v1 Announce Type: new \nAbstract: The Heckman selection model is one of the most well-renounced econometric models in the analysis of data with sample selection. This model is designed to rectify sample selection biases based on the assumption of bivariate normal error terms. However, real data diverge from this assumption in the presence of heavy tails and/or atypical observations. Recently, this assumption has been relaxed via a more flexible Student's t-distribution, which has appealing statistical properties. This paper introduces a novel Heckman selection model using a bivariate contaminated normal distribution for the error terms. We present an efficient ECM algorithm for parameter estimation with closed-form expressions at the E-step based on truncated multinormal distribution formulas. The identifiability of the proposed model is also discussed, and its properties have been examined. Through simulation studies, we compare our proposed model with the normal and Student's t counterparts and investigate the finite-sample properties and the variation in missing rate. Results obtained from two real data analyses showcase the usefulness and effectiveness of our model. The proposed algorithms are implemented in the R package HeckmanEM."}, "https://arxiv.org/abs/2409.12353": {"title": "A Way to Synthetic Triple Difference", "link": "https://arxiv.org/abs/2409.12353", "description": "arXiv:2409.12353v1 Announce Type: new \nAbstract: This paper introduces a novel approach that combines synthetic control with triple difference to address violations of the parallel trends assumption. While synthetic control has been widely applied to improve causal estimates in difference-in-differences (DID) frameworks, its use in triple-difference models has been underexplored. By transforming triple difference into a DID structure, this paper extends the applicability of synthetic control to a triple-difference framework, enabling more robust estimates when parallel trends are violated across multiple dimensions. The empirical example focuses on China's \"4+7 Cities\" Centralized Drug Procurement pilot program. Based on the proposed procedure for synthetic triple difference, I find that the program can promote pharmaceutical innovation in terms of the number of patent applications even based on the recommended clustered standard error. This method contributes to improving causal inference in policy evaluations and offers a valuable tool for researchers dealing with heterogeneous treatment effects across subgroups."}, "https://arxiv.org/abs/2409.12498": {"title": "Neymanian inference in randomized experiments", "link": "https://arxiv.org/abs/2409.12498", "description": "arXiv:2409.12498v1 Announce Type: new \nAbstract: In his seminal work in 1923, Neyman studied the variance estimation problem for the difference-in-means estimator of the average treatment effect in completely randomized experiments. He proposed a variance estimator that is conservative in general and unbiased when treatment effects are homogeneous. While widely used under complete randomization, there is no unique or natural way to extend this estimator to more complex designs. To this end, we show that Neyman's estimator can be alternatively derived in two ways, leading to two novel variance estimation approaches: the imputation approach and the contrast approach. While both approaches recover Neyman's estimator under complete randomization, they yield fundamentally different variance estimators for more general designs. In the imputation approach, the variance is expressed as a function of observed and missing potential outcomes and then estimated by imputing the missing potential outcomes, akin to Fisherian inference. In the contrast approach, the variance is expressed as a function of several unobservable contrasts of potential outcomes and then estimated by exchanging each unobservable contrast with an observable contrast. Unlike the imputation approach, the contrast approach does not require separately estimating the missing potential outcome for each unit. We examine the theoretical properties of both approaches, showing that for a large class of designs, each produces conservative variance estimators that are unbiased in finite samples or asymptotically under homogeneous treatment effects."}, "https://arxiv.org/abs/2409.12592": {"title": "Choice of the hypothesis matrix for using the Anova-type-statistic", "link": "https://arxiv.org/abs/2409.12592", "description": "arXiv:2409.12592v1 Announce Type: new \nAbstract: Initially developed in Brunner et al. (1997), the Anova-type-statistic (ATS) is one of the most used quadratic forms for testing multivariate hypotheses for a variety of different parameter vectors $\\boldsymbol{\\theta}\\in\\mathbb{R}^d$. Such tests can be based on several versions of ATS and in most settings, they are preferable over those based on other quadratic forms, as for example the Wald-type-statistic (WTS). However, the same null hypothesis $\\boldsymbol{H}\\boldsymbol{\\theta}=\\boldsymbol{y}$ can be expressed by a multitude of hypothesis matrices $\\boldsymbol{H}\\in\\mathbb{R}^{m\\times d}$ and corresponding vectors $\\boldsymbol{y}\\in\\mathbb{R}^m$, which leads to different values of the test statistic, as it can be seen in simple examples. Since this can entail distinct test decisions, it remains to investigate under which conditions tests using different hypothesis matrices coincide. Here, the dimensions of the different hypothesis matrices can be substantially different, which has exceptional potential to save computation effort.\n In this manuscript, we show that for the Anova-type-statistic and some versions thereof, it is possible for each hypothesis $\\boldsymbol{H}\\boldsymbol{\\theta}=\\boldsymbol{y}$ to construct a companion matrix $\\boldsymbol{L}$ with a minimal number of rows, which not only tests the same hypothesis but also always yields the same test decisions. This allows a substantial reduction of computation time, which is investigated in several conducted simulations."}, "https://arxiv.org/abs/2409.12611": {"title": "Parameters on the boundary in predictive regression", "link": "https://arxiv.org/abs/2409.12611", "description": "arXiv:2409.12611v1 Announce Type: new \nAbstract: We consider bootstrap inference in predictive (or Granger-causality) regressions when the parameter of interest may lie on the boundary of the parameter space, here defined by means of a smooth inequality constraint. For instance, this situation occurs when the definition of the parameter space allows for the cases of either no predictability or sign-restricted predictability. We show that in this context constrained estimation gives rise to bootstrap statistics whose limit distribution is, in general, random, and thus distinct from the limit null distribution of the original statistics of interest. This is due to both (i) the possible location of the true parameter vector on the boundary of the parameter space, and (ii) the possible non-stationarity of the posited predicting (resp. Granger-causing) variable. We discuss a modification of the standard fixed-regressor wild bootstrap scheme where the bootstrap parameter space is shifted by a data-dependent function in order to eliminate the portion of limiting bootstrap randomness attributable to the boundary, and prove validity of the associated bootstrap inference under non-stationarity of the predicting variable as the only remaining source of limiting bootstrap randomness. Our approach, which is initially presented in a simple location model, has bearing on inference in parameter-on-the-boundary situations beyond the predictive regression problem."}, "https://arxiv.org/abs/2409.12662": {"title": "Testing for equal predictive accuracy with strong dependence", "link": "https://arxiv.org/abs/2409.12662", "description": "arXiv:2409.12662v1 Announce Type: new \nAbstract: We analyse the properties of the Diebold and Mariano (1995) test in the presence of autocorrelation in the loss differential. We show that the power of the Diebold and Mariano (1995) test decreases as the dependence increases, making it more difficult to obtain statistically significant evidence of superior predictive ability against less accurate benchmarks. We also find that, after a certain threshold, the test has no power and the correct null hypothesis is spuriously rejected. Taken together, these results caution to seriously consider the dependence properties of the loss differential before the application of the Diebold and Mariano (1995) test."}, "https://arxiv.org/abs/2409.12848": {"title": "Bridging the Gap Between Design and Analysis: Randomization Inference and Sensitivity Analysis for Matched Observational Studies with Treatment Doses", "link": "https://arxiv.org/abs/2409.12848", "description": "arXiv:2409.12848v1 Announce Type: new \nAbstract: Matching is a commonly used causal inference study design in observational studies. Through matching on measured confounders between different treatment groups, valid randomization inferences can be conducted under the no unmeasured confounding assumption, and sensitivity analysis can be further performed to assess sensitivity of randomization inference results to potential unmeasured confounding. However, for many common matching designs, there is still a lack of valid downstream randomization inference and sensitivity analysis approaches. Specifically, in matched observational studies with treatment doses (e.g., continuous or ordinal treatments), with the exception of some special cases such as pair matching, there is no existing randomization inference or sensitivity analysis approach for studying analogs of the sample average treatment effect (Neyman-type weak nulls), and no existing valid sensitivity analysis approach for testing the sharp null of no effect for any subject (Fisher's sharp null) when the outcome is non-binary. To fill these gaps, we propose new methods for randomization inference and sensitivity analysis that can work for general matching designs with treatment doses, applicable to general types of outcome variables (e.g., binary, ordinal, or continuous), and cover both Fisher's sharp null and Neyman-type weak nulls. We illustrate our approaches via comprehensive simulation studies and a real-data application."}, "https://arxiv.org/abs/2409.12856": {"title": "Scaleable Dynamic Forecast Reconciliation", "link": "https://arxiv.org/abs/2409.12856", "description": "arXiv:2409.12856v1 Announce Type: new \nAbstract: We introduce a dynamic approach to probabilistic forecast reconciliation at scale. Our model differs from the existing literature in this area in several important ways. Firstly we explicitly allow the weights allocated to the base forecasts in forming the combined, reconciled forecasts to vary over time. Secondly we drop the assumption, near ubiquitous in the literature, that in-sample base forecasts are appropriate for determining these weights, and use out of sample forecasts instead. Most existing probabilistic reconciliation approaches rely on time consuming sampling based techniques, and therefore do not scale well (or at all) to large data sets. We address this problem in two main ways, firstly by utilising a closed from estimator of covariance structure appropriate to hierarchical forecasting problems, and secondly by decomposing large hierarchies in to components which can be reconciled separately."}, "https://arxiv.org/abs/2409.12928": {"title": "A general condition for bias attenuation by a nondifferentially mismeasured confounder", "link": "https://arxiv.org/abs/2409.12928", "description": "arXiv:2409.12928v1 Announce Type: new \nAbstract: In real-world studies, the collected confounders may suffer from measurement error. Although mismeasurement of confounders is typically unintentional -- originating from sources such as human oversight or imprecise machinery -- deliberate mismeasurement also occurs and is becoming increasingly more common. For example, in the 2020 U.S. Census, noise was added to measurements to assuage privacy concerns. Sensitive variables such as income or age are oftentimes partially censored and are only known up to a range of values. In such settings, obtaining valid estimates of the causal effect of a binary treatment can be impossible, as mismeasurement of confounders constitutes a violation of the no unmeasured confounding assumption. A natural question is whether the common practice of simply adjusting for the mismeasured confounder is justifiable. In this article, we answer this question in the affirmative and demonstrate that in many realistic scenarios not covered by previous literature, adjusting for the mismeasured confounders reduces bias compared to not adjusting."}, "https://arxiv.org/abs/2409.12328": {"title": "SplitVAEs: Decentralized scenario generation from siloed data for stochastic optimization problems", "link": "https://arxiv.org/abs/2409.12328", "description": "arXiv:2409.12328v1 Announce Type: cross \nAbstract: Stochastic optimization problems in large-scale multi-stakeholder networked systems (e.g., power grids and supply chains) rely on data-driven scenarios to encapsulate complex spatiotemporal interdependencies. However, centralized aggregation of stakeholder data is challenging due to the existence of data silos resulting from computational and logistical bottlenecks. In this paper, we present SplitVAEs, a decentralized scenario generation framework that leverages variational autoencoders to generate high-quality scenarios without moving stakeholder data. With the help of experiments on distributed memory systems, we demonstrate the broad applicability of SplitVAEs in a variety of domain areas that are dominated by a large number of stakeholders. Our experiments indicate that SplitVAEs can learn spatial and temporal interdependencies in large-scale networks to generate scenarios that match the joint historical distribution of stakeholder data in a decentralized manner. Our experiments show that SplitVAEs deliver robust performance compared to centralized, state-of-the-art benchmark methods while significantly reducing data transmission costs, leading to a scalable, privacy-enhancing alternative to scenario generation."}, "https://arxiv.org/abs/2409.12890": {"title": "Stable and Robust Hyper-Parameter Selection Via Robust Information Sharing Cross-Validation", "link": "https://arxiv.org/abs/2409.12890", "description": "arXiv:2409.12890v1 Announce Type: cross \nAbstract: Robust estimators for linear regression require non-convex objective functions to shield against adverse affects of outliers. This non-convexity brings challenges, particularly when combined with penalization in high-dimensional settings. Selecting hyper-parameters for the penalty based on a finite sample is a critical task. In practice, cross-validation (CV) is the prevalent strategy with good performance for convex estimators. Applied with robust estimators, however, CV often gives sub-par results due to the interplay between multiple local minima and the penalty. The best local minimum attained on the full training data may not be the minimum with the desired statistical properties. Furthermore, there may be a mismatch between this minimum and the minima attained in the CV folds. This paper introduces a novel adaptive CV strategy that tracks multiple minima for each combination of hyper-parameters and subsets of the data. A matching scheme is presented for correctly evaluating minima computed on the full training data using the best-matching minima from the CV folds. It is shown that the proposed strategy reduces the variability of the estimated performance metric, leads to smoother CV curves, and therefore substantially increases the reliability and utility of robust penalized estimators."}, "https://arxiv.org/abs/2211.09099": {"title": "Selecting Subpopulations for Causal Inference in Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2211.09099", "description": "arXiv:2211.09099v3 Announce Type: replace \nAbstract: The Brazil Bolsa Familia (BF) program is a conditional cash transfer program aimed to reduce short-term poverty by direct cash transfers and to fight long-term poverty by increasing human capital among poor Brazilian people. Eligibility for Bolsa Familia benefits depends on a cutoff rule, which classifies the BF study as a regression discontinuity (RD) design. Extracting causal information from RD studies is challenging. Following Li et al (2015) and Branson and Mealli (2019), we formally describe the BF RD design as a local randomized experiment within the potential outcome approach. Under this framework, causal effects can be identified and estimated on a subpopulation where a local overlap assumption, a local SUTVA and a local ignorability assumption hold. We first discuss the potential advantages of this framework over local regression methods based on continuity assumptions, which concern the definition of the causal estimands, the design and the analysis of the study, and the interpretation and generalizability of the results. A critical issue of this local randomization approach is how to choose subpopulations for which we can draw valid causal inference. We propose a Bayesian model-based finite mixture approach to clustering to classify observations into subpopulations where the RD assumptions hold and do not hold. This approach has important advantages: a) it allows to account for the uncertainty in the subpopulation membership, which is typically neglected; b) it does not impose any constraint on the shape of the subpopulation; c) it is scalable to high-dimensional settings; e) it allows to target alternative causal estimands than the average treatment effect (ATE); and f) it is robust to a certain degree of manipulation/selection of the running variable. We apply our proposed approach to assess causal effects of the Bolsa Familia program on leprosy incidence in 2009."}, "https://arxiv.org/abs/2309.15793": {"title": "Targeting relative risk heterogeneity with causal forests", "link": "https://arxiv.org/abs/2309.15793", "description": "arXiv:2309.15793v2 Announce Type: replace \nAbstract: The estimation of heterogeneous treatment effects (HTE) across different subgroups in a population is of significant interest in clinical trial analysis. State-of-the-art HTE estimation methods, including causal forests (Wager and Athey, 2018), generally rely on recursive partitioning for non-parametric identification of relevant covariates and interactions. However, like many other methods in this area, causal forests partition subgroups based on differences in absolute risk. This can dilute statistical power by masking variability in the relative risk, which is often a more appropriate quantity of clinical interest. In this work, we propose and implement a methodology for modifying causal forests to target relative risk, using a novel node-splitting procedure based on exhaustive generalized linear model comparison. We present results that suggest relative risk causal forests can capture otherwise undetected sources of heterogeneity."}, "https://arxiv.org/abs/2306.06291": {"title": "Optimal Multitask Linear Regression and Contextual Bandits under Sparse Heterogeneity", "link": "https://arxiv.org/abs/2306.06291", "description": "arXiv:2306.06291v2 Announce Type: replace-cross \nAbstract: Large and complex datasets are often collected from several, possibly heterogeneous sources. Multitask learning methods improve efficiency by leveraging commonalities across datasets while accounting for possible differences among them. Here, we study multitask linear regression and contextual bandits under sparse heterogeneity, where the source/task-associated parameters are equal to a global parameter plus a sparse task-specific term. We propose a novel two-stage estimator called MOLAR that leverages this structure by first constructing a covariate-wise weighted median of the task-wise linear regression estimates and then shrinking the task-wise estimates towards the weighted median. Compared to task-wise least squares estimates, MOLAR improves the dependence of the estimation error on the data dimension. Extensions of MOLAR to generalized linear models and constructing confidence intervals are discussed in the paper. We then apply MOLAR to develop methods for sparsely heterogeneous multitask contextual bandits, obtaining improved regret guarantees over single-task bandit methods. We further show that our methods are minimax optimal by providing a number of lower bounds. Finally, we support the efficiency of our methods by performing experiments on both synthetic data and the PISA dataset on student educational outcomes from heterogeneous countries."}, "https://arxiv.org/abs/2409.13041": {"title": "Properly constrained reference priors decay rates for efficient and robust posterior inference", "link": "https://arxiv.org/abs/2409.13041", "description": "arXiv:2409.13041v1 Announce Type: new \nAbstract: In Bayesian analysis, reference priors are widely recognized for their objective nature. Yet, they often lead to intractable and improper priors, which complicates their application. Besides, informed prior elicitation methods are penalized by the subjectivity of the choices they require. In this paper, we aim at proposing a reconciliation of the aforementioned aspects. Leveraging the objective aspect of reference prior theory, we introduce two strategies of constraint incorporation to build tractable reference priors. One provides a simple and easy-to-compute solution when the improper aspect is not questioned, and the other introduces constraints to ensure the reference prior is proper, or it provides proper posterior. Our methodology emphasizes the central role of Jeffreys prior decay rates in this process, and the practical applicability of our results is demonstrated using an example taken from the literature."}, "https://arxiv.org/abs/2409.13060": {"title": "Forecasting Causal Effects of Future Interventions: Confounding and Transportability Issues", "link": "https://arxiv.org/abs/2409.13060", "description": "arXiv:2409.13060v1 Announce Type: new \nAbstract: Recent developments in causal inference allow us to transport a causal effect of a time-fixed treatment from a randomized trial to a target population across space but within the same time frame. In contrast to transportability across space, transporting causal effects across time or forecasting causal effects of future interventions is more challenging due to time-varying confounders and time-varying effect modifiers. In this article, we seek to formally clarify the causal estimands for forecasting causal effects over time and the structural assumptions required to identify these estimands. Specifically, we develop a set of novel nonparametric identification formulas--g-computation formulas--for these causal estimands, and lay out the conditions required to accurately forecast causal effects from a past observed sample to a future population in a future time window. Our overaching objective is to leverage the modern causal inference theory to provide a theoretical framework for investigating whether the effects seen in a past sample would carry over to a new future population. Throughout the article, a working example addressing the effect of public policies or social events on COVID-related deaths is considered to contextualize the developments of analytical results."}, "https://arxiv.org/abs/2409.13097": {"title": "Incremental Causal Effect for Time to Treatment Initialization", "link": "https://arxiv.org/abs/2409.13097", "description": "arXiv:2409.13097v1 Announce Type: new \nAbstract: We consider time to treatment initialization. This can commonly occur in preventive medicine, such as disease screening and vaccination; it can also occur with non-fatal health conditions such as HIV infection without the onset of AIDS; or in tech industry where items wait to be reviewed manually as abusive or not, etc. While traditional causal inference focused on `when to treat' and its effects, including their possible dependence on subject characteristics, we consider the incremental causal effect when the intensity of time to treatment initialization is intervened upon. We provide identification of the incremental causal effect without the commonly required positivity assumption, as well as an estimation framework using inverse probability weighting. We illustrate our approach via simulation, and apply it to a rheumatoid arthritis study to evaluate the incremental effect of time to start methotrexate on joint pain."}, "https://arxiv.org/abs/2409.13140": {"title": "Non-parametric Replication of Instrumental Variable Estimates Across Studies", "link": "https://arxiv.org/abs/2409.13140", "description": "arXiv:2409.13140v1 Announce Type: new \nAbstract: Replicating causal estimates across different cohorts is crucial for increasing the integrity of epidemiological studies. However, strong assumptions regarding unmeasured confounding and effect modification often hinder this goal. By employing an instrumental variable (IV) approach and targeting the local average treatment effect (LATE), these assumptions can be relaxed to some degree; however, little work has addressed the replicability of IV estimates. In this paper, we propose a novel survey weighted LATE (SWLATE) estimator that incorporates unknown sampling weights and leverages machine learning for flexible modeling of nuisance functions, including the weights. Our approach, based on influence function theory and cross-fitting, provides a doubly-robust and efficient framework for valid inference, aligned with the growing \"double machine learning\" literature. We further extend our method to provide bounds on a target population ATE. The effectiveness of our approach, particularly in non-linear settings, is demonstrated through simulations and applied to a Mendelian randomization analysis of the relationship between triglycerides and cognitive decline."}, "https://arxiv.org/abs/2409.13190": {"title": "Nonparametric Causal Survival Analysis with Clustered Interference", "link": "https://arxiv.org/abs/2409.13190", "description": "arXiv:2409.13190v1 Announce Type: new \nAbstract: Inferring treatment effects on a survival time outcome based on data from an observational study is challenging due to the presence of censoring and possible confounding. An additional challenge occurs when a unit's treatment affects the outcome of other units, i.e., there is interference. In some settings, units may be grouped into clusters such that it is reasonable to assume interference only occurs within clusters, i.e., there is clustered interference. In this paper, methods are developed which can accommodate confounding, censored outcomes, and clustered interference. The approach avoids parametric assumptions and permits inference about counterfactual scenarios corresponding to any stochastic policy which modifies the propensity score distribution, and thus may have application across diverse settings. The proposed nonparametric sample splitting estimators allow for flexible data-adaptive estimation of nuisance functions and are consistent and asymptotically normal with parametric convergence rates. Simulation studies demonstrate the finite sample performance of the proposed estimators, and the methods are applied to a cholera vaccine study in Bangladesh."}, "https://arxiv.org/abs/2409.13267": {"title": "Spatial Sign based Principal Component Analysis for High Dimensional Data", "link": "https://arxiv.org/abs/2409.13267", "description": "arXiv:2409.13267v1 Announce Type: new \nAbstract: This article focuses on the robust principal component analysis (PCA) of high-dimensional data with elliptical distributions. We investigate the PCA of the sample spatial-sign covariance matrix in both nonsparse and sparse contexts, referring to them as SPCA and SSPCA, respectively. We present both nonasymptotic and asymptotic analyses to quantify the theoretical performance of SPCA and SSPCA. In sparse settings, we demonstrate that SSPCA, implemented through a combinatoric program, achieves the optimal rate of convergence. Our proposed SSPCA method is computationally efficient and exhibits robustness against heavy-tailed distributions compared to existing methods. Simulation studies and real-world data applications further validate the superiority of our approach."}, "https://arxiv.org/abs/2409.13300": {"title": "A Two-stage Inference Procedure for Sample Local Average Treatment Effects in Randomized Experiments", "link": "https://arxiv.org/abs/2409.13300", "description": "arXiv:2409.13300v1 Announce Type: new \nAbstract: In a given randomized experiment, individuals are often volunteers and can differ in important ways from a population of interest. It is thus of interest to focus on the sample at hand. This paper focuses on inference about the sample local average treatment effect (LATE) in randomized experiments with non-compliance. We present a two-stage procedure that provides asymptotically correct coverage rate of the sample LATE in randomized experiments. The procedure uses a first-stage test to decide whether the instrument is strong or weak, and uses different confidence sets depending on the first-stage result. Proofs of the procedure is developed for the situation with and without regression adjustment and for two experimental designs (complete randomization and Mahalaonobis distance based rerandomization). Finite sample performance of the methods are studied using extensive Monte Carlo simulations and the methods are applied to data from a voter encouragement experiment."}, "https://arxiv.org/abs/2409.13336": {"title": "Two-level D- and A-optimal main-effects designs with run sizes one and two more than a multiple of four", "link": "https://arxiv.org/abs/2409.13336", "description": "arXiv:2409.13336v1 Announce Type: new \nAbstract: For run sizes that are a multiple of four, the literature offers many two-level designs that are D- and A-optimal for the main-effects model and minimize the aliasing between main effects and interaction effects and among interaction effects. For run sizes that are not a multiple of four, no conclusive results are known. In this paper, we propose two algorithms that generate all non-isomorphic D- and A-optimal main-effects designs for run sizes that are one and two more than a multiple of four. We enumerate all such designs for run sizes up to 18, report the numbers of designs we obtained, and identify those that minimize the aliasing between main effects and interaction effects and among interaction effects. Finally, we compare the minimally aliased designs we found with benchmark designs from the literature."}, "https://arxiv.org/abs/2409.13458": {"title": "Interpretable meta-analysis of model or marker performance", "link": "https://arxiv.org/abs/2409.13458", "description": "arXiv:2409.13458v1 Announce Type: new \nAbstract: Conventional meta analysis of model performance conducted using datasources from different underlying populations often result in estimates that cannot be interpreted in the context of a well defined target population. In this manuscript we develop methods for meta-analysis of several measures of model performance that are interpretable in the context of a well defined target population when the populations underlying the datasources used in the meta analysis are heterogeneous. This includes developing identifiablity conditions, inverse-weighting, outcome model, and doubly robust estimator. We illustrate the methods using simulations and data from two large lung cancer screening trials."}, "https://arxiv.org/abs/2409.13479": {"title": "Efficiency gain in association studies based on population surveys by augmenting outcome data from the target population", "link": "https://arxiv.org/abs/2409.13479", "description": "arXiv:2409.13479v1 Announce Type: new \nAbstract: Routinely collected nation-wide registers contain socio-economic and health-related information from a large number of individuals. However, important information on lifestyle, biological and other risk factors is available at most for small samples of the population through surveys. A majority of health surveys lack detailed medical information necessary for assessing the disease burden. Hence, traditionally data from the registers and the surveys are combined to have necessary information for the survey sample. Our idea is to base analyses on a combined sample obtained by adding a (large) sample of individuals from the population to the survey sample. The main objective is to assess the bias and gain in efficiency of such combined analyses with a binary or time-to-event outcome. We employ (i) the complete-case analysis (CCA) using the respondents of the survey, (ii) analysis of the full survey sample with both unit- and item-nonresponse under the missing at random (MAR) assumption and (iii) analysis of the combined sample under mixed type of missing data mechanism. We handle the missing data using multiple imputation (MI)-based analysis in (ii) and (iii). We utilize simulated as well as empirical data on ischemic heart disease obtained from the Finnish population. Our results suggested that the MI methods improved the efficiency of the estimates when we used the combined data for a binary outcome, but in the case of a time-to-event outcome the CCA was at least as good as the MI using the larger datasets, in terms of the the mean absolute and squared errors. Increasing the participation in the surveys and having good statistical methods for handling missing covariate data when the outcome is time-to-event would be needed for implementation of the proposed ideas."}, "https://arxiv.org/abs/2409.13516": {"title": "Dynamic tail risk forecasting: what do realized skewness and kurtosis add?", "link": "https://arxiv.org/abs/2409.13516", "description": "arXiv:2409.13516v1 Announce Type: new \nAbstract: This paper compares the accuracy of tail risk forecasts with a focus on including realized skewness and kurtosis in \"additive\" and \"multiplicative\" models. Utilizing a panel of 960 US stocks, we conduct diagnostic tests, employ scoring functions, and implement rolling window forecasting to evaluate the performance of Value at Risk (VaR) and Expected Shortfall (ES) forecasts. Additionally, we examine the impact of the window length on forecast accuracy. We propose model specifications that incorporate realized skewness and kurtosis for enhanced precision. Our findings provide insights into the importance of considering skewness and kurtosis in tail risk modeling, contributing to the existing literature and offering practical implications for risk practitioners and researchers."}, "https://arxiv.org/abs/2409.13531": {"title": "A simple but powerful tail index regression", "link": "https://arxiv.org/abs/2409.13531", "description": "arXiv:2409.13531v1 Announce Type: new \nAbstract: This paper introduces a flexible framework for the estimation of the conditional tail index of heavy tailed distributions. In this framework, the tail index is computed from an auxiliary linear regression model that facilitates estimation and inference based on established econometric methods, such as ordinary least squares (OLS), least absolute deviations, or M-estimation. We show theoretically and via simulations that OLS provides interesting results. Our Monte Carlo results highlight the adequate finite sample properties of the OLS tail index estimator computed from the proposed new framework and contrast its behavior to that of tail index estimates obtained by maximum likelihood estimation of exponential regression models, which is one of the approaches currently in use in the literature. An empirical analysis of the impact of determinants of the conditional left- and right-tail indexes of commodities' return distributions highlights the empirical relevance of our proposed approach. The novel framework's flexibility allows for extensions and generalizations in various directions, empowering researchers and practitioners to straightforwardly explore a wide range of research questions."}, "https://arxiv.org/abs/2409.13688": {"title": "Morphological Detection and Classification of Microplastics and Nanoplastics Emerged from Consumer Products by Deep Learning", "link": "https://arxiv.org/abs/2409.13688", "description": "arXiv:2409.13688v1 Announce Type: cross \nAbstract: Plastic pollution presents an escalating global issue, impacting health and environmental systems, with micro- and nanoplastics found across mediums from potable water to air. Traditional methods for studying these contaminants are labor-intensive and time-consuming, necessitating a shift towards more efficient technologies. In response, this paper introduces micro- and nanoplastics (MiNa), a novel and open-source dataset engineered for the automatic detection and classification of micro and nanoplastics using object detection algorithms. The dataset, comprising scanning electron microscopy images simulated under realistic aquatic conditions, categorizes plastics by polymer type across a broad size spectrum. We demonstrate the application of state-of-the-art detection algorithms on MiNa, assessing their effectiveness and identifying the unique challenges and potential of each method. The dataset not only fills a critical gap in available resources for microplastic research but also provides a robust foundation for future advancements in the field."}, "https://arxiv.org/abs/2109.00404": {"title": "Perturbation graphs, invariant prediction and causal relations in psychology", "link": "https://arxiv.org/abs/2109.00404", "description": "arXiv:2109.00404v2 Announce Type: replace \nAbstract: Networks (graphs) in psychology are often restricted to settings without interventions. Here we consider a framework borrowed from biology that involves multiple interventions from different contexts (observations and experiments) in a single analysis. The method is called perturbation graphs. In gene regulatory networks, the induced change in one gene is measured on all other genes in the analysis, thereby assessing possible causal relations. This is repeated for each gene in the analysis. A perturbation graph leads to the correct set of causes (not necessarily direct causes). Subsequent pruning of paths in the graph (called transitive reduction) should reveal direct causes. We show that transitive reduction will not in general lead to the correct underlying graph. We also show that invariant causal prediction is a generalisation of the perturbation graph method, where including additional variables does reveal direct causes, and thereby replacing transitive reduction. We conclude that perturbation graphs provide a promising new tool for experimental designs in psychology, and combined with invariant causal prediction make it possible to reveal direct causes instead of causal paths. As an illustration we apply the perturbation graphs and invariant causal prediction to a data set about attitudes on meat consumption and to a time series of a patient diagnosed with major depression disorder."}, "https://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "https://arxiv.org/abs/2302.11505", "description": "arXiv:2302.11505v4 Announce Type: replace \nAbstract: This paper studies settings where the analyst is interested in identifying and estimating the average causal effect of a binary treatment on an outcome. We consider a setup in which the outcome realization does not get immediately realized after the treatment assignment, a feature that is ubiquitous in empirical settings. The period between the treatment and the realization of the outcome allows other observed actions to occur and affect the outcome. In this context, we study several regression-based estimands routinely used in empirical work to capture the average treatment effect and shed light on interpreting them in terms of ceteris paribus effects, indirect causal effects, and selection terms. We obtain three main and related takeaways. First, the three most popular estimands do not generally satisfy what we call \\emph{strong sign preservation}, in the sense that these estimands may be negative even when the treatment positively affects the outcome conditional on any possible combination of other actions. Second, the most popular regression that includes the other actions as controls satisfies strong sign preservation \\emph{if and only if} these actions are mutually exclusive binary variables. Finally, we show that a linear regression that fully stratifies the other actions leads to estimands that satisfy strong sign preservation."}, "https://arxiv.org/abs/2308.04374": {"title": "Are Information criteria good enough to choose the right the number of regimes in Hidden Markov Models?", "link": "https://arxiv.org/abs/2308.04374", "description": "arXiv:2308.04374v2 Announce Type: replace \nAbstract: Selecting the number of regimes in Hidden Markov models is an important problem. There are many criteria that are used to select this number, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), integrated completed likelihood (ICL), deviance information criterion (DIC), and Watanabe-Akaike information criterion (WAIC), to name a few. In this article, we introduced goodness-of-fit tests for general Hidden Markov models with covariates, where the distribution of the observations is arbitrary, i.e., continuous, discrete, or a mixture of both. Then, a selection procedure is proposed based on this goodness-of-fit test. The main aim of this article is to compare the classical information criterion with the new criterion, when the outcome is either continuous, discrete or zero-inflated. Numerical experiments assess the finite sample performance of the goodness-of-fit tests, and comparisons between the different criteria are made."}, "https://arxiv.org/abs/2310.16989": {"title": "Randomization Inference When N Equals One", "link": "https://arxiv.org/abs/2310.16989", "description": "arXiv:2310.16989v2 Announce Type: replace \nAbstract: N-of-1 experiments, where a unit serves as its own control and treatment in different time windows, have been used in certain medical contexts for decades. However, due to effects that accumulate over long time windows and interventions that have complex evolution, a lack of robust inference tools has limited the widespread applicability of such N-of-1 designs. This work combines techniques from experiment design in causal inference and system identification from control theory to provide such an inference framework. We derive a model of the dynamic interference effect that arises in linear time-invariant dynamical systems. We show that a family of causal estimands analogous to those studied in potential outcomes are estimable via a standard estimator derived from the method of moments. We derive formulae for higher moments of this estimator and describe conditions under which N-of-1 designs may provide faster ways to estimate the effects of interventions in dynamical systems. We also provide conditions under which our estimator is asymptotically normal and derive valid confidence intervals for this setting."}, "https://arxiv.org/abs/2209.05894": {"title": "Nonparametric estimation of trawl processes: Theory and Applications", "link": "https://arxiv.org/abs/2209.05894", "description": "arXiv:2209.05894v2 Announce Type: replace-cross \nAbstract: Trawl processes belong to the class of continuous-time, strictly stationary, infinitely divisible processes; they are defined as L\\'{e}vy bases evaluated over deterministic trawl sets. This article presents the first nonparametric estimator of the trawl function characterising the trawl set and the serial correlation of the process. Moreover, it establishes a detailed asymptotic theory for the proposed estimator, including a law of large numbers and a central limit theorem for various asymptotic relations between an in-fill and a long-span asymptotic regime. In addition, it develops consistent estimators for both the asymptotic bias and variance, which are subsequently used for establishing feasible central limit theorems which can be applied to data. A simulation study shows the good finite sample performance of the proposed estimators. The new methodology is applied to forecasting high-frequency financial spread data from a limit order book and to estimating the busy-time distribution of a stochastic queue."}, "https://arxiv.org/abs/2309.05630": {"title": "Boundary Peeling: Outlier Detection Method Using One-Class Peeling", "link": "https://arxiv.org/abs/2309.05630", "description": "arXiv:2309.05630v2 Announce Type: replace-cross \nAbstract: Unsupervised outlier detection constitutes a crucial phase within data analysis and remains a dynamic realm of research. A good outlier detection algorithm should be computationally efficient, robust to tuning parameter selection, and perform consistently well across diverse underlying data distributions. We introduce One-Class Boundary Peeling, an unsupervised outlier detection algorithm. One-class Boundary Peeling uses the average signed distance from iteratively-peeled, flexible boundaries generated by one-class support vector machines. One-class Boundary Peeling has robust hyperparameter settings and, for increased flexibility, can be cast as an ensemble method. In synthetic data simulations One-Class Boundary Peeling outperforms all state of the art methods when no outliers are present while maintaining comparable or superior performance in the presence of outliers, as compared to benchmark methods. One-Class Boundary Peeling performs competitively in terms of correct classification, AUC, and processing time using common benchmark data sets."}, "https://arxiv.org/abs/2409.13819": {"title": "Supervised low-rank approximation of high-dimensional multivariate functional data via tensor decomposition", "link": "https://arxiv.org/abs/2409.13819", "description": "arXiv:2409.13819v1 Announce Type: new \nAbstract: Motivated by the challenges of analyzing high-dimensional ($p \\gg n$) sequencing data from longitudinal microbiome studies, where samples are collected at multiple time points from each subject, we propose supervised functional tensor singular value decomposition (SupFTSVD), a novel dimensionality reduction method that leverages auxiliary information in the dimensionality reduction of high-dimensional functional tensors. Although multivariate functional principal component analysis is a natural choice for dimensionality reduction of multivariate functional data, it becomes computationally burdensome in high-dimensional settings. Low-rank tensor decomposition is a feasible alternative and has gained popularity in recent literature, but existing methods in this realm are often incapable of simultaneously utilizing the temporal structure of the data and subject-level auxiliary information. SupFTSVD overcomes these limitations by generating low-rank representations of high-dimensional functional tensors while incorporating subject-level auxiliary information and accounting for the functional nature of the data. Moreover, SupFTSVD produces low-dimensional representations of subjects, features, and time, as well as subject-specific trajectories, providing valuable insights into the biological significance of variations within the data. In simulation studies, we demonstrate that our method achieves notable improvement in tensor approximation accuracy and loading estimation by utilizing auxiliary information. Finally, we applied SupFTSVD to two longitudinal microbiome studies where biologically meaningful patterns in the data were revealed."}, "https://arxiv.org/abs/2409.13858": {"title": "BRcal: An R Package to Boldness-Recalibrate Probability Predictions", "link": "https://arxiv.org/abs/2409.13858", "description": "arXiv:2409.13858v1 Announce Type: new \nAbstract: When probability predictions are too cautious for decision making, boldness-recalibration enables responsible emboldening while maintaining the probability of calibration required by the user. We introduce BRcal, an R package implementing boldness-recalibration and supporting methodology as recently proposed. The BRcal package provides direct control of the calibration-boldness tradeoff and visualizes how different calibration levels change individual predictions. We describe the implementation details in BRcal related to non-linear optimization of boldness with a non-linear inequality constraint on calibration. Package functionality is demonstrated via a real world case study involving housing foreclosure predictions. The BRcal package is available on the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/BRcal/index.html) and on Github (https://github.com/apguthrie/BRcal)."}, "https://arxiv.org/abs/2409.13873": {"title": "Jointly modeling time-to-event and longitudinal data with individual-specific change points: a case study in modeling tumor burden", "link": "https://arxiv.org/abs/2409.13873", "description": "arXiv:2409.13873v1 Announce Type: new \nAbstract: In oncology clinical trials, tumor burden (TB) stands as a crucial longitudinal biomarker, reflecting the toll a tumor takes on a patient's prognosis. With certain treatments, the disease's natural progression shows the tumor burden initially receding before rising once more. Biologically, the point of change may be different between individuals and must have occurred between the baseline measurement and progression time of the patient, implying a random effects model obeying a bound constraint. However, in practice, patients may drop out of the study due to progression or death, presenting a non-ignorable missing data problem. In this paper, we introduce a novel joint model that combines time-to-event data and longitudinal data, where the latter is parameterized by a random change point augmented by random pre-slope and post-slope dynamics. Importantly, the model is equipped to incorporate covariates across for the longitudinal and survival models, adding significant flexibility. Adopting a Bayesian approach, we propose an efficient Hamiltonian Monte Carlo algorithm for parameter inference. We demonstrate the superiority of our approach compared to a longitudinal-only model via simulations and apply our method to a data set in oncology."}, "https://arxiv.org/abs/2409.13946": {"title": "Chauhan Weighted Trajectory Analysis of combined efficacy and safety outcomes for risk-benefit analysis", "link": "https://arxiv.org/abs/2409.13946", "description": "arXiv:2409.13946v1 Announce Type: new \nAbstract: Analyzing and effectively communicating the efficacy and toxicity of treatment is the basis of risk benefit analysis (RBA). More efficient and objective tools are needed. We apply Chauhan Weighted Trajectory Analysis (CWTA) to perform RBA with superior objectivity, power, and clarity.\n We used CWTA to perform 1000-fold simulations of RCTs using ordinal endpoints for both treatment efficacy and toxicity. RCTs were simulated with 1:1 allocation at defined sample sizes and hazard ratios. We studied the simplest case of 3 levels each of toxicity and efficacy and the general case of the advanced cancer trial, with efficacy graded by five RECIST 1.1 health statuses and toxicity by the six-point CTCAE scale (6 x 5 matrix). The latter model was applied to a real-world dose escalation phase I trial in advanced cancer.\n Simulations in both the 3 x 3 and the 6 x 5 advanced cancer matrix confirmed that drugs with both superior efficacy and toxicity profiles synergize for greater statistical power with CWTA-RBA. The CWTA-RBA 6 x 5 matrix reduced sample size requirements over CWTA efficacy-only analysis. Application to the dose finding phase I clinical trial provided objective, statistically significant validation for the selected dose.\n CWTA-RBA, by incorporating both drug efficacy and toxicity, provides a single test statistic and plot that analyzes and effectively communicates therapeutic risks and benefits. CWTA-RBA requires fewer patients than CWTA efficacy-only analysis when the experimental drug is both more effective and less toxic. CWTA-RBA facilitates the objective and efficient assessment of new therapies throughout the drug development pathway. Furthermore, several advantages over competing tests in communicating risk-benefit will assist regulatory review, clinical adoption, and understanding of therapeutic risks and benefits by clinicians and patients alike."}, "https://arxiv.org/abs/2409.13963": {"title": "Functional Factor Modeling of Brain Connectivity", "link": "https://arxiv.org/abs/2409.13963", "description": "arXiv:2409.13963v1 Announce Type: new \nAbstract: Many fMRI analyses examine functional connectivity, or statistical dependencies among remote brain regions. Yet popular methods for studying whole-brain functional connectivity often yield results that are difficult to interpret. Factor analysis offers a natural framework in which to study such dependencies, particularly given its emphasis on interpretability. However, multivariate factor models break down when applied to functional and spatiotemporal data, like fMRI. We present a factor model for discretely-observed multidimensional functional data that is well-suited to the study of functional connectivity. Unlike classical factor models which decompose a multivariate observation into a \"common\" term that captures covariance between observed variables and an uncorrelated \"idiosyncratic\" term that captures variance unique to each observed variable, our model decomposes a functional observation into two uncorrelated components: a \"global\" term that captures long-range dependencies and a \"local\" term that captures short-range dependencies. We show that if the global covariance is smooth with finite rank and the local covariance is banded with potentially infinite rank, then this decomposition is identifiable. Under these conditions, recovery of the global covariance amounts to rank-constrained matrix completion, which we exploit to formulate consistent loading estimators. We study these estimators, and their more interpretable post-processed counterparts, through simulations, then use our approach to uncover a rich covariance structure in a collection of resting-state fMRI scans."}, "https://arxiv.org/abs/2409.13990": {"title": "Batch Predictive Inference", "link": "https://arxiv.org/abs/2409.13990", "description": "arXiv:2409.13990v1 Announce Type: new \nAbstract: Constructing prediction sets with coverage guarantees for unobserved outcomes is a core problem in modern statistics. Methods for predictive inference have been developed for a wide range of settings, but usually only consider test data points one at a time. Here we study the problem of distribution-free predictive inference for a batch of multiple test points, aiming to construct prediction sets for functions -- such as the mean or median -- of any number of unobserved test datapoints. This setting includes constructing simultaneous prediction sets with a high probability of coverage, and selecting datapoints satisfying a specified condition while controlling the number of false claims.\n For the general task of predictive inference on a function of a batch of test points, we introduce a methodology called batch predictive inference (batch PI), and provide a distribution-free coverage guarantee under exchangeability of the calibration and test data. Batch PI requires the quantiles of a rank ordering function defined on certain subsets of ranks. While computing these quantiles is NP-hard in general, we show that it can be done efficiently in many cases of interest, most notably for batch score functions with a compositional structure -- which includes examples of interest such as the mean -- via a dynamic programming algorithm that we develop. Batch PI has advantages over naive approaches (such as partitioning the calibration data or directly extending conformal prediction) in many settings, as it can deliver informative prediction sets even using small calibration sample sizes. We illustrate that our procedures provide informative inference across the use cases mentioned above, through experiments on both simulated data and a drug-target interaction dataset."}, "https://arxiv.org/abs/2409.14032": {"title": "Refitted cross-validation estimation for high-dimensional subsamples from low-dimension full data", "link": "https://arxiv.org/abs/2409.14032", "description": "arXiv:2409.14032v1 Announce Type: new \nAbstract: The technique of subsampling has been extensively employed to address the challenges posed by limited computing resources and meet the needs for expedite data analysis. Various subsampling methods have been developed to meet the challenges characterized by a large sample size with a small number of parameters. However, direct applications of these subsampling methods may not be suitable when the dimension is also high and available computing facilities at hand are only able to analyze a subsample of size similar or even smaller than the dimension. In this case, although there is no high-dimensional problem in the full data, the subsample may have a sample size smaller or smaller than the number of parameters, making it a high-dimensional problem. We call this scenario the high-dimensional subsample from low-dimension full data problem. In this paper, we tackle this problem by proposing a novel subsampling-based approach that combines penalty-based dimension reduction and refitted cross-validation. The asymptotic normality of the refitted cross-validation subsample estimator is established, which plays a crucial role in statistical inference. The proposed method demonstrates appealing performance in numerical experiments on simulated data and a real data application."}, "https://arxiv.org/abs/2409.14049": {"title": "Adaptive radar detection of subspace-based distributed target in power heterogeneous clutter", "link": "https://arxiv.org/abs/2409.14049", "description": "arXiv:2409.14049v1 Announce Type: new \nAbstract: This paper investigates the problem of adaptive detection of distributed targets in power heterogeneous clutter. In the considered scenario, all the data share the identical structure of clutter covariance matrix, but with varying and unknown power mismatches. To address this problem, we iteratively estimate all the unknowns, including the coordinate matrix of the target, the clutter covariance matrix, and the corresponding power mismatches, and propose three detectors based on the generalized likelihood ratio test (GLRT), Rao and the Wald tests. The results from simulated and real data both illustrate that the detectors based on GLRT and Rao test have higher probabilities of detection (PDs) than the existing competitors. Among them, the Rao test-based detector exhibits the best overall detection performance. We also analyze the impact of the target extended dimensions, the signal subspace dimensions, and the number of training samples on the detection performance. Furthermore, simulation experiments also demonstrate that the proposed detectors have a constant false alarm rate (CFAR) property for the structure of clutter covariance matrix."}, "https://arxiv.org/abs/2409.14167": {"title": "Skew-symmetric approximations of posterior distributions", "link": "https://arxiv.org/abs/2409.14167", "description": "arXiv:2409.14167v1 Announce Type: new \nAbstract: Routinely-implemented deterministic approximations of posterior distributions from, e.g., Laplace method, variational Bayes and expectation-propagation, generally rely on symmetric approximating densities, often taken to be Gaussian. This choice facilitates optimization and inference, but typically affects the quality of the overall approximation. In fact, even in basic parametric models, the posterior distribution often displays asymmetries that yield bias and reduced accuracy when considering symmetric approximations. Recent research has moved towards more flexible approximating densities that incorporate skewness. However, current solutions are model-specific, lack general supporting theory, increase the computational complexity of the optimization problem, and do not provide a broadly-applicable solution to include skewness in any symmetric approximation. This article addresses such a gap by introducing a general and provably-optimal strategy to perturb any off-the-shelf symmetric approximation of a generic posterior distribution. Crucially, this novel perturbation is derived without additional optimization steps, and yields a similarly-tractable approximation within the class of skew-symmetric densities that provably enhances the finite-sample accuracy of the original symmetric approximation, and, under suitable assumptions, improves its convergence rate to the exact posterior by at least a $\\sqrt{n}$ factor, in asymptotic regimes. These advancements are illustrated in numerical studies focusing on skewed perturbations of state-of-the-art Gaussian approximations."}, "https://arxiv.org/abs/2409.14202": {"title": "Mining Causality: AI-Assisted Search for Instrumental Variables", "link": "https://arxiv.org/abs/2409.14202", "description": "arXiv:2409.14202v1 Announce Type: new \nAbstract: The instrumental variables (IVs) method is a leading empirical strategy for causal inference. Finding IVs is a heuristic and creative process, and justifying its validity (especially exclusion restrictions) is largely rhetorical. We propose using large language models (LLMs) to search for new IVs through narratives and counterfactual reasoning, similar to how a human researcher would. The stark difference, however, is that LLMs can accelerate this process exponentially and explore an extremely large search space. We demonstrate how to construct prompts to search for potentially valid IVs. We argue that multi-step prompting is useful and role-playing prompts are suitable for mimicking the endogenous decisions of economic agents. We apply our method to three well-known examples in economics: returns to schooling, production functions, and peer effects. We then extend our strategy to finding (i) control variables in regression and difference-in-differences and (ii) running variables in regression discontinuity designs."}, "https://arxiv.org/abs/2409.14255": {"title": "On the asymptotic distributions of some test statistics for two-way contingency tables", "link": "https://arxiv.org/abs/2409.14255", "description": "arXiv:2409.14255v1 Announce Type: new \nAbstract: Pearson's Chi-square test is a widely used tool for analyzing categorical data, yet its statistical power has remained theoretically underexplored. Due to the difficulties in obtaining its power function in the usual manner, Cochran (1952) suggested the derivation of its Pitman limiting power, which is later implemented by Mitra (1958) and Meng & Chapman (1966). Nonetheless, this approach is suboptimal for practical power calculations under fixed alternatives. In this work, we solve this long-standing problem by establishing the asymptotic normality of the Chi-square statistic under fixed alternatives and deriving an explicit formula for its variance. For finite samples, we suggest a second-order expansion based on the multivariate delta method to improve the approximations. As a further contribution, we obtain the power functions of two distance covariance tests. We apply our findings to study the statistical power of these tests under different simulation settings."}, "https://arxiv.org/abs/2409.14256": {"title": "POI-SIMEX for Conditionally Poisson Distributed Biomarkers from Tissue Histology", "link": "https://arxiv.org/abs/2409.14256", "description": "arXiv:2409.14256v1 Announce Type: new \nAbstract: Covariate measurement error in regression analysis is an important issue that has been studied extensively under the classical additive and the Berkson error models. Here, we consider cases where covariates are derived from tumor tissue histology, and in particular tissue microarrays. In such settings, biomarkers are evaluated from tissue cores that are subsampled from a larger tissue section so that these biomarkers are only estimates of the true cell densities. The resulting measurement error is non-negligible but is seldom accounted for in the analysis of cancer studies involving tissue microarrays. To adjust for this type of measurement error, we assume that these discrete-valued biomarkers are conditionally Poisson distributed, based on a Poisson process model governing the spatial locations of marker-positive cells. Existing methods for addressing conditional Poisson surrogates, particularly in the absence of internal validation data, are limited. We extend the simulation extrapolation (SIMEX) algorithm to accommodate the conditional Poisson case (POI-SIMEX), where measurement errors are non-Gaussian with heteroscedastic variance. The proposed estimator is shown to be strongly consistent in a linear regression model under the assumption of a conditional Poisson distribution for the observed biomarker. Simulation studies evaluate the performance of POI-SIMEX, comparing it with the naive method and an alternative corrected likelihood approach in linear regression and survival analysis contexts. POI-SIMEX is then applied to a study of high-grade serous cancer, examining the association between survival and the presence of triple-positive biomarker (CD3+CD8+FOXP3+ cells)"}, "https://arxiv.org/abs/2409.14263": {"title": "Potential root mean square error skill score", "link": "https://arxiv.org/abs/2409.14263", "description": "arXiv:2409.14263v1 Announce Type: new \nAbstract: Consistency, in a narrow sense, denotes the alignment between the forecast-optimization strategy and the verification directive. The current recommended deterministic solar forecast verification practice is to report the skill score based on root mean square error (RMSE), which would violate the notion of consistency if the forecasts are optimized under another strategy such as minimizing the mean absolute error (MAE). This paper overcomes such difficulty by proposing a so-called \"potential RMSE skill score,\" which depends only on: (1) the crosscorrelation between forecasts and observations, and (2) the autocorrelation of observations. While greatly simplifying the calculation, the new skill score does not discriminate inconsistent forecasts as much, e.g., even MAE-optimized forecasts can attain a high RMSE skill score."}, "https://arxiv.org/abs/2409.14397": {"title": "High-Dimensional Tensor Classification with CP Low-Rank Discriminant Structure", "link": "https://arxiv.org/abs/2409.14397", "description": "arXiv:2409.14397v1 Announce Type: new \nAbstract: Tensor classification has become increasingly crucial in statistics and machine learning, with applications spanning neuroimaging, computer vision, and recommendation systems. However, the high dimensionality of tensors presents significant challenges in both theory and practice. To address these challenges, we introduce a novel data-driven classification framework based on linear discriminant analysis (LDA) that exploits the CP low-rank structure in the discriminant tensor. Our approach includes an advanced iterative projection algorithm for tensor LDA and incorporates a novel initialization scheme called Randomized Composite PCA (\\textsc{rc-PCA}). \\textsc{rc-PCA}, potentially of independent interest beyond tensor classification, relaxes the incoherence and eigen-ratio assumptions of existing algorithms and provides a warm start close to the global optimum. We establish global convergence guarantees for the tensor estimation algorithm using \\textsc{rc-PCA} and develop new perturbation analyses for noise with cross-correlation, extending beyond the traditional i.i.d. assumption. This theoretical advancement has potential applications across various fields dealing with correlated data and allows us to derive statistical upper bounds on tensor estimation errors. Additionally, we confirm the rate-optimality of our classifier by establishing minimax optimal misclassification rates across a wide class of parameter spaces. Extensive simulations and real-world applications validate our method's superior performance.\n Keywords: Tensor classification; Linear discriminant analysis; Tensor iterative projection; CP low-rank; High-dimensional data; Minimax optimality."}, "https://arxiv.org/abs/2409.14646": {"title": "Scalable Expectation Propagation for Mixed-Effects Regression", "link": "https://arxiv.org/abs/2409.14646", "description": "arXiv:2409.14646v1 Announce Type: new \nAbstract: Mixed-effects regression models represent a useful subclass of regression models for grouped data; the introduction of random effects allows for the correlation between observations within each group to be conveniently captured when inferring the fixed effects. At a time where such regression models are being fit to increasingly large datasets with many groups, it is ideal if (a) the time it takes to make the inferences scales linearly with the number of groups and (b) the inference workload can be distributed across multiple computational nodes in a numerically stable way, if the dataset cannot be stored in one location. Current Bayesian inference approaches for mixed-effects regression models do not seem to account for both challenges simultaneously. To address this, we develop an expectation propagation (EP) framework in this setting that is both scalable and numerically stable when distributed for the case where there is only one grouping factor. The main technical innovations lie in the sparse reparameterisation of the EP algorithm, and a moment propagation (MP) based refinement for multivariate random effect factor approximations. Experiments are conducted to show that this EP framework achieves linear scaling, while having comparable accuracy to other scalable approximate Bayesian inference (ABI) approaches."}, "https://arxiv.org/abs/2409.14684": {"title": "Consistent Order Determination of Markov Decision Process", "link": "https://arxiv.org/abs/2409.14684", "description": "arXiv:2409.14684v1 Announce Type: new \nAbstract: The Markov assumption in Markov Decision Processes (MDPs) is fundamental in reinforcement learning, influencing both theoretical research and practical applications. Existing methods that rely on the Bellman equation benefit tremendously from this assumption for policy evaluation and inference. Testing the Markov assumption or selecting the appropriate order is important for further analysis. Existing tests primarily utilize sequential hypothesis testing methodology, increasing the tested order if the previously-tested one is rejected. However, This methodology cumulates type-I and type-II errors in sequential testing procedures that cause inconsistent order estimation, even with large sample sizes. To tackle this challenge, we develop a procedure that consistently distinguishes the true order from others. We first propose a novel estimator that equivalently represents any order Markov assumption. Based on this estimator, we thus construct a signal function and an associated signal statistic to achieve estimation consistency. Additionally, the curve pattern of the signal statistic facilitates easy visualization, assisting the order determination process in practice. Numerical studies validate the efficacy of our approach."}, "https://arxiv.org/abs/2409.14706": {"title": "Analysis of Stepped-Wedge Cluster Randomized Trials when treatment effect varies by exposure time or calendar time", "link": "https://arxiv.org/abs/2409.14706", "description": "arXiv:2409.14706v1 Announce Type: new \nAbstract: Stepped-wedge cluster randomized trials (SW-CRTs) are traditionally analyzed with models that assume an immediate and sustained treatment effect. Previous work has shown that making such an assumption in the analysis of SW-CRTs when the true underlying treatment effect varies by exposure time can produce severely misleading estimates. Alternatively, the true underlying treatment effect might vary by calendar time. Comparatively less work has examined treatment effect structure misspecification in this setting. Here, we evaluate the behavior of the mixed effects model-based immediate treatment effect, exposure time-averaged treatment effect, and calendar time-averaged treatment effect estimators in different scenarios where they are misspecified for the true underlying treatment effect structure. We prove that the immediate treatment effect estimator can be relatively robust to bias when estimating a true underlying calendar time-averaged treatment effect estimand. However, when there is a true underlying calendar (exposure) time-varying treatment effect, misspecifying an analysis with an exposure (calendar) time-averaged treatment effect estimator can yield severely misleading estimates and even converge to a value of the opposite sign of the true calendar (exposure) time-averaged treatment effect estimand. Researchers should carefully consider how the treatment effect may vary as a function of exposure time and/or calendar time in the analysis of SW-CRTs."}, "https://arxiv.org/abs/2409.14770": {"title": "Clinical research and methodology What usage and what hierarchical order for secondary endpoints?", "link": "https://arxiv.org/abs/2409.14770", "description": "arXiv:2409.14770v1 Announce Type: new \nAbstract: In a randomised clinical trial, when the result of the primary endpoint shows a significant benefit, the secondary endpoints are scrutinised to identify additional effects of the treatment. However, this approach entails a risk of concluding that there is a benefit for one of these endpoints when such benefit does not exist (inflation of type I error risk). There are mainly two methods used to control the risk of drawing erroneous conclusions for secondary endpoints. The first method consists of distributing the risk over several co-primary endpoints, so as to maintain an overall risk of 5%. The second is the hierarchical test procedure, which consists of first establishing a hierarchy of the endpoints, then evaluating each endpoint in succession according to this hierarchy while the endpoints continue to show statistical significance. This simple method makes it possible to show the additional advantages of treatments and to identify the factors that differentiate them."}, "https://arxiv.org/abs/2409.14776": {"title": "Inequality Sensitive Optimal Treatment Assignment", "link": "https://arxiv.org/abs/2409.14776", "description": "arXiv:2409.14776v1 Announce Type: new \nAbstract: The egalitarian equivalent, $ee$, of a societal distribution of outcomes with mean $m$ is the outcome level such that the evaluator is indifferent between the distribution of outcomes and a society in which everyone obtains an outcome of $ee$. For an inequality averse evaluator, $ee < m$. In this paper, I extend the optimal treatment choice framework in Manski (2024) to the case where the welfare evaluation is made using egalitarian equivalent measures, and derive optimal treatment rules for the Bayesian, maximin and minimax regret inequality averse evaluators. I illustrate how the methodology operates in the context of the JobCorps education and training program for disadvantaged youth (Schochet, Burghardt, and McConnell 2008) and in Meager (2022)'s Bayesian meta analysis of the microcredit literature."}, "https://arxiv.org/abs/2409.14806": {"title": "Rescaled Bayes factors: a class of e-variables", "link": "https://arxiv.org/abs/2409.14806", "description": "arXiv:2409.14806v1 Announce Type: new \nAbstract: A class of e-variables is introduced and analyzed. Some examples are presented."}, "https://arxiv.org/abs/2409.14926": {"title": "Early and Late Buzzards: Comparing Different Approaches for Quantile-based Multiple Testing in Heavy-Tailed Wildlife Research Data", "link": "https://arxiv.org/abs/2409.14926", "description": "arXiv:2409.14926v1 Announce Type: new \nAbstract: In medical, ecological and psychological research, there is a need for methods to handle multiple testing, for example to consider group comparisons with more than two groups. Typical approaches that deal with multiple testing are mean or variance based which can be less effective in the context of heavy-tailed and skewed data. Here, the median is the preferred measure of location and the interquartile range (IQR) is an adequate alternative to the variance. Therefore, it may be fruitful to formulate research questions of interest in terms of the median or the IQR. For this reason, we compare different inference approaches for two-sided and non-inferiority hypotheses formulated in terms of medians or IQRs in an extensive simulation study. We consider multiple contrast testing procedures combined with a bootstrap method as well as testing procedures with Bonferroni correction. As an example of a multiple testing problem based on heavy-tailed data we analyse an ecological trait variation in early and late breeding in a medium-sized bird of prey."}, "https://arxiv.org/abs/2409.14937": {"title": "Risk Estimate under a Nonstationary Autoregressive Model for Data-Driven Reproduction Number Estimation", "link": "https://arxiv.org/abs/2409.14937", "description": "arXiv:2409.14937v1 Announce Type: new \nAbstract: COVID-19 pandemic has brought to the fore epidemiological models which, though describing a rich variety of behaviors, have previously received little attention in the signal processing literature. During the pandemic, several works successfully leveraged state-of-the-art signal processing strategies to robustly infer epidemiological indicators despite the low quality of COVID-19 data. In the present work, a novel nonstationary autoregressive model is introduced, encompassing, but not reducing to, one of the most popular models for the propagation of viral epidemics. Using a variational framework, penalized likelihood estimators of the parameters of this new model are designed. In practice, the main bottleneck is that the estimation accuracy strongly depends on hyperparameters tuning. Without available ground truth, hyperparameters are selected by minimizing specifically designed data-driven oracles, used as proxy for the estimation error. Focusing on the nonstationary autoregressive Poisson model, the Stein's Unbiased Risk Estimate formalism is generalized to construct asymptotically unbiased risk estimators based on the derivation of an original autoregressive counterpart of Stein's lemma. The accuracy of these oracles and of the resulting estimates are assessed through intensive Monte Carlo simulations on synthetic data. Then, elaborating on recent epidemiological models, a novel weekly scaled Poisson model is proposed, enabling to better account for intrinsic variability of the contamination while being robust to reporting errors. Finally, the overall data-driven procedure is particularized to the estimation of COVID-19 reproduction number and exemplified on real COVID-19 infection counts in different countries and at different stages of the pandemic, demonstrating its ability to yield consistent estimates."}, "https://arxiv.org/abs/2409.15070": {"title": "Non-linear dependence and Granger causality: A vine copula approach", "link": "https://arxiv.org/abs/2409.15070", "description": "arXiv:2409.15070v1 Announce Type: new \nAbstract: Inspired by Jang et al. (2022), we propose a Granger causality-in-the-mean test for bivariate $k-$Markov stationary processes based on a recently introduced class of non-linear models, i.e., vine copula models. By means of a simulation study, we show that the proposed test improves on the statistical properties of the original test in Jang et al. (2022), constituting an excellent tool for testing Granger causality in the presence of non-linear dependence structures. Finally, we apply our test to study the pairwise relationships between energy consumption, GDP and investment in the U.S. and, notably, we find that Granger-causality runs two ways between GDP and energy consumption."}, "https://arxiv.org/abs/2409.15145": {"title": "Adaptive weight selection for time-to-event data under non-proportional hazards", "link": "https://arxiv.org/abs/2409.15145", "description": "arXiv:2409.15145v1 Announce Type: new \nAbstract: When planning a clinical trial for a time-to-event endpoint, we require an estimated effect size and need to consider the type of effect. Usually, an effect of proportional hazards is assumed with the hazard ratio as the corresponding effect measure. Thus, the standard procedure for survival data is generally based on a single-stage log-rank test. Knowing that the assumption of proportional hazards is often violated and sufficient knowledge to derive reasonable effect sizes is usually unavailable, such an approach is relatively rigid. We introduce a more flexible procedure by combining two methods designed to be more robust in case we have little to no prior knowledge. First, we employ a more flexible adaptive multi-stage design instead of a single-stage design. Second, we apply combination-type tests in the first stage of our suggested procedure to benefit from their robustness under uncertainty about the deviation pattern. We can then use the data collected during this period to choose a more specific single-weighted log-rank test for the subsequent stages. In this step, we employ Royston-Parmar spline models to extrapolate the survival curves to make a reasonable decision. Based on a real-world data example, we show that our approach can save a trial that would otherwise end with an inconclusive result. Additionally, our simulation studies demonstrate a sufficient power performance while maintaining more flexibility."}, "https://arxiv.org/abs/2409.14079": {"title": "Grid Point Approximation for Distributed Nonparametric Smoothing and Prediction", "link": "https://arxiv.org/abs/2409.14079", "description": "arXiv:2409.14079v1 Announce Type: cross \nAbstract: Kernel smoothing is a widely used nonparametric method in modern statistical analysis. The problem of efficiently conducting kernel smoothing for a massive dataset on a distributed system is a problem of great importance. In this work, we find that the popularly used one-shot type estimator is highly inefficient for prediction purposes. To this end, we propose a novel grid point approximation (GPA) method, which has the following advantages. First, the resulting GPA estimator is as statistically efficient as the global estimator under mild conditions. Second, it requires no communication and is extremely efficient in terms of computation for prediction. Third, it is applicable to the case where the data are not randomly distributed across different machines. To select a suitable bandwidth, two novel bandwidth selectors are further developed and theoretically supported. Extensive numerical studies are conducted to corroborate our theoretical findings. Two real data examples are also provided to demonstrate the usefulness of our GPA method."}, "https://arxiv.org/abs/2409.14284": {"title": "Survey Data Integration for Distribution Function Estimation", "link": "https://arxiv.org/abs/2409.14284", "description": "arXiv:2409.14284v1 Announce Type: cross \nAbstract: Integration of probabilistic and non-probabilistic samples for the estimation of finite population totals (or means) has recently received considerable attention in the field of survey sampling; yet, to the best of our knowledge, this framework has not been extended to cumulative distribution function (CDF) estimation. To address this gap, we propose a novel CDF estimator that integrates data from probability samples with data from (potentially large) nonprobability samples. Assuming that a set of shared covariates are observed in both samples, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some regularity conditions, we show that our CDF estimator is both design-consistent for the finite population CDF and asymptotically normally distributed. Additionally, we define and study a quantile estimator based on the proposed CDF estimator. Furthermore, we use both the bootstrap and asymptotic formulae to estimate their respective sampling variances. Our empirical results show that the proposed CDF estimator is robust to model misspecification under ignorability, and robust to ignorability under model misspecification. When both assumptions are violated, our residual-based CDF estimator still outperforms its 'plug-in' mass imputation and naive siblings, albeit with noted decreases in efficiency."}, "https://arxiv.org/abs/2409.14326": {"title": "Optimal sequencing depth for single-cell RNA-sequencing in Wasserstein space", "link": "https://arxiv.org/abs/2409.14326", "description": "arXiv:2409.14326v1 Announce Type: cross \nAbstract: How many samples should one collect for an empirical distribution to be as close as possible to the true population? This question is not trivial in the context of single-cell RNA-sequencing. With limited sequencing depth, profiling more cells comes at the cost of fewer reads per cell. Therefore, one must strike a balance between the number of cells sampled and the accuracy of each measured gene expression profile. In this paper, we analyze an empirical distribution of cells and obtain upper and lower bounds on the Wasserstein distance to the true population. Our analysis holds for general, non-parametric distributions of cells, and is validated by simulation experiments on a real single-cell dataset."}, "https://arxiv.org/abs/2409.14593": {"title": "Testing Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies", "link": "https://arxiv.org/abs/2409.14593", "description": "arXiv:2409.14593v1 Announce Type: cross \nAbstract: Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm."}, "https://arxiv.org/abs/2409.14734": {"title": "The continuous-time limit of quasi score-driven volatility models", "link": "https://arxiv.org/abs/2409.14734", "description": "arXiv:2409.14734v1 Announce Type: cross \nAbstract: This paper explores the continuous-time limit of a class of Quasi Score-Driven (QSD) models that characterize volatility. As the sampling frequency increases and the time interval tends to zero, the model weakly converges to a continuous-time stochastic volatility model where the two Brownian motions are correlated, thereby capturing the leverage effect in the market. Subsequently, we identify that a necessary condition for non-degenerate correlation is that the distribution of driving innovations differs from that of computing score, and at least one being asymmetric. We then illustrate this with two typical examples. As an application, the QSD model is used as an approximation for correlated stochastic volatility diffusions and quasi maximum likelihood estimation is performed. Simulation results confirm the method's effectiveness, particularly in estimating the correlation coefficient."}, "https://arxiv.org/abs/2111.06985": {"title": "Nonparametric Bayesian Knockoff Generators for Feature Selection Under Complex Data Structure", "link": "https://arxiv.org/abs/2111.06985", "description": "arXiv:2111.06985v2 Announce Type: replace \nAbstract: The recent proliferation of high-dimensional data, such as electronic health records and genetics data, offers new opportunities to find novel predictors of outcomes. Presented with a large set of candidate features, interest often lies in selecting the ones most likely to be predictive of an outcome for further study. Controlling the false discovery rate (FDR) at a specified level is often desired in evaluating these variables. Knockoff filtering is an innovative strategy for conducting FDR-controlled feature selection. This paper proposes a nonparametric Bayesian model for generating high-quality knockoff copies that can improve the accuracy of predictive feature identification for variables arising from complex distributions, which can be skewed, highly dispersed and/or a mixture of distributions. This paper provides a detailed description for generating knockoff copies from a GDPM model via MCMC posterior sampling. Additionally, we provide a theoretical guarantee on the robustness of the knockoff procedure. Through simulations, the method is shown to identify important features with accurate FDR control and improved power over the popular second-order Gaussian knockoff generator. Furthermore, the model is compared with finite Gaussian mixture knockoff generator in FDR and power. The proposed technique is applied for detecting genes predictive of survival in ovarian cancer patients using data from The Cancer Genome Atlas (TCGA)."}, "https://arxiv.org/abs/2201.05430": {"title": "Detecting Multiple Structural Breaks in Systems of Linear Regression Equations with Integrated and Stationary Regressors", "link": "https://arxiv.org/abs/2201.05430", "description": "arXiv:2201.05430v4 Announce Type: replace \nAbstract: In this paper, we propose a two-step procedure based on the group LASSO estimator in combination with a backward elimination algorithm to detect multiple structural breaks in linear regressions with multivariate responses. Applying the two-step estimator, we jointly detect the number and location of structural breaks, and provide consistent estimates of the coefficients. Our framework is flexible enough to allow for a mix of integrated and stationary regressors, as well as deterministic terms. Using simulation experiments, we show that the proposed two-step estimator performs competitively against the likelihood-based approach (Qu and Perron, 2007; Li and Perron, 2017; Oka and Perron, 2018) in finite samples. However, the two-step estimator is computationally much more efficient. An economic application to the identification of structural breaks in the term structure of interest rates illustrates this methodology."}, "https://arxiv.org/abs/2203.09330": {"title": "Fighting Noise with Noise: Causal Inference with Many Candidate Instruments", "link": "https://arxiv.org/abs/2203.09330", "description": "arXiv:2203.09330v3 Announce Type: replace \nAbstract: Instrumental variable methods provide useful tools for inferring causal effects in the presence of unmeasured confounding. To apply these methods with large-scale data sets, a major challenge is to find valid instruments from a possibly large candidate set. In practice, most of the candidate instruments are often not relevant for studying a particular exposure of interest. Moreover, not all relevant candidate instruments are valid as they may directly influence the outcome of interest. In this article, we propose a data-driven method for causal inference with many candidate instruments that addresses these two challenges simultaneously. A key component of our proposal involves using pseudo variables, known to be irrelevant, to remove variables from the original set that exhibit spurious correlations with the exposure. Synthetic data analyses show that the proposed method performs favourably compared to existing methods. We apply our method to a Mendelian randomization study estimating the effect of obesity on health-related quality of life."}, "https://arxiv.org/abs/2401.11827": {"title": "Flexible Models for Simple Longitudinal Data", "link": "https://arxiv.org/abs/2401.11827", "description": "arXiv:2401.11827v2 Announce Type: replace \nAbstract: We propose a new method for modelling simple longitudinal data. We aim to do this in a flexible manner (without restrictive assumptions about the shapes of individual trajectories), while exploiting structural similarities between the trajectories. Hierarchical models (such as linear mixed models, generalised additive mixed models and hierarchical generalised additive models) are commonly used to model longitudinal data, but fail to meet one or other of these requirements: either they make restrictive assumptions about the shape of individual trajectories, or fail to exploit structural similarities between trajectories. Functional principal components analysis promises to fulfil both requirements, and methods for functional principal components analysis have been developed for longitudinal data. However, we find that existing methods sometimes give poor-quality estimates of individual trajectories, particularly when the number of observations on each individual is small. We develop a new approach, which we call hierarchical modelling with functional principal components. Inference is conducted based on the full likelihood of all unknown quantities, with a penalty term to control the balance between fit to the data and smoothness of the trajectories. We run simulation studies to demonstrate that the new method substantially improves the quality of inference relative to existing methods across a range of examples, and apply the method to data on changes in body composition in adolescent girls."}, "https://arxiv.org/abs/2409.15530": {"title": "Identifying Elasticities in Autocorrelated Time Series Using Causal Graphs", "link": "https://arxiv.org/abs/2409.15530", "description": "arXiv:2409.15530v1 Announce Type: new \nAbstract: The price elasticity of demand can be estimated from observational data using instrumental variables (IV). However, naive IV estimators may be inconsistent in settings with autocorrelated time series. We argue that causal time graphs can simplify IV identification and help select consistent estimators. To do so, we propose to first model the equilibrium condition by an unobserved confounder, deriving a directed acyclic graph (DAG) while maintaining the assumption of a simultaneous determination of prices and quantities. We then exploit recent advances in graphical inference to derive valid IV estimators, including estimators that achieve consistency by simultaneously estimating nuisance effects. We further argue that observing significant differences between the estimates of presumably valid estimators can help to reject false model assumptions, thereby improving our understanding of underlying economic dynamics. We apply this approach to the German electricity market, estimating the price elasticity of demand on simulated and real-world data. The findings underscore the importance of accounting for structural autocorrelation in IV-based analysis."}, "https://arxiv.org/abs/2409.15597": {"title": "Higher-criticism for sparse multi-sensor change-point detection", "link": "https://arxiv.org/abs/2409.15597", "description": "arXiv:2409.15597v1 Announce Type: new \nAbstract: We present a procedure based on higher criticism (Dohono \\& Jin 2004) to address the sparse multi-sensor quickest change-point detection problem. Namely, we aim to detect a change in the distribution of the multi-sensor that might affect a few sensors out of potentially many, while those affected sensors, if they exist, are unknown to us in advance. Our procedure involves testing for a change point in individual sensors and combining multiple tests using higher criticism. As a by-product, our procedure also indicates a set of sensors suspected to be affected by the change. We demonstrate the effectiveness of our method compared to other procedures using extensive numerical evaluations. We analyze our procedure under a theoretical framework involving normal data sensors that might experience a change in both mean and variance. We consider individual tests based on the likelihood ratio or the generalized likelihood ratio statistics and show that our procedure attains the information-theoretic limits of detection. These limits coincide with existing litereature when the change is only in the mean."}, "https://arxiv.org/abs/2409.15663": {"title": "BARD: A seamless two-stage dose optimization design integrating backfill and adaptive randomization", "link": "https://arxiv.org/abs/2409.15663", "description": "arXiv:2409.15663v1 Announce Type: new \nAbstract: One common approach for dose optimization is a two-stage design, which initially conducts dose escalation to identify the maximum tolerated dose (MTD), followed by a randomization stage where patients are assigned to two or more doses to further assess and compare their risk-benefit profiles to identify the optimal dose. A limitation of this approach is its requirement for a relatively large sample size. To address this challenge, we propose a seamless two-stage design, BARD (Backfill and Adaptive Randomization for Dose Optimization), which incorporates two key features to reduce sample size and shorten trial duration. The first feature is the integration of backfilling into the stage 1 dose escalation, enhancing patient enrollment and data generation without prolonging the trial. The second feature involves seamlessly combining patients treated in stage 1 with those in stage 2, enabled by covariate-adaptive randomization, to inform the optimal dose and thereby reduce the sample size. Our simulation study demonstrates that BARD reduces the sample size, improves the accuracy of identifying the optimal dose, and maintains covariate balance in randomization, allowing for unbiased comparisons between doses. BARD designs offer an efficient solution to meet the dose optimization requirements set by Project Optimus, with software freely available at www.trialdesign.org."}, "https://arxiv.org/abs/2409.15676": {"title": "TUNE: Algorithm-Agnostic Inference after Changepoint Detection", "link": "https://arxiv.org/abs/2409.15676", "description": "arXiv:2409.15676v1 Announce Type: new \nAbstract: In multiple changepoint analysis, assessing the uncertainty of detected changepoints is crucial for enhancing detection reliability -- a topic that has garnered significant attention. Despite advancements through selective p-values, current methodologies often rely on stringent assumptions tied to specific changepoint models and detection algorithms, potentially compromising the accuracy of post-detection statistical inference. We introduce TUNE (Thresholding Universally and Nullifying change Effect), a novel algorithm-agnostic approach that uniformly controls error probabilities across detected changepoints. TUNE sets a universal threshold for multiple test statistics, applicable across a wide range of algorithms, and directly controls the family-wise error rate without the need for selective p-values. Through extensive theoretical and numerical analyses, TUNE demonstrates versatility, robustness, and competitive power, offering a viable and reliable alternative for model-agnostic post-detection inference."}, "https://arxiv.org/abs/2409.15756": {"title": "A penalized online sequential test of heterogeneous treatment effects for generalized linear models", "link": "https://arxiv.org/abs/2409.15756", "description": "arXiv:2409.15756v1 Announce Type: new \nAbstract: Identification of heterogeneous treatment effects (HTEs) has been increasingly popular and critical in various penalized strategy decisions using the A/B testing approach, especially in the scenario of a consecutive online collection of samples. However, in high-dimensional settings, such an identification remains challenging in the sense of lack of detection power of HTEs with insufficient sample instances for each batch sequentially collected online. In this article, a novel high-dimensional test is proposed, named as the penalized online sequential test (POST), to identify HTEs and select useful covariates simultaneously under continuous monitoring in generalized linear models (GLMs), which achieves high detection power and controls the Type I error. A penalized score test statistic is developed along with an extended p-value process for the online collection of samples, and the proposed POST method is further extended to multiple online testing scenarios, where both high true positive rates and under-controlled false discovery rates are achieved simultaneously. Asymptotic results are established and justified to guarantee properties of the POST, and its performance is evaluated through simulations and analysis of real data, compared with the state-of-the-art online test methods. Our findings indicate that the POST method exhibits selection consistency and superb detection power of HTEs as well as excellent control over the Type I error, which endows our method with the capability for timely and efficient inference for online A/B testing in high-dimensional GLMs framework."}, "https://arxiv.org/abs/2409.15995": {"title": "Robust Inference for Non-Linear Regression Models with Applications in Enzyme Kinetics", "link": "https://arxiv.org/abs/2409.15995", "description": "arXiv:2409.15995v1 Announce Type: new \nAbstract: Despite linear regression being the most popular statistical modelling technique, in real-life we often need to deal with situations where the true relationship between the response and the covariates is nonlinear in parameters. In such cases one needs to adopt appropriate non-linear regression (NLR) analysis, having wider applications in biochemical and medical studies among many others. In this paper we propose a new improved robust estimation and testing methodologies for general NLR models based on the minimum density power divergence approach and apply our proposal to analyze the widely popular Michaelis-Menten (MM) model in enzyme kinetics. We establish the asymptotic properties of our proposed estimator and tests, along with their theoretical robustness characteristics through influence function analysis. For the particular MM model, we have further empirically justified the robustness and the efficiency of our proposed estimator and the testing procedure through extensive simulation studies and several interesting real data examples of enzyme-catalyzed (biochemical) reactions."}, "https://arxiv.org/abs/2409.16003": {"title": "Easy Conditioning far Beyond Gaussian", "link": "https://arxiv.org/abs/2409.16003", "description": "arXiv:2409.16003v1 Announce Type: new \nAbstract: Estimating and sampling from conditional densities plays a critical role in statistics and data science, with a plethora of applications. Numerous methods are available ranging from simple fitting approaches to sophisticated machine learning algorithms. However, selecting from among these often involves a trade-off between conflicting objectives of efficiency, flexibility and interpretability. Starting from well known easy conditioning results in the Gaussian case, we show, thanks to results pertaining to stability by mixing and marginal transformations, that the latter carry over far beyond the Gaussian case. This enables us to flexibly model multivariate data by accommodating broad classes of multi-modal dependence structures and marginal distributions, while enjoying fast conditioning of fitted joint distributions. In applications, we primarily focus on conditioning via Gaussian versus Gaussian mixture copula models, comparing different fitting implementations for the latter. Numerical experiments with simulated and real data demonstrate the relevance of the approach for conditional sampling, evaluated using multivariate scoring rules."}, "https://arxiv.org/abs/2409.16132": {"title": "Large Bayesian Tensor VARs with Stochastic Volatility", "link": "https://arxiv.org/abs/2409.16132", "description": "arXiv:2409.16132v1 Announce Type: new \nAbstract: We consider Bayesian tensor vector autoregressions (TVARs) in which the VAR coefficients are arranged as a three-dimensional array or tensor, and this coefficient tensor is parameterized using a low-rank CP decomposition. We develop a family of TVARs using a general stochastic volatility specification, which includes a wide variety of commonly-used multivariate stochastic volatility and COVID-19 outlier-augmented models. In a forecasting exercise involving 40 US quarterly variables, we show that these TVARs outperform the standard Bayesian VAR with the Minnesota prior. The results also suggest that the parsimonious common stochastic volatility model tends to forecast better than the more flexible Cholesky stochastic volatility model."}, "https://arxiv.org/abs/2409.16276": {"title": "Bayesian Variable Selection and Sparse Estimation for High-Dimensional Graphical Models", "link": "https://arxiv.org/abs/2409.16276", "description": "arXiv:2409.16276v1 Announce Type: new \nAbstract: We introduce a novel Bayesian approach for both covariate selection and sparse precision matrix estimation in the context of high-dimensional Gaussian graphical models involving multiple responses. Our approach provides a sparse estimation of the three distinct sparsity structures: the regression coefficient matrix, the conditional dependency structure among responses, and between responses and covariates. This contrasts with existing methods, which typically focus on any two of these structures but seldom achieve simultaneous sparse estimation for all three. A key aspect of our method is that it leverages the structural sparsity information gained from the presence of irrelevant covariates in the dataset to introduce covariate-level sparsity in the precision and regression coefficient matrices. This is achieved through a Bayesian conditional random field model using a hierarchical spike and slab prior setup. Despite the non-convex nature of the problem, we establish statistical accuracy for points in the high posterior density region, including the maximum-a-posteriori (MAP) estimator. We also present an efficient Expectation-Maximization (EM) algorithm for computing the estimators. Through simulation experiments, we demonstrate the competitive performance of our method, particularly in scenarios with weak signal strength in the precision matrices. Finally, we apply our method to a bike-share dataset, showcasing its predictive performance."}, "https://arxiv.org/abs/2409.15532": {"title": "A theory of generalised coordinates for stochastic differential equations", "link": "https://arxiv.org/abs/2409.15532", "description": "arXiv:2409.15532v1 Announce Type: cross \nAbstract: Stochastic differential equations are ubiquitous modelling tools in physics and the sciences. In most modelling scenarios, random fluctuations driving dynamics or motion have some non-trivial temporal correlation structure, which renders the SDE non-Markovian; a phenomenon commonly known as ``colored'' noise. Thus, an important objective is to develop effective tools for mathematically and numerically studying (possibly non-Markovian) SDEs. In this report, we formalise a mathematical theory for analysing and numerically studying SDEs based on so-called `generalised coordinates of motion'. Like the theory of rough paths, we analyse SDEs pathwise for any given realisation of the noise, not solely probabilistically. Like the established theory of Markovian realisation, we realise non-Markovian SDEs as a Markov process in an extended space. Unlike the established theory of Markovian realisation however, the Markovian realisations here are accurate on short timescales and may be exact globally in time, when flows and fluctuations are analytic. This theory is exact for SDEs with analytic flows and fluctuations, and is approximate when flows and fluctuations are differentiable. It provides useful analysis tools, which we employ to solve linear SDEs with analytic fluctuations. It may also be useful for studying rougher SDEs, as these may be identified as the limit of smoother ones. This theory supplies effective, computationally straightforward methods for simulation, filtering and control of SDEs; amongst others, we re-derive generalised Bayesian filtering, a state-of-the-art method for time-series analysis. Looking forward, this report suggests that generalised coordinates have far-reaching applications throughout stochastic differential equations."}, "https://arxiv.org/abs/2409.15677": {"title": "Smoothing the Conditional Value-at-Risk based Pickands Estimators", "link": "https://arxiv.org/abs/2409.15677", "description": "arXiv:2409.15677v1 Announce Type: cross \nAbstract: We incorporate the conditional value-at-risk (CVaR) quantity into a generalized class of Pickands estimators. By introducing CVaR, the newly developed estimators not only retain the desirable properties of consistency, location, and scale invariance inherent to Pickands estimators, but also achieve a reduction in mean squared error (MSE). To address the issue of sensitivity to the choice of the number of top order statistics used for the estimation, and ensure robust estimation, which are crucial in practice, we first propose a beta measure, which is a modified beta density function, to smooth the estimator. Then, we develop an algorithm to approximate the asymptotic mean squared error (AMSE) and determine the optimal beta measure that minimizes AMSE. A simulation study involving a wide range of distributions shows that our estimators have good and highly stable finite-sample performance and compare favorably with the other estimators."}, "https://arxiv.org/abs/2409.15682": {"title": "Linear Contextual Bandits with Interference", "link": "https://arxiv.org/abs/2409.15682", "description": "arXiv:2409.15682v1 Announce Type: cross \nAbstract: Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit's actions on the rewards of others. In contextual bandit (CB) settings, where multiple units are present in the same round, potential interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decision-making process. Although some prior work has explored multi-agent and adversarial bandits in interference-aware settings, the effect of interference in CB, as well as the underlying theory, remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (LinCB), bridging the gap between causal inference and online decision-making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and a synthetic data generated based on MovieLens data."}, "https://arxiv.org/abs/2409.15844": {"title": "Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection", "link": "https://arxiv.org/abs/2409.15844", "description": "arXiv:2409.15844v1 Announce Type: cross \nAbstract: We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and hyperparameter tuning for engineering systems, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds."}, "https://arxiv.org/abs/2409.16044": {"title": "Stable Survival Extrapolation via Transfer Learning", "link": "https://arxiv.org/abs/2409.16044", "description": "arXiv:2409.16044v1 Announce Type: cross \nAbstract: The mean survival is the key ingredient of the decision process in several applications, including health economic evaluations. It is defined as the area under the complete survival curve thus necessitating extrapolation of the observed data. In this article we employ a Bayesian mortality model and transfer its projections in order to construct the baseline population that acts as an anchor of the survival model. This can be seen as an implicit bias-variance trade-off in unseen data. We then propose extrapolation methods based on flexible parametric polyhazard models which can naturally accommodate diverse shapes, including non-proportional hazards and crossing survival curves while typically maintaining a natural interpretation. We estimate the mean survival and related estimands in three cases, namely breast cancer, cardiac arrhythmia and advanced melanoma. Specifically, we evaluate the survival disadvantage of triple negative breast cancer cases, the efficacy of combining immunotherapy with mRNA cancer therapeutic for advanced melanoma treatment and the suitability of implantable cardioverter defibrilators for cardiac arrhythmia. The latter is conducted in a competing risks context illustrating how working on the cause-specific hazard alone minimizes potential instability. The results suggest that the proposed approach offers a flexible, interpretable and robust approach when survival extrapolation is required."}, "https://arxiv.org/abs/2210.09339": {"title": "Probability Weighted Clustered Coefficients Regression Models in Complex Survey Sampling", "link": "https://arxiv.org/abs/2210.09339", "description": "arXiv:2210.09339v3 Announce Type: replace \nAbstract: Regression analysis is commonly conducted in survey sampling. However, existing methods fail when the relationships vary across different areas or domains. In this paper, we propose a unified framework to study the group-wise covariate effect under complex survey sampling based on pairwise penalties, and the associated objective function is solved by the alternating direction method of multipliers. Theoretical properties of the proposed method are investigated under some generality conditions. Numerical experiments demonstrate the superiority of the proposed method in terms of identifying groups and estimation efficiency for both linear regression models and logistic regression models."}, "https://arxiv.org/abs/2305.08834": {"title": "Elastic Bayesian Model Calibration", "link": "https://arxiv.org/abs/2305.08834", "description": "arXiv:2305.08834v2 Announce Type: replace \nAbstract: Functional data are ubiquitous in scientific modeling. For instance, quantities of interest are modeled as functions of time, space, energy, density, etc. Uncertainty quantification methods for computer models with functional response have resulted in tools for emulation, sensitivity analysis, and calibration that are widely used. However, many of these tools do not perform well when the computer model's parameters control both the amplitude variation of the functional output and its alignment (or phase variation). This paper introduces a framework for Bayesian model calibration when the model responses are misaligned functional data. The approach generates two types of data out of the misaligned functional responses: (1) aligned functions so that the amplitude variation is isolated and (2) warping functions that isolate the phase variation. These two types of data are created for the computer simulation data (both of which may be emulated) and the experimental data. The calibration approach uses both types so that it seeks to match both the amplitude and phase of the experimental data. The framework is careful to respect constraints that arise especially when modeling phase variation, and is framed in a way that it can be done with readily available calibration software. We demonstrate the techniques on two simulated data examples and on two dynamic material science problems: a strength model calibration using flyer plate experiments and an equation of state model calibration using experiments performed on the Sandia National Laboratories' Z-machine."}, "https://arxiv.org/abs/2109.13819": {"title": "Perturbation theory for killed Markov processes and quasi-stationary distributions", "link": "https://arxiv.org/abs/2109.13819", "description": "arXiv:2109.13819v2 Announce Type: replace-cross \nAbstract: Motivated by recent developments of quasi-stationary Monte Carlo methods, we investigate the stability of quasi-stationary distributions of killed Markov processes under perturbations of the generator. We first consider a general bounded self-adjoint perturbation operator, and after that, study a particular unbounded perturbation corresponding to truncation of the killing rate. In both scenarios, we quantify the difference between eigenfunctions of the smallest eigenvalue of the perturbed and unperturbed generators in a Hilbert space norm. As a consequence, L1 norm estimates of the difference of the resulting quasi-stationary distributions in terms of the perturbation are provided."}, "https://arxiv.org/abs/2110.03200": {"title": "High Dimensional Logistic Regression Under Network Dependence", "link": "https://arxiv.org/abs/2110.03200", "description": "arXiv:2110.03200v3 Announce Type: replace-cross \nAbstract: Logistic regression is key method for modeling the probability of a binary outcome based on a collection of covariates. However, the classical formulation of logistic regression relies on the independent sampling assumption, which is often violated when the outcomes interact through an underlying network structure, such as over a temporal/spatial domain or on a social network. This necessitates the development of models that can simultaneously handle both the network `peer-effect' and the effect of high-dimensional covariates. In this paper, we develop a framework for incorporating such dependencies in a high-dimensional logistic regression model by introducing a quadratic interaction term, as in the Ising model, designed to capture the pairwise interactions from the underlying network. The resulting model can also be viewed as an Ising model, where the node-dependent external fields linearly encode the high-dimensional covariates. We propose a penalized maximum pseudo-likelihood method for estimating the network peer-effect and the effect of the covariates (the regression coefficients), which, in addition to handling the high-dimensionality of the parameters, conveniently avoids the computational intractability of the maximum likelihood approach. Under various standard regularity conditions, we show that the corresponding estimate attains the classical high-dimensional rate of consistency. Our results imply that even under network dependence it is possible to consistently estimate the model parameters at the same rate as in classical (independent) logistic regression, when the true parameter is sparse and the underlying network is not too dense. We also develop an efficient algorithm for computing the estimates and validate our theoretical results in numerical experiments. An application to selecting genes in clustering spatial transcriptomics data is also discussed."}, "https://arxiv.org/abs/2302.09034": {"title": "Bayesian Mixtures Models with Repulsive and Attractive Atoms", "link": "https://arxiv.org/abs/2302.09034", "description": "arXiv:2302.09034v3 Announce Type: replace-cross \nAbstract: The study of almost surely discrete random probability measures is an active line of research in Bayesian nonparametrics. The idea of assuming interaction across the atoms of the random probability measure has recently spurred significant interest in the context of Bayesian mixture models. This allows the definition of priors that encourage well-separated and interpretable clusters. In this work, we provide a unified framework for the construction and the Bayesian analysis of random probability measures with interacting atoms, encompassing both repulsive and attractive behaviours. Specifically, we derive closed-form expressions for the posterior distribution, the marginal and predictive distributions, which were not previously available except for the case of measures with i.i.d. atoms. We show how these quantities are fundamental both for prior elicitation and to develop new posterior simulation algorithms for hierarchical mixture models. Our results are obtained without any assumption on the finite point process that governs the atoms of the random measure. Their proofs rely on analytical tools borrowed from the Palm calculus theory, which might be of independent interest. We specialise our treatment to the classes of Poisson, Gibbs, and determinantal point processes, as well as in the case of shot-noise Cox processes. Finally, we illustrate the performance of different modelling strategies on simulated and real datasets."}, "https://arxiv.org/abs/2409.16409": {"title": "Robust Mean Squared Prediction Error Estimators of EBLUP of a Small Area Mean Under the Fay-Herriot Model", "link": "https://arxiv.org/abs/2409.16409", "description": "arXiv:2409.16409v1 Announce Type: new \nAbstract: In this paper we derive a second-order unbiased (or nearly unbiased) mean squared prediction error (MSPE) estimator of empirical best linear unbiased predictor (EBLUP) of a small area mean for a non-normal extension to the well-known Fay-Herriot model. Specifically, we derive our MSPE estimator essentially assuming certain moment conditions on both the sampling and random effects distributions. The normality-based Prasad-Rao MSPE estimator has a surprising robustness property in that it remains second-order unbiased under the non-normality of random effects when a simple method-of-moments estimator is used for the variance component and the sampling error distribution is normal. We show that the normality-based MSPE estimator is no longer second-order unbiased when the sampling error distribution is non-normal or when the Fay-Herriot moment method is used to estimate the variance component, even when the sampling error distribution is normal. It is interesting to note that when the simple method-of moments estimator is used for the variance component, our proposed MSPE estimator does not require the estimation of kurtosis of the random effects. Results of a simulation study on the accuracy of the proposed MSPE estimator, under non-normality of both sampling and random effects distributions, are also presented."}, "https://arxiv.org/abs/2409.16463": {"title": "Double-Estimation-Friendly Inference for High Dimensional Misspecified Measurement Error Models", "link": "https://arxiv.org/abs/2409.16463", "description": "arXiv:2409.16463v1 Announce Type: new \nAbstract: In this paper, we introduce an innovative testing procedure for assessing individual hypotheses in high-dimensional linear regression models with measurement errors. This method remains robust even when either the X-model or Y-model is misspecified. We develop a double robust score function that maintains a zero expectation if one of the models is incorrect, and we construct a corresponding score test. We first show the asymptotic normality of our approach in a low-dimensional setting, and then extend it to the high-dimensional models. Our analysis of high-dimensional settings explores scenarios both with and without the sparsity condition, establishing asymptotic normality and non-trivial power performance under local alternatives. Simulation studies and real data analysis demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2409.16683": {"title": "Robust Max Statistics for High-Dimensional Inference", "link": "https://arxiv.org/abs/2409.16683", "description": "arXiv:2409.16683v1 Announce Type: new \nAbstract: Although much progress has been made in the theory and application of bootstrap approximations for max statistics in high dimensions, the literature has largely been restricted to cases involving light-tailed data. To address this issue, we propose an approach to inference based on robust max statistics, and we show that their distributions can be accurately approximated via bootstrapping when the data are both high-dimensional and heavy-tailed. In particular, the data are assumed to satisfy an extended version of the well-established $L^{4}$-$L^2$ moment equivalence condition, as well as a weak variance decay condition. In this setting, we show that near-parametric rates of bootstrap approximation can be achieved in the Kolmogorov metric, independently of the data dimension. Moreover, this theoretical result is complemented by favorable empirical results involving both synthetic data and an application to financial data."}, "https://arxiv.org/abs/2409.16829": {"title": "Conditional Testing based on Localized Conformal p-values", "link": "https://arxiv.org/abs/2409.16829", "description": "arXiv:2409.16829v1 Announce Type: new \nAbstract: In this paper, we address conditional testing problems through the conformal inference framework. We define the localized conformal p-values by inverting prediction intervals and prove their theoretical properties. These defined p-values are then applied to several conditional testing problems to illustrate their practicality. Firstly, we propose a conditional outlier detection procedure to test for outliers in the conditional distribution with finite-sample false discovery rate (FDR) control. We also introduce a novel conditional label screening problem with the goal of screening multivariate response variables and propose a screening procedure to control the family-wise error rate (FWER). Finally, we consider the two-sample conditional distribution test and define a weighted U-statistic through the aggregation of localized p-values. Numerical simulations and real-data examples validate the superior performance of our proposed strategies."}, "https://arxiv.org/abs/2409.17039": {"title": "A flexiable approach: variable selection procedures with multilayer FDR control via e-values", "link": "https://arxiv.org/abs/2409.17039", "description": "arXiv:2409.17039v1 Announce Type: new \nAbstract: Consider a scenario where a large number of explanatory features targeting a response variable are analyzed, such that these features are partitioned into different groups according to their domain-specific structures. Furthermore, there may be several such partitions. Such multiple partitions may exist in many real-life scenarios. One such example is spatial genome-wide association studies. Researchers may not only be interested in identifying the features relevant to the response but also aim to determine the relevant groups within each partition. A group is considered relevant if it contains at least one relevant feature. To ensure the replicability of the findings at various resolutions, it is essential to provide false discovery rate (FDR) control for findings at multiple layers simultaneously. This paper presents a general approach that leverages various existing controlled selection procedures to generate more stable results using multilayer FDR control. The key contributions of our proposal are the development of a generalized e-filter that provides multilayer FDR control and the construction of a specific type of generalized e-values to evaluate feature importance. A primary application of our method is an improved version of Data Splitting (DS), called the eDS-filter. Furthermore, we combine the eDS-filter with the version of the group knockoff filter (gKF), resulting in a more flexible approach called the eDS+gKF filter. Simulation studies demonstrate that the proposed methods effectively control the FDR at multiple levels while maintaining or even improving power compared to other approaches. Finally, we apply the proposed method to analyze HIV mutation data."}, "https://arxiv.org/abs/2409.17129": {"title": "Bayesian Bivariate Conway-Maxwell-Poisson Regression Model for Correlated Count Data in Sports", "link": "https://arxiv.org/abs/2409.17129", "description": "arXiv:2409.17129v1 Announce Type: new \nAbstract: Count data play a crucial role in sports analytics, providing valuable insights into various aspects of the game. Models that accurately capture the characteristics of count data are essential for making reliable inferences. In this paper, we propose the use of the Conway-Maxwell-Poisson (CMP) model for analyzing count data in sports. The CMP model offers flexibility in modeling data with different levels of dispersion. Here we consider a bivariate CMP model that models the potential correlation between home and away scores by incorporating a random effect specification. We illustrate the advantages of the CMP model through simulations. We then analyze data from baseball and soccer games before, during, and after the COVID-19 pandemic. The performance of our proposed CMP model matches or outperforms standard Poisson and Negative Binomial models, providing a good fit and an accurate estimation of the observed effects in count data with any level of dispersion. The results highlight the robustness and flexibility of the CMP model in analyzing count data in sports, making it a suitable default choice for modeling a diverse range of count data types in sports, where the data dispersion may vary."}, "https://arxiv.org/abs/2409.16407": {"title": "Towards Representation Learning for Weighting Problems in Design-Based Causal Inference", "link": "https://arxiv.org/abs/2409.16407", "description": "arXiv:2409.16407v1 Announce Type: cross \nAbstract: Reweighting a distribution to minimize a distance to a target distribution is a powerful and flexible strategy for estimating a wide range of causal effects, but can be challenging in practice because optimal weights typically depend on knowledge of the underlying data generating process. In this paper, we focus on design-based weights, which do not incorporate outcome information; prominent examples include prospective cohort studies, survey weighting, and the weighting portion of augmented weighting estimators. In such applications, we explore the central role of representation learning in finding desirable weights in practice. Unlike the common approach of assuming a well-specified representation, we highlight the error due to the choice of a representation and outline a general framework for finding suitable representations that minimize this error. Building on recent work that combines balancing weights and neural networks, we propose an end-to-end estimation procedure that learns a flexible representation, while retaining promising theoretical properties. We show that this approach is competitive in a range of common causal inference tasks."}, "https://arxiv.org/abs/1507.08689": {"title": "Multiple Outlier Detection in Samples with Exponential & Pareto Tails: Redeeming the Inward Approach & Detecting Dragon Kings", "link": "https://arxiv.org/abs/1507.08689", "description": "arXiv:1507.08689v2 Announce Type: replace \nAbstract: We introduce two ratio-based robust test statistics, max-robust-sum (MRS) and sum-robust-sum (SRS), designed to enhance the robustness of outlier detection in samples with exponential or Pareto tails. We also reintroduce the inward sequential testing method -- formerly relegated since the introduction of outward testing -- and show that MRS and SRS tests reduce susceptibility of the inward approach to masking, making the inward test as powerful as, and potentially less error-prone than, outward tests. Moreover, inward testing does not require the complicated type I error control of outward tests. A comprehensive comparison of the test statistics is done, considering performance of the proposed tests in both block and sequential tests, and contrasting their performance with classical test statistics across various data scenarios. In five case studies -- financial crashes, nuclear power generation accidents, stock market returns, epidemic fatalities, and city sizes -- significant outliers are detected and related to the concept of `Dragon King' events, defined as meaningful outliers that arise from a unique generating mechanism."}, "https://arxiv.org/abs/2108.02196": {"title": "Synthetic Controls for Experimental Design", "link": "https://arxiv.org/abs/2108.02196", "description": "arXiv:2108.02196v4 Announce Type: replace \nAbstract: This article studies experimental design in settings where the experimental units are large aggregate entities (e.g., markets), and only one or a small number of units can be exposed to the treatment. In such settings, randomization of the treatment may result in treated and control groups with very different characteristics at baseline, inducing biases. We propose a variety of experimental non-randomized synthetic control designs (Abadie, Diamond and Hainmueller, 2010, Abadie and Gardeazabal, 2003) that select the units to be treated, as well as the untreated units to be used as a control group. Average potential outcomes are estimated as weighted averages of the outcomes of treated units for potential outcomes with treatment, and weighted averages the outcomes of control units for potential outcomes without treatment. We analyze the properties of estimators based on synthetic control designs and propose new inferential techniques. We show that in experimental settings with aggregate units, synthetic control designs can substantially reduce estimation biases in comparison to randomization of the treatment."}, "https://arxiv.org/abs/2209.06294": {"title": "Graph-constrained Analysis for Multivariate Functional Data", "link": "https://arxiv.org/abs/2209.06294", "description": "arXiv:2209.06294v3 Announce Type: replace \nAbstract: Functional Gaussian graphical models (GGM) used for analyzing multivariate functional data customarily estimate an unknown graphical model representing the conditional relationships between the functional variables. However, in many applications of multivariate functional data, the graph is known and existing functional GGM methods cannot preserve a given graphical constraint. In this manuscript, we demonstrate how to conduct multivariate functional analysis that exactly conforms to a given inter-variable graph. We first show the equivalence between partially separable functional GGM and graphical Gaussian processes (GP), proposed originally for constructing optimal covariance functions for multivariate spatial data that retain the conditional independence relations in a given graphical model. The theoretical connection help design a new algorithm that leverages Dempster's covariance selection to calculate the maximum likelihood estimate of the covariance function for multivariate functional data under graphical constraints. We also show that the finite term truncation of functional GGM basis expansion used in practice is equivalent to a low-rank graphical GP, which is known to oversmooth marginal distributions. To remedy this, we extend our algorithm to better preserve marginal distributions while still respecting the graph and retaining computational scalability. The insights obtained from the new results presented in this manuscript will help practitioners better understand the relationship between these graphical models and in deciding on the appropriate method for their specific multivariate data analysis task. The benefits of the proposed algorithms are illustrated using empirical experiments and an application to functional modeling of neuroimaging data using the connectivity graph among regions of the brain."}, "https://arxiv.org/abs/2305.12336": {"title": "Estimation of finite population proportions for small areas -- a statistical data integration approach", "link": "https://arxiv.org/abs/2305.12336", "description": "arXiv:2305.12336v2 Announce Type: replace \nAbstract: Empirical best prediction (EBP) is a well-known method for producing reliable proportion estimates when the primary data source provides only small or no sample from finite populations. There are potential challenges in implementing existing EBP methodology such as limited auxiliary variables in the frame (not adequate for building a reasonable working predictive model) or unable to accurately link the sample to the finite population frame due to absence of identifiers. In this paper, we propose a new data linkage approach where the finite population frame is replaced by a big probability sample, having a large set of auxiliary variables but not the outcome binary variable of interest. We fit an assumed model on the small probability sample and then impute the outcome variable for all units of the big sample to obtain standard weighted proportions. We develop a new adjusted maximum likelihood (ML) method so that the estimate of model variance doesn't fall on the boundary, which is otherwise encountered in commonly used ML method. We also propose an estimator of the mean squared prediction error using a parametric bootstrap method and address computational issues by developing an efficient Expectation Maximization algorithm. The proposed methodology is illustrated in the context of election projection for small areas."}, "https://arxiv.org/abs/2306.03663": {"title": "Bayesian inference for group-level cortical surface image-on-scalar-regression with Gaussian process priors", "link": "https://arxiv.org/abs/2306.03663", "description": "arXiv:2306.03663v2 Announce Type: replace \nAbstract: In regression-based analyses of group-level neuroimage data researchers typically fit a series of marginal general linear models to image outcomes at each spatially-referenced pixel. Spatial regularization of effects of interest is usually induced indirectly by applying spatial smoothing to the data during preprocessing. While this procedure often works well, resulting inference can be poorly calibrated. Spatial modeling of effects of interest leads to more powerful analyses, however the number of locations in a typical neuroimage can preclude standard computation with explicitly spatial models. Here we contribute a Bayesian spatial regression model for group-level neuroimaging analyses. We induce regularization of spatially varying regression coefficient functions through Gaussian process priors. When combined with a simple nonstationary model for the error process, our prior hierarchy can lead to more data-adaptive smoothing than standard methods. We achieve computational tractability through Vecchia approximation of our prior which, critically, can be constructed for a wide class of spatial correlation functions and results in prior models that retain full spatial rank. We outline several ways to work with our model in practice and compare performance against standard vertex-wise analyses. Finally we illustrate our method in an analysis of cortical surface fMRI task contrast data from a large cohort of children enrolled in the Adolescent Brain Cognitive Development study."}, "https://arxiv.org/abs/1910.03821": {"title": "Quasi Maximum Likelihood Estimation and Inference of Large Approximate Dynamic Factor Models via the EM algorithm", "link": "https://arxiv.org/abs/1910.03821", "description": "arXiv:1910.03821v5 Announce Type: replace-cross \nAbstract: We study estimation of large Dynamic Factor models implemented through the Expectation Maximization (EM) algorithm, jointly with the Kalman smoother. We prove that as both the cross-sectional dimension, $n$, and the sample size, $T$, diverge to infinity: (i) the estimated loadings are $\\sqrt T$-consistent, asymptotically normal and equivalent to their Quasi Maximum Likelihood estimates; (ii) the estimated factors are $\\sqrt n$-consistent, asymptotically normal and equivalent to their Weighted Least Squares estimates. Moreover, the estimated loadings are asymptotically as efficient as those obtained by Principal Components analysis, while the estimated factors are more efficient if the idiosyncratic covariance is sparse enough."}, "https://arxiv.org/abs/2401.11948": {"title": "The Ensemble Kalman Filter for Dynamic Inverse Problems", "link": "https://arxiv.org/abs/2401.11948", "description": "arXiv:2401.11948v2 Announce Type: replace-cross \nAbstract: In inverse problems, the goal is to estimate unknown model parameters from noisy observational data. Traditionally, inverse problems are solved under the assumption of a fixed forward operator describing the observation model. In this article, we consider the extension of this approach to situations where we have a dynamic forward model, motivated by applications in scientific computation and engineering. We specifically consider this extension for a derivative-free optimizer, the ensemble Kalman inversion (EKI). We introduce and justify a new methodology called dynamic-EKI, which is a particle-based method with a changing forward operator. We analyze our new method, presenting results related to the control of our particle system through its covariance structure. This analysis includes moment bounds and an ensemble collapse, which are essential for demonstrating a convergence result. We establish convergence in expectation and validate our theoretical findings through experiments with dynamic-EKI applied to a 2D Darcy flow partial differential equation."}, "https://arxiv.org/abs/2409.17195": {"title": "When Sensitivity Bias Varies Across Subgroups: The Impact of Non-uniform Polarity in List Experiments", "link": "https://arxiv.org/abs/2409.17195", "description": "arXiv:2409.17195v1 Announce Type: new \nAbstract: Survey researchers face the problem of sensitivity bias: since people are reluctant to reveal socially undesirable or otherwise risky traits, aggregate estimates of these traits will be biased. List experiments offer a solution by conferring respondents greater privacy. However, little is know about how list experiments fare when sensitivity bias varies across respondent subgroups. For example, a trait that is socially undesirable to one group may socially desirable in a second group, leading sensitivity bias to be negative in the first group, while it is positive in the second. Or a trait may be not sensitive in one group, leading sensitivity bias to be zero in one group and non-zero in another. We use Monte Carlo simulations to explore what happens when the polarity (sign) of sensitivity bias is non-uniform. We find that a general diagnostic test yields false positives and that commonly used estimators return biased estimates of the prevalence of the sensitive trait, coefficients of covariates, and sensitivity bias itself. The bias is worse when polarity runs in opposite directions across subgroups, and as the difference in subgroup sizes increases. Significantly, non-uniform polarity could explain why some list experiments appear to 'fail'. By defining and systematically investigating the problem of non-uniform polarity, we hope to save some studies from the file-drawer and provide some guidance for future research."}, "https://arxiv.org/abs/2409.17298": {"title": "Sparsity, Regularization and Causality in Agricultural Yield: The Case of Paddy Rice in Peru", "link": "https://arxiv.org/abs/2409.17298", "description": "arXiv:2409.17298v1 Announce Type: new \nAbstract: This study introduces a novel approach that integrates agricultural census data with remotely sensed time series to develop precise predictive models for paddy rice yield across various regions of Peru. By utilizing sparse regression and Elastic-Net regularization techniques, the study identifies causal relationships between key remotely sensed variables-such as NDVI, precipitation, and temperature-and agricultural yield. To further enhance prediction accuracy, the first- and second-order dynamic transformations (velocity and acceleration) of these variables are applied, capturing non-linear patterns and delayed effects on yield. The findings highlight the improved predictive performance when combining regularization techniques with climatic and geospatial variables, enabling more precise forecasts of yield variability. The results confirm the existence of causal relationships in the Granger sense, emphasizing the value of this methodology for strategic agricultural management. This contributes to more efficient and sustainable production in paddy rice cultivation."}, "https://arxiv.org/abs/2409.17404": {"title": "Bayesian Covariate-Dependent Graph Learning with a Dual Group Spike-and-Slab Prior", "link": "https://arxiv.org/abs/2409.17404", "description": "arXiv:2409.17404v1 Announce Type: new \nAbstract: Covariate-dependent graph learning has gained increasing interest in the graphical modeling literature for the analysis of heterogeneous data. This task, however, poses challenges to modeling, computational efficiency, and interpretability. The parameter of interest can be naturally represented as a three-dimensional array with elements that can be grouped according to two directions, corresponding to node level and covariate level, respectively. In this article, we propose a novel dual group spike-and-slab prior that enables multi-level selection at covariate-level and node-level, as well as individual (local) level sparsity. We introduce a nested strategy with specific choices to address distinct challenges posed by the various grouping directions. For posterior inference, we develop a tuning-free Gibbs sampler for all parameters, which mitigates the difficulties of parameter tuning often encountered in high-dimensional graphical models and facilitates routine implementation. Through simulation studies, we demonstrate that the proposed model outperforms existing methods in its accuracy of graph recovery. We show the practical utility of our model via an application to microbiome data where we seek to better understand the interactions among microbes as well as how these are affected by relevant covariates."}, "https://arxiv.org/abs/2409.17441": {"title": "Factor pre-training in Bayesian multivariate logistic models", "link": "https://arxiv.org/abs/2409.17441", "description": "arXiv:2409.17441v1 Announce Type: new \nAbstract: This article focuses on inference in logistic regression for high-dimensional binary outcomes. A popular approach induces dependence across the outcomes by including latent factors in the linear predictor. Bayesian approaches are useful for characterizing uncertainty in inferring the regression coefficients, factors and loadings, while also incorporating hierarchical and shrinkage structure. However, Markov chain Monte Carlo algorithms for posterior computation face challenges in scaling to high-dimensional outcomes. Motivated by applications in ecology, we exploit a blessing of dimensionality to motivate pre-estimation of the latent factors. Conditionally on the factors, the outcomes are modeled via independent logistic regressions. We implement Gaussian approximations in parallel in inferring the posterior on the regression coefficients and loadings, including a simple adjustment to obtain credible intervals with valid frequentist coverage. We show posterior concentration properties and excellent empirical performance in simulations. The methods are applied to insect biodiversity data in Madagascar."}, "https://arxiv.org/abs/2409.17631": {"title": "Invariant Coordinate Selection and Fisher discriminant subspace beyond the case of two groups", "link": "https://arxiv.org/abs/2409.17631", "description": "arXiv:2409.17631v1 Announce Type: new \nAbstract: Invariant Coordinate Selection (ICS) is a multivariate technique that relies on the simultaneous diagonalization of two scatter matrices. It serves various purposes, including its use as a dimension reduction tool prior to clustering or outlier detection. Unlike methods such as Principal Component Analysis, ICS has a theoretical foundation that explains why and when the identified subspace should contain relevant information. These general results have been examined in detail primarily for specific scatter combinations within a two-cluster framework. In this study, we expand these investigations to include more clusters and scatter combinations. The case of three clusters in particular is studied at length. Based on these expanded theoretical insights and supported by numerical studies, we conclude that ICS is indeed suitable for recovering Fisher's discriminant subspace under very general settings and cases of failure seem rare."}, "https://arxiv.org/abs/2409.17706": {"title": "Stationarity of Manifold Time Series", "link": "https://arxiv.org/abs/2409.17706", "description": "arXiv:2409.17706v1 Announce Type: new \nAbstract: In modern interdisciplinary research, manifold time series data have been garnering more attention. A critical question in analyzing such data is ``stationarity'', which reflects the underlying dynamic behavior and is crucial across various fields like cell biology, neuroscience and empirical finance. Yet, there has been an absence of a formal definition of stationarity that is tailored to manifold time series. This work bridges this gap by proposing the first definitions of first-order and second-order stationarity for manifold time series. Additionally, we develop novel statistical procedures to test the stationarity of manifold time series and study their asymptotic properties. Our methods account for the curved nature of manifolds, leading to a more intricate analysis than that in Euclidean space. The effectiveness of our methods is evaluated through numerical simulations and their practical merits are demonstrated through analyzing a cell-type proportion time series dataset from a paper recently published in Cell. The first-order stationarity test result aligns with the biological findings of this paper, while the second-order stationarity test provides numerical support for a critical assumption made therein."}, "https://arxiv.org/abs/2409.17751": {"title": "Granger Causality for Mixed Time Series Generalized Linear Models: A Case Study on Multimodal Brain Connectivity", "link": "https://arxiv.org/abs/2409.17751", "description": "arXiv:2409.17751v1 Announce Type: new \nAbstract: This paper is motivated by studies in neuroscience experiments to understand interactions between nodes in a brain network using different types of data modalities that capture different distinct facets of brain activity. To assess Granger-causality, we introduce a flexible framework through a general class of models that accommodates mixed types of data (binary, count, continuous, and positive components) formulated in a generalized linear model (GLM) fashion. Statistical inference for causality is performed based on both frequentist and Bayesian approaches, with a focus on the latter. Here, we develop a procedure for conducting inference through the proposed Bayesian mixed time series model. By introducing spike and slab priors for some parameters in the model, our inferential approach guides causality order selection and provides proper uncertainty quantification. The proposed methods are then utilized to study the rat spike train and local field potentials (LFP) data recorded during the olfaction working memory task. The proposed methodology provides critical insights into the causal relationship between the rat spiking activity and LFP spectral power. Specifically, power in the LFP beta band is predictive of spiking activity 300 milliseconds later, providing a novel analytical tool for this area of emerging interest in neuroscience and demonstrating its usefulness and flexibility in the study of causality in general."}, "https://arxiv.org/abs/2409.17968": {"title": "Nonparametric Inference Framework for Time-dependent Epidemic Models", "link": "https://arxiv.org/abs/2409.17968", "description": "arXiv:2409.17968v1 Announce Type: new \nAbstract: Compartmental models, especially the Susceptible-Infected-Removed (SIR) model, have long been used to understand the behaviour of various diseases. Allowing parameters, such as the transmission rate, to be time-dependent functions makes it possible to adjust for and make inferences about changes in the process due to mitigation strategies or evolutionary changes of the infectious agent. In this article, we attempt to build a nonparametric inference framework for stochastic SIR models with time dependent infection rate. The framework includes three main steps: likelihood approximation, parameter estimation and confidence interval construction. The likelihood function of the stochastic SIR model, which is often intractable, can be approximated using methods such as diffusion approximation or tau leaping. The infection rate is modelled by a B-spline basis whose knot location and number of knots are determined by a fast knot placement method followed by a criterion-based model selection procedure. Finally, a point-wise confidence interval is built using a parametric bootstrap procedure. The performance of the framework is observed through various settings for different epidemic patterns. The model is then applied to the Ontario COVID-19 data across multiple waves."}, "https://arxiv.org/abs/2409.18005": {"title": "Collapsible Kernel Machine Regression for Exposomic Analyses", "link": "https://arxiv.org/abs/2409.18005", "description": "arXiv:2409.18005v1 Announce Type: new \nAbstract: An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibility comes at the cost of low power and difficult interpretation, particularly in exposomic analyses when the number of exposures is large. We propose a flexible framework that allows for separate selection of additive and non-additive effects, unifying additive models and kernel machine regression. The proposed approach yields increased power and simpler interpretation when there is little evidence of interaction. Further, it allows users to specify separate priors for additive and non-additive effects, and allows for tests of non-additive interaction. We extend the approach to the class of multiple index models, in which the special case of kernel machine-distributed lag models are nested. We apply the method to motivating data from a subcohort of the Human Early Life Exposome (HELIX) study containing 65 mixture components grouped into 13 distinct exposure classes."}, "https://arxiv.org/abs/2409.18091": {"title": "Incorporating sparse labels into biologging studies using hidden Markov models with weighted likelihoods", "link": "https://arxiv.org/abs/2409.18091", "description": "arXiv:2409.18091v1 Announce Type: new \nAbstract: Ecologists often use a hidden Markov model to decode a latent process, such as a sequence of an animal's behaviours, from an observed biologging time series. Modern technological devices such as video recorders and drones now allow researchers to directly observe an animal's behaviour. Using these observations as labels of the latent process can improve a hidden Markov model's accuracy when decoding the latent process. However, many wild animals are observed infrequently. Including such rare labels often has a negligible influence on parameter estimates, which in turn does not meaningfully improve the accuracy of the decoded latent process. We introduce a weighted likelihood approach that increases the relative influence of labelled observations. We use this approach to develop two hidden Markov models to decode the foraging behaviour of killer whales (Orcinus orca) off the coast of British Columbia, Canada. Using cross-validated evaluation metrics, we show that our weighted likelihood approach produces more accurate and understandable decoded latent processes compared to existing methods. Thus, our method effectively leverages sparse labels to enhance researchers' ability to accurately decode hidden processes across various fields."}, "https://arxiv.org/abs/2409.18117": {"title": "Formulating the Proxy Pattern-Mixture Model as a Selection Model to Assist with Sensitivity Analysis", "link": "https://arxiv.org/abs/2409.18117", "description": "arXiv:2409.18117v1 Announce Type: new \nAbstract: Proxy pattern-mixture models (PPMM) have previously been proposed as a model-based framework for assessing the potential for nonignorable nonresponse in sample surveys and nonignorable selection in nonprobability samples. One defining feature of the PPMM is the single sensitivity parameter, $\\phi$, that ranges from 0 to 1 and governs the degree of departure from ignorability. While this sensitivity parameter is attractive in its simplicity, it may also be of interest to describe departures from ignorability in terms of how the odds of response (or selection) depend on the outcome being measured. In this paper, we re-express the PPMM as a selection model, using the known relationship between pattern-mixture models and selection models, in order to better understand the underlying assumptions of the PPMM and the implied effect of the outcome on nonresponse. The selection model that corresponds to the PPMM is a quadratic function of the survey outcome and proxy variable, and the magnitude of the effect depends on the value of the sensitivity parameter, $\\phi$ (missingness/selection mechanism), the differences in the proxy means and standard deviations for the respondent and nonrespondent populations, and the strength of the proxy, $\\rho^{(1)}$. Large values of $\\phi$ (beyond $0.5$) often result in unrealistic selection mechanisms, and the corresponding selection model can be used to establish more realistic bounds on $\\phi$. We illustrate the results using data from the U.S. Census Household Pulse Survey."}, "https://arxiv.org/abs/2409.17544": {"title": "Optimizing the Induced Correlation in Omnibus Joint Graph Embeddings", "link": "https://arxiv.org/abs/2409.17544", "description": "arXiv:2409.17544v1 Announce Type: cross \nAbstract: Theoretical and empirical evidence suggests that joint graph embedding algorithms induce correlation across the networks in the embedding space. In the Omnibus joint graph embedding framework, previous results explicitly delineated the dual effects of the algorithm-induced and model-inherent correlations on the correlation across the embedded networks. Accounting for and mitigating the algorithm-induced correlation is key to subsequent inference, as sub-optimal Omnibus matrix constructions have been demonstrated to lead to loss in inference fidelity. This work presents the first efforts to automate the Omnibus construction in order to address two key questions in this joint embedding framework: the correlation-to-OMNI problem and the flat correlation problem. In the flat correlation problem, we seek to understand the minimum algorithm-induced flat correlation (i.e., the same across all graph pairs) produced by a generalized Omnibus embedding. Working in a subspace of the fully general Omnibus matrices, we prove both a lower bound for this flat correlation and that the classical Omnibus construction induces the maximal flat correlation. In the correlation-to-OMNI problem, we present an algorithm -- named corr2Omni -- that, from a given matrix of estimated pairwise graph correlations, estimates the matrix of generalized Omnibus weights that induces optimal correlation in the embedding space. Moreover, in both simulated and real data settings, we demonstrate the increased effectiveness of our corr2Omni algorithm versus the classical Omnibus construction."}, "https://arxiv.org/abs/2409.17804": {"title": "Enriched Functional Tree-Based Classifiers: A Novel Approach Leveraging Derivatives and Geometric Features", "link": "https://arxiv.org/abs/2409.17804", "description": "arXiv:2409.17804v1 Announce Type: cross \nAbstract: The positioning of this research falls within the scalar-on-function classification literature, a field of significant interest across various domains, particularly in statistics, mathematics, and computer science. This study introduces an advanced methodology for supervised classification by integrating Functional Data Analysis (FDA) with tree-based ensemble techniques for classifying high-dimensional time series. The proposed framework, Enriched Functional Tree-Based Classifiers (EFTCs), leverages derivative and geometric features, benefiting from the diversity inherent in ensemble methods to further enhance predictive performance and reduce variance. While our approach has been tested on the enrichment of Functional Classification Trees (FCTs), Functional K-NN (FKNN), Functional Random Forest (FRF), Functional XGBoost (FXGB), and Functional LightGBM (FLGBM), it could be extended to other tree-based and non-tree-based classifiers, with appropriate considerations emerging from this investigation. Through extensive experimental evaluations on seven real-world datasets and six simulated scenarios, this proposal demonstrates fascinating improvements over traditional approaches, providing new insights into the application of FDA in complex, high-dimensional learning problems."}, "https://arxiv.org/abs/2409.18118": {"title": "Slowly Scaling Per-Record Differential Privacy", "link": "https://arxiv.org/abs/2409.18118", "description": "arXiv:2409.18118v1 Announce Type: cross \nAbstract: We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released.\n Formal privacy mechanisms generally add randomness, or \"noise,\" to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data.\n We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility."}, "https://arxiv.org/abs/2309.02073": {"title": "Debiased regression adjustment in completely randomized experiments with moderately high-dimensional covariates", "link": "https://arxiv.org/abs/2309.02073", "description": "arXiv:2309.02073v2 Announce Type: replace \nAbstract: Completely randomized experiment is the gold standard for causal inference. When the covariate information for each experimental candidate is available, one typical way is to include them in covariate adjustments for more accurate treatment effect estimation. In this paper, we investigate this problem under the randomization-based framework, i.e., that the covariates and potential outcomes of all experimental candidates are assumed as deterministic quantities and the randomness comes solely from the treatment assignment mechanism. Under this framework, to achieve asymptotically valid inference, existing estimators usually require either (i) that the dimension of covariates $p$ grows at a rate no faster than $O(n^{3 / 4})$ as sample size $n \\to \\infty$; or (ii) certain sparsity constraints on the linear representations of potential outcomes constructed via possibly high-dimensional covariates. In this paper, we consider the moderately high-dimensional regime where $p$ is allowed to be in the same order of magnitude as $n$. We develop a novel debiased estimator with a corresponding inference procedure and establish its asymptotic normality under mild assumptions. Our estimator is model-free and does not require any sparsity constraint on potential outcome's linear representations. We also discuss its asymptotic efficiency improvements over the unadjusted treatment effect estimator under different dimensionality constraints. Numerical analysis confirms that compared to other regression adjustment based treatment effect estimators, our debiased estimator performs well in moderately high dimensions."}, "https://arxiv.org/abs/2310.13387": {"title": "Assumption violations in causal discovery and the robustness of score matching", "link": "https://arxiv.org/abs/2310.13387", "description": "arXiv:2310.13387v2 Announce Type: replace \nAbstract: When domain knowledge is limited and experimentation is restricted by ethical, financial, or time constraints, practitioners turn to observational causal discovery methods to recover the causal structure, exploiting the statistical properties of their data. Because causal discovery without further assumptions is an ill-posed problem, each algorithm comes with its own set of usually untestable assumptions, some of which are hard to meet in real datasets. Motivated by these considerations, this paper extensively benchmarks the empirical performance of recent causal discovery methods on observational i.i.d. data generated under different background conditions, allowing for violations of the critical assumptions required by each selected approach. Our experimental findings show that score matching-based methods demonstrate surprising performance in the false positive and false negative rate of the inferred graph in these challenging scenarios, and we provide theoretical insights into their performance. This work is also the first effort to benchmark the stability of causal discovery algorithms with respect to the values of their hyperparameters. Finally, we hope this paper will set a new standard for the evaluation of causal discovery methods and can serve as an accessible entry point for practitioners interested in the field, highlighting the empirical implications of different algorithm choices."}, "https://arxiv.org/abs/2312.10176": {"title": "Spectral estimation for spatial point processes and random fields", "link": "https://arxiv.org/abs/2312.10176", "description": "arXiv:2312.10176v2 Announce Type: replace \nAbstract: Spatial variables can be observed in many different forms, such as regularly sampled random fields (lattice data), point processes, and randomly sampled spatial processes. Joint analysis of such collections of observations is clearly desirable, but complicated by the lack of an easily implementable analysis framework. It is well known that Fourier transforms provide such a framework, but its form has eluded data analysts. We formalize it by providing a multitaper analysis framework using coupled discrete and continuous data tapers, combined with the discrete Fourier transform for inference. Using this set of tools is important, as it forms the backbone for practical spectral analysis. In higher dimensions it is important not to be constrained to Cartesian product domains, and so we develop the methodology for spectral analysis using irregular domain data tapers, and the tapered discrete Fourier transform. We discuss its fast implementation, and the asymptotic as well as large finite domain properties. Estimators of partial association between different spatial processes are provided as are principled methods to determine their significance, and we demonstrate their practical utility on a large-scale ecological dataset."}, "https://arxiv.org/abs/2401.01645": {"title": "Model Averaging and Double Machine Learning", "link": "https://arxiv.org/abs/2401.01645", "description": "arXiv:2401.01645v2 Announce Type: replace \nAbstract: This paper discusses pairing double/debiased machine learning (DDML) with stacking, a model averaging method for combining multiple candidate learners, to estimate structural parameters. In addition to conventional stacking, we consider two stacking variants available for DDML: short-stacking exploits the cross-fitting step of DDML to substantially reduce the computational burden and pooled stacking enforces common stacking weights over cross-fitting folds. Using calibrated simulation studies and two applications estimating gender gaps in citations and wages, we show that DDML with stacking is more robust to partially unknown functional forms than common alternative approaches based on single pre-selected learners. We provide Stata and R software implementing our proposals."}, "https://arxiv.org/abs/2401.07400": {"title": "Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data", "link": "https://arxiv.org/abs/2401.07400", "description": "arXiv:2401.07400v2 Announce Type: replace \nAbstract: Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods."}, "https://arxiv.org/abs/2107.07575": {"title": "Optimal tests of the composite null hypothesis arising in mediation analysis", "link": "https://arxiv.org/abs/2107.07575", "description": "arXiv:2107.07575v2 Announce Type: replace-cross \nAbstract: The indirect effect of an exposure on an outcome through an intermediate variable can be identified by a product of regression coefficients under certain causal and regression modeling assumptions. In this context, the null hypothesis of no indirect effect is a composite null hypothesis, as the null holds if either regression coefficient is zero. A consequence is that traditional hypothesis tests are severely underpowered near the origin (i.e., when both coefficients are small with respect to standard errors). We propose hypothesis tests that (i) preserve level alpha type 1 error, (ii) meaningfully improve power when both true underlying effects are small relative to sample size, and (iii) preserve power when at least one is not. One approach gives a closed-form test that is minimax optimal with respect to local power over the alternative parameter space. Another uses sparse linear programming to produce an approximately optimal test for a Bayes risk criterion. We discuss adaptations for performing large-scale hypothesis testing as well as modifications that yield improved interpretability. We provide an R package that implements the minimax optimal test."}, "https://arxiv.org/abs/2205.13469": {"title": "Proximal Estimation and Inference", "link": "https://arxiv.org/abs/2205.13469", "description": "arXiv:2205.13469v3 Announce Type: replace-cross \nAbstract: We build a unifying convex analysis framework characterizing the statistical properties of a large class of penalized estimators, both under a regular and an irregular design. Our framework interprets penalized estimators as proximal estimators, defined by a proximal operator applied to a corresponding initial estimator. We characterize the asymptotic properties of proximal estimators, showing that their asymptotic distribution follows a closed-form formula depending only on (i) the asymptotic distribution of the initial estimator, (ii) the estimator's limit penalty subgradient and (iii) the inner product defining the associated proximal operator. In parallel, we characterize the Oracle features of proximal estimators from the properties of their penalty's subgradients. We exploit our approach to systematically cover linear regression settings with a regular or irregular design. For these settings, we build new $\\sqrt{n}-$consistent, asymptotically normal Ridgeless-type proximal estimators, which feature the Oracle property and are shown to perform satisfactorily in practically relevant Monte Carlo settings."}, "https://arxiv.org/abs/2409.18358": {"title": "A Capture-Recapture Approach to Facilitate Causal Inference for a Trial-eligible Observational Cohort", "link": "https://arxiv.org/abs/2409.18358", "description": "arXiv:2409.18358v1 Announce Type: new \nAbstract: Background: We extend recently proposed design-based capture-recapture methods for prevalence estimation among registry participants, in order to support causal inference among a trial-eligible target population. The proposed design for CRC analysis integrates an observational study cohort with a randomized trial involving a small representative study sample, and enhances the generalizability and transportability of the findings. Methods: We develop a novel CRC-type estimator derived via multinomial distribution-based maximum-likelihood that exploits the design to deliver benefits in terms of validity and efficiency for comparing the effects of two treatments on a binary outcome. Additionally, the design enables a direct standardization-type estimator for efficient estimation of general means (e.g., of biomarker levels) under a specific treatment, and for their comparison across treatments. For inference, we propose a tailored Bayesian credible interval approach to improve coverage properties in conjunction with the proposed CRC estimator for binary outcomes, along with a bootstrap percentile interval approach for use in the case of continuous outcomes. Results: Simulations demonstrate the proposed estimators derived from the CRC design. The multinomial-based maximum-likelihood estimator shows benefits in terms of validity and efficiency in treatment effect comparisons, while the direct standardization-type estimator allows comprehensive comparison of treatment effects within the target population. Conclusion: The extended CRC methods provide a useful framework for causal inference in a trial-eligible target population by integrating observational and randomized trial data. The novel estimators enhance the generalizability and transportability of findings, offering efficient and valid tools for treatment effect comparisons on both binary and continuous outcomes."}, "https://arxiv.org/abs/2409.18392": {"title": "PNOD: An Efficient Projected Newton Framework for Exact Optimal Experimental Designs", "link": "https://arxiv.org/abs/2409.18392", "description": "arXiv:2409.18392v1 Announce Type: new \nAbstract: Computing the exact optimal experimental design has been a longstanding challenge in various scientific fields. This problem, when formulated using a specific information function, becomes a mixed-integer nonlinear programming (MINLP) problem, which is typically NP-hard, thus making the computation of a globally optimal solution extremely difficult. The branch and bound (BnB) method is a widely used approach for solving such MINLPs, but its practical efficiency heavily relies on the ability to solve continuous relaxations effectively within the BnB search tree. In this paper, we propose a novel projected Newton framework, combining with a vertex exchange method for efficiently solving the associated subproblems, designed to enhance the BnB method. This framework offers strong convergence guarantees by utilizing recent advances in solving self-concordant optimization and convex quadratic programming problems. Extensive numerical experiments on A-optimal and D-optimal design problems, two of the most commonly used models, demonstrate the framework's promising numerical performance. Specifically, our framework significantly improves the efficiency of node evaluation within the BnB search tree and enhances the accuracy of solutions compared to state-of-the-art methods. The proposed framework is implemented in an open source Julia package called \\texttt{PNOD.jl}, which opens up possibilities for its application in a wide range of real-world scenarios."}, "https://arxiv.org/abs/2409.18527": {"title": "Handling Missingness, Failures, and Non-Convergence in Simulation Studies: A Review of Current Practices and Recommendations", "link": "https://arxiv.org/abs/2409.18527", "description": "arXiv:2409.18527v1 Announce Type: new \nAbstract: Simulation studies are commonly used in methodological research for the empirical evaluation of data analysis methods. They generate artificial data sets under specified mechanisms and compare the performance of methods across conditions. However, simulation repetitions do not always produce valid outputs, e.g., due to non-convergence or other algorithmic failures. This phenomenon complicates the interpretation of results, especially when its occurrence differs between methods and conditions. Despite the potentially serious consequences of such \"missingness\", quantitative data on its prevalence and specific guidance on how to deal with it are currently limited. To this end, we reviewed 482 simulation studies published in various methodological journals and systematically assessed the prevalence and handling of missingness. We found that only 23.0% (111/482) of the reviewed simulation studies mention missingness, with even fewer reporting frequency (92/482 = 19.1%) or how it was handled (67/482 = 13.9%). We propose a classification of missingness and possible solutions. We give various recommendations, most notably to always quantify and report missingness, even if none was observed, to align missingness handling with study goals, and to share code and data for reproduction and reanalysis. Using a case study on publication bias adjustment methods, we illustrate common pitfalls and solutions."}, "https://arxiv.org/abs/2409.18550": {"title": "Iterative Trace Minimization for the Reconciliation of Very Short Hierarchical Time Series", "link": "https://arxiv.org/abs/2409.18550", "description": "arXiv:2409.18550v1 Announce Type: new \nAbstract: Time series often appear in an additive hierarchical structure. In such cases, time series on higher levels are the sums of their subordinate time series. This hierarchical structure places a natural constraint on forecasts. However, univariate forecasting techniques are incapable of ensuring this forecast coherence. An obvious solution is to forecast only bottom time series and obtain higher level forecasts through aggregation. This approach is also known as the bottom-up approach. In their seminal paper, \\citep{Wickramasuriya2019} propose an optimal reconciliation approach named MinT. It tries to minimize the trace of the underlying covariance matrix of all forecast errors. The MinT algorithm has demonstrated superior performance to the bottom-up and other approaches and enjoys great popularity. This paper provides a simulation study examining the performance of MinT for very short time series and larger hierarchical structures. This scenario makes the covariance estimation required by MinT difficult. A novel iterative approach is introduced which significantly reduces the number of estimated parameters. This approach is capable of improving forecast accuracy further. The application of MinTit is also demonstrated with a case study at the hand of a semiconductor dataset based on data provided by the World Semiconductor Trade Statistics (WSTS), a premier provider of semiconductor market data."}, "https://arxiv.org/abs/2409.18603": {"title": "Which depth to use to construct functional boxplots?", "link": "https://arxiv.org/abs/2409.18603", "description": "arXiv:2409.18603v1 Announce Type: new \nAbstract: This paper answers the question of which functional depth to use to construct a boxplot for functional data. It shows that integrated depths, e.g., the popular modified band depth, do not result in well-defined boxplots. Instead, we argue that infimal depths are the only functional depths that provide a valid construction of a functional boxplot. We also show that the properties of the boxplot are completely determined by properties of the one-dimensional depth function used in defining the infimal depth for functional data. Our claims are supported by (i) a motivating example, (ii) theoretical results concerning the properties of the boxplot, and (iii) a simulation study."}, "https://arxiv.org/abs/2409.18640": {"title": "Time-Varying Multi-Seasonal AR Models", "link": "https://arxiv.org/abs/2409.18640", "description": "arXiv:2409.18640v1 Announce Type: new \nAbstract: We propose a seasonal AR model with time-varying parameter processes in both the regular and seasonal parameters. The model is parameterized to guarantee stability at every time point and can accommodate multiple seasonal periods. The time evolution is modeled by dynamic shrinkage processes to allow for long periods of essentially constant parameters, periods of rapid change as well as abrupt jumps. A Gibbs sampler is developed with a particle Gibbs update step for the AR parameter trajectories. The near-degeneracy of the model, caused by the dynamic shrinkage processes, is shown to pose a challenge for particle methods. To address this, a more robust, faster and accurate approximate sampler based on the extended Kalman filter is proposed. The model and the numerical effectiveness of the Gibbs sampler are investigated on simulated and real data. An application to more than a century of monthly US industrial production data shows interesting clear changes in seasonality over time, particularly during the Great Depression and the recent Covid-19 pandemic. Keywords: Bayesian inference; Extended Kalman filter; Locally stationary processes; Particle MCMC; Seasonality."}, "https://arxiv.org/abs/2409.18712": {"title": "Computational and Numerical Properties of a Broadband Subspace-Based Likelihood Ratio Test", "link": "https://arxiv.org/abs/2409.18712", "description": "arXiv:2409.18712v1 Announce Type: new \nAbstract: This paper investigates the performance of a likelihood ratio test in combination with a polynomial subspace projection approach to detect weak transient signals in broadband array data. Based on previous empirical evidence that a likelihood ratio test is advantageously applied in a lower-dimensional subspace, we present analysis that highlights how the polynomial subspace projection whitens a crucial part of the signals, enabling a detector to operate with a shortened temporal window. This reduction in temporal correlation, together with a spatial compaction of the data, also leads to both computational and numerical advantages over a likelihood ratio test that is directly applied to the array data. The results of our analysis are illustrated by examples and simulations."}, "https://arxiv.org/abs/2409.18719": {"title": "New flexible versions of extended generalized Pareto model for count data", "link": "https://arxiv.org/abs/2409.18719", "description": "arXiv:2409.18719v1 Announce Type: new \nAbstract: Accurate modeling is essential in integer-valued real phenomena, including the distribution of entire data, zero-inflated (ZI) data, and discrete exceedances. The Poisson and Negative Binomial distributions, along with their ZI variants, are considered suitable for modeling the entire data distribution, but they fail to capture the heavy tail behavior effectively alongside the bulk of the distribution. In contrast, the discrete generalized Pareto distribution (DGPD) is preferred for high threshold exceedances, but it becomes less effective for low threshold exceedances. However, in some applications, the selection of a suitable high threshold is challenging, and the asymptotic conditions required for using DGPD are not always met. To address these limitations, extended versions of DGPD are proposed. These extensions are designed to model one of three scenarios: first, the entire distribution of the data, including both bulk and tail and bypassing the threshold selection step; second, the entire distribution along with ZI; and third, the tail of the distribution for low threshold exceedances. The proposed extensions offer improved estimates across all three scenarios compared to existing models, providing more accurate and reliable results in simulation studies and real data applications."}, "https://arxiv.org/abs/2409.18782": {"title": "Non-parametric efficient estimation of marginal structural models with multi-valued time-varying treatments", "link": "https://arxiv.org/abs/2409.18782", "description": "arXiv:2409.18782v1 Announce Type: new \nAbstract: Marginal structural models are a popular method for estimating causal effects in the presence of time-varying exposures. In spite of their popularity, no scalable non-parametric estimator exist for marginal structural models with multi-valued and time-varying treatments. In this paper, we use machine learning together with recent developments in semiparametric efficiency theory for longitudinal studies to propose such an estimator. The proposed estimator is based on a study of the non-parametric identifying functional, including first order von-Mises expansions as well as the efficient influence function and the efficiency bound. We show conditions under which the proposed estimator is efficient, asymptotically normal, and sequentially doubly robust in the sense that it is consistent if, for each time point, either the outcome or the treatment mechanism is consistently estimated. We perform a simulation study to illustrate the properties of the estimators, and present the results of our motivating study on a COVID-19 dataset studying the impact of mobility on the cumulative number of observed cases."}, "https://arxiv.org/abs/2409.18908": {"title": "Inference with Sequential Monte-Carlo Computation of $p$-values: Fast and Valid Approaches", "link": "https://arxiv.org/abs/2409.18908", "description": "arXiv:2409.18908v1 Announce Type: new \nAbstract: Hypothesis tests calibrated by (re)sampling methods (such as permutation, rank and bootstrap tests) are useful tools for statistical analysis, at the computational cost of requiring Monte-Carlo sampling for calibration. It is common and almost universal practice to execute such tests with predetermined and large number of Monte-Carlo samples, and disregard any randomness from this sampling at the time of drawing and reporting inference. At best, this approach leads to computational inefficiency, and at worst to invalid inference. That being said, a number of approaches in the literature have been proposed to adaptively guide analysts in choosing the number of Monte-Carlo samples, by sequentially deciding when to stop collecting samples and draw inference. These works introduce varying competing notions of what constitutes \"valid\" inference, complicating the landscape for analysts seeking suitable methodology. Furthermore, the majority of these approaches solely guarantee a meaningful estimate of the testing outcome, not the $p$-value itself $\\unicode{x2014}$ which is insufficient for many practical applications. In this paper, we survey the relevant literature, and build bridges between the scattered validity notions, highlighting some of their complementary roles. We also introduce a new practical methodology that provides an estimate of the $p$-value of the Monte-Carlo test, endowed with practically relevant validity guarantees. Moreover, our methodology is sequential, updating the $p$-value estimate after each new Monte-Carlo sample has been drawn, while retaining important validity guarantees regardless of the selected stopping time. We conclude this paper with a set of recommendations for the practitioner, both in terms of selection of methodology and manner of reporting results."}, "https://arxiv.org/abs/2409.18321": {"title": "Local Prediction-Powered Inference", "link": "https://arxiv.org/abs/2409.18321", "description": "arXiv:2409.18321v1 Announce Type: cross \nAbstract: To infer a function value on a specific point $x$, it is essential to assign higher weights to the points closer to $x$, which is called local polynomial / multivariable regression. In many practical cases, a limited sample size may ruin this method, but such conditions can be improved by the Prediction-Powered Inference (PPI) technique. This paper introduced a specific algorithm for local multivariable regression using PPI, which can significantly reduce the variance of estimations without enlarge the error. The confidence intervals, bias correction, and coverage probabilities are analyzed and proved the correctness and superiority of our algorithm. Numerical simulation and real-data experiments are applied and show these conclusions. Another contribution compared to PPI is the theoretical computation efficiency and explainability by taking into account the dependency of the dependent variable."}, "https://arxiv.org/abs/2409.18374": {"title": "Adaptive Learning of the Latent Space of Wasserstein Generative Adversarial Networks", "link": "https://arxiv.org/abs/2409.18374", "description": "arXiv:2409.18374v1 Announce Type: cross \nAbstract: Generative models based on latent variables, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs), have gained lots of interests due to their impressive performance in many fields. However, many data such as natural images usually do not populate the ambient Euclidean space but instead reside in a lower-dimensional manifold. Thus an inappropriate choice of the latent dimension fails to uncover the structure of the data, possibly resulting in mismatch of latent representations and poor generative qualities. Towards addressing these problems, we propose a novel framework called the latent Wasserstein GAN (LWGAN) that fuses the Wasserstein auto-encoder and the Wasserstein GAN so that the intrinsic dimension of the data manifold can be adaptively learned by a modified informative latent distribution. We prove that there exist an encoder network and a generator network in such a way that the intrinsic dimension of the learned encoding distribution is equal to the dimension of the data manifold. We theoretically establish that our estimated intrinsic dimension is a consistent estimate of the true dimension of the data manifold. Meanwhile, we provide an upper bound on the generalization error of LWGAN, implying that we force the synthetic data distribution to be similar to the real data distribution from a population perspective. Comprehensive empirical experiments verify our framework and show that LWGAN is able to identify the correct intrinsic dimension under several scenarios, and simultaneously generate high-quality synthetic data by sampling from the learned latent distribution."}, "https://arxiv.org/abs/2409.18643": {"title": "Tail Risk Analysis for Financial Time Series", "link": "https://arxiv.org/abs/2409.18643", "description": "arXiv:2409.18643v1 Announce Type: cross \nAbstract: This book chapter illustrates how to apply extreme value statistics to financial time series data. Such data often exhibits strong serial dependence, which complicates assessment of tail risks. We discuss the two main approches to tail risk estimation, unconditional and conditional quantile forecasting. We use the S&P 500 index as a case study to assess serial (extremal) dependence, perform an unconditional and conditional risk analysis, and apply backtesting methods. Additionally, the chapter explores the impact of serial dependence on multivariate tail dependence."}, "https://arxiv.org/abs/1709.01050": {"title": "Modeling Interference Via Symmetric Treatment Decomposition", "link": "https://arxiv.org/abs/1709.01050", "description": "arXiv:1709.01050v3 Announce Type: replace \nAbstract: Classical causal inference assumes treatments meant for a given unit do not have an effect on other units. This assumption is violated in interference problems, where new types of spillover causal effects arise, and causal inference becomes much more difficult. In addition, interference introduces a unique complication where variables may transmit treatment influences to each other, which is a relationship that has some features of a causal one, but is symmetric.\n In this paper, we develop a new approach to decomposing the spillover effect into unit-specific components that extends the DAG based treatment decomposition approach to mediation of Robins and Richardson to causal models that admit stable symmetric relationships among variables in a network. We discuss two interpretations of such models: a network structural model interpretation, and an interpretation based on equilibrium of structural equation models discussed in (Lauritzen and Richardson, 2002). We show that both interpretations yield identical identification theory, and give conditions for components of the spillover effect to be identified.\n We discuss statistical inference for identified components of the spillover effect, including a maximum likelihood estimator, and a doubly robust estimator for the special case of two interacting outcomes. We verify consistency and robustness of our estimators via a simulation study, and illustrate our method by assessing the causal effect of education attainment on depressive symptoms using the data on households from the Wisconsin Longitudinal Study."}, "https://arxiv.org/abs/2203.08879": {"title": "A Simple and Computationally Trivial Estimator for Grouped Fixed Effects Models", "link": "https://arxiv.org/abs/2203.08879", "description": "arXiv:2203.08879v4 Announce Type: replace \nAbstract: This paper introduces a new fixed effects estimator for linear panel data models with clustered time patterns of unobserved heterogeneity. The method avoids non-convex and combinatorial optimization by combining a preliminary consistent estimator of the slope coefficient, an agglomerative pairwise-differencing clustering of cross-sectional units, and a pooled ordinary least squares regression. Asymptotic guarantees are established in a framework where $T$ can grow at any power of $N$, as both $N$ and $T$ approach infinity. Unlike most existing approaches, the proposed estimator is computationally straightforward and does not require a known upper bound on the number of groups. As existing approaches, this method leads to a consistent estimation of well-separated groups and an estimator of common parameters asymptotically equivalent to the infeasible regression controlling for the true groups. An application revisits the statistical association between income and democracy."}, "https://arxiv.org/abs/2307.07068": {"title": "Scalable Resampling in Massive Generalized Linear Models via Subsampled Residual Bootstrap", "link": "https://arxiv.org/abs/2307.07068", "description": "arXiv:2307.07068v2 Announce Type: replace \nAbstract: Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository."}, "https://arxiv.org/abs/2401.07294": {"title": "Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation Studies", "link": "https://arxiv.org/abs/2401.07294", "description": "arXiv:2401.07294v3 Announce Type: replace \nAbstract: Metamodels, or the regression analysis of Monte Carlo simulation results, provide a powerful tool to summarize simulation findings. However, an underutilized approach is the multilevel metamodel (MLMM) that accounts for the dependent data structure that arises from fitting multiple models to the same simulated data set. In this study, we articulate the theoretical rationale for the MLMM and illustrate how it can improve the interpretability of simulation results, better account for complex simulation designs, and provide new insights into the generalizability of simulation findings."}, "https://arxiv.org/abs/2401.11263": {"title": "Estimating Heterogeneous Treatment Effects on Survival Outcomes Using Counterfactual Censoring Unbiased Transformations", "link": "https://arxiv.org/abs/2401.11263", "description": "arXiv:2401.11263v2 Announce Type: replace \nAbstract: Methods for estimating heterogeneous treatment effects (HTE) from observational data have largely focused on continuous or binary outcomes, with less attention paid to survival outcomes and almost none to settings with competing risks. In this work, we develop censoring unbiased transformations (CUTs) for survival outcomes both with and without competing risks. After converting time-to-event outcomes using these CUTs, direct application of HTE learners for continuous outcomes yields consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. Our CUTs enable application of a much larger set of state of the art HTE learners for censored outcomes than had previously been available, especially in competing risks settings. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies."}, "https://arxiv.org/abs/2212.09900": {"title": "Policy learning \"without\" overlap: Pessimism and generalized empirical Bernstein's inequality", "link": "https://arxiv.org/abs/2212.09900", "description": "arXiv:2212.09900v3 Announce Type: replace-cross \nAbstract: This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions.\n In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL."}, "https://arxiv.org/abs/2401.03820": {"title": "Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices", "link": "https://arxiv.org/abs/2401.03820", "description": "arXiv:2401.03820v2 Announce Type: replace-cross \nAbstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method."}, "https://arxiv.org/abs/2409.19230": {"title": "Propensity Score Augmentation in Matching-based Estimation of Causal Effects", "link": "https://arxiv.org/abs/2409.19230", "description": "arXiv:2409.19230v1 Announce Type: new \nAbstract: When assessing the causal effect of a binary exposure using observational data, confounder imbalance across exposure arms must be addressed. Matching methods, including propensity score-based matching, can be used to deconfound the causal relationship of interest. They have been particularly popular in practice, at least in part due to their simplicity and interpretability. However, these methods can suffer from low statistical efficiency compared to many competing methods. In this work, we propose a novel matching-based estimator of the average treatment effect based on a suitably-augmented propensity score model. Our procedure can be shown to have greater statistical efficiency than traditional matching estimators whenever prognostic variables are available, and in some cases, can nearly reach the nonparametric efficiency bound. In addition to a theoretical study, we provide numerical results to illustrate our findings. Finally, we use our novel procedure to estimate the effect of circumcision on risk of HIV-1 infection using vaccine efficacy trial data."}, "https://arxiv.org/abs/2409.19241": {"title": "Estimating Interpretable Heterogeneous Treatment Effect with Causal Subgroup Discovery in Survival Outcomes", "link": "https://arxiv.org/abs/2409.19241", "description": "arXiv:2409.19241v1 Announce Type: new \nAbstract: Estimating heterogeneous treatment effect (HTE) for survival outcomes has gained increasing attention, as it captures the variation in treatment efficacy across patients or subgroups in delaying disease progression. However, most existing methods focus on post-hoc subgroup identification rather than simultaneously estimating HTE and selecting relevant subgroups. In this paper, we propose an interpretable HTE estimation framework that integrates three meta-learners that simultaneously estimate CATE for survival outcomes and identify predictive subgroups. We evaluated the performance of our method through comprehensive simulation studies across various randomized clinical trial (RCT) settings. Additionally, we demonstrated its application in a large RCT for age-related macular degeneration (AMD), a polygenic progressive eye disease, to estimate the HTE of an antioxidant and mineral supplement on time-to-AMD progression and to identify genetics-based subgroups with enhanced treatment effects. Our method offers a direct interpretation of the estimated HTE and provides evidence to support precision healthcare."}, "https://arxiv.org/abs/2409.19287": {"title": "Factors in Fashion: Factor Analysis towards the Mode", "link": "https://arxiv.org/abs/2409.19287", "description": "arXiv:2409.19287v1 Announce Type: new \nAbstract: The modal factor model represents a new factor model for dimension reduction in high dimensional panel data. Unlike the approximate factor model that targets for the mean factors, it captures factors that influence the conditional mode of the distribution of the observables. Statistical inference is developed with the aid of mode estimation, where the modal factors and the loadings are estimated through maximizing a kernel-type objective function. An easy-to-implement alternating maximization algorithm is designed to obtain the estimators numerically. Two model selection criteria are further proposed to determine the number of factors. The asymptotic properties of the proposed estimators are established under some regularity conditions. Simulations demonstrate the nice finite sample performance of our proposed estimators, even in the presence of heavy-tailed and asymmetric idiosyncratic error distributions. Finally, the application to inflation forecasting illustrates the practical merits of modal factors."}, "https://arxiv.org/abs/2409.19400": {"title": "The co-varying ties between networks and item responses via latent variables", "link": "https://arxiv.org/abs/2409.19400", "description": "arXiv:2409.19400v1 Announce Type: new \nAbstract: Relationships among teachers are known to influence their teaching-related perceptions. We study whether and how teachers' advising relationships (networks) are related to their perceptions of satisfaction, students, and influence over educational policies, recorded as their responses to a questionnaire (item responses). We propose a novel joint model of network and item responses (JNIRM) with correlated latent variables to understand these co-varying ties. This methodology allows the analyst to test and interpret the dependence between a network and item responses. Using JNIRM, we discover that teachers' advising relationships contribute to their perceptions of satisfaction and students more often than their perceptions of influence over educational policies. In addition, we observe that the complementarity principle applies in certain schools, where teachers tend to seek advice from those who are different from them. JNIRM shows superior parameter estimation and model fit over separately modeling the network and item responses with latent variable models."}, "https://arxiv.org/abs/2409.19673": {"title": "Priors for Reducing Asymptotic Bias of the Posterior Mean", "link": "https://arxiv.org/abs/2409.19673", "description": "arXiv:2409.19673v1 Announce Type: new \nAbstract: It is shown that the first-order term of the asymptotic bias of the posterior mean is removed by a suitable choice of a prior density. In regular statistical models including exponential families, and linear and logistic regression models, such a prior is given by the squared Jeffreys prior. We also explain the relationship between the proposed prior distribution, the moment matching prior, and the prior distribution that reduces the bias term of the posterior mode."}, "https://arxiv.org/abs/2409.19712": {"title": "Posterior Conformal Prediction", "link": "https://arxiv.org/abs/2409.19712", "description": "arXiv:2409.19712v1 Announce Type: new \nAbstract: Conformal prediction is a popular technique for constructing prediction intervals with distribution-free coverage guarantees. The coverage is marginal, meaning it only holds on average over the entire population but not necessarily for any specific subgroup. This article introduces a new method, posterior conformal prediction (PCP), which generates prediction intervals with both marginal and approximate conditional validity for clusters (or subgroups) naturally discovered in the data. PCP achieves these guarantees by modelling the conditional conformity score distribution as a mixture of cluster distributions. Compared to other methods with approximate conditional validity, this approach produces tighter intervals, particularly when the test data is drawn from clusters that are well represented in the validation data. PCP can also be applied to guarantee conditional coverage on user-specified subgroups, in which case it achieves robust coverage on smaller subgroups within the specified subgroups. In classification, the theory underlying PCP allows for adjusting the coverage level based on the classifier's confidence, achieving significantly smaller sets than standard conformal prediction sets. We evaluate the performance of PCP on diverse datasets from socio-economic, scientific and healthcare applications."}, "https://arxiv.org/abs/2409.19729": {"title": "Prior Sensitivity Analysis without Model Re-fit", "link": "https://arxiv.org/abs/2409.19729", "description": "arXiv:2409.19729v1 Announce Type: new \nAbstract: Prior sensitivity analysis is a fundamental method to check the effects of prior distributions on the posterior distribution in Bayesian inference. Exploring the posteriors under several alternative priors can be computationally intensive, particularly for complex latent variable models. To address this issue, we propose a novel method for quantifying the prior sensitivity that does not require model re-fit. Specifically, we present a method to compute the Hellinger and Kullback-Leibler distances between two posterior distributions with base and alternative priors, as well as posterior expectations under the alternative prior, using Monte Carlo integration based only on the base posterior distribution. This method significantly reduces computational costs in prior sensitivity analysis. We also extend the above approach for assessing the influence of hyperpriors in general latent variable models. We demonstrate the proposed method through examples of a simple normal distribution model, hierarchical binomial-beta model, and Gaussian process regression model."}, "https://arxiv.org/abs/2409.19812": {"title": "Compound e-values and empirical Bayes", "link": "https://arxiv.org/abs/2409.19812", "description": "arXiv:2409.19812v1 Announce Type: new \nAbstract: We explicitly define the notion of (exact or approximate) compound e-values which have been implicitly presented and extensively used in the recent multiple testing literature. We show that every FDR controlling procedure can be recovered by instantiating the e-BH procedure with certain compound e-values. Since compound e-values are closed under averaging, this allows for combination and derandomization of any FDR procedure. We then point out and exploit the connections to empirical Bayes. In particular, we use the fundamental theorem of compound decision theory to derive the log-optimal simple separable compound e-value for testing a set of point nulls against point alternatives: it is a ratio of mixture likelihoods. We extend universal inference to the compound setting. As one example, we construct approximate compound e-values for multiple t-tests, where the (nuisance) variances may be different across hypotheses. We provide connections to related notions in the literature stated in terms of p-values."}, "https://arxiv.org/abs/2409.19910": {"title": "Subset Simulation for High-dimensional and Multi-modal Bayesian Inference", "link": "https://arxiv.org/abs/2409.19910", "description": "arXiv:2409.19910v1 Announce Type: new \nAbstract: Bayesian analysis plays a crucial role in estimating distribution of unknown parameters for given data and model. Due to the curse of dimensionality, it is difficult to infer high-dimensional problems, especially when multiple modes exist. This paper introduces an efficient Bayesian posterior sampling algorithm that draws inspiration from subset simulation (SuS). It is based on a new interpretation of evidence from the perspective of structural reliability estimation, regarding the likelihood function as a limit state function. The posterior samples can be obtained following the principle of importance resampling as a postprocessing procedure. The estimation variance is derived to quantify the inherent uncertainty associated with the SuS estimator of evidence. The effective sample size is introduced to measure the quality of the posterior sampling. Three benchmark examples are first considered to illustrate the performance of the proposed algorithm by comparing it with two state-of-art algorithms. It is then used for the finite element (FE) model updating, showing its applicability in practical engineering problems. The proposed SuS algorithm exhibits comparable or even better performance in evidence estimation and posterior sampling, compared to the aBUS and MULTINEST algorithms, especially when the dimension of unknown parameters is high. In the application of the proposed algorithm for FE model updating, satisfactory results are obtained when the configuration (number and location) of sensory system is proper, underscoring the importance of adequate sensor placement around critical degrees of freedom."}, "https://arxiv.org/abs/2409.20084": {"title": "Conformal prediction for functional Ordinary kriging", "link": "https://arxiv.org/abs/2409.20084", "description": "arXiv:2409.20084v1 Announce Type: new \nAbstract: Functional Ordinary Kriging is the most widely used method to predict a curve at a given spatial point. However, uncertainty remains an open issue. In this article a distribution-free prediction method based on two different modulation functions and two conformity scores is proposed. Through simulations and benchmark data analyses, we demonstrate the advantages of our approach when compared to standard methods."}, "https://arxiv.org/abs/2409.20199": {"title": "Synthetic Difference in Differences for Repeated Cross-Sectional Data", "link": "https://arxiv.org/abs/2409.20199", "description": "arXiv:2409.20199v1 Announce Type: new \nAbstract: The synthetic difference-in-differences method provides an efficient method to estimate a causal effect with a latent factor model. However, it relies on the use of panel data. This paper presents an adaptation of the synthetic difference-in-differences method for repeated cross-sectional data. The treatment is considered to be at the group level so that it is possible to aggregate data by group to compute the two types of synthetic difference-in-differences weights on these aggregated data. Then, I develop and compute a third type of weight that accounts for the different number of observations in each cross-section. Simulation results show that the performance of the synthetic difference-in-differences estimator is improved when using the third type of weights on repeated cross-sectional data."}, "https://arxiv.org/abs/2409.20262": {"title": "Bootstrap-based goodness-of-fit test for parametric families of conditional distributions", "link": "https://arxiv.org/abs/2409.20262", "description": "arXiv:2409.20262v1 Announce Type: new \nAbstract: In various scientific fields, researchers are interested in exploring the relationship between some response variable Y and a vector of covariates X. In order to make use of a specific model for the dependence structure, it first has to be checked whether the conditional density function of Y given X fits into a given parametric family. We propose a consistent bootstrap-based goodness-of-fit test for this purpose. The test statistic traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of Y. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine the critical value. A simulation study shows that, in some cases, the new method is more sensitive to deviations from the parametric model than other tests found in the literature. We also apply our method to real-world datasets."}, "https://arxiv.org/abs/2409.20415": {"title": "New Tests of Equal Forecast Accuracy for Factor-Augmented Regressions with Weaker Loadings", "link": "https://arxiv.org/abs/2409.20415", "description": "arXiv:2409.20415v1 Announce Type: new \nAbstract: We provide the theoretical foundation for the recently proposed tests of equal forecast accuracy and encompassing by Pitarakis (2023a) and Pitarakis (2023b), when the competing forecast specification is that of a factor-augmented regression model, whose loadings are allowed to be homogeneously/heterogeneously weak. This should be of interest for practitioners, as at the moment there is no theory available to justify the use of these simple and powerful tests in such context."}, "https://arxiv.org/abs/2409.19060": {"title": "CURATE: Scaling-up Differentially Private Causal Graph Discovery", "link": "https://arxiv.org/abs/2409.19060", "description": "arXiv:2409.19060v1 Announce Type: cross \nAbstract: Causal Graph Discovery (CGD) is the process of estimating the underlying probabilistic graphical model that represents joint distribution of features of a dataset. CGD-algorithms are broadly classified into two categories: (i) Constraint-based algorithms (outcome depends on conditional independence (CI) tests), (ii) Score-based algorithms (outcome depends on optimized score-function). Since, sensitive features of observational data is prone to privacy-leakage, Differential Privacy (DP) has been adopted to ensure user privacy in CGD. Adding same amount of noise in this sequential-natured estimation process affects the predictive performance of the algorithms. As initial CI tests in constraint-based algorithms and later iterations of the optimization process of score-based algorithms are crucial, they need to be more accurate, less noisy. Based on this key observation, we present CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting. In contrast to existing DP-CGD algorithms with uniform privacy budgeting across all iterations, CURATE allows adaptive privacy budgeting by minimizing error probability (for constraint-based), maximizing iterations of the optimization problem (for score-based) while keeping the cumulative leakage bounded. To validate our framework, we present a comprehensive set of experiments on several datasets and show that CURATE achieves higher utility compared to existing DP-CGD algorithms with less privacy-leakage."}, "https://arxiv.org/abs/2409.19642": {"title": "Solving Fredholm Integral Equations of the Second Kind via Wasserstein Gradient Flows", "link": "https://arxiv.org/abs/2409.19642", "description": "arXiv:2409.19642v1 Announce Type: cross \nAbstract: Motivated by a recent method for approximate solution of Fredholm equations of the first kind, we develop a corresponding method for a class of Fredholm equations of the \\emph{second kind}. In particular, we consider the class of equations for which the solution is a probability measure. The approach centres around specifying a functional whose gradient flow admits a minimizer corresponding to a regularized version of the solution of the underlying equation and using a mean-field particle system to approximately simulate that flow. Theoretical support for the method is presented, along with some illustrative numerical results."}, "https://arxiv.org/abs/2409.19777": {"title": "Automatic debiasing of neural networks via moment-constrained learning", "link": "https://arxiv.org/abs/2409.19777", "description": "arXiv:2409.19777v1 Announce Type: cross \nAbstract: Causal and nonparametric estimands in economics and biostatistics can often be viewed as the mean of a linear functional applied to an unknown outcome regression function. Naively learning the regression function and taking a sample mean of the target functional results in biased estimators, and a rich debiasing literature has developed where one additionally learns the so-called Riesz representer (RR) of the target estimand (targeted learning, double ML, automatic debiasing etc.). Learning the RR via its derived functional form can be challenging, e.g. due to extreme inverse probability weights or the need to learn conditional density functions. Such challenges have motivated recent advances in automatic debiasing (AD), where the RR is learned directly via minimization of a bespoke loss. We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in AD, constraining the predicted moments and improving the robustness of RR estimates to optimization hyperparamters. Though our approach is not tied to a particular class of learner, we illustrate it using neural networks, and evaluate on the problems of average treatment/derivative effect estimation using semi-synthetic data. Our numerical experiments show improved performance versus state of the art benchmarks."}, "https://arxiv.org/abs/2409.20187": {"title": "Choosing DAG Models Using Markov and Minimal Edge Count in the Absence of Ground Truth", "link": "https://arxiv.org/abs/2409.20187", "description": "arXiv:2409.20187v1 Announce Type: cross \nAbstract: We give a novel nonparametric pointwise consistent statistical test (the Markov Checker) of the Markov condition for directed acyclic graph (DAG) or completed partially directed acyclic graph (CPDAG) models given a dataset. We also introduce the Cross-Algorithm Frugality Search (CAFS) for rejecting DAG models that either do not pass the Markov Checker test or that are not edge minimal. Edge minimality has been used previously by Raskutti and Uhler as a nonparametric simplicity criterion, though CAFS readily generalizes to other simplicity conditions. Reference to the ground truth is not necessary for CAFS, so it is useful for finding causal structure learning algorithms and tuning parameter settings that output causal models that are approximately true from a given data set. We provide a software tool for this analysis that is suitable for even quite large or dense models, provided a suitably fast pointwise consistent test of conditional independence is available. In addition, we show in simulation that the CAFS procedure can pick approximately correct models without knowing the ground truth."}, "https://arxiv.org/abs/2110.11856": {"title": "L-2 Regularized maximum likelihood for $\\beta$-model in large and sparse networks", "link": "https://arxiv.org/abs/2110.11856", "description": "arXiv:2110.11856v4 Announce Type: replace \nAbstract: The $\\beta$-model is a powerful tool for modeling large and sparse networks driven by degree heterogeneity, where many network models become infeasible due to computational challenge and network sparsity. However, existing estimation algorithms for $\\beta$-model do not scale up. Also, theoretical understandings remain limited to dense networks. This paper brings several significant improvements over existing results to address the urgent needs of practice. We propose a new $\\ell_2$-penalized MLE algorithm that can comfortably handle sparse networks of millions of nodes with much-improved memory parsimony. We establish the first rate-optimal error bounds and high-dimensional asymptotic normality results for $\\beta$-models, under much weaker network sparsity assumptions than best existing results.\n Application of our method to large COVID-19 network data sets and discover meaningful results."}, "https://arxiv.org/abs/2112.09170": {"title": "Reinforcing RCTs with Multiple Priors while Learning about External Validity", "link": "https://arxiv.org/abs/2112.09170", "description": "arXiv:2112.09170v5 Announce Type: replace \nAbstract: This paper introduces a framework for incorporating prior information into the design of sequential experiments. These sources may include past experiments, expert opinions, or the experimenter's intuition. We model the problem using a multi-prior Bayesian approach, mapping each source to a Bayesian model and aggregating them based on posterior probabilities. Policies are evaluated on three criteria: learning the parameters of payoff distributions, the probability of choosing the wrong treatment, and average rewards. Our framework demonstrates several desirable properties, including robustness to sources lacking external validity, while maintaining strong finite sample performance."}, "https://arxiv.org/abs/2210.05558": {"title": "Causal and counterfactual views of missing data models", "link": "https://arxiv.org/abs/2210.05558", "description": "arXiv:2210.05558v2 Announce Type: replace \nAbstract: It is often said that the fundamental problem of causal inference is a missing data problem -- the comparison of responses to two hypothetical treatment assignments is made difficult because for every experimental unit only one potential response is observed. In this paper, we consider the implications of the converse view: that missing data problems are a form of causal inference. We make explicit how the missing data problem of recovering the complete data law from the observed law can be viewed as identification of a joint distribution over counterfactual variables corresponding to values had we (possibly contrary to fact) been able to observe them. Drawing analogies with causal inference, we show how identification assumptions in missing data can be encoded in terms of graphical models defined over counterfactual and observed variables. We review recent results in missing data identification from this viewpoint. In doing so, we note interesting similarities and differences between missing data and causal identification theories."}, "https://arxiv.org/abs/2210.15829": {"title": "Estimation of Heterogeneous Treatment Effects Using a Conditional Moment Based Approach", "link": "https://arxiv.org/abs/2210.15829", "description": "arXiv:2210.15829v3 Announce Type: replace \nAbstract: We propose a new estimator for heterogeneous treatment effects in a partially linear model (PLM) with multiple exogenous covariates and a potentially endogenous treatment variable. Our approach integrates a Robinson transformation to handle the nonparametric component, the Smooth Minimum Distance (SMD) method to leverage conditional mean independence restrictions, and a Neyman-Orthogonalized first-order condition (FOC). By employing regularized model selection techniques like the Lasso method, our estimator accommodates numerous covariates while exhibiting reduced bias, consistency, and asymptotic normality. Simulations demonstrate its robust performance with diverse instrument sets compared to traditional GMM-type estimators. Applying this method to estimate Medicaid's heterogeneous treatment effects from the Oregon Health Insurance Experiment reveals more robust and reliable results than conventional GMM approaches."}, "https://arxiv.org/abs/2301.10228": {"title": "Augmenting a simulation campaign for hybrid computer model and field data experiments", "link": "https://arxiv.org/abs/2301.10228", "description": "arXiv:2301.10228v2 Announce Type: replace \nAbstract: The Kennedy and O'Hagan (KOH) calibration framework uses coupled Gaussian processes (GPs) to meta-model an expensive simulator (first GP), tune its ``knobs\" (calibration inputs) to best match observations from a real physical/field experiment and correct for any modeling bias (second GP) when predicting under new field conditions (design inputs). There are well-established methods for placement of design inputs for data-efficient planning of a simulation campaign in isolation, i.e., without field data: space-filling, or via criterion like minimum integrated mean-squared prediction error (IMSPE). Analogues within the coupled GP KOH framework are mostly absent from the literature. Here we derive a closed form IMSPE criterion for sequentially acquiring new simulator data for KOH. We illustrate how acquisitions space-fill in design space, but concentrate in calibration space. Closed form IMSPE precipitates a closed-form gradient for efficient numerical optimization. We demonstrate that our KOH-IMSPE strategy leads to a more efficient simulation campaign on benchmark problems, and conclude with a showcase on an application to equilibrium concentrations of rare earth elements for a liquid-liquid extraction reaction."}, "https://arxiv.org/abs/2312.02860": {"title": "Spectral Deconfounding for High-Dimensional Sparse Additive Models", "link": "https://arxiv.org/abs/2312.02860", "description": "arXiv:2312.02860v2 Announce Type: replace \nAbstract: Many high-dimensional data sets suffer from hidden confounding which affects both the predictors and the response of interest. In such situations, standard regression methods or algorithms lead to biased estimates. This paper substantially extends previous work on spectral deconfounding for high-dimensional linear models to the nonlinear setting and with this, establishes a proof of concept that spectral deconfounding is valid for general nonlinear models. Concretely, we propose an algorithm to estimate high-dimensional sparse additive models in the presence of hidden dense confounding: arguably, this is a simple yet practically useful nonlinear scope. We prove consistency and convergence rates for our method and evaluate it on synthetic data and a genetic data set."}, "https://arxiv.org/abs/2312.06289": {"title": "A graphical framework for interpretable correlation matrix models", "link": "https://arxiv.org/abs/2312.06289", "description": "arXiv:2312.06289v2 Announce Type: replace \nAbstract: In this work, we present a new approach for constructing models for correlation matrices with a user-defined graphical structure. The graphical structure makes correlation matrices interpretable and avoids the quadratic increase of parameters as a function of the dimension. We suggest an automatic approach to define a prior using a natural sequence of simpler models within the Penalized Complexity framework for the unknown parameters in these models.\n We illustrate this approach with three applications: a multivariate linear regression of four biomarkers, a multivariate disease mapping, and a multivariate longitudinal joint modelling. Each application underscores our method's intuitive appeal, signifying a substantial advancement toward a more cohesive and enlightening model that facilitates a meaningful interpretation of correlation matrices."}, "https://arxiv.org/abs/2008.07063": {"title": "To Bag is to Prune", "link": "https://arxiv.org/abs/2008.07063", "description": "arXiv:2008.07063v5 Announce Type: replace-cross \nAbstract: It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent \"true\" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better."}, "https://arxiv.org/abs/2010.13604": {"title": "A Sparse Beta Regression Model for Network Analysis", "link": "https://arxiv.org/abs/2010.13604", "description": "arXiv:2010.13604v3 Announce Type: replace-cross \nAbstract: For statistical analysis of network data, the $\\beta$-model has emerged as a useful tool, thanks to its flexibility in incorporating nodewise heterogeneity and theoretical tractability. To generalize the $\\beta$-model, this paper proposes the Sparse $\\beta$-Regression Model (S$\\beta$RM) that unites two research themes developed recently in modelling homophily and sparsity. In particular, we employ differential heterogeneity that assigns weights only to important nodes and propose penalized likelihood with an $\\ell_1$ penalty for parameter estimation. While our estimation method is closely related to the LASSO method for logistic regression, we develop new theory emphasizing the use of our model for dealing with a parameter regime that can handle sparse networks usually seen in practice. More interestingly, the resulting inference on the homophily parameter demands no debiasing normally employed in LASSO type estimation. We provide extensive simulation and data analysis to illustrate the use of the model. As a special case of our model, we extend the Erd\\H{o}s-R\\'{e}nyi model by including covariates and develop the associated statistical inference for sparse networks, which may be of independent interest."}, "https://arxiv.org/abs/2303.02887": {"title": "Empirical partially Bayes multiple testing and compound $\\chi^2$ decisions", "link": "https://arxiv.org/abs/2303.02887", "description": "arXiv:2303.02887v2 Announce Type: replace-cross \nAbstract: A common task in high-throughput biology is to screen for associations across thousands of units of interest, e.g., genes or proteins. Often, the data for each unit are modeled as Gaussian measurements with unknown mean and variance and are summarized as per-unit sample averages and sample variances. The downstream goal is multiple testing for the means. In this domain, it is routine to \"moderate\" (that is, to shrink) the sample variances through parametric empirical Bayes methods before computing p-values for the means. Such an approach is asymmetric in that a prior is posited and estimated for the nuisance parameters (variances) but not the primary parameters (means). Our work initiates the formal study of this paradigm, which we term \"empirical partially Bayes multiple testing.\" In this framework, if the prior for the variances were known, one could proceed by computing p-values conditional on the sample variances -- a strategy called partially Bayes inference by Sir David Cox. We show that these conditional p-values satisfy an Eddington/Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. The estimated p-values can be used with the Benjamini-Hochberg procedure to guarantee asymptotic control of the false discovery rate. Even in the compound setting, wherein the variances are fixed, the approach retains asymptotic type-I error guarantees."}, "https://arxiv.org/abs/2410.00116": {"title": "Bayesian Calibration in a multi-output transposition context", "link": "https://arxiv.org/abs/2410.00116", "description": "arXiv:2410.00116v1 Announce Type: new \nAbstract: Bayesian calibration is an effective approach for ensuring that numerical simulations accurately reflect the behavior of physical systems. However, because numerical models are never perfect, a discrepancy known as model error exists between the model outputs and the observed data, and must be quantified. Conventional methods can not be implemented in transposition situations, such as when a model has multiple outputs but only one is experimentally observed. To account for the model error in this context, we propose augmenting the calibration process by introducing additional input numerical parameters through a hierarchical Bayesian model, which includes hyperparameters for the prior distribution of the calibration variables. Importance sampling estimators are used to avoid increasing computational costs. Performance metrics are introduced to assess the proposed probabilistic model and the accuracy of its predictions. The method is applied on a computer code with three outputs that models the Taylor cylinder impact test. The outputs are considered as the observed variables one at a time, to work with three different transposition situations. The proposed method is compared with other approaches that embed model errors to demonstrate the significance of the hierarchical formulation."}, "https://arxiv.org/abs/2410.00125": {"title": "Relative Cumulative Residual Information Measure", "link": "https://arxiv.org/abs/2410.00125", "description": "arXiv:2410.00125v1 Announce Type: new \nAbstract: In this paper, we develop a relative cumulative residual information (RCRI) measure that intends to quantify the divergence between two survival functions. The dynamic relative cumulative residual information (DRCRI) measure is also introduced. We establish some characterization results under the proportional hazards model assumption. Additionally, we obtained the non-parametric estimators of RCRI and DRCRI measures based on the kernel density type estimator for the survival function. The effectiveness of the estimators are assessed through an extensive Monte Carlo simulation study. We consider the data from the third Gaia data release (Gaia DR3) for demonstrating the use of the proposed measure. For this study, we have collected epoch photometry data for the objects Gaia DR3 4111834567779557376 and Gaia DR3 5090605830056251776."}, "https://arxiv.org/abs/2410.00142": {"title": "On the posterior property of the Rician distribution", "link": "https://arxiv.org/abs/2410.00142", "description": "arXiv:2410.00142v1 Announce Type: new \nAbstract: The Rician distribution, a well-known statistical distribution frequently encountered in fields like magnetic resonance imaging and wireless communications, is particularly useful for describing many real phenomena such as signal process data. In this paper, we introduce objective Bayesian inference for the Rician distribution parameters, specifically the Jeffreys rule and Jeffreys prior are derived. We proved that the obtained posterior for the first priors led to an improper posterior while the Jeffreys prior led to a proper distribution. To evaluate the effectiveness of our proposed Bayesian estimation method, we perform extensive numerical simulations and compare the results with those obtained from traditional moment-based and maximum likelihood estimators. Our simulations illustrate that the Bayesian estimators derived from the Jeffreys prior provide nearly unbiased estimates, showcasing the advantages of our approach over classical techniques."}, "https://arxiv.org/abs/2410.00183": {"title": "Generalised mixed effects models for changepoint analysis of biomedical time series data", "link": "https://arxiv.org/abs/2410.00183", "description": "arXiv:2410.00183v1 Announce Type: new \nAbstract: Motivated by two distinct types of biomedical time series data, digital health monitoring and neuroimaging, we develop a novel approach for changepoint analysis that uses a generalised linear mixed model framework. The generalised linear mixed model framework lets us incorporate structure that is usually present in biomedical time series data. We embed the mixed model in a dynamic programming algorithm for detecting multiple changepoints in the fMRI data. We evaluate the performance of our proposed method across several scenarios using simulations. Finally, we show the utility of our proposed method on our two distinct motivating applications."}, "https://arxiv.org/abs/2410.00217": {"title": "Valid Inference on Functions of Causal Effects in the Absence of Microdata", "link": "https://arxiv.org/abs/2410.00217", "description": "arXiv:2410.00217v1 Announce Type: new \nAbstract: Economists are often interested in functions of multiple causal effects, a leading example of which is evaluating the cost-effectiveness of a government policy. In such settings, the benefits and costs might be captured by multiple causal effects and aggregated into a scalar measure of cost-effectiveness. Oftentimes, the microdata underlying these estimates is not accessible; only the published estimates and their corresponding standard errors are available for post-hoc analysis. We provide a method to conduct inference on functions of causal effects when the only information available is the point estimates and their corresponding standard errors. We apply our method to conduct inference on the Marginal Value of Public Funds (MVPF) for 8 different policies, and show that even in the absence of any microdata, it is possible to conduct valid and meaningful inference on the MVPF."}, "https://arxiv.org/abs/2410.00259": {"title": "Robust Emax Model Fitting: Addressing Nonignorable Missing Binary Outcome in Dose-Response Analysis", "link": "https://arxiv.org/abs/2410.00259", "description": "arXiv:2410.00259v1 Announce Type: new \nAbstract: The Binary Emax model is widely employed in dose-response analysis during drug development, where missing data often pose significant challenges. Addressing nonignorable missing binary responses, where the likelihood of missing data is related to unobserved outcomes, is particularly important, yet existing methods often lead to biased estimates. This issue is compounded when using the regulatory-recommended imputing as treatment failure approach, known as non-responder imputation. Moreover, the problem of separation, where a predictor perfectly distinguishes between outcome classes, can further complicate likelihood maximization. In this paper, we introduce a penalized likelihood-based method that integrates a modified Expectation-Maximization algorithm in the spirit of Ibrahim and Lipsitz to effectively manage both nonignorable missing data and separation issues. Our approach applies a noninformative Jeffreys prior to the likelihood, reducing bias in parameter estimation. Simulation studies demonstrate that our method outperforms existing methods, such as NRI, and the superiority is further supported by its application to data from a Phase II clinical trial. Additionally, we have developed an R package, ememax, to facilitate the implementation of the proposed method."}, "https://arxiv.org/abs/2410.00300": {"title": "Visualization for departures from symmetry with the power-divergence-type measure in two-way contingency tables", "link": "https://arxiv.org/abs/2410.00300", "description": "arXiv:2410.00300v1 Announce Type: new \nAbstract: When the row and column variables consist of the same category in a two-way contingency table, it is specifically called a square contingency table. Since it is clear that the square contingency tables have an association structure, a primary objective is to examine symmetric relationships and transitions between variables. While various models and measures have been proposed to analyze these structures understanding changes between two variables in behavior at two-time points or cohorts, it is also necessary to require a detailed investigation of individual categories and their interrelationships, such as shifts in brand preferences. This paper proposes a novel approach to correspondence analysis (CA) for evaluating departures from symmetry in square contingency tables with nominal categories, using a power-divergence-type measure. The approach ensures that well-known divergences can also be visualized and, regardless of the divergence used, the CA plot consists of two principal axes with equal contribution rates. Additionally, the scaling is independent of sample size, making it well-suited for comparing departures from symmetry across multiple contingency tables. Confidence regions are also constructed to enhance the accuracy of the CA plot."}, "https://arxiv.org/abs/2410.00338": {"title": "The generalized Nelson--Aalen estimator by inverse probability of treatment weighting", "link": "https://arxiv.org/abs/2410.00338", "description": "arXiv:2410.00338v1 Announce Type: new \nAbstract: Inverse probability of treatment weighting (IPTW) has been well applied in causal inference. For time-to-event outcomes, IPTW is performed by weighting the event counting process and at-risk process, resulting in a generalized Nelson--Aalen estimator for population-level hazards. In the presence of competing events, we adopt the counterfactual cumulative incidence of a primary event as the estimated. When the propensity score is estimated, we derive the influence function of the hazard estimator, and then establish the asymptotic property of the incidence estimator. We show that the uncertainty in the estimated propensity score contributes to an additional variation in the IPTW estimator of the cumulative incidence. However, through simulation and real-data application, we find that the additional variation is usually small."}, "https://arxiv.org/abs/2410.00370": {"title": "Covariate Adjusted Functional Mixed Membership Models", "link": "https://arxiv.org/abs/2410.00370", "description": "arXiv:2410.00370v1 Announce Type: new \nAbstract: Mixed membership models are a flexible class of probabilistic data representations used for unsupervised and semi-supervised learning, allowing each observation to partially belong to multiple clusters or features. In this manuscript, we extend the framework of functional mixed membership models to allow for covariate-dependent adjustments. The proposed model utilizes a multivariate Karhunen-Lo\\`eve decomposition, which allows for a scalable and flexible model. Within this framework, we establish a set of sufficient conditions ensuring the identifiability of the mean, covariance, and allocation structure up to a permutation of the labels. This manuscript is primarily motivated by studies on functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD). Specifically, we are interested in characterizing the heterogeneity of alpha oscillations for typically developing (TD) children and children with ASD. Since alpha oscillations are known to change as children develop, we aim to characterize the heterogeneity of alpha oscillations conditionally on the age of the child. Using the proposed framework, we were able to gain novel information on the developmental trajectories of alpha oscillations for children with ASD and how the developmental trajectories differ between TD children and children with ASD."}, "https://arxiv.org/abs/2410.00566": {"title": "Research Frontiers in Ambit Stochastics: In memory of Ole E", "link": "https://arxiv.org/abs/2410.00566", "description": "arXiv:2410.00566v1 Announce Type: new \nAbstract: This article surveys key aspects of ambit stochastics and remembers Ole E. Barndorff-Nielsen's important contributions to the foundation and advancement of this new research field over the last two decades. It also highlights some of the emerging trends in ambit stochastics."}, "https://arxiv.org/abs/2410.00574": {"title": "Asymmetric GARCH modelling without moment conditions", "link": "https://arxiv.org/abs/2410.00574", "description": "arXiv:2410.00574v1 Announce Type: new \nAbstract: There is a serious and long-standing restriction in the literature on heavy-tailed phenomena in that moment conditions, which are unrealistic, are almost always assumed in modelling such phenomena. Further, the issue of stability is often insufficiently addressed. To this end, we develop a comprehensive statistical inference for an asymmetric generalized autoregressive conditional heteroskedasticity model with standardized non-Gaussian symmetric stable innovation (sAGARCH) in a unified framework, covering both the stationary case and the explosive case. We consider first the maximum likelihood estimation of the model including the asymptotic properties of the estimator of the stable exponent parameter among others. We then propose a modified Kolmogorov-type test statistic for diagnostic checking, as well as those for strict stationarity and asymmetry testing. We conduct Monte Carlo simulation studies to examine the finite-sample performance of our entire statistical inference procedure. We include empirical examples of stock returns to highlight the usefulness and merits of our sAGARCH model."}, "https://arxiv.org/abs/2410.00662": {"title": "Bias in mixed models when analysing longitudinal data subject to irregular observation: when should we worry about it and how can recommended visit intervals help in specifying joint models when needed?", "link": "https://arxiv.org/abs/2410.00662", "description": "arXiv:2410.00662v1 Announce Type: new \nAbstract: In longitudinal studies using routinely collected data, such as electronic health records (EHRs), patients tend to have more measurements when they are unwell; this informative observation pattern may lead to bias. While semi-parametric approaches to modelling longitudinal data subject to irregular observation are known to be sensitive to misspecification of the visit process, parametric models may provide a more robust alternative. Robustness of parametric models on the outcome alone has been assessed under the assumption that the visit intensity is independent of the time since the last visit, given the covariates and random effects. However, this assumption of a memoryless visit process may not be realistic in the context of EHR data. In a special case which includes memory embedded into the visit process, we derive an expression for the bias in parametric models for the outcome alone and use this to identify factors that lead to increasing bias. Using simulation studies, we show that this bias is often small in practice. We suggest diagnostics for identifying the specific cases when the outcome model may be susceptible to meaningful bias, and propose a novel joint model of the outcome and visit processes that can eliminate or reduce the bias. We apply these diagnostics and the joint model to a study of juvenile dermatomyositis. We recommend that future studies using EHR data avoid relying only on the outcome model and instead first evaluate its appropriateness with our proposed diagnostics, applying our proposed joint model if necessary."}, "https://arxiv.org/abs/2410.00733": {"title": "A Nonparametric Test of Heterogeneous Treatment Effects under Interference", "link": "https://arxiv.org/abs/2410.00733", "description": "arXiv:2410.00733v1 Announce Type: new \nAbstract: Statistical inference of heterogeneous treatment effects (HTEs) across predefined subgroups is challenging when units interact because treatment effects may vary by pre-treatment variables, post-treatment exposure variables (that measure the exposure to other units' treatment statuses), or both. Thus, the conventional HTEs testing procedures may be invalid under interference. In this paper, I develop statistical methods to infer HTEs and disentangle the drivers of treatment effects heterogeneity in populations where units interact. Specifically, I incorporate clustered interference into the potential outcomes model and propose kernel-based test statistics for the null hypotheses of (i) no HTEs by treatment assignment (or post-treatment exposure variables) for all pre-treatment variables values and (ii) no HTEs by pre-treatment variables for all treatment assignment vectors. I recommend a multiple-testing algorithm to disentangle the source of heterogeneity in treatment effects. I prove the asymptotic properties of the proposed test statistics. Finally, I illustrate the application of the test procedures in an empirical setting using an experimental data set from a Chinese weather insurance program."}, "https://arxiv.org/abs/2410.00781": {"title": "Modeling Neural Switching via Drift-Diffusion Models", "link": "https://arxiv.org/abs/2410.00781", "description": "arXiv:2410.00781v1 Announce Type: new \nAbstract: Neural encoding, or neural representation, is a field in neuroscience that focuses on characterizing how information is encoded in the spiking activity of neurons. Currently, little is known about how sensory neurons can preserve information from multiple stimuli given their broad receptive fields. Multiplexing is a neural encoding theory that posits that neurons temporally switch between encoding various stimuli in their receptive field. Here, we construct a statistically falsifiable single-neuron model for multiplexing using a competition-based framework. The spike train models are constructed using drift-diffusion models, implying an integrate-and-fire framework to model the temporal dynamics of the membrane potential of the neuron. In addition to a multiplexing-specific model, we develop alternative models that represent alternative encoding theories (normalization, winner-take-all, subadditivity, etc.) with some level of abstraction. Using information criteria, we perform model comparison to determine whether the data favor multiplexing over alternative theories of neural encoding. Analysis of spike trains from the inferior colliculus of two macaque monkeys provides tenable evidence of multiplexing and offers new insight into the timescales at which switching occurs."}, "https://arxiv.org/abs/2410.00845": {"title": "Control Variate-based Stochastic Sampling from the Probability Simplex", "link": "https://arxiv.org/abs/2410.00845", "description": "arXiv:2410.00845v1 Announce Type: new \nAbstract: This paper presents a control variate-based Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models such as latent Dirichlet allocation. Standard Markov chain Monte Carlo methods, particularly those based on Langevin diffusions, suffer from significant discretization errors near the boundaries of the simplex, which are exacerbated in sparse data settings. To address this issue, we propose an improved approach based on the stochastic Cox--Ingersoll--Ross process, which eliminates discretization errors and enables exact transition densities. Our key contribution is the integration of control variates, which significantly reduces the variance of the stochastic gradient estimator in the Cox--Ingersoll--Ross process, thereby enhancing the accuracy and computational efficiency of the algorithm. We provide a theoretical analysis showing the variance reduction achieved by the control variates approach and demonstrate the practical advantages of our method in data subsampling settings. Empirical results on large datasets show that the proposed method outperforms existing approaches in both accuracy and scalability."}, "https://arxiv.org/abs/2410.00002": {"title": "Machine Learning and Econometric Approaches to Fiscal Policies: Understanding Industrial Investment Dynamics in Uruguay (1974-2010)", "link": "https://arxiv.org/abs/2410.00002", "description": "arXiv:2410.00002v1 Announce Type: cross \nAbstract: This paper examines the impact of fiscal incentives on industrial investment in Uruguay from 1974 to 2010. Using a mixed-method approach that combines econometric models with machine learning techniques, the study investigates both the short-term and long-term effects of fiscal benefits on industrial investment. The results confirm the significant role of fiscal incentives in driving long-term industrial growth, while also highlighting the importance of a stable macroeconomic environment, public investment, and access to credit. Machine learning models provide additional insights into nonlinear interactions between fiscal benefits and other macroeconomic factors, such as exchange rates, emphasizing the need for tailored fiscal policies. The findings have important policy implications, suggesting that fiscal incentives, when combined with broader economic reforms, can effectively promote industrial development in emerging economies."}, "https://arxiv.org/abs/2410.00301": {"title": "Network Science in Psychology", "link": "https://arxiv.org/abs/2410.00301", "description": "arXiv:2410.00301v1 Announce Type: cross \nAbstract: Social network analysis can answer research questions such as why or how individuals interact or form relationships and how those relationships impact other outcomes. Despite the breadth of methods available to address psychological research questions, social network analysis is not yet a standard practice in psychological research. To promote the use of social network analysis in psychological research, we present an overview of network methods, situating each method within the context of research studies and questions in psychology."}, "https://arxiv.org/abs/2410.00865": {"title": "How should we aggregate ratings? Accounting for personal rating scales via Wasserstein barycenters", "link": "https://arxiv.org/abs/2410.00865", "description": "arXiv:2410.00865v1 Announce Type: cross \nAbstract: A common method of making quantitative conclusions in qualitative situations is to collect numerical ratings on a linear scale. We investigate the problem of calculating aggregate numerical ratings from individual numerical ratings and propose a new, non-parametric model for the problem. We show that, with minimal modeling assumptions, the equal-weights average is inconsistent for estimating the quality of items. Analyzing the problem from the perspective of optimal transport, we derive an alternative rating estimator, which we show is asymptotically consistent almost surely and in $L^p$ for estimating quality, with an optimal rate of convergence. Further, we generalize Kendall's W, a non-parametric coefficient of preference concordance between raters, from the special case of rankings to the more general case of arbitrary numerical ratings. Along the way, we prove Glivenko--Cantelli-type theorems for uniform convergence of the cumulative distribution functions and quantile functions for Wasserstein-2 Fr\\'echet means on [0,1]."}, "https://arxiv.org/abs/2212.08766": {"title": "Asymptotically Optimal Knockoff Statistics via the Masked Likelihood Ratio", "link": "https://arxiv.org/abs/2212.08766", "description": "arXiv:2212.08766v2 Announce Type: replace \nAbstract: In feature selection problems, knockoffs are synthetic controls for the original features. Employing knockoffs allows analysts to use nearly any variable importance measure or \"feature statistic\" to select features while rigorously controlling false positives. However, it is not clear which statistic maximizes power. In this paper, we argue that state-of-the-art lasso-based feature statistics often prioritize features that are unlikely to be discovered, leading to low power in real applications. Instead, we introduce masked likelihood ratio (MLR) statistics, which prioritize features according to one's ability to distinguish each feature from its knockoff. Although no single feature statistic is uniformly most powerful in all situations, we show that MLR statistics asymptotically maximize the number of discoveries under a user-specified Bayesian model of the data. (Like all feature statistics, MLR statistics always provide frequentist error control.) This result places no restrictions on the problem dimensions and makes no parametric assumptions; instead, we require a \"local dependence\" condition that depends only on known quantities. In simulations and three real applications, MLR statistics outperform state-of-the-art feature statistics, including in settings where the Bayesian model is misspecified. We implement MLR statistics in the python package knockpy; our implementation is often faster than computing a cross-validated lasso."}, "https://arxiv.org/abs/2401.08626": {"title": "Validation and Comparison of Non-Stationary Cognitive Models: A Diffusion Model Application", "link": "https://arxiv.org/abs/2401.08626", "description": "arXiv:2401.08626v3 Announce Type: replace-cross \nAbstract: Cognitive processes undergo various fluctuations and transient states across different temporal scales. Superstatistics are emerging as a flexible framework for incorporating such non-stationary dynamics into existing cognitive model classes. In this work, we provide the first experimental validation of superstatistics and formal comparison of four non-stationary diffusion decision models in a specifically designed perceptual decision-making task. Task difficulty and speed-accuracy trade-off were systematically manipulated to induce expected changes in model parameters. To validate our models, we assess whether the inferred parameter trajectories align with the patterns and sequences of the experimental manipulations. To address computational challenges, we present novel deep learning techniques for amortized Bayesian estimation and comparison of models with time-varying parameters. Our findings indicate that transition models incorporating both gradual and abrupt parameter shifts provide the best fit to the empirical data. Moreover, we find that the inferred parameter trajectories closely mirror the sequence of experimental manipulations. Posterior re-simulations further underscore the ability of the models to faithfully reproduce critical data patterns. Accordingly, our results suggest that the inferred non-stationary dynamics may reflect actual changes in the targeted psychological constructs. We argue that our initial experimental validation paves the way for the widespread application of superstatistics in cognitive modeling and beyond."}, "https://arxiv.org/abs/2410.00931": {"title": "A simple emulator that enables interpretation of parameter-output relationships, applied to two climate model PPEs", "link": "https://arxiv.org/abs/2410.00931", "description": "arXiv:2410.00931v1 Announce Type: new \nAbstract: We present a new additive method, nicknamed sage for Simplified Additive Gaussian processes Emulator, to emulate climate model Perturbed Parameter Ensembles (PPEs). It estimates the value of a climate model output as the sum of additive terms. Each additive term is the mean of a Gaussian Process, and corresponds to the impact of a parameter or parameter group on the variable of interest. This design caters to the sparsity of PPEs which are characterized by limited ensemble members and high dimensionality of the parameter space. sage quantifies the variability explained by different parameters and parameter groups, providing additional insights on the parameter-climate model output relationship. We apply the method to two climate model PPEs and compare it to a fully connected Neural Network. The two methods have comparable performance with both PPEs, but sage provides insights on parameter and parameter group importance as well as diagnostics useful for optimizing PPE design. Insights gained are valid regardless of the emulator method used, and have not been previously addressed. Our work highlights that analyzing the PPE used to train an emulator is different from analyzing data generated from an emulator trained on the PPE, as the former provides more insights on the data structure in the PPE which could help inform the emulator design."}, "https://arxiv.org/abs/2410.00971": {"title": "Data-Driven Random Projection and Screening for High-Dimensional Generalized Linear Models", "link": "https://arxiv.org/abs/2410.00971", "description": "arXiv:2410.00971v1 Announce Type: new \nAbstract: We address the challenge of correlated predictors in high-dimensional GLMs, where regression coefficients range from sparse to dense, by proposing a data-driven random projection method. This is particularly relevant for applications where the number of predictors is (much) larger than the number of observations and the underlying structure -- whether sparse or dense -- is unknown. We achieve this by using ridge-type estimates for variable screening and random projection to incorporate information about the response-predictor relationship when performing dimensionality reduction. We demonstrate that a ridge estimator with a small penalty is effective for random projection and screening, but the penalty value must be carefully selected. Unlike in linear regression, where penalties approaching zero work well, this approach leads to overfitting in non-Gaussian families. Instead, we recommend a data-driven method for penalty selection. This data-driven random projection improves prediction performance over conventional random projections, even surpassing benchmarks like elastic net. Furthermore, an ensemble of multiple such random projections combined with probabilistic variable screening delivers the best aggregated results in prediction and variable ranking across varying sparsity levels in simulations at a rather low computational cost. Finally, three applications with count and binary responses demonstrate the method's advantages in interpretability and prediction accuracy."}, "https://arxiv.org/abs/2410.00985": {"title": "Nonparametric tests of treatment effect homogeneity for policy-makers", "link": "https://arxiv.org/abs/2410.00985", "description": "arXiv:2410.00985v1 Announce Type: new \nAbstract: Recent work has focused on nonparametric estimation of conditional treatment effects, but inference has remained relatively unexplored. We propose a class of nonparametric tests for both quantitative and qualitative treatment effect heterogeneity. The tests can incorporate a variety of structured assumptions on the conditional average treatment effect, allow for both continuous and discrete covariates, and do not require sample splitting. Furthermore, we show how the tests are tailored to detect alternatives where the population impact of adopting a personalized decision rule differs from using a rule that discards covariates. The proposal is thus relevant for guiding treatment policies. The utility of the proposal is borne out in simulation studies and a re-analysis of an AIDS clinical trial."}, "https://arxiv.org/abs/2410.01008": {"title": "Interval Estimation of Coefficients in Penalized Regression Models of Insurance Data", "link": "https://arxiv.org/abs/2410.01008", "description": "arXiv:2410.01008v1 Announce Type: new \nAbstract: The Tweedie exponential dispersion family is a popular choice among many to model insurance losses that consist of zero-inflated semicontinuous data. In such data, it is often important to obtain credibility (inference) of the most important features that describe the endogenous variables. Post-selection inference is the standard procedure in statistics to obtain confidence intervals of model parameters after performing a feature extraction procedure. For a linear model, the lasso estimate often has non-negligible estimation bias for large coefficients corresponding to exogenous variables. To have valid inference on those coefficients, it is necessary to correct the bias of the lasso estimate. Traditional statistical methods, such as hypothesis testing or standard confidence interval construction might lead to incorrect conclusions during post-selection, as they are generally too optimistic. Here we discuss a few methodologies for constructing confidence intervals of the coefficients after feature selection in the Generalized Linear Model (GLM) family with application to insurance data."}, "https://arxiv.org/abs/2410.01051": {"title": "A class of priors to perform asymmetric Bayesian wavelet shrinkage", "link": "https://arxiv.org/abs/2410.01051", "description": "arXiv:2410.01051v1 Announce Type: new \nAbstract: This paper proposes a class of asymmetric priors to perform Bayesian wavelet shrinkage in the standard nonparametric regression model with Gaussian error. The priors are composed by mixtures of a point mass function at zero and one of the following distributions: asymmetric beta, Kumaraswamy, asymmetric triangular or skew normal. Statistical properties of the associated shrinkage rules such as squared bias, variance and risks are obtained numerically and discussed. Monte Carlo simulation studies are described to evaluate the performances of the rules against standard techniques. An application of the asymmetric rules to a stock market index time series is also illustrated."}, "https://arxiv.org/abs/2410.01063": {"title": "Functional summary statistics and testing for independence in marked point processes on the surface of three dimensional convex shapes", "link": "https://arxiv.org/abs/2410.01063", "description": "arXiv:2410.01063v1 Announce Type: new \nAbstract: The fundamental functional summary statistics used for studying spatial point patterns are developed for marked homogeneous and inhomogeneous point processes on the surface of a sphere. These are extended to point processes on the surface of three dimensional convex shapes given the bijective mapping from the shape to the sphere is known. These functional summary statistics are used to test for independence between the marginals of multi-type spatial point processes with methods for sampling the null distribution proposed and discussed. This is illustrated on both simulated data and the RNGC galaxy point pattern, revealing attractive dependencies between different galaxy types."}, "https://arxiv.org/abs/2410.01133": {"title": "A subcopula characterization of dependence for the Multivariate Bernoulli Distribution", "link": "https://arxiv.org/abs/2410.01133", "description": "arXiv:2410.01133v1 Announce Type: new \nAbstract: By applying Sklar's theorem to the Multivariate Bernoulli Distribution (MBD), it is proposed a framework that decouples the marginal distributions from the dependence structure, providing a clearer understanding of how binary variables interact. Explicit formulas are derived under the MBD using subcopulas to introduce dependence measures for interactions of all orders, not just pairwise. A bayesian inference approach is also applied to estimate the parameters of the MBD, offering practical tools for parameter estimation and dependence analysis in real-world applications. The results obtained contribute to the application of subcopulas of multivariate binary data, with a real data example of comorbidities in COVID-19 patients."}, "https://arxiv.org/abs/2410.01159": {"title": "Partially Identified Heterogeneous Treatment Effect with Selection: An Application to Gender Gaps", "link": "https://arxiv.org/abs/2410.01159", "description": "arXiv:2410.01159v1 Announce Type: new \nAbstract: This paper addresses the sample selection model within the context of the gender gap problem, where even random treatment assignment is affected by selection bias. By offering a robust alternative free from distributional or specification assumptions, we bound the treatment effect under the sample selection model with an exclusion restriction, an assumption whose validity is tested in the literature. This exclusion restriction allows for further segmentation of the population into distinct types based on observed and unobserved characteristics. For each type, we derive the proportions and bound the gender gap accordingly. Notably, trends in type proportions and gender gap bounds reveal an increasing proportion of always-working individuals over time, alongside variations in bounds, including a general decline across time and consistently higher bounds for those in high-potential wage groups. Further analysis, considering additional assumptions, highlights persistent gender gaps for some types, while other types exhibit differing or inconclusive trends. This underscores the necessity of separating individuals by type to understand the heterogeneous nature of the gender gap."}, "https://arxiv.org/abs/2410.01163": {"title": "Perturbation-Robust Predictive Modeling of Social Effects by Network Subspace Generalized Linear Models", "link": "https://arxiv.org/abs/2410.01163", "description": "arXiv:2410.01163v1 Announce Type: new \nAbstract: Network-linked data, where multivariate observations are interconnected by a network, are becoming increasingly prevalent in fields such as sociology and biology. These data often exhibit inherent noise and complex relational structures, complicating conventional modeling and statistical inference. Motivated by empirical challenges in analyzing such data sets, this paper introduces a family of network subspace generalized linear models designed for analyzing noisy, network-linked data. We propose a model inference method based on subspace-constrained maximum likelihood, which emphasizes flexibility in capturing network effects and provides a robust inference framework against network perturbations.We establish the asymptotic distributions of the estimators under network perturbations, demonstrating the method's accuracy through extensive simulations involving random network models and deep-learning-based embedding algorithms. The proposed methodology is applied to a comprehensive analysis of a large-scale study on school conflicts, where it identifies significant social effects, offering meaningful and interpretable insights into student behaviors."}, "https://arxiv.org/abs/2410.01175": {"title": "Forecasting short-term inflation in Argentina with Random Forest Models", "link": "https://arxiv.org/abs/2410.01175", "description": "arXiv:2410.01175v1 Announce Type: new \nAbstract: This paper examines the performance of Random Forest models in forecasting short-term monthly inflation in Argentina, based on a database of monthly indicators since 1962. It is found that these models achieve forecast accuracy that is statistically comparable to the consensus of market analysts' expectations surveyed by the Central Bank of Argentina (BCRA) and to traditional econometric models. One advantage of Random Forest models is that, as they are non-parametric, they allow for the exploration of nonlinear effects in the predictive power of certain macroeconomic variables on inflation. Among other findings, the relative importance of the exchange rate gap in forecasting inflation increases when the gap between the parallel and official exchange rates exceeds 60%. The predictive power of the exchange rate on inflation rises when the BCRA's net international reserves are negative or close to zero (specifically, below USD 2 billion). The relative importance of inflation inertia and the nominal interest rate in forecasting the following month's inflation increases when the nominal levels of inflation and/or interest rates rise."}, "https://arxiv.org/abs/2410.01283": {"title": "Bayesian estimation for novel geometric INGARCH model", "link": "https://arxiv.org/abs/2410.01283", "description": "arXiv:2410.01283v1 Announce Type: new \nAbstract: This paper introduces an integer-valued generalized autoregressive conditional heteroskedasticity (INGARCH) model based on the novel geometric distribution and discusses some of its properties. The parameter estimation problem of the models are studied by conditional maximum likelihood and Bayesian approach using Hamiltonian Monte Carlo (HMC) algorithm. The results of the simulation studies and real data analysis affirm the good performance of the estimators and the model."}, "https://arxiv.org/abs/2410.01475": {"title": "Exploring Learning Rate Selection in Generalised Bayesian Inference using Posterior Predictive Checks", "link": "https://arxiv.org/abs/2410.01475", "description": "arXiv:2410.01475v1 Announce Type: new \nAbstract: Generalised Bayesian Inference (GBI) attempts to address model misspecification in a standard Bayesian setup by tempering the likelihood. The likelihood is raised to a fractional power, called the learning rate, which reduces its importance in the posterior and has been established as a method to address certain kinds of model misspecification. Posterior Predictive Checks (PPC) attempt to detect model misspecification by locating a diagnostic, computed on the observed data, within the posterior predictive distribution of the diagnostic. This can be used to construct a hypothesis test where a small $p$-value indicates potential misfit. The recent Embedded Diachronic Sense Change (EDiSC) model suffers from misspecification and benefits from likelihood tempering. Using EDiSC as a case study, this exploratory work examines whether PPC could be used in a novel way to set the learning rate in a GBI setup. Specifically, the learning rate selected is the lowest value for which a hypothesis test using the log likelihood diagnostic is not rejected at the 10% level. The experimental results are promising, though not definitive, and indicate the need for further research along the lines suggested here."}, "https://arxiv.org/abs/2410.01658": {"title": "Smaller Confidence Intervals From IPW Estimators via Data-Dependent Coarsening", "link": "https://arxiv.org/abs/2410.01658", "description": "arXiv:2410.01658v1 Announce Type: new \nAbstract: Inverse propensity-score weighted (IPW) estimators are prevalent in causal inference for estimating average treatment effects in observational studies. Under unconfoundedness, given accurate propensity scores and $n$ samples, the size of confidence intervals of IPW estimators scales down with $n$, and, several of their variants improve the rate of scaling. However, neither IPW estimators nor their variants are robust to inaccuracies: even if a single covariate has an $\\varepsilon>0$ additive error in the propensity score, the size of confidence intervals of these estimators can increase arbitrarily. Moreover, even without errors, the rate with which the confidence intervals of these estimators go to zero with $n$ can be arbitrarily slow in the presence of extreme propensity scores (those close to 0 or 1).\n We introduce a family of Coarse IPW (CIPW) estimators that captures existing IPW estimators and their variants. Each CIPW estimator is an IPW estimator on a coarsened covariate space, where certain covariates are merged. Under mild assumptions, e.g., Lipschitzness in expected outcomes and sparsity of extreme propensity scores, we give an efficient algorithm to find a robust estimator: given $\\varepsilon$-inaccurate propensity scores and $n$ samples, its confidence interval size scales with $\\varepsilon+1/\\sqrt{n}$. In contrast, under the same assumptions, existing estimators' confidence interval sizes are $\\Omega(1)$ irrespective of $\\varepsilon$ and $n$. Crucially, our estimator is data-dependent and we show that no data-independent CIPW estimator can be robust to inaccuracies."}, "https://arxiv.org/abs/2410.01783": {"title": "On metric choice in dimension reduction for Fr\\'echet regression", "link": "https://arxiv.org/abs/2410.01783", "description": "arXiv:2410.01783v1 Announce Type: new \nAbstract: Fr\\'echet regression is becoming a mainstay in modern data analysis for analyzing non-traditional data types belonging to general metric spaces. This novel regression method utilizes the pairwise distances between the random objects, which makes the choice of metric crucial in the estimation. In this paper, the effect of metric choice on the estimation of the dimension reduction subspace for the regression between random responses and Euclidean predictors is investigated. Extensive numerical studies illustrate how different metrics affect the central and central mean space estimates for regression involving responses belonging to some popular metric spaces versus Euclidean predictors. An analysis of the distributions of glycaemia based on continuous glucose monitoring data demonstrate how metric choice can influence findings in real applications."}, "https://arxiv.org/abs/2410.01168": {"title": "MDDC: An R and Python Package for Adverse Event Identification in Pharmacovigilance Data", "link": "https://arxiv.org/abs/2410.01168", "description": "arXiv:2410.01168v1 Announce Type: cross \nAbstract: The safety of medical products continues to be a significant health concern worldwide. Spontaneous reporting systems (SRS) and pharmacovigilance databases are essential tools for postmarketing surveillance of medical products. Various SRS are employed globally, such as the Food and Drug Administration Adverse Event Reporting System (FAERS), EudraVigilance, and VigiBase. In the pharmacovigilance literature, numerous methods have been proposed to assess product - adverse event pairs for potential signals. In this paper, we introduce an R and Python package that implements a novel pattern discovery method for postmarketing adverse event identification, named Modified Detecting Deviating Cells (MDDC). The package also includes a data generation function that considers adverse events as groups, as well as additional utility functions. We illustrate the usage of the package through the analysis of real datasets derived from the FAERS database."}, "https://arxiv.org/abs/2410.01194": {"title": "Maximum Ideal Likelihood Estimator: An New Estimation and Inference Framework for Latent Variable Models", "link": "https://arxiv.org/abs/2410.01194", "description": "arXiv:2410.01194v1 Announce Type: cross \nAbstract: In this paper, a new estimation framework, Maximum Ideal Likelihood Estimator (MILE), is proposed for general parametric models with latent variables and missing values. Instead of focusing on the marginal likelihood of the observed data as in many traditional approaches, the MILE directly considers the joint distribution of the complete dataset by treating the latent variables as parameters (the ideal likelihood). The MILE framework remains valid, even when traditional methods are not applicable, e.g., non-finite conditional expectation of the marginal likelihood function, via different optimization techniques and algorithms. The statistical properties of the MILE, such as the asymptotic equivalence to the Maximum Likelihood Estimation (MLE), are proved under some mild conditions, which facilitate statistical inference and prediction. Simulation studies illustrate that MILE outperforms traditional approaches with computational feasibility and scalability using existing and our proposed algorithms."}, "https://arxiv.org/abs/2410.01265": {"title": "Transformers Handle Endogeneity in In-Context Linear Regression", "link": "https://arxiv.org/abs/2410.01265", "description": "arXiv:2410.01265v1 Announce Type: cross \nAbstract: We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares $(\\textsf{2SLS})$ solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the $\\textsf{2SLS}$ method, in the presence of endogeneity."}, "https://arxiv.org/abs/2410.01427": {"title": "Regularized e-processes: anytime valid inference with knowledge-based efficiency gains", "link": "https://arxiv.org/abs/2410.01427", "description": "arXiv:2410.01427v1 Announce Type: cross \nAbstract: Classical statistical methods have theoretical justification when the sample size is predetermined by the data-collection plan. In applications, however, it's often the case that sample sizes aren't predetermined; instead, investigators might use the data observed along the way to make on-the-fly decisions about when to stop data collection. Since those methods designed for static sample sizes aren't reliable when sample sizes are dynamic, there's been a recent surge of interest in e-processes and the corresponding tests and confidence sets that are anytime valid in the sense that their justification holds up for arbitrary dynamic data-collection plans. But if the investigator has relevant-yet-incomplete prior information about the quantity of interest, then there's an opportunity for efficiency gain, but existing approaches can't accommodate this. Here I build a new, regularized e-process framework that features a knowledge-based, imprecise-probabilistic regularization that offers improved efficiency. A generalized version of Ville's inequality is established, ensuring that inference based on the regularized e-process remains anytime valid in a novel, knowledge-dependent sense. In addition to anytime valid hypothesis tests and confidence sets, the proposed regularized e-processes facilitate possibility-theoretic uncertainty quantification with strong frequentist-like calibration properties and other Bayesian-like features: satisfies the likelihood principle, avoids sure-loss, and offers formal decision-making with reliability guarantees."}, "https://arxiv.org/abs/2410.01480": {"title": "Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales", "link": "https://arxiv.org/abs/2410.01480", "description": "arXiv:2410.01480v1 Announce Type: cross \nAbstract: Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In this study, we present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders. Using both simulated scenarios and real data from the Swedish Scholastic Aptitude Test, we demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit. Furthermore, we illustrate how the latent trait scale from any fitted IRT model can be transformed into a ratio scale, aiding in score interpretation and making it easier to compare different types of IRT models. We refer to these new scales as bit scales. Bit scales are especially useful for models for which minimal or no assumptions are made for the latent trait scale distributions, such as for the autoencoder fitted models in this study."}, "https://arxiv.org/abs/2410.01522": {"title": "Uncertainty quantification in neutron and gamma time correlation measurements", "link": "https://arxiv.org/abs/2410.01522", "description": "arXiv:2410.01522v1 Announce Type: cross \nAbstract: Neutron noise analysis is a predominant technique for fissile matter identification with passive methods. Quantifying the uncertainties associated with the estimated nuclear parameters is crucial for decision-making. A conservative uncertainty quantification procedure is possible by solving a Bayesian inverse problem with the help of statistical surrogate models but generally leads to large uncertainties due to the surrogate models' errors. In this work, we develop two methods for robust uncertainty quantification in neutron and gamma noise analysis based on the resolution of Bayesian inverse problems. We show that the uncertainties can be reduced by including information on gamma correlations. The investigation of a joint analysis of the neutron and gamma observations is also conducted with the help of active learning strategies to fine-tune surrogate models. We test our methods on a model of the SILENE reactor core, using simulated and real-world measurements."}, "https://arxiv.org/abs/2311.07736": {"title": "Use of Expected Utility (EU) to Evaluate Artificial Intelligence-Enabled Rule-Out Devices for Mammography Screening", "link": "https://arxiv.org/abs/2311.07736", "description": "arXiv:2311.07736v2 Announce Type: replace \nAbstract: Background: An artificial intelligence (AI)-enabled rule-out device may autonomously remove patient images unlikely to have cancer from radiologist review. Many published studies evaluate this type of device by retrospectively applying the AI to large datasets and use sensitivity and specificity as the performance metrics. However, these metrics have fundamental shortcomings because they are bound to have opposite changes with the rule-out application of AI. Method: We reviewed two performance metrics to compare the screening performance between the radiologist-with-rule-out-device and radiologist-without-device workflows: positive/negative predictive values (PPV/NPV) and expected utility (EU). We applied both methods to a recent study that reported improved performance in the radiologist-with-device workflow using a retrospective U.S. dataset. We then applied the EU method to a European study based on the reported recall and cancer detection rates at different AI thresholds to compare the potential utility among different thresholds. Results: For the U.S. study, neither PPV/NPV nor EU can demonstrate significant improvement for any of the algorithm thresholds reported. For the study using European data, we found that EU is lower as AI rules out more patients including false-negative cases and reduces the overall screening performance. Conclusions: Due to the nature of the retrospective simulated study design, sensitivity and specificity can be ambiguous in evaluating a rule-out device. We showed that using PPV/NPV or EU can resolve the ambiguity. The EU method can be applied with only recall rates and cancer detection rates, which is convenient as ground truth is often unavailable for non-recalled patients in screening mammography."}, "https://arxiv.org/abs/2401.01977": {"title": "Conformal causal inference for cluster randomized trials: model-robust inference without asymptotic approximations", "link": "https://arxiv.org/abs/2401.01977", "description": "arXiv:2401.01977v2 Announce Type: replace \nAbstract: Traditional statistical inference in cluster randomized trials typically invokes the asymptotic theory that requires the number of clusters to approach infinity. In this article, we propose an alternative conformal causal inference framework for analyzing cluster randomized trials that achieves the target inferential goal in finite samples without the need for asymptotic approximations. Different from traditional inference focusing on estimating the average treatment effect, our conformal causal inference aims to provide prediction intervals for the difference of counterfactual outcomes, thereby providing a new decision-making tool for clusters and individuals in the same target population. We prove that this framework is compatible with arbitrary working outcome models -- including data-adaptive machine learning methods that maximally leverage information from baseline covariates, and enjoys robustness against misspecification of working outcome models. Under our conformal causal inference framework, we develop efficient computation algorithms to construct prediction intervals for treatment effects at both the cluster and individual levels, and further extend to address inferential targets defined based on pre-specified covariate subgroups. Finally, we demonstrate the properties of our methods via simulations and a real data application based on a completed cluster randomized trial for treating chronic pain."}, "https://arxiv.org/abs/2304.06251": {"title": "Importance is Important: Generalized Markov Chain Importance Sampling Methods", "link": "https://arxiv.org/abs/2304.06251", "description": "arXiv:2304.06251v2 Announce Type: replace-cross \nAbstract: We show that for any multiple-try Metropolis algorithm, one can always accept the proposal and evaluate the importance weight that is needed to correct for the bias without extra computational cost. This results in a general, convenient, and rejection-free Markov chain Monte Carlo (MCMC) sampling scheme. By further leveraging the importance sampling perspective on Metropolis--Hastings algorithms, we propose an alternative MCMC sampler on discrete spaces that is also outside the Metropolis--Hastings framework, along with a general theory on its complexity. Numerical examples suggest that the proposed algorithms are consistently more efficient than the original Metropolis--Hastings versions."}, "https://arxiv.org/abs/2304.09398": {"title": "Minimax Signal Detection in Sparse Additive Models", "link": "https://arxiv.org/abs/2304.09398", "description": "arXiv:2304.09398v2 Announce Type: replace-cross \nAbstract: Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature."}, "https://arxiv.org/abs/2304.13831": {"title": "Random evolutionary games and random polynomials", "link": "https://arxiv.org/abs/2304.13831", "description": "arXiv:2304.13831v2 Announce Type: replace-cross \nAbstract: In this paper, we discover that the class of random polynomials arising from the equilibrium analysis of random asymmetric evolutionary games is \\textit{exactly} the Kostlan-Shub-Smale system of random polynomials, revealing an intriguing connection between evolutionary game theory and the theory of random polynomials. Through this connection, we analytically characterize the statistics of the number of internal equilibria of random asymmetric evolutionary games, namely its mean value, probability distribution, central limit theorem and universality phenomena. Biologically, these quantities enable prediction of the levels of social and biological diversity as well as the overall complexity in a dynamical system. By comparing symmetric and asymmetric random games, we establish that symmetry in group interactions increases the expected number of internal equilibria. Our research establishes new theoretical understanding of asymmetric evolutionary games and highlights the significance of symmetry and asymmetry in group interactions."}, "https://arxiv.org/abs/2410.02403": {"title": "A unified approach to penalized likelihood estimation of covariance matrices in high dimensions", "link": "https://arxiv.org/abs/2410.02403", "description": "arXiv:2410.02403v1 Announce Type: new \nAbstract: We consider the problem of estimation of a covariance matrix for Gaussian data in a high dimensional setting. Existing approaches include maximum likelihood estimation under a pre-specified sparsity pattern, l_1-penalized loglikelihood optimization and ridge regularization of the sample covariance. We show that these three approaches can be addressed in an unified way, by considering the constrained optimization of an objective function that involves two suitably defined penalty terms. This unified procedure exploits the advantages of each individual approach, while bringing novelty in the combination of the three. We provide an efficient algorithm for the optimization of the regularized objective function and describe the relationship between the two penalty terms, thereby highlighting the importance of the joint application of the three methods. A simulation study shows how the sparse estimates of covariance matrices returned by the procedure are stable and accurate, both in low and high dimensional settings, and how their calculation is more efficient than existing approaches under a partially known sparsity pattern. An illustration on sonar data shows is presented for the identification of the covariance structure among signals bounced off a certain material. The method is implemented in the publicly available R package gicf."}, "https://arxiv.org/abs/2410.02448": {"title": "Bayesian Calibration and Uncertainty Quantification for a Large Nutrient Load Impact Model", "link": "https://arxiv.org/abs/2410.02448", "description": "arXiv:2410.02448v1 Announce Type: new \nAbstract: Nutrient load simulators are large, deterministic, models that simulate the hydrodynamics and biogeochemical processes in aquatic ecosystems. They are central tools for planning cost efficient actions to fight eutrophication since they allow scenario predictions on impacts of nutrient load reductions to, e.g., harmful algal biomass growth. Due to being computationally heavy, the uncertainties related to these predictions are typically not rigorously assessed though. In this work, we developed a novel Bayesian computational approach for estimating the uncertainties in predictions of the Finnish coastal nutrient load model FICOS. First, we constructed a likelihood function for the multivariate spatiotemporal outputs of the FICOS model. Then, we used Bayes optimization to locate the posterior mode for the model parameters conditional on long term monitoring data. After that, we constructed a space filling design for FICOS model runs around the posterior mode and used it to train a Gaussian process emulator for the (log) posterior density of the model parameters. We then integrated over this (approximate) parameter posterior to produce probabilistic predictions for algal biomass and chlorophyll a concentration under alternative nutrient load reduction scenarios. Our computational algorithm allowed for fast posterior inference and the Gaussian process emulator had good predictive accuracy within the highest posterior probability mass region. The posterior predictive scenarios showed that the probability to reach the EUs Water Framework Directive objectives in the Finnish Archipelago Sea is generally low even under large load reductions."}, "https://arxiv.org/abs/2410.02649": {"title": "Stochastic Gradient Variational Bayes in the Stochastic Blockmodel", "link": "https://arxiv.org/abs/2410.02649", "description": "arXiv:2410.02649v1 Announce Type: new \nAbstract: Stochastic variational Bayes algorithms have become very popular in the machine learning literature, particularly in the context of nonparametric Bayesian inference. These algorithms replace the true but intractable posterior distribution with the best (in the sense of Kullback-Leibler divergence) member of a tractable family of distributions, using stochastic gradient algorithms to perform the optimization step. stochastic variational Bayes inference implicitly trades off computational speed for accuracy, but the loss of accuracy is highly model (and even dataset) specific. In this paper we carry out an empirical evaluation of this trade off in the context of stochastic blockmodels, which are a widely used class of probabilistic models for network and relational data. Our experiments indicate that, in the context of stochastic blockmodels, relatively large subsamples are required for these algorithms to find accurate approximations of the posterior, and that even then the quality of the approximations provided by stochastic gradient variational algorithms can be highly variable."}, "https://arxiv.org/abs/2410.02655": {"title": "Exact Bayesian Inference for Multivariate Spatial Data of Any Size with Application to Air Pollution Monitoring", "link": "https://arxiv.org/abs/2410.02655", "description": "arXiv:2410.02655v1 Announce Type: new \nAbstract: Fine particulate matter and aerosol optical thickness are of interest to atmospheric scientists for understanding air quality and its various health/environmental impacts. The available data are extremely large, making uncertainty quantification in a fully Bayesian framework quite difficult, as traditional implementations do not scale reasonably to the size of the data. We specifically consider roughly 8 million observations obtained from NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instrument. To analyze data on this scale, we introduce Scalable Multivariate Exact Posterior Regression (SM-EPR) which combines the recently introduced data subset approach and Exact Posterior Regression (EPR). EPR is a new Bayesian hierarchical model where it is possible to sample independent replicates of fixed and random effects directly from the posterior without the use of Markov chain Monte Carlo (MCMC) or approximate Bayesian techniques. We extend EPR to the multivariate spatial context, where the multiple variables may be distributed according to different distributions. The combination of the data subset approach with EPR allows one to perform exact Bayesian inference without MCMC for effectively any sample size. We demonstrate our new SM-EPR method using this motivating big remote sensing data application and provide several simulations."}, "https://arxiv.org/abs/2410.02727": {"title": "Regression Discontinuity Designs Under Interference", "link": "https://arxiv.org/abs/2410.02727", "description": "arXiv:2410.02727v1 Announce Type: new \nAbstract: We extend the continuity-based framework to Regression Discontinuity Designs (RDDs) to identify and estimate causal effects in the presence of interference when units are connected through a network. In this setting, assignment to an \"effective treatment,\" which comprises the individual treatment and a summary of the treatment of interfering units (e.g., friends, classmates), is determined by the unit's score and the scores of other interfering units, leading to a multiscore RDD with potentially complex, multidimensional boundaries. We characterize these boundaries and derive generalized continuity assumptions to identify the proposed causal estimands, i.e., point and boundary causal effects. Additionally, we develop a distance-based nonparametric estimator, derive its asymptotic properties under restrictions on the network degree distribution, and introduce a novel variance estimator that accounts for network correlation. Finally, we apply our methodology to the PROGRESA/Oportunidades dataset to estimate the direct and indirect effects of receiving cash transfers on children's school attendance."}, "https://arxiv.org/abs/2410.01826": {"title": "Shocks-adaptive Robust Minimum Variance Portfolio for a Large Universe of Assets", "link": "https://arxiv.org/abs/2410.01826", "description": "arXiv:2410.01826v1 Announce Type: cross \nAbstract: This paper proposes a robust, shocks-adaptive portfolio in a large-dimensional assets universe where the number of assets could be comparable to or even larger than the sample size. It is well documented that portfolios based on optimizations are sensitive to outliers in return data. We deal with outliers by proposing a robust factor model, contributing methodologically through the development of a robust principal component analysis (PCA) for factor model estimation and a shrinkage estimation for the random error covariance matrix. This approach extends the well-regarded Principal Orthogonal Complement Thresholding (POET) method (Fan et al., 2013), enabling it to effectively handle heavy tails and sudden shocks in data. The novelty of the proposed robust method is its adaptiveness to both global and idiosyncratic shocks, without the need to distinguish them, which is useful in forming portfolio weights when facing outliers. We develop the theoretical results of the robust factor model and the robust minimum variance portfolio. Numerical and empirical results show the superior performance of the new portfolio."}, "https://arxiv.org/abs/2410.01839": {"title": "A Divide-and-Conquer Approach to Persistent Homology", "link": "https://arxiv.org/abs/2410.01839", "description": "arXiv:2410.01839v1 Announce Type: cross \nAbstract: Persistent homology is a tool of topological data analysis that has been used in a variety of settings to characterize different dimensional holes in data. However, persistent homology computations can be memory intensive with a computational complexity that does not scale well as the data size becomes large. In this work, we propose a divide-and-conquer (DaC) method to mitigate these issues. The proposed algorithm efficiently finds small, medium, and large-scale holes by partitioning data into sub-regions and uses a Vietoris-Rips filtration. Furthermore, we provide theoretical results that quantify the bottleneck distance between DaC and the true persistence diagram and the recovery probability of holes in the data. We empirically verify that the rate coincides with our theoretical rate, and find that the memory and computational complexity of DaC outperforms an alternative method that relies on a clustering preprocessing step to reduce the memory and computational complexity of the persistent homology computations. Finally, we test our algorithm using spatial data of the locations of lakes in Wisconsin, where the classical persistent homology is computationally infeasible."}, "https://arxiv.org/abs/2410.02015": {"title": "Instrumental variables: A non-asymptotic viewpoint", "link": "https://arxiv.org/abs/2410.02015", "description": "arXiv:2410.02015v1 Announce Type: cross \nAbstract: We provide a non-asymptotic analysis of the linear instrumental variable estimator allowing for the presence of exogeneous covariates. In addition, we introduce a novel measure of the strength of an instrument that can be used to derive non-asymptotic confidence intervals. For strong instruments, these non-asymptotic intervals match the asymptotic ones exactly up to higher order corrections; for weaker instruments, our intervals involve adaptive adjustments to the instrument strength, and thus remain valid even when asymptotic predictions break down. We illustrate our results via an analysis of the effect of PM2.5 pollution on various health conditions, using wildfire smoke exposure as an instrument. Our analysis shows that exposure to PM2.5 pollution leads to statistically significant increases in incidence of health conditions such as asthma, heart disease, and strokes."}, "https://arxiv.org/abs/2410.02025": {"title": "A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models", "link": "https://arxiv.org/abs/2410.02025", "description": "arXiv:2410.02025v1 Announce Type: cross \nAbstract: In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings."}, "https://arxiv.org/abs/2410.02050": {"title": "A fast, flexible simulation framework for Bayesian adaptive designs -- the R package BATSS", "link": "https://arxiv.org/abs/2410.02050", "description": "arXiv:2410.02050v1 Announce Type: cross \nAbstract: The use of Bayesian adaptive designs for randomised controlled trials has been hindered by the lack of software readily available to statisticians. We have developed a new software package (Bayesian Adaptive Trials Simulator Software - BATSS for the statistical software R, which provides a flexible structure for the fast simulation of Bayesian adaptive designs for clinical trials. We illustrate how the BATSS package can be used to define and evaluate the operating characteristics of Bayesian adaptive designs for various different types of primary outcomes (e.g., those that follow a normal, binary, Poisson or negative binomial distribution) and can incorporate the most common types of adaptations: stopping treatments (or the entire trial) for efficacy or futility, and Bayesian response adaptive randomisation - based on user-defined adaptation rules. Other important features of this highly modular package include: the use of (Integrated Nested) Laplace approximations to compute posterior distributions, parallel processing on a computer or a cluster, customisability, adjustment for covariates and a wide range of available conditional distributions for the response."}, "https://arxiv.org/abs/2410.02208": {"title": "Fast nonparametric feature selection with error control using integrated path stability selection", "link": "https://arxiv.org/abs/2410.02208", "description": "arXiv:2410.02208v1 Announce Type: cross \nAbstract: Feature selection can greatly improve performance and interpretability in machine learning problems. However, existing nonparametric feature selection methods either lack theoretical error control or fail to accurately control errors in practice. Many methods are also slow, especially in high dimensions. In this paper, we introduce a general feature selection method that applies integrated path stability selection to thresholding to control false positives and the false discovery rate. The method also estimates q-values, which are better suited to high-dimensional data than p-values. We focus on two special cases of the general method based on gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive simulations with RNA sequencing data show that IPSSGB and IPSSRF have better error control, detect more true positives, and are faster than existing methods. We also use both methods to detect microRNAs and genes related to ovarian cancer, finding that they make better predictions with fewer features than other methods."}, "https://arxiv.org/abs/2410.02306": {"title": "Choosing alpha post hoc: the danger of multiple standard significance thresholds", "link": "https://arxiv.org/abs/2410.02306", "description": "arXiv:2410.02306v1 Announce Type: cross \nAbstract: A fundamental assumption of classical hypothesis testing is that the significance threshold $\\alpha$ is chosen independently from the data. The validity of confidence intervals likewise relies on choosing $\\alpha$ beforehand. We point out that the independence of $\\alpha$ is guaranteed in practice because, in most fields, there exists one standard $\\alpha$ that everyone uses -- so that $\\alpha$ is automatically independent of everything. However, there have been recent calls to decrease $\\alpha$ from $0.05$ to $0.005$. We note that this may lead to multiple accepted standard thresholds within one scientific field. For example, different journals may require different significance thresholds. As a consequence, some researchers may be tempted to conveniently choose their $\\alpha$ based on their p-value. We use examples to illustrate that this severely invalidates hypothesis tests, and mention some potential solutions."}, "https://arxiv.org/abs/2410.02629": {"title": "Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression", "link": "https://arxiv.org/abs/2410.02629", "description": "arXiv:2410.02629v1 Announce Type: cross \nAbstract: This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates."}, "https://arxiv.org/abs/2203.14760": {"title": "Functional principal component analysis for longitudinal observations with sampling at random", "link": "https://arxiv.org/abs/2203.14760", "description": "arXiv:2203.14760v2 Announce Type: replace \nAbstract: Functional principal component analysis has been shown to be invaluable for revealing variation modes of longitudinal outcomes, which serves as important building blocks for forecasting and model building. Decades of research have advanced methods for functional principal component analysis often assuming independence between the observation times and longitudinal outcomes. Yet such assumptions are fragile in real-world settings where observation times may be driven by outcome-related reasons. Rather than ignoring the informative observation time process, we explicitly model the observational times by a counting process dependent on time-varying prognostic factors. Identification of the mean, covariance function, and functional principal components ensues via inverse intensity weighting. We propose using weighted penalized splines for estimation and establish consistency and convergence rates for the weighted estimators. Simulation studies demonstrate that the proposed estimators are substantially more accurate than the existing ones in the presence of a correlation between the observation time process and the longitudinal outcome process. We further examine the finite-sample performance of the proposed method using the Acute Infection and Early Disease Research Program study."}, "https://arxiv.org/abs/2310.07399": {"title": "Randomized Runge-Kutta-Nystr\\\"om Methods for Unadjusted Hamiltonian and Kinetic Langevin Monte Carlo", "link": "https://arxiv.org/abs/2310.07399", "description": "arXiv:2310.07399v2 Announce Type: replace-cross \nAbstract: We introduce $5/2$- and $7/2$-order $L^2$-accurate randomized Runge-Kutta-Nystr\\\"{o}m methods, tailored for approximating Hamiltonian flows within non-reversible Markov chain Monte Carlo samplers, such as unadjusted Hamiltonian Monte Carlo and unadjusted kinetic Langevin Monte Carlo. We establish quantitative $5/2$-order $L^2$-accuracy upper bounds under gradient and Hessian Lipschitz assumptions on the potential energy function. The numerical experiments demonstrate the superior efficiency of the proposed unadjusted samplers on a variety of well-behaved, high-dimensional target distributions."}, "https://arxiv.org/abs/2410.02905": {"title": "Multiscale Multi-Type Spatial Bayesian Analysis of Wildfires and Population Change That Avoids MCMC and Approximating the Posterior Distribution", "link": "https://arxiv.org/abs/2410.02905", "description": "arXiv:2410.02905v1 Announce Type: new \nAbstract: In recent years, wildfires have significantly increased in the United States (U.S.), making certain areas harder to live in. This motivates us to jointly analyze active fires and population changes in the U.S. from July 2020 to June 2021. The available data are recorded on different scales (or spatial resolutions) and by different types of distributions (referred to as multi-type data). Moreover, wildfires are known to have feedback mechanism that creates signal-to-noise dependence. We analyze point-referenced remote sensing fire data from National Aeronautics and Space Administration (NASA) and county-level population change data provided by U.S. Census Bureau's Population Estimates Program (PEP). To do this, we develop a multiscale multi-type spatial Bayesian hierarchical model that assumes the average number of fires is zero-inflated normal, the incidence of fire as Bernoulli, and the percentage population change as normally distributed. This high-dimensional dataset makes Markov chain Monte Carlo (MCMC) implementation infeasible. We bypass MCMC by extending a computationally efficient Bayesian framework to directly sample from the exact posterior distribution, referred to as Exact Posterior Regression (EPR), which includes a term to model feedback. A simulation study is included to compare our new EPR method to the traditional Bayesian model fitted via MCMC. In our analysis, we obtained predictions of wildfire probabilities, identified several useful covariates, and found that regions with many fires were directly related to population change."}, "https://arxiv.org/abs/2410.02918": {"title": "Moving sum procedure for multiple change point detection in large factor models", "link": "https://arxiv.org/abs/2410.02918", "description": "arXiv:2410.02918v1 Announce Type: new \nAbstract: The paper proposes a moving sum methodology for detecting multiple change points in high-dimensional time series under a factor model, where changes are attributed to those in loadings as well as emergence or disappearance of factors. We establish the asymptotic null distribution of the proposed test for family-wise error control, and show the consistency of the procedure for multiple change point estimation. Simulation studies and an application to a large dataset of volatilities demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2410.02920": {"title": "Statistical Inference with Nonignorable Non-Probability Survey Samples", "link": "https://arxiv.org/abs/2410.02920", "description": "arXiv:2410.02920v1 Announce Type: new \nAbstract: Statistical inference with non-probability survey samples is an emerging topic in survey sampling and official statistics and has gained increased attention from researchers and practitioners in the field. Much of the existing literature, however, assumes that the participation mechanism for non-probability samples is ignorable. In this paper, we develop a pseudo-likelihood approach to estimate participation probabilities for nonignorable non-probability samples when auxiliary information is available from an existing reference probability sample. We further construct three estimators for the finite population mean using regression-based prediction, inverse probability weighting (IPW), and augmented IPW estimators, and study their asymptotic properties. Variance estimation for the proposed methods is considered within the same framework. The efficiency of our proposed methods is demonstrated through simulation studies and a real data analysis using the ESPACOV survey on the effects of the COVID-19 pandemic in Spain."}, "https://arxiv.org/abs/2410.02929": {"title": "Identifying Hierarchical Structures in Network Data", "link": "https://arxiv.org/abs/2410.02929", "description": "arXiv:2410.02929v1 Announce Type: new \nAbstract: In this paper, we introduce a hierarchical extension of the stochastic blockmodel to identify multilevel community structures in networks. We also present a Markov chain Monte Carlo (MCMC) and a variational Bayes algorithm to fit the model and obtain approximate posterior inference. Through simulated and real datasets, we demonstrate that the model successfully identifies communities and supercommunities when they exist in the data. Additionally, we observe that the model returns a single supercommunity when there is no evidence of multilevel community structure. As expected in the case of the single-level stochastic blockmodel, we observe that the MCMC algorithm consistently outperforms its variational Bayes counterpart. Therefore, we recommend using MCMC whenever the network size allows for computational feasibility."}, "https://arxiv.org/abs/2410.02941": {"title": "Efficient collaborative learning of the average treatment effect under data sharing constraints", "link": "https://arxiv.org/abs/2410.02941", "description": "arXiv:2410.02941v1 Announce Type: new \nAbstract: Driven by the need to generate real-world evidence from multi-site collaborative studies, we introduce an efficient collaborative learning approach to evaluate average treatment effect in a multi-site setting under data sharing constraints. Specifically, the proposed method operates in a federated manner, using individual-level data from a user-defined target population and summary statistics from other source populations, to construct efficient estimator for the average treatment effect on the target population of interest. Our federated approach does not require iterative communications between sites, making it particularly suitable for research consortia with limited resources for developing automated data-sharing infrastructures. Compared to existing work data integration methods in causal inference, it allows distributional shifts in outcomes, treatments and baseline covariates distributions, and achieves semiparametric efficiency bound under appropriate conditions. We illustrate the magnitude of efficiency gains from incorporating extra data sources by examining the effect of insulin vs. non-insulin treatments on heart failure for patients with type II diabetes using electronic health record data collected from the All of Us program."}, "https://arxiv.org/abs/2410.02965": {"title": "BSNMani: Bayesian Scalar-on-network Regression with Manifold Learning", "link": "https://arxiv.org/abs/2410.02965", "description": "arXiv:2410.02965v1 Announce Type: new \nAbstract: Brain connectivity analysis is crucial for understanding brain structure and neurological function, shedding light on the mechanisms of mental illness. To study the association between individual brain connectivity networks and the clinical characteristics, we develop BSNMani: a Bayesian scalar-on-network regression model with manifold learning. BSNMani comprises two components: the network manifold learning model for brain connectivity networks, which extracts shared connectivity structures and subject-specific network features, and the joint predictive model for clinical outcomes, which studies the association between clinical phenotypes and subject-specific network features while adjusting for potential confounding covariates. For posterior computation, we develop a novel two-stage hybrid algorithm combining Metropolis-Adjusted Langevin Algorithm (MALA) and Gibbs sampling. Our method is not only able to extract meaningful subnetwork features that reveal shared connectivity patterns, but can also reveal their association with clinical phenotypes, further enabling clinical outcome prediction. We demonstrate our method through simulations and through its application to real resting-state fMRI data from a study focusing on Major Depressive Disorder (MDD). Our approach sheds light on the intricate interplay between brain connectivity and clinical features, offering insights that can contribute to our understanding of psychiatric and neurological disorders, as well as mental health."}, "https://arxiv.org/abs/2410.02982": {"title": "Imputing Missing Values with External Data", "link": "https://arxiv.org/abs/2410.02982", "description": "arXiv:2410.02982v1 Announce Type: new \nAbstract: Missing data is a common challenge across scientific disciplines. Current imputation methods require the availability of individual data to impute missing values. Often, however, missingness requires using external data for the imputation. In this paper, we introduce a new Stata command, mi impute from, designed to impute missing values using linear predictors and their related covariance matrix from imputation models estimated in one or multiple external studies. This allows for the imputation of any missing values without sharing individual data between studies. We describe the underlying method and present the syntax of mi impute from alongside practical examples of missing data in collaborative research projects."}, "https://arxiv.org/abs/2410.03012": {"title": "Determining adequate consistency levels for aggregation of expert estimates", "link": "https://arxiv.org/abs/2410.03012", "description": "arXiv:2410.03012v1 Announce Type: new \nAbstract: To obtain reliable results of expertise, which usually use individual and group expert pairwise comparisons, it is important to summarize (aggregate) expert estimates provided that they are sufficiently consistent. There are several ways to determine the threshold level of consistency sufficient for aggregation of estimates. They can be used for different consistency indices, but none of them relates the threshold value to the requirements for the reliability of the expertise's results. Therefore, a new approach to determining this consistency threshold is required. The proposed approach is based on simulation modeling of expert pairwise comparisons and a targeted search for the most inconsistent among the modeled pairwise comparison matrices. Thus, the search for the least consistent matrix is carried out for a given perturbation of the perfectly consistent matrix. This allows for determining the consistency threshold corresponding to a given permissible relative deviation of the resulting weight of an alternative from its hypothetical reference value."}, "https://arxiv.org/abs/2410.03091": {"title": "Time-In-Range Analyses of Functional Data Subject to Missing with Applications to Inpatient Continuous Glucose Monitoring", "link": "https://arxiv.org/abs/2410.03091", "description": "arXiv:2410.03091v1 Announce Type: new \nAbstract: Continuous glucose monitoring (CGM) has been increasingly used in US hospitals for the care of patients with diabetes. Time in range (TIR), which measures the percent of time over a specified time window with glucose values within a target range, has served as a pivotal CGM-metric for assessing glycemic control. However, inpatient CGM is prone to a prevailing issue that a limited length of hospital stay can cause insufficient CGM sampling, leading to a scenario with functional data plagued by complex missingness. Current analyses of inpatient CGM studies, however, ignore this issue and typically compute the TIR as the proportion of available CGM glucose values in range. As shown by simulation studies, this can result in considerably biased estimation and inference, largely owing to the nonstationary nature of inpatient CGM trajectories. In this work, we develop a rigorous statistical framework that confers valid inference on TIR in realistic inpatient CGM settings. Our proposals utilize a novel probabilistic representation of TIR, which enables leveraging the technique of inverse probability weighting and semiparametric survival modeling to obtain unbiased estimators of mean TIR that properly account for incompletely observed CGM trajectories. We establish desirable asymptotic properties of the proposed estimators. Results from our numerical studies demonstrate good finite-sample performance of the proposed method as well as its advantages over existing approaches. The proposed method is generally applicable to other functional data settings similar to CGM."}, "https://arxiv.org/abs/2410.03096": {"title": "Expected value of sample information calculations for risk prediction model development", "link": "https://arxiv.org/abs/2410.03096", "description": "arXiv:2410.03096v1 Announce Type: new \nAbstract: Uncertainty around predictions from a model due to the finite size of the development sample has traditionally been approached using classical inferential techniques. The finite size of the sample will result in the discrepancy between the final model and the correct model that maps covariates to predicted risks. From a decision-theoretic perspective, this discrepancy might affect the subsequent treatment decisions, and thus is associated with utility loss. From this perspective, procuring more development data is associated in expected gain in the utility of using the model. In this work, we define the Expected Value of Sample Information (EVSI) as the expected gain in clinical utility, defined in net benefit (NB) terms in net true positive units, by procuring a further development sample of a given size. We propose a bootstrap-based algorithm for EVSI computations, and show its feasibility and face validity in a case study. Decision-theoretic metrics can complement classical inferential methods when designing studies that are aimed at developing risk prediction models."}, "https://arxiv.org/abs/2410.03239": {"title": "A new GARCH model with a deterministic time-varying intercept", "link": "https://arxiv.org/abs/2410.03239", "description": "arXiv:2410.03239v1 Announce Type: new \nAbstract: It is common for long financial time series to exhibit gradual change in the unconditional volatility. We propose a new model that captures this type of nonstationarity in a parsimonious way. The model augments the volatility equation of a standard GARCH model by a deterministic time-varying intercept. It captures structural change that slowly affects the amplitude of a time series while keeping the short-run dynamics constant. We parameterize the intercept as a linear combination of logistic transition functions. We show that the model can be derived from a multiplicative decomposition of volatility and preserves the financial motivation of variance decomposition. We use the theory of locally stationary processes to show that the quasi maximum likelihood estimator (QMLE) of the parameters of the model is consistent and asymptotically normally distributed. We examine the quality of the asymptotic approximation in a small simulation study. An empirical application to Oracle Corporation stock returns demonstrates the usefulness of the model. We find that the persistence implied by the GARCH parameter estimates is reduced by including a time-varying intercept in the volatility equation."}, "https://arxiv.org/abs/2410.03393": {"title": "General linear hypothesis testing in ill-conditioned functional response model", "link": "https://arxiv.org/abs/2410.03393", "description": "arXiv:2410.03393v1 Announce Type: new \nAbstract: The paper concerns inference in the ill-conditioned functional response model, which is a part of functional data analysis. In this regression model, the functional response is modeled using several independent scalar variables. To verify linear hypotheses, we develop new test statistics by aggregating pointwise statistics using either integral or supremum. The new tests are scale-invariant, in contrast to the existing ones. To construct tests, we use different bootstrap methods. The performance of the new tests is compared with the performance of known tests through a simulation study and an application to a real data example."}, "https://arxiv.org/abs/2410.03557": {"title": "Robust Bond Risk Premia Predictability Test in the Quantiles", "link": "https://arxiv.org/abs/2410.03557", "description": "arXiv:2410.03557v1 Announce Type: new \nAbstract: Different from existing literature on testing the macro-spanning hypothesis of bond risk premia, which only considers mean regressions, this paper investigates whether the yield curve represented by CP factor (Cochrane and Piazzesi, 2005) contains all available information about future bond returns in a predictive quantile regression with many other macroeconomic variables. In this study, we introduce the Trend in Debt Holding (TDH) as a novel predictor, testing it alongside established macro indicators such as Trend Inflation (TI) (Cieslak and Povala, 2015), and macro factors from Ludvigson and Ng (2009). A significant challenge in this study is the invalidity of traditional quantile model inference approaches, given the high persistence of many macro variables involved. Furthermore, the existing methods addressing this issue do not perform well in the marginal test with many highly persistent predictors. Thus, we suggest a robust inference approach, whose size and power performance are shown to be better than existing tests. Using data from 1980-2022, the macro-spanning hypothesis is strongly supported at center quantiles by the empirical finding that the CP factor has predictive power while all other macro variables have negligible predictive power in this case. On the other hand, the evidence against the macro-spanning hypothesis is found at tail quantiles, in which TDH has predictive power at right tail quantiles while TI has predictive power at both tails quantiles. Finally, we show the performance of in-sample and out-of-sample predictions implemented by the proposed method are better than existing methods."}, "https://arxiv.org/abs/2410.03593": {"title": "Robust Quickest Correlation Change Detection in High-Dimensional Random Vectors", "link": "https://arxiv.org/abs/2410.03593", "description": "arXiv:2410.03593v1 Announce Type: new \nAbstract: Detecting changes in high-dimensional vectors presents significant challenges, especially when the post-change distribution is unknown and time-varying. This paper introduces a novel robust algorithm for correlation change detection in high-dimensional data. The approach utilizes the summary statistic of the maximum magnitude correlation coefficient to detect the change. This summary statistic captures the level of correlation present in the data but also has an asymptotic density. The robust test is designed using the asymptotic density. The proposed approach is robust because it can help detect a change in correlation level from some known level to unknown, time-varying levels. The proposed test is also computationally efficient and valid for a broad class of data distributions. The effectiveness of the proposed algorithm is demonstrated on simulated data."}, "https://arxiv.org/abs/2410.03619": {"title": "Functional Singular Value Decomposition", "link": "https://arxiv.org/abs/2410.03619", "description": "arXiv:2410.03619v1 Announce Type: new \nAbstract: Heterogeneous functional data are commonly seen in time series and longitudinal data analysis. To capture the statistical structures of such data, we propose the framework of Functional Singular Value Decomposition (FSVD), a unified framework with structure-adaptive interpretability for the analysis of heterogeneous functional data. We establish the mathematical foundation of FSVD by proving its existence and providing its fundamental properties using operator theory. We then develop an implementation approach for noisy and irregularly observed functional data based on a novel joint kernel ridge regression scheme and provide theoretical guarantees for its convergence and estimation accuracy. The framework of FSVD also introduces the concepts of intrinsic basis functions and intrinsic basis vectors, which represent two fundamental statistical structures for random functions and connect FSVD to various tasks including functional principal component analysis, factor models, functional clustering, and functional completion. We compare the performance of FSVD with existing methods in several tasks through extensive simulation studies. To demonstrate the value of FSVD in real-world datasets, we apply it to extract temporal patterns from a COVID-19 case count dataset and perform data completion on an electronic health record dataset."}, "https://arxiv.org/abs/2410.03648": {"title": "Spatial Hyperspheric Models for Compositional Data", "link": "https://arxiv.org/abs/2410.03648", "description": "arXiv:2410.03648v1 Announce Type: new \nAbstract: Compositional data are an increasingly prevalent data source in spatial statistics. Analysis of such data is typically done on log-ratio transformations or via Dirichlet regression. However, these approaches often make unnecessarily strong assumptions (e.g., strictly positive components, exclusively negative correlations). An alternative approach uses square-root transformed compositions and directional distributions. Such distributions naturally allow for zero-valued components and positive correlations, yet they may include support outside the non-negative orthant and are not generative for compositional data. To overcome this challenge, we truncate the elliptically symmetric angular Gaussian (ESAG) distribution to the non-negative orthant. Additionally, we propose a spatial hyperspheric regression that contains fixed and random multivariate spatial effects. The proposed method also contains a term that can be used to propagate uncertainty that may arise from precursory stochastic models (i.e., machine learning classification). We demonstrate our method on a simulation study and on classified bioacoustic signals of the Dryobates pubescens (downy woodpecker)."}, "https://arxiv.org/abs/2410.02774": {"title": "Estimating the Unobservable Components of Electricity Demand Response with Inverse Optimization", "link": "https://arxiv.org/abs/2410.02774", "description": "arXiv:2410.02774v1 Announce Type: cross \nAbstract: Understanding and predicting the electricity demand responses to prices are critical activities for system operators, retailers, and regulators. While conventional machine learning and time series analyses have been adequate for the routine demand patterns that have adapted only slowly over many years, the emergence of active consumers with flexible assets such as solar-plus-storage systems, and electric vehicles, introduces new challenges. These active consumers exhibit more complex consumption patterns, the drivers of which are often unobservable to the retailers and system operators. In practice, system operators and retailers can only monitor the net demand (metered at grid connection points), which reflects the overall energy consumption or production exchanged with the grid. As a result, all \"behind-the-meter\" activities-such as the use of flexibility-remain hidden from these entities. Such behind-the-meter behavior may be controlled by third party agents or incentivized by tariffs; in either case, the retailer's revenue and the system loads would be impacted by these activities behind the meter, but their details can only be inferred. We define the main components of net demand, as baseload, flexible, and self-generation, each having nonlinear responses to market price signals. As flexible demand response and self generation are increasing, this raises a pressing question of whether existing methods still perform well and, if not, whether there is an alternative way to understand and project the unobserved components of behavior. In response to this practical challenge, we evaluate the potential of a data-driven inverse optimization (IO) methodology. This approach characterizes decomposed consumption patterns without requiring direct observation of behind-the-meter behavior or device-level metering [...]"}, "https://arxiv.org/abs/2410.02799": {"title": "A Data Envelopment Analysis Approach for Assessing Fairness in Resource Allocation: Application to Kidney Exchange Programs", "link": "https://arxiv.org/abs/2410.02799", "description": "arXiv:2410.02799v1 Announce Type: cross \nAbstract: Kidney exchange programs have significantly increased transplantation rates but raise pressing questions about fairness in organ allocation. We present a novel framework leveraging Data Envelopment Analysis (DEA) to evaluate multiple fairness criteria--Priority, Access, and Outcome--within a single model, capturing complexities that may be overlooked in single-metric analyses. Using data from the United Network for Organ Sharing, we analyze these criteria individually, measuring Priority fairness through waitlist durations, Access fairness through Kidney Donor Profile Index scores, and Outcome fairness through graft lifespan. We then apply our DEA model to demonstrate significant disparities in kidney allocation efficiency across ethnic groups. To quantify uncertainty, we employ conformal prediction within the DEA framework, yielding group-conditional prediction intervals with finite sample coverage guarantees. Our findings show notable differences in efficiency distributions between ethnic groups. Our study provides a rigorous framework for evaluating fairness in complex resource allocation systems, where resource scarcity and mutual compatibility constraints exist. All code for using the proposed method and reproducing results is available on GitHub."}, "https://arxiv.org/abs/2410.02835": {"title": "The MLE is minimax optimal for LGC", "link": "https://arxiv.org/abs/2410.02835", "description": "arXiv:2410.02835v1 Announce Type: cross \nAbstract: We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convegence rates of the Maximum Likelihood Estimator (MLE). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the MLE. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions -- which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures."}, "https://arxiv.org/abs/2410.03346": {"title": "Implementing Response-Adaptive Randomisation in Stratified Rare-disease Trials: Design Challenges and Practical Solutions", "link": "https://arxiv.org/abs/2410.03346", "description": "arXiv:2410.03346v1 Announce Type: cross \nAbstract: Although response-adaptive randomisation (RAR) has gained substantial attention in the literature, it still has limited use in clinical trials. Amongst other reasons, the implementation of RAR in the real world raises important practical questions, often neglected. Motivated by an innovative phase-II stratified RAR trial, this paper addresses two challenges: (1) How to ensure that RAR allocations are both desirable and faithful to target probabilities, even in small samples? and (2) What adaptations to trigger after interim analyses in the presence of missing data? We propose a Mapping strategy that discretises the randomisation probabilities into a vector of allocation ratios, resulting in improved frequentist errors. Under the implementation of Mapping, we analyse the impact of missing data on operating characteristics by examining selected scenarios. Finally, we discuss additional concerns including: pooling data across trial strata, analysing the level of blinding in the trial, and reporting safety results."}, "https://arxiv.org/abs/2410.03630": {"title": "Is Gibbs sampling faster than Hamiltonian Monte Carlo on GLMs?", "link": "https://arxiv.org/abs/2410.03630", "description": "arXiv:2410.03630v1 Announce Type: cross \nAbstract: The Hamiltonian Monte Carlo (HMC) algorithm is often lauded for its ability to effectively sample from high-dimensional distributions. In this paper we challenge the presumed domination of HMC for the Bayesian analysis of GLMs. By utilizing the structure of the compute graph rather than the graphical model, we reduce the time per sweep of a full-scan Gibbs sampler from $O(d^2)$ to $O(d)$, where $d$ is the number of GLM parameters. Our simple changes to the implementation of the Gibbs sampler allow us to perform Bayesian inference on high-dimensional GLMs that are practically infeasible with traditional Gibbs sampler implementations. We empirically demonstrate a substantial increase in effective sample size per time when comparing our Gibbs algorithms to state-of-the-art HMC algorithms. While Gibbs is superior in terms of dimension scaling, neither Gibbs nor HMC dominate the other: we provide numerical and theoretical evidence that HMC retains an edge in certain circumstances thanks to its advantageous condition number scaling. Interestingly, for GLMs of fixed data size, we observe that increasing dimensionality can stabilize or even decrease condition number, shedding light on the empirical advantage of our efficient Gibbs sampler."}, "https://arxiv.org/abs/2410.03651": {"title": "Minimax-optimal trust-aware multi-armed bandits", "link": "https://arxiv.org/abs/2410.03651", "description": "arXiv:2410.03651v1 Announce Type: cross \nAbstract: Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue."}, "https://arxiv.org/abs/2207.08373": {"title": "Functional varying-coefficient model under heteroskedasticity with application to DTI data", "link": "https://arxiv.org/abs/2207.08373", "description": "arXiv:2207.08373v2 Announce Type: replace \nAbstract: In this paper, we develop a multi-step estimation procedure to simultaneously estimate the varying-coefficient functions using a local-linear generalized method of moments (GMM) based on continuous moment conditions. To incorporate spatial dependence, the continuous moment conditions are first projected onto eigen-functions and then combined by weighted eigen-values, thereby, solving the challenges of using an inverse covariance operator directly. We propose an optimal instrument variable that minimizes the asymptotic variance function among the class of all local-linear GMM estimators, and it outperforms the initial estimates which do not incorporate the spatial dependence. Our proposed method significantly improves the accuracy of the estimation under heteroskedasticity and its asymptotic properties have been investigated. Extensive simulation studies illustrate the finite sample performance, and the efficacy of the proposed method is confirmed by real data analysis."}, "https://arxiv.org/abs/2410.03814": {"title": "Graphical models for inference: A model comparison approach for analyzing bacterial conjugation", "link": "https://arxiv.org/abs/2410.03814", "description": "arXiv:2410.03814v1 Announce Type: new \nAbstract: We present a proof-of-concept of a model comparison approach for analyzing spatio-temporal observations of interacting populations. Our model variants are a collection of structurally similar Bayesian networks. Their distinct Noisy-Or conditional probability distributions describe interactions within the population, with each distribution corresponding to a specific mechanism of interaction. To determine which distributions most accurately represent the underlying mechanisms, we examine the accuracy of each Bayesian network with respect to observational data. We implement such a system for observations of bacterial populations engaged in conjugation, a type of horizontal gene transfer that allows microbes to share genetic material with nearby cells through physical contact. Evaluating cell-specific factors that affect conjugation is generally difficult because of the stochastic nature of the process. Our approach provides a new method for gaining insight into this process. We compare eight model variations for each of three experimental trials and rank them using two different metrics"}, "https://arxiv.org/abs/2410.03911": {"title": "Efficient Bayesian Additive Regression Models For Microbiome Studies", "link": "https://arxiv.org/abs/2410.03911", "description": "arXiv:2410.03911v1 Announce Type: new \nAbstract: Statistical analysis of microbiome data is challenging. Bayesian multinomial logistic-normal (MLN) models have gained popularity due to their ability to account for the count compositional nature of these data. However, these models are often computationally intractable to infer. Recently, we developed a computationally efficient and accurate approach to inferring MLN models with a Marginally Latent Matrix-T Process (MLTP) form: MLN-MLTPs. Our approach is based on a novel sampler with a marginal Laplace approximation -- called the \\textit{Collapse-Uncollapse} (CU) sampler. However, existing work with MLTPs has been limited to linear models or models of a single non-linear process. Moreover, existing methods lack an efficient means of estimating model hyperparameters. This article addresses both deficiencies. We introduce a new class of MLN Additive Gaussian Process models (\\textit{MultiAddGPs}) for deconvolution of overlapping linear and non-linear processes. We show that MultiAddGPs are examples of MLN-MLTPs and derive an efficient CU sampler for this model class. Moreover, we derive efficient Maximum Marginal Likelihood estimation for hyperparameters in MLTP models by taking advantage of Laplace approximation in the CU sampler. We demonstrate our approach using simulated and real data studies. Our models produce novel biological insights from a previously published artificial gut study."}, "https://arxiv.org/abs/2410.04020": {"title": "\"6 choose 4\": A framework to understand and facilitate discussion of strategies for overall survival safety monitoring", "link": "https://arxiv.org/abs/2410.04020", "description": "arXiv:2410.04020v1 Announce Type: new \nAbstract: Advances in anticancer therapies have significantly contributed to declining death rates in certain disease and clinical settings. However, they have also made it difficult to power a clinical trial in these settings with overall survival (OS) as the primary efficacy endpoint. A number of statistical approaches have therefore been recently proposed for the pre-specified analysis of OS as a safety endpoint (Shan, 2023; Fleming et al., 2024; Rodriguez et al., 2024). In this paper, we provide a simple, unifying framework that includes the aforementioned approaches (and a couple others) as special cases. By highlighting each approach's focus, priority, tolerance for risk, and strengths or challenges for practical implementation, this framework can help to facilitate discussions between stakeholders on \"fit-for-purpose OS data collection and assessment of harm\" (American Association for Cancer Research, 2024). We apply this framework to a real clinical trial in large B-cell lymphoma to illustrate its application and value. Several recommendations and open questions are also raised."}, "https://arxiv.org/abs/2410.04028": {"title": "Penalized Sparse Covariance Regression with High Dimensional Covariates", "link": "https://arxiv.org/abs/2410.04028", "description": "arXiv:2410.04028v1 Announce Type: new \nAbstract: Covariance regression offers an effective way to model the large covariance matrix with the auxiliary similarity matrices. In this work, we propose a sparse covariance regression (SCR) approach to handle the potentially high-dimensional predictors (i.e., similarity matrices). Specifically, we use the penalization method to identify the informative predictors and estimate their associated coefficients simultaneously. We first investigate the Lasso estimator and subsequently consider the folded concave penalized estimation methods (e.g., SCAD and MCP). However, the theoretical analysis of the existing penalization methods is primarily based on i.i.d. data, which is not directly applicable to our scenario. To address this difficulty, we establish the non-asymptotic error bounds by exploiting the spectral properties of the covariance matrix and similarity matrices. Then, we derive the estimation error bound for the Lasso estimator and establish the desirable oracle property of the folded concave penalized estimator. Extensive simulation studies are conducted to corroborate our theoretical results. We also illustrate the usefulness of the proposed method by applying it to a Chinese stock market dataset."}, "https://arxiv.org/abs/2410.04082": {"title": "Jackknife empirical likelihood ratio test for log symmetric distribution using probability weighted moments", "link": "https://arxiv.org/abs/2410.04082", "description": "arXiv:2410.04082v1 Announce Type: new \nAbstract: Log symmetric distributions are useful in modeling data which show high skewness and have found applications in various fields. Using a recent characterization for log symmetric distributions, we propose a goodness of fit test for testing log symmetry. The asymptotic distributions of the test statistics under both null and alternate distributions are obtained. As the normal-based test is difficult to implement, we also propose a jackknife empirical likelihood (JEL) ratio test for testing log symmetry. We conduct a Monte Carlo Simulation to evaluate the performance of the JEL ratio test. Finally, we illustrated our methodology using different data sets."}, "https://arxiv.org/abs/2410.04132": {"title": "A Novel Unit Distribution Named As Median Based Unit Rayleigh (MBUR): Properties and Estimations", "link": "https://arxiv.org/abs/2410.04132", "description": "arXiv:2410.04132v1 Announce Type: new \nAbstract: The importance of continuously emerging new distribution is a mandate to understand the world and environment surrounding us. In this paper, the author will discuss a new distribution defined on the interval (0,1) as regards the methodology of deducing its PDF, some of its properties and related functions. A simulation and real data analysis will be highlighted."}, "https://arxiv.org/abs/2410.04165": {"title": "How to Compare Copula Forecasts?", "link": "https://arxiv.org/abs/2410.04165", "description": "arXiv:2410.04165v1 Announce Type: new \nAbstract: This paper lays out a principled approach to compare copula forecasts via strictly consistent scores. We first establish the negative result that, in general, copulas fail to be elicitable, implying that copula predictions cannot sensibly be compared on their own. A notable exception is on Fr\\'echet classes, that is, when the marginal distribution structure is given and fixed, in which case we give suitable scores for the copula forecast comparison. As a remedy for the general non-elicitability of copulas, we establish novel multi-objective scores for copula forecast along with marginal forecasts. They give rise to two-step tests of equal or superior predictive ability which admit attribution of the forecast ranking to the accuracy of the copulas or the marginals. Simulations show that our two-step tests work well in terms of size and power. We illustrate our new methodology via an empirical example using copula forecasts for international stock market indices."}, "https://arxiv.org/abs/2410.04170": {"title": "Physics-encoded Spatio-temporal Regression", "link": "https://arxiv.org/abs/2410.04170", "description": "arXiv:2410.04170v1 Announce Type: new \nAbstract: Physics-informed methods have gained a great success in analyzing data with partial differential equation (PDE) constraints, which are ubiquitous when modeling dynamical systems. Different from the common penalty-based approach, this work promotes adherence to the underlying physical mechanism that facilitates statistical procedures. The motivating application concerns modeling fluorescence recovery after photobleaching, which is used for characterization of diffusion processes. We propose a physics-encoded regression model for handling spatio-temporally distributed data, which enables principled interpretability, parsimonious computation and efficient estimation by exploiting the structure of solutions of a governing evolution equation. The rate of convergence attaining the minimax optimality is theoretically demonstrated, generalizing the result obtained for the spatial regression. We conduct simulation studies to assess the performance of our proposed estimator and illustrate its usage in the aforementioned real data example."}, "https://arxiv.org/abs/2410.04312": {"title": "Adjusting for Spatial Correlation in Machine and Deep Learning", "link": "https://arxiv.org/abs/2410.04312", "description": "arXiv:2410.04312v1 Announce Type: new \nAbstract: Spatial data display correlation between observations collected at neighboring locations. Generally, machine and deep learning methods either do not account for this correlation or do so indirectly through correlated features and thereby forfeit predictive accuracy. To remedy this shortcoming, we propose preprocessing the data using a spatial decorrelation transform derived from properties of a multivariate Gaussian distribution and Vecchia approximations. The transformed data can then be ported into a machine or deep learning tool. After model fitting on the transformed data, the output can be spatially re-correlated via the corresponding inverse transformation. We show that including this spatial adjustment results in higher predictive accuracy on simulated and real spatial datasets."}, "https://arxiv.org/abs/2410.04330": {"title": "Inference in High-Dimensional Linear Projections: Multi-Horizon Granger Causality and Network Connectedness", "link": "https://arxiv.org/abs/2410.04330", "description": "arXiv:2410.04330v1 Announce Type: new \nAbstract: This paper presents a Wald test for multi-horizon Granger causality within a high-dimensional sparse Vector Autoregression (VAR) framework. The null hypothesis focuses on the causal coefficients of interest in a local projection (LP) at a given horizon. Nevertheless, the post-double-selection method on LP may not be applicable in this context, as a sparse VAR model does not necessarily imply a sparse LP for horizon h>1. To validate the proposed test, we develop two types of de-biased estimators for the causal coefficients of interest, both relying on first-step machine learning estimators of the VAR slope parameters. The first estimator is derived from the Least Squares method, while the second is obtained through a two-stage approach that offers potential efficiency gains. We further derive heteroskedasticity- and autocorrelation-consistent (HAC) inference for each estimator. Additionally, we propose a robust inference method for the two-stage estimator, eliminating the need to correct for serial correlation in the projection residuals. Monte Carlo simulations show that the two-stage estimator with robust inference outperforms the Least Squares method in terms of the Wald test size, particularly for longer projection horizons. We apply our methodology to analyze the interconnectedness of policy-related economic uncertainty among a large set of countries in both the short and long run. Specifically, we construct a causal network to visualize how economic uncertainty spreads across countries over time. Our empirical findings reveal, among other insights, that in the short run (1 and 3 months), the U.S. influences China, while in the long run (9 and 12 months), China influences the U.S. Identifying these connections can help anticipate a country's potential vulnerabilities and propose proactive solutions to mitigate the transmission of economic uncertainty."}, "https://arxiv.org/abs/2410.04356": {"title": "Subspace decompositions for association structure learning in multivariate categorical response regression", "link": "https://arxiv.org/abs/2410.04356", "description": "arXiv:2410.04356v1 Announce Type: new \nAbstract: Modeling the complex relationships between multiple categorical response variables as a function of predictors is a fundamental task in the analysis of categorical data. However, existing methods can be difficult to interpret and may lack flexibility. To address these challenges, we introduce a penalized likelihood method for multivariate categorical response regression that relies on a novel subspace decomposition to parameterize interpretable association structures. Our approach models the relationships between categorical responses by identifying mutual, joint, and conditionally independent associations, which yields a linear problem within a tensor product space. We establish theoretical guarantees for our estimator, including error bounds in high-dimensional settings, and demonstrate the method's interpretability and prediction accuracy through comprehensive simulation studies."}, "https://arxiv.org/abs/2410.04359": {"title": "Efficient, Cross-Fitting Estimation of Semiparametric Spatial Point Processes", "link": "https://arxiv.org/abs/2410.04359", "description": "arXiv:2410.04359v1 Announce Type: new \nAbstract: We study a broad class of models called semiparametric spatial point processes where the first-order intensity function contains both a parametric component and a nonparametric component. We propose a novel, spatial cross-fitting estimator of the parametric component based on random thinning, a common simulation technique in point processes. The proposed estimator is shown to be consistent and in many settings, asymptotically Normal. Also, we generalize the notion of semiparametric efficiency lower bound in i.i.d. settings to spatial point processes and show that the proposed estimator achieves the efficiency lower bound if the process is Poisson. Next, we present a new spatial kernel regression estimator that can estimate the nonparametric component of the intensity function at the desired rates for inference. Despite the dependence induced by the point process, we show that our estimator can be computed using existing software for generalized partial linear models in i.i.d. settings. We conclude with a small simulation study and a re-analysis of the spatial distribution of rainforest trees."}, "https://arxiv.org/abs/2410.04384": {"title": "Statistical Inference for Four-Regime Segmented Regression Models", "link": "https://arxiv.org/abs/2410.04384", "description": "arXiv:2410.04384v1 Announce Type: new \nAbstract: Segmented regression models offer model flexibility and interpretability as compared to the global parametric and the nonparametric models, and yet are challenging in both estimation and inference. We consider a four-regime segmented model for temporally dependent data with segmenting boundaries depending on multivariate covariates with non-diminishing boundary effects. A mixed integer quadratic programming algorithm is formulated to facilitate the least square estimation of the regression and the boundary parameters. The rates of convergence and the asymptotic distributions of the least square estimators are obtained for the regression and the boundary coefficients, respectively. We propose a smoothed regression bootstrap to facilitate inference on the parameters and a model selection procedure to select the most suitable model within the model class with at most four segments. Numerical simulations and a case study on air pollution in Beijing are conducted to demonstrate the proposed approach, which shows that the segmented models with three or four regimes are suitable for the modeling of the meteorological effects on the PM2.5 concentration."}, "https://arxiv.org/abs/2410.04390": {"title": "Approximate Maximum Likelihood Inference for Acoustic Spatial Capture-Recapture with Unknown Identities, Using Monte Carlo Expectation Maximization", "link": "https://arxiv.org/abs/2410.04390", "description": "arXiv:2410.04390v1 Announce Type: new \nAbstract: Acoustic spatial capture-recapture (ASCR) surveys with an array of synchronized acoustic detectors can be an effective way of estimating animal density or call density. However, constructing the capture histories required for ASCR analysis is challenging, as recognizing which detections at different detectors are of which calls is not a trivial task. Because calls from different distances take different times to arrive at detectors, the order in which calls are detected is not necessarily the same as the order in which they are made, and without knowing which detections are of the same call, we do not know how many different calls are detected. We propose a Monte Carlo expectation-maximization (MCEM) estimation method to resolve this unknown call identity problem. To implement the MCEM method in this context, we sample the latent variables from a complete-data likelihood model in the expectation step and use a semi-complete-data likelihood or conditional likelihood in the maximization step. We use a parametric bootstrap to obtain confidence intervals. When we apply our method to a survey of moss frogs, it gives an estimate within 15% of the estimate obtained using data with call capture histories constructed by experts, and unlike this latter estimate, our confidence interval incorporates the uncertainty about call identities. Simulations show it to have a low bias (6%) and coverage probabilities close to the nominal 95% value."}, "https://arxiv.org/abs/2410.04398": {"title": "Transfer Learning with General Estimating Equations", "link": "https://arxiv.org/abs/2410.04398", "description": "arXiv:2410.04398v1 Announce Type: new \nAbstract: We consider statistical inference for parameters defined by general estimating equations under the covariate shift transfer learning. Different from the commonly used density ratio weighting approach, we undertake a set of formulations to make the statistical inference semiparametric efficient with simple inference. It starts with re-constructing the estimation equations to make them Neyman orthogonal, which facilitates more robustness against errors in the estimation of two key nuisance functions, the density ratio and the conditional mean of the moment function. We present a divergence-based method to estimate the density ratio function, which is amenable to machine learning algorithms including the deep learning. To address the challenge that the conditional mean is parametric-dependent, we adopt a nonparametric multiple-imputation strategy that avoids regression at all possible parameter values. With the estimated nuisance functions and the orthogonal estimation equation, the inference for the target parameter is formulated via the empirical likelihood without sample splittings. We show that the proposed estimator attains the semiparametric efficiency bound, and the inference can be conducted with the Wilks' theorem. The proposed method is further evaluated by simulations and an empirical study on a transfer learning inference for ground-level ozone pollution"}, "https://arxiv.org/abs/2410.04431": {"title": "A Structural Approach to Growth-at-Risk", "link": "https://arxiv.org/abs/2410.04431", "description": "arXiv:2410.04431v1 Announce Type: new \nAbstract: We identify the structural impulse responses of quantiles of the outcome variable to a shock. Our estimation strategy explicitly distinguishes treatment from control variables, allowing us to model responses of unconditional quantiles while using controls for identification. Disentangling the effect of adding control variables on identification versus interpretation brings our structural quantile impulse responses conceptually closer to structural mean impulse responses. Applying our methodology to study the impact of financial shocks on lower quantiles of output growth confirms that financial shocks have an outsized effect on growth-at-risk, but the magnitude of our estimates is more extreme than in previous studies."}, "https://arxiv.org/abs/2410.04438": {"title": "A Reflection on the Impact of Misspecifying Unidentifiable Causal Inference Models in Surrogate Endpoint Evaluation", "link": "https://arxiv.org/abs/2410.04438", "description": "arXiv:2410.04438v1 Announce Type: new \nAbstract: Surrogate endpoints are often used in place of expensive, delayed, or rare true endpoints in clinical trials. However, regulatory authorities require thorough evaluation to accept these surrogate endpoints as reliable substitutes. One evaluation approach is the information-theoretic causal inference framework, which quantifies surrogacy using the individual causal association (ICA). Like most causal inference methods, this approach relies on models that are only partially identifiable. For continuous outcomes, a normal model is often used. Based on theoretical elements and a Monte Carlo procedure we studied the impact of model misspecification across two scenarios: 1) the true model is based on a multivariate t-distribution, and 2) the true model is based on a multivariate log-normal distribution. In the first scenario, the misspecification has a negligible impact on the results, while in the second, it has a significant impact when the misspecification is detectable using the observed data. Finally, we analyzed two data sets using the normal model and several D-vine copula models that were indistinguishable from the normal model based on the data at hand. We observed that the results may vary when different models are used."}, "https://arxiv.org/abs/2410.04496": {"title": "Hybrid Monte Carlo for Failure Probability Estimation with Gaussian Process Surrogates", "link": "https://arxiv.org/abs/2410.04496", "description": "arXiv:2410.04496v1 Announce Type: new \nAbstract: We tackle the problem of quantifying failure probabilities for expensive computer experiments with stochastic inputs. The computational cost of evaluating the computer simulation prohibits direct Monte Carlo (MC) and necessitates a statistical surrogate model. Surrogate-informed importance sampling -- which leverages the surrogate to identify suspected failures, fits a bias distribution to these locations, then calculates failure probabilities using a weighted average -- is popular, but it is data hungry and can provide erroneous results when budgets are limited. Instead, we propose a hybrid MC scheme which first uses the uncertainty quantification (UQ) of a Gaussian process (GP) surrogate to identify areas of high classification uncertainty, then combines surrogate predictions in certain regions with true simulator evaluation in uncertain regions. We also develop a stopping criterion which informs the allocation of a fixed budget of simulator evaluations between surrogate training and failure probability estimation. Our method is agnostic to surrogate choice (as long as UQ is provided); we showcase functionality with both GPs and deep GPs. It is also agnostic to design choices; we deploy contour locating sequential designs throughout. With these tools, we are able to effectively estimate small failure probabilities with only hundreds of simulator evaluations. We validate our method on a variety of synthetic benchmarks before deploying it on an expensive computer experiment of fluid flow around an airfoil."}, "https://arxiv.org/abs/2410.04561": {"title": "A Bayesian Method for Adverse Effects Estimation in Observational Studies with Truncation by Death", "link": "https://arxiv.org/abs/2410.04561", "description": "arXiv:2410.04561v1 Announce Type: new \nAbstract: Death among subjects is common in observational studies evaluating the causal effects of interventions among geriatric or severely ill patients. High mortality rates complicate the comparison of the prevalence of adverse events (AEs) between interventions. This problem is often referred to as outcome \"truncation\" by death. A possible solution is to estimate the survivor average causal effect (SACE), an estimand that evaluates the effects of interventions among those who would have survived under both treatment assignments. However, because the SACE does not include subjects who would have died under one or both arms, it does not consider the relationship between AEs and death. We propose a Bayesian method which imputes the unobserved mortality and AE outcomes for each participant under the intervention they did not receive. Using the imputed outcomes we define a composite ordinal outcome for each patient, combining the occurrence of death and the AE in an increasing scale of severity. This allows for the comparison of the effects of the interventions on death and the AE simultaneously among the entire sample. We implement the procedure to analyze the incidence of heart failure among geriatric patients being treated for Type II diabetes with sulfonylureas or dipeptidyl peptidase-4 inhibitors."}, "https://arxiv.org/abs/2410.04667": {"title": "A Finite Mixture Hidden Markov Model for Intermittently Observed Disease Process with Heterogeneity and Partially Known Disease Type", "link": "https://arxiv.org/abs/2410.04667", "description": "arXiv:2410.04667v1 Announce Type: new \nAbstract: Continuous-time multistate models are widely used for analyzing interval-censored data on disease progression over time. Sometimes, diseases manifest differently and what appears to be a coherent collection of symptoms is the expression of multiple distinct disease subtypes. To address this complexity, we propose a mixture hidden Markov model, where the observation process encompasses states representing common symptomatic stages across these diseases, and each underlying process corresponds to a distinct disease subtype. Our method models both the overall and the type-specific disease incidence/prevalence accounting for sampling conditions and exactly observed death times. Additionally, it can utilize partially available disease-type information, which offers insights into the pathway through specific hidden states in the disease process, to aid in the estimation. We present both a frequentist and a Bayesian way to obtain the estimates. The finite sample performance is evaluated through simulation studies. We demonstrate our method using the Nun Study and model the development and progression of dementia, encompassing both Alzheimer's disease (AD) and non-AD dementia."}, "https://arxiv.org/abs/2410.04676": {"title": "Democratizing Strategic Planning in Master-Planned Communities", "link": "https://arxiv.org/abs/2410.04676", "description": "arXiv:2410.04676v1 Announce Type: new \nAbstract: This paper introduces a strategic planning tool for master-planned communities designed specifically to quantify residents' subjective preferences about large investments in amenities and infrastructure projects. Drawing on data obtained from brief online surveys, the tool ranks alternative plans by considering the aggregate anticipated utilization of each proposed amenity and cost sensitivity to it (or risk sensitivity for infrastructure plans). In addition, the tool estimates the percentage of households that favor the preferred plan and predicts whether residents would actually be willing to fund the project. The mathematical underpinnings of the tool are borrowed from utility theory, incorporating exponential functions to model diminishing marginal returns on quality, cost, and risk mitigation."}, "https://arxiv.org/abs/2410.04696": {"title": "Efficient Input Uncertainty Quantification for Ratio Estimator", "link": "https://arxiv.org/abs/2410.04696", "description": "arXiv:2410.04696v1 Announce Type: new \nAbstract: We study the construction of a confidence interval (CI) for a simulation output performance measure that accounts for input uncertainty when the input models are estimated from finite data. In particular, we focus on performance measures that can be expressed as a ratio of two dependent simulation outputs' means. We adopt the parametric bootstrap method to mimic input data sampling and construct the percentile bootstrap CI after estimating the ratio at each bootstrap sample. The standard estimator, which takes the ratio of two sample averages, tends to exhibit large finite-sample bias and variance, leading to overcoverage of the percentile bootstrap CI. To address this, we propose two new ratio estimators that replace the sample averages with pooled mean estimators via the $k$-nearest neighbor ($k$NN) regression: the $k$NN estimator and the $k$LR estimator. The $k$NN estimator performs well in low dimensions but its theoretical performance guarantee degrades as the dimension increases. The $k$LR estimator combines the likelihood ratio (LR) method with the $k$NN regression, leveraging the strengths of both while mitigating their weaknesses; the LR method removes dependence on dimension, while the variance inflation introduced by the LR is controlled by $k$NN. Based on asymptotic analyses and finite-sample heuristics, we propose an experiment design that maximizes the efficiency of the proposed estimators and demonstrate their empirical performances using three examples including one in the enterprise risk management application."}, "https://arxiv.org/abs/2410.04700": {"title": "Nonparametric tests for interaction in two-way ANOVA with balanced replications", "link": "https://arxiv.org/abs/2410.04700", "description": "arXiv:2410.04700v1 Announce Type: new \nAbstract: Nonparametric procedures are more powerful for detecting interaction in two-way ANOVA when the data are non-normal. In this paper, we compute null critical values for the aligned rank-based tests (APCSSA/APCSSM) where the levels of the factors are between 2 and 6. We compare the performance of these new procedures with the ANOVA F-test for interaction, the adjusted rank transform test (ART), Conover's rank transform procedure (RT), and a rank-based ANOVA test (raov) using Monte Carlo simulations. The new procedures APCSSA/APCSSM are comparable with existing competitors in all settings. Even though there is no single dominant test in detecting interaction effects for non-normal data, nonparametric procedure APCSSM is the most highly recommended procedure for Cauchy errors settings."}, "https://arxiv.org/abs/2410.04864": {"title": "Order of Addition in Mixture-Amount Experiments", "link": "https://arxiv.org/abs/2410.04864", "description": "arXiv:2410.04864v1 Announce Type: new \nAbstract: In a mixture experiment, we study the behavior and properties of $m$ mixture components, where the primary focus is on the proportions of the components that make up the mixture rather than the total amount. Mixture-amount experiments are specialized types of mixture experiments where both the proportions of the components in the mixture and the total amount of the mixture are of interest. Such experiments consider the total amount of the mixture as a variable. In this paper, we consider an Order-of-Addition (OofA) mixture-amount experiment in which the response depends on both the mixture amounts of components and their order of addition. Full Mixture OofA designs are constructed to maintain orthogonality between the mixture-amount model terms and the effects of the order of addition. These designs enable the estimation of mixture-component model parameters and the order-of-addition effects. The G-efficiency criterion assesses how well the design supports precise and unbiased estimation of the model parameters. The Fraction of Design Space (FDS) plot is used to provide a visual assessment of the prediction capabilities of a design across the entire design space."}, "https://arxiv.org/abs/2410.04922": {"title": "Random-projection ensemble dimension reduction", "link": "https://arxiv.org/abs/2410.04922", "description": "arXiv:2410.04922v1 Announce Type: new \nAbstract: We introduce a new framework for dimension reduction in the context of high-dimensional regression. Our proposal is to aggregate an ensemble of random projections, which have been carefully chosen based on the empirical regression performance after being applied to the covariates. More precisely, we consider disjoint groups of independent random projections, apply a base regression method after each projection, and retain the projection in each group based on the empirical performance. We aggregate the selected projections by taking the singular value decomposition of their empirical average and then output the leading order singular vectors. A particularly appealing aspect of our approach is that the singular values provide a measure of the relative importance of the corresponding projection directions, which can be used to select the final projection dimension. We investigate in detail (and provide default recommendations for) various aspects of our general framework, including the projection distribution and the base regression method, as well as the number of random projections used. Additionally, we investigate the possibility of further reducing the dimension by applying our algorithm twice in cases where projection dimension recommended in the initial application is too large. Our theoretical results show that the error of our algorithm stabilises as the number of groups of projections increases. We demonstrate the excellent empirical performance of our proposal in a large numerical study using simulated and real data."}, "https://arxiv.org/abs/2410.04996": {"title": "Assumption-Lean Post-Integrated Inference with Negative Control Outcomes", "link": "https://arxiv.org/abs/2410.04996", "description": "arXiv:2410.04996v1 Announce Type: new \nAbstract: Data integration has become increasingly common in aligning multiple heterogeneous datasets. With high-dimensional outcomes, data integration methods aim to extract low-dimensional embeddings of observations to remove unwanted variations, such as batch effects and unmeasured covariates, inherent in data collected from different sources. However, multiple hypothesis testing after data integration can be substantially biased due to the data-dependent integration processes. To address this challenge, we introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes. By leveraging causal interpretations, we derive nonparametric identification conditions that form the basis of our PII approach.\n Our assumption-lean semiparametric inference method extends robustness and generality to projected direct effect estimands that account for mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide deterministic quantifications of the bias of target estimands induced by estimated embeddings and finite-sample linear expansions of the estimators with uniform concentration bounds on the residuals for all outcomes.\n The proposed doubly robust estimators are consistent and efficient under minimal assumptions, facilitating data-adaptive estimation with machine learning algorithms. Using random forests, we evaluate empirical statistical errors in simulations and analyze single-cell CRISPR perturbed datasets with potential unmeasured confounders."}, "https://arxiv.org/abs/2410.05008": {"title": "Testing procedures based on maximum likelihood estimation for Marked Hawkes processes", "link": "https://arxiv.org/abs/2410.05008", "description": "arXiv:2410.05008v1 Announce Type: new \nAbstract: The Hawkes model is a past-dependent point process, widely used in various fields for modeling temporal clustering of events. Extending this framework, the multidimensional marked Hawkes process incorporates multiple interacting event types and additional marks, enhancing its capability to model complex dependencies in multivariate time series data. However, increasing the complexity of the model also increases the computational cost of the associated estimation methods and may induce an overfitting of the model. Therefore, it is essential to find a trade-off between accuracy and artificial complexity of the model. In order to find the appropriate version of Hawkes processes, we address, in this paper, the tasks of model fit evaluation and parameter testing for marked Hawkes processes. This article focuses on parametric Hawkes processes with exponential memory kernels, a popular variant for its theoretical and practical advantages. Our work introduces robust testing methodologies for assessing model parameters and complexity, building upon and extending previous theoretical frameworks. We then validate the practical robustness of these tests through comprehensive numerical studies, especially in scenarios where theoretical guarantees remains incomplete."}, "https://arxiv.org/abs/2410.05082": {"title": "Large datasets for the Euro Area and its member countries and the dynamic effects of the common monetary policy", "link": "https://arxiv.org/abs/2410.05082", "description": "arXiv:2410.05082v1 Announce Type: new \nAbstract: We present and describe a new publicly available large dataset which encompasses quarterly and monthly macroeconomic time series for both the Euro Area (EA) as a whole and its ten primary member countries. The dataset, which is called EA-MD-QD, includes more than 800 time series and spans the period from January 2000 to the latest available month. Since January 2024 EA-MD-QD is updated on a monthly basis and constantly revised, making it an essential resource for conducting policy analysis related to economic outcomes in the EA. To illustrate the usefulness of EA-MD-QD, we study the country specific Impulse Responses of the EA wide monetary policy shock by means of the Common Component VAR plus either Instrumental Variables or Sign Restrictions identification schemes. The results reveal asymmetries in the transmission of the monetary policy shock across countries, particularly between core and peripheral countries. Additionally, we find comovements across Euro Area countries' business cycles to be driven mostly by real variables, compared to nominal ones."}, "https://arxiv.org/abs/2410.05169": {"title": "False Discovery Rate Control for Fast Screening of Large-Scale Genomics Biobanks", "link": "https://arxiv.org/abs/2410.05169", "description": "arXiv:2410.05169v1 Announce Type: new \nAbstract: Genomics biobanks are information treasure troves with thousands of phenotypes (e.g., diseases, traits) and millions of single nucleotide polymorphisms (SNPs). The development of methodologies that provide reproducible discoveries is essential for the understanding of complex diseases and precision drug development. Without statistical reproducibility guarantees, valuable efforts are spent on researching false positives. Therefore, scalable multivariate and high-dimensional false discovery rate (FDR)-controlling variable selection methods are urgently needed, especially, for complex polygenic diseases and traits. In this work, we propose the Screen-T-Rex selector, a fast FDR-controlling method based on the recently developed T-Rex selector. The method is tailored to screening large-scale biobanks and it does not require choosing additional parameters (sparsity parameter, target FDR level, etc). Numerical simulations and a real-world HIV-1 drug resistance example demonstrate that the performance of the Screen-T-Rex selector is superior, and its computation time is multiple orders of magnitude lower compared to current benchmark knockoff methods."}, "https://arxiv.org/abs/2410.05211": {"title": "The Informed Elastic Net for Fast Grouped Variable Selection and FDR Control in Genomics Research", "link": "https://arxiv.org/abs/2410.05211", "description": "arXiv:2410.05211v1 Announce Type: new \nAbstract: Modern genomics research relies on genome-wide association studies (GWAS) to identify the few genetic variants among potentially millions that are associated with diseases of interest. Only reproducible discoveries of groups of associations improve our understanding of complex polygenic diseases and enable the development of new drugs and personalized medicine. Thus, fast multivariate variable selection methods that have a high true positive rate (TPR) while controlling the false discovery rate (FDR) are crucial. Recently, the T-Rex+GVS selector, a version of the T-Rex selector that uses the elastic net (EN) as a base selector to perform grouped variable election, was proposed. Although it significantly increased the TPR in simulated GWAS compared to the original T-Rex, its comparably high computational cost limits scalability. Therefore, we propose the informed elastic net (IEN), a new base selector that significantly reduces computation time while retaining the grouped variable selection property. We quantify its grouping effect and derive its formulation as a Lasso-type optimization problem, which is solved efficiently within the T-Rex framework by the terminated LARS algorithm. Numerical simulations and a GWAS study demonstrate that the proposed T-Rex+GVS (IEN) exhibits the desired grouping effect, reduces computation time, and achieves the same TPR as T-Rex+GVS (EN) but with lower FDR, which makes it a promising method for large-scale GWAS."}, "https://arxiv.org/abs/2410.05212": {"title": "$\\texttt{rdid}$ and $\\texttt{rdidstag}$: Stata commands for robust difference-in-differences", "link": "https://arxiv.org/abs/2410.05212", "description": "arXiv:2410.05212v1 Announce Type: new \nAbstract: This article provides a Stata package for the implementation of the robust difference-in-differences (RDID) method developed in Ban and K\\'edagni (2023). It contains three main commands: $\\texttt{rdid}$, $\\texttt{rdid_dy}$, $\\texttt{rdidstag}$, which we describe in the introduction and the main text. We illustrate these commands through simulations and empirical examples."}, "https://arxiv.org/abs/2410.04554": {"title": "Fast algorithm for sparse least trimmed squares via trimmed-regularized reformulation", "link": "https://arxiv.org/abs/2410.04554", "description": "arXiv:2410.04554v1 Announce Type: cross \nAbstract: The least trimmed squares (LTS) is a reasonable formulation of robust regression whereas it suffers from high computational cost due to the nonconvexity and nonsmoothness of its objective function. The most frequently used FAST-LTS algorithm is particularly slow when a sparsity-inducing penalty such as the $\\ell_1$ norm is added. This paper proposes a computationally inexpensive algorithm for the sparse LTS, which is based on the proximal gradient method with a reformulation technique. Proposed method is equipped with theoretical convergence preferred over existing methods. Numerical experiments show that our method efficiently yields small objective value."}, "https://arxiv.org/abs/2410.04655": {"title": "Graph Fourier Neural Kernels (G-FuNK): Learning Solutions of Nonlinear Diffusive Parametric PDEs on Multiple Domains", "link": "https://arxiv.org/abs/2410.04655", "description": "arXiv:2410.04655v1 Announce Type: cross \nAbstract: Predicting time-dependent dynamics of complex systems governed by non-linear partial differential equations (PDEs) with varying parameters and domains is a challenging task motivated by applications across various fields. We introduce a novel family of neural operators based on our Graph Fourier Neural Kernels, designed to learn solution generators for nonlinear PDEs in which the highest-order term is diffusive, across multiple domains and parameters. G-FuNK combines components that are parameter- and domain-adapted with others that are not. The domain-adapted components are constructed using a weighted graph on the discretized domain, where the graph Laplacian approximates the highest-order diffusive term, ensuring boundary condition compliance and capturing the parameter and domain-specific behavior. Meanwhile, the learned components transfer across domains and parameters via Fourier Neural Operators. This approach naturally embeds geometric and directional information, improving generalization to new test domains without need for retraining the network. To handle temporal dynamics, our method incorporates an integrated ODE solver to predict the evolution of the system. Experiments show G-FuNK's capability to accurately approximate heat, reaction diffusion, and cardiac electrophysiology equations across various geometries and anisotropic diffusivity fields. G-FuNK achieves low relative errors on unseen domains and fiber fields, significantly accelerating predictions compared to traditional finite-element solvers."}, "https://arxiv.org/abs/2410.05263": {"title": "Regression Conformal Prediction under Bias", "link": "https://arxiv.org/abs/2410.05263", "description": "arXiv:2410.05263v1 Announce Type: cross \nAbstract: Uncertainty quantification is crucial to account for the imperfect predictions of machine learning algorithms for high-impact applications. Conformal prediction (CP) is a powerful framework for uncertainty quantification that generates calibrated prediction intervals with valid coverage. In this work, we study how CP intervals are affected by bias - the systematic deviation of a prediction from ground truth values - a phenomenon prevalent in many real-world applications. We investigate the influence of bias on interval lengths of two different types of adjustments -- symmetric adjustments, the conventional method where both sides of the interval are adjusted equally, and asymmetric adjustments, a more flexible method where the interval can be adjusted unequally in positive or negative directions. We present theoretical and empirical analyses characterizing how symmetric and asymmetric adjustments impact the \"tightness\" of CP intervals for regression tasks. Specifically for absolute residual and quantile-based non-conformity scores, we prove: 1) the upper bound of symmetrically adjusted interval lengths increases by $2|b|$ where $b$ is a globally applied scalar value representing bias, 2) asymmetrically adjusted interval lengths are not affected by bias, and 3) conditions when asymmetrically adjusted interval lengths are guaranteed to be smaller than symmetric ones. Our analyses suggest that even if predictions exhibit significant drift from ground truth values, asymmetrically adjusted intervals are still able to maintain the same tightness and validity of intervals as if the drift had never happened, while symmetric ones significantly inflate the lengths. We demonstrate our theoretical results with two real-world prediction tasks: sparse-view computed tomography (CT) reconstruction and time-series weather forecasting. Our work paves the way for more bias-robust machine learning systems."}, "https://arxiv.org/abs/2302.01861": {"title": "Covariance Matrix Estimation for High-Throughput Biomedical Data with Interconnected Communities", "link": "https://arxiv.org/abs/2302.01861", "description": "arXiv:2302.01861v3 Announce Type: replace \nAbstract: Estimating a covariance matrix is central to high-dimensional data analysis. Empirical analyses of high-dimensional biomedical data, including genomics, proteomics, microbiome, and neuroimaging, among others, consistently reveal strong modularity in the dependence patterns. In these analyses, intercorrelated high-dimensional biomedical features often form communities or modules that can be interconnected with others. While the interconnected community structure has been extensively studied in biomedical research (e.g., gene co-expression networks), its potential to assist in the estimation of covariance matrices remains largely unexplored. To address this gap, we propose a procedure that leverages the commonly observed interconnected community structure in high-dimensional biomedical data to estimate large covariance and precision matrices. We derive the uniformly minimum-variance unbiased estimators for covariance and precision matrices in closed forms and provide theoretical results on their asymptotic properties. Our proposed method enhances the accuracy of covariance- and precision-matrix estimation and demonstrates superior performance compared to the competing methods in both simulations and real data analyses."}, "https://arxiv.org/abs/2305.14506": {"title": "Confidence Sets for Causal Orderings", "link": "https://arxiv.org/abs/2305.14506", "description": "arXiv:2305.14506v2 Announce Type: replace \nAbstract: Causal discovery procedures aim to deduce causal relationships among variables in a multivariate dataset. While various methods have been proposed for estimating a single causal model or a single equivalence class of models, less attention has been given to quantifying uncertainty in causal discovery in terms of confidence statements. A primary challenge in causal discovery of directed acyclic graphs is determining a causal ordering among the variables, and our work offers a framework for constructing confidence sets of causal orderings that the data do not rule out. Our methodology specifically applies to identifiable structural equation models with additive errors and is based on a residual bootstrap procedure to test the goodness-of-fit of causal orderings. We demonstrate the asymptotic validity of the confidence set constructed using this goodness-of-fit test and explain how the confidence set may be used to form sub/supersets of ancestral relationships as well as confidence intervals for causal effects that incorporate model uncertainty."}, "https://arxiv.org/abs/2306.14851": {"title": "Stability-Adjusted Cross-Validation for Sparse Linear Regression", "link": "https://arxiv.org/abs/2306.14851", "description": "arXiv:2306.14851v2 Announce Type: replace-cross \nAbstract: Given a high-dimensional covariate matrix and a response vector, ridge-regularized sparse linear regression selects a subset of features that explains the relationship between covariates and the response in an interpretable manner. To select the sparsity and robustness of linear regressors, techniques like k-fold cross-validation are commonly used for hyperparameter tuning. However, cross-validation substantially increases the computational cost of sparse regression as it requires solving many mixed-integer optimization problems (MIOs). Additionally, validation metrics often serve as noisy estimators of test set errors, with different hyperparameter combinations leading to models with different noise levels. Therefore, optimizing over these metrics is vulnerable to out-of-sample disappointment, especially in underdetermined settings. To improve upon this state of affairs, we make two key contributions. First, motivated by the generalization theory literature, we propose selecting hyperparameters that minimize a weighted sum of a cross-validation metric and a model's output stability, thus reducing the risk of poor out-of-sample performance. Second, we leverage ideas from the mixed-integer optimization literature to obtain computationally tractable relaxations of k-fold cross-validation metrics and the output stability of regressors, facilitating hyperparameter selection after solving fewer MIOs. These relaxations result in an efficient cyclic coordinate descent scheme, achieving lower validation errors than via traditional methods such as grid search. On synthetic datasets, our confidence adjustment procedure improves out-of-sample performance by 2%-5% compared to minimizing the k-fold error alone. On 13 real-world datasets, our confidence adjustment procedure reduces test set error by 2%, on average."}, "https://arxiv.org/abs/2410.05512": {"title": "A New Method for Multinomial Inference using Dempster-Shafer Theory", "link": "https://arxiv.org/abs/2410.05512", "description": "arXiv:2410.05512v1 Announce Type: new \nAbstract: A new method for multinomial inference is proposed by representing the cell probabilities as unordered segments on the unit interval and following Dempster-Shafer (DS) theory. The resulting DS posterior is then strengthened to improve symmetry and learning properties with the final posterior model being characterized by a Dirichlet distribution. In addition to computational simplicity, the new model has desirable invariance properties related to category permutations, refinements, and coarsenings. Furthermore, posterior inference on relative probabilities amongst certain cells depends only on data for the cells in question. Finally, the model is quite flexible with regard to parameterization and the range of testable assertions. Comparisons are made to existing methods and illustrated with two examples."}, "https://arxiv.org/abs/2410.05594": {"title": "Comparing HIV Vaccine Immunogenicity across Trials with Different Populations and Study Designs", "link": "https://arxiv.org/abs/2410.05594", "description": "arXiv:2410.05594v1 Announce Type: new \nAbstract: Safe and effective preventive vaccines have the potential to help stem the HIV epidemic. The efficacy of such vaccines is typically measured in randomized, double-blind phase IIb/III trials and described as a reduction in newly acquired HIV infections. However, such trials are often expensive, time-consuming, and/or logistically challenging. These challenges lead to a great interest in immune responses induced by vaccination, and in identifying which immune responses predict vaccine efficacy. These responses are termed vaccine correlates of protection. Studies of vaccine-induced immunogenicity vary in size and design, ranging from small, early phase trials, to case-control studies nested in a broader late-phase randomized trial. Moreover, trials can be conducted in geographically diverse study populations across the world. Such diversity presents a challenge for objectively comparing vaccine-induced immunogenicity. To address these practical challenges, we propose a framework that is capable of identifying appropriate causal estimands and estimators, which can be used to provide standardized comparisons of vaccine-induced immunogenicity across trials. We evaluate the performance of the proposed estimands via extensive simulation studies. Our estimators are well-behaved and enjoy robustness properties. The proposed technique is applied to compare vaccine immunogenicity using data from three recent HIV vaccine trials."}, "https://arxiv.org/abs/2410.05630": {"title": "Navigating Inflation in Ghana: How Can Machine Learning Enhance Economic Stability and Growth Strategies", "link": "https://arxiv.org/abs/2410.05630", "description": "arXiv:2410.05630v1 Announce Type: new \nAbstract: Inflation remains a persistent challenge for many African countries. This research investigates the critical role of machine learning (ML) in understanding and managing inflation in Ghana, emphasizing its significance for the country's economic stability and growth. Utilizing a comprehensive dataset spanning from 2010 to 2022, the study aims to employ advanced ML models, particularly those adept in time series forecasting, to predict future inflation trends. The methodology is designed to provide accurate and reliable inflation forecasts, offering valuable insights for policymakers and advocating for a shift towards data-driven approaches in economic decision-making. This study aims to significantly advance the academic field of economic analysis by applying machine learning (ML) and offering practical guidance for integrating advanced technological tools into economic governance, ultimately demonstrating ML's potential to enhance Ghana's economic resilience and support sustainable development through effective inflation management."}, "https://arxiv.org/abs/2410.05634": {"title": "Identification and estimation for matrix time series CP-factor models", "link": "https://arxiv.org/abs/2410.05634", "description": "arXiv:2410.05634v1 Announce Type: new \nAbstract: We investigate the identification and the estimation for matrix time series CP-factor models. Unlike the generalized eigenanalysis-based method of Chang et al. (2023) which requires the two factor loading matrices to be full-ranked, the newly proposed estimation can handle rank-deficient factor loading matrices. The estimation procedure consists of the spectral decomposition of several matrices and a matrix joint diagonalization algorithm, resulting in low computational cost. The theoretical guarantee established without the stationarity assumption shows that the proposed estimation exhibits a faster convergence rate than that of Chang et al. (2023). In fact the new estimator is free from the adverse impact of any eigen-gaps, unlike most eigenanalysis-based methods such as that of Chang et al. (2023). Furthermore, in terms of the error rates of the estimation, the proposed procedure is equivalent to handling a vector time series of dimension $\\max(p,q)$ instead of $p \\times q$, where $(p, q)$ are the dimensions of the matrix time series concerned. We have achieved this without assuming the \"near orthogonality\" of the loadings under various incoherence conditions often imposed in the CP-decomposition literature, see Han and Zhang (2022), Han et al. (2024) and the references within. Illustration with both simulated and real matrix time series data shows the usefulness of the proposed approach."}, "https://arxiv.org/abs/2410.05741": {"title": "The Transmission of Monetary Policy via Common Cycles in the Euro Area", "link": "https://arxiv.org/abs/2410.05741", "description": "arXiv:2410.05741v1 Announce Type: new \nAbstract: We use a FAVAR model with proxy variables and sign restrictions to investigate the role of the euro area common output and inflation cycles in the transmission of monetary policy shocks. We find that common cycles explain most of the variation in output and inflation across member countries, while Southern European economies show larger deviations from the cycles in the aftermath of the financial crisis. Building on this evidence, we show that monetary policy is homogeneously propagated to member countries via the common cycles. In contrast, country-specific transmission channels lead to heterogeneous country responses to monetary policy shocks. Consequently, our empirical results suggest that the divergent effects of ECB monetary policy are due to heterogeneous country-specific exposures to financial markets and not due to dis-synchronized economies of the euro area."}, "https://arxiv.org/abs/2410.05858": {"title": "Detecting dependence structure: visualization and inference", "link": "https://arxiv.org/abs/2410.05858", "description": "arXiv:2410.05858v1 Announce Type: new \nAbstract: Identifying dependency between two random variables is a fundamental problem. Clear interpretability and ability of a procedure to provide information on the form of the possible dependence is particularly important in evaluating dependencies. We introduce a new estimator of the quantile dependence function and pertinent local acceptance regions. This leads to insightful visualization and evaluation of underlying dependence structure. We also propose a test of independence of two random variables, pertinent to this new estimator. Our procedures are based on ranks and we derive a finite-sample theory that guarantees the inferential validity of our solutions at any given sample size. The procedures are simple to implement and computationally efficient. Large-sample consistency of the proposed test is also proved. We show that, in terms of power, new test is one of the best statistics for independence testing when considering a wide range of alternative models. Finally, we demonstrate use of our approach to visualize dependence structure and to detect local departures from independence through analyzing some datasets."}, "https://arxiv.org/abs/2410.05861": {"title": "Persistence-Robust Break Detection in Predictive Quantile and CoVaR Regressions", "link": "https://arxiv.org/abs/2410.05861", "description": "arXiv:2410.05861v1 Announce Type: new \nAbstract: Forecasting risk (as measured by quantiles) and systemic risk (as measured by Adrian and Brunnermeiers's (2016) CoVaR) is important in economics and finance. However, past research has shown that predictive relationships may be unstable over time. Therefore, this paper develops structural break tests in predictive quantile and CoVaR regressions. These tests can detect changes in the forecasting power of covariates, and are based on the principle of self-normalization. We show that our tests are valid irrespective of whether the predictors are stationary or near-stationary, rendering the tests suitable for a range of practical applications. Simulations illustrate the good finite-sample properties of our tests. Two empirical applications concerning equity premium and systemic risk forecasting models show the usefulness of the tests."}, "https://arxiv.org/abs/2410.05893": {"title": "Model Uncertainty and Missing Data: An Objective Bayesian Perspective", "link": "https://arxiv.org/abs/2410.05893", "description": "arXiv:2410.05893v1 Announce Type: new \nAbstract: The interplay between missing data and model uncertainty -- two classic statistical problems -- leads to primary questions that we formally address from an objective Bayesian perspective. For the general regression problem, we discuss the probabilistic justification of Rubin's rules applied to the usual components of Bayesian variable selection, arguing that prior predictive marginals should be central to the pursued methodology. In the regression settings, we explore the conditions of prior distributions that make the missing data mechanism ignorable. Moreover, when comparing multiple linear models, we provide a complete methodology for dealing with special cases, such as variable selection or uncertainty regarding model errors. In numerous simulation experiments, we demonstrate that our method outperforms or equals others, in consistently producing results close to those obtained using the full dataset. In general, the difference increases with the percentage of missing data and the correlation between the variables used for imputation. Finally, we summarize possible directions for future research."}, "https://arxiv.org/abs/2410.06125": {"title": "Dynamic graphical models: Theory, structure and counterfactual forecasting", "link": "https://arxiv.org/abs/2410.06125", "description": "arXiv:2410.06125v1 Announce Type: new \nAbstract: Simultaneous graphical dynamic linear models (SGDLMs) provide advances in flexibility, parsimony and scalability of multivariate time series analysis, with proven utility in forecasting. Core theoretical aspects of such models are developed, including new results linking dynamic graphical and latent factor models. Methodological developments extend existing Bayesian sequential analyses for model marginal likelihood evaluation and counterfactual forecasting. The latter, involving new Bayesian computational developments for missing data in SGDLMs, is motivated by causal applications. A detailed example illustrating the models and new methodology concerns global macroeconomic time series with complex, time-varying cross-series relationships and primary interests in potential causal effects."}, "https://arxiv.org/abs/2410.06281": {"title": "Sequential Design with Derived Win Statistics", "link": "https://arxiv.org/abs/2410.06281", "description": "arXiv:2410.06281v1 Announce Type: new \nAbstract: The Win Ratio has gained significant traction in cardiovascular trials as a novel method for analyzing composite endpoints (Pocock and others, 2012). Compared with conventional approaches based on time to the first event, the Win Ratio accommodates the varying priorities and types of outcomes among components, potentially offering greater statistical power by fully utilizing the information contained within each outcome. However, studies using Win Ratio have largely been confined to fixed design, limiting flexibility for early decisions, such as stopping for futility or efficacy. Our study proposes a sequential design framework incorporating multiple interim analyses based on Win Ratio or Net Benefit statistics. Moreover, we provide rigorous proof of the canonical joint distribution for sequential Win Ratio and Net Benefit statistics, and an algorithm for sample size determination is developed. We also provide results from a finite sample simulation study, which show that our proposed method controls Type I error maintains power level, and has a smaller average sample size than the fixed design. A real study of cardiovascular study is applied to illustrate the proposed method."}, "https://arxiv.org/abs/2410.06326": {"title": "A convex formulation of covariate-adjusted Gaussian graphical models via natural parametrization", "link": "https://arxiv.org/abs/2410.06326", "description": "arXiv:2410.06326v1 Announce Type: new \nAbstract: Gaussian graphical models (GGMs) are widely used for recovering the conditional independence structure among random variables. Recently, several key advances have been made to exploit an additional set of variables for better estimating the GGMs of the variables of interest. For example, in co-expression quantitative trait locus (eQTL) studies, both the mean expression level of genes as well as their pairwise conditional independence structure may be adjusted by genetic variants local to those genes. Existing methods to estimate covariate-adjusted GGMs either allow only the mean to depend on covariates or suffer from poor scaling assumptions due to the inherent non-convexity of simultaneously estimating the mean and precision matrix. In this paper, we propose a convex formulation that jointly estimates the covariate-adjusted mean and precision matrix by utilizing the natural parametrization of the multivariate Gaussian likelihood. This convexity yields theoretically better performance as the sparsity and dimension of the covariates grow large relative to the number of samples. We verify our theoretical results with numerical simulations and perform a reanalysis of an eQTL study of glioblastoma multiforme (GBM), an aggressive form of brain cancer."}, "https://arxiv.org/abs/2410.06377": {"title": "A Non-parametric Direct Learning Approach to Heterogeneous Treatment Effect Estimation under Unmeasured Confounding", "link": "https://arxiv.org/abs/2410.06377", "description": "arXiv:2410.06377v1 Announce Type: new \nAbstract: In many social, behavioral, and biomedical sciences, treatment effect estimation is a crucial step in understanding the impact of an intervention, policy, or treatment. In recent years, an increasing emphasis has been placed on heterogeneity in treatment effects, leading to the development of various methods for estimating Conditional Average Treatment Effects (CATE). These approaches hinge on a crucial identifying condition of no unmeasured confounding, an assumption that is not always guaranteed in observational studies or randomized control trials with non-compliance. In this paper, we proposed a general framework for estimating CATE with a possible unmeasured confounder using Instrumental Variables. We also construct estimators that exhibit greater efficiency and robustness against various scenarios of model misspecification. The efficacy of the proposed framework is demonstrated through simulation studies and a real data example."}, "https://arxiv.org/abs/2410.06394": {"title": "Nested Compound Random Measures", "link": "https://arxiv.org/abs/2410.06394", "description": "arXiv:2410.06394v1 Announce Type: new \nAbstract: Nested nonparametric processes are vectors of random probability measures widely used in the Bayesian literature to model the dependence across distinct, though related, groups of observations. These processes allow a two-level clustering, both at the observational and group levels. Several alternatives have been proposed starting from the nested Dirichlet process by Rodr\\'iguez et al. (2008). However, most of the available models are neither computationally efficient or mathematically tractable. In the present paper, we aim to introduce a range of nested processes that are mathematically tractable, flexible, and computationally efficient. Our proposal builds upon Compound Random Measures, which are vectors of dependent random measures early introduced by Griffin and Leisen (2017). We provide a complete investigation of theoretical properties of our model. In particular, we prove a general posterior characterization for vectors of Compound Random Measures, which is interesting per se and still not available in the current literature. Based on our theoretical results and the available posterior representation, we develop the first Ferguson & Klass algorithm for nested nonparametric processes. We specialize our general theorems and algorithms in noteworthy examples. We finally test the model's performance on different simulated scenarios, and we exploit the construction to study air pollution in different provinces of an Italian region (Lombardy). We empirically show how nested processes based on Compound Random Measures outperform other Bayesian competitors."}, "https://arxiv.org/abs/2410.06451": {"title": "False Discovery Rate Control via Data Splitting for Testing-after-Clustering", "link": "https://arxiv.org/abs/2410.06451", "description": "arXiv:2410.06451v1 Announce Type: new \nAbstract: Testing for differences in features between clusters in various applications often leads to inflated false positives when practitioners use the same dataset to identify clusters and then test features, an issue commonly known as ``double dipping''. To address this challenge, inspired by data-splitting strategies for controlling the false discovery rate (FDR) in regressions \\parencite{daiFalseDiscoveryRate2023}, we present a novel method that applies data-splitting to control FDR while maintaining high power in unsupervised clustering. We first divide the dataset into two halves, then apply the conventional testing-after-clustering procedure to each half separately and combine the resulting test statistics to form a new statistic for each feature. The new statistic can help control the FDR due to its property of having a sampling distribution that is symmetric around zero for any null feature. To further enhance stability and power, we suggest multiple data splitting, which involves repeatedly splitting the data and combining results. Our proposed data-splitting methods are mathematically proven to asymptotically control FDR in Gaussian settings. Through extensive simulations and analyses of single-cell RNA sequencing (scRNA-seq) datasets, we demonstrate that the data-splitting methods are easy to implement, adaptable to existing single-cell data analysis pipelines, and often outperform other approaches when dealing with weak signals and high correlations among features."}, "https://arxiv.org/abs/2410.06484": {"title": "Model-assisted and Knowledge-guided Transfer Regression for the Underrepresented Population", "link": "https://arxiv.org/abs/2410.06484", "description": "arXiv:2410.06484v1 Announce Type: new \nAbstract: Covariate shift and outcome model heterogeneity are two prominent challenges in leveraging external sources to improve risk modeling for underrepresented cohorts in paucity of accurate labels. We consider the transfer learning problem targeting some unlabeled minority sample encountering (i) covariate shift to the labeled source sample collected on a different cohort; and (ii) outcome model heterogeneity with some majority sample informative to the targeted minority model. In this scenario, we develop a novel model-assisted and knowledge-guided transfer learning targeting underrepresented population (MAKEUP) approach for high-dimensional regression models. Our MAKEUP approach includes a model-assisted debiasing step in response to the covariate shift, accompanied by a knowledge-guided sparsifying procedure leveraging the majority data to enhance learning on the minority group. We also develop a model selection method to avoid negative knowledge transfer that can work in the absence of gold standard labels on the target sample. Theoretical analyses show that MAKEUP provides efficient estimation for the target model on the minority group. It maintains robustness to the high complexity and misspecification of the nuisance models used for covariate shift correction, as well as adaptivity to the model heterogeneity and potential negative transfer between the majority and minority groups. Numerical studies demonstrate similar advantages in finite sample settings over existing approaches. We also illustrate our approach through a real-world application about the transfer learning of Type II diabetes genetic risk models on some underrepresented ancestry group."}, "https://arxiv.org/abs/2410.06564": {"title": "Green bubbles: a four-stage paradigm for detection and propagation", "link": "https://arxiv.org/abs/2410.06564", "description": "arXiv:2410.06564v1 Announce Type: new \nAbstract: Climate change has emerged as a significant global concern, attracting increasing attention worldwide. While green bubbles may be examined through a social bubble hypothesis, it is essential not to neglect a Climate Minsky moment triggered by sudden asset price changes. The significant increase in green investments highlights the urgent need for a comprehensive understanding of these market dynamics. Therefore, the current paper introduces a novel paradigm for studying such phenomena. Focusing on the renewable energy sector, Statistical Process Control (SPC) methodologies are employed to identify green bubbles within time series data. Furthermore, search volume indexes and social factors are incorporated into established econometric models to reveal potential implications for the financial system. Inspired by Joseph Schumpeter's perspectives on business cycles, this study recognizes green bubbles as a necessary evil for facilitating a successful transition towards a more sustainable future."}, "https://arxiv.org/abs/2410.06580": {"title": "When Does Interference Matter? Decision-Making in Platform Experiments", "link": "https://arxiv.org/abs/2410.06580", "description": "arXiv:2410.06580v1 Announce Type: new \nAbstract: This paper investigates decision-making in A/B experiments for online platforms and marketplaces. In such settings, due to constraints on inventory, A/B experiments typically lead to biased estimators because of interference; this phenomenon has been well studied in recent literature. By contrast, there has been relatively little discussion of the impact of interference on decision-making. In this paper, we analyze a benchmark Markovian model of an inventory-constrained platform, where arriving customers book listings that are limited in supply; our analysis builds on a self-contained analysis of general A/B experiments for Markov chains. We focus on the commonly used frequentist hypothesis testing approach for making launch decisions based on data from customer-randomized experiments, and we study the impact of interference on (1) false positive probability and (2) statistical power.\n We obtain three main findings. First, we show that for {\\em monotone} treatments -- i.e., those where the treatment changes booking probabilities in the same direction relative to control in all states -- the false positive probability of the na\\\"ive difference-in-means estimator with classical variance estimation is correctly controlled. This result stems from a novel analysis of A/A experiments with arbitrary dependence structures, which may be of independent interest. Second, we demonstrate that for monotone treatments, the statistical power of this na\\\"ive approach is higher than that of any similar pipeline using a debiased estimator. Taken together, these two findings suggest that platforms may be better off *not* debiasing when treatments are monotone. Finally, using simulations, we investigate false positive probability and statistical power when treatments are non-monotone, and we show that the performance of the na\\\"ive approach can be arbitrarily worse than a debiased approach in such cases."}, "https://arxiv.org/abs/2410.06726": {"title": "Sharp Bounds of the Causal Effect Under MNAR Confounding", "link": "https://arxiv.org/abs/2410.06726", "description": "arXiv:2410.06726v1 Announce Type: new \nAbstract: We report bounds for any contrast between the probabilities of the counterfactual outcome under exposure and non-exposure when the confounders are missing not at random. We assume that the missingness mechanism is outcome-independent, and prove that our bounds are arbitrarily sharp, i.e., practically attainable or logically possible."}, "https://arxiv.org/abs/2410.06875": {"title": "Group Shapley Value and Counterfactual Simulations in a Structural Model", "link": "https://arxiv.org/abs/2410.06875", "description": "arXiv:2410.06875v1 Announce Type: new \nAbstract: We propose a variant of the Shapley value, the group Shapley value, to interpret counterfactual simulations in structural economic models by quantifying the importance of different components. Our framework compares two sets of parameters, partitioned into multiple groups, and applying group Shapley value decomposition yields unique additive contributions to the changes between these sets. The relative contributions sum to one, enabling us to generate an importance table that is as easily interpretable as a regression table. The group Shapley value can be characterized as the solution to a constrained weighted least squares problem. Using this property, we develop robust decomposition methods to address scenarios where inputs for the group Shapley value are missing. We first apply our methodology to a simple Roy model and then illustrate its usefulness by revisiting two published papers."}, "https://arxiv.org/abs/2410.06939": {"title": "Direct Estimation for Commonly Used Pattern-Mixture Models in Clinical Trials", "link": "https://arxiv.org/abs/2410.06939", "description": "arXiv:2410.06939v1 Announce Type: new \nAbstract: Pattern-mixture models have received increasing attention as they are commonly used to assess treatment effects in primary or sensitivity analyses for clinical trials with nonignorable missing data. Pattern-mixture models have traditionally been implemented using multiple imputation, where the variance estimation may be a challenge because the Rubin's approach of combining between- and within-imputation variance may not provide consistent variance estimation while bootstrap methods may be time-consuming. Direct likelihood-based approaches have been proposed in the literature and implemented for some pattern-mixture models, but the assumptions are sometimes restrictive, and the theoretical framework is fragile. In this article, we propose an analytical framework for an efficient direct likelihood estimation method for commonly used pattern-mixture models corresponding to return-to-baseline, jump-to-reference, placebo washout, and retrieved dropout imputations. A parsimonious tipping point analysis is also discussed and implemented. Results from simulation studies demonstrate that the proposed methods provide consistent estimators. We further illustrate the utility of the proposed methods using data from a clinical trial evaluating a treatment for type 2 diabetes."}, "https://arxiv.org/abs/2410.07091": {"title": "Collusion Detection with Graph Neural Networks", "link": "https://arxiv.org/abs/2410.07091", "description": "arXiv:2410.07091v1 Announce Type: new \nAbstract: Collusion is a complex phenomenon in which companies secretly collaborate to engage in fraudulent practices. This paper presents an innovative methodology for detecting and predicting collusion patterns in different national markets using neural networks (NNs) and graph neural networks (GNNs). GNNs are particularly well suited to this task because they can exploit the inherent network structures present in collusion and many other economic problems. Our approach consists of two phases: In Phase I, we develop and train models on individual market datasets from Japan, the United States, two regions in Switzerland, Italy, and Brazil, focusing on predicting collusion in single markets. In Phase II, we extend the models' applicability through zero-shot learning, employing a transfer learning approach that can detect collusion in markets in which training data is unavailable. This phase also incorporates out-of-distribution (OOD) generalization to evaluate the models' performance on unseen datasets from other countries and regions. In our empirical study, we show that GNNs outperform NNs in detecting complex collusive patterns. This research contributes to the ongoing discourse on preventing collusion and optimizing detection methodologies, providing valuable guidance on the use of NNs and GNNs in economic applications to enhance market fairness and economic welfare."}, "https://arxiv.org/abs/2410.07161": {"title": "Spatiotemporal Modeling and Forecasting at Scale with Dynamic Generalized Linear Models", "link": "https://arxiv.org/abs/2410.07161", "description": "arXiv:2410.07161v1 Announce Type: new \nAbstract: Spatiotemporal data consisting of timestamps, GPS coordinates, and IDs occurs in many settings. Modeling approaches for this type of data must address challenges in terms of sensor noise, uneven sampling rates, and non-persistent IDs. In this work, we characterize and forecast human mobility at scale with dynamic generalized linear models (DGLMs). We represent mobility data as occupancy counts of spatial cells over time and use DGLMs to model the occupancy counts for each spatial cell in an area of interest. DGLMs are flexible to varying numbers of occupancy counts across spatial cells, are dynamic, and easily incorporate daily and weekly seasonality in the aggregate-level behavior. Our overall approach is robust to various types of noise and scales linearly in the number of spatial cells, time bins, and agents. Our results show that DGLMs provide accurate occupancy count forecasts over a variety of spatial resolutions and forecast horizons. We also present scaling results for spatiotemporal data consisting of hundreds of millions of observations. Our approach is flexible to support several downstream applications, including characterizing human mobility, forecasting occupancy counts, and anomaly detection for aggregate-level behaviors."}, "https://arxiv.org/abs/2410.05419": {"title": "Refining Counterfactual Explanations With Joint-Distribution-Informed Shapley Towards Actionable Minimality", "link": "https://arxiv.org/abs/2410.05419", "description": "arXiv:2410.05419v1 Announce Type: cross \nAbstract: Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanations that are presented to users and stakeholders. We address this problem by proposing a method that minimizes the required feature changes while maintaining the validity of CE, without imposing restrictions on models or CE algorithms, whether instance- or group-based. The key innovation lies in computing a joint distribution between observed and counterfactual data and leveraging it to inform Shapley values for feature attributions (FA). We demonstrate that optimal transport (OT) effectively derives this distribution, especially when the alignment between observed and counterfactual data is unclear in used CE methods. Additionally, a counterintuitive finding is uncovered: it may be misleading to rely on an exact alignment defined by the CE generation mechanism in conducting FA. Our proposed method is validated on extensive experiments across multiple datasets, showcasing its effectiveness in refining CE towards greater actionable efficiency."}, "https://arxiv.org/abs/2410.05444": {"title": "Online scalable Gaussian processes with conformal prediction for guaranteed coverage", "link": "https://arxiv.org/abs/2410.05444", "description": "arXiv:2410.05444v1 Announce Type: cross \nAbstract: The Gaussian process (GP) is a Bayesian nonparametric paradigm that is widely adopted for uncertainty quantification (UQ) in a number of safety-critical applications, including robotics, healthcare, as well as surveillance. The consistency of the resulting uncertainty values however, hinges on the premise that the learning function conforms to the properties specified by the GP model, such as smoothness, periodicity and more, which may not be satisfied in practice, especially with data arriving on the fly. To combat against such model mis-specification, we propose to wed the GP with the prevailing conformal prediction (CP), a distribution-free post-processing framework that produces it prediction sets with a provably valid coverage under the sole assumption of data exchangeability. However, this assumption is usually violated in the online setting, where a prediction set is sought before revealing the true label. To ensure long-term coverage guarantee, we will adaptively set the key threshold parameter based on the feedback whether the true label falls inside the prediction set. Numerical results demonstrate the merits of the online GP-CP approach relative to existing alternatives in the long-term coverage performance."}, "https://arxiv.org/abs/2410.05458": {"title": "Testing Credibility of Public and Private Surveys through the Lens of Regression", "link": "https://arxiv.org/abs/2410.05458", "description": "arXiv:2410.05458v1 Announce Type: cross \nAbstract: Testing whether a sample survey is a credible representation of the population is an important question to ensure the validity of any downstream research. While this problem, in general, does not have an efficient solution, one might take a task-based approach and aim to understand whether a certain data analysis tool, like linear regression, would yield similar answers both on the population and the sample survey. In this paper, we design an algorithm to test the credibility of a sample survey in terms of linear regression. In other words, we design an algorithm that can certify if a sample survey is good enough to guarantee the correctness of data analysis done using linear regression tools. Nowadays, one is naturally concerned about data privacy in surveys. Thus, we further test the credibility of surveys published in a differentially private manner. Specifically, we focus on Local Differential Privacy (LDP), which is a standard technique to ensure privacy in surveys where the survey participants might not trust the aggregator. We extend our algorithm to work even when the data analysis has been done using surveys with LDP. In the process, we also propose an algorithm that learns with high probability the guarantees a linear regression model on a survey published with LDP. Our algorithm also serves as a mechanism to learn linear regression models from data corrupted with noise coming from any subexponential distribution. We prove that it achieves the optimal estimation error bound for $\\ell_1$ linear regression, which might be of broader interest. We prove the theoretical correctness of our algorithms while trying to reduce the sample complexity for both public and private surveys. We also numerically demonstrate the performance of our algorithms on real and synthetic datasets."}, "https://arxiv.org/abs/2410.05484": {"title": "Neural Networks Decoded: Targeted and Robust Analysis of Neural Network Decisions via Causal Explanations and Reasoning", "link": "https://arxiv.org/abs/2410.05484", "description": "arXiv:2410.05484v1 Announce Type: cross \nAbstract: Despite their success and widespread adoption, the opaque nature of deep neural networks (DNNs) continues to hinder trust, especially in critical applications. Current interpretability solutions often yield inconsistent or oversimplified explanations, or require model changes that compromise performance. In this work, we introduce TRACER, a novel method grounded in causal inference theory designed to estimate the causal dynamics underpinning DNN decisions without altering their architecture or compromising their performance. Our approach systematically intervenes on input features to observe how specific changes propagate through the network, affecting internal activations and final outputs. Based on this analysis, we determine the importance of individual features, and construct a high-level causal map by grouping functionally similar layers into cohesive causal nodes, providing a structured and interpretable view of how different parts of the network influence the decisions. TRACER further enhances explainability by generating counterfactuals that reveal possible model biases and offer contrastive explanations for misclassifications. Through comprehensive evaluations across diverse datasets, we demonstrate TRACER's effectiveness over existing methods and show its potential for creating highly compressed yet accurate models, illustrating its dual versatility in both understanding and optimizing DNNs."}, "https://arxiv.org/abs/2410.05548": {"title": "Scalable Inference for Bayesian Multinomial Logistic-Normal Dynamic Linear Models", "link": "https://arxiv.org/abs/2410.05548", "description": "arXiv:2410.05548v1 Announce Type: cross \nAbstract: Many scientific fields collect longitudinal count compositional data. Each observation is a multivariate count vector, where the total counts are arbitrary, and the information lies in the relative frequency of the counts. Multiple authors have proposed Bayesian Multinomial Logistic-Normal Dynamic Linear Models (MLN-DLMs) as a flexible approach to modeling these data. However, adoption of these methods has been limited by computational challenges. This article develops an efficient and accurate approach to posterior state estimation, called $\\textit{Fenrir}$. Our approach relies on a novel algorithm for MAP estimation and an accurate approximation to a key posterior marginal of the model. As there are no equivalent methods against which we can compare, we also develop an optimized Stan implementation of MLN-DLMs. Our experiments suggest that Fenrir can be three orders of magnitude more efficient than Stan and can even be incorporated into larger sampling schemes for joint inference of model hyperparameters. Our methods are made available to the community as a user-friendly software library written in C++ with an R interface."}, "https://arxiv.org/abs/2410.05567": {"title": "With random regressors, least squares inference is robust to correlated errors with unknown correlation structure", "link": "https://arxiv.org/abs/2410.05567", "description": "arXiv:2410.05567v1 Announce Type: cross \nAbstract: Linear regression is arguably the most widely used statistical method. With fixed regressors and correlated errors, the conventional wisdom is to modify the variance-covariance estimator to accommodate the known correlation structure of the errors. We depart from the literature by showing that with random regressors, linear regression inference is robust to correlated errors with unknown correlation structure. The existing theoretical analyses for linear regression are no longer valid because even the asymptotic normality of the least-squares coefficients breaks down in this regime. We first prove the asymptotic normality of the t statistics by establishing their Berry-Esseen bounds based on a novel probabilistic analysis of self-normalized statistics. We then study the local power of the corresponding t tests and show that, perhaps surprisingly, error correlation can even enhance power in the regime of weak signals. Overall, our results show that linear regression is applicable more broadly than the conventional theory suggests, and further demonstrate the value of randomization to ensure robustness of inference."}, "https://arxiv.org/abs/2410.05668": {"title": "Diversity and Inclusion Index with Networks and Similarity: Analysis and its Application", "link": "https://arxiv.org/abs/2410.05668", "description": "arXiv:2410.05668v1 Announce Type: cross \nAbstract: In recent years, the concepts of ``diversity'' and ``inclusion'' have attracted considerable attention across a range of fields, encompassing both social and biological disciplines. To fully understand these concepts, it is critical to not only examine the number of categories but also the similarities and relationships among them. In this study, I introduce a novel index for diversity and inclusion that considers similarities and network connections. I analyzed the properties of these indices and investigated their mathematical relationships using established measures of diversity and networks. Moreover, I developed a methodology for estimating similarities based on the utility of diversity. I also created a method for visualizing proportions, similarities, and network connections. Finally, I evaluated the correlation with external metrics using real-world data, confirming that both the proposed indices and our index can be effectively utilized. This study contributes to a more nuanced understanding of diversity and inclusion analysis."}, "https://arxiv.org/abs/2410.05753": {"title": "Pathwise Gradient Variance Reduction with Control Variates in Variational Inference", "link": "https://arxiv.org/abs/2410.05753", "description": "arXiv:2410.05753v1 Announce Type: cross \nAbstract: Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. The pathwise estimator is often favoured for its substantially lower variance compared to the score-function estimator, which typically requires variance reduction techniques. However, recent research suggests that even pathwise gradient estimators could benefit from variance reduction. In this work, we review existing control-variates-based variance reduction methods for pathwise gradient estimators to assess their effectiveness. Notably, these methods often rely on integrand approximations and are applicable only to simple variational families. To address this limitation, we propose applying zero-variance control variates to pathwise gradient estimators. This approach offers the advantage of requiring minimal assumptions about the variational distribution, other than being able to sample from it."}, "https://arxiv.org/abs/2410.05757": {"title": "Temperature Optimization for Bayesian Deep Learning", "link": "https://arxiv.org/abs/2410.05757", "description": "arXiv:2410.05757v1 Announce Type: cross \nAbstract: The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily focuses on predictive performance of the PPD, the latter emphasizes calibrated uncertainty and robustness to model misspecification; these distinct objectives lead to different temperature preferences."}, "https://arxiv.org/abs/2410.05843": {"title": "A time warping model for seasonal data with application to age estimation from narwhal tusks", "link": "https://arxiv.org/abs/2410.05843", "description": "arXiv:2410.05843v1 Announce Type: cross \nAbstract: Signals with varying periodicity frequently appear in real-world phenomena, necessitating the development of efficient modelling techniques to map the measured nonlinear timeline to linear time. Here we propose a regression model that allows for a representation of periodic and dynamic patterns observed in time series data. The model incorporates a hidden strictly increasing stochastic process that represents the instantaneous frequency, allowing the model to adapt and accurately capture varying time scales. A case study focusing on age estimation of narwhal tusks is presented, where cyclic element signals associated with annual growth layer groups are analyzed. We apply the methodology to data from one such tusk collected in West Greenland and use the fitted model to estimate the age of the narwhal. The proposed method is validated using simulated signals with known cycle counts and practical considerations and modelling challenges are discussed in detail. This research contributes to the field of time series analysis, providing a tool and valuable insights for understanding and modeling complex cyclic patterns in diverse domains."}, "https://arxiv.org/abs/2410.06163": {"title": "Likelihood-based Differentiable Structure Learning", "link": "https://arxiv.org/abs/2410.06163", "description": "arXiv:2410.06163v1 Announce Type: cross \nAbstract: Existing approaches to differentiable structure learning of directed acyclic graphs (DAGs) rely on strong identifiability assumptions in order to guarantee that global minimizers of the acyclicity-constrained optimization problem identifies the true DAG. Moreover, it has been observed empirically that the optimizer may exploit undesirable artifacts in the loss function. We explain and remedy these issues by studying the behavior of differentiable acyclicity-constrained programs under general likelihoods with multiple global minimizers. By carefully regularizing the likelihood, it is possible to identify the sparsest model in the Markov equivalence class, even in the absence of an identifiable parametrization. We first study the Gaussian case in detail, showing how proper regularization of the likelihood defines a score that identifies the sparsest model. Assuming faithfulness, it also recovers the Markov equivalence class. These results are then generalized to general models and likelihoods, where the same claims hold. These theoretical results are validated empirically, showing how this can be done using standard gradient-based optimizers, thus paving the way for differentiable structure learning under general models and losses."}, "https://arxiv.org/abs/2410.06349": {"title": "Robust Domain Generalisation with Causal Invariant Bayesian Neural Networks", "link": "https://arxiv.org/abs/2410.06349", "description": "arXiv:2410.06349v1 Announce Type: cross \nAbstract: Deep neural networks can obtain impressive performance on various tasks under the assumption that their training domain is identical to their target domain. Performance can drop dramatically when this assumption does not hold. One explanation for this discrepancy is the presence of spurious domain-specific correlations in the training data that the network exploits. Causal mechanisms, in the other hand, can be made invariant under distribution changes as they allow disentangling the factors of distribution underlying the data generation. Yet, learning causal mechanisms to improve out-of-distribution generalisation remains an under-explored area. We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms. We show theoretically and experimentally that our model approximates reasoning under causal interventions. We demonstrate the performance of our method, outperforming point estimate-counterparts, on out-of-distribution image recognition tasks where the data distribution acts as strong adversarial confounders."}, "https://arxiv.org/abs/2410.06381": {"title": "Statistical Inference for Low-Rank Tensors: Heteroskedasticity, Subgaussianity, and Applications", "link": "https://arxiv.org/abs/2410.06381", "description": "arXiv:2410.06381v1 Announce Type: cross \nAbstract: In this paper, we consider inference and uncertainty quantification for low Tucker rank tensors with additive noise in the high-dimensional regime. Focusing on the output of the higher-order orthogonal iteration (HOOI) algorithm, a commonly used algorithm for tensor singular value decomposition, we establish non-asymptotic distributional theory and study how to construct confidence regions and intervals for both the estimated singular vectors and the tensor entries in the presence of heteroskedastic subgaussian noise, which are further shown to be optimal for homoskedastic Gaussian noise. Furthermore, as a byproduct of our theoretical results, we establish the entrywise convergence of HOOI when initialized via diagonal deletion. To further illustrate the utility of our theoretical results, we then consider several concrete statistical inference tasks. First, in the tensor mixed-membership blockmodel, we consider a two-sample test for equality of membership profiles, and we propose a test statistic with consistency under local alternatives that exhibits a power improvement relative to the corresponding matrix test considered in several previous works. Next, we consider simultaneous inference for small collections of entries of the tensor, and we obtain consistent confidence regions. Finally, focusing on the particular case of testing whether entries of the tensor are equal, we propose a consistent test statistic that shows how index overlap results in different asymptotic standard deviations. All of our proposed procedures are fully data-driven, adaptive to noise distribution and signal strength, and do not rely on sample-splitting, and our main results highlight the effect of higher-order structures on estimation relative to the matrix setting. Our theoretical results are demonstrated through numerical simulations."}, "https://arxiv.org/abs/2410.06407": {"title": "A Skewness-Based Criterion for Addressing Heteroscedastic Noise in Causal Discovery", "link": "https://arxiv.org/abs/2410.06407", "description": "arXiv:2410.06407v1 Announce Type: cross \nAbstract: Real-world data often violates the equal-variance assumption (homoscedasticity), making it essential to account for heteroscedastic noise in causal discovery. In this work, we explore heteroscedastic symmetric noise models (HSNMs), where the effect $Y$ is modeled as $Y = f(X) + \\sigma(X)N$, with $X$ as the cause and $N$ as independent noise following a symmetric distribution. We introduce a novel criterion for identifying HSNMs based on the skewness of the score (i.e., the gradient of the log density) of the data distribution. This criterion establishes a computationally tractable measurement that is zero in the causal direction but nonzero in the anticausal direction, enabling the causal direction discovery. We extend this skewness-based criterion to the multivariate setting and propose SkewScore, an algorithm that handles heteroscedastic noise without requiring the extraction of exogenous noise. We also conduct a case study on the robustness of SkewScore in a bivariate model with a latent confounder, providing theoretical insights into its performance. Empirical studies further validate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2410.06774": {"title": "Retrieved dropout imputation considering administrative study withdrawal", "link": "https://arxiv.org/abs/2410.06774", "description": "arXiv:2410.06774v1 Announce Type: cross \nAbstract: The International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) E9 (R1) Addendum provides a framework for defining estimands in clinical trials. Treatment policy strategy is the mostly used approach to handle intercurrent events in defining estimands. Imputing missing values for potential outcomes under the treatment policy strategy has been discussed in the literature. Missing values as a result of administrative study withdrawals (such as site closures due to business reasons, COVID-19 control measures, and geopolitical conflicts, etc.) are often imputed in the same way as other missing values occurring after intercurrent events related to safety or efficacy. Some research suggests using a hypothetical strategy to handle the treatment discontinuations due to administrative study withdrawal in defining the estimands and imputing the missing values based on completer data assuming missing at random, but this approach ignores the fact that subjects might experience other intercurrent events had they not had the administrative study withdrawal. In this article, we consider the administrative study withdrawal censors the normal real-world like intercurrent events and propose two methods for handling the corresponding missing values under the retrieved dropout imputation framework. Simulation shows the two methods perform well. We also applied the methods to actual clinical trial data evaluating an anti-diabetes treatment."}, "https://arxiv.org/abs/1906.09581": {"title": "Resistant convex clustering: How does the fusion penalty enhance resistantance?", "link": "https://arxiv.org/abs/1906.09581", "description": "arXiv:1906.09581v3 Announce Type: replace \nAbstract: Convex clustering is a convex relaxation of the $k$-means and hierarchical clustering. It involves solving a convex optimization problem with the objective function being a squared error loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. However, when data are contaminated, convex clustering with a squared error loss fails even when there is only one arbitrary outlier. To address this challenge, we propose a resistant convex clustering method. Theoretically, we show that the new estimator is resistant to arbitrary outliers: it does not break down until more than half of the observations are arbitrary outliers. Perhaps surprisingly, the fusion penalty can help enhance resistance by fusing the estimators to the cluster centers of uncontaminated samples, but not the other way around. Numerical studies demonstrate the competitive performance of the proposed method."}, "https://arxiv.org/abs/2103.08035": {"title": "Testing for the Network Small-World Property", "link": "https://arxiv.org/abs/2103.08035", "description": "arXiv:2103.08035v2 Announce Type: replace \nAbstract: Researchers have long observed that the ``small-world\" property, which combines the concepts of high transitivity or clustering with a low average path length, is ubiquitous for networks obtained from a variety of disciplines, including social sciences, biology, neuroscience, and ecology. However, we find several shortcomings of the currently prevalent definition and detection methods rendering the concept less powerful. First, the widely used \\textit{small world coefficient} metric combines high transitivity with a low average path length in a single measure that confounds the two separate aspects. We find that the value of the metric is dominated by transitivity, and in several cases, networks get flagged as ``small world\" solely because of their high transitivity. Second, the detection methods lack a formal statistical inference. Third, the comparison is typically performed against simplistic random graph models as the baseline, ignoring well-known network characteristics and risks confounding the small world property with other network properties. We decouple the properties of high transitivity and low average path length as separate events to test for. Then we define the property as a statistical test between a suitable null hypothesis and a superimposed alternative hypothesis. We propose a parametric bootstrap test with several null hypothesis models to allow a wide range of background structures in the network. In addition to the bootstrap tests, we also propose an asymptotic test under the Erd\\\"{o}s-Ren\\'{y}i null model for which we provide theoretical guarantees on the asymptotic level and power. Our theoretical results include asymptotic distributions of clustering coefficient for various asymptotic growth rates on the probability of an edge. Applying the proposed methods to a large number of network datasets, we uncover new insights about their small-world property."}, "https://arxiv.org/abs/2209.01754": {"title": "Learning from a Biased Sample", "link": "https://arxiv.org/abs/2209.01754", "description": "arXiv:2209.01754v3 Announce Type: replace \nAbstract: The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $\\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction."}, "https://arxiv.org/abs/2301.07855": {"title": "Digital Divide: Empirical Study of CIUS 2020", "link": "https://arxiv.org/abs/2301.07855", "description": "arXiv:2301.07855v3 Announce Type: replace \nAbstract: As Canada and other major economies consider implementing \"digital money\" or Central Bank Digital Currencies, understanding how demographic and geographic factors influence public engagement with digital technologies becomes increasingly important. This paper uses data from the 2020 Canadian Internet Use Survey and employs survey-adapted Lasso inference methods to identify individual socio-economic and demographic characteristics determining the digital divide in Canada. We also introduce a score to measure and compare the digital literacy of various segments of Canadian population. Our findings reveal that disparities in the use of e.g. online banking, emailing, and digital payments exist across different demographic and socio-economic groups. In addition, we document the effects of COVID-19 pandemic on internet use in Canada and describe changes in the characteristics of Canadian internet users over the last decade."}, "https://arxiv.org/abs/2306.04315": {"title": "Inferring unknown unknowns: Regularized bias-aware ensemble Kalman filter", "link": "https://arxiv.org/abs/2306.04315", "description": "arXiv:2306.04315v3 Announce Type: replace \nAbstract: Because of physical assumptions and numerical approximations, low-order models are affected by uncertainties in the state and parameters, and by model biases. Model biases, also known as model errors or systematic errors, are difficult to infer because they are `unknown unknowns', i.e., we do not necessarily know their functional form a priori. With biased models, data assimilation methods may be ill-posed because either (i) they are 'bias-unaware' because the estimators are assumed unbiased, (ii) they rely on an a priori parametric model for the bias, or (iii) they can infer model biases that are not unique for the same model and data. First, we design a data assimilation framework to perform combined state, parameter, and bias estimation. Second, we propose a mathematical solution with a sequential method, i.e., the regularized bias-aware ensemble Kalman Filter (r-EnKF), which requires a model of the bias and its gradient (i.e., the Jacobian). Third, we propose an echo state network as the model bias estimator. We derive the Jacobian of the network, and design a robust training strategy with data augmentation to accurately infer the bias in different scenarios. Fourth, we apply the r-EnKF to nonlinearly coupled oscillators (with and without time-delay) affected by different forms of bias. The r-EnKF infers in real-time parameters and states, and a unique bias. The applications that we showcase are relevant to acoustics, thermoacoustics, and vibrations; however, the r-EnKF opens new opportunities for combined state, parameter and bias estimation for real-time and on-the-fly prediction in nonlinear systems."}, "https://arxiv.org/abs/2307.00214": {"title": "Utilizing a Capture-Recapture Strategy to Accelerate Infectious Disease Surveillance", "link": "https://arxiv.org/abs/2307.00214", "description": "arXiv:2307.00214v2 Announce Type: replace \nAbstract: Monitoring key elements of disease dynamics (e.g., prevalence, case counts) is of great importance in infectious disease prevention and control, as emphasized during the COVID-19 pandemic. To facilitate this effort, we propose a new capture-recapture (CRC) analysis strategy that takes misclassification into account from easily-administered, imperfect diagnostic test kits, such as the Rapid Antigen Test-kits or saliva tests. Our method is based on a recently proposed \"anchor stream\" design, whereby an existing voluntary surveillance data stream is augmented by a smaller and judiciously drawn random sample. It incorporates manufacturer-specified sensitivity and specificity parameters to account for imperfect diagnostic results in one or both data streams. For inference to accompany case count estimation, we improve upon traditional Wald-type confidence intervals by developing an adapted Bayesian credible interval for the CRC estimator that yields favorable frequentist coverage properties. When feasible, the proposed design and analytic strategy provides a more efficient solution than traditional CRC methods or random sampling-based biased-corrected estimation to monitor disease prevalence while accounting for misclassification. We demonstrate the benefits of this approach through simulation studies that underscore its potential utility in practice for economical disease monitoring among a registered closed population."}, "https://arxiv.org/abs/2308.02293": {"title": "Outlier-Robust Neural Network Training: Efficient Optimization of Transformed Trimmed Loss with Variation Regularization", "link": "https://arxiv.org/abs/2308.02293", "description": "arXiv:2308.02293v3 Announce Type: replace \nAbstract: In this study, we consider outlier-robust predictive modeling using highly-expressive neural networks. To this end, we employ (1) a transformed trimmed loss (TTL), which is a computationally feasible variant of the classical trimmed loss, and (2) a higher-order variation regularization (HOVR) of the prediction model. Note that using only TTL to train the neural network may possess outlier vulnerability, as its high expressive power causes it to overfit even the outliers perfectly. However, simultaneously introducing HOVR constrains the effective degrees of freedom, thereby avoiding fitting outliers. We newly provide an efficient stochastic algorithm for optimization and its theoretical convergence guarantee. (*Two authors contributed equally to this work.)"}, "https://arxiv.org/abs/2308.05073": {"title": "Harmonized Estimation of Subgroup-Specific Treatment Effects in Randomized Trials: The Use of External Control Data", "link": "https://arxiv.org/abs/2308.05073", "description": "arXiv:2308.05073v2 Announce Type: replace \nAbstract: Subgroup analyses of randomized controlled trials (RCTs) constitute an important component of the drug development process in precision medicine. In particular, subgroup analyses of early-stage trials often influence the design and eligibility criteria of subsequent confirmatory trials and ultimately influence which subpopulations will receive the treatment after regulatory approval. However, subgroup analyses are often complicated by small sample sizes, which leads to substantial uncertainty about subgroup-specific treatment effects. We explore the use of external control (EC) data to augment RCT subgroup analyses. We define and discuss it harmonized estimators of subpopulation-specific treatment effects that leverage EC data. Our approach can be used to modify any subgroup-specific treatment effect estimates that are obtained by combining RCT and EC data, such as linear regression. We alter these subgroup-specific estimates to make them coherent with a robust estimate of the average effect in the randomized population based only on RCT data. The weighted average of the resulting subgroup-specific harmonized estimates matches the RCT-only estimate of the overall effect in the randomized population. We discuss the proposed harmonized estimators through analytic results and simulations, and investigate standard performance metrics. The method is illustrated with a case study in oncology."}, "https://arxiv.org/abs/2310.15266": {"title": "Causal progress with imperfect placebo treatments and outcomes", "link": "https://arxiv.org/abs/2310.15266", "description": "arXiv:2310.15266v2 Announce Type: replace \nAbstract: In the quest to make defensible causal claims from observational data, it is sometimes possible to leverage information from \"placebo treatments\" and \"placebo outcomes\". Existing approaches employing such information focus largely on point identification and assume (i) \"perfect placebos\", meaning placebo treatments have precisely zero effect on the outcome and the real treatment has precisely zero effect on a placebo outcome; and (ii) \"equiconfounding\", meaning that the treatment-outcome relationship where one is a placebo suffers the same amount of confounding as does the real treatment-outcome relationship, on some scale. We instead consider an omitted variable bias framework, in which users can postulate ranges of values for the degree of unequal confounding and the degree of placebo imperfection. Once postulated, these assumptions identify or bound the linear estimates of treatment effects. Our approach also does not require using both a placebo treatment and placebo outcome, as some others do. While applicable in many settings, one ubiquitous use-case for this approach is to employ pre-treatment outcomes as (perfect) placebo outcomes, as in difference-in-difference. The parallel trends assumption in this setting is identical to the equiconfounding assumption, on a particular scale, which our framework allows the user to relax. Finally, we demonstrate the use of our framework with two applications and a simulation, employing an R package that implements these approaches."}, "https://arxiv.org/abs/2311.16984": {"title": "FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings", "link": "https://arxiv.org/abs/2311.16984", "description": "arXiv:2311.16984v3 Announce Type: replace \nAbstract: External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients' rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, 'FedECA' that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA provides evidence for a differential effect between two drugs that would have otherwise go unnoticed. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development."}, "https://arxiv.org/abs/2312.06018": {"title": "A Multivariate Polya Tree Model for Meta-Analysis with Event Time Distributions", "link": "https://arxiv.org/abs/2312.06018", "description": "arXiv:2312.06018v2 Announce Type: replace \nAbstract: We develop a non-parametric Bayesian prior for a family of random probability measures by extending the Polya tree ($PT$) prior to a joint prior for a set of probability measures $G_1,\\dots,G_n$, suitable for meta-analysis with event time outcomes. In the application to meta-analysis $G_i$ is the event time distribution specific to study $i$. The proposed model defines a regression on study-specific covariates by introducing increased correlation for any pair of studies with similar characteristics. The desired multivariate $PT$ model is constructed by introducing a hierarchical prior on the conditional splitting probabilities in the $PT$ construction for each of the $G_i$. The hierarchical prior replaces the independent beta priors for the splitting probability in the $PT$ construction with a Gaussian process prior for corresponding (logit) splitting probabilities across all studies. The Gaussian process is indexed by study-specific covariates, introducing the desired dependence with increased correlation for similar studies. The main feature of the proposed construction is (conditionally) conjugate posterior updating with commonly reported inference summaries for event time data. The construction is motivated by a meta-analysis over cancer immunotherapy studies."}, "https://arxiv.org/abs/2401.03834": {"title": "On the error control of invariant causal prediction", "link": "https://arxiv.org/abs/2401.03834", "description": "arXiv:2401.03834v3 Announce Type: replace \nAbstract: Invariant causal prediction (ICP, Peters et al. (2016)) provides a novel way for identifying causal predictors of a response by utilizing heterogeneous data from different environments. One notable advantage of ICP is that it guarantees to make no false causal discoveries with high probability. Such a guarantee, however, can be overly conservative in some applications, resulting in few or no causal discoveries. This raises a natural question: Can we use less conservative error control guarantees for ICP so that more causal information can be extracted from data? We address this question in the paper. We focus on two commonly used and more liberal guarantees: false discovery rate control and simultaneous true discovery bound. Unexpectedly, we find that false discovery rate does not seem to be a suitable error criterion for ICP. The simultaneous true discovery bound, on the other hand, proves to be an ideal choice, enabling users to explore potential causal predictors and extract more causal information. Importantly, the additional information comes for free, in the sense that no extra assumptions are required and the discoveries from the original ICP approach are fully retained. We demonstrate the practical utility of our method through simulations and a real dataset about the educational attainment of teenagers in the US."}, "https://arxiv.org/abs/2410.07229": {"title": "Fast spatio-temporally varying coefficient modeling with reluctant interaction selection", "link": "https://arxiv.org/abs/2410.07229", "description": "arXiv:2410.07229v1 Announce Type: new \nAbstract: Spatially and temporally varying coefficient (STVC) models are currently attracting attention as a flexible tool to explore the spatio-temporal patterns in regression coefficients. However, these models often struggle with balancing computational efficiency and model flexibility. To address this challenge, this study develops a fast and flexible method for STVC modeling. For enhanced flexibility in modeling, we assume multiple processes in each varying coefficient, including purely spatial, purely temporal, and spatio-temporal interaction processes with or without time cyclicity. While considering multiple processes can be time consuming, we combine a pre-conditioning method with a model selection procedure, inspired by reluctant interaction modeling. This approach allows us to computationally efficiently select and specify the latent space-time structure. Monte Carlo experiments demonstrate that the proposed method outperforms alternatives in terms of coefficient estimation accuracy and computational efficiency. Finally, we apply the proposed method to crime analysis using a sample size of 279,360, confirming that the proposed method provides reasonable estimates of varying coefficients. The STVC model is implemented in an R package spmoran."}, "https://arxiv.org/abs/2410.07242": {"title": "Dynamic borrowing from historical controls via the synthetic prior with covariates in randomized clinical trials", "link": "https://arxiv.org/abs/2410.07242", "description": "arXiv:2410.07242v1 Announce Type: new \nAbstract: Motivated by a rheumatoid arthritis clinical trial, we propose a new Bayesian method called SPx, standing for synthetic prior with covariates, to borrow information from historical trials to reduce the control group size in a new trial. The method involves a novel use of Bayesian model averaging to balance between multiple possible relationships between the historical and new trial data, allowing the historical data to be dynamically trusted or discounted as appropriate. We require only trial-level summary statistics, which are available more often than patient-level data. Through simulations and an application to the rheumatoid arthritis trial we show that SPx can substantially reduce the control group size while maintaining Frequentist properties."}, "https://arxiv.org/abs/2410.07293": {"title": "Feature-centric nonlinear autoregressive models", "link": "https://arxiv.org/abs/2410.07293", "description": "arXiv:2410.07293v1 Announce Type: new \nAbstract: We propose a novel feature-centric approach to surrogate modeling of dynamical systems driven by time-varying exogenous excitations. This approach, named Functional Nonlinear AutoRegressive with eXogenous inputs (F-NARX), aims to approximate the system response based on temporal features of both the exogenous inputs and the system response, rather than on their values at specific time lags. This is a major step away from the discrete-time-centric approach of classical NARX models, which attempts to determine the relationship between selected time steps of the input/output time series. By modeling the system in a time-feature space instead of the original time axis, F-NARX can provide more stable long-term predictions and drastically reduce the reliance of the model performance on the time discretization of the problem. F-NARX, like NARX, acts as a framework and is not tied to a single implementation. In this work, we introduce an F-NARX implementation based on principal component analysis and polynomial basis functions. To further improve prediction accuracy and computational efficiency, we also introduce a strategy to identify and fit a sparse model structure, thanks to a modified hybrid least angle regression approach that minimizes the expected forecast error, rather than the one-step-ahead prediction error. Since F-NARX is particularly well-suited to modeling engineering structures typically governed by physical processes, we investigate the behavior and capabilities of our F-NARX implementation on two case studies: an eight-story building under wind loading and a three-story steel frame under seismic loading. Our results demonstrate that F-NARX has several favorable properties over classical NARX, making it well suited to emulate engineering systems with high accuracy over extended time periods."}, "https://arxiv.org/abs/2410.07357": {"title": "Penalized regression with negative-unlabeled data: An approach to developing a long COVID research index", "link": "https://arxiv.org/abs/2410.07357", "description": "arXiv:2410.07357v1 Announce Type: new \nAbstract: Moderate to severe post-acute sequelae of SARS-CoV-2 infection (PASC), also called long COVID, is estimated to impact as many as 10% of SARS-CoV-2 infected individuals, representing a chronic condition with a substantial global public health burden. An expansive literature has identified over 200 long-term and persistent symptoms associated with a history of SARS-CoV-2 infection; yet, there remains to be a clear consensus on a syndrome definition. Such a definition is a critical first step in future studies of risk and resiliency factors, mechanisms of disease, and interventions for both treatment and prevention. We recently applied a strategy for defining a PASC research index based on a Lasso-penalized logistic regression on history of SARS-CoV-2 infection. In the current paper we formalize and evaluate this approach through theoretical derivations and simulation studies. We demonstrate that this approach appropriately selects symptoms associated with PASC and results in a score that has high discriminatory power for detecting PASC. An application to data on participants enrolled in the RECOVER (Researching COVID to Enhance Recovery) Adult Cohort is presented to illustrate our findings."}, "https://arxiv.org/abs/2410.07374": {"title": "Predicting Dengue Outbreaks: A Dynamic Approach with Variable Length Markov Chains and Exogenous Factors", "link": "https://arxiv.org/abs/2410.07374", "description": "arXiv:2410.07374v1 Announce Type: new \nAbstract: Variable Length Markov Chains with Exogenous Covariates (VLMCX) are stochastic models that use Generalized Linear Models to compute transition probabilities, taking into account both the state history and time-dependent exogenous covariates. The beta-context algorithm selects a relevant finite suffix (context) for predicting the next symbol. This algorithm estimates flexible tree-structured models by aggregating irrelevant states in the process history and enables the model to incorporate exogenous covariates over time.\n This research uses data from multiple sources to extend the beta-context algorithm to incorporate time-dependent and time-invariant exogenous covariates. Within this approach, we have a distinct Markov chain for every data source, allowing for a comprehensive understanding of the process behavior across multiple situations, such as different geographic locations. Despite using data from different sources, we assume that all sources are independent and share identical parameters - we explore contexts within each data source and combine them to compute transition probabilities, deriving a unified tree. This approach eliminates the necessity for spatial-dependent structural considerations within the model. Furthermore, we incorporate modifications in the estimation procedure to address contexts that appear with low frequency.\n Our motivation was to investigate the impact of previous dengue rates, weather conditions, and socioeconomic factors on subsequent dengue rates across various municipalities in Brazil, providing insights into dengue transmission dynamics."}, "https://arxiv.org/abs/2410.07443": {"title": "On the Lower Confidence Band for the Optimal Welfare", "link": "https://arxiv.org/abs/2410.07443", "description": "arXiv:2410.07443v1 Announce Type: new \nAbstract: This article addresses the question of reporting a lower confidence band (LCB) for optimal welfare in policy learning problems. A straightforward procedure inverts a one-sided t-test based on an efficient estimator of the optimal welfare. We argue that in an empirically relevant class of data-generating processes, a LCB corresponding to suboptimal welfare may exceed the straightforward LCB, with the average difference of order N-{1/2}. We relate this result to a lack of uniformity in the so-called margin assumption, commonly imposed in policy learning and debiased inference. We advocate for using uniformly valid asymptotic approximations and show how existing methods for inference in moment inequality models can be used to construct valid and tight LCBs for the optimal welfare. We illustrate our findings in the context of the National JTPA study."}, "https://arxiv.org/abs/2410.07454": {"title": "Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning", "link": "https://arxiv.org/abs/2410.07454", "description": "arXiv:2410.07454v1 Announce Type: new \nAbstract: A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models."}, "https://arxiv.org/abs/2410.07483": {"title": "Doubly robust estimation and sensitivity analysis with outcomes truncated by death in multi-arm clinical trials", "link": "https://arxiv.org/abs/2410.07483", "description": "arXiv:2410.07483v1 Announce Type: new \nAbstract: In clinical trials, the observation of participant outcomes may frequently be hindered by death, leading to ambiguity in defining a scientifically meaningful final outcome for those who die. Principal stratification methods are valuable tools for addressing the average causal effect among always-survivors, i.e., the average treatment effect among a subpopulation in the principal strata of those who would survive regardless of treatment assignment. Although robust methods for the truncation-by-death problem in two-arm clinical trials have been previously studied, its expansion to multi-arm clinical trials remains unknown. In this article, we study the identification of a class of survivor average causal effect estimands with multiple treatments under monotonicity and principal ignorability, and first propose simple weighting and regression approaches. As a further improvement, we then derive the efficient influence function to motivate doubly robust estimators for the survivor average causal effects in multi-arm clinical trials. We also articulate sensitivity methods under violations of key causal assumptions. Extensive simulations are conducted to investigate the finite-sample performance of the proposed methods, and a real data example is used to illustrate how to operationalize the proposed estimators and the sensitivity methods in practice."}, "https://arxiv.org/abs/2410.07487": {"title": "Learning associations of COVID-19 hospitalizations with wastewater viral signals by Markov modulated models", "link": "https://arxiv.org/abs/2410.07487", "description": "arXiv:2410.07487v1 Announce Type: new \nAbstract: Viral signal in wastewater offers a promising opportunity to assess and predict the burden of infectious diseases. That has driven the widespread adoption and development of wastewater monitoring tools by public health organizations. Recent research highlights a strong correlation between COVID-19 hospitalizations and wastewater viral signals, and validates that increases in wastewater measurements may offer early warnings of an increase in hospital admissions. Previous studies (e.g. Peng et al. 2023) utilize distributed lag models to explore associations of COVID-19 hospitalizations with lagged SARS-CoV-2 wastewater viral signals. However, the conventional distributed lag models assume the duration time of the lag to be fixed, which is not always plausible. This paper presents Markov-modulated models with distributed lasting time, treating the duration of the lag as a random variable defined by a hidden Markov chain. We evaluate exposure effects over the duration time and estimate the distribution of the lasting time using the wastewater data and COVID-19 hospitalization records from Ottawa, Canada during June 2020 to November 2022. The different COVID-19 waves are accommodated in the statistical learning. In particular, two strategies for comparing the associations over different time intervals are exemplified using the Ottawa data. Of note, the proposed Markov modulated models, an extension of distributed lag models, are potentially applicable to many different problems where the lag time is not fixed."}, "https://arxiv.org/abs/2410.07555": {"title": "A regression framework for studying relationships among attributes under network interference", "link": "https://arxiv.org/abs/2410.07555", "description": "arXiv:2410.07555v1 Announce Type: new \nAbstract: To understand how the interconnected and interdependent world of the twenty-first century operates and make model-based predictions, joint probability models for networks and interdependent outcomes are needed. We propose a comprehensive regression framework for networks and interdependent outcomes with multiple advantages, including interpretability, scalability, and provable theoretical guarantees. The regression framework can be used for studying relationships among attributes of connected units and captures complex dependencies among connections and attributes, while retaining the virtues of linear regression, logistic regression, and other regression models by being interpretable and widely applicable. On the computational side, we show that the regression framework is amenable to scalable statistical computing based on convex optimization of pseudo-likelihoods using minorization-maximization methods. On the theoretical side, we establish convergence rates for pseudo-likelihood estimators based on a single observation of dependent connections and attributes. We demonstrate the regression framework using simulations and an application to hate speech on the social media platform X in the six months preceding the insurrection at the U.S. Capitol on January 6, 2021."}, "https://arxiv.org/abs/2410.07934": {"title": "panelPomp: Analysis of Panel Data via Partially Observed Markov Processes in R", "link": "https://arxiv.org/abs/2410.07934", "description": "arXiv:2410.07934v1 Announce Type: new \nAbstract: Panel data arise when time series measurements are collected from multiple, dynamically independent but structurally related systems. In such cases, each system's time series can be modeled as a partially observed Markov process (POMP), and the ensemble of these models is called a PanelPOMP. If the time series are relatively short, statistical inference for each time series must draw information from across the entire panel. Every time series has a name, called its unit label, which may correspond to an object on which that time series was collected. Differences between units may be of direct inferential interest or may be a nuisance for studying the commonalities. The R package panelPomp supports analysis of panel data via a general class of PanelPOMP models. This includes a suite of tools for manipulation of models and data that take advantage of the panel structure. The panelPomp package currently emphasizes recent advances enabling likelihood-based inference via simulation-based algorithms. However, the general framework provided by panelPomp supports development of additional, new inference methodology for panel data."}, "https://arxiv.org/abs/2410.07996": {"title": "Smoothed pseudo-population bootstrap methods with applications to finite population quantiles", "link": "https://arxiv.org/abs/2410.07996", "description": "arXiv:2410.07996v1 Announce Type: new \nAbstract: This paper introduces smoothed pseudo-population bootstrap methods for the purposes of variance estimation and the construction of confidence intervals for finite population quantiles. In an i.i.d. context, it has been shown that resampling from a smoothed estimate of the distribution function instead of the usual empirical distribution function can improve the convergence rate of the bootstrap variance estimator of a sample quantile. We extend the smoothed bootstrap to the survey sampling framework by implementing it in pseudo-population bootstrap methods for high entropy, single-stage survey designs, such as simple random sampling without replacement and Poisson sampling. Given a kernel function and a bandwidth, it consists of smoothing the pseudo-population from which bootstrap samples are drawn using the original sampling design. Given that the implementation of the proposed algorithms requires the specification of the bandwidth, we develop a plug-in selection method along with a grid search selection method based on a bootstrap estimate of the mean squared error. Simulation results suggest a gain in efficiency associated with the smoothed approach as compared to the standard pseudo-population bootstrap for estimating the variance of a quantile estimator together with mixed results regarding confidence interval coverage."}, "https://arxiv.org/abs/2410.08078": {"title": "Negative Control Outcome Adjustment in Early-Phase Randomized Trials: Estimating Vaccine Effects on Immune Responses in HIV Exposed Uninfected Infants", "link": "https://arxiv.org/abs/2410.08078", "description": "arXiv:2410.08078v1 Announce Type: new \nAbstract: Adjustment for prognostic baseline covariates when estimating treatment effects in randomized trials can reduce bias due to covariate imbalance and yield guaranteed efficiency gain in large-samples. Gains in precision and reductions in finite-sample bias are arguably most important in the resource-limited setting of early-phase trials. Despite their favorable large-sample properties, the utility of covariate-adjusted estimators in early-phase trials is complicated by precision loss due to adjustment for weakly prognostic covariates and type I error rate inflation and undercoverage of asymptotic confidence intervals in finite samples. We propose adjustment for a valid negative control outcomes (NCO), or an auxiliary post-randomization outcome assumed completely unaffected by treatment but correlated with the outcome of interest. We articulate the causal assumptions that permit adjustment for NCOs, describe when NCO adjustment may improve upon adjustment for baseline covariates, illustrate performance and provide practical recommendations regarding model selection and finite-sample variance corrections in early-phase trials using numerical experiments, and demonstrate superior performance of NCO adjustment in the reanalysis of two early-phase vaccine trials in HIV exposed uninfected (HEU) infants. In early-phase studies without knowledge of baseline predictors of outcomes, we advocate for eschewing baseline covariates adjustment in favor of adjustment for NCOs believed to be unaffected by the experimental intervention."}, "https://arxiv.org/abs/2410.08080": {"title": "Bayesian Nonparametric Sensitivity Analysis of Multiple Comparisons Under Dependence", "link": "https://arxiv.org/abs/2410.08080", "description": "arXiv:2410.08080v1 Announce Type: new \nAbstract: This short communication introduces a sensitivity analysis method for Multiple Testing Procedures (MTPs), based on marginal $p$-values and the Dirichlet process prior distribution. The method measures each $p$-value's insensitivity towards a significance decision, with respect to the entire space of MTPs controlling either the family-wise error rate (FWER) or the false discovery rate (FDR) under arbitrary dependence between $p$-values, supported by this nonparametric prior. The sensitivity analysis method is illustrated through 1,081 hypothesis tests of the effects of the COVID-19 pandemic on educational processes for 15-year-old students, performed on a 2022 public dataset. Software code for the method is provided."}, "https://arxiv.org/abs/2410.07191": {"title": "Curb Your Attention: Causal Attention Gating for Robust Trajectory Prediction in Autonomous Driving", "link": "https://arxiv.org/abs/2410.07191", "description": "arXiv:2410.07191v1 Announce Type: cross \nAbstract: Trajectory prediction models in autonomous driving are vulnerable to perturbations from non-causal agents whose actions should not affect the ego-agent's behavior. Such perturbations can lead to incorrect predictions of other agents' trajectories, potentially compromising the safety and efficiency of the ego-vehicle's decision-making process. Motivated by this challenge, we propose $\\textit{Causal tRajecTory predICtion}$ $\\textbf{(CRiTIC)}$, a novel model that utilizes a $\\textit{Causal Discovery Network}$ to identify inter-agent causal relations over a window of past time steps. To incorporate discovered causal relationships, we propose a novel $\\textit{Causal Attention Gating}$ mechanism to selectively filter information in the proposed Transformer-based architecture. We conduct extensive experiments on two autonomous driving benchmark datasets to evaluate the robustness of our model against non-causal perturbations and its generalization capacity. Our results indicate that the robustness of predictions can be improved by up to $\\textbf{54%}$ without a significant detriment to prediction accuracy. Lastly, we demonstrate the superior domain generalizability of the proposed model, which achieves up to $\\textbf{29%}$ improvement in cross-domain performance. These results underscore the potential of our model to enhance both robustness and generalization capacity for trajectory prediction in diverse autonomous driving domains. Further details can be found on our project page: https://critic-model.github.io/."}, "https://arxiv.org/abs/2410.07234": {"title": "A Dynamic Approach to Stock Price Prediction: Comparing RNN and Mixture of Experts Models Across Different Volatility Profiles", "link": "https://arxiv.org/abs/2410.07234", "description": "arXiv:2410.07234v1 Announce Type: cross \nAbstract: This study evaluates the effectiveness of a Mixture of Experts (MoE) model for stock price prediction by comparing it to a Recurrent Neural Network (RNN) and a linear regression model. The MoE framework combines an RNN for volatile stocks and a linear model for stable stocks, dynamically adjusting the weight of each model through a gating network. Results indicate that the MoE approach significantly improves predictive accuracy across different volatility profiles. The RNN effectively captures non-linear patterns for volatile companies but tends to overfit stable data, whereas the linear model performs well for predictable trends. The MoE model's adaptability allows it to outperform each individual model, reducing errors such as Mean Squared Error (MSE) and Mean Absolute Error (MAE). Future work should focus on enhancing the gating mechanism and validating the model with real-world datasets to optimize its practical applicability."}, "https://arxiv.org/abs/2410.07899": {"title": "A multivariate spatial regression model using signatures", "link": "https://arxiv.org/abs/2410.07899", "description": "arXiv:2410.07899v1 Announce Type: cross \nAbstract: We propose a spatial autoregressive model for a multivariate response variable and functional covariates. The approach is based on the notion of signature, which represents a function as an infinite series of its iterated integrals and presents the advantage of being applicable to a wide range of processes. We have provided theoretical guarantees for the choice of the signature truncation order, and we have shown in a simulation study that this approach outperforms existing approaches in the literature."}, "https://arxiv.org/abs/2308.12506": {"title": "General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays", "link": "https://arxiv.org/abs/2308.12506", "description": "arXiv:2308.12506v4 Announce Type: replace \nAbstract: We present a general central limit theorem with simple, easy-to-check covariance-based sufficient conditions for triangular arrays of random vectors when all variables could be interdependent. The result is constructed from Stein's method, but the conditions are distinct from related work. We show that these covariance conditions nest standard assumptions studied in the literature such as $M$-dependence, mixing random fields, non-mixing autoregressive processes, and dependency graphs, which themselves need not imply each other. This permits researchers to work with high-level but intuitive conditions based on overall correlation instead of more complicated and restrictive conditions such as strong mixing in random fields that may not have any obvious micro-foundation. As examples of the implications, we show how the theorem implies asymptotic normality in estimating: treatment effects with spillovers in more settings than previously admitted, covariance matrices, processes with global dependencies such as epidemic spread and information diffusion, and spatial process with Mat\\'{e}rn dependencies."}, "https://arxiv.org/abs/2309.07365": {"title": "Addressing selection bias in cluster randomized experiments via weighting", "link": "https://arxiv.org/abs/2309.07365", "description": "arXiv:2309.07365v2 Announce Type: replace \nAbstract: In cluster randomized experiments, individuals are often recruited after the cluster treatment assignment, and data are typically only available for the recruited sample. Post-randomization recruitment can lead to selection bias, inducing systematic differences between the overall and the recruited populations, and between the recruited intervention and control arms. In this setting, we define causal estimands for the overall and the recruited populations. We prove, under the assumption of ignorable recruitment, that the average treatment effect on the recruited population can be consistently estimated from the recruited sample using inverse probability weighting. Generally we cannot identify the average treatment effect on the overall population. Nonetheless, we show, via a principal stratification formulation, that one can use weighting of the recruited sample to identify treatment effects on two meaningful subpopulations of the overall population: individuals who would be recruited into the study regardless of the assignment, and individuals who would be recruited into the study under treatment but not under control. We develop an estimation strategy and a sensitivity analysis approach for checking the ignorable recruitment assumption. The proposed methods are applied to the ARTEMIS cluster randomized trial, where removing co-payment barriers increases the persistence of P2Y12 inhibitor among the always-recruited population."}, "https://arxiv.org/abs/2310.09428": {"title": "Sparse higher order partial least squares for simultaneous variable selection, dimension reduction, and tensor denoising", "link": "https://arxiv.org/abs/2310.09428", "description": "arXiv:2310.09428v2 Announce Type: replace \nAbstract: Partial Least Squares (PLS) regression emerged as an alternative to ordinary least squares for addressing multicollinearity in a wide range of scientific applications. As multidimensional tensor data is becoming more widespread, tensor adaptations of PLS have been developed. In this paper, we first establish the statistical behavior of Higher Order PLS (HOPLS) of Zhao et al. (2012), by showing that the consistency of the HOPLS estimator cannot be guaranteed as the tensor dimensions and the number of features increase faster than the sample size. To tackle this issue, we propose Sparse Higher Order Partial Least Squares (SHOPS) regression and an accompanying algorithm. SHOPS simultaneously accommodates variable selection, dimension reduction, and tensor response denoising. We further establish the asymptotic results of the SHOPS algorithm under a high-dimensional regime. The results also complete the unknown theoretic properties of SPLS algorithm (Chun and Kele\\c{s}, 2010). We verify these findings through comprehensive simulation experiments, and application to an emerging high-dimensional biological data analysis."}, "https://arxiv.org/abs/2302.00860": {"title": "Modeling Causal Mechanisms with Diffusion Models for Interventional and Counterfactual Queries", "link": "https://arxiv.org/abs/2302.00860", "description": "arXiv:2302.00860v3 Announce Type: replace-cross \nAbstract: We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings. These encodings enable us to directly sample under interventions and perform abduction for counterfactuals. Diffusion models are a natural fit here, since they can encode each node to a latent representation that acts as a proxy for exogenous noise. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Furthermore, we provide theoretical results that offer a methodology for analyzing counterfactual estimation in general encoder-decoder models, which could be useful in settings beyond our proposed approach."}, "https://arxiv.org/abs/2410.08283": {"title": "Adaptive sparsening and smoothing of the treatment model for longitudinal causal inference using outcome-adaptive LASSO and marginal fused LASSO", "link": "https://arxiv.org/abs/2410.08283", "description": "arXiv:2410.08283v1 Announce Type: new \nAbstract: Causal variable selection in time-varying treatment settings is challenging due to evolving confounding effects. Existing methods mainly focus on time-fixed exposures and are not directly applicable to time-varying scenarios. We propose a novel two-step procedure for variable selection when modeling the treatment probability at each time point. We first introduce a novel approach to longitudinal confounder selection using a Longitudinal Outcome Adaptive LASSO (LOAL) that will data-adaptively select covariates with theoretical justification of variance reduction of the estimator of the causal effect. We then propose an Adaptive Fused LASSO that can collapse treatment model parameters over time points with the goal of simplifying the models in order to improve the efficiency of the estimator while minimizing model misspecification bias compared with naive pooled logistic regression models. Our simulation studies highlight the need for and usefulness of the proposed approach in practice. We implemented our method on data from the Nicotine Dependence in Teens study to estimate the effect of the timing of alcohol initiation during adolescence on depressive symptoms in early adulthood."}, "https://arxiv.org/abs/2410.08382": {"title": "Bivariate Variable Ranking for censored time-to-event data via Copula Link Based Additive models", "link": "https://arxiv.org/abs/2410.08382", "description": "arXiv:2410.08382v1 Announce Type: new \nAbstract: In this paper, we present a variable ranking approach established on a novel measure to select important variables in bivariate Copula Link-Based Additive Models (Marra & Radice, 2020). The proposal allows for identifying two sets of relevant covariates for the two time-to-events without neglecting the dependency structure that may exist between the two survivals. The procedure suggested is evaluated via a simulation study and then is applied for analyzing the Age-Related Eye Disease Study dataset. The algorithm is implemented in a new R package, called BRBVS.."}, "https://arxiv.org/abs/2410.08422": {"title": "Principal Component Analysis in the Graph Frequency Domain", "link": "https://arxiv.org/abs/2410.08422", "description": "arXiv:2410.08422v1 Announce Type: new \nAbstract: We propose a novel principal component analysis in the graph frequency domain for dimension reduction of multivariate data residing on graphs. The proposed method not only effectively reduces the dimensionality of multivariate graph signals, but also provides a closed-form reconstruction of the original data. In addition, we investigate several propositions related to principal components and the reconstruction errors, and introduce a graph spectral envelope that aids in identifying common graph frequencies in multivariate graph signals. We demonstrate the validity of the proposed method through a simulation study and further analyze the boarding and alighting patterns of Seoul Metropolitan Subway passengers using the proposed method."}, "https://arxiv.org/abs/2410.08441": {"title": "A scientific review on advances in statistical methods for crossover design", "link": "https://arxiv.org/abs/2410.08441", "description": "arXiv:2410.08441v1 Announce Type: new \nAbstract: A comprehensive review of the literature on crossover design is needed to highlight its evolution, applications, and methodological advancements across various fields. Given its widespread use in clinical trials and other research domains, understanding this design's challenges, assumptions, and innovations is essential for optimizing its implementation and ensuring accurate, unbiased results. This article extensively reviews the history and statistical inference methods for crossover designs. A primary focus is given to the AB-BA design as it is the most widely used design in literature. Extension from two periods to higher-order designs is discussed, and a general inference procedure for continuous response is studied. Analysis of multivariate and categorical responses is also reviewed in this context. A bunch of open problems in this area are shortlisted."}, "https://arxiv.org/abs/2410.08488": {"title": "Fractional binomial regression model for count data with excess zeros", "link": "https://arxiv.org/abs/2410.08488", "description": "arXiv:2410.08488v1 Announce Type: new \nAbstract: This paper proposes a new generalized linear model with fractional binomial distribution.\n Zero-inflated Poisson/negative binomial distributions are used for count data that has many zeros. To analyze the association of such a count variable with covariates, zero-inflated Poisson/negative binomial regression models are widely used. In this work, we develop a regression model with the fractional binomial distribution that can serve as an additional tool for modeling the count response variable with covariates. Data analysis results show that on some occasions, our model outperforms the existing zero-inflated regression models."}, "https://arxiv.org/abs/2410.08492": {"title": "Exact MLE for Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2410.08492", "description": "arXiv:2410.08492v1 Announce Type: new \nAbstract: Exact MLE for generalized linear mixed models (GLMMs) is a long-standing problem unsolved until today. The proposed research solves the problem. In this problem, the main difficulty is caused by intractable integrals in the likelihood function when the response does not follow normal and the prior distribution for the random effects is specified by normal. Previous methods use Laplace approximations or Monte Carol simulations to compute the MLE approximately. These methods cannot provide the exact MLEs of the parameters and the hyperparameters. The exact MLE problem remains unsolved until the proposed work. The idea is to construct a sequence of mathematical functions in the optimization procedure. Optimization of the mathematical functions can be numerically computed. The result can lead to the exact MLEs of the parameters and hyperparameters. Because computing the likelihood is unnecessary, the proposed method avoids the main difficulty caused by the intractable integrals in the likelihood function."}, "https://arxiv.org/abs/2410.08523": {"title": "Parametric multi-fidelity Monte Carlo estimation with applications to extremes", "link": "https://arxiv.org/abs/2410.08523", "description": "arXiv:2410.08523v1 Announce Type: new \nAbstract: In a multi-fidelity setting, data are available from two sources, high- and low-fidelity. Low-fidelity data has larger size and can be leveraged to make more efficient inference about quantities of interest, e.g. the mean, for high-fidelity variables. In this work, such multi-fidelity setting is studied when the goal is to fit more efficiently a parametric model to high-fidelity data. Three multi-fidelity parameter estimation methods are considered, joint maximum likelihood, (multi-fidelity) moment estimation and (multi-fidelity) marginal maximum likelihood, and are illustrated on several parametric models, with the focus on parametric families used in extreme value analysis. An application is also provided concerning quantification of occurrences of extreme ship motions generated by two computer codes of varying fidelity."}, "https://arxiv.org/abs/2410.08733": {"title": "An alignment-agnostic methodology for the analysis of designed separations data", "link": "https://arxiv.org/abs/2410.08733", "description": "arXiv:2410.08733v1 Announce Type: new \nAbstract: Chemical separations data are typically analysed in the time domain using methods that integrate the discrete elution bands. Integrating the same chemical components across several samples must account for retention time drift over the course of an entire experiment as the physical characteristics of the separation are altered through several cycles of use. Failure to consistently integrate the components within a matrix of $M \\times N$ samples and variables create artifacts that have a profound effect on the analysis and interpretation of the data. This work presents an alternative where the raw separations data are analysed in the frequency domain to account for the offset of the chromatographic peaks as a matrix of complex Fourier coefficients. We present a generalization of the permutation testing, and visualization steps in ANOVA-Simultaneous Component Analysis (ASCA) to handle complex matrices, and use this method to analyze a synthetic dataset with known significant factors and compare the interpretation of a real dataset via its peak table and frequency domain representations."}, "https://arxiv.org/abs/2410.08782": {"title": "Half-KFN: An Enhanced Detection Method for Subtle Covariate Drift", "link": "https://arxiv.org/abs/2410.08782", "description": "arXiv:2410.08782v1 Announce Type: new \nAbstract: Detecting covariate drift is a common task of significant practical value in supervised learning. Once covariate drift occurs, the models may no longer be applicable, hence numerous studies have been devoted to the advancement of detection methods. However, current research methods are not particularly effective in handling subtle covariate drift when dealing with small proportions of drift samples. In this paper, inspired by the $k$-nearest neighbor (KNN) approach, a novel method called Half $k$-farthest neighbor (Half-KFN) is proposed in response to specific scenarios. Compared to traditional ones, Half-KFN exhibits higher power due to the inherent capability of the farthest neighbors which could better characterize the nature of drift. Furthermore, with larger sample sizes, the employment of the bootstrap for hypothesis testing is recommended. It is leveraged to calculate $p$-values dramatically faster than permutation tests, with speed undergoing an exponential growth as sample size increases. Numerical experiments on simulated and real data are conducted to evaluate our proposed method, and the results demonstrate that it consistently displays superior sensitivity and rapidity in covariate drift detection across various cases."}, "https://arxiv.org/abs/2410.08803": {"title": "Generalised logistic regression with vine copulas", "link": "https://arxiv.org/abs/2410.08803", "description": "arXiv:2410.08803v1 Announce Type: new \nAbstract: We propose a generalisation of the logistic regression model, that aims to account for non-linear main effects and complex interactions, while keeping the model inherently explainable. This is obtained by starting with log-odds that are linear in the covariates, and adding non-linear terms that depend on at least two covariates. More specifically, we use a generative specification of the model, consisting of a combination of certain margins on natural exponential form, combined with vine copulas. The estimation of the model is however based on the discriminative likelihood, and dependencies between covariates are included in the model, only if they contribute significantly to the distinction between the two classes. Further, a scheme for model selection and estimation is presented. The methods described in this paper are implemented in the R package LogisticCopula. In order to assess the performance of our model, we ran an extensive simulation study. The results from the study, as well as from a couple of examples on real data, showed that our model performs at least as well as natural competitors, especially in the presence of non-linearities and complex interactions, even when $n$ is not large compared to $p$."}, "https://arxiv.org/abs/2410.08849": {"title": "Causal inference targeting a concentration index for studies of health inequalities", "link": "https://arxiv.org/abs/2410.08849", "description": "arXiv:2410.08849v1 Announce Type: new \nAbstract: A concentration index, a standardized covariance between a health variable and relative income ranks, is often used to quantify income-related health inequalities. There is a lack of formal approach to study the effect of an exposure, e.g., education, on such measures of inequality. In this paper we contribute by filling this gap and developing the necessary theory and method. Thus, we define a counterfactual concentration index for different levels of an exposure. We give conditions for their identification, and then deduce their efficient influence function. This allows us to propose estimators, which are regular asymptotic linear under certain conditions. In particular, these estimators are $\\sqrt n$-consistent and asymptotically normal, as well as locally efficient. The implementation of the estimators is based on the fit of several nuisance functions. The estimators proposed have rate robustness properties allowing for convergence rates slower than $\\sqrt{n}$-rate for some of the nuisance function fits. The relevance of the asymptotic results for finite samples is studied with simulation experiments. We also present a case study of the effect of education on income-related health inequalities for a Swedish cohort."}, "https://arxiv.org/abs/2410.09027": {"title": "Variance reduction combining pre-experiment and in-experiment data", "link": "https://arxiv.org/abs/2410.09027", "description": "arXiv:2410.09027v1 Announce Type: new \nAbstract: Online controlled experiments (A/B testing) are essential in data-driven decision-making for many companies. Increasing the sensitivity of these experiments, particularly with a fixed sample size, relies on reducing the variance of the estimator for the average treatment effect (ATE). Existing methods like CUPED and CUPAC use pre-experiment data to reduce variance, but their effectiveness depends on the correlation between the pre-experiment data and the outcome. In contrast, in-experiment data is often more strongly correlated with the outcome and thus more informative. In this paper, we introduce a novel method that combines both pre-experiment and in-experiment data to achieve greater variance reduction than CUPED and CUPAC, without introducing bias or additional computation complexity. We also establish asymptotic theory and provide consistent variance estimators for our method. Applying this method to multiple online experiments at Etsy, we reach substantial variance reduction over CUPAC with the inclusion of only a few in-experiment covariates. These results highlight the potential of our approach to significantly improve experiment sensitivity and accelerate decision-making."}, "https://arxiv.org/abs/2410.09039": {"title": "Semi-Supervised Learning of Noisy Mixture of Experts Models", "link": "https://arxiv.org/abs/2410.09039", "description": "arXiv:2410.09039v1 Announce Type: new \nAbstract: The mixture of experts (MoE) model is a versatile framework for predictive modeling that has gained renewed interest in the age of large language models. A collection of predictive ``experts'' is learned along with a ``gating function'' that controls how much influence each expert is given when a prediction is made. This structure allows relatively simple models to excel in complex, heterogeneous data settings. In many contemporary settings, unlabeled data are widely available while labeled data are difficult to obtain. Semi-supervised learning methods seek to leverage the unlabeled data. We propose a novel method for semi-supervised learning of MoE models. We start from a semi-supervised MoE model that was developed by oceanographers that makes the strong assumption that the latent clustering structure in unlabeled data maps directly to the influence that the gating function should give each expert in the supervised task. We relax this assumption, imagining a noisy connection between the two, and propose an algorithm based on least trimmed squares, which succeeds even in the presence of misaligned data. Our theoretical analysis characterizes the conditions under which our approach yields estimators with a near-parametric rate of convergence. Simulated and real data examples demonstrate the method's efficacy."}, "https://arxiv.org/abs/2410.08362": {"title": "Towards Optimal Environmental Policies: Policy Learning under Arbitrary Bipartite Network Interference", "link": "https://arxiv.org/abs/2410.08362", "description": "arXiv:2410.08362v1 Announce Type: cross \nAbstract: The substantial effect of air pollution on cardiovascular disease and mortality burdens is well-established. Emissions-reducing interventions on coal-fired power plants -- a major source of hazardous air pollution -- have proven to be an effective, but costly, strategy for reducing pollution-related health burdens. Targeting the power plants that achieve maximum health benefits while satisfying realistic cost constraints is challenging. The primary difficulty lies in quantifying the health benefits of intervening at particular plants. This is further complicated because interventions are applied on power plants, while health impacts occur in potentially distant communities, a setting known as bipartite network interference (BNI). In this paper, we introduce novel policy learning methods based on Q- and A-Learning to determine the optimal policy under arbitrary BNI. We derive asymptotic properties and demonstrate finite sample efficacy in simulations. We apply our novel methods to a comprehensive dataset of Medicare claims, power plant data, and pollution transport networks. Our goal is to determine the optimal strategy for installing power plant scrubbers to minimize ischemic heart disease (IHD) hospitalizations under various cost constraints. We find that annual IHD hospitalization rates could be reduced in a range from 20.66-44.51 per 10,000 person-years through optimal policies under different cost constraints."}, "https://arxiv.org/abs/2410.08378": {"title": "Deep Generative Quantile Bayes", "link": "https://arxiv.org/abs/2410.08378", "description": "arXiv:2410.08378v1 Announce Type: cross \nAbstract: We develop a multivariate posterior sampling procedure through deep generative quantile learning. Simulation proceeds implicitly through a push-forward mapping that can transform i.i.d. random vector samples from the posterior. We utilize Monge-Kantorovich depth in multivariate quantiles to directly sample from Bayesian credible sets, a unique feature not offered by typical posterior sampling methods. To enhance the training of the quantile mapping, we design a neural network that automatically performs summary statistic extraction. This additional neural network structure has performance benefits, including support shrinkage (i.e., contraction of our posterior approximation) as the observation sample size increases. We demonstrate the usefulness of our approach on several examples where the absence of likelihood renders classical MCMC infeasible. Finally, we provide the following frequentist theoretical justifications for our quantile learning framework: {consistency of the estimated vector quantile, of the recovered posterior distribution, and of the corresponding Bayesian credible sets."}, "https://arxiv.org/abs/2410.08574": {"title": "Change-point detection in regression models for ordered data via the max-EM algorithm", "link": "https://arxiv.org/abs/2410.08574", "description": "arXiv:2410.08574v1 Announce Type: cross \nAbstract: We consider the problem of breakpoint detection in a regression modeling framework. To that end, we introduce a novel method, the max-EM algorithm which combines a constrained Hidden Markov Model with the Classification-EM (CEM) algorithm. This algorithm has linear complexity and provides accurate breakpoints detection and parameter estimations. We derive a theoretical result that shows that the likelihood of the data as a function of the regression parameters and the breakpoints location is increased at each step of the algorithm. We also present two initialization methods for the location of the breakpoints in order to deal with local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated."}, "https://arxiv.org/abs/2410.08831": {"title": "Distribution-free uncertainty quantification for inverse problems: application to weak lensing mass mapping", "link": "https://arxiv.org/abs/2410.08831", "description": "arXiv:2410.08831v1 Announce Type: cross \nAbstract: In inverse problems, distribution-free uncertainty quantification (UQ) aims to obtain error bars with coverage guarantees that are independent of any prior assumptions about the data distribution. In the context of mass mapping, uncertainties could lead to errors that affects our understanding of the underlying mass distribution, or could propagate to cosmological parameter estimation, thereby impacting the precision and reliability of cosmological models. Current surveys, such as Euclid or Rubin, will provide new weak lensing datasets of very high quality. Accurately quantifying uncertainties in mass maps is therefore critical to perform reliable cosmological parameter inference. In this paper, we extend the conformalized quantile regression (CQR) algorithm, initially proposed for scalar regression, to inverse problems. We compare our approach with another distribution-free approach based on risk-controlling prediction sets (RCPS). Both methods are based on a calibration dataset, and offer finite-sample coverage guarantees that are independent of the data distribution. Furthermore, they are applicable to any mass mapping method, including blackbox predictors. In our experiments, we apply UQ on three mass-mapping method: the Kaiser-Squires inversion, iterative Wiener filtering, and the MCALens algorithm. Our experiments reveal that RCPS tends to produce overconservative confidence bounds with small calibration sets, whereas CQR is designed to avoid this issue. Although the expected miscoverage rate is guaranteed to stay below a user-prescribed threshold regardless of the mass mapping method, selecting an appropriate reconstruction algorithm remains crucial for obtaining accurate estimates, especially around peak-like structures, which are particularly important for inferring cosmological parameters. Additionally, the choice of mass mapping method influences the size of the error bars."}, "https://arxiv.org/abs/2410.08939": {"title": "Linear-cost unbiased posterior estimates for crossed effects and matrix factorization models via couplings", "link": "https://arxiv.org/abs/2410.08939", "description": "arXiv:2410.08939v1 Announce Type: cross \nAbstract: We design and analyze unbiased Markov chain Monte Carlo (MCMC) schemes based on couplings of blocked Gibbs samplers (BGSs), whose total computational costs scale linearly with the number of parameters and data points. Our methodology is designed for and applicable to high-dimensional BGS with conditionally independent blocks, which are often encountered in Bayesian modeling. We provide bounds on the expected number of iterations needed for coalescence for Gaussian targets, which imply that practical two-step coupling strategies achieve coalescence times that match the relaxation times of the original BGS scheme up to a logarithmic factor. To illustrate the practical relevance of our methodology, we apply it to high-dimensional crossed random effect and probabilistic matrix factorization models, for which we develop a novel BGS scheme with improved convergence speed. Our methodology provides unbiased posterior estimates at linear cost (usually requiring only a few BGS iterations for problems with thousands of parameters), matching state-of-the-art procedures for both frequentist and Bayesian estimation of those models."}, "https://arxiv.org/abs/2205.03505": {"title": "A Flexible Quasi-Copula Distribution for Statistical Modeling", "link": "https://arxiv.org/abs/2205.03505", "description": "arXiv:2205.03505v2 Announce Type: replace \nAbstract: Copulas, generalized estimating equations, and generalized linear mixed models promote the analysis of grouped data where non-normal responses are correlated. Unfortunately, parameter estimation remains challenging in these three frameworks. Based on prior work of Tonda, we derive a new class of probability density functions that allow explicit calculation of moments, marginal and conditional distributions, and the score and observed information needed in maximum likelihood estimation. We also illustrate how the new distribution flexibly models longitudinal data following a non-Gaussian distribution. Finally, we conduct a tri-variate genome-wide association analysis on dichotomized systolic and diastolic blood pressure and body mass index data from the UK-Biobank, showcasing the modeling prowess and computational scalability of the new distribution."}, "https://arxiv.org/abs/2307.14160": {"title": "On the application of Gaussian graphical models to paired data problems", "link": "https://arxiv.org/abs/2307.14160", "description": "arXiv:2307.14160v2 Announce Type: replace \nAbstract: Gaussian graphical models are nowadays commonly applied to the comparison of groups sharing the same variables, by jointy learning their independence structures. We consider the case where there are exactly two dependent groups and the association structure is represented by a family of coloured Gaussian graphical models suited to deal with paired data problems. To learn the two dependent graphs, together with their across-graph association structure, we implement a fused graphical lasso penalty. We carry out a comprehensive analysis of this approach, with special attention to the role played by some relevant submodel classes. In this way, we provide a broad set of tools for the application of Gaussian graphical models to paired data problems. These include results useful for the specification of penalty values in order to obtain a path of lasso solutions and an ADMM algorithm that solves the fused graphical lasso optimization problem. Finally, we present an application of our method to cancer genomics where it is of interest to compare cancer cells with a control sample from histologically normal tissues adjacent to the tumor. All the methods described in this article are implemented in the $\\texttt{R}$ package $\\texttt{pdglasso}$ availabe at: https://github.com/savranciati/pdglasso."}, "https://arxiv.org/abs/2401.10824": {"title": "A versatile trivariate wrapped Cauchy copula with applications to toroidal and cylindrical data", "link": "https://arxiv.org/abs/2401.10824", "description": "arXiv:2401.10824v2 Announce Type: replace \nAbstract: In this paper, we propose a new flexible distribution for data on the three-dimensional torus which we call a trivariate wrapped Cauchy copula. Our trivariate copula has several attractive properties. It has a simple form of density and is unimodal. its parameters are interpretable and allow adjustable degree of dependence between every pair of variables and these can be easily estimated. The conditional distributions of the model are well studied bivariate wrapped Cauchy distributions. Furthermore, the distribution can be easily simulated. Parameter estimation via maximum likelihood for the distribution is given and we highlight the simple implementation procedure to obtain these estimates. We compare our model to its competitors for analysing trivariate data and provide some evidence of its advantages. Another interesting feature of this model is that it can be extended to cylindrical copula as we describe this new cylindrical copula and then gives its properties. We illustrate our trivariate wrapped Cauchy copula on data from protein bioinformatics of conformational angles, and our cylindrical copula using climate data related to buoy in the Adriatic Sea. The paper is motivated by these real trivariate datasets, but we indicate how the model can be extended to multivariate copulas."}, "https://arxiv.org/abs/1609.07630": {"title": "Low-complexity Image and Video Coding Based on an Approximate Discrete Tchebichef Transform", "link": "https://arxiv.org/abs/1609.07630", "description": "arXiv:1609.07630v4 Announce Type: replace-cross \nAbstract: The usage of linear transformations has great relevance for data decorrelation applications, like image and video compression. In that sense, the discrete Tchebichef transform (DTT) possesses useful coding and decorrelation properties. The DTT transform kernel does not depend on the input data and fast algorithms can be developed to real time applications. However, the DTT fast algorithm presented in literature possess high computational complexity. In this work, we introduce a new low-complexity approximation for the DTT. The fast algorithm of the proposed transform is multiplication-free and requires a reduced number of additions and bit-shifting operations. Image and video compression simulations in popular standards shows good performance of the proposed transform. Regarding hardware resource consumption for FPGA shows 43.1% reduction of configurable logic blocks and ASIC place and route realization shows 57.7% reduction in the area-time figure when compared with the 2-D version of the exact DTT."}, "https://arxiv.org/abs/2305.12407": {"title": "Federated Offline Policy Learning", "link": "https://arxiv.org/abs/2305.12407", "description": "arXiv:2305.12407v2 Announce Type: replace-cross \nAbstract: We consider the problem of learning personalized decision policies from observational bandit feedback data across multiple heterogeneous data sources. In our approach, we introduce a novel regret analysis that establishes finite-sample upper bounds on distinguishing notions of global regret for all data sources on aggregate and of local regret for any given data source. We characterize these regret bounds by expressions of source heterogeneity and distribution shift. Moreover, we examine the practical considerations of this problem in the federated setting where a central server aims to train a policy on data distributed across the heterogeneous sources without collecting any of their raw data. We present a policy learning algorithm amenable to federation based on the aggregation of local policies trained with doubly robust offline policy evaluation strategies. Our analysis and supporting experimental results provide insights into tradeoffs in the participation of heterogeneous data sources in offline policy learning."}, "https://arxiv.org/abs/2306.01890": {"title": "Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning", "link": "https://arxiv.org/abs/2306.01890", "description": "arXiv:2306.01890v3 Announce Type: replace-cross \nAbstract: Distance-based clustering and classification are widely used in various fields to group mixed numeric and categorical data. In many algorithms, a predefined distance measurement is used to cluster data points based on their dissimilarity. While there exist numerous distance-based measures for data with pure numerical attributes and several ordered and unordered categorical metrics, an efficient and accurate distance for mixed-type data that utilizes the continuous and discrete properties simulatenously is an open problem. Many metrics convert numerical attributes to categorical ones or vice versa. They handle the data points as a single attribute type or calculate a distance between each attribute separately and add them up. We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity, with cross-validated optimal bandwidth selection. We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric, and improves clustering accuracy when utilized in existing distance-based clustering algorithms on simulated and real-world datasets containing continuous-only, categorical-only, and mixed-type data."}, "https://arxiv.org/abs/2312.15595": {"title": "Zero-Inflated Bandits", "link": "https://arxiv.org/abs/2312.15595", "description": "arXiv:2312.15595v2 Announce Type: replace-cross \nAbstract: Many real applications of bandits have sparse non-zero rewards, leading to slow learning speed. Using problem-specific structures for careful distribution modeling is known as critical to estimation efficiency in statistics, yet is under-explored in bandits. We initiate the study of zero-inflated bandits, where the reward is modeled as a classic semi-parametric distribution called zero-inflated distribution. We design Upper Confidence Bound- and Thompson Sampling-type algorithms for this specific structure. We derive the regret bounds under both multi-armed bandits with general reward assumptions and contextual generalized linear bandit with sub-Gaussian rewards. In many settings, the regret rates of our algorithms are either minimax optimal or state-of-the-art. The superior empirical performance of our methods is demonstrated via numerical studies."}, "https://arxiv.org/abs/2410.09217": {"title": "Flexibly Modeling Shocks to Demographic and Health Indicators with Bayesian Shrinkage Priors", "link": "https://arxiv.org/abs/2410.09217", "description": "arXiv:2410.09217v1 Announce Type: new \nAbstract: Demographic and health indicators may exhibit short or large short-term shocks; for example, armed conflicts, epidemics, or famines may cause shocks in period measures of life expectancy. Statistical models for estimating historical trends and generating future projections of these indicators for a large number of populations may be biased or not well probabilistically calibrated if they do not account for the presence of shocks. We propose a flexible method for modeling shocks when producing estimates and projections for multiple populations. The proposed approach makes no assumptions about the shape or duration of a shock, and requires no prior knowledge of when shocks may have occurred. Our approach is based on the modeling of shocks in level of the indicator of interest. We use Bayesian shrinkage priors such that shock terms are shrunk to zero unless the data suggest otherwise. The method is demonstrated in a model for male period life expectancy at birth. We use as a starting point an existing projection model and expand it by including the shock terms, modeled by the Bayesian shrinkage priors. Out-of-sample validation exercises find that including shocks in the model results in sharper uncertainty intervals without sacrificing empirical coverage or prediction error."}, "https://arxiv.org/abs/2410.09267": {"title": "Experimentation on Endogenous Graphs", "link": "https://arxiv.org/abs/2410.09267", "description": "arXiv:2410.09267v1 Announce Type: new \nAbstract: We study experimentation under endogenous network interference. Interference patterns are mediated by an endogenous graph, where edges can be formed or eliminated as a result of treatment. We show that conventional estimators are biased in these circumstances, and present a class of unbiased, consistent and asymptotically normal estimators of total treatment effects in the presence of such interference. Our results apply both to bipartite experimentation, in which the units of analysis and measurement differ, and the standard network experimentation case, in which they are the same."}, "https://arxiv.org/abs/2410.09274": {"title": "Hierarchical Latent Class Models for Mortality Surveillance Using Partially Verified Verbal Autopsies", "link": "https://arxiv.org/abs/2410.09274", "description": "arXiv:2410.09274v1 Announce Type: new \nAbstract: Monitoring data on causes of death is an important part of understanding the burden of diseases and effects of public health interventions. Verbal autopsy (VA) is a well-established method for gathering information about deaths outside of hospitals by conducting an interview to family members or caregivers of a deceased person. Existing cause-of-death assignment algorithms using VA data require either domain knowledge about the symptom-cause relationship, or large training datasets. When a new disease emerges, however, only limited information on symptom-cause relationship exists and training data are usually lacking, making it challenging to evaluate the impact of the disease. In this paper, we propose a novel Bayesian framework to estimate the fraction of deaths due to an emerging disease using VAs collected with partially verified cause of death. We use a latent class model to capture the distribution of symptoms and their dependence in a parsimonious way. We discuss potential sources of bias that may occur due to the cause-of-death verification process and adapt our framework to account for the verification mechanism. We also develop structured priors to improve prevalence estimation for sub-populations. We demonstrate the performance of our model using a mortality surveillance dataset that includes suspected COVID-19 related deaths in Brazil in 2021."}, "https://arxiv.org/abs/2410.09278": {"title": "Measurement Error Correction for Spatially Defined Environmental Exposures in Survival Analysis", "link": "https://arxiv.org/abs/2410.09278", "description": "arXiv:2410.09278v1 Announce Type: new \nAbstract: Environmental exposures are often defined using buffer zones around geocoded home addresses, but these static boundaries can miss dynamic daily activity patterns, leading to biased results. This paper presents a novel measurement error correction method for spatially defined environmental exposures within a survival analysis framework using the Cox proportional hazards model. The method corrects high-dimensional surrogate exposures from geocoded residential data at multiple buffer radii by applying principal component analysis for dimension reduction and leveraging external GPS-tracked validation datasets containing true exposure measurements. It also derives the asymptotic properties and variances of the proposed estimators. Extensive simulations are conducted to evaluate the performance of the proposed estimators, demonstrating its ability to improve accuracy in estimated exposure effects. An illustrative application assesses the impact of greenness exposure on depression incidence in the Nurses' Health Study (NHS). The results demonstrate that correcting for measurement error significantly enhances the accuracy of exposure estimates. This method offers a critical advancement for accurately assessing the health impacts of environmental exposures, outperforming traditional static buffer approaches."}, "https://arxiv.org/abs/2410.09282": {"title": "Anytime-Valid Continuous-Time Confidence Processes for Inhomogeneous Poisson Processes", "link": "https://arxiv.org/abs/2410.09282", "description": "arXiv:2410.09282v1 Announce Type: new \nAbstract: Motivated by monitoring the arrival of incoming adverse events such as customer support calls or crash reports from users exposed to an experimental product change, we consider sequential hypothesis testing of continuous-time inhomogeneous Poisson point processes. Specifically, we provide an interval-valued confidence process $C^\\alpha(t)$ over continuous time $t$ for the cumulative arrival rate $\\Lambda(t) = \\int_0^t \\lambda(s) \\mathrm{d}s$ with a continuous-time anytime-valid coverage guarantee $\\mathbb{P}[\\Lambda(t) \\in C^\\alpha(t) \\, \\forall t >0] \\geq 1-\\alpha$. We extend our results to compare two independent arrival processes by constructing multivariate confidence processes and a closed-form $e$-process for testing the equality of rates with a time-uniform Type-I error guarantee at a nominal $\\alpha$. We characterize the asymptotic growth rate of the proposed $e$-process under the alternative and show that it has power 1 when the average rates of the two Poisson process differ in the limit. We also observe a complementary relationship between our multivariate confidence process and the universal inference $e$-process for testing composite null hypotheses."}, "https://arxiv.org/abs/2410.09504": {"title": "Bayesian Transfer Learning for Artificially Intelligent Geospatial Systems: A Predictive Stacking Approach", "link": "https://arxiv.org/abs/2410.09504", "description": "arXiv:2410.09504v1 Announce Type: new \nAbstract: Building artificially intelligent geospatial systems require rapid delivery of spatial data analysis at massive scales with minimal human intervention. Depending upon their intended use, data analysis may also entail model assessment and uncertainty quantification. This article devises transfer learning frameworks for deployment in artificially intelligent systems, where a massive data set is split into smaller data sets that stream into the analytical framework to propagate learning and assimilate inference for the entire data set. Specifically, we introduce Bayesian predictive stacking for multivariate spatial data and demonstrate its effectiveness in rapidly analyzing massive data sets. Furthermore, we make inference feasible in a reasonable amount of time, and without excessively demanding hardware settings. We illustrate the effectiveness of this approach in extensive simulation experiments and subsequently analyze massive data sets in climate science on sea surface temperatures and on vegetation index."}, "https://arxiv.org/abs/2410.09506": {"title": "Distribution-Aware Mean Estimation under User-level Local Differential Privacy", "link": "https://arxiv.org/abs/2410.09506", "description": "arXiv:2410.09506v1 Announce Type: new \nAbstract: We consider the problem of mean estimation under user-level local differential privacy, where $n$ users are contributing through their local pool of data samples. Previous work assume that the number of data samples is the same across users. In contrast, we consider a more general and realistic scenario where each user $u \\in [n]$ owns $m_u$ data samples drawn from some generative distribution $\\mu$; $m_u$ being unknown to the statistician but drawn from a known distribution $M$ over $\\mathbb{N}^\\star$. Based on a distribution-aware mean estimation algorithm, we establish an $M$-dependent upper bounds on the worst-case risk over $\\mu$ for the task of mean estimation. We then derive a lower bound. The two bounds are asymptotically matching up to logarithmic factors and reduce to known bounds when $m_u = m$ for any user $u$."}, "https://arxiv.org/abs/2410.09552": {"title": "Model-based clustering of time-dependent observations with common structural changes", "link": "https://arxiv.org/abs/2410.09552", "description": "arXiv:2410.09552v1 Announce Type: new \nAbstract: We propose a novel model-based clustering approach for samples of time series. We assume as a unique commonality that two observations belong to the same group if structural changes in their behaviours happen at the same time. We resort to a latent representation of structural changes in each time series based on random orders to induce ties among different observations. Such an approach results in a general modeling strategy and can be combined with many time-dependent models known in the literature. Our studies have been motivated by an epidemiological problem, where we want to provide clusters of different countries of the European Union, where two countries belong to the same cluster if the spreading processes of the COVID-19 virus had structural changes at the same time."}, "https://arxiv.org/abs/2410.09665": {"title": "ipd: An R Package for Conducting Inference on Predicted Data", "link": "https://arxiv.org/abs/2410.09665", "description": "arXiv:2410.09665v1 Announce Type: new \nAbstract: Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage. Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage `vignette' are available at github.com/ipd-tools/ipd. Contact: jtleek@fredhutch.org and tylermc@uw.edu"}, "https://arxiv.org/abs/2410.09712": {"title": "Random effects model-based sufficient dimension reduction for independent clustered data", "link": "https://arxiv.org/abs/2410.09712", "description": "arXiv:2410.09712v1 Announce Type: new \nAbstract: Sufficient dimension reduction (SDR) is a popular class of regression methods which aim to find a small number of linear combinations of covariates that capture all the information of the responses i.e., a central subspace. The majority of current methods for SDR focus on the setting of independent observations, while the few techniques that have been developed for clustered data assume the linear transformation is identical across clusters. In this article, we introduce random effects SDR, where cluster-specific random effect central subspaces are assumed to follow a distribution on the Grassmann manifold, and the random effects distribution is characterized by a covariance matrix that captures the heterogeneity between clusters in the SDR process itself. We incorporate random effect SDR within a model-based inverse regression framework. Specifically, we propose a random effects principal fitted components model, where a two-stage algorithm is used to estimate the overall fixed effect central subspace, and predict the cluster-specific random effect central subspaces. We demonstrate the consistency of the proposed estimators, while simulation studies demonstrate the superior performance of the proposed approach compared to global and cluster-specific SDR approaches. We also present extensions of the above model to handle mixed predictors, demonstrating how random effects SDR can be achieved in the case of mixed continuous and binary covariates. Applying the proposed methods to study the longitudinal association between the life expectancy of women and socioeconomic variables across 117 countries, we find log income per capita, infant mortality, and income inequality are the main drivers of a two-dimensional fixed effect central subspace, although there is considerable heterogeneity in how the country-specific central subspaces are driven by the predictors."}, "https://arxiv.org/abs/2410.09808": {"title": "Optimal item calibration in the context of the Swedish Scholastic Aptitude Test", "link": "https://arxiv.org/abs/2410.09808", "description": "arXiv:2410.09808v1 Announce Type: new \nAbstract: Large scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate calibration items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of calibration items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method."}, "https://arxiv.org/abs/2410.09810": {"title": "Doubly unfolded adjacency spectral embedding of dynamic multiplex graphs", "link": "https://arxiv.org/abs/2410.09810", "description": "arXiv:2410.09810v1 Announce Type: new \nAbstract: Many real-world networks evolve dynamically over time and present different types of connections between nodes, often called layers. In this work, we propose a latent position model for these objects, called the dynamic multiplex random dot product graph (DMPRDPG), which uses an inner product between layer-specific and time-specific latent representations of the nodes to obtain edge probabilities. We further introduce a computationally efficient spectral embedding method for estimation of DMPRDPG parameters, called doubly unfolded adjacency spectral embedding (DUASE). The DUASE estimates are proved to be consistent and asymptotically normally distributed, demonstrating the optimality properties of the proposed estimator. A key strength of our method is the encoding of time-specific node representations and layer-specific effects in separate latent spaces, which allows the model to capture complex behaviours while maintaining relatively low dimensionality. The embedding method we propose can also be efficiently used for subsequent inference tasks. In particular, we highlight the use of the ISOMAP algorithm in conjunction with DUASE as a way to efficiently capture trends and global changepoints within a network, and the use of DUASE for graph clustering. Applications on real-world networks describing geopolitical interactions between countries and financial news reporting demonstrate practical uses of our method."}, "https://arxiv.org/abs/2410.09825": {"title": "Nickell Meets Stambaugh: A Tale of Two Biases in Panel Predictive Regressions", "link": "https://arxiv.org/abs/2410.09825", "description": "arXiv:2410.09825v1 Announce Type: new \nAbstract: In panel predictive regressions with persistent covariates, coexistence of the Nickell bias and the Stambaugh bias imposes challenges for hypothesis testing. This paper introduces a new estimator, the IVX-X-Jackknife (IVXJ), which effectively removes this composite bias and reinstates standard inferential procedures. The IVXJ estimator is inspired by the IVX technique in time series. In panel data where the cross section is of the same order as the time dimension, the bias of the baseline panel IVX estimator can be corrected via an analytical formula by leveraging an innovative X-Jackknife scheme that divides the time dimension into the odd and even indices. IVXJ is the first procedure that achieves unified inference across a wide range of modes of persistence in panel predictive regressions, whereas such unified inference is unattainable for the popular within-group estimator. Extended to accommodate long-horizon predictions with multiple regressions, IVXJ is used to examine the impact of debt levels on financial crises by panel local projection. Our empirics provide comparable results across different categories of debt."}, "https://arxiv.org/abs/2410.09884": {"title": "Detecting Structural Shifts and Estimating Change-Points in Interval-Based Time Series", "link": "https://arxiv.org/abs/2410.09884", "description": "arXiv:2410.09884v1 Announce Type: new \nAbstract: This paper addresses the open problem of conducting change-point analysis for interval-valued time series data using the maximum likelihood estimation (MLE) framework. Motivated by financial time series, we analyze data that includes daily opening (O), up (U), low (L), and closing (C) values, rather than just a closing value as traditionally used. To tackle this, we propose a fundamental model based on stochastic differential equations, which also serves as a transformation of other widely used models, such as the log-transformed geometric Brownian motion model. We derive the joint distribution for these interval-valued observations using the reflection principle and Girsanov's theorem. The MLE is obtained by optimizing the log-likelihood function through first and second-order derivative calculations, utilizing the Newton-Raphson algorithm. We further propose a novel parametric bootstrap method to compute confidence intervals, addressing challenges related to temporal dependency and interval-based data relationships. The performance of the model is evaluated through extensive simulations and real data analysis using S&P500 returns during the 2022 Russo-Ukrainian War. The results demonstrate that the proposed OULC model consistently outperforms the traditional OC model, offering more accurate and reliable change-point detection and parameter estimates."}, "https://arxiv.org/abs/2410.09892": {"title": "A Bayesian promotion time cure model with current status data", "link": "https://arxiv.org/abs/2410.09892", "description": "arXiv:2410.09892v1 Announce Type: new \nAbstract: Analysis of lifetime data from epidemiological studies or destructive testing often involves current status censoring, wherein individuals are examined only once and their event status is recorded only at that specific time point. In practice, some of these individuals may never experience the event of interest, leading to current status data with a cured fraction. Cure models are used to estimate the proportion of non-susceptible individuals, the distribution of susceptible ones, and covariate effects. Motivated from a biological interpretation of cancer metastasis, promotion time cure model is a popular alternative to the mixture cure rate model for analysing such data. The current study is the first to put forth a Bayesian inference procedure for analysing current status data with a cure fraction, resorting to a promotion time cure model. An adaptive Metropolis-Hastings algorithm is utilised for posterior computation. Simulation studies prove our approach's efficiency, while analyses of lung tumor and breast cancer data illustrate its practical utility. This approach has the potential to improve clinical cure rates through the incorporation of prior knowledge regarding the disease dynamics and therapeutic options."}, "https://arxiv.org/abs/2410.09898": {"title": "A Bayesian Joint Modelling of Current Status and Current Count Data", "link": "https://arxiv.org/abs/2410.09898", "description": "arXiv:2410.09898v1 Announce Type: new \nAbstract: Current status censoring or case I interval censoring takes place when subjects in a study are observed just once to check if a particular event has occurred. If the event is recurring, the data are classified as current count data; if non-recurring, they are classified as current status data. Several instances of dependence of these recurring and non-recurring events are observable in epidemiology and pathology. Estimation of the degree of this dependence and identification of major risk factors for the events are the major objectives of such studies. The current study proposes a Bayesian method for the joint modelling of such related events, employing a shared frailty-based semiparametric regression model. Computational implementation makes use of an adaptive Metropolis-Hastings algorithm. Simulation studies are put into use to show the effectiveness of the method proposed and fracture-osteoporosis data are worked through to highlight its application."}, "https://arxiv.org/abs/2410.09952": {"title": "Large Scale Longitudinal Experiments: Estimation and Inference", "link": "https://arxiv.org/abs/2410.09952", "description": "arXiv:2410.09952v1 Announce Type: new \nAbstract: Large-scale randomized experiments are seldom analyzed using panel regression methods because of computational challenges arising from the presence of millions of nuisance parameters. We leverage Mundlak's insight that unit intercepts can be eliminated by using carefully chosen averages of the regressors to rewrite several common estimators in a form that is amenable to weighted-least squares estimation with frequency weights. This renders regressions involving arbitrary strata intercepts tractable with very large datasets, optionally with the key compression step computed out-of-memory in SQL. We demonstrate that these methods yield more precise estimates than other commonly used estimators, and also find that the compression strategy greatly increases computational efficiency. We provide in-memory (pyfixest) and out-of-memory (duckreg) python libraries to implement these estimators."}, "https://arxiv.org/abs/2410.10025": {"title": "Sparse Multivariate Linear Regression with Strongly Associated Response Variables", "link": "https://arxiv.org/abs/2410.10025", "description": "arXiv:2410.10025v1 Announce Type: new \nAbstract: We propose new methods for multivariate linear regression when the regression coefficient matrix is sparse and the error covariance matrix is dense. We assume that the error covariance matrix has equicorrelation across the response variables. Two procedures are proposed: one is based on constant marginal response variance (compound symmetry), and the other is based on general varying marginal response variance. Two approximate procedures are also developed for high dimensions. We propose an approximation to the Gaussian validation likelihood for tuning parameter selection. Extensive numerical experiments illustrate when our procedures outperform relevant competitors as well as their robustness to model misspecification."}, "https://arxiv.org/abs/2410.10345": {"title": "Effective Positive Cauchy Combination Test", "link": "https://arxiv.org/abs/2410.10345", "description": "arXiv:2410.10345v1 Announce Type: new \nAbstract: In the field of multiple hypothesis testing, combining p-values represents a fundamental statistical method. The Cauchy combination test (CCT) (Liu and Xie, 2020) excels among numerous methods for combining p-values with powerful and computationally efficient performance. However, large p-values may diminish the significance of testing, even extremely small p-values exist. We propose a novel approach named the positive Cauchy combination test (PCCT) to surmount this flaw. Building on the relationship between the PCCT and CCT methods, we obtain critical values by applying the Cauchy distribution to the PCCT statistic. We find, however, that the PCCT tends to be effective only when the significance level is substantially small or the test statistics are strongly correlated. Otherwise, it becomes challenging to control type I errors, a problem that also pertains to the CCT. Thanks to the theories of stable distributions and the generalized central limit theorem, we have demonstrated critical values under weak dependence, which effectively controls type I errors for any given significance level. For more general scenarios, we correct the test statistic using the generalized mean method, which can control the size under any dependence structure and cannot be further optimized.Our method exhibits excellent performance, as demonstrated through comprehensive simulation studies. We further validate the effectiveness of our proposed method by applying it to a genetic dataset."}, "https://arxiv.org/abs/2410.10354": {"title": "Bayesian nonparametric modeling of heterogeneous populations of networks", "link": "https://arxiv.org/abs/2410.10354", "description": "arXiv:2410.10354v1 Announce Type: new \nAbstract: The increasing availability of multiple network data has highlighted the need for statistical models for heterogeneous populations of networks. A convenient framework makes use of metrics to measure similarity between networks. In this context, we propose a novel Bayesian nonparametric model that identifies clusters of networks characterized by similar connectivity patterns. Our approach relies on a location-scale Dirichlet process mixture of centered Erd\\H{o}s--R\\'enyi kernels, with components parametrized by a unique network representative, or mode, and a univariate measure of dispersion around the mode. We demonstrate that this model has full support in the Kullback--Leibler sense and is strongly consistent. An efficient Markov chain Monte Carlo scheme is proposed for posterior inference and clustering of multiple network data. The performance of the model is validated through extensive simulation studies, showing improvements over state-of-the-art methods. Additionally, we present an effective strategy to extend the application of the proposed model to datasets with a large number of nodes. We illustrate our approach with the analysis of human brain network data."}, "https://arxiv.org/abs/2410.10482": {"title": "Regression Model for Speckled Data with Extremely Variability", "link": "https://arxiv.org/abs/2410.10482", "description": "arXiv:2410.10482v1 Announce Type: new \nAbstract: Synthetic aperture radar (SAR) is an efficient and widely used remote sensing tool. However, data extracted from SAR images are contaminated with speckle, which precludes the application of techniques based on the assumption of additive and normally distributed noise. One of the most successful approaches to describing such data is the multiplicative model, where intensities can follow a variety of distributions with positive support. The $\\mathcal{G}^0_I$ model is among the most successful ones. Although several estimation methods for the $\\mathcal{G}^0_I$ parameters have been proposed, there is no work exploring a regression structure for this model. Such a structure could allow us to infer unobserved values from available ones. In this work, we propose a $\\mathcal{G}^0_I$ regression model and use it to describe the influence of intensities from other polarimetric channels. We derive some theoretical properties for the new model: Fisher information matrix, residual measures, and influential tools. Maximum likelihood point and interval estimation methods are proposed and evaluated by Monte Carlo experiments. Results from simulated and actual data show that the new model can be helpful for SAR image analysis."}, "https://arxiv.org/abs/2410.10618": {"title": "An Approximate Identity Link Function for Bayesian Generalized Linear Models", "link": "https://arxiv.org/abs/2410.10618", "description": "arXiv:2410.10618v1 Announce Type: new \nAbstract: In this note, we consider using a link function that has heavier tails than the usual exponential link function. We construct efficient Gibbs algorithms for Poisson and Multinomial models based on this link function by introducing gamma and inverse Gaussian latent variables and show that the algorithms generate geometrically ergodic Markov chains in simple settings. Our algorithms can be used for more complicated models with many parameters. We fit our simple Poisson model to a real dataset and confirm that the posterior distribution has similar implications to those under the usual Poisson regression model based on the exponential link function. Although less interpretable, our models are potentially more tractable or flexible from a computational point of view in some cases."}, "https://arxiv.org/abs/2410.10619": {"title": "Partially exchangeable stochastic block models for multilayer networks", "link": "https://arxiv.org/abs/2410.10619", "description": "arXiv:2410.10619v1 Announce Type: new \nAbstract: Multilayer networks generalize single-layered connectivity data in several directions. These generalizations include, among others, settings where multiple types of edges are observed among the same set of nodes (edge-colored networks) or where a single notion of connectivity is measured between nodes belonging to different pre-specified layers (node-colored networks). While progress has been made in statistical modeling of edge-colored networks, principled approaches that flexibly account for both within and across layer block-connectivity structures while incorporating layer information through a rigorous probabilistic construction are still lacking for node-colored multilayer networks. We fill this gap by introducing a novel class of partially exchangeable stochastic block models specified in terms of a hierarchical random partition prior for the allocation of nodes to groups, whose number is learned by the model. This goal is achieved without jeopardizing probabilistic coherence, uncertainty quantification and derivation of closed-form predictive within- and across-layer co-clustering probabilities. Our approach facilitates prior elicitation, the understanding of theoretical properties and the development of yet-unexplored predictive strategies for both the connections and the allocations of future incoming nodes. Posterior inference proceeds via a tractable collapsed Gibbs sampler, while performance is illustrated in simulations and in a real-world criminal network application. The notable gains achieved over competitors clarify the importance of developing general stochastic block models based on suitable node-exchangeability structures coherent with the type of multilayer network being analyzed."}, "https://arxiv.org/abs/2410.10633": {"title": "Missing data imputation using a truncated infinite factor model with application to metabolomics data", "link": "https://arxiv.org/abs/2410.10633", "description": "arXiv:2410.10633v1 Announce Type: new \nAbstract: In metabolomics, the study of small molecules in biological samples, data are often acquired through mass spectrometry. The resulting data contain highly correlated variables, typically with a larger number of variables than observations. Missing data are prevalent, and imputation is critical as data acquisition can be difficult and expensive, and many analysis methods necessitate complete data. In such data, missing at random (MAR) missingness occurs due to acquisition or processing error, while missing not at random (MNAR) missingness occurs when true values lie below the threshold for detection. Existing imputation methods generally assume one missingness type, or impute values outside the physical constraints of the data, which lack utility. A truncated factor analysis model with an infinite number of factors (tIFA) is proposed to facilitate imputation in metabolomics data, in a statistically and physically principled manner. Truncated distributional assumptions underpin tIFA, ensuring cognisance of the data's physical constraints when imputing. Further, tIFA allows for both MAR and MNAR missingness, and a Bayesian inferential approach provides uncertainty quantification for imputed values and missingness types. The infinite factor model parsimoniously models the high-dimensional, multicollinear data, with nonparametric shrinkage priors obviating the need for model selection tools to infer the number of latent factors. A simulation study is performed to assess the performance of tIFA and an application to a urinary metabolomics dataset results in a full dataset with practically useful imputed values, and associated uncertainty, ready for use in metabolomics analyses. Open-source R code accompanies tIFA, facilitating its widespread use."}, "https://arxiv.org/abs/2410.10723": {"title": "Statistically and computationally efficient conditional mean imputation for censored covariates", "link": "https://arxiv.org/abs/2410.10723", "description": "arXiv:2410.10723v1 Announce Type: new \nAbstract: Censored, missing, and error-prone covariates are all coarsened data types for which the true values are unknown. Many methods to handle the unobserved values, including imputation, are shared between these data types, with nuances based on the mechanism dominating the unobserved values and any other available information. For example, in prospective studies, the time to a specific disease diagnosis will be incompletely observed if only some patients are diagnosed by the end of the follow-up. Specifically, some times will be randomly right-censored, and patients' disease-free follow-up times must be incorporated into their imputed values. Assuming noninformative censoring, these censored values are replaced with their conditional means, the calculations of which require (i) estimating the conditional distribution of the censored covariate and (ii) integrating over the corresponding survival function. Semiparametric approaches are common, which estimate the distribution with a Cox proportional hazards model and then the integral with the trapezoidal rule. While these approaches offer robustness, they come at the cost of statistical and computational efficiency. We propose a general framework for parametric conditional mean imputation of censored covariates that offers better statistical precision and requires less computational strain by modeling the survival function parametrically, where conditional means often have an analytic solution. The framework is implemented in the open-source R package, speedyCMI."}, "https://arxiv.org/abs/2410.10749": {"title": "Testing the order of fractional integration in the presence of smooth trends, with an application to UK Great Ratios", "link": "https://arxiv.org/abs/2410.10749", "description": "arXiv:2410.10749v1 Announce Type: new \nAbstract: This note proposes semi-parametric tests for investigating whether a stochastic process is fractionally integrated of order $\\delta$, where $|\\delta| < 1/2$, when smooth trends are present in the model. We combine the semi-parametric approach by Iacone, Nielsen & Taylor (2022) to model the short range dependence with the use of Chebyshev polynomials by Cuestas & Gil-Alana to describe smooth trends. Our proposed statistics have standard limiting null distributions and match the asymptotic local power of infeasible tests based on unobserved errors. We also establish the conditions under which an information criterion can consistently estimate the order of the Chebyshev polynomial. The finite sample performance is evaluated using simulations, and an empirical application is given for the UK Great Ratios."}, "https://arxiv.org/abs/2410.10772": {"title": "Peer effects in the linear-in-means model may be inestimable even when identified", "link": "https://arxiv.org/abs/2410.10772", "description": "arXiv:2410.10772v1 Announce Type: new \nAbstract: Linear-in-means models are widely used to investigate peer effects. Identifying peer effects in these models is challenging, but conditions for identification are well-known. However, even when peer effects are identified, they may not be estimable, due to an asymptotic colinearity issue: as sample size increases, peer effects become more and more linearly dependent. We show that asymptotic colinearity occurs whenever nodal covariates are independent of the network and the minimum degree of the network is growing. Asymptotic colinearity can cause estimators to be inconsistent or to converge at slower than expected rates. We also demonstrate that dependence between nodal covariates and network structure can alleviate colinearity issues in random dot product graphs. These results suggest that linear-in-means models are less reliable for studying peer influence than previously believed."}, "https://arxiv.org/abs/2410.09227": {"title": "Fast Data-independent KLT Approximations Based on Integer Functions", "link": "https://arxiv.org/abs/2410.09227", "description": "arXiv:2410.09227v1 Announce Type: cross \nAbstract: The Karhunen-Lo\\`eve transform (KLT) stands as a well-established discrete transform, demonstrating optimal characteristics in data decorrelation and dimensionality reduction. Its ability to condense energy compression into a select few main components has rendered it instrumental in various applications within image compression frameworks. However, computing the KLT depends on the covariance matrix of the input data, which makes it difficult to develop fast algorithms for its implementation. Approximations for the KLT, utilizing specific rounding functions, have been introduced to reduce its computational complexity. Therefore, our paper introduces a category of low-complexity, data-independent KLT approximations, employing a range of round-off functions. The design methodology of the approximate transform is defined for any block-length $N$, but emphasis is given to transforms of $N = 8$ due to its wide use in image and video compression. The proposed transforms perform well when compared to the exact KLT and approximations considering classical performance measures. For particular scenarios, our proposed transforms demonstrated superior performance when compared to KLT approximations documented in the literature. We also developed fast algorithms for the proposed transforms, further reducing the arithmetic cost associated with their implementation. Evaluation of field programmable gate array (FPGA) hardware implementation metrics was conducted. Practical applications in image encoding showed the relevance of the proposed transforms. In fact, we showed that one of the proposed transforms outperformed the exact KLT given certain compression ratios."}, "https://arxiv.org/abs/2410.09835": {"title": "Knockoffs for exchangeable categorical covariates", "link": "https://arxiv.org/abs/2410.09835", "description": "arXiv:2410.09835v1 Announce Type: cross \nAbstract: Let $X=(X_1,\\ldots,X_p)$ be a $p$-variate random vector and $F$ a fixed finite set. In a number of applications, mainly in genetics, it turns out that $X_i\\in F$ for each $i=1,\\ldots,p$. Despite the latter fact, to obtain a knockoff $\\widetilde{X}$ (in the sense of \\cite{CFJL18}), $X$ is usually modeled as an absolutely continuous random vector. While comprehensible from the point of view of applications, this approximate procedure does not make sense theoretically, since $X$ is supported by the finite set $F^p$. In this paper, explicit formulae for the joint distribution of $(X,\\widetilde{X})$ are provided when $P(X\\in F^p)=1$ and $X$ is exchangeable or partially exchangeable. In fact, when $X_i\\in F$ for all $i$, there seem to be various reasons for assuming $X$ exchangeable or partially exchangeable. The robustness of $\\widetilde{X}$, with respect to the de Finetti's measure $\\pi$ of $X$, is investigated as well. Let $\\mathcal{L}_\\pi(\\widetilde{X}\\mid X=x)$ denote the conditional distribution of $\\widetilde{X}$, given $X=x$, when the de Finetti's measure is $\\pi$. It is shown that $$\\norm{\\mathcal{L}_{\\pi_1}(\\widetilde{X}\\mid X=x)-\\mathcal{L}_{\\pi_2}(\\widetilde{X}\\mid X=x)}\\le c(x)\\,\\norm{\\pi_1-\\pi_2}$$ where $\\norm{\\cdot}$ is total variation distance and $c(x)$ a suitable constant. Finally, a numerical experiment is performed. Overall, the knockoffs of this paper outperform the alternatives (i.e., the knockoffs obtained by giving $X$ an absolutely continuous distribution) as regards the false discovery rate but are slightly weaker in terms of power."}, "https://arxiv.org/abs/2410.10282": {"title": "Exact MCMC for Intractable Proposals", "link": "https://arxiv.org/abs/2410.10282", "description": "arXiv:2410.10282v1 Announce Type: cross \nAbstract: Accept-reject based Markov chain Monte Carlo (MCMC) methods are the workhorse algorithm for Bayesian inference. These algorithms, like Metropolis-Hastings, require the choice of a proposal distribution which is typically informed by the desired target distribution. Surprisingly, proposal distributions with unknown normalizing constants are not uncommon, even though for such a choice of a proposal, the Metropolis-Hastings acceptance ratio cannot be evaluated exactly. Across the literature, authors resort to approximation methods that yield inexact MCMC or develop specialized algorithms to combat this problem. We show how Bernoulli factory MCMC algorithms, originally proposed for doubly intractable target distributions, can quite naturally be adapted to this situation. We present three diverse and relevant examples demonstrating the usefulness of the Bernoulli factory approach to this problem."}, "https://arxiv.org/abs/2410.10309": {"title": "Optimal lower bounds for logistic log-likelihoods", "link": "https://arxiv.org/abs/2410.10309", "description": "arXiv:2410.10309v1 Announce Type: cross \nAbstract: The logit transform is arguably the most widely-employed link function beyond linear settings. This transformation routinely appears in regression models for binary data and provides, either explicitly or implicitly, a core building-block within state-of-the-art methodologies for both classification and regression. Its widespread use, combined with the lack of analytical solutions for the optimization of general losses involving the logit transform, still motivates active research in computational statistics. Among the directions explored, a central one has focused on the design of tangent lower bounds for logistic log-likelihoods that can be tractably optimized, while providing a tight approximation of these log-likelihoods. Although progress along these lines has led to the development of effective minorize-maximize (MM) algorithms for point estimation and coordinate ascent variational inference schemes for approximate Bayesian inference under several logit models, the overarching focus in the literature has been on tangent quadratic minorizers. In fact, it is still unclear whether tangent lower bounds sharper than quadratic ones can be derived without undermining the tractability of the resulting minorizer. This article addresses such a challenging question through the design and study of a novel piece-wise quadratic lower bound that uniformly improves any tangent quadratic minorizer, including the sharpest ones, while admitting a direct interpretation in terms of the classical generalized lasso problem. As illustrated in a ridge logistic regression, this unique connection facilitates more effective implementations than those provided by available piece-wise bounds, while improving the convergence speed of quadratic ones."}, "https://arxiv.org/abs/2410.10538": {"title": "Data-Driven Approaches for Modelling Target Behaviour", "link": "https://arxiv.org/abs/2410.10538", "description": "arXiv:2410.10538v1 Announce Type: cross \nAbstract: The performance of tracking algorithms strongly depends on the chosen model assumptions regarding the target dynamics. If there is a strong mismatch between the chosen model and the true object motion, the track quality may be poor or the track is easily lost. Still, the true dynamics might not be known a priori or it is too complex to be expressed in a tractable mathematical formulation. This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion based on training data. The first method builds on Gaussian Processes (GPs) for predicting the object motion, the second learns the parameters of an Interacting Multiple Model (IMM) filter and the third uses a Long Short-Term Memory (LSTM) network as a motion model. All methods are compared against an Extended Kalman Filter (EKF) with an analytic motion model as a benchmark and their respective strengths are highlighted in one simulated and two real-world scenarios."}, "https://arxiv.org/abs/2410.10641": {"title": "Echo State Networks for Spatio-Temporal Area-Level Data", "link": "https://arxiv.org/abs/2410.10641", "description": "arXiv:2410.10641v1 Announce Type: cross \nAbstract: Spatio-temporal area-level datasets play a critical role in official statistics, providing valuable insights for policy-making and regional planning. Accurate modeling and forecasting of these datasets can be extremely useful for policymakers to develop informed strategies for future planning. Echo State Networks (ESNs) are efficient methods for capturing nonlinear temporal dynamics and generating forecasts. However, ESNs lack a direct mechanism to account for the neighborhood structure inherent in area-level data. Ignoring these spatial relationships can significantly compromise the accuracy and utility of forecasts. In this paper, we incorporate approximate graph spectral filters at the input stage of the ESN, thereby improving forecast accuracy while preserving the model's computational efficiency during training. We demonstrate the effectiveness of our approach using Eurostat's tourism occupancy dataset and show how it can support more informed decision-making in policy and planning contexts."}, "https://arxiv.org/abs/2410.10649": {"title": "Vecchia Gaussian Processes: Probabilistic Properties, Minimax Rates and Methodological Developments", "link": "https://arxiv.org/abs/2410.10649", "description": "arXiv:2410.10649v1 Announce Type: cross \nAbstract: Gaussian Processes (GPs) are widely used to model dependency in spatial statistics and machine learning, yet the exact computation suffers an intractable time complexity of $O(n^3)$. Vecchia approximation allows scalable Bayesian inference of GPs in $O(n)$ time by introducing sparsity in the spatial dependency structure that is characterized by a directed acyclic graph (DAG). Despite the popularity in practice, it is still unclear how to choose the DAG structure and there are still no theoretical guarantees in nonparametric settings. In this paper, we systematically study the Vecchia GPs as standalone stochastic processes and uncover important probabilistic properties and statistical results in methodology and theory. For probabilistic properties, we prove that the conditional distributions of the Mat\\'{e}rn GPs, as well as the Vecchia approximations of the Mat\\'{e}rn GPs, can be characterized by polynomials. This allows us to prove a series of results regarding the small ball probabilities and RKHSs of Vecchia GPs. For statistical methodology, we provide a principled guideline to choose parent sets as norming sets with fixed cardinality and provide detailed algorithms following such guidelines. For statistical theory, we prove posterior contraction rates for applying Vecchia GPs to regression problems, where minimax optimality is achieved by optimally tuned GPs via either oracle rescaling or hierarchical Bayesian methods. Our theory and methodology are demonstrated with numerical studies, where we also provide efficient implementation of our methods in C++ with R interfaces."}, "https://arxiv.org/abs/2410.10704": {"title": "Estimation beyond Missing (Completely) at Random", "link": "https://arxiv.org/abs/2410.10704", "description": "arXiv:2410.10704v1 Announce Type: cross \nAbstract: We study the effects of missingness on the estimation of population parameters. Moving beyond restrictive missing completely at random (MCAR) assumptions, we first formulate a missing data analogue of Huber's arbitrary $\\epsilon$-contamination model. For mean estimation with respect to squared Euclidean error loss, we show that the minimax quantiles decompose as a sum of the corresponding minimax quantiles under a heterogeneous, MCAR assumption, and a robust error term, depending on $\\epsilon$, that reflects the additional error incurred by departure from MCAR.\n We next introduce natural classes of realisable $\\epsilon$-contamination models, where an MCAR version of a base distribution $P$ is contaminated by an arbitrary missing not at random (MNAR) version of $P$. These classes are rich enough to capture various notions of biased sampling and sensitivity conditions, yet we show that they enjoy improved minimax performance relative to our earlier arbitrary contamination classes for both parametric and nonparametric classes of base distributions. For instance, with a univariate Gaussian base distribution, consistent mean estimation over realisable $\\epsilon$-contamination classes is possible even when $\\epsilon$ and the proportion of missingness converge (slowly) to 1. Finally, we extend our results to the setting of departures from missing at random (MAR) in normal linear regression with a realisable missing response."}, "https://arxiv.org/abs/1904.04484": {"title": "Meta-analysis of Bayesian analyses", "link": "https://arxiv.org/abs/1904.04484", "description": "arXiv:1904.04484v2 Announce Type: replace \nAbstract: Meta-analysis aims to generalize results from multiple related statistical analyses through a combined analysis. While the natural outcome of a Bayesian study is a posterior distribution, traditional Bayesian meta-analyses proceed by combining summary statistics (i.e., point-valued estimates) computed from data. In this paper, we develop a framework for combining posterior distributions from multiple related Bayesian studies into a meta-analysis. Importantly, the method is capable of reusing pre-computed posteriors from computationally costly analyses, without needing the implementation details from each study. Besides providing a consensus across studies, the method enables updating the local posteriors post-hoc and therefore refining them by sharing statistical strength between the studies, without rerunning the original analyses. We illustrate the wide applicability of the framework by combining results from likelihood-free Bayesian analyses, which would be difficult to carry out using standard methodology."}, "https://arxiv.org/abs/2210.05026": {"title": "Uncertainty Quantification in Synthetic Controls with Staggered Treatment Adoption", "link": "https://arxiv.org/abs/2210.05026", "description": "arXiv:2210.05026v4 Announce Type: replace \nAbstract: We propose principled prediction intervals to quantify the uncertainty of a large class of synthetic control predictions (or estimators) in settings with staggered treatment adoption, offering precise non-asymptotic coverage probability guarantees. From a methodological perspective, we provide a detailed discussion of different causal quantities to be predicted, which we call causal predictands, allowing for multiple treated units with treatment adoption at possibly different points in time. From a theoretical perspective, our uncertainty quantification methods improve on prior literature by (i) covering a large class of causal predictands in staggered adoption settings, (ii) allowing for synthetic control methods with possibly nonlinear constraints, (iii) proposing scalable robust conic optimization methods and principled data-driven tuning parameter selection, and (iv) offering valid uniform inference across post-treatment periods. We illustrate our methodology with an empirical application studying the effects of economic liberalization on real GDP per capita for Sub-Saharan African countries. Companion general-purpose software packages are provided in Python, R, and Stata."}, "https://arxiv.org/abs/2211.11884": {"title": "Parameter Estimation in Nonlinear Multivariate Stochastic Differential Equations Based on Splitting Schemes", "link": "https://arxiv.org/abs/2211.11884", "description": "arXiv:2211.11884v3 Announce Type: replace \nAbstract: The likelihood functions for discretely observed nonlinear continuous-time models based on stochastic differential equations are not available except for a few cases. Various parameter estimation techniques have been proposed, each with advantages, disadvantages, and limitations depending on the application. Most applications still use the Euler-Maruyama discretization, despite many proofs of its bias. More sophisticated methods, such as Kessler's Gaussian approximation, Ozaki's Local Linearization, A\\\"it-Sahalia's Hermite expansions, or MCMC methods, might be complex to implement, do not scale well with increasing model dimension, or can be numerically unstable. We propose two efficient and easy-to-implement likelihood-based estimators based on the Lie-Trotter (LT) and the Strang (S) splitting schemes. We prove that S has $L^p$ convergence rate of order 1, a property already known for LT. We show that the estimators are consistent and asymptotically efficient under the less restrictive one-sided Lipschitz assumption. A numerical study on the 3-dimensional stochastic Lorenz system complements our theoretical findings. The simulation shows that the S estimator performs the best when measured on precision and computational speed compared to the state-of-the-art."}, "https://arxiv.org/abs/2212.05524": {"title": "Bayesian inference for partial orders from random linear extensions: power relations from 12th Century Royal Acta", "link": "https://arxiv.org/abs/2212.05524", "description": "arXiv:2212.05524v3 Announce Type: replace \nAbstract: In the eleventh and twelfth centuries in England, Wales and Normandy, Royal Acta were legal documents in which witnesses were listed in order of social status. Any bishops present were listed as a group. For our purposes, each witness-list is an ordered permutation of bishop names with a known date or date-range. Changes over time in the order bishops are listed may reflect changes in their authority. Historians would like to detect and quantify these changes. There is no reason to assume that the underlying social order which constrains bishop-order within lists is a complete order. We therefore model the evolving social order as an evolving partial ordered set or {\\it poset}.\n We construct a Hidden Markov Model for these data. The hidden state is an evolving poset (the evolving social hierarchy) and the emitted data are random total orders (dated lists) respecting the poset present at the time the order was observed. This generalises existing models for rank-order data such as Mallows and Plackett-Luce. We account for noise via a random ``queue-jumping'' process. Our latent-variable prior for the random process of posets is marginally consistent. A parameter controls poset depth and actor-covariates inform the position of actors in the hierarchy. We fit the model, estimate posets and find evidence for changes in status over time. We interpret our results in terms of court politics. Simpler models, based on Bucket Orders and vertex-series-parallel orders, are rejected. We compare our results with a time-series extension of the Plackett-Luce model. Our software is publicly available."}, "https://arxiv.org/abs/2301.03799": {"title": "Tensor Formulation of the General Linear Model with Einstein Notation", "link": "https://arxiv.org/abs/2301.03799", "description": "arXiv:2301.03799v2 Announce Type: replace \nAbstract: The general linear model is a universally accepted method to conduct and test multiple linear regression models. Using this model one has the ability to simultaneously regress covariates among different groups of data. Moreover, there are hundreds of applications and statistical tests associated with the general linear model. However, the conventional matrix formulation is relatively inelegant which yields multiple difficulties including slow computation speed due to a large number of computations, increased memory usage due to needlessly large data structures, and organizational inconsistency. This is due to the fundamental incongruence between the degrees of freedom of the information the data structures in the conventional formulation of the general linear model are intended to represent and the rank of the data structures themselves. Here, I briefly suggest an elegant reformulation of the general linear model which involves the use of tensors and multidimensional arrays as opposed to exclusively flat structures in the conventional formulation. To demonstrate the efficacy of this approach I translate a few common applications of the general linear model from the conventional formulation to the tensor formulation."}, "https://arxiv.org/abs/2310.09345": {"title": "A Unified Bayesian Framework for Modeling Measurement Error in Multinomial Data", "link": "https://arxiv.org/abs/2310.09345", "description": "arXiv:2310.09345v2 Announce Type: replace \nAbstract: Measurement error in multinomial data is a well-known and well-studied inferential problem that is encountered in many fields, including engineering, biomedical and omics research, ecology, finance, official statistics, and social sciences. Methods developed to accommodate measurement error in multinomial data are typically equipped to handle false negatives or false positives, but not both. We provide a unified framework for accommodating both forms of measurement error using a Bayesian hierarchical approach. We demonstrate the proposed method's performance on simulated data and apply it to acoustic bat monitoring and official crime data."}, "https://arxiv.org/abs/2311.12252": {"title": "Partial identification and unmeasured confounding with multiple treatments and multiple outcomes", "link": "https://arxiv.org/abs/2311.12252", "description": "arXiv:2311.12252v2 Announce Type: replace \nAbstract: In this work, we develop a framework for partial identification of causal effects in settings with multiple treatments and multiple outcomes. We highlight several advantages of jointly analyzing causal effects across multiple estimands under a \"factor confounding assumption\" where residual dependence amongst treatments and outcomes is assumed to be driven by unmeasured confounding. In this setting, we show that joint partial identification regions for multiple estimands can be more informative than considering partial identification for individual estimands one at a time. We additionally show how assumptions related to the strength of confounding or the magnitude of plausible effect sizes for any one estimand can reduce the partial identification regions for other estimands. As a special case of this result, we explore how negative control assumptions reduce partial identification regions and discuss conditions under which point identification can be obtained. We develop novel computational approaches to finding partial identification regions under a variety of these assumptions. Lastly, we demonstrate our approach in an analysis of the causal effects of multiple air pollutants on several health outcomes in the United States using claims data from Medicare, where we find that some exposures have effects that are robust to the presence of unmeasured confounding."}, "https://arxiv.org/abs/1908.09173": {"title": "Welfare Analysis in Dynamic Models", "link": "https://arxiv.org/abs/1908.09173", "description": "arXiv:1908.09173v2 Announce Type: replace-cross \nAbstract: This paper provides welfare metrics for dynamic choice. We give estimation and inference methods for functions of the expected value of dynamic choice. These parameters include average value by group, average derivatives with respect to endowments, and structural decompositions. The example of dynamic discrete choice is considered. We give dual and doubly robust representations of these parameters. A least squares estimator of the dynamic Riesz representer for the parameter of interest is given. Debiased machine learners are provided and asymptotic theory given."}, "https://arxiv.org/abs/2303.06198": {"title": "Deflated HeteroPCA: Overcoming the curse of ill-conditioning in heteroskedastic PCA", "link": "https://arxiv.org/abs/2303.06198", "description": "arXiv:2303.06198v2 Announce Type: replace-cross \nAbstract: This paper is concerned with estimating the column subspace of a low-rank matrix $\\boldsymbol{X}^\\star \\in \\mathbb{R}^{n_1\\times n_2}$ from contaminated data. How to obtain optimal statistical accuracy while accommodating the widest range of signal-to-noise ratios (SNRs) becomes particularly challenging in the presence of heteroskedastic noise and unbalanced dimensionality (i.e., $n_2\\gg n_1$). While the state-of-the-art algorithm $\\textsf{HeteroPCA}$ emerges as a powerful solution for solving this problem, it suffers from \"the curse of ill-conditioning,\" namely, its performance degrades as the condition number of $\\boldsymbol{X}^\\star$ grows. In order to overcome this critical issue without compromising the range of allowable SNRs, we propose a novel algorithm, called $\\textsf{Deflated-HeteroPCA}$, that achieves near-optimal and condition-number-free theoretical guarantees in terms of both $\\ell_2$ and $\\ell_{2,\\infty}$ statistical accuracy. The proposed algorithm divides the spectrum of $\\boldsymbol{X}^\\star$ into well-conditioned and mutually well-separated subblocks, and applies $\\textsf{HeteroPCA}$ to conquer each subblock successively. Further, an application of our algorithm and theory to two canonical examples -- the factor model and tensor PCA -- leads to remarkable improvement for each application."}, "https://arxiv.org/abs/2410.11093": {"title": "rquest: An R package for hypothesis tests and confidence intervals for quantiles and summary measures based on quantiles", "link": "https://arxiv.org/abs/2410.11093", "description": "arXiv:2410.11093v1 Announce Type: new \nAbstract: Sample quantiles, such as the median, are often better suited than the sample mean for summarising location characteristics of a data set. Similarly, linear combinations of sample quantiles and ratios of such linear combinations, e.g. the interquartile range and quantile-based skewness measures, are often used to quantify characteristics such as spread and skew. While often reported, it is uncommon to accompany quantile estimates with confidence intervals or standard errors. The rquest package provides a simple way to conduct hypothesis tests and derive confidence intervals for quantiles, linear combinations of quantiles, ratios of dependent linear combinations (e.g., Bowley's measure of skewness) and differences and ratios of all of the above for comparisons between independent samples. Many commonly used measures based on quantiles are included, although it is also very simple for users to define their own. Additionally, quantile-based measures of inequality are also considered. The methods are based on recent research showing that reliable distribution-free confidence intervals can be obtained, even for moderate sample sizes. Several examples are provided herein."}, "https://arxiv.org/abs/2410.11103": {"title": "Models for spatiotemporal data with some missing locations and application to emergency calls models calibration", "link": "https://arxiv.org/abs/2410.11103", "description": "arXiv:2410.11103v1 Announce Type: new \nAbstract: We consider two classes of models for spatiotemporal data: one without covariates and one with covariates. If $\\mathcal{T}$ is a partition of time and $\\mathcal{I}$ a partition of the studied area into zones and if $\\mathcal{C}$ is the set of arrival types, we assume that the process of arrivals for time interval $t \\in \\mathcal{T}$, zone $i \\in \\mathcal{I}$, and arrival type $c \\in \\mathcal{C}$ is Poisson with some intensity $\\lambda_{c,i,t}$. We discussed the calibration and implementation of such models in \\cite{laspatedpaper, laspatedmanual} with corresponding software LASPATED (Library for the Analysis of SPAtioTEmporal Discrete data) available on GitHub at https://github.com/vguigues/LASPATED. In this paper, we discuss the extension of these models when some of the locations are missing in the historical data. We propose three models to deal with missing locations and implemented them both in Matlab and C++. The corresponding code is available on GitHub as an extension of LASPATED at https://github.com/vguigues/LASPATED/Missing_Data. We tested our implementation using the process of emergency calls to an Emergency Health Service where many calls come with missing locations and show the importance and benefit of using models that consider missing locations, rather than discarding the calls with missing locations for the calibration of statistical models for such calls."}, "https://arxiv.org/abs/2410.11151": {"title": "Discovering the critical number of respondents to validate an item in a questionnaire: The Binomial Cut-level Content Validity proposal", "link": "https://arxiv.org/abs/2410.11151", "description": "arXiv:2410.11151v1 Announce Type: new \nAbstract: The question that drives this research is: \"How to discover the number of respondents that are necessary to validate items of a questionnaire as actually essential to reach the questionnaire's proposal?\" Among the efforts in this subject, \\cite{Lawshe1975, Wilson2012, Ayre_CVR_2014} approached this issue by proposing and refining the Content Validation Ratio (CVR) that looks to identify items that are actually essentials. Despite their contribution, these studies do not check if an item validated as \"essential\" should be also validated as \"not essential\" by the same sample, which should be a paradox. Another issue is the assignment a probability equal a 50\\% to a item be randomly checked by a respondent as essential, despite an evaluator has three options to choose. Our proposal faces these issues, making it possible to verify if a paradoxical situation occurs, and being more precise in recommending whether an item should either be retained or discarded from a questionnaire."}, "https://arxiv.org/abs/2410.11192": {"title": "Multi-scale tests of independence powerful for detecting explicit or implicit functional relationship", "link": "https://arxiv.org/abs/2410.11192", "description": "arXiv:2410.11192v1 Announce Type: new \nAbstract: In this article, we consider the problem of testing the independence between two random variables. Our primary objective is to develop tests that are highly effective at detecting associations arising from explicit or implicit functional relationship between two variables. We adopt a multi-scale approach by analyzing neighborhoods of varying sizes within the dataset and aggregating the results. We introduce a general testing framework designed to enhance the power of existing independence tests to achieve our objective. Additionally, we propose a novel test method that is powerful as well as computationally efficient. The performance of these tests is compared with existing methods using various simulated datasets."}, "https://arxiv.org/abs/2410.11263": {"title": "Closed-form estimation and inference for panels with attrition and refreshment samples", "link": "https://arxiv.org/abs/2410.11263", "description": "arXiv:2410.11263v1 Announce Type: new \nAbstract: It has long been established that, if a panel dataset suffers from attrition, auxiliary (refreshment) sampling restores full identification under additional assumptions that still allow for nontrivial attrition mechanisms. Such identification results rely on implausible assumptions about the attrition process or lead to theoretically and computationally challenging estimation procedures. We propose an alternative identifying assumption that, despite its nonparametric nature, suggests a simple estimation algorithm based on a transformation of the empirical cumulative distribution function of the data. This estimation procedure requires neither tuning parameters nor optimization in the first step, i.e. has a closed form. We prove that our estimator is consistent and asymptotically normal and demonstrate its good performance in simulations. We provide an empirical illustration with income data from the Understanding America Study."}, "https://arxiv.org/abs/2410.11320": {"title": "Regularized Estimation of High-Dimensional Matrix-Variate Autoregressive Models", "link": "https://arxiv.org/abs/2410.11320", "description": "arXiv:2410.11320v1 Announce Type: new \nAbstract: Matrix-variate time series data are increasingly popular in economics, statistics, and environmental studies, among other fields. This paper develops regularized estimation methods for analyzing high-dimensional matrix-variate time series using bilinear matrix-variate autoregressive models. The bilinear autoregressive structure is widely used for matrix-variate time series, as it reduces model complexity while capturing interactions between rows and columns. However, when dealing with large dimensions, the commonly used iterated least-squares method results in numerous estimated parameters, making interpretation difficult. To address this, we propose two regularized estimation methods to further reduce model dimensionality. The first assumes banded autoregressive coefficient matrices, where each data point interacts only with nearby points. A two-step estimation method is used: first, traditional iterated least-squares is applied for initial estimates, followed by a banded iterated least-squares approach. A Bayesian Information Criterion (BIC) is introduced to estimate the bandwidth of the coefficient matrices. The second method assumes sparse autoregressive matrices, applying the LASSO technique for regularization. We derive asymptotic properties for both methods as the dimensions diverge and the sample size $T\\rightarrow\\infty$. Simulations and real data examples demonstrate the effectiveness of our methods, comparing their forecasting performance against common autoregressive models in the literature."}, "https://arxiv.org/abs/2410.11408": {"title": "Aggregation Trees", "link": "https://arxiv.org/abs/2410.11408", "description": "arXiv:2410.11408v1 Announce Type: new \nAbstract: Uncovering the heterogeneous effects of particular policies or \"treatments\" is a key concern for researchers and policymakers. A common approach is to report average treatment effects across subgroups based on observable covariates. However, the choice of subgroups is crucial as it poses the risk of $p$-hacking and requires balancing interpretability with granularity. This paper proposes a nonparametric approach to construct heterogeneous subgroups. The approach enables a flexible exploration of the trade-off between interpretability and the discovery of more granular heterogeneity by constructing a sequence of nested groupings, each with an optimality property. By integrating our approach with \"honesty\" and debiased machine learning, we provide valid inference about the average treatment effect of each group. We validate the proposed methodology through an empirical Monte-Carlo study and apply it to revisit the impact of maternal smoking on birth weight, revealing systematic heterogeneity driven by parental and birth-related characteristics."}, "https://arxiv.org/abs/2410.11438": {"title": "Effect modification and non-collapsibility leads to conflicting treatment decisions: a review of marginal and conditional estimands and recommendations for decision-making", "link": "https://arxiv.org/abs/2410.11438", "description": "arXiv:2410.11438v1 Announce Type: new \nAbstract: Effect modification occurs when a covariate alters the relative effectiveness of treatment compared to control. It is widely understood that, when effect modification is present, treatment recommendations may vary by population and by subgroups within the population. Population-adjustment methods are increasingly used to adjust for differences in effect modifiers between study populations and to produce population-adjusted estimates in a relevant target population for decision-making. It is also widely understood that marginal and conditional estimands for non-collapsible effect measures, such as odds ratios or hazard ratios, do not in general coincide even without effect modification. However, the consequences of both non-collapsibility and effect modification together are little-discussed in the literature.\n In this paper, we set out the definitions of conditional and marginal estimands, illustrate their properties when effect modification is present, and discuss the implications for decision-making. In particular, we show that effect modification can result in conflicting treatment rankings between conditional and marginal estimates. This is because conditional and marginal estimands correspond to different decision questions that are no longer aligned when effect modification is present. For time-to-event outcomes, the presence of covariates implies that marginal hazard ratios are time-varying, and effect modification can cause marginal hazard curves to cross. We conclude with practical recommendations for decision-making in the presence of effect modification, based on pragmatic comparisons of both conditional and marginal estimates in the decision target population. Currently, multilevel network meta-regression is the only population-adjustment method capable of producing both conditional and marginal estimates, in any decision target population."}, "https://arxiv.org/abs/2410.11482": {"title": "Scalable likelihood-based estimation and variable selection for the Cox model with incomplete covariates", "link": "https://arxiv.org/abs/2410.11482", "description": "arXiv:2410.11482v1 Announce Type: new \nAbstract: Regression analysis with missing data is a long-standing and challenging problem, particularly when there are many missing variables with arbitrary missing patterns. Likelihood-based methods, although theoretically appealing, are often computationally inefficient or even infeasible when dealing with a large number of missing variables. In this paper, we consider the Cox regression model with incomplete covariates that are missing at random. We develop an expectation-maximization (EM) algorithm for nonparametric maximum likelihood estimation, employing a transformation technique in the E-step so that it involves only a one-dimensional integration. This innovation makes our methods scalable with respect to the dimension of the missing variables. In addition, for variable selection, we extend the proposed EM algorithm to accommodate a LASSO penalty in the likelihood. We demonstrate the feasibility and advantages of the proposed methods over existing methods by large-scale simulation studies and apply the proposed methods to a cancer genomic study."}, "https://arxiv.org/abs/2410.11562": {"title": "Functional Data Analysis on Wearable Sensor Data: A Systematic Review", "link": "https://arxiv.org/abs/2410.11562", "description": "arXiv:2410.11562v1 Announce Type: new \nAbstract: Wearable devices and sensors have recently become a popular way to collect data, especially in the health sciences. The use of sensors allows patients to be monitored over a period of time with a high observation frequency. Due to the continuous-on-time structure of the data, novel statistical methods are recommended for the analysis of sensor data. One of the popular approaches in the analysis of wearable sensor data is functional data analysis. The main objective of this paper is to review functional data analysis methods applied to wearable device data according to the type of sensor. In addition, we introduce several freely available software packages and open databases of wearable device data to facilitate access to sensor data in different fields."}, "https://arxiv.org/abs/2410.11637": {"title": "Prediction-Centric Uncertainty Quantification via MMD", "link": "https://arxiv.org/abs/2410.11637", "description": "arXiv:2410.11637v1 Announce Type: new \nAbstract: Deterministic mathematical models, such as those specified via differential equations, are a powerful tool to communicate scientific insight. However, such models are necessarily simplified descriptions of the real world. Generalised Bayesian methodologies have been proposed for inference with misspecified models, but these are typically associated with vanishing parameter uncertainty as more data are observed. In the context of a misspecified deterministic mathematical model, this has the undesirable consequence that posterior predictions become deterministic and certain, while being incorrect. Taking this observation as a starting point, we propose Prediction-Centric Uncertainty Quantification, where a mixture distribution based on the deterministic model confers improved uncertainty quantification in the predictive context. Computation of the mixing distribution is cast as a (regularised) gradient flow of the maximum mean discrepancy (MMD), enabling consistent numerical approximations to be obtained. Results are reported on both a toy model from population ecology and a real model of protein signalling in cell biology."}, "https://arxiv.org/abs/2410.11713": {"title": "Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing", "link": "https://arxiv.org/abs/2410.11713", "description": "arXiv:2410.11713v1 Announce Type: new \nAbstract: Randomized controlled trials (RCTs) are the gold standard for causal inference on treatment effects. However, they can be underpowered due to small population sizes in rare diseases and limited number of patients willing to participate due to questions regarding equipoise among treatment groups in common diseases. Hybrid controlled trials use external controls (ECs) from historical studies or large observational databases to enhance statistical efficiency, and as such, are considered for drug evaluation in rare diseases or indications associated with low trial participation. However, non-randomized ECs can introduce biases that compromise validity and inflate type I errors for treatment discovery, particularly in small samples. To address this, we extend the Fisher randomization test (FRT) to hybrid controlled trials. Our approach involves a test statistic combining RCT and EC data and is based solely on randomization in the RCT. This method strictly controls the type I error rate, even with biased ECs, and improves power by incorporating unbiased ECs. To mitigate the power loss caused by biased ECs, we introduce conformal selective borrowing, which uses conformal p-values to individually select unbiased ECs, offering the flexibility to use either computationally efficient parametric models or off-the-shelf machine learning models to construct the conformal score function, along with model-agnostic reliability. We identify a risk-benefit trade-off in the power of FRT associated with different selection thresholds for conformal p-values. We offer both fixed and adaptive selection threshold options, enabling robust performance across varying levels of hidden bias. The advantages of our method are demonstrated through simulations and an application to a small lung cancer RCT with ECs from the National Cancer Database. Our method is available in the R package intFRT on GitHub."}, "https://arxiv.org/abs/2410.11716": {"title": "Randomization-based Inference for MCP-Mod", "link": "https://arxiv.org/abs/2410.11716", "description": "arXiv:2410.11716v1 Announce Type: new \nAbstract: Dose selection is critical in pharmaceutical drug development, as it directly impacts therapeutic efficacy and patient safety of a drug. The Generalized Multiple Comparison Procedures and Modeling (MCP-Mod) approach is commonly used in Phase II trials for testing and estimation of dose-response relationships. However, its effectiveness in small sample sizes, particularly with binary endpoints, is hindered by issues like complete separation in logistic regression, leading to non-existence of estimates. Motivated by an actual clinical trial using the MCP-Mod approach, this paper introduces penalized maximum likelihood estimation (MLE) and randomization-based inference techniques to address these challenges. Randomization-based inference allows for exact finite sample inference, while population-based inference for MCP-Mod typically relies on asymptotic approximations. Simulation studies demonstrate that randomization-based tests can enhance statistical power in small to medium-sized samples while maintaining control over type-I error rates, even in the presence of time trends. Our results show that residual-based randomization tests using penalized MLEs not only improve computational efficiency but also outperform standard randomization-based methods, making them an adequate choice for dose-finding analyses within the MCP-Mod framework. Additionally, we apply these methods to pharmacometric settings, demonstrating their effectiveness in such scenarios. The results in this paper underscore the potential of randomization-based inference for the analysis of dose-finding trials, particularly in small sample contexts."}, "https://arxiv.org/abs/2410.11743": {"title": "Addressing the Null Paradox in Epidemic Models: Correcting for Collider Bias in Causal Inference", "link": "https://arxiv.org/abs/2410.11743", "description": "arXiv:2410.11743v1 Announce Type: new \nAbstract: We address the null paradox in epidemic models, where standard methods estimate a non-zero treatment effect despite the true effect being zero. This occurs when epidemic models mis-specify how causal effects propagate over time, especially when covariates act as colliders between past interventions and latent variables, leading to spurious correlations. Standard approaches like maximum likelihood and Bayesian methods can misinterpret these biases, inferring false causal relationships. While semi-parametric models and inverse propensity weighting offer potential solutions, they often limit the ability of domain experts to incorporate epidemic-specific knowledge. To resolve this, we propose an alternative estimating equation that corrects for collider bias while allowing for statistical inference with frequentist guarantees, previously unavailable for complex models like SEIR."}, "https://arxiv.org/abs/2410.11797": {"title": "Unraveling Heterogeneous Treatment Effects in Networks: A Non-Parametric Approach Based on Node Connectivity", "link": "https://arxiv.org/abs/2410.11797", "description": "arXiv:2410.11797v1 Announce Type: new \nAbstract: In network settings, interference between units makes causal inference more challenging as outcomes may depend on the treatments received by others in the network. Typical estimands in network settings focus on treatment effects aggregated across individuals in the population. We propose a framework for estimating node-wise counterfactual means, allowing for more granular insights into the impact of network structure on treatment effect heterogeneity. We develop a doubly robust and non-parametric estimation procedure, KECENI (Kernel Estimation of Causal Effect under Network Interference), which offers consistency and asymptotic normality under network dependence. The utility of this method is demonstrated through an application to microfinance data, revealing the impact of network characteristics on treatment effects."}, "https://arxiv.org/abs/2410.11113": {"title": "Statistical Properties of Deep Neural Networks with Dependent Data", "link": "https://arxiv.org/abs/2410.11113", "description": "arXiv:2410.11113v1 Announce Type: cross \nAbstract: This paper establishes statistical properties of deep neural network (DNN) estimators under dependent data. Two general results for nonparametric sieve estimators directly applicable to DNNs estimators are given. The first establishes rates for convergence in probability under nonstationary data. The second provides non-asymptotic probability bounds on $\\mathcal{L}^{2}$-errors under stationary $\\beta$-mixing data. I apply these results to DNN estimators in both regression and classification contexts imposing only a standard H\\\"older smoothness assumption. These results are then used to demonstrate how asymptotic inference can be conducted on the finite dimensional parameter of a partially linear regression model after first-stage DNN estimation of infinite dimensional parameters. The DNN architectures considered are common in applications, featuring fully connected feedforward networks with any continuous piecewise linear activation function, unbounded weights, and a width and depth that grows with sample size. The framework provided also offers potential for research into other DNN architectures and time-series applications."}, "https://arxiv.org/abs/2410.11238": {"title": "Impact of existence and nonexistence of pivot on the coverage of empirical best linear prediction intervals for small areas", "link": "https://arxiv.org/abs/2410.11238", "description": "arXiv:2410.11238v1 Announce Type: cross \nAbstract: We advance the theory of parametric bootstrap in constructing highly efficient empirical best (EB) prediction intervals of small area means. The coverage error of such a prediction interval is of the order $O(m^{-3/2})$, where $m$ is the number of small areas to be pooled using a linear mixed normal model. In the context of an area level model where the random effects follow a non-normal known distribution except possibly for unknown hyperparameters, we analytically show that the order of coverage error of empirical best linear (EBL) prediction interval remains the same even if we relax the normality of the random effects by the existence of pivot for a suitably standardized random effects when hyperpameters are known. Recognizing the challenge of showing existence of a pivot, we develop a simple moment-based method to claim non-existence of pivot. We show that existing parametric bootstrap EBL prediction interval fails to achieve the desired order of the coverage error, i.e. $O(m^{-3/2})$, in absence of a pivot. We obtain a surprising result that the order $O(m^{-1})$ term is always positive under certain conditions indicating possible overcoverage of the existing parametric bootstrap EBL prediction interval. In general, we analytically show for the first time that the coverage problem can be corrected by adopting a suitably devised double parametric bootstrap. Our Monte Carlo simulations show that our proposed single bootstrap method performs reasonably well when compared to rival methods."}, "https://arxiv.org/abs/2410.11429": {"title": "Filtering coupled Wright-Fisher diffusions", "link": "https://arxiv.org/abs/2410.11429", "description": "arXiv:2410.11429v1 Announce Type: cross \nAbstract: Coupled Wright-Fisher diffusions have been recently introduced to model the temporal evolution of finitely-many allele frequencies at several loci. These are vectors of multidimensional diffusions whose dynamics are weakly coupled among loci through interaction coefficients, which make the reproductive rates for each allele depend on its frequencies at several loci. Here we consider the problem of filtering a coupled Wright-Fisher diffusion with parent-independent mutation, when this is seen as an unobserved signal in a hidden Markov model. We assume individuals are sampled multinomially at discrete times from the underlying population, whose type configuration at the loci is described by the diffusion states, and adapt recently introduced duality methods to derive the filtering and smoothing distributions. These respectively provide the conditional distribution of the diffusion states given past data, and that conditional on the entire dataset, and are key to be able to perform parameter inference on models of this type. We show that for this model these distributions are countable mixtures of tilted products of Dirichlet kernels, and describe their mixing weights and how these can be updated sequentially. The evaluation of the weights involves the transition probabilities of the dual process, which are not available in closed form. We lay out pseudo codes for the implementation of the algorithms, discuss how to handle the unavailable quantities, and briefly illustrate the procedure with synthetic data."}, "https://arxiv.org/abs/2210.00200": {"title": "Semiparametric Efficient Fusion of Individual Data and Summary Statistics", "link": "https://arxiv.org/abs/2210.00200", "description": "arXiv:2210.00200v4 Announce Type: replace \nAbstract: Suppose we have individual data from an internal study and various summary statistics from relevant external studies. External summary statistics have the potential to improve statistical inference for the internal population; however, it may lead to efficiency loss or bias if not used properly. We study the fusion of individual data and summary statistics in a semiparametric framework to investigate the efficient use of external summary statistics. Under a weak transportability assumption, we establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is no larger than that using only internal data and underpins the potential efficiency gain of integrating individual data and summary statistics. We propose a data-fused efficient estimator that achieves this efficiency bound. In addition, an adaptive fusion estimator is proposed to eliminate the bias of the original data-fused estimator when the transportability assumption fails. We establish the asymptotic oracle property of the adaptive fusion estimator. Simulations and application to a Helicobacter pylori infection dataset demonstrate the promising numerical performance of the proposed method."}, "https://arxiv.org/abs/2308.11026": {"title": "Harnessing The Collective Wisdom: Fusion Learning Using Decision Sequences From Diverse Sources", "link": "https://arxiv.org/abs/2308.11026", "description": "arXiv:2308.11026v2 Announce Type: replace \nAbstract: Learning from the collective wisdom of crowds is related to the statistical notion of fusion learning from multiple data sources or studies. However, fusing inferences from diverse sources is challenging since cross-source heterogeneity and potential data-sharing complicate statistical inference. Moreover, studies may rely on disparate designs, employ myriad modeling techniques, and prevailing data privacy norms may forbid sharing even summary statistics across the studies for an overall analysis. We propose an Integrative Ranking and Thresholding (IRT) framework for fusion learning in multiple testing. IRT operates under the setting where from each study a triplet is available: the vector of binary accept-reject decisions on the tested hypotheses, its False Discovery Rate (FDR) level and the hypotheses tested by it. Under this setting, IRT constructs an aggregated and nonparametric measure of evidence against each null hypotheses, which facilitates ranking the hypotheses in the order of their likelihood of being rejected. We show that IRT guarantees an overall FDR control if the studies control their respective FDR at the desired levels. IRT is extremely flexible, and a comprehensive numerical study demonstrates its practical relevance for pooling inferences. A real data illustration and extensions to alternative forms of Type I error control are discussed."}, "https://arxiv.org/abs/2309.10792": {"title": "Frequentist Inference for Semi-mechanistic Epidemic Models with Interventions", "link": "https://arxiv.org/abs/2309.10792", "description": "arXiv:2309.10792v2 Announce Type: replace \nAbstract: The effect of public health interventions on an epidemic are often estimated by adding the intervention to epidemic models. During the Covid-19 epidemic, numerous papers used such methods for making scenario predictions. The majority of these papers use Bayesian methods to estimate the parameters of the model. In this paper we show how to use frequentist methods for estimating these effects which avoids having to specify prior distributions. We also use model-free shrinkage methods to improve estimation when there are many different geographic regions. This allows us to borrow strength from different regions while still getting confidence intervals with correct coverage and without having to specify a hierarchical model. Throughout, we focus on a semi-mechanistic model which provides a simple, tractable alternative to compartmental methods."}, "https://arxiv.org/abs/2304.01010": {"title": "COBRA: Comparison-Optimal Betting for Risk-limiting Audits", "link": "https://arxiv.org/abs/2304.01010", "description": "arXiv:2304.01010v2 Announce Type: replace-cross \nAbstract: Risk-limiting audits (RLAs) can provide routine, affirmative evidence that reported election outcomes are correct by checking a random sample of cast ballots. An efficient RLA requires checking relatively few ballots. Here we construct highly efficient RLAs by optimizing supermartingale tuning parameters--$\\textit{bets}$--for ballot-level comparison audits. The exactly optimal bets depend on the true rate of errors in cast-vote records (CVRs)--digital receipts detailing how machines tabulated each ballot. We evaluate theoretical and simulated workloads for audits of contests with a range of diluted margins and CVR error rates. Compared to bets recommended in past work, using these optimal bets can dramatically reduce expected workloads--by 93% on average over our simulated audits. Because the exactly optimal bets are unknown in practice, we offer some strategies for approximating them. As with the ballot-polling RLAs described in ALPHA and RiLACs, adapting bets to previously sampled data or diversifying them over a range of suspected error rates can lead to substantially more efficient audits than fixing bets to $\\textit{a priori}$ values, especially when those values are far from correct. We sketch extensions to other designs and social choice functions, and conclude with some recommendations for real-world comparison audits."}, "https://arxiv.org/abs/2410.11892": {"title": "Copula based joint regression models for correlated data: an analysis in the bivariate case", "link": "https://arxiv.org/abs/2410.11892", "description": "arXiv:2410.11892v1 Announce Type: new \nAbstract: Regression analysis of non-normal correlated data is commonly performed using generalized linear mixed models (GLMM) and generalized estimating equations (GEE). The recent development of generalized joint regression models (GJRM) presents an alternative to these approaches by using copulas to flexibly model response variables and their dependence structures. This paper provides a simulation study that compares the GJRM with alternative methods. We focus on the case of the marginal distributions having the same form, for example, in models for longitudinal data.\n We find that for the normal model with identity link, all models provide accurate estimates of the parameters of interest. However, for non-normal models and when a non-identity link function is used, GLMMs in general provide biased estimates of marginal model parameters with inaccurately low standard errors. GLMM bias is more pronounced when the marginal distributions are more skewed or highly correlated. However, in the case that a GLMM parameter is estimated independently of the random effect term, we show it is possible to extract accurate parameter estimates, shown for a longitudinal time parameter with a logarithmic link model. In contrast, we find that GJRM and GEE provide unbiased estimates for all parameters with accurate standard errors when using a logarithmic link. In addition, we show that GJRM provides a model fit comparable to GLMM. In a real-world study of doctor visits, we further demonstrate that the GJRM provides better model fits than a comparable GEE or GLM, due to its greater flexibility in choice of marginal distribution and copula fit to dependence structures. We conclude that the GJRM provides a superior approach to current popular models for analysis of non-normal correlated data."}, "https://arxiv.org/abs/2410.11897": {"title": "A Structural Text-Based Scaling Model for Analyzing Political Discourse", "link": "https://arxiv.org/abs/2410.11897", "description": "arXiv:2410.11897v1 Announce Type: new \nAbstract: Scaling political actors based on their individual characteristics and behavior helps profiling and grouping them as well as understanding changes in the political landscape. In this paper we introduce the Structural Text-Based Scaling (STBS) model to infer ideological positions of speakers for latent topics from text data. We expand the usual Poisson factorization specification for topic modeling of text data and use flexible shrinkage priors to induce sparsity and enhance interpretability. We also incorporate speaker-specific covariates to assess their association with ideological positions. Applying STBS to U.S. Senate speeches from Congress session 114, we identify immigration and gun violence as the most polarizing topics between the two major parties in Congress. Additionally, we find that, in discussions about abortion, the gender of the speaker significantly influences their position, with female speakers focusing more on women's health. We also see that a speaker's region of origin influences their ideological position more than their religious affiliation."}, "https://arxiv.org/abs/2410.11902": {"title": "Stochastic Nonlinear Model Updating in Structural Dynamics Using a Novel Likelihood Function within the Bayesian-MCMC Framework", "link": "https://arxiv.org/abs/2410.11902", "description": "arXiv:2410.11902v1 Announce Type: new \nAbstract: The study presents a novel approach for stochastic nonlinear model updating in structural dynamics, employing a Bayesian framework integrated with Markov Chain Monte Carlo (MCMC) sampling for parameter estimation by using an approximated likelihood function. The proposed methodology is applied to both numerical and experimental cases. The paper commences by introducing Bayesian inference and its constituents: the likelihood function, prior distribution, and posterior distribution. The resonant decay method is employed to extract backbone curves, which capture the non-linear behaviour of the system. A mathematical model based on a single degree of freedom (SDOF) system is formulated, and backbone curves are obtained from time response data. Subsequently, MCMC sampling is employed to estimate the parameters using both numerical and experimental data. The obtained results demonstrate the convergence of the Markov chain, present parameter trace plots, and provide estimates of posterior distributions of updated parameters along with their uncertainties. Experimental validation is performed on a cantilever beam system equipped with permanent magnets and electromagnets. The proposed methodology demonstrates promising results in estimating parameters of stochastic non-linear dynamical systems. Through the use of the proposed likelihood functions using backbone curves, the probability distributions of both linear and non-linear parameters are simultaneously identified. Based on this view, the necessity to segregate stochastic linear and non-linear model updating is eliminated."}, "https://arxiv.org/abs/2410.11916": {"title": "\"The Simplest Idea One Can Have\" for Seamless Forecasts with Postprocessing", "link": "https://arxiv.org/abs/2410.11916", "description": "arXiv:2410.11916v1 Announce Type: new \nAbstract: Seamless forecasts are based on a combination of different sources to produce the best possible forecasts. Statistical multimodel postprocessing helps to combine various sources to achieve these seamless forecasts. However, when one of the combined sources of the forecast is not available due to reaching the end of its forecasting horizon, forecasts can be temporally inconsistent and sudden drops in skill can be observed. To obtain a seamless forecast, the output of multimodel postprocessing is often blended across these transitions, although this unnecessarily worsens the forecasts immediately before the transition. Additionally, large differences between the latest observation and the first forecasts can be present. This paper presents an idea to preserve a smooth temporal prediction until the end of the forecast range and increase its predictability. This optimal seamless forecast is simply accomplished by not excluding any model from the multimodel by using the latest possible lead time as model persistence into the future. Furthermore, the gap between the latest available observation and the first model step is seamlessly closed with the persistence of the observation by using the latest observation as additional predictor. With this idea, no visible jump in forecasts is observed and the verification presents a seamless quality in terms of scores. The benefit of accounting for observation and forecast persistence in multimodel postprocessing is illustrated using a simple temperature example with linear regression but can also be extended to other predictors and postprocessing methods."}, "https://arxiv.org/abs/2410.12093": {"title": "A Unified Framework for Causal Estimand Selection", "link": "https://arxiv.org/abs/2410.12093", "description": "arXiv:2410.12093v1 Announce Type: new \nAbstract: To determine the causal effect of a treatment using observational data, it is important to balance the covariate distributions between treated and control groups. However, achieving balance can be difficult when treated and control groups lack overlap. In the presence of limited overlap, researchers typically choose between two types of methods: 1) methods (e.g., inverse propensity score weighting) that imply traditional estimands (e.g., ATE) but whose estimators are at risk of variance inflation and considerable statistical bias; and 2) methods (e.g., overlap weighting) which imply a different estimand, thereby changing the target population to reduce variance. In this work, we introduce a framework for characterizing estimands by their target populations and the statistical performance of their estimators. We introduce a bias decomposition that encapsulates bias due to 1) the statistical bias of the estimator; and 2) estimand mismatch, i.e., deviation from the population of interest. We propose a design-based estimand selection procedure that helps navigate the tradeoff between these two sources of bias and variance of the resulting estimators. Our procedure allows the analyst to incorporate their domain-specific preference for preservation of the original population versus reduction of statistical bias. We demonstrate how to select an estimand based on these preferences by applying our framework to a right heart catheterization study."}, "https://arxiv.org/abs/2410.12098": {"title": "Testing Identifying Assumptions in Parametric Separable Models: A Conditional Moment Inequality Approach", "link": "https://arxiv.org/abs/2410.12098", "description": "arXiv:2410.12098v1 Announce Type: new \nAbstract: In this paper, we propose a simple method for testing identifying assumptions in parametric separable models, namely treatment exogeneity, instrument validity, and/or homoskedasticity. We show that the testable implications can be written in the intersection bounds framework, which is easy to implement using the inference method proposed in Chernozhukov, Lee, and Rosen (2013), and the Stata package of Chernozhukov et al. (2015). Monte Carlo simulations confirm that our test is consistent and controls size. We use our proposed method to test the validity of some commonly used instrumental variables, such as the average price in other markets in Nevo and Rosen (2012), the Bartik instrument in Card (2009), and the test rejects both instrumental variable models. When the identifying assumptions are rejected, we discuss solutions that allow researchers to identify some causal parameters of interest after relaxing functional form assumptions. We show that the IV model is nontestable if no functional form assumption is made on the outcome equation, when there exists a one-to-one mapping between the continuous treatment variable, the instrument, and the first-stage unobserved heterogeneity."}, "https://arxiv.org/abs/2410.12108": {"title": "A General Latent Embedding Approach for Modeling Non-uniform High-dimensional Sparse Hypergraphs with Multiplicity", "link": "https://arxiv.org/abs/2410.12108", "description": "arXiv:2410.12108v1 Announce Type: new \nAbstract: Recent research has shown growing interest in modeling hypergraphs, which capture polyadic interactions among entities beyond traditional dyadic relations. However, most existing methodologies for hypergraphs face significant limitations, including their heavy reliance on uniformity restrictions for hyperlink orders and their inability to account for repeated observations of identical hyperlinks. In this work, we introduce a novel and general latent embedding approach that addresses these challenges through the integration of latent embeddings, vertex degree heterogeneity parameters, and an order-adjusting parameter. Theoretically, we investigate the identifiability conditions for the latent embeddings and associated parameters, and we establish the convergence rates of their estimators along with asymptotic distributions. Computationally, we employ a projected gradient ascent algorithm for parameter estimation. Comprehensive simulation studies demonstrate the effectiveness of the algorithm and validate the theoretical findings. Moreover, an application to a co-citation hypergraph illustrates the advantages of the proposed method."}, "https://arxiv.org/abs/2410.12117": {"title": "Empirical Bayes estimation via data fission", "link": "https://arxiv.org/abs/2410.12117", "description": "arXiv:2410.12117v1 Announce Type: new \nAbstract: We demonstrate how data fission, a method for creating synthetic replicates from single observations, can be applied to empirical Bayes estimation. This extends recent work on empirical Bayes with multiple replicates to the classical single-replicate setting. The key insight is that after data fission, empirical Bayes estimation can be cast as a general regression problem."}, "https://arxiv.org/abs/2410.12151": {"title": "Root cause discovery via permutations and Cholesky decomposition", "link": "https://arxiv.org/abs/2410.12151", "description": "arXiv:2410.12151v1 Announce Type: new \nAbstract: This work is motivated by the following problem: Can we identify the disease-causing gene in a patient affected by a monogenic disorder? This problem is an instance of root cause discovery. In particular, we aim to identify the intervened variable in one interventional sample using a set of observational samples as reference. We consider a linear structural equation model where the causal ordering is unknown. We begin by examining a simple method that uses squared z-scores and characterize the conditions under which this method succeeds and fails, showing that it generally cannot identify the root cause. We then prove, without additional assumptions, that the root cause is identifiable even if the causal ordering is not. Two key ingredients of this identifiability result are the use of permutations and the Cholesky decomposition, which allow us to exploit an invariant property across different permutations to discover the root cause. Furthermore, we characterize permutations that yield the correct root cause and, based on this, propose a valid method for root cause discovery. We also adapt this approach to high-dimensional settings. Finally, we evaluate the performance of our methods through simulations and apply the high-dimensional method to discover disease-causing genes in the gene expression dataset that motivates this work."}, "https://arxiv.org/abs/2410.12300": {"title": "Sparse Causal Effect Estimation using Two-Sample Summary Statistics in the Presence of Unmeasured Confounding", "link": "https://arxiv.org/abs/2410.12300", "description": "arXiv:2410.12300v1 Announce Type: new \nAbstract: Observational genome-wide association studies are now widely used for causal inference in genetic epidemiology. To maintain privacy, such data is often only publicly available as summary statistics, and often studies for the endogenous covariates and the outcome are available separately. This has necessitated methods tailored to two-sample summary statistics. Current state-of-the-art methods modify linear instrumental variable (IV) regression -- with genetic variants as instruments -- to account for unmeasured confounding. However, since the endogenous covariates can be high dimensional, standard IV assumptions are generally insufficient to identify all causal effects simultaneously. We ensure identifiability by assuming the causal effects are sparse and propose a sparse causal effect two-sample IV estimator, spaceTSIV, adapting the spaceIV estimator by Pfister and Peters (2022) for two-sample summary statistics. We provide two methods, based on L0- and L1-penalization, respectively. We prove identifiability of the sparse causal effects in the two-sample setting and consistency of spaceTSIV. The performance of spaceTSIV is compared with existing two-sample IV methods in simulations. Finally, we showcase our methods using real proteomic and gene-expression data for drug-target discovery."}, "https://arxiv.org/abs/2410.12333": {"title": "Quantifying Treatment Effects: Estimating Risk Ratios in Causal Inference", "link": "https://arxiv.org/abs/2410.12333", "description": "arXiv:2410.12333v1 Announce Type: new \nAbstract: Randomized Controlled Trials (RCT) are the current gold standards to empirically measure the effect of a new drug. However, they may be of limited size and resorting to complementary non-randomized data, referred to as observational, is promising, as additional sources of evidence. In both RCT and observational data, the Risk Difference (RD) is often used to characterize the effect of a drug. Additionally, medical guidelines recommend to also report the Risk Ratio (RR), which may provide a different comprehension of the effect of the same drug. While different methods have been proposed and studied to estimate the RD, few methods exist to estimate the RR. In this paper, we propose estimators of the RR both in RCT and observational data and provide both asymptotical and finite-sample analyses. We show that, even in an RCT, estimating treatment allocation probability or adjusting for covariates leads to lower asymptotic variance. In observational studies, we propose weighting and outcome modeling estimators and derive their asymptotic bias and variance for well-specified models. Using semi-parametric theory, we define two doubly robusts estimators with minimal variances among unbiased estimators. We support our theoretical analysis with empirical evaluations and illustrate our findings through experiments."}, "https://arxiv.org/abs/2410.12374": {"title": "Forecasting Densities of Fatalities from State-based Conflicts using Observed Markov Models", "link": "https://arxiv.org/abs/2410.12374", "description": "arXiv:2410.12374v1 Announce Type: new \nAbstract: In this contribution to the VIEWS 2023 prediction challenge, we propose using an observed Markov model for making predictions of densities of fatalities from armed conflicts. The observed Markov model can be conceptualized as a two-stage model. The first stage involves a standard Markov model, where the latent states are pre-defined based on domain knowledge about conflict states. The second stage is a set of regression models conditional on the latent Markov-states which predict the number of fatalities. In the VIEWS 2023/24 prediction competition, we use a random forest classifier for modeling the transitions between the latent Markov states and a quantile regression forest to model the fatalities conditional on the latent states. For the predictions, we dynamically simulate latent state paths and randomly draw fatalities for each country-month from the conditional distribution of fatalities given the latent states. Interim evaluation of out-of-sample performance indicates that the observed Markov model produces well-calibrated forecasts which outperform the benchmark models and are among the top performing models across the evaluation metrics."}, "https://arxiv.org/abs/2410.12386": {"title": "Bias correction of quadratic spectral estimators", "link": "https://arxiv.org/abs/2410.12386", "description": "arXiv:2410.12386v1 Announce Type: new \nAbstract: The three cardinal, statistically consistent, families of non-parametric estimators to the power spectral density of a time series are lag-window, multitaper and Welch estimators. However, when estimating power spectral densities from a finite sample each can be subject to non-ignorable bias. Astfalck et al. (2024) developed a method that offers significant bias reduction for finite samples for Welch's estimator, which this article extends to the larger family of quadratic estimators, thus offering similar theory for bias correction of lag-window and multitaper estimators as well as combinations thereof. Importantly, this theory may be used in conjunction with any and all tapers and lag-sequences designed for bias reduction, and so should be seen as an extension to valuable work in these fields, rather than a supplanting methodology. The order of computation is larger than O(n log n) typical in spectral analyses, but not insurmountable in practice. Simulation studies support the theory with comparisons across variations of quadratic estimators."}, "https://arxiv.org/abs/2410.12454": {"title": "Conditional Outcome Equivalence: A Quantile Alternative to CATE", "link": "https://arxiv.org/abs/2410.12454", "description": "arXiv:2410.12454v1 Announce Type: new \nAbstract: Conditional quantile treatment effect (CQTE) can provide insight into the effect of a treatment beyond the conditional average treatment effect (CATE). This ability to provide information over multiple quantiles of the response makes CQTE especially valuable in cases where the effect of a treatment is not well-modelled by a location shift, even conditionally on the covariates. Nevertheless, the estimation of CQTE is challenging and often depends upon the smoothness of the individual quantiles as a function of the covariates rather than smoothness of the CQTE itself. This is in stark contrast to CATE where it is possible to obtain high-quality estimates which have less dependency upon the smoothness of the nuisance parameters when the CATE itself is smooth. Moreover, relative smoothness of the CQTE lacks the interpretability of smoothness of the CATE making it less clear whether it is a reasonable assumption to make. We combine the desirable properties of CATE and CQTE by considering a new estimand, the conditional quantile comparator (CQC). The CQC not only retains information about the whole treatment distribution, similar to CQTE, but also having more natural examples of smoothness and is able to leverage simplicity in an auxiliary estimand. We provide finite sample bounds on the error of our estimator, demonstrating its ability to exploit simplicity. We validate our theory in numerical simulations which show that our method produces more accurate estimates than baselines. Finally, we apply our methodology to a study on the effect of employment incentives on earnings across different age groups. We see that our method is able to reveal heterogeneity of the effect across different quantiles."}, "https://arxiv.org/abs/2410.12590": {"title": "Flexible and Efficient Estimation of Causal Effects with Error-Prone Exposures: A Control Variates Approach for Measurement Error", "link": "https://arxiv.org/abs/2410.12590", "description": "arXiv:2410.12590v1 Announce Type: new \nAbstract: Exposure measurement error is a ubiquitous but often overlooked challenge in causal inference with observational data. Existing methods accounting for exposure measurement error largely rely on restrictive parametric assumptions, while emerging data-adaptive estimation approaches allow for less restrictive assumptions but at the cost of flexibility, as they are typically tailored towards rigidly-defined statistical quantities. There remains a critical need for assumption-lean estimation methods that are both flexible and possess desirable theoretical properties across a variety of study designs. In this paper, we introduce a general framework for estimation of causal quantities in the presence of exposure measurement error, adapted from the control variates approach of Yang and Ding (2019). Our method can be implemented in various two-phase sampling study designs, where one obtains gold-standard exposure measurements for a small subset of the full study sample, called the validation data. The control variates framework leverages both the error-prone and error-free exposure measurements by augmenting an initial consistent estimator from the validation data with a variance reduction term formed from the full data. We show that our method inherits double-robustness properties under standard causal assumptions. Simulation studies show that our approach performs favorably compared to leading methods under various two-phase sampling schemes. We illustrate our method with observational electronic health record data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic."}, "https://arxiv.org/abs/2410.12709": {"title": "A Simple Interactive Fixed Effects Estimator for Short Panels", "link": "https://arxiv.org/abs/2410.12709", "description": "arXiv:2410.12709v1 Announce Type: new \nAbstract: We study the interactive effects (IE) model as an extension of the conventional additive effects (AE) model. For the AE model, the fixed effects estimator can be obtained by applying least squares to a regression that adds a linear projection of the fixed effect on the explanatory variables (Mundlak, 1978; Chamberlain, 1984). In this paper, we develop a novel estimator -- the projection-based IE (PIE) estimator -- for the IE model that is based on a similar approach. We show that, for the IE model, fixed effects estimators that have appeared in the literature are not equivalent to our PIE estimator, though both can be expressed as a generalized within estimator. Unlike the fixed effects estimators for the IE model, the PIE estimator is consistent for a fixed number of time periods with no restrictions on serial correlation or conditional heteroskedasticity in the errors. We also derive a statistic for testing the consistency of the two-way fixed effects estimator in the possible presence of iterative effects. Moreover, although the PIE estimator is the solution to a high-dimensional nonlinear least squares problem, we show that it can be computed by iterating between two steps, both of which have simple analytical solutions. The computational simplicity is an important advantage relative to other strategies that have been proposed for estimating the IE model for short panels. Finally, we compare the finite sample performance of IE estimators through simulations."}, "https://arxiv.org/abs/2410.12731": {"title": "Counterfactual Analysis in Empirical Games", "link": "https://arxiv.org/abs/2410.12731", "description": "arXiv:2410.12731v1 Announce Type: new \nAbstract: We address counterfactual analysis in empirical models of games with partially identified parameters, and multiple equilibria and/or randomized strategies, by constructing and analyzing the counterfactual predictive distribution set (CPDS). This framework accommodates various outcomes of interest, including behavioral and welfare outcomes. It allows a variety of changes to the environment to generate the counterfactual, including modifications of the utility functions, the distribution of utility determinants, the number of decision makers, and the solution concept. We use a Bayesian approach to summarize statistical uncertainty. We establish conditions under which the population CPDS is sharp from the point of view of identification. We also establish conditions under which the posterior CPDS is consistent if the posterior distribution for the underlying model parameter is consistent. Consequently, our results can be employed to conduct counterfactual analysis after a preliminary step of identifying and estimating the underlying model parameter based on the existing literature. Our consistency results involve the development of a new general theory for Bayesian consistency of posterior distributions for mappings of sets. Although we primarily focus on a model of a strategic game, our approach is applicable to other structural models with similar features."}, "https://arxiv.org/abs/2410.12740": {"title": "Just Ramp-up: Unleash the Potential of Regression-based Estimator for A/B Tests under Network Interference", "link": "https://arxiv.org/abs/2410.12740", "description": "arXiv:2410.12740v1 Announce Type: new \nAbstract: Recent research in causal inference under network interference has explored various experimental designs and estimation techniques to address this issue. However, existing methods, which typically rely on single experiments, often reach a performance bottleneck and face limitations in handling diverse interference structures. In contrast, we propose leveraging multiple experiments to overcome these limitations. In industry, the use of sequential experiments, often known as the ramp-up process, where traffic to the treatment gradually increases, is common due to operational needs like risk management and cost control. Our approach shifts the focus from operational aspects to the statistical advantages of merging data from multiple experiments. By combining data from sequentially conducted experiments, we aim to estimate the global average treatment effect more effectively. In this paper, we begin by analyzing the bias and variance of the linear regression estimator for GATE under general linear network interference. We demonstrate that bias plays a dominant role in the bias-variance tradeoff and highlight the intrinsic bias reduction achieved by merging data from experiments with strictly different treatment proportions. Herein the improvement introduced by merging two steps of experimental data is essential. In addition, we show that merging more steps of experimental data is unnecessary under general linear interference, while it can become beneficial when nonlinear interference occurs. Furthermore, we look into a more advanced estimator based on graph neural networks. Through extensive simulation studies, we show that the regression-based estimator benefits remarkably from training on merged experiment data, achieving outstanding statistical performance."}, "https://arxiv.org/abs/2410.12047": {"title": "Testing Causal Explanations: A Case Study for Understanding the Effect of Interventions on Chronic Kidney Disease", "link": "https://arxiv.org/abs/2410.12047", "description": "arXiv:2410.12047v1 Announce Type: cross \nAbstract: Randomized controlled trials (RCTs) are the standard for evaluating the effectiveness of clinical interventions. To address the limitations of RCTs on real-world populations, we developed a methodology that uses a large observational electronic health record (EHR) dataset. Principles of regression discontinuity (rd) were used to derive randomized data subsets to test expert-driven interventions using dynamic Bayesian Networks (DBNs) do-operations. This combined method was applied to a chronic kidney disease (CKD) cohort of more than two million individuals and used to understand the associational and causal relationships of CKD variables with respect to a surrogate outcome of >=40% decline in estimated glomerular filtration rate (eGFR). The associational and causal analyses depicted similar findings across DBNs from two independent healthcare systems. The associational analysis showed that the most influential variables were eGFR, urine albumin-to-creatinine ratio, and pulse pressure, whereas the causal analysis showed eGFR as the most influential variable, followed by modifiable factors such as medications that may impact kidney function over time. This methodology demonstrates how real-world EHR data can be used to provide population-level insights to inform improved healthcare delivery."}, "https://arxiv.org/abs/2410.12201": {"title": "SAT: Data-light Uncertainty Set Merging via Synthetics, Aggregation, and Test Inversion", "link": "https://arxiv.org/abs/2410.12201", "description": "arXiv:2410.12201v1 Announce Type: cross \nAbstract: The integration of uncertainty sets has diverse applications but also presents challenges, particularly when only initial sets and their control levels are available, along with potential dependencies. Examples include merging confidence sets from different distributed sites with communication constraints, as well as combining conformal prediction sets generated by different learning algorithms or data splits. In this article, we introduce an efficient and flexible Synthetic, Aggregation, and Test inversion (SAT) approach to merge various potentially dependent uncertainty sets into a single set. The proposed method constructs a novel class of synthetic test statistics, aggregates them, and then derives merged sets through test inversion. Our approach leverages the duality between set estimation and hypothesis testing, ensuring reliable coverage in dependent scenarios. The procedure is data-light, meaning it relies solely on initial sets and control levels without requiring raw data, and it adapts to any user-specified initial uncertainty sets, accommodating potentially varying coverage levels. Theoretical analyses and numerical experiments confirm that SAT provides finite-sample coverage guarantees and achieves small set sizes."}, "https://arxiv.org/abs/2410.12209": {"title": "Global Censored Quantile Random Forest", "link": "https://arxiv.org/abs/2410.12209", "description": "arXiv:2410.12209v1 Announce Type: cross \nAbstract: In recent years, censored quantile regression has enjoyed an increasing popularity for survival analysis while many existing works rely on linearity assumptions. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring, a forest-based flexible, competitive method able to capture complex nonlinear relationships. Taking into account the randomness in trees and connecting the proposed method to a randomized incomplete infinite degree U-process (IDUP), we quantify the prediction process' variation without assuming an infinite forest and establish its weak convergence. Moreover, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives and illustrate the use of the proposed importance ranking measures on both simulated and real data."}, "https://arxiv.org/abs/2410.12224": {"title": "Causally-Aware Unsupervised Feature Selection Learning", "link": "https://arxiv.org/abs/2410.12224", "description": "arXiv:2410.12224v1 Announce Type: cross \nAbstract: Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data. However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization."}, "https://arxiv.org/abs/2410.12367": {"title": "Adaptive and Stratified Subsampling Techniques for High Dimensional Non-Standard Data Environments", "link": "https://arxiv.org/abs/2410.12367", "description": "arXiv:2410.12367v1 Announce Type: cross \nAbstract: This paper addresses the challenge of estimating high-dimensional parameters in non-standard data environments, where traditional methods often falter due to issues such as heavy-tailed distributions, data contamination, and dependent observations. We propose robust subsampling techniques, specifically Adaptive Importance Sampling (AIS) and Stratified Subsampling, designed to enhance the reliability and efficiency of parameter estimation. Under some clearly outlined conditions, we establish consistency and asymptotic normality for the proposed estimators, providing non-asymptotic error bounds that quantify their performance. Our theoretical foundations are complemented by controlled experiments demonstrating the superiority of our methods over conventional approaches. By bridging the gap between theory and practice, this work offers significant contributions to robust statistical estimation, paving the way for advancements in various applied domains."}, "https://arxiv.org/abs/2410.12690": {"title": "Local transfer learning Gaussian process modeling, with applications to surrogate modeling of expensive computer simulators", "link": "https://arxiv.org/abs/2410.12690", "description": "arXiv:2410.12690v1 Announce Type: cross \nAbstract: A critical bottleneck for scientific progress is the costly nature of computer simulations for complex systems. Surrogate models provide an appealing solution: such models are trained on simulator evaluations, then used to emulate and quantify uncertainty on the expensive simulator at unexplored inputs. In many applications, one often has available data on related systems. For example, in designing a new jet turbine, there may be existing studies on turbines with similar configurations. A key question is how information from such \"source\" systems can be transferred for effective surrogate training on the \"target\" system of interest. We thus propose a new LOcal transfer Learning Gaussian Process (LOL-GP) model, which leverages a carefully-designed Gaussian process to transfer such information for surrogate modeling. The key novelty of the LOL-GP is a latent regularization model, which identifies regions where transfer should be performed and regions where it should be avoided. This \"local transfer\" property is desirable in scientific systems: at certain parameters, such systems may behave similarly and thus transfer is beneficial; at other parameters, they may behave differently and thus transfer is detrimental. By accounting for local transfer, the LOL-GP can rectify a critical limitation of \"negative transfer\" in existing transfer learning models, where the transfer of information worsens predictive performance. We derive a Gibbs sampling algorithm for efficient posterior predictive sampling on the LOL-GP, for both the multi-source and multi-fidelity transfer settings. We then show, via a suite of numerical experiments and an application for jet turbine design, the improved surrogate performance of the LOL-GP over existing methods."}, "https://arxiv.org/abs/2306.06932": {"title": "Revisiting Whittaker-Henderson Smoothing", "link": "https://arxiv.org/abs/2306.06932", "description": "arXiv:2306.06932v3 Announce Type: replace \nAbstract: Introduced nearly a century ago, Whittaker-Henderson smoothing is still widely used by actuaries for constructing one-dimensional and two-dimensional experience tables for mortality, disability and other Life Insurance risks. Our paper reframes this smoothing technique within a modern statistical framework and addresses six questions of practical relevance regarding its use.Firstly, we adopt a Bayesian view of this smoothing method to build credible intervals. Next, we shed light on the choice of the observation and weight vectors to which the smoothing should be applied by linking it to a maximum likelihood estimator introduced in the context of duration models. We then enhance the precision of the smoothing by relaxing an implicit asymptotic normal approximation on which it relies. Afterward, we select the smoothing parameters based on maximizing a marginal likelihood function. We later improve numerical performance in the presence of many observation points and consequently parameters. Finally, we extrapolate the results of the smoothing while preserving, through the use of constraints, consistency between estimated and predicted values."}, "https://arxiv.org/abs/2401.14910": {"title": "Modeling Extreme Events: Univariate and Multivariate Data-Driven Approaches", "link": "https://arxiv.org/abs/2401.14910", "description": "arXiv:2401.14910v2 Announce Type: replace \nAbstract: This article summarizes the contribution of team genEVA to the EVA (2023) Conference Data Challenge. The challenge comprises four individual tasks, with two focused on univariate extremes and two related to multivariate extremes. In the first univariate assignment, we estimate a conditional extremal quantile using a quantile regression approach with neural networks. For the second, we develop a fine-tuning procedure for improved extremal quantile estimation with a given conservative loss function. In the first multivariate sub-challenge, we approximate the data-generating process with a copula model. In the remaining task, we use clustering to separate a high-dimensional problem into approximately independent components. Overall, competitive results were achieved for all challenges, and our approaches for the univariate tasks yielded the most accurate quantile estimates in the competition."}, "https://arxiv.org/abs/1207.0258": {"title": "Control of Probability Flow in Markov Chain Monte Carlo -- Nonreversibility and Lifting", "link": "https://arxiv.org/abs/1207.0258", "description": "arXiv:1207.0258v2 Announce Type: replace-cross \nAbstract: The Markov Chain Monte Carlo (MCMC) method is widely used in various fields as a powerful numerical integration technique for systems with many degrees of freedom. In MCMC methods, probabilistic state transitions can be considered as a random walk in state space, and random walks allow for sampling from complex distributions. However, paradoxically, it is necessary to carefully suppress the randomness of the random walk to improve computational efficiency. By breaking detailed balance, we can create a probability flow in the state space and perform more efficient sampling along this flow. Motivated by this idea, practical and efficient nonreversible MCMC methods have been developed over the past ten years. In particular, the lifting technique, which introduces probability flows in an extended state space, has been applied to various systems and has proven more efficient than conventional reversible updates. We review and discuss several practical approaches to implementing nonreversible MCMC methods, including the shift method in the cumulative distribution and the directed-worm algorithm."}, "https://arxiv.org/abs/1905.10806": {"title": "Score-Driven Exponential Random Graphs: A New Class of Time-Varying Parameter Models for Dynamical Networks", "link": "https://arxiv.org/abs/1905.10806", "description": "arXiv:1905.10806v3 Announce Type: replace-cross \nAbstract: Motivated by the increasing abundance of data describing real-world networks that exhibit dynamical features, we propose an extension of the Exponential Random Graph Models (ERGMs) that accommodates the time variation of its parameters. Inspired by the fast-growing literature on Dynamic Conditional Score models, each parameter evolves according to an updating rule driven by the score of the ERGM distribution. We demonstrate the flexibility of score-driven ERGMs (SD-ERGMs) as data-generating processes and filters and show the advantages of the dynamic version over the static one. We discuss two applications to temporal networks from financial and political systems. First, we consider the prediction of future links in the Italian interbank credit network. Second, we show that the SD-ERGM allows discriminating between static or time-varying parameters when used to model the U.S. Congress co-voting network dynamics."}, "https://arxiv.org/abs/2401.13943": {"title": "Is the age pension in Australia sustainable and fair? Evidence from forecasting the old-age dependency ratio using the Hamilton-Perry model", "link": "https://arxiv.org/abs/2401.13943", "description": "arXiv:2401.13943v2 Announce Type: replace-cross \nAbstract: The age pension aims to assist eligible elderly Australians meet specific age and residency criteria in maintaining basic living standards. In designing efficient pension systems, government policymakers seek to satisfy the expectations of the overall aging population in Australia. However, the population's unique demographic characteristics at the state and territory level are often overlooked due to the lack of available data. We use the Hamilton-Perry model, which requires minimum input, to model and forecast the evolution of age-specific populations at the state level. We also integrate the obtained sub-national demographic information to determine sustainable pension ages up to 2051. We also investigate pension welfare distribution in all states and territories to identify disadvantaged residents under the current pension system. Using the sub-national mortality data for Australia from 1971 to 2021 obtained from AHMD (2023), we implement the Hamilton-Perry model with the help of functional time series forecasting techniques. With forecasts of age-specific population sizes for each state and territory, we compute the old age dependency ratio to determine the nationwide sustainable pension age."}, "https://arxiv.org/abs/2410.12930": {"title": "Probabilistic inference when the population space is open", "link": "https://arxiv.org/abs/2410.12930", "description": "arXiv:2410.12930v1 Announce Type: new \nAbstract: In using observed data to make inferences about a population quantity, it is commonly assumed that the sampling distribution from which the data were drawn belongs to a given parametric family of distributions, or at least, a given finite set of such families, i.e. the population space is assumed to be closed. Here, we address the problem of how to determine an appropriate post-data distribution for a given population quantity when such an assumption about the underlying sampling distribution is not made, i.e. when the population space is open. The strategy used to address this problem is based on the fact that even though, due to an open population space being non-measurable, we are not able to place a post-data distribution over all the sampling distributions contained in such a population space, it is possible to partition this type of space into a finite, countable or uncountable number of subsets such that a distribution can be placed over a variable that simply indicates which of these subsets contains the true sampling distribution. Moreover, it is argued that, by using sampling distributions that belong to a number of parametric families, it is possible to adequately and elegantly represent the sampling distributions that belong to each of the subsets of such a partition. Since a statistical model is conceived as being a model of a population space rather than a model of a sampling distribution, it is also argued that neither the type of models that are put forward nor the expression of pre-data knowledge via such models can be directly brought into question by the data. Finally, the case is made that, as well as not being required in the modelling process that is proposed, the standard practice of using P values to measure the absolute compatibility of an individual or family of sampling distributions with observed data is neither meaningful nor useful."}, "https://arxiv.org/abs/2410.13050": {"title": "Improved control of Dirichlet location and scale near the boundary", "link": "https://arxiv.org/abs/2410.13050", "description": "arXiv:2410.13050v1 Announce Type: new \nAbstract: Dirichlet distributions are commonly used for modeling vectors in a probability simplex. When used as a prior or a proposal distribution, it is natural to set the mean of a Dirichlet to be equal to the location where one wants the distribution to be centered. However, if the mean is near the boundary of the probability simplex, then a Dirichlet distribution becomes highly concentrated either (i) at the mean or (ii) extremely close to the boundary. Consequently, centering at the mean provides poor control over the location and scale near the boundary. In this article, we introduce a method for improved control over the location and scale of Beta and Dirichlet distributions. Specifically, given a target location point and a desired scale, we maximize the density at the target location point while constraining a specified measure of scale. We consider various choices of scale constraint, such as fixing the concentration parameter, the mean cosine error, or the variance in the Beta case. In several examples, we show that this maximum density method provides superior performance for constructing priors, defining Metropolis-Hastings proposals, and generating simulated probability vectors."}, "https://arxiv.org/abs/2410.13113": {"title": "Analyzing longitudinal electronic health records data with clinically informative visiting process: possible choices and comparisons", "link": "https://arxiv.org/abs/2410.13113", "description": "arXiv:2410.13113v1 Announce Type: new \nAbstract: Analyzing longitudinal electronic health records (EHR) data is challenging due to clinically informative observation processes, where the timing and frequency of patient visits often depend on their underlying health condition. Traditional longitudinal data analysis rarely considers observation times as stochastic or models them as functions of covariates. In this work, we evaluate the impact of informative visiting processes on two common statistical tasks: (1) estimating the effect of an exposure on a longitudinal biomarker, and (2) assessing the effect of longitudinal biomarkers on a time-to-event outcome, such as disease diagnosis. The methods we consider range from using simple summaries of the observed longitudinal data, imputation, inversely weighting by the estimated intensity of the visit process, and joint modeling. To our knowledge, this is the most comprehensive evaluation of inferential methods specifically tailored for EHR-like visiting processes. For the first task, we improve certain methods around 18 times faster, making them more suitable for large-scale EHR data analysis. For the second task, where no method currently accounts for the visiting process in joint models of longitudinal and time-to-event data, we propose incorporating the historical number of visits to adjust for informative visiting. Using data from the longitudinal biobank at the University of Michigan Health System, we investigate two case studies: 1) the association between genetic variants and lab markers with repeated measures (known as the LabWAS); and 2) the association between cardiometabolic health markers and time to hypertension diagnosis. We show how accounting for informative visiting processes affects the analysis results. We develop an R package CIMPLE (Clinically Informative Missingness handled through Probabilities, Likelihood, and Estimating equations) that integrates all these methods."}, "https://arxiv.org/abs/2410.13115": {"title": "Online conformal inference for multi-step time series forecasting", "link": "https://arxiv.org/abs/2410.13115", "description": "arXiv:2410.13115v1 Announce Type: new \nAbstract: We consider the problem of constructing distribution-free prediction intervals for multi-step time series forecasting, with a focus on the temporal dependencies inherent in multi-step forecast errors. We establish that the optimal $h$-step-ahead forecast errors exhibit serial correlation up to lag $(h-1)$ under a general non-stationary autoregressive data generating process. To leverage these properties, we propose the Autocorrelated Multi-step Conformal Prediction (AcMCP) method, which effectively incorporates autocorrelations in multi-step forecast errors, resulting in more statistically efficient prediction intervals. This method ensures theoretical long-run coverage guarantees for multi-step prediction intervals, though we note that increased forecasting horizons may exacerbate deviations from the target coverage, particularly in the context of limited sample sizes. Additionally, we extend several easy-to-implement conformal prediction methods, originally designed for single-step forecasting, to accommodate multi-step scenarios. Through empirical evaluations, including simulations and applications to data, we demonstrate that AcMCP achieves coverage that closely aligns with the target within local windows, while providing adaptive prediction intervals that effectively respond to varying conditions."}, "https://arxiv.org/abs/2410.13142": {"title": "Agnostic Characterization of Interference in Randomized Experiments", "link": "https://arxiv.org/abs/2410.13142", "description": "arXiv:2410.13142v1 Announce Type: new \nAbstract: We give an approach for characterizing interference by lower bounding the number of units whose outcome depends on certain groups of treated individuals, such as depending on the treatment of others, or others who are at least a certain distance away. The approach is applicable to randomized experiments with binary-valued outcomes. Asymptotically conservative point estimates and one-sided confidence intervals may be constructed with no assumptions beyond the known randomization design, allowing the approach to be used when interference is poorly understood, or when an observed network might only be a crude proxy for the underlying social mechanisms. Point estimates are equal to Hajek-weighted comparisons of units with differing levels of treatment exposure. Empirically, we find that the size of our interval estimates is competitive with (and often smaller than) those of the EATE, an assumption-lean treatment effect, suggesting that the proposed estimands may be intrinsically easier to estimate than treatment effects."}, "https://arxiv.org/abs/2410.13164": {"title": "Markov Random Fields with Proximity Constraints for Spatial Data", "link": "https://arxiv.org/abs/2410.13164", "description": "arXiv:2410.13164v1 Announce Type: new \nAbstract: The conditional autoregressive (CAR) model, simultaneous autoregressive (SAR) model, and its variants have become the predominant strategies for modeling regional or areal-referenced spatial data. The overwhelming wide-use of the CAR/SAR model motivates the need for new classes of models for areal-referenced data. Thus, we develop a novel class of Markov random fields based on truncating the full-conditional distribution. We define this truncation in two ways leading to versions of what we call the truncated autoregressive (TAR) model. First, we truncate the full conditional distribution so that a response at one location is close to the average of its neighbors. This strategy establishes relationships between TAR and CAR. Second, we truncate on the joint distribution of the data process in a similar way. This specification leads to connection between TAR and SAR model. Our Bayesian implementation does not use Markov chain Monte Carlo (MCMC) for Bayesian computation, and generates samples directly from the posterior distribution. Moreover, TAR does not have a range parameter that arises in the CAR/SAR models, which can be difficult to learn. We present the results of the proposed truncated autoregressive model on several simulated datasets and on a dataset of average property prices."}, "https://arxiv.org/abs/2410.13170": {"title": "Adaptive LAD-Based Bootstrap Unit Root Tests under Unconditional Heteroskedasticity", "link": "https://arxiv.org/abs/2410.13170", "description": "arXiv:2410.13170v1 Announce Type: new \nAbstract: This paper explores testing unit roots based on least absolute deviations (LAD) regression under unconditional heteroskedasticity. We first derive the asymptotic properties of the LAD estimator for a first-order autoregressive process with the coefficient (local to) unity under unconditional heteroskedasticity and weak dependence, revealing that the limiting distribution of the LAD estimator (consequently the derived test statistics) is closely associated with unknown time-varying variances. To conduct feasible LAD-based unit root tests under heteroskedasticity and serial dependence, we develop an adaptive block bootstrap procedure, which accommodates time-varying volatility and serial dependence, both of unknown forms, to compute critical values for LAD-based tests. The asymptotic validity is established. We then extend the testing procedure to allow for deterministic components. Simulation results indicate that, in the presence of unconditional heteroskedasticity and serial dependence, the classic LAD-based tests demonstrate severe size distortion, whereas the proposed LAD-based bootstrap tests exhibit good size-control capability. Additionally, the newly developed tests show superior testing power in heavy-tailed distributed cases compared to considered benchmarks. Finally, empirical analysis of real effective exchange rates of 16 EU countries is conducted to illustrate the applicability of the newly proposed tests."}, "https://arxiv.org/abs/2410.13261": {"title": "Novel Bayesian algorithms for ARFIMA long-memory processes: a comparison between MCMC and ABC approaches", "link": "https://arxiv.org/abs/2410.13261", "description": "arXiv:2410.13261v1 Announce Type: new \nAbstract: This paper presents a comparative study of two Bayesian approaches - Markov Chain Monte Carlo (MCMC) and Approximate Bayesian Computation (ABC) - for estimating the parameters of autoregressive fractionally-integrated moving average (ARFIMA) models, which are widely used to capture long-memory in time series data. We propose a novel MCMC algorithm that filters the time series into distinct long-memory and ARMA components, and benchmarked it against standard approaches. Additionally, a new ABC method is proposed, using three different summary statistics used for posterior estimation. The methods are implemented and evaluated through an extensive simulation study, as well as applied to a real-world financial dataset, specifically the quarterly U.S. Gross National Product (GNP) series. The results demonstrate the effectiveness of the Bayesian methods in estimating long-memory and short-memory parameters, with the filtered MCMC showing superior performance in various metrics. This study enhances our understanding of Bayesian techniques in ARFIMA modeling, providing insights into their advantages and limitations when applied to complex time series data."}, "https://arxiv.org/abs/2410.13420": {"title": "Spatial Proportional Hazards Model with Differential Regularization", "link": "https://arxiv.org/abs/2410.13420", "description": "arXiv:2410.13420v1 Announce Type: new \nAbstract: This paper introduces a novel Spatial Proportional Hazards model that incorporates spatial dependence through differential regularization. We address limitations of existing methods that overlook domain geometry by proposing an approach based on the Generalized Spatial Regression with PDE Penalization. Our method handles complex-shaped domains, enabling accurate modeling of spatial fields in survival data. Using a penalized log-likelihood functional, we estimate both covariate effects and the spatial field. The methodology is implemented via finite element methods, efficiently handling irregular domain geometries. We demonstrate its efficacy through simulations and apply it to real-world data."}, "https://arxiv.org/abs/2410.13522": {"title": "Fair comparisons of causal parameters with many treatments and positivity violations", "link": "https://arxiv.org/abs/2410.13522", "description": "arXiv:2410.13522v1 Announce Type: new \nAbstract: Comparing outcomes across treatments is essential in medicine and public policy. To do so, researchers typically estimate a set of parameters, possibly counterfactual, with each targeting a different treatment. Treatment-specific means (TSMs) are commonly used, but their identification requires a positivity assumption -- that every subject has a non-zero probability of receiving each treatment. This assumption is often implausible, especially when treatment can take many values. Causal parameters based on dynamic stochastic interventions can be robust to positivity violations. However, comparing these parameters may be unfair because they may depend on outcomes under non-target treatments. To address this, and clarify when fair comparisons are possible, we propose a fairness criterion: if the conditional TSM for one treatment is greater than that for another, then the corresponding causal parameter should also be greater. We derive two intuitive properties equivalent to this criterion and show that only a mild positivity assumption is needed to identify fair parameters. We then provide examples that satisfy this criterion and are identifiable under the milder positivity assumption. These parameters are non-smooth, making standard nonparametric efficiency theory inapplicable, so we propose smooth approximations of them. We then develop doubly robust-style estimators that attain parametric convergence rates under nonparametric conditions. We illustrate our methods with an analysis of dialysis providers in New York State."}, "https://arxiv.org/abs/2410.13658": {"title": "The Subtlety of Optimal Paternalism in a Population with Bounded Rationality", "link": "https://arxiv.org/abs/2410.13658", "description": "arXiv:2410.13658v1 Announce Type: new \nAbstract: We consider a utilitarian planner with the power to design a discrete choice set for a heterogeneous population with bounded rationality. We find that optimal paternalism is subtle. The policy that most effectively constrains or influences choices depends on the preference distribution of the population and on the choice probabilities conditional on preferences that measure the suboptimality of behavior. We first consider the planning problem in abstraction. We next examine policy choice when individuals measure utility with additive random error and maximize mismeasured rather than actual utility. We then analyze a class of problems of binary treatment choice under uncertainty. Here we suppose that a planner can mandate a treatment conditional on publicly observed personal covariates or can decentralize decision making, enabling persons to choose their own treatments. Bounded rationality may take the form of deviations between subjective personal beliefs and objective probabilities of uncertain outcomes. We apply our analysis to clinical decision making in medicine. Having documented that optimization of paternalism requires the planner to possess extensive knowledge that is rarely available, we address the difficult problem of paternalistic policy choice when the planner is boundedly rational."}, "https://arxiv.org/abs/2410.13693": {"title": "A multiscale method for data collected from network edges via the line graph", "link": "https://arxiv.org/abs/2410.13693", "description": "arXiv:2410.13693v1 Announce Type: new \nAbstract: Data collected over networks can be modelled as noisy observations of an unknown function over the nodes of a graph or network structure, fully described by its nodes and their connections, the edges. In this context, function estimation has been proposed in the literature and typically makes use of the network topology such as relative node arrangement, often using given or artificially constructed node Euclidean coordinates. However, networks that arise in fields such as hydrology (for example, river networks) present features that challenge these established modelling setups since the target function may naturally live on edges (e.g., river flow) and/or the node-oriented modelling uses noisy edge data as weights. This work tackles these challenges and develops a novel lifting scheme along with its associated (second) generation wavelets that permit data decomposition across the network edges. The transform, which we refer to under the acronym LG-LOCAAT, makes use of a line graph construction that first maps the data in the line graph domain. We thoroughly investigate the proposed algorithm's properties and illustrate its performance versus existing methodologies. We conclude with an application pertaining to hydrology that involves the denoising of a water quality index over the England river network, backed up by a simulation study for a river flow dataset."}, "https://arxiv.org/abs/2410.13723": {"title": "A Subsequence Approach to Topological Data Analysis for Irregularly-Spaced Time Series", "link": "https://arxiv.org/abs/2410.13723", "description": "arXiv:2410.13723v1 Announce Type: new \nAbstract: A time-delay embedding (TDE), grounded in the framework of Takens's Theorem, provides a mechanism to represent and analyze the inherent dynamics of time-series data. Recently, topological data analysis (TDA) methods have been applied to study this time series representation mainly through the lens of persistent homology. Current literature on the fusion of TDE and TDA are adept at analyzing uniformly-spaced time series observations. This work introduces a novel {\\em subsequence} embedding method for irregularly-spaced time-series data. We show that this method preserves the original state space topology while reducing spurious homological features. Theoretical stability results and convergence properties of the proposed method in the presence of noise and varying levels of irregularity in the spacing of the time series are established. Numerical studies and an application to real data illustrates the performance of the proposed method."}, "https://arxiv.org/abs/2410.13744": {"title": "Inferring the dynamics of quasi-reaction systems via nonlinear local mean-field approximations", "link": "https://arxiv.org/abs/2410.13744", "description": "arXiv:2410.13744v1 Announce Type: new \nAbstract: In the modelling of stochastic phenomena, such as quasi-reaction systems, parameter estimation of kinetic rates can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the nonlinear nature of the underlying process. At the mean level, the dynamics of the system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An analytical solution for generic quasi-reaction systems is proposed via a first order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing SDE and ODE-based methods via a simulation study. Besides the increased computational efficiency of the approach, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. An illustration on Rhesus Macaque data shows the applicability of the approach to the study of cell differentiation."}, "https://arxiv.org/abs/2410.13112": {"title": "Distributional Matrix Completion via Nearest Neighbors in the Wasserstein Space", "link": "https://arxiv.org/abs/2410.13112", "description": "arXiv:2410.13112v1 Announce Type: cross \nAbstract: We introduce the problem of distributional matrix completion: Given a sparsely observed matrix of empirical distributions, we seek to impute the true distributions associated with both observed and unobserved matrix entries. This is a generalization of traditional matrix completion where the observations per matrix entry are scalar valued. To do so, we utilize tools from optimal transport to generalize the nearest neighbors method to the distributional setting. Under a suitable latent factor model on probability distributions, we establish that our method recovers the distributions in the Wasserstein norm. We demonstrate through simulations that our method is able to (i) provide better distributional estimates for an entry compared to using observed samples for that entry alone, (ii) yield accurate estimates of distributional quantities such as standard deviation and value-at-risk, and (iii) inherently support heteroscedastic noise. We also prove novel asymptotic results for Wasserstein barycenters over one-dimensional distributions."}, "https://arxiv.org/abs/2410.13495": {"title": "On uniqueness of the set of k-means", "link": "https://arxiv.org/abs/2410.13495", "description": "arXiv:2410.13495v1 Announce Type: cross \nAbstract: We provide necessary and sufficient conditions for the uniqueness of the k-means set of a probability distribution. This uniqueness problem is related to the choice of k: depending on the underlying distribution, some values of this parameter could lead to multiple sets of k-means, which hampers the interpretation of the results and/or the stability of the algorithms. We give a general assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness and determine the asymptotic distribution of the within cluster sum of squares (WCSS). We also provide statistical characterizations of k-means uniqueness in terms of the asymptotic behavior of the empirical WCSS. As a consequence, we derive a bootstrap test for uniqueness of the set of k-means. The results are illustrated with examples of different types of non-uniqueness and we check by simulations the performance of the proposed methodology."}, "https://arxiv.org/abs/2410.13681": {"title": "Ab initio nonparametric variable selection for scalable Symbolic Regression with large $p$", "link": "https://arxiv.org/abs/2410.13681", "description": "arXiv:2410.13681v1 Announce Type: cross \nAbstract: Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which are common in modern scientific applications. This ``large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 17 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets."}, "https://arxiv.org/abs/2410.13735": {"title": "Optimizing Probabilistic Conformal Prediction with Vectorized Non-Conformity Scores", "link": "https://arxiv.org/abs/2410.13735", "description": "arXiv:2410.13735v1 Announce Type: cross \nAbstract: Generative models have shown significant promise in critical domains such as medical diagnosis, autonomous driving, and climate science, where reliable decision-making hinges on accurate uncertainty quantification. While probabilistic conformal prediction (PCP) offers a powerful framework for this purpose, its coverage efficiency -- the size of the uncertainty set -- is limited when dealing with complex underlying distributions and a finite number of generated samples. In this paper, we propose a novel PCP framework that enhances efficiency by first vectorizing the non-conformity scores with ranked samples and then optimizing the shape of the prediction set by varying the quantiles for samples at the same rank. Our method delivers valid coverage while producing discontinuous and more efficient prediction sets, making it particularly suited for high-stakes applications. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets."}, "https://arxiv.org/abs/2303.14801": {"title": "FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions", "link": "https://arxiv.org/abs/2303.14801", "description": "arXiv:2303.14801v3 Announce Type: replace \nAbstract: Functional regression analysis is an established tool for many contemporary scientific applications. Regression problems involving large and complex data sets are ubiquitous, and feature selection is crucial for avoiding overfitting and achieving accurate predictions. We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse high dimensional function-on-function regression problem, and we show how to extend it to the scalar-on-function framework. Our method, called FAStEN, combines functional data, optimization, and machine learning techniques to perform feature selection and parameter estimation simultaneously. We exploit the properties of Functional Principal Components and the sparsity inherent to the Dual Augmented Lagrangian problem to significantly reduce computational cost, and we introduce an adaptive scheme to improve selection accuracy. In addition, we derive asymptotic oracle properties, which guarantee estimation and selection consistency for the proposed FAStEN estimator. Through an extensive simulation study, we benchmark our approach to the best existing competitors and demonstrate a massive gain in terms of CPU time and selection performance, without sacrificing the quality of the coefficients' estimation. The theoretical derivations and the simulation study provide a strong motivation for our approach. Finally, we present an application to brain fMRI data from the AOMIC PIOP1 study. Complete FAStEN code is provided at https://github.com/IBM/funGCN."}, "https://arxiv.org/abs/2312.17716": {"title": "Dependent Random Partitions by Shrinking Toward an Anchor", "link": "https://arxiv.org/abs/2312.17716", "description": "arXiv:2312.17716v2 Announce Type: replace \nAbstract: Although exchangeable processes from Bayesian nonparametrics have been used as a generating mechanism for random partition models, we deviate from this paradigm to explicitly incorporate clustering information in the formulation of our random partition model. Our shrinkage partition distribution takes any partition distribution and shrinks its probability mass toward an anchor partition. We show how this provides a framework to model hierarchically-dependent and temporally-dependent random partitions. The shrinkage parameter controls the degree of dependence, accommodating at its extremes both independence and complete equality. Since a priori knowledge of items may vary, our formulation allows the degree of shrinkage toward the anchor to be item-specific. Our random partition model has a tractable normalizing constant which allows for standard Markov chain Monte Carlo algorithms for posterior sampling. We prove intuitive theoretical properties for our distribution and compare it to related partition distributions. We show that our model provides better out-of-sample fit in a real data application."}, "https://arxiv.org/abs/2301.06615": {"title": "Data-Driven Estimation of Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2301.06615", "description": "arXiv:2301.06615v2 Announce Type: replace-cross \nAbstract: Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assumptions about the observed features in the data and ignore the structure of the underlying causal model, which can lead to biased estimation. At the same time, the underlying causal mechanism is rarely known in real-world datasets, making it hard to take it into consideration. In this work, we provide a survey of state-of-the-art data-driven methods for heterogeneous treatment effect estimation using machine learning, broadly categorizing them as methods that focus on counterfactual prediction and methods that directly estimate the causal effect. We also provide an overview of a third category of methods which rely on structural causal models and learn the model structure from data. Our empirical evaluation under various underlying structural model mechanisms shows the advantages and deficiencies of existing estimators and of the metrics for measuring their performance."}, "https://arxiv.org/abs/2307.12395": {"title": "Concentration inequalities for high-dimensional linear processes with dependent innovations", "link": "https://arxiv.org/abs/2307.12395", "description": "arXiv:2307.12395v2 Announce Type: replace-cross \nAbstract: We develop concentration inequalities for the $l_\\infty$ norm of vector linear processes with sub-Weibull, mixingale innovations. This inequality is used to obtain a concentration bound for the maximum entrywise norm of the lag-$h$ autocovariance matrix of linear processes. We apply these inequalities to sparse estimation of large-dimensional VAR(p) systems and heterocedasticity and autocorrelation consistent (HAC) high-dimensional covariance estimation."}, "https://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation", "link": "https://arxiv.org/abs/2310.06653", "description": "arXiv:2310.06653v2 Announce Type: replace-cross \nAbstract: In clinical trials, patients may discontinue treatments prematurely, breaking the initial randomization and, thus, challenging inference. Stakeholders in drug development are generally interested in going beyond the Intention-To-Treat (ITT) analysis, which provides valid causal estimates of the effect of treatment assignment but does not inform on the effect of the actual treatment receipt. Our study is motivated by an RCT in oncology, where patients assigned the investigational treatment may discontinue it due to adverse events. We propose adopting a principal stratum strategy and decomposing the overall ITT effect into principal causal effects for groups of patients defined by their potential discontinuation behavior. We first show how to implement a principal stratum strategy to assess causal effects on a survival outcome in the presence of continuous time treatment discontinuation, its advantages, and the conclusions one can draw. Our strategy deals with the time-to-event intermediate variable that may not be defined for patients who would not discontinue; moreover, discontinuation time and the primary endpoint are subject to censoring. We employ a flexible model-based Bayesian approach to tackle these complexities, providing easily interpretable results. We apply this Bayesian principal stratification framework to analyze synthetic data of the motivating oncology trial. We simulate data under different assumptions that reflect real scenarios where patients' behavior depends on critical baseline covariates. Supported by a simulation study, we shed light on the role of covariates in this framework: beyond making structural and parametric assumptions more credible, they lead to more precise inference and can be used to characterize patients' discontinuation behavior, which could help inform clinical practice and future protocols."}, "https://arxiv.org/abs/2410.13949": {"title": "Modeling Zero-Inflated Correlated Dental Data through Gaussian Copulas and Approximate Bayesian Computation", "link": "https://arxiv.org/abs/2410.13949", "description": "arXiv:2410.13949v1 Announce Type: new \nAbstract: We develop a new longitudinal count data regression model that accounts for zero-inflation and spatio-temporal correlation across responses. This project is motivated by an analysis of Iowa Fluoride Study (IFS) data, a longitudinal cohort study with data on caries (cavity) experience scores measured for each tooth across five time points. To that end, we use a hurdle model for zero-inflation with two parts: the presence model indicating whether a count is non-zero through logistic regression and the severity model that considers the non-zero counts through a shifted Negative Binomial distribution allowing overdispersion. To incorporate dependence across measurement occasion and teeth, these marginal models are embedded within a Gaussian copula that introduces spatio-temporal correlations. A distinct advantage of this formulation is that it allows us to determine covariate effects with population-level (marginal) interpretations in contrast to mixed model choices. Standard Bayesian sampling from such a model is infeasible, so we use approximate Bayesian computing for inference. This approach is applied to the IFS data to gain insight into the risk factors for dental caries and the correlation structure across teeth and time."}, "https://arxiv.org/abs/2410.14002": {"title": "A note on Bayesian R-squared for generalized additive mixed models", "link": "https://arxiv.org/abs/2410.14002", "description": "arXiv:2410.14002v1 Announce Type: new \nAbstract: We present a novel Bayesian framework to decompose the posterior predictive variance in a fitted Generalized Additive Mixed Model (GAMM) into explained and unexplained components. This decomposition enables a rigorous definition of Bayesian $R^{2}$. We show that the new definition aligns with the intuitive Bayesian $R^{2}$ proposed by Gelman, Goodrich, Gabry, and Vehtari (2019) [\\emph{The American Statistician}, \\textbf{73}(3), 307-309], but extends its applicability to a broader class of models. Furthermore, we introduce a partial Bayesian $R^{2}$ to quantify the contribution of individual model terms to the explained variation in the posterior predictions"}, "https://arxiv.org/abs/2410.14301": {"title": "Confidence interval for the sensitive fraction in Item Count Technique model", "link": "https://arxiv.org/abs/2410.14301", "description": "arXiv:2410.14301v1 Announce Type: new \nAbstract: The problem is in the estimation of the fraction of population with a sensitive characteristic. We consider the Item Count Technique an indirect method of questioning designed to protect respondents' privacy. The exact confidence interval for the sensitive fraction is constructed. The length of the proposed CI depends on both the given parameter of the model and the sample size. For these CI the model's parameter is established in relation to the provided level of the privacy protection of the interviewee. The optimal sample size for obtaining a CI of a given length is discussed in the context."}, "https://arxiv.org/abs/2410.14308": {"title": "Adaptive L-statistics for high dimensional test problem", "link": "https://arxiv.org/abs/2410.14308", "description": "arXiv:2410.14308v1 Announce Type: new \nAbstract: In this study, we focus on applying L-statistics to the high-dimensional one-sample location test problem. Intuitively, an L-statistic with $k$ parameters tends to perform optimally when the sparsity level of the alternative hypothesis matches $k$. We begin by deriving the limiting distributions for both L-statistics with fixed parameters and those with diverging parameters. To ensure robustness across varying sparsity levels of alternative hypotheses, we first establish the asymptotic independence between L-statistics with fixed and diverging parameters. Building on this, we propose a Cauchy combination test that integrates L-statistics with different parameters. Both simulation results and real-data applications highlight the advantages of our proposed methods."}, "https://arxiv.org/abs/2410.14317": {"title": "Identification of a Rank-dependent Peer Effect Model", "link": "https://arxiv.org/abs/2410.14317", "description": "arXiv:2410.14317v1 Announce Type: new \nAbstract: This paper develops an econometric model to analyse heterogeneity in peer effects in network data with endogenous spillover across units. We introduce a rank-dependent peer effect model that captures how the relative ranking of a peer outcome shapes the influence units have on one another, by modeling the peer effect to be linear in ordered peer outcomes. In contrast to the traditional linear-in-means model, our approach allows for greater flexibility in peer effect by accounting for the distribution of peer outcomes as well as the size of peer groups. Under a minimal condition, the rank-dependent peer effect model admits a unique equilibrium and is therefore tractable. Our simulations show that that estimation performs well in finite samples given sufficient covariate strength. We then apply our model to educational data from Norway, where we see that higher-performing students disproportionately drive GPA spillovers."}, "https://arxiv.org/abs/2410.14459": {"title": "To Vary or Not To Vary: A Simple Empirical Bayes Factor for Testing Variance Components", "link": "https://arxiv.org/abs/2410.14459", "description": "arXiv:2410.14459v1 Announce Type: new \nAbstract: Random effects are a flexible addition to statistical models to capture structural heterogeneity in the data, such as spatial dependencies, individual differences, temporal dependencies, or non-linear effects. Testing for the presence (or absence) of random effects is an important but challenging endeavor however, as testing a variance component, which must be non-negative, is a boundary problem. Various methods exist which have potential shortcomings or limitations. As a flexible alternative, we propose a flexible empirical Bayes factor (EBF) for testing for the presence of random effects. Rather than testing whether a variance component equals zero or not, the proposed EBF tests the equivalent assumption of whether all random effects are zero. The Bayes factor is `empirical' because the distribution of the random effects on the lower level, which serves as a prior, is estimated from the data as it is part of the model. Empirical Bayes factors can be computed using the output from classical (MLE) or Bayesian (MCMC) approaches. Analyses on synthetic data were carried out to assess the general behavior of the criterion. To illustrate the methodology, the EBF is used for testing random effects under various models including logistic crossed mixed effects models, spatial random effects models, dynamic structural equation models, random intercept cross-lagged panel models, and nonlinear regression models."}, "https://arxiv.org/abs/2410.14507": {"title": "Bin-Conditional Conformal Prediction of Fatalities from Armed Conflict", "link": "https://arxiv.org/abs/2410.14507", "description": "arXiv:2410.14507v1 Announce Type: new \nAbstract: Forecasting of armed conflicts is an important area of research that has the potential to save lives and prevent suffering. However, most existing forecasting models provide only point predictions without any individual-level uncertainty estimates. In this paper, we introduce a novel extension to conformal prediction algorithm which we call bin-conditional conformal prediction. This method allows users to obtain individual-level prediction intervals for any arbitrary prediction model while maintaining a specific level of coverage across user-defined ranges of values. We apply the bin-conditional conformal prediction algorithm to forecast fatalities from armed conflict. Our results demonstrate that the method provides well-calibrated uncertainty estimates for the predicted number of fatalities. Compared to standard conformal prediction, the bin-conditional method outperforms offers improved calibration of coverage rates across different values of the outcome, but at the cost of wider prediction intervals."}, "https://arxiv.org/abs/2410.14513": {"title": "GARCH option valuation with long-run and short-run volatility components: A novel framework ensuring positive variance", "link": "https://arxiv.org/abs/2410.14513", "description": "arXiv:2410.14513v1 Announce Type: new \nAbstract: Christoffersen, Jacobs, Ornthanalai, and Wang (2008) (CJOW) proposed an improved Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model for valuing European options, where the return volatility is comprised of two distinct components. Empirical studies indicate that the model developed by CJOW outperforms widely-used single-component GARCH models and provides a superior fit to options data than models that combine conditional heteroskedasticity with Poisson-normal jumps. However, a significant limitation of this model is that it allows the variance process to become negative. Oh and Park [2023] partially addressed this issue by developing a related model, yet the positivity of the volatility components is not guaranteed, both theoretically and empirically. In this paper we introduce a new GARCH model that improves upon the models by CJOW and Oh and Park [2023], ensuring the positivity of the return volatility. In comparison to the two earlier GARCH approaches, our novel methodology shows comparable in-sample performance on returns data and superior performance on S&P500 options data."}, "https://arxiv.org/abs/2410.14533": {"title": "The Traveling Bandit: A Framework for Bayesian Optimization with Movement Costs", "link": "https://arxiv.org/abs/2410.14533", "description": "arXiv:2410.14533v1 Announce Type: new \nAbstract: This paper introduces a framework for Bayesian Optimization (BO) with metric movement costs, addressing a critical challenge in practical applications where input alterations incur varying costs. Our approach is a convenient plug-in that seamlessly integrates with the existing literature on batched algorithms, where designs within batches are observed following the solution of a Traveling Salesman Problem. The proposed method provides a theoretical guarantee of convergence in terms of movement costs for BO. Empirically, our method effectively reduces average movement costs over time while maintaining comparable regret performance to conventional BO methods. This framework also shows promise for broader applications in various bandit settings with movement costs."}, "https://arxiv.org/abs/2410.14585": {"title": "A GARCH model with two volatility components and two driving factors", "link": "https://arxiv.org/abs/2410.14585", "description": "arXiv:2410.14585v1 Announce Type: new \nAbstract: We introduce a novel GARCH model that integrates two sources of uncertainty to better capture the rich, multi-component dynamics often observed in the volatility of financial assets. This model provides a quasi closed-form representation of the characteristic function for future log-returns, from which semi-analytical formulas for option pricing can be derived. A theoretical analysis is conducted to establish sufficient conditions for strict stationarity and geometric ergodicity, while also obtaining the continuous-time diffusion limit of the model. Empirical evaluations, conducted both in-sample and out-of-sample using S\\&P500 time series data, show that our model outperforms widely used single-factor models in predicting returns and option prices."}, "https://arxiv.org/abs/2410.14046": {"title": "Tensor Decomposition with Unaligned Observations", "link": "https://arxiv.org/abs/2410.14046", "description": "arXiv:2410.14046v1 Announce Type: cross \nAbstract: This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset."}, "https://arxiv.org/abs/2410.14333": {"title": "Predicting the trajectory of intracranial pressure in patients with traumatic brain injury: evaluation of a foundation model for time series", "link": "https://arxiv.org/abs/2410.14333", "description": "arXiv:2410.14333v1 Announce Type: cross \nAbstract: Patients with traumatic brain injury (TBI) often experience pathological increases in intracranial pressure (ICP), leading to intracranial hypertension (tIH), a common and serious complication. Early warning of an impending rise in ICP could potentially improve patient outcomes by enabling preemptive clinical intervention. However, the limited availability of patient data poses a challenge in developing reliable prediction models. In this study, we aim to determine whether foundation models, which leverage transfer learning, may offer a promising solution."}, "https://arxiv.org/abs/2410.14483": {"title": "Spectral Representations for Accurate Causal Uncertainty Quantification with Gaussian Processes", "link": "https://arxiv.org/abs/2410.14483", "description": "arXiv:2410.14483v1 Announce Type: cross \nAbstract: Accurate uncertainty quantification for causal effects is essential for robust decision making in complex systems, but remains challenging in non-parametric settings. One promising framework represents conditional distributions in a reproducing kernel Hilbert space and places Gaussian process priors on them to infer posteriors on causal effects, but requires restrictive nuclear dominant kernels and approximations that lead to unreliable uncertainty estimates. In this work, we introduce a method, IMPspec, that addresses these limitations via a spectral representation of the Hilbert space. We show that posteriors in this model can be obtained explicitly, by extending a result in Hilbert space regression theory. We also learn the spectral representation to optimise posterior calibration. Our method achieves state-of-the-art performance in uncertainty quantification and causal Bayesian optimisation across simulations and a healthcare application."}, "https://arxiv.org/abs/2410.14490": {"title": "Matrix normal distribution and elliptic distribution", "link": "https://arxiv.org/abs/2410.14490", "description": "arXiv:2410.14490v1 Announce Type: cross \nAbstract: In this paper, we introduce the matrix normal distribution according to the tensor decomposition of its covariance. Based on the canonical diagonal form, the moment generating function of sample covariance matrix and the distribution of latent roots are explicitly calculated. We also discuss the connections between matrix normal distributions, elliptic distributions, and their relevance to multivariate analysis and matrix variate distributions."}, "https://arxiv.org/abs/2203.09380": {"title": "Identifiability of Sparse Causal Effects using Instrumental Variables", "link": "https://arxiv.org/abs/2203.09380", "description": "arXiv:2203.09380v4 Announce Type: replace \nAbstract: Exogenous heterogeneity, for example, in the form of instrumental variables can help us learn a system's underlying causal structure and predict the outcome of unseen intervention experiments. In this paper, we consider linear models in which the causal effect from covariates $X$ on a response $Y$ is sparse. We provide conditions under which the causal coefficient becomes identifiable from the observed distribution. These conditions can be satisfied even if the number of instruments is as small as the number of causal parents. We also develop graphical criteria under which identifiability holds with probability one if the edge coefficients are sampled randomly from a distribution that is absolutely continuous with respect to Lebesgue measure and $Y$ is childless. As an estimator, we propose spaceIV and prove that it consistently estimates the causal effect if the model is identifiable and evaluate its performance on simulated data. If identifiability does not hold, we show that it may still be possible to recover a subset of the causal parents."}, "https://arxiv.org/abs/2310.19051": {"title": "A Survey of Methods for Estimating Hurst Exponent of Time Sequence", "link": "https://arxiv.org/abs/2310.19051", "description": "arXiv:2310.19051v2 Announce Type: replace \nAbstract: The Hurst exponent is a significant indicator for characterizing the self-similarity and long-term memory properties of time sequences. It has wide applications in physics, technologies, engineering, mathematics, statistics, economics, psychology and so on. Currently, available methods for estimating the Hurst exponent of time sequences can be divided into different categories: time-domain methods and spectrum-domain methods based on the representation of time sequence, linear regression methods and Bayesian methods based on parameter estimation methods. Although various methods are discussed in literature, there are still some deficiencies: the descriptions of the estimation algorithms are just mathematics-oriented and the pseudo-codes are missing; the effectiveness and accuracy of the estimation algorithms are not clear; the classification of estimation methods is not considered and there is a lack of guidance for selecting the estimation methods. In this work, the emphasis is put on thirteen dominant methods for estimating the Hurst exponent. For the purpose of decreasing the difficulty of implementing the estimation methods with computer programs, the mathematical principles are discussed briefly and the pseudo-codes of algorithms are presented with necessary details. It is expected that the survey could help the researchers to select, implement and apply the estimation algorithms of interest in practical situations in an easy way."}, "https://arxiv.org/abs/2311.16440": {"title": "Inference for Low-rank Models without Estimating the Rank", "link": "https://arxiv.org/abs/2311.16440", "description": "arXiv:2311.16440v2 Announce Type: replace \nAbstract: This paper studies the inference about linear functionals of high-dimensional low-rank matrices. While most existing inference methods would require consistent estimation of the true rank, our procedure is robust to rank misspecification, making it a promising approach in applications where rank estimation can be unreliable. We estimate the low-rank spaces using pre-specified weighting matrices, known as diversified projections. A novel statistical insight is that, unlike the usual statistical wisdom that overfitting mainly introduces additional variances, the over-estimated low-rank space also gives rise to a non-negligible bias due to an implicit ridge-type regularization. We develop a new inference procedure and show that the central limit theorem holds as long as the pre-specified rank is no smaller than the true rank. In one of our applications, we study multiple testing with incomplete data in the presence of confounding factors and show that our method remains valid as long as the number of controlled confounding factors is at least as large as the true number, even when no confounding factors are present."}, "https://arxiv.org/abs/2410.14789": {"title": "Differentially Private Covariate Balancing Causal Inference", "link": "https://arxiv.org/abs/2410.14789", "description": "arXiv:2410.14789v1 Announce Type: new \nAbstract: Differential privacy is the leading mathematical framework for privacy protection, providing a probabilistic guarantee that safeguards individuals' private information when publishing statistics from a dataset. This guarantee is achieved by applying a randomized algorithm to the original data, which introduces unique challenges in data analysis by distorting inherent patterns. In particular, causal inference using observational data in privacy-sensitive contexts is challenging because it requires covariate balance between treatment groups, yet checking the true covariates is prohibited to prevent leakage of sensitive information. In this article, we present a differentially private two-stage covariate balancing weighting estimator to infer causal effects from observational data. Our algorithm produces both point and interval estimators with statistical guarantees, such as consistency and rate optimality, under a given privacy budget."}, "https://arxiv.org/abs/2410.14857": {"title": "A New One Parameter Unit Distribution: Median Based Unit Rayleigh (MBUR): Parametric Quantile Regression Model", "link": "https://arxiv.org/abs/2410.14857", "description": "arXiv:2410.14857v1 Announce Type: new \nAbstract: Parametric quantile regression is illustrated for the one parameter new unit Rayleigh distribution called Median Based Unit Rayleigh distribution (MBUR) distribution. The estimation process using re-parameterized maximum likelihood function is highlighted with real dataset example. The inference and goodness of fit is also explored."}, "https://arxiv.org/abs/2410.14866": {"title": "Fast and Optimal Changepoint Detection and Localization using Bonferroni Triplets", "link": "https://arxiv.org/abs/2410.14866", "description": "arXiv:2410.14866v1 Announce Type: new \nAbstract: The paper considers the problem of detecting and localizing changepoints in a sequence of independent observations. We propose to evaluate a local test statistic on a triplet of time points, for each such triplet in a particular collection. This collection is sparse enough so that the results of the local tests can simply be combined with a weighted Bonferroni correction. This results in a simple and fast method, {\\sl Lean Bonferroni Changepoint detection} (LBD), that provides finite sample guarantees for the existance of changepoints as well as simultaneous confidence intervals for their locations. LBD is free of tuning parameters, and we show that LBD allows optimal inference for the detection of changepoints. To this end, we provide a lower bound for the critical constant that measures the difficulty of the changepoint detection problem, and we show that LBD attains this critical constant. We illustrate LBD for a number of distributional settings, namely when the observations are homoscedastic normal with known or unknown variance, for observations from a natural exponential family, and in a nonparametric setting where we assume only exchangeability for segments without a changepoint."}, "https://arxiv.org/abs/2410.14871": {"title": "Learning the Effect of Persuasion via Difference-In-Differences", "link": "https://arxiv.org/abs/2410.14871", "description": "arXiv:2410.14871v1 Announce Type: new \nAbstract: The persuasion rate is a key parameter for measuring the causal effect of a directional message on influencing the recipient's behavior. Its identification analysis has largely relied on the availability of credible instruments, but the requirement is not always satisfied in observational studies. Therefore, we develop a framework for identifying, estimating, and conducting inference for the average persuasion rates on the treated using a difference-in-differences approach. The average treatment effect on the treated is a standard parameter with difference-in-differences, but it underestimates the persuasion rate in our setting. Our estimation and inference methods include regression-based approaches and semiparametrically efficient estimators. Beginning with the canonical two-period case, we extend the framework to staggered treatment settings, where we show how to conduct rich analyses like the event-study design. We revisit previous studies of the British election and the Chinese curriculum reform to illustrate the usefulness of our methodology."}, "https://arxiv.org/abs/2410.14985": {"title": "Stochastic Loss Reserving: Dependence and Estimation", "link": "https://arxiv.org/abs/2410.14985", "description": "arXiv:2410.14985v1 Announce Type: new \nAbstract: Nowadays insurers have to account for potentially complex dependence between risks. In the field of loss reserving, there are many parametric and non-parametric models attempting to capture dependence between business lines. One common approach has been to use additive background risk models (ABRMs) which provide rich and interpretable dependence structures via a common shock model. Unfortunately, ABRMs are often restrictive. Models that capture necessary features may have impractical to estimate parameters. For example models without a closed-form likelihood function for lack of a probability density function (e.g. some Tweedie, Stable Distributions, etc).\n We apply a modification of the continuous generalised method of moments (CGMM) of [Carrasco and Florens, 2000] which delivers comparable estimators to the MLE to loss reserving. We examine models such as the one proposed by [Avanzi et al., 2016] and a related but novel one derived from the stable family of distributions. Our CGMM method of estimation provides conventional non-Bayesian estimates in the case where MLEs are impractical."}, "https://arxiv.org/abs/2410.15090": {"title": "Fast and Efficient Bayesian Analysis of Structural Vector Autoregressions Using the R Package bsvars", "link": "https://arxiv.org/abs/2410.15090", "description": "arXiv:2410.15090v1 Announce Type: new \nAbstract: The R package bsvars provides a wide range of tools for empirical macroeconomic and financial analyses using Bayesian Structural Vector Autoregressions. It uses frontier econometric techniques and C++ code to ensure fast and efficient estimation of these multivariate dynamic structural models, possibly with many variables, complex identification strategies, and non-linear characteristics. The models can be identified using adjustable exclusion restrictions and heteroskedastic or non-normal shocks. They feature a flexible three-level equation-specific local-global hierarchical prior distribution for the estimated level of shrinkage for autoregressive and structural parameters. Additionally, the package facilitates predictive and structural analyses such as impulse responses, forecast error variance and historical decompositions, forecasting, statistical verification of identification and hypotheses on autoregressive parameters, and analyses of structural shocks, volatilities, and fitted values. These features differentiate bsvars from existing R packages that either focus on a specific structural model, do not consider heteroskedastic shocks, or lack the implementation using compiled code."}, "https://arxiv.org/abs/2410.15097": {"title": "Predictive Quantile Regression with High-Dimensional Predictors: The Variable Screening Approach", "link": "https://arxiv.org/abs/2410.15097", "description": "arXiv:2410.15097v1 Announce Type: new \nAbstract: This paper advances a variable screening approach to enhance conditional quantile forecasts using high-dimensional predictors. We have refined and augmented the quantile partial correlation (QPC)-based variable screening proposed by Ma et al. (2017) to accommodate $\\beta$-mixing time-series data. Our approach is inclusive of i.i.d scenarios but introduces new convergence bounds for time-series contexts, suggesting the performance of QPC-based screening is influenced by the degree of time-series dependence. Through Monte Carlo simulations, we validate the effectiveness of QPC under weak dependence. Our empirical assessment of variable selection for growth-at-risk (GaR) forecasting underscores the method's advantages, revealing that specific labor market determinants play a pivotal role in forecasting GaR. While prior empirical research has predominantly considered a limited set of predictors, we employ the comprehensive Fred-QD dataset, retaining a richer breadth of information for GaR forecasts."}, "https://arxiv.org/abs/2410.15102": {"title": "Bayesian-based Propensity Score Subclassification Estimator", "link": "https://arxiv.org/abs/2410.15102", "description": "arXiv:2410.15102v1 Announce Type: new \nAbstract: Subclassification estimators are one of the methods used to estimate causal effects of interest using the propensity score. This method is more stable compared to other weighting methods, such as inverse probability weighting estimators, in terms of the variance of the estimators. In subclassification estimators, the number of strata is traditionally set at five, and this number is not typically chosen based on data information. Even when the number of strata is selected, the uncertainty from the selection process is often not properly accounted for. In this study, we propose a novel Bayesian-based subclassification estimator that can assess the uncertainty in the number of strata, rather than selecting a single optimal number, using a Bayesian paradigm. To achieve this, we apply a general Bayesian procedure that does not rely on a likelihood function. This procedure allows us to avoid making strong assumptions about the outcome model, maintaining the same flexibility as traditional causal inference methods. With the proposed Bayesian procedure, it is expected that uncertainties from the design phase can be appropriately reflected in the analysis phase, which is sometimes overlooked in non-Bayesian contexts."}, "https://arxiv.org/abs/2410.15381": {"title": "High-dimensional prediction for count response via sparse exponential weights", "link": "https://arxiv.org/abs/2410.15381", "description": "arXiv:2410.15381v1 Announce Type: new \nAbstract: Count data is prevalent in various fields like ecology, medical research, and genomics. In high-dimensional settings, where the number of features exceeds the sample size, feature selection becomes essential. While frequentist methods like Lasso have advanced in handling high-dimensional count data, Bayesian approaches remain under-explored with no theoretical results on prediction performance. This paper introduces a novel probabilistic machine learning framework for high-dimensional count data prediction. We propose a pseudo-Bayesian method that integrates a scaled Student prior to promote sparsity and uses an exponential weight aggregation procedure. A key contribution is a novel risk measure tailored to count data prediction, with theoretical guarantees for prediction risk using PAC-Bayesian bounds. Our results include non-asymptotic oracle inequalities, demonstrating rate-optimal prediction error without prior knowledge of sparsity. We implement this approach efficiently using Langevin Monte Carlo method. Simulations and a real data application highlight the strong performance of our method compared to the Lasso in various settings."}, "https://arxiv.org/abs/2410.15383": {"title": "Probabilities for asymmetric p-outside values", "link": "https://arxiv.org/abs/2410.15383", "description": "arXiv:2410.15383v1 Announce Type: new \nAbstract: In 2017-2020 Jordanova and co-authors investigate probabilities for p-outside values and determine them in many particular cases. They show that these probabilities are closely related to the concept for heavy tails. Tukey's boxplots are very popular and useful in practice. Analogously to the chi-square-criterion, the relative frequencies of the events an observation to fall in different their parts, compared with the corresponding probabilities an observation of a fixed probability distribution to fall in the same parts, help the practitioners to find the accurate probability distribution of the observed random variable. These open the door to work with the distribution sensitive estimators which in many cases are more accurate, especially for small sample investigations. All these methods, however, suffer from the disadvantage that they use inter quantile range in a symmetric way. The concept for outside values should take into account the form of the distribution. Therefore, here, we give possibility for more asymmetry in analysis of the tails of the distributions. We suggest new theoretical and empirical box-plots and characteristics of the tails of the distributions. These are theoretical asymmetric p-outside values functions. We partially investigate some of their properties and give some examples. It turns out that they do not depend on the center and the scaling factor of the distribution. Therefore, they are very appropriate for comparison of the tails of the distribution, and later on, for estimation of the parameters, which govern the tail behaviour of the cumulative distribution function."}, "https://arxiv.org/abs/2410.15421": {"title": "A New Framework for Bayesian Function Registration", "link": "https://arxiv.org/abs/2410.15421", "description": "arXiv:2410.15421v1 Announce Type: new \nAbstract: Function registration, also referred to as alignment, has been one of the fundamental problems in the field of functional data analysis. Classical registration methods such as the Fisher-Rao alignment focus on estimating optimal time warping function between functions. In recent studies, a model on time warping has attracted more attention, and it can be used as a prior term to combine with the classical method (as a likelihood term) in a Bayesian framework. The Bayesian approaches have been shown improvement over the classical methods. However, its prior model on time warping is often based a nonlinear approximation, which may introduce inaccuracy and inefficiency. To overcome these problems, we propose a new Bayesian approach by adopting a prior which provides a linear representation and various stochastic processes (Gaussian or non-Gaussian) can be effectively utilized on time warping. No linearization approximation is needed in the time warping computation, and the posterior can be obtained via a conventional Markov Chain Monte Carlo approach. We thoroughly investigate the impact of the prior on the performance of functional registration with multiple simulation examples, which demonstrate the superiority of the new framework over the previous methods. We finally utilize the new method in a real dataset and obtain desirable alignment result."}, "https://arxiv.org/abs/2410.15477": {"title": "Randomization Inference for Before-and-After Studies with Multiple Units: An Application to a Criminal Procedure Reform in Uruguay", "link": "https://arxiv.org/abs/2410.15477", "description": "arXiv:2410.15477v1 Announce Type: new \nAbstract: We study the immediate impact of a new code of criminal procedure on crime. In November 2017, Uruguay switched from an inquisitorial system (where a single judge leads the investigation and decides the appropriate punishment for a particular crime) to an adversarial system (where the investigation is now led by prosecutors and the judge plays an overseeing role). To analyze the short-term effects of this reform, we develop a randomization-based approach for before-and-after studies with multiple units. Our framework avoids parametric time series assumptions and eliminates extrapolation by basing statistical inferences on finite-sample methods that rely only on the time periods closest to the time of the policy intervention. A key identification assumption underlying our method is that there would have been no time trends in the absence of the intervention, which is most plausible in a small window around the time of the reform. We also discuss several falsification methods to assess the plausibility of this assumption. Using our proposed inferential approach, we find statistically significant short-term causal effects of the crime reform. Our unbiased estimate shows an average increase of approximately 25 police reports per day in the week following the implementation of the new adversarial system in Montevideo, representing an 8 percent increase compared to the previous week under the old system."}, "https://arxiv.org/abs/2410.15530": {"title": "Simultaneous Inference in Multiple Matrix-Variate Graphs for High-Dimensional Neural Recordings", "link": "https://arxiv.org/abs/2410.15530", "description": "arXiv:2410.15530v1 Announce Type: new \nAbstract: As large-scale neural recordings become common, many neuroscientific investigations are focused on identifying functional connectivity from spatio-temporal measurements in two or more brain areas across multiple sessions. Spatial-temporal data in neural recordings can be represented as matrix-variate data, with time as the first dimension and space as the second. In this paper, we exploit the multiple matrix-variate Gaussian Graphical model to encode the common underlying spatial functional connectivity across multiple sessions of neural recordings. By effectively integrating information across multiple graphs, we develop a novel inferential framework that allows simultaneous testing to detect meaningful connectivity for a target edge subset of arbitrary size. Our test statistics are based on a group penalized regression approach and a high-dimensional Gaussian approximation technique. The validity of simultaneous testing is demonstrated theoretically under mild assumptions on sample size and non-stationary autoregressive temporal dependence. Our test is nearly optimal in achieving the testable region boundary. Additionally, our method involves only convex optimization and parametric bootstrap, making it computationally attractive. We demonstrate the efficacy of the new method through both simulations and an experimental study involving multiple local field potential (LFP) recordings in the Prefrontal Cortex (PFC) and visual area V4 during a memory-guided saccade task."}, "https://arxiv.org/abs/2410.15560": {"title": "Ablation Studies for Novel Treatment Effect Estimation Models", "link": "https://arxiv.org/abs/2410.15560", "description": "arXiv:2410.15560v1 Announce Type: new \nAbstract: Ablation studies are essential for understanding the contribution of individual components within complex models, yet their application in nonparametric treatment effect estimation remains limited. This paper emphasizes the importance of ablation studies by examining the Bayesian Causal Forest (BCF) model, particularly the inclusion of the estimated propensity score $\\hat{\\pi}(x_i)$ intended to mitigate regularization-induced confounding (RIC). Through a partial ablation study utilizing five synthetic data-generating processes with varying baseline and propensity score complexities, we demonstrate that excluding $\\hat{\\pi}(x_i)$ does not diminish the model's performance in estimating average and conditional average treatment effects or in uncertainty quantification. Moreover, omitting $\\hat{\\pi}(x_i)$ reduces computational time by approximately 21\\%. These findings suggest that the BCF model's inherent flexibility suffices in adjusting for confounding without explicitly incorporating the propensity score. The study advocates for the routine use of ablation studies in treatment effect estimation to ensure model components are essential and to prevent unnecessary complexity."}, "https://arxiv.org/abs/2410.15596": {"title": "Assessing mediation in cross-sectional stepped wedge cluster randomized trials", "link": "https://arxiv.org/abs/2410.15596", "description": "arXiv:2410.15596v1 Announce Type: new \nAbstract: Mediation analysis has been comprehensively studied for independent data but relatively little work has been done for correlated data, especially for the increasingly adopted stepped wedge cluster randomized trials (SW-CRTs). Motivated by challenges in underlying the effect mechanisms in pragmatic and implementation science clinical trials, we develop new methods for mediation analysis in SW-CRTs. Specifically, based on a linear and generalized linear mixed models, we demonstrate how to estimate the natural indirect effect and mediation proportion in typical SW-CRTs with four data types, including both continuous and binary mediators and outcomes. Furthermore, to address the emerging challenges in exposure-time treatment effect heterogeneity, we derive the mediation expressions in SW-CRTs when the total effect varies as a function of the exposure time. The cluster jackknife approach is considered for inference across all data types and treatment effect structures. We conduct extensive simulations to evaluate the finite-sample performances of proposed mediation estimators and demonstrate the proposed approach in a real data example. A user-friendly R package mediateSWCRT has been developed to facilitate the practical implementation of the estimators."}, "https://arxiv.org/abs/2410.15634": {"title": "Distributionally Robust Instrumental Variables Estimation", "link": "https://arxiv.org/abs/2410.15634", "description": "arXiv:2410.15634v1 Announce Type: new \nAbstract: Instrumental variables (IV) estimation is a fundamental method in econometrics and statistics for estimating causal effects in the presence of unobserved confounding. However, challenges such as untestable model assumptions and poor finite sample properties have undermined its reliability in practice. Viewing common issues in IV estimation as distributional uncertainties, we propose DRIVE, a distributionally robust framework of the classical IV estimation method. When the ambiguity set is based on a Wasserstein distance, DRIVE minimizes a square root ridge regularized variant of the two stage least squares (TSLS) objective. We develop a novel asymptotic theory for this regularized regression estimator based on the square root ridge, showing that it achieves consistency without requiring the regularization parameter to vanish. This result follows from a fundamental property of the square root ridge, which we call ``delayed shrinkage''. This novel property, which also holds for a class of generalized method of moments (GMM) estimators, ensures that the estimator is robust to distributional uncertainties that persist in large samples. We further derive the asymptotic distribution of Wasserstein DRIVE and propose data-driven procedures to select the regularization parameter based on theoretical results. Simulation studies confirm the superior finite sample performance of Wasserstein DRIVE. Thanks to its regularization and robustness properties, Wasserstein DRIVE could be preferable in practice, particularly when the practitioner is uncertain about model assumptions or distributional shifts in data."}, "https://arxiv.org/abs/2410.15705": {"title": "Variable screening for covariate dependent extreme value index estimation", "link": "https://arxiv.org/abs/2410.15705", "description": "arXiv:2410.15705v1 Announce Type: new \nAbstract: One of the main topics of extreme value analysis is to estimate the extreme value index, an important parameter that controls the tail behavior of the distribution. In many cases, estimating the extreme value index of the target variable associated with covariates is useful. Although the estimation of the covariate-dependent extreme value index has been developed by numerous researchers, no results have been presented regarding covariate selection. This paper proposes a sure independence screening method for covariate-dependent extreme value index estimation. For the screening, the marginal utility between the target variable and each covariate is calculated using the conditional Pickands estimator. A single-index model that uses the covariates selected by screening is further provided to estimate the extreme value index after screening. Monte Carlo simulations confirmed the finite sample performance of the proposed method. In addition, a real-data application is presented."}, "https://arxiv.org/abs/2410.15713": {"title": "Nonparametric method of structural break detection in stochastic time series regression model", "link": "https://arxiv.org/abs/2410.15713", "description": "arXiv:2410.15713v1 Announce Type: new \nAbstract: We propose a nonparametric algorithm to detect structural breaks in the conditional mean and/or variance of a time series. Our method does not assume any specific parametric form for the dependence structure of the regressor, the time series model, or the distribution of the model noise. This flexibility allows our algorithm to be applicable to a wide range of time series structures commonly encountered in financial econometrics. The effectiveness of the proposed algorithm is validated through an extensive simulation study and a real data application in detecting structural breaks in the mean and volatility of Bitcoin returns. The algorithm's ability to identify structural breaks in the data highlights its practical utility in econometric analysis and financial modeling."}, "https://arxiv.org/abs/2410.15734": {"title": "A Kernelization-Based Approach to Nonparametric Binary Choice Models", "link": "https://arxiv.org/abs/2410.15734", "description": "arXiv:2410.15734v1 Announce Type: new \nAbstract: We propose a new estimator for nonparametric binary choice models that does not impose a parametric structure on either the systematic function of covariates or the distribution of the error term. A key advantage of our approach is its computational efficiency. For instance, even when assuming a normal error distribution as in probit models, commonly used sieves for approximating an unknown function of covariates can lead to a large-dimensional optimization problem when the number of covariates is moderate. Our approach, motivated by kernel methods in machine learning, views certain reproducing kernel Hilbert spaces as special sieve spaces, coupled with spectral cut-off regularization for dimension reduction. We establish the consistency of the proposed estimator for both the systematic function of covariates and the distribution function of the error term, and asymptotic normality of the plug-in estimator for weighted average partial derivatives. Simulation studies show that, compared to parametric estimation methods, the proposed method effectively improves finite sample performance in cases of misspecification, and has a rather mild efficiency loss if the model is correctly specified. Using administrative data on the grant decisions of US asylum applications to immigration courts, along with nine case-day variables on weather and pollution, we re-examine the effect of outdoor temperature on court judges' \"mood\", and thus, their grant decisions."}, "https://arxiv.org/abs/2410.15874": {"title": "A measure of departure from symmetry via the Fisher-Rao distance for contingency tables", "link": "https://arxiv.org/abs/2410.15874", "description": "arXiv:2410.15874v1 Announce Type: new \nAbstract: A measure of asymmetry is a quantification method that allows for the comparison of categorical evaluations before and after treatment effects or among different target populations, irrespective of sample size. We focus on square contingency tables that summarize survey results between two time points or cohorts, represented by the same categorical variables. We propose a measure to evaluate the degree of departure from a symmetry model using cosine similarity. This proposal is based on the Fisher-Rao distance, allowing asymmetry to be interpreted as a geodesic distance between two distributions. Various measures of asymmetry have been proposed, but visualizing the relationship of these quantification methods on a two-dimensional plane demonstrates that the proposed measure provides the geometrically simplest and most natural quantification. Moreover, the visualized figure indicates that the proposed method for measuring departures from symmetry is less affected by very few cells with extreme asymmetry. A simulation study shows that for square contingency tables with an underlying asymmetry model, our method can directly extract and quantify only the asymmetric structure of the model, and can more sensitively detect departures from symmetry than divergence-type measures."}, "https://arxiv.org/abs/2410.15968": {"title": "A Causal Transformation Model for Time-to-Event Data Affected by Unobserved Confounding: Revisiting the Illinois Reemployment Bonus Experiment", "link": "https://arxiv.org/abs/2410.15968", "description": "arXiv:2410.15968v1 Announce Type: new \nAbstract: Motivated by studies investigating causal effects in survival analysis, we propose a transformation model to quantify the impact of a binary treatment on a time-to-event outcome. The approach is based on a flexible linear transformation structural model that links a monotone function of the time-to-event with the propensity for treatment through a bivariate Gaussian distribution. The model equations are specified as functions of additive predictors, allowing the impacts of observed confounders to be accounted for flexibly. Furthermore, the effect of the instrumental variable may be regularized through a ridge penalty, while interactions between the treatment and modifier variables can be incorporated into the model to acknowledge potential variations in treatment effects across different subgroups. The baseline survival function is estimated in a flexible manner using monotonic P-splines, while unobserved confounding is captured through the dependence parameter of the bivariate Gaussian. Parameter estimation is achieved via a computationally efficient and stable penalized maximum likelihood estimation approach and intervals constructed using the related inferential results. We revisit a dataset from the Illinois Reemployment Bonus Experiment to estimate the causal effect of a cash bonus on unemployment duration, unveiling new insights. The modeling framework is incorporated into the R package GJRM, enabling researchers and practitioners to fit the proposed causal survival model and obtain easy-to-interpret numerical and visual summaries."}, "https://arxiv.org/abs/2410.16017": {"title": "Semiparametric Bayesian Inference for a Conditional Moment Equality Model", "link": "https://arxiv.org/abs/2410.16017", "description": "arXiv:2410.16017v1 Announce Type: new \nAbstract: Conditional moment equality models are regularly encountered in empirical economics, yet they are difficult to estimate. These models map a conditional distribution of data to a structural parameter via the restriction that a conditional mean equals zero. Using this observation, I introduce a Bayesian inference framework in which an unknown conditional distribution is replaced with a nonparametric posterior, and structural parameter inference is then performed using an implied posterior. The method has the same flexibility as frequentist semiparametric estimators and does not require converting conditional moments to unconditional moments. Importantly, I prove a semiparametric Bernstein-von Mises theorem, providing conditions under which, in large samples, the posterior for the structural parameter is approximately normal, centered at an efficient estimator, and has variance equal to the Chamberlain (1987) semiparametric efficiency bound. As byproducts, I show that Bayesian uncertainty quantification methods are asymptotically optimal frequentist confidence sets and derive low-level sufficient conditions for Gaussian process priors. The latter sheds light on a key prior stability condition and relates to the numerical aspects of the paper in which these priors are used to predict the welfare effects of price changes."}, "https://arxiv.org/abs/2410.16076": {"title": "Improving the (approximate) sequential probability ratio test by avoiding overshoot", "link": "https://arxiv.org/abs/2410.16076", "description": "arXiv:2410.16076v1 Announce Type: new \nAbstract: The sequential probability ratio test (SPRT) by Wald (1945) is a cornerstone of sequential analysis. Based on desired type-I, II error levels $\\alpha, \\beta \\in (0,1)$, it stops when the likelihood ratio statistic crosses certain upper and lower thresholds, guaranteeing optimality of the expected sample size. However, these thresholds are not closed form and the test is often applied with approximate thresholds $(1-\\beta)/\\alpha$ and $\\beta/(1-\\alpha)$ (approximate SPRT). When $\\beta > 0$, this neither guarantees type I,II error control at $\\alpha,\\beta$ nor optimality. When $\\beta=0$ (power-one SPRT), it guarantees type I error control at $\\alpha$ that is in general conservative, and thus not optimal. The looseness in both cases is caused by overshoot: the test statistic overshoots the thresholds at the stopping time. One standard way to address this is to calculate the right thresholds numerically, but many papers and software packages do not do this. In this paper, we describe a different way to improve the approximate SPRT: we change the test statistic to avoid overshoot. Our technique uniformly improves power-one SPRTs $(\\beta=0)$ for simple nulls and alternatives, or for one-sided nulls and alternatives in exponential families. When $\\beta > 0$, our techniques provide valid type I error guarantees, lead to similar type II error as Wald's, but often needs less samples. These improved sequential tests can also be used for deriving tighter parametric confidence sequences, and can be extended to nontrivial settings like sampling without replacement and conformal martingales."}, "https://arxiv.org/abs/2410.16096": {"title": "Dynamic Time Warping-based imputation of long gaps in human mobility trajectories", "link": "https://arxiv.org/abs/2410.16096", "description": "arXiv:2410.16096v1 Announce Type: new \nAbstract: Individual mobility trajectories are difficult to measure and often incur long periods of missingness. Aggregation of this mobility data without accounting for the missingness leads to erroneous results, underestimating travel behavior. This paper proposes Dynamic Time Warping-Based Multiple Imputation (DTWBMI) as a method of filling long gaps in human mobility trajectories in order to use the available data to the fullest extent. This method reduces spatiotemporal trajectories to time series of particular travel behavior, then selects candidates for multiple imputation on the basis of the dynamic time warping distance between the potential donor series and the series preceding and following the gap in the recipient series and finally imputes values multiple times. A simulation study designed to establish optimal parameters for DTWBMI provides two versions of the method. These two methods are applied to a real-world dataset of individual mobility trajectories with simulated missingness and compared against other methods of handling missingness. Linear interpolation outperforms DTWBMI and other methods when gaps are short and data are limited. DTWBMI outperforms other methods when gaps become longer and when more data are available."}, "https://arxiv.org/abs/2410.16112": {"title": "Dynamic Biases of Static Panel Data Estimators", "link": "https://arxiv.org/abs/2410.16112", "description": "arXiv:2410.16112v1 Announce Type: new \nAbstract: This paper identifies an important bias - termed dynamic bias - in fixed effects panel estimators that arises when dynamic feedback is ignored in the estimating equation. Dynamic feedback occurs if past outcomes impact current outcomes, a feature of many settings ranging from economic growth to agricultural and labor markets. When estimating equations omit past outcomes, dynamic bias can lead to significantly inaccurate treatment effect estimates, even with randomly assigned treatments. This dynamic bias in simulations is larger than Nickell bias. I show that dynamic bias stems from the estimation of fixed effects, as their estimation generates confounding in the data. To recover consistent treatment effects, I develop a flexible estimator that provides fixed-T bias correction. I apply this approach to study the impact of temperature shocks on GDP, a canonical example where economic theory points to an important feedback from past to future outcomes. Accounting for dynamic bias lowers the estimated effects of higher yearly temperatures on GDP growth by 10% and GDP levels by 120%."}, "https://arxiv.org/abs/2410.16214": {"title": "Asymmetries in Financial Spillovers", "link": "https://arxiv.org/abs/2410.16214", "description": "arXiv:2410.16214v1 Announce Type: new \nAbstract: This paper analyzes nonlinearities in the international transmission of financial shocks originating in the US. To do so, we develop a flexible nonlinear multi-country model. Our framework is capable of producing asymmetries in the responses to financial shocks for shock size and sign, and over time. We show that international reactions to US-based financial shocks are asymmetric along these dimensions. Particularly, we find that adverse shocks trigger stronger declines in output, inflation, and stock markets than benign shocks. Further, we investigate time variation in the estimated dynamic effects and characterize the responsiveness of three major central banks to financial shocks."}, "https://arxiv.org/abs/2410.14783": {"title": "High-Dimensional Tensor Discriminant Analysis with Incomplete Tensors", "link": "https://arxiv.org/abs/2410.14783", "description": "arXiv:2410.14783v1 Announce Type: cross \nAbstract: Tensor classification has gained prominence across various fields, yet the challenge of handling partially observed tensor data in real-world applications remains largely unaddressed. This paper introduces a novel approach to tensor classification with incomplete data, framed within the tensor high-dimensional linear discriminant analysis. Specifically, we consider a high-dimensional tensor predictor with missing observations under the Missing Completely at Random (MCR) assumption and employ the Tensor Gaussian Mixture Model to capture the relationship between the tensor predictor and class label. We propose the Tensor LDA-MD algorithm, which manages high-dimensional tensor predictors with missing entries by leveraging the low-rank structure of the discriminant tensor. A key feature of our approach is a novel covariance estimation method under the tensor-based MCR model, supported by theoretical results that allow for correlated entries under mild conditions. Our work establishes the convergence rate of the estimation error of the discriminant tensor with incomplete data and minimax optimal bounds for the misclassification rate, addressing key gaps in the literature. Additionally, we derive large deviation results for the generalized mode-wise (separable) sample covariance matrix and its inverse, which are crucial tools in our analysis and hold independent interest. Our method demonstrates excellent performance in simulations and real data analysis, even with significant proportions of missing data. This research advances high-dimensional LDA and tensor learning, providing practical tools for applications with incomplete data and a solid theoretical foundation for classification accuracy in complex settings."}, "https://arxiv.org/abs/2410.14812": {"title": "Isolated Causal Effects of Natural Language", "link": "https://arxiv.org/abs/2410.14812", "description": "arXiv:2410.14812v1 Announce Type: cross \nAbstract: As language technologies become widespread, it is important to understand how variations in language affect reader perceptions -- formalized as the isolated causal effect of some focal language-encoded intervention on an external outcome. A core challenge of estimating isolated effects is the need to approximate all non-focal language outside of the intervention. In this paper, we introduce a formal estimation framework for isolated causal effects and explore how different approximations of non-focal language impact effect estimates. Drawing on the principle of omitted variable bias, we present metrics for evaluating the quality of isolated effect estimation and non-focal language approximation along the axes of fidelity and overlap. In experiments on semi-synthetic and real-world data, we validate the ability of our framework to recover ground truth isolated effects, and we demonstrate the utility of our proposed metrics as measures of quality for both isolated effect estimates and non-focal language approximations."}, "https://arxiv.org/abs/2410.14843": {"title": "Predictive variational inference: Learn the predictively optimal posterior distribution", "link": "https://arxiv.org/abs/2410.14843", "description": "arXiv:2410.14843v1 Announce Type: cross \nAbstract: Vanilla variational inference finds an optimal approximation to the Bayesian posterior distribution, but even the exact Bayesian posterior is often not meaningful under model misspecification. We propose predictive variational inference (PVI): a general inference framework that seeks and samples from an optimal posterior density such that the resulting posterior predictive distribution is as close to the true data generating process as possible, while this this closeness is measured by multiple scoring rules. By optimizing the objective, the predictive variational inference is generally not the same as, or even attempting to approximate, the Bayesian posterior, even asymptotically. Rather, we interpret it as implicit hierarchical expansion. Further, the learned posterior uncertainty detects heterogeneity of parameters among the population, enabling automatic model diagnosis. This framework applies to both likelihood-exact and likelihood-free models. We demonstrate its application in real data examples."}, "https://arxiv.org/abs/2410.14904": {"title": "Switchback Price Experiments with Forward-Looking Demand", "link": "https://arxiv.org/abs/2410.14904", "description": "arXiv:2410.14904v1 Announce Type: cross \nAbstract: We consider a retailer running a switchback experiment for the price of a single product, with infinite supply. In each period, the seller chooses a price $p$ from a set of predefined prices that consist of a reference price and a few discounted price levels. The goal is to estimate the demand gradient at the reference price point, with the goal of adjusting the reference price to improve revenue after the experiment. In our model, in each period, a unit mass of buyers arrives on the market, with values distributed based on a time-varying process. Crucially, buyers are forward looking with a discounted utility and will choose to not purchase now if they expect to face a discounted price in the near future. We show that forward-looking demand introduces bias in naive estimators of the demand gradient, due to intertemporal interference. Furthermore, we prove that there is no estimator that uses data from price experiments with only two price points that can recover the correct demand gradient, even in the limit of an infinitely long experiment with an infinitesimal price discount. Moreover, we characterize the form of the bias of naive estimators. Finally, we show that with a simple three price level experiment, the seller can remove the bias due to strategic forward-looking behavior and construct an estimator for the demand gradient that asymptotically recovers the truth."}, "https://arxiv.org/abs/2410.15049": {"title": "Modeling Time-Varying Effects of Mobile Health Interventions Using Longitudinal Functional Data from HeartSteps Micro-Randomized Trial", "link": "https://arxiv.org/abs/2410.15049", "description": "arXiv:2410.15049v1 Announce Type: cross \nAbstract: To optimize mobile health interventions and advance domain knowledge on intervention design, it is critical to understand how the intervention effect varies over time and with contextual information. This study aims to assess how a push notification suggesting physical activity influences individuals' step counts using data from the HeartSteps micro-randomized trial (MRT). The statistical challenges include the time-varying treatments and longitudinal functional step count measurements. We propose the first semiparametric causal excursion effect model with varying coefficients to model the time-varying effects within a decision point and across decision points in an MRT. The proposed model incorporates double time indices to accommodate the longitudinal functional outcome, enabling the assessment of time-varying effect moderation by contextual variables. We propose a two-stage causal effect estimator that is robust against a misspecified high-dimensional outcome regression nuisance model. We establish asymptotic theory and conduct simulation studies to validate the proposed estimator. Our analysis provides new insights into individuals' change in response profiles (such as how soon a response occurs) due to the activity suggestions, how such changes differ by the type of suggestions received, and how such changes depend on other contextual information such as being recently sedentary and the day being a weekday."}, "https://arxiv.org/abs/2410.15057": {"title": "Asymptotic Time-Uniform Inference for Parameters in Averaged Stochastic Approximation", "link": "https://arxiv.org/abs/2410.15057", "description": "arXiv:2410.15057v1 Announce Type: cross \nAbstract: We study time-uniform statistical inference for parameters in stochastic approximation (SA), which encompasses a bunch of applications in optimization and machine learning. To that end, we analyze the almost-sure convergence rates of the averaged iterates to a scaled sum of Gaussians in both linear and nonlinear SA problems. We then construct three types of asymptotic confidence sequences that are valid uniformly across all times with coverage guarantees, in an asymptotic sense that the starting time is sufficiently large. These coverage guarantees remain valid if the unknown covariance matrix is replaced by its plug-in estimator, and we conduct experiments to validate our methodology."}, "https://arxiv.org/abs/2410.15166": {"title": "Joint Probability Estimation of Many Binary Outcomes via Localized Adversarial Lasso", "link": "https://arxiv.org/abs/2410.15166", "description": "arXiv:2410.15166v1 Announce Type: cross \nAbstract: In this work we consider estimating the probability of many (possibly dependent) binary outcomes which is at the core of many applications, e.g., multi-level treatments in causal inference, demands for bundle of products, etc. Without further conditions, the probability distribution of an M dimensional binary vector is characterized by exponentially in M coefficients which can lead to a high-dimensional problem even without the presence of covariates. Understanding the (in)dependence structure allows us to substantially improve the estimation as it allows for an effective factorization of the probability distribution. In order to estimate the probability distribution of a M dimensional binary vector, we leverage a Bahadur representation that connects the sparsity of its coefficients with independence across the components. We propose to use regularized and adversarial regularized estimators to obtain an adaptive estimator with respect to the dependence structure which allows for rates of convergence to depend on this intrinsic (lower) dimension. These estimators are needed to handle several challenges within this setting, including estimating nuisance parameters, estimating covariates, and nonseparable moment conditions. Our main results consider the presence of (low dimensional) covariates for which we propose a locally penalized estimator. We provide pointwise rates of convergence addressing several issues in the theoretical analyses as we strive for making a computationally tractable formulation. We apply our results in the estimation of causal effects with multiple binary treatments and show how our estimators can improve the finite sample performance when compared with non-adaptive estimators that try to estimate all the probabilities directly. We also provide simulations that are consistent with our theoretical findings."}, "https://arxiv.org/abs/2410.15180": {"title": "HACSurv: A Hierarchical Copula-based Approach for Survival Analysis with Dependent Competing Risks", "link": "https://arxiv.org/abs/2410.15180", "description": "arXiv:2410.15180v1 Announce Type: cross \nAbstract: In survival analysis, subjects often face competing risks; for example, individuals with cancer may also suffer from heart disease or other illnesses, which can jointly influence the prognosis of risks and censoring. Traditional survival analysis methods often treat competing risks as independent and fail to accommodate the dependencies between different conditions. In this paper, we introduce HACSurv, a survival analysis method that learns Hierarchical Archimedean Copulas structures and cause-specific survival functions from data with competing risks. HACSurv employs a flexible dependency structure using hierarchical Archimedean copulas to represent the relationships between competing risks and censoring. By capturing the dependencies between risks and censoring, HACSurv achieves better survival predictions and offers insights into risk interactions. Experiments on synthetic datasets demonstrate that our method can accurately identify the complex dependency structure and precisely predict survival distributions, whereas the compared methods exhibit significant deviations between their predictions and the true distributions. Experiments on multiple real-world datasets also demonstrate that our method achieves better survival prediction compared to previous state-of-the-art methods."}, "https://arxiv.org/abs/2410.15244": {"title": "Extensions on low-complexity DCT approximations for larger blocklengths based on minimal angle similarity", "link": "https://arxiv.org/abs/2410.15244", "description": "arXiv:2410.15244v1 Announce Type: cross \nAbstract: The discrete cosine transform (DCT) is a central tool for image and video coding because it can be related to the Karhunen-Lo\\`eve transform (KLT), which is the optimal transform in terms of retained transform coefficients and data decorrelation. In this paper, we introduce 16-, 32-, and 64-point low-complexity DCT approximations by minimizing individually the angle between the rows of the exact DCT matrix and the matrix induced by the approximate transforms. According to some classical figures of merit, the proposed transforms outperformed the approximations for the DCT already known in the literature. Fast algorithms were also developed for the low-complexity transforms, asserting a good balance between the performance and its computational cost. Practical applications in image encoding showed the relevance of the transforms in this context. In fact, the experiments showed that the proposed transforms had better results than the known approximations in the literature for the cases of 16, 32, and 64 blocklength."}, "https://arxiv.org/abs/2410.15491": {"title": "Structural Causality-based Generalizable Concept Discovery Models", "link": "https://arxiv.org/abs/2410.15491", "description": "arXiv:2410.15491v1 Announce Type: cross \nAbstract: The rising need for explainable deep neural network architectures has utilized semantic concepts as explainable units. Several approaches utilizing disentangled representation learning estimate the generative factors and utilize them as concepts for explaining DNNs. However, even though the generative factors for a dataset remain fixed, concepts are not fixed entities and vary based on downstream tasks. In this paper, we propose a disentanglement mechanism utilizing a variational autoencoder (VAE) for learning mutually independent generative factors for a given dataset and subsequently learning task-specific concepts using a structural causal model (SCM). Our method assumes generative factors and concepts to form a bipartite graph, with directed causal edges from generative factors to concepts. Experiments are conducted on datasets with known generative factors: D-sprites and Shapes3D. On specific downstream tasks, our proposed method successfully learns task-specific concepts which are explained well by the causal edges from the generative factors. Lastly, separate from current causal concept discovery methods, our methodology is generalizable to an arbitrary number of concepts and flexible to any downstream tasks."}, "https://arxiv.org/abs/2410.15564": {"title": "Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits", "link": "https://arxiv.org/abs/2410.15564", "description": "arXiv:2410.15564v1 Announce Type: cross \nAbstract: In multi-armed bandits, the tasks of reward maximization and pure exploration are often at odds with each other. The former focuses on exploiting arms with the highest means, while the latter may require constant exploration across all arms. In this work, we focus on good arm identification (GAI), a practical bandit inference objective that aims to label arms with means above a threshold as quickly as possible. We show that GAI can be efficiently solved by combining a reward-maximizing sampling algorithm with a novel nonparametric anytime-valid sequential test for labeling arm means. We first establish that our sequential test maintains error control under highly nonparametric assumptions and asymptotically achieves the minimax optimal e-power, a notion of power for anytime-valid tests. Next, by pairing regret-minimizing sampling schemes with our sequential test, we provide an approach that achieves minimax optimal stopping times for labeling arms with means above a threshold, under an error probability constraint. Our empirical results validate our approach beyond the minimax setting, reducing the expected number of samples for all stopping times by at least 50% across both synthetic and real-world settings."}, "https://arxiv.org/abs/2410.15648": {"title": "Linking Model Intervention to Causal Interpretation in Model Explanation", "link": "https://arxiv.org/abs/2410.15648", "description": "arXiv:2410.15648v1 Announce Type: cross \nAbstract: Intervention intuition is often used in model explanation where the intervention effect of a feature on the outcome is quantified by the difference of a model prediction when the feature value is changed from the current value to the baseline value. Such a model intervention effect of a feature is inherently association. In this paper, we will study the conditions when an intuitive model intervention effect has a causal interpretation, i.e., when it indicates whether a feature is a direct cause of the outcome. This work links the model intervention effect to the causal interpretation of a model. Such an interpretation capability is important since it indicates whether a machine learning model is trustworthy to domain experts. The conditions also reveal the limitations of using a model intervention effect for causal interpretation in an environment with unobserved features. Experiments on semi-synthetic datasets have been conducted to validate theorems and show the potential for using the model intervention effect for model interpretation."}, "https://arxiv.org/abs/2410.15655": {"title": "Accounting for Missing Covariates in Heterogeneous Treatment Estimation", "link": "https://arxiv.org/abs/2410.15655", "description": "arXiv:2410.15655v1 Announce Type: cross \nAbstract: Many applications of causal inference require using treatment effects estimated on a study population to make decisions in a separate target population. We consider the challenging setting where there are covariates that are observed in the target population that were not seen in the original study. Our goal is to estimate the tightest possible bounds on heterogeneous treatment effects conditioned on such newly observed covariates. We introduce a novel partial identification strategy based on ideas from ecological inference; the main idea is that estimates of conditional treatment effects for the full covariate set must marginalize correctly when restricted to only the covariates observed in both populations. Furthermore, we introduce a bias-corrected estimator for these bounds and prove that it enjoys fast convergence rates and statistical guarantees (e.g., asymptotic normality). Experimental results on both real and synthetic data demonstrate that our framework can produce bounds that are much tighter than would otherwise be possible."}, "https://arxiv.org/abs/2410.15711": {"title": "Quantiles and Quantile Regression on Riemannian Manifolds: a measure-transportation-based approach", "link": "https://arxiv.org/abs/2410.15711", "description": "arXiv:2410.15711v1 Announce Type: cross \nAbstract: Increased attention has been given recently to the statistical analysis of variables with values on nonlinear manifolds. A natural but nontrivial problem in that context is the definition of quantile concepts. We are proposing a solution for compact Riemannian manifolds without boundaries; typical examples are polyspheres, hyperspheres, and toro\\\"{\\i}dal manifolds equipped with their Riemannian metrics. Our concept of quantile function comes along with a concept of distribution function and, in the empirical case, ranks and signs. The absence of a canonical ordering is offset by resorting to the data-driven ordering induced by optimal transports. Theoretical properties, such as the uniform convergence of the empirical distribution and conditional (and unconditional) quantile functions and distribution-freeness of ranks and signs, are established. Statistical inference applications, from goodness-of-fit to distribution-free rank-based testing, are without number. Of particular importance is the case of quantile regression with directional or toro\\\"{\\i}dal multiple output, which is given special attention in this paper. Extensive simulations are carried out to illustrate these novel concepts."}, "https://arxiv.org/abs/2410.15931": {"title": "Towards more realistic climate model outputs: A multivariate bias correction based on zero-inflated vine copulas", "link": "https://arxiv.org/abs/2410.15931", "description": "arXiv:2410.15931v1 Announce Type: cross \nAbstract: Climate model large ensembles are an essential research tool for analysing and quantifying natural climate variability and providing robust information for rare extreme events. The models simulated representations of reality are susceptible to bias due to incomplete understanding of physical processes. This paper aims to correct the bias of five climate variables from the CRCM5 Large Ensemble over Central Europe at a 3-hourly temporal resolution. At this high temporal resolution, two variables, precipitation and radiation, exhibit a high share of zero inflation. We propose a novel bias-correction method, VBC (Vine copula bias correction), that models and transfers multivariate dependence structures for zero-inflated margins in the data from its error-prone model domain to a reference domain. VBC estimates the model and reference distribution using vine copulas and corrects the model distribution via (inverse) Rosenblatt transformation. To deal with the variables' zero-inflated nature, we develop a new vine density decomposition that accommodates such variables and employs an adequately randomized version of the Rosenblatt transform. This novel approach allows for more accurate modelling of multivariate zero-inflated climate data. Compared with state-of-the-art correction methods, VBC is generally the best-performing correction and the most accurate method for correcting zero-inflated events."}, "https://arxiv.org/abs/2410.15938": {"title": "Quantifying world geography as seen through the lens of Soviet propaganda", "link": "https://arxiv.org/abs/2410.15938", "description": "arXiv:2410.15938v1 Announce Type: cross \nAbstract: Cultural data typically contains a variety of biases. In particular, geographical locations are unequally portrayed in media, creating a distorted representation of the world. Identifying and measuring such biases is crucial to understand both the data and the socio-cultural processes that have produced them. Here we suggest to measure geographical biases in a large historical news media corpus by studying the representation of cities. Leveraging ideas of quantitative urban science, we develop a mixed quantitative-qualitative procedure, which allows us to get robust quantitative estimates of the biases. These biases can be further qualitatively interpreted resulting in a hermeneutic feedback loop. We apply this procedure to a corpus of the Soviet newsreel series 'Novosti Dnya' (News of the Day) and show that city representation grows super-linearly with city size, and is further biased by city specialization and geographical location. This allows to systematically identify geographical regions which are explicitly or sneakily emphasized by Soviet propaganda and quantify their importance."}, "https://arxiv.org/abs/1903.00037": {"title": "Sequential and Simultaneous Distance-based Dimension Reduction", "link": "https://arxiv.org/abs/1903.00037", "description": "arXiv:1903.00037v3 Announce Type: replace \nAbstract: This paper introduces a method called Sequential and Simultaneous Distance-based Dimension Reduction ($S^2D^2R$) that performs simultaneous dimension reduction for a pair of random vectors based on Distance Covariance (dCov). Compared with Sufficient Dimension Reduction (SDR) and Canonical Correlation Analysis (CCA)-based approaches, $S^2D^2R$ is a model-free approach that does not impose dimensional or distributional restrictions on variables and is more sensitive to nonlinear relationships. Theoretically, we establish a non-asymptotic error bound to guarantee the performance of $S^2D^2R$. Numerically, $S^2D^2R$ performs comparable to or better than other state-of-the-art algorithms and is computationally faster. All codes of our $S^2D^2R$ method can be found on Github, including an R package named S2D2R."}, "https://arxiv.org/abs/2204.08100": {"title": "Multi-Model Subset Selection", "link": "https://arxiv.org/abs/2204.08100", "description": "arXiv:2204.08100v3 Announce Type: replace \nAbstract: The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the L0-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by \"blackbox\" multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding L0-penalized regression models by extending recent developments in L0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages."}, "https://arxiv.org/abs/2206.09819": {"title": "Double soft-thresholded model for multi-group scalar on vector-valued image regression", "link": "https://arxiv.org/abs/2206.09819", "description": "arXiv:2206.09819v4 Announce Type: replace \nAbstract: In this paper, we develop a novel spatial variable selection method for scalar on vector-valued image regression in a multi-group setting. Here, 'vector-valued image' refers to the imaging datasets that contain vector-valued information at each pixel/voxel location, such as in RGB color images, multimodal medical images, DTI imaging, etc. The focus of this work is to identify the spatial locations in the image having an important effect on the scalar outcome measure. Specifically, the overall effect of each voxel is of interest. We thus develop a novel shrinkage prior by soft-thresholding the \\ell_2 norm of a latent multivariate Gaussian process. It will allow us to estimate sparse and piecewise-smooth spatially varying vector-valued regression coefficient functions. For posterior inference, an efficient MCMC algorithm is developed. We establish the posterior contraction rate for parameter estimation and consistency for variable selection of the proposed Bayesian model, assuming that the true regression coefficients are Holder smooth. Finally, we demonstrate the advantages of the proposed method in simulation studies and further illustrate in an ADNI dataset for modeling MMSE scores based on DTI-based vector-valued imaging markers."}, "https://arxiv.org/abs/2208.04669": {"title": "Boosting with copula-based components", "link": "https://arxiv.org/abs/2208.04669", "description": "arXiv:2208.04669v2 Announce Type: replace \nAbstract: The authors propose new additive models for binary outcomes, where the components are copula-based regression models (Noh et al, 2013), and designed such that the model may capture potentially complex interaction effects. The models do not require discretisation of continuous covariates, and are therefore suitable for problems with many such covariates. A fitting algorithm, and efficient procedures for model selection and evaluation of the components are described. Software is provided in the R-package copulaboost. Simulations and illustrations on data sets indicate that the method's predictive performance is either better than or comparable to the other methods."}, "https://arxiv.org/abs/2305.05106": {"title": "Mixed effects models for extreme value index regression", "link": "https://arxiv.org/abs/2305.05106", "description": "arXiv:2305.05106v4 Announce Type: replace \nAbstract: Extreme value theory (EVT) provides an elegant mathematical tool for the statistical analysis of rare events. When data are collected from multiple population subgroups, because some subgroups may have less data available for extreme value analysis, a scientific interest of many researchers would be to improve the estimates obtained directly from each subgroup. To achieve this, we incorporate the mixed effects model (MEM) into the regression technique in EVT. In small area estimation, the MEM has attracted considerable attention as a primary tool for producing reliable estimates for subgroups with small sample sizes, i.e., ``small areas.'' The key idea of MEM is to incorporate information from all subgroups into a single model and to borrow strength from all subgroups to improve estimates for each subgroup. Using this property, in extreme value analysis, the MEM may contribute to reducing the bias and variance of the direct estimates from each subgroup. This prompts us to evaluate the effectiveness of the MEM for EVT through theoretical studies and numerical experiments, including its application to the risk assessment of a number of stocks in the cryptocurrency market."}, "https://arxiv.org/abs/2307.12892": {"title": "A Statistical View of Column Subset Selection", "link": "https://arxiv.org/abs/2307.12892", "description": "arXiv:2307.12892v2 Announce Type: replace \nAbstract: We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework."}, "https://arxiv.org/abs/2309.02631": {"title": "A Bayesian Nonparametric Method to Adjust for Unmeasured Confounding with Negative Controls", "link": "https://arxiv.org/abs/2309.02631", "description": "arXiv:2309.02631v2 Announce Type: replace \nAbstract: Unmeasured confounding bias threatens the validity of observational studies. While sensitivity analyses and study designs have been proposed to address this issue, they often overlook the growing availability of auxiliary data. Using negative controls from these data is a promising new approach to reduce unmeasured confounding bias. In this article, we develop a Bayesian nonparametric method to estimate a causal exposure-response function (CERF) leveraging information from negative controls to adjust for unmeasured confounding. We model the CERF as a mixture of linear models. This strategy captures the potential nonlinear shape of CERFs while maintaining computational efficiency, and it leverages closed-form results that hold under the linear model assumption. We assess the performance of our method through simulation studies. We found that the proposed method can recover the true shape of the CERF in the presence of unmeasured confounding under assumptions. To show the practical utility of our approach, we apply it to adjust for a possible unmeasured confounder when evaluating the relationship between long-term exposure to ambient $PM_{2.5}$ and cardiovascular hospitalization rates among the elderly in the continental US. We implement our estimation procedure in open-source software and have made the code publicly available to ensure reproducibility."}, "https://arxiv.org/abs/2401.00354": {"title": "Maximum Likelihood Estimation under the Emax Model: Existence, Geometry and Efficiency", "link": "https://arxiv.org/abs/2401.00354", "description": "arXiv:2401.00354v2 Announce Type: replace \nAbstract: This study focuses on the estimation of the Emax dose-response model, a widely utilized framework in clinical trials, agriculture, and environmental experiments. Existing challenges in obtaining maximum likelihood estimates (MLE) for model parameters are often ascribed to computational issues but, in reality, stem from the absence of a MLE. Our contribution provides a new understanding and control of all the experimental situations that practitioners might face, guiding them in the estimation process. We derive the exact MLE for a three-point experimental design and we identify the two scenarios where the MLE fails. To address these challenges, we propose utilizing Firth's modified score, providing its analytical expression as a function of the experimental design. Through a simulation study, we demonstrate that, in one of the problematic cases, the Firth modification yields a finite estimate. For the remaining case, we introduce a design-augmentation strategy akin to a hypothesis test."}, "https://arxiv.org/abs/2302.12093": {"title": "Experimenting under Stochastic Congestion", "link": "https://arxiv.org/abs/2302.12093", "description": "arXiv:2302.12093v4 Announce Type: replace-cross \nAbstract: We study randomized experiments in a service system when stochastic congestion can arise from temporarily limited supply or excess demand. Such congestion gives rise to cross-unit interference between the waiting customers, and analytic strategies that do not account for this interference may be biased. In current practice, one of the most widely used ways to address stochastic congestion is to use switchback experiments that alternatively turn a target intervention on and off for the whole system. We find, however, that under a queueing model for stochastic congestion, the standard way of analyzing switchbacks is inefficient, and that estimators that leverage the queueing model can be materially more accurate. Additionally, we show how the queueing model enables estimation of total policy gradients from unit-level randomized experiments, thus giving practitioners an alternative experimental approach they can use without needing to pre-commit to a fixed switchback length before data collection."}, "https://arxiv.org/abs/2306.15709": {"title": "Privacy-Preserving Community Detection for Locally Distributed Multiple Networks", "link": "https://arxiv.org/abs/2306.15709", "description": "arXiv:2306.15709v2 Announce Type: replace-cross \nAbstract: Modern multi-layer networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. The literature on the model-based statistical methods for community detection based on these data is still limited. This paper proposes a new method for consensus community detection and estimation in a multi-layer stochastic block model using locally stored and computed network data with privacy protection. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors using an orthogonal Procrustes transformation, and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds."}, "https://arxiv.org/abs/2407.01032": {"title": "Overcoming Common Flaws in the Evaluation of Selective Classification Systems", "link": "https://arxiv.org/abs/2407.01032", "description": "arXiv:2407.01032v2 Announce Type: replace-cross \nAbstract: Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the $\\mathrm{AUROC}$ in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve ($\\mathrm{AUGRC}$), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of $\\mathrm{AUGRC}$ on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets."}, "https://arxiv.org/abs/2410.16391": {"title": "Causal Data Fusion for Panel Data without Pre-Intervention Period", "link": "https://arxiv.org/abs/2410.16391", "description": "arXiv:2410.16391v1 Announce Type: new \nAbstract: Traditional panel data causal inference frameworks, such as difference-in-differences and synthetic control methods, rely on pre-intervention data to estimate counterfactuals, which may not be available in real-world settings when interventions are implemented in response to sudden events. In this paper, we introduce two data fusion methods for causal inference from panel data in scenarios where pre-intervention data is unavailable. These methods leverage auxiliary reference domains with related panel data to estimate causal effects in the target domain, overcoming the limitations imposed by the absence of pre-intervention data. We show the efficacy of these methods by obtaining converging bounds on the absolute bias as well as through simulations, showing their robustness in a variety of panel data settings. Our findings provide a framework for applying causal inference in urgent and data-constrained environments, such as public health crises or epidemiological shocks."}, "https://arxiv.org/abs/2410.16477": {"title": "Finite-Sample and Distribution-Free Fair Classification: Optimal Trade-off Between Excess Risk and Fairness, and the Cost of Group-Blindness", "link": "https://arxiv.org/abs/2410.16477", "description": "arXiv:2410.16477v1 Announce Type: new \nAbstract: Algorithmic fairness in machine learning has recently garnered significant attention. However, two pressing challenges remain: (1) The fairness guarantees of existing fair classification methods often rely on specific data distribution assumptions and large sample sizes, which can lead to fairness violations when the sample size is moderate-a common situation in practice. (2) Due to legal and societal considerations, using sensitive group attributes during decision-making (referred to as the group-blind setting) may not always be feasible.\n In this work, we quantify the impact of enforcing algorithmic fairness and group-blindness in binary classification under group fairness constraints. Specifically, we propose a unified framework for fair classification that provides distribution-free and finite-sample fairness guarantees with controlled excess risk. This framework is applicable to various group fairness notions in both group-aware and group-blind scenarios. Furthermore, we establish a minimax lower bound on the excess risk, showing the minimax optimality of our proposed algorithm up to logarithmic factors. Through extensive simulation studies and real data analysis, we further demonstrate the superior performance of our algorithm compared to existing methods, and provide empirical support for our theoretical findings."}, "https://arxiv.org/abs/2410.16526": {"title": "A Dynamic Spatiotemporal and Network ARCH Model with Common Factors", "link": "https://arxiv.org/abs/2410.16526", "description": "arXiv:2410.16526v1 Announce Type: new \nAbstract: We introduce a dynamic spatiotemporal volatility model that extends traditional approaches by incorporating spatial, temporal, and spatiotemporal spillover effects, along with volatility-specific observed and latent factors. The model offers a more general network interpretation, making it applicable for studying various types of network spillovers. The primary innovation lies in incorporating volatility-specific latent factors into the dynamic spatiotemporal volatility model. Using Bayesian estimation via the Markov Chain Monte Carlo (MCMC) method, the model offers a robust framework for analyzing the spatial, temporal, and spatiotemporal effects of a log-squared outcome variable on its volatility. We recommend using the deviance information criterion (DIC) and a regularized Bayesian MCMC method to select the number of relevant factors in the model. The model's flexibility is demonstrated through two applications: a spatiotemporal model applied to the U.S. housing market and another applied to financial stock market networks, both highlighting the model's ability to capture varying degrees of interconnectedness. In both applications, we find strong spatial/network interactions with relatively stronger spillover effects in the stock market."}, "https://arxiv.org/abs/2410.16577": {"title": "High-dimensional Grouped-regression using Bayesian Sparse Projection-posterior", "link": "https://arxiv.org/abs/2410.16577", "description": "arXiv:2410.16577v1 Announce Type: new \nAbstract: We consider a novel Bayesian approach to estimation, uncertainty quantification, and variable selection for a high-dimensional linear regression model under sparsity. The number of predictors can be nearly exponentially large relative to the sample size. We put a conjugate normal prior initially disregarding sparsity, but for making an inference, instead of the original multivariate normal posterior, we use the posterior distribution induced by a map transforming the vector of regression coefficients to a sparse vector obtained by minimizing the sum of squares of deviations plus a suitably scaled $\\ell_1$-penalty on the vector. We show that the resulting sparse projection-posterior distribution contracts around the true value of the parameter at the optimal rate adapted to the sparsity of the vector. We show that the true sparsity structure gets a large sparse projection-posterior probability. We further show that an appropriately recentred credible ball has the correct asymptotic frequentist coverage. Finally, we describe how the computational burden can be distributed to many machines, each dealing with only a small fraction of the whole dataset. We conduct a comprehensive simulation study under a variety of settings and found that the proposed method performs well for finite sample sizes. We also apply the method to several real datasets, including the ADNI data, and compare its performance with the state-of-the-art methods. We implemented the method in the \\texttt{R} package called \\texttt{sparseProj}, and all computations have been carried out using this package."}, "https://arxiv.org/abs/2410.16608": {"title": "Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective", "link": "https://arxiv.org/abs/2410.16608", "description": "arXiv:2410.16608v1 Announce Type: new \nAbstract: Visualizing high-dimensional data is an important routine for understanding biomedical data and interpreting deep learning models. Neighbor embedding methods, such as t-SNE, UMAP, and LargeVis, among others, are a family of popular visualization methods which reduce high-dimensional data to two dimensions. However, recent studies suggest that these methods often produce visual artifacts, potentially leading to incorrect scientific conclusions. Recognizing that the current limitation stems from a lack of data-independent notions of embedding maps, we introduce a novel conceptual and computational framework, LOO-map, that learns the embedding maps based on a classical statistical idea known as the leave-one-out. LOO-map extends the embedding over a discrete set of input points to the entire input space, enabling a systematic assessment of map continuity, and thus the reliability of the visualizations. We find for many neighbor embedding methods, their embedding maps can be intrinsically discontinuous. The discontinuity induces two types of observed map distortion: ``overconfidence-inducing discontinuity,\" which exaggerates cluster separation, and ``fracture-inducing discontinuity,\" which creates spurious local structures. Building upon LOO-map, we propose two diagnostic point-wise scores -- perturbation score and singularity score -- to address these limitations. These scores can help identify unreliable embedding points, detect out-of-distribution data, and guide hyperparameter selection. Our approach is flexible and works as a wrapper around many neighbor embedding algorithms. We test our methods across multiple real-world datasets from computer vision and single-cell omics to demonstrate their effectiveness in enhancing the interpretability and accuracy of visualizations."}, "https://arxiv.org/abs/2410.16627": {"title": "Enhancing Computational Efficiency in High-Dimensional Bayesian Analysis: Applications to Cancer Genomics", "link": "https://arxiv.org/abs/2410.16627", "description": "arXiv:2410.16627v1 Announce Type: new \nAbstract: In this study, we present a comprehensive evaluation of the Two-Block Gibbs (2BG) sampler as a robust alternative to the traditional Three-Block Gibbs (3BG) sampler in Bayesian shrinkage models. Through extensive simulation studies, we demonstrate that the 2BG sampler exhibits superior computational efficiency and faster convergence rates, particularly in high-dimensional settings where the ratio of predictors to samples is large. We apply these findings to real-world data from the NCI-60 cancer cell panel, leveraging gene expression data to predict protein expression levels. Our analysis incorporates feature selection, identifying key genes that influence protein expression while shedding light on the underlying genetic mechanisms in cancer cells. The results indicate that the 2BG sampler not only produces more effective samples than the 3BG counterpart but also significantly reduces computational costs, thereby enhancing the applicability of Bayesian methods in high-dimensional data analysis. This contribution extends the understanding of shrinkage techniques in statistical modeling and offers valuable insights for cancer genomics research."}, "https://arxiv.org/abs/2410.16656": {"title": "Parsimonious Dynamic Mode Decomposition: A Robust and Automated Approach for Optimally Sparse Mode Selection in Complex Systems", "link": "https://arxiv.org/abs/2410.16656", "description": "arXiv:2410.16656v1 Announce Type: new \nAbstract: This paper introduces the Parsimonious Dynamic Mode Decomposition (parsDMD), a novel algorithm designed to automatically select an optimally sparse subset of dynamic modes for both spatiotemporal and purely temporal data. By incorporating time-delay embedding and leveraging Orthogonal Matching Pursuit (OMP), parsDMD ensures robustness against noise and effectively handles complex, nonlinear dynamics. The algorithm is validated on a diverse range of datasets, including standing wave signals, identifying hidden dynamics, fluid dynamics simulations (flow past a cylinder and transonic buffet), and atmospheric sea-surface temperature (SST) data. ParsDMD addresses a significant limitation of the traditional sparsity-promoting DMD (spDMD), which requires manual tuning of sparsity parameters through a rigorous trial-and-error process to balance between single-mode and all-mode solutions. In contrast, parsDMD autonomously determines the optimally sparse subset of modes without user intervention, while maintaining minimal computational complexity. Comparative analyses demonstrate that parsDMD consistently outperforms spDMD by providing more accurate mode identification and effective reconstruction in noisy environments. These advantages render parsDMD an effective tool for real-time diagnostics, forecasting, and reduced-order model construction across various disciplines."}, "https://arxiv.org/abs/2410.16716": {"title": "A class of modular and flexible covariate-based covariance functions for nonstationary spatial modeling", "link": "https://arxiv.org/abs/2410.16716", "description": "arXiv:2410.16716v1 Announce Type: new \nAbstract: The assumptions of stationarity and isotropy often stated over spatial processes have not aged well during the last two decades, partly explained by the combination of computational developments and the increasing availability of high-resolution spatial data. While a plethora of approaches have been developed to relax these assumptions, it is often a costly tradeoff between flexibility and a diversity of computational challenges. In this paper, we present a class of covariance functions that relies on fixed, observable spatial information that provides a convenient tradeoff while offering an extra layer of numerical and visual representation of the flexible spatial dependencies. This model allows for separate parametric structures for different sources of nonstationarity, such as marginal standard deviation, geometric anisotropy, and smoothness. It simplifies to a Mat\\'ern covariance function in its basic form and is adaptable for large datasets, enhancing flexibility and computational efficiency. We analyze the capabilities of the presented model through simulation studies and an application to Swiss precipitation data."}, "https://arxiv.org/abs/2410.16722": {"title": "Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors", "link": "https://arxiv.org/abs/2410.16722", "description": "arXiv:2410.16722v1 Announce Type: new \nAbstract: In our paper, we focus on robust variable selection for missing data and measurement error.Missing data and measurement errors can lead to confusing data distribution.We propose an exponential loss function with tuning parameter to apply to Missing and measurement errors data.By adjusting the parameter,the loss function can be better and more robust under various different data distributions.We use inverse probability weighting and additivity error models to address missing data and measurement errors. Also, we find that the Atan punishment method works better.We used Monte Carlo simulations to assess the validity of robust variable selection and validated our findings with the breast cancer dataset"}, "https://arxiv.org/abs/2410.16806": {"title": "Simplified vine copula models: state of science and affairs", "link": "https://arxiv.org/abs/2410.16806", "description": "arXiv:2410.16806v1 Announce Type: new \nAbstract: Vine copula models have become highly popular practical tools for modeling multivariate dependencies. To maintain traceability, a commonly employed simplifying assumption is that conditional copulas remain unchanged by the conditioning variables. This assumption has sparked a somewhat polarizing debate within our community. The fact that much of this dispute occurs outside the public record has placed the field in an unfortunate position, impeding scientific progress. In this article, I will review what we know about the flexibility and limitations of simplified vine copula models, explore the broader implications, and offer my own, hopefully reconciling, perspective on the issue."}, "https://arxiv.org/abs/2410.16903": {"title": "Second-order characteristics for spatial point processes with graph-valued marks", "link": "https://arxiv.org/abs/2410.16903", "description": "arXiv:2410.16903v1 Announce Type: new \nAbstract: The immense progress in data collection and storage capacities have yielded rather complex, challenging spatial event-type data, where each event location is augmented by a non-simple mark. Despite the growing interest in analysing such complex event patterns, the methodology for such analysis is not embedded well in the literature. In particular, the literature lacks statistical methods to analyse marks which are characterised by an inherent relational structure, i.e.\\ where the mark is graph-valued. Motivated by epidermal nerve fibre data, we introduce different mark summary characteristics, which investigate the average variation or association between pairs of graph-valued marks, and apply some of the methods to the nerve data."}, "https://arxiv.org/abs/2410.16998": {"title": "Identifying Conduct Parameters with Separable Demand: A Counterexample to Lau (1982)", "link": "https://arxiv.org/abs/2410.16998", "description": "arXiv:2410.16998v1 Announce Type: new \nAbstract: We provide a counterexample to the conduct parameter identification result established in the foundational work of Lau (1982), which generalizes the identification theorem of Bresnahan (1982) by relaxing the linearity assumptions. We identify a separable demand function that still permits identification and validate this case both theoretically and through numerical simulations."}, "https://arxiv.org/abs/2410.17046": {"title": "Mesoscale two-sample testing for network data", "link": "https://arxiv.org/abs/2410.17046", "description": "arXiv:2410.17046v1 Announce Type: new \nAbstract: Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications, for example, neuroimaging, to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classical statistical question of two-sample comparison arises. In this work, we address the problem of testing for statistically significant differences in a given arbitrary subset of connections. This general framework allows an analyst to focus on a single node, a specific region of interest, or compare whole networks. Our ability to conduct \"mesoscale\" testing on a meaningful group of edges is particularly relevant for applications such as neuroimaging and distinguishes our approach from prior work, which tends to focus either on a single node or the whole network. In this mesoscale setting, we develop statistically sound projection-based tests for two-sample comparison in both weighted and binary edge networks. Our approach can leverage all available network information, and learn informative projections which improve testing power when low-dimensional latent network structure is present."}, "https://arxiv.org/abs/2410.17105": {"title": "General Seemingly Unrelated Local Projections", "link": "https://arxiv.org/abs/2410.17105", "description": "arXiv:2410.17105v1 Announce Type: new \nAbstract: We provide a framework for efficiently estimating impulse response functions with Local Projections (LPs). Our approach offers a Bayesian treatment for LPs with Instrumental Variables, accommodating multiple shocks and instruments per shock, accounts for autocorrelation in multi-step forecasts by jointly modeling all LPs as a seemingly unrelated system of equations, defines a flexible yet parsimonious joint prior for impulse responses based on a Gaussian Process, allows for joint inference about the entire vector of impulse responses, and uses all available data across horizons by imputing missing values."}, "https://arxiv.org/abs/2410.17108": {"title": "A general framework for probabilistic model uncertainty", "link": "https://arxiv.org/abs/2410.17108", "description": "arXiv:2410.17108v1 Announce Type: new \nAbstract: Existing approaches to model uncertainty typically either compare models using a quantitative model selection criterion or evaluate posterior model probabilities having set a prior. In this paper, we propose an alternative strategy which views missing observations as the source of model uncertainty, where the true model would be identified with the complete data. To quantify model uncertainty, it is then necessary to provide a probability distribution for the missing observations conditional on what has been observed. This can be set sequentially using one-step-ahead predictive densities, which recursively sample from the best model according to some consistent model selection criterion. Repeated predictive sampling of the missing data, to give a complete dataset and hence a best model each time, provides our measure of model uncertainty. This approach bypasses the need for subjective prior specification or integration over parameter spaces, addressing issues with standard methods such as the Bayes factor. Predictive resampling also suggests an alternative view of hypothesis testing as a decision problem based on a population statistic, where we directly index the probabilities of competing models. In addition to hypothesis testing, we provide illustrations from density estimation and variable selection, demonstrating our approach on a range of standard problems."}, "https://arxiv.org/abs/2410.17153": {"title": "A Bayesian Perspective on the Maximum Score Problem", "link": "https://arxiv.org/abs/2410.17153", "description": "arXiv:2410.17153v1 Announce Type: new \nAbstract: This paper presents a Bayesian inference framework for a linear index threshold-crossing binary choice model that satisfies a median independence restriction. The key idea is that the model is observationally equivalent to a probit model with nonparametric heteroskedasticity. Consequently, Gibbs sampling techniques from Albert and Chib (1993) and Chib and Greenberg (2013) lead to a computationally attractive Bayesian inference procedure in which a Gaussian process forms a conditionally conjugate prior for the natural logarithm of the skedastic function."}, "https://arxiv.org/abs/2410.16307": {"title": "Functional Clustering of Discount Functions for Behavioral Investor Profiling", "link": "https://arxiv.org/abs/2410.16307", "description": "arXiv:2410.16307v1 Announce Type: cross \nAbstract: Classical finance models are based on the premise that investors act rationally and utilize all available information when making portfolio decisions. However, these models often fail to capture the anomalies observed in intertemporal choices and decision-making under uncertainty, particularly when accounting for individual differences in preferences and consumption patterns. Such limitations hinder traditional finance theory's ability to address key questions like: How do personal preferences shape investment choices? What drives investor behaviour? And how do individuals select their portfolios? One prominent contribution is Pompian's model of four Behavioral Investor Types (BITs), which links behavioural finance studies with Keirsey's temperament theory, highlighting the role of personality in financial decision-making. Yet, traditional parametric models struggle to capture how these distinct temperaments influence intertemporal decisions, such as how individuals evaluate trade-offs between present and future outcomes. To address this gap, the present study employs Functional Data Analysis (FDA) to specifically investigate temporal discounting behaviours revealing nuanced patterns in how different temperaments perceive and manage uncertainty over time. Our findings show heterogeneity within each temperament, suggesting that investor profiles are far more diverse than previously thought. This refined classification provides deeper insights into the role of temperament in shaping intertemporal financial decisions, offering practical implications for financial advisors to better tailor strategies to individual risk preferences and decision-making styles."}, "https://arxiv.org/abs/2410.16333": {"title": "Conformal Predictive Portfolio Selection", "link": "https://arxiv.org/abs/2410.16333", "description": "arXiv:2410.16333v1 Announce Type: cross \nAbstract: This study explores portfolio selection using predictive models for portfolio returns. Portfolio selection is a fundamental task in finance, and various methods have been developed to achieve this goal. For example, the mean-variance approach constructs portfolios by balancing the trade-off between the mean and variance of asset returns, while the quantile-based approach optimizes portfolios by accounting for tail risk. These traditional methods often rely on distributional information estimated from historical data. However, a key concern is the uncertainty of future portfolio returns, which may not be fully captured by simple reliance on historical data, such as using the sample average. To address this, we propose a framework for predictive portfolio selection using conformal inference, called Conformal Predictive Portfolio Selection (CPPS). Our approach predicts future portfolio returns, computes corresponding prediction intervals, and selects the desirable portfolio based on these intervals. The framework is flexible and can accommodate a variety of predictive models, including autoregressive (AR) models, random forests, and neural networks. We demonstrate the effectiveness of our CPPS framework using an AR model and validate its performance through empirical studies, showing that it provides superior returns compared to simpler strategies."}, "https://arxiv.org/abs/2410.16419": {"title": "Data Augmentation of Multivariate Sensor Time Series using Autoregressive Models and Application to Failure Prognostics", "link": "https://arxiv.org/abs/2410.16419", "description": "arXiv:2410.16419v1 Announce Type: cross \nAbstract: This work presents a novel data augmentation solution for non-stationary multivariate time series and its application to failure prognostics. The method extends previous work from the authors which is based on time-varying autoregressive processes. It can be employed to extract key information from a limited number of samples and generate new synthetic samples in a way that potentially improves the performance of PHM solutions. This is especially valuable in situations of data scarcity which are very usual in PHM, especially for failure prognostics. The proposed approach is tested based on the CMAPSS dataset, commonly employed for prognostics experiments and benchmarks. An AutoML approach from PHM literature is employed for automating the design of the prognostics solution. The empirical evaluation provides evidence that the proposed method can substantially improve the performance of PHM solutions."}, "https://arxiv.org/abs/2410.16523": {"title": "Efficient Neural Network Training via Subset Pretraining", "link": "https://arxiv.org/abs/2410.16523", "description": "arXiv:2410.16523v1 Announce Type: cross \nAbstract: In training neural networks, it is common practice to use partial gradients computed over batches, mostly very small subsets of the training set. This approach is motivated by the argument that such a partial gradient is close to the true one, with precision growing only with the square root of the batch size. A theoretical justification is with the help of stochastic approximation theory. However, the conditions for the validity of this theory are not satisfied in the usual learning rate schedules. Batch processing is also difficult to combine with efficient second-order optimization methods. This proposal is based on another hypothesis: the loss minimum of the training set can be expected to be well-approximated by the minima of its subsets. Such subset minima can be computed in a fraction of the time necessary for optimizing over the whole training set. This hypothesis has been tested with the help of the MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks, optionally extended by training data augmentation. The experiments have confirmed that results equivalent to conventional training can be reached. In summary, even small subsets are representative if the overdetermination ratio for the given model parameter set sufficiently exceeds unity. The computing expense can be reduced to a tenth or less."}, "https://arxiv.org/abs/2410.16554": {"title": "On the breakdown point of transport-based quantiles", "link": "https://arxiv.org/abs/2410.16554", "description": "arXiv:2410.16554v1 Announce Type: cross \nAbstract: Recent work has used optimal transport ideas to generalize the notion of (center-outward) quantiles to dimension $d\\geq 2$. We study the robustness properties of these transport-based quantiles by deriving their breakdown point, roughly, the smallest amount of contamination required to make these quantiles take arbitrarily aberrant values. We prove that the transport median defined in Chernozhukov et al.~(2017) and Hallin et al.~(2021) has breakdown point of $1/2$. Moreover, a point in the transport depth contour of order $\\tau\\in [0,1/2]$ has breakdown point of $\\tau$. This shows that the multivariate transport depth shares the same breakdown properties as its univariate counterpart. Our proof relies on a general argument connecting the breakdown point of transport maps evaluated at a point to the Tukey depth of that point in the reference measure."}, "https://arxiv.org/abs/2410.16702": {"title": "HDNRA: An R package for HDLSS location testing with normal-reference approaches", "link": "https://arxiv.org/abs/2410.16702", "description": "arXiv:2410.16702v1 Announce Type: cross \nAbstract: The challenge of location testing for high-dimensional data in statistical inference is notable. Existing literature suggests various methods, many of which impose strong regularity conditions on underlying covariance matrices to ensure asymptotic normal distribution of test statistics, leading to difficulties in size control. To address this, a recent set of tests employing the normal-reference approach has been proposed. Moreover, the availability of tests for high-dimensional location testing in R packages implemented in C++ is limited. This paper introduces the latest methods utilizing normal-reference approaches to test the equality of mean vectors in high-dimensional samples with potentially different covariance matrices. We present an R package named HDNRA to illustrate the implementation of these tests, extending beyond the two-sample problem to encompass general linear hypothesis testing (GLHT). The package offers easy and user-friendly access to these tests, with its core implemented in C++ using Rcpp, OpenMP and RcppArmadillo for efficient execution. Theoretical properties of these normal-reference tests are revisited, and examples based on real datasets using different tests are provided."}, "https://arxiv.org/abs/2106.13856": {"title": "Nonparametric inference on counterfactuals in first-price auctions", "link": "https://arxiv.org/abs/2106.13856", "description": "arXiv:2106.13856v3 Announce Type: replace \nAbstract: In a classical model of the first-price sealed-bid auction with independent private values, we develop nonparametric estimators for several policy-relevant targets, such as the bidder's surplus and auctioneer's revenue under counterfactual reserve prices. Motivated by the linearity of these targets in the quantile function of bidders' values, we propose an estimator of the latter and derive its Bahadur-Kiefer expansion. This makes it possible to construct uniform confidence bands and test complex hypotheses about the auction design. Using the data on U.S. Forest Service timber auctions, we test whether setting zero reserve prices in these auctions was revenue maximizing."}, "https://arxiv.org/abs/2305.00694": {"title": "Scaling of Piecewise Deterministic Monte Carlo for Anisotropic Targets", "link": "https://arxiv.org/abs/2305.00694", "description": "arXiv:2305.00694v2 Announce Type: replace \nAbstract: Piecewise deterministic Markov processes (PDMPs) are a type of continuous-time Markov process that combine deterministic flows with jumps. Recently, PDMPs have garnered attention within the Monte Carlo community as a potential alternative to traditional Markov chain Monte Carlo (MCMC) methods. The Zig-Zag sampler and the Bouncy Particle Sampler are commonly used examples of the PDMP methodology which have also yielded impressive theoretical properties, but little is known about their robustness to extreme dependence or anisotropy of the target density. It turns out that PDMPs may suffer from poor mixing due to anisotropy and this paper investigates this effect in detail in the stylised but important Gaussian case. To this end, we employ a multi-scale analysis framework in this paper. Our results show that when the Gaussian target distribution has two scales, of order $1$ and $\\epsilon$, the computational cost of the Bouncy Particle Sampler is of order $\\epsilon^{-1}$, and the computational cost of the Zig-Zag sampler is $\\epsilon^{-2}$. In comparison, the cost of the traditional MCMC methods such as RWM is of order $\\epsilon^{-2}$, at least when the dimensionality of the small component is more than $1$. Therefore, there is a robustness advantage to using PDMPs in this context."}, "https://arxiv.org/abs/2309.07107": {"title": "Reducing Symbiosis Bias Through Better A/B Tests of Recommendation Algorithms", "link": "https://arxiv.org/abs/2309.07107", "description": "arXiv:2309.07107v3 Announce Type: replace \nAbstract: It is increasingly common in digital environments to use A/B tests to compare the performance of recommendation algorithms. However, such experiments often violate the stable unit treatment value assumption (SUTVA), particularly SUTVA's \"no hidden treatments\" assumption, due to the shared data between algorithms being compared. This results in a novel form of bias, which we term \"symbiosis bias,\" where the performance of each algorithm is influenced by the training data generated by its competitor. In this paper, we investigate three experimental designs--cluster-randomized, data-diverted, and user-corpus co-diverted experiments--aimed at mitigating symbiosis bias. We present a theoretical model of symbiosis bias and simulate the impact of each design in dynamic recommendation environments. Our results show that while each design reduces symbiosis bias to some extent, they also introduce new challenges, such as reduced training data in data-diverted experiments. We further validate the existence of symbiosis bias using data from a large-scale A/B test conducted on a global recommender system, demonstrating that symbiosis bias affects treatment effect estimates in the field. Our findings provide actionable insights for researchers and practitioners seeking to design experiments that accurately capture algorithmic performance without bias in treatment effect estimates introduced by shared data."}, "https://arxiv.org/abs/2410.17399": {"title": "An Anatomy of Event Studies: Hypothetical Experiments, Exact Decomposition, and Weighting Diagnostics", "link": "https://arxiv.org/abs/2410.17399", "description": "arXiv:2410.17399v1 Announce Type: new \nAbstract: In recent decades, event studies have emerged as a central methodology in health and social research for evaluating the causal effects of staggered interventions. In this paper, we analyze event studies from the perspective of experimental design and focus on the use of information across units and time periods in the construction of effect estimators. As a particular case of this approach, we offer a novel decomposition of the classical dynamic two-way fixed effects (TWFE) regression estimator for event studies. Our decomposition is expressed in closed form and reveals in finite samples the hypothetical experiment that TWFE regression adjustments approximate. This decomposition offers insights into how standard regression estimators borrow information across different units and times, clarifying and supplementing the notion of forbidden comparison noted in the literature. We propose a robust weighting approach for estimation in event studies, which allows investigators to progressively build larger valid weighted contrasts by leveraging increasingly stronger assumptions on the treatment assignment and potential outcomes mechanisms. This weighting approach also allows for the generalization of treatment effect estimates to a target population. We provide diagnostics and visualization tools and illustrate these methods in a case study of the impact of divorce reforms on female suicide."}, "https://arxiv.org/abs/2410.17601": {"title": "Flexible Approach for Statistical Disclosure Control in Geospatial Data", "link": "https://arxiv.org/abs/2410.17601", "description": "arXiv:2410.17601v1 Announce Type: new \nAbstract: We develop a flexible approach by combining the Quadtree-based method with suppression to maximize the utility of the grid data and simultaneously to reduce the risk of disclosing private information from individual units. To protect data confidentiality, we produce a high resolution grid from geo-reference data with a minimum size of 1 km nested in grids with increasingly larger resolution on the basis of statistical disclosure control methods (i.e threshold and concentration rule). While our implementation overcomes certain weaknesses of Quadtree-based method by accounting for irregularly distributed and relatively isolated marginal units, it also allows creating joint aggregation of several variables. The method is illustrated by relying on synthetic data of the Danish agricultural census 2020 for a set of key agricultural indicators, such as the number of agricultural holdings, the utilized agricultural area and the number of organic farms. We demonstrate the need to assess the reliability of indicators when using a sub-sample of synthetic data followed by an example that presents the same approach for generating a ratio (i.e., the share of organic farming). The methodology is provided as the open-source \\textit{R}-package \\textit{MRG} that is adaptable to use with other geo-referenced survey data underlying confidentiality or other privacy restrictions."}, "https://arxiv.org/abs/2410.17604": {"title": "Ranking of Multi-Response Experiment Treatments", "link": "https://arxiv.org/abs/2410.17604", "description": "arXiv:2410.17604v1 Announce Type: new \nAbstract: We present a probabilistic ranking model to identify the optimal treatment in multiple-response experiments. In contemporary practice, treatments are applied over individuals with the goal of achieving multiple ideal properties on them simultaneously. However, often there are competing properties, and the optimality of one cannot be achieved without compromising the optimality of another. Typically, we still want to know which treatment is the overall best. In our framework, we first formulate overall optimality in terms of treatment ranks. Then we infer the latent ranking that allow us to report treatments from optimal to least optimal, provided ideal desirable properties. We demonstrate through simulations and real data analysis how we can achieve reliability of inferred ranks in practice. We adopt a Bayesian approach and derive an associated Markov Chain Monte Carlo algorithm to fit our model to data. Finally, we discuss the prospects of adoption of our method as a standard tool for experiment evaluation in trials-based research."}, "https://arxiv.org/abs/2410.17680": {"title": "Unraveling Residualization: enhancing its application and exposing its relationship with the FWL theorem", "link": "https://arxiv.org/abs/2410.17680", "description": "arXiv:2410.17680v1 Announce Type: new \nAbstract: The residualization procedure has been applied in many different fields to estimate models with multicollinearity. However, there exists a lack of understanding of this methodology and some authors discourage its use. This paper aims to contribute to a better understanding of the residualization procedure to promote an adequate application and interpretation of it among statistics and data sciences. We highlight its interesting potential application, not only to mitigate multicollinearity but also when the study is oriented to the analysis of the isolated effect of independent variables. The relation between the residualization methodology and the Frisch-Waugh-Lovell (FWL) theorem is also analyzed, concluding that, although both provide the same estimations, the interpretation of the estimated coefficients is different. These different interpretations justify the application of the residualization methodology regardless of the FWL theorem. A real data example is presented for a better illustration of the contribution of this paper."}, "https://arxiv.org/abs/2410.17864": {"title": "Longitudinal Causal Inference with Selective Eligibility", "link": "https://arxiv.org/abs/2410.17864", "description": "arXiv:2410.17864v1 Announce Type: new \nAbstract: Dropout often threatens the validity of causal inference in longitudinal studies. While existing studies have focused on the problem of missing outcomes caused by treatment, we study an important but overlooked source of dropout, selective eligibility. For example, patients may become ineligible for subsequent treatments due to severe side effects or complete recovery. Selective eligibility differs from the problem of ``truncation by death'' because dropout occurs after observing the outcome but before receiving the subsequent treatment. This difference makes the standard approach to dropout inapplicable. We propose a general methodological framework for longitudinal causal inference with selective eligibility. By focusing on subgroups of units who would become eligible for treatment given a specific treatment history, we define the time-specific eligible treatment effect (ETE) and expected number of outcome events (EOE) under a treatment sequence of interest. Assuming a generalized version of sequential ignorability, we derive two nonparametric identification formulae, each leveraging different parts of the observed data distribution. We then derive the efficient influence function of each causal estimand, yielding the corresponding doubly robust estimator. Finally, we apply the proposed methodology to an impact evaluation of a pre-trial risk assessment instrument in the criminal justice system, in which selective eligibility arises due to recidivism."}, "https://arxiv.org/abs/2410.18021": {"title": "Deep Nonparametric Inference for Conditional Hazard Function", "link": "https://arxiv.org/abs/2410.18021", "description": "arXiv:2410.18021v1 Announce Type: new \nAbstract: We propose a novel deep learning approach to nonparametric statistical inference for the conditional hazard function of survival time with right-censored data. We use a deep neural network (DNN) to approximate the logarithm of a conditional hazard function given covariates and obtain a DNN likelihood-based estimator of the conditional hazard function. Such an estimation approach renders model flexibility and hence relaxes structural and functional assumptions on conditional hazard or survival functions. We establish the nonasymptotic error bound and functional asymptotic normality of the proposed estimator. Subsequently, we develop new one-sample tests for goodness-of-fit evaluation and two-sample tests for treatment comparison. Both simulation studies and real application analysis show superior performances of the proposed estimators and tests in comparison with existing methods."}, "https://arxiv.org/abs/2410.17398": {"title": "Sacred and Profane: from the Involutive Theory of MCMC to Helpful Hamiltonian Hacks", "link": "https://arxiv.org/abs/2410.17398", "description": "arXiv:2410.17398v1 Announce Type: cross \nAbstract: In the first edition of this Handbook, two remarkable chapters consider seemingly distinct yet deeply connected subjects ..."}, "https://arxiv.org/abs/2410.17692": {"title": "Asymptotics for parametric martingale posteriors", "link": "https://arxiv.org/abs/2410.17692", "description": "arXiv:2410.17692v1 Announce Type: cross \nAbstract: The martingale posterior framework is a generalization of Bayesian inference where one elicits a sequence of one-step ahead predictive densities instead of the likelihood and prior. Posterior sampling then involves the imputation of unseen observables, and can then be carried out in an expedient and parallelizable manner using predictive resampling without requiring Markov chain Monte Carlo. Recent work has investigated the use of plug-in parametric predictive densities, combined with stochastic gradient descent, to specify a parametric martingale posterior. This paper investigates the asymptotic properties of this class of parametric martingale posteriors. In particular, two central limit theorems based on martingale limit theory are introduced and applied. The first is a predictive central limit theorem, which enables a significant acceleration of the predictive resampling scheme through a hybrid sampling algorithm based on a normal approximation. The second is a Bernstein-von Mises result, which is novel for martingale posteriors, and provides methodological guidance on attaining desirable frequentist properties. We demonstrate the utility of the theoretical results in simulations and a real data example."}, "https://arxiv.org/abs/2209.04977": {"title": "Semi-supervised Triply Robust Inductive Transfer Learning", "link": "https://arxiv.org/abs/2209.04977", "description": "arXiv:2209.04977v2 Announce Type: replace \nAbstract: In this work, we propose a Semi-supervised Triply Robust Inductive transFer LEarning (STRIFLE) approach, which integrates heterogeneous data from a label-rich source population and a label-scarce target population and utilizes a large amount of unlabeled data simultaneously to improve the learning accuracy in the target population. Specifically, we consider a high dimensional covariate shift setting and employ two nuisance models, a density ratio model and an imputation model, to combine transfer learning and surrogate-assisted semi-supervised learning strategies effectively and achieve triple robustness. While the STRIFLE approach assumes the target and source populations to share the same conditional distribution of outcome Y given both the surrogate features S and predictors X, it allows the true underlying model of Y|X to differ between the two populations due to the potential covariate shift in S and X. Different from double robustness, even if both nuisance models are misspecified or the distribution of Y|(S, X) is not the same between the two populations, the triply robust STRIFLE estimator can still partially use the source population when the shifted source population and the target population share enough similarities. Moreover, it is guaranteed to be no worse than the target-only surrogate-assisted semi-supervised estimator with an additional error term from transferability detection. These desirable properties of our estimator are established theoretically and verified in finite samples via extensive simulation studies. We utilize the STRIFLE estimator to train a Type II diabetes polygenic risk prediction model for the African American target population by transferring knowledge from electronic health records linked genomic data observed in a larger European source population."}, "https://arxiv.org/abs/2301.08056": {"title": "Geodesic slice sampling on the sphere", "link": "https://arxiv.org/abs/2301.08056", "description": "arXiv:2301.08056v3 Announce Type: replace \nAbstract: Probability measures on the sphere form an important class of statistical models and are used, for example, in modeling directional data or shapes. Due to their widespread use, but also as an algorithmic building block, efficient sampling of distributions on the sphere is highly desirable. We propose a shrinkage based and an idealized geodesic slice sampling Markov chain, designed to generate approximate samples from distributions on the sphere. In particular, the shrinkage-based version of the algorithm can be implemented such that it runs efficiently in any dimension and has no tuning parameters. We verify reversibility and prove that under weak regularity conditions geodesic slice sampling is uniformly ergodic. Numerical experiments show that the proposed slice samplers achieve excellent mixing on challenging targets including the Bingham distribution and mixtures of von Mises-Fisher distributions. In these settings our approach outperforms standard samplers such as random-walk Metropolis-Hastings and Hamiltonian Monte Carlo."}, "https://arxiv.org/abs/2302.07227": {"title": "Transport map unadjusted Langevin algorithms: learning and discretizing perturbed samplers", "link": "https://arxiv.org/abs/2302.07227", "description": "arXiv:2302.07227v4 Announce Type: replace \nAbstract: Langevin dynamics are widely used in sampling high-dimensional, non-Gaussian distributions whose densities are known up to a normalizing constant. In particular, there is strong interest in unadjusted Langevin algorithms (ULA), which directly discretize Langevin dynamics to estimate expectations over the target distribution. We study the use of transport maps that approximately normalize a target distribution as a way to precondition and accelerate the convergence of Langevin dynamics. We show that in continuous time, when a transport map is applied to Langevin dynamics, the result is a Riemannian manifold Langevin dynamics (RMLD) with metric defined by the transport map. We also show that applying a transport map to an irreversibly-perturbed ULA results in a geometry-informed irreversible perturbation (GiIrr) of the original dynamics. These connections suggest more systematic ways of learning metrics and perturbations, and also yield alternative discretizations of the RMLD described by the map, which we study. Under appropriate conditions, these discretized processes can be endowed with non-asymptotic bounds describing convergence to the target distribution in 2-Wasserstein distance. Illustrative numerical results complement our theoretical claims."}, "https://arxiv.org/abs/2306.07362": {"title": "Large-Scale Multiple Testing of Composite Null Hypotheses Under Heteroskedasticity", "link": "https://arxiv.org/abs/2306.07362", "description": "arXiv:2306.07362v2 Announce Type: replace \nAbstract: Heteroskedasticity poses several methodological challenges in designing valid and powerful procedures for simultaneous testing of composite null hypotheses. In particular, the conventional practice of standardizing or re-scaling heteroskedastic test statistics in this setting may severely affect the power of the underlying multiple testing procedure. Additionally, when the inferential parameter of interest is correlated with the variance of the test statistic, methods that ignore this dependence may fail to control the type I error at the desired level. We propose a new Heteroskedasticity Adjusted Multiple Testing (HAMT) procedure that avoids data reduction by standardization, and directly incorporates the side information from the variances into the testing procedure. Our approach relies on an improved nonparametric empirical Bayes deconvolution estimator that offers a practical strategy for capturing the dependence between the inferential parameter of interest and the variance of the test statistic. We develop theory to show that HAMT is asymptotically valid and optimal for FDR control. Simulation results demonstrate that HAMT outperforms existing procedures with substantial power gain across many settings at the same FDR level. The method is illustrated on an application involving the detection of engaged users on a mobile game app."}, "https://arxiv.org/abs/2312.01925": {"title": "Coefficient Shape Alignment in Multiple Functional Linear Regression", "link": "https://arxiv.org/abs/2312.01925", "description": "arXiv:2312.01925v4 Announce Type: replace \nAbstract: In multivariate functional data analysis, different functional covariates often exhibit homogeneity. The covariates with pronounced homogeneity can be analyzed jointly within the same group, offering a parsimonious approach to modeling multivariate functional data. In this paper, a novel grouped multiple functional regression model with a new regularization approach termed {\\it ``coefficient shape alignment\"} is developed to tackle functional covariates homogeneity. The modeling procedure includes two steps: first aggregate covariates into disjoint groups using the new regularization approach; then the grouped multiple functional regression model is established based on the detected grouping structure. In this grouped model, the coefficient functions of covariates in the same group share the same shape, invariant to scaling. The new regularization approach works by penalizing differences in the shape of the coefficients. We establish conditions under which the true grouping structure can be accurately identified and derive the asymptotic properties of the model estimates. Extensive simulation studies are conducted to assess the finite-sample performance of the proposed methods. The practical applicability of the model is demonstrated through real data analysis in the context of sugar quality evaluation. This work offers a novel framework for analyzing the homogeneity of functional covariates and constructing parsimonious models for multivariate functional data."}, "https://arxiv.org/abs/2410.18144": {"title": "Using Platt's scaling for calibration after undersampling -- limitations and how to address them", "link": "https://arxiv.org/abs/2410.18144", "description": "arXiv:2410.18144v1 Announce Type: new \nAbstract: When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study."}, "https://arxiv.org/abs/2410.18159": {"title": "On the Existence of One-Sided Representations in the Generalised Dynamic Factor Model", "link": "https://arxiv.org/abs/2410.18159", "description": "arXiv:2410.18159v1 Announce Type: new \nAbstract: We consider the generalised dynamic factor model (GDFM) and assume that the dynamic common component is purely non-deterministic. We show that then the common shocks (and therefore the dynamic common component) can always be represented in terms of current and past observed variables. Hence, we further generalise existing results on the so called One-Sidedness problem of the GDFM. We may conclude that the existence of a one-sided representation that is causally subordinated to the observed variables is in the very nature of the GDFM and the lack of one-sidedness is an artefact of the chosen representation."}, "https://arxiv.org/abs/2410.18261": {"title": "Detecting Spatial Outliers: the Role of the Local Influence Function", "link": "https://arxiv.org/abs/2410.18261", "description": "arXiv:2410.18261v1 Announce Type: new \nAbstract: In the analysis of large spatial datasets, identifying and treating spatial outliers is essential for accurately interpreting geographical phenomena. While spatial correlation measures, particularly Local Indicators of Spatial Association (LISA), are widely used to detect spatial patterns, the presence of abnormal observations frequently distorts the landscape and conceals critical spatial relationships. These outliers can significantly impact analysis due to the inherent spatial dependencies present in the data. Traditional influence function (IF) methodologies, commonly used in statistical analysis to measure the impact of individual observations, are not directly applicable in the spatial context because the influence of an observation is determined not only by its own value but also by its spatial location, its connections with neighboring regions, and the values of those neighboring observations. In this paper, we introduce a local version of the influence function (LIF) that accounts for these spatial dependencies. Through the analysis of both simulated and real-world datasets, we demonstrate how the LIF provides a more nuanced and accurate detection of spatial outliers compared to traditional LISA measures and local impact assessments, improving our understanding of spatial patterns."}, "https://arxiv.org/abs/2410.18272": {"title": "Partially Identified Rankings from Pairwise Interactions", "link": "https://arxiv.org/abs/2410.18272", "description": "arXiv:2410.18272v1 Announce Type: new \nAbstract: This paper considers the problem of ranking objects based on their latent merits using data from pairwise interactions. Existing approaches rely on the restrictive assumption that all the interactions are either observed or missed randomly. We investigate what can be inferred about rankings when this assumption is relaxed. First, we demonstrate that in parametric models, such as the popular Bradley-Terry-Luce model, rankings are point-identified if and only if the tournament graph is connected. Second, we show that in nonparametric models based on strong stochastic transitivity, rankings in a connected tournament are only partially identified. Finally, we propose two statistical tests to determine whether a ranking belongs to the identified set. One test is valid in finite samples but computationally intensive, while the other is easy to implement and valid asymptotically. We illustrate our procedure using Brazilian employer-employee data to test whether male and female workers rank firms differently when making job transitions."}, "https://arxiv.org/abs/2410.18338": {"title": "Robust function-on-function interaction regression", "link": "https://arxiv.org/abs/2410.18338", "description": "arXiv:2410.18338v1 Announce Type: new \nAbstract: A function-on-function regression model with quadratic and interaction effects of the covariates provides a more flexible model. Despite several attempts to estimate the model's parameters, almost all existing estimation strategies are non-robust against outliers. Outliers in the quadratic and interaction effects may deteriorate the model structure more severely than their effects in the main effect. We propose a robust estimation strategy based on the robust functional principal component decomposition of the function-valued variables and $\\tau$-estimator. The performance of the proposed method relies on the truncation parameters in the robust functional principal component decomposition of the function-valued variables. A robust Bayesian information criterion is used to determine the optimum truncation constants. A forward stepwise variable selection procedure is employed to determine relevant main, quadratic, and interaction effects to address a possible model misspecification. The finite-sample performance of the proposed method is investigated via a series of Monte-Carlo experiments. The proposed method's asymptotic consistency and influence function are also studied in the supplement, and its empirical performance is further investigated using a U.S. COVID-19 dataset."}, "https://arxiv.org/abs/2410.18381": {"title": "Inference on High Dimensional Selective Labeling Models", "link": "https://arxiv.org/abs/2410.18381", "description": "arXiv:2410.18381v1 Announce Type: new \nAbstract: A class of simultaneous equation models arise in the many domains where observed binary outcomes are themselves a consequence of the existing choices of of one of the agents in the model. These models are gaining increasing interest in the computer science and machine learning literatures where they refer the potentially endogenous sample selection as the {\\em selective labels} problem. Empirical settings for such models arise in fields as diverse as criminal justice, health care, and insurance. For important recent work in this area, see for example Lakkaruju et al. (2017), Kleinberg et al. (2018), and Coston et al.(2021) where the authors focus on judicial bail decisions, and where one observes the outcome of whether a defendant filed to return for their court appearance only if the judge in the case decides to release the defendant on bail. Identifying and estimating such models can be computationally challenging for two reasons. One is the nonconcavity of the bivariate likelihood function, and the other is the large number of covariates in each equation. Despite these challenges, in this paper we propose a novel distribution free estimation procedure that is computationally friendly in many covariates settings. The new method combines the semiparametric batched gradient descent algorithm introduced in Khan et al.(2023) with a novel sorting algorithms incorporated to control for selection bias. Asymptotic properties of the new procedure are established under increasing dimension conditions in both equations, and its finite sample properties are explored through a simulation study and an application using judicial bail data."}, "https://arxiv.org/abs/2410.18409": {"title": "Doubly protected estimation for survival outcomes utilizing external controls for randomized clinical trials", "link": "https://arxiv.org/abs/2410.18409", "description": "arXiv:2410.18409v1 Announce Type: new \nAbstract: Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach incorporates machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework."}, "https://arxiv.org/abs/2410.18437": {"title": "Studentized Tests of Independence: Random-Lifter approach", "link": "https://arxiv.org/abs/2410.18437", "description": "arXiv:2410.18437v1 Announce Type: new \nAbstract: The exploration of associations between random objects with complex geometric structures has catalyzed the development of various novel statistical tests encompassing distance-based and kernel-based statistics. These methods have various strengths and limitations. One problem is that their test statistics tend to converge to asymptotic null distributions involving second-order Wiener chaos, which are hard to compute and need approximation or permutation techniques that use much computing power to build rejection regions. In this work, we take an entirely different and novel strategy by using the so-called ``Random-Lifter''. This method is engineered to yield test statistics with the standard normal limit under null distributions without the need for sample splitting. In other words, we set our sights on having simple limiting distributions and finding the proper statistics through reverse engineering. We use the Central Limit Theorems (CLTs) for degenerate U-statistics derived from our novel association measures to do this. As a result, the asymptotic distributions of our proposed tests are straightforward to compute. Our test statistics also have the minimax property. We further substantiate that our method maintains competitive power against existing methods with minimal adjustments to constant factors. Both numerical simulations and real-data analysis corroborate the efficacy of the Random-Lifter method."}, "https://arxiv.org/abs/2410.18445": {"title": "Inferring Latent Graphs from Stationary Signals Using a Graphical Autoregressive Model", "link": "https://arxiv.org/abs/2410.18445", "description": "arXiv:2410.18445v1 Announce Type: new \nAbstract: Graphs are an intuitive way to represent relationships between variables in fields such as finance and neuroscience. However, these graphs often need to be inferred from data. In this paper, we propose a novel framework to infer a latent graph by treating the observed multidimensional data as graph-referenced stationary signals. Specifically, we introduce the graphical autoregressive model (GAR), where the inverse covariance matrix of the observed signals is expressed as a second-order polynomial of the normalized graph Laplacian of the latent graph. The GAR model extends the autoregressive model from time series analysis to general undirected graphs, offering a new approach to graph inference. To estimate the latent graph, we develop a three-step procedure based on penalized maximum likelihood, supported by theoretical analysis and numerical experiments. Simulation studies and an application to S&P 500 stock price data show that the GAR model can outperform Gaussian graphical models when it fits the observed data well. Our results suggest that the GAR model offers a promising new direction for inferring latent graphs across diverse applications. Codes and example scripts are available at https://github.com/jed-harwood/SGM ."}, "https://arxiv.org/abs/2410.18486": {"title": "Evolving Voices Based on Temporal Poisson Factorisation", "link": "https://arxiv.org/abs/2410.18486", "description": "arXiv:2410.18486v1 Announce Type: new \nAbstract: The world is evolving and so is the vocabulary used to discuss topics in speech. Analysing political speech data from more than 30 years requires the use of flexible topic models to uncover the latent topics and their change in prevalence over time as well as the change in the vocabulary of the topics. We propose the temporal Poisson factorisation (TPF) model as an extension to the Poisson factorisation model to model sparse count data matrices obtained based on the bag-of-words assumption from text documents with time stamps. We discuss and empirically compare different model specifications for the time-varying latent variables consisting either of a flexible auto-regressive structure of order one or a random walk. Estimation is based on variational inference where we consider a combination of coordinate ascent updates with automatic differentiation using batching of documents. Suitable variational families are proposed to ease inference. We compare results obtained using independent univariate variational distributions for the time-varying latent variables to those obtained with a multivariate variant. We discuss in detail the results of the TPF model when analysing speeches from 18 sessions in the U.S. Senate (1981-2016)."}, "https://arxiv.org/abs/2410.18696": {"title": "Latent Functional PARAFAC for modeling multidimensional longitudinal data", "link": "https://arxiv.org/abs/2410.18696", "description": "arXiv:2410.18696v1 Announce Type: new \nAbstract: In numerous settings, it is increasingly common to deal with longitudinal data organized as high-dimensional multi-dimensional arrays, also known as tensors. Within this framework, the time-continuous property of longitudinal data often implies a smooth functional structure on one of the tensor modes. To help researchers investigate such data, we introduce a new tensor decomposition approach based on the CANDECOMP/PARAFAC decomposition. Our approach allows for representing a high-dimensional functional tensor as a low-dimensional set of functions and feature matrices. Furthermore, to capture the underlying randomness of the statistical setting more efficiently, we introduce a probabilistic latent model in the decomposition. A covariance-based block-relaxation algorithm is derived to obtain estimates of model parameters. Thanks to the covariance formulation of the solving procedure and thanks to the probabilistic modeling, the method can be used in sparse and irregular sampling schemes, making it applicable in numerous settings. We apply our approach to help characterize multiple neurocognitive scores observed over time in the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. Finally, intensive simulations show a notable advantage of our method in reconstructing tensors."}, "https://arxiv.org/abs/2410.18734": {"title": "Response Surface Designs for Crossed and Nested Multi-Stratum Structures", "link": "https://arxiv.org/abs/2410.18734", "description": "arXiv:2410.18734v1 Announce Type: new \nAbstract: Response surface designs are usually described as being run under complete randomization of the treatment combinations to the experimental units. In practice, however, it is often necessary or beneficial to run them under some kind of restriction to the randomization, leading to multi-stratum designs. In particular, some factors are often hard to set, so they cannot have their levels reset for each experimental unit. This paper presents a general solution to designing response surface experiments in any multi-stratum structure made up of crossing and/or nesting of unit factors. A stratum-by-stratum approach to constructing designs using compound optimal design criteria is used and illustrated. It is shown that good designs can be found even for large experiments in complex structures."}, "https://arxiv.org/abs/2410.18939": {"title": "Adaptive partition Factor Analysis", "link": "https://arxiv.org/abs/2410.18939", "description": "arXiv:2410.18939v1 Announce Type: new \nAbstract: Factor Analysis has traditionally been utilized across diverse disciplines to extrapolate latent traits that influence the behavior of multivariate observed variables. Historically, the focus has been on analyzing data from a single study, neglecting the potential study-specific variations present in data from multiple studies. Multi-study factor analysis has emerged as a recent methodological advancement that addresses this gap by distinguishing between latent traits shared across studies and study-specific components arising from artifactual or population-specific sources of variation. In this paper, we extend the current methodologies by introducing novel shrinkage priors for the latent factors, thereby accommodating a broader spectrum of scenarios -- from the absence of study-specific latent factors to models in which factors pertain only to small subgroups nested within or shared between the studies. For the proposed construction we provide conditions for identifiability of factor loadings and guidelines to perform straightforward posterior computation via Gibbs sampling. Through comprehensive simulation studies, we demonstrate that our proposed method exhibits competing performance across a variety of scenarios compared to existing methods, yet providing richer insights. The practical benefits of our approach are further illustrated through applications to bird species co-occurrence data and ovarian cancer gene expression data."}, "https://arxiv.org/abs/2410.18243": {"title": "Saddlepoint Monte Carlo and its Application to Exact Ecological Inference", "link": "https://arxiv.org/abs/2410.18243", "description": "arXiv:2410.18243v1 Announce Type: cross \nAbstract: Assuming X is a random vector and A a non-invertible matrix, one sometimes need to perform inference while only having access to samples of Y = AX. The corresponding likelihood is typically intractable. One may still be able to perform exact Bayesian inference using a pseudo-marginal sampler, but this requires an unbiased estimator of the intractable likelihood.\n We propose saddlepoint Monte Carlo, a method for obtaining an unbiased estimate of the density of Y with very low variance, for any model belonging to an exponential family. Our method relies on importance sampling of the characteristic function, with insights brought by the standard saddlepoint approximation scheme with exponential tilting. We show that saddlepoint Monte Carlo makes it possible to perform exact inference on particularly challenging problems and datasets. We focus on the ecological inference problem, where one observes only aggregates at a fine level. We present in particular a study of the carryover of votes between the two rounds of various French elections, using the finest available data (number of votes for each candidate in about 60,000 polling stations over most of the French territory).\n We show that existing, popular approximate methods for ecological inference can lead to substantial bias, which saddlepoint Monte Carlo is immune from. We also present original results for the 2024 legislative elections on political centre-to-left and left-to-centre conversion rates when the far-right is present in the second round. Finally, we discuss other exciting applications for saddlepoint Monte Carlo, such as dealing with aggregate data in privacy or inverse problems."}, "https://arxiv.org/abs/2410.18268": {"title": "Stabilizing black-box model selection with the inflated argmax", "link": "https://arxiv.org/abs/2410.18268", "description": "arXiv:2410.18268v1 Announce Type: cross \nAbstract: Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. This paper presents a new approach to stabilizing model selection that leverages a combination of bagging and an \"inflated\" argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. In addition to developing theoretical guarantees, we illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable and (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances. In both settings, the proposed method yields stable and compact collections of selected models, outperforming a variety of benchmarks."}, "https://arxiv.org/abs/2410.18435": {"title": "Forecasting Australian fertility by age, region, and birthplace", "link": "https://arxiv.org/abs/2410.18435", "description": "arXiv:2410.18435v1 Announce Type: cross \nAbstract: Fertility differentials by urban-rural residence and nativity of women in Australia significantly impact population composition at sub-national levels. We aim to provide consistent fertility forecasts for Australian women characterized by age, region, and birthplace. Age-specific fertility rates at the national and sub-national levels obtained from census data between 1981-2011 are jointly modeled and forecast by the grouped functional time series method. Forecasts for women of each region and birthplace are reconciled following the chosen hierarchies to ensure that results at various disaggregation levels consistently sum up to the respective national total. Coupling the region of residence disaggregation structure with the trace minimization reconciliation method produces the most accurate point and interval forecasts. In addition, age-specific fertility rates disaggregated by the birthplace of women show significant heterogeneity that supports the application of the grouped forecasting method."}, "https://arxiv.org/abs/2410.18844": {"title": "Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints", "link": "https://arxiv.org/abs/2410.18844", "description": "arXiv:2410.18844v1 Announce Type: cross \nAbstract: Pure exploration in bandits models multiple real-world problems, such as tuning hyper-parameters or conducting user studies, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an $r$$\\textit{-good feasible policy}$. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. We show how this lower bound evolves with the sequential estimation of constraints. Second, we leverage the Lagrangian lower bound and the properties of convex optimisation to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. To this end, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use pessimistic estimate of the feasible set at each step. We show that these algorithms achieve asymptotically optimal sample complexity upper bounds up to constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LAGEX and LATS with respect to baselines."}, "https://arxiv.org/abs/2202.04146": {"title": "A Neural Phillips Curve and a Deep Output Gap", "link": "https://arxiv.org/abs/2202.04146", "description": "arXiv:2202.04146v2 Announce Type: replace \nAbstract: Many problems plague empirical Phillips curves (PCs). Among them is the hurdle that the two key components, inflation expectations and the output gap, are both unobserved. Traditional remedies include proxying for the absentees or extracting them via assumptions-heavy filtering procedures. I propose an alternative route: a Hemisphere Neural Network (HNN) whose architecture yields a final layer where components can be interpreted as latent states within a Neural PC. There are benefits. First, HNN conducts the supervised estimation of nonlinearities that arise when translating a high-dimensional set of observed regressors into latent states. Second, forecasts are economically interpretable. Among other findings, the contribution of real activity to inflation appears understated in traditional PCs. In contrast, HNN captures the 2021 upswing in inflation and attributes it to a large positive output gap starting from late 2020. The unique path of HNN's gap comes from dispensing with unemployment and GDP in favor of an amalgam of nonlinearly processed alternative tightness indicators."}, "https://arxiv.org/abs/2210.08149": {"title": "Distance and Kernel-Based Measures for Global and Local Two-Sample Conditional Distribution Testing", "link": "https://arxiv.org/abs/2210.08149", "description": "arXiv:2210.08149v2 Announce Type: replace \nAbstract: Testing the equality of two conditional distributions is crucial in various modern applications, including transfer learning and causal inference. Despite its importance, this fundamental problem has received surprisingly little attention in the literature. This work aims to present a unified framework based on distance and kernel methods for both global and local two-sample conditional distribution testing. To this end, we introduce distance and kernel-based measures that characterize the homogeneity of two conditional distributions. Drawing from the concept of conditional U-statistics, we propose consistent estimators for these measures. Theoretically, we derive the convergence rates and the asymptotic distributions of the estimators under both the null and alternative hypotheses. Utilizing these measures, along with a local bootstrap approach, we develop global and local tests that can detect discrepancies between two conditional distributions at global and local levels, respectively. Our tests demonstrate reliable performance through simulations and real data analyses."}, "https://arxiv.org/abs/2401.12031": {"title": "Multi-objective optimisation using expected quantile improvement for decision making in disease outbreaks", "link": "https://arxiv.org/abs/2401.12031", "description": "arXiv:2401.12031v2 Announce Type: replace \nAbstract: Optimization under uncertainty is important in many applications, particularly to inform policy and decision making in areas such as public health. A key source of uncertainty arises from the incorporation of environmental variables as inputs into computational models or simulators. Such variables represent uncontrollable features of the optimization problem and reliable decision making must account for the uncertainty they propagate to the simulator outputs. Often, multiple, competing objectives are defined from these outputs such that the final optimal decision is a compromise between different goals.\n Here, we present emulation-based optimization methodology for such problems that extends expected quantile improvement (EQI) to address multi-objective optimization. Focusing on the practically important case of two objectives, we use a sequential design strategy to identify the Pareto front of optimal solutions. Uncertainty from the environmental variables is integrated out using Monte Carlo samples from the simulator. Interrogation of the expected output from the simulator is facilitated by use of (Gaussian process) emulators. The methodology is demonstrated on an optimization problem from public health involving the dispersion of anthrax spores across a spatial terrain. Environmental variables include meteorological features that impact the dispersion, and the methodology identifies the Pareto front even when there is considerable input uncertainty."}, "https://arxiv.org/abs/2410.19019": {"title": "Median Based Unit Weibull (MBUW): a new unit distribution Properties", "link": "https://arxiv.org/abs/2410.19019", "description": "arXiv:2410.19019v1 Announce Type: new \nAbstract: A new 2 parameter unit Weibull distribution is defined on the unit interval (0,1). The methodology of deducing its PDF, some of its properties and related functions are discussed. The paper is supplied by many figures illustrating the new distribution and how this can make it illegible to fit a wide range of skewed data. The new distribution holds a name (Attia) as a nickname."}, "https://arxiv.org/abs/2410.19028": {"title": "Enhancing Approximate Modular Bayesian Inference by Emulating the Conditional Posterior", "link": "https://arxiv.org/abs/2410.19028", "description": "arXiv:2410.19028v1 Announce Type: new \nAbstract: In modular Bayesian analyses, complex models are composed of distinct modules, each representing different aspects of the data or prior information. In this context, fully Bayesian approaches can sometimes lead to undesirable feedback between modules, compromising the integrity of the inference. This paper focuses on the \"cut-distribution\" which prevents unwanted influence between modules by \"cutting\" feedback. The multiple imputation (DS) algorithm is standard practice for approximating the cut-distribution, but it can be computationally intensive, especially when the number of imputations required is large. An enhanced method is proposed, the Emulating the Conditional Posterior (ECP) algorithm, which leverages emulation to increase the number of imputations. Through numerical experiment it is demonstrated that the ECP algorithm outperforms the traditional DS approach in terms of accuracy and computational efficiency, particularly when resources are constrained. It is also shown how the DS algorithm can be improved using ideas from design of experiments. This work also provides practical recommendations on algorithm choice based on the computational demands of sampling from the prior and cut-distributions."}, "https://arxiv.org/abs/2410.19031": {"title": "Model-free Variable Selection and Inference for High-dimensional Data", "link": "https://arxiv.org/abs/2410.19031", "description": "arXiv:2410.19031v1 Announce Type: new \nAbstract: Statistical inference is challenging in high-dimensional data analysis. Existing post-selection inference requires an explicitly specified regression model as well as sparsity in the regression model. The performance of such procedures can be poor under either misspecified nonlinear models or a violation of the sparsity assumption. In this paper, we propose a sufficient dimension association (SDA) technique that measures the association between each predictor and the response variable conditioning on other predictors. Our proposed SDA method requires neither a specific form of regression model nor sparsity in the regression. Alternatively, our method assumes normalized or Gaussian-distributed predictors with a Markov blanket property. We propose an estimator for the SDA and prove asymptotic properties for the estimator. For simultaneous hypothesis testing and variable selection, we construct test statistics based on the Kolmogorov-Smirnov principle and the Cram{\\\"e}r-von-Mises principle. A multiplier bootstrap approach is used for computing critical values and $p$-values. Extensive simulation studies have been conducted to show the validity and superiority of our SDA method. Gene expression data from the Alzheimer Disease Neuroimaging Initiative are used to demonstrate a real application."}, "https://arxiv.org/abs/2410.19060": {"title": "Heterogeneous Intertemporal Treatment Effects via Dynamic Panel Data Models", "link": "https://arxiv.org/abs/2410.19060", "description": "arXiv:2410.19060v1 Announce Type: new \nAbstract: We study the identification and estimation of heterogeneous, intertemporal treatment effects (TE) when potential outcomes depend on past treatments. First, applying a dynamic panel data model to observed outcomes, we show that instrument-based GMM estimators, such as Arellano and Bond (1991), converge to a non-convex (negatively weighted) aggregate of TE plus non-vanishing trends. We then provide restrictions on sequential exchangeability (SE) of treatment and TE heterogeneity that reduce the GMM estimand to a convex (positively weighted) aggregate of TE. Second, we introduce an adjusted inverse-propensity-weighted (IPW) estimator for a new notion of average treatment effect (ATE) over past observed treatments. Third, we show that when potential outcomes are generated by dynamic panel data models with homogeneous TE, such GMM estimators converge to causal parameters (even when SE is generically violated without conditioning on individual fixed effects). Finally, we motivate SE and compare it with parallel trends (PT) in various settings with observational data (when treatments are dynamic, rational choices under learning) or experimental data (when treatments are sequentially randomized)."}, "https://arxiv.org/abs/2410.19073": {"title": "Doubly Robust Nonparametric Efficient Estimation for Provider Evaluation", "link": "https://arxiv.org/abs/2410.19073", "description": "arXiv:2410.19073v1 Announce Type: new \nAbstract: Provider profiling has the goal of identifying healthcare providers with exceptional patient outcomes. When evaluating providers, adjustment is necessary to control for differences in case-mix between different providers. Direct and indirect standardization are two popular risk adjustment methods. In causal terms, direct standardization examines a counterfactual in which the entire target population is treated by one provider. Indirect standardization, commonly expressed as a standardized outcome ratio, examines the counterfactual in which the population treated by a provider had instead been randomly assigned to another provider. Our first contribution is to present nonparametric efficiency bound for direct and indirectly standardized provider metrics by deriving their efficient influence functions. Our second contribution is to propose fully nonparametric estimators based on targeted minimum loss-based estimation that achieve the efficiency bounds. The finite-sample performance of the estimator is investigated through simulation studies. We apply our methods to evaluate dialysis facilities in New York State in terms of unplanned readmission rates using a large Medicare claims dataset. A software implementation of our methods is available in the R package TargetedRisk."}, "https://arxiv.org/abs/2410.19140": {"title": "Enhancing Spatial Functional Linear Regression with Robust Dimension Reduction Methods", "link": "https://arxiv.org/abs/2410.19140", "description": "arXiv:2410.19140v1 Announce Type: new \nAbstract: This paper introduces a robust estimation strategy for the spatial functional linear regression model using dimension reduction methods, specifically functional principal component analysis (FPCA) and functional partial least squares (FPLS). These techniques are designed to address challenges associated with spatially correlated functional data, particularly the impact of outliers on parameter estimation. By projecting the infinite-dimensional functional predictor onto a finite-dimensional space defined by orthonormal basis functions and employing M-estimation to mitigate outlier effects, our approach improves the accuracy and reliability of parameter estimates in the spatial functional linear regression context. Simulation studies and empirical data analysis substantiate the effectiveness of our methods, while an appendix explores the Fisher consistency and influence function of the FPCA-based approach. The rfsac package in R implements these robust estimation strategies, ensuring practical applicability for researchers and practitioners."}, "https://arxiv.org/abs/2410.19154": {"title": "Cross Spline Net and a Unified World", "link": "https://arxiv.org/abs/2410.19154", "description": "arXiv:2410.19154v1 Announce Type: new \nAbstract: In today's machine learning world for tabular data, XGBoost and fully connected neural network (FCNN) are two most popular methods due to their good model performance and convenience to use. However, they are highly complicated, hard to interpret, and can be overfitted. In this paper, we propose a new modeling framework called cross spline net (CSN) that is based on a combination of spline transformation and cross-network (Wang et al. 2017, 2021). We will show CSN is as performant and convenient to use, and is less complicated, more interpretable and robust. Moreover, the CSN framework is flexible, as the spline layer can be configured differently to yield different models. With different choices of the spline layer, we can reproduce or approximate a set of non-neural network models, including linear and spline-based statistical models, tree, rule-fit, tree-ensembles (gradient boosting trees, random forest), oblique tree/forests, multi-variate adaptive regression spline (MARS), SVM with polynomial kernel, etc. Therefore, CSN provides a unified modeling framework that puts the above set of non-neural network models under the same neural network framework. By using scalable and powerful gradient descent algorithms available in neural network libraries, CSN avoids some pitfalls (such as being ad-hoc, greedy or non-scalable) in the case-specific optimization methods used in the above non-neural network models. We will use a special type of CSN, TreeNet, to illustrate our point. We will compare TreeNet with XGBoost and FCNN to show the benefits of TreeNet. We believe CSN will provide a flexible and convenient framework for practitioners to build performant, robust and more interpretable models."}, "https://arxiv.org/abs/2410.19190": {"title": "A novel longitudinal rank-sum test for multiple primary endpoints in clinical trials: Applications to neurodegenerative disorders", "link": "https://arxiv.org/abs/2410.19190", "description": "arXiv:2410.19190v1 Announce Type: new \nAbstract: Neurodegenerative disorders such as Alzheimer's disease (AD) present a significant global health challenge, characterized by cognitive decline, functional impairment, and other debilitating effects. Current AD clinical trials often assess multiple longitudinal primary endpoints to comprehensively evaluate treatment efficacy. Traditional methods, however, may fail to capture global treatment effects, require larger sample sizes due to multiplicity adjustments, and may not fully exploit multivariate longitudinal data. To address these limitations, we introduce the Longitudinal Rank Sum Test (LRST), a novel nonparametric rank-based omnibus test statistic. The LRST enables a comprehensive assessment of treatment efficacy across multiple endpoints and time points without multiplicity adjustments, effectively controlling Type I error while enhancing statistical power. It offers flexibility against various data distributions encountered in AD research and maximizes the utilization of longitudinal data. Extensive simulations and real-data applications demonstrate the LRST's performance, underscoring its potential as a valuable tool in AD clinical trials."}, "https://arxiv.org/abs/2410.19212": {"title": "Inference on Multiple Winners with Applications to Microcredit and Economic Mobility", "link": "https://arxiv.org/abs/2410.19212", "description": "arXiv:2410.19212v1 Announce Type: new \nAbstract: While policymakers and researchers are often concerned with conducting inference based on a data-dependent selection, a strictly larger class of inference problems arises when considering multiple data-dependent selections, such as when selecting on statistical significance or quantiles. Given this, we study the problem of conducting inference on multiple selections, which we dub the inference on multiple winners problem. In this setting, we encounter both selective and multiple testing problems, making existing approaches either not applicable or too conservative. Instead, we propose a novel, two-step approach to the inference on multiple winners problem, with the first step modeling the selection of winners, and the second step using this model to conduct inference only on the set of likely winners. Our two-step approach reduces over-coverage error by up to 96%. We apply our two-step approach to revisit the winner's curse in the creating moves to opportunity (CMTO) program, and to study external validity issues in the microcredit literature. In the CMTO application, we find that, after correcting for the inference on multiple winners problem, we fail to reject the possibility of null effects in the majority of census tracts selected by the CMTO program. In our microcredit application, we find that heterogeneity in treatment effect estimates remains largely unaffected even after our proposed inference corrections."}, "https://arxiv.org/abs/2410.19226": {"title": "Deep Transformation Model", "link": "https://arxiv.org/abs/2410.19226", "description": "arXiv:2410.19226v1 Announce Type: new \nAbstract: There has been a significant recent surge in deep neural network (DNN) techniques. Most of the existing DNN techniques have restricted model formats/assumptions. To overcome their limitations, we propose the nonparametric transformation model, which encompasses many popular models as special cases and hence is less sensitive to model mis-specification. This model also has the potential of accommodating heavy-tailed errors, a robustness property not broadly shared. Accordingly, a new loss function, which fundamentally differs from the existing ones, is developed. For computational feasibility, we further develop a double rectified linear unit (DReLU)-based estimator. To accommodate the scenario with a diverging number of input variables and/or noises, we propose variable selection based on group penalization. We further expand the scope to coherently accommodate censored survival data. The estimation and variable selection properties are rigorously established. Extensive numerical studies, including simulations and data analyses, establish the satisfactory practical utility of the proposed methods."}, "https://arxiv.org/abs/2410.19271": {"title": "Feed-Forward Panel Estimation for Discrete-time Survival Analysis of Recurrent Events with Frailty", "link": "https://arxiv.org/abs/2410.19271", "description": "arXiv:2410.19271v1 Announce Type: new \nAbstract: In recurrent survival analysis where the event of interest can occur multiple times for each subject, frailty models play a crucial role by capturing unobserved heterogeneity at the subject level within a population. Frailty models traditionally face challenges due to the lack of a closed-form solution for the maximum likelihood estimation that is unconditional on frailty. In this paper, we propose a novel method: Feed-Forward Panel estimation for discrete-time Survival Analysis (FFPSurv). Our model uses variational Bayesian inference to sequentially update the posterior distribution of frailty as recurrent events are observed, and derives a closed form for the panel likelihood, effectively addressing the limitation of existing frailty models. We demonstrate the efficacy of our method through extensive experiments on numerical examples and real-world recurrent survival data. Furthermore, we mathematically prove that our model is identifiable under minor assumptions."}, "https://arxiv.org/abs/2410.19407": {"title": "Insights into regression-based cross-temporal forecast reconciliation", "link": "https://arxiv.org/abs/2410.19407", "description": "arXiv:2410.19407v1 Announce Type: new \nAbstract: Cross-temporal forecast reconciliation aims to ensure consistency across forecasts made at different temporal and cross-sectional levels. We explore the relationships between sequential, iterative, and optimal combination approaches, and discuss the conditions under which a sequential reconciliation approach (either first-cross-sectional-then-temporal, or first-temporal-then-cross-sectional) is equivalent to a fully (i.e., cross-temporally) coherent iterative heuristic. Furthermore, we show that for specific patterns of the error covariance matrix in the regression model on which the optimal combination approach grounds, iterative reconciliation naturally converges to the optimal combination solution, regardless the order of application of the uni-dimensional cross-sectional and temporal reconciliation approaches. Theoretical and empirical properties of the proposed approaches are investigated through a forecasting experiment using a dataset of hourly photovoltaic power generation. The study presents a comprehensive framework for understanding and enhancing cross-temporal forecast reconciliation, considering both forecast accuracy and the often overlooked computational aspects, showing that significant improvement can be achieved in terms of memory space and computation time, two particularly important aspects in the high-dimensional contexts that usually arise in cross-temporal forecast reconciliation."}, "https://arxiv.org/abs/2410.19469": {"title": "Unified Causality Analysis Based on the Degrees of Freedom", "link": "https://arxiv.org/abs/2410.19469", "description": "arXiv:2410.19469v1 Announce Type: new \nAbstract: Temporally evolving systems are typically modeled by dynamic equations. A key challenge in accurate modeling is understanding the causal relationships between subsystems, as well as identifying the presence and influence of unobserved hidden drivers on the observed dynamics. This paper presents a unified method capable of identifying fundamental causal relationships between pairs of systems, whether deterministic or stochastic. Notably, the method also uncovers hidden common causes beyond the observed variables. By analyzing the degrees of freedom in the system, our approach provides a more comprehensive understanding of both causal influence and hidden confounders. This unified framework is validated through theoretical models and simulations, demonstrating its robustness and potential for broader application."}, "https://arxiv.org/abs/2410.19523": {"title": "OCEAN: Flexible Feature Set Aggregation for Analysis of Multi-omics Data", "link": "https://arxiv.org/abs/2410.19523", "description": "arXiv:2410.19523v1 Announce Type: new \nAbstract: Integrated analysis of multi-omics datasets holds great promise for uncovering complex biological processes. However, the large dimension of omics data poses significant interpretability and multiple testing challenges. Simultaneous Enrichment Analysis (SEA) was introduced to address these issues in single-omics analysis, providing an in-built multiple testing correction and enabling simultaneous feature set testing. In this paper, we introduce OCEAN, an extension of SEA to multi-omics data. OCEAN is a flexible approach to analyze potentially all possible two-way feature sets from any pair of genomics datasets. We also propose two new error rates which are in line with the two-way structure of the data and facilitate interpretation of the results. The power and utility of OCEAN is demonstrated by analyzing copy number and gene expression data for breast and colon cancer."}, "https://arxiv.org/abs/2410.19682": {"title": "trajmsm: An R package for Trajectory Analysis and Causal Modeling", "link": "https://arxiv.org/abs/2410.19682", "description": "arXiv:2410.19682v1 Announce Type: new \nAbstract: The R package trajmsm provides functions designed to simplify the estimation of the parameters of a model combining latent class growth analysis (LCGA), a trajectory analysis technique, and marginal structural models (MSMs) called LCGA-MSM. LCGA summarizes similar patterns of change over time into a few distinct categories called trajectory groups, which are then included as \"treatments\" in the MSM. MSMs are a class of causal models that correctly handle treatment-confounder feedback. The parameters of LCGA-MSMs can be consistently estimated using different estimators, such as inverse probability weighting (IPW), g-computation, and pooled longitudinal targeted maximum likelihood estimation (pooled LTMLE). These three estimators of the parameters of LCGA-MSMs are currently implemented in our package. In the context of a time-dependent outcome, we previously proposed a combination of LCGA and history-restricted MSMs (LCGA-HRMSMs). Our package provides additional functions to estimate the parameters of such models. Version 0.1.3 of the package is currently available on CRAN."}, "https://arxiv.org/abs/2410.18993": {"title": "Deterministic Fokker-Planck Transport -- With Applications to Sampling, Variational Inference, Kernel Mean Embeddings & Sequential Monte Carlo", "link": "https://arxiv.org/abs/2410.18993", "description": "arXiv:2410.18993v1 Announce Type: cross \nAbstract: The Fokker-Planck equation can be reformulated as a continuity equation, which naturally suggests using the associated velocity field in particle flow methods. While the resulting probability flow ODE offers appealing properties - such as defining a gradient flow of the Kullback-Leibler divergence between the current and target densities with respect to the 2-Wasserstein distance - it relies on evaluating the current probability density, which is intractable in most practical applications. By closely examining the drawbacks of approximating this density via kernel density estimation, we uncover opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo."}, "https://arxiv.org/abs/2410.19125": {"title": "A spectral method for multi-view subspace learning using the product of projections", "link": "https://arxiv.org/abs/2410.19125", "description": "arXiv:2410.19125v1 Announce Type: cross \nAbstract: Multi-view data provides complementary information on the same set of observations, with multi-omics and multimodal sensor data being common examples. Analyzing such data typically requires distinguishing between shared (joint) and unique (individual) signal subspaces from noisy, high-dimensional measurements. Despite many proposed methods, the conditions for reliably identifying joint and individual subspaces remain unclear. We rigorously quantify these conditions, which depend on the ratio of the signal rank to the ambient dimension, principal angles between true subspaces, and noise levels. Our approach characterizes how spectrum perturbations of the product of projection matrices, derived from each view's estimated subspaces, affect subspace separation. Using these insights, we provide an easy-to-use and scalable estimation algorithm. In particular, we employ rotational bootstrap and random matrix theory to partition the observed spectrum into joint, individual, and noise subspaces. Diagnostic plots visualize this partitioning, providing practical and interpretable insights into the estimation performance. In simulations, our method estimates joint and individual subspaces more accurately than existing approaches. Applications to multi-omics data from colorectal cancer patients and nutrigenomic study of mice demonstrate improved performance in downstream predictive tasks."}, "https://arxiv.org/abs/2410.19300": {"title": "Golden Ratio-Based Sufficient Dimension Reduction", "link": "https://arxiv.org/abs/2410.19300", "description": "arXiv:2410.19300v1 Announce Type: cross \nAbstract: Many machine learning applications deal with high dimensional data. To make computations feasible and learning more efficient, it is often desirable to reduce the dimensionality of the input variables by finding linear combinations of the predictors that can retain as much original information as possible in the relationship between the response and the original predictors. We propose a neural network based sufficient dimension reduction method that not only identifies the structural dimension effectively, but also estimates the central space well. It takes advantages of approximation capabilities of neural networks for functions in Barron classes and leads to reduced computation cost compared to other dimension reduction methods in the literature. Additionally, the framework can be extended to fit practical dimension reduction, making the methodology more applicable in practical settings."}, "https://arxiv.org/abs/2410.19412": {"title": "Robust Time Series Causal Discovery for Agent-Based Model Validation", "link": "https://arxiv.org/abs/2410.19412", "description": "arXiv:2410.19412v1 Announce Type: cross \nAbstract: Agent-Based Model (ABM) validation is crucial as it helps ensuring the reliability of simulations, and causal discovery has become a powerful tool in this context. However, current causal discovery methods often face accuracy and robustness challenges when applied to complex and noisy time series data, which is typical in ABM scenarios. This study addresses these issues by proposing a Robust Cross-Validation (RCV) approach to enhance causal structure learning for ABM validation. We develop RCV-VarLiNGAM and RCV-PCMCI, novel extensions of two prominent causal discovery algorithms. These aim to reduce the impact of noise better and give more reliable causal relation results, even with high-dimensional, time-dependent data. The proposed approach is then integrated into an enhanced ABM validation framework, which is designed to handle diverse data and model structures.\n The approach is evaluated using synthetic datasets and a complex simulated fMRI dataset. The results demonstrate greater reliability in causal structure identification. The study examines how various characteristics of datasets affect the performance of established causal discovery methods. These characteristics include linearity, noise distribution, stationarity, and causal structure density. This analysis is then extended to the RCV method to see how it compares in these different situations. This examination helps confirm whether the results are consistent with existing literature and also reveals the strengths and weaknesses of the novel approaches.\n By tackling key methodological challenges, the study aims to enhance ABM validation with a more resilient valuation framework presented. These improvements increase the reliability of model-driven decision making processes in complex systems analysis."}, "https://arxiv.org/abs/2410.19596": {"title": "On the robustness of semi-discrete optimal transport", "link": "https://arxiv.org/abs/2410.19596", "description": "arXiv:2410.19596v1 Announce Type: cross \nAbstract: We derive the breakdown point for solutions of semi-discrete optimal transport problems, which characterizes the robustness of the multivariate quantiles based on optimal transport proposed in Ghosal and Sen (2022). We do so under very mild assumptions: the absolutely continuous reference measure is only assumed to have a support that is compact and convex, whereas the target measure is a general discrete measure on a finite number, $n$ say, of atoms. The breakdown point depends on the target measure only through its probability weights (hence not on the location of the atoms) and involves the geometry of the reference measure through the Tukey (1975) concept of halfspace depth. Remarkably, depending on this geometry, the breakdown point of the optimal transport median can be strictly smaller than the breakdown point of the univariate median or the breakdown point of the spatial median, namely~$\\lceil n/2\\rceil /2$. In the context of robust location estimation, our results provide a subtle insight on how to perform multivariate trimming when constructing trimmed means based on optimal transport."}, "https://arxiv.org/abs/2204.11979": {"title": "Semi-Parametric Sensitivity Analysis for Trials with Irregular and Informative Assessment Times", "link": "https://arxiv.org/abs/2204.11979", "description": "arXiv:2204.11979v4 Announce Type: replace \nAbstract: Many trials are designed to collect outcomes at or around pre-specified times after randomization. If there is variability in the times when participants are actually assessed, this can pose a challenge to learning the effect of treatment, since not all participants have outcome assessments at the times of interest. Furthermore, observed outcome values may not be representative of all participants' outcomes at a given time. Methods have been developed that account for some types of such irregular and informative assessment times; however, since these methods rely on untestable assumptions, sensitivity analyses are needed. We develop a methodology that is benchmarked at the explainable assessmen (EA) assumption, under which assessment and outcomes at each time are related only through data collected prior to that time. Our method uses an exponential tilting assumption, governed by a sensitivity analysis parameter, that posits deviations from the EA assumption. Our inferential strategy is based on a new influence function-based, augmented inverse intensity-weighted estimator. Our approach allows for flexible semiparametric modeling of the observed data, which is separated from specification of the sensitivity parameter. We apply our method to a randomized trial of low-income individuals with uncontrolled asthma, and we illustrate implementation of our estimation procedure in detail."}, "https://arxiv.org/abs/2312.05858": {"title": "Causal inference and policy evaluation without a control group", "link": "https://arxiv.org/abs/2312.05858", "description": "arXiv:2312.05858v2 Announce Type: replace \nAbstract: Without a control group, the most widespread methodologies for estimating causal effects cannot be applied. To fill this gap, we propose the Machine Learning Control Method, a new approach for causal panel analysis that estimates causal parameters without relying on untreated units. We formalize identification within the potential outcomes framework and then provide estimation based on machine learning algorithms. To illustrate the practical relevance of our method, we present simulation evidence, a replication study, and an empirical application on the impact of the COVID-19 crisis on educational inequality. We implement the proposed approach in the companion R package MachineControl"}, "https://arxiv.org/abs/2312.14013": {"title": "Two-Stage Pseudo Maximum Likelihood Estimation of Semiparametric Copula-based Regression Models for Semi-Competing Risks Data", "link": "https://arxiv.org/abs/2312.14013", "description": "arXiv:2312.14013v2 Announce Type: replace \nAbstract: We propose a two-stage estimation procedure for a copula-based model with semi-competing risks data, where the non-terminal event is subject to dependent censoring by the terminal event, and both events are subject to independent censoring. With a copula-based model, the marginal survival functions of individual event times are specified by semiparametric transformation models, and the dependence between the bivariate event times is specified by a parametric copula function. For the estimation procedure, in the first stage, the parameters associated with the marginal of the terminal event are estimated using only the corresponding observed outcomes, and in the second stage, the marginal parameters for the non-terminal event time and the copula parameter are estimated together via maximizing a pseudo-likelihood function based on the joint distribution of the bivariate event times. We derived the asymptotic properties of the proposed estimator and provided an analytic variance estimator for inference. Through simulation studies, we showed that our approach leads to consistent estimates with less computational cost and more robustness than the one-stage procedure developed in Chen (2012), where all parameters were estimated simultaneously. In addition, our approach demonstrates more desirable finite-sample performances over another existing two-stage estimation method proposed in Zhu et al. (2021). An R package PMLE4SCR is developed to implement our proposed method."}, "https://arxiv.org/abs/2401.13179": {"title": "Realized Stochastic Volatility Model with Skew-t Distributions for Improved Volatility and Quantile Forecasting", "link": "https://arxiv.org/abs/2401.13179", "description": "arXiv:2401.13179v2 Announce Type: replace \nAbstract: Forecasting volatility and quantiles of financial returns is essential for accurately measuring financial tail risks, such as value-at-risk and expected shortfall. The critical elements in these forecasts involve understanding the distribution of financial returns and accurately estimating volatility. This paper introduces an advancement to the traditional stochastic volatility model, termed the realized stochastic volatility model, which integrates realized volatility as a precise estimator of volatility. To capture the well-known characteristics of return distribution, namely skewness and heavy tails, we incorporate three types of skew-t distributions. Among these, two distributions include the skew-normal feature, offering enhanced flexibility in modeling the return distribution. We employ a Bayesian estimation approach using the Markov chain Monte Carlo method and apply it to major stock indices. Our empirical analysis, utilizing data from US and Japanese stock indices, indicates that the inclusion of both skewness and heavy tails in daily returns significantly improves the accuracy of volatility and quantile forecasts."}, "https://arxiv.org/abs/2207.10464": {"title": "Rate-optimal estimation of mixed semimartingales", "link": "https://arxiv.org/abs/2207.10464", "description": "arXiv:2207.10464v2 Announce Type: replace-cross \nAbstract: Consider the sum $Y=B+B(H)$ of a Brownian motion $B$ and an independent fractional Brownian motion $B(H)$ with Hurst parameter $H\\in(0,1)$. Even though $B(H)$ is not a semimartingale, it was shown in [\\textit{Bernoulli} \\textbf{7} (2001) 913--934] that $Y$ is a semimartingale if $H>3/4$. Moreover, $Y$ is locally equivalent to $B$ in this case, so $H$ cannot be consistently estimated from local observations of $Y$. This paper pivots on another unexpected feature in this model: if $B$ and $B(H)$ become correlated, then $Y$ will never be a semimartingale, and $H$ can be identified, regardless of its value. This and other results will follow from a detailed statistical analysis of a more general class of processes called \\emph{mixed semimartingales}, which are semiparametric extensions of $Y$ with stochastic volatility in both the martingale and the fractional component. In particular, we derive consistent estimators and feasible central limit theorems for all parameters and processes that can be identified from high-frequency observations. We further show that our estimators achieve optimal rates in a minimax sense."}, "https://arxiv.org/abs/2410.19947": {"title": "Testing the effects of an unobservable factor: Do marriage prospects affect college major choice?", "link": "https://arxiv.org/abs/2410.19947", "description": "arXiv:2410.19947v1 Announce Type: new \nAbstract: Motivated by studying the effects of marriage prospects on students' college major choices, this paper develops a new econometric test for analyzing the effects of an unobservable factor in a setting where this factor potentially influences both agents' decisions and a binary outcome variable. Our test is built upon a flexible copula-based estimation procedure and leverages the ordered nature of latent utilities of the polychotomous choice model. Using the proposed method, we demonstrate that marriage prospects significantly influence the college major choices of college graduates participating in the National Longitudinal Study of Youth (97) Survey. Furthermore, we validate the robustness of our findings with alternative tests that use stated marriage expectation measures from our data, thereby demonstrating the applicability and validity of our testing procedure in real-life scenarios."}, "https://arxiv.org/abs/2410.20029": {"title": "Jacobian-free Efficient Pseudo-Likelihood (EPL) Algorithm", "link": "https://arxiv.org/abs/2410.20029", "description": "arXiv:2410.20029v1 Announce Type: new \nAbstract: This study proposes a simple procedure to compute Efficient Pseudo Likelihood (EPL) estimator proposed by Dearing and Blevins (2024) for estimating dynamic discrete games, without computing Jacobians of equilibrium constraints. EPL estimator is efficient, convergent, and computationally fast. However, the original algorithm requires deriving and coding the Jacobians, which are cumbersome and prone to coding mistakes especially when considering complicated models. The current study proposes to avoid the computation of Jacobians by combining the ideas of numerical derivatives (for computing Jacobian-vector products) and the Krylov method (for solving linear equations). It shows good computational performance of the proposed method by numerical experiments."}, "https://arxiv.org/abs/2410.20045": {"title": "Accurate Inference for Penalized Logistic Regression", "link": "https://arxiv.org/abs/2410.20045", "description": "arXiv:2410.20045v1 Announce Type: new \nAbstract: Inference for high-dimensional logistic regression models using penalized methods has been a challenging research problem. As an illustration, a major difficulty is the significant bias of the Lasso estimator, which limits its direct application in inference. Although various bias corrected Lasso estimators have been proposed, they often still exhibit substantial biases in finite samples, undermining their inference performance. These finite sample biases become particularly problematic in one-sided inference problems, such as one-sided hypothesis testing. This paper proposes a novel two-step procedure for accurate inference in high-dimensional logistic regression models. In the first step, we propose a Lasso-based variable selection method to select a suitable submodel of moderate size for subsequent inference. In the second step, we introduce a bias corrected estimator to fit the selected submodel. We demonstrate that the resulting estimator from this two-step procedure has a small bias order and enables accurate inference. Numerical studies and an analysis of alcohol consumption data are included, where our proposed method is compared to alternative approaches. Our results indicate that the proposed method exhibits significantly smaller biases than alternative methods in finite samples, thereby leading to improved inference performance."}, "https://arxiv.org/abs/2410.20138": {"title": "Functional Mixture Regression Control Chart", "link": "https://arxiv.org/abs/2410.20138", "description": "arXiv:2410.20138v1 Announce Type: new \nAbstract: Industrial applications often exhibit multiple in-control patterns due to varying operating conditions, which makes a single functional linear model (FLM) inadequate to capture the complexity of the true relationship between a functional quality characteristic and covariates, which gives rise to the multimode profile monitoring problem. This issue is clearly illustrated in the resistance spot welding (RSW) process in the automotive industry, where different operating conditions lead to multiple in-control states. In these states, factors such as electrode tip wear and dressing may influence the functional quality characteristic differently, resulting in distinct FLMs across subpopulations. To address this problem, this article introduces the functional mixture regression control chart (FMRCC) to monitor functional quality characteristics with multiple in-control patterns and covariate information, modeled using a mixture of FLMs. A monitoring strategy based on the likelihood ratio test is proposed to monitor any deviation from the estimated in-control heterogeneous population. An extensive Monte Carlo simulation study is performed to compare the FMRCC with competing monitoring schemes that have already appeared in the literature, and a case study in the monitoring of an RSW process in the automotive industry, which motivated this research, illustrates its practical applicability."}, "https://arxiv.org/abs/2410.20169": {"title": "Robust Bayes-assisted Confidence Regions", "link": "https://arxiv.org/abs/2410.20169", "description": "arXiv:2410.20169v1 Announce Type: new \nAbstract: The Frequentist, Assisted by Bayes (FAB) framework aims to construct confidence regions that leverage information about parameter values in the form of a prior distribution. FAB confidence regions (FAB-CRs) have smaller volume for values of the parameter that are likely under the prior, while maintaining exact frequentist coverage. This work introduces several methodological and theoretical contributions to the FAB framework. For Gaussian likelihoods, we show that the posterior mean of the parameter of interest is always contained in the FAB-CR. As such, the posterior mean constitutes a natural notion of FAB estimator to be reported alongside the FAB-CR. More generally, we show that for a likelihood in the natural exponential family, a transformation of the posterior mean of the natural parameter is always contained in the FAB-CR. For Gaussian likelihoods, we show that power law tails conditions on the marginal likelihood induce robust FAB-CRs, that are uniformly bounded and revert to the standard frequentist confidence intervals for extreme observations. We translate this result into practice by proposing a class of shrinkage priors for the FAB framework that satisfy this condition without sacrificing analytical tractability. The resulting FAB estimators are equal to prominent Bayesian shrinkage estimators, including the horseshoe estimator, thereby establishing insightful connections between robust FAB-CRs and Bayesian shrinkage methods."}, "https://arxiv.org/abs/2410.20208": {"title": "scpQCA: Enhancing mvQCA Applications through Set-Covering-Based QCA Method", "link": "https://arxiv.org/abs/2410.20208", "description": "arXiv:2410.20208v1 Announce Type: new \nAbstract: In fields such as sociology, political science, public administration, and business management, particularly in the direction of international relations, Qualitative Comparative Analysis (QCA) has been widely adopted as a research method. This article addresses the limitations of the QCA method in its application, specifically in terms of low coverage, factor limitations, and value limitations. scpQCA enhances the coverage of results and expands the tolerance of the QCA method for multi-factor and multi-valued analyses by maintaining the consistency threshold. To validate these capabilities, we conducted experiments on both random data and specific case datasets, utilizing different approaches of CCM (Configurational Comparative Methods) such as scpQCA, CNA, and QCApro, and presented the different results. In addition, the robustness of scpQCA has been examined from the perspectives of internal and external across different case datasets, thereby demonstrating its extensive applicability and advantages over existing QCA algorithms."}, "https://arxiv.org/abs/2410.20319": {"title": "High-dimensional partial linear model with trend filtering", "link": "https://arxiv.org/abs/2410.20319", "description": "arXiv:2410.20319v1 Announce Type: new \nAbstract: We study the high-dimensional partial linear model, where the linear part has a high-dimensional sparse regression coefficient and the nonparametric part includes a function whose derivatives are of bounded total variation. We expand upon the univariate trend filtering to develop partial linear trend filtering--a doubly penalized least square estimation approach based on $\\ell_1$ penalty and total variation penalty. Analogous to the advantages of trend filtering in univariate nonparametric regression, partial linear trend filtering not only can be efficiently computed, but also achieves the optimal error rate for estimating the nonparametric function. This in turn leads to the oracle rate for the linear part as if the underlying nonparametric function were known. We compare the proposed approach with a standard smoothing spline based method, and show both empirically and theoretically that the former outperforms the latter when the underlying function possesses heterogeneous smoothness. We apply our approach to the IDATA study to investigate the relationship between metabolomic profiles and ultra-processed food (UPF) intake, efficiently identifying key metabolites associated with UPF consumption and demonstrating strong predictive performance."}, "https://arxiv.org/abs/2410.20443": {"title": "A Robust Topological Framework for Detecting Regime Changes in Multi-Trial Experiments with Application to Predictive Maintenance", "link": "https://arxiv.org/abs/2410.20443", "description": "arXiv:2410.20443v1 Announce Type: new \nAbstract: We present a general and flexible framework for detecting regime changes in complex, non-stationary data across multi-trial experiments. Traditional change point detection methods focus on identifying abrupt changes within a single time series (single trial), targeting shifts in statistical properties such as the mean, variance, and spectrum over time within that sole trial. In contrast, our approach considers changes occurring across trials, accommodating changes that may arise within individual trials due to experimental inconsistencies, such as varying delays or event duration. By leveraging diverse metrics to analyze time-frequency characteristics specifically topological changes in the spectrum and spectrograms, our approach offers a comprehensive framework for detecting such variations. Our approach can handle different scenarios with various statistical assumptions, including varying levels of stationarity within and across trials, making our framework highly adaptable. We validate our approach through simulations using time-varying autoregressive processes that exhibit different regime changes. Our results demonstrate the effectiveness of detecting changes across trials under diverse conditions. Furthermore, we illustrate the effectiveness of our method by applying it to predictive maintenance using the NASA bearing dataset. By analyzing the time-frequency characteristics of vibration signals recorded by accelerometers, our approach accurately identifies bearing failures, showcasing its strong potential for early fault detection in mechanical systems."}, "https://arxiv.org/abs/2410.20628": {"title": "International vulnerability of inflation", "link": "https://arxiv.org/abs/2410.20628", "description": "arXiv:2410.20628v1 Announce Type: new \nAbstract: In a globalised world, inflation in a given country may be becoming less responsive to domestic economic activity, while being increasingly determined by international conditions. Consequently, understanding the international sources of vulnerability of domestic inflation is turning fundamental for policy makers. In this paper, we propose the construction of Inflation-at-risk and Deflation-at-risk measures of vulnerability obtained using factor-augmented quantile regressions estimated with international factors extracted from a multi-level Dynamic Factor Model with overlapping blocks of inflations corresponding to economies grouped either in a given geographical region or according to their development level. The methodology is implemented to inflation observed monthly from 1999 to 2022 for over 115 countries. We conclude that, in a large number of developed countries, international factors are relevant to explain the right tail of the distribution of inflation, and, consequently, they are more relevant for the vulnerability related to high inflation than for average or low inflation. However, while inflation of developing low-income countries is hardly affected by international conditions, the results for middle-income countries are mixed. Finally, based on a rolling-window out-of-sample forecasting exercise, we show that the predictive power of international factors has increased in the most recent years of high inflation."}, "https://arxiv.org/abs/2410.20641": {"title": "The Curious Problem of the Normal Inverse Mean", "link": "https://arxiv.org/abs/2410.20641", "description": "arXiv:2410.20641v1 Announce Type: new \nAbstract: In astronomical observations, the estimation of distances from parallaxes is a challenging task due to the inherent measurement errors and the non-linear relationship between the parallax and the distance. This study leverages ideas from robust Bayesian inference to tackle these challenges, investigating a broad class of prior densities for estimating distances with a reduced bias and variance. Through theoretical analysis, simulation experiments, and the application to data from the Gaia Data Release 1 (GDR1), we demonstrate that heavy-tailed priors provide more reliable distance estimates, particularly in the presence of large fractional parallax errors. Theoretical results highlight the \"curse of a single observation,\" where the likelihood dominates the posterior, limiting the impact of the prior. Nevertheless, heavy-tailed priors can delay the explosion of posterior risk, offering a more robust framework for distance estimation. The findings suggest that reciprocal invariant priors, with polynomial decay in their tails, such as the Half-Cauchy and Product Half-Cauchy, are particularly well-suited for this task, providing a balance between bias reduction and variance control."}, "https://arxiv.org/abs/2410.20671": {"title": "Statistical Inference in High-dimensional Poisson Regression with Applications to Mediation Analysis", "link": "https://arxiv.org/abs/2410.20671", "description": "arXiv:2410.20671v1 Announce Type: new \nAbstract: Large-scale datasets with count outcome variables are widely present in various applications, and the Poisson regression model is among the most popular models for handling count outcomes. This paper considers the high-dimensional sparse Poisson regression model and proposes bias-corrected estimators for both linear and quadratic transformations of high-dimensional regression vectors. We establish the asymptotic normality of the estimators, construct asymptotically valid confidence intervals, and conduct related hypothesis testing. We apply the devised methodology to high-dimensional mediation analysis with count outcome, with particular application of testing for the existence of interaction between the treatment variable and high-dimensional mediators. We demonstrate the proposed methods through extensive simulation studies and application to real-world epigenetic data."}, "https://arxiv.org/abs/2410.20860": {"title": "Robust Network Targeting with Multiple Nash Equilibria", "link": "https://arxiv.org/abs/2410.20860", "description": "arXiv:2410.20860v1 Announce Type: new \nAbstract: Many policy problems involve designing individualized treatment allocation rules to maximize the equilibrium social welfare of interacting agents. Focusing on large-scale simultaneous decision games with strategic complementarities, we develop a method to estimate an optimal treatment allocation rule that is robust to the presence of multiple equilibria. Our approach remains agnostic about changes in the equilibrium selection mechanism under counterfactual policies, and we provide a closed-form expression for the boundary of the set-identified equilibrium outcomes. To address the incompleteness that arises when an equilibrium selection mechanism is not specified, we use the maximin welfare criterion to select a policy, and implement this policy using a greedy algorithm. We establish a performance guarantee for our method by deriving a welfare regret bound, which accounts for sampling uncertainty and the use of the greedy algorithm. We demonstrate our method with an application to the microfinance dataset of Banerjee et al. (2013)."}, "https://arxiv.org/abs/2410.20885": {"title": "A Distributed Lag Approach to the Generalised Dynamic Factor Model (GDFM)", "link": "https://arxiv.org/abs/2410.20885", "description": "arXiv:2410.20885v1 Announce Type: new \nAbstract: We provide estimation and inference for the Generalised Dynamic Factor Model (GDFM) under the assumption that the dynamic common component can be expressed in terms of a finite number of lags of contemporaneously pervasive factors. The proposed estimator is simply an OLS regression of the observed variables on factors extracted via static principal components and therefore avoids frequency domain techniques entirely."}, "https://arxiv.org/abs/2410.20908": {"title": "Making all pairwise comparisons in multi-arm clinical trials without control treatment", "link": "https://arxiv.org/abs/2410.20908", "description": "arXiv:2410.20908v1 Announce Type: new \nAbstract: The standard paradigm for confirmatory clinical trials is to compare experimental treatments with a control, for example the standard of care or a placebo. However, it is not always the case that a suitable control exists. Efficient statistical methodology is well studied in the setting of randomised controlled trials. This is not the case if one wishes to compare several experimental with no control arm. We propose hypothesis testing methods suitable for use in such a setting. These methods are efficient, ensuring the error rate is controlled at exactly the desired rate with no conservatism. This in turn yields an improvement in power when compared with standard methods one might otherwise consider using, such as a Bonferroni adjustment. The proposed testing procedure is also highly flexible. We show how it may be extended for use in multi-stage adaptive trials, covering the majority of scenarios in which one might consider the use of such procedures in the clinical trials setting. With such a highly flexible nature, these methods may also be applied more broadly outside of a clinical trials setting."}, "https://arxiv.org/abs/2410.20915": {"title": "On Spatio-Temporal Stochastic Frontier Models", "link": "https://arxiv.org/abs/2410.20915", "description": "arXiv:2410.20915v1 Announce Type: new \nAbstract: In the literature on stochastic frontier models until the early 2000s, the joint consideration of spatial and temporal dimensions was often inadequately addressed, if not completely neglected. However, from an evolutionary economics perspective, the production process of the decision-making units constantly changes over both dimensions: it is not stable over time due to managerial enhancements and/or internal or external shocks, and is influenced by the nearest territorial neighbours. This paper proposes an extension of the Fusco and Vidoli [2013] SEM-like approach, which globally accounts for spatial and temporal effects in the term of inefficiency. In particular, coherently with the stochastic panel frontier literature, two different versions of the model are proposed: the time-invariant and the time-varying spatial stochastic frontier models. In order to evaluate the inferential properties of the proposed estimators, we first run Monte Carlo experiments and we then present the results of an application to a set of commonly referenced data, demonstrating robustness and stability of estimates across all scenarios."}, "https://arxiv.org/abs/2410.20918": {"title": "Almost goodness-of-fit tests", "link": "https://arxiv.org/abs/2410.20918", "description": "arXiv:2410.20918v1 Announce Type: new \nAbstract: We introduce the almost goodness-of-fit test, a procedure to decide if a (parametric) model provides a good representation of the probability distribution generating the observed sample. We consider the approximate model determined by an M-estimator of the parameters as the best representative of the unknown distribution within the parametric class. The objective is the approximate validation of a distribution or an entire parametric family up to a pre-specified threshold value, the margin of error. The methodology also allows quantifying the percentage improvement of the proposed model compared to a non-informative (constant) one. The test statistic is the $\\mathrm{L}^p$-distance between the empirical distribution function and the corresponding one of the estimated (parametric) model. The value of the parameter $p$ allows modulating the impact of the tails of the distribution in the validation of the model. By deriving the asymptotic distribution of the test statistic, as well as proving the consistency of its bootstrap approximation, we present an easy-to-implement and flexible method. The performance of the proposal is illustrated with a simulation study and the analysis of a real dataset."}, "https://arxiv.org/abs/2410.21098": {"title": "Single CASANOVA? Not in multiple comparisons", "link": "https://arxiv.org/abs/2410.21098", "description": "arXiv:2410.21098v1 Announce Type: new \nAbstract: When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated [1] and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent [2]. We propose two new tests based on combined weighted log-rank tests. One as a simple multiple contrast test of weighted log-rank tests and one as an extension of the so-called CASANOVA test [3]. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test promises to be more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power."}, "https://arxiv.org/abs/2410.21104": {"title": "Topological Identification of Agent Status in Information Contagions: Application to Financial Markets", "link": "https://arxiv.org/abs/2410.21104", "description": "arXiv:2410.21104v1 Announce Type: new \nAbstract: Cascade models serve as effective tools for understanding the propagation of information and diseases within social networks. Nevertheless, their applicability becomes constrained when the states of the agents (nodes) are hidden and can only be inferred through indirect observations or symptoms. This study proposes a Mapper-based strategy to infer the status of agents within a hidden information cascade model using expert knowledge. To verify and demonstrate the method we identify agents who are likely to take advantage of information obtained from an inside information network. We do this using data on insider networks and stock market transactions. Recognizing the sensitive nature of allegations of insider trading, we design a conservative approach to minimize false positives, ensuring that innocent agents are not wrongfully implicated. The Mapper-based results systematically outperform other methods, such as clustering and unsupervised anomaly detection, on synthetic data. We also apply the method to empirical data and verify the results using a statistical validation method based on persistence homology. Our findings highlight that the proposed Mapper-based technique successfully identifies a subpopulation of opportunistic agents within the information cascades. The adaptability of this method to diverse data types and sizes is demonstrated, with potential for tailoring for specific applications."}, "https://arxiv.org/abs/2410.21105": {"title": "Difference-in-Differences with Time-varying Continuous Treatments using Double/Debiased Machine Learning", "link": "https://arxiv.org/abs/2410.21105", "description": "arXiv:2410.21105v1 Announce Type: new \nAbstract: We propose a difference-in-differences (DiD) method for a time-varying continuous treatment and multiple time periods. Our framework assesses the average treatment effect on the treated (ATET) when comparing two non-zero treatment doses. The identification is based on a conditional parallel trend assumption imposed on the mean potential outcome under the lower dose, given observed covariates and past treatment histories. We employ kernel-based ATET estimators for repeated cross-sections and panel data adopting the double/debiased machine learning framework to control for covariates and past treatment histories in a data-adaptive manner. We also demonstrate the asymptotic normality of our estimation approach under specific regularity conditions. In a simulation study, we find a compelling finite sample performance of undersmoothed versions of our estimators in setups with several thousand observations."}, "https://arxiv.org/abs/2410.21166": {"title": "A Componentwise Estimation Procedure for Multivariate Location and Scatter: Robustness, Efficiency and Scalability", "link": "https://arxiv.org/abs/2410.21166", "description": "arXiv:2410.21166v1 Announce Type: new \nAbstract: Covariance matrix estimation is an important problem in multivariate data analysis, both from theoretical as well as applied points of view. Many simple and popular covariance matrix estimators are known to be severely affected by model misspecification and the presence of outliers in the data; on the other hand robust estimators with reasonably high efficiency are often computationally challenging for modern large and complex datasets. In this work, we propose a new, simple, robust and highly efficient method for estimation of the location vector and the scatter matrix for elliptically symmetric distributions. The proposed estimation procedure is designed in the spirit of the minimum density power divergence (DPD) estimation approach with appropriate modifications which makes our proposal (sequential minimum DPD estimation) computationally very economical and scalable to large as well as higher dimensional datasets. Consistency and asymptotic normality of the proposed sequential estimators of the multivariate location and scatter are established along with asymptotic positive definiteness of the estimated scatter matrix. Robustness of our estimators are studied by means of influence functions. All theoretical results are illustrated further under multivariate normality. A large-scale simulation study is presented to assess finite sample performances and scalability of our method in comparison to the usual maximum likelihood estimator (MLE), the ordinary minimum DPD estimator (MDPDE) and other popular non-parametric methods. The applicability of our method is further illustrated with a real dataset on credit card transactions."}, "https://arxiv.org/abs/2410.21213": {"title": "Spatial causal inference in the presence of preferential sampling to study the impacts of marine protected areas", "link": "https://arxiv.org/abs/2410.21213", "description": "arXiv:2410.21213v1 Announce Type: new \nAbstract: Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observational, and processes of interest exhibit spatial dependence, which presents challenges in estimating the causal effects. Spatial data can also be subject to preferential sampling, where the sampling locations are related to the response variable, further complicating inference and prediction. To address these challenges, we propose a spatial causal inference method that simultaneously accounts for unmeasured spatial confounders in both the sampling process and the treatment allocation. We prove the identifiability of key parameters in the model and the consistency of the posterior distributions of those parameters. We show via simulation studies that the causal effect of interest can be reliably estimated under the proposed model. The proposed method is applied to assess the effect of MPAs on fish biomass. We find evidence of preferential sampling and that properly accounting for this source of bias impacts the estimate of the causal effect."}, "https://arxiv.org/abs/2410.21263": {"title": "Adaptive Transfer Clustering: A Unified Framework", "link": "https://arxiv.org/abs/2410.21263", "description": "arXiv:2410.21263v1 Announce Type: new \nAbstract: We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios."}, "https://arxiv.org/abs/2410.19780": {"title": "Sampling from Bayesian Neural Network Posteriors with Symmetric Minibatch Splitting Langevin Dynamics", "link": "https://arxiv.org/abs/2410.19780", "description": "arXiv:2410.19780v1 Announce Type: cross \nAbstract: We propose a scalable kinetic Langevin dynamics algorithm for sampling parameter spaces of big data and AI applications. Our scheme combines a symmetric forward/backward sweep over minibatches with a symmetric discretization of Langevin dynamics. For a particular Langevin splitting method (UBU), we show that the resulting Symmetric Minibatch Splitting-UBU (SMS-UBU) integrator has bias $O(h^2 d^{1/2})$ in dimension $d>0$ with stepsize $h>0$, despite only using one minibatch per iteration, thus providing excellent control of the sampling bias as a function of the stepsize. We apply the algorithm to explore local modes of the posterior distribution of Bayesian neural networks (BNNs) and evaluate the calibration performance of the posterior predictive probabilities for neural networks with convolutional neural network architectures for classification problems on three different datasets (Fashion-MNIST, Celeb-A and chest X-ray). Our results indicate that BNNs sampled with SMS-UBU can offer significantly better calibration performance compared to standard methods of training and stochastic weight averaging."}, "https://arxiv.org/abs/2410.19923": {"title": "Language Agents Meet Causality -- Bridging LLMs and Causal World Models", "link": "https://arxiv.org/abs/2410.19923", "description": "arXiv:2410.19923v1 Announce Type: cross \nAbstract: Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally-aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons."}, "https://arxiv.org/abs/2410.19952": {"title": "L\\'evy graphical models", "link": "https://arxiv.org/abs/2410.19952", "description": "arXiv:2410.19952v1 Announce Type: cross \nAbstract: Conditional independence and graphical models are crucial concepts for sparsity and statistical modeling in higher dimensions. For L\\'evy processes, a widely applied class of stochastic processes, these notions have not been studied. By the L\\'evy-It\\^o decomposition, a multivariate L\\'evy process can be decomposed into the sum of a Brownian motion part and an independent jump process. We show that conditional independence statements between the marginal processes can be studied separately for these two parts. While the Brownian part is well-understood, we derive a novel characterization of conditional independence between the sample paths of the jump process in terms of the L\\'evy measure. We define L\\'evy graphical models as L\\'evy processes that satisfy undirected or directed Markov properties. We prove that the graph structure is invariant under changes of the univariate marginal processes. L\\'evy graphical models allow the construction of flexible, sparse dependence models for L\\'evy processes in large dimensions, which are interpretable thanks to the underlying graph. For trees, we develop statistical methodology to learn the underlying structure from low- or high-frequency observations of the L\\'evy process and show consistent graph recovery. We apply our method to model stock returns from U.S. companies to illustrate the advantages of our approach."}, "https://arxiv.org/abs/2410.20089": {"title": "Sample Efficient Bayesian Learning of Causal Graphs from Interventions", "link": "https://arxiv.org/abs/2410.20089", "description": "arXiv:2410.20089v1 Announce Type: cross \nAbstract: Causal discovery is a fundamental problem with applications spanning various areas in science and engineering. It is well understood that solely using observational data, one can only orient the causal graph up to its Markov equivalence class, necessitating interventional data to learn the complete causal graph. Most works in the literature design causal discovery policies with perfect interventions, i.e., they have access to infinite interventional samples. This study considers a Bayesian approach for learning causal graphs with limited interventional samples, mirroring real-world scenarios where such samples are usually costly to obtain. By leveraging the recent result of Wien\\\"obst et al. (2023) on uniform DAG sampling in polynomial time, we can efficiently enumerate all the cut configurations and their corresponding interventional distributions of a target set, and further track their posteriors. Given any number of interventional samples, our proposed algorithm randomly intervenes on a set of target vertices that cut all the edges in the graph and returns a causal graph according to the posterior of each target set. When the number of interventional samples is large enough, we show theoretically that our proposed algorithm will return the true causal graph with high probability. We compare our algorithm against various baseline methods on simulated datasets, demonstrating its superior accuracy measured by the structural Hamming distance between the learned DAG and the ground truth. Additionally, we present a case study showing how this algorithm could be modified to answer more general causal questions without learning the whole graph. As an example, we illustrate that our method can be used to estimate the causal effect of a variable that cannot be intervened."}, "https://arxiv.org/abs/2410.20318": {"title": "Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds", "link": "https://arxiv.org/abs/2410.20318", "description": "arXiv:2410.20318v1 Announce Type: cross \nAbstract: We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered."}, "https://arxiv.org/abs/2410.20896": {"title": "BSD: a Bayesian framework for parametric models of neural spectra", "link": "https://arxiv.org/abs/2410.20896", "description": "arXiv:2410.20896v1 Announce Type: cross \nAbstract: The analysis of neural power spectra plays a crucial role in understanding brain function and dysfunction. While recent efforts have led to the development of methods for decomposing spectral data, challenges remain in performing statistical analysis and group-level comparisons. Here, we introduce Bayesian Spectral Decomposition (BSD), a Bayesian framework for analysing neural spectral power. BSD allows for the specification, inversion, comparison, and analysis of parametric models of neural spectra, addressing limitations of existing methods. We first establish the face validity of BSD on simulated data and show how it outperforms an established method (\\fooof{}) for peak detection on artificial spectral data. We then demonstrate the efficacy of BSD on a group-level study of EEG spectra in 204 healthy subjects from the LEMON dataset. Our results not only highlight the effectiveness of BSD in model selection and parameter estimation, but also illustrate how BSD enables straightforward group-level regression of the effect of continuous covariates such as age. By using Bayesian inference techniques, BSD provides a robust framework for studying neural spectral data and their relationship to brain function and dysfunction."}, "https://arxiv.org/abs/2208.05543": {"title": "A novel decomposition to explain heterogeneity in observational and randomized studies of causality", "link": "https://arxiv.org/abs/2208.05543", "description": "arXiv:2208.05543v2 Announce Type: replace \nAbstract: This paper introduces a novel decomposition framework to explain heterogeneity in causal effects observed across different studies, considering both observational and randomized settings. We present a formal decomposition of between-study heterogeneity, identifying sources of variability in treatment effects across studies. The proposed methodology allows for robust estimation of causal parameters under various assumptions, addressing differences in pre-treatment covariate distributions, mediating variables, and the outcome mechanism. Our approach is validated through a simulation study and applied to data from the Moving to Opportunity (MTO) study, demonstrating its practical relevance. This work contributes to the broader understanding of causal inference in multi-study environments, with potential applications in evidence synthesis and policy-making."}, "https://arxiv.org/abs/2211.03274": {"title": "A General Framework for Cutting Feedback within Modularised Bayesian Inference", "link": "https://arxiv.org/abs/2211.03274", "description": "arXiv:2211.03274v3 Announce Type: replace \nAbstract: Standard Bayesian inference can build models that combine information from various sources, but this inference may not be reliable if components of a model are misspecified. Cut inference, as a particular type of modularized Bayesian inference, is an alternative which splits a model into modules and cuts the feedback from the suspect module. Previous studies have focused on a two-module case, but a more general definition of a \"module\" remains unclear. We present a formal definition of a \"module\" and discuss its properties. We formulate methods for identifying modules; determining the order of modules; and building the cut distribution that should be used for cut inference within an arbitrary directed acyclic graph structure. We justify the cut distribution by showing that it not only cuts the feedback but also is the best approximation satisfying this condition to the joint distribution in the Kullback-Leibler divergence. We also extend cut inference for the two-module case to a general multiple-module case via a sequential splitting technique and demonstrate this via illustrative applications."}, "https://arxiv.org/abs/2310.16638": {"title": "Double Debiased Covariate Shift Adaptation Robust to Density-Ratio Estimation", "link": "https://arxiv.org/abs/2310.16638", "description": "arXiv:2310.16638v3 Announce Type: replace \nAbstract: Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies."}, "https://arxiv.org/abs/2008.13619": {"title": "A model and method for analyzing the precision of binary measurement methods based on beta-binomial distributions, and related statistical tests", "link": "https://arxiv.org/abs/2008.13619", "description": "arXiv:2008.13619v3 Announce Type: replace-cross \nAbstract: This study developed a new statistical model and method for analyzing the precision of binary measurement methods from collaborative studies. The model is based on beta-binomial distributions. In other words, it assumes that the sensitivity of each laboratory obeys a beta distribution, and the binary measured values under a given sensitivity follow a binomial distribution. We propose the key precision measures of repeatability and reproducibility for the model, and provide their unbiased estimates. Further, through consideration of a number of statistical test methods for homogeneity of proportions, we propose appropriate methods for determining laboratory effects in the new model. Finally, we apply the results to real-world examples in the fields of food safety and chemical risk assessment and management."}, "https://arxiv.org/abs/2304.01111": {"title": "Theoretical guarantees for neural control variates in MCMC", "link": "https://arxiv.org/abs/2304.01111", "description": "arXiv:2304.01111v2 Announce Type: replace-cross \nAbstract: In this paper, we propose a variance reduction approach for Markov chains based on additive control variates and the minimization of an appropriate estimate for the asymptotic variance. We focus on the particular case when control variates are represented as deep neural networks. We derive the optimal convergence rate of the asymptotic variance under various ergodicity assumptions on the underlying Markov chain. The proposed approach relies upon recent results on the stochastic errors of variance reduction algorithms and function approximation theory."}, "https://arxiv.org/abs/2401.01426": {"title": "Modular Learning of Deep Causal Generative Models for High-dimensional Causal Inference", "link": "https://arxiv.org/abs/2401.01426", "description": "arXiv:2401.01426v2 Announce Type: replace-cross \nAbstract: Sound and complete algorithms have been proposed to compute identifiable causal queries using the causal structure and data. However, most of these algorithms assume accurate estimation of the data distribution, which is impractical for high-dimensional variables such as images. On the other hand, modern deep generative architectures can be trained to sample from high-dimensional distributions. However, training these networks are typically very costly. Thus, it is desirable to leverage pre-trained models to answer causal queries using such high-dimensional data. To address this, we propose modular training of deep causal generative models that not only makes learning more efficient, but also allows us to utilize large, pre-trained conditional generative models. To the best of our knowledge, our algorithm, Modular-DCM is the first algorithm that, given the causal structure, uses adversarial training to learn the network weights, and can make use of pre-trained models to provably sample from any identifiable causal query in the presence of latent confounders. With extensive experiments on the Colored-MNIST dataset, we demonstrate that our algorithm outperforms the baselines. We also show our algorithm's convergence on the COVIDx dataset and its utility with a causal invariant prediction problem on CelebA-HQ."}, "https://arxiv.org/abs/2410.21343": {"title": "Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects", "link": "https://arxiv.org/abs/2410.21343", "description": "arXiv:2410.21343v1 Announce Type: new \nAbstract: Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \\textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \\textbf{C}ombine \\textbf{I}ncomplete \\textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \\textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets."}, "https://arxiv.org/abs/2410.21350": {"title": "Enhanced sequential directional importance sampling for structural reliability analysis", "link": "https://arxiv.org/abs/2410.21350", "description": "arXiv:2410.21350v1 Announce Type: new \nAbstract: Sequential directional importance sampling (SDIS) is an efficient adaptive simulation method for estimating failure probabilities. It expresses the failure probability as the product of a group of integrals that are easy to estimate, wherein the first one is estimated with Monte Carlo simulation (MCS), and all the subsequent ones are estimated with directional importance sampling. In this work, we propose an enhanced SDIS method for structural reliability analysis. We discuss the efficiency of MCS for estimat?ing the first integral in standard SDIS and propose using Subset Simulation as an alternative method. Additionally, we propose a Kriging-based active learning algorithm tailored to identify multiple roots in certain important di?rections within a specificed search interval. The performance of the enhanced SDIS is demonstrated through various complex benchmark problems. The results show that the enhanced SDIS is a versatile reliability analysis method that can efficiently and robustly solve challenging reliability problems"}, "https://arxiv.org/abs/2410.21355": {"title": "Generalized Method of Moments and Percentile Method: Estimating parameters of the Novel Median Based Unit Weibull Distribution", "link": "https://arxiv.org/abs/2410.21355", "description": "arXiv:2410.21355v1 Announce Type: new \nAbstract: The Median Based Unit Weibull is a new 2 parameter unit Weibull distribution defined on the unit interval (0,1). Estimation of the parameters using MLE encountered some problems like large variance. Using generalized method of moments (GMMs) and percentile method may ameliorate this condition. This paper introduces GMMs and the percentile methods for estimating the parameters of the new distribution with illustrative real data analysis."}, "https://arxiv.org/abs/2410.21464": {"title": "Causal Bootstrap for General Randomized Designs", "link": "https://arxiv.org/abs/2410.21464", "description": "arXiv:2410.21464v1 Announce Type: new \nAbstract: We distinguish between two sources of uncertainty in experimental causal inference: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. Most bootstrap procedures used by practitioners primarily estimate sampling uncertainty. Other methods for quantifying design uncertainty also fall short, because they are restricted to common designs and estimators, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer programming approach, which allows us to estimate design uncertainty for any known and probabilistic assignment mechanisms, and linear-in-treatment and quadratic-in-treatment estimators. We include asymptotic validity results and demonstrate the refined confidence intervals achieved by accurately accounting for non-standard design uncertainty through simulations of geographical experiments."}, "https://arxiv.org/abs/2410.21469": {"title": "Hybrid Bayesian Smoothing on Surfaces", "link": "https://arxiv.org/abs/2410.21469", "description": "arXiv:2410.21469v1 Announce Type: new \nAbstract: Modeling spatial processes that exhibit both smooth and rough features poses a significant challenge. This is especially true in fields where complex physical variables are observed across spatial domains. Traditional spatial techniques, such as Gaussian processes (GPs), are ill-suited to capture sharp transitions and discontinuities in spatial fields. In this paper, we propose a new approach incorporating non-Gaussian processes (NGPs) into a hybrid model which identifies both smooth and rough components. Specifically, we model the rough process using scaled mixtures of Gaussian distributions in a Bayesian hierarchical model (BHM).\n Our motivation comes from the Community Earth System Model Large Ensemble (CESM-LE), where we seek to emulate climate sensitivity fields that exhibit complex spatial patterns, including abrupt transitions at ocean-land boundaries. We demonstrate that traditional GP models fail to capture such abrupt changes and that our proposed hybrid model, implemented through a full Gibbs sampler. This significantly improves model interpretability and accurate recovery of process parameters.\n Through a multi-factor simulation study, we evaluate the performance of several scaled mixtures designed to model the rough process. The results highlight the advantages of using these heavier tailed priors as a replacement to the Bayesian fused LASSO. One prior in particular, the normal Jeffrey's prior stands above the rest. We apply our model to the CESM-LE dataset, demonstrating its ability to better represent the mean function and its uncertainty in climate sensitivity fields.\n This work combines the strengths of GPs for smooth processes with the flexibility of NGPs for abrupt changes. We provide a computationally efficient Gibbs sampler and include additional strategies for accelerating Monte Carlo Markov Chain (MCMC) sampling."}, "https://arxiv.org/abs/2410.21498": {"title": "Bayesian Nonparametric Models for Multiple Raters: a General Statistical Framework", "link": "https://arxiv.org/abs/2410.21498", "description": "arXiv:2410.21498v1 Announce Type: new \nAbstract: Rating procedure is crucial in many applied fields (e.g., educational, clinical, emergency). It implies that a rater (e.g., teacher, doctor) rates a subject (e.g., student, doctor) on a rating scale. Given raters variability, several statistical methods have been proposed for assessing and improving the quality of ratings. The analysis and the estimate of the Intraclass Correlation Coefficient (ICC) are major concerns in such cases. As evidenced by the literature, ICC might differ across different subgroups of raters and might be affected by contextual factors and subject heterogeneity. Model estimation in the presence of heterogeneity has been one of the recent challenges in this research line. Consequently, several methods have been proposed to address this issue under a parametric multilevel modelling framework, in which strong distributional assumptions are made. We propose a more flexible model under the Bayesian nonparametric (BNP) framework, in which most of those assumptions are relaxed. By eliciting hierarchical discrete nonparametric priors, the model accommodates clusters among raters and subjects, naturally accounts for heterogeneity, and improves estimate accuracy. We propose a general BNP heteroscedastic framework to analyze rating data and possible latent differences among subjects and raters. The estimated densities are used to make inferences about the rating process and the quality of the ratings. By exploiting a stick-breaking representation of the Dirichlet Process a general class of ICC indices might be derived for these models. Theoretical results about the ICC are provided together with computational strategies. Simulations and a real-world application are presented and possible future directions are discussed."}, "https://arxiv.org/abs/2410.21505": {"title": "Economic Diversification and Social Progress in the GCC Countries: A Study on the Transition from Oil-Dependency to Knowledge-Based Economies", "link": "https://arxiv.org/abs/2410.21505", "description": "arXiv:2410.21505v1 Announce Type: new \nAbstract: The Gulf Cooperation Council countries -- Oman, Bahrain, Kuwait, UAE, Qatar, and Saudi Arabia -- holds strategic significance due to its large oil reserves. However, these nations face considerable challenges in shifting from oil-dependent economies to more diversified, knowledge-based systems. This study examines the progress of Gulf Cooperation Council (GCC) countries in achieving economic diversification and social development, focusing on the Social Progress Index (SPI), which provides a broader measure of societal well-being beyond just economic growth. Using data from the World Bank, covering 2010 to 2023, the study employs the XGBoost machine learning model to forecast SPI values for the period of 2024 to 2026. Key components of the methodology include data preprocessing, feature selection, and the simulation of independent variables through ARIMA modeling. The results highlight significant improvements in education, healthcare, and women's rights, contributing to enhanced SPI performance across the GCC countries. However, notable challenges persist in areas like personal rights and inclusivity. The study further indicates that despite economic setbacks caused by global disruptions, including the COVID-19 pandemic and oil price volatility, GCC nations are expected to see steady improvements in their SPI scores through 2027. These findings underscore the critical importance of economic diversification, investment in human capital, and ongoing social reforms to reduce dependence on hydrocarbons and build knowledge-driven economies. This research offers valuable insights for policymakers aiming to strengthen both social and economic resilience in the region while advancing long-term sustainable development goals."}, "https://arxiv.org/abs/2410.21516": {"title": "Forecasting Political Stability in GCC Countries", "link": "https://arxiv.org/abs/2410.21516", "description": "arXiv:2410.21516v1 Announce Type: new \nAbstract: Political stability is crucial for the socioeconomic development of nations, particularly in geopolitically sensitive regions such as the Gulf Cooperation Council Countries, Saudi Arabia, UAE, Kuwait, Qatar, Oman, and Bahrain. This study focuses on predicting the political stability index for these six countries using machine learning techniques. The study uses data from the World Banks comprehensive dataset, comprising 266 indicators covering economic, political, social, and environmental factors. Employing the Edit Distance on Real Sequence method for feature selection and XGBoost for model training, the study forecasts political stability trends for the next five years. The model achieves high accuracy, with mean absolute percentage error values under 10, indicating reliable predictions. The forecasts suggest that Oman, the UAE, and Qatar will experience relatively stable political conditions, while Saudi Arabia and Bahrain may continue to face negative political stability indices. The findings underscore the significance of economic factors such as GDP and foreign investment, along with variables related to military expenditure and international tourism, as key predictors of political stability. These results provide valuable insights for policymakers, enabling proactive measures to enhance governance and mitigate potential risks."}, "https://arxiv.org/abs/2410.21559": {"title": "Enhancing parameter estimation in finite mixture of generalized normal distributions", "link": "https://arxiv.org/abs/2410.21559", "description": "arXiv:2410.21559v1 Announce Type: new \nAbstract: Mixtures of generalized normal distributions (MGND) have gained popularity for modelling datasets with complex statistical behaviours. However, the estimation of the shape parameter within the maximum likelihood framework is quite complex, presenting the risk of numerical and degeneracy issues. This study introduced an expectation conditional maximization algorithm that includes an adaptive step size function within Newton-Raphson updates of the shape parameter and a modified criterion for stopping the EM iterations. Through extensive simulations, the effectiveness of the proposed algorithm in overcoming the limitations of existing approaches, especially in scenarios with high shape parameter values, high parameters overalp and low sample sizes, is shown. A detailed comparative analysis with a mixture of normals and Student-t distributions revealed that the MGND model exhibited superior goodness-of-fit performance when used to fit the density of the returns of 50 stocks belonging to the Euro Stoxx index."}, "https://arxiv.org/abs/2410.21603": {"title": "Approximate Bayesian Computation with Statistical Distances for Model Selection", "link": "https://arxiv.org/abs/2410.21603", "description": "arXiv:2410.21603v1 Announce Type: new \nAbstract: Model selection is a key task in statistics, playing a critical role across various scientific disciplines. While no model can fully capture the complexities of a real-world data-generating process, identifying the model that best approximates it can provide valuable insights. Bayesian statistics offers a flexible framework for model selection by updating prior beliefs as new data becomes available, allowing for ongoing refinement of candidate models. This is typically achieved by calculating posterior probabilities, which quantify the support for each model given the observed data. However, in cases where likelihood functions are intractable, exact computation of these posterior probabilities becomes infeasible. Approximate Bayesian Computation (ABC) has emerged as a likelihood-free method and it is traditionally used with summary statistics to reduce data dimensionality, however this often results in information loss difficult to quantify, particularly in model selection contexts. Recent advancements propose the use of full data approaches based on statistical distances, offering a promising alternative that bypasses the need for summary statistics and potentially allows recovery of the exact posterior distribution. Despite these developments, full data ABC approaches have not yet been widely applied to model selection problems. This paper seeks to address this gap by investigating the performance of ABC with statistical distances in model selection. Through simulation studies and an application to toad movement models, this work explores whether full data approaches can overcome the limitations of summary statistic-based ABC for model choice."}, "https://arxiv.org/abs/2410.21832": {"title": "Robust Estimation and Model Selection for the Controlled Directed Effect with Unmeasured Mediator-Outcome Confounders", "link": "https://arxiv.org/abs/2410.21832", "description": "arXiv:2410.21832v1 Announce Type: new \nAbstract: Controlled Direct Effect (CDE) is one of the causal estimands used to evaluate both exposure and mediation effects on an outcome. When there are unmeasured confounders existing between the mediator and the outcome, the ordinary identification assumption does not work. In this manuscript, we consider an identification condition to identify CDE in the presence of unmeasured confounders. The key assumptions are: 1) the random allocation of the exposure, and 2) the existence of instrumental variables directly related to the mediator. Under these conditions, we propose a novel doubly robust estimation method, which work well if either the propensity score model or the baseline outcome model is correctly specified. Additionally, we propose a Generalized Information Criterion (GIC)-based model selection criterion for CDE that ensures model selection consistency. Our proposed procedure and related methods are applied to both simulation and real datasets to confirm the performance of these methods. Our proposed method can select the correct model with high probability and accurately estimate CDE."}, "https://arxiv.org/abs/2410.21858": {"title": "Joint Estimation of Conditional Mean and Covariance for Unbalanced Panels", "link": "https://arxiv.org/abs/2410.21858", "description": "arXiv:2410.21858v1 Announce Type: new \nAbstract: We propose a novel nonparametric kernel-based estimator of cross-sectional conditional mean and covariance matrices for large unbalanced panels. We show its consistency and provide finite-sample guarantees. In an empirical application, we estimate conditional mean and covariance matrices for a large unbalanced panel of monthly stock excess returns given macroeconomic and firm-specific covariates from 1962 to 2021.The estimator performs well with respect to statistical measures. It is informative for empirical asset pricing, generating conditional mean-variance efficient portfolios with substantial out-of-sample Sharpe ratios far beyond equal-weighted benchmarks."}, "https://arxiv.org/abs/2410.21914": {"title": "Bayesian Stability Selection and Inference on Inclusion Probabilities", "link": "https://arxiv.org/abs/2410.21914", "description": "arXiv:2410.21914v1 Announce Type: new \nAbstract: Stability selection is a versatile framework for structure estimation and variable selection in high-dimensional setting, primarily grounded in frequentist principles. In this paper, we propose an enhanced methodology that integrates Bayesian analysis to refine the inference of inclusion probabilities within the stability selection framework. Traditional approaches rely on selection frequencies for decision-making, often disregarding domain-specific knowledge and failing to account for the inherent uncertainty in the variable selection process. Our methodology uses prior information to derive posterior distributions of inclusion probabilities, thereby improving both inference and decision-making. We present a two-step process for engaging with domain experts, enabling statisticians to elucidate prior distributions informed by expert knowledge while allowing experts to control the weight of their input on the final results. Using posterior distributions, we offer Bayesian credible intervals to quantify uncertainty in the variable selection process. In addition, we highlight how selection frequencies can be uninformative or even misleading when covariates are correlated with each other, and demonstrate how domain expertise can alleviate such issues. Our approach preserves the versatility of stability selection and is suitable for a broad range of structure estimation challenges."}, "https://arxiv.org/abs/2410.21954": {"title": "Inference of a Susceptible-Infectious stochastic model", "link": "https://arxiv.org/abs/2410.21954", "description": "arXiv:2410.21954v1 Announce Type: new \nAbstract: We consider a time-inhomogeneous diffusion process able to describe the dynamics of infected people in a susceptible-infectious epidemic model in which the transmission intensity function is time-dependent. Such a model is well suited to describe some classes of micro-parasitic infections in which individuals never acquire lasting immunity and over the course of the epidemic everyone eventually becomes infected. The stochastic process related to the deterministic model is transformable into a non homogeneous Wiener process so the probability distribution can be obtained. Here we focus on the inference for such process, by providing an estimation procedure for the involved parameters. We point out that the time dependence in the infinitesimal moments of the diffusion process makes classical inference methods inapplicable. The proposed procedure is based on Generalized Method of Moments in order to find suitable estimate for the infinitesimal drift and variance of the transformed process. Several simulation studies are conduced to test the procedure, these include the time homogeneous case, for which a comparison with the results obtained by applying the MLE is made, and cases in which the intensity function are time dependent with particular attention to periodic cases. Finally, we apply the estimation procedure to a real dataset."}, "https://arxiv.org/abs/2410.21989": {"title": "On the Consistency of Partial Ordering Continual Reassessment Method with Model and Ordering Misspecification", "link": "https://arxiv.org/abs/2410.21989", "description": "arXiv:2410.21989v1 Announce Type: new \nAbstract: One of the aims of Phase I clinical trial designs in oncology is typically to find the maximum tolerated doses. A number of innovative dose-escalation designs were proposed in the literature to achieve this goal efficiently. Although the sample size of Phase I trials is usually small, the asymptotic properties (e.g. consistency) of dose-escalation designs can provide useful guidance on the design parameters and improve fundamental understanding of these designs. For the first proposed model-based monotherapy dose-escalation design, the Continual Reassessment Method (CRM), sufficient consistency conditions have been previously derived and then greatly influenced on how these studies are run in practice. At the same time, there is an increasing interest in Phase I combination-escalation trial in which two or more drugs are combined. The monotherapy dose-escalation design cannot be generally applied in this case due to uncertainty between monotonic ordering between some of the combinations, and, as a result, specialised designs were proposed. However, there were no theoretical or asymptotic properties evaluation of these proposals. In this paper, we derive the consistency conditions of the partial Ordering CRM (POCRM) design when there exists uncertainty in the monotonic ordering with a focus on dual-agent combination-escalation trials. Based on the derived consistency condition, we provide guidance on how the design parameters and ordering of the POCRM should be defined."}, "https://arxiv.org/abs/2410.22248": {"title": "Model-free Estimation of Latent Structure via Multiscale Nonparametric Maximum Likelihood", "link": "https://arxiv.org/abs/2410.22248", "description": "arXiv:2410.22248v1 Announce Type: new \nAbstract: Multivariate distributions often carry latent structures that are difficult to identify and estimate, and which better reflect the data generating mechanism than extrinsic structures exhibited simply by the raw data. In this paper, we propose a model-free approach for estimating such latent structures whenever they are present, without assuming they exist a priori. Given an arbitrary density $p_0$, we construct a multiscale representation of the density and propose data-driven methods for selecting representative models that capture meaningful discrete structure. Our approach uses a nonparametric maximum likelihood estimator to estimate the latent structure at different scales and we further characterize their asymptotic limits. By carrying out such a multiscale analysis, we obtain coarseto-fine structures inherent in the original distribution, which are integrated via a model selection procedure to yield an interpretable discrete representation of it. As an application, we design a clustering algorithm based on the proposed procedure and demonstrate its effectiveness in capturing a wide range of latent structures."}, "https://arxiv.org/abs/2410.22300": {"title": "A Latent Variable Model with Change Points and Its Application to Time Pressure Effects in Educational Assessment", "link": "https://arxiv.org/abs/2410.22300", "description": "arXiv:2410.22300v1 Announce Type: new \nAbstract: Educational assessments are valuable tools for measuring student knowledge and skills, but their validity can be compromised when test takers exhibit changes in response behavior due to factors such as time pressure. To address this issue, we introduce a novel latent factor model with change-points for item response data, designed to detect and account for individual-level shifts in response patterns during testing. This model extends traditional Item Response Theory (IRT) by incorporating person-specific change-points, which enables simultaneous estimation of item parameters, person latent traits, and the location of behavioral changes. We evaluate the proposed model through extensive simulation studies, which demonstrate its ability to accurately recover item parameters, change-point locations, and individual ability estimates under various conditions. Our findings show that accounting for change-points significantly reduces bias in ability estimates, particularly for respondents affected by time pressure. Application of the model to two real-world educational testing datasets reveals distinct patterns of change-point occurrence between high-stakes and lower-stakes tests, providing insights into how test-taking behavior evolves during the tests. This approach offers a more nuanced understanding of test-taking dynamics, with important implications for test design, scoring, and interpretation."}, "https://arxiv.org/abs/2410.22333": {"title": "Hypothesis tests and model parameter estimation on data sets with missing correlation information", "link": "https://arxiv.org/abs/2410.22333", "description": "arXiv:2410.22333v1 Announce Type: new \nAbstract: Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simply hypothesis tests, as well as an algorithm to determine the necessary inflation factor model parameter fits. It then presents some example applications of the methods to real neutrino interaction data and model comparisons."}, "https://arxiv.org/abs/2410.21790": {"title": "Reconstructing East Asian Temperatures from 1368 to 1911 Using Historical Documents, Climate Models, and Data Assimilation", "link": "https://arxiv.org/abs/2410.21790", "description": "arXiv:2410.21790v1 Announce Type: cross \nAbstract: We present a novel approach for reconstructing annual temperatures in East Asia from 1368 to 1911, leveraging the Reconstructed East Asian Climate Historical Encoded Series (REACHES). The lack of instrumental data during this period poses significant challenges to understanding past climate conditions. REACHES digitizes historical documents from the Ming and Qing dynasties of China, converting qualitative descriptions into a four-level ordinal temperature scale. However, these index-based data are biased toward abnormal or extreme weather phenomena, leading to data gaps that likely correspond to normal conditions. To address this bias and reconstruct historical temperatures at any point within East Asia, including locations without direct historical data, we employ a three-tiered statistical framework. First, we perform kriging to interpolate temperature data across East Asia, adopting a zero-mean assumption to handle missing information. Next, we utilize the Last Millennium Ensemble (LME) reanalysis data and apply quantile mapping to calibrate the kriged REACHES data to Celsius temperature scales. Finally, we introduce a novel Bayesian data assimilation method that integrates the kriged Celsius data with LME simulations to enhance reconstruction accuracy. We model the LME data at each geographic location using a flexible nonstationary autoregressive time series model and employ regularized maximum likelihood estimation with a fused lasso penalty. The resulting dynamic distribution serves as a prior, which is refined via Kalman filtering by incorporating the kriged Celsius REACHES data to yield posterior temperature estimates. This comprehensive integration of historical documentation, contemporary climate models, and advanced statistical methods improves the accuracy of historical temperature reconstructions and provides a crucial resource for future environmental and climate studies."}, "https://arxiv.org/abs/2410.22119": {"title": "Deep Q-Exponential Processes", "link": "https://arxiv.org/abs/2410.22119", "description": "arXiv:2410.22119v1 Announce Type: cross \nAbstract: Motivated by deep neural networks, the deep Gaussian process (DGP) generalizes the standard GP by stacking multiple layers of GPs. Despite the enhanced expressiveness, GP, as an $L_2$ regularization prior, tends to be over-smooth and sub-optimal for inhomogeneous subjects, such as images with edges. Recently, Q-exponential process (Q-EP) has been proposed as an $L_q$ relaxation to GP and demonstrated with more desirable regularization properties through a parameter $q>0$ with $q=2$ corresponding to GP. Sharing the similar tractability of posterior and predictive distributions with GP, Q-EP can also be stacked to improve its modeling flexibility. In this paper, we generalize Q-EP to deep Q-EP to enjoy both proper regularization and improved expressiveness. The generalization is realized by introducing shallow Q-EP as a latent variable model and then building a hierarchy of the shallow Q-EP layers. Sparse approximation by inducing points and scalable variational strategy are applied to facilitate the inference. We demonstrate the numerical advantages of the proposed deep Q-EP model by comparing with multiple state-of-the-art deep probabilistic models."}, "https://arxiv.org/abs/2303.06384": {"title": "Measuring Information Transfer Between Nodes in a Brain Network through Spectral Transfer Entropy", "link": "https://arxiv.org/abs/2303.06384", "description": "arXiv:2303.06384v3 Announce Type: replace \nAbstract: Brain connectivity characterizes interactions between different regions of a brain network during resting-state or performance of a cognitive task. In studying brain signals such as electroencephalograms (EEG), one formal approach to investigating connectivity is through an information-theoretic causal measure called transfer entropy (TE). To enhance the functionality of TE in brain signal analysis, we propose a novel methodology that captures cross-channel information transfer in the frequency domain. Specifically, we introduce a new measure, the spectral transfer entropy (STE), to quantify the magnitude and direction of information flow from a band-specific oscillation of one channel to another band-specific oscillation of another channel. The main advantage of our proposed approach is that it formulates TE in a novel way to perform inference on band-specific oscillations while maintaining robustness to the inherent problems associated with filtering. In addition, an advantage of STE is that it allows adjustments for multiple comparisons to control false positive rates. Another novel contribution is a simple yet efficient method for estimating STE using vine copula theory. This method can produce an exact zero estimate of STE (which is the boundary point of the parameter space) without the need for bias adjustments. With the vine copula representation, a null copula model, which exhibits zero STE, is defined, thus enabling straightforward significance testing through standard resampling. Lastly, we demonstrate the advantage of the proposed STE measure through numerical experiments and provide interesting and novel findings on the analysis of EEG data in a visual-memory experiment."}, "https://arxiv.org/abs/2401.06447": {"title": "Uncertainty-aware multi-fidelity surrogate modeling with noisy data", "link": "https://arxiv.org/abs/2401.06447", "description": "arXiv:2401.06447v2 Announce Type: replace \nAbstract: Emulating high-accuracy computationally expensive models is crucial for tasks requiring numerous model evaluations, such as uncertainty quantification and optimization. When lower-fidelity models are available, they can be used to improve the predictions of high-fidelity models. Multi-fidelity surrogate models combine information from sources of varying fidelities to construct an efficient surrogate model. However, in real-world applications, uncertainty is present in both high- and low-fidelity models due to measurement or numerical noise, as well as lack of knowledge due to the limited experimental design budget. This paper introduces a comprehensive framework for multi-fidelity surrogate modeling that handles noise-contaminated data and is able to estimate the underlying noise-free high-fidelity model. Our methodology quantitatively incorporates the different types of uncertainty affecting the problem and emphasizes on delivering precise estimates of the uncertainty in its predictions both with respect to the underlying high-fidelity model and unseen noise-contaminated high-fidelity observations, presented through confidence and prediction intervals, respectively. Additionally, the proposed framework offers a natural approach to combining physical experiments and computational models by treating noisy experimental data as high-fidelity sources and white-box computational models as their low-fidelity counterparts. The effectiveness of our methodology is showcased through synthetic examples and a wind turbine application."}, "https://arxiv.org/abs/2410.22481": {"title": "Bayesian Counterfactual Prediction Models for HIV Care Retention with Incomplete Outcome and Covariate Information", "link": "https://arxiv.org/abs/2410.22481", "description": "arXiv:2410.22481v1 Announce Type: new \nAbstract: Like many chronic diseases, human immunodeficiency virus (HIV) is managed over time at regular clinic visits. At each visit, patient features are assessed, treatments are prescribed, and a subsequent visit is scheduled. There is a need for data-driven methods for both predicting retention and recommending scheduling decisions that optimize retention. Prediction models can be useful for estimating retention rates across a range of scheduling options. However, training such models with electronic health records (EHR) involves several complexities. First, formal causal inference methods are needed to adjust for observed confounding when estimating retention rates under counterfactual scheduling decisions. Second, competing events such as death preclude retention, while censoring events render retention missing. Third, inconsistent monitoring of features such as viral load and CD4 count lead to covariate missingness. This paper presents an all-in-one approach for both predicting HIV retention and optimizing scheduling while accounting for these complexities. We formulate and identify causal retention estimands in terms of potential return-time under a hypothetical scheduling decision. Flexible Bayesian approaches are used to model the observed return-time distribution while accounting for competing and censoring events and form posterior point and uncertainty estimates for these estimands. We address the urgent need for data-driven decision support in HIV care by applying our method to EHR from the Academic Model Providing Access to Healthcare (AMPATH) - a consortium of clinics that treat HIV in Western Kenya."}, "https://arxiv.org/abs/2410.22501": {"title": "Order of Addition in Orthogonally Blocked Mixture and Component-Amount Designs", "link": "https://arxiv.org/abs/2410.22501", "description": "arXiv:2410.22501v1 Announce Type: new \nAbstract: Mixture experiments often involve process variables, such as different chemical reactors in a laboratory or varying mixing speeds in a production line. Organizing the runs in orthogonal blocks allows the mixture model to be fitted independently of the process effects, ensuring clearer insights into the role of each mixture component. Current literature on mixture designs in orthogonal blocks ignores the order of addition of mixture components in mixture blends. This paper considers the order of addition of components in mixture and mixture-amount experiments, using the variable total amount taken into orthogonal blocks. The response depends on both the mixture proportions or the amounts of the components and the order of their addition. Mixture designs in orthogonal blocks are constructed to enable the estimation of mixture or component-amount model parameters and the order-of-addition effects. The G-efficiency criterion is used to assess how well the design supports precise and unbiased estimation of the model parameters. The fraction of the Design Space plot is used to provide a visual assessment of the prediction capabilities of a design across the entire design space."}, "https://arxiv.org/abs/2410.22534": {"title": "Bayesian shared parameter joint models for heterogeneous populations", "link": "https://arxiv.org/abs/2410.22534", "description": "arXiv:2410.22534v1 Announce Type: new \nAbstract: Joint models (JMs) for longitudinal and time-to-event data are an important class of biostatistical models in health and medical research. When the study population consists of heterogeneous subgroups, the standard JM may be inadequate and lead to misleading results. Joint latent class models (JLCMs) and their variants have been proposed to incorporate latent class structures into JMs. JLCMs are useful for identifying latent subgroup structures, obtaining a more nuanced understanding of the relationships between longitudinal outcomes, and improving prediction performance. We consider the generic form of JLCM, which poses significant computational challenges for both frequentist and Bayesian approaches due to the numerical intractability and multimodality of the associated model's likelihood or posterior. Focusing on the less explored Bayesian paradigm, we propose a new Bayesian inference framework to tackle key limitations in the existing method. Our algorithm leverages state-of-the-art Markov chain Monte Carlo techniques and parallel computing for parameter estimation and model selection. Through a simulation study, we demonstrate the feasibility and superiority of our proposed method over the existing approach. Our simulations also generate important computational insights and practical guidance for implementing such complex models. We illustrate our method using data from the PAQUID prospective cohort study, where we jointly investigate the association between a repeatedly measured cognitive score and the risk of dementia and the latent class structure defined from the longitudinal outcomes."}, "https://arxiv.org/abs/2410.22574": {"title": "Inference in Partially Linear Models under Dependent Data with Deep Neural Networks", "link": "https://arxiv.org/abs/2410.22574", "description": "arXiv:2410.22574v1 Announce Type: new \nAbstract: I consider inference in a partially linear regression model under stationary $\\beta$-mixing data after first stage deep neural network (DNN) estimation. Using the DNN results of Brown (2024), I show that the estimator for the finite dimensional parameter, constructed using DNN-estimated nuisance components, achieves $\\sqrt{n}$-consistency and asymptotic normality. By avoiding sample splitting, I address one of the key challenges in applying machine learning techniques to econometric models with dependent data. In a future version of this work, I plan to extend these results to obtain general conditions for semiparametric inference after DNN estimation of nuisance components, which will allow for considerations such as more efficient estimation procedures, and instrumental variable settings."}, "https://arxiv.org/abs/2410.22617": {"title": "Bayesian Inference for Relational Graph in a Causal Vector Autoregressive Time Series", "link": "https://arxiv.org/abs/2410.22617", "description": "arXiv:2410.22617v1 Announce Type: new \nAbstract: We propose a method for simultaneously estimating a contemporaneous graph structure and autocorrelation structure for a causal high-dimensional vector autoregressive process (VAR). The graph is estimated by estimating the stationary precision matrix using a Bayesian framework. We introduce a novel parameterization that is convenient for jointly estimating the precision matrix and the autocovariance matrices. The methodology based on the new parameterization has several desirable properties. A key feature of the proposed methodology is that it maintains causality of the process in its estimates and also provides a fast feasible way for computing the reduced rank likelihood for a high-dimensional Gaussian VAR. We use sparse priors along with the likelihood under the new parameterization to obtain the posterior of the graphical parameters as well as that of the temporal parameters. An efficient Markov Chain Monte Carlo (MCMC) algorithm is developed for posterior computation. We also establish theoretical consistency properties for the high-dimensional posterior. The proposed methodology shows excellent performance in simulations and real data applications."}, "https://arxiv.org/abs/2410.22675": {"title": "Clustering Computer Mouse Tracking Data with Informed Hierarchical Shrinkage Partition Priors", "link": "https://arxiv.org/abs/2410.22675", "description": "arXiv:2410.22675v1 Announce Type: new \nAbstract: Mouse-tracking data, which record computer mouse trajectories while participants perform an experimental task, provide valuable insights into subjects' underlying cognitive processes. Neuroscientists are interested in clustering the subjects' responses during computer mouse-tracking tasks to reveal patterns of individual decision-making behaviors and identify population subgroups with similar neurobehavioral responses. These data can be combined with neuro-imaging data to provide additional information for personalized interventions. In this article, we develop a novel hierarchical shrinkage partition (HSP) prior for clustering summary statistics derived from the trajectories of mouse-tracking data. The HSP model defines a subjects' cluster as a set of subjects that gives rise to more similar (rather than identical) nested partitions of the conditions. The proposed model can incorporate prior information about the partitioning of either subjects or conditions to facilitate clustering, and it allows for deviations of the nested partitions within each subject group. These features distinguish the HSP model from other bi-clustering methods that typically create identical nested partitions of conditions within a subject group. Furthermore, it differs from existing nested clustering methods, which define clusters based on common parameters in the sampling model and identify subject groups by different distributions. We illustrate the unique features of the HSP model on a mouse tracking dataset from a pilot study and in simulation studies. Our results show the ability and effectiveness of the proposed exploratory framework in clustering and revealing possible different behavioral patterns across subject groups."}, "https://arxiv.org/abs/2410.22751": {"title": "Novel Subsampling Strategies for Heavily Censored Reliability Data", "link": "https://arxiv.org/abs/2410.22751", "description": "arXiv:2410.22751v1 Announce Type: new \nAbstract: Computational capability often falls short when confronted with massive data, posing a common challenge in establishing a statistical model or statistical inference method dealing with big data. While subsampling techniques have been extensively developed to downsize the data volume, there is a notable gap in addressing the unique challenge of handling extensive reliability data, in which a common situation is that a large proportion of data is censored. In this article, we propose an efficient subsampling method for reliability analysis in the presence of censoring data, intending to estimate the parameters of lifetime distribution. Moreover, a novel subsampling method for subsampling from severely censored data is proposed, i.e., only a tiny proportion of data is complete. The subsampling-based estimators are given, and their asymptotic properties are derived. The optimal subsampling probabilities are derived through the L-optimality criterion, which minimizes the trace of the product of the asymptotic covariance matrix and a constant matrix. Efficient algorithms are proposed to implement the proposed subsampling methods to address the challenge that optimal subsampling strategy depends on unknown parameter estimation from full data. Real-world hard drive dataset case and simulative empirical studies are employed to demonstrate the superior performance of the proposed methods."}, "https://arxiv.org/abs/2410.22824": {"title": "Surface data imputation with stochastic processes", "link": "https://arxiv.org/abs/2410.22824", "description": "arXiv:2410.22824v1 Announce Type: new \nAbstract: Spurious measurements in surface data are common in technical surfaces. Excluding or ignoring these spurious points may lead to incorrect surface characterization if these points inherit features of the surface. Therefore, data imputation must be applied to ensure that the estimated data points at spurious measurements do not strongly deviate from the true surface and its characteristics. Traditional surface data imputation methods rely on simple assumptions and ignore existing knowledge of the surface, yielding in suboptimal estimates. In this paper, we propose using stochastic processes for data imputation. This approach, which originates from surface simulation, allows for the straightforward integration of a priori knowledge. We employ Gaussian processes for surface data imputation with artificially generated missing features. Both the surfaces and the missing features are generated artificially. Our results demonstrate that the proposed method fills the missing values and interpolates data points with better alignment to the measured surface compared to traditional approaches, particularly when surface features are missing."}, "https://arxiv.org/abs/2410.22989": {"title": "Propensity Score Methods for Local Test Score Equating: Stratification and Inverse Probability Weighting", "link": "https://arxiv.org/abs/2410.22989", "description": "arXiv:2410.22989v1 Announce Type: new \nAbstract: In test equating, ensuring score comparability across different test forms is crucial but particularly challenging when test groups are non-equivalent and no anchor test is available. Local test equating aims to satisfy Lord's equity requirement by conditioning equating transformations on individual-level information, typically using anchor test scores as proxies for latent ability. However, anchor tests are not always available in practice. This paper introduces two novel propensity score-based methods for local equating: stratification and inverse probability weighting (IPW). These methods use covariates to account for group differences, with propensity scores serving as proxies for latent ability differences between test groups. The stratification method partitions examinees into comparable groups based on similar propensity scores, while IPW assigns weights inversely proportional to the probability of group membership. We evaluate these methods through empirical analysis and simulation studies. Results indicate both methods can effectively adjust for group differences, with their relative performance depending on the strength of covariate-ability correlations. The study extends local equating methodology to cases where only covariate information is available, providing testing programs with new tools for ensuring fair score comparability."}, "https://arxiv.org/abs/2410.23081": {"title": "General Bayesian quantile regression for counts via generative modeling", "link": "https://arxiv.org/abs/2410.23081", "description": "arXiv:2410.23081v1 Announce Type: new \nAbstract: Although quantile regression has emerged as a powerful tool for understanding various quantiles of a response variable conditioned on a set of covariates, the development of quantile regression for count responses has received far less attention. This paper proposes a new Bayesian approach to quantile regression for count data, which provides a more flexible and interpretable alternative to the existing approaches. The proposed approach associates the continuous latent variable with the discrete response and nonparametrically estimates the joint distribution of the latent variable and a set of covariates. Then, by regressing the estimated continuous conditional quantile on the covariates, the posterior distributions of the covariate effects on the conditional quantiles are obtained through general Bayesian updating via simple optimization. The simulation study and real data analysis demonstrate that the proposed method overcomes the existing limitations and enhances quantile estimation and interpretation of variable relationships, making it a valuable tool for practitioners handling count data."}, "https://arxiv.org/abs/2410.23246": {"title": "Progression: an extrapolation principle for regression", "link": "https://arxiv.org/abs/2410.23246", "description": "arXiv:2410.23246v1 Announce Type: new \nAbstract: The problem of regression extrapolation, or out-of-distribution generalization, arises when predictions are required at test points outside the range of the training data. In such cases, the non-parametric guarantees for regression methods from both statistics and machine learning typically fail. Based on the theory of tail dependence, we propose a novel statistical extrapolation principle. After a suitable, data-adaptive marginal transformation, it assumes a simple relationship between predictors and the response at the boundary of the training predictor samples. This assumption holds for a wide range of models, including non-parametric regression functions with additive noise. Our semi-parametric method, progression, leverages this extrapolation principle and offers guarantees on the approximation error beyond the training data range. We demonstrate how this principle can be effectively integrated with existing approaches, such as random forests and additive models, to improve extrapolation performance on out-of-distribution samples."}, "https://arxiv.org/abs/2410.22502": {"title": "Gender disparities in rehospitalisations after coronary artery bypass grafting: evidence from a functional causal mediation analysis of the MIMIC-IV data", "link": "https://arxiv.org/abs/2410.22502", "description": "arXiv:2410.22502v1 Announce Type: cross \nAbstract: Hospital readmissions following coronary artery bypass grafting (CABG) not only impose a substantial cost burden on healthcare systems but also serve as a potential indicator of the quality of medical care. Previous studies of gender effects on complications after CABG surgery have consistently revealed that women tend to suffer worse outcomes. To better understand the causal pathway from gender to the number of rehospitalisations, we study the postoperative central venous pressure (CVP), frequently recorded over patients' intensive care unit (ICU) stay after the CABG surgery, as a functional mediator. Confronted with time-varying CVP measurements and zero-inflated rehospitalisation counts within 60 days following discharge, we propose a parameter-simulating quasi-Bayesian Monte Carlo approximation method that accommodates a functional mediator and a zero-inflated count outcome for causal mediation analysis. We find a causal relationship between the female gender and increased rehospitalisation counts after CABG, and that time-varying central venous pressure mediates this causal effect."}, "https://arxiv.org/abs/2410.22591": {"title": "FGCE: Feasible Group Counterfactual Explanations for Auditing Fairness", "link": "https://arxiv.org/abs/2410.22591", "description": "arXiv:2410.22591v1 Announce Type: cross \nAbstract: This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (FGCEs), captures real-world feasibility constraints and constructs subgroups with similar counterfactuals, setting it apart from existing methods. It also addresses key trade-offs in counterfactual generation, including the balance between the number of counterfactuals, their associated costs, and the breadth of coverage achieved. To evaluate these trade-offs and assess fairness, we propose measures tailored to group counterfactual generation. Our experimental results on benchmark datasets demonstrate the effectiveness of our approach in managing feasibility constraints and trade-offs, as well as the potential of our proposed metrics in identifying and quantifying fairness issues."}, "https://arxiv.org/abs/2410.22647": {"title": "Adaptive Robust Confidence Intervals", "link": "https://arxiv.org/abs/2410.22647", "description": "arXiv:2410.22647v1 Announce Type: cross \nAbstract: This paper studies the construction of adaptive confidence intervals under Huber's contamination model when the contamination proportion is unknown. For the robust confidence interval of a Gaussian mean, we show that the optimal length of an adaptive interval must be exponentially wider than that of a non-adaptive one. An optimal construction is achieved through simultaneous uncertainty quantification of quantiles at all levels. The results are further extended beyond the Gaussian location model by addressing a general family of robust hypothesis testing. In contrast to adaptive robust estimation, our findings reveal that the optimal length of an adaptive robust confidence interval critically depends on the distribution's shape."}, "https://arxiv.org/abs/2410.22703": {"title": "On tail inference in scale-free inhomogeneous random graphs", "link": "https://arxiv.org/abs/2410.22703", "description": "arXiv:2410.22703v1 Announce Type: cross \nAbstract: Both empirical and theoretical investigations of scale-free network models have found that large degrees in a network exert an outsized impact on its structure. However, the tools used to infer the tail behavior of degree distributions in scale-free networks often lack a strong theoretical foundation. In this paper, we introduce a new framework for analyzing the asymptotic distribution of estimators for degree tail indices in scale-free inhomogeneous random graphs. Our framework leverages the relationship between the large weights and large degrees of Norros-Reittu and Chung-Lu random graphs. In particular, we determine a rate for the number of nodes $k(n) \\rightarrow \\infty$ such that for all $i = 1, \\dots, k(n)$, the node with the $i$-th largest weight will have the $i$-th largest degree with high probability. Such alignment of upper-order statistics is then employed to establish the asymptotic normality of three different tail index estimators based on the upper degrees. These results suggest potential applications of the framework to threshold selection and goodness-of-fit testing in scale-free networks, issues that have long challenged the network science community."}, "https://arxiv.org/abs/2410.22754": {"title": "An Overview of Causal Inference using Kernel Embeddings", "link": "https://arxiv.org/abs/2410.22754", "description": "arXiv:2410.22754v1 Announce Type: cross \nAbstract: Kernel embeddings have emerged as a powerful tool for representing probability measures in a variety of statistical inference problems. By mapping probability measures into a reproducing kernel Hilbert space (RKHS), kernel embeddings enable flexible representations of complex relationships between variables. They serve as a mechanism for efficiently transferring the representation of a distribution downstream to other tasks, such as hypothesis testing or causal effect estimation. In the context of causal inference, the main challenges include identifying causal associations and estimating the average treatment effect from observational data, where confounding variables may obscure direct cause-and-effect relationships. Kernel embeddings provide a robust nonparametric framework for addressing these challenges. They allow for the representations of distributions of observational data and their seamless transformation into representations of interventional distributions to estimate relevant causal quantities. We overview recent research that leverages the expressiveness of kernel embeddings in tandem with causal inference."}, "https://arxiv.org/abs/2410.23147": {"title": "FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection", "link": "https://arxiv.org/abs/2410.23147", "description": "arXiv:2410.23147v1 Announce Type: cross \nAbstract: Traditional decision trees are limited by axis-orthogonal splits, which can perform poorly when true decision boundaries are oblique. While oblique decision tree methods address this limitation, they often face high computational costs, difficulties with multi-class classification, and a lack of effective feature selection. In this paper, we introduce LDATree and FoLDTree, two novel frameworks that integrate Uncorrelated Linear Discriminant Analysis (ULDA) and Forward ULDA into a decision tree structure. These methods enable efficient oblique splits, handle missing values, support feature selection, and provide both class labels and probabilities as model outputs. Through evaluations on simulated and real-world datasets, LDATree and FoLDTree consistently outperform axis-orthogonal and other oblique decision tree methods, achieving accuracy levels comparable to the random forest. The results highlight the potential of these frameworks as robust alternatives to traditional single-tree methods."}, "https://arxiv.org/abs/2410.23155": {"title": "QWO: Speeding Up Permutation-Based Causal Discovery in LiGAMs", "link": "https://arxiv.org/abs/2410.23155", "description": "arXiv:2410.23155v1 Announce Type: cross \nAbstract: Causal discovery is essential for understanding relationships among variables of interest in many scientific domains. In this paper, we focus on permutation-based methods for learning causal graphs in Linear Gaussian Acyclic Models (LiGAMs), where the permutation encodes a causal ordering of the variables. Existing methods in this setting are not scalable due to their high computational complexity. These methods are comprised of two main components: (i) constructing a specific DAG, $\\mathcal{G}^\\pi$, for a given permutation $\\pi$, which represents the best structure that can be learned from the available data while adhering to $\\pi$, and (ii) searching over the space of permutations (i.e., causal orders) to minimize the number of edges in $\\mathcal{G}^\\pi$. We introduce QWO, a novel approach that significantly enhances the efficiency of computing $\\mathcal{G}^\\pi$ for a given permutation $\\pi$. QWO has a speed-up of $O(n^2)$ ($n$ is the number of variables) compared to the state-of-the-art BIC-based method, making it highly scalable. We show that our method is theoretically sound and can be integrated into existing search strategies such as GRASP and hill-climbing-based methods to improve their performance."}, "https://arxiv.org/abs/2410.23174": {"title": "On the fundamental limitations of multiproposal Markov chain Monte Carlo algorithms", "link": "https://arxiv.org/abs/2410.23174", "description": "arXiv:2410.23174v1 Announce Type: cross \nAbstract: We study multiproposal Markov chain Monte Carlo algorithms, such as Multiple-try or generalised Metropolis-Hastings schemes, which have recently received renewed attention due to their amenability to parallel computing. First, we prove that no multiproposal scheme can speed-up convergence relative to the corresponding single proposal scheme by more than a factor of $K$, where $K$ denotes the number of proposals at each iteration. This result applies to arbitrary target distributions and it implies that serial multiproposal implementations are always less efficient than single proposal ones. Secondly, we consider log-concave distributions over Euclidean spaces, proving that, in this case, the speed-up is at most logarithmic in $K$, which implies that even parallel multiproposal implementations are fundamentally limited in the computational gain they can offer. Crucially, our results apply to arbitrary multiproposal schemes and purely rely on the two-step structure of the associated kernels (i.e. first generate $K$ candidate points, then select one among those). Our theoretical findings are validated through numerical simulations."}, "https://arxiv.org/abs/2410.23412": {"title": "BAMITA: Bayesian Multiple Imputation for Tensor Arrays", "link": "https://arxiv.org/abs/2410.23412", "description": "arXiv:2410.23412v1 Announce Type: new \nAbstract: Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation ."}, "https://arxiv.org/abs/2410.23587": {"title": "Fractional Moments by the Moment-Generating Function", "link": "https://arxiv.org/abs/2410.23587", "description": "arXiv:2410.23587v1 Announce Type: new \nAbstract: We introduce a novel method for obtaining a wide variety of moments of a random variable with a well-defined moment-generating function (MGF). We derive new expressions for fractional moments and fractional absolute moments, both central and non-central moments. The new moment expressions are relatively simple integrals that involve the MGF, but do not require its derivatives. We label the new method CMGF because it uses a complex extension of the MGF and can be used to obtain complex moments. We illustrate the new method with three applications where the MGF is available in closed-form, while the corresponding densities and the derivatives of the MGF are either unavailable or very difficult to obtain."}, "https://arxiv.org/abs/2410.23590": {"title": "The Nudge Average Treatment Effect", "link": "https://arxiv.org/abs/2410.23590", "description": "arXiv:2410.23590v1 Announce Type: new \nAbstract: The instrumental variable method is a prominent approach to recover under certain conditions, valid inference about a treatment causal effect even when unmeasured confounding might be present. In a groundbreaking paper, Imbens and Angrist (1994) established that a valid instrument nonparametrically identifies the average causal effect among compliers, also known as the local average treatment effect under a certain monotonicity assumption which rules out the existence of so-called defiers. An often-cited attractive property of monotonicity is that it facilitates a causal interpretation of the instrumental variable estimand without restricting the degree of heterogeneity of the treatment causal effect. In this paper, we introduce an alternative equally straightforward and interpretable condition for identification, which accommodates both the presence of defiers and heterogenous treatment effects. Mainly, we show that under our new conditions, the instrumental variable estimand recovers the average causal effect for the subgroup of units for whom the treatment is manipulable by the instrument, a subgroup which may consist of both defiers and compliers, therefore recovering an effect estimand we aptly call the Nudge Average Treatment Effect."}, "https://arxiv.org/abs/2410.23706": {"title": "Asynchronous Jump Testing and Estimation in High Dimensions Under Complex Temporal Dynamics", "link": "https://arxiv.org/abs/2410.23706", "description": "arXiv:2410.23706v1 Announce Type: new \nAbstract: Most high dimensional changepoint detection methods assume the error process is stationary and changepoints occur synchronously across dimensions. The violation of these assumptions, which in applied settings is increasingly likely as the dimensionality of the time series being analyzed grows, can dramatically curtail the sensitivity or the accuracy of these methods. We propose AJDN (Asynchronous Jump Detection under Nonstationary noise). AJDN is a high dimensional multiscale jump detection method that tests and estimates jumps in an otherwise smoothly varying mean function for high dimensional time series with nonstationary noise where the jumps across dimensions may not occur at the same time. AJDN is correct in the sense that it detects the correct number of jumps with a prescribed probability asymptotically and its accuracy in estimating the locations of the jumps is asymptotically nearly optimal under the asynchronous jump assumption. Through a simulation study we demonstrate AJDN's robustness across a wide variety of stationary and nonstationary high dimensional time series, and we show its strong performance relative to some existing high dimensional changepoint detection methods. We apply AJDN to a seismic time series to demonstrate its ability to accurately detect jumps in real-world high dimensional time series with complex temporal dynamics."}, "https://arxiv.org/abs/2410.23785": {"title": "Machine Learning Debiasing with Conditional Moment Restrictions: An Application to LATE", "link": "https://arxiv.org/abs/2410.23785", "description": "arXiv:2410.23785v1 Announce Type: new \nAbstract: Models with Conditional Moment Restrictions (CMRs) are popular in economics. These models involve finite and infinite dimensional parameters. The infinite dimensional components include conditional expectations, conditional choice probabilities, or policy functions, which might be flexibly estimated using Machine Learning tools. This paper presents a characterization of locally debiased moments for regular models defined by general semiparametric CMRs with possibly different conditioning variables. These moments are appealing as they are known to be less affected by first-step bias. Additionally, we study their existence and relevance. Such results apply to a broad class of smooth functionals of finite and infinite dimensional parameters that do not necessarily appear in the CMRs. As a leading application of our theory, we characterize debiased machine learning for settings of treatment effects with endogeneity, giving necessary and sufficient conditions. We present a large class of relevant debiased moments in this context. We then propose the Compliance Machine Learning Estimator (CML), based on a practically convenient orthogonal relevant moment. We show that the resulting estimand can be written as a convex combination of conditional local average treatment effects (LATE). Altogether, CML enjoys three appealing properties in the LATE framework: (1) local robustness to first-stage estimation, (2) an estimand that can be identified under a minimal relevance condition, and (3) a meaningful causal interpretation. Our numerical experimentation shows satisfactory relative performance of such an estimator. Finally, we revisit the Oregon Health Insurance Experiment, analyzed by Finkelstein et al. (2012). We find that the use of machine learning and CML suggest larger positive effects on health care utilization than previously determined."}, "https://arxiv.org/abs/2410.23786": {"title": "Conformal inference for cell type annotation with graph-structured constraints", "link": "https://arxiv.org/abs/2410.23786", "description": "arXiv:2410.23786v1 Announce Type: new \nAbstract: Conformal inference is a method that provides prediction sets for machine learning models, operating independently of the underlying distributional assumptions and relying solely on the exchangeability of training and test data. Despite its wide applicability and popularity, its application in graph-structured problems remains underexplored. This paper addresses this gap by developing an approach that leverages the rich information encoded in the graph structure of predicted classes to enhance the interpretability of conformal sets. Using a motivating example from genomics, specifically imaging-based spatial transcriptomics data and single-cell RNA sequencing data, we demonstrate how incorporating graph-structured constraints can improve the interpretation of cell type predictions. This approach aims to generate more coherent conformal sets that align with the inherent relationships among classes, facilitating clearer and more intuitive interpretations of model predictions. Additionally, we provide a technique to address non-exchangeability, particularly when the distribution of the response variable changes between training and test datasets. We implemented our method in the open-source R package scConform, available at https://github.com/ccb-hms/scConform."}, "https://arxiv.org/abs/2410.23852": {"title": "Estimation and Inference in Dyadic Network Formation Models with Nontransferable Utilities", "link": "https://arxiv.org/abs/2410.23852", "description": "arXiv:2410.23852v1 Announce Type: new \nAbstract: This paper studies estimation and inference in a dyadic network formation model with observed covariates, unobserved heterogeneity, and nontransferable utilities. With the presence of the high dimensional fixed effects, the maximum likelihood estimator is numerically difficult to compute and suffers from the incidental parameter bias. We propose an easy-to-compute one-step estimator for the homophily parameter of interest, which is further refined to achieve $\\sqrt{N}$-consistency via split-network jackknife and efficiency by the bootstrap aggregating (bagging) technique. We establish consistency for the estimator of the fixed effects and prove asymptotic normality for the unconditional average partial effects. Simulation studies show that our method works well with finite samples, and an empirical application using the risk-sharing data from Nyakatoke highlights the importance of employing proper statistical inferential procedures."}, "https://arxiv.org/abs/2410.24003": {"title": "On testing for independence between generalized error models of several time series", "link": "https://arxiv.org/abs/2410.24003", "description": "arXiv:2410.24003v1 Announce Type: new \nAbstract: We propose new copula-based models for multivariate time series having continuous or discrete distributions, or a mixture of both. These models include stochastic volatility models and regime-switching models. We also propose statistics for testing independence between the generalized errors of these models, extending previous results of Duchesne, Ghoudi and Remillard (2012) obtained for stochastic volatility models. We define families of empirical processes constructed from lagged generalized errors, and we show that their joint asymptotic distributions are Gaussian and independent of the estimated parameters of the individual time series. Moebius transformations of the empirical processes are used to obtain tractable covariances. Several tests statistics are then proposed, based on Cramer-von Mises statistics and dependence measures, as well as graphical methods to visualize the dependence. In addition, numerical experiments are performed to assess the power of the proposed tests. Finally, to show the usefulness of our methodologies, examples of applications for financial data and crime data are given to cover both discrete and continuous cases. ll developed methodologies are implemented in the CRAN package IndGenErrors."}, "https://arxiv.org/abs/2410.24094": {"title": "Adaptive Sphericity Tests for High Dimensional Data", "link": "https://arxiv.org/abs/2410.24094", "description": "arXiv:2410.24094v1 Announce Type: new \nAbstract: In this paper, we investigate sphericity testing in high-dimensional settings, where existing methods primarily rely on sum-type test procedures that often underperform under sparse alternatives. To address this limitation, we propose two max-type test procedures utilizing the sample covariance matrix and the sample spatial-sign covariance matrix, respectively. Furthermore, we introduce two Cauchy combination test procedures that integrate both sum-type and max-type tests, demonstrating their superiority across a wide range of sparsity levels in the alternative hypothesis. Our simulation studies corroborate these findings, highlighting the enhanced performance of our proposed methodologies in high-dimensional sphericity testi"}, "https://arxiv.org/abs/2410.24136": {"title": "Two-sided conformalized survival analysis", "link": "https://arxiv.org/abs/2410.24136", "description": "arXiv:2410.24136v1 Announce Type: new \nAbstract: This paper presents a novel method using conformal prediction to generate two-sided or one-sided prediction intervals for survival times. Specifically, the method provides both lower and upper predictive bounds for individuals deemed sufficiently similar to the non-censored population, while returning only a lower bound for others. The prediction intervals offer finite-sample coverage guarantees, requiring no distributional assumptions other than the sampled data points are independent and identically distributed. The performance of the procedure is assessed using both synthetic and real-world datasets."}, "https://arxiv.org/abs/2410.24163": {"title": "Improve the Precision of Area Under the Curve Estimation for Recurrent Events Through Covariate Adjustment", "link": "https://arxiv.org/abs/2410.24163", "description": "arXiv:2410.24163v1 Announce Type: new \nAbstract: The area under the curve (AUC) of the mean cumulative function (MCF) has recently been introduced as a novel estimand for evaluating treatment effects in recurrent event settings, capturing a totality of evidence in relation to disease progression. While the Lin-Wei-Yang-Ying (LWYY) model is commonly used for analyzing recurrent events, it relies on the proportional rate assumption between treatment arms, which is often violated in practice. In contrast, the AUC under MCFs does not depend on such proportionality assumptions and offers a clinically interpretable measure of treatment effect. To improve the precision of the AUC estimation while preserving its unconditional interpretability, we propose a nonparametric covariate adjustment approach. This approach guarantees efficiency gain compared to unadjusted analysis, as demonstrated by theoretical asymptotic distributions, and is universally applicable to various randomization schemes, including both simple and covariate-adaptive designs. Extensive simulations across different scenarios further support its advantage in increasing statistical power. Our findings highlight the importance of covariate adjustment for the analysis of AUC in recurrent event settings, offering practical guidance for its application in randomized clinical trials."}, "https://arxiv.org/abs/2410.24194": {"title": "Bayesian hierarchical models with calibrated mixtures of g-priors for assessing treatment effect moderation in meta-analysis", "link": "https://arxiv.org/abs/2410.24194", "description": "arXiv:2410.24194v1 Announce Type: new \nAbstract: Assessing treatment effect moderation is critical in biomedical research and many other fields, as it guides personalized intervention strategies to improve participant's outcomes. Individual participant-level data meta-analysis (IPD-MA) offers a robust framework for such assessments by leveraging data from multiple trials. However, its performance is often compromised by challenges such as high between-trial variability. Traditional Bayesian shrinkage methods have gained popularity, but are less suitable in this context, as their priors do not discern heterogeneous studies. In this paper, we propose the calibrated mixtures of g-priors methods in IPD-MA to enhance efficiency and reduce risk in the estimation of moderation effects. Our approach incorporates a trial-level sample size tuning function, and a moderator-level shrinkage parameter in the prior, offering a flexible spectrum of shrinkage levels that enables practitioners to evaluate moderator importance, from conservative to optimistic perspectives. Compared with existing Bayesian shrinkage methods, our extensive simulation studies demonstrate that the calibrated mixtures of g-priors exhibit superior performances in terms of efficiency and risk metrics, particularly under high between-trial variability, high model sparsity, weak moderation effects and correlated design matrices. We further illustrate their application in assessing effect moderators of two active treatments for major depressive disorder, using IPD from four randomized controlled trials."}, "https://arxiv.org/abs/2410.23525": {"title": "On the consistency of bootstrap for matching estimators", "link": "https://arxiv.org/abs/2410.23525", "description": "arXiv:2410.23525v1 Announce Type: cross \nAbstract: In a landmark paper, Abadie and Imbens (2008) showed that the naive bootstrap is inconsistent when applied to nearest neighbor matching estimators of the average treatment effect with a fixed number of matches. Since then, this finding has inspired numerous efforts to address the inconsistency issue, typically by employing alternative bootstrap methods. In contrast, this paper shows that the naive bootstrap is provably consistent for the original matching estimator, provided that the number of matches, $M$, diverges. The bootstrap inconsistency identified by Abadie and Imbens (2008) thus arises solely from the use of a fixed $M$."}, "https://arxiv.org/abs/2410.23580": {"title": "Bayesian Hierarchical Model for Synthesizing Registry and Survey Data on Female Breast Cancer Prevalence", "link": "https://arxiv.org/abs/2410.23580", "description": "arXiv:2410.23580v1 Announce Type: cross \nAbstract: In public health, it is critical for policymakers to assess the relationship between the disease prevalence and associated risk factors or clinical characteristics, facilitating effective resources allocation. However, for diseases like female breast cancer (FBC), reliable prevalence data at specific geographical levels, such as the county-level, are limited because the gold standard data typically come from long-term cancer registries, which do not necessarily collect needed risk factors. In addition, it remains unclear whether fitting each model separately or jointly results in better estimation. In this paper, we identify two data sources to produce reliable county-level prevalence estimates in Missouri, USA: the population-based Missouri Cancer Registry (MCR) and the survey-based Missouri County-Level Study (CLS). We propose a two-stage Bayesian model to synthesize these sources, accounting for their differences in the methodological design, case definitions, and collected information. The first stage involves estimating the county-level FBC prevalence using the raking method for CLS data and the counting method for MCR data, calibrating the differences in the methodological design and case definition. The second stage includes synthesizing two sources with different sets of covariates using a Bayesian generalized linear mixed model with Zeller-Siow prior for the coefficients. Our data analyses demonstrate that using both data sources have better results than at least one data source, and including a data source membership matters when there exist systematic differences in these sources. Finally, we translate results into policy making and discuss methodological differences for data synthesis of registry and survey data."}, "https://arxiv.org/abs/2410.23614": {"title": "Hypothesis testing with e-values", "link": "https://arxiv.org/abs/2410.23614", "description": "arXiv:2410.23614v1 Announce Type: cross \nAbstract: This book is written to offer a humble, but unified, treatment of e-values in hypothesis testing. The book is organized into three parts: Fundamental Concepts, Core Ideas, and Advanced Topics. The first part includes three chapters that introduce the basic concepts. The second part includes five chapters of core ideas such as universal inference, log-optimality, e-processes, operations on e-values, and e-values in multiple testing. The third part contains five chapters of advanced topics. We hope that, by putting the materials together in this book, the concept of e-values becomes more accessible for educational, research, and practical use."}, "https://arxiv.org/abs/2410.23838": {"title": "Zero-inflated stochastic block modeling of efficiency-security tradeoffs in weighted criminal networks", "link": "https://arxiv.org/abs/2410.23838", "description": "arXiv:2410.23838v1 Announce Type: cross \nAbstract: Criminal networks arise from the unique attempt to balance a need of establishing frequent ties among affiliates to facilitate the coordination of illegal activities, with the necessity to sparsify the overall connectivity architecture to hide from law enforcement. This efficiency-security tradeoff is also combined with the creation of groups of redundant criminals that exhibit similar connectivity patterns, thus guaranteeing resilient network architectures. State-of-the-art models for such data are not designed to infer these unique structures. In contrast to such solutions we develop a computationally-tractable Bayesian zero-inflated Poisson stochastic block model (ZIP-SBM), which identifies groups of redundant criminals with similar connectivity patterns, and infers both overt and covert block interactions within and across such groups. This is accomplished by modeling weighted ties (corresponding to counts of interactions among pairs of criminals) via zero-inflated Poisson distributions with block-specific parameters that quantify complex patterns in the excess of zero ties in each block (security) relative to the distribution of the observed weighted ties within that block (efficiency). The performance of ZIP-SBM is illustrated in simulations and in a study of summits co-attendances in a complex Mafia organization, where we unveil efficiency-security structures adopted by the criminal organization that were hidden to previous analyses."}, "https://arxiv.org/abs/2410.23975": {"title": "Average Controlled and Average Natural Micro Direct Effects in Summary Causal Graphs", "link": "https://arxiv.org/abs/2410.23975", "description": "arXiv:2410.23975v1 Announce Type: cross \nAbstract: In this paper, we investigate the identifiability of average controlled direct effects and average natural direct effects in causal systems represented by summary causal graphs, which are abstractions of full causal graphs, often used in dynamic systems where cycles and omitted temporal information complicate causal inference. Unlike in the traditional linear setting, where direct effects are typically easier to identify and estimate, non-parametric direct effects, which are crucial for handling real-world complexities, particularly in epidemiological contexts where relationships between variables (e.g, genetic, environmental, and behavioral factors) are often non-linear, are much harder to define and identify. In particular, we give sufficient conditions for identifying average controlled micro direct effect and average natural micro direct effect from summary causal graphs in the presence of hidden confounding. Furthermore, we show that the conditions given for the average controlled micro direct effect become also necessary in the setting where there is no hidden confounding and where we are only interested in identifiability by adjustment."}, "https://arxiv.org/abs/2410.24056": {"title": "A Martingale-Free Introduction to Conditional Gaussian Nonlinear Systems", "link": "https://arxiv.org/abs/2410.24056", "description": "arXiv:2410.24056v1 Announce Type: cross \nAbstract: The Conditional Gaussian Nonlinear System (CGNS) is a broad class of nonlinear stochastic dynamical systems. Given the trajectories for a subset of state variables, the remaining follow a Gaussian distribution. Despite the conditionally linear structure, the CGNS exhibits strong nonlinearity, thus capturing many non-Gaussian characteristics observed in nature through its joint and marginal distributions. Desirably, it enjoys closed analytic formulae for the time evolution of its conditional Gaussian statistics, which facilitate the study of data assimilation and other related topics. In this paper, we develop a martingale-free approach to improve the understanding of CGNSs. This methodology provides a tractable approach to proving the time evolution of the conditional statistics by deriving results through time discretization schemes, with the continuous-time regime obtained via a formal limiting process as the discretization time-step vanishes. This discretized approach further allows for developing analytic formulae for optimal posterior sampling of unobserved state variables with correlated noise. These tools are particularly valuable for studying extreme events and intermittency and apply to high-dimensional systems. Moreover, the approach improves the understanding of different sampling methods in characterizing uncertainty. The effectiveness of the framework is demonstrated through a physics-constrained, triad-interaction climate model with cubic nonlinearity and state-dependent cross-interacting noise."}, "https://arxiv.org/abs/2410.24145": {"title": "Conformal prediction of circular data", "link": "https://arxiv.org/abs/2410.24145", "description": "arXiv:2410.24145v1 Announce Type: cross \nAbstract: Split conformal prediction techniques are applied to regression problems with circular responses by introducing a suitable conformity score, leading to prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under exchangeable data. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear response regression model into one suitable for circular responses. When random forests serve as basis models in this projection procedure, we harness the out-of-bag dynamics to eliminate the necessity for a separate calibration sample in the construction of prediction sets. For synthetic and real datasets the resulting projected random forests model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, when compared to the split conformal prediction sets generated by two existing alternative models."}, "https://arxiv.org/abs/2006.00371": {"title": "Ridge Regularization: an Essential Concept in Data Science", "link": "https://arxiv.org/abs/2006.00371", "description": "arXiv:2006.00371v2 Announce Type: replace \nAbstract: Ridge or more formally $\\ell_2$ regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics."}, "https://arxiv.org/abs/2111.10721": {"title": "Identifying Dynamic Discrete Choice Models with Hyperbolic Discounting", "link": "https://arxiv.org/abs/2111.10721", "description": "arXiv:2111.10721v4 Announce Type: replace \nAbstract: We study identification of dynamic discrete choice models with hyperbolic discounting. We show that the standard discount factor, present bias factor, and instantaneous utility functions for the sophisticated agent are point-identified from observed conditional choice probabilities and transition probabilities in a finite horizon model. The main idea to achieve identification is to exploit variation in the observed conditional choice probabilities over time. We present the estimation method and demonstrate a good performance of the estimator by simulation."}, "https://arxiv.org/abs/2309.02674": {"title": "Denoising and Multilinear Projected-Estimation of High-Dimensional Matrix-Variate Factor Time Series", "link": "https://arxiv.org/abs/2309.02674", "description": "arXiv:2309.02674v2 Announce Type: replace \nAbstract: This paper proposes a new multi-linear projection method for denoising and estimation of high-dimensional matrix-variate factor time series. It assumes that a $p_1\\times p_2$ matrix-variate time series consists of a dynamically dependent, lower-dimensional matrix-variate factor process and a $p_1\\times p_2$ matrix idiosyncratic series. In addition, the latter series assumes a matrix-variate factor structure such that its row and column covariances may have diverging/spiked eigenvalues to accommodate the case of low signal-to-noise ratio often encountered in applications. We use an iterative projection procedure to reduce the dimensions and noise effects in estimating front and back loading matrices and to obtain faster convergence rates than those of the traditional methods available in the literature. We further introduce a two-way projected Principal Component Analysis to mitigate the diverging noise effects, and implement a high-dimensional white-noise testing procedure to estimate the dimension of the matrix factor process. Asymptotic properties of the proposed method are established if the dimensions and sample size go to infinity. We also use simulations and real examples to assess the performance of the proposed method in finite samples and to compare its forecasting ability with some existing ones in the literature. The proposed method fares well in out-of-sample forecasting. In a supplement, we demonstrate the efficacy of the proposed approach even when the idiosyncratic terms exhibit serial correlations with or without a diverging white noise effect."}, "https://arxiv.org/abs/2311.10900": {"title": "Max-Rank: Efficient Multiple Testing for Conformal Prediction", "link": "https://arxiv.org/abs/2311.10900", "description": "arXiv:2311.10900v3 Announce Type: replace \nAbstract: Multiple hypothesis testing (MHT) commonly arises in various scientific fields, from genomics to psychology, where testing many hypotheses simultaneously increases the risk of Type-I errors. These errors can mislead scientific inquiry, rendering MHT corrections essential. In this paper, we address MHT within the context of conformal prediction, a flexible method for predictive uncertainty quantification. Some conformal prediction settings can require simultaneous testing, and positive dependencies among tests typically exist. We propose a novel correction named $\\texttt{max-rank}$ that leverages these dependencies, whilst ensuring that the joint Type-I error rate is efficiently controlled. Inspired by permutation-based corrections from Westfall & Young (1993), $\\texttt{max-rank}$ exploits rank order information to improve performance, and readily integrates with any conformal procedure. We demonstrate both its theoretical and empirical advantages over the common Bonferroni correction and its compatibility with conformal prediction, highlighting the potential to enhance predictive uncertainty quantification."}, "https://arxiv.org/abs/2312.04601": {"title": "Weak Supervision Performance Evaluation via Partial Identification", "link": "https://arxiv.org/abs/2312.04601", "description": "arXiv:2312.04601v2 Announce Type: replace-cross \nAbstract: Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fr\\'echet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications."}, "https://arxiv.org/abs/2411.00191": {"title": "Sharp Bounds on the Variance of General Regression Adjustment in Randomized Experiments", "link": "https://arxiv.org/abs/2411.00191", "description": "arXiv:2411.00191v1 Announce Type: new \nAbstract: Building on statistical foundations laid by Neyman [1923] a century ago, a growing literature focuses on problems of causal inference that arise in the context of randomized experiments where the target of inference is the average treatment effect in a finite population and random assignment determines which subjects are allocated to one of the experimental conditions. In this framework, variances of average treatment effect estimators remain unidentified because they depend on the covariance between treated and untreated potential outcomes, which are never jointly observed. Aronow et al. [2014] provide an estimator for the variance of the difference-in-means estimator that is asymptotically sharp. In practice, researchers often use some form of covariate adjustment, such as linear regression when estimating the average treatment effect. Here we extend the Aronow et al. [2014] result, providing asymptotically sharp variance bounds for general regression adjustment. We apply these results to linear regression adjustment and show benefits both in a simulation as well as an empirical application."}, "https://arxiv.org/abs/2411.00256": {"title": "Bayesian Smoothing and Feature Selection Using variational Automatic Relevance Determination", "link": "https://arxiv.org/abs/2411.00256", "description": "arXiv:2411.00256v1 Announce Type: new \nAbstract: This study introduces Variational Automatic Relevance Determination (VARD), a novel approach tailored for fitting sparse additive regression models in high-dimensional settings. VARD distinguishes itself by its ability to independently assess the smoothness of each feature while enabling precise determination of whether a feature's contribution to the response is zero, linear, or nonlinear. Further, an efficient coordinate descent algorithm is introduced to implement VARD. Empirical evaluations on simulated and real-world data underscore VARD's superiority over alternative variable selection methods for additive models."}, "https://arxiv.org/abs/2411.00346": {"title": "Estimating Broad Sense Heritability via Kernel Ridge Regression", "link": "https://arxiv.org/abs/2411.00346", "description": "arXiv:2411.00346v1 Announce Type: new \nAbstract: The broad sense genetic heritability, which quantifies the total proportion of phenotypic variation in a population due to genetic factors, is crucial for understanding trait inheritance. While many existing methods focus on estimating narrow sense heritability, which accounts only for additive genetic variation, this paper introduces a kernel ridge regression approach to estimate broad-sense heritability. We provide both upper and lower bounds for the estimator. The effectiveness of the proposed method was evaluated through extensive simulations of both synthetic data and real data from the 1000 Genomes Project. Additionally, the estimator was applied to data from the Alzheimer's Disease Neuroimaging Initiative to demonstrate its practical utility."}, "https://arxiv.org/abs/2411.00358": {"title": "Inference in a Stationary/Nonstationary Autoregressive Time-Varying-Parameter Model", "link": "https://arxiv.org/abs/2411.00358", "description": "arXiv:2411.00358v1 Announce Type: new \nAbstract: This paper considers nonparametric estimation and inference in first-order autoregressive (AR(1)) models with deterministically time-varying parameters. A key feature of the proposed approach is to allow for time-varying stationarity in some time periods, time-varying nonstationarity (i.e., unit root or local-to-unit root behavior) in other periods, and smooth transitions between the two. The estimation of the AR parameter at any time point is based on a local least squares regression method, where the relevant initial condition is endogenous. We obtain limit distributions for the AR parameter estimator and t-statistic at a given point $\\tau$ in time when the parameter exhibits unit root, local-to-unity, or stationary/stationary-like behavior at time $\\tau$. These results are used to construct confidence intervals and median-unbiased interval estimators for the AR parameter at any specified point in time. The confidence intervals have correct asymptotic coverage probabilities with the coverage holding uniformly over stationary and nonstationary behavior of the observations."}, "https://arxiv.org/abs/2411.00429": {"title": "Unbiased mixed variables distance", "link": "https://arxiv.org/abs/2411.00429", "description": "arXiv:2411.00429v1 Announce Type: new \nAbstract: Defining a distance in a mixed setting requires the quantification of observed differences of variables of different types and of variables that are measured on different scales. There exist several proposals for mixed variable distances, however, such distances tend to be biased towards specific variable types and measurement units. That is, the variable types and scales influence the contribution of individual variables to the overall distance. In this paper, we define unbiased mixed variable distances for which the contributions of individual variables to the overall distance are not influenced by measurement types or scales. We define the relevant concepts to quantify such biases and we provide a general formulation that can be used to construct unbiased mixed variable distances."}, "https://arxiv.org/abs/2411.00471": {"title": "Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models", "link": "https://arxiv.org/abs/2411.00471", "description": "arXiv:2411.00471v1 Announce Type: new \nAbstract: This paper introduces Dirichlet process mixtures of block $g$ priors for model selection and prediction in linear models. These priors are extensions of traditional mixtures of $g$ priors that allow for differential shrinkage for various (data-selected) blocks of parameters while fully accounting for the predictors' correlation structure, providing a bridge between the literatures on model selection and continuous shrinkage priors. We show that Dirichlet process mixtures of block $g$ priors are consistent in various senses and, in particular, that they avoid the conditional Lindley ``paradox'' highlighted by Som et al.(2016). Further, we develop a Markov chain Monte Carlo algorithm for posterior inference that requires only minimal ad-hoc tuning. Finally, we investigate the empirical performance of the prior in various real and simulated datasets. In the presence of a small number of very large effects, Dirichlet process mixtures of block $g$ priors lead to higher power for detecting smaller but significant effects without only a minimal increase in the number of false discoveries."}, "https://arxiv.org/abs/2411.00520": {"title": "Calibrated quantile prediction for Growth-at-Risk", "link": "https://arxiv.org/abs/2411.00520", "description": "arXiv:2411.00520v1 Announce Type: new \nAbstract: Accurate computation of robust estimates for extremal quantiles of empirical distributions is an essential task for a wide range of applicative fields, including economic policymaking and the financial industry. Such estimates are particularly critical in calculating risk measures, such as Growth-at-Risk (GaR). % and Value-at-Risk (VaR). This work proposes a conformal framework to estimate calibrated quantiles, and presents an extensive simulation study and a real-world analysis of GaR to examine its benefits with respect to the state of the art. Our findings show that CP methods consistently improve the calibration and robustness of quantile estimates at all levels. The calibration gains are appreciated especially at extremal quantiles, which are critical for risk assessment and where traditional methods tend to fall short. In addition, we introduce a novel property that guarantees coverage under the exchangeability assumption, providing a valuable tool for managing risks by quantifying and controlling the likelihood of future extreme observations."}, "https://arxiv.org/abs/2411.00644": {"title": "What can we learn from marketing skills as a bipartite network from accredited programs?", "link": "https://arxiv.org/abs/2411.00644", "description": "arXiv:2411.00644v1 Announce Type: new \nAbstract: The relationship between professional skills and higher education programs is modeled as a non-directed bipartite network with binary entries representing the links between 28 skills (as captured by the occupational information network, O*NET) and 258 graduate program summaries (as captured by commercial brochures of graduate programs in marketing with accreditation standards of the Association to Advance Collegiate Schools of Business). While descriptive analysis for skills suggests a qualitative lack of alignment between the job demands captured by O*NET, inferential analyses based on exponential random graph model estimates show that skills' popularity and homophily coexist with a systematic yet weak alignment to job demands for marketing managers."}, "https://arxiv.org/abs/2411.00534": {"title": "Change-point detection in functional time series: Applications to age-specific mortality and fertility", "link": "https://arxiv.org/abs/2411.00534", "description": "arXiv:2411.00534v1 Announce Type: cross \nAbstract: We consider determining change points in a time series of age-specific mortality and fertility curves observed over time. We propose two detection methods for identifying these change points. The first method uses a functional cumulative sum statistic to pinpoint the change point. The second method computes a univariate time series of integrated squared forecast errors after fitting a functional time-series model before applying a change-point detection method to the errors to determine the change point. Using Australian age-specific fertility and mortality data, we apply these methods to locate the change points and identify the optimal training period to achieve improved forecast accuracy."}, "https://arxiv.org/abs/2411.00621": {"title": "Nonparametric estimation of Hawkes processes with RKHSs", "link": "https://arxiv.org/abs/2411.00621", "description": "arXiv:2411.00621v1 Announce Type: cross \nAbstract: This paper addresses nonparametric estimation of nonlinear multivariate Hawkes processes, where the interaction functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). Motivated by applications in neuroscience, the model allows complex interaction functions, in order to express exciting and inhibiting effects, but also a combination of both (which is particularly interesting to model the refractory period of neurons), and considers in return that conditional intensities are rectified by the ReLU function. The latter feature incurs several methodological challenges, for which workarounds are proposed in this paper. In particular, it is shown that a representer theorem can be obtained for approximated versions of the log-likelihood and the least-squares criteria. Based on it, we propose an estimation method, that relies on two simple approximations (of the ReLU function and of the integral operator). We provide an approximation bound, justifying the negligible statistical effect of these approximations. Numerical results on synthetic data confirm this fact as well as the good asymptotic behavior of the proposed estimator. It also shows that our method achieves a better performance compared to related nonparametric estimation techniques and suits neuronal applications."}, "https://arxiv.org/abs/1801.09236": {"title": "Structure and Sensitivity in Differential Privacy: Comparing K-Norm Mechanisms", "link": "https://arxiv.org/abs/1801.09236", "description": "arXiv:1801.09236v4 Announce Type: replace \nAbstract: Differential privacy (DP), provides a framework for provable privacy protection against arbitrary adversaries, while allowing the release of summary statistics and synthetic data. We address the problem of releasing a noisy real-valued statistic vector $T$, a function of sensitive data under DP, via the class of $K$-norm mechanisms with the goal of minimizing the noise added to achieve privacy. First, we introduce the sensitivity space of $T$, which extends the concepts of sensitivity polytope and sensitivity hull to the setting of arbitrary statistics $T$. We then propose a framework consisting of three methods for comparing the $K$-norm mechanisms: 1) a multivariate extension of stochastic dominance, 2) the entropy of the mechanism, and 3) the conditional variance given a direction, to identify the optimal $K$-norm mechanism. In all of these criteria, the optimal $K$-norm mechanism is generated by the convex hull of the sensitivity space. Using our methodology, we extend the objective perturbation and functional mechanisms and apply these tools to logistic and linear regression, allowing for private releases of statistical results. Via simulations and an application to a housing price dataset, we demonstrate that our proposed methodology offers a substantial improvement in utility for the same level of risk."}, "https://arxiv.org/abs/2205.07090": {"title": "Evaluating Forecasts with scoringutils in R", "link": "https://arxiv.org/abs/2205.07090", "description": "arXiv:2205.07090v2 Announce Type: replace \nAbstract: Evaluating forecasts is essential to understand and improve forecasting and make forecasts useful to decision makers. A variety of R packages provide a broad variety of scoring rules, visualisations and diagnostic tools. One particular challenge, which scoringutils aims to address, is handling the complexity of evaluating and comparing forecasts from several forecasters across multiple dimensions such as time, space, and different types of targets. scoringutils extends the existing landscape by offering a convenient and flexible data.table-based framework for evaluating and comparing probabilistic forecasts (forecasts represented by a full predictive distribution). Notably, scoringutils is the first package to offer extensive support for probabilistic forecasts in the form of predictive quantiles, a format that is currently used by several infectious disease Forecast Hubs. The package is easily extendable, meaning that users can supply their own scoring rules or extend existing classes to handle new types of forecasts. scoringutils provides broad functionality to check the data and diagnose issues, to visualise forecasts and missing data, to transform data before scoring, to handle missing forecasts, to aggregate scores, and to visualise the results of the evaluation. The paper presents the package and its core functionality and illustrates common workflows using example data of forecasts for COVID-19 cases and deaths submitted to the European COVID-19 Forecast Hub."}, "https://arxiv.org/abs/2301.10592": {"title": "Hierarchical Regularizers for Reverse Unrestricted Mixed Data Sampling Regressions", "link": "https://arxiv.org/abs/2301.10592", "description": "arXiv:2301.10592v2 Announce Type: replace \nAbstract: Reverse Unrestricted MIxed DAta Sampling (RU-MIDAS) regressions are used to model high-frequency responses by means of low-frequency variables. However, due to the periodic structure of RU-MIDAS regressions, the dimensionality grows quickly if the frequency mismatch between the high- and low-frequency variables is large. Additionally the number of high-frequency observations available for estimation decreases. We propose to counteract this reduction in sample size by pooling the high-frequency coefficients and further reduce the dimensionality through a sparsity-inducing convex regularizer that accounts for the temporal ordering among the different lags. To this end, the regularizer prioritizes the inclusion of lagged coefficients according to the recency of the information they contain. We demonstrate the proposed method on two empirical applications, one on realized volatility forecasting with macroeconomic data and another on demand forecasting for a bicycle-sharing system with ridership data on other transportation types."}, "https://arxiv.org/abs/2411.00795": {"title": "Simulations for estimation of random effects and overall effect in three-level meta-analysis of standardized mean differences using constant and inverse-variance weights", "link": "https://arxiv.org/abs/2411.00795", "description": "arXiv:2411.00795v1 Announce Type: new \nAbstract: We consider a three-level meta-analysis of standardized mean differences. The standard method of estimation uses inverse-variance weights and REML/PL estimation of variance components for the random effects. We introduce new moment-based point and interval estimators for the two variance components and related estimators of the overall mean. Similar to traditional analysis of variance, our method is based on two conditional $Q$ statistics with effective-sample-size weights. We study, by simulation, bias and coverage of these new estimators. For comparison, we also study bias and coverage of the REML/PL-based approach as implemented in {\\it rma.mv} in {\\it metafor}. Our results demonstrate that the new methods are often considerably better and do not have convergence problems, which plague the standard analysis."}, "https://arxiv.org/abs/2411.00886": {"title": "The ET Interview: Professor Joel L", "link": "https://arxiv.org/abs/2411.00886", "description": "arXiv:2411.00886v1 Announce Type: new \nAbstract: Joel L. Horowitz has made profound contributions to many areas in econometrics and statistics. These include bootstrap methods, semiparametric and nonparametric estimation, specification testing, nonparametric instrumental variables estimation, high-dimensional models, functional data analysis, and shape restrictions, among others. Originally trained as a physicist, Joel made a pivotal transition to econometrics, greatly benefiting our profession. Throughout his career, he has collaborated extensively with a diverse range of coauthors, including students, departmental colleagues, and scholars from around the globe. Joel was born in 1941 in Pasadena, California. He attended Stanford for his undergraduate studies and obtained his Ph.D. in physics from Cornell in 1967. He has been Charles E. and Emma H. Morrison Professor of Economics at Northwestern University since 2001. Prior to that, he was a faculty member at the University of Iowa (1982-2001). He has served as a co-editor of Econometric Theory (1992-2000) and Econometrica (2000-2004). He is a Fellow of the Econometric Society and of the American Statistical Association, and an elected member of the International Statistical Institute. The majority of this interview took place in London during June 2022."}, "https://arxiv.org/abs/2411.00921": {"title": "Differentially Private Algorithms for Linear Queries via Stochastic Convex Optimization", "link": "https://arxiv.org/abs/2411.00921", "description": "arXiv:2411.00921v1 Announce Type: new \nAbstract: This article establishes a method to answer a finite set of linear queries on a given dataset while ensuring differential privacy. To achieve this, we formulate the corresponding task as a saddle-point problem, i.e. an optimization problem whose solution corresponds to a distribution minimizing the difference between answers to the linear queries based on the true distribution and answers from a differentially private distribution. Against this background, we establish two new algorithms for corresponding differentially private data release: the first is based on the differentially private Frank-Wolfe method, the second combines randomized smoothing with stochastic convex optimization techniques for a solution to the saddle-point problem. While previous works assess the accuracy of differentially private algorithms with reference to the empirical data distribution, a key contribution of our work is a more natural evaluation of the proposed algorithms' accuracy with reference to the true data-generating distribution."}, "https://arxiv.org/abs/2411.00947": {"title": "Asymptotic theory for the quadratic assignment procedure", "link": "https://arxiv.org/abs/2411.00947", "description": "arXiv:2411.00947v1 Announce Type: new \nAbstract: The quadratic assignment procedure (QAP) is a popular tool for analyzing network data in medical and social sciences. To test the association between two network measurements represented by two symmetric matrices, QAP calculates the $p$-value by permuting the units, or equivalently, by simultaneously permuting the rows and columns of one matrix. Its extension to the regression setting, known as the multiple regression QAP, has also gained popularity, especially in psychometrics. However, the statistics theory for QAP has not been fully established in the literature. We fill the gap in this paper. We formulate the network models underlying various QAPs. We derive (a) the asymptotic sampling distributions of some canonical test statistics and (b) the corresponding asymptotic permutation distributions induced by QAP under strong and weak null hypotheses. Task (a) relies on applying the theory of U-statistics, and task (b) relies on applying the theory of double-indexed permutation statistics. The combination of tasks (a) and (b) provides a relatively complete picture of QAP. Overall, our asymptotic theory suggests that using properly studentized statistics in QAP is a robust choice in that it is finite-sample exact under the strong null hypothesis and preserves the asymptotic type one error rate under the weak null hypothesis."}, "https://arxiv.org/abs/2411.00950": {"title": "A Semiparametric Approach to Causal Inference", "link": "https://arxiv.org/abs/2411.00950", "description": "arXiv:2411.00950v1 Announce Type: new \nAbstract: In causal inference, an important problem is to quantify the effects of interventions or treatments. Many studies focus on estimating the mean causal effects; however, these estimands may offer limited insight since two distributions can share the same mean yet exhibit significant differences. Examining the causal effects from a distributional perspective provides a more thorough understanding. In this paper, we employ a semiparametric density ratio model (DRM) to characterize the counterfactual distributions, introducing a framework that assumes a latent structure shared by these distributions. Our model offers flexibility by avoiding strict parametric assumptions on the counterfactual distributions. Specifically, the DRM incorporates a nonparametric component that can be estimated through the method of empirical likelihood (EL), using the data from all the groups stemming from multiple interventions. Consequently, the EL-DRM framework enables inference of the counterfactual distribution functions and their functionals, facilitating direct and transparent causal inference from a distributional perspective. Numerical studies on both synthetic and real-world data validate the effectiveness of our approach."}, "https://arxiv.org/abs/2411.01052": {"title": "Multivariate Gini-type discrepancies", "link": "https://arxiv.org/abs/2411.01052", "description": "arXiv:2411.01052v1 Announce Type: new \nAbstract: Measuring distances in a multidimensional setting is a challenging problem, which appears in many fields of science and engineering. In this paper, to measure the distance between two multivariate distributions, we introduce a new measure of discrepancy which is scale invariant and which, in the case of two independent copies of the same distribution, and after normalization, coincides with the scaling invariant multidimensional version of the Gini index recently proposed in [34]. A byproduct of the analysis is an easy-to-handle discrepancy metric, obtained by application of the theory to a pair of Gaussian multidimensional densities. The obtained metric does improve the standard metrics, based on the mean squared error, as it is scale invariant. The importance of this theoretical finding is illustrated by means of a real problem that concerns measuring the importance of Environmental, Social and Governance factors for the growth of small and medium enterprises."}, "https://arxiv.org/abs/2411.01064": {"title": "Empirical Welfare Analysis with Hedonic Budget Constraints", "link": "https://arxiv.org/abs/2411.01064", "description": "arXiv:2411.01064v1 Announce Type: new \nAbstract: We analyze demand settings where heterogeneous consumers maximize utility for product attributes subject to a nonlinear budget constraint. We develop nonparametric methods for welfare-analysis of interventions that change the constraint. Two new findings are Roy's identity for smooth, nonlinear budgets, which yields a Partial Differential Equation system, and a Slutsky-like symmetry condition for demand. Under scalar unobserved heterogeneity and single-crossing preferences, the coefficient functions in the PDEs are nonparametrically identified, and under symmetry, lead to path-independent, money-metric welfare. We illustrate our methods with welfare evaluation of a hypothetical change in relationship between property rent and neighborhood school-quality using British microdata."}, "https://arxiv.org/abs/2411.01065": {"title": "Local Indicators of Mark Association for Spatial Marked Point Processes", "link": "https://arxiv.org/abs/2411.01065", "description": "arXiv:2411.01065v1 Announce Type: new \nAbstract: The emergence of distinct local mark behaviours is becoming increasingly common in the applications of spatial marked point processes. This dynamic highlights the limitations of existing global mark correlation functions in accurately identifying the true patterns of mark associations/variations among points as distinct mark behaviours might dominate one another, giving rise to an incomplete understanding of mark associations. In this paper, we introduce a family of local indicators of mark association (LIMA) functions for spatial marked point processes. These functions are defined on general state spaces and can include marks that are either real-valued or function-valued. Unlike global mark correlation functions, which are often distorted by the existence of distinct mark behaviours, LIMA functions reliably identify all types of mark associations and variations among points. Additionally, they accurately determine the interpoint distances where individual points show significant mark associations. Through simulation studies, featuring various scenarios, and four real applications in forestry, criminology, and urban mobility, we study spatial marked point processes in $\\R^2$ and on linear networks with either real-valued or function-valued marks, demonstrating that LIMA functions significantly outperform the existing global mark correlation functions."}, "https://arxiv.org/abs/2411.01249": {"title": "A novel method for synthetic control with interference", "link": "https://arxiv.org/abs/2411.01249", "description": "arXiv:2411.01249v1 Announce Type: new \nAbstract: Synthetic control methods have been widely used for evaluating policy effects in comparative case studies. However, most existing synthetic control methods implicitly rule out interference effects on the untreated units, and their validity may be jeopardized in the presence of interference. In this paper, we propose a novel synthetic control method, which admits interference but does not require modeling the interference structure. Identification of the effects is achieved under the assumption that the number of interfered units is no larger than half of the total number of units minus the dimension of confounding factors. We propose consistent and asymptotically normal estimation and establish statistical inference for the direct and interference effects averaged over post-intervention periods. We evaluate the performance of our method and compare it to competing methods via numerical experiments. A real data analysis, evaluating the effects of the announcement of relocating the US embassy to Jerusalem on the number of Middle Eastern conflicts, provides empirical evidence that the announcement not only increases the number of conflicts in Israel-Palestine but also has interference effects on several other Middle Eastern countries."}, "https://arxiv.org/abs/2411.01250": {"title": "Hierarchical and Density-based Causal Clustering", "link": "https://arxiv.org/abs/2411.01250", "description": "arXiv:2411.01250v1 Announce Type: new \nAbstract: Understanding treatment effect heterogeneity is vital for scientific and policy research. However, identifying and evaluating heterogeneous treatment effects pose significant challenges due to the typically unknown subgroup structure. Recently, a novel approach, causal k-means clustering, has emerged to assess heterogeneity of treatment effect by applying the k-means algorithm to unknown counterfactual regression functions. In this paper, we expand upon this framework by integrating hierarchical and density-based clustering algorithms. We propose plug-in estimators that are simple and readily implementable using off-the-shelf algorithms. Unlike k-means clustering, which requires the margin condition, our proposed estimators do not rely on strong structural assumptions on the outcome process. We go on to study their rate of convergence, and show that under the minimal regularity conditions, the additional cost of causal clustering is essentially the estimation error of the outcome regression functions. Our findings significantly extend the capabilities of the causal clustering framework, thereby contributing to the progression of methodologies for identifying homogeneous subgroups in treatment response, consequently facilitating more nuanced and targeted interventions. The proposed methods also open up new avenues for clustering with generic pseudo-outcomes. We explore finite sample properties via simulation, and illustrate the proposed methods in voting and employment projection datasets."}, "https://arxiv.org/abs/2411.01317": {"title": "Distributed Pseudo-Likelihood Method for Community Detection in Large-Scale Networks", "link": "https://arxiv.org/abs/2411.01317", "description": "arXiv:2411.01317v1 Announce Type: new \nAbstract: This paper proposes a distributed pseudo-likelihood method (DPL) to conveniently identify the community structure of large-scale networks. Specifically, we first propose a block-wise splitting method to divide large-scale network data into several subnetworks and distribute them among multiple workers. For simplicity, we assume the classical stochastic block model. Then, the DPL algorithm is iteratively implemented for the distributed optimization of the sum of the local pseudo-likelihood functions. At each iteration, the worker updates its local community labels and communicates with the master. The master then broadcasts the combined estimator to each worker for the new iterative steps. Based on the distributed system, DPL significantly reduces the computational complexity of the traditional pseudo-likelihood method using a single machine. Furthermore, to ensure statistical accuracy, we theoretically discuss the requirements of the worker sample size. Moreover, we extend the DPL method to estimate degree-corrected stochastic block models. The superior performance of the proposed distributed algorithm is demonstrated through extensive numerical studies and real data analysis."}, "https://arxiv.org/abs/2411.01381": {"title": "Modeling the restricted mean survival time using pseudo-value random forests", "link": "https://arxiv.org/abs/2411.01381", "description": "arXiv:2411.01381v1 Announce Type: new \nAbstract: The restricted mean survival time (RMST) has become a popular measure to summarize event times in longitudinal studies. Defined as the area under the survival function up to a time horizon $\\tau$ > 0, the RMST can be interpreted as the life expectancy within the time interval [0, $\\tau$]. In addition to its straightforward interpretation, the RMST also allows for the definition of valid estimands for the causal analysis of treatment contrasts in medical studies. In this work, we introduce a non-parametric approach to model the RMST conditional on a set of baseline variables (including, e.g., treatment variables and confounders). Our method is based on a direct modeling strategy for the RMST, using leave-one-out jackknife pseudo-values within a random forest regression framework. In this way, it can be employed to obtain precise estimates of both patient-specific RMST values and confounder-adjusted treatment contrasts. Since our method (termed \"pseudo-value random forest\", PVRF) is model-free, RMST estimates are not affected by restrictive assumptions like the proportional hazards assumption. Particularly, PVRF offers a high flexibility in detecting relevant covariate effects from higher-dimensional data, thereby expanding the range of existing pseudo-value modeling techniques for RMST estimation. We investigate the properties of our method using simulations and illustrate its use by an application to data from the SUCCESS-A breast cancer trial. Our numerical experiments demonstrate that PVRF yields accurate estimates of both patient-specific RMST values and RMST-based treatment contrasts."}, "https://arxiv.org/abs/2411.01382": {"title": "On MCMC mixing under unidentified nonparametric models with an application to survival predictions under transformation models", "link": "https://arxiv.org/abs/2411.01382", "description": "arXiv:2411.01382v1 Announce Type: new \nAbstract: The multi-modal posterior under unidentified nonparametric models yields poor mixing of Markov Chain Monte Carlo (MCMC), which is a stumbling block to Bayesian predictions. In this article, we conceptualize a prior informativeness threshold that is essentially the variance of posterior modes and expressed by the uncertainty hyperparameters of nonparametric priors. The threshold plays the role of a lower bound of the within-chain MCMC variance to ensure MCMC mixing, and engines prior modification through hyperparameter tuning to descend the mode variance. Our method distinguishes from existing postprocessing methods in that it directly samples well-mixed MCMC chains on the unconstrained space, and inherits the original posterior predictive distribution in predictive inference. Our method succeeds in Bayesian survival predictions under an unidentified nonparametric transformation model, guarded by the inferential theories of the posterior variance, under elicitation of two delicate nonparametric priors. Comprehensive simulations and real-world data analysis demonstrate that our method achieves MCMC mixing and outperforms existing approaches in survival predictions."}, "https://arxiv.org/abs/2411.01383": {"title": "Automated Analysis of Experiments using Hierarchical Garrote", "link": "https://arxiv.org/abs/2411.01383", "description": "arXiv:2411.01383v1 Announce Type: new \nAbstract: In this work, we propose an automatic method for the analysis of experiments that incorporates hierarchical relationships between the experimental variables. We use a modified version of nonnegative garrote method for variable selection which can incorporate hierarchical relationships. The nonnegative garrote method requires a good initial estimate of the regression parameters for it to work well. To obtain the initial estimate, we use generalized ridge regression with the ridge parameters estimated from a Gaussian process prior placed on the underlying input-output relationship. The proposed method, called HiGarrote, is fast, easy to use, and requires no manual tuning. Analysis of several real experiments are presented to demonstrate its benefits over the existing methods."}, "https://arxiv.org/abs/2411.01498": {"title": "Educational Effects in Mathematics: Conditional Average Treatment Effect depending on the Number of Treatments", "link": "https://arxiv.org/abs/2411.01498", "description": "arXiv:2411.01498v1 Announce Type: new \nAbstract: This study examines the educational effect of the Academic Support Center at Kogakuin University. Following the initial assessment, it was suggested that group bias had led to an underestimation of the Center's true impact. To address this issue, the authors applied the theory of causal inference. By using T-learner, the conditional average treatment effect (CATE) of the Center's face-to-face (F2F) personal assistance program was evaluated. Extending T-learner, the authors produced a new CATE function that depends on the number of treatments (F2F sessions) and used the estimated function to predict the CATE performance of F2F assistance."}, "https://arxiv.org/abs/2411.01528": {"title": "Enhancing Forecasts Using Real-Time Data Flow and Hierarchical Forecast Reconciliation, with Applications to the Energy Sector", "link": "https://arxiv.org/abs/2411.01528", "description": "arXiv:2411.01528v1 Announce Type: new \nAbstract: A novel framework for hierarchical forecast updating is presented, addressing a critical gap in the forecasting literature. By assuming a temporal hierarchy structure, the innovative approach extends hierarchical forecast reconciliation to effectively manage the challenge posed by partially observed data. This crucial extension allows, in conjunction with real-time data, to obtain updated and coherent forecasts across the entire temporal hierarchy, thereby enhancing decision-making accuracy. The framework involves updating base models in response to new data, which produces revised base forecasts. A subsequent pruning step integrates the newly available data, allowing for the application of any forecast reconciliation method to obtain fully updated reconciled forecasts. Additionally, the framework not only ensures coherence among forecasts but also improves overall accuracy throughout the hierarchy. Its inherent flexibility and interpretability enable users to perform hierarchical forecast updating concisely. The methodology is extensively demonstrated in a simulation study with various settings and comparing different data-generating processes, hierarchies, and reconciliation methods. Practical applicability is illustrated through two case studies in the energy sector, energy generation and solar power data, where the framework yields superior results compared to base models that do not incorporate new data, leading to more precise decision-making outcomes."}, "https://arxiv.org/abs/2411.01588": {"title": "Statistical Inference on High Dimensional Gaussian Graphical Regression Models", "link": "https://arxiv.org/abs/2411.01588", "description": "arXiv:2411.01588v1 Announce Type: new \nAbstract: Gaussian graphical regressions have emerged as a powerful approach for regressing the precision matrix of a Gaussian graphical model on covariates, which, unlike traditional Gaussian graphical models, can help determine how graphs are modulated by high dimensional subject-level covariates, and recover both the population-level and subject-level graphs. To fit the model, a multi-task learning approach {achieves} %has been shown to result in lower error rates compared to node-wise regressions. However, due to the high complexity and dimensionality of the Gaussian graphical regression problem, the important task of statistical inference remains unexplored. We propose a class of debiased estimators based on multi-task learners for statistical inference in Gaussian graphical regressions. We show that debiasing can be performed quickly and separately for the multi-task learners. In a key debiasing step {that estimates} %involving the estimation of the inverse covariance matrix, we propose a novel {projection technique} %diagonalization approach that dramatically reduces computational costs {in optimization} to scale only with the sample size $n$. We show that our debiased estimators enjoy a fast convergence rate and asymptotically follow a normal distribution, enabling valid statistical inference such as constructing confidence intervals and performing hypothesis testing. Simulation studies confirm the practical utility of the proposed approach, and we further apply it to analyze gene co-expression graph data from a brain cancer study, revealing meaningful biological relationships."}, "https://arxiv.org/abs/2411.01617": {"title": "Changes-In-Changes For Discrete Treatment", "link": "https://arxiv.org/abs/2411.01617", "description": "arXiv:2411.01617v1 Announce Type: new \nAbstract: This paper generalizes the changes-in-changes (CIC) model to handle discrete treatments with more than two categories, extending the binary case of Athey and Imbens (2006). While the original CIC model is well-suited for binary treatments, it cannot accommodate multi-category discrete treatments often found in economic and policy settings. Although recent work has extended CIC to continuous treatments, there remains a gap for multi-category discrete treatments. I introduce a generalized CIC model that adapts the rank invariance assumption to multiple treatment levels, allowing for robust modeling while capturing the distinct effects of varying treatment intensities."}, "https://arxiv.org/abs/2411.01686": {"title": "FRODO: A novel approach to micro-macro multilevel regressio", "link": "https://arxiv.org/abs/2411.01686", "description": "arXiv:2411.01686v1 Announce Type: new \nAbstract: Within the field of hierarchical modelling, little attention is paid to micro-macro models: those in which group-level outcomes are dependent on covariates measured at the level of individuals within groups. Although such models are perhaps underrepresented in the literature, they have applications in economics, epidemiology, and the social sciences. Despite the strong mathematical similarities between micro-macro and measurement error models, few efforts have been made to apply the much better-developed methodology of the latter to the former. Here, we present a new empirical Bayesian technique for micro-macro data, called FRODO (Functional Regression On Densities of Observations). The method jointly infers group-specific densities for multilevel covariates and uses them as functional predictors in a functional linear regression, resulting in a model that is analogous to a generalized additive model (GAM). In doing so, it achieves a level of generality comparable to more sophisticated methods developed for errors-in-variables models, while further leveraging the larger group sizes characteristic of multilevel data to provide richer information about the within-group covariate distributions. After explaining the hierarchical structure of FRODO, its power and versatility are demonstrated on several simulated datasets, showcasing its ability to accommodate a wide variety of covariate distributions and regression models."}, "https://arxiv.org/abs/2411.01697": {"title": "A probabilistic diagnostic for Laplace approximations: Introduction and experimentation", "link": "https://arxiv.org/abs/2411.01697", "description": "arXiv:2411.01697v1 Announce Type: new \nAbstract: Many models require integrals of high-dimensional functions: for instance, to obtain marginal likelihoods. Such integrals may be intractable, or too expensive to compute numerically. Instead, we can use the Laplace approximation (LA). The LA is exact if the function is proportional to a normal density; its effectiveness therefore depends on the function's true shape. Here, we propose the use of the probabilistic numerical framework to develop a diagnostic for the LA and its underlying shape assumptions, modelling the function and its integral as a Gaussian process and devising a \"test\" by conditioning on a finite number of function values. The test is decidedly non-asymptotic and is not intended as a full substitute for numerical integration - rather, it is simply intended to test the feasibility of the assumptions underpinning the LA with as minimal computation. We discuss approaches to optimize and design the test, apply it to known sample functions, and highlight the challenges of high dimensions."}, "https://arxiv.org/abs/2411.01704": {"title": "Understanding the decision-making process of choice modellers", "link": "https://arxiv.org/abs/2411.01704", "description": "arXiv:2411.01704v1 Announce Type: new \nAbstract: Discrete Choice Modelling serves as a robust framework for modelling human choice behaviour across various disciplines. Building a choice model is a semi structured research process that involves a combination of a priori assumptions, behavioural theories, and statistical methods. This complex set of decisions, coupled with diverse workflows, can lead to substantial variability in model outcomes. To better understand these dynamics, we developed the Serious Choice Modelling Game, which simulates the real world modelling process and tracks modellers' decisions in real time using a stated preference dataset. Participants were asked to develop choice models to estimate Willingness to Pay values to inform policymakers about strategies for reducing noise pollution. The game recorded actions across multiple phases, including descriptive analysis, model specification, and outcome interpretation, allowing us to analyse both individual decisions and differences in modelling approaches. While our findings reveal a strong preference for using data visualisation tools in descriptive analysis, it also identifies gaps in missing values handling before model specification. We also found significant variation in the modelling approach, even when modellers were working with the same choice dataset. Despite the availability of more complex models, simpler models such as Multinomial Logit were often preferred, suggesting that modellers tend to avoid complexity when time and resources are limited. Participants who engaged in more comprehensive data exploration and iterative model comparison tended to achieve better model fit and parsimony, which demonstrate that the methodological choices made throughout the workflow have significant implications, particularly when modelling outcomes are used for policy formulation."}, "https://arxiv.org/abs/2411.01723": {"title": "Comparing multilevel and fixed effect approaches in the generalized linear model setting", "link": "https://arxiv.org/abs/2411.01723", "description": "arXiv:2411.01723v1 Announce Type: new \nAbstract: We extend prior work comparing linear multilevel models (MLM) and fixed effect (FE) models to the generalized linear model (GLM) setting, where the coefficient on a treatment variable is of primary interest. This leads to three key insights. (i) First, as in the linear setting, MLM can be thought of as a regularized form of FE. This explains why MLM can show large biases in its treatment coefficient estimates when group-level confounding is present. However, unlike the linear setting, there is not an exact equivalence between MLM and regularized FE coefficient estimates in GLMs. (ii) Second, we study a generalization of \"bias-corrected MLM\" (bcMLM) to the GLM setting. Neither FE nor bcMLM entirely solves MLM's bias problem in GLMs, but bcMLM tends to show less bias than does FE. (iii) Third, and finally, just like in the linear setting, MLM's default standard errors can misspecify the true intragroup dependence structure in the GLM setting, which can lead to downwardly biased standard errors. A cluster bootstrap is a more agnostic alternative. Ultimately, for non-linear GLMs, we recommend bcMLM for estimating the treatment coefficient, and a cluster bootstrap for standard errors and confidence intervals. If a bootstrap is not computationally feasible, then we recommend FE with cluster-robust standard errors."}, "https://arxiv.org/abs/2411.01732": {"title": "Alignment and matching tests for high-dimensional tensor signals via tensor contraction", "link": "https://arxiv.org/abs/2411.01732", "description": "arXiv:2411.01732v1 Announce Type: new \nAbstract: We consider two hypothesis testing problems for low-rank and high-dimensional tensor signals, namely the tensor signal alignment and tensor signal matching problems. These problems are challenging due to the high dimension of tensors and lack of meaningful test statistics. By exploiting a recent tensor contraction method, we propose and validate relevant test statistics using eigenvalues of a data matrix resulting from the tensor contraction. The matrix has a long range dependence among its entries, which makes the analysis of the matrix challenging, involved and distinct from standard random matrix theory. Our approach provides a novel framework for addressing hypothesis testing problems in the context of high-dimensional tensor signals."}, "https://arxiv.org/abs/2411.01799": {"title": "Estimating Nonseparable Selection Models: A Functional Contraction Approach", "link": "https://arxiv.org/abs/2411.01799", "description": "arXiv:2411.01799v1 Announce Type: new \nAbstract: We propose a new method for estimating nonseparable selection models. We show that, given the selection rule and the observed selected outcome distribution, the potential outcome distribution can be characterized as the fixed point of an operator, and we prove that this operator is a functional contraction. We propose a two-step semiparametric maximum likelihood estimator to estimate the selection model and the potential outcome distribution. The consistency and asymptotic normality of the estimator are established. Our approach performs well in Monte Carlo simulations and is applicable in a variety of empirical settings where only a selected sample of outcomes is observed. Examples include consumer demand models with only transaction prices, auctions with incomplete bid data, and Roy models with data on accepted wages."}, "https://arxiv.org/abs/2411.01820": {"title": "Dynamic Supervised Principal Component Analysis for Classification", "link": "https://arxiv.org/abs/2411.01820", "description": "arXiv:2411.01820v1 Announce Type: new \nAbstract: This paper introduces a novel framework for dynamic classification in high dimensional spaces, addressing the evolving nature of class distributions over time or other index variables. Traditional discriminant analysis techniques are adapted to learn dynamic decision rules with respect to the index variable. In particular, we propose and study a new supervised dimension reduction method employing kernel smoothing to identify the optimal subspace, and provide a comprehensive examination of this approach for both linear discriminant analysis and quadratic discriminant analysis. We illustrate the effectiveness of the proposed methods through numerical simulations and real data examples. The results show considerable improvements in classification accuracy and computational efficiency. This work contributes to the field by offering a robust and adaptive solution to the challenges of scalability and non-staticity in high-dimensional data classification."}, "https://arxiv.org/abs/2411.01864": {"title": "On the Asymptotic Properties of Debiased Machine Learning Estimators", "link": "https://arxiv.org/abs/2411.01864", "description": "arXiv:2411.01864v1 Announce Type: new \nAbstract: This paper studies the properties of debiased machine learning (DML) estimators under a novel asymptotic framework, offering insights for improving the performance of these estimators in applications. DML is an estimation method suited to economic models where the parameter of interest depends on unknown nuisance functions that must be estimated. It requires weaker conditions than previous methods while still ensuring standard asymptotic properties. Existing theoretical results do not distinguish between two alternative versions of DML estimators, DML1 and DML2. Under a new asymptotic framework, this paper demonstrates that DML2 asymptotically dominates DML1 in terms of bias and mean squared error, formalizing a previous conjecture based on simulation results regarding their relative performance. Additionally, this paper provides guidance for improving the performance of DML2 in applications."}, "https://arxiv.org/abs/2411.02123": {"title": "Uncertainty quantification and multi-stage variable selection for personalized treatment regimes", "link": "https://arxiv.org/abs/2411.02123", "description": "arXiv:2411.02123v1 Announce Type: new \nAbstract: A dynamic treatment regime is a sequence of medical decisions that adapts to the evolving clinical status of a patient over time. To facilitate personalized care, it is crucial to assess the probability of each available treatment option being optimal for a specific patient, while also identifying the key prognostic factors that determine the optimal sequence of treatments. This task has become increasingly challenging due to the growing number of individual prognostic factors typically available. In response to these challenges, we propose a Bayesian model for optimizing dynamic treatment regimes that addresses the uncertainty in identifying optimal decision sequences and incorporates dimensionality reduction to manage high-dimensional individual covariates. The first task is achieved through a suitable augmentation of the model to handle counterfactual variables. For the second, we introduce a novel class of spike-and-slab priors for the multi-stage selection of significant factors, to favor the sharing of information across stages. The effectiveness of the proposed approach is demonstrated through extensive simulation studies and illustrated using clinical trial data on severe acute arterial hypertension."}, "https://arxiv.org/abs/2411.02231": {"title": "Sharp Bounds for Continuous-Valued Treatment Effects with Unobserved Confounders", "link": "https://arxiv.org/abs/2411.02231", "description": "arXiv:2411.02231v1 Announce Type: new \nAbstract: In causal inference, treatment effects are typically estimated under the ignorability, or unconfoundedness, assumption, which is often unrealistic in observational data. By relaxing this assumption and conducting a sensitivity analysis, we introduce novel bounds and derive confidence intervals for the Average Potential Outcome (APO) - a standard metric for evaluating continuous-valued treatment or exposure effects. We demonstrate that these bounds are sharp under a continuous sensitivity model, in the sense that they give the smallest possible interval under this model, and propose a doubly robust version of our estimators. In a comparative analysis with the method of Jesson et al. (2022) (arXiv:2204.10022), using both simulated and real datasets, we show that our approach not only yields sharper bounds but also achieves good coverage of the true APO, with significantly reduced computation times."}, "https://arxiv.org/abs/2411.02239": {"title": "Powerful batch conformal prediction for classification", "link": "https://arxiv.org/abs/2411.02239", "description": "arXiv:2411.02239v1 Announce Type: new \nAbstract: In a supervised classification split conformal/inductive framework with $K$ classes, a calibration sample of $n$ labeled examples is observed for inference on the label of a new unlabeled example. In this work, we explore the case where a \"batch\" of $m$ independent such unlabeled examples is given, and a multivariate prediction set with $1-\\alpha$ coverage should be provided for this batch. Hence, the batch prediction set takes the form of a collection of label vectors of size $m$, while the calibration sample only contains univariate labels. Using the Bonferroni correction consists in concatenating the individual prediction sets at level $1-\\alpha/m$ (Vovk 2013). We propose a uniformly more powerful solution, based on specific combinations of conformal $p$-values that exploit the Simes inequality (Simes 1986). Intuitively, the pooled evidence of fairly \"easy\" examples of the batch can help provide narrower batch prediction sets. We also introduced adaptive versions of the novel procedure that are particularly effective when the batch prediction set is expected to be large. The theoretical guarantees are provided when all examples are iid, as well as more generally when iid is assumed only conditionally within each class. In particular, our results are also valid under a label distribution shift since the distribution of the labels need not be the same in the calibration sample and in the new `batch'. The usefulness of the method is illustrated on synthetic and real data examples."}, "https://arxiv.org/abs/2411.02276": {"title": "A Bayesian Model for Co-clustering Ordinal Data with Informative Missing Entries", "link": "https://arxiv.org/abs/2411.02276", "description": "arXiv:2411.02276v1 Announce Type: new \nAbstract: Several approaches have been proposed in the literature for clustering multivariate ordinal data. These methods typically treat missing values as absent information, rather than recognizing them as valuable for profiling population characteristics. To address this gap, we introduce a Bayesian nonparametric model for co-clustering multivariate ordinal data that treats censored observations as informative, rather than merely missing. We demonstrate that this offers a significant improvement in understanding the underlying structure of the data. Our model exploits the flexibility of two independent Dirichlet processes, allowing us to infer potentially distinct subpopulations that characterize the latent structure of both subjects and variables. The ordinal nature of the data is addressed by introducing latent variables, while a matrix factorization specification is adopted to handle the high dimensionality of the data in a parsimonious way. The conjugate structure of the model enables an explicit derivation of the full conditional distributions of all the random variables in the model, which facilitates seamless posterior inference using a Gibbs sampling algorithm. We demonstrate the method's performance through simulations and by analyzing politician and movie ratings data."}, "https://arxiv.org/abs/2411.02396": {"title": "Fusion of Tree-induced Regressions for Clinico-genomic Data", "link": "https://arxiv.org/abs/2411.02396", "description": "arXiv:2411.02396v1 Announce Type: new \nAbstract: Cancer prognosis is often based on a set of omics covariates and a set of established clinical covariates such as age and tumor stage. Combining these two sets poses challenges. First, dimension difference: clinical covariates should be favored because they are low-dimensional and usually have stronger prognostic ability than high-dimensional omics covariates. Second, interactions: genetic profiles and their prognostic effects may vary across patient subpopulations. Last, redundancy: a (set of) gene(s) may encode similar prognostic information as a clinical covariate. To address these challenges, we combine regression trees, employing clinical covariates only, with a fusion-like penalized regression framework in the leaf nodes for the omics covariates. The fusion penalty controls the variability in genetic profiles across subpopulations. We prove that the shrinkage limit of the proposed method equals a benchmark model: a ridge regression with penalized omics covariates and unpenalized clinical covariates. Furthermore, the proposed method allows researchers to evaluate, for different subpopulations, whether the overall omics effect enhances prognosis compared to only employing clinical covariates. In an application to colorectal cancer prognosis based on established clinical covariates and 20,000+ gene expressions, we illustrate the features of our method."}, "https://arxiv.org/abs/2411.00794": {"title": "HOUND: High-Order Universal Numerical Differentiator for a Parameter-free Polynomial Online Approximation", "link": "https://arxiv.org/abs/2411.00794", "description": "arXiv:2411.00794v1 Announce Type: cross \nAbstract: This paper introduces a scalar numerical differentiator, represented as a system of nonlinear differential equations of any high order. We derive the explicit solution for this system and demonstrate that, with a suitable choice of differentiator order, the error converges to zero for polynomial signals with additive white noise. In more general cases, the error remains bounded, provided that the highest estimated derivative is also bounded. A notable advantage of this numerical differentiation method is that it does not require tuning parameters based on the specific characteristics of the signal being differentiated. We propose a discretization method for the equations that implements a cumulative smoothing algorithm for time series. This algorithm operates online, without the need for data accumulation, and it solves both interpolation and extrapolation problems without fitting any coefficients to the data."}, "https://arxiv.org/abs/2411.00839": {"title": "CausAdv: A Causal-based Framework for Detecting Adversarial Examples", "link": "https://arxiv.org/abs/2411.00839", "description": "arXiv:2411.00839v1 Announce Type: cross \nAbstract: Deep learning has led to tremendous success in many real-world applications of computer vision, thanks to sophisticated architectures such as Convolutional neural networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations in inputs. These inputs appear almost indistinguishable from natural images, yet they are incorrectly classified by CNN architectures. This vulnerability of adversarial examples has led researchers to focus on enhancing the robustness of deep learning models in general, and CNNs in particular, by creating defense and detection methods to distinguish adversarials inputs from natural ones. In this paper, we address the adversarial robustness of CNNs through causal reasoning.\n We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. Then we perform statistical analysis on the filters CI of every sample, whether clan or adversarials, to demonstrate how adversarial examples indeed exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarials detection without the need to train a separate detector. In addition, we illustrate the efficiency of causal explanations as a helpful detection technique through visualizing the causal features. The results can be reproduced using the code available in the repository: https://github.com/HichemDebbi/CausAdv."}, "https://arxiv.org/abs/2411.00945": {"title": "Higher-Order Causal Message Passing for Experimentation with Complex Interference", "link": "https://arxiv.org/abs/2411.00945", "description": "arXiv:2411.00945v1 Announce Type: cross \nAbstract: Accurate estimation of treatment effects is essential for decision-making across various scientific fields. This task, however, becomes challenging in areas like social sciences and online marketplaces, where treating one experimental unit can influence outcomes for others through direct or indirect interactions. Such interference can lead to biased treatment effect estimates, particularly when the structure of these interactions is unknown. We address this challenge by introducing a new class of estimators based on causal message-passing, specifically designed for settings with pervasive, unknown interference. Our estimator draws on information from the sample mean and variance of unit outcomes and treatments over time, enabling efficient use of observed data to estimate the evolution of the system state. Concretely, we construct non-linear features from the moments of unit outcomes and treatments and then learn a function that maps these features to future mean and variance of unit outcomes. This allows for the estimation of the treatment effect over time. Extensive simulations across multiple domains, using synthetic and real network data, demonstrate the efficacy of our approach in estimating total treatment effect dynamics, even in cases where interference exhibits non-monotonic behavior in the probability of treatment."}, "https://arxiv.org/abs/2411.01100": {"title": "Transfer Learning Between U", "link": "https://arxiv.org/abs/2411.01100", "description": "arXiv:2411.01100v1 Announce Type: cross \nAbstract: For the 2024 U.S. presidential election, would negative, digital ads against Donald Trump impact voter turnout in Pennsylvania (PA), a key \"tipping point\" state? The gold standard to address this question, a randomized experiment where voters get randomized to different ads, yields unbiased estimates of the ad effect, but is very expensive. Instead, we propose a less-than-ideal, but significantly cheaper and likely faster framework based on transfer learning, where we transfer knowledge from a past ad experiment in 2020 to evaluate ads for 2024. A key component of our framework is a sensitivity analysis that quantifies the unobservable differences between past and future elections, which can be calibrated in a data-driven manner. We propose two estimators of the 2024 ad effect: a simple regression estimator with bootstrap, which we recommend for practitioners in this field, and an estimator based on the efficient influence function for broader applications. Using our framework, we estimate the effect of running a negative, digital ad campaign against Trump on voter turnout in PA for the 2024 election. Our findings indicate effect heterogeneity across counties of PA and among important subgroups stratified by gender, urbanicity, and education attainment."}, "https://arxiv.org/abs/2411.01292": {"title": "Causal reasoning in difference graphs", "link": "https://arxiv.org/abs/2411.01292", "description": "arXiv:2411.01292v1 Announce Type: cross \nAbstract: In epidemiology, understanding causal mechanisms across different populations is essential for designing effective public health interventions. Recently, difference graphs have been introduced as a tool to visually represent causal variations between two distinct populations. While there has been progress in inferring these graphs from data through causal discovery methods, there remains a gap in systematically leveraging their potential to enhance causal reasoning. This paper addresses that gap by establishing conditions for identifying causal changes and effects using difference graphs and observational data. It specifically focuses on identifying total causal changes and total effects in a nonparametric framework, as well as direct causal changes and direct effects in a linear context. In doing so, it provides a novel approach to causal reasoning that holds potential for various public health applications."}, "https://arxiv.org/abs/2411.01295": {"title": "Marginal Causal Flows for Validation and Inference", "link": "https://arxiv.org/abs/2411.01295", "description": "arXiv:2411.01295v1 Announce Type: cross \nAbstract: Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets."}, "https://arxiv.org/abs/2411.01319": {"title": "Efficient Nested Estimation of CoVaR: A Decoupled Approach", "link": "https://arxiv.org/abs/2411.01319", "description": "arXiv:2411.01319v1 Announce Type: cross \nAbstract: This paper addresses the estimation of the systemic risk measure known as CoVaR, which quantifies the risk of a financial portfolio conditional on another portfolio being at risk. We identify two principal challenges: conditioning on a zero-probability event and the repricing of portfolios. To tackle these issues, we propose a decoupled approach utilizing smoothing techniques and develop a model-independent theoretical framework grounded in a functional perspective. We demonstrate that the rate of convergence of the decoupled estimator can achieve approximately $O_{\\rm P}(\\Gamma^{-1/2})$, where $\\Gamma$ represents the computational budget. Additionally, we establish the smoothness of the portfolio loss functions, highlighting its crucial role in enhancing sample efficiency. Our numerical results confirm the effectiveness of the decoupled estimators and provide practical insights for the selection of appropriate smoothing techniques."}, "https://arxiv.org/abs/2411.01394": {"title": "Centrality in Collaboration: A Novel Algorithm for Social Partitioning Gradients in Community Detection for Multiple Oncology Clinical Trial Enrollments", "link": "https://arxiv.org/abs/2411.01394", "description": "arXiv:2411.01394v1 Announce Type: cross \nAbstract: Patients at a comprehensive cancer center who do not achieve cure or remission following standard treatments often become candidates for clinical trials. Patients who participate in a clinical trial may be suitable for other studies. A key factor influencing patient enrollment in subsequent clinical trials is the structured collaboration between oncologists and most responsible physicians. Possible identification of these collaboration networks can be achieved through the analysis of patient movements between clinical trial intervention types with social network analysis and community detection algorithms. In the detection of oncologist working groups, the present study evaluates three community detection algorithms: Girvan-Newman, Louvain and an algorithm developed by the author. Girvan-Newman identifies each intervention as their own community, while Louvain groups interventions in a manner that is difficult to interpret. In contrast, the author's algorithm groups interventions in a way that is both intuitive and informative, with a gradient effect that is particularly useful for epidemiological research. This lays the groundwork for future subgroup analysis of clustered interventions."}, "https://arxiv.org/abs/2411.01694": {"title": "Modeling Home Range and Intra-Specific Spatial Interaction in Wild Animal Populations", "link": "https://arxiv.org/abs/2411.01694", "description": "arXiv:2411.01694v1 Announce Type: cross \nAbstract: Interactions among individuals from the same-species of wild animals are an important component of population dynamics. An interaction can be either static (based on overlap of space use) or dynamic (based on movement). The goal of this work is to determine the level of static interactions between individuals from the same-species of wild animals using 95\\% and 50\\% home ranges, as well as to model their movement interactions, which could include attraction, avoidance (or repulsion), or lack of interaction, in order to gain new insights and improve our understanding of ecological processes. Home range estimation methods (minimum convex polygon, kernel density estimator, and autocorrelated kernel density estimator), inhomogeneous multitype (or cross-type) summary statistics, and envelope testing methods (pointwise and global envelope tests) were proposed to study the nature of the same-species wild-animal spatial interactions. Using GPS collar data, we applied the methods to quantify both static and dynamic interactions between black bears in southern Alabama, USA. In general, our findings suggest that the black bears in our dataset showed no significant preference to live together or apart, i.e., there was no significant deviation from independence toward association or avoidance (i.e., segregation) between the bears."}, "https://arxiv.org/abs/2411.01772": {"title": "Joint optimization for production operations considering reworking", "link": "https://arxiv.org/abs/2411.01772", "description": "arXiv:2411.01772v1 Announce Type: cross \nAbstract: In pursuit of enhancing the comprehensive efficiency of production systems, our study focused on the joint optimization problem of scheduling and machine maintenance in scenarios where product rework occurs. The primary challenge lies in the interdependence between product \\underline{q}uality, machine \\underline{r}eliability, and \\underline{p}roduction scheduling, compounded by the uncertainties from machine degradation and product quality, which is prevalent in sophisticated manufacturing systems. To address this issue, we investigated the dynamic relationship among these three aspects, named as QRP-co-effect. On this basis, we constructed an optimization model that integrates production scheduling, machine maintenance, and product rework decisions, encompassing the context of stochastic degradation and product quality uncertainties within a mixed-integer programming problem. To effectively solve this problem, we proposed a dual-module solving framework that integrates planning and evaluation for solution improvement via dynamic communication. By analyzing the structural properties of this joint optimization problem, we devised an efficient solving algorithm with an interactive mechanism that leverages \\emph{in-situ} condition information regarding the production system's state and computational resources. The proposed methodology has been validated through comparative and ablation experiments. The experimental results demonstrated the significant enhancement of production system efficiency, along with a reduction in machine maintenance costs in scenarios involving rework."}, "https://arxiv.org/abs/2411.01881": {"title": "Causal Discovery and Classification Using Lempel-Ziv Complexity", "link": "https://arxiv.org/abs/2411.01881", "description": "arXiv:2411.01881v1 Announce Type: cross \nAbstract: Inferring causal relationships in the decision-making processes of machine learning algorithms is a crucial step toward achieving explainable Artificial Intelligence (AI). In this research, we introduce a novel causality measure and a distance metric derived from Lempel-Ziv (LZ) complexity. We explore how the proposed causality measure can be used in decision trees by enabling splits based on features that most strongly \\textit{cause} the outcome. We further evaluate the effectiveness of the causality-based decision tree and the distance-based decision tree in comparison to a traditional decision tree using Gini impurity. While the proposed methods demonstrate comparable classification performance overall, the causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models. This result indicates that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data. Based on the features used in the LZ causal measure based decision tree, we introduce a causal strength for each features in the dataset so as to infer the predominant causal variables for the occurrence of the outcome."}, "https://arxiv.org/abs/2411.02221": {"title": "Targeted Learning for Variable Importance", "link": "https://arxiv.org/abs/2411.02221", "description": "arXiv:2411.02221v1 Announce Type: cross \nAbstract: Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results."}, "https://arxiv.org/abs/2411.02374": {"title": "Identifying Economic Factors Affecting Unemployment Rates in the United States", "link": "https://arxiv.org/abs/2411.02374", "description": "arXiv:2411.02374v1 Announce Type: cross \nAbstract: In this study, we seek to understand how macroeconomic factors such as GDP, inflation, Unemployment Insurance, and S&P 500 index; as well as microeconomic factors such as health, race, and educational attainment impacted the unemployment rate for about 20 years in the United States. Our research question is to identify which factor(s) contributed the most to the unemployment rate surge using linear regression. Results from our studies showed that GDP (negative), inflation (positive), Unemployment Insurance (contrary to popular opinion; negative), and S&P 500 index (negative) were all significant factors, with inflation being the most important one. As for health issue factors, our model produced resultant correlation scores for occurrences of Cardiovascular Disease, Neurological Disease, and Interpersonal Violence with unemployment. Race as a factor showed a huge discrepancies in the unemployment rate between Black Americans compared to their counterparts. Asians had the lowest unemployment rate throughout the years. As for education attainment, results showed that having a higher education attainment significantly reduced one chance of unemployment. People with higher degrees had the lowest unemployment rate. Results of this study will be beneficial for policymakers and researchers in understanding the unemployment rate during the pandemic."}, "https://arxiv.org/abs/2411.02380": {"title": "Robust Bayesian regression in astronomy", "link": "https://arxiv.org/abs/2411.02380", "description": "arXiv:2411.02380v1 Announce Type: cross \nAbstract: Model mis-specification (e.g. the presence of outliers) is commonly encountered in astronomical analyses, often requiring the use of ad hoc algorithms (e.g. sigma-clipping). We develop and implement a generic Bayesian approach to linear regression, based on Student's t-distributions, that is robust to outliers and mis-specification of the noise model. Our method is validated using simulated datasets with various degrees of model mis-specification; the derived constraints are shown to be systematically less biased than those from a similar model using normal distributions. We demonstrate that, for a dataset without outliers, a worst-case inference using t-distributions would give unbiased results with $\\lesssim\\!10$ per cent increase in the reported parameter uncertainties. We also compare with existing analyses of real-world datasets, finding qualitatively different results where normal distributions have been used and agreement where more robust methods have been applied. A Python implementation of this model, t-cup, is made available for others to use."}, "https://arxiv.org/abs/2204.10291": {"title": "Structural Nested Mean Models Under Parallel Trends Assumptions", "link": "https://arxiv.org/abs/2204.10291", "description": "arXiv:2204.10291v5 Announce Type: replace \nAbstract: We link and extend two approaches to estimating time-varying treatment effects on repeated continuous outcomes--time-varying Difference in Differences (DiD; see Roth et al. (2023) and Chaisemartin et al. (2023) for reviews) and Structural Nested Mean Models (SNMMs; see Vansteelandt and Joffe (2014) for a review). In particular, we show that SNMMs, which were previously only known to be nonparametrically identified under a no unobserved confounding assumption, are also identified under a generalized version of the parallel trends assumption typically used to justify time-varying DiD methods. Because SNMMs model a broader set of causal estimands, our results allow practitioners of existing time-varying DiD approaches to address additional types of substantive questions under similar assumptions. SNMMs enable estimation of time-varying effect heterogeneity, lasting effects of a `blip' of treatment at a single time point, effects of sustained interventions (possibly on continuous or multi-dimensional treatments) when treatment repeatedly changes value in the data, controlled direct effects, effects of dynamic treatment strategies that depend on covariate history, and more. Our results also allow analysts who apply SNMMs under the no unobserved confounding assumption to estimate some of the same causal effects under alternative identifying assumptions. We provide a method for sensitivity analysis to violations of our parallel trends assumption. We further explain how to estimate optimal treatment regimes via optimal regime SNMMs under parallel trends assumptions plus an assumption that there is no effect modification by unobserved confounders. Finally, we illustrate our methods with real data applications estimating effects of Medicaid expansion on uninsurance rates, effects of floods on flood insurance take-up, and effects of sustained changes in temperature on crop yields."}, "https://arxiv.org/abs/2212.05515": {"title": "Reliability Study of Battery Lives: A Functional Degradation Analysis Approach", "link": "https://arxiv.org/abs/2212.05515", "description": "arXiv:2212.05515v2 Announce Type: replace \nAbstract: Renewable energy is critical for combating climate change, whose first step is the storage of electricity generated from renewable energy sources. Li-ion batteries are a popular kind of storage units. Their continuous usage through charge-discharge cycles eventually leads to degradation. This can be visualized in plotting voltage discharge curves (VDCs) over discharge cycles. Studies of battery degradation have mostly concentrated on modeling degradation through one scalar measurement summarizing each VDC. Such simplification of curves can lead to inaccurate predictive models. Here we analyze the degradation of rechargeable Li-ion batteries from a NASA data set through modeling and predicting their full VDCs. With techniques from longitudinal and functional data analysis, we propose a new two-step predictive modeling procedure for functional responses residing on heterogeneous domains. We first predict the shapes and domain end points of VDCs using functional regression models. Then we integrate these predictions to perform a degradation analysis. Our approach is fully functional, allows the incorporation of usage information, produces predictions in a curve form, and thus provides flexibility in the assessment of battery degradation. Through extensive simulation studies and cross-validated data analysis, our approach demonstrates better prediction than the existing approach of modeling degradation directly with aggregated data."}, "https://arxiv.org/abs/2309.13441": {"title": "Anytime valid and asymptotically optimal inference driven by predictive recursion", "link": "https://arxiv.org/abs/2309.13441", "description": "arXiv:2309.13441v4 Announce Type: replace \nAbstract: Distinguishing two candidate models is a fundamental and practically important statistical problem. Error rate control is crucial to the testing logic but, in complex nonparametric settings, can be difficult to achieve, especially when the stopping rule that determines the data collection process is not available. This paper proposes an e-process construction based on the predictive recursion (PR) algorithm originally designed to recursively fit nonparametric mixture models. The resulting PRe-process affords anytime valid inference and is asymptotically efficient in the sense that its growth rate is first-order optimal relative to PR's mixture model."}, "https://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "https://arxiv.org/abs/2310.04660", "description": "arXiv:2310.04660v2 Announce Type: replace \nAbstract: Many scientific questions in biomedical, environmental, and psychological research involve understanding the effects of multiple factors on outcomes. While factorial experiments are ideal for this purpose, randomized controlled treatment assignment is generally infeasible in many empirical studies. Therefore, investigators must rely on observational data, where drawing reliable causal inferences for multiple factors remains challenging. As the number of treatment combinations grows exponentially with the number of factors, some treatment combinations can be rare or missing by chance in observed data, further complicating factorial effects estimation. To address these challenges, we propose a novel weighting method tailored to observational studies with multiple factors. Our approach uses weighted observational data to emulate a randomized factorial experiment, enabling simultaneous estimation of the effects of multiple factors and their interactions. Our investigations reveal a crucial nuance: achieving balance among covariates, as in single-factor scenarios, is necessary but insufficient for unbiasedly estimating factorial effects; balancing the factors is also essential in multi-factor settings. Moreover, we extend our weighting method to handle missing treatment combinations in observed data. Finally, we study the asymptotic behavior of the new weighting estimators and propose a consistent variance estimator, providing reliable inferences on factorial effects in observational studies."}, "https://arxiv.org/abs/2310.18836": {"title": "Cluster-Randomized Trials with Cross-Cluster Interference", "link": "https://arxiv.org/abs/2310.18836", "description": "arXiv:2310.18836v3 Announce Type: replace \nAbstract: The literature on cluster-randomized trials typically assumes no interference across clusters. This may be implausible when units are irregularly distributed in space without well-separated communities, in which case clusters may not represent significant geographic, social, or economic divisions. In this paper, we develop methods for reducing bias due to cross-cluster interference. First, we propose an estimation strategy that excludes units not surrounded by clusters assigned to the same treatment arm. We show that this substantially reduces asymptotic bias relative to conventional difference-in-means estimators without substantial cost to variance. Second, we formally establish a bias-variance trade-off in the choice of clusters: constructing fewer, larger clusters reduces bias due to interference but increases variance. We provide a rule for choosing the number of clusters to balance the asymptotic orders of the bias and variance of our estimator. Finally, we consider unsupervised learning for cluster construction and provide theoretical guarantees for $k$-medoids."}, "https://arxiv.org/abs/2311.11858": {"title": "Theory coherent shrinkage of Time-Varying Parameters in VARs", "link": "https://arxiv.org/abs/2311.11858", "description": "arXiv:2311.11858v2 Announce Type: replace \nAbstract: This paper introduces a novel theory-coherent shrinkage prior for Time-Varying Parameter VARs (TVP-VARs). The prior centers the time-varying parameters on a path implied a priori by an underlying economic theory, chosen to describe the dynamics of the macroeconomic variables in the system. Leveraging information from conventional economic theory using this prior significantly improves inference precision and forecast accuracy compared to the standard TVP-VAR. In an application, I use this prior to incorporate information from a New Keynesian model that includes both the Zero Lower Bound (ZLB) and forward guidance into a medium-scale TVP-VAR model. This approach leads to more precise estimates of the impulse response functions, revealing a distinct propagation of risk premium shocks inside and outside the ZLB in US data."}, "https://arxiv.org/abs/2401.02694": {"title": "Nonconvex High-Dimensional Time-Varying Coefficient Estimation for Noisy High-Frequency Observations with a Factor Structure", "link": "https://arxiv.org/abs/2401.02694", "description": "arXiv:2401.02694v2 Announce Type: replace \nAbstract: In this paper, we propose a novel high-dimensional time-varying coefficient estimator for noisy high-frequency observations with a factor structure. In high-frequency finance, we often observe that noises dominate the signal of underlying true processes and that covariates exhibit a factor structure due to their strong dependence. Thus, we cannot apply usual regression procedures to analyze high-frequency observations. To handle the noises, we first employ a smoothing method for the observed dependent and covariate processes. Then, to handle the strong dependence of the covariate processes, we apply Principal Component Analysis (PCA) and transform the highly correlated covariate structure into a weakly correlated structure. However, the variables from PCA still contain non-negligible noises. To manage these non-negligible noises and the high dimensionality, we propose a nonconvex penalized regression method for each local coefficient. This method produces consistent but biased local coefficient estimators. To estimate the integrated coefficients, we propose a debiasing scheme and obtain a debiased integrated coefficient estimator using debiased local coefficient estimators. Then, to further account for the sparsity structure of the coefficients, we apply a thresholding scheme to the debiased integrated coefficient estimator. We call this scheme the Factor Adjusted Thresholded dEbiased Nonconvex LASSO (FATEN-LASSO) estimator. Furthermore, this paper establishes the concentration properties of the FATEN-LASSO estimator and discusses a nonconvex optimization algorithm."}, "https://arxiv.org/abs/1706.09072": {"title": "Decomposing Network Influence: Social Influence Regression", "link": "https://arxiv.org/abs/1706.09072", "description": "arXiv:1706.09072v2 Announce Type: replace-cross \nAbstract: Understanding network influence and its determinants are key challenges in political science and network analysis. Traditional latent variable models position actors within a social space based on network dependencies but often do not elucidate the underlying factors driving these interactions. To overcome this limitation, we propose the Social Influence Regression (SIR) model, an extension of vector autoregression tailored for relational data that incorporates exogenous covariates into the estimation of influence patterns. The SIR model captures influence dynamics via a pair of $n \\times n$ matrices that quantify how the actions of one actor affect the future actions of another. This framework not only provides a statistical mechanism for explaining actor influence based on observable traits but also improves computational efficiency through an iterative block coordinate descent method. We showcase the SIR model's capabilities by applying it to monthly conflict events between countries, using data from the Integrated Crisis Early Warning System (ICEWS). Our findings demonstrate the SIR model's ability to elucidate complex influence patterns within networks by linking them to specific covariates. This paper's main contributions are: (1) introducing a model that explains third-order dependencies through exogenous covariates and (2) offering an efficient estimation approach that scales effectively with large, complex networks."}, "https://arxiv.org/abs/1806.02935": {"title": "Causal effects based on distributional distances", "link": "https://arxiv.org/abs/1806.02935", "description": "arXiv:1806.02935v3 Announce Type: replace-cross \nAbstract: Comparing counterfactual distributions can provide more nuanced and valuable measures for causal effects, going beyond typical summary statistics such as averages. In this work, we consider characterizing causal effects via distributional distances, focusing on two kinds of target parameters. The first is the counterfactual outcome density. We propose a doubly robust-style estimator for the counterfactual density and study its rates of convergence and limiting distributions. We analyze asymptotic upper bounds on the $L_q$ and the integrated $L_q$ risks of the proposed estimator, and propose a bootstrap-based confidence band. The second is a novel distributional causal effect defined by the $L_1$ distance between different counterfactual distributions. We study three approaches for estimating the proposed distributional effect: smoothing the counterfactual density, smoothing the $L_1$ distance, and imposing a margin condition. For each approach, we analyze asymptotic properties and error bounds of the proposed estimator, and discuss potential advantages and disadvantages. We go on to present a bootstrap approach for obtaining confidence intervals, and propose a test of no distributional effect. We conclude with a numerical illustration and a real-world example."}, "https://arxiv.org/abs/2411.02531": {"title": "Comment on 'Sparse Bayesian Factor Analysis when the Number of Factors is Unknown' by S", "link": "https://arxiv.org/abs/2411.02531", "description": "arXiv:2411.02531v1 Announce Type: new \nAbstract: The techniques suggested in Fr\\\"uhwirth-Schnatter et al. (2024) concern sparsity and factor selection and have enormous potential beyond standard factor analysis applications. We show how these techniques can be applied to Latent Space (LS) models for network data. These models suffer from well-known identification issues of the latent factors due to likelihood invariance to factor translation, reflection, and rotation (see Hoff et al., 2002). A set of observables can be instrumental in identifying the latent factors via auxiliary equations (see Liu et al., 2021). These, in turn, share many analogies with the equations used in factor modeling, and we argue that the factor loading restrictions may be beneficial for achieving identification."}, "https://arxiv.org/abs/2411.02675": {"title": "Does Regression Produce Representative Causal Rankings?", "link": "https://arxiv.org/abs/2411.02675", "description": "arXiv:2411.02675v1 Announce Type: new \nAbstract: We examine the challenges in ranking multiple treatments based on their estimated effects when using linear regression or its popular double-machine-learning variant, the Partially Linear Model (PLM), in the presence of treatment effect heterogeneity. We demonstrate by example that overlap-weighting performed by linear models like PLM can produce Weighted Average Treatment Effects (WATE) that have rankings that are inconsistent with the rankings of the underlying Average Treatment Effects (ATE). We define this as ranking reversals and derive a necessary and sufficient condition for ranking reversals under the PLM. We conclude with several simulation studies conditions under which ranking reversals occur."}, "https://arxiv.org/abs/2411.02755": {"title": "Restricted Win Probability with Bayesian Estimation for Implementing the Estimand Framework in Clinical Trials With a Time-to-Event Outcome", "link": "https://arxiv.org/abs/2411.02755", "description": "arXiv:2411.02755v1 Announce Type: new \nAbstract: We propose a restricted win probability estimand for comparing treatments in a randomized trial with a time-to-event outcome. We also propose Bayesian estimators for this summary measure as well as the unrestricted win probability. Bayesian estimation is scalable and facilitates seamless handling of censoring mechanisms as compared to related non-parametric pairwise approaches like win ratios. Unlike the log-rank test, these measures effectuate the estimand framework as they reflect a clearly defined population quantity related to the probability of a later event time with the potential restriction that event times exceeding a pre-specified time are deemed equivalent. We compare efficacy with established methods using computer simulation and apply the proposed approach to 304 reconstructed datasets from oncology trials. We show that the proposed approach has more power than the log-rank test in early treatment difference scenarios, and at least as much power as the win ratio in all scenarios considered. We also find that the proposed approach's statistical significance is concordant with the log-rank test for the vast majority of the oncology datasets examined. The proposed approach offers an interpretable, efficient alternative for trials with time-to-event outcomes that aligns with the estimand framework."}, "https://arxiv.org/abs/2411.02763": {"title": "Identifying nonlinear relations among random variables: A network analytic approach", "link": "https://arxiv.org/abs/2411.02763", "description": "arXiv:2411.02763v1 Announce Type: new \nAbstract: Nonlinear relations between variables, such as the curvilinear relationship between childhood trauma and resilience in patients with schizophrenia and the moderation relationship between mentalizing, and internalizing and externalizing symptoms and quality of life in youths, are more prevalent than our current methods have been able to detect. Although there has been a rise in network models, network construction for the standard Gaussian graphical model depends solely upon linearity. While nonlinear models are an active field of study in psychological methodology, many of these models require the analyst to specify the functional form of the relation. When performing more exploratory modeling, such as with cross-sectional network psychometrics, specifying the functional form a nonlinear relation might take becomes infeasible given the number of possible relations modeled. Here, we apply a nonparametric approach to identifying nonlinear relations using partial distance correlations. We found that partial distance correlations excel overall at identifying nonlinear relations regardless of functional form when compared with Pearson's and Spearman's partial correlations. Through simulation studies and an empirical example, we show that partial distance correlations can be used to identify possible nonlinear relations in psychometric networks, enabling researchers to then explore the shape of these relations with more confirmatory models."}, "https://arxiv.org/abs/2411.02769": {"title": "Assessment of Misspecification in CDMs Using a Generalized Information Matrix Test", "link": "https://arxiv.org/abs/2411.02769", "description": "arXiv:2411.02769v1 Announce Type: new \nAbstract: If the probability model is correctly specified, then we can estimate the covariance matrix of the asymptotic maximum likelihood estimate distribution using either the first or second derivatives of the likelihood function. Therefore, if the determinants of these two different covariance matrix estimation formulas differ this indicates model misspecification. This misspecification detection strategy is the basis of the Determinant Information Matrix Test ($GIMT_{Det}$). To investigate the performance of the $GIMT_{Det}$, a Deterministic Input Noisy And gate (DINA) Cognitive Diagnostic Model (CDM) was fit to the Fraction-Subtraction dataset. Next, various misspecified versions of the original DINA CDM were fit to bootstrap data sets generated by sampling from the original fitted DINA CDM. The $GIMT_{Det}$ showed good discrimination performance for larger levels of misspecification. In addition, the $GIMT_{Det}$ did not detect model misspecification when model misspecification was not present and additionally did not detect model misspecification when the level of misspecification was very low. However, the $GIMT_{Det}$ discrimation performance was highly variable across different misspecification strategies when the misspecification level was moderately sized. The proposed new misspecification detection methodology is promising but additional empirical studies are required to further characterize its strengths and limitations."}, "https://arxiv.org/abs/2411.02771": {"title": "Automatic doubly robust inference for linear functionals via calibrated debiased machine learning", "link": "https://arxiv.org/abs/2411.02771", "description": "arXiv:2411.02771v1 Announce Type: new \nAbstract: In causal inference, many estimands of interest can be expressed as a linear functional of the outcome regression function; this includes, for example, average causal effects of static, dynamic and stochastic interventions. For learning such estimands, in this work, we propose novel debiased machine learning estimators that are doubly robust asymptotically linear, thus providing not only doubly robust consistency but also facilitating doubly robust inference (e.g., confidence intervals and hypothesis tests). To do so, we first establish a key link between calibration, a machine learning technique typically used in prediction and classification tasks, and the conditions needed to achieve doubly robust asymptotic linearity. We then introduce calibrated debiased machine learning (C-DML), a unified framework for doubly robust inference, and propose a specific C-DML estimator that integrates cross-fitting, isotonic calibration, and debiased machine learning estimation. A C-DML estimator maintains asymptotic linearity when either the outcome regression or the Riesz representer of the linear functional is estimated sufficiently well, allowing the other to be estimated at arbitrarily slow rates or even inconsistently. We propose a simple bootstrap-assisted approach for constructing doubly robust confidence intervals. Our theoretical and empirical results support the use of C-DML to mitigate bias arising from the inconsistent or slow estimation of nuisance functions."}, "https://arxiv.org/abs/2411.02804": {"title": "Beyond the Traditional VIX: A Novel Approach to Identifying Uncertainty Shocks in Financial Markets", "link": "https://arxiv.org/abs/2411.02804", "description": "arXiv:2411.02804v1 Announce Type: new \nAbstract: We introduce a new identification strategy for uncertainty shocks to explain macroeconomic volatility in financial markets. The Chicago Board Options Exchange Volatility Index (VIX) measures market expectations of future volatility, but traditional methods based on second-moment shocks and time-varying volatility of the VIX often fail to capture the non-Gaussian, heavy-tailed nature of asset returns. To address this, we construct a revised VIX by fitting a double-subordinated Normal Inverse Gaussian Levy process to S&P 500 option prices, providing a more comprehensive measure of volatility that reflects the extreme movements and heavy tails observed in financial data. Using an axiomatic approach, we introduce a general family of risk-reward ratios, computed with our revised VIX and fitted over a fractional time series to more accurately identify uncertainty shocks in financial markets."}, "https://arxiv.org/abs/2411.02811": {"title": "Temporal Wasserstein Imputation: Versatile Missing Data Imputation for Time Series", "link": "https://arxiv.org/abs/2411.02811", "description": "arXiv:2411.02811v1 Announce Type: new \nAbstract: Missing data often significantly hamper standard time series analysis, yet in practice they are frequently encountered. In this paper, we introduce temporal Wasserstein imputation, a novel method for imputing missing data in time series. Unlike existing techniques, our approach is fully nonparametric, circumventing the need for model specification prior to imputation, making it suitable for potential nonlinear dynamics. Its principled algorithmic implementation can seamlessly handle univariate or multivariate time series with any missing pattern. In addition, the plausible range and side information of the missing entries (such as box constraints) can easily be incorporated. As a key advantage, our method mitigates the distributional bias typical of many existing approaches, ensuring more reliable downstream statistical analysis using the imputed series. Leveraging the benign landscape of the optimization formulation, we establish the convergence of an alternating minimization algorithm to critical points. Furthermore, we provide conditions under which the marginal distributions of the underlying time series can be identified. Our numerical experiments, including extensive simulations covering linear and nonlinear time series models and an application to a real-world groundwater dataset laden with missing data, corroborate the practical usefulness of the proposed method."}, "https://arxiv.org/abs/2411.02924": {"title": "A joint model of correlated ordinal and continuous variables", "link": "https://arxiv.org/abs/2411.02924", "description": "arXiv:2411.02924v1 Announce Type: new \nAbstract: In this paper we build a joint model which can accommodate for binary, ordinal and continuous responses, by assuming that the errors of the continuous variables and the errors underlying the ordinal and binary outcomes follow a multivariate normal distribution. We employ composite likelihood methods to estimate the model parameters and use composite likelihood inference for model comparison and uncertainty quantification. The complimentary R package mvordnorm implements estimation of this model using composite likelihood methods and is available for download from Github. We present two use-cases in the area of risk management to illustrate our approach."}, "https://arxiv.org/abs/2411.02956": {"title": "On Distributional Discrepancy for Experimental Design with General Assignment Probabilities", "link": "https://arxiv.org/abs/2411.02956", "description": "arXiv:2411.02956v1 Announce Type: new \nAbstract: We investigate experimental design for randomized controlled trials (RCTs) with both equal and unequal treatment-control assignment probabilities. Our work makes progress on the connection between the distributional discrepancy minimization (DDM) problem introduced by Harshaw et al. (2024) and the design of RCTs. We make two main contributions: First, we prove that approximating the optimal solution of the DDM problem within even a certain constant error is NP-hard. Second, we introduce a new Multiplicative Weights Update (MWU) algorithm for the DDM problem, which improves the Gram-Schmidt walk algorithm used by Harshaw et al. (2024) when assignment probabilities are unequal. Building on the framework of Harshaw et al. (2024) and our MWU algorithm, we then develop the MWU design, which reduces the worst-case mean-squared error in estimating the average treatment effect. Finally, we present a comprehensive simulation study comparing our design with commonly used designs."}, "https://arxiv.org/abs/2411.03014": {"title": "Your copula is a classifier in disguise: classification-based copula density estimation", "link": "https://arxiv.org/abs/2411.03014", "description": "arXiv:2411.03014v1 Announce Type: new \nAbstract: We propose reinterpreting copula density estimation as a discriminative task. Under this novel estimation scheme, we train a classifier to distinguish samples from the joint density from those of the product of independent marginals, recovering the copula density in the process. We derive equivalences between well-known copula classes and classification problems naturally arising in our interpretation. Furthermore, we show our estimator achieves theoretical guarantees akin to maximum likelihood estimation. By identifying a connection with density ratio estimation, we benefit from the rich literature and models available for such problems. Empirically, we demonstrate the applicability of our approach by estimating copulas of real and high-dimensional datasets, outperforming competing copula estimators in density evaluation as well as sampling."}, "https://arxiv.org/abs/2411.03100": {"title": "Modeling sparsity in count-weighted networks", "link": "https://arxiv.org/abs/2411.03100", "description": "arXiv:2411.03100v1 Announce Type: new \nAbstract: Community detection methods have been extensively studied to recover communities structures in network data. While many models and methods focus on binary data, real-world networks also present the strength of connections, which could be considered in the network analysis. We propose a probabilistic model for generating weighted networks that allows us to control network sparsity and incorporates degree corrections for each node. We propose a community detection method based on the Variational Expectation-Maximization (VEM) algorithm. We show that the proposed method works well in practice for simulated networks. We analyze the Brazilian airport network to compare the community structures before and during the COVID-19 pandemic."}, "https://arxiv.org/abs/2411.03208": {"title": "Randomly assigned first differences?", "link": "https://arxiv.org/abs/2411.03208", "description": "arXiv:2411.03208v1 Announce Type: new \nAbstract: I consider treatment-effect estimation with a panel data with two periods, using a first difference regression of the outcome evolution $\\Delta Y_g$ on the treatment evolution $\\Delta D_g$. A design-based justification of this estimator assumes that $\\Delta D_g$ is as good as randomly assigned. This note shows that if one posits a causal model in levels between the treatment and the outcome, then $\\Delta D_g$ randomly assigned implies that $\\Delta D_g$ is independent of $D_{g,1}$. This is a very strong, fully testable condition. If $\\Delta D_g$ is not independent of $D_{g,1}$, the first-difference estimator can still identify a convex combination of effects if $\\Delta D_g$ is randomly assigned and treatment effects do not change over time, or if the the treatment path $(D_{g,1},D_{g,2})$ is randomly assigned and either of the following conditions hold: i) $D_{g,1}$ and $D_{g,2}$ are binary; ii) $D_{g,1}$ and $D_{g,2}$ have the same variance; iii) $D_{g,1}$ and $D_{g,2}$ are uncorrelated or negatively correlated. I use these results to revisit Acemoglu et al (2016)."}, "https://arxiv.org/abs/2411.03304": {"title": "Bayesian Controlled FDR Variable Selection via Knockoffs", "link": "https://arxiv.org/abs/2411.03304", "description": "arXiv:2411.03304v1 Announce Type: new \nAbstract: In many research fields, researchers aim to identify significant associations between a set of explanatory variables and a response while controlling the false discovery rate (FDR). To this aim, we develop a fully Bayesian generalization of the classical model-X knockoff filter. Knockoff filter introduces controlled noise in the model in the form of cleverly constructed copies of the predictors as auxiliary variables. In our approach we consider the joint model of the covariates and the response and incorporate the conditional independence structure of the covariates into the prior distribution of the auxiliary knockoff variables. We further incorporate the estimation of a graphical model among the covariates, which in turn aids knockoffs generation and improves the estimation of the covariate effects on the response. We use a modified spike-and-slab prior on the regression coefficients, which avoids the increase of the model dimension as typical in the classical knockoff filter. Our model performs variable selection using an upper bound on the posterior probability of non-inclusion. We show how our model construction leads to valid model-X knockoffs and demonstrate that the proposed characterization is sufficient for controlling the BFDR at an arbitrary level, in finite samples. We also show that the model selection is robust to the estimation of the precision matrix. We use simulated data to demonstrate that our proposal increases the stability of the selection with respect to classical knockoff methods, as it relies on the entire posterior distribution of the knockoff variables instead of a single sample. With respect to Bayesian variable selection methods, we show that our selection procedure achieves comparable or better performances, while maintaining control over the FDR. Finally, we show the usefulness of the proposed model with an application to real data."}, "https://arxiv.org/abs/2411.02726": {"title": "Elliptical Wishart distributions: information geometry, maximum likelihood estimator, performance analysis and statistical learning", "link": "https://arxiv.org/abs/2411.02726", "description": "arXiv:2411.02726v1 Announce Type: cross \nAbstract: This paper deals with Elliptical Wishart distributions - which generalize the Wishart distribution - in the context of signal processing and machine learning. Two algorithms to compute the maximum likelihood estimator (MLE) are proposed: a fixed point algorithm and a Riemannian optimization method based on the derived information geometry of Elliptical Wishart distributions. The existence and uniqueness of the MLE are characterized as well as the convergence of both estimation algorithms. Statistical properties of the MLE are also investigated such as consistency, asymptotic normality and an intrinsic version of Fisher efficiency. On the statistical learning side, novel classification and clustering methods are designed. For the $t$-Wishart distribution, the performance of the MLE and statistical learning algorithms are evaluated on both simulated and real EEG and hyperspectral data, showcasing the interest of our proposed methods."}, "https://arxiv.org/abs/2411.02909": {"title": "When is it worthwhile to jackknife? Breaking the quadratic barrier for Z-estimators", "link": "https://arxiv.org/abs/2411.02909", "description": "arXiv:2411.02909v1 Announce Type: cross \nAbstract: Resampling methods are especially well-suited to inference with estimators that provide only \"black-box'' access. Jackknife is a form of resampling, widely used for bias correction and variance estimation, that is well-understood under classical scaling where the sample size $n$ grows for a fixed problem. We study its behavior in application to estimating functionals using high-dimensional $Z$-estimators, allowing both the sample size $n$ and problem dimension $d$ to diverge. We begin showing that the plug-in estimator based on the $Z$-estimate suffers from a quadratic breakdown: while it is $\\sqrt{n}$-consistent and asymptotically normal whenever $n \\gtrsim d^2$, it fails for a broad class of problems whenever $n \\lesssim d^2$. We then show that under suitable regularity conditions, applying a jackknife correction yields an estimate that is $\\sqrt{n}$-consistent and asymptotically normal whenever $n\\gtrsim d^{3/2}$. This provides strong motivation for the use of jackknife in high-dimensional problems where the dimension is moderate relative to sample size. We illustrate consequences of our general theory for various specific $Z$-estimators, including non-linear functionals in linear models; generalized linear models; and the inverse propensity score weighting (IPW) estimate for the average treatment effect, among others."}, "https://arxiv.org/abs/2411.03021": {"title": "Testing Generalizability in Causal Inference", "link": "https://arxiv.org/abs/2411.03021", "description": "arXiv:2411.03021v1 Announce Type: cross \nAbstract: Ensuring robust model performance across diverse real-world scenarios requires addressing both transportability across domains with covariate shifts and extrapolation beyond observed data ranges. However, there is no formal procedure for statistically evaluating generalizability in machine learning algorithms, particularly in causal inference. Existing methods often rely on arbitrary metrics like AUC or MSE and focus predominantly on toy datasets, providing limited insights into real-world applicability. To address this gap, we propose a systematic and quantitative framework for evaluating model generalizability under covariate distribution shifts, specifically within causal inference settings. Our approach leverages the frugal parameterization, allowing for flexible simulations from fully and semi-synthetic benchmarks, offering comprehensive evaluations for both mean and distributional regression methods. By basing simulations on real data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics. Grounded in real-world data, it provides realistic insights into model performance, bridging the gap between synthetic evaluations and practical applications."}, "https://arxiv.org/abs/2411.03026": {"title": "Robust Market Interventions", "link": "https://arxiv.org/abs/2411.03026", "description": "arXiv:2411.03026v1 Announce Type: cross \nAbstract: A large differentiated oligopoly yields inefficient market equilibria. An authority with imprecise information about the primitives of the market aims to design tax/subsidy interventions that increase efficiency robustly, i.e., with high probability. We identify a condition on demand that guarantees the existence of such interventions, and we show how to construct them using noisy estimates of demand complementarities and substitutabilities across products. The analysis works by deriving a novel description of the incidence of market interventions in terms of spectral statistics of a Slutsky matrix. Our notion of recoverable structure ensures that parts of the spectrum that are useful for the design of interventions are statistically recoverable from noisy demand estimates."}, "https://arxiv.org/abs/2109.00408": {"title": "How to Detect Network Dependence in Latent Factor Models? A Bias-Corrected CD Test", "link": "https://arxiv.org/abs/2109.00408", "description": "arXiv:2109.00408v4 Announce Type: replace \nAbstract: In a recent paper Juodis and Reese (2022) (JR) show that the application of the CD test proposed by Pesaran (2004) to residuals from panels with latent factors results in over-rejection. They propose a randomized test statistic to correct for over-rejection, and add a screening component to achieve power. This paper considers the same problem but from a different perspective, and shows that the standard CD test remains valid if the latent factors are weak in the sense the strength is less than half. In the case where latent factors are strong, we propose a bias-corrected version, CD*, which is shown to be asymptotically standard normal under the null of error cross-sectional independence and have power against network type alternatives. This result is shown to hold for pure latent factor models as well as for panel regression models with latent factors. The case where the errors are serially correlated is also considered. Small sample properties of the CD* test are investigated by Monte Carlo experiments and are shown to have the correct size for strong and weak factors as well as for Gaussian and non-Gaussian errors. In contrast, it is found that JR's test tends to over-reject in the case of panels with non-Gaussian errors, and has low power against spatial network alternatives. In an empirical application, using the CD* test, it is shown that there remains spatial error dependence in a panel data model for real house price changes across 377 Metropolitan Statistical Areas in the U.S., even after the effects of latent factors are filtered out."}, "https://arxiv.org/abs/2208.03632": {"title": "Quantile Random-Coefficient Regression with Interactive Fixed Effects: Heterogeneous Group-Level Policy Evaluation", "link": "https://arxiv.org/abs/2208.03632", "description": "arXiv:2208.03632v3 Announce Type: replace \nAbstract: We propose a quantile random-coefficient regression with interactive fixed effects to study the effects of group-level policies that are heterogeneous across individuals. Our approach is the first to use a latent factor structure to handle the unobservable heterogeneities in the random coefficient. The asymptotic properties and an inferential method for the policy estimators are established. The model is applied to evaluate the effect of the minimum wage policy on earnings between 1967 and 1980 in the United States. Our results suggest that the minimum wage policy has significant and persistent positive effects on black workers and female workers up to the median. Our results also indicate that the policy helps reduce income disparity up to the median between two groups: black, female workers versus white, male workers. However, the policy is shown to have little effect on narrowing the income gap between low- and high-income workers within the subpopulations."}, "https://arxiv.org/abs/2311.05914": {"title": "Efficient Case-Cohort Design using Balanced Sampling", "link": "https://arxiv.org/abs/2311.05914", "description": "arXiv:2311.05914v2 Announce Type: replace \nAbstract: A case-cohort design is a two-phase sampling design frequently used to analyze censored survival data in a cost-effective way, where a subcohort is usually selected using simple random sampling or stratified simple random sampling. In this paper, we propose an efficient sampling procedure based on balanced sampling when selecting a subcohort in a case-cohort design. A sample selected via a balanced sampling procedure automatically calibrates auxiliary variables. When fitting a Cox model, calibrating sampling weights has been shown to lead to more efficient estimators of the regression coefficients (Breslow et al., 2009a, b). The reduced variabilities over its counterpart with a simple random sampling are shown via extensive simulation experiments. The proposed design and estimation procedure are also illustrated with the well-known National Wilms Tumor Study dataset."}, "https://arxiv.org/abs/2311.16260": {"title": "Using Multiple Outcomes to Improve the Synthetic Control Method", "link": "https://arxiv.org/abs/2311.16260", "description": "arXiv:2311.16260v2 Announce Type: replace \nAbstract: When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than separate weights, and that averaging leads to further gains when the number of outcomes grows. We illustrate this via a re-analysis of the impact of the Flint water crisis on educational outcomes."}, "https://arxiv.org/abs/2208.06115": {"title": "A Nonparametric Approach with Marginals for Modeling Consumer Choice", "link": "https://arxiv.org/abs/2208.06115", "description": "arXiv:2208.06115v5 Announce Type: replace-cross \nAbstract: Given data on the choices made by consumers for different offer sets, a key challenge is to develop parsimonious models that describe and predict consumer choice behavior while being amenable to prescriptive tasks such as pricing and assortment optimization. The marginal distribution model (MDM) is one such model, which requires only the specification of marginal distributions of the random utilities. This paper aims to establish necessary and sufficient conditions for given choice data to be consistent with the MDM hypothesis, inspired by the usefulness of similar characterizations for the random utility model (RUM). This endeavor leads to an exact characterization of the set of choice probabilities that the MDM can represent. Verifying the consistency of choice data with this characterization is equivalent to solving a polynomial-sized linear program. Since the analogous verification task for RUM is computationally intractable and neither of these models subsumes the other, MDM is helpful in striking a balance between tractability and representational power. The characterization is then used with robust optimization for making data-driven sales and revenue predictions for new unseen assortments. When the choice data lacks consistency with the MDM hypothesis, finding the best-fitting MDM choice probabilities reduces to solving a mixed integer convex program. Numerical results using real world data and synthetic data demonstrate that MDM exhibits competitive representational power and prediction performance compared to RUM and parametric models while being significantly faster in computation than RUM."}, "https://arxiv.org/abs/2411.03388": {"title": "Extending Cluster-Weighted Factor Analyzers for multivariate prediction and high-dimensional interpretability", "link": "https://arxiv.org/abs/2411.03388", "description": "arXiv:2411.03388v1 Announce Type: new \nAbstract: Cluster-weighted factor analyzers (CWFA) are a versatile class of mixture models designed to estimate the joint distribution of a random vector that includes a response variable along with a set of explanatory variables. They are particularly valuable in situations involving high dimensionality. This paper enhances CWFA models in two notable ways. First, it enables the prediction of multiple response variables while considering their potential interactions. Second, it identifies factors associated with disjoint groups of explanatory variables, thereby improving interpretability. This development leads to the introduction of the multivariate cluster-weighted disjoint factor analyzers (MCWDFA) model. An alternating expectation-conditional maximization algorithm is employed for parameter estimation. The effectiveness of the proposed model is assessed through an extensive simulation study that examines various scenarios. The proposal is applied to crime data from the United States, sourced from the UCI Machine Learning Repository, with the aim of capturing potential latent heterogeneity within communities and identifying groups of socio-economic features that are similarly associated with factors predicting crime rates. Results provide valuable insights into the underlying structures influencing crime rates which may potentially be helpful for effective cluster-specific policymaking and social interventions."}, "https://arxiv.org/abs/2411.03489": {"title": "A Bayesian nonparametric approach to mediation and spillover effects with multiple mediators in cluster-randomized trials", "link": "https://arxiv.org/abs/2411.03489", "description": "arXiv:2411.03489v1 Announce Type: new \nAbstract: Cluster randomized trials (CRTs) with multiple unstructured mediators present significant methodological challenges for causal inference due to within-cluster correlation, interference among units, and the complexity introduced by multiple mediators. Existing causal mediation methods often fall short in simultaneously addressing these complexities, particularly in disentangling mediator-specific effects under interference that are central to studying complex mechanisms. To address this gap, we propose new causal estimands for spillover mediation effects that differentiate the roles of each individual's own mediator and the spillover effects resulting from interactions among individuals within the same cluster. We establish identification results for each estimand and, to flexibly model the complex data structures inherent in CRTs, we develop a new Bayesian nonparametric prior -- the Nested Dependent Dirichlet Process Mixture -- designed for flexibly capture the outcome and mediator surfaces at different levels. We conduct extensive simulations across various scenarios to evaluate the frequentist performance of our methods, compare them with a Bayesian parametric counterpart and illustrate our new methods in an analysis of a completed CRT."}, "https://arxiv.org/abs/2411.03504": {"title": "Exact Designs for OofA Experiments Under a Transition-Effect Model", "link": "https://arxiv.org/abs/2411.03504", "description": "arXiv:2411.03504v1 Announce Type: new \nAbstract: In the chemical, pharmaceutical, and food industries, sometimes the order of adding a set of components has an impact on the final product. These are instances of the order-of-addition (OofA) problem, which aims to find the optimal sequence of the components. Extensive research on this topic has been conducted, but almost all designs are found by optimizing the $D-$optimality criterion. However, when prediction of the response is important, there is still a need for $I-$optimal designs. A new model for OofA experiments is presented that uses transition effects to model the effect of order on the response, and the model is extended to cover cases where block-wise constraints are placed on the order of addition. Several algorithms are used to find both $D-$ and $I-$efficient designs under this new model for many run sizes and for large numbers of components. Finally, two examples are shown to illustrate the effectiveness of the proposed designs and model at identifying the optimal order of addition, even under block-wise constraints."}, "https://arxiv.org/abs/2411.03530": {"title": "Improving precision of A/B experiments using trigger intensity", "link": "https://arxiv.org/abs/2411.03530", "description": "arXiv:2411.03530v1 Announce Type: new \nAbstract: In industry, online randomized controlled experiment (a.k.a A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. To improve the precision (or reduce standard error), we introduce the idea of trigger observations where the output of the treatment and the control model are different. We show that the evaluation with full information about trigger observations (full knowledge) improves the precision in comparison to a baseline method. However, detecting all such trigger observations is a costly affair, hence we propose a sampling based evaluation method (partial knowledge) to reduce the cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, evaluation with full knowledge reduces the standard error as much as 85%. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%."}, "https://arxiv.org/abs/2411.03625": {"title": "Identification and Inference in General Bunching Designs", "link": "https://arxiv.org/abs/2411.03625", "description": "arXiv:2411.03625v1 Announce Type: new \nAbstract: This paper develops a formal econometric framework and tools for the identification and inference of a structural parameter in general bunching designs. We present both point and partial identification results, which generalize previous approaches in the literature. The key assumption for point identification is the analyticity of the counterfactual density, which defines a broader class of distributions than many well-known parametric families. In the partial identification approach, the analyticity condition is relaxed and various shape restrictions can be incorporated, including those found in the literature. Both of our identification results account for observable heterogeneity in the model, which has previously been permitted only in limited ways. We provide a suite of counterfactual estimation and inference methods, termed the generalized polynomial strategy. Our method restores the merits of the original polynomial strategy proposed by Chetty et al. (2011) while addressing several weaknesses in the widespread practice. The efficacy of the proposed method is demonstrated compared to a version of the polynomial estimator in a series of Monte Carlo studies within the augmented isoelastic model. We revisit the data used in Saez (2010) and find substantially different results relative to those from the polynomial strategy."}, "https://arxiv.org/abs/2411.03848": {"title": "Monotone Missing Data: A Blessing and a Curse", "link": "https://arxiv.org/abs/2411.03848", "description": "arXiv:2411.03848v1 Announce Type: new \nAbstract: Monotone missingness is commonly encountered in practice where a missing measurement compels another measurement to be missing. In graphical missing data models, monotonicity has implications for the identifiability of the full law, i.e., the joint distribution of actual variables and response indicators. In the general nonmonotone case, the full law is known to be nonparametrically identifiable if and only if neither colluders nor self-censoring edges are present in the graph. We show that monotonicity may enable the identification of the full law despite colluders and prevent the identification under mediated (pathwise) self-censoring. The results emphasize the importance of proper treatment of monotone missingness in the analysis of incomplete data."}, "https://arxiv.org/abs/2411.03992": {"title": "Sparse Bayesian joint modal estimation for exploratory item factor analysis", "link": "https://arxiv.org/abs/2411.03992", "description": "arXiv:2411.03992v1 Announce Type: new \nAbstract: This study presents a scalable Bayesian estimation algorithm for sparse estimation in exploratory item factor analysis based on a classical Bayesian estimation method, namely Bayesian joint modal estimation (BJME). BJME estimates the model parameters and factor scores that maximize the complete-data joint posterior density. Simulation studies show that the proposed algorithm has high computational efficiency and accuracy in variable selection over latent factors and the recovery of the model parameters. Moreover, we conducted a real data analysis using large-scale data from a psychological assessment that targeted the Big Five personality traits. This result indicates that the proposed algorithm achieves computationally efficient parameter estimation and extracts the interpretable factor loading structure."}, "https://arxiv.org/abs/2411.04002": {"title": "Federated mixed effects logistic regression based on one-time shared summary statistics", "link": "https://arxiv.org/abs/2411.04002", "description": "arXiv:2411.04002v1 Announce Type: new \nAbstract: Upholding data privacy especially in medical research has become tantamount to facing difficulties in accessing individual-level patient data. Estimating mixed effects binary logistic regression models involving data from multiple data providers like hospitals thus becomes more challenging. Federated learning has emerged as an option to preserve the privacy of individual observations while still estimating a global model that can be interpreted on the individual level, but it usually involves iterative communication between the data providers and the data analyst. In this paper, we present a strategy to estimate a mixed effects binary logistic regression model that requires data providers to share summary statistics only once. It involves generating pseudo-data whose summary statistics match those of the actual data and using these into the model estimation process instead of the actual unavailable data. Our strategy is able to include multiple predictors which can be a combination of continuous and categorical variables. Through simulation, we show that our approach estimates the true model at least as good as the one which requires the pooled individual observations. An illustrative example using real data is provided. Unlike typical federated learning algorithms, our approach eliminates infrastructure requirements and security issues while being communication efficient and while accounting for heterogeneity."}, "https://arxiv.org/abs/2411.03333": {"title": "Analysis of Bipartite Networks in Anime Series: Textual Analysis, Topic Clustering, and Modeling", "link": "https://arxiv.org/abs/2411.03333", "description": "arXiv:2411.03333v1 Announce Type: cross \nAbstract: This article analyzes a specific bipartite network that shows the relationships between users and anime, examining how the descriptions of anime influence the formation of user communities. In particular, we introduce a new variable that quantifies the frequency with which words from a description appear in specific word clusters. These clusters are generated from a bigram analysis derived from all descriptions in the database. This approach fully characterizes the dynamics of these communities and shows how textual content affect the cohesion and structure of the social network among anime enthusiasts. Our findings suggest that there may be significant implications for the design of recommendation systems and the enhancement of user experience on anime platforms."}, "https://arxiv.org/abs/2411.03623": {"title": "Asymptotic analysis of estimators of ergodic stochastic differential equations", "link": "https://arxiv.org/abs/2411.03623", "description": "arXiv:2411.03623v1 Announce Type: cross \nAbstract: The paper studies asymptotic properties of estimators of multidimensional stochastic differential equations driven by Brownian motions from high-frequency discrete data. Consistency and central limit properties of a class of estimators of the diffusion parameter and an approximate maximum likelihood estimator of the drift parameter based on a discretized likelihood function have been established in a suitable scaling regime involving the time-gap between the observations and the overall time span. Our framework is more general than that typically considered in the literature and, thus, has the potential to be applicable to a wider range of stochastic models."}, "https://arxiv.org/abs/2411.03641": {"title": "Constrained Multi-objective Bayesian Optimization through Optimistic Constraints Estimation", "link": "https://arxiv.org/abs/2411.03641", "description": "arXiv:2411.03641v1 Announce Type: cross \nAbstract: Multi-objective Bayesian optimization has been widely adopted in scientific experiment design, including drug discovery and hyperparameter optimization. In practice, regulatory or safety concerns often impose additional thresholds on certain attributes of the experimental outcomes. Previous work has primarily focused on constrained single-objective optimization tasks or active search under constraints. We propose CMOBO, a sample-efficient constrained multi-objective Bayesian optimization algorithm that balances learning of the feasible region (defined on multiple unknowns) with multi-objective optimization within the feasible region in a principled manner. We provide both theoretical justification and empirical evidence, demonstrating the efficacy of our approach on various synthetic benchmarks and real-world applications."}, "https://arxiv.org/abs/2202.06420": {"title": "Statistical Inference for Cell Type Deconvolution", "link": "https://arxiv.org/abs/2202.06420", "description": "arXiv:2202.06420v4 Announce Type: replace \nAbstract: Integrating data from different platforms, such as bulk and single-cell RNA sequencing, is crucial for improving the accuracy and interpretability of complex biological analyses like cell type deconvolution. However, this task is complicated by measurement and biological heterogeneity between target and reference datasets. For the problem of cell type deconvolution, existing methods often neglect the correlation and uncertainty in cell type proportion estimates, possibly leading to an additional concern of false positives in downstream comparisons across multiple individuals. We introduce MEAD, a comprehensive statistical framework that not only estimates cell type proportions but also provides asymptotically valid statistical inference on the estimates. One of our key contributions is the identifiability result, which rigorously establishes the conditions under which cell type proportions are identifiable despite arbitrary heterogeneity of measurement biases between platforms. MEAD also supports the comparison of cell type proportions across individuals after deconvolution, accounting for gene-gene correlations and biological variability. Through simulations and real-data analysis, MEAD demonstrates superior reliability for inferring cell type compositions in complex biological systems."}, "https://arxiv.org/abs/2205.15422": {"title": "Profile Monitoring via Eigenvector Perturbation", "link": "https://arxiv.org/abs/2205.15422", "description": "arXiv:2205.15422v2 Announce Type: replace \nAbstract: In Statistical Process Control, control charts are often used to detect undesirable behavior of sequentially observed quality characteristics. Designing a control chart with desirably low False Alarm Rate (FAR) and detection delay ($ARL_1$) is an important challenge especially when the sampling rate is high and the control chart has an In-Control Average Run Length, called $ARL_0$, of 200 or more, as commonly found in practice. Unfortunately, arbitrary reduction of the FAR typically increases the $ARL_1$. Motivated by eigenvector perturbation theory, we propose the Eigenvector Perturbation Control Chart for computationally fast nonparametric profile monitoring. Our simulation studies show that it outperforms the competition and achieves both $ARL_1 \\approx 1$ and $ARL_0 > 10^6$."}, "https://arxiv.org/abs/2302.09103": {"title": "Multiple change-point detection for some point processes", "link": "https://arxiv.org/abs/2302.09103", "description": "arXiv:2302.09103v3 Announce Type: replace \nAbstract: The aim of change-point detection is to identify behavioral shifts within time series data. This article focuses on scenarios where the data is derived from an inhomogeneous Poisson process or a marked Poisson process. We present a methodology for detecting multiple offline change-points using a minimum contrast estimator. Specifically, we address how to manage the continuous nature of the process given the available discrete observations. Additionally, we select the appropriate number of changes via a cross-validation procedure which is particularly effective given the characteristics of the Poisson process. Lastly, we show how to use this methodology to self-exciting processes with changes in the intensity. Through experiments, with both simulated and real datasets, we showcase the advantages of the proposed method, which has been implemented in the R package \\texttt{CptPointProcess}."}, "https://arxiv.org/abs/2311.01147": {"title": "Variational Inference for Sparse Poisson Regression", "link": "https://arxiv.org/abs/2311.01147", "description": "arXiv:2311.01147v4 Announce Type: replace \nAbstract: We have utilized the non-conjugate VB method for the problem of the sparse Poisson regression model. To provide an approximated conjugacy in the model, the likelihood is approximated by a quadratic function, which provides the conjugacy of the approximation component with the Gaussian prior to the regression coefficient. Three sparsity-enforcing priors are used for this problem. The proposed models are compared with each other and two frequentist sparse Poisson methods (LASSO and SCAD) to evaluate the estimation, prediction and the sparsing performance of the proposed methods. Throughout a simulated data example, the accuracy of the VB methods is computed compared to the corresponding benchmark MCMC methods. It can be observed that the proposed VB methods have provided a good approximation to the posterior distribution of the parameters, while the VB methods are much faster than the MCMC ones. Using several benchmark count response data sets, the prediction performance of the proposed methods is evaluated in real-world applications."}, "https://arxiv.org/abs/2212.01887": {"title": "The Optimality of Blocking Designs in Equally and Unequally Allocated Randomized Experiments with General Response", "link": "https://arxiv.org/abs/2212.01887", "description": "arXiv:2212.01887v3 Announce Type: replace-cross \nAbstract: We consider the performance of the difference-in-means estimator in a two-arm randomized experiment under common experimental endpoints such as continuous (regression), incidence, proportion and survival. We examine performance under both equal and unequal allocation to treatment groups and we consider both the Neyman randomization model and the population model. We show that in the Neyman model, where the only source of randomness is the treatment manipulation, there is no free lunch: complete randomization is minimax for the estimator's mean squared error. In the population model, where each subject experiences response noise with zero mean, the optimal design is the deterministic perfect-balance allocation. However, this allocation is generally NP-hard to compute and moreover, depends on unknown response parameters. When considering the tail criterion of Kapelner et al. (2021), we show the optimal design is less random than complete randomization and more random than the deterministic perfect-balance allocation. We prove that Fisher's blocking design provides the asymptotically optimal degree of experimental randomness. Theoretical results are supported by simulations in all considered experimental settings."}, "https://arxiv.org/abs/2411.04166": {"title": "Kernel density estimation with polyspherical data and its applications", "link": "https://arxiv.org/abs/2411.04166", "description": "arXiv:2411.04166v1 Announce Type: new \nAbstract: A kernel density estimator for data on the polysphere $\\mathbb{S}^{d_1}\\times\\cdots\\times\\mathbb{S}^{d_r}$, with $r,d_1,\\ldots,d_r\\geq 1$, is presented in this paper. We derive the main asymptotic properties of the estimator, including mean square error, normality, and optimal bandwidths. We address the kernel theory of the estimator beyond the von Mises-Fisher kernel, introducing new kernels that are more efficient and investigating normalizing constants, moments, and sampling methods thereof. Plug-in and cross-validated bandwidth selectors are also obtained. As a spin-off of the kernel density estimator, we propose a nonparametric $k$-sample test based on the Jensen-Shannon divergence. Numerical experiments illuminate the asymptotic theory of the kernel density estimator and demonstrate the superior performance of the $k$-sample test with respect to parametric alternatives in certain scenarios. Our smoothing methodology is applied to the analysis of the morphology of a sample of hippocampi of infants embedded on the high-dimensional polysphere $(\\mathbb{S}^2)^{168}$ via skeletal representations ($s$-reps)."}, "https://arxiv.org/abs/2411.04228": {"title": "dsld: A Socially Relevant Tool for Teaching Statistics", "link": "https://arxiv.org/abs/2411.04228", "description": "arXiv:2411.04228v1 Announce Type: new \nAbstract: The growing power of data science can play a crucial role in addressing social discrimination, necessitating nuanced understanding and effective mitigation strategies of potential biases. Data Science Looks At Discrimination (dsld) is an R and Python package designed to provide users with a comprehensive toolkit of statistical and graphical methods for assessing possible discrimination related to protected groups, such as race, gender, and age. Our software offers techniques for discrimination analysis by identifying and mitigating confounding variables, along with methods for reducing bias in predictive models.\n In educational settings, dsld offers instructors powerful tools to teach important statistical principles through motivating real world examples of discrimination analysis. The inclusion of an 80-page Quarto book further supports users, from statistics educators to legal professionals, in effectively applying these analytical tools to real world scenarios."}, "https://arxiv.org/abs/2411.04229": {"title": "Detecting State Changes in Functional Neuronal Connectivity using Factorial Switching Linear Dynamical Systems", "link": "https://arxiv.org/abs/2411.04229", "description": "arXiv:2411.04229v1 Announce Type: new \nAbstract: A key question in brain sciences is how to identify time-evolving functional connectivity, such as that obtained from recordings of neuronal activity over time. We wish to explain the observed phenomena in terms of latent states which, in the case of neuronal activity, might correspond to subnetworks of neurons within a brain or organoid. Many existing approaches assume that only one latent state can be active at a time, in contrast to our domain knowledge. We propose a switching dynamical system based on the factorial hidden Markov model. Unlike existing approaches, our model acknowledges that neuronal activity can be caused by multiple subnetworks, which may be activated either jointly or independently. A change in one part of the network does not mean that the entire connectivity pattern will change. We pair our model with scalable variational inference algorithm, using a concrete relaxation of the underlying factorial hidden Markov model, to effectively infer the latent states and model parameters. We show that our algorithm can recover ground-truth structure and yield insights about the maturation of neuronal activity in microelectrode array recordings from in vitro neuronal cultures."}, "https://arxiv.org/abs/2411.04239": {"title": "An Adversarial Approach to Identification and Inference", "link": "https://arxiv.org/abs/2411.04239", "description": "arXiv:2411.04239v1 Announce Type: new \nAbstract: We introduce a novel framework to characterize identified sets of structural and counterfactual parameters in econometric models. Our framework centers on a discrepancy function, which we construct using insights from convex analysis. The zeros of the discrepancy function determine the identified set, which may be a singleton. The discrepancy function has an adversarial game interpretation: a critic maximizes the discrepancy between data and model features, while a defender minimizes it by adjusting the probability measure of the unobserved heterogeneity. Our approach enables fast computation via linear programming. We use the sample analog of the discrepancy function as a test statistic, and show that it provides asymptotically valid inference for the identified set. Applied to nonlinear panel models with fixed effects, it offers a unified approach for identifying both structural and counterfactual parameters across exogeneity conditions, including strict and sequential, without imposing parametric restrictions on the distribution of error terms or functional form assumptions."}, "https://arxiv.org/abs/2411.04286": {"title": "Bounded Rationality in Central Bank Communication", "link": "https://arxiv.org/abs/2411.04286", "description": "arXiv:2411.04286v1 Announce Type: new \nAbstract: This study explores the influence of FOMC sentiment on market expectations, focusing on cognitive differences between experts and non-experts. Using sentiment analysis of FOMC minutes, we integrate these insights into a bounded rationality model to examine the impact on inflation expectations. Results show that experts form more conservative expectations, anticipating FOMC stabilization actions, while non-experts react more directly to inflation concerns. A lead-lag analysis indicates that institutions adjust faster, though the gap with individual investors narrows in the short term. These findings highlight the need for tailored communication strategies to better align public expectations with policy goals."}, "https://arxiv.org/abs/2411.04310": {"title": "Mediation analysis of community context effects on heart failure using the survival R2D2 prior", "link": "https://arxiv.org/abs/2411.04310", "description": "arXiv:2411.04310v1 Announce Type: new \nAbstract: Congestive heart failure (CHF) is a leading cause of morbidity, mortality and healthcare costs, impacting $>$23 million individuals worldwide. Large electronic health records data provide an opportunity to improve clinical management of diseases, but statistical inference on large amounts of relevant personal data is still a challenge. Thus, accurately identifying influential risk factors is pivotal to reducing the dimensionality of information. Bayesian variable selection in survival regression is a common approach towards solving this problem. In this paper, we propose placing a beta prior directly on the model coefficient of determination (Bayesian $R^2$), which induces a prior on the global variance of the predictors and provides shrinkage. Through reparameterization using an auxiliary variable, we are able to update a majority of the parameters with Gibbs sampling, simplifying computation and quickening convergence. Performance gains over competing variable selection methods are showcased through an extensive simulation study. Finally, the method is applied in a mediation analysis to identify community context attributes impacting time to first congestive heart failure diagnosis of patients enrolled in University of North Carolina Cardiovascular Device Surveillance Registry. The model has high predictive performance with a C-index of over 0.7 and we find that factors associated with higher socioeconomic inequality and air pollution increase risk of heart failure."}, "https://arxiv.org/abs/2411.04312": {"title": "Lee Bounds with a Continuous Treatment in Sample Selection", "link": "https://arxiv.org/abs/2411.04312", "description": "arXiv:2411.04312v1 Announce Type: new \nAbstract: Sample selection problems arise when treatment affects both the outcome and the researcher's ability to observe it. This paper generalizes Lee (2009) bounds for the average treatment effect of a binary treatment to a continuous/multivalued treatment. We evaluate the Job Crops program to study the causal effect of training hours on wages. To identify the average treatment effect of always-takers who are selected regardless of the treatment values, we assume that if a subject is selected at some sufficient treatment values, then it remains selected at all treatment values. For example, if program participants are employed with one month of training, then they remain employed with any training hours. This sufficient treatment values assumption includes the monotone assumption on the treatment effect on selection as a special case. We further allow the conditional independence assumption and subjects with different pretreatment covariates to have different sufficient treatment values. The estimation and inference theory utilize the orthogonal moment function and cross-fitting for double debiased machine learning."}, "https://arxiv.org/abs/2411.04380": {"title": "Identification of Long-Term Treatment Effects via Temporal Links, Observational, and Experimental Data", "link": "https://arxiv.org/abs/2411.04380", "description": "arXiv:2411.04380v1 Announce Type: new \nAbstract: Recent literature proposes combining short-term experimental and long-term observational data to provide credible alternatives to conventional observational studies for identification of long-term average treatment effects (LTEs). I show that experimental data have an auxiliary role in this context. They bring no identifying power without additional modeling assumptions. When modeling assumptions are imposed, experimental data serve to amplify their identifying power. If the assumptions fail, adding experimental data may only yield results that are farther from the truth. Motivated by this, I introduce two assumptions on treatment response that may be defensible based on economic theory or intuition. To utilize them, I develop a novel two-step identification approach that centers on bounding temporal link functions -- the relationship between short-term and mean long-term potential outcomes. The approach provides sharp bounds on LTEs for a general class of assumptions, and allows for imperfect experimental compliance -- extending existing results."}, "https://arxiv.org/abs/2411.04411": {"title": "Parsimoniously Fitting Large Multivariate Random Effects in glmmTMB", "link": "https://arxiv.org/abs/2411.04411", "description": "arXiv:2411.04411v1 Announce Type: new \nAbstract: Multivariate random effects with unstructured variance-covariance matrices of large dimensions, $q$, can be a major challenge to estimate. In this paper, we introduce a new implementation of a reduced-rank approach to fit large dimensional multivariate random effects by writing them as a linear combination of $d < q$ latent variables. By adding reduced-rank functionality to the package glmmTMB, we enhance the mixed models available to include random effects of dimensions that were previously not possible. We apply the reduced-rank random effect to two examples, estimating a generalized latent variable model for multivariate abundance data and a random-slopes model."}, "https://arxiv.org/abs/2411.04450": {"title": "Partial Identification of Distributional Treatment Effects in Panel Data using Copula Equality Assumptions", "link": "https://arxiv.org/abs/2411.04450", "description": "arXiv:2411.04450v1 Announce Type: new \nAbstract: This paper aims to partially identify the distributional treatment effects (DTEs) that depend on the unknown joint distribution of treated and untreated potential outcomes. We construct the DTE bounds using panel data and allow individuals to switch between the treated and untreated states more than once over time. Individuals are grouped based on their past treatment history, and DTEs are allowed to be heterogeneous across different groups. We provide two alternative group-wise copula equality assumptions to bound the unknown joint and the DTEs, both of which leverage information from the past observations. Testability of these two assumptions are also discussed, and test results are presented. We apply this method to study the treatment effect heterogeneity of exercising on the adults' body weight. These results demonstrate that our method improves the identification power of the DTE bounds compared to the existing methods."}, "https://arxiv.org/abs/2411.04520": {"title": "A Structured Estimator for large Covariance Matrices in the Presence of Pairwise and Spatial Covariates", "link": "https://arxiv.org/abs/2411.04520", "description": "arXiv:2411.04520v1 Announce Type: new \nAbstract: We consider the problem of estimating a high-dimensional covariance matrix from a small number of observations when covariates on pairs of variables are available and the variables can have spatial structure. This is motivated by the problem arising in demography of estimating the covariance matrix of the total fertility rate (TFR) of 195 different countries when only 11 observations are available. We construct an estimator for high-dimensional covariance matrices by exploiting information about pairwise covariates, such as whether pairs of variables belong to the same cluster, or spatial structure of the variables, and interactions between the covariates. We reformulate the problem in terms of a mixed effects model. This requires the estimation of only a small number of parameters, which are easy to interpret and which can be selected using standard procedures. The estimator is consistent under general conditions, and asymptotically normal. It works if the mean and variance structure of the data is already specified or if some of the data are missing. We assess its performance under our model assumptions, as well as under model misspecification, using simulations. We find that it outperforms several popular alternatives. We apply it to the TFR dataset and draw some conclusions."}, "https://arxiv.org/abs/2411.04640": {"title": "The role of expansion strategies and operational attributes on hotel performance: a compositional approach", "link": "https://arxiv.org/abs/2411.04640", "description": "arXiv:2411.04640v1 Announce Type: new \nAbstract: This study aims to explore the impact of expansion strategies and specific attributes of hotel establishments on the performance of international hotel chains, focusing on four key performance indicators: RevPAR, efficiency, occupancy, and asset turnover. Data were collected from 255 hotels across various international hotel chains, providing a comprehensive assessment of how different expansion strategies and hotel attributes influence performance. The research employs compositional data analysis (CoDA) to address the methodological limitations of traditional financial ratios in statistical analysis. The findings indicate that ownership-based expansion strategies result in higher operational performance, as measured by revenue per available room, but yield lower economic performance due to the high capital investment required. Non-ownership strategies, such as management contracts and franchising, show superior economic efficiency, offering more flexibility and reduced financial risk. This study contributes to the hospitality management literature by applying CoDA, a novel methodological approach in this field, to examine the performance of different hotel expansion strategies with a sound and more appropriate method. The insights provided can guide hotel managers and investors in making informed decisions to optimize both operational and economic performance."}, "https://arxiv.org/abs/2411.04909": {"title": "Doubly robust inference with censoring unbiased transformations", "link": "https://arxiv.org/abs/2411.04909", "description": "arXiv:2411.04909v1 Announce Type: new \nAbstract: This paper extends doubly robust censoring unbiased transformations to a broad class of censored data structures under the assumption of coarsening at random and positivity. This includes the classic survival and competing risks setting, but also encompasses multiple events. A doubly robust representation for the conditional bias of the transformed data is derived. This leads to rate double robustness and oracle efficiency properties for estimating conditional expectations when combined with cross-fitting and linear smoothers. Simulation studies demonstrate favourable performance of the proposed method relative to existing approaches. An application of the methods to a regression discontinuity design with censored data illustrates its practical utility."}, "https://arxiv.org/abs/2411.04236": {"title": "Differentially Private Finite Population Estimation via Survey Weight Regularization", "link": "https://arxiv.org/abs/2411.04236", "description": "arXiv:2411.04236v1 Announce Type: cross \nAbstract: In general, it is challenging to release differentially private versions of survey-weighted statistics with low error for acceptable privacy loss. This is because weighted statistics from complex sample survey data can be more sensitive to individual survey response and weight values than unweighted statistics, resulting in differentially private mechanisms that can add substantial noise to the unbiased estimate of the finite population quantity. On the other hand, simply disregarding the survey weights adds noise to a biased estimator, which also can result in an inaccurate estimate. Thus, the problem of releasing an accurate survey-weighted estimate essentially involves a trade-off among bias, precision, and privacy. We leverage this trade-off to develop a differentially private method for estimating finite population quantities. The key step is to privately estimate a hyperparameter that determines how much to regularize or shrink survey weights as a function of privacy loss. We illustrate the differentially private finite population estimation using the Panel Study of Income Dynamics. We show that optimal strategies for releasing DP survey-weighted mean income estimates require orders-of-magnitude less noise than naively using the original survey weights without modification."}, "https://arxiv.org/abs/2411.04522": {"title": "Testing for changes in the error distribution in functional linear models", "link": "https://arxiv.org/abs/2411.04522", "description": "arXiv:2411.04522v1 Announce Type: cross \nAbstract: We consider linear models with scalar responses and covariates from a separable Hilbert space. The aim is to detect change points in the error distribution, based on sequential residual empirical distribution functions. Expansions for those estimated functions are more challenging in models with infinite-dimensional covariates than in regression models with scalar or vector-valued covariates due to a slower rate of convergence of the parameter estimators. Yet the suggested change point test is asymptotically distribution-free and consistent for one-change point alternatives. In the latter case we also show consistency of a change point estimator."}, "https://arxiv.org/abs/2411.04729": {"title": "Conjugate gradient methods for high-dimensional GLMMs", "link": "https://arxiv.org/abs/2411.04729", "description": "arXiv:2411.04729v1 Announce Type: cross \nAbstract: Generalized linear mixed models (GLMMs) are a widely used tool in statistical analysis. The main bottleneck of many computational approaches lies in the inversion of the high dimensional precision matrices associated with the random effects. Such matrices are typically sparse; however, the sparsity pattern resembles a multi partite random graph, which does not lend itself well to default sparse linear algebra techniques. Notably, we show that, for typical GLMMs, the Cholesky factor is dense even when the original precision is sparse. We thus turn to approximate iterative techniques, in particular to the conjugate gradient (CG) method. We combine a detailed analysis of the spectrum of said precision matrices with results from random graph theory to show that CG-based methods applied to high-dimensional GLMMs typically achieve a fixed approximation error with a total cost that scales linearly with the number of parameters and observations. Numerical illustrations with both real and simulated data confirm the theoretical findings, while at the same time illustrating situations, such as nested structures, where CG-based methods struggle."}, "https://arxiv.org/abs/2204.04119": {"title": "Using negative controls to identify causal effects with invalid instrumental variables", "link": "https://arxiv.org/abs/2204.04119", "description": "arXiv:2204.04119v4 Announce Type: replace \nAbstract: Many proposals for the identification of causal effects require an instrumental variable that satisfies strong, untestable unconfoundedness and exclusion restriction assumptions. In this paper, we show how one can potentially identify causal effects under violations of these assumptions by harnessing a negative control population or outcome. This strategy allows one to leverage sup-populations for whom the exposure is degenerate, and requires that the instrument-outcome association satisfies a certain parallel trend condition. We develop the semiparametric efficiency theory for a general instrumental variable model, and obtain a multiply robust, locally efficient estimator of the average treatment effect in the treated. The utility of the estimators is demonstrated in simulation studies and an analysis of the Life Span Study."}, "https://arxiv.org/abs/2305.05934": {"title": "Does Principal Component Analysis Preserve the Sparsity in Sparse Weak Factor Models?", "link": "https://arxiv.org/abs/2305.05934", "description": "arXiv:2305.05934v2 Announce Type: replace \nAbstract: This paper studies the principal component (PC) method-based estimation of weak factor models with sparse loadings. We uncover an intrinsic near-sparsity preservation property for the PC estimators of loadings, which comes from the approximately upper triangular (block) structure of the rotation matrix. It implies an asymmetric relationship among factors: the rotated loadings for a stronger factor can be contaminated by those from a weaker one, but the loadings for a weaker factor is almost free of the impact of those from a stronger one. More importantly, the finding implies that there is no need to use complicated penalties to sparsify the loading estimators. Instead, we adopt a simple screening method to recover the sparsity and construct estimators for various factor strengths. In addition, for sparse weak factor models, we provide a singular value thresholding-based approach to determine the number of factors and establish uniform convergence rates for PC estimators, which complement Bai and Ng (2023). The accuracy and efficiency of the proposed estimators are investigated via Monte Carlo simulations. The application to the FRED-QD dataset reveals the underlying factor strengths and loading sparsity as well as their dynamic features."}, "https://arxiv.org/abs/1602.00856": {"title": "Bayesian Dynamic Quantile Model Averaging", "link": "https://arxiv.org/abs/1602.00856", "description": "arXiv:1602.00856v2 Announce Type: replace-cross \nAbstract: This article introduces a novel dynamic framework to Bayesian model averaging for time-varying parameter quantile regressions. By employing sequential Markov chain Monte Carlo, we combine empirical estimates derived from dynamically chosen quantile regressions, thereby facilitating a comprehensive understanding of the quantile model instabilities. The effectiveness of our methodology is initially validated through the examination of simulated datasets and, subsequently, by two applications to the US inflation rates and to the US real estate market. Our empirical findings suggest that a more intricate and nuanced analysis is needed when examining different sub-period regimes, since the determinants of inflation and real estate prices are clearly shown to be time-varying. In conclusion, we suggest that our proposed approach could offer valuable insights to aid decision making in a rapidly changing environment."}, "https://arxiv.org/abs/2411.05215": {"title": "Misclassification of Vaccination Status in Electronic Health Records: A Bayesian Approach in Cluster Randomized Trials", "link": "https://arxiv.org/abs/2411.05215", "description": "arXiv:2411.05215v1 Announce Type: new \nAbstract: Misclassification in binary outcomes is not uncommon and statistical methods to investigate its impact on policy-driving study results are lacking. While misclassifying binary outcomes is a statistically ubiquitous phenomena, we focus on misclassification in a public health application: vaccinations. One such study design in public health that addresses policy is the cluster controlled randomized trial (CCRT). A CCRT that measures the impact of a novel behavioral intervention on increasing vaccine uptake can be severely biased when the supporting data are incomplete vaccination records. In particular, these vaccine records more often may be prone to negative misclassification, that is, a clinic's record of an individual patient's vaccination status may be unvaccinated when, in reality, this patient was vaccinated outside of the clinic. With large nation-wide endeavors to encourage vaccinations without a gold-standard vaccine record system, sensitivity analyses that incorporate misclassification rates are promising for robust inference. In this work we introduce a novel extension of Bayesian logistic regression where we perturb the clinic size and vaccination count with random draws from expert-elicited prior distributions. These prior distributions represent the misclassification rates for each clinic that stochastically add unvaccinated counts to the observed vaccinated counts. These prior distributions are assigned for each clinic (the first level in a group-level randomized trial). We demonstrate this method with a data application from a CCRT evaluating the influence of a behavioral intervention on vaccination uptake among U.S. veterans. A simulation study is carried out demonstrating its estimation properties."}, "https://arxiv.org/abs/2411.05220": {"title": "Inference for Treatment Effects Conditional on Generalized Principal Strata using Instrumental Variables", "link": "https://arxiv.org/abs/2411.05220", "description": "arXiv:2411.05220v1 Announce Type: new \nAbstract: In a setting with a multi-valued outcome, treatment and instrument, this paper considers the problem of inference for a general class of treatment effect parameters. The class of parameters considered are those that can be expressed as the expectation of a function of the response type conditional on a generalized principal stratum. Here, the response type simply refers to the vector of potential outcomes and potential treatments, and a generalized principal stratum is a set of possible values for the response type. In addition to instrument exogeneity, the main substantive restriction imposed rules out certain values for the response types in the sense that they are assumed to occur with probability zero. It is shown through a series of examples that this framework includes a wide variety of parameters and assumptions that have been considered in the previous literature. A key result in our analysis is a characterization of the identified set for such parameters under these assumptions in terms of existence of a non-negative solution to linear systems of equations with a special structure. We propose methods for inference exploiting this special structure and recent results in Fang et al. (2023)."}, "https://arxiv.org/abs/2411.05246": {"title": "Caliper Synthetic Matching: Generalized Radius Matching with Local Synthetic Controls", "link": "https://arxiv.org/abs/2411.05246", "description": "arXiv:2411.05246v1 Announce Type: new \nAbstract: Matching promises transparent causal inferences for observational data, making it an intuitive approach for many applications. In practice, however, standard matching methods often perform poorly compared to modern approaches such as response-surface modeling and optimizing balancing weights. We propose Caliper Synthetic Matching (CSM) to address these challenges while preserving simple and transparent matches and match diagnostics. CSM extends Coarsened Exact Matching by incorporating general distance metrics, adaptive calipers, and locally constructed synthetic controls. We show that CSM can be viewed as a monotonic imbalance bounding matching method, so that it inherits the usual bounds on imbalance and bias enjoyed by MIB methods. We further provide a bound on a measure of joint covariate imbalance. Using a simulation study, we illustrate how CSM can even outperform modern matching methods in certain settings, and finally illustrate its use in an empirical example. Overall, we find CSM allows for many of the benefits of matching while avoiding some of the costs."}, "https://arxiv.org/abs/2411.05315": {"title": "Differentiable Calibration of Inexact Stochastic Simulation Models via Kernel Score Minimization", "link": "https://arxiv.org/abs/2411.05315", "description": "arXiv:2411.05315v1 Announce Type: new \nAbstract: Stochastic simulation models are generative models that mimic complex systems to help with decision-making. The reliability of these models heavily depends on well-calibrated input model parameters. However, in many practical scenarios, only output-level data are available to learn the input model parameters, which is challenging due to the often intractable likelihood of the stochastic simulation model. Moreover, stochastic simulation models are frequently inexact, with discrepancies between the model and the target system. No existing methods can effectively learn and quantify the uncertainties of input parameters using only output-level data. In this paper, we propose to learn differentiable input parameters of stochastic simulation models using output-level data via kernel score minimization with stochastic gradient descent. We quantify the uncertainties of the learned input parameters using a frequentist confidence set procedure based on a new asymptotic normality result that accounts for model inexactness. The proposed method is evaluated on exact and inexact G/G/1 queueing models."}, "https://arxiv.org/abs/2411.05601": {"title": "Detecting Cointegrating Relations in Non-stationary Matrix-Valued Time Series", "link": "https://arxiv.org/abs/2411.05601", "description": "arXiv:2411.05601v1 Announce Type: new \nAbstract: This paper proposes a Matrix Error Correction Model to identify cointegration relations in matrix-valued time series. We hereby allow separate cointegrating relations along the rows and columns of the matrix-valued time series and use information criteria to select the cointegration ranks. Through Monte Carlo simulations and a macroeconomic application, we demonstrate that our approach provides a reliable estimation of the number of cointegrating relationships."}, "https://arxiv.org/abs/2411.05629": {"title": "Nowcasting distributions: a functional MIDAS model", "link": "https://arxiv.org/abs/2411.05629", "description": "arXiv:2411.05629v1 Announce Type: new \nAbstract: We propose a functional MIDAS model to leverage high-frequency information for forecasting and nowcasting distributions observed at a lower frequency. We approximate the low-frequency distribution using Functional Principal Component Analysis and consider a group lasso spike-and-slab prior to identify the relevant predictors in the finite-dimensional SUR-MIDAS approximation of the functional MIDAS model. In our application, we use the model to nowcast the U.S. households' income distribution. Our findings indicate that the model enhances forecast accuracy for the entire target distribution and for key features of the distribution that signal changes in inequality."}, "https://arxiv.org/abs/2411.05695": {"title": "Firm Heterogeneity and Macroeconomic Fluctuations: a Functional VAR model", "link": "https://arxiv.org/abs/2411.05695", "description": "arXiv:2411.05695v1 Announce Type: new \nAbstract: We develop a Functional Augmented Vector Autoregression (FunVAR) model to explicitly incorporate firm-level heterogeneity observed in more than one dimension and study its interaction with aggregate macroeconomic fluctuations. Our methodology employs dimensionality reduction techniques for tensor data objects to approximate the joint distribution of firm-level characteristics. More broadly, our framework can be used for assessing predictions from structural models that account for micro-level heterogeneity observed on multiple dimensions. Leveraging firm-level data from the Compustat database, we use the FunVAR model to analyze the propagation of total factor productivity (TFP) shocks, examining their impact on both macroeconomic aggregates and the cross-sectional distribution of capital and labor across firms."}, "https://arxiv.org/abs/2411.05324": {"title": "SASWISE-UE: Segmentation and Synthesis with Interpretable Scalable Ensembles for Uncertainty Estimation", "link": "https://arxiv.org/abs/2411.05324", "description": "arXiv:2411.05324v1 Announce Type: cross \nAbstract: This paper introduces an efficient sub-model ensemble framework aimed at enhancing the interpretability of medical deep learning models, thus increasing their clinical applicability. By generating uncertainty maps, this framework enables end-users to evaluate the reliability of model outputs. We developed a strategy to develop diverse models from a single well-trained checkpoint, facilitating the training of a model family. This involves producing multiple outputs from a single input, fusing them into a final output, and estimating uncertainty based on output disagreements. Implemented using U-Net and UNETR models for segmentation and synthesis tasks, this approach was tested on CT body segmentation and MR-CT synthesis datasets. It achieved a mean Dice coefficient of 0.814 in segmentation and a Mean Absolute Error of 88.17 HU in synthesis, improved from 89.43 HU by pruning. Additionally, the framework was evaluated under corruption and undersampling, maintaining correlation between uncertainty and error, which highlights its robustness. These results suggest that the proposed approach not only maintains the performance of well-trained models but also enhances interpretability through effective uncertainty estimation, applicable to both convolutional and transformer models in a range of imaging tasks."}, "https://arxiv.org/abs/2411.05580": {"title": "Increasing power and robustness in screening trials by testing stored specimens in the control arm", "link": "https://arxiv.org/abs/2411.05580", "description": "arXiv:2411.05580v1 Announce Type: cross \nAbstract: Background: Screening trials require large sample sizes and long time-horizons to demonstrate mortality reductions. We recently proposed increasing statistical power by testing stored control-arm specimens, called the Intended Effect (IE) design. To evaluate feasibility of the IE design, the US National Cancer Institute (NCI) is collecting blood specimens in the control-arm of the NCI Vanguard Multicancer Detection pilot feasibility trial. However, key assumptions of the IE design require more investigation and relaxation. Methods: We relax the IE design to (1) reduce costs by testing only a stratified sample of control-arm specimens by incorporating inverse-probability sampling weights, (2) correct for potential loss-of-signal in stored control-arm specimens, and (3) correct for non-compliance with control-arm specimen collections. We also examine sensitivity to unintended effects of screening. Results: In simulations, testing all primary-outcome control-arm specimens and a 50% sample of the rest maintains nearly all the power of the IE while only testing half the control-arm specimens. Power remains increased from the IE analysis (versus the standard analysis) even if unintended effects exist. The IE design is robust to some loss-of-signal scenarios, but otherwise requires retest-positive fractions that correct bias at a small loss of power. The IE can be biased and lose power under control-arm non-compliance scenarios, but corrections correct bias and can increase power. Conclusions: The IE design can be made more cost-efficient and robust to loss-of-signal. Unintended effects will not typically reduce the power gain over the standard trial design. Non-compliance with control-arm specimen collections can cause bias and loss of power that can be mitigated by corrections. Although promising, practical experience with the IE design in screening trials is necessary."}, "https://arxiv.org/abs/2411.05625": {"title": "Cross-validating causal discovery via Leave-One-Variable-Out", "link": "https://arxiv.org/abs/2411.05625", "description": "arXiv:2411.05625v1 Announce Type: cross \nAbstract: We propose a new approach to falsify causal discovery algorithms without ground truth, which is based on testing the causal model on a pair of variables that has been dropped when learning the causal model. To this end, we use the \"Leave-One-Variable-Out (LOVO)\" prediction where $Y$ is inferred from $X$ without any joint observations of $X$ and $Y$, given only training data from $X,Z_1,\\dots,Z_k$ and from $Z_1,\\dots,Z_k,Y$. We demonstrate that causal models on the two subsets, in the form of Acyclic Directed Mixed Graphs (ADMGs), often entail conclusions on the dependencies between $X$ and $Y$, enabling this type of prediction. The prediction error can then be estimated since the joint distribution $P(X, Y)$ is assumed to be available, and $X$ and $Y$ have only been omitted for the purpose of falsification. After presenting this graphical method, which is applicable to general causal discovery algorithms, we illustrate how to construct a LOVO predictor tailored towards algorithms relying on specific a priori assumptions, such as linear additive noise models. Simulations indicate that the LOVO prediction error is indeed correlated with the accuracy of the causal outputs, affirming the method's effectiveness."}, "https://arxiv.org/abs/2411.05758": {"title": "On the limiting variance of matching estimators", "link": "https://arxiv.org/abs/2411.05758", "description": "arXiv:2411.05758v1 Announce Type: cross \nAbstract: This paper examines the limiting variance of nearest neighbor matching estimators for average treatment effects with a fixed number of matches. We present, for the first time, a closed-form expression for this limit. Here the key is the establishment of the limiting second moment of the catchment area's volume, which resolves a question of Abadie and Imbens. At the core of our approach is a new universality theorem on the measures of high-order Voronoi cells, extending a result by Devroye, Gy\\\"orfi, Lugosi, and Walk."}, "https://arxiv.org/abs/2302.10160": {"title": "Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift", "link": "https://arxiv.org/abs/2302.10160", "description": "arXiv:2302.10160v3 Announce Type: replace \nAbstract: We develop and analyze a principled approach to kernel ridge regression under covariate shift. The goal is to learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution. We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model. We use the latter to fill the missing labels and then select the best candidate accordingly. Our non-asymptotic excess risk bounds demonstrate that our estimator adapts effectively to both the structure of the target distribution and the covariate shift. This adaptation is quantified through a notion of effective sample size that reflects the value of labeled source data for the target regression task. Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance."}, "https://arxiv.org/abs/2309.16861": {"title": "Demystifying Spatial Confounding", "link": "https://arxiv.org/abs/2309.16861", "description": "arXiv:2309.16861v2 Announce Type: replace \nAbstract: Spatial confounding is a fundamental issue in spatial regression models which arises because spatial random effects, included to approximate unmeasured spatial variation, are typically not independent of covariates in the model. This can lead to significant bias in covariate effect estimates. The problem is complex and has been the topic of extensive research with sometimes puzzling and seemingly contradictory results. Here, we develop a broad theoretical framework that brings mathematical clarity to the mechanisms of spatial confounding, providing explicit analytical expressions for the resulting bias. We see that the problem is directly linked to spatial smoothing and identify exactly how the size and occurrence of bias relate to the features of the spatial model as well as the underlying confounding scenario. Using our results, we can explain subtle and counter-intuitive behaviours. Finally, we propose a general approach for dealing with spatial confounding bias in practice, applicable for any spatial model specification. When a covariate has non-spatial information, we show that a general form of the so-called spatial+ method can be used to eliminate bias. When no such information is present, the situation is more challenging but, under the assumption of unconfounded high frequencies, we develop a procedure in which multiple capped versions of spatial+ are applied to assess the bias in this case. We illustrate our approach with an application to air temperature in Germany."}, "https://arxiv.org/abs/2411.05839": {"title": "Simulation Studies For Goodness-of-Fit and Two-Sample Methods For Univariate Data", "link": "https://arxiv.org/abs/2411.05839", "description": "arXiv:2411.05839v1 Announce Type: new \nAbstract: We present the results of a large number of simulation studies regarding the power of various goodness-of-fit as well as nonparametric two-sample tests for univariate data. This includes both continuous and discrete data. In general no single method can be relied upon to provide good power, any one method may be quite good for some combination of null hypothesis and alternative and may fail badly for another. Based on the results of these studies we propose a fairly small number of methods chosen such that for any of the case studies included here at least one of the methods has good power.\n The studies were carried out using the R packages R2sample and Rgof, available from CRAN."}, "https://arxiv.org/abs/2411.06129": {"title": "Rational Expectations Nonparametric Empirical Bayes", "link": "https://arxiv.org/abs/2411.06129", "description": "arXiv:2411.06129v1 Announce Type: new \nAbstract: We examine the uniqueness of the posterior distribution within an Empirical Bayes framework using a discretized prior. To achieve this, we impose Rational Expectations conditions on the prior, focusing on coherence and stability properties. We derive the conditions necessary for posterior uniqueness when observations are drawn from either discrete or continuous distributions. Additionally, we discuss the properties of our discretized prior as an approximation of the true underlying prior."}, "https://arxiv.org/abs/2411.06150": {"title": "It's About Time: What A/B Test Metrics Estimate", "link": "https://arxiv.org/abs/2411.06150", "description": "arXiv:2411.06150v1 Announce Type: new \nAbstract: Online controlled experiments, or A/B tests, are large-scale randomized trials in digital environments. This paper investigates the estimands of the difference-in-means estimator in these experiments, focusing on scenarios with repeated measurements on users. We compare cumulative metrics that use all post-exposure data for each user to windowed metrics that measure each user over a fixed time window. We analyze the estimands and highlight trade-offs between the two types of metrics. Our findings reveal that while cumulative metrics eliminate the need for pre-defined measurement windows, they result in estimands that are more intricately tied to the experiment intake and runtime. This complexity can lead to counter-intuitive practical consequences, such as decreased statistical power with more observations. However, cumulative metrics offer earlier results and can quickly detect strong initial signals. We conclude that neither metric type is universally superior. The optimal choice depends on the temporal profile of the treatment effect, the distribution of exposure, and the stopping time of the experiment. This research provides insights for experimenters to make informed decisions about how to define metrics based on their specific experimental contexts and objectives."}, "https://arxiv.org/abs/2411.06298": {"title": "Efficient subsampling for high-dimensional data", "link": "https://arxiv.org/abs/2411.06298", "description": "arXiv:2411.06298v1 Announce Type: new \nAbstract: In the field of big data analytics, the search for efficient subdata selection methods that enable robust statistical inferences with minimal computational resources is of high importance. A procedure prior to subdata selection could perform variable selection, as only a subset of a large number of variables is active. We propose an approach when both the size of the full dataset and the number of variables are large. This approach firstly identifies the active variables by applying a procedure inspired by random LASSO (Least Absolute Shrinkage and Selection Operator) and then selects subdata based on leverage scores to build a predictive model. Our proposed approach outperforms approaches that already exists in the current literature, including the usage of the full dataset, in both variable selection and prediction, while also exhibiting significant improvements in computing time. Simulation experiments as well as a real data application are provided."}, "https://arxiv.org/abs/2411.06327": {"title": "Return-forecasting and Volatility-forecasting Power of On-chain Activities in the Cryptocurrency Market", "link": "https://arxiv.org/abs/2411.06327", "description": "arXiv:2411.06327v1 Announce Type: new \nAbstract: We investigate the return-forecasting and volatility-forecasting power of intraday on-chain flow data for BTC, ETH, and USDT, and the associated option strategies. First, we find that USDT net inflow into cryptocurrency exchanges positively forecasts future returns of both BTC and ETH, with the strongest effect at the 1-hour frequency. Second, we find that ETH net inflow into cryptocurrency exchanges negatively forecasts future returns of ETH. Third, we find that BTC net inflow into cryptocurrency exchanges does not significantly forecast future returns of BTC. Finally, we confirm that selling 0DTE ETH call options is a profitable trading strategy when the net inflow into cryptocurrency exchanges is high. Our study lends new insights into the emerging literature that studies the on-chain activities and their asset-pricing impact in the cryptocurrency market."}, "https://arxiv.org/abs/2411.06342": {"title": "Stabilized Inverse Probability Weighting via Isotonic Calibration", "link": "https://arxiv.org/abs/2411.06342", "description": "arXiv:2411.06342v1 Announce Type: new \nAbstract: Inverse weighting with an estimated propensity score is widely used by estimation methods in causal inference to adjust for confounding bias. However, directly inverting propensity score estimates can lead to instability, bias, and excessive variability due to large inverse weights, especially when treatment overlap is limited. In this work, we propose a post-hoc calibration algorithm for inverse propensity weights that generates well-calibrated, stabilized weights from user-supplied, cross-fitted propensity score estimates. Our approach employs a variant of isotonic regression with a loss function specifically tailored to the inverse propensity weights. Through theoretical analysis and empirical studies, we demonstrate that isotonic calibration improves the performance of doubly robust estimators of the average treatment effect."}, "https://arxiv.org/abs/2411.06591": {"title": "Analysis of spatially clustered survival data with unobserved covariates using SBART", "link": "https://arxiv.org/abs/2411.06591", "description": "arXiv:2411.06591v1 Announce Type: new \nAbstract: Usual parametric and semi-parametric regression methods are inappropriate and inadequate for large clustered survival studies when the appropriate functional forms of the covariates and their interactions in hazard functions are unknown, and random cluster effects and cluster-level covariates are spatially correlated. We present a general nonparametric method for such studies under the Bayesian ensemble learning paradigm called Soft Bayesian Additive Regression Trees. Our methodological and computational challenges include large number of clusters, variable cluster sizes, and proper statistical augmentation of the unobservable cluster-level covariate using a data registry different from the main survival study. We use an innovative 3-step approach based on latent variables to address our computational challenges. We illustrate our method and its advantages over existing methods by assessing the impacts of intervention in some county-level and patient-level covariates to mitigate existing racial disparity in breast cancer survival in 67 Florida counties (clusters) using two different data resources. Florida Cancer Registry (FCR) is used to obtain clustered survival data with patient-level covariates, and the Behavioral Risk Factor Surveillance Survey (BRFSS) is used to obtain further data information on an unobservable county-level covariate of Screening Mammography Utilization (SMU)."}, "https://arxiv.org/abs/2411.06716": {"title": "Parameter Estimation for Partially Observed McKean-Vlasov Diffusions", "link": "https://arxiv.org/abs/2411.06716", "description": "arXiv:2411.06716v1 Announce Type: new \nAbstract: In this article we consider likelihood-based estimation of static parameters for a class of partially observed McKean-Vlasov (POMV) diffusion process with discrete-time observations over a fixed time interval. In particular, using the framework of [5] we develop a new randomized multilevel Monte Carlo method for estimating the parameters, based upon Markovian stochastic approximation methodology. New Markov chain Monte Carlo algorithms for the POMV model are introduced facilitating the application of [5]. We prove, under assumptions, that the expectation of our estimator is biased, but with expected small and controllable bias. Our approach is implemented on several examples."}, "https://arxiv.org/abs/2411.06913": {"title": "BudgetIV: Optimal Partial Identification of Causal Effects with Mostly Invalid Instruments", "link": "https://arxiv.org/abs/2411.06913", "description": "arXiv:2411.06913v1 Announce Type: new \nAbstract: Instrumental variables (IVs) are widely used to estimate causal effects in the presence of unobserved confounding between exposure and outcome. An IV must affect the outcome exclusively through the exposure and be unconfounded with the outcome. We present a framework for relaxing either or both of these strong assumptions with tuneable and interpretable budget constraints. Our algorithm returns a feasible set of causal effects that can be identified exactly given relevant covariance parameters. The feasible set may be disconnected but is a finite union of convex subsets. We discuss conditions under which this set is sharp, i.e., contains all and only effects consistent with the background assumptions and the joint distribution of observable variables. Our method applies to a wide class of semiparametric models, and we demonstrate how its ability to select specific subsets of instruments confers an advantage over convex relaxations in both linear and nonlinear settings. We also adapt our algorithm to form confidence sets that are asymptotically valid under a common statistical assumption from the Mendelian randomization literature."}, "https://arxiv.org/abs/2411.07028": {"title": "Estimating abilities with an Elo-informed growth model", "link": "https://arxiv.org/abs/2411.07028", "description": "arXiv:2411.07028v1 Announce Type: new \nAbstract: An intelligent tutoring system (ITS) aims to provide instructions and exercises tailored to the ability of a student. To do this, the ITS needs to estimate the ability based on student input. Rather than including frequent full-scale tests to update our ability estimate, we want to base estimates on the outcomes of practice exercises that are part of the learning process. A challenge with this approach is that the ability changes as the student learns, which makes traditional item response theory (IRT) models inappropriate. Most IRT models estimate an ability based on a test result, and assume that the ability is constant throughout a test.\n We review some existing methods for measuring abilities that change throughout the measurement period, and propose a new method which we call the Elo-informed growth model. This method assumes that the abilities for a group of respondents who are all in the same stage of the learning process follow a distribution that can be estimated. The method does not assume a particular shape of the growth curve. It performs better than the standard Elo algorithm when the measured outcomes are far apart in time, or when the ability change is rapid."}, "https://arxiv.org/abs/2411.07221": {"title": "Self-separated and self-connected models for mediator and outcome missingness in mediation analysis", "link": "https://arxiv.org/abs/2411.07221", "description": "arXiv:2411.07221v1 Announce Type: new \nAbstract: Missing data is a common problem that challenges the study of effects of treatments. In the context of mediation analysis, this paper addresses missingness in the two key variables, mediator and outcome, focusing on identification. We consider self-separated missingness models where identification is achieved by conditional independence assumptions only and self-connected missingness models where identification relies on so-called shadow variables. The first class is somewhat limited as it is constrained by the need to remove a certain number of connections from the model. The second class turns out to include substantial variation in the position of the shadow variable in the causal structure (vis-a-vis the mediator and outcome) and the corresponding implications for the model. In constructing the models, to improve plausibility, we pay close attention to allowing, where possible, dependencies due to unobserved causes of the missingness. In this exploration, we develop theory where needed. This results in templates for identification in this mediation setting, generally useful identification techniques, and perhaps most significantly, synthesis and substantial expansion of shadow variable theory."}, "https://arxiv.org/abs/2411.05869": {"title": "Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data", "link": "https://arxiv.org/abs/2411.05869", "description": "arXiv:2411.05869v1 Announce Type: cross \nAbstract: The Gaussian process (GP) is a widely used probabilistic machine learning method for stochastic function approximation, stochastic modeling, and analyzing real-world measurements of nonlinear processes. Unlike many other machine learning methods, GPs include an implicit characterization of uncertainty, making them extremely useful across many areas of science, technology, and engineering. Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility and exact methods for inference that prevent application to data sets with more than about ten thousand points. Modern approaches to address stationarity assumptions generally fail to accommodate large data sets, while all attempts to address scalability focus on approximating the Gaussian likelihood, which can involve subjectivity and lead to inaccuracies. In this work, we explicitly derive an alternative kernel that can discover and encode both sparsity and nonstationarity. We embed the kernel within a fully Bayesian GP model and leverage high-performance computing resources to enable the analysis of massive data sets. We demonstrate the favorable performance of our novel kernel relative to existing exact and approximate GP methods across a variety of synthetic data examples. Furthermore, we conduct space-time prediction based on more than one million measurements of daily maximum temperature and verify that our results outperform state-of-the-art methods in the Earth sciences. More broadly, having access to exact GPs that use ultra-scalable, sparsity-discovering, nonstationary kernels allows GP methods to truly compete with a wide variety of machine learning methods."}, "https://arxiv.org/abs/2411.05870": {"title": "An Adaptive Online Smoother with Closed-Form Solutions and Information-Theoretic Lag Selection for Conditional Gaussian Nonlinear Systems", "link": "https://arxiv.org/abs/2411.05870", "description": "arXiv:2411.05870v1 Announce Type: cross \nAbstract: Data assimilation (DA) combines partial observations with a dynamical model to improve state estimation. Filter-based DA uses only past and present data and is the prerequisite for real-time forecasts. Smoother-based DA exploits both past and future observations. It aims to fill in missing data, provide more accurate estimations, and develop high-quality datasets. However, the standard smoothing procedure requires using all historical state estimations, which is storage-demanding, especially for high-dimensional systems. This paper develops an adaptive-lag online smoother for a large class of complex dynamical systems with strong nonlinear and non-Gaussian features, which has important applications to many real-world problems. The adaptive lag allows the DA to utilize only observations within a nearby window, significantly reducing computational storage. Online lag adjustment is essential for tackling turbulent systems, where temporal autocorrelation varies significantly over time due to intermittency, extreme events, and nonlinearity. Based on the uncertainty reduction in the estimated state, an information criterion is developed to systematically determine the adaptive lag. Notably, the mathematical structure of these systems facilitates the use of closed analytic formulae to calculate the online smoother and the adaptive lag, avoiding empirical tunings as in ensemble-based DA methods. The adaptive online smoother is applied to studying three important scientific problems. First, it helps detect online causal relationships between state variables. Second, its advantage of computational storage is illustrated via Lagrangian DA, a high-dimensional nonlinear problem. Finally, the adaptive smoother advances online parameter estimation with partial observations, emphasizing the role of the observed extreme events in accelerating convergence."}, "https://arxiv.org/abs/2411.06140": {"title": "Deep Nonparametric Conditional Independence Tests for Images", "link": "https://arxiv.org/abs/2411.06140", "description": "arXiv:2411.06140v1 Announce Type: cross \nAbstract: Conditional independence tests (CITs) test for conditional dependence between random variables. As existing CITs are limited in their applicability to complex, high-dimensional variables such as images, we introduce deep nonparametric CITs (DNCITs). The DNCITs combine embedding maps, which extract feature representations of high-dimensional variables, with nonparametric CITs applicable to these feature representations. For the embedding maps, we derive general properties on their parameter estimators to obtain valid DNCITs and show that these properties include embedding maps learned through (conditional) unsupervised or transfer learning. For the nonparametric CITs, appropriate tests are selected and adapted to be applicable to feature representations. Through simulations, we investigate the performance of the DNCITs for different embedding maps and nonparametric CITs under varying confounder dimensions and confounder relationships. We apply the DNCITs to brain MRI scans and behavioral traits, given confounders, of healthy individuals from the UK Biobank (UKB), confirming null results from a number of ambiguous personality neuroscience studies with a larger data set and with our more powerful tests. In addition, in a confounder control study, we apply the DNCITs to brain MRI scans and a confounder set to test for sufficient confounder control, leading to a potential reduction in the confounder dimension under improved confounder control compared to existing state-of-the-art confounder control studies for the UKB. Finally, we provide an R package implementing the DNCITs."}, "https://arxiv.org/abs/2411.06324": {"title": "Amortized Bayesian Local Interpolation NetworK: Fast covariance parameter estimation for Gaussian Processes", "link": "https://arxiv.org/abs/2411.06324", "description": "arXiv:2411.06324v1 Announce Type: cross \nAbstract: Gaussian processes (GPs) are a ubiquitous tool for geostatistical modeling with high levels of flexibility and interpretability, and the ability to make predictions at unseen spatial locations through a process called Kriging. Estimation of Kriging weights relies on the inversion of the process' covariance matrix, creating a computational bottleneck for large spatial datasets. In this paper, we propose an Amortized Bayesian Local Interpolation NetworK (A-BLINK) for fast covariance parameter estimation, which uses two pre-trained deep neural networks to learn a mapping from spatial location coordinates and covariance function parameters to Kriging weights and the spatial variance, respectively. The fast prediction time of these networks allows us to bypass the matrix inversion step, creating large computational speedups over competing methods in both frequentist and Bayesian settings, and also provides full posterior inference and predictions using Markov chain Monte Carlo sampling methods. We show significant increases in computational efficiency over comparable scalable GP methodology in an extensive simulation study with lower parameter estimation error. The efficacy of our approach is also demonstrated using a temperature dataset of US climate normals for 1991--2020 based on over 7,000 weather stations."}, "https://arxiv.org/abs/2411.06518": {"title": "Causal Representation Learning from Multimodal Biological Observations", "link": "https://arxiv.org/abs/2411.06518", "description": "arXiv:2411.06518v1 Announce Type: cross \nAbstract: Prevalent in biological applications (e.g., human phenotype measurements), multimodal datasets can provide valuable insights into the underlying biological mechanisms. However, current machine learning models designed to analyze such datasets still lack interpretability and theoretical guarantees, which are essential to biological applications. Recent advances in causal representation learning have shown promise in uncovering the interpretable latent causal variables with formal theoretical certificates. Unfortunately, existing works for multimodal distributions either rely on restrictive parametric assumptions or provide rather coarse identification results, limiting their applicability to biological research which favors a detailed understanding of the mechanisms.\n In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biological datasets. Theoretically, we consider a flexible nonparametric latent distribution (c.f., parametric assumptions in prior work) permitting causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work. Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities, which, as we will discuss, is natural for a large collection of biological systems. Empirically, we propose a practical framework to instantiate our theoretical insights. We demonstrate the effectiveness of our approach through extensive experiments on both numerical and synthetic datasets. Results on a real-world human phenotype dataset are consistent with established medical research, validating our theoretical and methodological framework."}, "https://arxiv.org/abs/2411.06593": {"title": "Algebraic and Statistical Properties of the Partially Regularized Ordinary Least Squares Interpolator", "link": "https://arxiv.org/abs/2411.06593", "description": "arXiv:2411.06593v1 Announce Type: cross \nAbstract: Modern deep learning has revealed a surprising statistical phenomenon known as benign overfitting, with high-dimensional linear regression being a prominent example. This paper contributes to ongoing research on the ordinary least squares (OLS) interpolator, focusing on the partial regression setting, where only a subset of coefficients is implicitly regularized. On the algebraic front, we extend Cochran's formula and the leave-one-out residual formula for the partial regularization framework. On the stochastic front, we leverage our algebraic results to design several homoskedastic variance estimators under the Gauss-Markov model. These estimators serve as a basis for conducting statistical inference, albeit with slight conservatism in their performance. Through simulations, we study the finite-sample properties of these variance estimators across various generative models."}, "https://arxiv.org/abs/2411.06631": {"title": "SequentialSamplingModels", "link": "https://arxiv.org/abs/2411.06631", "description": "arXiv:2411.06631v1 Announce Type: cross \nAbstract: Sequential sampling models (SSMs) are a widely used framework describing decision-making as a stochastic, dynamic process of evidence accumulation. SSMs popularity across cognitive science has driven the development of various software packages that lower the barrier for simulating, estimating, and comparing existing SSMs. Here, we present a software tool, SequentialSamplingModels.jl (SSM.jl), designed to make SSM simulations more accessible to Julia users, and to integrate with the Julia ecosystem. We demonstrate the basic use of SSM.jl for simulation, plotting, and Bayesian inference."}, "https://arxiv.org/abs/2208.10974": {"title": "Beta-Sorted Portfolios", "link": "https://arxiv.org/abs/2208.10974", "description": "arXiv:2208.10974v3 Announce Type: replace \nAbstract: Beta-sorted portfolios -- portfolios comprised of assets with similar covariation to selected risk factors -- are a popular tool in empirical finance to analyze models of (conditional) expected returns. Despite their widespread use, little is known of their statistical properties in contrast to comparable procedures such as two-pass regressions. We formally investigate the properties of beta-sorted portfolio returns by casting the procedure as a two-step nonparametric estimator with a nonparametric first step and a beta-adaptive portfolios construction. Our framework rationalize the well-known estimation algorithm with precise economic and statistical assumptions on the general data generating process and characterize its key features. We study beta-sorted portfolios for both a single cross-section as well as for aggregation over time (e.g., the grand mean), offering conditions that ensure consistency and asymptotic normality along with new uniform inference procedures allowing for uncertainty quantification and testing of various relevant hypotheses in financial applications. We also highlight some limitations of current empirical practices and discuss what inferences can and cannot be drawn from returns to beta-sorted portfolios for either a single cross-section or across the whole sample. Finally, we illustrate the functionality of our new procedures in an empirical application."}, "https://arxiv.org/abs/2212.08446": {"title": "Score function-based tests for ultrahigh-dimensional linear models", "link": "https://arxiv.org/abs/2212.08446", "description": "arXiv:2212.08446v2 Announce Type: replace \nAbstract: In this paper, we investigate score function-based tests to check the significance of an ultrahigh-dimensional sub-vector of the model coefficients when the nuisance parameter vector is also ultrahigh-dimensional in linear models. We first reanalyze and extend a recently proposed score function-based test to derive, under weaker conditions, its limiting distributions under the null and local alternative hypotheses. As it may fail to work when the correlation between testing covariates and nuisance covariates is high, we propose an orthogonalized score function-based test with two merits: debiasing to make the non-degenerate error term degenerate and reducing the asymptotic variance to enhance power performance. Simulations evaluate the finite-sample performances of the proposed tests, and a real data analysis illustrates its application."}, "https://arxiv.org/abs/2304.00874": {"title": "A mixture transition distribution modeling for higher-order circular Markov processes", "link": "https://arxiv.org/abs/2304.00874", "description": "arXiv:2304.00874v5 Announce Type: replace \nAbstract: The stationary higher-order Markov process for circular data is considered. We employ the mixture transition distribution (MTD) model to express the transition density of the process on the circle. The underlying circular transition distribution is based on Wehrly and Johnson's bivariate joint circular models. The structures of the circular autocorrelation function together with the circular partial autocorrelation function are found to be similar to those of the autocorrelation and partial autocorrelation functions of the real-valued autoregressive process when the underlying binding density has zero sine moments. The validity of the model is assessed by applying it to some Monte Carlo simulations and real directional data."}, "https://arxiv.org/abs/2306.02711": {"title": "Truly Multivariate Structured Additive Distributional Regression", "link": "https://arxiv.org/abs/2306.02711", "description": "arXiv:2306.02711v2 Announce Type: replace \nAbstract: Generalized additive models for location, scale and shape (GAMLSS) are a popular extension to mean regression models where each parameter of an arbitrary distribution is modelled through covariates. While such models have been developed for univariate and bivariate responses, the truly multivariate case remains extremely challenging for both computational and theoretical reasons. Alternative approaches to GAMLSS may allow for higher dimensional response vectors to be modelled jointly but often assume a fixed dependence structure not depending on covariates or are limited with respect to modelling flexibility or computational aspects. We contribute to this gap in the literature and propose a truly multivariate distributional model, which allows one to benefit from the flexibility of GAMLSS even when the response has dimension larger than two or three. Building on copula regression, we model the dependence structure of the response through a Gaussian copula, while the marginal distributions can vary across components. Our model is highly parameterized but estimation becomes feasible with Bayesian inference employing shrinkage priors. We demonstrate the competitiveness of our approach in a simulation study and illustrate how it complements existing models along the examples of childhood malnutrition and a yet unexplored data set on traffic detection in Berlin."}, "https://arxiv.org/abs/2306.16402": {"title": "Guidance on Individualized Treatment Rule Estimation in High Dimensions", "link": "https://arxiv.org/abs/2306.16402", "description": "arXiv:2306.16402v2 Announce Type: replace \nAbstract: Individualized treatment rules, cornerstones of precision medicine, inform patient treatment decisions with the goal of optimizing patient outcomes. These rules are generally unknown functions of patients' pre-treatment covariates, meaning they must be estimated from clinical or observational study data. Myriad methods have been developed to learn these rules, and these procedures are demonstrably successful in traditional asymptotic settings with moderate number of covariates. The finite-sample performance of these methods in high-dimensional covariate settings, which are increasingly the norm in modern clinical trials, has not been well characterized, however. We perform a comprehensive comparison of state-of-the-art individualized treatment rule estimators, assessing performance on the basis of the estimators' accuracy, interpretability, and computational efficacy. Sixteen data-generating processes with continuous outcomes and binary treatment assignments are considered, reflecting a diversity of randomized and observational studies. We summarize our findings and provide succinct advice to practitioners needing to estimate individualized treatment rules in high dimensions. All code is made publicly available, facilitating modifications and extensions to our simulation study. A novel pre-treatment covariate filtering procedure is also proposed and is shown to improve estimators' accuracy and interpretability."}, "https://arxiv.org/abs/2308.02005": {"title": "Randomization-Based Inference for Average Treatment Effect in Inexactly Matched Observational Studies", "link": "https://arxiv.org/abs/2308.02005", "description": "arXiv:2308.02005v3 Announce Type: replace \nAbstract: Matching is a widely used causal inference study design in observational studies. It seeks to mimic a randomized experiment by forming matched sets of treated and control units based on proximity in covariates. Ideally, treated units are exactly matched with controls for the covariates, and randomization-based inference for the treatment effect can then be conducted as in a randomized experiment under the ignorability assumption. However, matching is typically inexact when continuous covariates or many covariates exist. Previous studies have routinely ignored inexact matching in the downstream randomization-based inference as long as some covariate balance criteria are satisfied. Some recent studies found that this routine practice can cause severe bias. They proposed new inference methods for correcting for bias due to inexact matching. However, these inference methods focus on the constant treatment effect (i.e., Fisher's sharp null) and are not directly applicable to the average treatment effect (i.e., Neyman's weak null). To address this problem, we propose a new framework - inverse post-matching probability weighting (IPPW) - for randomization-based average treatment effect inference under inexact matching. Compared with the routinely used randomization-based inference framework based on the difference-in-means estimator, our proposed IPPW framework can substantially reduce bias due to inexact matching and improve the coverage rate. We have also developed an open-source R package RIIM (Randomization-Based Inference under Inexact Matching) for implementing our methods."}, "https://arxiv.org/abs/2311.09446": {"title": "Scalable simulation-based inference for implicitly defined models using a metamodel for Monte Carlo log-likelihood estimator", "link": "https://arxiv.org/abs/2311.09446", "description": "arXiv:2311.09446v2 Announce Type: replace \nAbstract: Models implicitly defined through a random simulator of a process have become widely used in scientific and industrial applications in recent years. However, simulation-based inference methods for such implicit models, like approximate Bayesian computation (ABC), often scale poorly as data size increases. We develop a scalable inference method for implicitly defined models using a metamodel for the Monte Carlo log-likelihood estimator derived from simulations. This metamodel characterizes both statistical and simulation-based randomness in the distribution of the log-likelihood estimator across different parameter values. Our metamodel-based method quantifies uncertainty in parameter estimation in a principled manner, leveraging the local asymptotic normality of the mean function of the log-likelihood estimator. We apply this method to construct accurate confidence intervals for parameters of partially observed Markov process models where the Monte Carlo log-likelihood estimator is obtained using the bootstrap particle filter. We numerically demonstrate that our method enables accurate and highly scalable parameter inference across several examples, including a mechanistic compartment model for infectious diseases."}, "https://arxiv.org/abs/2106.16149": {"title": "When Frictions are Fractional: Rough Noise in High-Frequency Data", "link": "https://arxiv.org/abs/2106.16149", "description": "arXiv:2106.16149v4 Announce Type: replace-cross \nAbstract: The analysis of high-frequency financial data is often impeded by the presence of noise. This article is motivated by intraday return data in which market microstructure noise appears to be rough, that is, best captured by a continuous-time stochastic process that locally behaves as fractional Brownian motion. Assuming that the underlying efficient price process follows a continuous It\\^o semimartingale, we derive consistent estimators and asymptotic confidence intervals for the roughness parameter of the noise and the integrated price and noise volatilities, in all cases where these quantities are identifiable. In addition to desirable features such as serial dependence of increments, compatibility between different sampling frequencies and diurnal effects, the rough noise model can further explain divergence rates in volatility signature plots that vary considerably over time and between assets."}, "https://arxiv.org/abs/2208.00830": {"title": "Short-time expansion of characteristic functions in a rough volatility setting with applications", "link": "https://arxiv.org/abs/2208.00830", "description": "arXiv:2208.00830v2 Announce Type: replace-cross \nAbstract: We derive a higher-order asymptotic expansion of the conditional characteristic function of the increment of an It\\^o semimartingale over a shrinking time interval. The spot characteristics of the It\\^o semimartingale are allowed to have dynamics of general form. In particular, their paths can be rough, that is, exhibit local behavior like that of a fractional Brownian motion, while at the same time have jumps with arbitrary degree of activity. The expansion result shows the distinct roles played by the different features of the spot characteristics dynamics. As an application of our result, we construct a nonparametric estimator of the Hurst parameter of the diffusive volatility process from portfolios of short-dated options written on an underlying asset."}, "https://arxiv.org/abs/2311.01762": {"title": "Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel", "link": "https://arxiv.org/abs/2311.01762", "description": "arXiv:2311.01762v2 Announce Type: replace-cross \nAbstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes solving a system of linear equations, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks."}, "https://arxiv.org/abs/2312.09862": {"title": "Wasserstein-based Minimax Estimation of Dependence in Multivariate Regularly Varying Extremes", "link": "https://arxiv.org/abs/2312.09862", "description": "arXiv:2312.09862v2 Announce Type: replace-cross \nAbstract: We present the first minimax risk bounds for estimators of the spectral measure in multivariate linear factor models, where observations are linear combinations of regularly varying latent factors. Non-asymptotic convergence rates are derived for the multivariate Peak-over-Threshold estimator in terms of the $p$-th order Wasserstein distance, and information-theoretic lower bounds for the minimax risks are established. The convergence rate of the estimator is shown to be minimax optimal under a class of Pareto-type models analogous to the standard class used in the setting of one-dimensional observations known as the Hall-Welsh class. When the estimator is minimax inefficient, a novel two-step estimator is introduced and demonstrated to attain the minimax lower bound. Our analysis bridges the gaps in understanding trade-offs between estimation bias and variance in multivariate extreme value theory."}, "https://arxiv.org/abs/2411.07399": {"title": "Cumulative differences between subpopulations versus body mass index in the Behavioral Risk Factor Surveillance System data", "link": "https://arxiv.org/abs/2411.07399", "description": "arXiv:2411.07399v1 Announce Type: new \nAbstract: Prior works have demonstrated many advantages of cumulative statistics over the classical methods of reliability diagrams, ECEs (empirical, estimated, or expected calibration errors), and ICIs (integrated calibration indices). The advantages pertain to assessing calibration of predicted probabilities, comparison of responses from a subpopulation to the responses from the full population, and comparison of responses from one subpopulation to those from a separate subpopulation. The cumulative statistics include graphs of cumulative differences as a function of the scalar covariate, as well as metrics due to Kuiper and to Kolmogorov and Smirnov that summarize the graphs into single scalar statistics and associated P-values (also known as \"attained significance levels\" for significance tests). However, the prior works have not yet treated data from biostatistics.\n Fortunately, the advantages of the cumulative statistics extend to the Behavioral Risk Factor Surveillance System (BRFSS) of the Centers for Disease Control and Prevention. This is unsurprising, since the mathematics is the same as in earlier works. Nevertheless, detailed analysis of the BRFSS data is revealing and corroborates the findings of earlier works.\n Two methodological extensions beyond prior work that facilitate analysis of the BRFSS are (1) empirical estimators of uncertainty for graphs of the cumulative differences between two subpopulations, such that the estimators are valid for any real-valued responses, and (2) estimators of the weighted average treatment effect for the differences in the responses between the subpopulations. Both of these methods concern the case in which none of the covariate's observed values for one subpopulation is equal to any of the covariate's values for the other subpopulation. The data analysis presented reports results for this case as well as several others."}, "https://arxiv.org/abs/2411.07604": {"title": "Dynamic Evolutionary Game Analysis of How Fintech in Banking Mitigates Risks in Agricultural Supply Chain Finance", "link": "https://arxiv.org/abs/2411.07604", "description": "arXiv:2411.07604v1 Announce Type: new \nAbstract: This paper explores the impact of banking fintech on reducing financial risks in the agricultural supply chain, focusing on the secondary allocation of commercial credit. The study constructs a three-player evolutionary game model involving banks, core enterprises, and SMEs to analyze how fintech innovations, such as big data credit assessment, blockchain, and AI-driven risk evaluation, influence financial risks and access to credit. The findings reveal that banking fintech reduces financing costs and mitigates financial risks by improving transaction reliability, enhancing risk identification, and minimizing information asymmetry. By optimizing cooperation between banks, core enterprises, and SMEs, fintech solutions enhance the stability of the agricultural supply chain, contributing to rural revitalization goals and sustainable agricultural development. The study provides new theoretical insights and practical recommendations for improving agricultural finance systems and reducing financial risks.\n Keywords: banking fintech, agricultural supply chain, financial risk, commercial credit, SMEs, evolutionary game model, big data, blockchain, AI-driven risk evaluation."}, "https://arxiv.org/abs/2411.07617": {"title": "Semi-supervised learning using copula-based regression and model averaging", "link": "https://arxiv.org/abs/2411.07617", "description": "arXiv:2411.07617v1 Announce Type: new \nAbstract: The available data in semi-supervised learning usually consists of relatively small sized labeled data and much larger sized unlabeled data. How to effectively exploit unlabeled data is the key issue. In this paper, we write the regression function in the form of a copula and marginal distributions, and the unlabeled data can be exploited to improve the estimation of the marginal distributions. The predictions based on different copulas are weighted, where the weights are obtained by minimizing an asymptotic unbiased estimator of the prediction risk. Error-ambiguity decomposition of the prediction risk is performed such that unlabeled data can be exploited to improve the prediction risk estimation. We demonstrate the asymptotic normality of copula parameters and regression function estimators of the candidate models under the semi-supervised framework, as well as the asymptotic optimality and weight consistency of the model averaging estimator. Our model averaging estimator achieves faster convergence rates of asymptotic optimality and weight consistency than the supervised counterpart. Extensive simulation experiments and the California housing dataset demonstrate the effectiveness of the proposed method."}, "https://arxiv.org/abs/2411.07651": {"title": "Quasi-Bayes empirical Bayes: a sequential approach to the Poisson compound decision problem", "link": "https://arxiv.org/abs/2411.07651", "description": "arXiv:2411.07651v1 Announce Type: new \nAbstract: The Poisson compound decision problem is a classical problem in statistics, for which parametric and nonparametric empirical Bayes methodologies are available to estimate the Poisson's means in static or batch domains. In this paper, we consider the Poisson compound decision problem in a streaming or online domain. By relying on a quasi-Bayesian approach, often referred to as Newton's algorithm, we obtain sequential Poisson's mean estimates that are of easy evaluation, computationally efficient and with a constant computational cost as data increase, which is desirable for streaming data. Large sample asymptotic properties of the proposed estimates are investigated, also providing frequentist guarantees in terms of a regret analysis. We validate empirically our methodology, both on synthetic and real data, comparing against the most popular alternatives."}, "https://arxiv.org/abs/2411.07808": {"title": "Spatial Competition on Psychological Pricing Strategies: Preliminary Evidence from an Online Marketplace", "link": "https://arxiv.org/abs/2411.07808", "description": "arXiv:2411.07808v1 Announce Type: new \nAbstract: According zu Kadir et al. (2023) online marketplaces are used to buy and sell products and services, as well as to exchange money and data between users or the platform. Due to the large product selection, low costs and the ease of shopping without physical restrictions as well as the technical possibilities, online marketplaces have grown rapidly Kadir et al. (2023). Online marketplaces are also used in the consumer-to-consumer (C2C) sector and thus offer a broad user group a marketplace, for example for used products. This article focuses on Willhaben.at (2024), a leading C2C marketplace in Austria, as stated by Obersteiner, Schmied, and Pamperl (2023). The empirical analysis in this course essay centers around the offer ads of Woom Bikes, a standardised product which is sold on Willhaben. Through web scraping, a dataset of approximately 826 observations was created, focusing on mid-to-high price segment bicycles, which are characterized by price stability and uniformity as we claim. This analysis aims to create analyse ad listing prices through predictive models using willhaben product listing attributes and using the spatial distribution of one of the product attributes."}, "https://arxiv.org/abs/2411.07817": {"title": "Impact of R&D and AI Investments on Economic Growth and Credit Rating", "link": "https://arxiv.org/abs/2411.07817", "description": "arXiv:2411.07817v1 Announce Type: new \nAbstract: The research and development (R&D) phase is essential for fostering innovation and aligns with long-term strategies in both public and private sectors. This study addresses two primary research questions: (1) assessing the relationship between R&D investments and GDP through regression analysis, and (2) estimating the economic value added (EVA) that Georgia must generate to progress from a BB to a BBB credit rating. Using World Bank data from 2014-2022, this analysis found that increasing R&D, with an emphasis on AI, by 30-35% has a measurable impact on GDP. Regression results reveal a coefficient of 7.02%, indicating a 10% increase in R&D leads to a 0.70% GDP rise, with an 81.1% determination coefficient and a strong 90.1% correlation.\n Georgia's EVA model was calculated to determine the additional value needed for a BBB rating, comparing indicators from Greece, Hungary, India, and Kazakhstan as benchmarks. Key economic indicators considered were nominal GDP, GDP per capita, real GDP growth, and fiscal indicators (government balance/GDP, debt/GDP). The EVA model projects that to achieve a BBB rating within nine years, Georgia requires $61.7 billion in investments. Utilizing EVA and comprehensive economic indicators will support informed decision-making and enhance the analysis of Georgia's economic trajectory."}, "https://arxiv.org/abs/2411.07874": {"title": "Changepoint Detection in Complex Models: Cross-Fitting Is Needed", "link": "https://arxiv.org/abs/2411.07874", "description": "arXiv:2411.07874v1 Announce Type: new \nAbstract: Changepoint detection is commonly approached by minimizing the sum of in-sample losses to quantify the model's overall fit across distinct data segments. However, we observe that flexible modeling techniques, particularly those involving hyperparameter tuning or model selection, often lead to inaccurate changepoint estimation due to biases that distort the target of in-sample loss minimization. To mitigate this issue, we propose a novel cross-fitting methodology that incorporates out-of-sample loss evaluations using independent samples separate from those used for model fitting. This approach ensures consistent changepoint estimation, contingent solely upon the models' predictive accuracy across nearly homogeneous data segments. Extensive numerical experiments demonstrate that our proposed cross-fitting strategy significantly enhances the reliability and adaptability of changepoint detection in complex scenarios."}, "https://arxiv.org/abs/2411.07952": {"title": "Matching $\\leq$ Hybrid $\\leq$ Difference in Differences", "link": "https://arxiv.org/abs/2411.07952", "description": "arXiv:2411.07952v1 Announce Type: new \nAbstract: Since LaLonde's (1986) seminal paper, there has been ongoing interest in estimating treatment effects using pre- and post-intervention data. Scholars have traditionally used experimental benchmarks to evaluate the accuracy of alternative econometric methods, including Matching, Difference-in-Differences (DID), and their hybrid forms (e.g., Heckman et al., 1998b; Dehejia and Wahba, 2002; Smith and Todd, 2005). We revisit these methodologies in the evaluation of job training and educational programs using four datasets (LaLonde, 1986; Heckman et al., 1998a; Smith and Todd, 2005; Chetty et al., 2014a; Athey et al., 2020), and show that the inequality relationship, Matching $\\leq$ Hybrid $\\leq$ DID, appears as a consistent norm, rather than a mere coincidence. We provide a formal theoretical justification for this puzzling phenomenon under plausible conditions such as negative selection, by generalizing the classical bracketing (Angrist and Pischke, 2009, Section 5). Consequently, when treatments are expected to be non-negative, DID tends to provide optimistic estimates, while Matching offers more conservative ones. Keywords: bias, difference in differences, educational program, job training program, matching."}, "https://arxiv.org/abs/2411.07978": {"title": "Doubly Robust Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2411.07978", "description": "arXiv:2411.07978v1 Announce Type: new \nAbstract: This study introduces a doubly robust (DR) estimator for regression discontinuity (RD) designs. In RD designs, treatment effects are estimated in a quasi-experimental setting where treatment assignment depends on whether a running variable surpasses a predefined cutoff. A common approach in RD estimation is to apply nonparametric regression methods, such as local linear regression. In such an approach, the validity relies heavily on the consistency of nonparametric estimators and is limited by the nonparametric convergence rate, thereby preventing $\\sqrt{n}$-consistency. To address these issues, we propose the DR-RD estimator, which combines two distinct estimators for the conditional expected outcomes. If either of these estimators is consistent, the treatment effect estimator remains consistent. Furthermore, due to the debiasing effect, our proposed estimator achieves $\\sqrt{n}$-consistency if both regression estimators satisfy certain mild conditions, which also simplifies statistical inference."}, "https://arxiv.org/abs/2411.07984": {"title": "Scalable piecewise smoothing with BART", "link": "https://arxiv.org/abs/2411.07984", "description": "arXiv:2411.07984v1 Announce Type: new \nAbstract: Although it is an extremely effective, easy-to-use, and increasingly popular tool for nonparametric regression, the Bayesian Additive Regression Trees (BART) model is limited by the fact that it can only produce discontinuous output. Initial attempts to overcome this limitation were based on regression trees that output Gaussian Processes instead of constants. Unfortunately, implementations of these extensions cannot scale to large datasets. We propose ridgeBART, an extension of BART built with trees that output linear combinations of ridge functions (i.e., a composition of an affine transformation of the inputs and non-linearity); that is, we build a Bayesian ensemble of localized neural networks with a single hidden layer. We develop a new MCMC sampler that updates trees in linear time and establish nearly minimax-optimal posterior contraction rates for estimating Sobolev and piecewise anisotropic Sobolev functions. We demonstrate ridgeBART's effectiveness on synthetic data and use it to estimate the probability that a professional basketball player makes a shot from any location on the court in a spatially smooth fashion."}, "https://arxiv.org/abs/2411.07800": {"title": "Kernel-based retrieval models for hyperspectral image data optimized with Kernel Flows", "link": "https://arxiv.org/abs/2411.07800", "description": "arXiv:2411.07800v1 Announce Type: cross \nAbstract: Kernel-based statistical methods are efficient, but their performance depends heavily on the selection of kernel parameters. In literature, the optimization studies on kernel-based chemometric methods is limited and often reduced to grid searching. Previously, the authors introduced Kernel Flows (KF) to learn kernel parameters for Kernel Partial Least-Squares (K-PLS) regression. KF is easy to implement and helps minimize overfitting. In cases of high collinearity between spectra and biogeophysical quantities in spectroscopy, simpler methods like Principal Component Regression (PCR) may be more suitable. In this study, we propose a new KF-type approach to optimize Kernel Principal Component Regression (K-PCR) and test it alongside KF-PLS. Both methods are benchmarked against non-linear regression techniques using two hyperspectral remote sensing datasets."}, "https://arxiv.org/abs/2411.08019": {"title": "Language Models as Causal Effect Generators", "link": "https://arxiv.org/abs/2411.08019", "description": "arXiv:2411.08019v1 Announce Type: cross \nAbstract: We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure."}, "https://arxiv.org/abs/2304.03247": {"title": "A Bayesian Framework for Causal Analysis of Recurrent Events with Timing Misalignment", "link": "https://arxiv.org/abs/2304.03247", "description": "arXiv:2304.03247v2 Announce Type: replace \nAbstract: Observational studies of recurrent event rates are common in biomedical statistics. Broadly, the goal is to estimate differences in event rates under two treatments within a defined target population over a specified followup window. Estimation with observational data is challenging because, while membership in the target population is defined in terms of eligibility criteria, treatment is rarely observed exactly at the time of eligibility. Ad-hoc solutions to this timing misalignment can induce bias by incorrectly attributing prior event counts and person-time to treatment. Even if eligibility and treatment are aligned, a terminal event process (e.g. death) often stops the recurrent event process of interest. In practice, both processes can be censored so that events are not observed over the entire followup window. Our approach addresses misalignment by casting it as a time-varying treatment problem: some patients are on treatment at eligibility while others are off treatment but may switch to treatment at a specified time - if they survive long enough. We define and identify an average causal effect estimand under right-censoring. Estimation is done using a g-computation procedure with a joint semiparametric Bayesian model for the death and recurrent event processes. We apply the method to contrast hospitalization rates among patients with different opioid treatments using Medicare insurance claims data."}, "https://arxiv.org/abs/2306.04182": {"title": "Simultaneous Estimation and Dataset Selection for Transfer Learning in High Dimensions by a Non-convex Penalty", "link": "https://arxiv.org/abs/2306.04182", "description": "arXiv:2306.04182v3 Announce Type: replace \nAbstract: In this paper, we propose to estimate model parameters and identify informative source datasets simultaneously for high-dimensional transfer learning problems with the aid of a non-convex penalty, in contrast to the separate useful dataset selection and transfer learning procedures in the existing literature. To numerically solve the non-convex problem with respect to two specific statistical models, namely the sparse linear regression and the generalized low-rank trace regression models, we adopt the difference of convex (DC) programming with the alternating direction method of multipliers (ADMM) procedures. We theoretically justify the proposed algorithm from both statistical and computational perspectives. Extensive numerical results are reported alongside to validate the theoretical assertions. An \\texttt{R} package \\texttt{MHDTL} is developed to implement the proposed methods."}, "https://arxiv.org/abs/2306.13362": {"title": "Factor-augmented sparse MIDAS regressions with an application to nowcasting", "link": "https://arxiv.org/abs/2306.13362", "description": "arXiv:2306.13362v3 Announce Type: replace \nAbstract: This article investigates factor-augmented sparse MIDAS (Mixed Data Sampling) regressions for high-dimensional time series data, which may be observed at different frequencies. Our novel approach integrates sparse and dense dimensionality reduction techniques. We derive the convergence rate of our estimator under misspecification, $\\tau$-mixing dependence, and polynomial tails. Our method's finite sample performance is assessed via Monte Carlo simulations. We apply the methodology to nowcasting U.S. GDP growth and demonstrate that it outperforms both sparse regression and standard factor-augmented regression during the COVID-19 pandemic. To ensure the robustness of these results, we also implement factor-augmented sparse logistic regression, which further confirms the superior accuracy of our nowcast probabilities during recessions. These findings indicate that recessions are influenced by both idiosyncratic (sparse) and common (dense) shocks."}, "https://arxiv.org/abs/2309.08039": {"title": "Flexible Functional Treatment Effect Estimation", "link": "https://arxiv.org/abs/2309.08039", "description": "arXiv:2309.08039v2 Announce Type: replace \nAbstract: We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weights are constructed by directly minimizing the uniform balancing error resulting from a decomposition of the WMKRR estimator, instead of being estimated under a particular treatment selection model. Despite the complex structure of the uniform balancing error derived under WMKRR, finite-dimensional convex algorithms can be applied to efficiently solve for the proposed weights thanks to a representer theorem. The optimal convergence rate is shown to be attainable by the proposed WMKRR estimator without any smoothness assumption on the true weight function. Corresponding empirical performance is demonstrated by a simulation study and a real data application."}, "https://arxiv.org/abs/2310.01589": {"title": "Exact Gradient Evaluation for Adaptive Quadrature Approximate Marginal Likelihood in Mixed Models for Grouped Data", "link": "https://arxiv.org/abs/2310.01589", "description": "arXiv:2310.01589v2 Announce Type: replace \nAbstract: A method is introduced for approximate marginal likelihood inference via adaptive Gaussian quadrature in mixed models with a single grouping factor. The core technical contribution is an algorithm for computing the exact gradient of the approximate log-marginal likelihood. This leads to efficient maximum likelihood via quasi-Newton optimization that is demonstrated to be faster than existing approaches based on finite-differenced gradients or derivative-free optimization. The method is specialized to Bernoulli mixed models with multivariate, correlated Gaussian random effects; here computations are performed using an inverse log-Cholesky parameterization of the Gaussian density that involves no matrix decomposition during model fitting, while Wald confidence intervals are provided for variance parameters on the original scale. Simulations give evidence of these intervals attaining nominal coverage if enough quadrature points are used, for data comprised of a large number of very small groups exhibiting large between-group heterogeneity. The Laplace approximation is well-known to give especially poor coverage and high bias for data comprised of a large number of small groups. Adaptive quadrature mitigates this, and the methods in this paper improve the computational feasibility of this more accurate method. All results may be reproduced using code available at \\url{https://github.com/awstringer1/aghmm-paper-code}."}, "https://arxiv.org/abs/2411.08150": {"title": "Targeted Maximum Likelihood Estimation for Integral Projection Models in Population Ecology", "link": "https://arxiv.org/abs/2411.08150", "description": "arXiv:2411.08150v1 Announce Type: new \nAbstract: Integral projection models (IPMs) are widely used to study population growth and the dynamics of demographic structure (e.g. age and size distributions) within a population.These models use data on individuals' growth, survival, and reproduction to predict changes in the population from one time point to the next and use these in turn to ask about long-term growth rates, the sensitivity of that growth rate to environmental factors, and aspects of the long term population such as how much reproduction concentrates in a few individuals; these quantities are not directly measurable from data and must be inferred from the model. Building IPMs requires us to develop models for individual fates over the next time step -- Did they survive? How much did they grow or shrink? Did they Reproduce? -- conditional on their initial state as well as on environmental covariates in a manner that accounts for the unobservable quantities that are are ultimately interested in estimating.Targeted maximum likelihood estimation (TMLE) methods are particularly well-suited to a framework in which we are largely interested in the consequences of models. These build machine learning-based models that estimate the probability distribution of the data we observe and define a target of inference as a function of these. The initial estimate for the distribution is then modified by tilting in the direction of the efficient influence function to both de-bias the parameter estimate and provide more accurate inference. In this paper, we employ TMLE to develop robust and efficient estimators for properties derived from a fitted IPM. Mathematically, we derive the efficient influence function and formulate the paths for the least favorable sub-models. Empirically, we conduct extensive simulations using real data from both long term studies of Idaho steppe plant communities and experimental Rotifer populations."}, "https://arxiv.org/abs/2411.08188": {"title": "MSTest: An R-Package for Testing Markov Switching Models", "link": "https://arxiv.org/abs/2411.08188", "description": "arXiv:2411.08188v1 Announce Type: new \nAbstract: We present the R package MSTest, which implements hypothesis testing procedures to identify the number of regimes in Markov switching models. These models have wide-ranging applications in economics, finance, and numerous other fields. The MSTest package includes the Monte Carlo likelihood ratio test procedures proposed by Rodriguez-Rondon and Dufour (2024), the moment-based tests of Dufour and Luger (2017), the parameter stability tests of Carrasco, Hu, and Ploberger (2014), and the likelihood ratio test of Hansen (1992). Additionally, the package enables users to simulate and estimate univariate and multivariate Markov switching and hidden Markov processes, using the expectation-maximization (EM) algorithm or maximum likelihood estimation (MLE). We demonstrate the functionality of the MSTest package through both simulation experiments and an application to U.S. GNP growth data."}, "https://arxiv.org/abs/2411.08256": {"title": "$K$-means clustering for sparsely observed longitudinal data", "link": "https://arxiv.org/abs/2411.08256", "description": "arXiv:2411.08256v1 Announce Type: new \nAbstract: In longitudinal data analysis, observation points of repeated measurements over time often vary among subjects except in well-designed experimental studies. Additionally, measurements for each subject are typically obtained at only a few time points. From such sparsely observed data, identifying underlying cluster structures can be challenging. This paper proposes a fast and simple clustering method that generalizes the classical $k$-means method to identify cluster centers in sparsely observed data. The proposed method employs the basis function expansion to model the cluster centers, providing an effective way to estimate cluster centers from fragmented data. We establish the statistical consistency of the proposed method, as with the classical $k$-means method. Through numerical experiments, we demonstrate that the proposed method performs competitively with, or even outperforms, existing clustering methods. Moreover, the proposed method offers significant gains in computational efficiency due to its simplicity. Applying the proposed method to real-world data illustrates its effectiveness in identifying cluster structures in sparsely observed data."}, "https://arxiv.org/abs/2411.08262": {"title": "Adaptive Shrinkage with a Nonparametric Bayesian Lasso", "link": "https://arxiv.org/abs/2411.08262", "description": "arXiv:2411.08262v1 Announce Type: new \nAbstract: Modern approaches to perform Bayesian variable selection rely mostly on the use of shrinkage priors. That said, an ideal shrinkage prior should be adaptive to different signal levels, ensuring that small effects are ruled out, while keeping relatively intact the important ones. With this task in mind, we develop the nonparametric Bayesian Lasso, an adaptive and flexible shrinkage prior for Bayesian regression and variable selection, particularly useful when the number of predictors is comparable or larger than the number of available data points. We build on spike-and-slab Lasso ideas and extend them by placing a Dirichlet Process prior on the shrinkage parameters. The result is a prior on the regression coefficients that can be seen as an infinite mixture of Double Exponential densities, all offering different amounts of regularization, ensuring a more adaptive and flexible shrinkage. We also develop an efficient Markov chain Monte Carlo algorithm for posterior inference. Through various simulation exercises and real-world data analyses, we demonstrate that our proposed method leads to a better recovery of the true regression coefficients, a better variable selection, and better out-of-sample predictions, highlighting the benefits of the nonparametric Bayesian Lasso over existing shrinkage priors."}, "https://arxiv.org/abs/2411.08315": {"title": "Optimal individualized treatment regimes for survival data with competing risks", "link": "https://arxiv.org/abs/2411.08315", "description": "arXiv:2411.08315v1 Announce Type: new \nAbstract: Precision medicine leverages patient heterogeneity to estimate individualized treatment regimens, formalized, data-driven approaches designed to match patients with optimal treatments. In the presence of competing events, where multiple causes of failure can occur and one cause precludes others, it is crucial to assess the risk of the specific outcome of interest, such as one type of failure over another. This helps clinicians tailor interventions based on the factors driving that particular cause, leading to more precise treatment strategies. Currently, no precision medicine methods simultaneously account for both survival and competing risk endpoints. To address this gap, we develop a nonparametric individualized treatment regime estimator. Our two-phase method accounts for both overall survival from all events as well as the cumulative incidence of a main event of interest. Additionally, we introduce a multi-utility value function that incorporates both outcomes. We develop random survival and random cumulative incidence forests to construct individual survival and cumulative incidence curves. Simulation studies demonstrated that our proposed method performs well, which we applied to a cohort of peripheral artery disease patients at high risk for limb loss and mortality."}, "https://arxiv.org/abs/2411.08352": {"title": "Imputation-based randomization tests for randomized experiments with interference", "link": "https://arxiv.org/abs/2411.08352", "description": "arXiv:2411.08352v1 Announce Type: new \nAbstract: The presence of interference renders classic Fisher randomization tests infeasible due to nuisance unknowns. To address this issue, we propose imputing the nuisance unknowns and computing Fisher randomization p-values multiple times, then averaging them. We term this approach the imputation-based randomization test and provide theoretical results on its asymptotic validity. Our method leverages the merits of randomization and the flexibility of the Bayesian framework: for multiple imputations, we can either employ the empirical distribution of observed outcomes to achieve robustness against model mis-specification or utilize a parametric model to incorporate prior information. Simulation results demonstrate that our method effectively controls the type I error rate and significantly enhances the testing power compared to existing randomization tests for randomized experiments with interference. We apply our method to a two-round randomized experiment with multiple treatments and one-way interference, where existing randomization tests exhibit limited power."}, "https://arxiv.org/abs/2411.08390": {"title": "Expected Information Gain Estimation via Density Approximations: Sample Allocation and Dimension Reduction", "link": "https://arxiv.org/abs/2411.08390", "description": "arXiv:2411.08390v1 Announce Type: new \nAbstract: Computing expected information gain (EIG) from prior to posterior (equivalently, mutual information between candidate observations and model parameters or other quantities of interest) is a fundamental challenge in Bayesian optimal experimental design. We formulate flexible transport-based schemes for EIG estimation in general nonlinear/non-Gaussian settings, compatible with both standard and implicit Bayesian models. These schemes are representative of two-stage methods for estimating or bounding EIG using marginal and conditional density estimates. In this setting, we analyze the optimal allocation of samples between training (density estimation) and approximation of the outer prior expectation. We show that with this optimal sample allocation, the MSE of the resulting EIG estimator converges more quickly than that of a standard nested Monte Carlo scheme. We then address the estimation of EIG in high dimensions, by deriving gradient-based upper bounds on the mutual information lost by projecting the parameters and/or observations to lower-dimensional subspaces. Minimizing these upper bounds yields projectors and hence low-dimensional EIG approximations that outperform approximations obtained via other linear dimension reduction schemes. Numerical experiments on a PDE-constrained Bayesian inverse problem also illustrate a favorable trade-off between dimension truncation and the modeling of non-Gaussianity, when estimating EIG from finite samples in high dimensions."}, "https://arxiv.org/abs/2411.08491": {"title": "Covariate Adjustment in Randomized Experiments Motivated by Higher-Order Influence Functions", "link": "https://arxiv.org/abs/2411.08491", "description": "arXiv:2411.08491v1 Announce Type: new \nAbstract: Higher-Order Influence Functions (HOIF), developed in a series of papers over the past twenty years, is a fundamental theoretical device for constructing rate-optimal causal-effect estimators from observational studies. However, the value of HOIF for analyzing well-conducted randomized controlled trials (RCT) has not been explicitly explored. In the recent US Food \\& Drug Administration (FDA) and European Medical Agency (EMA) guidelines on the practice of covariate adjustment in analyzing RCT, in addition to the simple, unadjusted difference-in-mean estimator, it was also recommended to report the estimator adjusting for baseline covariates via a simple parametric working model, such as a linear model. In this paper, we show that an HOIF-motivated estimator for the treatment-specific mean has significantly improved statistical properties compared to popular adjusted estimators in practice when the number of baseline covariates $p$ is relatively large compared to the sample size $n$. We also characterize the conditions under which the HOIF-motivated estimator improves upon the unadjusted estimator. Furthermore, we demonstrate that a novel debiased adjusted estimator proposed recently by Lu et al. is, in fact, another HOIF-motivated estimator under disguise. Finally, simulation studies are conducted to corroborate our theoretical findings."}, "https://arxiv.org/abs/2411.08495": {"title": "Confidence intervals for adaptive trial designs I: A methodological review", "link": "https://arxiv.org/abs/2411.08495", "description": "arXiv:2411.08495v1 Announce Type: new \nAbstract: Regulatory guidance notes the need for caution in the interpretation of confidence intervals (CIs) constructed during and after an adaptive clinical trial. Conventional CIs of the treatment effects are prone to undercoverage (as well as other undesirable properties) in many adaptive designs, because they do not take into account the potential and realised trial adaptations. This paper is the first in a two-part series that explores CIs for adaptive trials. It provides a comprehensive review of the methods to construct CIs for adaptive designs, while the second paper illustrates how to implement these in practice and proposes a set of guidelines for trial statisticians. We describe several classes of techniques for constructing CIs for adaptive clinical trials, before providing a systematic literature review of available methods, classified by the type of adaptive design. As part of this, we assess, through a proposed traffic light system, which of several desirable features of CIs (such as achieving nominal coverage and consistency with the hypothesis test decision) each of these methods holds."}, "https://arxiv.org/abs/2411.08524": {"title": "Evaluating Parameter Uncertainty in the Poisson Lognormal Model with Corrected Variational Estimators", "link": "https://arxiv.org/abs/2411.08524", "description": "arXiv:2411.08524v1 Announce Type: new \nAbstract: Count data analysis is essential across diverse fields, from ecology and accident analysis to single-cell RNA sequencing (scRNA-seq) and metagenomics. While log transformations are computationally efficient, model-based approaches such as the Poisson-Log-Normal (PLN) model provide robust statistical foundations and are more amenable to extensions. The PLN model, with its latent Gaussian structure, not only captures overdispersion but also enables correlation between variables and inclusion of covariates, making it suitable for multivariate count data analysis. Variational approximations are a golden standard to estimate parameters of complex latent variable models such as PLN, maximizing a surrogate likelihood. However, variational estimators lack theoretical statistical properties such as consistency and asymptotic normality. In this paper, we investigate the consistency and variance estimation of PLN parameters using M-estimation theory. We derive the Sandwich estimator, previously studied in Westling and McCormick (2019), specifically for the PLN model. We compare this approach to the variational Fisher Information method, demonstrating the Sandwich estimator's effectiveness in terms of coverage through simulation studies. Finally, we validate our method on a scRNA-seq dataset."}, "https://arxiv.org/abs/2411.08625": {"title": "A Conjecture on Group Decision Accuracy in Voter Networks through the Regularized Incomplete Beta Function", "link": "https://arxiv.org/abs/2411.08625", "description": "arXiv:2411.08625v1 Announce Type: new \nAbstract: This paper presents a conjecture on the regularized incomplete beta function in the context of majority decision systems modeled through a voter framework. We examine a network where voters interact, with some voters fixed in their decisions while others are free to change their states based on the influence of their neighbors. We demonstrate that as the number of free voters increases, the probability of selecting the correct majority outcome converges to $1-I_{0.5}(\\alpha,\\beta)$, where $I_{0.5}(\\alpha,\\beta)$ is the regularized incomplete beta function. The conjecture posits that when $\\alpha > \\beta$, $1-I_{0.5}(\\alpha,\\beta) > \\alpha/(\\alpha+\\beta)$, meaning the group's decision accuracy exceeds that of an individual voter. We provide partial results, including a proof for integer values of $\\alpha$ and $\\beta$, and support the general case using a probability bound. This work extends Condorcet's Jury Theorem by incorporating voter dependence driven by network dynamics, showing that group decision accuracy can exceed individual accuracy under certain conditions."}, "https://arxiv.org/abs/2411.08698": {"title": "Statistical Operating Characteristics of Current Early Phase Dose Finding Designs with Toxicity and Efficacy in Oncology", "link": "https://arxiv.org/abs/2411.08698", "description": "arXiv:2411.08698v1 Announce Type: new \nAbstract: Traditional phase I dose finding cancer clinical trial designs aim to determine the maximum tolerated dose (MTD) of the investigational cytotoxic agent based on a single toxicity outcome, assuming a monotone dose-response relationship. However, this assumption might not always hold for newly emerging therapies such as immuno-oncology therapies and molecularly targeted therapies, making conventional dose finding trial designs based on toxicity no longer appropriate. To tackle this issue, numerous early phase dose finding clinical trial designs have been developed to identify the optimal biological dose (OBD), which takes both toxicity and efficacy outcomes into account. In this article, we review the current model-assisted dose finding designs, BOIN-ET, BOIN12, UBI, TEPI-2, PRINTE, STEIN, and uTPI to identify the OBD and compare their operating characteristics. Extensive simulation studies and a case study using a CAR T-cell therapy phase I trial have been conducted to compare the performance of the aforementioned designs under different possible dose-response relationship scenarios. The simulation results demonstrate that the performance of different designs varies depending on the particular dose-response relationship and the specific metric considered. Based on our simulation results and practical considerations, STEIN, PRINTE, and BOIN12 outperform the other designs from different perspectives."}, "https://arxiv.org/abs/2411.08771": {"title": "Confidence intervals for adaptive trial designs II: Case study and practical guidance", "link": "https://arxiv.org/abs/2411.08771", "description": "arXiv:2411.08771v1 Announce Type: new \nAbstract: In adaptive clinical trials, the conventional confidence interval (CI) for a treatment effect is prone to undesirable properties such as undercoverage and potential inconsistency with the final hypothesis testing decision. Accordingly, as is stated in recent regulatory guidance on adaptive designs, there is the need for caution in the interpretation of CIs constructed during and after an adaptive clinical trial. However, it may be unclear which of the available CIs in the literature are preferable. This paper is the second in a two-part series that explores CIs for adaptive trials. Part I provided a methodological review of approaches to construct CIs for adaptive designs. In this paper (part II), we present an extended case study based around a two-stage group sequential trial, including a comprehensive simulation study of the proposed CIs for this setting. This facilitates an expanded description of considerations around what makes for an effective CI procedure following an adaptive trial. We show that the CIs can have notably different properties. Finally, we propose a set of guidelines for researchers around the choice of CIs and the reporting of CIs following an adaptive design."}, "https://arxiv.org/abs/2411.08778": {"title": "Causal-DRF: Conditional Kernel Treatment Effect Estimation using Distributional Random Forest", "link": "https://arxiv.org/abs/2411.08778", "description": "arXiv:2411.08778v1 Announce Type: new \nAbstract: The conditional average treatment effect (CATE) is a commonly targeted statistical parameter for measuring the mean effect of a treatment conditional on covariates. However, the CATE will fail to capture effects of treatments beyond differences in conditional expectations. Inspired by causal forests for CATE estimation, we develop a forest-based method to estimate the conditional kernel treatment effect (CKTE), based on the recently introduced Distributional Random Forest (DRF) algorithm. Adapting the splitting criterion of DRF, we show how one forest fit can be used to obtain a consistent and asymptotically normal estimator of the CKTE, as well as an approximation of its sampling distribution. This allows to study the difference in distribution between control and treatment group and thus yields a more comprehensive understanding of the treatment effect. In particular, this enables the construction of a conditional kernel-based test for distributional effects with provably valid type-I error. We show the effectiveness of the proposed estimator in simulations and apply it to the 1991 Survey of Income and Program Participation (SIPP) pension data to study the effect of 401(k) eligibility on wealth."}, "https://arxiv.org/abs/2411.08861": {"title": "Interaction Testing in Variation Analysis", "link": "https://arxiv.org/abs/2411.08861", "description": "arXiv:2411.08861v1 Announce Type: new \nAbstract: Relationships of cause and effect are of prime importance for explaining scientific phenomena. Often, rather than just understanding the effects of causes, researchers also wish to understand how a cause $X$ affects an outcome $Y$ mechanistically -- i.e., what are the causal pathways that are activated between $X$ and $Y$. For analyzing such questions, a range of methods has been developed over decades under the rubric of causal mediation analysis. Traditional mediation analysis focuses on decomposing the average treatment effect (ATE) into direct and indirect effects, and therefore focuses on the ATE as the central quantity. This corresponds to providing explanations for associations in the interventional regime, such as when the treatment $X$ is randomized. Commonly, however, it is of interest to explain associations in the observational regime, and not just in the interventional regime. In this paper, we introduce \\text{variation analysis}, an extension of mediation analysis that focuses on the total variation (TV) measure between $X$ and $Y$, written as $\\mathrm{E}[Y \\mid X=x_1] - \\mathrm{E}[Y \\mid X=x_0]$. The TV measure encompasses both causal and confounded effects, as opposed to the ATE which only encompasses causal (direct and mediated) variations. In this way, the TV measure is suitable for providing explanations in the natural regime and answering questions such as ``why is $X$ associated with $Y$?''. Our focus is on decomposing the TV measure, in a way that explicitly includes direct, indirect, and confounded variations. Furthermore, we also decompose the TV measure to include interaction terms between these different pathways. Subsequently, interaction testing is introduced, involving hypothesis tests to determine if interaction terms are significantly different from zero. If interactions are not significant, more parsimonious decompositions of the TV measure can be used."}, "https://arxiv.org/abs/2411.07031": {"title": "Evaluating the Accuracy of Chatbots in Financial Literature", "link": "https://arxiv.org/abs/2411.07031", "description": "arXiv:2411.07031v1 Announce Type: cross \nAbstract: We evaluate the reliability of two chatbots, ChatGPT (4o and o1-preview versions), and Gemini Advanced, in providing references on financial literature and employing novel methodologies. Alongside the conventional binary approach commonly used in the literature, we developed a nonbinary approach and a recency measure to assess how hallucination rates vary with how recent a topic is. After analyzing 150 citations, ChatGPT-4o had a hallucination rate of 20.0% (95% CI, 13.6%-26.4%), while the o1-preview had a hallucination rate of 21.3% (95% CI, 14.8%-27.9%). In contrast, Gemini Advanced exhibited higher hallucination rates: 76.7% (95% CI, 69.9%-83.4%). While hallucination rates increased for more recent topics, this trend was not statistically significant for Gemini Advanced. These findings emphasize the importance of verifying chatbot-provided references, particularly in rapidly evolving fields."}, "https://arxiv.org/abs/2411.08338": {"title": "Quantifying uncertainty in the numerical integration of evolution equations based on Bayesian isotonic regression", "link": "https://arxiv.org/abs/2411.08338", "description": "arXiv:2411.08338v1 Announce Type: cross \nAbstract: This paper presents a new Bayesian framework for quantifying discretization errors in numerical solutions of ordinary differential equations. By modelling the errors as random variables, we impose a monotonicity constraint on the variances, referred to as discretization error variances. The key to our approach is the use of a shrinkage prior for the variances coupled with variable transformations. This methodology extends existing Bayesian isotonic regression techniques to tackle the challenge of estimating the variances of a normal distribution. An additional key feature is the use of a Gaussian mixture model for the $\\log$-$\\chi^2_1$ distribution, enabling the development of an efficient Gibbs sampling algorithm for the corresponding posterior."}, "https://arxiv.org/abs/2007.12922": {"title": "Data fusion methods for the heterogeneity of treatment effect and confounding function", "link": "https://arxiv.org/abs/2007.12922", "description": "arXiv:2007.12922v3 Announce Type: replace \nAbstract: The heterogeneity of treatment effect (HTE) lies at the heart of precision medicine. Randomized controlled trials are gold-standard for treatment effect estimation but are typically underpowered for heterogeneous effects. In contrast, large observational studies have high predictive power but are often confounded due to the lack of randomization of treatment. We show that an observational study, even subject to hidden confounding, may be used to empower trials in estimating the HTE using the notion of confounding function. The confounding function summarizes the impact of unmeasured confounders on the difference between the observed treatment effect and the causal treatment effect, given the observed covariates, which is unidentifiable based only on the observational study. Coupling the trial and observational study, we show that the HTE and confounding function are identifiable. We then derive the semiparametric efficient scores and the integrative estimators of the HTE and confounding function. We clarify the conditions under which the integrative estimator of the HTE is strictly more efficient than the trial estimator. Finally, we illustrate the integrative estimators via simulation and an application."}, "https://arxiv.org/abs/2105.10809": {"title": "Exact PPS Sampling with Bounded Sample Size", "link": "https://arxiv.org/abs/2105.10809", "description": "arXiv:2105.10809v2 Announce Type: replace \nAbstract: Probability proportional to size (PPS) sampling schemes with a target sample size aim to produce a sample comprising a specified number $n$ of items while ensuring that each item in the population appears in the sample with a probability proportional to its specified \"weight\" (also called its \"size\"). These two objectives, however, cannot always be achieved simultaneously. Existing PPS schemes prioritize control of the sample size, violating the PPS property if necessary. We provide a new PPS scheme that allows a different trade-off: our method enforces the PPS property at all times while ensuring that the sample size never exceeds the target value $n$. The sample size is exactly equal to $n$ if possible, and otherwise has maximal expected value and minimal variance. Thus we bound the sample size, thereby avoiding storage overflows and helping to control the time required for analytics over the sample, while allowing the user complete control over the sample contents. The method is both simple to implement and efficient, being a one-pass streaming algorithm with an amortized processing time of $O(1)$ per item."}, "https://arxiv.org/abs/2206.04902": {"title": "Forecasting macroeconomic data with Bayesian VARs: Sparse or dense? It depends!", "link": "https://arxiv.org/abs/2206.04902", "description": "arXiv:2206.04902v4 Announce Type: replace \nAbstract: Vector autogressions (VARs) are widely applied when it comes to modeling and forecasting macroeconomic variables. In high dimensions, however, they are prone to overfitting. Bayesian methods, more concretely shrinkage priors, have shown to be successful in improving prediction performance. In the present paper, we introduce the semi-global framework, in which we replace the traditional global shrinkage parameter with group-specific shrinkage parameters. We show how this framework can be applied to various shrinkage priors, such as global-local priors and stochastic search variable selection priors. We demonstrate the virtues of the proposed framework in an extensive simulation study and in an empirical application forecasting data of the US economy. Further, we shed more light on the ongoing ``Illusion of Sparsity'' debate, finding that forecasting performances under sparse/dense priors vary across evaluated economic variables and across time frames. Dynamic model averaging, however, can combine the merits of both worlds."}, "https://arxiv.org/abs/2306.16642": {"title": "Improving randomized controlled trial analysis via data-adaptive borrowing", "link": "https://arxiv.org/abs/2306.16642", "description": "arXiv:2306.16642v2 Announce Type: replace \nAbstract: In recent years, real-world external controls have grown in popularity as a tool to empower randomized placebo-controlled trials, particularly in rare diseases or cases where balanced randomization is unethical or impractical. However, as external controls are not always comparable to the trials, direct borrowing without scrutiny may heavily bias the treatment effect estimator. Our paper proposes a data-adaptive integrative framework capable of preventing unknown biases of the external controls. The adaptive nature is achieved by dynamically sorting out a comparable subset of the external controls via bias penalization. Our proposed method can simultaneously achieve (a) the semiparametric efficiency bound when the external controls are comparable and (b) selective borrowing that mitigates the impact of the existence of incomparable external controls. Furthermore, we establish statistical guarantees, including consistency, asymptotic distribution, and inference, providing type-I error control and good power. Extensive simulations and two real-data applications show that the proposed method leads to improved performance over the trial-only estimator across various bias-generating scenarios."}, "https://arxiv.org/abs/2310.09673": {"title": "Robust Quickest Change Detection in Non-Stationary Processes", "link": "https://arxiv.org/abs/2310.09673", "description": "arXiv:2310.09673v2 Announce Type: replace \nAbstract: Optimal algorithms are developed for robust detection of changes in non-stationary processes. These are processes in which the distribution of the data after change varies with time. The decision-maker does not have access to precise information on the post-change distribution. It is shown that if the post-change non-stationary family has a distribution that is least favorable in a well-defined sense, then the algorithms designed using the least favorable distributions are robust and optimal. Non-stationary processes are encountered in public health monitoring and space and military applications. The robust algorithms are applied to real and simulated data to show their effectiveness."}, "https://arxiv.org/abs/2401.01804": {"title": "Efficient Computation of Confidence Sets Using Classification on Equidistributed Grids", "link": "https://arxiv.org/abs/2401.01804", "description": "arXiv:2401.01804v2 Announce Type: replace \nAbstract: Economic models produce moment inequalities, which can be used to form tests of the true parameters. Confidence sets (CS) of the true parameters are derived by inverting these tests. However, they often lack analytical expressions, necessitating a grid search to obtain the CS numerically by retaining the grid points that pass the test. When the statistic is not asymptotically pivotal, constructing the critical value for each grid point in the parameter space adds to the computational burden. In this paper, we convert the computational issue into a classification problem by using a support vector machine (SVM) classifier. Its decision function provides a faster and more systematic way of dividing the parameter space into two regions: inside vs. outside of the confidence set. We label those points in the CS as 1 and those outside as -1. Researchers can train the SVM classifier on a grid of manageable size and use it to determine whether points on denser grids are in the CS or not. We establish certain conditions for the grid so that there is a tuning that allows us to asymptotically reproduce the test in the CS. This means that in the limit, a point is classified as belonging to the confidence set if and only if it is labeled as 1 by the SVM."}, "https://arxiv.org/abs/2104.04716": {"title": "Selecting Penalty Parameters of High-Dimensional M-Estimators using Bootstrapping after Cross-Validation", "link": "https://arxiv.org/abs/2104.04716", "description": "arXiv:2104.04716v5 Announce Type: replace-cross \nAbstract: We develop a new method for selecting the penalty parameter for $\\ell_{1}$-penalized M-estimators in high dimensions, which we refer to as bootstrapping after cross-validation. We derive rates of convergence for the corresponding $\\ell_1$-penalized M-estimator and also for the post-$\\ell_1$-penalized M-estimator, which refits the non-zero entries of the former estimator without penalty in the criterion function. We demonstrate via simulations that our methods are not dominated by cross-validation in terms of estimation errors and can outperform cross-validation in terms of inference. As an empirical illustration, we revisit Fryer Jr (2019), who investigated racial differences in police use of force, and confirm his findings."}, "https://arxiv.org/abs/2411.08929": {"title": "Power and Sample Size Calculations for Cluster Randomized Hybrid Type 2 Effectiveness-Implementation Studies", "link": "https://arxiv.org/abs/2411.08929", "description": "arXiv:2411.08929v1 Announce Type: new \nAbstract: Hybrid studies allow investigators to simultaneously study an intervention effectiveness outcome and an implementation research outcome. In particular, type 2 hybrid studies support research that places equal importance on both outcomes rather than focusing on one and secondarily on the other (i.e., type 1 and type 3 studies). Hybrid 2 studies introduce the statistical issue of multiple testing, complicated by the fact that they are typically also cluster randomized trials. Standard statistical methods do not apply in this scenario. Here, we describe the design methodologies available for validly powering hybrid type 2 studies and producing reliable sample size calculations in a cluster-randomized design with a focus on binary outcomes. Through a literature search, 18 publications were identified that included methods relevant to the design of hybrid 2 studies. Five methods were identified, two of which did not account for clustering but are extended in this article to do so, namely the combined outcomes approach and the single 1-degree of freedom combined test. Procedures for powering hybrid 2 studies using these five methods are described and illustrated using input parameters inspired by a study from the Community Intervention to Reduce CardiovascuLar Disease in Chicago (CIRCL-Chicago) Implementation Research Center. In this illustrative example, the intervention effectiveness outcome was controlled blood pressure, and the implementation outcome was reach. The conjunctive test resulted in higher power than the popular p-value adjustment methods, and the newly extended combined outcomes and single 1-DF test were found to be the most powerful among all of the tests."}, "https://arxiv.org/abs/2411.08984": {"title": "Using Principal Progression Rate to Quantify and Compare Disease Progression in Comparative Studies", "link": "https://arxiv.org/abs/2411.08984", "description": "arXiv:2411.08984v1 Announce Type: new \nAbstract: In comparative studies of progressive diseases, such as randomized controlled trials (RCTs), the mean Change From Baseline (CFB) of a continuous outcome at a pre-specified follow-up time across subjects in the target population is a standard estimand used to summarize the overall disease progression. Despite its simplicity in interpretation, the mean CFB may not efficiently capture important features of the trajectory of the mean outcome relevant to the evaluation of the treatment effect of an intervention. Additionally, the estimation of the mean CFB does not use all longitudinal data points. To address these limitations, we propose a class of estimands called Principal Progression Rate (PPR). The PPR is a weighted average of local or instantaneous slope of the trajectory of the population mean during the follow-up. The flexibility of the weight function allows the PPR to cover a broad class of intuitive estimands, including the mean CFB, the slope of ordinary least-square fit to the trajectory, and the area under the curve. We showed that properly chosen PPRs can enhance statistical power over the mean CFB by amplifying the signal of treatment effect and/or improving estimation precision. We evaluated different versions of PPRs and the performance of their estimators through numerical studies. A real dataset was analyzed to demonstrate the advantage of using alternative PPR over the mean CFB."}, "https://arxiv.org/abs/2411.09017": {"title": "Debiased machine learning for counterfactual survival functionals based on left-truncated right-censored data", "link": "https://arxiv.org/abs/2411.09017", "description": "arXiv:2411.09017v1 Announce Type: new \nAbstract: Learning causal effects of a binary exposure on time-to-event endpoints can be challenging because survival times may be partially observed due to censoring and systematically biased due to truncation. In this work, we present debiased machine learning-based nonparametric estimators of the joint distribution of a counterfactual survival time and baseline covariates for use when the observed data are subject to covariate-dependent left truncation and right censoring and when baseline covariates suffice to deconfound the relationship between exposure and survival time. Our inferential procedures explicitly allow the integration of flexible machine learning tools for nuisance estimation, and enjoy certain robustness properties. The approach we propose can be directly used to make pointwise or uniform inference on smooth summaries of the joint counterfactual survival time and covariate distribution, and can be valuable even in the absence of interventions, when summaries of a marginal survival distribution are of interest. We showcase how our procedures can be used to learn a variety of inferential targets and illustrate their performance in simulation studies."}, "https://arxiv.org/abs/2411.09097": {"title": "On the Selection Stability of Stability Selection and Its Applications", "link": "https://arxiv.org/abs/2411.09097", "description": "arXiv:2411.09097v1 Announce Type: new \nAbstract: Stability selection is a widely adopted resampling-based framework for high-dimensional structure estimation and variable selection. However, the concept of 'stability' is often narrowly addressed, primarily through examining selection frequencies, or 'stability paths'. This paper seeks to broaden the use of an established stability estimator to evaluate the overall stability of the stability selection framework, moving beyond single-variable analysis. We suggest that the stability estimator offers two advantages: it can serve as a reference to reflect the robustness of the outcomes obtained and help identify an optimal regularization value to improve stability. By determining this value, we aim to calibrate key stability selection parameters, namely, the decision threshold and the expected number of falsely selected variables, within established theoretical bounds. Furthermore, we explore a novel selection criterion based on this regularization value. With the asymptotic distribution of the stability estimator previously established, convergence to true stability is ensured, allowing us to observe stability trends over successive sub-samples. This approach sheds light on the required number of sub-samples addressing a notable gap in prior studies. The 'stabplot' package is developed to facilitate the use of the plots featured in this manuscript, supporting their integration into further statistical analysis and research workflows."}, "https://arxiv.org/abs/2411.09218": {"title": "On the (Mis)Use of Machine Learning with Panel Data", "link": "https://arxiv.org/abs/2411.09218", "description": "arXiv:2411.09218v1 Announce Type: new \nAbstract: Machine Learning (ML) is increasingly employed to inform and support policymaking interventions. This methodological article cautions practitioners about common but often overlooked pitfalls associated with the uncritical application of supervised ML algorithms to panel data. Ignoring the cross-sectional and longitudinal structure of this data can lead to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of ML models. After clarifying these issues, we provide practical guidelines and best practices for applied researchers to ensure the correct implementation of supervised ML in panel data environments, emphasizing the need to define ex ante the primary goal of the analysis and align the ML pipeline accordingly. An empirical application based on over 3,000 US counties from 2000 to 2019 illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks."}, "https://arxiv.org/abs/2411.09221": {"title": "Difference-in-Differences with Sample Selection", "link": "https://arxiv.org/abs/2411.09221", "description": "arXiv:2411.09221v1 Announce Type: new \nAbstract: Endogenous treatment and sample selection are two concomitant sources of endogeneity that challenge the validity of causal inference. In this paper, we focus on the partial identification of treatment effects within a standard two-period difference-in-differences framework when the outcome is observed for an endogenously selected subpopulation. The identification strategy embeds Lee's (2009) bounding approach based on principal stratification, which divides the population into latent subgroups based on selection behaviour in counterfactual treatment states in both periods. We establish identification results for four latent types and illustrate the proposed approach by applying it to estimate 1) the effect of a job training program on earnings and 2) the effect of a working-from-home policy on employee performance."}, "https://arxiv.org/abs/2411.09452": {"title": "Sparse Interval-valued Time Series Modeling with Machine Learning", "link": "https://arxiv.org/abs/2411.09452", "description": "arXiv:2411.09452v1 Announce Type: new \nAbstract: By treating intervals as inseparable sets, this paper proposes sparse machine learning regressions for high-dimensional interval-valued time series. With LASSO or adaptive LASSO techniques, we develop a penalized minimum distance estimation, which covers point-based estimators are special cases. We establish the consistency and oracle properties of the proposed penalized estimator, regardless of whether the number of predictors is diverging with the sample size. Monte Carlo simulations demonstrate the favorable finite sample properties of the proposed estimation. Empirical applications to interval-valued crude oil price forecasting and sparse index-tracking portfolio construction illustrate the robustness and effectiveness of our method against competing approaches, including random forest and multilayer perceptron for interval-valued data. Our findings highlight the potential of machine learning techniques in interval-valued time series analysis, offering new insights for financial forecasting and portfolio management."}, "https://arxiv.org/abs/2411.09579": {"title": "Propensity Score Matching: Should We Use It in Designing Observational Studies?", "link": "https://arxiv.org/abs/2411.09579", "description": "arXiv:2411.09579v1 Announce Type: new \nAbstract: Propensity Score Matching (PSM) stands as a widely embraced method in comparative effectiveness research. PSM crafts matched datasets, mimicking some attributes of randomized designs, from observational data. In a valid PSM design where all baseline confounders are measured and matched, the confounders would be balanced, allowing the treatment status to be considered as if it were randomly assigned. Nevertheless, recent research has unveiled a different facet of PSM, termed \"the PSM paradox.\" As PSM approaches exact matching by progressively pruning matched sets in order of decreasing propensity score distance, it can paradoxically lead to greater covariate imbalance, heightened model dependence, and increased bias, contrary to its intended purpose. Methods: We used analytic formula, simulation, and literature to demonstrate that this paradox stems from the misuse of metrics for assessing chance imbalance and bias. Results: Firstly, matched pairs typically exhibit different covariate values despite having identical propensity scores. However, this disparity represents a \"chance\" difference and will average to zero over a large number of matched pairs. Common distance metrics cannot capture this ``chance\" nature in covariate imbalance, instead reflecting increasing variability in chance imbalance as units are pruned and the sample size diminishes. Secondly, the largest estimate among numerous fitted models, because of uncertainty among researchers over the correct model, was used to determine statistical bias. This cherry-picking procedure ignores the most significant benefit of matching design-reducing model dependence based on its robustness against model misspecification bias. Conclusions: We conclude that the PSM paradox is not a legitimate concern and should not stop researchers from using PSM designs."}, "https://arxiv.org/abs/2411.08998": {"title": "Microfoundation Inference for Strategic Prediction", "link": "https://arxiv.org/abs/2411.08998", "description": "arXiv:2411.08998v1 Announce Type: cross \nAbstract: Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models. A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents' responses as a cost-adjusted utility maximization problem and propose estimates for said cost. Our approach leverages optimal transport to align pre-model exposure (ex ante) and post-model exposure (ex post) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit-scoring dataset."}, "https://arxiv.org/abs/2411.09100": {"title": "General linear threshold models with application to influence maximization", "link": "https://arxiv.org/abs/2411.09100", "description": "arXiv:2411.09100v1 Announce Type: cross \nAbstract: A number of models have been developed for information spread through networks, often for solving the Influence Maximization (IM) problem. IM is the task of choosing a fixed number of nodes to \"seed\" with information in order to maximize the spread of this information through the network, with applications in areas such as marketing and public health. Most methods for this problem rely heavily on the assumption of known strength of connections between network members (edge weights), which is often unrealistic. In this paper, we develop a likelihood-based approach to estimate edge weights from the fully and partially observed information diffusion paths. We also introduce a broad class of information diffusion models, the general linear threshold (GLT) model, which generalizes the well-known linear threshold (LT) model by allowing arbitrary distributions of node activation thresholds. We then show our weight estimator is consistent under the GLT and some mild assumptions. For the special case of the standard LT model, we also present a much faster expectation-maximization approach for weight estimation. Finally, we prove that for the GLT models, the IM problem can be solved by a natural greedy algorithm with standard optimality guarantees if all node threshold distributions have concave cumulative distribution functions. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility in the choice of threshold distribution combined with the estimation of edge weights significantly improves the quality of IM solutions, spread prediction, and the estimates of the node activation probabilities."}, "https://arxiv.org/abs/2411.09225": {"title": "fdesigns: Bayesian Optimal Designs of Experiments for Functional Models in R", "link": "https://arxiv.org/abs/2411.09225", "description": "arXiv:2411.09225v1 Announce Type: cross \nAbstract: This paper describes the R package fdesigns that implements a methodology for identifying Bayesian optimal experimental designs for models whose factor settings are functions, known as profile factors. This type of experiments which involve factors that vary dynamically over time, presenting unique challenges in both estimation and design due to the infinite-dimensional nature of functions. The package fdesigns implements a dimension reduction method leveraging basis functions of the B-spline basis system. The package fdesigns contains functions that effectively reduce the design problem to the optimisation of basis coefficients for functional linear functional generalised linear models, and it accommodates various options. Applications of the fdesigns package are demonstrated through a series of examples that showcase its capabilities in identifying optimal designs for functional linear and generalised linear models. The examples highlight how the package's functions can be used to efficiently design experiments involving both profile and scalar factors, including interactions and polynomial effects."}, "https://arxiv.org/abs/2411.09258": {"title": "On Asymptotic Optimality of Least Squares Model Averaging When True Model Is Included", "link": "https://arxiv.org/abs/2411.09258", "description": "arXiv:2411.09258v1 Announce Type: cross \nAbstract: Asymptotic optimality is a key theoretical property in model averaging. Due to technical difficulties, existing studies rely on restricted weight sets or the assumption that there is no true model with fixed dimensions in the candidate set. The focus of this paper is to overcome these difficulties. Surprisingly, we discover that when the penalty factor in the weight selection criterion diverges with a certain order and the true model dimension is fixed, asymptotic loss optimality does not hold, but asymptotic risk optimality does. This result differs from the corresponding result of Fang et al. (2023, Econometric Theory 39, 412-441) and reveals that using the discrete weight set of Hansen (2007, Econometrica 75, 1175-1189) can yield opposite asymptotic properties compared to using the usual weight set. Simulation studies illustrate the theoretical findings in a variety of settings."}, "https://arxiv.org/abs/2411.09353": {"title": "Monitoring time to event in registry data using CUSUMs based on excess hazard models", "link": "https://arxiv.org/abs/2411.09353", "description": "arXiv:2411.09353v1 Announce Type: cross \nAbstract: An aspect of interest in surveillance of diseases is whether the survival time distribution changes over time. By following data in health registries over time, this can be monitored, either in real time or retrospectively. With relevant risk factors registered, these can be taken into account in the monitoring as well. A challenge in monitoring survival times based on registry data is that data on cause of death might either be missing or uncertain. To quantify the burden of disease in such cases, excess hazard methods can be used, where the total hazard is modelled as the population hazard plus the excess hazard due to the disease.\n We propose a CUSUM procedure for monitoring for changes in the survival time distribution in cases where use of excess hazard models is relevant. The procedure is based on a survival log-likelihood ratio and extends previously suggested methods for monitoring of time to event to the excess hazard setting. The procedure takes into account changes in the population risk over time, as well as changes in the excess hazard which is explained by observed covariates. Properties, challenges and an application to cancer registry data will be presented."}, "https://arxiv.org/abs/2411.09514": {"title": "On importance sampling and independent Metropolis-Hastings with an unbounded weight function", "link": "https://arxiv.org/abs/2411.09514", "description": "arXiv:2411.09514v1 Announce Type: cross \nAbstract: Importance sampling and independent Metropolis-Hastings (IMH) are among the fundamental building blocks of Monte Carlo methods. Both require a proposal distribution that globally approximates the target distribution. The Radon-Nikodym derivative of the target distribution relative to the proposal is called the weight function. Under the weak assumption that the weight is unbounded but has a number of finite moments under the proposal distribution, we obtain new results on the approximation error of importance sampling and of the particle independent Metropolis-Hastings algorithm (PIMH), which includes IMH as a special case. For IMH and PIMH, we show that the common random numbers coupling is maximal. Using that coupling we derive bounds on the total variation distance of a PIMH chain to the target distribution. The bounds are sharp with respect to the number of particles and the number of iterations. Our results allow a formal comparison of the finite-time biases of importance sampling and IMH. We further consider bias removal techniques using couplings of PIMH, and provide conditions under which the resulting unbiased estimators have finite moments. We compare the asymptotic efficiency of regular and unbiased importance sampling estimators as the number of particles goes to infinity."}, "https://arxiv.org/abs/2310.08479": {"title": "Personalised dynamic super learning: an application in predicting hemodiafiltration convection volumes", "link": "https://arxiv.org/abs/2310.08479", "description": "arXiv:2310.08479v2 Announce Type: replace \nAbstract: Obtaining continuously updated predictions is a major challenge for personalised medicine. Leveraging combinations of parametric regressions and machine learning approaches, the personalised online super learner (POSL) can achieve such dynamic and personalised predictions. We adapt POSL to predict a repeated continuous outcome dynamically and propose a new way to validate such personalised or dynamic prediction models. We illustrate its performance by predicting the convection volume of patients undergoing hemodiafiltration. POSL outperformed its candidate learners with respect to median absolute error, calibration-in-the-large, discrimination, and net benefit. We finally discuss the choices and challenges underlying the use of POSL."}, "https://arxiv.org/abs/2401.06403": {"title": "Fourier analysis of spatial point processes", "link": "https://arxiv.org/abs/2401.06403", "description": "arXiv:2401.06403v2 Announce Type: replace \nAbstract: In this article, we develop comprehensive frequency domain methods for estimating and inferring the second-order structure of spatial point processes. The main element here is on utilizing the discrete Fourier transform (DFT) of the point pattern and its tapered counterpart. Under second-order stationarity, we show that both the DFTs and the tapered DFTs are asymptotically jointly independent Gaussian even when the DFTs share the same limiting frequencies. Based on these results, we establish an $\\alpha$-mixing central limit theorem for a statistic formulated as a quadratic form of the tapered DFT. As applications, we derive the asymptotic distribution of the kernel spectral density estimator and establish a frequency domain inferential method for parametric stationary point processes. For the latter, the resulting model parameter estimator is computationally tractable and yields meaningful interpretations even in the case of model misspecification. We investigate the finite sample performance of our estimator through simulations, considering scenarios of both correctly specified and misspecified models. Furthermore, we extend our proposed DFT-based frequency domain methods to a class of non-stationary spatial point processes."}, "https://arxiv.org/abs/2401.07344": {"title": "Robust Genomic Prediction and Heritability Estimation using Density Power Divergence", "link": "https://arxiv.org/abs/2401.07344", "description": "arXiv:2401.07344v2 Announce Type: replace \nAbstract: This manuscript delves into the intersection of genomics and phenotypic prediction, focusing on the statistical innovation required to navigate the complexities introduced by noisy covariates and confounders. The primary emphasis is on the development of advanced robust statistical models tailored for genomic prediction from single nucleotide polymorphism data in plant and animal breeding and multi-field trials. The manuscript highlights the significance of incorporating all estimated effects of marker loci into the statistical framework and aiming to reduce the high dimensionality of data while preserving critical information. This paper introduces a new robust statistical framework for genomic prediction, employing one-stage and two-stage linear mixed model analyses along with utilizing the popular robust minimum density power divergence estimator (MDPDE) to estimate genetic effects on phenotypic traits. The study illustrates the superior performance of the proposed MDPDE-based genomic prediction and associated heritability estimation procedures over existing competitors through extensive empirical experiments on artificial datasets and application to a real-life maize breeding dataset. The results showcase the robustness and accuracy of the proposed MDPDE-based approaches, especially in the presence of data contamination, emphasizing their potential applications in improving breeding programs and advancing genomic prediction of phenotyping traits."}, "https://arxiv.org/abs/2011.14762": {"title": "M-Variance Asymptotics and Uniqueness of Descriptors", "link": "https://arxiv.org/abs/2011.14762", "description": "arXiv:2011.14762v2 Announce Type: replace-cross \nAbstract: Asymptotic theory for M-estimation problems usually focuses on the asymptotic convergence of the sample descriptor, defined as the minimizer of the sample loss function. Here, we explore a related question and formulate asymptotic theory for the minimum value of sample loss, the M-variance. Since the loss function value is always a real number, the asymptotic theory for the M-variance is comparatively simple. M-variance often satisfies a standard central limit theorem, even in situations where the asymptotics of the descriptor is more complicated as for example in case of smeariness, or if no asymptotic distribution can be given as can be the case if the descriptor space is a general metric space. We use the asymptotic results for the M-variance to formulate a hypothesis test to systematically determine for a given sample whether the underlying population loss function may have multiple global minima. We discuss three applications of our test to data, each of which presents a typical scenario in which non-uniqueness of descriptors may occur. These model scenarios are the mean on a non-euclidean space, non-linear regression and Gaussian mixture clustering."}, "https://arxiv.org/abs/2411.09771": {"title": "Bayesian estimation of finite mixtures of Tobit models", "link": "https://arxiv.org/abs/2411.09771", "description": "arXiv:2411.09771v1 Announce Type: new \nAbstract: This paper outlines a Bayesian approach to estimate finite mixtures of Tobit models. The method consists of an MCMC approach that combines Gibbs sampling with data augmentation and is simple to implement. I show through simulations that the flexibility provided by this method is especially helpful when censoring is not negligible. In addition, I demonstrate the broad utility of this methodology with applications to a job training program, labor supply, and demand for medical care. I find that this approach allows for non-trivial additional flexibility that can alter results considerably and beyond improving model fit."}, "https://arxiv.org/abs/2411.09808": {"title": "Sharp Testable Implications of Encouragement Designs", "link": "https://arxiv.org/abs/2411.09808", "description": "arXiv:2411.09808v1 Announce Type: new \nAbstract: This paper studies the sharp testable implications of an additive random utility model with a discrete multi-valued treatment and a discrete multi-valued instrument, in which each value of the instrument only weakly increases the utility of one choice. Borrowing the terminology used in randomized experiments, we call such a setting an encouragement design. We derive inequalities in terms of the conditional choice probabilities that characterize when the distribution of the observed data is consistent with such a model. Through a novel constructive argument, we further show these inequalities are sharp in the sense that any distribution of the observed data that satisfies these inequalities is generated by this additive random utility model."}, "https://arxiv.org/abs/2411.10009": {"title": "Semiparametric inference for impulse response functions using double/debiased machine learning", "link": "https://arxiv.org/abs/2411.10009", "description": "arXiv:2411.10009v1 Announce Type: new \nAbstract: We introduce a double/debiased machine learning (DML) estimator for the impulse response function (IRF) in settings where a time series of interest is subjected to multiple discrete treatments, assigned over time, which can have a causal effect on future outcomes. The proposed estimator can rely on fully nonparametric relations between treatment and outcome variables, opening up the possibility to use flexible machine learning approaches to estimate IRFs. To this end, we extend the theory of DML from an i.i.d. to a time series setting and show that the proposed DML estimator for the IRF is consistent and asymptotically normally distributed at the parametric rate, allowing for semiparametric inference for dynamic effects in a time series setting. The properties of the estimator are validated numerically in finite samples by applying it to learn the IRF in the presence of serial dependence in both the confounder and observation innovation processes. We also illustrate the methodology empirically by applying it to the estimation of the effects of macroeconomic shocks."}, "https://arxiv.org/abs/2411.10064": {"title": "Adaptive Physics-Guided Neural Network", "link": "https://arxiv.org/abs/2411.10064", "description": "arXiv:2411.10064v1 Announce Type: new \nAbstract: This paper introduces an adaptive physics-guided neural network (APGNN) framework for predicting quality attributes from image data by integrating physical laws into deep learning models. The APGNN adaptively balances data-driven and physics-informed predictions, enhancing model accuracy and robustness across different environments. Our approach is evaluated on both synthetic and real-world datasets, with comparisons to conventional data-driven models such as ResNet. For the synthetic data, 2D domains were generated using three distinct governing equations: the diffusion equation, the advection-diffusion equation, and the Poisson equation. Non-linear transformations were applied to these domains to emulate complex physical processes in image form.\n In real-world experiments, the APGNN consistently demonstrated superior performance in the diverse thermal image dataset. On the cucumber dataset, characterized by low material diversity and controlled conditions, APGNN and PGNN showed similar performance, both outperforming the data-driven ResNet. However, in the more complex thermal dataset, particularly for outdoor materials with higher environmental variability, APGNN outperformed both PGNN and ResNet by dynamically adjusting its reliance on physics-based versus data-driven insights. This adaptability allowed APGNN to maintain robust performance across structured, low-variability settings and more heterogeneous scenarios. These findings underscore the potential of adaptive physics-guided learning to integrate physical constraints effectively, even in challenging real-world contexts with diverse environmental conditions."}, "https://arxiv.org/abs/2411.10089": {"title": "G-computation for increasing performances of clinical trials with individual randomization and binary response", "link": "https://arxiv.org/abs/2411.10089", "description": "arXiv:2411.10089v1 Announce Type: new \nAbstract: In a clinical trial, the random allocation aims to balance prognostic factors between arms, preventing true confounders. However, residual differences due to chance may introduce near-confounders. Adjusting on prognostic factors is therefore recommended, especially because the related increase of the power. In this paper, we hypothesized that G-computation associated with machine learning could be a suitable method for randomized clinical trials even with small sample sizes. It allows for flexible estimation of the outcome model, even when the covariates' relationships with outcomes are complex. Through simulations, penalized regressions (Lasso, Elasticnet) and algorithm-based methods (neural network, support vector machine, super learner) were compared. Penalized regressions reduced variance but may introduce a slight increase in bias. The associated reductions in sample size ranged from 17\\% to 54\\%. In contrast, algorithm-based methods, while effective for larger and more complex data structures, underestimated the standard deviation, especially with small sample sizes. In conclusion, G-computation with penalized models, particularly Elasticnet with splines when appropriate, represents a relevant approach for increasing the power of RCTs and accounting for potential near-confounders."}, "https://arxiv.org/abs/2411.10121": {"title": "Quadratic Form based Multiple Contrast Tests for Comparison of Group Means", "link": "https://arxiv.org/abs/2411.10121", "description": "arXiv:2411.10121v1 Announce Type: new \nAbstract: Comparing the mean vectors across different groups is a cornerstone in the realm of multivariate statistics, with quadratic forms commonly serving as test statistics. However, when the overall hypothesis is rejected, identifying specific vector components or determining the groups among which differences exist requires additional investigations. Conversely, employing multiple contrast tests (MCT) allows conclusions about which components or groups contribute to these differences. However, they come with a trade-off, as MCT lose some benefits inherent to quadratic forms. In this paper, we combine both approaches to get a quadratic form based multiple contrast test that leverages the advantages of both. To understand its theoretical properties, we investigate its asymptotic distribution in a semiparametric model. We thereby focus on two common quadratic forms - the Wald-type statistic and the Anova-type statistic - although our findings are applicable to any quadratic form.\n Furthermore, we employ Monte-Carlo and resampling techniques to enhance the test's performance in small sample scenarios. Through an extensive simulation study, we assess the performance of our proposed tests against existing alternatives, highlighting their advantages."}, "https://arxiv.org/abs/2411.10204": {"title": "Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport", "link": "https://arxiv.org/abs/2411.10204", "description": "arXiv:2411.10204v1 Announce Type: new \nAbstract: Wasserstein distances form a family of metrics on spaces of probability measures that have recently seen many applications. However, statistical analysis in these spaces is complex due to the nonlinearity of Wasserstein spaces. One potential solution to this problem is Linear Optimal Transport (LOT). This method allows one to find a Euclidean embedding, called LOT embedding, of measures in some Wasserstein spaces, but some information is lost in this embedding. So, to understand whether statistical analysis relying on LOT embeddings can make valid inferences about original data, it is helpful to quantify how well these embeddings describe that data. To answer this question, we present a decomposition of the Fr\\'echet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We then extend this decomposition to the Fused Gromov-Wasserstein setting. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained by the embedding, and the classification accuracy of machine learning classifiers built on the embedded data. We use the MNIST handwritten digits dataset, IMDB-50000 dataset, and Diffusion Tensor MRI images for these experiments. Our results illustrate the effectiveness of low dimensional LOT embeddings in terms of the percentage of variance explained and the classification accuracy of models built on the embedded data."}, "https://arxiv.org/abs/2411.10218": {"title": "Bayesian Adaptive Tucker Decompositions for Tensor Factorization", "link": "https://arxiv.org/abs/2411.10218", "description": "arXiv:2411.10218v1 Announce Type: new \nAbstract: Tucker tensor decomposition offers a more effective representation for multiway data compared to the widely used PARAFAC model. However, its flexibility brings the challenge of selecting the appropriate latent multi-rank. To overcome the issue of pre-selecting the latent multi-rank, we introduce a Bayesian adaptive Tucker decomposition model that infers the multi-rank automatically via an infinite increasing shrinkage prior. The model introduces local sparsity in the core tensor, inducing rich and at the same time parsimonious dependency structures. Posterior inference proceeds via an efficient adaptive Gibbs sampler, supporting both continuous and binary data and allowing for straightforward missing data imputation when dealing with incomplete multiway data. We discuss fundamental properties of the proposed modeling framework, providing theoretical justification. Simulation studies and applications to chemometrics and complex ecological data offer compelling evidence of its advantages over existing tensor factorization methods."}, "https://arxiv.org/abs/2411.10312": {"title": "Generalized Conditional Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2411.10312", "description": "arXiv:2411.10312v1 Announce Type: new \nAbstract: We propose generalized conditional functional principal components analysis (GC-FPCA) for the joint modeling of the fixed and random effects of non-Gaussian functional outcomes. The method scales up to very large functional data sets by estimating the principal components of the covariance matrix on the linear predictor scale conditional on the fixed effects. This is achieved by combining three modeling innovations: (1) fit local generalized linear mixed models (GLMMs) conditional on covariates in windows along the functional domain; (2) conduct a functional principal component analysis (FPCA) on the person-specific functional effects obtained by assembling the estimated random effects from the local GLMMs; and (3) fit a joint functional mixed effects model conditional on covariates and the estimated principal components from the previous step. GC-FPCA was motivated by modeling the minute-level active/inactive profiles over the day ($1{,}440$ 0/1 measurements per person) for $8{,}700$ study participants in the National Health and Nutrition Examination Survey (NHANES) 2011-2014. We show that state-of-the-art approaches cannot handle data of this size and complexity, while GC-FPCA can."}, "https://arxiv.org/abs/2411.10381": {"title": "An Instrumental Variables Framework to Unite Spatial Confounding Methods", "link": "https://arxiv.org/abs/2411.10381", "description": "arXiv:2411.10381v1 Announce Type: new \nAbstract: Studies investigating the causal effects of spatially varying exposures on health$\\unicode{x2013}$such as air pollution, green space, or crime$\\unicode{x2013}$often rely on observational and spatially indexed data. A prevalent challenge is unmeasured spatial confounding, where an unobserved spatially varying variable affects both exposure and outcome, leading to biased causal estimates and invalid confidence intervals. In this paper, we introduce a general framework based on instrumental variables (IV) that encompasses and unites most of the existing methods designed to account for an unmeasured spatial confounder. We show that a common feature of all existing methods is their reliance on small-scale variation in exposure, which functions as an IV. In this framework, we outline the underlying assumptions and the estimation strategy of each method. Furthermore, we demonstrate that the IV can be used to identify and estimate the exposure-response curve under more relaxed assumptions. We conclude by estimating the exposure-response curve between long-term exposure to fine particulate matter and all-cause mortality among 33,454 zip codes in the United States while adjusting for unmeasured spatial confounding."}, "https://arxiv.org/abs/2411.10415": {"title": "Dynamic Causal Effects in a Nonlinear World: the Good, the Bad, and the Ugly", "link": "https://arxiv.org/abs/2411.10415", "description": "arXiv:2411.10415v1 Announce Type: new \nAbstract: Applied macroeconomists frequently use impulse response estimators motivated by linear models. We study whether the estimands of such procedures have a causal interpretation when the true data generating process is in fact nonlinear. We show that vector autoregressions and linear local projections onto observed shocks or proxies identify weighted averages of causal effects regardless of the extent of nonlinearities. By contrast, identification approaches that exploit heteroskedasticity or non-Gaussianity of latent shocks are highly sensitive to departures from linearity. Our analysis is based on new results on the identification of marginal treatment effects through weighted regressions, which may also be of interest to researchers outside macroeconomics."}, "https://arxiv.org/abs/2411.09726": {"title": "Spatio-Temporal Jump Model for Urban Thermal Comfort Monitoring", "link": "https://arxiv.org/abs/2411.09726", "description": "arXiv:2411.09726v1 Announce Type: cross \nAbstract: Thermal comfort is essential for well-being in urban spaces, especially as cities face increasing heat from urbanization and climate change. Existing thermal comfort models usually overlook temporal dynamics alongside spatial dependencies. We address this problem by introducing a spatio-temporal jump model that clusters data with persistence across both spatial and temporal dimensions. This framework enhances interpretability, minimizes abrupt state changes, and easily handles missing data. We validate our approach through extensive simulations, demonstrating its accuracy in recovering the true underlying partition. When applied to hourly environmental data gathered from a set of weather stations located across the city of Singapore, our proposal identifies meaningful thermal comfort regimes, demonstrating its effectiveness in dynamic urban settings and suitability for real-world monitoring. The comparison of these regimes with feedback on thermal preference indicates the potential of an unsupervised approach to avoid extensive surveys."}, "https://arxiv.org/abs/2411.10314": {"title": "Estimating the Cost of Informal Care with a Novel Two-Stage Approach to Individual Synthetic Control", "link": "https://arxiv.org/abs/2411.10314", "description": "arXiv:2411.10314v1 Announce Type: cross \nAbstract: Informal carers provide the majority of care for people living with challenges related to older age, long-term illness, or disability. However, the care they provide often results in a significant income penalty for carers, a factor largely overlooked in the economics literature and policy discourse. Leveraging data from the UK Household Longitudinal Study, this paper provides the first robust causal estimates of the caring income penalty using a novel individual synthetic control based method that accounts for unit-level heterogeneity in post-treatment trajectories over time. Our baseline estimates identify an average relative income gap of up to 45%, with an average decrease of {\\pounds}162 in monthly income, peaking at {\\pounds}192 per month after 4 years, based on the difference between informal carers providing the highest-intensity of care and their synthetic counterparts. We find that the income penalty is more pronounced for women than for men, and varies by ethnicity and age."}, "https://arxiv.org/abs/2009.00401": {"title": "Time-Varying Parameters as Ridge Regressions", "link": "https://arxiv.org/abs/2009.00401", "description": "arXiv:2009.00401v4 Announce Type: replace \nAbstract: Time-varying parameters (TVPs) models are frequently used in economics to capture structural change. I highlight a rather underutilized fact -- that these are actually ridge regressions. Instantly, this makes computations, tuning, and implementation much easier than in the state-space paradigm. Among other things, solving the equivalent dual ridge problem is computationally very fast even in high dimensions, and the crucial \"amount of time variation\" is tuned by cross-validation. Evolving volatility is dealt with using a two-step ridge regression. I consider extensions that incorporate sparsity (the algorithm selects which parameters vary and which do not) and reduced-rank restrictions (variation is tied to a factor model). To demonstrate the usefulness of the approach, I use it to study the evolution of monetary policy in Canada using large time-varying local projections. The application requires the estimation of about 4600 TVPs, a task well within the reach of the new method."}, "https://arxiv.org/abs/2309.00141": {"title": "Causal Inference under Network Interference Using a Mixture of Randomized Experiments", "link": "https://arxiv.org/abs/2309.00141", "description": "arXiv:2309.00141v2 Announce Type: replace \nAbstract: In randomized experiments, the classic Stable Unit Treatment Value Assumption (SUTVA) posits that the outcome for one experimental unit is unaffected by the treatment assignments of other units. However, this assumption is frequently violated in settings such as online marketplaces and social networks, where interference between units is common. We address the estimation of the total treatment effect in a network interference model by employing a mixed randomization design that combines two widely used experimental methods: Bernoulli randomization, where treatment is assigned independently to each unit, and cluster-based randomization, where treatment is assigned at the aggregate level. The mixed randomization design simultaneously incorporates both methods, thereby mitigating the bias present in cluster-based designs. We propose an unbiased estimator for the total treatment effect under this mixed design and show that its variance is bounded by $O(d^2 n^{-1} p^{-1} (1-p)^{-1})$, where $d$ is the maximum degree of the network, $n$ is the network size, and $p$ is the treatment probability. Additionally, we establish a lower bound of $\\Omega(d^{1.5} n^{-1} p^{-1} (1-p)^{-1})$ for the variance of any mixed design. Moreover, when the interference weights on the network's edges are unknown, we propose a weight-invariant design that achieves a variance bound of $O(d^3 n^{-1} p^{-1} (1-p)^{-1})$, which is aligned with the estimator introduced by Cortez-Rodriguez et al. (2023) under similar conditions."}, "https://arxiv.org/abs/2311.14204": {"title": "Reproducible Aggregation of Sample-Split Statistics", "link": "https://arxiv.org/abs/2311.14204", "description": "arXiv:2311.14204v3 Announce Type: replace \nAbstract: Statistical inference is often simplified by sample-splitting. This simplification comes at the cost of the introduction of randomness not native to the data. We propose a simple procedure for sequentially aggregating statistics constructed with multiple splits of the same sample. The user specifies a bound and a nominal error rate. If the procedure is implemented twice on the same data, the nominal error rate approximates the chance that the results differ by more than the bound. We illustrate the application of the procedure to several widely applied econometric methods."}, "https://arxiv.org/abs/2411.10494": {"title": "Efficient inference for differential equation models without numerical solvers", "link": "https://arxiv.org/abs/2411.10494", "description": "arXiv:2411.10494v1 Announce Type: new \nAbstract: Parameter inference is essential when we wish to interpret observational data using mathematical models. Standard inference methods for differential equation models typically rely on obtaining repeated numerical solutions of the differential equation(s). Recent results have explored how numerical truncation error can have major, detrimental, and sometimes hidden impacts on likelihood-based inference by introducing false local maxima into the log-likelihood function. We present a straightforward approach for inference that eliminates the need for solving the underlying differential equations, thereby completely avoiding the impact of truncation error. Open-access Jupyter notebooks, available on GitHub, allow others to implement this method for a broad class of widely-used models to interpret biological data."}, "https://arxiv.org/abs/2411.10600": {"title": "Monetary Incentives, Landowner Preferences: Estimating Cross-Elasticities in Farmland Conversion to Renewable Energy", "link": "https://arxiv.org/abs/2411.10600", "description": "arXiv:2411.10600v1 Announce Type: new \nAbstract: This study examines the impact of monetary factors on the conversion of farmland to renewable energy generation, specifically solar and wind, in the context of expanding U.S. energy production. We propose a new econometric method that accounts for the diverse circumstances of landowners, including their unordered alternative land use options, non-monetary benefits from farming, and the influence of local regulations. We demonstrate that identifying the cross elasticity of landowners' farming income in relation to the conversion of farmland to renewable energy requires an understanding of their preferences. By utilizing county legislation that we assume to be shaped by land-use preferences, we estimate the cross-elasticities of farming income. Our findings indicate that monetary incentives may only influence landowners' decisions in areas with potential for future residential development, underscoring the importance of considering both preferences and regulatory contexts."}, "https://arxiv.org/abs/2411.10615": {"title": "Inference for overparametrized hierarchical Archimedean copulas", "link": "https://arxiv.org/abs/2411.10615", "description": "arXiv:2411.10615v1 Announce Type: new \nAbstract: Hierarchical Archimedean copulas (HACs) are multivariate uniform distributions constructed by nesting Archimedean copulas into one another, and provide a flexible approach to modeling non-exchangeable data. However, this flexibility in the model structure may lead to over-fitting when the model estimation procedure is not performed properly. In this paper, we examine the problem of structure estimation and more generally on the selection of a parsimonious model from the hypothesis testing perspective. Formal tests for structural hypotheses concerning HACs have been lacking so far, most likely due to the restrictions on their associated parameter space which hinders the use of standard inference methodology. Building on previously developed asymptotic methods for these non-standard parameter spaces, we provide an asymptotic stochastic representation for the maximum likelihood estimators of (potentially) overparametrized HACs, which we then use to formulate a likelihood ratio test for certain common structural hypotheses. Additionally, we also derive analytical expressions for the first- and second-order partial derivatives of two-level HACs based on Clayton and Gumbel generators, as well as general numerical approximation schemes for the Fisher information matrix."}, "https://arxiv.org/abs/2411.10620": {"title": "Doubly Robust Estimation of Causal Excursion Effects in Micro-Randomized Trials with Missing Longitudinal Outcomes", "link": "https://arxiv.org/abs/2411.10620", "description": "arXiv:2411.10620v1 Announce Type: new \nAbstract: Micro-randomized trials (MRTs) are increasingly utilized for optimizing mobile health interventions, with the causal excursion effect (CEE) as a central quantity for evaluating interventions under policies that deviate from the experimental policy. However, MRT often contains missing data due to reasons such as missed self-reports or participants not wearing sensors, which can bias CEE estimation. In this paper, we propose a two-stage, doubly robust estimator for CEE in MRTs when longitudinal outcomes are missing at random, accommodating continuous, binary, and count outcomes. Our two-stage approach allows for both parametric and nonparametric modeling options for two nuisance parameters: the missingness model and the outcome regression. We demonstrate that our estimator is doubly robust, achieving consistency and asymptotic normality if either the missingness or the outcome regression model is correctly specified. Simulation studies further validate the estimator's desirable finite-sample performance. We apply the method to HeartSteps, an MRT for developing mobile health interventions that promote physical activity."}, "https://arxiv.org/abs/2411.10623": {"title": "Sensitivity Analysis for Observational Studies with Flexible Matched Designs", "link": "https://arxiv.org/abs/2411.10623", "description": "arXiv:2411.10623v1 Announce Type: new \nAbstract: Observational studies provide invaluable opportunities to draw causal inference, but they may suffer from biases due to pretreatment difference between treated and control units. Matching is a popular approach to reduce observed covariate imbalance. To tackle unmeasured confounding, a sensitivity analysis is often conducted to investigate how robust a causal conclusion is to the strength of unmeasured confounding. For matched observational studies, Rosenbaum proposed a sensitivity analysis framework that uses the randomization of treatment assignments as the \"reasoned basis\" and imposes no model assumptions on the potential outcomes as well as their dependence on the observed and unobserved confounding factors. However, this otherwise appealing framework requires exact matching to guarantee its validity, which is hard to achieve in practice. In this paper we provide an alternative inferential framework that shares the same procedure as Rosenbaum's approach but relies on a different justification. Our framework allows flexible matching algorithms and utilizes alternative source of randomness, in particular random permutations of potential outcomes instead of treatment assignments, to guarantee statistical validity."}, "https://arxiv.org/abs/2411.10628": {"title": "Feature Importance of Climate Vulnerability Indicators with Gradient Boosting across Five Global Cities", "link": "https://arxiv.org/abs/2411.10628", "description": "arXiv:2411.10628v1 Announce Type: new \nAbstract: Efforts are needed to identify and measure both communities' exposure to climate hazards and the social vulnerabilities that interact with these hazards, but the science of validating hazard vulnerability indicators is still in its infancy. Progress is needed to improve: 1) the selection of variables that are used as proxies to represent hazard vulnerability; 2) the applicability and scale for which these indicators are intended, including their transnational applicability. We administered an international urban survey in Buenos Aires, Argentina; Johannesburg, South Africa; London, United Kingdom; New York City, United States; and Seoul, South Korea in order to collect data on exposure to various types of extreme weather events, socioeconomic characteristics commonly used as proxies for vulnerability (i.e., income, education level, gender, and age), and additional characteristics not often included in existing composite indices (i.e., queer identity, disability identity, non-dominant primary language, and self-perceptions of both discrimination and vulnerability to flood risk). We then use feature importance analysis with gradient-boosted decision trees to measure the importance that these variables have in predicting exposure to various types of extreme weather events. Our results show that non-traditional variables were more relevant to self-reported exposure to extreme weather events than traditionally employed variables such as income or age. Furthermore, differences in variable relevance across different types of hazards and across urban contexts suggest that vulnerability indicators need to be fit to context and should not be used in a one-size-fits-all fashion."}, "https://arxiv.org/abs/2411.10647": {"title": "False Discovery Control in Multiple Testing: A Brief Overview of Theories and Methodologies", "link": "https://arxiv.org/abs/2411.10647", "description": "arXiv:2411.10647v1 Announce Type: new \nAbstract: As the volume and complexity of data continue to expand across various scientific disciplines, the need for robust methods to account for the multiplicity of comparisons has grown widespread. A popular measure of type 1 error rate in multiple testing literature is the false discovery rate (FDR). The FDR provides a powerful and practical approach to large-scale multiple testing and has been successfully used in a wide range of applications. The concept of FDR has gained wide acceptance in the statistical community and various methods has been proposed to control the FDR. In this work, we review the latest developments in FDR control methodologies. We also develop a conceptual framework to better describe this vast literature; understand its intuition and key ideas; and provide guidance for the researcher interested in both the application and development of the methodology."}, "https://arxiv.org/abs/2411.10648": {"title": "Subsampling-based Tests in Mediation Analysis", "link": "https://arxiv.org/abs/2411.10648", "description": "arXiv:2411.10648v1 Announce Type: new \nAbstract: Testing for mediation effect poses a challenge since the null hypothesis (i.e., the absence of mediation effects) is composite, making most existing mediation tests quite conservative and often underpowered. In this work, we propose a subsampling-based procedure to construct a test statistic whose asymptotic null distribution is pivotal and remains the same regardless of the three null cases encountered in mediation analysis. The method, when combined with the popular Sobel test, leads to an accurate size control under the null. We further introduce a Cauchy combination test to construct p-values from different subsample splits, which reduces variability in the testing results and increases detection power. Through numerical studies, our approach has demonstrated a more accurate size and higher detection power than the competing classical and contemporary methods."}, "https://arxiv.org/abs/2411.10768": {"title": "Building Interpretable Climate Emulators for Economics", "link": "https://arxiv.org/abs/2411.10768", "description": "arXiv:2411.10768v1 Announce Type: new \nAbstract: This paper presents a framework for developing efficient and interpretable carbon-cycle emulators (CCEs) as part of climate emulators in Integrated Assessment Models, enabling economists to custom-build CCEs accurately calibrated to advanced climate science. We propose a generalized multi-reservoir linear box-model CCE that preserves key physical quantities and can be use-case tailored for specific use cases. Three CCEs are presented for illustration: the 3SR model (replicating DICE-2016), the 4PR model (including the land biosphere), and the 4PR-X model (accounting for dynamic land-use changes like deforestation that impact the reservoir's storage capacity). Evaluation of these models within the DICE framework shows that land-use changes in the 4PR-X model significantly impact atmospheric carbon and temperatures -- emphasizing the importance of using tailored climate emulators. By providing a transparent and flexible tool for policy analysis, our framework allows economists to assess the economic impacts of climate policies more accurately."}, "https://arxiv.org/abs/2411.10801": {"title": "Improving Causal Estimation by Mixing Samples to Address Weak Overlap in Observational Studies", "link": "https://arxiv.org/abs/2411.10801", "description": "arXiv:2411.10801v1 Announce Type: new \nAbstract: Sufficient overlap of propensity scores is one of the most critical assumptions in observational studies. In this article, we will cover the severity in statistical inference under such assumption failure with weighting, one of the most dominating causal inference methodologies. Then we propose a simple, yet novel remedy: \"mixing\" the treated and control groups in the observed dataset. We state that our strategy has three key advantages: (1) Improvement in estimators' accuracy especially in weak overlap, (2) Identical targeting population of treatment effect, (3) High flexibility. We introduce a property of mixed sample that offers a safer inference by implementing onto both traditional and modern weighting methods. We illustrate this with several extensive simulation studies and guide the readers with a real-data analysis for practice."}, "https://arxiv.org/abs/2411.10858": {"title": "Scalable Gaussian Process Regression Via Median Posterior Inference for Estimating Multi-Pollutant Mixture Health Effects", "link": "https://arxiv.org/abs/2411.10858", "description": "arXiv:2411.10858v1 Announce Type: new \nAbstract: Humans are exposed to complex mixtures of environmental pollutants rather than single chemicals, necessitating methods to quantify the health effects of such mixtures. Research on environmental mixtures provides insights into realistic exposure scenarios, informing regulatory policies that better protect public health. However, statistical challenges, including complex correlations among pollutants and nonlinear multivariate exposure-response relationships, complicate such analyses. A popular Bayesian semi-parametric Gaussian process regression framework (Coull et al., 2015) addresses these challenges by modeling exposure-response functions with Gaussian processes and performing feature selection to manage high-dimensional exposures while accounting for confounders. Originally designed for small to moderate-sized cohort studies, this framework does not scale well to massive datasets. To address this, we propose a divide-and-conquer strategy, partitioning data, computing posterior distributions in parallel, and combining results using the generalized median. While we focus on Gaussian process models for environmental mixtures, the proposed distributed computing strategy is broadly applicable to other Bayesian models with computationally prohibitive full-sample Markov Chain Monte Carlo fitting. We provide theoretical guarantees for the convergence of the proposed posterior distributions to those derived from the full sample. We apply this method to estimate associations between a mixture of ambient air pollutants and ~650,000 birthweights recorded in Massachusetts during 2001-2012. Our results reveal negative associations between birthweight and traffic pollution markers, including elemental and organic carbon and PM2.5, and positive associations with ozone and vegetation greenness."}, "https://arxiv.org/abs/2411.10908": {"title": "The Conflict Graph Design: Estimating Causal Effects under Arbitrary Neighborhood Interference", "link": "https://arxiv.org/abs/2411.10908", "description": "arXiv:2411.10908v1 Announce Type: new \nAbstract: A fundamental problem in network experiments is selecting an appropriate experimental design in order to precisely estimate a given causal effect of interest. In fact, optimal rates of estimation remain unknown for essentially all causal effects in network experiments. In this work, we propose a general approach for constructing experiment designs under network interference with the goal of precisely estimating a pre-specified causal effect. A central aspect of our approach is the notion of a conflict graph, which captures the fundamental unobservability associated with the casual effect and the underlying network. We refer to our experimental design as the Conflict Graph Design. In order to estimate effects, we propose a modified Horvitz--Thompson estimator. We show that its variance under the Conflict Graph Design is bounded as $O(\\lambda(H) / n )$, where $\\lambda(H)$ is the largest eigenvalue of the adjacency matrix of the conflict graph. These rates depend on both the underlying network and the particular causal effect under investigation. Not only does this yield the best known rates of estimation for several well-studied causal effects (e.g. the global and direct effects) but it also provides new methods for effects which have received less attention from the perspective of experiment design (e.g. spill-over effects). Our results corroborate two implicitly understood points in the literature: (1) that in order to increase precision, experiment designs should be tailored to specific causal effects of interest and (2) that \"more local\" effects are easier to estimate than \"more global\" effects. In addition to point estimation, we construct conservative variance estimators which facilitate the construction of asymptotically valid confidence intervals for the casual effect of interest."}, "https://arxiv.org/abs/2411.10959": {"title": "Program Evaluation with Remotely Sensed Outcomes", "link": "https://arxiv.org/abs/2411.10959", "description": "arXiv:2411.10959v1 Announce Type: new \nAbstract: While traditional program evaluations typically rely on surveys to measure outcomes, certain economic outcomes such as living standards or environmental quality may be infeasible or costly to collect. As a result, recent empirical work estimates treatment effects using remotely sensed variables (RSVs), such mobile phone activity or satellite images, instead of ground-truth outcome measurements. Common practice predicts the economic outcome from the RSV, using an auxiliary sample of labeled RSVs, and then uses such predictions as the outcome in the experiment. We prove that this approach leads to biased estimates of treatment effects when the RSV is a post-outcome variable. We nonparametrically identify the treatment effect, using an assumption that reflects the logic of recent empirical research: the conditional distribution of the RSV remains stable across both samples, given the outcome and treatment. Our results do not require researchers to know or consistently estimate the relationship between the RSV, outcome, and treatment, which is typically mis-specified with unstructured data. We form a representation of the RSV for downstream causal inference by predicting the outcome and predicting the treatment, with better predictions leading to more precise causal estimates. We re-evaluate the efficacy of a large-scale public program in India, showing that the program's measured effects on local consumption and poverty can be replicated using satellite"}, "https://arxiv.org/abs/2411.10971": {"title": "A novel density-based approach for estimating unknown means, distribution visualisations and meta-analyses of quantiles", "link": "https://arxiv.org/abs/2411.10971", "description": "arXiv:2411.10971v1 Announce Type: new \nAbstract: In meta-analysis with continuous outcomes, the use of effect sizes based on the means is the most common. It is often found, however, that only the quantile summary measures are reported in some studies, and in certain scenarios, a meta-analysis of the quantiles themselves are of interest. We propose a novel density-based approach to support the implementation of a comprehensive meta-analysis, when only the quantile summary measures are reported. The proposed approach uses flexible quantile-based distributions and percentile matching to estimate the unknown parameters without making any prior assumptions about the underlying distributions. Using simulated and real data, we show that the proposed novel density-based approach works as well as or better than the widely-used methods in estimating the means using quantile summaries without assuming a distribution apriori, and provides a novel tool for distribution visualisations. In addition to this, we introduce quantile-based meta-analysis methods for situations where a comparison of quantiles between groups themselves are of interest and found to be more suitable. Using both real and simulated data, we also demonstrate the applicability of these quantile-based methods."}, "https://arxiv.org/abs/2411.10980": {"title": "A joint modeling approach to treatment effects estimation with unmeasured confounders", "link": "https://arxiv.org/abs/2411.10980", "description": "arXiv:2411.10980v1 Announce Type: new \nAbstract: Estimating treatment effects using observation data often relies on the assumption of no unmeasured confounders. However, unmeasured confounding variables may exist in many real-world problems. It can lead to a biased estimation without incorporating the unmeasured confounding effect. To address this problem, this paper proposes a new mixed-effects joint modeling approach to identifying and estimating the OR function and the PS function in the presence of unmeasured confounders in longitudinal data settings. As a result, we can obtain the estimators of the average treatment effect and heterogeneous treatment effects. In our proposed setting, we allow interaction effects of the treatment and unmeasured confounders on the outcome. Moreover, we propose a new Laplacian-variant EM algorithm to estimate the parameters in the joint models. We apply the method to a real-world application from the CitieS-Health Barcelona Panel Study, in which we study the effect of short-term air pollution exposure on mental health."}, "https://arxiv.org/abs/2411.11058": {"title": "Econometrics and Formalism of Psychological Archetypes of Scientific Workers with Introverted Thinking Type", "link": "https://arxiv.org/abs/2411.11058", "description": "arXiv:2411.11058v1 Announce Type: new \nAbstract: The chronological hierarchy and classification of psychological types of individuals are examined. The anomalous nature of psychological activity in individuals involved in scientific work is highlighted. Certain aspects of the introverted thinking type in scientific activities are analyzed. For the first time, psychological archetypes of scientists with pronounced introversion are postulated in the context of twelve hypotheses about the specifics of professional attributes of introverted scientific activities.\n A linear regression and Bayesian equation are proposed for quantitatively assessing the econometric degree of introversion in scientific employees, considering a wide range of characteristics inherent to introverts in scientific processing. Specifically, expressions for a comprehensive assessment of introversion in a linear model and the posterior probability of the econometric (scientometric) degree of introversion in a Bayesian model are formulated.\n The models are based on several econometric (scientometric) hypotheses regarding various aspects of professional activities of introverted scientists, such as a preference for solo publications, low social activity, narrow specialization, high research depth, and so forth. Empirical data and multiple linear regression methods can be used to calibrate the equations. The model can be applied to gain a deeper understanding of the psychological characteristics of scientific employees, which is particularly useful in ergonomics and the management of scientific teams and projects. The proposed method also provides scientists with pronounced introversion the opportunity to develop their careers, focusing on individual preferences and features."}, "https://arxiv.org/abs/2411.11248": {"title": "A model-free test of the time-reversibility of climate change processes", "link": "https://arxiv.org/abs/2411.11248", "description": "arXiv:2411.11248v1 Announce Type: new \nAbstract: Time-reversibility is a crucial feature of many time series models, while time-irreversibility is the rule rather than the exception in real-life data. Testing the null hypothesis of time-reversibilty, therefore, should be an important step preliminary to the identification and estimation of most traditional time-series models. Existing procedures, however, mostly consist of testing necessary but not sufficient conditions, leading to under-rejection, or sufficient but non-necessary ones, which leads to over-rejection. Moreover, they generally are model-besed. In contrast, the copula spectrum studied by Goto et al. ($\\textit{Ann. Statist.}$ 2022, $\\textbf{50}$: 3563--3591) allows for a model-free necessary and sufficient time-reversibility condition. A test based on this copula-spectrum-based characterization has been proposed by authors. This paper illustrates the performance of this test, with an illustration in the analysis of climatic data."}, "https://arxiv.org/abs/2411.11270": {"title": "Unbiased Approximations for Stationary Distributions of McKean-Vlasov SDEs", "link": "https://arxiv.org/abs/2411.11270", "description": "arXiv:2411.11270v1 Announce Type: new \nAbstract: We consider the development of unbiased estimators, to approximate the stationary distribution of Mckean-Vlasov stochastic differential equations (MVSDEs). These are an important class of processes, which frequently appear in applications such as mathematical finance, biology and opinion dynamics. Typically the stationary distribution is unknown and indeed one cannot simulate such processes exactly. As a result one commonly requires a time-discretization scheme which results in a discretization bias and a bias from not being able to simulate the associated stationary distribution. To overcome this bias, we present a new unbiased estimator taking motivation from the literature on unbiased Monte Carlo. We prove the unbiasedness of our estimator, under assumptions. In order to prove this we require developing ergodicity results of various discrete time processes, through an appropriate discretization scheme, towards the invariant measure. Numerous numerical experiments are provided, on a range of MVSDEs, to demonstrate the effectiveness of our unbiased estimator. Such examples include the Currie-Weiss model, a 3D neuroscience model and a parameter estimation problem."}, "https://arxiv.org/abs/2411.11301": {"title": "Subgroup analysis in multi level hierarchical cluster randomized trials", "link": "https://arxiv.org/abs/2411.11301", "description": "arXiv:2411.11301v1 Announce Type: new \nAbstract: Cluster or group randomized trials (CRTs) are increasingly used for both behavioral and system-level interventions, where entire clusters are randomly assigned to a study condition or intervention. Apart from the assigned cluster-level analysis, investigating whether an intervention has a differential effect for specific subgroups remains an important issue, though it is often considered an afterthought in pivotal clinical trials. Determining such subgroup effects in a CRT is a challenging task due to its inherent nested cluster structure. Motivated by a real-life HIV prevention CRT, we consider a three-level cross-sectional CRT, where randomization is carried out at the highest level and subgroups may exist at different levels of the hierarchy. We employ a linear mixed-effects model to estimate the subgroup-specific effects through their maximum likelihood estimators (MLEs). Consequently, we develop a consistent test for the significance of the differential intervention effect between two subgroups at different levels of the hierarchy, which is the key methodological contribution of this work. We also derive explicit formulae for sample size determination to detect a differential intervention effect between two subgroups, aiming to achieve a given statistical power in the case of a planned confirmatory subgroup analysis. The application of our methodology is illustrated through extensive simulation studies using synthetic data, as well as with real-world data from an HIV prevention CRT in The Bahamas."}, "https://arxiv.org/abs/2411.11498": {"title": "Efficient smoothness selection for nonparametric Markov-switching models via quasi restricted maximum likelihood", "link": "https://arxiv.org/abs/2411.11498", "description": "arXiv:2411.11498v1 Announce Type: new \nAbstract: Markov-switching models are powerful tools that allow capturing complex patterns from time series data driven by latent states. Recent work has highlighted the benefits of estimating components of these models nonparametrically, enhancing their flexibility and reducing biases, which in turn can improve state decoding, forecasting, and overall inference. Formulating such models using penalized splines is straightforward, but practically feasible methods for a data-driven smoothness selection in these models are still lacking. Traditional techniques, such as cross-validation and information criteria-based selection suffer from major drawbacks, most importantly their reliance on computationally expensive grid search methods, hampering practical usability for Markov-switching models. Michelot (2022) suggested treating spline coefficients as random effects with a multivariate normal distribution and using the R package TMB (Kristensen et al., 2016) for marginal likelihood maximization. While this method avoids grid search and typically results in adequate smoothness selection, it entails a nested optimization problem, thus being computationally demanding. We propose to exploit the simple structure of penalized splines treated as random effects, thereby greatly reducing the computational burden while potentially improving fixed effects parameter estimation accuracy. Our proposed method offers a reliable and efficient mechanism for smoothness selection, rendering the estimation of Markov-switching models involving penalized splines feasible for complex data structures."}, "https://arxiv.org/abs/2411.11559": {"title": "Treatment Effect Estimators as Weighted Outcomes", "link": "https://arxiv.org/abs/2411.11559", "description": "arXiv:2411.11559v1 Announce Type: new \nAbstract: Estimators weighting observed outcomes to form an effect estimate have a long tradition. The corresponding outcome weights are utilized in established procedures, e.g. to check covariate balancing, to characterize target populations, or to detect and manage extreme weights. This paper provides a general framework to derive the functional form of such weights. It establishes when and how numerical equivalence between an original estimator representation as moment condition and a unique weighted representation can be obtained. The framework is applied to derive novel outcome weights for the leading cases of double machine learning and generalized random forests, with existing results recovered as special cases. The analysis highlights that implementation choices determine (i) the availability of outcome weights and (ii) their properties. Notably, standard implementations of partially linear regression based estimators - like causal forest - implicitly apply outcome weights that do not sum to (minus) one in the (un)treated group as usually considered desirable."}, "https://arxiv.org/abs/2411.11580": {"title": "Metric Oja Depth, New Statistical Tool for Estimating the Most Central Objects", "link": "https://arxiv.org/abs/2411.11580", "description": "arXiv:2411.11580v1 Announce Type: new \nAbstract: The Oja depth (simplicial volume depth) is one of the classical statistical techniques for measuring the central tendency of data in multivariate space. Despite the widespread emergence of object data like images, texts, matrices or graphs, a well-developed and suitable version of Oja depth for object data is lacking. To address this shortcoming, in this study we propose a novel measure of statistical depth, the metric Oja depth applicable to any object data. Then, we develop two competing strategies for optimizing metric depth functions, i.e., finding the deepest objects with respect to them. Finally, we compare the performance of the metric Oja depth with three other depth functions (half-space, lens, and spatial) in diverse data scenarios.\n Keywords: Object Data, Metric Oja depth, Statistical depth, Optimization, Genetic algorithm, Metric statistics"}, "https://arxiv.org/abs/2411.11675": {"title": "Nonparametric Bayesian approach for dynamic borrowing of historical control data", "link": "https://arxiv.org/abs/2411.11675", "description": "arXiv:2411.11675v1 Announce Type: new \nAbstract: When incorporating historical control data into the analysis of current randomized controlled trial data, it is critical to account for differences between the datasets. When the cause of the difference is an unmeasured factor and adjustment for observed covariates only is insufficient, it is desirable to use a dynamic borrowing method that reduces the impact of heterogeneous historical controls. We propose a nonparametric Bayesian approach for borrowing historical controls that are homogeneous with the current control. Additionally, to emphasize the resolution of conflicts between the historical controls and current control, we introduce a method based on the dependent Dirichlet process mixture. The proposed methods can be implemented using the same procedure, regardless of whether the outcome data comprise aggregated study-level data or individual participant data. We also develop a novel index of similarity between the historical and current control data, based on the posterior distribution of the parameter of interest. We conduct a simulation study and analyze clinical trial examples to evaluate the performance of the proposed methods compared to existing methods. The proposed method based on the dependent Dirichlet process mixture can more accurately borrow from homogeneous historical controls while reducing the impact of heterogeneous historical controls compared to the typical Dirichlet process mixture. The proposed methods outperform existing methods in scenarios with heterogeneous historical controls, in which the meta-analytic approach is ineffective."}, "https://arxiv.org/abs/2411.11728": {"title": "Davis-Kahan Theorem in the two-to-infinity norm and its application to perfect clustering", "link": "https://arxiv.org/abs/2411.11728", "description": "arXiv:2411.11728v1 Announce Type: new \nAbstract: Many statistical applications, such as the Principal Component Analysis, matrix completion, tensor regression and many others, rely on accurate estimation of leading eigenvectors of a matrix. The Davis-Khan theorem is known to be instrumental for bounding above the distances between matrices $U$ and $\\widehat{U}$ of population eigenvectors and their sample versions. While those distances can be measured in various metrics, the recent developments showed advantages of evaluation of the deviation in the 2-to-infinity norm. The purpose of this paper is to provide upper bounds for the distances between $U$ and $\\widehat{U}$ in the two-to-infinity norm for a variety of possible scenarios and competitive approaches. Although this problem has been studied by several authors, the difference between this paper and its predecessors is that the upper bounds are obtained with no or mild probabilistic assumptions on the error distributions. Those bounds are subsequently refined, when some generic probabilistic assumptions on the errors hold. In addition, the paper provides alternative methods for evaluation of $\\widehat{U}$ and, therefore, enables one to compare the resulting accuracies. As an example of an application of the results in the paper, we derive sufficient conditions for perfect clustering in a generic setting, and then employ them in various scenarios."}, "https://arxiv.org/abs/2411.11737": {"title": "Randomization-based Z-estimation for evaluating average and individual treatment effects", "link": "https://arxiv.org/abs/2411.11737", "description": "arXiv:2411.11737v1 Announce Type: new \nAbstract: Randomized experiments have been the gold standard for drawing causal inference. The conventional model-based approach has been one of the most popular ways for analyzing treatment effects from randomized experiments, which is often carried through inference for certain model parameters. In this paper, we provide a systematic investigation of model-based analyses for treatment effects under the randomization-based inference framework. This framework does not impose any distributional assumptions on the outcomes, covariates and their dependence, and utilizes only randomization as the \"reasoned basis\". We first derive the asymptotic theory for Z-estimation in completely randomized experiments, and propose sandwich-type conservative covariance estimation. We then apply the developed theory to analyze both average and individual treatment effects in randomized experiments. For the average treatment effect, we consider three estimation strategies: model-based, model-imputed, and model-assisted, where the first two can be sensitive to model misspecification or require specific ways for parameter estimation. The model-assisted approach is robust to arbitrary model misspecification and always provides consistent average treatment effect estimation. We propose optimal ways to conduct model-assisted estimation using generally nonlinear least squares for parameter estimation. For the individual treatment effects, we propose to directly model the relationship between individual effects and covariates, and discuss the model's identifiability, inference and interpretation allowing model misspecification."}, "https://arxiv.org/abs/2411.10646": {"title": "Wasserstein Spatial Depth", "link": "https://arxiv.org/abs/2411.10646", "description": "arXiv:2411.10646v1 Announce Type: cross \nAbstract: Modeling observations as random distributions embedded within Wasserstein spaces is becoming increasingly popular across scientific fields, as it captures the variability and geometric structure of the data more effectively. However, the distinct geometry and unique properties of Wasserstein space pose challenges to the application of conventional statistical tools, which are primarily designed for Euclidean spaces. Consequently, adapting and developing new methodologies for analysis within Wasserstein spaces has become essential. The space of distributions on $\\mathbb{R}^d$ with $d>1$ is not linear, and ''mimic'' the geometry of a Riemannian manifold. In this paper, we extend the concept of statistical depth to distribution-valued data, introducing the notion of {\\it Wasserstein spatial depth}. This new measure provides a way to rank and order distributions, enabling the development of order-based clustering techniques and inferential tools. We show that Wasserstein spatial depth (WSD) preserves critical properties of conventional statistical depths, notably, ranging within $[0,1]$, transformation invariance, vanishing at infinity, reaching a maximum at the geometric median, and continuity. Additionally, the population WSD has a straightforward plug-in estimator based on sampled empirical distributions. We establish the estimator's consistency and asymptotic normality. Extensive simulation and real-data application showcase the practical efficacy of WSD."}, "https://arxiv.org/abs/2411.10982": {"title": "Towards a framework on tabular synthetic data generation: a minimalist approach: theory, use cases, and limitations", "link": "https://arxiv.org/abs/2411.10982", "description": "arXiv:2411.10982v1 Announce Type: cross \nAbstract: We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits."}, "https://arxiv.org/abs/2411.11132": {"title": "Variational Bayesian Bow tie Neural Networks with Shrinkage", "link": "https://arxiv.org/abs/2411.11132", "description": "arXiv:2411.11132v1 Announce Type: cross \nAbstract: Despite the dominant role of deep models in machine learning, limitations persist, including overconfident predictions, susceptibility to adversarial attacks, and underestimation of variability in predictions. The Bayesian paradigm provides a natural framework to overcome such issues and has become the gold standard for uncertainty estimation with deep models, also providing improved accuracy and a framework for tuning critical hyperparameters. However, exact Bayesian inference is challenging, typically involving variational algorithms that impose strong independence and distributional assumptions. Moreover, existing methods are sensitive to the architectural choice of the network. We address these issues by constructing a relaxed version of the standard feed-forward rectified neural network, and employing Polya-Gamma data augmentation tricks to render a conditionally linear and Gaussian model. Additionally, we use sparsity-promoting priors on the weights of the neural network for data-driven architectural design. To approximate the posterior, we derive a variational inference algorithm that avoids distributional assumptions and independence across layers and is a faster alternative to the usual Markov Chain Monte Carlo schemes."}, "https://arxiv.org/abs/2411.11203": {"title": "Debiasing Watermarks for Large Language Models via Maximal Coupling", "link": "https://arxiv.org/abs/2411.11203", "description": "arXiv:2411.11203v1 Announce Type: cross \nAbstract: Watermarking language models is essential for distinguishing between human and machine-generated text and thus maintaining the integrity and trustworthiness of digital communication. We present a novel green/red list watermarking approach that partitions the token set into ``green'' and ``red'' lists, subtly increasing the generation probability for green tokens. To correct token distribution bias, our method employs maximal coupling, using a uniform coin flip to decide whether to apply bias correction, with the result embedded as a pseudorandom watermark signal. Theoretical analysis confirms this approach's unbiased nature and robust detection capabilities. Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability, and it demonstrates resilience to targeted modifications aimed at improving text quality. This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality."}, "https://arxiv.org/abs/2411.11271": {"title": "Mean Estimation in Banach Spaces Under Infinite Variance and Martingale Dependence", "link": "https://arxiv.org/abs/2411.11271", "description": "arXiv:2411.11271v1 Announce Type: cross \nAbstract: We consider estimating the shared mean of a sequence of heavy-tailed random variables taking values in a Banach space. In particular, we revisit and extend a simple truncation-based mean estimator by Catoni and Giulini. While existing truncation-based approaches require a bound on the raw (non-central) second moment of observations, our results hold under a bound on either the central or non-central $p$th moment for some $p > 1$. In particular, our results hold for distributions with infinite variance. The main contributions of the paper follow from exploiting connections between truncation-based mean estimation and the concentration of martingales in 2-smooth Banach spaces. We prove two types of time-uniform bounds on the distance between the estimator and unknown mean: line-crossing inequalities, which can be optimized for a fixed sample size $n$, and non-asymptotic law of the iterated logarithm type inequalities, which match the tightness of line-crossing inequalities at all points in time up to a doubly logarithmic factor in $n$. Our results do not depend on the dimension of the Banach space, hold under martingale dependence, and all constants in the inequalities are known and small."}, "https://arxiv.org/abs/2411.11748": {"title": "Debiased Regression for Root-N-Consistent Conditional Mean Estimation", "link": "https://arxiv.org/abs/2411.11748", "description": "arXiv:2411.11748v1 Announce Type: cross \nAbstract: This study introduces a debiasing method for regression estimators, including high-dimensional and nonparametric regression estimators. For example, nonparametric regression methods allow for the estimation of regression functions in a data-driven manner with minimal assumptions; however, these methods typically fail to achieve $\\sqrt{n}$-consistency in their convergence rates, and many, including those in machine learning, lack guarantees that their estimators asymptotically follow a normal distribution. To address these challenges, we propose a debiasing technique for nonparametric estimators by adding a bias-correction term to the original estimators, extending the conventional one-step estimator used in semiparametric analysis. Specifically, for each data point, we estimate the conditional expected residual of the original nonparametric estimator, which can, for instance, be computed using kernel (Nadaraya-Watson) regression, and incorporate it as a bias-reduction term. Our theoretical analysis demonstrates that the proposed estimator achieves $\\sqrt{n}$-consistency and asymptotic normality under a mild convergence rate condition for both the original nonparametric estimator and the conditional expected residual estimator. Notably, this approach remains model-free as long as the original estimator and the conditional expected residual estimator satisfy the convergence rate condition. The proposed method offers several advantages, including improved estimation accuracy and simplified construction of confidence intervals."}, "https://arxiv.org/abs/2411.11824": {"title": "Theoretical Foundations of Conformal Prediction", "link": "https://arxiv.org/abs/2411.11824", "description": "arXiv:2411.11824v1 Announce Type: cross \nAbstract: This book is about conformal prediction and related inferential techniques that build on permutation tests and exchangeability. These techniques are useful in a diverse array of tasks, including hypothesis testing and providing uncertainty quantification guarantees for machine learning systems. Much of the current interest in conformal prediction is due to its ability to integrate into complex machine learning workflows, solving the problem of forming prediction sets without any assumptions on the form of the data generating distribution. Since contemporary machine learning algorithms have generally proven difficult to analyze directly, conformal prediction's main appeal is its ability to provide formal, finite-sample guarantees when paired with such methods.\n The goal of this book is to teach the reader about the fundamental technical arguments that arise when researching conformal prediction and related questions in distribution-free inference. Many of these proof strategies, especially the more recent ones, are scattered among research papers, making it difficult for researchers to understand where to look, which results are important, and how exactly the proofs work. We hope to bridge this gap by curating what we believe to be some of the most important results in the literature and presenting their proofs in a unified language, with illustrations, and with an eye towards pedagogy."}, "https://arxiv.org/abs/2310.08115": {"title": "Model-Agnostic Covariate-Assisted Inference on Partially Identified Causal Effects", "link": "https://arxiv.org/abs/2310.08115", "description": "arXiv:2310.08115v2 Announce Type: replace \nAbstract: Many causal estimands are only partially identifiable since they depend on the unobservable joint distribution between potential outcomes. Stratification on pretreatment covariates can yield sharper bounds; however, unless the covariates are discrete with relatively small support, this approach typically requires binning covariates or estimating the conditional distributions of the potential outcomes given the covariates. Binning can result in substantial efficiency loss and become challenging to implement, even with a moderate number of covariates. Estimating conditional distributions, on the other hand, may yield invalid inference if the distributions are inaccurately estimated, such as when a misspecified model is used or when the covariates are high-dimensional. In this paper, we propose a unified and model-agnostic inferential approach for a wide class of partially identified estimands. Our method, based on duality theory for optimal transport problems, has four key properties. First, in randomized experiments, our approach can wrap around any estimates of the conditional distributions and provide uniformly valid inference, even if the initial estimates are arbitrarily inaccurate. A simple extension of our method to observational studies is doubly robust in the usual sense. Second, if nuisance parameters are estimated at semiparametric rates, our estimator is asymptotically unbiased for the sharp partial identification bound. Third, we can apply the multiplier bootstrap to select covariates and models without sacrificing validity, even if the true model is not selected. Finally, our method is computationally efficient. Overall, in three empirical applications, our method consistently reduces the width of estimated identified sets and confidence intervals without making additional structural assumptions."}, "https://arxiv.org/abs/2312.06265": {"title": "Type I Error Rates are Not Usually Inflated", "link": "https://arxiv.org/abs/2312.06265", "description": "arXiv:2312.06265v4 Announce Type: replace \nAbstract: The inflation of Type I error rates is thought to be one of the causes of the replication crisis. Questionable research practices such as p-hacking are thought to inflate Type I error rates above their nominal level, leading to unexpectedly high levels of false positives in the literature and, consequently, unexpectedly low replication rates. In this article, I offer an alternative view. I argue that questionable and other research practices do not usually inflate relevant Type I error rates. I begin by introducing the concept of Type I error rates and distinguishing between statistical errors and theoretical errors. I then illustrate my argument with respect to model misspecification, multiple testing, selective inference, forking paths, exploratory analyses, p-hacking, optional stopping, double dipping, and HARKing. In each case, I demonstrate that relevant Type I error rates are not usually inflated above their nominal level, and in the rare cases that they are, the inflation is easily identified and resolved. I conclude that the replication crisis may be explained, at least in part, by researchers' misinterpretation of statistical errors and their underestimation of theoretical errors."}, "https://arxiv.org/abs/2411.12086": {"title": "A Comparison of Zero-Inflated Models for Modern Biomedical Data", "link": "https://arxiv.org/abs/2411.12086", "description": "arXiv:2411.12086v1 Announce Type: new \nAbstract: Many data sets cannot be accurately described by standard probability distributions due to the excess number of zero values present. For example, zero-inflation is prevalent in microbiome data and single-cell RNA sequencing data, which serve as our real data examples. Several models have been proposed to address zero-inflated datasets including the zero-inflated negative binomial, hurdle negative binomial model, and the truncated latent Gaussian copula model. This study aims to compare various models and determine which one performs optimally under different conditions using both simulation studies and real data analyses. We are particularly interested in investigating how dependence among the variables, level of zero-inflation or deflation, and variance of the data affects model selection."}, "https://arxiv.org/abs/2411.12184": {"title": "Testability of Instrumental Variables in Additive Nonlinear, Non-Constant Effects Models", "link": "https://arxiv.org/abs/2411.12184", "description": "arXiv:2411.12184v1 Announce Type: new \nAbstract: We address the issue of the testability of instrumental variables derived from observational data. Most existing testable implications are centered on scenarios where the treatment is a discrete variable, e.g., instrumental inequality (Pearl, 1995), or where the effect is assumed to be constant, e.g., instrumental variables condition based on the principle of independent mechanisms (Burauel, 2023). However, treatments can often be continuous variables, such as drug dosages or nutritional content levels, and non-constant effects may occur in many real-world scenarios. In this paper, we consider an additive nonlinear, non-constant effects model with unmeasured confounders, in which treatments can be either discrete or continuous, and propose an Auxiliary-based Independence Test (AIT) condition to test whether a variable is a valid instrument. We first show that if the candidate instrument is valid, then the AIT condition holds. Moreover, we illustrate the implications of the AIT condition and demonstrate that, in certain conditions, AIT conditions are necessary and sufficient to detect all invalid IVs. We also extend the AIT condition to include covariates and introduce a practical testing algorithm. Experimental results on both synthetic and three different real-world datasets show the effectiveness of our proposed condition."}, "https://arxiv.org/abs/2411.12294": {"title": "Adaptive Forward Stepwise Regression", "link": "https://arxiv.org/abs/2411.12294", "description": "arXiv:2411.12294v1 Announce Type: new \nAbstract: This paper proposes a sparse regression method that continuously interpolates between Forward Stepwise selection (FS) and the LASSO. When tuned appropriately, our solutions are much sparser than typical LASSO fits but, unlike FS fits, benefit from the stabilizing effect of shrinkage. Our method, Adaptive Forward Stepwise Regression (AFS) addresses this need for sparser models with shrinkage. We show its connection with boosting via a soft-thresholding viewpoint and demonstrate the ease of adapting the method to classification tasks. In both simulations and real data, our method has lower mean squared error and fewer selected features across multiple settings compared to popular sparse modeling procedures."}, "https://arxiv.org/abs/2411.12367": {"title": "Left-truncated discrete lifespans: The AFiD enterprise panel", "link": "https://arxiv.org/abs/2411.12367", "description": "arXiv:2411.12367v1 Announce Type: new \nAbstract: Our model for the lifespan of an enterprise is the geometric distribution. We do not formulate a model for enterprise foundation, but assume that foundations and lifespans are independent. We aim to fit the model to information about foundation and closure of German enterprises in the AFiD panel. The lifespan for an enterprise that has been founded before the first wave of the panel is either left truncated, when the enterprise is contained in the panel, or missing, when it already closed down before the first wave. Marginalizing the likelihood to that part of the enterprise history after the first wave contributes to the aim of a closed-form estimate and standard error. Invariance under the foundation distribution is achived by conditioning on observability of the enterprises. The conditional marginal likelihood can be written as a function of a martingale. The later arises when calculating the compensator, with respect some filtration, of a process that counts the closures. The estimator itself can then also be written as a martingale transform and consistency as well as asymptotic normality are easily proven. The life expectancy of German enterprises, estimated from the demographic information about 1.4 million enterprises for the years 2018 and 2019, are ten years. The width of the confidence interval are two months. Closure after the last wave is taken into account as right censored."}, "https://arxiv.org/abs/2411.12407": {"title": "Bayesian multilevel compositional data analysis with the R package multilevelcoda", "link": "https://arxiv.org/abs/2411.12407", "description": "arXiv:2411.12407v1 Announce Type: new \nAbstract: Multilevel compositional data, such as data sampled over time that are non-negative and sum to a constant value, are common in various fields. However, there is currently no software specifically built to model compositional data in a multilevel framework. The R package multilevelcoda implements a collection of tools for modelling compositional data in a Bayesian multivariate, multilevel pipeline. The user-friendly setup only requires the data, model formula, and minimal specification of the analysis. This paper outlines the statistical theory underlying the Bayesian compositional multilevel modelling approach and details the implementation of the functions available in multilevelcoda, using an example dataset of compositional daily sleep-wake behaviours. This innovative method can be used to gain robust answers to scientific questions using the increasingly available multilevel compositional data from intensive, longitudinal studies."}, "https://arxiv.org/abs/2411.12423": {"title": "Nonstationary functional time series forecasting", "link": "https://arxiv.org/abs/2411.12423", "description": "arXiv:2411.12423v1 Announce Type: new \nAbstract: We propose a nonstationary functional time series forecasting method with an application to age-specific mortality rates observed over the years. The method begins by taking the first-order differencing and estimates its long-run covariance function. Through eigen-decomposition, we obtain a set of estimated functional principal components and their associated scores for the differenced series. These components allow us to reconstruct the original functional data and compute the residuals. To model the temporal patterns in the residuals, we again perform dynamic functional principal component analysis and extract its estimated principal components and the associated scores for the residuals. As a byproduct, we introduce a geometrically decaying weighted approach to assign higher weights to the most recent data than those from the distant past. Using the Swedish age-specific mortality rates from 1751 to 2022, we demonstrate that the weighted dynamic functional factor model can produce more accurate point and interval forecasts, particularly for male series exhibiting higher volatility."}, "https://arxiv.org/abs/2411.12477": {"title": "Robust Bayesian causal estimation for causal inference in medical diagnosis", "link": "https://arxiv.org/abs/2411.12477", "description": "arXiv:2411.12477v1 Announce Type: new \nAbstract: Causal effect estimation is a critical task in statistical learning that aims to find the causal effect on subjects by identifying causal links between a number of predictor (or, explanatory) variables and the outcome of a treatment. In a regressional framework, we assign a treatment and outcome model to estimate the average causal effect. Additionally, for high dimensional regression problems, variable selection methods are also used to find a subset of predictor variables that maximises the predictive performance of the underlying model for better estimation of the causal effect. In this paper, we propose a different approach. We focus on the variable selection aspects of high dimensional causal estimation problem. We suggest a cautious Bayesian group LASSO (least absolute shrinkage and selection operator) framework for variable selection using prior sensitivity analysis. We argue that in some cases, abstaining from selecting (or, rejecting) a predictor is beneficial and we should gather more information to obtain a more decisive result. We also show that for problems with very limited information, expert elicited variable selection can give us a more stable causal effect estimation as it avoids overfitting. Lastly, we carry a comparative study with synthetic dataset and show the applicability of our method in real-life situations."}, "https://arxiv.org/abs/2411.12479": {"title": "Graph-based Square-Root Estimation for Sparse Linear Regression", "link": "https://arxiv.org/abs/2411.12479", "description": "arXiv:2411.12479v1 Announce Type: new \nAbstract: Sparse linear regression is one of the classic problems in the field of statistics, which has deep connections and high intersections with optimization, computation, and machine learning. To address the effective handling of high-dimensional data, the diversity of real noise, and the challenges in estimating standard deviation of the noise, we propose a novel and general graph-based square-root estimation (GSRE) model for sparse linear regression. Specifically, we use square-root-loss function to encourage the estimators to be independent of the unknown standard deviation of the error terms and design a sparse regularization term by using the graphical structure among predictors in a node-by-node form. Based on the predictor graphs with special structure, we highlight the generality by analyzing that the model in this paper is equivalent to several classic regression models. Theoretically, we also analyze the finite sample bounds, asymptotic normality and model selection consistency of GSRE method without relying on the standard deviation of error terms. In terms of computation, we employ the fast and efficient alternating direction method of multipliers. Finally, based on a large number of simulated and real data with various types of noise, we demonstrate the performance advantages of the proposed method in estimation, prediction and model selection."}, "https://arxiv.org/abs/2411.12555": {"title": "Multivariate and Online Transfer Learning with Uncertainty Quantification", "link": "https://arxiv.org/abs/2411.12555", "description": "arXiv:2411.12555v1 Announce Type: new \nAbstract: Untreated periodontitis causes inflammation within the supporting tissue of the teeth and can ultimately lead to tooth loss. Modeling periodontal outcomes is beneficial as they are difficult and time consuming to measure, but disparities in representation between demographic groups must be considered. There may not be enough participants to build group specific models and it can be ineffective, and even dangerous, to apply a model to participants in an underrepresented group if demographic differences were not considered during training. We propose an extension to RECaST Bayesian transfer learning framework. Our method jointly models multivariate outcomes, exhibiting significant improvement over the previous univariate RECaST method. Further, we introduce an online approach to model sequential data sets. Negative transfer is mitigated to ensure that the information shared from the other demographic groups does not negatively impact the modeling of the underrepresented participants. The Bayesian framework naturally provides uncertainty quantification on predictions. Especially important in medical applications, our method does not share data between domains. We demonstrate the effectiveness of our method in both predictive performance and uncertainty quantification on simulated data and on a database of dental records from the HealthPartners Institute."}, "https://arxiv.org/abs/2411.12578": {"title": "Robust Inference for High-dimensional Linear Models with Heavy-tailed Errors via Partial Gini Covariance", "link": "https://arxiv.org/abs/2411.12578", "description": "arXiv:2411.12578v1 Announce Type: new \nAbstract: This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effectiveness. Our proposed approach leverages the partial Gini covariance to construct a robust statistical inference framework that requires minimal tuning and does not impose restrictive moment conditions on error distributions. Unlike traditional methods, it circumvents the need for estimating the density of random errors and enhances the computational feasibility and robustness. Extensive simulations demonstrate the proposed method's superior power and robustness over standard high-dimensional inference approaches, such as those based on the debiased Lasso. The asymptotic relative efficiency analysis provides additional theoretical insight on the improved efficiency of the new approach in the heavy-tailed setting. Additionally, the partial Gini covariance extends to the multivariate setting, enabling chi-square testing for a group of coefficients. We illustrate the method's practical application with a real-world data example."}, "https://arxiv.org/abs/2411.12585": {"title": "Semiparametric quantile functional regression analysis of adolescent physical activity distributions in the presence of missing data", "link": "https://arxiv.org/abs/2411.12585", "description": "arXiv:2411.12585v1 Announce Type: new \nAbstract: In the age of digital healthcare, passively collected physical activity profiles from wearable sensors are a preeminent tool for evaluating health outcomes. In order to fully leverage the vast amounts of data collected through wearable accelerometers, we propose to use quantile functional regression to model activity profiles as distributional outcomes through quantile responses, which can be used to evaluate activity level differences across covariates based on any desired distributional summary. Our proposed framework addresses two key problems not handled in existing distributional regression literature. First, we use spline mixed model formulations in the basis space to model nonparametric effects of continuous predictors on the distributional response. Second, we address the underlying missingness problem that is common in these types of wearable data but typically not addressed. We show that the missingness can induce bias in the subject-specific distributional summaries that leads to biased distributional regression estimates and even bias the frequently used scalar summary measures, and introduce a nonparametric function-on-function modeling approach that adjusts for each subject's missingness profile to address this problem. We evaluate our nonparametric modeling and missing data adjustment using simulation studies based on realistically simulated activity profiles and use it to gain insights into adolescent activity profiles from the Teen Environment and Neighborhood study."}, "https://arxiv.org/abs/2411.11847": {"title": "Modelling financial returns with mixtures of generalized normal distributions", "link": "https://arxiv.org/abs/2411.11847", "description": "arXiv:2411.11847v1 Announce Type: cross \nAbstract: This PhD Thesis presents an investigation into the analysis of financial returns using mixture models, focusing on mixtures of generalized normal distributions (MGND) and their extensions. The study addresses several critical issues encountered in the estimation process and proposes innovative solutions to enhance accuracy and efficiency. In Chapter 2, the focus lies on the MGND model and its estimation via expectation conditional maximization (ECM) and generalized expectation maximization (GEM) algorithms. A thorough exploration reveals a degeneracy issue when estimating the shape parameter. Several algorithms are proposed to overcome this critical issue. Chapter 3 extends the theoretical perspective by applying the MGND model on several stock market indices. A two-step approach is proposed for identifying turmoil days and estimating returns and volatility. Chapter 4 introduces constrained mixture of generalized normal distributions (CMGND), enhancing interpretability and efficiency by imposing constraints on parameters. Simulation results highlight the benefits of constrained parameter estimation. Finally, Chapter 5 introduces generalized normal distribution-hidden Markov models (GND-HMMs) able to capture the dynamic nature of financial returns. This manuscript contributes to the statistical modelling of financial returns by offering flexible, parsimonious, and interpretable frameworks. The proposed mixture models capture complex patterns in financial data, thereby facilitating more informed decision-making in financial analysis and risk management."}, "https://arxiv.org/abs/2411.12036": {"title": "Prediction-Guided Active Experiments", "link": "https://arxiv.org/abs/2411.12036", "description": "arXiv:2411.12036v1 Announce Type: cross \nAbstract: Here is the revised abstract, ensuring all characters are ASCII-compatible:\n In this work, we introduce a new framework for active experimentation, the Prediction-Guided Active Experiment (PGAE), which leverages predictions from an existing machine learning model to guide sampling and experimentation. Specifically, at each time step, an experimental unit is sampled according to a designated sampling distribution, and the actual outcome is observed based on an experimental probability. Otherwise, only a prediction for the outcome is available. We begin by analyzing the non-adaptive case, where full information on the joint distribution of the predictor and the actual outcome is assumed. For this scenario, we derive an optimal experimentation strategy by minimizing the semi-parametric efficiency bound for the class of regular estimators. We then introduce an estimator that meets this efficiency bound, achieving asymptotic optimality. Next, we move to the adaptive case, where the predictor is continuously updated with newly sampled data. We show that the adaptive version of the estimator remains efficient and attains the same semi-parametric bound under certain regularity assumptions. Finally, we validate PGAE's performance through simulations and a semi-synthetic experiment using data from the US Census Bureau. The results underscore the PGAE framework's effectiveness and superiority compared to other existing methods."}, "https://arxiv.org/abs/2411.12119": {"title": "Asymptotics in Multiple Hypotheses Testing under Dependence: beyond Normality", "link": "https://arxiv.org/abs/2411.12119", "description": "arXiv:2411.12119v1 Announce Type: cross \nAbstract: Correlated observations are ubiquitous phenomena in a plethora of scientific avenues. Tackling this dependence among test statistics has been one of the pertinent problems in simultaneous inference. However, very little literature exists that elucidates the effect of correlation on different testing procedures under general distributional assumptions. In this work, we address this gap in a unified way by considering the multiple testing problem under a general correlated framework. We establish an upper bound on the family-wise error rate(FWER) of Bonferroni's procedure for equicorrelated test statistics. Consequently, we find that for a quite general class of distributions, Bonferroni FWER asymptotically tends to zero when the number of hypotheses approaches infinity. We extend this result to general positively correlated elliptically contoured setups. We also present examples of distributions for which Bonferroni FWER has a strictly positive limit under equicorrelation. We extend the limiting zero results to the class of step-down procedures under quite general correlated setups. Specifically, the probability of rejecting at least one hypothesis approaches zero asymptotically for any step-down procedure. The results obtained in this work generalize existing results for correlated Normal test statistics and facilitate new insights into the performances of multiple testing procedures under dependence."}, "https://arxiv.org/abs/2411.12258": {"title": "E-STGCN: Extreme Spatiotemporal Graph Convolutional Networks for Air Quality Forecasting", "link": "https://arxiv.org/abs/2411.12258", "description": "arXiv:2411.12258v1 Announce Type: cross \nAbstract: Modeling and forecasting air quality plays a crucial role in informed air pollution management and protecting public health. The air quality data of a region, collected through various pollution monitoring stations, display nonlinearity, nonstationarity, and highly dynamic nature and detain intense stochastic spatiotemporal correlation. Geometric deep learning models such as Spatiotemporal Graph Convolutional Networks (STGCN) can capture spatial dependence while forecasting temporal time series data for different sensor locations. Another key characteristic often ignored by these models is the presence of extreme observations in the air pollutant levels for severely polluted cities worldwide. Extreme value theory is a commonly used statistical method to predict the expected number of violations of the National Ambient Air Quality Standards for air pollutant concentration levels. This study develops an extreme value theory-based STGCN model (E-STGCN) for air pollution data to incorporate extreme behavior across pollutant concentrations. Along with spatial and temporal components, E-STGCN uses generalized Pareto distribution to investigate the extreme behavior of different air pollutants and incorporate it inside graph convolutional networks. The proposal is then applied to analyze air pollution data (PM2.5, PM10, and NO2) of 37 monitoring stations across Delhi, India. The forecasting performance for different test horizons is evaluated compared to benchmark forecasters (both temporal and spatiotemporal). It was found that E-STGCN has consistent performance across all the seasons in Delhi, India, and the robustness of our results has also been evaluated empirically. Moreover, combined with conformal prediction, E-STGCN can also produce probabilistic prediction intervals."}, "https://arxiv.org/abs/2411.12623": {"title": "Random signed measures", "link": "https://arxiv.org/abs/2411.12623", "description": "arXiv:2411.12623v1 Announce Type: cross \nAbstract: Point processes and, more generally, random measures are ubiquitous in modern statistics. However, they can only take positive values, which is a severe limitation in many situations. In this work, we introduce and study random signed measures, also known as real-valued random measures, and apply them to constrcut various Bayesian non-parametric models. In particular, we provide an existence result for random signed measures, allowing us to obtain a canonical definition for them and solve a 70-year-old open problem. Further, we provide a representation of completely random signed measures (CRSMs), which extends the celebrated Kingman's representation for completely random measures (CRMs) to the real-valued case. We then introduce specific classes of random signed measures, including the Skellam point process, which plays the role of the Poisson point process in the real-valued case, and the Gaussian random measure. We use the theoretical results to develop two Bayesian nonparametric models -- one for topic modeling and the other for random graphs -- and to investigate mean function estimation in Bayesian nonparametric regression."}, "https://arxiv.org/abs/2411.12674": {"title": "OrigamiPlot: An R Package and Shiny Web App Enhanced Visualizations for Multivariate Data", "link": "https://arxiv.org/abs/2411.12674", "description": "arXiv:2411.12674v1 Announce Type: cross \nAbstract: We introduce OrigamiPlot, an open-source R package and Shiny web application designed to enhance the visualization of multivariate data. This package implements the origami plot, a novel visualization technique proposed by Duan et al. in 2023, which improves upon traditional radar charts by ensuring that the area of the connected region is invariant to the ordering of attributes, addressing a key limitation of radar charts. The software facilitates multivariate decision-making by supporting comparisons across multiple objects and attributes, offering customizable features such as auxiliary axes and weighted attributes for enhanced clarity. Through the R package and user-friendly Shiny interface, researchers can efficiently create and customize plots without requiring extensive programming knowledge. Demonstrated using network meta-analysis as a real-world example, OrigamiPlot proves to be a versatile tool for visualizing multivariate data across various fields. This package opens new opportunities for simplifying decision-making processes with complex data."}, "https://arxiv.org/abs/2009.00085": {"title": "Identification of Semiparametric Panel Multinomial Choice Models with Infinite-Dimensional Fixed Effects", "link": "https://arxiv.org/abs/2009.00085", "description": "arXiv:2009.00085v2 Announce Type: replace \nAbstract: This paper proposes a robust method for semiparametric identification and estimation in panel multinomial choice models, where we allow for infinite-dimensional fixed effects that enter into consumer utilities in an additively nonseparable way, thus incorporating rich forms of unobserved heterogeneity. Our identification strategy exploits multivariate monotonicity in parametric indexes, and uses the logical contraposition of an intertemporal inequality on choice probabilities to obtain identifying restrictions. We provide a consistent estimation procedure, and demonstrate the practical advantages of our method with Monte Carlo simulations and an empirical illustration on popcorn sales with the Nielsen data."}, "https://arxiv.org/abs/2110.00982": {"title": "Identification and Estimation in a Time-Varying Endogenous Random Coefficient Panel Data Model", "link": "https://arxiv.org/abs/2110.00982", "description": "arXiv:2110.00982v2 Announce Type: replace \nAbstract: This paper proposes a correlated random coefficient linear panel data model, where regressors can be correlated with time-varying and individual-specific random coefficients through both a fixed effect and a time-varying random shock. I develop a new panel data-based identification method to identify the average partial effect and the local average response function. The identification strategy employs a sufficient statistic to control for the fixed effect and a conditional control variable for the random shock. Conditional on these two controls, the residual variation in the regressors is driven solely by the exogenous instrumental variables, and thus can be exploited to identify the parameters of interest. The constructive identification analysis leads to three-step series estimators, for which I establish rates of convergence and asymptotic normality. To illustrate the method, I estimate a heterogeneous Cobb-Douglas production function for manufacturing firms in China, finding substantial variations in output elasticities across firms."}, "https://arxiv.org/abs/2310.20460": {"title": "Aggregating Dependent Signals with Heavy-Tailed Combination Tests", "link": "https://arxiv.org/abs/2310.20460", "description": "arXiv:2310.20460v2 Announce Type: replace \nAbstract: Combining dependent p-values to evaluate the global null hypothesis presents a longstanding challenge in statistical inference, particularly when aggregating results from diverse methods to boost signal detection. P-value combination tests using heavy-tailed distribution based transformations, such as the Cauchy combination test and the harmonic mean p-value, have recently garnered significant interest for their potential to efficiently handle arbitrary p-value dependencies. Despite their growing popularity in practical applications, there is a gap in comprehensive theoretical and empirical evaluations of these methods. This paper conducts an extensive investigation, revealing that, theoretically, while these combination tests are asymptotically valid for pairwise quasi-asymptotically independent test statistics, such as bivariate normal variables, they are also asymptotically equivalent to the Bonferroni test under the same conditions. However, extensive simulations unveil their practical utility, especially in scenarios where stringent type-I error control is not necessary and signals are dense. Both the heaviness of the distribution and its support substantially impact the tests' non-asymptotic validity and power, and we recommend using a truncated Cauchy distribution in practice. Moreover, we show that under the violation of quasi-asymptotic independence among test statistics, these tests remain valid and, in fact, can be considerably less conservative than the Bonferroni test. We also present two case studies in genetics and genomics, showcasing the potential of the combination tests to significantly enhance statistical power while effectively controlling type-I errors."}, "https://arxiv.org/abs/2305.01277": {"title": "Zero-Truncated Modelling in a Meta-Analysis on Suicide Data after Bariatric Surgery", "link": "https://arxiv.org/abs/2305.01277", "description": "arXiv:2305.01277v2 Announce Type: replace-cross \nAbstract: Meta-analysis is a well-established method for integrating results from several independent studies to estimate a common quantity of interest. However, meta-analysis is prone to selection bias, notably when particular studies are systematically excluded. This can lead to bias in estimating the quantity of interest. Motivated by a meta-analysis to estimate the rate of completed-suicide after bariatric surgery, where studies which reported no suicides were excluded, a novel zero-truncated count modelling approach was developed. This approach addresses heterogeneity, both observed and unobserved, through covariate and overdispersion modelling, respectively. Additionally, through the Horvitz-Thompson estimator, an approach is developed to estimate the number of excluded studies, a quantity of potential interest for researchers. Uncertainty quantification for both estimation of suicide rates and number of excluded studies is achieved through a parametric bootstrapping approach.\\end{abstract}"}, "https://arxiv.org/abs/2411.12871": {"title": "Modelling Directed Networks with Reciprocity", "link": "https://arxiv.org/abs/2411.12871", "description": "arXiv:2411.12871v1 Announce Type: new \nAbstract: Asymmetric relational data is increasingly prevalent across diverse fields, underscoring the need for directed network models to address the complex challenges posed by their unique structures. Unlike undirected models, directed models can capture reciprocity, the tendency of nodes to form mutual links. In this work, we address a fundamental question: what is the effective sample size for modeling reciprocity? We examine this by analyzing the Bernoulli model with reciprocity, allowing for varying sparsity levels between non-reciprocal and reciprocal effects. We then extend this framework to a model that incorporates node-specific heterogeneity and link-specific reciprocity using covariates. Our findings reveal intriguing interplays between non-reciprocal and reciprocal effects in sparse networks. We propose a straightforward inference procedure based on maximum likelihood estimation that operates without prior knowledge of sparsity levels, whether covariates are included or not."}, "https://arxiv.org/abs/2411.12889": {"title": "Goodness-of-fit tests for generalized Poisson distributions", "link": "https://arxiv.org/abs/2411.12889", "description": "arXiv:2411.12889v1 Announce Type: new \nAbstract: This paper presents and examines computationally convenient goodness-of-fit tests for the family of generalized Poisson distributions, which encompasses notable distributions such as the Compound Poisson and the Katz distributions. The tests are consistent against fixed alternatives and their null distribution can be consistently approximated by a parametric bootstrap. The goodness of the bootstrap estimator and the power for finite sample sizes are numerically assessed through an extensive simulation experiment, including comparisons with other tests. In many cases, the novel tests either outperform or match the performance of existing ones. Real data applications are considered for illustrative purposes."}, "https://arxiv.org/abs/2411.12944": {"title": "From Estimands to Robust Inference of Treatment Effects in Platform Trials", "link": "https://arxiv.org/abs/2411.12944", "description": "arXiv:2411.12944v1 Announce Type: new \nAbstract: A platform trial is an innovative clinical trial design that uses a master protocol (i.e., one overarching protocol) to evaluate multiple treatments in an ongoing manner and can accelerate the evaluation of new treatments. However, the flexibility that marks the potential of platform trials also creates inferential challenges. Two key challenges are the precise definition of treatment effects and the robust and efficient inference on these effects. To address these challenges, we first define a clinically meaningful estimand that characterizes the treatment effect as a function of the expected outcomes under two given treatments among concurrently eligible patients. Then, we develop weighting and post-stratification methods for estimation of treatment effects with minimal assumptions. To fully leverage the efficiency potential of data from concurrently eligible patients, we also consider a model-assisted approach for baseline covariate adjustment to gain efficiency while maintaining robustness against model misspecification. We derive and compare asymptotic distributions of proposed estimators in theory and propose robust variance estimators. The proposed estimators are empirically evaluated in a simulation study and illustrated using the SIMPLIFY trial. Our methods are implemented in the R package RobinCID."}, "https://arxiv.org/abs/2411.13131": {"title": "Bayesian Parameter Estimation of Normal Distribution from Sample Mean and Extreme Values", "link": "https://arxiv.org/abs/2411.13131", "description": "arXiv:2411.13131v1 Announce Type: new \nAbstract: This paper proposes a Bayesian method for estimating the parameters of a normal distribution when only limited summary statistics (sample mean, minimum, maximum, and sample size) are available. To estimate the parameters of a normal distribution, we introduce a data augmentation approach using the Gibbs sampler, where intermediate values are treated as missing values and samples from a truncated normal distribution conditional on the observed sample mean, minimum, and maximum values. Through simulation studies, we demonstrate that our method achieves estimation accuracy comparable to theoretical expectations."}, "https://arxiv.org/abs/2411.13372": {"title": "Clustering with Potential Multidimensionality: Inference and Practice", "link": "https://arxiv.org/abs/2411.13372", "description": "arXiv:2411.13372v1 Announce Type: new \nAbstract: We show how clustering standard errors in one or more dimensions can be justified in M-estimation when there is sampling or assignment uncertainty. Since existing procedures for variance estimation are either conservative or invalid, we propose a variance estimator that refines a conservative procedure and remains valid. We then interpret environments where clustering is frequently employed in empirical work from our design-based perspective and provide insights on their estimands and inference procedures."}, "https://arxiv.org/abs/2411.13432": {"title": "Spatial error models with heteroskedastic normal perturbations and joint modeling of mean and variance", "link": "https://arxiv.org/abs/2411.13432", "description": "arXiv:2411.13432v1 Announce Type: new \nAbstract: This work presents the spatial error model with heteroskedasticity, which allows the joint modeling of the parameters associated with both the mean and the variance, within a traditional approach to spatial econometrics. The estimation algorithm is based on the log-likelihood function and incorporates the use of GAMLSS models in an iterative form. Two theoretical results show the advantages of the model to the usual models of spatial econometrics and allow obtaining the bias of weighted least squares estimators. The proposed methodology is tested through simulations, showing notable results in terms of the ability to recover all parameters and the consistency of its estimates. Finally, this model is applied to identify the factors associated with school desertion in Colombia."}, "https://arxiv.org/abs/2411.13542": {"title": "The R\\'enyi Outlier Test", "link": "https://arxiv.org/abs/2411.13542", "description": "arXiv:2411.13542v1 Announce Type: new \nAbstract: Cox and Kartsonaki proposed a simple outlier test for a vector of p-values based on the R\\'enyi transformation that is fast for large $p$ and numerically stable for very small p-values -- key properties for large data analysis. We propose and implement a generalization of this procedure we call the R\\'enyi Outlier Test (ROT). This procedure maintains the key properties of the original but is much more robust to uncertainty in the number of outliers expected a priori among the p-values. The ROT can also account for two types of prior information that are common in modern data analysis. The first is the prior probability that a given p-value may be outlying. The second is an estimate of how far of an outlier a p-value might be, conditional on it being an outlier; in other words, an estimate of effect size. Using a series of pre-calculated spline functions, we provide a fast and numerically stable implementation of the ROT in our R package renyi."}, "https://arxiv.org/abs/2411.12845": {"title": "Underlying Core Inflation with Multiple Regimes", "link": "https://arxiv.org/abs/2411.12845", "description": "arXiv:2411.12845v1 Announce Type: cross \nAbstract: This paper introduces a new approach for estimating core inflation indicators based on common factors across a broad range of price indices. Specifically, by utilizing procedures for detecting multiple regimes in high-dimensional factor models, we propose two types of core inflation indicators: one incorporating multiple structural breaks and another based on Markov switching. The structural breaks approach can eliminate revisions for past regimes, though it functions as an offline indicator, as real-time detection of breaks is not feasible with this method. On the other hand, the Markov switching approach can reduce revisions while being useful in real time, making it a simple and robust core inflation indicator suitable for real-time monitoring and as a short-term guide for monetary policy. Additionally, this approach allows us to estimate the probability of being in different inflationary regimes. To demonstrate the effectiveness of these indicators, we apply them to Canadian price data. To compare the real-time performance of the Markov switching approach to the benchmark model without regime-switching, we assess their abilities to forecast headline inflation and minimize revisions. We find that the Markov switching model delivers superior predictive accuracy and significantly reduces revisions during periods of substantial inflation changes. Hence, our findings suggest that accounting for time-varying factors and parameters enhances inflation signal accuracy and reduces data requirements, especially following sudden economic shifts."}, "https://arxiv.org/abs/2411.12965": {"title": "On adaptivity and minimax optimality of two-sided nearest neighbors", "link": "https://arxiv.org/abs/2411.12965", "description": "arXiv:2411.12965v1 Announce Type: cross \nAbstract: Nearest neighbor (NN) algorithms have been extensively used for missing data problems in recommender systems and sequential decision-making systems. Prior theoretical analysis has established favorable guarantees for NN when the underlying data is sufficiently smooth and the missingness probabilities are lower bounded. Here we analyze NN with non-smooth non-linear functions with vast amounts of missingness. In particular, we consider matrix completion settings where the entries of the underlying matrix follow a latent non-linear factor model, with the non-linearity belonging to a \\Holder function class that is less smooth than Lipschitz. Our results establish following favorable properties for a suitable two-sided NN: (1) The mean squared error (MSE) of NN adapts to the smoothness of the non-linearity, (2) under certain regularity conditions, the NN error rate matches the rate obtained by an oracle equipped with the knowledge of both the row and column latent factors, and finally (3) NN's MSE is non-trivial for a wide range of settings even when several matrix entries might be missing deterministically. We support our theoretical findings via extensive numerical simulations and a case study with data from a mobile health study, HeartSteps."}, "https://arxiv.org/abs/2411.13080": {"title": "Distribution-free Measures of Association based on Optimal Transport", "link": "https://arxiv.org/abs/2411.13080", "description": "arXiv:2411.13080v1 Announce Type: cross \nAbstract: In this paper we propose and study a class of nonparametric, yet interpretable measures of association between two random vectors $X$ and $Y$ taking values in $\\mathbb{R}^{d_1}$ and $\\mathbb{R}^{d_2}$ respectively ($d_1, d_2\\ge 1$). These nonparametric measures -- defined using the theory of reproducing kernel Hilbert spaces coupled with optimal transport -- capture the strength of dependence between $X$ and $Y$ and have the property that they are 0 if and only if the variables are independent and 1 if and only if one variable is a measurable function of the other. Further, these population measures can be consistently estimated using the general framework of geometric graphs which include $k$-nearest neighbor graphs and minimum spanning trees. Additionally, these measures can also be readily used to construct an exact finite sample distribution-free test of mutual independence between $X$ and $Y$. In fact, as far as we are aware, these are the only procedures that possess all the above mentioned desirable properties. The correlation coefficient proposed in Dette et al. (2013), Chatterjee (2021), Azadkia and Chatterjee (2021), at the population level, can be seen as a special case of this general class of measures."}, "https://arxiv.org/abs/2411.13293": {"title": "Revealed Information", "link": "https://arxiv.org/abs/2411.13293", "description": "arXiv:2411.13293v1 Announce Type: cross \nAbstract: An analyst observes the frequency with which a decision maker (DM) takes actions, but does not observe the frequency of actions conditional on the payoff-relevant state. We ask when can the analyst rationalize the DM's choices as if the DM first learns something about the state before taking action. We provide a support function characterization of the triples of utility functions, prior beliefs, and (marginal) distributions over actions such that the DM's action distribution is consistent with information given the agent's prior and utility function. Assumptions on the cardinality of the state space and the utility function allow us to refine this characterization, obtaining a sharp system of finitely many inequalities the utility function, prior, and action distribution must satisfy. We apply our characterization to study comparative statics and ring-network games, and to identify conditions under which a data set is consistent with a public information structure in first-order Bayesian persuasion games. We characterize the set of distributions over posterior beliefs that are consistent with the DM's choices. Assuming the first-order approach applies, we extend our results to settings with a continuum of actions and/or states.%"}, "https://arxiv.org/abs/2104.07773": {"title": "Jointly Modeling and Clustering Tensors in High Dimensions", "link": "https://arxiv.org/abs/2104.07773", "description": "arXiv:2104.07773v3 Announce Type: replace \nAbstract: We consider the problem of jointly modeling and clustering populations of tensors by introducing a high-dimensional tensor mixture model with heterogeneous covariances. To effectively tackle the high dimensionality of tensor objects, we employ plausible dimension reduction assumptions that exploit the intrinsic structures of tensors such as low-rankness in the mean and separability in the covariance. In estimation, we develop an efficient high-dimensional expectation-conditional-maximization (HECM) algorithm that breaks the intractable optimization in the M-step into a sequence of much simpler conditional optimization problems, each of which is convex, admits regularization and has closed-form updating formulas. Our theoretical analysis is challenged by both the non-convexity in the EM-type estimation and having access to only the solutions of conditional maximizations in the M-step, leading to the notion of dual non-convexity. We demonstrate that the proposed HECM algorithm, with an appropriate initialization, converges geometrically to a neighborhood that is within statistical precision of the true parameter. The efficacy of our proposed method is demonstrated through comparative numerical experiments and an application to a medical study, where our proposal achieves an improved clustering accuracy over existing benchmarking methods."}, "https://arxiv.org/abs/2307.16720": {"title": "Clustering multivariate functional data using the epigraph and hypograph indices: a case study on Madrid air quality", "link": "https://arxiv.org/abs/2307.16720", "description": "arXiv:2307.16720v4 Announce Type: replace \nAbstract: With the rapid growth of data generation, advancements in functional data analysis (FDA) have become essential, especially for approaches that handle multiple variables at the same time. This paper introduces a novel formulation of the epigraph and hypograph indices, along with their generalized expressions, specifically designed for multivariate functional data (MFD). These new definitions account for interrelationships between variables, enabling effective clustering of MFD based on the original data curves and their first two derivatives. The methodology developed here has been tested on simulated datasets, demonstrating strong performance compared to state-of-the-art methods. Its practical utility is further illustrated with two environmental datasets: the Canadian weather dataset and a 2023 air quality study in Madrid. These applications highlight the potential of the method as a great tool for analyzing complex environmental data, offering valuable insights for researchers and policymakers in climate and environmental research."}, "https://arxiv.org/abs/2308.05858": {"title": "Inconsistency and Acausality of Model Selection in Bayesian Inverse Problems", "link": "https://arxiv.org/abs/2308.05858", "description": "arXiv:2308.05858v3 Announce Type: replace \nAbstract: Bayesian inference paradigms are regarded as powerful tools for solution of inverse problems. However, when applied to inverse problems in physical sciences, Bayesian formulations suffer from a number of inconsistencies that are often overlooked. A well known, but mostly neglected, difficulty is connected to the notion of conditional probability densities. Borel, and later Kolmogorov's (1933/1956), found that the traditional definition of conditional densities is incomplete: In different parameterizations it leads to different results. We will show an example where two apparently correct procedures applied to the same problem lead to two widely different results. Another type of inconsistency involves violation of causality. This problem is found in model selection strategies in Bayesian inversion, such as Hierarchical Bayes and Trans-Dimensional Inversion where so-called hyperparameters are included as variables to control either the number (or type) of unknowns, or the prior uncertainties on data or model parameters. For Hierarchical Bayes we demonstrate that the calculated 'prior' distributions of data or model parameters are not prior-, but posterior information. In fact, the calculated 'standard deviations' of the data are a measure of the inability of the forward function to model the data, rather than uncertainties of the data. For trans-dimensional inverse problems we show that the so-called evidence is, in fact, not a measure of the success of fitting the data for the given choice (or number) of parameters, as often claimed. We also find that the notion of Natural Parsimony is ill-defined, because of its dependence on the parameter prior. Based on this study, we find that careful rethinking of Bayesian inversion practices is required, with special emphasis on ways of avoiding the Borel-Kolmogorov inconsistency, and on the way we interpret model selection results."}, "https://arxiv.org/abs/2311.15485": {"title": "Calibrated Generalized Bayesian Inference", "link": "https://arxiv.org/abs/2311.15485", "description": "arXiv:2311.15485v2 Announce Type: replace \nAbstract: We provide a simple and general solution for accurate uncertainty quantification of Bayesian inference in misspecified or approximate models, and for generalized posteriors more generally. While existing solutions are based on explicit Gaussian posterior approximations, or post-processing procedures, we demonstrate that correct uncertainty quantification can be achieved by substituting the usual posterior with an intuitively appealing alternative posterior that conveys the same information. This solution applies to both likelihood-based and loss-based posteriors, and we formally demonstrate the reliable uncertainty quantification of this approach. The new approach is demonstrated through a range of examples, including linear models, and doubly intractable models."}, "https://arxiv.org/abs/1710.06078": {"title": "Estimate exponential memory decay in Hidden Markov Model and its applications", "link": "https://arxiv.org/abs/1710.06078", "description": "arXiv:1710.06078v2 Announce Type: replace-cross \nAbstract: Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as $\\lambda_2-\\lambda_1$, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences $B$ given the controlled error $\\epsilon$ is $B=\\log(\\epsilon)/(\\lambda_2-\\lambda_1)$. We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated $B$ could be reused for the nearby parameter during the inference."}, "https://arxiv.org/abs/2211.12692": {"title": "Empirical Bayes estimation: When does $g$-modeling beat $f$-modeling in theory (and in practice)?", "link": "https://arxiv.org/abs/2211.12692", "description": "arXiv:2211.12692v2 Announce Type: replace-cross \nAbstract: Empirical Bayes (EB) is a popular framework for large-scale inference that aims to find data-driven estimators to compete with the Bayesian oracle that knows the true prior. Two principled approaches to EB estimation have emerged over the years: $f$-modeling, which constructs an approximate Bayes rule by estimating the marginal distribution of the data, and $g$-modeling, which estimates the prior from data and then applies the learned Bayes rule. For the Poisson model, the prototypical examples are the celebrated Robbins estimator and the nonparametric MLE (NPMLE), respectively. It has long been recognized in practice that the Robbins estimator, while being conceptually appealing and computationally simple, lacks robustness and can be easily derailed by ``outliers'', unlike the NPMLE which provides more stable and interpretable fit thanks to its Bayes form. On the other hand, not only do the existing theories shed little light on this phenomenon, but they all point to the opposite, as both methods have recently been shown optimal in terms of regret (excess over the Bayes risk) for compactly supported and subexponential priors.\n In this paper we provide a theoretical justification for the superiority of $g$-modeling over $f$-modeling for heavy-tailed data by considering priors with bounded $p>1$th moment. We show that with mild regularization, any $g$-modeling method that is Hellinger rate-optimal in density estimation achieves an optimal total regret $\\tilde \\Theta(n^{\\frac{3}{2p+1}})$; in particular, the special case of NPMLE succeeds without regularization. In contrast, there exists an $f$-modeling estimator whose density estimation rate is optimal but whose EB regret is suboptimal by a polynomial factor. These results show that the proper Bayes form provides a ``general recipe of success'' for optimal EB estimation that applies to all $g$-modeling (but not $f$-modeling) methods."}, "https://arxiv.org/abs/2411.13570": {"title": "Inconsistency and Acausality in Bayesian Inference for Physical Problems", "link": "https://arxiv.org/abs/2411.13570", "description": "arXiv:2411.13570v1 Announce Type: new \nAbstract: Bayesian inference is used to estimate continuous parameter values given measured data in many fields of science. The method relies on conditional probability densities to describe information about both data and parameters, yet the notion of conditional densities is inadmissible: probabilities of the same physical event, computed from conditional densities under different parameterizations, may be inconsistent. We show that this inconsistency, together with acausality in hierarchical methods, invalidate a variety of commonly applied Bayesian methods when applied to problems in the physical world, including trans-dimensional inference, general Bayesian dimensionality reduction methods, and hierarchical and empirical Bayes. Models in parameter spaces of different dimensionalities cannot be compared, invalidating the concept of natural parsimony, the probabilistic counterpart to Occams Razor. Bayes theorem itself is inadmissible, and Bayesian inference applied to parameters that characterize physical properties requires reformulation."}, "https://arxiv.org/abs/2411.13692": {"title": "Randomized Basket Trial with an Interim Analysis (RaBIt) and Applications in Mental Health", "link": "https://arxiv.org/abs/2411.13692", "description": "arXiv:2411.13692v1 Announce Type: new \nAbstract: Basket trials can efficiently evaluate a single treatment across multiple diseases with a common shared target. Prior methods for randomized basket trials required baskets to have the same sample and effect sizes. To that end, we developed a general randomized basket trial with an interim analysis (RaBIt) that allows for unequal sample sizes and effect sizes per basket. RaBIt is characterized by pruning at an interim stage and then analyzing a pooling of the remaining baskets. We derived the analytical power and type 1 error for the design. We first show that our results are consistent with the prior methods when the sample and effect sizes were the same across baskets. As we adjust the sample allocation between baskets, our threshold for the final test statistic becomes more stringent in order to maintain the same overall type 1 error. Finally, we notice that if we fix a sample size for the baskets proportional to their accrual rate, then at the cost of an almost negligible amount of power, the trial overall is expected to take substantially less time than the non-generalized version."}, "https://arxiv.org/abs/2411.13748": {"title": "An Economical Approach to Design Posterior Analyses", "link": "https://arxiv.org/abs/2411.13748", "description": "arXiv:2411.13748v1 Announce Type: new \nAbstract: To design Bayesian studies, criteria for the operating characteristics of posterior analyses - such as power and the type I error rate - are often assessed by estimating sampling distributions of posterior probabilities via simulation. In this paper, we propose an economical method to determine optimal sample sizes and decision criteria for such studies. Using our theoretical results that model posterior probabilities as a function of the sample size, we assess operating characteristics throughout the sample size space given simulations conducted at only two sample sizes. These theoretical results are used to construct bootstrap confidence intervals for the optimal sample sizes and decision criteria that reflect the stochastic nature of simulation-based design. We also repurpose the simulations conducted in our approach to efficiently investigate various sample sizes and decision criteria using contour plots. The broad applicability and wide impact of our methodology is illustrated using two clinical examples."}, "https://arxiv.org/abs/2411.13764": {"title": "Selective inference is easier with p-values", "link": "https://arxiv.org/abs/2411.13764", "description": "arXiv:2411.13764v1 Announce Type: new \nAbstract: Selective inference is a subfield of statistics that enables valid inference after selection of a data-dependent question. In this paper, we introduce selectively dominant p-values, a class of p-values that allow practitioners to easily perform inference after arbitrary selection procedures. Unlike a traditional p-value, whose distribution must stochastically dominate the uniform distribution under the null, a selectively dominant p-value must have a post-selection distribution that stochastically dominates that of a uniform having undergone the same selection process; moreover, this property must hold simultaneously for all possible selection processes. Despite the strength of this condition, we show that all commonly used p-values (e.g., p-values from two-sided testing in parametric families, one-sided testing in monotone likelihood ratio and exponential families, $F$-tests for linear regression, and permutation tests) are selectively dominant. By recasting two canonical selective inference problems-inference on winners and rank verification-in our selective dominance framework, we provide simpler derivations, a deeper conceptual understanding, and new generalizations and variations of these methods. Additionally, we use our insights to introduce selective variants of methods that combine p-values, such as Fisher's combination test."}, "https://arxiv.org/abs/2411.13810": {"title": "Dynamic spatial interaction models for a leader's resource allocation and followers' multiple activities", "link": "https://arxiv.org/abs/2411.13810", "description": "arXiv:2411.13810v1 Announce Type: new \nAbstract: This paper introduces a novel spatial interaction model to explore the decision-making processes of two types of agents-a leader and followers-with central and local governments serving as empirical representations. The model accounts for three key features: (i) resource allocations from the leader to the followers and the resulting strategic interactions, (ii) followers' choices across multiple activities, and (iii) interactions among these activities. We develop a network game to examine the micro-foundations of these processes. In this game, followers engage in multiple activities, while the leader allocates resources by monitoring the externalities arising from followers' interactions. The game's unique NE is the foundation for our econometric framework, providing equilibrium measures to understand the short-term impacts of changes in followers' characteristics and their long-term consequences. To estimate the agent payoff parameters, we employ the QML estimation method and examine the asymptotic properties of the QML estimator to ensure robust statistical inferences. Empirically, we investigate interactions among U.S. states in public welfare expenditures (PWE) and housing and community development expenditures (HCDE), focusing on how federal grants influence these expenditures and the interactions among state governments. Our findings reveal positive spillovers in states' PWEs, complementarity between the two expenditures within states, and negative cross-variable spillovers between them. Additionally, we observe positive effects of federal grants on both expenditures. Counterfactual simulations indicate that federal interventions lead to a 6.46% increase in social welfare by increasing the states' efforts on PWE and HCDE. However, due to the limited flexibility in federal grants, their magnitudes are smaller than the proportion of federal grants within the states' total revenues."}, "https://arxiv.org/abs/2411.13822": {"title": "High-Dimensional Extreme Quantile Regression", "link": "https://arxiv.org/abs/2411.13822", "description": "arXiv:2411.13822v1 Announce Type: new \nAbstract: The estimation of conditional quantiles at extreme tails is of great interest in numerous applications. Various methods that integrate regression analysis with an extrapolation strategy derived from extreme value theory have been proposed to estimate extreme conditional quantiles in scenarios with a fixed number of covariates. However, these methods prove ineffective in high-dimensional settings, where the number of covariates increases with the sample size. In this article, we develop new estimation methods tailored for extreme conditional quantiles with high-dimensional covariates. We establish the asymptotic properties of the proposed estimators and demonstrate their superior performance through simulation studies, particularly in scenarios of growing dimension and high dimension where existing methods may fail. Furthermore, the analysis of auto insurance data validates the efficacy of our methods in estimating extreme conditional insurance claims and selecting important variables."}, "https://arxiv.org/abs/2411.13829": {"title": "Sensitivity analysis methods for outcome missingness using substantive-model-compatible multiple imputation and their application in causal inference", "link": "https://arxiv.org/abs/2411.13829", "description": "arXiv:2411.13829v1 Announce Type: new \nAbstract: When using multiple imputation (MI) for missing data, maintaining compatibility between the imputation model and substantive analysis is important for avoiding bias. For example, some causal inference methods incorporate an outcome model with exposure-confounder interactions that must be reflected in the imputation model. Two approaches for compatible imputation with multivariable missingness have been proposed: Substantive-Model-Compatible Fully Conditional Specification (SMCFCS) and a stacked-imputation-based approach (SMC-stack). If the imputation model is correctly specified, both approaches are guaranteed to be unbiased under the \"missing at random\" assumption. However, this assumption is violated when the outcome causes its own missingness, which is common in practice. In such settings, sensitivity analyses are needed to assess the impact of alternative assumptions on results. An appealing solution for sensitivity analysis is delta-adjustment using MI, specifically \"not-at-random\" (NAR)FCS. However, the issue of imputation model compatibility has not been considered in sensitivity analysis, with a naive implementation of NARFCS being susceptible to bias. To address this gap, we propose two approaches for compatible sensitivity analysis when the outcome causes its own missingness. The proposed approaches, NAR-SMCFCS and NAR-SMC-stack, extend SMCFCS and SMC-stack, respectively, with delta-adjustment for the outcome. We evaluate these approaches using a simulation study that is motivated by a case study, to which the methods were also applied. The simulation results confirmed that a naive implementation of NARFCS produced bias in effect estimates, while NAR-SMCFCS and NAR-SMC-stack were approximately unbiased. The proposed compatible approaches provide promising avenues for conducting sensitivity analysis to missingness assumptions in causal inference."}, "https://arxiv.org/abs/2411.14010": {"title": "Spectral domain likelihoods for Bayesian inference in time-varying parameter models", "link": "https://arxiv.org/abs/2411.14010", "description": "arXiv:2411.14010v1 Announce Type: new \nAbstract: Inference for locally stationary processes is often based on some local Whittle-type approximation of the likelihood function defined in the frequency domain. The main reasons for using such a likelihood approximation is that i) it has substantially lower computational cost and better scalability to long time series compared to the time domain likelihood, particularly when used for Bayesian inference via Markov Chain Monte Carlo (MCMC), ii) convenience when the model itself is specified in the frequency domain, and iii) it provides access to bootstrap and subsampling MCMC which exploits the asymptotic independence of Fourier transformed data. Most of the existing literature compares the asymptotic performance of the maximum likelihood estimator (MLE) from such frequency domain likelihood approximation with the exact time domain MLE. Our article uses three simulation studies to assess the finite-sample accuracy of several frequency domain likelihood functions when used to approximate the posterior distribution in time-varying parameter models. The methods are illustrated on an application to egg price data."}, "https://arxiv.org/abs/2411.14016": {"title": "Robust Mutual Fund Selection with False Discovery Rate Control", "link": "https://arxiv.org/abs/2411.14016", "description": "arXiv:2411.14016v1 Announce Type: new \nAbstract: In this article, we address the challenge of identifying skilled mutual funds among a large pool of candidates, utilizing the linear factor pricing model. Assuming observable factors with a weak correlation structure for the idiosyncratic error, we propose a spatial-sign based multiple testing procedure (SS-BH). When latent factors are present, we first extract them using the elliptical principle component method (He et al. 2022) and then propose a factor-adjusted spatial-sign based multiple testing procedure (FSS-BH). Simulation studies demonstrate that our proposed FSS-BH procedure performs exceptionally well across various applications and exhibits robustness to variations in the covariance structure and the distribution of the error term. Additionally, real data application further highlights the superiority of the FSS-BH procedure."}, "https://arxiv.org/abs/2411.14185": {"title": "A note on numerical evaluation of conditional Akaike information for nonlinear mixed-effects models", "link": "https://arxiv.org/abs/2411.14185", "description": "arXiv:2411.14185v1 Announce Type: new \nAbstract: We propose two methods to evaluate the conditional Akaike information (cAI) for nonlinear mixed-effects models with no restriction on cluster size. Method 1 is designed for continuous data and includes formulae for the derivatives of fixed and random effects estimators with respect to observations. Method 2, compatible with any type of observation, requires modeling the marginal (or prior) distribution of random effects as a multivariate normal distribution. Simulations show that Method 1 performs well with Gaussian data but struggles with skewed continuous distributions, whereas Method 2 consistently performs well across various distributions, including normal, gamma, negative binomial, and Tweedie, with flexible link functions. Based on our findings, we recommend Method 2 as a distributionally robust cAI criterion for model selection in nonlinear mixed-effects models."}, "https://arxiv.org/abs/2411.14265": {"title": "A Bayesian mixture model for Poisson network autoregression", "link": "https://arxiv.org/abs/2411.14265", "description": "arXiv:2411.14265v1 Announce Type: new \nAbstract: In this paper, we propose a new Bayesian Poisson network autoregression mixture model (PNARM). Our model combines ideas from the models of Dahl 2008, Ren et al. 2024 and Armillotta and Fokianos 2024, as it is motivated by the following aims. We consider the problem of modelling multivariate count time series since they arise in many real-world data sets, but has been studied less than its Gaussian-distributed counterpart (Fokianos 2024). Additionally, we assume that the time series occur on the nodes of a known underlying network where the edges dictate the form of the structural vector autoregression model, as a means of imposing sparsity. A further aim is to accommodate heterogeneous node dynamics, and to develop a probabilistic model for clustering nodes that exhibit similar behaviour. We develop an MCMC algorithm for sampling from the model's posterior distribution. The model is applied to a data set of COVID-19 cases in the counties of the Republic of Ireland."}, "https://arxiv.org/abs/2411.14285": {"title": "Stochastic interventions, sensitivity analysis, and optimal transport", "link": "https://arxiv.org/abs/2411.14285", "description": "arXiv:2411.14285v1 Announce Type: new \nAbstract: Recent methodological research in causal inference has focused on effects of stochastic interventions, which assign treatment randomly, often according to subject-specific covariates. In this work, we demonstrate that the usual notion of stochastic interventions have a surprising property: when there is unmeasured confounding, bounds on their effects do not collapse when the policy approaches the observational regime. As an alternative, we propose to study generalized policies, treatment rules that can depend on covariates, the natural value of treatment, and auxiliary randomness. We show that certain generalized policy formulations can resolve the \"non-collapsing\" bound issue: bounds narrow to a point when the target treatment distribution approaches that in the observed data. Moreover, drawing connections to the theory of optimal transport, we characterize generalized policies that minimize worst-case bound width in various sensitivity analysis models, as well as corresponding sharp bounds on their causal effects. These optimal policies are new, and can have a more parsimonious interpretation compared to their usual stochastic policy analogues. Finally, we develop flexible, efficient, and robust estimators for the sharp nonparametric bounds that emerge from the framework."}, "https://arxiv.org/abs/2411.14323": {"title": "Estimands and Their Implications for Evidence Synthesis for Oncology: A Simulation Study of Treatment Switching in Meta-Analysis", "link": "https://arxiv.org/abs/2411.14323", "description": "arXiv:2411.14323v1 Announce Type: new \nAbstract: The ICH E9(R1) addendum provides guidelines on accounting for intercurrent events in clinical trials using the estimands framework. However, there has been limited attention on the estimands framework for meta-analysis. Using treatment switching, a well-known intercurrent event that occurs frequently in oncology, we conducted a simulation study to explore the bias introduced by pooling together estimates targeting different estimands in a meta-analysis of randomized clinical trials (RCTs) that allowed for treatment switching. We simulated overall survival data of a collection of RCTs that allowed patients in the control group to switch to the intervention treatment after disease progression under fixed-effects and random-effects models. For each RCT, we calculated effect estimates for a treatment policy estimand that ignored treatment switching, and a hypothetical estimand that accounted for treatment switching by censoring switchers at the time of switching. Then, we performed random-effects and fixed-effects meta-analyses to pool together RCT effect estimates while varying the proportions of treatment policy and hypothetical effect estimates. We compared the results of meta-analyses that pooled different types of effect estimates with those that pooled only treatment policy or hypothetical estimates. We found that pooling estimates targeting different estimands results in pooled estimators that reflect neither the treatment policy estimand nor the hypothetical estimand. This finding shows that pooling estimates of varying target estimands can generate misleading results, even under a random-effects model. Adopting the estimands framework for meta-analysis may improve alignment between meta-analytic results and the clinical research question of interest."}, "https://arxiv.org/abs/2411.13763": {"title": "Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data", "link": "https://arxiv.org/abs/2411.13763", "description": "arXiv:2411.13763v1 Announce Type: cross \nAbstract: In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter $\\theta$ in a linear threshold $\\theta^T Z$ for a continuous variable $X$ such that the discrepancy between whether $X$ exceeds the threshold $\\theta^T Z$ and a binary outcome $Y$ is minimized. We propose a novel $K$-step active subsampling algorithm to estimate $\\theta$, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to $\\beta\\geq 1$, the smoothness of the conditional density of $X$ given $Y$ and $Z$. For $\\beta>(1+\\sqrt{3})/2$, we show that the two-step algorithm yields an estimator with the parametric convergence rate $O_p((s \\log d /N)^{1/2})$ in $l_2$ norm. The rate of our estimator is strictly faster than the minimax optimal rate with $N$ i.i.d. samples drawn from the population. For the other two scenarios $1<\\beta\\leq (1+\\sqrt{3})/2$ and $\\beta=1$, the estimator from the two-step algorithm is sub-optimal. The former requires to run $K>2$ steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset."}, "https://arxiv.org/abs/2411.13868": {"title": "Robust Detection of Watermarks for Large Language Models Under Human Edits", "link": "https://arxiv.org/abs/2411.13868", "description": "arXiv:2411.13868v1 Announce Type: cross \nAbstract: Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \\textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families."}, "https://arxiv.org/abs/2102.08591": {"title": "Data-Driven Logistic Regression Ensembles With Applications in Genomics", "link": "https://arxiv.org/abs/2102.08591", "description": "arXiv:2102.08591v5 Announce Type: replace \nAbstract: Advances in data collecting technologies in genomics have significantly increased the need for tools designed to study the genetic basis of many diseases. Statistical tools used to discover patterns between the expression of certain genes and the presence of diseases should ideally perform well in terms of both prediction accuracy and identification of key biomarkers. We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. The ensembles are comprised of a relatively small number of highly accurate and interpretable models that are learned directly from the data by minimizing a global objective function. We derive the asymptotic properties of our method and develop an efficient algorithm to compute the ensembles. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical genomics datasets involving common diseases such as cancer, multiple sclerosis and psoriasis. In several applications our method could identify key biomarkers that were absent in state-of-the-art competitor methods. We develop a variable importance ranking tool that may guide the focus of researchers on the most promising genes. Based on numerical experiments we provide guidelines for the choice of the number of models in our ensembles."}, "https://arxiv.org/abs/2411.14542": {"title": "Combining missing data imputation and internal validation in clinical risk prediction models", "link": "https://arxiv.org/abs/2411.14542", "description": "arXiv:2411.14542v1 Announce Type: new \nAbstract: Methods to handle missing data have been extensively explored in the context of estimation and descriptive studies, with multiple imputation being the most widely used method in clinical research. However, in the context of clinical risk prediction models, where the goal is often to achieve high prediction accuracy and to make predictions for future patients, there are different considerations regarding the handling of missing data. As a result, deterministic imputation is better suited to the setting of clinical risk prediction models, since the outcome is not included in the imputation model and the imputation method can be easily applied to future patients. In this paper, we provide a tutorial demonstrating how to conduct bootstrapping followed by deterministic imputation of missing data to construct and internally validate the performance of a clinical risk prediction model in the presence of missing data. Extensive simulation study results are provided to help guide decision-making in real-world applications."}, "https://arxiv.org/abs/2411.14548": {"title": "A Random-Effects Approach to Linear Mixed Model Analysis of Incomplete Longitudinal Data", "link": "https://arxiv.org/abs/2411.14548", "description": "arXiv:2411.14548v1 Announce Type: new \nAbstract: We propose a random-effects approach to missing values for linear mixed model (LMM) analysis. The method converts a LMM with missing covariates to another LMM without missing covariates. The standard LMM analysis tools for longitudinal data then apply. Performance of the method is evaluated empirically, and compared with alternative approaches, including the popular MICE procedure of multiple imputation. Theoretical explanations are given for the patterns observed in the simulation studies. A real-data example is discussed."}, "https://arxiv.org/abs/2411.14570": {"title": "Gradient-based optimization for variational empirical Bayes multiple regression", "link": "https://arxiv.org/abs/2411.14570", "description": "arXiv:2411.14570v1 Announce Type: new \nAbstract: Variational empirical Bayes (VEB) methods provide a practically attractive approach to fitting large, sparse, multiple regression models. These methods usually use coordinate ascent to optimize the variational objective function, an approach known as coordinate ascent variational inference (CAVI). Here we propose alternative optimization approaches based on gradient-based (quasi-Newton) methods, which we call gradient-based variational inference (GradVI). GradVI exploits a recent result from Kim et. al. [arXiv:2208.10910] which writes the VEB regression objective function as a penalized regression. Unfortunately the penalty function is not available in closed form, and we present and compare two approaches to dealing with this problem. In simple situations where CAVI performs well, we show that GradVI produces similar predictive performance, and GradVI converges in fewer iterations when the predictors are highly correlated. Furthermore, unlike CAVI, the key computations in GradVI are simple matrix-vector products, and so GradVI is much faster than CAVI in settings where the design matrix admits fast matrix-vector products (e.g., as we show here, trendfiltering applications) and lends itself to parallelized implementations in ways that CAVI does not. GradVI is also very flexible, and could exploit automatic differentiation to easily implement different prior families. Our methods are implemented in an open-source Python software, GradVI (available from https://github.com/stephenslab/gradvi )."}, "https://arxiv.org/abs/2411.14635": {"title": "Modelling Loss of Complexity in Intermittent Time Series and its Application", "link": "https://arxiv.org/abs/2411.14635", "description": "arXiv:2411.14635v1 Announce Type: new \nAbstract: In this paper, we developed a nonparametric relative entropy (RlEn) for modelling loss of complexity in intermittent time series. This technique consists of two steps. First, we carry out a nonlinear autoregressive model where the lag order is determined by a Bayesian Information Criterion (BIC), and complexity of each intermittent time series is obtained by our novel relative entropy. Second, change-points in complexity were detected by using the cumulative sum (CUSUM) based method. Using simulations and compared to the popular method appropriate entropy (ApEN), the performance of RlEn was assessed for its (1) ability to localise complexity change-points in intermittent time series; (2) ability to faithfully estimate underlying nonlinear models. The performance of the proposal was then examined in a real analysis of fatigue-induced changes in the complexity of human motor outputs. The results demonstrated that the proposed method outperformed the ApEn in accurately detecting complexity changes in intermittent time series segments."}, "https://arxiv.org/abs/2411.14674": {"title": "Summarizing Bayesian Nonparametric Mixture Posterior -- Sliced Optimal Transport Metrics for Gaussian Mixtures", "link": "https://arxiv.org/abs/2411.14674", "description": "arXiv:2411.14674v1 Announce Type: new \nAbstract: Existing methods to summarize posterior inference for mixture models focus on identifying a point estimate of the implied random partition for clustering, with density estimation as a secondary goal (Wade and Ghahramani, 2018; Dahl et al., 2022). We propose a novel approach for summarizing posterior inference in nonparametric Bayesian mixture models, prioritizing density estimation of the mixing measure (or mixture) as an inference target. One of the key features is the model-agnostic nature of the approach, which remains valid under arbitrarily complex dependence structures in the underlying sampling model. Using a decision-theoretic framework, our method identifies a point estimate by minimizing posterior expected loss. A loss function is defined as a discrepancy between mixing measures as the loss function. Estimating the mixing measure implies inference on the mixture density. Exploiting the discrete nature of the mixing measure, we use a version of sliced Wasserstein distance. We introduce two specific variants for Gaussian mixtures. The first, mixed sliced Wasserstein, applies generalized geodesic projections on the product of the Euclidean space and the manifold of symmetric positive definite matrices. The second, sliced mixture Wasserstein, leverages the linearity of Gaussian mixture measures for efficient projection."}, "https://arxiv.org/abs/2411.14690": {"title": "Deep Gaussian Process Emulation and Uncertainty Quantification for Large Computer Experiments", "link": "https://arxiv.org/abs/2411.14690", "description": "arXiv:2411.14690v1 Announce Type: new \nAbstract: Computer models are used as a way to explore complex physical systems. Stationary Gaussian process emulators, with their accompanying uncertainty quantification, are popular surrogates for computer models. However, many computer models are not well represented by stationary Gaussian processes models. Deep Gaussian processes have been shown to be capable of capturing non-stationary behaviors and abrupt regime changes in the computer model response. In this paper, we explore the properties of two deep Gaussian process formulations within the context of computer model emulation. For one of these formulations, we introduce a new parameter that controls the amount of smoothness in the deep Gaussian process layers. We adapt a stochastic variational approach to inference for this model, allowing for prior specification and posterior exploration of the smoothness of the response surface. Our approach can be applied to a large class of computer models, and scales to arbitrarily large simulation designs. The proposed methodology was motivated by the need to emulate an astrophysical model of the formation of binary black hole mergers."}, "https://arxiv.org/abs/2411.14763": {"title": "From Replications to Revelations: Heteroskedasticity-Robust Inference", "link": "https://arxiv.org/abs/2411.14763", "description": "arXiv:2411.14763v1 Announce Type: new \nAbstract: We compare heteroskedasticity-robust inference methods with a large-scale Monte Carlo study based on regressions from 155 reproduction packages of leading economic journals. The results confirm established wisdom and uncover new insights. Among well established methods HC2 standard errors with the degree of freedom specification proposed by Bell and McCaffrey (2002) perform best. To further improve the accuracy of t-tests, we propose a novel degree-of-freedom specification based on partial leverages. We also show how HC2 to HC4 standard errors can be refined by more effectively addressing the 15.6% of cases where at least one observation exhibits a leverage of one."}, "https://arxiv.org/abs/2411.14864": {"title": "Bayesian optimal change point detection in high-dimensions", "link": "https://arxiv.org/abs/2411.14864", "description": "arXiv:2411.14864v1 Announce Type: new \nAbstract: We propose the first Bayesian methods for detecting change points in high-dimensional mean and covariance structures. These methods are constructed using pairwise Bayes factors, leveraging modularization to identify significant changes in individual components efficiently. We establish that the proposed methods consistently detect and estimate change points under much milder conditions than existing approaches in the literature. Additionally, we demonstrate that their localization rates are nearly optimal in terms of rates. The practical performance of the proposed methods is evaluated through extensive simulation studies, where they are compared to state-of-the-art techniques. The results show comparable or superior performance across most scenarios. Notably, the methods effectively detect change points whenever signals of sufficient magnitude are present, irrespective of the number of signals. Finally, we apply the proposed methods to genetic and financial datasets, illustrating their practical utility in real-world applications."}, "https://arxiv.org/abs/2411.14999": {"title": "The EE-Classifier: A classification method for functional data based on extremality indexes", "link": "https://arxiv.org/abs/2411.14999", "description": "arXiv:2411.14999v1 Announce Type: new \nAbstract: Functional data analysis has gained significant attention due to its wide applicability. This research explores the extension of statistical analysis methods for functional data, with a primary focus on supervised classification techniques. It provides a review on the existing depth-based methods used in functional data samples. Building on this foundation, it introduces an extremality-based approach, which takes the modified epigraph and hypograph indexes properties as classification techniques. To demonstrate the effectiveness of the classifier, it is applied to both real-world and synthetic data sets. The results show its efficacy in accurately classifying functional data. Additionally, the classifier is used to analyze the fluctuations in the S\\&P 500 stock value. This research contributes to the field of functional data analysis by introducing a new extremality-based classifier. The successful application to various data sets shows its potential for supervised classification tasks and provides valuable insights into financial data analysis."}, "https://arxiv.org/abs/2411.14472": {"title": "Exploring the Potential Role of Generative AI in the TRAPD Procedure for Survey Translation", "link": "https://arxiv.org/abs/2411.14472", "description": "arXiv:2411.14472v1 Announce Type: cross \nAbstract: This paper explores and assesses in what ways generative AI can assist in translating survey instruments. Writing effective survey questions is a challenging and complex task, made even more difficult for surveys that will be translated and deployed in multiple linguistic and cultural settings. Translation errors can be detrimental, with known errors rendering data unusable for its intended purpose and undetected errors leading to incorrect conclusions. A growing number of institutions face this problem as surveys deployed by private and academic organizations globalize, and the success of their current efforts depends heavily on researchers' and translators' expertise and the amount of time each party has to contribute to the task. Thus, multilinguistic and multicultural surveys produced by teams with limited expertise, budgets, or time are at significant risk for translation-based errors in their data. We implement a zero-shot prompt experiment using ChatGPT to explore generative AI's ability to identify features of questions that might be difficult to translate to a linguistic audience other than the source language. We find that ChatGPT can provide meaningful feedback on translation issues, including common source survey language, inconsistent conceptualization, sensitivity and formality issues, and nonexistent concepts. In addition, we provide detailed information on the practicality of the approach, including accessing the necessary software, associated costs, and computational run times. Lastly, based on our findings, we propose avenues for future research that integrate AI into survey translation practices."}, "https://arxiv.org/abs/2411.15075": {"title": "The Effects of Major League Baseball's Ban on Infield Shifts: A Quasi-Experimental Analysis", "link": "https://arxiv.org/abs/2411.15075", "description": "arXiv:2411.15075v1 Announce Type: cross \nAbstract: From 2020 to 2023, Major League Baseball changed rules affecting team composition, player positioning, and game time. Understanding the effects of these rules is crucial for leagues, teams, players, and other relevant parties to assess their impact and to advocate either for further changes or undoing previous ones. Panel data and quasi-experimental methods provide useful tools for causal inference in these settings. I demonstrate this potential by analyzing the effect of the 2023 shift ban at both the league-wide and player-specific levels. Using difference-in-differences analysis, I show that the policy increased batting average on balls in play and on-base percentage for left-handed batters by a modest amount (nine points). For individual players, synthetic control analyses identify several players whose offensive performance (on-base percentage, on-base plus slugging percentage, and weighted on-base average) improved substantially (over 70 points in several cases) because of the rule change, and other players with previously high shift rates for whom it had little effect. This article both estimates the impact of this specific rule change and demonstrates how these methods for causal inference are potentially valuable for sports analytics -- at the player, team, and league levels -- more broadly."}, "https://arxiv.org/abs/2305.03134": {"title": "Debiased Inference for Dynamic Nonlinear Panels with Multi-dimensional Heterogeneities", "link": "https://arxiv.org/abs/2305.03134", "description": "arXiv:2305.03134v3 Announce Type: replace \nAbstract: We introduce a generic class of dynamic nonlinear heterogeneous parameter models that incorporate individual and time effects in both the intercept and slope. To address the incidental parameter problem inherent in this class of models, we develop an analytical bias correction procedure to construct a bias-corrected likelihood. The resulting maximum likelihood estimators are automatically bias-corrected. Moreover, likelihood-based tests statistics -- including likelihood-ratio, Lagrange-multiplier, and Wald tests -- follow the limiting chi-square distribution under the null hypothesis. Simulations demonstrate the effectiveness of the proposed correction method, and an empirical application on the labor force participation of single mothers underscores its practical importance."}, "https://arxiv.org/abs/2309.10284": {"title": "Rank-adaptive covariance testing with applications to genomics and neuroimaging", "link": "https://arxiv.org/abs/2309.10284", "description": "arXiv:2309.10284v2 Announce Type: replace \nAbstract: In biomedical studies, testing for differences in covariance offers scientific insights beyond mean differences, especially when differences are driven by complex joint behavior between features. However, when differences in joint behavior are weakly dispersed across many dimensions and arise from differences in low-rank structures within the data, as is often the case in genomics and neuroimaging, existing two-sample covariance testing methods may suffer from power loss. The Ky-Fan(k) norm, defined by the sum of the top Ky-Fan(k) singular values, is a simple and intuitive matrix norm able to capture signals caused by differences in low-rank structures between matrices, but its statistical properties in hypothesis testing have not been studied well. In this paper, we investigate the behavior of the Ky-Fan(k) norm in two-sample covariance testing. Ultimately, we propose a novel methodology, Rank-Adaptive Covariance Testing (RACT), which is able to leverage differences in low-rank structures found in the covariance matrices of two groups in order to maximize power. RACT uses permutation for statistical inference, ensuring an exact Type I error control. We validate RACT in simulation studies and evaluate its performance when testing for differences in gene expression networks between two types of lung cancer, as well as testing for covariance heterogeneity in diffusion tensor imaging (DTI) data taken on two different scanner types."}, "https://arxiv.org/abs/2311.13202": {"title": "Robust Multi-Model Subset Selection", "link": "https://arxiv.org/abs/2311.13202", "description": "arXiv:2311.13202v3 Announce Type: replace \nAbstract: Outlying observations can be challenging to handle and adversely affect subsequent analyses, particularly, in complex high-dimensional datasets. Although outliers are not always undesired anomalies in the data and may possess valuable insights, only methods that are robust to outliers are able to accurately identify them and resist their influence. In this paper, we propose a method that generates an ensemble of sparse and diverse predictive models that are resistant to outliers. We show that the ensembles generally outperform single-model sparse and robust methods in high-dimensional prediction tasks. Cross-validation is used to tune model parameters to control levels of sparsity, diversity and resistance to outliers. We establish the finitesample breakdown point of the ensembles and the models that comprise them, and we develop a tailored computing algorithm to learn the ensembles by leveraging recent developments in L0 optimization. Our extensive numerical experiments on synthetic and artificially contaminated real datasets from bioinformatics and cheminformatics demonstrate the competitive advantage of our method over state-of-the-art single-model methods."}, "https://arxiv.org/abs/2309.03122": {"title": "Bayesian Evidence Synthesis for Modeling SARS-CoV-2 Transmission", "link": "https://arxiv.org/abs/2309.03122", "description": "arXiv:2309.03122v2 Announce Type: replace-cross \nAbstract: The acute phase of the Covid-19 pandemic has made apparent the need for decision support based upon accurate epidemic modeling. This process is substantially hampered by under-reporting of cases and related data incompleteness issues. In this article we adopt the Bayesian paradigm and synthesize publicly available data via a discrete-time stochastic epidemic modeling framework. The models allow for estimating the total number of infections while accounting for the endemic phase of the pandemic. We assess the prediction of the infection rate utilizing mobility information, notably the principal components of the mobility data. We evaluate variational Bayes in this context and find that Hamiltonian Monte Carlo offers a robust inference alternative for such models. We elaborate upon vector analysis of the epidemic dynamics, thus enriching the traditional tools used for decision making. In particular, we show how certain 2-dimensional plots on the phase plane may yield intuitive information regarding the speed and the type of transmission dynamics. We investigate the potential of a two-stage analysis as a consequence of cutting feedback, for inference on certain functionals of the model parameters. Finally, we show that a point mass on critical parameters is overly restrictive and investigate informative priors as a suitable alternative."}, "https://arxiv.org/abs/2311.12214": {"title": "Random Fourier Signature Features", "link": "https://arxiv.org/abs/2311.12214", "description": "arXiv:2311.12214v2 Announce Type: replace-cross \nAbstract: Tensor algebras give rise to one of the most powerful measures of similarity for sequences of arbitrary length called the signature kernel accompanied with attractive theoretical guarantees from stochastic analysis. Previous algorithms to compute the signature kernel scale quadratically in terms of the length and the number of the sequences. To mitigate this severe computational bottleneck, we develop a random Fourier feature-based acceleration of the signature kernel acting on the inherently non-Euclidean domain of sequences. We show uniform approximation guarantees for the proposed unbiased estimator of the signature kernel, while keeping its computation linear in the sequence length and number. In addition, combined with recent advances on tensor projections, we derive two even more scalable time series features with favourable concentration properties and computational complexity both in time and memory. Our empirical results show that the reduction in computational cost comes at a negligible price in terms of accuracy on moderate-sized datasets, and it enables one to scale to large datasets up to a million time series."}, "https://arxiv.org/abs/2411.15326": {"title": "Scalar-on-Shape Regression Models for Functional Data Analysis", "link": "https://arxiv.org/abs/2411.15326", "description": "arXiv:2411.15326v1 Announce Type: new \nAbstract: Functional data contains two components: shape (or amplitude) and phase. This paper focuses on a branch of functional data analysis (FDA), namely Shape-Based FDA, that isolates and focuses on shapes of functions. Specifically, this paper focuses on Scalar-on-Shape (ScoSh) regression models that incorporate the shapes of predictor functions and discard their phases. This aspect sets ScoSh models apart from the traditional Scalar-on-Function (ScoF) regression models that incorporate full predictor functions. ScoSh is motivated by object data analysis, {\\it, e.g.}, for neuro-anatomical objects, where object morphologies are relevant and their parameterizations are arbitrary. ScoSh also differs from methods that arbitrarily pre-register data and uses it in subsequent analysis. In contrast, ScoSh models perform registration during regression, using the (non-parametric) Fisher-Rao inner product and nonlinear index functions to capture complex predictor-response relationships. This formulation results in novel concepts of {\\it regression phase} and {\\it regression mean} of functions. Regression phases are time-warpings of predictor functions that optimize prediction errors, and regression means are optimal regression coefficients. We demonstrate practical applications of the ScoSh model using extensive simulated and real-data examples, including predicting COVID outcomes when daily rate curves are predictors."}, "https://arxiv.org/abs/2411.15398": {"title": "The ultimate issue error: mistaking parameters for hypotheses", "link": "https://arxiv.org/abs/2411.15398", "description": "arXiv:2411.15398v1 Announce Type: new \nAbstract: In a criminal investigation, an inferential error occurs when the probability that a suspect is the source of some evidence -- such as a fingerprint -- is taken as the probability of guilt. This is known as the ultimate issue error, and the same error occurs in statistical inference when the probability that a parameter equals some value is incorrectly taken to be the probability of a hypothesis. Almost all statistical inference in the social and biological sciences is subject to this error, and replacing every instance of \"hypothesis testing\" with \"parameter testing\" in these fields would more accurately describe the target of inference. The relationship between parameter values and quantities derived from them, such as p-values or Bayes factors, have no direct quantitative relationship with scientific hypotheses. Here, we describe the problem, its consequences, and suggest options for improving scientific inference."}, "https://arxiv.org/abs/2411.15499": {"title": "Asymmetric Errors", "link": "https://arxiv.org/abs/2411.15499", "description": "arXiv:2411.15499v1 Announce Type: new \nAbstract: We present a procedure for handling asymmetric errors. Many results in particle physics are presented as values with different positive and negative errors, and there is no consistent procedure for handling them. We consider the difference between errors quoted using pdfs and using likelihoods, and the difference between the rms spread of a measurement and the 68\\% central confidence region. We provide a comprehensive analysis of the possibilities, and software tools to enable their use."}, "https://arxiv.org/abs/2411.15511": {"title": "Forecasting with Markovian max-stable fields in space and time: An application to wind gust speeds", "link": "https://arxiv.org/abs/2411.15511", "description": "arXiv:2411.15511v1 Announce Type: new \nAbstract: Hourly maxima of 3-second wind gust speeds are prominent indicators of the severity of wind storms, and accurately forecasting them is thus essential for populations, civil authorities and insurance companies. Space-time max-stable models appear as natural candidates for this, but those explored so far are not suited for forecasting and, more generally, the forecasting literature for max-stable fields is limited. To fill this gap, we consider a specific space-time max-stable model, more precisely a max-autoregressive model with advection, that is well-adapted to model and forecast atmospheric variables. We apply it, as well as our related forecasting strategy, to reanalysis 3-second wind gust data for France in 1999, and show good performance compared to a competitor model. On top of demonstrating the practical relevance of our model, we meticulously study its theoretical properties and show the consistency and asymptotic normality of the space-time pairwise likelihood estimator which is used to calibrate the model."}, "https://arxiv.org/abs/2411.15566": {"title": "Sensitivity Analysis on Interaction Effects of Policy-Augmented Bayesian Networks", "link": "https://arxiv.org/abs/2411.15566", "description": "arXiv:2411.15566v1 Announce Type: new \nAbstract: Biomanufacturing plays an important role in supporting public health and the growth of the bioeconomy. Modeling and studying the interaction effects among various input variables is very critical for obtaining a scientific understanding and process specification in biomanufacturing. In this paper, we use the ShapleyOwen indices to measure the interaction effects for the policy-augmented Bayesian network (PABN) model, which characterizes the risk- and science-based understanding of production bioprocess mechanisms. In order to facilitate efficient interaction effect quantification, we propose a sampling-based simulation estimation framework. In addition, to further improve the computational efficiency, we develop a non-nested simulation algorithm with sequential sampling, which can dynamically allocate the simulation budget to the interactions with high uncertainty and therefore estimate the interaction effects more accurately under a total fixed budget setting."}, "https://arxiv.org/abs/2411.15625": {"title": "Canonical Correlation Analysis: review", "link": "https://arxiv.org/abs/2411.15625", "description": "arXiv:2411.15625v1 Announce Type: new \nAbstract: For over a century canonical correlations, variables, and related concepts have been studied across various fields, with contributions dating back to Jordan [1875] and Hotelling [1936]. This text surveys the evolution of canonical correlation analysis, a fundamental statistical tool, beginning with its foundational theorems and progressing to recent developments and open research problems. Along the way we introduce and review methods, notions, and fundamental concepts from linear algebra, random matrix theory, and high-dimensional statistics, placing particular emphasis on rigorous mathematical treatment.\n The survey is intended for technically proficient graduate students and other researchers with an interest in this area. The content is organized into five chapters, supplemented by six sets of exercises found in Chapter 6. These exercises introduce additional material, reinforce key concepts, and serve to bridge ideas across chapters. We recommend the following sequence: first, solve Problem Set 0, then proceed with Chapter 1, solve Problem Set 1, and so on through the text."}, "https://arxiv.org/abs/2411.15691": {"title": "Data integration using covariate summaries from external sources", "link": "https://arxiv.org/abs/2411.15691", "description": "arXiv:2411.15691v1 Announce Type: new \nAbstract: In modern data analysis, information is frequently collected from multiple sources, often leading to challenges such as data heterogeneity and imbalanced sample sizes across datasets. Robust and efficient data integration methods are crucial for improving the generalization and transportability of statistical findings. In this work, we address scenarios where, in addition to having full access to individualized data from a primary source, supplementary covariate information from external sources is also available. While traditional data integration methods typically require individualized covariates from external sources, such requirements can be impractical due to limitations related to accessibility, privacy, storage, and cost. Instead, we propose novel data integration techniques that rely solely on external summary statistics, such as sample means and covariances, to construct robust estimators for the mean outcome under both homogeneous and heterogeneous data settings. Additionally, we extend this framework to causal inference, enabling the estimation of average treatment effects for both generalizability and transportability."}, "https://arxiv.org/abs/2411.15713": {"title": "Bayesian High-dimensional Grouped-regression using Sparse Projection-posterior", "link": "https://arxiv.org/abs/2411.15713", "description": "arXiv:2411.15713v1 Announce Type: new \nAbstract: We present a novel Bayesian approach for high-dimensional grouped regression under sparsity. We leverage a sparse projection method that uses a sparsity-inducing map to derive an induced posterior on a lower-dimensional parameter space. Our method introduces three distinct projection maps based on popular penalty functions: the Group LASSO Projection Posterior, Group SCAD Projection Posterior, and Adaptive Group LASSO Projection Posterior. Each projection map is constructed to immerse dense posterior samples into a structured, sparse space, allowing for effective group selection and estimation in high-dimensional settings. We derive optimal posterior contraction rates for estimation and prediction, proving that the methods are model selection consistent. Additionally, we propose a Debiased Group LASSO Projection Map, which ensures exact coverage of credible sets. Our methodology is particularly suited for applications in nonparametric additive models, where we apply it with B-spline expansions to capture complex relationships between covariates and response. Extensive simulations validate our theoretical findings, demonstrating the robustness of our approach across different settings. Finally, we illustrate the practical utility of our method with an application to brain MRI volume data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), where our model identifies key brain regions associated with Alzheimer's progression."}, "https://arxiv.org/abs/2411.15771": {"title": "A flexible and general semi-supervised approach to multiple hypothesis testing", "link": "https://arxiv.org/abs/2411.15771", "description": "arXiv:2411.15771v1 Announce Type: new \nAbstract: Standard multiple testing procedures are designed to report a list of discoveries, or suspected false null hypotheses, given the hypotheses' p-values or test scores. Recently there has been a growing interest in enhancing such procedures by combining additional information with the primary p-value or score. Specifically, such so-called ``side information'' can be leveraged to improve the separation between true and false nulls along additional ``dimensions'' thereby increasing the overall sensitivity. In line with this idea, we develop RESET (REScoring via Estimating and Training) which uses a unique data-splitting protocol that subsequently allows any semi-supervised learning approach to factor in the available side-information while maintaining finite-sample error rate control. Our practical implementation, RESET Ensemble, selects from an ensemble of classification algorithms so that it is compatible to a range of multiple testing scenarios without the need for the user to select the appropriate one. We apply RESET to both p-value and competition based multiple testing problems and show that RESET is (1) power-wise competitive, (2) fast compared to most tools and (3) is able to uniquely achieve finite sample FDR or FDP control, depending on the user's preference."}, "https://arxiv.org/abs/2411.15797": {"title": "Utilization and Profitability of Tractor Services for Maize Farming in Ejura-Sekyedumase Municipality, Ghana", "link": "https://arxiv.org/abs/2411.15797", "description": "arXiv:2411.15797v1 Announce Type: new \nAbstract: Maize farming is a major livelihood activity for many farmers in Ghana. Unfortunately, farmers usually do not obtain the expected returns on their investment due to reliance on rudimentary, labor-intensive, and inefficient methods of production. Using cross-sectional data from 359 maize farmers, this study investigates the profitability and determinants of the use of tractor services for maize production in Ejura-Sekyedumase, Ashanti Region of Ghana. Results from descriptive and profitability analyses reveal that tractor services such as ploughing and shelling are widely used, but their profitability varies significantly among farmers. Key factors influencing profitability include farm size, fertilizer quantity applied, and farmer experience. Results from a multivariate probit analysis also showed that farming experience, fertilizer quantity, and profit per acre have a positive influence on tractor service use for shelling, while household size, farm size, and FBO have a negative influence. Farming experience, fertilizer quantity, and profit per acre positively influence tractor service use for ploughing, while farm size has a negative influence. A t-test result reveals a statistically significant difference in profit between farmers who use tractor services and those who do not. Specifically, farmers who utilize tractor services on their maize farm had a return to cost of 9 percent more than those who do not (p-value < 0.05). The Kendall's result showed a moderate agreement among the maize farmers (first ranked being financial issues) in their ability to access/utilize tractor services on their farm."}, "https://arxiv.org/abs/2411.15819": {"title": "A Copula-Based Approach to Modelling and Testing for Heavy-tailed Data with Bivariate Heteroscedastic Extremes", "link": "https://arxiv.org/abs/2411.15819", "description": "arXiv:2411.15819v1 Announce Type: new \nAbstract: Heteroscedasticity and correlated data pose challenges for extreme value analysis, particularly in two-sample testing problems for tail behaviors. In this paper, we propose a novel copula-based multivariate model for independent but not identically distributed heavy-tailed data with heterogeneous marginal distributions and a varying copula structure. The proposed model encompasses classical models with independent and identically distributed data and some models with a mixture of correlation. To understand the tail behavior, we introduce the quasi-tail copula, which integrates both marginal heteroscedasticity and the dependence structure of the varying copula, and further propose the estimation approach. We then establish the joint asymptotic properties for the Hill estimator, scedasis functions, and quasi-tail copula. In addition, a multiplier bootstrap method is applied to estimate their complex covariance. Moreover, it is of practical interest to develop four typical two-sample testing problems under the new model, which include the equivalence of the extreme value indices and scedasis functions. Finally, we conduct simulation studies to validate our tests and apply the new model to the data from the stock market."}, "https://arxiv.org/abs/2411.15822": {"title": "Semi-parametric least-area linear-circular regression through M\\\"obius transformation", "link": "https://arxiv.org/abs/2411.15822", "description": "arXiv:2411.15822v1 Announce Type: new \nAbstract: This paper introduces a new area-based regression model where the responses are angular variables and the predictors are linear. The regression curve is formulated using a generalized M\\\"obius transformation that maps the real axis to the circle. A novel area-based loss function is introduced for parameter estimation, utilizing the intrinsic geometry of a curved torus. The model is semi-parametric, requiring no specific distributional assumptions for the angular error. Extensive simulation studies are performed with von Mises and wrapped Cauchy distributions as angular errors. The practical utility of the model is illustrated through real data analysis of two well-known cryptocurrencies, Bitcoin and Ethereum."}, "https://arxiv.org/abs/2411.15826": {"title": "Expert-elicitation method for non-parametric joint priors using normalizing flows", "link": "https://arxiv.org/abs/2411.15826", "description": "arXiv:2411.15826v1 Announce Type: new \nAbstract: We propose an expert-elicitation method for learning non-parametric joint prior distributions using normalizing flows. Normalizing flows are a class of generative models that enable exact, single-step density evaluation and can capture complex density functions through specialized deep neural networks. Building on our previously introduced simulation-based framework, we adapt and extend the methodology to accommodate non-parametric joint priors. Our framework thus supports the development of elicitation methods for learning both parametric and non-parametric priors, as well as independent or joint priors for model parameters. To evaluate the performance of the proposed method, we perform four simulation studies and present an evaluation pipeline that incorporates diagnostics and additional evaluation tools to support decision-making at each stage of the elicitation process."}, "https://arxiv.org/abs/2411.15908": {"title": "Selective Inference for Time-Varying Effect Moderation", "link": "https://arxiv.org/abs/2411.15908", "description": "arXiv:2411.15908v1 Announce Type: new \nAbstract: Causal effect moderation investigates how the effect of interventions (or treatments) on outcome variables changes based on observed characteristics of individuals, known as potential effect moderators. With advances in data collection, datasets containing many observed features as potential moderators have become increasingly common. High-dimensional analyses often lack interpretability, with important moderators masked by noise, while low-dimensional, marginal analyses yield many false positives due to strong correlations with true moderators. In this paper, we propose a two-step method for selective inference on time-varying causal effect moderation that addresses the limitations of both high-dimensional and marginal analyses. Our method first selects a relatively smaller, more interpretable model to estimate a linear causal effect moderation using a Gaussian randomization approach. We then condition on the selection event to construct a pivot, enabling uniformly asymptotic semi-parametric inference in the selected model. Through simulations and real data analyses, we show that our method consistently achieves valid coverage rates, even when existing conditional methods and common sample splitting techniques fail. Moreover, our method yields shorter, bounded intervals, unlike existing methods that may produce infinitely long intervals."}, "https://arxiv.org/abs/2411.15996": {"title": "Homeopathic Modernization and the Middle Science Trap: conceptual context of ergonomics, econometrics and logic of some national scientific case", "link": "https://arxiv.org/abs/2411.15996", "description": "arXiv:2411.15996v1 Announce Type: new \nAbstract: This article analyses the structural and institutional barriers hindering the development of scientific systems in transition economies, such as Kazakhstan. The main focus is on the concept of the \"middle science trap,\" which is characterized by steady growth in quantitative indicators (publications, grants) but a lack of qualitative advancement. Excessive bureaucracy, weak integration into the international scientific community, and ineffective science management are key factors limiting development. This paper proposes an approach of \"homeopathic modernization,\" which focuses on minimal yet strategically significant changes aimed at reducing bureaucratic barriers and enhancing the effectiveness of the scientific ecosystem. A comparative analysis of international experience (China, India, and the European Union) is provided, demonstrating how targeted reforms in the scientific sector can lead to significant results. Social and cultural aspects, including the influence of mentality and institutional structure, are also examined, and practical recommendations for reforming the scientific system in Kazakhstan and Central Asia are offered. The conclusions of the article could be useful for developing national science modernization programs, particularly in countries with high levels of bureaucracy and conservatism."}, "https://arxiv.org/abs/2411.16153": {"title": "Analysis of longitudinal data with destructive sampling using linear mixed models", "link": "https://arxiv.org/abs/2411.16153", "description": "arXiv:2411.16153v1 Announce Type: new \nAbstract: This paper proposes an analysis methodology for the case where there is longitudinal data with destructive sampling of observational units, which come from experimental units that are measured at all times of the analysis. A mixed linear model is proposed and compared with regression models with fixed and mixed effects, among which is a similar that is used for data called pseudo-panel, and one of multivariate analysis of variance, which are common in statistics. To compare the models, the mean square error was used, demonstrating the advantage of the proposed methodology. In addition, an application was made to real-life data that refers to the scores in the Saber 11 tests applied to students in Colombia to see the advantage of using this methodology in practical scenarios."}, "https://arxiv.org/abs/2411.16192": {"title": "Modeling large dimensional matrix time series with partially known and latent factors", "link": "https://arxiv.org/abs/2411.16192", "description": "arXiv:2411.16192v1 Announce Type: new \nAbstract: This article considers to model large-dimensional matrix time series by introducing a regression term to the matrix factor model. This is an extension of classic matrix factor model to incorporate the information of known factors or useful covariates. We establish the convergence rates of coefficient matrix, loading matrices and the signal part. The theoretical results coincide with the rates in Wang et al. (2019). We conduct numerical studies to verify the performance of our estimation procedure in finite samples. Finally, we demonstrate the superiority of our proposed model using the daily returns of stocks data."}, "https://arxiv.org/abs/2411.16220": {"title": "On the achievability of efficiency bounds for covariate-adjusted response-adaptive randomization", "link": "https://arxiv.org/abs/2411.16220", "description": "arXiv:2411.16220v1 Announce Type: new \nAbstract: In the context of precision medicine, covariate-adjusted response-adaptive randomization (CARA) has garnered much attention from both academia and industry due to its benefits in providing ethical and tailored treatment assignments based on patients' profiles while still preserving favorable statistical properties. Recent years have seen substantial progress in understanding the inference for various adaptive experimental designs. In particular, research has focused on two important perspectives: how to obtain robust inference in the presence of model misspecification, and what the smallest variance, i.e., the efficiency bound, an estimator can achieve. Notably, Armstrong (2022) derived the asymptotic efficiency bound for any randomization procedure that assigns treatments depending on covariates and accrued responses, thus including CARA, among others. However, to the best of our knowledge, no existing literature has addressed whether and how the asymptotic efficiency bound can be achieved under CARA. In this paper, by connecting two strands of literature on adaptive randomization, namely robust inference and efficiency bound, we provide a definitive answer to this question for an important practical scenario where only discrete covariates are observed and used to form stratification. We consider a specific type of CARA, i.e., a stratified version of doubly-adaptive biased coin design, and prove that the stratified difference-in-means estimator achieves Armstrong (2022)'s efficiency bound, with possible ethical constraints on treatment assignments. Our work provides new insights and demonstrates the potential for more research regarding the design and analysis of CARA that maximizes efficiency while adhering to ethical considerations. Future studies could explore how to achieve the asymptotic efficiency bound for general CARA with continuous covariates, which remains an open question."}, "https://arxiv.org/abs/2411.16311": {"title": "Bayesian models for missing and misclassified variables using integrated nested Laplace approximations", "link": "https://arxiv.org/abs/2411.16311", "description": "arXiv:2411.16311v1 Announce Type: new \nAbstract: Misclassified variables used in regression models, either as a covariate or as the response, may lead to biased estimators and incorrect inference. Even though Bayesian models to adjust for misclassification error exist, it has not been shown how these models can be implemented using integrated nested Laplace approximation (INLA), a popular framework for fitting Bayesian models due to its computational efficiency. Since INLA requires the latent field to be Gaussian, and the Bayesian models adjusting for covariate misclassification error necessarily introduce a latent categorical variable, it is not obvious how to fit these models in INLA. Here, we show how INLA can be combined with importance sampling to overcome this limitation. We also discuss how to account for a misclassified response variable using INLA directly without any additional sampling procedure. The proposed methods are illustrated through a number of simulations and applications to real-world data, and all examples are presented with detailed code in the supporting information."}, "https://arxiv.org/abs/2411.16429": {"title": "Multivariate Adjustments for Average Equivalence Testing", "link": "https://arxiv.org/abs/2411.16429", "description": "arXiv:2411.16429v1 Announce Type: new \nAbstract: Multivariate (average) equivalence testing is widely used to assess whether the means of two conditions of interest are `equivalent' for different outcomes simultaneously. The multivariate Two One-Sided Tests (TOST) procedure is typically used in this context by checking if, outcome by outcome, the marginal $100(1-2\\alpha$)\\% confidence intervals for the difference in means between the two conditions of interest lie within pre-defined lower and upper equivalence limits. This procedure, known to be conservative in the univariate case, leads to a rapid power loss when the number of outcomes increases, especially when one or more outcome variances are relatively large. In this work, we propose a finite-sample adjustment for this procedure, the multivariate $\\alpha$-TOST, that consists in a correction of $\\alpha$, the significance level, taking the (arbitrary) dependence between the outcomes of interest into account and making it uniformly more powerful than the conventional multivariate TOST. We present an iterative algorithm allowing to efficiently define $\\alpha^{\\star}$, the corrected significance level, a task that proves challenging in the multivariate setting due to the inter-relationship between $\\alpha^{\\star}$ and the sets of values belonging to the null hypothesis space and defining the test size. We study the operating characteristics of the multivariate $\\alpha$-TOST both theoretically and via an extensive simulation study considering cases relevant for real-world analyses -- i.e.,~relatively small sample sizes, unknown and heterogeneous variances, and different correlation structures -- and show the superior finite-sample properties of the multivariate $\\alpha$-TOST compared to its conventional counterpart. We finally re-visit a case study on ticlopidine hydrochloride and compare both methods when simultaneously assessing bioequivalence for multiple pharmacokinetic parameters."}, "https://arxiv.org/abs/2411.16662": {"title": "A Supervised Machine Learning Approach for Assessing Grant Peer Review Reports", "link": "https://arxiv.org/abs/2411.16662", "description": "arXiv:2411.16662v1 Announce Type: new \nAbstract: Peer review in grant evaluation informs funding decisions, but the contents of peer review reports are rarely analyzed. In this work, we develop a thoroughly tested pipeline to analyze the texts of grant peer review reports using methods from applied Natural Language Processing (NLP) and machine learning. We start by developing twelve categories reflecting content of grant peer review reports that are of interest to research funders. This is followed by multiple human annotators' iterative annotation of these categories in a novel text corpus of grant peer review reports submitted to the Swiss National Science Foundation. After validating the human annotation, we use the annotated texts to fine-tune pre-trained transformer models to classify these categories at scale, while conducting several robustness and validation checks. Our results show that many categories can be reliably identified by human annotators and machine learning approaches. However, the choice of text classification approach considerably influences the classification performance. We also find a high correspondence between out-of-sample classification performance and human annotators' perceived difficulty in identifying categories. Our results and publicly available fine-tuned transformer models will allow researchers and research funders and anybody interested in peer review to examine and report on the contents of these reports in a structured manner. Ultimately, we hope our approach can contribute to ensuring the quality and trustworthiness of grant peer review."}, "https://arxiv.org/abs/2411.15306": {"title": "Heavy-tailed Contamination is Easier than Adversarial Contamination", "link": "https://arxiv.org/abs/2411.15306", "description": "arXiv:2411.15306v1 Announce Type: cross \nAbstract: A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to statistically and computationally efficient outlier-robust estimators. Two particular outlier models have received significant attention: the adversarial and heavy-tailed models. While the former models outliers as the result of a malicious adversary manipulating the data, the latter relaxes distributional assumptions on the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of statistical efficiency, where the dependence on the failure probability is paramount.\n Despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that any adversarially robust estimator is also resilient to heavy-tailed outliers for any statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we construct heavy-tailed estimators whose application to the adversarial setting requires any black-box reduction to remove almost all the outliers in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches for the heavy-tailed setting. Additionally, confidence intervals obtained for adversarially robust estimation also hold with high-probability."}, "https://arxiv.org/abs/2411.15590": {"title": "From Complexity to Parsimony: Integrating Latent Class Analysis to Uncover Multimodal Learning Patterns in Collaborative Learning", "link": "https://arxiv.org/abs/2411.15590", "description": "arXiv:2411.15590v1 Announce Type: cross \nAbstract: Multimodal Learning Analytics (MMLA) leverages advanced sensing technologies and artificial intelligence to capture complex learning processes, but integrating diverse data sources into cohesive insights remains challenging. This study introduces a novel methodology for integrating latent class analysis (LCA) within MMLA to map monomodal behavioural indicators into parsimonious multimodal ones. Using a high-fidelity healthcare simulation context, we collected positional, audio, and physiological data, deriving 17 monomodal indicators. LCA identified four distinct latent classes: Collaborative Communication, Embodied Collaboration, Distant Interaction, and Solitary Engagement, each capturing unique monomodal patterns. Epistemic network analysis compared these multimodal indicators with the original monomodal indicators and found that the multimodal approach was more parsimonious while offering higher explanatory power regarding students' task and collaboration performances. The findings highlight the potential of LCA in simplifying the analysis of complex multimodal data while capturing nuanced, cross-modality behaviours, offering actionable insights for educators and enhancing the design of collaborative learning interventions. This study proposes a pathway for advancing MMLA, making it more parsimonious and manageable, and aligning with the principles of learner-centred education."}, "https://arxiv.org/abs/2411.15624": {"title": "Trans-Glasso: A Transfer Learning Approach to Precision Matrix Estimation", "link": "https://arxiv.org/abs/2411.15624", "description": "arXiv:2411.15624v1 Announce Type: cross \nAbstract: Precision matrix estimation is essential in various fields, yet it is challenging when samples for the target study are limited. Transfer learning can enhance estimation accuracy by leveraging data from related source studies. We propose Trans-Glasso, a two-step transfer learning method for precision matrix estimation. First, we obtain initial estimators using a multi-task learning objective that captures shared and unique features across studies. Then, we refine these estimators through differential network estimation to adjust for structural differences between the target and source precision matrices. Under the assumption that most entries of the target precision matrix are shared with source matrices, we derive non-asymptotic error bounds and show that Trans-Glasso achieves minimax optimality under certain conditions. Extensive simulations demonstrate Trans Glasso's superior performance compared to baseline methods, particularly in small-sample settings. We further validate Trans-Glasso in applications to gene networks across brain tissues and protein networks for various cancer subtypes, showcasing its effectiveness in biological contexts. Additionally, we derive the minimax optimal rate for differential network estimation, representing the first such guarantee in this area."}, "https://arxiv.org/abs/2411.15674": {"title": "Quantile deep learning models for multi-step ahead time series prediction", "link": "https://arxiv.org/abs/2411.15674", "description": "arXiv:2411.15674v1 Announce Type: cross \nAbstract: Uncertainty quantification is crucial in time series prediction, and quantile regression offers a valuable mechanism for uncertainty quantification which is useful for extreme value forecasting. Although deep learning models have been prominent in multi-step ahead prediction, the development and evaluation of quantile deep learning models have been limited. We present a novel quantile regression deep learning framework for multi-step time series prediction. In this way, we elevate the capabilities of deep learning models by incorporating quantile regression, thus providing a more nuanced understanding of predictive values. We provide an implementation of prominent deep learning models for multi-step ahead time series prediction and evaluate their performance under high volatility and extreme conditions. We include multivariate and univariate modelling, strategies and provide a comparison with conventional deep learning models from the literature. Our models are tested on two cryptocurrencies: Bitcoin and Ethereum, using daily close-price data and selected benchmark time series datasets. The results show that integrating a quantile loss function with deep learning provides additional predictions for selected quantiles without a loss in the prediction accuracy when compared to the literature. Our quantile model has the ability to handle volatility more effectively and provides additional information for decision-making and uncertainty quantification through the use of quantiles when compared to conventional deep learning models."}, "https://arxiv.org/abs/2411.16244": {"title": "What events matter for exchange rate volatility ?", "link": "https://arxiv.org/abs/2411.16244", "description": "arXiv:2411.16244v1 Announce Type: cross \nAbstract: This paper expands on stochastic volatility models by proposing a data-driven method to select the macroeconomic events most likely to impact volatility. The paper identifies and quantifies the effects of macroeconomic events across multiple countries on exchange rate volatility using high-frequency currency returns, while accounting for persistent stochastic volatility effects and seasonal components capturing time-of-day patterns. Given the hundreds of macroeconomic announcements and their lags, we rely on sparsity-based methods to select relevant events for the model. We contribute to the exchange rate literature in four ways: First, we identify the macroeconomic events that drive currency volatility, estimate their effects and connect them to macroeconomic fundamentals. Second, we find a link between intraday seasonality, trading volume, and the opening hours of major markets across the globe. We provide a simple labor-based explanation for this observed pattern. Third, we show that including macroeconomic events and seasonal components is crucial for forecasting exchange rate volatility. Fourth, our proposed model yields the lowest volatility and highest Sharpe ratio in portfolio allocations when compared to standard SV and GARCH models."}, "https://arxiv.org/abs/2411.16366": {"title": "Connections between sequential Bayesian inference and evolutionary dynamics", "link": "https://arxiv.org/abs/2411.16366", "description": "arXiv:2411.16366v1 Announce Type: cross \nAbstract: It has long been posited that there is a connection between the dynamical equations describing evolutionary processes in biology and sequential Bayesian learning methods. This manuscript describes new research in which this precise connection is rigorously established in the continuous time setting. Here we focus on a partial differential equation known as the Kushner-Stratonovich equation describing the evolution of the posterior density in time. Of particular importance is a piecewise smooth approximation of the observation path from which the discrete time filtering equations, which are shown to converge to a Stratonovich interpretation of the Kushner-Stratonovich equation. This smooth formulation will then be used to draw precise connections between nonlinear stochastic filtering and replicator-mutator dynamics. Additionally, gradient flow formulations will be investigated as well as a form of replicator-mutator dynamics which is shown to be beneficial for the misspecified model filtering problem. It is hoped this work will spur further research into exchanges between sequential learning and evolutionary biology and to inspire new algorithms in filtering and sampling."}, "https://arxiv.org/abs/1904.01047": {"title": "Dynamically Optimal Treatment Allocation", "link": "https://arxiv.org/abs/1904.01047", "description": "arXiv:1904.01047v5 Announce Type: replace \nAbstract: Dynamic decisions are pivotal to economic policy making. We show how existing evidence from randomized control trials can be utilized to guide personalized decisions in challenging dynamic environments with budget and capacity constraints. Recent advances in reinforcement learning now enable the solution of many complex, real-world problems for the first time. We allow for restricted classes of policy functions and prove that their regret decays at rate n^(-0.5), the same as in the static case. Applying our methods to job training, we find that by exploiting the problem's dynamic structure, we achieve significantly higher welfare compared to static approaches."}, "https://arxiv.org/abs/2208.09542": {"title": "Improving knockoffs with conditional calibration", "link": "https://arxiv.org/abs/2208.09542", "description": "arXiv:2208.09542v3 Announce Type: replace \nAbstract: The knockoff filter of Barber and Candes (arXiv:1404.5609) is a flexible framework for multiple testing in supervised learning models, based on introducing synthetic predictor variables to control the false discovery rate (FDR). Using the conditional calibration framework of Fithian and Lei (arXiv:2007.10438), we introduce the calibrated knockoff procedure, a method that uniformly improves the power of any fixed-X or model-X knockoff procedure. We show theoretically and empirically that the improvement is especially notable in two contexts where knockoff methods can be nearly powerless: when the rejection set is small, and when the structure of the design matrix in fixed-X knockoffs prevents us from constructing good knockoff variables. In these contexts, calibrated knockoffs even outperform competing FDR-controlling methods like the (dependence-adjusted) procedure Benjamini-Hochberg in many scenarios."}, "https://arxiv.org/abs/2301.10911": {"title": "Posterior risk of modular and semi-modular Bayesian inference", "link": "https://arxiv.org/abs/2301.10911", "description": "arXiv:2301.10911v3 Announce Type: replace \nAbstract: Modular Bayesian methods perform inference in models that are specified through a collection of coupled sub-models, known as modules. These modules often arise from modelling different data sources or from combining domain knowledge from different disciplines. ``Cutting feedback'' is a Bayesian inference method that ensures misspecification of one module does not affect inferences for parameters in other modules, and produces what is known as the cut posterior. However, choosing between the cut posterior and the standard Bayesian posterior is challenging. When misspecification is not severe, cutting feedback can greatly increase posterior uncertainty without a large reduction of estimation bias, leading to a bias-variance trade-off. This trade-off motivates semi-modular posteriors, which interpolate between standard and cut posteriors based on a tuning parameter. In this work, we provide the first precise formulation of the bias-variance trade-off that is present in cutting feedback, and we propose a new semi-modular posterior that takes advantage of it. Under general regularity conditions, we prove that this semi-modular posterior is more accurate than the cut posterior according to a notion of posterior risk. An important implication of this result is that point inferences made under the cut posterior are inadmissable. The new method is demonstrated in a number of examples."}, "https://arxiv.org/abs/2307.11201": {"title": "Choosing the Right Approach at the Right Time: A Comparative Analysis of Causal Effect Estimation using Confounder Adjustment and Instrumental Variables", "link": "https://arxiv.org/abs/2307.11201", "description": "arXiv:2307.11201v3 Announce Type: replace \nAbstract: In observational studies, potential unobserved confounding is a major barrier in isolating the average causal effect (ACE). In these scenarios, two main approaches are often used: confounder adjustment for causality (CAC) and instrumental variable analysis for causation (IVAC). Nevertheless, both are subject to untestable assumptions and, therefore, it may be unclear which assumption violation scenarios one method is superior in terms of mitigating inconsistency for the ACE. Although general guidelines exist, direct theoretical comparisons of the trade-offs between CAC and the IVAC assumptions are limited. Using ordinary least squares (OLS) for CAC and two-stage least squares (2SLS) for IVAC, we analytically compare the relative inconsistency for the ACE of each approach under a variety of assumption violation scenarios and discuss rules of thumb for practice. Additionally, a sensitivity framework is proposed to guide analysts in determining which approach may result in less inconsistency for estimating the ACE with a given dataset. We demonstrate our findings both through simulation and by revisiting Card's analysis of the effect of educational attainment on earnings, which has been the subject of previous discussion on instrument validity. The implications of our findings on causal inference practice are discussed, providing guidance for analysts to judge whether CAC or IVAC may be more appropriate for a given situation."}, "https://arxiv.org/abs/2308.13713": {"title": "Causally Sound Priors for Binary Experiments", "link": "https://arxiv.org/abs/2308.13713", "description": "arXiv:2308.13713v3 Announce Type: replace \nAbstract: We introduce the BREASE framework for the Bayesian analysis of randomized controlled trials with a binary treatment and a binary outcome. Approaching the problem from a causal inference perspective, we propose parameterizing the likelihood in terms of the baselinerisk, efficacy, and adverse side effects of the treatment, along with a flexible, yet intuitive and tractable jointly independent beta prior distribution on these parameters, which we show to be a generalization of the Dirichlet prior for the joint distribution of potential outcomes. Our approach has a number of desirable characteristics when compared to current mainstream alternatives: (i) it naturally induces prior dependence between expected outcomes in the treatment and control groups; (ii) as the baseline risk, efficacy and risk of adverse side effects are quantities commonly present in the clinicians' vocabulary, the hyperparameters of the prior are directly interpretable, thus facilitating the elicitation of prior knowledge and sensitivity analysis; and (iii) we provide analytical formulae for the marginal likelihood, Bayes factor, and other posterior quantities, as well as an exact posterior sampling algorithm and an accurate and fast data-augmented Gibbs sampler in cases where traditional MCMC fails. Empirical examples demonstrate the utility of our methods for estimation, hypothesis testing, and sensitivity analysis of treatment effects."}, "https://arxiv.org/abs/2311.12597": {"title": "Optimal Functional Bilinear Regression with Two-way Functional Covariates via Reproducing Kernel Hilbert Space", "link": "https://arxiv.org/abs/2311.12597", "description": "arXiv:2311.12597v2 Announce Type: replace \nAbstract: Traditional functional linear regression usually takes a one-dimensional functional predictor as input and estimates the continuous coefficient function. Modern applications often generate two-dimensional covariates, which become matrices when observed at grid points. To avoid the inefficiency of the classical method involving estimation of a two-dimensional coefficient function, we propose a functional bilinear regression model, and introduce an innovative three-term penalty to impose roughness penalty in the estimation. The proposed estimator exhibits minimax optimal property for prediction under the framework of reproducing kernel Hilbert space. An iterative generalized cross-validation approach is developed to choose tuning parameters, which significantly improves the computational efficiency over the traditional cross-validation approach. The statistical and computational advantages of the proposed method over existing methods are further demonstrated via simulated experiments, the Canadian weather data, and a biochemical long-range infrared light detection and ranging data."}, "https://arxiv.org/abs/2411.16831": {"title": "Measuring Statistical Evidence: A Short Report", "link": "https://arxiv.org/abs/2411.16831", "description": "This short text tried to establish a big picture of what statistics is about and how an ideal inference method should behave. Moreover, by examining shortcomings of some of the currently used methods for measuring evidence and utilizing some intuitive principles, we motivated the Relative Belief Ratio as the primary method of characterizing statistical evidence. Number of topics has been omitted for the interest of this text and the reader is strongly advised to refer to (Evans, 2015) as the primary source for further readings of the subject.", "authors": "Zamani"}, "https://arxiv.org/abs/2411.16902": {"title": "Bounding causal effects with an unknown mixture of informative and non-informative censoring", "link": "https://arxiv.org/abs/2411.16902", "description": "In experimental and observational data settings, researchers often have limited knowledge of the reasons for censored or missing outcomes. To address this uncertainty, we propose bounds on causal effects for censored outcomes, accommodating the scenario where censoring is an unobserved mixture of informative and non-informative components. Within this mixed censoring framework, we explore several assumptions to derive bounds on causal effects, including bounds expressed as a function of user-specified sensitivity parameters. We develop influence-function based estimators of these bounds to enable flexible, non-parametric, and machine learning based estimation, achieving root-n convergence rates and asymptotic normality under relatively mild conditions. We further consider the identification and estimation of bounds for other causal quantities that remain meaningful when informative censoring reflects a competing risk, such as death. We conduct extensive simulation studies and illustrate our methodology with a study on the causal effect of antipsychotic drugs on diabetes risk using a health insurance dataset.", "authors": "Rubinstein, Agniel, Han et al"}, "https://arxiv.org/abs/2411.16906": {"title": "A Binary IV Model for Persuasion: Profiling Persuasion Types among Compliers", "link": "https://arxiv.org/abs/2411.16906", "description": "In an empirical study of persuasion, researchers often use a binary instrument to encourage individuals to consume information and take some action. We show that, with a binary Imbens-Angrist instrumental variable model and the monotone treatment response assumption, it is possible to identify the joint distribution of potential outcomes among compliers. This is necessary to identify the percentage of mobilised voters and their statistical characteristic defined by the moments of the joint distribution of treatment and covariates. Specifically, we develop a method that enables researchers to identify the statistical characteristic of persuasion types: always-voters, never-voters, and mobilised voters among compliers. These findings extend the kappa weighting results in Abadie (2003). We also provide a sharp test for the two sets of identification assumptions. The test boils down to testing whether there exists a nonnegative solution to a possibly under-determined system of linear equations with known coefficients. An application based on Green et al. (2003) is provided.", "authors": "Yu"}, "https://arxiv.org/abs/2411.16938": {"title": "Fragility Index for Time-to-Event Endpoints in Single-Arm Clinical Trials", "link": "https://arxiv.org/abs/2411.16938", "description": "The reliability of clinical trial outcomes is crucial, especially in guiding medical decisions. In this paper, we introduce the Fragility Index (FI) for time-to-event endpoints in single-arm clinical trials - a novel metric designed to quantify the robustness of study conclusions. The FI represents the smallest number of censored observations that, when reclassified as uncensored events, causes the posterior probability of the median survival time exceeding a specified threshold to fall below a predefined confidence level. While drug effectiveness is typically assessed by determining whether the posterior probability exceeds a specified confidence level, the FI offers a complementary measure, indicating how robust these conclusions are to potential shifts in the data. Using a Bayesian approach, we develop a practical framework for computing the FI based on the exponential survival model. To facilitate the application of our method, we developed an R package fi, which provides a tool to compute the Fragility Index. Through real world case studies involving time to event data from single arms clinical trials, we demonstrate the utility of this index. Our findings highlight how the FI can be a valuable tool for assessing the robustness of survival analyses in single-arm studies, aiding researchers and clinicians in making more informed decisions.", "authors": "Maity, Garg, Basu"}, "https://arxiv.org/abs/2411.16978": {"title": "Normal Approximation for U-Statistics with Cross-Sectional Dependence", "link": "https://arxiv.org/abs/2411.16978", "description": "We apply Stein's method to investigate the normal approximation for both non-degenerate and degenerate U-statistics with cross-sectionally dependent underlying processes in the Wasserstein metric. We show that the convergence rates depend on the mixing rates, the sparsity of the cross-sectional dependence, and the moments of the kernel functions. Conditions are derived for central limit theorems to hold as corollaries. We demonstrate one application of the theoretical results with nonparametric specification test for data with cross-sectional dependence.", "authors": "Liu"}, "https://arxiv.org/abs/2411.17013": {"title": "Conditional Extremes with Graphical Models", "link": "https://arxiv.org/abs/2411.17013", "description": "Multivariate extreme value analysis quantifies the probability and magnitude of joint extreme events. River discharges from the upper Danube River basin provide a challenging dataset for such analysis because the data, which is measured on a spatial network, exhibits both asymptotic dependence and asymptotic independence. To account for both features, we extend the conditional multivariate extreme value model (CMEVM) with a new approach for the residual distribution. This allows sparse (graphical) dependence structures and fully parametric prediction. Our approach fills a current gap in statistical methodology for graphical extremes, where existing models require asymptotic independence. Further, the model can be used to learn the graphical dependence structure when it is unknown a priori. To support inference in high dimensions, we propose a stepwise inference procedure that is computationally efficient and loses no information or predictive power. We show our method is flexible and accurately captures the extremal dependence for the upper Danube River basin discharges.", "authors": "Farrell, Eastoe, Lee"}, "https://arxiv.org/abs/2411.17024": {"title": "Detecting Outliers in Multiple Sampling Results Without Thresholds", "link": "https://arxiv.org/abs/2411.17024", "description": "Bayesian statistics emphasizes the importance of prior distributions, yet finding an appropriate one is practically challenging. When multiple sample results are taken regarding the frequency of the same event, these samples may be influenced by different selection effects. In the absence of suitable prior distributions to correct for these selection effects, it is necessary to exclude outlier sample results to avoid compromising the final result. However, defining outliers based on different thresholds may change the result, which makes the result less persuasive. This work proposes a definition of outliers without the need to set thresholds.", "authors": "Shen"}, "https://arxiv.org/abs/2411.17033": {"title": "Quantile Graph Discovery through QuACC: Quantile Association via Conditional Concordance", "link": "https://arxiv.org/abs/2411.17033", "description": "Graphical structure learning is an effective way to assess and visualize cross-biomarker dependencies in biomedical settings. Standard approaches to estimating graphs rely on conditional independence tests that may not be sensitive to associations that manifest at the tails of joint distributions, i.e., they may miss connections among variables that exhibit associations mainly at lower or upper quantiles. In this work, we propose a novel measure of quantile-specific conditional association called QuACC: Quantile Association via Conditional Concordance. For a pair of variables and a conditioning set, QuACC quantifies agreement between the residuals from two quantile regression models, which may be linear or more complex, e.g., quantile forests. Using this measure as the basis for a test of null (quantile) association, we introduce a new class of quantile-specific graphical models. Through simulation we show our method is powerful for detecting dependencies under dependencies that manifest at the tails of distributions. We apply our method to biobank data from All of Us and identify quantile-specific patterns of conditional association in a multivariate setting.", "authors": "Khan, Malinsky, Picard et al"}, "https://arxiv.org/abs/2411.17035": {"title": "Achieving Privacy Utility Balance for Multivariate Time Series Data", "link": "https://arxiv.org/abs/2411.17035", "description": "Utility-preserving data privatization is of utmost importance for data-producing agencies. The popular noise-addition privacy mechanism distorts autocorrelation patterns in time series data, thereby marring utility; in response, McElroy et al. (2023) introduced all-pass filtering (FLIP) as a utility-preserving time series data privatization method. Adapting this concept to multivariate data is more complex, and in this paper we propose a multivariate all-pass (MAP) filtering method, employing an optimization algorithm to achieve the best balance between data utility and privacy protection. To test the effectiveness of our approach, we apply MAP filtering to both simulated and real data, sourced from the U.S. Census Bureau's Quarterly Workforce Indicator (QWI) dataset.", "authors": "Hore, McElroy, Roy"}, "https://arxiv.org/abs/2411.17393": {"title": "Patient recruitment forecasting in clinical trials using time-dependent Poisson-gamma model and homogeneity testing criteria", "link": "https://arxiv.org/abs/2411.17393", "description": "Clinical trials in the modern era are characterized by their complexity and high costs and usually involve hundreds/thousands of patients to be recruited across multiple clinical centres in many countries, as typically a rather large sample size is required in order to prove the efficiency of a particular drug. As the imperative to recruit vast numbers of patients across multiple clinical centres has become a major challenge, an accurate forecasting of patient recruitment is one of key factors for the operational success of clinical trials. A classic Poisson-gamma (PG) recruitment model assumes time-homogeneous recruitment rates. However, there can be potential time-trends in the recruitment driven by various factors, e.g. seasonal changes, exhaustion of patients on particular treatments in some centres, etc. Recently a few authors considered some extensions of the PG model to time-dependent rates under some particular assumptions. In this paper, a natural generalization of the original PG model to a PG model with non-homogeneous time-dependent rates is introduced. It is also proposed a new analytic methodology for modelling/forecasting patient recruitment using a Poisson-gamma approximation of recruitment processes in different countries and globally. The properties of some tests on homogeneity of the rates (non-parametric one using a Poisson model and two parametric tests using Poisson and PG model) are investigated. The techniques for modeling and simulation of the recruitment using time-dependent model are discussed. For re-projection of the remaining recruitment it is proposed to use a moving window and re-estimating parameters at every interim time. The results are supported by simulation of some artificial data sets.", "authors": "Anisimov, Oliver"}, "https://arxiv.org/abs/2411.17394": {"title": "Variable selection via fused sparse-group lasso penalized multi-state models incorporating molecular data", "link": "https://arxiv.org/abs/2411.17394", "description": "In multi-state models based on high-dimensional data, effective modeling strategies are required to determine an optimal, ideally parsimonious model. In particular, linking covariate effects across transitions is needed to conduct joint variable selection. A useful technique to reduce model complexity is to address homogeneous covariate effects for distinct transitions. We integrate this approach to data-driven variable selection by extended regularization methods within multi-state model building. We propose the fused sparse-group lasso (FSGL) penalized Cox-type regression in the framework of multi-state models combining the penalization concepts of pairwise differences of covariate effects along with transition grouping. For optimization, we adapt the alternating direction method of multipliers (ADMM) algorithm to transition-specific hazards regression in the multi-state setting. In a simulation study and application to acute myeloid leukemia (AML) data, we evaluate the algorithm's ability to select a sparse model incorporating relevant transition-specific effects and similar cross-transition effects. We investigate settings in which the combined penalty is beneficial compared to global lasso regularization.", "authors": "Miah, Goeman, Putter et al"}, "https://arxiv.org/abs/2411.17402": {"title": "Receiver operating characteristic curve analysis with non-ignorable missing disease status", "link": "https://arxiv.org/abs/2411.17402", "description": "This article considers the receiver operating characteristic (ROC) curve analysis for medical data with non-ignorable missingness in the disease status. In the framework of the logistic regression models for both the disease status and the verification status, we first establish the identifiability of model parameters, and then propose a likelihood method to estimate the model parameters, the ROC curve, and the area under the ROC curve (AUC) for the biomarker. The asymptotic distributions of these estimators are established. Via extensive simulation studies, we compare our method with competing methods in the point estimation and assess the accuracy of confidence interval estimation under various scenarios. To illustrate the application of the proposed method in practical data, we apply our method to the National Alzheimer's Coordinating Center data set.", "authors": "Hu, Yu, Li"}, "https://arxiv.org/abs/2411.17464": {"title": "A new test for assessing the covariate effect in ROC curves", "link": "https://arxiv.org/abs/2411.17464", "description": "The ROC curve is a statistical tool that analyses the accuracy of a diagnostic test in which a variable is used to decide whether an individual is healthy or not. Along with that diagnostic variable it is usual to have information of some other covariates. In some situations it is advisable to incorporate that information into the study, as the performance of the ROC curves can be affected by them. Using the covariate-adjusted, the covariate-specific or the pooled ROC curves we discuss how to decide if we can exclude the covariates from our study or not, and the implications this may have in further analyses of the ROC curve. A new test for comparing the covariate-adjusted and the pooled ROC curve is proposed, and the problem is illustrated by analysing a real database.", "authors": "Fanjul-Hevia, Pardo-Fern\\'andez, Gonz\\'alez-Manteiga"}, "https://arxiv.org/abs/2411.17533": {"title": "Simplifying Causal Mediation Analysis for Time-to-Event Outcomes using Pseudo-Values", "link": "https://arxiv.org/abs/2411.17533", "description": "Mediation analysis for survival outcomes is challenging. Most existing methods quantify the treatment effect using the hazard ratio (HR) and attempt to decompose the HR into the direct effect of treatment plus an indirect, or mediated, effect. However, the HR is not expressible as an expectation, which complicates this decomposition, both in terms of estimation and interpretation. Here, we present an alternative approach which leverages pseudo-values to simplify estimation and inference. Pseudo-values take censoring into account during their construction, and once derived, can be modeled in the same way as any continuous outcome. Thus, pseudo-values enable mediation analysis for a survival outcome to fit seamlessly into standard mediation software (e.g. CMAverse in R). Pseudo-values are easy to calculate via a leave-one-observation-out procedure (i.e. jackknifing) and the calculation can be accelerated when the influence function of the estimator is known. Mediation analysis for causal effects defined by survival probabilities, restricted mean survival time, and cumulative incidence functions - in the presence of competing risks - can all be performed within this framework. Extensive simulation studies demonstrate that the method is unbiased across 324 scenarios/estimands and controls the type-I error at the nominal level under the null of no mediation. We illustrate the approach using data from the PARADIGMS clinical trial for the treatment of pediatric multiple sclerosis using fingolimod. In particular, we evaluate whether an imaging biomarker lies on the causal path between treatment and time-to-relapse, which aids in justifying this biomarker as a surrogate outcome. Our approach greatly simplifies mediation analysis for survival data and provides a decomposition of the total effect that is both intuitive and interpretable.", "authors": "Ocampo, Giudice, H\\\"aring et al"}, "https://arxiv.org/abs/2411.17618": {"title": "Valid Bayesian Inference based on Variance Weighted Projection for High-Dimensional Logistic Regression with Binary Covariates", "link": "https://arxiv.org/abs/2411.17618", "description": "We address the challenge of conducting inference for a categorical treatment effect related to a binary outcome variable while taking into account high-dimensional baseline covariates. The conventional technique used to establish orthogonality for the treatment effect from nuisance variables in continuous cases is inapplicable in the context of binary treatment. To overcome this obstacle, an orthogonal score tailored specifically to this scenario is formulated which is based on a variance-weighted projection. Additionally, a novel Bayesian framework is proposed to facilitate valid inference for the desired low-dimensional parameter within the complex framework of high-dimensional logistic regression. We provide uniform convergence results, affirming the validity of credible intervals derived from the posterior distribution. The effectiveness of the proposed method is demonstrated through comprehensive simulation studies and real data analysis.", "authors": "Ojha, Narisetty"}, "https://arxiv.org/abs/2411.17639": {"title": "Intrepid MCMC: Metropolis-Hastings with Exploration", "link": "https://arxiv.org/abs/2411.17639", "description": "In engineering examples, one often encounters the need to sample from unnormalized distributions with complex shapes that may also be implicitly defined through a physical or numerical simulation model, making it computationally expensive to evaluate the associated density function. For such cases, MCMC has proven to be an invaluable tool. Random-walk Metropolis Methods (also known as Metropolis-Hastings (MH)), in particular, are highly popular for their simplicity, flexibility, and ease of implementation. However, most MH algorithms suffer from significant limitations when attempting to sample from distributions with multiple modes (particularly disconnected ones). In this paper, we present Intrepid MCMC - a novel MH scheme that utilizes a simple coordinate transformation to significantly improve the mode-finding ability and convergence rate to the target distribution of random-walk Markov chains while retaining most of the simplicity of the vanilla MH paradigm. Through multiple examples, we showcase the improvement in the performance of Intrepid MCMC over vanilla MH for a wide variety of target distribution shapes. We also provide an analysis of the mixing behavior of the Intrepid Markov chain, as well as the efficiency of our algorithm for increasing dimensions. A thorough discussion is presented on the practical implementation of the Intrepid MCMC algorithm. Finally, its utility is highlighted through a Bayesian parameter inference problem for a two-degree-of-freedom oscillator under free vibration.", "authors": "Chakroborty, Shields"}, "https://arxiv.org/abs/2411.17054": {"title": "Optimal Estimation of Shared Singular Subspaces across Multiple Noisy Matrices", "link": "https://arxiv.org/abs/2411.17054", "description": "Estimating singular subspaces from noisy matrices is a fundamental problem with wide-ranging applications across various fields. Driven by the challenges of data integration and multi-view analysis, this study focuses on estimating shared (left) singular subspaces across multiple matrices within a low-rank matrix denoising framework. A common approach for this task is to perform singular value decomposition on the stacked matrix (Stack-SVD), which is formed by concatenating all the individual matrices. We establish that Stack-SVD achieves minimax rate-optimality when the true (left) singular subspaces of the signal matrices are identical. Our analysis reveals some phase transition phenomena in the estimation problem as a function of the underlying signal-to-noise ratio, highlighting how the interplay among multiple matrices collectively determines the fundamental limits of estimation. We then tackle the more complex scenario where the true singular subspaces are only partially shared across matrices. For various cases of partial sharing, we rigorously characterize the conditions under which Stack-SVD remains effective, achieves minimax optimality, or fails to deliver consistent estimates, offering theoretical insights into its practical applicability. To overcome Stack-SVD's limitations in partial sharing scenarios, we propose novel estimators and an efficient algorithm to identify shared and unshared singular vectors, and prove their minimax rate-optimality. Extensive simulation studies and real-world data applications demonstrate the numerous advantages of our proposed approaches.", "authors": "Ma, Ma"}, "https://arxiv.org/abs/2411.17087": {"title": "On the Symmetry of Limiting Distributions of M-estimators", "link": "https://arxiv.org/abs/2411.17087", "description": "Many functionals of interest in statistics and machine learning can be written as minimizers of expected loss functions. Such functionals are called $M$-estimands, and can be estimated by $M$-estimators -- minimizers of empirical average losses. Traditionally, statistical inference (e.g., hypothesis tests and confidence sets) for $M$-estimands is obtained by proving asymptotic normality of $M$-estimators centered at the target. However, asymptotic normality is only one of several possible limiting distributions and (asymptotically) valid inference becomes significantly difficult with non-normal limits. In this paper, we provide conditions for the symmetry of three general classes of limiting distributions, enabling inference using HulC (Kuchibhotla et al. (2024)).", "authors": "Bhowmick, Kuchibhotla"}, "https://arxiv.org/abs/2411.17136": {"title": "Autoencoder Enhanced Realised GARCH on Volatility Forecasting", "link": "https://arxiv.org/abs/2411.17136", "description": "Realised volatility has become increasingly prominent in volatility forecasting due to its ability to capture intraday price fluctuations. With a growing variety of realised volatility estimators, each with unique advantages and limitations, selecting an optimal estimator may introduce challenges. In this thesis, aiming to synthesise the impact of various realised volatility measures on volatility forecasting, we propose an extension of the Realised GARCH model that incorporates an autoencoder-generated synthetic realised measure, combining the information from multiple realised measures in a nonlinear manner. Our proposed model extends existing linear methods, such as Principal Component Analysis and Independent Component Analysis, to reduce the dimensionality of realised measures. The empirical evaluation, conducted across four major stock markets from January 2000 to June 2022 and including the period of COVID-19, demonstrates both the feasibility of applying an autoencoder to synthesise volatility measures and the superior effectiveness of the proposed model in one-step-ahead rolling volatility forecasting. The model exhibits enhanced flexibility in parameter estimations across each rolling window, outperforming traditional linear approaches. These findings indicate that nonlinear dimension reduction offers further adaptability and flexibility in improving the synthetic realised measure, with promising implications for future volatility forecasting applications.", "authors": "Zhao, Wang, Gerlach et al"}, "https://arxiv.org/abs/2411.17224": {"title": "Double robust estimation of functional outcomes with data missing at random", "link": "https://arxiv.org/abs/2411.17224", "description": "We present and study semi-parametric estimators for the mean of functional outcomes in situations where some of these outcomes are missing and covariate information is available on all units. Assuming that the missingness mechanism depends only on the covariates (missing at random assumption), we present two estimators for the functional mean parameter, using working models for the functional outcome given the covariates, and the probability of missingness given the covariates. We contribute by establishing that both these estimators have Gaussian processes as limiting distributions and explicitly give their covariance functions. One of the estimators is double robust in the sense that the limiting distribution holds whenever at least one of the nuisance models is correctly specified. These results allow us to present simultaneous confidence bands for the mean function with asymptotically guaranteed coverage. A Monte Carlo study shows the finite sample properties of the proposed functional estimators and their associated simultaneous inference. The use of the method is illustrated in an application where the mean of counterfactual outcomes is targeted.", "authors": "Liu, Ecker, Schelin et al"}, "https://arxiv.org/abs/2411.17395": {"title": "Asymptotics for estimating a diverging number of parameters -- with and without sparsity", "link": "https://arxiv.org/abs/2411.17395", "description": "We consider high-dimensional estimation problems where the number of parameters diverges with the sample size. General conditions are established for consistency, uniqueness, and asymptotic normality in both unpenalized and penalized estimation settings. The conditions are weak and accommodate a broad class of estimation problems, including ones with non-convex and group structured penalties. The wide applicability of the results is illustrated through diverse examples, including generalized linear models, multi-sample inference, and stepwise estimation procedures.", "authors": "Gauss, Nagler"}, "https://arxiv.org/abs/2110.07361": {"title": "Distribution-Free Bayesian multivariate predictive inference", "link": "https://arxiv.org/abs/2110.07361", "description": "We introduce a comprehensive Bayesian multivariate predictive inference framework. The basis for our framework is a hierarchical Bayesian model, that is a mixture of finite Polya trees corresponding to multiple dyadic partitions of the unit cube. Given a sample of observations from an unknown multivariate distribution, the posterior predictive distribution is used to model and generate future observations from the unknown distribution. We illustrate the implementation of our methodology and study its performance on simulated examples. We introduce an algorithm for constructing conformal prediction sets, that provide finite sample probability assurances for future observations, with our Bayesian model.", "authors": "Yekutieli"}, "https://arxiv.org/abs/2302.03237": {"title": "Examination of Nonlinear Longitudinal Processes with Latent Variables, Latent Processes, Latent Changes, and Latent Classes in the Structural Equation Modeling Framework: The R package nlpsem", "link": "https://arxiv.org/abs/2302.03237", "description": "We introduce the R package nlpsem, a comprehensive toolkit for analyzing longitudinal processes within the structural equation modeling (SEM) framework, incorporating individual measurement occasions. This package emphasizes nonlinear longitudinal models, especially intrinsic ones, across four key scenarios: (1) univariate longitudinal processes with latent variables, optionally including covariates such as time-invariant covariates (TICs) and time-varying covariates (TVCs); (2) multivariate longitudinal analyses to explore correlations or unidirectional relationships between longitudinal variables; (3) multiple-group frameworks for comparing manifest classes in scenarios (1) and (2); and (4) mixture models for scenarios (1) and (2), accommodating latent class heterogeneity. Built on the OpenMx R package, nlpsem supports flexible model designs and uses the full information maximum likelihood method for parameter estimation. A notable feature is its algorithm for determining initial values directly from raw data, enhancing computational efficiency and convergence. Furthermore, nlpsem provides tools for goodness-of-fit tests, cluster analyses, visualization, derivation of p-values and three types of confidence intervals, as well as model selection for nested models using likelihood ratio tests and for non-nested models based on criteria such as Akaike Information Criterion and Bayesian Information Criterion. This article serves as a companion document to the nlpsem R package, providing a comprehensive guide to its modeling capabilities, estimation methods, implementation features, and application examples using synthetic intelligence growth data.", "authors": "Liu"}, "https://arxiv.org/abs/2209.14295": {"title": "Label Noise Robustness of Conformal Prediction", "link": "https://arxiv.org/abs/2209.14295", "description": "We study the robustness of conformal prediction, a powerful tool for uncertainty quantification, to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. We further extend our theory and formulate the requirements for correctly controlling a general loss function, such as the false negative proportion, with noisy labels. Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity.", "authors": "Einbinder, Feldman, Bates et al"}, "https://arxiv.org/abs/2411.17743": {"title": "Ranking probabilistic forecasting models with different loss functions", "link": "https://arxiv.org/abs/2411.17743", "description": "In this study, we introduced various statistical performance metrics, based on the pinball loss and the empirical coverage, for the ranking of probabilistic forecasting models. We tested the ability of the proposed metrics to determine the top performing forecasting model and investigated the use of which metric corresponds to the highest average per-trade profit in the out-of-sample period. Our findings show that for the considered trading strategy, ranking the forecasting models according to the coverage of quantile forecasts used in the trading hours exhibits a superior economic performance.", "authors": "Serafin, Uniejewski"}, "https://arxiv.org/abs/2411.17841": {"title": "Defective regression models for cure rate modeling in Marshall-Olkin family", "link": "https://arxiv.org/abs/2411.17841", "description": "Regression model have a substantial impact on interpretation of treatments, genetic characteristics and other covariates in survival analysis. In many datasets, the description of censoring and survival curve reveals the presence of cure fraction on data, which leads to alternative modelling. The most common approach to introduce covariates under a parametric estimation are the cure rate models and their variations, although the use of defective distributions have introduced a more parsimonious and integrated approach. Defective distributions is given by a density function whose integration is not one after changing the domain of one the parameters. In this work, we introduce two new defective regression models for long-term survival data in the Marshall-Olkin family: the Marshall-Olkin Gompertz and the Marshall-Olkin inverse Gaussian. The estimation process is conducted using the maximum likelihood estimation and Bayesian inference. We evaluate the asymptotic properties of the classical approach in Monte Carlo studies as well as the behavior of Bayes estimates with vague information. The application of both models under classical and Bayesian inferences is provided in an experiment of time until death from colon cancer with a dichotomous covariate. The Marshall-Olkin Gompertz regression presented the best adjustment and we present some global diagnostic and residual analysis for this proposal.", "authors": "Neto, Tomazella"}, "https://arxiv.org/abs/2411.17859": {"title": "Sparse twoblock dimension reduction for simultaneous compression and variable selection in two blocks of variables", "link": "https://arxiv.org/abs/2411.17859", "description": "A method is introduced to perform simultaneous sparse dimension reduction on two blocks of variables. Beyond dimension reduction, it also yields an estimator for multivariate regression with the capability to intrinsically deselect uninformative variables in both independent and dependent blocks. An algorithm is provided that leads to a straightforward implementation of the method. The benefits of simultaneous sparse dimension reduction are shown to carry through to enhanced capability to predict a set of multivariate dependent variables jointly. Both in a simulation study and in two chemometric applications, the new method outperforms its dense counterpart, as well as multivariate partial least squares.", "authors": "Serneels"}, "https://arxiv.org/abs/2411.17905": {"title": "Repeated sampling of different individuals but the same clusters to improve precision of difference-in-differences estimators: the DISC design", "link": "https://arxiv.org/abs/2411.17905", "description": "We describe the DISC (Different Individuals, Same Clusters) design, a sampling scheme that can improve the precision of difference-in-differences (DID) estimators in settings involving repeated sampling of a population at multiple time points. Although cohort designs typically lead to more efficient DID estimators relative to repeated cross-sectional (RCS) designs, they are often impractical in practice due to high rates of loss-to-follow-up, individuals leaving the risk set, or other reasons. The DISC design represents a hybrid between a cohort sampling design and a RCS sampling design, an alternative strategy in which the researcher takes a single sample of clusters, but then takes different cross-sectional samples of individuals within each cluster at two or more time points. We show that the DISC design can yield DID estimators with much higher precision relative to a RCS design, particularly if random cluster effects are present in the data-generating mechanism. For example, for a design in which 40 clusters and 25 individuals per cluster are sampled (for a total sample size of n=1,000), the variance of a commonly-used DID treatment effect estimator is 2.3 times higher in the RCS design for an intraclass correlation coefficient (ICC) of 0.05, 3.8 times higher for an ICC of 0.1, and 7.3 times higher for an ICC of 0.2.", "authors": "Downey, Kenny"}, "https://arxiv.org/abs/2411.17910": {"title": "Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological Studies", "link": "https://arxiv.org/abs/2411.17910", "description": "In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediators. Analyzing causal mediation in such high-dimensional omics data presents substantial challenges, including complex dependencies among mediators and the need for advanced regularization or Bayesian techniques to ensure stable and interpretable estimation and selection of indirect effects. To this end, we propose a novel Bayesian framework for identifying active pathways and estimating indirect effects in the presence of high-dimensional multivariate mediators. Our approach adopts a multivariate stochastic search variable selection method, tailored for such complex mediation scenarios. Central to our method is the introduction of a set of priors for the selection: a Markov random field prior and sequential subsetting Bernoulli priors. The first prior's Markov property leverages the inherent correlations among mediators, thereby increasing power to detect mediated effects. The sequential subsetting aspect of the second prior encourages the simultaneous selection of relevant mediators and their corresponding indirect effects from the two model parts, providing a more coherent and efficient variable selection framework, specific to mediation analysis. Comprehensive simulation studies demonstrate that the proposed method provides superior power in detecting active mediating pathways. We further illustrate the practical utility of the method through its application to metabolome data from two cohort studies, highlighting its effectiveness in real data setting.", "authors": "Bae, Kim, Wang et al"}, "https://arxiv.org/abs/2411.17983": {"title": "Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization", "link": "https://arxiv.org/abs/2411.17983", "description": "Model selection/optimization in conformal inference is challenging, since it may break the exchangeability between labeled and unlabeled data. We study this problem in the context of conformal selection, which uses conformal p-values to select ``interesting'' instances with large unobserved labels from a pool of unlabeled data, while controlling the FDR in finite sample. For validity, existing solutions require the model choice to be independent of the data used to construct the p-values and calibrate the selection set. However, when presented with many model choices and limited labeled data, it is desirable to (i) select the best model in a data-driven manner, and (ii) mitigate power loss due to sample splitting.\n This paper presents OptCS, a general framework that allows valid statistical testing (selection) after flexible data-driven model optimization. We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse and handles complex p-value dependencies to maintain finite-sample FDR control via a novel multiple testing procedure. We instantiate this general recipe to propose three FDR-controlling procedures, each optimizing the models differently: (i) selecting the most powerful one among multiple pre-trained candidate models, (ii) using all data for model fitting without sample splitting, and (iii) combining full-sample model fitting and selection. We demonstrate the efficacy of our methods via simulation studies and real applications in drug discovery and alignment of large language models in radiology report generation.", "authors": "Bai, Jin"}, "https://arxiv.org/abs/2411.18012": {"title": "Bayesian Inference of Spatially Varying Correlations via the Thresholded Correlation Gaussian Process", "link": "https://arxiv.org/abs/2411.18012", "description": "A central question in multimodal neuroimaging analysis is to understand the association between two imaging modalities and to identify brain regions where such an association is statistically significant. In this article, we propose a Bayesian nonparametric spatially varying correlation model to make inference of such regions. We build our model based on the thresholded correlation Gaussian process (TCGP). It ensures piecewise smoothness, sparsity, and jump discontinuity of spatially varying correlations, and is well applicable even when the number of subjects is limited or the signal-to-noise ratio is low. We study the identifiability of our model, establish the large support property, and derive the posterior consistency and selection consistency. We also develop a highly efficient Gibbs sampler and its variant to compute the posterior distribution. We illustrate the method with both simulations and an analysis of functional magnetic resonance imaging data from the Human Connectome Project.", "authors": "Li, Li, Kang"}, "https://arxiv.org/abs/2411.18334": {"title": "Multi-response linear regression estimation based on low-rank pre-smoothing", "link": "https://arxiv.org/abs/2411.18334", "description": "Pre-smoothing is a technique aimed at increasing the signal-to-noise ratio in data to improve subsequent estimation and model selection in regression problems. However, pre-smoothing has thus far been limited to the univariate response regression setting. Motivated by the widespread interest in multi-response regression analysis in many scientific applications, this article proposes a technique for data pre-smoothing in this setting based on low-rank approximation. We establish theoretical results on the performance of the proposed methodology, and quantify its benefit empirically in a number of simulated experiments. We also demonstrate our proposed low-rank pre-smoothing technique on real data arising from the environmental and biological sciences.", "authors": "Tian, Gibberd, Nunes et al"}, "https://arxiv.org/abs/2411.18351": {"title": "On an EM-based closed-form solution for 2 parameter IRT models", "link": "https://arxiv.org/abs/2411.18351", "description": "It is a well-known issue that in Item Response Theory models there is no closed-form for the maximum likelihood estimators of the item parameters. Parameter estimation is therefore typically achieved by means of numerical methods like gradient search. The present work has a two-fold aim: On the one hand, we revise the fundamental notions associated to the item parameter estimation in 2 parameter Item Response Theory models from the perspective of the complete-data likelihood. On the other hand, we argue that, within an Expectation-Maximization approach, a closed-form for discrimination and difficulty parameters can actually be obtained that simply corresponds to the Ordinary Least Square solution.", "authors": "Center, Tuebingen), Center et al"}, "https://arxiv.org/abs/2411.18398": {"title": "Derivative Estimation of Multivariate Functional Data", "link": "https://arxiv.org/abs/2411.18398", "description": "Existing approaches for derivative estimation are restricted to univariate functional data. We propose two methods to estimate the principal components and scores for the derivatives of multivariate functional data. As a result, the derivatives can be reconstructed by a multivariate Karhunen-Lo\\`eve expansion. The first approach is an extended version of multivariate functional principal component analysis (MFPCA) which incorporates the derivatives, referred to as derivative MFPCA (DMFPCA). The second approach is based on the derivation of multivariate Karhunen-Lo\\`eve (DMKL) expansion. We compare the performance of the two proposed methods with a direct approach in simulations. The simulation results indicate that DMFPCA outperforms DMKL and the direct approach, particularly for densely observed data. We apply DMFPCA and DMKL methods to coronary angiogram data to recover derivatives of diameter and quantitative flow ratio. We obtain the multivariate functional principal components and scores of the derivatives, which can be used to classify patterns of coronary artery disease.", "authors": "Zhu, Golovkine, Bargary et al"}, "https://arxiv.org/abs/2411.18416": {"title": "Probabilistic size-and-shape functional mixed models", "link": "https://arxiv.org/abs/2411.18416", "description": "The reliable recovery and uncertainty quantification of a fixed effect function $\\mu$ in a functional mixed model, for modelling population- and object-level variability in noisily observed functional data, is a notoriously challenging task: variations along the $x$ and $y$ axes are confounded with additive measurement error, and cannot in general be disentangled. The question then as to what properties of $\\mu$ may be reliably recovered becomes important. We demonstrate that it is possible to recover the size-and-shape of a square-integrable $\\mu$ under a Bayesian functional mixed model. The size-and-shape of $\\mu$ is a geometric property invariant to a family of space-time unitary transformations, viewed as rotations of the Hilbert space, that jointly transform the $x$ and $y$ axes. A random object-level unitary transformation then captures size-and-shape \\emph{preserving} deviations of $\\mu$ from an individual function, while a random linear term and measurement error capture size-and-shape \\emph{altering} deviations. The model is regularized by appropriate priors on the unitary transformations, posterior summaries of which may then be suitably interpreted as optimal data-driven rotations of a fixed orthonormal basis for the Hilbert space. Our numerical experiments demonstrate utility of the proposed model, and superiority over the current state-of-the-art.", "authors": "Wang, Bharath, Chkrebtii et al"}, "https://arxiv.org/abs/2411.18433": {"title": "A Latent Space Approach to Inferring Distance-Dependent Reciprocity in Directed Networks", "link": "https://arxiv.org/abs/2411.18433", "description": "Reciprocity, or the stochastic tendency for actors to form mutual relationships, is an essential characteristic of directed network data. Existing latent space approaches to modeling directed networks are severely limited by the assumption that reciprocity is homogeneous across the network. In this work, we introduce a new latent space model for directed networks that can model heterogeneous reciprocity patterns that arise from the actors' latent distances. Furthermore, existing edge-independent latent space models are nested within the proposed model class, which allows for meaningful model comparisons. We introduce a Bayesian inference procedure to infer the model parameters using Hamiltonian Monte Carlo. Lastly, we use the proposed method to infer different reciprocity patterns in an advice network among lawyers, an information-sharing network between employees at a manufacturing company, and a friendship network between high school students.", "authors": "Loyal, Wu, Stewart"}, "https://arxiv.org/abs/2411.18481": {"title": "Bhirkuti's Test of Bias Acceptance: Examining in Psychometric Simulations", "link": "https://arxiv.org/abs/2411.18481", "description": "This study introduces Bhirkuti's Test of Bias Acceptance, a systematic graphical framework for evaluating bias and determining its acceptability under varying experimental conditions. Absolute Relative Bias (ARB), while useful for understanding bias, is sensitive to outliers and population parameter magnitudes, often overstating bias for small values and understating it for larger ones. Similarly, Relative Efficiency (RE) can be influenced by variance differences and outliers, occasionally producing counterintuitive values exceeding 100%, which complicates interpretation. By addressing the limitations of traditional metrics such as Absolute Relative Bias (ARB) and Relative Efficiency (RE), the proposed graphical methodology framework leverages ridgeline plots and standardized estimate to provide a comprehensive visualization of parameter estimate distributions. Ridgeline plots done this way offer a robust alternative by visualizing full distributions, highlighting variability, trends, outliers, descriptive and facilitating more informed decision-making. This study employs multivariate Latent Growth Models (LGM) and Monte Carlo simulations to examine the performance of growth curve modeling under planned missing data designs, focusing on parameter estimate recovery and efficiency. By combining innovative visualization techniques with rigorous simulation methods, Bhirkuti's Test of Bias Acceptance provides a versatile and interpretable toolset for advancing quantitative research in bias evaluation and efficiency assessment.", "authors": "Bhusal, Little"}, "https://arxiv.org/abs/2411.18510": {"title": "A subgroup-aware scoring approach to the study of effect modification in observational studies", "link": "https://arxiv.org/abs/2411.18510", "description": "Effect modification means the size of a treatment effect varies with an observed covariate. Generally speaking, a larger treatment effect with more stable error terms is less sensitive to bias. Thus, we might be able to conclude that a study is less sensitive to unmeasured bias by using these subgroups experiencing larger treatment effects. Lee et al. (2018) proposed the submax method that leverages the joint distribution of test statistics from subgroups to draw a firmer conclusion if effect modification occurs. However, one version of the submax method uses M-statistics as the test statistics and is implemented in the R package submax (Rosenbaum, 2017). The scaling factor in the M-statistics is computed using all observations combined across subgroups. We show that this combining can confuse effect modification with outliers. We propose a novel group M-statistic that scores the matched pairs in each subgroup to tackle the issue. We examine our novel scoring strategy in extensive settings to show the superior performance. The proposed method is applied to an observational study of the effect of a malaria prevention treatment in West Africa.", "authors": "Fan, Small"}, "https://arxiv.org/abs/2411.18549": {"title": "Finite population inference for skewness measures", "link": "https://arxiv.org/abs/2411.18549", "description": "In this article we consider Bowley's skewness measure and the Groeneveld-Meeden $b_{3}$ index in the context of finite population sampling. We employ the functional delta method to obtain asymptotic variance formulae for plug-in estimators and propose corresponding variance estimators. We then consider plug-in estimators based on the H\\'{a}jek cdf-estimator and on a Deville-S\\\"arndal type calibration estimator and test the performance of normal confidence intervals.", "authors": "Pasquazzi"}, "https://arxiv.org/abs/2411.16552": {"title": "When Is Heterogeneity Actionable for Personalization?", "link": "https://arxiv.org/abs/2411.16552", "description": "Targeting and personalization policies can be used to improve outcomes beyond the uniform policy that assigns the best performing treatment in an A/B test to everyone. Personalization relies on the presence of heterogeneity of treatment effects, yet, as we show in this paper, heterogeneity alone is not sufficient for personalization to be successful. We develop a statistical model to quantify \"actionable heterogeneity,\" or the conditions when personalization is likely to outperform the best uniform policy. We show that actionable heterogeneity can be visualized as crossover interactions in outcomes across treatments and depends on three population-level parameters: within-treatment heterogeneity, cross-treatment correlation, and the variation in average responses. Our model can be used to predict the expected gain from personalization prior to running an experiment and also allows for sensitivity analysis, providing guidance on how changing treatments can affect the personalization gain. To validate our model, we apply five common personalization approaches to two large-scale field experiments with many interventions that encouraged flu vaccination. We find an 18% gain from personalization in one and a more modest 4% gain in the other, which is consistent with our model. Counterfactual analysis shows that this difference in the gains from personalization is driven by a drastic difference in within-treatment heterogeneity. However, reducing cross-treatment correlation holds a larger potential to further increase personalization gains. Our findings provide a framework for assessing the potential from personalization and offer practical recommendations for improving gains from targeting in multi-intervention settings.", "authors": "Shchetkina, Berman"}, "https://arxiv.org/abs/2411.17808": {"title": "spar: Sparse Projected Averaged Regression in R", "link": "https://arxiv.org/abs/2411.17808", "description": "Package spar for R builds ensembles of predictive generalized linear models with high-dimensional predictors. It employs an algorithm utilizing variable screening and random projection tools to efficiently handle the computational challenges associated with large sets of predictors. The package is designed with a strong focus on extensibility. Screening and random projection techniques are implemented as S3 classes with user-friendly constructor functions, enabling users to easily integrate and develop new procedures. This design enhances the package's adaptability and makes it a powerful tool for a variety of high-dimensional applications.", "authors": "Parzer, Vana-G\\\"ur, Filzmoser"}, "https://arxiv.org/abs/2411.17989": {"title": "Regularized Multi-LLMs Collaboration for Enhanced Score-based Causal Discovery", "link": "https://arxiv.org/abs/2411.17989", "description": "As the significance of understanding the cause-and-effect relationships among variables increases in the development of modern systems and algorithms, learning causality from observational data has become a preferred and efficient approach over conducting randomized control trials. However, purely observational data could be insufficient to reconstruct the true causal graph. Consequently, many researchers tried to utilise some form of prior knowledge to improve causal discovery process. In this context, the impressive capabilities of large language models (LLMs) have emerged as a promising alternative to the costly acquisition of prior expert knowledge. In this work, we further explore the potential of using LLMs to enhance causal discovery approaches, particularly focusing on score-based methods, and we propose a general framework to utilise the capacity of not only one but multiple LLMs to augment the discovery process.", "authors": "Li, Liu, Wang et al"}, "https://arxiv.org/abs/2411.18008": {"title": "Causal and Local Correlations Based Network for Multivariate Time Series Classification", "link": "https://arxiv.org/abs/2411.18008", "description": "Recently, time series classification has attracted the attention of a large number of researchers, and hundreds of methods have been proposed. However, these methods often ignore the spatial correlations among dimensions and the local correlations among features. To address this issue, the causal and local correlations based network (CaLoNet) is proposed in this study for multivariate time series classification. First, pairwise spatial correlations between dimensions are modeled using causality modeling to obtain the graph structure. Then, a relationship extraction network is used to fuse local correlations to obtain long-term dependency features. Finally, the graph structure and long-term dependency features are integrated into the graph neural network. Experiments on the UEA datasets show that CaLoNet can obtain competitive performance compared with state-of-the-art methods.", "authors": "Du, Wei, Zheng et al"}, "https://arxiv.org/abs/2411.18502": {"title": "Isometry pursuit", "link": "https://arxiv.org/abs/2411.18502", "description": "Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.", "authors": "Koelle, Meila"}, "https://arxiv.org/abs/2411.18569": {"title": "A Flexible Defense Against the Winner's Curse", "link": "https://arxiv.org/abs/2411.18569", "description": "Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly, in machine learning, practitioners are often interested in the population performance of the model that performs best empirically. However, cherry-picking the best candidate leads to the winner's curse: the observed performance for the winner is biased upwards, rendering conclusions based on standard measures of uncertainty invalid. We introduce the zoom correction, a novel approach for valid inference on the winner. Our method is flexible: it can be employed in both parametric and nonparametric settings, can handle arbitrary dependencies between candidates, and automatically adapts to the level of selection bias. The method easily extends to important related problems, such as inference on the top k winners, inference on the value and identity of the population winner, and inference on \"near-winners.\"", "authors": "Zrnic, Fithian"}, "https://arxiv.org/abs/2307.16353": {"title": "Single Proxy Synthetic Control", "link": "https://arxiv.org/abs/2307.16353", "description": "Synthetic control methods are widely used to estimate the treatment effect on a single treated unit in time-series settings. A common approach to estimate synthetic control weights is to regress the treated unit's pre-treatment outcome and covariates' time series measurements on those of untreated units via ordinary least squares. However, this approach can perform poorly if the pre-treatment fit is not near perfect, whether the weights are normalized or not. In this paper, we introduce a single proxy synthetic control approach, which views the outcomes of untreated units as proxies of the treatment-free potential outcome of the treated unit, a perspective we leverage to construct a valid synthetic control. Under this framework, we establish an alternative identification strategy and corresponding estimation methods for synthetic controls and the treatment effect on the treated unit. Notably, unlike existing proximal synthetic control methods, which require two types of proxies for identification, ours relies on a single type of proxy, thus facilitating its practical relevance. Additionally, we adapt a conformal inference approach to perform inference about the treatment effect, obviating the need for a large number of post-treatment observations. Lastly, our framework can accommodate time-varying covariates and nonlinear models. We demonstrate the proposed approach in a simulation study and a real-world application.", "authors": "Park, Tchetgen"}, "https://arxiv.org/abs/1903.04209": {"title": "From interpretability to inference: an estimation framework for universal approximators", "link": "https://arxiv.org/abs/1903.04209", "description": "We present a novel framework for estimation and inference for the broad class of universal approximators. Estimation is based on the decomposition of model predictions into Shapley values. Inference relies on analyzing the bias and variance properties of individual Shapley components. We show that Shapley value estimation is asymptotically unbiased, and we introduce Shapley regressions as a tool to uncover the true data generating process from noisy data alone. The well-known case of the linear regression is the special case in our framework if the model is linear in parameters. We present theoretical, numerical, and empirical results for the estimation of heterogeneous treatment effects as our guiding example.", "authors": "Joseph"}, "https://arxiv.org/abs/2211.13612": {"title": "Joint modeling of wind speed and wind direction through a conditional approach", "link": "https://arxiv.org/abs/2211.13612", "description": "Atmospheric near surface wind speed and wind direction play an important role in many applications, ranging from air quality modeling, building design, wind turbine placement to climate change research. It is therefore crucial to accurately estimate the joint probability distribution of wind speed and direction. In this work we develop a conditional approach to model these two variables, where the joint distribution is decomposed into the product of the marginal distribution of wind direction and the conditional distribution of wind speed given wind direction. To accommodate the circular nature of wind direction a von Mises mixture model is used; the conditional wind speed distribution is modeled as a directional dependent Weibull distribution via a two-stage estimation procedure, consisting of a directional binned Weibull parameter estimation, followed by a harmonic regression to estimate the dependence of the Weibull parameters on wind direction. A Monte Carlo simulation study indicates that our method outperforms an alternative method that uses periodic spline quantile regression in terms of estimation efficiency. We illustrate our method by using the output from a regional climate model to investigate how the joint distribution of wind speed and direction may change under some future climate scenarios.", "authors": "Murphy, Huang, Bessac et al"}, "https://arxiv.org/abs/2411.18646": {"title": "Temporal Models for Demographic and Global Health Outcomes in Multiple Populations: Introducing the Normal-with-Optional-Shrinkage Data Model Class", "link": "https://arxiv.org/abs/2411.18646", "description": "Statistical models are used to produce estimates of demographic and global health indicators in populations with limited data. Such models integrate multiple data sources to produce estimates and forecasts with uncertainty based on model assumptions. Model assumptions can be divided into assumptions that describe latent trends in the indicator of interest versus assumptions on the data generating process of the observed data, conditional on the latent process value. Focusing on the latter, we introduce a class of data models that can be used to combine data from multiple sources with various reporting issues. The proposed data model accounts for sampling errors and differences in observational uncertainty based on survey characteristics. In addition, the data model employs horseshoe priors to produce estimates that are robust to outlying observations. We refer to the data model class as the normal-with-optional-shrinkage (NOS) set up. We illustrate the use of the NOS data model for the estimation of modern contraceptive use and other family planning indicators at the national level for countries globally, using survey data.", "authors": "Alkema, Susmann, Ray"}, "https://arxiv.org/abs/2411.18739": {"title": "A Bayesian semi-parametric approach to causal mediation for longitudinal mediators and time-to-event outcomes with application to a cardiovascular disease cohort study", "link": "https://arxiv.org/abs/2411.18739", "description": "Causal mediation analysis of observational data is an important tool for investigating the potential causal effects of medications on disease-related risk factors, and on time-to-death (or disease progression) through these risk factors. However, when analyzing data from a cohort study, such analyses are complicated by the longitudinal structure of the risk factors and the presence of time-varying confounders. Leveraging data from the Atherosclerosis Risk in Communities (ARIC) cohort study, we develop a causal mediation approach, using (semi-parametric) Bayesian Additive Regression Tree (BART) models for the longitudinal and survival data. Our framework allows for time-varying exposures, confounders, and mediators, all of which can either be continuous or binary. We also identify and estimate direct and indirect causal effects in the presence of a competing event. We apply our methods to assess how medication, prescribed to target cardiovascular disease (CVD) risk factors, affects the time-to-CVD death.", "authors": "Bhandari, Daniels, Josefsson et al"}, "https://arxiv.org/abs/2411.18772": {"title": "Difference-in-differences Design with Outcomes Missing Not at Random", "link": "https://arxiv.org/abs/2411.18772", "description": "This paper addresses one of the most prevalent problems encountered by political scientists working with difference-in-differences (DID) design: missingness in panel data. A common practice for handling missing data, known as complete case analysis, is to drop cases with any missing values over time. A more principled approach involves using nonparametric bounds on causal effects or applying inverse probability weighting based on baseline covariates. Yet, these methods are general remedies that often under-utilize the assumptions already imposed on panel structure for causal identification. In this paper, I outline the pitfalls of complete case analysis and propose an alternative identification strategy based on principal strata. To be specific, I impose parallel trends assumption within each latent group that shares the same missingness pattern (e.g., always-respondents, if-treated-respondents) and leverage missingness rates over time to estimate the proportions of these groups. Building on this, I tailor Lee bounds, a well-known nonparametric bounds under selection bias, to partially identify the causal effect within the DID design. Unlike complete case analysis, the proposed method does not require independence between treatment selection and missingness patterns, nor does it assume homogeneous effects across these patterns.", "authors": "Shin"}, "https://arxiv.org/abs/2411.18773": {"title": "Inference on Dynamic Spatial Autoregressive Models with Change Point Detection", "link": "https://arxiv.org/abs/2411.18773", "description": "We analyze a varying-coefficient dynamic spatial autoregressive model with spatial fixed effects. One salient feature of the model is the incorporation of multiple spatial weight matrices through their linear combinations with varying coefficients, which help solve the problem of choosing the most \"correct\" one for applied econometricians who often face the availability of multiple expert spatial weight matrices. We estimate and make inferences on the model coefficients and coefficients in basis expansions of the varying coefficients through penalized estimations, establishing the oracle properties of the estimators and the consistency of the overall estimated spatial weight matrix, which can be time-dependent. We further consider two applications of our model in change point detections in dynamic spatial autoregressive models, providing theoretical justifications in consistent change point locations estimation and practical implementations. Simulation experiments demonstrate the performance of our proposed methodology, and a real data analysis is also carried out.", "authors": "Cen, Chen, Lam"}, "https://arxiv.org/abs/2411.18826": {"title": "Improved order selection method for hidden Markov models: a case study with movement data", "link": "https://arxiv.org/abs/2411.18826", "description": "Hidden Markov models (HMMs) are a versatile statistical framework commonly used in ecology to characterize behavioural patterns from animal movement data. In HMMs, the observed data depend on a finite number of underlying hidden states, generally interpreted as the animal's unobserved behaviour. The number of states is a crucial parameter, controlling the trade-off between ecological interpretability of behaviours (fewer states) and the goodness of fit of the model (more states). Selecting the number of states, commonly referred to as order selection, is notoriously challenging. Common model selection metrics, such as AIC and BIC, often perform poorly in determining the number of states, particularly when models are misspecified. Building on existing methods for HMMs and mixture models, we propose a double penalized likelihood maximum estimate (DPMLE) for the simultaneous estimation of the number of states and parameters of non-stationary HMMs. The DPMLE differs from traditional information criteria by using two penalty functions on the stationary probabilities and state-dependent parameters. For non-stationary HMMs, forward and backward probabilities are used to approximate stationary probabilities. Using a simulation study that includes scenarios with additional complexity in the data, we compare the performance of our method with that of AIC and BIC. We also illustrate how the DPMLE differs from AIC and BIC using narwhal (Monodon monoceros) movement data. The proposed method outperformed AIC and BIC in identifying the correct number of states under model misspecification. Furthermore, its capacity to handle non-stationary dynamics allowed for more realistic modeling of complex movement data, offering deeper insights into narwhal behaviour. Our method is a powerful tool for order selection in non-stationary HMMs, with potential applications extending beyond the field of ecology.", "authors": "Dupont, Marcoux, Hussey et al"}, "https://arxiv.org/abs/2411.18838": {"title": "Contrasting the optimal resource allocation to cybersecurity and cyber insurance using prospect theory versus expected utility theory", "link": "https://arxiv.org/abs/2411.18838", "description": "Protecting against cyber-threats is vital for every organization and can be done by investing in cybersecurity controls and purchasing cyber insurance. However, these are interlinked since insurance premiums could be reduced by investing more in cybersecurity controls. The expected utility theory and the prospect theory are two alternative theories explaining decision-making under risk and uncertainty, which can inform strategies for optimizing resource allocation. While the former is considered a rational approach, research has shown that most people make decisions consistent with the latter, including on insurance uptakes. We compare and contrast these two approaches to provide important insights into how the two approaches could lead to different optimal allocations resulting in differing risk exposure as well as financial costs. We introduce the concept of a risk curve and show that identifying the nature of the risk curve is a key step in deriving the optimal resource allocation.", "authors": "Joshi, Yang, Slapnicar et al"}, "https://arxiv.org/abs/2411.18864": {"title": "Redesigning the ensemble Kalman filter with a dedicated model of epistemic uncertainty", "link": "https://arxiv.org/abs/2411.18864", "description": "The problem of incorporating information from observations received serially in time is widespread in the field of uncertainty quantification. Within a probabilistic framework, such problems can be addressed using standard filtering techniques. However, in many real-world problems, some (or all) of the uncertainty is epistemic, arising from a lack of knowledge, and is difficult to model probabilistically. This paper introduces a possibilistic ensemble Kalman filter designed for this setting and characterizes some of its properties. Using possibility theory to describe epistemic uncertainty is appealing from a philosophical perspective, and it is easy to justify certain heuristics often employed in standard ensemble Kalman filters as principled approaches to capturing uncertainty within it. The possibilistic approach motivates a robust mechanism for characterizing uncertainty which shows good performance with small sample sizes, and can outperform standard ensemble Kalman filters at given sample size, even when dealing with genuinely aleatoric uncertainty.", "authors": "Kimchaiwong, Houssineau, Johansen"}, "https://arxiv.org/abs/2411.18879": {"title": "Learning treatment effects under covariate dependent left truncation and right censoring", "link": "https://arxiv.org/abs/2411.18879", "description": "In observational studies with delayed entry, causal inference for time-to-event outcomes can be challenging. The challenges arise because, in addition to the potential confounding bias from observational data, the collected data often also suffers from the selection bias due to left truncation, where only subjects with time-to-event (such as death) greater than the enrollment times are included, as well as bias from informative right censoring. To estimate the treatment effects on time-to-event outcomes in such settings, inverse probability weighting (IPW) is often employed. However, IPW is sensitive to model misspecifications, which makes it vulnerable, especially when faced with three sources of biases. Moreover, IPW is inefficient. To address these challenges, we propose a doubly robust framework to handle covariate dependent left truncation and right censoring that can be applied to a wide range of estimation problems, including estimating average treatment effect (ATE) and conditional average treatment effect (CATE). For average treatment effect, we develop estimators that enjoy model double robustness and rate double robustness. For conditional average treatment effect, we develop orthogonal and doubly robust learners that can achieve oracle rate of convergence. Our framework represents the first attempt to construct doubly robust estimators and orthogonal learners for treatment effects that account for all three sources of biases: confounding, selection from covariate-induced dependent left truncation, and informative right censoring. We apply the proposed methods to analyze the effect of midlife alcohol consumption on late-life cognitive impairment, using data from Honolulu Asia Aging Study.", "authors": "Wang, Ying, Xu"}, "https://arxiv.org/abs/2411.18957": {"title": "Bayesian Cluster Weighted Gaussian Models", "link": "https://arxiv.org/abs/2411.18957", "description": "We introduce a novel class of Bayesian mixtures for normal linear regression models which incorporates a further Gaussian random component for the distribution of the predictor variables. The proposed cluster-weighted model aims to encompass potential heterogeneity in the distribution of the response variable as well as in the multivariate distribution of the covariates for detecting signals relevant to the underlying latent structure. Of particular interest are potential signals originating from: (i) the linear predictor structures of the regression models and (ii) the covariance structures of the covariates. We model these two components using a lasso shrinkage prior for the regression coefficients and a graphical-lasso shrinkage prior for the covariance matrices. A fully Bayesian approach is followed for estimating the number of clusters, by treating the number of mixture components as random and implementing a trans-dimensional telescoping sampler. Alternative Bayesian approaches based on overfitting mixture models or using information criteria to select the number of components are also considered. The proposed method is compared against EM type implementation, mixtures of regressions and mixtures of experts. The method is illustrated using a set of simulation studies and a biomedical dataset.", "authors": "Papastamoulis, Perrakis"}, "https://arxiv.org/abs/2411.18978": {"title": "Warfare Ignited Price Contagion Dynamics in Early Modern Europe", "link": "https://arxiv.org/abs/2411.18978", "description": "Economic historians have long studied market integration and contagion dynamics during periods of warfare and global stress, but there is a lack of model-based evidence on these phenomena. This paper uses an econometric contagion model, the Diebold-Yilmaz framework, to examine the dynamics of economic shocks across European markets in the early modern period. Our findings suggest that violent conflicts, especially the Thirty Years' War, significantly increased food price spillover across cities, causing widespread disruptions across Europe. We also demonstrate the ability of this framework to capture relevant historical dynamics between the main trade centers of the period.", "authors": "Esmaili, Puma, Ludlow et al"}, "https://arxiv.org/abs/2411.19104": {"title": "Algorithmic modelling of a complex redundant multi-state system subject to multiple events, preventive maintenance, loss of units and a multiple vacation policy through a MMAP", "link": "https://arxiv.org/abs/2411.19104", "description": "A complex multi-state redundant system undergoing preventive maintenance and experiencing multiple events is being considered in a continuous time frame. The online unit is susceptible to various types of failures, both internal and external in nature, with multiple degradation levels present, both internally and externally. Random inspections are continuously monitoring these degradation levels, and if they reach a critical state, the unit is directed to a repair facility for preventive maintenance. The repair facility is managed by a single repairperson, who follows a multiple vacation policy dependent on the operational status of the units. The repairperson is responsible for two primary tasks: corrective repairs and preventive maintenance. The time durations within the system follow phase-type distributions, and the model is constructed using Markovian Arrival Processes with marked arrivals. A variety of performance measures, including transient and stationary distributions, are calculated using matrix-analytic methods. This approach enables the expression of key results and overall system behaviour in a matrix-algorithmic format. In order to optimize the model, costs and rewards are integrated into the analysis. A numerical example is presented to showcase the model's flexibility and effectiveness in real-world applications.", "authors": "Ruiz-Castro, Zapata-Ceballos"}, "https://arxiv.org/abs/2411.19138": {"title": "Nonparametric estimation on the circle based on Fej\\'er polynomials", "link": "https://arxiv.org/abs/2411.19138", "description": "This paper presents a comprehensive study of nonparametric estimation techniques on the circle using Fej\\'er polynomials, which are analogues of Bernstein polynomials for periodic functions. Building upon Fej\\'er's uniform approximation theorem, the paper introduces circular density and distribution function estimators based on Fej\\'er kernels. It establishes their theoretical properties, including uniform strong consistency and asymptotic expansions. The proposed methods are extended to account for measurement errors by incorporating classical and Berkson error models, adjusting the Fej\\'er estimator to mitigate their effects. Simulation studies analyze the finite-sample performance of these estimators under various scenarios, including mixtures of circular distributions and measurement error models. An application to rainfall data demonstrates the practical application of the proposed estimators, demonstrating their robustness and effectiveness in the presence of rounding-induced Berkson errors.", "authors": "Klar, Milo\\v{s}evi\\'c, Obradovi\\'c"}, "https://arxiv.org/abs/2411.19176": {"title": "A Unified Bayesian Framework for Mortality Model Selection", "link": "https://arxiv.org/abs/2411.19176", "description": "In recent years, a wide range of mortality models has been proposed to address the diverse factors influencing mortality rates, which has highlighted the need to perform model selection. Traditional mortality model selection methods, such as AIC and BIC, often require fitting multiple models independently and ranking them based on these criteria. This process can fail to account for uncertainties in model selection, which can lead to overly optimistic prediction interval, and it disregards the potential insights from combining models. To address these limitations, we propose a novel Bayesian model selection framework that integrates model selection and parameter estimation into the same process. This requires creating a model building framework that will give rise to different models by choosing different parametric forms for each term. Inference is performed using the reversible jump Markov chain Monte Carlo algorithm, which is devised to allow for transition between models of different dimensions, as is the case for the models considered here. We develop modelling frameworks for data stratified by age and period and for data stratified by age, period and product. Our results are presented in two case studies.", "authors": "Diana, Tze, Pittea"}, "https://arxiv.org/abs/2411.19184": {"title": "Flexible space-time models for extreme data", "link": "https://arxiv.org/abs/2411.19184", "description": "Extreme Value Analysis is an essential methodology in the study of rare and extreme events, which hold significant interest in various fields, particularly in the context of environmental sciences. Models that employ the exceedances of values above suitably selected high thresholds possess the advantage of capturing the \"sub-asymptotic\" dependence of data. This paper presents an extension of spatial random scale mixture models to the spatio-temporal domain. A comprehensive framework for characterizing the dependence structure of extreme events across both dimensions is provided. Indeed, the model is capable of distinguishing between asymptotic dependence and independence, both in space and time, through the use of parametric inference. The high complexity of the likelihood function for the proposed model necessitates a simulation approach based on neural networks for parameter estimation, which leverages summaries of the sub-asymptotic dependence present in the data. The effectiveness of the model in assessing the limiting dependence structure of spatio-temporal processes is demonstrated through both simulation studies and an application to rainfall datasets.", "authors": "Dell'Oro, Gaetan"}, "https://arxiv.org/abs/2411.19205": {"title": "On the application of Jammalamadaka-Jim\\'enez Gamero-Meintanis test for circular regression model assessment", "link": "https://arxiv.org/abs/2411.19205", "description": "We study a circular-circular multiplicative regression model, characterized by an angular error distribution assumed to be wrapped Cauchy. We propose a specification procedure for this model, focusing on adapting a recently proposed goodness-of-fit test for circular distributions. We derive its limiting properties and study the power performance of the test through extensive simulations, including the adaptation of some other well-known goodness-of-fit tests for this type of data. To emphasize the practical relevance of our methodology, we apply it to several small real-world datasets and wind direction measurements in the Black Forest region of southwestern Germany, demonstrating the power and versatility of the presented approach.", "authors": "Halaj, Klar, Milo\\v{s}evi\\'c et al"}, "https://arxiv.org/abs/2411.19225": {"title": "Sparse optimization for estimating the cross-power spectrum in linear inverse models : from theory to the application in brain connectivity", "link": "https://arxiv.org/abs/2411.19225", "description": "In this work we present a computationally efficient linear optimization approach for estimating the cross--power spectrum of an hidden multivariate stochastic process from that of another observed process. Sparsity in the resulting estimator of the cross--power is induced through $\\ell_1$ regularization and the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is used for computing such an estimator. With respect to a standard implementation, we prove that a proper initialization step is sufficient to guarantee the required symmetric and antisymmetric properties of the involved quantities. Further, we show how structural properties of the forward operator can be exploited within the FISTA update in order to make our approach adequate also for large--scale problems such as those arising in context of brain functional connectivity.\n The effectiveness of the proposed approach is shown in a practical scenario where we aim at quantifying the statistical relationships between brain regions in the context of non-invasive electromagnetic field recordings. Our results show that our method provide results with an higher specificity that classical approaches based on a two--step procedure where first the hidden process describing the brain activity is estimated through a linear optimization step and then the cortical cross--power spectrum is computed from the estimated time--series.", "authors": "Carini, Furci, Sommariva"}, "https://arxiv.org/abs/2411.19368": {"title": "Distribution-Free Calibration of Statistical Confidence Sets", "link": "https://arxiv.org/abs/2411.19368", "description": "Constructing valid confidence sets is a crucial task in statistical inference, yet traditional methods often face challenges when dealing with complex models or limited observed sample sizes. These challenges are frequently encountered in modern applications, such as Likelihood-Free Inference (LFI). In these settings, confidence sets may fail to maintain a confidence level close to the nominal value. In this paper, we introduce two novel methods, TRUST and TRUST++, for calibrating confidence sets to achieve distribution-free conditional coverage. These methods rely entirely on simulated data from the statistical model to perform calibration. Leveraging insights from conformal prediction techniques adapted to the statistical inference context, our methods ensure both finite-sample local coverage and asymptotic conditional coverage as the number of simulations increases, even if n is small. They effectively handle nuisance parameters and provide computationally efficient uncertainty quantification for the estimated confidence sets. This allows users to assess whether additional simulations are necessary for robust inference. Through theoretical analysis and experiments on models with both tractable and intractable likelihoods, we demonstrate that our methods outperform existing approaches, particularly in small-sample regimes. This work bridges the gap between conformal prediction and statistical inference, offering practical tools for constructing valid confidence sets in complex models.", "authors": "Cabezas, Soares, Ramos et al"}, "https://arxiv.org/abs/2411.19384": {"title": "Random Effects Misspecification and its Consequences for Prediction in Generalized Linear Mixed Models", "link": "https://arxiv.org/abs/2411.19384", "description": "When fitting generalized linear mixed models (GLMMs), one important decision to make relates to the choice of the random effects distribution. As the random effects are unobserved, misspecification of this distribution is a real possibility. In this article, we investigate the consequences of random effects misspecification for point prediction and prediction inference in GLMMs, a topic on which there is considerably less research compared to consequences for parameter estimation and inference. We use theory, simulation, and a real application to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors (MSEPs), and prediction intervals. We found that the optimal shrinkage is different under the two random effect distributions, so is impacted by misspecification. The unconditional MSEPs for the random effects are almost always larger under the misspecified normal random effects distribution, especially when cluster sizes are small. Results for the MSEPs conditional on the random effects are more complicated, but they remain generally larger under the misspecified distribution when the true random effect is close to the mean of one of the component distributions in the true mixture distribution. Results for prediction intervals indicate that overall coverage probability is not greatly impacted by misspecification.", "authors": "Vu, Hui, Muller et al"}, "https://arxiv.org/abs/2411.19425": {"title": "Bayesian Hierarchical Modeling for Predicting Spatially Correlated Curves in Irregular Domains: A Case Study on PM10 Pollution", "link": "https://arxiv.org/abs/2411.19425", "description": "This study presents a Bayesian hierarchical model for analyzing spatially correlated functional data and handling irregularly spaced observations. The model uses Bernstein polynomial (BP) bases combined with autoregressive random effects, allowing for nuanced modeling of spatial correlations between sites and dependencies of observations within curves. Moreover, the proposed procedure introduces a distinct structure for the random effect component compared to previous works. Simulation studies conducted under various challenging scenarios verify the model's robustness, demonstrating its capacity to accurately recover spatially dependent curves and predict observations at unmonitored locations. The model's performance is further supported by its application to real-world data, specifically PM$_{10}$ particulate matter measurements from a monitoring network in Mexico City. This application is of practical importance, as particles can penetrate the respiratory system and aggravate various health conditions. The model effectively predicts concentrations at unmonitored sites, with uncertainty estimates that reflect spatial variability across the domain. This new methodology provides a flexible framework for the FDA in spatial contexts and addresses challenges in analyzing irregular domains with potential applications in environmental monitoring.", "authors": "Moreno, Dias"}, "https://arxiv.org/abs/2411.19448": {"title": "Unsupervised Variable Selection for Ultrahigh-Dimensional Clustering Analysis", "link": "https://arxiv.org/abs/2411.19448", "description": "Compared to supervised variable selection, the research on unsupervised variable selection is far behind. A forward partial-variable clustering full-variable loss (FPCFL) method is proposed for the corresponding challenges. An advantage is that the FPCFL method can distinguish active, redundant, and uninformative variables, which the previous methods cannot achieve. Theoretical and simulation studies show that the performance of a clustering method using all the variables can be worse if many uninformative variables are involved. Better results are expected if the uninformative variables are excluded. The research addresses a previous concern about how variable selection affects the performance of clustering. Rather than many previous methods attempting to select all the relevant variables, the proposed method selects a subset that can induce an equally good result. This phenomenon does not appear in the supervised variable selection problems.", "authors": "Zhang, Huang"}, "https://arxiv.org/abs/2411.19523": {"title": "Density-Calibrated Conformal Quantile Regression", "link": "https://arxiv.org/abs/2411.19523", "description": "This paper introduces the Density-Calibrated Conformal Quantile Regression (CQR-d) method, a novel approach for constructing prediction intervals that adapts to varying uncertainty across the feature space. Building upon conformal quantile regression, CQR-d incorporates local information through a weighted combination of local and global conformity scores, where the weights are determined by local data density. We prove that CQR-d provides valid marginal coverage at level $1 - \\alpha - \\epsilon$, where $\\epsilon$ represents a small tolerance from numerical optimization. Through extensive simulation studies and an application to the a heteroscedastic dataset available in R, we demonstrate that CQR-d maintains the desired coverage while producing substantially narrower prediction intervals compared to standard conformal quantile regression (CQR). Notably, in our application on heteroscedastic data, CQR-d achieves an $8.6\\%$ reduction in average interval width while maintaining comparable coverage. The method's effectiveness is particularly pronounced in settings with clear local uncertainty patterns, making it a valuable tool for prediction tasks in heterogeneous data environments.", "authors": "Lu"}, "https://arxiv.org/abs/2411.19572": {"title": "Canonical correlation analysis of stochastic trends via functional approximation", "link": "https://arxiv.org/abs/2411.19572", "description": "This paper proposes a novel canonical correlation analysis for semiparametric inference in $I(1)/I(0)$ systems via functional approximation. The approach can be applied coherently to panels of $p$ variables with a generic number $s$ of stochastic trends, as well as to subsets or aggregations of variables. This study discusses inferential tools on $s$ and on the loading matrix $\\psi$ of the stochastic trends (and on their duals $r$ and $\\beta$, the cointegration rank and the cointegrating matrix): asymptotically pivotal test sequences and consistent estimators of $s$ and $r$, $T$-consistent, mixed Gaussian and efficient estimators of $\\psi$ and $\\beta$, Wald tests thereof, and misspecification tests for checking model assumptions. Monte Carlo simulations show that these tools have reliable performance uniformly in $s$ for small, medium and large-dimensional systems, with $p$ ranging from 10 to 300. An empirical analysis of 20 exchange rates illustrates the methods.", "authors": "Franchi, Georgiev, Paruolo"}, "https://arxiv.org/abs/2411.19633": {"title": "Isotropy testing in spatial point patterns: nonparametric versus parametric replication under misspecification", "link": "https://arxiv.org/abs/2411.19633", "description": "Several hypothesis testing methods have been proposed to validate the assumption of isotropy in spatial point patterns. A majority of these methods are characterised by an unknown distribution of the test statistic under the null hypothesis of isotropy. Parametric approaches to approximating the distribution involve simulation of patterns from a user-specified isotropic model. Alternatively, nonparametric replicates of the test statistic under isotropy can be used to waive the need for specifying a model. In this paper, we first develop a general framework which allows for the integration of a selected nonparametric replication method into isotropy testing. We then conduct a large simulation study comprising application-like scenarios to assess the performance of tests with different parametric and nonparametric replication methods. In particular, we explore distortions in test size and power caused by model misspecification, and demonstrate the advantages of nonparametric replication in such scenarios.", "authors": "Pypkowski, Sykulski, Martin"}, "https://arxiv.org/abs/2411.19789": {"title": "Adjusting auxiliary variables under approximate neighborhood interference", "link": "https://arxiv.org/abs/2411.19789", "description": "Randomized experiments are the gold standard for causal inference. However, traditional assumptions, such as the Stable Unit Treatment Value Assumption (SUTVA), often fail in real-world settings where interference between units is present. Network interference, in particular, has garnered significant attention. Structural models, like the linear-in-means model, are commonly used to describe interference; but they rely on the correct specification of the model, which can be restrictive. Recent advancements in the literature, such as the Approximate Neighborhood Interference (ANI) framework, offer more flexible approaches by assuming negligible interference from distant units. In this paper, we introduce a general framework for regression adjustment for the network experiments under the ANI assumption. This framework expands traditional regression adjustment by accounting for imbalances in network-based covariates, ensuring precision improvement, and providing shorter confidence intervals. We establish the validity of our approach using a design-based inference framework, which relies solely on randomization of treatment assignments for inference without requiring correctly specified outcome models.", "authors": "Lu, Wang, Zhang"}, "https://arxiv.org/abs/2411.19871": {"title": "Thompson, Ulam, or Gauss? Multi-criteria recommendations for posterior probability computation methods in Bayesian response-adaptive trials", "link": "https://arxiv.org/abs/2411.19871", "description": "To implement a Bayesian response-adaptive trial it is necessary to evaluate a sequence of posterior probabilities. This sequence is often approximated by simulation due to the unavailability of closed-form formulae to compute it exactly. Approximating these probabilities by simulation can be computationally expensive and impact the accuracy or the range of scenarios that may be explored. An alternative approximation method based on Gaussian distributions can be faster but its accuracy is not guaranteed. The literature lacks practical recommendations for selecting approximation methods and comparing their properties, particularly considering trade-offs between computational speed and accuracy. In this paper, we focus on the case where the trial has a binary endpoint with Beta priors. We first outline an efficient way to compute the posterior probabilities exactly for any number of treatment arms. Then, using exact probability computations, we show how to benchmark calculation methods based on considerations of computational speed, patient benefit, and inferential accuracy. This is done through a range of simulations in the two-armed case, as well as an analysis of the three-armed Established Status Epilepticus Treatment Trial. Finally, we provide practical guidance for which calculation method is most appropriate in different settings, and how to choose the number of simulations if the simulation-based approximation method is used.", "authors": "Kaddaj, Pin, Baas et al"}, "https://arxiv.org/abs/2411.19933": {"title": "Transfer Learning for High-dimensional Quantile Regression with Distribution Shift", "link": "https://arxiv.org/abs/2411.19933", "description": "Information from related source studies can often enhance the findings of a target study. However, the distribution shift between target and source studies can severely impact the efficiency of knowledge transfer. In the high-dimensional regression setting, existing transfer approaches mainly focus on the parameter shift. In this paper, we focus on the high-dimensional quantile regression with knowledge transfer under three types of distribution shift: parameter shift, covariate shift, and residual shift. We propose a novel transferable set and a new transfer framework to address the above three discrepancies. Non-asymptotic estimation error bounds and source detection consistency are established to validate the availability and superiority of our method in the presence of distribution shift. Additionally, an orthogonal debiased approach is proposed for statistical inference with knowledge transfer, leading to sharper asymptotic results. Extensive simulation results as well as real data applications further demonstrate the effectiveness of our proposed procedure.", "authors": "Bai, Zhang, Yang et al"}, "https://arxiv.org/abs/2411.18830": {"title": "Double Descent in Portfolio Optimization: Dance between Theoretical Sharpe Ratio and Estimation Accuracy", "link": "https://arxiv.org/abs/2411.18830", "description": "We study the relationship between model complexity and out-of-sample performance in the context of mean-variance portfolio optimization. Representing model complexity by the number of assets, we find that the performance of low-dimensional models initially improves with complexity but then declines due to overfitting. As model complexity becomes sufficiently high, the performance improves with complexity again, resulting in a double ascent Sharpe ratio curve similar to the double descent phenomenon observed in artificial intelligence. The underlying mechanisms involve an intricate interaction between the theoretical Sharpe ratio and estimation accuracy. In high-dimensional models, the theoretical Sharpe ratio approaches its upper limit, and the overfitting problem is reduced because there are more parameters than data restrictions, which allows us to choose well-behaved parameters based on inductive bias.", "authors": "Lu, Yang, Zhang"}, "https://arxiv.org/abs/2411.18986": {"title": "ZIPG-SK: A Novel Knockoff-Based Approach for Variable Selection in Multi-Source Count Data", "link": "https://arxiv.org/abs/2411.18986", "description": "The rapid development of sequencing technology has generated complex, highly skewed, and zero-inflated multi-source count data. This has posed significant challenges in variable selection, which is crucial for uncovering shared disease mechanisms, such as tumor development and metabolic dysregulation. In this study, we propose a novel variable selection method called Zero-Inflated Poisson-Gamma based Simultaneous knockoff (ZIPG-SK) for multi-source count data. To address the highly skewed and zero-inflated properties of count data, we introduce a Gaussian copula based on the ZIPG distribution for constructing knockoffs, while also incorporating the information of covariates. This method successfully detects common features related to the results in multi-source data while controlling the false discovery rate (FDR). Additionally, our proposed method effectively combines e-values to enhance power. Extensive simulations demonstrate the superiority of our method over Simultaneous Knockoff and other existing methods in processing count data, as it improves power across different scenarios. Finally, we validated the method by applying it to two real-world multi-source datasets: colorectal cancer (CRC) and type 2 diabetes (T2D). The identified variable characteristics are consistent with existing studies and provided additional insights.", "authors": "Tang, Mao, Ma et al"}, "https://arxiv.org/abs/2411.19223": {"title": "On the Unknowable Limits to Prediction", "link": "https://arxiv.org/abs/2411.19223", "description": "This short Correspondence critiques the classic dichotomization of prediction error into reducible and irreducible components, noting that certain types of error can be eliminated at differential speeds. We propose an improved analytical framework that better distinguishes epistemic from aleatoric uncertainty, emphasizing that predictability depends on information sets and cautioning against premature claims of unpredictability.", "authors": "Yan, Rahal"}, "https://arxiv.org/abs/2411.19878": {"title": "Nonparametric Estimation for a Log-concave Distribution Function with Interval-censored Data", "link": "https://arxiv.org/abs/2411.19878", "description": "We consider the nonparametric maximum likelihood estimation for the underlying event time based on mixed-case interval-censored data, under a log-concavity assumption on its distribution function. This generalized framework relaxes the assumptions of a log-concave density function or a concave distribution function considered in the literature. A log-concave distribution function is fulfilled by many common parametric families in survival analysis and also allows for multi-modal and heavy-tailed distributions. We establish the existence, uniqueness and consistency of the log-concave nonparametric maximum likelihood estimator. A computationally efficient procedure that combines an active set algorithm with the iterative convex minorant algorithm is proposed. Numerical studies demonstrate the advantages of incorporating additional shape constraint compared to the unconstrained nonparametric maximum likelihood estimator. The results also show that our method achieves a balance between efficiency and robustness compared to assuming log-concavity in the density. An R package iclogcondist is developed to implement our proposed method.", "authors": "Chu, Ling, Yuan"}, "https://arxiv.org/abs/1902.07770": {"title": "Cross Validation for Penalized Quantile Regression with a Case-Weight Adjusted Solution Path", "link": "https://arxiv.org/abs/1902.07770", "description": "Cross validation is widely used for selecting tuning parameters in regularization methods, but it is computationally intensive in general. To lessen its computational burden, approximation schemes such as generalized approximate cross validation (GACV) are often employed. However, such approximations may not work well when non-smooth loss functions are involved. As a case in point, approximate cross validation schemes for penalized quantile regression do not work well for extreme quantiles. In this paper, we propose a new algorithm to compute the leave-one-out cross validation scores exactly for quantile regression with ridge penalty through a case-weight adjusted solution path. Resorting to the homotopy technique in optimization, we introduce a case weight for each individual data point as a continuous embedding parameter and decrease the weight gradually from one to zero to link the estimators based on the full data and those with a case deleted. This allows us to design a solution path algorithm to compute all leave-one-out estimators very efficiently from the full-data solution. We show that the case-weight adjusted solution path is piecewise linear in the weight parameter, and using the solution path, we examine case influences comprehensively and observe that different modes of case influences emerge, depending on the specified quantiles, data dimensions and penalty parameter. We further illustrate the utility of the proposed algorithm in real-world applications.", "authors": "Tu, Zhu, Lee et al"}, "https://arxiv.org/abs/2206.12152": {"title": "Estimation and Inference in High-Dimensional Panel Data Models with Interactive Fixed Effects", "link": "https://arxiv.org/abs/2206.12152", "description": "We develop new econometric methods for estimation and inference in high-dimensional panel data models with interactive fixed effects. Our approach can be regarded as a non-trivial extension of the very popular common correlated effects (CCE) approach. Roughly speaking, we proceed as follows: We first construct a projection device to eliminate the unobserved factors from the model by applying a dimensionality reduction transform to the matrix of cross-sectionally averaged covariates. The unknown parameters are then estimated by applying lasso techniques to the projected model. For inference purposes, we derive a desparsified version of our lasso-type estimator. While the original CCE approach is restricted to the low-dimensional case where the number of regressors is small and fixed, our methods can deal with both low- and high-dimensional situations where the number of regressors is large and may even exceed the overall sample size. We derive theory for our estimation and inference methods both in the large-T-case, where the time series length T tends to infinity, and in the small-T-case, where T is a fixed natural number. Specifically, we derive the convergence rate of our estimator and show that its desparsified version is asymptotically normal under suitable regularity conditions. The theoretical analysis of the paper is complemented by a simulation study and an empirical application to characteristic based asset pricing.", "authors": "Linton, Ruecker, Vogt et al"}, "https://arxiv.org/abs/2310.09257": {"title": "Reconstruct Ising Model with Global Optimality via SLIDE", "link": "https://arxiv.org/abs/2310.09257", "description": "The reconstruction of interaction networks between random events is a critical problem arising from statistical physics and politics, sociology, biology, psychology, and beyond. The Ising model lays the foundation for this reconstruction process, but finding the underlying Ising model from the least amount of observed samples in a computationally efficient manner has been historically challenging for half a century. Using sparsity learning, we present an approach named SLIDE whose sample complexity is globally optimal. Furthermore, a tuning-free algorithm is developed to give a statistically consistent solution of SLIDE in polynomial time with high probability. On extensive benchmarked cases, the SLIDE approach demonstrates dominant performance in reconstructing underlying Ising models, confirming its superior statistical properties. The application on the U.S. senators voting in the last six congresses reveals that both the Republicans and Democrats noticeably assemble in each congress; interestingly, the assembling of Democrats is particularly pronounced in the latest congress.", "authors": "Chen, Zhu, Zhu et al"}, "https://arxiv.org/abs/2412.00066": {"title": "New Axioms for Dependence Measurement and Powerful Tests", "link": "https://arxiv.org/abs/2412.00066", "description": "We build a context-free, comprehensive, flexible, and sound footing for measuring the dependence of two variables based on three new axioms, updating Renyi's (1959) seven postulates. We illustrate the superior footing of axioms by Vinod's (2014) asymmetric matrix of generalized correlation coefficients R*. We list five limitations explaining the poorer footing of axiom-failing Hellinger correlation proposed in 2022. We also describe a new implementation of a one-sided test with Taraldsen's (2021) exact density. This paper provides a new table for more powerful one-sided tests using the exact Taraldsen density and includes a published example where using Taraldsen's method makes a practical difference. The code to implement all our proposals is in R packages.", "authors": "Vinod"}, "https://arxiv.org/abs/2412.00106": {"title": "Scalable computation of the maximum flow in large brain connectivity networks", "link": "https://arxiv.org/abs/2412.00106", "description": "We are interested in computing an approximation of the maximum flow in large (brain) connectivity networks. The maximum flow in such networks is of interest in order to better understand the routing of information in the human brain. However, the runtime of $O(|V||E|^2)$ for the classic Edmonds-Karp algorithm renders computations of the maximum flow on networks with millions of vertices infeasible, where $V$ is the set of vertices and $E$ is the set of edges. In this contribution, we propose a new Monte Carlo algorithm which is capable of computing an approximation of the maximum flow in networks with millions of vertices via subsampling. Apart from giving a point estimate of the maximum flow, our algorithm also returns valid confidence bounds for the true maximum flow. Importantly, its runtime only scales as $O(B \\cdot |\\tilde{V}| |\\tilde{E}|^2)$, where $B$ is the number of Monte Carlo samples, $\\tilde{V}$ is the set of subsampled vertices, and $\\tilde{E}$ is the edge set induced by $\\tilde{V}$. Choosing $B \\in O(|V|)$ and $|\\tilde{V}| \\in O(\\sqrt{|V|})$ (implying $|\\tilde{E}| \\in O(|V|)$) yields an algorithm with runtime $O(|V|^{3.5})$ while still guaranteeing the usual \"root-n\" convergence of the confidence interval of the maximum flow estimate. We evaluate our proposed algorithm with respect to both accuracy and runtime on simulated graphs as well as graphs downloaded from the Brain Networks Data Repository (https://networkrepository.com).", "authors": "Qian, Hahn"}, "https://arxiv.org/abs/2412.00228": {"title": "A Doubly Robust Method to Counteract Outcome-Dependent Selection Bias in Multi-Cohort EHR Studies", "link": "https://arxiv.org/abs/2412.00228", "description": "Selection bias can hinder accurate estimation of association parameters in binary disease risk models using non-probability samples like electronic health records (EHRs). The issue is compounded when participants are recruited from multiple clinics or centers with varying selection mechanisms that may depend on the disease or outcome of interest. Traditional inverse-probability-weighted (IPW) methods, based on constructed parametric selection models, often struggle with misspecifications when selection mechanisms vary across cohorts. This paper introduces a new Joint Augmented Inverse Probability Weighted (JAIPW) method, which integrates individual-level data from multiple cohorts collected under potentially outcome-dependent selection mechanisms, with data from an external probability sample. JAIPW offers double robustness by incorporating a flexible auxiliary score model to address potential misspecifications in the selection models. We outline the asymptotic properties of the JAIPW estimator, and our simulations reveal that JAIPW achieves up to five times lower relative bias and three times lower root mean square error (RMSE) compared to the best performing joint IPW methods under scenarios with misspecified selection models. Applying JAIPW to the Michigan Genomics Initiative (MGI), a multi-clinic EHR-linked biobank, combined with external national probability samples, resulted in cancer-sex association estimates more closely aligned with national estimates. We also analyzed the association between cancer and polygenic risk scores (PRS) in MGI to illustrate a situation where the exposure is not available in the external probability sample.", "authors": "Kundu, Shi, Salvatore et al"}, "https://arxiv.org/abs/2412.00233": {"title": "Peer Effects and Herd Behavior: An Empirical Study Based on the \"Double 11\" Shopping Festival", "link": "https://arxiv.org/abs/2412.00233", "description": "This study employs a Bayesian Probit model to empirically analyze peer effects and herd behavior among consumers during the \"Double 11\" shopping festival, using data collected through a questionnaire survey. The results demonstrate that peer effects significantly influence consumer decision-making, with the probability of participation in the shopping event increasing notably when roommates are involved. Additionally, factors such as gender, online shopping experience, and fashion consciousness significantly impact consumers' herd behavior. This research not only enhances the understanding of online shopping behavior among college students but also provides empirical evidence for e-commerce platforms to formulate targeted marketing strategies. Finally, the study discusses the fragility of online consumption activities, the need for adjustments in corporate marketing strategies, and the importance of promoting a healthy online culture.", "authors": "Wang"}, "https://arxiv.org/abs/2412.00280": {"title": "Benchmarking covariates balancing methods, a simulation study", "link": "https://arxiv.org/abs/2412.00280", "description": "Causal inference in observational studies has advanced significantly since Rosenbaum and Rubin introduced propensity score methods. Inverse probability of treatment weighting (IPTW) is widely used to handle confounding bias. However, newer methods, such as energy balancing (EB), kernel optimal matching (KOM), and covariate balancing propensity score by tailored loss function (TLF), offer model-free or non-parametric alternatives. Despite these developments, guidance remains limited in selecting the most suitable method for treatment effect estimation in practical applications. This study compares IPTW with EB, KOM, and TLF, focusing on their ability to estimate treatment effects since this is the primary objective in many applications. Monte Carlo simulations are used to assess the ability of these balancing methods combined with different estimators to evaluate average treatment effect. We compare these methods across a range of scenarios varying sample size, level of confusion, and proportion of treated. In our simulation, we observe no significant advantages in using EB, KOM, or TLF methods over IPTW. Moreover, these recent methods make obtaining confidence intervals with nominal coverage difficult. We also compare the methods on the PROBITsim dataset and get results similar to those of our simulations.", "authors": "Peyrot, Porcher, Petit"}, "https://arxiv.org/abs/2412.00304": {"title": "Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores", "link": "https://arxiv.org/abs/2412.00304", "description": "Bayesian factor models are widely used for dimensionality reduction and pattern discovery in high-dimensional datasets across diverse fields. These models typically focus on imposing priors on factor loading to induce sparsity and improve interpretability. However, factor score, which plays a critical role in individual-level associations with factors, has received less attention and is assumed to have standard multivariate normal distribution. This oversimplification fails to capture the heterogeneity observed in real-world applications. We propose the Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN), a novel framework that addresses these limitations by introducing a mass-nonlocal prior for factor scores. This prior provides a more flexible posterior distribution that captures individual heterogeneity while assigning positive probability to zero value. The zeros entries in the score matrix, characterize the sparsity, offering a robust and novel approach for determining the optimal number of factors. Model parameters are estimated using a fast and efficient Gibbs sampler. Extensive simulations demonstrate that BFMAN outperforms standard Bayesian sparse factor models in factor recovery, sparsity detection, and score estimation. We apply BFMAN to the Hispanic Community Health Study/Study of Latinos and identify dietary patterns and their associations with cardiovascular outcomes, showcasing the model's ability to uncover meaningful insights in diet.", "authors": "Zorzetto, Huang, Vito"}, "https://arxiv.org/abs/2412.00311": {"title": "Disentangling The Effects of Air Pollution on Social Mobility: A Bayesian Principal Stratification Approach", "link": "https://arxiv.org/abs/2412.00311", "description": "Principal stratification provides a robust framework for causal inference, enabling the investigation of the causal link between air pollution exposure and social mobility, mediated by the education level. Studying the causal mechanisms through which air pollution affects social mobility is crucial to highlight the role of education as a mediator, and offering evidence that can inform policies aimed at reducing both environmental and educational inequalities for more equitable social outcomes. In this paper, we introduce a novel Bayesian nonparametric model for principal stratification, leveraging the dependent Dirichlet process to flexibly model the distribution of potential outcomes. By incorporating confounders and potential outcomes for the post-treatment variable in the Bayesian mixture model for the final outcome, our approach improves the accuracy of missing data imputation and allows for the characterization of treatment effects. We assess the performance of our method through a simulation study and demonstrate its application in evaluating the principal causal effects of air pollution on social mobility in the United States.", "authors": "Zorzetto, Torre, Petrone et al"}, "https://arxiv.org/abs/2412.00607": {"title": "Risk models from tree-structured Markov random fields following multivariate Poisson distributions", "link": "https://arxiv.org/abs/2412.00607", "description": "We propose risk models for a portfolio of risks, each following a compound Poisson distribution, with dependencies introduced through a family of tree-based Markov random fields with Poisson marginal distributions inspired in C\\^ot\\'e et al. (2024b, arXiv:2408.13649). The diversity of tree topologies allows for the construction of risk models under several dependence schemes. We study the distribution of the random vector of risks and of the aggregate claim amount of the portfolio. We perform two risk management tasks: the assessment of the global risk of the portfolio and its allocation to each component. Numerical examples illustrate the findings and the efficiency of the computation methods developed throughout. We also show that the discussed family of Markov random fields is a subfamily of the multivariate Poisson distribution constructed through common shocks.", "authors": "Cossette, C\\^ot\\'e, Dubeau et al"}, "https://arxiv.org/abs/2412.00634": {"title": "Optimization of Delivery Routes for Fresh E-commerce in Pre-warehouse Mode", "link": "https://arxiv.org/abs/2412.00634", "description": "With the development of the economy, fresh food e-commerce has experienced rapid growth. One of the core competitive advantages of fresh food e-commerce platforms lies in selecting an appropriate logistics distribution model. This study focuses on the front warehouse model, aiming to minimize distribution costs. Considering the perishable nature and short shelf life of fresh food, a distribution route optimization model is constructed, and the saving mileage method is designed to determine the optimal distribution scheme. The results indicate that under certain conditions, different distribution schemes significantly impact the performance of fresh food e-commerce platforms. Based on a review of domestic and international research, this paper takes Dingdong Maicai as an example to systematically introduce the basic concepts of distribution route optimization in fresh food e-commerce platforms under the front warehouse model, analyze the advantages of logistics distribution, and thoroughly examine the importance of distribution routes for fresh products.", "authors": "Harward, Lin, Wang et al"}, "https://arxiv.org/abs/2412.00885": {"title": "Bayesian feature selection in joint models with application to a cardiovascular disease cohort study", "link": "https://arxiv.org/abs/2412.00885", "description": "Cardiovascular disease (CVD) cohorts collect data longitudinally to study the association between CVD risk factors and event times. An important area of scientific research is to better understand what features of CVD risk factor trajectories are associated with the disease. We develop methods for feature selection in joint models where feature selection is viewed as a bi-level variable selection problem with multiple features nested within multiple longitudinal risk factors. We modify a previously proposed Bayesian sparse group selection (BSGS) prior, which has not been implemented in joint models until now, to better represent prior beliefs when selecting features both at the group level (longitudinal risk factor) and within group (features of a longitudinal risk factor). One of the advantages of our method over the BSGS method is the ability to account for correlation among the features within a risk factor. As a result, it selects important features similarly, but excludes the unimportant features within risk factors more efficiently than BSGS. We evaluate our prior via simulations and apply our method to data from the Atherosclerosis Risk in Communities (ARIC) study, a population-based, prospective cohort study consisting of over 15,000 men and women aged 45-64, measured at baseline and at six additional times. We evaluate which CVD risk factors and which characteristics of their trajectories (features) are associated with death from CVD. We find that systolic and diastolic blood pressure, glucose, and total cholesterol are important risk factors with different important features associated with CVD death in both men and women.", "authors": "Islam, Daniels, Aghabazaz et al"}, "https://arxiv.org/abs/2412.00926": {"title": "A sensitivity analysis approach to principal stratification with a continuous longitudinal intermediate outcome: Applications to a cohort stepped wedge trial", "link": "https://arxiv.org/abs/2412.00926", "description": "Causal inference in the presence of intermediate variables is a challenging problem in many applications. Principal stratification (PS) provides a framework to estimate principal causal effects (PCE) in such settings. However, existing PS methods primarily focus on settings with binary intermediate variables. We propose a novel approach to estimate PCE with continuous intermediate variables in the context of stepped wedge cluster randomized trials (SW-CRTs). Our method leverages the time-varying treatment assignment in SW-CRTs to calibrate sensitivity parameters and identify the PCE under realistic assumptions. We demonstrate the application of our approach using data from a cohort SW-CRT evaluating the effect of a crowdsourcing intervention on HIV testing uptake among men who have sex with men in China, with social norms as a continuous intermediate variable. The proposed methodology expands the scope of PS to accommodate continuous variables and provides a practical tool for causal inference in SW-CRTs.", "authors": "Yang, Daniels, Li"}, "https://arxiv.org/abs/2412.00945": {"title": "Generalized spatial autoregressive model", "link": "https://arxiv.org/abs/2412.00945", "description": "This paper presents the generalized spatial autoregression (GSAR) model, a significant advance in spatial econometrics for non-normal response variables belonging to the exponential family. The GSAR model extends the logistic SAR, probit SAR, and Poisson SAR approaches by offering greater flexibility in modeling spatial dependencies while ensuring computational feasibility. Fundamentally, theoretical results are established on the convergence, efficiency, and consistency of the estimates obtained by the model. In addition, it improves the statistical properties of existing methods and extends them to new distributions. Simulation samples show the theoretical results and allow a visual comparison with existing methods. An empirical application is made to Republican voting patterns in the United States. The GSAR model outperforms standard spatial models by capturing nuanced spatial autocorrelation and accommodating regional heterogeneity, leading to more robust inferences. These findings underline the potential of the GSAR model as an analytical tool for researchers working with categorical or count data or skewed distributions with spatial dependence in diverse domains, such as political science, epidemiology, and market research. In addition, the R codes for estimating the model are provided, which allows its adaptability in these scenarios.", "authors": "Cruz, Toloza-Delgado, Melo"}, "https://arxiv.org/abs/2412.01008": {"title": "Multiple Testing in Generalized Universal Inference", "link": "https://arxiv.org/abs/2412.01008", "description": "Compared to p-values, e-values provably guarantee safe, valid inference. If the goal is to test multiple hypotheses simultaneously, one can construct e-values for each individual test and then use the recently developed e-BH procedure to properly correct for multiplicity. Standard e-value constructions, however, require distributional assumptions that may not be justifiable. This paper demonstrates that the generalized universal inference framework can be used along with the e-BH procedure to control frequentist error rates in multiple testing when the quantities of interest are minimizers of risk functions, thereby avoiding the need for distributional assumptions. We demonstrate the validity and power of this approach via a simulation study, testing the significance of a predictor in quantile regression.", "authors": "Dey, Martin, Williams"}, "https://arxiv.org/abs/2412.01030": {"title": "Iterative Distributed Multinomial Regression", "link": "https://arxiv.org/abs/2412.01030", "description": "This article introduces an iterative distributed computing estimator for the multinomial logistic regression model with large choice sets. Compared to the maximum likelihood estimator, the proposed iterative distributed estimator achieves significantly faster computation and, when initialized with a consistent estimator, attains asymptotic efficiency under a weak dominance condition. Additionally, we propose a parametric bootstrap inference procedure based on the iterative distributed estimator and establish its consistency. Extensive simulation studies validate the effectiveness of the proposed methods and highlight the computational efficiency of the iterative distributed estimator.", "authors": "Fan, Okar, Shi"}, "https://arxiv.org/abs/2412.01084": {"title": "Stochastic Search Variable Selection for Bayesian Generalized Linear Mixed Effect Models", "link": "https://arxiv.org/abs/2412.01084", "description": "Variable selection remains a difficult problem, especially for generalized linear mixed models (GLMMs). While some frequentist approaches to simultaneously select joint fixed and random effects exist, primarily through the use of penalization, existing approaches for Bayesian GLMMs exist only for special cases, like that of logistic regression. In this work, we apply the Stochastic Search Variable Selection (SSVS) approach for the joint selection of fixed and random effects proposed in Yang et al. (2020) for linear mixed models to Bayesian GLMMs. We show that while computational issues remain, SSVS serves as a feasible and effective approach to jointly select fixed and random effects. We demonstrate the effectiveness of the proposed methodology to both simulated and real data. Furthermore, we study the role hyperparameters play in the model selection.", "authors": "Ding, Laga"}, "https://arxiv.org/abs/2412.01208": {"title": "Locally robust semiparametric estimation of sample selection models without exclusion restrictions", "link": "https://arxiv.org/abs/2412.01208", "description": "Existing identification and estimation methods for semiparametric sample selection models rely heavily on exclusion restrictions. However, it is difficult in practice to find a credible excluded variable that has a correlation with selection but no correlation with the outcome. In this paper, we establish a new identification result for a semiparametric sample selection model without the exclusion restriction. The key identifying assumptions are nonlinearity on the selection equation and linearity on the outcome equation. The difference in the functional form plays the role of an excluded variable and provides identification power. According to the identification result, we propose to estimate the model by a partially linear regression with a nonparametrically generated regressor. To accommodate modern machine learning methods in generating the regressor, we construct an orthogonalized moment by adding the first-step influence function and develop a locally robust estimator by solving the cross-fitted orthogonalized moment condition. We prove root-n-consistency and asymptotic normality of the proposed estimator under mild regularity conditions. A Monte Carlo simulation shows the satisfactory performance of the estimator in finite samples, and an application to wage regression illustrates its usefulness in the absence of exclusion restrictions.", "authors": "Pan, Zhang"}, "https://arxiv.org/abs/2412.01302": {"title": "The Deep Latent Position Block Model For The Block Clustering And Latent Representation Of Networks", "link": "https://arxiv.org/abs/2412.01302", "description": "The increased quantity of data has led to a soaring use of networks to model relationships between different objects, represented as nodes. Since the number of nodes can be particularly large, the network information must be summarised through node clustering methods. In order to make the results interpretable, a relevant visualisation of the network is also required. To tackle both issues, we propose a new methodology called deep latent position block model (Deep LPBM) which simultaneously provides a network visualisation coherent with block modelling, allowing a clustering more general than community detection methods, as well as a continuous representation of nodes in a latent space given by partial membership vectors. Our methodology is based on a variational autoencoder strategy, relying on a graph convolutional network, with a specifically designed decoder. The inference is done using both variational and stochastic approximations. In order to efficiently select the number of clusters, we provide a comparison of three model selection criteria. An extensive benchmark as well as an evaluation of the partial memberships are provided. We conclude with an analysis of a French political blogosphere network and a comparison with another methodology to illustrate the insights provided by Deep LPBM results.", "authors": "(SU, LPSM), (UCA et al"}, "https://arxiv.org/abs/2412.01367": {"title": "From rotational to scalar invariance: Enhancing identifiability in score-driven factor models", "link": "https://arxiv.org/abs/2412.01367", "description": "We show that, for a certain class of scaling matrices including the commonly used inverse square-root of the conditional Fisher Information, score-driven factor models are identifiable up to a multiplicative scalar constant under very mild restrictions. This result has no analogue in parameter-driven models, as it exploits the different structure of the score-driven factor dynamics. Consequently, score-driven models offer a clear advantage in terms of economic interpretability compared to parameter-driven factor models, which are identifiable only up to orthogonal transformations. Our restrictions are order-invariant and can be generalized to scoredriven factor models with dynamic loadings and nonlinear factor models. We test extensively the identification strategy using simulated and real data. The empirical analysis on financial and macroeconomic data reveals a substantial increase of log-likelihood ratios and significantly improved out-of-sample forecast performance when switching from the classical restrictions adopted in the literature to our more flexible specifications.", "authors": "Buccheri, Corsi, Dzuverovic"}, "https://arxiv.org/abs/2412.01464": {"title": "Nonparametric directional variogram estimation in the presence of outlier blocks", "link": "https://arxiv.org/abs/2412.01464", "description": "This paper proposes robust estimators of the variogram, a statistical tool that is commonly used in geostatistics to capture the spatial dependence structure of data. The new estimators are based on the highly robust minimum covariance determinant estimator and estimate the directional variogram for several lags jointly. Simulations and breakdown considerations confirm the good robustness properties of the new estimators. While Genton's estimator based on the robust estimation of the variance of pairwise sums and differences performs well in case of isolated outliers, the new estimators based on robust estimation of multivariate variance and covariance matrices perform superior to the established alternatives in the presence of outlier blocks in the data. The methods are illustrated by an application to satellite data, where outlier blocks may occur because of e.g. clouds.", "authors": "Gierse, Fried"}, "https://arxiv.org/abs/2412.01603": {"title": "A Dimension-Agnostic Bootstrap Anderson-Rubin Test For Instrumental Variable Regressions", "link": "https://arxiv.org/abs/2412.01603", "description": "Weak-identification-robust Anderson-Rubin (AR) tests for instrumental variable (IV) regressions are typically developed separately depending on whether the number of IVs is treated as fixed or increasing with the sample size. These tests rely on distinct test statistics and critical values. To apply them, researchers are forced to take a stance on the asymptotic behavior of the number of IVs, which can be ambiguous when the number is moderate. In this paper, we propose a bootstrap-based, dimension-agnostic AR test. By deriving strong approximations for the test statistic and its bootstrap counterpart, we show that our new test has a correct asymptotic size regardless of whether the number of IVs is fixed or increasing -- allowing, but not requiring, the number of IVs to exceed the sample size. We also analyze the power properties of the proposed uniformly valid test under both fixed and increasing numbers of IVs.", "authors": "Lim, Wang, Zhang"}, "https://arxiv.org/abs/2412.00658": {"title": "Probabilistic Predictions of Option Prices Using Multiple Sources of Data", "link": "https://arxiv.org/abs/2412.00658", "description": "A new modular approximate Bayesian inferential framework is proposed that enables fast calculation of probabilistic predictions of future option prices. We exploit multiple information sources, including daily spot returns, high-frequency spot data and option prices. A benefit of this modular Bayesian approach is that it allows us to work with the theoretical option pricing model, without needing to specify an arbitrary statistical model that links the theoretical prices to their observed counterparts. We show that our approach produces accurate probabilistic predictions of option prices in realistic scenarios and, despite not explicitly modelling pricing errors, the method is shown to be robust to their presence. Predictive accuracy based on the Heston stochastic volatility model, with predictions produced via rapid real-time updates, is illustrated empirically for short-maturity options.", "authors": "Maneesoonthorn, Frazier, Martin"}, "https://arxiv.org/abs/2412.00753": {"title": "The ecological forecast horizon revisited: Potential, actual and relative system predictability", "link": "https://arxiv.org/abs/2412.00753", "description": "Ecological forecasts are model-based statements about currently unknown ecosystem states in time or space. For a model forecast to be useful to inform decision-makers, model validation and verification determine adequateness. The measure of forecast goodness that can be translated into a limit up to which a forecast is acceptable is known as the `forecast horizon'. While verification of meteorological models follows strict criteria with established metrics and forecast horizons, assessments of ecological forecasting models still remain experiment-specific and forecast horizons are rarely reported. As such, users of ecological forecasts remain uninformed of how far into the future statements can be trusted. In this work, we synthesise existing approaches, define empirical forecast horizons in a unified framework for assessing ecological predictability and offer recipes on their computation. We distinguish upper and lower boundary estimates of predictability limits, reflecting the model's potential and actual forecast horizon, and show how a benchmark model can help determine its relative forecast horizon. The approaches are demonstrated with four case studies from population, ecosystem, and earth system research.", "authors": "Wesselkamp, Albrecht, Pinnington et al"}, "https://arxiv.org/abs/2412.01344": {"title": "Practical Performative Policy Learning with Strategic Agents", "link": "https://arxiv.org/abs/2412.01344", "description": "This paper studies the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes, inducing an endogenous distribution shift. There has been growing interest in training machine learning models in strategic environments, including strategic classification and performative prediction. However, existing approaches often rely on restrictive parametric assumptions: micro-level utility models in strategic classification and macro-level data distribution maps in performative prediction, severely limiting scalability and generalizability. We approach this problem as a complex causal inference task, relaxing parametric assumptions on both micro-level agent behavior and macro-level data distribution. Leveraging bounded rationality, we uncover a practical low-dimensional structure in distribution shifts and construct an effective mediator in the causal path from the deployed model to the shifted data. We then propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map. Our algorithm efficiently utilizes batch feedback and limited manipulation patterns. Our approach achieves high sample efficiency compared to methods reliant on bandit feedback or zero-order optimization. We also provide theoretical guarantees for algorithmic convergence. Extensive and challenging experiments on high-dimensional settings demonstrate our method's practical efficacy.", "authors": "Chen, Chen, Li"}, "https://arxiv.org/abs/2412.01399": {"title": "Navigating Challenges in Spatio-temporal Modelling of Antarctic Krill Abundance: Addressing Zero-inflated Data and Misaligned Covariates", "link": "https://arxiv.org/abs/2412.01399", "description": "Antarctic krill (Euphausia superba) are among the most abundant species on our planet and serve as a vital food source for many marine predators in the Southern Ocean. In this paper, we utilise statistical spatio-temporal methods to combine data from various sources and resolutions, aiming to accurately model krill abundance. Our focus lies in fitting the model to a dataset comprising acoustic measurements of krill biomass. To achieve this, we integrate climate covariates obtained from satellite imagery and from drifting surface buoys (also known as drifters). Additionally, we use sparsely collected krill biomass data obtained from net fishing efforts (KRILLBASE) for validation. However, integrating these multiple heterogeneous data sources presents significant modelling challenges, including spatio-temporal misalignment and inflated zeros in the observed data. To address these challenges, we fit a Hurdle-Gamma model to jointly describe the occurrence of zeros and the krill biomass for the non-zero observations, while also accounting for misaligned and heterogeneous data sources, including drifters. Therefore, our work presents a comprehensive framework for analysing and predicting krill abundance in the Southern Ocean, leveraging information from various sources and formats. This is crucial due to the impact of krill fishing, as understanding their distribution is essential for informed management decisions and fishing regulations aimed at protecting the species.", "authors": "Amaral, Sykulski, Cavan et al"}, "https://arxiv.org/abs/2206.10475": {"title": "New possibilities in identification of binary choice models with fixed effects", "link": "https://arxiv.org/abs/2206.10475", "description": "We study the identification of binary choice models with fixed effects. We provide a condition called sign saturation and show that this condition is sufficient for the identification of the model. In particular, we can guarantee identification even with bounded regressors. We also show that without this condition, the model is not identified unless the error distribution belongs to a small class. The same sign saturation condition is also essential for identifying the sign of treatment effects. A test is provided to check the sign saturation condition and can be implemented using existing algorithms for the maximum score estimator.", "authors": "Zhu"}, "https://arxiv.org/abs/2209.04787": {"title": "A time-varying bivariate copula joint model for longitudinal and time-to-event data", "link": "https://arxiv.org/abs/2209.04787", "description": "A time-varying bivariate copula joint model, which models the repeatedly measured longitudinal outcome at each time point and the survival data jointly by both the random effects and time-varying bivariate copulas, is proposed in this paper. A regular joint model normally supposes there exist subject-specific latent random effects or classes shared by the longitudinal and time-to-event processes and the two processes are conditionally independent given these latent variables. Under this assumption, the joint likelihood of the two processes is straightforward to derive and their association, as well as heterogeneity among the population, are naturally introduced by the unobservable latent variables. However, because of the unobservable nature of these latent variables, the conditional independence assumption is difficult to verify. Therefore, besides the random effects, a time-varying bivariate copula is introduced to account for the extra time-dependent association between the two processes. The proposed model includes a regular joint model as a special case under some copulas. Simulation studies indicates the parameter estimators in the proposed model are robust against copula misspecification and it has superior performance in predicting survival probabilities compared to the regular joint model. A real data application on the Primary biliary cirrhosis (PBC) data is performed.", "authors": "Zhang, Charalambous, Foster"}, "https://arxiv.org/abs/2308.02450": {"title": "Composite Quantile Factor Model", "link": "https://arxiv.org/abs/2308.02450", "description": "This paper introduces the method of composite quantile factor model for factor analysis in high-dimensional panel data. We propose to estimate the factors and factor loadings across multiple quantiles of the data, allowing the estimates to better adapt to features of the data at different quantiles while still modeling the mean of the data. We develop the limiting distribution of the estimated factors and factor loadings, and an information criterion for consistent factor number selection is also discussed. Simulations show that the proposed estimator and the information criterion have good finite sample properties for several non-normal distributions under consideration. We also consider an empirical study on the factor analysis for 246 quarterly macroeconomic variables. A companion R package cqrfactor is developed.", "authors": "Huang"}, "https://arxiv.org/abs/2309.01637": {"title": "The Robust F-Statistic as a Test for Weak Instruments", "link": "https://arxiv.org/abs/2309.01637", "description": "Montiel Olea and Pflueger (2013) proposed the effective F-statistic as a test for weak instruments in terms of the Nagar bias of the two-stage least squares (2SLS) estimator relative to a benchmark worst-case bias. We show that their methodology applies to a class of linear generalized method of moments (GMM) estimators with an associated class of generalized effective F-statistics. The standard nonhomoskedasticity robust F-statistic is a member of this class. The associated GMMf estimator, with the extension f for first-stage, is a novel and unusual estimator as the weight matrix is based on the first-stage residuals. As the robust F-statistic can also be used as a test for underidentification, expressions for the calculation of the weak-instruments critical values in terms of the Nagar bias of the GMMf estimator relative to the benchmark simplify and no simulation methods or Patnaik (1949) distributional approximations are needed. In the grouped-data IV designs of Andrews (2018), where the robust F-statistic is large but the effective F-statistic is small, the GMMf estimator is shown to behave much better in terms of bias than the 2SLS estimator, as expected by the weak-instruments test results.", "authors": "Windmeijer"}, "https://arxiv.org/abs/2312.13148": {"title": "Partially factorized variational inference for high-dimensional mixed models", "link": "https://arxiv.org/abs/2312.13148", "description": "While generalized linear mixed models are a fundamental tool in applied statistics, many specifications, such as those involving categorical factors with many levels or interaction terms, can be computationally challenging to estimate due to the need to compute or approximate high-dimensional integrals. Variational inference is a popular way to perform such computations, especially in the Bayesian context. However, naive use of such methods can provide unreliable uncertainty quantification. We show that this is indeed the case for mixed models, proving that standard mean-field variational inference dramatically underestimates posterior uncertainty in high-dimensions. We then show how appropriately relaxing the mean-field assumption leads to methods whose uncertainty quantification does not deteriorate in high-dimensions, and whose total computational cost scales linearly with the number of parameters and observations. Our theoretical and numerical results focus on mixed models with Gaussian or binomial likelihoods, and rely on connections to random graph theory to obtain sharp high-dimensional asymptotic analysis. We also provide generic results, which are of independent interest, relating the accuracy of variational inference to the convergence rate of the corresponding coordinate ascent algorithm that is used to find it. Our proposed methodology is implemented in the R package, see https://github.com/mgoplerud/vglmer . Numerical results with simulated and real data examples illustrate the favourable computation cost versus accuracy trade-off of our approach compared to various alternatives.", "authors": "Goplerud, Papaspiliopoulos, Zanella"}, "https://arxiv.org/abs/2303.03092": {"title": "Environment Invariant Linear Least Squares", "link": "https://arxiv.org/abs/2303.03092", "description": "This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.", "authors": "Fan, Fang, Gu et al"}, "https://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "https://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class of partially observed stochastic differential equations (SDE) driven by jump processes. Such type of models can be routinely found in applications, of which we focus upon the case of neuroscience. The data are assumed to be observed regularly in time and driven by the SDE model with unknown parameters. In practice the SDE may not have an analytically tractable solution and this leads naturally to a time-discretization. We adapt the multilevel Markov chain Monte Carlo method of [11], which works with a hierarchy of time discretizations and show empirically and theoretically that this is preferable to using one single time discretization. The improvement is in terms of the computational cost needed to obtain a pre-specified numerical error. Our approach is illustrated on models that are found in neuroscience.", "authors": "Maama, Jasra, Kamatani"}, "https://arxiv.org/abs/2311.15257": {"title": "Career Modeling with Missing Data and Traces", "link": "https://arxiv.org/abs/2311.15257", "description": "Many social scientists study the career trajectories of populations of interest, such as economic and administrative elites. However, data to document such processes are rarely completely available, which motivates the adoption of inference tools that can account for large numbers of missing values. Taking the example of public-private paths of elite civil servants in France, we introduce binary Markov switching models to perform Bayesian data augmentation. Our procedure relies on two data sources: (1) detailed observations of a small number of individual trajectories, and (2) less informative ``traces'' left by all individuals, which we model for imputation of missing data. An advantage of this model class is that it maintains the properties of hidden Markov models and enables a tailored sampler to target the posterior, while allowing for varying parameters across individuals and time. We provide two applied studies which demonstrate this can be used to properly test substantive hypotheses, and expand the social scientific literature in various ways. We notably show that the rate at which ENA graduates exit the French public sector has not increased since 1990, but that the rate at which they come back has increased.", "authors": "Voldoire, Ryder, Lahfa"}, "https://arxiv.org/abs/2412.02014": {"title": "Dynamic Prediction of High-density Generalized Functional Data with Fast Generalized Functional Principal Component Analysis", "link": "https://arxiv.org/abs/2412.02014", "description": "Dynamic prediction, which typically refers to the prediction of future outcomes using historical records, is often of interest in biomedical research. For datasets with large sample sizes, high measurement density, and complex correlation structures, traditional methods are often infeasible because of the computational burden associated with both data scale and model complexity. Moreover, many models do not directly facilitate out-of-sample predictions for generalized outcomes. To address these issues, we develop a novel approach for dynamic predictions based on a recently developed method estimating complex patterns of variation for exponential family data: fast Generalized Functional Principal Components Analysis (fGFPCA). Our method is able to handle large-scale, high-density repeated measures much more efficiently with its implementation feasible even on personal computational resources (e.g., a standard desktop or laptop computer). The proposed method makes highly flexible and accurate predictions of future trajectories for data that exhibit high degrees of nonlinearity, and allows for out-of-sample predictions to be obtained without reestimating any parameters. A simulation study is designed and implemented to illustrate the advantages of this method. To demonstrate its practical utility, we also conducted a case study to predict diurnal active/inactive patterns using accelerometry data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014. Both the simulation study and the data application demonstrate the better predictive performance and high computational efficiency of the proposed method compared to existing methods. The proposed method also obtains more personalized prediction that improves as more information becomes available, which is an essential goal of dynamic prediction that other methods fail to achieve.", "authors": "Jin, Leroux"}, "https://arxiv.org/abs/2412.02105": {"title": "The causal effects of modified treatment policies under network interference", "link": "https://arxiv.org/abs/2412.02105", "description": "Modified treatment policies are a widely applicable class of interventions used to study the causal effects of continuous exposures. Approaches to evaluating their causal effects assume no interference, meaning that such effects cannot be learned from data in settings where the exposure of one unit affects the outcome of others, as is common in spatial or network data. We introduce a new class of intervention, induced modified treatment policies, which we show identify such causal effects in the presence of network interference. Building on recent developments in network causal inference, we provide flexible, semi-parametric efficient estimators of the identified statistical estimand. Simulation experiments demonstrate that an induced modified treatment policy can eliminate causal (or identification) bias resulting from interference. We use the methods developed to evaluate the effect of zero-emission vehicle uptake on air pollution in California, strengthening prior evidence.", "authors": "Balkus, Delaney, Hejazi"}, "https://arxiv.org/abs/2412.02151": {"title": "Efficient Analysis of Latent Spaces in Heterogeneous Networks", "link": "https://arxiv.org/abs/2412.02151", "description": "This work proposes a unified framework for efficient estimation under latent space modeling of heterogeneous networks. We consider a class of latent space models that decompose latent vectors into shared and network-specific components across networks. We develop a novel procedure that first identifies the shared latent vectors and further refines estimates through efficient score equations to achieve statistical efficiency. Oracle error rates for estimating the shared and heterogeneous latent vectors are established simultaneously. The analysis framework offers remarkable flexibility, accommodating various types of edge weights under exponential family distributions.", "authors": "Tian, Sun, He"}, "https://arxiv.org/abs/2412.02182": {"title": "Searching for local associations while controlling the false discovery rate", "link": "https://arxiv.org/abs/2412.02182", "description": "We introduce local conditional hypotheses that express how the relation between explanatory variables and outcomes changes across different contexts, described by covariates. By expanding upon the model-X knockoff filter, we show how to adaptively discover these local associations, all while controlling the false discovery rate. Our enhanced inferences can help explain sample heterogeneity and uncover interactions, making better use of the capabilities offered by modern machine learning models. Specifically, our method is able to leverage any model for the identification of data-driven hypotheses pertaining to different contexts. Then, it rigorously test these hypotheses without succumbing to selection bias. Importantly, our approach is efficient and does not require sample splitting. We demonstrate the effectiveness of our method through numerical experiments and by studying the genetic architecture of Waist-Hip-Ratio across different sexes in the UKBiobank.", "authors": "Gablenz, Sesia, Sun et al"}, "https://arxiv.org/abs/2412.02183": {"title": "Endogenous Interference in Randomized Experiments", "link": "https://arxiv.org/abs/2412.02183", "description": "This paper investigates the identification and inference of treatment effects in randomized controlled trials with social interactions. Two key network features characterize the setting and introduce endogeneity: (1) latent variables may affect both network formation and outcomes, and (2) the intervention may alter network structure, mediating treatment effects. I make three contributions. First, I define parameters within a post-treatment network framework, distinguishing direct effects of treatment from indirect effects mediated through changes in network structure. I provide a causal interpretation of the coefficients in a linear outcome model. For estimation and inference, I focus on a specific form of peer effects, represented by the fraction of treated friends. Second, in the absence of endogeneity, I establish the consistency and asymptotic normality of ordinary least squares estimators. Third, if endogeneity is present, I propose addressing it through shift-share instrumental variables, demonstrating the consistency and asymptotic normality of instrumental variable estimators in relatively sparse networks. For denser networks, I propose a denoised estimator based on eigendecomposition to restore consistency. Finally, I revisit Prina (2015) as an empirical illustration, demonstrating that treatment can influence outcomes both directly and through network structure changes.", "authors": "Gao"}, "https://arxiv.org/abs/2412.02333": {"title": "Estimation of a multivariate von Mises distribution for contaminated torus data", "link": "https://arxiv.org/abs/2412.02333", "description": "The occurrence of atypical circular observations on the torus can badly affect parameters estimation of the multivariate von Mises distribution. This paper addresses the problem of robust fitting of the multivariate von Mises model using the weighted likelihood methodology. The key ingredients are non-parametric density estimation for multivariate circular data and the definition of appropriate weighted estimating equations. The finite sample behavior of the proposed weighted likelihood estimator has been investigated by Monte Carlo numerical studies and real data examples.", "authors": "Bertagnolli, Greco, Agostinelli"}, "https://arxiv.org/abs/2412.02513": {"title": "Quantile-Crossing Spectrum and Spline Autoregression Estimation", "link": "https://arxiv.org/abs/2412.02513", "description": "The quantile-crossing spectrum is the spectrum of quantile-crossing processes created from a time series by the indicator function that shows whether or not the time series lies above or below a given quantile at a given time. This bivariate function of frequency and quantile level provides a richer view of serial dependence than that offered by the ordinary spectrum. We propose a new method for estimating the quantile-crossing spectrum as a bivariate function of frequency and quantile level. The proposed method, called spline autoregression (SAR), jointly fits an AR model to the quantile-crossing series across multiple quantiles; the AR coefficients are represented as spline functions of the quantile level and penalized for their roughness. Numerical experiments show that when the underlying spectrum is smooth in quantile level the proposed method is able to produce more accurate estimates in comparison with the alternative that ignores the smoothness.", "authors": "Li"}, "https://arxiv.org/abs/2412.02654": {"title": "Simple and Effective Portfolio Construction with Crypto Assets", "link": "https://arxiv.org/abs/2412.02654", "description": "We consider the problem of constructing a portfolio that combines traditional financial assets with crypto assets. We show that despite the documented attributes of crypto assets, such as high volatility, heavy tails, excess kurtosis, and skewness, a simple extension of traditional risk allocation provides robust solutions for integrating these emerging assets into broader investment strategies. Examination of the risk allocation holdings suggests an even simpler method, analogous to the traditional 60/40 stocks/bonds allocation, involving a fixed allocation to crypto and traditional assets, dynamically diluted with cash to achieve a target risk level.", "authors": "Johansson, Boyd"}, "https://arxiv.org/abs/2412.02660": {"title": "A Markowitz Approach to Managing a Dynamic Basket of Moving-Band Statistical Arbitrages", "link": "https://arxiv.org/abs/2412.02660", "description": "We consider the problem of managing a portfolio of moving-band statistical arbitrages (MBSAs), inspired by the Markowitz optimization framework. We show how to manage a dynamic basket of MBSAs, and illustrate the method on recent historical data, showing that it can perform very well in terms of risk-adjusted return, essentially uncorrelated with the market.", "authors": "Johansson, Schmelzer, Boyd"}, "https://arxiv.org/abs/2412.01953": {"title": "The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications", "link": "https://arxiv.org/abs/2412.01953", "description": "Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientific disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.", "authors": "Brouillard, Squires, Wahl et al"}, "https://arxiv.org/abs/2412.02251": {"title": "Selective Reviews of Bandit Problems in AI via a Statistical View", "link": "https://arxiv.org/abs/2412.02251", "description": "Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes stochastic multi-armed bandit (MAB) and continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration-exploitation trade-offs. We also extend the discussion to $K$-armed contextual bandits and SCAB, examining their methodologies, regret analyses, and discussing the relation between the SCAB problems and the functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field.", "authors": "Zhou, Wei, Zhang"}, "https://arxiv.org/abs/2412.02355": {"title": "TITE-CLRM: Towards efficient time-to-event dose-escalation guidance of multi-cycle cancer therapies", "link": "https://arxiv.org/abs/2412.02355", "description": "Treatment of cancer has rapidly evolved over time in quite dramatic ways, for example from chemotherapies, targeted therapies to immunotherapies and chimeric antigen receptor T-cells. Nonetheless, the basic design of early phase I trials in oncology still follows pre-dominantly a dose-escalation design. These trials monitor safety over the first treatment cycle in order to escalate the dose of the investigated drug. However, over time studying additional factors such as drug combinations and/or variation in the timing of dosing became important as well. Existing designs were continuously enhanced and expanded to account for increased trial complexity. With toxicities occurring at later stages beyond the first cycle and the need to treat patients over multiple cycles, the focus on the first treatment cycle only is becoming a limitation in nowadays multi-cycle treatment therapies. Here we introduce a multi-cycle time-to-event model (TITE-CLRM: Time-Interval-To-Event Complementary-Loglog Regression Model) allowing guidance of dose-escalation trials studying multi-cycle therapies. The challenge lies in balancing the need to monitor safety of longer treatment periods with the need to continuously enroll patients safely. The proposed multi-cycle time to event model is formulated as an extension to established concepts like the escalation with over dose control principle. The model is motivated from a current drug development project and evaluated in a simulation study.", "authors": "Widmer, Weber, Xu et al"}, "https://arxiv.org/abs/2412.02380": {"title": "Use of surrogate endpoints in health technology assessment: a review of selected NICE technology appraisals in oncology", "link": "https://arxiv.org/abs/2412.02380", "description": "Objectives: Surrogate endpoints, used to substitute for and predict final clinical outcomes, are increasingly being used to support submissions to health technology assessment agencies. The increase in use of surrogate endpoints has been accompanied by literature describing frameworks and statistical methods to ensure their robust validation. The aim of this review was to assess how surrogate endpoints have recently been used in oncology technology appraisals by the National Institute for Health and Care Excellence (NICE) in England and Wales.\n Methods: This paper identified technology appraisals in oncology published by NICE between February 2022 and May 2023. Data are extracted on methods for the use and validation of surrogate endpoints.\n Results: Of the 47 technology appraisals in oncology available for review, 18 (38 percent) utilised surrogate endpoints, with 37 separate surrogate endpoints being discussed. However, the evidence supporting the validity of the surrogate relationship varied significantly across putative surrogate relationships with 11 providing RCT evidence, 7 providing evidence from observational studies, 12 based on clinical opinion and 7 providing no evidence for the use of surrogate endpoints.\n Conclusions: This review supports the assertion that surrogate endpoints are frequently used in oncology technology appraisals in England and Wales. Despite increasing availability of statistical methods and guidance on appropriate validation of surrogate endpoints, this review highlights that use and validation of surrogate endpoints can vary between technology appraisals which can lead to uncertainty in decision-making.", "authors": "Wheaton, Bujkiewicz"}, "https://arxiv.org/abs/2412.02439": {"title": "Nature versus nurture in galaxy formation: the effect of environment on star formation with causal machine learning", "link": "https://arxiv.org/abs/2412.02439", "description": "Understanding how galaxies form and evolve is at the heart of modern astronomy. With the advent of large-scale surveys and simulations, remarkable progress has been made in the last few decades. Despite this, the physical processes behind the phenomena, and particularly their importance, remain far from known, as correlations have primarily been established rather than the underlying causality. We address this challenge by applying the causal inference framework. Specifically, we tackle the fundamental open question of whether galaxy formation and evolution depends more on nature (i.e., internal processes) or nurture (i.e., external processes), by estimating the causal effect of environment on star-formation rate in the IllustrisTNG simulations. To do so, we develop a comprehensive causal model and employ cutting-edge techniques from epidemiology to overcome the long-standing problem of disentangling nature and nurture. We find that the causal effect is negative and substantial, with environment suppressing the SFR by a maximal factor of $\\sim100$. While the overall effect at $z=0$ is negative, in the early universe, environment is discovered to have a positive impact, boosting star formation by a factor of $\\sim10$ at $z\\sim1$ and by even greater amounts at higher redshifts. Furthermore, we show that: (i) nature also plays an important role, as ignoring it underestimates the causal effect in intermediate-density environments by a factor of $\\sim2$, (ii) controlling for the stellar mass at a snapshot in time, as is common in the literature, is not only insufficient to disentangle nature and nurture but actually has an adverse effect, though (iii) stellar mass is an adequate proxy of the effects of nature. Finally, this work may prove a useful blueprint for extracting causal insights in other fields that deal with dynamical systems with closed feedback loops, such as the Earth's climate.", "authors": "Mucesh, Hartley, Gilligan-Lee et al"}, "https://arxiv.org/abs/2412.02640": {"title": "On the optimality of coin-betting for mean estimation", "link": "https://arxiv.org/abs/2412.02640", "description": "Confidence sequences are sequences of confidence sets that adapt to incoming data while maintaining validity. Recent advances have introduced an algorithmic formulation for constructing some of the tightest confidence sequences for bounded real random variables. These approaches use a coin-betting framework, where a player sequentially bets on differences between potential mean values and observed data. This letter establishes that such coin-betting formulation is optimal among all possible algorithmic frameworks for constructing confidence sequences that build on e-variables and sequential hypothesis testing.", "authors": "Clerico"}, "https://arxiv.org/abs/1907.00287": {"title": "Estimating Treatment Effect under Additive Hazards Models with High-dimensional Covariates", "link": "https://arxiv.org/abs/1907.00287", "description": "Estimating causal effects for survival outcomes in the high-dimensional setting is an extremely important topic for many biomedical applications as well as areas of social sciences. We propose a new orthogonal score method for treatment effect estimation and inference that results in asymptotically valid confidence intervals assuming only good estimation properties of the hazard outcome model and the conditional probability of treatment. This guarantee allows us to provide valid inference for the conditional treatment effect under the high-dimensional additive hazards model under considerably more generality than existing approaches. In addition, we develop a new Hazards Difference (HDi), estimator. We showcase that our approach has double-robustness properties in high dimensions: with cross-fitting, the HDi estimate is consistent under a wide variety of treatment assignment models; the HDi estimate is also consistent when the hazards model is misspecified and instead the true data generating mechanism follows a partially linear additive hazards model. We further develop a novel sparsity doubly robust result, where either the outcome or the treatment model can be a fully dense high-dimensional model. We apply our methods to study the treatment effect of radical prostatectomy versus conservative management for prostate cancer patients using the SEER-Medicare Linked Data.", "authors": "Hou, Bradic, Xu"}, "https://arxiv.org/abs/2305.19242": {"title": "Nonstationary Gaussian Process Surrogates", "link": "https://arxiv.org/abs/2305.19242", "description": "We provide a survey of nonstationary surrogate models which utilize Gaussian processes (GPs) or variations thereof, including nonstationary kernel adaptations, partition and local GPs, and spatial warpings through deep Gaussian processes. We also overview publicly available software implementations and conclude with a bake-off involving an 8-dimensional satellite drag computer experiment. Code for this example is provided in a public git repository.", "authors": "Booth, Cooper, Gramacy"}, "https://arxiv.org/abs/2401.05315": {"title": "Multi-resolution filters via linear projection for large spatio-temporal datasets", "link": "https://arxiv.org/abs/2401.05315", "description": "Advances in compact sensing devices mounted on satellites have facilitated the collection of large spatio-temporal datasets with coordinates. Since such datasets are often incomplete and noisy, it is useful to create the prediction surface of a spatial field. To this end, we consider an online filtering inference by using the Kalman filter based on linear Gaussian state-space models. However, the Kalman filter is impractically time-consuming when the number of locations in spatio-temporal datasets is large. To address this problem, we propose a multi-resolution filter via linear projection (MRF-lp), a fast computation method for online filtering inference. In the MRF-lp, by carrying out a multi-resolution approximation via linear projection (MRA-lp), the forecast covariance matrix can be approximated while capturing both the large- and small-scale spatial variations. As a result of this approximation, our proposed MRF-lp preserves a block-sparse structure of some matrices appearing in the MRF-lp through time, which leads to the scalability of this algorithm. Additionally, we discuss extensions of the MRF-lp to a nonlinear and non-Gaussian case. Simulation studies and real data analysis for total precipitable water vapor demonstrate that our proposed approach performs well compared with the related methods.", "authors": "Hirano, Ishihara"}, "https://arxiv.org/abs/2312.00710": {"title": "SpaCE: The Spatial Confounding Environment", "link": "https://arxiv.org/abs/2312.00710", "description": "Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.", "authors": "Tec, Trisovic, Audirac et al"}, "https://arxiv.org/abs/2412.02767": {"title": "Endogenous Heteroskedasticity in Linear Models", "link": "https://arxiv.org/abs/2412.02767", "description": "Linear regressions with endogeneity are widely used to estimate causal effects. This paper studies a statistical framework that has two common issues, endogeneity of the regressors, and heteroskedasticity that is allowed to depend on endogenous regressors, i.e., endogenous heteroskedasticity. We show that the presence of such conditional heteroskedasticity in the structural regression renders the two-stages least squares estimator inconsistent. To solve this issue, we propose sufficient conditions together with a control function approach to identify and estimate the causal parameters of interest. We establish statistical properties of the estimator, say consistency and asymptotic normality, and propose valid inference procedures. Monte Carlo simulations provide evidence of the finite sample performance of the proposed methods, and evaluate different implementation procedures. We revisit an empirical application about job training to illustrate the methods.", "authors": "Alejo, Galvao, Martinez-Iriarte et al"}, "https://arxiv.org/abs/2412.02791": {"title": "Chain-linked Multiple Matrix Integration via Embedding Alignment", "link": "https://arxiv.org/abs/2412.02791", "description": "Motivated by the increasing demand for multi-source data integration in various scientific fields, in this paper we study matrix completion in scenarios where the data exhibits certain block-wise missing structures -- specifically, where only a few noisy submatrices representing (overlapping) parts of the full matrix are available. We propose the Chain-linked Multiple Matrix Integration (CMMI) procedure to efficiently combine the information that can be extracted from these individual noisy submatrices. CMMI begins by deriving entity embeddings for each observed submatrix, then aligns these embeddings using overlapping entities between pairs of submatrices, and finally aggregates them to reconstruct the entire matrix of interest. We establish, under mild regularity conditions, entrywise error bounds and normal approximations for the CMMI estimates. Simulation studies and real data applications show that CMMI is computationally efficient and effective in recovering the full matrix, even when overlaps between the observed submatrices are minimal.", "authors": "Zheng, Tang"}, "https://arxiv.org/abs/2412.02945": {"title": "Detection of Multiple Influential Observations on Model Selection", "link": "https://arxiv.org/abs/2412.02945", "description": "Outlying observations are frequently encountered in a wide spectrum of scientific domains, posing significant challenges for the generalizability of statistical models and the reproducibility of downstream analysis. These observations can be identified through influential diagnosis, which refers to the detection of observations that are unduly influential on diverse facets of statistical inference. To date, methods for identifying observations influencing the choice of a stochastically selected submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors p exceeds the sample size n. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, the notion of exchangeability is revived, and used to determine the exact finite- and large-sample distributions of our assessment metric. This forms the foundation for the introduction of both parametric and non-parametric approaches for its approximation and the establishment of thresholds for diagnosis. The resulting framework is extended to logistic regression models, followed by a simulation study conducted to assess the performance of various detection procedures. Finally the framework is applied to data from an fMRI study of thermal pain, with the goal of identifying outlying subjects that could distort the formulation of statistical models using functional brain activity in predicting physical pain ratings. Both linear and logistic regression models are used to demonstrate the benefits of detection and compare the performances of different detection procedures. In particular, two additional influential observations are identified, which are not discovered by previous studies.", "authors": "Zhang, Asgharian, Lindquist"}, "https://arxiv.org/abs/2412.02970": {"title": "Uncovering dynamics between SARS-CoV-2 wastewater concentrations and community infections via Bayesian spatial functional concurrent regression", "link": "https://arxiv.org/abs/2412.02970", "description": "Monitoring wastewater concentrations of SARS-CoV-2 yields a low-cost, noninvasive method for tracking disease prevalence and provides early warning signs of upcoming outbreaks in the serviced communities. There is tremendous clinical and public health interest in understanding the exact dynamics between wastewater viral loads and infection rates in the population. As both data sources may contain substantial noise and missingness, in addition to spatial and temporal dependencies, properly modeling this relationship must address these numerous complexities simultaneously while providing interpretable and clear insights. We propose a novel Bayesian functional concurrent regression model that accounts for both spatial and temporal correlations while estimating the dynamic effects between wastewater concentrations and positivity rates over time. We explicitly model the time lag between the two series and provide full posterior inference on the possible delay between spikes in wastewater concentrations and subsequent outbreaks. We estimate a time lag likely between 5 to 11 days between spikes in wastewater levels and reported clinical positivity rates. Additionally, we find a dynamic relationship between wastewater concentration levels and the strength of its association with positivity rates that fluctuates between outbreaks and non-outbreaks.", "authors": "Sun, Schedler, Kowal et al"}, "https://arxiv.org/abs/2412.02986": {"title": "Bayesian Transfer Learning for Enhanced Estimation and Inference", "link": "https://arxiv.org/abs/2412.02986", "description": "Transfer learning enhances model performance in a target population with limited samples by leveraging knowledge from related studies. While many works focus on improving predictive performance, challenges of statistical inference persist. Bayesian approaches naturally offer uncertainty quantification for parameter estimates, yet existing Bayesian transfer learning methods are typically limited to single-source scenarios or require individual-level data. We introduce TRansfer leArning via guideD horseshoE prioR (TRADER), a novel approach enabling multi-source transfer through pre-trained models in high-dimensional linear regression. TRADER shrinks target parameters towards a weighted average of source estimates, accommodating sources with different scales. Theoretical investigation shows that TRADER achieves faster posterior contraction rates than standard continuous shrinkage priors when sources align well with the target while preventing negative transfer from heterogeneous sources. The analysis of finite-sample marginal posterior behavior reveals that TRADER achieves desired frequentist coverage probabilities, even for coefficients with moderate signal strength--a scenario where standard continuous shrinkage priors struggle. Extensive numerical studies and a real-data application estimating the association between blood glucose and insulin use in the Hispanic diabetic population demonstrate that TRADER improves estimation and inference accuracy over continuous shrinkage priors using target data alone, while outperforming a state-of-the-art transfer learning method that requires individual-level data.", "authors": "Lai, Padilla, Gu"}, "https://arxiv.org/abs/2412.03042": {"title": "On a penalised likelihood approach for joint modelling of longitudinal covariates and partly interval-censored data -- an application to the Anti-PD1 brain collaboration trial", "link": "https://arxiv.org/abs/2412.03042", "description": "This article considers the joint modeling of longitudinal covariates and partly-interval censored time-to-event data. Longitudinal time-varying covariates play a crucial role in obtaining accurate clinically relevant predictions using a survival regression model. However, these covariates are often measured at limited time points and may be subject to measurement error. Further methodological challenges arise from the fact that, in many clinical studies, the event times of interest are interval-censored. A model that simultaneously accounts for all these factors is expected to improve the accuracy of survival model estimations and predictions. In this article, we consider joint models that combine longitudinal time-varying covariates with the Cox model for time-to-event data which is subject to interval censoring. The proposed model employs a novel penalised likelihood approach for estimating all parameters, including the random effects. The covariance matrix of the estimated parameters can be obtained from the penalised log-likelihood. The performance of the model is compared to an existing method under various scenarios. The simulation results demonstrated that our new method can provide reliable inferences when dealing with interval-censored data. Data from the Anti-PD1 brain collaboration clinical trial in advanced melanoma is used to illustrate the application of the new method.", "authors": "Webb, Zou, Lo et al"}, "https://arxiv.org/abs/2412.03185": {"title": "Information borrowing in Bayesian clinical trials: choice of tuning parameters for the robust mixture prior", "link": "https://arxiv.org/abs/2412.03185", "description": "Borrowing historical data for use in clinical trials has increased in recent years. This is accomplished in the Bayesian framework by specification of informative prior distributions. One such approach is the robust mixture prior arising as a weighted mixture of an informative prior and a robust prior inducing dynamic borrowing that allows to borrow most when the current and external data are observed to be similar. The robust mixture prior requires the choice of three additional quantities: the mixture weight, and the mean and dispersion of the robust component. Some general guidance is available, but a case-by-case study of the impact of these quantities on specific operating characteristics seems lacking. We focus on evaluating the impact of parameter choices for the robust component of the mixture prior in one-arm and hybrid-control trials. The results show that all three quantities can strongly impact the operating characteristics. In particular, as already known, variance of the robust component is linked to robustness. Less known, however, is that its location can have a strong impact on Type I error rate and MSE which can even become unbounded. Further, the impact of the weight choice is strongly linked with the robust component's location and variance. Recommendations are provided for the choice of the robust component parameters, prior weight, alternative functional form for this component as well as considerations to keep in mind when evaluating operating characteristics.", "authors": "Weru, Kopp-Schneider, Wiesenfarth et al"}, "https://arxiv.org/abs/2412.03246": {"title": "Nonparametric estimation of the Patient Weighted While-Alive Estimand", "link": "https://arxiv.org/abs/2412.03246", "description": "In clinical trials with recurrent events, such as repeated hospitalizations terminating with death, it is important to consider the patient events overall history for a thorough assessment of treatment effects. The occurrence of fewer events due to early deaths can lead to misinterpretation, emphasizing the importance of a while-alive strategy as suggested in Schmidli et al. (2023). We focus in this paper on the patient weighted while-alive estimand represented as the expected number of events divided by the time alive within a target window and develop efficient estimation for this estimand. We derive its efficient influence function and develop a one-step estimator, initially applied to the irreversible illness-death model. For the broader context of recurrent events, due to the increased complexity, the one-step estimator is practically intractable. We therefore suggest an alternative estimator that is also expected to have high efficiency focusing on the randomized treatment setting. We compare the efficiency of these two estimators in the illness-death setting. Additionally, we apply our proposed estimator to a real-world case study involving metastatic colorectal cancer patients, demonstrating the practical applicability and benefits of the while-alive approach.", "authors": "Ragni, Martinussen, Scheike"}, "https://arxiv.org/abs/2412.03429": {"title": "Coherent forecast combination for linearly constrained multiple time series", "link": "https://arxiv.org/abs/2412.03429", "description": "Linearly constrained multiple time series may be encountered in many practical contexts, such as the National Accounts (e.g., GDP disaggregated by Income, Expenditure and Output), and multilevel frameworks where the variables are organized according to hierarchies or groupings, like the total energy consumption of a country disaggregated by region and energy sources. In these cases, when multiple incoherent base forecasts for each individual variable are available, a forecast combination-and-reconciliation approach, that we call coherent forecast combination, may be used to improve the accuracy of the base forecasts and achieve coherence in the final result. In this paper, we develop an optimization-based technique that combines multiple unbiased base forecasts while assuring the constraints valid for the series. We present closed form expressions for the coherent combined forecast vector and its error covariance matrix in the general case where a different number of forecasts is available for each variable. We also discuss practical issues related to the covariance matrix that is part of the optimal solution. Through simulations and a forecasting experiment on the daily Australian electricity generation hierarchical time series, we show that the proposed methodology, in addition to adhering to sound statistical principles, may yield in significant improvement on base forecasts, single-task combination and single-expert reconciliation approaches as well.", "authors": "Girolimetto, Fonzo"}, "https://arxiv.org/abs/2412.03484": {"title": "Visualisation for Exploratory Modelling Analysis of Bayesian Hierarchical Models", "link": "https://arxiv.org/abs/2412.03484", "description": "When developing Bayesian hierarchical models, selecting the most appropriate hierarchical structure can be a challenging task, and visualisation remains an underutilised tool in this context. In this paper, we consider visualisations for the display of hierarchical models in data space and compare a collection of multiple models via their parameters and hyper-parameter estimates. Specifically, with the aim of aiding model choice, we propose new visualisations to explore how the choice of Bayesian hierarchical modelling structure impacts parameter distributions. The visualisations are designed using a robust set of principles to provide richer comparisons that extend beyond the conventional plots and numerical summaries typically used. As a case study, we investigate five Bayesian hierarchical models fit using the brms R package, a high-level interface to Stan for Bayesian modelling, to model country mathematics trends from the PISA (Programme for International Student Assessment) database. Our case study demonstrates that by adhering to these principles, researchers can create visualisations that not only help them make more informed choices between Bayesian hierarchical model structures but also enable them to effectively communicate the rationale for those choices.", "authors": "Akinfenwa, Cahill, Hurley"}, "https://arxiv.org/abs/2412.02869": {"title": "Constrained Identifiability of Causal Effects", "link": "https://arxiv.org/abs/2412.02869", "description": "We study the identification of causal effects in the presence of different types of constraints (e.g., logical constraints) in addition to the causal graph. These constraints impose restrictions on the models (parameterizations) induced by the causal graph, reducing the set of models considered by the identifiability problem. We formalize the notion of constrained identifiability, which takes a set of constraints as another input to the classical definition of identifiability. We then introduce a framework for testing constrained identifiability by employing tractable Arithmetic Circuits (ACs), which enables us to accommodate constraints systematically. We show that this AC-based approach is at least as complete as existing algorithms (e.g., do-calculus) for testing classical identifiability, which only assumes the constraint of strict positivity. We use examples to demonstrate the effectiveness of this AC-based approach by showing that unidentifiable causal effects may become identifiable under different types of constraints.", "authors": "Chen, Darwiche"}, "https://arxiv.org/abs/2412.02878": {"title": "Modeling and Discovering Direct Causes for Predictive Models", "link": "https://arxiv.org/abs/2412.02878", "description": "We introduce a causal modeling framework that captures the input-output behavior of predictive models (e.g., machine learning models) by representing it using causal graphs. The framework enables us to define and identify features that directly cause the predictions, which has broad implications for data collection and model evaluation. We show two assumptions under which the direct causes can be discovered from data, one of which further simplifies the discovery process. In addition to providing sound and complete algorithms, we propose an optimization technique based on an independence rule that can be integrated with the algorithms to speed up the discovery process both theoretically and empirically.", "authors": "Chen, Bhatia"}, "https://arxiv.org/abs/2412.02893": {"title": "Removing Spurious Correlation from Neural Network Interpretations", "link": "https://arxiv.org/abs/2412.02893", "description": "The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.", "authors": "Fotouhi, Bahadori, Feyisetan et al"}, "https://arxiv.org/abs/2412.03491": {"title": "Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications", "link": "https://arxiv.org/abs/2412.03491", "description": "Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.", "authors": "Sauer, Boulesteix, Han{\\ss}um et al"}, "https://arxiv.org/abs/2412.03528": {"title": "The R.O.A.D. to clinical trial emulation", "link": "https://arxiv.org/abs/2412.03528", "description": "Observational studies provide the only evidence on the effectiveness of interventions when randomized controlled trials (RCTs) are impractical due to cost, ethical concerns, or time constraints. While many methodologies aim to draw causal inferences from observational data, there is a growing trend to model observational study designs after RCTs, a strategy known as \"target trial emulation.\" Despite its potential, causal inference through target trial emulation cannot fully address the confounding bias in real-world data due to the lack of randomization. In this work, we present a novel framework for target trial emulation that aims to overcome several key limitations, including confounding bias. The framework proceeds as follows: First, we apply the eligibility criteria of a specific trial to an observational cohort. We then \"correct\" this cohort by extracting a subset that matches both the distribution of covariates and the baseline prognosis of the control group in the target RCT. Next, we address unmeasured confounding by adjusting the prognosis estimates of the treated group to align with those observed in the trial. Following trial emulation, we go a step further by leveraging the emulated cohort to train optimal decision trees, to identify subgroups of patients with heterogeneity in treatment effects (HTE). The absence of confounding is verified using two external models, and the validity of the treatment recommendations is independently confirmed by the team responsible for the original trial we emulate. To our knowledge, this is the first framework to successfully address both observed and unobserved confounding, a challenge that has historically limited the use of randomized trial emulation and causal inference. Additionally, our framework holds promise in advancing precision medicine by identifying patient subgroups that benefit most from specific treatments.", "authors": "Bertsimas, Koulouras, Nagata et al"}, "https://arxiv.org/abs/2307.05708": {"title": "Bayesian inference on the order of stationary vector autoregressions", "link": "https://arxiv.org/abs/2307.05708", "description": "Vector autoregressions (VARs) are a widely used tool for modelling multivariate time-series. It is common to assume a VAR is stationary; this can be enforced by imposing the stationarity condition which restricts the parameter space of the autoregressive coefficients to the stationary region. However, implementing this constraint is difficult due to the complex geometry of the stationary region. Fortunately, recent work has provided a solution for autoregressions of fixed order $p$ based on a reparameterization in terms of a set of interpretable and unconstrained transformed partial autocorrelation matrices. In this work, focus is placed on the difficult problem of allowing $p$ to be unknown, developing a prior and computational inference that takes full account of order uncertainty. Specifically, the multiplicative gamma process is used to build a prior which encourages increasing shrinkage of the partial autocorrelations with increasing lag. Identifying the lag beyond which the partial autocorrelations become equal to zero then determines $p$. Based on classic time-series theory, a principled choice of truncation criterion identifies whether a partial autocorrelation matrix is effectively zero. Posterior inference utilizes Hamiltonian Monte Carlo via Stan. The work is illustrated in a substantive application to neural activity data to investigate ultradian brain rhythms.", "authors": "Binks, Heaps, Panagiotopoulou et al"}, "https://arxiv.org/abs/2401.13208": {"title": "Assessing Influential Observations in Pain Prediction using fMRI Data", "link": "https://arxiv.org/abs/2401.13208", "description": "Neuroimaging data allows researchers to model the relationship between multivariate patterns of brain activity and outcomes related to mental states and behaviors. However, the existence of outlying participants can potentially undermine the generalizability of these models and jeopardize the validity of downstream statistical analysis. To date, the ability to detect and account for participants unduly influencing various model selection approaches have been sorely lacking. Motivated by a task-based functional magnetic resonance imaging (fMRI) study of thermal pain, we propose and establish the asymptotic distribution for a diagnostic measure applicable to a number of different model selectors. A high-dimensional clustering procedure is further combined with this measure to detect multiple influential observations. In a series of simulations, our proposed method demonstrates clear advantages over existing methods in terms of improved detection performance, leading to enhanced predictive and variable selection outcomes. Application of our method to data from the thermal pain study illustrates the influence of outlying participants, in particular with regards to differences in activation between low and intense pain conditions. This allows for the selection of an interpretable model with high prediction power after removal of the detected observations. Though inspired by the fMRI-based thermal pain study, our methods are broadly applicable to other high-dimensional data types.", "authors": "Zhang, Asgharian, Lindquist"}, "https://arxiv.org/abs/2210.07114": {"title": "A Tutorial on Statistical Models Based on Counting Processes", "link": "https://arxiv.org/abs/2210.07114", "description": "Since the famous paper written by Kaplan and Meier in 1958, survival analysis has become one of the most important fields in statistics. Nowadays it is one of the most important statistical tools in analyzing epidemiological and clinical data including COVID-19 pandemic. This article reviews some of the most celebrated and important results and methods, including consistency, asymptotic normality, bias and variance estimation, in survival analysis and the treatment is parallel to the monograph Statistical Models Based on Counting Processes. Other models and results such as semi-Markov models and the Turnbull's estimator that jump out of the classical counting process martingale framework are also discussed.", "authors": "Cui"}, "https://arxiv.org/abs/2412.03596": {"title": "SMART-MC: Sparse Matrix Estimation with Covariate-Based Transitions in Markov Chain Modeling of Multiple Sclerosis Disease Modifying Therapies", "link": "https://arxiv.org/abs/2412.03596", "description": "A Markov model is a widely used tool for modeling sequences of events from a finite state-space and hence can be employed to identify the transition probabilities across treatments based on treatment sequence data. To understand how patient-level covariates impact these treatment transitions, the transition probabilities are modeled as a function of patient covariates. This approach enables the visualization of the effect of patient-level covariates on the treatment transitions across patient visits. The proposed method automatically estimates the entries of the transition matrix with smaller numbers of empirical transitions as constant; the user can set desired cutoff of the number of empirical transition counts required for a particular transition probability to be estimated as a function of covariates. Firstly, this strategy automatically enforces the final estimated transition matrix to contain zeros at the locations corresponding to zero empirical transition counts, avoiding further complicated model constructs to handle sparsity, in an efficient manner. Secondly, it restricts estimation of transition probabilities as a function of covariates, when the number of empirical transitions is particularly small, thus avoiding the identifiability issue which might arise due to the p>n scenario when estimating each transition probability as a function of patient covariates. To optimize the multi-modal likelihood, a parallelized scalable global optimization routine is also developed. The proposed method is applied to understand how the transitions across disease modifying therapies (DMTs) in Multiple Sclerosis (MS) patients are influenced by patient-level demographic and clinical phenotypes.", "authors": "Kim, Xia, Das"}, "https://arxiv.org/abs/2412.03668": {"title": "Hidden Markov graphical models with state-dependent generalized hyperbolic distributions", "link": "https://arxiv.org/abs/2412.03668", "description": "In this paper we develop a novel hidden Markov graphical model to investigate time-varying interconnectedness between different financial markets. To identify conditional correlation structures under varying market conditions and accommodate stylized facts embedded in financial time series, we rely upon the generalized hyperbolic family of distributions with time-dependent parameters evolving according to a latent Markov chain. We exploit its location-scale mixture representation to build a penalized EM algorithm for estimating the state-specific sparse precision matrices by means of an $L_1$ penalty. The proposed approach leads to regime-specific conditional correlation graphs that allow us to identify different degrees of network connectivity of returns over time. The methodology's effectiveness is validated through simulation exercises under different scenarios. In the empirical analysis we apply our model to daily returns of a large set of market indexes, cryptocurrencies and commodity futures over the period 2017-2023.", "authors": "Foroni, Merlo, Petrella"}, "https://arxiv.org/abs/2412.03731": {"title": "Using a Two-Parameter Sensitivity Analysis Framework to Efficiently Combine Randomized and Non-randomized Studies", "link": "https://arxiv.org/abs/2412.03731", "description": "Causal inference is vital for informed decision-making across fields such as biomedical research and social sciences. Randomized controlled trials (RCTs) are considered the gold standard for the internal validity of inferences, whereas observational studies (OSs) often provide the opportunity for greater external validity. However, both data sources have inherent limitations preventing their use for broadly valid statistical inferences: RCTs may lack generalizability due to their selective eligibility criterion, and OSs are vulnerable to unobserved confounding. This paper proposes an innovative approach to integrate RCT and OS that borrows the other study's strengths to remedy each study's limitations. The method uses a novel triplet matching algorithm to align RCT and OS samples and a new two-parameter sensitivity analysis framework to quantify internal and external biases. This combined approach yields causal estimates that are more robust to hidden biases than OSs alone and provides reliable inferences about the treatment effect in the general population. We apply this method to investigate the effects of lactation on maternal health using a small RCT and a long-term observational health records dataset from the California National Primate Research Center. This application demonstrates the practical utility of our approach in generating scientifically sound and actionable causal estimates.", "authors": "Yu, Karmakar, Vandeleest et al"}, "https://arxiv.org/abs/2412.03797": {"title": "A Two-stage Approach for Variable Selection in Joint Modeling of Multiple Longitudinal Markers and Competing Risk Outcomes", "link": "https://arxiv.org/abs/2412.03797", "description": "Background: In clinical and epidemiological research, the integration of longitudinal measurements and time-to-event outcomes is vital for understanding relationships and improving risk prediction. However, as the number of longitudinal markers increases, joint model estimation becomes more complex, leading to long computation times and convergence issues. This study introduces a novel two-stage Bayesian approach for variable selection in joint models, illustrated through a practical application.\n Methods: Our approach conceptualizes the analysis in two stages. In the first stage, we estimate one-marker joint models for each longitudinal marker related to the event, allowing for bias reduction from informative dropouts through individual marker trajectory predictions. The second stage employs a proportional hazard model that incorporates expected current values of all markers as time-dependent covariates. We explore continuous and Dirac spike-and-slab priors for variable selection, utilizing Markov chain Monte Carlo (MCMC) techniques.\n Results: The proposed method addresses the challenges of parameter estimation and risk prediction with numerous longitudinal markers, demonstrating robust performance through simulation studies. We further validate our approach by predicting dementia risk using the Three-City (3C) dataset, a longitudinal cohort study from France.\n Conclusions: This two-stage Bayesian method offers an efficient process for variable selection in joint modeling, enhancing risk prediction capabilities in longitudinal studies. The accompanying R package VSJM, which is freely available at https://github.com/tbaghfalaki/VSJM, facilitates implementation, making this approach accessible for diverse clinical applications.", "authors": "Baghfalaki, Hashemi, Tzourio et al"}, "https://arxiv.org/abs/2412.03827": {"title": "Optimal Correlation for Bernoulli Trials with Covariates", "link": "https://arxiv.org/abs/2412.03827", "description": "Given covariates for $n$ units, each of which is to receive a treatment with probability $1/2$, we study the question of how best to correlate their treatment assignments to minimize the variance of the IPW estimator of the average treatment effect. Past work by \\cite{bai2022} found that the optimal stratified experiment is a matched-pair design, where the matching depends on oracle knowledge of the distributions of potential outcomes given covariates. We show that, in the strictly broader class of all admissible correlation structures, the optimal design is to divide the units into two clusters and uniformly assign treatment to exactly one of the two clusters. This design can be computed by solving a 0-1 knapsack problem that uses the same oracle information and can result in an arbitrarily large variance improvement. A shift-invariant version can be constructed by ensuring that exactly half of the units are treated. A method with just two clusters is not robust to a bad proxy for the oracle, and we mitigate this with a hybrid that uses $O(n^\\alpha)$ clusters for $0<\\alpha<1$. Under certain assumptions, we also derive a CLT for the IPW estimator under our design and a consistent estimator of the variance. We compare our proposed designs to the optimal stratified design in simulated examples and find improved performance.", "authors": "Morrison, Owen"}, "https://arxiv.org/abs/2412.03833": {"title": "A Note on the Identifiability of the Degree-Corrected Stochastic Block Model", "link": "https://arxiv.org/abs/2412.03833", "description": "In this short note, we address the identifiability issues inherent in the Degree-Corrected Stochastic Block Model (DCSBM). We provide a rigorous proof demonstrating that the parameters of the DCSBM are identifiable up to a scaling factor and a permutation of the community labels, under a mild condition.", "authors": "Park, Zhao, Hao"}, "https://arxiv.org/abs/2412.03918": {"title": "Selection of Ultrahigh-Dimensional Interactions Using $L_0$ Penalty", "link": "https://arxiv.org/abs/2412.03918", "description": "Selecting interactions from an ultrahigh-dimensional statistical model with $n$ observations and $p$ variables when $p\\gg n$ is difficult because the number of candidates for interactions is $p(p-1)/2$ and a selected model should satisfy the strong hierarchical (SH) restriction. A new method called the SHL0 is proposed to overcome the difficulty. The objective function of the SHL0 method is composed of a loglikelihood function and an $L_0$ penalty. A well-known approach in theoretical computer science called local combinatorial optimization is used to optimize the objective function. We show that any local solution of the SHL0 is consistent and enjoys the oracle properties, implying that it is unnecessary to use a global solution in practice. Three additional advantages are: a tuning parameter is used to penalize the main effects and interactions; a closed-form expression can derive the tuning parameter; and the idea can be extended to arbitrary ultrahigh-dimensional statistical models. The proposed method is more flexible than the previous methods for selecting interactions. A simulation study of the research shows that the proposed SHL0 outperforms its competitors.", "authors": "Zhang"}, "https://arxiv.org/abs/2412.03975": {"title": "A shiny app for modeling the lifetime in primary breast cancer patients through phase-type distributions", "link": "https://arxiv.org/abs/2412.03975", "description": "Phase-type distributions (PHDs), which are defined as the distribution of the lifetime up to the absorption in an absorbent Markov chain, are an appropriate candidate to model the lifetime of any system, since any non-negative probability distribution can be approximated by a PHD with sufficient precision. Despite PHD potential, friendly statistical programs do not have a module implemented in their interfaces to handle PHD. Thus, researchers must consider others statistical software such as R, Matlab or Python that work with the compilation of code chunks and functions. This fact might be an important handicap for those researchers who do not have sufficient knowledge in programming environments. In this paper, a new interactive web application developed with shiny is introduced in order to adjust PHD to an experimental dataset. This open access app does not require any kind of knowledge about programming or major mathematical concepts. Users can easily compare the graphic fit of several PHDs while estimating their parameters and assess the goodness of fit with just several clicks. All these functionalities are exhibited by means of a numerical simulation and modeling the time to live since the diagnostic in primary breast cancer patients.", "authors": "Acal, Contreras, Montero et al"}, "https://arxiv.org/abs/2412.04109": {"title": "Pseudo-Observations for Bivariate Survival Data", "link": "https://arxiv.org/abs/2412.04109", "description": "The pseudo-observations approach has been gaining popularity as a method to estimate covariate effects on censored survival data. It is used regularly to estimate covariate effects on quantities such as survival probabilities, restricted mean life, cumulative incidence, and others. In this work, we propose to generalize the pseudo-observations approach to situations where a bivariate failure-time variable is observed, subject to right censoring. The idea is to first estimate the joint survival function of both failure times and then use it to define the relevant pseudo-observations. Once the pseudo-observations are calculated, they are used as the response in a generalized linear model. We consider two common nonparametric estimators of the joint survival function: the estimator of Lin and Ying (1993) and the Dabrowska estimator (Dabrowska, 1988). For both estimators, we show that our bivariate pseudo-observations approach produces regression estimates that are consistent and asymptotically normal. Our proposed method enables estimation of covariate effects on quantities such as the joint survival probability at a fixed bivariate time point, or simultaneously at several time points, and consequentially can estimate covariate-adjusted conditional survival probabilities. We demonstrate the method using simulations and an analysis of two real-world datasets.", "authors": "Travis-Lumer, Mandel, Betensky"}, "https://arxiv.org/abs/2412.04265": {"title": "On Extrapolation of Treatment Effects in Multiple-Cutoff Regression Discontinuity Designs", "link": "https://arxiv.org/abs/2412.04265", "description": "Regression discontinuity (RD) designs typically identify the treatment effect at a single cutoff point. But when and how can we learn about treatment effects away from the cutoff? This paper addresses this question within a multiple-cutoff RD framework. We begin by examining the plausibility of the constant bias assumption proposed by Cattaneo, Keele, Titiunik, and Vazquez-Bare (2021) through the lens of rational decision-making behavior, which suggests that a kind of similarity between groups and whether individuals can influence the running variable are important factors. We then introduce an alternative set of assumptions and propose a broadly applicable partial identification strategy. The potential applicability and usefulness of the proposed bounds are illustrated through two empirical examples.", "authors": "Okamoto, Ozaki"}, "https://arxiv.org/abs/2412.04275": {"title": "Scoping review of methodology for aiding generalisability and transportability of clinical prediction models", "link": "https://arxiv.org/abs/2412.04275", "description": "Generalisability and transportability of clinical prediction models (CPMs) refer to their ability to maintain predictive performance when applied to new populations. While CPMs may show good generalisability or transportability to a specific new population, it is rare for a CPM to be developed using methods that prioritise good generalisability or transportability. There is an emerging literature of such techniques; therefore, this scoping review aims to summarise the main methodological approaches, assumptions, advantages, disadvantages and future development of methodology aiding the generalisability/transportability. Relevant articles were systematically searched from MEDLINE, Embase, medRxiv, arxiv databases until September 2023 using a predefined set of search terms. Extracted information included methodology description, assumptions, applied examples, advantages and disadvantages. The searches found 1,761 articles; 172 were retained for full text screening; 18 were finally included. We categorised the methodologies according to whether they were data-driven or knowledge-driven, and whether are specifically tailored for target population. Data-driven approaches range from data augmentation to ensemble methods and density ratio weighting, while knowledge-driven strategies rely on causal methodology. Future research could focus on comparison of such methodologies on simulated and real datasets to identify their strengths specific applicability, as well as synthesising these approaches for enhancing their practical usefulness.", "authors": "Ploddi, Sperrin, Martin et al"}, "https://arxiv.org/abs/2412.04293": {"title": "Cubic-based Prediction Approach for Large Volatility Matrix using High-Frequency Financial Data", "link": "https://arxiv.org/abs/2412.04293", "description": "In this paper, we develop a novel method for predicting future large volatility matrices based on high-dimensional factor-based It\\^o processes. Several studies have proposed volatility matrix prediction methods using parametric models to account for volatility dynamics. However, these methods often impose restrictions, such as constant eigenvectors over time. To generalize the factor structure, we construct a cubic (order-3 tensor) form of an integrated volatility matrix process, which can be decomposed into low-rank tensor and idiosyncratic tensor components. To predict conditional expected large volatility matrices, we introduce the Projected Tensor Principal Orthogonal componEnt Thresholding (PT-POET) procedure and establish its asymptotic properties. Finally, the advantages of PT-POET are also verified by a simulation study and illustrated by applying minimum variance portfolio allocation using high-frequency trading data.", "authors": "Choi, Kim"}, "https://arxiv.org/abs/2412.03723": {"title": "Bayesian Perspective for Orientation Estimation in Cryo-EM and Cryo-ET", "link": "https://arxiv.org/abs/2412.03723", "description": "Accurate orientation estimation is a crucial component of 3D molecular structure reconstruction, both in single-particle cryo-electron microscopy (cryo-EM) and in the increasingly popular field of cryo-electron tomography (cryo-ET). The dominant method, which involves searching for an orientation with maximum cross-correlation relative to given templates, falls short, particularly in low signal-to-noise environments. In this work, we propose a Bayesian framework to develop a more accurate and flexible orientation estimation approach, with the minimum mean square error (MMSE) estimator as a key example. This method effectively accommodates varying structural conformations and arbitrary rotational distributions. Through simulations, we demonstrate that our estimator consistently outperforms the cross-correlation-based method, especially in challenging conditions with low signal-to-noise ratios, and offer a theoretical framework to support these improvements. We further show that integrating our estimator into the iterative refinement in the 3D reconstruction pipeline markedly enhances overall accuracy, revealing substantial benefits across the algorithmic workflow. Finally, we show empirically that the proposed Bayesian approach enhances robustness against the ``Einstein from Noise'' phenomenon, reducing model bias and improving reconstruction reliability. These findings indicate that the proposed Bayesian framework could substantially advance cryo-EM and cryo-ET by enhancing the accuracy, robustness, and reliability of 3D molecular structure reconstruction, thereby facilitating deeper insights into complex biological systems.", "authors": "Xu, Balanov, Bendory"}, "https://arxiv.org/abs/2412.04493": {"title": "Robust Quickest Change Detection in Multi-Stream Non-Stationary Processes", "link": "https://arxiv.org/abs/2412.04493", "description": "The problem of robust quickest change detection (QCD) in non-stationary processes under a multi-stream setting is studied. In classical QCD theory, optimal solutions are developed to detect a sudden change in the distribution of stationary data. Most studies have focused on single-stream data. In non-stationary processes, the data distribution both before and after change varies with time and is not precisely known. The multi-dimension data even complicates such issues. It is shown that if the non-stationary family for each dimension or stream has a least favorable law (LFL) or distribution in a well-defined sense, then the algorithm designed using the LFLs is robust optimal. The notion of LFL defined in this work differs from the classical definitions due to the dependence of the post-change model on the change point. Examples of multi-stream non-stationary processes encountered in public health monitoring and aviation applications are provided. Our robust algorithm is applied to simulated and real data to show its effectiveness.", "authors": "Hou, Bidkhori, Banerjee"}, "https://arxiv.org/abs/2412.04605": {"title": "Semiparametric Bayesian Difference-in-Differences", "link": "https://arxiv.org/abs/2412.04605", "description": "This paper studies semiparametric Bayesian inference for the average treatment effect on the treated (ATT) within the difference-in-differences research design. We propose two new Bayesian methods with frequentist validity. The first one places a standard Gaussian process prior on the conditional mean function of the control group. We obtain asymptotic equivalence of our Bayesian estimator and an efficient frequentist estimator by establishing a semiparametric Bernstein-von Mises (BvM) theorem. The second method is a double robust Bayesian procedure that adjusts the prior distribution of the conditional mean function and subsequently corrects the posterior distribution of the resulting ATT. We establish a semiparametric BvM result under double robust smoothness conditions; i.e., the lack of smoothness of conditional mean functions can be compensated by high regularity of the propensity score, and vice versa. Monte Carlo simulations and an empirical application demonstrate that the proposed Bayesian DiD methods exhibit strong finite-sample performance compared to existing frequentist methods. Finally, we outline an extension to difference-in-differences with multiple periods and staggered entry.", "authors": "Breunig, Liu, Yu"}, "https://arxiv.org/abs/2412.04640": {"title": "Multi-Quantile Estimators for the parameters of Generalized Extreme Value distribution", "link": "https://arxiv.org/abs/2412.04640", "description": "We introduce and study Multi-Quantile estimators for the parameters $( \\xi, \\sigma, \\mu)$ of Generalized Extreme Value (GEV) distributions to provide a robust approach to extreme value modeling. Unlike classical estimators, such as the Maximum Likelihood Estimation (MLE) estimator and the Probability Weighted Moments (PWM) estimator, which impose strict constraints on the shape parameter $\\xi$, our estimators are always asymptotically normal and consistent across all values of the GEV parameters. The asymptotic variances of our estimators decrease with the number of quantiles increasing and can approach the Cram\\'er-Rao lower bound very closely whenever it exists. Our Multi-Quantile Estimators thus offer a more flexible and efficient alternative for practical applications. We also discuss how they can be implemented in the context of Block Maxima method.", "authors": "Lin, Kong, Azencott"}, "https://arxiv.org/abs/2412.04663": {"title": "Fairness-aware Principal Component Analysis for Mortality Forecasting and Annuity Pricing", "link": "https://arxiv.org/abs/2412.04663", "description": "Fairness-aware statistical learning is critical for data-driven decision-making to mitigate discrimination against protected attributes, such as gender, race, and ethnicity. This is especially important for high-stake decision-making, such as insurance underwriting and annuity pricing. This paper proposes a new fairness-regularized principal component analysis - Fair PCA, in the context of high-dimensional factor models. An efficient gradient descent algorithm is constructed with adaptive selection criteria for hyperparameter tuning. The Fair PCA is applied to mortality modelling to mitigate gender discrimination in annuity pricing. The model performance has been validated through both simulation studies and empirical data analysis.", "authors": "Huang, Shen, Yang et al"}, "https://arxiv.org/abs/2412.04736": {"title": "Modeling High-Dimensional Dependent Data in the Presence of Many Explanatory Variables and Weak Signals", "link": "https://arxiv.org/abs/2412.04736", "description": "This article considers a novel and widely applicable approach to modeling high-dimensional dependent data when a large number of explanatory variables are available and the signal-to-noise ratio is low. We postulate that a $p$-dimensional response series is the sum of a linear regression with many observable explanatory variables and an error term driven by some latent common factors and an idiosyncratic noise. The common factors have dynamic dependence whereas the covariance matrix of the idiosyncratic noise can have diverging eigenvalues to handle the situation of low signal-to-noise ratio commonly encountered in applications. The regression coefficient matrix is estimated using penalized methods when the dimensions involved are high. We apply factor modeling to the regression residuals, employ a high-dimensional white noise testing procedure to determine the number of common factors, and adopt a projected Principal Component Analysis when the signal-to-noise ratio is low. We establish asymptotic properties of the proposed method, both for fixed and diverging numbers of regressors, as $p$ and the sample size $T$ approach infinity. Finally, we use simulations and empirical applications to demonstrate the efficacy of the proposed approach in finite samples.", "authors": "Gao, Tsay"}, "https://arxiv.org/abs/2412.04744": {"title": "Marginally interpretable spatial logistic regression with bridge processes", "link": "https://arxiv.org/abs/2412.04744", "description": "In including random effects to account for dependent observations, the odds ratio interpretation of logistic regression coefficients is changed from population-averaged to subject-specific. This is unappealing in many applications, motivating a rich literature on methods that maintain the marginal logistic regression structure without random effects, such as generalized estimating equations. However, for spatial data, random effect approaches are appealing in providing a full probabilistic characterization of the data that can be used for prediction. We propose a new class of spatial logistic regression models that maintain both population-averaged and subject-specific interpretations through a novel class of bridge processes for spatial random effects. These processes are shown to have appealing computational and theoretical properties, including a scale mixture of normal representation. The new methodology is illustrated with simulations and an analysis of childhood malaria prevalence data in the Gambia.", "authors": "Lee, Dunson"}, "https://arxiv.org/abs/2412.04773": {"title": "Robust and Optimal Tensor Estimation via Robust Gradient Descent", "link": "https://arxiv.org/abs/2412.04773", "description": "Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.", "authors": "Zhang, Wang, Li et al"}, "https://arxiv.org/abs/2412.04803": {"title": "Regression Analysis of Cure Rate Models with Competing Risks Subjected to Interval Censoring", "link": "https://arxiv.org/abs/2412.04803", "description": "In this work, we present two defective regression models for the analysis of interval-censored competing risk data in the presence of cured individuals, viz., defective Gompertz and defective inverse Gaussian regression models. The proposed models enable us to estimate the cure fraction directly from the model. Simultaneously, we estimate the regression parameters corresponding to each cause of failure using the method of maximum likelihood. The finite sample behaviour of the proposed models is evaluated through Monte Carlo simulation studies. We illustrate the practical applicability of the models using a real-life data set on HIV patients.", "authors": "K., P., Sankaran"}, "https://arxiv.org/abs/2412.04816": {"title": "Linear Regressions with Combined Data", "link": "https://arxiv.org/abs/2412.04816", "description": "We study best linear predictions in a context where the outcome of interest and some of the covariates are observed in two different datasets that cannot be matched. Traditional approaches obtain point identification by relying, often implicitly, on exclusion restrictions. We show that without such restrictions, coefficients of interest can still be partially identified and we derive a constructive characterization of the sharp identified set. We then build on this characterization to develop computationally simple and asymptotically normal estimators of the corresponding bounds. We show that these estimators exhibit good finite sample performances.", "authors": "D'Haultfoeuille, Gaillac, Maurel"}, "https://arxiv.org/abs/2412.04956": {"title": "Fast Estimation of the Composite Link Model for Multidimensional Grouped Counts", "link": "https://arxiv.org/abs/2412.04956", "description": "This paper presents a significant advancement in the estimation of the Composite Link Model within a penalized likelihood framework, specifically designed to address indirect observations of grouped count data. While the model is effective in these contexts, its application becomes computationally challenging in large, high-dimensional settings. To overcome this, we propose a reformulated iterative estimation procedure that leverages Generalized Linear Array Models, enabling the disaggregation and smooth estimation of latent distributions in multidimensional data. Through applications to high-dimensional mortality datasets, we demonstrate the model's capability to capture fine-grained patterns while comparing its computational performance to the conventional algorithm. The proposed methodology offers notable improvements in computational speed, storage efficiency, and practical applicability, making it suitable for a wide range of fields where high-dimensional data are provided in grouped formats.", "authors": "Camarda, Durb\\'an"}, "https://arxiv.org/abs/2412.05018": {"title": "Application of generalized linear models in big data: a divide and recombine (D&R) approach", "link": "https://arxiv.org/abs/2412.05018", "description": "D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.", "authors": "Nayem, Biswas"}, "https://arxiv.org/abs/2412.05032": {"title": "moonboot: An R Package Implementing m-out-of-n Bootstrap Methods", "link": "https://arxiv.org/abs/2412.05032", "description": "The m-out-of-n bootstrap is a possible workaround to compute confidence intervals for bootstrap inconsistent estimators, because it works under weaker conditions than the n-out-of-n bootstrap. It has the disadvantage, however, that it requires knowledge of an appropriate scaling factor {\\tau}n and that the coverage probability for finite n depends on the choice of m. This article presents an R package moonboot which implements the computation of m-out-of-n bootstrap confidence intervals and provides functions for estimating the parameters {\\tau}n and m. By means of Monte Carlo simulations, we evaluate the different methods and compare them for different estimators", "authors": "Dalitz, L\\\"ogler"}, "https://arxiv.org/abs/2412.05195": {"title": "Piecewise-linear modeling of multivariate geometric extremes", "link": "https://arxiv.org/abs/2412.05195", "description": "A recent development in extreme value modeling uses the geometry of the dataset to perform inference on the multivariate tail. A key quantity in this inference is the gauge function, whose values define this geometry. Methodology proposed to date for capturing the gauge function either lacks flexibility due to parametric specifications, or relies on complex neural network specifications in dimensions greater than three. We propose a semiparametric gauge function that is piecewise-linear, making it simple to interpret and provides a good approximation for the true underlying gauge function. This linearity also makes optimization tasks computationally inexpensive. The piecewise-linear gauge function can be used to define both a radial and an angular model, allowing for the joint fitting of extremal pseudo-polar coordinates, a key aspect of this geometric framework. We further expand the toolkit for geometric extremal modeling through the estimation of high radial quantiles at given angular values via kernel density estimation. We apply the new methodology to air pollution data, which exhibits a complex extremal dependence structure.", "authors": "Campbell, Wadsworth"}, "https://arxiv.org/abs/2412.05199": {"title": "Energy Based Equality of Distributions Testing for Compositional Data", "link": "https://arxiv.org/abs/2412.05199", "description": "Not many tests exist for testing the equality for two or more multivariate distributions with compositional data, perhaps due to their constrained sample space. At the moment, there is only one test suggested that relies upon random projections. We propose a novel test termed {\\alpha}-Energy Based Test ({\\alpha}-EBT) to compare the multivariate distributions of two (or more) compositional data sets. Similar to the aforementioned test, the new test makes no parametric assumptions about the data and, based on simulation studies it exhibits higher power levels.", "authors": "Sevinc, Tsagris"}, "https://arxiv.org/abs/2412.05108": {"title": "Constructing optimal treatment length strategies to maximize quality-adjusted lifetimes", "link": "https://arxiv.org/abs/2412.05108", "description": "Real-world clinical decision making is a complex process that involves balancing the risks and benefits of treatments. Quality-adjusted lifetime is a composite outcome that combines patient quantity and quality of life, making it an attractive outcome in clinical research. We propose methods for constructing optimal treatment length strategies to maximize this outcome. Existing methods for estimating optimal treatment strategies for survival outcomes cannot be applied to a quality-adjusted lifetime due to induced informative censoring. We propose a weighted estimating equation that adjusts for both confounding and informative censoring. We also propose a nonparametric estimator of the mean counterfactual quality-adjusted lifetime survival curve under a given treatment length strategy, where the weights are estimated using an undersmoothed sieve-based estimator. We show that the estimator is asymptotically linear and provide a data-dependent undersmoothing criterion. We apply our method to obtain the optimal time for percutaneous endoscopic gastrostomy insertion in patients with amyotrophic lateral sclerosis.", "authors": "Sun, Ertefaie, Duttweiler et al"}, "https://arxiv.org/abs/2412.05174": {"title": "Compound Gaussian Radar Clutter Model With Positive Tempered Alpha-Stable Texture", "link": "https://arxiv.org/abs/2412.05174", "description": "The compound Gaussian (CG) family of distributions has achieved great success in modeling sea clutter. This work develops a flexible-tailed CG model to improve generality in clutter modeling, by introducing the positive tempered $\\alpha$-stable (PT$\\alpha$S) distribution to model clutter texture. The PT$\\alpha$S distribution exhibits widely tunable tails by tempering the heavy tails of the positive $\\alpha$-stable (P$\\alpha$S) distribution, thus providing greater flexibility in texture modeling. Specifically, we first develop a bivariate isotropic CG-PT$\\alpha$S complex clutter model that is defined by an explicit characteristic function, based on which the corresponding amplitude model is derived. Then, we prove that the amplitude model can be expressed as a scale mixture of Rayleighs, just as the successful compound K and Pareto models. Furthermore, a characteristic function-based method is developed to estimate the parameters of the amplitude model. Finally, real-world sea clutter data analysis indicates the amplitude model's flexibility in modeling clutter data with various tail behaviors.", "authors": "Liao, Xie, Zhou"}, "https://arxiv.org/abs/2007.10432": {"title": "Treatment Effects with Targeting Instruments", "link": "https://arxiv.org/abs/2007.10432", "description": "Multivalued treatments are commonplace in applications. We explore the use of discrete-valued instruments to control for selection bias in this setting. Our discussion revolves around the concept of targeting: which instruments target which treatments. It allows us to establish conditions under which counterfactual averages and treatment effects are point- or partially-identified for composite complier groups. We illustrate the usefulness of our framework by applying it to data from the Head Start Impact Study. Under a plausible positive selection assumption, we derive informative bounds that suggest less beneficial effects of Head Start expansions than the parametric estimates of Kline and Walters (2016).", "authors": "Lee, Salani\\'e"}, "https://arxiv.org/abs/2412.05397": {"title": "Network Structural Equation Models for Causal Mediation and Spillover Effects", "link": "https://arxiv.org/abs/2412.05397", "description": "Social network interference induces spillover effects from neighbors' exposures, and the complexity of statistical analysis increases when mediators are involved with network interference. To address various technical challenges, we develop a theoretical framework employing a structural graphical modeling approach to investigate both mediation and interference effects within network data. Our framework enables us to capture the multifaceted mechanistic pathways through which neighboring units' exposures and mediators exert direct and indirect influences on an individual's outcome. We extend the exposure mapping paradigm in the context of a random-effects network structural equation models (REN-SEM), establishing its capacity to delineate spillover effects of interest. Our proposed methodology contributions include maximum likelihood estimation for REN-SEM and inference procedures with theoretical guarantees. Such guarantees encompass consistent asymptotic variance estimators, derived under a non-i.i.d. asymptotic theory. The robustness and practical utility of our methodology are demonstrated through simulation experiments and a real-world data analysis of the Twitch Gamers Network Dataset, underscoring its effectiveness in capturing the intricate dynamics of network-mediated exposure effects. This work is the first to provide a rigorous theoretical framework and analytic toolboxes to the mediation analysis of network data, including a robust assessment on the interplay of mediation and interference.", "authors": "Kundu, Song"}, "https://arxiv.org/abs/2412.05463": {"title": "The BPgWSP test: a Bayesian Weibull Shape Parameter signal detection test for adverse drug reactions", "link": "https://arxiv.org/abs/2412.05463", "description": "We develop a Bayesian Power generalized Weibull shape parameter (PgWSP) test as statistical method for signal detection of possible drug-adverse event associations using electronic health records for pharmacovigilance. The Bayesian approach allows the incorporation of prior knowledge about the likely time of occurrence along time-to-event data. The test is based on the shape parameters of the Power generalized Weibull (PgW) distribution. When both shape parameters are equal to one, the PgW distribution reduces to an exponential distribution, i.e. a constant hazard function. This is interpreted as no temporal association between drug and adverse event. The Bayesian PgWSP test involves comparing a region of practical equivalence (ROPE) around one reflecting the null hypothesis with estimated credibility intervals reflecting the posterior means of the shape parameters. The decision to raise a signal is based on the outcomes of the ROPE test and the selected combination rule for these outcomes. The development of the test requires a simulation study for tuning of the ROPE and credibility intervals to optimize specifcity and sensitivity of the test. Samples are generated under various conditions, including differences in sample size, prevalence of adverse drug reactions (ADRs), and the proportion of adverse events. We explore prior assumptions reflecting the belief in the presence or absence of ADRs at different points in the observation period. Various types of ROPE, credibility intervals, and combination rules are assessed and optimal tuning parameters are identifed based on the area under the curve. The tuned Bayesian PgWSP test is illustrated in a case study in which the time-dependent correlation between the intake of bisphosphonates and four adverse events is investigated.", "authors": "Dyck, Sauzet"}, "https://arxiv.org/abs/2412.05508": {"title": "Optimizing Returns from Experimentation Programs", "link": "https://arxiv.org/abs/2412.05508", "description": "Experimentation in online digital platforms is used to inform decision making. Specifically, the goal of many experiments is to optimize a metric of interest. Null hypothesis statistical testing can be ill-suited to this task, as it is indifferent to the magnitude of effect sizes and opportunity costs. Given access to a pool of related past experiments, we discuss how experimentation practice should change when the goal is optimization. We survey the literature on empirical Bayes analyses of A/B test portfolios, and single out the A/B Testing Problem (Azevedo et al., 2020) as a starting point, which treats experimentation as a constrained optimization problem. We show that the framework can be solved with dynamic programming and implemented by appropriately tuning $p$-value thresholds. Furthermore, we develop several extensions of the A/B Testing Problem and discuss the implications of these results on experimentation programs in industry. For example, under no-cost assumptions, firms should be testing many more ideas, reducing test allocation sizes, and relaxing $p$-value thresholds away from $p = 0.05$.", "authors": "Sudijono, Ejdemyr, Lal et al"}, "https://arxiv.org/abs/2412.05621": {"title": "Minimum Sliced Distance Estimation in a Class of Nonregular Econometric Models", "link": "https://arxiv.org/abs/2412.05621", "description": "This paper proposes minimum sliced distance estimation in structural econometric models with possibly parameter-dependent supports. In contrast to likelihood-based estimation, we show that under mild regularity conditions, the minimum sliced distance estimator is asymptotically normally distributed leading to simple inference regardless of the presence/absence of parameter dependent supports. We illustrate the performance of our estimator on an auction model.", "authors": "Fan, Park"}, "https://arxiv.org/abs/2412.05664": {"title": "Property of Inverse Covariance Matrix-based Financial Adjacency Matrix for Detecting Local Groups", "link": "https://arxiv.org/abs/2412.05664", "description": "In financial applications, we often observe both global and local factors that are modeled by a multi-level factor model. When detecting unknown local group memberships under such a model, employing a covariance matrix as an adjacency matrix for local group memberships is inadequate due to the predominant effect of global factors. Thus, to detect a local group structure more effectively, this study introduces an inverse covariance matrix-based financial adjacency matrix (IFAM) that utilizes negative values of the inverse covariance matrix. We show that IFAM ensures that the edge density between different groups vanishes, while that within the same group remains non-vanishing. This reduces falsely detected connections and helps identify local group membership accurately. To estimate IFAM under the multi-level factor model, we introduce a factor-adjusted GLASSO estimator to address the prevalent global factor effect in the inverse covariance matrix. An empirical study using returns from international stocks across 20 financial markets demonstrates that incorporating IFAM effectively detects latent local groups, which helps improve the minimum variance portfolio allocation performance.", "authors": "Oh, Kim"}, "https://arxiv.org/abs/2412.05673": {"title": "A generalized Bayesian approach for high-dimensional robust regression with serially correlated errors and predictors", "link": "https://arxiv.org/abs/2412.05673", "description": "This paper presents a loss-based generalized Bayesian methodology for high-dimensional robust regression with serially correlated errors and predictors. The proposed framework employs a novel scaled pseudo-Huber (SPH) loss function, which smooths the well-known Huber loss, achieving a balance between quadratic and absolute linear loss behaviors. This flexibility enables the framework to accommodate both thin-tailed and heavy-tailed data effectively. The generalized Bayesian approach constructs a working likelihood utilizing the SPH loss that facilitates efficient and stable estimation while providing rigorous estimation uncertainty quantification for all model parameters. Notably, this allows formal statistical inference without requiring ad hoc tuning parameter selection while adaptively addressing a wide range of tail behavior in the errors. By specifying appropriate prior distributions for the regression coefficients -- e.g., ridge priors for small or moderate-dimensional settings and spike-and-slab priors for high-dimensional settings -- the framework ensures principled inference. We establish rigorous theoretical guarantees for the accurate estimation of underlying model parameters and the correct selection of predictor variables under sparsity assumptions for a wide range of data generating setups. Extensive simulation studies demonstrate the superiority of our approach compared to traditional quadratic and absolute linear loss-based Bayesian regression methods, highlighting its flexibility and robustness in high-dimensional and challenging data contexts.", "authors": "Chakraborty, Khare, Michailidis"}, "https://arxiv.org/abs/2412.05687": {"title": "Bootstrap Model Averaging", "link": "https://arxiv.org/abs/2412.05687", "description": "Model averaging has gained significant attention in recent years due to its ability of fusing information from different models. The critical challenge in frequentist model averaging is the choice of weight vector. The bootstrap method, known for its favorable properties, presents a new solution. In this paper, we propose a bootstrap model averaging approach that selects the weights by minimizing a bootstrap criterion. Our weight selection criterion can also be interpreted as a bootstrap aggregating. We demonstrate that the resultant estimator is asymptotically optimal in the sense that it achieves the lowest possible squared error loss. Furthermore, we establish the convergence rate of bootstrap weights tending to the theoretically optimal weights. Additionally, we derive the limiting distribution for our proposed model averaging estimator. Through simulation studies and empirical applications, we show that our proposed method often has better performance than other commonly used model selection and model averaging methods, and bootstrap variants.", "authors": "Song, Zou, Wan"}, "https://arxiv.org/abs/2412.05736": {"title": "Convolution Mode Regression", "link": "https://arxiv.org/abs/2412.05736", "description": "For highly skewed or fat-tailed distributions, mean or median-based methods often fail to capture the central tendencies in the data. Despite being a viable alternative, estimating the conditional mode given certain covariates (or mode regression) presents significant challenges. Nonparametric approaches suffer from the \"curse of dimensionality\", while semiparametric strategies often lead to non-convex optimization problems. In order to avoid these issues, we propose a novel mode regression estimator that relies on an intermediate step of inverting the conditional quantile density. In contrast to existing approaches, we employ a convolution-type smoothed variant of the quantile regression. Our estimator converges uniformly over the design points of the covariates and, unlike previous quantile-based mode regressions, is uniform with respect to the smoothing bandwidth. Additionally, the Convolution Mode Regression is dimension-free, carries no issues regarding optimization and preliminary simulations suggest the estimator is normally distributed in finite samples.", "authors": "Finn, Horta"}, "https://arxiv.org/abs/2412.05744": {"title": "PICS: A sequential approach to obtain optimal designs for non-linear models leveraging closed-form solutions for faster convergence", "link": "https://arxiv.org/abs/2412.05744", "description": "D-Optimal designs for estimating parameters of response models are derived by maximizing the determinant of the Fisher information matrix. For non-linear models, the Fisher information matrix depends on the unknown parameter vector of interest, leading to a weird situation that in order to obtain the D-optimal design, one needs to have knowledge of the parameter to be estimated. One solution to this problem is to choose the design points sequentially, optimizing the D-optimality criterion using parameter estimates based on available data, followed by updating the parameter estimates using maximum likelihood estimation. On the other hand, there are many non-linear models for which closed-form results for D-optimal designs are available, but because such solutions involve the parameters to be estimated, they can only be used by substituting \"guestimates\" of parameters. In this paper, a hybrid sequential strategy called PICS (Plug into closed-form solution) is proposed that replaces the optimization of the objective function at every single step by a draw from the probability distribution induced by the known optimal design by plugging in the current estimates. Under regularity conditions, asymptotic normality of the sequence of estimators generated by this approach are established. Usefulness of this approach in terms of saving computational time and achieving greater efficiency of estimation compared to the standard sequential approach are demonstrated with simulations conducted from two different sets of models.", "authors": "Ghosh, Khamaru, Dasgupta"}, "https://arxiv.org/abs/2412.05765": {"title": "A Two-stage Joint Modeling Approach for Multiple Longitudinal Markers and Time-to-event Data", "link": "https://arxiv.org/abs/2412.05765", "description": "Collecting multiple longitudinal measurements and time-to-event outcomes is a common practice in clinical and epidemiological studies, often focusing on exploring associations between them. Joint modeling is the standard analytical tool for such data, with several R packages available. However, as the number of longitudinal markers increases, the computational burden and convergence challenges make joint modeling increasingly impractical.\n This paper introduces a novel two-stage Bayesian approach to estimate joint models for multiple longitudinal measurements and time-to-event outcomes. The method builds on the standard two-stage framework but improves the initial stage by estimating a separate one-marker joint model for the event and each longitudinal marker, rather than relying on mixed models. These estimates are used to derive predictions of individual marker trajectories, avoiding biases from informative dropouts. In the second stage, a proportional hazards model is fitted, incorporating the predicted current values and slopes of the markers as time-dependent covariates. To address uncertainty in the first-stage predictions, a multiple imputation technique is employed when estimating the Cox model in the second stage.\n This two-stage method allows for the analysis of numerous longitudinal markers, which is often infeasible with traditional multi-marker joint modeling. The paper evaluates the approach through simulation studies and applies it to the PBC2 dataset and a real-world dementia dataset containing 17 longitudinal markers. An R package, TSJM, implementing the method is freely available on GitHub: https://github.com/tbaghfalaki/TSJM.", "authors": "Baghfalaki, Hashemi, Helmer et al"}, "https://arxiv.org/abs/2412.05794": {"title": "Bundle Choice Model with Endogenous Regressors: An Application to Soda Tax", "link": "https://arxiv.org/abs/2412.05794", "description": "This paper proposes a Bayesian factor-augmented bundle choice model to estimate joint consumption as well as the substitutability and complementarity of multiple goods in the presence of endogenous regressors. The model extends the two primary treatments of endogeneity in existing bundle choice models: (1) endogenous market-level prices and (2) time-invariant unobserved individual heterogeneity. A Bayesian sparse factor approach is employed to capture high-dimensional error correlations that induce taste correlation and endogeneity. Time-varying factor loadings allow for more general individual-level and time-varying heterogeneity and endogeneity, while the sparsity induced by the shrinkage prior on loadings balances flexibility with parsimony. Applied to a soda tax in the context of complementarities, the new approach captures broader effects of the tax that were previously overlooked. Results suggest that a soda tax could yield additional health benefits by marginally decreasing the consumption of salty snacks along with sugary drinks, extending the health benefits beyond the reduction in sugar consumption alone.", "authors": "Sun"}, "https://arxiv.org/abs/2412.05836": {"title": "Modeling time to failure using a temporal sequence of events", "link": "https://arxiv.org/abs/2412.05836", "description": "In recent years, the requirement for real-time understanding of machine behavior has become an important objective in industrial sectors to reduce the cost of unscheduled downtime and to maximize production with expected quality. The vast majority of high-end machines are equipped with a number of sensors that can record event logs over time. In this paper, we consider an injection molding (IM) machine that manufactures plastic bottles for soft drink. We have analyzed the machine log data with a sequence of three type of events, ``running with alert'', ``running without alert'', and ``failure''. Failure event leads to downtime of the machine and necessitates maintenance. The sensors are capable of capturing the corresponding operational conditions of the machine as well as the defined states of events. This paper presents a new model to predict a) time to failure of the IM machine and b) identification of important sensors in the system that may explain the events which in-turn leads to failure. The proposed method is more efficient than the popular competitor and can help reduce the downtime costs by controlling operational parameters in advance to prevent failures from occurring too soon.", "authors": "Pal, Koley, Ranjan et al"}, "https://arxiv.org/abs/2412.05905": {"title": "Fast QR updating methods for statistical applications", "link": "https://arxiv.org/abs/2412.05905", "description": "This paper introduces fast R updating algorithms designed for statistical applications, including regression, filtering, and model selection, where data structures change frequently. Although traditional QR decomposition is essential for matrix operations, it becomes computationally intensive when dynamically updating the design matrix in statistical models. The proposed algorithms efficiently update the R matrix without recalculating Q, significantly reducing computational costs. These algorithms provide a scalable solution for high-dimensional regression models, enhancing the feasibility of large-scale statistical analyses and model selection in data-intensive fields. Comprehensive simulation studies and real-world data applications reveal that the methods significantly reduce computational time while preserving accuracy. An extensive discussion highlights the versatility of fast R updating algorithms, illustrating their benefits across a wide range of models and applications in statistics and machine learning.", "authors": "Bernardi, Busatto, Cattelan"}, "https://arxiv.org/abs/2412.05919": {"title": "Estimating Spillover Effects in the Presence of Isolated Nodes", "link": "https://arxiv.org/abs/2412.05919", "description": "In estimating spillover effects under network interference, practitioners often use linear regression with either the number or fraction of treated neighbors as regressors. An often overlooked fact is that the latter is undefined for units without neighbors (``isolated nodes\"). The common practice is to impute this fraction as zero for isolated nodes. This paper shows that such practice introduces bias through theoretical derivations and simulations. Causal interpretations of the commonly used spillover regression coefficients are also provided.", "authors": "Kim"}, "https://arxiv.org/abs/2412.05998": {"title": "B-MASTER: Scalable Bayesian Multivariate Regression Analysis for Selecting Targeted Essential Regressors to Identify the Key Genera in Microbiome-Metabolite Relation Dynamics", "link": "https://arxiv.org/abs/2412.05998", "description": "The gut microbiome significantly influences responses to cancer therapies, including immunotherapies, primarily through its impact on the metabolome. Despite some existing studies addressing the effects of specific microbial genera on individual metabolites, there is little to no prior work focused on identifying the key microbiome components at the genus level that shape the overall metabolome profile. To bridge this gap, we introduce B-MASTER (Bayesian Multivariate regression Analysis for Selecting Targeted Essential Regressors), a fully Bayesian framework incorporating an L1 penalty to promote sparsity in the coefficient matrix and an L2 penalty to shrink coefficients for non-major covariate components simultaneously, thereby isolating essential regressors. The method is complemented with a scalable Gibbs sampling algorithm, whose computational speed increases linearly with the number of parameters and remains largely unaffected by sample size and data-specific characteristics for models of fixed dimensions. Notably, B-MASTER achieves full posterior inference for models with up to four million parameters within a practical time-frame. Using this approach, we identify key microbial genera influencing the overall metabolite profile, conduct an in-depth analysis of their effects on the most abundant metabolites, and investigate metabolites differentially abundant in colorectal cancer patients. These results provide foundational insights into the impact of the microbiome at the genus level on metabolite profiles relevant to cancer, a relationship that remains largely unexplored in the existing literature.", "authors": "Das, Dey, Peterson et al"}, "https://arxiv.org/abs/2412.06018": {"title": "Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research", "link": "https://arxiv.org/abs/2412.06018", "description": "Longitudinal passive sensing studies for health and behavior outcomes often have missing and incomplete data. Handling missing data effectively is thus a critical data processing and modeling step. Our formative interviews with researchers working in longitudinal health and behavior passive sensing revealed a recurring theme: most researchers consider imputation a low-priority step in their analysis and inference pipeline, opting to use simple and off-the-shelf imputation strategies without comprehensively evaluating its impact on study outcomes. Through this paper, we call attention to the importance of imputation. Using publicly available passive sensing datasets for depression, we show that prioritizing imputation can significantly impact the study outcomes -- with our proposed imputation strategies resulting in up to 31% improvement in AUROC to predict depression over the original imputation strategy. We conclude by discussing the challenges and opportunities with effective imputation in longitudinal sensing studies.", "authors": "Choube, Majethia, Bhattacharya et al"}, "https://arxiv.org/abs/2412.06020": {"title": "New Additive OCBA Procedures for Robust Ranking and Selection", "link": "https://arxiv.org/abs/2412.06020", "description": "Robust ranking and selection (R&S) is an important and challenging variation of conventional R&S that seeks to select the best alternative among a finite set of alternatives. It captures the common input uncertainty in the simulation model by using an ambiguity set to include multiple possible input distributions and shifts to select the best alternative with the smallest worst-case mean performance over the ambiguity set. In this paper, we aim at developing new fixed-budget robust R&S procedures to minimize the probability of incorrect selection (PICS) under a limited sampling budget. Inspired by an additive upper bound of the PICS, we derive a new asymptotically optimal solution to the budget allocation problem. Accordingly, we design a new sequential optimal computing budget allocation (OCBA) procedure to solve robust R&S problems efficiently. We then conduct a comprehensive numerical study to verify the superiority of our robust OCBA procedure over existing ones. The numerical study also provides insights on the budget allocation behaviors that lead to enhanced efficiency.", "authors": "Wan, Li, Hong"}, "https://arxiv.org/abs/2412.06027": {"title": "A Generalized Mixture Cure Model Incorporating Known Cured Individuals", "link": "https://arxiv.org/abs/2412.06027", "description": "The Mixture Cure (MC) models constitute an appropriate and easily interpretable method when studying a time-to-event variable in a population comprised of both susceptible and cured individuals. In literature, those models usually assume that the latter are unobservable. However, there are cases in which a cured individual may be identified. For example, when studying the distant metastasis during the lifetime or the miscarriage during pregnancy, individuals that have died without a metastasis or have given birth are certainly non-susceptible. The same also holds when studying the x-year overall survival or the death during hospital stay. Common MC models ignore this information and consider them all censored, thus yielding in risk of assigning low immune probabilities to cured individuals. In this study, we consider a MC model that incorporates known information on cured individuals, with the time to cure identification being either deterministic or stochastic. We use the expectation-maximization algorithm to derive the maximum likelihood estimators. Furthermore, we compare different strategies that account for cure information such as (1) assigning infinite times to event for known cured cases and adjusting the traditional model and (2) considering only the probability of cure identification but ignoring the time until that happens. Theoretical results and simulations demonstrate the value of the proposed model especially when the time to cure identification is stochastic, increasing precision and decreasing the mean squared error. On the other hand, the traditional models that ignore the known cured information perform well when the curation is achieved after a known cutoff point. Moreover, through simulations the comparisons of the different strategies are examined, as possible alternatives to the complete-information model.", "authors": "Karakatsoulis"}, "https://arxiv.org/abs/2412.06092": {"title": "Density forecast transformations", "link": "https://arxiv.org/abs/2412.06092", "description": "The popular choice of using a $direct$ forecasting scheme implies that the individual predictions do not contain information on cross-horizon dependence. However, this dependence is needed if the forecaster has to construct, based on $direct$ density forecasts, predictive objects that are functions of several horizons ($e.g.$ when constructing annual-average growth rates from quarter-on-quarter growth rates). To address this issue we propose to use copulas to combine the individual $h$-step-ahead predictive distributions into a joint predictive distribution. Our method is particularly appealing to practitioners for whom changing the $direct$ forecasting specification is too costly. In a Monte Carlo study, we demonstrate that our approach leads to a better approximation of the true density than an approach that ignores the potential dependence. We show the superior performance of our method in several empirical examples, where we construct (i) quarterly forecasts using month-on-month $direct$ forecasts, (ii) annual-average forecasts using monthly year-on-year $direct$ forecasts, and (iii) annual-average forecasts using quarter-on-quarter $direct$ forecasts.", "authors": "Mogliani, Odendahl"}, "https://arxiv.org/abs/2412.06098": {"title": "Bayesian Clustering Prior with Overlapping Indices for Effective Use of Multisource External Data", "link": "https://arxiv.org/abs/2412.06098", "description": "The use of external data in clinical trials offers numerous advantages, such as reducing the number of patients, increasing study power, and shortening trial durations. In Bayesian inference, information in external data can be transferred into an informative prior for future borrowing (i.e., prior synthesis). However, multisource external data often exhibits heterogeneity, which can lead to information distortion during the prior synthesis. Clustering helps identifying the heterogeneity, enhancing the congruence between synthesized prior and external data, thereby preventing information distortion. Obtaining optimal clustering is challenging due to the trade-off between congruence with external data and robustness to future data. We introduce two overlapping indices: the overlapping clustering index (OCI) and the overlapping evidence index (OEI). Using these indices alongside a K-Means algorithm, the optimal clustering of external data can be identified by balancing the trade-off. Based on the clustering result, we propose a prior synthesis framework to effectively borrow information from multisource external data. By incorporating the (robust) meta-analytic predictive prior into this framework, we develop (robust) Bayesian clustering MAP priors. Simulation studies and real-data analysis demonstrate their superiority over commonly used priors in the presence of heterogeneity. Since the Bayesian clustering priors are constructed without needing data from the prospective study to be conducted, they can be applied to both study design and data analysis in clinical trials or experiments.", "authors": "Lu, Lee"}, "https://arxiv.org/abs/2412.06114": {"title": "Randomized interventional effects for semicompeting risks, with application to allogeneic stem cell transplantation study", "link": "https://arxiv.org/abs/2412.06114", "description": "In clinical studies, the risk of the terminal event can be modified by an intermediate event, resulting in semicompeting risks. To study the treatment effect on the terminal event mediated by the intermediate event, it is of interest to decompose the total effect into direct and indirect effects. Three typical frameworks for causal mediation analysis have been proposed: natural effects, separable effects and interventional effects. In this article, we extend the interventional approach to time-to-event data, and compare it with other frameworks. With no time-varying confounders, these three frameworks lead to the same identification formula. With time-varying confounders, the interventional effects framework outperforms the other two because it requires weaker assumptions and fewer restrictions on time-varying confounders for identification. We present the identification formula for interventional effects and discuss some variants of the identification assumptions. As an illustration, we study the effect of transplant modalities on death mediated by relapse in an allogeneic stem cell transplantation study to treat leukemia with a time-varying confounder.", "authors": "Deng, Wang"}, "https://arxiv.org/abs/2412.06119": {"title": "Sandwich regression for accurate and robust estimation in generalized linear multilevel and longitudinal models", "link": "https://arxiv.org/abs/2412.06119", "description": "Generalized linear models are a popular tool in applied statistics, with their maximum likelihood estimators enjoying asymptotic Gaussianity and efficiency. As all models are wrong, it is desirable to understand these estimators' behaviours under model misspecification. We study semiparametric multilevel generalized linear models, where only the conditional mean of the response is taken to follow a specific parametric form. Pre-existing estimators from mixed effects models and generalized estimating equations require specificaiton of a conditional covariance, which when misspecified can result in inefficient estimates of fixed effects parameters. It is nevertheless often computationally attractive to consider a restricted, finite dimensional class of estimators, as these models naturally imply. We introduce sandwich regression, that selects the estimator of minimal variance within a parametric class of estimators over all distributions in the full semiparametric model. We demonstrate numerically on simulated and real data the attractive improvements our sandwich regression approach enjoys over classical mixed effects models and generalized estimating equations.", "authors": "Young, Shah"}, "https://arxiv.org/abs/2412.06311": {"title": "SID: A Novel Class of Nonparametric Tests of Independence for Censored Outcomes", "link": "https://arxiv.org/abs/2412.06311", "description": "We propose a new class of metrics, called the survival independence divergence (SID), to test dependence between a right-censored outcome and covariates. A key technique for deriving the SIDs is to use a counting process strategy, which equivalently transforms the intractable independence test due to the presence of censoring into a test problem for complete observations. The SIDs are equal to zero if and only if the right-censored response and covariates are independent, and they are capable of detecting various types of nonlinear dependence. We propose empirical estimates of the SIDs and establish their asymptotic properties. We further develop a wild bootstrap method to estimate the critical values and show the consistency of the bootstrap tests. The numerical studies demonstrate that our SID-based tests are highly competitive with existing methods in a wide range of settings.", "authors": "Li, Liu, You et al"}, "https://arxiv.org/abs/2412.06442": {"title": "Efficiency of nonparametric superiority tests based on restricted mean survival time versus the log-rank test under proportional hazards", "link": "https://arxiv.org/abs/2412.06442", "description": "Background: For RCTs with time-to-event endpoints, proportional hazard (PH) models are typically used to estimate treatment effects and logrank tests are commonly used for hypothesis testing. There is growing support for replacing this approach with a model-free estimand and assumption-lean analysis method. One alternative is to base the analysis on the difference in restricted mean survival time (RMST) at a specific time, a single-number summary measure that can be defined without any restrictive assumptions on the outcome model. In a simple setting without covariates, an assumption-lean analysis can be achieved using nonparametric methods such as Kaplan Meier estimation. The main advantage of moving to a model-free summary measure and assumption-lean analysis is that the validity and interpretation of conclusions do not depend on the PH assumption. The potential disadvantage is that the nonparametric analysis may lose efficiency under PH. There is disagreement in recent literature on this issue. Methods: Asymptotic results and simulations are used to compare the efficiency of a log-rank test against a nonparametric analysis of the difference in RMST in a superiority trial under PH. Previous studies have separately examined the effect of event rates and the censoring distribution on relative efficiency. This investigation clarifies conflicting results from earlier research by exploring the joint effect of event rate and censoring distribution together. Several illustrative examples are provided. Results: In scenarios with high event rates and/or substantial censoring across a large proportion of the study window, and when both methods make use of the same amount of data, relative efficiency is close to unity. However, in cases with low event rates but when censoring is concentrated at the end of the study window, the PH analysis has a considerable efficiency advantage.", "authors": "Magirr, Wang, Deng et al"}, "https://arxiv.org/abs/2412.06628": {"title": "Partial identification of principal causal effects under violations of principal ignorability", "link": "https://arxiv.org/abs/2412.06628", "description": "Principal stratification is a general framework for studying causal mechanisms involving post-treatment variables. When estimating principal causal effects, the principal ignorability assumption is commonly invoked, which we study in detail in this manuscript. Our first key contribution is studying a commonly used strategy of using parametric models to jointly model the outcome and principal strata without requiring the principal ignorability assumption. We show that even if the joint distribution of principal strata is known, this strategy necessarily leads to only partial identification of causal effects, even under very simple and correctly specified outcome models. While principal ignorability can lead to point identification in this setting, we discuss alternative, weaker assumptions and show how they lead to more informative partial identification regions. An additional contribution is that we provide theoretical support to strategies used in the literature for identifying association parameters that govern the joint distribution of principal strata. We prove that this is possible, but only if the principal ignorability assumption is violated. Additionally, due to partial identifiability of causal effects even when these association parameters are known, we show that these association parameters are only identifiable under strong parametric constraints. Lastly, we extend these results to more flexible semiparametric and nonparametric Bayesian models.", "authors": "Wu, Antonelli"}, "https://arxiv.org/abs/2412.06688": {"title": "Probabilistic Targeted Factor Analysis", "link": "https://arxiv.org/abs/2412.06688", "description": "We develop a probabilistic variant of Partial Least Squares (PLS) we call Probabilistic Targeted Factor Analysis (PTFA), which can be used to extract common factors in predictors that are useful to predict a set of predetermined target variables. Along with the technique, we provide an efficient expectation-maximization (EM) algorithm to learn the parameters and forecast the targets of interest. We develop a number of extensions to missing-at-random data, stochastic volatility, and mixed-frequency data for real-time forecasting. In a simulation exercise, we show that PTFA outperforms PLS at recovering the common underlying factors affecting both features and target variables delivering better in-sample fit, and providing valid forecasts under contamination such as measurement error or outliers. Finally, we provide two applications in Economics and Finance where PTFA performs competitively compared with PLS and Principal Component Analysis (PCA) at out-of-sample forecasting.", "authors": "Herculano, Montoya-Bland\\'on"}, "https://arxiv.org/abs/2412.06766": {"title": "Testing Mutual Independence in Metric Spaces Using Distance Profiles", "link": "https://arxiv.org/abs/2412.06766", "description": "This paper introduces a novel unified framework for testing mutual independence among a vector of random objects that may reside in different metric spaces, including some existing methodologies as special cases. The backbone of the proposed tests is the notion of joint distance profiles, which uniquely characterize the joint law of random objects under a mild condition on the joint law or on the metric spaces. Our test statistics measure the difference of the joint distance profiles of each data point with respect to the joint law and the product of marginal laws of the vector of random objects, where flexible data-adaptive weight profiles are incorporated for power enhancement. We derive the limiting distribution of the test statistics under the null hypothesis of mutual independence and show that the proposed tests with specific weight profiles are asymptotically distribution-free if the marginal distance profiles are continuous. We also establish the consistency of the tests under sequences of alternative hypotheses converging to the null. Furthermore, since the asymptotic tests with non-trivial weight profiles require the knowledge of the underlying data distribution, we adopt a permutation scheme to approximate the $p$-values and provide theoretical guarantees that the permutation-based tests control the type I error rate under the null and are consistent under the alternatives. We demonstrate the power of the proposed tests across various types of data objects through simulations and real data applications, where our tests are shown to have superior performance compared with popular existing approaches.", "authors": "Chen, Dubey"}, "https://arxiv.org/abs/1510.04315": {"title": "Deriving Priorities From Inconsistent PCM using the Network Algorithms", "link": "https://arxiv.org/abs/1510.04315", "description": "In several multiobjective decision problems Pairwise Comparison Matrices (PCM) are applied to evaluate the decision variants. The problem that arises very often is the inconsistency of a given PCM. In such a situation it is important to approximate the PCM with a consistent one. The most common way is to minimize the Euclidean distance between the matrices. In the paper we consider the problem of minimizing the maximum distance. After applying the logarithmic transformation we are able to formulate the obtained subproblem as a Shortest Path Problem and solve it more efficiently. We analyze and completely characterize the form of the set of optimal solutions and provide an algorithm that results in a unique, Pareto-efficient solution.", "authors": "Anholcer, F\\\"ul\\\"op"}, "https://arxiv.org/abs/2412.05506": {"title": "Ranking of Large Language Model with Nonparametric Prompts", "link": "https://arxiv.org/abs/2412.05506", "description": "We consider the inference for the ranking of large language models (LLMs). Alignment arises as a big challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has been shown as a well-performing tool to improve alignment based on the best-of-$N$ policy. In this paper, we propose a new inferential framework for testing hypotheses and constructing confidence intervals of the ranking of language models. We consider the widely adopted Bradley-Terry-Luce (BTL) model, where each item is assigned a positive preference score that determines its pairwise comparisons' outcomes. We further extend it into the contextual setting, where the score of each model varies with the prompt. We show the convergence rate of our estimator. By extending the current Gaussian multiplier bootstrap theory to accommodate the supremum of not identically distributed empirical processes, we construct the confidence interval for ranking and propose a valid testing procedure. We also introduce the confidence diagram as a global ranking property. We conduct numerical experiments to assess the performance of our method.", "authors": "Wang, Han, Fang et al"}, "https://arxiv.org/abs/2412.05608": {"title": "Exact distribution-free tests of spherical symmetry applicable to high dimensional data", "link": "https://arxiv.org/abs/2412.05608", "description": "We develop some graph-based tests for spherical symmetry of a multivariate distribution using a method based on data augmentation. These tests are constructed using a new notion of signs and ranks that are computed along a path obtained by optimizing an objective function based on pairwise dissimilarities among the observations in the augmented data set. The resulting tests based on these signs and ranks have the exact distribution-free property, and irrespective of the dimension of the data, the null distributions of the test statistics remain the same. These tests can be conveniently used for high-dimensional data, even when the dimension is much larger than the sample size. Under appropriate regularity conditions, we prove the consistency of these tests in high dimensional asymptotic regime, where the dimension grows to infinity while the sample size may or may not grow with the dimension. We also propose a generalization of our methods to take care of the situations, where the center of symmetry is not specified by the null hypothesis. Several simulated data sets and a real data set are analyzed to demonstrate the utility of the proposed tests.", "authors": "Banerjee, Ghosh"}, "https://arxiv.org/abs/2412.05759": {"title": "Leveraging Black-box Models to Assess Feature Importance in Unconditional Distribution", "link": "https://arxiv.org/abs/2412.05759", "description": "Understanding how changes in explanatory features affect the unconditional distribution of the outcome is important in many applications. However, existing black-box predictive models are not readily suited for analyzing such questions. In this work, we develop an approximation method to compute the feature importance curves relevant to the unconditional distribution of outcomes, while leveraging the power of pre-trained black-box predictive models. The feature importance curves measure the changes across quantiles of outcome distribution given an external impact of change in the explanatory features. Through extensive numerical experiments and real data examples, we demonstrate that our approximation method produces sparse and faithful results, and is computationally efficient.", "authors": "Zhou, Li"}, "https://arxiv.org/abs/2412.06042": {"title": "Infinite Mixture Models for Improved Modeling of Across-Site Evolutionary Variation", "link": "https://arxiv.org/abs/2412.06042", "description": "Scientific studies in many areas of biology routinely employ evolutionary analyses based on the probabilistic inference of phylogenetic trees from molecular sequence data. Evolutionary processes that act at the molecular level are highly variable, and properly accounting for heterogeneity in evolutionary processes is crucial for more accurate phylogenetic inference. Nucleotide substitution rates and patterns are known to vary among sites in multiple sequence alignments, and such variation can be modeled by partitioning alignments into categories corresponding to different substitution models. Determining $\\textit{a priori}$ appropriate partitions can be difficult, however, and better model fit can be achieved through flexible Bayesian infinite mixture models that simultaneously infer the number of partitions, the partition that each site belongs to, and the evolutionary parameters corresponding to each partition. Here, we consider several different types of infinite mixture models, including classic Dirichlet process mixtures, as well as novel approaches for modeling across-site evolutionary variation: hierarchical models for data with a natural group structure, and infinite hidden Markov models that account for spatial patterns in alignments. In analyses of several viral data sets, we find that different types of infinite mixture models emerge as the best choices in different scenarios. To enable these models to scale efficiently to large data sets, we adapt efficient Markov chain Monte Carlo algorithms and exploit opportunities for parallel computing. We implement this infinite mixture modeling framework in BEAST X, a widely-used software package for Bayesian phylogenetic inference.", "authors": "Gill, Baele, Suchard et al"}, "https://arxiv.org/abs/2412.06087": {"title": "Ethnography and Machine Learning: Synergies and New Directions", "link": "https://arxiv.org/abs/2412.06087", "description": "Ethnography (social scientific methods that illuminate how people understand, navigate and shape the real world contexts in which they live their lives) and machine learning (computational techniques that use big data and statistical learning models to perform quantifiable tasks) are each core to contemporary social science. Yet these tools have remained largely separate in practice. This chapter draws on a growing body of scholarship that argues that ethnography and machine learning can be usefully combined, particularly for large comparative studies. Specifically, this paper (a) explains the value (and challenges) of using machine learning alongside qualitative field research for certain types of projects, (b) discusses recent methodological trends to this effect, (c) provides examples that illustrate workflow drawn from several large projects, and (d) concludes with a roadmap for enabling productive coevolution of field methods and machine learning.", "authors": "Li, Abramson"}, "https://arxiv.org/abs/2412.06582": {"title": "Optimal estimation in private distributed functional data analysis", "link": "https://arxiv.org/abs/2412.06582", "description": "We systematically investigate the preservation of differential privacy in functional data analysis, beginning with functional mean estimation and extending to varying coefficient model estimation. Our work introduces a distributed learning framework involving multiple servers, each responsible for collecting several sparsely observed functions. This hierarchical setup introduces a mixed notion of privacy. Within each function, user-level differential privacy is applied to $m$ discrete observations. At the server level, central differential privacy is deployed to account for the centralised nature of data collection. Across servers, only private information is exchanged, adhering to federated differential privacy constraints. To address this complex hierarchy, we employ minimax theory to reveal several fundamental phenomena: from sparse to dense functional data analysis, from user-level to central and federated differential privacy costs, and the intricate interplay between different regimes of functional data analysis and privacy preservation.\n To the best of our knowledge, this is the first study to rigorously examine functional data estimation under multiple privacy constraints. Our theoretical findings are complemented by efficient private algorithms and extensive numerical evidence, providing a comprehensive exploration of this challenging problem.", "authors": "Xue, Lin, Yu"}, "https://arxiv.org/abs/2205.03970": {"title": "Policy Choice in Time Series by Empirical Welfare Maximization", "link": "https://arxiv.org/abs/2205.03970", "description": "This paper develops a novel method for policy choice in a dynamic setting where the available data is a multi-variate time series. Building on the statistical treatment choice framework, we propose Time-series Empirical Welfare Maximization (T-EWM) methods to estimate an optimal policy rule by maximizing an empirical welfare criterion constructed using nonparametric potential outcome time series. We characterize conditions under which T-EWM consistently learns a policy choice that is optimal in terms of conditional welfare given the time-series history. We derive a nonasymptotic upper bound for conditional welfare regret. To illustrate the implementation and uses of T-EWM, we perform simulation studies and apply the method to estimate optimal restriction rules against Covid-19.", "authors": "Kitagawa, Wang, Xu"}, "https://arxiv.org/abs/2207.12081": {"title": "A unified quantile framework for nonlinear heterogeneous transcriptome-wide associations", "link": "https://arxiv.org/abs/2207.12081", "description": "Transcriptome-wide association studies (TWAS) are powerful tools for identifying gene-level associations by integrating genome-wide association studies and gene expression data. However, most TWAS methods focus on linear associations between genes and traits, ignoring the complex nonlinear relationships that may be present in biological systems. To address this limitation, we propose a novel framework, QTWAS, which integrates a quantile-based gene expression model into the TWAS model, allowing for the discovery of nonlinear and heterogeneous gene-trait associations. Via comprehensive simulations and applications to both continuous and binary traits, we demonstrate that the proposed model is more powerful than conventional TWAS in identifying gene-trait associations.", "authors": "Wang, Ionita-Laza, Wei"}, "https://arxiv.org/abs/2211.05844": {"title": "Quantile Fourier Transform, Quantile Series, and Nonparametric Estimation of Quantile Spectra", "link": "https://arxiv.org/abs/2211.05844", "description": "A nonparametric method is proposed for estimating the quantile spectra and cross-spectra introduced in Li (2012; 2014) as bivariate functions of frequency and quantile level. The method is based on the quantile discrete Fourier transform (QDFT) defined by trigonometric quantile regression and the quantile series (QSER) defined by the inverse Fourier transform of the QDFT. A nonparametric spectral estimator is constructed from the autocovariance function of the QSER using the lag-window (LW) approach. Smoothing techniques are also employed to reduce the statistical variability of the LW estimator across quantiles when the underlying spectrum varies smoothly with respect to the quantile level. The performance of the proposed estimation method is evaluated through a simulation study.", "authors": "Li"}, "https://arxiv.org/abs/2211.13478": {"title": "A New Spatio-Temporal Model Exploiting Hamiltonian Equations", "link": "https://arxiv.org/abs/2211.13478", "description": "The solutions of Hamiltonian equations are known to describe the underlying phase space of a mechanical system. In this article, we propose a novel spatio-temporal model using a strategic modification of the Hamiltonian equations, incorporating appropriate stochasticity via Gaussian processes. The resultant spatio-temporal process, continuously varying with time, turns out to be nonparametric, non-stationary, non-separable, and non-Gaussian. Additionally, the lagged correlations converge to zero as the spatio-temporal lag goes to infinity. We investigate the theoretical properties of the new spatio-temporal process, including its continuity and smoothness properties. We derive methods for complete Bayesian inference using MCMC techniques in the Bayesian paradigm. The performance of our method has been compared with that of a non-stationary Gaussian process (GP) using two simulation studies, where our method shows a significant improvement over the non-stationary GP. Further, applying our new model to two real data sets revealed encouraging performance.", "authors": "Mazumder, Banerjee, Bhattacharya"}, "https://arxiv.org/abs/2305.07549": {"title": "Distribution free MMD tests for model selection with estimated parameters", "link": "https://arxiv.org/abs/2305.07549", "description": "There exist some testing procedures based on the maximum mean discrepancy (MMD) to address the challenge of model specification. However, they ignore the presence of estimated parameters in the case of composite null hypotheses. In this paper, we first illustrate the effect of parameter estimation in model specification tests based on the MMD. Second, we propose simple model specification and model selection tests in the case of models with estimated parameters. All our tests are asymptotically standard normal under the null, even when the true underlying distribution belongs to the competing parametric families. A simulation study and a real data analysis illustrate the performance of our tests in terms of power and level.", "authors": "Br\\\"uck, Fermanian, Min"}, "https://arxiv.org/abs/2306.11689": {"title": "Statistical Tests for Replacing Human Decision Makers with Algorithms", "link": "https://arxiv.org/abs/2306.11689", "description": "This paper proposes a statistical framework of using artificial intelligence to improve human decision making. The performance of each human decision maker is benchmarked against that of machine predictions. We replace the diagnoses made by a subset of the decision makers with the recommendation from the machine learning algorithm. We apply both a heuristic frequentist approach and a Bayesian posterior loss function approach to abnormal birth detection using a nationwide dataset of doctor diagnoses from prepregnancy checkups of reproductive age couples and pregnancy outcomes. We find that our algorithm on a test dataset results in a higher overall true positive rate and a lower false positive rate than the diagnoses made by doctors only.", "authors": "Feng, Hong, Tang et al"}, "https://arxiv.org/abs/2308.08152": {"title": "Estimating Effects of Long-Term Treatments", "link": "https://arxiv.org/abs/2308.08152", "description": "Estimating the effects of long-term treatments through A/B testing is challenging. Treatments, such as updates to product functionalities, user interface designs, and recommendation algorithms, are intended to persist within the system for a long duration of time after their initial launches. However, due to the constraints of conducting long-term experiments, practitioners often rely on short-term experimental results to make product launch decisions. It remains open how to accurately estimate the effects of long-term treatments using short-term experimental data. To address this question, we introduce a longitudinal surrogate framework that decomposes the long-term effects into functions based on user attributes, short-term metrics, and treatment assignments. We outline identification assumptions, estimation strategies, inferential techniques, and validation methods under this framework. Empirically, we demonstrate that our approach outperforms existing solutions by using data from two real-world experiments, each involving more than a million users on WeChat, one of the world's largest social networking platforms.", "authors": "(Jingjing), (Jingjing), (Jingjing) et al"}, "https://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "https://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set of likelihood components. This approach provides a flexible framework for drawing inference when the likelihood function of a statistical model is computationally intractable. While composite likelihood has computational advantages, it can still be demanding when dealing with numerous likelihood components and a large sample size. This paper tackles this challenge by employing an approximation of the conventional composite likelihood estimator, which is derived from an optimization procedure relying on stochastic gradients. This novel estimator is shown to be asymptotically normally distributed around the true parameter. In particular, based on the relative divergent rate of the sample size and the number of iterations of the optimization, the variance of the limiting distribution is shown to compound for two sources of uncertainty: the sampling variability of the data and the optimization noise, with the latter depending on the sampling distribution used to construct the stochastic gradients. The advantages of the proposed framework are illustrated through simulation studies on two working examples: an Ising model for binary data and a gamma frailty model for count data. Finally, a real-data application is presented, showing its effectiveness in a large-scale mental health survey.", "authors": "Alfonzetti, Bellio, Chen et al"}, "https://arxiv.org/abs/2211.16182": {"title": "Residual permutation test for regression coefficient testing", "link": "https://arxiv.org/abs/2211.16182", "description": "We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates $p$ can be up to a constant fraction of sample size $n$. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever $p < n / 2$. Moreover, RPT is shown to be asymptotically powerful for heavy tailed noises with bounded $(1+t)$-th order moment when the true coefficient is at least of order $n^{-t/(1+t)}$ for $t \\in [0,1]$. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.", "authors": "Wen, Wang, Wang"}, "https://arxiv.org/abs/2311.17797": {"title": "Learning to Simulate: Generative Metamodeling via Quantile Regression", "link": "https://arxiv.org/abs/2311.17797", "description": "Stochastic simulation models effectively capture complex system dynamics but are often too slow for real-time decision-making. Traditional metamodeling techniques learn relationships between simulator inputs and a single output summary statistic, such as the mean or median. These techniques enable real-time predictions without additional simulations. However, they require prior selection of one appropriate output summary statistic, limiting their flexibility in practical applications. We propose a new concept: generative metamodeling. It aims to construct a \"fast simulator of the simulator,\" generating random outputs significantly faster than the original simulator while preserving approximately equal conditional distributions. Generative metamodels enable rapid generation of numerous random outputs upon input specification, facilitating immediate computation of any summary statistic for real-time decision-making. We introduce a new algorithm, quantile-regression-based generative metamodeling (QRGMM), and establish its distributional convergence and convergence rate. Extensive numerical experiments demonstrate QRGMM's efficacy compared to other state-of-the-art generative algorithms in practical real-time decision-making scenarios.", "authors": "Hong, Hou, Zhang et al"}, "https://arxiv.org/abs/2401.13009": {"title": "Comparative Study of Causal Discovery Methods for Cyclic Models with Hidden Confounders", "link": "https://arxiv.org/abs/2401.13009", "description": "Nowadays, the need for causal discovery is ubiquitous. A better understanding of not just the stochastic dependencies between parts of a system, but also the actual cause-effect relations, is essential for all parts of science. Thus, the need for reliable methods to detect causal directions is growing constantly. In the last 50 years, many causal discovery algorithms have emerged, but most of them are applicable only under the assumption that the systems have no feedback loops and that they are causally sufficient, i.e. that there are no unmeasured subsystems that can affect multiple measured variables. This is unfortunate since those restrictions can often not be presumed in practice. Feedback is an integral feature of many processes, and real-world systems are rarely completely isolated and fully measured. Fortunately, in recent years, several techniques, that can cope with cyclic, causally insufficient systems, have been developed. And with multiple methods available, a practical application of those algorithms now requires knowledge of the respective strengths and weaknesses. Here, we focus on the problem of causal discovery for sparse linear models which are allowed to have cycles and hidden confounders. We have prepared a comprehensive and thorough comparative study of four causal discovery techniques: two versions of the LLC method [10] and two variants of the ASP-based algorithm [11]. The evaluation investigates the performance of those techniques for various experiments with multiple interventional setups and different dataset sizes.", "authors": "Lorbeer, Mohsen"}, "https://arxiv.org/abs/2412.06848": {"title": "Application of Random Matrix Theory in High-Dimensional Statistics", "link": "https://arxiv.org/abs/2412.06848", "description": "This review article provides an overview of random matrix theory (RMT) with a focus on its growing impact on the formulation and inference of statistical models and methodologies. Emphasizing applications within high-dimensional statistics, we explore key theoretical results from RMT and their role in addressing challenges associated with high-dimensional data. The discussion highlights how advances in RMT have significantly influenced the development of statistical methods, particularly in areas such as covariance matrix inference, principal component analysis (PCA), signal processing, and changepoint detection, demonstrating the close interplay between theory and practice in modern high-dimensional statistical inference.", "authors": "Bhattacharyya, Chattopadhyay, Basu"}, "https://arxiv.org/abs/2412.06870": {"title": "Variable Selection for Comparing High-dimensional Time-Series Data", "link": "https://arxiv.org/abs/2412.06870", "description": "Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator's parameters on traffic flows are analysed.", "authors": "Mitsuzawa, Grossi, Bortoli et al"}, "https://arxiv.org/abs/2412.07031": {"title": "Large Language Models: An Applied Econometric Framework", "link": "https://arxiv.org/abs/2412.07031", "description": "Large language models (LLMs) are being used in economics research to form predictions, label text, simulate human responses, generate hypotheses, and even produce data for times and places where such data don't exist. While these uses are creative, are they valid? When can we abstract away from the inner workings of an LLM and simply rely on their outputs? We develop an econometric framework to answer this question. Our framework distinguishes between two types of empirical tasks. Using LLM outputs for prediction problems (including hypothesis generation) is valid under one condition: no \"leakage\" between the LLM's training dataset and the researcher's sample. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed by some text or from human subjects) requires an additional assumption: LLM outputs must be as good as the gold standard measurements they replace. Otherwise estimates can be biased, even if LLM outputs are highly accurate but not perfectly so. We document the extent to which these conditions are violated and the implications for research findings in illustrative applications to finance and political economy. We also provide guidance to empirical researchers. The only way to ensure no training leakage is to use open-source LLMs with documented training data and published weights. The only way to deal with LLM measurement error is to collect validation data and model the error structure. A corollary is that if such conditions can't be met for a candidate LLM application, our strong advice is: don't.", "authors": "Ludwig, Mullainathan, Rambachan"}, "https://arxiv.org/abs/2412.07170": {"title": "On the Consistency of Bayesian Adaptive Testing under the Rasch Model", "link": "https://arxiv.org/abs/2412.07170", "description": "This study establishes the consistency of Bayesian adaptive testing methods under the Rasch model, addressing a gap in the literature on their large-sample guarantees. Although Bayesian approaches are recognized for their finite-sample performance and capability to circumvent issues such as the cold-start problem; however, rigorous proofs of their asymptotic properties, particularly in non-i.i.d. structures, remain lacking. We derive conditions under which the posterior distributions of latent traits converge to the true values for a sequence of given items, and demonstrate that Bayesian estimators remain robust under the mis-specification of the prior. Our analysis then extends to adaptive item selection methods in which items are chosen endogenously during the test. Additionally, we develop a Bayesian decision-theoretical framework for the item selection problem and propose a novel selection that aligns the test process with optimal estimator performance. These theoretical results provide a foundation for Bayesian methods in adaptive testing, complementing prior evidence of their finite-sample advantages.", "authors": "Yang, Wei, Chen"}, "https://arxiv.org/abs/2412.07184": {"title": "Automatic Doubly Robust Forests", "link": "https://arxiv.org/abs/2412.07184", "description": "This paper proposes the automatic Doubly Robust Random Forest (DRRF) algorithm for estimating the conditional expectation of a moment functional in the presence of high-dimensional nuisance functions. DRRF combines the automatic debiasing framework using the Riesz representer (Chernozhukov et al., 2022) with non-parametric, forest-based estimation methods for the conditional moment (Athey et al., 2019; Oprescu et al., 2019). In contrast to existing methods, DRRF does not require prior knowledge of the form of the debiasing term nor impose restrictive parametric or semi-parametric assumptions on the target quantity. Additionally, it is computationally efficient for making predictions at multiple query points and significantly reduces runtime compared to methods such as Orthogonal Random Forest (Oprescu et al., 2019). We establish the consistency and asymptotic normality results of DRRF estimator under general assumptions, allowing for the construction of valid confidence intervals. Through extensive simulations in heterogeneous treatment effect (HTE) estimation, we demonstrate the superior performance of DRRF over benchmark approaches in terms of estimation accuracy, robustness, and computational efficiency.", "authors": "Chen, Duan, Chernozhukov et al"}, "https://arxiv.org/abs/2412.07352": {"title": "Inference after discretizing unobserved heterogeneity", "link": "https://arxiv.org/abs/2412.07352", "description": "We consider a linear panel data model with nonseparable two-way unobserved heterogeneity corresponding to a linear version of the model studied in Bonhomme et al. (2022). We show that inference is possible in this setting using a straightforward two-step estimation procedure inspired by existing discretization approaches. In the first step, we construct a discrete approximation of the unobserved heterogeneity by (k-means) clustering observations separately across the individual ($i$) and time ($t$) dimensions. In the second step, we estimate a linear model with two-way group fixed effects specific to each cluster. Our approach shares similarities with methods from the double machine learning literature, as the underlying moment conditions exhibit the same type of bias-reducing properties. We provide a theoretical analysis of a cross-fitted version of our estimator, establishing its asymptotic normality at parametric rate under the condition $\\max(N,T)=o(\\min(N,T)^3)$. Simulation studies demonstrate that our methodology achieves excellent finite-sample performance, even when $T$ is negligible with respect to $N$.", "authors": "Beyhum, Mugnier"}, "https://arxiv.org/abs/2412.07404": {"title": "Median Based Unit Weibull Distribution (MBUW): Does the Higher Order Probability Weighted Moments (PWM) Add More Information over the Lower Order PWM in Parameter Estimation", "link": "https://arxiv.org/abs/2412.07404", "description": "In the present paper, Probability weighted moments (PWMs) method for parameter estimation of the median based unit weibull (MBUW) distribution is discussed. The most widely used first order PWMs is compared with the higher order PWMs for parameter estimation of (MBUW) distribution. Asymptotic distribution of this PWM estimator is derived. This comparison is illustrated using real data analysis.", "authors": "Attia"}, "https://arxiv.org/abs/2412.07495": {"title": "A comparison of Kaplan--Meier-based inverse probability of censoring weighted regression methods", "link": "https://arxiv.org/abs/2412.07495", "description": "Weighting with the inverse probability of censoring is an approach to deal with censoring in regression analyses where the outcome may be missing due to right-censoring. In this paper, three separate approaches involving this idea in a setting where the Kaplan--Meier estimator is used for estimating the censoring probability are compared. In more detail, the three approaches involve weighted regression, regression with a weighted outcome, and regression of a jack-knife pseudo-observation based on a weighted estimator. Expressions of the asymptotic variances are given in each case and the expressions are compared to each other and to the uncensored case. In terms of low asymptotic variance, a clear winner cannot be found. Which approach will have the lowest asymptotic variance depends on the censoring distribution. Expressions of the limit of the standard sandwich variance estimator in the three cases are also provided, revealing an overestimation under the implied assumptions.", "authors": "Overgaard"}, "https://arxiv.org/abs/2412.07524": {"title": "Gaussian Process with dissolution spline kernel", "link": "https://arxiv.org/abs/2412.07524", "description": "In-vitro dissolution testing is a critical component in the quality control of manufactured drug products. The $\\mathrm{f}_2$ statistic is the standard for assessing similarity between two dissolution profiles. However, the $\\mathrm{f}_2$ statistic has known limitations: it lacks an uncertainty estimate, is a discrete-time metric, and is a biased measure, calculating the differences between profiles at discrete time points. To address these limitations, we propose a Gaussian Process (GP) with a dissolution spline kernel for dissolution profile comparison. The dissolution spline kernel is a new spline kernel using logistic functions as its basis functions, enabling the GP to capture the expected monotonic increase in dissolution curves. This results in better predictions of dissolution curves. This new GP model reduces bias in the $\\mathrm{f}_2$ calculation by allowing predictions to be interpolated in time between observed values, and provides uncertainty quantification. We assess the model's performance through simulations and real datasets, demonstrating its improvement over a previous GP-based model introduced for dissolution testing. We also show that the new model can be adapted to include dissolution-specific covariates. Applying the model to real ibuprofen dissolution data under various conditions, we demonstrate its ability to extrapolate curve shapes across different experimental settings.", "authors": "Murphy, Bachiller, D'Arcy et al"}, "https://arxiv.org/abs/2412.07604": {"title": "Nested exemplar latent space models for dimension reduction in dynamic networks", "link": "https://arxiv.org/abs/2412.07604", "description": "Dynamic latent space models are widely used for characterizing changes in networks and relational data over time. These models assign to each node latent attributes that characterize connectivity with other nodes, with these latent attributes dynamically changing over time. Node attributes can be organized as a three-way tensor with modes corresponding to nodes, latent space dimension, and time. Unfortunately, as the number of nodes and time points increases, the number of elements of this tensor becomes enormous, leading to computational and statistical challenges, particularly when data are sparse. We propose a new approach for massively reducing dimensionality by expressing the latent node attribute tensor as low rank. This leads to an interesting new {\\em nested exemplar} latent space model, which characterizes the node attribute tensor as dependent on low-dimensional exemplar traits for each node, weights for each latent space dimension, and exemplar curves characterizing time variation. We study properties of this framework, including expressivity, and develop efficient Bayesian inference algorithms. The approach leads to substantial advantages in simulations and applications to ecological networks.", "authors": "Kampe, Silva, Dunson et al"}, "https://arxiv.org/abs/2412.07611": {"title": "Deep Partially Linear Transformation Model for Right-Censored Survival Data", "link": "https://arxiv.org/abs/2412.07611", "description": "Although the Cox proportional hazards model is well established and extensively used in the analysis of survival data, the proportional hazards (PH) assumption may not always hold in practical scenarios. The semiparametric transformation model extends the conventional Cox model and also includes many other survival models as special cases. This paper introduces a deep partially linear transformation model (DPLTM) as a general and flexible framework for estimation, inference and prediction. The proposed method is capable of avoiding the curse of dimensionality while still retaining the interpretability of some covariates of interest. We derive the overall convergence rate of the maximum likelihood estimators, the minimax lower bound of the nonparametric deep neural network (DNN) estimator, the asymptotic normality and the semiparametric efficiency of the parametric estimator. Comprehensive simulation studies demonstrate the impressive performance of the proposed estimation procedure in terms of both estimation accuracy and prediction power, which is further validated by an application to a real-world dataset.", "authors": "Yin, Zhang, Yu"}, "https://arxiv.org/abs/2412.07649": {"title": "Machine Learning the Macroeconomic Effects of Financial Shocks", "link": "https://arxiv.org/abs/2412.07649", "description": "We propose a method to learn the nonlinear impulse responses to structural shocks using neural networks, and apply it to uncover the effects of US financial shocks. The results reveal substantial asymmetries with respect to the sign of the shock. Adverse financial shocks have powerful effects on the US economy, while benign shocks trigger much smaller reactions. Instead, with respect to the size of the shocks, we find no discernible asymmetries.", "authors": "Hauzenberger, Huber, Klieber et al"}, "https://arxiv.org/abs/2412.07194": {"title": "Emperical Study on Various Symmetric Distributions for Modeling Time Series", "link": "https://arxiv.org/abs/2412.07194", "description": "This study evaluated probability distributions for modeling time series with abrupt structural changes. The Pearson type VII distribution, with an adjustable shape parameter $b$, proved versatile. The generalized Laplace distribution performed similarly to the Pearson model, occasionally surpassing it in terms of likelihood and AIC. Mixture models, including the mixture of $\\delta$-function and Gaussian distribution, showed potential but were less stable. Pearson type VII and extended Laplace models were deemed more reliable for general cases. Model selection depends on data characteristics and goals.", "authors": "Mathmatics)"}, "https://arxiv.org/abs/2412.07657": {"title": "Probabilistic Modelling of Multiple Long-Term Condition Onset Times", "link": "https://arxiv.org/abs/2412.07657", "description": "The co-occurrence of multiple long-term conditions (MLTC), or multimorbidity, in an individual can reduce their lifespan and severely impact their quality of life. Exploring the longitudinal patterns, e.g. clusters, of disease accrual can help better understand the genetic and environmental drivers of multimorbidity, and potentially identify individuals who may benefit from early targeted intervention. We introduce $\\textit{probabilistic modelling of onset times}$, or $\\texttt{ProMOTe}$, for clustering and forecasting MLTC trajectories. $\\texttt{ProMOTe}$ seamlessly learns from incomplete and unreliable disease trajectories that is commonplace in Electronic Health Records but often ignored in existing longitudinal clustering methods. We analyse data from 150,000 individuals in the UK Biobank and identify 50 clusters showing patterns of disease accrual that have also been reported by some recent studies. We further discuss the forecasting capabilities of the model given the history of disease accrual.", "authors": "Richards, Fleetwood, Prigge et al"}, "https://arxiv.org/abs/1912.03158": {"title": "High-frequency and heteroskedasticity identification in multicountry models: Revisiting spillovers of monetary shocks", "link": "https://arxiv.org/abs/1912.03158", "description": "We explore the international transmission of monetary policy and central bank information shocks originating from the United States and the euro area. Employing a panel vector autoregression, we use macroeconomic and financial variables across several major economies to address both static and dynamic spillovers. To identify structural shocks, we introduce a novel approach that combines external instruments with heteroskedasticity-based identification and sign restrictions. Our results suggest significant spillovers from European Central Bank and Federal Reserve policies to each other's economies, global aggregates, and other countries. These effects are more pronounced for central bank information shocks than for pure monetary policy shocks, and the dominance of the US in the global economy is reflected in our findings.", "authors": "Pfarrhofer, Stelzer"}, "https://arxiv.org/abs/2108.05990": {"title": "Nearly Optimal Learning using Sparse Deep ReLU Networks in Regularized Empirical Risk Minimization with Lipschitz Loss", "link": "https://arxiv.org/abs/2108.05990", "description": "We propose a sparse deep ReLU network (SDRN) estimator of the regression function obtained from regularized empirical risk minimization with a Lipschitz loss function. Our framework can be applied to a variety of regression and classification problems. We establish novel non-asymptotic excess risk bounds for our SDRN estimator when the regression function belongs to a Sobolev space with mixed derivatives. We obtain a new nearly optimal risk rate in the sense that the SDRN estimator can achieve nearly the same optimal minimax convergence rate as one-dimensional nonparametric regression with the dimension only involved in a logarithm term when the feature dimension is fixed. The estimator has a slightly slower rate when the dimension grows with the sample size. We show that the depth of the SDRN estimator grows with the sample size in logarithmic order, and the total number of nodes and weights grows in polynomial order of the sample size to have the nearly optimal risk rate. The proposed SDRN can go deeper with fewer parameters to well estimate the regression and overcome the overfitting problem encountered by conventional feed-forward neural networks.", "authors": "Huang, Liu, Ma"}, "https://arxiv.org/abs/2301.07513": {"title": "A Bayesian Nonparametric Stochastic Block Model for Directed Acyclic Graphs", "link": "https://arxiv.org/abs/2301.07513", "description": "Random graphs have been widely used in statistics, for example in network and social interaction analysis. In some applications, data may contain an inherent hierarchical ordering among its vertices, which prevents any directed edge between pairs of vertices that do not respect this order. For example, in bibliometrics, older papers cannot cite newer ones. In such situations, the resulting graph forms a Directed Acyclic Graph. In this article, we propose an extension of the popular Stochastic Block Model (SBM) to account for the presence of a latent hierarchical ordering in the data. The proposed approach includes a topological ordering in the likelihood of the model, which allows a directed edge to have positive probability only if the corresponding pair of vertices respect the order. This latent ordering is treated as an unknown parameter and endowed with a prior distribution. We describe how to formalize the model and perform posterior inference for a Bayesian nonparametric version of the SBM in which both the hierarchical ordering and the number of latent blocks are learnt from the data. Finally, an illustration with a real-world dataset from bibliometrics is presented. Additional supplementary materials are available online.", "authors": "Lee, Battiston"}, "https://arxiv.org/abs/2312.13195": {"title": "Principal Component Copulas for Capital Modelling and Systemic Risk", "link": "https://arxiv.org/abs/2312.13195", "description": "We introduce a class of copulas that we call Principal Component Copulas (PCCs). This class combines the strong points of copula-based techniques with principal component-based models, which results in flexibility when modelling tail dependence along the most important directions in high-dimensional data. We obtain theoretical results for PCCs that are important for practical applications. In particular, we derive tractable expressions for the high-dimensional copula density, which can be represented in terms of characteristic functions. We also develop algorithms to perform Maximum Likelihood and Generalized Method of Moment estimation in high-dimensions and show very good performance in simulation experiments. Finally, we apply the copula to the international stock market in order to study systemic risk. We find that PCCs lead to excellent performance on measures of systemic risk due to their ability to distinguish between parallel market movements, which increase systemic risk, and orthogonal movements, which reduce systemic risk. As a result, we consider the PCC promising for internal capital models, which financial institutions use to protect themselves against systemic risk.", "authors": "Gubbels, Ypma, Oosterlee"}, "https://arxiv.org/abs/2412.07787": {"title": "Anomaly Detection in California Electricity Price Forecasting: Enhancing Accuracy and Reliability Using Principal Component Analysis", "link": "https://arxiv.org/abs/2412.07787", "description": "Accurate and reliable electricity price forecasting has significant practical implications for grid management, renewable energy integration, power system planning, and price volatility management. This study focuses on enhancing electricity price forecasting in California's grid, addressing challenges from complex generation data and heteroskedasticity. Utilizing principal component analysis (PCA), we analyze CAISO's hourly electricity prices and demand from 2016-2021 to improve day-ahead forecasting accuracy. Initially, we apply traditional outlier analysis with the interquartile range method, followed by robust PCA (RPCA) for more effective outlier elimination. This approach improves data symmetry and reduces skewness. We then construct multiple linear regression models using both raw and PCA-transformed features. The model with transformed features, refined through traditional and SAS Sparse Matrix outlier removal methods, shows superior forecasting performance. The SAS Sparse Matrix method, in particular, significantly enhances model accuracy. Our findings demonstrate that PCA-based methods are key in advancing electricity price forecasting, supporting renewable integration and grid management in day-ahead markets.\n Keywords: Electricity price forecasting, principal component analysis (PCA), power system planning, heteroskedasticity, renewable energy integration.", "authors": "Nyangon, Akintunde"}, "https://arxiv.org/abs/2412.07824": {"title": "Improved Small Area Inference from Data Integration Using Global-Local Priors", "link": "https://arxiv.org/abs/2412.07824", "description": "We present and apply methodology to improve inference for small area parameters by using data from several sources. This work extends Cahoy and Sedransk (2023) who showed how to integrate summary statistics from several sources. Our methodology uses hierarchical global-local prior distributions to make inferences for the proportion of individuals in Florida's counties who do not have health insurance. Results from an extensive simulation study show that this methodology will provide improved inference by using several data sources. Among the five model variants evaluated the ones using horseshoe priors for all variances have better performance than the ones using lasso priors for the local variances.", "authors": "Cahoy, Sedransk"}, "https://arxiv.org/abs/2412.07905": {"title": "Spectral Differential Network Analysis for High-Dimensional Time Series", "link": "https://arxiv.org/abs/2412.07905", "description": "Spectral networks derived from multivariate time series data arise in many domains, from brain science to Earth science. Often, it is of interest to study how these networks change under different conditions. For instance, to better understand epilepsy, it would be interesting to capture the changes in the brain connectivity network as a patient experiences a seizure, using electroencephalography data. A common approach relies on estimating the networks in each condition and calculating their difference. Such estimates may behave poorly in high dimensions as the networks themselves may not be sparse in structure while their difference may be. We build upon this observation to develop an estimator of the difference in inverse spectral densities across two conditions. Using an L1 penalty on the difference, consistency is established by only requiring the difference to be sparse. We illustrate the method on synthetic data experiments, on experiments with electroencephalography data, and on experiments with optogentic stimulation and micro-electrocorticography data.", "authors": "Hellstern, Kim, Harchaoui et al"}, "https://arxiv.org/abs/2412.07957": {"title": "Spatial scale-aware tail dependence modeling for high-dimensional spatial extremes", "link": "https://arxiv.org/abs/2412.07957", "description": "Extreme events over large spatial domains may exhibit highly heterogeneous tail dependence characteristics, yet most existing spatial extremes models yield only one dependence class over the entire spatial domain. To accurately characterize \"data-level dependence'' in analysis of extreme events, we propose a mixture model that achieves flexible dependence properties and allows high-dimensional inference for extremes of spatial processes. We modify the popular random scale construction that multiplies a Gaussian random field by a single radial variable; we allow the radial variable to vary smoothly across space and add non-stationarity to the Gaussian process. As the level of extremeness increases, this single model exhibits both asymptotic independence at long ranges and either asymptotic dependence or independence at short ranges. We make joint inference on the dependence model and a marginal model using a copula approach within a Bayesian hierarchical model. Three different simulation scenarios show close to nominal frequentist coverage rates. Lastly, we apply the model to a dataset of extreme summertime precipitation over the central United States. We find that the joint tail of precipitation exhibits non-stationary dependence structure that cannot be captured by limiting extreme value models or current state-of-the-art sub-asymptotic models.", "authors": "Shi, Zhang, Risser et al"}, "https://arxiv.org/abs/2412.07987": {"title": "Hypothesis Testing for High-Dimensional Matrix-Valued Data", "link": "https://arxiv.org/abs/2412.07987", "description": "This paper addresses hypothesis testing for the mean of matrix-valued data in high-dimensional settings. We investigate the minimum discrepancy test, originally proposed by Cragg (1997), which serves as a rank test for lower-dimensional matrices. We evaluate the performance of this test as the matrix dimensions increase proportionally with the sample size, and identify its limitations when matrix dimensions significantly exceed the sample size. To address these challenges, we propose a new test statistic tailored for high-dimensional matrix rank testing. The oracle version of this statistic is analyzed to highlight its theoretical properties. Additionally, we develop a novel approach for constructing a sparse singular value decomposition (SVD) estimator for singular vectors, providing a comprehensive examination of its theoretical aspects. Using the sparse SVD estimator, we explore the properties of the sample version of our proposed statistic. The paper concludes with simulation studies and two case studies involving surveillance video data, demonstrating the practical utility of our proposed methods.", "authors": "Cui, Li, Li et al"}, "https://arxiv.org/abs/2412.08042": {"title": "Robust and efficient estimation of time-varying treatment effects using marginal structural models dependent on partial treatment history", "link": "https://arxiv.org/abs/2412.08042", "description": "Inverse probability (IP) weighting of marginal structural models (MSMs) can provide consistent estimators of time-varying treatment effects under correct model specifications and identifiability assumptions, even in the presence of time-varying confounding. However, this method has two problems: (i) inefficiency due to IP-weights cumulating all time points and (ii) bias and inefficiency due to the MSM misspecification. To address these problems, we propose new IP-weights for estimating the parameters of the MSM dependent on partial treatment history and closed testing procedures for selecting the MSM under known IP-weights. In simulation studies, our proposed methods outperformed existing methods in terms of both performance in estimating time-varying treatment effects and in selecting the correct MSM. Our proposed methods were also applied to real data of hemodialysis patients with reasonable results.", "authors": "Seya, Taguri, Ishii"}, "https://arxiv.org/abs/2412.08051": {"title": "Two-way Node Popularity Model for Directed and Bipartite Networks", "link": "https://arxiv.org/abs/2412.08051", "description": "There has been extensive research on community detection in directed and bipartite networks. However, these studies often fail to consider the popularity of nodes in different communities, which is a common phenomenon in real-world networks. To address this issue, we propose a new probabilistic framework called the Two-Way Node Popularity Model (TNPM). The TNPM also accommodates edges from different distributions within a general sub-Gaussian family. We introduce the Delete-One-Method (DOM) for model fitting and community structure identification, and provide a comprehensive theoretical analysis with novel technical skills dealing with sub-Gaussian generalization. Additionally, we propose the Two-Stage Divided Cosine Algorithm (TSDC) to handle large-scale networks more efficiently. Our proposed methods offer multi-folded advantages in terms of estimation accuracy and computational efficiency, as demonstrated through extensive numerical studies. We apply our methods to two real-world applications, uncovering interesting findings.", "authors": "Jing, Li, Wang et al"}, "https://arxiv.org/abs/2412.08088": {"title": "Dynamic Classification of Latent Disease Progression with Auxiliary Surrogate Labels", "link": "https://arxiv.org/abs/2412.08088", "description": "Disease progression prediction based on patients' evolving health information is challenging when true disease states are unknown due to diagnostic capabilities or high costs. For example, the absence of gold-standard neurological diagnoses hinders distinguishing Alzheimer's disease (AD) from related conditions such as AD-related dementias (ADRDs), including Lewy body dementia (LBD). Combining temporally dependent surrogate labels and health markers may improve disease prediction. However, existing literature models informative surrogate labels and observed variables that reflect the underlying states using purely generative approaches, limiting the ability to predict future states. We propose integrating the conventional hidden Markov model as a generative model with a time-varying discriminative classification model to simultaneously handle potentially misspecified surrogate labels and incorporate important markers of disease progression. We develop an adaptive forward-backward algorithm with subjective labels for estimation, and utilize the modified posterior and Viterbi algorithms to predict the progression of future states or new patients based on objective markers only. Importantly, the adaptation eliminates the need to model the marginal distribution of longitudinal markers, a requirement in traditional algorithms. Asymptotic properties are established, and significant improvement with finite samples is demonstrated via simulation studies. Analysis of the neuropathological dataset of the National Alzheimer's Coordinating Center (NACC) shows much improved accuracy in distinguishing LBD from AD.", "authors": "Cai, Zeng, Marder et al"}, "https://arxiv.org/abs/2412.08225": {"title": "Improving Active Learning with a Bayesian Representation of Epistemic Uncertainty", "link": "https://arxiv.org/abs/2412.08225", "description": "A popular strategy for active learning is to specifically target a reduction in epistemic uncertainty, since aleatoric uncertainty is often considered as being intrinsic to the system of interest and therefore not reducible. Yet, distinguishing these two types of uncertainty remains challenging and there is no single strategy that consistently outperforms the others. We propose to use a particular combination of probability and possibility theories, with the aim of using the latter to specifically represent epistemic uncertainty, and we show how this combination leads to new active learning strategies that have desirable properties. In order to demonstrate the efficiency of these strategies in non-trivial settings, we introduce the notion of a possibilistic Gaussian process (GP) and consider GP-based multiclass and binary classification problems, for which the proposed methods display a strong performance for both simulated and real datasets.", "authors": "Thomas, Houssineau"}, "https://arxiv.org/abs/2412.08458": {"title": "Heavy Tail Robust Estimation and Inference for Average Treatment Effects", "link": "https://arxiv.org/abs/2412.08458", "description": "We study the probability tail properties of Inverse Probability Weighting (IPW) estimators of the Average Treatment Effect (ATE) when there is limited overlap between the covariate distributions of the treatment and control groups. Under unconfoundedness of treatment assignment conditional on covariates, such limited overlap is manifested in the propensity score for certain units being very close (but not equal) to 0 or 1. This renders IPW estimators possibly heavy tailed, and with a slower than sqrt(n) rate of convergence. Trimming or truncation is ultimately based on the covariates, ignoring important information about the inverse probability weighted random variable Z that identifies ATE by E[Z]= ATE. We propose a tail-trimmed IPW estimator whose performance is robust to limited overlap. In terms of the propensity score, which is generally unknown, we plug-in its parametric estimator in the infeasible Z, and then negligibly trim the resulting feasible Z adaptively by its large values. Trimming leads to bias if Z has an asymmetric distribution and an infinite variance, hence we estimate and remove the bias using important improvements on existing theory and methods. Our estimator sidesteps dimensionality, bias and poor correspondence properties associated with trimming by the covariates or propensity score. Monte Carlo experiments demonstrate that trimming by the covariates or the propensity score requires the removal of a substantial portion of the sample to render a low bias and close to normal estimator, while our estimator has low bias and mean-squared error, and is close to normal, based on the removal of very few sample extremes.", "authors": "Hill, Chaudhuri"}, "https://arxiv.org/abs/2412.08498": {"title": "A robust, scalable K-statistic for quantifying immune cell clustering in spatial proteomics data", "link": "https://arxiv.org/abs/2412.08498", "description": "Spatial summary statistics based on point process theory are widely used to quantify the spatial organization of cell populations in single-cell spatial proteomics data. Among these, Ripley's $K$ is a popular metric for assessing whether cells are spatially clustered or are randomly dispersed. However, the key assumption of spatial homogeneity is frequently violated in spatial proteomics data, leading to overestimates of cell clustering and colocalization. To address this, we propose a novel $K$-based method, termed \\textit{KAMP} (\\textbf{K} adjustment by \\textbf{A}nalytical \\textbf{M}oments of the \\textbf{P}ermutation distribution), for quantifying the spatial organization of cells in spatial proteomics samples. \\textit{KAMP} leverages background cells in each sample along with a new closed-form representation of the first and second moments of the permutation distribution of Ripley's $K$ to estimate an empirical null model. Our method is robust to inhomogeneity, computationally efficient even in large datasets, and provides approximate $p$-values for testing spatial clustering and colocalization. Methodological developments are motivated by a spatial proteomics study of 103 women with ovarian cancer, where our analysis using \\textit{KAMP} shows a positive association between immune cell clustering and overall patient survival. Notably, we also find evidence that using $K$ without correcting for sample inhomogeneity may bias hazard ratio estimates in downstream analyses. \\textit{KAMP} completes this analysis in just 5 minutes, compared to 538 minutes for the only competing method that adequately addresses inhomogeneity.", "authors": "Wrobel, Song"}, "https://arxiv.org/abs/2412.08533": {"title": "Rate accelerated inference for integrals of multivariate random functions", "link": "https://arxiv.org/abs/2412.08533", "description": "The computation of integrals is a fundamental task in the analysis of functional data, which are typically considered as random elements in a space of squared integrable functions. Borrowing ideas from recent advances in the Monte Carlo integration literature, we propose effective unbiased estimation and inference procedures for integrals of uni- and multivariate random functions. Several applications to key problems in functional data analysis (FDA) involving random design points are studied and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and the usual algorithms for numerical integration. Moreover, the proposed estimator facilitates effective inference by generally providing better coverage with shorter confidence and prediction intervals, in both noisy and noiseless setups.", "authors": "Patilea, Wang"}, "https://arxiv.org/abs/2412.08567": {"title": "Identifiability of the instrumental variable model with the treatment and outcome missing not at random", "link": "https://arxiv.org/abs/2412.08567", "description": "The instrumental variable model of Imbens and Angrist (1994) and Angrist et al. (1996) allow for the identification of the local average treatment effect, also known as the complier average causal effect. However, many empirical studies are challenged by the missingness in the treatment and outcome. Generally, the complier average causal effect is not identifiable without further assumptions when the treatment and outcome are missing not at random. We study its identifiability even when the treatment and outcome are missing not at random. We review the existing results and provide new findings to unify the identification analysis in the literature.", "authors": "Zuo, Ding, Yang"}, "https://arxiv.org/abs/2412.07929": {"title": "Dirichlet-Neumann Averaging: The DNA of Efficient Gaussian Process Simulation", "link": "https://arxiv.org/abs/2412.07929", "description": "Gaussian processes (GPs) and Gaussian random fields (GRFs) are essential for modelling spatially varying stochastic phenomena. Yet, the efficient generation of corresponding realisations on high-resolution grids remains challenging, particularly when a large number of realisations are required. This paper presents two novel contributions. First, we propose a new methodology based on Dirichlet-Neumann averaging (DNA) to generate GPs and GRFs with isotropic covariance on regularly spaced grids. The combination of discrete cosine and sine transforms in the DNA sampling approach allows for rapid evaluations without the need for modification or padding of the desired covariance function. While this introduces an error in the covariance, our numerical experiments show that this error is negligible for most relevant applications, representing a trade-off between efficiency and precision. We provide explicit error estimates for Mat\\'ern covariances. The second contribution links our new methodology to the stochastic partial differential equation (SPDE) approach for sampling GRFs. We demonstrate that the concepts developed in our methodology can also guide the selection of boundary conditions in the SPDE framework. We prove that averaging specific GRFs sampled via the SPDE approach yields genuinely isotropic realisations without domain extension, with the error bounds established in the first part remaining valid.", "authors": "Mathematics, Heidelberg, Computing) et al"}, "https://arxiv.org/abs/2412.08064": {"title": "Statistical Convergence Rates of Optimal Transport Map Estimation between General Distributions", "link": "https://arxiv.org/abs/2412.08064", "description": "This paper studies the convergence rates of optimal transport (OT) map estimators, a topic of growing interest in statistics, machine learning, and various scientific fields. Despite recent advancements, existing results rely on regularity assumptions that are very restrictive in practice and much stricter than those in Brenier's Theorem, including the compactness and convexity of the probability support and the bi-Lipschitz property of the OT maps. We aim to broaden the scope of OT map estimation and fill this gap between theory and practice. Given the strong convexity assumption on Brenier's potential, we first establish the non-asymptotic convergence rates for the original plug-in estimator without requiring restrictive assumptions on probability measures. Additionally, we introduce a sieve plug-in estimator and establish its convergence rates without the strong convexity assumption on Brenier's potential, enabling the widely used cases such as the rank functions of normal or t-distributions. We also establish new Poincar\\'e-type inequalities, which are proved given sufficient conditions on the local boundedness of the probability density and mild topological conditions of the support, and these new inequalities enable us to achieve faster convergence rates for the Donsker function class. Moreover, we develop scalable algorithms to efficiently solve the OT map estimation using neural networks and present numerical experiments to demonstrate the effectiveness and robustness.", "authors": "Ding, Li, Xue"}, "https://arxiv.org/abs/2203.06743": {"title": "Bayesian Analysis of Sigmoidal Gaussian Cox Processes via Data Augmentation", "link": "https://arxiv.org/abs/2203.06743", "description": "Many models for point process data are defined through a thinning procedure where locations of a base process (often Poisson) are either kept (observed) or discarded (thinned). In this paper, we go back to the fundamentals of the distribution theory for point processes to establish a link between the base thinning mechanism and the joint density of thinned and observed locations in any of such models. In practice, the marginal model of observed points is often intractable, but thinned locations can be instantiated from their conditional distribution and typical data augmentation schemes can be employed to circumvent this problem. Such approaches have been employed in the recent literature, but some inconsistencies have been introduced across the different publications. We concentrate on an example: the so-called sigmoidal Gaussian Cox process. We apply our approach to resolve contradicting viewpoints in the data augmentation step of the inference procedures therein. We also provide a multitype extension to this process and conduct Bayesian inference on data consisting of positions of two different species of trees in Lansing Woods, Michigan. The emphasis is put on intertype dependence modeling with Bayesian uncertainty quantification.", "authors": "Alie, Stephens, Schmidt"}, "https://arxiv.org/abs/2210.06639": {"title": "Robust Estimation and Inference in Panels with Interactive Fixed Effects", "link": "https://arxiv.org/abs/2210.06639", "description": "We consider estimation and inference for a regression coefficient in panels with interactive fixed effects (i.e., with a factor structure). We demonstrate that existing estimators and confidence intervals (CIs) can be heavily biased and size-distorted when some of the factors are weak. We propose estimators with improved rates of convergence and bias-aware CIs that remain valid uniformly, regardless of factor strength. Our approach applies the theory of minimax linear estimation to form a debiased estimate, using a nuclear norm bound on the error of an initial estimate of the interactive fixed effects. Our resulting bias-aware CIs take into account the remaining bias caused by weak factors. Monte Carlo experiments show substantial improvements over conventional methods when factors are weak, with minimal costs to estimation accuracy when factors are strong.", "authors": "Armstrong, Weidner, Zeleneev"}, "https://arxiv.org/abs/2301.04876": {"title": "Interacting Treatments with Endogenous Takeup", "link": "https://arxiv.org/abs/2301.04876", "description": "We study causal inference in randomized experiments (or quasi-experiments) following a $2\\times 2$ factorial design. There are two treatments, denoted $A$ and $B$, and units are randomly assigned to one of four categories: treatment $A$ alone, treatment $B$ alone, joint treatment, or none. Allowing for endogenous non-compliance with the two binary instruments representing the intended assignment, as well as unrestricted interference across the two treatments, we derive the causal interpretation of various instrumental variable estimands under more general compliance conditions than in the literature. In general, if treatment takeup is driven by both instruments for some units, it becomes difficult to separate treatment interaction from treatment effect heterogeneity. We provide auxiliary conditions and various bounding strategies that may help zero in on causally interesting parameters. As an empirical illustration, we apply our results to a program randomly offering two different treatments, namely tutoring and financial incentives, to first year college students, in order to assess the treatments' effects on academic performance.", "authors": "Kormos, Lieli, Huber"}, "https://arxiv.org/abs/2308.01605": {"title": "Causal thinking for decision making on Electronic Health Records: why and how", "link": "https://arxiv.org/abs/2308.01605", "description": "Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.", "authors": "(SODA), (MIT, USZ) et al"}, "https://arxiv.org/abs/2312.04078": {"title": "Methods for Quantifying Dataset Similarity: a Review, Taxonomy and Comparison", "link": "https://arxiv.org/abs/2312.04078", "description": "Quantifying the similarity between datasets has widespread applications in statistics and machine learning. The performance of a predictive model on novel datasets, referred to as generalizability, depends on how similar the training and evaluation datasets are. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In simulation studies, the similarity between distributions of simulated datasets and real datasets, for which the performance of methods is assessed, is crucial. In two- or $k$-sample testing, it is checked, whether the underlying distributions of two or more datasets coincide.\n Extremely many approaches for quantifying dataset similarity have been proposed in the literature. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes. In an extensive review of these methods the main underlying ideas, formal definitions, and important properties are introduced.\n We compare the 118 methods in terms of their applicability, interpretability, and theoretical properties, in order to provide recommendations for selecting an appropriate dataset similarity measure based on the specific goal of the dataset comparison and on the properties of the datasets at hand. An online tool facilitates the choice of the appropriate dataset similarity measure.", "authors": "Stolte, Kappenberg, Rahnenf\\\"uhrer et al"}, "https://arxiv.org/abs/2303.08653": {"title": "On the robustness of posterior means", "link": "https://arxiv.org/abs/2303.08653", "description": "Consider a normal location model $X \\mid \\theta \\sim N(\\theta, \\sigma^2)$ with known $\\sigma^2$. Suppose $\\theta \\sim G_0$, where the prior $G_0$ has zero mean and variance bounded by $V$. Let $G_1$ be a possibly misspecified prior with zero mean and variance bounded by $V$. We show that the squared error Bayes risk of the posterior mean under $G_1$ is bounded, subjected to an additional tail condition on $G_1$, uniformly over $G_0, G_1, \\sigma^2 > 0$.", "authors": "Chen"}, "https://arxiv.org/abs/2412.08762": {"title": "Modeling EEG Spectral Features through Warped Functional Mixed Membership Models", "link": "https://arxiv.org/abs/2412.08762", "description": "A common concern in the field of functional data analysis is the challenge of temporal misalignment, which is typically addressed using curve registration methods. Currently, most of these methods assume the data is governed by a single common shape or a finite mixture of population level shapes. We introduce more flexibility using mixed membership models. Individual observations are assumed to partially belong to different clusters, allowing variation across multiple functional features. We propose a Bayesian hierarchical model to estimate the underlying shapes, as well as the individual time-transformation functions and levels of membership. Motivating this work is data from EEG signals in children with autism spectrum disorder (ASD). Our method agrees with the neuroimaging literature, recovering the 1/f pink noise feature distinctly from the peak in the alpha band. Furthermore, the introduction of a regression component in the estimation of time-transformation functions quantifies the effect of age and clinical designation on the location of the peak alpha frequency (PAF).", "authors": "Landry, Senturk, Jeste et al"}, "https://arxiv.org/abs/2412.08827": {"title": "A Debiased Estimator for the Mediation Functional in Ultra-High-Dimensional Setting in the Presence of Interaction Effects", "link": "https://arxiv.org/abs/2412.08827", "description": "Mediation analysis is crucial in many fields of science for understanding the mechanisms or processes through which an independent variable affects an outcome, thereby providing deeper insights into causal relationships and improving intervention strategies. Despite advances in analyzing the mediation effect with fixed/low-dimensional mediators and covariates, our understanding of estimation and inference of mediation functional in the presence of (ultra)-high-dimensional mediators and covariates is still limited. In this paper, we present an estimator for mediation functional in a high-dimensional setting that accommodates the interaction between covariates and treatment in generating mediators, as well as interactions between both covariates and treatment and mediators and treatment in generating the response. We demonstrate that our estimator is $\\sqrt{n}$-consistent and asymptotically normal, thus enabling reliable inference on direct and indirect treatment effects with asymptotically valid confidence intervals. A key technical contribution of our work is to develop a multi-step debiasing technique, which may also be valuable in other statistical settings with similar structural complexities where accurate estimation depends on debiasing.", "authors": "Bo, Ghassami, Mukherjee"}, "https://arxiv.org/abs/2412.08828": {"title": "A Two-Stage Approach for Segmenting Spatial Point Patterns Applied to Multiplex Imaging", "link": "https://arxiv.org/abs/2412.08828", "description": "Recent advances in multiplex imaging have enabled researchers to locate different types of cells within a tissue sample. This is especially relevant for tumor immunology, as clinical regimes corresponding to different stages of disease or responses to treatment may manifest as different spatial arrangements of tumor and immune cells. Spatial point pattern modeling can be used to partition multiplex tissue images according to these regimes. To this end, we propose a two-stage approach: first, local intensities and pair correlation functions are estimated from the spatial point pattern of cells within each image, and the pair correlation functions are reduced in dimension via spectral decomposition of the covariance function. Second, the estimates are clustered in a Bayesian hierarchical model with spatially-dependent cluster labels. The clusters correspond to regimes of interest that are present across subjects; the cluster labels segment the spatial point patterns according to those regimes. Through Markov Chain Monte Carlo sampling, we jointly estimate and quantify uncertainty in the cluster assignment and spatial characteristics of each cluster. Simulations demonstrate the performance of the method, and it is applied to a set of multiplex immunofluorescence images of diseased pancreatic tissue.", "authors": "Sheng, Reich, Staicu et al"}, "https://arxiv.org/abs/2412.08831": {"title": "Panel Stochastic Frontier Models with Latent Group Structures", "link": "https://arxiv.org/abs/2412.08831", "description": "Stochastic frontier models have attracted significant interest over the years due to their unique feature of including a distinct inefficiency term alongside the usual error term. To effectively separate these two components, strong distributional assumptions are often necessary. To overcome this limitation, numerous studies have sought to relax or generalize these models for more robust estimation. In line with these efforts, we introduce a latent group structure that accommodates heterogeneity across firms, addressing not only the stochastic frontiers but also the distribution of the inefficiency term. This framework accounts for the distinctive features of stochastic frontier models, and we propose a practical estimation procedure to implement it. Simulation studies demonstrate the strong performance of our proposed method, which is further illustrated through an application to study the cost efficiency of the U.S. commercial banking sector.", "authors": "Tomioka, Yang, Zhang"}, "https://arxiv.org/abs/2412.08857": {"title": "Dynamic prediction of an event using multiple longitudinal markers: a model averaging approach", "link": "https://arxiv.org/abs/2412.08857", "description": "Dynamic event prediction, using joint modeling of survival time and longitudinal variables, is extremely useful in personalized medicine. However, the estimation of joint models including many longitudinal markers is still a computational challenge because of the high number of random effects and parameters to be estimated. In this paper, we propose a model averaging strategy to combine predictions from several joint models for the event, including one longitudinal marker only or pairwise longitudinal markers. The prediction is computed as the weighted mean of the predictions from the one-marker or two-marker models, with the time-dependent weights estimated by minimizing the time-dependent Brier score. This method enables us to combine a large number of predictions issued from joint models to achieve a reliable and accurate individual prediction. Advantages and limits of the proposed methods are highlighted in a simulation study by comparison with the predictions from well-specified and misspecified all-marker joint models as well as the one-marker and two-marker joint models. Using the PBC2 data set, the method is used to predict the risk of death in patients with primary biliary cirrhosis. The method is also used to analyze a French cohort study called the 3C data. In our study, seventeen longitudinal markers are considered to predict the risk of death.", "authors": "Hashemi, Baghfalaki, Philipps et al"}, "https://arxiv.org/abs/2412.08916": {"title": "Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy", "link": "https://arxiv.org/abs/2412.08916", "description": "Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.", "authors": "Kim, Ray, Reich"}, "https://arxiv.org/abs/2412.08934": {"title": "A cheat sheet for probability distributions of orientational data", "link": "https://arxiv.org/abs/2412.08934", "description": "The need for statistical models of orientations arises in many applications in engineering and computer science. Orientational data appear as sets of angles, unit vectors, rotation matrices or quaternions. In the field of directional statistics, a lot of advances have been made in modelling such types of data. However, only a few of these tools are used in engineering and computer science applications. Hence, this paper aims to serve as a cheat sheet for those probability distributions of orientations. Models for 1-DOF, 2-DOF and 3-DOF orientations are discussed. For each of them, expressions for the density function, fitting to data, and sampling are presented. The paper is written with a compromise between engineering and statistics in terms of notation and terminology. A Python library with functions for some of these models is provided. Using this library, two examples of applications to real data are presented.", "authors": "Lopez-Custodio"}, "https://arxiv.org/abs/2412.09166": {"title": "The square array design", "link": "https://arxiv.org/abs/2412.09166", "description": "This paper is concerned with the construction of augmented row-column designs for unreplicated trials. The idea is predicated on the representation of a $k \\times t$ equireplicate incomplete-block design with $t$ treatments in $t$ blocks of size $k$, termed an auxiliary block design, as a $t \\times t$ square array design with $k$ controls, where $k 0$) on the rate at which spillovers decay with the ``distance'' between units, defined in a generalized way to encompass spatial and quasi-spatial settings, e.g. where the economically relevant concept of distance is a gravity equation. Over all estimators linear in the outcomes and all cluster-randomized designs the optimal geometric rate of convergence is $n^{-\\frac{1}{2+\\frac{1}{\\eta}}}$, and this rate can be achieved using a generalized ``Scaling Clusters'' design that we provide. We then introduce the additional assumption, implicit in the OLS estimators used in recent applied studies, that potential outcomes are linear in population treatment assignments. These estimators are inconsistent for our estimand, but a refined OLS estimator is consistent and rate optimal, and performs better than IPW estimators when clusters must be small. Its finite-sample performance can be improved by incorporating prior information about the structure of spillovers. As a robust alternative to the linear approach we also provide a method to select estimator-design pairs that minimize a notion of worst-case risk when the data generating process is unknown. Finally, we provide asymptotically valid inference methods.", "authors": "Faridani, Niehaus"}, "https://arxiv.org/abs/2310.04578": {"title": "A Double Machine Learning Approach for the Evaluation of COVID-19 Vaccine Effectiveness under the Test-Negative Design: Analysis of Qu\\'ebec Administrative Data", "link": "https://arxiv.org/abs/2310.04578", "description": "The test-negative design (TND), which is routinely used for monitoring seasonal flu vaccine effectiveness (VE), has recently become integral to COVID-19 vaccine surveillance, notably in Qu\\'ebec, Canada. Some studies have addressed the identifiability and estimation of causal parameters under the TND, but efficiency bounds for nonparametric estimators of the target parameter under the unconfoundedness assumption have not yet been investigated. Motivated by the goal of improving adjustment for measured confounders when estimating COVID-19 VE among community-dwelling people aged $\\geq 60$ years in Qu\\'ebec, we propose a one-step doubly robust and locally efficient estimator called TNDDR (TND doubly robust), which utilizes cross-fitting (sample splitting) and can incorporate machine learning techniques to estimate the nuisance functions and thus improve control for measured confounders. We derive the efficient influence function (EIF) for the marginal expectation of the outcome under a vaccination intervention, explore the von Mises expansion, and establish the conditions for $\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR. The proposed estimator is supported by both theoretical and empirical justifications.", "authors": "Jiang, Talbot, Carazo et al"}, "https://arxiv.org/abs/2401.05256": {"title": "Tests of Missing Completely At Random based on sample covariance matrices", "link": "https://arxiv.org/abs/2401.05256", "description": "We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a collection of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Our first contributions are to define a natural measure of the incompatibility of a collection of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By analysing the concentration properties of the natural plug-in estimator for this measure, we propose a novel hypothesis test, which is calibrated via a bootstrap procedure and demonstrates power against any distribution with incompatible covariance matrices. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed. Furthermore, tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest.", "authors": "Bordino, Berrett"}, "https://arxiv.org/abs/2412.09697": {"title": "Evaluating time-specific treatment effects in matched-pairs studies", "link": "https://arxiv.org/abs/2412.09697", "description": "This study develops methods for evaluating a treatment effect on a time-to-event outcome in matched-pair studies. While most methods for paired right-censored outcomes allow determining an overall treatment effect over the course of follow-up, they generally lack in providing detailed insights into how the effect changes over time. To address this gap, we propose time-specific and overall tests for paired right-censored outcomes under randomization inference. We further extend our tests to matched observational studies by developing corresponding sensitivity analysis methods to take into account departures from randomization. Simulations demonstrate the robustness of our approach against various non-proportional hazards alternatives, including a crossing survival curves scenario. We demonstrate the application of our methods using a matched observational study from the Korean Longitudinal Study of Aging (KLoSA) data, focusing on the effect of social engagement on survival.", "authors": "Lee, Lee"}, "https://arxiv.org/abs/2412.09729": {"title": "Doubly Robust Conformalized Survival Analysis with Right-Censored Data", "link": "https://arxiv.org/abs/2412.09729", "description": "We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for type-I censoring. This method imputes unobserved censoring times using a suitable model, and then analyzes the imputed data using weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data sets demonstrate that our method is more robust than existing approaches in challenging settings where the survival model may be inaccurate, while achieving comparable performance in easier scenarios.", "authors": "Sesia, Svetnik"}, "https://arxiv.org/abs/2412.09786": {"title": "A class of nonparametric methods for evaluating the effect of continuous treatments on survival outcomes", "link": "https://arxiv.org/abs/2412.09786", "description": "In randomized trials and observational studies, it is often necessary to evaluate the extent to which an intervention affects a time-to-event outcome, which is only partially observed due to right censoring. For instance, in infectious disease studies, it is frequently of interest to characterize the relationship between risk of acquisition of infection with a pathogen and a biomarker previously measuring for an immune response against that pathogen induced by prior infection and/or vaccination. It is common to conduct inference within a causal framework, wherein we desire to make inferences about the counterfactual probability of survival through a given time point, at any given exposure level. To determine whether a causal effect is present, one can assess if this quantity differs by exposure level. Recent work shows that, under typical causal assumptions, summaries of the counterfactual survival distribution are identifiable. Moreover, when the treatment is multi-level, these summaries are also pathwise differentiable in a nonparametric probability model, making it possible to construct estimators thereof that are unbiased and approximately normal. In cases where the treatment is continuous, the target estimand is no longer pathwise differentiable, rendering it difficult to construct well-behaved estimators without strong parametric assumptions. In this work, we extend beyond the traditional setting with multilevel interventions to develop approaches to nonparametric inference with a continuous exposure. We introduce methods for testing whether the counterfactual probability of survival time by a given time-point remains constant across the range of the continuous exposure levels. The performance of our proposed methods is evaluated via numerical studies, and we apply our method to data from a recent pair of efficacy trials of an HIV monoclonal antibody.", "authors": "Jin, Gilbert, Hudson"}, "https://arxiv.org/abs/2412.09792": {"title": "Flexible Bayesian Nonparametric Product Mixtures for Multi-scale Functional Clustering", "link": "https://arxiv.org/abs/2412.09792", "description": "There is a rich literature on clustering functional data with applications to time-series modeling, trajectory data, and even spatio-temporal applications. However, existing methods routinely perform global clustering that enforces identical atom values within the same cluster. Such grouping may be inadequate for high-dimensional functions, where the clustering patterns may change between the more dominant high-level features and the finer resolution local features. While there is some limited literature on local clustering approaches to deal with the above problems, these methods are typically not scalable to high-dimensional functions, and their theoretical properties are not well-investigated. Focusing on basis expansions for high-dimensional functions, we propose a flexible non-parametric Bayesian approach for multi-resolution clustering. The proposed method imposes independent Dirichlet process (DP) priors on different subsets of basis coefficients that ultimately results in a product of DP mixture priors inducing local clustering. We generalize the approach to incorporate spatially correlated error terms when modeling random spatial functions to provide improved model fitting. An efficient Markov chain Monte Carlo (MCMC) algorithm is developed for implementation. We show posterior consistency properties under the local clustering approach that asymptotically recovers the true density of random functions. Extensive simulations illustrate the improved clustering and function estimation under the proposed method compared to classical approaches. We apply the proposed approach to a spatial transcriptomics application where the goal is to infer clusters of genes with distinct spatial patterns of expressions. Our method makes an important contribution by expanding the limited literature on local clustering methods for high-dimensional functions with theoretical guarantees.", "authors": "Yao, Kundu"}, "https://arxiv.org/abs/2412.09794": {"title": "Sequential Change Point Detection in High-dimensional Vector Auto-regressive Models", "link": "https://arxiv.org/abs/2412.09794", "description": "Sequential (online) change-point detection involves continuously monitoring time-series data and triggering an alarm when shifts in the data distribution are detected. We propose an algorithm for real-time identification of alterations in the transition matrices of high-dimensional vector autoregressive models. The algorithm estimates transition matrices and error term variances using regularization techniques applied to training data, then computes a specific test statistic to detect changes in transition matrices as new data batches arrive. We establish the asymptotic normality of the test statistic under the scenario of no change points, subject to mild conditions. An alarm is raised when the calculated test statistic exceeds a predefined quantile of the standard normal distribution. We demonstrate that, as the size of the change (jump size) increases, the test power approaches one. The effectiveness of the algorithm is validated empirically across various simulation scenarios. Finally, we present two applications of the proposed methodology: analyzing shocks in S&P 500 data and detecting the timing of seizures in EEG data.", "authors": "Tian, Safikhani"}, "https://arxiv.org/abs/2412.09804": {"title": "Unified optimal model averaging with a general loss function based on cross-validation", "link": "https://arxiv.org/abs/2412.09804", "description": "Studying unified model averaging estimation for situations with complicated data structures, we propose a novel model averaging method based on cross-validation (MACV). MACV unifies a large class of new and existing model averaging estimators and covers a very general class of loss functions. Furthermore, to reduce the computational burden caused by the conventional leave-subject/one-out cross validation, we propose a SEcond-order-Approximated Leave-one/subject-out (SEAL) cross validation, which largely improves the computation efficiency. In the context of non-independent and non-identically distributed random variables, we establish the unified theory for analyzing the asymptotic behaviors of the proposed MACV and SEAL methods, where the number of candidate models is allowed to diverge with sample size. To demonstrate the breadth of the proposed methodology, we exemplify four optimal model averaging estimators under four important situations, i.e., longitudinal data with discrete responses, within-cluster correlation structure modeling, conditional prediction in spatial data, and quantile regression with a potential correlation structure. We conduct extensive simulation studies and analyze real-data examples to illustrate the advantages of the proposed methods.", "authors": "Yu, Zhang, Liang"}, "https://arxiv.org/abs/2412.09830": {"title": "$L$-estimation of Claim Severity Models Weighted by Kumaraswamy Density", "link": "https://arxiv.org/abs/2412.09830", "description": "Statistical modeling of claim severity distributions is essential in insurance and risk management, where achieving a balance between robustness and efficiency in parameter estimation is critical against model contaminations. Two \\( L \\)-estimators, the method of trimmed moments (MTM) and the method of winsorized moments (MWM), are commonly used in the literature, but they are constrained by rigid weighting schemes that either discard or uniformly down-weight extreme observations, limiting their customized adaptability. This paper proposes a flexible robust \\( L \\)-estimation framework weighted by Kumaraswamy densities, offering smoothly varying observation-specific weights that preserve valuable information while improving robustness and efficiency. The framework is developed for parametric claim severity models, including Pareto, lognormal, and Fr{\\'e}chet distributions, with theoretical justifications on asymptotic normality and variance-covariance structures. Through simulations and application to a U.S. indemnity loss dataset, the proposed method demonstrates superior performance over MTM, MWM, and MLE approaches, particularly in handling outliers and heavy-tailed distributions, making it a flexible and reliable alternative for loss severity modeling.", "authors": "Poudyal, Aryal, Pokhrel"}, "https://arxiv.org/abs/2412.09845": {"title": "Addressing Positivity Violations in Extending Inference to a Target Population", "link": "https://arxiv.org/abs/2412.09845", "description": "Enhancing the external validity of trial results is essential for their applicability to real-world populations. However, violations of the positivity assumption can limit both the generalizability and transportability of findings. To address positivity violations in estimating the average treatment effect for a target population, we propose a framework that integrates characterizing the underrepresented group and performing sensitivity analysis for inference in the original target population. Our approach helps identify limitations in trial sampling and improves the robustness of trial findings for real-world populations. We apply this approach to extend findings from phase IV trials of treatments for opioid use disorder to a real-world population based on the 2021 Treatment Episode Data Set.", "authors": "Lu, Basu"}, "https://arxiv.org/abs/2412.09872": {"title": "Tail Risk Equivalent Level Transition and Its Application for Estimating Extreme $L_p$-quantiles", "link": "https://arxiv.org/abs/2412.09872", "description": "$L_p$-quantile has recently been receiving growing attention in risk management since it has desirable properties as a risk measure and is a generalization of two widely applied risk measures, Value-at-Risk and Expectile. The statistical methodology for $L_p$-quantile is not only feasible but also straightforward to implement as it represents a specific form of M-quantile using $p$-power loss function. In this paper, we introduce the concept of Tail Risk Equivalent Level Transition (TRELT) to capture changes in tail risk when we make a risk transition between two $L_p$-quantiles. TRELT is motivated by PELVE in Li and Wang (2023) but for tail risk. As it remains unknown in theory how this transition works, we investigate the existence, uniqueness, and asymptotic properties of TRELT (as well as dual TRELT) for $L_p$-quantiles. In addition, we study the inference methods for TRELT and extreme $L_p$-quantiles by using this risk transition, which turns out to be a novel extrapolation method in extreme value theory. The asymptotic properties of the proposed estimators are established, and both simulation studies and real data analysis are conducted to demonstrate their empirical performance.", "authors": "Zhong, Hou"}, "https://arxiv.org/abs/2412.10039": {"title": "Are you doing better than random guessing? A call for using negative controls when evaluating causal discovery algorithms", "link": "https://arxiv.org/abs/2412.10039", "description": "New proposals for causal discovery algorithms are typically evaluated using simulations and a few select real data examples with known data generating mechanisms. However, there does not exist a general guideline for how such evaluation studies should be designed, and therefore, comparing results across different studies can be difficult. In this article, we propose a common evaluation baseline by posing the question: Are we doing better than random guessing? For the task of graph skeleton estimation, we derive exact distributional results under random guessing for the expected behavior of a range of typical causal discovery evaluation metrics (including precision and recall). We show that these metrics can achieve very large values under random guessing in certain scenarios, and hence warn against using them without also reporting negative control results, i.e., performance under random guessing. We also propose an exact test of overall skeleton fit, and showcase its use on a real data application. Finally, we propose a general pipeline for using random controls beyond the skeleton estimation task, and apply it both in a simulated example and a real data application.", "authors": "Petersen"}, "https://arxiv.org/abs/2412.10196": {"title": "High-dimensional Statistics Applications to Batch Effects in Metabolomics", "link": "https://arxiv.org/abs/2412.10196", "description": "Batch effects are inevitable in large-scale metabolomics. Prior to formal data analysis, batch effect correction (BEC) is applied to prevent from obscuring biological variations, and batch effect evaluation (BEE) is used for correction assessment. However, existing BEE algorithms neglect covariances between the variables, and existing BEC algorithms might fail to adequately correct the covariances. Therefore, we resort to recent advancements in high-dimensional statistics, and respectively propose \"quality control-based simultaneous tests (QC-ST)\" and \"covariance correction (CoCo)\". Validated by the simulation data, QC-ST can simultaneously detect the statistical significance of QC samples' mean vectors and covariance matrices across different batches, and has a satisfactory statistical performance in empirical sizes, empirical powers, and computational speed. Then, we apply four QC-based BEC algorithms to two large cohort datasets, and find that extreme gradient boost (XGBoost) performs best in relative standard deviation (RSD) and dispersion-ratio (D-ratio). After prepositive BEC, if QC-ST still suggests that batch effects between some two batches are significant, CoCo should be implemented. And after CoCo (if necessary), the four metrics (i.e., RSD, D-ratio, classification performance, and QC-ST) might be further improved. In summary, under the guidance of QC-ST, we can develop a matching strategy to integrate multiple BEC algorithms more rationally and flexibly, and minimize batch effects for reliable biological conclusions.", "authors": "Guo"}, "https://arxiv.org/abs/2412.10213": {"title": "Collaborative Design of Controlled Experiments in the Presence of Subject Covariates", "link": "https://arxiv.org/abs/2412.10213", "description": "We consider the optimal experimental design problem of allocating subjects to treatment or control when subjects participate in multiple, separate controlled experiments within a short time-frame and subject covariate information is available. Here, in addition to subject covariates, we consider the dependence among the responses coming from the subject's random effect across experiments. In this setting, the goal of the allocation is to provide precise estimates of treatment effects for each experiment. Deriving the precision matrix of the treatment effects and using D-optimality as our allocation criterion, we demonstrate the advantage of collaboratively designing and analyzing multiple experiments over traditional independent design and analysis, and propose two randomized algorithms to provide solutions to the D-optimality problem for collaborative design. The first algorithm decomposes the D-optimality problem into a sequence of subproblems, where each subproblem is a quadratic binary program that can be solved through a semi-definite relaxation based randomized algorithm with performance guarantees. The second algorithm involves solving a single semi-definite program, and randomly generating allocations for each experiment from the solution of this program. We showcase the performance of these algorithms through a simulation study, finding that our algorithms outperform covariate-agnostic methods when there are a large number of covariates.", "authors": "Fisher, Zhang, Kang et al"}, "https://arxiv.org/abs/2412.10245": {"title": "Regression trees for nonparametric diagnostics of sequential positivity violations in longitudinal causal inference", "link": "https://arxiv.org/abs/2412.10245", "description": "Sequential positivity is often a necessary assumption for drawing causal inferences, such as through marginal structural modeling. Unfortunately, verification of this assumption can be challenging because it usually relies on multiple parametric propensity score models, unlikely all correctly specified. Therefore, we propose a new algorithm, called \"sequential Positivity Regression Tree\" (sPoRT), to check this assumption with greater ease under either static or dynamic treatment strategies. This algorithm also identifies the subgroups found to be violating this assumption, allowing for insights about the nature of the violations and potential solutions. We first present different versions of sPoRT based on either stratifying or pooling over time. Finally, we illustrate its use in a real-life application of HIV-positive children in Southern Africa with and without pooling over time. An R notebook showing how to use sPoRT is available at github.com/ArthurChatton/sPoRT-notebook.", "authors": "Chatton, Schomaker, Luque-Fernandez et al"}, "https://arxiv.org/abs/2412.10252": {"title": "What if we had built a prediction model with a survival super learner instead of a Cox model 10 years ago?", "link": "https://arxiv.org/abs/2412.10252", "description": "Objective: This study sought to compare the drop in predictive performance over time according to the modeling approach (regression versus machine learning) used to build a kidney transplant failure prediction model with a time-to-event outcome.\n Study Design and Setting: The Kidney Transplant Failure Score (KTFS) was used as a benchmark. We reused the data from which it was developed (DIVAT cohort, n=2,169) to build another prediction algorithm using a survival super learner combining (semi-)parametric and non-parametric methods. Performance in DIVAT was estimated for the two prediction models using internal validation. Then, the drop in predictive performance was evaluated in the same geographical population approximately ten years later (EKiTE cohort, n=2,329).\n Results: In DIVAT, the super learner achieved better discrimination than the KTFS, with a tAUROC of 0.83 (0.79-0.87) compared to 0.76 (0.70-0.82). While the discrimination remained stable for the KTFS, it was not the case for the super learner, with a drop to 0.80 (0.76-0.83). Regarding calibration, the survival SL overestimated graft survival at development, while the KTFS underestimated graft survival ten years later. Brier score values were similar regardless of the approach and the timing.\n Conclusion: The more flexible SL provided superior discrimination on the population used to fit it compared to a Cox model and similar discrimination when applied to a future dataset of the same population. Both methods are subject to calibration drift over time. However, weak calibration on the population used to develop the prediction model was correct only for the Cox model, and recalibration should be considered in the future to correct the calibration drift.", "authors": "Chatton, Pilote, Feugo et al"}, "https://arxiv.org/abs/2412.10304": {"title": "A Neyman-Orthogonalization Approach to the Incidental Parameter Problem", "link": "https://arxiv.org/abs/2412.10304", "description": "A popular approach to perform inference on a target parameter in the presence of nuisance parameters is to construct estimating equations that are orthogonal to the nuisance parameters, in the sense that their expected first derivative is zero. Such first-order orthogonalization may, however, not suffice when the nuisance parameters are very imprecisely estimated. Leading examples where this is the case are models for panel and network data that feature fixed effects. In this paper, we show how, in the conditional-likelihood setting, estimating equations can be constructed that are orthogonal to any chosen order. Combining these equations with sample splitting yields higher-order bias-corrected estimators of target parameters. In an empirical application we apply our method to a fixed-effect model of team production and obtain estimates of complementarity in production and impacts of counterfactual re-allocations.", "authors": "Bonhomme, Jochmans, Weidner"}, "https://arxiv.org/abs/2412.10005": {"title": "Matrix Completion via Residual Spectral Matching", "link": "https://arxiv.org/abs/2412.10005", "description": "Noisy matrix completion has attracted significant attention due to its applications in recommendation systems, signal processing and image restoration. Most existing works rely on (weighted) least squares methods under various low-rank constraints. However, minimizing the sum of squared residuals is not always efficient, as it may ignore the potential structural information in the residuals.In this study, we propose a novel residual spectral matching criterion that incorporates not only the numerical but also locational information of residuals. This criterion is the first in noisy matrix completion to adopt the perspective of low-rank perturbation of random matrices and exploit the spectral properties of sparse random matrices. We derive optimal statistical properties by analyzing the spectral properties of sparse random matrices and bounding the effects of low-rank perturbations and partial observations. Additionally, we propose algorithms that efficiently approximate solutions by constructing easily computable pseudo-gradients. The iterative process of the proposed algorithms ensures convergence at a rate consistent with the optimal statistical error bound. Our method and algorithms demonstrate improved numerical performance in both simulated and real data examples, particularly in environments with high noise levels.", "authors": "Chen, Yao"}, "https://arxiv.org/abs/2412.10053": {"title": "De-Biasing Structure Function Estimates From Sparse Time Series of the Solar Wind: A Data-Driven Approach", "link": "https://arxiv.org/abs/2412.10053", "description": "Structure functions, which represent the moments of the increments of a stochastic process, are essential complementary statistics to power spectra for analysing the self-similar behaviour of a time series. However, many real-world environmental datasets, such as those collected by spacecraft monitoring the solar wind, contain gaps, which inevitably corrupt the statistics. The nature of this corruption for structure functions remains poorly understood - indeed, often overlooked. Here we simulate gaps in a large set of magnetic field intervals from Parker Solar Probe in order to characterize the behaviour of the structure function of a sparse time series of solar wind turbulence. We quantify the resultant error with regards to the overall shape of the structure function, and its slope in the inertial range. Noting the consistent underestimation of the true curve when using linear interpolation, we demonstrate the ability of an empirical correction factor to de-bias these estimates. This correction, \"learnt\" from the data from a single spacecraft, is shown to generalize well to data from a solar wind regime elsewhere in the heliosphere, producing smaller errors, on average, for missing fractions >25%. Given this success, we apply the correction to gap-affected Voyager intervals from the inner heliosheath and local interstellar medium, obtaining spectral indices similar to those from previous studies. This work provides a tool for future studies of fragmented solar wind time series, such as those from Voyager, MAVEN, and OMNI, as well as sparsely-sampled astrophysical and geophysical processes more generally.", "authors": "Wrench, Parashar"}, "https://arxiv.org/abs/2412.10069": {"title": "Multiscale Dynamical Indices Reveal Scale-Dependent Atmospheric Dynamics", "link": "https://arxiv.org/abs/2412.10069", "description": "Geophysical systems are inherently complex and span multiple spatial and temporal scales, making their dynamics challenging to understand and predict. This challenge is especially pronounced for extreme events, which are primarily governed by their instantaneous properties rather than their average characteristics. Advances in dynamical systems theory, including the development of local dynamical indices such as local dimension and inverse persistence, have provided powerful tools for studying these short-lasting phenomena. However, existing applications of such indices often rely on predefined fixed spatial domains and scales, with limited discussion on the influence of spatial scales on the results. In this work, we present a novel spatially multiscale methodology that applies a sliding window method to compute dynamical indices, enabling the exploration of scale-dependent properties. Applying this framework to high-impact European summertime heatwaves, we reconcile previously different perspectives, thereby underscoring the importance of spatial scales in such analyses. Furthermore, we emphasize that our novel methodology has broad applicability to other atmospheric phenomena, as well as to other geophysical and spatio-temporal systems.", "authors": "Dong, Messori, Faranda et al"}, "https://arxiv.org/abs/2412.10119": {"title": "AMUSE: Adaptive Model Updating using a Simulated Environment", "link": "https://arxiv.org/abs/2412.10119", "description": "Prediction models frequently face the challenge of concept drift, in which the underlying data distribution changes over time, weakening performance. Examples can include models which predict loan default, or those used in healthcare contexts. Typical management strategies involve regular model updates or updates triggered by concept drift detection. However, these simple policies do not necessarily balance the cost of model updating with improved classifier performance. We present AMUSE (Adaptive Model Updating using a Simulated Environment), a novel method leveraging reinforcement learning trained within a simulated data generating environment, to determine update timings for classifiers. The optimal updating policy depends on the current data generating process and ongoing drift process. Our key idea is that we can train an arbitrarily complex model updating policy by creating a training environment in which possible episodes of drift are simulated by a parametric model, which represents expectations of possible drift patterns. As a result, AMUSE proactively recommends updates based on estimated performance improvements, learning a policy that balances maintaining model performance with minimizing update costs. Empirical results confirm the effectiveness of AMUSE in simulated data.", "authors": "Chislett, Vallejos, Cannings et al"}, "https://arxiv.org/abs/2412.10288": {"title": "Performance evaluation of predictive AI models to support medical decisions: Overview and guidance", "link": "https://arxiv.org/abs/2412.10288", "description": "A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a \"proper\" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.", "authors": "initiative), initiative), initiative) et al"}, "https://arxiv.org/abs/2212.14622": {"title": "Identifying causal effects with subjective ordinal outcomes", "link": "https://arxiv.org/abs/2212.14622", "description": "Survey questions often ask respondents to select from ordered scales where the meanings of the categories are subjective, leaving each individual free to apply their own definitions when answering. This paper studies the use of these responses as an outcome variable in causal inference, accounting for variation in the interpretation of categories across individuals. I find that when a continuous treatment is statistically independent of both i) potential outcomes; and ii) heterogeneity in reporting styles, a nonparametric regression of response category number on that treatment variable recovers a quantity proportional to an average causal effect among individuals who are on the margin between successive response categories. The magnitude of a given regression coefficient is not meaningful on its own, but the ratio of local regression derivatives with respect to two such treatment variables identifies the relative magnitudes of convex averages of their effects. I find that comparisons involving discrete treatment variables are not as readily interpretable, but obtain a partial identification result for such cases under additional assumptions. I illustrate the results by revisiting the effects of income comparisons on subjective well-being, without assuming cardinality or interpersonal comparability of responses.", "authors": "Goff"}, "https://arxiv.org/abs/2412.10521": {"title": "KenCoh: A Ranked-Based Canonical Coherence", "link": "https://arxiv.org/abs/2412.10521", "description": "In this paper, we consider the problem of characterizing a robust global dependence between two brain regions where each region may contain several voxels or channels. This work is driven by experiments to investigate the dependence between two cortical regions and to identify differences in brain networks between brain states, e.g., alert and drowsy states. The most common approach to explore dependence between two groups of variables (or signals) is via canonical correlation analysis (CCA). However, it is limited to only capturing linear associations and is sensitive to outlier observations. These limitations are crucial because brain network connectivity is likely to be more complex than linear and that brain signals may exhibit heavy-tailed properties. To overcome these limitations, we develop a robust method, Kendall canonical coherence (KenCoh), for learning monotonic connectivity structure among neuronal signals filtered at given frequency bands. Furthermore, we propose the KenCoh-based permutation test to investigate the differences in brain network connectivity between two different states. Our simulation study demonstrates that KenCoh is competitive to the traditional variance-covariance estimator and outperforms the later when the underlying distributions are heavy-tailed. We apply our method to EEG recordings from a virtual-reality driving experiment. Our proposed method led to further insights on the differences of frontal-parietal cross-dependence network when the subject is alert and when the subject is drowsy and that left-parietal channel drives this dependence at the beta-band.", "authors": "Talento, Roy, Ombao"}, "https://arxiv.org/abs/2412.10563": {"title": "Augmented two-stage estimation for treatment crossover in oncology trials: Leveraging external data for improved precision", "link": "https://arxiv.org/abs/2412.10563", "description": "Randomized controlled trials (RCTs) in oncology often allow control group participants to crossover to experimental treatments, a practice that, while often ethically necessary, complicates the accurate estimation of long-term treatment effects. When crossover rates are high or sample sizes are limited, commonly used methods for crossover adjustment (such as the rank-preserving structural failure time model, inverse probability of censoring weights, and two-stage estimation (TSE)) may produce imprecise estimates. Real-world data (RWD) can be used to develop an external control arm for the RCT, although this approach ignores evidence from trial subjects who did not crossover and ignores evidence from the data obtained prior to crossover for those subjects who did. This paper introduces augmented two-stage estimation (ATSE), a method that combines data from non-switching participants in a RCT with an external dataset, forming a 'hybrid non-switching arm'. With a simulation study, we evaluate the ATSE method's effectiveness compared to TSE crossover adjustment and an external control arm approach. Results indicate that, relative to TSE and the external control arm approach, ATSE can increase precision and may be less susceptible to bias due to unmeasured confounding.", "authors": "Campbell, Jansen, Cope"}, "https://arxiv.org/abs/2412.10600": {"title": "The Front-door Criterion in the Potential Outcome Framework", "link": "https://arxiv.org/abs/2412.10600", "description": "In recent years, the front-door criterion (FDC) has been increasingly noticed in economics and social science. However, most economists still resist collecting this tool in their empirical toolkit. This article aims to incorporate the FDC into the framework of the potential outcome model (RCM). It redefines the key assumptions of the FDC with the language of the RCM. These assumptions are more comprehensive and detailed than the original ones in the structure causal model (SCM). The causal connotations of the FDC estimates are elaborated in detail, and the estimation bias caused by violating some key assumptions is theoretically derived. Rigorous simulation data are used to confirm the theoretical derivation. It is proved that the FDC can still provide useful insights into causal relationships even when some key assumptions are violated. The FDC is also comprehensively compared with the instrumental variables (IV) from the perspective of assumptions and causal connotations. The analyses of this paper show that the FDC can serve as a powerful empirical tool. It can provide new insights into causal relationships compared with the conventional methods in social science.", "authors": "Chen"}, "https://arxiv.org/abs/2412.10608": {"title": "An overview of meta-analytic methods for economic research", "link": "https://arxiv.org/abs/2412.10608", "description": "Meta-analysis is the use of statistical methods to combine the results of individual studies to estimate the overall effect size for a specific outcome of interest. The direction and magnitude of this estimated effect, along with its confidence interval, provide insights into the phenomenon or relationship being investigated. As an extension of the standard meta-analysis, meta-regression analysis incorporates multiple moderators representing identifiable study characteristics into the meta-analysis model, thereby explaining some of the heterogeneity in true effect sizes across studies. This form of meta-analysis is especially designed to quantitatively synthesize empirical evidence in economics. This study provides an overview of the meta-analytic procedures tailored for economic research. By addressing key challenges, including between-study heterogeneity, publication bias, and effect size dependence, it aims to equip researchers with the tools and insights needed to conduct rigorous and informative meta-analytic studies in economics and related disciplines.", "authors": "Haghnejad, Farahati"}, "https://arxiv.org/abs/2412.10635": {"title": "Do LLMs Act as Repositories of Causal Knowledge?", "link": "https://arxiv.org/abs/2412.10635", "description": "Large language models (LLMs) offer the potential to automate a large number of tasks that previously have not been possible to automate, including some in science. There is considerable interest in whether LLMs can automate the process of causal inference by providing the information about causal links necessary to build a structural model. We use the case of confounding in the Coronary Drug Project (CDP), for which there are several studies listing expert-selected confounders that can serve as a ground truth. LLMs exhibit mediocre performance in identifying confounders in this setting, even though text about the ground truth is in their training data. Variables that experts identify as confounders are only slightly more likely to be labeled as confounders by LLMs compared to variables that experts consider non-confounders. Further, LLM judgment on confounder status is highly inconsistent across models, prompts, and irrelevant concerns like multiple-choice option ordering. LLMs do not yet have the ability to automate the reporting of causal links.", "authors": "Huntington-Klein, Murray"}, "https://arxiv.org/abs/2412.10639": {"title": "A Multiprocess State Space Model with Feedback and Switching for Patterns of Clinical Measurements Associated with COVID-19", "link": "https://arxiv.org/abs/2412.10639", "description": "Clinical measurements, such as body temperature, are often collected over time to monitor an individual's underlying health condition. These measurements exhibit complex temporal dynamics, necessitating sophisticated statistical models to capture patterns and detect deviations. We propose a novel multiprocess state space model with feedback and switching mechanisms to analyze the dynamics of clinical measurements. This model captures the evolution of time series through distinct latent processes and incorporates feedback effects in the transition probabilities between latent processes. We develop estimation methods using the EM algorithm, integrated with multiprocess Kalman filtering and multiprocess fixed-interval smoothing. Simulation study shows that the algorithm is efficient and performs well. We apply the proposed model to body temperature measurements from COVID-19-infected hemodialysis patients to examine temporal dynamics and estimate infection and recovery probabilities.", "authors": "Ma, Guo, Kotanko et al"}, "https://arxiv.org/abs/2412.10658": {"title": "Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling", "link": "https://arxiv.org/abs/2412.10658", "description": "Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.", "authors": "Dong, Jiang, Pan et al"}, "https://arxiv.org/abs/2412.10683": {"title": "Adaptive Nonparametric Perturbations of Parametric Bayesian Models", "link": "https://arxiv.org/abs/2412.10683", "description": "Parametric Bayesian modeling offers a powerful and flexible toolbox for scientific data analysis. Yet the model, however detailed, may still be wrong, and this can make inferences untrustworthy. In this paper we study nonparametrically perturbed parametric (NPP) Bayesian models, in which a parametric Bayesian model is relaxed via a distortion of its likelihood. We analyze the properties of NPP models when the target of inference is the true data distribution or some functional of it, such as in causal inference. We show that NPP models can offer the robustness of nonparametric models while retaining the data efficiency of parametric models, achieving fast convergence when the parametric model is close to true. To efficiently analyze data with an NPP model, we develop a generalized Bayes procedure to approximate its posterior. We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data. NPP modeling offers an efficient approach to robust Bayesian inference and can be used to robustify any parametric Bayesian model.", "authors": "Wu, Weinstein, Salehi et al"}, "https://arxiv.org/abs/2412.10721": {"title": "Model checking for high dimensional generalized linear models based on random projections", "link": "https://arxiv.org/abs/2412.10721", "description": "Most existing tests in the literature for model checking do not work in high dimension settings due to challenges arising from the \"curse of dimensionality\", or dependencies on the normality of parameter estimators. To address these challenges, we proposed a new goodness of fit test based on random projections for generalized linear models, when the dimension of covariates may substantially exceed the sample size. The tests only require the convergence rate of parameter estimators to derive the limiting distribution. The growing rate of the dimension is allowed to be of exponential order in relation to the sample size. As random projection converts covariates to one-dimensional space, our tests can detect the local alternative departing from the null at the rate of $n^{-1/2}h^{-1/4}$ where $h$ is the bandwidth, and $n$ is the sample size. This sensitive rate is not related to the dimension of covariates, and thus the \"curse of dimensionality\" for our tests would be largely alleviated. An interesting and unexpected result is that for randomly chosen projections, the resulting test statistics can be asymptotic independent. We then proposed combination methods to enhance the power performance of the tests. Detailed simulation studies and a real data analysis are conducted to illustrate the effectiveness of our methodology.", "authors": "Chen, Liu, Peng et al"}, "https://arxiv.org/abs/2412.10791": {"title": "Forecasting realized covariances using HAR-type models", "link": "https://arxiv.org/abs/2412.10791", "description": "We investigate methods for forecasting multivariate realized covariances matrices applied to a set of 30 assets that were included in the DJ30 index at some point, including two novel methods that use existing (univariate) log of realized variance models that account for attenuation bias and time-varying parameters. We consider the implications of some modeling choices within the class of heterogeneous autoregressive models. The following are our key findings. First, modeling the logs of the marginal volatilities is strongly preferred over direct modeling of marginal volatility. Thus, our proposed model that accounts for attenuation bias (for the log-response) provides superior one-step-ahead forecasts over existing multivariate realized covariance approaches. Second, accounting for measurement errors in marginal realized variances generally improves multivariate forecasting performance, but to a lesser degree than previously found in the literature. Third, time-varying parameter models based on state-space models perform almost equally well. Fourth, statistical and economic criteria for comparing the forecasting performance lead to some differences in the models' rankings, which can partially be explained by the turbulent post-pandemic data in our out-of-sample validation dataset using sub-sample analyses.", "authors": "Quiroz, Tafakori, Manner"}, "https://arxiv.org/abs/2412.10920": {"title": "Multiscale Autoregression on Adaptively Detected Timescales", "link": "https://arxiv.org/abs/2412.10920", "description": "We propose a multiscale approach to time series autoregression, in which linear regressors for the process in question include features of its own path that live on multiple timescales. We take these multiscale features to be the recent averages of the process over multiple timescales, whose number or spans are not known to the analyst and are estimated from the data via a change-point detection technique. The resulting construction, termed Adaptive Multiscale AutoRegression (AMAR) enables adaptive regularisation of linear autoregression of large orders. The AMAR model is designed to offer simplicity and interpretability on the one hand, and modelling flexibility on the other. Our theory permits the longest timescale to increase with the sample size. A simulation study is presented to show the usefulness of our approach. Some possible extensions are also discussed, including the Adaptive Multiscale Vector AutoRegressive model (AMVAR) for multivariate time series, which demonstrates promising performance in the data example on UK and US unemployment rates. The R package amar provides an efficient implementation of the AMAR framework.", "authors": "Baranowski, Chen, Fryzlewicz"}, "https://arxiv.org/abs/2412.10988": {"title": "Multiple Imputation for Nonresponse in Complex Surveys Using Design Weights and Auxiliary Margins", "link": "https://arxiv.org/abs/2412.10988", "description": "Survey data typically have missing values due to unit and item nonresponse. Sometimes, survey organizations know the marginal distributions of certain categorical variables in the survey. As shown in previous work, survey organizations can leverage these distributions in multiple imputation for nonignorable unit nonresponse, generating imputations that result in plausible completed-data estimates for the variables with known margins. However, this prior work does not use the design weights for unit nonrespondents; rather, it relies on a set of fabricated weights for these units. We extend this previous work to utilize the design weights for all sampled units. We illustrate the approach using simulation studies.", "authors": "Xu, Reiter"}, "https://arxiv.org/abs/2412.11136": {"title": "Minimax Regret Estimation for Generalizing Heterogeneous Treatment Effects with Multisite Data", "link": "https://arxiv.org/abs/2412.11136", "description": "To test scientific theories and develop individualized treatment rules, researchers often wish to learn heterogeneous treatment effects that can be consistently found across diverse populations and contexts. We consider the problem of generalizing heterogeneous treatment effects (HTE) based on data from multiple sites. A key challenge is that a target population may differ from the source sites in unknown and unobservable ways. This means that the estimates from site-specific models lack external validity, and a simple pooled analysis risks bias. We develop a robust CATE (conditional average treatment effect) estimation methodology with multisite data from heterogeneous populations. We propose a minimax-regret framework that learns a generalizable CATE model by minimizing the worst-case regret over a class of target populations whose CATE can be represented as convex combinations of site-specific CATEs. Using robust optimization, the proposed methodology accounts for distribution shifts in both individual covariates and treatment effect heterogeneity across sites. We show that the resulting CATE model has an interpretable closed-form solution, expressed as a weighted average of site-specific CATE models. Thus, researchers can utilize a flexible CATE estimation method within each site and aggregate site-specific estimates to produce the final model. Through simulations and a real-world application, we show that the proposed methodology improves the robustness and generalizability of existing approaches.", "authors": "Zhang, Huang, Imai"}, "https://arxiv.org/abs/2412.11140": {"title": "BUPD: A Bayesian under-parameterized basket design with the unit information prior in oncology trials", "link": "https://arxiv.org/abs/2412.11140", "description": "Basket trials in oncology enroll multiple patients with cancer harboring identical gene alterations and evaluate their response to targeted therapies across cancer types. Several existing methods have extended a Bayesian hierarchical model borrowing information on the response rates in different cancer types to account for the heterogeneity of drug effects. However, these methods rely on several pre-specified parameters to account for the heterogeneity of response rates among different cancer types. Here, we propose a novel Bayesian under-parameterized basket design with a unit information prior (BUPD) that uses only one (or two) pre-specified parameters to control the amount of information borrowed among cancer types, considering the heterogeneity of response rates. BUPD adapts the unit information prior approach, originally developed for borrowing information from historical clinical trial data, to enable mutual information borrowing between two cancer types. BUPD enables flexible controls of the type 1 error rate and power by explicitly specifying the strength of borrowing while providing interpretable estimations of response rates. Simulation studies revealed that BUPD reduced the type 1 error rate in scenarios with few ineffective cancer types and improved the power in scenarios with few effective cancer types better than five existing methods. This study also illustrated the efficiency of BUPD using response rates from a real basket trial.", "authors": "Kitabayashi, Sato, Hirakawa"}, "https://arxiv.org/abs/2412.11153": {"title": "Balancing Accuracy and Costs in Cross-Temporal Hierarchies: Investigating Decision-Based and Validation-Based Reconciliation", "link": "https://arxiv.org/abs/2412.11153", "description": "Wind power forecasting is essential for managing daily operations at wind farms and enabling market operators to manage power uncertainty effectively in demand planning. This paper explores advanced cross-temporal forecasting models and their potential to enhance forecasting accuracy. First, we propose a novel approach that leverages validation errors, rather than traditional in-sample errors, for covariance matrix estimation and forecast reconciliation. Second, we introduce decision-based aggregation levels for forecasting and reconciliation where certain horizons are based on the required decisions in practice. Third, we evaluate the forecasting performance of the models not only on their ability to minimize errors but also on their effectiveness in reducing decision costs, such as penalties in ancillary services. Our results show that statistical-based hierarchies tend to adopt less conservative forecasts and reduce revenue losses. On the other hand, decision-based reconciliation offers a more balanced compromise between accuracy and decision cost, making them attractive for practical use.", "authors": "Abolghasemi, Girolimetto, Fonzo"}, "https://arxiv.org/abs/2412.11179": {"title": "Treatment Evaluation at the Intensive and Extensive Margins", "link": "https://arxiv.org/abs/2412.11179", "description": "This paper provides a solution to the evaluation of treatment effects in selective samples when neither instruments nor parametric assumptions are available. We provide sharp bounds for average treatment effects under a conditional monotonicity assumption for all principal strata, i.e. units characterizing the complete intensive and extensive margins. Most importantly, we allow for a large share of units whose selection is indifferent to treatment, e.g. due to non-compliance. The existence of such a population is crucially tied to the regularity of sharp population bounds and thus conventional asymptotic inference for methods such as Lee bounds can be misleading. It can be solved using smoothed outer identification regions for inference. We provide semiparametrically efficient debiased machine learning estimators for both regular and smooth bounds that can accommodate high-dimensional covariates and flexible functional forms. Our study of active labor market policy reveals the empirical prevalence of the aforementioned indifference population and supports results from previous impact analysis under much weaker assumptions.", "authors": "Heiler, Kaufmann, Veliyev"}, "https://arxiv.org/abs/2412.11267": {"title": "P3LS: Point Process Partial Least Squares", "link": "https://arxiv.org/abs/2412.11267", "description": "Many studies collect data that can be considered as a realization of a point process. Included are medical imaging data where photon counts are recorded by a gamma camera from patients being injected with a gamma emitting tracer. It is of interest to develop analytic methods that can help with diagnosis as well as in the training of inexpert radiologists. Partial least squares (PLS) is a popular analytic approach that combines features from linear modeling as well as dimension reduction to provide parsimonious prediction and classification. However, existing PLS methodologies do not include the analysis of point process predictors. In this article, we introduce point process PLS (P3LS) for analyzing latent time-varying intensity functions from collections of inhomogeneous point processes. A novel estimation procedure for $P^3LS$ is developed that utilizes the properties of log-Gaussian Cox processes, and its empirical properties are examined in simulation studies. The method is used to analyze kidney functionality in patients with renal disease in order to aid in the diagnosis of kidney obstruction.", "authors": "Namdari, Krafty, Manatunga"}, "https://arxiv.org/abs/2412.11278": {"title": "VAR models with an index structure: A survey with new results", "link": "https://arxiv.org/abs/2412.11278", "description": "The main aim of this paper is to review recent advances in the multivariate autoregressive index model [MAI], originally proposed by reinsel1983some, and their applications to economic and financial time series. MAI has recently gained momentum because it can be seen as a link between two popular but distinct multivariate time series approaches: vector autoregressive modeling [VAR] and the dynamic factor model [DFM]. Indeed, on the one hand, the MAI is a VAR model with a peculiar reduced-rank structure; on the other hand, it allows for identification of common components and common shocks in a similar way as the DFM. The focus is on recent developments of the MAI, which include extending the original model with individual autoregressive structures, stochastic volatility, time-varying parameters, high-dimensionality, and cointegration. In addition, new insights on previous contributions and a novel model are also provided.", "authors": "Cubadda"}, "https://arxiv.org/abs/2412.11285": {"title": "Moderating the Mediation Bootstrap for Causal Inference", "link": "https://arxiv.org/abs/2412.11285", "description": "Mediation analysis is a form of causal inference that investigates indirect effects and causal mechanisms. Confidence intervals for indirect effects play a central role in conducting inference. The problem is non-standard leading to coverage rates that deviate considerably from their nominal level. The default inference method in the mediation model is the paired bootstrap, which resamples directly from the observed data. However, a residual bootstrap that explicitly exploits the assumed causal structure (X->M->Y) could also be applied. There is also a debate whether the bias-corrected (BC) bootstrap method is superior to the percentile method, with the former showing liberal behavior (actual coverage too low) in certain circumstances. Moreover, bootstrap methods tend to be very conservative (coverage higher than required) when mediation effects are small. Finally, iterated bootstrap methods like the double bootstrap have not been considered due to their high computational demands. We investigate the issues mentioned in the simple mediation model by a large-scale simulation. Results are explained using graphical methods and the newly derived finite-sample distribution. The main findings are: (i) conservative behavior of the bootstrap is caused by extreme dependence of the bootstrap distribution's shape on the estimated coefficients (ii) this dependence leads to counterproductive correction of the the double bootstrap. The added randomness of the BC method inflates the coverage in the absence of mediation, but still leads to (invalid) liberal inference when the mediation effect is small.", "authors": "Garderen, Giersbergen"}, "https://arxiv.org/abs/2412.11326": {"title": "Spatial Cross-Recurrence Quantification Analysis for Multi-Platform Contact Tracing and Epidemiology Research", "link": "https://arxiv.org/abs/2412.11326", "description": "Contact tracing is an essential tool in slowing and containing outbreaks of contagious diseases. Current contact tracing methods range from interviews with public health personnel to Bluetooth pings from smartphones. While all methods offer various benefits, it is difficult for different methods to integrate with one another. Additionally, for contact tracing mobile applications, data privacy is a concern to many as GPS data from users is saved to either a central server or the user's device. The current paper describes a method called spatial cross-recurrence quantification analysis (SpaRQ) that can combine and analyze contact tracing data, regardless of how it has been obtained, and generate a risk profile for the user without storing GPS data. Furthermore, the plots from SpaRQ can be used to investigate the nature of the infectious agent, such as how long it can remain viable in air or on surfaces after an infected person has passed, the chance of infection based on exposure time, and what type of exposure is maximally infective.", "authors": "Patten"}, "https://arxiv.org/abs/2412.11340": {"title": "Fast Bayesian Functional Principal Components Analysis", "link": "https://arxiv.org/abs/2412.11340", "description": "Functional Principal Components Analysis (FPCA) is one of the most successful and widely used analytic tools for exploration and dimension reduction of functional data. Standard implementations of FPCA estimate the principal components from the data but ignore their sampling variability in subsequent inferences. To address this problem, we propose the Fast Bayesian Functional Principal Components Analysis (Fast BayesFPCA), that treats principal components as parameters on the Stiefel manifold. To ensure efficiency, stability, and scalability we introduce three innovations: (1) project all eigenfunctions onto an orthonormal spline basis, reducing modeling considerations to a smaller-dimensional Stiefel manifold; (2) induce a uniform prior on the Stiefel manifold of the principal component spline coefficients via the polar representation of a matrix with entries following independent standard Normal priors; and (3) constrain sampling using the assumed FPCA structure to improve stability. We demonstrate the application of Fast BayesFPCA to characterize the variability in mealtime glucose from the Dietary Approaches to Stop Hypertension for Diabetes Continuous Glucose Monitoring (DASH4D CGM) study. All relevant STAN code and simulation routines are available as supplementary material.", "authors": "Sartini, Zhou, Selvin et al"}, "https://arxiv.org/abs/2412.11348": {"title": "Analyzing zero-inflated clustered longitudinal ordinal outcomes using GEE-type models with an application to dental fluorosis studies", "link": "https://arxiv.org/abs/2412.11348", "description": "Motivated by the Iowa Fluoride Study (IFS) dataset, which comprises zero-inflated multi-level ordinal responses on tooth fluorosis, we develop an estimation scheme leveraging generalized estimating equations (GEEs) and James-Stein shrinkage. Previous analyses of this cohort study primarily focused on caries (count response) or employed a Bayesian approach to the ordinal fluorosis outcome. This study is based on the expanded dataset that now includes observations for age 23, whereas earlier works were restricted to ages 9, 13, and/or 17 according to the participants' ages at the time of measurement. The adoption of a frequentist perspective enhances the interpretability to a broader audience. Over a choice of several covariance structures, separate models are formulated for the presence (zero versus non-zero score) and severity (non-zero ordinal scores) of fluorosis, which are then integrated through shared regression parameters. This comprehensive framework effectively identifies risk or protective effects of dietary and non-dietary factors on dental fluorosis.", "authors": "Sarkar, Mukherjee, Gaskins et al"}, "https://arxiv.org/abs/2412.11575": {"title": "Cost-aware Portfolios in a Large Universe of Assets", "link": "https://arxiv.org/abs/2412.11575", "description": "This paper considers the finite horizon portfolio rebalancing problem in terms of mean-variance optimization, where decisions are made based on current information on asset returns and transaction costs. The study's novelty is that the transaction costs are integrated within the optimization problem in a high-dimensional portfolio setting where the number of assets is larger than the sample size. We propose portfolio construction and rebalancing models with nonconvex penalty considering two types of transaction cost, the proportional transaction cost and the quadratic transaction cost. We establish the desired theoretical properties under mild regularity conditions. Monte Carlo simulations and empirical studies using S&P 500 and Russell 2000 stocks show the satisfactory performance of the proposed portfolio and highlight the importance of involving the transaction costs when rebalancing a portfolio.", "authors": "Fan, Medeiros, Yang et al"}, "https://arxiv.org/abs/2412.11612": {"title": "Autoregressive hidden Markov models for high-resolution animal movement data", "link": "https://arxiv.org/abs/2412.11612", "description": "New types of high-resolution animal movement data allow for increasingly comprehensive biological inference, but method development to meet the statistical challenges associated with such data is lagging behind. In this contribution, we extend the commonly applied hidden Markov models for step lengths and turning angles to address the specific requirements posed by high-resolution movement data, in particular the very strong within-state correlation induced by the momentum in the movement. The models feature autoregressive components of general order in both the step length and the turning angle variable, with the possibility to automate the selection of the autoregressive degree using a lasso approach. In a simulation study, we identify potential for improved inference when using the new model instead of the commonly applied basic hidden Markov model in cases where there is strong within-state autocorrelation. The practical use of the model is illustrated using high-resolution movement tracks of terns foraging near an anthropogenic structure causing turbulent water flow features.", "authors": "Stoye, Hoyer, Langrock"}, "https://arxiv.org/abs/2412.11651": {"title": "A New Sampling Method Base on Sequential Tests with Fixed Sample Size Upper Limit", "link": "https://arxiv.org/abs/2412.11651", "description": "Sequential inspection is a technique employed to monitor product quality during the production process. For smaller batch sizes, the Acceptable Quality Limit(AQL) inspection theory is typically applied, whereas for larger batch sizes, the Poisson distribution is commonly utilized to determine the sample size and rejection thresholds. However, due to the fact that the rate of defective products is usually low in actual production, using these methods often requires more samples to draw conclusions, resulting in higher inspection time. Based on this, this paper proposes a sequential inspection method with a fixed upper limit of sample size. This approach not only incorporates the Poisson distribution algorithm, allowing for rapid calculation of sample size and rejection thresholds to facilitate planning, but also adapts the concept of sequential inspection to dynamically modify the sampling plan and decision-making process. This method aims to decrease the number of samples required while preserving the inspection's efficacy. Finally, this paper shows through Monte Carlo simulation that compared with the traditional Poisson distribution algorithm, the sequential test method with a fixed sample size upper limit significantly reduces the number of samples compared to the traditional Poisson distribution algorithm, while maintaining effective inspection outcomes.", "authors": "Huang"}, "https://arxiv.org/abs/2412.11692": {"title": "A partial likelihood approach to tree-based density modeling and its application in Bayesian inference", "link": "https://arxiv.org/abs/2412.11692", "description": "Tree-based models for probability distributions are usually specified using a predetermined, data-independent collection of candidate recursive partitions of the sample space. To characterize an unknown target density in detail over the entire sample space, candidate partitions must have the capacity to expand deeply into all areas of the sample space with potential non-zero sampling probability. Such an expansive system of partitions often incurs prohibitive computational costs and makes inference prone to overfitting, especially in regions with little probability mass. Existing models typically make a compromise and rely on relatively shallow trees. This hampers one of the most desirable features of trees, their ability to characterize local features, and results in reduced statistical efficiency. Traditional wisdom suggests that this compromise is inevitable to ensure coherent likelihood-based reasoning, as a data-dependent partition system that allows deeper expansion only in regions with more observations would induce double dipping of the data and thus lead to inconsistent inference. We propose a simple strategy to restore coherency while allowing the candidate partitions to be data-dependent, using Cox's partial likelihood. This strategy parametrizes the tree-based sampling model according to the allocation of probability mass based on the observed data, and yet under appropriate specification, the resulting inference remains valid. Our partial likelihood approach is broadly applicable to existing likelihood-based methods and in particular to Bayesian inference on tree-based models. We give examples in density estimation in which the partial likelihood is endowed with existing priors on tree-based models and compare with the standard, full-likelihood approach. The results show substantial gains in estimation accuracy and computational efficiency from using the partial likelihood.", "authors": "Ma, Bruni"}, "https://arxiv.org/abs/2412.11790": {"title": "Variable importance measures for heterogeneous treatment effects with survival outcome", "link": "https://arxiv.org/abs/2412.11790", "description": "Treatment effect heterogeneity plays an important role in many areas of causal inference and within recent years, estimation of the conditional average treatment effect (CATE) has received much attention in the statistical community. While accurate estimation of the CATE-function through flexible machine learning procedures provides a tool for prediction of the individual treatment effect, it does not provide further insight into the driving features of potential treatment effect heterogeneity. Recent papers have addressed this problem by providing variable importance measures for treatment effect heterogeneity. Most of the suggestions have been developed for continuous or binary outcome, while little attention has been given to censored time-to-event outcome. In this paper, we extend the treatment effect variable importance measure (TE-VIM) proposed in Hines et al. (2022) to the survival setting with censored outcome. We derive an estimator for the TE-VIM for two different CATE functions based on the survival function and RMST, respectively. Along with the TE-VIM, we propose a new measure of treatment effect heterogeneity based on the best partially linear projection of the CATE and suggest accompanying estimators for that projection. All estimators are based on semiparametric efficiency theory, and we give conditions under which they are asymptotically linear. The finite sample performance of the derived estimators are investigated in a simulation study. Finally, the estimators are applied and contrasted in two real data examples.", "authors": "Ziersen, Martinussen"}, "https://arxiv.org/abs/2412.11850": {"title": "Causal Invariance Learning via Efficient Optimization of a Nonconvex Objective", "link": "https://arxiv.org/abs/2412.11850", "description": "Data from multiple environments offer valuable opportunities to uncover causal relationships among variables. Leveraging the assumption that the causal outcome model remains invariant across heterogeneous environments, state-of-the-art methods attempt to identify causal outcome models by learning invariant prediction models and rely on exhaustive searches over all (exponentially many) covariate subsets. These approaches present two major challenges: 1) determining the conditions under which the invariant prediction model aligns with the causal outcome model, and 2) devising computationally efficient causal discovery algorithms that scale polynomially, instead of exponentially, with the number of covariates. To address both challenges, we focus on the additive intervention regime and propose nearly necessary and sufficient conditions for ensuring that the invariant prediction model matches the causal outcome model. Exploiting the essentially necessary identifiability conditions, we introduce Negative Weight Distributionally Robust Optimization NegDRO a nonconvex continuous minimax optimization whose global optimizer recovers the causal outcome model. Unlike standard group DRO problems that maximize over the simplex, NegDRO allows negative weights on environment losses, which break the convexity. Despite its nonconvexity, we demonstrate that a standard gradient method converges to the causal outcome model, and we establish the convergence rate with respect to the sample size and the number of iterations. Our algorithm avoids exhaustive search, making it scalable especially when the number of covariates is large. The numerical results further validate the efficiency of the proposed method.", "authors": "Wang, Hu, B\\\"uhlmann et al"}, "https://arxiv.org/abs/2412.11971": {"title": "Multiplex Dirichlet stochastic block model for clustering multidimensional compositional networks", "link": "https://arxiv.org/abs/2412.11971", "description": "Network data often represent multiple types of relations, which can also denote exchanged quantities, and are typically encompassed in a weighted multiplex. Such data frequently exhibit clustering structures, however, traditional clustering methods are not well-suited for multiplex networks. Additionally, standard methods treat edge weights in their raw form, potentially biasing clustering towards a node's total weight capacity rather than reflecting cluster-related interaction patterns. To address this, we propose transforming edge weights into a compositional format, enabling the analysis of connection strengths in relative terms and removing the impact of nodes' total weights. We introduce a multiplex Dirichlet stochastic block model designed for multiplex networks with compositional layers. This model accounts for sparse compositional networks and enables joint clustering across different types of interactions. We validate the model through a simulation study and apply it to the international export data from the Food and Agriculture Organization of the United Nations.", "authors": "Promskaia, O'Hagan, Fop"}, "https://arxiv.org/abs/2412.10643": {"title": "Scientific Realism vs. Anti-Realism: Toward a Common Ground", "link": "https://arxiv.org/abs/2412.10643", "description": "The debate between scientific realism and anti-realism remains at a stalemate, with reconciliation seeming hopeless. Yet, important work remains: to seek a common ground, even if only to uncover deeper points of disagreement. I develop the idea that everyone values some truths, and use it to benefit both sides of the debate. More specifically, many anti-realists, such as instrumentalists, have yet to seriously engage with Sober's call to justify their preferred version of Ockham's razor through a positive epistemology. Meanwhile, realists face a similar challenge: providing a non-circular explanation of how their version of Ockham's razor connects to truth. Drawing insights from fields that study scientific inference -- statistics and machine learning -- I propose a common ground that addresses these challenges for both sides. This common ground also isolates a distinctively epistemic root of the irreconcilability in the realism debate.", "authors": "Lin"}, "https://arxiv.org/abs/2412.10753": {"title": "Posterior asymptotics of high-dimensional spiked covariance model with inverse-Wishart prior", "link": "https://arxiv.org/abs/2412.10753", "description": "We consider Bayesian inference on the spiked eigenstructures of high-dimensional covariance matrices; specifically, we focus on estimating the eigenvalues and corresponding eigenvectors of high-dimensional covariance matrices in which a few eigenvalues are significantly larger than the rest. We impose an inverse-Wishart prior distribution on the unknown covariance matrix and derive the posterior distributions of the eigenvalues and eigenvectors by transforming the posterior distribution of the covariance matrix. We prove that the posterior distribution of the spiked eigenvalues and corresponding eigenvectors converges to the true parameters under the spiked high-dimensional covariance assumption, and also that the posterior distribution of the spiked eigenvector attains the minimax optimality under the single spiked covariance model. Simulation studies and real data analysis demonstrate that our proposed method outperforms all existing methods in quantifying uncertainty.", "authors": "Lee, Park, Kim et al"}, "https://arxiv.org/abs/2412.10796": {"title": "Spatio-temporal analysis of extreme winter temperatures in Ireland", "link": "https://arxiv.org/abs/2412.10796", "description": "We analyse extreme daily minimum temperatures in winter months over the island of Ireland from 1950-2022. We model the marginal distributions of extreme winter minima using a generalised Pareto distribution (GPD), capturing temporal and spatial non-stationarities in the parameters of the GPD. We investigate two independent temporal non-stationarities in extreme winter minima. We model the long-term trend in magnitude of extreme winter minima as well as short-term, large fluctuations in magnitude caused by anomalous behaviour of the jet stream. We measure magnitudes of spatial events with a carefully chosen risk function and fit an r-Pareto process to extreme events exceeding a high-risk threshold. Our analysis is based on synoptic data observations courtesy of Met \\'Eireann and the Met Office. We show that the frequency of extreme cold winter events is decreasing over the study period. The magnitude of extreme winter events is also decreasing, indicating that winters are warming, and apparently warming at a faster rate than extreme summer temperatures. We also show that extremely cold winter temperatures are warming at a faster rate than non-extreme winter temperatures. We find that a climate model output previously shown to be informative as a covariate for modelling extremely warm summer temperatures is less effective as a covariate for extremely cold winter temperatures. However, we show that the climate model is useful for informing a non-extreme temperature model.", "authors": "Healy, Tawn, Thorne et al"}, "https://arxiv.org/abs/2412.11104": {"title": "ABC3: Active Bayesian Causal Inference with Cohn Criteria in Randomized Experiments", "link": "https://arxiv.org/abs/2412.11104", "description": "In causal inference, randomized experiment is a de facto method to overcome various theoretical issues in observational study. However, the experimental design requires expensive costs, so an efficient experimental design is necessary. We propose ABC3, a Bayesian active learning policy for causal inference. We show a policy minimizing an estimation error on conditional average treatment effect is equivalent to minimizing an integrated posterior variance, similar to Cohn criteria \\citep{cohn1994active}. We theoretically prove ABC3 also minimizes an imbalance between the treatment and control groups and the type 1 error probability. Imbalance-minimizing characteristic is especially notable as several works have emphasized the importance of achieving balance. Through extensive experiments on real-world data sets, ABC3 achieves the highest efficiency, while empirically showing the theoretical results hold.", "authors": "Cha, Lee"}, "https://arxiv.org/abs/2412.11743": {"title": "Generalized Bayesian deep reinforcement learning", "link": "https://arxiv.org/abs/2412.11743", "description": "Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. Similar to other model-based RL approaches, it involves two key components: (1) Inferring the posterior distribution of the data generating process (DGP) modeling the true environment and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models assuming Markov dependence. In absence of likelihood functions for these models we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We use sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high dimensional parameter space of the neural networks, we use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To justify the use of the prequential scoring rule posterior we prove a Bernstein-von Misses type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximizing the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies assuming discrete action and state-space. Finally we successfully extend our setup for a challenging problem with continuous action space without theoretical guarantees.", "authors": "Roy, Everitt, Robert et al"}, "https://arxiv.org/abs/2105.08451": {"title": "Bayesian Levy-Dynamic Spatio-Temporal Process: Towards Big Data Analysis", "link": "https://arxiv.org/abs/2105.08451", "description": "In this era of big data, all scientific disciplines are evolving fast to cope up with the enormity of the available information. So is statistics, the queen of science. Big data are particularly relevant to spatio-temporal statistics, thanks to much-improved technology in satellite based remote sensing and Geographical Information Systems. However, none of the existing approaches seem to meet the simultaneous demand of reality emulation and cheap computation. In this article, with the Levy random fields as the starting point, e construct a new Bayesian nonparametric, nonstationary and nonseparable dynamic spatio-temporal model with the additional realistic property that the lagged spatio-temporal correlations converge to zero as the lag tends to infinity. Although our Bayesian model seems to be intricately structured and is variable-dimensional with respect to each time index, we are able to devise a fast and efficient parallel Markov Chain Monte Carlo (MCMC) algorithm for Bayesian inference. Our simulation experiment brings out quite encouraging performance from our Bayesian Levy-dynamic approach. We finally apply our Bayesian Levy-dynamic model and methods to a sea surface temperature dataset consisting of 139,300 data points in space and time. Although not big data in the true sense, this is a large and highly structured data by any standard. Even for this large and complex data, our parallel MCMC algorithm, implemented on 80 processors, generated 110,000 MCMC realizations from the Levy-dynamic posterior within a single day, and the resultant Bayesian posterior predictive analysis turned out to be encouraging. Thus, it is not unreasonable to expect that with significantly more computing resources, it is feasible to analyse terabytes of spatio-temporal data with our new model and methods.", "authors": "Bhattacharya"}, "https://arxiv.org/abs/2303.04258": {"title": "Extremes in High Dimensions: Methods and Scalable Algorithms", "link": "https://arxiv.org/abs/2303.04258", "description": "Extreme value theory for univariate and low-dimensional observations has been explored in considerable detail, but the field is still in an early stage regarding high-dimensional settings. This paper focuses on H\\\"usler-Reiss models, a popular class of models for multivariate extremes similar to multivariate Gaussian distributions, and their domain of attraction. We develop estimators for the model parameters based on score matching, and we equip these estimators with theories and exceptionally scalable algorithms. Simulations and applications to weather extremes demonstrate the fact that the estimators can estimate a large number of parameters reliably and fast; for example, we show that H\\\"usler-Reiss models with thousands of parameters can be fitted within a couple of minutes on a standard laptop. More generally speaking, our work relates extreme value theory to modern concepts of high-dimensional statistics and convex optimization.", "authors": "Lederer, Oesting"}, "https://arxiv.org/abs/2304.06960": {"title": "Estimating Conditional Average Treatment Effects with Heteroscedasticity by Model Averaging and Matching", "link": "https://arxiv.org/abs/2304.06960", "description": "We propose a model averaging approach, combined with a partition and matching method to estimate the conditional average treatment effects under heteroskedastic error settings. The proposed approach has asymptotic optimality and consistency of weights and estimator. Numerical studies show that our method has good finite-sample performances.", "authors": "Shi, Zhang, Zhong"}, "https://arxiv.org/abs/2310.00356": {"title": "Functional conditional volatility modeling with missing data: inference and application to energy commodities", "link": "https://arxiv.org/abs/2310.00356", "description": "This paper explores the nonparametric estimation of the volatility component in a heteroscedastic scalar-on-function regression model, where the underlying discrete-time process is ergodic and subject to a missing-at-random mechanism. We first propose a simplified estimator for the regression and volatility operators, constructed solely from the observed data. The asymptotic properties of these estimators, including the almost sure uniform consistency rate and asymptotic distribution, are rigorously analyzed. Subsequently, the simplified estimators are employed to impute the missing data in the original process, enhancing the estimation of the regression and volatility components. The asymptotic behavior of these imputed estimators is also thoroughly investigated. A numerical comparison of the simplified and imputed estimators is presented using simulated data. Finally, the methodology is applied to real-world data to model the volatility of daily natural gas returns, utilizing intraday EU/USD exchange rate return curves sampled at a 1-hour frequency.", "authors": "Djeniah, Chaouch, Bouchentouf"}, "https://arxiv.org/abs/2311.14892": {"title": "An Identification and Dimensionality Robust Test for Instrumental Variables Models", "link": "https://arxiv.org/abs/2311.14892", "description": "Using modifications of Lindeberg's interpolation technique, I propose a new identification-robust test for the structural parameter in a heteroskedastic instrumental variables model. While my analysis allows the number of instruments to be much larger than the sample size, it does not require many instruments, making my test applicable in settings that have not been well studied. Instead, the proposed test statistic has a limiting chi-squared distribution so long as an auxiliary parameter can be consistently estimated. This is possible using machine learning methods even when the number of instruments is much larger than the sample size. To improve power, a simple combination with the sup-score statistic of Belloni et al. (2012) is proposed. I point out that first-stage F-statistics calculated on LASSO selected variables may be misleading indicators of identification strength and demonstrate favorable performance of my proposed methods in both empirical data and simulation study.", "authors": "Navjeevan"}, "https://arxiv.org/abs/2312.01411": {"title": "Bayesian inference on Cox regression models using catalytic prior distributions", "link": "https://arxiv.org/abs/2312.01411", "description": "The Cox proportional hazards model (Cox model) is a popular model for survival data analysis. When the sample size is small relative to the dimension of the model, the standard maximum partial likelihood inference is often problematic. In this work, we propose the Cox catalytic prior distributions for Bayesian inference on Cox models, which is an extension of a general class of prior distributions originally designed for stabilizing complex parametric models. The Cox catalytic prior is formulated as a weighted likelihood of the regression coefficients based on synthetic data and a surrogate baseline hazard constant. This surrogate hazard can be either provided by the user or estimated from the data, and the synthetic data are generated from the predictive distribution of a fitted simpler model. For point estimation, we derive an approximation of the marginal posterior mode, which can be computed conveniently as a regularized log partial likelihood estimator. We prove that our prior distribution is proper and the resulting estimator is consistent under mild conditions. In simulation studies, our proposed method outperforms standard maximum partial likelihood inference and is on par with existing shrinkage methods. We further illustrate the application of our method to a real dataset.", "authors": "Li, Huang"}, "https://arxiv.org/abs/2312.01969": {"title": "FDR Control for Online Anomaly Detection", "link": "https://arxiv.org/abs/2312.01969", "description": "A new online multiple testing procedure is described in the context of anomaly detection, which controls the False Discovery Rate (FDR). An accurate anomaly detector must control the false positive rate at a prescribed level while keeping the false negative rate as low as possible. However in the online context, such a constraint remains highly challenging due to the usual lack of FDR control: the online framework makes it impossible to use classical multiple testing approaches such as the Benjamini-Hochberg (BH) procedure, which would require knowing the entire time series. The developed strategy relies on exploiting the local control of the ``modified FDR'' (mFDR) criterion. It turns out that the local control of mFDR enables global control of the FDR over the full series up to additional modifications of the multiple testing procedures. An important ingredient in this control is the cardinality of the calibration dataset used to compute the empirical p-values. A dedicated strategy for tuning this parameter is designed for achieving the prescribed FDR control over the entire time series. The good statistical performance of the full strategy is analyzed by theoretical guarantees. Its practical behavior is assessed by several simulation experiments which support our conclusions.", "authors": "Kr\\\"onert, C\\'elisse, Hattab"}, "https://arxiv.org/abs/2401.15793": {"title": "Doubly regularized generalized linear models for spatial observations with high-dimensional covariates", "link": "https://arxiv.org/abs/2401.15793", "description": "A discrete spatial lattice can be cast as a network structure over which spatially-correlated outcomes are observed. A second network structure may also capture similarities among measured features, when such information is available. Incorporating the network structures when analyzing such doubly-structured data can improve predictive power, and lead to better identification of important features in the data-generating process. Motivated by applications in spatial disease mapping, we develop a new doubly regularized regression framework to incorporate these network structures for analyzing high-dimensional datasets. Our estimators can be easily implemented with standard convex optimization algorithms. In addition, we describe a procedure to obtain asymptotically valid confidence intervals and hypothesis tests for our model parameters. We show empirically that our framework provides improved predictive accuracy and inferential power compared to existing high-dimensional spatial methods. These advantages hold given fully accurate network information, and also with networks which are partially misspecified or uninformative. The application of the proposed method to modeling COVID-19 mortality data suggests that it can improve prediction of deaths beyond standard spatial models, and that it selects relevant covariates more often.", "authors": "Sondhi, Cheng, Shojaie"}, "https://arxiv.org/abs/2412.12316": {"title": "Estimating HIV Cross-sectional Incidence Using Recency Tests from a Non-representative Sample", "link": "https://arxiv.org/abs/2412.12316", "description": "Cross-sectional incidence estimation based on recency testing has become a widely used tool in HIV research. Recently, this method has gained prominence in HIV prevention trials to estimate the \"placebo\" incidence that participants might experience without preventive treatment. The application of this approach faces challenges due to non-representative sampling, as individuals aware of their HIV-positive status may be less likely to participate in screening for an HIV prevention trial. To address this, a recent phase 3 trial excluded individuals based on whether they have had a recent HIV test. To the best of our knowledge, the validity of this approach has yet to be studied. In our work, we investigate the performance of cross-sectional HIV incidence estimation when excluding individuals based on prior HIV tests in realistic trial settings. We develop a statistical framework that incorporates a testing-based criterion and possible non-representative sampling. We introduce a metric we call the effective mean duration of recent infection (MDRI) that mathematically quantifies bias in incidence estimation. We conduct an extensive simulation study to evaluate incidence estimator performance under various scenarios. Our findings reveal that when screening attendance is affected by knowledge of HIV status, incidence estimators become unreliable unless all individuals with recent HIV tests are excluded. Additionally, we identified a trade-off between bias and variability: excluding more individuals reduces bias from non-representative sampling but in many cases increases the variability of incidence estimates. These findings highlight the need for caution when applying testing-based criteria and emphasize the importance of refining incidence estimation methods to improve the design and evaluation of future HIV prevention trials.", "authors": "Pan, Bannick, Gao"}, "https://arxiv.org/abs/2412.12335": {"title": "Target Aggregate Data Adjustment Method for Transportability Analysis Utilizing Summary-Level Data from the Target Population", "link": "https://arxiv.org/abs/2412.12335", "description": "Transportability analysis is a causal inference framework used to evaluate the external validity of randomized clinical trials (RCTs) or observational studies. Most existing transportability analysis methods require individual patient-level data (IPD) for both the source and the target population, narrowing its applicability when only target aggregate-level data (AgD) is available. Besides, accounting for censoring is essential to reduce bias in longitudinal data, yet AgD-based transportability methods in the presence of censoring remain underexplored. Here, we propose a two-stage weighting framework named \"Target Aggregate Data Adjustment\" (TADA) to address the mentioned challenges simultaneously. TADA is designed as a two-stage weighting scheme to simultaneously adjust for both censoring bias and distributional imbalances of effect modifiers (EM), where the final weights are the product of the inverse probability of censoring weights and participation weights derived using the method of moments. We have conducted an extensive simulation study to evaluate TADA's performance. Our results indicate that TADA can effectively control the bias resulting from censoring within a non-extreme range suitable for most practical scenarios, and enhance the application and clinical interpretability of transportability analyses in settings with limited data availability.", "authors": "Yan, Vuong, Metcalfe et al"}, "https://arxiv.org/abs/2412.12365": {"title": "On the Role of Surrogates in Conformal Inference of Individual Causal Effects", "link": "https://arxiv.org/abs/2412.12365", "description": "Learning the Individual Treatment Effect (ITE) is essential for personalized decision making, yet causal inference has traditionally focused on aggregated treatment effects. While integrating conformal prediction with causal inference can provide valid uncertainty quantification for ITEs, the resulting prediction intervals are often excessively wide, limiting their practical utility. To address this limitation, we introduce \\underline{S}urrogate-assisted \\underline{C}onformal \\underline{I}nference for \\underline{E}fficient I\\underline{N}dividual \\underline{C}ausal \\underline{E}ffects (SCIENCE), a framework designed to construct more efficient prediction intervals for ITEs. SCIENCE applies to various data configurations, including semi-supervised and surrogate-assisted semi-supervised learning. It accommodates covariate shifts between source data, which contain primary outcomes, and target data, which may include only surrogate outcomes or covariates. Leveraging semi-parametric efficiency theory, SCIENCE produces rate double-robust prediction intervals under mild rate convergence conditions, permitting the use of flexible non-parametric models to estimate nuisance functions. We quantify efficiency gains by comparing semi-parametric efficiency bounds with and without the incorporation of surrogates. Simulation studies demonstrate that our surrogate-assisted intervals offer substantial efficiency improvements over existing methods while maintaining valid group-conditional coverage. Applied to the phase 3 Moderna COVE COVID-19 vaccine trial, SCIENCE illustrates how multiple surrogate markers can be leveraged to generate more efficient prediction intervals.", "authors": "Gao, Gilbert, Han"}, "https://arxiv.org/abs/2412.12380": {"title": "The Estimand Framework and Causal Inference: Complementary not Competing Paradigms", "link": "https://arxiv.org/abs/2412.12380", "description": "The creation of the ICH E9 (R1) estimands framework has led to more precise specification of the treatment effects of interest in the design and statistical analysis of clinical trials. However, it is unclear how the new framework relates to causal inference, as both approaches appear to define what is being estimated and have a quantity labelled an estimand. Using illustrative examples, we show that both approaches can be used to define a population-based summary of an effect on an outcome for a specified population and highlight the similarities and differences between these approaches. We demonstrate that the ICH E9 (R1) estimand framework offers a descriptive, structured approach that is more accessible to non-mathematicians, facilitating clearer communication of trial objectives and results. We then contrast this with the causal inference framework, which provides a mathematically precise definition of an estimand, and allows the explicit articulation of assumptions through tools such as causal graphs. Despite these differences, the two paradigms should be viewed as complementary rather than competing. The combined use of both approaches enhances the ability to communicate what is being estimated. We encourage those familiar with one framework to appreciate the concepts of the other to strengthen the robustness and clarity of clinical trial design, analysis, and interpretation.", "authors": "Drury, Bartlett, Wright et al"}, "https://arxiv.org/abs/2412.12405": {"title": "Generalized entropy calibration for analyzing voluntary survey data", "link": "https://arxiv.org/abs/2412.12405", "description": "Statistical analysis of voluntary survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized entropy calibration is introduced as a unified tool for calibration weighting to control the selection bias. We first establish the relationship between the generalized calibration weighting and its dual expression for regression estimation. The dual relationship is critical in identifying the implied regression model and developing model selection for calibration weighting. Also, if a linear regression model for an important study variable is available, then two-step calibration method can be used to smooth the final weights and achieve the statistical efficiency. Asymptotic properties of the proposed estimator are investigated. Results from a limited simulation study are also presented.", "authors": "Kwon, Kim, Qiu"}, "https://arxiv.org/abs/2412.12407": {"title": "Inside-out cross-covariance for spatial multivariate data", "link": "https://arxiv.org/abs/2412.12407", "description": "As the spatial features of multivariate data are increasingly central in researchers' applied problems, there is a growing demand for novel spatially-aware methods that are flexible, easily interpretable, and scalable to large data. We develop inside-out cross-covariance (IOX) models for multivariate spatial likelihood-based inference. IOX leads to valid cross-covariance matrix functions which we interpret as inducing spatial dependence on independent replicates of a correlated random vector. The resulting sample cross-covariance matrices are \"inside-out\" relative to the ubiquitous linear model of coregionalization (LMC). However, unlike LMCs, our methods offer direct marginal inference, easy prior elicitation of covariance parameters, the ability to model outcomes with unequal smoothness, and flexible dimension reduction. As a covariance model for a q-variate Gaussian process, IOX leads to scalable models for noisy vector data as well as flexible latent models. For large n cases, IOX complements Vecchia approximations and related process-based methods based on sparse graphical models. We demonstrate superior performance of IOX on synthetic datasets as well as on colorectal cancer proteomics data. An R package implementing the proposed methods is available at github.com/mkln/spiox.", "authors": "Peruzzi"}, "https://arxiv.org/abs/2412.12868": {"title": "Bayesian nonparametric partial clustering: Quantifying the effectiveness of agricultural subsidies across Europe", "link": "https://arxiv.org/abs/2412.12868", "description": "The global climate has underscored the need for effective policies to reduce greenhouse gas emissions from all sources, including those resulting from agricultural expansion, which is regulated by the Common Agricultural Policy (CAP) across the European Union (EU). To assess the effectiveness of these mitigation policies, statistical methods must account for the heterogeneous impact of policies across different countries. We propose a Bayesian approach that combines the multinomial logit model, which is suitable for compositional land-use data, with a Bayesian nonparametric (BNP) prior to cluster regions with similar policy impacts. To simultaneously control for other relevant factors, we distinguish between cluster-specific and global covariates, coining this approach the Bayesian nonparametric partial clustering model. We develop a novel and efficient Markov Chain Monte Carlo (MCMC) algorithm, leveraging recent advances in the Bayesian literature. Using economic, geographic, and subsidy-related data from 22 EU member states, we examine the effectiveness of policies influencing land-use decisions across Europe and highlight the diversity of the problem. Our results indicate that the impact of CAP varies widely across the EU, emphasizing the need for subsidies to be tailored to optimize their effectiveness.", "authors": "Mozdzen, Addo, Krisztin et al"}, "https://arxiv.org/abs/2412.12945": {"title": "Meta-analysis models relaxing the random effects normality assumption: methodological systematic review and simulation study", "link": "https://arxiv.org/abs/2412.12945", "description": "Random effects meta-analysis is widely used for synthesizing studies under the assumption that underlying effects come from a normal distribution. However, under certain conditions the use of alternative distributions might be more appropriate. We conducted a systematic review to identify articles introducing alternative meta-analysis models assuming non-normal between-study distributions. We identified 27 eligible articles suggesting 24 alternative meta-analysis models based on long-tail and skewed distributions, on mixtures of distributions, and on Dirichlet process priors. Subsequently, we performed a simulation study to evaluate the performance of these models and to compare them with the standard normal model. We considered 22 scenarios varying the amount of between-study variance, the shape of the true distribution, and the number of included studies. We compared 15 models implemented in the Frequentist or in the Bayesian framework. We found small differences with respect to bias between the different models but larger differences in the level of coverage probability. In scenarios with large between-study variance, all models were substantially biased in the estimation of the mean treatment effect. This implies that focusing only on the mean treatment effect of random effects meta-analysis can be misleading when substantial heterogeneity is suspected or outliers are present.", "authors": "Panagiotopoulou, Evrenoglou, Schmid et al"}, "https://arxiv.org/abs/2412.12967": {"title": "Neural Posterior Estimation for Stochastic Epidemic Modeling", "link": "https://arxiv.org/abs/2412.12967", "description": "Stochastic infectious disease models capture uncertainty in public health outcomes and have become increasingly popular in epidemiological practice. However, calibrating these models to observed data is challenging with existing methods for parameter estimation. Stochastic epidemic models are nonlinear dynamical systems with potentially large latent state spaces, resulting in computationally intractable likelihood densities. We develop an approach to calibrating complex epidemiological models to high-dimensional data using Neural Posterior Estimation, a novel technique for simulation-based inference. In NPE, a neural conditional density estimator trained on simulated data learns to \"invert\" a stochastic simulator, returning a parametric approximation to the posterior distribution. We introduce a stochastic, discrete-time Susceptible Infected (SI) model with heterogeneous transmission for healthcare-associated infections (HAIs). HAIs are a major burden on healthcare systems. They exhibit high rates of asymptotic carriage, making it difficult to estimate infection rates. Through extensive simulation experiments, we show that NPE produces accurate posterior estimates of infection rates with greater sample efficiency compared to Approximate Bayesian Computation (ABC). We then use NPE to fit our SI model to an outbreak of carbapenem-resistant Klebsiella pneumoniae in a long-term acute care facility, finding evidence of location-based heterogeneity in patient-to-patient transmission risk. We argue that our methodology can be fruitfully applied to a wide range of mechanistic transmission models and problems in the epidemiology of infectious disease.", "authors": "Chatha, Bu, Regier et al"}, "https://arxiv.org/abs/2412.13076": {"title": "Dual Interpretation of Machine Learning Forecasts", "link": "https://arxiv.org/abs/2412.13076", "description": "Machine learning predictions are typically interpreted as the sum of contributions of predictors. Yet, each out-of-sample prediction can also be expressed as a linear combination of in-sample values of the predicted variable, with weights corresponding to pairwise proximity scores between current and past economic events. While this dual route leads nowhere in some contexts (e.g., large cross-sectional datasets), it provides sparser interpretations in settings with many regressors and little training data-like macroeconomic forecasting. In this case, the sequence of contributions can be visualized as a time series, allowing analysts to explain predictions as quantifiable combinations of historical analogies. Moreover, the weights can be viewed as those of a data portfolio, inspiring new diagnostic measures such as forecast concentration, short position, and turnover. We show how weights can be retrieved seamlessly for (kernel) ridge regression, random forest, boosted trees, and neural networks. Then, we apply these tools to analyze post-pandemic forecasts of inflation, GDP growth, and recession probabilities. In all cases, the approach opens the black box from a new angle and demonstrates how machine learning models leverage history partly repeating itself.", "authors": "Coulombe, Goebel, Klieber"}, "https://arxiv.org/abs/2412.13122": {"title": "Fr\\'echet Sufficient Dimension Reduction for Metric Space-Valued Data via Distance Covariance", "link": "https://arxiv.org/abs/2412.13122", "description": "We propose a novel Fr\\'echet sufficient dimension reduction (SDR) method based on kernel distance covariance, tailored for metric space-valued responses such as count data, probability densities, and other complex structures. The method leverages a kernel-based transformation to map metric space-valued responses into a feature space, enabling efficient dimension reduction. By incorporating kernel distance covariance, the proposed approach offers enhanced flexibility and adaptability for datasets with diverse and non-Euclidean characteristics. The effectiveness of the method is demonstrated through synthetic simulations and several real-world applications. In all cases, the proposed method runs faster and consistently outperforms the existing Fr\\'echet SDR approaches, demonstrating its broad applicability and robustness in addressing complex data challenges.", "authors": "Huang, Yu, Li et al"}, "https://arxiv.org/abs/2405.11284": {"title": "The Logic of Counterfactuals and the Epistemology of Causal Inference", "link": "https://arxiv.org/abs/2405.11284", "description": "The 2021 Nobel Prize in Economics recognizes a type of causal model known as the Rubin causal model, or potential outcome framework, which deserves far more attention from philosophers than it currently receives. To spark philosophers' interest, I develop a dialectic connecting the Rubin causal model to the Lewis-Stalnaker debate on a logical principle of counterfactuals: Conditional Excluded Middle (CEM). I begin by playing good cop for CEM, developing a new argument in its favor -- a Quine-Putnam-style indispensability argument. This argument is based on the observation that CEM seems to be indispensable to the Rubin causal model, which underpins our best scientific theory of causal inference in health and social sciences -- a Nobel Prize-winning theory. Indeed, CEM has long remained a core assumption of the Rubin causal model, despite challenges from within the statistics and economics communities over twenty years ago. I then switch sides to play bad cop for CEM, undermining the indispensability argument by developing a new theory of causal inference that dispenses with CEM while preserving the successes of the original theory (thanks to a new theorem proved here). The key, somewhat surprisingly, is to integrate two approaches to causal modeling: the Rubin causal model, more familiar in health and social sciences, and the causal Bayes net, more familiar in philosophy. The good cop/bad cop dialectic is concluded with a connection to broader philosophical issues, including intertheory relations, the revisability of logic, and the role of background assumptions in justifying scientific inference.", "authors": "Lin"}, "https://arxiv.org/abs/2412.12114": {"title": "Unlocking new capabilities in the analysis of GC$\\times$GC-TOFMS data with shift-invariant multi-linearity", "link": "https://arxiv.org/abs/2412.12114", "description": "This paper introduces a novel deconvolution algorithm, shift-invariant multi-linearity (SIML), which significantly enhances the analysis of data from a comprehensive two-dimensional gas chromatograph coupled to a mass spectrometric detector (GC$\\times$GC-TOFMS). Designed to address the challenges posed by retention time shifts and high noise levels, SIML incorporates wavelet-based smoothing and Fourier-Transform based shift-correction within the multivariate curve resolution-alternating least squares (MCR-ALS) framework. We benchmarked the SIML algorithm against traditional methods such as MCR-ALS and Parallel Factor Analysis 2 with flexible coupling (PARAFAC2$\\times$N) using both simulated and real GC$\\times$GC-TOFMS datasets. Our results demonstrate that SIML provides unique solutions with significantly improved robustness, particularly in low signal-to-noise ratio scenarios, where it maintains high accuracy in estimating mass spectra and concentrations. The enhanced reliability of quantitative analyses afforded by SIML underscores its potential for broad application in complex matrix analyses across environmental science, food chemistry, and biological research.", "authors": "Schneide, Armstrong, Gallagher et al"}, "https://arxiv.org/abs/2412.12198": {"title": "Pop-out vs. Glue: A Study on the pre-attentive and focused attention stages in Visual Search tasks", "link": "https://arxiv.org/abs/2412.12198", "description": "This study explores visual search asymmetry and the detection process between parallel and serial search strategies, building upon Treisman's Feature Integration Theory [3]. Our experiment examines how easy it is to locate an oblique line among vertical distractors versus a vertical line among oblique distractors, a framework previously validated by Treisman & Gormican (1988) [4] and Gupta et al. (2015) [1]. We hypothesised that an oblique target among vertical lines would produce a perceptual 'pop-out' effect, allowing for faster, parallel search, while the reverse condition would require serial search strategy. Seventy-eight participants from Utrecht University engaged in trials with varied target-distractor orientations and number of items. We measured reaction times and found a significant effect of target type on search speed: oblique targets were identified more quickly, reflecting 'pop-out' behaviour, while vertical targets demanded focused attention ('glue phase'). Our results align with past findings, supporting our hypothesis on search asymmetry and its dependency on distinct visual features. Future research could benefit from eye-tracking and neural network analysis, particularly for identifying the neural processing of visual features in both parallel and serial search conditions.", "authors": "Beukelman, Rodrigues"}, "https://arxiv.org/abs/2412.12807": {"title": "Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification", "link": "https://arxiv.org/abs/2412.12807", "description": "Selective classification frameworks are useful tools for automated decision making in highly risky scenarios, since they allow for a classifier to only make highly confident decisions, while abstaining from making a decision when it is not confident enough to do so, which is otherwise known as an indecision. For a given level of classification accuracy, we aim to make as many decisions as possible. For many problems, this can be achieved without abstaining from making decisions. But when the problem is hard enough, we show that we can still control the misclassification rate of a classifier up to any user specified level, while only abstaining from the minimum necessary amount of decisions, even if this level of misclassification is smaller than the Bayes optimal error rate. In many problem settings, the user could obtain a dramatic decrease in misclassification while only paying a comparatively small price in terms of indecisions.", "authors": "Ndaoud, Radchenko, Rava"}, "https://arxiv.org/abs/2412.13034": {"title": "Unified calibration and spatial mapping of fine particulate matter data from multiple low-cost air pollution sensor networks in Baltimore, Maryland", "link": "https://arxiv.org/abs/2412.13034", "description": "Low-cost air pollution sensor networks are increasingly being deployed globally, supplementing sparse regulatory monitoring with localized air quality data. In some areas, like Baltimore, Maryland, there are only few regulatory (reference) devices but multiple low-cost networks. While there are many available methods to calibrate data from each network individually, separate calibration of each network leads to conflicting air quality predictions. We develop a general Bayesian spatial filtering model combining data from multiple networks and reference devices, providing dynamic calibrations (informed by the latest reference data) and unified predictions (combining information from all available sensors) for the entire region. This method accounts for network-specific bias and noise (observation models), as different networks can use different types of sensors, and uses a Gaussian process (state-space model) to capture spatial correlations. We apply the method to calibrate PM$_{2.5}$ data from Baltimore in June and July 2023 -- a period including days of hazardous concentrations due to wildfire smoke. Our method helps mitigate the effects of preferential sampling of one network in Baltimore, results in better predictions and narrower confidence intervals. Our approach can be used to calibrate low-cost air pollution sensor data in Baltimore and any other areas with multiple low-cost networks.", "authors": "Heffernan, Koehler, Gentner et al"}, "https://arxiv.org/abs/2105.00879": {"title": "Identification and Estimation of Average Causal Effects in Fixed Effects Logit Models", "link": "https://arxiv.org/abs/2105.00879", "description": "This paper studies identification and estimation of average causal effects, such as average marginal or treatment effects, in fixed effects logit models with short panels. Relating the identified set of these effects to an extremal moment problem, we first show how to obtain sharp bounds on such effects simply, without any optimization. We also consider even simpler outer bounds, which, contrary to the sharp bounds, do not require any first-step nonparametric estimators. We build confidence intervals based on these two approaches and show their asymptotic validity. Monte Carlo simulations suggest that both approaches work well in practice, the second being typically competitive in terms of interval length. Finally, we show that our method is also useful to measure treatment effect heterogeneity.", "authors": "Davezies, D'Haultf{\\oe}uille, Laage"}, "https://arxiv.org/abs/2202.02029": {"title": "First-order integer-valued autoregressive processes with Generalized Katz innovations", "link": "https://arxiv.org/abs/2202.02029", "description": "A new integer--valued autoregressive process (INAR) with Generalised Lagrangian Katz (GLK) innovations is defined. This process family provides a flexible modelling framework for count data, allowing for under and over--dispersion, asymmetry, and excess of kurtosis and includes standard INAR models such as Generalized Poisson and Negative Binomial as special cases. We show that the GLK--INAR process is discrete semi--self--decomposable, infinite divisible, stable by aggregation and provides stationarity conditions. Some extensions are discussed, such as the Markov--Switching and the zero--inflated GLK--INARs. A Bayesian inference framework and an efficient posterior approximation procedure are introduced. The proposed models are applied to 130 time series from Google Trend, which proxy the worldwide public concern about climate change. New evidence is found of heterogeneity across time, countries and keywords in the persistence, uncertainty, and long--run public awareness level.", "authors": "Lopez, Bassetti, Carallo et al"}, "https://arxiv.org/abs/2311.00013": {"title": "Semiparametric Discrete Choice Models for Bundles", "link": "https://arxiv.org/abs/2311.00013", "description": "We propose two approaches to estimate semiparametric discrete choice models for bundles. Our first approach is a kernel-weighted rank estimator based on a matching-based identification strategy. We establish its complete asymptotic properties and prove the validity of the nonparametric bootstrap for inference. We then introduce a new multi-index least absolute deviations (LAD) estimator as an alternative, of which the main advantage is its capacity to estimate preference parameters on both alternative- and agent-specific regressors. Both methods can account for arbitrary correlation in disturbances across choices, with the former also allowing for interpersonal heteroskedasticity. We also demonstrate that the identification strategy underlying these procedures can be extended naturally to panel data settings, producing an analogous localized maximum score estimator and a LAD estimator for estimating bundle choice models with fixed effects. We derive the limiting distribution of the former and verify the validity of the numerical bootstrap as an inference tool. All our proposed methods can be applied to general multi-index models. Monte Carlo experiments show that they perform well in finite samples.", "authors": "Ouyang, Yang"}, "https://arxiv.org/abs/2401.13379": {"title": "An Ising Similarity Regression Model for Modeling Multivariate Binary Data", "link": "https://arxiv.org/abs/2401.13379", "description": "Understanding the dependence structure between response variables is an important component in the analysis of correlated multivariate data. This article focuses on modeling dependence structures in multivariate binary data, motivated by a study aiming to understand how patterns in different U.S. senators' votes are determined by similarities (or lack thereof) in their attributes, e.g., political parties and social network profiles. To address such a research question, we propose a new Ising similarity regression model which regresses pairwise interaction coefficients in the Ising model against a set of similarity measures available/constructed from covariates. Model selection approaches are further developed through regularizing the pseudo-likelihood function with an adaptive lasso penalty to enable the selection of relevant similarity measures. We establish estimation and selection consistency of the proposed estimator under a general setting where the number of similarity measures and responses tend to infinity. Simulation study demonstrates the strong finite sample performance of the proposed estimator, particularly compared with several existing Ising model estimators in estimating the matrix of pairwise interaction coefficients. Applying the Ising similarity regression model to a dataset of roll call voting records of 100 U.S. senators, we are able to quantify how similarities in senators' parties, businessman occupations and social network profiles drive their voting associations.", "authors": "Tho, Hui, Zou"}, "https://arxiv.org/abs/2412.13453": {"title": "ExtremalDep: Modelling extremal dependence in high-dimensional extremes", "link": "https://arxiv.org/abs/2412.13453", "description": "From environmental sciences to finance, there are growing needs for assessing the risk of more extreme events than those observed. Extrapolating extreme events beyond the range of the data is not obvious and requires advanced tools based on extreme value theory. Furthermore, the complexity of risk assessments often requires the inclusion of multiple variables. Extreme value theory provides very important tools for the analysis of multivariate or spatial extreme events, but these are not easily accessible to professionals without appropriate expertise. This article provides a minimal background on multivariate and spatial extremes and gives simple yet thorough instructions to analyse high-dimensional extremes using the R package ExtremalDep. After briefly introducing the statistical methodologies, we focus on road testing the package's toolbox through several real-world applications.", "authors": "Beranger, Padoan"}, "https://arxiv.org/abs/2412.13556": {"title": "An Analysis of the Relationship Between the Characteristics of Innovative Consumers and the Degree of Serious Leisure in User Innovation", "link": "https://arxiv.org/abs/2412.13556", "description": "This study examines the relationship between the concept of serious leisure and user innovation. We adopted the characteristics of innovative consumers identified by Luthje (2004)-product use experience, information exchange, and new product adoption speed-to analyze their correlation with serious leisure engagement. The analysis utilized consumer behavior survey data from the \"Marketing Analysis Contest 2023\" sponsored by Nomura Research Institute, examining the relationship between innovative consumer characteristics and the degree of serious leisure (Serious Leisure Inventory and Measure: SLIM). Since the contest data did not directly measure innovative consumer characteristics or serious leisure engagement, we established alternative variables for quantitative analysis. The results showed that the SLIM alternative variable had positive correlations with diverse product experiences and early adoption of new products. However, no clear relationship was found with information exchange among consumers. These findings suggest that serious leisure practice may serve as a potential antecedent to user innovation. The leisure career perspective of the serious leisure concept may capture the motivations of user innovators that Okada and Nishikawa (2019) identified.", "authors": "Abe, Morita"}, "https://arxiv.org/abs/2412.13572": {"title": "A Model-Based Clustering Approach for Bounded Data Using Transformation-Based Gaussian Mixture Models", "link": "https://arxiv.org/abs/2412.13572", "description": "The clustering of bounded data presents unique challenges in statistical analysis due to the constraints imposed on the data values. This paper introduces a novel method for model-based clustering specifically designed for bounded data. Building on the transformation-based approach to Gaussian mixture density estimation introduced by Scrucca (2019), we extend this framework to develop a probabilistic clustering algorithm for data with bounded support that allows for accurate clustering while respecting the natural bounds of the variables. In our proposal, a flexible range-power transformation is employed to map the data from its bounded domain to the unrestricted real space, hence enabling the estimation of Gaussian mixture models in the transformed space. This approach leads to improved cluster recovery and interpretation, especially for complex distributions within bounded domains. The performance of the proposed method is evaluated through real-world data applications involving both fully and partially bounded data, in both univariate and multivariate settings. The results demonstrate the effectiveness and advantages of our approach over traditional and advanced model-based clustering techniques that employ distributions with bounded support.", "authors": "Scrucca"}, "https://arxiv.org/abs/2412.13959": {"title": "A path-specific effect approach to mediation analysis with time-varying mediators and time-to-event outcomes accounting for competing risks", "link": "https://arxiv.org/abs/2412.13959", "description": "Not accounting for competing events in survival analysis can lead to biased estimates, as individuals who die from other causes do not have the opportunity to develop the event of interest. Formal definitions and considerations for causal effects in the presence of competing risks have been published, but not for the mediation analysis setting. We propose, for the first time, an approach based on the path-specific effects framework to account for competing risks in longitudinal mediation analysis with time-to-event outcomes. We do so by considering the pathway through the competing event as another mediator, which is nested within our longitudinal mediator of interest. We provide a theoretical formulation and related definitions of the effects of interest based on the mediational g-formula, as well as a detailed description of the algorithm. We also present an application of our algorithm to data from the Strong Heart Study, a prospective cohort of American Indian adults. In this application, we evaluated the mediating role of the blood pressure trajectory (measured during three visits) on the association between arsenic and cadmium, in separate models, with time to cardiovascular disease, accounting for competing risks by death. Identifying the effects through different paths enables us to evaluate the impact of metals on the outcome of interest, as well as through competing risks, more transparently.", "authors": "Domingo-Relloso, Zhang, Wang et al"}, "https://arxiv.org/abs/2412.14124": {"title": "Semiparametric Joint Modeling to Estimate the Treatment Effect on a Longitudinal Surrogate with Application to Chronic Kidney Disease Trials", "link": "https://arxiv.org/abs/2412.14124", "description": "In clinical trials where long follow-up is required to measure the primary outcome of interest, there is substantial interest in using an accepted surrogate outcome that can be measured earlier in time or with less cost to estimate a treatment effect. For example, in clinical trials of chronic kidney disease (CKD), the effect of a treatment is often demonstrated on a surrogate outcome, the longitudinal trajectory of glomerular filtration rate (GFR). However, estimating the effect of a treatment on the GFR trajectory is complicated by the fact that GFR measurement can be terminated by the occurrence of a terminal event, such as death or kidney failure. Thus, to estimate this effect, one must consider both the longitudinal outcome of GFR, and the terminal event process. Available estimation methods either impose restrictive parametric assumptions with corresponding maximum likelihood estimation that is computationintensive or other assumptions not appropriate for the GFR setting. In this paper, we build a semiparametric framework to jointly model the longitudinal outcome and the terminal event, where the model for the longitudinal outcome is semiparametric, and the relationship between the longitudinal outcome and the terminal event is nonparametric. The proposed semiparametric joint model is flexible and can be extended to include nonlinear trajectory of the longitudinal outcome easily. An estimating equation based method is proposed to estimate the treatment effect on the slope of the longitudinal outcome (e.g., GFR slope). Theoretical properties of the proposed estimators are derived. Finite sample performance of the proposed method is evaluated through simulation studies. We illustrate the proposed method using data from the Reduction of Endpoints in NIDDM with the Angiotensin II Antagonist Losartan (RENAAL) trail to examine the effect of Losartan on GFR slope.", "authors": "Wang, Zhou, Parast et al"}, "https://arxiv.org/abs/2412.14129": {"title": "Model-free Approach to Evaluate a Censored Intermediate Outcome as a Surrogate for Overall Survival", "link": "https://arxiv.org/abs/2412.14129", "description": "Clinical trials or studies oftentimes require long-term and/or costly follow-up of participants to evaluate a novel treatment/drug/vaccine. There has been increasing interest in the past few decades in using short-term surrogate outcomes as a replacement of the primary outcome i.e., in using the surrogate outcome, which can potentially be observed sooner, to make inference about the treatment effect on the long-term primary outcome. Very few of the available statistical methods to evaluate a surrogate are applicable to settings where both the surrogate and the primary outcome are time-to-event outcomes subject to censoring. Methods that can handle this setting tend to require parametric assumptions or be limited to assessing only the restricted mean survival time. In this paper, we propose a non-parametric approach to evaluate a censored surrogate outcome, such as time to progression, when the primary outcome is also a censored time-to-event outcome, such as time to death, and the treatment effect of interest is the difference in overall survival. Specifically, we define the proportion of the treatment effect on the primary outcome that is explained (PTE) by the censored surrogate outcome in this context, and estimate this proportion by defining and deriving an optimal transformation of the surrogate information. Our approach provides the added advantage of relaxed assumptions to guarantee that the true PTE is within (0,1), along with being model-free. Finite sample performance of our estimators are illustrated via extensive simulation studies and a real data application examining progression-free survival as a surrogate for overall survival for patients with metastatic colorectal cancer.", "authors": "Wang, Cai, Tian et al"}, "https://arxiv.org/abs/2412.13608": {"title": "Disease Progression Modelling and Stratification for detecting sub-trajectories in the natural history of pathologies: application to Parkinson's Disease trajectory modelling", "link": "https://arxiv.org/abs/2412.13608", "description": "Modelling the progression of Degenerative Diseases (DD) is essential for detection, prevention, and treatment, yet it remains challenging due to the heterogeneity in disease trajectories among individuals. Factors such as demographics, genetic conditions, and lifestyle contribute to diverse phenotypical manifestations, necessitating patient stratification based on these variations. Recent methods like Subtype and Stage Inference (SuStaIn) have advanced unsupervised stratification of disease trajectories, but they face potential limitations in robustness, interpretability, and temporal granularity. To address these challenges, we introduce Disease Progression Modelling and Stratification (DP-MoSt), a novel probabilistic method that optimises clusters of continuous trajectories over a long-term disease time-axis while estimating the confidence of trajectory sub-types for each biomarker. We validate DP-MoSt using both synthetic and real-world data from the Parkinson's Progression Markers Initiative (PPMI). Our results demonstrate that DP-MoSt effectively identifies both sub-trajectories and subpopulations, and is a promising alternative to current state-of-the-art models.", "authors": "(CRISAM), (IIT), UMC) et al"}, "https://arxiv.org/abs/2412.13667": {"title": "Exploring Multi-Modal Integration with Tool-Augmented LLM Agents for Precise Causal Discovery", "link": "https://arxiv.org/abs/2412.13667", "description": "Causal inference is an imperative foundation for decision-making across domains, such as smart health, AI for drug discovery and AIOps. Traditional statistical causal discovery methods, while well-established, predominantly rely on observational data and often overlook the semantic cues inherent in cause-and-effect relationships. The advent of Large Language Models (LLMs) has ushered in an affordable way of leveraging the semantic cues for knowledge-driven causal discovery, but the development of LLMs for causal discovery lags behind other areas, particularly in the exploration of multi-modality data. To bridge the gap, we introduce MATMCD, a multi-agent system powered by tool-augmented LLMs. MATMCD has two key agents: a Data Augmentation agent that retrieves and processes modality-augmented data, and a Causal Constraint agent that integrates multi-modal data for knowledge-driven inference. Delicate design of the inner-workings ensures successful cooperation of the agents. Our empirical study across seven datasets suggests the significant potential of multi-modality enhanced causal discovery.", "authors": "Shen, Chen, Luo et al"}, "https://arxiv.org/abs/2412.13731": {"title": "Reliability analysis for non-deterministic limit-states using stochastic emulators", "link": "https://arxiv.org/abs/2412.13731", "description": "Reliability analysis is a sub-field of uncertainty quantification that assesses the probability of a system performing as intended under various uncertainties. Traditionally, this analysis relies on deterministic models, where experiments are repeatable, i.e., they produce consistent outputs for a given set of inputs. However, real-world systems often exhibit stochastic behavior, leading to non-repeatable outcomes. These so-called stochastic simulators produce different outputs each time the model is run, even with fixed inputs. This paper formally introduces reliability analysis for stochastic models and addresses it by using suitable surrogate models to lower its typically high computational cost. Specifically, we focus on the recently introduced generalized lambda models and stochastic polynomial chaos expansions. These emulators are designed to learn the inherent randomness of the simulator's response and enable efficient uncertainty quantification at a much lower cost than traditional Monte Carlo simulation. We validate our methodology through three case studies. First, using an analytical function with a closed-form solution, we demonstrate that the emulators converge to the correct solution. Second, we present results obtained from the surrogates using a toy example of a simply supported beam. Finally, we apply the emulators to perform reliability analysis on a realistic wind turbine case study, where only a dataset of simulation results is available.", "authors": "Pires, Moustapha, Marelli et al"}, "https://arxiv.org/abs/2412.13855": {"title": "Breaching 1.5{\\deg}C: Give me the odds", "link": "https://arxiv.org/abs/2412.13855", "description": "Climate change communication is crucial to raising awareness and motivating action. In the context of breaching the limits set out by the Paris Agreement, we argue that climate scientists should move away from point estimates and towards reporting probabilities. Reporting probabilities will provide policymakers with a range of possible outcomes and will allow them to make informed timely decisions. To achieve this goal, we propose a method to calculate the probability of breaching the limits set out by the Paris Agreement. The method can be summarized as predicting future temperatures under different scenarios and calculating the number of possible outcomes that breach the limits as a proportion of the total number of outcomes. The probabilities can be computed for different time horizons and can be updated as new data become available. As an illustration, we performed a simulation study to investigate the probability of breaching the limits in a statistical model. Our results show that the probability of breaching the 1.5{\\deg}C limit is already greater than zero for 2024. Moreover, the probability of breaching the limit is greater than 99% by 2042 if no action is taken to reduce greenhouse gas emissions. Our methodology is simple to implement and can easily be extended to more complex models of the climate system. We encourage climate model developers to include the probabilities of breaching the limits in their reports.", "authors": "Vera-Vald\\'es, Kvist"}, "https://arxiv.org/abs/2301.06098": {"title": "A novel method and comparison of methods for constructing Markov bridges", "link": "https://arxiv.org/abs/2301.06098", "description": "In this study, we address the central issue of statistical inference for Markov jump processes using discrete time observations. The primary problem at hand is to accurately estimate the infinitesimal generator of a Markov jump process, a critical task in various applications. To tackle this problem, we begin by reviewing established methods for generating sample paths from a Markov jump process conditioned to endpoints, known as Markov bridges. Additionally, we introduce a novel algorithm grounded in the concept of time-reversal, which serves as our main contribution. Our proposed method is then employed to estimate the infinitesimal generator of a Markov jump process. To achieve this, we use a combination of Markov Chain Monte Carlo techniques and the Monte Carlo Expectation-Maximization algorithm. The results obtained from our approach demonstrate its effectiveness in providing accurate parameter estimates. To assess the efficacy of our proposed method, we conduct a comprehensive comparative analysis with existing techniques (Bisection, Uniformization, Direct, Rejection, and Modified Rejection), taking into consideration both speed and accuracy. Notably, our method stands out as the fastest among the alternatives while maintaining high levels of precision.", "authors": "Baltazar-Larios, Esparza"}, "https://arxiv.org/abs/2210.07416": {"title": "GLACIAL: Granger and Learning-based Causality Analysis for Longitudinal Imaging Studies", "link": "https://arxiv.org/abs/2210.07416", "description": "The Granger framework is useful for discovering causal relations in time-varying signals. However, most Granger causality (GC) methods are developed for densely sampled timeseries data. A substantially different setting, particularly common in medical imaging, is the longitudinal study design, where multiple subjects are followed and sparsely observed over time. Longitudinal studies commonly track several biomarkers, which are likely governed by nonlinear dynamics that might have subject-specific idiosyncrasies and exhibit both direct and indirect causes. Furthermore, real-world longitudinal data often suffer from widespread missingness. GC methods are not well-suited to handle these issues. In this paper, we propose an approach named GLACIAL (Granger and LeArning-based CausalIty Analysis for Longitudinal studies) to fill this methodological gap by marrying GC with a multi-task neural forecasting model. GLACIAL treats subjects as independent samples and uses the model's average prediction accuracy on hold-out subjects to probe causal links. Input dropout and model interpolation are used to efficiently learn nonlinear dynamic relationships between a large number of variables and to handle missing values respectively. Extensive simulations and experiments on a real longitudinal medical imaging dataset show GLACIAL beating competitive baselines and confirm its utility. Our code is available at https://github.com/mnhng/GLACIAL.", "authors": "Nguyen, Ngo, Sabuncu"}}
\ No newline at end of file